Statistical Modeling of Agatston Score in Multi-Ethnic Study of Atherosclerosis (MESA)

Shuangge Ma; Anna Liu; Jeffrey Carr; Wendy Post; Richard Kronmal

doi:10.1371/journal.pone.0012036

Abstract

The MESA (Multi-Ethnic Study of Atherosclerosis) is an ongoing study of the prevalence, risk factors, and progression of subclinical cardiovascular disease in a multi-ethnic cohort. It provides a valuable opportunity to examine the development and progression of CAC (coronary artery calcium), which is an important risk factor for the development of coronary heart disease. In MESA, about half of the CAC scores are zero and the rest are continuously distributed. Such data has been referred to as “zero-inflated data” and may be described using two-part models. Existing two-part model studies have limitations in that they usually consider parametric models only, make the assumption of known forms of the covariate effects, and focus only on the estimation property of the models. In this article, we investigate statistical modeling of CAC in MESA. Building on existing studies, we focus on two-part models. We investigate both parametric and semiparametric, and both proportional and nonproportional models. For various models, we study their estimation as well as prediction properties. We show that, to fully describe the relationship between covariates and CAC development, the semiparametric model with nonproportional covariate effects is needed. In contrast, for the purpose of prediction, the parametric model with proportional covariate effects is sufficient. This study provides a statistical basis for describing the behaviors of CAC and insights into its biological mechanisms.

Citation: Ma S, Liu A, Carr J, Post W, Kronmal R (2010) Statistical Modeling of Agatston Score in Multi-Ethnic Study of Atherosclerosis (MESA). PLoS ONE 5(8): e12036. https://doi.org/10.1371/journal.pone.0012036

Editor: Massimo Federici, University of Tor Vergata, Italy

Received: March 25, 2010; Accepted: June 30, 2010; Published: August 9, 2010

Copyright: © 2010 Ma et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This study has been supported by DMS 0805984 from the National Science Foundation (Ma) and contracts N01-HC-95159 through N01-HC-95169 from the National Heart, Lung and Blood Institute (Carr, Post, Kronmal). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The MESA (Multi-Ethnic Study of Atherosclerosis) is an ongoing study of the prevalence, risk factors, and progression of subclinical cardiovascular disease in a multi-ethnic cohort (http://www.mesa-nhlbi.org/) [1]. It provides a valuable opportunity to investigate the development and progression of CAC (coronary artery calcium), which is an important risk factor for the development of coronary heart disease events [2]. In MESA, CAC is measured with the Agatston score, which is the amount of calcium at each lesion scaled by an attenuation factor and summed over all lesions [3]. The histogram of log(1+CAC) in Figure 1 shows that, about half of the CAC scores are zero and the rest are continuously distributed. In a relatively healthy cohort, such a mixture CAC distribution is commonly observed.

Download:

Figure 1. Histogram of log(1+CAC).

https://doi.org/10.1371/journal.pone.0012036.g001

The CAC has a “point mass at zero+continuous” distribution and is a special case of zero-inflated data. Simple regression models are not capable of describing such data. It is not our intention to comprehensively review analytic methodologies for zero-inflated data. Instead, we focus on the statistical models for CAC. To describe nonzero CAC values, existing methods include generalized estimating equations [4], Tobit regression [5], zero-inflated normal model [6], quantile regression [7], and others. To describe zero versus nonzero CAC values, existing methods include logistic regression [5], relative risk regression, and others.

In MESA, after extensive comparisons and evaluations, Kronmal [8] suggested two-part models as the default for CAC. Two-part models have a long history in economic, statistical, and biomedical literature and can be a natural choice for data with a mixture distribution. With two-part models, the development of CAC is modeled in two steps (parts). The first step describes the development from zero to nonzero CAC values. In this step, the response variable is binary. The second step describes the progression of nonzero CAC values. In this step, the response variable is continuously distributed. The two steps have different purposes and different types of response variables, and hence demand different models with different covariate effects. Compared with other models that can also describe mixture data, two-part models have the advantage of being intuitive and not making strong assumptions on the unknown data generating mechanisms. On the negative side, our literature review suggests that existing two-part model studies may have the following limitations. First, they only consider parametric models. Such models are limited in that they cannot describe the subtle, nonlinear relationships between covariates and CAC. Second, they assume that the forms of the covariate effects are known. Such an assumption is usually not sufficiently justified. Third, they often focus on the estimation property and do not provide a comprehensive description of the models.

Building on existing studies [8], we investigate two-part CAC models in this article. This study has been motivated by the clinical importance of CAC and limitations of existing models. It advances from published studies in the following directions. First, besides parametric models, we also consider semiparametric models, with which we may discover the nonlinear relationships between covariates and CAC development. Second, multiple forms of covariate effects are considered. Particularly, besides ordinary nonproportional covariate effects, we also consider proportional covariate effects, which have fewer parameters, can be more accurately estimated, and may provide insights into the biological mechanisms underlying CAC development. Third, besides estimation, we also investigate the prediction performance of various models and thus are able to provide a more comprehensive description of those models.

Methods

The MESA Study

The MESA is a study of the characteristics of subclinical cardiovascular disease (disease detected non-invasively before it has produced clinical signs and symptoms) and the risk factors that predict progression to clinically overt cardiovascular disease or progression of the subclinical disease [1]. 6814 participants 45 to 84 years of age were recruited from six US communities from 2000 to 2002. Among them, 2619 are white, 1898 are African-American, 1494 are Hispanic, and 803 are Asian – predominantly Chinese descent. At recruitment, all participants were free of clinically apparent cardiovascular disease. Each participant received an extensive examination to determine coronary calcification, ventricular mass and function, flow-mediated endothelial vasodilation, carotid intimal-medial wall thickness and presence of atherosclerotic plaque, lower extremity vascular insufficiency, arterial wave forms, electrocardiographic (ECG) measures, standard coronary risk factors, sociodemographic factors, lifestyle factors, and psychosocial factors. Written consents were obtained from all participants.

CAC was measured with electron-beam computed tomography (EBT) at three field centers or multidetector computed tomography (MDCT) at the other three field centers. Each participant was scanned twice consecutively, and the results from the two scans were averaged to provide a more accurate estimation. The amount of calcium was quantified with the Agatston scoring method [3]. Calcium scores were adjusted with a standard calcium phantom that was scanned along with the participant. This phantom makes it possible to calibrate the degree of brightness between sites and participants. Rescan agreement was found to be high with both EBT and MDCT scanners. Interobserver agreement and intraobserver agreement were found to be satisfactory ( = 0.93 and 0.90, respectively) [9].

The MESA study has been approved by the Human Subjects Research Review Committee at University of Washington and all six sites. Detailed information is available at the MESA website http://www.mesa-nhlbi.org. Study presented in this article has been approved by the Human Subjects Research Review Committee at Yale University.

Two-part CAC Models

The distribution of CAC is highly skewed. We make the logarithm transformation and study . Figure 1 shows that, with probability ∼0.5, . For , has a continuous distribution. Denote as the K covariates of interest.

Consider two-part models, where in the first part, we model the occurrence of a nonzero CAC value. More specifically, considerwhere is the link function, is the inverse of , and is the covariate effect. In the second part, considerwhere is the covariate effect, and is the random error.

We determine the link function using the techniques described in [10] and find that the logistic link function, which has been suggested in [11], [12], is proper. We determine the distribution of random error using the approaches described in [13] and find that the normal distribution is proper. This is intuitively reasonable by “eyeballing” Figure 1. There are multiple possibilities for the covariate effects, including:

Parametric, proportional covariate effects:where are the unknown intercepts, is the unknown length-K regression coefficients, and is the unknown scale parameter;
Parametric, nonproportional covariate effects:where are the unknown intercepts, are the unknown length-K regression coefficients, and there is no proportionality constraint on and ;
Semiparametric, proportional covariate effects:where are the unknown intercepts, is the length-m unknown regression coefficients, are the K-m unknown nonparametric covariate effects, and is the unknown scale parameter;
Semiparametric, nonproportional covariate effects:where are the unknown intercepts, and are the length-m unknown regression coefficients, , are the unknown nonparametric covariate effects, and there is no proportionality constraint on and .

Remarks: Parametric and semiparametric models.

Models(i) and (ii) are parametric, whereas models (iii) and (iv) are semiparametric. There is a rich literature on the advantages and disadvantages of parametric and semiparametric models [14]. Parametric models assume linear relationships between covariates (or their transformations) and response variables. They are usually easy to interpret, with the regression coefficients measuring the increase rates of response variables with changes of covariates. In addition, they can be easily estimated using many existing software and the estimates usually have the desired root-n convergence rate. Statistical inference can be easily conducted using likelihood-based methods. On the negative side, the assumption of linear relationships can be limited and subject to model misspecification. Semiparametric models, on the other hand, allow nonlinear relationships between covariates and response variables. Thus, they are able to describe more subtle data structure. The tradeoff is that semiparametric models can be hard to estimate and interpret. In addition, the estimates of nonparametric functions may not have the desired root-n convergence rate. Moreover, inference with semiparametric models may not be straightforward. Computationally intensive methods, such as the bootstrap or jackknife, may be needed. Of note, most existing studies assume parametric two-part models. In this study, to comprehensively describe CAC, both parametric and semiparametric models are considered.

Remarks: Proportional and nonproportional models.

Most existing two-part models share a similar spirit with models (ii) and (iv) in that there is no constraint on the covariate effects and . Unlike those models, models (i) and (iii) have a proportionality constraint. That is, other than the intercepts, the covariate effects and differ only by a scale parameter. Compared with nonproportional models, proportional models have fewer unknown parameters and thus can be more accurately estimated. This improved accuracy has been rigorously proved and observed in numerical studies [11]. In addition, in proportional models, covariates contribute to and in the same manner. Thus, it is reasonable to hypothesize that, when the proportionality holds, the same biological process determines whether the CAC is zero as well as its actual value if nonzero. This may provide insights into the biological mechanisms underlying CAC development. Moreover, under proportionality, the same index can be used to predict the whole range of CAC – from zero to nonzero as well as progression of nonzero values. This may simplify practice involving predicting the CAC values.

Remarks: Estimation and prediction.

With a statistical model, we are interested in its two closely related but distinct properties. The first is the estimation property, where the goal is to fully describe the relationship between covariates and response variable. The second is the prediction property, where the goal is to accurately predict values of the response variable for subjects that are not used in model building. Theoretically speaking, there exists a true data generating model. This model not only provides the best description of the relationship between covariates and response but also has the best prediction performance. However, in practice with finite sample data, the true model is not known, and the models most suitable for estimation and prediction may differ.

With a simple linear regression model (M1): , we consider scenarios under which the models most suitable for estimation and prediction are different and possible causes of the difference. It is expected that similar arguments hold for more complicated models. We have conducted a small scale simulation, where we fix the values of and . With the simulated data, we are able to increase the magnitudes of and , but keep (the estimate of ) statistically significant with p-value<0.05 (more details available upon request).

We also consider the alternative model (M2): . For estimation, since the goal is to fully describe the relationship between the covariates and response variable, model (M1) is needed, while model (M2) is misspecified and improper. For prediction, the goal is to minimize the squared error SE = (predicted value - observed value)² for subjects not used in model building. This quantity can be decomposed into two components. The first is a bias component, and the second is a variance component. Model (M2) is misspecified, so it may have the bias component larger than that of (M1). However, model (M2) has fewer unknown parameters and can be more accurately estimated. So the variance component for (M2) may be smaller than that for (M1). Thus, because of the bias-variance tradeoff, the misspecified (M2) can be more suitable for prediction. In studies of statistical models for CAC, the two aspects of model fitting have not been well distinguished, and the best estimation models have been used for prediction without rigorous justification, or vice versa. Our study shows that, for CAC in MESA, the models most suitable for estimation and prediction are in fact different.

Estimation and inference methods

With a normally distributed random error, up to a constant, the log-likelihood function for a single observation isAssume n iid observations. Denote as the empirical measure.

With models (i) and (ii), we consider the maximum likelihood estimates (MLE), which are defined as the maximizers of . Under regularity conditions, the MLEs are consistent and asymptotically normally distributed. This result can be established using the standard M-estimation theories.

With model (iii), we further assume that s are smooth functions (or more specifically, spline functions). This assumption has been motivated by the observation that the change of covariate values affects CAC development in a continuous manner. Following [15], we consider the penalized maximum likelihood estimate (PMLE) defined as the maximizer of Here, is the data-dependent tuning parameter and can be selected using the approach described in [12], [16]. is the second-order derivative of , and is the penalty on smoothness [15]. We assume that, (A1) belongs to a compact subset of ; and are bounded; (A2) The asymptotic variance matrix of the parametric parameters is non-singular and component-wise bounded; and (A3) . Under (A1)–(A3), the PMLEs of s are consistent, and the PMLEs of the parametric parameters are consistent and asymptotically normally distributed. This result can be proved using the empirical processes techniques described in [17].

With model (iv), we adopt a similar estimation strategy and consider the PMLE defined as the maximizer of . Under conditions similar to (A1)–(A3), the PMLEs of the nonparametric parameters are consistent, and the PMLEs of the parametric parameters are consistent and asymptotically normally distributed.

With parametric models (i) and (ii), inference can be based on the asymptotic normality result and the Fisher information matrix. However, with semiparametric models (iii) and (iv), such an approach involves smoothed estimation and is very difficult to employ. We propose the following bootstrap approach for inference of all parameters in all models: (a) Fit the model and compute the MLEs (PMLEs); (b) With the observed covariate values, generate random errors from the normal distribution with mean zero and variance ; (c) Generate the binary indicators using the logistic model. For those with , generate the continuous values; (d) With the generated responses, re-estimate the model; (e) Repeat steps (a)–(d) B (for example 500) times. We estimate the variances of the MLEs (PMLEs) using the variances of estimates generated using the bootstrap samples.

Results

Estimation Properties

We collect measurements on the following covariates, which have been suggested as possibly associated with CAC in various publications: gender (female is used as the reference group), race/ethnicity (Caucasian, Chinese, African-American, and Hispanic; Caucasian is used as the reference group), former smoker (binary indicator), current smoker (binary indicator), diabetes (binary indicator), SBP (systolic blood pressure), DBP (diastolic blood pressure), age, BMI (body mass index), LDL cholesterol, and HDL cholesterol. We consider the following parametric models: (i.1) model (i) with linear effects for all covariates; (i.2) model (i) with linear effects for all covariates plus quadratic effects for LDL and HDL; (ii.1) model (ii) with linear effects for all covariates; and (ii.2) model (ii) with linear effects for all covariates plus quadratic effects for LDL and HDL. Models (i.1) and (ii.1) are more commonly adopted in practice, whereas models (i.2) and (ii.2) have been motivated by the nonproportional semiparametric model, i.e, the “biggest model”, and suggested by a reviewer. In semiparametric models (iii) and (iv), among the 13 covariates, 7 are binary, which naturally have parametric effects. Our preliminary analysis also suggests parametric covariate effects for SBP and DBP. Thus, there are 9 parametric covariate effects and 4 nonparametric ones. There are a total of six models considered.

For , covariates with parametric effects in all models, we show the MLEs (PMLEs) in Table 1. For all covariates, their estimates under different models have almost the same signs. Thus the biological conclusions on whether they are positively or negatively associated with CAC are the same in all models. However, the magnitudes of the estimates may be considerably different. For example, the estimates of the regression coefficients for in the linear parts are −0.151, −0.144, −0.291, −0.273, −0.141, and −0.285, respectively. For , covariates with nonparametric effects in models (iii) and (iv), we show the estimates and point-wise 95% confidence intervals in Figures 2–7. We note that the lines intercept at the mean of X-axis since every fitted line has been mean-centered for identifiability. In addition, estimates under models (i) and (ii) are straight lines (i.e, parametric). Figures 2–7 show that, the estimates of the Age and BMI effects under different models are reasonably close. It is interesting that under models (iii) and (iv), the nonparametric estimates of the Age and BMI effects are close to linear functions. However, the estimates of the HDL and LDL effects under different models are significantly different, with the HDL effect in models (i.2), (ii.2), (iii) and (iv) and the LDL effect in models (i.2), (ii.2) and (iv) significantly deviating from straight lines.