Daniel Iong (University of Michigan Department of Statistics) daniong@umich.edu
Qingyuan Zhao (University of Pennsylvania Department of Statistics) qyzhao@wharton.upenn.edu
HDL are heterogeneous subpopulations of discrete particles that differ in apolipoprotein and lipid composition. Previous Mendelian randomization (MR) studies reported heterogeneous associations between genetically determined HDL cholesterol (HDL-C) and coronary artery disease (CAD) (Voight et al. 2012; Zhao et al. 2018). Our research questions are:
We constructed a dataset using the GWAS summary data published by GLGC, CARDIoGRAMplusC4D and Kettunen et al. (2016). We used the GWAS results in (Teslovich et al. 2010) to select independent instruments that are associated with HDL-C (p-value \(\leq 5e^{-8}\)) and then obtained the estimated associations of the SNPs with HDL-C, CAD and lipoprotein subfractions using the GWAS results of Kettunen et al. (2016) and Nikpay et al. (2015). The table below summarizes how the data were obtained.
Purpose | Phenotype | Data source/Citation |
---|---|---|
Instrument selection | HDL | GLGC (Teslovich et al. 2010) |
SNP association with exposure | HDL | Kettunen et al. (2016) |
SNP association with outcomee | CAD | CARDIoGRAMplusC4D (Nikpay et al. 2015) |
Explore heterogeneity | Lipoprotein subfractions | Kettunen et al. (2016) |
For the first research question, we introduced a mixture model to account for effect heterogeneity of the instruments. This model has several components:
Measurement error: SNP-exposure effects and SNP-outcome effects are observed with (known) measurement error with unknown means.
Independence: All the estimated SNP associations are mutually independent. The independence between the SNP-exposure and SNP-outcome effect for a given SNP is guaranteed if they are computed using non-overlapping samples. Independence across SNPs is a reasonable assumption if the selected SNPs are in linkage equilibrium.
Effect heterogeneity: We assume each instrument may estimate a different “causal” effect and the instrument-specific effects follow a Gaussian mixture distribution.
Assumptions 1 & 2 are standard in MR analysis, see for example Zhao et al. (2018). Assumption 3 differs from the standard MR assumption and the mixture distribution captures the idea that instruments on the same genetic pathway may correspond to similar causal effect estimates. Small and balanced horizontal pleiotropy is captured by the within-cluster variation.
Visual inspection of effect heterogeneity: We first use a modal plot developed in Wang et al. (forthcoming, implemented in the modal.plot function from the mr.raps package) to visualize the data. This plot shows a robustified log-likelihood function over the causal effect and will exhibit multiple modes when there are clustered effect heterogeneity. Visual inspection of the scatterplot between estimated SNP-exposure and SNP-outcome effects can also suggest heterogeneity among variant-specific estimates.
Bayesian inference for the mixture model: We used a Monte-Carlo EM algorithm to estimate the prior parameters in the mixture model and then obtained (empirical Bayes) posterior distribution of the instrument-specific effects. See the Technical Appendix below for more detail.
Examine heatmap of lipoprotein subfractions: We ranked the variants based on the posterior mean of their instrument-specific effects and plot a heatmap of the associations of the SNPs with lipoprotein subfractions. This heatmap may provide some insights if groups of SNPs with similar instrument-specific effects correspond to distinct biological mechanisms.
We fit our model with 2 mixture components to the data discussed above and provide interactive visualizations for the model output. We chose 2 mixture components because there appears to be 2 modes in the modal plot from mr.raps (plotted below).
In the scatterplot below, the slopes of the colored solid lines are the estimated cluster means and the shaded region is the associated 68% confidence interval. Hovering over the points displays the RSID of the SNP and posterior probabilities of belonging to each cluster. The points are colored to match the cluster that it has the highest probability of belonging to. Furthermore, hovering over the solid lines will display the mixture component parameters.
The second plot displays the 95% posterior interval of the “causal” effect for each SNP. Hovering over the intervals will display additional information about the SNP (chromosome, gene, BP). The third plot is a heatmap of the p-values between the SNPs and lipoprotein subfraction traits. Negative associations are colored in red. Hovering over each box will display the SNP, trait, and p-value. One can also zoom in and isolate specific portions by dragging a box across the heatmap. These plots were created using the plotly R package.
The 2 SNPs with the highest (posterior) probability of belonging to the 2nd cluster (rs1532085, rs588136) are both in chromosome 15 and are close to the LIPC gene.
rs1532085, rs588136, and rs174546 (cluster 2) are positively associated with all of the large and extra-large HDL subfraction traits. (p-value \(\leq 10^{-11}\)).
rs1532085, rs588136, and rs174546 are negatively associated with concentration of small HDL particles (p-value \(\leq 10^{-3}\)). rs1532085 and rs588136 are negatively associated with total lipids in small HDL (p-value \(\leq 10^{-14}\)).
rs1532085, rs588136, and rs174546 are positively associated with LDL diameter (p-value \(\leq 10^{-7}\)).
rs14320985 and rs588136 are positively associated with all of the very small VLDL subfraction traits (p-value \(\leq 10^{-14}\)). rs174546 is positively associated with total lipids and phospholipids in very small VLDL (p-value \(\leq 10^{-3}\)).
rs1532085, rs588136, and rs174546 are positively associated with all of the IDL subtraction traits (p-value \(\leq 10^{-4}\)).
The specifications of the model we fit in our analysis is given below.
For SNP \(i = 1,\dots,p\),
\[ \begin{align*} \mu_{X_i} & \sim N(m_x, \lambda_x^2) \\ Z_i & \sim \text{Categorical}(\pi_1,\dots,\pi_K) \\ \beta_i | Z_i = k & \sim N(\mu_k, \sigma_k^2) \\ \begin{pmatrix} X_i \\ Y_i \end{pmatrix} | \beta_i, \mu_{X_i} & \sim N \Big( \begin{pmatrix} \mu_{X_i} \\ \beta_i \mu_{X_i} \end{pmatrix}, \begin{pmatrix} \sigma_{X_i}^2 & 0 \\ 0 & \sigma_{Y_i}^2 \end{pmatrix} \Big) \end{align*} \]
Variables | Type | Description |
---|---|---|
\(\mu_{X_i}\) | Latent (continuous) | True SNP-exposure effect |
\(Z_i\) | Latent (categorical) | Mixture component assignment for SNP \(i\) |
\(\beta_i\) | Latent (continuous) | True variant-specific causal effect |
\(m_x\) | Estimated parameter (continuous) | |
\(\lambda_x\) | Estimated parameter (continuous) | |
\(\pi_k\) | Estimated parameter (continuous) | Mixture component proportions |
\(\mu_k\) | Estimated parameter (continuous) | Mixture component means |
\(\sigma_k\) | Estimated parameter (continuous) | Mixture component standard deviations |
\(X_i\) | Observed (continuous) | Observed SNP-exposure effects |
\(Y_i\) | Observed (continuous) | Observed SNP-outcome effects |
\(\sigma_{X_i}\) | Known constant | Standard errors for observed SNP-exposure effects |
\(\sigma_{Y_i}\) | Known constant | Standard errors for observed SNP-outcome effects |
\(K\) | Fixed | Number of mixture components |
Let \(\theta = (m_x, \lambda_x, \{\pi_k, \mu_k, \sigma_k : k = 1,\dots,K\})\) denote the parameters we want to estimate.
Obtain Empirical Bayes estimates \(\hat{\theta}\) of \(\theta\) using a Monte-Carlo EM algorithm (details provided below).
Sample from \(P(\mu_{X_i}, \beta_i, Z_i | X_i, Y_i, \hat{\theta})\), the posterior distribution of the latent variables given \(\hat{\theta}\) (details provided below).
We will use importance sampling (Liu 2004) to sample from \(P(\mu_{X_i}, \beta_i, Z_i | X_i, Y_i, \theta)\). Let \(\phi(\cdot ; \mu, \sigma^2)\) denote the Gaussian density with mean \(\mu\) and variance \(\sigma^2\).
First, we can re-write the latent variable posterior as
\[ \begin{align*} P(\beta_i, Z_i = k, \mu_{X_i} | X_i, Y_i, \theta) & = P(\mu_{X_i}| X_i, Y_i, \theta) P(\beta_i | \mu_{X_i}, Y_i, \theta) P(Z_i = k | \beta_i, \theta) \\ & = P(\mu_{X_i} | X_i, Y_i, \theta) \Big[ \sum_{k=1}^K \pi_k \phi(\beta_i; \tilde{\mu}_{ik}, \tilde{\sigma}_{ik}^2) \Big] \Big[ \frac{\pi_k \phi(\beta_i; \mu_k, \sigma_k^2)}{\sum_{j=1}^K \pi_j \phi(\beta_i; \mu_j, \sigma_j^2)} \Big] \\ \end{align*} \] where
\[ \begin{align*} \tilde{\sigma}_{ik}^2 &= \Big(\frac{1}{\sigma_k^2} + \frac{\mu_{X_i}^2}{\sigma_{Y_i}^2}\Big)^{-1} \\ \tilde{\mu}_{ik} & = \tilde{\sigma}_{ik}^2 \Big( \frac{Y_i \mu_{X_i}}{\sigma_{Y_i}^2} + \frac{\mu_k}{\sigma_k^2} \Big) \end{align*} \]
Thus, we can sample from \(P(\mu_{X_i} | X_i, Y_i, \theta)\) using a (importance) proposal distribution, then sequentially sample from \(P(\beta_i | \mu_{X_i}, Y_i, \theta)\) and \(P(Z_i = k | \beta_i, \theta)\) directly, and weight the samples according to the importance weights.
\[ \begin{align*} P(\mu_{X_i} | X_i, Y_i, \theta) & \propto P(\mu_{X_i} | X_i, \theta) P(Y_i | \mu_{X_i}, \theta) \\ & = \phi(\mu_{X_i}; \tilde{m}_{X_i},\tilde{\lambda}_{X_i}^2) \Big[ \sum_{k=1}^K \pi_k \phi(Y_i; \mu_{X_i}\mu_k, \mu_{X_i}^2 \sigma_k^2 + \sigma_{Y_i}^2) \Big] =: \pi(\mu_{X_i}) \\ \end{align*} \]
where
\[ \begin{align*} \tilde{\lambda}_{X_i}^2 &= \Big(\frac{1}{\sigma_{X_i}^2} + \frac{1}{\lambda_x^2}\Big)^{-1} \\ \tilde{m}_{X_i} & = \tilde{\lambda}_{X_i}^2 \Big(\frac{X_i}{\sigma_{X_i}^2} + \frac{m_x}{\lambda_x^2}\Big) \end{align*} \]
We chose to use \(g(\mu_{X_i}) = P(\mu_{X_i} | X_i, \theta)\) as the proposal distribution with associated importance weights
\[ w_i = \frac{\pi(\mu_{X_i})}{g(\mu_{X_i})} = \sum_{k=1}^K \pi_k \phi(Y_i; \mu_{X_i}\mu_k, \mu_{X_i}^2 \sigma_k^2 + \sigma_{Y_i}^2) \] Since \(\phi(Y_i; \mu_{X_i}\mu_k, \mu_{X_i}^2 \sigma_k^2 + \sigma_{Y_i}^2) \leq \frac{1}{\sqrt{2\pi \sigma_{Y_i}^2}}\), \[ 0 \leq w_i \leq \frac{1}{\sqrt{2\pi \sigma_{Y_i}^2}} \] Therefore, the importance weights are bounded.
a Briefly, the EM algorithm (Dempster, Laird, and Rubin 1977) is an iterative algorithm to compute maximum likelihood estimates in latent variable models that consists of two steps:
E Step: Compute the expectation of the complete-data log-likelihood with respect to the posterior distribution of the latent variables, given the parameter estimates from the previous iteration.
M Step: Maximize the expectation computed in the E step with respect to the parameters.
In our model, the expectation in the E step does not have an analytical form so we resort to Monte-Carlo methods to approximate it (Levine and Casella 2001).
The complete-data log-likelihood is given by
\[ \begin{align*} l(\theta | \mathbf{X}, \mathbf{Y}, \{\beta_i\}, \{Z_i\}, \{\mu_{X_i}\}) & := \sum_{i=1}^p l(\theta | X_i, Y_i, \beta_i, Z_i, \mu_{X_i}) \\ & = \sum_{i=1}^p \Big\{ \text{log} \phi(\mu_{X_i}; m_x, \lambda_x^2) + \sum_{k=1}^K Z_{ik} \big[\text{log} \pi_k + \text{log} \phi(\beta_i; \mu_k, \sigma_k^2) \big] \Big\} \end{align*} \] where \(Z_{ik} = 1\) if \(Z_i = k\) and \(Z_{ik} = 0\) otherwise.
To approximate the E-step expectation, we approximate \(E(\mu_{X_i} | X_i, Y_i, \theta)\), \(E(\mu_{X_i}^2 | X_i, Y_i, \theta)\), \(E(Z_{ik} \beta_i | X_i, Y_i, \theta)\), \(E(Z_{ik} \beta_i^2 | X_i, Y_i, \theta)\), and \(E(Z_{ik} | X_i, Y_i, \theta)\) using the importance sampling procedure described in the previous section. After obtaining the importance sampling estimates, the M step is straightforward so we omit the details here.
Code to replicate the results in this analysis are available at https://github.com/danieliong/MRDataChallenge2019. An implementation of the Monte-Carlo EM algorithm is available at https://github.com/danieliong/MR-MCEM.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the Em Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1). [Royal Statistical Society, Wiley]: 1–38. http://www.jstor.org/stable/2984875.
Kettunen, Johannes, Ayşe Demirkan, Peter Würtz, Harmen HM Draisma, Toomas Haller, Rajesh Rawal, Anika Vaarhorst, et al. 2016. “Genome-Wide Study for Circulating Metabolites Identifies 62 Loci and Reveals Novel Systemic Effects of Lpa.” Nature Communications 7. Nature Publishing Group: 11122.
Levine, Richard A., and George Casella. 2001. “Implementations of the Monte Carlo Em Algorithm.” Journal of Computational and Graphical Statistics 10 (3). [American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America]: 422–39. http://www.jstor.org/stable/1391097.
Liu, Jun S. 2004. Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-76371-2.
Nikpay, Majid, Anuj Goel, Hong-Hee Won, Leanne M Hall, Christina Willenborg, Stavroula Kanoni, Danish Saleheen, et al. 2015. “A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease.” Nature Genetics 47 (10). Europe PMC Funders: 1121–30. https://doi.org/10.1038/ng.3396.
Teslovich, Tanya M, Kiran Musunuru, Albert V Smith, Andrew C Edmondson, Ioannis M Stylianou, Masahiro Koseki, James P Pirruccello, et al. 2010. “Biological, clinical and population relevance of 95 loci for blood lipids.” Nature 466 (7307). NIH Public Access: 707–13. https://doi.org/10.1038/nature09270.
Voight, Benjamin F, Gina M Peloso, Marju Orho-Melander, Ruth Frikke-Schmidt, Maja Barbalic, Majken K Jensen, George Hindy, et al. 2012. “Plasma HDL Cholesterol and Risk of Myocardial Infarction: A Mendelian Randomisation Study.” Lancet 380 (9841). Elsevier: 572–80.
Zhao, Qingyuan, Yang Chen, Jingshu Wang, and Dylan S Small. 2018. “Powerful three-sample genome-wide design and robust statistical inference in summary-data Mendelian randomization.” International Journal of Epidemiology. Vol. to appear. https://arxiv.org/pdf/1804.07371.pdf.