MR Data Challenge 2019

Participants:

Borges, Maria Carolina 1,2
Haycock, Philip 1,2
Hingorani, Aroon 3,4
Lawlor, Deborah 1,2

MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK
Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
UCL Institute of Cardiovascular Science, University College London, London, UK
Farr Institute, University College London, London, UK

Motivation:

Improving reliability of MR studies testing the effect of molecular phenotypes:

The case of omega-3 fatty acids and cardiovascular disease risk

One major limitation of Mendelian randomization (MR) is the unprovable assumption that the association between genetic instrument and outcome is entirely mediated by the exposure (i.e. “vertical pleiotropy”) (Figure 1A) and not by direct effects of the genetic instrument (i.e. hereafter “spurious pleiotropy”) (Figures 1B-1D).

A large proportion of recently developed methods aim at accounting for spurious pleiotropy to improve the reliability of MR results. However, these methods ideally require that many genetic instruments for the exposure are available, which is often not the case when the exposure is a molecular phenotype (e.g. gene expression levels, DNA methylation, protein or metabolite).

In the context of MR, spurious pleiotropy can arise from three main scenarios:

Counfounding by linkage disequilibrium: the genetic instrument influences the exposure and is correlated (i.e. in linkage disequilibrium) with another genetic variant that influences the outcome independently (Figure 1B)
Reverse causation: the genetic instrument primarily influences the outcome, which subsequently affects the exposure (Figure 1C)
Horizontal pleiotropy: the genetic instrument influences exposure and outcome via two independent biological pathways (Figure 1D)

We propose a framework combining methods developed within and outside the MR community to explore robustness of MR findings to spurious pleiotropy when only one or a few genetic instruments are available for a particular exposure.

Briefly, the framework aims to assess the plausibility that any of the above-mentioned spurious pleiotropic mechanisms introduce bias in MR findings as follows:

Confounding by linkage disequilibrium (Figure 1B) can be tested using genetic colocalisation analysis.
Reverse causation (Figure 1C) can be investigated using MR methods to test whether the outcome is likely to affect the exposure (hereafter reverse MR). Evidence of bidirectional effects could indicate a genuine bidirectional relation or, alternatively, wrong inferences on the causal direction of the relation between exposure and outcome, as detailed in Analysis methods (item 3.2).
Horizontal pleiotropy (Figure 1D) can be explored by using genomic annotation tools and integrating high-dimensional omics data to improve knowledge on biological mechanisms linking genetic variants, exposure, and outcome.

As a motivating example, we have used this framework to assess the causal role of circulating docosahexaenoic acid (DHA) on coronary artery disease (CAD) risk. For several decades, this long chain omega-3 fatty acid has been considered a protective factor against CAD risk. However, recent systematic reviews of observational studies and randomised controlled trials (RCTs) have cast doubt on long-held beliefs concerning the cardioprotective effects of DHA (Abdelhamid et al., 2018).

Figure 1. Diagrams representing the putative causal relations between genetic instrument, exposure (DHA) and outcome (CAD risk)

Data:

Summary data for the association of genetic instrument with circulating DHA from Kettunen et al. (2016) was extracted from MR-Base (Hemani et al., 2018)

##     author consortium  id sample_size     nsnp     pmid population
## 1 Kettunen       <NA> 852       13499 11415610 27005778   European
##                 sex                      trait unit
## 1 Males and females 22:6, docosahexaenoic acid   SD

Summary data for the association of genetic instrument with CAD risk from Nikpay et al.(2015) was extracted from MR-Base (Hemani et al., 2018)

##   author        consortium id ncase ncontrol    nsnp     pmid population
## 1 Nikpay CARDIoGRAMplusC4D  7 60801   123504 9455779 26343387      Mixed
##                 sex                  trait     unit
## 1 Males and females Coronary heart disease log odds

Analysis methods:

Mendelian randomization analyses were performed with R software version 3.5.1 (R Foundation for Statistical Computing).

1. Selection of genetic instruments for DHA

Genetic variants were selected if strongly associated with circulating DHA ($P < 5*10^{-8}$) and if not in linkage disequilibrium ($R^{2} < 0.001$).

2. Main analysis

Mendelian randomization estimates were calculated using Wald ratios for each genetic variant and the inverse-variance weighted (IVW) method for the combined set of genetic instruments.

3. Sensitivity analyses

We conducted a series of sensitivity analyses to explore whether results from the main analyses could be biased by the spurious pleiotropic mechanisms shown in Figures 1B-D.

3.1. Counfounding by linkage disequilibrium

Genetic colocalisation was used to test whether associations with exposure and outcome were driven by a shared genetic factor. While not sufficient, sharing causal variants is a necessary condition for two traits to be causally related. Thus, genetic colocalisation helps to mitigate the possibility of bias due to spurious pleiotropy driven by two separate (but correlated) variants in a given genomic region (Figure 1B).

The genetic colocalisation was performed using moloc (Giambartolomei et al., 2018), a Bayesian method where posterior probabilities are computed for each possible configuration, and these probabilities can be summed over all configurations and combined with the prior probabilities to assess the support for each hypothesis. In the case of two traits, there are five possible hypotheses:

H0: No association with either trait
H1: Association with DHA, not with CAD risk
H2: Association with CAD risk, not with DHA
H3: Association with both traits by two independent genetic variants
H4: Association with both traits by one shared genetic variant

Evidence of genetic colocalisation between the two traits was considered if there was a high posterior probability (PPA) for H4 and low PPA for H3.

Total variance was set at 0.15 for the continuous trait and 0.2 for the binary trait. Prior probability that a SNP is associated with one or two traits was set to $P = 1*10^{-4}$ and $P = 1*10^{-6}$, respectively.

3.2. Reverse causation

With sufficiently large sample sizes, a genetic variant associated with an outcome through a mediating exposure could reach the conventional threshold for statistical significance for both outcome and exposure. Therefore, using such thresholds to select instruments could lead to situations where the genetic instrument influences the hypothesised exposure via the hypothesised outcome (i.e. the hypothesised outcome actually has a causal effect on the hypothesised exposure and not vice versa) (Figure 1C). This could lead to wrong inferences on the causal direction of the relation between the two traits and reverse MR can help addressing whether this is a likely source of bias in the analysis.

We conducted a reverse MR analysis to explore the effect of genetic predisposition to CAD on circulating DHA. To do that, we have selected genetic variants strongly and independently associated with CAD risk using the same criteria described in item 1 (“Selection of genetic instruments for DHA”). MR estimates were calculated using the IVW method and several other MR methods that are more robust to horizontal pleiotropy (i.e. MR-Egger, weighted median and weighted mode estimator).

3.3 Horizontal pleiotropy

Horizontal pleiotropy (Figure 1D) is one of main sources of invalid instruments since it is a widespread biological phenomenon. Given valid instruments are expected to produce consistent estimates of the causal effect, most current MR methods attempt to address horizontal pleiotropy by directly or indirectly accounting for statistical heterogeneity in single instrument MR effect estimates.

When only one or few instruments are available, the application of these methods can be impossible, misleading or of limited usefulness. Therefore, in this context, we propose that horizontal pleiotropy should be investigated by using knowledge on the biological mechanisms linking genetic variants, exposure and outcome, which can be informed by genomic annotation and by integrating high-dimensional omics data.

We have adopted two approaches to gain insight into the biological mechanism underpinning the association of the genetic instruments with exposure and outcome:

Genomic annotation: We annotated the selected genetic variants (and variants in linkage disequilibrium with those) using FUMA (Watanabe et al., 2017), an integrated platform for functional annotation that uses information from multiple databases. This includes protein-coding annotation, which focuses on the impact of the genetic variant on protein structure, and non-protein-coding annotation, which focuses on genetic variation that are often involved in gene regulation.
Integrating omics data:

First, we conducted a hypothesis-free test to examine the association of each genetic instrument with gene expression levels. We searched the GTEx portal data to identify cis-expression quantitative trait loci ( cis-eQTLs) among the selected genetic variants (FDR <= 5%).
Then, we checked for the association of genetic variants across all metabolic traits included in Kettunen et al. (2016) GWAS. To account for multiple testing, we considered a p-value < 0.0005 (0.05/22/5) as evidence of association, where 22 is the number of principal components explaining 95% of the phenotypic variation in the metabolomics data (Kettunen et al., 2016) and five is the number of assessed genetic variants. The metabolites showing evidence of association with any of the genetic variants were carried forward to the multi-trait colocalisation analysis (Foley et al., 2019), which is a computationally efficient extension of moloc (Giambartolomei et al., 2018). The multi-trait colocalisation analysis was conducted using default and uniform priors as described by Burgess et al. (2019), given that the default prior can be overly conservative.

To further explore the possibility of MR findings being biased by horizontal pleiotropy, we conducted two additional sensitivity analyses:

Adjusting main MR estimates by total fatty acids using multivariable IVW
Restricting MR analysis to genetic instruments close to genes directly related to fatty acids metabolism (i.e. FADS and ELOVL)

Results:

1. Selection of genetic instruments for DHA

Six independent genetic variants were selected, explaining from 0.25% to 0.78% of variance in circulating DHA individually and 2.4% when combined.

##           SNP effect_allele.exposure other_allele.exposure eaf.exposure
## 1   rs2281591                      A                     G     0.866280
## 2    rs174546                      C                     T     0.597151
## 3  rs11604424                      C                     T     0.243328
## 4    rs261334                      G                     C     0.230871
## 5 rs145717049                      C                     T     0.955942
## 6 rs143988316                      C                     T     0.930513
##   beta.exposure se.exposure pval.exposure units.exposure rsquared.exposure
## 1      0.108394    0.018174      3.66e-09             SD       0.002722044
## 2      0.127635    0.012483      4.81e-24             SD       0.007837833
## 3      0.083110    0.014241      7.84e-09             SD       0.002543527
## 4      0.110247    0.014749      1.44e-13             SD       0.004316504
## 5      0.201292    0.032750      1.21e-09             SD       0.003413024
## 6      0.150045    0.024351      1.10e-09             SD       0.002911381

sum(exp_dat$rsquared.exposure)

## [1] 0.02374431

One of the variants (rs145717049) could not be found in the CAD GWAS (neither a proxy variant in strong linkage disequilibrium - i.e. $R^{2} > 0.8$). Therefore, this variant was not considered in further analysis.

2. Main analysis

Results from the main MR analysis indicated that higher circulating DHA is related to higher CAD risk (Figure 2), which is conflict with evidence from observational studies and RCTs pointing to either a null or protective effect of DHA on CAD risk (Abdelhamid et al., 2018).

Figure 2. Mendelian randomization estimates for the effect of DHA on CAD risk

Note: Results are expressed as log odds of CAD per standard unit increase in circulating DHA.

3. Sensitivity analyses

3.1. Linkage disequilibrium

In the five assessed genomic regions, PPA for both H3 and H4 was very low, and, therefore, there was no clear evidence supporting shared (or distinct) causal variants influencing DHA and CAD risk. Instead, there was strong evidence that individual genomic regions were strongly associated with circulating DHA only (i.e. PPA for H1 ranged from 89% to 99%) (Figure 3). This might be the result of limited power in the colocalization analysis to detect an association between variants and CAD risk in single genomic regions.

Figure 3. Posterior probabilities (PPA) for each hypothesis tested in the colocalisation of DHA and CAD risk across DHA genomic regions

Note: Names for genomic regions were assigned according to the gene closest to the selected genetic instrument in the region.

These results are broadly consistent with the visual inspection of volcano plots as shown in Figure 4.

$Figure 4. P-values (log 10 units) for the association of genetic variants with DHA and CAD risk across DHA genomic regions$

Figure 4. P-values (log 10 units) for the association of genetic variants with DHA and CAD risk across DHA genomic regions

3.2. Reverse causation

Forty one genetic variants were selected as instruments for CAD genetic predisposition. Two variants were removed for being ambiguous palindromic SNPs (MAF > 42%) and, therefore, 39 variants were used in further analyses.

The IVW method indicated that higher predisposition to CAD is related to higher circulating DHA.

##                      method nsnp          b         se        pval
## 1 Inverse variance weighted   39 0.09225259 0.03174137 0.003656332

Note: Results are expressed as standardised units of circulating DHA per log odds of CAD.

The evidence from MR findings of a bidirectional positive effect between DHA and CAD risk could indicate that results from main analysis might be biased due to inclusion of invalid instruments affecting primarily CAD predisposition (or its risk factors). Alternatively, there might be a genuine bidirectional relation between DHA and CAD risk.

Overall, the effect estimate was partly attenuated when using MR methods that are usually more robust to horizontal pleiotropy.

##            method nsnp          b         se      pval
## 1        MR Egger   39 0.08632377 0.08121313 0.2947027
## 2 Weighted median   39 0.04199586 0.04112948 0.3072238
## 3   Weighted mode   39 0.01111937 0.05741033 0.8474561

Note: Results are expressed as standardised units of circulating DHA per log odds of CAD.

3.3 Horizontal pleiotropy

Genomic annotation

Most genetic instruments were located in introns (rs2281591, rs11604424, and rs261334), one was intergenic (rs143988316) and the other was a 3 Prime UTR variant (rs174546).

##           SNP chr        bp Closest.gene            location
## 1   rs2281591   6  10990493       ELOVL2              intron
## 2    rs174546  11  61569830        FADS1 3 Prime UTR Variant
## 3  rs11604424  11 116651115         ZPR1              intron
## 4    rs261334  15  58726744         LIPC              intron
## 5 rs143988316  19  19667254         PBX4          intergenic

Genetic variants in linkage disequilibrium with the genetic instruments were largely intronic or intergenic and the ones located in exons were mostly synonymous (Figure 5).

Figure 5. Functional consequences of SNPs in linkage disequilibrium with DHA genetic instruments

These results are broadly consistent with these variants affecting phenotypes through regulatory mechanism, such as changes in gene expression.

Integrating omics data

Gene expression

Four out of the five genetic variants were identified as cis-eQTLs (Figure 6).

One of the variants (rs174546) had strong and opposite effects on FADS1 and FADS2 across several tissues, both genes code for key enzymes in omega-3 and omega-6 fatty acids metabolism. This SNP also affects the expression of FADS3 (liver and cerebellum), MYRF (esophagus) and TMEM258 (several tissues).

Another variant (rs2281591) is associated with increased expression of ELOLV2-AS1, the antisense RNA 1 for ELOLV2 (spinal cord), which codes for a key enzyme in the biosynthetic pathway of long chain polyunsaturated fatty acids such as DHA, and with SYCP2L (testis).

The variant rs143988316 was related to a decrease in the expression of ATP13A1 (nerve and subcutaneous adipose tissue), GATAD2A (whole blood and esophagus), and an increase in MAU2 (whole blood and skeletal muscle).

Lastly, the variant rs261334 was related to decreased expression of LIPC (liver and pancreas), which encodes a triglyceride lipase that acts in liver as triglyceride hydrolase and ligand/bridging factor for receptor-mediated lipoprotein uptake, and, therefore, is related to removal of fatty acids (in the form of tryglicerides) from the circulation.

Figure 6. Association of DHA genetic instruments with gene expression levels

Circulating metabolites

Despite having reasonably similar effects on DHA, there were substantial differences in the metabolic profile across the five assessed genetic variants (Figures 7A and 7B).

The variants rs11604424 (closest gene: ZPR1) and rs143988316 (closest gene: PBX4) had a broadly similar metabolic signature and ubiquitous effect on lipoprotein-related traits. For both variants, the DHA increasing allele was related to increases in apolipoprotein B (ApoB) and in lipids in ApoB containing particles (VLDL, LDL, IDL). There was some evidence from the multi-trait colocalisation analysis that these associations were due to a shared causal variant ( ZPR1 locus: PPA = 74%-99%; PBX4 locus: PPA = 72%-97%).

The variant rs174546 (closest gene: FADS1) had the largest effects on multiple polyunsaturated fatty acids-related traits and was related to increases in apolipoprotein A1 (apoA1), larger HDL, IDL and larger VLDL particle lipids. There was some evidence of a shared causal variant with DHA (PPA = 55-86%).

The variant rs2281591 (closest gene: ELOVL2) did not seem to be strongly associated with metabolites other than DHA. Therefore, this variant was not carried forward to the multi-trait colocalisation analysis.

Finally, the DHA increasing allele of the variant rs261334 (closest gene: LIPC) was related to a substantial increase in circulating ApoA1 and in larger HDL particle lipids and, to a smaller extent, in circulating ApoB and large LDL and IDL particle lipids. However, there was some evidence that DHA and HDL-related traits are influenced by distinct causal variants.

Figure 7A. Association of DHA genetic instruments with lipoprotein-related traits

Note: full circles indicate p-value < 0.0005

Figure 7B. Association of DHA genetic instruments with other metabolites

Note: full circles indicate p-value < 0.0005

Multivariable MR

When adjusting for total fatty acids using multivariable MR, estimates for the effect of DHA on CAD risk were attenuated:

##                               exposure nsnp          b         se
## 1 22:6, docosahexaenoic acid || id:852    5 0.07159174 0.11702609
## 2          Total fatty acids || id:936   10 0.20140084 0.09533509
##         pval
## 1 0.54069733
## 2 0.03463868

MR restricted to genetic variants regulating genes directly implicated in fatty acids metabolism

MR estimates were partially attenuated when restricting analysis to the two instruments (rs2281591 and rs174546) directly implicated in DHA metabolism (i.e. ELOVL 2 and FADS), as identified by physical mapping and gene expression analysis:

##                               exposure                        outcome nsnp
## 1 22:6, docosahexaenoic acid || id:852 Coronary heart disease || id:7    2
##           b         se       pval
## 1 0.1248601 0.06465978 0.05347916

Discussion

The IVW estimate points to a detrimental effect of circulating DHA on CAD risk.

However, inferences on the causal effect of DHA on CAD risk should be made in light of three additional aspects:

1. Are causal variants shared between exposure and outcome?

This was tested using genetic colocalisation. There was no strong evidence supporting shared (or distinct) causal variants influencing DHA and CAD risk, which might be related to limited statistical power in the analysis.

2. Is there evidence that the outcome (or its risk factors) influences the exposure?

This was tested using reverse MR. Results indicate that CAD predisposition (or, more likely, CAD risk factors) influences circulating DHA. Therefore, results from main analysis could be biased due to inclusion of invalid instruments affecting primarily CAD predisposition (or its risk factors).

3. Do genetic instruments regulate other biological mechanisms affecting the outcome independently of the exposure?

This was explored using genomic annotation and gene expression/metabolomics data.

These analyses provide strong evidence of an association between DHA genetic instruments and lipoprotein-related traits. The MR estimates for the effect of DHA on CAD risk would not be invalidated by these pleiotropic associations if the influence of genetic instruments on lipoprotein-related traits were fully mediated DHA. On the other hand, MR estimates are expected to be biased if the association between genetic instruments and lipoprotein-related traits were not fully mediated by DHA, specially given lipoprotein metabolism is a major determinant of CAD risk.

In combination, results indicate that horizontal pleiotropy (possibly via lipoprotein-related traits) is introducing bias in MR estimates by the reasons detailed below:

Despite having reasonably similar effects on DHA, genetic variants had substantially different impacts on the patterns and magnitude of changes in lipoprotein-related traits.
One of the genetic variants (rs261334) was found to regulate the expression of a key gene in lipoprotein metabolism (i.e. LIPC), and, therefore, its association with DHA is likely to be a consequence of regulating lipoprotein particles, which are major carriers of lipids in the circulation, including DHA.
The two variants (rs2281591 and rs174546) likely to regulate genes directly implicated in DHA metabolism (i.e. ELOVL 2 and FADS), identified by physical mapping and gene expression analysis, provided the weakest evidence for a causal relation between DHA and CAD risk (Figure 2). In particular, rs2281591 was the least pleiotropic variant and the one estimating an effect closest to the null (Figure 2).
When adjusting for total fatty acids using multivariable MR, estimates for the effect of DHA on CAD risk were attenuated.

Final considerations:

Spurious pleiotropy is a key source of bias in MR studies. Current methods developed to explore this type of bias are of limited use when one or few genetic instruments are available, as is typically the case when the exposure is a molecular phenotype.

We have presented a framework combining different methods that can be used to explore spurious pleiotropy in order to improve reliability of MR interrogating molecular mechanisms of diseases.

The framework captilises on the combination of using multiple statistical approaches (e.g. genetic colocalisation, reverse MR) and high dimensional molecular phenotyping data to explore the plausibility that MR results are explained by any alternative spurious pleiotropic scenario.

Confidence in the main MR estimates increases if there is no strong evidence that results might be explained by one or more spurious pleiotropic scenarios (i.e. Linkage disequilibrium, reverse causation or horizontal pleiotropy).

Likewise, confidence in main MR estimates decreases if one or more spurious pleiotropic scenarios seem plausible. Importantly, better understanding the mechanisms of spurious pleiotropy can inform sensitivity analyses and are key to interpretation of findings.

Using the case of DHA and CAD risk as a motivating example, we have shown that the main MR estimates of a detrimental effect of circulating DHA on CAD risk are likely to be biased by horizontal pleiotropy since selected DHA genetic instruments seem to affect key determinants of CAD risk (i.e. lipoprotein-related traits) independently of DHA. When accounting for this pleiotropic mechanism by adjusting for changes in total fatty acids or by restricting analysis to the least pleiotropic variants, evidence of a detrimental effect of DHA on CAD risk was much weaker.

As next steps to this framework, we aim to:

Incorporate gene expression data from a larger dataset (eQTLGen: N > 30,000)
Conduct multi-trait colocalization analysis with DHA and gene expression data
Conduct coloc analysis using conditional summary statistics to test plausibility of the single causal variant assumption

Software: R code used to run the analysis.

############################################################################
#                        MR data challenge                                 #
############################################################################

############################################################################
#                                Set - UP                                  #
############################################################################

##
## Clear the work environment
rm(list = ls())


##
## Set CRAN mirror
options(repos = c(CRAN ="http://www.stats.bris.ac.uk/R/"))


##
## Load packages
library(devtools)
library(TwoSampleMR)
library(MRInstruments)
library(data.table)
library(dplyr)
library(purrr)
library(moloc)
library(xlsx)
library(ggplot2)
library(ggforestplot)


############################################################################
#                         Extract & prepare data                           #
############################################################################

##
## List data for docosahexaenoic acid (DHA)

ao <- available_outcomes()

filter(ao, id == 852)


##
## Extract summary data for DHA (effect alelle = DHA increasing allele)

exp_dat <- extract_instruments(852) %>%
            mutate(increase.allele = ifelse(beta.exposure >=0, effect_allele.exposure, other_allele.exposure),
                   decrease.allele = ifelse(beta.exposure >=0, other_allele.exposure, effect_allele.exposure),
                   eaf.exposure = ifelse(beta.exposure >=0, eaf.exposure, 1-eaf.exposure),
                   effect_allele.exposure = increase.allele, 
                   other_allele.exposure = decrease.allele,
                   beta.exposure = abs(beta.exposure)) %>%
            select(-increase.allele, -decrease.allele)


##
## Estimate variance explained by each SNP 

exp_dat <- mutate(exp_dat, rsquared.exposure = 2 * beta.exposure ^ 2 * eaf.exposure * (1-eaf.exposure))

sum(exp_dat$rsquared.exposure)


##
## Extract corresponding data for CHD

out_dat <- extract_outcome_data(snps = exp_dat$SNP, outcomes = 7)


##
## Harmonise exposure and outcome data

dat <- harmonise_data(exp_dat, out_dat)


############################################################################
#                             Main MR estimate                             #
############################################################################

##
## 2. Main MR estimate (IVW)

res <- mr(dat, method_list = "mr_ivw")

res_single <- mr_singlesnp(dat, all_method = c("mr_ivw"))

fplot <- mr_forest_plot(res_single)


############################################################################
#                 Sensitivity analyses - part 1                            #
#                      Explore LD scenario                                 #
############################################################################


##
## Import file with SNPs coordinates and exclude if MAF < 1% (EUR)

coord <- fread("./data/coord_hg19.txt") %>%
          filter(MAF > 0.01) %>%
          rename(SNP = rsid)


##
## Import file with SNPs annotation

genes <- read.xlsx("./data/genes.xlsx", sheetName = "dha") 


##
## Define loci of interest (+/- 500kB around top hit)

loci <- filter(coord, SNP %in% dat$SNP) %>%
        mutate(start = position - 500000, end = position + 500000) %>%
        merge(., genes[c("SNP", "Closest.gene")], by = "SNP") %>%
        split(., f = seq(nrow(.))) 


##
## Map loci of interest to top hit closest gene

loci <- map(loci, ~pull(., Closest.gene)) %>%
        unlist %>%
        as.character %>%
        set_names(loci, .)


##
## List SNPs to be included for each loci

snps <- map(loci, ~filter(.y,
                            .y$chromosome == .x$chromosome &
                            .y$position > .x$start &
                            .y$position < .x$end), 
        coord) %>%
        map(., ~pull(., SNP)) 


##
## Extract data for colocalisation analysis

extr_dat <- function(rsid, trait.id) {
  
          df <- extract_outcome_data(snps = rsid, outcomes = trait.id) 
          
          names(df) <- sub(".outcome", "", names(df)) %>% toupper
          
          df <- df %>%
                merge(., coord, by = "SNP") %>%
                mutate(., log10P = log(.$PVAL, base=10), 
                          Trait = OUTCOME,
                          N = SAMPLESIZE,
                          MAF = ifelse(EAF < 0.5, EAF, 1 - EAF),
                          MB = position / 1000000) %>%
                select(., SNP, BETA, SE, N, MAF, Trait, MB, log10P)
}  

coloc_dha_dat <- map(snps, ~extr_dat(rsid = ., trait.id = 852))
coloc_chd_dat <- map(snps, ~extr_dat(rsid = ., trait.id = 7)) 


##
## Format data for colocalisation analysis

prep_dat <- function(df1, df2) {
        
        snps <- Reduce(intersect, Map("[[", list(df1,df2), "SNP")) 
        
        ls <- list(df1[df1$SNP %in% snps,], df2[df2$SNP %in% snps,])
}

moloc_ls <- map2(coloc_dha_dat, coloc_chd_dat, prep_dat)


##
## Run colocalisation analysis

moloc_res <- map(moloc_ls, ~moloc_test(., 
                                        prior_var = c(0.15, 0.2), 
                                        priors = c(1e-04, 1e-06), 
                                        save.SNP.info = T)) %>%
              map(., ~pluck(., 1)) %>%
              map2(., names(.), ~mutate(.x, 
                                          Hypotheses = c("H1: DHA", "H3: DHA,CAD", "H2: CAD", "H4: DHA.CAD", "H0: none"),
                                          close.gene = .y)) %>%
              map(., ~arrange(., Hypotheses)) %>%
              bind_rows


coloc_plot <- ggplot(moloc_res, aes(x=Hypotheses, y=PPA)) +
                        geom_col(aes(color = Hypotheses, fill = Hypotheses)) +
                        theme(axis.title.x= element_blank(),
                                      axis.text.x=element_blank(),
                                      axis.ticks.x=element_blank(),
                                      panel.background = element_blank(),
                                      legend.position = "bottom",
                                      legend.title=element_blank(), 
                                      legend.text = element_text(size=9)) +
                        facet_wrap(~ close.gene)

              
##
## Create volcano plots

make_vplot <- function(df1, df2, gene) {
        
        df <- rbind(df1, df2)
        
        p <- ggplot(df, aes(x=MB, y=log10P)) +
                geom_point(aes(color = Trait), size = 0.5) + 
                scale_y_reverse() +
                labs(title = paste("Closest gene:", gene)) +
                theme(axis.text = element_text(size = 7),
                      axis.title = element_text(size = 7, face = "bold"))
}

vplot.ls <- pmap(list(coloc_dha_dat, coloc_chd_dat, names(coloc_dha_dat)), make_vplot) 


############################################################################
#               Sensitivity analyses - part 2                              #
#              Explore reverse causation scenario                          #
############################################################################


##
## Extract summary data for CHD 

exp2_dat <- extract_instruments(outcomes = 7) # Only SNPs p1 = 5e-08 & r2 < 0.001


##
## Extract corresponding data for FA

out2_dat <- extract_outcome_data(snps = exp2_dat$SNP, outcomes = 852)


##
## Harmonise exposure and outcome data

dat2 <- harmonise_data(exp2_dat, out2_dat)


##
## Reverse MR estimate

res2 <- mr(dat2)


############################################################################
#                  Sensitivity analyses - part 3                           #
#              Explore horizontal pleiotropy scenario                      #
############################################################################


############################################################################
#      Part 3.1.1. Genotype x gene expression - top hits look up           #
############################################################################

##
## Format GTEx data

gtex_files <- list.files(path = "data", pattern = "^gtex_rs", full.names = T) 

gtex_snps <- stringr::str_extract_all(gtex_files, "rs[0-9]*")

format_gtex_dat <- function(files) {
        
        df <- fread(files) %>%
                rename(., SNP = "SNP Id", snpid = "Variant Id", Gene = "Gene Symbol", pval = "P-Value") %>%
                mutate(., alleles = stringr::str_extract_all(snpid, "[A-C-G-T]")) %>%
                mutate(., EA = alleles[[1]][2], NEA  = alleles[[1]][1]) %>%
                merge(., dat, by = "SNP") %>%
                mutate(., beta.eqtl = ifelse(EA == effect_allele.exposure, NES, -NES)) %>%
                mutate(., p_dir = abs(log(pval, base=10))*sign(beta.eqtl)) %>%
                select(., SNP, Gene, Tissue, effect_allele.exposure, p_dir) %>%
                arrange(Tissue)
}

gtex_dat <- map(gtex_files, format_gtex_dat)
        

##
## Make plots for eQTLs GTex data

make_gtex_plot <- function(df, top) {
        
        p <- ggplot(df, aes(x=Tissue, y=p_dir)) +
                geom_col(aes(color = Gene, fill = Gene)) + 
                coord_flip() +
                labs(title = paste("Top hit:", top)) + 
                ylab("-log10(p) x direction of effect") + 
                theme(axis.title.y = element_blank(),
                      panel.grid.major = element_blank(), 
                      panel.grid.minor = element_blank(),
                      panel.background = element_blank(), 
                      axis.line = element_line(colour = "black")) + 
                geom_hline(yintercept = 0) + 
                scale_y_continuous(limit = c(-30, 32),
                                   breaks = c(-30, -20, -10, 0, 10, 20, 30)) 
}

gtex_plots <- map2(gtex_dat, gtex_snps, ~make_gtex_plot(.x, .y))      


############################################################################
#       Part 3.2.1. Genotype x metabolites - top hits look up              #
############################################################################

##
## Format abbreviated names of metabolites

metab_ao <- filter(ao, author == "Kettunen") %>%
                mutate_at(., vars(trait), ~gsub("Summary_statistics_MAGNETIC_|.txt.tab", "", filename) )


##
## Format metabolomics GWAS data

metab_dat <- filter(ao, author == "Kettunen") %>% 
                pull(id) %>%
                extract_outcome_data(snps = dat$SNP, outcomes = .) %>%
                harmonise_data(exp_dat, .) %>%
                merge(., metab_ao, by.x = "id.outcome", by.y = "id") %>%
                select(SNP, beta.outcome, se.outcome, pval.outcome, trait, subcategory, id.outcome) 


##
## Split metabolomics GWAS data (lipoprotein vs non-lipoprotein traits)

metab_dat.ls <- list(filter(metab_dat, subcategory == "Lipid"),
                     filter(metab_dat, subcategory != "Lipid"))


# Make plots for top hit x metabolite association

make_metabplot <- function(dat) {
        
        p <- forestplot(df = dat,
                                name = trait,
                                estimate = beta.outcome,
                                se = se.outcome,
                                pvalue = pval.outcome,
                                psignif = 0.0005,
                                logodds = FALSE,
                                colour = SNP,
                                xlab = "1-SD increment per DHA increasing allele"
                        ) + 
                        theme(axis.text.y = element_text(size = 6)) + 
                        theme(axis.title.x = element_text(size=6)) +
                        theme(legend.position="bottom", legend.box = "horizontal") +
                        theme(legend.title=element_blank(), legend.text = element_text(size=8)) 
}

metab_plot <- map(metab_dat.ls, make_metabplot)


############################################################################
#     Part 3.2.2. DHA x metabolites - Multi-trait colocalization analysis  #
############################################################################

##
## Select IDs for metabolites if P < 0.05 with any top hit

metab_id <- filter(metab_dat, pval.outcome < 0.0005) %>%
        arrange(SNP) %>%
        group_by(SNP) %>%
        group_split %>%
        map(., ~pull(., id.outcome)) 


##
## Import and format metabolites summary data 

ket_all <- fread("./data/ket_file.txt") %>%
        merge(., metab_ao, by.x = "file", by.y = "filename") 


##
## Create matrices of SNP-metabolite betas and SEs for each loci

make_mtx <- function(snps.coloc, metab.coloc, est) {
        
          mtx <- ket_all %>%
                  filter(., snp %in% snps.coloc &
                            id %in% metab.coloc) %>%
                  select(., snp, trait, beta, se) %>%
                  data.table %>%
                  data.table::dcast(., snp ~ trait, value.var = c("beta", "se")) %>%
                  select(., snp, starts_with(paste0(est, "_"))) %>%
                  rename_at(., vars(-snp), ~sub(paste0(est, "_"), "", .)) %>%
                  tibble::column_to_rownames(., var = "snp") %>%
                  as.matrix %>%
                  na.omit 
          
          if(ncol(mtx) > 1) {
            
                  mtx <- mtx
                  
               } else {
                  
                  mtx <- NULL
          }
}


betas.ls <- map2(snps, metab_id, ~make_mtx(.x, .y, est = "beta")) %>% compact
  
ses.ls <- map2(snps, metab_id, ~make_mtx(.x, .y, est = "se")) %>% compact


##
## Function for multi-trait colocalization analysis

run_hypr_coloc <- function(beta, se, uni.priors = F) {
        
              traits <- colnames(beta)
              
              cont_out <- c(rep(0, ncol(beta)))
              
              rsid <- rownames(beta)
              
              res <- hyprcoloc::hyprcoloc(beta, se, 
                                          binary.outcomes = cont_out, 
                                          trait.names = traits, 
                                          snp.id = rsid,
                                          uniform.priors = uni.priors)
}


##
## Using default priors (can be overly conservative)

multi_coloc_res <- map2(betas.ls, ses.ls, ~run_hypr_coloc(beta = .x, se = .y)) %>%
        flatten %>%
        map2(., names(betas.ls), ~mutate(.x, loci = .y)) %>%
        bind_rows 


##
## Using uniform priors

multi_coloc_res_uni <- map2(betas.ls, ses.ls, ~run_hypr_coloc(beta = .x, se = .y, uni.priors = T)) %>%
        flatten %>%
        map2(., names(betas.ls), ~mutate(.x, loci = .y)) %>%
        bind_rows 


############################################################################
#                    Part 3.3.1. Multivariable MR                          #
############################################################################


exp_mvdat <- mv_extract_exposures(c(852, 936))

out_mvdat <- extract_outcome_data(exp_mvdat$SNP, 7)

mvdat <- mv_harmonise_data(exp_mvdat, out_mvdat)

mv_res <- mv_multiple(mvdat)


############################################################################
#                    Part 3.3.2. Conservative approach                     #
############################################################################

cons_res <- filter(dat, SNP == "rs174546" | SNP == "rs2281591") %>%
              mr(.)

References:

Abdelhamid AS, Brown TJ, Brainard JS, Biswas P, Thorpe GC, Moore HJ, Deane KHO, AlAbdulghafoor FK, Summerbell CD, Worthington HV, Song F, Hooper L. Omega 3 fatty acids for the primary and secondary prevention of cardiovascular disease. Cochrane Database of Systematic Reviews 2018, Issue 7. Art. No.: CD003177. DOI: 10.1002/14651858.CD003177.pub3.
Kettunen J, Demirkan A, Wurtz P, et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of lpa. Nature communications [Internet]. 2016 March;7:11122. Available from: http://europepmc.org/articles/PMC4814583/
Hemani G, Zheng J, Elsworth B, Wade KH, Baird D, Haberland V, Laurin C, Burgess S, Bowden J, Langdon R, Tan VY, Yarmolinsky J, Shihab HA, Timpson NJ, Evans DM, Relton C, Martin RM, Davey Smith G, Gaunt TR, Haycock PC, The MR-Base Collaboration. The MR-Base platform supports systematic causal inference across the human phenome. eLife 2018;7:e34408. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5976434/
Nikpay M, Goel A, Won HH, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease.Nat Genet. 2015 Oct;47(10):1121-1130. doi: 10.1038/ng.3396. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4589895/
Giambartolomei C, Zhenli Liu J, Zhang W, Hauberg M, Shi H, Boocock J, Pickrell J, Jaffe AE; CommonMind Consortium, Pasaniuc B, Roussos P.A Bayesian framework for multiple trait colocalization from summary association statistics. Bioinformatics. 2018 Aug 1;34(15):2538-2545. doi: 10.1093/bioinformatics/bty147.
Watanabe K, Taskesen E, van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 2017;8:1826. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5705698/
Foley CN, Staley JR, Breen PG, Sun BB, Kirk PDW, Burgess S, Howson JMM. A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits. BioRxiv 2019. doi: https://doi.org/10.1101/592238
Burgess S, Foley CN, Allara E, Staley JR, Howson JMM. A robust and efficient method for Mendelian randomization with hundreds of genetic variants: unravelling mechanisms linking HDL-cholesterol and coronary heart disease. BioRxiv 2019. doi: https://doi.org/10.1101/566851

MR Data Challenge 2019

Maria Carolina Borges

June 7, 2019

Participants:

Motivation:

Improving reliability of MR studies testing the effect of molecular phenotypes:

The case of omega-3 fatty acids and cardiovascular disease risk

Data:

Analysis methods:

1. Selection of genetic instruments for DHA

2. Main analysis

3. Sensitivity analyses

3.1. Counfounding by linkage disequilibrium

3.2. Reverse causation

3.3 Horizontal pleiotropy

Results:

1. Selection of genetic instruments for DHA

2. Main analysis

3. Sensitivity analyses

3.1. Linkage disequilibrium

3.2. Reverse causation

3.3 Horizontal pleiotropy

Genomic annotation

Integrating omics data

Gene expression

Circulating metabolites

Multivariable MR

MR restricted to genetic variants regulating genes directly implicated in fatty acids metabolism

Discussion

1. Are causal variants shared between exposure and outcome?

2. Is there evidence that the outcome (or its risk factors) influences the exposure?

3. Do genetic instruments regulate other biological mechanisms affecting the outcome independently of the exposure?

Final considerations:

Software: R code used to run the analysis.

References: