2. Main analysis
Mendelian randomization estimates were calculated using Wald ratios for each genetic variant and the inverse-variance weighted (IVW) method for the combined set of genetic instruments.
One major limitation of Mendelian randomization (MR) is the unprovable assumption that the association between genetic instrument and outcome is entirely mediated by the exposure (i.e. “vertical pleiotropy”) (Figure 1A) and not by direct effects of the genetic instrument (i.e. hereafter “spurious pleiotropy”) (Figures 1B-1D).
A large proportion of recently developed methods aim at accounting for spurious pleiotropy to improve the reliability of MR results. However, these methods ideally require that many genetic instruments for the exposure are available, which is often not the case when the exposure is a molecular phenotype (e.g. gene expression levels, DNA methylation, protein or metabolite).
In the context of MR, spurious pleiotropy can arise from three main scenarios:
Counfounding by linkage disequilibrium: the genetic instrument influences the exposure and is correlated (i.e. in linkage disequilibrium) with another genetic variant that influences the outcome independently (Figure 1B)
Reverse causation: the genetic instrument primarily influences the outcome, which subsequently affects the exposure (Figure 1C)
Horizontal pleiotropy: the genetic instrument influences exposure and outcome via two independent biological pathways (Figure 1D)
We propose a framework combining methods developed within and outside the MR community to explore robustness of MR findings to spurious pleiotropy when only one or a few genetic instruments are available for a particular exposure.
Briefly, the framework aims to assess the plausibility that any of the above-mentioned spurious pleiotropic mechanisms introduce bias in MR findings as follows:
Confounding by linkage disequilibrium (Figure 1B) can be tested using genetic colocalisation analysis.
Reverse causation (Figure 1C) can be investigated using MR methods to test whether the outcome is likely to affect the exposure (hereafter reverse MR). Evidence of bidirectional effects could indicate a genuine bidirectional relation or, alternatively, wrong inferences on the causal direction of the relation between exposure and outcome, as detailed in Analysis methods (item 3.2).
Horizontal pleiotropy (Figure 1D) can be explored by using genomic annotation tools and integrating high-dimensional omics data to improve knowledge on biological mechanisms linking genetic variants, exposure, and outcome.
As a motivating example, we have used this framework to assess the causal role of circulating docosahexaenoic acid (DHA) on coronary artery disease (CAD) risk. For several decades, this long chain omega-3 fatty acid has been considered a protective factor against CAD risk. However, recent systematic reviews of observational studies and randomised controlled trials (RCTs) have cast doubt on long-held beliefs concerning the cardioprotective effects of DHA (Abdelhamid et al., 2018).
Figure 1. Diagrams representing the putative causal relations between genetic instrument, exposure (DHA) and outcome (CAD risk)
## author consortium id sample_size nsnp pmid population
## 1 Kettunen <NA> 852 13499 11415610 27005778 European
## sex trait unit
## 1 Males and females 22:6, docosahexaenoic acid SD
## author consortium id ncase ncontrol nsnp pmid population
## 1 Nikpay CARDIoGRAMplusC4D 7 60801 123504 9455779 26343387 Mixed
## sex trait unit
## 1 Males and females Coronary heart disease log odds
Mendelian randomization analyses were performed with R software version 3.5.1 (R Foundation for Statistical Computing).
Genetic variants were selected if strongly associated with circulating DHA (\(P < 5*10^{-8}\)) and if not in linkage disequilibrium (\(R^{2} < 0.001\)).
Mendelian randomization estimates were calculated using Wald ratios for each genetic variant and the inverse-variance weighted (IVW) method for the combined set of genetic instruments.
We conducted a series of sensitivity analyses to explore whether results from the main analyses could be biased by the spurious pleiotropic mechanisms shown in Figures 1B-D.
Genetic colocalisation was used to test whether associations with exposure and outcome were driven by a shared genetic factor. While not sufficient, sharing causal variants is a necessary condition for two traits to be causally related. Thus, genetic colocalisation helps to mitigate the possibility of bias due to spurious pleiotropy driven by two separate (but correlated) variants in a given genomic region (Figure 1B).
The genetic colocalisation was performed using moloc (Giambartolomei et al., 2018), a Bayesian method where posterior probabilities are computed for each possible configuration, and these probabilities can be summed over all configurations and combined with the prior probabilities to assess the support for each hypothesis. In the case of two traits, there are five possible hypotheses:
Evidence of genetic colocalisation between the two traits was considered if there was a high posterior probability (PPA) for H4 and low PPA for H3.
Total variance was set at 0.15 for the continuous trait and 0.2 for the binary trait. Prior probability that a SNP is associated with one or two traits was set to \(P = 1*10^{-4}\) and \(P = 1*10^{-6}\), respectively.
With sufficiently large sample sizes, a genetic variant associated with an outcome through a mediating exposure could reach the conventional threshold for statistical significance for both outcome and exposure. Therefore, using such thresholds to select instruments could lead to situations where the genetic instrument influences the hypothesised exposure via the hypothesised outcome (i.e. the hypothesised outcome actually has a causal effect on the hypothesised exposure and not vice versa) (Figure 1C). This could lead to wrong inferences on the causal direction of the relation between the two traits and reverse MR can help addressing whether this is a likely source of bias in the analysis.
We conducted a reverse MR analysis to explore the effect of genetic predisposition to CAD on circulating DHA. To do that, we have selected genetic variants strongly and independently associated with CAD risk using the same criteria described in item 1 (“Selection of genetic instruments for DHA”). MR estimates were calculated using the IVW method and several other MR methods that are more robust to horizontal pleiotropy (i.e. MR-Egger, weighted median and weighted mode estimator).
Horizontal pleiotropy (Figure 1D) is one of main sources of invalid instruments since it is a widespread biological phenomenon. Given valid instruments are expected to produce consistent estimates of the causal effect, most current MR methods attempt to address horizontal pleiotropy by directly or indirectly accounting for statistical heterogeneity in single instrument MR effect estimates.
When only one or few instruments are available, the application of these methods can be impossible, misleading or of limited usefulness. Therefore, in this context, we propose that horizontal pleiotropy should be investigated by using knowledge on the biological mechanisms linking genetic variants, exposure and outcome, which can be informed by genomic annotation and by integrating high-dimensional omics data.
We have adopted two approaches to gain insight into the biological mechanism underpinning the association of the genetic instruments with exposure and outcome:
Genomic annotation: We annotated the selected genetic variants (and variants in linkage disequilibrium with those) using FUMA (Watanabe et al., 2017), an integrated platform for functional annotation that uses information from multiple databases. This includes protein-coding annotation, which focuses on the impact of the genetic variant on protein structure, and non-protein-coding annotation, which focuses on genetic variation that are often involved in gene regulation.
Integrating omics data:
First, we conducted a hypothesis-free test to examine the association of each genetic instrument with gene expression levels. We searched the GTEx portal data to identify cis-expression quantitative trait loci ( cis-eQTLs) among the selected genetic variants (FDR <= 5%).
Then, we checked for the association of genetic variants across all metabolic traits included in Kettunen et al. (2016) GWAS. To account for multiple testing, we considered a p-value < 0.0005 (0.05/22/5) as evidence of association, where 22 is the number of principal components explaining 95% of the phenotypic variation in the metabolomics data (Kettunen et al., 2016) and five is the number of assessed genetic variants. The metabolites showing evidence of association with any of the genetic variants were carried forward to the multi-trait colocalisation analysis (Foley et al., 2019), which is a computationally efficient extension of moloc (Giambartolomei et al., 2018). The multi-trait colocalisation analysis was conducted using default and uniform priors as described by Burgess et al. (2019), given that the default prior can be overly conservative.
To further explore the possibility of MR findings being biased by horizontal pleiotropy, we conducted two additional sensitivity analyses:
Six independent genetic variants were selected, explaining from 0.25% to 0.78% of variance in circulating DHA individually and 2.4% when combined.
## SNP effect_allele.exposure other_allele.exposure eaf.exposure
## 1 rs2281591 A G 0.866280
## 2 rs174546 C T 0.597151
## 3 rs11604424 C T 0.243328
## 4 rs261334 G C 0.230871
## 5 rs145717049 C T 0.955942
## 6 rs143988316 C T 0.930513
## beta.exposure se.exposure pval.exposure units.exposure rsquared.exposure
## 1 0.108394 0.018174 3.66e-09 SD 0.002722044
## 2 0.127635 0.012483 4.81e-24 SD 0.007837833
## 3 0.083110 0.014241 7.84e-09 SD 0.002543527
## 4 0.110247 0.014749 1.44e-13 SD 0.004316504
## 5 0.201292 0.032750 1.21e-09 SD 0.003413024
## 6 0.150045 0.024351 1.10e-09 SD 0.002911381
sum(exp_dat$rsquared.exposure)
## [1] 0.02374431
One of the variants (rs145717049) could not be found in the CAD GWAS (neither a proxy variant in strong linkage disequilibrium - i.e. \(R^{2} > 0.8\)). Therefore, this variant was not considered in further analysis.
Results from the main MR analysis indicated that higher circulating DHA is related to higher CAD risk (Figure 2), which is conflict with evidence from observational studies and RCTs pointing to either a null or protective effect of DHA on CAD risk (Abdelhamid et al., 2018).
Figure 2. Mendelian randomization estimates for the effect of DHA on CAD risk
Note: Results are expressed as log odds of CAD per standard unit increase in circulating DHA.
In the five assessed genomic regions, PPA for both H3 and H4 was very low, and, therefore, there was no clear evidence supporting shared (or distinct) causal variants influencing DHA and CAD risk. Instead, there was strong evidence that individual genomic regions were strongly associated with circulating DHA only (i.e. PPA for H1 ranged from 89% to 99%) (Figure 3). This might be the result of limited power in the colocalization analysis to detect an association between variants and CAD risk in single genomic regions.
Figure 3. Posterior probabilities (PPA) for each hypothesis tested in the colocalisation of DHA and CAD risk across DHA genomic regions
Note: Names for genomic regions were assigned according to the gene closest to the selected genetic instrument in the region.
These results are broadly consistent with the visual inspection of volcano plots as shown in Figure 4.
Figure 4. P-values (log 10 units) for the association of genetic variants with DHA and CAD risk across DHA genomic regions
Forty one genetic variants were selected as instruments for CAD genetic predisposition. Two variants were removed for being ambiguous palindromic SNPs (MAF > 42%) and, therefore, 39 variants were used in further analyses.
The IVW method indicated that higher predisposition to CAD is related to higher circulating DHA.
## method nsnp b se pval
## 1 Inverse variance weighted 39 0.09225259 0.03174137 0.003656332
Note: Results are expressed as standardised units of circulating DHA per log odds of CAD.
The evidence from MR findings of a bidirectional positive effect between DHA and CAD risk could indicate that results from main analysis might be biased due to inclusion of invalid instruments affecting primarily CAD predisposition (or its risk factors). Alternatively, there might be a genuine bidirectional relation between DHA and CAD risk.
Overall, the effect estimate was partly attenuated when using MR methods that are usually more robust to horizontal pleiotropy.
## method nsnp b se pval
## 1 MR Egger 39 0.08632377 0.08121313 0.2947027
## 2 Weighted median 39 0.04199586 0.04112948 0.3072238
## 3 Weighted mode 39 0.01111937 0.05741033 0.8474561
Note: Results are expressed as standardised units of circulating DHA per log odds of CAD.
Most genetic instruments were located in introns (rs2281591, rs11604424, and rs261334), one was intergenic (rs143988316) and the other was a 3 Prime UTR variant (rs174546).
## SNP chr bp Closest.gene location
## 1 rs2281591 6 10990493 ELOVL2 intron
## 2 rs174546 11 61569830 FADS1 3 Prime UTR Variant
## 3 rs11604424 11 116651115 ZPR1 intron
## 4 rs261334 15 58726744 LIPC intron
## 5 rs143988316 19 19667254 PBX4 intergenic
Genetic variants in linkage disequilibrium with the genetic instruments were largely intronic or intergenic and the ones located in exons were mostly synonymous (Figure 5).
Figure 5. Functional consequences of SNPs in linkage disequilibrium with DHA genetic instruments
These results are broadly consistent with these variants affecting phenotypes through regulatory mechanism, such as changes in gene expression.
Four out of the five genetic variants were identified as cis-eQTLs (Figure 6).
One of the variants (rs174546) had strong and opposite effects on FADS1 and FADS2 across several tissues, both genes code for key enzymes in omega-3 and omega-6 fatty acids metabolism. This SNP also affects the expression of FADS3 (liver and cerebellum), MYRF (esophagus) and TMEM258 (several tissues).
Another variant (rs2281591) is associated with increased expression of ELOLV2-AS1, the antisense RNA 1 for ELOLV2 (spinal cord), which codes for a key enzyme in the biosynthetic pathway of long chain polyunsaturated fatty acids such as DHA, and with SYCP2L (testis).
The variant rs143988316 was related to a decrease in the expression of ATP13A1 (nerve and subcutaneous adipose tissue), GATAD2A (whole blood and esophagus), and an increase in MAU2 (whole blood and skeletal muscle).
Lastly, the variant rs261334 was related to decreased expression of LIPC (liver and pancreas), which encodes a triglyceride lipase that acts in liver as triglyceride hydrolase and ligand/bridging factor for receptor-mediated lipoprotein uptake, and, therefore, is related to removal of fatty acids (in the form of tryglicerides) from the circulation.
Figure 6. Association of DHA genetic instruments with gene expression levels
Despite having reasonably similar effects on DHA, there were substantial differences in the metabolic profile across the five assessed genetic variants (Figures 7A and 7B).
The variants rs11604424 (closest gene: ZPR1) and rs143988316 (closest gene: PBX4) had a broadly similar metabolic signature and ubiquitous effect on lipoprotein-related traits. For both variants, the DHA increasing allele was related to increases in apolipoprotein B (ApoB) and in lipids in ApoB containing particles (VLDL, LDL, IDL). There was some evidence from the multi-trait colocalisation analysis that these associations were due to a shared causal variant ( ZPR1 locus: PPA = 74%-99%; PBX4 locus: PPA = 72%-97%).
The variant rs174546 (closest gene: FADS1) had the largest effects on multiple polyunsaturated fatty acids-related traits and was related to increases in apolipoprotein A1 (apoA1), larger HDL, IDL and larger VLDL particle lipids. There was some evidence of a shared causal variant with DHA (PPA = 55-86%).
The variant rs2281591 (closest gene: ELOVL2) did not seem to be strongly associated with metabolites other than DHA. Therefore, this variant was not carried forward to the multi-trait colocalisation analysis.
Finally, the DHA increasing allele of the variant rs261334 (closest gene: LIPC) was related to a substantial increase in circulating ApoA1 and in larger HDL particle lipids and, to a smaller extent, in circulating ApoB and large LDL and IDL particle lipids. However, there was some evidence that DHA and HDL-related traits are influenced by distinct causal variants.
Figure 7A. Association of DHA genetic instruments with lipoprotein-related traits
Note: full circles indicate p-value < 0.0005
Figure 7B. Association of DHA genetic instruments with other metabolites
Note: full circles indicate p-value < 0.0005
When adjusting for total fatty acids using multivariable MR, estimates for the effect of DHA on CAD risk were attenuated:
## exposure nsnp b se
## 1 22:6, docosahexaenoic acid || id:852 5 0.07159174 0.11702609
## 2 Total fatty acids || id:936 10 0.20140084 0.09533509
## pval
## 1 0.54069733
## 2 0.03463868
MR estimates were partially attenuated when restricting analysis to the two instruments (rs2281591 and rs174546) directly implicated in DHA metabolism (i.e. ELOVL 2 and FADS), as identified by physical mapping and gene expression analysis:
## exposure outcome nsnp
## 1 22:6, docosahexaenoic acid || id:852 Coronary heart disease || id:7 2
## b se pval
## 1 0.1248601 0.06465978 0.05347916
The IVW estimate points to a detrimental effect of circulating DHA on CAD risk.
However, inferences on the causal effect of DHA on CAD risk should be made in light of three additional aspects:
This was tested using reverse MR. Results indicate that CAD predisposition (or, more likely, CAD risk factors) influences circulating DHA. Therefore, results from main analysis could be biased due to inclusion of invalid instruments affecting primarily CAD predisposition (or its risk factors).
This was explored using genomic annotation and gene expression/metabolomics data.
These analyses provide strong evidence of an association between DHA genetic instruments and lipoprotein-related traits. The MR estimates for the effect of DHA on CAD risk would not be invalidated by these pleiotropic associations if the influence of genetic instruments on lipoprotein-related traits were fully mediated DHA. On the other hand, MR estimates are expected to be biased if the association between genetic instruments and lipoprotein-related traits were not fully mediated by DHA, specially given lipoprotein metabolism is a major determinant of CAD risk.
In combination, results indicate that horizontal pleiotropy (possibly via lipoprotein-related traits) is introducing bias in MR estimates by the reasons detailed below:
Despite having reasonably similar effects on DHA, genetic variants had substantially different impacts on the patterns and magnitude of changes in lipoprotein-related traits.
One of the genetic variants (rs261334) was found to regulate the expression of a key gene in lipoprotein metabolism (i.e. LIPC), and, therefore, its association with DHA is likely to be a consequence of regulating lipoprotein particles, which are major carriers of lipids in the circulation, including DHA.
The two variants (rs2281591 and rs174546) likely to regulate genes directly implicated in DHA metabolism (i.e. ELOVL 2 and FADS), identified by physical mapping and gene expression analysis, provided the weakest evidence for a causal relation between DHA and CAD risk (Figure 2). In particular, rs2281591 was the least pleiotropic variant and the one estimating an effect closest to the null (Figure 2).
When adjusting for total fatty acids using multivariable MR, estimates for the effect of DHA on CAD risk were attenuated.
Spurious pleiotropy is a key source of bias in MR studies. Current methods developed to explore this type of bias are of limited use when one or few genetic instruments are available, as is typically the case when the exposure is a molecular phenotype.
We have presented a framework combining different methods that can be used to explore spurious pleiotropy in order to improve reliability of MR interrogating molecular mechanisms of diseases.
The framework captilises on the combination of using multiple statistical approaches (e.g. genetic colocalisation, reverse MR) and high dimensional molecular phenotyping data to explore the plausibility that MR results are explained by any alternative spurious pleiotropic scenario.
Confidence in the main MR estimates increases if there is no strong evidence that results might be explained by one or more spurious pleiotropic scenarios (i.e. Linkage disequilibrium, reverse causation or horizontal pleiotropy).
Likewise, confidence in main MR estimates decreases if one or more spurious pleiotropic scenarios seem plausible. Importantly, better understanding the mechanisms of spurious pleiotropy can inform sensitivity analyses and are key to interpretation of findings.
Using the case of DHA and CAD risk as a motivating example, we have shown that the main MR estimates of a detrimental effect of circulating DHA on CAD risk are likely to be biased by horizontal pleiotropy since selected DHA genetic instruments seem to affect key determinants of CAD risk (i.e. lipoprotein-related traits) independently of DHA. When accounting for this pleiotropic mechanism by adjusting for changes in total fatty acids or by restricting analysis to the least pleiotropic variants, evidence of a detrimental effect of DHA on CAD risk was much weaker.
As next steps to this framework, we aim to:
############################################################################
# MR data challenge #
############################################################################
############################################################################
# Set - UP #
############################################################################
##
## Clear the work environment
rm(list = ls())
##
## Set CRAN mirror
options(repos = c(CRAN ="http://www.stats.bris.ac.uk/R/"))
##
## Load packages
library(devtools)
library(TwoSampleMR)
library(MRInstruments)
library(data.table)
library(dplyr)
library(purrr)
library(moloc)
library(xlsx)
library(ggplot2)
library(ggforestplot)
############################################################################
# Extract & prepare data #
############################################################################
##
## List data for docosahexaenoic acid (DHA)
ao <- available_outcomes()
filter(ao, id == 852)
##
## Extract summary data for DHA (effect alelle = DHA increasing allele)
exp_dat <- extract_instruments(852) %>%
mutate(increase.allele = ifelse(beta.exposure >=0, effect_allele.exposure, other_allele.exposure),
decrease.allele = ifelse(beta.exposure >=0, other_allele.exposure, effect_allele.exposure),
eaf.exposure = ifelse(beta.exposure >=0, eaf.exposure, 1-eaf.exposure),
effect_allele.exposure = increase.allele,
other_allele.exposure = decrease.allele,
beta.exposure = abs(beta.exposure)) %>%
select(-increase.allele, -decrease.allele)
##
## Estimate variance explained by each SNP
exp_dat <- mutate(exp_dat, rsquared.exposure = 2 * beta.exposure ^ 2 * eaf.exposure * (1-eaf.exposure))
sum(exp_dat$rsquared.exposure)
##
## Extract corresponding data for CHD
out_dat <- extract_outcome_data(snps = exp_dat$SNP, outcomes = 7)
##
## Harmonise exposure and outcome data
dat <- harmonise_data(exp_dat, out_dat)
############################################################################
# Main MR estimate #
############################################################################
##
## 2. Main MR estimate (IVW)
res <- mr(dat, method_list = "mr_ivw")
res_single <- mr_singlesnp(dat, all_method = c("mr_ivw"))
fplot <- mr_forest_plot(res_single)
############################################################################
# Sensitivity analyses - part 1 #
# Explore LD scenario #
############################################################################
##
## Import file with SNPs coordinates and exclude if MAF < 1% (EUR)
coord <- fread("./data/coord_hg19.txt") %>%
filter(MAF > 0.01) %>%
rename(SNP = rsid)
##
## Import file with SNPs annotation
genes <- read.xlsx("./data/genes.xlsx", sheetName = "dha")
##
## Define loci of interest (+/- 500kB around top hit)
loci <- filter(coord, SNP %in% dat$SNP) %>%
mutate(start = position - 500000, end = position + 500000) %>%
merge(., genes[c("SNP", "Closest.gene")], by = "SNP") %>%
split(., f = seq(nrow(.)))
##
## Map loci of interest to top hit closest gene
loci <- map(loci, ~pull(., Closest.gene)) %>%
unlist %>%
as.character %>%
set_names(loci, .)
##
## List SNPs to be included for each loci
snps <- map(loci, ~filter(.y,
.y$chromosome == .x$chromosome &
.y$position > .x$start &
.y$position < .x$end),
coord) %>%
map(., ~pull(., SNP))
##
## Extract data for colocalisation analysis
extr_dat <- function(rsid, trait.id) {
df <- extract_outcome_data(snps = rsid, outcomes = trait.id)
names(df) <- sub(".outcome", "", names(df)) %>% toupper
df <- df %>%
merge(., coord, by = "SNP") %>%
mutate(., log10P = log(.$PVAL, base=10),
Trait = OUTCOME,
N = SAMPLESIZE,
MAF = ifelse(EAF < 0.5, EAF, 1 - EAF),
MB = position / 1000000) %>%
select(., SNP, BETA, SE, N, MAF, Trait, MB, log10P)
}
coloc_dha_dat <- map(snps, ~extr_dat(rsid = ., trait.id = 852))
coloc_chd_dat <- map(snps, ~extr_dat(rsid = ., trait.id = 7))
##
## Format data for colocalisation analysis
prep_dat <- function(df1, df2) {
snps <- Reduce(intersect, Map("[[", list(df1,df2), "SNP"))
ls <- list(df1[df1$SNP %in% snps,], df2[df2$SNP %in% snps,])
}
moloc_ls <- map2(coloc_dha_dat, coloc_chd_dat, prep_dat)
##
## Run colocalisation analysis
moloc_res <- map(moloc_ls, ~moloc_test(.,
prior_var = c(0.15, 0.2),
priors = c(1e-04, 1e-06),
save.SNP.info = T)) %>%
map(., ~pluck(., 1)) %>%
map2(., names(.), ~mutate(.x,
Hypotheses = c("H1: DHA", "H3: DHA,CAD", "H2: CAD", "H4: DHA.CAD", "H0: none"),
close.gene = .y)) %>%
map(., ~arrange(., Hypotheses)) %>%
bind_rows
coloc_plot <- ggplot(moloc_res, aes(x=Hypotheses, y=PPA)) +
geom_col(aes(color = Hypotheses, fill = Hypotheses)) +
theme(axis.title.x= element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
panel.background = element_blank(),
legend.position = "bottom",
legend.title=element_blank(),
legend.text = element_text(size=9)) +
facet_wrap(~ close.gene)
##
## Create volcano plots
make_vplot <- function(df1, df2, gene) {
df <- rbind(df1, df2)
p <- ggplot(df, aes(x=MB, y=log10P)) +
geom_point(aes(color = Trait), size = 0.5) +
scale_y_reverse() +
labs(title = paste("Closest gene:", gene)) +
theme(axis.text = element_text(size = 7),
axis.title = element_text(size = 7, face = "bold"))
}
vplot.ls <- pmap(list(coloc_dha_dat, coloc_chd_dat, names(coloc_dha_dat)), make_vplot)
############################################################################
# Sensitivity analyses - part 2 #
# Explore reverse causation scenario #
############################################################################
##
## Extract summary data for CHD
exp2_dat <- extract_instruments(outcomes = 7) # Only SNPs p1 = 5e-08 & r2 < 0.001
##
## Extract corresponding data for FA
out2_dat <- extract_outcome_data(snps = exp2_dat$SNP, outcomes = 852)
##
## Harmonise exposure and outcome data
dat2 <- harmonise_data(exp2_dat, out2_dat)
##
## Reverse MR estimate
res2 <- mr(dat2)
############################################################################
# Sensitivity analyses - part 3 #
# Explore horizontal pleiotropy scenario #
############################################################################
############################################################################
# Part 3.1.1. Genotype x gene expression - top hits look up #
############################################################################
##
## Format GTEx data
gtex_files <- list.files(path = "data", pattern = "^gtex_rs", full.names = T)
gtex_snps <- stringr::str_extract_all(gtex_files, "rs[0-9]*")
format_gtex_dat <- function(files) {
df <- fread(files) %>%
rename(., SNP = "SNP Id", snpid = "Variant Id", Gene = "Gene Symbol", pval = "P-Value") %>%
mutate(., alleles = stringr::str_extract_all(snpid, "[A-C-G-T]")) %>%
mutate(., EA = alleles[[1]][2], NEA = alleles[[1]][1]) %>%
merge(., dat, by = "SNP") %>%
mutate(., beta.eqtl = ifelse(EA == effect_allele.exposure, NES, -NES)) %>%
mutate(., p_dir = abs(log(pval, base=10))*sign(beta.eqtl)) %>%
select(., SNP, Gene, Tissue, effect_allele.exposure, p_dir) %>%
arrange(Tissue)
}
gtex_dat <- map(gtex_files, format_gtex_dat)
##
## Make plots for eQTLs GTex data
make_gtex_plot <- function(df, top) {
p <- ggplot(df, aes(x=Tissue, y=p_dir)) +
geom_col(aes(color = Gene, fill = Gene)) +
coord_flip() +
labs(title = paste("Top hit:", top)) +
ylab("-log10(p) x direction of effect") +
theme(axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
geom_hline(yintercept = 0) +
scale_y_continuous(limit = c(-30, 32),
breaks = c(-30, -20, -10, 0, 10, 20, 30))
}
gtex_plots <- map2(gtex_dat, gtex_snps, ~make_gtex_plot(.x, .y))
############################################################################
# Part 3.2.1. Genotype x metabolites - top hits look up #
############################################################################
##
## Format abbreviated names of metabolites
metab_ao <- filter(ao, author == "Kettunen") %>%
mutate_at(., vars(trait), ~gsub("Summary_statistics_MAGNETIC_|.txt.tab", "", filename) )
##
## Format metabolomics GWAS data
metab_dat <- filter(ao, author == "Kettunen") %>%
pull(id) %>%
extract_outcome_data(snps = dat$SNP, outcomes = .) %>%
harmonise_data(exp_dat, .) %>%
merge(., metab_ao, by.x = "id.outcome", by.y = "id") %>%
select(SNP, beta.outcome, se.outcome, pval.outcome, trait, subcategory, id.outcome)
##
## Split metabolomics GWAS data (lipoprotein vs non-lipoprotein traits)
metab_dat.ls <- list(filter(metab_dat, subcategory == "Lipid"),
filter(metab_dat, subcategory != "Lipid"))
# Make plots for top hit x metabolite association
make_metabplot <- function(dat) {
p <- forestplot(df = dat,
name = trait,
estimate = beta.outcome,
se = se.outcome,
pvalue = pval.outcome,
psignif = 0.0005,
logodds = FALSE,
colour = SNP,
xlab = "1-SD increment per DHA increasing allele"
) +
theme(axis.text.y = element_text(size = 6)) +
theme(axis.title.x = element_text(size=6)) +
theme(legend.position="bottom", legend.box = "horizontal") +
theme(legend.title=element_blank(), legend.text = element_text(size=8))
}
metab_plot <- map(metab_dat.ls, make_metabplot)
############################################################################
# Part 3.2.2. DHA x metabolites - Multi-trait colocalization analysis #
############################################################################
##
## Select IDs for metabolites if P < 0.05 with any top hit
metab_id <- filter(metab_dat, pval.outcome < 0.0005) %>%
arrange(SNP) %>%
group_by(SNP) %>%
group_split %>%
map(., ~pull(., id.outcome))
##
## Import and format metabolites summary data
ket_all <- fread("./data/ket_file.txt") %>%
merge(., metab_ao, by.x = "file", by.y = "filename")
##
## Create matrices of SNP-metabolite betas and SEs for each loci
make_mtx <- function(snps.coloc, metab.coloc, est) {
mtx <- ket_all %>%
filter(., snp %in% snps.coloc &
id %in% metab.coloc) %>%
select(., snp, trait, beta, se) %>%
data.table %>%
data.table::dcast(., snp ~ trait, value.var = c("beta", "se")) %>%
select(., snp, starts_with(paste0(est, "_"))) %>%
rename_at(., vars(-snp), ~sub(paste0(est, "_"), "", .)) %>%
tibble::column_to_rownames(., var = "snp") %>%
as.matrix %>%
na.omit
if(ncol(mtx) > 1) {
mtx <- mtx
} else {
mtx <- NULL
}
}
betas.ls <- map2(snps, metab_id, ~make_mtx(.x, .y, est = "beta")) %>% compact
ses.ls <- map2(snps, metab_id, ~make_mtx(.x, .y, est = "se")) %>% compact
##
## Function for multi-trait colocalization analysis
run_hypr_coloc <- function(beta, se, uni.priors = F) {
traits <- colnames(beta)
cont_out <- c(rep(0, ncol(beta)))
rsid <- rownames(beta)
res <- hyprcoloc::hyprcoloc(beta, se,
binary.outcomes = cont_out,
trait.names = traits,
snp.id = rsid,
uniform.priors = uni.priors)
}
##
## Using default priors (can be overly conservative)
multi_coloc_res <- map2(betas.ls, ses.ls, ~run_hypr_coloc(beta = .x, se = .y)) %>%
flatten %>%
map2(., names(betas.ls), ~mutate(.x, loci = .y)) %>%
bind_rows
##
## Using uniform priors
multi_coloc_res_uni <- map2(betas.ls, ses.ls, ~run_hypr_coloc(beta = .x, se = .y, uni.priors = T)) %>%
flatten %>%
map2(., names(betas.ls), ~mutate(.x, loci = .y)) %>%
bind_rows
############################################################################
# Part 3.3.1. Multivariable MR #
############################################################################
exp_mvdat <- mv_extract_exposures(c(852, 936))
out_mvdat <- extract_outcome_data(exp_mvdat$SNP, 7)
mvdat <- mv_harmonise_data(exp_mvdat, out_mvdat)
mv_res <- mv_multiple(mvdat)
############################################################################
# Part 3.3.2. Conservative approach #
############################################################################
cons_res <- filter(dat, SNP == "rs174546" | SNP == "rs2281591") %>%
mr(.)