Run the following command in R to install the MultiABEL package:
install.packages("MultiABEL", repos = "http://R-Forge.R-project.org")
For Mac users, please download the source .tar.gz and install via:
R CMD INSTALL MultiABEL_1.1-x.tar.gz
in your terminal. You need Fortran 4.8 compiler.
In R, load the package via:
library(MultiABEL)
##
## MultiABEL: Multi-Trait Genome-Wide Association Analyses
## Version 1.1-5 (2016-05-07) installed
## Author: Xia Shen
## Maintainer: Xia Shen <xia.shen@ki.se>
## Use citation("MultiABEL") to know how to cite this work.
or
require(MultiABEL)
You need to gather multiple GWAS summary statistics into your working directory. Let us demonstrate this via an example, and the example files can be found at https://goo.gl/r0Mcav. Here, we have the summary statistics for three traits: height, weight, body mass index (BMI), across 890 variants. There is also an .RData file indep.snps.RData to download, which contains the names of a set of independent variants in European ancestry populations. These variants are used for estimating the phenotypic correlations if unknown.
Each file of single-trait GWAS summary statistics should contain columns for variant names (default column name snp
), the first (coding or reference) alleles (default column name a1
), allele frequencies (default column name freq
), effect sizes (default column name beta
), standard errors (default column name se
), and sample sizes (default column name n
). For example, here the top of the summary statistics file for height looks like:
height <- read.table('height.txt', header = TRUE)
head(height)
## snp a1 a2 freq beta se p n
## 1 rs4747841 a g 0.5500 0.0025 0.0061 0.680 60558.2
## 2 rs4749917 t c 0.4500 -0.0025 0.0061 0.680 60558.1
## 3 rs737656 a g 0.3667 -0.0073 0.0064 0.250 60529.2
## 4 rs737657 a g 0.3583 -0.0075 0.0064 0.240 60527.3
## 5 rs17524355 t c NA -0.0460 0.0460 0.320 1700.0
## 6 rs7086391 t c 0.2000 -0.0130 0.0079 0.099 59675.0
where the columns p
and a2
are extra information. For default simple pleiotropic analysis (MultiSummary(..., type = 'direct')
, see below), n
is not essentially required, if unknown, simply give a large number, e.g. 10,000. For MultiSummary(..., type = 'precise')
, freq
can be any values between 0 and 1, as the exact genotypic variances are given.
MultiABEL does NOT require different single-trait GWAS having been performed in exactly the same individuals.
We can easily load the summary-level data into R via:
load('indep.snps.RData')
data <- load.summary(c('height.txt', 'weight.txt', 'bmi.txt'), indep.snps = indep.snps)
## loading data ...
## Progress: 33% Progress: 67% Progress: 100%
## checking markers ...
## Progress: 67% Progress: 100%
## cleaning data ...
## Progress: 33% Progress: 67% Progress: 100%
## correcting parameters ...
## Progress: 67% Progress: 100%
## adjusting sample size ... done.
## finalizing summary statistics ...
## Progress: 33% Progress: 67% Progress: 100%
## samples partially overlap!
## estimating shrinkage phenotypic correlations ... done.
The first command reads a set of independent SNPs for correlation estimation. If you have your own set of independent markers to import, simply replace. The loaded data contains three sub-objects: $gwa
, $cor.pheno
and $var.pheno
, where $gwa
is a cleaned data frame of single-trait GWAS results, and the rest are shrinkage phenotypic correlation matrix and phenotypic variances, both estimated or given by user input.
head(data$gwa)
## height.txt.beta height.txt.se weight.txt.beta weight.txt.se
## rs4747841 0.0025 0.0061 0.0003 0.0064
## rs4749917 -0.0025 0.0061 -0.0003 0.0064
## rs737656 -0.0073 0.0064 -0.0063 0.0066
## rs737657 -0.0075 0.0064 -0.0056 0.0066
## rs7086391 -0.0130 0.0079 0.0094 0.0083
## rs878177 0.0140 0.0066 0.0008 0.0069
## bmi.txt.beta bmi.txt.se f n
## rs4747841 0.0005 0.0063 0.5500 58322.7
## rs4749917 -0.0005 0.0063 0.4500 58322.6
## rs737656 -0.0025 0.0066 0.3667 58322.7
## rs737657 -0.0020 0.0066 0.3583 58316.8
## rs7086391 0.0230 0.0082 0.2000 58322.5
## rs878177 -0.0100 0.0068 0.3000 58322.4
data$cor.pheno
## height.txt weight.txt bmi.txt
## height.txt 1.00000000 0.31558706 -0.16242127
## weight.txt 0.31558706 1.00000000 0.85458914
## bmi.txt -0.16242127 0.85458914 1.00000000
data$var.pheno
## [1] 1 1 1
Default load.summary(..., est.var = FALSE)
assumes all phenotypic variances are 1, which is a known value for GWAS with inverse-Gaussian transformation of the phenotypes. Setting est.var = TRUE
will estimate the phenotypic variances using summary statistics, which is useful e.g. for case-control studies where the variance of liability can be estimated.
Once the data are successfully loaded, the simplest multi-trait pleiotropic analysis is straightforward:
result <- MultiSummary(data)
## Multi-trait genome scan ... Done.
The result is a list with two sub-objects $scan
and $coef
. For this simple analysis, only the data frame $scan
is reported, where the column p
gives the multi-trait analysis p-values.
head(result$scan)
## marker freq n p
## rs4747841 rs4747841 0.5500 58322.7 0.79726544739
## rs4749917 rs4749917 0.4500 58322.6 0.79726544741
## rs737656 rs737656 0.3667 58322.7 0.63421469840
## rs737657 rs737657 0.3583 58316.8 0.65263056840
## rs7086391 rs7086391 0.2000 58322.5 0.00043028835
## rs878177 rs878177 0.3000 58322.4 0.01823161595
The result is analog to the MANOVA analysis in R, such as manova(cbind(Trait1, Trait2, Trait3) ~ SNP)
.
If we assume that our outbred populations for the three phenotypes follow Hardy-Weinberg equilibrium (HWE), we can then perform deeper pleiotropic meta-analysis, with more estimates including conditional genetic effects. For example:
result.out <- MultiSummary(data, type = 'outbred')
## Multi-trait genome scan ... Done.
The result is a list with two sub-objects $scan
and $coef
. Now, the data frame $scan
is reported with more columns, corresponding to three conditional GWAS analyses, i.e. single-trait GWAS for each trait with the other traits included as covariates:
head(result.out$scan)
## marker freq n p beta.score
## rs4747841 rs4747841 0.5500 58322.7 0.79726544739 1.8754020e-05
## rs4749917 rs4749917 0.4500 58322.6 0.79726544741 1.8754020e-05
## rs737656 rs737656 0.3667 58322.7 0.63421469840 3.3824150e-05
## rs737657 rs737657 0.3583 58316.8 0.65263056840 3.1213268e-05
## rs7086391 rs7086391 0.2000 58322.5 0.00043028835 3.9293148e-04
## rs878177 rs878177 0.3000 58322.4 0.01823161595 2.0098628e-04
## se.score beta.cond.height.txt se.cond.height.txt
## rs4747841 7.3004733e-05 0.00279569491 0.0026842035
## rs4749917 7.3004733e-05 -0.00279569491 0.0026842058
## rs737656 7.1088553e-05 -0.00069077805 0.0027711098
## rs737657 6.9345748e-05 -0.00126907245 0.0027851258
## rs7086391 1.1160370e-04 0.00801804470 0.0033387984
## rs878177 8.5131924e-05 -0.00337208905 0.0029142862
## p.cond.height.txt beta.cond.weight.txt se.cond.weight.txt
## rs4747841 0.297626798 -0.00133195655 0.0014126825
## rs4749917 0.297627212 0.00133195655 0.0014126838
## rs737656 0.803145371 -0.00056711104 0.0014584137
## rs737657 0.648633972 -0.00023895596 0.0014657971
## rs7086391 0.016329074 -0.00593304553 0.0017570168
## rs878177 0.247235918 0.00357046716 0.0015336614
## p.cond.weight.txt beta.cond.bmi.txt se.cond.bmi.txt
## rs4747841 0.34575444223 1.3979808e-03 0.0014689944
## rs4749917 0.34575485575 -1.3979808e-03 0.0014689957
## rs737656 0.69738363269 3.3470484e-04 0.0015165516
## rs737657 0.87050176200 3.4480639e-05 0.0015242273
## rs7086391 0.00073341764 7.3051451e-03 0.0018269113
## rs878177 0.01990852313 -4.0863748e-03 0.0015947648
## p.cond.bmi.txt
## rs4747841 3.4127027e-01
## rs4749917 3.4127068e-01
## rs737656 8.2532506e-01
## rs737657 9.8195202e-01
## rs7086391 6.3709785e-05
## rs878177 1.0396101e-02
The p
column is analog to the MANOVA analysis in R, such as manova(cbind(Trait1, Trait2, Trait3) ~ SNP)
. The remaining three columns per trait are analog to the regression analysis in R, such as lm(Trait1 ~ Trait2 + Trait3 + SNP)
.
The $coef
data frame in the result gives an alternative conditional analysis, i.e. the multiple regression of the SNP dosage on the phenotypes, analog to R analysis lm(SNP ~ Trait1 + Trait2 + Trait3)
.
In inbred populations, we have the allele frequencies being equal to the corresponding genotype frequencies. Therefore, different from a HWE outbred population where \(genotype\sim Binomial(2,freq)\), we have \(genotype\sim Bernoulli(freq)\). This does not change the pleiotropic analysis p-values, but will have an effect on the conditional analysis. The analysis using MultiABEL is very similar to above: MultiSummary(data, type = 'inbred')
.
Regardless of HWE or the genotype frequencies, if the genotypic variance per variant is precisely calculated, one can perform the above analysis in a generic way. As an example, let us assume our summary statistics were indeed obtained in a HWE population, but somehow our minor allele frequencies (MAF) were over-estimated by 25%. We can then calculate the correct genotypic variances:
f <- data$gwa[,'f']
minor <- data$gwa[,'f'] <= 0.5
f[minor] <- data$gwa[minor,'f']*.8
f[!minor] <- 1 - (1 - data$gwa[!minor,'f'])*.8
gv <- 2*f*(1 - f)
Then, the correct pleiotropic and conditional meta analysis can be performed by:
result.pre <- MultiSummary(data, type = 'precise', vars = gv)
## Multi-trait genome scan ... Done.
We can see that the joint analysis is not affected by allele frequencies being incorrectly reported, but the conditional analysis is. Over-estimating MAF inflates the conditional analysis signals.
For direct R documentation of the two functions above, you can simply use question mark in R:
?load.summary
?MultiSummary
If you have specific questions, we advice you to use our GenABEL project forum at http://www.genabel.org and for urgent questions, you may email the maintainer of MultiABEL via xia dot shen at ed dot ac dot uk.