subgxe
is an R package for combining summary data from
multiple association studies (or multiple phenotypes in a single study)
by incorporating potential gene-environment (G-E) interactions into the
testing procedure. It is an implementation of the p value-assisted
subset testing for associations (pASTA) framework proposed by Yu et
al(2019). pASTA aims to identify a subset of studies or traits that
yields the strongest evidence of associations and calculates a
meta-analytic p-value from that subset. This vignette offers a brief
introduction to the basic use of subgxe
. For more details
on how pASTA works, please refer to the paper.
We simulated K = 5
independent case-control studies that are included in the package
(subgxe::studies
) to illustrate the basic use of
subgxe
. In each data set, G
, E
,
and D
denote the genetic variant, environmental factor, and
disease status (binary outcome) respectively. In this case,
G
is coded as binary (under a dominant or recessive
susceptibility model). It can also be coded as allele count (under the
additive model). Two of the 5 studies have non-null genetic associations
with the true marginal genetic odds ratio being 1.09. Each study has
6,000 cases and 6,000 controls, with the total sample size nk being 12000.
For the specific underlying parameters of the data generating model,
please refer to the original article (Table A4, Scenario
2).
We first obtain a vector of input p-values of length K by conducting an association test for each study. For each study k, the joint model with G-E interaction taken into account is logit[E(Dki|Gki, Eki)] = β0(k) + βG(k)Gki + βE(k)Eki + βGE(k)GkiEki where i = 1, ⋯, nk. This model can be further adjusted for potential confounders, which we drop from the formula for simplicity of notation.
To detect the genetic association while accounting for the G-E
interaction, we test the null hypothesis βG(k) = βGE(k) = 0
based on the joint model for each study. The coefficients can be
estimated by maximum likelihood using the glm
function. For
alternative null hypotheses and methods for estimation of coefficients,
refer to the paper.
A common choice for testing the null hypothesis is the likelihood
ratio test (LRT) with 2 degrees of freedom, which can be carried out
with the lrtest
function in the package
lmtest
. We will use the results of the LRT in our example.
For comparative purposes, we also look at the p-values of the
marginal genetic associations obtained by Wald test, i.e., the
p-values of α̂G(k)
in the model
logit[E(Dki|Gki)] = α0(k) + αG(k)Gki
library(lmtest)
# number of studies
K <- 5
study.pvals.marg <- NULL
study.pvals.joint <- NULL
for (i in 1:K) {
joint.model <- glm(D ~ G + E + I(G*E), data = studies[[i]], family = "binomial")
null.model <- glm(D ~ E, data = studies[[i]], family = "binomial")
marg.model <- glm(D ~ G, data = studies[[i]], family = "binomial")
study.pvals.marg[i] <- summary(marg.model)$coef[2, 4]
study.pvals.joint[i] <- lmtest::lrtest(null.model, joint.model)[2, 5]
}
Now, we just need to specify the cor
parameter – the
correlation matrix of the study-specific p-values. In this example,
since the studies are independent, the p-values are also independent,
and therefore the cor
should be the identity matrix. In a
multiple-phenotype analysis where the phenotypes are measured
on the same set of subjects, one way to approximate the correlations
among p-values is to use the phenotypic correlations.
Now, lets use the pasta
function to perform subset
analysis and obtain a meta-analytic p-value for the genetic
association.
study.sizes <- c(nrow(studies[[1]]), nrow(studies[[2]]), nrow(studies[[3]]),
nrow(studies[[4]]), nrow(studies[[5]]))
cor.matrix <- diag(1, K)
pasta.joint <- pasta(p.values = study.pvals.joint, study.sizes = study.sizes,
cor = cor.matrix
)
pasta.marg <- pasta(p.values = study.pvals.marg, study.sizes = study.sizes,
cor = cor.matrix
)
pasta.joint$p.pasta
#> [1] 0.001859015
pasta.joint$test.statistic$selected.subset
#> Var1 Var2 Var3 Var4 Var5
#> 4 1 1 0 0 0
pasta.marg$p.pasta
#> [1] 0.03312643
pasta.marg$test.statistic$selected.subset
#> Var1 Var2 Var3 Var4 Var5
#> 8 1 1 1 0 0
pasta
yields a meta-analytic p-value of 0.002 when the
G-E interaction is taken into account, and identifies the first two
studies as non-null. On the other hand, if we only consider the marginal
associations, the meta-analytic p-value becomes much larger (p=0.033)
and the first three studies are identified as having significant
associations.