Getting started with subgxe

Introduction

subgxe is an R package for combining summary data from multiple association studies (or multiple phenotypes in a single study) by incorporating potential gene-environment (G-E) interactions into the testing procedure. It is an implementation of the p value-assisted subset testing for associations (pASTA) framework proposed by Yu et al(2019). pASTA aims to identify a subset of studies or traits that yields the strongest evidence of associations and calculates a meta-analytic p-value from that subset. This vignette offers a brief introduction to the basic use of subgxe. For more details on how pASTA works, please refer to the paper.

subgxe Example

We simulated \(K=5\) independent case-control studies that are included in the package (subgxe::studies) to illustrate the basic use of subgxe. In each data set, G, E, and D denote the genetic variant, environmental factor, and disease status (binary outcome) respectively. In this case, G is coded as binary (under a dominant or recessive susceptibility model). It can also be coded as allele count (under the additive model). Two of the 5 studies have non-null genetic associations with the true marginal genetic odds ratio being 1.09. Each study has 6,000 cases and 6,000 controls, with the total sample size \(n_k\) being 12000. For the specific underlying parameters of the data generating model, please refer to the original article (Table A4, Scenario 2).

# install.packages("subgxe")
library(subgxe)

We first obtain a vector of input p-values of length \(K\) by conducting an association test for each study. For each study \(k\), the joint model with G-E interaction taken into account is \[ \mathrm{logit}[E(D_{ki}|G_{ki},E_{ki})] = \beta_0^{(k)} + \beta_G^{(k)}G_{ki} + \beta_E^{(k)}E_{ki} + \beta_{GE}^{(k)} G_{ki}E_{ki} \] where \(i = 1, \cdots, n_k\). This model can be further adjusted for potential confounders, which we drop from the formula for simplicity of notation.

To detect the genetic association while accounting for the G-E interaction, we test the null hypothesis \[ \beta_{G}^{(k)}=\beta_{GE}^{(k)}=0 \] based on the joint model for each study. The coefficients can be estimated by maximum likelihood using the glm function. For alternative null hypotheses and methods for estimation of coefficients, refer to the paper.

A common choice for testing the null hypothesis is the likelihood ratio test (LRT) with 2 degrees of freedom, which can be carried out with the lrtest function in the package lmtest. We will use the results of the LRT in our example. For comparative purposes, we also look at the p-values of the marginal genetic associations obtained by Wald test, i.e., the p-values of \(\hat{\alpha}_G^{(k)}\) in the model

\[\text{logit}[E(D_{ki}|G_{ki})]=\alpha_0^{(k)}+\alpha_G^{(k)}G_{ki}\]

library(lmtest)

# number of studies
K <- 5
study.pvals.marg  <- NULL
study.pvals.joint <- NULL

for (i in 1:K) {
  joint.model <- glm(D ~ G + E + I(G*E), data = studies[[i]], family = "binomial")
  null.model  <- glm(D ~ E, data = studies[[i]], family = "binomial")
  marg.model  <- glm(D ~ G, data = studies[[i]], family = "binomial")
  study.pvals.marg[i]  <- summary(marg.model)$coef[2, 4]
  study.pvals.joint[i] <- lmtest::lrtest(null.model, joint.model)[2, 5]
}

Now, we just need to specify the cor parameter – the correlation matrix of the study-specific p-values. In this example, since the studies are independent, the p-values are also independent, and therefore the cor should be the identity matrix. In a multiple-phenotype analysis where the phenotypes are measured on the same set of subjects, one way to approximate the correlations among p-values is to use the phenotypic correlations.

Now, lets use the pasta function to perform subset analysis and obtain a meta-analytic p-value for the genetic association.

study.sizes <- c(nrow(studies[[1]]), nrow(studies[[2]]), nrow(studies[[3]]),
                 nrow(studies[[4]]), nrow(studies[[5]]))

cor.matrix <- diag(1, K)
pasta.joint <- pasta(p.values = study.pvals.joint, study.sizes = study.sizes,
                       cor = cor.matrix
)
pasta.marg <- pasta(p.values = study.pvals.marg, study.sizes = study.sizes,
                      cor = cor.matrix
)

pasta.joint$p.pasta
#> [1] 0.001859015
pasta.joint$test.statistic$selected.subset
#>   Var1 Var2 Var3 Var4 Var5
#> 4    1    1    0    0    0

pasta.marg$p.pasta
#> [1] 0.03312643
pasta.marg$test.statistic$selected.subset
#>   Var1 Var2 Var3 Var4 Var5
#> 8    1    1    1    0    0

pasta yields a meta-analytic p-value of 0.002 when the G-E interaction is taken into account, and identifies the first two studies as non-null. On the other hand, if we only consider the marginal associations, the meta-analytic p-value becomes much larger (p=0.033) and the first three studies are identified as having significant associations.

Reference

Yu Y, Xia L, Lee S, Zhou X, Stringham H, M, Boehnke M, Mukherjee B: Subset-Based Analysis Using Gene-Environment Interactions for Discovery of Genetic Associations across Multiple Studies or Phenotypes. Hum Hered 2019. doi: 10.1159/000496867