--- title: "Getting started with subgxe" output: rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Getting started with subgxe} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction `subgxe` is an R package for combining summary data from multiple association studies (or multiple phenotypes in a single study) by incorporating potential gene-environment (G-E) interactions into the testing procedure. It is an implementation of the *p value-assisted subset testing for associations (pASTA)* framework proposed by Yu et al(2019). pASTA aims to identify a subset of studies or traits that yields the strongest evidence of associations and calculates a meta-analytic p-value from that subset. This vignette offers a brief introduction to the basic use of `subgxe`. For more details on how pASTA works, please refer to the paper. ## subgxe Example We simulated $K=5$ independent case-control studies that are included in the package (`subgxe::studies`) to illustrate the basic use of `subgxe`. In each data set, `G`, `E`, and `D` denote the genetic variant, environmental factor, and disease status (binary outcome) respectively. In this case, `G` is coded as binary (under a dominant or recessive susceptibility model). It can also be coded as allele count (under the additive model). Two of the 5 studies have non-null genetic associations with the true marginal genetic odds ratio being 1.09. Each study has 6,000 cases and 6,000 controls, with the total sample size $n_k$ being 12000. For the specific underlying parameters of the data generating model, please refer to the original [article](https://doi.org/10.1159/000496867) (Table A4, Scenario 2). ```{r setup} # install.packages("subgxe") library(subgxe) ``` We first obtain a vector of input p-values of length $K$ by conducting an association test for each study. For each study $k$, the *joint* model with G-E interaction taken into account is $$ \mathrm{logit}[E(D_{ki}|G_{ki},E_{ki})] = \beta_0^{(k)} + \beta_G^{(k)}G_{ki} + \beta_E^{(k)}E_{ki} + \beta_{GE}^{(k)} G_{ki}E_{ki} $$ where $i = 1, \cdots, n_k$. This model can be further adjusted for potential confounders, which we drop from the formula for simplicity of notation. To detect the genetic association while accounting for the G-E interaction, we test the null hypothesis $$ \beta_{G}^{(k)}=\beta_{GE}^{(k)}=0 $$ based on the joint model for each study. The coefficients can be estimated by maximum likelihood using the `glm` function. For alternative null hypotheses and methods for estimation of coefficients, refer to the paper. A common choice for testing the null hypothesis is the likelihood ratio test (LRT) with 2 degrees of freedom, which can be carried out with the `lrtest` function in the package `lmtest`. We will use the results of the LRT in our example. For comparative purposes, we also look at the p-values of the *marginal* genetic associations obtained by Wald test, i.e., the p-values of $\hat{\alpha}_G^{(k)}$ in the model $$\text{logit}[E(D_{ki}|G_{ki})]=\alpha_0^{(k)}+\alpha_G^{(k)}G_{ki}$$ ```{r pvalues, message = FALSE} library(lmtest) # number of studies K <- 5 study.pvals.marg <- NULL study.pvals.joint <- NULL for (i in 1:K) { joint.model <- glm(D ~ G + E + I(G*E), data = studies[[i]], family = "binomial") null.model <- glm(D ~ E, data = studies[[i]], family = "binomial") marg.model <- glm(D ~ G, data = studies[[i]], family = "binomial") study.pvals.marg[i] <- summary(marg.model)$coef[2, 4] study.pvals.joint[i] <- lmtest::lrtest(null.model, joint.model)[2, 5] } ``` Now, we just need to specify the `cor` parameter -- the correlation matrix of the study-specific p-values. In this example, since the studies are independent, the p-values are also independent, and therefore the `cor` should be the identity matrix. In a *multiple-phenotype* analysis where the phenotypes are measured on the same set of subjects, one way to approximate the correlations among p-values is to use the phenotypic correlations. Now, lets use the `pasta` function to perform subset analysis and obtain a meta-analytic p-value for the genetic association. ```{r pasta} study.sizes <- c(nrow(studies[[1]]), nrow(studies[[2]]), nrow(studies[[3]]), nrow(studies[[4]]), nrow(studies[[5]])) cor.matrix <- diag(1, K) pasta.joint <- pasta(p.values = study.pvals.joint, study.sizes = study.sizes, cor = cor.matrix ) pasta.marg <- pasta(p.values = study.pvals.marg, study.sizes = study.sizes, cor = cor.matrix ) pasta.joint$p.pasta pasta.joint$test.statistic$selected.subset pasta.marg$p.pasta pasta.marg$test.statistic$selected.subset ``` `pasta` yields a meta-analytic p-value of `r formatC(pasta.joint$p.pasta, format = "f", digits = 3)` when the G-E interaction is taken into account, and identifies the first two studies as non-null. On the other hand, if we only consider the marginal associations, the meta-analytic p-value becomes much larger (p=`r formatC(pasta.marg$p.pasta, format = "f", digits = 3)`) and the first three studies are identified as having significant associations. ## Reference * Yu Y, Xia L, Lee S, Zhou X, Stringham H, M, Boehnke M, Mukherjee B: Subset-Based Analysis Using Gene-Environment Interactions for Discovery of Genetic Associations across Multiple Studies or Phenotypes. *Hum Hered* 2019. doi: [10.1159/000496867](https://doi.org/10.1159/000496867)