Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 12;37(Suppl_1):i34-i41.
doi: 10.1093/bioinformatics/btab327.

Statistical approaches for differential expression analysis in metatranscriptomics

Affiliations

Statistical approaches for differential expression analysis in metatranscriptomics

Yancong Zhang et al. Bioinformatics. .

Abstract

Motivation: Metatranscriptomics (MTX) has become an increasingly practical way to profile the functional activity of microbial communities in situ. However, MTX remains underutilized due to experimental and computational limitations. The latter are complicated by non-independent changes in both RNA transcript levels and their underlying genomic DNA copies (as microbes simultaneously change their overall abundance in the population and regulate individual transcripts), genetic plasticity (as whole loci are frequently gained and lost in microbial lineages) and measurement compositionality and zero-inflation. Here, we present a systematic evaluation of and recommendations for differential expression (DE) analysis in MTX.

Results: We designed and assessed six statistical models for DE discovery in MTX that incorporate different combinations of DNA and RNA normalization and assumptions about the underlying changes of gene copies or species abundance within communities. We evaluated these models on multiple simulated and real multi-omic datasets. Models adjusting transcripts relative to their encoding gene copies as a covariate were significantly more accurate in identifying DE from MTX in both simulated and real datasets. Moreover, we show that when paired DNA measurements (metagenomic data) are not available, models normalizing MTX measurements within-species while also adjusting for total-species RNA balance sensitivity, specificity and interpretability of DE detection, as does filtering likely technical zeros. The efficiency and accuracy of these models pave the way for more effective MTX-based DE discovery in microbial communities.

Availability and implementation: The analysis code and synthetic datasets used in this evaluation are available online at http://huttenhower.sph.harvard.edu/mtx2021.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Normalization and DE models for microbial community transcript abundances. (A) Each panel corresponds to a simple conceptual community (two species, A and B, each contributing two genes) assayed by MTX and MGX sequencing. Case 0 represents a reference condition, while Cases 1–4 correspond to perturbations of the reference. While the RNA abundance of gene B:4 (dashed outline) differs under each perturbation, only in Case 4 is the change attributable to DE rather than gene copy-number variation. (B) A summary of six linear models for assaying DE of a MTX feature f with respect to a sample phenotype/property p. Models 2–6 incorporate transformations and covariates aimed at minimizing spurious DE signals from gene copy number
Fig. 2.
Fig. 2.
DE models for MTX accounting for underlying variation in gene copy number control FP rates while maintaining statistical power. We evaluated the performance of six models for DE in MTX data (M1-6; Fig. 1B) on nine synthetic datasets with paired MTX and MGX measurements of known taxonomy (Table 1). ‘Null’ datasets (top row) contained no positive DE signatures and were evaluated in comparison with the theoretical nominal type-1 error rate only (dashed lines; FPR = 0.05). ‘True’ datasets contained 10% positive DE relationships and were evaluated on the basis of their sensitivity (accounting for multiple hypothesis correction; middle row) versus nominal type-I error rate (bottom row). Error bars reflect the 95% CI for percentages
Fig. 3.
Fig. 3.
DE models for MTX benefit from pre-filtering of probable technical zeros. We assessed three pre-filtering strategies for balancing biological and technical zeros in MTX data. (A) Under ‘lenient’ pre-filtering, features were analyzed for DE if they were ever detected (non-zero) at the RNA level (or gene-copy level, where applicable). (B) Under ‘semi-strict’ pre-filtering, samples were excluded if both a feature’s RNA count and gene-copy estimate (where applicable) were zero. (C) Under ‘strict’ pre-filtering, samples were excluded if either a feature’s RNA count or gene-copy estimate were zero. Features that were excluded from analysis were scored as ‘not DE’ (i.e. negatives). Gray cells indicate ‘undefined’ TPR for datasets lacking positive DE signals
Fig. 4.
Fig. 4.
Adjusting RNA for DNA gene copy number provides consistently higher performance in DE analysis for communities containing unknown taxonomy. We assessed the performance of community DE models that do not require a mapping of MTX features to taxa (M1, M4 and M6) on communities of unknown taxonomy. One of these community datasets (‘group-null-enc’) was spiked with confounding gene presence/absence signals, while a second (‘group-true-exp’) contained positive DE signals. Spikes were generated at the gene family (orthogroup) level. Each method was evaluated in combination with three pre-filtering schemes for managing zero inflation (Supplementary Fig. S1). Error bars reflect the 95% CI for percentages
Fig. 5.
Fig. 5.
DE of E.coli pilin-like proteins in the IBD gut microbiome. We previously prioritized 113 pilin-family proteins assigned to E.coli for roles in IBD-associated inflammation based on their MGX properties. We surveyed these proteins for differential functional activity during inflammation using three models of community DE that performed well during synthetic evaluations (Fig. 2). (A) The feature-DNA covariate model (M6) identified 16 genes with significantly elevated expression among dysbiotic samples (FDR q <0.25 as emphasized in bold text). (B) The taxon-RNA covariate model (M3) tended to agree with these trends in sign but not statistical significance, while (C) the RNA/DNA ratio model (M4) tended to identify these trends as significant in the opposite direction. M3- and M4-specific trends are further compared in Supplementary Figure S5

References

    1. Baldrian P. et al. (2012) Active and total microbial communities in forest soil are largely different and highly stratified during decomposition. ISME J., 6, 248–258. - PMC - PubMed
    1. Bashiardes S. et al. (2016) Use of metatranscriptomics in microbiome research. Bioinform. Biol. Insights, 10, 19–25. - PMC - PubMed
    1. Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological), 57, 289–300.
    1. Coolen M.J., Orsi W.D. (2015) The transcriptional response of microbial communities in thawing Alaskan permafrost soils. Front. Microbiol., 6, 197. - PMC - PubMed
    1. Darfeuille-Michaud A. (2002) Adherent-invasive Escherichia coli: a putative new E. coli pathotype associated with Crohn’s disease. Int. J. Med. Microbiol., 292, 185–193. - PubMed

Publication types