Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Feb 15;29(4):476-82.
doi: 10.1093/bioinformatics/bts727. Epub 2013 Jan 6.

Assessing identity, redundancy and confounds in Gene Ontology annotations over time

Affiliations

Assessing identity, redundancy and confounds in Gene Ontology annotations over time

Jesse Gillis et al. Bioinformatics. .

Abstract

Motivation: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored.

Results: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their 'functional identity' over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks.

Availability: Data available at http://chibi.ubc.ca/assessGO.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Gene functional identity changes over GO editions. The shading indicates the fraction of genes that retain a functional identity between GO editions. Semantic similarity is calculated and genes are matched between editions; if a gene is most similar to itself between editions, it is said to retain its identity. Similarity is not symmetric in time (GeneiGOA may rank GeneiGOB as most similar to it, without the reverse being true). Below the diagonal is matching backward in time; above, forward in time
Fig. 2.
Fig. 2.
Annotation bias persists in the GO. (A) Annotation bias has risen among human genes, indicating genes with many annotations have become more dominant within GO over time. (B) Annotation bias has generally fallen for yeast, aligned to remove two discontinuities that we regarded as artifactual. (C) The relative number of annotations a gene possesses has remained stable over time, with some change (correlation shown). (D) Annotation bias (expressed as the number of GO terms for a gene) is correlated with the rank of the numerical ID of the gene in NCBI, indicating a historical bias
Fig. 3.
Fig. 3.
Data are reused in protein-interaction networks and GO. (A) Many GO groups have a large fraction of their network functional connectivity coming from the same publication as the GO annotations (‘confounded’). (B): Most network connections can be used to infer some function due to confounds
Fig. 4.
Fig. 4.
Confounded edges are likely to either have very low or very high impacts on determining function within networks. ‘Confound’ is calculated as the fraction of shared functional assignments for a protein pair, which overlap (in either part) with the article reporting the protein interaction. Exceptionality was calculated as the effect of a given edge’s removal on network function prediction performance in cross-validation (Gillis and Pavlidis, 2012). The data are binned (bins of 100 edges per point, non-overlapping) to emphasize the trend
Fig. 5.
Fig. 5.
Potential confounds in functional analysis of protein interactions over time. ‘Confound’ is defined as in Figure 3A (function centered, black lines) and 3B (connection centered, gray lines). (A) The number of functions per connection with PubMed ID overlaps between function assignment and interaction report is shown (connection centered) as well as the number of functional edges within a function that have PubMed ID overlap (function centered). (B) Confounds computed using only ‘IPI’ (inferred from physical interaction). (C) Confounds calculated using changing Gene Annotations on a fixed GO (most recent). (D) Confounds for IPI annotations calculated using a fixed ontology
Fig. 6.
Fig. 6.
Module 3 from the PSP case study. The module is shown with genes annotated with the enriched functions shown in dark gray. JUP and CDH2 (diamonds) received annotations from articles reporting both their functional annotation and their interaction (PubMed IDs 1639850 and 7650039)

References

    1. Alterovitz G, et al. GO PaD: the Gene Ontology Partition Database. Nucleic Acids Res. 2007;35:D322–D327. - PMC - PubMed
    1. Andorf C, et al. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach. BMC Bioinformatics. 2007;8:284–284. - PMC - PubMed
    1. Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. - PMC - PubMed
    1. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B. 1995;57:12.
    1. Camon E, et al. The Gene Ontology Annotation (GOA) project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 2003;13:662–672. - PMC - PubMed

Publication types