Open Access. Powered by Scholars. Published by Universities.®

Life Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Bioinformatics

SelectedWorks

Selected Works

GO classification

Publication Year

Articles 1 - 2 of 2

Full-Text Articles in Life Sciences

Pitfalls Of Ascertainment Biases In Genome Annotations—Computing Comparable Protein Domain Distributions In Eukarya, Arli A. Parikesit, Lydia Steiner, Peter F. Stadler, Sonja J. Prohaska Jan 2014

Pitfalls Of Ascertainment Biases In Genome Annotations—Computing Comparable Protein Domain Distributions In Eukarya, Arli A. Parikesit, Lydia Steiner, Peter F. Stadler, Sonja J. Prohaska

Arli A Parikesit

Most investigations into the large-scale patterns of protein evolution are based on gene annotations that have been compiled in reference databases. The use of these resources for quantitative comparisons, however, is complicated by sometimes vast differences in coverage. More importantly, however, we also observe substantial ascertainment biases that cannot be removed by simple normalization procedures. A striking example is provided by the correlations between protein domains. We observe that statistics derived from different computational gene annotation procedure show dramatic discrepancies, and even qualitative changes from negative to positive correlation, when compared to statistics obtained from annotation databases.


Evolution And Quantitative Comparison Of Genome-Wide Protein Domain Distributions, Arli A. Parikesit, Peter F. Stadler, Sonja J. Prohaska Jan 2011

Evolution And Quantitative Comparison Of Genome-Wide Protein Domain Distributions, Arli A. Parikesit, Peter F. Stadler, Sonja J. Prohaska

Arli A Parikesit

The metabolic and regulatory capabilities of an organism are implicit in its protein content. This is often hard to estimate, however, due to ascertainment biases inherent in the available genome annotations. Its complement of recognizable functional protein domains and their combinations convey essentially the same information and at the same time are much more readily accessible, although protein domain models trained for one phylogenetic group frequently fail on distantly related sequences. Pooling related domain models based on their GO-annotation in combination with de novo gene prediction methods provides estimates that seem to be less affected by phylogenetic biases. We show …