Open Access. Powered by Scholars. Published by Universities.®

Life Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Selected Works

Bioinformatics

Articles 1 - 30 of 36

Full-Text Articles in Life Sciences

Saccharomyces Genome Database & Uniprot Bioinformatics Analysis, Ray A. Enke Dec 2018

Saccharomyces Genome Database & Uniprot Bioinformatics Analysis, Ray A. Enke

Ray Enke Ph.D.

This in class activity introduces basic bioinformatics analysis using the Saccharomyces Genome Database (SGD) and the UniProt Database. The yeast URA3 gene is studied in this activity, however, any other yeast gene can be substituted. This activity is designed for novice instructors and students for implementation into core biology lecture or lab courses.


Fast And Space-Efficient Location Of Heavy Or Dense Segments In Run-Length Encoded Sequences, Ronald I. Greenberg Jan 2018

Fast And Space-Efficient Location Of Heavy Or Dense Segments In Run-Length Encoded Sequences, Ronald I. Greenberg

Ronald Greenberg

This paper considers several variations of an optimization problem with potential applications in such areas as biomolecular sequence analysis and image processing. Given a sequence of items, each with a weight and a length, the goal is to find a subsequence of consecutive items of optimal value, where value is either total weight or total weight divided by total length. There may also be a specified lower and/or upper bound on the acceptable length of subsequences. This paper shows that all the variations of the problem are solvable in linear time and space even with non-uniform item lengths and divisible …


A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti Oct 2017

A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti

Konstantin Läufer

As sequencing technologies continue to drop in price and increase in throughput, new challenges emerge for the management and accessibility of genomic sequence data. We have developed a pipeline for facilitating the storage, retrieval, and subsequent analysis of molecular data, integrating both sequence and metadata. Taking a polyglot approach involving multiple languages, libraries, and persistence mechanisms, sequence data can be aggregated from publicly available and local repositories. Data are exposed in the form of a RESTful web service, formatted for easy querying, and retrieved for downstream analyses. As a proof of concept, we have developed a resource for annotated HIV-1 …


A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer Oct 2017

A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer

Konstantin Läufer

RNA-interference has potential therapeutic use against HIV-1 by targeting highly-functional mRNA sequences that contribute to the virulence of the virus. Empirical work has shown that within cell lines, all of the HIV-1 genes are affected by RNAi-induced gene silencing. While promising, inherent in this treatment is the fact that RNAi sequences must be highly specific. HIV, however, mutates rapidly, leading to the evolution of viral escape mutants. In fact, such strains are under strong selection to include mutations within the targeted region, evading the RNAi therapy and thus increasing the virus’ fitness in the host. Taking a phylogenetic approach, we …


A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti Sep 2017

A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti

Catherine Putonti

As sequencing technologies continue to drop in price and increase in throughput, new challenges emerge for the management and accessibility of genomic sequence data. We have developed a pipeline for facilitating the storage, retrieval, and subsequent analysis of molecular data, integrating both sequence and metadata. Taking a polyglot approach involving multiple languages, libraries, and persistence mechanisms, sequence data can be aggregated from publicly available and local repositories. Data are exposed in the form of a RESTful web service, formatted for easy querying, and retrieved for downstream analyses. As a proof of concept, we have developed a resource for annotated HIV-1 …


A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer Sep 2017

A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer

Catherine Putonti

RNA-interference has potential therapeutic use against HIV-1 by targeting highly-functional mRNA sequences that contribute to the virulence of the virus. Empirical work has shown that within cell lines, all of the HIV-1 genes are affected by RNAi-induced gene silencing. While promising, inherent in this treatment is the fact that RNAi sequences must be highly specific. HIV, however, mutates rapidly, leading to the evolution of viral escape mutants. In fact, such strains are under strong selection to include mutations within the targeted region, evading the RNAi therapy and thus increasing the virus’ fitness in the host. Taking a phylogenetic approach, we …


Using Phylogenetically-Informed Annotation (Pia) To Search For Light-Interacting Genes In Transcriptomes From Non-Model Organisms, Daniel I. Speiser, M. Sabrina Pankey, Alexander K. Zaharoff, Barbara A. Battelle, Heather D. Bracken-Grissom, Jesse W. Breinholt, Seth M. Bybee, Thomas W. Cronin, Anders Garm, Annie R. Lindgren, Nipam H. Patel, Megan L. Porter, Meredith E. Protas, Anja S. Rivera, Jeanne M. Serb, Kirk S. Zigler, Keith A. Crandall, Todd H. Oakley Jan 2017

Using Phylogenetically-Informed Annotation (Pia) To Search For Light-Interacting Genes In Transcriptomes From Non-Model Organisms, Daniel I. Speiser, M. Sabrina Pankey, Alexander K. Zaharoff, Barbara A. Battelle, Heather D. Bracken-Grissom, Jesse W. Breinholt, Seth M. Bybee, Thomas W. Cronin, Anders Garm, Annie R. Lindgren, Nipam H. Patel, Megan L. Porter, Meredith E. Protas, Anja S. Rivera, Jeanne M. Serb, Kirk S. Zigler, Keith A. Crandall, Todd H. Oakley

Meredith Protas

Background: Tools for high throughput sequencing and de novo assembly make the analysis of transcriptomes (i.e. the suite of genes expressed in a tissue) feasible for almost any organism. Yet a challenge for biologists is that it can be difficult to assign identities to gene sequences, especially from non-model organisms. Phylogenetic analyses are one useful method for assigning identities to these sequences, but such methods tend to be time-consuming because of the need to re-calculate trees for every gene of interest and each time a new data set is analyzed. In response, we employed existing tools for phylogenetic analysis to …


A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti Jan 2017

A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti

George K. Thiruvathukal

As sequencing technologies continue to drop in price and increase in throughput, new challenges emerge for the management and accessibility of genomic sequence data. We have developed a pipeline for facilitating the storage, retrieval, and subsequent analysis of molecular data, integrating both sequence and metadata. Taking a polyglot approach involving multiple languages, libraries, and persistence mechanisms, sequence data can be aggregated from publicly available and local repositories. Data are exposed in the form of a RESTful web service, formatted for easy querying, and retrieved for downstream analyses. As a proof of concept, we have developed a resource for annotated HIV-1 …


Statistical Contributions To Bioinformatics: Design, Modeling, Structure Learning, And Integration, Jeffrey S. Morris, Veera Baladandayuthapani Dec 2016

Statistical Contributions To Bioinformatics: Design, Modeling, Structure Learning, And Integration, Jeffrey S. Morris, Veera Baladandayuthapani

Jeffrey S. Morris

The advent of high-throughput multi-platform genomics technologies providing whole-genome molecular summaries of biological samples has revolutionalized biomedical research. These technologies yield highly structured big data, whose analysis poses significant quantitative challenges. The field of Bioinformatics has emerged to deal with these challenges, and is comprised of many quantitative and biological scientists working together to eectively process these data and extract the treasure trove of information they contain. Statisticians, with their deep understanding of variability and uncertainty quantification, play a key role in these efforts. In this article, we attempt to summarize some of the key contributions of statisticians to bioinformatics, …


Using Rstudio For Manipulating And Visualizing Data (Updated 11/17), Ray A. Enke, Bejan A. Rasoul Dec 2016

Using Rstudio For Manipulating And Visualizing Data (Updated 11/17), Ray A. Enke, Bejan A. Rasoul

Ray Enke Ph.D.

This in class exercise is designed to teach novices about the basic features of R and RStudio using a non-biological data set called Gapminder. It is a modified version of a Data Carpentry Workshop that I use to teach programming to beginners.


Genomics Rna-Seq Analysis Part 2_ Kallisto Indexing And Quantification (Updated 11/17), Ray A. Enke, Melika Rahmani-Mofrad Dec 2016

Genomics Rna-Seq Analysis Part 2_ Kallisto Indexing And Quantification (Updated 11/17), Ray A. Enke, Melika Rahmani-Mofrad

Ray Enke Ph.D.

This in class exercise is a hands on activity designed to teach students about how to run Kallisto indexing quantification using CyVerse DE apps as part of a eukaryotic RNA-seq analysis pipeline.


Genomics Rna-Seq Analysis Part 3-Sleuth Data Visualization (Updated 11/17), Ray A. Enke, Scott Schumacker Dec 2016

Genomics Rna-Seq Analysis Part 3-Sleuth Data Visualization (Updated 11/17), Ray A. Enke, Scott Schumacker

Ray Enke Ph.D.

This in class exercise is a hands on activity designed to teach students about how to run Sleuth statistical modeling and RStudio data visualization package using Kallisto pseudoalignment output files as part of a eukaryotic RNA-seq analysis pipeline.


Supporting Biomedical Research In The Era Of Omics And Precision Medicine, Rolando Garcia-Milian, Denise Hersey, Nathan Rupp Aug 2016

Supporting Biomedical Research In The Era Of Omics And Precision Medicine, Rolando Garcia-Milian, Denise Hersey, Nathan Rupp

Rolando Garcia-Milian


This annual report (2015-2016) provides a continuing view on the position of the Cushing/Whitney Medical Library End-user Bioinformatics Program. Besides the report on the three main areas of training, resources and tools, and consultations, it contains the results of the recent assessment “Information and Needs Assessment for Biomedical Research in the Omics Era” During this period, 741 Yale affiliates attended (out of 1240 registered) the end-user bioinformatics training and presentations organized by the Medical Library.  This year, the number of Ingenuity Pathway Analysis and MetaCore accounts continued to grow. Consequently, the number and length of research support consultations (130 researchers …


Bringing Toxicology Into The 21st Century: A Global Call To Action, Troy Seidle, Martin Stephens Jul 2016

Bringing Toxicology Into The 21st Century: A Global Call To Action, Troy Seidle, Martin Stephens

Martin Stephens, PhD

Conventional toxicological testing methods are often decades old, costly and low-throughput, with questionable relevance to the human condition. Several of these factors have contributed to a backlog of chemicals that have been inadequately assessed for toxicity. Some authorities have responded to this challenge by implementing large-scale testing programmes. Others have concluded that a paradigm shift in toxicology is warranted. One such call came in 2007 from the United States National Research Council (NRC), which articulated a vision of ‘‘21st century toxicology” based predominantly on non-animal techniques. Potential advantages of such an approach include the capacity to examine a far greater …


Analysis Of Rna-Seq Alignments Using Dna Subway Green Line (Computational), Raymond A. Enke May 2016

Analysis Of Rna-Seq Alignments Using Dna Subway Green Line (Computational), Raymond A. Enke

Ray Enke Ph.D.

This class tested protocol will guide students through the steps for the following activities:
  • Review basic steps of RNA-Seq bioinformatics analysis in DNA Subway Green Line
  • View and run basic analytics of RNA-Seq data set in DNA Subway Green Line


Bayesmotif: De Novo Protein Sorting Motif Discovery From Impure Datasets, Jianjun Hu, F. Zhang Jun 2015

Bayesmotif: De Novo Protein Sorting Motif Discovery From Impure Datasets, Jianjun Hu, F. Zhang

Jianjun Hu

Background

Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms.

Methods

We formulated the protein sorting motif discovery problem as a classification problem …


Hemebind: A Novel Method For Heme Binding Residue Prediction By Combining Structural And Sequence Information, R. Liu, Jianjun Hu Jun 2015

Hemebind: A Novel Method For Heme Binding Residue Prediction By Combining Structural And Sequence Information, R. Liu, Jianjun Hu

Jianjun Hu

Background Accurate prediction of binding residues involved in the interactions between proteins and small ligands is one of the major challenges in structural bioinformatics. Heme is an essential and commonly used ligand that plays critical roles in electron transfer, catalysis, signal transduction and gene expression. Although much effort has been devoted to the development of various generic algorithms for ligand binding site prediction over the last decade, no algorithm has been specifically designed to complement experimental techniques for identification of heme binding residues. Consequently, an urgent need is to develop a computational method for recognizing these important residues. Results Here …


Integrative Disease Classification Based On Cross-Platform Microarray Data, C.-C. Liu, Jianjun Hu, M. Kalakrishnan, H. Huang, X. Zhou Jun 2015

Integrative Disease Classification Based On Cross-Platform Microarray Data, C.-C. Liu, Jianjun Hu, M. Kalakrishnan, H. Huang, X. Zhou

Jianjun Hu

Background Disease classification has been an important application of microarray technology. However, most microarray-based classifiers can only handle data generated within the same study, since microarray data generated by different laboratories or with different platforms can not be compared directly due to systematic variations. This issue has severely limited the practical use of microarray-based disease classification. Results In this study, we tested the feasibility of disease classification by integrating the large amount of heterogeneous microarray datasets from the public microarray repositories. Cross-platform data compatibility is created by deriving expression log-rank ratios within datasets. One may then compare vectors of log-rank …


Integrative Missing Value Estimation For Microarray Data, Jianjun Hu, H. Li, M. Waterman, X. Zhou Jun 2015

Integrative Missing Value Estimation For Microarray Data, Jianjun Hu, H. Li, M. Waterman, X. Zhou

Jianjun Hu

Background Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. Results We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets …


Library Support For Biomedical Research In The Omics Era: 2014- 2015 Report, Rolando Garcia-Milian May 2015

Library Support For Biomedical Research In The Omics Era: 2014- 2015 Report, Rolando Garcia-Milian

Rolando Garcia-Milian

The decreased cost of high-throughput technologies has enabled its use as the main research methods to study biological processes and disorders. In order to understand the relevance of the data generated by these methods, the researcher needs mining and integrating the enormous amount of biomedical information and knowledge contained in the text of the scientific literature and biomedical databases. Accordingly, the ability to access and examine molecular data should not be restricted to bioinformaticians or those with exceptional computer skills. In May 2014, the Cushing/Whitney Medical Library began to provide end-user bioinformatics support to the biomedical researchers of the Yale …


Introduction To Gene Enrichment Analysis Tools, Rolando Garcia-Milian Feb 2015

Introduction To Gene Enrichment Analysis Tools, Rolando Garcia-Milian

Rolando Garcia-Milian

Bioinformatics enrichment tools play an important role in identifying, annotating, and functionally analyzing large list of genes generated by high-throughput technologies (e.g. microarrary, RNA-seq, ChIP-chip). This workshop will provide an overview of the principle, type of enrichments, and the infrastructure of enrichment tools. By using concrete examples, it will also introduce some of the most popular tools for gene enrichment analysis such as DAVID, GSEA, and WebGestalt.


Deciphering The Associations Between Gene Expression And Copy Number Alteration Using A Sparse Double Laplacian Shrinkage Approach, Shuangge Ma Dec 2014

Deciphering The Associations Between Gene Expression And Copy Number Alteration Using A Sparse Double Laplacian Shrinkage Approach, Shuangge Ma

Shuangge Ma

Both gene expression levels (GEs) and copy number alterations (CNAs) have important implications in the development of complex diseases. GEs are partly regulated by CNAs, and much effort has been devoted to understanding their relations. The expression of a gene can be regulated by multiple CNAs, and one CNA can regulate the expression of multiple genes. In addition, multiple GEs (CNAs) can be correlated with each other. The existing methods for associating GEs with CNAs have limitations in deciphering the complex data structures. In this study, we develop a sparse double Laplacian shrinkage approach. It jointly models the effects of …


A Penalized Robust Semiparametric Approach For Gene-Environment Interactions, Shuangge Ma Dec 2014

A Penalized Robust Semiparametric Approach For Gene-Environment Interactions, Shuangge Ma

Shuangge Ma

In genetic and genomic studies, gene-environment (G*E) interactions have important implications. Some of the existing G$\times$E interaction methods are limited by analyzing a small number of G factors at a time, by assuming linear effects of E factors, by assuming no data contamination, and by adopting ineffective selection techniques. In this study, we propose a new approach for identifying important G*E interactions. It jointly models the effects of all E and G factors and their interactions. A partially linear varying coefficient model (PLVCM) is adopted to accommodate possible nonlinear effects of E factors. A rank-based loss function is used to …


Bringing Toxicology Into The 21st Century: A Global Call To Action, Troy Seidle, Martin Stephens Dec 2014

Bringing Toxicology Into The 21st Century: A Global Call To Action, Troy Seidle, Martin Stephens

Troy Seidle, PhD

Conventional toxicological testing methods are often decades old, costly and low-throughput, with questionable relevance to the human condition. Several of these factors have contributed to a backlog of chemicals that have been inadequately assessed for toxicity. Some authorities have responded to this challenge by implementing large-scale testing programmes. Others have concluded that a paradigm shift in toxicology is warranted. One such call came in 2007 from the United States National Research Council (NRC), which articulated a vision of ‘‘21st century toxicology” based predominantly on non-animal techniques. Potential advantages of such an approach include the capacity to examine a far greater …


Snp-E: A New Method For Multiple Sequence Alignments Analysis And Accurate Single Nucleotide Polymorphism Evaluation, David A. Lightfoot Jan 2014

Snp-E: A New Method For Multiple Sequence Alignments Analysis And Accurate Single Nucleotide Polymorphism Evaluation, David A. Lightfoot

David A. Lightfoot

Identification of single nucleotide polymorphisms (SNPs) and insertion-deletion mutations are important for discovering the connection between the genetic mutations and complex diseases. The objective of this study was to develop a sensitive and accurate computational method for SNP detection among Multiple Sequence Alignments (MSAs) to be run on Microsoft Office SuiteTM and WindowsTM. The SNP-Evaluator, was designed to simulate the process of human eye visual change-identification. Analysis of three 82-Kbp genomic loci derived from Sanger sequencing and the corresponding SNPs from 31 genomes from IlluminaTM sequencing of soybean (Glycine max L. Merr.) demonstrated that the SNP-E was an effective method …


Penalized Integrative Analysis Of High-Dimensional Omics Data, Shuangge Ma Dec 2013

Penalized Integrative Analysis Of High-Dimensional Omics Data, Shuangge Ma

Shuangge Ma

No abstract provided.


A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer Jul 2013

A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer

George K. Thiruvathukal

RNA-interference has potential therapeutic use against HIV-1 by targeting highly-functional mRNA sequences that contribute to the virulence of the virus. Empirical work has shown that within cell lines, all of the HIV-1 genes are affected by RNAi-induced gene silencing. While promising, inherent in this treatment is the fact that RNAi sequences must be highly specific. HIV, however, mutates rapidly, leading to the evolution of viral escape mutants. In fact, such strains are under strong selection to include mutations within the targeted region, evading the RNAi therapy and thus increasing the virus’ fitness in the host. Taking a phylogenetic approach, we …


Functionally Compensating Coevolving Positions Are Neither Homoplasic Nor Conserved In Clades, Gregory Gloor, Gaurav Tyagi, Dana Abrassart, Andrew Kingston, Andrew Fernandes, Stanley Dunn, Christopher Brandl Oct 2012

Functionally Compensating Coevolving Positions Are Neither Homoplasic Nor Conserved In Clades, Gregory Gloor, Gaurav Tyagi, Dana Abrassart, Andrew Kingston, Andrew Fernandes, Stanley Dunn, Christopher Brandl

Stanley D Dunn

We demonstrated that a pair of positions in phosphoglycerate kinase that score highly by three nonparametric covariation measures are important for function even though the positions can be occupied by aliphatic, aromatic, or charged residues. Examination of these pairs suggested that the majority of the covariation scores could be explained by within-clade conservation. However, an analysis of diversity showed that the conservation within clades of covarying pairs was indistinguishable from pairs of positions that do not covary, thus ruling out both clade conservation and extensive homoplasy as means to identify covarying positions. Mutagenesis showed that the residues in the covarying …


A Cluster-Based Approach For Biological Hypothesis Testing And Its Application, Ahmed Mustafa Jun 2012

A Cluster-Based Approach For Biological Hypothesis Testing And Its Application, Ahmed Mustafa

Ahmed Mustafa Dr.

No abstract provided.


Using Comparative Genomics For Inquiry-Based Learning To Dissect Virulence Of Escherichia Coli O157:H7 And Yersinia Pestis, David J. Baumler, Lois M. Banta, Kai F. Hung, Jodi A. Schwarz, Eric L. Cabot, Jeremy D. Glasner, Nicole T. Perna Jan 2012

Using Comparative Genomics For Inquiry-Based Learning To Dissect Virulence Of Escherichia Coli O157:H7 And Yersinia Pestis, David J. Baumler, Lois M. Banta, Kai F. Hung, Jodi A. Schwarz, Eric L. Cabot, Jeremy D. Glasner, Nicole T. Perna

Kai F. Hung

Genomics and bioinformatics are topics of increasing interest in undergraduate biological science curricula. Many existing exercises focus on gene annotation and analysis of a single genome. In this paper, we present two educational modules designed to enable students to learn and apply fundamental concepts in comparative genomics using examples related to bacterial pathogenesis. Students first examine alignments of genomes of Escherichia coli O157:H7 strains isolated from three food-poisoning outbreaks using the multiple-genome alignment tool Mauve. Students investigate conservation of virulence factors using the Mauve viewer and by browsing annotations available at the A Systematic Annotation Package for Community Analysis of …