Open Access. Powered by Scholars. Published by Universities.®

Life Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 33

Full-Text Articles in Life Sciences

The Role Of Software Engineering In Bioinformatics, Brendan Sean Lawlor Jan 2021

The Role Of Software Engineering In Bioinformatics, Brendan Sean Lawlor

Theses

This thesis proposes that by applying state-of-the-art software engineering tools, techniques and frameworks to currently recognised challenges in bioinformatics, improved outcomes can be attained in that field. It begins by decomposing software engineering into two categories, namely process and architecture, and choosing two key challenges in the practice of bioinformatics: reproducibility and scalability. The body of the thesis is an exploration of the intersection between these two software engineering categories and these two bioinformatics challenges. The question is asked: Can best practices in professional software engineering be applied to address key issues in the bioinformatics domain, creating positive outcomes? And …


New Methods For Deep Learning Based Real-Valued Inter-Residue Distance Prediction, Jacob Barger Nov 2020

New Methods For Deep Learning Based Real-Valued Inter-Residue Distance Prediction, Jacob Barger

Theses

Background: Much of the recent success in protein structure prediction has been a result of accurate protein contact prediction--a binary classification problem. Dozens of methods, built from various types of machine learning and deep learning algorithms, have been published over the last two decades for predicting contacts. Recently, many groups, including Google DeepMind, have demonstrated that reformulating the problem as a multi-class classification problem is a more promising direction to pursue. As an alternative approach, we recently proposed real-valued distance predictions, formulating the problem as a regression problem. The nuances of protein 3D structures make this formulation appropriate, allowing predictions …


Protein Inter-Residue Distance Prediction Using Residual And Capsule Networks, Andrew Dillon Oct 2019

Protein Inter-Residue Distance Prediction Using Residual And Capsule Networks, Andrew Dillon

Theses

The protein folding problem, also known as protein structure prediction, is the task of building three-dimensional protein models given their one-dimensional amino acid sequence. New methods that have been successfully used in the most recent CASP challenge have demonstrated that predicting a protein's inter-residue distances is key to solving this problem. Various deep learning algorithms including fully convolutional neural networks and residual networks have been developed to solve the distance prediction problem. In this work, we develop a hybrid method based on residual networks and capsule networks. We demonstrate that our method can predict distances more accurately than the algorithms …


Polya Db3: A Database Cataloging Polyadenation Sites(Pas) Across Different Species And Their Conservation, Ram Mohan Nambiar Dec 2018

Polya Db3: A Database Cataloging Polyadenation Sites(Pas) Across Different Species And Their Conservation, Ram Mohan Nambiar

Theses

Polyadenation is an important process occurring in the messenger RNA that involves cleavage of 3 end nascent mRNAs and addition of poly(A) tails. For this thesis,I present PolyA DB3 ,a database cataloging cleavage and polyadenylation sites (PASs) in several genomes specifically for human,mouse,rat and chicken. This database is based on deep sequencing data. PASs are mapped by the 3’ region extraction and deep sequencing (3’READS) method, ensuring unequivocal PAS identification. Large volume of data based on diverse biological samples is used to increase PAS coverage and provide PAS usage information. Strand-specific RNA-seq data were used to extend annotated 3’ ends …


A Parallelized Implementation Of Cut-And-Solve And A Streamlined Mixed-Integer Linear Programming Model For Finding Genetic Patterns Optimally Associated With Complex Diseases, Michael Yip-Hin Chan Nov 2018

A Parallelized Implementation Of Cut-And-Solve And A Streamlined Mixed-Integer Linear Programming Model For Finding Genetic Patterns Optimally Associated With Complex Diseases, Michael Yip-Hin Chan

Theses

With the advent of genetic sequencing, there was much hope of finding the inherited elements underlying complex diseases, such as late-onset Alzheimer’s disease (AD), but it has been a challenge to fully uncover the necessary information hidden in the data. A likely contributor to this failure is the fact that the pathogenesis of most complex diseases does not involve single markers working alone, but patterns of genetic markers interacting additively or epistatically. But as we move upwards beyond patterns of size two, it quickly becomes computationally infeasible to examine all combinations in the solution space. A common solution to solving …


Hypoxic And Viral Contributions To The Etiopathogenesis Of Schizophrenia: A Whole Transcriptome Analysis, Kathryn A. Gorski May 2018

Hypoxic And Viral Contributions To The Etiopathogenesis Of Schizophrenia: A Whole Transcriptome Analysis, Kathryn A. Gorski

Theses

Schizophrenia is a mental illness with a complex and as of yet unclear etiology. It is highly heritable and has a strong polygenic character, however, studies examining the genetics of schizophrenia have not sufficiently explained all variability in its prevalence. Environmental causes are theorized to have a non trivial contribution to the pathoetiology of schizophrenia, including interactions with genetic components, but these mechanisms remain unclear. Analyzing schizophrenia dysfunction using transcriptomic approaches is a paradigm still in its infancy, and fewer studies still have examined non neurological contributions to schizophrenia pathology with next generation sequencing technologies. This pilot study uses several …


Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao Apr 2018

Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao

Theses

The problem of community structure identification has been an extensively investigated area for biology, physics, social sciences, and computer science in recent years for studying the properties of networks representing complex relationships. Most traditional methods, such as K-means and hierarchical clustering, are based on the assumption that communities have spherical configurations. Lately, Genetic Algorithms (GA) are being utilized for efficient community detection without imposing sphericity. GAs are machine learning methods which mimic natural selection and scale with the complexity of the network. However, traditional GA approaches employ a representation method that dramatically increases the solution space to be searched by …


The Use Of Machine Learning To Detect Suckling In Pre-Weaned Calves, Sukumar Katamreddy Jan 2018

The Use Of Machine Learning To Detect Suckling In Pre-Weaned Calves, Sukumar Katamreddy

Theses

The weaning of cattle is a process which is known to be labour intensive and to have stressful effects on both cow and calf Common methods used in the weaning process include the temporary removal of a mother from the calf and manual observation and intervention. Early and speedy weaning is known to have a number of benefits, including health benefits for both cow and calf, additional weight gains for the calves as well as reduced labour and feed requirements. The process known as Two-Stage Weaning is recognised to be an effective low-stress approach to weaning in which the calf …


Network Exploration Of Correlated Multivariate Protein Data For Alzheimer's Disease Association, Matthew J. Lane Apr 2017

Network Exploration Of Correlated Multivariate Protein Data For Alzheimer's Disease Association, Matthew J. Lane

Theses

Alzheimer Disease (AD) is difficult to diagnose by using genetic testing or other traditional methods. Unlike diseases with simple genetic risk components, there exists no single marker determining as to whether someone will develop AD. Furthermore, AD is highly heterogeneous and different subgroups of individuals develop the disease due to differing factors. Traditional diagnostic methods using perceivable cognitive deficiencies are often too little too late due to the brain having suffered damage from decades of disease progression. In order to observe AD at early stages prior to the observation of cognitive deficiencies, biomarkers with greater accuracy are required. By using …


Novel Neuroevolution Techniques For The Life Science Domain, Timothy Manning Jan 2017

Novel Neuroevolution Techniques For The Life Science Domain, Timothy Manning

Theses

The life science domain is a high value research area, both in terms of the benefits in increased knowledge and in societal impact. Much of the research funding has focused on wet lab based approaches to increase visibility into biological processes and producing maximal relevant information on which to make decisions. Given the complexity of biological functions, in many cases this has led to an information overload. Researchers are now able to routinely generate and access petabytes of data as a result of high throughput experiments, and this capability is growing. This data can be difficult to interpret and intractable …


Gene Network Understanding And Analysis, Maria E. Somoza May 2016

Gene Network Understanding And Analysis, Maria E. Somoza

Theses

Gene regulatory network (GRN) is a collection of regulators that interact with each other in the cell to govern the gene expression levels of mRNA and proteins. These regulators can either be DNA, RNA, protein and their complex. Transcriptional gene regulation is an important mechanisms in which an in-depth study can lead to various practical applications, and a greater understanding of how organisms control their cellular behavior. One of the most widely studied organisms in gene regulatory networks are the Mycobacterium tuberculosis and Corynebacterium glutamicum ATCC 13032.

Gene co-expression networks are of biological interests due to co-expressed genes which are …


Uusing The Kdj As A Trading Strategy On Biotech Companies, Shijie Zha May 2016

Uusing The Kdj As A Trading Strategy On Biotech Companies, Shijie Zha

Theses

Mean Reversion is the most commonly used model in quantitative trading. This model is associated with several factors, like ma5 and ma10 line. These factors are the most significant in stock markets. However, the disadvantages of this model are lag and inaccuracy.

In this research, we get the historical and current stock data by web crawler, analyze the quantitative data and build a new model involved with the KDJ. Taking biotech companies marketed in the United States and B-share marketed in China as the research subjects, the result shows increased profits compared with the Mean Reversion model. It also shows …


Unsupervised Gene Regulatory Network Inference On Microarray Data, Nidhi Radia May 2015

Unsupervised Gene Regulatory Network Inference On Microarray Data, Nidhi Radia

Theses

Obtaining gene regulatory networks (GRNs) from expression data is a challenging and crucial task. Many computational methods and algorithms have been developed to infer gene networks for gene expression data, which are usually obtained from microarray experiments. A gene network is a method to depict the relation among clusters of genes. To infer gene networks, the unsupervised method is used in this study. The two types of data used are time-series data and steady-state data. The data is analyzed using various tools containing different algorithms and concepts. GRNs from time-series data tools are obtained using the Time-delayed Algorithm for the …


Exact Genome Alignment, Nandini Ghosh May 2015

Exact Genome Alignment, Nandini Ghosh

Theses

The increase in the volume of genomic data due to the decrease in the cost of whole genome sequencing techniques has opened up new avenues of research in the field of Bioinformatics, like comparative genomics and evolutionary dynamics. The fundamental task in these studies is to align the genome sequences accurately. Sequence alignment helps to identify regions of similarity between the sequences to establish their functional, evolutionary and structural relationship. The thesis investigates the performance of two sequence alignment programs LASTZ, a hash table based faster method and SSEARCH, a slower but more rigorous Smith-Waterman based approach, on whole genome …


Identifying Modifier Genes In Sma Model Mice, Weiting Xu May 2015

Identifying Modifier Genes In Sma Model Mice, Weiting Xu

Theses

Spinal Muscular Atrophy (SMA) involves the loss of nerve cells called motor neurons in the spinal cord and is classified as a motor neuron disease, it affects 1 in 5000-10000 newborns, one of the leading genetic causes of infant death in USA. Mutations in the SMN1, UBA1, DYNC1H1 and VAPB genes cause spinal muscular atrophy. Extra copies of the SMN2 gene modify the severity of spinal muscular atrophy. Mutations in SMN1 (Motor Neuron 1) mainly causes SMA (Autosomal recessive inheritance). SMN1 gene mutations lead to a shortage of the SMN protein and SMN protein forms SMN complex …


Cancer Risk Prediction With Next Generation Sequencing Data Using Machine Learning, Nihir Patel Jan 2015

Cancer Risk Prediction With Next Generation Sequencing Data Using Machine Learning, Nihir Patel

Theses

The use of computational biology for next generation sequencing (NGS) analysis is rapidly increasing in genomics research. However, the effectiveness of NGS data to predict disease abundance is yet unclear. This research investigates the problem in the whole exome NGS data of the chronic lymphocytic leukemia (CLL) available at dbGaP. Initially, raw reads from samples are aligned to the human reference genome using burrows wheeler aligner. From the samples, structural variants, namely, Single Nucleotide Polymorphism (SNP) and Insertion Deletion (INDEL) are identified and are filtered using SAMtools as well as with Genome Analyzer Tool Kit (GATK). Subsequently, the variants are …


Rice And Mouse Quantitative Phenotype Prediction In Genome-Wide Association Studies With Support Vector Regression, Abdulrhman Fahad M. Aljouie Jan 2015

Rice And Mouse Quantitative Phenotype Prediction In Genome-Wide Association Studies With Support Vector Regression, Abdulrhman Fahad M. Aljouie

Theses

Quantitative phenotypes prediction from genotype data is significant for pathogenesis, crop yields, and immunity tests. The scientific community conducted many studies to find unobserved quantitative phenotype high predictive ability models. Early genome-wide association studies (GWAS) focused on genetic variants that are associated with disease or phenotype, however, these variants manly covers small portion of the whole genetic variance, and therefore, the effectiveness of predictions obtained using this information may possibly be circumscribed [ 1 ].

Instead, this study shows prediction ability from whole genome single nucleotide polymorphisms (SNPs) data of 1940 genotyped stoke mouse with - 12k SNPs, and 413 …


Risk Prediction With Genomic Data, Bharati Jadhav May 2014

Risk Prediction With Genomic Data, Bharati Jadhav

Theses

Genome wide association study (GWAS) is widely used with various machine learning algorithms to predict disease risk. This thesis investigates this widely used approach of GWAS using Single Nucleotide Polymorphism (SNP) genotype data and a novel approach of disease risk prediction with whole exome sequencing data, namely Whole Exome Wide Association Study (WEWAS). It further applies a discriminating machine learning algorithm, namely a Support Vector Machine (SVM) with different Kernel functions. For this study, only SNPs generated using genotyping technology, which focuses more on common variants, are used initially for disease prediction. Later, the whole exome data generated using Next …


Comparison Of Different Differential Expression Analysis Tools For Rna-Seq Data, Junfei Zhu Jan 2014

Comparison Of Different Differential Expression Analysis Tools For Rna-Seq Data, Junfei Zhu

Theses

In molecular biology research, RNA-seq is a relatively new method for transcriptome profiling. It utilizes the next generation sequencing technology to provide huge amount information about the variety and abundance of RNA present in an organism of interest at a specific state and a given time. One of the most important tasks of RNA-seq analysis is finding genes that are expressed differently in different subject groups. A lot of differential expression analysis tools for RNA-seq have been developed, but there is no golden standard in this field. In this research, four commonly used tools (DESeq, edgeR, limma, and cuffdiff) are …


Polyaseeker: A Computational Framework For Identifying Polyadenylation Cleavage Site From Rna-Seq, Xiao Ling May 2013

Polyaseeker: A Computational Framework For Identifying Polyadenylation Cleavage Site From Rna-Seq, Xiao Ling

Theses

Alternative polyadenylation (APA) of mRNA plays a crucial role for post-transcriptional gene regulation. Recently, advances in next generation sequencing technology have made it possible to efficiently characterize the transcriptome and identify the 3’end of polyadenylated RNAs. However, no comprehensive bioi nformatic pipelines have fulfilled this goal. The PolyASeeker, a computational framework for identifying polyadenylation cleavage sites from RNA-Seq data is proposed in this thesis. By using the simulated RNA-seq dataset, a novel method is developed to evaluate the performance of the proposed framework versus the traditional A-stretch approach, and compute accurate Precisions and Recalls that previous estimation could not get. …


Performance Comparison Of Five Rna-Seq Alignment Tools, Yuanpeng Lu May 2013

Performance Comparison Of Five Rna-Seq Alignment Tools, Yuanpeng Lu

Theses

Aligning millions of short reads to a reference genome is a critical task in high throughput sequencing. In recent years, a large number of mapping algorithms have been developed, all of which have in common that they align a vast number of reads to genomic or transcriptomic sequences. RNA-Seq data is discrete in nature, therefore with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. To provide guidance in the choice of alignment algorithms, five different alignment tools for RNA-Seq data are evaluated. In order to compare the …


A Gpu Program To Compute Snp-Snp Interactions In Genome-Wide Association Studies, Srividya Ramakrishnan May 2013

A Gpu Program To Compute Snp-Snp Interactions In Genome-Wide Association Studies, Srividya Ramakrishnan

Theses

With the recent advances in the next generation sequencing technologies, short read sequences of human genome are made more accessible. Paired end sequencing of short reads is currently the most sensitive method for detecting somatic mutations that arise during tumor development. In this study, a novel approach to optimize the detection of structural variants using a new short read alignment program is presented.

Pairwise interaction effects of the Single Nucleotide Polymorphisms (SNPs) have proven to uncover the underlying complex disease traits. Computing the disease risk based on the interaction effects of SNPs on a case - control study is a …


Genome Wide Search For Pseudo Knotted Non-Coding Rnas, Meghana S. Vasavada May 2013

Genome Wide Search For Pseudo Knotted Non-Coding Rnas, Meghana S. Vasavada

Theses

Non-coding RNAs (ncRNAs) are the functional RNA molecules that are involved in many biological processes including gene regulation, chromosome replication and RNA modification. Searching genomes using computational methods has become an important asset for prediction and annotation of ncRNAs. To annotate an individual genome for a specific family of ncRNAs, a computational tool is interpreted to scan through the genome and align its sequence segments to some structure model for the ncRNA family. With the recent advances in detecting an ncRNA in the genome, heuristic techniques are designed to perform an accurate search and sequence-structure alignment. This study uses a …


Rna-Sequence Analysis Of Human Melanoma Cells, Jharna Miya May 2013

Rna-Sequence Analysis Of Human Melanoma Cells, Jharna Miya

Theses

RNA-sequencing refers to the use of high throughput sequencing technologies that are used to sequence cDNA in order to get the complete information of a sample’s RNA content. The objective of this study is to analyze this data in different aspects and to characterize gene expression. Besides this characterization, the data was also used to investigate the effect of sequencing depth on gene expression measurements.

This research focuses on quantitative measurement of expression levels of genes and their transcripts. In this study, complementary DNA fragments of cultured human melanoma cells are sequenced and a total of 139,501,106 million 200-bp reads …


A Comparative Analysis Of Machine Learning Algorithms For Genome Wide Association Studies, Neha Singh May 2012

A Comparative Analysis Of Machine Learning Algorithms For Genome Wide Association Studies, Neha Singh

Theses

Variations present in human genome play a vital role in the emergence of genetic disorders and abnormal traits. Single Nucleotide Polymorphism (SNP) is considered as the most common source of genetic variations. Genome Wide Association Studies (GWAS) probe these variations present in human population and find their association with complex genetic disorders. Now these days, recent advances in technology and drastic reduction in costs of Genome Wide Association Studies provide the opportunity to have a plethora of genomic data that delivers huge information of these variations to analyze. In fact, there is significant difference in pace of data generation and …


Phenotype Prediction And Feature Selection In Genome-Wide Association Studies, Andrew Roberts May 2012

Phenotype Prediction And Feature Selection In Genome-Wide Association Studies, Andrew Roberts

Theses

Genome wide association studies (GWAS) search for correlations between single nucleotide polymorphisms (SNPs) in a subject genome and an observed phenotype. GWAS can be used to generate models for predicting phenotype based on genotype, as well as aiding in identification of specific genes affecting the biological mechanism underlying the phenotype.

In this investigation, phenotype prediction models are constructed from GWAS training data and are evaluated for performance on test data. Three methods are used to rank SNPs by their correlation with the phenotype: the univariate Wald test, a multivariate, support vector machine (SVM) based technique, and a hybrid method where …


Data Mining Of Tetraloop-Tetraloop Receptors In Rna Xml Files, Sinan Ramazanoglu May 2012

Data Mining Of Tetraloop-Tetraloop Receptors In Rna Xml Files, Sinan Ramazanoglu

Theses

RNA (Ribonucleic acid) Motifs are tertiary structures that play an important role in the folding mechanism of the RNA molecule. The overall function of a RNA Motif depends on its specific bp (base pairs) sequence that constitutes the secondary structure. Data mining is a novel method in both discovering potential tertiary structures within DNA (Deoxyribonucleic acid), RNA, and protein molecules and storing the information in databases. The RNA Motif of interest is the tetraloop-tetraloop receptor, which is composed of a highly conserved 11 nt (nucleotide) sequence and a tetraloop with the generic form of GNRA (where N = any base …


Fast Program For Sequence Alignment Using Partition Function Posterior Probabilities, Meera Prasad May 2011

Fast Program For Sequence Alignment Using Partition Function Posterior Probabilities, Meera Prasad

Theses

The key requirements of a good sequence alignment tool are high accuracy and fast execution. The existing Probalign program is a highly accurate tool for sequence alignment of both proteins and nucleotides. However, the time for execution is fairly high. The focus is therefore, to reduce the running time of the existing version of Probalign, maintaining its current accuracy level.

The thesis conducts a detail analysis of the performance of Probalign to bring down the running time of the existing code. A modified version of Probalign, Version 1.4 is released. A new program for sequence alignment with faster computation is …


Aminormotiffinder - A Graph Grammar Based Tool To Effectively Search A Minor Motifs In 3d Rna Molecules, Ankur Malhotra Jan 2011

Aminormotiffinder - A Graph Grammar Based Tool To Effectively Search A Minor Motifs In 3d Rna Molecules, Ankur Malhotra

Theses

RNA Motifs are three dimensional folds that play important role in RNA folding and its interaction with other molecules. They basically have modular structure and are composed of conserved building blocks dependent upon the sequence. Their automated in silico identification remains a challenging task. Existing motif identification tools does not correctly identify motifs with large structure variations. Here a “graph rewriting” based method is proposed to identify motifs in real three dimensional structures. The unique encoding of A Minor Searcher takes into consideration the non canonical base pairs and also multipairing of RNA structural motifs. The accuracy is demonstrated by …


Type-1 Diabetes Risk Prediction Using Multiple Kernel Learning, Paras Garg May 2010

Type-1 Diabetes Risk Prediction Using Multiple Kernel Learning, Paras Garg

Theses

This thesis presents an analysis of multiple kernel learning (MKL) for type-1 diabetes risk prediction. MKL combines different models and representation of data to find a linear combination of these representations of the data. MKL has been successfully been implemented in image detection, splice site detection, ribosomal and membrane protein prediction, etc. In this thesis, this method was applied for Genome-wide association study (GWAS) for classifying cases and controls.

This thesis has shown that combined kernel does not perform better than the individual kernels and that MKL does not select the best model for this problem. Also, the effect of …