Open Access. Powered by Scholars. Published by Universities.®

Bioinformatics Commons

Open Access. Powered by Scholars. Published by Universities.®

Bioinformatics

Physical Sciences and Mathematics

Institution
Publication Year
Publication
Publication Type
File Type

Articles 1 - 30 of 70

Full-Text Articles in Bioinformatics

Convolutional Neural Network-Based Gene Prediction Using Buffalograss As A Model System, Michael Morikone Nov 2023

Convolutional Neural Network-Based Gene Prediction Using Buffalograss As A Model System, Michael Morikone

Complex Biosystems PhD Program: Dissertations

The task of gene prediction has been largely stagnant in algorithmic improvements compared to when algorithms were first developed for predicting genes thirty years ago. Rather than iteratively improving the underlying algorithms in gene prediction tools by utilizing better performing models, most current approaches update existing tools through incorporating increasing amounts of extrinsic data to improve gene prediction performance. The traditional method of predicting genes is done using Hidden Markov Models (HMMs). These HMMs are constrained by having strict assumptions made about the independence of genes that do not always hold true. To address this, a Convolutional Neural Network (CNN) …


Deephtlv: A Deep Learning Framework For Detecting Human T-Lymphotrophic Virus 1 Integration Sites, Johnathan Jia, Johnathan Jia May 2023

Deephtlv: A Deep Learning Framework For Detecting Human T-Lymphotrophic Virus 1 Integration Sites, Johnathan Jia, Johnathan Jia

Dissertations & Theses (Open Access)

In the 1980s, researchers found the first human oncogenic retrovirus called human T-lymphotrophic virus type 1 (HTLV-1). Since then, HTLV-1 has been identified as the causative agent behind several diseases such as adult T-cell leukemia/lymphoma (ATL) and a HTLV-1 associated myelopathy or tropical spastic paraparesis (HAM/TSP). As part of its normal replication cycle, the genome is converted into DNA and integrated into the genome. With several hundreds to thousands of unique viral integration sites (VISs) distributed with indeterminate preference throughout the genome, detection of HTLV-1 VISs is a challenging task. Experimental studies typically use molecular biology …


Framework For The Evaluation Of Perturbations In The Systems Biology Landscape And Inter-Sample Similarity From Transcriptomic Datasets — A Digital Twin Perspective, Mariah Marie Hoffman Jan 2022

Framework For The Evaluation Of Perturbations In The Systems Biology Landscape And Inter-Sample Similarity From Transcriptomic Datasets — A Digital Twin Perspective, Mariah Marie Hoffman

Dissertations and Theses

One approach to interrogating the complexities of human systems in their well-regulated and dysregulated states is through the use of digital twins. Digital twins are virtual representations of physical systems that are descriptive of an individual's state of health, an object fundamentally related to precision medicine. A key element for building a functional digital twin type for a disease or predicting the therapeutic efficacy of a potential treatment is harmonized, machine-parsable domain knowledge. Hypothesis-driven investigations are the gold standard for representing subsystems, but their results encompass a limited knowledge of the full biosystem. Multi-omics data is one rich source of …


Building A Learning Healthcare System: A Path To Optimizing Big Health Data To Inform Clinical Care Decisions, Danne Charlotte Emily Elbers Jan 2022

Building A Learning Healthcare System: A Path To Optimizing Big Health Data To Inform Clinical Care Decisions, Danne Charlotte Emily Elbers

Graduate College Dissertations and Theses

The explosive growth of data and computing power of the last decades has had large impacts on a myriad of domains, not in the least on one of society’s most complex systems: healthcare. In this work, a version of the resulting Learning Healthcare System (LHS) is explored and elements of it have been implemented and are in use at the Department of Veterans’ Affairs today. After an overview of what a LHS is and what it could be once executed in its full form, the chapters will describe in detail some of the individual elements and how they address cogs …


Simulation Of The Interaction Between Striated Muscle Unc-45 And Transcription Factor Gata-4, Drake Alexander Duncan May 2021

Simulation Of The Interaction Between Striated Muscle Unc-45 And Transcription Factor Gata-4, Drake Alexander Duncan

Electronic Theses and Dissertations

Striated Muscle UNC-45, also known as UNC-45b, is an important protein that acts as a chaperone for myosin in cardiac and skeletal muscles, binding to myosin at its C-terminal UCS domain and regulating its assembly into thick filaments and sarcomeric structures. The UCS domain contains a large loop that is believed to be the first point of interaction between myosin and UNC-45b. GATA-4 is an essential transcription factor that facilitates transcription of several genes in cardiac development, particularly alpha-heavy chain myosin in heart tissue. Recently, studies have shown that there is interaction of GATA-4 with UNC-45b and that GATA-4 binds …


Trunctrimmer: A First Step Towards Automating Standard Bioinformatic Analysis, Z. Gunner Lawless, Dana Dittoe, Dale R. Thompson, Steven C. Ricke May 2021

Trunctrimmer: A First Step Towards Automating Standard Bioinformatic Analysis, Z. Gunner Lawless, Dana Dittoe, Dale R. Thompson, Steven C. Ricke

Computer Science and Computer Engineering Undergraduate Honors Theses

Bioinformatic analysis is a time-consuming process for labs performing research on various microbiomes. Researchers use tools like Qiime2 to help standardize the bioinformatic analysis methods, but even large, extensible platforms like Qiime2 have drawbacks due to the attention required by researchers. In this project, we propose to automate additional standard lab bioinformatic procedures by eliminating the existing manual process of determining the trim and truncate locations for paired end 2 sequences. We introduce a new Qiime2 plugin called TruncTrimmer to automate the process that usually requires the researcher to make a decision on where to trim and truncate manually after …


Gene Selection And Classification In High-Throughput Biological Data With Integrated Machine Learning Algorithms And Bioinformatics Approaches, Abhijeet R Patil May 2021

Gene Selection And Classification In High-Throughput Biological Data With Integrated Machine Learning Algorithms And Bioinformatics Approaches, Abhijeet R Patil

Open Access Theses & Dissertations

With the rise of high throughput technologies in biomedical research, large volumes of expression profiling, methylation profiling, and RNA-sequencing data are being generated. These high-dimensional data have large number of features with small number of samples, a characteristic called the "curse of dimensionality." The selection of optimal features, which largely affects the performance of classification algorithms in machine learning models, has led to challenging problems in bioinformatics analyses of such high-dimensional datasets. In this work, I focus on the design of two-stage frameworks of feature selection and classification and their applications in multiple sets of colorectal cancer data. The first …


Machine Learning And Bioinformatic Insights Into Key Enzymes For A Bio-Based Circular Economy, Japheth E. Gado Jan 2021

Machine Learning And Bioinformatic Insights Into Key Enzymes For A Bio-Based Circular Economy, Japheth E. Gado

Theses and Dissertations--Chemical and Materials Engineering

The world is presently faced with a sustainability crisis; it is becoming increasingly difficult to meet the energy and material needs of a growing global population without depleting and polluting our planet. Greenhouse gases released from the continuous combustion of fossil fuels engender accelerated climate change, and plastic waste accumulates in the environment. There is need for a circular economy, where energy and materials are renewably derived from waste items, rather than by consuming limited resources. Deconstruction of the recalcitrant linkages in natural and synthetic polymers is crucial for a circular economy, as deconstructed monomers can be used to manufacture …


The Role Of Software Engineering In Bioinformatics, Brendan Sean Lawlor Jan 2021

The Role Of Software Engineering In Bioinformatics, Brendan Sean Lawlor

Theses

This thesis proposes that by applying state-of-the-art software engineering tools, techniques and frameworks to currently recognised challenges in bioinformatics, improved outcomes can be attained in that field. It begins by decomposing software engineering into two categories, namely process and architecture, and choosing two key challenges in the practice of bioinformatics: reproducibility and scalability. The body of the thesis is an exploration of the intersection between these two software engineering categories and these two bioinformatics challenges. The question is asked: Can best practices in professional software engineering be applied to address key issues in the bioinformatics domain, creating positive outcomes? And …


An Automated Method To Enrich And Expand Consumer Health Vocabularies Using Glove Word Embeddings, Mohammed Ibrahim Jan 2021

An Automated Method To Enrich And Expand Consumer Health Vocabularies Using Glove Word Embeddings, Mohammed Ibrahim

Graduate Theses and Dissertations

Clear language makes communication easier between any two parties. However, a layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical jargon, which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow …


Ensemble Protein Inference Evaluation, Kyle Lee Lucke Jan 2021

Ensemble Protein Inference Evaluation, Kyle Lee Lucke

Graduate Student Theses, Dissertations, & Professional Papers

The Protein inference problem is becoming an increasingly important tool that aids in the characterization of complex proteomes and analysis of complex protein samples. In bottom-up shotgun proteomics experiments the metrics for evaluation (like AUC and calibration error) are based on an often imperfect target-decoy database. These metrics make the inherent assumption that all of the proteins in the target set are present in the sample being analyzed. In general, this is not the case, they are typically a mix of present and absent proteins. To objectively evaluate inference methods, protein standard datasets are used. These datasets are special in …


Soda: An Open-Source Library For Visualizing Biological Sequence Annotation, Jack W. Roddy, Travis J. Wheeler Jan 2021

Soda: An Open-Source Library For Visualizing Biological Sequence Annotation, Jack W. Roddy, Travis J. Wheeler

Graduate Student Theses, Dissertations, & Professional Papers

Genome annotation is the process of identifying and labeling known genetic sequences or features within a genome. Across the various subfields within modern molecular biology, there is a common need for the visualization of such annotations. Genomic data is often visualized on web browser platforms, providing users with easy access to visualization tools without the need for installing any software or, in many cases, underlying datasets. While there exists a broad range of web-based visualization tools, there is, to my knowledge, no lightweight, modern library tailored towards the visualization of genomic data. Instead, developers charged with the task of producing …


Development Of Computational Tools To Target Microrna, Luo Song Dec 2020

Development Of Computational Tools To Target Microrna, Luo Song

Dissertations & Theses (Open Access)

MicroRNAs (a.k.a, miRNAs) play an important role in disease development. However, few of their structures have been determined and structure-based computational methods remain challenging in accurately predicting their interactions with small molecules. To address this issue, my thesis is to develop integrated approaches to screening for novel inhibitors by targeting specific structure motifs in miRNAs. The project starts with implementing a tool to find potential miRNA targets with desired motifs. I combined both sequence information of miRNAs and known RNA structure data from Protein Data Bank (PDB) to predict the miRNA structure and identify the motif to target, then I …


New Methods For Deep Learning Based Real-Valued Inter-Residue Distance Prediction, Jacob Barger Nov 2020

New Methods For Deep Learning Based Real-Valued Inter-Residue Distance Prediction, Jacob Barger

Theses

Background: Much of the recent success in protein structure prediction has been a result of accurate protein contact prediction--a binary classification problem. Dozens of methods, built from various types of machine learning and deep learning algorithms, have been published over the last two decades for predicting contacts. Recently, many groups, including Google DeepMind, have demonstrated that reformulating the problem as a multi-class classification problem is a more promising direction to pursue. As an alternative approach, we recently proposed real-valued distance predictions, formulating the problem as a regression problem. The nuances of protein 3D structures make this formulation appropriate, allowing predictions …


Machine Learning With Digital Signal Processing For Rapid And Accurate Alignment-Free Genome Analysis: From Methodological Design To A Covid-19 Case Study, Gurjit Singh Randhawa Jun 2020

Machine Learning With Digital Signal Processing For Rapid And Accurate Alignment-Free Genome Analysis: From Methodological Design To A Covid-19 Case Study, Gurjit Singh Randhawa

Electronic Thesis and Dissertation Repository

In the field of bioinformatics, taxonomic classification is the scientific practice of identifying, naming, and grouping of organisms based on their similarities and differences. The problem of taxonomic classification is of immense importance considering that nearly 86% of existing species on Earth and 91% of marine species remain unclassified. Due to the magnitude of the datasets, the need exists for an approach and software tool that is scalable enough to handle large datasets and can be used for rapid sequence comparison and analysis. We propose ML-DSP, a stand-alone alignment-free software tool that uses Machine Learning and Digital Signal Processing to …


Using Cuda To Enhance Data Processing Of Variant Call Format Files For Statistical Genetic Analysis, Heather Mckinnon Jan 2020

Using Cuda To Enhance Data Processing Of Variant Call Format Files For Statistical Genetic Analysis, Heather Mckinnon

All Graduate Projects

Utilizing the power of GPU parallel processing with CUDA can speed up the processing of Variant Call Format (VCF) files and statistical analysis of genomic data. A software package designed toward this purpose would be beneficial to genetic researchers by saving them time which they could spend on other aspects of their research. A data set containing genetics from a study of trichome production in Mimulus guttatus, or yellow monkey flower, was used to develop a package to test the effectiveness of GPU parallel processing versus serial executions. After a serial version of the code was generated and benchmarked, OpenACC …


The Role Of Influenza Infection On Cell Metabolism, Megha Mokkapati Jan 2020

The Role Of Influenza Infection On Cell Metabolism, Megha Mokkapati

Williams Honors College, Honors Research Projects

In this particular study global metabolomics LCMS data was analyzed to determine the effect of influenza infection on host metabolic processes. Specifically, wild-type male and female mouse lung samples from mice infected with PR8 (H1N1) or both PR8 and a reinfection of X31 (H3N2) were analyzed. The LC/MS(liquid chromatography-mass spectrometry) based global metabolomics will be performed on the samples using hydrophilic interaction liquid chromatography to separate polar analytes from the tissues (Dettmer, Aronov, 2007). Metabolites impacted by influenza infection in all samples will be analyzed to compare male and female mice, as well as differences between PR8 infection and X31 …


Model-Based Deep Autoencoders For Characterizing Discrete Data With Application To Genomic Data Analysis, Tian Tian May 2019

Model-Based Deep Autoencoders For Characterizing Discrete Data With Application To Genomic Data Analysis, Tian Tian

Dissertations

Deep learning techniques have achieved tremendous successes in a wide range of real applications in recent years. For dimension reduction, deep neural networks (DNNs) provide a natural choice to parameterize a non-linear transforming function that maps the original high dimensional data to a lower dimensional latent space. Autoencoder is a kind of DNNs used to learn efficient feature representation in an unsupervised manner. Deep autoencoder has been widely explored and applied to analysis of continuous data, while it is understudied for characterizing discrete data. This dissertation focuses on developing model-based deep autoencoders for modeling discrete data. A motivating example of …


Simplicity Diffexpress: A Bespoke Cloud-Based Interface For Rna-Seq Differential Expression Modeling And Analysis, Cintia C. Palu, Marcelo Ribeiro-Alves, Yanxin Wu, Brendan Lawlor, Pavel V. Baranov, Brian Kelly, Paul Walsh May 2019

Simplicity Diffexpress: A Bespoke Cloud-Based Interface For Rna-Seq Differential Expression Modeling And Analysis, Cintia C. Palu, Marcelo Ribeiro-Alves, Yanxin Wu, Brendan Lawlor, Pavel V. Baranov, Brian Kelly, Paul Walsh

Department of Computer Science Publications

One of the key challenges for transcriptomics-based research is not only the processing of large data but also modeling the complexity of features that are sources of variation across samples, which is required for an accurate statistical analysis. Therefore, our goal is to foster access for wet lab researchers to bioinformatics tools, in order to enhance their ability to explore biological aspects and validate hypotheses with robust analysis. In this context, user-friendly interfaces can enable researchers to apply computational biology methods without requiring bioinformatics expertise. Such bespoke platforms can improve the quality of the findings by allowing the researcher to …


Designing Computational Biology Workflows With Perl - Part 1, Esma Yildirim May 2019

Designing Computational Biology Workflows With Perl - Part 1, Esma Yildirim

Open Educational Resources

This material introduces Linux File System structures and demonstrates how to use commands to communicate with the operating system through a Terminal program. Basic program structures and system() function of Perl are discussed. A brief introduction to gene-sequencing terminology and file formats are given.


Designing Computational Biology Workflows With Perl - Part 1, Esma Yildirim May 2019

Designing Computational Biology Workflows With Perl - Part 1, Esma Yildirim

Open Educational Resources

This material introduces the AWS console interface, describes how to create an instance on AWS with the VMI provided, connect to that machine instance using the SSH protocol. Once connected, it requires the students to write a script to enter the data folder, which includes gene-sequencing input files and print the first five line of each file remotely. The same exercise can be applied if the VMI is installed on a local machine using virtualization software (e.g. Oracle VirtualBox). In this case, the Terminal program of the VMI can be used to do the exercise.


Designing Computational Biology Workflows With Perl - Part 2, Esma Yildirim May 2019

Designing Computational Biology Workflows With Perl - Part 2, Esma Yildirim

Open Educational Resources

This material introduces the AWS console interface, describes how to create an instance on AWS with the VMI provided and connect to that machine instance using the SSH protocol. Once connected, it requires the students to write a script to automate the tasks to create VCF files from two different sample genomes belonging to E.coli microorganisms by using the FASTA and FASTQ files in the input folder of the virtual machine. The same exercise can be applied if the VMI is installed on a local machine using virtualization software (e.g. Oracle VirtualBox). In this case, the Terminal program of the …


Designing Computational Biology Workflows With Perl - Part 2, Esma Yildirim May 2019

Designing Computational Biology Workflows With Perl - Part 2, Esma Yildirim

Open Educational Resources

This material briefly reintroduces the DNA double Helix structure, explains SNP and INDEL mutations in genes and describes FASTA, FASTQ, BAM and VCF file formats. It also explains the index creation, alignment, sorting, marking duplicates and variant calling steps of a simple preprocessing workflow and how to write a Perl script to automate the execution of these steps on a Virtual Machine Image.


Designing Computational Biology Workflows With Perl - Part 1 & 2, Esma Yildirim May 2019

Designing Computational Biology Workflows With Perl - Part 1 & 2, Esma Yildirim

Open Educational Resources

This manual guides the instructor to combine the partial files of the virtual machine image and construct sequencer.ova file. It is accompanied by the partial files of the virtual machine image.


Exploring Strategies To Integrate Disparate Bioinformatics Datasets, Charbel Bader Fakhry Jan 2019

Exploring Strategies To Integrate Disparate Bioinformatics Datasets, Charbel Bader Fakhry

Walden Dissertations and Doctoral Studies

Distinct bioinformatics datasets make it challenging for bioinformatics specialists to locate the required datasets and unify their format for result extraction. The purpose of this single case study was to explore strategies to integrate distinct bioinformatics datasets. The technology acceptance model was used as the conceptual framework to understand the perceived usefulness and ease of use of integrating bioinformatics datasets. The population of this study included bioinformatics specialists of a research institution in Lebanon that has strategies to integrate distinct bioinformatics datasets. The data collection process included interviews with 6 bioinformatics specialists and reviewing 27 organizational documents relating to integrating …


Fast And Space-Efficient Location Of Heavy Or Dense Segments In Run-Length Encoded Sequences, Ronald I. Greenberg Jan 2018

Fast And Space-Efficient Location Of Heavy Or Dense Segments In Run-Length Encoded Sequences, Ronald I. Greenberg

Ronald Greenberg

This paper considers several variations of an optimization problem with potential applications in such areas as biomolecular sequence analysis and image processing. Given a sequence of items, each with a weight and a length, the goal is to find a subsequence of consecutive items of optimal value, where value is either total weight or total weight divided by total length. There may also be a specified lower and/or upper bound on the acceptable length of subsequences. This paper shows that all the variations of the problem are solvable in linear time and space even with non-uniform item lengths and divisible …


An Interdisciplinary Approach To The Target Elucidation Of Novel Antibiotic 31g12, Larissa A. Walker Jan 2018

An Interdisciplinary Approach To The Target Elucidation Of Novel Antibiotic 31g12, Larissa A. Walker

Graduate Student Theses, Dissertations, & Professional Papers

Staphylococcus aureus is a Gram-positive bacterial pathogen responsible for nosocomial and community-acquired infections that can quickly acquire antibiotic resistance. We have identified a novel triazole antimicrobial 31G12 based on the natural product core of nonactin isolated from the fermentation of Streptomyces griseus, that is active against many Gram-positive bacteria as well as antibiotic resistant methicillin-resistant S. aureus and vancomycin-resistant Enterococcus. The synthesis and characterization indicate that 31G12 exists as a mixture of two rotamers at room temperature and displays bacteriostatic activity against S. aureus with moderate mammalian cell toxicity. We have currently identified potential protein targets of 31G12 in …


Scalable Feature Selection And Extraction With Applications In Kinase Polypharmacology, Derek Jones Jan 2018

Scalable Feature Selection And Extraction With Applications In Kinase Polypharmacology, Derek Jones

Theses and Dissertations--Computer Science

In order to reduce the time associated with and the costs of drug discovery, machine learning is being used to automate much of the work in this process. However the size and complex nature of molecular data makes the application of machine learning especially challenging. Much work must go into the process of engineering features that are then used to train machine learning models, costing considerable amounts of time and requiring the knowledge of domain experts to be most effective. The purpose of this work is to demonstrate data driven approaches to perform the feature selection and extraction steps in …


A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti Oct 2017

A Polyglot Approach To Bioinformatics Data Integration: A Phylogenetic Analysis Of Hiv-1, Steven Reisman, Thomas Hatzopoulous, Konstantin Läufer, George K. Thiruvathukal, Catherine Putonti

Konstantin Läufer

As sequencing technologies continue to drop in price and increase in throughput, new challenges emerge for the management and accessibility of genomic sequence data. We have developed a pipeline for facilitating the storage, retrieval, and subsequent analysis of molecular data, integrating both sequence and metadata. Taking a polyglot approach involving multiple languages, libraries, and persistence mechanisms, sequence data can be aggregated from publicly available and local repositories. Data are exposed in the form of a RESTful web service, formatted for easy querying, and retrieved for downstream analyses. As a proof of concept, we have developed a resource for annotated HIV-1 …


A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer Oct 2017

A Polyglot Approach To Bioinformatics Data Integration: Phylogenetic Analysis Of Hiv-1, Steven Reisman, Catherine Putonti, George K. Thiruvathukal, Konstantin Läufer

Konstantin Läufer

RNA-interference has potential therapeutic use against HIV-1 by targeting highly-functional mRNA sequences that contribute to the virulence of the virus. Empirical work has shown that within cell lines, all of the HIV-1 genes are affected by RNAi-induced gene silencing. While promising, inherent in this treatment is the fact that RNAi sequences must be highly specific. HIV, however, mutates rapidly, leading to the evolution of viral escape mutants. In fact, such strains are under strong selection to include mutations within the targeted region, evading the RNAi therapy and thus increasing the virus’ fitness in the host. Taking a phylogenetic approach, we …