Open Access. Powered by Scholars. Published by Universities.®

Bioinformatics Commons

Open Access. Powered by Scholars. Published by Universities.®

Data mining

Discipline
Institution
Publication Year
Publication
Publication Type

Articles 1 - 22 of 22

Full-Text Articles in Bioinformatics

Repurposing Normal Chromosomal Microarray Data To Harbor Genetic Insights Into Congenital Heart Disease, Nephi Walton, Hoang Nguyen, Sara Procknow, Darren Johnson, Alexander Anzelmi, Patrick Jay Sep 2023

Repurposing Normal Chromosomal Microarray Data To Harbor Genetic Insights Into Congenital Heart Disease, Nephi Walton, Hoang Nguyen, Sara Procknow, Darren Johnson, Alexander Anzelmi, Patrick Jay

Department of Medicine Faculty Papers

About 15% of congenital heart disease (CHD) patients have a known pathogenic copy number variant. The majority of their chromosomal microarray (CMA) tests are deemed normal. Diagnostic interpretation typically ignores microdeletions smaller than 100 kb. We hypothesized that unreported microdeletions are enriched for CHD genes. We analyzed "normal" CMAs of 1762 patients who were evaluated at a pediatric referral center, of which 319 (18%) had CHD. Using CMAs from monozygotic twins or replicates from the same individual, we established a size threshold based on probe count for the reproducible detection of small microdeletions. Genes in the microdeletions were sequentially filtered …


Drug Repurposing Using Gene Expression Data Mining, Yue Qiu Sep 2023

Drug Repurposing Using Gene Expression Data Mining, Yue Qiu

Dissertations, Theses, and Capstone Projects

The conventional drug discovery process that employs the "one disease, one target, one drug'' paradigm is expensive, time-consuming, and has a high rate of failure for multi-genic complex diseases. An alternative approach to drug discovery is to repurpose an existing drug that has been used to treat some medical conditions. Drug repurposing is considered a promising method due to its accelerated the process of drug discovery and lower overall cost and risk.

Drug-perturbed gene expression profiles are powerful phenotype readouts of biological systems, and they have been widely used in drug repurposing studies. However, the existing drug-perturbed gene expression datasets …


Framework For The Evaluation Of Perturbations In The Systems Biology Landscape And Inter-Sample Similarity From Transcriptomic Datasets — A Digital Twin Perspective, Mariah Marie Hoffman Jan 2022

Framework For The Evaluation Of Perturbations In The Systems Biology Landscape And Inter-Sample Similarity From Transcriptomic Datasets — A Digital Twin Perspective, Mariah Marie Hoffman

Dissertations and Theses

One approach to interrogating the complexities of human systems in their well-regulated and dysregulated states is through the use of digital twins. Digital twins are virtual representations of physical systems that are descriptive of an individual's state of health, an object fundamentally related to precision medicine. A key element for building a functional digital twin type for a disease or predicting the therapeutic efficacy of a potential treatment is harmonized, machine-parsable domain knowledge. Hypothesis-driven investigations are the gold standard for representing subsystems, but their results encompass a limited knowledge of the full biosystem. Multi-omics data is one rich source of …


Applications Of Machine Learning In Microbial Forensics, Ryan B. Ghannam Jan 2021

Applications Of Machine Learning In Microbial Forensics, Ryan B. Ghannam

Dissertations, Master's Theses and Master's Reports

Microbial ecosystems are complex, with hundreds of members interacting with each other and the environment. The intricate and hidden behaviors underlying these interactions make research questions challenging – but can be better understood through machine learning. However, most machine learning that is used in microbiome work is a black box form of investigation, where accurate predictions can be made, but the inner logic behind what is driving prediction is hidden behind nontransparent layers of complexity.

Accordingly, the goal of this dissertation is to provide an interpretable and in-depth machine learning approach to investigate microbial biogeography and to use micro-organisms as …


Prediction Of O-Glycosylation In Proteins For Different Polypeptide Galnac-Transferases, Jonathon Edward Mohl Jan 2019

Prediction Of O-Glycosylation In Proteins For Different Polypeptide Galnac-Transferases, Jonathon Edward Mohl

Open Access Theses & Dissertations

Mucin-type O-Glycosylation is a posttranslational modification of proteins found on secreted and cell surface proteins in most animals which serves multiple important biological functions. Â In humans, mutations and changes in the expression levels of O-glycosylating polypeptide GalNac-transferases (GALNTs) have been linked to diabetes and multiple cancers. In most animals the GALNTs are compose a large family of isoforms with humans having 20 isoforms. Presently, the prediction of the sites that will be O-glycosylated is a difficult task due to each isoforms different substrate preference. In this work ISOGlyP, an isoform specific O-glycosylation prediction program, was redesigned and expanded to …


Citationally Enhanced Semantic Literature Based Discovery, John David Fleig Jan 2019

Citationally Enhanced Semantic Literature Based Discovery, John David Fleig

CCE Theses and Dissertations

We are living within the age of information. The ever increasing flow of data and publications poses a monumental bottleneck to scientific progress as despite the amazing abilities of the human mind, it is woefully inadequate in processing such a vast quantity of multidimensional information. The small bits of flotsam and jetsam that we leverage belies the amount of useful information beneath the surface. It is imperative that automated tools exist to better search, retrieve, and summarize this content. Combinations of document indexing and search engines can quickly find you a document whose content best matches your query - if …


Prediction Of 1p/19q Codeletion Status In Diffuse Glioma Patients Using Preoperative Multiparametric Magnetic Resonance Imaging, Donnie Kim Aug 2018

Prediction Of 1p/19q Codeletion Status In Diffuse Glioma Patients Using Preoperative Multiparametric Magnetic Resonance Imaging, Donnie Kim

Dissertations & Theses (Open Access)

A complete codeletion of chromosome 1p/19q is strongly correlated with better overall survival of diffuse glioma patients, hence determining the codeletion status early in the course of a patient’s disease would be valuable in that patient’s care. The current practice requires a surgical biopsy in order to assess the codeletion status, which exposes patients to risks and is limited in its accuracy by sampling variations. To overcome such limitations, we utilized four conventional magnetic resonance imaging sequences to predict the 1p/19q status. We extracted three sets of image-derived features, namely texture-based, topology-based, and convolutional neural network (CNN)-based, and analyzed each …


Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao Apr 2018

Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao

Theses

The problem of community structure identification has been an extensively investigated area for biology, physics, social sciences, and computer science in recent years for studying the properties of networks representing complex relationships. Most traditional methods, such as K-means and hierarchical clustering, are based on the assumption that communities have spherical configurations. Lately, Genetic Algorithms (GA) are being utilized for efficient community detection without imposing sphericity. GAs are machine learning methods which mimic natural selection and scale with the complexity of the network. However, traditional GA approaches employ a representation method that dramatically increases the solution space to be searched by …


Clinical Information Extraction From Unstructured Free-Texts, Mingzhe Tao Jan 2018

Clinical Information Extraction From Unstructured Free-Texts, Mingzhe Tao

Legacy Theses & Dissertations (2009 - 2024)

Information extraction (IE) is a fundamental component of natural language processing (NLP) that provides a deeper understanding of the texts. In the clinical domain, documents prepared by medical experts (e.g., discharge summaries, drug labels, medical history records) contain a significant amount of clinically-relevant information that is crucial to the overall well-being of patients. Unfortunately, in many cases, clinically-relevant information is presented in an unstructured format, predominantly consisting of free-texts, making it inaccessible to computerized methods. Automatic extraction of this information can improve accessibility. However, the presence of synonymous expressions, medical acronyms, misspellings, negated phrases, and ambiguous terminologies make automatic extraction …


Process Mining Of Medication Revisions In Electronic Health Records, Deevakar Rogith Dec 2015

Process Mining Of Medication Revisions In Electronic Health Records, Deevakar Rogith

Dissertations & Theses (Open Access)

Objective: The objective of this work is to develop process mining techniques for analysing Electronic Health Record (EHR) events in order to uncover factors contributing to the event, and understanding deviations in the process. We have outlined a method for combining data mining with expert review to model the EHR process and develop automated algorithms that can be used to detect potential deviations for a defined process.

Introduction: To analyse EHR events meaningfully, process mining can be applied to distil structured process description from a set of real executions. Process mining can be applied for 1) Discovery, 2) Conformance, and …


Novel Computational Methods For Transcript Reconstruction And Quantification Using Rna-Seq Data, Yan Huang Jan 2015

Novel Computational Methods For Transcript Reconstruction And Quantification Using Rna-Seq Data, Yan Huang

Theses and Dissertations--Computer Science

The advent of RNA-seq technologies provides an unprecedented opportunity to precisely profile the mRNA transcriptome of a specific cell population. It helps reveal the characteristics of the cell under the particular condition such as a disease. It is now possible to discover mRNA transcripts not cataloged in existing database, in addition to assessing the identities and quantities of the known transcripts in a given sample or cell. However, the sequence reads obtained from an RNA-seq experiment is only a short fragment of the original transcript. How to recapitulate the mRNA transcriptome from short RNA-seq reads remains a challenging problem. We …


Evolutionary Approaches For Feature Selection In Biological Data, Vinh Q. Dang Jan 2014

Evolutionary Approaches For Feature Selection In Biological Data, Vinh Q. Dang

Theses: Doctorates and Masters

Data mining techniques have been used widely in many areas such as business, science, engineering and medicine. The techniques allow a vast amount of data to be explored in order to extract useful information from the data. One of the foci in the health area is finding interesting biomarkers from biomedical data. Mass throughput data generated from microarrays and mass spectrometry from biological samples are high dimensional and is small in sample size. Examples include DNA microarray datasets with up to 500,000 genes and mass spectrometry data with 300,000 m/z values. While the availability of such datasets can aid in …


Development And Evaluation Of An Ontology-Based Quality Metrics Extraction System, Sina Madani Nov 2013

Development And Evaluation Of An Ontology-Based Quality Metrics Extraction System, Sina Madani

Dissertations & Theses (Open Access)

The Institute of Medicine reports a growing demand in recent years for quality improvement within the healthcare industry. In response, numerous organizations have been involved in the development and reporting of quality measurement metrics. However, disparate data models from such organizations shift the burden of accurate and reliable metrics extraction and reporting to healthcare providers. Furthermore, manual abstraction of quality metrics and diverse implementation of Electronic Health Record (EHR) systems deepens the complexity of consistent, valid, explicit, and comparable quality measurement reporting within healthcare provider organizations.

The main objective of this research is to evaluate an ontology-based information extraction framework …


A Novel Computational Framework For Transcriptome Analysis With Rna-Seq Data, Yin Hu Jan 2013

A Novel Computational Framework For Transcriptome Analysis With Rna-Seq Data, Yin Hu

Theses and Dissertations--Computer Science

The advance of high-throughput sequencing technologies and their application on mRNA transcriptome sequencing (RNA-seq) have enabled comprehensive and unbiased profiling of the landscape of transcription in a cell. In order to address the current limitation of analyzing accuracy and scalability in transcriptome analysis, a novel computational framework has been developed on large-scale RNA-seq datasets with no dependence on transcript annotations. Directly from raw reads, a probabilistic approach is first applied to infer the best transcript fragment alignments from paired-end reads. Empowered by the identification of alternative splicing modules, this framework then performs precise and efficient differential analysis at automatically detected …


Data Mining Of Tetraloop-Tetraloop Receptors In Rna Xml Files, Sinan Ramazanoglu May 2012

Data Mining Of Tetraloop-Tetraloop Receptors In Rna Xml Files, Sinan Ramazanoglu

Theses

RNA (Ribonucleic acid) Motifs are tertiary structures that play an important role in the folding mechanism of the RNA molecule. The overall function of a RNA Motif depends on its specific bp (base pairs) sequence that constitutes the secondary structure. Data mining is a novel method in both discovering potential tertiary structures within DNA (Deoxyribonucleic acid), RNA, and protein molecules and storing the information in databases. The RNA Motif of interest is the tetraloop-tetraloop receptor, which is composed of a highly conserved 11 nt (nucleotide) sequence and a tetraloop with the generic form of GNRA (where N = any base …


Reconstructability Analysis Of Epistasis, Martin Zwick Dec 2010

Reconstructability Analysis Of Epistasis, Martin Zwick

Systems Science Faculty Publications and Presentations

The literature on epistasis describes various methods to detect epistatic interactions and to classify different types of epistasis. Reconstructability analysis (RA) has recently been used to detect epistasis in genomic data. This paper shows that RA offers a classification of types of epistasis at three levels of resolution (variable-based models without loops, variable-based models with loops, state-based models). These types can be defined by the simplest RA structures that model the data without information loss; a more detailed classification can be defined by the information content of multiple candidate structures. The RA classification can be augmented with structures from related …


Reconstructability Analysis As A Tool For Identifying Gene-Gene Interactions In Studies Of Human Diseases, Stephen Shervais, Patricia L. Kramer, Shawn K. Westaway, Nancy J. Cox, Martin Zwick Mar 2010

Reconstructability Analysis As A Tool For Identifying Gene-Gene Interactions In Studies Of Human Diseases, Stephen Shervais, Patricia L. Kramer, Shawn K. Westaway, Nancy J. Cox, Martin Zwick

Systems Science Faculty Publications and Presentations

There are a number of common human diseases for which the genetic component may include an epistatic interaction of multiple genes. Detecting these interactions with standard statistical tools is difficult because there may be an interaction effect, but minimal or no main effect. Reconstructability analysis (RA) uses Shannon’s information theory to detect relationships between variables in categorical datasets. We applied RA to simulated data for five different models of gene-gene interaction, and find that even with heritability levels as low as 0.008, and with the inclusion of 50 non-associated genes in the dataset, we can identify the interacting gene pairs …


Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder Sep 2008

Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder

Faculty Publications, Computer Science

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as |development| can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an …


Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder Sep 2008

Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder

William B. Andreopoulos

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as |development| can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an …


Mobile Semantic Computing, Karthik Gomadam, Anupam Joshi, Amit P. Sheth Jan 2008

Mobile Semantic Computing, Karthik Gomadam, Anupam Joshi, Amit P. Sheth

Kno.e.sis Publications

We propose to organize a special session on research in the intersection of mobile computing, the Semantic Web and Web services.

This session will examine how the research in these areas can serve as a foundation for new architectural and communication paradigms that can enhance service creation, distribution, discovery, integration and utilization in distributed and ubiquitous environments. Some of the initial areas that our early research have highlighted are :

  1. Semantic annotation of data in bandwidth constrained environments such as mobile networks to promote efficient bandwidth utilization
  2. Possibilities of using microformats such as RDFa and opportunities that can be explored …


Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang Jun 2006

Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang

Faculty Publications, Computer Science

Biomedical data sets often have mixed categorical and numerical types, where the former represent semantic information on the objects and the latter represent experimental results. We present the BILCOM algorithm for |Bi-Level Clustering of Mixed categorical and numerical data types|. BILCOM performs a pseudo-Bayesian process, where the prior is categorical clustering. BILCOM partitions biomedical data sets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations, more accurately than if using one type alone.


Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang Jun 2006

Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang

William B. Andreopoulos

Biomedical data sets often have mixed categorical and numerical types, where the former represent semantic information on the objects and the latter represent experimental results. We present the BILCOM algorithm for |Bi-Level Clustering of Mixed categorical and numerical data types|. BILCOM performs a pseudo-Bayesian process, where the prior is categorical clustering. BILCOM partitions biomedical data sets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations, more accurately than if using one type alone.