Open Access. Powered by Scholars. Published by Universities.®

Bioinformatics Commons

Open Access. Powered by Scholars. Published by Universities.®

Clustering

Discipline
Institution
Publication Year
Publication
Publication Type

Articles 1 - 26 of 26

Full-Text Articles in Bioinformatics

Model-Based Deep Autoencoders For Clustering Single-Cell Rna Sequencing Data With Side Information, Xiang Lin Dec 2023

Model-Based Deep Autoencoders For Clustering Single-Cell Rna Sequencing Data With Side Information, Xiang Lin

Dissertations

Clustering analysis has been conducted extensively in single-cell RNA sequencing (scRNA-seq) studies. scRNA-seq can profile tens of thousands of genes' activities within a single cell. Thousands or tens of thousands of cells can be captured simultaneously in a typical scRNA-seq experiment. Biologists would like to cluster these cells for exploring and elucidating cell types or subtypes. Numerous methods have been designed for clustering scRNA-seq data. Yet, single-cell technologies develop so fast in the past few years that those existing methods do not catch up with these rapid changes and fail to fully fulfil their potential. For instance, besides profiling transcription …


Biofilmgeneset: Leveraging Multi-Omics Data Mining And Ica To Discover Biofilm Stage Genes Of Interest From Condition-Specific Expression Dataset, Mathew Olakunle Alaba Jan 2022

Biofilmgeneset: Leveraging Multi-Omics Data Mining And Ica To Discover Biofilm Stage Genes Of Interest From Condition-Specific Expression Dataset, Mathew Olakunle Alaba

Dissertations and Theses

Biofilm formation occurs in the attachment, colony, maturation, and dispersion stages. Understanding the molecular basis at every point of this process is essential to developing efficient diagnostics devices and effective antibiofilm agents. Gene expression data provide molecular insight for both static and temporal biofilm development. The most used analytic techniques for biofilm gene expression data are clustering and network inference algorithms, which class genes with similar expressions across the samples. However, these methods are inherently deficient because they do not capture gene(s) expressed in a subset of the samples. These subsets might be unique to a developmental stage, for example. …


High Performance Computing Techniques To Better Understand Protein Conformational Space, Arpita Joshi Aug 2019

High Performance Computing Techniques To Better Understand Protein Conformational Space, Arpita Joshi

Graduate Doctoral Dissertations

This thesis presents an amalgamation of high performance computing techniques to get better insight into protein molecular dynamics. Key aspects of protein function and dynamics can be learned from their conformational space. Datasets that represent the complex nuances of a protein molecule are high dimensional. Efficient dimensionality reduction becomes indispensable for the analysis of such exorbitant datasets. Dimensionality reduction forms a formidable portion of this work and its application has been explored for other datasets as well. It begins with the parallelization of a known non-liner feature reduction algorithm called Isomap. The code for the algorithm was re-written in C …


Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao Apr 2018

Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao

Theses

The problem of community structure identification has been an extensively investigated area for biology, physics, social sciences, and computer science in recent years for studying the properties of networks representing complex relationships. Most traditional methods, such as K-means and hierarchical clustering, are based on the assumption that communities have spherical configurations. Lately, Genetic Algorithms (GA) are being utilized for efficient community detection without imposing sphericity. GAs are machine learning methods which mimic natural selection and scale with the complexity of the network. However, traditional GA approaches employ a representation method that dramatically increases the solution space to be searched by …


Machine Learning Techniques Implementation In Power Optimization, Data Processing, And Bio-Medical Applications, Khalid Khairullah Mezied Al-Jabery Jan 2018

Machine Learning Techniques Implementation In Power Optimization, Data Processing, And Bio-Medical Applications, Khalid Khairullah Mezied Al-Jabery

Doctoral Dissertations

"The rapid progress and development in machine-learning algorithms becomes a key factor in determining the future of humanity. These algorithms and techniques were utilized to solve a wide spectrum of problems extended from data mining and knowledge discovery to unsupervised learning and optimization. This dissertation consists of two study areas. The first area investigates the use of reinforcement learning and adaptive critic design algorithms in the field of power grid control. The second area in this dissertation, consisting of three papers, focuses on developing and applying clustering algorithms on biomedical data. The first paper presents a novel modelling approach for …


Bioinformatic Interrogation Of Phosphonate Tailoring Pathways, Monica Papinski Jan 2018

Bioinformatic Interrogation Of Phosphonate Tailoring Pathways, Monica Papinski

Theses and Dissertations (Comprehensive)

Phosphonates represent an underexploited class of natural products despite their tremendous potential for use in medicine and agriculture. Even less characterized are phosphonate-containing macromolecules such as cell wall lipids and glycans, distinguished by a P-C bond known to provide stability towards hydrolysis. Despite some progress made in revealing cell wall phosphonate tailoring (Pnt) pathways, several barriers impede the discovery and characterization of novel phosphonate biosynthetic pathways. Specifically, a large diversity of gene composition and arrangement is evident surrounding key genes established to participate in phosphonate tailoring pathways, which are identified alongside the presence of the ppm gene encoding the P-C …


Basic Science To Clinical Research: Segmentation Of Ultrasound And Modelling In Clinical Informatics, Ali K. Hamou Apr 2017

Basic Science To Clinical Research: Segmentation Of Ultrasound And Modelling In Clinical Informatics, Ali K. Hamou

Electronic Thesis and Dissertation Repository

The world of basic science is a world of minutia; it boils down to improving even a fraction of a percent over the baseline standard. It is a domain of peer reviewed fractions of seconds and the world of squeezing every last ounce of efficiency from a processor, a storage medium, or an algorithm. The field of health data is based on extracting knowledge from segments of data that may improve some clinical process or practice guideline to improve the time and quality of care. Clinical informatics and knowledge translation provide this information in order to reveal insights to …


A Novel Approach For Classifying Gene Expression Data Using Topic Modeling, Soon Jye Kho, Himi Yalamanchili, Michael L. Raymer, Amit Sheth Jan 2017

A Novel Approach For Classifying Gene Expression Data Using Topic Modeling, Soon Jye Kho, Himi Yalamanchili, Michael L. Raymer, Amit Sheth

Kno.e.sis Publications

Understanding the role of differential gene expression in cancer etiology and cellular process is a complex problem that continues to pose a challenge due to sheer number of genes and inter-related biological processes involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to mitigate overfitting of high-dimensionality gene expression data and to facilitate understanding of the associated pathways. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. Here, we proposed to use LDA inclustering as well as in classification of cancer and healthy tissues using lung cancer …


Global Gene Expression Profiling Of Healthy Human Brain And Its Application In Studying Neurological Disorders, Simarjeet K. Negi Dec 2016

Global Gene Expression Profiling Of Healthy Human Brain And Its Application In Studying Neurological Disorders, Simarjeet K. Negi

Theses & Dissertations

The human brain is the most complex structure known to mankind and one of the greatest challenges in modern biology is to understand how it is built and organized. The power of the brain arises from its variety of cells and structures, and ultimately where and when different genes are switched on and off throughout the brain tissue. In other words, brain function depends on the precise regulation of gene expression in its sub-anatomical structures. But, our understanding of the complexity and dynamics of the transcriptome of the human brain is still incomplete. To fill in the need, we designed …


A Framework For The Statistical Analysis Of Mass Spectrometry Imaging Experiments, Kyle Bemis Dec 2016

A Framework For The Statistical Analysis Of Mass Spectrometry Imaging Experiments, Kyle Bemis

Open Access Dissertations

Mass spectrometry (MS) imaging is a powerful investigation technique for a wide range of biological applications such as molecular histology of tissue, whole body sections, and bacterial films , and biomedical applications such as cancer diagnosis. MS imaging visualizes the spatial distribution of molecular ions in a sample by repeatedly collecting mass spectra across its surface, resulting in complex, high-dimensional imaging datasets. Two of the primary goals of statistical analysis of MS imaging experiments are classification (for supervised experiments), i.e. assigning pixels to pre-defined classes based on their spectral profiles, and segmentation (for unsupervised experiments), i.e. assigning pixels to newly …


Near Infrared Spectroscopy For Estimating The Age Of Malaria Transmitting Mosquitoes, Masabho Peter Milali Oct 2016

Near Infrared Spectroscopy For Estimating The Age Of Malaria Transmitting Mosquitoes, Masabho Peter Milali

Master's Theses (2009 -)

We explore the use of near infrared spectrometry to classifying the age of a wild malaria transmitting mosquito. In Chapter Two, using a different set of lab-reared mosquitoes, we replicate the Mayagaya et al. study of the accuracy of near-infrared spectrometry (NIRS) to estimate the age of lab-reared mosquitoes, reproducing the published accuracy. Our results strengthen the Mayagaya et. al study and increase confidence in using NIRS to estimate age classes of mosquitoes. In the field, we wish to classify the ages of wild, not lab-reared mosquitoes, but the necessary training data from wild mosquitoes is difficult to find. Applying …


A Computational Framework For Learning From Complex Data: Formulations, Algorithms, And Applications, Wenlu Zhang Jul 2016

A Computational Framework For Learning From Complex Data: Formulations, Algorithms, And Applications, Wenlu Zhang

Computer Science Theses & Dissertations

Many real-world processes are dynamically changing over time. As a consequence, the observed complex data generated by these processes also evolve smoothly. For example, in computational biology, the expression data matrices are evolving, since gene expression controls are deployed sequentially during development in many biological processes. Investigations into the spatial and temporal gene expression dynamics are essential for understanding the regulatory biology governing development. In this dissertation, I mainly focus on two types of complex data: genome-wide spatial gene expression patterns in the model organism fruit fly and Allen Brain Atlas mouse brain data. I provide a framework to explore …


Classification Of Breast Cancer Patients Using Somatic Mutation Profiles And Machine Learning Approaches, Suleyman Vural Dec 2015

Classification Of Breast Cancer Patients Using Somatic Mutation Profiles And Machine Learning Approaches, Suleyman Vural

Theses & Dissertations

The high degree of heterogeneity observed in breast cancers makes it very difficult to classify cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. In this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers. We analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Identified somatic and non-synonymous single nucleotide variants were assigned a quantitative score (C-score) that represents the extent of negative impact on the function of the gene. Using …


Apply Data Clustering To Gene Expression Data, Abdullah Jameel Abualhamayl Mr. Dec 2015

Apply Data Clustering To Gene Expression Data, Abdullah Jameel Abualhamayl Mr.

Electronic Theses, Projects, and Dissertations

Data clustering plays an important role in effective analysis of gene expression. Although DNA microarray technology facilitates expression monitoring, several challenges arise when dealing with gene expression datasets. Some of these challenges are the enormous number of genes, the dimensionality of the data, and the change of data over time. The genetic groups which are biologically interlinked can be identified through clustering. This project aims to clarify the steps to apply clustering analysis of genes involved in a published dataset. The methodology for this project includes the selection of the dataset representation, the selection of gene datasets, Similarity Matrix Selection, …


Plsi: A Computational Software Pipeline For Pathway Level Disease Subtype Identification, Michele Donato Jan 2015

Plsi: A Computational Software Pipeline For Pathway Level Disease Subtype Identification, Michele Donato

Wayne State University Theses

It is accepted that many complex diseases, like cancer, consist in collections of distinct genetic diseases. Clinical advances in treatments are attributed to molecular treatments aimed at specific genes resulting in greater ecacy and fewer debilitating side effects. This proves that it is important to identify and appropriately treat each individual disease subtype. Our current understanding of subtypes is limited: despite targeted treatment advances, targeted therapies often fail for some patients. The main limitation of current methods for subtype identification is that they focus on gene expression, and they are subject to its intrinsic noise. Signaling pathways describe biological processes …


The Structural Heterogeneity And Dynamics Of Base Stacking And Unstacking In Nucleic Acids, Ada Anna Sedova Jan 2015

The Structural Heterogeneity And Dynamics Of Base Stacking And Unstacking In Nucleic Acids, Ada Anna Sedova

Legacy Theses & Dissertations (2009 - 2024)

Base stacking provides stability to nucleic acid duplexes, and base unstacking is involved in numerous biological functions related to nucleic acids, including replication, repair, transcription, and translation. The patterns of base stacking and unstacking in available nucleic acid crystal structures were classified after separation into their individual single strand dinucleotide components and clustering using a k-means-based ensemble clustering method. The A- and B-form proximity of these dinucleotide structures were assessed to discover that RNA dinucleotides can approach B-form-like structures. Umbrella sampling molecular dynamics simulations were used to obtain the potential of mean force profiles for base unstacking at 5'-termini for …


Analysis Of Dna Motifs In The Human Genome, Yupu Liang Feb 2014

Analysis Of Dna Motifs In The Human Genome, Yupu Liang

Dissertations, Theses, and Capstone Projects

DNA motifs include repeat elements, promoter elements and gene regulator elements, and play a critical role in the human genome. This thesis describes a genome-wide computational study on two groups of motifs: tandem repeats and core promoter elements.

Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover tandem repeats generate a huge volume of data, which can be difficult to decipher without further organization. A new method is presented here to organize and rank detected tandem repeats through clustering and classification. Our work presents multiple ways of expressing tandem repeats using …


Genome Jigsaw: Implications Of 16s Ribosomal Rna Gene Fragment Position For Bacterial Species Identification, Jennifer Mitchell Jan 2014

Genome Jigsaw: Implications Of 16s Ribosomal Rna Gene Fragment Position For Bacterial Species Identification, Jennifer Mitchell

Theses and Dissertations (Comprehensive)

The 16S rRNA gene is present within all bacteria, and contains nine variable regions interspersed within conserved regions of the gene. While conserved regions remain mostly constant over time, variable regions can be used for taxonomic identification purposes. Current methodologies for characterizing microbial communities, such as those used to study the human microbiome, involve sequencing short fragments of this ubiquitous gene, and comparing these fragments to reference sequences in databases to identify the microbes present. Traditionally, whole 16S rRNA sequences with more than 97% sequence identity (id) are assigned to a single operational taxonomic unit (OTUs); each OTU being a …


On Identifying And Analyzing Significant Nodes In Protein-­Protein Interaction Networks, Rohan Khazanchi, Kathryn Dempsey Cooper, Ishwor Thapa, Hesham Ali Jan 2013

On Identifying And Analyzing Significant Nodes In Protein-­Protein Interaction Networks, Rohan Khazanchi, Kathryn Dempsey Cooper, Ishwor Thapa, Hesham Ali

Interdisciplinary Informatics Faculty Proceedings & Presentations

Network theory has been used for modeling biological data as well as social networks, transportation logistics, business transcripts, and many other types of data sets. Identifying important features/parts of these networks for a multitude of applications is becoming increasingly significant as the need for big data analysis techniques grows. When analyzing a network of protein-protein interactions (PPIs), identifying nodes of significant importance can direct the user toward biologically relevant network features. In this work, we propose that a node of structural importance in a network model can correspond to a biologically vital or significant property. This relationship between topological and …


On Mining Biological Signals Using Correlation Networks, Kathryn Dempsey Cooper, Ishwor Thapa, Claudia Cortes, Zack Eriksen, Dhundy Raj Bastola, Hesham Ali Jan 2013

On Mining Biological Signals Using Correlation Networks, Kathryn Dempsey Cooper, Ishwor Thapa, Claudia Cortes, Zack Eriksen, Dhundy Raj Bastola, Hesham Ali

Interdisciplinary Informatics Faculty Proceedings & Presentations

Correlation networks have been used in biological networks to analyze and model high-throughput biological data, such as gene expression from microarray or RNA-seq assays. Typically in biological network modeling, structures can be mined from these networks that represent biological functions; for example, a cluster of proteins in an interactome can represent a protein complex. In correlation networks built from high-throughput gene expression data, it has often been speculated or even assumed that clusters represent sets of genes that are coregulated. This research aims to validate this concept using network systems biology and data mining by identification of correlation network clusters …


Modeling And Quantitative Analysis Of White Matter Fiber Tracts In Diffusion Tensor Imaging, Xuwei Liang Jan 2011

Modeling And Quantitative Analysis Of White Matter Fiber Tracts In Diffusion Tensor Imaging, Xuwei Liang

University of Kentucky Doctoral Dissertations

Diffusion tensor imaging (DTI) is a structural magnetic resonance imaging (MRI) technique to record incoherent motion of water molecules and has been used to detect micro structural white matter alterations in clinical studies to explore certain brain disorders. A variety of DTI based techniques for detecting brain disorders and facilitating clinical group analysis have been developed in the past few years. However, there are two crucial issues that have great impacts on the performance of those algorithms. One is that brain neural pathways appear in complicated 3D structures which are inappropriate and inaccurate to be approximated by simple 2D structures, …


Informatics And Statistics For Analyzing 2-D Gel Electrophoresis Images, Andrew W. Dowsey, Jeffrey S. Morris, Howard G. Gutstein, Guang Z. Yang Jan 2010

Informatics And Statistics For Analyzing 2-D Gel Electrophoresis Images, Andrew W. Dowsey, Jeffrey S. Morris, Howard G. Gutstein, Guang Z. Yang

Jeffrey S. Morris

Whilst recent progress in ‘shotgun’ peptide separation by integrated liquid chromatography and mass spectrometry (LC/MS) has enabled its use as a sensitive analytical technique, proteome coverage and reproducibility is still limited and obtaining enough replicate runs for biomarker discovery is a challenge. For these reasons, recent research demonstrates the continuing need for protein separation by two-dimensional gel electrophoresis (2-DE). However, with traditional 2-DE informatics, the digitized images are reduced to symbolic data though spot detection and quantification before proteins are compared for differential expression by spot matching. Recently, a more robust and automated paradigm has emerged where gels are directly …


Finding Molecular Complexes Through Multiple Layer Clustering Of Protein Interaction Networks, Bill Andreopoulos, Aijun An, Xiangji Huang, Xiaogang Wang Jan 2007

Finding Molecular Complexes Through Multiple Layer Clustering Of Protein Interaction Networks, Bill Andreopoulos, Aijun An, Xiangji Huang, Xiaogang Wang

Faculty Publications, Computer Science

Clustering protein-protein interaction networks (PINs) helps to identify complexes that guide the cell machinery. Clustering algorithms often create a flat clustering, without considering the layered structure of PINs. We propose the MULIC clustering algorithm that produces layered clusters. We applied MULIC to five PINs. Clusters correlate with known MIPS protein complexes. For example, a cluster of 79 proteins overlaps with a known complex of 88 proteins. Proteins in top cluster layers tend to be more representative of complexes than proteins in bottom layers. Lab work on finding unknown complexes or determining drug effects can be guided by top layer proteins.


Finding Molecular Complexes Through Multiple Layer Clustering Of Protein Interaction Networks, Bill Andreopoulos, Aijun An, Xiangji Huang, Xiaogang Wang Dec 2006

Finding Molecular Complexes Through Multiple Layer Clustering Of Protein Interaction Networks, Bill Andreopoulos, Aijun An, Xiangji Huang, Xiaogang Wang

William B. Andreopoulos

Clustering protein-protein interaction networks (PINs) helps to identify complexes that guide the cell machinery. Clustering algorithms often create a flat clustering, without considering the layered structure of PINs. We propose the MULIC clustering algorithm that produces layered clusters. We applied MULIC to five PINs. Clusters correlate with known MIPS protein complexes. For example, a cluster of 79 proteins overlaps with a known complex of 88 proteins. Proteins in top cluster layers tend to be more representative of complexes than proteins in bottom layers. Lab work on finding unknown complexes or determining drug effects can be guided by top layer proteins.


Cluster Analysis Of Genomic Data With Applications In R, Katherine S. Pollard, Mark J. Van Der Laan Jan 2005

Cluster Analysis Of Genomic Data With Applications In R, Katherine S. Pollard, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

In this paper, we provide an overview of existing partitioning and hierarchical clustering algorithms in R. We discuss statistical issues and methods in choosing the number of clusters, the choice of clustering algorithm, and the choice of dissimilarity matrix. In particular, we illustrate how the bootstrap can be employed as a statistical method in cluster analysis to establish the reproducibility of the clusters and the overall variability of the followed procedure. We also show how to visualize a clustering result by plotting ordered dissimilarity matrices in R. We present a new R package, hopach, which implements the hybrid clustering method, …


Statistical Inference For Simultaneous Clustering Of Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan Jul 2001

Statistical Inference For Simultaneous Clustering Of Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Current methods for analysis of gene expression data are mostly based on clustering and classification of either genes or samples. We offer support for the idea that more complex patterns can be identified in the data if genes and samples are considered simultaneously. We formalize the approach and propose a statistical framework for two-way clustering. A simultaneous clustering parameter is defined as a function of the true data generating distribution, and an estimate is obtained by applying this function to the empirical distribution. We illustrate that a wide range of clustering procedures, including generalized hierarchical methods, can be defined as …