Open Access. Powered by Scholars. Published by Universities.®

Genetics and Genomics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 28 of 28

Full-Text Articles in Genetics and Genomics

Unified Methods For Feature Selection In Large-Scale Genomic Studies With Censored Survival Outcomes, Lauren Spirko-Burns, Karthik Devarajan Mar 2019

Unified Methods For Feature Selection In Large-Scale Genomic Studies With Censored Survival Outcomes, Lauren Spirko-Burns, Karthik Devarajan

COBRA Preprint Series

One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease's process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards …


Hpcnmf: A High-Performance Toolbox For Non-Negative Matrix Factorization, Karthik Devarajan, Guoli Wang Feb 2016

Hpcnmf: A High-Performance Toolbox For Non-Negative Matrix Factorization, Karthik Devarajan, Guoli Wang

COBRA Preprint Series

Non-negative matrix factorization (NMF) is a widely used machine learning algorithm for dimension reduction of large-scale data. It has found successful applications in a variety of fields such as computational biology, neuroscience, natural language processing, information retrieval, image processing and speech recognition. In bioinformatics, for example, it has been used to extract patterns and profiles from genomic and text-mining data as well as in protein sequence and structure analysis. While the scientific performance of NMF is very promising in dealing with high dimensional data sets and complex data structures, its computational cost is high and sometimes could be critical for …


Models For Hsv Shedding Must Account For Two Levels Of Overdispersion, Amalia Magaret Jan 2016

Models For Hsv Shedding Must Account For Two Levels Of Overdispersion, Amalia Magaret

UW Biostatistics Working Paper Series

We have frequently implemented crossover studies to evaluate new therapeutic interventions for genital herpes simplex virus infection. The outcome measured to assess the efficacy of interventions on herpes disease severity is the viral shedding rate, defined as the frequency of detection of HSV on the genital skin and mucosa. We performed a simulation study to ascertain whether our standard model, which we have used previously, was appropriately considering all the necessary features of the shedding data to provide correct inference. We simulated shedding data under our standard, validated assumptions and assessed the ability of 5 different models to reproduce the …


Multiple Testing Of Local Maxima For Detection Of Peaks In Chip-Seq Data, Armin Schwartzman, Andrew Jaffe, Yulia Gavrilov, Clifford A. Meyer Aug 2011

Multiple Testing Of Local Maxima For Detection Of Peaks In Chip-Seq Data, Armin Schwartzman, Andrew Jaffe, Yulia Gavrilov, Clifford A. Meyer

Harvard University Biostatistics Working Paper Series

No abstract provided.


A Unified Approach To Non-Negative Matrix Factorization And Probabilistic Latent Semantic Indexing, Karthik Devarajan, Guoli Wang, Nader Ebrahimi Jul 2011

A Unified Approach To Non-Negative Matrix Factorization And Probabilistic Latent Semantic Indexing, Karthik Devarajan, Guoli Wang, Nader Ebrahimi

COBRA Preprint Series

Non-negative matrix factorization (NMF) by the multiplicative updates algorithm is a powerful machine learning method for decomposing a high-dimensional nonnegative matrix V into two matrices, W and H, each with nonnegative entries, V ~ WH. NMF has been shown to have a unique parts-based, sparse representation of the data. The nonnegativity constraints in NMF allow only additive combinations of the data which enables it to learn parts that have distinct physical representations in reality. In the last few years, NMF has been successfully applied in a variety of areas such as natural language processing, information retrieval, image processing, speech recognition …


Component Extraction Of Complex Biomedical Signal And Performance Analysis Based On Different Algorithm, Hemant Pasusangai Kasturiwale Jun 2011

Component Extraction Of Complex Biomedical Signal And Performance Analysis Based On Different Algorithm, Hemant Pasusangai Kasturiwale

Johns Hopkins University, Dept. of Biostatistics Working Papers

Biomedical signals can arise from one or many sources including heart ,brains and endocrine systems. Multiple sources poses challenge to researchers which may have contaminated with artifacts and noise. The Biomedical time series signal are like electroencephalogram(EEG),electrocardiogram(ECG),etc The morphology of the cardiac signal is very important in most of diagnostics based on the ECG. The diagnosis of patient is based on visual observation of recorded ECG,EEG,etc, may not be accurate. To achieve better understanding , PCA (Principal Component Analysis) and ICA algorithms helps in analyzing ECG signals . The immense scope in the field of biomedical-signal processing Independent Component Analysis( …


Sparse Linear Discriminant Analysis For Simultaneous Testing For The Significance Of A Gene Set/Pathway And Gene Selection, Michael C. Wu, Lingson Zhang, Zhaoxi Wang, David C. Christiani, Xihong Lin Jan 2009

Sparse Linear Discriminant Analysis For Simultaneous Testing For The Significance Of A Gene Set/Pathway And Gene Selection, Michael C. Wu, Lingson Zhang, Zhaoxi Wang, David C. Christiani, Xihong Lin

Harvard University Biostatistics Working Paper Series

No abstract provided.


Model-Based Clustering Of Methylation Array Data: A Recursive-Partitioning Algorithm For High-Dimensional Data Arising As A Mixture Of Beta Distributions, E. Andres Houseman, Brock C. Christensen, Ru-Fang Yeh, Carmen J. Marsit, Margaret R. Karagas, Margaret Wrensch, Heather H. Nelson, Joseph Wiemels, Shichun Zheng, John K. Wiencke, Karl T. Kelsey Jun 2008

Model-Based Clustering Of Methylation Array Data: A Recursive-Partitioning Algorithm For High-Dimensional Data Arising As A Mixture Of Beta Distributions, E. Andres Houseman, Brock C. Christensen, Ru-Fang Yeh, Carmen J. Marsit, Margaret R. Karagas, Margaret Wrensch, Heather H. Nelson, Joseph Wiemels, Shichun Zheng, John K. Wiencke, Karl T. Kelsey

Harvard University Biostatistics Working Paper Series

No abstract provided.


Empirical Null And False Discovery Rate Inference For Exponential Families, Armin Schwartzman Feb 2008

Empirical Null And False Discovery Rate Inference For Exponential Families, Armin Schwartzman

Harvard University Biostatistics Working Paper Series

No abstract provided.


Power Boosting In Genome-Wide Studies Via Methods For Multivariate Outcomes, Mary J. Emond Feb 2007

Power Boosting In Genome-Wide Studies Via Methods For Multivariate Outcomes, Mary J. Emond

UW Biostatistics Working Paper Series

Whole-genome studies are becoming a mainstay of biomedical research. Examples include expression array experiments, comparative genomic hybridization analyses and large case-control studies for detecting polymorphism/disease associations. The tactic of applying a regression model to every locus to obtain test statistics is useful in such studies. However, this approach ignores potential correlation structure in the data that could be used to gain power, particularly when a Bonferroni correction is applied to adjust for multiple testing. In this article, we propose using regression techniques for misspecified multivariate outcomes to increase statistical power over independence-based modeling at each locus. Even when the outcome …


Semiparametric Regression Of Multi-Dimensional Genetic Pathway Data: Least Squares Kernel Machines And Linear Mixed Models, Dawei Liu, Xihong Lin, Debashis Ghosh Nov 2006

Semiparametric Regression Of Multi-Dimensional Genetic Pathway Data: Least Squares Kernel Machines And Linear Mixed Models, Dawei Liu, Xihong Lin, Debashis Ghosh

Harvard University Biostatistics Working Paper Series

No abstract provided.


Multiple Tests Of Association With Biological Annotation Metadata, Sandrine Dudoit, Sunduz Keles, Mark J. Van Der Laan Mar 2006

Multiple Tests Of Association With Biological Annotation Metadata, Sandrine Dudoit, Sunduz Keles, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We propose a general and formal statistical framework for the multiple tests of associations between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known fixed gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating genome-wide transcript levels or DNA copy numbers to possibly censored biological and …


New Statistical Paradigms Leading To Web-Based Tools For Clinical/Translational Science, Knut M. Wittkowski May 2005

New Statistical Paradigms Leading To Web-Based Tools For Clinical/Translational Science, Knut M. Wittkowski

COBRA Preprint Series

As the field of functional genetics and genomics is beginning to mature, we become confronted with new challenges. The constant drop in price for sequencing and gene expression profiling as well as the increasing number of genetic and genomic variables that can be measured makes it feasible to address more complex questions. The success with rare diseases caused by single loci or genes has provided us with a proof-of-concept that new therapies can be developed based on functional genomics and genetics.

Common diseases, however, typically involve genetic epistasis, genomic pathways, and proteomic pattern. Moreover, to better understand the underlying biologi-cal …


The Clustering Of Regression Models Method With Applications In Gene Expression Data, Li-Xuan Qin, Steven G. Self Jan 2005

The Clustering Of Regression Models Method With Applications In Gene Expression Data, Li-Xuan Qin, Steven G. Self

UW Biostatistics Working Paper Series

Identification of differentially expressed genes and clustering of genes are two important and complementary objectives addressed with gene expression data. For the differential expression question, many "per-gene" analytic methods have been proposed. These methods can generally be characterized as using a regression function to independently model the observations for each gene; various adjustments for multiplicity are then used to interpret the statistical significance of these per-gene regression models over the collection of genes analyzed. Motivated by this common structure of per-gene models, we propose a new model-based clustering method -- the clustering of regression models method, which groups genes that …


Cluster Analysis Of Genomic Data With Applications In R, Katherine S. Pollard, Mark J. Van Der Laan Jan 2005

Cluster Analysis Of Genomic Data With Applications In R, Katherine S. Pollard, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

In this paper, we provide an overview of existing partitioning and hierarchical clustering algorithms in R. We discuss statistical issues and methods in choosing the number of clusters, the choice of clustering algorithm, and the choice of dissimilarity matrix. In particular, we illustrate how the bootstrap can be employed as a statistical method in cluster analysis to establish the reproducibility of the clusters and the overall variability of the followed procedure. We also show how to visualize a clustering result by plotting ordered dissimilarity matrices in R. We present a new R package, hopach, which implements the hybrid clustering method, …


Finding Cancer Subtypes In Microarray Data Using Random Projections, Debashis Ghosh Oct 2004

Finding Cancer Subtypes In Microarray Data Using Random Projections, Debashis Ghosh

The University of Michigan Department of Biostatistics Working Paper Series

One of the benefits of profiling of cancer samples using microarrays is the generation of molecular fingerprints that will define subtypes of disease. Such subgroups have typically been found in microarray data using hierarchical clustering. A major problem in interpretation of the output is determining the number of clusters. We approach the problem of determining disease subtypes using mixture models. A novel estimation procedure of the parameters in the mixture model is developed based on a combination of random projections and the expectation-maximization algorithm. Because the approach is probabilistic, our approach provides a measure for the number of true clusters …


Significance Analysis Of Time Course Microarray Experiments, John D. Storey, Wenzhong Xiao, Jeffrey T. Leek, Ronald G. Tompkins, Ron W. Davis Aug 2004

Significance Analysis Of Time Course Microarray Experiments, John D. Storey, Wenzhong Xiao, Jeffrey T. Leek, Ronald G. Tompkins, Ron W. Davis

UW Biostatistics Working Paper Series

Characterizing the genome-wide dynamic regulation of gene expression is important and will be of much interest in the future. However, there is currently no established method for identifying differentially expressed genes in a time course study. Here we propose a significance method for analyzing time course microarray studies that can be applied to the typical types of comparisons and sampling schemes. This method is applied to two studies on humans. In one study, genes are identified that show differential expression over time in response to in vivo endotoxin administration. Using our method 7409 genes are called significant at a 1% …


Quantification And Visualization Of Ld Patterns And Identification Of Haplotype Blocks, Yan Wang, Sandrine Dudoit Jun 2004

Quantification And Visualization Of Ld Patterns And Identification Of Haplotype Blocks, Yan Wang, Sandrine Dudoit

U.C. Berkeley Division of Biostatistics Working Paper Series

Classical measures of linkage disequilibrium (LD) between two loci, based only on the joint distribution of alleles at these loci, present noisy patterns. In this paper, we propose a new distance-based LD measure, R, which takes into account multilocus haplotypes around the two loci in order to exploit information from neighboring loci. The LD measure R yields a matrix of pairwise distances between markers, based on the correlation between the lengths of shared haplotypes among chromosomes around these markers. Data analysis demonstrates that visualization of LD patterns through the R matrix reveals more deterministic patterns, with much less noise, than …


Classification Using Generalized Partial Least Squares, Beiying Ding, Robert Gentleman May 2004

Classification Using Generalized Partial Least Squares, Beiying Ding, Robert Gentleman

Bioconductor Project Working Papers

The advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene functions but they also present challenge of analyzing data with large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. In this paper, we address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in …


Calibrating Observed Differential Gene Expression For The Multiplicity Of Genes On The Array, Yingye Zheng, Margaret S. Pepe Jan 2004

Calibrating Observed Differential Gene Expression For The Multiplicity Of Genes On The Array, Yingye Zheng, Margaret S. Pepe

UW Biostatistics Working Paper Series

In a gene expression array study, the expression levels of thousands of genes are monitored simultaneously across various biological conditions on a small set of subjects. One goal of such studies is to explore a large pool of genes in order to select a subset of genes that appear to be differently expressed for further investigation. Of particular interest here is how to select the top k genes once genes are ranked based on their evidence for differential expression in two tissue types. We consider statistical methods that provide a more rigorous and intuitively appealing selection process for k. We …


Loss-Based Estimation With Cross-Validation: Applications To Microarray Data Analysis And Motif Finding, Sandrine Dudoit, Mark J. Van Der Laan, Sunduz Keles, Annette M. Molinaro, Sandra E. Sinisi, Siew Leng Teng Dec 2003

Loss-Based Estimation With Cross-Validation: Applications To Microarray Data Analysis And Motif Finding, Sandrine Dudoit, Mark J. Van Der Laan, Sunduz Keles, Annette M. Molinaro, Sandra E. Sinisi, Siew Leng Teng

U.C. Berkeley Division of Biostatistics Working Paper Series

Current statistical inference problems in genomic data analysis involve parameter estimation for high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables. Addressing these inference questions satisfactorily requires: (i) an intensive and thorough search of the parameter space to generate good candidate estimators, (ii) an approach for selecting an optimal estimator among these candidates, and (iii) a method for reliably assessing the performance of the resulting estimator. We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable …


A Nested Unsupervised Approach To Identifying Novel Molecular Subtypes, Elizabeth Garrett, Giovanni Parmigiani Oct 2003

A Nested Unsupervised Approach To Identifying Novel Molecular Subtypes, Elizabeth Garrett, Giovanni Parmigiani

Johns Hopkins University, Dept. of Biostatistics Working Papers

In classification problems arising in genomics research it is common to study populations for which a broad class assignment is known (say, normal versus diseased) and one seeks to find undiscovered subclasses within one or both of the known classes. Formally, this problem can be thought of as an unsupervised analysis nested within a supervised one. Here we take the view that the nested unsupervised analysis can successfully utilize information from the entire data set for constructing and/or selecting useful predictors. Specifically, we propose a mixture model approach to the nested unsupervised problem, where the supervised information is used to …


Tree-Based Multivariate Regression And Density Estimation With Right-Censored Data , Annette M. Molinaro, Sandrine Dudoit, Mark J. Van Der Laan Sep 2003

Tree-Based Multivariate Regression And Density Estimation With Right-Censored Data , Annette M. Molinaro, Sandrine Dudoit, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We propose a unified strategy for estimator construction, selection, and performance assessment in the presence of censoring. This approach is entirely driven by the choice of a loss function for the full (uncensored) data structure and can be stated in terms of the following three main steps. (1) Define the parameter of interest as the minimizer of the expected loss, or risk, for a full data loss function chosen to represent the desired measure of performance. Map the full data loss function into an observed (censored) data loss function having the same expected value and leading to an efficient estimator …


Cluster Stability Scores For Microarray Data In Cancer Studies, Mark Smolkin, Debashis Ghosh Jun 2003

Cluster Stability Scores For Microarray Data In Cancer Studies, Mark Smolkin, Debashis Ghosh

The University of Michigan Department of Biostatistics Working Paper Series

A potential benefit of profiling of tissue samples using microarrays is the generation of molecular fingerprints that will define subtypes of disease. Hierarchical clustering has been the primary analytical tool used to define disease subtypes from microarray experiments in cancer settings. Assessing cluster reliability poses a major complication in analyzing output from these procedures. While much work has been done on assessing the global question of number of clusters in a dataset, relatively little research exists on assessing stability of individual clusters. A potential benefit of profiling of tissue samples using microarrays is the generation of molecular fingerprints that will …


Selecting Differentially Expressed Genes From Microarray Experiments, Margaret S. Pepe, Gary M. Longton, Garnet L. Anderson, Michel Schummer Jan 2003

Selecting Differentially Expressed Genes From Microarray Experiments, Margaret S. Pepe, Gary M. Longton, Garnet L. Anderson, Michel Schummer

UW Biostatistics Working Paper Series

High throughput technologies, such as gene expression arrays and protein mass spectrometry, allow one to simultaneously evaluate thousands of potential biomarkers that distinguish different tissue types. Of particular interest here is cancer versus normal organ tissues. We consider statistical methods to rank genes (or proteins) in regards to differential expression between tissues. Various statistical measures are considered and we argue that two measures related to the Receiver Operating Characteristic Curve are particularly suitable for this purpose. We also propose that sampling variability in the gene rankings be quantified and suggest using the “selection probability function”, the probability distribution of rankings …


Comparative Genomic Hybridization Array Analysis, Annette M. Molinaro, Mark J. Van Der Laan, Dan H. Moore Apr 2002

Comparative Genomic Hybridization Array Analysis, Annette M. Molinaro, Mark J. Van Der Laan, Dan H. Moore

U.C. Berkeley Division of Biostatistics Working Paper Series

At the present time, there is increasing evidence that cancer may be regulated by the number of copies of genes in tumor cells. Through microarray technology it is now possible to measure the number of copies of thousands of genes and gene segments in samples of chromosomal DNA. Microarray comparative genomic hybridization (array CGH) provides the opportunity to both measure DNA sequence copy number gains and losses and map these aberrations to the genomic sequence. Gains can signify the over-expression of oncogenes, genes which stimulate cell growth and have become hyperactive, while losses can signify under-expression of tumor suppressor genes, …


A New Partitioning Around Medoids Algorithm, Mark J. Van Der Laan, Katherine S. Pollard, Jennifer Bryan Feb 2002

A New Partitioning Around Medoids Algorithm, Mark J. Van Der Laan, Katherine S. Pollard, Jennifer Bryan

U.C. Berkeley Division of Biostatistics Working Paper Series

Kaufman & Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this …


Statistical Inference For Simultaneous Clustering Of Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan Jul 2001

Statistical Inference For Simultaneous Clustering Of Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Current methods for analysis of gene expression data are mostly based on clustering and classification of either genes or samples. We offer support for the idea that more complex patterns can be identified in the data if genes and samples are considered simultaneously. We formalize the approach and propose a statistical framework for two-way clustering. A simultaneous clustering parameter is defined as a function of the true data generating distribution, and an estimate is obtained by applying this function to the empirical distribution. We illustrate that a wide range of clustering procedures, including generalized hierarchical methods, can be defined as …