Open Access. Powered by Scholars. Published by Universities.®

Genetics and Genomics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 10 of 10

Full-Text Articles in Genetics and Genomics

Loss-Based Estimation With Cross-Validation: Applications To Microarray Data Analysis And Motif Finding, Sandrine Dudoit, Mark J. Van Der Laan, Sunduz Keles, Annette M. Molinaro, Sandra E. Sinisi, Siew Leng Teng Dec 2003

Loss-Based Estimation With Cross-Validation: Applications To Microarray Data Analysis And Motif Finding, Sandrine Dudoit, Mark J. Van Der Laan, Sunduz Keles, Annette M. Molinaro, Sandra E. Sinisi, Siew Leng Teng

U.C. Berkeley Division of Biostatistics Working Paper Series

Current statistical inference problems in genomic data analysis involve parameter estimation for high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables. Addressing these inference questions satisfactorily requires: (i) an intensive and thorough search of the parameter space to generate good candidate estimators, (ii) an approach for selecting an optimal estimator among these candidates, and (iii) a method for reliably assessing the performance of the resulting estimator. We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable …


Unification Of Variance Components And Haseman-Elston Regression For Quantitative Trait Linkage Analysis, Wei-Min Chen, Karl W. Broman, Kung-Yee Liang Oct 2003

Unification Of Variance Components And Haseman-Elston Regression For Quantitative Trait Linkage Analysis, Wei-Min Chen, Karl W. Broman, Kung-Yee Liang

Johns Hopkins University, Dept. of Biostatistics Working Papers

Two of the major approaches for linkage analysis with quantitative traits in humans include variance components and Haseman-Elston regression. Previously, these have been viewed as quite separate methods. We describe a general model, fit by use of generalized estimating equations (GEE), for which the variance components and Haseman-Elston methods (including many of the extensions to the original Haseman-Elston method) are special cases, corresponding to different choices for a working covariance matrix. We also show that the regression-based test of Sham et al.(2002) is equivalent to a robust score statistic derived from our GEE approach. These results have several important implications. …


A Nested Unsupervised Approach To Identifying Novel Molecular Subtypes, Elizabeth Garrett, Giovanni Parmigiani Oct 2003

A Nested Unsupervised Approach To Identifying Novel Molecular Subtypes, Elizabeth Garrett, Giovanni Parmigiani

Johns Hopkins University, Dept. of Biostatistics Working Papers

In classification problems arising in genomics research it is common to study populations for which a broad class assignment is known (say, normal versus diseased) and one seeks to find undiscovered subclasses within one or both of the known classes. Formally, this problem can be thought of as an unsupervised analysis nested within a supervised one. Here we take the view that the nested unsupervised analysis can successfully utilize information from the entire data set for constructing and/or selecting useful predictors. Specifically, we propose a mixture model approach to the nested unsupervised problem, where the supervised information is used to …


Tree-Based Multivariate Regression And Density Estimation With Right-Censored Data , Annette M. Molinaro, Sandrine Dudoit, Mark J. Van Der Laan Sep 2003

Tree-Based Multivariate Regression And Density Estimation With Right-Censored Data , Annette M. Molinaro, Sandrine Dudoit, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We propose a unified strategy for estimator construction, selection, and performance assessment in the presence of censoring. This approach is entirely driven by the choice of a loss function for the full (uncensored) data structure and can be stated in terms of the following three main steps. (1) Define the parameter of interest as the minimizer of the expected loss, or risk, for a full data loss function chosen to represent the desired measure of performance. Map the full data loss function into an observed (censored) data loss function having the same expected value and leading to an efficient estimator …


Cluster Stability Scores For Microarray Data In Cancer Studies, Mark Smolkin, Debashis Ghosh Jun 2003

Cluster Stability Scores For Microarray Data In Cancer Studies, Mark Smolkin, Debashis Ghosh

The University of Michigan Department of Biostatistics Working Paper Series

A potential benefit of profiling of tissue samples using microarrays is the generation of molecular fingerprints that will define subtypes of disease. Hierarchical clustering has been the primary analytical tool used to define disease subtypes from microarray experiments in cancer settings. Assessing cluster reliability poses a major complication in analyzing output from these procedures. While much work has been done on assessing the global question of number of clusters in a dataset, relatively little research exists on assessing stability of individual clusters. A potential benefit of profiling of tissue samples using microarrays is the generation of molecular fingerprints that will …


Supervised Detection Of Regulatory Motifs In Dna Sequences, Sunduz Keles, Mark J. Van Der Laan, Sandrine Dudoit, Biao Xing, Michael B. Eisen May 2003

Supervised Detection Of Regulatory Motifs In Dna Sequences, Sunduz Keles, Mark J. Van Der Laan, Sandrine Dudoit, Biao Xing, Michael B. Eisen

U.C. Berkeley Division of Biostatistics Working Paper Series

Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs sampler) model binding sites as families of sequences described by a position weight matrix (PWM) and identify PWMs that maximize the likelihood of observed sequence data under a simple multinomial mixture model. This model assumes that the positions of the PWM correspond to independent multinomial distributions with four cell probabilities. We address supervising the search for DNA binding sites using the information derived from …


Simple Parallel Statistical Computing In R, Anthony Rossini, Luke Tierney, Na Li Mar 2003

Simple Parallel Statistical Computing In R, Anthony Rossini, Luke Tierney, Na Li

UW Biostatistics Working Paper Series

Theoretically, many modern statistical procedures are trivial to parallelize. However, practical deployment of a parallelized implementation which is robust and reliably runs on different computational cluster configurations and environments is far from trivial. We present a framework for the R statistical computing language that provides a simple yet powerful programming interface to a computational cluster. This interface allows the development of R functions that distribute independent computations across the nodes of the computational cluster. The resulting framework allows statisticians to obtain significant speed-ups for some computations at little additional development cost. The particular implementation can be deployed in heterogeneous computing …


Literate Statistical Practice, Anthony Rossini, Friedrich Leisch Mar 2003

Literate Statistical Practice, Anthony Rossini, Friedrich Leisch

UW Biostatistics Working Paper Series

Literate Statistical Practice (LSP, Rossini, 2001) describes an approach for creating self-documenting statistical results. It applies literate programming (Knuth, 1992) and related techniques in a natural fashion to the practice of statistics. In particular, documentation, specification, and descriptions of results are written concurrently with writing and evaluation of statistical programs. We discuss how and where LSP can be integrated into practice and illustrate this with an example derived from an actual statistical consulting project. The approach is simplified through the use of a comprehensive, open source toolset incorporating Noweb, Emacs Speaks Statistics (ESS), Sweave (Ramsey, 1994; Rossini, et al, 2002; …


Ibd Configuration Transition Matrices And Linkage Score Tests For Unilineal Relative Pairs, Sandrine Dudoit Feb 2003

Ibd Configuration Transition Matrices And Linkage Score Tests For Unilineal Relative Pairs, Sandrine Dudoit

U.C. Berkeley Division of Biostatistics Working Paper Series

Properties of transition matrices between IBD configurations are derived for four general classes of unilineal relative pairs obtained from the grand-parent/ grand-child, half-sib, avuncular, and cousin relationships. In this setting, IBD configurations are defined as orbits of groups acting on a set of inheritance vectors. Properties of the transition matrix between IBD configurations at two linked loci are derived by relating its infinitesimal generator to the adjacency matrix of a quotient graph. The second largest eigenvalue of the infinitesimal generator and its multiplicity are key in determining the form of the transition matrix and of likelihood-based linkage tests such as …


Selecting Differentially Expressed Genes From Microarray Experiments, Margaret S. Pepe, Gary M. Longton, Garnet L. Anderson, Michel Schummer Jan 2003

Selecting Differentially Expressed Genes From Microarray Experiments, Margaret S. Pepe, Gary M. Longton, Garnet L. Anderson, Michel Schummer

UW Biostatistics Working Paper Series

High throughput technologies, such as gene expression arrays and protein mass spectrometry, allow one to simultaneously evaluate thousands of potential biomarkers that distinguish different tissue types. Of particular interest here is cancer versus normal organ tissues. We consider statistical methods to rank genes (or proteins) in regards to differential expression between tissues. Various statistical measures are considered and we argue that two measures related to the Receiver Operating Characteristic Curve are particularly suitable for this purpose. We also propose that sampling variability in the gene rankings be quantified and suggest using the “selection probability function”, the probability distribution of rankings …