Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 6 of 6
Full-Text Articles in Entire DC Network
Cluster Analysis Of Genomic Data With Applications In R, Katherine S. Pollard, Mark J. Van Der Laan
Cluster Analysis Of Genomic Data With Applications In R, Katherine S. Pollard, Mark J. Van Der Laan
U.C. Berkeley Division of Biostatistics Working Paper Series
In this paper, we provide an overview of existing partitioning and hierarchical clustering algorithms in R. We discuss statistical issues and methods in choosing the number of clusters, the choice of clustering algorithm, and the choice of dissimilarity matrix. In particular, we illustrate how the bootstrap can be employed as a statistical method in cluster analysis to establish the reproducibility of the clusters and the overall variability of the followed procedure. We also show how to visualize a clustering result by plotting ordered dissimilarity matrices in R. We present a new R package, hopach, which implements the hybrid clustering method, …
Resampling-Based Multiple Testing: Asymptotic Control Of Type I Error And Applications To Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan
Resampling-Based Multiple Testing: Asymptotic Control Of Type I Error And Applications To Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan
U.C. Berkeley Division of Biostatistics Working Paper Series
We define a general statistical framework for multiple hypothesis testing and show that the correct null distribution for the test statistics is obtained by projecting the true distribution of the test statistics onto the space of mean zero distributions. For common choices of test statistics (based on an asymptotically linear parameter estimator), this distribution is asymptotically multivariate normal with mean zero and the covariance of the vector influence curve for the parameter estimator. This test statistic null distribution can be estimated by applying the non-parametric or parametric bootstrap to correctly centered test statistics. We prove that this bootstrap estimated null …
A Method To Identify Significant Clusters In Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan
A Method To Identify Significant Clusters In Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan
U.C. Berkeley Division of Biostatistics Working Paper Series
Clustering algorithms have been widely applied to gene expression data. For both hierarchical and partitioning clustering algorithms, selecting the number of significant clusters is an important problem and many methods have been proposed. Existing methods for selecting the number of clusters tend to find only the global patterns in the data (e.g.: the over and under expressed genes). We have noted the need for a better method in the gene expression context, where small, biologically meaningful clusters can be difficult to identify. In this paper, we define a new criteria, Mean Split Silhouette (MSS), which is a measure of cluster …
A New Partitioning Around Medoids Algorithm, Mark J. Van Der Laan, Katherine S. Pollard, Jennifer Bryan
A New Partitioning Around Medoids Algorithm, Mark J. Van Der Laan, Katherine S. Pollard, Jennifer Bryan
U.C. Berkeley Division of Biostatistics Working Paper Series
Kaufman & Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this …
Identification Of Regulatory Elements Using A Feature Selection Method, Sunduz Keles, Mark J. Van Der Laan, Michael B. Eisen
Identification Of Regulatory Elements Using A Feature Selection Method, Sunduz Keles, Mark J. Van Der Laan, Michael B. Eisen
U.C. Berkeley Division of Biostatistics Working Paper Series
Many methods have been described to identify regulatory motifs in the transcription control regions of genes that exhibit similar patterns of gene expression across a variety of experimental conditions. Here we focus on a single experimental condition, and utilize gene expression data to identify sequence motifs associated with genes that are activated under this experimental condition. We use a linear model with two way interactions to model gene expression as a function of sequence features (words) present in presumptive transcription control regions. The most relevant features are selected by a feature selection method called stepwise selection with monte carlo cross …
Statistical Inference For Simultaneous Clustering Of Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan
Statistical Inference For Simultaneous Clustering Of Gene Expression Data, Katherine S. Pollard, Mark J. Van Der Laan
U.C. Berkeley Division of Biostatistics Working Paper Series
Current methods for analysis of gene expression data are mostly based on clustering and classification of either genes or samples. We offer support for the idea that more complex patterns can be identified in the data if genes and samples are considered simultaneously. We formalize the approach and propose a statistical framework for two-way clustering. A simultaneous clustering parameter is defined as a function of the true data generating distribution, and an estimate is obtained by applying this function to the empirical distribution. We illustrate that a wide range of clustering procedures, including generalized hierarchical methods, can be defined as …