Open Access. Powered by Scholars. Published by Universities.®
- Institution
- Keyword
-
- Causal inference (3)
- MCMC (3)
- Confounding (2)
- Counterfactual (2)
- Cross-validation (2)
-
- Double robust estimation (2)
- Estimation (2)
- G-computation estimation (2)
- Loss function (2)
- Model selection (2)
- Prediction (2)
- Risk (2)
- 3D conformal radiation therapy (1)
- Adaptive quadrature (1)
- Adaptivity (1)
- Adjacency matrix; disease mapping; epidemiology; Markov processes (1)
- Airglow (1)
- Antiretroviral resistance (1)
- Antiretroviral therapy (1)
- Auxiliary variables (1)
- Backfitting algorithm; CAR model; collapsibility; epidemiology; Gauss-Seidel algorithm; iterative weighted least squares algorithm (1)
- Bandwidth Selection (1)
- Baseball (1)
- Bayesian analysis (1)
- Bayesian estimation (1)
- Bayesian inference (1)
- Bayesian statistics; Fourier basis; FFT; generalized linear mixed model; geostatistics; spatial statistics (1)
- Bayesian statistics; Fourier basis; FFT; geostatistics; generalized linear mixed model; generalized additive model; Markov chain Monte Carlo; spatial statistics; spectral representation (1)
- Bayesian variable selection (1)
- Bioinformatics (1)
- Publication Year
- Publication
-
- Harvard University Biostatistics Working Paper Series (7)
- Johns Hopkins University, Dept. of Biostatistics Working Papers (7)
- The University of Michigan Department of Biostatistics Working Paper Series (6)
- U.C. Berkeley Division of Biostatistics Working Paper Series (6)
- COBRA Preprint Series (2)
-
- Publications and Research (2)
- Articles (1)
- Basic Science Engineering (1)
- Department of Mathematics Facuty Scholarship and Creative Works (1)
- FIU Electronic Theses and Dissertations (1)
- Masters Theses & Specialist Projects (1)
- Masters Theses 1911 - February 2014 (1)
- Mathematical Sciences Spring Lecture Series (1)
- Publications (1)
- Rowan-Virtua School of Osteopathic Medicine Faculty Scholarship (1)
- UW Biostatistics Working Paper Series (1)
- United States Geological Survey: Water Reports and Publications (1)
Articles 1 - 30 of 41
Full-Text Articles in Statistical Models
Modeling Biphasic, Non-Sigmoidal Dose-Response Relationships: Comparison Of Brain- Cousens And Cedergreen Models For A Biochemical Dataset, Venkat D. Abbaraju, Tamaraty L. Robinson, Brian P. Weiser
Modeling Biphasic, Non-Sigmoidal Dose-Response Relationships: Comparison Of Brain- Cousens And Cedergreen Models For A Biochemical Dataset, Venkat D. Abbaraju, Tamaraty L. Robinson, Brian P. Weiser
Rowan-Virtua School of Osteopathic Medicine Faculty Scholarship
Biphasic, non-sigmoidal dose-response relationships are frequently observed in biochemistry and pharmacology, but they are not always analyzed with appropriate statistical methods. Here, we examine curve fitting methods for “hormetic” dose-response relationships where low and high doses of an effector produce opposite responses. We provide the full dataset used for modeling, and we provide the code for analyzing the dataset in SAS using two established mathematical models of hormesis, the Brain-Cousens model and the Cedergreen model. We show how to obtain and interpret curve parameters such as the ED50 that arise from modeling, and we discuss how curve parameters might change …
Statistical Characteristics Of High-Frequency Gravity Waves Observed By An Airglow Imager At Andes Lidar Observatory, Alan Z. Liu, Bing Cao
Statistical Characteristics Of High-Frequency Gravity Waves Observed By An Airglow Imager At Andes Lidar Observatory, Alan Z. Liu, Bing Cao
Publications
The long-term statistical characteristics of high-frequency quasi-monochromatic gravity waves are presented using multi-year airglow images observed at Andes Lidar Observatory (ALO, 30.3° S, 70.7° W) in northern Chile. The distribution of primary gravity wave parameters including horizontal wavelength, vertical wavelength, intrinsic wave speed, and intrinsic wave period are obtained and are in the ranges of 20–30 km, 15–25 km, 50–100 m s−1, and 5–10 min, respectively. The duration of persistent gravity wave events captured by the imager approximately follows an exponential distribution with an average duration of 7–9 min. The waves tend to propagate against the local background winds and …
A Simple Algorithm For Generating A New Two Sample Type-Ii Progressive Censoring With Applications, E. M. Shokr, Rashad Mohamed El-Sagheer, Mahmoud Mansour, H. M. Faied, B. S. El-Desouky
A Simple Algorithm For Generating A New Two Sample Type-Ii Progressive Censoring With Applications, E. M. Shokr, Rashad Mohamed El-Sagheer, Mahmoud Mansour, H. M. Faied, B. S. El-Desouky
Basic Science Engineering
In this article, we introduce a simple algorithm to generating a new type-II progressive censoring scheme for two samples. It is observed that the proposed algorithm can be applied for any continues probability distribution. Moreover, the description model and necessary assumptions are discussed. In addition, the steps of simple generation algorithm along with programming steps are also constructed on real example. The inference of two Weibull Frechet populations are discussed under the proposed algorithm. Both classical and Bayesian inferential approaches of the distribution parameters are discussed. Furthermore, approximate confidence intervals are constructed based on the asymptotic distribution of the maximum …
Application Of Randomness In Finance, Jose Sanchez, Daanial Ahmad, Satyanand Singh
Application Of Randomness In Finance, Jose Sanchez, Daanial Ahmad, Satyanand Singh
Publications and Research
Brownian Motion which is also considered to be a Wiener process and can be thought of as a random walk. In our project we had briefly discussed the fluctuations of financial indices and related it to Brownian Motion and the modeling of Stock prices.
Lecture 04: Spatial Statistics Applications Of Hrl, Trl, And Mixed Precision, David Keyes
Lecture 04: Spatial Statistics Applications Of Hrl, Trl, And Mixed Precision, David Keyes
Mathematical Sciences Spring Lecture Series
As simulation and analytics enter the exascale era, numerical algorithms, particularly implicit solvers that couple vast numbers of degrees of freedom, must span a widening gap between ambitious applications and austere architectures to support them. We present fifteen universals for researchers in scalable solvers: imperatives from computer architecture that scalable solvers must respect, strategies towards achieving them that are currently well established, and additional strategies currently being developed for an effective and efficient exascale software ecosystem. We consider recent generalizations of what it means to “solve” a computational problem, which suggest that we have often been “oversolving” them at the …
Pawnee Dam Inflow Design Flood (Idf) Update And Stage-Frequency Curve Development Using Rmcrfa, Jennifer P. Christensen, Joshua J. Melliger
Pawnee Dam Inflow Design Flood (Idf) Update And Stage-Frequency Curve Development Using Rmcrfa, Jennifer P. Christensen, Joshua J. Melliger
United States Geological Survey: Water Reports and Publications
Pawnee Dam is one of the ten Salt Creek Dams designed and built in the 1960s to mitigate flooding in Lincoln, Nebraska. This short paper illustrates the update of the Pawnee Dam inflow design flood (IDF) through calibration to recent high flow events and the development of its stage-frequency or hydrologic loading curve with the U.S. Army Corps of Engineers’ Risk Management Center Reservoir Frequency Analysis (RMC-RFA) model. The IDF update follows Engineering Regulation 1110-8-2, Inflow Design Flood for Dams and Reservoirs, including unit hydrograph peaking and two antecedent pool elevations. Background information on the original design of the dam …
Sabermetrics - Statistical Modeling Of Run Creation And Prevention In Baseball, Parker Chernoff
Sabermetrics - Statistical Modeling Of Run Creation And Prevention In Baseball, Parker Chernoff
FIU Electronic Theses and Dissertations
The focus of this thesis was to investigate which baseball metrics are most conducive to run creation and prevention. Stepwise regression and Liu estimation were used to formulate two models for the dependent variables and also used for cross validation. Finally, the predicted values were fed into the Pythagorean Expectation formula to predict a team’s most important goal: winning.
Each model fit strongly and collinearity amongst offensive predictors was considered using variance inflation factors. Hits, walks, and home runs allowed, infield putouts, errors, defense-independent earned run average ratio, defensive efficiency ratio, saves, runners left on base, shutouts, and walks per …
On The Three Dimensional Interaction Between Flexible Fibers And Fluid Flow, Bogdan Nita, Ryan Allaire
On The Three Dimensional Interaction Between Flexible Fibers And Fluid Flow, Bogdan Nita, Ryan Allaire
Department of Mathematics Facuty Scholarship and Creative Works
In this paper we discuss the deformation of a flexible fiber clamped to a spherical body and immersed in a flow of fluid moving with a speed ranging between 0 and 50 cm/s by means of three dimensional numerical simulation developed in COMSOL . The effects of flow speed and initial configuration angle of the fiber relative to the flow are analyzed. A rigorous analysis of the numerical procedure is performed and our code is benchmarked against well established cases. The flow velocity and pressure are used to compute drag forces upon the fiber. Of particular interest is the behavior …
Hpcnmf: A High-Performance Toolbox For Non-Negative Matrix Factorization, Karthik Devarajan, Guoli Wang
Hpcnmf: A High-Performance Toolbox For Non-Negative Matrix Factorization, Karthik Devarajan, Guoli Wang
COBRA Preprint Series
Non-negative matrix factorization (NMF) is a widely used machine learning algorithm for dimension reduction of large-scale data. It has found successful applications in a variety of fields such as computational biology, neuroscience, natural language processing, information retrieval, image processing and speech recognition. In bioinformatics, for example, it has been used to extract patterns and profiles from genomic and text-mining data as well as in protein sequence and structure analysis. While the scientific performance of NMF is very promising in dealing with high dimensional data sets and complex data structures, its computational cost is high and sometimes could be critical for …
Stochastic Dea With A Perfect Object And Its Application To Analysis Of Environmental Efficiency, Alexander Vaninsky
Stochastic Dea With A Perfect Object And Its Application To Analysis Of Environmental Efficiency, Alexander Vaninsky
Publications and Research
The paper introduces stochastic DEA with a Perfect Object (SDEA PO). The Perfect Object (PO) is a virtual Decision Making Unit (DMU) that has the smallest inputs and greatest outputs. Including the PO in a collection of actual objects yields an explicit formula of the efficiency index. Given the distributions of DEA inputs and outputs, this formula allows us to derive the probability distribution of the efficiency score, to find its mathematical expectation, and to deliver common (group–related) and partial (object-related) efficiency components. We apply this approach to a prospective analysis of environmental efficiency of the major national and regional …
Spatial And Temporal Correlations Of Freeway Link Speeds: An Empirical Study, Piotr J. Rachtan
Spatial And Temporal Correlations Of Freeway Link Speeds: An Empirical Study, Piotr J. Rachtan
Masters Theses 1911 - February 2014
Congestion on roadways and high level of uncertainty of traffic conditions are major considerations for trip planning. The purpose of this research is to investigate the characteristics and patterns of spatial and temporal correlations and also to detect other variables that affect correlation in a freeway setting. 5-minute speed aggregates from the Performance Measurement System (PeMS) database are obtained for two directions of an urban freeway – I-10 between Santa Monica and Los Angeles, California. Observations are for all non-holiday weekdays between January 1st and June 30th, 2010. Other variables include traffic flow, ramp locations, number of lanes and the …
Flexible Distributed Lag Models Using Random Functions With Application To Estimating Mortality Displacement From Heat-Related Deaths, Roger D. Peng
Flexible Distributed Lag Models Using Random Functions With Application To Estimating Mortality Displacement From Heat-Related Deaths, Roger D. Peng
Johns Hopkins University, Dept. of Biostatistics Working Papers
No abstract provided.
Generalized Bathtub Hazard Models For Binary-Transformed Climate Data, James Polcer
Generalized Bathtub Hazard Models For Binary-Transformed Climate Data, James Polcer
Masters Theses & Specialist Projects
In this study, we use a hazard-based modeling as an alternative statistical framework to time series methods as applied to climate data. Data collected from the Kentucky Mesonet will be used to study the distributional properties of the duration of high and low-energy wind events relative to an arbitrary threshold. Our objectiveswere to fit bathtub models proposed in literature, propose a generalized bathtub model, apply these models to Kentucky Mesonet data, and make recommendations as to feasibility of wind power generation. Using two different thresholds (1.8 and 10 mph respectively), results show that the Hjorth bathtub model consistently performed better …
Shrinkage Estimation Of Expression Fold Change As An Alternative To Testing Hypotheses Of Equivalent Expression, Zahra Montazeri, Corey M. Yanofsky, David R. Bickel
Shrinkage Estimation Of Expression Fold Change As An Alternative To Testing Hypotheses Of Equivalent Expression, Zahra Montazeri, Corey M. Yanofsky, David R. Bickel
COBRA Preprint Series
Research on analyzing microarray data has focused on the problem of identifying differentially expressed genes to the neglect of the problem of how to integrate evidence that a gene is differentially expressed with information on the extent of its differential expression. Consequently, researchers currently prioritize genes for further study either on the basis of volcano plots or, more commonly, according to simple estimates of the fold change after filtering the genes with an arbitrary statistical significance threshold. While the subjective and informal nature of the former practice precludes quantification of its reliability, the latter practice is equivalent to using a …
A Method For Visualizing Multivariate Time Series Data, Roger D. Peng
A Method For Visualizing Multivariate Time Series Data, Roger D. Peng
Johns Hopkins University, Dept. of Biostatistics Working Papers
Visualization and exploratory analysis is an important part of any data analysis and is made more challenging when the data are voluminous and high-dimensional. One such example is environmental monitoring data, which are often collected over time and at multiple locations, resulting in a geographically indexed multivariate time series. Financial data, although not necessarily containing a geographic component, present another source of high-volume multivariate time series data. We present the mvtsplot function which provides a method for visualizing multivariate time series data. We outline the basic design concepts and provide some examples of its usage by applying it to a …
Bayesian Analysis For Penalized Spline Regression Using Win Bugs, Ciprian M. Crainiceanu, David Ruppert, M.P. Wand
Bayesian Analysis For Penalized Spline Regression Using Win Bugs, Ciprian M. Crainiceanu, David Ruppert, M.P. Wand
Johns Hopkins University, Dept. of Biostatistics Working Papers
Penalized splines can be viewed as BLUPs in a mixed model framework, which allows the use of mixed model software for smoothing. Thus, software originally developed for Bayesian analysis of mixed models can be used for penalized spline regression. Bayesian inference for nonparametric models enjoys the flexibility of nonparametric models and the exact inference provided by the Bayesian inferential machinery. This paper provides a simple, yet comprehensive, set of programs for the implementation of nonparametric Bayesian analysis in WinBUGS. MCMC mixing is substantially improved from the previous versions by using low{rank thin{plate splines instead of truncated polynomial basis. Simulation time …
Diffusion And Fractional Diffusion Based Models For Multiple Light Scattering And Image Analysis, Jonathan Blackledge
Diffusion And Fractional Diffusion Based Models For Multiple Light Scattering And Image Analysis, Jonathan Blackledge
Articles
This paper considers a fractional light diffusion model as an approach to characterizing the case when intermediate scattering processes are present, i.e. the scattering regime is neither strong nor weak. In order to introduce the basis for this approach, we revisit the elements of formal scattering theory and the classical diffusion problem in terms of solutions to the inhomogeneous wave and diffusion equations respectively. We then address the significance of these equations in terms of a random walk model for multiple scattering. This leads to the proposition of a fractional diffusion equation for modelling intermediate strength scattering that is based …
Spatio-Temporal Analysis Of Areal Data And Discovery Of Neighborhood Relationships In Conditionally Autoregressive Models, Subharup Guha, Louise Ryan
Spatio-Temporal Analysis Of Areal Data And Discovery Of Neighborhood Relationships In Conditionally Autoregressive Models, Subharup Guha, Louise Ryan
Harvard University Biostatistics Working Paper Series
No abstract provided.
Bayesian Smoothing Of Irregularly-Spaced Data Using Fourier Basis Functions, Christopher J. Paciorek
Bayesian Smoothing Of Irregularly-Spaced Data Using Fourier Basis Functions, Christopher J. Paciorek
Harvard University Biostatistics Working Paper Series
No abstract provided.
Gauss-Seidel Estimation Of Generalized Linear Mixed Models With Application To Poisson Modeling Of Spatially Varying Disease Rates, Subharup Guha, Louise Ryan
Gauss-Seidel Estimation Of Generalized Linear Mixed Models With Application To Poisson Modeling Of Spatially Varying Disease Rates, Subharup Guha, Louise Ryan
Harvard University Biostatistics Working Paper Series
Generalized linear mixed models (GLMMs) provide an elegant framework for the analysis of correlated data. Due to the non-closed form of the likelihood, GLMMs are often fit by computational procedures like penalized quasi-likelihood (PQL). Special cases of these models are generalized linear models (GLMs), which are often fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space constraints often make it difficult to apply these iterative procedures to data sets with very large number of cases.
This paper proposes a computationally efficient strategy based on the Gauss-Seidel algorithm that iteratively fits sub-models of the GLMM …
Computational Techniques For Spatial Logistic Regression With Large Datasets, Christopher J. Paciorek, Louise Ryan
Computational Techniques For Spatial Logistic Regression With Large Datasets, Christopher J. Paciorek, Louise Ryan
Harvard University Biostatistics Working Paper Series
In epidemiological work, outcomes are frequently non-normal, sample sizes may be large, and effects are often small. To relate health outcomes to geographic risk factors, fast and powerful methods for fitting spatial models, particularly for non-normal data, are required. We focus on binary outcomes, with the risk surface a smooth function of space. We compare penalized likelihood models, including the penalized quasi-likelihood (PQL) approach, and Bayesian models based on fit, speed, and ease of implementation.
A Bayesian model using a spectral basis representation of the spatial surface provides the best tradeoff of sensitivity and specificity in simulations, detecting real spatial …
Robust Inferences For Covariate Effects On Survival Time With Censored Linear Regression Models, Larry Leon, Tianxi Cai, L. J. Wei
Robust Inferences For Covariate Effects On Survival Time With Censored Linear Regression Models, Larry Leon, Tianxi Cai, L. J. Wei
Harvard University Biostatistics Working Paper Series
Various inference procedures for linear regression models with censored failure times have been studied extensively. Recent developments on efficient algorithms to implement these procedures enhance the practical usage of such models in survival analysis. In this article, we present robust inferences for certain covariate effects on the failure time in the presence of "nuisance" confounders under a semiparametric, partial linear regression setting. Specifically, the estimation procedures for the regression coefficients of interest are derived from a working linear model and are valid even when the function of the confounders in the model is not correctly specified. The new proposals are …
A Hybrid Newton-Type Method For The Linear Regression In Case-Cohort Studies, Menggang Yu, Bin Nan
A Hybrid Newton-Type Method For The Linear Regression In Case-Cohort Studies, Menggang Yu, Bin Nan
The University of Michigan Department of Biostatistics Working Paper Series
Case-cohort designs are increasingly commonly used in large epidemiological cohort studies. Nan, Yu, and Kalbeisch (2004) provided the asymptotic results for censored linear regression models in case-cohort studies. In this article, we consider computational aspects of their proposed rank based estimating methods. We show that the rank based discontinuous estimating functions for case-cohort studies are monotone, a property established for cohort data in the literature, when generalized Gehan type of weights are used. Though the estimating problem can be formulated to a linear programming problem as that for cohort data, due to its easily uncontrollable large scale even for a …
Semiparametric Regression In Capture-Recapture Modelling, O. Gimenez, C. Barbraud, Ciprian M. Crainiceanu, S. Jenouvrier, B.T. Morgan
Semiparametric Regression In Capture-Recapture Modelling, O. Gimenez, C. Barbraud, Ciprian M. Crainiceanu, S. Jenouvrier, B.T. Morgan
Johns Hopkins University, Dept. of Biostatistics Working Papers
Capture-recapture models were developed to estimate survival using data arising from marking and monitoring wild animals over time. Variation in the survival process may be explained by incorporating relevant covariates. We develop nonparametric and semiparametric regression models for estimating survival in capture-recapture models. A fully Bayesian approach using MCMC simulations was employed to estimate the model parameters. The work is illustrated by a study of Snow petrels, in which survival probabilities are expressed as nonlinear functions of a climate covariate, using data from a 40-year study on marked individuals, nesting at Petrels Island, Terre Adelie.
A Bayesian Mixture Model Relating Dose To Critical Organs And Functional Complication In 3d Conformal Radiation Therapy, Tim Johnson, Jeremy Taylor, Randall K. Ten Haken, Avraham Eisbruch
A Bayesian Mixture Model Relating Dose To Critical Organs And Functional Complication In 3d Conformal Radiation Therapy, Tim Johnson, Jeremy Taylor, Randall K. Ten Haken, Avraham Eisbruch
The University of Michigan Department of Biostatistics Working Paper Series
A goal of radiation therapy is to deliver maximum dose to the target tumor while minimizing complications due to irradiation of critical organs. Technological advances in 3D conformal radiation therapy has allowed great strides in realizing this goal, however complications may still arise. Critical organs may be adjacent to tumors or in the path of the radiation beam. Several mathematical models have been proposed that describe a relationship between dose and observed functional complication, however only a few published studies have successfully fit these models to data using modern statistical methods which make efficient use of the data. One complication …
A Bayesian Method For Finding Interactions In Genomic Studies, Wei Chen, Debashis Ghosh, Trivellore E. Raghuanthan, Sharon Kardia
A Bayesian Method For Finding Interactions In Genomic Studies, Wei Chen, Debashis Ghosh, Trivellore E. Raghuanthan, Sharon Kardia
The University of Michigan Department of Biostatistics Working Paper Series
An important step in building a multiple regression model is the selection of predictors. In genomic and epidemiologic studies, datasets with a small sample size and a large number of predictors are common. In such settings, most standard methods for identifying a good subset of predictors are unstable. Furthermore, there is an increasing emphasis towards identification of interactions, which has not been studied much in the statistical literature. We propose a method, called BSI (Bayesian Selection of Interactions), for selecting predictors in a regression setting when the number of predictors is considerably larger than the sample size with a focus …
Spatially Adaptive Bayesian P-Splines With Heteroscedastic Errors, Ciprian M. Crainiceanu, David Ruppert, Raymond J. Carroll
Spatially Adaptive Bayesian P-Splines With Heteroscedastic Errors, Ciprian M. Crainiceanu, David Ruppert, Raymond J. Carroll
Johns Hopkins University, Dept. of Biostatistics Working Papers
An increasingly popular tool for nonparametric smoothing are penalized splines (P-splines) which use low-rank spline bases to make computations tractable while maintaining accuracy as good as smoothing splines. This paper extends penalized spline methodology by both modeling the variance function nonparametrically and using a spatially adaptive smoothing parameter. These extensions have been studied before, but never together and never in the multivariate case. This combination is needed for satisfactory inference and can be implemented effectively by Bayesian \mbox{MCMC}. The variance process controlling the spatially-adaptive shrinkage of the mean and the variance of the heteroscedastic error process are modeled as log-penalized …
Gllamm Manual, Sophia Rabe-Hesketh, Anders Skrondal, Andrew Pickles
Gllamm Manual, Sophia Rabe-Hesketh, Anders Skrondal, Andrew Pickles
U.C. Berkeley Division of Biostatistics Working Paper Series
This manual describes a Stata program gllamm that can estimate Generalized Linear Latent and Mixed Models (GLLAMMs). GLLAMMs are a class of multilevel latent variable models for (multivariate) responses of mixed type including continuous responses, counts, duration/survival data, dichotomous, ordered and unordered categorical responses and rankings. The latent variables (common factors or random effects) can be assumed to be discrete or to have a multivariate normal distribution. Examples of models in this class are multilevel generalized linear models or generalized linear mixed models, multilevel factor or latent trait models, item response models, latent class models and multilevel structural equation models. …
Data Adaptive Estimation Of The Treatment Specific Mean, Yue Wang, Oliver Bembom, Mark J. Van Der Laan
Data Adaptive Estimation Of The Treatment Specific Mean, Yue Wang, Oliver Bembom, Mark J. Van Der Laan
U.C. Berkeley Division of Biostatistics Working Paper Series
An important problem in epidemiology and medical research is the estimation of the causal effect of a treatment action at a single point in time on the mean of an outcome, possibly within strata of the target population defined by a subset of the baseline covariates. Current approaches to this problem are based on marginal structural models, i.e., parametric models for the marginal distribution of counterfactural outcomes as a function of treatment and effect modifiers. The various estimators developed in this context furthermore each depend on a high-dimensional nuisance parameter whose estimation currently also relies on parametric models. Since misspecification …
Finding Cancer Subtypes In Microarray Data Using Random Projections, Debashis Ghosh
Finding Cancer Subtypes In Microarray Data Using Random Projections, Debashis Ghosh
The University of Michigan Department of Biostatistics Working Paper Series
One of the benefits of profiling of cancer samples using microarrays is the generation of molecular fingerprints that will define subtypes of disease. Such subgroups have typically been found in microarray data using hierarchical clustering. A major problem in interpretation of the output is determining the number of clusters. We approach the problem of determining disease subtypes using mixture models. A novel estimation procedure of the parameters in the mixture model is developed based on a combination of random projections and the expectation-maximization algorithm. Because the approach is probabilistic, our approach provides a measure for the number of true clusters …