Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Statistics and Probability

PDF

Prediction

Institution
Publication Year
Publication
Publication Type

Articles 31 - 60 of 67

Full-Text Articles in Physical Sciences and Mathematics

Well I'Ll Be Damned - Insights Into Predictive Value Of Pedigree Information In Horse Racing, Timothy Baker Mr, Ming-Chien Sung, Johnnie Johnson Professor, Tiejun Ma Jun 2016

Well I'Ll Be Damned - Insights Into Predictive Value Of Pedigree Information In Horse Racing, Timothy Baker Mr, Ming-Chien Sung, Johnnie Johnson Professor, Tiejun Ma

International Conference on Gambling & Risk Taking

Fundamental form characteristics like how fast a horse ran at its last start, are widely used to help predict the outcome of horse racing events. The exception being in races where horses haven’t previously competed, such as Maiden races, where there is little or no publicly available past performance information. In these types of events bettors need only consider a simplified suite of factors however this is offset by a higher level of uncertainty. This paper examines the inherent information content embedded within a horse’s ancestry and the extent to which this information is discounted in the United Kingdom bookmaker …


A Scalable Supervised Subsemble Prediction Algorithm, Stephanie Sapp, Mark J. Van Der Laan Apr 2014

A Scalable Supervised Subsemble Prediction Algorithm, Stephanie Sapp, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Subsemble is a flexible ensemble method that partitions a full data set into subsets of observations, fits the same algorithm on each subset, and uses a tailored form of V-fold cross-validation to construct a prediction function that combines the subset-specific fits with a second metalearner algorithm. Previous work studied the performance of Subsemble with subsets created randomly, and showed that these types of Subsembles often result in better prediction performance than the underlying algorithm fit just once on the full dataset. Since the final Subsemble estimator varies depending on the data used to create the subset-specific fits, different strategies for …


Risk Factors For Physical Violence Against Partners In The U.S., K. Daniel O'Leary, Nathan L. Tintle, Evelyn Bromet Jan 2014

Risk Factors For Physical Violence Against Partners In The U.S., K. Daniel O'Leary, Nathan L. Tintle, Evelyn Bromet

Faculty Work Comprehensive List

Objective: To examine unique and relative predictive values of demographic, social learning, developmental, psychopathology, and dyadic variables as risk factors for perpetration of intimate partner physical aggression in a national sample of married or cohabiting individuals. Method: Men (n = 798) and women (n = 770) were selected from the public use data file of the 2003 National Comorbidity Survey Replication (NCS-R) which used a multistage cluster sampling design. Results: Eight percent of women and 5% of men reported perpetrating physical aggression in the past year. Based on multivariable regression analyses, among men, the unique risk factors for perpetrating physical …


Parametric And Nonparametric Statistical Methods For Genomic Selection Of Traits With Additive And Epistatic Genetic Architectures, Reka Howard, Alicia L. Carriquiry, William D. Beavis Jan 2014

Parametric And Nonparametric Statistical Methods For Genomic Selection Of Traits With Additive And Epistatic Genetic Architectures, Reka Howard, Alicia L. Carriquiry, William D. Beavis

Department of Statistics: Faculty Publications

Parametric and nonparametric methods have been developed for purposes of predicting phenotypes. These methods are based on retrospective analyses of empirical data consisting of genotypic and phenotypic scores. Recent reports have indicated that parametric methods are unable to predict phenotypes of traits with known epistatic genetic architectures. Herein, we review parametric methods including least squares regression, ridge regression, Bayesian ridge regression, least absolute shrinkage and selection operator (LASSO), Bayesian LASSO, best linear unbiased prediction (BLUP), Bayes A, Bayes B, Bayes C, and Bayes Cπ. We also review nonparametric methods including Nadaraya-Watson estimator, reproducing kernel Hilbert space, support vector machine regression, …


Regularization Methods For Predicting An Ordinal Response Using Longitudinal High-Dimensional Genomic Data, Jiayi Hou Nov 2013

Regularization Methods For Predicting An Ordinal Response Using Longitudinal High-Dimensional Genomic Data, Jiayi Hou

Theses and Dissertations

Ordinal scales are commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. Notable examples include cancer staging, which is a five-category ordinal scale indicating tumor size, node involvement, and likelihood of metastasizing. Glasgow Coma Scale (GCS), which gives a reliable and objective assessment of conscious status of a patient, is an ordinal scaled measure. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical ordinal modeling methods based on the likelihood approach have contributed to the analysis of data in …


Subsemble: An Ensemble Method For Combining Subset-Specific Algorithm Fits, Stephanie Sapp, Mark J. Van Der Laan, John Canny May 2013

Subsemble: An Ensemble Method For Combining Subset-Specific Algorithm Fits, Stephanie Sapp, Mark J. Van Der Laan, John Canny

U.C. Berkeley Division of Biostatistics Working Paper Series

Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive datasets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large datasets. Subsemble partitions the full dataset into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be …


Using Methods From The Data-Mining And Machine-Learning Literature For Disease Classification And Prediction: A Case Study Examining Classification Of Heart Failure Subtypes, Peter C. Austin Jan 2013

Using Methods From The Data-Mining And Machine-Learning Literature For Disease Classification And Prediction: A Case Study Examining Classification Of Heart Failure Subtypes, Peter C. Austin

Peter Austin

OBJECTIVE: Physicians classify patients into those with or without a specific disease. Furthermore, there is often interest in classifying patients according to disease etiology or subtype. Classification trees are frequently used to classify patients according to the presence or absence of a disease. However, classification trees can suffer from limited accuracy. In the data-mining and machine-learning literature, alternate classification schemes have been developed. These include bootstrap aggregation (bagging), boosting, random forests, and support vector machines.

STUDY DESIGN AND SETTING: We compared the performance of these classification methods with that of conventional classification trees to classify patients with heart failure (HF) …


Prediction In M-Complete Problems With Limited Sample Size, Jennifer Lynn Clarke, Bertrand Clarke, Chi-Wai Yu Jan 2013

Prediction In M-Complete Problems With Limited Sample Size, Jennifer Lynn Clarke, Bertrand Clarke, Chi-Wai Yu

Department of Statistics: Faculty Publications

We define a new Bayesian predictor called the posterior weighted median (PWM) and compare its performance to several other predictors including the Bayes model average under squared error loss, the Barbieri-Berger median model predictor, the stacking predictor, and the model average predictor based on Akaike's information criterion. We argue that PWM generally gives better performance than other predictors over a range of M-complete problems. This range is between the M-closed-M-complete boundary and the M-complete- M-open boundary. Indeed, as a problem gets closer to M-open, it seems that M-complete predictive methods begin to break down. Our comparisons rest on extensive simulations …


Prediction In Several Conventional Contexts, Bertrand Clarke, Jennifer Clarke Jan 2012

Prediction In Several Conventional Contexts, Bertrand Clarke, Jennifer Clarke

Department of Statistics: Faculty Publications

We review predictive techniques from several traditional branches of statistics. Starting with prediction based on the normal model and on the empirical distribution function, we proceed to techniques for various forms of regression and classification. Then, we turn to time series, longitudinal data, and survival analysis. Our focus throughout is on the mechanics of prediction more than on the properties of predictors.


A Comparison Of Spatial Prediction Techniques Using Both Hard And Soft Data, Megan L. Liedtke Tesar May 2011

A Comparison Of Spatial Prediction Techniques Using Both Hard And Soft Data, Megan L. Liedtke Tesar

Department of Statistics: Dissertations, Theses, and Student Work

The overall goal of this research, which is common to most spatial studies, is to predict a value of interest at an unsampled location based on measured values at nearby sampled locations. To accomplish this goal, ordinary kriging can be used to obtain the best linear unbiased predictor. However, there is often a large amount of variability surrounding the measurements of environmental variables, and traditional prediction methods, such as ordinary kriging, do not account for an attribute with more than one level of uncertainty. This dissertation addresses this limitation by introducing a new methodology called weighted kriging. This prediction technique …


Bayesian Logistic Regression Model For Siting Biomass-Using Facilities, Xia Huang Dec 2010

Bayesian Logistic Regression Model For Siting Biomass-Using Facilities, Xia Huang

Masters Theses

Key sources of oil for western markets are located in complex geopolitical environments that increase economic and social risk. The amalgamation of economic, environmental, social and national security concerns for petroleum-based economies have created a renewed emphasis on alternative sources of energy which include biomass. The stability of sustainable biomass markets hinges on improved methods to predict and visualize business risk and cost to the supply chain.

This thesis develops Bayesian logistic regression models, with comparisons of classical maximum likelihood models, to quantify significant factors that influence the siting of biomass-using facilities and predict potential locations in the 13-state Southeastern …


Improving Accuracy Of Large-Scale Prediction Of Forest Disease Incidence Through Bayesian Data Reconciliation, Ephraim M. Hanks Jan 2010

Improving Accuracy Of Large-Scale Prediction Of Forest Disease Incidence Through Bayesian Data Reconciliation, Ephraim M. Hanks

All Graduate Plan B and other Reports, Spring 1920 to Spring 2023

Increasing the accuracy of predictions made from ecological data typically involves replacing or replicating the data, but the cost of updating large-scale data sets can be prohibitive. Focusing resources on a small sample of locations from a large, less accurate data set can result in more reliable observations, though on a smaller scale. We present an approach for increasing the accuracy of predictions made from a large-scale eco logical data set through reconciliation with a small, highly accurate data set within a Bayesian hierarchical modeling framework. This approach is illustrated through a study of incidence of eastern spruce dwarf mistletoe …


Predicting Hearing Threshold In Nonresponsive Subjects Using A Log-Normal Bayesian Linear Model In The Presence Of Left-Censored Covariates, Byron J. Gajewski, Nannette Nicholson, Judith E. Widen Jan 2009

Predicting Hearing Threshold In Nonresponsive Subjects Using A Log-Normal Bayesian Linear Model In The Presence Of Left-Censored Covariates, Byron J. Gajewski, Nannette Nicholson, Judith E. Widen

Byron J Gajewski

We provide a nontrivial example illustrating analysis of a Bayesian clinical trial. Many of the issues discussed in the article are emphasized in a recent Food and Drug Administration (FDA) guidance on use of Bayesian statistics in medical device clinical trials. Here we present a fully Bayesian data analysis for predicting hearing thresholds in subjects who cannot respond to usual hearing tests. The article begins with simple concepts such as simple linear regression and proceeds into more complex issues such as censoring in the dependent and independent variables. Throughout, we emphasize the substantive interpretation of the analysis. Of particular interest …


Estimating Time-To-Event From Longitudinal Categorical Data Using Random Effects Markov Models: Application To Multiple Sclerosis Progression, Micha Mandel, Rebecca A. Betensky Jun 2007

Estimating Time-To-Event From Longitudinal Categorical Data Using Random Effects Markov Models: Application To Multiple Sclerosis Progression, Micha Mandel, Rebecca A. Betensky

Harvard University Biostatistics Working Paper Series

No abstract provided.


Evaluating The Roc Performance Of Markers For Future Events, Margaret Pepe, Yingye Zheng, Yuying Jin May 2007

Evaluating The Roc Performance Of Markers For Future Events, Margaret Pepe, Yingye Zheng, Yuying Jin

UW Biostatistics Working Paper Series

Receiver operating characteristic (ROC) curves play a central role in the evaluation of biomarkers and tests for disease diagnosis. Predictors for event time outcomes can also be evaluated with ROC curves, but the time lag between marker measurement and event time must be acknowledged. We discuss different definitions of time-dependent ROC curves in the context of real applications. Several approaches have been proposed for estimation. We contrast retrospective versus prospective methods in regards to assumptions and flexibility, including their capacities to incorporate censored data, competing risks and different sampling schemes. Applications to two datasets are presented.


Comment: Boosting Algorithms: Regularization, Prediction And Model Fitting, A. Buja, David Mease, A. Wyner Jan 2007

Comment: Boosting Algorithms: Regularization, Prediction And Model Fitting, A. Buja, David Mease, A. Wyner

David Mease

The authors are doing the readers of Statistical Science a true service with a well-written and up-to-date overview of boosting that originated with the seminal algorithms of Freund and Schapire. Equally, we are grateful for high-level software that will permit a larger readership to experiment with, or simply apply, boosting-inspired model fitting. The authors show us a world of methodology that illustrates how a fundamental innovation can penetrate every nook and cranny of statistical thinking and practice. They introduce the reader to one particular interpretation of boosting and then give a display of its potential with extensions from classification (where …


Comment: Boosting Algorithms: Regularization, Prediction And Model Fitting, A. Buja, David Mease, A. Wyner Jan 2007

Comment: Boosting Algorithms: Regularization, Prediction And Model Fitting, A. Buja, David Mease, A. Wyner

Faculty Publications

The authors are doing the readers of Statistical Science a true service with a well-written and up-to-date overview of boosting that originated with the seminal algorithms of Freund and Schapire. Equally, we are grateful for high-level software that will permit a larger readership to experiment with, or simply apply, boosting-inspired model fitting. The authors show us a world of methodology that illustrates how a fundamental innovation can penetrate every nook and cranny of statistical thinking and practice. They introduce the reader to one particular interpretation of boosting and then give a display of its potential with extensions from classification (where …


Survival Point Estimate Prediction In Matched And Non-Matched Case-Control Subsample Designed Studies, Annette M. Molinaro, Mark J. Van Der Laan, Dan H. Moore, Karla Kerlikowske Aug 2005

Survival Point Estimate Prediction In Matched And Non-Matched Case-Control Subsample Designed Studies, Annette M. Molinaro, Mark J. Van Der Laan, Dan H. Moore, Karla Kerlikowske

U.C. Berkeley Division of Biostatistics Working Paper Series

Providing information about the risk of disease and clinical factors that may increase or decrease a patient's risk of disease is standard medical practice. Although case-control studies can provide evidence of strong associations between diseases and risk factors, clinicians need to be able to communicate to patients the age-specific risks of disease over a defined time interval for a set of risk factors.

An estimate of absolute risk cannot be determined from case-control studies because cases are generally chosen from a population whose size is not known (necessary for calculation of absolute risk) and where duration of follow-up is not …


An Exploration Of Using Data Mining In Educational Research, Yonghong Jade Xu May 2005

An Exploration Of Using Data Mining In Educational Research, Yonghong Jade Xu

Journal of Modern Applied Statistical Methods

Technology advances popularized large databases in education. Traditional statistics have limitations for analyzing large quantities of data. This article discusses data mining by analyzing a data set with three models: multiple regression, data mining, and a combination of the two. It is concluded that data mining is applicable in educational research.


Survival Ensembles, Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette M. Molinaro, Mark J. Van Der Laan Apr 2005

Survival Ensembles, Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette M. Molinaro, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We propose a unified and flexible framework for ensemble learning in the presence of censoring. For right-censored data, we introduce a random forest algorithm and a generic gradient boosting algorithm for the construction of prognostic models. The methodology is utilized for predicting the survival time of patients suffering from acute myeloid leukemia based on clinical and genetic covariates. Furthermore, we compare the diagnostic capabilities of the proposed censored data random forest and boosting methods applied to the recurrence free survival time of node positive breast cancer patients with previously published findings.


Standardizing Markers To Evaluate And Compare Their Performances, Margaret S. Pepe, Gary M. Longton Jan 2005

Standardizing Markers To Evaluate And Compare Their Performances, Margaret S. Pepe, Gary M. Longton

UW Biostatistics Working Paper Series

Introduction: Markers that purport to distinguish subjects with a condition from those without a condition must be evaluated rigorously for their classification accuracy. A single approach to statistically evaluating and comparing markers is not yet established.

Methods: We suggest a standardization that uses the marker distribution in unaffected subjects as a reference. For an affected subject with marker value Y, the standardized placement value is the proportion of unaffected subjects with marker values that exceed Y.

Results: We apply the standardization to two illustrative datasets. In patients with pancreatic cancer placement values calculated for the CA 19-9 marker are smaller …


Deletion/Substitution/Addition Algorithm For Partitioning The Covariate Space In Prediction, Annette Molinaro, Mark J. Van Der Laan Nov 2004

Deletion/Substitution/Addition Algorithm For Partitioning The Covariate Space In Prediction, Annette Molinaro, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We propose a new method for predicting censored (and non-censored) clinical outcomes from a highly-complex covariate space. Previously we suggested a unified strategy for predictor construction, selection, and performance assessment. Here we introduce a new algorithm which generates a piecewise constant estimation sieve of candidate predictors based on an intensive and comprehensive search over the entire covariate space. This algorithm allows us to elucidate interactions and correlation patterns in addition to main effects.


Multiple Testing And Data Adaptive Regression: An Application To Hiv-1 Sequence Data, Merrill D. Birkner, Sandra E. Sinisi, Mark J. Van Der Laan Oct 2004

Multiple Testing And Data Adaptive Regression: An Application To Hiv-1 Sequence Data, Merrill D. Birkner, Sandra E. Sinisi, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Analysis of viral strand sequence data and viral replication capacity could potentially lead to biological insights regarding the replication ability of HIV-1. Determining specific target codons on the viral strand will facilitate the manufacturing of target specific antiretrovirals. Various algorithmic and analysis techniques can be applied to this application. We propose using multiple testing to find codons which have significant univariate associations with replication capacity of the virus. We also propose using a data adaptive multiple regression algorithm to obtain multiple predictions of viral replication capacity based on an entire mutant/non-mutant sequence profile. The data set to which these techniques …


Jmasm10: A Fortran Routine For Sieve Bootstrap Prediction Intervals, Andrés M. Alonso May 2004

Jmasm10: A Fortran Routine For Sieve Bootstrap Prediction Intervals, Andrés M. Alonso

Journal of Modern Applied Statistical Methods

A Fortran routine for constructing nonparametric prediction intervals for a general class of linear processes is described. The approach uses the sieve bootstrap procedure of Bühlmann (1997) based on residual resampling from an autoregressive approximation to the given process.


Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms In Estimation, Sandra E. Sinisi, Mark J. Van Der Laan Mar 2004

Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms In Estimation, Sandra E. Sinisi, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

In van der Laan and Dudoit (2003) we propose and theoretically study a unified loss function based statistical methodology, which provides a road map for estimation and performance assessment. Given a parameter of interest which can be described as the minimizer of the population mean of a loss function, the road map involves as important ingredients cross-validation for estimator selection and minimizing over subsets of basis functions the empirical risk of the subset-specific estimator of the parameter of interest, where the basis functions correspond to a parameterization of a specified subspace of the complete parameter space. In this article we …


The Cross-Validated Adaptive Epsilon-Net Estimator, Mark J. Van Der Laan, Sandrine Dudoit, Aad W. Van Der Vaart Feb 2004

The Cross-Validated Adaptive Epsilon-Net Estimator, Mark J. Van Der Laan, Sandrine Dudoit, Aad W. Van Der Vaart

U.C. Berkeley Division of Biostatistics Working Paper Series

Suppose that we observe a sample of independent and identically distributed realizations of a random variable. Assume that the parameter of interest can be defined as the minimizer, over a suitably defined parameter space, of the expectation (with respect to the distribution of the random variable) of a particular (loss) function of a candidate parameter value and the random variable. Examples of commonly used loss functions are the squared error loss function in regression and the negative log-density loss function in density estimation. Minimizing the empirical risk (i.e., the empirical mean of the loss function) over the entire parameter space …


Survival Model Predictive Accuracy And Roc Curves, Patrick Heagerty, Yingye Zheng Dec 2003

Survival Model Predictive Accuracy And Roc Curves, Patrick Heagerty, Yingye Zheng

UW Biostatistics Working Paper Series

The predictive accuracy of a survival model can be summarized using extensions of the proportion of variation explained by the model, or R^2, commonly used for continuous response models, or using extensions of sensitivity and specificity which are commonly used for binary response models.

In this manuscript we propose new time-dependent accuracy summaries based on time-specific versions of sensitivity and specificity calculated over risk sets. We connect the accuracy summaries to a previously proposed global concordance measure which is a variant of Kendall's tau. In addition, we show how standard Cox regression output can be used to obtain estimates of …


Loss-Based Estimation With Cross-Validation: Applications To Microarray Data Analysis And Motif Finding, Sandrine Dudoit, Mark J. Van Der Laan, Sunduz Keles, Annette M. Molinaro, Sandra E. Sinisi, Siew Leng Teng Dec 2003

Loss-Based Estimation With Cross-Validation: Applications To Microarray Data Analysis And Motif Finding, Sandrine Dudoit, Mark J. Van Der Laan, Sunduz Keles, Annette M. Molinaro, Sandra E. Sinisi, Siew Leng Teng

U.C. Berkeley Division of Biostatistics Working Paper Series

Current statistical inference problems in genomic data analysis involve parameter estimation for high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables. Addressing these inference questions satisfactorily requires: (i) an intensive and thorough search of the parameter space to generate good candidate estimators, (ii) an approach for selecting an optimal estimator among these candidates, and (iii) a method for reliably assessing the performance of the resulting estimator. We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable …


Unified Cross-Validation Methodology For Selection Among Estimators And A General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities And Examples, Mark J. Van Der Laan, Sandrine Dudoit Nov 2003

Unified Cross-Validation Methodology For Selection Among Estimators And A General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities And Examples, Mark J. Van Der Laan, Sandrine Dudoit

U.C. Berkeley Division of Biostatistics Working Paper Series

In Part I of this article we propose a general cross-validation criterian for selecting among a collection of estimators of a particular parameter of interest based on n i.i.d. observations. It is assumed that the parameter of interest minimizes the expectation (w.r.t. to the distribution of the observed data structure) of a particular loss function of a candidate parameter value and the observed data structure, possibly indexed by a nuisance parameter. The proposed cross-validation criterian is defined as the empirical mean over the validation sample of the loss function at the parameter estimate based on the training sample, averaged over …


Estimating Predictors For Long- Or Short-Term Survivors, Lu Tian, Wei Wang, L. J. Wei Nov 2003

Estimating Predictors For Long- Or Short-Term Survivors, Lu Tian, Wei Wang, L. J. Wei

Harvard University Biostatistics Working Paper Series

No abstract provided.