Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

2005

Statistical Theory

COBRA

Keyword
Publication

Articles 1 - 30 of 41

Full-Text Articles in Physical Sciences and Mathematics

Alleviating Linear Ecological Bias And Optimal Design With Subsample Data, Adam Glynn, Jon Wakefield, Mark Handcock, Thomas Richardson Dec 2005

Alleviating Linear Ecological Bias And Optimal Design With Subsample Data, Adam Glynn, Jon Wakefield, Mark Handcock, Thomas Richardson

UW Biostatistics Working Paper Series

In this paper, we illustrate that combining ecological data with subsample data in situations in which a linear model is appropriate provides three main benefits. First, by including the individual level subsample data, the biases associated with linear ecological inference can be eliminated. Second, by supplementing the subsample data with ecological data, the information about parameters will be increased. Third, we can use readily available ecological data to design optimal subsampling schemes, so as to further increase the information about parameters. We present an application of this methodology to the classic problem of estimating the effect of a college degree …


Empirical Likelihood Inference For The Area Under The Roc Curve, Gengsheng Qin, Xiao-Hua Zhou Dec 2005

Empirical Likelihood Inference For The Area Under The Roc Curve, Gengsheng Qin, Xiao-Hua Zhou

UW Biostatistics Working Paper Series

For a continuous-scale diagnostic test, the most commonly used summary index of the receiver operating characteristic (ROC) curve is the area under the curve (AUC) that measures the accuracy of the diagnostic test. In this paper we propose an empirical likelihood approach for the inference of AUC. We first define an empirical likelihood ratio for AUC and show that its limiting distribution is a scaled chi-square distribution. We then obtain an empirical likelihood based confidence interval for AUC using the scaled chi-square distribution. This empirical likelihood inference for AUC can be extended to stratified samples and the resulting limiting distribution …


Interval Estimation For The Ratio And Difference Of Two Lognormal Means, Yea-Hung Chen, Xiao-Hua Zhou Dec 2005

Interval Estimation For The Ratio And Difference Of Two Lognormal Means, Yea-Hung Chen, Xiao-Hua Zhou

UW Biostatistics Working Paper Series

Health research often gives rise to data that follow lognormal distributions. In two sample situations, researchers are likely to be interested in estimating the difference or ratio of the population means. Several methods have been proposed for providing confidence intervals for these parameters. However, it is not clear which techniques are most appropriate, or how their performance might vary. Additionally, methods for the difference of means have not been adequately explored. We discuss in the present article five methods of analysis. These include two methods based on the log-likelihood ratio statistic and a generalized pivotal approach. Additionally, we provide and …


Inferences In Censored Cost Regression Models With Empirical Likelihood, Xiao-Hua Zhou, Gengsheng Qin, Huazhen Lin, Gang Li Dec 2005

Inferences In Censored Cost Regression Models With Empirical Likelihood, Xiao-Hua Zhou, Gengsheng Qin, Huazhen Lin, Gang Li

UW Biostatistics Working Paper Series

In many studies of health economics, we are interested in the expected total cost over a certain period for a patient with given characteristics. Problems can arise if cost estimation models do not account for distributional aspects of costs. Two such problems are 1) the skewed nature of the data and 2) censored observations. In this paper we propose an empirical likelihood (EL) method for constructing a confidence region for the vector of regression parameters and a confidence interval for the expected total cost of a patient with the given covariates. We show that this new method has good theoretical …


Confidence Intervals For Predictive Values Using Data From A Case Control Study, Nathaniel David Mercaldo, Xiao-Hua Zhou, Kit F. Lau Dec 2005

Confidence Intervals For Predictive Values Using Data From A Case Control Study, Nathaniel David Mercaldo, Xiao-Hua Zhou, Kit F. Lau

UW Biostatistics Working Paper Series

The accuracy of a binary-scale diagnostic test can be represented by sensitivity (Se), specificity (Sp) and positive and negative predictive values (PPV and NPV). Although Se and Sp measure the intrinsic accuracy of a diagnostic test that does not depend on the prevalence rate, they do not provide information on the diagnostic accuracy of a particular patient. To obtain this information we need to use PPV and NPV. Since PPV and NPV are functions of both the intrinsic accuracy and the prevalence of the disease, constructing confidence intervals for PPV and NPV for a particular patient in a population with …


Issues Of Processing And Multiple Testing Of Seldi-Tof Ms Proteomic Data, Merrill D. Birkner, Alan E. Hubbard, Mark J. Van Der Laan, Christine F. Skibola, Christine M. Hegedus, Martyn T. Smith Dec 2005

Issues Of Processing And Multiple Testing Of Seldi-Tof Ms Proteomic Data, Merrill D. Birkner, Alan E. Hubbard, Mark J. Van Der Laan, Christine F. Skibola, Christine M. Hegedus, Martyn T. Smith

U.C. Berkeley Division of Biostatistics Working Paper Series

A new data filtering method for SELDI-TOF MS proteomic spectra data is described. We examined technical repeats (2 per subject) of intensity versus m/z (mass/charge) of bone marrow cell lysate for two groups of childhood leukemia patients: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). As others have noted, the type of data processing as well as experimental variability can have a disproportionate impact on the list of "interesting" proteins (see Baggerly et al. (2004)). We propose a list of processing and multiple testing techniques to correct for 1) background drift; 2) filtering using smooth regression and cross-validated bandwidth …


Quantile-Function Based Null Distribution In Resampling Based Multiple Testing, Mark J. Van Der Laan, Alan E. Hubbard Nov 2005

Quantile-Function Based Null Distribution In Resampling Based Multiple Testing, Mark J. Van Der Laan, Alan E. Hubbard

U.C. Berkeley Division of Biostatistics Working Paper Series

Simultaneously testing a collection of null hypotheses about a data generating distribution based on a sample of independent and identically distributed observations is a fundamental and important statistical problem involving many applications. Methods based on marginal null distributions (i.e., marginal p-values) are attractive since the marginal p-values can be based on a user supplied choice of marginal null distributions and they are computationally trivial, but they, by necessity, are known to either be conservative or to rely on assumptions about the dependence structure between the test-statistics. Resampling based multiple testing (Westfall and Young, 1993) involves sampling from a joint null …


Optimal Feature Selection For Nearest Centroid Classifiers, With Applications To Gene Expression Microarrays, Alan R. Dabney, John D. Storey Nov 2005

Optimal Feature Selection For Nearest Centroid Classifiers, With Applications To Gene Expression Microarrays, Alan R. Dabney, John D. Storey

UW Biostatistics Working Paper Series

Nearest centroid classifiers have recently been successfully employed in high-dimensional applications. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is typically carried out by computing univariate statistics for each feature individually, without consideration for how a subset of features performs as a whole. For subsets of a given size, we characterize the optimal choice of features, corresponding to those yielding the smallest misclassification rate. Furthermore, we propose an algorithm for estimating this optimal subset in practice. Finally, we investigate the applicability of shrinkage ideas to nearest centroid classifiers. We use gene-expression microarrays for …


A New Approach To Intensity-Dependent Normalization Of Two-Channel Microarrays, Alan R. Dabney, John D. Storey Nov 2005

A New Approach To Intensity-Dependent Normalization Of Two-Channel Microarrays, Alan R. Dabney, John D. Storey

UW Biostatistics Working Paper Series

A two-channel microarray measures the relative expression levels of thousands of genes from a pair of biological samples. In order to reliably compare gene expression levels between and within arrays, it is necessary to remove systematic errors that distort the biological signal of interest. The standard for accomplishing this is smoothing "MA-plots" to remove intensity-dependent dye bias and array-specific effects. However, MA methods require strong assumptions. We review these assumptions and derive several practical scenarios in which they fail. The "dye-swap" normalization method has been much less frequently used because it requires two arrays per pair of samples. We show …


A General Imputation Methodology For Nonparametric Regression With Censored Data, Dan Rubin, Mark J. Van Der Laan Nov 2005

A General Imputation Methodology For Nonparametric Regression With Censored Data, Dan Rubin, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We consider the random design nonparametric regression problem when the response variable is subject to a general mode of missingness or censoring. A traditional approach to such problems is imputation, in which the missing or censored responses are replaced by well-chosen values, and then the resulting covariate/response data are plugged into algorithms designed for the uncensored setting. We present a general methodology for imputation with the property of double robustness, in that the method works well if either a parameter of the full data distribution (covariate and response distribution) or a parameter of the censoring mechanism is well approximated. These …


Estimating A Treatment Effect With Repeated Measurements Accounting For Varying Effectiveness Duration, Ying Qing Chen, Jingrong Yang, Su-Chun Cheng Nov 2005

Estimating A Treatment Effect With Repeated Measurements Accounting For Varying Effectiveness Duration, Ying Qing Chen, Jingrong Yang, Su-Chun Cheng

UW Biostatistics Working Paper Series

To assess treatment efficacy in clinical trials, certain clinical outcomes are repeatedly measured for same subject over time. They can be regarded as function of time. The difference in their mean functions between the treatment arms usually characterises a treatment effect. Due to the potential existence of subject-specific treatment effectiveness lag and saturation times, erosion of treatment effect in the difference may occur during the observation period of time. Instead of using ad hoc parametric or purely nonparametric time-varying coefficients in statistical modeling, we first propose to model the treatment effectiveness durations, which are the varying time intervals between the …


A Fine-Scale Linkage Disequilibrium Measure Based On Length Of Haplotype Sharing, Yan Wang, Lue Ping Zhao, Sandrine Dudoit Oct 2005

A Fine-Scale Linkage Disequilibrium Measure Based On Length Of Haplotype Sharing, Yan Wang, Lue Ping Zhao, Sandrine Dudoit

U.C. Berkeley Division of Biostatistics Working Paper Series

High-throughput genotyping technologies for single nucleotide polymorphisms (SNP) have enabled the recent completion of the International HapMap Project (Phase I), which has stimulated much interest in studying genome-wide linkage disequilibrium (LD) patterns. Conventional LD measures, such as D' and r-square, are two-point measurements, and their relationship with physical distance is highly noisy. We propose a new LD measure, defined in terms of the correlation coefficient for shared haplotype lengths around two loci, thereby borrowing information from multiple loci. A U-statistic-based estimator of the new LD measure, which takes into consideration the dependence structure of the observed data, is developed and …


Population Intervention Models In Causal Inference, Alan E. Hubbard, Mark J. Van Der Laan Oct 2005

Population Intervention Models In Causal Inference, Alan E. Hubbard, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Marginal structural models (MSM) provide a powerful tool for estimating the causal effect of a] treatment variable or risk variable on the distribution of a disease in a population. These models, as originally introduced by Robins (e.g., Robins (2000a), Robins (2000b), van der Laan and Robins (2002)), model the marginal distributions of treatment-specific counterfactual outcomes, possibly conditional on a subset of the baseline covariates, and its dependence on treatment. Marginal structural models are particularly useful in the context of longitudinal data structures, in which each subject's treatment and covariate history are measured over time, and an outcome is recorded at …


Designed Extension Of Survival Studies: Application To Clinical Trials With Unrecognized Heterogeneity, Yi Li, Mei-Chiung Shih, Rebecca A. Betensky Oct 2005

Designed Extension Of Survival Studies: Application To Clinical Trials With Unrecognized Heterogeneity, Yi Li, Mei-Chiung Shih, Rebecca A. Betensky

Harvard University Biostatistics Working Paper Series

It is well known that unrecognized heterogeneity among patients, such as is conferred by genetic subtype, can undermine the power of randomized trial, designed under the assumption of homogeneity, to detect a truly beneficial treatment. We consider the conditional power approach to allow for recovery of power under unexplained heterogeneity. While Proschan and Hunsberger (1995) confined the application of conditional power design to normally distributed observations, we consider more general and difficult settings in which the data are in the framework of continuous time and are subject to censoring. In particular, we derive a procedure appropriate for the analysis of …


A Pseudolikelihood Approach For Simultaneous Analysis Of Array Comparative Genomic Hybridizations (Acgh), David A. Engler, Gayatry Mohapatra, David N. Louis, Rebecca Betensky Sep 2005

A Pseudolikelihood Approach For Simultaneous Analysis Of Array Comparative Genomic Hybridizations (Acgh), David A. Engler, Gayatry Mohapatra, David N. Louis, Rebecca Betensky

Harvard University Biostatistics Working Paper Series

DNA sequence copy number has been shown to be associated with cancer development and progression. Array-based Comparative Genomic Hybridization (aCGH) is a recent development that seeks to identify the copy number ratio at large numbers of markers across the genome. Due to experimental and biological variations across chromosomes and across hybridizations, current methods are limited to analyses of single chromosomes. We propose a more powerful approach that borrows strength across chromosomes and across hybridizations. We assume a Gaussian mixture model, with a hidden Markov dependence structure, and with random effects to allow for intertumoral variation, as well as intratumoral clonal …


Semiparametric Estimation In General Repeated Measures Problems, Xihong Lin, Raymond J. Carroll Sep 2005

Semiparametric Estimation In General Repeated Measures Problems, Xihong Lin, Raymond J. Carroll

Harvard University Biostatistics Working Paper Series

This paper considers a wide class of semiparametric problems with a parametric part for some covariate effects and repeated evaluations of a nonparametric function. Special cases in our approach include marginal models for longitudinal/clustered data, conditional logistic regression for matched case-control studies, multivariate measurement error models, generalized linear mixed models with a semiparametric component, and many others. We propose profile-kernel and backfitting estimation methods for these problems, derive their asymptotic distributions, and show that in likelihood problems the methods are semiparametric efficient. While generally not true, with our methods profiling and backfitting are asymptotically equivalent. We also consider pseudolikelihood methods …


The Optimal Discovery Procedure: A New Approach To Simultaneous Significance Testing, John D. Storey Sep 2005

The Optimal Discovery Procedure: A New Approach To Simultaneous Significance Testing, John D. Storey

UW Biostatistics Working Paper Series

Significance testing is one of the main objectives of statistics. The Neyman-Pearson lemma provides a simple rule for optimally testing a single hypothesis when the null and alternative distributions are known. This result has played a major role in the development of significance testing strategies that are used in practice. Most of the work extending single testing strategies to multiple tests has focused on formulating and estimating new types of significance measures, such as the false discovery rate. These methods tend to be based on p-values that are calculated from each test individually, ignoring information from the other tests. As …


Mixture Cure Survival Models With Dependent Censoring, Yi Li, Ram C. Tiwari, Subharup Guha Sep 2005

Mixture Cure Survival Models With Dependent Censoring, Yi Li, Ram C. Tiwari, Subharup Guha

Harvard University Biostatistics Working Paper Series

A number of authors have studies the mixture survival model to analyze survival data with nonnegligible cure fractions. A key assumption made by these authors is the independence between the survival time and the censoring time. To our knowledge, no one has studies the mixture cure model in the presence of dependent censoring. To account for such dependence, we propose a more general cure model which allows for dependent censoring. In particular, we derive the cure models from the perspective of competing risks and model the dependence between the censoring time and the survival time using a class of Archimedean …


Semiparametric Normal Transformation Models For Spatially Correlated Survival Data, Yi Li, Xihong Lin Sep 2005

Semiparametric Normal Transformation Models For Spatially Correlated Survival Data, Yi Li, Xihong Lin

Harvard University Biostatistics Working Paper Series

There is an emerging interest in modeling spatially correlated survival data in biomedical and epidemiological studies. In this paper, we propose a new class of semiparametric normal transformation models for right censored spatially correlated survival data. This class of models assumes that survival outcomes marginally follow a Cox proportional hazard model with unspecified baseline hazard, and their joint distribution is obtained by transforming survival outcomes to normal random variables, whose joint distribution is assumed to be multivariate normal with a spatial correlation structure. A key feature of the class of semiparametric normal transformation models is that it provides a rich …


Inference On Survival Data With Covariate Measurement Error - An Imputation-Based Approach, Yi Li, Louise Ryan Sep 2005

Inference On Survival Data With Covariate Measurement Error - An Imputation-Based Approach, Yi Li, Louise Ryan

Harvard University Biostatistics Working Paper Series

We propose a new method for fitting proportional hazards models with error-prone covariates. Regression coefficients are estimated by solving an estimating equation that is the average of the partial likelihood scores based on imputed true covariates. For the purpose of imputation, a linear spline model is assumed on the baseline hazard. We discuss consistency and asymptotic normality of the resulting estimators, and propose a stochastic approximation scheme to obtain the estimates. The algorithm is easy to implement, and reduces to the ordinary Cox partial likelihood approach when the measurement error has a degenerative distribution. Simulations indicate high efficiency and robustness. …


The Optimal Discovery Procedure For Large-Scale Significance Testing, With Applications To Comparative Microarray Experiments, John D. Storey, James Y. Dai, Jeffrey T. Leek Sep 2005

The Optimal Discovery Procedure For Large-Scale Significance Testing, With Applications To Comparative Microarray Experiments, John D. Storey, James Y. Dai, Jeffrey T. Leek

UW Biostatistics Working Paper Series

As much of the focus of genetics and molecular biology has shifted toward the systems level, it has become increasingly important to accurately extract biologically relevant signal from thousands of related measurements. The common property among these high-dimensional biological studies is that the measured features have a rich and largely unknown underlying structure. One example of much recent interest is identifying differentially expressed genes in comparative microarray experiments. We propose a new approach aimed at optimally performing many hypothesis tests in a high-dimensional study. This approach estimates the Optimal Discovery Procedure (ODP), which has recently been introduced and theoretically shown …


Direct Effect Models, Mark J. Van Der Laan, Maya L. Petersen Aug 2005

Direct Effect Models, Mark J. Van Der Laan, Maya L. Petersen

U.C. Berkeley Division of Biostatistics Working Paper Series

The causal effect of a treatment on an outcome is generally mediated by several intermediate variables. Estimation of the component of the causal effect of a treatment that is mediated by a given intermediate variable (the indirect effect of the treatment), and the component that is not mediated by that intermediate variable (the direct effect of the treatment) is often relevant to mechanistic understanding and to the design of clinical and public health interventions. Under the assumption of no-unmeasured confounders for treatment and the intermediate variable, Robins & Greenland (1992) define an individual direct effect as the counterfactual effect of …


Application Of A Multiple Testing Procedure Controlling The Proportion Of False Positives To Protein And Bacterial Data, Merrill D. Birkner, Alan E. Hubbard, Mark J. Van Der Laan Aug 2005

Application Of A Multiple Testing Procedure Controlling The Proportion Of False Positives To Protein And Bacterial Data, Merrill D. Birkner, Alan E. Hubbard, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Simultaneously testing multiple hypotheses is important in high-dimensional biological studies. In these situations, one is often interested in controlling the Type-I error rate, such as the proportion of false positives to total rejections (TPPFP) at a specific level, alpha. This article will present an application of the E-Bayes/Bootstrap TPPFP procedure, presented in van der Laan et al. (2005), which controls the tail probability of the proportion of false positives (TPPFP), on two biological datasets. The two data applications include firstly, the application to a mass-spectrometry dataset of two leukemia subtypes, AML and ALL. The protein data measurements include intensity and …


Cross-Validating And Bagging Partitioning Algorithms With Variable Importance, Annette M. Molinaro, Mark J. Van Der Laan Aug 2005

Cross-Validating And Bagging Partitioning Algorithms With Variable Importance, Annette M. Molinaro, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We present a cross-validated bagging scheme in the context of partitioning algorithms. To explore the benefits of the various bagging scheme, we compare via simulations the predictive ability of single Classification and Regression (CART) Tree with several previously suggested bagging schemes and with our proposed approach. Additionally, a variable importance measure is explained and illustrated.


Test Statistics Null Distributions In Multiple Testing: Simulation Studies And Applications To Genomics, Katherine S. Pollard, Merrill D. Birkner, Mark J. Van Der Laan, Sandrine Dudoit Jul 2005

Test Statistics Null Distributions In Multiple Testing: Simulation Studies And Applications To Genomics, Katherine S. Pollard, Merrill D. Birkner, Mark J. Van Der Laan, Sandrine Dudoit

U.C. Berkeley Division of Biostatistics Working Paper Series

Multiple hypothesis testing problems arise frequently in biomedical and genomic research, for instance, when identifying differentially expressed or co-expressed genes in microarray experiments. We have developed generally applicable resampling-based single-step and stepwise multiple testing procedures (MTP) for control of a broad class of Type I error rates, defined as tail probabilities and expected values for arbitrary functions of the numbers of false positives and rejected hypotheses (Dudoit and van der Laan, 2005; Dudoit et al., 2004a,b; Pollard and van der Laan, 2004; van der Laan et al., 2005, 2004a,b). As argued in the early article of Pollard and van der …


Linear Regression Of Censored Length-Biased Lifetimes, Ying Qing Chen, Yan Wang Jul 2005

Linear Regression Of Censored Length-Biased Lifetimes, Ying Qing Chen, Yan Wang

UW Biostatistics Working Paper Series

Length-biased lifetimes may be collected in observational studies or sample surveys due to biased sampling scheme. In this article, we use a linear regression model, namely, the accelerated failure time model, for the population lifetime distributions in regression analysis of the length-biased lifetimes. It is discovered that the associated regression parameters are invariant under the length-biased sampling scheme. According to this discovery, we propose the quasi partial score estimating equations to estimate the population regression parameters. The proposed methodologies are evaluated and demonstrated by simulation studies and an application to actual data set.


A Note On The Construction Of Counterfactuals And The G-Computation Formula, Zhuo Yu, Mark J. Van Der Laan Jun 2005

A Note On The Construction Of Counterfactuals And The G-Computation Formula, Zhuo Yu, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Robins' causal inference theory assumes existence of treatment specific counterfactual variables so that the observed data augmented by the counterfactual data will satisfy a consistency and a randomization assumption. In this paper we provide an explicit function that maps the observed data into a counterfactual variable which satisfies the consistency and randomization assumptions. This offers a practically useful imputation method for counterfactuals. Gill & Robins [2001]'s construction of counterfactuals can be used as an imputation method in principle, but it is very hard to implement in practice. Robins [1987] shows that the counterfactual distribution can be identified from the observed …


Cross-Validated Bagged Learning, Mark J. Van Der Laan, Sandra E. Sinisi, Maya L. Petersen Jun 2005

Cross-Validated Bagged Learning, Mark J. Van Der Laan, Sandra E. Sinisi, Maya L. Petersen

U.C. Berkeley Division of Biostatistics Working Paper Series

Many applications aim to learn a high dimensional parameter of a data generating distribution based on a sample of independent and identically distributed observations. For example, the goal might be to estimate the conditional mean of an outcome given a list of input variables. In this prediction context, Breiman (1996a) introduced bootstrap aggregating (bagging) as a method to reduce the variance of a given estimator at little cost to bias. Bagging involves applying the estimator to multiple bootstrap samples, and averaging the result across bootstrap samples. In order to deal with the curse of dimensionality, typical practice has been to …


On Additive Regression Of Expectancy, Ying Qing Chen Jun 2005

On Additive Regression Of Expectancy, Ying Qing Chen

UW Biostatistics Working Paper Series

Regression models have been important tools to study the association between outcome variables and their covariates. The traditional linear regression models usually specify such an association by the expectations of the outcome variables as function of the covariates and some parameters. In reality, however, interests often focus on their expectancies characterized by the conditional means. In this article, a new class of additive regression models is proposed to model the expectancies. The model parameters carry practical implication, which may allow the models to be useful in applications such as treatment assessment, resource planning or short-term forecasting. Moreover, the new model …


An Empirical Process Limit Theorem For Sparsely Correlated Data, Thomas Lumley Jun 2005

An Empirical Process Limit Theorem For Sparsely Correlated Data, Thomas Lumley

UW Biostatistics Working Paper Series

We consider data that are dependent, but where most small sets of observations are independent. By extending Bernstein's inequality we prove a strong law of law numbers and an empirical process central limit theorem under bracketing entropy conditions.