Online Cross-Validation-Based Ensemble Learning, 2016 Division of Biostatistics, University of California, Berkeley

#### Online Cross-Validation-Based Ensemble Learning, David Benkeser, Samuel D. Lendle, Cheng Ju, Mark J. Van Der Laan

*U.C. Berkeley Division of Biostatistics Working Paper Series*

Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series models and as special case, models for independent and identically distributed data. Our estimator considers a large library of candidate online estimators and uses online cross-validation to ...

Doubly-Robust Nonparametric Inference On The Average Treatment Effect, 2016 Division of Biostatistics, University of California, Berkeley

#### Doubly-Robust Nonparametric Inference On The Average Treatment Effect, David Benkeser, Marco Carone, Mark J. Van Der Laan, Peter Gilbert

*U.C. Berkeley Division of Biostatistics Working Paper Series*

Doubly-robust estimators are widely used to draw inference about the average effect of a treatment. Such estimators are consistent for the effect of interest if either one of two nuisance parameters is consistently estimated. However, if flexible, data-adaptive estimators of these nuisance parameters are used, double-robustness does not readily extend to inference. We present a general theoretical study of the behavior of doubly-robust estimators of an average treatment effect when one of the nuisance parameters is inconsistently estimated. We contrast different approaches for constructing such estimators and investigate the extent to which they may be modified to also allow doubly-robust ...

Comparative Effectiveness Research Using Observational Data: Active Comparators To Emulate Target Trials With Inactive Comparators, 2016 Harvard T.H. Chan School of Public Health

#### Comparative Effectiveness Research Using Observational Data: Active Comparators To Emulate Target Trials With Inactive Comparators, Anders Huitfeldt, Miguel A. Hernan, Mette Kalager, James M. Robins

*eGEMs (Generating Evidence & Methods to improve patient outcomes)*

**Introduction: **Because a comparison of non-initiators and initiators of treatment may be hopelessly confounded, guidelines for the conduct of observational research often recommend using an “active” comparator group consisting of people who initiate a treatment other than the medication of interest. In this paper, we discuss the conditions under which this approach is valid if the goal is to emulate a trial with an inactive comparator.

**Identification of Effects: **We provide conditions under which a target trial in a subpopulation can be validly emulated from observational data, using an active comparator that is known or believed to be inactive for ...

Performance-Constrained Binary Classification Using Ensemble Learning: An Application To Cost-Efficient Targeted Prep Strategies, 2016 Division of Biostatistics, School of Public Health, University of California, Berkeley

#### Performance-Constrained Binary Classification Using Ensemble Learning: An Application To Cost-Efficient Targeted Prep Strategies, Wenjing Zheng, Laura Balzer, Maya L. Petersen, Mark J. Van Der Laan

*U.C. Berkeley Division of Biostatistics Working Paper Series*

Binary classifications problems are ubiquitous in health and social science applications. In many cases, one wishes to balance two conflicting criteria for an optimal binary classifier. For instance, in resource-limited settings, an HIV prevention program based on offering Pre-Exposure Prophylaxis (PrEP) to select high-risk individuals must balance the sensitivity of the binary classifier in detecting future seroconverters (and hence offering them PrEP regimens) with the total number of PrEP regimens that is financially and logistically feasible for the program to deliver. In this article, we consider a general class of performance-constrained binary classification problems wherein the objective function and the ...

Matching The Efficiency Gains Of The Logistic Regression Estimator While Avoiding Its Interpretability Problems, In Randomized Trials, 2016 Johns Hopkins Bloomberg School of Public Health, Department of Biostatistics

#### Matching The Efficiency Gains Of The Logistic Regression Estimator While Avoiding Its Interpretability Problems, In Randomized Trials, Michael Rosenblum, Jon Arni Steingrimsson

*Johns Hopkins University, Dept. of Biostatistics Working Papers*

Adjusting for prognostic baseline variables can lead to improved power in randomized trials. For binary outcomes, a logistic regression estimator is commonly used for such adjustment. This has resulted in substantial efficiency gains in practice, e.g., gains equivalent to reducing the required sample size by 20-28% were observed in a recent survey of traumatic brain injury trials. Robinson and Jewell (1991) proved that the logistic regression estimator is guaranteed to have equal or better asymptotic efficiency compared to the unadjusted estimator (which ignores baseline variables). Unfortunately, the logistic regression estimator has the following dangerous vulnerabilities: it is only interpretable ...

A Synthesis Of Current Surveillance Planning Methods For The Sequential Monitoring Of Drug And Vaccine Adverse Effects Using Electronic Health Care Data, 2016 Group Health Research Institute; University of Washington

#### A Synthesis Of Current Surveillance Planning Methods For The Sequential Monitoring Of Drug And Vaccine Adverse Effects Using Electronic Health Care Data, Jennifer C. Nelson, Robert Wellman, Onchee Yu, Andrea J. Cook, Judith C. Maro, Rita Ouellet-Hellstrom, Denise Boudreau, James S. Floyd, Susan R. Heckbert, Simone Pinheiro, Marsha Reichman, Azadeh Shoaibi

*eGEMs (Generating Evidence & Methods to improve patient outcomes)*

**Introduction:** The large-scale assembly of electronic health care data combined with the use of sequential monitoring has made proactive postmarket drug- and vaccine-safety surveillance possible. Although sequential designs have been used extensively in randomized trials, less attention has been given to methods for applying them in observational electronic health care database settings.

**Existing Methods:** We review current sequential-surveillance planning methods from randomized trials, and the Vaccine Safety Datalink (VSD) and Mini-Sentinel Pilot projects—two national observational electronic health care database safety monitoring programs.

**Future Surveillance Planning:** Based on this examination, we suggest three steps for future surveillance planning in health ...

Model Averaged Double Robust Estimation, 2016 Harvard School of Public Health

#### Model Averaged Double Robust Estimation, Matthew Cefalu, Francesca Dominici, Nils D. Arvold Md, Giovanni Parmigiani

*Harvard University Biostatistics Working Paper Series*

Existing methods in causal inference do not account for the uncertainty in the selection of confounders. We propose a new class of estimators for the average causal effect, the model averaged double robust estimators, that formally account for model uncertainty in both the propensity score and outcome model through the use of Bayesian model averaging. These estimators build on the desirable double robustness property by only requiring the true propensity score model or the true outcome model be within a specified class of models to maintain consistency. We provide asymptotic results and conduct a large scale simulation study that indicates ...

Prevalence Estimation At The Cluster Level For Correlated Binary Data Using Random Partial-Cluster Sampling, 2016 University of North Carolina at Chapel Hill

#### Prevalence Estimation At The Cluster Level For Correlated Binary Data Using Random Partial-Cluster Sampling, Rujin Wang, John S. Preisser

*The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series*

For clustered data in the medical sciences, disease is present when one or more of the observations in the cluster has the disease condition. This paper focuses on estimation of periodontal disease prevalence defined as the probability that one or more tooth sites have disease in a randomly selected subject. The prohibitive exam time and monetary cost of the full-mouth examination makes partial-mouth recording protocols attractive alternative methods to assess chronic periodontitis. In particular, Beck et al. (2006) proposed the random site selection method (RSSM), which pre-specifies a fixed number of tooth sites to be selected randomly from each subject ...

Distance-Based Analysis Of Variance For Brain Connectivity, 2016 Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania

#### Distance-Based Analysis Of Variance For Brain Connectivity, Russell T. Shinohara, Haochang Shou, Marco Carone, Robert Schultz, Birkan Tunc, Drew Parker, Ragini Verma

*UPenn Biostatistics Working Papers*

The field of neuroimaging dedicated to mapping connections in the brain is increasingly being recognized as key for understanding neurodevelopment and pathology. Networks of these connections are quantitatively represented using complex structures including matrices, functions, and graphs, which require specialized statistical techniques for estimation and inference about developmental and disorder-related changes. Unfortunately, classical statistical testing procedures are not well suited to high-dimensional testing problems. In the context of global or regional tests for differences in neuroimaging data, traditional analysis of variance (ANOVA) is not directly applicable without first summarizing the data into univariate or low-dimensional features, a process that may ...

Addition To Pglr Chap 6, 2016 Arizona State University

#### Addition To Pglr Chap 6, Joseph M. Hilbe

*Joseph M Hilbe*

The Use Of Permutation Tests For The Analysis Of Parallel And Stepped-Wedge Cluster Randomized Trials, 2016 Harvard University

#### The Use Of Permutation Tests For The Analysis Of Parallel And Stepped-Wedge Cluster Randomized Trials, Rui Wang, Victor Degruttola

*Harvard University Biostatistics Working Paper Series*

We investigate the use of permutation tests for the analysis of parallel and stepped-wedge cluster randomized trials. Permutation tests for parallel designs with exponential family endpoints have been extensively studied. The optimal permutation tests developed for exponential family alternatives require information on intraclass correlation, a quantity not yet defined for time-to-event endpoints. Therefore, it is unclear how efficient permutation tests can be constructed for cluster-randomized trials with such endpoints. We consider a class of test statistics formed by a weighted average of pair-specific treatment effect estimates and offer practical guidance on the choice of weights to improve efficiency. We apply ...

Improving Precision By Adjusting For Baseline Variables In Randomized Trials With Binary Outcomes, Without Regression Model Assumptions, 2016 Johns Hopkins Bloomberg School of Public Health

#### Improving Precision By Adjusting For Baseline Variables In Randomized Trials With Binary Outcomes, Without Regression Model Assumptions, Jon Arni Steingrimsson, Daniel F. Hanley, Michael Rosenblum

*Johns Hopkins University, Dept. of Biostatistics Working Papers*

In randomized clinical trials with baseline variables that are prognostic for the primary outcome, there is potential to improve precision and reduce sample size by appropriately adjusting for these variables. A major challenge is that there are multiple statistical methods to adjust for baseline variables, but little guidance on which is best to use in a given context. The choice of method can have important consequences. For example, one commonly used method leads to uninterpretable estimates if there is any treatment effect heterogeneity, which would jeopardize the validity of trial conclusions. We give practical guidance on how to avoid this ...

An Exploration Of Information Exchange By Adolescents And Parents Participating In Adolescent Idiopathic Scoliosis Online Support Groups, 2016 University of Iowa

#### An Exploration Of Information Exchange By Adolescents And Parents Participating In Adolescent Idiopathic Scoliosis Online Support Groups, Traci Schwieger, Shelly Campo, Keli R. Steuber, Stuart L. Weinstein, Sato Ashida

*Department of Biostatistics Publications*

### Background

Research indicates that healthcare providers frequently fail to adequately address patients’ health information needs. Therefore, it is not surprising that patients or parents of a sick child are seeking health information on the internet, in particular in online support groups (OSGs). In order to improve our understanding of the unmet health information needs of families dealing with adolescent idiopathic scoliosis (AIS), this study assessed and compared the types of information that adolescents and parents are seeking in OSGs.

### Methods

This study used two publicly accessible AIS-related OSGs on the National Scoliosis Foundation (NSF) website that targeted those who are ...

An Activity Index For Raw Accelerometry Data And Its Comparison With Other Activity Metrics, 2016 Selected Works

#### An Activity Index For Raw Accelerometry Data And Its Comparison With Other Activity Metrics, J Bai, C Z. Di, L Xiao, K R. Evenson, A Z. Lacroix, C M. Crainiceanu, D M. Buchner

*Chongzhi Di*

Using Machine Learning And Natural Language Processing Algorithms To Automate The Evaluation Of Clinical Decision Support In Electronic Medical Record Systems, 2016 University of Southern Maine

#### Using Machine Learning And Natural Language Processing Algorithms To Automate The Evaluation Of Clinical Decision Support In Electronic Medical Record Systems, Donald A. Szlosek, Jonathan M. Ferretti

*eGEMs (Generating Evidence & Methods to improve patient outcomes)*

**Introduction: ** As the number of clinical decision support systems incorporated into electronic medical records increases, so does the need to evaluate their effectiveness. The use of medical record review and similar manual methods for evaluating decision rules is laborious and inefficient. Here we use machine learning and natural language processing (NLP) algorithms to accurately evaluate a clinical decision support rule through an electronic medical record system and compare it against manual evaluation.

**Methods: **Modeled after the electronic medical record system EPIC at Maine Medical Center, we developed a dummy dataset containing physician notes in free text for 3621 artificial patients ...

Level Of Patient-Physician Agreement In Assessment Of Change Following Conservative Rehabilitation For Shoulder Pain, 2016 California State University, Fresno

#### Level Of Patient-Physician Agreement In Assessment Of Change Following Conservative Rehabilitation For Shoulder Pain, Stephanie D. Moore-Reed, W. Ben Kibler, Heather M. Bush, Timothy L. Uhl

*Tim L. Uhl*

**Background** Assessment of health-related status has been shown to vary between patients and physicians, although the degree of patient–physician discordance in the assessment of the change in status is unknown.

**Methods** Ninety-nine patients with shoulder dysfunction underwent a standardized physician examination and completed several self-reported questionnaires. All patients were prescribed the same physical therapy intervention. Six weeks later, the patients returned to the physician, when self-report questionnaires were re-assessed and the Global Rating of Change (GROC) was completed by the patient. The physician completed the GROC retrospectively. To determine agreement between patient and physician, intra-class correlation (ICC) coefficient and ...

Mediation Analysis For A Survival Outcome With Time-Varying Exposures, Mediators, And Confounders, 2016 Department of Biostatistics, Columbia Mailman School of Public Health

#### Mediation Analysis For A Survival Outcome With Time-Varying Exposures, Mediators, And Confounders, Sheng-Hsuan Lin, Jessica G. Young, Roger Logan, Tyler J. Vanderweele

*Harvard University Biostatistics Working Paper Series*

We propose an approach to conduct mediation analysis for survival data with time-varying exposures, mediators, and confounders. We identify certain interventional direct and indirect effects through a survival mediational g-formula and describe the required assumptions. We also provide a feasible parametric approach along with an algorithm and software to estimate these effects. We apply this method to analyze the Framingham Heart Study data to investigate the causal mechanism of smoking on mortality through coronary artery disease. The risk ratio of smoking 30 cigarettes per day for ten years compared with no smoking on mortality is 2.34 (95 % CI = (1 ...

Multilevel Models For Longitudinal Data, 2016 East Tennessee State University

#### Multilevel Models For Longitudinal Data, Aastha Khatiwada

*Electronic Theses and Dissertations*

Longitudinal data arise when individuals are measured several times during an ob- servation period and thus the data for each individual are not independent. There are several ways of analyzing longitudinal data when different treatments are com- pared. Multilevel models are used to analyze data that are clustered in some way. In this work, multilevel models are used to analyze longitudinal data from a case study. Results from other more commonly used methods are compared to multilevel models. Also, comparison in output between two software, SAS and R, is done. Finally a method consisting of fitting individual models for each ...

Propensity Score Based Methods For Estimating The Treatment Effects Based On Observational Studies., 2016 University of Louisville

#### Propensity Score Based Methods For Estimating The Treatment Effects Based On Observational Studies., Younathan Abdia

*Electronic Theses and Dissertations*

This dissertation consists of two interconnected research projects. The first project was a study of propensity scores based statistical methods for estimating the average treatment effect (ATE) and the average treatment effect among treated (ATT) when there are two treatment groups. The ATE is defined as the mean of the individual causal effects in the whole population, while ATT is defined as the treatment effect for the treated population. Propensity score based statistical methods, such as matching, regression, stratification, inverse probability weighting (IPW), and doubly robust (DR) methods were used to estimate the ATE and ATT. Simulation studies and case ...

Variable Selection Via Penalized Regression And The Genetic Algorithm Using Information Complexity, With Applications For High-Dimensional -Omics Data, 2016 University of Tennessee, Knoxville

#### Variable Selection Via Penalized Regression And The Genetic Algorithm Using Information Complexity, With Applications For High-Dimensional -Omics Data, Tyler J. Massaro

*Doctoral Dissertations*

This dissertation is a collection of examples, algorithms, and techniques for researchers interested in selecting influential variables from statistical regression models. Chapters 1, 2, and 3 provide background information that will be used throughout the remaining chapters, on topics including but not limited to information complexity, model selection, covariance estimation, stepwise variable selection, penalized regression, and especially the genetic algorithm (GA) approach to variable subsetting.

In chapter 4, we fully develop the framework for performing GA subset selection in logistic regression models. We present advantages of this approach against stepwise and elastic net regularized regression in selecting variables from a ...