Open Access. Powered by Scholars. Published by Universities.®
![Digital Commons Network](http://assets.bepress.com/20200205/img/dcn/DCsunburst.png)
Physical Sciences and Mathematics Commons™
Open Access. Powered by Scholars. Published by Universities.®
- Institution
- Publication
- Publication Type
Articles 1 - 13 of 13
Full-Text Articles in Physical Sciences and Mathematics
High-Dimensional Feature Selection And Multi-Level Causal Mediation Analysis With Applications To Human Aging And Cluster-Based Intervention Studies, Hachem Saddiki
Doctoral Dissertations
Many questions in public health and medicine are fundamentally causal in that our objective is to learn the effect of some exposure, randomized or not, on an outcome of interest. As a result, causal inference frameworks and methodologies have gained interest as a promising tool to reliably answer scientific questions. However, the tasks of identifying and efficiently estimating causal effects from observed data still pose significant challenges under complex data generating scenarios. We focus on (1) high-dimensional settings where the number of variables is orders of magnitude higher than the number of observations; and (2) multi-level settings, where study participants …
Vif-Regression Screening Ultrahigh Dimensional Feature Space, Hassan S. Uraibi
Vif-Regression Screening Ultrahigh Dimensional Feature Space, Hassan S. Uraibi
Journal of Modern Applied Statistical Methods
Iterative Sure Independent Screening (ISIS) was proposed for the problem of variable selection with ultrahigh dimensional feature space. Unfortunately, the ISIS method transforms the dimensionality of features from ultrahigh to ultra-low and may result in un-reliable inference when the number of important variables particularly is greater than the screening threshold. The proposed method has transformed the ultrahigh dimensionality of features to high dimension space in order to remedy of losing some information by ISIS method. The proposed method is compared with ISIS method by using real data and simulation. The results show this method is more efficient and more reliable …
A Xgboost Risk Model Via Feature Selection And Bayesian Hyper-Parameter Optimization, Yan Wang, Sherry Ni
A Xgboost Risk Model Via Feature Selection And Bayesian Hyper-Parameter Optimization, Yan Wang, Sherry Ni
Published and Grey Literature from PhD Candidates
This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structuredParzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank …
Selecting Maximally-Predictive Deep Features To Explain What Drives Fixations In Free-Viewing, Matthias Kümmerer, Thomas S.A. Wallis, Matthias Bethge
Selecting Maximally-Predictive Deep Features To Explain What Drives Fixations In Free-Viewing, Matthias Kümmerer, Thomas S.A. Wallis, Matthias Bethge
MODVIS Workshop
No abstract provided.
Incorporating Pathway Information Into Feature Selection Towards Better Performed Gene Signatures, Suyan Tian, Chi Wang, Bing Wang
Incorporating Pathway Information Into Feature Selection Towards Better Performed Gene Signatures, Suyan Tian, Chi Wang, Bing Wang
Biostatistics Faculty Publications
To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, …
Feature Selection For Longitudinal Data By Using Sign Averages To Summarize Gene Expression Values Over Time, Suyan Tian, Chi Wang
Feature Selection For Longitudinal Data By Using Sign Averages To Summarize Gene Expression Values Over Time, Suyan Tian, Chi Wang
Biostatistics Faculty Publications
With the rapid evolution of high-throughput technologies, time series/longitudinal high-throughput experiments have become possible and affordable. However, the development of statistical methods dealing with gene expression profiles across time points has not kept up with the explosion of such data. The feature selection process is of critical importance for longitudinal microarray data. In this study, we proposed aggregating a gene’s expression values across time into a single value using the sign average method, thereby degrading a longitudinal feature selection process into a classic one. Regularized logistic regression models with pseudogenes (i.e., the sign average of genes across time as predictors) …
Unified Methods For Feature Selection In Large-Scale Genomic Studies With Censored Survival Outcomes, Lauren Spirko-Burns, Karthik Devarajan
Unified Methods For Feature Selection In Large-Scale Genomic Studies With Censored Survival Outcomes, Lauren Spirko-Burns, Karthik Devarajan
COBRA Preprint Series
One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease's process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards …
Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis
Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis
Electronic Theses and Dissertations
Self-care activities classification poses significant challenges in identifying children’s unique functional abilities and needs within the exceptional children healthcare system. The accuracy of diagnosing a child's self-care problem, such as toileting or dressing, is highly influenced by an occupational therapists’ experience and time constraints. Thus, there is a need for objective means to detect and predict in advance the self-care problems of children with physical and motor disabilities. We use clustering to discover interesting information from self-care problems, perform automatic classification of binary data, and discover outliers. The advantages are twofold: the advancement of knowledge on identifying self-care problems in …
A Logitudinal Feature Selection Method Identifies Relevant Genes To Distinguish Complicated Injury And Uncomplicated Injury Over Time, Suyan Tian, Chi Wang, Howard H. Chang
A Logitudinal Feature Selection Method Identifies Relevant Genes To Distinguish Complicated Injury And Uncomplicated Injury Over Time, Suyan Tian, Chi Wang, Howard H. Chang
Biostatistics Faculty Publications
Background: Feature selection and gene set analysis are of increasing interest in the field of bioinformatics. While these two approaches have been developed for different purposes, we describe how some gene set analysis methods can be utilized to conduct feature selection.
Methods: We adopted a gene set analysis method, the significance analysis of microarray gene set reduction (SAMGSR) algorithm, to carry out feature selection for longitudinal gene expression data.
Results: Using a real-world application and simulated data, it is demonstrated that the proposed SAMGSR extension outperforms other relevant methods. In this study, we illustrate that a gene’s expression profiles over …
An Empirical Study On Different Ranking Methods For Effective Data Classification, Ilangovan Sangaiah, A. Vincent Antony Kumar, Appavu Balamurugan
An Empirical Study On Different Ranking Methods For Effective Data Classification, Ilangovan Sangaiah, A. Vincent Antony Kumar, Appavu Balamurugan
Journal of Modern Applied Statistical Methods
Ranking is the attribute selection technique used in the pre-processing phase to emphasize the most relevant attributes which allow models of classification simpler and easy to understand. It is a very important and a central task for information retrieval, such as web search engines, recommendation systems, and advertisement systems. A comparison between eight ranking methods was conducted. Ten different learning algorithms (NaiveBayes, J48, SMO, JRIP, Decision table, RandomForest, Multilayerperceptron, Kstar) were used to test the accuracy. The ranking methods with different supervised learning algorithms give different results for balanced accuracy. It was shown the selection of ranking methods could be …
Multi-Tgdr, A Multi-Class Regularization Method, Identifies The Metabolic Profiles Of Hepatocellular Carcinoma And Cirrhosis Infected With Hepatitis B Or Hepatitis C Virus, Suyan Tian, Howard H. Chang, Chi Wang, Jing Jiang, Xiaomei Wang, Junqi Niu
Multi-Tgdr, A Multi-Class Regularization Method, Identifies The Metabolic Profiles Of Hepatocellular Carcinoma And Cirrhosis Infected With Hepatitis B Or Hepatitis C Virus, Suyan Tian, Howard H. Chang, Chi Wang, Jing Jiang, Xiaomei Wang, Junqi Niu
Biostatistics Faculty Publications
BACKGROUND: Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other "omics" data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multi-class classification of "omics" data, and proposed two such extensions referred to as multi-TGDR. Both multi-TGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) …
Score Test Variable Screening, Sihai Dave Zhao, Yi Li
Score Test Variable Screening, Sihai Dave Zhao, Yi Li
The University of Michigan Department of Biostatistics Working Paper Series
Variable screening has emerged as a crucial first step in the analysis of high-throughput data, but existing procedures can be computationally cumbersome, difficult to justify theoretically, or inapplicable to certain types of analyses. Motivated by a high-dimensional censored quantile regression problem in multiple myeloma genomics, this paper makes three contributions. First, we establish a score test-based screening framework, which is widely applicable, extremely computationally efficient, and relatively simple to justify. Secondly, we propose a resampling-based procedure for selecting the number of variables to retain after screening according to the principle of reproducibility. Finally, we propose a new iterative score test …
Selection Of Independent Binary Features Using Probabilities: An Example From Veterinary Medicine, Ludmila I. Kuncheva, Zoë S.J. Hoare, Peter D. Cockcroft
Selection Of Independent Binary Features Using Probabilities: An Example From Veterinary Medicine, Ludmila I. Kuncheva, Zoë S.J. Hoare, Peter D. Cockcroft
Journal of Modern Applied Statistical Methods
Supervised classification into c mutually exclusive classes based on n binary features is considered. The only information available is an n×c table with probabilities. Knowing that the best d features are not the d best, simulations were run for 4 feature selection methods and an application to diagnosing BSE in cattle and Scrapie in sheep is presented.