Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 36

Full-Text Articles in Physical Sciences and Mathematics

A Framework For The Statistical Analysis Of Mass Spectrometry Imaging Experiments, Kyle Bemis Dec 2016

A Framework For The Statistical Analysis Of Mass Spectrometry Imaging Experiments, Kyle Bemis

Open Access Dissertations

Mass spectrometry (MS) imaging is a powerful investigation technique for a wide range of biological applications such as molecular histology of tissue, whole body sections, and bacterial films , and biomedical applications such as cancer diagnosis. MS imaging visualizes the spatial distribution of molecular ions in a sample by repeatedly collecting mass spectra across its surface, resulting in complex, high-dimensional imaging datasets. Two of the primary goals of statistical analysis of MS imaging experiments are classification (for supervised experiments), i.e. assigning pixels to pre-defined classes based on their spectral profiles, and segmentation (for unsupervised experiments), i.e. assigning pixels to newly …


Group Transformation And Identification With Kernel Methods And Big Data Mixed Logistic Regression, Chao Pan Dec 2016

Group Transformation And Identification With Kernel Methods And Big Data Mixed Logistic Regression, Chao Pan

Open Access Dissertations

Exploratory Data Analysis (EDA) is a crucial step in the life cycle of data analysis. Exploring data with effective methods would reveal main characteristics of data and provides guidance for model building. The goal of this thesis is to develop effective and efficient methods for data exploration in the regression setting.

First, we propose to use optimal group transformations as a general approach for exploring the relationship between predictor variables X and the response Y. This approach can be considered an automatic procedure to identify the best characteristic of P( Y|X) under which the relationship …


Characterizing The Effects Of Repetitive Head Trauma In Female Soccer Athletes For Prevention Of Mild Traumatic Brain Injury, Diana Otero Svaldi Dec 2016

Characterizing The Effects Of Repetitive Head Trauma In Female Soccer Athletes For Prevention Of Mild Traumatic Brain Injury, Diana Otero Svaldi

Open Access Dissertations

As participation in women’s soccer continues to grow and the longevity of female athletes’ careers continues to increase, prevention of mTBI in women’s soccer has become a major concern for female athletes as the long-term risks associated with a history of mTBI are well documented. Among women’s sports, soccer exhibits the highest concussion rates, on par with those of men’s football at the collegiate level. Head impact monitoring technology has revealed that “concussive hits” occurring directly before symptomatic injury are not predictive of mTBI, suggesting that the cumulative effect of repetitive head impacts experienced by collision sport athletes should be …


Computational Environment For Modeling And Analysing Network Traffic Behaviour Using The Divide And Recombine Framework, Ashrith Barthur Dec 2016

Computational Environment For Modeling And Analysing Network Traffic Behaviour Using The Divide And Recombine Framework, Ashrith Barthur

Open Access Dissertations

There are two essential goals of this research. The first goal is to design and construct a computational environment that is used for studying large and complex datasets in the cybersecurity domain. The second goal is to analyse the Spamhaus blacklist query dataset which includes uncovering the properties of blacklisted hosts and understanding the nature of blacklisted hosts over time.

The analytical environment enables deep analysis of very large and complex datasets by exploiting the divide and recombine framework. The capability to analyse data in depth enables one to go beyond just summary statistics in research. This deep analysis is …


Functional Regression Models In The Frame Work Of Reproducing Kernel Hilbert Space, Simeng Qu Dec 2016

Functional Regression Models In The Frame Work Of Reproducing Kernel Hilbert Space, Simeng Qu

Open Access Dissertations

The aim of this thesis is to systematically investigate some functional regression models for accurately quantifying the effect of functional predictors. In particular, three functional models are studied: functional linear regression model, functional Cox model, and function-on-scalar model. Both theoretical properties and numerical algorithms are studied in depth. The new models find broad applications in many areas.

For the functional linear regression model, the focus is on testing the nullity of the slope function, and a generalized likelihood ratio test based on easily implementable data-driven estimate is proposed. The quality of the test is measured by the minimal distance between …


Divide And Recombined For Large Complex Data: Nonparametric-Regression Modelling Of Spatial And Seasonal-Temporal Time Series, Xiaosu Tong Dec 2016

Divide And Recombined For Large Complex Data: Nonparametric-Regression Modelling Of Spatial And Seasonal-Temporal Time Series, Xiaosu Tong

Open Access Dissertations

In the first chapter of this dissertation, I briefly introduce one type of nonparametric regression method, namely local polynomial regression, followed by emphasis on one specific application of loess on time series decomposition, called Seasonal Trend Loess (STL). The chapter is closed by the introduction of D\&R; (Divide and Recombined) statistical framework. Data can be divided into subsets, each of which is applied with a statistical analysis method. This is an embarrassing parallel procedure since there is no communication between each subset. Then the analysis result for each subset are combined together to be the final analysis outcome for the …


Controlling For Confounding Network Properties In Hypothesis Testing And Anomaly Detection, Timothy La Fond Aug 2016

Controlling For Confounding Network Properties In Hypothesis Testing And Anomaly Detection, Timothy La Fond

Open Access Dissertations

An important task in network analysis is the detection of anomalous events in a network time series. These events could merely be times of interest in the network timeline or they could be examples of malicious activity or network malfunction. Hypothesis testing using network statistics to summarize the behavior of the network provides a robust framework for the anomaly detection decision process. Unfortunately, choosing network statistics that are dependent on confounding factors like the total number of nodes or edges can lead to incorrect conclusions (e.g., false positives and false negatives). In this dissertation we describe the challenges that face …


Learning From Data: Plant Breeding Applications Of Machine Learning, Alencar Xavier Aug 2016

Learning From Data: Plant Breeding Applications Of Machine Learning, Alencar Xavier

Open Access Dissertations

Increasingly, new sources of data are being incorporated into plant breeding pipelines. Enormous amounts of data from field phenomics and genotyping technologies places data mining and analysis into a completely different level that is challenging from practical and theoretical standpoints. Intelligent decision-making relies on our capability of extracting from data useful information that may help us to achieve our goals more efficiently. Many plant breeders, agronomists and geneticists perform analyses without knowing relevant underlying assumptions, strengths or pitfalls of the employed methods. The study endeavors to assess statistical learning properties and plant breeding applications of supervised and unsupervised machine learning …


Extreme-Strike And Small-Time Asymptotics For Gaussian Stochastic Volatility Models, Xin Zhang Aug 2016

Extreme-Strike And Small-Time Asymptotics For Gaussian Stochastic Volatility Models, Xin Zhang

Open Access Dissertations

Asymptotic behavior of implied volatility is of our interest in this dissertation. For extreme strike, we consider a stochastic volatility asset price model in which the volatility is the absolute value of a continuous Gaussian process with arbitrary prescribed mean and covariance. By exhibiting a Karhunen-Loève expansion for the integrated variance, and using sharp estimates of the density of a general second-chaos variable, we derive asymptotics for the asset price density for large or small values of the variable, and study the wing behavior of the implied volatility in these models. Our main result provides explicit expressions for the first …


The Design And Statistical Analysis Of Single-Cell Rna-Sequencing Experiments, Faye H. Zheng Aug 2016

The Design And Statistical Analysis Of Single-Cell Rna-Sequencing Experiments, Faye H. Zheng

Open Access Dissertations

Next-generation DNA- and RNA-sequencing (RNA-seq) technologies have expanded rapidly in both throughput and accuracy within the last decade. The momentum continues as emerging techniques become increasingly capable of profiling molecular content at the level of individual cells. One goal of this research is to put forward best practices in the design of single-cell RNA-sequencing (scRNA-seq) experiments, specifically as it relates to choices regarding the trade-off between sequencing depth and sample size. In addition to general guidelines, an interactive tool is presented to aid researchers in making experiment-specific decisions that are informed by real data and practical constraints. Further, a new …


Model-Free Variable Screening, Sparse Regression Analysis And Other Applications With Optimal Transformations, Qiming Huang Aug 2016

Model-Free Variable Screening, Sparse Regression Analysis And Other Applications With Optimal Transformations, Qiming Huang

Open Access Dissertations

Variable screening and variable selection methods play important roles in modeling high dimensional data. Variable screening is the process of filtering out irrelevant variables, with the aim to reduce the dimensionality from ultrahigh to high while retaining all important variables. Variable selection is the process of selecting a subset of relevant variables for use in model construction. The main theme of this thesis is to develop variable screening and variable selection methods for high dimensional data analysis. In particular, we will present two relevant methods for variable screening and selection under a unified framework based on optimal transformations.

In the …


Maximum Empirical Likelihood Estimation In U-Statistics Based General Estimating Equations, Lingnan Li Aug 2016

Maximum Empirical Likelihood Estimation In U-Statistics Based General Estimating Equations, Lingnan Li

Open Access Dissertations

In the first part of this thesis, we study maximum empirical likelihood estimates (MELE's) in U-statistics based general estimating equations (UGEE's). Our technical maneuver is the jackknife empirical likelihood (JEL) approach. We give the local uniform asymptotic normality condition for the log-JEL for UGEE's. We derive the estimating equations for finding MELE's and provide their asymptotic normality. We obtain easy MELE's which have less computational burden than the usual MELE's and can be easily implemented using existing software. We investigate the use of side information of the data to improve efficiency. We exhibit that the MELE's are fully efficient, and …


Is Metabolism Goal-Directed? Investigating The Validity Of Modeling Biological Systems With Cybernetic Control Via Omic Data, Frank T. Devilbiss Apr 2016

Is Metabolism Goal-Directed? Investigating The Validity Of Modeling Biological Systems With Cybernetic Control Via Omic Data, Frank T. Devilbiss

Open Access Dissertations

Cybernetic models are uniquely juxtaposed to other metabolic modeling frameworks in that they describe the time-dependent regulation of cellular reactions in terms of dynamic "metabolic goals." This approach contrasts starkly with purely mechanistic descriptions of metabolic regulation which seek to explain metabolic processes in high resolution — a clearly daunting undertaking. Over a span of three decades, cybernetic models have been used to predict metabolic phenomena ranging from resource consumption in mixed-substrate environments to intracellular reaction fluxes of intricate metabolic networks. While the cybernetic approach has been validated in its utility for the prediction of metabolic phenomena, its central feature, …


User-Centric Workload Analytics: Towards Better Cluster Management, Suhas Raveesh Javagal Apr 2016

User-Centric Workload Analytics: Towards Better Cluster Management, Suhas Raveesh Javagal

Open Access Theses

Effective management of computing clusters and providing a high quality customer support is not a trivial task. Due to rise of community clusters there is an increase in the diversity of workloads and the user demographic. Owing to this and privacy concerns of the user, it is difficult to identify performance issues, reduce resource wastage and understand implicit user demands. In this thesis, we perform in-depth analysis of user behavior, performance issues, resource usage patterns and failures in the workloads collected from a university-wide community cluster and two clusters maintained by a government lab. We also introduce a set of …


Implementation And Validation Of A Probabilistic Open Source Baseball Engine (Posbe): Modeling Hitters And Pitchers, Rhett Tracy Schaefer Apr 2016

Implementation And Validation Of A Probabilistic Open Source Baseball Engine (Posbe): Modeling Hitters And Pitchers, Rhett Tracy Schaefer

Open Access Theses

This manuscript details the implementation and validation of an open source probabilistic baseball engine (POSBE) that focuses on the hitter and pitcher model of the simulation. The simulation produced outcomes that parallel those observed in actual professional Major League Baseball games. The observed data were taken from the nineteen games played between the New York Yankees (NYY) and Boston Red Sox (BOS) during the 2015 season. The potential hitter/pitcher outcomes of interest were singles, doubles, triples, homeruns, walks, hit-by-pitch, and strikeouts. The nineteen game series was simulated 1000 times, resulting in a total of 19,000 simulations. The eighteen hitters and …


A Flexible And Versatile Framework For Statistical Design And Analysis Of Quantitative Mass Spectrometry-Based Proteomic Experiments, Meena Choi Feb 2016

A Flexible And Versatile Framework For Statistical Design And Analysis Of Quantitative Mass Spectrometry-Based Proteomic Experiments, Meena Choi

Open Access Dissertations

Quantitative mass spectrometry (MS)-based proteomics is an indispensable technology for biological and clinical research. As the proteomics field grows, MS-based proteomic workflows are becoming more complex and diverse. The accuracy and the throughput of the MS measurements and of the signal processing tools dramatically increased. However, many existing statistical tools and workflows have not followed the technological development. Therefore, there is a need for flexible statistical tools, which reflect diverse and complex workflows, are computationally efficient for large datasets, and maximize the reproducibility of the results.

We propose a family of linear mixed effects models, and a split-plot view of …


Overcoming Uncertainty For Within-Network Relational Machine Learning, Joseph J. Pfeiffer Apr 2015

Overcoming Uncertainty For Within-Network Relational Machine Learning, Joseph J. Pfeiffer

Open Access Dissertations

People increasingly communicate through email and social networks to maintain friendships and conduct business, as well as share online content such as pictures, videos and products. Relational machine learning (RML) utilizes a set of observed attributes and network structure to predict corresponding labels for items; for example, to predict individuals engaged in securities fraud, we can utilize phone calls and workplace information to make joint predictions over the individuals. However, in large scale and partially observed network domains, missing labels and edges can significantly impact standard relational machine learning methods by introducing bias into the learning and inference processes. In …


Stability Of Machine Learning Algorithms, Wei Sun Apr 2015

Stability Of Machine Learning Algorithms, Wei Sun

Open Access Dissertations

In the literature, the predictive accuracy is often the primary criterion for evaluating a learning algorithm. In this thesis, I will introduce novel concepts of stability into the machine learning community. A learning algorithm is said to be stable if it produces consistent predictions with respect to small perturbation of training samples. Stability is an important aspect of a learning procedure because unstable predictions can potentially reduce users' trust in the system and also harm the reproducibility of scientific conclusions. As a prototypical example, stability of the classification procedure will be discussed extensively. In particular, I will present two new …


The Stability Of The Iris As A Biometric Modality, Benjamin Wright Petry Apr 2015

The Stability Of The Iris As A Biometric Modality, Benjamin Wright Petry

Open Access Theses

In this thesis, the question of the stability of a group of individual subjects' irises is examined and answered. This stability is examined in regards to the time scale of the month range. The covariate for this research was time. Images collected during one month of separation between captures were examined. The genuine and impostor scores for these images were calculated and then interpreted using the stability score index. This index produced a quantifiable value for the stability of iris match scores over the months of the examination. ^ Additionally, a new framework for collecting and analyzing time in biometrics …


Divide And Recombine For Large Complex Data: The Subset Likelihood Modeling Approach To Recombination, Philip Gautier Apr 2015

Divide And Recombine For Large Complex Data: The Subset Likelihood Modeling Approach To Recombination, Philip Gautier

Open Access Dissertations

Divide and recombine (D&R) is a statistical framework for the analysis of large complex data. The data are divided into subsets. Numeric and visualization methods, which collectively are analytic methods, are applied to each subset. For each analytic method, the outputs of the application of the method to the subsets are recombined. So each analytic method has associated with it a division method and a recombination method. Here we study D&R methods for likelihood-based model fitting. We introduce a notion of likelihood analysis and modeling. We divide the data and fit a likelihood model on each subset. The fitted model …


A Pure-Jump Market-Making Model For High-Frequency Trading, Chi Wai Law Apr 2015

A Pure-Jump Market-Making Model For High-Frequency Trading, Chi Wai Law

Open Access Dissertations

We propose a new market-making model which incorporates a number of realistic features relevant for high-frequency trading. In particular, we model the dependency structure of prices and order arrivals with novel self- and cross-exciting point processes. Furthermore, instead of assuming the bid and ask prices can be adjusted continuously by the market maker, we formulate the market maker's decisions as an optimal switching problem. Moreover, the risk of overtrading has been taken into consideration by allowing each order to have different size, and the market maker can make use of market orders, which are treated as impulse control, to get …


Spatial Analysis Of Passenger Vehicle Use And Ownership And Its Impact On The Sustainability Of Highway Infrastructure Funding, Matthew Volovski Apr 2015

Spatial Analysis Of Passenger Vehicle Use And Ownership And Its Impact On The Sustainability Of Highway Infrastructure Funding, Matthew Volovski

Open Access Dissertations

Across the United States, the sustainability of highway funding is at risk due to increasing need and uncertainty in the factors that drive revenue. Past studies on highway funding sustainability have identified that the root cause of changing highway revenue are the shifts in social demographics and economic characteristics. Unfortunately, from the revenue perspective (the focus of this dissertation), the ability of previous research to account for these factors has been rather limited in two ways; first, the inability to accurately assess current regional vehicle use (a typical prerequisite for statistical modeling of highway revenues) due to difficulties associated with …


Probabilistic Uncertainty Quantification And Experiment Design For Nonlinear Models: Applications In Systems Biology, Vu Cao Duy Thien Dinh Oct 2014

Probabilistic Uncertainty Quantification And Experiment Design For Nonlinear Models: Applications In Systems Biology, Vu Cao Duy Thien Dinh

Open Access Dissertations

Despite the ever-increasing interest in understanding biology at the system level, there are several factors that hinder studies and analyses of biological systems. First, unlike systems from other applied fields whose parameters can be effectively identified, biological systems are usually unidentifiable, even in the ideal case when all possible system outputs are known with high accuracy. Second, the presence of multivariate bifurcations often leads the system to behaviors that are completely different in nature. In such cases, system outputs (as function of parameters/inputs) are usually discontinuous or have sharp transitions across domains with different behaviors. Finally, models from systems biology …


Application Of Bayesian Networks In Consumer Service Industry, Yuan Gao Oct 2014

Application Of Bayesian Networks In Consumer Service Industry, Yuan Gao

Open Access Theses

Gao, Yuan. M.S.I.E., Purdue University. December 2014. Application of Bayesian Networks in Consumer Service Industry. Major professor: Vincent G. Duffy The purpose of the present study is to explore the application of Bayesian networks in the consumer service industry to model causal relationships within complex risk factor structures using aggregate data. An analysis of the Hawaii tourism market was conducted to find out how visitor characteristics affect their behavior and experience as consumers during the trips, and influence the tourism market outcomes represented by measurable factors. Two hypotheses were proposed regarding the use of aggregate data and the influence of …


On The Occurrences Of Motifs In Recursive Trees, With Applications To Random Structures, Mohan Gopaladesikan Oct 2014

On The Occurrences Of Motifs In Recursive Trees, With Applications To Random Structures, Mohan Gopaladesikan

Open Access Dissertations

In this dissertation we study three problems related to motifs and recursive trees. In the first problem we consider a collection of uncorrelated motifs and their occurrences on the fringe of random recursive trees. We compute the exact mean and variance of the multivariate random vector of the counts of occurrences of the motifs. We further use the Cramér-Wold device and the contraction method to show an asymptotic convergence in distribution to a multivariate normal random variable with this mean and variance. ^ The second problem we study is that of the probability that a collection of motifs (of the …


Divide And Recombine: Autoregressive Models And Stl+, Xiang Han Oct 2014

Divide And Recombine: Autoregressive Models And Stl+, Xiang Han

Open Access Dissertations

In this thesis multiple methods are proposed and applied to the Akamai CIDR time series data. The Akamai network is one of the world's largest distributed-computing platforms, with more than 250,000 servers in more than 80 countries. It is responsible for 15-20 percent of all web traffic. We obtained 110 GB raw CIDR data over a 18 month period, collected on the Akamai network from November 2011 to April 2013. ^ The Seasonal-Trend Decomposition procedure based on loess (STL+) is used to model the CIDR series. Motivated by the CIDR series analysis, we propose a general prediction based model selection …


Spatial Marked Point Processes: Models And Inferences, Yen-Ning E Huang Oct 2014

Spatial Marked Point Processes: Models And Inferences, Yen-Ning E Huang

Open Access Dissertations

A spatial marked point process describes the locations of randomly distributed events in a region, with a mark attached to each observed point. Nowadays, the availability of spatiotemporal data is increasing and many spatiotemporal models are studied with applications in a wide range of disciplines. Spatial marked point processes are then extended to spatiotemporal marked point processes if time component is taken into account. In general, the marks can be quantitative or categorical variables. Independence between points and marks is a convenient assumption, but may not be true in practice. Tests for independence between points and marks are proposed previously, …


The Tessera D&R Computational Environment: Designed Experiments For R-Hadoop Performance And Bitcoin Analysis, Jianfu Li Oct 2014

The Tessera D&R Computational Environment: Designed Experiments For R-Hadoop Performance And Bitcoin Analysis, Jianfu Li

Open Access Dissertations

D&R is a statistical framework for the analysis of large complex data that enables feasible and practical analysis of large complex data. The analyst selects a division method to divide the data into subsets, applies an analytic method of the analysis to each subset independently with no communication among subsets, selects a recombination method that is applied to the outputs across subsets to form a result of the analytic method for the entire data. The computational tasking of D&R is nearly embarrassingly parallel, so D&R can readily exploit distributed, parallel computational environments, such as our D&R computational environment, Tessera.^ In …


Modeling Spatial Covariance Functions, Inkyung Choi Jul 2014

Modeling Spatial Covariance Functions, Inkyung Choi

Open Access Dissertations

Covariance modeling plays a key role in the spatial data analysis as it provides important information about the dependence structure of underlying processes and determines performance of spatial prediction. Various parametric models have been developed to accommodate the idiosyncratic features of a given dataset. However, the parametric models may impose unjustified restrictions to the covariance structure and the procedure of choosing a specific model is often ad-hoc. In the first part of the dissertation, a new nonparametric covariance model that can avoid the choice of parametric forms is proposed. The estimator is obtained via a nonparametric approximation of completely monotone …


Identification Of Genomic Factors Using Family-Based Association Studies, Libo Wang Jan 2014

Identification Of Genomic Factors Using Family-Based Association Studies, Libo Wang

Open Access Dissertations

Genome-wide association studies become increasingly popular and important for detecting genetic associations of complex traits. However, it is well known that spurious associations could arise from statistical analysis without proper consideration of genetic relatedness of samples. Many methods have been proposed to guard against these spurious associations. Here we focus on multi-locus association studies of quantitative traits and the case-control status, and propose algorithms that take into consideration of genetic related samples to address possible confounding issues. As supervised dimension reduction methods, these algorithms performs well to conduct association studies with a large number of biomarkers but a relative small …