Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 26 of 26

Full-Text Articles in Physical Sciences and Mathematics

Contrastive Learning, With Application To Forensic Identification Of Source, Cole Ryan Patten Jan 2024

Contrastive Learning, With Application To Forensic Identification Of Source, Cole Ryan Patten

Electronic Theses and Dissertations

Forensic identification of source problems often fall under the category of verification problems, where recent advances in deep learning have been made by contrastive learning methods. Many forensic identification of source problems deal with a scarcity of data, an issue addressed by few-shot learning. In this work, we make specific what makes a neural network a contrastive network. We then consider the use of contrastive neural networks for few-shot learning classification problems and compare them to other statistical and deep learning methods. Our findings indicate similar performance between models trained by contrastive loss and models trained by cross-entropy loss. We …


A Class Of Regression Models For Pairwise Comparisons Of Forensic Handwriting Comparison Systems, Cami M. Fuglsby Jan 2023

A Class Of Regression Models For Pairwise Comparisons Of Forensic Handwriting Comparison Systems, Cami M. Fuglsby

Electronic Theses and Dissertations

Handwriting analysis is a complex field largely living in forensic science and the legal realm. One task of a forensic document examiner (FDE) may be to determine the writer(s) of handwritten documents. Automated identification systems (AIS) were built to aid FDEs in their examinations. Part of the uses of these AIS (such as FISH[5] [7],WANDA [6], CEDAR-FOX [17], and FLASHID®2) are tomeasure features about a handwriting sample and to provide the user with a numeric value of the evidence. These systems use their own algorithms and definitions of features to quantify the writing and can be considered a black-box. The …


Using Deep Neural Networks To Analyze Precision Agriculture Data, Stephanie Liebl Jan 2022

Using Deep Neural Networks To Analyze Precision Agriculture Data, Stephanie Liebl

Electronic Theses and Dissertations

As the population of the Earth increases, there is a growing need for food to feed the inhabitants. Precision agriculture offers techniques and tools that can be used to help accommodate the growing population. One specific precision agriculture tool is remote sensing data, which can be used to image fields as an effort to better predict or understand the crops. In this thesis, deep neural networks are used to evaluate various spatial, spectral, and temporal resolutions of three different satellite images to determine which best predicts corn yield. The main metrics we used to evaluate the models were R-squared (R2), …


Comparison Of Software Packages For Detecting Differentially Expressed Genes From Single-Sample Rna-Seq Data, Rong Zhou Jan 2021

Comparison Of Software Packages For Detecting Differentially Expressed Genes From Single-Sample Rna-Seq Data, Rong Zhou

Electronic Theses and Dissertations

RNA-sequencing (RNA-seq) has rapidly become the tool in many genome-wide transcriptomic studies. It provides a way to understand the RNA environment of cells in different physiological or pathological states to determine how cells respond to these changes. RNA-seq provides quantitative information about the abundance of different RNA species present in a given sample. If the difference or change observed in the read counts or expression level between two experimental conditions is statistically significant, the gene is declared as differentially expressed. A large number of methods for detecting differentially expressed genes (DEGs) with RNA-seq have been developed, such as the methods …


Methods For High-Dimensional Spatial Data: Dimension Reduction And Covariance Approximation, Paul May Jan 2021

Methods For High-Dimensional Spatial Data: Dimension Reduction And Covariance Approximation, Paul May

Electronic Theses and Dissertations

In spatial statistics, because quantities are correlated based on their relative positions in space, data is modeled as a single realization of a multivariate stochastic process. Spatial data can be high-dimensional either through a large number of observed variables per location, or through a large number of observed locations. The two are often handled differently, with the former addressed through dimension reduction and the latter addressed through appropriate modeling of the spatial correlation between locations. The main body of this dissertation is a three-part work. Parts 2 and 3 pertain to the "many variables" problem, proposing novel methods of dimension …


Development And Properties Of The Roc-Abc Bayes Factor For The Quantification Of The Weight Of Forensic Evidence, Jessie Hendricks Jan 2021

Development And Properties Of The Roc-Abc Bayes Factor For The Quantification Of The Weight Of Forensic Evidence, Jessie Hendricks

Electronic Theses and Dissertations

Many scholars have proposed the use of a Bayes factor to quantify the weight of forensic evidence. However, due to the complex and high-dimensional nature of pattern evidence, likelihood functions are intractable and thus, Bayes factors cannot be assigned using traditional methods. Approximate Bayesian Computation (ABC) model selection algorithms provide likelihood-free methods to assign Bayes factors. ABC Bayes factors leverage the use of the scoring functions commonly used in recent years in forensic statistics in a rigorous statistical manner. However, traditional methods for assigning ABC Bayes factors are subject of several criticisms. In this dissertation, one of the main criticisms …


Development Of A Probabilistic Multi-Class Model Selection Algorithm For High-Dimensional And Complex Data, Madeline Anne Ausdemore Jan 2021

Development Of A Probabilistic Multi-Class Model Selection Algorithm For High-Dimensional And Complex Data, Madeline Anne Ausdemore

Electronic Theses and Dissertations

The development of quantifiable measures of uncertainty in forensic conclusions has resulted in the debut of several ad-hoc methods for approximating the weight of evidence (WoE). In particular, forensic researchers have attempted to use similarity measures, or scores, to approximate the weight of evidence characterized by highdimensional and complex data. Score-based methods have been proposed to approximate theWoE for numerous evidence types (e.g., fingerprints, handwriting, inks, voice analysis). In general, scorebased methods consider the score as a projection onto the real line. For example, the score-based likelihood ratio evaluates and compares the likelihoods of a score calculated between two objects …


Finite Mixture Of Regression Models For Complex Survey Data, Abdelbaset Abdalla Jan 2019

Finite Mixture Of Regression Models For Complex Survey Data, Abdelbaset Abdalla

Electronic Theses and Dissertations

Over time, survey data has become an essential source of information for modern society. However, to be effective, the structures of survey data require sampling designs that are more complex than simple random sampling. The complex sampling data collected from enormous national surveys via these complex designs ideally include sample weights that allow analysis to take account of complicated population structures. When the target of inference is the parameters of a regression model, it is crucial to know whether these weights should be incorporated into the sampling weight when fitting the model to the survey data. The finite mixture models …


Applying Machine Learning Algorithms For The Analysis Of Biological Sequences And Medical Records, Shaopeng Gu Jan 2019

Applying Machine Learning Algorithms For The Analysis Of Biological Sequences And Medical Records, Shaopeng Gu

Electronic Theses and Dissertations

The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and …


Development Of A Data-Driven Patient Engagement Score Using Finite Mixture Models, Eric Bae Jan 2019

Development Of A Data-Driven Patient Engagement Score Using Finite Mixture Models, Eric Bae

Electronic Theses and Dissertations

Patient activation measure (PAM) is widely adopted by health care providers to access individual's knowledge, skill, and confidence for managing one's health and healthcare. Patient activation measure (PAM), licensed by Insignia Health, is widely adopted by health care providers to access individual's knowledge, skill, and confidence for managing one's health and healthcare. Multiple studies corroborate the effectiveness of activation measure in predicting most health behaviors, including preventive behaviors, healthy behaviors, self-management behaviors, and health information seeking. However, PAM is heavily dependent on subjective patient-reported data, which are often incomplete. The purpose of this study is to develop an objective statistical …


Location Optimization Of A Coal Power Plant To Balance Coal Supply And Electric Transmission Costs Against Plant’S Emission Exposure, Najam Khan Jan 2018

Location Optimization Of A Coal Power Plant To Balance Coal Supply And Electric Transmission Costs Against Plant’S Emission Exposure, Najam Khan

Electronic Theses and Dissertations

This research is focused on developing a location analysis methodology that can minimize the pollutant exposure to the public while ensuring that the combined costs of electric transmission losses and coal logistics are minimized. Coal power plants will provide a critical contribution towards meeting electricity demands for various nations in the foreseeable future. The site selection for a new coal power plant is extremely important from an investment point of view. The operational costs for running a coal power plant can be minimized by a combined emphasis on placing a coal power plant near coal mines as well as customers. …


Statistical Algorithms And Bioinformatics Tools Development For Computational Analysis Of High-Throughput Transcriptomic Data, Adam Mcdermaid Jan 2018

Statistical Algorithms And Bioinformatics Tools Development For Computational Analysis Of High-Throughput Transcriptomic Data, Adam Mcdermaid

Electronic Theses and Dissertations

Next-Generation Sequencing technologies allow for a substantial increase in the amount of data available for various biological studies. In order to effectively and efficiently analyze this data, computational approaches combining mathematics, statistics, computer science, and biology are implemented. Even with the substantial efforts devoted to development of these approaches, numerous issues and pitfalls remain. One of these issues is mapping uncertainty, in which read alignment results are biased due to the inherent difficulties associated with accurately aligning RNA-Sequencing reads. GeneQC is an alignment quality control tool that provides insight into the severity of mapping uncertainty in each annotated gene from …


Variable Selection Techniques For Clustering On The Unit Hypersphere, Damon Bayer Jan 2018

Variable Selection Techniques For Clustering On The Unit Hypersphere, Damon Bayer

Electronic Theses and Dissertations

Mixtures of von Mises-Fisher distributions have been shown to be an effective model for clustering data on a unit hypersphere, but variable selection for these models remains an important and challenging problem. In this paper, we derive two variants of the expectation-maximization framework, which are each used to identify a specific type of irrelevant variables for these models. The first type are noise variables, which are not useful for separating any pairs of clusters. The second type are redundant variables, which may be useful for separating pairs of clusters, but do not enable any additional separation beyond the separability provided …


The Impact Of Data Sovereignty On American Indian Self-Determination: A Framework Proof Of Concept Using Data Science, Joseph Carver Robertson Jan 2018

The Impact Of Data Sovereignty On American Indian Self-Determination: A Framework Proof Of Concept Using Data Science, Joseph Carver Robertson

Electronic Theses and Dissertations

The Data Sovereignty Initiative is a collection of ideas that was designed to create SMART solutions for tribal communities. This concept was to develop a horizontal governance framework to create a strategic act of sovereignty using data science. The core concept of this idea was to present data sovereignty as a way for tribal communities to take ownership of data in order to affect policy and strategic decisions that are data driven in nature. The case studies in this manuscript were developed around statistical theories of spatial statistics, exploratory data analysis, and machine learning. And although these case studies are …


Development Of Biclustering Techniques For Gene Expression Data Modeling And Mining, Juan Xie Jan 2018

Development Of Biclustering Techniques For Gene Expression Data Modeling And Mining, Juan Xie

Electronic Theses and Dissertations

The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene …


U-Statistics For Characterizing Forensic Sufficiency Studies, Cami Fuglsby Jan 2017

U-Statistics For Characterizing Forensic Sufficiency Studies, Cami Fuglsby

Electronic Theses and Dissertations

One of the main metrics for deciding if a given forensic modality is useful across a broad spectrum of cases, within a given population, is the Random Match Probability (RMP), or the corresponding discriminating power. Traditionally, the RMP of a given modality is studied by comparing full `templates' and estimating the rate at which pairs of templates 'match' in a given population. This strategy leads to a natural U-statistic of degree two. However, in questioned document examination, the RMP is studied as a function of the amount of handwriting contained in the two documents being compared; turning the U-statistic into …


Response Surface Methodology And Its Application In Optimizing The Efficiency Of Organic Solar Cells, Rajab Suliman Jan 2017

Response Surface Methodology And Its Application In Optimizing The Efficiency Of Organic Solar Cells, Rajab Suliman

Electronic Theses and Dissertations

Response surface methodology (RSM) is a ubiquitous optimization approach used in a wide variety of scientific research studies. The philosophy behind a response surface method is to sequentially run relatively simple experiments or models in order to optimize a response variable of interest. In other words, we run a small number of experiments sequentially that can provide a large amount of information upon augmentation. In this dissertation, the RSM technique is utilized in order to find the optimum fabrication condition of a polymer solar cell that maximizes the cell efficiency. The optimal device performance was achieved using 10.25 mg/ml polymer …


Threshold Models For Genome-Wide Association Mapping Of Familial Breast Cancer Incidence In Humans, Nasir Elmesmari Jan 2017

Threshold Models For Genome-Wide Association Mapping Of Familial Breast Cancer Incidence In Humans, Nasir Elmesmari

Electronic Theses and Dissertations

Breast cancer is the second most fatal cancer in the world and one of the most highly harmful cancers from which people suffer. Breast cancer studies have been able to uncover some knowledge about genetic susceptibility for familial breast cancer in humans. Hence, determining genetic factors may potentially help track the disease, as well as discover the cancer in early stages, or perhaps before it starts. In addition, this may allow early determination of possible treatment strategies which will make it easier to prevent the disease. In this context, it is important to determine whether the heritability of breast cancer …


Comparative Study Of The Distribution Of Repetitive Dna In Model Organisms, Mohamed K. Aburweis Jan 2017

Comparative Study Of The Distribution Of Repetitive Dna In Model Organisms, Mohamed K. Aburweis

Electronic Theses and Dissertations

Repetitive DNA elements are abundant in the genome of a wide range of organisms. In mammals, repetitive elements comprise about 40-50% of the total genomes. However, their biological functions remain largely unknown. Analysis of their abundance and distribution may shed some light on how they affect genome structure, function, and evolution. We conducted a detailed comparative analysis of repetitive DNA elements across ten different eukaryotic organisms, including chicken (G. gallus), zebrafish (D. rerio), Fugu (T. rubripes), fruit fly (D. melanogaster), and nematode worm (C. elegans), along with five mammalian organisms: human (H. sapiens), mouse (M. musculus), cow (B. taurus), rat …


Development Of Computational Techniques For Regulatory Dna Motif Identification Based On Big Biological Data, Jinyu Yang Jan 2017

Development Of Computational Techniques For Regulatory Dna Motif Identification Based On Big Biological Data, Jinyu Yang

Electronic Theses and Dissertations

Accurate regulatory DNA motif (or motif) identification plays a fundamental role in the elucidation of transcriptional regulatory mechanisms in a cell and can strongly support the regulatory network construction for both prokaryotic and eukaryotic organisms. Next-generation sequencing techniques generate a huge amount of biological data for motif identification. Specifically, Chromatin Immunoprecipitation followed by high throughput DNA sequencing (ChIP-seq) enables researchers to identify motifs on a genome scale. Recently, technological improvements have allowed for DNA structural information to be obtained in a high-throughput manner, which can provide four DNA shape features. The DNA shape has been found as a complementary factor …


Development And Properties Of Kernel-Based Methods For The Interpretation And Presentation Of Forensic Evidence, Douglas Armstrong Jan 2017

Development And Properties Of Kernel-Based Methods For The Interpretation And Presentation Of Forensic Evidence, Douglas Armstrong

Electronic Theses and Dissertations

The inference of the source of forensic evidence is related to model selection. Many forms of evidence can only be represented by complex, high-dimensional random vectors and cannot be assigned a likelihood structure. A common approach to circumvent this is to measure the similarity between pairs of objects composing the evidence. Such methods are ad-hoc and unstable approaches to the judicial inference process. While these methods address the dimensionality issue they also engender dependencies between scores when 2 scores have 1 object in common that are not taken into account in these models. The model developed in this research captures …


Approximate Statistical Solutions To The Forensic Identification Of Source Problem, Danica M. Ommen Jan 2017

Approximate Statistical Solutions To The Forensic Identification Of Source Problem, Danica M. Ommen

Electronic Theses and Dissertations

Currently in forensic science, the statistical methods for solving the identification of source problems are inherently subjective and generally ad-hoc. The formal Bayesian decision framework provides the most statistically rigorous foundation for these problems to date. However, computing a solution under this framework, which relies on a Bayes Factor, tends to be computationally intensive and highly sensitive to the subjective choice of prior distributions for the parameters. Therefore, this dissertation aims to develop statistical solutions to the forensic identification of source problems which are less subjective, but which retain the statistical rigor of the Bayesian solution. First, this dissertation focuses …


Identifying Predictors Of Weight Loss And Drop-Out Using Joint Modeling, Valerie Bares Jan 2017

Identifying Predictors Of Weight Loss And Drop-Out Using Joint Modeling, Valerie Bares

Electronic Theses and Dissertations

Profile by Sanford is a membership based weight loss program that helps its members make lifestyle changes with diet, exercise, and one-on-one interactions with a weight loss coach. Discovery of characteristics and behaviors influencing weight loss will benefit current and future members of Profile. This research utilizes massive data from Profile by Sanford to analyze member behavior. Fourteen data sets are evaluated, some containing millions of observations. All data is combined into one comprehensive table of 33,487 members. Members of Profile by Sanford are 77% female and two-thirds of all members start the program classified as obese. Attending meetings with …


A Kernel Based Approach To Determine Atypicality, Austin O'Brien Jan 2017

A Kernel Based Approach To Determine Atypicality, Austin O'Brien

Electronic Theses and Dissertations

This dissertation outlines the development and use for a new probabilistic measure for categorization, referred to as atypicality. Given a set of known source objects, we can create a corresponding set of similarity scores between them. Assuming the set of scores has a normal distribution, we can estimate its parameters. Then, we can introduce new trace objects to the problem, and compute similarity scores for them. The main goal of the atypicality score is to determine if the new trace objects are similar to the source objects. To do this, we bootstrap many new scores using the estimated parameters (from …


Spatial And Spatiotemporal Modeling Of Epidemiological Data, Laxman Karki Jan 2017

Spatial And Spatiotemporal Modeling Of Epidemiological Data, Laxman Karki

Electronic Theses and Dissertations

This dissertation focuses on modeling approach for spatial and spatiotemporal data with epidemiological applications. Chapter one gives the general overview of spatial and spatiotemporal data and challenges in the statistical analysis of spatial and spatiotemporal data, and motivation and objectives of the study. Chapter two describes the regression models commonly used in spatial data analysis. Various types of regression methods such as OLS, GWR and MGWR were used to study the association between diabetes prevalence and socioeconomic and lifestyle factors on county level data of Midwestern United States. A new analysis workflow is purposed for regression analysis of spatial data. …


Identifying Data Centers From Satellite Imagery, Adam Buskirk Jan 2016

Identifying Data Centers From Satellite Imagery, Adam Buskirk

Electronic Theses and Dissertations

We develop two different descriptors which can be utilized to describe satellite imagery. The first, the differential-magnitude and radius descriptor, describes a scene by computing the directional gradient of the scene with respect to a vector field whose solutions are circles around a pixel to be described, and then counts pixels in a descriptor matrix according to the magnitude of this gradient and the distance at which this magnitude occurs. The second, the radial Fourier descriptor, extracts from the scene a sequence of annuloid sectors, and uses this to approximate the behavior of the image on a circle around the …