Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 14 of 14

Full-Text Articles in Physical Sciences and Mathematics

Computationally Efficient Confidence Intervals For Cross-Validated Area Under The Roc Curve Estimates, Erin Ledell, Maya L. Petersen, Mark J. Van Der Laan Dec 2012

Computationally Efficient Confidence Intervals For Cross-Validated Area Under The Roc Curve Estimates, Erin Ledell, Maya L. Petersen, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

In binary classification problems, the area under the ROC curve (AUC), is an effective means of measuring the performance of your model. Most often, cross-validation is also used, in order to assess how the results will generalize to an independent data set. In order to evaluate the quality of an estimate for cross-validated AUC, we must obtain an estimate for its variance. For massive data sets, the process of generating a single performance estimate can be computationally expensive. Additionally, when using a complex prediction method, calculating the cross-validated AUC on even a relatively small data set can still require a …


Identification Of Tcp Protocols, Juan Shao Dec 2012

Identification Of Tcp Protocols, Juan Shao

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Recently, many new TCP algorithms, such as BIC, CUBIC, and CTCP, have been deployed in the Internet. Investigating the deployment statistics of these TCP algorithms is meaningful to study the performance and stability of the Internet. Currently, there is a tool named Congestion Avoidance Algorithm Identification (CAAI) for identifying the TCP algorithm of a web server and then for investigating the TCP deployment statistics. However, CAAI using a simple k-NN algorithm can not achieve a high identification accuracy. In this thesis, we comprehensively study the identification accuracy of five popular machine learning models. We find that the random forest model …


Bayesian Test Analytics For Document Collections, Daniel David Walker Nov 2012

Bayesian Test Analytics For Document Collections, Daniel David Walker

Theses and Dissertations

Modern document collections are too large to annotate and curate manually. As increasingly large amounts of data become available, historians, librarians and other scholars increasingly need to rely on automated systems to efficiently and accurately analyze the contents of their collections and to find new and interesting patterns therein. Modern techniques in Bayesian text analytics are becoming wide spread and have the potential to revolutionize the way that research is conducted. Much work has been done in the document modeling community towards this end,though most of it is focused on modern, relatively clean text data. We present research for improved …


Geocam: A Geovisual Analytics Workspace To Contextualize And Interpret Statements About Movement, Anuj Jaiswal, Scott Pezanowski, Prasenjit Mitra, Xiao Zhang, Sen Xu, Ian Turton, Alexander Klippel, Alan M. Maceachren Oct 2012

Geocam: A Geovisual Analytics Workspace To Contextualize And Interpret Statements About Movement, Anuj Jaiswal, Scott Pezanowski, Prasenjit Mitra, Xiao Zhang, Sen Xu, Ian Turton, Alexander Klippel, Alan M. Maceachren

Journal of Spatial Information Science

This article focuses on integrating computational and visual methods in a system that supports analysts to identify extract map and relate linguistic accounts of movement. We address two objectives: (1) build the conceptual theoretical and empirical framework needed to represent and interpret human-generated directions; and (2) design and implement a geovisual analytics workspace for direction document analysis. We have built a set of geo-enabled computational methods to identify documents containing movement statements and a visual analytics environment that uses natural language processing methods iteratively with geographic database support to extract interpret and map geographic movement references in context. Additionally analysts …


Linguistic Spatial Classifications Of Event Domains In Narratives Of Crime, Blake Stephen Howald Oct 2012

Linguistic Spatial Classifications Of Event Domains In Narratives Of Crime, Blake Stephen Howald

Journal of Spatial Information Science

Structurally, formal definitions of the linguistic narrative minimally require two temporally linked past-time events. The role of space in this definition, based on spatial language indicating where events occur, is considered optional and non-structural. However, based on narratives with a high frequency of spatial language, recent research has questioned this perspective, suggesting that space is more critical than may be readily apparent. Through an analysis of spatially rich serial criminal narratives, it will be demonstrated that spatial information qualitatively varies relative to narrative events. In particular, statistical classifiers in a supervised machine learning task achieve a 90% accuracy in predicting …


A New Web Search Engine With Learning Hierarchy, Da Kuang Aug 2012

A New Web Search Engine With Learning Hierarchy, Da Kuang

Electronic Thesis and Dissertation Repository

Most of the existing web search engines (such as Google and Bing) are in the form of keyword-based search. Typically, after the user issues a query with the keywords, the search engine will return a flat list of results. When the query issued by the user is related to a topic, only the keyword matching may not accurately retrieve the whole set of webpages in that topic. On the other hand, there exists another type of search system, particularly in e-Commerce web- sites, where the user can search in the categories of different faceted hierarchies (e.g., product types and price …


A Confidence-Prioritization Approach To Data Processing In Noisy Data Sets And Resulting Estimation Models For Predicting Streamflow Diel Signals In The Pacific Northwest, Nathaniel Lee Gustafson Aug 2012

A Confidence-Prioritization Approach To Data Processing In Noisy Data Sets And Resulting Estimation Models For Predicting Streamflow Diel Signals In The Pacific Northwest, Nathaniel Lee Gustafson

Theses and Dissertations

Streams in small watersheds are often known to exhibit diel fluctuations, in which streamflow oscillates on a 24-hour cycle. Streamflow diel fluctuations, which we investigate in this study, are an informative indicator of environmental processes. However, in Environmental Data sets, as well as many others, there is a range of noise associated with individual data points. Some points are extracted under relatively clear and defined conditions, while others may include a range of known or unknown confounding factors, which may decrease those points' validity. These points may or may not remain useful for training, depending on how much uncertainty they …


On The K-Mer Frequency Spectra Of Organism Genome And Proteome Sequences With A Preliminary Machine Learning Assessment Of Prime Predictability, Nathan O. Schmidt Aug 2012

On The K-Mer Frequency Spectra Of Organism Genome And Proteome Sequences With A Preliminary Machine Learning Assessment Of Prime Predictability, Nathan O. Schmidt

Boise State University Theses and Dissertations

A regular expression and region-specific filtering system for biological records at the National Center for Biotechnology database is integrated into an object oriented sequence counting application, and a statistical software suite is designed and deployed to interpret the resulting k-mer frequencies|with a priority focus on nullomers. The proteome k-mer frequency spectra of ten model organisms and the genome k-mer frequency spectra of two bacteria and virus strains for the coding and non-coding regions are comparatively scrutinized. We observe that the naturally-evolved (NCBI/organism) and the artificially-biased (randomly-generated) sequences exhibit a clear deviation from the artificially-unbiased (randomly-generated) histogram distributions. …


On The Automatic Recognition Of Human Activities Using Heterogeneous Wearable Sensors, Oscar David Lara Yejas Jun 2012

On The Automatic Recognition Of Human Activities Using Heterogeneous Wearable Sensors, Oscar David Lara Yejas

USF Tampa Graduate Theses and Dissertations

Delivering accurate and opportune information on people's activities and behaviors has become one of the most important tasks within pervasive computing. Its wide spectrum of potential applications in medical, entertainment, and tactical scenarios, motivates further

research and development of new strategies to improve accuracy, pervasiveness, and eciency.

This dissertation addresses the recognition of human activities (HAR) with wearable sensors in three main regards: In the rst place, physiological signals have been incorporated as a new source of information to improve the recognition accuracy achieved by conventional approaches, which rely on accelerometer signals solely. A new HAR system, Centinela, was born …


Bayesian And Related Methods: Techniques Based On Bayes' Theorem, Mehmet Vurkaç May 2012

Bayesian And Related Methods: Techniques Based On Bayes' Theorem, Mehmet Vurkaç

Systems Science Friday Noon Seminar Series

Bayes' theorem is a simple algebraic consequence of conditional probability. Yet, its consequences are critical to philosophy, society, and technology. Starting from its simple derivation, we will show how its interpretation in terms of base rates (priors) and class-conditional likelihoods illuminates everyday problems in medicine and law, and provides signal processing, communications, machine learning, model selection, and other applications of statistics with powerful classification and estimation tools. Next, we will briefly examine some of the ways in which this theorem can be adopted to include multiple attributes, contexts, hypotheses, and levels of risk. Methods derived from or related to Bayes’ …


Weakly Supervised Learning For Unconstrained Face Processing, Gary B. Huang May 2012

Weakly Supervised Learning For Unconstrained Face Processing, Gary B. Huang

Open Access Dissertations

Machine face recognition has traditionally been studied under the assumption of a carefully controlled image acquisition process. By controlling image acquisition, variation due to factors such as pose, lighting, and background can be either largely eliminated or specifically limited to a study over a discrete number of possibilities. Applications of face recognition have had mixed success when deployed in conditions where the assumption of controlled image acquisition no longer holds. This dissertation focuses on this unconstrained face recognition problem, where face images exhibit the same amount of variability that one would encounter in everyday life. We formalize unconstrained face recognition …


The Glass Is Half-Full: Overestimating The Quality Of A Novel Environment Is Advantageous, Oded Berger-Tal, Tal Avgar Apr 2012

The Glass Is Half-Full: Overestimating The Quality Of A Novel Environment Is Advantageous, Oded Berger-Tal, Tal Avgar

Wildland Resources Faculty Publications

According to optimal foraging theory, foraging decisions are based on the forager's current estimate of the quality of its environment. However, in a novel environment, a forager does not possess information regarding the quality of the environment, and may make a decision based on a biased estimate. We show, using a simple simulation model, that when facing uncertainty in heterogeneous environments it is better to overestimate the quality of the environment (to be an “optimist”) than underestimate it, as optimistic animals learn the true value of the environment faster due to higher exploration rate. Moreover, we show that when the …


A Study Of Localization And Latency Reduction For Action Recognition, Syed Zain Masood Jan 2012

A Study Of Localization And Latency Reduction For Action Recognition, Syed Zain Masood

Electronic Theses and Dissertations

The success of recognizing periodic actions in single-person-simple-background datasets, such as Weizmann and KTH, has created a need for more complex datasets to push the performance of action recognition systems. In this work, we create a new synthetic action dataset and use it to highlight weaknesses in current recognition systems. Experiments show that introducing background complexity to action video sequences causes a significant degradation in recognition performance. Moreover, this degradation cannot be fixed by fine-tuning system parameters or by selecting better feature points. Instead, we show that the problem lies in the spatio-temporal cuboid volume extracted from the interest point …


Ensemble Methods For Malware Diagnosis Based On One-Class Svms, Xing An Jan 2012

Ensemble Methods For Malware Diagnosis Based On One-Class Svms, Xing An

LSU Master's Theses

Malware diagnosis is one of today’s most popular topics of machine learning. Instead of simply applying all the classical classification algorithms to the problem and claim the highest accuracy as the result of prediction, which is the typical approach adopted by studies of this kind, we stick to the Support Vector Machine (SVM) classifier and based on our observation of some principles of learning, characteristics of statistics and the behavior of SVM, we employed a number of the potential preprocessing or ensemble methods including rescaling, bagging and clustering that may enhance the performance to the classical algorithm. We implemented the …