Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

Discipline
Institution
Keyword
Publication Year
Publication
Publication Type

Articles 1 - 28 of 28

Full-Text Articles in Data Science

Detecting Hacker Threats: Performance Of Word And Sentence Embedding Models In Identifying Hacker Communications, Susan Mckeever, Brian Keegan, Andrei Quieroz Dec 2020

Detecting Hacker Threats: Performance Of Word And Sentence Embedding Models In Identifying Hacker Communications, Susan Mckeever, Brian Keegan, Andrei Quieroz

Conference papers

Abstract—Cyber security is striving to find new forms of protection against hacker attacks. An emerging approach nowadays is the investigation of security-related messages exchanged on deep/dark web and even surface web channels. This approach can be supported by the use of supervised machine learning models and text mining techniques. In our work, we compare a variety of machine learning algorithms, text representations and dimension reduction approaches for the detection accuracies of software-vulnerability-related communications. Given the imbalanced nature of the three public datasets used, we investigate appropriate sampling approaches to boost detection accuracies of our models. In addition, we ...


Identifying Structure Transitions Using Machine Learning Methods, Nicholas Walker Jul 2020

Identifying Structure Transitions Using Machine Learning Methods, Nicholas Walker

LSU Doctoral Dissertations

Methodologies from data science and machine learning, both new and old, provide an exciting opportunity to investigate physical systems using extremely expressive statistical modeling techniques. Physical transitions are of particular interest, as they are accompanied by pattern changes in the configurations of the systems. Detecting and characterizing pattern changes in data happens to be a particular strength of statistical modeling in data science, especially with the highly expressive and flexible neural network models that have become increasingly computationally accessible in recent years through performance improvements in both hardware and algorithmic implementations. Conceptually, the machine learning approach can be regarded as ...


Prediction Of Feed Utilization Performance In Clarias Gariepinus Using Multiple Linear Regression In Machine Learning, Adekunle Oluwatosin Familusi Jun 2020

Prediction Of Feed Utilization Performance In Clarias Gariepinus Using Multiple Linear Regression In Machine Learning, Adekunle Oluwatosin Familusi

Journal of Bioresource Management

Machine learning models can be used to make predictions about nutrient utilization performance index using available proximate analysis data on feed composition. Data from similar experiments on nutrient utilization performance was used to fit a multiple linear regression model for the prediction of four performance indexes. The Specific Growth Rate and percentage inclusion with strength of 0.57 was noted along with a negative relationship between protein efficiency and protein content. A negative relationship between Nitrogen Free Extract (NFE) and Protein Efficiency Ratio (PER) at NFE content ≥25 % was observed. PER was predicted with 85 % accuracy, while Weight Gain (WG ...


Pathways To The Native Storyteller: A Method To Enable Computational Story Understanding, Aramide O. Kehinde Jun 2020

Pathways To The Native Storyteller: A Method To Enable Computational Story Understanding, Aramide O. Kehinde

College of Computing and Digital Media Dissertations

The primary objective of this thesis is to develop a method that uses machine learning algorithms to enable computational story understanding. This research is conducted with the aim of establishing a system called the Native Storyteller that plans and creates storytelling experiences for human users. The paper first establishes the desired capabilities of the system and then deep dives into how to enable story understanding, which is the core ability the system needs to function. As such, the research places emphasis on natural language processing and its application to solving key problems in this context. Namely, machine representation of story ...


Utilizing Neural Networks And Wearables To Quantify Hip Joint Angles And Moments During Walking And Stair Ascent, Megan V. Mccabe Jun 2020

Utilizing Neural Networks And Wearables To Quantify Hip Joint Angles And Moments During Walking And Stair Ascent, Megan V. Mccabe

ENGS 88 Honors Thesis (AB Students)

Wearable sensors were leveraged to develop two methods for computing hip joint angles and moments during walking and stair ascent that are more portable than the gold standard. The Insole-Standard (I-S) approach replaced force plates with force-measuring insoles and achieved results that match the curvature of results from similar studies. Peaks in I-S kinetic results are high due to error induced by applying the ground reaction force to the talus. The Wearable-ANN (W-A) approach combines wearables with artificial neural networks to compute the same results. Compared against the I-S, the W-A approach performs well (average rRMSE = 18%, R2 0 ...


Mining User-Generated Content Of Mobile Patient Portal: Dimensions Of User Experience, Mohammad Al-Ramahi, Cherie Noteboom Jun 2020

Mining User-Generated Content Of Mobile Patient Portal: Dimensions Of User Experience, Mohammad Al-Ramahi, Cherie Noteboom

Faculty Research & Publications

Patient portals are positioned as a central component of patient engagement through the potential to change the physician-patient relationship and enable chronic disease self-management. The incorporation of patient portals provides the promise to deliver excellent quality, at optimized costs, while improving the health of the population. This study extends the existing literature by extracting dimensions related to the Mobile Patient Portal Use. We use a topic modeling approach to systematically analyze users’ feedback from the actual use of a common mobile patient portal, Epic’s MyChart. Comparing results of Latent Dirichlet Allocation analysis with those of human analysis validated the ...


A Web-Based, Positive Emotion Skills Intervention For Enhancing Posttreatment Psychological Well-Being In Young Adult Cancer Survivors (Empower): Protocol For A Single-Arm Feasibility Trial, John M. Salsman, Laurie E. Mclouth, Michael Cohn, Janet A. Tooze, Mia Sorkin, Judith T. Moskowitz May 2020

A Web-Based, Positive Emotion Skills Intervention For Enhancing Posttreatment Psychological Well-Being In Young Adult Cancer Survivors (Empower): Protocol For A Single-Arm Feasibility Trial, John M. Salsman, Laurie E. Mclouth, Michael Cohn, Janet A. Tooze, Mia Sorkin, Judith T. Moskowitz

Behavioral Science Faculty Publications

BACKGROUND: Adolescent and young adult cancer survivors (AYAs) experience clinically significant distress and have limited access to supportive care services. Interventions to enhance psychological well-being have improved positive affect and reduced depression in clinical and healthy populations but have not been routinely tested in AYAs.

OBJECTIVE: The aim of this protocol is to (1) test the feasibility and acceptability of a Web-based positive emotion skills intervention for posttreatment AYAs called Enhancing Management of Psychological Outcomes With Emotion Regulation (EMPOWER) and (2) examine proof of concept for reducing psychological distress and enhancing psychological well-being.

METHODS: The intervention development and testing are ...


Using Case-Level Context To Classify Cancer Pathology Reports, Shang Gao, Mohammed Alawad, Noah Schaefferkoetter, Lynne Penberthy, Xiao-Cheng Wu, Eric B. Durbin, Linda Coyle, Arvind Ramanathan, Georgia Tourassi May 2020

Using Case-Level Context To Classify Cancer Pathology Reports, Shang Gao, Mohammed Alawad, Noah Schaefferkoetter, Lynne Penberthy, Xiao-Cheng Wu, Eric B. Durbin, Linda Coyle, Arvind Ramanathan, Georgia Tourassi

Kentucky Cancer Registry Faculty Publications

Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence-for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We ...


Using Data Mining To Identify The Most Influential Factors In Training Results, Xiaoqing Wu, Daanial Ahmad May 2020

Using Data Mining To Identify The Most Influential Factors In Training Results, Xiaoqing Wu, Daanial Ahmad

Publications and Research

Data Science is used as a tool to find hidden facts in the data. We want to find out what factors such as ‘AGE’, ‘TAX’, ‘PUPIL-TEACHER RATIO’, ‘PER-CAPITA INCOME’ contribute the most to housing prices. To answer this question, we studied the dataset of “Boston Houses Prices”. By applying the Lasso Regression (a Data Mining Technique) on the data set of “Boston Houses Prices” we identified the influential factors in the linear model. As a conclusion we found that there were six inputs which contributed the most to the prices of houses and those inputs are as follow: (i) CRIM-per ...


Sensor Data Analysis In Smart Buildings, Manuel A. Mane Penton May 2020

Sensor Data Analysis In Smart Buildings, Manuel A. Mane Penton

Publications and Research

Data analysis and Machine Learning are destined to evolve the current technology infrastructure by solving technology and economy demands present mainly in developed cities like New York. This research proposes a machine learning (ML) based solution to alleviate one of the main issues that big buildings such as CUNY campuses have, that is the waste of energy resources. The analysis of data coming from the readings of different deployed sensors such as CO2, humidity and temperature can be used to estimate occupancy in a specific room and building in general. The outcome of this research established a relationship between the ...


Design Principles Influencing Secondary School Counselors' Satisfaction Of A Decision-Support System, Kodey S. Crandall Apr 2020

Design Principles Influencing Secondary School Counselors' Satisfaction Of A Decision-Support System, Kodey S. Crandall

Masters Theses & Doctoral Dissertations

In the current era of accountability, secondary school counselors are expected to use data to drive program decision-making, identify and implement evidence-based interventions to create systemic change, and utilize emerging technology. Research shows it is difficult for school counselors to meet any of these expectations. A decision support system (DSS) is a technology that takes minimal effort to learn and can assist in decision-making processes. This design science research builds and evaluates an IT artifact, a decision-support system, in an attempt to solve the problems facing school counselors. To develop this system, four design principles (system usefulness, interface quality, information ...


Digital Forensic Readiness: An Examination Of Law Enforcement Agencies In The State Of Maryland, James B. Mcnicholas Iii Apr 2020

Digital Forensic Readiness: An Examination Of Law Enforcement Agencies In The State Of Maryland, James B. Mcnicholas Iii

Masters Theses & Doctoral Dissertations

Digital forensic readiness within the law enforcement community, especially at the local level, has gone mostly unexplored. As a result, a current lack of data exists that examines the digital forensic readiness of individual agencies, the possibility of proximity relationships, and correlations between readiness and backlogs. This quantitative, crosssectional research study sought to explore these issues by focusing on the state of Maryland. The study resulted in the creation of a digital forensic readiness scoring model that was then used to assign digital forensic readiness scores to thirty (30) of the one-hundred-forty-one (141) law enforcement agencies throughout Maryland. It was ...


The Effectiveness Of Transfer Learning Systems On Medical Images, James Boit Apr 2020

The Effectiveness Of Transfer Learning Systems On Medical Images, James Boit

Masters Theses & Doctoral Dissertations

Deep neural networks have revolutionized the performances of many machine learning tasks such as medical image classification and segmentation. Current deep learning (DL) algorithms, specifically convolutional neural networks are increasingly becoming the methodological choice for most medical image analysis. However, training these deep neural networks requires high computational resources and very large amounts of labeled data which is often expensive and laborious. Meanwhile, recent studies have shown the transfer learning (TL) paradigm as an attractive choice in providing promising solutions to challenges of shortage in the availability of labeled medical images. Accordingly, TL enables us to leverage the knowledge learned ...


Data Science Meets Compliance, Christian Clarke Apr 2020

Data Science Meets Compliance, Christian Clarke

Petersheim Academic Exposition

No abstract provided.


Mobile Identity, Credential, And Access Management Framework, Peggy Renee Camley Mar 2020

Mobile Identity, Credential, And Access Management Framework, Peggy Renee Camley

Masters Theses & Doctoral Dissertations

Organizations today gather unprecedented quantities of data from their operations. This data is coming from transactions made by a person or from a connected system/application. From personal devices to industry including government, the internet has become the primary means of modern communication, further increasing the need for a method to track and secure these devices. Protecting the integrity of connected devices collecting data is critical to ensure the trustworthiness of the system. An organization must not only know the identity of the users on their networks and have the capability of tracing the actions performed by a user but ...


Algorithm Selection Framework: A Holistic Approach To The Algorithm Selection Problem, Marc W. Chalé Mar 2020

Algorithm Selection Framework: A Holistic Approach To The Algorithm Selection Problem, Marc W. Chalé

Theses and Dissertations

A holistic approach to the algorithm selection problem is presented. The “algorithm selection framework" uses a combination of user input and meta-data to streamline the algorithm selection for any data analysis task. The framework removes the conjecture of the common trial and error strategy and generates a preference ranked list of recommended analysis techniques. The framework is performed on nine analysis problems. Each of the recommended analysis techniques are implemented on the corresponding data sets. Algorithm performance is assessed using the primary metric of recall and the secondary metric of run time. In six of the problems, the recall of ...


An Analysis Of Learning Curve Theory & Diminishing Rates Of Learning, Dakotah W. Hogan Mar 2020

An Analysis Of Learning Curve Theory & Diminishing Rates Of Learning, Dakotah W. Hogan

Theses and Dissertations

Traditional learning curve theory assumes a constant learning rate regardless of the number of units produced; however, a collection of theoretical and empirical evidence indicates that learning rates decrease as more units are produced in some cases. These diminishing learning rates cause traditional learning curves to underestimate required resources, potentially resulting in cost overruns. A diminishing learning rate model, Boones Learning Curve (2018), was recently developed to model this phenomenon. This research confirmed that Boones Learning Curve is more accurate in modeling observed learning curves using production data of 169 Department of Defense end-items. However, further empirical analysis revealed deficiencies ...


Invariance And Invertibility In Deep Neural Networks, Han Zhang Jan 2020

Invariance And Invertibility In Deep Neural Networks, Han Zhang

Theses and Dissertations

Machine learning is concerned with computer systems that learn from data instead of being explicitly programmed to solve a particular task. One of the main approaches behind recent advances in machine learning involves neural networks with a large number of layers, often referred to as deep learning. In this dissertation, we study how to equip deep neural networks with two useful properties: invariance and invertibility. The first part of our work is focused on constructing neural networks that are invariant to certain transformations in the input, that is, some outputs of the network stay the same even if the input ...


An Analytical Examination On The Effects Of Vegetarian And Omnivorous Diets On C-Reactive Protein, Aletha Kleis Jan 2020

An Analytical Examination On The Effects Of Vegetarian And Omnivorous Diets On C-Reactive Protein, Aletha Kleis

Undergraduate Honors Theses

There is a lack of research regarding how following a vegetarian or omnivores diet effects C-Reactive Protein (CRP) levels of people as seen through results from an analysis of data gathered from the National Health and Nutrition Examination Survey (NHANES). The level of CRP is a reflection of how much inflammation there is in one’s body and is a popular indicator of risk for heart disease. Thus, in this research, I use the NHANES data to look at the relationship of CRP levels of people who identified themselves as vegetarian or not, while also considering the general healthiness of ...


Sediment Dynamics In The Magdalena River Basin, Colombia: Implications For Understanding Tropical River Processes And Hydropower Development, Luke H. Fisher Jan 2020

Sediment Dynamics In The Magdalena River Basin, Colombia: Implications For Understanding Tropical River Processes And Hydropower Development, Luke H. Fisher

Graduate Student Theses, Dissertations, & Professional Papers

The Magdalena River Basin of Colombia has a globally relevant sediment flux, however, studies of the sediment regime in the basin are limited in scope. This knowledge gap limits application of understanding of sediment dynamics to hydropower decision making. To close this gap, we implemented a sediment budget framework to quantify the impacts of hydropower development in a 118,000 km2 portion of the Magdalena River basin. We informed this framework with analysis of background erosion rates derived from 10Be cosmogenic nuclides and modern sediment fluxes derived from monitoring and optical remote sensing. We standardized these data to ...


The Mathematics, Computer Science, And Data Science Student Research Showcase, Seton Hall University Jan 2020

The Mathematics, Computer Science, And Data Science Student Research Showcase, Seton Hall University

Petersheim Academic Exposition

No abstract provided.


Multimodal Fusion Strategies For Outcome Prediction In Stroke, Esra Zihni, John D. Kelleher, Vince I. Madai, Ahmed Khalil, Ivana Galinovic, Jochen Fiebach, Michelle Livne, Dietmar Frey Jan 2020

Multimodal Fusion Strategies For Outcome Prediction In Stroke, Esra Zihni, John D. Kelleher, Vince I. Madai, Ahmed Khalil, Ivana Galinovic, Jochen Fiebach, Michelle Livne, Dietmar Frey

Conference papers

Data driven methods are increasingly being adopted in the medical domain for clinical predictive modeling. Prediction of stroke outcome using machine learning could provide a decision support system for physicians to assist them in patient-oriented diagnosis and treatment. While patient-specific clinical parameters play an important role in outcome prediction, a multimodal fusion approach that integrates neuroimaging with clinical data has the potential to improve accuracy. This paper addresses two research questions: (a) does multimodal fusion aid in the prediction of stroke outcome, and (b) what fusion strategy is more suitable for the task at hand. The baselines for our experimental ...


Image Features For Tuberculosis Classification In Digital Chest Radiographs, Brian Hooper Jan 2020

Image Features For Tuberculosis Classification In Digital Chest Radiographs, Brian Hooper

All Master's Theses

Tuberculosis (TB) is a respiratory disease which affects millions of people each year, accounting for the tenth leading cause of death worldwide, and is especially prevalent in underdeveloped regions where access to adequate medical care may be limited. Analysis of digital chest radiographs (CXRs) is a common and inexpensive method for the diagnosis of TB; however, a trained radiologist is required to interpret the results, and is subject to human error. Computer-Aided Detection (CAD) systems are a promising machine-learning based solution to automate the diagnosis of TB from CXR images. As the dimensionality of a high-resolution CXR image is very ...


Eavesdropping Hackers: Detecting Software Vulnerability Communication On Social Media Using Text Mining, Susan Mckeever, Brian Keegan, Andrei Quieroz Sep 2019

Eavesdropping Hackers: Detecting Software Vulnerability Communication On Social Media Using Text Mining, Susan Mckeever, Brian Keegan, Andrei Quieroz

Conference papers

Abstract—Cyber security is striving to find new forms of protection against hacker attacks. An emerging approach nowadays is the investigation of security-related messages exchanged on Deep/Dark Web and even Surface Web channels. This approach can be supported by the use of supervised machine learning models and text mining techniques. In our work, we compare a variety of machine learning algorithms, text representations and dimension reduction approaches for the detection accuracies of software-vulnerability-related communications. Given the imbalanced nature of the three public datasets used, we investigate appropriate sampling approaches to boost detection accuracies of our models. In addition, we ...


Streaming Feature Grouping And Selection (Sfgs) For Big Data Classification, Noura Helal Hamad Al Nuaimi Mar 2019

Streaming Feature Grouping And Selection (Sfgs) For Big Data Classification, Noura Helal Hamad Al Nuaimi

Dissertations

Real-time data has always been an essential element for organizations when the quickness of data delivery is critical to their businesses. Today, organizations understand the importance of real-time data analysis to maintain benefits from their generated data. Real-time data analysis is also known as real-time analytics, streaming analytics, real-time streaming analytics, and event processing. Stream processing is the key to getting results in real-time. It allows us to process the data stream in real-time as it arrives. The concept of streaming data means the data are generated dynamically, and the full stream is unknown or even infinite. This data becomes ...


Special Issue: Neutrosophic Theories Applied In Engineering, Florentin Smarandache, Jun Ye Jan 2017

Special Issue: Neutrosophic Theories Applied In Engineering, Florentin Smarandache, Jun Ye

Mathematics and Statistics Faculty and Staff Publications

Neutrosophic sets and logic are generalizations of fuzzy and intuitionistic fuzzy sets and logic. Neutrosophic sets and logic are gaining significant attention in solving many real life decision making problems that involve uncertainty, impreciseness, vagueness, incompleteness, inconsistent, and indeterminacy. They have been applied in computational intelligence, multiple criteria decision making, image processing, medical diagnoses, etc. This Special Issue presents original research papers that report on state-of-the-art and recent advancements in neutrosophic sets and logic in soft computing, artificial intelligence, big and small data mining, decision making problems, and practical achievements.


Mapsnap System To Perform Vector-To-Raster Fusion, Boris Kovalerchuk, Peter Doucette, Gamal Seedahmed, Jerry Tagestad, Sergei Kovalerchuk, Brian Graff May 2011

Mapsnap System To Perform Vector-To-Raster Fusion, Boris Kovalerchuk, Peter Doucette, Gamal Seedahmed, Jerry Tagestad, Sergei Kovalerchuk, Brian Graff

All Faculty Scholarship for the College of the Sciences

As the availability of geospatial data increases, there is a growing need to match these datasets together. However, since these datasets often vary in their origins and spatial accuracy, they frequently do not correspond well to each other, which create multiple problems. To accurately align with imagery, analysts currently either: 1) manually move the vectors, 2) perform a labor-intensive spatial registration of vectors to imagery, 3) move imagery to vectors, or 4) redigitize the vectors from scratch and transfer the attributes. All of these are time consuming and labor-intensive operations. Automated matching and fusing vector datasets has been a subject ...


Extreme Data Mining: Inference From Small Datasets, Răzvan Andonie Sep 2010

Extreme Data Mining: Inference From Small Datasets, Răzvan Andonie

All Faculty Scholarship for the College of the Sciences

Neural networks have been applied successfully in many fields. However, satisfactory results can only be found under large sample conditions. When it comes to small training sets, the performance may not be so good, or the learning task can even not be accomplished. This deficiency limits the applications of neural network severely. The main reason why small datasets cannot provide enough information is that there exist gaps between samples, even the domain of samples cannot be ensured. Several computational intelligence techniques have been proposed to overcome the limits of learning from small datasets.

We have the following goals: i. To ...