Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Data Science

PDF

2024

Institution
Keyword
Publication
Publication Type

Articles 1 - 30 of 117

Full-Text Articles in Physical Sciences and Mathematics

Data Collector Selection Ranking-Based Method For Collaborative Multi-Tasks In Ubiquitous Environments, Belal Z. Hassan, Ahmed. A. A. Gad-Elrab, Mohamed S. Farag, S. E. Abu-Youssef Aug 2024

Data Collector Selection Ranking-Based Method For Collaborative Multi-Tasks In Ubiquitous Environments, Belal Z. Hassan, Ahmed. A. A. Gad-Elrab, Mohamed S. Farag, S. E. Abu-Youssef

Al-Azhar Bulletin of Science

In Ubiquitous Computing and the Internet of Things, the sensing and control of objects involve numerous devices collecting and transmitting data. However, connecting these devices without fostering collaboration leads to suboptimal system performance. As the number of connected sensing devices in Internet of Things increases, efficient task accomplishment through collaboration becomes imperative. This paper proposes a Data Collector Selection Method for Collaborative Multi-Tasks to address this challenge, considering task preferences and uncertainty in data collectors' contributions. The proposed method incorporates three key aspects: (1) Using Fuzzy Analytical Hierarchy Process to determine optimal weights for task preferences; (2) Ranking data collectors …


Three Essays Applying Dynamic Models In Economics, Finance, And Machine Learning, Lucas C. Dowiak Jun 2024

Three Essays Applying Dynamic Models In Economics, Finance, And Machine Learning, Lucas C. Dowiak

Dissertations, Theses, and Capstone Projects

This dissertation is a composition in three parts. Collectively, these essays investigate dynamic methods and their application in the fields of Economics, Finance, and Machine Learning. It pulls liberally from all three. In particular, this dissertation makes repeated use of multi-state modeling frameworks popular in Economics to bring a faceted view to the underlying data and detect its hidden heterogeneity. The challenge of modeling financial assets and estimating their dependence is another focus. For stimulus, concepts in the Machine Learning field are brought in to aid or compete with established econometric techniques.

Econometric Applications of the Hierarchical Mixture-of-Experts

In this …


Predictive Analysis Of Local House Prices: Leveraging Machine Learning For Real Estate Valuation, Joey Hernandez, Danny Chang, Santiago Gutierrez, Paul Huggins May 2024

Predictive Analysis Of Local House Prices: Leveraging Machine Learning For Real Estate Valuation, Joey Hernandez, Danny Chang, Santiago Gutierrez, Paul Huggins

SMU Data Science Review

This paper presents a comprehensive study examining the real estate market potential in the dynamic urban landscapes of Frisco and Plano, Texas. Combining traditional real estate analysis with cutting-edge machine learning techniques, the study aims to predict home prices and assess investment feasibility. Leveraging these findings, the study proposes a strategic focus on predictive modeling and investment potential identification, emphasizing the continual refinement of machine learning models with updated data to accurately forecast changes in the real estate market. By harnessing the predictive power of these models, investors can identify high-growth areas and optimize their investment decisions, thus capitalizing on …


A Symbolic Approach To Nonlinear Time Series Analysis, Ranjan Karki, Nibhrat Lohia, Michael B. Schulte May 2024

A Symbolic Approach To Nonlinear Time Series Analysis, Ranjan Karki, Nibhrat Lohia, Michael B. Schulte

SMU Data Science Review

Current nonlinear time series methods such as neural networks forecast well. However, they act as a black box and are difficult to interpret, leaving the researchers and the audience with little insight into why the forecasts are the way they are. There is a need for a method that forecasts accurately while also being easy to interpret. This paper aims to develop a method to build an interpretable model for univariate and multivariate nonlinear time series data using wavelets and symbolic regression. The final method relies on multilayer perceptron (MLP) neural networks as a form of dimensionality reduction and the …


Intelligent Solutions For Retroactive Anomaly Detection And Resolution With Log File Systems, Derek G. Rogers, Chanvo Nguyen, Abhay Sharma May 2024

Intelligent Solutions For Retroactive Anomaly Detection And Resolution With Log File Systems, Derek G. Rogers, Chanvo Nguyen, Abhay Sharma

SMU Data Science Review

This paper explores the intricate challenges log files pose from data science and machine learning perspectives. Drawing inspiration from existing methods, LAnoBERT, PULL, LLMs, and the breadth of recent research, this paper aims to push the boundaries of machine learning for log file systems. Our study comprehensively examines the unique challenges presented in our problem setup, delineates the limitations of existing methods, and introduces innovative solutions. These contributions are organized to offer valuable insights, predictions, and actionable recommendations tailored for Microsoft's engineers working on log data analysis.


Baseball Decision-Making: Optimizing At-Bat Simulations, Varun Gopal, Krithika Kondakindi, Nibhrat Lohia, Morgan Williams May 2024

Baseball Decision-Making: Optimizing At-Bat Simulations, Varun Gopal, Krithika Kondakindi, Nibhrat Lohia, Morgan Williams

SMU Data Science Review

Pitch selection in baseball plays a crucial role, involving pitchers, catchers, and batters working together. This practice, dating back to early baseball, has seen teams try various methods to gain an advantage. This research aims to use reinforcement learning and pitch-by-pitch Statcast data to improve batting strategies. It also builds on previous statistical work (sabermetrics) to make better choices in pitch selection and plate discipline. The dataset used, including over 700,000 pitches for each full season and 200,000 pitches for the COVID-shortened 2020 season, encompasses a wealth of crucial metrics including pitch release point, velocity, and launch angle. This study …


Reevaluating Texas Energy Market Forecasts In The Wake Of Recent Extreme Weather Events, Robert A. Derner, Richard W. Butler Ii, Alexandria Neff, Adam R. Ruthford May 2024

Reevaluating Texas Energy Market Forecasts In The Wake Of Recent Extreme Weather Events, Robert A. Derner, Richard W. Butler Ii, Alexandria Neff, Adam R. Ruthford

SMU Data Science Review

This paper provides updated forecasts of energy demand in Texas and recognizes the impact of sustainable energy. It is important that the forecasts of the adoption of sustainable energy are reexamined after Winter Storm Uri crippled the Texas power grid and left many without power. This storm highlighted the issues the Texas power grid had and has continued to struggle with in supplying the state with energy. This paper will offer an overview of the relevant literature on the adoption of sustainable energy and relevant events that have occurred in the state of Texas that will give the reader the …


Multi-Class Emotion Classification With Xgboost Model Using Wearable Eeg Headband Data, James Khamthung, Nibhrat Lohia, Seement Srivastava May 2024

Multi-Class Emotion Classification With Xgboost Model Using Wearable Eeg Headband Data, James Khamthung, Nibhrat Lohia, Seement Srivastava

SMU Data Science Review

Electroencephalography (EEG) or brainwave signals serve as a valuable source for discerning human activities, thoughts, and emotions. This study explores the efficacy of EXtreme Gradient Boosting (XGBoost) models in sentiment classification using EEG signals, specifically those captured by the MUSE EEG headband. The MUSE device, equipped with four EEG electrodes (TP9, AF7, AF8, TP10), offers a cost-effective alternative to traditional EEG setups, which often utilize over 60 channels in laboratory-grade settings. Leveraging a dataset from previous MUSE research (Bird, J. et al., 2019), emotional states (positive, neutral, and negative) were observed in a male and a female participant, each for …


Building Effective Large Language Model Agents, Sydney Holder, Shreyash Taywade May 2024

Building Effective Large Language Model Agents, Sydney Holder, Shreyash Taywade

SMU Data Science Review

The advancement of large language models (LLMs) has significantly expanded the influence of artificial intelligence across various sectors. This paper explores building LLM agents to power applications and examines what is necessary to build an efficient and helpful AI assistant. The research investigates the core components necessary to create specialized agents, facilitate collaboration in problem-solving, and improve human task performance. The development and application of tools designed to augment the capabilities of LLM agents are also explored. The paper addresses the potential risks of the unknowns, such as hallucinations, which can compromise the success of agent-based solutions within LLM applications. …


Game Recommendation Analysis Using Steam Profiles And Reviews, Robert Blue, Luis Garcia, Jacob Turner May 2024

Game Recommendation Analysis Using Steam Profiles And Reviews, Robert Blue, Luis Garcia, Jacob Turner

SMU Data Science Review

Smaller game studios are at a disadvantage when it comes to getting their product noticed by users. This study aims to provide insights on how recommendation engines work so that these smaller studios can have their games noticed on Steam. Steam is one of the largest video game distribution services and they have a recommendation engine which promotes games to its user base. This study utilized user information such as number of games played, the type of games, and the hours played and created recommendation engines to identify the qualities in the game that are driving recommendations.


Leveraging Transformer Models For Genre Classification, Andreea C. Craus, Ben Berger, Yves Hughes, Hayley Horn May 2024

Leveraging Transformer Models For Genre Classification, Andreea C. Craus, Ben Berger, Yves Hughes, Hayley Horn

SMU Data Science Review

As the digital music landscape continues to expand, the need for effective methods to understand and contextualize the diverse genres of lyrical content becomes increasingly critical. This research focuses on the application of transformer models in the domain of music analysis, specifically in the task of lyric genre classification. By leveraging the advanced capabilities of transformer architectures, this project aims to capture intricate linguistic nuances within song lyrics, thereby enhancing the accuracy and efficiency of genre classification. The relevance of this project lies in its potential to contribute to the development of automated systems for music recommendation and genre-based playlist …


Context Aware Music Recommendation And Playlist Generation, Elias Mann May 2024

Context Aware Music Recommendation And Playlist Generation, Elias Mann

SMU Journal of Undergraduate Research

There are many reasons people listen to music, and the type of music is largely determined by what the listener may be doing while they listen. For example, one may listen to one type of music while commuting, another while exercising, and yet another while relaxing. Without access to the physiological state of the user, current music recommendation methods rely on collaborative filtering - recommending music based on what other similar users listen to - and content based filtering - recommending songs based on their similarities to songs the user already prefers. With the rise in popularity of smart devices …


Dual-Domain Clustering Of Spatiotemporal Infectious Disease Data, Samuel R. Thornton, Erin C.S. Acquesta, Patrick D. Finley, Mansoor A. Haider May 2024

Dual-Domain Clustering Of Spatiotemporal Infectious Disease Data, Samuel R. Thornton, Erin C.S. Acquesta, Patrick D. Finley, Mansoor A. Haider

Biology and Medicine Through Mathematics Conference

No abstract provided.


Characteristics Based Factor Models - Comparison Of Estimation Procedures, Henri Ohl May 2024

Characteristics Based Factor Models - Comparison Of Estimation Procedures, Henri Ohl

McKelvey School of Engineering Theses & Dissertations

Understanding cross-sectional and time series variation of asset returns is fundamental in finance, particularly in asset pricing. This thesis explores the integration of factor theory with machine learning to deepen our comprehension of these dynamics. Characteristics based factor models offer a systematic framework for quantifying an asset's underlying risk-return structure, leveraging time-varying conditional information on model parameters carried by firm-specific characteristics. These models serve as valuable tools for discerning the driving components of an asset's expected excess return. Recent research established a novel methodology for consistent parameter estimation within this framework, only requiring a large cross-section but not a long …


A Nlp Approach To Automating The Generation Of Surveys For Market Research, Anav Chug May 2024

A Nlp Approach To Automating The Generation Of Surveys For Market Research, Anav Chug

Honors College Theses

Market Research is vital but includes activities that are often laborious and time consuming. Survey questionnaires are one possible output of the process and market researchers spend a lot of time manually developing questions for focus groups. The proposed research aims to develop a software prototype that utilizes Natural Language Processing (NLP) to automate the process of generating survey questions for market research. The software uses a pre-trained Open AI language model to generate multiple choice survey questions based on a given product prompt, send it to a targeted email list, and also provides a real-time analysis of the responses …


Evaluating Neuroimaging Modalities In The A/T/N Framework: Single And Combined Fdg-Pet And T1-Weighted Mri For Alzheimer’S Diagnosis, Peiwang Liu May 2024

Evaluating Neuroimaging Modalities In The A/T/N Framework: Single And Combined Fdg-Pet And T1-Weighted Mri For Alzheimer’S Diagnosis, Peiwang Liu

McKelvey School of Engineering Theses & Dissertations

With the escalating prevalence of dementia, particularly Alzheimer's Disease (AD), the need for early and precise diagnostic techniques is rising. This study delves into the comparative efficacy of Fluorodeoxyglucose Positron Emission Tomography (FDG-PET) and T1-weighted Magnetic Resonance Imaging (MRI) in diagnosing AD, where the integration of multimodal models is becoming a trend. Leveraging data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), we employed linear Support Vector Machines (SVM) to assess the diagnostic potential of these modalities, both individually and in combination, within the AD continuum. Our analysis, under the A/T/N framework's 'N' category, reveals that FDG-PET consistently outperforms T1w-MRI across …


Surmounting Challenges In Aggregating Results From Static Analysis Tools, Dr. Ann Marie Reinhold, Brittany Boles, A. Redempta Manzi Muneza, Thomas Mcelroy, Dr. Clemente Izurieta May 2024

Surmounting Challenges In Aggregating Results From Static Analysis Tools, Dr. Ann Marie Reinhold, Brittany Boles, A. Redempta Manzi Muneza, Thomas Mcelroy, Dr. Clemente Izurieta

Military Cyber Affairs

Aggregation poses a significant challenge for software practitioners because it requires a comprehensive and nuanced understanding of raw data from diverse sources. Suites of static-analysis tools (SATs) are commonly used to assess organizational security but simultaneously introduce significant challenges. Challenges include unique results, scales, configuration environments for each SAT execution, and incompatible formats between SAT outputs. Here, we document our experiences addressing these issues. We highlight the problem of relying on a single vendor's SAT version and offer a solution for aggregating findings across multiple SATs, aiming to enhance software security practices and deter threats early with robust defensive operations.


Data Analysis Project For Preferred Credit Inc., Emily Smith, Greta Nesbit, Jack Simonet, Ignacio Sanchez-Romero May 2024

Data Analysis Project For Preferred Credit Inc., Emily Smith, Greta Nesbit, Jack Simonet, Ignacio Sanchez-Romero

Celebrating Scholarship and Creativity Day (2018-)

This project focuses on transforming real data within PCI's operations into valuable insights through an approach of coding, data cleaning, and visualization. By leveraging advanced techniques, the project aims to uncover key trends and create visually compelling representations to aid decision-making within the company. The outcome will allow PCI stakeholders the ability to extract valuable insights, optimize processes, and drive initiatives for growth and competitive advantage in the finance industry.


A Novel Correction For The Multivariate Ljung-Box Test, Minhao Huang May 2024

A Novel Correction For The Multivariate Ljung-Box Test, Minhao Huang

Computational and Data Sciences (PhD) Dissertations

This research introduces an analytical improvement to the Multivariate Ljung-Box test that addresses significant deviations of the original test from the nominal Type I error rates under almost all scenarios. Prior attempts to mitigate this issue have been directed at modification of the test statistics or correction of the test distribution to achieve precise results in finite samples. In previous studies, focused on designing corrections to the univariate Ljung-Box, a method that specifically adjusts the test rejection region has been the most successful of attaining the best Type I error rates. We adopt the same approach for the more complex, …


Code For Care: Hypertension Prediction In Women Aged 18-39 Years, Kruti Sheth May 2024

Code For Care: Hypertension Prediction In Women Aged 18-39 Years, Kruti Sheth

Electronic Theses, Projects, and Dissertations

The longstanding prevalence of hypertension, often undiagnosed, poses significant risks of severe chronic and cardiovascular complications if left untreated. This study investigated the causes and underlying risks of hypertension in females aged between 18-39 years. The research questions were: (Q1.) What factors affect the occurrence of hypertension in females aged 18-39 years? (Q2.) What machine learning algorithms are suited for effectively predicting hypertension? (Q3.) How can SHAP values be leveraged to analyze the factors from model outputs? The findings are: (Q1.) Performing Feature selection using binary classification Logistic regression algorithm reveals an array of 30 most influential factors at an …


Evaluation Of An End-To-End Radiotherapy Treatment Planning Pipeline For Prostate Cancer, Mohammad Daniel El Basha, Court Laurence, Carlos Eduardo Cardenas, Julianne Pollard-Larkin, Steven Frank, David T. Fuentes, Falk Poenisch, Zhiqian H. Yu May 2024

Evaluation Of An End-To-End Radiotherapy Treatment Planning Pipeline For Prostate Cancer, Mohammad Daniel El Basha, Court Laurence, Carlos Eduardo Cardenas, Julianne Pollard-Larkin, Steven Frank, David T. Fuentes, Falk Poenisch, Zhiqian H. Yu

Dissertations & Theses (Open Access)

Radiation treatment planning is a crucial and time-intensive process in radiation therapy. This planning involves carefully designing a treatment regimen tailored to a patient’s specific condition, including the type, location, and size of the tumor with reference to surrounding healthy tissues. For prostate cancer, this tumor may be either local, locally advanced with extracapsular involvement, or extend into the pelvic lymph node chain. Automating essential parts of this process would allow for the rapid development of effective treatment plans and better plan optimization to enhance tumor control for better outcomes.

The first objective of this work, to automate the treatment …


A Framework That Explores The Cognitive Load Of Cs1 Assignments Using Pausing Behavior, Joshua O. Urry May 2024

A Framework That Explores The Cognitive Load Of Cs1 Assignments Using Pausing Behavior, Joshua O. Urry

All Graduate Theses and Dissertations, Fall 2023 to Present

Pausing behavior in introductory Computer Science (CS1) courses has been related to a student’s performance in the course and could be linked to a student’s cognitive load, or assignment difficulty. Having an objective measure of the cognitive load would be beneficial to course instructors as it would help them design assignments that are not too difficult. Two studies are presented in this work. The first study uses Cognitive Load Theory and Vygotsky’s Zone of Proximal Development as a theoretical framework to analyze pause times between keystrokes to better understand what types of assignments need more educational support than others. The …


Code Syntax Understanding In Large Language Models, Cole Granger May 2024

Code Syntax Understanding In Large Language Models, Cole Granger

Undergraduate Honors Theses

In recent years, tasks for automated software engineering have been achieved using Large Language Models trained on source code, such as Seq2Seq, LSTM, GPT, T5, BART and BERT. The inherent textual nature of source code allows it to be represented as a sequence of sub-words (or tokens), drawing parallels to prior work in NLP. Although these models have shown promising results according to established metrics (e.g., BLEU, CODEBLEU), there remains a deeper question about the extent of syntax knowledge they truly grasp when trained and fine-tuned for specific tasks.

To address this question, this thesis introduces a taxonomy of syntax …


Security And Interpretability In Large Language Models, Lydia Danas May 2024

Security And Interpretability In Large Language Models, Lydia Danas

Undergraduate Honors Theses

Large Language Models (LLMs) have the capability to model long-term dependencies in sequences of tokens, and are consequently often utilized to generate text through language modeling. These capabilities are increasingly being used for code generation tasks; however, LLM-powered code generation tools such as GitHub's Copilot have been generating insecure code and thus pose a cybersecurity risk. To generate secure code we must first understand why LLMs are generating insecure code. This non-trivial task can be realized through interpretability methods, which investigate the hidden state of a neural network to explain model outputs. A new interpretability method is rationales, which obtains …


The Quantitative Analysis And Visualization Of Nfl Passing Routes, Sandeep Chitturi May 2024

The Quantitative Analysis And Visualization Of Nfl Passing Routes, Sandeep Chitturi

Computer Science and Computer Engineering Undergraduate Honors Theses

The strategic planning of offensive passing plays in the NFL incorporates numerous variables, including defensive coverages, player positioning, historical data, etc. This project develops an application using an analytical framework and an interactive model to simulate and visualize an NFL offense's passing strategy under varying conditions. Using R-programming and data management, the model dynamically represents potential passing routes in response to different defensive schemes. The system architecture integrates data from historical NFL league years to generate quantified route scores through designed mathematical equations. This allows for the prediction of potential passing routes for offensive skill players in response to the …


Sequential Optimization For Stressor-Informed Test Planning Through Integration Of Experimental And Simulated Data, Jacob Brecheisen May 2024

Sequential Optimization For Stressor-Informed Test Planning Through Integration Of Experimental And Simulated Data, Jacob Brecheisen

Data Science Undergraduate Honors Theses

This technical report details an innovative approach in reliability engineering aimed at maximizing system durability through a synergistic use of physical experimentation and computer-based modeling. Our methodology explores the efficient design and analysis of computer experiments and physical tests to facilitate accelerated reliability growth, while leveraging a sequential integration of data from these two distinct sources: costly physical experiments, characterized by random errors, and inexpensive computer simulations, marked by inherent systematic errors. The key innovation lies in the adoption of a closed-loop design and analysis method. This method begins by identifying a viable subset of important environmental stressors—such as temperature, …


Traffic Analysis Of Cities In San Bernardino County, Sai Kalyan Ayyagari May 2024

Traffic Analysis Of Cities In San Bernardino County, Sai Kalyan Ayyagari

Electronic Theses, Projects, and Dissertations

This research offers an in-depth analysis of vehicular traffic within San Bernardino County, California, aiming to spotlight congestion areas and suggest improvements for more efficient and sustainable transportation. Leveraging 2021 data from StreetLight Data, traffic patterns in 15 key cities were examined based on their population sizes, covering various vehicle types to dissect dynamics and flow. The methodology focused on analyzing trip purposes and metrics to calculate Vehicle Miles Traveled (VMT) and its influence on congestion and environmental factors.

Findings indicate considerable disparities in traffic volume, purposes, and timings across different urban areas, with population density and intercity connections significantly …


Truck Traffic Analysis In The Inland Empire, Bhavik Khatri May 2024

Truck Traffic Analysis In The Inland Empire, Bhavik Khatri

Electronic Theses, Projects, and Dissertations

This study undertakes a meticulous examination of truck traffic within the Inland Empire, focusing on the distribution and dynamics of medium and heavy-duty vehicles, to advocate for the region's transition to electric trucks. Utilizing advanced spatial analysis and data from Streetlight Data, it segments the region into six subregions, revealing distinct traffic patterns and environmental impacts. Notably, the research uncovers that the North Center and West zones, integral to the logistics and warehousing sectors, exhibit the highest traffic volumes, significantly influencing air quality and infrastructure.

Quantitative results from 2021 illustrate a pronounced disparity in truck activity: medium-weight vehicles accounted for …


Computational Linguistics And Multilingualism: A Comparative Analysis With Spanish And English Data, Evelyn Lawrie May 2024

Computational Linguistics And Multilingualism: A Comparative Analysis With Spanish And English Data, Evelyn Lawrie

Student Scholar Symposium Abstracts and Posters

Computational linguistics is an increasingly ubiquitous field, serving as the basis for artificial intelligence and machine translation. It aims to analyze the syntax and semantics of individual words and phrases. While there have been in-depth advancements in computational linguistics strategies for the English language, others have not been developed as thoroughly. This lack of emphasis on multilingualism has contributed to the disappearance of Hispanic perspectives in the digital world. Especially those of indigenous heritage, as the decline of many indigenous languages has been exacerbated by the lack of digital translation services. Sentiment analysis is a branch of computational linguistics that …


Roads And Corresponding Travel Time To Markets: Assessing Climate Vulnerability In Nepal, Kaitlyn Crowley May 2024

Roads And Corresponding Travel Time To Markets: Assessing Climate Vulnerability In Nepal, Kaitlyn Crowley

Undergraduate Honors Theses

Roads exist as a physical and theoretical connection between people and places around the globe. In addition to providing a route from one point to another, roads are also an indicator of access to markets and of poverty. However, current road datasets, particularly the Global Roads Open Access Data Set, are out of date or incomplete, necessitating new sources of data for analyses involving road networks. This study explores the relationship between climate change and access to markets in Nepal. We seek to identify isolated communities that are likely to experience detrimental outcomes associated with environmental threats, such as increasing …