Open Access. Powered by Scholars. Published by Universities.®
- Keyword
-
- Computer Science (273)
- Department of Computer Science and Engineering (237)
- Engineering (178)
- College of Engineering and Computer Science (157)
- Newsletters (157)
-
- Science news (157)
- Technical writing (157)
- Computer Engineering (50)
- Department of Mechanical and Materials Engineering (31)
- Department of Computer Science (21)
- Physical Sciences and Mathematics (18)
- Computer Sciences (17)
- Semantic Web (16)
- Artificial Intelligence (14)
- Machine learning (14)
- Machine Learning (10)
- Wright State University (10)
- And Creative Activities (9)
- College campuses (9)
- College facilities (9)
- College students (9)
- College teachers (9)
- Information Technology (9)
- Mechanical Engineering (9)
- Natural Language Processing (9)
- Scholarship (9)
- Universities and colleges--Administration (9)
- Universities and colleges--Faculty (9)
- Universities and colleges--Professional staff (9)
- Visualization (9)
- Publication
- Publication Type
- File Type
Articles 31 - 60 of 1969
Full-Text Articles in Engineering
Delaunay Walk For Fast Nearest Neighbor: Accelerating Correspondence Matching For Icp, James D. Anderson, Ryan M. Raettig, Josh Larson, Scott L. Nykl, Clark N. Taylor, Thomas Wischgoll
Delaunay Walk For Fast Nearest Neighbor: Accelerating Correspondence Matching For Icp, James D. Anderson, Ryan M. Raettig, Josh Larson, Scott L. Nykl, Clark N. Taylor, Thomas Wischgoll
Computer Science and Engineering Faculty Publications
Point set registration algorithms such as Iterative Closest Point (ICP) are commonly utilized in time-constrained environments like robotics. Finding the nearest neighbor of a point in a reference 3D point set is a common operation in ICP and frequently consumes at least 90% of the computation time. We introduce a novel approach to performing the distance-based nearest neighbor step based on Delaunay triangulation. This greedy algorithm finds the nearest neighbor of a query point by traversing the edges of the Delaunay triangulation created from a reference 3D point set. Our work integrates the Delaunay traversal into the correspondences search of …
Semantics-Driven Abstractive Document Summarization, Amanuel Alambo
Semantics-Driven Abstractive Document Summarization, Amanuel Alambo
Browse all Theses and Dissertations
The evolution of the Web over the last three decades has led to a deluge of scientific and news articles on the Internet. Harnessing these publications in different fields of study is critical to effective end user information consumption. Similarly, in the domain of healthcare, one of the key challenges with the adoption of Electronic Health Records (EHRs) for clinical practice has been the tremendous amount of clinical notes generated that can be summarized without which clinical decision making and communication will be inefficient and costly. In spite of the rapid advances in information retrieval and deep learning techniques towards …
Building An Understanding Of Human Activities In First Person Video Using Fuzzy Inference, Bradley A. Schneider
Building An Understanding Of Human Activities In First Person Video Using Fuzzy Inference, Bradley A. Schneider
Browse all Theses and Dissertations
Activities of Daily Living (ADL’s) are the activities that people perform every day in their home as part of their typical routine. The in-home, automated monitoring of ADL’s has broad utility for intelligent systems that enable independent living for the elderly and mentally or physically disabled individuals. With rising interest in electronic health (e-Health) and mobile health (m-Health) technology, opportunities abound for the integration of activity monitoring systems into these newer forms of healthcare. In this dissertation we propose a novel system for describing ’s based on video collected from a wearable camera. Most in-home activities are naturally defined by …
Novel Natural Language Processing Models For Medical Terms And Symptoms Detection In Twitter, Farahnaz Golrooy Motlagh
Novel Natural Language Processing Models For Medical Terms And Symptoms Detection In Twitter, Farahnaz Golrooy Motlagh
Browse all Theses and Dissertations
This dissertation focuses on disambiguation of language use on Twitter about drug use, consumption types of drugs, drug legalization, ontology-enhanced approaches, and prediction analysis of data-driven by developing novel NLP models. Three technical aims comprise this work: (a) leveraging pattern recognition techniques to improve the quality and quantity of crawled Twitter posts related to drug abuse; (b) using an expert-curated, domain-specific DsOn ontology model that improve knowledge extraction in the form of drug-to-symptom and drug-to-side effect relations; and (c) modeling the prediction of public perception of the drug’s legalization and the sentiment analysis of drug consumption on Twitter. We collected …
A Cloud Computing-Based Dashboard For The Visualization Of Motivational Interviewing Metrics, E Jinq Heng
A Cloud Computing-Based Dashboard For The Visualization Of Motivational Interviewing Metrics, E Jinq Heng
Browse all Theses and Dissertations
Motivational Interviewing (MI) is an evidence-based brief interventional technique that has been demonstrated to be effective in triggering behavior change in patients. To facilitate behavior change, healthcare practitioners adopt a nonconfrontational, empathetic dialogic style, a core component of MI. Despite its advantages, MI has been severely underutilized mainly due to the cognitive overload on the part of the MI dialogue evaluator, who has to assess MI dialogue in real-time and calculate MI characteristic metrics (number of open-ended questions, close-ended questions, reflection, and scale-based sentences) for immediate post-session evaluation both in MI training and clinical settings. To automate dialogue assessment and …
Deep Understanding Of Technical Documents : Automated Generation Of Pseudocode From Digital Diagrams & Analysis/Synthesis Of Mathematical Formulas, Nikolaos Gkorgkolis
Deep Understanding Of Technical Documents : Automated Generation Of Pseudocode From Digital Diagrams & Analysis/Synthesis Of Mathematical Formulas, Nikolaos Gkorgkolis
Browse all Theses and Dissertations
The technical document is an entity that consists of several essential and interconnected parts, often referred to as modalities. Despite the extensive attention that certain parts have already received, per say the textual information, there are several aspects that severely under researched. Two such modalities are the utility of diagram images and the deep automated understanding of mathematical formulas. Inspired by existing holistic approaches to the deep understanding of technical documents, we develop a novel formal scheme for the modelling of digital diagram images. This extends to a generative framework that allows for the creation of artificial images and their …
Synthetic Aperture Ladar Automatic Target Recognizer Design And Performance Prediction Via Geometric Properties Of Targets, Jacob W. Ross
Synthetic Aperture Ladar Automatic Target Recognizer Design And Performance Prediction Via Geometric Properties Of Targets, Jacob W. Ross
Browse all Theses and Dissertations
Synthetic Aperture LADAR (SAL) has several phenomenology differences from Synthetic Aperture RADAR (SAR) making it a promising candidate for automatic target recognition (ATR) purposes. The diffuse nature of SAL results in more pixels on target. Optical wavelengths offers centimeter class resolution with an aperture baseline that is 10,000 times smaller than an SAR baseline. While diffuse scattering and optical wavelengths have several advantages, there are also a number of challenges. The diffuse nature of SAL leads to a more pronounced speckle effect than in the SAR case. Optical wavelengths are more susceptible to atmospheric noise, leading to distortions in formed …
Evaluating Similarity Of Cross-Architecture Basic Blocks, Elijah L. Meyer
Evaluating Similarity Of Cross-Architecture Basic Blocks, Elijah L. Meyer
Browse all Theses and Dissertations
Vulnerabilities in source code can be compiled for multiple processor architectures and make their way into several different devices. Security researchers frequently have no way to obtain this source code to analyze for vulnerabilities. Therefore, the ability to effectively analyze binary code is essential. Similarity detection is one facet of binary code analysis. Because source code can be compiled for different architectures, the need can arise for detecting code similarity across architectures. This need is especially apparent when analyzing firmware from embedded computing environments such as Internet of Things devices, where the processor architecture is dependent on the product and …
Validating Software States Using Reverse Execution, Nathaniel Christian Boland
Validating Software States Using Reverse Execution, Nathaniel Christian Boland
Browse all Theses and Dissertations
A key feature of software analysis is determining whether it is possible for a program to reach a certain state. Various methods have been devised to accomplish this including directed fuzzing and dynamic execution. In this thesis we present a reverse execution engine to validate states, the Complex Emulator. The Complex Emulator seeks to validate a program state by emulating it in reverse to discover if a contradiction exists. When unknown variables are found during execution, the emulator is designed to use constraint solving to compute their values. The Complex Emulator has been tested on small assembly programs and is …
Topological Hierarchies And Decomposition: From Clustering To Persistence, Kyle A. Brown
Topological Hierarchies And Decomposition: From Clustering To Persistence, Kyle A. Brown
Browse all Theses and Dissertations
Hierarchical clustering is a class of algorithms commonly used in exploratory data analysis (EDA) and supervised learning. However, they suffer from some drawbacks, including the difficulty of interpreting the resulting dendrogram, arbitrariness in the choice of cut to obtain a flat clustering, and the lack of an obvious way of comparing individual clusters. In this dissertation, we develop the notion of a topological hierarchy on recursively-defined subsets of a metric space. We look to the field of topological data analysis (TDA) for the mathematical background to associate topological structures such as simplicial complexes and maps of covers to clusters in …
Secure Authenticated Key Exchange For Enhancing The Security Of Routing Protocol For Low-Power And Lossy Networks, Sarah Mohammed Alzahrani
Secure Authenticated Key Exchange For Enhancing The Security Of Routing Protocol For Low-Power And Lossy Networks, Sarah Mohammed Alzahrani
Browse all Theses and Dissertations
The current Routing Protocol for Low Power and Lossy Networks (RPL) standard provides three security modes Unsecured Mode (UM), Preinstalled Secure Mode (PSM), and Authenticated Secure Mode (ASM). The PSM and ASM are designed to prevent external routing attacks and specific replay attacks through an optional replay protection mechanism. RPL's PSM mode does not support key replacement when a malicious party obtains the key via differential cryptanalysis since it considers the key to be provided to nodes during the configuration of the network. This thesis presents an approach to implementing a secure authenticated key exchange mechanism for RPL, which ensures …
Locality Analysis Of Patched Php Vulnerabilities, Luke N. Holt
Locality Analysis Of Patched Php Vulnerabilities, Luke N. Holt
Browse all Theses and Dissertations
The size and complexity of modern software programs is constantly growing making it increasingly difficult to diligently find and diagnose security exploits. The ability to quickly and effectively release patches to prevent existing vulnerabilities significantly limits the exploitation of users and/or the company itself. Due to this it has become crucial to provide the capability of not only releasing a patched version, but also to do so quickly to mitigate the potential damage. In this thesis, we propose metrics for evaluating the locality between exploitable code and its corresponding sanitation API such that we can statistically determine the proximity of …
Realistic Virtual Human Character Design Strategy And Experience For Supporting Serious Role-Playing Simulations On Mobile Devices, Sindhu Kumari
Realistic Virtual Human Character Design Strategy And Experience For Supporting Serious Role-Playing Simulations On Mobile Devices, Sindhu Kumari
Browse all Theses and Dissertations
Promoting awareness of social determinants of health (SDoH) among healthcare providers is important to improve the patient care experience and outcome as it helps providers understand their patients in a better way which can facilitate more efficient and effective communication about health conditions. Healthcare professionals are typically educated about SDoH through lectures, questionaries, or role-play-based approaches; but in today’s world, it is becoming increasingly possible to leverage modern technology to create more impactful and accessible tools for SDoH education. Wright LIFE (Lifelike Immersion for Equity) is a simulation-based training tool especially created for this purpose. It is a mobile app …
Design, Analysis, And Optimization Of Traffic Engineering For Software Defined Networks, Mohammed Ibrahim Salman
Design, Analysis, And Optimization Of Traffic Engineering For Software Defined Networks, Mohammed Ibrahim Salman
Browse all Theses and Dissertations
Network traffic has been growing exponentially due to the rapid development of applications and communications technologies. Conventional routing protocols, such as Open-Shortest Path First (OSPF), do not provide optimal routing and result in weak network resources. Optimal traffic engineering (TE) is not applicable in practice due to operational constraints such as limited memory on the forwarding devices and routes oscillation. Recently, a new way of centralized management of networks enabled by Software-Defined Networking (SDN) made it easy to apply most traffic engineering ideas in practice. \par Toward creating an applicable traffic engineering system, we created a TE simulator for experimenting …
Data Analytics And Visualization For Virtual Simulation, Sri Lekha Koppaka
Data Analytics And Visualization For Virtual Simulation, Sri Lekha Koppaka
Browse all Theses and Dissertations
Healthcare organizations attract a diversity of caregivers and patients by providing essential care. While interacting with people of various races, ethnicity, and economical background, caregivers need to be empathetic and compassionate. Proper training and exposure are needed to understand the patient’s background and handle different situations and provide the best care for the patient. With social determinants of health (SDOH) as the basis, the thesis focuses on providing exposure through “Wright LIFE (Lifelike Immersion for Equity) - A simulation-based training tool” to two such scenarios covering patients from the LGBTQIA+ community & autism spectrum disorder (ASD). This interactive tool helps …
Development Of Enhanced User Interaction And User Experience For Supporting Serious Role-Playing Games In A Healthcare Setting, Mark Lee Alow
Development Of Enhanced User Interaction And User Experience For Supporting Serious Role-Playing Games In A Healthcare Setting, Mark Lee Alow
Browse all Theses and Dissertations
Education about implicit bias in clinical settings is essential for improving the quality of healthcare for underrepresented groups. Such a learning experience can be delivered in the form of a serious game simulation. WrightLIFE (Lifelike Immersion for Equity) is a project that combines two serious game simulations, with each addressing the group that faces implicit bias. These groups are individuals that identify as LGBTQIA+ and people with autism spectrum disorder (ASD). The project presents healthcare providers with a training tool that puts them in the roles of the patient and a medical specialist and immerses them in social and clinical …
Few-Shot Malware Detection Using A Novel Adversarial Reprogramming Model, Ekula Praveen Kumar
Few-Shot Malware Detection Using A Novel Adversarial Reprogramming Model, Ekula Praveen Kumar
Browse all Theses and Dissertations
The increasing sophistication of malware has made detecting and defending against new strains a major challenge for cybersecurity. One promising approach to this problem is using machine learning techniques that extract representative features and train classification models to detect malware in an early stage. However, training such machine learning-based malware detection models represents a significant challenge that requires a large number of high-quality labeled data samples while it is very costly to obtain them in real-world scenarios. In other words, training machine learning models for malware detection requires the capability to learn from only a few labeled examples. To address …
A Solder-Defined Computer Architecture For Backdoor And Malware Resistance, Marc W. Abel
A Solder-Defined Computer Architecture For Backdoor And Malware Resistance, Marc W. Abel
Browse all Theses and Dissertations
This research is about securing control of those devices we most depend on for integrity and confidentiality. An emerging concern is that complex integrated circuits may be subject to exploitable defects or backdoors, and measures for inspection and audit of these chips are neither supported nor scalable. One approach for providing a “supply chain firewall” may be to forgo such components, and instead to build central processing units (CPUs) and other complex logic from simple, generic parts. This work investigates the capability and speed ceiling when open-source hardware methodologies are fused with maker-scale assembly tools and visible-scale final inspection.
The …
Computer Enabled Interventions To Communication And Behavioral Problems In Collaborative Work Environments, Ashutosh Shivakumar
Computer Enabled Interventions To Communication And Behavioral Problems In Collaborative Work Environments, Ashutosh Shivakumar
Browse all Theses and Dissertations
Task success in co-located and distributed collaborative work settings is characterized by clear and efficient communication between participating members. Communication issues like 1) Unwanted interruptions and 2) Delayed feedback in collaborative work based distributed scenarios have the potential to impede task coordination and significantly decrease the probability of accomplishing task objective. Research shows that 1) Interrupting tasks at random moments can cause users to take up to 30% longer to resume tasks, commit up to twice the errors, and experience up to twice the negative effect than when interrupted at boundaries 2) Skill retention in collaborative learning tasks improves with …
Automatically Inferring Image Bases Of Arm32 Binaries, Daniel T. Chong
Automatically Inferring Image Bases Of Arm32 Binaries, Daniel T. Chong
Browse all Theses and Dissertations
Reverse engineering tools rely on the critical image base value for tasks such as correctly mapping code into virtual memory for an emulator or accurately determining branch destinations for a disassembler. However, binaries are often stripped and therefore, do not explicitly state this value. Currently available solutions for calculating this essential value generally require user input in the form of parameter configurations or manual binary analysis, thus these methods are limited by the experience and knowledge of the user. In this thesis, we propose a user-independent solution for determining the image base of ARM32 binaries and describe our implementation. Our …
Automatically Generating Searchable Fingerprints For Wordpress Plugins Using Static Program Analysis, Chuang Li
Automatically Generating Searchable Fingerprints For Wordpress Plugins Using Static Program Analysis, Chuang Li
Browse all Theses and Dissertations
This thesis introduces a novel method to automatically generate fingerprints for WordPress plugins. Our method performs static program analysis using Abstract Syntax Trees (ASTs) of WordPress plugins. The generated fingerprints can be used for identifying these plugins using search engines, which have support critical applications such as proactively identifying web servers with vulnerable WordPress plugins. We have used our method to generate fingerprints for over 10,000 WordPress plugins and analyze the resulted fingerprints. Our fingerprints have also revealed 453 websites that are potentially vulnerable. We have also compared fingerprints for vulnerable plugins and those for vulnerability-free plugins.
Ufuzzer: Lightweight Detection Of Php-Based Unrestricted File Upload Vulnerabilities Via Static-Fuzzing Co-Analysis, Jin Huang, Junjie Zhang, Jialun Liu, Chuang Li
Ufuzzer: Lightweight Detection Of Php-Based Unrestricted File Upload Vulnerabilities Via Static-Fuzzing Co-Analysis, Jin Huang, Junjie Zhang, Jialun Liu, Chuang Li
Computer Science and Engineering Faculty Publications
Unrestricted file upload vulnerabilities enable attackers to upload malicious scripts to a web server for later execution. We have built a system, namely UFuzzer, to effectively and automatically detect such vulnerabilities in PHP-based server-side web programs. Different from existing detection methods that use either static program analysis or fuzzing, UFuzzer integrates both (i.e., static-fuzzing co-analysis). Specifically, it leverages static program analysis to generate executable code templates that compactly and effectively summarize the vulnerability-relevant semantics of a server-side web application. UFuzzer then “fuzzes” these templates in a local, native PHP runtime environment for vulnerability detection. Compared to static-analysis-based methods, UFuzzer preserves …
Clustering Of Pain Dynamics In Sickle Cell Disease From Sparse, Uneven Samples, Gary K. Nave Jr, Swati Padhee, Amanuel Alambo, Tanvi Banerjee, Nirmish Shah, Daniel M. Abrams
Clustering Of Pain Dynamics In Sickle Cell Disease From Sparse, Uneven Samples, Gary K. Nave Jr, Swati Padhee, Amanuel Alambo, Tanvi Banerjee, Nirmish Shah, Daniel M. Abrams
Computer Science and Engineering Faculty Publications
Irregularly sampled time series data are common in a variety of fields. Many typical methods for drawing insight from data fail in this case. Here we attempt to generalize methods for clustering trajectories to irregularly and sparsely sampled data. We first construct synthetic data sets, then propose and assess four methods of data alignment to allow for application of spectral clustering. We also repeat the same process for real data drawn from medical records of patients with sickle cell disease -- patients whose subjective experiences of pain were tracked for several months via a mobile app. We find that different …
Uncertainty-Aware Visualization In Medical Imaging - A Survey, Christina Gillmann, Dorothee Saur, Thomas Wischgoll, Gerik Scheuermann
Uncertainty-Aware Visualization In Medical Imaging - A Survey, Christina Gillmann, Dorothee Saur, Thomas Wischgoll, Gerik Scheuermann
Computer Science and Engineering Faculty Publications
Medical imaging (image acquisition, image transformation, and image visualization) is a standard tool for clinicians in order to make diagnoses, plan surgeries, or educate students. Each of these steps is affected by uncertainty, which can highly influence the decision-making process of clinicians. Visualization can help in understanding and communicating these uncertainties. In this manuscript, we aim to summarize the current state-of-the-art in uncertainty-aware visualization in medical imaging. Our report is based on the steps involved in medical imaging as well as its applications. Requirements are formulated to examine the considered approaches. In addition, this manuscript shows which approaches can be …
Nomophobia Before And After The Covid-19 Pandemic-Can Social Media Be Used To Understand Mobile Phone Dependency, Vaishnavi Visweswaraiah, Tanvi Banerjee, William Romine, Sarah Fryman
Nomophobia Before And After The Covid-19 Pandemic-Can Social Media Be Used To Understand Mobile Phone Dependency, Vaishnavi Visweswaraiah, Tanvi Banerjee, William Romine, Sarah Fryman
Computer Science and Engineering Faculty Publications
No abstract provided.
Neuro-Symbolic Deductive Reasoning For Cross-Knowledge Graph Entailment, Monireh Ebrahimi, Md Kamruzzaman Sarker, Federico Bianchi, Ning Xie, Aaron Eberhart, Derek Doran, Hyeongsik Kim, Pascal Hitzler
Neuro-Symbolic Deductive Reasoning For Cross-Knowledge Graph Entailment, Monireh Ebrahimi, Md Kamruzzaman Sarker, Federico Bianchi, Ning Xie, Aaron Eberhart, Derek Doran, Hyeongsik Kim, Pascal Hitzler
Computer Science and Engineering Faculty Publications
A significant and recent development in neural-symbolic learning are deep neural networks that can reason over symbolic knowledge graphs (KGs). A particular task of interest is KG entailment, which is to infer the set of all facts that are a logical consequence of current and potential facts of a KG. Initial neural-symbolic systems that can deduce the entailment of a KG have been presented, but they are limited: current systems learn fact relations and entailment patterns specific to a particular KG and hence do not truly generalize, and must be retrained for each KG they are tasked with entailing. We …
Leveraging Natural Language Processing To Mine Issues On Twitter During The Covid-19 Pandemic, Ankita Agarwal, Preetham Salehundam, Swati Padhee, William Romine, Tanvi Wright State University - Main Campus
Leveraging Natural Language Processing To Mine Issues On Twitter During The Covid-19 Pandemic, Ankita Agarwal, Preetham Salehundam, Swati Padhee, William Romine, Tanvi Wright State University - Main Campus
Computer Science and Engineering Faculty Publications
The recent global outbreak of the coronavirus disease (COVID-19) has spread to all corners of the globe. The international travel ban, panic buying, and the need for self-quarantine are among the many other social challenges brought about in this new era. Twitter platforms have been used in various public health studies to identify public opinion about an event at the local and global scale. To understand the public concerns and responses to the pandemic, a system that can leverage machine learning techniques to filter out irrelevant tweets and identify the important topics of discussion on social media platforms like Twitter …
Topic-Centric Unsupervised Multi-Document Summarization Of Scientific And News Articles, Amanuel Alambo, Cori Lohstroh, Erik Madaus, Swati Padhee, Brandy Foster, Tanvi Banerjee, Krishnaprasad Thirunarayan, Michael Raymer
Topic-Centric Unsupervised Multi-Document Summarization Of Scientific And News Articles, Amanuel Alambo, Cori Lohstroh, Erik Madaus, Swati Padhee, Brandy Foster, Tanvi Banerjee, Krishnaprasad Thirunarayan, Michael Raymer
Computer Science and Engineering Faculty Publications
Recent advances in natural language processing have enabled automation of a wide range of tasks, including machine translation, named entity recognition, and sentiment analysis. Automated summarization of documents, or groups of documents, however, has remained elusive, with many efforts limited to extraction of keywords, key phrases, or key sentences. Accurate abstractive summarization has yet to be achieved due to the inherent difficulty of the problem, and limited availability of training data. In this paper, we propose a topic-centric unsupervised multi-document summarization framework to generate extractive and abstractive summaries for groups of scientific articles across 20 Fields of Study (FoS) in …
Can Subjective Pain Be Inferred From Objective Physiological Data? Evidence From Patients With Sickle Cell Disease, Mark J. Panaggio, Daniel M. Abrams, Fan Yang, Tanvi Banerjee, Nirmish R. Shah
Can Subjective Pain Be Inferred From Objective Physiological Data? Evidence From Patients With Sickle Cell Disease, Mark J. Panaggio, Daniel M. Abrams, Fan Yang, Tanvi Banerjee, Nirmish R. Shah
Computer Science and Engineering Faculty Publications
Patients with sickle cell disease (SCD) experience lifelong struggles with both chronic and acute pain, often requiring medical interventMaion. Pain can be managed with medications, but dosages must balance the goal of pain mitigation against the risks of tolerance, addiction and other adverse effects. Setting appropriate dosages requires knowledge of a patient's subjective pain, but collecting pain reports from patients can be difficult for clinicians and disruptive for patients, and is only possible when patients are awake and communicative. Here we investigate methods for estimating SCD patients' pain levels indirectly using vital signs that are routinely collected and documented in …
An Analysis Of C/C++ Datasets For Machine Learning-Assisted Software Vulnerability Detection, Daniel Grahn, Junjie Zhang
An Analysis Of C/C++ Datasets For Machine Learning-Assisted Software Vulnerability Detection, Daniel Grahn, Junjie Zhang
Computer Science and Engineering Faculty Publications
As machine learning-assisted vulnerability detection research matures, it is critical to understand the datasets being used by existing papers. In this paper, we explore 7 C/C++ datasets and evaluate their suitability for machine learning-assisted vulnerability detection. We also present a new dataset, named Wild C, containing over 10.3 million individual opensource C/C++ files – a sufficiently large sample to be reasonably considered representative of typical C/C++ code. To facilitate comparison, we tokenize all of the datasets and perform the analysis at this level. We make three primary contributions. First, while all the datasets differ from our Wild C dataset, some …