Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Series

Data Mining

Institution
Publication Year
Publication

Articles 1 - 30 of 48

Full-Text Articles in Physical Sciences and Mathematics

A Method For Generating A Non-Manual Feature Model For Sign Language Processing, Robert G. Smith Dr, Markus Hofmann Dr Aug 2023

A Method For Generating A Non-Manual Feature Model For Sign Language Processing, Robert G. Smith Dr, Markus Hofmann Dr

Articles

While recent approaches to sign language processing have shifted to the domain of Machine Learning (ML), the treatment of Non-Manual Features (NMFs) remains an open question. The principal challenge facing this method is the comparatively small sign language corpora available for training machine learning models. This study produces a statistical model which may be used in future ML, rules-based, and hybrid-learning approaches for sign language processing tasks. In doing so, this research explores the emerging patterns of non-manual articulation concerning grammatical classes in Irish Sign Language (ISL). The experimental method applied here is a novel implementation of an association rules …


Exploiting Association Rules Mining To Inform The Use Of Non-Manual Features In Sign Language Processing, Robert G. Smith Jun 2023

Exploiting Association Rules Mining To Inform The Use Of Non-Manual Features In Sign Language Processing, Robert G. Smith

Other Resources

In recent years, the use of virtual assistants and voice user interfaces has become a latent part of modern living. Unseen to the user are the various artificial intelligence and natural language processing technologies, the vast datasets, and the linguistic insights that underpin such tools. The technologies supporting them have chiefly targeted widely used spoken languages, leaving sign language users at a disadvantage. One important reason why sign languages are unsupported by such tools is a requirement of the underpinning technologies for a comprehensive description of the language. Sign language processing technologies endeavour to bridge this technology inequality.

Recent approaches …


Adaptive Resolution Loss: An Efficient And Effective Loss For Time Series Self-Supervised Learning Framework, Kevin Garcia, Juan Manuel Perez, Yifeng Gao Jan 2023

Adaptive Resolution Loss: An Efficient And Effective Loss For Time Series Self-Supervised Learning Framework, Kevin Garcia, Juan Manuel Perez, Yifeng Gao

Computer Science Faculty Publications and Presentations

Time series data is a crucial form of information that has vast opportunities. With the widespread use of sensor networks, largescale time series data has become ubiquitous. One of the most prominent problems in time series data mining is representation learning. Recently, with the introduction of self-supervised learning frameworks (SSL), numerous amounts of research have focused on designing an effective SSL for time series data. One of the current state-of-the-art SSL frameworks in time series is called TS2Vec. TS2Vec specially designs a hierarchical contrastive learning framework that uses loss-based training, which performs outstandingly against benchmark testing. However, the computational cost …


An Analysis On Network Flow-Based Iot Botnet Detection Using Weka, Cian Porteous Jan 2022

An Analysis On Network Flow-Based Iot Botnet Detection Using Weka, Cian Porteous

Dissertations

Botnets pose a significant and growing risk to modern networks. Detection of botnets remains an important area of open research in order to prevent the proliferation of botnets and to mitigate the damage that can be caused by botnets that have already been established. Botnet detection can be broadly categorised into two main categories: signature-based detection and anomaly-based detection. This paper sets out to measure the accuracy, false-positive rate, and false-negative rate of four algorithms that are available in Weka for anomaly-based detection of a dataset of HTTP and IRC botnet data. The algorithms that were selected to detect botnets …


Unsupervised Data Mining Technique For Clustering Library In Indonesia, Robbi Rahim, Joseph Teguh Santoso, Sri Jumini, Gita Widi Bhawika, Daniel Susilo, Danny Wibowo Feb 2021

Unsupervised Data Mining Technique For Clustering Library In Indonesia, Robbi Rahim, Joseph Teguh Santoso, Sri Jumini, Gita Widi Bhawika, Daniel Susilo, Danny Wibowo

Library Philosophy and Practice (e-journal)

Organizing school libraries not only keeps library materials, but helps students and teachers in completing tasks in the teaching process so that national development goals are in order to improve community welfare by producing quality and competitive human resources. The purpose of this study is to analyze the Unsupervised Learning technique in conducting cluster mapping of the number of libraries at education levels in Indonesia. The data source was obtained from the Ministry of Education and Culture which was processed by the Central Statistics Agency (abbreviated as BPS) with url: bps.go.id/. The data consisted of 34 records where the attribute …


Novel Technique To Analyze The Effects Of Cognitive And Non-Cognitive Predictors On Students Course Withdrawal In College, Mohammed Ali Jul 2020

Novel Technique To Analyze The Effects Of Cognitive And Non-Cognitive Predictors On Students Course Withdrawal In College, Mohammed Ali

Technology Faculty Publications and Presentations

A novel technique was applied to a college student database to identify the cognitive and non-cognitive factors that predict college students’ course withdrawal behaviors. Predictors such as high school grade point average (HSGPA), standardized test scores (ACT–American College Test or SAT-Scholastic Aptitude Test), number of credit hours enrolled, and age were analyzed in this study. Data mining software algorithms were used to study information about undergraduate students at a west-south-central state university in the United States. The study results revealed that two factors, number of enrolled credit hours, and a student’s age have the most effect on collegiate course withdrawal …


Human Activity Recognition & Mobily Path Prediction, Priya Patel Apr 2020

Human Activity Recognition & Mobily Path Prediction, Priya Patel

Other Student Works

Individual Mobility is the study that depicts how individuals move inside a region or system. As of late a few researches have been accomplished for this reason and there has been a flood in enormous informational accessible in individual developments. Most of these information’s are gathered from cellphone or potentially GPS with variable accuracy relying upon the distance from the tower. Enormous scope information, for example, cell phone follows are significant hotspot for urban modeling. The individual travel designs breakdown into a solitary likelihood distribution however despite the assorted variety of their travel history people follow basic reproducible examples. This …


Using Data Analytics To Filter Insincere Posts From Online Social Networks. A Case Study: Quora Insincere Questions, Mohammad A. Al-Ramahi, Izzat Alsmadi Jan 2020

Using Data Analytics To Filter Insincere Posts From Online Social Networks. A Case Study: Quora Insincere Questions, Mohammad A. Al-Ramahi, Izzat Alsmadi

Computer Information Systems Faculty Publications

The internet in general and Online Social Networks (OSNs) in particular continue to play a significant role in our life where information is massively uploaded and exchanged. With such high importance and attention, abuses of such media of communication for different purposes are common. Driven by goals such as marketing and financial gains, some users use OSNs to post their misleading or insincere content. In this context, we utilized a real-world dataset posted by Quora in Kaggle.com to evaluate different mechanisms and algorithms to filter insincere and spam contents. We evaluated different preprocessing and analysis models. Moreover, we analyzed the …


Reputation-Aware Trajectory-Based Data Mining In The Internet Of Things (Iot), Samia Tasnim Nov 2019

Reputation-Aware Trajectory-Based Data Mining In The Internet Of Things (Iot), Samia Tasnim

FIU Electronic Theses and Dissertations

Internet of Things (IoT) is a critically important technology for the acquisition of spatiotemporally dense data in diverse applications, ranging from environmental monitoring to surveillance systems. Such data helps us improve our transportation systems, monitor our air quality and the spread of diseases, respond to natural disasters, and a bevy of other applications. However, IoT sensor data is error-prone due to a number of reasons: sensors may be deployed in hazardous environments, may deplete their energy resources, have mechanical faults, or maybe become the targets of malicious attacks by adversaries. While previous research has attempted to improve the quality of …


Learnfca: A Fuzzy Fca And Probability Based Approach For Learning And Classification, Suraj Ketan Samal Aug 2019

Learnfca: A Fuzzy Fca And Probability Based Approach For Learning And Classification, Suraj Ketan Samal

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Formal concept analysis(FCA) is a mathematical theory based on lattice and order theory used for data analysis and knowledge representation. Over the past several years, many of its extensions have been proposed and applied in several domains including data mining, machine learning, knowledge management, semantic web, software development, chemistry ,biology, medicine, data analytics, biology and ontology engineering.

This thesis reviews the state-of-the-art of theory of Formal Concept Analysis(FCA) and its various extensions that have been developed and well-studied in the past several years. We discuss their historical roots, reproduce the original definitions and derivations with illustrative examples. Further, we provide …


Econometrics In R Program, Ian Connors May 2019

Econometrics In R Program, Ian Connors

Senior Honors Projects

Econometrics and Datamining using R Programming

I provide an analysis of Rhode Island economic conditions by comparing economic variables in the state to other states in New England and the country as a whole. I learned the programming language R to complete the analysis using published economic statistics. Statistics provided from the Bureau of Economic Analysis (BEA) show quarterly or annual trends which can assist the researcher in predicting future trends. This data includes figures such as real personal income, real GDP, per capita real GDP, regional price parities, housing prices, and total full-time and part-time employment by state; additionally, …


Analyzing And Estimating Cyberattack Trends By Performing Data Mining On A Cybersecurity Data Set, Chan Young Koh Apr 2019

Analyzing And Estimating Cyberattack Trends By Performing Data Mining On A Cybersecurity Data Set, Chan Young Koh

Honors Program Theses and Projects

More than five billion personal information has been compromised over the past eight years through data breaches from notable companies, and the damage related to cybercrime is expected to reach six trillion USD annually by the year of 2021. Interestingly, recent cyberattacks were aimed specifically at credit agencies and companies that hold credit information of their customers and employees. The question is: “Why is it difficult to protect against or evade cyberattacks even for these prestigious companies?”. The purpose of this research is to bring the notion of notorious, rapidly-multiplying cyberthreats. Hence, the research focuses on analyzing cyberattack techniques and …


The Political Power Of Twitter, James Usher, Pierpaolo Dondio, Lucia Morales Jan 2019

The Political Power Of Twitter, James Usher, Pierpaolo Dondio, Lucia Morales

Conference papers

In June 2016, the British voted by 52 per cent to leave the EU, a club the UK joined in 1973. This paper examines Twitter public and political party discourse surrounding the BREXIT withdrawal agreement. In particular, we focus on tweets from four different BREXIT exit strategies known as “Norway”, “Article 50”, the “Backstop” and “No Deal” and their effect on the pound and FTSE 100 index from the period of December 10th 2018 to February 24th 2019. Our approach focuses on using a Naive Bayes classification algorithm to assess political party and public Twitter sentiment. A Granger causality analysis …


Auditing Snomed Ct Hierarchical Relations Based On Lexical Features Of Concepts In Non-Lattice Subgraphs, Licong Cui, Olivier Bodenreider, Jay Shi, Guo-Qiang Zhang Feb 2018

Auditing Snomed Ct Hierarchical Relations Based On Lexical Features Of Concepts In Non-Lattice Subgraphs, Licong Cui, Olivier Bodenreider, Jay Shi, Guo-Qiang Zhang

Computer Science Faculty Publications

Objective—We introduce a structural-lexical approach for auditing SNOMED CT using a combination of non-lattice subgraphs of the underlying hierarchical relations and enriched lexical attributes of fully specified concept names. Our goal is to develop a scalable and effective approach that automatically identifies missing hierarchical IS-A relations.

Methods—Our approach involves 3 stages. In stage 1, all non-lattice subgraphs of SNOMED CT’s IS-A hierarchical relations are extracted. In stage 2, lexical attributes of fully-specified concept names in such non-lattice subgraphs are extracted. For each concept in a non-lattice subgraph, we enrich its set of attributes with attributes from its ancestor …


Data Mining Techniques To Understand Textual Data, Wubai Zhou Oct 2017

Data Mining Techniques To Understand Textual Data, Wubai Zhou

FIU Electronic Theses and Dissertations

More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks …


Recommendation Vs Sentiment Analysis: A Text-Driven Latent Factor Model For Rating Prediction With Cold-Start Awareness, Kaisong Song, Wei Gao, Shi Feng Feng, Daling Wang, Kam-Fai Wong, Chengqi Zhang Aug 2017

Recommendation Vs Sentiment Analysis: A Text-Driven Latent Factor Model For Rating Prediction With Cold-Start Awareness, Kaisong Song, Wei Gao, Shi Feng Feng, Daling Wang, Kam-Fai Wong, Chengqi Zhang

Research Collection School Of Computing and Information Systems

Review rating prediction is an important research topic. The problem was approached from either the perspective of recommender systems (RS) or that of sentiment analysis (SA). Recent SA research using deep neural networks (DNNs) has realized the importance of user and product interaction for better interpreting the sentiment of reviews. However, the complexity of DNN models in terms of the scale of parameters is very high, and the performance is not always satisfying especially when user-product interaction is sparse. In this paper, we propose a simple, extensible RS-based model, called Text-driven Latent Factor Model (TLFM), to capture the semantics of …


Semantic Visualization For Short Texts With Word Embeddings, Van Minh Tuan Le, Hady W. Lauw Aug 2017

Semantic Visualization For Short Texts With Word Embeddings, Van Minh Tuan Le, Hady W. Lauw

Research Collection School Of Computing and Information Systems

Semantic visualization integrates topic modeling and visualization, such that every document is associated with a topic distribution as well as visualization coordinates on a low-dimensional Euclidean space. We address the problem of semantic visualization for short texts. Such documents are increasingly common, including tweets, search snippets, news headlines, or status updates. Due to their short lengths, it is difficult to model semantics as the word co-occurrences in such a corpus are very sparse. Our approach is to incorporate auxiliary information, such as word embeddings from a larger corpus, to supplement the lack of co-occurrences. This requires the development of a …


Mining Non-Lattice Subgraphs For Detecting Missing Hierarchical Relations And Concepts In Snomed Ct, Licong Cui, Wei Zhu, Shiqiang Tao, James T. Case, Olivier Bodenreider, Guo-Qiang Zhang Jul 2017

Mining Non-Lattice Subgraphs For Detecting Missing Hierarchical Relations And Concepts In Snomed Ct, Licong Cui, Wei Zhu, Shiqiang Tao, James T. Case, Olivier Bodenreider, Guo-Qiang Zhang

Computer Science Faculty Publications

Objective: Quality assurance of large ontological systems such as SNOMED CT is an indispensable part of the terminology management lifecycle. We introduce a hybrid structural-lexical method for scalable and systematic discovery of missing hierarchical relations and concepts in SNOMED CT.

Material and Methods: All non-lattice subgraphs (the structural part) in SNOMED CT are exhaustively extracted using a scalable MapReduce algorithm. Four lexical patterns (the lexical part) are identified among the extracted non-lattice subgraphs. Non-lattice subgraphs exhibiting such lexical patterns are often indicative of missing hierarchical relations or concepts. Each lexical pattern is associated with a potential specific type of error. …


Significant Permission Identification For Android Malware Detection, Lichao Sun Jul 2016

Significant Permission Identification For Android Malware Detection, Lichao Sun

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

A recent report indicates that a newly developed malicious app for Android is introduced every 11 seconds. To combat this alarming rate of malware creation, we need a scalable malware detection approach that is effective and efficient. In this thesis, we introduce SigPID, a malware detection system based on permission analysis to cope with the rapid increase in the number of Android malware. Instead of analyzing all 135 Android permissions, our approach applies 3-level pruning by mining the permission data to identify only significant permissions that can be effective in distinguishing benign and malicious apps. Based on the identified significant …


Efspredictor: Predicting Configuration Bugs With Ensemble Feature Selection, Bowen Xu, David Lo, Xin Xia, Ashish Sureka, Shanping Li May 2016

Efspredictor: Predicting Configuration Bugs With Ensemble Feature Selection, Bowen Xu, David Lo, Xin Xia, Ashish Sureka, Shanping Li

Research Collection School Of Computing and Information Systems

The configuration of a system determines the system behavior and wrong configuration settings can adversely impact system's availability, performance, and correctness. We refer to these wrong configuration settings as configuration bugs. The importance of configuration bugs has prompted many researchers to study it, and past studies can be grouped into three categories: detection, localization, and fixing of configuration bugs. In the work, we focus on the detection of configuration bugs, in particular, we follow the line-of-work that tries to predict if a bug report is caused by a wrong configuration setting. Automatically prediction of whether a bug is a configuration …


Domain Specific Document Retrieval Framework For Real-Time Social Health Data, Swapnil Soni Jul 2015

Domain Specific Document Retrieval Framework For Real-Time Social Health Data, Swapnil Soni

Kno.e.sis Publications

With the advent of the web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these online services to share and seek health real-time information has in- creased exponentially. OHIS use web search engines or microblogging search services to seek out latest, relevant as well as reliable health in- formation. When OHIS turn to microblogging search services to search real-time content, trends and breaking news, etc. the search results are not promising. Two major challenges exist in the current microblogging search engines are keyword based techniques and results do not contain real-time information. To address these challenges, …


Using Support Vector Machine Ensembles For Target Audience Classification On Twitter, Siaw Ling Lo, Raymond Chiong, David Cornforth Apr 2015

Using Support Vector Machine Ensembles For Target Audience Classification On Twitter, Siaw Ling Lo, Raymond Chiong, David Cornforth

Research Collection School Of Computing and Information Systems

The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results …


Temporal Mining For Distributed Systems, Yexi Jiang Mar 2015

Temporal Mining For Distributed Systems, Yexi Jiang

FIU Electronic Theses and Dissertations

Many systems and applications are continuously producing events. These events are used to record the status of the system and trace the behaviors of the systems. By examining these events, system administrators can check the potential problems of these systems. If the temporal dynamics of the systems are further investigated, the underlying patterns can be discovered. The uncovered knowledge can be leveraged to predict the future system behaviors or to mitigate the potential risks of the systems. Moreover, the system administrators can utilize the temporal patterns to set up event management rules to make the system more intelligent.

With the …


Analysis Into Developing Accurate And Efficient Intrusion Detection Approaches, Priya Rabadia, Craig Valli Jan 2015

Analysis Into Developing Accurate And Efficient Intrusion Detection Approaches, Priya Rabadia, Craig Valli

Australian Digital Forensics Conference

Cyber-security has become more prevalent as more organisations are relying on cyber-enabled infrastructures to conduct their daily actives. Subsequently cybercrime and cyber-attacks are increasing. An Intrusion Detection System (IDS) is a cyber-security tool that is used to mitigate cyber-attacks. An IDS is a system deployed to monitor network traffic and trigger an alert when unauthorised activity has been detected. It is important for IDSs to accurately identify cyber-attacks against assets on cyber-enabled infrastructures, while also being efficient at processing current and predicted network traffic flows. The purpose of the paper is to outline the importance of developing an accurate and …


Time-Series Data Mining In Transportation: A Case Study On Singapore Public Train Commuter Travel Patterns, Tin Seong Kam, Roy Ka Wei Lee Mar 2014

Time-Series Data Mining In Transportation: A Case Study On Singapore Public Train Commuter Travel Patterns, Tin Seong Kam, Roy Ka Wei Lee

Research Collection School Of Computing and Information Systems

The adoption of smart cards technologies and automated data collection systems (ADCS) in transportation domain had provided public transport planners opportunities to amass a huge and continuously increasing amount of time-series data about the behaviors and travel patterns of commuters. However the explosive growth of temporal related databases has far outpaced the transport planners’ ability to interpret these data using conventional statistical techniques, creating an urgent need for new techniques to support the analyst in transforming the data into actionable information and knowledge. This research study thus explores and discusses the potential use of time-series data mining, a relatively new …


Mining Effective Multi-Segment Sliding Window For Pathogen Incidence Rate Prediction, Lei Duan, Changjie Tang, Xiasong Li, Guozhu Dong, Xianming Wang, Jie Zuo, Min Jiang, Zhongqi Li, Yongqing Zhang Sep 2013

Mining Effective Multi-Segment Sliding Window For Pathogen Incidence Rate Prediction, Lei Duan, Changjie Tang, Xiasong Li, Guozhu Dong, Xianming Wang, Jie Zuo, Min Jiang, Zhongqi Li, Yongqing Zhang

Kno.e.sis Publications

Pathogen incidence rate prediction, which can be considered as time series modeling, is an important task for infectious disease incidence rate prediction and for public health. This paper investigates the application of a genetic computation technique, namely GEP, for pathogen incidence rate prediction. To overcome the shortcomings of traditional sliding windows in GEP-based time series modeling, the paper introduces the problem of mining effective sliding window, for discovering optimal sliding windows for building accurate prediction models. To utilize the periodical characteristic of pathogen incidence rates, a multi-segment sliding window consisting of several segments from different periodical intervals is proposed and …


Hybrid Methods For Feature Selection, Iunniang Cheng May 2013

Hybrid Methods For Feature Selection, Iunniang Cheng

Masters Theses & Specialist Projects

Feature selection is one of the important data preprocessing steps in data mining. The feature selection problem involves finding a feature subset such that a classification model built only with this subset would have better predictive accuracy than model built with a complete set of features. In this study, we propose two hybrid methods for feature selection. The best features are selected through either the hybrid methods or existing feature selection methods. Next, the reduced dataset is used to build classification models using five classifiers. The classification accuracy was evaluated in terms of the area under the Receiver Operating Characteristic …


A Hybrid Recommendation System Based On Association Rules, Ahmed Alsalama May 2013

A Hybrid Recommendation System Based On Association Rules, Ahmed Alsalama

Masters Theses & Specialist Projects

Recommendation systems are widely used in e-commerce applications. The
engine of a current recommendation system recommends items to a particular user based on user preferences and previous high ratings. Various recommendation schemes such as collaborative filtering and content-based approaches are used to build a recommendation system. Most of current recommendation systems were developed to fit a certain domain such as books, articles, and movies. We propose a hybrid framework recommendation system to be applied on two dimensional spaces (User × Item) with a large number of users and a small number of items. Moreover, our proposed framework makes use of …


A Framework For Generating Data To Simulate Application Scoring, Kenneth Kennedy, Sarah Jane Delany, Brian Mac Namee Aug 2011

A Framework For Generating Data To Simulate Application Scoring, Kenneth Kennedy, Sarah Jane Delany, Brian Mac Namee

Conference papers

In this paper we propose a framework to generate artificial data that can be used to simulate credit risk scenarios. Artificial data is useful in the credit scoring domain for two reasons. Firstly, the use of artificial data allows for the introduction and control of variability that can realistically be expected to occur, but has yet to materialise in practice. The ability to control parameters allows for a thorough exploration of the performance of classification models under different conditions. Secondly, due to non-disclosure agreements and commercial sensitivities, obtaining real credit scoring data is a problematic and time consuming task. By …


Polygonal Spatial Clustering, Deepti Joshi Apr 2011

Polygonal Spatial Clustering, Deepti Joshi

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Clustering, the process of grouping together similar objects, is a fundamental task in data mining to help perform knowledge discovery in large datasets. With the growing number of sensor networks, geospatial satellites, global positioning devices, and human networks tremendous amounts of spatio-temporal data that measure the state of the planet Earth are being collected every day. This large amount of spatio-temporal data has increased the need for efficient spatial data mining techniques. Furthermore, most of the anthropogenic objects in space are represented using polygons, for example – counties, census tracts, and watersheds. Therefore, it is important to develop data mining …