Open Access. Powered by Scholars. Published by Universities.®

Computer Engineering Commons

Open Access. Powered by Scholars. Published by Universities.®

PDF

Data mining

Discipline
Institution
Publication Year
Publication
Publication Type

Articles 61 - 90 of 94

Full-Text Articles in Computer Engineering

Pre-Processing Techniques Applied To Automatic Taxon Identification On Fish Otoliths, Ramon Reig-Bolaño, Pere Marti-Puig Jun 2014

Pre-Processing Techniques Applied To Automatic Taxon Identification On Fish Otoliths, Ramon Reig-Bolaño, Pere Marti-Puig

International Congress on Environmental Modelling and Software

This paper analyzes the characteristics of a rotation-invariant Feature space to be used in a classifier of fish otoliths, it is compared to two other Feature spaces, one with raw data and another with transformed data (using the Elliptic Fourier Descriptors EFD). Otoliths are found in the inner ear of fish. Their shape can be analyzed to determine sex, age, populations and species, and thus they can provide necessary and relevant information for ecological studies. The Automatic Taxon Identifier (ATI) is used to classify fish otoliths directly from a query image and is implemented on-line in a Public Database. This …


Text Stylometry For Chat Bot Identification And Intelligence Estimation., Nawaf Ali May 2014

Text Stylometry For Chat Bot Identification And Intelligence Estimation., Nawaf Ali

Electronic Theses and Dissertations

Authorship identification is a technique used to identify the author of an unclaimed document, by attempting to find traits that will match those of the original author. Authorship identification has a great potential for applications in forensics. It can also be used in identifying chat bots, a form of intelligent software created to mimic the human conversations, by their unique style. The online criminal community is utilizing chat bots as a new way to steal private information and commit fraud and identity theft. The need for identifying chat bots by their style is becoming essential to overcome the danger of …


Using The K-Means Clustering Algorithm To Classify Features For Choropleth Maps, Mark Polczynski, Michael Polczynski Apr 2014

Using The K-Means Clustering Algorithm To Classify Features For Choropleth Maps, Mark Polczynski, Michael Polczynski

Electrical and Computer Engineering Faculty Research and Publications

Common methods for classifying choropleth map features typically form classes based on a single feature attribute. This technical note reviews the use of the k-means clustering algorithm to perform feature classification using multiple feature attributes. The k-means clustering algorithm is described and compared to other common classification methods, and two examples of choropleth maps prepared using k-means clustering are provided.


An Urgent Precaution System To Detect Students At Risk Of Substance Abuse Through Classification Algorithms, Faruk Bulut, İhsan Ömür Bucak Jan 2014

An Urgent Precaution System To Detect Students At Risk Of Substance Abuse Through Classification Algorithms, Faruk Bulut, İhsan Ömür Bucak

Turkish Journal of Electrical Engineering and Computer Sciences

In recent years, the use of addictive drugs and substances has turned out to be a challenging social problem worldwide. The illicit use of these types of drugs and substances appears to be increasing among elementary and high school students. After becoming addicted to drugs, life becomes unbearable and gets even worse for their users. Scientific studies show that it becomes extremely difficult for an individual to break this habit after being a user. Hence, preventing teenagers from addiction becomes an important issue. This study focuses on an urgent precaution system that helps families and educators prevent teenagers from developing …


M-Fdbscan: A Multicore Density-Based Uncertain Data Clustering Algorithm, Atakan Erdem, Taflan İmre Gündem Jan 2014

M-Fdbscan: A Multicore Density-Based Uncertain Data Clustering Algorithm, Atakan Erdem, Taflan İmre Gündem

Turkish Journal of Electrical Engineering and Computer Sciences

In many data mining applications, we use a clustering algorithm on a large amount of uncertain data. In this paper, we adapt an uncertain data clustering algorithm called fast density-based spatial clustering of applications with noise (FDBSCAN) to multicore systems in order to have fast processing. The new algorithm, which we call multicore FDBSCAN (M-FDBSCAN), splits the data domain into c rectangular regions, where c is the number of cores in the system. The FDBSCAN algorithm is then applied to each rectangular region simultaneously. After the clustering operation is completed, semiclusters that occur during splitting are detected and merged to …


Discovery Of Hydrometeorological Patterns, Mete Çeli̇k, Fi̇li̇z Dadaşer Çeli̇k, Ahmet Şaki̇r Dokuz Jan 2014

Discovery Of Hydrometeorological Patterns, Mete Çeli̇k, Fi̇li̇z Dadaşer Çeli̇k, Ahmet Şaki̇r Dokuz

Turkish Journal of Electrical Engineering and Computer Sciences

Hydrometeorological patterns can be defined as meaningful and nontrivial associations between hydrological and meteorological parameters over a region. Discovering hydrometeorological patterns is important for many applications, including forecasting hydrometeorological hazards (floods and droughts), predicting the hydrological responses of ungauged basins, and filling in missing hydrological or meteorological records. However, discovering these patterns is challenging due to the special characteristics of hydrological and meteorological data, and is computationally complex due to the archival history of the datasets. Moreover, defining monotonic interest measures to quantify these patterns is difficult. In this study, we propose a new monotonic interest measure, called the hydrometeorological …


Hot Zone Identification: Analyzing Effects Of Data Sampling On Spam Clustering, Rasib Khan, Mainul Mizan, Ragib Hasan, Alan Sprague Jan 2014

Hot Zone Identification: Analyzing Effects Of Data Sampling On Spam Clustering, Rasib Khan, Mainul Mizan, Ragib Hasan, Alan Sprague

Journal of Digital Forensics, Security and Law

Email is the most common and comparatively the most efficient means of exchanging information in today's world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process …


A Knowledge-Based Clinical Toxicology Consultant For Diagnosing Multiple Exposures, Joel D. Schipper, Douglas D. Dankel Ii, A. Antonio Arroyo, Jay L. Schauben May 2013

A Knowledge-Based Clinical Toxicology Consultant For Diagnosing Multiple Exposures, Joel D. Schipper, Douglas D. Dankel Ii, A. Antonio Arroyo, Jay L. Schauben

Publications

Objective: This paper presents continued research toward the development of a knowledge-based system for the diagnosis of human toxic exposures. In particular, this research focuses on the challenging task of diagnosing exposures to multiple toxins. Although only 10% of toxic exposures in the United States involve multiple toxins, multiple exposures account for more than half of all toxin-related fatalities. Using simple medical mathematics, we seek to produce a practical decision support system capable of supplying useful information to aid in the diagnosis of complex cases involving multiple unknown substances.

Methods: The system is automatically trained using data mining …


Rank Based Anomaly Detection Algorithms, Huaming Huang May 2013

Rank Based Anomaly Detection Algorithms, Huaming Huang

Electrical Engineering and Computer Science - Dissertations

Anomaly or outlier detection problems are of considerable importance, arising frequently in diverse real-world applications such as finance and cyber-security. Several algorithms have been formulated for such problems, usually based on formulating a problem-dependent heuristic or distance metric. This dissertation proposes anomaly detection algorithms that exploit the notion of ``rank," expressing relative outlierness of different points in the relevant space, and exploiting asymmetry in nearest neighbor relations between points: a data point is ``more anomalous" if it is not the nearest neighbor of its nearest neighbors. Although rank is computed using distance, it is a more robust and higher level …


Data Mining The Harness Track And Predicting Outcomes, Robert P. Schumaker Apr 2013

Data Mining The Harness Track And Predicting Outcomes, Robert P. Schumaker

Journal of International Technology and Information Management

This paper presented the S&C Racing system that uses Support Vector Regression (SVR) to predict harness race finishes and analyzed it on fifteen months of data from Northfield Park. We found that our system outperforms the most common betting strategies of wagering on the favorites and the mathematical arbitrage Dr. Z system in five of the seven wager types tested. This work would suggest that an informational inequality exists within the harness racing market that is not apparent to domain experts.


Predicting Sql Injection And Cross Site Scripting Vulnerabilities Through Mining Input Sanitization Patterns, Lwin Khin Shar, Hee Beng Kuan Tan Apr 2013

Predicting Sql Injection And Cross Site Scripting Vulnerabilities Through Mining Input Sanitization Patterns, Lwin Khin Shar, Hee Beng Kuan Tan

Research Collection School Of Computing and Information Systems

ContextSQL injection (SQLI) and cross site scripting (XSS) are the two most common and serious web application vulnerabilities for the past decade. To mitigate these two security threats, many vulnerability detection approaches based on static and dynamic taint analysis techniques have been proposed. Alternatively, there are also vulnerability prediction approaches based on machine learning techniques, which showed that static code attributes such as code complexity measures are cheap and useful predictors. However, current prediction approaches target general vulnerabilities. And most of these approaches locate vulnerable code only at software component or file levels. Some approaches also involve process attributes that …


A Rule Induction Algorithm For Knowledge Discovery And Classification, Ömer Akgöbek Jan 2013

A Rule Induction Algorithm For Knowledge Discovery And Classification, Ömer Akgöbek

Turkish Journal of Electrical Engineering and Computer Sciences

Classification and rule induction are key topics in the fields of decision making and knowledge discovery. The objective of this study is to present a new algorithm developed for automatic knowledge acquisition in data mining. The proposed algorithm has been named RES-2 (Rule Extraction System). It aims at eliminating the pitfalls and disadvantages of the techniques and algorithms currently in use. The proposed algorithm makes use of the direct rule extraction approach, rather than the decision tree. For this purpose, it uses a set of examples to induce general rules. In this study, 15 datasets consisting of multiclass values with …


A Window Of Opportunity: Assessing Behavioural Scoring, Kenneth Kennedy, Brian Mac Namee, Sarah Jane Delany, Michael O'Sullivan, Neil Watson Jan 2013

A Window Of Opportunity: Assessing Behavioural Scoring, Kenneth Kennedy, Brian Mac Namee, Sarah Jane Delany, Michael O'Sullivan, Neil Watson

Articles

After credit has been granted, lenders use behavioural scoring to assess the likelihood of default occurring during some specific outcome period. This assessment is based on customers’ repayment performance over a given fixed period. Often the outcome period and fixed performance period are arbitrarily selected, causing instability in making predictions. Behavioural scoring has failed to receive the same attention from researchers as application scoring. The bias for application scoring research can be attributed, in part, to the large volume of data required for behavioural scoring studies. Furthermore, the commercial sensitivities associated with such a large pool of customer data often …


An Efficient Algorithm To Solve High-Dimensional Data Clustering: Candidate Subspace Clustering Algorithm, Chin-Chieh Kao Jan 2013

An Efficient Algorithm To Solve High-Dimensional Data Clustering: Candidate Subspace Clustering Algorithm, Chin-Chieh Kao

Theses Digitization Project

For this project, a comprehensive literature review on high dimensional data clustering is conducted and a novel density-algorithm to perform high dimensional data clustering is developed.


Semi-Automatic Simulation Initialization By Mining Structured And Unstructured Data Formats From Local And Web Data Sources, Olcay Sahin Oct 2012

Semi-Automatic Simulation Initialization By Mining Structured And Unstructured Data Formats From Local And Web Data Sources, Olcay Sahin

Computational Modeling & Simulation Engineering Theses & Dissertations

Initialization is one of the most important processes for obtaining successful results from a simulation. However, initialization is a challenge when 1) a simulation requires hundreds or even thousands of input parameters or 2) re-initializing the simulation due to different initial conditions or runtime errors. These challenges lead to the modeler spending more time initializing a simulation and may lead to errors due to poor input data.

This thesis proposes two semi-automatic simulation initialization approaches that provide initialization using data mining from structured and unstructured data formats from local and web data sources. First, the System Initialization with Retrieval (SIR) …


Measuring Merci: Exploring Data Mining Techniques For Examining Surgical Outcomes Of Stroke Patients, Matthew Ronald Mcnabb Aug 2012

Measuring Merci: Exploring Data Mining Techniques For Examining Surgical Outcomes Of Stroke Patients, Matthew Ronald Mcnabb

Masters Theses and Doctoral Dissertations

Mechanical Embolus Removal in Cerebral Ischemia (MERCI) has been supported by medical trials as an improved method of treating ischemic stroke past the safe window of time for administering clot-busting drugs, and was released for medical use in 2004. The importance of analyzing real-world data collected from MERCI clinical trials is key to providing insights on the effectiveness of MERCI. Most of the existing data analysis on MERCI results has thus far employed conventional statistical analysis techniques. To the best of the knowledge acquired in preliminary research, advanced data analytics and data mining techniques have not yet been systematically applied. …


Data Mining Of Protein Databases, Christopher Assi Jul 2012

Data Mining Of Protein Databases, Christopher Assi

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Data mining of protein databases poses special challenges because many protein databases are non-relational whereas most data mining and machine learning algorithms assume the input data to be a relational database. Protein databases are non-relational mainly because they often contain set data types. We developed new data mining algorithms that can restructure non-relational protein databases so that they become relational and amenable for various data mining and machine learning tools. We applied the new restructuring algorithms to a pancreatic protein database. After the restructuring, we also applied two classification methods, such as decision tree and SVM classifiers and compared their …


An Interactive Visualization Model For Analyzing Data Storage System Workloads, Steven Charubhat Pungdumri Mar 2012

An Interactive Visualization Model For Analyzing Data Storage System Workloads, Steven Charubhat Pungdumri

Master's Theses

The performance of hard disks has become increasingly important as the volume of data storage increases. At the bottom level of large-scale storage networks is the hard disk. Despite the importance of hard drives in a storage network, it is often difficult to analyze the performance of hard disks due to the sheer size of the datasets seen by hard disks. Additionally, hard drive workloads can have several multi-dimensional characteristics, such as access time, queue depth and block-address space. The result is that hard drive workloads are extremely diverse and large, making extracting meaningful information from hard drive workloads very …


A Review Of Situation Identification Techniques In Pervasive Computing, Juan Ye, Simon Dobson, Susan Mckeever Feb 2012

A Review Of Situation Identification Techniques In Pervasive Computing, Juan Ye, Simon Dobson, Susan Mckeever

Articles

Pervasive systems must offer an open, extensible, and evolving portfolio of services which integrate sensor data from a diverse range of sources. The core challenge is to provide appropriate and consistent adaptive behaviours for these services in the face of huge volumes of sensor data exhibiting varying degrees of precision, accuracy and dynamism. Situation identification is an enabling technology that resolves noisy sensor data and abstracts it into higher-level concepts that are interesting to applications. We provide a comprehensive analysis of the nature and characteristics of situations, discuss the complexities of situation identification, and review the techniques that are most …


Sports Data Mining Technology Used In Basketball Outcome Prediction, Chenjie Cao Jan 2012

Sports Data Mining Technology Used In Basketball Outcome Prediction, Chenjie Cao

Dissertations

Driven by the increasing comprehensive data in sports datasets and data mining technique successfully used in different area, sports data mining technique emerges and enables us to find hidden knowledge to impact the sport industry. In many instances, predicting the outcomes of sporting events has always been a challenging and attractive work and is therefore drawing a wide concern to conduct research in this field. This project focuses on using machine learning algorithms to build a model for predicting the NBA game outcomes and the algorithms involve Simple Logistics Classifier, Artificial Neural Networks, SVM and Naïve Bayes. In order to …


Using Textual Features To Predict Popular Content On Digg, Paul H. Miller May 2011

Using Textual Features To Predict Popular Content On Digg, Paul H. Miller

Paul H Miller

Over the past few years, collaborative rating sites, such as Netflix, Digg and Stumble, have become increasingly prevalent sites for users to find trending content. I used various data mining techniques to study Digg, a social news site, to examine the influence of content on popularity. What influence does content have on popularity, and what influence does content have on users’ decisions? Overwhelmingly, prior studies have consistently shown that predicting popularity based on content is difficult and maybe even inherently impossible. The same submission can have multiple outcomes and content neither determines popularity, nor individual user decisions. My results show …


Using Textual Features To Predict Popular Content On Digg, Paul H. Miller Apr 2011

Using Textual Features To Predict Popular Content On Digg, Paul H. Miller

Department of English: Dissertations, Theses, and Student Research

Over the past few years, collaborative rating sites, such as Netflix, Digg and Stumble, have become increasingly prevalent sites for users to find trending content. I used various data mining techniques to study Digg, a social news site, to examine the influence of content on popularity. What influence does content have on popularity, and what influence does content have on users’ decisions? Overwhelmingly, prior studies have consistently shown that predicting popularity based on content is difficult and maybe even inherently impossible. The same submission can have multiple outcomes and content neither determines popularity, nor individual user decisions. My results show …


Infoextractor – A Tool For Social Media Data Mining, Chirag Shah, Charles File Jan 2011

Infoextractor – A Tool For Social Media Data Mining, Chirag Shah, Charles File

JITP 2011: The Future of Computational Social Science

We present InfoExtractor, a web-based tool for collecting data and metadata from focused social media content. InfoExtractor then provides this data in various structured and unstructured formats for easy manipulation and analysis. The tool allows social science researchers to easily collect data for quantitative analysis, and is designed to deliver data from popular and influential social media sites in a useful and easy to access way. InfoExtractor was designed to replace traditional means of content aggregation, such as page scraping and brute- force copying.


Data Mining Based Learning Algorithms For Semi-Supervised Object Identification And Tracking, Michael P. Dessauer Jan 2011

Data Mining Based Learning Algorithms For Semi-Supervised Object Identification And Tracking, Michael P. Dessauer

Doctoral Dissertations

Sensor exploitation (SE) is the crucial step in surveillance applications such as airport security and search and rescue operations. It allows localization and identification of movement in urban settings and can significantly boost knowledge gathering, interpretation and action. Data mining techniques offer the promise of precise and accurate knowledge acquisition techniques in high-dimensional data domains (and diminishing the “curse of dimensionality” prevalent in such datasets), coupled by algorithmic design in feature extraction, discriminative ranking, feature fusion and supervised learning (classification). Consequently, data mining techniques and algorithms can be used to refine and process captured data and to detect, recognize, classify, …


Knowledge Discovery And Analysis In Manufacturing, Mark Polczynski, Andzrej Kochanski Jun 2010

Knowledge Discovery And Analysis In Manufacturing, Mark Polczynski, Andzrej Kochanski

Electrical and Computer Engineering Faculty Research and Publications

The quality and reliability requirements for next-generation manufacturing are reviewed, and current approaches are cited. The potential for augmenting current quality/reliability technology is described, and characteristics of potential future directions are postulated. Methods based on knowledge discovery and analysis in manufacturing (KDAM) are reviewed, and related successful applications in business and social fields are discussed. Typical KDAM applications are noted, along with general functions and specific KDAM-related technologies. A systematic knowledge discovery process model is reviewed, and examples of current work are given, including description of successful applications of KDAM to creation of rules for optimizing gas porosity in sand …


Dynamic Application Level Security Sensors, Christopher Thomas Rathgeb May 2010

Dynamic Application Level Security Sensors, Christopher Thomas Rathgeb

Masters Theses

The battle for cyber supremacy is a cat and mouse game: evolving threats from internal and external sources make it difficult to protect critical systems. With the diverse and high risk nature of these threats, there is a need for robust techniques that can quickly adapt and address this evolution. Existing tools such as Splunk, Snort, and Bro help IT administrators defend their networks by actively parsing through network traffic or system log data. These tools have been thoroughly developed and have proven to be a formidable defense against many cyberattacks. However, they are vulnerable to zero-day attacks, slow attacks, …


Artificial Intelligence – I: A Two-Step Approach For Improving Efficiency Of Feedforward Multilayer Perceptrons Network, Shoukat Ullah, Zakia Hussain Aug 2009

Artificial Intelligence – I: A Two-Step Approach For Improving Efficiency Of Feedforward Multilayer Perceptrons Network, Shoukat Ullah, Zakia Hussain

International Conference on Information and Communication Technologies

An artificial neural network has got greater importance in the field of data mining. Although it may have complex structure, long training time, and uneasily understandable representation of results, neural network has high accuracy and is preferable in data mining. This research paper is aimed to improve efficiency and to provide accurate results on the basis of same behaviour data. To achieve these objectives, an algorithm is proposed that uses two data mining techniques, that is, attribute selection method and cluster analysis. The algorithm works by applying attribute selection method to eliminate irrelevant attributes, so that input dimensionality is reduced …


Opinion Mining With The Sentwordnet Lexical Resource, Bruno Ohana Mar 2009

Opinion Mining With The Sentwordnet Lexical Resource, Bruno Ohana

Dissertations

Sentiment classification concerns the application of automatic methods for predicting the orientation of sentiment present on text documents. It is an important subject in opinion mining research, with applications on a number of areas including recommender and advertising systems, customer intelligence and information retrieval. SentiWordNet is a lexical resource of sentiment information for terms in the English language designed to assist in opinion mining tasks, where each term is associated with numerical scores for positive and negative sentiment information. A resource that makes term level sentiment information readily available could be of use in building more effective sentiment classification methods. …


Investigation Of The Visual Aspects Of Business Intelligence, Niall Cunningham Jan 2008

Investigation Of The Visual Aspects Of Business Intelligence, Niall Cunningham

Dissertations

As the need for Irish firms to move into knowledge economy increases knowledge management is being more important. Among its central aims is for knowledge creation and development to enhance knowledge and inspire innovations. One of the ways knowledge management achieves these aims is through the development of knowledge management tools for the firm. Among these tools is the option to use business intelligence to enhance knowledge in the firm by exposing hidden knowledge and building on it. In particular data mining has been highlighted as being effective within business intelligence and knowledge management in discovering hidden knowledge. This dissertation …


Multivariate Discretization Of Continuous Valued Attributes., Ehab Ahmed El Sayed Ahmed 1978- Dec 2006

Multivariate Discretization Of Continuous Valued Attributes., Ehab Ahmed El Sayed Ahmed 1978-

Electronic Theses and Dissertations

The area of Knowledge discovery and data mining is growing rapidly. Feature Discretization is a crucial issue in Knowledge Discovery in Databases (KDD), or Data Mining because most data sets used in real world applications have features with continuously values. Discretization is performed as a preprocessing step of the data mining to make data mining techniques useful for these data sets. This thesis addresses discretization issue by proposing a multivariate discretization (MVD) algorithm. It begins withal number of common discretization algorithms like Equal width discretization, Equal frequency discretization, Naïve; Entropy based discretization, Chi square discretization, and orthogonal hyper planes. After …