Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 2 of 2
Full-Text Articles in Arts and Humanities
Graph-Theoretic Techniques For Web Content Mining, Adam Schenker
Graph-Theoretic Techniques For Web Content Mining, Adam Schenker
USF Tampa Graduate Theses and Dissertations
In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied …
Scavenger: A Junk Mail Classification Program, Rohan V. Malkhare
Scavenger: A Junk Mail Classification Program, Rohan V. Malkhare
USF Tampa Graduate Theses and Dissertations
The problem of junk mail, also called spam, has reached epic proportions and various efforts are underway to fight spam. Junk mail classification using machine learning techniques is a key method to fight spam. We have devised a machine learning algorithm where features are created from individual sentences in the subject and body of a message by forming all possible word-pairings from a sentence. Weights are assigned to the features based on the strength of their predictive capabilities for spam/legitimate determination. The predictive capabilities are estimated by the frequency of occurrence of the feature in spam/legitimate collections as well as …