Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 4 of 4

Full-Text Articles in Entire DC Network

Mining Frequency Of Drug Side Effects Over A Large Twitter Dataset Using Apache Spark, Dennis Hsu May 2017

Mining Frequency Of Drug Side Effects Over A Large Twitter Dataset Using Apache Spark, Dennis Hsu

Master's Projects

Despite clinical trials by pharmaceutical companies as well as current FDA reporting systems, there are still drug side effects that have not been caught. To find a larger sample of reports, a possible way is to mine online social media. With its current widespread use, social media such as Twitter has given rise to massive amounts of data, which can be used as reports for drug side effects. To process these large datasets, Apache Spark has become popular for fast, distributed batch processing. In this work, we have improved on previous pipelines in sentimental analysis-based mining, processing, and extracting tweets …


Image Spam Detection, Aneri Chavda May 2017

Image Spam Detection, Aneri Chavda

Master's Projects

Email is one of the most common forms of digital communication. Spam can be de ned as unsolicited bulk email, while image spam includes spam text embedded inside images. Image spam is used by spammers so as to evade text-based spam lters and hence it poses a threat to email based communication. In this research, we analyze image spam detection methods based on various combinations of image processing and machine learning techniques.


Malware Detection Using The Index Of Coincidence, Bhavna Gurnani Jan 2017

Malware Detection Using The Index Of Coincidence, Bhavna Gurnani

Master's Projects

In this research, we apply the Index of Coincidence (IC) to problems in malware analysis. The IC, which is often used in cryptanalysis of classic ciphers, is a technique for measuring the repeat rate in a string of symbols. A score based on the IC is applied to a variety of challenging malware families. We nd that this relatively simple IC score performs surprisingly well, with superior results in comparison to various machine learning based scores, at least in some cases.


Automated Classification To Improve The Efficiency Of Weeding Library Collections, Kiri Lou Wagstaff Jan 2017

Automated Classification To Improve The Efficiency Of Weeding Library Collections, Kiri Lou Wagstaff

Master's Theses

Studies have shown that library weeding (the selective removal of unused, worn, outdated, or irrelevant items) benefits patrons and increases circulation rates. However, the time required to review the collection and make weeding decisions presents a formidable obstacle. In this study, we empirically evaluated methods for automatically classifying weeding candidates. A data set containing 80,346 items from a large-scale academic library weeding project by Wesleyan University from 2011 to 2014 was used to train six machine learning classifiers to predict “Keep” or “Weed” for each candidate. We found statistically significant agreement (p = 0.001) between classifier predictions and librarian judgments …