Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 2 of 2

Full-Text Articles in Physical Sciences and Mathematics

Ensemble Methods For Historical Machine-Printed Document Recognition, William Lund Sep 2014

Ensemble Methods For Historical Machine-Printed Document Recognition, William Lund

William Lund

The usefulness of digitized documents is directly related to the quality of the extracted text. Optical Character Recognition (OCR) has reached a point where well-formatted and clean machine- printed documents are easily recognizable by current commercial OCR products; however, older or degraded machine-printed documents present problems to OCR engines resulting in word error rates (WER) that severely limit either automated or manual use of the extracted text. Major archives of historical machine-printed documents are being assembled around the globe, requiring an accurate transcription of the text for the automated creation of descriptive metadata, full-text searching, and information extraction. Given document …


Adam: Automated Detection And Attribution Of Malicious Webpages, Ahmed E. Kosba, Aziz Mohaisen, Andrew G. West, Trevor Tonn, Huy Kang Kim Aug 2014

Adam: Automated Detection And Attribution Of Malicious Webpages, Ahmed E. Kosba, Aziz Mohaisen, Andrew G. West, Trevor Tonn, Huy Kang Kim

Andrew G. West

Malicious webpages are a prevalent and severe threat in the Internet security landscape. This fact has motivated numerous static and dynamic techniques to alleviate such threats. Building on this existing literature, this work introduces the design and evaluation of ADAM, a system that uses machine-learning over network metadata derived from the sandboxed execution of webpage content. ADAM aims to detect malicious webpages and identify the nature of those vulnerabilities using a simple set of features. Machine-trained models are not novel in this problem space. Instead, it is the dynamic network artifacts (and their subsequent feature representations) collected during rendering that …