Computer Sciences | Open Access Articles | Digital Commons Network™

Webarc: Website Archival Using A Structured Approach, Ee Peng Lim, Maria Marissa Dec 2005

Webarc: Website Archival Using A Structured Approach, Ee Peng Lim, Maria Marissa

Research Collection School Of Computing and Information Systems

Website archival refers to the task of monitoring and storing snapshots of website(s) for future retrieval and analysis. This task is particularly important for websites that have content changing over time with older information constantly overwritten by newer one. In this paper, we propose WEBARC as a set of software tools to allow users to construct a logical structure for a website to be archived. Classifiers are trained to. determine relevant web pages and their categories, and subsequently used in website downloading. The archival schedule can be specified and executed by a scheduler. A website viewer is also developed to …

Go to article

Translation Initiation Sites Prediction With Mixture Gaussian Models In Human Cdna Sequences, G. Li, Tze-Yun Leong, Louxin Zhang Aug 2005

Translation Initiation Sites Prediction With Mixture Gaussian Models In Human Cdna Sequences, G. Li, Tze-Yun Leong, Louxin Zhang

Research Collection School Of Computing and Information Systems

Translation initiation sites (TISs) are important signals in cDNA sequences. Many research efforts have tried to predict TISs in cDNA sequences. In this paper, we propose to use mixture Gaussian models for TIS prediction. Using both local features and some features generated from global measures, the proposed method predicts TISs with a sensitivity of 98 percent and a specificity of 93.6 percent. Our method outperforms many other existing methods in sensitivity while keeping specificity high. We attribute the improvement in sensitivity to the nature of the global features and the mixture Gaussian models. © 2005 IEEE.

Go to article

Automatically Discovering The Number Of Clusters In Web Page Datasets, Zhongmei Yao Jun 2005

Automatically Discovering The Number Of Clusters In Web Page Datasets, Zhongmei Yao

Computer Science Faculty Publications

Clustering is well-suited for Web mining by automatically organizing Web pages into categories, each of which contains Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain in particular, currently there is no such method suitable for Web page clustering. In an attempt to address this problem, we discover a constant factor that characterizes the Web domain, based on which we propose a new method for automatically determining the number of clusters in Web page data sets. We discover that the …

Go to article

The Edam Project: Mining Atmospheric Aerosol Datasets, Raghu Ramakrishnan, James J. Schauer, Lei Chen, Zheng Huang, Martin M. Shafer, Deborah S. Gross, David R. Musicant Jan 2005

The Edam Project: Mining Atmospheric Aerosol Datasets, Raghu Ramakrishnan, James J. Schauer, Lei Chen, Zheng Huang, Martin M. Shafer, Deborah S. Gross, David R. Musicant

Faculty Work

Data mining has been a very active area of research in the database, machine learning, and mathematical programming communities in recent years. EDAM (Exploratory Data Analysis and Management) is a joint project between researchers in Atmospheric Chemistry and Computer Science at Carleton College and the University of Wisconsin-Madison that aims to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, …

Go to article

Collective Multi-Label Classification, Nadia Ghamrawi, Andrew Mccallum Jan 2005

Collective Multi-Label Classification, Nadia Ghamrawi, Andrew Mccallum

Computer Science Department Faculty Publication Series

Common approaches to multi-label classification learn independent classifiers for each category, and employ ranking or thresholding schemes for classification. Because they do not exploit dependencies between labels, such techniques are only well-suited to problems in which categories are independent. However, in many domains labels are highly interdependent. This paper explores multilabel conditional random field (CRF) classification models that directly parameterize label co-occurrences in multi-label classification. Experiments show that the models outperform their singlelabel counterparts on standard text corpora. Even when multilabels are sparse, the models improve subset classification error by as much as 40%.

Go to article

Computer Sciences Commons^™

Full-Text Articles in Computer Sciences

Webarc: Website Archival Using A Structured Approach, Ee Peng Lim, Maria Marissa

Research Collection School Of Computing and Information Systems

Translation Initiation Sites Prediction With Mixture Gaussian Models In Human Cdna Sequences, G. Li, Tze-Yun Leong, Louxin Zhang

Research Collection School Of Computing and Information Systems

Automatically Discovering The Number Of Clusters In Web Page Datasets, Zhongmei Yao

Computer Science Faculty Publications

The Edam Project: Mining Atmospheric Aerosol Datasets, Raghu Ramakrishnan, James J. Schauer, Lei Chen, Zheng Huang, Martin M. Shafer, Deborah S. Gross, David R. Musicant

Faculty Work

Collective Multi-Label Classification, Nadia Ghamrawi, Andrew Mccallum

Computer Science Department Faculty Publication Series