Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 10 of 10

Full-Text Articles in Computer Sciences

Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder Sep 2008

Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder

Faculty Publications, Computer Science

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as |development| can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an …


Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder Sep 2008

Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder

William B. Andreopoulos

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as |development| can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an …


Relational Methodology For Data Mining And Knowledge Discovery, Engenii Vityaev, Boris Kovalerchuk Apr 2008

Relational Methodology For Data Mining And Knowledge Discovery, Engenii Vityaev, Boris Kovalerchuk

All Faculty Scholarship for the College of the Sciences

Knowledge discovery and data mining methods have been successful in many domains. However, their abilities to build or discover a domain theory remain unclear. This is largely due to the fact that many fundamental KDD&DM methodological questions are still unexplored such as (1) the nature of the information contained in input data relative to the domain theory, and (2) the nature of the knowledge that these methods discover. The goal of this paper is to clarify methodological questions of KDD&DM methods. This is done by using the concept of Relational Data Mining (RDM), representative measurement theory, an ontology of a …


Symbolic Methodology For Numeric Data Mining, Boris Kovalerchuk, Engenii Vityaev Apr 2008

Symbolic Methodology For Numeric Data Mining, Boris Kovalerchuk, Engenii Vityaev

All Faculty Scholarship for the College of the Sciences

Currently statistical and artificial neural network methods dominate in data mining applications. Alternative relational (symbolic) data mining methods have shown their effectiveness in robotics, drug design, and other areas. Neural networks and decision tree methods have serious limitations in capturing relations that may have a variety of forms. Learning systems based on symbolic first-order logic (FOL) representations capture relations naturally. The learned regularities are understandable directly in domain terms that help to build a domain theory. This paper describes relational data mining methodology and develops it further for numeric data such as financial and spatial data. This includes (1) comparing …


Using Plsi-U To Detect Insider Threats By Datamining Email, James S. Okolica, Gilbert L. Peterson, Robert F. Mills Feb 2008

Using Plsi-U To Detect Insider Threats By Datamining Email, James S. Okolica, Gilbert L. Peterson, Robert F. Mills

Faculty Publications

Despite a technology bias that focuses on external electronic threats, insiders pose the greatest threat to an organisation. This paper discusses an approach to assist investigators in identifying potential insider threats. We discern employees' interests from e-mail using an extended version of PLSI. These interests are transformed into implicit and explicit social network graphs, which are used to locate potential insiders by identifying individuals who feel alienated from the organisation or have a hidden interest in a sensitive topic. By applying this technique to the Enron e-mail corpus, a small number of employees appear as potential insider threats.


Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum Jan 2008

Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum

Andrew McCallum

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent---because venues tend to focus on a few research areas---but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set …


Optrr: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining, Zhengli Huang, Wenliang Du Jan 2008

Optrr: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining, Zhengli Huang, Wenliang Du

Electrical Engineering and Computer Science - All Scholarship

The randomized response (RR) technique is a promising technique to disguise private categorical data in Privacy-Preserving Data Mining (PPDM). Although a number of RR-based methods have been proposed for various data mining computations, no study has systematically compared them to find optimal RR schemes. The difficulty of comparison lies in the fact that to compare two PPDM schemes, one needs to consider two conflicting metrics: privacy and utility. An optimal scheme based on one metric is usually the worst based on the other metric. In this paper, we first describe a method to quantify privacy and utility. We formulate the …


Mobile Semantic Computing, Karthik Gomadam, Anupam Joshi, Amit P. Sheth Jan 2008

Mobile Semantic Computing, Karthik Gomadam, Anupam Joshi, Amit P. Sheth

Kno.e.sis Publications

We propose to organize a special session on research in the intersection of mobile computing, the Semantic Web and Web services.

This session will examine how the research in these areas can serve as a foundation for new architectural and communication paradigms that can enhance service creation, distribution, discovery, integration and utilization in distributed and ubiquitous environments. Some of the initial areas that our early research have highlighted are :

  1. Semantic annotation of data in bandwidth constrained environments such as mobile networks to promote efficient bandwidth utilization
  2. Possibilities of using microformats such as RDFa and opportunities that can be explored …


The Impact Of Directionality In Predications On Text Mining, Gondy Leroy, Marcelo Fiszman, Thomas C. Rindflesch Jan 2008

The Impact Of Directionality In Predications On Text Mining, Gondy Leroy, Marcelo Fiszman, Thomas C. Rindflesch

CGU Faculty Publications and Research

The number of publications in biomedicine is increasing enormously each year. To help researchers digest the information in these documents, text mining tools are being developed that present co-occurrence relations between concepts. Statistical measures are used to mine interesting subsets of relations. We demonstrate how directionality of these relations affects interestingness. Support and confidence, simple data mining statistics, are used as proxies for interestingness metrics. We first built a test bed of 126,404 directional relations extracted from biomedical abstracts, which we represent as graphs containing a central starting concept and 2 rings of associated relations. We manipulated directionality in four …


Data Exploration By Using The Monotonicity Property, Hongyi Chen Jan 2008

Data Exploration By Using The Monotonicity Property, Hongyi Chen

LSU Master's Theses

Dealing with different misclassification costs has been a big problem for classification. Some algorithms can predict quite accurately when assuming the misclassification costs for each class are the same, like most rule induction methods. However, when the misclassification costs change, which is a common phenomenon in reality, these algorithms are not capable of adjusting their results. Some other algorithms, like the Bayesian methods, have the ability to yield probabilities of a certain unclassified example belonging to given classes, which is helpful to make modification on the results according to different misclassification costs. The shortcoming of such algorithms is, when the …