Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

2019

Artificial Intelligence and Robotics

Series

Background Corpus

Articles 1 - 2 of 2

Full-Text Articles in Physical Sciences and Mathematics

Update Frequency And Background Corpus Selection In Dynamic Tf-Idf Models For First Story Detection, Fei Wang, Robert J. Ross, John D. Kelleher Oct 2019

Update Frequency And Background Corpus Selection In Dynamic Tf-Idf Models For First Story Detection, Fei Wang, Robert J. Ross, John D. Kelleher

Conference papers

First Story Detection (FSD) requires a system to detect the very first story that mentions an event from a stream of stories. Nearest neighbour-based models, using the traditional term vector document representations like TF-IDF, currently achieve the state of the art in FSD. Because of its online nature, a dynamic term vector model that is incrementally updated during the detection process is usually adopted for FSD instead of a static model. However, very little research has investigated the selection of hyper-parameters and the background corpora for a dynamic model. In this paper, we analyse how a dynamic term vector model …


Bigger Versus Similar: Selecting A Background Corpus For First Story Detection Based On Distributional Similarity, Fei Wang, Robert J. Ross, John D. Kelleher Sep 2019

Bigger Versus Similar: Selecting A Background Corpus For First Story Detection Based On Distributional Similarity, Fei Wang, Robert J. Ross, John D. Kelleher

Conference papers

The current state of the art for First Story Detection (FSD) are nearest neighbour-based models with traditional term vector representations; however, one challenge faced by FSD models is that the document representation is usually defined by the vocabulary and term frequency from a background corpus. Consequently, the ideal background corpus should arguably be both large-scale to ensure adequate term coverage, and similar to the target domain in terms of the language distribution. However, given these two factors cannot always be mutually satisfied, in this paper we examine whether the distributional similarity of common terms is more important than the scale …