Open Access. Powered by Scholars. Published by Universities.®
- Discipline
- Keyword
- Publication
Articles 1 - 2 of 2
Full-Text Articles in Computational Linguistics
Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher
Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher
Articles
Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with …
Synthetic, Yet Natural: Properties Of Wordnet Random Walk Corpora And The Impact Of Rare Words On Embedding Performance, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher
Synthetic, Yet Natural: Properties Of Wordnet Random Walk Corpora And The Impact Of Rare Words On Embedding Performance, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher
Conference papers
Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find …