Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Old Dominion University

Computer Science Theses & Dissertations

2022

Reference string parser

Articles 1 - 1 of 1

Full-Text Articles in Physical Sciences and Mathematics

Transparscit: A Transformer-Based Citation Parser Trained On Large-Scale Synthesized Data, Md Sami Uddin May 2022

Transparscit: A Transformer-Based Citation Parser Trained On Large-Scale Synthesized Data, Md Sami Uddin

Computer Science Theses & Dissertations

Accurately parsing citation strings is key to automatically building large-scale citation graphs, so a robust citation parser is an essential module in academic search engines. One limitation of the state-of-the-art models (such as ParsCit and Neural-ParsCit) is the lack of a large-scale training corpus. Manually annotating hundreds of thousands of citation strings is laborious and time-consuming. This thesis presents a novel transformer-based citation parser by leveraging the GIANT dataset, consisting of 1 billion synthesized citation strings covering over 1500 citation styles. As opposed to handcrafted features, our model benefits from word embeddings and character-based embeddings by combining the bidirectional long …