Title: Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia
Abstract: Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. This paper presents a method for automatically gathering massive amounts of naturally-occurring cross-document reference data. We also present the Wikilinks dataset comprising of 40 million mentions over 3 million entities, gathered using this method. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.
Publication Year: 2012
Publication Date: 2012-01-01
Language: en
Type: article
Access and Citation
Cited By Count: 99
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot