Title: Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan
Abstract: In 2011, the National Institute for Japanese Language and Linguistics (NINJAL) launched a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words by 2016. The project is divided into four categories: Page Collection, Linguistic Annotation, Release and Preservation. For Page Collection, web crawlers are employed to collect web text by crawling 100 million pages every three months and retaining several versions of the text for three-month periods. For Linguistic Annotation, the linguistic studies web corpus contains annotated linguistic information. To improve the usability of these linguistic resources, normalization tasks such as tag removal, word segmentation, dependency parsing, and register estimation are performed. For Release, word lists and n-gram data are published based on the crawled and annotated text corpus. In addition, applications are being developed to enable searching for morphosyntax patterns in the ten-billion-word corpus. For Preservation, crawled web pages are preserved in chronological order as web archives primarily to support the survey of ongoing linguistic changes. In this paper, we present the basic design of the four categories. Additionally, we report the current status of the corpus using basic statistics of the crawled data and discuss the importance of deduplicating sentences.
Publication Year: 2014
Publication Date: 2014-08-01
Language: en
Type: article
Indexed In: ['crossref']
Access and Citation
Cited By Count: 21
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot