Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan

Masayuki Asahara; Kikuo Maekawa; Mizuho Imada; Sachi Kato; Hikari Konishi
{'id': 'https://openalex.org/W1981379484', 'doi': 'https://doi.org/10.7227/alx.0024', 'title': 'Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan', 'display_name': 'Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan', 'publication_year': 2014, 'publication_date': '2014-08-01', 'ids': {'openalex': 'https://openalex.org/W1981379484', 'doi': 'https://doi.org/10.7227/alx.0024', 'mag': '1981379484'}, 'language': 'en', 'primary_location': {'is_oa': False, 'landing_page_url': 'https://doi.org/10.7227/alx.0024', 'pdf_url': None, 'source': {'id': 'https://openalex.org/S2764349462', 'display_name': 'Alexandria The Journal of National and International Library and Information Issues', 'issn_l': '0955-7490', 'issn': ['0955-7490', '2050-4551'], 'is_oa': False, 'is_in_doaj': False, 'is_core': True, 'host_organization': 'https://openalex.org/P4310320017', 'host_organization_name': 'SAGE Publishing', 'host_organization_lineage': ['https://openalex.org/P4310320017'], 'host_organization_lineage_names': ['SAGE Publishing'], 'type': 'journal'}, 'license': None, 'license_id': None, 'version': None, 'is_accepted': False, 'is_published': False}, 'type': 'article', 'type_crossref': 'journal-article', 'indexed_in': ['crossref'], 'open_access': {'is_oa': False, 'oa_status': 'closed', 'oa_url': None, 'any_repository_has_fulltext': False}, 'authorships': [{'author_position': 'first', 'author': {'id': 'https://openalex.org/A5002196974', 'display_name': 'Masayuki Asahara', 'orcid': 'https://orcid.org/0000-0002-5178-7275'}, 'institutions': [], 'countries': [], 'is_corresponding': False, 'raw_author_name': 'Masayuki Asahara', 'raw_affiliation_strings': [], 'affiliations': []}, {'author_position': 'middle', 'author': {'id': 'https://openalex.org/A5037501207', 'display_name': 'Kikuo Maekawa', 'orcid': 'https://orcid.org/0000-0001-6343-7689'}, 'institutions': [], 'countries': [], 'is_corresponding': False, 'raw_author_name': 'Kikuo Maekawa', 'raw_affiliation_strings': [], 'affiliations': []}, {'author_position': 'middle', 'author': {'id': 'https://openalex.org/A5008622909', 'display_name': 'Mizuho Imada', 'orcid': 'https://orcid.org/0000-0001-6505-3988'}, 'institutions': [], 'countries': [], 'is_corresponding': False, 'raw_author_name': 'Mizuho Imada', 'raw_affiliation_strings': [], 'affiliations': []}, {'author_position': 'middle', 'author': {'id': 'https://openalex.org/A5005630384', 'display_name': 'Sachi Kato', 'orcid': None}, 'institutions': [], 'countries': [], 'is_corresponding': False, 'raw_author_name': 'Sachi Kato', 'raw_affiliation_strings': [], 'affiliations': []}, {'author_position': 'last', 'author': {'id': 'https://openalex.org/A5025199566', 'display_name': 'Hikari Konishi', 'orcid': None}, 'institutions': [], 'countries': [], 'is_corresponding': False, 'raw_author_name': 'Hikari Konishi', 'raw_affiliation_strings': [], 'affiliations': []}], 'institution_assertions': [], 'countries_distinct_count': 0, 'institutions_distinct_count': 0, 'corresponding_author_ids': [], 'corresponding_institution_ids': [], 'apc_list': None, 'apc_paid': None, 'fwci': 1.154, 'has_fulltext': True, 'fulltext_origin': 'ngrams', 'cited_by_count': 21, 'citation_normalized_percentile': {'value': 0.944716, 'is_in_top_1_percent': False, 'is_in_top_10_percent': True}, 'cited_by_percentile_year': {'min': 91, 'max': 92}, 'biblio': {'volume': '25', 'issue': '1-2', 'first_page': '129', 'last_page': '148'}, 'is_retracted': False, 'is_paratext': False, 'primary_topic': {'id': 'https://openalex.org/T10181', 'display_name': 'Statistical Machine Translation and Natural Language Processing', 'score': 0.9998, 'subfield': {'id': 'https://openalex.org/subfields/1702', 'display_name': 'Artificial Intelligence'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}, 'topics': [{'id': 'https://openalex.org/T10181', 'display_name': 'Statistical Machine Translation and Natural Language Processing', 'score': 0.9998, 'subfield': {'id': 'https://openalex.org/subfields/1702', 'display_name': 'Artificial Intelligence'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}, {'id': 'https://openalex.org/T10028', 'display_name': 'Natural Language Processing', 'score': 0.9915, 'subfield': {'id': 'https://openalex.org/subfields/1702', 'display_name': 'Artificial Intelligence'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}, {'id': 'https://openalex.org/T12016', 'display_name': 'Web Data Extraction and Crawling Techniques', 'score': 0.9842, 'subfield': {'id': 'https://openalex.org/subfields/1710', 'display_name': 'Information Systems'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}], 'keywords': [{'id': 'https://openalex.org/keywords/web-data-extraction', 'display_name': 'Web Data Extraction', 'score': 0.542429}, {'id': 'https://openalex.org/keywords/web-crawling', 'display_name': 'Web Crawling', 'score': 0.530712}, {'id': 'https://openalex.org/keywords/dependency-grammar', 'display_name': 'Dependency grammar', 'score': 0.510195}, {'id': 'https://openalex.org/keywords/page-segmentation', 'display_name': 'Page Segmentation', 'score': 0.504255}, {'id': 'https://openalex.org/keywords/language-modeling', 'display_name': 'Language Modeling', 'score': 0.503215}, {'id': 'https://openalex.org/keywords/corpus-linguistics', 'display_name': 'Corpus linguistics', 'score': 0.43699318}], 'concepts': [{'id': 'https://openalex.org/C41008148', 'wikidata': 'https://www.wikidata.org/wiki/Q21198', 'display_name': 'Computer science', 'level': 0, 'score': 0.76440436}, {'id': 'https://openalex.org/C2776321320', 'wikidata': 'https://www.wikidata.org/wiki/Q857525', 'display_name': 'Annotation', 'level': 2, 'score': 0.6506728}, {'id': 'https://openalex.org/C204321447', 'wikidata': 'https://www.wikidata.org/wiki/Q30642', 'display_name': 'Natural language processing', 'level': 1, 'score': 0.57942504}, {'id': 'https://openalex.org/C164883195', 'wikidata': 'https://www.wikidata.org/wiki/Q674834', 'display_name': 'Dependency grammar', 'level': 3, 'score': 0.510195}, {'id': 'https://openalex.org/C186644900', 'wikidata': 'https://www.wikidata.org/wiki/Q194152', 'display_name': 'Parsing', 'level': 2, 'score': 0.50566626}, {'id': 'https://openalex.org/C154945302', 'wikidata': 'https://www.wikidata.org/wiki/Q11660', 'display_name': 'Artificial intelligence', 'level': 1, 'score': 0.48207074}, {'id': 'https://openalex.org/C136764020', 'wikidata': 'https://www.wikidata.org/wiki/Q466', 'display_name': 'World Wide Web', 'level': 1, 'score': 0.47726056}, {'id': 'https://openalex.org/C21959979', 'wikidata': 'https://www.wikidata.org/wiki/Q36774', 'display_name': 'Web page', 'level': 2, 'score': 0.4559321}, {'id': 'https://openalex.org/C41895202', 'wikidata': 'https://www.wikidata.org/wiki/Q8162', 'display_name': 'Linguistics', 'level': 1, 'score': 0.44657555}, {'id': 'https://openalex.org/C532629269', 'wikidata': 'https://www.wikidata.org/wiki/Q865083', 'display_name': 'Corpus linguistics', 'level': 2, 'score': 0.43699318}, {'id': 'https://openalex.org/C23123220', 'wikidata': 'https://www.wikidata.org/wiki/Q816826', 'display_name': 'Information retrieval', 'level': 1, 'score': 0.39905944}, {'id': 'https://openalex.org/C138885662', 'wikidata': 'https://www.wikidata.org/wiki/Q5891', 'display_name': 'Philosophy', 'level': 0, 'score': 0.0}], 'mesh': [], 'locations_count': 1, 'locations': [{'is_oa': False, 'landing_page_url': 'https://doi.org/10.7227/alx.0024', 'pdf_url': None, 'source': {'id': 'https://openalex.org/S2764349462', 'display_name': 'Alexandria The Journal of National and International Library and Information Issues', 'issn_l': '0955-7490', 'issn': ['0955-7490', '2050-4551'], 'is_oa': False, 'is_in_doaj': False, 'is_core': True, 'host_organization': 'https://openalex.org/P4310320017', 'host_organization_name': 'SAGE Publishing', 'host_organization_lineage': ['https://openalex.org/P4310320017'], 'host_organization_lineage_names': ['SAGE Publishing'], 'type': 'journal'}, 'license': None, 'license_id': None, 'version': None, 'is_accepted': False, 'is_published': False}], 'best_oa_location': None, 'sustainable_development_goals': [{'display_name': 'Quality education', 'id': 'https://metadata.un.org/sdg/4', 'score': 0.82}], 'grants': [], 'datasets': [], 'versions': [], 'referenced_works_count': 2, 'referenced_works': ['https://openalex.org/W1983774632', 'https://openalex.org/W2272153972'], 'related_works': ['https://openalex.org/W3046984657', 'https://openalex.org/W2968543375', 'https://openalex.org/W2888625260', 'https://openalex.org/W287510790', 'https://openalex.org/W2571817549', 'https://openalex.org/W2251084681', 'https://openalex.org/W2250525544', 'https://openalex.org/W2098784136', 'https://openalex.org/W2053943328', 'https://openalex.org/W1541975828'], 'abstract_inverted_index': {'In': [0, 125, 163], '2011,': [1], 'the': [2, 68, 76, 87, 119, 138, 157, 168, 172, 178, 182, 188, 193], 'National': [3], 'Institute': [4], 'for': [5, 21, 70, 134], 'Japanese': [6], 'Language': [7], 'and': [8, 42, 63, 103, 112, 121, 191], 'Linguistics': [9], '(NINJAL)': [10], 'launched': [11], 'a': [12, 18], 'corpus': [13, 20, 80, 183], 'compilation': [14], 'project': [15, 31], 'to': [16, 51, 131, 155], 'construct': [17], 'web': [19, 47, 53, 79, 144, 152], 'linguistic': [22, 77, 83, 91, 161], 'research': [23], 'comprising': [24], 'ten': [25], 'billion': [26], 'words': [27], 'by': [28, 55], '2016.': [29], 'The': [30], 'is': [32], 'divided': [33], 'into': [34], 'four': [35, 173], 'categories:': [36], 'Page': [37, 45], 'Collection,': [38, 46], 'Linguistic': [39, 74], 'Annotation,': [40, 75], 'Release': [41], 'Preservation.': [43], 'For': [44, 73, 108, 141], 'crawlers': [48], 'are': [49, 106, 115, 128, 146], 'employed': [50], 'collect': [52], 'text': [54, 69, 123], 'crawling': [56], '100': [57], 'million': [58], 'pages': [59, 145], 'every': [60], 'three': [61], 'months': [62], 'retaining': [64], 'several': [65], 'versions': [66], 'of': [67, 89, 159, 171, 181, 187, 195], 'three-month': [71], 'periods.': [72], 'studies': [78], 'contains': [81], 'annotated': [82, 122], 'information.': [84], 'To': [85], 'improve': [86], 'usability': [88], 'these': [90], 'resources,': [92], 'normalization': [93], 'tasks': [94], 'such': [95], 'as': [96, 151], 'tag': [97], 'removal,': [98], 'word': [99, 110], 'segmentation,': [100], 'dependency': [101], 'parsing,': [102], 'register': [104], 'estimation': [105], 'performed.': [107], 'Release,': [109], 'lists': [111], 'n-gram': [113], 'data': [114, 190], 'published': [116], 'based': [117], 'on': [118], 'crawled': [120, 143, 189], 'corpus.': [124, 140], 'addition,': [126], 'applications': [127], 'being': [129], 'developed': [130], 'enable': [132], 'searching': [133], 'morphosyntax': [135], 'patterns': [136], 'in': [137, 148], 'ten-billion-word': [139], 'Preservation,': [142], 'preserved': [147], 'chronological': [149], 'order': [150], 'archives': [153], 'primarily': [154], 'support': [156], 'survey': [158], 'ongoing': [160], 'changes.': [162], 'this': [164], 'paper,': [165], 'we': [166, 176], 'present': [167], 'basic': [169, 185], 'design': [170], 'categories.': [174], 'Additionally,': [175], 'report': [177], 'current': [179], 'status': [180], 'using': [184], 'statistics': [186], 'discuss': [192], 'importance': [194], 'deduplicating': [196], 'sentences.': [197]}, 'cited_by_api_url': 'https://api.openalex.org/works?filter=cites:W1981379484', 'counts_by_year': [{'year': 2024, 'cited_by_count': 1}, {'year': 2023, 'cited_by_count': 1}, {'year': 2022, 'cited_by_count': 3}, {'year': 2021, 'cited_by_count': 3}, {'year': 2020, 'cited_by_count': 4}, {'year': 2019, 'cited_by_count': 3}, {'year': 2018, 'cited_by_count': 3}, {'year': 2017, 'cited_by_count': 2}, {'year': 2016, 'cited_by_count': 1}], 'updated_date': '2024-09-19T03:12:13.597973', 'created_date': '2016-06-24'}
Publication Information

Basic Information

Access and Citation

AI Researcher Chatbot

Primary Location

Authors

Topics

Keywords

Related Works