A Statistical Approach for Content Extraction from Web Page

Yi Guan
{'id': 'https://openalex.org/W2393035056', 'doi': None, 'title': 'A Statistical Approach for Content Extraction from Web Page', 'display_name': 'A Statistical Approach for Content Extraction from Web Page', 'publication_year': 2004, 'publication_date': '2004-01-01', 'ids': {'openalex': 'https://openalex.org/W2393035056', 'mag': '2393035056'}, 'language': 'en', 'primary_location': {'is_oa': False, 'landing_page_url': 'https://en.cnki.com.cn/Article_en/CJFDTOTAL-MESS200405002.htm', 'pdf_url': None, 'source': {'id': 'https://openalex.org/S2765077740', 'display_name': 'Zhongwen xinxi xuebao', 'issn_l': '1003-0077', 'issn': ['1003-0077'], 'is_oa': False, 'is_in_doaj': False, 'is_core': True, 'host_organization': 'https://openalex.org/P4366716320', 'host_organization_name': 'Beijing Institute of Information Engineering', 'host_organization_lineage': ['https://openalex.org/P4366716320'], 'host_organization_lineage_names': ['Beijing Institute of Information Engineering'], 'type': 'journal'}, 'license': None, 'license_id': None, 'version': None, 'is_accepted': False, 'is_published': False}, 'type': 'article', 'type_crossref': 'journal-article', 'indexed_in': [], 'open_access': {'is_oa': False, 'oa_status': 'closed', 'oa_url': None, 'any_repository_has_fulltext': False}, 'authorships': [{'author_position': 'first', 'author': {'id': 'https://openalex.org/A5067201995', 'display_name': 'Yi Guan', 'orcid': 'https://orcid.org/0000-0001-6057-9243'}, 'institutions': [], 'countries': [], 'is_corresponding': True, 'raw_author_name': 'Guan Yi', 'raw_affiliation_strings': ['Dept'], 'affiliations': [{'raw_affiliation_string': 'Dept', 'institution_ids': []}]}], 'institution_assertions': [], 'countries_distinct_count': 0, 'institutions_distinct_count': 0, 'corresponding_author_ids': ['https://openalex.org/A5067201995'], 'corresponding_institution_ids': [], 'apc_list': None, 'apc_paid': None, 'fwci': 1.305, 'has_fulltext': False, 'cited_by_count': 20, 'citation_normalized_percentile': {'value': 0.946895, 'is_in_top_1_percent': False, 'is_in_top_10_percent': True}, 'cited_by_percentile_year': {'min': 86, 'max': 87}, 'biblio': {'volume': None, 'issue': None, 'first_page': None, 'last_page': None}, 'is_retracted': False, 'is_paratext': False, 'primary_topic': {'id': 'https://openalex.org/T12016', 'display_name': 'Web Data Extraction and Crawling Techniques', 'score': 0.9972, 'subfield': {'id': 'https://openalex.org/subfields/1710', 'display_name': 'Information Systems'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}, 'topics': [{'id': 'https://openalex.org/T12016', 'display_name': 'Web Data Extraction and Crawling Techniques', 'score': 0.9972, 'subfield': {'id': 'https://openalex.org/subfields/1710', 'display_name': 'Information Systems'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}, {'id': 'https://openalex.org/T13734', 'display_name': 'Artificial Intelligence and Expert Systems', 'score': 0.9557, 'subfield': {'id': 'https://openalex.org/subfields/1702', 'display_name': 'Artificial Intelligence'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}, {'id': 'https://openalex.org/T10679', 'display_name': 'QoS-Aware Web Services Composition and Semantic Matching', 'score': 0.9117, 'subfield': {'id': 'https://openalex.org/subfields/1710', 'display_name': 'Information Systems'}, 'field': {'id': 'https://openalex.org/fields/17', 'display_name': 'Computer Science'}, 'domain': {'id': 'https://openalex.org/domains/3', 'display_name': 'Physical Sciences'}}], 'keywords': [{'id': 'https://openalex.org/keywords/web-data-extraction', 'display_name': 'Web Data Extraction', 'score': 0.673487}, {'id': 'https://openalex.org/keywords/page-segmentation', 'display_name': 'Page Segmentation', 'score': 0.560647}, {'id': 'https://openalex.org/keywords/web-crawling', 'display_name': 'Web Crawling', 'score': 0.534064}, {'id': 'https://openalex.org/keywords/document-object-model', 'display_name': 'Document Object Model', 'score': 0.52113676}, {'id': 'https://openalex.org/keywords/automated-web-service-discovery', 'display_name': 'Automated Web Service Discovery', 'score': 0.518573}, {'id': 'https://openalex.org/keywords/html-element', 'display_name': 'HTML element', 'score': 0.5084426}, {'id': 'https://openalex.org/keywords/web-service-composition', 'display_name': 'Web Service Composition', 'score': 0.503562}, {'id': 'https://openalex.org/keywords/tree', 'display_name': 'Tree (set theory)', 'score': 0.4987507}], 'concepts': [{'id': 'https://openalex.org/C41008148', 'wikidata': 'https://www.wikidata.org/wiki/Q21198', 'display_name': 'Computer science', 'level': 0, 'score': 0.8981453}, {'id': 'https://openalex.org/C21959979', 'wikidata': 'https://www.wikidata.org/wiki/Q36774', 'display_name': 'Web page', 'level': 2, 'score': 0.6894269}, {'id': 'https://openalex.org/C23123220', 'wikidata': 'https://www.wikidata.org/wiki/Q816826', 'display_name': 'Information retrieval', 'level': 1, 'score': 0.64536744}, {'id': 'https://openalex.org/C2780801425', 'wikidata': 'https://www.wikidata.org/wiki/Q5164392', 'display_name': 'Construct (python library)', 'level': 2, 'score': 0.5729734}, {'id': 'https://openalex.org/C62611344', 'wikidata': 'https://www.wikidata.org/wiki/Q1062658', 'display_name': 'Node (physics)', 'level': 2, 'score': 0.53153735}, {'id': 'https://openalex.org/C137922610', 'wikidata': 'https://www.wikidata.org/wiki/Q2093', 'display_name': 'Document Object Model', 'level': 3, 'score': 0.52113676}, {'id': 'https://openalex.org/C81639021', 'wikidata': 'https://www.wikidata.org/wiki/Q179551', 'display_name': 'HTML element', 'level': 3, 'score': 0.5084426}, {'id': 'https://openalex.org/C113174947', 'wikidata': 'https://www.wikidata.org/wiki/Q2859736', 'display_name': 'Tree (set theory)', 'level': 2, 'score': 0.4987507}, {'id': 'https://openalex.org/C36503486', 'wikidata': 'https://www.wikidata.org/wiki/Q11235244', 'display_name': 'Domain (mathematical analysis)', 'level': 2, 'score': 0.49428797}, {'id': 'https://openalex.org/C2780586882', 'wikidata': 'https://www.wikidata.org/wiki/Q7520643', 'display_name': 'Simple (philosophy)', 'level': 2, 'score': 0.4109929}, {'id': 'https://openalex.org/C136764020', 'wikidata': 'https://www.wikidata.org/wiki/Q466', 'display_name': 'World Wide Web', 'level': 1, 'score': 0.39162996}, {'id': 'https://openalex.org/C124101348', 'wikidata': 'https://www.wikidata.org/wiki/Q172491', 'display_name': 'Data mining', 'level': 1, 'score': 0.33929425}, {'id': 'https://openalex.org/C134306372', 'wikidata': 'https://www.wikidata.org/wiki/Q7754', 'display_name': 'Mathematical analysis', 'level': 1, 'score': 0.0}, {'id': 'https://openalex.org/C138885662', 'wikidata': 'https://www.wikidata.org/wiki/Q5891', 'display_name': 'Philosophy', 'level': 0, 'score': 0.0}, {'id': 'https://openalex.org/C33923547', 'wikidata': 'https://www.wikidata.org/wiki/Q395', 'display_name': 'Mathematics', 'level': 0, 'score': 0.0}, {'id': 'https://openalex.org/C66938386', 'wikidata': 'https://www.wikidata.org/wiki/Q633538', 'display_name': 'Structural engineering', 'level': 1, 'score': 0.0}, {'id': 'https://openalex.org/C111472728', 'wikidata': 'https://www.wikidata.org/wiki/Q9471', 'display_name': 'Epistemology', 'level': 1, 'score': 0.0}, {'id': 'https://openalex.org/C127413603', 'wikidata': 'https://www.wikidata.org/wiki/Q11023', 'display_name': 'Engineering', 'level': 0, 'score': 0.0}, {'id': 'https://openalex.org/C199360897', 'wikidata': 'https://www.wikidata.org/wiki/Q9143', 'display_name': 'Programming language', 'level': 1, 'score': 0.0}], 'mesh': [], 'locations_count': 1, 'locations': [{'is_oa': False, 'landing_page_url': 'https://en.cnki.com.cn/Article_en/CJFDTOTAL-MESS200405002.htm', 'pdf_url': None, 'source': {'id': 'https://openalex.org/S2765077740', 'display_name': 'Zhongwen xinxi xuebao', 'issn_l': '1003-0077', 'issn': ['1003-0077'], 'is_oa': False, 'is_in_doaj': False, 'is_core': True, 'host_organization': 'https://openalex.org/P4366716320', 'host_organization_name': 'Beijing Institute of Information Engineering', 'host_organization_lineage': ['https://openalex.org/P4366716320'], 'host_organization_lineage_names': ['Beijing Institute of Information Engineering'], 'type': 'journal'}, 'license': None, 'license_id': None, 'version': None, 'is_accepted': False, 'is_published': False}], 'best_oa_location': None, 'sustainable_development_goals': [{'id': 'https://metadata.un.org/sdg/4', 'display_name': 'Quality education', 'score': 0.83}], 'grants': [], 'datasets': [], 'versions': [], 'referenced_works_count': 0, 'referenced_works': [], 'related_works': ['https://openalex.org/W69749311', 'https://openalex.org/W3182657028', 'https://openalex.org/W2751078258', 'https://openalex.org/W2556721678', 'https://openalex.org/W2370599418', 'https://openalex.org/W2362238332', 'https://openalex.org/W2352911637', 'https://openalex.org/W2351518653', 'https://openalex.org/W2347528768', 'https://openalex.org/W2347320193', 'https://openalex.org/W2311530945', 'https://openalex.org/W2187337573', 'https://openalex.org/W2061189702', 'https://openalex.org/W2031790754', 'https://openalex.org/W2026635729', 'https://openalex.org/W2026345620', 'https://openalex.org/W1997836898', 'https://openalex.org/W1968053850', 'https://openalex.org/W1927392092', 'https://openalex.org/W1514708291'], 'abstract_inverted_index': {'This': [0], 'paper': [1], 'proposes': [2], 'a': [3, 31, 35, 111], 'statistical': [4], 'approach': [5], 'for': [6, 76, 110], 'extracting': [7], 'text': [8, 49, 108], 'content': [9, 50], 'from': [10], 'Chinese': [11, 57], 'news': [12], 'web': [13, 25, 36, 107], 'pages': [14], 'in': [15, 59], 'order': [16], 'to': [17, 24, 33, 39, 86, 105], 'effectively': [18], 'apply': [19], 'natural': [20], 'language': [21], 'processing': [22], 'technologies': [23], 'page': [26, 37], 'documents.': [27], 'The': [28, 100], 'method': [29, 71, 101], 'uses': [30], 'tree': [32], 'represent': [34], 'according': [38], 'HTML': [40], 'tags,': [41], 'and': [42, 84], 'then': [43], 'chooses': [44], 'the': [45, 53, 56, 63, 70, 93], 'node': [46, 61], 'which': [47], 'contains': [48], 'by': [51], 'using': [52], 'number': [54], 'of': [55, 62, 115], 'characters': [58], 'each': [60], 'tree.': [64], 'In': [65], 'comparison': [66], 'with': [67], 'traditional': [68], 'methods,': [69], 'needn\ue10bt': [72], 'construct': [73], 'different': [74, 77], 'wrappers': [75], 'data': [78, 109], 'sources.': [79], 'It': [80], 'is': [81, 96], 'simple,': [82], 'accurate': [83], 'easy': [85], 'be': [87], 'implemented.': [88], 'Experimental': [89], 'results': [90], 'show': [91], 'that': [92], 'extraction': [94], 'precision': [95], 'higher': [97], 'than': [98], '95%.': [99], 'has': [102], 'been': [103], 'adopted': [104], 'provide': [106], 'question': [112], 'answering': [113], 'system': [114], 'traveling': [116], 'domain.': [117]}, 'cited_by_api_url': 'https://api.openalex.org/works?filter=cites:W2393035056', 'counts_by_year': [{'year': 2018, 'cited_by_count': 1}, {'year': 2016, 'cited_by_count': 1}, {'year': 2015, 'cited_by_count': 3}, {'year': 2014, 'cited_by_count': 4}, {'year': 2013, 'cited_by_count': 1}, {'year': 2012, 'cited_by_count': 2}], 'updated_date': '2024-09-15T18:11:04.209389', 'created_date': '2016-06-24'}
Publication Information

Basic Information

Access and Citation

AI Researcher Chatbot

Primary Location

Authors

Topics

Keywords

Related Works