Title: A Statistical Approach for Content Extraction from Web Page
Abstract: This paper proposes a statistical approach for extracting text content from Chinese news web pages in order to effectively apply natural language processing technologies to web page documents. The method uses a tree to represent a web page according to HTML tags, and then chooses the node which contains text content by using the number of the Chinese characters in each node of the tree. In comparison with traditional methods, the method neednt construct different wrappers for different data sources. It is simple, accurate and easy to be implemented. Experimental results show that the extraction precision is higher than 95%. The method has been adopted to provide web text data for a question answering system of traveling domain.
Publication Year: 2004
Publication Date: 2004-01-01
Language: en
Type: article
Access and Citation
Cited By Count: 20
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot