Title: RankMass Crawler: A Crawler with High PageRank Coverage Guarantee.
Abstract: Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the infinite number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage” of the Web with a relatively small number of pages.
Publication Year: 2007
Publication Date: 2007-01-01
Language: en
Type: article
Access and Citation
Cited By Count: 13
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot