Abstract:
In this paper, we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages to focus the crawl on a topic; by prioritizing promising links within the topic; and by also following links that may not lead to immediate benefit.
We propose a new framework whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup and tuning.
Our experiments over real Web pages in a representative set of domains indicate that online learning leads to significant gains in harvest rates—the adaptive crawlers retrieve up to three times as many forms as crawlers that use a fixed focus strategy.
The hidden Web has been growing at a very fast pace. It is estimated that there are several million hidden-Web sites. These are sites whose contents typically reside in databases and are only exposed on demand, as users fill out and submit forms. As the volume of hidden information grows, there has been increased interest in techniques that allow users and applications to leverage this information.
Examples of applications that attempt to make hidden-Web information more easily accessible include meta-searchers, hidden-Web crawlers, online-database directories and Web information integration systems. Since for any given domain of interest, there are many hidden-Web sources whose data need to be integrated or searched, a key requirement for these applications is the ability to locate these sources. But doing so at a large scale is a challenging problem.