Web Crawling
─
NIHARIKA
Software Engineer
M.Tech, BITS Pilani
Overview
While at first impression web crawling may appear to be applications that gather information from across hundreds of billions of webpages and organize it in the Search index., the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources.
Goals
- Fast gathering of web pages
- Efficient filtering of useful web pages
Challenges
Given a set of URLs, a crawler downloads all the web pages addressed by the URLs, get the hyperlinks contained in the pages, and iteratively downloads the web pages addressed by these hyperlinks.
Web crawling has many inherent challenges in spite of its basic algorithm :
- Content selection tradeoffs: Crawling is done selectively in a controlled order. To get the most relevant content quickly, and eradicate irrelevant, duplicate, and malicious content.
- Scale: Modern search engine companies use both vertical and horizontal scaling using thousands of workstations and several high-speed network links.
- Adversaries: misdirecting web searches to commercial web sites for revenue through advertisements.
Solutions
- Building an efficient, robust and scalable crawler
Be immune to spider traps, duplicates, very large pages, very large websites, dynamic pages, etc.
- Avoiding problematic and undesirable content
Specifications from webmasters on what portions of site can be crawled (Explicit Politeness) and Implicit Politeness(even with no specification, avoid hitting any site too often).
- Rescheduling visiting previously crawled content
Should continuously crawl the web (visiting frequency of a page should be close to its modification frequency)
- Distributed computing of results
Crawling should be distributed within several machines and clever use of the processor, memory, bandwidth (e.g. as few idle processes as possible)
Architecture
The figure shows the basic architecture of a web crawler. Each crawling process includes several worker threads, and each worker thread performs repeated work cycles.
At the beginning of each work cycle, a worker obtains a URL from the Frontier data structure, which gives out URLs according to their priority and politeness policies. The worker thread then invokes the HTTP fetcher which first calls a DNS sub-module to resolve the host component of the URL into the IP address of the corresponding web server, and then connects to the web server. It then checks for any robots exclusion rules (which typically are cached ) and downloads the web page. If the download is successful, the web page may or may not be stored in a repository of acquired web pages. In either case, the page is passed to the Link extractor, which parses the page’s HTML content and extracts hyperlinks contained in it.
The corresponding URLs are then passed to a URL distributor, which assigns each URL to a crawling process. This assignment is typically made by hashing the URLs host component, its domain, or its IP address. Since most hyperlinks refer to pages on the same web site, assignment to the local crawling process is the common case. Next, the URL passes through the Custom URL filter (e.g., to exclude URLs belonging to “black-listed” sites, or URLs with particular file extensions that are not of interest) and into the Duplicate URL eliminator, which maintains the set of all URLs discovered so far and passes on only never-before-seen URLs. Finally, the URL prioritizer selects a position for the URL in the Frontier, based on factors such as estimated page importance or rate of change.
Results
The shown implementation does not scale to web corpus sizes that exceed the amount of memory available on a single machine. To scale beyond this limitation, one could either maintain the data structure (e.g., the tree or sorted list) on disk or use an off-the-shelf database management system. Either solution allows maintaining set sizes that exceed main memory; however, the cost of accessing items in the set is fairly expensive operation as it typically involves a disk seek. To achieve high performance, a more specialized approach is needed.
Conclusion
Web crawlers are the most important aspect of all the search engines. They are the basic component of all the web services so they need to provide high performance.
A number of crawling algorithms are used by the search engines namely, Breadth-First Algorithm, Naive Best First Algorithm, Page Rank Algorithm, Fish Search Algorithm. Building an effective web crawler to solve different purposes is not that difficult a task, but choosing the right strategies and building an effective architecture will lead to the implementation of highly intelligent web crawler application.
Limitations
As the referred attached journal indicates, crawling is a well-researched problem. Even in the material that is covered, we can notice many open issues, as follows:
- Expiring unwanted pages. In practice crawlers discard or retire low-quality and spam pages from their collections, to make room for superior pages as the storage capacity is finite.
- Personalized content. Web sites often customize their content to individual users, e.g., Flipkart gives personalized recommendations depending on a user’s searching and purchasing patterns.
- Holistic crawl ordering. Whereas much attention has been paid to problems that are prioritizing the crawling order of new pages, refreshing content from old pages, revisiting pages to discover new links, there is little research on how to integrate the diversified approaches into a unified strategy.
- Collaboration between content providers and crawlers.
References
- International Journal of Computer Trends and Technology (IJCTT) — volume 13 number 3 — Jul 2014
- http://infolab.stanford.edu/olston/publications/crawling_survey.pdf