Design Web Crawler

Problem Statement:


Step-1: What is a Web Crawler?


Step-2: Requirements and Goals of the System


Step-3: Some Design Considerations

Is it a crawler for HTML pages only? Or should we fetch and store other types of media i.e. sound files, images, videos,etc. also ?
What is the expected number of pages we will crawl ? How big will the URL database become ?
What is ‘RobotsExclusion’ and how should we deal with it ?


Step-4: Capacity Estimation and Constraints

Traffic Estimates:
Storage Estimates:


Step-5: High Level Design

Basic Algorithm
How To Crawl ?

Breadth First or Depth First ?

Path-ascending crawling:

Difficulties in implementing efficient web crawler

There are two important characteristics of the Web that makes Web crawling a very difficult task:

  1. Large volume of Web pages: A large volume of web page implies that web crawler can only download a fraction of the web pages at any time and hence it is critical that web crawler should be intelligent enough to prioritize download.
  2. Rate of change on web pages: Another problem with today’s dynamic world is that web pages on the internet change very frequently, as a result, by the time the crawler is downloading the last page from a site, the page may change, or a new page has been added to the site.
A bare minimum crawler needs at least these components:
  1. URL frontier: To store the list of URLs to download and also prioritize which URLs should be crawled first.
  2. HTTP Fetcher: To retrieve a web page from the server.
  3. Extractor: To extract links from HTML documents.
  4. Duplicate Eliminator: To make sure same content is not extracted twice unintentionally.
  5. Datastore: To store retrieve pages and URL and other metadata.


Step-6: Detailed Component Design


1. URL Frontier
How big our URL frontier would be ?


2. Fetcher Module:


3. Document Input Stream (DIS):


4. Document Dedupe Test:
How big would be the checksum store ?


5. URL Filters:


6. Domain Name Resolution (DNS):


7. URL Dedupe Test:
How much storage we would need for URL’s store ?
Can we use bloom filters for deduping ?


8. Checkpointing:


Step-7: Fault tolerance


Step-8: Data Partitioning


Step-9: Crawler Traps




← Previous: Design Twitter Search

Next: Design Facebook Newsfeed →