Skip to content

Concurrent Web Crawler

Build a concurrent web crawler that discovers and processes links using worker pools and depth-limited traversal.

Recommended Prerequisites

Complete these exercises first for the best learning experience:

  • Worker Pool

0/1 completed

Web crawlers discover and process web pages by:

  1. Starting from seed URLs
  2. Extracting links from pages
  3. Following links to discover new pages
  4. Avoiding duplicate visits

Challenges:

  • Deduplication: Track visited URLs to avoid cycles
  • Concurrency: Use workers to crawl multiple pages simultaneously
  • Depth Control: Limit how deep the crawler goes
  • Thread Safety: Safely share visited URL set across workers

Build a crawler with:

  1. Worker pool for concurrent crawling
  2. URL deduplication (visited tracking)
  3. Depth-limited traversal
  4. Thread-safe operations
  5. Simulated link extraction (no actual HTTP for playground)

Concurrent Web Crawler

~60 minhard

Build a worker pool-based web crawler with deduplication and depth limiting

  • Worker Pool Pattern: Fixed number of workers processing from shared queue
  • BFS Traversal: Breadth-first search controlled by depth tracking
  • URL Deduplication: Mutex-protected map prevents visiting same URL twice
  • Concurrent Data Structures: Thread-safe visited set shared across workers
  • Dynamic Work Generation: Workers add new jobs as they discover links
  • Add robots.txt respect
  • Implement rate limiting per domain
  • Add request timeout and retry logic
  • Extract and store page content
  • Build inverted index for search
  • Add politeness delay between requests to same domain
  • Implement URL normalization (handle trailing slashes, query params)