Building a Web Crawler with WebMagic

Yatin BatraSeptember 16th, 2025Last Updated: September 16th, 2025

0 326 4 minutes read

A web crawler is a program that automatically browses and extracts data from websites. In Java, one of the most widely used libraries for building web crawlers is WebMagic. It provides a flexible, pluggable, and easy-to-use API for crawling web pages, parsing content, and storing results. Let us delve into understanding the Java WebMagic web crawler and how it simplifies the process of web scraping.

1. Introduction

Web crawlers, sometimes called spiders or bots, navigate through the web by following links and retrieving content. They are the backbone of modern applications such as search engines, price monitoring systems, sentiment analysis tools, and large-scale data mining. By automatically collecting structured information, they save countless hours of manual work.

When it comes to Java-based solutions, WebMagic stands out as a popular and practical framework. It provides developers with a ready-to-use yet highly extensible toolkit to build robust crawling solutions. A Java WebMagic web crawler can be customized for simple tasks like scraping product details from an e-commerce site, or for complex scenarios such as multi-threaded crawling across multiple domains with real-time data storage.

One of WebMagic’s biggest strengths is its modular and pluggable architecture, which makes it easy to extend, swap, or enhance different parts of the crawler without touching the entire codebase. This flexibility allows developers to focus more on business logic and less on low-level details like connection handling or parsing boilerplate.

1.1 WebMagic Core Components

Downloader – Responsible for fetching web pages while handling connection issues, retries, and throttling.
PageProcessor – Defines the parsing logic, extracting links, data fields, or structured information from the fetched pages.
Scheduler – Manages the queue of URLs to be crawled, ensuring efficient and non-repetitive traversal.
Pipeline – Processes and stores the extracted results, which can be directed to databases, JSON files, CSV exports, or even message queues for further processing.
Spider – The orchestrator that ties everything together, coordinating downloaders, processors, schedulers, and pipelines into a seamless crawling workflow.

2. Code Example

2.1 Dependencies

To get started with WebMagic, you need to include its core and extension libraries in your project. If you are using Maven, add the following dependencies to your pom.xml:

<dependency>
  <groupId>us.codecraft</groupId>
  <artifactId>webmagic-core</artifactId>
  <version>latest__jar__version</version>
</dependency>
<dependency>
  <groupId>us.codecraft</groupId>
  <artifactId>webmagic-extension</artifactId>
  <version>latest__jar__version</version>
</dependency>

These dependencies bring in the WebMagic core engine and additional utilities (like pipelines and schedulers) that simplify writing custom Java WebMagic web crawlers.

2.2 Java Code

Below is a simple implementation of a web crawler using WebMagic. It demonstrates how to fetch a page, extract the title, and print it to the console.

// SimpleWebCrawler.java

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.pipeline.ConsolePipeline;

public class SimpleWebCrawler implements PageProcessor {

  private Site site = Site.me()
    .setRetryTimes(3)
    .setSleepTime(1000);

  @Override
  public void process(Page page) {
    // Extract links and add them to the target queue
    page.addTargetRequests(page.getHtml().links().all());

    // Extract the title of the page
    String title = page.getHtml().xpath("//title/text()").get();
    page.putField("title", title);
  }

  @Override
  public Site getSite() {
    return site;
  }

  public static void main(String[] args) {
    Spider.create(new SimpleWebCrawler())
      .addUrl("https://www.wikipedia.org")
      .addPipeline(new ConsolePipeline())
      .thread(5)
      .run();
  }
}

2.2.1 Code Explanation

The SimpleWebCrawler class implements WebMagic’s PageProcessor interface, which defines the core crawling logic. The site object configures crawling behavior, such as retrying failed requests up to three times and waiting one second between requests to avoid overwhelming the target server. Inside the process() method, the crawler first extracts all hyperlinks from the current page using page.getHtml().links().all() and adds them to the target request queue, enabling recursive crawling. It then uses an XPath expression //title/text() to extract the page’s title and stores it in the results map with page.putField("title", title). The getSite() method simply returns the site configuration so the framework can apply it. In the main() method, a Spider is created with this custom crawler, starting from the seed URL https://www.wikipedia.org. The extracted data is sent to a ConsolePipeline, which prints results to the console, and the crawler runs with five threads for parallelism, improving efficiency. Finally, the run() method launches the entire crawling process, tying together downloading, parsing, scheduling, and data handling into one execution flow.

2.2.2 Code Run and Output

When you run the SimpleWebCrawler class, WebMagic will start a crawler with the seed URL https://www.wikipedia.org. The Spider will fetch the page, pass the content to the process() method, extract the page title using XPath, and print the result using the ConsolePipeline. At the same time, it will collect all links on the page and schedule them for further crawling, repeating the process until no new links remain. Since we are running with 5 threads, multiple pages will be fetched and processed in parallel, speeding up the crawl.

title: Wikipedia
title: English — Wikipedia
title: Español — Wikipedia
title: 日本語 — Wikipedia
title: Deutsch — Wikipedia
title: Français — Wikipédia
title: Русский — Википедия
title: Italiano — Wikipedia
title: 中文 — 维基百科
title: Português — Wikipédia
title: العربية — ويكيبيديا
title: മലയാളം — വിക്കിപീഡിയ
title: हिन्दी — विकिपीडिया
...

In this example, the crawler fetches the page at https://www.wikipedia.org, extracts its title, page links, etc. and outputs it to the console. If additional links were present on the page, the crawler would continue fetching and processing them, printing more titles in the same format. For real-world websites, you would see multiple lines of output, each representing extracted data from different pages.

3. Conclusion

WebMagic simplifies the process of building a robust and scalable crawler in Java. By separating concerns into modular components, it allows developers to extend and customize crawling logic easily. Whether for academic research, monitoring, or building search engines, WebMagic provides a reliable foundation for web crawling.