My First Attempt at a Full-Site Distributed Crawler - Lofter

lyc8503

2022-09-11

This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.

I’ve written quite a few crawlers before, but all of them were single-machine scripts—mostly used to automate scraping some specific content I cared about, or to streamline certain repetitive tasks.

However, single-machine crawlers eventually hit performance bottlenecks. The data volume they can collect is limited, so I decided it was time to try building a distributed crawler and learn some new techniques along the way.

Choosing a Target

Note: Avoid scraping sensitive personal data or using scraped data for commercial purposes. Also, make sure your crawler doesn’t generate excessive traffic that could disrupt the target website.

~~Looking for a relatively simple platform to start with,~~ I decided to scrape some fanfiction from Lofter. (x

Among the platforms I frequently use, Lofter stood out during testing: its APIs don’t require login, there are no IP rate limits, and I wouldn’t need an IP pool or account pool. The main constraints would be bandwidth and concurrent connections—perfect conditions to leverage a cluster. Plus, as a platform under NetEase, it’s likely capable of handling relatively high traffic without being significantly impacted by my crawler.

Setting Up the Cluster

Finding VPS Providers

A cluster, by definition, requires multiple machines. So first things first—I needed a bunch of servers to run the cluster.

Using developer-focused cloud providers like DigitalOcean or Vultr wouldn’t be too expensive.
But let’s be honest—I’m broke ~~(and love free stuff)~~. So I used my credit card to sign up for Google Cloud Platform’s 90-day $300 free trial. That should be more than enough to spin up several crawler instances.

Using a Ping tool, I found that Lofter’s overseas domain resolves to AWS in Singapore. So I set up my VMs in the Singapore region.

Container Orchestration

Tools like Kuboard can automatically deploy Kubernetes clusters, and AutoK3s can directly set up GCE clusters. But for a simple crawler project, these orchestration tools felt overly heavy.

Instead, I went with Docker Swarm—a lightweight, easy-to-use alternative.

I set up one machine as the Swarm Manager, which also runs Redis and MongoDB. Redis handles task queue synchronization between crawlers, while MongoDB stores the scraped data.

The setup was straightforward:

Manager Node
Run sudo docker swarm init, then follow the instructions at https://docs.portainer.io/start/install/server/swarm/linux to set up the Portainer dashboard.
Worker Nodes
Join the nodes to the cluster using the command provided by the manager. After configuring one node, I saved it as a template and used it to quickly deploy the rest.

cluster cluster

Portainer dashboard

Writing the Crawler

After inspecting Lofter’s website and identifying potential entry points, here was my plan:

Randomly pick a TAG and add it to the queue.
Dequeue a TAG, scrape its popular posts, and extract any new, unvisited TAGs from the content to add back into the queue.
Repeat step 2 until the queue is empty.

For scraping TAG-related content, I borrowed some code from https://github.com/IshtarTang/lofterSpider.

Once all TAGs were crawled, I moved on to collecting author information and scraping every post published by each author.

Since author homepages vary in layout, I analyzed the Android app’s API via packet capture to perform the scraping.

Running the Crawler

Thankfully, Lofter’s APIs for fetching TAGs and author content both return up to 100 entries per request—including full article content. This made the crawling process quite efficient.

Even better—there were no IP-based rate limits, and no login was required.

The entire process—TAG crawling followed by author crawling—took about a week.

In the end, the TAG phase yielded 443.1 million entries, and the follow-up author crawl brought the total to a whopping 1.2 billion data points…

Data Analysis

Still a work in progress—will update when I have time.

This article is licensed under the CC BY-NC-SA 4.0 license.

Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/distributed-crawler-lofter/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/