High-Concurrency Crawler Optimization Notes - From 1 QPS to 10,000 QPS

lyc8503

2024-11-23

This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.

This article documents a real crawler optimization experience. Data has been anonymized and is for reference and learning purposes only.

One day, you casually wrote a crawler:

def fetch_data(i):
    return requests.get(f"https://api.example.com/{i}").json()

for i in tqdm.trange(1, 10_0000_0000):
    print(fetch_data(i))

The crawler started happily, but checking the progress bar revealed only single-digit QPS (measured: 56/999999999 [00:09<47662:00:38, 5.83it/s]). At this rate, it would take forever…

Optimization seemed straightforward—just add multithreading:

def fetch_data(i):
    return requests.get(f"https://api.example.com/{i}").json()

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = set()
    for i in tqdm.trange(1, 10_0000_0000):
        while len(futures) >= 20:
            completed, futures = wait(futures, return_when=FIRST_COMPLETED)
            for future in completed:
                print(future.result())
        futures.add(executor.submit(fetch_data, i))

But shortly after starting, the program crashed…

1	requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

After logging the request responses, you found a flood of 403/429/400/412 or other random status codes—and the response body was a CAPTCHA page.

“Damn, IP rate limiting again…” you thought.

“Proxy pools are expensive, unstable, and a pain to set up. I’ve used cloud provider IPs as proxies before, but this time the volume is way too big. Re-dialing my home connection or using Tor could work, but efficiency would still be low.”

Suddenly, you remembered the massive IPv6 address space. Some websites apply rate limits per /64 or even /128 subnet. Even if strict, limiting per /48 wouldn’t matter much—~~nowadays everyone has their own ASN~~, and who doesn’t have a /40 IPv6 block at home?

After testing, good news: this site uses the loosest form of rate limiting—/128 per address. That means your /64 home prefix is effectively unlimited. “So it’s basically no restriction at all… and I just saved a fortune on server and IP costs,” you muttered as you got to work.

Starting with OpenWRT, route a subnet (e.g., /80) to the host running the crawler:

# For home broadband, IPv6 prefixes are typically routed via PPPoE
# Using NDP as in the linked guide adds overhead; manual routing on the router is preferred unless you're stuck with a dumb modem
# Assume ISP assigns 2409:1:2:3::/64, and crawler host is 2409:1:2:3:abcd::1
ip -6 route add 2409:1:2:3:abcd::/80 via 2409:1:2:3:abcd::1

On the crawler host, configure local routing and enable ip_nonlocal_bind:

ip route add local 2409:1:2:3:abcd::/80 dev ens18
sysctl net.ipv6.ip_nonlocal_bind=1

# Test with curl
curl -v --interface 2409:1:2:3:abcd::2333 ip.p3terx.com

Now you need each request to bind to a random local IPv6 address. There are various tools—let’s go with the familiar Xray:

{
  "inbounds": [
    {
      "listen": "127.0.0.1",
      "port": 1234,
      "protocol": "http"
    }
  ],
  "outbounds": [
    {
      "protocol": "freedom",
      "tag": "direct",
      "sendThrough": "2409:1:2:3:abcd::/80"
    }
  ]
}

Finally, route the crawler traffic through the local Xray proxy:

1	https_proxy=http://127.0.0.1:1234 python3 crawl.py

Success! The crawler now runs steadily at… 40 QPS (10773/999999999 [04:48<6206:23:37, 44.76it/s]).

“What the hell? It’s 6-7x faster, but still… (grabs calculator) over 6,000 hours? That’s 250 days!”

You try brute-forcing more threads, increasing max_workers from 20 to 100. But QPS remains stuck at 40, while CPU usage skyrockets.

Now it’s not “one core busy, fifteen cores idle”—it’s “one core busy, fifteen cores thrashing in context switches.”

You sigh, rub your forehead, and fire up the IDE again:

async def fetch(i):
    async with aiohttp.ClientSession() as session:
        async with session.get(f"https://api.example.com/{i}", proxy='http://127.0.0.1:1234') as response:
            return await response.json()

async def main():
    tasks = set()
    for i in tqdm.trange(1, 10_0000_0000):
        tasks.add(asyncio.create_task(fetch(i)))
        while len(tasks) >= 20:
            finished, tasks = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
            
            for r in finished:
                print(r.result())

if __name__ == "__main__":
    asyncio.run(main())

Sure enough, switching to asyncio gives a big boost—async I/O really shines. CPU usage drops, and QPS jumps to 180+ (4622/999999999 [00:33<1505:59:35, 184.45it/s]).

But before you can celebrate, the progress bar freezes. QPS plummets. No ulimit issues, no “too many open files” errors.

Tracing up the stack, you find a flood of logs in OpenWRT: nf_conntrack: table full, dropping packet.

Ah-ha! The router’s connection tracking table is full—no new connections allowed.

Temporarily fix it with echo 65535 > /proc/sys/net/netfilter/nf_conntrack_max, and QPS stabilizes back to 180. But now the router is tracking over 30,000 connections.

“This isn’t sustainable… ~~though the legendary MT7621 chip is holding up for now~~. Even if I disable conntrack on OpenWRT, this many connections might get me flagged by my ISP,” you mutter.

“An IP isn’t banned after one request. HTTP supports pipelining—why not reuse TCP connections to save handshake overhead?”

“Need a connection pool… Let me check the docs… Wait, what? Damn it—the ClientSession already has one! I just used it wrong.”

A quick fix: share the same session across requests:

async def fetch(session, i):
    async with session.get(f"https://api.example.com/{i}", proxy='http://127.0.0.1:1234') as response:
        return await response.json()

async def main():
    tasks = set()
    async with aiohttp.ClientSession() as session:
        for i in tqdm.trange(1, 10_0000_0000):
            tasks.add(asyncio.create_task(fetch(session, i)))
            while len(tasks) >= 20:
                finished, tasks = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
                
                for r in finished:
                    print(r.result())

if __name__ == "__main__":
    asyncio.run(main())

Now connection count drops to just hundreds, and performance improves further—QPS hits 280+ (4678/999999999 [00:18<982:43:14, 282.66it/s).

Code looks solid now, but still needs 982 hours (~40 days).

Check system metrics: downstream ~4 Mbps, upstream ~2 Mbps, Python single-core usage ~30%. Plenty of headroom.

Increase concurrency to 80. After tuning, QPS soars to 1,280 (35126/999999999 [00:28<217:00:56, 1279.94it/s]), bandwidth usage ~12 Mbps down / 5 Mbps up.

Further increases yield diminishing returns—Python CPU usage now hits 90%. We’re CPU-bound again.

“Stupid GIL… forces me into multiprocessing. But at this speed, crawling 1e9 records takes less than 10 days—maybe acceptable?”

Final push: split the range and run 10 processes. Real-world QPS hits 10,000 and holds steady, consuming ~120 Mbps down / 35 Mbps up just for API traffic. (Each of ten processes averages 30962/100000000 [00:36<27:27:53, 1011.08it/s], total QPS = 10 × that.)

Pushing further, on an i9-10900 machine, peak QPS reaches 15,000. Bandwidth still has room, but CPU usage hits 80%. Bottleneck is clearly CPU—~~though we should leave headroom for Xray and the database, so let’s stop here~~.

1
2
3

async def main(start, end, pos):
    for i in tqdm.trange(start, end, position=pos, desc=f"Process #{pos}"):
        # Rest same as above, omitted

“Still… programmers shouldn’t suffer needlessly. ~~Is this even a crawler or a DDoS attack?~~ If I flood the API at 3 AM and trigger alerts, some poor dev will have to wake up. Next thing you know, they tighten anti-scraping rules, and I’m back to square one. Thanks to this site’s lenient anti-scraping, but strict measures hurt everyone. I’ll be nice—stick to 1–2k QPS.”

Satisfied, you glance at the clock. “Crap, it’s already 3 AM. ‘Early’ night tonight. Tomorrow I’ll figure out where to dump all this data—and whether my SSD can even hold it…”

~~Author’s note~~: First time trying this writing style, and first deep dive into ultra-high-concurrency crawling. The journey went smoothly—final code is thousands of times faster than the naive serial version, nearly a thousand times better than the threaded requests version, and successfully bypassed IP rate-limiting using IPv6. Quite the learning experience.

This article is licensed under the CC BY-NC-SA 4.0 license.

Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/high-performance-crawler/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/