This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.
Recently, while writing a web crawler, I encountered strict IP rate limiting. Since the amount of data to scrape was large, I decided to look for a proxy pool.
Most free proxy pools available online offer relatively few usable IPs, and connections are often unstable. Paid proxy services, on the other hand, tend to be quite expensive.
While searching, I discovered a novel source of proxy IPs — various cloud function services, which can provide a large number of stable and low-cost IP addresses.
Principle
Cloud function services operate based on containerized infrastructure and server clusters. User-uploaded code is scheduled and executed on different machines with distinct public IP addresses. Repeatedly accessing the same function service may result in being assigned different public IPs over time.
By setting up a custom HTTP proxy server locally, the crawler sends requests through this proxy. Upon receiving an HTTP request, the proxy packages it in a specific format and forwards it to the cloud function. The cloud function then unpacks the request, makes the actual HTTP call, retrieves the response, and sends it back to the local proxy server, which finally returns the result to the crawler. (This is essentially using cloud functions as request forwarders.) This achieves an effect similar to that of a traditional proxy pool.
Implementation
There are several open-source implementations on GitHub based on Tencent Cloud SCF:
- https://github.com/hashsecteam/scf-proxy
- https://github.com/Sakurasan/scf-proxy (the one I used)
- https://github.com/shimmeris/SCFProxy
I haven’t tested the number of available IPs from Alibaba Cloud’s FaaS yet, but Tencent Cloud consistently provides more than 50 different IPs per region per time period (subject to fluctuations over time).
Tencent Cloud has 4 domestic regions. By deploying functions in all of them and rotating usage, you can obtain over 200 domestic IP addresses — more than sufficient for many crawling tasks.
There are a total of 12 regions globally. If foreign IPs are also acceptable, even more IPs become available.
Currently, Tencent Cloud offers a free tier: for the first three months, you get 1 million requests per month for free. After that, you can purchase a 1 million-request package for just 1 RMB per month.
Setup Instructions
Based on https://github.com/Sakurasan/scf-proxy
git clonethe above repository.- Run
sh build.shto generatemain.zip, which needs to be uploaded to the cloud service. - Run
go buildonclient.goin thecmddirectory to build the client binary.
Optionally, you can deploy in multiple regions and run multiple clients locally, letting your crawler rotate through them.
The following deployment steps are adapted from: https://freewechat.com/a/MzI0MDI5MTQ3OQ==/2247484068/1
Open Function Service (https://console.cloud.tencent.com/scf/list?rid=1&ns=default) and click “Create”.
Follow the steps in the screenshots below. Don’t rush to click “Finish” right away.
screenshot
screenshot
screenshotAfter clicking “Finish”, the ZIP package will be automatically uploaded and deployed. An API Gateway will also be created. Click “Access Now“.
Obtain the API access URL of the cloud function, then run
./client -sfcurl <access_url>
This article is licensed under the CC BY-NC-SA 4.0 license.
Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/sfc-proxy-pool/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/