A Comprehensive Guide to Python Web Scraping

lyc8503

2022-04-03

This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.

A summary of web scraping strategies and common anti-scraping countermeasures.

Mind Map

Prerequisites for Web Scraping

Basic Python syntax (Python has powerful and well-developed scraping libraries)
Knowledge of HTTP (learn how to capture network traffic and understand HTTP requests/responses)
Basic HTML and CSS (understand HTML structure and parse web content)
Understanding of JSON and XML formats (parse API responses)

Quick learning resource: https://www.runoob.com/

Optional skills include:

Regular expressions (for data extraction)
JavaScript (to understand dynamic content and reverse engineer encryption)
Android knowledge (reverse engineering app encryption)
Computer Vision (CV) (for CAPTCHA recognition)
SQL (for storing large volumes of scraped data)
Linux (for running scrapers on servers over long periods)
etc.

Fetching and Parsing Data

Construct and send requests to the server

Construct appropriate HTTP requests by analyzing page content or capturing traffic, then send them.
Recommended traffic capture tool: Fiddler https://www.telerik.com/fiddler/fiddler-classic
Recommended Python HTTP library: requests https://docs.python-requests.org/en/latest/

1
2
3

# Example: Google search with keyword
def google_search(keyword):
    return requests.get("https://www.google.com", params={"q": keyword})

Parse the response content (specific approach depends on the website’s design)

For content directly embedded in HTML:
Use beautifulsoup4 to parse HTML.
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

# Example: Baidu Baike search result
ret = requests.get("https://baike.baidu.com/search/word",
                   params={"word": keyword},
                   headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
                            "Referer": "https://baike.baidu.com/"},
                   timeout=10)

bs = BeautifulSoup(ret.text, features="html.parser")
result = ""
for content in bs.find_all("div", class_="para"):
    if content.string is not None:
        result += content.string + "\n"

For data returned via API:
Directly use the .json() method of the requests object or json.loads() to parse the response.

# Example: NetEase Cloud Music search API
r = requests.get("https://music.163.com/api/search/get/web",
                  params={"s": name, "offset": "0", "limit": "20", "type": "1"})

print(r.json())
# Successfully outputs search results:
# {'result': {'songs': [{'id': xxx, 'name': 'xxx', 'artists': [xxx], ...}, ...], 'songCount': 300}, 'code': 200}

print(r.json()['result']['songs'][0]['id'])  # Get the ID of the first song

Common Anti-Scraping Techniques and Countermeasures

User-Agent / Referer Checks

User-Agent and Referer in HTTP headers can be used to detect bots.
If you immediately get a 403 error when accessing a site, it might be due to UA detection.
The fix is simple: modify the UA and Referer in the headers of your requests.

requests.get("https://www.example.com/",
               headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0",
                        "Referer": "https://www.example.com/"}
            )

IP Rate Limiting

Repeated access from the same IP in a short time may trigger anti-bot systems.
Solution: use proxies to hide your real IP. When scraping large amounts of data, rotate through multiple IPs.

1	requests.get("https://www.example.com/", proxies={"https": "http://x.x.x.x:1080"})

You can purchase proxy pool services online, such as https://www.abuyun.com/ or https://http.zhimaruanjian.com/.
I’ve also found a cheaper and stable source of IPs—using Tencent Cloud Functions as a relay to obtain multiple IPs. Details here: https://blog.lyc8503.net/post/sfc-proxy-pool/

Account Rate Limiting

When accessing data that requires login, using the same account’s cookie repeatedly in a short time may also be flagged.
Solutions: register multiple accounts, save their cookies, and rotate usage. Alternatively, look for non-login data sources.

CAPTCHA

Solutions:

Simple CAPTCHAs can be preprocessed and then recognized using OCR.

Slider CAPTCHAs can be solved using CV to detect the gap and simulate dragging.

# Example: Using OpenCV to bypass Tencent's slider CAPTCHA
target = cv2.cvtColor(target, cv2.COLOR_BGR2GRAY)
target = abs(255 - target)
result = cv2.matchTemplate(target, template, cv2.TM_CCOEFF_NORMED)
x, y = np.unravel_index(result.argmax(), result.shape)
return x, y

Ultimate solution: use human-powered CAPTCHA-solving services.

JavaScript Encryption

Some websites add a “signature” (e.g., a sign parameter) to requests or encrypt responses.
The JavaScript code responsible is often heavily obfuscated.
Solutions:

Use selenium to run a full browser. Easy to implement but slow, resource-heavy. Suitable for small-scale scraping.
Use PyExecJS or similar to run the encryption JavaScript directly in Python. Good performance and moderate complexity, but environment setup can be tricky. Not ideal for heavily obfuscated or unextractable JS.
Manually analyze the JS and rewrite the encryption logic in Python. Best performance and clarity, but requires JavaScript knowledge and is time-consuming.

Recommended tools: selenium / Chrome DevTools

Mobile App Encryption

Many websites have companion Android/iOS apps, which may have weaker anti-scraping measures than their web counterparts.
When the web version is hard to scrape, try accessing data via the app’s API.
Recommended mobile traffic capture tools: Fiddler (requires PC) or HttpCanary https://apkpure.com/httpcanary-%E2%80%94-http-sniffer-capture-analysis/com.guoshi.httpcanary
Note: App requests may also include signatures or encryption, requiring reverse engineering.
Recommended Android tools:

Unpacking: Reflection Master (Fansi Dashi)
APK decompiler: jadx
Native code analysis: IDA Pro
There’s also appium, similar to selenium, but it’s cumbersome—not recommended.

HTML Obfuscation

Various techniques exist and can be complex, though not very common. Examples:

Custom fonts: server returns garbled text, which appears normal when rendered with a specific font in the browser.
CSS repositioning: actual data is “321” but displayed as “123” by rearranging elements via CSS.
Splitting images into multiple pieces (“puzzle”) and reassembling them on the client side.

Practical Examples

Weibo Scraper

After capturing traffic, we find Weibo’s search request is simple:
https://s.weibo.com/weibo?q={keyword}&page={i}
The response is plain HTML (not dynamically loaded), so parsing with bs4 is sufficient.
weibo weibo

soup = BeautifulSoup(r.text, "html.parser")
feeds = soup.find_all("div", attrs={"class": "card-wrap", "action-type": "feed_list_item"})

for j in feeds:
    if "mid" in j.attrs:
        try:
            logging.info("keyword: " + keyword + ", mid: " + j.attrs['mid'])
            content = j.find_all("p", attrs={"node-type": "feed_list_content_full"})
            if len(content) == 0:
                content = j.find_all("p", attrs={"node-type": "feed_list_content"})

            content_text = html2text.html2text(str(content[0])).strip().replace("\n\n", "\n")

            nick_name = j.find_all("p", attrs={"node-type": "feed_list_content"})[0].attrs['nick-name']
            uid = j.find_all("a", attrs={"nick-name": nick_name})[0].attrs['href']

            time_str = j.find_all("a", attrs={"suda-data": re.compile(".*wb_time.*")})[0].string.strip()

            info = {"keyword": keyword, "content": content_text, "mid": j.attrs['mid'], "user": nick_name,
                    "uid": uid, "time": time_str, "timestamp": int(time.time())}

            with open(save_path + j.attrs['mid'] + ".json", "w") as f:
                json.dump(info, f)
        except Exception as e:
            logging.error("failed: " + j.attrs['mid'] + ", " + str(e))
            # raise e
    else:
        logging.warning("no mid found, discard")

Nanjing University Health Check-in Automation

Use HttpCanary to capture mobile app traffic.
NJU NJU

r = session.get('https://ehallapp.nju.edu.cn/xgfw/sys/yqfxmrjkdkappnju/apply/getApplyInfoList.do')

dk_info = r.json()['data'][0]
if dk_info['TBZT'] == "0":
    logging.info("Attempting check-in...")
    wid = dk_info['WID']
    data = "?WID={}&IS_TWZC=1&CURR_LOCATION={}&JRSKMYS=1&IS_HAS_JKQK=1&JZRJRSKMYS=1".format(
        wid, location)
    r = session.get("https://ehallapp.nju.edu.cn/xgfw/sys/yqfxmrjkdkappnju/apply/saveApplyInfos.do" + data)
    logging.info("getApplyInfoList.do " + r.text)

Scraping QQ Zone Status Updates

Use Chrome DevTools to capture requests.

Qzone Qzone
We find a mysterious g_tk parameter—without it, the API returns nothing. Other parameters are understandable.
By tracing the source code and searching for g_tk, we find it’s generated by this function:

// Unimportant parts omitted
QZONE.FrontPage.getACSRFToken = function(url) {
  url = QZFL.util.URI(url);
  var skey;
  if (url) {
    if (url.host && url.host.indexOf("qzone.qq.com") > 0) {
      skey = QZFL.cookie.get("p_skey");
    } else {
    // ......
  }
  // ......
  var hash = 5381;
  for (var i = 0, len = skey.length;i < len;++i) {
    hash += (hash << 5) + skey.charAt(i).charCodeAt();
  }
  return hash & 2147483647;
};

We can directly rewrite this in Python:

def get_gtk(login_cookie):
    p_skey = login_cookie['p_skey']
    h = 5381
    for i in p_skey:
        h += (h << 5) + ord(i)
    return h & 2147483647

After constructing the request with g_tk, the server returns data in JSONP format, which can be parsed directly.

Automatic Login and Cookie Retrieval

The web version of QQ Zone cookies expire quickly. For long-term scraping, we need automatic login.
Since login logic is rarely called and complex to reverse, we use selenium.
Key issues to handle:

Login is in a separate iframe—remember to switch_to.frame('login_frame') in code.
Tencent’s slider CAPTCHA.

Slide Slide
Solving the slider CAPTCHA using CV:

# Detect gap position
target = cv2.cvtColor(target, cv2.COLOR_BGR2GRAY)
target = abs(255 - target)
result = cv2.matchTemplate(target, template, cv2.TM_CCOEFF_NORMED)
x, y = np.unravel_index(result.argmax(), result.shape)

# Simulate mouse drag
ActionChains(driver).click_and_hold(on_element=driver.find_element_by_id('tcaptcha_drag_thumb')).perform()
time.sleep(0.2)
ActionChains(driver).move_by_offset(xoffset=offset, yoffset=0).perform()
time.sleep(0.2)
ActionChains(driver).release(on_element=driver.find_element_by_id('tcaptcha_drag_thumb')).perform()

Gaokao.cn (College Entrance Exam) Scraper https://www.gaokao.cn/

Capturing traffic reveals the API for historical admission scores:

https://api.eol.cn/web/api/?local_batch_id=51&local_province_id=14&local_type_id=1&page=1&school_id=111&size=10&special_group=&uri=apidata/api/gk/score/special&year=2020&signsafe=69d89bdf5ca94281643ef5a6a32a2dd4

The server validates the signsafe signature and encrypts the response:

{“code”:”0000”,”message”:”success”,”data”:{“method”:”aes-256-cbc”,”text”:”eab8325abc5a1440b7708431e83f79ace……”},”location”:””,”encrydata”:””}

Since large-scale data scraping is needed, selenium is not ideal. We reverse-engineer the JavaScript instead.
Using Chrome DevTools, we locate the JS files and search for signsafe to find the signature generation code.

return (
  (h = "D23ABC@#56"),
  (p = ""),
  e.endsWith(".json") ||
  e.endsWith(".txt") ||
  ((m =
    e +
    (0 < Object.keys(u).length
     ? "?" +
     (function (l) {
    return Object.keys(l)
      .sort()
      .map(function (e) {
      var a = l[e];
      return (
        ("keyword" !== e && "ranktype" !== e) ||
        (a = decodeURI(decodeURI(l[e]))),
        ""
        .concat(e, "=")
        .concat(void 0 === a ? "" : a)
      );
    })
      .join("&");
  })(u)
     : "")),
   (g = void 0),
   (g = (t = {
    SIGN: h,
    str: m.replace(/^\/|https?:\/\/\//, "")
  }).SIGN),
   (t = t.str),
   (g = r.a.HmacSHA1(r.a.enc.Utf8.parse(t), g)),
   (g = r.a.enc.Base64.stringify(g).toString()),
   (p = c()(g)),
   (u.signsafe = p),
   s.find(function (l) {
    return l === u.uri;
  }) || (e = m + "&signsafe=" + p)),
  l.abrupt(
    "return",
    i()({
      url: e,
      method: a,
      timeout: n,
      data: (function (l, e) {
        var a,
            u = {};
        if (
          (Object.keys(e)
           .sort()
           .forEach(function (l) {
            return (u[l] = Array.isArray(e[l])
                    ? e[l].toString()
                    : e[l]);
          }),
           "get" === l)
        )
          return JSON.stringify(u);
        for (a in u)
          ("elective" !== a && "vote" !== a) ||
            "" == u[a] ||
            (u[a] =
             -1 == u[a].indexOf(",")
             ? u[a].split(" ")
             : u[a].split(","));
        return u;
      })(
        a,
        Object(b.a)(
          Object(b.a)({}, u),
          {},
          { signsafe: p }
        )
      )
    })
    .then(function (l) {
      return l.data;
    })
    .catch(function () {
      return {};
    })
    .then(function (l) {
      var e, a, t, b, n;
      return (
        null != l &&
        null !== (a = l.data) &&
        void 0 !== a &&
        a.text &&
        (l.data =
         ((n = (e = {
          iv: u.uri,
          text: l.data.text,
          SIGN: h
        }).iv),
          (a = e.text),
          (e = e.SIGN),
          (e = r.a
           .PBKDF2(e, "secret", {
          keySize: 8,
          iterations: 1e3,
          hasher: r.a.algo.SHA256
        })
           .toString()),
          (n = r.a
           .PBKDF2(n, "secret", {
          keySize: 4,
          iterations: 1e3,
          hasher: r.a.algo.SHA256
        })
           .toString()),
          (a = r.a.lib.CipherParams.create({
          ciphertext: r.a.enc.Hex.parse(a)
        })),
          (n = r.a.AES.decrypt(
          a,
          r.a.enc.Hex.parse(e),
          { iv: r.a.enc.Hex.parse(n) }
        )),
          JSON.parse(n.toString(r.a.enc.Utf8)))),
        v &&
        ((t = o),
         (b = l),
         null !== (n = window.apiConfig) &&
         void 0 !== n &&
         null !== (n = n.filterCacheList) &&
         void 0 !== n &&
         n.length
         ? window.apiConfig.filterCacheList.forEach(
          function (l) {
            new RegExp(l).test(t) || d.set(t, b);
          }
        )
         : d.set(t, b)),
        l
      );
    })
  )
);

We’ve found the relevant request-handling code.
It’s clearly obfuscated, with many strange patterns. We need to carefully trace the logic.
From lines 32–33 and context, we see HmacSHA1 and Base64 are used in signing.
The result is further processed on line 34.
Using Chrome debugging, we discover this function is actually MD5.
We rewrite the signing logic in Python:

import base64
import hmac
from hashlib import sha1, md5

KEY = "D23ABC@#56".encode("utf-8")


def hash_hmac(code, key):
    hmac_code = hmac.new(key, code.encode("utf-8"), sha1).digest()
    return base64.b64encode(hmac_code).decode()


def get_sign(url):
    return md5(hash_hmac(url, KEY).encode("utf-8")).hexdigest()

For decryption (lines 89–118), we see PBKDF2 and AES are used.
Again, using breakpoints helps identify the actual meaning of obfuscated variables.
The decryption logic can be rewritten in Python as:

import hmac
from hashlib import pbkdf2_hmac
from Crypto.Cipher import AES

KEY = "D23ABC@#56".encode("utf-8")

def decrypt_response(text, uri, password=pbkdf2_hmac("sha256", KEY, b"secret", 1000, 32)):
    unpad = lambda s: s[:-ord(s[len(s)-1:])]
    iv = pbkdf2_hmac("sha256", uri.encode("utf-8"), b"secret", 1000, 16)
    return unpad(AES.new(password, AES.MODE_CBC, iv).decrypt(bytes.fromhex(text))).decode("utf-8")

Once encryption and decryption are implemented, we can freely construct requests to scrape data.
This API doesn’t check login status—only rate-limits by IP.

Since we need to scrape a large amount of data, we also need an IP pool for speed.
We can integrate third-party proxy services or use the Tencent Cloud Function proxy mentioned earlier.

This article is licensed under the CC BY-NC-SA 4.0 license.

Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/python-crawler/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/