Time Machine - QQ Homework Crawler

lyc8503

2020-06-21

This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.

I wrote a web crawler to fetch my QQ Homework history, and the code has been published on GitHub.

The reason? Homework went totally out of control during winter break and online school — all the printed assignments were scattered everywhere, and everything was a mess…

So I decided to write a crawler to automatically download and archive all the homework I’d ever submitted.

Then it hit me: if I can get my own homework, why not my classmates’ as well?

At first glance, Tencent actually does a decent job on access control — unlike some older platforms (like the post-class website I used before, whose image URLs followed predictable patterns and allowed easy access to others’ work).

Tencent uses a CDN with no authentication checks, but filenames are randomly generated. So unless you’re a group admin, you can’t access other students’ submissions.

So I made a second version — one that logs in with admin privileges and downloads everyone’s homework…

Now, what’s the point of saving all this? Originally, I thought I could extract more insights from EXIF data (like, maybe track what phone models my classmates use). But Tencent strips all EXIF metadata during upload — probably to save space or protect privacy. So yeah, the data I scraped just ended up sitting there, unused…

(Still, it’s wild that plain text data piled up to 20 MB. Props to my teachers and classmates — everyone really went through it during the pandemic…)

Database

This article is licensed under the CC BY-NC-SA 4.0 license.

Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/qq-homework-crawler/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/