This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.
Not long ago, Bob‘s HomeLab had its system disk fail, and I was casually spectating in the group chat as he recovered his data. I thought to myself: my HomeLab server has extremely robust backups—there’s no way I could ever have such an issue(shouldn’t have jinxed it).
Then… as the title suggests, I crashed too.
The Beginning — A Server Dream(?)
On the night of April 19th, I had a nightmare. In a hazy half-sleep, I dreamt that my server had crashed, and all my services went down and became inaccessible. At the time, I didn’t think much of it.
On April 20th, I was traveling back to Nanjing University by high-speed rail and didn’t check my server all day. Everything seemed fine… until I returned to my dorm after dinner. I opened my laptop and was greeted by a flood of monitoring alerts: my HomeLab had actually gone down!
And the outage timestamp? Exactly 01:33 AM on April 20th—the very moment I was asleep (and dreaming)… Looks like the server sent me a dream to come fix it ASAP.
Known Minor Issue
Although I was in Nanjing, I had solid infrastructure: I immediately connected via PiKVM to check the server status. The PVE host reported a general protection fault, and the call stack was full of ZFS-related functions. The entire system had frozen, so I resorted to Alt+SysRQ+REISUB to force a reboot.
This was actually a known issue: ever since I swapped in a shady 10900 ES CPU a few months ago, the system would randomly “miscalculate” under prolonged high load, causing all sorts of mysterious problems. I hadn’t gotten around to replacing the hardware yet. My current observation was that the issue correlated with temperature—the CPU would start glitching at around 70–80°C, which is clearly not normal. As a temporary workaround, I’d lazily limited CPU power and frequency. (Since debugging is tedious and buying another CPU feels wasteful, I plan to upgrade the entire platform eventually—but I’m still procrastinating waiting for the right hardware.)
But recently, the weather got hotter, and my weekly heavy backup job runs every Sunday at 1:00 AM. Checking Netdata’s history, I found the CPU temperature had spiked to 75°C right before the crash… and there we go again.
I took the lazy route once more—capped CPU power to ~45W—and the issue seemed to vanish again.
Minor issue… right? — Windows VM Corruption
I manually restarted all VMs and containers, and everything seemed back online. So I retried the failed backup job.
But it failed with an I/O Error while reading the zvol of one of my Windows VMs. Running zpool status gave me this:
1 | pool: rpool |
Although the Windows VM still booted and appeared normal, ZFS reported an unrecoverable error on its system disk zvol and demanded I restore it from backup. I checked the auto-snapshots—all of them for this zvol were also corrupted and unusable.
“Not a big deal,” I thought at the time. “Probably just a CPU miscalculation or an unsafe shutdown.”
The last automated backup of this VM was from April 13th. I checked the data—almost nothing important had been created in the past week. So I went ahead and restored the April 13th backup directly from the PVE web interface.
Uh-oh — Local Backup Also Corrupted
PVE’s restore process works by first deleting the current VM, then restoring from the backup.
But… halfway through the restore, another I/O Error occurred! Not only was the current VM gone, but the only local backup was also corrupted. And because I was saving space, I only kept the most recent backup.
Now we’re in trouble. Checking zpool status again, I found the backup pool was also reporting errors:
1 | pool: hdd |
I didn’t expect both the original and the backup to be corrupted at the same time: the VM disk was on SSD (in one ZFS pool), and it was backed up weekly to HDD (a separate ZFS pool). Before this, I’d checked the backup logs from April 13th—the write had completed successfully.
There was no logical reason why an unsafe shutdown on April 20th should affect a backup file created on April 13th and untouched since.
I immediately initiated a scrub on both ZFS pools. After a full day, the scrub finished—and revealed that only these two files (the VM disk and its backup) were corrupted. What are the odds?
In hindsight, the only plausible explanation is this: during the April 13th backup job, the Windows VM happened to be backed up later in the queue, by which time the CPU had already been running hot and unstable for a while. It likely wrote corrupted data into ZFS, but no panic or error was triggered, and since no one read the backup file afterward, the issue went unnoticed. At other times, the system wasn’t under sustained high load, so no other files were affected. It wasn’t until April 20th’s kernel panic that I discovered the corruption.
Total meltdown! — Cloud Backup on Strike
With both local copies gone, I turned to my remote backup.
But here’s the kicker: my OneDrive was nearly full, and since all data had at least two local copies, I figured I could temporarily disable the restic backup script. The last successful backup was on January 12th…
Time to Rescue the Data…
The Windows VM mostly ran my QQ and WeChat clients. Restoring a three-month-old backup would only lose some chat history—no big deal. But I still hated the idea of losing data without a fight, so I dove into recovery.
ddrescue
The original VM zvol and all snapshots had already been destroyed during the failed restore attempt, and dozens of GB had since been written over them—no hope of recovery there.
So I turned to the only remaining piece: the corrupted backup file vzdump-qemu-204-2025_04_13-01_27_32.vma.gz.
First, I used ddrescue to extract as much data as possible:
1 | ddrescue /hdd/pool_backup/dump/vzdump-qemu-204-2025_04_13-01_27_32.vma.gz 204.vma.gz 204.log |
The result showed a 128 KB bad block in the gzip file (reads return I/O Error):
1 | # Mapfile. Created by GNU ddrescue version 1.27 |
GZIP Repair
So we’ve got 128 KB of contiguous data corrupted in ZFS, and scrub can’t fix it—this data is likely gone for good.
Now I needed to extract as much as possible from a gz file with a 128 KB “hole.”
There’s not much online about repairing such gz files, but I found an archived version of a guide from gzip.org (no idea why it’s 404 now).
According to the guide, I could first decompress the part before the bad block losslessly.
1 | gzip -dc 204.vma.gz > part1.vma |
This would fail at the corrupted block. Next, I needed to try decompressing the data after the hole.
The gzip format consists of:
- a 10-byte header (magic number
1f 8b, compression method, flags, timestamp, OS ID) - optional extra headers
- a body with DEFLATE-compressed payload
- an 8-byte trailer (CRC-32 and original size)
As per the document, I needed to find the start of the next DEFLATE block after the bad block, prepend a valid gzip header, and try decompressing.
But here’s the catch: DEFLATE data is a bitstream, so block boundaries aren’t necessarily byte-aligned.
I really didn’t feel like writing bit-shifting code, so I took a shortcut: I brute-forced byte-aligned offsets after 0x14960C0000 to find the next valid block.
1 | INPUT_FILE="204.vma.gz" |
Running this, I found that at i=700, gzip -t no longer failed.
So I skipped ahead 700 bytes and decompressed the rest. The end would have CRC errors, but that’s fine—we just want the data.
1 | (printf '\x1f\x8b\x08\x00\x84\xa2\xfa\x67\x00\x03'; dd if=204.vma.gz bs=1M iflag=skip_bytes skip=88416715452 status=progress) | gzip -dc > part2.vma |
If I’d searched by bit, I might’ve found the block earlier, but after losing 128 KB, who cares about 700 more bytes?
VMA Repair
Now I had two halves of the VMA file with an unknown-sized hole in between. Time to stitch it back together.
After some research, I found that VMA is a custom format invented by PVE, with a detailed spec.
Long story short: the VMA file starts with a header (config and disk definitions), followed by a series of VMA Extents containing actual disk data.
Each VMA Extent header lists the cluster numbers (64 KB each) stored in that extent, followed by the raw cluster data.
Let’s inspect the ends of our two parts (VMAE is the magic number for VMA Extent Header):
1 | # tail -c 10485760 part1.vma | grep -oab 'VMAE' |
Looks like the bad block split a VMAE header in half. I decided to discard the incomplete ends of both parts and concatenate the “complete” sections.
1 | (dd if=part1.vma bs=1M count=137516301824 iflag=count_bytes; dd if=part2.vma bs=1M skip=2492812 iflag=skip_bytes) > try_fix.vma |
The official vma extract tool checks for missing clusters and refuses to work if any are found.
I found a Python script on GitHub that extracts VMA files without checking for missing clusters (missing clusters become all-zero).
1 | python3 vma.py try_fix.vma out |
After a while, it successfully extracted the VM disk image and config files.
NTFS Check
I created a new VM using the recovered disk. Despite having some zero-filled holes, Windows booted up just fine—very resilient.
Everything looked normal inside. chkdsk reported no errors.
But to avoid future headaches, I wanted to check which files might be corrupted.
(Let GPT help me write a) Python script to parse the VMA file and list all cluster numbers, then find the missing ones. It returned a bunch:
1 | Missing cluster numbers: [2114645, 2114646, 2114647, 2114677, 2114678, 2114679, 2114694, 2114695, 2114696, 2114708, 2114709, 2114710, 2114711, 2114725, 2114726, 2114727, 2114740, 2114741, 2114742, 2114743, 2114756, 2114757, 2114758, 2114759, 2114772, 2114773, 2114774, 2114775, 2114788, 2114789, 2114790, 2114791, 2114804, 2114805, 2114806, 2114807, 2114820, 2114821, 2114822, 2114823, 2114836, 2114837, 2114838, 2114839, 2114852, 2114853, 2114854, 2114855, 2114868, 2114869, 2114870, 2114871, 2114884, 2114885, 2114886, 2114900, 2114901, 2114902] |
These are 64 KB cluster numbers from the start of the disk. Another script converted them to 4 KB NTFS clusters (logical clusters), then to range format:
1 | ['33804368-33804415', '33804880-33804927', '33805152-33805199', '33805376-33805439', '33805648-33805695', '33805888-33805951', '33806144-33806207', '33806400-33806463', '33806656-33806719', '33806912-33806975', '33807168-33807231', '33807424-33807487', '33807680-33807743', '33807936-33807999', '33808192-33808239', '33808448-33808495'] |
Using ntfscluster from ntfs-3g, I checked which files used these clusters:
1 | losetup -Pf drive-scsi0 --show |
Result? No files:
1 | Searching for cluster range 33804368-33804415 |
I checked all ranges—every cluster was free. No files were damaged.
To double-check my script, I ran the same search on the January 12th backup:
1 | Searching for cluster range 33804368-33804415 |
Turns out this was a temporary file used by Windows VSS, which had since been deleted.
Afterword
The entire data corruption and recovery journey was a bizarre mix of bad and good luck—first, incredibly unlucky that both primary and backup got corrupted, and the cloud backup just happened to be full. But then miraculously lucky that the corruption, after passing through three layers (GZ → VMA → NTFS), didn’t break any layer completely and ended up scattered only in unused clusters—no files were damaged.
While this blog post reads smooth and straightforward, the actual process involved writing tons of scripts, hours of research, trial and error, debugging, and waiting for file transfers. It was a pretty grueling experience—let’s not do this again.
One thing’s for sure: the 3-2-1 backup rule exists for a reason. I’m turning my restic offsite backup back on—right now.
This article is licensed under the CC BY-NC-SA 4.0 license.
Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/pve-vm-data-recovery/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/