A (Fantastic) Tale of HomeLab Data Corruption and Recovery

lyc8503

2025-04-24

This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.

Not long ago, Bob‘s HomeLab had its system disk fail, and I was casually spectating in the group chat as he recovered his data. I thought to myself: my HomeLab server has extremely robust backups—there’s no way I could ever have such an issue~~(shouldn’t have jinxed it)~~.

Then… as the title suggests, I crashed too.

The Beginning — A Server Dream(?)

On the night of April 19th, I had a nightmare. In a hazy half-sleep, I dreamt that my server had crashed, and all my services went down and became inaccessible. At the time, I didn’t think much of it.

On April 20th, I was traveling back to Nanjing University by high-speed rail and didn’t check my server all day. Everything seemed fine… until I returned to my dorm after dinner. I opened my laptop and was greeted by a flood of monitoring alerts: my HomeLab had actually gone down!

And the outage timestamp? Exactly 01:33 AM on April 20th—the very moment I was asleep (and dreaming)… ~~Looks like the server sent me a dream to come fix it ASAP~~.

Known Minor Issue

Although I was in Nanjing, I had solid infrastructure: I immediately connected via PiKVM to check the server status. The PVE host reported a general protection fault, and the call stack was full of ZFS-related functions. The entire system had frozen, so I resorted to Alt+SysRQ+REISUB to force a reboot.

This was actually a known issue: ever since I swapped in a shady 10900 ES CPU a few months ago, the system would randomly “miscalculate” under prolonged high load, causing all sorts of mysterious problems. I hadn’t gotten around to replacing the hardware yet. My current observation was that the issue correlated with temperature—the CPU would start glitching at around 70–80°C, which is clearly not normal. As a temporary workaround, I’d lazily limited CPU power and frequency. (Since debugging is tedious and buying another CPU feels wasteful, I plan to upgrade the entire platform eventually—but I’m still ~~procrastinating~~ waiting for the right hardware.)

But recently, the weather got hotter, and my weekly heavy backup job runs every Sunday at 1:00 AM. Checking Netdata’s history, I found the CPU temperature had spiked to 75°C right before the crash… and there we go again.

I took the lazy route once more—capped CPU power to ~45W—and the issue seemed to vanish again.

Minor issue… right? — Windows VM Corruption

I manually restarted all VMs and containers, and everything seemed back online. So I retried the failed backup job.

But it failed with an I/O Error while reading the zvol of one of my Windows VMs. Running zpool status gave me this:

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
   ......
errors: Permanent errors have been detected in the following files:
        
        rpool/pve/vm-204-disk-0:<0x1>

Although the Windows VM still booted and appeared normal, ZFS reported an unrecoverable error on its system disk zvol and demanded I restore it from backup. I checked the auto-snapshots—all of them for this zvol were also corrupted and unusable.

“Not a big deal,” I thought at the time. “Probably just a CPU miscalculation or an unsafe shutdown.”

The last automated backup of this VM was from April 13th. I checked the data—almost nothing important had been created in the past week. So I went ahead and restored the April 13th backup directly from the PVE web interface.

Uh-oh — Local Backup Also Corrupted

PVE’s restore process works by first deleting the current VM, then restoring from the backup.

But… halfway through the restore, another I/O Error occurred! Not only was the current VM gone, but the only local backup was also corrupted. And because I was saving space, I only kept the most recent backup.

Now we’re in trouble. Checking zpool status again, I found the backup pool was also reporting errors:

  pool: hdd
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
......
errors: Permanent errors have been detected in the following files:

        /hdd/pool_backup/dump/vzdump-qemu-204-2025_04_13-01_27_32.vma.gz

I didn’t expect both the original and the backup to be corrupted at the same time: the VM disk was on SSD (in one ZFS pool), and it was backed up weekly to HDD (a separate ZFS pool). Before this, I’d checked the backup logs from April 13th—the write had completed successfully.

There was no logical reason why an unsafe shutdown on April 20th should affect a backup file created on April 13th and untouched since.

I immediately initiated a scrub on both ZFS pools. After a full day, the scrub finished—and revealed that only these two files (the VM disk and its backup) were corrupted. What are the odds?

In hindsight, the only plausible explanation is this: during the April 13th backup job, the Windows VM happened to be backed up later in the queue, by which time the CPU had already been running hot and unstable for a while. It likely wrote corrupted data into ZFS, but no panic or error was triggered, and since no one read the backup file afterward, the issue went unnoticed. At other times, the system wasn’t under sustained high load, so no other files were affected. It wasn’t until April 20th’s kernel panic that I discovered the corruption.

Total meltdown! — Cloud Backup on Strike

With both local copies gone, I turned to my remote backup.

But here’s the kicker: my OneDrive was nearly full, and since all data had at least two local copies, I figured I could temporarily disable the restic backup script. The last successful backup was on January 12th…

Time to Rescue the Data…

The Windows VM mostly ran my QQ and WeChat clients. Restoring a three-month-old backup would only lose some chat history—no big deal. But I still hated the idea of losing data without a fight, so I dove into recovery.

ddrescue

The original VM zvol and all snapshots had already been destroyed during the failed restore attempt, and dozens of GB had since been written over them—no hope of recovery there.

So I turned to the only remaining piece: the corrupted backup file vzdump-qemu-204-2025_04_13-01_27_32.vma.gz.

First, I used ddrescue to extract as much data as possible:

1	ddrescue /hdd/pool_backup/dump/vzdump-qemu-204-2025_04_13-01_27_32.vma.gz 204.vma.gz 204.log

The result showed a 128 KB bad block in the gzip file (reads return I/O Error):

# Mapfile. Created by GNU ddrescue version 1.27
# Command line: ddrescue /hdd/pool_backup/dump/vzdump-qemu-204-2025_04_13-01_27_32.vma.gz 204.vma.gz 204.log
# Start time:   2025-04-22 11:46:25
# Current time: 2025-04-22 11:57:47
# Finished
# current_pos  current_status  current_pass
0x14960BFC00     +               1
#      pos        size  status
0x00000000  0x14960A0000  +
0x14960A0000  0x00020000  -
0x14960C0000  0xBDD6A72C4  +

GZIP Repair

So we’ve got 128 KB of contiguous data corrupted in ZFS, and scrub can’t fix it—this data is likely gone for good.

Now I needed to extract as much as possible from a gz file with a 128 KB “hole.”

There’s not much online about repairing such gz files, but I found an archived version of a guide from gzip.org (no idea why it’s 404 now).

According to the guide, I could first decompress the part before the bad block losslessly.

1	gzip -dc 204.vma.gz > part1.vma

This would fail at the corrupted block. Next, I needed to try decompressing the data after the hole.

The gzip format consists of:

a 10-byte header (magic number 1f 8b, compression method, flags, timestamp, OS ID)
optional extra headers
a body with DEFLATE-compressed payload
an 8-byte trailer (CRC-32 and original size)

As per the document, I needed to find the start of the next DEFLATE block after the bad block, prepend a valid gzip header, and try decompressing.

But here’s the catch: DEFLATE data is a bitstream, so block boundaries aren’t necessarily byte-aligned.

I really didn’t feel like writing bit-shifting code, so I took a shortcut: I brute-forced byte-aligned offsets after 0x14960C0000 to find the next valid block.

INPUT_FILE="204.vma.gz"
INITIAL_SKIP=88416714752
MAX_ATTEMPTS=5000

for ((i=0; i < MAX_ATTEMPTS; i++)); do
  CURRENT_SKIP=$((INITIAL_SKIP + i))
  echo $i
  (
    printf '\x1f\x8b\x08\x00\x84\xa2\xfa\x67\x00\x03'  # gzip header
    dd if="$INPUT_FILE" bs=1M iflag=skip_bytes,fullblock skip="$CURRENT_SKIP" status=none 2>/dev/null
  ) | gzip -t
done

Running this, I found that at i=700, gzip -t no longer failed.

So I skipped ahead 700 bytes and decompressed the rest. The end would have CRC errors, but that’s fine—we just want the data.

1	(printf '\x1f\x8b\x08\x00\x84\xa2\xfa\x67\x00\x03'; dd if=204.vma.gz bs=1M iflag=skip_bytes skip=88416715452 status=progress) \| gzip -dc > part2.vma

~~If I’d searched by bit, I might’ve found the block earlier, but after losing 128 KB, who cares about 700 more bytes?~~

VMA Repair

Now I had two halves of the VMA file with an unknown-sized hole in between. Time to stitch it back together.

After some research, I found that VMA is a custom format invented by PVE, with a detailed spec.

Long story short: the VMA file starts with a header (config and disk definitions), followed by a series of VMA Extents containing actual disk data.

Each VMA Extent header lists the cluster numbers (64 KB each) stored in that extent, followed by the raw cluster data.

Let’s inspect the ends of our two parts (VMAE is the magic number for VMA Extent Header):

# tail -c 10485760 part1.vma | grep -oab 'VMAE'
2341376:VMAE
6142976:VMAE
9944576:VMAE

# tail -c 10485760 part1.vma | xxd -s 9944576 -l 64
0097be00: 564d 4145 0000 03a0 1700 453d 6193 44e9  VMAE......E=a.D.
0097be10: 8c41 548f 9821 242e ba86 5dc6 ac40 cd80  .AT..!$...]..@..
0097be20: d1f5 a647 c2b6 d450 ffff 0002 0020 44d4  ...G...P..... D.
0097be30: ffff 0002 0020 44c4 ffff 0002 0020 44b4  ..... D...... D.

# head -c 10485760 part2.vma | grep -oab 'VMAE'
2492812:VMAE
6294412:VMAE
10096012:VMAE

# xxd -l 64 part2.vma 
00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0041  ...............A
00000030: 0000 0000 0000 0000 0000 4c00 0000 0000  ..........L.....

Looks like the bad block split a VMAE header in half. I decided to discard the incomplete ends of both parts and concatenate the “complete” sections.

1	(dd if=part1.vma bs=1M count=137516301824 iflag=count_bytes; dd if=part2.vma bs=1M skip=2492812 iflag=skip_bytes) > try_fix.vma

The official vma extract tool checks for missing clusters and refuses to work if any are found.

I found a Python script on GitHub that extracts VMA files without checking for missing clusters (missing clusters become all-zero).

1	python3 vma.py try_fix.vma out

After a while, it successfully extracted the VM disk image and config files.

NTFS Check

I created a new VM using the recovered disk. Despite having some zero-filled holes, Windows booted up just fine—very resilient.

Everything looked normal inside. chkdsk reported no errors.

But to avoid future headaches, I wanted to check which files might be corrupted.

~~(Let GPT help me write a)~~ Python script to parse the VMA file and list all cluster numbers, then find the missing ones. It returned a bunch:

Missing cluster numbers: [2114645, 2114646, 2114647, 2114677, 2114678, 2114679, 2114694, 2114695, 2114696, 2114708, 2114709, 2114710, 2114711, 2114725, 2114726, 2114727, 2114740, 2114741, 2114742, 2114743, 2114756, 2114757, 2114758, 2114759, 2114772, 2114773, 2114774, 2114775, 2114788, 2114789, 2114790, 2114791, 2114804, 2114805, 2114806, 2114807, 2114820, 2114821, 2114822, 2114823, 2114836, 2114837, 2114838, 2114839, 2114852, 2114853, 2114854, 2114855, 2114868, 2114869, 2114870, 2114871, 2114884, 2114885, 2114886, 2114900, 2114901, 2114902]

These are 64 KB cluster numbers from the start of the disk. Another script converted them to 4 KB NTFS clusters (logical clusters), then to range format:

['33804368-33804415', '33804880-33804927', '33805152-33805199', '33805376-33805439', '33805648-33805695', '33805888-33805951', '33806144-33806207', '33806400-33806463', '33806656-33806719', '33806912-33806975', '33807168-33807231', '33807424-33807487', '33807680-33807743', '33807936-33807999', '33808192-33808239', '33808448-33808495']

Using ntfscluster from ntfs-3g, I checked which files used these clusters:

losetup -Pf drive-scsi0 --show
ntfscluster -c 33804368-33804415 /dev/loop0p3 2>/dev/null
ntfscluster -c 33804880-33804927 /dev/loop0p3 2>/dev/null
ntfscluster -c 33805152-33805199 /dev/loop0p3 2>/dev/null
......

Result? No files:

Searching for cluster range 33804368-33804415
* no inode found
Searching for cluster range 33804880-33804927
* no inode found
Searching for cluster range 33805152-33805199
* no inode found
......

I checked all ranges—every cluster was free. No files were damaged.

To double-check my script, I ran the same search on the January 12th backup:

1
2
3

Searching for cluster range 33804368-33804415
Inode 731838 /System Volume Information/{c784ba3d-bd06-11ef-9dec-ed26e76be199}{3808876b-c176-4e48-b7ae-04046e6cc752}/$DATA
* one inode found

Turns out this was a temporary file used by Windows VSS, which had since been deleted.

Afterword

The entire data corruption and recovery journey was a bizarre mix of bad and good luck—first, incredibly unlucky that both primary and backup got corrupted, and the cloud backup just happened to be full. But then miraculously lucky that the corruption, after passing through three layers (GZ → VMA → NTFS), didn’t break any layer completely and ended up scattered only in unused clusters—no files were damaged.

While this blog post reads smooth and straightforward, the actual process involved writing tons of scripts, hours of research, trial and error, debugging, and waiting for file transfers. It was a pretty grueling experience—let’s not do this again.

One thing’s for sure: the 3-2-1 backup rule exists for a reason. I’m turning my restic offsite backup back on—right now.

This article is licensed under the CC BY-NC-SA 4.0 license.

Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/pve-vm-data-recovery/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/