“Your backups are only as good as your last successful restore.”
The Discovery
It started with qbittorrent refusing to authenticate. After the Ceph Reef to Tentacle upgrade, several apps needed restoring from backups. Routine stuff—trigger the VolSync ReplicationDestination, wait for completion, scale up the app.
Except the restored data was garbage.
|
|
That’s not a username. That’s null bytes. The entire config file was zeroed out—the file existed, had the right size, but contained nothing but 0x00 characters.
The Pattern Emerges
Checking other apps revealed the same problem:
|
|
The common factor: all these PVCs were on ceph-filesystem storage class and had been restored via VolSync.
Understanding the Bug
CephFS handles sparse files differently than traditional filesystems. A sparse file is one where regions of null bytes aren’t actually stored on disk—they’re just metadata saying “this region is empty.”
The problem: when VolSync’s Kopia mover restores files to CephFS, something in the sparse file handling chain goes wrong. Files that should contain data get their content replaced with null bytes, while maintaining their original size and metadata.
This isn’t a VolSync bug or a Kopia bug. It’s a quirk of how CephFS handles certain write patterns during restore operations. The same restore to ceph-block storage works perfectly.
The Damage Assessment
After checking all apps that used ceph-filesystem with VolSync backups:
| App | Status | Impact |
|---|---|---|
| qbittorrent | Config zeroed | Lost WebUI credentials, port settings |
| sabnzbd | Empty directory | Lost entire config, server settings |
| sonarr | Config zeroed | Minimal (uses PostgreSQL for data) |
| sonarr-uhd | Config zeroed | Minimal (uses PostgreSQL for data) |
| sonarr-foreign | Config zeroed | Minimal (uses PostgreSQL for data) |
| radarr | Config zeroed | Minimal (uses PostgreSQL for data) |
| radarr-uhd | Config zeroed | Minimal (uses PostgreSQL for data) |
| filebrowser | Config zeroed | Lost user settings |
The sonarr and radarr instances were lucky—they store actual data in PostgreSQL, so the zeroed config.xml only meant losing some network settings. But qbittorrent and sabnzbd were serious losses.
Recovery Strategy
The immediate fix was obvious: stop using ceph-filesystem for VolSync-backed PVCs. But first, I needed to recover the data.
Attempt 1: Kopia Snapshots with previous: N
Kopia stores multiple snapshots. The previous parameter tells the ReplicationDestination to restore an older snapshot:
|
|
I tried previous: 3, previous: 7, previous: 10, even previous: 13. Every single snapshot was empty.
The CephFS corruption happened before the Kopia migration. All Kopia snapshots were backing up already-corrupted data.
Attempt 2: Kopia with restoreAsOf
Maybe the corruption was more recent? Kopia’s restoreAsOf parameter restores from the most recent snapshot before a given timestamp:
|
|
Same result. Empty. The corruption predated any Kopia backup.
Attempt 3: Old Restic Backups
Before migrating to Kopia on December 11th, I had Restic backups going to Backblaze B2. Those old backups might still have good data.
The Restic backup bucket (nerdz-volsync) was separate from the Kopia bucket (nerdz-volsync-kopia). I still had the credentials in 1Password.
|
|
|
|
Success! The December 10th Restic backup had the full config with all my Usenet server settings.
The Recovery Process
Step 1: Create the Restic Restore Component
I created a one-time-use component specifically for Restic restores:
|
|
Step 2: Migrate Each App to ceph-block
For each affected app:
- Scale down the deployment
- Delete the corrupted PVC
- Create new PVC on
ceph-block - Restore from Restic backup
- Update ks.yaml to use
ceph-blockgoing forward - Scale up and verify
|
|
Step 3: Verify and Create Fresh Backups
After confirming each app had valid data, I triggered fresh backups to all three destinations:
|
|
The Flux Alert Spam
After fixing all the apps, I got bombarded with Flux alerts:
|
|
The volsync component’s PVC template includes a dataSourceRef pointing to the ReplicationDestination. For existing PVCs, this causes a conflict—you can’t add a dataSourceRef after creation.
The fix was adding the IfNotPresent SSA label to the PVC template:
|
|
This tells Flux: “Create this PVC if it doesn’t exist, but don’t try to update existing ones.”
Lessons Learned
| Assumption | Reality |
|---|---|
| CephFS works fine for all workloads | Sparse file handling during restores can corrupt data |
| Kopia backups are good if they complete | They can back up already-corrupted data perfectly |
previous: N is a time machine |
Only if the data was good when backed up |
| Old backup systems can be deleted after migration | Keep them until you’ve verified restores work |
| All my apps use PostgreSQL for data | qbittorrent and sabnzbd use local config files |
The 3-2-1-1 Backup Strategy
After this incident, I’ve upgraded from 3-2-1 to 3-2-1-1:
- 3 copies of data
- 2 different storage types
- 1 offsite copy
- 1 air-gapped or delayed-deletion copy
The old Restic backups in B2 were essentially an air-gapped backup—I hadn’t deleted them after the Kopia migration. That laziness saved my data.
Storage Class Selection
Going forward, all VolSync-backed PVCs use ceph-block:
| Use Case | Storage Class |
|---|---|
| App config/data backed by VolSync | ceph-block |
| Shared working storage (media processing) | ceph-filesystem |
| Databases (backed by pgBackRest) | ceph-block |
| Temporary/cache data | openebs-hostpath |
CephFS is still useful for ReadWriteMany workloads where multiple pods need access to the same files. Just don’t use it for data that needs to survive restore operations.
Update 2025-12-23: CSI Read Affinity
After discussing this issue with the home-operations community, I discovered another contributing factor: CSI Read Affinity.
|
|
This setting makes the Ceph CSI driver prefer reading from OSDs on the same node as the pod. While this sounds like a performance optimization, it can cause data consistency issues with CephFS—particularly with sparse file handling during restore operations.
The fix: Disable it.
|
|
If you’re experiencing CephFS data corruption, check this setting first.
Quick Reference
|
|
Final Thoughts
Data corruption is insidious. The files looked normal—right names, right sizes, right permissions. Only the content was wrong. Without actually reading the files, there was no indication anything was broken.
This is why backup verification matters. Not “did the backup job complete successfully,” but “can I actually restore and use the data.” I’ve added a monthly calendar reminder to do test restores.
The silver lining: this forced me to audit all my apps and migrate everything to consistent storage classes. The cluster is more robust now than before the incident.
This post documents the recovery from my home-ops cluster. The original VolSync Kopia migration is documented in my previous post.