You don’t perform open-heart surgery until you’re sure the blood bank works.
In Part 1 I did the autopsy: three consumer Samsung 990 PROs, same batch, holding /var for my Talos control plane, all wearing out roughly twice as fast as host writes alone would explain — one already dead with a kernel-level IRQ-disable, the other two queued up behind it. Amazon refunded the whole batch, and I bought three Samsung PM9A3 960GB enterprise drives to replace them — power-loss protection, proper endurance, the same family as the OSD drives that have run flawlessly for 22 months.
I called it “Part 1 of 2.” This is Part 2. Reader, it turned into Part 3 as well, because the cluster had opinions. This part is the prep and the canary; Part 3 is the other two nodes and everything that went sideways.
Step one wasn’t the disk swap
I sat down to plan the swap and immediately tripped over a rule I keep relearning: swapping the OS disk on a control-plane node is a rebuild — etcd member, Ceph mon, the lot. If anything goes sideways mid-rebuild, my recovery path is the backups. So before touching a disk, I went to check the backups.
They were broken. Quietly, in three different ways.
This is the CloudNativePG / pgBackRest setup behind my Postgres clusters. What I found:
| What I found | Reality |
|---|---|
| The “daily” backup | Running hourly — a five-field cron (0 3 * * *) silently parsed as six-field (seconds-first), so it meant “every hour at :03” |
| Backup chain depth | 1 full from December + 1,573 incrementals stacked on it. Never re-based. |
Accumulated Backup objects |
7,537 of them in the namespace — months of ~48/day, over half failed |
| Forcing a fresh full | The plugin silently ignored type: full; the correct key was backupType: full |
That cron one is worth dwelling on, because it’s a trap anyone using CloudNativePG can fall into: ScheduledBackup.spec.schedule is a six-field cron — the first field is seconds. A normal five-field expression doesn’t error, it just shifts. 0 3 * * *, which I read as “3am daily,” became sec=0 min=3 hour=* — the top of every hour. That single misparse generated the 7,537 objects.
Fixes shipped:
backupType: fullweekly + explicitbackupType: incrdaily, both on proper six-field crons (0 0 3 * * *,0 0 1 * * 0)- Cleaned 7,537 → 339
Backupobjects (no finalizers, so the B2 data was never at risk) - Took a fresh full to Backblaze and an independent
pg_dumpallto the NAS as a break-glass copy
Then the part that actually matters: I proved the restore. Recovered the logical dump into a throwaway cluster and confirmed all 42 roles and 50 databases came back, data intact, with the production B2 repo untouched. A backup you haven’t restored is a rumour.
Only then did I let myself near the disks.
Burn-in
The PM9A3s arrived described as “recertified,” which made me suspicious, but SMART said otherwise — zero power-on hours, zero writes, pristine. I burned them in anyway before trusting them with etcd: two hours of fio per drive, full-tilt 4k random write then a 70/30 read/write mix.
| Drive | randwrite | IOPS | Media errors | Note |
|---|---|---|---|---|
stanton-01 (S5XMNE0TC58615) |
731 MiB/s | 187k | 0 | sensor 2 brushed 78°C |
stanton-02 (S5XMNE2TC24056) |
735 MiB/s | 188k | 0 | passive-cooled MS-01 slot |
stanton-03 (S5XMNE1TC99141) |
732 MiB/s | 187k | 0 | within 0.2% of the others |
All three within 0.2% of each other. That tight a spread is what healthy, identical-firmware drives look like — the opposite of the diverging-wear picture from Part 1. They do run hotter than the 990s (8.25W idle vs ~5W), so the chassis sits warmer after the swap. Worth watching, not worth worrying about: 78°C under a torture test, 85°C is the throttle line.
Then Talos made me read the manual
My first plan was naive: change installDiskSelector to the PM9A3’s serial, apply-config --reboot, done.
Wrong. installDiskSelector is consumed by the installer — initial install and talosctl upgrade — not by a running node. Change the config and reboot a healthy node and it just boots the existing install on the old 990. As a Sidero maintainer put it in a discussion thread: “Existing installations must be wiped before attempting migration.”
So moving the OS to the PM9A3 is a genuine per-node rebuild: drain the node, remove its etcd member, power off, physically pull the 990, boot the PM9A3 fresh from a Talos USB in maintenance mode, let it reinstall and rejoin. etcd drops to 2/3 and recovers. The Ceph mon rebuilds. And the Ceph OSD on the other disk — the PM9A3 that was already in nvme0 — has to be re-adopted, because Rook’s local metadata lives on the /var I’m about to wipe.
That last bit was the real unknown, and it’s why I didn’t do all three at once.
The canary: stanton-02
stanton-02 had the flakiest 990 (the one that actually died in Part 1), so it was both the most urgent node and the one I’d least mind learning on. I wrote the runbook with the five things I genuinely didn’t know would work until I did one, because pretending I knew would just hide where the risk was:
- the exact
etcd remove-membersyntax - the maintenance-mode IP
- whether
apply-config --insecureactually installs to the new disk - whether Rook re-adopts the OSD or rebuilds it (the one that mattered)
- how long the Ceph mon takes to rebuild
Drain, then freeze Ceph
Cordon, drain. The expected things went Pending — the Ceph mon and OSD (pinned to the node) and the CNPG replicas (anti-affinity won’t co-locate them). Then, before powering off, the step the first draft of my runbook didn’t have and should have: freeze Ceph. An OSD that’s been down for 10 minutes gets marked out, and Ceph starts re-replicating its data elsewhere — pointless churn when the node’s coming right back.
|
|
2 up, 3 in — osd.1 down but still counted, no rebalance. Now I could take as long as I needed.
The etcd dance
|
|
It wants the member ID, not the hostname. talosctl etcd members shows it in hex; feed that in, and we’re down to two members, both healthy, quorum intact. Unknown #1: solved.
Pull, boot, install
Pulled the 990, left the OSD disk and the PM9A3 in, inserted the USB, F11, maintenance mode. Because I run MAC-reservation DHCP, the node came up on its usual 10.90.3.102 — no hunting for a maintenance IP. Unknown #2: solved.
Checked the disks before installing, because paranoia is healthy when you’re about to wipe something:
|
|
990 gone, PM9A3 present, not yet a system disk. Apply the config:
|
|
Watched the install run on the JetKVM, pulled the USB at the reboot. Unknown #3: it installs to the configured disk. Solved.
The moment I didn’t like
Node back, etcd rejoined, all three in sync. Then I checked the system disk:
|
|
nvme0n1. That was the Ceph OSD disk in maintenance mode. Did I just install Talos over an OSD?
Short answer: no — and this is exactly why you pin by serial, not by /dev/nvmeX. When the 990 left, the two remaining drives re-enumerated — the PM9A3 moved from nvme1n1 to nvme0n1. installDiskSelector.serial followed the drive, not the name:
|
|
The thing that looked like a disaster was the safety mechanism working. Always verify the system disk’s serial, never trust the device name. (This bites again in Part 3, harder.)
The one that mattered: the OSD
osd.1’s data lives on the OSD drive, which I never touched. The question was whether Rook would re-adopt that OSD or treat the disk as new and rebuild it — a full backfill from the other two nodes. I uncordoned and watched:
|
|
No osd-prepare job spawned. That’s the tell — Rook didn’t run a prepare/zap pass, it just started the existing OSD deployment. A minute later:
|
|
Re-adopted. Same OSD ID, data intact, 169 active+clean — and because of noout, zero data movement. Unknowns #4 and #5: solved, the good way. Unset the flags, HEALTH_OK.
What the canary cost
| Ceph data rebalanced | 0 bytes |
| etcd quorum lost | never (2/3 held throughout) |
| Downtime to services | none (everything HA across the other two nodes) |
| Old 990 | in my hand, intact — instant rollback if I’d needed it |
One node migrated onto a datacenter SSD, the failing consumer drive bagged for the RMA, and — the part Part 1 was really asking — the failure mode is structurally gone. No more consumer NAND doing fsync-heavy etcd work without power-loss protection. The PM9A3 has the capacitors; the 990 never did.
So the procedure works. I had a clean runbook, five resolved unknowns, and a node humming on enterprise flash.
Which is precisely the kind of confidence that comes right before the back two nodes find five new ways to make you earn it. That’s Part 3: node-local data that vaporises on reinstall, an OSD that boots faster than its own network, a database password bug I’d only half-fixed, a restore that raced itself, and a serial number I wrongly swore I couldn’t read.