<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Prometheus on Nerdz</title>
        <link>https://blog.nerdz.cloud/tags/prometheus/</link>
        <description>Recent content in Prometheus on Nerdz</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <copyright>Gavin McFall</copyright>
        <lastBuildDate>Tue, 26 May 2026 00:00:00 +1200</lastBuildDate><atom:link href="https://blog.nerdz.cloud/tags/prometheus/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Three 990 PROs, One Batch, All Dying — Part 3: The Part Where the Canary Lied</title>
        <link>https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-3/</link>
        <pubDate>Tue, 26 May 2026 00:00:00 +1200</pubDate>
        
        <guid>https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-3/</guid>
        <description>&lt;blockquote&gt;
&lt;p&gt;In &lt;a class=&#34;link&#34; href=&#34;https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-2&#34; &gt;Part 2&lt;/a&gt; the canary went so cleanly I ended by saying the procedure works. Reader, that was the confidence that comes right before the other two nodes teach you things.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The procedure itself held up perfectly. stanton-01 and stanton-03 both drained, dropped their etcd member, pulled the 990, booted the PM9A3 from a USB stick, reinstalled, and rejoined — etcd 3/3, Ceph &lt;code&gt;HEALTH_OK&lt;/code&gt;, OSD re-adopted with zero data movement, exactly like the canary promised.&lt;/p&gt;
&lt;p&gt;What the canary &lt;em&gt;didn&amp;rsquo;t&lt;/em&gt; teach me is that a control-plane node carries a surprising amount of state that lives &lt;strong&gt;only&lt;/strong&gt; on that node — and a fresh install wipes &lt;code&gt;/var&lt;/code&gt; to bare metal. Five separate things broke or surprised me on the back two nodes. None of them were the disk swap. All of them were worth writing down.&lt;/p&gt;
&lt;h2 id=&#34;1-the-data-that-lives-on-the-node-and-dies-with-it&#34;&gt;1. The data that lives on the node (and dies with it)
&lt;/h2&gt;&lt;p&gt;Talos&amp;rsquo;s &lt;code&gt;EPHEMERAL&lt;/code&gt; partition is &lt;code&gt;/var&lt;/code&gt;, and on these nodes &lt;code&gt;/var/openebs/local&lt;/code&gt; is where the &lt;strong&gt;OpenEBS hostpath&lt;/strong&gt; PVCs live — node-pinned, single-copy volumes. A reinstall doesn&amp;rsquo;t migrate them. It vaporises them.&lt;/p&gt;
&lt;p&gt;I knew the CNPG replicas would need rebuilding (they stream fresh from the primary — a non-event). What I&amp;rsquo;d underweighted was everything &lt;em&gt;else&lt;/em&gt; on node-local storage. After stanton-01 came back, a handful of pods were stuck:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;loki-0          ContainerCreating
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;alertmanager-0  Init:0/1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-gdscript3&#34; data-lang=&#34;gdscript3&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;MountVolume&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;NewMounter&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;...&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;path&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;/var/openebs/local/pvc-9713ea45-…&amp;#34;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;does&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;not&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;exist&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The PV object still existed in Kubernetes; its backing directory on the node did not. OpenEBS won&amp;rsquo;t recreate the directory for an &lt;em&gt;already-provisioned&lt;/em&gt; PV, so the pod sits there forever, politely failing to mount a thing that no longer exists.&lt;/p&gt;
&lt;p&gt;Loki and Alertmanager are single-instance. There&amp;rsquo;s no replica to stream from. If I wiped the node, that history was simply &lt;strong&gt;gone&lt;/strong&gt; — unless I&amp;rsquo;d taken it off the node first.&lt;/p&gt;
&lt;p&gt;So before each wipe, I backed up the at-risk node-local volumes to the NAS with a dead-simple privileged pod: mount the node&amp;rsquo;s &lt;code&gt;/var/openebs/local&lt;/code&gt; read-only on one side, an NFS share on the other, &lt;code&gt;tar&lt;/code&gt; the directories across. (A couple of things I learned the hard way: TrueNAS NFS root-squashes to a fixed UID, so the tarballs land owned by &lt;code&gt;apps&lt;/code&gt; and you don&amp;rsquo;t fight permissions; &lt;code&gt;showmount&lt;/code&gt; isn&amp;rsquo;t on TrueNAS SCALE; and &lt;code&gt;chmod&lt;/code&gt; over SSH bounces off the NFSv4 ACLs — read &lt;code&gt;/etc/exports&lt;/code&gt; directly and lean on the squash.)&lt;/p&gt;
&lt;p&gt;The restore is the neat part. Rather than fight OpenEBS to re-provision, I just &lt;strong&gt;recreated the exact directory the PV expected and unpacked the backup into it&lt;/strong&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;mkdir -p /var/openebs/local/pvc-9713ea45-…   &lt;span class=&#34;c1&#34;&gt;# the path the mount was crying about&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;tar xf loki_….tar -C /var/openebs/local/pvc-9713ea45-…
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Next mount retry, the kubelet finds a populated directory, and Loki starts on its restored history like nothing happened. For the genuinely throwaway volumes (VolSync caches — ~67 of them on stanton-03 alone), I just &lt;code&gt;mkdir&lt;/code&gt;&amp;rsquo;d empty directories; VolSync refills them on the next sync. Empty dir for caches, restored dir for data, skip the freshly-provisioned CNPG dirs. One pass.&lt;/p&gt;
&lt;p&gt;The lesson is blunt: &lt;strong&gt;on a hyperconverged node, &amp;ldquo;reinstall the OS&amp;rdquo; and &amp;ldquo;destroy the node-local data&amp;rdquo; are the same sentence.&lt;/strong&gt; Know what&amp;rsquo;s single-copy and on &lt;code&gt;/var&lt;/code&gt; before you pull the trigger. For me that was Loki, Alertmanager, and Prometheus — everything else was replicated, on the NAS, or rebuildable.&lt;/p&gt;
&lt;h2 id=&#34;2-the-osd-that-booted-faster-than-its-network&#34;&gt;2. The OSD that booted faster than its network
&lt;/h2&gt;&lt;p&gt;stanton-03 reinstalled fine, etcd rejoined, the mon came back — and then its Ceph OSD went into &lt;code&gt;CrashLoopBackOff&lt;/code&gt;. New behaviour; the first two nodes re-adopted their OSDs without a hiccup.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-1 unable to find any IPv4 address in networks &amp;#39;169.254.255.0/24&amp;#39; interfaces &amp;#39;&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-1 Failed to pick cluster address.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;My Ceph &lt;strong&gt;cluster_network&lt;/strong&gt; runs over the Thunderbolt mesh between the three MS-01s — the same fragile TB ring that turned a single dead disk into a four-hour incident back in &lt;a class=&#34;link&#34; href=&#34;https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-1&#34; &gt;Part 1&lt;/a&gt;, and that I&amp;rsquo;m in the middle of migrating off entirely (a story for another post). The OSD needs an address on &lt;code&gt;169.254.255.0/24&lt;/code&gt; to bind. On a freshly-booted node, the OSD container started &lt;strong&gt;before&lt;/strong&gt; Thunderbolt had finished negotiating and getting its address. No address, no bind, crash.&lt;/p&gt;
&lt;p&gt;The fix was almost embarrassingly simple once I understood it: wait for Thunderbolt to come up (&lt;code&gt;talosctl get addresses&lt;/code&gt; shows the &lt;code&gt;169.254.255.x&lt;/code&gt; land on the &lt;code&gt;enx…&lt;/code&gt; interfaces), then delete the crash-looping OSD pod so it restarts into a network that now exists. Up &lt;code&gt;2/2&lt;/code&gt;, re-adopted, &lt;code&gt;169 active+clean&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Worth flagging because it&amp;rsquo;s a pure &lt;strong&gt;ordering&lt;/strong&gt; bug, not a config error — the same manifest that worked on two nodes &amp;ldquo;failed&amp;rdquo; on the third purely because a USB-ish interface took a few extra seconds to wake up. If your Ceph cluster network rides on something that negotiates slowly, expect this on a cold node and don&amp;rsquo;t panic.&lt;/p&gt;
&lt;h2 id=&#34;3-the-bug-i-thought-id-fixed&#34;&gt;3. The bug I thought I&amp;rsquo;d fixed
&lt;/h2&gt;&lt;p&gt;Then five app pods fell over at once, all with the same error:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;FATAL:  password authentication failed for user &amp;#34;postgres&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This is my CloudNativePG cluster&amp;rsquo;s original sin: it was bootstrapped with &lt;code&gt;owner: postgres&lt;/code&gt; — the database owner &lt;em&gt;is&lt;/em&gt; the superuser. That gives CNPG two independent reconcile paths that both write the &lt;code&gt;postgres&lt;/code&gt; role&amp;rsquo;s password, and they don&amp;rsquo;t always agree. Restart a node, rebuild some instances, fail a primary over a few times — exactly what a disk migration does — and the live password drifts off what the apps hold. The apps, holding the right value, get rejected.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the honest part. A while back I &amp;ldquo;fixed&amp;rdquo; this. What I &lt;em&gt;actually&lt;/em&gt; fixed was the &lt;strong&gt;backups&lt;/strong&gt; (the &lt;a class=&#34;link&#34; href=&#34;https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-2&#34; &gt;Part 2&lt;/a&gt; excavation) and I wrote myself a recovery runbook. I never fixed the &lt;strong&gt;root cause&lt;/strong&gt;, because that needs a planned outage to recreate the cluster, and there was always something more urgent. So when the migration churned CNPG hard, the race came back precisely as designed.&lt;/p&gt;
&lt;p&gt;The recovery is well-trodden now. A hash comparison (no secrets printed — just &lt;code&gt;sha256&lt;/code&gt; prefixes) showed all five apps and the 1Password-managed secret agreeing on one value; the live database had drifted off it. So I set the database back to the value everyone else already expected — and, importantly, to the value held in the database&amp;rsquo;s &lt;em&gt;own&lt;/em&gt; managed secret, so the next reconcile applies the same thing instead of fighting me:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-sql&#34; data-lang=&#34;sql&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;ALTER&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;USER&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;postgres&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;WITH&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PASSWORD&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;…&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;   &lt;/span&gt;&lt;span class=&#34;c1&#34;&gt;-- via a local peer-auth session on the primary
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Bounce the five pods, they re-run their init against a database that now accepts them, done. &lt;strong&gt;24/24 apps back in agreement.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;But I&amp;rsquo;m done pretending the recovery &lt;em&gt;is&lt;/em&gt; the fix. The real fix — migrating the cluster to &lt;code&gt;owner: app&lt;/code&gt; so the superuser has exactly one password-writer — is now planned, validated end-to-end against a throwaway cluster, and waiting for an outage window. The backups that Part 2 was all about are what finally make me comfortable doing it. Funny how that comes full circle.&lt;/p&gt;
&lt;h2 id=&#34;4-the-restore-that-raced-itself&#34;&gt;4. The restore that raced itself
&lt;/h2&gt;&lt;p&gt;Prometheus was the biggest single-copy volume — ~32 GB of TSDB on stanton-03, backed up before the wipe. After the node came back, I recreated its directory and started unpacking the 32 GB tarball into it.&lt;/p&gt;
&lt;p&gt;And while &lt;code&gt;tar&lt;/code&gt; was still extracting, Prometheus &lt;strong&gt;started&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The kubelet had been retrying the mount every 20 seconds. The instant my &lt;code&gt;mkdir&lt;/code&gt; created the directory, the mount succeeded, and the pod launched onto a &lt;strong&gt;half-extracted&lt;/strong&gt; TSDB while &lt;code&gt;tar&lt;/code&gt; was still writing files underneath it. That is a great way to corrupt a time-series database.&lt;/p&gt;
&lt;p&gt;I caught it because the pod went &lt;code&gt;2/2 Running&lt;/code&gt; far too early. Recovery:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Scale Prometheus to zero — for an operator-managed Prometheus that means &lt;code&gt;kubectl patch prometheus … replicas: 0&lt;/code&gt;, &lt;em&gt;not&lt;/em&gt; scaling the StatefulSet, which the operator just reverts.&lt;/li&gt;
&lt;li&gt;Empty the directory (&lt;code&gt;find … -delete&lt;/code&gt; — not &lt;code&gt;rm -rf&lt;/code&gt;; my own safety tooling slaps that down, rightly).&lt;/li&gt;
&lt;li&gt;Re-extract cleanly with nothing mounted.&lt;/li&gt;
&lt;li&gt;Scale back up.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This time it loaded properly — nine healthy blocks, ~31 days of history, clean WAL replay. The lesson: &lt;strong&gt;if you&amp;rsquo;re restoring into a directory a controller is actively trying to mount, stop the controller first.&lt;/strong&gt; The kubelet will not wait for your &lt;code&gt;tar&lt;/code&gt; to finish — it grabs the volume the moment it appears.&lt;/p&gt;
&lt;h2 id=&#34;5-the-serial-number-i-said-i-couldnt-read&#34;&gt;5. The serial number I said I couldn&amp;rsquo;t read
&lt;/h2&gt;&lt;p&gt;This one&amp;rsquo;s a personal favourite, because I was wrong and the record should say so.&lt;/p&gt;
&lt;p&gt;The pulled 990 PROs are going back as warranty returns, and they held etcd — every secret in the cluster. I wanted to confirm &lt;em&gt;which&lt;/em&gt; drive was which before wiping, and read each one&amp;rsquo;s real serial. I dropped one into a USB-NVMe dock, asked the OS for the serial, and got the &lt;strong&gt;dock&amp;rsquo;s&lt;/strong&gt; serial, not the drive&amp;rsquo;s. I&amp;rsquo;d hit this on the Talos side too: &lt;code&gt;smartctl -d sntrealtek&lt;/code&gt; failed, and I concluded the bridge masks the serial and moved on.&lt;/p&gt;
&lt;p&gt;Then I actually researched it instead of giving up, and checked the one thing I&amp;rsquo;d skipped: the bridge&amp;rsquo;s USB ID. &lt;code&gt;152D:0586&lt;/code&gt;. That&amp;rsquo;s &lt;strong&gt;JMicron&lt;/strong&gt;, not Realtek. I&amp;rsquo;d been handing a JMicron bridge the Realtek passthrough and treating the failure as proof of impossibility.&lt;/p&gt;
&lt;p&gt;The right incantation:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;smartctl -d sntjmicron -a /dev/sdX     # JMicron&amp;#39;s NVMe passthrough — needs admin
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;And there it was, straight through the dock:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Model Number:   Samsung SSD 990 PRO 1TB
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Serial Number:  S73VNU0X303066H
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Exact match to stanton-01&amp;rsquo;s record. As a bonus, the same call dumps full SMART — and these &amp;ldquo;fine&amp;rdquo; consumer drives were at &lt;strong&gt;43% endurance used&lt;/strong&gt; with 100 TB written, which is rather the entire point of this three-part saga.&lt;/p&gt;
&lt;p&gt;Then the wipe itself tried to take all day. &lt;code&gt;diskpart clean all&lt;/code&gt; was crawling at &lt;strong&gt;41 MB/s&lt;/strong&gt; — textbook USB 2.0. The dock is a 10 Gbps enclosure, so I went hunting, and the culprit was the cable: a premium Anker USB-C–to–USB-C cable that is &lt;strong&gt;USB 2.0 for data&lt;/strong&gt;. Loads of &amp;ldquo;high-end&amp;rdquo; C-to-C cables are — they&amp;rsquo;re built for charging wattage with only the USB 2.0 data pairs wired, and they look identical to a 10 Gbps cable. A charging cable advertises watts; a data cable advertises 5 or 10 Gbps. Swapped to a proper SSD data cable and the same wipe ran at &lt;strong&gt;~1.35 GB/s&lt;/strong&gt; — about 30× faster, ~15 minutes per drive instead of nearly seven hours.&lt;/p&gt;
&lt;p&gt;One more trap while aborting the slow wipe: killing the &lt;code&gt;diskpart&lt;/code&gt; process did &lt;strong&gt;not&lt;/strong&gt; stop it. &lt;code&gt;diskpart clean all&lt;/code&gt; hands the actual zeroing to the &lt;strong&gt;Virtual Disk Service&lt;/strong&gt; (&lt;code&gt;vds&lt;/code&gt;), which keeps grinding after the front-end is gone. To truly stop it you stop VDS — or just unplug the drive, which is perfectly safe when the thing is mid-erase anyway.&lt;/p&gt;
&lt;p&gt;All three drives: serial confirmed through the bridge, full-disk zeroed, partition table gone. Ready to ship.&lt;/p&gt;
&lt;p&gt;Two takeaways here. &lt;strong&gt;Identify the bridge chip before you decide something&amp;rsquo;s impossible&lt;/strong&gt; — the passthrough is chip-specific, and &amp;ldquo;it didn&amp;rsquo;t work&amp;rdquo; usually means &amp;ldquo;wrong passthrough,&amp;rdquo; not &amp;ldquo;can&amp;rsquo;t be done.&amp;rdquo; And &lt;strong&gt;a USB enclosure isn&amp;rsquo;t an information black hole&lt;/strong&gt;: the NVMe device is right there behind a thin translation layer; you just have to speak its dialect.&lt;/p&gt;
&lt;h2 id=&#34;what-it-cost-across-all-three&#34;&gt;What it cost, across all three
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;&lt;/th&gt;
          &lt;th&gt;&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Nodes migrated to datacenter SSDs&lt;/td&gt;
          &lt;td&gt;3 / 3&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Ceph data rebalanced&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;0 bytes&lt;/strong&gt; (re-adopted every time)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;etcd quorum lost&lt;/td&gt;
          &lt;td&gt;never (held ≥2/3 throughout)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Monitoring history lost&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;none&lt;/strong&gt; — Prometheus, Loki, Alertmanager all restored&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Service downtime&lt;/td&gt;
          &lt;td&gt;none that outlived a pod reschedule&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Surprises that were the disk swap&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;zero&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Surprises that were everything &lt;em&gt;around&lt;/em&gt; the disk swap&lt;/td&gt;
          &lt;td&gt;five&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The control plane now runs on PM9A3s with power-loss protection and endurance I won&amp;rsquo;t have to think about for years. The structural failure mode from Part 1 — consumer NAND doing fsync-heavy etcd work with no PLP, on hosts with no UPS — is gone. Every consumer 990 PRO is wiped, serial confirmed, and bagged for the RMA.&lt;/p&gt;
&lt;h2 id=&#34;lessons&#34;&gt;Lessons
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A per-node reinstall on a hyperconverged cluster is a controlled demolition of that node&amp;rsquo;s local state.&lt;/strong&gt; Know — &lt;em&gt;before&lt;/em&gt; you start — exactly what lives on &lt;code&gt;/var&lt;/code&gt; and how you&amp;rsquo;ll bring it back. Replicated and NAS-backed data is free; single-copy node-local data is not.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify disks by serial, never by &lt;code&gt;/dev/nvmeX&lt;/code&gt;.&lt;/strong&gt; Device names re-enumerate the moment you pull a drive. The serial follows the hardware.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cold-boot ordering is a real failure class.&lt;/strong&gt; If a daemon needs a network that comes up slowly (Thunderbolt, some SFP+), it&amp;rsquo;ll crash on a fresh node and recover on a restart. Don&amp;rsquo;t mistake it for a config error.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A bug you fixed by treating the symptom is not fixed.&lt;/strong&gt; Write the runbook &lt;em&gt;and&lt;/em&gt; schedule the root-cause work, or the symptom comes back at the worst time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quiesce before restoring into a volume a controller wants.&lt;/strong&gt; The kubelet doesn&amp;rsquo;t wait for your &lt;code&gt;tar&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The passthrough is chip-specific.&lt;/strong&gt; Check the USB bridge&amp;rsquo;s VID:PID before declaring a serial unreadable; and remember a &amp;ldquo;premium&amp;rdquo; C-to-C cable can still be USB 2.0 for data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The disk swap was the easy part. The education was in the blast radius. The one honest piece of unfinished business is finally migrating that CloudNativePG cluster off &lt;code&gt;owner: postgres&lt;/code&gt; so the password race can&amp;rsquo;t come back — backups are solid, the procedure&amp;rsquo;s validated, no more excuses. That might even be Part 4.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;update-the-blast-radius-had-a-seven-hour-fuse&#34;&gt;Update: the blast radius had a seven-hour fuse
&lt;/h2&gt;&lt;p&gt;A few hours after I&amp;rsquo;d declared victory, ntfy went off again: &lt;strong&gt;bazarr&amp;rsquo;s VolSync replication was out of date.&lt;/strong&gt; And here&amp;rsquo;s the thing — it traced straight back to section 1 of this very post, which I&amp;rsquo;d apparently failed to fully internalise about my own cluster.&lt;/p&gt;
&lt;p&gt;stanton-02 was the &lt;strong&gt;canary&lt;/strong&gt;. It was migrated &lt;em&gt;first&lt;/em&gt;, before I&amp;rsquo;d worked out the recreate-the-directory restore dance — I only built that step on stanton-01 and reused it on stanton-03. So stanton-02&amp;rsquo;s wiped OpenEBS cache directories were never recreated. They&amp;rsquo;d been gone the whole time.&lt;/p&gt;
&lt;p&gt;The reason it stayed invisible for seven hours is the sneaky part: &lt;strong&gt;VolSync cache PVCs only get mounted when a replication actually fires.&lt;/strong&gt; So nothing failed at migration time — it failed lazily, one mover at a time, as each app&amp;rsquo;s schedule came around and hit the same &lt;code&gt;path … does not exist&lt;/code&gt; wall I described up top. By the time I looked, six movers were silently wedged in &lt;code&gt;Init&lt;/code&gt; — bazarr, bazarr-foreign, metube, qui, romm, sonarr-uhd — and bazarr&amp;rsquo;s was simply the first stale-replication alert loud enough to notice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The symptom&lt;/strong&gt;: a VolSync &amp;ldquo;out of date&amp;rdquo; / out-of-sync alert hours after a node rebuild, with the mover pod stuck &lt;code&gt;Init&lt;/code&gt; on &lt;code&gt;FailedMount: path &amp;quot;/var/openebs/local/pvc-… &amp;quot; does not exist&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: the same trick from section 1, applied to &lt;em&gt;every&lt;/em&gt; node-local volume on that node at once — recreate all the missing directories in a single pass (empty for caches, restored-from-backup for real data), then delete the wedged movers so they re-run. They all caught up immediately.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The lesson&lt;/strong&gt; — the one I&amp;rsquo;m adding because I learned it the hard way &lt;em&gt;after&lt;/em&gt; publishing: when you wipe a hyperconverged node&amp;rsquo;s &lt;code&gt;/var&lt;/code&gt;, recreate &lt;strong&gt;all&lt;/strong&gt; of that node&amp;rsquo;s node-local directories right then, not just the ones actively complaining. The quiet ones aren&amp;rsquo;t fine — they&amp;rsquo;re just waiting for their next scheduled sync to bite you, and they&amp;rsquo;ll do it on a timer you didn&amp;rsquo;t set, hours after you&amp;rsquo;ve moved on. The blast radius doesn&amp;rsquo;t always go off at once.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
