<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Hardware on Nerdz</title>
        <link>https://blog.nerdz.cloud/categories/hardware/</link>
        <description>Recent content in Hardware on Nerdz</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <copyright>Gavin McFall</copyright>
        <lastBuildDate>Tue, 05 May 2026 00:00:00 +1200</lastBuildDate><atom:link href="https://blog.nerdz.cloud/categories/hardware/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>The Slow Death of Three Samsung 990 PROs</title>
        <link>https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-1/</link>
        <pubDate>Tue, 05 May 2026 00:00:00 +1200</pubDate>
        
        <guid>https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-1/</guid>
        <description>&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;PASSED&amp;rdquo; doesn&amp;rsquo;t mean what you think it means.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&#34;the-2am-alert-storm&#34;&gt;The 2am alert storm
&lt;/h2&gt;&lt;p&gt;My phone exploded with critical alerts overnight. ntfy was angry. Alertmanager was angry. Ceph was angry. Sonarr was rolling back to a state that didn&amp;rsquo;t exist. Prometheus had even disconnected itself from Alertmanager — the canary alert that fires when alerting itself is broken.&lt;/p&gt;
&lt;p&gt;The root cause was buried under several cascading consequences, but the actual finding was simple: the NVMe holding &lt;code&gt;/var&lt;/code&gt; on stanton-02 had stopped responding to interrupts. The kernel logged &lt;code&gt;Disabling IRQ #173&lt;/code&gt; and gave up. Ceph&amp;rsquo;s &lt;code&gt;osd.1&lt;/code&gt; went &lt;code&gt;down&lt;/code&gt;, &lt;code&gt;mon.b&lt;/code&gt; lost its RocksDB, and the whole cluster started swimming.&lt;/p&gt;
&lt;p&gt;Over the next several hours of recovery, I confirmed that this wasn&amp;rsquo;t a one-off. It was the latest event in a documented escalation that had been building for weeks. The Samsung 990 PRO 1TB on stanton-02 had been throwing controller-fatal-status events since mid-April — six events over 19 days, each interval shorter than the last, until the kernel finally gave up on the IRQ and walked away.&lt;/p&gt;
&lt;p&gt;But that&amp;rsquo;s not the interesting bit. The interesting bit is that the drive&amp;rsquo;s two siblings — same batch, same firmware, same workload, in the same cluster — are also dying. Just at different rates.&lt;/p&gt;
&lt;p&gt;This is part 1 of 2. Part 1 is the autopsy. Part 2 will come when the new drives arrive.&lt;/p&gt;
&lt;h2 id=&#34;the-cluster&#34;&gt;The cluster
&lt;/h2&gt;&lt;p&gt;Three &lt;a class=&#34;link&#34; href=&#34;https://store.minisforum.com/products/minisforum-ms-01&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Minisforum MS-01&lt;/a&gt; mini-PCs, each running Talos Linux as a Kubernetes control-plane node. Cluster name: stanton &lt;em&gt;(cf. &lt;a class=&#34;link&#34; href=&#34;https://robertsspaceindustries.com/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Star Citizen&lt;/a&gt;)&lt;/em&gt;. Each box has two NVMe drives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;nvme0&lt;/strong&gt; — Samsung PM9A3 1.92TB. Enterprise-class with PLP. Rook-Ceph OSD storage. Rock solid.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;nvme1&lt;/strong&gt; — Samsung 990 PRO 1TB. &lt;strong&gt;Consumer&lt;/strong&gt;. Holds &lt;code&gt;/var&lt;/code&gt;: etcd WAL, Ceph mon RocksDB, container logs, kubelet state.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can probably already see where this is going.&lt;/p&gt;
&lt;p&gt;I bought the three 990 PROs as a matched batch from Amazon AU&amp;rsquo;s Global Store on 2024-06-18. Same firmware revision (&lt;code&gt;4B2QJXD7&lt;/code&gt; — the one Samsung released after the &lt;a class=&#34;link&#34; href=&#34;https://www.tomshardware.com/news/samsung-issues-firmware-update-for-990-pro-ssds&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;990 PRO firmware-killing-itself bug&lt;/a&gt;, so we&amp;rsquo;re not even talking about THAT problem). They went into production almost immediately and have been running 24/7 for ~22 months.&lt;/p&gt;
&lt;p&gt;The workload on &lt;code&gt;/var&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;etcd WAL&lt;/strong&gt; — Every Kubernetes API write. Pod scheduling, controller reconciliation, kubelet leases. Constant fsync.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ceph mon RocksDB&lt;/strong&gt; — Cluster state churn. Constant tiny writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Container runtime overlay&lt;/strong&gt; — Image extraction, log writes, layer state.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Fsync-heavy. Small-block-write heavy. The exact opposite of what consumer SSDs are tuned for.&lt;/p&gt;
&lt;h2 id=&#34;the-autopsy&#34;&gt;The autopsy
&lt;/h2&gt;&lt;p&gt;After getting the cluster back to HEALTH_OK, I pulled SMART data off all three nvme1 drives. Same command on each:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;kubectl debug node/stanton-XX --image&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;alpine --profile&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;sysadmin -- &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  sh -c &lt;span class=&#34;s2&#34;&gt;&amp;#34;apk add -q smartmontools &amp;amp;&amp;amp; smartctl -a /dev/nvme1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Here&amp;rsquo;s the comparison:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Metric&lt;/th&gt;
          &lt;th&gt;stanton-01&lt;/th&gt;
          &lt;th&gt;stanton-02 (failed)&lt;/th&gt;
          &lt;th&gt;stanton-03&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Serial&lt;/td&gt;
          &lt;td&gt;S73VNU0X303066H&lt;/td&gt;
          &lt;td&gt;S73VNU0X303413H&lt;/td&gt;
          &lt;td&gt;S73VNU0X303400H&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Firmware&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;4B2QJXD7&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;4B2QJXD7&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;4B2QJXD7&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Power-On Hours&lt;/td&gt;
          &lt;td&gt;15,856&lt;/td&gt;
          &lt;td&gt;15,864&lt;/td&gt;
          &lt;td&gt;15,867&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Percentage Used&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;42%&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;47%&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Data Units Written&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;96.3 TB&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;112 TB&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;133 TB&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Power Cycles&lt;/td&gt;
          &lt;td&gt;83&lt;/td&gt;
          &lt;td&gt;35&lt;/td&gt;
          &lt;td&gt;38&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Unsafe Shutdowns&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;37 (45% of cycles)&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;15 (43% of cycles)&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;13 (34% of cycles)&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Critical Warning&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;0x00&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;0x00&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;0x00&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Media &amp;amp; Data Integrity Errors&lt;/td&gt;
          &lt;td&gt;0&lt;/td&gt;
          &lt;td&gt;0&lt;/td&gt;
          &lt;td&gt;0&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Available Spare&lt;/td&gt;
          &lt;td&gt;100%&lt;/td&gt;
          &lt;td&gt;100%&lt;/td&gt;
          &lt;td&gt;100%&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;SMART Self-Test&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;PASSED&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;PASSED&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;PASSED&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Temperature&lt;/td&gt;
          &lt;td&gt;54°C&lt;/td&gt;
          &lt;td&gt;53°C&lt;/td&gt;
          &lt;td&gt;53°C&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Three things should jump out.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, all three drives &amp;ldquo;PASSED&amp;rdquo; the self-test. The drive that just died with a kernel-level IRQ-disable failure says it&amp;rsquo;s healthy. So does the one with 50% wear. So does the one I haven&amp;rsquo;t even seen flap yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt;, stanton-03 has more wear (50%) than the drive that just died (47%). It&amp;rsquo;s next in line.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Third&lt;/strong&gt;, the wear math doesn&amp;rsquo;t add up. The 990 PRO 1TB has a 600 TBW endurance rating. stanton-03 has written 133 TB — 22% of its rated endurance — but reports 50% used. The drives are wearing &lt;strong&gt;roughly twice as fast as host writes alone would suggest.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;That last one is the actually interesting story.&lt;/p&gt;
&lt;h2 id=&#34;why-is-the-wear-accelerating&#34;&gt;Why is the wear accelerating?
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Percentage Used&lt;/code&gt; in NVMe SMART data isn&amp;rsquo;t a measurement of how many host writes you&amp;rsquo;ve done. It&amp;rsquo;s the drive&amp;rsquo;s own estimate of how much of its &lt;strong&gt;internal NAND endurance reserve&lt;/strong&gt; has been consumed.&lt;/p&gt;
&lt;p&gt;For consumer drives, the gap between &amp;ldquo;host writes&amp;rdquo; and &amp;ldquo;NAND wear&amp;rdquo; gets large when you have:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Small random writes&lt;/strong&gt; — etcd does fsync after every write. The drive can&amp;rsquo;t batch these, so it ends up writing-in then re-writing pages constantly to maintain durability semantics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No power-loss protection&lt;/strong&gt; — every unclean shutdown forces the drive to discard in-flight write buffers and rebuild from journal, which means re-writing pages the drive thought it could batch. Wear amplification.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mixed read/write pages&lt;/strong&gt; — when read traffic and write traffic share NAND blocks, the drive shuffles data around to keep cells in spec. All extra writes the host never asked for.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each of those happens constantly under an etcd + mon workload.&lt;/p&gt;
&lt;p&gt;The unsafe-shutdown counter is the nail in the coffin. Across the three drives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;stanton-01: &lt;strong&gt;37 unsafe shutdowns out of 83 cycles&lt;/strong&gt; (45%)&lt;/li&gt;
&lt;li&gt;stanton-02: &lt;strong&gt;15 unsafe shutdowns out of 35 cycles&lt;/strong&gt; (43%)&lt;/li&gt;
&lt;li&gt;stanton-03: &lt;strong&gt;13 unsafe shutdowns out of 38 cycles&lt;/strong&gt; (34%)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I don&amp;rsquo;t have a UPS. The cluster has weathered multiple powercuts since I built it, plus the occasional kernel-level reboot under stress. Every one of those is a little bit of write-amp punishment to a drive that has no capacitors to flush its DRAM cache to NAND.&lt;/p&gt;
&lt;h2 id=&#34;power-loss-protection--what-consumer-nvme-doesnt-have&#34;&gt;Power Loss Protection — what consumer NVMe doesn&amp;rsquo;t have
&lt;/h2&gt;&lt;p&gt;Enterprise NVMe drives have a row of tantalum capacitors on the PCB. When the host yanks power, those caps hold the drive alive just long enough to flush its DRAM write buffer to flash. Result: no data loss, no in-flight pages stuck in limbo, no journal-replay amp on the next boot.&lt;/p&gt;
&lt;p&gt;Consumer NVMe drives do not have those capacitors. Cost-cut. The 990 PRO is a consumer drive. So is the SN850X. So is anything you&amp;rsquo;d buy at a big-box store with &amp;ldquo;Pro&amp;rdquo; in the name.&lt;/p&gt;
&lt;p&gt;When a consumer drive loses power mid-write:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In-flight writes that were in DRAM are gone. The host&amp;rsquo;s write-cache thinks they hit NAND, but they didn&amp;rsquo;t.&lt;/li&gt;
&lt;li&gt;On next boot, the drive replays its journal to figure out which pages are valid and which are torn.&lt;/li&gt;
&lt;li&gt;That replay re-writes a lot of pages &amp;ldquo;to be safe.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;All of which counts against your NAND endurance reserve.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is why enterprise SSD specs say things like &amp;ldquo;0.4 DWPD&amp;rdquo; or &amp;ldquo;1 DWPD&amp;rdquo; or &amp;ldquo;3 DWPD&amp;rdquo; — Drive Writes Per Day, sustained for the warranty period (usually 5 years). The 990 PRO&amp;rsquo;s spec is &lt;code&gt;600 TBW over 5 years&lt;/code&gt;, which works out to about 0.33 DWPD if you do the math. &lt;strong&gt;That assumes a clean workload with no powercut amplification.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;What I have is consumer drives, with no PLP, doing fsync-heavy etcd workloads, on hosts with no UPS, in a region that has the occasional powercut. Of course the wear is accelerating.&lt;/p&gt;
&lt;h2 id=&#34;what-smart-didnt-tell-me&#34;&gt;What SMART didn&amp;rsquo;t tell me
&lt;/h2&gt;&lt;p&gt;The most maddening thing about this whole episode is that the drive&amp;rsquo;s &amp;ldquo;PASSED&amp;rdquo; self-test was technically correct, right up until it wasn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;NVMe SMART tracks things like media errors, temperature exceedences, and the available-spare counter. None of those tripped. The drive on stanton-02 is still reporting 100% Available Spare and 0 Media Errors as of writing. It also happens to be unable to respond to interrupts anymore.&lt;/p&gt;
&lt;p&gt;The actual signal of impending failure was buried in the kernel log — six controller-fatal-status events over 19 days, with the gap between events shrinking each time:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-fallback&#34; data-lang=&#34;fallback&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvme nvme1: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x11
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;CSTS=0x3&lt;/code&gt; means the drive&amp;rsquo;s own controller is asserting &lt;strong&gt;fatal status&lt;/strong&gt; on itself. That&amp;rsquo;s the drive saying &amp;ldquo;something is wrong with me, please reset me.&amp;rdquo; The kernel resets it, the drive comes back up, and SMART still says PASSED because by the spec, none of the threshold-based metrics have been crossed.&lt;/p&gt;
&lt;p&gt;The escalation timeline:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;#&lt;/th&gt;
          &lt;th&gt;Date (UTC)&lt;/th&gt;
          &lt;th&gt;Failure mode&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;1&lt;/td&gt;
          &lt;td&gt;2026-04-15&lt;/td&gt;
          &lt;td&gt;Soft CFS reset, auto-recovered&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2&lt;/td&gt;
          &lt;td&gt;2026-04-20&lt;/td&gt;
          &lt;td&gt;Soft CFS reset, auto-recovered&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;3&lt;/td&gt;
          &lt;td&gt;2026-04-21&lt;/td&gt;
          &lt;td&gt;Soft CFS reset, auto-recovered&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;4&lt;/td&gt;
          &lt;td&gt;2026-04-25&lt;/td&gt;
          &lt;td&gt;Soft CFS reset, auto-recovered&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;5&lt;/td&gt;
          &lt;td&gt;2026-04-30&lt;/td&gt;
          &lt;td&gt;Soft CFS reset, auto-recovered&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;6&lt;/td&gt;
          &lt;td&gt;2026-05-04&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;IRQ disabled, no auto-recovery&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The pattern is &amp;ldquo;drive needs increasingly frequent kicks until eventually the kernel gives up on it.&amp;rdquo; None of which shows up in &lt;code&gt;smartctl --health&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;side-effects-when-the-cluster_network-is-on-the-same-node&#34;&gt;Side effects when the cluster_network is on the same node
&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s the bit that turned an annoying-but-recoverable single-drive failure into a 4-hour cluster-wide incident: stanton-02 also runs one of three Ceph monitors AND one of three OSDs. The 990 PRO holds &lt;code&gt;/var&lt;/code&gt; (mon RocksDB), and the PM9A3 in nvme0 holds the OSD bluestore data. When the 990 PRO died, mon.b went silent, but the OSD itself was still up.&lt;/p&gt;
&lt;p&gt;Then I rebooted the node to get the drive back. The reboot killed the Thunderbolt ring that Ceph uses for &lt;code&gt;cluster_network&lt;/code&gt; traffic — a documented MS-01 quirk where the second TB port doesn&amp;rsquo;t always re-enumerate after a warm boot. So when the node came back, OSDs were &lt;code&gt;up,in&lt;/code&gt; per Ceph, but &lt;code&gt;osd.1&lt;/code&gt; and &lt;code&gt;osd.2&lt;/code&gt; couldn&amp;rsquo;t actually talk to each other over the cluster network. PGs got stuck &lt;code&gt;peering&lt;/code&gt; for an hour while traffic spilled to &lt;code&gt;public_network&lt;/code&gt; and the slow-heartbeat alarms climbed past 500 seconds.&lt;/p&gt;
&lt;p&gt;I wrote up the Thunderbolt fix separately — kernel arg &lt;code&gt;thunderbolt.host_reset=0&lt;/code&gt; baked into a custom factory.talos.dev schematic — but it&amp;rsquo;s worth mentioning here because it&amp;rsquo;s the failure-mode amplifier. &lt;strong&gt;A single dying disk wouldn&amp;rsquo;t have caused a cluster-wide alert storm if my Ceph cluster network wasn&amp;rsquo;t running over Thunderbolt cables that don&amp;rsquo;t always come back up after a reboot.&lt;/strong&gt; Two unrelated weaknesses combined into one bad night.&lt;/p&gt;
&lt;h2 id=&#34;what-im-doing-about-it&#34;&gt;What I&amp;rsquo;m doing about it
&lt;/h2&gt;&lt;p&gt;After confirming the failure was real and ongoing, I went back to Amazon AU. The drive had 38 months left on a 5-year warranty, the failure mode is documented in dmesg with timestamps and serial numbers, and the SMART screenshots showed the wear/unsafe-shutdown picture clearly. Amazon&amp;rsquo;s Global Store rep was sympathetic.&lt;/p&gt;
&lt;p&gt;To my surprise, they refunded the &lt;strong&gt;full cost of all three drives&lt;/strong&gt; — not just the failing one. Recognition that a same-batch matched set is going to fail in similar ways was a nicer outcome than I expected.&lt;/p&gt;
&lt;p&gt;Now I&amp;rsquo;m shopping for replacements. The path:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Enterprise NVMe with hardware PLP&lt;/strong&gt; — non-negotiable. The whole point is to remove the consumer-NAND-on-server-workload mismatch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M.2 22110 form factor&lt;/strong&gt; — fits the MS-01&amp;rsquo;s slot 2 and 3. The PM9A3 already in nvme0 has been rock solid; putting its sibling family in nvme1 keeps the cluster homogeneous.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;At least 1 DWPD endurance class&lt;/strong&gt; — overkill for my measured 180 GB/day write rate (~0.18 DWPD on a 1TB drive), but every doubling of headroom is insurance against future workload growth.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The shortlist I&amp;rsquo;ve narrowed it to is &lt;strong&gt;Samsung PM9A3 M.2 22110 960GB&lt;/strong&gt; (NEW from a Chinese eBay seller at ~AU$554 each) or &lt;strong&gt;Micron 7450 PRO 480GB&lt;/strong&gt; (new retail, but the NZ pricing is eye-watering). The math + budget pushed me toward the PM9A3 — it matches the drive that&amp;rsquo;s been working flawlessly on the same cluster for 22 months.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s where Part 2 comes in. New drives, installation, performance comparison, the burn-in protocol, and the real test: whether enterprise PLP actually fixes the failure mode I&amp;rsquo;ve documented here, or whether the MS-01&amp;rsquo;s chassis is going to throw new and unexpected thermal headaches at me with 8.2W enterprise drives in slots designed for 5W consumer parts.&lt;/p&gt;
&lt;h2 id=&#34;lessons-so-far&#34;&gt;Lessons so far
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&amp;ldquo;PASSED&amp;rdquo; SMART status is necessary but not sufficient.&lt;/strong&gt; Watch the kernel log for &lt;code&gt;CSTS=0x3&lt;/code&gt; and similar; SMART&amp;rsquo;s threshold-based metrics will lag behind the actual drive health by months.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consumer NVMe under etcd workload is a category error.&lt;/strong&gt; Even on a homelab, if the drive holds &lt;code&gt;/var&lt;/code&gt; for a Kubernetes control-plane, it&amp;rsquo;s doing enterprise work. Buy enterprise.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The &lt;code&gt;Percentage Used&lt;/code&gt; metric tells you the truth.&lt;/strong&gt; When it&amp;rsquo;s growing roughly 2× faster than &lt;code&gt;Data Units Written ÷ TBW&lt;/code&gt; would predict, your drive is wearing out faster than spec, and you need to plan for replacement before the controller events start.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PLP is the structural fix.&lt;/strong&gt; A UPS helps with powercuts but doesn&amp;rsquo;t fix the fsync-amp problem on consumer NAND.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Same-batch drives die together.&lt;/strong&gt; If one drive in a matched set fails, pull SMART on all of them. They&amp;rsquo;ll be on the same trajectory. In my case, the most-worn drive isn&amp;rsquo;t the one that failed first — it&amp;rsquo;s the one I haven&amp;rsquo;t seen flap yet.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Architectural single-points-of-pain compound.&lt;/strong&gt; A drive failure on its own is recoverable. A drive failure plus a fragile cluster_network on the same node is a bad night. Audit your dependencies before you have to.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Part 2 incoming when the new drives arrive. Until then I&amp;rsquo;m running on borrowed time on stanton-03 (the 50%-wear sibling). Coffee in hand, alert thresholds tightened, Renovate auto-merge disabled on Ceph until the swap is done.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
