“Your 100MB of data living in a 250MB file is not a feature.”
The Alert
Got this from AlertManager:
|
|
Time to learn what etcd fragmentation actually means.
Why Does etcd Fragment?
etcd stores all Kubernetes state—every pod, deployment, secret, and configmap lives here. Under the hood, it uses a B+ tree data structure backed by an append-only write-ahead log (WAL).
Here’s the key insight: when you delete or update a key in etcd, the old data isn’t removed from disk. It’s just marked as free space. The database file grows but never shrinks on its own.
Three things constantly churn etcd in a Kubernetes cluster:
- Pod scheduling: Every pod creation, update, and deletion writes to etcd
- Controller loops: Controllers constantly reconciling state means constant writes
- Lease renewals: Kubelet heartbeats, leader elections, and endpoint updates
The Kubernetes API server runs automatic compaction, which removes old revisions of keys (you don’t need 1000 historical versions of a ConfigMap). But compaction just marks space as reusable—it doesn’t actually free it.
Over time, your database file becomes Swiss cheese: actual data scattered among holes of freed space. This is fragmentation.
Checking the Damage
In Talos, checking etcd status is straightforward:
|
|
All three control plane nodes were using less than 50% of their allocated space. The rest? Fragmented free space doing nothing but wasting disk I/O.
The Fix
Defragmentation rewrites the database file compactly, eliminating the holes. In Talos, it’s a single command.
Step 1: Snapshot First
Paranoia is healthy when touching cluster state:
|
|
This creates a consistent backup you can restore from if something goes wrong.
Step 2: Defrag Each Node Sequentially
Defrag briefly blocks reads and writes on that node, so you want to do one at a time. Best practice: non-leader nodes first, leader last.
To find the leader, look at the LEADER column in the status output. It shows the member ID of the current leader (f0f9525a77920d83). Then match that to the MEMBER column to find which node it is:
|
|
Node 10.90.3.102 has member ID f0f9525a77920d83, which matches the leader ID. So that’s our leader.
Now defrag:
|
|
Each defrag takes just a few seconds for a database this size.
Step 3: Verify
|
|
~160MB reclaimed per node. 100% utilization means zero fragmentation.
When to Defrag
The general guidance:
- Below 50% utilization: Defrag recommended
- NOSPACE errors: Defrag required (etcd will refuse writes when it hits its quota)
For homelabs, waiting for the Prometheus alert is fine. Production clusters might schedule it monthly or watch the metric more closely.
What If You Hit NOSPACE?
If etcd hits its space quota before you defrag, it enters a read-only mode to protect data integrity. You’ll need to:
- Defrag to free space
- Clear the alarm:
talosctl etcd alarm disarm
If you’re consistently hitting the quota, you can increase it in your Talos machine config:
|
|