“I thought BGP was supposed to solve this hairpin nonsense I had with L2?” — Me, staring at timeout errors
The Setup
Last week I migrated from Cilium L2 announcements to BGP for LoadBalancer IP advertisement. The migration went smoothly — BGP sessions established, routes advertised, services reachable. I patted myself on the back for solving the hairpin routing issues that plagued L2 announcements.
Then qui stopped working.
The Symptom
The qui pod (a qBittorrent web UI) was failing its startup probe:
|
|
Looking at the pod logs told the real story:
|
|
The app was timing out trying to reach my OIDC provider at id.nerdz.cloud. The health endpoint wasn’t responding because the app was stuck waiting for OIDC initialization.
The Investigation
My first instinct: test connectivity from the pod.
|
|
Timeout.
But wait — from a debug pod on a different node:
|
|
Success. Full JSON response.
The pattern emerged: pods on stanton-01 couldn’t reach 10.99.8.202 (the LoadBalancer VIP for my internal gateway), but pods on stanton-02 and stanton-03 could.
What was special about stanton-01? Let’s check:
|
|
|
|
The envoy-internal pod — the backend for 10.99.8.202 — was running on stanton-01. Same node as the failing qui pod.
The Root Cause: DSR and Same-Node Hairpin
Cilium’s DSR (Direct Server Return) mode is great for external traffic:
- Client sends packet to LoadBalancer VIP
- Router (UDM Pro) forwards to a cluster node via BGP
- Cilium DNATs to the backend pod
- Backend sends response directly back to client (skipping the load balancer)
This preserves client IPs and reduces latency. But it breaks when the source and destination are on the same node.
Here’s what happens when a pod on stanton-01 tries to reach 10.99.8.202:
|
|
The problem is that Cilium’s DSR mode doesn’t intercept traffic to LoadBalancer VIPs from pods — it’s designed for external traffic entering the cluster. The packet goes out to the router, the router sends it back, and something breaks in the return path.
This is a known Cilium limitation. The GitHub issue title says it all: “Pods are not able to reach Cilium-managed LoadBalancer IP.”
What I Tried (And What Didn’t Work)
Attempt 1: Socket LB
The first suggestion in the docs: enable socket-level load balancing.
|
|
Socket LB intercepts connections at the connect() syscall, before packets hit the network. In theory, it should handle hairpin traffic. In practice:
|
|
Socket LB helps with ClusterIP services, but it doesn’t intercept ExternalIP/LoadBalancer traffic.
Attempt 2: Hybrid Mode
Maybe the problem is pure DSR. What about hybrid mode?
|
|
Nope. Still timing out.
Attempt 3: lbExternalClusterIP
There’s an option to allow cluster-internal access to external LB IPs:
|
|
Also didn’t work.
At this point I’d tried everything in the Cilium docs. The problem is fundamental to how BGP + native routing + DSR interact. Pods simply can’t reach LoadBalancer VIPs via the normal packet path when the backend is on the same node.
The Solution: CoreDNS Rewriting
If I can’t fix the network path, I can fix the DNS. The key insight: pods don’t need to hit the LoadBalancer VIP — they can use the ClusterIP instead.
| Client | Should resolve to |
|---|---|
| External (internet) | LoadBalancer VIP 10.99.8.202 |
| Pod (internal) | ClusterIP 10.96.9.253 |
CoreDNS can make this happen with the template plugin:
|
|
The ENVOY_INTERNAL_CLUSTERIP variable is defined in cluster-settings.yaml and substituted by Flux. This way if the ClusterIP ever changes (e.g., if the service is recreated), you only need to update one place.
The template matches queries for internal.nerdz.cloud and id.nerdz.cloud and returns the ClusterIP instead of forwarding to external DNS (which would return the LoadBalancer VIP).
Why Both Domains?
My DNS is set up with a CNAME:
|
|
If I only intercept internal.nerdz.cloud, the CNAME lookup for id.nerdz.cloud goes to external DNS, which returns the CNAME, and then the A record lookup for internal.nerdz.cloud also goes to external DNS. The whole chain bypasses my template.
By intercepting both domains, I catch the query before it ever leaves the cluster.
Verification
After deploying the CoreDNS change:
|
|
|
|
And the OIDC endpoint:
|
|
|
|
The qui pod now starts without timing out:
|
|
|
|
The Trade-off
This solution has a subtle implication: internal pod traffic to id.nerdz.cloud won’t hit the LoadBalancer layer. It goes directly to the envoy-internal ClusterIP.
In practice, this doesn’t matter — ClusterIP load balancing is still handled by Cilium, and there’s only one replica of envoy-internal anyway. But if you’re doing something fancy with LoadBalancer-level policies or metrics, be aware that pod traffic bypasses that layer.
Lessons Learned
-
BGP fixes L2 hairpin, not DSR hairpin — L2 announcements had hairpin issues because one node “owned” the IP via ARP. BGP fixed that by routing through the router. But DSR mode has its own hairpin problem when source and backend share a node.
-
ClusterIP always works — When in doubt, use ClusterIP for pod-to-service traffic. It’s what Kubernetes expects.
-
DNS is a powerful escape hatch — When the network layer is fighting you, sometimes the answer is above the network layer. CoreDNS templates let you intercept and rewrite queries before they ever become packets.
-
Test from the same node — My initial testing from different nodes missed the issue entirely. When debugging network problems, always test from the specific node(s) having issues.
-
Read the GitHub issues — The Cilium issue tracker is full of people hitting this exact problem. I could have saved hours by searching for “pods cannot reach LoadBalancer IP” earlier.
The Fix in Context
| Problem | Solution |
|---|---|
| L2 hairpin (ARP ownership) | Migrate to BGP |
| DSR same-node hairpin | CoreDNS rewrite to ClusterIP |
| General pod-to-LB access | Use ClusterIP or service DNS |
BGP was the right call for L2 hairpin. But DSR mode introduced a new hairpin scenario that BGP doesn’t solve. CoreDNS rewriting is the cleanest workaround I’ve found — it’s targeted, doesn’t require Cilium changes, and works regardless of pod scheduling.
References
- Cilium Issue #39198 — Pods are not able to reach Cilium-managed LoadBalancer IP
- Cilium DSR Documentation — How DSR mode works
- CoreDNS Template Plugin — Synthesizing DNS responses
- From L2 to BGP — The migration that started this journey
This post documents a debugging session with help from Claude Code. The CoreDNS workaround is now part of my home-ops repository.