“In production you do not rise to the level of your architecture. You fall to the level of the assumptions you forgot you made.”
At some point on New Year’s Eve, I had three separate tabs open:
- A Wireshark capture showing SYN retransmissions.
- Cilium logs complaining about Service frontends.
- A Proxmox console that would load, freeze, and then disconnect like it had somewhere better to be.
This was supposed to be a simple win.
I was trying to make my ingress feel healthier and more responsive by spreading traffic across multiple Traefik pods. My networking gear does not support BGP, so I went with the next best thing in a homelab: multiple L2-announced VIPs and DNS round robin.
It worked just long enough to build confidence.
Then it started flapping in the most frustrating way possible: retries would succeed, dashboards would lie, and the same IP:port would alternate between working and not.
TL;DR
- The real outage was caused by stateful conntrack filtering on the hypervisors dropping asymmetric return traffic as INVALID.
- The fix was enabling
nf_conntrack_allow_invalid=1on every hypervisor, which stopped Proxmox firewall from dropping conntrack INVALID packets. - I also created a second problem that compounded debugging: a Local Redirect Policy experiment caused Service frontend ownership conflicts, so Cilium refused to reconcile the real LoadBalancer frontends until I removed the policy and forced a resync.
- DNS round robin did not cause the break, but it multiplied the number of packet paths, which multiplied how often clients hit the bad one.
If the TCP handshake flaps, do not start by blaming apps or TLS. Start by proving where SYN and SYN-ACK are going.
Homelab landscape (setting the stage)
Before we autopsy the failure, here is the sandbox in generalized form:
| Layer | Details |
|---|---|
| Platform | Kubernetes on an immutable OS (Talos), GitOps reconciliation (ArgoCD) |
| Networking | Cilium as kube-proxy replacement, tunnel mode with Geneve, WireGuard node encryption enabled, L2 announcements enabled |
| Ingress | Traefik as a DaemonSet (one pod per worker), with per-node LoadBalancer VIPs |
| DNS | LAN DNS forwarder (AdGuard) + CoreDNS authoritative zone for an internal ingress subdomain with A record round robin |
| Virtualization | Proxmox hypervisors with datacenter firewall enabled (stateful conntrack on the bridge) |
This combination is important because it blends three worlds:
- DNS-based distribution
- L2 VIP ownership
- Overlay forwarding and DSR-style return paths
Each of those has its own rules. The outage happened where those rules disagreed.
The intended design
I was trying to approximate “anycast ingress” without BGP.
Not real anycast. More like “three front doors instead of one”.
The DNS model
LAN clients resolve app hostnames through a DNS forwarder. That forwarder points a friendly name to an internal ingress zone that CoreDNS serves.
CoreDNS serves an A record set like:
VIP-AVIP-BVIP-C
The answers are round robin (or at least shuffled), so different clients should naturally pick different VIPs.
Here is the important mental model:
DNS is a phonebook.
It does not deliver packets. It hands out addresses and walks away.
The VIP model
Each VIP is a Kubernetes Service of type LoadBalancer with:
- a specific
loadBalancerIP loadBalancerClassset to the Cilium L2 announcerexternalTrafficPolicy: Cluster- ports for normal http ingress, plus a few TCP ports through Traefik
The VIPs are announced to the LAN via L2 (ARP), which means a node will answer “who owns VIP-A?” with its MAC address.
This is the second important mental model:
A VIP is not a thing. It is a promise.
The promise is “if you send packets to this IP, someone will answer.”
And the “someone” is decided by whatever is answering ARP.
Why externalTrafficPolicy: Cluster was intentional
This decision matters later.
I used externalTrafficPolicy: Cluster because I wanted to avoid blackholing during rollouts.
With Local, if traffic lands on a node without a local endpoint, the node drops the traffic. That is a valid design, but it is not the failure mode I wanted during a Traefik restart or reschedule.
Also, I wanted to preserve client IP without falling into a classic SNAT trap.
So the intended combo was:
externalTrafficPolicy: Cluster- Cilium LoadBalancer in DSR mode (Geneve dispatch)
- preserve source IP even though traffic might be forwarded across nodes
That “forward across nodes but preserve source IP” is what makes the packet path asymmetric.
When it happened
All on 2025-12-31, in local time.
- 02:12: Added a Local Redirect Policy experiment targeting Traefik VIPs.
- 09:39: Adjusted MTU mitigation in Cilium values.
- 10:59 to 11:21: Ran repeated
cilium connectivity testruns and captured drop reasons and cilium-agent errors. - 11:11: Removed conflicting local redirect manifests from GitOps.
- 11:37: Forced reconciliation of per-node VIP services and hard refreshed ArgoCD.
- Later: Final fix applied on hypervisors: enabled
nf_conntrack_allow_invalid=1everywhere.
I did not capture the timestamp for the hypervisor change, and that is its own lesson. When the fix lives outside GitOps, you need a habit of writing it down immediately.
Impact and symptoms
Symptoms seen by clients
- Intermittent TCP connectivity to cluster-exposed services.
- Postgres connectivity failures through Traefik’s TCP entrypoint.
- Wireshark showed repeated SYN retransmissions. The handshake was not completing.
A database error message can be a networking symptom.
If the TCP handshake does not complete, the app should not be the primary suspect. More than likely, the path is.
Symptoms seen at ingress
- Some requests hit Traefik but got a default certificate for a hostname that should have had a real certificate.
This is a subtle but useful signal. It often means the request hit a Traefik instance that did not match the expected router context, or it could not reach the configured upstream path and fell back to a default behavior.
The clue that broke the case open
- Hypervisor consoles were intermittently unreachable too.
That pushed the investigation down a layer.
If the hypervisor UI flaps at the same time as your ingress VIPs, your cluster might be innocent.
Investigation
This incident had two parallel tracks:
- Things that were actually wrong in the cluster
- Things that were wrong in the underlay and made the cluster look wrong
Both were real. Only one was the primary cause.
Architecture confirmation
I verified the control plane of the design:
- DNS forwarder points to ingress zone.
- CoreDNS returns multiple A records for ingress.
- ARP mapping shows each VIP resolves to the MAC of a worker node.
Client-side repro: make flapping visible
I used two basic probes:
ARP snapshot:
# Windows example, placeholders
arp -a | findstr 198.51.100.
TCP connect loop:
# Client-side loop to make intermittent failures obvious
1..120 | % {
try {
$c = New-Object Net.Sockets.TcpClient('198.51.100.12', 5432)
$c.Close()
'OK'
} catch {
'FAIL'
}
Start-Sleep -Milliseconds 150
}
This is the fastest way I know to turn “it feels flaky” into a measurable signal.
Mixed OK and FAIL means one of three things:
- routing is path-dependent
- filtering is path-dependent
- host selection is path-dependent
The loop itself is not smart. That is why it is useful.
Blaming Cilium (and finding something real)
I ran cilium connectivity test repeatedly and saw failures in “no unexpected packet drops” with reasons that included:
- no tunnel or encapsulation endpoint
- FIB lookup failed
- fragment-related drops
Those messages are scary. They also have a bad habit of being both true and not the thing breaking your app right now.
I still treat them as data. I just stopped treating them as the conclusion.
Then I found something concrete.
The Service frontend ownership conflict
Cilium agent logs started showing warnings that boiled down to:
- “Failure processing services…”
- “frontend already owned by another service… local-redirect”
level=warning msg="Failure processing service" error="frontend 198.51.100.10:80 already owned by another service (:local-redirect)"
That is not noise.
That is Cilium telling you, very directly:
“I cannot program the real Service frontend because something else has claimed that IP:port tuple.”
And that “something else” was my Local Redirect Policy experiment.
What I was trying to do
I was experimenting with Local Redirect Policy to keep traffic local and reduce hops for the VIPs.
The intention was performance and predictability.
The outcome was a controller conflict.
Here is the simplest way to explain it:
A Service frontend is a front door.
A Local Redirect Policy is also a front door rule.
If both try to own the same front door, someone loses, and that someone is usually “the system behaving like the YAML says it should.”
The fix for this layer
- Remove the Local Redirect Policy manifests from GitOps.
- Force reconciliation of the Traefik app.
- Hard refresh ArgoCD when it looked “Synced” but was actually referencing an older source revision.
After that:
- The per-node VIPs reappeared in
cilium service list. - In-cluster pod to VIP TCP checks became consistently successful.
At this point I expected the client flapping to stop.
It did not.
Decision fork: keep blaming the cluster, or chase the underlay
This was the moment where the incident could have dragged on for another day.
Decision fork
A. Keep assuming the cluster is dropping packets and keep tuning Cilium
B. What if the underlay is involved and prove it with capture
The Proxmox console was still flapping. That tipped the scale.
I picked B.
The underlay was dropping the handshake
The underlay capture that mattered
I ran a capture on the hypervisor bridge and focused on a single failing path:
- client IP
- VIP-C
- TCP port 5432
Sanitized example:
tcpdump -eni vmbr0 'host 203.0.113.30 and host 198.51.100.12 and tcp port 5432'
I saw the client SYN packets arriving reliably.
But the handshake did not consistently complete.
If the SYN arrives and the connection still fails, you are hunting one of these:
- SYN-ACK never generated
- SYN-ACK generated and dropped on egress
- SYN-ACK generated and lost elsewhere
The clue was that success and failure alternated. That usually implies some form of path asymmetry or statefulness.
Technical deep dive: why conntrack broke DSR-style Cluster forwarding
This is the “aha”.
If you remember nothing else, remember this section. Particularly if you are running your cluster on a hypervisor.
Conntrack as a bouncer
Conntrack is a state machine. It watches packets and tries to classify them:
- NEW
- ESTABLISHED
- RELATED
- INVALID
A stateful firewall uses conntrack to decide what to allow. A very common rule in stateful firewalls is:
- allow ESTABLISHED and RELATED
- drop INVALID
That is a reasonable baseline when your flows are symmetric.
Now combine it with a design that intentionally allows asymmetry.
The packet path, simplified
Here is the simplified North-South flow that matters.
We will use placeholders:
- Client:
203.0.113.30 - VIP-A:
198.51.100.10 - VIP-B:
198.51.100.11 - VIP-C:
198.51.100.12 - Hypervisors:
HYP-1,HYP-2,HYP-3 - Workers:
WKR-1,WKR-2,WKR-3,WKR-4
Step-by-step
- Client ARPs for VIP-C.
- A worker node answers ARP for VIP-C. That worker lives on some hypervisor, say HYP-1.
- Client sends SYN to VIP-C:5432.
- SYN enters HYP-1 bridge, then reaches WKR-1.
- Traefik VIP Service uses
externalTrafficPolicy: Cluster, so WKR-1 can select a backend anywhere. - Cilium forwards the request over the overlay to WKR-3 (for example) via Geneve.
- Backend on WKR-3 generates the response.
Now the important part:
Because LoadBalancer mode is DSR-style and we are preserving source IP, the response can be emitted from WKR-3 directly back to the client with source IP set to VIP-C.
So the SYN arrives via HYP-1.
The SYN-ACK exits via HYP-2.
What conntrack sees
- HYP-1 sees the SYN and records state.
- HYP-2 sees a SYN-ACK, but it did not see the SYN.
HYP-2’s conntrack table does not have the flow state that makes the SYN-ACK “make sense”.
So conntrack can classify the SYN-ACK as INVALID.
If the firewall drops INVALID, the SYN-ACK never reaches the client.
The client retransmits SYN.
Sometimes the chosen backend is on the same hypervisor as the VIP owner node, and the reply path happens to be symmetric enough to satisfy conntrack.
Sometimes it is not.
That creates the exact symptom: mixed OK and FAIL.
Why it looked like “VIP-C is cursed”
During the incident, failures were disproportionately reported against one VIP.
That can happen for boring reasons:
- DNS answer ordering bias on a specific resolver
- client-side caching behavior
- consistent hashing behavior in load balancing decisions
- placement of a VIP owner node on a specific hypervisor combined with where the chosen backend lives
The VIP is not cursed.
The path is.
Root cause
Primary root cause: Hypervisor firewall conntrack was dropping flows as INVALID because conntrack state did not match the asymmetric packet patterns created by Cluster forwarding plus DSR-style return traffic.
Contributing factor: A Local Redirect Policy experiment created Service frontend ownership conflicts, temporarily breaking VIP programming and compounding the debugging process.
Both were real. The hypervisor conntrack behavior was the reason external clients flapped even after the cluster datapath was repaired.
The fix
Fix 1: Remove the Service frontend conflict
- Remove the Local Redirect Policy manifests.
- Force reconciliation and hard refresh GitOps tooling.
- Verify VIP frontends are present in Cilium BPF programming again.
This fixed the “cluster is actually confused” part.
Fix 2: Stop the hypervisors from dropping asymmetric traffic
Enable nf_conntrack_allow_invalid=1 on all hypervisors.
This allowed conntrack INVALID packets through the firewall instead of dropping them, which is necessary when your architecture intentionally creates asymmetric paths.
This was the change that immediately made:
- client TCP loops become consistently OK
- handshake latency improve
- hypervisor console access stabilize
Verification
Client-side verification
The TCP connect loop went from mixed OK and FAIL to consistently OK.
That was the loudest signal, because it verified the handshake itself.
Underlay verification
Hypervisor management access stabilized.
This mattered because it proved the underlay fix addressed a broader issue, not just one Kubernetes Service.
Cluster-side verification
- pod to VIP connectivity remained stable
cilium service listandcilium bpf lb listshowed the expected VIP frontends and backends- cilium-agent logs stopped complaining about frontend ownership conflicts
What I would do differently
- The moment the hypervisor console flapped, I should have moved the underlay to the top of the suspect list.
That was the clue. I treated it like a side quest. - I would not have tried a completely different course so soon. I should have gone back to basics first.
The Local Redirect Policy experiment might still be useful in a different context, but it created a second failure mode that looked similar to the first. - I would write down the packet path before changing knobs.
As soon as you draw the asymmetry, conntrack becomes an obvious suspect. - I would add a standing “VIP handshake probe” before touching ingress architecture again.
If you cannot prove that every VIP completes TCP handshakes consistently, you do not have an ingress system. You have a slot machine.
Debugging this class of outage
This is the practical checklist I should have stopped and followed from the beginning.
Step 0: Do not start with the app
If the TCP handshake fails, the app cannot fix it.
Step 1: Make intermittent failure measurable
Use a connect loop to a specific VIP:port.
If it is 60 percent OK and 40 percent FAIL, you have a real signal.
Step 2: Pin the path
Do not let DNS keep rotating targets while you collect evidence.
Hit a single VIP directly until you understand it.
Step 3: Check L2 ownership
Confirm which node is answering ARP for the VIP at the time of failure.
If VIP ownership shifts unexpectedly, that is a separate issue.
Step 4: Validate datapath programming, not just YAML
In a kube-proxy replacement world, the truth is in BPF maps.
Check the VIP frontends and backend mappings from the node perspective.
Step 5: If external still flaps, capture on the hypervisor bridge
If SYN arrives but SYN-ACK does not leave, stop chasing Cilium policy and start chasing stateful filtering.
Step 6: Treat conntrack as a dependency
If any hop is doing stateful filtering, it must be compatible with your packet path.
If your packet path is intentionally asymmetric, “drop INVALID” may be incompatible.
Key commands
Cluster
# Connectivity suite
cilium connectivity test
# Inspect BPF service programming
kubectl -n kube-system exec <cilium-pod> -- cilium service list
kubectl -n kube-system exec <cilium-pod> -- cilium bpf lb list
# Inspect cilium-agent warnings
kubectl -n kube-system logs <cilium-pod> --since=2h | grep -E 'Failure processing services|frontend already owned|local-redirect'
GitOps
# Force refresh when an app looks synced but behavior suggests stale sources
kubectl -n argocd annotate application <app> argocd.argoproj.io/refresh=hard --overwrite
Client
# ARP snapshot
arp -a | findstr 198.51.100.
# TCP connect loop
1..120 | % {
try {
$c = New-Object Net.Sockets.TcpClient('198.51.100.12', 5432)
$c.Close()
'OK'
} catch {
'FAIL'
}
Start-Sleep -Milliseconds 150
}
Underlay
# Capture on hypervisor bridge
tcpdump -eni <bridge0> 'host <client> and host <vip> and tcp port 5432'
Appendix A: Config snippets that shaped the failure mode
CoreDNS zone with round robin A records
$ORIGIN ingress.example.internal.
$TTL 30
ns1 IN A 198.51.100.20
@ IN A 198.51.100.10
IN A 198.51.100.11
IN A 198.51.100.12
Per-node Traefik VIP Service example
apiVersion: v1
kind: Service
metadata:
name: traefik-lb-worker-c
spec:
type: LoadBalancer
loadBalancerClass: io.cilium/l2-announcer
loadBalancerIP: "198.51.100.12"
externalTrafficPolicy: Cluster
ports:
- name: web
port: 80
protocol: TCP
targetPort: 8000
- name: websecure
port: 443
protocol: TCP
targetPort: 8443
- name: psql
port: 5432
protocol: TCP
targetPort: 5432
Cilium values that made asymmetry possible
routingMode: "tunnel"
tunnelProtocol: "geneve"
tunnelPort: 6081
loadBalancer:
mode: "dsr"
dsrDispatch: "geneve"
Closing thoughts
The humbling part of this incident is that nothing was “mysteriously broken”.
- DNS did what DNS does.
- L2 VIPs did what L2 VIPs do.
- Cilium did what the config told it to do.
- Conntrack did what conntrack does.
- The firewall did what stateful firewalls do when they see packets out of sequence.
The system failed because I built an intentionally asymmetric packet path with a state machine in the middle that assumed symmetry.
Once you see that, this entire outage stops being a ghost story and becomes a checklist.
If you are building L2 VIP ingress without BGP and you want client IP preservation, you hopefully wont have to meet this class of failure.