VIPs, DNS, and a Conntrack Ghost: Post-mortem of an Intermittent Ingress Outage

“In production you do not rise to the level of your architecture. You fall to the level of the assumptions you forgot you made.”

At some point on New Year’s Eve, I had three separate tabs open:

A Wireshark capture showing SYN retransmissions.
Cilium logs complaining about Service frontends.
A Proxmox console that would load, freeze, and then disconnect like it had somewhere better to be.

This was supposed to be a simple win.

I was trying to make my ingress feel healthier and more responsive by spreading traffic across multiple Traefik pods. My networking gear does not support BGP, so I went with the next best thing in a homelab: multiple L2-announced VIPs and DNS round robin.

It worked just long enough to build confidence.

Then it started flapping in the most frustrating way possible: retries would succeed, dashboards would lie, and the same IP:port would alternate between working and not.

TL;DR

The real outage was caused by stateful conntrack filtering on the hypervisors dropping asymmetric return traffic as INVALID.
The fix was enabling nf_conntrack_allow_invalid=1 on every hypervisor, which stopped Proxmox firewall from dropping conntrack INVALID packets.
I also created a second problem that compounded debugging: a Local Redirect Policy experiment caused Service frontend ownership conflicts, so Cilium refused to reconcile the real LoadBalancer frontends until I removed the policy and forced a resync.
DNS round robin did not cause the break, but it multiplied the number of packet paths, which multiplied how often clients hit the bad one.

If the TCP handshake flaps, do not start by blaming apps or TLS. Start by proving where SYN and SYN-ACK are going.

Homelab landscape (setting the stage)

Before we autopsy the failure, here is the sandbox in generalized form:

Layer	Details
Platform	Kubernetes on an immutable OS (Talos), GitOps reconciliation (ArgoCD)
Networking	Cilium as kube-proxy replacement, tunnel mode with Geneve, WireGuard node encryption enabled, L2 announcements enabled
Ingress	Traefik as a DaemonSet (one pod per worker), with per-node LoadBalancer VIPs
DNS	LAN DNS forwarder (AdGuard) + CoreDNS authoritative zone for an internal ingress subdomain with A record round robin
Virtualization	Proxmox hypervisors with datacenter firewall enabled (stateful conntrack on the bridge)

This combination is important because it blends three worlds:

DNS-based distribution
L2 VIP ownership
Overlay forwarding and DSR-style return paths

Each of those has its own rules. The outage happened where those rules disagreed.

The intended design

I was trying to approximate “anycast ingress” without BGP.

Not real anycast. More like “three front doors instead of one”.

The DNS model

LAN clients resolve app hostnames through a DNS forwarder. That forwarder points a friendly name to an internal ingress zone that CoreDNS serves.

CoreDNS serves an A record set like:

VIP-A
VIP-B
VIP-C

The answers are round robin (or at least shuffled), so different clients should naturally pick different VIPs.

Here is the important mental model:

DNS is a phonebook.
It does not deliver packets. It hands out addresses and walks away.

The VIP model

Each VIP is a Kubernetes Service of type LoadBalancer with:

a specific loadBalancerIP
loadBalancerClass set to the Cilium L2 announcer
externalTrafficPolicy: Cluster
ports for normal http ingress, plus a few TCP ports through Traefik

The VIPs are announced to the LAN via L2 (ARP), which means a node will answer “who owns VIP-A?” with its MAC address.

This is the second important mental model:

A VIP is not a thing. It is a promise.
The promise is “if you send packets to this IP, someone will answer.”

And the “someone” is decided by whatever is answering ARP.

Why `externalTrafficPolicy: Cluster` was intentional

This decision matters later.

I used externalTrafficPolicy: Cluster because I wanted to avoid blackholing during rollouts.

With Local, if traffic lands on a node without a local endpoint, the node drops the traffic. That is a valid design, but it is not the failure mode I wanted during a Traefik restart or reschedule.

Also, I wanted to preserve client IP without falling into a classic SNAT trap.

So the intended combo was:

externalTrafficPolicy: Cluster
Cilium LoadBalancer in DSR mode (Geneve dispatch)
preserve source IP even though traffic might be forwarded across nodes

That “forward across nodes but preserve source IP” is what makes the packet path asymmetric.

When it happened

All on 2025-12-31, in local time.

02:12: Added a Local Redirect Policy experiment targeting Traefik VIPs.
09:39: Adjusted MTU mitigation in Cilium values.
10:59 to 11:21: Ran repeated cilium connectivity test runs and captured drop reasons and cilium-agent errors.
11:11: Removed conflicting local redirect manifests from GitOps.
11:37: Forced reconciliation of per-node VIP services and hard refreshed ArgoCD.
Later: Final fix applied on hypervisors: enabled nf_conntrack_allow_invalid=1 everywhere.

I did not capture the timestamp for the hypervisor change, and that is its own lesson. When the fix lives outside GitOps, you need a habit of writing it down immediately.

Impact and symptoms

Symptoms seen by clients

Intermittent TCP connectivity to cluster-exposed services.
Postgres connectivity failures through Traefik’s TCP entrypoint.
Wireshark showed repeated SYN retransmissions. The handshake was not completing.

A database error message can be a networking symptom.
If the TCP handshake does not complete, the app should not be the primary suspect. More than likely, the path is.

Symptoms seen at ingress

Some requests hit Traefik but got a default certificate for a hostname that should have had a real certificate.

This is a subtle but useful signal. It often means the request hit a Traefik instance that did not match the expected router context, or it could not reach the configured upstream path and fell back to a default behavior.

The clue that broke the case open

Hypervisor consoles were intermittently unreachable too.

That pushed the investigation down a layer.

If the hypervisor UI flaps at the same time as your ingress VIPs, your cluster might be innocent.

Investigation

This incident had two parallel tracks:

Things that were actually wrong in the cluster
Things that were wrong in the underlay and made the cluster look wrong

Both were real. Only one was the primary cause.

Architecture confirmation

I verified the control plane of the design:

DNS forwarder points to ingress zone.
CoreDNS returns multiple A records for ingress.
ARP mapping shows each VIP resolves to the MAC of a worker node.

Client-side repro: make flapping visible

I used two basic probes:

ARP snapshot:

# Windows example, placeholders
arp -a | findstr 198.51.100.

TCP connect loop:

# Client-side loop to make intermittent failures obvious
1..120 | % {
  try {
    $c = New-Object Net.Sockets.TcpClient('198.51.100.12', 5432)
    $c.Close()
    'OK'
  } catch {
    'FAIL'
  }
  Start-Sleep -Milliseconds 150
}

This is the fastest way I know to turn “it feels flaky” into a measurable signal.

Mixed OK and FAIL means one of three things:

routing is path-dependent
filtering is path-dependent
host selection is path-dependent

The loop itself is not smart. That is why it is useful.

Blaming Cilium (and finding something real)

I ran cilium connectivity test repeatedly and saw failures in “no unexpected packet drops” with reasons that included:

no tunnel or encapsulation endpoint
FIB lookup failed
fragment-related drops

Those messages are scary. They also have a bad habit of being both true and not the thing breaking your app right now.

I still treat them as data. I just stopped treating them as the conclusion.

Then I found something concrete.

The Service frontend ownership conflict

Cilium agent logs started showing warnings that boiled down to:

“Failure processing services…”
“frontend already owned by another service… local-redirect”

level=warning msg="Failure processing service" error="frontend 198.51.100.10:80 already owned by another service (:local-redirect)"

That is not noise.

That is Cilium telling you, very directly:

“I cannot program the real Service frontend because something else has claimed that IP:port tuple.”

And that “something else” was my Local Redirect Policy experiment.

What I was trying to do

I was experimenting with Local Redirect Policy to keep traffic local and reduce hops for the VIPs.

The intention was performance and predictability.

The outcome was a controller conflict.

Here is the simplest way to explain it:

A Service frontend is a front door.
A Local Redirect Policy is also a front door rule.
If both try to own the same front door, someone loses, and that someone is usually “the system behaving like the YAML says it should.”

The fix for this layer

Remove the Local Redirect Policy manifests from GitOps.
Force reconciliation of the Traefik app.
Hard refresh ArgoCD when it looked “Synced” but was actually referencing an older source revision.

After that:

The per-node VIPs reappeared in cilium service list.
In-cluster pod to VIP TCP checks became consistently successful.

At this point I expected the client flapping to stop.

It did not.

Decision fork: keep blaming the cluster, or chase the underlay

This was the moment where the incident could have dragged on for another day.

Decision fork
A. Keep assuming the cluster is dropping packets and keep tuning Cilium
B. What if the underlay is involved and prove it with capture

The Proxmox console was still flapping. That tipped the scale.

I picked B.

The underlay was dropping the handshake

The underlay capture that mattered

I ran a capture on the hypervisor bridge and focused on a single failing path:

client IP
VIP-C
TCP port 5432

Sanitized example:

tcpdump -eni vmbr0 'host 203.0.113.30 and host 198.51.100.12 and tcp port 5432'

I saw the client SYN packets arriving reliably.

But the handshake did not consistently complete.

If the SYN arrives and the connection still fails, you are hunting one of these:

SYN-ACK never generated
SYN-ACK generated and dropped on egress
SYN-ACK generated and lost elsewhere

The clue was that success and failure alternated. That usually implies some form of path asymmetry or statefulness.

Technical deep dive: why conntrack broke DSR-style Cluster forwarding

This is the “aha”.

If you remember nothing else, remember this section. Particularly if you are running your cluster on a hypervisor.

Conntrack as a bouncer

Conntrack is a state machine. It watches packets and tries to classify them:

NEW
ESTABLISHED
RELATED
INVALID

A stateful firewall uses conntrack to decide what to allow. A very common rule in stateful firewalls is:

allow ESTABLISHED and RELATED
drop INVALID

That is a reasonable baseline when your flows are symmetric.

Now combine it with a design that intentionally allows asymmetry.

The packet path, simplified

Here is the simplified North-South flow that matters.

We will use placeholders:

Client: 203.0.113.30
VIP-A: 198.51.100.10
VIP-B: 198.51.100.11
VIP-C: 198.51.100.12
Hypervisors: HYP-1, HYP-2, HYP-3
Workers: WKR-1, WKR-2, WKR-3, WKR-4

Step-by-step

Client ARPs for VIP-C.
A worker node answers ARP for VIP-C. That worker lives on some hypervisor, say HYP-1.
Client sends SYN to VIP-C:5432.
SYN enters HYP-1 bridge, then reaches WKR-1.
Traefik VIP Service uses externalTrafficPolicy: Cluster, so WKR-1 can select a backend anywhere.
Cilium forwards the request over the overlay to WKR-3 (for example) via Geneve.
Backend on WKR-3 generates the response.

Now the important part:

Because LoadBalancer mode is DSR-style and we are preserving source IP, the response can be emitted from WKR-3 directly back to the client with source IP set to VIP-C.

So the SYN arrives via HYP-1.

The SYN-ACK exits via HYP-2.

What conntrack sees

HYP-1 sees the SYN and records state.
HYP-2 sees a SYN-ACK, but it did not see the SYN.

HYP-2’s conntrack table does not have the flow state that makes the SYN-ACK “make sense”.

So conntrack can classify the SYN-ACK as INVALID.

If the firewall drops INVALID, the SYN-ACK never reaches the client.

The client retransmits SYN.

Sometimes the chosen backend is on the same hypervisor as the VIP owner node, and the reply path happens to be symmetric enough to satisfy conntrack.

Sometimes it is not.

That creates the exact symptom: mixed OK and FAIL.

Why it looked like “VIP-C is cursed”

During the incident, failures were disproportionately reported against one VIP.

That can happen for boring reasons:

DNS answer ordering bias on a specific resolver
client-side caching behavior
consistent hashing behavior in load balancing decisions
placement of a VIP owner node on a specific hypervisor combined with where the chosen backend lives

The VIP is not cursed.

The path is.

Root cause

Primary root cause: Hypervisor firewall conntrack was dropping flows as INVALID because conntrack state did not match the asymmetric packet patterns created by Cluster forwarding plus DSR-style return traffic.

Contributing factor: A Local Redirect Policy experiment created Service frontend ownership conflicts, temporarily breaking VIP programming and compounding the debugging process.

Both were real. The hypervisor conntrack behavior was the reason external clients flapped even after the cluster datapath was repaired.

The fix

Fix 1: Remove the Service frontend conflict

Remove the Local Redirect Policy manifests.
Force reconciliation and hard refresh GitOps tooling.
Verify VIP frontends are present in Cilium BPF programming again.

This fixed the “cluster is actually confused” part.

Fix 2: Stop the hypervisors from dropping asymmetric traffic

Enable nf_conntrack_allow_invalid=1 on all hypervisors.

This allowed conntrack INVALID packets through the firewall instead of dropping them, which is necessary when your architecture intentionally creates asymmetric paths.

This was the change that immediately made:

client TCP loops become consistently OK
handshake latency improve
hypervisor console access stabilize

Verification

Client-side verification

The TCP connect loop went from mixed OK and FAIL to consistently OK.

That was the loudest signal, because it verified the handshake itself.

Underlay verification

Hypervisor management access stabilized.

This mattered because it proved the underlay fix addressed a broader issue, not just one Kubernetes Service.

Cluster-side verification

pod to VIP connectivity remained stable
cilium service list and cilium bpf lb list showed the expected VIP frontends and backends
cilium-agent logs stopped complaining about frontend ownership conflicts

What I would do differently

The moment the hypervisor console flapped, I should have moved the underlay to the top of the suspect list.
That was the clue. I treated it like a side quest.
I would not have tried a completely different course so soon. I should have gone back to basics first.
The Local Redirect Policy experiment might still be useful in a different context, but it created a second failure mode that looked similar to the first.
I would write down the packet path before changing knobs.
As soon as you draw the asymmetry, conntrack becomes an obvious suspect.
I would add a standing “VIP handshake probe” before touching ingress architecture again.
If you cannot prove that every VIP completes TCP handshakes consistently, you do not have an ingress system. You have a slot machine.

Debugging this class of outage

This is the practical checklist I should have stopped and followed from the beginning.

Step 0: Do not start with the app

If the TCP handshake fails, the app cannot fix it.

Step 1: Make intermittent failure measurable

Use a connect loop to a specific VIP:port.

If it is 60 percent OK and 40 percent FAIL, you have a real signal.

Step 2: Pin the path

Do not let DNS keep rotating targets while you collect evidence.

Hit a single VIP directly until you understand it.

Step 3: Check L2 ownership

Confirm which node is answering ARP for the VIP at the time of failure.

If VIP ownership shifts unexpectedly, that is a separate issue.

Step 4: Validate datapath programming, not just YAML

In a kube-proxy replacement world, the truth is in BPF maps.

Check the VIP frontends and backend mappings from the node perspective.

Step 5: If external still flaps, capture on the hypervisor bridge

If SYN arrives but SYN-ACK does not leave, stop chasing Cilium policy and start chasing stateful filtering.

Step 6: Treat conntrack as a dependency

If any hop is doing stateful filtering, it must be compatible with your packet path.

If your packet path is intentionally asymmetric, “drop INVALID” may be incompatible.

Key commands

Cluster

# Connectivity suite
cilium connectivity test

# Inspect BPF service programming
kubectl -n kube-system exec <cilium-pod> -- cilium service list
kubectl -n kube-system exec <cilium-pod> -- cilium bpf lb list

# Inspect cilium-agent warnings
kubectl -n kube-system logs <cilium-pod> --since=2h | grep -E 'Failure processing services|frontend already owned|local-redirect'

GitOps

# Force refresh when an app looks synced but behavior suggests stale sources
kubectl -n argocd annotate application <app> argocd.argoproj.io/refresh=hard --overwrite

Client

# ARP snapshot
arp -a | findstr 198.51.100.

# TCP connect loop
1..120 | % {
  try {
    $c = New-Object Net.Sockets.TcpClient('198.51.100.12', 5432)
    $c.Close()
    'OK'
  } catch {
    'FAIL'
  }
  Start-Sleep -Milliseconds 150
}

Underlay

# Capture on hypervisor bridge
tcpdump -eni <bridge0> 'host <client> and host <vip> and tcp port 5432'

Appendix A: Config snippets that shaped the failure mode

CoreDNS zone with round robin A records

$ORIGIN ingress.example.internal.
$TTL 30

ns1 IN A 198.51.100.20

@   IN A 198.51.100.10
    IN A 198.51.100.11
    IN A 198.51.100.12

Per-node Traefik VIP Service example

apiVersion: v1
kind: Service
metadata:
  name: traefik-lb-worker-c
spec:
  type: LoadBalancer
  loadBalancerClass: io.cilium/l2-announcer
  loadBalancerIP: "198.51.100.12"
  externalTrafficPolicy: Cluster
  ports:
    - name: web
      port: 80
      protocol: TCP
      targetPort: 8000
    - name: websecure
      port: 443
      protocol: TCP
      targetPort: 8443
    - name: psql
      port: 5432
      protocol: TCP
      targetPort: 5432

Cilium values that made asymmetry possible

routingMode: "tunnel"
tunnelProtocol: "geneve"
tunnelPort: 6081

loadBalancer:
  mode: "dsr"
  dsrDispatch: "geneve"

Closing thoughts

The humbling part of this incident is that nothing was “mysteriously broken”.

DNS did what DNS does.
L2 VIPs did what L2 VIPs do.
Cilium did what the config told it to do.
Conntrack did what conntrack does.
The firewall did what stateful firewalls do when they see packets out of sequence.

The system failed because I built an intentionally asymmetric packet path with a state machine in the middle that assumed symmetry.

Once you see that, this entire outage stops being a ghost story and becomes a checklist.

If you are building L2 VIP ingress without BGP and you want client IP preservation, you hopefully wont have to meet this class of failure.