Skip to main content

CNI Migration Post-mortem: Flannel to Cilium on Talos

While migrating from Flannel to Cilium on my Talos cluster, a configuration oversight caused 76 pods to fail during startup. This post documents the root cause and lessons learned about CNI behavior during node upgrades.

CNI Migration Post-mortem: Flannel to Cilium on Talos
Photo by fabio / Unsplash

Six hours into what should have been a 30-minute CNI migration, I found myself staring at 76 pods stuck in ContainerCreating state, authentication failures cascading through my cluster, and storage volumes refusing to attach. The migration from Flannel to Cilium exposed fundamental misconceptions I had about how Kubernetes networking actually works at the system level.

TL;DR
cni.exclusive: true first, or kubelet will pick the old CNI
• Talos wipes /run, so Flannel dies after a reboot
• Cilium ≥1.16.5 + Talos forwardKubeDNSToHost → set bpf.hostLegacyRouting: true or disable the Talos feature
• Disable kube-proxy → enable kubeProxyReplacement: strict
• Service account tokens are 1h; broken DNS blocks rotation and causes 401s

The Initial Context

My homelab cluster had been running Flannel v0.25.1 on Talos v1.10.2 without issues, but I needed NetworkPolicy support for implementing CrowdSec security policies. Flannel, being a simple overlay network, doesn't support NetworkPolicies (it's designed for connectivity, not security enforcement).

Cilium was the natural choice for several reasons:

  • Native eBPF implementation meant better performance and less overhead than iptables
  • NetworkPolicy support with additional CiliumNetworkPolicy CRDs for L7 filtering
  • Hubble observability would give me actual visibility into network flows
  • Could replace kube-proxy entirely, reducing resource consumption

Since I was already planning node upgrades from Talos v1.10.2 to v1.10.5, I figured I'd handle both changes at once; a decision that would complicate debugging significantly.

Understanding CNI Architecture

Before diving into the failure, it's important to understand how Kubernetes actually selects and uses CNI plugins. When kubelet starts a pod, it reads configuration files from /etc/cni/net.d/ and selects them based on lexicographic ordering. This isn't configurable. It's hardcoded behavior in the CRI implementation.

# What kubelet sees in /etc/cni/net.d/
05-cilium.conflist
10-flannel.conflist

The kubelet simply sorts all files (.conf and .conflist) in lexicographic order and tries the first one it can parse. If the Cilium file parses successfully and the plugin binary executes without error, kubelet will ignore Flannel entirely. The problem arose only because Cilium left Flannel's file in place when exclusive: false was set.

The Migration Strategy

I chose a dual-overlay approach to maintain zero downtime. The plan was:

  1. Deploy Cilium alongside Flannel on different subnets
  2. Let new pods use Cilium while existing pods continued with Flannel
  3. Gradually drain and upgrade nodes to complete the transition

A cold switch would have broken 30+ sidecar pods that restart automatically when their parent pods lose network connectivity, creating a cascade of failures.

Initial Cilium Deployment

I deployed Cilium v1.16.5 using Helm with this configuration:

# cilium/values.yaml
ipam:
  mode: "cluster-pool"
  operator:
    clusterPoolIPv4PodCIDRList: ["10.245.0.0/16"]  # Temp subnet during dual-overlay phase

tunnelPort: 8473  # Flannel uses 8472, avoiding conflict

cni:
  exclusive: false  # This would haunt me later
  customConf: true

bpf:
  masquerade: true
  hostLegacyRouting: true  # Workaround for Talos forwardKubeDNSToHost compatibility

kubeProxyReplacement: strict  # Since we're disabling kube-proxy
nodePort:
  enabled: true

securityContext:
  capabilities:
    ciliumAgent: "{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}"
    cleanCiliumState: "{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}"

The exclusive: false setting was my first mistake. In exclusive mode, Cilium takes ownership of the CNI directory and renames competing configurations with a .cilium_bak suffix. I thought keeping both CNIs active would provide a safer migration path. I was wrong.

During the dual-overlay period, I expected isolation between the networks. Since I was using VXLAN mode (tunnelPort: 8473), autoDirectNodeRoutes is false in VXLAN (encapsulation) mode, keeping the Flannel and Cilium pod CIDRs isolated, which is exactly what I wanted during migration.

The Cascade Begins

Phase 1: Talos Configuration Change

After Cilium was running, I updated Talos to stop managing CNI:

# Applied via talosctl patch
cluster:
  network:
    cni:
      name: none  # Tell Talos to stop managing CNI

This change requires a reboot to take effect. Since I had to restart Talos anyway, I proceeded with the node upgrades:

kubectl drain talos-cp-01 --ignore-daemonsets --delete-emptydir-data
talosctl -n 10.0.1.61 upgrade --image ghcr.io/siderolabs/installer:v1.10.5

Phase 2: The Missing Files

Talos has a critical behavior during upgrades: it wipes /run (a tmpfs mount) but preserves /etc. This makes sense: runtime state should be ephemeral, configuration should persist. But Flannel depends on /run/flannel/subnet.env for its operation:

# What Flannel expects
$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.1.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

When the control plane nodes came back up, pods began failing with:

Warning  FailedCreatePodSandBox  Failed to create pod sandbox:
rpc error: code = Unknown desc = failed to setup network for sandbox:
plugin type="flannel" failed (add): loadFlannelSubnetEnv failed:
open /run/flannel/subnet.env: no such file or directory

The kubelet was still selecting Flannel (due to the lexicographic ordering of 10-flannel.conflist), but Flannel could no longer function without its runtime state.

Phase 3: DNS Black Hole

Even after switching to exclusive mode, pods that did start had no DNS resolution. The culprit was a subtle interaction between Talos and Cilium.

Since Talos v1.8.0, forwardKubeDNSToHost has been enabled by default. This feature intercepts DNS queries to the cluster DNS service (typically 10.96.0.10) and forwards them to a local resolver on the host at 169.254.116.108. This improves DNS performance and reliability during CoreDNS pod restarts.

However, the issue isn't just bpf.masquerade, it's the new BPF host-routing path introduced in Cilium 1.16 (BPF masquerade implicitly enables host-routing—see Cilium #36761). When combined with Talos' forwardKubeDNSToHost, the link-local address (169.254.116.108) becomes unreachable from the pod network namespace. The documented workaround is to either enable bpf.hostLegacyRouting: true or disable forwardKubeDNSToHost.

# Testing DNS from a pod
$ kubectl run debug --image=busybox --rm -it -- nslookup kubernetes.default
;; connection timed out; no servers could be reached

# From the host, it works fine
$ talosctl -n 10.0.1.61 logs dns-resolve-cache
{"level":"debug","msg":"processing request","query":"kubernetes.default.svc.cluster.local."}
{"level":"debug","msg":"resolved","response":"10.96.0.1"}

The DNS resolver was working, but responses weren't making it back to pods.

Phase 4: Authentication Failures

Two worker nodes had additional complications. Their Cilium agents couldn't authenticate to the API server:

level=error msg="Unable to contact k8s api-server" error="Unauthorized" ipAddr="https://10.0.1.200:6443"
level=error msg="Unable to initialize Kubernetes subsystem" error="unable to create k8s client: unable to authenticate to Kubernetes API"

This wasn't immediately obvious because Kubernetes uses projected service account tokens with a default 1-hour TTL (tokenExpirationSeconds: 3600). The kubelet (running in host network namespace) could still refresh tokens, but Cilium agents run as pods. Their tokens expired after 1 hour and they couldn't renew over the broken pod network resulting in a classic chicken-and-egg problem where you need network connectivity to refresh the tokens that grant network access.

Phase 5: Storage Attachment Limbo

With networking partially restored, I discovered that database pods couldn't start. Longhorn CSI couldn't attach their volumes:

$ kubectl get volumeattachments
NAME                        ATTACHER            PV                    NODE           ATTACHED   AGE
csi-postgres-vol-wrk-02     driver.longhorn.io  pvc-abc123-def456    talos-wrk-02   false      45m

Longhorn CSI uses the pod network for communication between its components. When the CNI changed:

  1. Instance manager pods lost connectivity to the Longhorn manager
  2. The CSI attacher couldn't communicate with the Longhorn backend
  3. Volume state became inconsistent between Kubernetes and Longhorn
$ kubectl -n longhorn-system logs csi-attacher-568c7f9b5b-xxxxx
E0113 15:24:15.792Z Attachment request failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

The Recovery Process

Step 1: Fix CNI Exclusive Mode

First, I had to make Cilium take control:

# Enable exclusive mode via ConfigMap
kubectl patch configmap cilium-config -n kube-system \
  --type merge -p '{"data":{"cni-exclusive":"true"}}'

# Restart Cilium pods to apply
kubectl rollout restart daemonset/cilium -n kube-system

Cilium immediately renamed the Flannel configuration:

$ kubectl exec -n kube-system daemonset/cilium -- ls -la /host/etc/cni/net.d/
05-cilium.conflist
10-flannel.conflist.cilium_bak

Step 2: Resolve DNS Compatibility

The DNS fix required patching Talos configuration on all nodes:

# talos-dns-patch.yaml
cluster:
  network:
    forwardKubeDNSToHost: false
# Apply to all nodes
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
  talosctl -n $node patch machineconfig --patch @talos-dns-patch.yaml
done

# Restart CoreDNS to clear any cached state
kubectl rollout restart deployment/coredns -n kube-system

# Note: Talos applies most network patches live, but if DNS still fails,
# perform a controlled reboot per node to ensure forwardKubeDNSToHost is truly disabled

Step 3: Fix IPAM Configuration

During the chaos, I discovered the Cilium ConfigMap had the wrong pod CIDR:

# Check current configuration
$ kubectl get configmap cilium-config -n kube-system -o yaml | grep cluster-pool
cluster-pool-ipv4-cidr: "10.245.0.0/16"  # Still using temp subnet

# Switch back to main pod CIDR now that Flannel is gone
kubectl patch configmap cilium-config -n kube-system \
  --type merge -p '{"data":{"cluster-pool-ipv4-cidr":"10.244.0.0/16"}}'

# Restart Cilium to apply IPAM changes
kubectl rollout restart daemonset/cilium -n kube-system

Step 4: Refresh Authentication

For the nodes with authentication failures, I needed to restart the kubelet service:

# This refreshes service account tokens and clears kubelet state
talosctl -n 10.0.1.62 service kubelet restart
talosctl -n 10.0.1.64 service kubelet restart

Within seconds, the Cilium agents authenticated successfully and removed their node taints.

Step 5: Recover Storage

With networking stable, Longhorn began its self-healing process. I just needed to trigger it:

# Delete stuck instance managers
kubectl -n longhorn-system delete pod -l component=instance-manager

# Watch Longhorn recover
kubectl -n longhorn-system get volumes.longhorn.io -w
NAME           STATE      NODE
postgres-data  Detached   
postgres-data  Attaching  talos-wrk-02
postgres-data  Attached   talos-wrk-02

Longhorn detected the orphaned volumes, rebuilt the necessary replicas, and restored healthy state.

Technical Deep Dive: What Actually Happened

CNI Selection Mechanism

The kubelet's CNI selection is deceptively simple but has subtle behaviors:

  1. Reads all files from /etc/cni/net.d/
  2. Filters for .conf and .conflist extensions
  3. Sorts lexicographically
  4. Uses the first valid configuration

The key insight: After reboot, kubelet parsed 05-cilium.conflist first and attempted the ADD call, which failed because /opt/cni/bin/cilium-cni was missing (Talos upgrades re-flash the immutable rootFS, removing /opt/cni/bin/* unless injected via extraFiles). Kubelet surfaced an error and pods stayed in CreateContainerError. A subsequent configuration change with exclusive: true caused Cilium to move its conf out of the way, leaving Flannel's file first but Flannel couldn't function without its runtime files in /run.

Talos Filesystem Semantics

Talos Linux uses an immutable root filesystem with specific mount points:

  • /etc - Persistent configuration (survives reboots/upgrades)
  • /run - Runtime state (tmpfs, cleared on reboot/upgrade)
  • /var - Persistent data

This design prevents configuration drift but means CNI plugins must be designed with stateless operation in mind. Flannel's dependency on /run/flannel/subnet.env violated this principle.

eBPF vs iptables NAT

Traditional kube-proxy uses iptables for NAT operations. These rules work at the netfilter layer and handle special cases like link-local addresses (169.254.0.0/16) correctly.

Cilium's eBPF programs operate at a lower level (directly in the kernel's network stack). While more efficient, they handle packet flows differently. Since Cilium 1.16, the new BPF host-routing mode breaks compatibility with Talos' forwardKubeDNSToHost. The bpf.hostLegacyRouting: true option reverts to the pre-1.16 behavior for compatibility.

Lessons for Production

1. Understand Platform Integration Points

Each Kubernetes distribution has unique behaviors:

  • Talos: Immutable filesystem, forwardKubeDNSToHost
  • RKE2: Different CNI directory location
  • K3s: Bundled Flannel by default

Research platform-specific documentation before migrations.

2. Plan for Authentication Refresh

Service account tokens expire. During extended outages, plan for authentication refresh:

# Force token refresh
kubectl delete secret -n kube-system -l kubernetes.io/service-account-name=cilium

3. Monitor CNI Transitions

Add monitoring before starting:

# Watch CNI configuration changes
watch -n 1 'kubectl exec -n kube-system daemonset/cilium -- ls -la /host/etc/cni/net.d/'

# Monitor pod creation failures
kubectl get events -A --field-selector type=Warning -w | grep sandbox

4. Have a Rollback Plan

Before starting, document how to rollback:

  1. Restore original CNI configuration
  2. Clear modified configs from /etc/cni/net.d/
  3. Restart all nodes if necessary

Configuration Templates

Cilium for Talos Linux

# values-talos.yaml
ipam:
  mode: "cluster-pool"  # Keep consistent with earlier config
  operator:
    clusterPoolIPv4PodCIDRList: ["10.244.0.0/16"]

cni:
  exclusive: true  # Always use exclusive mode for migrations
  customConf: true

bpf:
  masquerade: true
  hostLegacyRouting: true  # Required for Talos forwardKubeDNSToHost compatibility

kubeProxyReplacement: strict  # Required when kube-proxy is disabled
nodePort:
  enabled: true

# Talos-specific capability restrictions
securityContext:
  capabilities:
    ciliumAgent: "{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}"
    cleanCiliumState: "{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}"

# Use Talos's localhost API endpoint
k8sServiceHost: localhost
k8sServicePort: 7445

# Disable kernel module loading
cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup

Talos Machine Config

# talos-network-patch.yaml
cluster:
  network:
    cni:
      name: none  # Let Cilium manage CNI
    forwardKubeDNSToHost: false  # Workaround for Cilium compatibility
  proxy:
    disabled: true  # Cilium replaces kube-proxy

The Outcome

After six hours of debugging, the cluster was fully operational with Cilium. The benefits were immediate:

  • NetworkPolicies working correctly for CrowdSec
  • Hubble providing network visibility
  • Reduction in CPU usage (no kube-proxy)
  • eBPF-accelerated networking

More importantly, I gained deep insights into Kubernetes networking internals that documentation alone couldn't provide. Sometimes the best learning comes from fixing what you broke.

References