Part 2: Bootstrapping Your Talos Cluster

With your planning complete, it's time to bring your cluster to life. This article walks through installing Talos Linux, configuring high-availability control planes, and joining worker nodes. More importantly, I'll share the critical configurations and gotchas that took me days to debug, with detailed explanations of every command and configuration choice.

Prerequisites

Before starting, ensure you have:

talosctl CLI installed on your workstation (installation guide)
Physical or virtual machines prepared as discussed in Part 1
Network connectivity to all nodes
USB drives (for bare metal) or hypervisor access (for VMs)

VIP address verified as unused and not in DHCP range:

ping -c1 192.168.0.200  # Should fail (no response)
# Also verify VIP is outside DHCP range in router config

# Optional ARP sanity check (no MAC should be returned):
ip neigh show 192.168.0.200 || echo "No ARP entry for VIP (good)"

Verify your tool versions are compatible:

talosctl version    # Should show v1.11.x
kubectl version --short

# kubectl should be within ±1 minor of the cluster version (Talos bundles K8s)
kubectl version --client --short

Understanding Talos Boot Process

Unlike traditional Linux distributions, Talos has a unique boot process:

Boot from ISO: Initial boot into installer environment
Apply Configuration: Send configuration via API
Install to Disk: Talos installs itself and reboots
Join Cluster: Node bootstraps and joins the cluster

This API-driven approach means you never SSH into nodes - everything is managed through talosctl.

Preparing Installation Media

First, download the correct Talos Linux image for your hardware:

# For most bare metal servers and Proxmox VMs
# wget downloads files from the internet
# The long hash ensures you get the exact build with required extensions
wget https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.11.0/metal-amd64.iso

# For cloud environments (AWS, Azure, GCP, or when using cloud-init)
# Note: nocloud ISO is for cloud-init workflows; some bare-metal BIOSes won't boot it
# Always use metal ISO for USB installs on physical servers
wget https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.11.0/nocloud-amd64.iso

# For VMware virtual machines (include the vmtoolsd guest agent extension)
wget https://factory.talos.dev/image/[your-custom-hash-with-vmware-tools]/v1.11.0/vmware-amd64.iso

Why Use Talos Factory?

The Talos Factory (https://factory.talos.dev) lets you customize your ISO with system extensions. These are kernel modules and tools not included in the base image to keep it minimal. For our cluster, I pre-included:

siderolabs/util-linux-tools: Provides mount utilities that storage CSI drivers need
siderolabs/iscsi-tools: Only needed if using iSCSI-based storage (like Longhorn) - NOT needed for Ceph RBD
siderolabs/qemu-guest-agent: For KVM/Proxmox VMs - enables graceful shutdown
siderolabs/vmtoolsd-guest-agent: For VMware VMs - provides VM integration

For Ceph RBD specifically, you only need util-linux-tools and the rbd kernel module (which is configured later). The iscsi-tools extension is for iSCSI-based storage systems, not Ceph.

Proxmox/VMware note: After install, enable the guest agent in your hypervisor UI (Proxmox: VM → Options → QEMU Guest Agent = Enabled; VMware: Install with vmtoolsd extension and ensure the guest agent is allowed to run).

Creating Bootable Media

For bare metal installations, write the ISO to USB drives:

# List all disks to find your USB drive
# BE CAREFUL: choosing the wrong device will erase that disk!
lsblk

# Example output:
# NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
# sda      8:0    0 931.5G  0 disk
# ├─sda1   8:1    0   512M  0 part /boot/efi
# ├─sda2   8:2    0 930.5G  0 part /
# sdb      8:16   1  14.9G  0 part      ← This is likely your USB drive

# Write ISO to USB (replace /dev/sdb with your actual USB device)
# sudo: run with administrator privileges
# dd: disk duplicator tool
# if: input file (the ISO)
# of: output file (the USB device)
# bs=4M: use 4MB blocks for faster writing
# status=progress: show progress during write
# conv=fsync: ensure all data is written before completing
sudo dd if=metal-amd64.iso of=/dev/sdb bs=4M status=progress conv=fsync

# Expected output:
# 156+1 records in
# 156+1 records out
# 654311424 bytes (654 MB, 624 MiB) copied, 45.2341 s, 14.5 MB/s

# Ensure data is written and safely eject
sync
# Linux:
sudo udisksctl power-off -b /dev/sdb

# macOS: find the right disk first
# diskutil list  # Find your USB drive
# diskutil unmountDisk /dev/disk2  # Replace with your disk number
# Optional: use rdisk for faster dd on macOS
# WARNING: Triple-check disk number! macOS won't stop you from overwriting system disk
# sudo dd if=metal-amd64.iso of=/dev/rdisk2 bs=4m

Common Boot Issues and Solutions:

# If the USB doesn't boot:
# 1. Check UEFI vs BIOS mode in your system settings
# 2. Disable Secure Boot temporarily
# 3. Enable Legacy/CSM mode if needed

# Verify the USB was written correctly:
sudo fdisk -l /dev/sdb
# Should show the Talos partition structure

For virtual machines, upload the ISO to your hypervisor's datastore and attach it to each VM as a boot device. (Proxmox: SCSI disk + virtio-scsi controller is fine if using CSI that benefits from SCSI, common with Ceph RBD nodes).

Generating Cluster Configuration

Talos uses a declarative configuration model where you define your desired state in YAML files. Unlike traditional Linux where you'd run commands to configure the system, Talos applies the entire configuration at once.

Understanding Talos Configuration Components

Before generating configs, let's understand what we're creating:

secrets.yaml: Cryptographic materials (certificates, tokens) for secure cluster communication
controlplane.yaml: Configuration for control plane nodes
worker.yaml: Configuration for worker nodes
talosconfig: Client configuration for talosctl to communicate with the cluster

Generating the Base Configuration

# Generate the secrets bundle - this contains all certificates and tokens
# CRITICAL: Back this up securely! Lost secrets = lost cluster access
# -o specifies output directory (not file) - secrets.yaml will be created inside
talosctl gen secrets -o _out

# Generate the base cluster configuration
# homelab-cluster: your cluster name (can be anything)
# https://192.168.0.200:6443: the API endpoint (VIP:port)
# --with-secrets: use our generated secrets file
# --config-patch: apply our customizations on top
talosctl gen config homelab-cluster https://192.168.0.200:6443 \
  --with-secrets _out/secrets.yaml \
  --config-patch @cluster-patch.yaml

# Copy the generated controlplane.yaml/worker.yaml per node (controlplane-1.yaml, -2.yaml, ...)
# and edit hostname/IPs accordingly

Why These Commands Matter

The gen secrets command creates:

Root CA: Signs all other certificates in the cluster
etcd CA: Secures etcd peer communication
Kubernetes CA: Signs API server and kubelet certificates
Bootstrap token: Allows nodes to join the cluster

The gen config command uses these secrets to create node configurations that will establish secure communication within your cluster.

The Critical Patch File

Here's the cluster-patch.yaml with lessons learned embedded and explanations for every setting:

# cluster-patch.yaml
cluster:
  network:
    cni:
      name: none  # Don't install default CNI - we'll install Cilium separately
                  # Why: Cilium provides advanced eBPF networking features
    podSubnets:
      - 10.244.0.0/16  # IP range for pods (65,536 addresses)
                       # Why: Separate from node network to avoid conflicts
    serviceSubnets:
      - 10.96.0.0/12   # IP range for services (1,048,576 addresses)
                       # Why: Kubernetes internal service discovery
  proxy:
    disabled: true  # Disable kube-proxy - Cilium replaces it with eBPF
                    # Why: Better performance and observability
                    # WARNING: If kube-proxy is disabled, ensure Cilium is installed
                    # with kube-proxy replacement enabled (we'll do this in Part 3).
                    # Without that, Services won't route.
                    # cilium-values.yaml (preview for Part 3):
                    #   kubeProxyReplacement: "strict"  # Use "strict" only because kube-proxy is disabled here;
                    #                                   # if you keep kube-proxy, use "partial" instead
                    #   k8sServiceHost: 192.168.0.200
                    #   k8sServicePort: 6443
  etcd:
    advertisedSubnets:
      - 192.168.0.0/24  # Restrict etcd peer communication to this network
                        # Why: Security - etcd contains all cluster secrets
  apiServer:
    certSANs:
      - 192.168.0.200  # VIP - ensures all API server certs include it
                       # Why: Prevents cert errors when accessing via VIP
      # - k8s.homelab.example  # Add DNS name if you use one for the VIP

machine:
  network:
    nameservers:
      - 192.168.0.2     # Primary local DNS (your router/Pi-hole)
      - 192.168.0.3     # Secondary local DNS
      - 1.1.1.1         # Fallback to Cloudflare if local DNS fails
                        # Why: Redundancy prevents DNS failures
  time:
    servers:
      - time.cloudflare.com  # NTP servers for time synchronization
      - pool.ntp.org         # Why: Certificates fail if time is wrong
  features:
    kubernetesTalosAPIAccess:
      enabled: true          # Allow Kubernetes to manage Talos
      allowedRoles:
        - os:admin           # Which roles can access Talos API
      allowedKubernetesNamespaces:
        - kube-system        # Only kube-system pods can manage OS
                            # Why: Security - limit OS access
    hostDNS:
      enabled: true
      forwardKubeDNSToHost: false  # Must be false for Cilium without kube-proxy
                                  # Default is true on Talos >=1.8
                                  # Why: Prevents DNS loops
  kubelet:
    extraArgs:
      rotate-certificates: true  # Auto-renew kubelet certificates
                                # Why: Prevents manual cert renewal
    clusterDNS:
      - 10.96.0.10  # CoreDNS service IP (must be in serviceSubnet)
                    # Why: Pods use this for DNS resolution
    nodeIP:
      validSubnets:
        - 192.168.0.0/24  # Force kubelet to bind to this network
                          # Why: Ensures kubelet uses correct interface
                          # In multi-NIC/VLAN hosts, this prevents kubelet from
                          # binding to a management or storage interface by mistake—
                          # avoids hairpins and VIP weirdness.
  kernel:
    modules:  # Load these kernel modules at boot
      - name: br_netfilter  # Required for Kubernetes networking
      - name: rbd           # Ceph RBD (RADOS Block Device) support
                           # Why: Required for Ceph block storage
      # Confirm modules are present after a node boots:
      # talosctl --nodes 192.168.0.11 read /proc/modules | grep -E 'br_netfilter|rbd'
                           # Note: 'ceph' module is only for CephFS (not needed here)
  sysctls:  # Kernel tuning parameters
    net.core.somaxconn: "65535"         # Max queued connections
    net.core.netdev_max_backlog: "4096" # Network device backlog
    net.netfilter.nf_conntrack_max: "131072"  # Increase connection tracking
                                               # Why: Prevents "table full" errors
    # CRITICAL for Cilium eBPF - I lost 6 hours debugging this!
    net.ipv4.conf.all.rp_filter: "2"      # Loose mode reverse path filtering
    net.ipv4.conf.default.rp_filter: "2"  # Loose mode for new interfaces
                                          # Why: Cilium eBPF needs relaxed rp_filter
                                          # Start with "2" (loose) - keeps basic anti-spoofing
                                          # Only drop to "0" if you see asymmetric routing issues
  install:
    disk: /dev/sda  # Which disk to install Talos on
                    # IMPORTANT: Verify with 'lsblk' - wrong disk = data loss!
    wipe: true      # Erase the disk before installing
                    # Why: Ensures clean installation
    extraKernelArgs:
      - talos.logging.kernel=udp://192.168.0.2:514/  # Send logs to syslog
                                                      # Why: Central logging

Control Plane Configuration

Each control plane node needs specific configuration. Understanding these settings is crucial for a stable cluster.

Understanding the VIP (Virtual IP)

The Virtual IP (VIP) is a floating IP address that moves between control plane nodes. This ensures:

High Availability: If a control plane fails, the VIP moves to another
Single Endpoint: Clients always connect to the same IP
Automatic Failover: No manual intervention needed

All three control plane nodes compete for the VIP. Talos's VIP controller uses leader election and health checks to attach the VIP to one healthy control-plane node. It often coincides with the etcd leader, but isn't guaranteed to always be the same node.

The VIP requires L2 adjacency (same broadcast domain). It won't float across L3/routed boundaries.

⚠️ Critical VIP Configuration Warning

DO NOT use the VIP as your Talos API endpoint in talosconfig! The VIP depends on etcd and kube-apiserver being healthy. If the cluster is unhealthy, you'll lose Talos API access when you need it most.For Talos API: Use individual node IPs (192.168.0.11, 192.168.0.12, 192.168.0.13)For Kubernetes API: Use the VIP (192.168.0.200) - this is safe for kubectlNote: VIP election relies on etcd; it becomes reliably available once etcd is up (post-bootstrap)

Control Plane 1 Configuration

📌 Common NIC Nameseth0: Traditional naming (older systems, some VMs)ens18: VMware/Proxmox VMsenp3s0: Physical servers with predictable naming

Always verify with talosctl get links before finalizing configs!

# controlplane-1.yaml
machine:
  type: controlplane         # This node runs API server, scheduler, controller
  # Note: token is auto-generated from secrets.yaml - no need to set manually
  network:
    hostname: talos-cp-01    # Node's hostname in the cluster
    interfaces:
      - interface: eth0      # Network interface name (verify with talosctl get links)
        dhcp: false         # Static IP required for control planes
                           # Why: Control planes must have predictable IPs
        addresses:
          - 192.168.0.11/24  # This node's static IP/netmask
                            # /24 means 255.255.255.0 subnet mask
        mtu: 1500           # Match your network MTU (default 1500)
                           # Compute NODE_MTU = UNDERLAY_MTU - 50 (e.g., 1500 → 1450)
                           # and set that same value in your Cilium install in Part 3
                           # to avoid fragmentation.
        routes:
          - network: 0.0.0.0/0     # Default route (all traffic)
            gateway: 192.168.0.1   # Your router's IP
                                  # Why: Nodes need internet access
        vip:
          ip: 192.168.0.200  # The floating Virtual IP
                            # Why: Provides single endpoint for API access
  certSANs:  # Subject Alternative Names for TLS certificates
    - 192.168.0.11                  # This node's IP
    - 192.168.0.200                 # The VIP
    - talos-cp-01                   # Hostname
    - talos-cp-01.homelab.example   # FQDN (Fully Qualified Domain Name)
    # Why: Certificates must be valid for all these names/IPs

Control Plane 2 & 3 Configuration

Create similar configs for the other control planes. Note: If you plan to run workloads on control planes (by removing taints), add the same kubelet.extraMounts section from the worker config to ensure Ceph CSI works properly.

# controlplane-2.yaml - Changes from cp-01:
machine:
  network:
    hostname: talos-cp-02    # Different hostname
    interfaces:
      - interface: eth0
        addresses:
          - 192.168.0.12/24   # Different static IP
        # ... rest same as cp-01 including VIP!
  certSANs:
    - 192.168.0.12           # This node's IP
    - 192.168.0.200          # Same VIP as cp-01
    # ... rest of SANs

# controlplane-3.yaml - Similar pattern for 192.168.0.13

Critical Points About Control Plane Config

All control planes MUST have the VIP configuration - This is how they know to compete for it
The VIP must be in the same subnet - It can't route to a different network
Choose an unused IP for the VIP - Conflicts will break everything
Ensure the VIP is outside your DHCP pool - Or make a DHCP reservation to avoid ARP/IP conflicts
Static IPs are mandatory - DHCP would cause nodes to lose contact

Worker Node Configuration

Worker nodes run your actual workloads (pods). They're simpler to configure but need specific settings for storage systems like Ceph.

Understanding Worker Node Role

Worker nodes:

Run application pods - Your actual workloads
Don't run control plane components - No API server, scheduler, etc.
Connect to control plane via VIP - For receiving instructions
Need storage configuration - For persistent volumes

Worker Configuration Explained

# worker-1.yaml
machine:
  type: worker               # This node only runs pods, not control plane
  # Note: token is auto-generated from secrets.yaml - no need to set manually
  network:
    hostname: talos-wrk-01   # Unique hostname for this worker
    interfaces:
      - interface: eth0
        dhcp: false         # Static IP recommended for consistency
        addresses:
          - 192.168.0.14/24  # This worker's IP address
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.0.1
        # Note: No VIP here - only control planes compete for VIP
  kubelet:
    # CRITICAL for Ceph CSI - Without this, volume mounts fail!
    # Note: If you plan to run workloads on control planes (untainted),
    # add this same extraMounts section to control plane configs too
    extraMounts:
      - destination: /var/lib/kubelet   # Where kubelet stores pod data
        type: bind                       # Bind mount type
        source: /var/lib/kubelet         # Source directory (same as dest)
        options:
          - bind      # Create a bind mount
          - rshared   # Share mount points with child mounts
                     # Why: Ceph CSI needs to propagate mounts to pods
          - rw        # Read-write access

Why the Extra Mount?

This was a major gotcha that took hours to debug. Here's why it's needed:

Talos has a read-only root filesystem - Most directories are immutable
Ceph CSI needs to create mount points - For attaching volumes to pods
Mount propagation must be bidirectional - CSI mounts must be visible to kubelet
Without rshared - Pods can't see their persistent volumes!

The error you'd see without this: MountVolume.MountDevice failed: rpc error: code = Internal desc = mount failed: exit status 32

Creating Worker Configs for All Nodes

Repeat for each worker, changing:

hostname: talos-wrk-02, talos-wrk-03, talos-wrk-04
addresses: 192.168.0.15/24, 192.168.0.16/24, 192.168.0.17/24

Installation Process

Time to bring your cluster to life! This process must be done in order - each step depends on the previous one.

Pre-Installation Checklist

Before starting:

[ ] All nodes booted from Talos ISO
[ ] You can ping each node's IP from your workstation
[ ] VIP address is not in use - verify with: ping -c1 192.168.0.200 && echo "VIP IN USE - PICK ANOTHER!" || echo "VIP available"
[ ] Configuration files ready (controlplane-.yaml, worker-.yaml)
[ ] Coffee ready (this takes about 30 minutes)

Step 1: Apply Configuration to First Control Plane

Start with your first control plane node. This becomes the initial cluster leader:

# Apply the configuration to the node
# --insecure: Node doesn't have certs yet, so skip TLS verification
# --nodes: IP of the node booted from ISO
# --file: The configuration file for this specific node
talosctl apply-config --insecure \
  --nodes 192.168.0.11 \
  --file controlplane-1.yaml

# Expected output:
# Applied configuration to 192.168.0.11

The node will now:

Validate the configuration
Install Talos to disk
Reboot into the installed system
Start Talos services

Wait for it to come back online:

# Wait for node to be healthy
# --wait-timeout: Maximum time to wait
talosctl --nodes 192.168.0.11 health --wait-timeout 10m

# Expected output when ready:
# discovered nodes: ["192.168.0.11"]
# waiting for etcd to be healthy: OK
# waiting for kubelet to be healthy: OK
# waiting for apid to be ready: OK

Step 2: Bootstrap the Cluster

This is the most critical step - it initializes the cluster. Only run this ONCE on the FIRST control plane:

# Bootstrap initializes etcd and starts Kubernetes
# --nodes: Which node to bootstrap on
# --endpoints: Where to connect (same as nodes for first bootstrap)
# --talosconfig: Client config file (generated earlier)
talosctl bootstrap --nodes 192.168.0.11 --endpoints 192.168.0.11 \
  --talosconfig talosconfig

# Expected output:
# bootstrapping cluster
# bootstrap complete

What Bootstrap Actually Does

Understanding bootstrap helps troubleshoot issues:

Initializes etcd - Creates the distributed key-value store
Generates Kubernetes manifests - API server, scheduler, controller configs
Starts control plane pods - As static pods managed by kubelet
Initializes cluster state - Creates default namespaces, service accounts

Monitor the bootstrap progress:

# Watch real-time kernel/system logs
# -f: Follow log output (like tail -f)
talosctl --nodes 192.168.0.11 dmesg -f

# Check if etcd is running
talosctl --nodes 192.168.0.11 etcd status

# Expected output:
# MEMBER          PEER URLS              CLIENT URLS             STATE
# talos-cp-01     https://192.168.0.11   https://192.168.0.11   Leader

Step 3: Add Remaining Control Planes

Now add the other control planes to form a highly available cluster:

# Apply configuration to second control plane
talosctl apply-config --insecure \
  --nodes 192.168.0.12 \
  --file controlplane-2.yaml

# Apply configuration to third control plane
talosctl apply-config --insecure \
  --nodes 192.168.0.13 \
  --file controlplane-3.yaml

# Wait for all control planes to be healthy
# This checks multiple nodes at once
talosctl --nodes 192.168.0.11,192.168.0.12,192.168.0.13 health

# Expected output:
# discovered nodes: ["192.168.0.11", "192.168.0.12", "192.168.0.13"]
# waiting for etcd to be healthy: OK (all nodes)
# waiting for kubelet to be healthy: OK (all nodes)

Understanding etcd Cluster Formation

When the new control planes join:

They contact the existing etcd member (cp-01)
Request to join the etcd cluster
Synchronize the database
Form a 3-member quorum

Check etcd membership:

talosctl --nodes 192.168.0.11 etcd members

# Expected output:
# MEMBER          PEER URLS              STATE
# talos-cp-01     https://192.168.0.11   Leader
# talos-cp-02     https://192.168.0.12   Follower
# talos-cp-03     https://192.168.0.13   Follower

Step 4: Verify Control Plane and Access Kubernetes

Now that your control plane is running, get Kubernetes access:

# Generate kubeconfig file for kubectl
# --nodes: Any control plane node to query
# --endpoints: Talos API endpoints (NOT the VIP - use actual node IPs)
# --force-context-name: Use a named context (helpful with multiple clusters)
talosctl kubeconfig --nodes 192.168.0.11 \
  --endpoints 192.168.0.11,192.168.0.12,192.168.0.13 \
  --force-context-name homelab-cluster

# By default, this merges into ~/.kube/config
# Use --force to overwrite completely

# Set talosctl defaults to avoid repeating flags
talosctl config endpoint 192.168.0.11,192.168.0.12,192.168.0.13
talosctl config node 192.168.0.11

# Switch to the new context explicitly so you don't clobber another cluster
kubectl config use-context homelab-cluster

# Now kubectl commands will work
kubectl get nodes

# Expected output:
NAME          STATUS     ROLES           AGE   VERSION
talos-cp-01   NotReady   control-plane   5m    v1.x.x
talos-cp-02   NotReady   control-plane   3m    v1.x.x
talos-cp-03   NotReady   control-plane   2m    v1.x.x

Why Nodes Show "NotReady"

Don't panic! Nodes show NotReady because:

No CNI (Container Network Interface) installed yet - Pods can't network
CoreDNS is pending - Waiting for network to start
This is completely normal - We'll install Cilium CNI in Part 3

If you see NetworkPluginNotReady in kubelet, that's expected until CNI is installed.

You can verify the issue and see CoreDNS pending:

kubectl describe node talos-cp-01 | grep Ready
# Output shows:
# Ready    False   KubeletNotReady   container runtime network not ready

# See CoreDNS pods stuck pending:
kubectl -n kube-system get pods | grep coredns
# Output shows:
# coredns-xxx   0/1     Pending   0          5m
# coredns-yyy   0/1     Pending   0          5m

Step 5: Add Worker Nodes

With control planes ready, add the workers:

# Apply configuration to all workers
# This loop applies configs to IPs 192.168.0.14-17
for i in 1 2 3 4; do
  echo "Configuring worker-${i} at 192.168.0.$((13+i))"
  talosctl apply-config --insecure \
    --nodes 192.168.0.$((13+i)) \
    --file worker-${i}.yaml
done

# Wait for all nodes to appear
# -o wide: Shows additional info like IPs and container runtime
kubectl get nodes -o wide

# Expected output:
NAME          STATUS     ROLES           AGE   VERSION   INTERNAL-IP
talos-cp-01   NotReady   control-plane   10m   v1.x.x    192.168.0.11
talos-cp-02   NotReady   control-plane   8m    v1.x.x    192.168.0.12
talos-cp-03   NotReady   control-plane   7m    v1.x.x    192.168.0.13
talos-wrk-01  NotReady   <none>          2m    v1.x.x    192.168.0.14
talos-wrk-02  NotReady   <none>          2m    v1.x.x    192.168.0.15
talos-wrk-03  NotReady   <none>          2m    v1.x.x    192.168.0.16
talos-wrk-04  NotReady   <none>          2m    v1.x.x    192.168.0.17

Critical Post-Bootstrap Configurations

These configurations fix issues that aren't obvious until you start installing components.

Quick Apply: All Post-Bootstrap Fixes

For experienced users, here's a paste-once script that applies all critical fixes:

# Define your node sets (edit if your IPs differ)
CP_NODES="192.168.0.11,192.168.0.12,192.168.0.13"
ALL_NODES="192.168.0.11,192.168.0.12,192.168.0.13,192.168.0.14,192.168.0.15,192.168.0.16,192.168.0.17"

# DNS: prevent kube-DNS forwarding loops with Cilium (Talos >=1.8 defaults to true)
cat > dns-patch.yaml <<'EOF'
machine:
  features:
    hostDNS:
      enabled: true
      forwardKubeDNSToHost: false
EOF

# Audit logging: pragmatic defaults
cat > audit-patch.yaml <<'EOF'
cluster:
  apiServer:
    auditPolicy:
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
        - level: Metadata
          omitStages: ["RequestReceived"]
          namespaces: ["kube-system", "default"]
        - level: RequestResponse
          omitStages: ["RequestReceived"]
          users: ["admin"]  # Replace "admin" with your real identity (e.g., "kubernetes-admin"
                            # or your OIDC user/group)
          verbs: ["create", "update", "patch", "delete"]
        - level: Metadata
          omitStages: ["RequestReceived"]
EOF

# Remote logging: ship logs to syslog (edit endpoint for your setup)
cat > logging-patch.yaml <<'EOF'
machine:
  logging:
    destinations:
      - endpoint: udp://192.168.0.2:514/
        format: json_lines
EOF

# Apply all patches
talosctl patch machineconfig --nodes "$ALL_NODES" --patch @dns-patch.yaml
talosctl patch machineconfig --nodes "$CP_NODES" --patch @audit-patch.yaml
talosctl patch machineconfig --nodes "$ALL_NODES" --patch @logging-patch.yaml

# Verify key settings
echo "Verifying DNS setting..."
talosctl --nodes 192.168.0.11 get machineconfig -o yaml | grep -n 'forwardKubeDNSToHost: false' || echo "DNS patch failed"
echo "Verifying audit policy..."
talosctl --nodes 192.168.0.11 get machineconfig -o yaml | grep -n 'auditPolicy:' || echo "Audit patch failed"

Or follow the detailed explanations below to understand each fix:

The DNS Forwarding Gotcha (6 Hours of Debugging!)

This was my biggest gotcha. Here's what happened and why:

The Problem

On Talos 1.8+, forwardKubeDNSToHost defaults to true, which means:

Host queries for Kubernetes services go to CoreDNS
CoreDNS responds with cluster IPs (10.96.x.x)
Cilium's eBPF masquerading doesn't handle this correctly
Result: Pods can't resolve DNS, mysterious connection timeouts

The Solution

# dns-patch.yaml
machine:
  features:
    hostDNS:
      enabled: true              # Keep host DNS resolver active
      forwardKubeDNSToHost: false # Don't forward .cluster.local to host
                                 # Default is true on Talos 1.8+
                                 # Why: Prevents DNS loops with Cilium without kube-proxy

Apply this fix to ALL nodes:

# Create the patch file
# 'cat > file <<EOF' creates a file with content until EOF
cat > dns-patch.yaml <<EOF
machine:
  features:
    hostDNS:
      enabled: true
      forwardKubeDNSToHost: false
EOF

# Apply the patch to all nodes at once
# patch machineconfig: Modifies node configuration without full reapply
# --nodes: Comma-separated list of all node IPs
# --patch @file: Apply patch from file (@ means read from file)
talosctl patch machineconfig \
  --nodes 192.168.0.11,192.168.0.12,192.168.0.13,192.168.0.14,192.168.0.15,192.168.0.16,192.168.0.17 \
  --patch @dns-patch.yaml

# Nodes will restart their DNS configuration
# No reboot required!

How to Verify the Fix

# Check the configuration was applied
talosctl --nodes 192.168.0.11 get machineconfig -o yaml | grep forwardKubeDNSToHost

# Should show:
# forwardKubeDNSToHost: false

Enable Audit Logging

Audit logging tracks who did what in your cluster - essential for security and debugging:

Why Audit Logging Matters

Security: Track unauthorized access attempts
Compliance: Required for many security standards
Debugging: See what changed before things broke
Learning: Understand how Kubernetes operators work

# audit-patch.yaml
cluster:
  apiServer:
    auditPolicy:
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
        # Log basic info for system namespaces
        - level: Metadata          # What: Log request metadata only
          omitStages:
            - RequestReceived      # Skip: Initial request stage (too noisy)
          namespaces: ["kube-system", "default"]

        # Log full details for admin modifications
        - level: RequestResponse   # What: Log request and response bodies
          omitStages:
            - RequestReceived
          users: ["admin"]         # Who: Admin users (adjust for your identity provider)
                                   # Note: May be email/group from OIDC in production
          verbs: ["create", "update", "patch", "delete"]  # What: Modifications

        # Log metadata for everything else
        - level: Metadata
          omitStages:
            - RequestReceived

Apply to control planes only:

# Create the patch file
cat > audit-patch.yaml <<EOF
[paste the yaml above]
EOF

# Apply to control planes (they run the API server)
talosctl patch machineconfig \
  --nodes 192.168.0.11,192.168.0.12,192.168.0.13 \
  --patch @audit-patch.yaml

Configure Remote Logging

Talos has limited disk space. Send logs to a central server to prevent disk exhaustion:

# logging-patch.yaml
machine:
  logging:
    destinations:
      - endpoint: udp://192.168.0.2:514/  # Your syslog server IP:port
        format: json_lines                # Structured logs for parsing
        # Why: JSON format makes logs searchable in tools like Loki

Setting Up a Simple Syslog Server

If you don't have one, here's a quick rsyslog setup on Ubuntu/Debian:

# On your syslog server (192.168.0.2):
# Install rsyslog
sudo apt-get install rsyslog

# Configure to receive UDP logs
sudo cat >> /etc/rsyslog.conf <<EOF
# Receive logs on UDP 514
module(load="imudp")
input(type="imudp" port="514")

# Store Talos logs separately
if $fromhost-ip startswith '192.168.0.' then /var/log/talos.log
& stop
EOF

# Restart rsyslog
sudo systemctl restart rsyslog

# Watch logs arrive
tail -f /var/log/talos.log

Verifying Your Installation

Before moving forward, thoroughly verify your cluster is healthy. These checks will save hours of debugging later.

Check Overall Cluster Health

# Check all nodes at once using comma-separated list or bash expansion
# Option 1: Comma-separated list
talosctl --nodes 192.168.0.11,192.168.0.12,192.168.0.13,192.168.0.14,192.168.0.15,192.168.0.16,192.168.0.17 health

# Option 2: Bash brace expansion (shorter)
talosctl --nodes 192.168.0.{11..17} health

# Expected output for healthy nodes:
# discovered nodes: ["192.168.0.11" ... "192.168.0.17"]
# waiting for etcd to be healthy: OK (control planes only)
# waiting for kubelet to be healthy: OK (all nodes)
# waiting for apid to be ready: OK (all nodes)

Verify etcd Cluster Health

etcd is your cluster's brain - it must be healthy:

# Check etcd member status
talosctl --nodes 192.168.0.11 etcd status

# Expected output:
# MEMBER          ENDPOINT           STATE      ERRORS
# talos-cp-01     192.168.0.11:2379  Leader
# talos-cp-02     192.168.0.12:2379  Follower
# talos-cp-03     192.168.0.13:2379  Follower

# Check etcd alarm status (should be empty)
talosctl --nodes 192.168.0.11 etcd alarm list

# No output = good! Alarms indicate problems

Verify Kubernetes API Access

# Check cluster endpoint
kubectl cluster-info

# Expected output:
# Kubernetes control plane is running at https://192.168.0.200:6443

# List all pods (minimal without CNI)
kubectl get pods -A

# Expected output:
NAMESPACE     NAME                                  READY   STATUS
kube-system   kube-apiserver-talos-cp-01          1/1     Running
kube-system   kube-apiserver-talos-cp-02          1/1     Running
kube-system   kube-apiserver-talos-cp-03          1/1     Running
kube-system   kube-controller-manager-talos-cp-01  1/1     Running
kube-system   kube-controller-manager-talos-cp-02  1/1     Running
kube-system   kube-controller-manager-talos-cp-03  1/1     Running
kube-system   kube-scheduler-talos-cp-01           1/1     Running
kube-system   kube-scheduler-talos-cp-02           1/1     Running
kube-system   kube-scheduler-talos-cp-03           1/1     Running
kube-system   coredns-xxx                           0/1     Pending  # Normal - needs CNI
kube-system   coredns-yyy                           0/1     Pending  # Normal - needs CNI

Test VIP Failover (Critical!)

This test ensures your cluster survives control plane failures:

# Step 1: Find which node currently holds the VIP
talosctl --nodes 192.168.0.11,192.168.0.12,192.168.0.13 get addresses | grep 192.168.0.200
# Or with more detail:
talosctl --nodes 192.168.0.11,192.168.0.12,192.168.0.13 get addresses --output wide | grep 192.168.0.200
# Or using bash range syntax:
talosctl --nodes 192.168.0.{11..13} get addresses -o wide | grep 192.168.0.200

# Example output:
# NODE            ADDRESS
# 192.168.0.11    192.168.0.200/24  # This node has the VIP

# Step 2: Reboot the VIP holder to trigger failover
echo "Rebooting node with VIP..."
talosctl reboot --nodes 192.168.0.11

# Step 3: Immediately check VIP moved (within 5-10 seconds)
talosctl --nodes 192.168.0.12,192.168.0.13 get addresses | grep 192.168.0.200

# Should show VIP on different node:
# NODE            ADDRESS
# 192.168.0.12    192.168.0.200/24  # VIP moved here!

# Step 4: Verify API still works during failover
kubectl get nodes
# Or test API server directly:
kubectl get --raw /livez?verbose

# All nodes should still be visible (one will show NotReady briefly)

💡 VIP Failover Tip: If failover seems slow (>10s), check for "gratuitous ARP rate-limit" or "ARP suppression" features on your switch/router and relax them for the control-plane VLAN.

Check Node Resources

Verify each node has the expected CPU and memory:

# Show resource capacity for all nodes
kubectl describe nodes | grep -A 6 "Capacity:"

# Expected output per node:
# Capacity:
#   cpu:                4         # Number of CPU cores
#   ephemeral-storage:  100Gi     # Local disk space
#   memory:             16Gi      # RAM
#   pods:               110       # Max pods per node

# For detailed view of a specific node:
kubectl describe node talos-cp-01

Verify Network Configuration

Ensure nodes are using the correct network interfaces:

# Check network interfaces on each node
for i in 11 12 13 14 15 16 17; do
  echo "=== Node 192.168.0.$i ==="
  talosctl --nodes 192.168.0.$i get addresses | grep -E "eth0|192.168"
done

Troubleshooting Common Issues

Here are the issues I encountered and how to fix them:

Node Stuck in "Booting" State

If talosctl health shows a node stuck in "Booting":

Diagnosis Steps

# 1. Check real-time kernel logs for errors
talosctl --nodes 192.168.0.11 dmesg -f | grep -i error

# 2. Check if the node can reach the API
talosctl --nodes 192.168.0.11 logs machined | grep "connection refused"

# 3. Verify disk was found and formatted
talosctl --nodes 192.168.0.11 dmesg | grep -E "sda|nvme"

Common Causes and Fixes

Wrong disk in config:

# Check what disks the node sees
talosctl --nodes 192.168.0.11 disks

# Output shows available disks:
# DEV        SIZE    MODEL
# /dev/sda   500GB   Samsung SSD
# /dev/sdb   1TB     WD Blue

# Update your config with correct disk and reapply

Network misconfiguration:

# Verify network interface name
talosctl --nodes 192.168.0.11 get links

# Should match interface in your config (eth0, ens18, etc.)

Time sync issues (causes cert validation failures):

# Quick check of system time
talosctl --nodes 192.168.0.11 time

# Check NTP sync status
talosctl --nodes 192.168.0.11 ntp status

# Check for time sync in logs if above commands show issues
talosctl --nodes 192.168.0.11 logs machined | grep -i ntp

# Verify NTP servers are configured
talosctl --nodes 192.168.0.11 get machineconfig -o yaml | grep -A2 "time:"

etcd Members Not Forming Cluster

When control planes can't form an etcd cluster:

Quick Diagnosis

# Check etcd container logs
talosctl --nodes 192.168.0.11 logs etcd | grep -E "error|fail|reject"

# Common error: "connection refused"
# Means: etcd can't reach other members

# Check network connectivity between nodes
talosctl --nodes 192.168.0.11 get addresses
talosctl --nodes 192.168.0.11 get routes

Resolution Steps

Network connectivity issue:

# Verify etcd service is running
talosctl --nodes 192.168.0.11 services | grep etcd

# Should show etcd service active

Certificate issues:

# Check for pending certificate signing requests
kubectl get csr
kubectl describe csr <name>

# ☠️ Destroys node state; you must reapply config after reboot
# WARNING: This wipes local etcd/kubelet state on that node; reapply config after it reboots
talosctl --nodes 192.168.0.11 reset --graceful --reboot

Force etcd recovery (LAST RESORT - causes downtime):

# WARNING: Removing members without quorum can cause data loss!
# Try to snapshot etcd first if possible:
talosctl --nodes 192.168.0.11 etcd snapshot save backup.tar.gz

# NOTE: Restoring from snapshot changes the cluster ID
# You'll need to re-join members cleanly after restore

# Only if etcd is completely broken
talosctl --nodes 192.168.0.11 etcd forfeit-leadership
talosctl --nodes 192.168.0.11 etcd leave
talosctl --nodes 192.168.0.11 reboot

API Server Inaccessible via VIP

The scariest issue - can't reach your cluster API:

Systematic Debugging

# 1. Check if ANY node has the VIP
for i in 11 12 13; do
  echo "=== Checking 192.168.0.$i for VIP ==="
  talosctl --nodes 192.168.0.$i get addresses | grep 192.168.0.200
done

# If no output: No node has VIP!

# 2. Check etcd leader election (controls VIP)
talosctl --nodes 192.168.0.11 etcd status

# If no leader: etcd election failed

# 3. Manually verify API server is running
talosctl --nodes 192.168.0.11 containers | grep kube-apiserver
# Or check service status directly:
talosctl --nodes 192.168.0.11 service kube-apiserver

# 4. Test API directly (bypass VIP)
kubectl --server=https://192.168.0.11:6443 get nodes

Common Fix

Usually the VIP configuration is missing or wrong:

# Verify VIP config exists on ALL control planes
for i in 11 12 13; do
  echo "=== Node 192.168.0.$i VIP config ==="
  talosctl --nodes 192.168.0.$i get machineconfig -o yaml | grep -A2 "vip:"
done

# All must show:
# vip:
#   ip: 192.168.0.200

DNS Resolution Failures

Pods can't resolve DNS - extremely common before CNI installation:

Understanding the Problem

# Check CoreDNS pod status
kubectl get pods -n kube-system | grep coredns

# Shows: Pending
# Why: No CNI = no pod network = CoreDNS can't start

Temporary Observability Checks (before CNI)

# 1. Check node's DNS resolver
talosctl --nodes 192.168.0.11 read /etc/resolv.conf

# 2. Check DNS resolvers configured on the node
talosctl --nodes 192.168.0.11 get resolvers

# 3. Check network routes
talosctl --nodes 192.168.0.11 get routes

# 4. Check running services
talosctl --nodes 192.168.0.11 services | grep -E 'etcd|kubelet|apid|trustd'

# 5. Check control plane containers
talosctl --nodes 192.168.0.11 containers | grep -E 'kube-apiserver|kube-controller-manager|kube-scheduler|coredns'

The Real Fix

Install CNI (Cilium) in Part 3 - DNS issues will resolve automatically.

Kubelet Certificate Errors

Seeing x509 certificate errors in logs:

# Check certificate expiry
talosctl --nodes 192.168.0.11 get KubernetesDynamicCerts -o yaml

# Rotate certificates if needed (rarely necessary)
# ALWAYS run with --dry-run first to see what will change!
# For Talos CA rotation:
talosctl --nodes 192.168.0.11 rotate-ca --dry-run=true --talos=true --kubernetes=false
# If dry-run looks good, run without --dry-run:
talosctl --nodes 192.168.0.11 rotate-ca --talos=true --kubernetes=false

# For Kubernetes CA rotation:
talosctl --nodes 192.168.0.11 rotate-ca --dry-run=true --talos=false --kubernetes=true
# If dry-run looks good, run without --dry-run:
talosctl --nodes 192.168.0.11 rotate-ca --talos=false --kubernetes=true

Security Considerations

Before proceeding, implement these security basics to protect your cluster:

Secure Your Configuration Files

Your configuration files are the keys to the kingdom:

Protect talosconfig

# Move talosconfig to secure location with restricted permissions
cp talosconfig ~/.talos/config
chmod 600 ~/.talos/config  # Only you can read/write

# Set environment variable for talosctl
export TALOSCONFIG=~/.talos/config

# Add to your shell profile (.bashrc/.zshrc)
echo 'export TALOSCONFIG=~/.talos/config' >> ~/.bashrc

Protect secrets.yaml

# This file can recreate your entire cluster's security!
# Option 1: Encrypt with GPG
gpg -c secrets.yaml  # Creates secrets.yaml.gpg
shred -u secrets.yaml  # Securely delete original

# Option 2: Store in password manager
# Option 3: Store in hardware security key
# Option 4: Use sops + age for GitOps-friendly encryption
#   - sops encrypts only values, not keys (diff-friendly)
#   - Store the age key in your password manager

# To decrypt when needed:
gpg -d secrets.yaml.gpg > secrets.yaml

Network Security

Restrict API Server Access

The API server (port 6443) should only be accessible from trusted networks. Important: Apply these firewall rules on your upstream router/firewall, NOT on Talos nodes (which don't have iptables/ufw):

# Apply these on your upstream router/firewall. Talos itself doesn't run iptables/ufw.
# Using iptables (adjust for your firewall solution)
# Allow from local network only
sudo iptables -A INPUT -p tcp --dport 6443 -s 192.168.0.0/24 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 6443 -j DROP

# Using ufw (Ubuntu firewall on router/firewall box)
sudo ufw allow from 192.168.0.0/24 to any port 6443
sudo ufw deny 6443

# Verify rules
sudo iptables -L -n | grep 6443

Restrict Talos API Access

Talos API (port 50000) is even more sensitive. Again, apply on your firewall/router:

# Apply these on your upstream router/firewall. Talos itself doesn't run iptables/ufw.
# Only allow from your workstation
sudo iptables -A INPUT -p tcp --dport 50000 -s YOUR_WORKSTATION_IP -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 50000 -j DROP

For network policies within Talos itself, use Kubernetes NetworkPolicy resources after installing your CNI.

Enable Pod Security Standards

Kubernetes Pod Security Standards prevent risky pod configurations:

# Label kube-system namespace (system pods need privileged access)
kubectl label namespace kube-system \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted

# Explanation:
# enforce=privileged: Allow system pods to run privileged
# audit=restricted: Log violations of restricted policy
# warn=restricted: Warn about violations but don't block

Create a Secure Default Namespace

# Create namespace for your apps with security enforced
kubectl create namespace production
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted

# This namespace will:
# - Block pods requesting privileged access
# - Block pods running as root
# - Require security contexts

Audit Your Configuration

Run these checks to ensure security basics are in place:

# Check no passwords in configs
grep -r "password\|secret\|key" *.yaml

# Verify file permissions
ls -la *.yaml talosconfig
# Should show -rw------- (600) for sensitive files

# Check critical services are running (on Talos nodes)
talosctl --nodes 192.168.0.11,192.168.0.12,192.168.0.13 services | grep -E "etcd|apid|kubelet"
# Should show services as active

Performance Baseline

Establish performance baselines now, before adding workloads. These metrics help identify problems later:

Measure etcd Performance

etcd performance directly impacts cluster responsiveness:

# Check for etcd alarms (none is good)
talosctl --nodes 192.168.0.11 etcd alarm list

# Empty output = healthy
# Common alarms:
# - NOSPACE: Disk full
# - CORRUPT: Database corruption

# Check etcd metrics
talosctl --nodes 192.168.0.11 etcd status

# Look for:
# - DB SIZE: Should be <100MB for new cluster
# - RAFT TERM: Should be low (<10) indicating stable leadership
# - RAFT INDEX: Increases with each transaction

Measure API Server Responsiveness

# Time a simple API call
time kubectl get nodes

# Expected output:
# real    0m0.052s  # <100ms is excellent
# user    0m0.031s
# sys     0m0.016s

# Test API under load (creates and deletes a pod)
time kubectl run test --image=busybox --rm -it --restart=Never -- echo "test"

# Should complete in <2 seconds

Check Resource Usage

Note baseline resource consumption for capacity planning:

# Install metrics-server first (optional but recommended)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# If metrics fail to scrape, you may need to add flags:
# Edit the deployment and add these args to the container:
# --kubelet-preferred-address-types=InternalIP
# --kubelet-insecure-tls  # Only as last resort for self-signed certs

# Wait 60 seconds for metrics to collect
sleep 60

# Check node resource usage
kubectl top nodes

# Expected baselines for idle cluster:
# NAME          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# talos-cp-01   250m         6%     2Gi             12%
# talos-cp-02   240m         6%     2Gi             12%
# talos-cp-03   235m         5%     2Gi             12%
# talos-wrk-01  150m         3%     1Gi             6%
# talos-wrk-02  145m         3%     1Gi             6%
# talos-wrk-03  148m         3%     1Gi             6%
# talos-wrk-04  152m         3%     1Gi             6%

# Note: On very small labs, reduce overhead with metrics-server
kubectl -n kube-system edit deploy metrics-server   # Add args under spec.template.spec.containers[0].args
# --metric-resolution=30s

Document Your Baselines

Save these for future reference:

# Create a baseline report
# Note: this heredoc is unquoted so $(...) expands now
cat > cluster-baseline.txt <<EOF
Cluster Baseline - $(date)
===========================
Nodes: 7 (3 control plane, 4 workers)
Kubernetes: $(kubectl version --short | grep Server)

etcd Performance:
$(talosctl --nodes 192.168.0.11 etcd status)

API Response Time:
$(time kubectl get nodes 2>&1)

Resource Usage:
$(kubectl top nodes 2>&1)

Network Connectivity:
Control Plane VIP: 192.168.0.200
Pod Network: 10.244.0.0/16
Service Network: 10.96.0.0/12
EOF

echo "Baseline saved to cluster-baseline.txt"

Expected Performance Characteristics

Your cluster should exhibit:

etcd latency: <10ms for local cluster
API response: <100ms for simple queries
Control plane memory: ~2GB per node (increases with workloads)
Control plane CPU: <10% at idle
Worker memory: ~1GB base (before workloads)
Worker CPU: <5% at idle
Network latency: <1ms between nodes on same network

What's Next

Your cluster is bootstrapped but not yet functional - nodes are in NotReady state awaiting a CNI. This is completely normal! In Part 3, we'll:

Install Cilium for advanced eBPF networking
Enable native L2 load balancing without MetalLB
Configure network policies for security
Fix the DNS issues we discussed
Get all nodes to Ready state

The cluster will then be ready to run actual workloads.

Key Takeaways

Through this bootstrapping process, I learned several critical lessons:

VIP configuration on ALL control planes is mandatory - Without it on all three, failover won't work
forwardKubeDNSToHost: false is critical for Cilium - This single setting cost me 6 hours of debugging
Extra kubelet mounts with rshared are required for Ceph - Pods can't access storage without this
Node order matters - Bootstrap the first control plane completely before adding others
"NotReady" is normal without CNI - Don't panic, it's expected behavior
Baseline metrics immediately - You'll need these to identify performance regressions later
Secure configs from day one - Much harder to retrofit security later

Common Gotchas Summary

To save you the pain I experienced:

Wrong network interface name: Check with talosctl get links before configuring
Disk device mismatch: Verify with lsblk - /dev/sda vs /dev/nvme0n1 matters
Missing kernel modules: Pre-build ISO with required extensions at Talos Factory
Time sync issues: Certificates fail if clocks are off by >5 minutes
VIP in wrong subnet: Must be in same network as control plane IPs
DNS loops with Cilium: Always set forwardKubeDNSToHost: false

References

Official Documentation

Talos Configuration Reference v1.11 - Complete config options
Talos Troubleshooting Guide - Common issues and solutions
etcd Operations Guide - Understanding etcd clustering
Kubernetes Components - How control plane works

Specific Issues I Encountered

Cilium DNS forwarding issue - The 6-hour debugging story
Ceph CSI mount propagation - Why rshared is required
Pod Security Standards - Security from the start

Continue to Part 3: Cilium CNI - Advanced Networking and Load Balancing →