Skip to main content
Part 1: Foundation - Why Talos Linux and Initial Planning
Photo by Braden Collum / Unsplash

After spending months running various Kubernetes distributions in my homelab, I finally found the perfect foundation: Talos Linux. This article explains why Talos stands out from alternatives like k3s, microk8s, or kubeadm, and how to properly plan your cluster before touching any hardware.

Prerequisites You'll Need

Before we begin, ensure you have:

Software Tools:

  • talosctl v1.11.0+ - Command-line tool for managing Talos nodes (installation guide)
  • kubectl matching your cluster's minor version (kubectl supports ±1 minor from the API server) (installation guide)
  • A text editor for YAML files (VS Code, nano, vim, etc.)
  • wget or curl for downloading files

Version Compatibility:

# Check your current versions before starting
talosctl version --short
kubectl version --short

# Expected output:
# talosctl: v1.11.0
# kubectl: v1.34.0

Hardware Requirements:

  • Minimum 3 machines for control plane (physical or virtual)
  • Minimum 2 machines for worker nodes (physical or virtual)
  • Each machine needs at least 4GB RAM and 2 CPU cores
  • Network connectivity between all machines
  • DHCP server on your network OR static IP configuration knowledge

The Problem with Traditional Kubernetes Distributions

I started my Kubernetes journey like many others - with k3s on Ubuntu VMs. It worked, but I quickly discovered several pain points:

  1. Configuration drift: Each node slowly diverged as packages updated at different times
  2. Security surface: Full Linux distributions meant hundreds of packages to patch
  3. SSH access: Convenient but a massive security hole in production
  4. Manual maintenance: OS updates, kernel patches, and configuration management consumed hours weekly
  5. Reproducibility: Rebuilding a failed node meant hoping my Ansible playbooks still worked

These issues compound in a homelab where you're both the platform team and the application developer. You need infrastructure that just works.

Enter Talos Linux: Kubernetes-Native Operating System

Talos Linux (https://talos.dev) takes a radically different approach. Instead of retrofitting Kubernetes onto a general-purpose Linux distribution like Ubuntu or CentOS, Talos is built specifically to run Kubernetes and nothing else.

What Makes Talos Different

Immutable and Atomic

Let me explain what "immutable" means with a practical example:

# Traditional Linux: You can modify any system file
# This command would add a line allowing root SSH login (security risk!)
$ ssh node1 "echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config"
# Result: File is modified, system is now less secure

# Talos: The filesystem is read-only
$ talosctl -n 192.168.0.11 debug -- sh -c "echo 'PermitRootLogin yes' >> /rootfs/etc/ssh/sshd_config"
# sh: can't create /rootfs/etc/ssh/sshd_config: Read-only file system
# Result: System files cannot be changed, maintaining security

The entire OS is read-only except for designated data directories. Configuration changes require generating new machine configs and applying them atomically - either the whole configuration applies successfully, or nothing changes. This prevents partial updates that could break your system.

Why This Matters: You can't accidentally break the OS by modifying the wrong file. Every change is intentional and tracked.

API-Driven Everything

Talos has no SSH daemon. No shell. No package manager. Every interaction happens through the Talos API:

# View logs from a service (kubelet manages pods on the node)
talosctl -n 192.168.0.11 logs kubelet

# Read files from the node
talosctl -n 192.168.0.11 read /var/log/pods/...

# Execute commands for debugging (very limited, intentionally)
talosctl -n 192.168.0.11 debug ls /var/lib/kubelet

This seems restrictive until you realize it eliminates entire categories of security vulnerabilities:

  • No SSH keys to steal
  • No shell escapes to exploit
  • No privilege escalation through sudo misconfigurations
  • No package vulnerabilities beyond the minimal OS

Minimal Attack Surface

Talos is tiny: the read-only SquashFS rootfs is < 80 MB. Fewer moving parts → fewer CVEs and less patching.

Why This Matters: Fewer components = fewer things that can break or be exploited. It's security through simplicity.

Real-World Benefits I've Experienced

  1. Fast Recovery: Lost a node? Boot from Talos ISO, apply the configuration, rejoin cluster. Total time: ~5 minutes.
  2. Security by Default:
    • Secure boot support (cryptographically verify the OS hasn't been tampered with)
    • Rootfs verification (ensure filesystem integrity)
    • Automatic certificate rotation (security certificates refresh automatically)
    • No unnecessary services (only runs what Kubernetes needs)

Configuration as Code: All node configuration lives in version control:

machine:
  type: controlplane        # This node will run the control plane
  token: ${MACHINE_TOKEN}   # Used during secure discovery/TLS bootstrap to mint node certs
  network:
    hostname: talos-cp-01   # Human-readable name for this node
    interfaces:
      - interface: eth0     # Network interface name (usually eth0 or ens0)
        dhcp: false         # We'll use static IPs for predictability
        addresses:
          - 192.168.0.11/24 # Static IP in CIDR notation
                            # /24 means subnet mask 255.255.255.0

Predictable Updates: Upgrading Talos is a single command that atomically updates the OS:

# Upgrade a node to a new Talos version
# --nodes = which node to upgrade
# --image = the new version to install
talosctl upgrade --nodes 192.168.0.11 --image ghcr.io/siderolabs/installer:v1.11.1

# What happens:
# 1. New OS downloads to a separate partition
# 2. Node reboots into new version
# 3. If boot fails, automatically rolls back to previous version

See Talos upgrade documentation for detailed upgrade procedures.

Planning Your Cluster Architecture

Before installing anything, invest time in proper planning. Here's the architecture I've refined over multiple iterations:

Node Planning

Control Plane Nodes (3 minimum)

  • Why exactly 3? Kubernetes uses etcd (distributed key-value store) to store all cluster data. Etcd requires a majority (quorum) to function. With 3 nodes:
    • All 3 running = fully operational
    • 2 running = still operational (2 out of 3 is majority)
    • 1 running = etcd loses quorum—API writes fail and new scheduling halts (existing pods usually keep running)
    • This is why you need odd numbers: 2 nodes is worse than 1 (neither has majority if they disagree)
  • Resources: 4 CPU cores, 8GB RAM minimum
    • Why: etcd is surprisingly resource-intensive, especially during leader elections
    • Control plane also runs API server, scheduler, and controller manager
  • Disk: 50GB SSD for OS and etcd
    • Why: etcd performance directly impacts cluster responsiveness
    • Every Kubernetes operation goes through etcd
    • Slow disk = slow cluster

Worker Nodes (2 minimum, 3+ recommended)

  • Why separate workers?
    • Control plane stability: A runaway pod consuming all CPU won't affect cluster management
    • Security: User workloads are isolated from control plane
    • Scaling: Can add/remove workers without affecting control plane
  • Resources: 8+ CPU cores, 16GB+ RAM
    • Why: These run your actual applications
    • Need headroom for system pods (CNI, DNS, monitoring)
  • Disk: 100GB+ SSD for OS, container images, and Ceph storage
    • Container images can consume significant space
    • Ceph (storage system) needs dedicated space

An example setup:

Control Plane:
  talos-cp-01: 192.168.0.11 (4 CPU, 8GB RAM, 50GB SSD)
  talos-cp-02: 192.168.0.12 (4 CPU, 8GB RAM, 50GB SSD)
  talos-cp-03: 192.168.0.13 (4 CPU, 8GB RAM, 50GB SSD)
  API Endpoint: 192.168.0.200 (HA access - see below for options)

Workers:
  talos-wrk-01: 192.168.0.14 (8 CPU, 32GB RAM, 500GB SSD)
  talos-wrk-02: 192.168.0.15 (8 CPU, 32GB RAM, 500GB SSD)
  talos-wrk-03: 192.168.0.16 (8 CPU, 16GB RAM, 250GB SSD)
  talos-wrk-04: 192.168.0.17 (8 CPU, 16GB RAM, 250GB SSD)

High Availability API Access

Understanding Control Plane HA:
The Kubernetes API needs a stable endpoint that remains accessible even when control plane nodes fail. This is NOT about load balancing services - it's about ensuring kubectl and nodes can always reach the API.

Talos includes a VIP (Virtual IP) that provides a single, stable endpoint for the Kubernetes API:

# In your Talos machine config for control plane nodes:
machine:
  network:
    interfaces:
      - interface: eth0
        dhcp: false
        addresses:
          - 192.168.0.11/24  # Node's primary IP
        vip:
          ip: 192.168.0.200  # Shared VIP for API endpoint

# Talos explicitly recommends not using the VIP in talosconfig to avoid lock-out during loss of quorum.
# Keep at least one direct CP node IP in `talosconfig` for break-glass.

# How it works:
# - One control plane node holds the VIP at a time
# - Nodes elect a leader using etcd consensus
# - If VIP holder fails, another control plane takes over
# - kubectl/talosctl always use the VIP address

Important: This VIP is ONLY for Kubernetes API access, NOT for load balancing your applications. Application load balancing is handled by Cilium (covered in Part 3).

Break-Glass Access Tip: In a complete VIP failure, you can always directly connect to any control plane node's actual IP address using https://<cp-node-ip>:6443 (port 6443 is the Kubernetes API server default). This bypasses the VIP entirely—crucial for disaster recovery when the VIP mechanism itself fails.

Option 2: Talos KubePrism (Excellent Production Option)

KubePrism is Talos' built-in API proxy that provides client-side load balancing. KubePrism listens on localhost (127.0.0.1:7445) on each node and is enabled by default since Talos 1.6:

# In your Talos machine config:
machine:
  features:
    kubePrism:
      enabled: true  # Enable KubePrism
      port: 7445    # Local proxy port

# How it works:
# - Each node runs a local proxy on port 7445
# - Proxy knows all control plane endpoints
# - Automatically fails over to healthy nodes
# - Workers use localhost:7445 to reach API

Usage:

# Workers connect to API via local proxy
kubectl --server=https://127.0.0.1:7445 get nodes

# External access still needs a load balancer or VIP

Pros: Automatic failover, no single point of failure, load distribution
Cons: KubePrism gives nodes a resilient local API (127.0.0.1:7445); for off-cluster admins you still need VIP or an external LB

Option 3: External Load Balancer (Enterprise Approach)

For production environments, use a dedicated load balancer:

# HAProxy example configuration:
frontend k8s_api
  bind *:6443
  mode tcp
  option tcplog
  default_backend k8s_control_plane

backend k8s_control_plane
  mode tcp
  balance roundrobin
  option tcp-check
  server cp1 192.168.0.11:6443 check
  server cp2 192.168.0.12:6443 check
  server cp3 192.168.0.13:6443 check

Pros: Professional-grade HA, advanced health checks, proper load balancing
Cons: Additional infrastructure, more complexity

kube-vip provides advanced VIP with load balancing, running as a static pod:

# Generated with: kube-vip manifest daemonset ...
# Provides both VIP and load balancing
# Supports BGP, ARP, and cloud providers
# More complex but very flexible

For this series, we'll use Option 1 (Talos VIP) for API endpoint stability.

Service Load Balancing (Different from API HA!)

Critical Distinction:

Type Purpose Port Managed By
Control Plane Endpoint (VIP/LB) Stable Kubernetes API access 6443 Talos VIP or external LB
Service Load Balancer Application traffic ingress 80, 443, etc. Cilium L2 announcements

Service load balancing is handled by Cilium L2 announcements with LB IPAM (Part 3), which provides a pool of IPs (192.168.0.201-210) for your applications. This is completely separate from the control plane VIP and configured through Cilium's IPAM pools and CRDs.

Network Architecture

Proper network planning prevents hours of debugging later. Here's what you need to consider:

High-Level Network Flow

Internet
   │
   ├─ Router/Gateway (192.168.0.1)
   │
   └─ Home Network Switch
       │
       ├─ DNS Servers (192.168.0.2-3)
       ├─ Kubernetes API VIP (192.168.0.200) ←── Control plane HA
       │   │
       │   ├─ talos-cp-01 (192.168.0.11)
       │   ├─ talos-cp-02 (192.168.0.12)
       │   └─ talos-cp-03 (192.168.0.13)
       │
       ├─ Worker Nodes
       │   ├─ talos-wrk-01 (192.168.0.14)
       │   ├─ talos-wrk-02 (192.168.0.15)
       │   ├─ talos-wrk-03 (192.168.0.16)
       │   └─ talos-wrk-04 (192.168.0.17)
       │
       └─ Service Load Balancer Pool (192.168.0.201-210)
           └─ Application traffic (Cilium L2 - Part 3)

IP Address Planning

Management Network: 192.168.0.0/24
  # /24 means 254 usable IPs (192.168.0.1 to 192.168.0.254)
  # This is your physical network where nodes communicate

  Router/Gateway: 192.168.0.1      # Your router's IP
  DNS Servers: 192.168.0.2, 192.168.0.3  # Your DNS servers (Pi-hole, etc.)
  Kubernetes API VIP: 192.168.0.200  # Control plane endpoint (port 6443)
  Service LB Pool: 192.168.0.201-210  # Cilium LB IPAM for apps (ports 80, 443, etc.)
  Node IPs: 192.168.0.11-20       # Static IPs for your nodes

Pod Network: 10.20.0.0/16 (example)
  # /16 means 65,534 usable IPs for pods
  # Choose a pod CIDR that does not overlap any LAN/VPN ranges
  # Cilium doesn't require a specific default; set this explicitly in Talos
  # Each node gets a subset (like 10.20.1.0/24, 10.20.2.0/24)

Service Network: 10.96.0.0/12 (Kubernetes default)
  # /12 means over 1 million IPs for services
  # Virtual IPs that load balance to pods

Why These Specific Ranges?

  • 192.168.x.x: Private IP range that won't conflict with internet
  • 10.x.x.x: Another private range, keeping pod traffic separate from management
  • Different ranges prevent routing confusion

VLAN Considerations

If your switch supports VLANs (Virtual LANs - network segmentation), consider:

  • Management VLAN: Node-to-node communication
  • Storage VLAN: Ceph replication traffic (high bandwidth)
  • Service VLAN: Load balancer IPs accessible to your home network

Why VLANs? They isolate traffic types, improving security and performance. Not required for starting out.

DNS Strategy

You'll need both internal and external DNS:

  • Internal: CoreDNS (included with Kubernetes) handles *.cluster.local
    • Example: nginx.default.svc.cluster.local resolves to service IP
  • External: Your domain pointing to load balancer IPs
    • Example: app.homelab.example → 192.168.0.201
  • Split-horizon: Pi-hole or similar for internal resolution
    • Resolves *.homelab.example internally without going to internet

Storage Planning

Storage deserves special attention because it's hard to change later:

Boot/OS Storage

  • 50GB minimum per node
  • SSD strongly recommended
    • Why: etcd does many small writes, HDDs are 10x slower
  • Can be smaller for worker nodes (30GB) if using separate data disks

Data Storage for Ceph (Distributed Storage System)

  • Dedicated disks preferred
    • Why: OS and Ceph competing for I/O hurts performance
  • Minimum 3 OSDs (Object Storage Daemons) across different nodes
    • Why: Data replicated 3 times for redundancy
  • 100GB+ per OSD
    • Why: Ceph has ~10% overhead for metadata
  • Same-size disks simplify management
    • Why: Ceph balances better with uniform disk sizes

Performance Considerations

# Test disk performance before committing
# On Talos, run fio from a debug container (no package manager available)
talosctl -n 192.168.0.11 debug --image alpine:3.20 -- sh

# Inside the debug container:
apk add fio
fio --name=randwrite --ioengine=libaio --iodepth=32 \
    --rw=randwrite --bs=4k --direct=1 --size=1G \
    --numjobs=8 --runtime=60 --group_reporting

# Understanding the parameters:
# --rw=randwrite: Random write test (hardest for disks)
# --bs=4k: 4KB blocks (typical database/etcd size)
# --direct=1: Bypass cache for real performance
# --numjobs=8: Simulate 8 parallel workers

# Use low-latency SSDs; etcd stability is sensitive to disk latency and fsync
# Benchmark with fio for latency, not just IOPS
# Good SSD: Low latency, consistent fsync times
# HDD: High latency, unsuitable for etcd

Security Planning

Security isn't something you add later - build it in from the start:

Certificate Management

  • Talos generates its own PKI (Public Key Infrastructure) for node communication
    • Why: Nodes need to prove their identity to each other
  • You'll need to plan for application certificates (HTTPS for your apps)
    • Let's Encrypt via cert-manager (free SSL certificates)
  • Consider a wildcard certificate for *.homelab.example
    • Why: One certificate covers all subdomains

Secret Management

  • Never hardcode secrets in your configurations
    • Why: Anyone with repo access sees passwords
  • Use Infisical, Vault, or Sealed Secrets from day one
    • These encrypt secrets, only decrypt in cluster
  • Plan your secret paths and access policies
    • Example: Database passwords in /secrets/databases/

Network Security

  • Default deny all traffic between namespaces
    • Why: Compromised app can't access other apps
  • Explicitly allow only required communication
    • Example: Allow frontend → backend, deny backend → frontend
  • Plan your NetworkPolicy strategy before deploying apps

Access Control

  • No SSH means you need to secure talosctl access
  • Store Talos config file (talosconfig) securely
    • This file grants full cluster admin access
  • Consider hardware tokens for additional security
    • YubiKey or similar for two-factor authentication

Common Pitfalls to Avoid

Through painful experience, here are mistakes to avoid:

1. Undersizing etcd Nodes

I initially gave control plane nodes 2GB RAM. Etcd started having leader elections under load, causing cluster instability.

Why this happened: During high load, etcd couldn't respond fast enough, triggering new leader elections, creating more load - a death spiral.

Solution: Minimum 4GB RAM, 8GB recommended for production workloads.

2. Forgetting Time Synchronization

Kubernetes certificates and etcd consensus require accurate time. Without NTP (Network Time Protocol), nodes drift apart.

Why this matters: Significant clock skew breaks TLS and consensus; configure reliable NTP on all nodes.

Configure NTP:

machine:
  time:
    servers:
      - time.cloudflare.com  # Primary time server
      - pool.ntp.org        # Fallback pool

3. Single DNS Server

When my AdGuard Home went down, nodes couldn't resolve anything, including internal Kubernetes services.

Why this matters: Kubernetes relies heavily on DNS. No DNS = no service discovery = broken applications.

Always configure multiple DNS servers:

machine:
  network:
    nameservers:
      - 192.168.0.2   # Primary DNS (Pi-hole, AdGuard, etc.)
      - 192.168.0.3   # Secondary DNS (backup)
      - 1.1.1.1       # Cloudflare public DNS (internet fallback)
      # Kubernetes tries each in order until one responds

4. No Capacity Planning

That cool new observability stack? It might need 14 CPU cores and 42GB RAM.

Plan for overhead (in my experience):

  • OS overhead: ~500MB RAM per node
    • Kernel, system services, etc.
  • Kubernetes system pods: ~2GB RAM per node
    • kubelet, containerd, system controllers
  • CNI (Cilium): ~500MB-1GB RAM per node
    • Network stack, eBPF programs
  • Storage (Ceph): ~4GB RAM, significant CPU
    • Distributed storage overhead
  • Monitoring: Can easily consume 4+ CPU, 8GB+ RAM
    • Metrics, logs, traces add up quickly

Rule of thumb: Plan for 2-3x your application needs for comfortable operation.

5. Ignoring Backup Strategy

Before storing any data, know how you'll back it up:

  • Etcd snapshots (schedule talosctl etcd snapshot yourself or via automation)
  • Persistent volume backups (Velero backs up to S3/MinIO)
  • Application-level backups (database dumps, etc.)

Why multiple layers?: Different restore scenarios need different backups. Corrupted database? Need app backup. Failed cluster? Need etcd backup.

Pre-Flight Checklist

Before proceeding to Part 2, ensure you have:

  • [ ] Hardware
    • [ ] 3+ machines for control plane (4 CPU, 8GB RAM minimum each)
    • [ ] 2+ machines for workers (8 CPU, 16GB RAM recommended)
    • [ ] SSDs for all nodes (check with fio benchmark)
    • [ ] Gigabit ethernet minimum (10Gb for storage network if possible)
  • [ ] Network
    • [ ] IP addresses assigned for all nodes (static or DHCP reservations)
    • [ ] Control plane VIP address reserved (not used by any device)
    • [ ] Load balancer IP range reserved (10 IPs recommended)
    • [ ] DNS servers accessible from all nodes
    • [ ] Internet access for pulling container images
  • [ ] Planning Documents
    • [ ] Network diagram with IPs and VLANs
    • [ ] Storage allocation plan (which disks for what)
    • [ ] Secret management strategy (what tool, what structure)
    • [ ] Backup and recovery plan (what to backup, where to store)
  • [ ] Software
    • [ ] Talos Linux ISO downloaded
    • [ ] talosctl installed on your workstation
    • [ ] kubectl installed on your workstation
    • [ ] Text editor ready for YAML editing

What's Next

In Part 2: Bootstrapping Your Talos Cluster, we'll bootstrap your Talos cluster, configure the control plane with high availability, and join worker nodes. You'll learn about critical Talos configurations, including the forwardKubeDNSToHost setting that caused me 6 hours of debugging.

Note: Talos forwards cluster DNS to the host resolver by default (forwardKubeDNSToHost: true), which can surprise you if your host uses Pi-hole/AdGuard—covered in detail in Part 2.

Related Reading:

Key Takeaways

  1. Talos Linux provides unmatched security and operational simplicity for Kubernetes - immutability prevents configuration drift
  2. Proper planning prevents painful migrations - get your network and storage architecture right initially
  3. Security must be built-in from day one - retrofitting security is exponentially harder
  4. Resource requirements add up quickly - plan for 2-3x your application needs for overhead
  5. High availability requires odd numbers - 3 control plane nodes minimum for etcd quorum

References


Continue to Part 2: Bootstrapping Your Talos Cluster →