After spending months running various Kubernetes distributions in my homelab, I finally found the perfect foundation: Talos Linux. This article explains why Talos stands out from alternatives like k3s, microk8s, or kubeadm, and how to properly plan your cluster before touching any hardware.
Prerequisites You'll Need
Before we begin, ensure you have:
Software Tools:
talosctlv1.11.0+ - Command-line tool for managing Talos nodes (installation guide)kubectlmatching your cluster's minor version (kubectl supports ±1 minor from the API server) (installation guide)- A text editor for YAML files (VS Code, nano, vim, etc.)
wgetorcurlfor downloading files
Version Compatibility:
# Check your current versions before starting
talosctl version --short
kubectl version --short
# Expected output:
# talosctl: v1.11.0
# kubectl: v1.34.0
Hardware Requirements:
- Minimum 3 machines for control plane (physical or virtual)
- Minimum 2 machines for worker nodes (physical or virtual)
- Each machine needs at least 4GB RAM and 2 CPU cores
- Network connectivity between all machines
- DHCP server on your network OR static IP configuration knowledge
The Problem with Traditional Kubernetes Distributions
I started my Kubernetes journey like many others - with k3s on Ubuntu VMs. It worked, but I quickly discovered several pain points:
- Configuration drift: Each node slowly diverged as packages updated at different times
- Security surface: Full Linux distributions meant hundreds of packages to patch
- SSH access: Convenient but a massive security hole in production
- Manual maintenance: OS updates, kernel patches, and configuration management consumed hours weekly
- Reproducibility: Rebuilding a failed node meant hoping my Ansible playbooks still worked
These issues compound in a homelab where you're both the platform team and the application developer. You need infrastructure that just works.
Enter Talos Linux: Kubernetes-Native Operating System
Talos Linux (https://talos.dev) takes a radically different approach. Instead of retrofitting Kubernetes onto a general-purpose Linux distribution like Ubuntu or CentOS, Talos is built specifically to run Kubernetes and nothing else.
What Makes Talos Different
Immutable and Atomic
Let me explain what "immutable" means with a practical example:
# Traditional Linux: You can modify any system file
# This command would add a line allowing root SSH login (security risk!)
$ ssh node1 "echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config"
# Result: File is modified, system is now less secure
# Talos: The filesystem is read-only
$ talosctl -n 192.168.0.11 debug -- sh -c "echo 'PermitRootLogin yes' >> /rootfs/etc/ssh/sshd_config"
# sh: can't create /rootfs/etc/ssh/sshd_config: Read-only file system
# Result: System files cannot be changed, maintaining security
The entire OS is read-only except for designated data directories. Configuration changes require generating new machine configs and applying them atomically - either the whole configuration applies successfully, or nothing changes. This prevents partial updates that could break your system.
Why This Matters: You can't accidentally break the OS by modifying the wrong file. Every change is intentional and tracked.
API-Driven Everything
Talos has no SSH daemon. No shell. No package manager. Every interaction happens through the Talos API:
# View logs from a service (kubelet manages pods on the node)
talosctl -n 192.168.0.11 logs kubelet
# Read files from the node
talosctl -n 192.168.0.11 read /var/log/pods/...
# Execute commands for debugging (very limited, intentionally)
talosctl -n 192.168.0.11 debug ls /var/lib/kubelet
This seems restrictive until you realize it eliminates entire categories of security vulnerabilities:
- No SSH keys to steal
- No shell escapes to exploit
- No privilege escalation through sudo misconfigurations
- No package vulnerabilities beyond the minimal OS
Minimal Attack Surface
Talos is tiny: the read-only SquashFS rootfs is < 80 MB. Fewer moving parts → fewer CVEs and less patching.
Why This Matters: Fewer components = fewer things that can break or be exploited. It's security through simplicity.
Real-World Benefits I've Experienced
- Fast Recovery: Lost a node? Boot from Talos ISO, apply the configuration, rejoin cluster. Total time: ~5 minutes.
- Security by Default:
- Secure boot support (cryptographically verify the OS hasn't been tampered with)
- Rootfs verification (ensure filesystem integrity)
- Automatic certificate rotation (security certificates refresh automatically)
- No unnecessary services (only runs what Kubernetes needs)
Configuration as Code: All node configuration lives in version control:
machine:
type: controlplane # This node will run the control plane
token: ${MACHINE_TOKEN} # Used during secure discovery/TLS bootstrap to mint node certs
network:
hostname: talos-cp-01 # Human-readable name for this node
interfaces:
- interface: eth0 # Network interface name (usually eth0 or ens0)
dhcp: false # We'll use static IPs for predictability
addresses:
- 192.168.0.11/24 # Static IP in CIDR notation
# /24 means subnet mask 255.255.255.0
Predictable Updates: Upgrading Talos is a single command that atomically updates the OS:
# Upgrade a node to a new Talos version
# --nodes = which node to upgrade
# --image = the new version to install
talosctl upgrade --nodes 192.168.0.11 --image ghcr.io/siderolabs/installer:v1.11.1
# What happens:
# 1. New OS downloads to a separate partition
# 2. Node reboots into new version
# 3. If boot fails, automatically rolls back to previous version
See Talos upgrade documentation for detailed upgrade procedures.
Planning Your Cluster Architecture
Before installing anything, invest time in proper planning. Here's the architecture I've refined over multiple iterations:
Node Planning
Control Plane Nodes (3 minimum)
- Why exactly 3? Kubernetes uses etcd (distributed key-value store) to store all cluster data. Etcd requires a majority (quorum) to function. With 3 nodes:
- All 3 running = fully operational
- 2 running = still operational (2 out of 3 is majority)
- 1 running = etcd loses quorum—API writes fail and new scheduling halts (existing pods usually keep running)
- This is why you need odd numbers: 2 nodes is worse than 1 (neither has majority if they disagree)
- Resources: 4 CPU cores, 8GB RAM minimum
- Why: etcd is surprisingly resource-intensive, especially during leader elections
- Control plane also runs API server, scheduler, and controller manager
- Disk: 50GB SSD for OS and etcd
- Why: etcd performance directly impacts cluster responsiveness
- Every Kubernetes operation goes through etcd
- Slow disk = slow cluster
Worker Nodes (2 minimum, 3+ recommended)
- Why separate workers?
- Control plane stability: A runaway pod consuming all CPU won't affect cluster management
- Security: User workloads are isolated from control plane
- Scaling: Can add/remove workers without affecting control plane
- Resources: 8+ CPU cores, 16GB+ RAM
- Why: These run your actual applications
- Need headroom for system pods (CNI, DNS, monitoring)
- Disk: 100GB+ SSD for OS, container images, and Ceph storage
- Container images can consume significant space
- Ceph (storage system) needs dedicated space
An example setup:
Control Plane:
talos-cp-01: 192.168.0.11 (4 CPU, 8GB RAM, 50GB SSD)
talos-cp-02: 192.168.0.12 (4 CPU, 8GB RAM, 50GB SSD)
talos-cp-03: 192.168.0.13 (4 CPU, 8GB RAM, 50GB SSD)
API Endpoint: 192.168.0.200 (HA access - see below for options)
Workers:
talos-wrk-01: 192.168.0.14 (8 CPU, 32GB RAM, 500GB SSD)
talos-wrk-02: 192.168.0.15 (8 CPU, 32GB RAM, 500GB SSD)
talos-wrk-03: 192.168.0.16 (8 CPU, 16GB RAM, 250GB SSD)
talos-wrk-04: 192.168.0.17 (8 CPU, 16GB RAM, 250GB SSD)
High Availability API Access
Understanding Control Plane HA:
The Kubernetes API needs a stable endpoint that remains accessible even when control plane nodes fail. This is NOT about load balancing services - it's about ensuring kubectl and nodes can always reach the API.
Option 1: Talos Built-in VIP (Recommended for Homelab)
Talos includes a VIP (Virtual IP) that provides a single, stable endpoint for the Kubernetes API:
# In your Talos machine config for control plane nodes:
machine:
network:
interfaces:
- interface: eth0
dhcp: false
addresses:
- 192.168.0.11/24 # Node's primary IP
vip:
ip: 192.168.0.200 # Shared VIP for API endpoint
# Talos explicitly recommends not using the VIP in talosconfig to avoid lock-out during loss of quorum.
# Keep at least one direct CP node IP in `talosconfig` for break-glass.
# How it works:
# - One control plane node holds the VIP at a time
# - Nodes elect a leader using etcd consensus
# - If VIP holder fails, another control plane takes over
# - kubectl/talosctl always use the VIP address
Important: This VIP is ONLY for Kubernetes API access, NOT for load balancing your applications. Application load balancing is handled by Cilium (covered in Part 3).
Break-Glass Access Tip: In a complete VIP failure, you can always directly connect to any control plane node's actual IP address using https://<cp-node-ip>:6443 (port 6443 is the Kubernetes API server default). This bypasses the VIP entirely—crucial for disaster recovery when the VIP mechanism itself fails.
Option 2: Talos KubePrism (Excellent Production Option)
KubePrism is Talos' built-in API proxy that provides client-side load balancing. KubePrism listens on localhost (127.0.0.1:7445) on each node and is enabled by default since Talos 1.6:
# In your Talos machine config:
machine:
features:
kubePrism:
enabled: true # Enable KubePrism
port: 7445 # Local proxy port
# How it works:
# - Each node runs a local proxy on port 7445
# - Proxy knows all control plane endpoints
# - Automatically fails over to healthy nodes
# - Workers use localhost:7445 to reach API
Usage:
# Workers connect to API via local proxy
kubectl --server=https://127.0.0.1:7445 get nodes
# External access still needs a load balancer or VIP
Pros: Automatic failover, no single point of failure, load distribution
Cons: KubePrism gives nodes a resilient local API (127.0.0.1:7445); for off-cluster admins you still need VIP or an external LB
Option 3: External Load Balancer (Enterprise Approach)
For production environments, use a dedicated load balancer:
# HAProxy example configuration:
frontend k8s_api
bind *:6443
mode tcp
option tcplog
default_backend k8s_control_plane
backend k8s_control_plane
mode tcp
balance roundrobin
option tcp-check
server cp1 192.168.0.11:6443 check
server cp2 192.168.0.12:6443 check
server cp3 192.168.0.13:6443 check
Pros: Professional-grade HA, advanced health checks, proper load balancing
Cons: Additional infrastructure, more complexity
Option 4: kube-vip (Popular Community Choice)
kube-vip provides advanced VIP with load balancing, running as a static pod:
# Generated with: kube-vip manifest daemonset ...
# Provides both VIP and load balancing
# Supports BGP, ARP, and cloud providers
# More complex but very flexible
For this series, we'll use Option 1 (Talos VIP) for API endpoint stability.
Service Load Balancing (Different from API HA!)
Critical Distinction:
| Type | Purpose | Port | Managed By |
|---|---|---|---|
| Control Plane Endpoint (VIP/LB) | Stable Kubernetes API access | 6443 | Talos VIP or external LB |
| Service Load Balancer | Application traffic ingress | 80, 443, etc. | Cilium L2 announcements |
Service load balancing is handled by Cilium L2 announcements with LB IPAM (Part 3), which provides a pool of IPs (192.168.0.201-210) for your applications. This is completely separate from the control plane VIP and configured through Cilium's IPAM pools and CRDs.
Network Architecture
Proper network planning prevents hours of debugging later. Here's what you need to consider:
High-Level Network Flow
Internet
│
├─ Router/Gateway (192.168.0.1)
│
└─ Home Network Switch
│
├─ DNS Servers (192.168.0.2-3)
├─ Kubernetes API VIP (192.168.0.200) ←── Control plane HA
│ │
│ ├─ talos-cp-01 (192.168.0.11)
│ ├─ talos-cp-02 (192.168.0.12)
│ └─ talos-cp-03 (192.168.0.13)
│
├─ Worker Nodes
│ ├─ talos-wrk-01 (192.168.0.14)
│ ├─ talos-wrk-02 (192.168.0.15)
│ ├─ talos-wrk-03 (192.168.0.16)
│ └─ talos-wrk-04 (192.168.0.17)
│
└─ Service Load Balancer Pool (192.168.0.201-210)
└─ Application traffic (Cilium L2 - Part 3)
IP Address Planning
Management Network: 192.168.0.0/24
# /24 means 254 usable IPs (192.168.0.1 to 192.168.0.254)
# This is your physical network where nodes communicate
Router/Gateway: 192.168.0.1 # Your router's IP
DNS Servers: 192.168.0.2, 192.168.0.3 # Your DNS servers (Pi-hole, etc.)
Kubernetes API VIP: 192.168.0.200 # Control plane endpoint (port 6443)
Service LB Pool: 192.168.0.201-210 # Cilium LB IPAM for apps (ports 80, 443, etc.)
Node IPs: 192.168.0.11-20 # Static IPs for your nodes
Pod Network: 10.20.0.0/16 (example)
# /16 means 65,534 usable IPs for pods
# Choose a pod CIDR that does not overlap any LAN/VPN ranges
# Cilium doesn't require a specific default; set this explicitly in Talos
# Each node gets a subset (like 10.20.1.0/24, 10.20.2.0/24)
Service Network: 10.96.0.0/12 (Kubernetes default)
# /12 means over 1 million IPs for services
# Virtual IPs that load balance to pods
Why These Specific Ranges?
- 192.168.x.x: Private IP range that won't conflict with internet
- 10.x.x.x: Another private range, keeping pod traffic separate from management
- Different ranges prevent routing confusion
VLAN Considerations
If your switch supports VLANs (Virtual LANs - network segmentation), consider:
- Management VLAN: Node-to-node communication
- Storage VLAN: Ceph replication traffic (high bandwidth)
- Service VLAN: Load balancer IPs accessible to your home network
Why VLANs? They isolate traffic types, improving security and performance. Not required for starting out.
DNS Strategy
You'll need both internal and external DNS:
- Internal: CoreDNS (included with Kubernetes) handles
*.cluster.local- Example:
nginx.default.svc.cluster.localresolves to service IP
- Example:
- External: Your domain pointing to load balancer IPs
- Example:
app.homelab.example→ 192.168.0.201
- Example:
- Split-horizon: Pi-hole or similar for internal resolution
- Resolves
*.homelab.exampleinternally without going to internet
- Resolves
Storage Planning
Storage deserves special attention because it's hard to change later:
Boot/OS Storage
- 50GB minimum per node
- SSD strongly recommended
- Why: etcd does many small writes, HDDs are 10x slower
- Can be smaller for worker nodes (30GB) if using separate data disks
Data Storage for Ceph (Distributed Storage System)
- Dedicated disks preferred
- Why: OS and Ceph competing for I/O hurts performance
- Minimum 3 OSDs (Object Storage Daemons) across different nodes
- Why: Data replicated 3 times for redundancy
- 100GB+ per OSD
- Why: Ceph has ~10% overhead for metadata
- Same-size disks simplify management
- Why: Ceph balances better with uniform disk sizes
Performance Considerations
# Test disk performance before committing
# On Talos, run fio from a debug container (no package manager available)
talosctl -n 192.168.0.11 debug --image alpine:3.20 -- sh
# Inside the debug container:
apk add fio
fio --name=randwrite --ioengine=libaio --iodepth=32 \
--rw=randwrite --bs=4k --direct=1 --size=1G \
--numjobs=8 --runtime=60 --group_reporting
# Understanding the parameters:
# --rw=randwrite: Random write test (hardest for disks)
# --bs=4k: 4KB blocks (typical database/etcd size)
# --direct=1: Bypass cache for real performance
# --numjobs=8: Simulate 8 parallel workers
# Use low-latency SSDs; etcd stability is sensitive to disk latency and fsync
# Benchmark with fio for latency, not just IOPS
# Good SSD: Low latency, consistent fsync times
# HDD: High latency, unsuitable for etcd
Security Planning
Security isn't something you add later - build it in from the start:
Certificate Management
- Talos generates its own PKI (Public Key Infrastructure) for node communication
- Why: Nodes need to prove their identity to each other
- You'll need to plan for application certificates (HTTPS for your apps)
- Let's Encrypt via cert-manager (free SSL certificates)
- Consider a wildcard certificate for
*.homelab.example- Why: One certificate covers all subdomains
Secret Management
- Never hardcode secrets in your configurations
- Why: Anyone with repo access sees passwords
- Use Infisical, Vault, or Sealed Secrets from day one
- These encrypt secrets, only decrypt in cluster
- Plan your secret paths and access policies
- Example: Database passwords in
/secrets/databases/
- Example: Database passwords in
Network Security
- Default deny all traffic between namespaces
- Why: Compromised app can't access other apps
- Explicitly allow only required communication
- Example: Allow frontend → backend, deny backend → frontend
- Plan your NetworkPolicy strategy before deploying apps
Access Control
- No SSH means you need to secure
talosctlaccess - Store Talos config file (
talosconfig) securely- This file grants full cluster admin access
- Consider hardware tokens for additional security
- YubiKey or similar for two-factor authentication
Common Pitfalls to Avoid
Through painful experience, here are mistakes to avoid:
1. Undersizing etcd Nodes
I initially gave control plane nodes 2GB RAM. Etcd started having leader elections under load, causing cluster instability.
Why this happened: During high load, etcd couldn't respond fast enough, triggering new leader elections, creating more load - a death spiral.
Solution: Minimum 4GB RAM, 8GB recommended for production workloads.
2. Forgetting Time Synchronization
Kubernetes certificates and etcd consensus require accurate time. Without NTP (Network Time Protocol), nodes drift apart.
Why this matters: Significant clock skew breaks TLS and consensus; configure reliable NTP on all nodes.
Configure NTP:
machine:
time:
servers:
- time.cloudflare.com # Primary time server
- pool.ntp.org # Fallback pool
3. Single DNS Server
When my AdGuard Home went down, nodes couldn't resolve anything, including internal Kubernetes services.
Why this matters: Kubernetes relies heavily on DNS. No DNS = no service discovery = broken applications.
Always configure multiple DNS servers:
machine:
network:
nameservers:
- 192.168.0.2 # Primary DNS (Pi-hole, AdGuard, etc.)
- 192.168.0.3 # Secondary DNS (backup)
- 1.1.1.1 # Cloudflare public DNS (internet fallback)
# Kubernetes tries each in order until one responds
4. No Capacity Planning
That cool new observability stack? It might need 14 CPU cores and 42GB RAM.
Plan for overhead (in my experience):
- OS overhead: ~500MB RAM per node
- Kernel, system services, etc.
- Kubernetes system pods: ~2GB RAM per node
- kubelet, containerd, system controllers
- CNI (Cilium): ~500MB-1GB RAM per node
- Network stack, eBPF programs
- Storage (Ceph): ~4GB RAM, significant CPU
- Distributed storage overhead
- Monitoring: Can easily consume 4+ CPU, 8GB+ RAM
- Metrics, logs, traces add up quickly
Rule of thumb: Plan for 2-3x your application needs for comfortable operation.
5. Ignoring Backup Strategy
Before storing any data, know how you'll back it up:
- Etcd snapshots (schedule
talosctl etcd snapshotyourself or via automation) - Persistent volume backups (Velero backs up to S3/MinIO)
- Application-level backups (database dumps, etc.)
Why multiple layers?: Different restore scenarios need different backups. Corrupted database? Need app backup. Failed cluster? Need etcd backup.
Pre-Flight Checklist
Before proceeding to Part 2, ensure you have:
- [ ] Hardware
- [ ] 3+ machines for control plane (4 CPU, 8GB RAM minimum each)
- [ ] 2+ machines for workers (8 CPU, 16GB RAM recommended)
- [ ] SSDs for all nodes (check with
fiobenchmark) - [ ] Gigabit ethernet minimum (10Gb for storage network if possible)
- [ ] Network
- [ ] IP addresses assigned for all nodes (static or DHCP reservations)
- [ ] Control plane VIP address reserved (not used by any device)
- [ ] Load balancer IP range reserved (10 IPs recommended)
- [ ] DNS servers accessible from all nodes
- [ ] Internet access for pulling container images
- [ ] Planning Documents
- [ ] Network diagram with IPs and VLANs
- [ ] Storage allocation plan (which disks for what)
- [ ] Secret management strategy (what tool, what structure)
- [ ] Backup and recovery plan (what to backup, where to store)
- [ ] Software
- [ ] Talos Linux ISO downloaded
- [ ]
talosctlinstalled on your workstation - [ ]
kubectlinstalled on your workstation - [ ] Text editor ready for YAML editing
What's Next
In Part 2: Bootstrapping Your Talos Cluster, we'll bootstrap your Talos cluster, configure the control plane with high availability, and join worker nodes. You'll learn about critical Talos configurations, including the forwardKubeDNSToHost setting that caused me 6 hours of debugging.
Note: Talos forwards cluster DNS to the host resolver by default (forwardKubeDNSToHost: true), which can surprise you if your host uses Pi-hole/AdGuard—covered in detail in Part 2.
Related Reading:
- Part 3: Cilium CNI Configuration - Why the DNS configuration in Part 2 is critical for Cilium
- Part 4: Ceph Storage Requirements - Storage planning details mentioned in this article
Key Takeaways
- Talos Linux provides unmatched security and operational simplicity for Kubernetes - immutability prevents configuration drift
- Proper planning prevents painful migrations - get your network and storage architecture right initially
- Security must be built-in from day one - retrofitting security is exponentially harder
- Resource requirements add up quickly - plan for 2-3x your application needs for overhead
- High availability requires odd numbers - 3 control plane nodes minimum for etcd quorum
References
- Talos Linux Documentation: https://www.talos.dev/latest/
- Kubernetes The Hard Way: https://github.com/kelseyhightower/kubernetes-the-hard-way
- etcd Operations Guide: https://etcd.io/docs/v3.5/op-guide/
- Ceph Hardware Recommendations: https://docs.ceph.com/en/reef/start/hardware-recommendations/
Continue to Part 2: Bootstrapping Your Talos Cluster →