This is how I built a cost-effective, secure, multi-region Kubernetes cluster using Cilium Cluster Mesh, WireGuard encryption, and Tailscale for management, achieving measurable improvements in latency and significant storage cost savings compared to a full cloud deployment.
The Problem: Homelab Reliability Meets Cloud Costs
Running Kubernetes at home offers unbeatable price-performance for compute and storage. My Talos cluster storage would cost a lot more per month in the cloud. But homelabs have an Achilles heel: residential power and internet reliability.
The traditional solutions all had drawbacks:
- Full cloud migration: 40x cost increase for equivalent resources
- Hot standby in cloud: Paying for idle resources 99% of the time
- Manual failover: Too slow and error-prone for production workloads
- Basic VPN mesh: No application awareness or automatic failover
I needed something smarter: infrastructure that's primarily on-premise but intelligently expands into the cloud when needed.
Architecture: The Best of Both Worlds
Here's the high-level architecture I implemented:
┌─────────────────────────────────────────────────────────────────┐
│ Internet │
└────────────────┬─────────────────────────┬──────────────────────┘
│ │
┌────────▼────────┐ ┌───────▼────────┐
│ Tailscale │ │ Tailscale │
│ Coordination │ │ Coordination │
└────────┬────────┘ └───────┬────────┘
│ │
┌────────────▼────────────┐ ┌─────────▼────────────┐
│ On-Premise Cluster │ │ Cloud Burst │
│ (Primary) │ │ (Auto-Scaled) │
│ │ │ │
│ 🖥️ Control Plane x3 │ │ ☁️ Worker x0-5 │
│ 🖥️ Worker x4 │ │ (Spot Instances) │
│ 💾 Longhorn Storage │ │ 💾 Cloud Storage │
│ │ │ │
│ Cilium + WireGuard │ │ Cilium + WireGuard │
│ 192.168.0.0/24 │ │ 10.1.0.0/24 │
└────────────┬────────────┘ └──────────┬───────────┘
│ │
└─────────────┬───────────┘
│
┌──────────▼──────────┐
│ Cilium Cluster │
│ Mesh │
│ │
│ 🔐 WireGuard │
│ 🌐 Service Mesh │
│ 📊 Observability │
└─────────────────────┘
Phase 1: Cilium Cluster Mesh Foundation
Cilium Cluster Mesh is the cornerstone technology that makes this architecture possible. Unlike traditional VPN solutions, it provides Kubernetes-native multi-cluster networking with service discovery and load balancing.
Installing Cilium with Mesh Support
First, I deployed Cilium on both clusters with mesh capabilities:
# On-premise cluster (Using v1.18.0 - latest stable as of July 2025)
helm upgrade --install cilium cilium/cilium \
--version 1.18.0 \
--namespace kube-system \
--set cluster.name=on-prem \
--set cluster.id=1 \
--set encryption.enabled=true \
--set encryption.type=wireguard \
--set encryption.nodeEncryption=false \
--set clustermesh.apiserver.enabled=true \
--set bpf.masquerade=true
# Cloud cluster (initially empty)
helm upgrade --install cilium cilium/cilium \
--version 1.18.0 \
--namespace kube-system \
--set cluster.name=cloud-burst \
--set cluster.id=2 \
--set encryption.enabled=true \
--set encryption.type=wireguard \
--set encryption.nodeEncryption=false \
--set clustermesh.apiserver.enabled=true
Establishing the Mesh Connection
The magic happens when you connect the clusters. Cilium generates a cluster mesh configuration:
# Extract mesh configuration from on-prem (creates LoadBalancer service by default)
cilium clustermesh enable --context on-prem
cilium clustermesh status --context on-prem --wait
# Connect cloud cluster to on-prem
cilium clustermesh connect --context cloud-burst \
--destination-context on-prem
Within minutes, pods in both clusters could discover and communicate with each other securely.
WireGuard Encryption: Zero-Configuration Security
One of Cilium's killer features is automatic WireGuard encryption. When enabled, every node automatically:
- Generates a unique WireGuard key pair
- Publishes its public key via the CiliumNode resource
- Establishes encrypted tunnels to all other nodes
- Performs automatic periodic key rotation (managed internally by Cilium)
Note: WireGuard encryption adds approximately 5% CPU overhead at sustained 10Gbps; less than 2% for typical 1Gbps flows (observed in my lab on AMD EPYC hardware; YMMV). MTU is auto-detected by Cilium; override with --set mtu=<value> only if your network fabric has non-standard MTU requirements.
Important: While Talos Linux supports native node-to-node WireGuard encryption, I deliberately disabled it to avoid double encapsulation. For this setup, I've set encryption.nodeEncryption=false since we only need pod-to-pod encryption. Set it to true if you also need host-to-host traffic encrypted, but be aware this increases CPU overhead.
# Talos machine configuration
cluster:
network:
wireguard:
enabled: false # Disabled to prevent double encryption with Cilium
Here's what the CiliumNode resource looks like:
apiVersion: cilium.io/v2
kind: CiliumNode
metadata:
name: cluster-wrk-01
annotations:
network.cilium.io/wg-pub-key: "<PUBKEY>"
spec:
encryption:
key: 15 # Encryption key index for rotation
addresses:
- type: InternalIP
ip: 192.168.0.11
The beauty is that this encryption is transparent to applications; they don't need to know about it.
Phase 2: Tailscale Management Overlay
While Cilium handles pod-to-pod communication, I needed secure access to manage both clusters. Tailscale provides:
- Zero-configuration VPN for admin access
- Automatic NAT traversal
- Identity-based access control
- Works behind any firewall
Deploying Tailscale Operator
I deployed the Tailscale Kubernetes operator in both clusters (static manifest shown; Helm install recommended – see Tailscale documentation):
apiVersion: v1
kind: Secret
metadata:
name: tailscale-auth
namespace: tailscale
stringData:
client-id: "${TAILSCALE_CLIENT_ID}"
client-secret: "${TAILSCALE_CLIENT_SECRET}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tailscale-operator
namespace: tailscale
spec:
replicas: 1
selector:
matchLabels:
app: tailscale-operator
template:
metadata:
labels:
app: tailscale-operator
spec:
serviceAccountName: tailscale-operator
containers:
- name: operator
image: tailscale/k8s-operator:v1.84.0
env:
- name: TS_CLIENT_ID
valueFrom:
secretKeyRef:
name: tailscale-auth
key: client-id
- name: TS_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: tailscale-auth
key: client-secret
- name: OPERATOR_HOSTNAME
value: "k8s-operator"
- name: TS_KUBE_SECRET
value: "tailscale-auth"
- name: OPERATOR_LOGGING
value: "info"
Exposing Services via Tailscale
The operator makes it trivial to expose internal services securely:
apiVersion: v1
kind: Service
metadata:
name: argocd-server
annotations:
tailscale.com/expose: "true"
tailscale.com/hostname: "argocd-homelab"
spec:
type: ClusterIP # LoadBalancer not needed for Tailscale exposure
selector:
app.kubernetes.io/name: argocd-server
ports:
- port: 443
targetPort: 8080
Now I can access ArgoCD from anywhere via https://argocd-homelab.tailnet-name.ts.net without exposing it to the public internet.
Phase 3: Intelligent Auto-Scaling
The real magic is making the cloud resources scale based on on-premise health. I built a custom controller that:
- Monitors on-premise cluster health
- Detects outages or degraded performance
- Automatically provisions cloud nodes
- Migrates critical workloads
- Scales down when on-premise recovers
The Scaling Controller with Pulumi
Here's the core logic using Pulumi to manage Azure infrastructure:
package main
import (
"context"
"fmt"
"time"
"github.com/pulumi/pulumi-azure-native-sdk/compute/v3"
"github.com/pulumi/pulumi/sdk/v3/go/auto"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
v1 "k8s.io/client-go/kubernetes/typed/core/v1"
)
type ClusterHealth struct {
NodesReady int
NodesTotal int
APIResponsive bool
PowerStatus bool // from UPS monitoring
NetworkLatency time.Duration // optional SLO input
}
type Controller struct {
k8s v1.CoreV1Interface
azureScaler *AzureScaler
currentCloudNodes int
checkOnPremHealth func(ctx context.Context) ClusterHealth
migrateWorkloads func(ctx context.Context, tier string) error
}
type AzureScaler struct {
rg string
vmssName string
location string
stack auto.Stack // pre-created in main(); reuse between calls
}
// --- public entry ------------------------------------------------------------
func (c *Controller) reconcile(ctx context.Context) error {
health := c.checkOnPremHealth(ctx)
required := calculateCloudCapacity(health)
if required != c.currentCloudNodes {
if err := c.azureScaler.updateVMSS(ctx, required); err != nil {
return fmt.Errorf("scale VMSS: %w", err)
}
c.currentCloudNodes = required
}
// Workload migration (only when majority of on-prem nodes unavailable)
if health.NodesReady < health.NodesTotal/2 {
if err := c.migrateWorkloads(ctx, "critical"); err != nil {
return fmt.Errorf("migrate workloads: %w", err)
}
}
return nil
}
// --- capacity ---------------------------------------------------------------
func calculateCloudCapacity(h ClusterHealth) int {
if h.PowerStatus && h.NodesReady == h.NodesTotal && h.APIResponsive {
return 0 // healthy cluster
}
missing := h.NodesTotal - h.NodesReady
if !h.PowerStatus { // full power outage -> extra buffer
missing += 2
}
if missing < 0 {
missing = 0
}
return missing
}
// --- pulumi update -----------------------------------------------------------
func (a *AzureScaler) updateVMSS(ctx context.Context, capacity int) error {
program := func(pctx *pulumi.Context) error {
_, err := compute.NewVirtualMachineScaleSet(pctx, a.vmssName, &compute.VirtualMachineScaleSetArgs{
ResourceGroupName: pulumi.String(a.rg),
VmScaleSetName: pulumi.String(a.vmssName),
Location: pulumi.String(a.location),
Sku: &compute.SkuArgs{
Name: pulumi.String("Standard_D2s_v3"),
Capacity: pulumi.Int(capacity),
},
VirtualMachineProfile: &compute.VirtualMachineScaleSetVMProfileArgs{
Priority: pulumi.String("Spot"),
EvictionPolicy: pulumi.String("Delete"),
BillingProfile: &compute.BillingProfileArgs{
MaxPrice: pulumi.Float64(0.05),
},
},
})
return err
}
// fast-path: only run Up if program changed desired capacity
_, err := a.stack.Up(ctx, auto.UpOptions{Program: program, OnEvent: nil})
return err
}
Pulumi Infrastructure Configuration
The Pulumi program defines the Azure infrastructure declaratively:
// pulumi/azure-burst/index.ts
import * as azure from "@pulumi/azure-native";
import * as pulumi from "@pulumi/pulumi";
const cfg = new pulumi.Config();
const cap = cfg.requireNumber("vmss-capacity"); // desired node count
const pub = cfg.require("ssh-public-key");
const location = "East US 2";
// Resource group
const rg = new azure.resources.ResourceGroup("homelab-burst-rg", { location });
// VNet + subnet
const vnet = new azure.network.VirtualNetwork("burst-vnet", {
resourceGroupName: rg.name,
location,
addressSpace: { addressPrefixes: ["10.1.0.0/16"] },
});
const subnet = new azure.network.Subnet("burst-subnet", {
resourceGroupName: rg.name,
virtualNetworkName: vnet.name,
addressPrefix: "10.1.1.0/24",
});
// VMSS
const vmss = new azure.compute.VirtualMachineScaleSet("k8s-burst-vmss", {
resourceGroupName: rg.name,
location,
sku: { name: "Standard_D2s_v3", capacity: cap },
virtualMachineProfile: {
priority: "Spot",
evictionPolicy: "Delete",
billingProfile: { maxPrice: 0.05 },
osProfile: {
computerNamePrefix: "k8s-burst",
adminUsername: "azureuser",
linuxConfiguration: {
disablePasswordAuthentication: true,
ssh: { publicKeys: [{ path: "/home/azureuser/.ssh/authorized_keys", keyData: pub }] },
},
},
storageProfile: {
imageReference: {
publisher: "Canonical",
offer: "0001-com-ubuntu-server-jammy",
sku: "22_04-lts-gen2",
version: "latest",
},
},
networkProfile: {
networkInterfaceConfigurations: [{
name: "burst-nic",
primary: true,
ipConfigurations: [{
name: "internal",
subnet: { id: subnet.id },
primary: true,
}],
}],
},
},
});
export const vmssId = vmss.id;
export const resourceGroupName = rg.name;
Controller Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: azure-scaler-config
data:
azure-config: |
resourceGroup: "homelab-burst-rg"
vmssName: "k8s-burst-vmss"
location: "eastus2"
maxPrice: "0.05" # Spot instance max price
vmSize: "Standard_D2s_v3" # 2 vCPU, 8GB RAM
pulumi:
stack: "homelab/azure-burst/prod"
scale-down-delay: "30m" # Grace period before scaling down
Phase 4: Service Mesh and Load Balancing
Cilium Cluster Mesh provides intelligent load balancing across clusters. Services automatically failover between clusters based on health and latency.
Global Services
By default, Cilium merges services with the same name and namespace across clusters:
apiVersion: v1
kind: Service
metadata:
name: webapp
namespace: production
annotations:
service.cilium.io/global: "true"
spec:
selector:
app: webapp
ports:
- port: 80
targetPort: 8080
Cilium automatically:
- Discovers endpoints in both clusters
- Load balances based on latency
- Fails over during outages
- Preserves session affinity
Topology-Aware Routing
I configured topology-aware routing to prefer local endpoints using Kubernetes native topology features:
apiVersion: v1
kind: Service
metadata:
name: webapp
namespace: production
annotations:
service.cilium.io/global: "true"
service.kubernetes.io/topology-mode: "Auto"
spec:
selector:
app: webapp
ports:
- port: 80
targetPort: 8080
For Cilium v1.18, you can also use annotation-based routing preferences:
annotations:
service.cilium.io/affinity: "local"
This reduces cross-cluster traffic by 90% during normal operations.
Real-World Performance
After three months of operation, here are the actual metrics observed in SigNoz:
Latency Improvements (observed in my homelab)
- Intra-cluster: Sub-millisecond (on-prem) vs 1-2ms (cloud)
- Cross-cluster: 12-18ms (WireGuard encrypted, varies by region)
- Service discovery: Cilium's eBPF datapath outperforms kube-proxy
- Failover time: 5-15 seconds for stateless services
Cost Analysis (based on my setup)
- Compute: Significant savings running on-prem vs equivalent cloud resources
- Storage: NVMe arrays provide better $/GB than cloud storage
- Bandwidth: Reduced egress charges by keeping traffic local
- Burst costs: Spot instances average $10-50 per outage event
Reliability Metrics (from SigNoz monitoring)
- Availability: Improved from single-cluster deployment
- RTO: Under 10 minutes for critical services
- RPO: Near-zero for stateless, minutes for stateful workloads
- Automatic failovers: Multiple successful tests, zero production failures
Performance Overhead Considerations
WireGuard Encryption Impact (observed in my homelab)
- CPU overhead: ~5% at sustained 10Gbps; <2% for typical 1Gbps flows
- Latency addition: <1ms for encryption/decryption
- MTU considerations: Automatically handled by Cilium v1.18 autoMTU
- Throughput: Near line-rate on modern hardware
Cilium eBPF Efficiency
- Datapath: Kernel-space processing avoids context switches
- Connection tracking: Hardware-offloaded where supported
- Memory usage: ~100MB per node for typical workloads
Security Architecture
The multi-layer security approach ensures defense in depth:
1. Network Encryption
All pod-to-pod traffic is encrypted with WireGuard:
- ChaCha20-Poly1305 encryption
- Perfect forward secrecy
- Keys are rotated automatically by Cilium (interval isn't user-tunable)
- Minimal performance penalty on modern CPUs
2. Identity-Based Access
Cilium provides identity-based security policies:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: cloud-burst-restrictions
spec:
endpointSelector:
matchLabels:
cluster: cloud-burst
ingress:
- fromEndpoints:
- matchLabels:
cluster: on-prem
security-clearance: high
egress:
- toEndpoints:
- matchLabels:
cluster: on-prem
toPorts:
- ports:
- port: "443"
protocol: TCP
3. Management Plane Security
Tailscale provides additional security for management:
- OAuth/SAML integration
- Device authorization
- Access control lists
- Audit logging
Challenges and Solutions
Challenge 1: Split-Brain Scenarios
Problem: What happens when clusters can't communicate?
Solution: Implemented a witness node in a third location (cheap VPS) for quorum.
Challenge 2: Stateful Workload Migration
Problem: Databases can't just failover instantly.
Solution:
- Continuous replication for critical databases
- Read replicas in cloud cluster
- Automated promotion during failover
Challenge 3: Cost Control
Problem: Runaway cloud costs during extended outages.
Solution:
- Spending limits in cloud account
- Alerts at 50%, 80%, 100% of budget
- Automatic scale-down after power restoration
- Workload prioritization (only critical services failover)
Operational Runbook
Testing Failover
I test the system monthly with controlled failures:
# Simulate on-prem failure
# Cordon simulates an API-server/control-plane outage, not a total power cut;
# for blackout tests you'd fence the node or power-cycle the PDU
kubectl --context on-prem cordon cluster-cp-01 cluster-cp-02 cluster-cp-03
# Watch cloud scaling
watch kubectl --context cloud-burst get nodes
# Verify service availability
curl -H "Host: webapp.homelab.example" https://cloud-endpoint.example/health
# Restore on-prem
kubectl --context on-prem uncordon cluster-cp-01 cluster-cp-02 cluster-cp-03
Monitoring with SigNoz
Key metrics tracked in our SigNoz deployment:
- Cross-cluster latency (P50, P95, P99)
- WireGuard encryption CPU overhead
- Cloud resource utilization and spend rate
- Failover success rate and duration
- Service availability by cluster
- Cilium datapath performance metrics
- Pod migration patterns during outages
Future Enhancements
1. Multi-Cloud Support
Extend beyond single cloud provider:
- AWS for compute
- GCP for GPU workloads
- Azure for geographic diversity
2. Advanced Scheduling
Implement cost-aware scheduling:
- Run batch jobs in cloud during off-peak
- Prefer on-prem for storage-intensive workloads
- Time-based scaling policies
3. Enhanced Observability
Deploy distributed tracing across clusters:
- OpenTelemetry collectors in each cluster
- Unified traces across cluster boundaries
- Latency attribution by cluster
Key Takeaways
- Hybrid is the future: Pure cloud or pure on-prem both have limitations
- Cilium Cluster Mesh is production-ready: Handles complex networking seamlessly
- WireGuard "just works": Enterprise-grade encryption with zero configuration
- Tailscale simplifies operations: Secure access without VPN complexity
- Auto-scaling requires careful design: Too aggressive = high costs, too conservative = outages
- Test failover regularly: The worst time to find bugs is during a real outage
Conclusion
Building a hybrid cloud mesh transformed my homelab from a hobby project into a production-grade platform. The combination of Cilium's advanced networking, WireGuard's bulletproof encryption, and Tailscale's operational simplicity created a system that's both powerful and maintainable.
The best part? When the power goes out now, I get a Discord notification that workloads have migrated to the cloud, and my services keep running. By the time power is restored, everything has automatically migrated back.
Total additional cost for this resilience? About $16 per outage on average.
References
- Cilium v1.18 Documentation - Cluster Mesh (v1.18.0)
- Cilium v1.18 Release Notes
- WireGuard Protocol Specification
- Cilium WireGuard Transparent Encryption (v1.18.0)
- Tailscale Kubernetes Operator (v1.84.0)
- Azure Spot Virtual Machines (2025)
- Pulumi Azure Native Provider (v3.x)
- Azure Virtual Machine Scale Sets (2025)
- Kubernetes Topology Aware Routing (v1.33)