Skip to main content

Building a Hybrid Cloud Mesh: Cilium, WireGuard, and Tailscale for Disaster-Resilient Kubernetes

When the power went out for the third time in two months, I realized I needed a better disaster recovery strategy. Cloud providers don't lose power, but they do charge by the minute. The solution? A hybrid mesh that runs on-premise by default but automatically bursts into the cloud during outages.

Building a Hybrid Cloud Mesh: Cilium, WireGuard, and Tailscale for Disaster-Resilient Kubernetes
Photo by BoliviaInteligente / Unsplash

This is how I built a cost-effective, secure, multi-region Kubernetes cluster using Cilium Cluster Mesh, WireGuard encryption, and Tailscale for management, achieving measurable improvements in latency and significant storage cost savings compared to a full cloud deployment.

The Problem: Homelab Reliability Meets Cloud Costs

Running Kubernetes at home offers unbeatable price-performance for compute and storage. My Talos cluster storage would cost a lot more per month in the cloud. But homelabs have an Achilles heel: residential power and internet reliability.

The traditional solutions all had drawbacks:

  • Full cloud migration: 40x cost increase for equivalent resources
  • Hot standby in cloud: Paying for idle resources 99% of the time
  • Manual failover: Too slow and error-prone for production workloads
  • Basic VPN mesh: No application awareness or automatic failover

I needed something smarter: infrastructure that's primarily on-premise but intelligently expands into the cloud when needed.

Architecture: The Best of Both Worlds

Here's the high-level architecture I implemented:

┌─────────────────────────────────────────────────────────────────┐
│                          Internet                               │
└────────────────┬─────────────────────────┬──────────────────────┘
                 │                         │
        ┌────────▼────────┐       ┌───────▼────────┐
        │   Tailscale     │       │  Tailscale     │
        │  Coordination   │       │ Coordination   │
        └────────┬────────┘       └───────┬────────┘
                 │                         │
    ┌────────────▼────────────┐ ┌─────────▼────────────┐
    │   On-Premise Cluster    │ │   Cloud Burst        │
    │   (Primary)             │ │   (Auto-Scaled)      │
    │                         │ │                      │
    │  🖥️  Control Plane x3   │ │  ☁️  Worker x0-5      │
    │  🖥️  Worker x4          │ │    (Spot Instances)   │
    │  💾 Longhorn Storage    │ │  💾 Cloud Storage     │
    │                         │ │                      │
    │  Cilium + WireGuard     │ │  Cilium + WireGuard  │
    │  192.168.0.0/24         │ │  10.1.0.0/24         │
    └────────────┬────────────┘ └──────────┬───────────┘
                 │                         │
                 └─────────────┬───────────┘
                               │
                    ┌──────────▼──────────┐
                    │   Cilium Cluster    │
                    │       Mesh          │
                    │                     │
                    │  🔐 WireGuard       │
                    │  🌐 Service Mesh    │
                    │  📊 Observability   │
                    └─────────────────────┘

Phase 1: Cilium Cluster Mesh Foundation

Cilium Cluster Mesh is the cornerstone technology that makes this architecture possible. Unlike traditional VPN solutions, it provides Kubernetes-native multi-cluster networking with service discovery and load balancing.

Installing Cilium with Mesh Support

First, I deployed Cilium on both clusters with mesh capabilities:

# On-premise cluster (Using v1.18.0 - latest stable as of July 2025)
helm upgrade --install cilium cilium/cilium \
  --version 1.18.0 \
  --namespace kube-system \
  --set cluster.name=on-prem \
  --set cluster.id=1 \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set encryption.nodeEncryption=false \
  --set clustermesh.apiserver.enabled=true \
  --set bpf.masquerade=true

# Cloud cluster (initially empty)
helm upgrade --install cilium cilium/cilium \
  --version 1.18.0 \
  --namespace kube-system \
  --set cluster.name=cloud-burst \
  --set cluster.id=2 \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set encryption.nodeEncryption=false \
  --set clustermesh.apiserver.enabled=true

Establishing the Mesh Connection

The magic happens when you connect the clusters. Cilium generates a cluster mesh configuration:

# Extract mesh configuration from on-prem (creates LoadBalancer service by default)
cilium clustermesh enable --context on-prem
cilium clustermesh status --context on-prem --wait

# Connect cloud cluster to on-prem
cilium clustermesh connect --context cloud-burst \
  --destination-context on-prem

Within minutes, pods in both clusters could discover and communicate with each other securely.

WireGuard Encryption: Zero-Configuration Security

One of Cilium's killer features is automatic WireGuard encryption. When enabled, every node automatically:

  1. Generates a unique WireGuard key pair
  2. Publishes its public key via the CiliumNode resource
  3. Establishes encrypted tunnels to all other nodes
  4. Performs automatic periodic key rotation (managed internally by Cilium)

Note: WireGuard encryption adds approximately 5% CPU overhead at sustained 10Gbps; less than 2% for typical 1Gbps flows (observed in my lab on AMD EPYC hardware; YMMV). MTU is auto-detected by Cilium; override with --set mtu=<value> only if your network fabric has non-standard MTU requirements.

Important: While Talos Linux supports native node-to-node WireGuard encryption, I deliberately disabled it to avoid double encapsulation. For this setup, I've set encryption.nodeEncryption=false since we only need pod-to-pod encryption. Set it to true if you also need host-to-host traffic encrypted, but be aware this increases CPU overhead.

# Talos machine configuration
cluster:
  network:
    wireguard:
      enabled: false  # Disabled to prevent double encryption with Cilium

Here's what the CiliumNode resource looks like:

apiVersion: cilium.io/v2
kind: CiliumNode
metadata:
  name: cluster-wrk-01
  annotations:
    network.cilium.io/wg-pub-key: "<PUBKEY>"
spec:
  encryption:
    key: 15  # Encryption key index for rotation
  addresses:
  - type: InternalIP
    ip: 192.168.0.11

The beauty is that this encryption is transparent to applications; they don't need to know about it.

Phase 2: Tailscale Management Overlay

While Cilium handles pod-to-pod communication, I needed secure access to manage both clusters. Tailscale provides:

  • Zero-configuration VPN for admin access
  • Automatic NAT traversal
  • Identity-based access control
  • Works behind any firewall

Deploying Tailscale Operator

I deployed the Tailscale Kubernetes operator in both clusters (static manifest shown; Helm install recommended – see Tailscale documentation):

apiVersion: v1
kind: Secret
metadata:
  name: tailscale-auth
  namespace: tailscale
stringData:
  client-id: "${TAILSCALE_CLIENT_ID}"
  client-secret: "${TAILSCALE_CLIENT_SECRET}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tailscale-operator
  namespace: tailscale
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tailscale-operator
  template:
    metadata:
      labels:
        app: tailscale-operator
    spec:
      serviceAccountName: tailscale-operator
      containers:
      - name: operator
        image: tailscale/k8s-operator:v1.84.0
        env:
        - name: TS_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: tailscale-auth
              key: client-id
        - name: TS_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: tailscale-auth
              key: client-secret
        - name: OPERATOR_HOSTNAME
          value: "k8s-operator"
        - name: TS_KUBE_SECRET
          value: "tailscale-auth"
        - name: OPERATOR_LOGGING
          value: "info"

Exposing Services via Tailscale

The operator makes it trivial to expose internal services securely:

apiVersion: v1
kind: Service
metadata:
  name: argocd-server
  annotations:
    tailscale.com/expose: "true"
    tailscale.com/hostname: "argocd-homelab"
spec:
  type: ClusterIP  # LoadBalancer not needed for Tailscale exposure
  selector:
    app.kubernetes.io/name: argocd-server
  ports:
  - port: 443
    targetPort: 8080

Now I can access ArgoCD from anywhere via https://argocd-homelab.tailnet-name.ts.net without exposing it to the public internet.

Phase 3: Intelligent Auto-Scaling

The real magic is making the cloud resources scale based on on-premise health. I built a custom controller that:

  1. Monitors on-premise cluster health
  2. Detects outages or degraded performance
  3. Automatically provisions cloud nodes
  4. Migrates critical workloads
  5. Scales down when on-premise recovers

The Scaling Controller with Pulumi

Here's the core logic using Pulumi to manage Azure infrastructure:

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/pulumi/pulumi-azure-native-sdk/compute/v3"
    "github.com/pulumi/pulumi/sdk/v3/go/auto"
    "github.com/pulumi/pulumi/sdk/v3/go/pulumi"
    v1 "k8s.io/client-go/kubernetes/typed/core/v1"
)

type ClusterHealth struct {
    NodesReady     int
    NodesTotal     int
    APIResponsive  bool
    PowerStatus    bool          // from UPS monitoring
    NetworkLatency time.Duration // optional SLO input
}

type Controller struct {
    k8s                 v1.CoreV1Interface
    azureScaler         *AzureScaler
    currentCloudNodes   int
    checkOnPremHealth   func(ctx context.Context) ClusterHealth
    migrateWorkloads    func(ctx context.Context, tier string) error
}

type AzureScaler struct {
    rg        string
    vmssName  string
    location  string
    stack     auto.Stack // pre-created in main(); reuse between calls
}

// --- public entry ------------------------------------------------------------

func (c *Controller) reconcile(ctx context.Context) error {
    health := c.checkOnPremHealth(ctx)

    required := calculateCloudCapacity(health)
    if required != c.currentCloudNodes {
        if err := c.azureScaler.updateVMSS(ctx, required); err != nil {
            return fmt.Errorf("scale VMSS: %w", err)
        }
        c.currentCloudNodes = required
    }

    // Workload migration (only when majority of on-prem nodes unavailable)
    if health.NodesReady < health.NodesTotal/2 {
        if err := c.migrateWorkloads(ctx, "critical"); err != nil {
            return fmt.Errorf("migrate workloads: %w", err)
        }
    }
    return nil
}

// --- capacity ---------------------------------------------------------------

func calculateCloudCapacity(h ClusterHealth) int {
    if h.PowerStatus && h.NodesReady == h.NodesTotal && h.APIResponsive {
        return 0 // healthy cluster
    }

    missing := h.NodesTotal - h.NodesReady
    if !h.PowerStatus { // full power outage -> extra buffer
        missing += 2
    }
    if missing < 0 {
        missing = 0
    }
    return missing
}

// --- pulumi update -----------------------------------------------------------

func (a *AzureScaler) updateVMSS(ctx context.Context, capacity int) error {
    program := func(pctx *pulumi.Context) error {
        _, err := compute.NewVirtualMachineScaleSet(pctx, a.vmssName, &compute.VirtualMachineScaleSetArgs{
            ResourceGroupName: pulumi.String(a.rg),
            VmScaleSetName:    pulumi.String(a.vmssName),
            Location:          pulumi.String(a.location),
            Sku: &compute.SkuArgs{
                Name:     pulumi.String("Standard_D2s_v3"),
                Capacity: pulumi.Int(capacity),
            },
            VirtualMachineProfile: &compute.VirtualMachineScaleSetVMProfileArgs{
                Priority:       pulumi.String("Spot"),
                EvictionPolicy: pulumi.String("Delete"),
                BillingProfile: &compute.BillingProfileArgs{
                    MaxPrice: pulumi.Float64(0.05),
                },
            },
        })
        return err
    }

    // fast-path: only run Up if program changed desired capacity
    _, err := a.stack.Up(ctx, auto.UpOptions{Program: program, OnEvent: nil})
    return err
}

Pulumi Infrastructure Configuration

The Pulumi program defines the Azure infrastructure declaratively:

// pulumi/azure-burst/index.ts
import * as azure from "@pulumi/azure-native";
import * as pulumi from "@pulumi/pulumi";

const cfg = new pulumi.Config();
const cap = cfg.requireNumber("vmss-capacity");     // desired node count
const pub = cfg.require("ssh-public-key");

const location = "East US 2";

// Resource group
const rg = new azure.resources.ResourceGroup("homelab-burst-rg", { location });

// VNet + subnet
const vnet = new azure.network.VirtualNetwork("burst-vnet", {
    resourceGroupName: rg.name,
    location,
    addressSpace: { addressPrefixes: ["10.1.0.0/16"] },
});

const subnet = new azure.network.Subnet("burst-subnet", {
    resourceGroupName: rg.name,
    virtualNetworkName: vnet.name,
    addressPrefix: "10.1.1.0/24",
});

// VMSS
const vmss = new azure.compute.VirtualMachineScaleSet("k8s-burst-vmss", {
    resourceGroupName: rg.name,
    location,
    sku: { name: "Standard_D2s_v3", capacity: cap },
    virtualMachineProfile: {
        priority: "Spot",
        evictionPolicy: "Delete",
        billingProfile: { maxPrice: 0.05 },
        osProfile: {
            computerNamePrefix: "k8s-burst",
            adminUsername: "azureuser",
            linuxConfiguration: {
                disablePasswordAuthentication: true,
                ssh: { publicKeys: [{ path: "/home/azureuser/.ssh/authorized_keys", keyData: pub }] },
            },
        },
        storageProfile: {
            imageReference: {
                publisher: "Canonical",
                offer: "0001-com-ubuntu-server-jammy",
                sku: "22_04-lts-gen2",
                version: "latest",
            },
        },
        networkProfile: {
            networkInterfaceConfigurations: [{
                name: "burst-nic",
                primary: true,
                ipConfigurations: [{
                    name: "internal",
                    subnet: { id: subnet.id },
                    primary: true,
                }],
            }],
        },
    },
});

export const vmssId = vmss.id;
export const resourceGroupName = rg.name;

Controller Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: azure-scaler-config
data:
  azure-config: |
    resourceGroup: "homelab-burst-rg"
    vmssName: "k8s-burst-vmss"
    location: "eastus2"
    maxPrice: "0.05"  # Spot instance max price
    vmSize: "Standard_D2s_v3"  # 2 vCPU, 8GB RAM
  pulumi:
    stack: "homelab/azure-burst/prod"
  scale-down-delay: "30m"  # Grace period before scaling down

Phase 4: Service Mesh and Load Balancing

Cilium Cluster Mesh provides intelligent load balancing across clusters. Services automatically failover between clusters based on health and latency.

Global Services

By default, Cilium merges services with the same name and namespace across clusters:

apiVersion: v1
kind: Service
metadata:
  name: webapp
  namespace: production
  annotations:
    service.cilium.io/global: "true"
spec:
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080

Cilium automatically:

  • Discovers endpoints in both clusters
  • Load balances based on latency
  • Fails over during outages
  • Preserves session affinity

Topology-Aware Routing

I configured topology-aware routing to prefer local endpoints using Kubernetes native topology features:

apiVersion: v1
kind: Service
metadata:
  name: webapp
  namespace: production
  annotations:
    service.cilium.io/global: "true"
    service.kubernetes.io/topology-mode: "Auto"
spec:
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080

For Cilium v1.18, you can also use annotation-based routing preferences:

annotations:
  service.cilium.io/affinity: "local"

This reduces cross-cluster traffic by 90% during normal operations.

Real-World Performance

After three months of operation, here are the actual metrics observed in SigNoz:

Latency Improvements (observed in my homelab)

  • Intra-cluster: Sub-millisecond (on-prem) vs 1-2ms (cloud)
  • Cross-cluster: 12-18ms (WireGuard encrypted, varies by region)
  • Service discovery: Cilium's eBPF datapath outperforms kube-proxy
  • Failover time: 5-15 seconds for stateless services

Cost Analysis (based on my setup)

  • Compute: Significant savings running on-prem vs equivalent cloud resources
  • Storage: NVMe arrays provide better $/GB than cloud storage
  • Bandwidth: Reduced egress charges by keeping traffic local
  • Burst costs: Spot instances average $10-50 per outage event

Reliability Metrics (from SigNoz monitoring)

  • Availability: Improved from single-cluster deployment
  • RTO: Under 10 minutes for critical services
  • RPO: Near-zero for stateless, minutes for stateful workloads
  • Automatic failovers: Multiple successful tests, zero production failures

Performance Overhead Considerations

WireGuard Encryption Impact (observed in my homelab)

  • CPU overhead: ~5% at sustained 10Gbps; <2% for typical 1Gbps flows
  • Latency addition: <1ms for encryption/decryption
  • MTU considerations: Automatically handled by Cilium v1.18 autoMTU
  • Throughput: Near line-rate on modern hardware

Cilium eBPF Efficiency

  • Datapath: Kernel-space processing avoids context switches
  • Connection tracking: Hardware-offloaded where supported
  • Memory usage: ~100MB per node for typical workloads

Security Architecture

The multi-layer security approach ensures defense in depth:

1. Network Encryption

All pod-to-pod traffic is encrypted with WireGuard:

  • ChaCha20-Poly1305 encryption
  • Perfect forward secrecy
  • Keys are rotated automatically by Cilium (interval isn't user-tunable)
  • Minimal performance penalty on modern CPUs

2. Identity-Based Access

Cilium provides identity-based security policies:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: cloud-burst-restrictions
spec:
  endpointSelector:
    matchLabels:
      cluster: cloud-burst
  ingress:
  - fromEndpoints:
    - matchLabels:
        cluster: on-prem
        security-clearance: high
  egress:
  - toEndpoints:
    - matchLabels:
        cluster: on-prem
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP

3. Management Plane Security

Tailscale provides additional security for management:

  • OAuth/SAML integration
  • Device authorization
  • Access control lists
  • Audit logging

Challenges and Solutions

Challenge 1: Split-Brain Scenarios

Problem: What happens when clusters can't communicate?
Solution: Implemented a witness node in a third location (cheap VPS) for quorum.

Challenge 2: Stateful Workload Migration

Problem: Databases can't just failover instantly.
Solution:

  • Continuous replication for critical databases
  • Read replicas in cloud cluster
  • Automated promotion during failover

Challenge 3: Cost Control

Problem: Runaway cloud costs during extended outages.
Solution:

  • Spending limits in cloud account
  • Alerts at 50%, 80%, 100% of budget
  • Automatic scale-down after power restoration
  • Workload prioritization (only critical services failover)

Operational Runbook

Testing Failover

I test the system monthly with controlled failures:

# Simulate on-prem failure
# Cordon simulates an API-server/control-plane outage, not a total power cut;
# for blackout tests you'd fence the node or power-cycle the PDU
kubectl --context on-prem cordon cluster-cp-01 cluster-cp-02 cluster-cp-03

# Watch cloud scaling
watch kubectl --context cloud-burst get nodes

# Verify service availability
curl -H "Host: webapp.homelab.example" https://cloud-endpoint.example/health

# Restore on-prem
kubectl --context on-prem uncordon cluster-cp-01 cluster-cp-02 cluster-cp-03

Monitoring with SigNoz

Key metrics tracked in our SigNoz deployment:

  • Cross-cluster latency (P50, P95, P99)
  • WireGuard encryption CPU overhead
  • Cloud resource utilization and spend rate
  • Failover success rate and duration
  • Service availability by cluster
  • Cilium datapath performance metrics
  • Pod migration patterns during outages

Future Enhancements

1. Multi-Cloud Support

Extend beyond single cloud provider:

  • AWS for compute
  • GCP for GPU workloads
  • Azure for geographic diversity

2. Advanced Scheduling

Implement cost-aware scheduling:

  • Run batch jobs in cloud during off-peak
  • Prefer on-prem for storage-intensive workloads
  • Time-based scaling policies

3. Enhanced Observability

Deploy distributed tracing across clusters:

  • OpenTelemetry collectors in each cluster
  • Unified traces across cluster boundaries
  • Latency attribution by cluster

Key Takeaways

  1. Hybrid is the future: Pure cloud or pure on-prem both have limitations
  2. Cilium Cluster Mesh is production-ready: Handles complex networking seamlessly
  3. WireGuard "just works": Enterprise-grade encryption with zero configuration
  4. Tailscale simplifies operations: Secure access without VPN complexity
  5. Auto-scaling requires careful design: Too aggressive = high costs, too conservative = outages
  6. Test failover regularly: The worst time to find bugs is during a real outage

Conclusion

Building a hybrid cloud mesh transformed my homelab from a hobby project into a production-grade platform. The combination of Cilium's advanced networking, WireGuard's bulletproof encryption, and Tailscale's operational simplicity created a system that's both powerful and maintainable.

The best part? When the power goes out now, I get a Discord notification that workloads have migrated to the cloud, and my services keep running. By the time power is restored, everything has automatically migrated back.

Total additional cost for this resilience? About $16 per outage on average.

References