Skip to main content
Part 8: Security Hardening and Production Readiness
Photo by Alberto Restifo / Unsplash

Kubernetes is not secure by default. This final article covers implementing defense in depth with detailed explanations of security concepts, Kyverno policies, Tetragon runtime security, CrowdSec threat detection, and Velero disaster recovery. These layers saved my cluster from real attacks and will protect yours too.

Understanding Kubernetes Security

What is Defense in Depth?

Defense in depth is like having multiple locks on your door, a security system, and a guard dog. If one security layer fails, others are there to protect you. In Kubernetes, no single security tool can protect against all threats.

Why Default Kubernetes Security Isn't Enough:

  • Pods run as root by default
  • No network traffic restrictions
  • No runtime threat detection
  • Minimal admission controls
  • No backup/recovery built-in

Defense in Depth Strategy

Security isn't a single tool - it's layers of protection:

  1. Admission Control (Kyverno) - Prevent bad configurations before pods start
  2. Runtime Security (Tetragon) - Detect and block threats while pods are running
  3. Threat Intelligence (CrowdSec) - Community-driven protection against known attacks
  4. Network Policies - Zero-trust networking (deny all, allow only what's needed)
  5. Disaster Recovery (Velero) - When all else fails, restore from backups

Kyverno Policy Engine

What is Kyverno?
Kyverno is a policy engine that validates, mutates, and generates Kubernetes resources. It acts as a gatekeeper - every time someone tries to create or modify a resource, Kyverno checks if it's allowed by your policies.

Key Kyverno Concepts:

  • Validate: Check if resources meet security requirements (reject if they don't)
  • Mutate: Automatically fix resources (add missing security settings)
  • Generate: Create additional resources (like network policies)
  • Background: Check existing resources, not just new ones

Installing Kyverno

# kyverno-values.yaml
# Webhook configuration - Kyverno defaults to Fail for strict security

# Admission controller (validates and mutates resources)
admissionController:
  replicas: 3  # Run 3 copies for high availability (admission control is critical)
  metricsService:
    create: true
    type: ClusterIP
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 1000m
      memory: 1Gi
  container:
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      readOnlyRootFilesystem: true

# Background controller (scans existing resources for policy violations)
backgroundController:
  enabled: true  # Check existing resources, not just new ones
  replicas: 2
  metricsService:
    create: true
    type: ClusterIP
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

# Cleanup controller (removes old resources)
cleanupController:
  replicas: 2
  metricsService:
    create: true
    type: ClusterIP

# Reports controller (generates policy reports)
reportsController:
  replicas: 2
  metricsService:
    create: true
    type: ClusterIP

# Policy reports cleanup (chart-specific feature)
# Note: policyReportsCleanup exists in some downstream charts but not upstream
# For portable cleanup, use Kyverno CleanupPolicy CRDs instead

Install Kyverno:

# Create dedicated namespace for Kyverno
kubectl create namespace kyverno

# Add Kyverno Helm repository
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update

# Install Kyverno with our configuration
KYVERNO_CHART_VERSION=$(helm search repo kyverno/kyverno --versions | awk 'NR==2 {print $2}')
helm show chart kyverno/kyverno --version "$KYVERNO_CHART_VERSION" | grep appVersion
# appVersion: 1.15.1  ← confirm you're installing the latest Kyverno release
helm install kyverno kyverno/kyverno \
  --namespace kyverno \
  --version "$KYVERNO_CHART_VERSION" \
  --values kyverno-values.yaml \
  --wait                         # Wait for deployment to complete

# Expected output:
# NAME: kyverno
# NAMESPACE: kyverno
# STATUS: deployed
# REVISION: 1

# Verify admission webhook is registered
kubectl get validatingwebhookconfigurations | grep kyverno

# Expected output:
# kyverno-policy-validating-webhook-cfg
# kyverno-resource-validating-webhook-cfg

# Check Kyverno pods are running
kubectl -n kyverno get pods

# Expected output:
# NAME                       READY   STATUS    RESTARTS   AGE
# kyverno-686b7d7f99-2x4k8   1/1     Running   0          2m
# kyverno-686b7d7f99-7h9m2   1/1     Running   0          2m
# kyverno-686b7d7f99-qr8xz   1/1     Running   0          2m

Day-2 Operations: Managing Kyverno

# Check policy violation reports
kubectl get polr -A  # Policy Reports across all namespaces
# or use the full resource name for portability:
kubectl get policyreports.wgpolicyk8s.io -A

# If OpenReports is enabled (Kyverno 1.15+):
kubectl get policyreports.openreports.io -A

# View specific policy violations
kubectl get polr -n default -o yaml

# Update Kyverno safely
KYVERNO_CHART_VERSION=$(helm search repo kyverno/kyverno --versions | awk 'NR==2 {print $2}')
helm upgrade kyverno kyverno/kyverno \
  --namespace kyverno \
  --values kyverno-values.yaml \
  --version "$KYVERNO_CHART_VERSION"

# Monitor Kyverno performance
kubectl top pods -n kyverno

# Troubleshoot webhook issues
kubectl describe validatingwebhookconfigurations kyverno-policy-validating-webhook-cfg

# Emergency recovery: if fail-closed webhooks brick the API, delete webhook configs
# kubectl delete validatingwebhookconfigurations kyverno-policy-validating-webhook-cfg
# kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg
# kubectl delete mutatingwebhookconfiguration kyverno-resource-mutating-webhook-cfg
# Then fix/exclude problematic namespaces and redeploy Kyverno

Essential Security Policies

Why These Policies Matter:
By default, Kubernetes allows dangerous configurations. These policies enforce security best practices automatically.

# require-non-root.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy  # Applies cluster-wide
metadata:
  name: require-non-root
spec:
  validationFailureAction: Enforce  # Block pods that violate this policy
  background: true                   # Also check existing pods
  rules:
    - name: check-non-root
      match:  # Apply to these resources
        any:
        - resources:
            kinds:
            - Pod  # Check all pods
      exclude:  # Skip these namespaces (system components need root)
        any:
        - resources:
            namespaces:
            - kube-system  # Kubernetes system components
            - kyverno      # Kyverno itself
            - rook-ceph    # Ceph storage system
      validate:
        message: "Containers must run as non-root user"
        pattern:  # Required configuration
          spec:
            =(securityContext):      # Pod-level security context
              runAsNonRoot: true   # Don't run as root (user ID 0)
            =(initContainers):
            - =(securityContext):    # Init container security context
                runAsNonRoot: true
            =(ephemeralContainers):
            - =(securityContext):    # Ephemeral container security context
                runAsNonRoot: true
            containers:
            - =(securityContext):    # Container-level security context
                runAsNonRoot: true
---
# disallow-privileged.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: check-privileged
      match:
        any:
        - resources:
            kinds:
            - Pod
      validate:
        message: "Privileged containers are not allowed"
        pattern:
          spec:
            =(securityContext):
              =(privileged): "!true"  # "!true" means "must not be true"
            =(initContainers):
            - =(securityContext):
                =(privileged): "!true"  # Block privileged init containers
            =(ephemeralContainers):
            - =(securityContext):
                =(privileged): "!true"  # Block privileged ephemeral containers
            containers:
            - =(securityContext):
                =(privileged): "!true"  # Block privileged containers

# Why this matters: Privileged containers can access the host kernel
# and potentially break out of the container to attack the host
---
# require-resource-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-pod-resources
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-resources
      match:
        any:
        - resources:
            kinds:
            - Pod
      exclude:
        any:
        - resources:
            namespaces:
            - kube-system  # System pods have different requirements
            - kyverno
      validate:
        message: "Resource requests and limits are required"
        pattern:
          spec:
            =(initContainers):
            - name: "*"
              resources:
                requests: { memory: "?*", cpu: "?*" }
                limits:   { memory: "?*", cpu: "?*" }
            =(ephemeralContainers):
            - name: "*"
              resources:
                requests: { memory: "?*", cpu: "?*" }
                limits:   { memory: "?*", cpu: "?*" }
            containers:
            - name: "*"
              resources:
                requests:    # Minimum guaranteed resources
                  memory: "?*"  # Must specify memory request
                  cpu: "?*"     # Must specify CPU request
                limits:      # Maximum allowed resources
                  memory: "?*"  # Must specify memory limit
                  cpu: "?*"     # Must specify CPU limit

# Why this matters: Without limits, one pod can consume all resources
# and crash other pods ("noisy neighbor" problem)
---
# restrict-image-registries.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  background: false  # Only check new pods (existing ones might be from old registries)
  rules:
    - name: validate-registries
      match:
        any:
        - resources:
            kinds:
            - Pod
      validate:
        message: "Images must come from approved registries"
        pattern:
          spec:
            containers:  # Regular containers
            - image: "docker.io/* | ghcr.io/* | quay.io/* | registry.gitlab.com/your-org/*"
            =(initContainers):  # Init containers (run before main containers)
            - image: "docker.io/* | ghcr.io/* | quay.io/* | registry.gitlab.com/your-org/*"
            =(ephemeralContainers):  # Ephemeral containers (debugging)
            - image: "docker.io/* | ghcr.io/* | quay.io/* | registry.gitlab.com/your-org/*"

# Why this matters: Restricts images to trusted registries only
# Prevents using images from unknown sources that might contain malware

Mutation Policies

What are Mutation Policies?
Mutation policies automatically add security settings to pods if they're missing. Instead of rejecting pods, they fix them automatically.

# add-security-context.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-security-context
spec:
  background: false          # Only apply to new pods
  rules:
    - name: add-security-context
      match:
        any:
        - resources:
            kinds:
            - Pod
      mutate:  # Automatically add these security settings
        patchStrategicMerge:
          spec:
            securityContext:  # Pod-level security settings
              runAsNonRoot: true           # Don't run as root
              runAsUser: 1000             # Run as user ID 1000
              fsGroup: 2000               # File system group ID
              seccompProfile:             # Secure computing mode
                type: RuntimeDefault      # Use default seccomp profile
            containers:
            - (name): "*"  # Apply to all containers
              securityContext:
                allowPrivilegeEscalation: false  # Can't become root
                readOnlyRootFilesystem: true     # Can't write to filesystem
                capabilities:                    # Linux capabilities
                  drop:
                  - ALL  # Remove all dangerous capabilities

# This policy automatically hardens any pod that doesn't have security settings
---
# enforce-security-context.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: enforce-security-context
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: enforce-container-context
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "Containers must follow hardened security context"
        pattern:
          spec:
            securityContext:
              seccompProfile:
                type: RuntimeDefault
            =(initContainers):
              - =(securityContext):
                  allowPrivilegeEscalation: false
                  readOnlyRootFilesystem: true
                  capabilities:
                    drop: ["ALL"]
            =(ephemeralContainers):
              - =(securityContext):
                  allowPrivilegeEscalation: false
                  readOnlyRootFilesystem: true
                  capabilities:
                    drop: ["ALL"]
            containers:
              - =(securityContext):
                  allowPrivilegeEscalation: false
                  readOnlyRootFilesystem: true
                  capabilities:
                    drop: ["ALL"]

# This validation policy ensures security context can't be bypassed

Tetragon Runtime Security

What is Tetragon?
Tetragon is a runtime security tool that uses eBPF (extended Berkeley Packet Filter) to monitor system calls and network activity in real-time. It can detect and automatically respond to threats as they happen.

How eBPF Works:
eBPF runs programs directly in the Linux kernel, giving complete visibility into system activity without impacting performance. Think of it as having a security camera that watches every action in your system.

Why Tetragon is Powerful:

  • Zero-day protection: Detects unknown threats by behavior
  • Real-time response: Can kill malicious processes immediately
  • No signature updates: Uses behavior analysis, not signatures
  • Kernel-level visibility: Sees everything, can't be bypassed

Deploy Tetragon

# tetragon-values.yaml
tetragon:
  enabled: true
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m     # Max 2 CPU cores
      memory: 1Gi    # Max 4GB RAM

  # eBPF program settings
  btf: /sys/kernel/btf/vmlinux  # Kernel BTF (BPF Type Format) for type info
  debug: false                 # Set to true temporarily when investigating policies
  exportRateLimit: 1000        # Events per minute; -1 means unlimited

  # Prometheus metrics (Tetragon runs on hostNetwork for full visibility)
  prometheus:
    enabled: true
    port: 2112

# Export settings (top-level in chart, not under tetragon:)
export:
  mode: "stdout"        # Valid options: "stdout" or use defaults
  filenames: ["tetragon.log"]

tetragonOperator:  # Note: chart uses "tetragonOperator" not "operator"
  enabled: true
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

# Note: TracingPolicies are applied as separate manifests, not via Helm values
# The chart doesn't support a tracingPolicies: values key
# Apply TracingPolicy CRs separately or GitOps them with ArgoCD

Install Tetragon:

# Create dedicated namespace for Tetragon
kubectl create namespace tetragon

# Add Cilium Helm repository (Tetragon is part of Cilium project)
helm repo add cilium https://helm.cilium.io
helm repo update

# Install Tetragon with our security policies
TETRAGON_CHART_VERSION=$(helm search repo cilium/tetragon --versions | awk 'NR==2 {print $2}')
helm show chart cilium/tetragon --version "$TETRAGON_CHART_VERSION" | grep appVersion
# appVersion: 1.5.0  ← confirm appVersion matches the chart you install
helm install tetragon cilium/tetragon \
  --namespace tetragon \         # Install in tetragon namespace
  --version "$TETRAGON_CHART_VERSION" \  # Pin to published chart version
  --values tetragon-values.yaml  # Use our custom configuration

# Verify Tetragon pods are running
kubectl -n tetragon get pods
# Should see tetragon pods running on each node

# Check if eBPF programs are loaded
kubectl -n tetragon logs -l app.kubernetes.io/name=tetragon --tail=10
# Look for messages about eBPF programs being loaded

Apply TracingPolicies Separately:

TracingPolicies define what Tetragon monitors and must be applied as separate manifests:

# Apply the runtime security policies (shown in next section)
kubectl apply -f runtime-security-policies.yaml

Runtime Security Policies

Real-Time Threat Response:
These policies detect and automatically respond to threats as they happen. No waiting for signature updates or manual intervention.

# runtime-security-policies.yaml
# File monitoring policy
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: file-monitoring
spec:
  kprobes:
  - call: "sys_openat"
    syscall: true
    args:
    - index: 1
      type: "string"
    selectors:
    - matchArgs:
      - index: 1
        operator: "Prefix"
        values:
        - "/etc/"
        - "/root/"
        - "/home/"
---
# Network monitoring policy
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: network-monitoring
spec:
  kprobes:
  - call: "tcp_connect"
    syscall: false
    args:
    - index: 0
      type: "sock"
    selectors:
    - matchArgs:
      - index: 0
        operator: "NotDAddr"
        values:
        - "10.0.0.0/8"
        - "172.16.0.0/12"
        - "192.168.0.0/16"
        - "127.0.0.0/8"
        - "fc00::/7"
---
# Privilege escalation detection
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: detect-privilege-escalation
spec:
  kprobes:
    - call: "__x64_sys_setuid"
      syscall: true
      args: [{ index: 0, type: "int" }]
    - call: "__x64_sys_setreuid"
      syscall: true
      args: [{ index: 0, type: "int" }]
    - call: "__x64_sys_setresuid"
      syscall: true
      args: [{ index: 0, type: "int" }]
  selectors:
    - matchArgs:
        - index: 0
          operator: "Equal"
          values: ["0"]
      matchActions:
        - action: Override
          argError: -1   # EPERM
        - action: Sigkill

# This detects and blocks any process trying to become root
---
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: detect-crypto-mining
spec:
  kprobes:
  - call: "__x64_sys_execve"  # Monitor process execution
    syscall: true
    args:
    - index: 0               # Program name argument
      type: "string"
    selectors:
    - matchArgs:
      - index: 0             # Check program name
        operator: "In"        # If name is in this list
        values:              # Known crypto mining programs
        - "xmrig"           # Popular Monero miner
        - "minerd"          # CPU miner
        - "minergate"       # Mining pool software
    matchActions:
    - action: Sigkill        # Kill crypto miners immediately

# This stops crypto mining attacks that steal your compute resources

CrowdSec Threat Detection

What is CrowdSec?
CrowdSec is a community-driven security platform that shares threat intelligence globally. When it detects an attack on your cluster, it shares the attacker's IP with other CrowdSec users, creating a collaborative defense network.

How CrowdSec Works:

  1. Agents analyze logs for attack patterns
  2. Scenarios define what constitutes an attack
  3. Decisions are made to block attackers
  4. Bouncers enforce decisions (block IPs, etc.)
  5. Community shares threat intelligence globally

Deploy CrowdSec

# crowdsec-values.yaml
agent:  # Analyzes logs for threats
  acquisition:  # Where to get log data
    - source: kinesis     # AWS Kinesis stream (if using)
      stream_name: kubernetes-logs
      labels:
        type: nginx       # Web server logs
    - source: file        # Read log files directly
      filenames:
        - /var/log/pods/*/*/*.log  # All pod logs
      labels:
        type: syslog      # System logs

  # Parsers and scenarios (attack detection rules)
  collections:  # Pre-built detection rules from CrowdSec community
    - crowdsecurity/nginx      # Web server attack detection
    - crowdsecurity/sshd       # SSH brute force detection
    - crowdsecurity/linux      # Linux system attack detection
    - crowdsecurity/http-cve   # HTTP vulnerability exploitation

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

lapi:  # Local API (processes decisions and shares with community)
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

  # Decisions stream (what to do with detected threats)
  bouncer:
    enabled: true
    # Don't use ${VAR} - pass key via Helm: --set lapi.bouncer.key="$YOUR_BOUNCER_KEY"
    # Or consider mTLS between Traefik and LAPI to avoid shared keys

# Metrics
metrics:
  enabled: true
  serviceMonitor:
    enabled: true

Deploy CrowdSec with bouncers:

# Add CrowdSec Helm repository
helm repo add crowdsec https://crowdsecurity.github.io/helm-charts
helm repo update

# Install CrowdSec with threat detection rules
helm install crowdsec crowdsec/crowdsec \
  --namespace crowdsec \        # Create and use crowdsec namespace
  --create-namespace \          # Create namespace if it doesn't exist
  --values crowdsec-values.yaml \  # Use our configuration
  --set lapi.bouncer.enabled=true \
  --set lapi.bouncer.key="$YOUR_BOUNCER_KEY"  # Pass bouncer key securely

# Install Traefik bouncer (blocks IPs at ingress level)
# Note: Don't use Helm templating ({{ .Values... }}) in kubectl apply - it won't be rendered
# Option A: Use literal key in kubectl (less secure)
kubectl apply -f - <<EOF
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: crowdsec-bouncer
  namespace: traefik-system
spec:
  plugin:
    bouncer:  # Must match your Traefik static plugin ID (e.g., --experimental.plugins.bouncer.moduleName=...)
      crowdsecLapiScheme: http
      crowdsecLapiHost: crowdsec-lapi.crowdsec.svc.cluster.local:8080
      crowdsecLapiKey: "YOUR_LITERAL_BOUNCER_KEY_HERE"  # Replace with actual key
EOF

# Option B: Manage via Helm chart so templating works (recommended)
# Add to your Traefik values and upgrade the chart:
# middleware:
#   crowdsecBouncer:
#     crowdsecLapiKey: "{{ .Values.crowdsec.bouncerKey }}"

# Verify CrowdSec is running
kubectl -n crowdsec get pods
# Should see crowdsec agent and lapi pods running

Cilium Network Policies

Zero-Trust Networking with Cilium

Since we're running Cilium as our CNI, we'll use CiliumNetworkPolicy (CNP) for superior security controls including L7 filtering, DNS policies, and better observability.

# default-deny-all.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  endpointSelector: {}
  ingress: []   # default-deny ingress
  egress:  []   # default-deny egress
---
# allow-dns.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  endpointSelector: {}
  egress:
  - toEndpoints:
    - matchLabels:
        k8s-app: kube-dns
        k8s:io.kubernetes.pod.namespace: kube-system
    toPorts:
    - ports:
      - port: "53"
        protocol: UDP
      - port: "53"
        protocol: TCP
      rules:
        dns:
        - matchPattern: "*"  # Or restrict to specific domains
---
# allow-from-ingress.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-from-ingress
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: web
  ingress:
  - fromEndpoints:
    - matchLabels:
        app.kubernetes.io/name: traefik
        k8s:io.kubernetes.pod.namespace: traefik-system
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/"
        - method: "POST"
          path: "/api/.*"
        # Note: This will deny all other paths/methods (healthz, static assets, OPTIONS, etc)
        # Add more rules or a catch-all if needed for your application

Cilium-Specific Security Features:

# advanced-l7-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-security
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/api/v1/users"
          headerMatches:
          - name: "Authorization"
            regex: "^Bearer\\s+.+$"
        - method: "POST"
          path: "/api/v1/users"
          headerMatches:
          - name: "Content-Type"
            value: "application/json"
---
# dns-based-egress.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-external-apis
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: backend
  egress:
  - toFQDNs:
    - matchName: "api.github.com"
    - matchPattern: "*.googleapis.com"
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP
Important Notes for Operations:L7 HTTP Rules: The policies above will deny any paths/methods not explicitly listed. For example, /healthz, static assets, or OPTIONS requests will be blocked unless added to the rules. Consider adding a catch-all rule if your application needs broader access.hostNetwork Pods: CiliumNetworkPolicies (like all CNI policies) do not affect hostNetwork pods. If you need to quarantine or restrict hostNetwork pods (like some monitoring agents), you'll need node-level controls such as:Node taints and tolerations for isolationHost firewall rules (iptables/nftables)Separate node pools for hostNetwork workloadsRuntime security policies via Tetragon at the syscall level

Velero Backup and Disaster Recovery

Install Velero

# velero-values.yaml
configuration:
  provider: aws
  backupStorageLocation:
    name: default
    bucket: homelab-velero-backups
    config:
      region: us-east-1
      s3ForcePathStyle: true
      s3Url: https://s3.homelab.example  # MinIO or S3

  volumeSnapshotLocation:
    name: default
    config:
      region: us-east-1

# Note: Don't use ${VAR} expansion in Kubernetes YAML - use one of these approaches:
# A) Pre-create the secret:
#    kubectl -n velero create secret generic velero-credentials \
#      --from-file=cloud=./credentials-velero
# B) Or use Helm templating:
#    --set-file credentials.secretContents.cloud=./credentials-velero
credentials:
  useSecret: true
  name: velero-credentials

initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.13.0
    volumeMounts:
    - mountPath: /target
      name: plugins

# Note: CSI plugin removed - integrated into Velero core since v1.14

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 1Gi

# Node Agent for file backup (Restic deprecated in v1.15+)
deployNodeAgent: true
nodeAgent:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

Install Velero:

# Create namespace
kubectl create namespace velero

# Install with Helm
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update

VELERO_CHART_VERSION=$(helm search repo vmware-tanzu/velero --versions | awk 'NR==2 {print $2}')
helm show chart vmware-tanzu/velero --version "$VELERO_CHART_VERSION" | grep appVersion
# appVersion: v1.17.0 ← confirm before install
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --version "$VELERO_CHART_VERSION" \
  --values velero-values.yaml

Backup Schedules

# backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - production
    - databases
    - argocd
    - signoz
    excludedResources:
    - events
    - events.events.k8s.io
    labelSelector:
      matchLabels:
        backup: "true"
    ttl: 720h  # Keep for 30 days
    storageLocation: default
    volumeSnapshotLocations:
    - default
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: hourly-database-backup
  namespace: velero
spec:
  schedule: "0 * * * *"  # Every hour
  template:
    includedNamespaces:
    - databases
    ttl: 168h  # Keep for 7 days

Disaster Recovery Testing

Why Test Disaster Recovery?
Backups are worthless if you can't restore from them. Regular testing ensures your backup strategy actually works when disaster strikes.

# Create test backup of production namespace
velero backup create test-backup --include-namespaces production

# Wait for backup to complete
velero backup describe test-backup
# Status should show "Completed" with no errors

# Simulate disaster - delete entire namespace
echo "DANGER: This will delete the production namespace!"
kubectl delete namespace production

# Verify everything is gone
kubectl get all -n production
# Should show "No resources found in production namespace"

# Restore from backup
velero restore create test-restore --from-backup test-backup

# Monitor restore progress
velero restore describe test-restore

# Verify restoration worked
kubectl get all -n production
# Should see all your applications back and running

# Check application functionality
kubectl -n production get pods
# All pods should be Running

# Test application endpoints
curl https://your-app.homelab.example
# Should respond normally

Pod Security Standards

Configure Pod Security

What are Pod Security Standards?
Kubernetes built-in security policies that enforce different security levels:

  • Privileged: No restrictions (dangerous)
  • Baseline: Minimal restrictions (better than nothing)
  • Restricted: Strict security (recommended for production)
# pod-security-standards.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted  # Block non-compliant pods
    pod-security.kubernetes.io/enforce-version: v1.31  # Pin to your cluster minor version or 'latest'
    pod-security.kubernetes.io/audit: restricted    # Log policy violations
    pod-security.kubernetes.io/audit-version: v1.31    # Pin audit version
    pod-security.kubernetes.io/warn: restricted     # Show warnings to users
    pod-security.kubernetes.io/warn-version: v1.31     # Pin warn version
---
apiVersion: v1
kind: Namespace
metadata:
  name: development
  labels:
    pod-security.kubernetes.io/enforce: baseline    # More relaxed for dev
    pod-security.kubernetes.io/enforce-version: v1.31
    pod-security.kubernetes.io/audit: restricted    # But still audit violations
    pod-security.kubernetes.io/audit-version: v1.31
    pod-security.kubernetes.io/warn: restricted     # And warn developers
    pod-security.kubernetes.io/warn-version: v1.31

Secrets Scanning

Prevent Secret Leaks

# secrets-not-from-env-vars.yaml (official Kyverno policy)
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: secrets-not-from-env-vars
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: secrets-not-from-env-vars
    match:
      any:
      - resources:
          kinds: ["Pod"]
    validate:
      message: "Secrets must be mounted as volumes, not as environment variables."
      pattern:
        spec:
          containers:
          - name: "*"
            =(env):
            - =(valueFrom):
                X(secretKeyRef): "null"
  - name: secrets-not-from-envfrom
    match:
      any:
      - resources:
          kinds: ["Pod"]
    validate:
      message: "Secrets must not come from envFrom statements."
      pattern:
        spec:
          containers:
          - name: "*"
            =(envFrom):
            - X(secretRef): "null"

Security Monitoring

Key Metrics

Security Metrics to Monitor:
These metrics help you understand your security posture and detect issues.

# Kyverno admission latency (histogram)
histogram_quantile(0.95, sum by(le) (rate(kyverno_admission_review_duration_seconds_bucket[5m])))

# Count of failed rule results (enforced + background)
sum(rate(kyverno_policy_results_total{status="fail"}[5m]))
# Alert if this spikes - indicates many policy violations

# Tetragon total events exported
sum(rate(tetragon_events_exported_total[5m]))
# Monitor overall event volume

# Tetragon BPF missed events (health indicator)
sum(rate(tetragon_bpf_missed_events_total[5m]))
# Alert if > 0 - indicates system overload

# CrowdSec active decisions (current active bans)
sum(cs_active_decisions{action="ban"})
# Monitor current blocked threats
# Note: cs_active_decisions only appears when there are active decisions

# Velero backup success rate (disaster recovery readiness)
sum(rate(velero_backup_success_total[24h])) / sum(rate(velero_backup_attempt_total[24h]))
# Alert if < 0.95 (95% success rate)

# Failed authentication attempts (potential brute force)
sum(rate(authentication_attempts_total{result="failure"}[5m]))
# Alert if spike indicates brute force attack

Incident Response Plan

Automated Response

First, create the RBAC for the incident response Job:

# incident-response-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: incident-responder
  namespace: security
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: incident-responder
rules:
  - apiGroups: [""]
    resources: ["pods","pods/exec","pods/log"]
    verbs: ["get","list","watch","create","update","patch"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["networkpolicies"]
    verbs: ["create","get","list","watch","delete","patch"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get","list"]
  - apiGroups: [""]         # for 'kubectl cp' server-side
    resources: ["pods/exec"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: incident-responder
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: incident-responder
subjects:
  - kind: ServiceAccount
    name: incident-responder
    namespace: security

Then deploy the incident response Job:

# incident-response.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: incident-response-{{ .Timestamp }}
  namespace: security
spec:
  template:
    spec:
      serviceAccountName: incident-responder
      restartPolicy: Never
      containers:
      - name: responder
        image: incident-responder:latest
        env:
          - name: NAMESPACE
            value: "production"     # Override at submit time if needed
          - name: POD_NAME
            value: "suspect-pod"    # Override at submit time
        command:
        - /bin/bash
        - -c
        - |
          # Isolate affected pod
          kubectl -n $NAMESPACE label pod $POD_NAME quarantine=true

          # Create network policy to isolate
          kubectl apply -f - <<EOF
          apiVersion: networking.k8s.io/v1
          kind: NetworkPolicy
          metadata:
            name: quarantine-$POD_NAME
            namespace: $NAMESPACE
          spec:
            podSelector:
              matchLabels:
                quarantine: "true"
            policyTypes: ["Ingress","Egress"]
            ingress: []   # explicit deny
            egress: []    # explicit deny
          EOF

          # Capture forensics (targeted collection to avoid failures)
          kubectl -n $NAMESPACE exec $POD_NAME -- tar czf /tmp/forensics.tar.gz /var/log /etc /proc/self/fd /proc/self/maps /proc/self/cmdline
          kubectl -n $NAMESPACE cp $POD_NAME:/tmp/forensics.tar.gz ./forensics-$POD_NAME.tar.gz

          # Alert security team
          curl -X POST https://webhook.site/alert \
            -H "Content-Type: application/json" \
            -d '{"incident": "Security breach detected", "pod": "'$POD_NAME'"}'
  backoffLimit: 0
  ttlSecondsAfterFinished: 3600

Security Audit

Regular security audits are critical:

# Run Kubernetes CIS benchmark
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml

# Check RBAC permissions
kubectl auth can-i --list --as=system:serviceaccount:default:default

# Scan for vulnerabilities
trivy k8s --report summary cluster

# Check network policies
kubectl get networkpolicies -A

# Verify pod security standards
kubectl get namespaces -o json | jq '.items[] | select(.metadata.labels."pod-security.kubernetes.io/enforce" != null) | {name: .metadata.name, enforce: .metadata.labels."pod-security.kubernetes.io/enforce"}'

Production Readiness Checklist

  • [ ] Admission Control
    • [ ] Kyverno policies enforced
    • [ ] Resource limits required
    • [ ] Image scanning enabled (consider Kyverno verifyImages or ImageValidatingPolicy for supply-chain security)
  • [ ] Runtime Security
    • [ ] Tetragon monitoring active
    • [ ] Suspicious activity alerts configured
    • [ ] Automatic remediation enabled
  • [ ] Network Security
    • [ ] Default deny policies
    • [ ] Ingress/egress restrictions
    • [ ] Service mesh encryption
  • [ ] Backup & Recovery
    • [ ] Automated daily backups
    • [ ] Tested restore procedures
    • [ ] Off-site backup storage
  • [ ] Monitoring
    • [ ] Security dashboards created
    • [ ] Alert rules configured
    • [ ] Incident response tested
  • [ ] Compliance
    • [ ] CIS benchmarks passing
    • [ ] Pod security standards enforced
    • [ ] Audit logging enabled

Conclusion

Congratulations! You've built a production-grade Kubernetes homelab that rivals enterprise deployments. Your cluster now has:

  • Immutable infrastructure with Talos Linux
  • Advanced networking with Cilium eBPF
  • Distributed storage with Rook-Ceph
  • GitOps workflows with ArgoCD
  • Automated TLS with cert-manager
  • Complete observability with SigNoz
  • Defense in depth security

This isn't just a homelab - it's a platform for learning, experimentation, and running real workloads with confidence.

Key Takeaways

  1. Security is layers, not a single tool - Each layer catches what others miss
  2. Policy enforcement prevents problems - Better than fixing after breach
  3. Runtime security catches zero-days - eBPF sees everything
  4. Backups are worthless without testing - Regular restore drills are essential
  5. Resource allocation matters - Over-provisioning resources can lead to wasted costs and performance issues

Final Thoughts

Building this cluster taught me more about Kubernetes than any course or certification. Every failure was a learning opportunity, every optimization a small victory. Your journey will be different, but the fundamentals remain: plan carefully, implement security from the start, monitor everything, and always have backups.

Welcome to the world of production Kubernetes. May your clusters be stable and your pods always running.

References


Series Complete! 🎉

You've now built a production-grade Kubernetes homelab with enterprise-level security, observability, and operational practices.

What You've Accomplished:

Next Steps:

  • Start deploying your applications using the GitOps patterns from Part 5
  • Monitor everything with the observability stack from Part 7
  • Review security policies regularly and update as threats evolve
  • Consider expanding to multi-cluster setups for production workloads

Thank you for following this series. Your feedback and experiences are welcome - may they help future builders avoid our mistakes and build upon our successes.