Kubernetes is not secure by default. This final article covers implementing defense in depth with detailed explanations of security concepts, Kyverno policies, Tetragon runtime security, CrowdSec threat detection, and Velero disaster recovery. These layers saved my cluster from real attacks and will protect yours too.
Understanding Kubernetes Security
What is Defense in Depth?
Defense in depth is like having multiple locks on your door, a security system, and a guard dog. If one security layer fails, others are there to protect you. In Kubernetes, no single security tool can protect against all threats.
Why Default Kubernetes Security Isn't Enough:
- Pods run as root by default
- No network traffic restrictions
- No runtime threat detection
- Minimal admission controls
- No backup/recovery built-in
Defense in Depth Strategy
Security isn't a single tool - it's layers of protection:
- Admission Control (Kyverno) - Prevent bad configurations before pods start
- Runtime Security (Tetragon) - Detect and block threats while pods are running
- Threat Intelligence (CrowdSec) - Community-driven protection against known attacks
- Network Policies - Zero-trust networking (deny all, allow only what's needed)
- Disaster Recovery (Velero) - When all else fails, restore from backups
Kyverno Policy Engine
What is Kyverno?
Kyverno is a policy engine that validates, mutates, and generates Kubernetes resources. It acts as a gatekeeper - every time someone tries to create or modify a resource, Kyverno checks if it's allowed by your policies.
Key Kyverno Concepts:
- Validate: Check if resources meet security requirements (reject if they don't)
- Mutate: Automatically fix resources (add missing security settings)
- Generate: Create additional resources (like network policies)
- Background: Check existing resources, not just new ones
Installing Kyverno
# kyverno-values.yaml
# Webhook configuration - Kyverno defaults to Fail for strict security
# Admission controller (validates and mutates resources)
admissionController:
replicas: 3 # Run 3 copies for high availability (admission control is critical)
metricsService:
create: true
type: ClusterIP
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
container:
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
# Background controller (scans existing resources for policy violations)
backgroundController:
enabled: true # Check existing resources, not just new ones
replicas: 2
metricsService:
create: true
type: ClusterIP
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Cleanup controller (removes old resources)
cleanupController:
replicas: 2
metricsService:
create: true
type: ClusterIP
# Reports controller (generates policy reports)
reportsController:
replicas: 2
metricsService:
create: true
type: ClusterIP
# Policy reports cleanup (chart-specific feature)
# Note: policyReportsCleanup exists in some downstream charts but not upstream
# For portable cleanup, use Kyverno CleanupPolicy CRDs instead
Install Kyverno:
# Create dedicated namespace for Kyverno
kubectl create namespace kyverno
# Add Kyverno Helm repository
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
# Install Kyverno with our configuration
KYVERNO_CHART_VERSION=$(helm search repo kyverno/kyverno --versions | awk 'NR==2 {print $2}')
helm show chart kyverno/kyverno --version "$KYVERNO_CHART_VERSION" | grep appVersion
# appVersion: 1.15.1 ← confirm you're installing the latest Kyverno release
helm install kyverno kyverno/kyverno \
--namespace kyverno \
--version "$KYVERNO_CHART_VERSION" \
--values kyverno-values.yaml \
--wait # Wait for deployment to complete
# Expected output:
# NAME: kyverno
# NAMESPACE: kyverno
# STATUS: deployed
# REVISION: 1
# Verify admission webhook is registered
kubectl get validatingwebhookconfigurations | grep kyverno
# Expected output:
# kyverno-policy-validating-webhook-cfg
# kyverno-resource-validating-webhook-cfg
# Check Kyverno pods are running
kubectl -n kyverno get pods
# Expected output:
# NAME READY STATUS RESTARTS AGE
# kyverno-686b7d7f99-2x4k8 1/1 Running 0 2m
# kyverno-686b7d7f99-7h9m2 1/1 Running 0 2m
# kyverno-686b7d7f99-qr8xz 1/1 Running 0 2m
Day-2 Operations: Managing Kyverno
# Check policy violation reports
kubectl get polr -A # Policy Reports across all namespaces
# or use the full resource name for portability:
kubectl get policyreports.wgpolicyk8s.io -A
# If OpenReports is enabled (Kyverno 1.15+):
kubectl get policyreports.openreports.io -A
# View specific policy violations
kubectl get polr -n default -o yaml
# Update Kyverno safely
KYVERNO_CHART_VERSION=$(helm search repo kyverno/kyverno --versions | awk 'NR==2 {print $2}')
helm upgrade kyverno kyverno/kyverno \
--namespace kyverno \
--values kyverno-values.yaml \
--version "$KYVERNO_CHART_VERSION"
# Monitor Kyverno performance
kubectl top pods -n kyverno
# Troubleshoot webhook issues
kubectl describe validatingwebhookconfigurations kyverno-policy-validating-webhook-cfg
# Emergency recovery: if fail-closed webhooks brick the API, delete webhook configs
# kubectl delete validatingwebhookconfigurations kyverno-policy-validating-webhook-cfg
# kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg
# kubectl delete mutatingwebhookconfiguration kyverno-resource-mutating-webhook-cfg
# Then fix/exclude problematic namespaces and redeploy Kyverno
Essential Security Policies
Why These Policies Matter:
By default, Kubernetes allows dangerous configurations. These policies enforce security best practices automatically.
# require-non-root.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy # Applies cluster-wide
metadata:
name: require-non-root
spec:
validationFailureAction: Enforce # Block pods that violate this policy
background: true # Also check existing pods
rules:
- name: check-non-root
match: # Apply to these resources
any:
- resources:
kinds:
- Pod # Check all pods
exclude: # Skip these namespaces (system components need root)
any:
- resources:
namespaces:
- kube-system # Kubernetes system components
- kyverno # Kyverno itself
- rook-ceph # Ceph storage system
validate:
message: "Containers must run as non-root user"
pattern: # Required configuration
spec:
=(securityContext): # Pod-level security context
runAsNonRoot: true # Don't run as root (user ID 0)
=(initContainers):
- =(securityContext): # Init container security context
runAsNonRoot: true
=(ephemeralContainers):
- =(securityContext): # Ephemeral container security context
runAsNonRoot: true
containers:
- =(securityContext): # Container-level security context
runAsNonRoot: true
---
# disallow-privileged.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-privileged
spec:
validationFailureAction: Enforce
background: true
rules:
- name: check-privileged
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Privileged containers are not allowed"
pattern:
spec:
=(securityContext):
=(privileged): "!true" # "!true" means "must not be true"
=(initContainers):
- =(securityContext):
=(privileged): "!true" # Block privileged init containers
=(ephemeralContainers):
- =(securityContext):
=(privileged): "!true" # Block privileged ephemeral containers
containers:
- =(securityContext):
=(privileged): "!true" # Block privileged containers
# Why this matters: Privileged containers can access the host kernel
# and potentially break out of the container to attack the host
---
# require-resource-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-pod-resources
spec:
validationFailureAction: Enforce
background: true
rules:
- name: validate-resources
match:
any:
- resources:
kinds:
- Pod
exclude:
any:
- resources:
namespaces:
- kube-system # System pods have different requirements
- kyverno
validate:
message: "Resource requests and limits are required"
pattern:
spec:
=(initContainers):
- name: "*"
resources:
requests: { memory: "?*", cpu: "?*" }
limits: { memory: "?*", cpu: "?*" }
=(ephemeralContainers):
- name: "*"
resources:
requests: { memory: "?*", cpu: "?*" }
limits: { memory: "?*", cpu: "?*" }
containers:
- name: "*"
resources:
requests: # Minimum guaranteed resources
memory: "?*" # Must specify memory request
cpu: "?*" # Must specify CPU request
limits: # Maximum allowed resources
memory: "?*" # Must specify memory limit
cpu: "?*" # Must specify CPU limit
# Why this matters: Without limits, one pod can consume all resources
# and crash other pods ("noisy neighbor" problem)
---
# restrict-image-registries.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
spec:
validationFailureAction: Enforce
background: false # Only check new pods (existing ones might be from old registries)
rules:
- name: validate-registries
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Images must come from approved registries"
pattern:
spec:
containers: # Regular containers
- image: "docker.io/* | ghcr.io/* | quay.io/* | registry.gitlab.com/your-org/*"
=(initContainers): # Init containers (run before main containers)
- image: "docker.io/* | ghcr.io/* | quay.io/* | registry.gitlab.com/your-org/*"
=(ephemeralContainers): # Ephemeral containers (debugging)
- image: "docker.io/* | ghcr.io/* | quay.io/* | registry.gitlab.com/your-org/*"
# Why this matters: Restricts images to trusted registries only
# Prevents using images from unknown sources that might contain malware
Mutation Policies
What are Mutation Policies?
Mutation policies automatically add security settings to pods if they're missing. Instead of rejecting pods, they fix them automatically.
# add-security-context.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-security-context
spec:
background: false # Only apply to new pods
rules:
- name: add-security-context
match:
any:
- resources:
kinds:
- Pod
mutate: # Automatically add these security settings
patchStrategicMerge:
spec:
securityContext: # Pod-level security settings
runAsNonRoot: true # Don't run as root
runAsUser: 1000 # Run as user ID 1000
fsGroup: 2000 # File system group ID
seccompProfile: # Secure computing mode
type: RuntimeDefault # Use default seccomp profile
containers:
- (name): "*" # Apply to all containers
securityContext:
allowPrivilegeEscalation: false # Can't become root
readOnlyRootFilesystem: true # Can't write to filesystem
capabilities: # Linux capabilities
drop:
- ALL # Remove all dangerous capabilities
# This policy automatically hardens any pod that doesn't have security settings
---
# enforce-security-context.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: enforce-security-context
spec:
validationFailureAction: Enforce
background: true
rules:
- name: enforce-container-context
match:
any:
- resources:
kinds: ["Pod"]
validate:
message: "Containers must follow hardened security context"
pattern:
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
=(initContainers):
- =(securityContext):
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
=(ephemeralContainers):
- =(securityContext):
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
containers:
- =(securityContext):
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
# This validation policy ensures security context can't be bypassed
Tetragon Runtime Security
What is Tetragon?
Tetragon is a runtime security tool that uses eBPF (extended Berkeley Packet Filter) to monitor system calls and network activity in real-time. It can detect and automatically respond to threats as they happen.
How eBPF Works:
eBPF runs programs directly in the Linux kernel, giving complete visibility into system activity without impacting performance. Think of it as having a security camera that watches every action in your system.
Why Tetragon is Powerful:
- Zero-day protection: Detects unknown threats by behavior
- Real-time response: Can kill malicious processes immediately
- No signature updates: Uses behavior analysis, not signatures
- Kernel-level visibility: Sees everything, can't be bypassed
Deploy Tetragon
# tetragon-values.yaml
tetragon:
enabled: true
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m # Max 2 CPU cores
memory: 1Gi # Max 4GB RAM
# eBPF program settings
btf: /sys/kernel/btf/vmlinux # Kernel BTF (BPF Type Format) for type info
debug: false # Set to true temporarily when investigating policies
exportRateLimit: 1000 # Events per minute; -1 means unlimited
# Prometheus metrics (Tetragon runs on hostNetwork for full visibility)
prometheus:
enabled: true
port: 2112
# Export settings (top-level in chart, not under tetragon:)
export:
mode: "stdout" # Valid options: "stdout" or use defaults
filenames: ["tetragon.log"]
tetragonOperator: # Note: chart uses "tetragonOperator" not "operator"
enabled: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Note: TracingPolicies are applied as separate manifests, not via Helm values
# The chart doesn't support a tracingPolicies: values key
# Apply TracingPolicy CRs separately or GitOps them with ArgoCD
Install Tetragon:
# Create dedicated namespace for Tetragon
kubectl create namespace tetragon
# Add Cilium Helm repository (Tetragon is part of Cilium project)
helm repo add cilium https://helm.cilium.io
helm repo update
# Install Tetragon with our security policies
TETRAGON_CHART_VERSION=$(helm search repo cilium/tetragon --versions | awk 'NR==2 {print $2}')
helm show chart cilium/tetragon --version "$TETRAGON_CHART_VERSION" | grep appVersion
# appVersion: 1.5.0 ← confirm appVersion matches the chart you install
helm install tetragon cilium/tetragon \
--namespace tetragon \ # Install in tetragon namespace
--version "$TETRAGON_CHART_VERSION" \ # Pin to published chart version
--values tetragon-values.yaml # Use our custom configuration
# Verify Tetragon pods are running
kubectl -n tetragon get pods
# Should see tetragon pods running on each node
# Check if eBPF programs are loaded
kubectl -n tetragon logs -l app.kubernetes.io/name=tetragon --tail=10
# Look for messages about eBPF programs being loaded
Apply TracingPolicies Separately:
TracingPolicies define what Tetragon monitors and must be applied as separate manifests:
# Apply the runtime security policies (shown in next section)
kubectl apply -f runtime-security-policies.yaml
Runtime Security Policies
Real-Time Threat Response:
These policies detect and automatically respond to threats as they happen. No waiting for signature updates or manual intervention.
# runtime-security-policies.yaml
# File monitoring policy
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: file-monitoring
spec:
kprobes:
- call: "sys_openat"
syscall: true
args:
- index: 1
type: "string"
selectors:
- matchArgs:
- index: 1
operator: "Prefix"
values:
- "/etc/"
- "/root/"
- "/home/"
---
# Network monitoring policy
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: network-monitoring
spec:
kprobes:
- call: "tcp_connect"
syscall: false
args:
- index: 0
type: "sock"
selectors:
- matchArgs:
- index: 0
operator: "NotDAddr"
values:
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
- "127.0.0.0/8"
- "fc00::/7"
---
# Privilege escalation detection
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: detect-privilege-escalation
spec:
kprobes:
- call: "__x64_sys_setuid"
syscall: true
args: [{ index: 0, type: "int" }]
- call: "__x64_sys_setreuid"
syscall: true
args: [{ index: 0, type: "int" }]
- call: "__x64_sys_setresuid"
syscall: true
args: [{ index: 0, type: "int" }]
selectors:
- matchArgs:
- index: 0
operator: "Equal"
values: ["0"]
matchActions:
- action: Override
argError: -1 # EPERM
- action: Sigkill
# This detects and blocks any process trying to become root
---
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: detect-crypto-mining
spec:
kprobes:
- call: "__x64_sys_execve" # Monitor process execution
syscall: true
args:
- index: 0 # Program name argument
type: "string"
selectors:
- matchArgs:
- index: 0 # Check program name
operator: "In" # If name is in this list
values: # Known crypto mining programs
- "xmrig" # Popular Monero miner
- "minerd" # CPU miner
- "minergate" # Mining pool software
matchActions:
- action: Sigkill # Kill crypto miners immediately
# This stops crypto mining attacks that steal your compute resources
CrowdSec Threat Detection
What is CrowdSec?
CrowdSec is a community-driven security platform that shares threat intelligence globally. When it detects an attack on your cluster, it shares the attacker's IP with other CrowdSec users, creating a collaborative defense network.
How CrowdSec Works:
- Agents analyze logs for attack patterns
- Scenarios define what constitutes an attack
- Decisions are made to block attackers
- Bouncers enforce decisions (block IPs, etc.)
- Community shares threat intelligence globally
Deploy CrowdSec
# crowdsec-values.yaml
agent: # Analyzes logs for threats
acquisition: # Where to get log data
- source: kinesis # AWS Kinesis stream (if using)
stream_name: kubernetes-logs
labels:
type: nginx # Web server logs
- source: file # Read log files directly
filenames:
- /var/log/pods/*/*/*.log # All pod logs
labels:
type: syslog # System logs
# Parsers and scenarios (attack detection rules)
collections: # Pre-built detection rules from CrowdSec community
- crowdsecurity/nginx # Web server attack detection
- crowdsecurity/sshd # SSH brute force detection
- crowdsecurity/linux # Linux system attack detection
- crowdsecurity/http-cve # HTTP vulnerability exploitation
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
lapi: # Local API (processes decisions and shares with community)
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Decisions stream (what to do with detected threats)
bouncer:
enabled: true
# Don't use ${VAR} - pass key via Helm: --set lapi.bouncer.key="$YOUR_BOUNCER_KEY"
# Or consider mTLS between Traefik and LAPI to avoid shared keys
# Metrics
metrics:
enabled: true
serviceMonitor:
enabled: true
Deploy CrowdSec with bouncers:
# Add CrowdSec Helm repository
helm repo add crowdsec https://crowdsecurity.github.io/helm-charts
helm repo update
# Install CrowdSec with threat detection rules
helm install crowdsec crowdsec/crowdsec \
--namespace crowdsec \ # Create and use crowdsec namespace
--create-namespace \ # Create namespace if it doesn't exist
--values crowdsec-values.yaml \ # Use our configuration
--set lapi.bouncer.enabled=true \
--set lapi.bouncer.key="$YOUR_BOUNCER_KEY" # Pass bouncer key securely
# Install Traefik bouncer (blocks IPs at ingress level)
# Note: Don't use Helm templating ({{ .Values... }}) in kubectl apply - it won't be rendered
# Option A: Use literal key in kubectl (less secure)
kubectl apply -f - <<EOF
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: crowdsec-bouncer
namespace: traefik-system
spec:
plugin:
bouncer: # Must match your Traefik static plugin ID (e.g., --experimental.plugins.bouncer.moduleName=...)
crowdsecLapiScheme: http
crowdsecLapiHost: crowdsec-lapi.crowdsec.svc.cluster.local:8080
crowdsecLapiKey: "YOUR_LITERAL_BOUNCER_KEY_HERE" # Replace with actual key
EOF
# Option B: Manage via Helm chart so templating works (recommended)
# Add to your Traefik values and upgrade the chart:
# middleware:
# crowdsecBouncer:
# crowdsecLapiKey: "{{ .Values.crowdsec.bouncerKey }}"
# Verify CrowdSec is running
kubectl -n crowdsec get pods
# Should see crowdsec agent and lapi pods running
Cilium Network Policies
Zero-Trust Networking with Cilium
Since we're running Cilium as our CNI, we'll use CiliumNetworkPolicy (CNP) for superior security controls including L7 filtering, DNS policies, and better observability.
# default-deny-all.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
endpointSelector: {}
ingress: [] # default-deny ingress
egress: [] # default-deny egress
---
# allow-dns.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
endpointSelector: {}
egress:
- toEndpoints:
- matchLabels:
k8s-app: kube-dns
k8s:io.kubernetes.pod.namespace: kube-system
toPorts:
- ports:
- port: "53"
protocol: UDP
- port: "53"
protocol: TCP
rules:
dns:
- matchPattern: "*" # Or restrict to specific domains
---
# allow-from-ingress.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-from-ingress
namespace: production
spec:
endpointSelector:
matchLabels:
app: web
ingress:
- fromEndpoints:
- matchLabels:
app.kubernetes.io/name: traefik
k8s:io.kubernetes.pod.namespace: traefik-system
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: "GET"
path: "/"
- method: "POST"
path: "/api/.*"
# Note: This will deny all other paths/methods (healthz, static assets, OPTIONS, etc)
# Add more rules or a catch-all if needed for your application
Cilium-Specific Security Features:
# advanced-l7-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: api-security
namespace: production
spec:
endpointSelector:
matchLabels:
app: api
ingress:
- fromEndpoints:
- matchLabels:
app: frontend
toPorts:
- ports:
- port: "443"
protocol: TCP
rules:
http:
- method: "GET"
path: "/api/v1/users"
headerMatches:
- name: "Authorization"
regex: "^Bearer\\s+.+$"
- method: "POST"
path: "/api/v1/users"
headerMatches:
- name: "Content-Type"
value: "application/json"
---
# dns-based-egress.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-external-apis
namespace: production
spec:
endpointSelector:
matchLabels:
app: backend
egress:
- toFQDNs:
- matchName: "api.github.com"
- matchPattern: "*.googleapis.com"
toPorts:
- ports:
- port: "443"
protocol: TCP
Important Notes for Operations:L7 HTTP Rules: The policies above will deny any paths/methods not explicitly listed. For example, /healthz, static assets, or OPTIONS requests will be blocked unless added to the rules. Consider adding a catch-all rule if your application needs broader access.hostNetwork Pods: CiliumNetworkPolicies (like all CNI policies) do not affect hostNetwork pods. If you need to quarantine or restrict hostNetwork pods (like some monitoring agents), you'll need node-level controls such as:Node taints and tolerations for isolationHost firewall rules (iptables/nftables)Separate node pools for hostNetwork workloadsRuntime security policies via Tetragon at the syscall levelVelero Backup and Disaster Recovery
Install Velero
# velero-values.yaml
configuration:
provider: aws
backupStorageLocation:
name: default
bucket: homelab-velero-backups
config:
region: us-east-1
s3ForcePathStyle: true
s3Url: https://s3.homelab.example # MinIO or S3
volumeSnapshotLocation:
name: default
config:
region: us-east-1
# Note: Don't use ${VAR} expansion in Kubernetes YAML - use one of these approaches:
# A) Pre-create the secret:
# kubectl -n velero create secret generic velero-credentials \
# --from-file=cloud=./credentials-velero
# B) Or use Helm templating:
# --set-file credentials.secretContents.cloud=./credentials-velero
credentials:
useSecret: true
name: velero-credentials
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.13.0
volumeMounts:
- mountPath: /target
name: plugins
# Note: CSI plugin removed - integrated into Velero core since v1.14
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
# Node Agent for file backup (Restic deprecated in v1.15+)
deployNodeAgent: true
nodeAgent:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Install Velero:
# Create namespace
kubectl create namespace velero
# Install with Helm
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
VELERO_CHART_VERSION=$(helm search repo vmware-tanzu/velero --versions | awk 'NR==2 {print $2}')
helm show chart vmware-tanzu/velero --version "$VELERO_CHART_VERSION" | grep appVersion
# appVersion: v1.17.0 ← confirm before install
helm install velero vmware-tanzu/velero \
--namespace velero \
--version "$VELERO_CHART_VERSION" \
--values velero-values.yaml
Backup Schedules
# backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- production
- databases
- argocd
- signoz
excludedResources:
- events
- events.events.k8s.io
labelSelector:
matchLabels:
backup: "true"
ttl: 720h # Keep for 30 days
storageLocation: default
volumeSnapshotLocations:
- default
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: hourly-database-backup
namespace: velero
spec:
schedule: "0 * * * *" # Every hour
template:
includedNamespaces:
- databases
ttl: 168h # Keep for 7 days
Disaster Recovery Testing
Why Test Disaster Recovery?
Backups are worthless if you can't restore from them. Regular testing ensures your backup strategy actually works when disaster strikes.
# Create test backup of production namespace
velero backup create test-backup --include-namespaces production
# Wait for backup to complete
velero backup describe test-backup
# Status should show "Completed" with no errors
# Simulate disaster - delete entire namespace
echo "DANGER: This will delete the production namespace!"
kubectl delete namespace production
# Verify everything is gone
kubectl get all -n production
# Should show "No resources found in production namespace"
# Restore from backup
velero restore create test-restore --from-backup test-backup
# Monitor restore progress
velero restore describe test-restore
# Verify restoration worked
kubectl get all -n production
# Should see all your applications back and running
# Check application functionality
kubectl -n production get pods
# All pods should be Running
# Test application endpoints
curl https://your-app.homelab.example
# Should respond normally
Pod Security Standards
Configure Pod Security
What are Pod Security Standards?
Kubernetes built-in security policies that enforce different security levels:
- Privileged: No restrictions (dangerous)
- Baseline: Minimal restrictions (better than nothing)
- Restricted: Strict security (recommended for production)
# pod-security-standards.yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted # Block non-compliant pods
pod-security.kubernetes.io/enforce-version: v1.31 # Pin to your cluster minor version or 'latest'
pod-security.kubernetes.io/audit: restricted # Log policy violations
pod-security.kubernetes.io/audit-version: v1.31 # Pin audit version
pod-security.kubernetes.io/warn: restricted # Show warnings to users
pod-security.kubernetes.io/warn-version: v1.31 # Pin warn version
---
apiVersion: v1
kind: Namespace
metadata:
name: development
labels:
pod-security.kubernetes.io/enforce: baseline # More relaxed for dev
pod-security.kubernetes.io/enforce-version: v1.31
pod-security.kubernetes.io/audit: restricted # But still audit violations
pod-security.kubernetes.io/audit-version: v1.31
pod-security.kubernetes.io/warn: restricted # And warn developers
pod-security.kubernetes.io/warn-version: v1.31
Secrets Scanning
Prevent Secret Leaks
# secrets-not-from-env-vars.yaml (official Kyverno policy)
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: secrets-not-from-env-vars
spec:
validationFailureAction: Enforce
background: true
rules:
- name: secrets-not-from-env-vars
match:
any:
- resources:
kinds: ["Pod"]
validate:
message: "Secrets must be mounted as volumes, not as environment variables."
pattern:
spec:
containers:
- name: "*"
=(env):
- =(valueFrom):
X(secretKeyRef): "null"
- name: secrets-not-from-envfrom
match:
any:
- resources:
kinds: ["Pod"]
validate:
message: "Secrets must not come from envFrom statements."
pattern:
spec:
containers:
- name: "*"
=(envFrom):
- X(secretRef): "null"
Security Monitoring
Key Metrics
Security Metrics to Monitor:
These metrics help you understand your security posture and detect issues.
# Kyverno admission latency (histogram)
histogram_quantile(0.95, sum by(le) (rate(kyverno_admission_review_duration_seconds_bucket[5m])))
# Count of failed rule results (enforced + background)
sum(rate(kyverno_policy_results_total{status="fail"}[5m]))
# Alert if this spikes - indicates many policy violations
# Tetragon total events exported
sum(rate(tetragon_events_exported_total[5m]))
# Monitor overall event volume
# Tetragon BPF missed events (health indicator)
sum(rate(tetragon_bpf_missed_events_total[5m]))
# Alert if > 0 - indicates system overload
# CrowdSec active decisions (current active bans)
sum(cs_active_decisions{action="ban"})
# Monitor current blocked threats
# Note: cs_active_decisions only appears when there are active decisions
# Velero backup success rate (disaster recovery readiness)
sum(rate(velero_backup_success_total[24h])) / sum(rate(velero_backup_attempt_total[24h]))
# Alert if < 0.95 (95% success rate)
# Failed authentication attempts (potential brute force)
sum(rate(authentication_attempts_total{result="failure"}[5m]))
# Alert if spike indicates brute force attack
Incident Response Plan
Automated Response
First, create the RBAC for the incident response Job:
# incident-response-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: incident-responder
namespace: security
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: incident-responder
rules:
- apiGroups: [""]
resources: ["pods","pods/exec","pods/log"]
verbs: ["get","list","watch","create","update","patch"]
- apiGroups: ["networking.k8s.io"]
resources: ["networkpolicies"]
verbs: ["create","get","list","watch","delete","patch"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get","list"]
- apiGroups: [""] # for 'kubectl cp' server-side
resources: ["pods/exec"]
verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: incident-responder
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: incident-responder
subjects:
- kind: ServiceAccount
name: incident-responder
namespace: security
Then deploy the incident response Job:
# incident-response.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: incident-response-{{ .Timestamp }}
namespace: security
spec:
template:
spec:
serviceAccountName: incident-responder
restartPolicy: Never
containers:
- name: responder
image: incident-responder:latest
env:
- name: NAMESPACE
value: "production" # Override at submit time if needed
- name: POD_NAME
value: "suspect-pod" # Override at submit time
command:
- /bin/bash
- -c
- |
# Isolate affected pod
kubectl -n $NAMESPACE label pod $POD_NAME quarantine=true
# Create network policy to isolate
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: quarantine-$POD_NAME
namespace: $NAMESPACE
spec:
podSelector:
matchLabels:
quarantine: "true"
policyTypes: ["Ingress","Egress"]
ingress: [] # explicit deny
egress: [] # explicit deny
EOF
# Capture forensics (targeted collection to avoid failures)
kubectl -n $NAMESPACE exec $POD_NAME -- tar czf /tmp/forensics.tar.gz /var/log /etc /proc/self/fd /proc/self/maps /proc/self/cmdline
kubectl -n $NAMESPACE cp $POD_NAME:/tmp/forensics.tar.gz ./forensics-$POD_NAME.tar.gz
# Alert security team
curl -X POST https://webhook.site/alert \
-H "Content-Type: application/json" \
-d '{"incident": "Security breach detected", "pod": "'$POD_NAME'"}'
backoffLimit: 0
ttlSecondsAfterFinished: 3600
Security Audit
Regular security audits are critical:
# Run Kubernetes CIS benchmark
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
# Check RBAC permissions
kubectl auth can-i --list --as=system:serviceaccount:default:default
# Scan for vulnerabilities
trivy k8s --report summary cluster
# Check network policies
kubectl get networkpolicies -A
# Verify pod security standards
kubectl get namespaces -o json | jq '.items[] | select(.metadata.labels."pod-security.kubernetes.io/enforce" != null) | {name: .metadata.name, enforce: .metadata.labels."pod-security.kubernetes.io/enforce"}'
Production Readiness Checklist
- [ ] Admission Control
- [ ] Kyverno policies enforced
- [ ] Resource limits required
- [ ] Image scanning enabled (consider Kyverno verifyImages or ImageValidatingPolicy for supply-chain security)
- [ ] Runtime Security
- [ ] Tetragon monitoring active
- [ ] Suspicious activity alerts configured
- [ ] Automatic remediation enabled
- [ ] Network Security
- [ ] Default deny policies
- [ ] Ingress/egress restrictions
- [ ] Service mesh encryption
- [ ] Backup & Recovery
- [ ] Automated daily backups
- [ ] Tested restore procedures
- [ ] Off-site backup storage
- [ ] Monitoring
- [ ] Security dashboards created
- [ ] Alert rules configured
- [ ] Incident response tested
- [ ] Compliance
- [ ] CIS benchmarks passing
- [ ] Pod security standards enforced
- [ ] Audit logging enabled
Conclusion
Congratulations! You've built a production-grade Kubernetes homelab that rivals enterprise deployments. Your cluster now has:
- Immutable infrastructure with Talos Linux
- Advanced networking with Cilium eBPF
- Distributed storage with Rook-Ceph
- GitOps workflows with ArgoCD
- Automated TLS with cert-manager
- Complete observability with SigNoz
- Defense in depth security
This isn't just a homelab - it's a platform for learning, experimentation, and running real workloads with confidence.
Key Takeaways
- Security is layers, not a single tool - Each layer catches what others miss
- Policy enforcement prevents problems - Better than fixing after breach
- Runtime security catches zero-days - eBPF sees everything
- Backups are worthless without testing - Regular restore drills are essential
- Resource allocation matters - Over-provisioning resources can lead to wasted costs and performance issues
Final Thoughts
Building this cluster taught me more about Kubernetes than any course or certification. Every failure was a learning opportunity, every optimization a small victory. Your journey will be different, but the fundamentals remain: plan carefully, implement security from the start, monitor everything, and always have backups.
Welcome to the world of production Kubernetes. May your clusters be stable and your pods always running.
References
- Kyverno Documentation: https://kyverno.io/docs/
- Tetragon Documentation: https://tetragon.io/docs/
- CrowdSec Documentation: https://docs.crowdsec.net/
- Velero Documentation: https://velero.io/docs/
- Pod Security Standards: https://kubernetes.io/docs/concepts/security/pod-security-standards/
- CIS Kubernetes Benchmark: https://www.cisecurity.org/benchmark/kubernetes
Series Complete! 🎉
You've now built a production-grade Kubernetes homelab with enterprise-level security, observability, and operational practices.
What You've Accomplished:
- Part 1: Foundation Planning - Architecture and planning
- Part 2: Talos Cluster Bootstrap - Immutable infrastructure
- Part 3: Cilium Networking - eBPF-powered networking
- Part 4: Ceph Storage - Distributed storage
- Part 5: GitOps with ArgoCD - Automated deployments
- Part 6: Ingress & Certificates - External access
- Part 7: Observability Stack - Complete monitoring
- Part 8: Security Hardening - Defense in depth (you are here)
Next Steps:
- Start deploying your applications using the GitOps patterns from Part 5
- Monitor everything with the observability stack from Part 7
- Review security policies regularly and update as threats evolve
- Consider expanding to multi-cluster setups for production workloads
Thank you for following this series. Your feedback and experiences are welcome - may they help future builders avoid our mistakes and build upon our successes.