After migrating from Longhorn to Ceph, I discovered why enterprise environments choose Ceph despite its complexity. Yes, Ceph requires 15 CPU cores and 24GB RAM in my cluster, but it delivers performance and reliability that simpler solutions can't match. This article covers deploying Rook-Ceph on Talos Linux with detailed explanations of storage concepts, proper configuration for homelab scale, and critical lessons from my storage journey.
Understanding Storage in Kubernetes
What is Distributed Storage?
Think of distributed storage like a RAID array spread across multiple servers. Instead of having all your data on one machine (single point of failure), distributed storage systems replicate your data across multiple nodes. If one server fails, your data remains accessible from the surviving nodes.
Key Storage Concepts:
- Block Storage: Raw disk-like storage (like a virtual hard drive) - used for databases
- Object Storage: Stores files as objects with metadata (like S3) - used for backups, media
- File Storage: Traditional filesystem that multiple pods can mount simultaneously
- PVC (Persistent Volume Claim): How pods request storage in Kubernetes
- Storage Class: Defines the type and properties of storage available
Why Ceph Over Longhorn
I started with Longhorn - it's simpler, lighter, and "cloud-native." After six months, here's why I switched:
Performance at Scale (My Lab Results)
- Longhorn: 45MB/s sequential writes, 2000 IOPS random
- Ceph: 280MB/s sequential writes, 15000 IOPS random
- 6x improvement for database workloads
- Test conditions: 4-node cluster, 1GbE network, replicas=3, 4K block random I/O
Reliability
- Longhorn had "split-brain" issues during network partitions (when nodes can't communicate, they disagree on data state)
- Ceph's CRUSH algorithm (Controlled Replication Under Scalable Hashing) handles failures predictably
- Better data consistency guarantees through quorum-based decisions
Features
- Native block, object, and file storage from one system
- Erasure coding for efficient storage utilization (like RAID 5/6 but distributed)
- Advanced data placement and balancing across storage tiers
The Trade-off
- Ceph needs serious resources (minimum 4GB RAM per OSD)
- Complex to troubleshoot
- Steep learning curve
Ceph Architecture Explained
Core Components:
- OSD (Object Storage Daemon): One per disk, stores actual data
- MON (Monitor): Maintains cluster maps and state (need odd number for quorum)
- MGR (Manager): Provides monitoring, orchestration, and interfaces
- MDS (Metadata Server): Only needed for CephFS (file storage)
- CRUSH Map: Algorithm that determines where data is stored
Why These Matter:
- OSDs handle all data I/O - more OSDs = more performance
- MONs maintain consensus - lose quorum and cluster stops accepting writes
- MGR provides the dashboard and metrics we need for monitoring
- CRUSH ensures data is spread evenly and survives failures
Pre-Installation Requirements
Hardware Verification
Time Sync Requirement: Ensure Talos NTP is properly configured and all nodes are time-synced. Clock skew causes MON quorum issues and OSD flapping.
Network Reality for Homelab: If you have ≥10GbE, consider a separate VLAN/NIC for Ceph traffic with jumbo frames. On 1GbE, compression and conservative recovery settings (shown later) are essential.
# Check available disks on each worker node
# We need dedicated disks for Ceph - it shouldn't share with the OS disk
for node in 192.168.0.14 192.168.0.15 192.168.0.16 192.168.0.17; do
echo "=== Node $node ==="
talosctl -n $node disks # Lists all disks on the node
done
# Expected output:
DEV MODEL SIZE TYPE UUID WWID MODALIAS
/dev/sda Virtual HD 32 GB HDD - - scsi:t-0x00 # OS disk - don't use
/dev/sdb Virtual HD 500 GB HDD - - scsi:t-0x00 # ← This is what Ceph will use
/dev/sdc Virtual HD 500 GB HDD - - scsi:t-0x00 # ← Additional disk if available
# Verify memory (need 4GB+ available per OSD)
# Note: kubectl top shows current usage, not what's available
kubectl top nodes
# Check allocatable memory per node
kubectl get nodes -o custom-columns=NAME:.metadata.name,ALLOCATABLE:.status.allocatable.memory
# Formula: (allocatable - used) >= 4GiB * (# of OSDs on that node)
# Example: 32GiB allocatable - 2GiB used = 30GiB available = supports 7 OSDs
Troubleshooting Disk Issues:
# If no secondary disks are available:
# Check if you need to add virtual disks to VMs
# For physical servers, you'll need to install additional drives
# If disks are already formatted:
# Ceph can't use disks with existing filesystems
talosctl -n 192.168.0.14 disks | grep -E "ext4|xfs|ntfs"
# If you see formatted disks, you'll need to wipe them:
# WARNING: This destroys all data on the disk!
# Only run this on dedicated Ceph disks, NEVER the OS disk!
# Use the bare device name (e.g., sdb, not /dev/sdb):
talosctl -n 192.168.0.14 wipe disk sdb --insecure
# Verify disk performance (optional but recommended):
# For disk performance baselines, rely on rados bench after Ceph is deployed
# (Talos doesn't provide shell access for raw disk tests)
Why These Requirements:
- Dedicated Disk: Ceph performs best with dedicated disks - sharing with OS causes I/O contention and performance degradation
- 4GB RAM per OSD: Ceph uses RAM for BlueStore cache, metadata, and operations - insufficient RAM causes 90% performance loss
- CPU: While not as critical, each OSD needs ~1 CPU core for good performance under load
Talos System Extensions
What are System Extensions?
Talos Linux is immutable - you can't install packages like on Ubuntu. System extensions are pre-built packages that add functionality to Talos. Think of them like drivers that need to be baked into the OS image.
# Check for extensions on a node
talosctl -n 192.168.0.14 get extensions
# Required extensions for Ceph:
# - siderolabs/util-linux-tools # Provides mount, lsblk, blkid, and filesystem utilities
If missing, rebuild Talos with extensions:
# Create custom Talos image with extensions
# Use imager version matching your Talos release
docker run --rm -t -v $PWD/_out:/out \
ghcr.io/siderolabs/imager:v1.11.0 \
--arch amd64 \ # CPU architecture
--system-extension siderolabs/util-linux-tools # Mount utilities
# This creates a new Talos ISO/image in _out/ directory
# You'll need to reinstall Talos nodes with this image
Why These Extensions:
- util-linux-tools: Required for mounting Ceph filesystems and disk utilities
- Note: Ceph RBD uses the kernel RBD client (krbd) or librbd, NOT iSCSI. The iSCSI protocol is only involved when you deploy the optional Ceph iSCSI Gateway to export RBD volumes to non-Ceph clients
Kernel Modules
What are Kernel Modules?
Kernel modules are drivers that run in the Linux kernel. Ceph needs specific modules to communicate with storage devices and manage the filesystem.
Apply this patch to all worker nodes:
# ceph-kernel-modules.yaml
machine:
kernel:
modules:
- name: dm_mod # Device mapper (required for encryption)
- name: dm_crypt # LUKS encryption support for encrypted RBD volumes
- name: libceph # Core network client (some kernels don't autoload via ceph)
- name: rbd # RADOS Block Device - enables block storage via kernel RBD client
- name: ceph # Core Ceph filesystem support (needed for CephFS kernel client)
- name: nbd # Network Block Device - optional for rbd-nbd fallback
sysctls:
vm.swappiness: 0 # Disable swap - Ceph needs real RAM, not swap
kernel.pid_max: 4194304 # Increase max processes - Ceph spawns many
fs.aio-max-nr: 1048576 # Increase async I/O operations - improves performance
# Apply the patch to all worker nodes simultaneously
talosctl patch machineconfig \
--nodes 192.168.0.14,192.168.0.15,192.168.0.16,192.168.0.17 \
--patch @ceph-kernel-modules.yaml
# Nodes will reboot to apply kernel changes
# Wait for them to come back:
kubectl get nodes -w # -w watches for changes
Why These Settings:
- vm.swappiness=0: Ceph performs terribly with swap - it needs physical RAM for caching
- kernel.pid_max: Ceph creates many processes for parallel I/O operations
- fs.aio-max-nr: Allows more concurrent disk operations for better performance
Installing Rook Operator
What is Rook?
Rook is a Kubernetes operator that automates Ceph deployment and management. Think of it as the bridge between Kubernetes and Ceph - it translates Kubernetes resources into Ceph configurations.
# Add Rook Helm repository
helm repo add rook-release https://charts.rook.io/release
helm repo update
# 1) Install CRDs first (required once per cluster, separate from operator)
helm install rook-ceph-crds rook-release/rook-ceph-crds --version 1.18.2
# 2) Install Rook operator with specific settings
helm install rook-ceph rook-release/rook-ceph \
--namespace rook-ceph \
--create-namespace \ # Creates namespace if not exists
--version 1.18.2 \ # Pin to tested version
--set enableDiscoveryDaemon=true \ # Auto-discover new disks
--set csi.enableCephfsSnapshotter=true \ # Enable filesystem snapshots
--set csi.enableRBDSnapshotter=true # Enable block device snapshots
# When upgrading Rook later:
# helm upgrade rook-ceph-crds rook-release/rook-ceph-crds --version NEW_VERSION
# helm upgrade rook-ceph rook-release/rook-ceph --namespace rook-ceph --version NEW_VERSION
What These Settings Do:
- enableDiscoveryDaemon: Automatically finds new disks added to nodes
- CSI snapshotters: Allow backing up volumes through Kubernetes snapshots
Wait for operator readiness:
# Wait up to 5 minutes for the operator to be ready
kubectl -n rook-ceph wait --for=condition=ready pod -l app=rook-ceph-operator --timeout=300s
# Verify operator is running
kubectl -n rook-ceph get pods
NAME READY STATUS
rook-ceph-operator-7b9c5f8d8c-xxxxx 1/1 Running # Should see this
Installing the Ceph Toolbox
Important: Before running any ceph commands, we need to deploy the toolbox pod:
# Deploy the Ceph toolbox for CLI management (using versioned URL)
# Use toolbox version matching your Rook operator version
kubectl create -f https://raw.githubusercontent.com/rook/rook/v1.18.2/deploy/examples/toolbox.yaml
# Wait for toolbox to be ready
kubectl -n rook-ceph rollout status deploy/rook-ceph-tools
# Verify toolbox is running
kubectl -n rook-ceph get pods -l app=rook-ceph-tools
NAME READY STATUS
rook-ceph-tools-7b9c5f8d8c-xxxxx 1/1 Running
All subsequent ceph commands will be run through this toolbox pod.
Ceph Cluster Configuration
Here's my production configuration optimized for homelab:
# ceph-cluster.yaml
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
# ⚠️ UPGRADE ORDER: Rook CRDs → Rook operator → Ceph image. Wait for HEALTH_OK between each.
cephVersion:
image: quay.io/ceph/ceph:v18.2.2 # Pin to tested Reef version
# Note: Keep Rook and Ceph within supported matrix
# Always wait for HEALTH_OK between upgrade steps - never skip versions
dataDirHostPath: /var/lib/rook # Where Rook stores config on host
mon:
count: 3 # Must be odd number for quorum (consensus voting)
allowMultiplePerNode: false # Spread monitors across nodes for HA
# Note: Most clusters use dataDirHostPath for monitors instead of PVCs
# Only use volumeClaimTemplate if you have a local-path StorageClass configured
mgr:
count: 2 # One active, one standby for failover
allowMultiplePerNode: false
modules:
- name: prometheus # Expose metrics for Prometheus monitoring
enabled: true
- name: devicehealth # Monitor disk SMART data for early failure detection
enabled: true
- name: pg_autoscaler # Automatically adjusts placement groups (usually on by default)
enabled: true
- name: balancer # Evenly distributes data across OSDs (usually on by default)
enabled: true
dashboard:
enabled: true
ssl: true
network:
connections:
encryption:
enabled: false # Enables msgr2 payload encryption (CPU cost)
# Note: Ceph always uses cephx auth; this toggles wire encryption
# Most homelabs run compression on, encryption off for performance
compression:
enabled: true # Compresses data in transit (saves bandwidth)
ipFamily: IPv4
dualStack: false # We're not using IPv6
crashCollector:
disable: false
cleanupPolicy:
confirmation: ""
sanitizeDisks:
method: quick
dataSource: zero
iteration: 1
allowUninstallWithVolumes: false
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist # Don't run on control plane nodes
tolerations:
- key: storage-node
operator: Exists # Can run on nodes tainted as storage-node
resources:
mon:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
mgr:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
osd:
requests:
cpu: 1000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
storage:
useAllNodes: false # Don't automatically use all nodes
useAllDevices: false # Don't automatically use all disks (safer)
nodes:
- name: talos-wrk-01
devices:
- name: /dev/sdb # Second disk dedicated to Ceph (not OS disk)
config:
osdsPerDevice: "1" # One OSD per disk (standard practice)
- name: talos-wrk-02
devices:
- name: /dev/sdb
config:
osdsPerDevice: "1"
- name: talos-wrk-03
devices:
- name: /dev/sdb
config:
osdsPerDevice: "1"
- name: talos-wrk-04
devices:
- name: /dev/sdb
config:
osdsPerDevice: "1"
# Optional: HDD OSDs with SSD DB/WAL for big performance win
# If you have a small SSD per node, use it for BlueStore DB/WAL:
# - name: talos-wrk-05
# config:
# metadataDevice: /dev/nvme0n1 # SSD for DB/WAL for all OSDs on this node
# # databaseSizeMB: 2048 # Tune if SSD is small (default: 5% of OSD)
# # walSizeMB: 1024 # Tune if SSD is small (default: 1GB)
# devices:
# - name: /dev/sdb # HDD data devices
# - name: /dev/sdc
priorityClassNames:
mon: system-node-critical
osd: system-node-critical
mgr: system-cluster-critical
disruptionManagement:
managePodBudgets: true
osdMaintenanceTimeout: 30
pgHealthCheckTimeout: 0
Deploy the cluster:
# Optional: Taint storage nodes to dedicate them to Ceph
kubectl taint nodes talos-wrk-01 storage-node=true:NoSchedule
kubectl taint nodes talos-wrk-02 storage-node=true:NoSchedule
kubectl taint nodes talos-wrk-03 storage-node=true:NoSchedule
kubectl taint nodes talos-wrk-04 storage-node=true:NoSchedule
# Your Ceph pods already tolerate key=storage-node
kubectl apply -f ceph-cluster.yaml
# Watch deployment progress (updates every 2 seconds)
watch kubectl -n rook-ceph get cephcluster
# Expected progression:
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE
rook-ceph /var/lib/rook 3 1m Progressing Configuring Ceph Mons
rook-ceph /var/lib/rook 3 5m Progressing Configuring Ceph OSDs
rook-ceph /var/lib/rook 3 10m Ready Cluster created successfully
# Press Ctrl+C to exit watch
Monitoring Deployment Progress
Ceph deployment takes 10-15 minutes. Monitor progress:
# Watch OSD (storage daemon) creation
# -w flag watches for changes in real-time
kubectl -n rook-ceph get pods -l app=rook-ceph-osd -w
# Expected output:
NAME READY STATUS
rook-ceph-osd-0-xxxxx 0/1 Init:0/2 # Initializing
rook-ceph-osd-0-xxxxx 0/1 Init:1/2 # Still initializing
rook-ceph-osd-0-xxxxx 1/1 Running # Ready!
# ⚠️ CRITICAL: Verify device classes BEFORE creating pools
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree
# Check that SSDs show class 'ssd' and HDDs show 'hdd'
# If wrong, fix device classes then:
# ceph osd crush rule rm replicapool-fast_crush_rule || true
# Re-apply ceph-pools.yaml so Rook regenerates CRUSH rules
# Check overall Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Understanding the output:
cluster:
id: abc123
health: HEALTH_OK # Green - everything working
# HEALTH_WARN # Yellow - something needs attention
# HEALTH_ERR # Red - critical issue
services:
mon: 3 daemons, quorum a,b,c # 3 monitors maintaining quorum
mgr: a(active), standbys: b # Manager with standby ready
osd: 4 osds: 4 up, 4 in # All 4 storage daemons running
data:
pools: 0 pools, 0 pgs # No storage pools yet (we'll create them)
objects: 0 objects, 0 B # No data stored yet
usage: 4.0 GiB used, 1.96 TiB / 2 TiB avail # Overhead vs available
pgs: # Placement groups (data shards)
What to Look For:
- HEALTH_OK: Cluster is fully operational
- All OSDs "up": All storage daemons are running
- Quorum established: Monitors can make decisions
- Manager active: Dashboard and metrics available
Storage Classes
What are Storage Classes?
Storage Classes define different tiers of storage with specific properties. Think of them like different types of cloud storage - standard, premium, archive. Applications request storage through PVCs (Persistent Volume Claims) that reference these classes.
Note on TRIM/discard for RBD volumes: While thediscardmount option enables continuous TRIM, it can hurt performance on RBD volumes. Consider using periodicfstriminstead for better performance. This is configured at the application level, not in the StorageClass. Note thatfstrimrequires filesystem support (ext4/xfs have it) and root access inside the container (which the CronJob below provides viarunAsUser: 0).
Optional: Periodic TRIM CronJob pattern for applications
# fstrim-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: fstrim-weekly
spec:
schedule: "0 3 * * 0" # Sundays 03:00
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: fstrim
image: alpine:3.20
securityContext:
runAsUser: 0
command: ["sh","-c","apk add --no-cache util-linux && fstrim -v /data"]
volumeMounts:
- name: v
mountPath: /data
volumes:
- name: v
persistentVolumeClaim:
claimName: your-app-pvc # Replace with your PVC
About clusterID: TheclusterIDparameter refers to the Rook namespace (e.g.,rook-ceph), not the Ceph FSID. The CSI driver uses this to locate the Ceph cluster.
Create storage classes for different workload types:
# storage-classes.yaml
---
# Fast NVMe pool for databases (high IOPS needed)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block-fast
annotations:
storageclass.kubernetes.io/is-default-class: "true" # Default if PVC doesn't specify
# ⚠️ **WARNING**: ALL unspecified PVCs will use expensive fast storage!
# **For cost-sensitive homelabs**: Make 'slow' the default instead to avoid surprise IO costs
provisioner: rook-ceph.rbd.csi.ceph.com # Which driver handles provisioning
parameters:
clusterID: rook-ceph # Rook namespace, not Ceph FSID
pool: replicapool-fast # Which Ceph pool to use
imageFormat: "2" # RBD format (v2 is current)
imageFeatures: layering # Start conservative; add features after kernel validation
# When ready (kernels ≥5.4 with Reef), enable faster snapshot/clone features:
# imageFeatures: layering,exclusive-lock,object-map,fast-diff
# Note: object-map and fast-diff depend on exclusive-lock
# mounter: rbd-nbd # Optional: Use userspace mounter instead of krbd if kernel issues
# mapOptions: "queue_depth=128" # For krbd only: More parallelism for DBs (omit for rbd-nbd)
# The following are credentials for Ceph operations
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
csi.storage.k8s.io/fstype: ext4 # Safest on Talos (xfs requires xfsprogs system extension)
allowVolumeExpansion: true # Can grow volumes without recreating
reclaimPolicy: Delete # Delete data when PVC deleted
volumeBindingMode: Immediate # Provision immediately
# Note: WaitForFirstConsumer delays until Pod scheduled - fine for network storage
# but switch to Immediate if PVCs appear stuck in Pending
---
# Standard HDD pool for media/backups (where capacity matters more than speed)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block-slow
# annotations:
# storageclass.kubernetes.io/is-default-class: "true" # Safer default for homelabs
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph
pool: replicapool-slow # Different pool with different settings
imageFormat: "2"
imageFeatures: layering # Core feature for thin provisioning
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Retain # Keep data even after PVC deleted (safer for backups)
volumeBindingMode: WaitForFirstConsumer # Wait until pod schedules to provision
---
# Shared filesystem for multi-attach (multiple pods can mount simultaneously)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-filesystem
provisioner: rook-ceph.cephfs.csi.ceph.com # Different provisioner for CephFS
parameters:
clusterID: rook-ceph
fsName: ceph-filesystem
# Note: 'pool' parameter is optional - CSI will use the default data pool
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete
allowVolumeExpansion: true
# Note: 'discard' mount option is for block devices, not applicable to CephFS
Creating Ceph Pools
What are Ceph Pools?
Pools are logical partitions in Ceph storage. Each pool can have different replication levels, performance settings, and quotas. Think of them like different RAID configurations for different purposes.
Want erasure coding for cold data? Use a replicated metadata pool + EC data pool for RBD. See the Rook storageclass-ec.yaml example for configuration details.
Define pools with appropriate replication:
Important: If your OSDs aren't classified yet, either run the 'Verify Device Classes' step first or remove deviceClass from the pool spec and add it later.# ceph-pools.yaml
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool-fast
namespace: rook-ceph
spec:
failureDomain: host # Replicas on different hosts (not just different disks)
deviceClass: ssd # Only use SSD/NVMe OSDs for fast pool
replicated:
size: 3 # Keep 3 copies of all data
requireSafeReplicaSize: true # Don't allow writes if <2 replicas available
parameters:
compression_mode: aggressive # Always compress (use 'passive' if CPU-constrained)
compression_algorithm: lz4 # Fast compression (alternatives: snappy, zlib, zstd)
# DBs: If CPU tight or latency critical, use 'passive' and enable DB-level compression instead
quotas:
maxSize: 500Gi # Prevent single pool from using all storage
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool-slow
namespace: rook-ceph
spec:
failureDomain: host
deviceClass: hdd # Only use HDD OSDs for slow/bulk pool
replicated:
size: 2 # Only 2 copies for bulk storage (increases data-loss risk during second failure; only for cold/non-critical data)
# ⚠️ Use size:2 only if you have ≥3 hosts; with two hosts you risk data unavailability during maintenance
requireSafeReplicaSize: true # Prevent writes if <2 replicas (avoid single-replica risk)
parameters:
compression_mode: passive
---
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: ceph-filesystem
namespace: rook-ceph
spec:
metadataPool:
replicated:
size: 3 # Metadata is critical - keep 3 copies
dataPools:
- name: data0
failureDomain: host
replicated:
size: 3 # File data also replicated 3x
metadataServer:
activeCount: 1 # One active MDS (metadata server)
activeStandby: true # Enables standby-replay mode for near-instant failover
# Costs more RAM/CPU but gives <5s MDS switchover vs 30s+ cold standby
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
Apply the pools:
# Create the pools first
kubectl apply -f ceph-pools.yaml
# Wait for pools to be ready (about 30 seconds)
sleep 30
# Then create storage classes that reference them
kubectl apply -f storage-classes.yaml
# Verify pools were created
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool ls
# Expected output:
replicapool-fast
replicapool-slow
ceph-filesystem-metadata
ceph-filesystem-data0
# Ensure RBD application is enabled (Rook usually handles this)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
ceph osd pool application get replicapool-fast || \
ceph osd pool application enable replicapool-fast rbd
ceph osd pool application get replicapool-slow || \
ceph osd pool application enable replicapool-slow rbd
# For size:2 pool, ensure min_size:2 to prevent single-replica writes
ceph osd pool set replicapool-slow min_size 2
# Add hard object quotas to prevent runaway apps from eating the cluster
ceph osd pool set-quota replicapool-fast max_objects 20000000 # 20M objects
ceph osd pool set-quota replicapool-slow max_objects 50000000 # 50M objects
# (or use max_bytes if that's easier for your capacity planning)
'
# Verify CephFS is created correctly
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs ls
# Should show: name: ceph-filesystem, metadata pool: ceph-filesystem-metadata, data pools: [ceph-filesystem-data0]
# Check storage classes
kubectl get storageclass
NAME PROVISIONER AGE
ceph-block-fast (default) rook-ceph.rbd.csi.ceph.com 1m
ceph-block-slow rook-ceph.rbd.csi.ceph.com 1m
ceph-filesystem rook-ceph.cephfs.csi.ceph.com 1m
# Verify PG autoscaler is doing the right thing
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool autoscale-status | column -t
# Aim for ~50-200 PGs per OSD total across all pools
# Look for warn/ok status - "warn" may need target_size_ratio adjustment
# Check PG distribution per OSD to spot outliers
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd df tree
# Look for wildly uneven PG counts relative to weight
Verify Device Classes (Important for VMs)
Virtual machines sometimes misreport disk types. Verify OSDs have correct device classes:
# Check how OSDs are classified (ssd vs hdd)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree
# Sample output:
# ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT
# -1 4.00000 root default
# -3 1.00000 host talos-wrk-01
# 0 hdd 1.00000 osd.0 up 1.00000
# If an SSD shows as 'hdd', manually correct it:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd crush rm-device-class osd.0
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd crush set-device-class ssd osd.0
# This ensures your fast/slow pools use the correct OSDs
# Verify CRUSH rules target the correct device classes:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd crush rule ls
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd crush rule dump replicapool-fast_crush_rule
# Expect to see: "device_class": "ssd" in the rule definition
# If missing, the rule targets all device classes - fix and recreate
Testing Storage
Always Test Before Production Use!
Before deploying real applications, verify storage works correctly. This test creates a PVC, mounts it in a pod, and tests read/write performance.
# test-storage.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-block-pvc
spec:
accessModes:
- ReadWriteOnce # Single pod can mount read-write
storageClassName: ceph-block-fast # Use our fast storage class
resources:
requests:
storage: 10Gi # Request 10GB volume
---
apiVersion: v1
kind: Pod
metadata:
name: test-block-pod
spec:
containers:
- name: test
image: alpine:3.20
volumeMounts:
- name: test-volume
mountPath: /data
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: test-block-pvc
# Create test PVC and pod
kubectl apply -f test-storage.yaml
# Watch PVC status change from Pending to Bound
kubectl get pvc test-block-pvc -w
NAME STATUS VOLUME CAPACITY
test-block-pvc Pending # Ceph creating volume
test-block-pvc Bound pvc-xxx 10Gi # Volume ready!
# Verify pod is running
kubectl get pod test-block-pod
NAME READY STATUS
test-block-pod 1/1 Running
# Test with less compressible data (zeros compress too well with LZ4)
# Using fio for realistic benchmarks:
# Random write test (database inserts)
# Note: Use io_uring for kernels ≥5.10, fallback to libaio if unsupported
kubectl exec -it test-block-pod -- sh -c \
'apk add --no-cache fio && \
fio -name=randwrite -filename=/data/test.fio \
-ioengine=io_uring -direct=1 -bs=4k -iodepth=32 -rw=randwrite \
-numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0 || \
fio -name=randwrite -filename=/data/test.fio \
-ioengine=libaio -direct=1 -bs=4k -iodepth=32 -rw=randwrite \
-numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0'
# Random read test (database queries)
kubectl exec -it test-block-pod -- sh -c \
'fio -name=randread -filename=/data/test.fio \
-ioengine=io_uring -direct=1 -bs=4k -iodepth=32 -rw=randread \
-numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0 || \
fio -name=randread -filename=/data/test.fio \
-ioengine=libaio -direct=1 -bs=4k -iodepth=32 -rw=randread \
-numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0'
# Mixed 70/30 read/write (typical DB workload)
kubectl exec -it test-block-pod -- sh -c \
'fio -name=randrw -filename=/data/test.fio \
-ioengine=io_uring -direct=1 -bs=4k -iodepth=32 -rw=randrw -rwmixread=70 \
-numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0 || \
fio -name=randrw -filename=/data/test.fio \
-ioengine=libaio -direct=1 -bs=4k -iodepth=32 -rw=randrw -rwmixread=70 \
-numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0'
# Output shows real IOPS and bandwidth without compression artifacts
# Optional: Direct RBD benchmark to isolate CSI/RBD path
# This tests raw RBD performance without Kubernetes CSI overhead
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
# Create a temporary test image (not the PVC image)
rbd pool init replicapool-fast || true
rbd create -p replicapool-fast benchtest --size 10240 # 10GB test image
# Run direct RBD benchmark (4K random writes)
rbd bench -p replicapool-fast benchtest --io-type write \
--io-size 4K --io-threads 16 --io-total 1G --io-pattern rand
# Clean up test image
rbd rm -p replicapool-fast benchtest
'
# Compare these results with FIO to identify CSI overhead
# Cleanup test resources
kubectl delete -f test-storage.yaml
# The PVC and its data will be deleted
Test CephFS (shared filesystem):
# test-cephfs.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-cephfs-pvc
spec:
accessModes: [ ReadWriteMany ] # Multiple pods can mount
storageClassName: ceph-filesystem
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Pod
metadata:
name: test-cephfs-pod
spec:
containers:
- name: app
image: alpine:3.20
command: ["sh","-c","apk add --no-cache coreutils && mkdir -p /shared && dd if=/dev/zero of=/shared/test bs=1M count=256 && ls -lh /shared && sleep 3600"]
volumeMounts:
- name: shared
mountPath: /shared
volumes:
- name: shared
persistentVolumeClaim:
claimName: test-cephfs-pvc
# Deploy CephFS test
kubectl apply -f test-cephfs.yaml
# Verify the pod created the test file
kubectl logs test-cephfs-pod
# Should show: -rw-r--r-- 1 root root 256M ... /shared/test
# Note: Throughput may look high due to compression; use fio for realistic tests
# Clean up
kubectl delete -f test-cephfs.yaml
Test CephFS RWX concurrency (proving multiple pods share the filesystem):
# cephfs-rwx-two-pods.yaml
---
apiVersion: v1
kind: Pod
metadata:
name: cephfs-writer-a
spec:
containers:
- name: a
image: alpine:3.20
command: ["sh","-c","apk add --no-cache coreutils && while true; do echo A-$(date +%s) >> /shared/log; sleep 1; done"]
volumeMounts:
- name: shared
mountPath: /shared
volumes:
- name: shared
persistentVolumeClaim:
claimName: test-cephfs-pvc # Reuses the PVC from above
---
apiVersion: v1
kind: Pod
metadata:
name: cephfs-reader-b
spec:
containers:
- name: b
image: alpine:3.20
command: ["sh","-c","tail -f /shared/log"]
volumeMounts:
- name: shared
mountPath: /shared
volumes:
- name: shared
persistentVolumeClaim:
claimName: test-cephfs-pvc # Same PVC - truly shared filesystem
kubectl apply -f cephfs-rwx-two-pods.yaml
# Watch reader-b stream lines written by writer-a
kubectl logs -f cephfs-reader-b
# Should see: A-1734567890, A-1734567891, ... proving concurrent access
# Verify both pods see the same inode (truly shared filesystem)
kubectl exec cephfs-writer-a -- stat -c '%i' /shared/log
kubectl exec cephfs-reader-b -- stat -c '%i' /shared/log
# Both should return the same inode number
# Clean up (explicit pod deletion in case of ctrl-c)
kubectl delete pod cephfs-writer-a cephfs-reader-b --ignore-not-found
kubectl delete pvc test-cephfs-pvc
Creating Snapshot Classes (Optional)
Prerequisites: Ensure the snapshot.storage.k8s.io CRDs and the external snapshot-controller are installed for your cluster/distro. These are not provided by Rook.Optional: Install snapshot controller if not present
# Install external snapshot CRDs + controller (cluster-wide, do once)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v6.3.1/client/config/crd/
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v6.3.1/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v6.3.1/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
If you enabled CSI snapshotters in the Rook operator, create VolumeSnapshotClasses:
# snapshot-classes.yaml
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-rbdplugin-snapclass
driver: rook-ceph.rbd.csi.ceph.com
deletionPolicy: Delete # Or 'Retain' for production to prevent accidental deletion
parameters:
clusterID: rook-ceph
csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-cephfsplugin-snapclass
driver: rook-ceph.cephfs.csi.ceph.com
deletionPolicy: Delete # Consider 'Retain' for production snapshots
parameters:
clusterID: rook-ceph
csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
# Note: CephFS snapshots work via CSI but restoring RWX volumes requires apps
# to handle shared filesystem semantics (file locks, cache coherency)
Apply if you plan to use CSI snapshots:
kubectl apply -f snapshot-classes.yaml
# Sanity check: ensure snapshot CRDs exist
kubectl get crd | grep snapshot
# Should show:
# volumesnapshotclasses.snapshot.storage.k8s.io
# volumesnapshotcontents.snapshot.storage.k8s.io
# volumesnapshots.snapshot.storage.k8s.io
Test RBD snapshots:
# test-rbd-snapshot.yaml
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-block-snap
spec:
volumeSnapshotClassName: csi-rbdplugin-snapclass
source:
persistentVolumeClaimName: test-block-pvc # From earlier RBD test
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-pvc
spec:
storageClassName: ceph-block-fast
dataSource:
name: test-block-snap
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Pod
metadata:
name: restore-pod
spec:
containers:
- name: app
image: alpine:3.20
command: ["sh","-c","ls -lh /data && cat /data/test 2>/dev/null | head -c 100 && sleep 3600"]
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: restore-pvc
# Test snapshot and restore
kubectl apply -f test-rbd-snapshot.yaml
# Check snapshot created
kubectl get volumesnapshot
NAME READYTOUSE SOURCEPVC AGE
test-block-snap true test-block-pvc 30s
# Verify restored data
kubectl logs restore-pod
# Should show the test file from original PVC
# Clean up
kubectl delete -f test-rbd-snapshot.yaml
Object Storage (S3) - Optional
If you need S3-compatible object storage:
# ceph-object.yaml
---
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
name: my-store
namespace: rook-ceph
spec:
metadataPool:
replicated:
size: 3
deviceClass: ssd # Metadata benefits from fast storage
dataPool:
replicated:
size: 3
deviceClass: hdd # Object data can use slower storage
# RGW pools honor CRUSH device_class; verify classes with 'ceph osd tree' if RGW targets wrong media
gateway:
port: 80
instances: 1
---
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: s3-user
namespace: rook-ceph
spec:
store: my-store
displayName: "homelab-s3"
# Deploy object storage
kubectl apply -f ceph-object.yaml
# Wait for RGW pod to be ready
kubectl -n rook-ceph wait --for=condition=ready pod -l app=rook-ceph-rgw
# Expose RGW service locally (for testing only)
kubectl -n rook-ceph port-forward svc/rook-ceph-rgw-my-store 8080:80 &
# Note: Port-forward is for testing; use Ingress/LoadBalancer for real clients
# Get S3 credentials
ACCESS_KEY=$(kubectl -n rook-ceph get secret rook-ceph-object-user-my-store-s3-user \
-o jsonpath='{.data.AccessKey}' | base64 -d)
SECRET_KEY=$(kubectl -n rook-ceph get secret rook-ceph-object-user-my-store-s3-user \
-o jsonpath='{.data.SecretKey}' | base64 -d)
echo "Access Key: $ACCESS_KEY"
echo "Secret Key: $SECRET_KEY"
echo "Endpoint: http://localhost:8080"
# Quick smoke test with AWS CLI
apt-get update && apt-get install -y awscli # Or use a container with awscli
aws configure set aws_access_key_id $ACCESS_KEY
aws configure set aws_secret_access_key $SECRET_KEY
aws configure set default.region us-east-1 # Required even for local S3
aws --endpoint-url http://localhost:8080 s3 mb s3://test-bucket
aws --endpoint-url http://localhost:8080 s3 ls
aws --endpoint-url http://localhost:8080 s3 cp /etc/hosts s3://test-bucket/
aws --endpoint-url http://localhost:8080 s3 ls s3://test-bucket/
Performance Tuning
OSD Optimization
Why Tune Performance?
Default Ceph settings are conservative for stability. In a homelab, we can be more aggressive for better performance.
# Set OSD memory target (default is ~4GB per OSD in modern Ceph)
# Increasing to 6GB provides more cache for better performance
# Calculate: 6GB × #OSDs per node - ensure you leave headroom for kubelet/system
# Note: This overrides bluestore_cache_autotune (on by default in Reef)
# Re-check node headroom during initial rebalance and backups
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph config set osd osd_memory_target 6442450944 # 6GB in bytes
# Verify the setting took effect
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph config get osd osd_memory_target
# Should show: 6442450944
# Enable bluestore compression on the pool
# LZ4 provides fast compression with good ratio
# Alternative algorithms: snappy, zlib, zstd (zstd can be CPU-heavy for small writes)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool set replicapool-fast compression_algorithm lz4
# Compression is per-object and transparent to clients
# Verify compression settings:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool get replicapool-fast compression_algorithm
# Also check: compression_mode, compression_min_blob_size
# Tune recovery settings for homelab
# These control how aggressively Ceph repairs/rebalances
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c "
ceph config set osd osd_recovery_max_active 3 # Max concurrent recovery ops
ceph config set osd osd_max_backfills 1 # Max backfill ops per OSD
ceph config set osd osd_recovery_sleep_hdd 0.1 # HDD recovery sleep (seconds)
ceph config set osd osd_recovery_sleep_ssd 0 # No sleep for SSD recovery
"
# Why these values:
# - Lower values = less impact on client I/O during recovery
# - Higher values = faster recovery but more performance impact
# - These settings balance recovery speed with usability
PG Autoscaler Pool Weighting
Optionally weight your pools based on expected usage:
# Example: allocate ~60% of PGs to fast pool, ~40% to slow pool
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool set replicapool-fast target_size_ratio 0.6
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool set replicapool-slow target_size_ratio 0.4
# The PG autoscaler will adjust placement groups based on these hints
# Aim for 50-200 PGs per OSD; avoid creating many tiny pools
# Also set explicit min_size for safety (matches requireSafeReplicaSize):
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool set replicapool-fast min_size 2 # For size=3 pool
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool set replicapool-slow min_size 2 # For size=2 pool
Network Optimization
# Set proper heartbeat and recovery settings through Ceph config
# (Note: environment variables on DaemonSets won't work for these)
# If running VMs on noisy NICs, test smaller increments first (15/30) before 30/60
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c "
ceph config set osd osd_heartbeat_interval 30 && # Check peer health every 30s (default: ~6s)
ceph config set osd osd_heartbeat_grace 60 # Wait 60s before marking OSD down (default: ~20s)
"
# Why these matter:
# - These affect OSD peer detection, not client I/O timeouts
# - Home networks have more variable latency than datacenter
# - Trade-off: Higher values prevent false failures but slow real failure detection
# - Higher osd_heartbeat_* = fewer false downs on noisy homelab links but slower real failover
# - Document these values for on-call expectations (60s grace = ~1 minute to detect real OSD failure)
# - Don't push too high (e.g., minutes) or you'll mask real failures
# - Sweet spot for homelab: 30s/60s balances stability vs detection speed
Advanced Network Isolation (Optional):
For production clusters, consider using Multus CNI to separate storage traffic onto a dedicated network interface. This prevents storage replication from competing with pod traffic:
# With Multus, you can attach OSDs to a dedicated storage VLAN
# Example: 10GbE for storage, 1GbE for pod traffic
# See: https://github.com/rook/rook/blob/master/design/ceph/multus-network.md
Maintenance Operations
Preventing rebalancing during maintenance:
# Before shutting down a storage node for maintenance:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd set noout
# This tells Ceph not to mark OSDs as permanently out during maintenance
# Do your maintenance (node reboot, disk replacement, etc.)
# After maintenance is complete:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd unset noout
Maintenance flags reference:
| Flag | Purpose | When to Use | Warning |
|---|---|---|---|
noout |
Prevents OSD removal | Node reboots, short maintenance | Unset ASAP |
norebalance |
Stops data redistribution | Adding multiple OSDs | Can affect redundancy |
nobackfill |
Pauses backfill operations | Performance-critical periods | Delays recovery |
norecover |
Stops recovery operations | Investigating issues | High risk - use briefly |
pause |
Stops all I/O | Emergency debugging | Blocks all client I/O |
# Example: Full maintenance mode for major work
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
ceph osd set noout &&
ceph osd set norebalance &&
ceph osd set nobackfill
'
# ALWAYS unset all flags after maintenance:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
ceph osd unset noout &&
ceph osd unset norebalance &&
ceph osd unset nobackfill
'
Schedule scrubs for off-peak hours:
# Configure scrubs to run during quiet hours (1 AM - 6 AM)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
ceph config set osd osd_scrub_begin_hour 1 &&
ceph config set osd osd_scrub_end_hour 6 &&
ceph config set osd osd_deep_scrub_interval 1209600 # Deep scrub every 14 days
'
# Scrubs verify data integrity but can impact performance
# Scheduling them for quiet hours minimizes user impact
Monitoring Ceph Health
Prometheus Integration
If you're running kube-prometheus-stack, add ServiceMonitor and alerts:
# First check what port name your mgr service uses
kubectl -n rook-ceph get svc rook-ceph-mgr -o jsonpath='{.spec.ports[*].name}{"\n"}'
# If it shows 'metrics' instead of 'http-metrics', update the ServiceMonitor below
# servicemonitor-ceph-mgr.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rook-ceph-mgr
namespace: rook-ceph
labels:
release: prometheus # Match your Prometheus Operator selector
spec:
selector:
matchExpressions:
- key: app
operator: In
values: [rook-ceph-mgr]
namespaceSelector:
matchNames: ["rook-ceph"]
endpoints:
- port: http-metrics # Or 'metrics' depending on your version
interval: 30s
---
# prometheusrule-ceph.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ceph-rules
namespace: rook-ceph
labels:
release: prometheus
spec:
groups:
- name: ceph.health
rules:
# Recording rule to detect maintenance flags (prevents false alarms)
- record: ceph:maintenance_active
expr: |
max(
(ceph_osd_noout == 1) or (ceph_osd_norebalance == 1) or (ceph_osd_nobackfill == 1)
) by ()
- alert: CephHealthError
expr: ceph_health_status == 2
for: 2m
labels:
severity: critical
annotations:
summary: "Ceph cluster health is ERROR"
description: "Ceph cluster is in ERROR state for >2 minutes"
- alert: CephOSDNearFull
expr: ceph_osd_nearfull == 1
for: 5m
labels:
severity: warning
annotations:
summary: "OSD {{ $labels.osd }} is nearfull"
description: "OSD is nearfull (default ~85%, backfillfull ~90%, full ~95%)"
- alert: CephMonQuorumLost
expr: min(ceph_mon_quorum_status) == 0 # min() for robustness with per-mon metrics
for: 1m
labels:
severity: critical
annotations:
summary: "Ceph MON quorum lost"
description: "Monitor quorum is lost - writes will fail"
- alert: CephPGDegraded
expr: ceph_pg_degraded > 0
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $value }} degraded PGs present"
description: "Data is under-replicated - recovery in progress"
- alert: CephOSDDown
expr: sum(ceph_osd_up == 0) > 0 and ceph:maintenance_active == 0 # Inhibit during maintenance
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $value }} OSD(s) down"
description: "One or more OSDs are down - data at risk"
kubectl apply -f servicemonitor-ceph-mgr.yaml
kubectl apply -f prometheusrule-ceph.yaml
# Note: Metric names can vary across exporters - adjust expressions to match your exporter
# Optional: Add inhibition rule to silence alerts during maintenance
# In your alertmanager config, when noout/nobackfill/norebalance are set:
# - source_matchers: [severity="info", alertname="CephMaintenanceMode"]
# target_matchers: [severity=~"warning|critical"]
# equal: ['cluster']
# Verify metrics are being scraped
kubectl -n rook-ceph port-forward svc/rook-ceph-mgr 9283:9283 &
curl -s localhost:9283/metrics | grep ceph_health
Dashboard Access
Ceph includes a web dashboard for monitoring and management:
# Get the auto-generated admin password
kubectl -n rook-ceph get secret rook-ceph-dashboard-password \
-o jsonpath="{['data']['password']}" | base64 --decode
# Save this password!
# Create a tunnel to access the dashboard
# This forwards local port 8443 to the dashboard service
kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 8443:8443
# Now access in your browser:
# URL: https://localhost:8443
# Username: admin
# Password: (from above command)
# Note: You'll get a certificate warning - this is expected
CLI Monitoring
Key commands for monitoring Ceph health:
Capacity Thresholds in Ceph (ratios, not fixed bytes):~85% (nearfull ratio): Warning alerts start~90% (backfillfull ratio): New backfills throttled~95% (full ratio): Writes blocked completely
Keep your cluster below 80-85% for safe operation headroom. These are configurable ratios that change cluster behavior at each threshold.
# Overall cluster health with details
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail
# HEALTH_OK = everything good
# HEALTH_WARN = check warnings (often non-critical)
# HEALTH_ERR = immediate attention needed
# Storage usage by pool
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
# Shows:
# - RAW STORAGE: Total physical capacity
# - USED: Actual bytes used (includes replicas)
# - AVAIL: Space available for new data
# - %USED: Percentage full (keep under 85%)
# OSD performance metrics
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
# Shows latency for each OSD
# commit_latency: Write confirmation time
# apply_latency: Write to disk time
# High latency (>100ms) indicates problems
# Benchmark with rados bench (preferred over dd for Ceph testing)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
rados -p replicapool-fast bench 10 write --no-cleanup
# Runs 10-second write test
# Shows bandwidth and IOPS achieved
# Clean up test objects
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
rados -p replicapool-fast cleanup
# Note: rados bench is preferred for Ceph performance testing
# because it uses pseudo-random data (avoids compression artifacts)
# and measures the entire cluster (including replication)
Backup & Disaster Recovery
Always backup critical Ceph secrets before major changes:
# Save admin and CSI keys
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- \
ceph auth export client.admin > ceph-client-admin.keyring
# Backup CSI secrets
for secret in rook-csi-rbd-node rook-csi-rbd-provisioner \
rook-csi-cephfs-node rook-csi-cephfs-provisioner; do
kubectl -n rook-ceph get secret $secret -o yaml > ${secret}.secret.yaml
done
# Save monitor map and keyring
kubectl -n rook-ceph get secret rook-ceph-mon -o yaml > rook-ceph-mon.secret.yaml
# Save version info and config for post-incident diffing
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph versions > ceph-versions.txt
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph config dump > ceph-config-dump.txt
# Store these backups OFF-CLUSTER
# They're critical for disaster recovery if you lose the k8s control plane
Migration Story: Longhorn to Ceph
My migration wasn't smooth, but here's what I learned:
The Problems with Longhorn
- Performance degradation: As volumes grew, performance tanked
- Network storms: Replica sync during rebuilds saturated the network
- Instance manager crashes: Lost all volumes on a node twice
- No compression: Storage efficiency was poor
Migration Process
Here's the step-by-step process I used:
# Step 1: Deploy Ceph alongside Longhorn
# Both storage systems can coexist during migration
# Step 2: Create new PVCs on Ceph
# For each Longhorn PVC, create equivalent Ceph PVC
# Step 3: Application-level data migration
# This is safer than trying to copy volumes directly
# Step 4: Switch applications to Ceph PVCs
# Update deployments to use new PVC names
# Step 5: Delete Longhorn volumes
# After verifying data integrity
# Step 6: Remove Longhorn
# Uninstall once all data migrated
# Example: Migrating PostgreSQL database
# 1. Backup from Longhorn volume
kubectl exec -it postgres-pod -- pg_dumpall > backup.sql
# 2. Scale down PostgreSQL
kubectl scale deployment postgres --replicas=0
# 3. Update deployment to use Ceph PVC
kubectl edit deployment postgres
# Change: claimName: postgres-longhorn
# To: claimName: postgres-ceph
# 4. Scale back up
kubectl scale deployment postgres --replicas=1
# 5. Restore data
kubectl exec -it postgres-pod -- psql < backup.sql
# 6. Verify data integrity
kubectl exec -it postgres-pod -- psql -c "SELECT COUNT(*) FROM your_table;"
Lessons Learned
- Plan for 2x storage during migration
- You need both systems running simultaneously
- Migration can take days for large datasets
- Test restore procedures first
- Practice on test databases/non-critical data
- Verify backup/restore commands work
- Monitor resource usage closely
- Ceph uses lots of RAM during initial data distribution
- Watch for OOM kills during migration
- Use application-level migration
- Database dumps, rsync, application-specific tools
- Don't try to copy PV contents directly (won't work)
- Keep Longhorn for a week
- Don't rush to delete old storage
- You might discover missing data days later
Troubleshooting Guide
OSD Won't Start
# Check OSD pod logs for errors
kubectl -n rook-ceph logs -l app=rook-ceph-osd --tail=50
# Common issues and solutions:
# 1. ERROR: Disk has existing filesystem
# Solution: Wipe the disk completely (use bare device name)
talosctl -n 192.168.0.14 wipe disk sdb --insecure
# Then delete and recreate the OSD pod
# 2. ERROR: OOM killed (out of memory)
# Solution: Reduce memory target
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph config set osd osd_memory_target 2147483648 # 2GB
# 3. ERROR: Permission denied on /dev/sdb
# Solution: Check Talos kernel modules loaded
talosctl -n 192.168.0.14 dmesg | grep -i ceph
# Should see: "rbd: loaded" and "libceph: loaded"
# 4. ERROR: Clock skew detected
# Solution: Check Ceph's view of time skew
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail | grep clock
# If clock skew persists, check Talos NTP configuration in machine config
Slow Performance
# 1. Check if recovery/rebalancing is running
# This severely impacts performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Look for: "recovery: 123 MB/s, 31 objects/s"
# If present, wait for it to complete
# 2. Check individual OSD performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
# Look for outliers with high latency (>50ms watch, >100ms investigate)
# osd.0: commit_latency: 12ms, apply_latency: 15ms # Good
# osd.1: commit_latency: 250ms, apply_latency: 300ms # Bad - investigate this OSD
# 3. Check network latency between nodes
kubectl run netshoot --rm -i --tty --image nicolaka/netshoot -- \
ping -c 10 192.168.0.15
# Latency should be <1ms on local network
# 4. Check if pools are healthy
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd pool ls detail
# Look for: "flags" field - should be empty or "hashpspool"
# "full" or "nearfull" = immediate problem
# 5. Check disk I/O on host
talosctl -n 192.168.0.14 top
# Look for high iowait% (>20% indicates disk bottleneck)
PVC Stuck Pending
When a PVC won't bind to a volume:
# 1. Check PVC events for errors
kubectl describe pvc <pvc-name>
# Look at Events section for errors like:
# - "no capacity" = out of space
# - "pool not found" = pool doesn't exist
# - "failed to provision" = CSI driver issue
# 2. Verify the pool exists
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool ls
# Should see your pool name (e.g., replicapool-fast)
# 3. Check CSI driver pods are running
kubectl get pods -n rook-ceph | grep csi
# Should see:
# csi-cephfsplugin-xxxxx 2/2 Running
# csi-rbdplugin-xxxxx 2/2 Running
# csi-cephfsplugin-provisioner-xxxxx 5/5 Running
# csi-rbdplugin-provisioner-xxxxx 5/5 Running
# 4. Check if storage class exists
kubectl get storageclass
# Verify the storage class name matches your PVC
# 5. Check Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health
# HEALTH_ERR would prevent provisioning
# 6. Enable temporary debug logging (if needed)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
ceph tell osd.* config set debug_osd 1/5
ceph tell osd.* config set debug_bluestore 1/5
'
# Check logs, then revert:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
ceph tell osd.* config set debug_osd 0/0
ceph tell osd.* config set debug_bluestore 0/0
'
# 7. Check for capacity
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
# At ~85% (nearfull) you get warnings
# Writes stop at ~95% (full) and backfill throttles around 90% (backfillfull)
Resource Management
After running Ceph for months, here's actual resource usage:
# Current allocation (observed in my cluster)
Ceph Components:
MON (3x): 1.5 CPU, 3GB RAM total
MGR (2x): 1 CPU, 1GB RAM total
MDS (2x): 1 CPU, 2GB RAM total
OSD (4x): 4 CPU, 16GB RAM total
Total: ~7.5 CPU, 22GB RAM
# Actual usage
MON: 100m CPU, 400MB RAM each
MGR: 150m CPU, 300MB RAM each
MDS: 80m CPU, 250MB RAM each
OSD: 200m CPU, 3.8GB RAM each
# Over-provisioned by ~5x for CPU, ~2x for RAM
Security Considerations
Encryption at Rest
For sensitive data, enable encryption:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block-encrypted
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph
pool: replicapool-fast
encrypted: "true" # Without KMS: stores keys in k8s secrets (homelab OK)
# encryptionKMSID: vault-kms # Production: use external KMS like Vault
For production with proper key management (covered in Part 6/7):
# ceph-block-encrypted-vault.yaml
parameters:
encrypted: "true"
encryptionKMSID: vault-kms # References KMS config in rook-ceph-csi-kms-config
Network Isolation
Warning: NetworkPolicies can break Ceph operator/CSI functionality. The Rook operator needs Kubernetes API access, and some CSI pods use hostNetwork. Leave commented out unless you know you need it.
# network-policy.yaml (OPTIONAL - uncomment only if required)
# apiVersion: networking.k8s.io/v1
# kind: NetworkPolicy
# metadata:
# name: ceph-network-isolation
# namespace: rook-ceph
# spec:
# podSelector: {} # Apply to all pods in namespace
# policyTypes:
# - Ingress
# - Egress
# ingress:
# - from:
# - namespaceSelector:
# matchLabels:
# kubernetes.io/metadata.name: rook-ceph # Well-known label
# - podSelector: {} # Allow all pods in the namespace to communicate
# egress:
# - to:
# - namespaceSelector:
# matchLabels:
# kubernetes.io/metadata.name: rook-ceph # Well-known label
# - podSelector: {} # Allow egress to all pods in the namespace
# - to: # Allow DNS
# - namespaceSelector:
# matchLabels:
# kubernetes.io/metadata.name: kube-system
# ports:
# - protocol: UDP
# port: 53
# - protocol: TCP
# port: 53
# - to: # Allow Kubernetes API access
# - ipBlock:
# cidr: 10.96.0.1/32 # Replace with: kubectl get svc kubernetes -n default -o jsonpath='{.spec.clusterIP}'
# ports:
# - protocol: TCP
# port: 443
Pre-Flight Checklist
Before considering Ceph production-ready, verify:
- [ ] Health Check:
ceph statusshowsHEALTH_OKwith all OSDs up/in - [ ] Performance Baseline:
rados benchcompletes without triggering recovery - [ ] Dashboard Access:
svc/rook-ceph-mgr-dashboardreachable athttps://localhost:8443(via port-forward) - [ ] Metrics Flow: Prometheus scraping
ceph_*metrics (if configured) - [ ] Time Sync: All nodes within 1 second of each other
- [ ] Test Workloads: Both RBD and CephFS PVCs mount successfully
[ ] Maintenance Flags Clear: No unexpected cluster-wide OSD flags
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd dump | grep flags
# Should NOT show: noout, norebalance, nobackfill, norecover
[ ] Storage Classes: Default SC set, all expected classes present
kubectl get sc
[ ] CRUSH Rules: Verify pools use correct device classes
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd crush rule ls
What's Next
With Ceph providing resilient storage, your cluster is ready for stateful applications. In Part 5, we'll implement GitOps with ArgoCD, enabling declarative deployments and automatic synchronization from Git repositories.
Key Takeaways
- Ceph requires significant resources but delivers enterprise-grade storage
- Proper disk selection is critical - dedicated disks perform much better
- Migration requires careful planning - always test restore procedures
- Compression saves significant space with minimal performance impact
- Monitor resource usage closely - Ceph can consume all available resources
Quality of Life Tips
Post-Deployment Cleanup
# Option 1: Scale down the toolbox when not needed (saves ~500MB memory)
kubectl -n rook-ceph scale deploy rook-ceph-tools --replicas=0
# Scale back up when needed: --replicas=1
# Option 2: Delete entirely and recreate later
kubectl -n rook-ceph delete deploy/rook-ceph-tools
# Recreate later with: kubectl apply -f toolbox.yaml
# Note: The toolbox carries the admin keyring; deleting it doesn't delete creds,
# it just removes the Pod
# Verify CRUSH rules match your intent
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
ceph osd crush rule ls
# Each pool should have its appropriate _crush_rule
# Check actual memory headroom on storage nodes
kubectl top nodes | grep wrk
# Ensure at least 2GB free after accounting for OSDs + kubelet
Quick Health Checks
# One-liner cluster health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- \
ceph health | head -1
# Storage utilization at a glance
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- \
ceph df | grep -E "TOTAL|POOL"
# Handy aliases for frequent checks
alias cephs='kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph -s'
alias cephh='kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail'
References
- Rook Documentation: https://rook.io/docs/rook/v1.18/
- Ceph Architecture: https://docs.ceph.com/en/reef/architecture/
- Ceph on Kubernetes: https://docs.ceph.com/en/reef/rbd/rbd-kubernetes/
- Storage Benchmarking: https://github.com/ceph/cbt
- Talos Storage Configuration: https://www.talos.dev/v1.11/kubernetes-guides/configuration/storage/
Continue to Part 5: GitOps with ArgoCD →