Part 4: Distributed Storage with Rook-Ceph

After migrating from Longhorn to Ceph, I discovered why enterprise environments choose Ceph despite its complexity. Yes, Ceph requires 15 CPU cores and 24GB RAM in my cluster, but it delivers performance and reliability that simpler solutions can't match. This article covers deploying Rook-Ceph on Talos Linux with detailed explanations of storage concepts, proper configuration for homelab scale, and critical lessons from my storage journey.

Understanding Storage in Kubernetes

What is Distributed Storage?

Think of distributed storage like a RAID array spread across multiple servers. Instead of having all your data on one machine (single point of failure), distributed storage systems replicate your data across multiple nodes. If one server fails, your data remains accessible from the surviving nodes.

Key Storage Concepts:

Block Storage: Raw disk-like storage (like a virtual hard drive) - used for databases
Object Storage: Stores files as objects with metadata (like S3) - used for backups, media
File Storage: Traditional filesystem that multiple pods can mount simultaneously
PVC (Persistent Volume Claim): How pods request storage in Kubernetes
Storage Class: Defines the type and properties of storage available

Why Ceph Over Longhorn

I started with Longhorn - it's simpler, lighter, and "cloud-native." After six months, here's why I switched:

Performance at Scale (My Lab Results)

Longhorn: 45MB/s sequential writes, 2000 IOPS random
Ceph: 280MB/s sequential writes, 15000 IOPS random
6x improvement for database workloads
Test conditions: 4-node cluster, 1GbE network, replicas=3, 4K block random I/O

Reliability

Longhorn had "split-brain" issues during network partitions (when nodes can't communicate, they disagree on data state)
Ceph's CRUSH algorithm (Controlled Replication Under Scalable Hashing) handles failures predictably
Better data consistency guarantees through quorum-based decisions

Features

Native block, object, and file storage from one system
Erasure coding for efficient storage utilization (like RAID 5/6 but distributed)
Advanced data placement and balancing across storage tiers

The Trade-off

Ceph needs serious resources (minimum 4GB RAM per OSD)
Complex to troubleshoot
Steep learning curve

Ceph Architecture Explained

Core Components:

OSD (Object Storage Daemon): One per disk, stores actual data
MON (Monitor): Maintains cluster maps and state (need odd number for quorum)
MGR (Manager): Provides monitoring, orchestration, and interfaces
MDS (Metadata Server): Only needed for CephFS (file storage)
CRUSH Map: Algorithm that determines where data is stored

Why These Matter:

OSDs handle all data I/O - more OSDs = more performance
MONs maintain consensus - lose quorum and cluster stops accepting writes
MGR provides the dashboard and metrics we need for monitoring
CRUSH ensures data is spread evenly and survives failures

Pre-Installation Requirements

Hardware Verification

Time Sync Requirement: Ensure Talos NTP is properly configured and all nodes are time-synced. Clock skew causes MON quorum issues and OSD flapping.

Network Reality for Homelab: If you have ≥10GbE, consider a separate VLAN/NIC for Ceph traffic with jumbo frames. On 1GbE, compression and conservative recovery settings (shown later) are essential.

# Check available disks on each worker node
# We need dedicated disks for Ceph - it shouldn't share with the OS disk
for node in 192.168.0.14 192.168.0.15 192.168.0.16 192.168.0.17; do
  echo "=== Node $node ==="
  talosctl -n $node disks  # Lists all disks on the node
done

# Expected output:
DEV        MODEL       SIZE     TYPE   UUID   WWID   MODALIAS
/dev/sda   Virtual HD  32 GB    HDD    -      -      scsi:t-0x00    # OS disk - don't use
/dev/sdb   Virtual HD  500 GB   HDD    -      -      scsi:t-0x00    # ← This is what Ceph will use
/dev/sdc   Virtual HD  500 GB   HDD    -      -      scsi:t-0x00    # ← Additional disk if available

# Verify memory (need 4GB+ available per OSD)
# Note: kubectl top shows current usage, not what's available
kubectl top nodes

# Check allocatable memory per node
kubectl get nodes -o custom-columns=NAME:.metadata.name,ALLOCATABLE:.status.allocatable.memory

# Formula: (allocatable - used) >= 4GiB * (# of OSDs on that node)
# Example: 32GiB allocatable - 2GiB used = 30GiB available = supports 7 OSDs

Troubleshooting Disk Issues:

# If no secondary disks are available:
# Check if you need to add virtual disks to VMs
# For physical servers, you'll need to install additional drives

# If disks are already formatted:
# Ceph can't use disks with existing filesystems
talosctl -n 192.168.0.14 disks | grep -E "ext4|xfs|ntfs"

# If you see formatted disks, you'll need to wipe them:
# WARNING: This destroys all data on the disk!
# Only run this on dedicated Ceph disks, NEVER the OS disk!
# Use the bare device name (e.g., sdb, not /dev/sdb):
talosctl -n 192.168.0.14 wipe disk sdb --insecure

# Verify disk performance (optional but recommended):
# For disk performance baselines, rely on rados bench after Ceph is deployed
# (Talos doesn't provide shell access for raw disk tests)

Why These Requirements:

Dedicated Disk: Ceph performs best with dedicated disks - sharing with OS causes I/O contention and performance degradation
4GB RAM per OSD: Ceph uses RAM for BlueStore cache, metadata, and operations - insufficient RAM causes 90% performance loss
CPU: While not as critical, each OSD needs ~1 CPU core for good performance under load

Talos System Extensions

What are System Extensions?
Talos Linux is immutable - you can't install packages like on Ubuntu. System extensions are pre-built packages that add functionality to Talos. Think of them like drivers that need to be baked into the OS image.

# Check for extensions on a node
talosctl -n 192.168.0.14 get extensions

# Required extensions for Ceph:
# - siderolabs/util-linux-tools  # Provides mount, lsblk, blkid, and filesystem utilities

If missing, rebuild Talos with extensions:

# Create custom Talos image with extensions
# Use imager version matching your Talos release
docker run --rm -t -v $PWD/_out:/out \
  ghcr.io/siderolabs/imager:v1.11.0 \
  --arch amd64 \                           # CPU architecture
  --system-extension siderolabs/util-linux-tools   # Mount utilities

# This creates a new Talos ISO/image in _out/ directory
# You'll need to reinstall Talos nodes with this image

Why These Extensions:

util-linux-tools: Required for mounting Ceph filesystems and disk utilities
Note: Ceph RBD uses the kernel RBD client (krbd) or librbd, NOT iSCSI. The iSCSI protocol is only involved when you deploy the optional Ceph iSCSI Gateway to export RBD volumes to non-Ceph clients

Kernel Modules

What are Kernel Modules?
Kernel modules are drivers that run in the Linux kernel. Ceph needs specific modules to communicate with storage devices and manage the filesystem.

Apply this patch to all worker nodes:

# ceph-kernel-modules.yaml
machine:
  kernel:
    modules:
      - name: dm_mod   # Device mapper (required for encryption)
      - name: dm_crypt # LUKS encryption support for encrypted RBD volumes
      - name: libceph  # Core network client (some kernels don't autoload via ceph)
      - name: rbd      # RADOS Block Device - enables block storage via kernel RBD client
      - name: ceph     # Core Ceph filesystem support (needed for CephFS kernel client)
      - name: nbd      # Network Block Device - optional for rbd-nbd fallback
  sysctls:
    vm.swappiness: 0         # Disable swap - Ceph needs real RAM, not swap
    kernel.pid_max: 4194304  # Increase max processes - Ceph spawns many
    fs.aio-max-nr: 1048576   # Increase async I/O operations - improves performance

# Apply the patch to all worker nodes simultaneously
talosctl patch machineconfig \
  --nodes 192.168.0.14,192.168.0.15,192.168.0.16,192.168.0.17 \
  --patch @ceph-kernel-modules.yaml

# Nodes will reboot to apply kernel changes
# Wait for them to come back:
kubectl get nodes -w  # -w watches for changes

Why These Settings:

vm.swappiness=0: Ceph performs terribly with swap - it needs physical RAM for caching
kernel.pid_max: Ceph creates many processes for parallel I/O operations
fs.aio-max-nr: Allows more concurrent disk operations for better performance

Installing Rook Operator

What is Rook?
Rook is a Kubernetes operator that automates Ceph deployment and management. Think of it as the bridge between Kubernetes and Ceph - it translates Kubernetes resources into Ceph configurations.

# Add Rook Helm repository
helm repo add rook-release https://charts.rook.io/release
helm repo update

# 1) Install CRDs first (required once per cluster, separate from operator)
helm install rook-ceph-crds rook-release/rook-ceph-crds --version 1.18.2

# 2) Install Rook operator with specific settings
helm install rook-ceph rook-release/rook-ceph \
  --namespace rook-ceph \
  --create-namespace \                     # Creates namespace if not exists
  --version 1.18.2 \                       # Pin to tested version
  --set enableDiscoveryDaemon=true \       # Auto-discover new disks
  --set csi.enableCephfsSnapshotter=true \ # Enable filesystem snapshots
  --set csi.enableRBDSnapshotter=true      # Enable block device snapshots

# When upgrading Rook later:
# helm upgrade rook-ceph-crds rook-release/rook-ceph-crds --version NEW_VERSION
# helm upgrade rook-ceph rook-release/rook-ceph --namespace rook-ceph --version NEW_VERSION

What These Settings Do:

enableDiscoveryDaemon: Automatically finds new disks added to nodes
CSI snapshotters: Allow backing up volumes through Kubernetes snapshots

Wait for operator readiness:

# Wait up to 5 minutes for the operator to be ready
kubectl -n rook-ceph wait --for=condition=ready pod -l app=rook-ceph-operator --timeout=300s

# Verify operator is running
kubectl -n rook-ceph get pods
NAME                                  READY   STATUS
rook-ceph-operator-7b9c5f8d8c-xxxxx  1/1     Running   # Should see this

Installing the Ceph Toolbox

Important: Before running any ceph commands, we need to deploy the toolbox pod:

# Deploy the Ceph toolbox for CLI management (using versioned URL)
# Use toolbox version matching your Rook operator version
kubectl create -f https://raw.githubusercontent.com/rook/rook/v1.18.2/deploy/examples/toolbox.yaml

# Wait for toolbox to be ready
kubectl -n rook-ceph rollout status deploy/rook-ceph-tools

# Verify toolbox is running
kubectl -n rook-ceph get pods -l app=rook-ceph-tools
NAME                              READY   STATUS
rook-ceph-tools-7b9c5f8d8c-xxxxx  1/1     Running

All subsequent ceph commands will be run through this toolbox pod.

Ceph Cluster Configuration

Here's my production configuration optimized for homelab:

# ceph-cluster.yaml
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  # ⚠️ UPGRADE ORDER: Rook CRDs → Rook operator → Ceph image. Wait for HEALTH_OK between each.
  cephVersion:
    image: quay.io/ceph/ceph:v18.2.2  # Pin to tested Reef version
    # Note: Keep Rook and Ceph within supported matrix
    # Always wait for HEALTH_OK between upgrade steps - never skip versions

  dataDirHostPath: /var/lib/rook  # Where Rook stores config on host

  mon:
    count: 3  # Must be odd number for quorum (consensus voting)
    allowMultiplePerNode: false  # Spread monitors across nodes for HA
    # Note: Most clusters use dataDirHostPath for monitors instead of PVCs
    # Only use volumeClaimTemplate if you have a local-path StorageClass configured

  mgr:
    count: 2  # One active, one standby for failover
    allowMultiplePerNode: false
    modules:
      - name: prometheus        # Expose metrics for Prometheus monitoring
        enabled: true
      - name: devicehealth      # Monitor disk SMART data for early failure detection
        enabled: true
      - name: pg_autoscaler     # Automatically adjusts placement groups (usually on by default)
        enabled: true
      - name: balancer          # Evenly distributes data across OSDs (usually on by default)
        enabled: true

  dashboard:
    enabled: true
    ssl: true

  network:
    connections:
      encryption:
        enabled: false  # Enables msgr2 payload encryption (CPU cost)
        # Note: Ceph always uses cephx auth; this toggles wire encryption
        # Most homelabs run compression on, encryption off for performance
      compression:
        enabled: true   # Compresses data in transit (saves bandwidth)
    ipFamily: IPv4
    dualStack: false    # We're not using IPv6

  crashCollector:
    disable: false

  cleanupPolicy:
    confirmation: ""
    sanitizeDisks:
      method: quick
      dataSource: zero
      iteration: 1
    allowUninstallWithVolumes: false

  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: node-role.kubernetes.io/control-plane
              operator: DoesNotExist  # Don't run on control plane nodes
      tolerations:
      - key: storage-node
        operator: Exists  # Can run on nodes tainted as storage-node

  resources:
    mon:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 2000m
        memory: 2Gi
    mgr:
      requests:
        cpu: 500m
        memory: 512Mi
      limits:
        cpu: 2000m
        memory: 2Gi
    osd:
      requests:
        cpu: 1000m
        memory: 4Gi
      limits:
        cpu: 4000m
        memory: 8Gi

  storage:
    useAllNodes: false     # Don't automatically use all nodes
    useAllDevices: false   # Don't automatically use all disks (safer)
    nodes:
    - name: talos-wrk-01
      devices:
      - name: /dev/sdb     # Second disk dedicated to Ceph (not OS disk)
        config:
          osdsPerDevice: "1"  # One OSD per disk (standard practice)
    - name: talos-wrk-02
      devices:
      - name: /dev/sdb
        config:
          osdsPerDevice: "1"
    - name: talos-wrk-03
      devices:
      - name: /dev/sdb
        config:
          osdsPerDevice: "1"
    - name: talos-wrk-04
      devices:
      - name: /dev/sdb
        config:
          osdsPerDevice: "1"

    # Optional: HDD OSDs with SSD DB/WAL for big performance win
    # If you have a small SSD per node, use it for BlueStore DB/WAL:
    # - name: talos-wrk-05
    #   config:
    #     metadataDevice: /dev/nvme0n1  # SSD for DB/WAL for all OSDs on this node
    #     # databaseSizeMB: 2048        # Tune if SSD is small (default: 5% of OSD)
    #     # walSizeMB: 1024              # Tune if SSD is small (default: 1GB)
    #   devices:
    #     - name: /dev/sdb               # HDD data devices
    #     - name: /dev/sdc

  priorityClassNames:
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical

  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0

Deploy the cluster:

# Optional: Taint storage nodes to dedicate them to Ceph
kubectl taint nodes talos-wrk-01 storage-node=true:NoSchedule
kubectl taint nodes talos-wrk-02 storage-node=true:NoSchedule
kubectl taint nodes talos-wrk-03 storage-node=true:NoSchedule
kubectl taint nodes talos-wrk-04 storage-node=true:NoSchedule
# Your Ceph pods already tolerate key=storage-node

kubectl apply -f ceph-cluster.yaml

# Watch deployment progress (updates every 2 seconds)
watch kubectl -n rook-ceph get cephcluster

# Expected progression:
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE         MESSAGE
rook-ceph   /var/lib/rook     3          1m    Progressing   Configuring Ceph Mons
rook-ceph   /var/lib/rook     3          5m    Progressing   Configuring Ceph OSDs
rook-ceph   /var/lib/rook     3          10m   Ready         Cluster created successfully

# Press Ctrl+C to exit watch

Monitoring Deployment Progress

Ceph deployment takes 10-15 minutes. Monitor progress:

# Watch OSD (storage daemon) creation
# -w flag watches for changes in real-time
kubectl -n rook-ceph get pods -l app=rook-ceph-osd -w

# Expected output:
NAME                      READY   STATUS
rook-ceph-osd-0-xxxxx     0/1     Init:0/2  # Initializing
rook-ceph-osd-0-xxxxx     0/1     Init:1/2  # Still initializing
rook-ceph-osd-0-xxxxx     1/1     Running   # Ready!

# ⚠️ CRITICAL: Verify device classes BEFORE creating pools
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree
# Check that SSDs show class 'ssd' and HDDs show 'hdd'
# If wrong, fix device classes then:
#   ceph osd crush rule rm replicapool-fast_crush_rule || true
#   Re-apply ceph-pools.yaml so Rook regenerates CRUSH rules

# Check overall Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

# Understanding the output:
  cluster:
    id:     abc123
    health: HEALTH_OK        # Green - everything working
            # HEALTH_WARN    # Yellow - something needs attention
            # HEALTH_ERR     # Red - critical issue

  services:
    mon: 3 daemons, quorum a,b,c  # 3 monitors maintaining quorum
    mgr: a(active), standbys: b   # Manager with standby ready
    osd: 4 osds: 4 up, 4 in       # All 4 storage daemons running

  data:
    pools:   0 pools, 0 pgs        # No storage pools yet (we'll create them)
    objects: 0 objects, 0 B        # No data stored yet
    usage:   4.0 GiB used, 1.96 TiB / 2 TiB avail  # Overhead vs available
    pgs:                           # Placement groups (data shards)

What to Look For:

HEALTH_OK: Cluster is fully operational
All OSDs "up": All storage daemons are running
Quorum established: Monitors can make decisions
Manager active: Dashboard and metrics available

Storage Classes

What are Storage Classes?
Storage Classes define different tiers of storage with specific properties. Think of them like different types of cloud storage - standard, premium, archive. Applications request storage through PVCs (Persistent Volume Claims) that reference these classes.

Note on TRIM/discard for RBD volumes: While the discard mount option enables continuous TRIM, it can hurt performance on RBD volumes. Consider using periodic fstrim instead for better performance. This is configured at the application level, not in the StorageClass. Note that fstrim requires filesystem support (ext4/xfs have it) and root access inside the container (which the CronJob below provides via runAsUser: 0).

Optional: Periodic TRIM CronJob pattern for applications

# fstrim-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: fstrim-weekly
spec:
  schedule: "0 3 * * 0"  # Sundays 03:00
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: fstrim
            image: alpine:3.20
            securityContext:
              runAsUser: 0
            command: ["sh","-c","apk add --no-cache util-linux && fstrim -v /data"]
            volumeMounts:
            - name: v
              mountPath: /data
          volumes:
          - name: v
            persistentVolumeClaim:
              claimName: your-app-pvc  # Replace with your PVC

About clusterID: The clusterID parameter refers to the Rook namespace (e.g., rook-ceph), not the Ceph FSID. The CSI driver uses this to locate the Ceph cluster.

Create storage classes for different workload types:

# storage-classes.yaml
---
# Fast NVMe pool for databases (high IOPS needed)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-block-fast
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"  # Default if PVC doesn't specify
    # ⚠️ **WARNING**: ALL unspecified PVCs will use expensive fast storage!
    # **For cost-sensitive homelabs**: Make 'slow' the default instead to avoid surprise IO costs
provisioner: rook-ceph.rbd.csi.ceph.com  # Which driver handles provisioning
parameters:
  clusterID: rook-ceph              # Rook namespace, not Ceph FSID
  pool: replicapool-fast           # Which Ceph pool to use
  imageFormat: "2"                 # RBD format (v2 is current)
  imageFeatures: layering  # Start conservative; add features after kernel validation
  # When ready (kernels ≥5.4 with Reef), enable faster snapshot/clone features:
  # imageFeatures: layering,exclusive-lock,object-map,fast-diff
  # Note: object-map and fast-diff depend on exclusive-lock
  # mounter: rbd-nbd  # Optional: Use userspace mounter instead of krbd if kernel issues
  # mapOptions: "queue_depth=128"  # For krbd only: More parallelism for DBs (omit for rbd-nbd)
  # The following are credentials for Ceph operations
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/fstype: ext4  # Safest on Talos (xfs requires xfsprogs system extension)
allowVolumeExpansion: true          # Can grow volumes without recreating
reclaimPolicy: Delete               # Delete data when PVC deleted
volumeBindingMode: Immediate       # Provision immediately
# Note: WaitForFirstConsumer delays until Pod scheduled - fine for network storage
# but switch to Immediate if PVCs appear stuck in Pending
---
# Standard HDD pool for media/backups (where capacity matters more than speed)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-block-slow
  # annotations:
  #   storageclass.kubernetes.io/is-default-class: "true"  # Safer default for homelabs
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  pool: replicapool-slow    # Different pool with different settings
  imageFormat: "2"
  imageFeatures: layering  # Core feature for thin provisioning
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Retain  # Keep data even after PVC deleted (safer for backups)
volumeBindingMode: WaitForFirstConsumer  # Wait until pod schedules to provision
---
# Shared filesystem for multi-attach (multiple pods can mount simultaneously)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-filesystem
provisioner: rook-ceph.cephfs.csi.ceph.com  # Different provisioner for CephFS
parameters:
  clusterID: rook-ceph
  fsName: ceph-filesystem
  # Note: 'pool' parameter is optional - CSI will use the default data pool
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete
allowVolumeExpansion: true
# Note: 'discard' mount option is for block devices, not applicable to CephFS

Creating Ceph Pools

What are Ceph Pools?
Pools are logical partitions in Ceph storage. Each pool can have different replication levels, performance settings, and quotas. Think of them like different RAID configurations for different purposes.

Want erasure coding for cold data? Use a replicated metadata pool + EC data pool for RBD. See the Rook storageclass-ec.yaml example for configuration details.

Define pools with appropriate replication:

Important: If your OSDs aren't classified yet, either run the 'Verify Device Classes' step first or remove deviceClass from the pool spec and add it later.

# ceph-pools.yaml
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool-fast
  namespace: rook-ceph
spec:
  failureDomain: host      # Replicas on different hosts (not just different disks)
  deviceClass: ssd         # Only use SSD/NVMe OSDs for fast pool
  replicated:
    size: 3                # Keep 3 copies of all data
    requireSafeReplicaSize: true  # Don't allow writes if <2 replicas available
  parameters:
    compression_mode: aggressive  # Always compress (use 'passive' if CPU-constrained)
    compression_algorithm: lz4    # Fast compression (alternatives: snappy, zlib, zstd)
    # DBs: If CPU tight or latency critical, use 'passive' and enable DB-level compression instead
  quotas:
    maxSize: 500Gi  # Prevent single pool from using all storage
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool-slow
  namespace: rook-ceph
spec:
  failureDomain: host
  deviceClass: hdd         # Only use HDD OSDs for slow/bulk pool
  replicated:
    size: 2  # Only 2 copies for bulk storage (increases data-loss risk during second failure; only for cold/non-critical data)
    # ⚠️ Use size:2 only if you have ≥3 hosts; with two hosts you risk data unavailability during maintenance
    requireSafeReplicaSize: true  # Prevent writes if <2 replicas (avoid single-replica risk)
  parameters:
    compression_mode: passive
---
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: ceph-filesystem
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3           # Metadata is critical - keep 3 copies
  dataPools:
  - name: data0
    failureDomain: host
    replicated:
      size: 3           # File data also replicated 3x
  metadataServer:
    activeCount: 1      # One active MDS (metadata server)
    activeStandby: true # Enables standby-replay mode for near-instant failover
    # Costs more RAM/CPU but gives <5s MDS switchover vs 30s+ cold standby
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 2000m
        memory: 4Gi

Apply the pools:

# Create the pools first
kubectl apply -f ceph-pools.yaml

# Wait for pools to be ready (about 30 seconds)
sleep 30

# Then create storage classes that reference them
kubectl apply -f storage-classes.yaml

# Verify pools were created
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool ls

# Expected output:
replicapool-fast
replicapool-slow
ceph-filesystem-metadata
ceph-filesystem-data0

# Ensure RBD application is enabled (Rook usually handles this)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
  ceph osd pool application get replicapool-fast || \
  ceph osd pool application enable replicapool-fast rbd

  ceph osd pool application get replicapool-slow || \
  ceph osd pool application enable replicapool-slow rbd

  # For size:2 pool, ensure min_size:2 to prevent single-replica writes
  ceph osd pool set replicapool-slow min_size 2

  # Add hard object quotas to prevent runaway apps from eating the cluster
  ceph osd pool set-quota replicapool-fast max_objects 20000000  # 20M objects
  ceph osd pool set-quota replicapool-slow max_objects 50000000  # 50M objects
  # (or use max_bytes if that's easier for your capacity planning)
'

# Verify CephFS is created correctly
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs ls
# Should show: name: ceph-filesystem, metadata pool: ceph-filesystem-metadata, data pools: [ceph-filesystem-data0]

# Check storage classes
kubectl get storageclass
NAME                    PROVISIONER                        AGE
ceph-block-fast (default)  rook-ceph.rbd.csi.ceph.com     1m
ceph-block-slow            rook-ceph.rbd.csi.ceph.com     1m
ceph-filesystem            rook-ceph.cephfs.csi.ceph.com  1m

# Verify PG autoscaler is doing the right thing
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool autoscale-status | column -t
# Aim for ~50-200 PGs per OSD total across all pools
# Look for warn/ok status - "warn" may need target_size_ratio adjustment

# Check PG distribution per OSD to spot outliers
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd df tree
# Look for wildly uneven PG counts relative to weight

Verify Device Classes (Important for VMs)

Virtual machines sometimes misreport disk types. Verify OSDs have correct device classes:

# Check how OSDs are classified (ssd vs hdd)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree

# Sample output:
# ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT
# -1         4.00000  root default
# -3         1.00000      host talos-wrk-01
#  0    hdd  1.00000          osd.0          up   1.00000

# If an SSD shows as 'hdd', manually correct it:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd crush rm-device-class osd.0
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd crush set-device-class ssd osd.0

# This ensures your fast/slow pools use the correct OSDs

# Verify CRUSH rules target the correct device classes:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd crush rule ls
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd crush rule dump replicapool-fast_crush_rule
# Expect to see: "device_class": "ssd" in the rule definition
# If missing, the rule targets all device classes - fix and recreate

Testing Storage

Always Test Before Production Use!
Before deploying real applications, verify storage works correctly. This test creates a PVC, mounts it in a pod, and tests read/write performance.

# test-storage.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-block-pvc
spec:
  accessModes:
    - ReadWriteOnce         # Single pod can mount read-write
  storageClassName: ceph-block-fast  # Use our fast storage class
  resources:
    requests:
      storage: 10Gi         # Request 10GB volume
---
apiVersion: v1
kind: Pod
metadata:
  name: test-block-pod
spec:
  containers:
  - name: test
    image: alpine:3.20
    volumeMounts:
    - name: test-volume
      mountPath: /data
  volumes:
  - name: test-volume
    persistentVolumeClaim:
      claimName: test-block-pvc

# Create test PVC and pod
kubectl apply -f test-storage.yaml

# Watch PVC status change from Pending to Bound
kubectl get pvc test-block-pvc -w
NAME             STATUS    VOLUME   CAPACITY
test-block-pvc   Pending                       # Ceph creating volume
test-block-pvc   Bound     pvc-xxx  10Gi      # Volume ready!

# Verify pod is running
kubectl get pod test-block-pod
NAME             READY   STATUS
test-block-pod   1/1     Running

# Test with less compressible data (zeros compress too well with LZ4)
# Using fio for realistic benchmarks:

# Random write test (database inserts)
# Note: Use io_uring for kernels ≥5.10, fallback to libaio if unsupported
kubectl exec -it test-block-pod -- sh -c \
  'apk add --no-cache fio && \
   fio -name=randwrite -filename=/data/test.fio \
   -ioengine=io_uring -direct=1 -bs=4k -iodepth=32 -rw=randwrite \
   -numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0 || \
   fio -name=randwrite -filename=/data/test.fio \
   -ioengine=libaio -direct=1 -bs=4k -iodepth=32 -rw=randwrite \
   -numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0'

# Random read test (database queries)
kubectl exec -it test-block-pod -- sh -c \
  'fio -name=randread -filename=/data/test.fio \
   -ioengine=io_uring -direct=1 -bs=4k -iodepth=32 -rw=randread \
   -numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0 || \
   fio -name=randread -filename=/data/test.fio \
   -ioengine=libaio -direct=1 -bs=4k -iodepth=32 -rw=randread \
   -numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0'

# Mixed 70/30 read/write (typical DB workload)
kubectl exec -it test-block-pod -- sh -c \
  'fio -name=randrw -filename=/data/test.fio \
   -ioengine=io_uring -direct=1 -bs=4k -iodepth=32 -rw=randrw -rwmixread=70 \
   -numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0 || \
   fio -name=randrw -filename=/data/test.fio \
   -ioengine=libaio -direct=1 -bs=4k -iodepth=32 -rw=randrw -rwmixread=70 \
   -numjobs=1 -size=1G -refill_buffers=1 -randrepeat=0'

# Output shows real IOPS and bandwidth without compression artifacts

# Optional: Direct RBD benchmark to isolate CSI/RBD path
# This tests raw RBD performance without Kubernetes CSI overhead
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
  # Create a temporary test image (not the PVC image)
  rbd pool init replicapool-fast || true
  rbd create -p replicapool-fast benchtest --size 10240   # 10GB test image

  # Run direct RBD benchmark (4K random writes)
  rbd bench -p replicapool-fast benchtest --io-type write \
    --io-size 4K --io-threads 16 --io-total 1G --io-pattern rand

  # Clean up test image
  rbd rm -p replicapool-fast benchtest
'
# Compare these results with FIO to identify CSI overhead

# Cleanup test resources
kubectl delete -f test-storage.yaml
# The PVC and its data will be deleted

Test CephFS (shared filesystem):

# test-cephfs.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-cephfs-pvc
spec:
  accessModes: [ ReadWriteMany ]  # Multiple pods can mount
  storageClassName: ceph-filesystem
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: test-cephfs-pod
spec:
  containers:
  - name: app
    image: alpine:3.20
    command: ["sh","-c","apk add --no-cache coreutils && mkdir -p /shared && dd if=/dev/zero of=/shared/test bs=1M count=256 && ls -lh /shared && sleep 3600"]
    volumeMounts:
    - name: shared
      mountPath: /shared
  volumes:
  - name: shared
    persistentVolumeClaim:
      claimName: test-cephfs-pvc

# Deploy CephFS test
kubectl apply -f test-cephfs.yaml

# Verify the pod created the test file
kubectl logs test-cephfs-pod
# Should show: -rw-r--r-- 1 root root 256M ... /shared/test
# Note: Throughput may look high due to compression; use fio for realistic tests

# Clean up
kubectl delete -f test-cephfs.yaml

Test CephFS RWX concurrency (proving multiple pods share the filesystem):

# cephfs-rwx-two-pods.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: cephfs-writer-a
spec:
  containers:
  - name: a
    image: alpine:3.20
    command: ["sh","-c","apk add --no-cache coreutils && while true; do echo A-$(date +%s) >> /shared/log; sleep 1; done"]
    volumeMounts:
    - name: shared
      mountPath: /shared
  volumes:
  - name: shared
    persistentVolumeClaim:
      claimName: test-cephfs-pvc  # Reuses the PVC from above
---
apiVersion: v1
kind: Pod
metadata:
  name: cephfs-reader-b
spec:
  containers:
  - name: b
    image: alpine:3.20
    command: ["sh","-c","tail -f /shared/log"]
    volumeMounts:
    - name: shared
      mountPath: /shared
  volumes:
  - name: shared
    persistentVolumeClaim:
      claimName: test-cephfs-pvc  # Same PVC - truly shared filesystem

kubectl apply -f cephfs-rwx-two-pods.yaml

# Watch reader-b stream lines written by writer-a
kubectl logs -f cephfs-reader-b
# Should see: A-1734567890, A-1734567891, ... proving concurrent access

# Verify both pods see the same inode (truly shared filesystem)
kubectl exec cephfs-writer-a -- stat -c '%i' /shared/log
kubectl exec cephfs-reader-b -- stat -c '%i' /shared/log
# Both should return the same inode number

# Clean up (explicit pod deletion in case of ctrl-c)
kubectl delete pod cephfs-writer-a cephfs-reader-b --ignore-not-found
kubectl delete pvc test-cephfs-pvc

Creating Snapshot Classes (Optional)

Prerequisites: Ensure the snapshot.storage.k8s.io CRDs and the external snapshot-controller are installed for your cluster/distro. These are not provided by Rook.

Optional: Install snapshot controller if not present

# Install external snapshot CRDs + controller (cluster-wide, do once)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v6.3.1/client/config/crd/
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v6.3.1/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v6.3.1/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml

If you enabled CSI snapshotters in the Rook operator, create VolumeSnapshotClasses:

# snapshot-classes.yaml
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-rbdplugin-snapclass
driver: rook-ceph.rbd.csi.ceph.com
deletionPolicy: Delete  # Or 'Retain' for production to prevent accidental deletion
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-cephfsplugin-snapclass
driver: rook-ceph.cephfs.csi.ceph.com
deletionPolicy: Delete  # Consider 'Retain' for production snapshots
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
# Note: CephFS snapshots work via CSI but restoring RWX volumes requires apps
# to handle shared filesystem semantics (file locks, cache coherency)

Apply if you plan to use CSI snapshots:

kubectl apply -f snapshot-classes.yaml

# Sanity check: ensure snapshot CRDs exist
kubectl get crd | grep snapshot
# Should show:
# volumesnapshotclasses.snapshot.storage.k8s.io
# volumesnapshotcontents.snapshot.storage.k8s.io
# volumesnapshots.snapshot.storage.k8s.io

Test RBD snapshots:

# test-rbd-snapshot.yaml
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-block-snap
spec:
  volumeSnapshotClassName: csi-rbdplugin-snapclass
  source:
    persistentVolumeClaimName: test-block-pvc  # From earlier RBD test
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc
spec:
  storageClassName: ceph-block-fast
  dataSource:
    name: test-block-snap
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: restore-pod
spec:
  containers:
  - name: app
    image: alpine:3.20
    command: ["sh","-c","ls -lh /data && cat /data/test 2>/dev/null | head -c 100 && sleep 3600"]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: restore-pvc

# Test snapshot and restore
kubectl apply -f test-rbd-snapshot.yaml

# Check snapshot created
kubectl get volumesnapshot
NAME              READYTOUSE   SOURCEPVC        AGE
test-block-snap   true         test-block-pvc   30s

# Verify restored data
kubectl logs restore-pod
# Should show the test file from original PVC

# Clean up
kubectl delete -f test-rbd-snapshot.yaml

Object Storage (S3) - Optional

If you need S3-compatible object storage:

# ceph-object.yaml
---
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
  name: my-store
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3
    deviceClass: ssd  # Metadata benefits from fast storage
  dataPool:
    replicated:
      size: 3
    deviceClass: hdd  # Object data can use slower storage
  # RGW pools honor CRUSH device_class; verify classes with 'ceph osd tree' if RGW targets wrong media
  gateway:
    port: 80
    instances: 1
---
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
  name: s3-user
  namespace: rook-ceph
spec:
  store: my-store
  displayName: "homelab-s3"

# Deploy object storage
kubectl apply -f ceph-object.yaml

# Wait for RGW pod to be ready
kubectl -n rook-ceph wait --for=condition=ready pod -l app=rook-ceph-rgw

# Expose RGW service locally (for testing only)
kubectl -n rook-ceph port-forward svc/rook-ceph-rgw-my-store 8080:80 &
# Note: Port-forward is for testing; use Ingress/LoadBalancer for real clients

# Get S3 credentials
ACCESS_KEY=$(kubectl -n rook-ceph get secret rook-ceph-object-user-my-store-s3-user \
  -o jsonpath='{.data.AccessKey}' | base64 -d)
SECRET_KEY=$(kubectl -n rook-ceph get secret rook-ceph-object-user-my-store-s3-user \
  -o jsonpath='{.data.SecretKey}' | base64 -d)

echo "Access Key: $ACCESS_KEY"
echo "Secret Key: $SECRET_KEY"
echo "Endpoint: http://localhost:8080"

# Quick smoke test with AWS CLI
apt-get update && apt-get install -y awscli  # Or use a container with awscli
aws configure set aws_access_key_id $ACCESS_KEY
aws configure set aws_secret_access_key $SECRET_KEY
aws configure set default.region us-east-1  # Required even for local S3
aws --endpoint-url http://localhost:8080 s3 mb s3://test-bucket
aws --endpoint-url http://localhost:8080 s3 ls
aws --endpoint-url http://localhost:8080 s3 cp /etc/hosts s3://test-bucket/
aws --endpoint-url http://localhost:8080 s3 ls s3://test-bucket/

Performance Tuning

OSD Optimization

Why Tune Performance?
Default Ceph settings are conservative for stability. In a homelab, we can be more aggressive for better performance.

# Set OSD memory target (default is ~4GB per OSD in modern Ceph)
# Increasing to 6GB provides more cache for better performance
# Calculate: 6GB × #OSDs per node - ensure you leave headroom for kubelet/system
# Note: This overrides bluestore_cache_autotune (on by default in Reef)
# Re-check node headroom during initial rebalance and backups
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph config set osd osd_memory_target 6442450944  # 6GB in bytes

# Verify the setting took effect
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph config get osd osd_memory_target
# Should show: 6442450944

# Enable bluestore compression on the pool
# LZ4 provides fast compression with good ratio
# Alternative algorithms: snappy, zlib, zstd (zstd can be CPU-heavy for small writes)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool set replicapool-fast compression_algorithm lz4

# Compression is per-object and transparent to clients
# Verify compression settings:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool get replicapool-fast compression_algorithm
# Also check: compression_mode, compression_min_blob_size

# Tune recovery settings for homelab
# These control how aggressively Ceph repairs/rebalances
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c "
  ceph config set osd osd_recovery_max_active 3        # Max concurrent recovery ops
  ceph config set osd osd_max_backfills 1              # Max backfill ops per OSD
  ceph config set osd osd_recovery_sleep_hdd 0.1      # HDD recovery sleep (seconds)
  ceph config set osd osd_recovery_sleep_ssd 0        # No sleep for SSD recovery
"

# Why these values:
# - Lower values = less impact on client I/O during recovery
# - Higher values = faster recovery but more performance impact
# - These settings balance recovery speed with usability

PG Autoscaler Pool Weighting

Optionally weight your pools based on expected usage:

# Example: allocate ~60% of PGs to fast pool, ~40% to slow pool
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool set replicapool-fast target_size_ratio 0.6
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool set replicapool-slow target_size_ratio 0.4

# The PG autoscaler will adjust placement groups based on these hints
# Aim for 50-200 PGs per OSD; avoid creating many tiny pools

# Also set explicit min_size for safety (matches requireSafeReplicaSize):
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool set replicapool-fast min_size 2  # For size=3 pool
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool set replicapool-slow min_size 2  # For size=2 pool

Network Optimization

# Set proper heartbeat and recovery settings through Ceph config
# (Note: environment variables on DaemonSets won't work for these)
# If running VMs on noisy NICs, test smaller increments first (15/30) before 30/60
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c "
  ceph config set osd osd_heartbeat_interval 30 &&     # Check peer health every 30s (default: ~6s)
  ceph config set osd osd_heartbeat_grace 60           # Wait 60s before marking OSD down (default: ~20s)
"

# Why these matter:
# - These affect OSD peer detection, not client I/O timeouts
# - Home networks have more variable latency than datacenter
# - Trade-off: Higher values prevent false failures but slow real failure detection
# - Higher osd_heartbeat_* = fewer false downs on noisy homelab links but slower real failover
# - Document these values for on-call expectations (60s grace = ~1 minute to detect real OSD failure)
# - Don't push too high (e.g., minutes) or you'll mask real failures
# - Sweet spot for homelab: 30s/60s balances stability vs detection speed

Advanced Network Isolation (Optional):
For production clusters, consider using Multus CNI to separate storage traffic onto a dedicated network interface. This prevents storage replication from competing with pod traffic:

# With Multus, you can attach OSDs to a dedicated storage VLAN
# Example: 10GbE for storage, 1GbE for pod traffic
# See: https://github.com/rook/rook/blob/master/design/ceph/multus-network.md

Maintenance Operations

Preventing rebalancing during maintenance:

# Before shutting down a storage node for maintenance:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd set noout
# This tells Ceph not to mark OSDs as permanently out during maintenance

# Do your maintenance (node reboot, disk replacement, etc.)

# After maintenance is complete:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd unset noout

Maintenance flags reference:

Flag	Purpose	When to Use	Warning
`noout`	Prevents OSD removal	Node reboots, short maintenance	Unset ASAP
`norebalance`	Stops data redistribution	Adding multiple OSDs	Can affect redundancy
`nobackfill`	Pauses backfill operations	Performance-critical periods	Delays recovery
`norecover`	Stops recovery operations	Investigating issues	High risk - use briefly
`pause`	Stops all I/O	Emergency debugging	Blocks all client I/O

# Example: Full maintenance mode for major work
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
  ceph osd set noout &&
  ceph osd set norebalance &&
  ceph osd set nobackfill
'

# ALWAYS unset all flags after maintenance:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
  ceph osd unset noout &&
  ceph osd unset norebalance &&
  ceph osd unset nobackfill
'

Schedule scrubs for off-peak hours:

# Configure scrubs to run during quiet hours (1 AM - 6 AM)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
  ceph config set osd osd_scrub_begin_hour 1 &&
  ceph config set osd osd_scrub_end_hour 6 &&
  ceph config set osd osd_deep_scrub_interval 1209600  # Deep scrub every 14 days
'

# Scrubs verify data integrity but can impact performance
# Scheduling them for quiet hours minimizes user impact

Monitoring Ceph Health

Prometheus Integration

If you're running kube-prometheus-stack, add ServiceMonitor and alerts:

# First check what port name your mgr service uses
kubectl -n rook-ceph get svc rook-ceph-mgr -o jsonpath='{.spec.ports[*].name}{"\n"}'
# If it shows 'metrics' instead of 'http-metrics', update the ServiceMonitor below

# servicemonitor-ceph-mgr.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rook-ceph-mgr
  namespace: rook-ceph
  labels:
    release: prometheus  # Match your Prometheus Operator selector
spec:
  selector:
    matchExpressions:
    - key: app
      operator: In
      values: [rook-ceph-mgr]
  namespaceSelector:
    matchNames: ["rook-ceph"]
  endpoints:
  - port: http-metrics  # Or 'metrics' depending on your version
    interval: 30s
---
# prometheusrule-ceph.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ceph-rules
  namespace: rook-ceph
  labels:
    release: prometheus
spec:
  groups:
  - name: ceph.health
    rules:
    # Recording rule to detect maintenance flags (prevents false alarms)
    - record: ceph:maintenance_active
      expr: |
        max(
          (ceph_osd_noout == 1) or (ceph_osd_norebalance == 1) or (ceph_osd_nobackfill == 1)
        ) by ()
    - alert: CephHealthError
      expr: ceph_health_status == 2
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Ceph cluster health is ERROR"
        description: "Ceph cluster is in ERROR state for >2 minutes"
    - alert: CephOSDNearFull
      expr: ceph_osd_nearfull == 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "OSD {{ $labels.osd }} is nearfull"
        description: "OSD is nearfull (default ~85%, backfillfull ~90%, full ~95%)"
    - alert: CephMonQuorumLost
      expr: min(ceph_mon_quorum_status) == 0  # min() for robustness with per-mon metrics
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Ceph MON quorum lost"
        description: "Monitor quorum is lost - writes will fail"
    - alert: CephPGDegraded
      expr: ceph_pg_degraded > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "{{ $value }} degraded PGs present"
        description: "Data is under-replicated - recovery in progress"
    - alert: CephOSDDown
      expr: sum(ceph_osd_up == 0) > 0 and ceph:maintenance_active == 0  # Inhibit during maintenance
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "{{ $value }} OSD(s) down"
        description: "One or more OSDs are down - data at risk"

kubectl apply -f servicemonitor-ceph-mgr.yaml
kubectl apply -f prometheusrule-ceph.yaml

# Note: Metric names can vary across exporters - adjust expressions to match your exporter

# Optional: Add inhibition rule to silence alerts during maintenance
# In your alertmanager config, when noout/nobackfill/norebalance are set:
#   - source_matchers: [severity="info", alertname="CephMaintenanceMode"]
#     target_matchers: [severity=~"warning|critical"]
#     equal: ['cluster']

# Verify metrics are being scraped
kubectl -n rook-ceph port-forward svc/rook-ceph-mgr 9283:9283 &
curl -s localhost:9283/metrics | grep ceph_health

Dashboard Access

Ceph includes a web dashboard for monitoring and management:

# Get the auto-generated admin password
kubectl -n rook-ceph get secret rook-ceph-dashboard-password \
  -o jsonpath="{['data']['password']}" | base64 --decode
# Save this password!

# Create a tunnel to access the dashboard
# This forwards local port 8443 to the dashboard service
kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 8443:8443

# Now access in your browser:
# URL: https://localhost:8443
# Username: admin
# Password: (from above command)
# Note: You'll get a certificate warning - this is expected

CLI Monitoring

Key commands for monitoring Ceph health:

Capacity Thresholds in Ceph (ratios, not fixed bytes):~85% (nearfull ratio): Warning alerts start~90% (backfillfull ratio): New backfills throttled~95% (full ratio): Writes blocked completely

Keep your cluster below 80-85% for safe operation headroom. These are configurable ratios that change cluster behavior at each threshold.

# Overall cluster health with details
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail
# HEALTH_OK = everything good
# HEALTH_WARN = check warnings (often non-critical)
# HEALTH_ERR = immediate attention needed

# Storage usage by pool
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
# Shows:
# - RAW STORAGE: Total physical capacity
# - USED: Actual bytes used (includes replicas)
# - AVAIL: Space available for new data
# - %USED: Percentage full (keep under 85%)

# OSD performance metrics
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
# Shows latency for each OSD
# commit_latency: Write confirmation time
# apply_latency: Write to disk time
# High latency (>100ms) indicates problems

# Benchmark with rados bench (preferred over dd for Ceph testing)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  rados -p replicapool-fast bench 10 write --no-cleanup
# Runs 10-second write test
# Shows bandwidth and IOPS achieved

# Clean up test objects
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  rados -p replicapool-fast cleanup

# Note: rados bench is preferred for Ceph performance testing
# because it uses pseudo-random data (avoids compression artifacts)
# and measures the entire cluster (including replication)

Backup & Disaster Recovery

Always backup critical Ceph secrets before major changes:

# Save admin and CSI keys
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- \
  ceph auth export client.admin > ceph-client-admin.keyring

# Backup CSI secrets
for secret in rook-csi-rbd-node rook-csi-rbd-provisioner \
              rook-csi-cephfs-node rook-csi-cephfs-provisioner; do
  kubectl -n rook-ceph get secret $secret -o yaml > ${secret}.secret.yaml
done

# Save monitor map and keyring
kubectl -n rook-ceph get secret rook-ceph-mon -o yaml > rook-ceph-mon.secret.yaml

# Save version info and config for post-incident diffing
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph versions > ceph-versions.txt
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph config dump > ceph-config-dump.txt

# Store these backups OFF-CLUSTER
# They're critical for disaster recovery if you lose the k8s control plane

Migration Story: Longhorn to Ceph

My migration wasn't smooth, but here's what I learned:

The Problems with Longhorn

Performance degradation: As volumes grew, performance tanked
Network storms: Replica sync during rebuilds saturated the network
Instance manager crashes: Lost all volumes on a node twice
No compression: Storage efficiency was poor

Migration Process

Here's the step-by-step process I used:

# Step 1: Deploy Ceph alongside Longhorn
# Both storage systems can coexist during migration

# Step 2: Create new PVCs on Ceph
# For each Longhorn PVC, create equivalent Ceph PVC

# Step 3: Application-level data migration
# This is safer than trying to copy volumes directly

# Step 4: Switch applications to Ceph PVCs
# Update deployments to use new PVC names

# Step 5: Delete Longhorn volumes
# After verifying data integrity

# Step 6: Remove Longhorn
# Uninstall once all data migrated

# Example: Migrating PostgreSQL database
# 1. Backup from Longhorn volume
kubectl exec -it postgres-pod -- pg_dumpall > backup.sql

# 2. Scale down PostgreSQL
kubectl scale deployment postgres --replicas=0

# 3. Update deployment to use Ceph PVC
kubectl edit deployment postgres
# Change: claimName: postgres-longhorn
# To: claimName: postgres-ceph

# 4. Scale back up
kubectl scale deployment postgres --replicas=1

# 5. Restore data
kubectl exec -it postgres-pod -- psql < backup.sql

# 6. Verify data integrity
kubectl exec -it postgres-pod -- psql -c "SELECT COUNT(*) FROM your_table;"

Lessons Learned

Plan for 2x storage during migration
- You need both systems running simultaneously
- Migration can take days for large datasets
Test restore procedures first
- Practice on test databases/non-critical data
- Verify backup/restore commands work
Monitor resource usage closely
- Ceph uses lots of RAM during initial data distribution
- Watch for OOM kills during migration
Use application-level migration
- Database dumps, rsync, application-specific tools
- Don't try to copy PV contents directly (won't work)
Keep Longhorn for a week
- Don't rush to delete old storage
- You might discover missing data days later

Troubleshooting Guide

OSD Won't Start

# Check OSD pod logs for errors
kubectl -n rook-ceph logs -l app=rook-ceph-osd --tail=50

# Common issues and solutions:

# 1. ERROR: Disk has existing filesystem
# Solution: Wipe the disk completely (use bare device name)
talosctl -n 192.168.0.14 wipe disk sdb --insecure
# Then delete and recreate the OSD pod

# 2. ERROR: OOM killed (out of memory)
# Solution: Reduce memory target
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph config set osd osd_memory_target 2147483648  # 2GB

# 3. ERROR: Permission denied on /dev/sdb
# Solution: Check Talos kernel modules loaded
talosctl -n 192.168.0.14 dmesg | grep -i ceph
# Should see: "rbd: loaded" and "libceph: loaded"

# 4. ERROR: Clock skew detected
# Solution: Check Ceph's view of time skew
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail | grep clock
# If clock skew persists, check Talos NTP configuration in machine config

Slow Performance

# 1. Check if recovery/rebalancing is running
# This severely impacts performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Look for: "recovery: 123 MB/s, 31 objects/s"
# If present, wait for it to complete

# 2. Check individual OSD performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
# Look for outliers with high latency (>50ms watch, >100ms investigate)
# osd.0: commit_latency: 12ms, apply_latency: 15ms  # Good
# osd.1: commit_latency: 250ms, apply_latency: 300ms # Bad - investigate this OSD

# 3. Check network latency between nodes
kubectl run netshoot --rm -i --tty --image nicolaka/netshoot -- \
  ping -c 10 192.168.0.15
# Latency should be <1ms on local network

# 4. Check if pools are healthy
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd pool ls detail
# Look for: "flags" field - should be empty or "hashpspool"
# "full" or "nearfull" = immediate problem

# 5. Check disk I/O on host
talosctl -n 192.168.0.14 top
# Look for high iowait% (>20% indicates disk bottleneck)

PVC Stuck Pending

When a PVC won't bind to a volume:

# 1. Check PVC events for errors
kubectl describe pvc <pvc-name>
# Look at Events section for errors like:
# - "no capacity" = out of space
# - "pool not found" = pool doesn't exist
# - "failed to provision" = CSI driver issue

# 2. Verify the pool exists
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool ls
# Should see your pool name (e.g., replicapool-fast)

# 3. Check CSI driver pods are running
kubectl get pods -n rook-ceph | grep csi
# Should see:
# csi-cephfsplugin-xxxxx     2/2     Running
# csi-rbdplugin-xxxxx        2/2     Running
# csi-cephfsplugin-provisioner-xxxxx  5/5  Running
# csi-rbdplugin-provisioner-xxxxx     5/5  Running

# 4. Check if storage class exists
kubectl get storageclass
# Verify the storage class name matches your PVC

# 5. Check Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health
# HEALTH_ERR would prevent provisioning

# 6. Enable temporary debug logging (if needed)
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
  ceph tell osd.* config set debug_osd 1/5
  ceph tell osd.* config set debug_bluestore 1/5
'
# Check logs, then revert:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash -c '
  ceph tell osd.* config set debug_osd 0/0
  ceph tell osd.* config set debug_bluestore 0/0
'

# 7. Check for capacity
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
# At ~85% (nearfull) you get warnings
# Writes stop at ~95% (full) and backfill throttles around 90% (backfillfull)

Resource Management

After running Ceph for months, here's actual resource usage:

# Current allocation (observed in my cluster)
Ceph Components:
  MON (3x): 1.5 CPU, 3GB RAM total
  MGR (2x): 1 CPU, 1GB RAM total
  MDS (2x): 1 CPU, 2GB RAM total
  OSD (4x): 4 CPU, 16GB RAM total

Total: ~7.5 CPU, 22GB RAM

# Actual usage
MON: 100m CPU, 400MB RAM each
MGR: 150m CPU, 300MB RAM each
MDS: 80m CPU, 250MB RAM each
OSD: 200m CPU, 3.8GB RAM each

# Over-provisioned by ~5x for CPU, ~2x for RAM

Security Considerations

Encryption at Rest

For sensitive data, enable encryption:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-block-encrypted
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  pool: replicapool-fast
  encrypted: "true"  # Without KMS: stores keys in k8s secrets (homelab OK)
  # encryptionKMSID: vault-kms  # Production: use external KMS like Vault

For production with proper key management (covered in Part 6/7):

# ceph-block-encrypted-vault.yaml
parameters:
  encrypted: "true"
  encryptionKMSID: vault-kms  # References KMS config in rook-ceph-csi-kms-config

Network Isolation

Warning: NetworkPolicies can break Ceph operator/CSI functionality. The Rook operator needs Kubernetes API access, and some CSI pods use hostNetwork. Leave commented out unless you know you need it.

# network-policy.yaml (OPTIONAL - uncomment only if required)
# apiVersion: networking.k8s.io/v1
# kind: NetworkPolicy
# metadata:
#   name: ceph-network-isolation
#   namespace: rook-ceph
# spec:
#   podSelector: {}  # Apply to all pods in namespace
#   policyTypes:
#   - Ingress
#   - Egress
#   ingress:
#   - from:
#     - namespaceSelector:
#         matchLabels:
#           kubernetes.io/metadata.name: rook-ceph  # Well-known label
#     - podSelector: {}  # Allow all pods in the namespace to communicate
#   egress:
#   - to:
#     - namespaceSelector:
#         matchLabels:
#           kubernetes.io/metadata.name: rook-ceph  # Well-known label
#     - podSelector: {}  # Allow egress to all pods in the namespace
#   - to:  # Allow DNS
#     - namespaceSelector:
#         matchLabels:
#           kubernetes.io/metadata.name: kube-system
#     ports:
#     - protocol: UDP
#       port: 53
#     - protocol: TCP
#       port: 53
#   - to:  # Allow Kubernetes API access
#     - ipBlock:
#         cidr: 10.96.0.1/32  # Replace with: kubectl get svc kubernetes -n default -o jsonpath='{.spec.clusterIP}'
#     ports:
#     - protocol: TCP
#       port: 443

Pre-Flight Checklist

Before considering Ceph production-ready, verify:

[ ] Health Check: ceph status shows HEALTH_OK with all OSDs up/in
[ ] Performance Baseline: rados bench completes without triggering recovery
[ ] Dashboard Access: svc/rook-ceph-mgr-dashboard reachable at https://localhost:8443 (via port-forward)
[ ] Metrics Flow: Prometheus scraping ceph_* metrics (if configured)
[ ] Time Sync: All nodes within 1 second of each other
[ ] Test Workloads: Both RBD and CephFS PVCs mount successfully

[ ] Maintenance Flags Clear: No unexpected cluster-wide OSD flags

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd dump | grep flags
# Should NOT show: noout, norebalance, nobackfill, norecover

[ ] Storage Classes: Default SC set, all expected classes present

kubectl get sc

[ ] CRUSH Rules: Verify pools use correct device classes

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd crush rule ls

What's Next

With Ceph providing resilient storage, your cluster is ready for stateful applications. In Part 5, we'll implement GitOps with ArgoCD, enabling declarative deployments and automatic synchronization from Git repositories.

Key Takeaways

Ceph requires significant resources but delivers enterprise-grade storage
Proper disk selection is critical - dedicated disks perform much better
Migration requires careful planning - always test restore procedures
Compression saves significant space with minimal performance impact
Monitor resource usage closely - Ceph can consume all available resources

Quality of Life Tips

Post-Deployment Cleanup

# Option 1: Scale down the toolbox when not needed (saves ~500MB memory)
kubectl -n rook-ceph scale deploy rook-ceph-tools --replicas=0
# Scale back up when needed: --replicas=1

# Option 2: Delete entirely and recreate later
kubectl -n rook-ceph delete deploy/rook-ceph-tools
# Recreate later with: kubectl apply -f toolbox.yaml
# Note: The toolbox carries the admin keyring; deleting it doesn't delete creds,
# it just removes the Pod

# Verify CRUSH rules match your intent
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- \
  ceph osd crush rule ls
# Each pool should have its appropriate _crush_rule

# Check actual memory headroom on storage nodes
kubectl top nodes | grep wrk
# Ensure at least 2GB free after accounting for OSDs + kubelet

Quick Health Checks

# One-liner cluster health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- \
  ceph health | head -1

# Storage utilization at a glance
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- \
  ceph df | grep -E "TOTAL|POOL"

# Handy aliases for frequent checks
alias cephs='kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph -s'
alias cephh='kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail'

References

Rook Documentation: https://rook.io/docs/rook/v1.18/
Ceph Architecture: https://docs.ceph.com/en/reef/architecture/
Ceph on Kubernetes: https://docs.ceph.com/en/reef/rbd/rbd-kubernetes/
Storage Benchmarking: https://github.com/ceph/cbt
Talos Storage Configuration: https://www.talos.dev/v1.11/kubernetes-guides/configuration/storage/

Continue to Part 5: GitOps with ArgoCD →