"In production you don't rise to the level of your architecture... you fall to the level of your observability."
My SigNoz dashboards went blood-red: Longhorn's instance-manager pods were thrashing in a tight restart loop, chewing through crash-backoff timers like popcorn. Within minutes I was staring at a storm of VolumeDegraded alerts and a gut-level certainty that sleep was cancelled for the next few nights.
What followed was a 72-hour gauntlet of log spelunking, rollbacks, and caffeine-fuelled decision-making that culminated in a live migration from Longhorn v1.9.0 → Ceph 19.2.3 Squid via Rook v1.17.0.
This write-up is not a victory lap. It's the post-incident narrative I wish I'd found when I first dipped my toes into self-hosted block-storage. My goal is to walk you through every decision point, detour, and forehead slap so you can learn (and hopefully laugh) at my scars instead of earning your own.
Homelab Landscape (Setting the Stage)
Before we autopsy the failure, you need to understand the sandbox:
| Layer | Details |
|---|---|
| Hardware | 7-node cluster: 3 control plane nodes + 4 worker nodes with dedicated NVMe storage |
| OS | Talos Linux v1.10.5 (immutable, API-only)[1] |
| Network | Cilium v1.18.0 CNI with native network policies[2] |
| Platform | Kubernetes v1.33.3 (upgraded from Talos default v1.33.2), managed via ArgoCD and GitOps |
| Stateful Apps | PostgreSQL (CloudNativePG), Redis, MariaDB (Ghost blog), SigNoz observability, Marqo vector database |
| Security | VaultWarden, Infisical secrets management, Tetragon, CrowdSec |
| Observability | SigNoz for metrics/traces/logs, custom exporters |
| Backups | Velero 1.13 with restic backend to S3-compatible storage |
Sidebar: Why Talos? Because when you're debugging storage you don't want an SSH session to become a footgun. Talos forces you to interact via an API, eliminating the temptation to "just edit a file in /etc" at 3 a.m.2 · Act I — The Night Everything Caught Fire
22:04 | Alert flood
The first SigNoz alert was innocuous enough: Longhorn Replica Manager restart count > 3 in 10 min. By the time I alt-tabbed, the count was 27. A quick kubectl get pods -n longhorn-system confirmed that every instance-manager pod was crash-looping.
22:11 | Smoke test
I exec'd into the Ghost pod and tried a simple touch /var/lib/ghost/content/.liveness. Instant I/O hang. That was the moment I realized the issue was block-level, not app-level.
22:18 | Log spelunking
The canonical error:
WARN engine-controller: failed to start engine for volume ghost-content: cannot connect to engine: timed out waiting for engine.ready==true
ERROR instance-manager-r: engine API version mismatch – engine reports 6-1, CRD expects 0
I'd read about the magic version bug months earlier but dismissed it as edge-case folklore.[3] Seeing it live was equal parts validation and dread.
23:00 | First attempt: "Quick" patch
helm repo update && \
helm upgrade longhorn longhorn/longhorn \
--namespace longhorn-system \
--version 1.9.1
Helm returned deployed. The cluster disagreed. The manager Deployment rolled but the longhorn-engine-image DaemonSet didn't — node selectors had drifted since the last upgrade. Five minutes later the mismatch errors were back, now with a shiny new version combo (manager 1.9.1 ↔ engine 1.9.0).
The git log tells the story:
# The version dance of despair
d012a80 re-upgrade to v1.9.1 to fix version mismatch
8a6f5ef downgrade to v1.9.0 to resolve instance manager cycling
e25addf revert to v1.9.1 - downgrade not supported
4cc9661 revert to v1.9.0 due to instance manager API version bug
By 02:00 the cluster was teetering: half the volumes Degraded, Ghost in read-only mode, and critical services timing out.
That's when I pulled the ripcord.
Decision fork:
A. Keep fighting Longhorn and risk data loss with no clean backup.
B. Stand up Ceph in parallel and copy what data I could while volumes were still readable.
I chose B.
Act II — Crash-Course in Ceph (Why & How)
If you've never run Ceph, imagine a cohesive distributed storage project with dozens of cooperating daemons that handle everything from object storage to block devices to filesystems. It's intimidating, but once you grasp the mental model, it's kind of beautiful.
3.1 Bootstrapping Rook
The actual Ceph deployment was surprisingly smooth (using Rook v1.17.0, GA May 2, 2025[4]):
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v19.2.3 # Squid
mon:
count: 3 # one per control-plane node
storage:
nodes:
- name: cluster-wrk-01
devices:
- name: /dev/sdb
- name: cluster-wrk-02
devices:
- name: /dev/sdb
- name: cluster-wrk-03
devices:
- name: /dev/sdb
Timeline:
- 03:07 - Applied CRDs
- 03:10 - Rook operator running (~150 MiB RAM)
- 03:18 - MON pods scheduled (RocksDB default 512 MiB, peak ~1-2 GiB)[5]
- 03:29 - OSD pods provisioning (need ~3-5 GiB RAM each)[6]
- 04:04 -
ceph -s= HEALTH_OK (no deep scrub yet)
Rook reached HEALTH_OK in 57 minutes. PG autoscaler was still creating placement groups; first deep scrub ran later. Larger clusters may spend hours in HEALTH_WARN waiting for initial scrub completion.
Act III — The Great Migration
4.1 Redis (Warm-up)
Redis was my canary - simple state, easy rollback. Scaled to 0, created new PVC on Ceph, scaled back up. All 160 keys intact.
Total downtime: ~30 s.
4.2 PostgreSQL (Heart Surgery)
CloudNativePG made this elegant:[7]
kubectl patch cluster postgres-cluster -n database-system \
--type merge -p '{"spec":{"storage":{"storageClass":"ceph-rbd-fast"}}}'
CNPG initiated a rolling switchover with zero client disconnects. Beautiful.
4.3 Ghost Blog
Ghost keeps content in /var/lib/ghost/content. Used a scratch pod to rsync 2.7 GB between PVCs:
rsync -a --info=progress2 /old/content/ /new/content/
Maintenance mode for 90 seconds, flipped the mount. Done.
4.4 MariaDB (The Boss Fight)
This is where things got interesting. MariaDB's entrypoint has a bug ([MDEV-25670][8]) where it treats missing mariadb_upgrade_info as an upgrade scenario, even on fresh volumes.
Fix #1: MARIADB_DISABLE_UPGRADE_BACKUP=1 environment variable.
But another challenge was networking. My standard Kubernetes NetworkPolicy was blocking the operator's restore Job:
# The old way (broken for Jobs)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: mariadb-ghost-blog
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: mariadb
Jobs get random pod names, so the selector never matched. The solution? Cilium's superior NetworkPolicy:
# The Cilium way (Jobs work!)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: mariadb-ghost-blog
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/name: mariadb
ingress:
- fromEndpoints:
- matchLabels:
k8s:job-name: restore-mariadb # Cilium can match job labels!
This is why Cilium NetworkPolicies are superior to vanilla Kubernetes NetworkPolicies:
- Can select by Job labels (
k8s:job-name) - Support namespace label projection
- Offer cluster-wide identity matching
- Provide DNS-aware policies
After switching to CiliumNetworkPolicy, the restore finally worked. Four blog posts and one user account successfully recovered.
Elapsed time: 36 h (mostly debugging the mariadb issue).
Performance Aftermath
Even with data moved, PostgreSQL showed ~800ms p95 latencies (was ~150ms on Longhorn). The culprit: WAL writes paying for network + three-replica overhead.
5.1 Ceph Tuning
| Setting | Default | Tuned | Why & Caveats |
|---|---|---|---|
bluestore_min_alloc_size |
4 KB | 64 KB | May improve large sequential writes; expect space-amp and random-write latency penalty[9] |
osd_mclock_profile |
balanced |
high_client_ops |
Prioritize client I/O over recovery (default changed from Pacific)[10] |
client.rbd_cache |
false |
true (256 MiB, policy=writeback) |
⚠️ Safe only with flush-capable guests (kernel ≥5.16, proper QEMU)[11] |
5.2 PostgreSQL Tuning
postgresql:
parameters:
wal_buffers: "256MB"
synchronous_commit: "local"
random_page_cost: "6.0"
effective_io_concurrency: "200"
checkpoint_completion_target: "0.95"
Added missing index on Infisical's pgboss queue:
CREATE INDEX CONCURRENTLY idx_pgboss_version_cron_on
ON pgboss.version(cron_on);
Result: Latencies dropped from 800ms → 120ms in my fio-nvme benchmark (iodepth=32, 4K random read). Your mileage will vary based on workload patterns.
Hardening
- Automated Restore Testing: Weekly CronJob that restores Velero backups to a test namespace
- Storage Monitoring: SigNoz dashboards tracking OSD latency, PG health, disk usage
- Chaos Engineering: Monthly random worker shutdown to validate replication
- Scrub Schedule: Set
osd_scrub_begin_hour=2andosd_scrub_end_hour=5for controlled deep-scrub windows - GitOps Guards: ArgoCD hooks preventing storage upgrades during business hours
Key Metrics (Measured on My Cluster)
| KPI | Longhorn-era | Ceph-era (post-tuning) |
|---|---|---|
| Detection time | 6 min | — |
| Full migration | — | 72 hours |
| PostgreSQL p95 latency | ~150 ms | ~120 ms |
| Workloads migrated | — | 8 |
| Data integrity | — | No data loss observed during post-migration checks |
| Coffee consumed | — | Excessive |
What I'd Do Differently
- Monitor backup success - Velero was failing silently for days
- Start with Cilium policies - They offer Job label selectors, namespace projections, and cluster-wide identities that vanilla NetworkPolicies lack
- Document as you go - Git commits help but aren't enough
Closing Thoughts
Storage failures are brutal because they sit below every abstraction layer. This incident taught me that complexity isn't inherently bad; hidden complexity (Longhorn) is worse than explicit complexity (Ceph).
Longhorn is excellent for simple clusters with ample resources. But when it breaks, the reconciliation logic becomes opaque. Ceph demands you learn its vocabulary: MONs, OSDs, placement groups, but rewards you with observable, predictable behaviour.
The cluster is stronger now. Not because Ceph is "better" than Longhorn, but because I understand it deeply. Every 3 AM debugging session is an investment in that understanding.
The meta-lesson: If your storage system has a two-phase upgrade path, test the rollback. And when choosing between storage systems, optimize for observability over simplicity.
Have storage war stories? Drop a comment. I'll trade caffeine recommendations for tales of survival.
References
- Talos Linux v1.10.5 Release
- Cilium v1.18.0 Release
- Longhorn issue #8289 - Magic version/engine-replica API version mismatching
- Rook v1.17.0 Release - April 16, 2025
- Ceph Monitor memory - RocksDB default 512 MiB
- Ceph OSD memory requirements - 3-5 GiB minimum
- CloudNativePG storage - mutable storageClass
- MariaDB MDEV-25670 - mysql_upgrade at startup bug
- Ceph BlueStore configuration - allocation size trade-offs
- Ceph mClock default profiles - balanced in Reef/Squid
- Ceph RBD cache safety requirements
- CloudNativePG WAL volume separation