Longhorn → Ceph Migration Post-mortem: When Storage Systems Collide

"In production you don't rise to the level of your architecture... you fall to the level of your observability."

My SigNoz dashboards went blood-red: Longhorn's instance-manager pods were thrashing in a tight restart loop, chewing through crash-backoff timers like popcorn. Within minutes I was staring at a storm of VolumeDegraded alerts and a gut-level certainty that sleep was cancelled for the next few nights.

What followed was a 72-hour gauntlet of log spelunking, rollbacks, and caffeine-fuelled decision-making that culminated in a live migration from Longhorn v1.9.0 → Ceph 19.2.3 Squid via Rook v1.17.0.

This write-up is not a victory lap. It's the post-incident narrative I wish I'd found when I first dipped my toes into self-hosted block-storage. My goal is to walk you through every decision point, detour, and forehead slap so you can learn (and hopefully laugh) at my scars instead of earning your own.

Homelab Landscape (Setting the Stage)

Before we autopsy the failure, you need to understand the sandbox:

Layer	Details
Hardware	7-node cluster: 3 control plane nodes + 4 worker nodes with dedicated NVMe storage
OS	Talos Linux v1.10.5 (immutable, API-only)^[1]
Network	Cilium v1.18.0 CNI with native network policies^[2]
Platform	Kubernetes v1.33.3 (upgraded from Talos default v1.33.2), managed via ArgoCD and GitOps
Stateful Apps	PostgreSQL (CloudNativePG), Redis, MariaDB (Ghost blog), SigNoz observability, Marqo vector database
Security	VaultWarden, Infisical secrets management, Tetragon, CrowdSec
Observability	SigNoz for metrics/traces/logs, custom exporters
Backups	Velero 1.13 with restic backend to S3-compatible storage

Sidebar: Why Talos? Because when you're debugging storage you don't want an SSH session to become a footgun. Talos forces you to interact via an API, eliminating the temptation to "just edit a file in /etc" at 3 a.m.

2 · Act I — The Night Everything Caught Fire

22:04 | Alert flood

The first SigNoz alert was innocuous enough: Longhorn Replica Manager restart count > 3 in 10 min. By the time I alt-tabbed, the count was 27. A quick kubectl get pods -n longhorn-system confirmed that every instance-manager pod was crash-looping.

22:11 | Smoke test

I exec'd into the Ghost pod and tried a simple touch /var/lib/ghost/content/.liveness. Instant I/O hang. That was the moment I realized the issue was block-level, not app-level.

22:18 | Log spelunking

The canonical error:

WARN engine-controller: failed to start engine for volume ghost-content: cannot connect to engine: timed out waiting for engine.ready==true
ERROR instance-manager-r: engine API version mismatch – engine reports 6-1, CRD expects 0

I'd read about the magic version bug months earlier but dismissed it as edge-case folklore.^[3] Seeing it live was equal parts validation and dread.

23:00 | First attempt: "Quick" patch

helm repo update && \
helm upgrade longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --version 1.9.1

Helm returned deployed. The cluster disagreed. The manager Deployment rolled but the longhorn-engine-image DaemonSet didn't — node selectors had drifted since the last upgrade. Five minutes later the mismatch errors were back, now with a shiny new version combo (manager 1.9.1 ↔ engine 1.9.0).

The git log tells the story:

# The version dance of despair
d012a80 re-upgrade to v1.9.1 to fix version mismatch
8a6f5ef downgrade to v1.9.0 to resolve instance manager cycling  
e25addf revert to v1.9.1 - downgrade not supported
4cc9661 revert to v1.9.0 due to instance manager API version bug

By 02:00 the cluster was teetering: half the volumes Degraded, Ghost in read-only mode, and critical services timing out.

That's when I pulled the ripcord.

Decision fork:
A. Keep fighting Longhorn and risk data loss with no clean backup.
B. Stand up Ceph in parallel and copy what data I could while volumes were still readable.

I chose B.

Act II — Crash-Course in Ceph (Why & How)

If you've never run Ceph, imagine a cohesive distributed storage project with dozens of cooperating daemons that handle everything from object storage to block devices to filesystems. It's intimidating, but once you grasp the mental model, it's kind of beautiful.

3.1 Bootstrapping Rook

The actual Ceph deployment was surprisingly smooth (using Rook v1.17.0, GA May 2, 2025^[4]):

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3   # Squid
  mon:
    count: 3                           # one per control-plane node
  storage:
    nodes:
    - name: cluster-wrk-01
      devices:
      - name: /dev/sdb
    - name: cluster-wrk-02
      devices:
      - name: /dev/sdb
    - name: cluster-wrk-03
      devices:
      - name: /dev/sdb

Timeline:

03:07 - Applied CRDs
03:10 - Rook operator running (~150 MiB RAM)
03:18 - MON pods scheduled (RocksDB default 512 MiB, peak ~1-2 GiB)^[5]
03:29 - OSD pods provisioning (need ~3-5 GiB RAM each)^[6]
04:04 - ceph -s = HEALTH_OK (no deep scrub yet)

Rook reached HEALTH_OK in 57 minutes. PG autoscaler was still creating placement groups; first deep scrub ran later. Larger clusters may spend hours in HEALTH_WARN waiting for initial scrub completion.

Act III — The Great Migration

4.1 Redis (Warm-up)

Redis was my canary - simple state, easy rollback. Scaled to 0, created new PVC on Ceph, scaled back up. All 160 keys intact.

Total downtime: ~30 s.

4.2 PostgreSQL (Heart Surgery)

CloudNativePG made this elegant:^[7]

kubectl patch cluster postgres-cluster -n database-system \
  --type merge -p '{"spec":{"storage":{"storageClass":"ceph-rbd-fast"}}}'

CNPG initiated a rolling switchover with zero client disconnects. Beautiful.

4.3 Ghost Blog

Ghost keeps content in /var/lib/ghost/content. Used a scratch pod to rsync 2.7 GB between PVCs:

rsync -a --info=progress2 /old/content/ /new/content/

Maintenance mode for 90 seconds, flipped the mount. Done.

4.4 MariaDB (The Boss Fight)

This is where things got interesting. MariaDB's entrypoint has a bug ([MDEV-25670]^[8]) where it treats missing mariadb_upgrade_info as an upgrade scenario, even on fresh volumes.

Fix #1: MARIADB_DISABLE_UPGRADE_BACKUP=1 environment variable.

But another challenge was networking. My standard Kubernetes NetworkPolicy was blocking the operator's restore Job:

# The old way (broken for Jobs)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: mariadb-ghost-blog
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: mariadb

Jobs get random pod names, so the selector never matched. The solution? Cilium's superior NetworkPolicy:

# The Cilium way (Jobs work!)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: mariadb-ghost-blog
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: mariadb
  ingress:
  - fromEndpoints:
    - matchLabels:
        k8s:job-name: restore-mariadb  # Cilium can match job labels!

This is why Cilium NetworkPolicies are superior to vanilla Kubernetes NetworkPolicies:

Can select by Job labels (k8s:job-name)
Support namespace label projection
Offer cluster-wide identity matching
Provide DNS-aware policies

After switching to CiliumNetworkPolicy, the restore finally worked. Four blog posts and one user account successfully recovered.

Elapsed time: 36 h (mostly debugging the mariadb issue).

Performance Aftermath

Even with data moved, PostgreSQL showed ~800ms p95 latencies (was ~150ms on Longhorn). The culprit: WAL writes paying for network + three-replica overhead.

5.1 Ceph Tuning

Setting	Default	Tuned	Why & Caveats
`bluestore_min_alloc_size`	4 KB	64 KB	May improve large sequential writes; expect space-amp and random-write latency penalty^[9]
`osd_mclock_profile`	`balanced`	`high_client_ops`	Prioritize client I/O over recovery (default changed from Pacific)^[10]
`client.rbd_cache`	`false`	`true` (256 MiB, policy=writeback)	⚠️ Safe only with flush-capable guests (kernel ≥5.16, proper QEMU)^[11]

5.2 PostgreSQL Tuning

postgresql:
  parameters:
    wal_buffers: "256MB"
    synchronous_commit: "local"
    random_page_cost: "6.0"
    effective_io_concurrency: "200"
    checkpoint_completion_target: "0.95"

Added missing index on Infisical's pgboss queue:

CREATE INDEX CONCURRENTLY idx_pgboss_version_cron_on 
ON pgboss.version(cron_on);

Result: Latencies dropped from 800ms → 120ms in my fio-nvme benchmark (iodepth=32, 4K random read). Your mileage will vary based on workload patterns.

Hardening

Automated Restore Testing: Weekly CronJob that restores Velero backups to a test namespace
Storage Monitoring: SigNoz dashboards tracking OSD latency, PG health, disk usage
Chaos Engineering: Monthly random worker shutdown to validate replication
Scrub Schedule: Set osd_scrub_begin_hour=2 and osd_scrub_end_hour=5 for controlled deep-scrub windows
GitOps Guards: ArgoCD hooks preventing storage upgrades during business hours

Key Metrics (Measured on My Cluster)

KPI	Longhorn-era	Ceph-era (post-tuning)
Detection time	6 min	—
Full migration	—	72 hours
PostgreSQL p95 latency	~150 ms	~120 ms
Workloads migrated	—	8
Data integrity	—	No data loss observed during post-migration checks
Coffee consumed	—	Excessive

What I'd Do Differently

Monitor backup success - Velero was failing silently for days
Start with Cilium policies - They offer Job label selectors, namespace projections, and cluster-wide identities that vanilla NetworkPolicies lack
Document as you go - Git commits help but aren't enough

Closing Thoughts

Storage failures are brutal because they sit below every abstraction layer. This incident taught me that complexity isn't inherently bad; hidden complexity (Longhorn) is worse than explicit complexity (Ceph).

Longhorn is excellent for simple clusters with ample resources. But when it breaks, the reconciliation logic becomes opaque. Ceph demands you learn its vocabulary: MONs, OSDs, placement groups, but rewards you with observable, predictable behaviour.

The cluster is stronger now. Not because Ceph is "better" than Longhorn, but because I understand it deeply. Every 3 AM debugging session is an investment in that understanding.

The meta-lesson: If your storage system has a two-phase upgrade path, test the rollback. And when choosing between storage systems, optimize for observability over simplicity.

Have storage war stories? Drop a comment. I'll trade caffeine recommendations for tales of survival.