The Mathematics of Maglev: An Analysis of Consistent Hashing in eBPF Load Balancers

Load balancing at scale requires more than engineering intuition. It demands mathematical rigour. When Google introduced Maglev in 2016, they presented an elegant solution to a fundamental distributed systems problem: how to consistently map connections to backends with minimal memory overhead, constant-time lookups, and provable bounds on disruption during backend changes. Today, this algorithm powers critical infrastructure through implementations like Cilium's eBPF-based load balancer.

This analysis provides a comprehensive mathematical treatment of Maglev consistent hashing, from its theoretical foundations through its implementation in eBPF. We'll establish formal proofs for its key properties: uniform load distribution, minimal disruption guarantees, and performance bounds. Parts I–IV develop the mathematical theory, while Parts V–XI focus on Cilium's production implementation and operational considerations.

Part I: Mathematical Foundations

1.1 Problem Formulation

Let us formally define the consistent hashing problem:

Definition 1.1 (Consistent Hashing Problem). Given a set of backends $B = {b_1, b_2, ..., b_n}$ and a key space $K$, find a hash function $h: K \rightarrow B$ such that:

Load Balance: $\forall, b_i, b_j \in B:\ \left|\ |{k \in K : h(k) = b_i}|\ -\ |{k \in K : h(k) = b_j}|\ \right| \le \epsilon$ for small $\epsilon$ relative to total keys
Minimal Disruption: When $B$ changes to $B'$, minimize $|{k \in K : h(k) \neq h'(k)}|$
Efficiency (Maglev design goal): Achieve $O(1)$ lookup using a precomputed table of size $M$ (i.e., $O(M)$ state)

Notation. We use keys and flows interchangeably for 5-tuples; $|K|$ denotes the number of live flows. We take the 5-tuple as $(srcIP, dstIP, srcPort, dstPort, L4proto)$.

Traditional ring-based consistent hashing with virtual nodes [12] achieves properties 1 and 2 but requires $O(\log(nv))$ lookup (binary search over the ring). O(1) schemes like Jump Consistent Hash (JCH) [13] exist and achieve the optimal $1/(n{+}1)$ remap on additions, but they compute the mapping per-lookup (no precomputed table) and thus differ operationally from Maglev's table-driven design. Maglev achieves all three with a precomputed table and $O(1)$ lookups; compared to Jump [13] (also $O(1)$ but no table), Maglev trades a one-time table build for the fastest possible per-packet path and simple eBPF lookups.

1.2 The Maglev Construction

Definition 1.2 (Maglev Lookup Table). The Maglev lookup table is an array $T[0..M-1]$ of backend identifiers. For each backend $b_i$, define a preference list $\pi_i$ which is a permutation of ${0, \ldots, M-1}$. Algorithm 1 fills $T$ by iterating backends round-robin and placing each $b_i$ at the first unfilled slot in $\pi_i$.

The construction algorithm operates as follows:

Algorithm 1: Maglev Table Generation

Input: Backends B = {b₁, ..., bₙ}, table size M (prime)
Output: Lookup table T[0..M-1]

1. For each backend bᵢ, generate two hash values:
   - offset[i] = hash₁(bᵢ) mod M
   - skip[i] = (hash₂(bᵢ) mod (M-1)) + 1  // Ensures gcd(skip[i], M) = 1

2. Generate permutation πᵢ for each backend:
   πᵢ[j] = (offset[i] + j × skip[i]) mod M, for j ∈ [0, M-1]

3. Initialize: T[j] = -1 for all j, next[i] = 0 for all i

4. For c = 0 to M-1:
   a. n' = c mod n
   b. While T[πₙ'[next[n']]] ≠ -1:
      next[n'] = next[n'] + 1
   c. T[πₙ'[next[n']]] = n'
   d. next[n'] = next[n'] + 1

5. Return T

Note: The skip value is computed as (hash₂(bᵢ) mod (M-1)) + 1 to ensure $\gcd(\text{skip}[i], M) = 1$ when $M$ is prime, guaranteeing complete permutations as proven in Theorem 1.1.

Because each $\pi_i$ is a full permutation of ${0,\dots,M-1}$, the while must encounter any still-empty index before $j$ reaches $M$; thus each iteration assigns exactly one slot and the loop terminates after exactly $M$ iterations, yielding the fair counts in §2.1.

1.3 Mathematical Properties of Permutations

Theorem 1.1 (Permutation Completeness). For prime $M$ and $\gcd(\text{skip}[i], M) = 1$, the sequence $\pi_i[j] = (\text{offset}[i] + j \cdot \text{skip}[i]) \bmod M$ generates a complete permutation of ${0, 1, ..., M-1}$.

Proof. We must show that $\pi_i$ is bijective.

Suppose $\pi_i[j_1] = \pi_i[j_2]$ for some $j_1, j_2 \in [0, M-1]$. Then:
$$(\text{offset}[i] + j_1 \cdot \text{skip}[i]) \equiv (\text{offset}[i] + j_2 \cdot \text{skip}[i]) \pmod{M}$$

This implies:
$$j_1 \cdot \text{skip}[i] \equiv j_2 \cdot \text{skip}[i] \pmod{M}$$
$$(j_1 - j_2) \cdot \text{skip}[i] \equiv 0 \pmod{M}$$

Since $\gcd(\text{skip}[i], M) = 1$ and $M$ is prime, $\text{skip}[i]$ has a multiplicative inverse modulo $M$. Therefore:
$$j_1 - j_2 \equiv 0 \pmod{M}$$

Since $j_1, j_2 \in [0, M-1]$, we have $j_1 = j_2$. Thus $\pi_i$ is injective. Since $|{0, ..., M-1}| = M$ and $\pi_i$ maps to the same set, injectivity implies bijectivity.

Equivalently, $\pi_i$ is a complete cycle whenever $\text{skip}[i] \in (\mathbb{Z}/M\mathbb{Z})^*$; taking $M$ prime makes this automatic for all $1 \leq \text{skip}[i] \leq M-1$ (see §6.3 for composite $M$). □

Plain-language intuition: This proof shows that each backend's "preference list" visits every single slot in the table exactly once—like having a unique dance where you visit every spot on the dance floor without repeating. The primality of $M$ is crucial: it ensures no backend gets stuck in a loop visiting only some positions.

Kubernetes Impact: In a Kubernetes cluster, this means when Cilium creates its load balancing table for a Service with multiple pod endpoints, each pod gets fair consideration for every possible hash slot. No pod can be structurally disadvantaged by the mathematics.

Corollary 1.1. Each backend $b_i$ has equal opportunity to claim any position in the lookup table during the round-robin filling process.

Collision note. Two backends can, in theory, produce identical preference lists if both offset and skip collide. Assuming independent uniform draws (a good model with a keyed hash), the per-pair probability is $\frac{1}{M}\cdot\frac{1}{M-1}$, which is negligible for typical $M$. A keyed hash keeps it unpredictable.

Part II: Load Distribution Analysis

2.1 Exact Load Distribution

Algorithm 1 Invariant: After $c$ iterations, exactly $c$ entries of $T$ are filled and each backend $b_i$ has either $\lfloor c/n \rfloor$ or $\lceil c/n \rceil$ assignments.

Let $X_i$ denote the number of table slots in $T$ assigned to backend $b_i$.

Proposition 2.1 (Exact Assignment Counts). For Algorithm 1, exactly $r = M \bmod n$ backends receive $\lceil M/n \rceil$ entries and the remaining $n - r$ backends receive $\lfloor M/n \rfloor$ entries. In particular, the average is $M/n$.

Operational rule: never run with $M<n$; some backends will receive 0 slots. Use $M\ge n$ (ideally $M\ge 100n$). Many implementations guard against configurations that would yield empty tables; verify behavior for your version.

Proof. By the invariant, backend $b_i$ is selected either $\lfloor M/n \rfloor$ or $\lceil M/n \rceil$ times; each selection produces exactly one placement.

After $M$ iterations:

Exactly $r = M \bmod n$ backends get $\lceil M/n \rceil$ slots
The remaining $(n - r)$ backends get $\lfloor M/n \rfloor$ slots

Verification: $r \cdot \lceil M/n \rceil + (n-r) \cdot \lfloor M/n \rfloor = r + n \cdot \lfloor M/n \rfloor = M$. □

Plain-language intuition: Imagine dealing cards to players in order—everyone gets roughly the same number. The round-robin approach ensures fairness: if you have 16,381 slots and 100 backends, each backend gets approximately 164 slots. This prevents any single pod from being overwhelmed.

Theoretical Impact in Kubernetes: Consider a Service with 50 pod replicas receiving λ = 10⁶ requests/second. By Proposition 2.1, each pod handles λ/n = 20,000 requests/second. Without uniform distribution, some pods might receive 2-3× the mean load, triggering cascading failures through Little's Law when queue lengths exceed buffer capacity.

2.2 Load Imbalance Bounds

Proposition 2.2 (Maximum Load Imbalance). For any $M$ and $n$, the maximum load imbalance between any two backends satisfies:
$$\max_{i,j} |X_i - X_j| \leq \lceil M/n \rceil - \lfloor M/n \rfloor \leq 1$$

Proof. The round-robin filling ensures each backend gets at most one more assignment than any other. Specifically:

Let $q = \lfloor M/n \rfloor$ and $r = M \bmod n$. Then:

$r$ backends get exactly $q + 1$ slots
$(n - r)$ backends get exactly $q$ slots

Therefore:
$$\max_{i,j} |X_i - X_j| = (q + 1) - q = 1$$

This bound is tight and achieved when $M \not\equiv 0 \pmod{n}$. □

Plain-language intuition: The worst-case scenario is that some backends get exactly one more slot than others—like dividing 17 cookies among 5 kids, where 2 kids get 4 cookies and 3 kids get 3 cookies. The difference is never more than 1 cookie (or slot).

Theoretical Demonstration: Consider n = 73 pods with M = 16381. Proposition 2.2 guarantees max imbalance = 1 slot. Expected slots per pod = 16381/73 ≈ 224.4. Therefore:

Maximum load: ⌈224.4⌉ = 225 slots
Minimum load: ⌊224.4⌋ = 224 slots
Relative difference: 1/224.4 ≈ 0.45%

This mathematical bound ensures no pod becomes a hotspot, maintaining system stability under queueing theory constraints.

Corollary 2.1 (Relative Load Imbalance). The relative load imbalance is bounded by:
$$\frac{\max_i X_i - \min_i X_i}{\mathbb{E}[X_i]} \leq \frac{n}{M}$$

For $M > 100n$, this gives less than 1% relative imbalance.

Note this bound is for table slots. With a finite number of live flows |K|, sampling variance from hashing adds ≈O(1/√|K|) relative fluctuations on top of the slot imbalance; for large |K| this is negligible in practice and vanishes as |K| grows. In practice, per-backend flow counts concentrate tightly around $X_i$ with $\operatorname{Var}!\left[\text{flows}_i\right]\approx |K|\cdot (X_i/M)\cdot (1 - X_i/M)$, i.e., binomial sampling around the slot share.

Flow-size skew. Maglev equalizes connections (slots), not necessarily bytes. If your workload has heavy-tailed flow sizes, byte-level load can skew even when slot counts are equal. If weights are required, consider Weighted Rendezvous Hashing (WRH) or shard at the app layer until weighted Maglev is available.

Part III: Disruption Analysis

3.1 Backend Addition

Proposition 3.1 (Share claimed by a new backend). After adding backend $b_{n+1}$, $b_{n+1}$ occupies exactly $\lfloor M/(n+1) \rfloor$ or $\lceil M/(n+1) \rceil$ entries of $T$. Hence at least that fraction of keys must be remapped.

Remark. Because Maglev fills $T$ greedily in round-robin across per-backend permutations, some keys not mapped to $b_{n+1}$ can still change owners among the original backends. In practice this "extra churn" is small when $M \gg n$ (vanishing as $n/M \to 0$), but it is not guaranteed to be zero. Rendezvous (HRW) hashing does guarantee that only keys moving to the new backend change.

Plain-language intuition: When you add a new server to a pool of 100 servers, at least ~1/101 (~1%) of connections need to move, and it's typically close to this when $M \gg n$. It's like adding a new cashier at a store—a small fraction of customers shift to the new register.

Theoretical Autoscaling Analysis: Consider an HPA event scaling from n₀ = 80 to n₁ = 100 pods. By the proposition:

Lower bound on disruption: (n₁ - n₀)/n₁ = 20% (exact under Rendezvous (HRW); typically ≈20% for Maglev when $M \gg n$)
So ≈80% of connections remain pinned (exact under Rendezvous (HRW))

Under an M/M/n queue with arrival rate λ and service rate μ per backend, the per-pod utilization is ρ = λ/(nμ). If autoscaling tracks λ so that λ grows roughly ∝ n, then ρ stays approximately constant during the transition.

3.2 Backend Removal

Proposition 3.2 (Removal). Removing backend $b_j$ remaps at least the fraction $X_j/M$ of keys (with $X_j \approx M/n$, i.e., ≈$1/n$ of keys). (At minimum, all keys in $X_j$ must move; Maglev's table construction can additionally reshuffle a small number of other keys.) Rendezvous (HRW) avoids this additional reshuffling.

Plain-language intuition: When one server fails in a pool of 100, only the 1% of traffic it was handling needs to find new homes. The other 99% of connections remain untouched—crucial for maintaining established TCP sessions.

3.3 Churn Stability

Definition 3.1 (Churn Stability). A hashing scheme is $(\alpha, \beta)$-churn stable if removing $\alpha$ fraction of backends causes at most $\beta$ fraction of keys to be remapped.

Proposition 3.3 (Required remaps under churn). Removing $\alpha n$ backends eliminates $\alpha M$ entries, so at least an $\alpha$ fraction of keys must move.

Remark. Maglev's table construction (round-robin over per-backend permutations) can cause additional reshuffling among surviving backends; Rendezvous (HRW) achieves exactly $\alpha$.

Plain-language intuition: If 30% of your servers fail, about 30% of traffic gets redistributed. This linear relationship means the impact is roughly proportional to the failure, preventing cascade effects.

Part IV: Complexity Analysis

Maglev table generation runs in userspace (cilium-agent), not in the packet path; rebuild cost does not impact per-packet latency.

4.1 Expected Behavior and Worst-Case Bounds

Proposition 4.1 (Expected Behavior and Worst-Case Bounds). The Maglev table construction empirically runs near-linear for $M \gg n$; under an independence model (a standard coupon-collector approximation), analyses in [11] give $O(M\log M)$ expected operations. (The per-backend sequences are arithmetic progressions modulo M; independence is an approximation but empirically sufficient for the occupancy analysis.) The adversarial worst case is $O(M^2/n)$ if many permutations overlap so heavily that each placement scans $\Theta(M/n)$ filled entries.

Proof.
Empirical Behavior: The Maglev paper's analysis [11] models the fill process as an occupancy problem with independent permutations, deriving approximately $M/(M-n)$ tries on the $n$-th insertion, leading to $O(M \log M)$ total operations. When $M \gg n$ (e.g., $M = 100n$), collisions are rare and practical performance approaches $O(M)$.

Worst Case: Consider the pathological case where all backends have similar permutations. Backend $b_i$ might need to scan through $O(M/n)$ positions to find an empty slot. With $M$ total assignments:

Total: $O(M \cdot M/n) = O(M^2/n)$

For typical parameters ($M = 16381$, $n = 100$), the algorithm completes in milliseconds (ballpark). □

Plain-language intuition: Building the lookup table is like filling a parking lot where each car (backend) has preferred spots. Usually takes time proportional to the number of spots, but if everyone wants the same spots, it takes longer as cars circle looking for alternatives.

Empirical cost. Modeling the fill as independent occupancy yields an $O(M \log M)$ expectation; for $M \gg n$ the constant is small and implementations run close to linear in $M$. Back-of-the-envelope for $M=16{,}381$ is on the order of a few $10^5$ primitive steps—milliseconds on modern CPUs. (Exact cycles depend on your code path and cache locality.)

4.2 Space Complexity

Proposition 4.2 (Space Complexity). Maglev uses O(M + n) memory: the lookup table $T$ has $M$ entries and the per-backend metadata is $O(n)$. In practice one chooses $M = c\cdot n$ (e.g., $c\in[100,1000]$) to bound relative imbalance by $n/M \le 1/c$, so total memory is linear in $n$. This is Maglev's explicit time–space trade: O(1) lookups in exchange for O(M) state.

Comment. There isn't a commonly accepted formal lower bound that forbids sublinear state for O(1) lookups under all CH desiderata; Maglev makes a pragmatic, well-understood trade that performs excellently in the packet path.

Part V: eBPF Implementation Analysis

5.1 BPF Map Structure

In Cilium's eBPF implementation, Maglev uses a specialized map-in-map structure:

// Schematic representation - actual Cilium implementation differs
// Source concept: cilium/pkg/bpf/maps/maglev.go
// (actual map layouts/field sizes depend on build options and Cilium version)
struct bpf_map_def maglev_outer = {
    .type = BPF_MAP_TYPE_HASH_OF_MAPS,
    .key_size = sizeof(__u16),        // Service ID
    .value_size = sizeof(__u32),      // Inner map ID
    .max_entries = MAX_SERVICES,
};

struct bpf_map_def maglev_inner = {
    .type = BPF_MAP_TYPE_ARRAY,
    .key_size = sizeof(__u32),        // Table index
    .value_size = sizeof(__u16),      // Backend ID (implementation-dependent; commonly 32-bit, occasionally 16-bit)
    .max_entries = MAGLEV_TABLE_SIZE, // M (prime)
};

Proposition 5.1 (Memory Scaling). The eBPF implementation requires $O(s \cdot M)$ kernel memory where $s$ is the number of services.

Proof. Each service maintains its own Maglev table of size $M$. With $s$ services:
$$\text{Memory} = s \cdot M \cdot \text{sizeof(backend\_id)} = O(s \cdot M)$$

This allows independent backend pools per service while maintaining O(1) lookup. □

Theoretical Memory Calculation: For s = 847 services with M = 16,381 and 2–4 byte backend IDs (depending on build):

Memory = 847 × 16,381 × 2–4 bytes = 27.7–55.5 MB

(Kernel may pad map values to alignment; treat 2–4 bytes as an effective range in practice (varies by kernel and architecture).)

For 10,000 services, this would require ~312–655 MB—still manageable on modern systems.

5.2 Lookup Performance in eBPF

The packet processing path in eBPF:

// Schematic pseudocode - not actual BPF C
static __always_inline int maglev_lookup(struct __sk_buff *skb, __u16 svc_id) {
    // Get flow hash - either via helper or custom 5-tuple mixing
    __u32 hash = bpf_get_hash_recalc(skb);  // TC context
    // For XDP: manually compute 5-tuple hash from packet headers
    // In XDP, compute a keyed 5-tuple hash in BPF (bpf_get_hash_recalc is not available) to keep selection secret and stable.
    __u32 idx = hash % MAGLEV_TABLE_SIZE;

    // Two map lookups for O(1) backend selection
    void *inner = bpf_map_lookup_elem(&maglev_outer, &svc_id);
    if (!inner) return -1;

    __u16 *backend = bpf_map_lookup_elem(inner, &idx);
    return backend ? *backend : -1;
}

Proposition 5.2 (Lookup Performance). The eBPF Maglev lookup executes in constant time.

Proof. The lookup consists of:

Hash computation: a constant number of CPU cycles
Two helper lookups (outer + inner map) plus dereference

Total: Constant time; typically on the order of one or two cache misses. □

5.3 Verifier Constraints

Proposition 5.3 (Verifier Compliance). The Maglev eBPF implementation satisfies all verifier constraints with complexity $O(1)$ per packet.

Proof. The verifier checks:

Bounded loops: No loops in lookup path—only array indexing ✓
Memory safety: All map accesses are bounds-checked by BPF helpers ✓
Complexity limit: Single lookup path with O(1) operations ✓
Stack usage: Minimal stack variables (< 512 bytes) ✓

The implementation is thus verifier-compliant. □

Part VI: Optimality and Lower Bounds

6.1 Information-Theoretic Lower Bound

Theorem 6.1 (Lower Bound on Disruption). Any consistent hashing scheme must remap at least $1/(n+1)$ fraction of keys when adding a backend to $n$ existing backends.

Proof. By the pigeonhole principle, the new backend must receive at least $1/(n+1)$ fraction of the total load for uniform distribution. These keys must come from existing backends, requiring remapping.

This is the information-theoretic lower bound. Rendezvous (HRW) attains it exactly; Maglev approaches it for $M \gg n$ but does not guarantee equality. □

Corollary 6.1a (Adding $m$ backends). Adding $m$ backends to $n$ existing backends requires remapping at least $\dfrac{m}{n+m}$ of keys. Rendezvous (HRW) attains this bound exactly.

Plain-language intuition: This bound is the information-theoretic limit—no scheme can do better. Rendezvous (HRW) achieves it exactly; Maglev approaches it when $M \gg n$. It's like proving you need at least N cuts to divide a cake into N+1 equal pieces.

6.2 Table Size Requirements

Theorem 6.2 (Minimum Table Size). To guarantee maximum relative load imbalance below $\epsilon$, the table size must satisfy:
$$M \geq \frac{n}{\epsilon}$$

Proof. From Corollary 2.1, the relative load imbalance is bounded by $n/M$. For this to be less than $\epsilon$:
$$\frac{n}{M} < \epsilon \implies M > \frac{n}{\epsilon}$$

For $\epsilon = 0.01$ (1% imbalance), we need $M \ge 100n$. □

6.3 Prime Size—Simple Sufficient Condition

Theorem 6.3 (Prime size—simple sufficient condition). With $\text{skip}[i] = (\text{hash}_2(b_i) \bmod (M-1)) + 1$ and prime $M$, we have $\gcd(\text{skip}[i],M)=1$ for all backends, so each $\pi_i$ is a full permutation (Theorem 1.1).

Proof. For prime $M$ and any $\text{skip}[i] \in {1, 2, ..., M-1}$:
$$\gcd(\text{skip}[i], M) = 1$$

This holds because $M$ is prime and $0 < \text{skip}[i] < M$. By Theorem 1.1, each backend's permutation covers all $M$ positions. □

Note on composite $M$. If $M$ is composite and $\gcd(\text{skip}[i],M)\neq 1$, $\pi_i$ decomposes into cycles on a proper subset of ${0,\dots,M-1}$. That restricts which slots a backend can ever claim and can bias occupancy. To avoid this, either (a) use prime $M$, or (b) draw $\text{skip}$ uniformly from the unit group $\mathbb{Z}_M^{*}$ (values coprime to $M$). If drawing $\text{skip}$ from the unit group $\mathbb{Z}_M^{*}$, every $\pi_i$ is a cycle whose length divides $\varphi(M)$. Choosing prime $M$ makes $\varphi(M)=M-1$, which guarantees full-length cycles and simplifies analysis.

Part VII: Comparative Analysis

7.1 Versus Ring Hashing

Comparison 7.1 (Maglev vs Ring Hashing). Compared to ring-based consistent hashing with $v$ virtual nodes per backend:

Interpretation note. For comparable balance, ring hashing typically sets $v \approx M/n$, which makes both methods $O(M)$ in space. The operational trade is binary search on a ring vs direct table indexing in Maglev.

Property	Ring Hashing	Maglev
Lookup Time	$O(\log(nv))$	$O(1)$
Space	$O(nv)$	$O(M)$ (often $M \approx c \cdot n$)
Relative imbalance (ring: typical CV; maglev: deterministic bound)	~ $\Theta(1/\sqrt{v})$	$\le n/M$ (≈ 1/c when M = c·n)
Update Time (per change)	$O(v \log(nv))$	$O(M)$

CV = coefficient of variation (stddev/mean) of per-backend load.

Notes. Ring hashing requires binary search over $nv$ points on the ring, while Maglev uses direct array indexing. For equivalent load balance ($v \approx M/n$), ring hashing uses $O(M)$ space but with $O(\log M)$ lookup time.

Maglev trades update time for lookup performance, optimal for read-heavy workloads.

7.2 Versus Rendezvous Hashing

Comparison 7.2 (Maglev vs Rendezvous). Compared to Rendezvous (HRW) hashing:

Property	Rendezvous	Maglev
Lookup Time	$O(n)$	$O(1)$
Space	$O(n)$	$O(M)$ (often $M \approx c \cdot n$)
Disruption	Optimal	Near-optimal (for $M \gg n$)
Update Time (per change)	$O(1)$	$O(M)$

Rendezvous achieves optimal disruption; Maglev approaches it for large M. Maglev's precomputed table enables constant-time lookup at the cost of slower updates.

Rendezvous is suitable for small $n$ or scenarios where updates are frequent. Maglev excels when lookups are frequent and $n$ is large (hundreds or thousands of backends).

Part VIII: Production Considerations

8.1 Practical Table Sizes (Well-Tested Primes)

Guideline 8.1 (Table size selection). Pick a prime $M$ large enough that $n/M \le \epsilon$. Typical choices by scale:

Small (n < 100): $M = 16{,}381$
Medium (n < 600): $M = 65{,}521$
Large (n $\lesssim$ 1,300): $M = 131{,}071$

Note. These sizes are commonly supported in Cilium, but exact allowed values and defaults are version-dependent. Check your chart/agent version and confirm with helm show values or cilium-agent --help. Cilium validates table sizes against a supported set; unsupported primes will be rejected even if prime.

Note: On modern compilers, constant-modulus division is lowered to reciprocal-multiply; being "near $2^k$" offers no meaningful advantage, but the listed primes are convenient and well-tested.

Rationale. These values are chosen to satisfy:

Primality for complete permutations
$M > 100n$ for < 1% imbalance
Fit within CPU cache levels for performance

The specific values are well-tested primes from production deployments.

Plain-language intuition: These "magic numbers" are carefully chosen primes that have been battle-tested in production while being large enough to ensure good balance.

8.2 Connection Affinity

Theorem 8.2 (Connection Affinity Preservation). For a stable backend set, Maglev guarantees that:
$$P(\text{connection maintains backend | no changes}) = 1$$

Proof. The lookup table $T$ is deterministic given the backend set $B$. For unchanged $B$:

Same hash function $h$
Same permutations $\pi_i$
Same table $T$

Therefore, $h(k) = T[hash(k) \bmod M]$ remains constant for all connections. □

Caveat. Affinity assumes the hashing inputs (typically the 5-tuple) are stable; mid-flow NAT/policy changes or L7 proxies that re-originate connections will change the hash and therefore the selected backend.

Plain-language intuition: Once established, a connection always goes to the same backend (assuming no backend changes)—crucial for stateful services. Your video stream doesn't suddenly jump to a different server mid-movie.

8.3 Failure Detection Integration

Proposition 8.3 (Graceful Degradation Lower Bound). With health checking removing failed backends:
$$\text{Keys affected} \ge \frac{f}{n}$$

where $f$ is the number of failed backends. Rendezvous (HRW) achieves exactly $f/n$; Maglev is typically close for large $M$, but its table construction (round-robin over per-backend permutations) can introduce small additional reshuffling.

Proof. At minimum, connections previously mapped to failed backends need remapping. This fraction is at least $f/n$ by uniform distribution. Maglev's table construction may cause additional reshuffling. □

Part IX: Security Considerations

9.1 Hash Collision Attacks

Proposition 9.1 (Adversarial key steering bound). With a keyed, independent hash on the 5-tuple and a Maglev table whose per-backend share is $X_i/M \in {\lfloor M/n\rfloor/M,\ \lceil M/n\rceil/M}$, the probability that $k$ chosen keys all map to the same specific backend, conditioning on the first key's backend, is

$$P \ \le\ \left(\frac{\lceil M/n\rceil}{M}\right)^{k-1}.$$

This assumes independence of per-flow hashes and a secret seed, using the worst-case backend share $\max_i X_i/M = \lceil M/n\rceil/M$. In practice, keep the seed confidential and combine with rate-limits to mitigate probing.

9.2 Seed Randomization

Using a random seed $s$ unknown to attackers, the work to find collisions is $\Omega(2^{|s|/2})$ via birthday attacks. For example, with a 96-bit seed, birthday attacks need $\approx 2^{48}$ trials to find colliding inputs—well beyond practical limits for DDoS scenarios.

Cilium Implementation: Cilium's maglev.hashSeed (or --bpf-lb-maglev-hash-seed) seeds the userspace permutation generation so every agent builds identical tables. The per-packet flow hash in BPF uses bpf_get_hash_recalc() or an equivalent 5-tuple mix [1,2,3].

Part X: Cilium Implementation Details

10.1 eBPF/XDP Architecture

Cilium implements Maglev using a sophisticated eBPF map-in-map structure for efficient packet processing:

Map Structure:

// Schematic representation of map-in-map structure
// Outer map: Service ID → Inner Maglev table
struct bpf_map_def maglev_outer = {
    .type = BPF_MAP_TYPE_HASH_OF_MAPS,
    .key_size = sizeof(__u16),        // Service ID
    .value_size = sizeof(__u32),      // Inner map ID
    .max_entries = MAX_SERVICES,
};

// Inner map: Maglev lookup table
struct bpf_map_def maglev_inner = {
    .type = BPF_MAP_TYPE_ARRAY,
    .key_size = sizeof(__u32),        // Table index [0, M-1]
    .value_size = sizeof(__u16),      // Backend ID (implementation-dependent; commonly 32-bit, occasionally 16-bit)
    .max_entries = MAGLEV_TABLE_SIZE, // M (prime)
};

Performance Characteristics:

Packet rate: Up to ~14.88 Mpps demonstrated in XDP L4LB benchmarks (10 GbE line rate at 64-byte frames) [6,9]
Lookup complexity: O(1) with two map lookups plus array access
Memory usage: 2–4 bytes × M × services (value size is 2–4 bytes depending on build; e.g., 27.7–55.5 MB for 847 services with M=16381)

10.2 Configuration and Tuning

Helm Configuration:

# Example (verify keys for your chart version)
helm upgrade --install cilium cilium/cilium \
  --set loadBalancer.algorithm=maglev \
  --set maglev.tableSize=16381 \
  --set maglev.hashSeed="$(head -c12 /dev/urandom | base64 | tr -d '\n')"
# Some options (e.g., kubeProxyReplacement) are version-dependent.
# Always verify: helm show values cilium/cilium --version <x.y.z>

Agent Flags (alternative):

cilium-agent \
  --bpf-lb-algorithm=maglev \
  --bpf-lb-maglev-table-size=16381 \
  --bpf-lb-maglev-hash-seed="$(head -c12 /dev/urandom | base64 | tr -d '\n')"

Examples of available table sizes in Cilium (as of the cited Cilium release): 251, 509, 1021, 2039, 4093, 8191, 16381 (default), 32749, 65521, 131071 [1,4].

The hash seed prevents attackers from predicting backend selection, crucial for DDoS protection.

10.3 Production Deployment Patterns

Traffic paths. Cilium commonly uses Maglev for north–south L4LB (ingress/external → service). East–west (pod→service) on the same node often bypasses Maglev via socket-level load-balancing at connect(2) (sockops/sockmap), while inter-node paths may still traverse TC/XDP depending on configuration. The net effect: Maglev is critical for external fan-in; intra-cluster connections frequently take an even faster path.

Session Affinity:
The consistent hashing property naturally provides session affinity without additional state:

Same 5-tuple always maps to same backend
Survives node failures (minimal redistribution)
No session table required

10.4 Real-World Performance Impact

Seznam.cz Production Case Study:
Seznam successfully migrated to Cilium for their large-scale infrastructure, reporting significant improvements in visibility and cost savings through more efficient load balancing (see citation [10]). The migration eliminated iptables rule explosion (from O(n×m) to O(1) lookups) and enabled handling of massive connection volumes.

Key Optimizations in Cilium:

XDP acceleration: Process packets before SKB allocation
Direct server return (DSR): Bypass return path for reduced latency
Socket-level LB: Redirect at connect() for container-to-container traffic

10.5 Known Limitations and Workarounds

Current Limitations:

Weighted backends: Not yet supported (equal weight only) as of this writing; track project roadmap/issues (weights are active design work; semantics may change)
Dynamic table resize: May require an agent restart depending on version
Memory trade-off: Each service requires M × sizeof(backend_id) bytes

Platform-Specific Considerations:

Kernel requirements: Use kernel ≥5.10 LTS for production; specific features vary by NIC/driver
NIC compatibility: Native XDP requires driver support (check with ethtool -i)
MTU constraints: With AWS ENA driver, XDP max MTU is ~3498 bytes due to single-page buffers; other drivers differ [1,5]

Configuration Best Practices:

# For high-throughput services
loadBalancer:
  algorithm: "maglev"
  acceleration: "native"  # Force native XDP
  mode: "dsr"            # Direct server return

# For large backend pools (>100 backends)
maglev:
  tableSize: 65521      # Larger prime for better distribution

10.6 Observability and Debugging

Debugging Commands:

# List Maglev tables
cilium-dbg bpf lb maglev list

# Get the lookup table for a specific service
cilium-dbg bpf lb maglev get <service-id>

# List all services using Maglev
cilium-dbg bpf lb list --output=json | jq '.[] | select(.algorithm=="maglev")'

# Check overall load balancer state
cilium-dbg bpf lb list

These commands provide direct insight into the Maglev tables stored in eBPF maps, allowing operators to verify distribution quality and debug load balancing issues [7,8].

Part XI: Practical Implications

11.1 When to Use Maglev

Ideal Scenarios:

High-throughput services (>1M requests/sec)
Large backend pools (>50 instances)
Read-heavy workloads (lookups >> updates)
Session affinity requirements

Poor Fit:

Highly dynamic backend pools (constant churn)
Small services (<10 backends)
Weighted load balancing requirements
Memory-constrained environments

11.2 Migration Considerations

From IPVS/iptables:

Often yields substantial CPU reductions at scale (commonly reported 50-70% in practice, workload-dependent)
Connection table memory shifts to Maglev tables
Session affinity naturally preserved
No connection state synchronization needed

Performance Comparison at Scale:

Method	1K backends	10K backends	50K backends
iptables	O(n) rules	Unusable	Crash
IPVS	15% CPU	45% CPU	78% CPU
Maglev	3% CPU	5% CPU	7% CPU

Illustrative only (not measured) — conveys trend lines; actual CPU varies with workload, kernel/NIC, and tuning.

11.3 Future Developments

Upcoming Features in Cilium:

Weighted Maglev: Extension for non-uniform backend weights
Adaptive table sizing: Dynamic M based on backend count
NUMA-aware optimization: Per-CPU Maglev tables for cache locality
Hardware offload: SmartNIC Maglev implementation

Research Directions:

Hierarchical Maglev for multi-tier load balancing
Integration with service mesh circuit breaking
Formal verification of eBPF implementation correctness

Conclusion

The Maglev algorithm represents a masterful balance of theoretical elegance and practical efficiency. Through our rigorous analysis, we've established:

Optimality: Absolute load imbalance is ≤ 1 slot; relative imbalance ≤ $n/M$ (choose $M \ge 100n$ for <1%). On membership change, the lower bound is $1/(n+1)$ for additions and $1/n$ for removals. Rendezvous (HRW) attains this exactly; Maglev approaches it when $M \gg n$
Efficiency: Constant-time lookups with precomputed state $O(M)$. In practice choose $M\approx c\cdot n$ (e.g., $c\in[100,1000]$) for linear memory in $n$—a deliberate time–space tradeoff that performs excellently in the packet path
Robustness: Mathematical guarantees hold under failures, attacks, and dynamic changes
Practicality: The eBPF implementation maintains all theoretical properties while satisfying kernel verifier constraints

Cilium's production implementation validates these theoretical guarantees:

Real-world scale: Seznam's case study reports substantial efficiency and visibility gains after adopting Cilium [10]
eBPF efficiency: Map-in-map architecture achieves true O(1) with two map lookups
XDP acceleration: ~14.88 Mpps throughput demonstrated in L4LB benchmarks, limited by NIC not software [6,9]
Operational simplicity: No connection state synchronization, natural session affinity

The algorithm's beauty lies not just in achieving these properties individually, but in achieving them simultaneously through a simple permutation-based construction. The round-robin filling strategy, combined with prime table sizes and independent permutations, creates a near-perfect hash distribution that's both theoretically sound and practically efficient.

For production deployments with Cilium, the key insights are:

Choose $M > 100n$ and prime for < 1% load imbalance (16381 default handles ~160 backends)
Use cryptographic hashing with random seeds for DDoS protection
Leverage eBPF's map-in-map structure for per-service isolation
Accept $O(M \log M)$ rebuild time for $O(1)$ lookup performance
Enable XDP acceleration for maximum throughput (kernel ≥5.10 LTS recommended)
Monitor Cilium LB/Maglev metrics exposed by cilium-agent (names vary by version; search for 'maglev'/'lb' in /metrics) and validate distribution with the cilium-dbg commands above

As we push the boundaries of network performance with technologies like XDP and eBPF, algorithms like Maglev become critical infrastructure. They demonstrate that distributed systems problems often have beautiful mathematical solutions waiting to be discovered—solutions that are not just theoretically optimal but practically transformative. The transition from iptables' O(n×m) rule explosion to Maglev's O(1) lookups represents not just an incremental improvement, but a fundamental paradigm shift in how we approach load balancing at scale.

References

[1] Cilium Documentation. "Kubernetes Without kube-proxy." Available at: https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/

[2] Linux Kernel Patchwork. "bpf: allow for map-in-map with dynamic inner." Available at: https://patchwork.ozlabs.org/comment/2550297/

[3] eBPF Documentation. "Helper Function bpf_get_hash_recalc." Available at: https://docs.ebpf.io/linux/helper-function/bpf_get_hash_recalc/

[4] Cilium Documentation. "cilium-agent hive." Available at: https://docs.cilium.io/en/stable/cmdref/cilium-agent_hive.html

[5] Cloudflare Blog. "High Availability Load Balancers with Maglev." Available at: https://blog.cloudflare.com/high-availability-load-balancers-with-maglev/

[6] Cilium Blog. "Cilium Standalone Layer 4 Load Balancer XDP." Available at: https://cilium.io/blog/2022/04/12/cilium-standalone-L4LB-XDP/

[7] GitHub Issue. "Maglev weight annotations for EndpointSlices." Available at: https://github.com/cilium/cilium/issues/40190

[8] Cilium Documentation. "cilium-dbg bpf lb maglev." Available at: https://docs.cilium.io/en/stable/cmdref/cilium-dbg_bpf_lb_maglev.html

[9] eBPF.top. "告别IPVS，拥抱Cilium/XDP" (Goodbye IPVS, Hello Cilium/XDP). Available at: https://www.ebpf.top/post/cilium-standalone-L4LB-XDP-zh/

[10] CNCF Case Study. "Seznam's move to Cilium saves money and boosts visibility." Available at: https://www.cncf.io/case-studies/seznam/

[11] Eisenbud, D. E., et al. "Maglev: A Fast and Reliable Software Network Load Balancer." NSDI 2016. Available at: https://research.google/pubs/pub44824/

[12] Karger, D., et al. "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web." STOC 1997.

[13] Lamping, J., and Veach, E. "A Fast, Minimal Memory, Consistent Hash Algorithm." arXiv:1406.2294, 2014. Available at: https://arxiv.org/abs/1406.2294