Load balancing at scale requires more than engineering intuition. It demands mathematical rigour. When Google introduced Maglev in 2016, they presented an elegant solution to a fundamental distributed systems problem: how to consistently map connections to backends with minimal memory overhead, constant-time lookups, and provable bounds on disruption during backend changes. Today, this algorithm powers critical infrastructure through implementations like Cilium's eBPF-based load balancer.
This analysis provides a comprehensive mathematical treatment of Maglev consistent hashing, from its theoretical foundations through its implementation in eBPF. We'll establish formal proofs for its key properties: uniform load distribution, minimal disruption guarantees, and performance bounds. Parts I–IV develop the mathematical theory, while Parts V–XI focus on Cilium's production implementation and operational considerations.
Part I: Mathematical Foundations
1.1 Problem Formulation
Let us formally define the consistent hashing problem:
Definition 1.1 (Consistent Hashing Problem). Given a set of backends $B = {b_1, b_2, ..., b_n}$ and a key space $K$, find a hash function $h: K \rightarrow B$ such that:
- Load Balance: $\forall, b_i, b_j \in B:\ \left|\ |{k \in K : h(k) = b_i}|\ -\ |{k \in K : h(k) = b_j}|\ \right| \le \epsilon$ for small $\epsilon$ relative to total keys
- Minimal Disruption: When $B$ changes to $B'$, minimize $|{k \in K : h(k) \neq h'(k)}|$
- Efficiency (Maglev design goal): Achieve $O(1)$ lookup using a precomputed table of size $M$ (i.e., $O(M)$ state)
Notation. We use keys and flows interchangeably for 5-tuples; $|K|$ denotes the number of live flows. We take the 5-tuple as $(srcIP, dstIP, srcPort, dstPort, L4proto)$.
Traditional ring-based consistent hashing with virtual nodes [12] achieves properties 1 and 2 but requires $O(\log(nv))$ lookup (binary search over the ring). O(1) schemes like Jump Consistent Hash (JCH) [13] exist and achieve the optimal $1/(n{+}1)$ remap on additions, but they compute the mapping per-lookup (no precomputed table) and thus differ operationally from Maglev's table-driven design. Maglev achieves all three with a precomputed table and $O(1)$ lookups; compared to Jump [13] (also $O(1)$ but no table), Maglev trades a one-time table build for the fastest possible per-packet path and simple eBPF lookups.
1.2 The Maglev Construction
Definition 1.2 (Maglev Lookup Table). The Maglev lookup table is an array $T[0..M-1]$ of backend identifiers. For each backend $b_i$, define a preference list $\pi_i$ which is a permutation of ${0, \ldots, M-1}$. Algorithm 1 fills $T$ by iterating backends round-robin and placing each $b_i$ at the first unfilled slot in $\pi_i$.
The construction algorithm operates as follows:
Algorithm 1: Maglev Table Generation
Input: Backends B = {b₁, ..., bₙ}, table size M (prime)
Output: Lookup table T[0..M-1]
1. For each backend bᵢ, generate two hash values:
- offset[i] = hash₁(bᵢ) mod M
- skip[i] = (hash₂(bᵢ) mod (M-1)) + 1 // Ensures gcd(skip[i], M) = 1
2. Generate permutation πᵢ for each backend:
πᵢ[j] = (offset[i] + j × skip[i]) mod M, for j ∈ [0, M-1]
3. Initialize: T[j] = -1 for all j, next[i] = 0 for all i
4. For c = 0 to M-1:
a. n' = c mod n
b. While T[πₙ'[next[n']]] ≠ -1:
next[n'] = next[n'] + 1
c. T[πₙ'[next[n']]] = n'
d. next[n'] = next[n'] + 1
5. Return T
Note: The skip value is computed as (hash₂(bᵢ) mod (M-1)) + 1 to ensure $\gcd(\text{skip}[i], M) = 1$ when $M$ is prime, guaranteeing complete permutations as proven in Theorem 1.1.
Because each $\pi_i$ is a full permutation of ${0,\dots,M-1}$, the while must encounter any still-empty index before $j$ reaches $M$; thus each iteration assigns exactly one slot and the loop terminates after exactly $M$ iterations, yielding the fair counts in §2.1.
1.3 Mathematical Properties of Permutations
Theorem 1.1 (Permutation Completeness). For prime $M$ and $\gcd(\text{skip}[i], M) = 1$, the sequence $\pi_i[j] = (\text{offset}[i] + j \cdot \text{skip}[i]) \bmod M$ generates a complete permutation of ${0, 1, ..., M-1}$.
Proof. We must show that $\pi_i$ is bijective.
Suppose $\pi_i[j_1] = \pi_i[j_2]$ for some $j_1, j_2 \in [0, M-1]$. Then:
$$(\text{offset}[i] + j_1 \cdot \text{skip}[i]) \equiv (\text{offset}[i] + j_2 \cdot \text{skip}[i]) \pmod{M}$$
This implies:
$$j_1 \cdot \text{skip}[i] \equiv j_2 \cdot \text{skip}[i] \pmod{M}$$
$$(j_1 - j_2) \cdot \text{skip}[i] \equiv 0 \pmod{M}$$
Since $\gcd(\text{skip}[i], M) = 1$ and $M$ is prime, $\text{skip}[i]$ has a multiplicative inverse modulo $M$. Therefore:
$$j_1 - j_2 \equiv 0 \pmod{M}$$
Since $j_1, j_2 \in [0, M-1]$, we have $j_1 = j_2$. Thus $\pi_i$ is injective. Since $|{0, ..., M-1}| = M$ and $\pi_i$ maps to the same set, injectivity implies bijectivity.
Equivalently, $\pi_i$ is a complete cycle whenever $\text{skip}[i] \in (\mathbb{Z}/M\mathbb{Z})^*$; taking $M$ prime makes this automatic for all $1 \leq \text{skip}[i] \leq M-1$ (see §6.3 for composite $M$). □
Plain-language intuition: This proof shows that each backend's "preference list" visits every single slot in the table exactly once—like having a unique dance where you visit every spot on the dance floor without repeating. The primality of $M$ is crucial: it ensures no backend gets stuck in a loop visiting only some positions.
Kubernetes Impact: In a Kubernetes cluster, this means when Cilium creates its load balancing table for a Service with multiple pod endpoints, each pod gets fair consideration for every possible hash slot. No pod can be structurally disadvantaged by the mathematics.
Corollary 1.1. Each backend $b_i$ has equal opportunity to claim any position in the lookup table during the round-robin filling process.
Collision note. Two backends can, in theory, produce identical preference lists if both offset and skip collide. Assuming independent uniform draws (a good model with a keyed hash), the per-pair probability is $\frac{1}{M}\cdot\frac{1}{M-1}$, which is negligible for typical $M$. A keyed hash keeps it unpredictable.
Part II: Load Distribution Analysis
2.1 Exact Load Distribution
Algorithm 1 Invariant: After $c$ iterations, exactly $c$ entries of $T$ are filled and each backend $b_i$ has either $\lfloor c/n \rfloor$ or $\lceil c/n \rceil$ assignments.
Let $X_i$ denote the number of table slots in $T$ assigned to backend $b_i$.
Proposition 2.1 (Exact Assignment Counts). For Algorithm 1, exactly $r = M \bmod n$ backends receive $\lceil M/n \rceil$ entries and the remaining $n - r$ backends receive $\lfloor M/n \rfloor$ entries. In particular, the average is $M/n$.
Operational rule: never run with $M<n$; some backends will receive 0 slots. Use $M\ge n$ (ideally $M\ge 100n$). Many implementations guard against configurations that would yield empty tables; verify behavior for your version.
Proof. By the invariant, backend $b_i$ is selected either $\lfloor M/n \rfloor$ or $\lceil M/n \rceil$ times; each selection produces exactly one placement.
After $M$ iterations:
- Exactly $r = M \bmod n$ backends get $\lceil M/n \rceil$ slots
- The remaining $(n - r)$ backends get $\lfloor M/n \rfloor$ slots
Verification: $r \cdot \lceil M/n \rceil + (n-r) \cdot \lfloor M/n \rfloor = r + n \cdot \lfloor M/n \rfloor = M$. □
Plain-language intuition: Imagine dealing cards to players in order—everyone gets roughly the same number. The round-robin approach ensures fairness: if you have 16,381 slots and 100 backends, each backend gets approximately 164 slots. This prevents any single pod from being overwhelmed.
Theoretical Impact in Kubernetes: Consider a Service with 50 pod replicas receiving λ = 10⁶ requests/second. By Proposition 2.1, each pod handles λ/n = 20,000 requests/second. Without uniform distribution, some pods might receive 2-3× the mean load, triggering cascading failures through Little's Law when queue lengths exceed buffer capacity.
2.2 Load Imbalance Bounds
Proposition 2.2 (Maximum Load Imbalance). For any $M$ and $n$, the maximum load imbalance between any two backends satisfies:
$$\max_{i,j} |X_i - X_j| \leq \lceil M/n \rceil - \lfloor M/n \rfloor \leq 1$$
Proof. The round-robin filling ensures each backend gets at most one more assignment than any other. Specifically:
Let $q = \lfloor M/n \rfloor$ and $r = M \bmod n$. Then:
- $r$ backends get exactly $q + 1$ slots
- $(n - r)$ backends get exactly $q$ slots
Therefore:
$$\max_{i,j} |X_i - X_j| = (q + 1) - q = 1$$
This bound is tight and achieved when $M \not\equiv 0 \pmod{n}$. □
Plain-language intuition: The worst-case scenario is that some backends get exactly one more slot than others—like dividing 17 cookies among 5 kids, where 2 kids get 4 cookies and 3 kids get 3 cookies. The difference is never more than 1 cookie (or slot).
Theoretical Demonstration: Consider n = 73 pods with M = 16381. Proposition 2.2 guarantees max imbalance = 1 slot. Expected slots per pod = 16381/73 ≈ 224.4. Therefore:
- Maximum load: ⌈224.4⌉ = 225 slots
- Minimum load: ⌊224.4⌋ = 224 slots
- Relative difference: 1/224.4 ≈ 0.45%
This mathematical bound ensures no pod becomes a hotspot, maintaining system stability under queueing theory constraints.
Corollary 2.1 (Relative Load Imbalance). The relative load imbalance is bounded by:
$$\frac{\max_i X_i - \min_i X_i}{\mathbb{E}[X_i]} \leq \frac{n}{M}$$
For $M > 100n$, this gives less than 1% relative imbalance.
Note this bound is for table slots. With a finite number of live flows |K|, sampling variance from hashing adds ≈O(1/√|K|) relative fluctuations on top of the slot imbalance; for large |K| this is negligible in practice and vanishes as |K| grows. In practice, per-backend flow counts concentrate tightly around $X_i$ with $\operatorname{Var}!\left[\text{flows}_i\right]\approx |K|\cdot (X_i/M)\cdot (1 - X_i/M)$, i.e., binomial sampling around the slot share.
Flow-size skew. Maglev equalizes connections (slots), not necessarily bytes. If your workload has heavy-tailed flow sizes, byte-level load can skew even when slot counts are equal. If weights are required, consider Weighted Rendezvous Hashing (WRH) or shard at the app layer until weighted Maglev is available.
Part III: Disruption Analysis
3.1 Backend Addition
Proposition 3.1 (Share claimed by a new backend). After adding backend $b_{n+1}$, $b_{n+1}$ occupies exactly $\lfloor M/(n+1) \rfloor$ or $\lceil M/(n+1) \rceil$ entries of $T$. Hence at least that fraction of keys must be remapped.
Remark. Because Maglev fills $T$ greedily in round-robin across per-backend permutations, some keys not mapped to $b_{n+1}$ can still change owners among the original backends. In practice this "extra churn" is small when $M \gg n$ (vanishing as $n/M \to 0$), but it is not guaranteed to be zero. Rendezvous (HRW) hashing does guarantee that only keys moving to the new backend change.
Plain-language intuition: When you add a new server to a pool of 100 servers, at least ~1/101 (~1%) of connections need to move, and it's typically close to this when $M \gg n$. It's like adding a new cashier at a store—a small fraction of customers shift to the new register.
Theoretical Autoscaling Analysis: Consider an HPA event scaling from n₀ = 80 to n₁ = 100 pods. By the proposition:
- Lower bound on disruption: (n₁ - n₀)/n₁ = 20% (exact under Rendezvous (HRW); typically ≈20% for Maglev when $M \gg n$)
- So ≈80% of connections remain pinned (exact under Rendezvous (HRW))
Under an M/M/n queue with arrival rate λ and service rate μ per backend, the per-pod utilization is ρ = λ/(nμ). If autoscaling tracks λ so that λ grows roughly ∝ n, then ρ stays approximately constant during the transition.
3.2 Backend Removal
Proposition 3.2 (Removal). Removing backend $b_j$ remaps at least the fraction $X_j/M$ of keys (with $X_j \approx M/n$, i.e., ≈$1/n$ of keys). (At minimum, all keys in $X_j$ must move; Maglev's table construction can additionally reshuffle a small number of other keys.) Rendezvous (HRW) avoids this additional reshuffling.
Plain-language intuition: When one server fails in a pool of 100, only the 1% of traffic it was handling needs to find new homes. The other 99% of connections remain untouched—crucial for maintaining established TCP sessions.
3.3 Churn Stability
Definition 3.1 (Churn Stability). A hashing scheme is $(\alpha, \beta)$-churn stable if removing $\alpha$ fraction of backends causes at most $\beta$ fraction of keys to be remapped.
Proposition 3.3 (Required remaps under churn). Removing $\alpha n$ backends eliminates $\alpha M$ entries, so at least an $\alpha$ fraction of keys must move.
Remark. Maglev's table construction (round-robin over per-backend permutations) can cause additional reshuffling among surviving backends; Rendezvous (HRW) achieves exactly $\alpha$.
Plain-language intuition: If 30% of your servers fail, about 30% of traffic gets redistributed. This linear relationship means the impact is roughly proportional to the failure, preventing cascade effects.
Part IV: Complexity Analysis
Maglev table generation runs in userspace (cilium-agent), not in the packet path; rebuild cost does not impact per-packet latency.
4.1 Expected Behavior and Worst-Case Bounds
Proposition 4.1 (Expected Behavior and Worst-Case Bounds). The Maglev table construction empirically runs near-linear for $M \gg n$; under an independence model (a standard coupon-collector approximation), analyses in [11] give $O(M\log M)$ expected operations. (The per-backend sequences are arithmetic progressions modulo M; independence is an approximation but empirically sufficient for the occupancy analysis.) The adversarial worst case is $O(M^2/n)$ if many permutations overlap so heavily that each placement scans $\Theta(M/n)$ filled entries.
Proof.
Empirical Behavior: The Maglev paper's analysis [11] models the fill process as an occupancy problem with independent permutations, deriving approximately $M/(M-n)$ tries on the $n$-th insertion, leading to $O(M \log M)$ total operations. When $M \gg n$ (e.g., $M = 100n$), collisions are rare and practical performance approaches $O(M)$.
Worst Case: Consider the pathological case where all backends have similar permutations. Backend $b_i$ might need to scan through $O(M/n)$ positions to find an empty slot. With $M$ total assignments:
Total: $O(M \cdot M/n) = O(M^2/n)$
For typical parameters ($M = 16381$, $n = 100$), the algorithm completes in milliseconds (ballpark). □
Plain-language intuition: Building the lookup table is like filling a parking lot where each car (backend) has preferred spots. Usually takes time proportional to the number of spots, but if everyone wants the same spots, it takes longer as cars circle looking for alternatives.
Empirical cost. Modeling the fill as independent occupancy yields an $O(M \log M)$ expectation; for $M \gg n$ the constant is small and implementations run close to linear in $M$. Back-of-the-envelope for $M=16{,}381$ is on the order of a few $10^5$ primitive steps—milliseconds on modern CPUs. (Exact cycles depend on your code path and cache locality.)
4.2 Space Complexity
Proposition 4.2 (Space Complexity). Maglev uses O(M + n) memory: the lookup table $T$ has $M$ entries and the per-backend metadata is $O(n)$. In practice one chooses $M = c\cdot n$ (e.g., $c\in[100,1000]$) to bound relative imbalance by $n/M \le 1/c$, so total memory is linear in $n$. This is Maglev's explicit time–space trade: O(1) lookups in exchange for O(M) state.
Comment. There isn't a commonly accepted formal lower bound that forbids sublinear state for O(1) lookups under all CH desiderata; Maglev makes a pragmatic, well-understood trade that performs excellently in the packet path.
Part V: eBPF Implementation Analysis
5.1 BPF Map Structure
In Cilium's eBPF implementation, Maglev uses a specialized map-in-map structure:
// Schematic representation - actual Cilium implementation differs
// Source concept: cilium/pkg/bpf/maps/maglev.go
// (actual map layouts/field sizes depend on build options and Cilium version)
struct bpf_map_def maglev_outer = {
.type = BPF_MAP_TYPE_HASH_OF_MAPS,
.key_size = sizeof(__u16), // Service ID
.value_size = sizeof(__u32), // Inner map ID
.max_entries = MAX_SERVICES,
};
struct bpf_map_def maglev_inner = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(__u32), // Table index
.value_size = sizeof(__u16), // Backend ID (implementation-dependent; commonly 32-bit, occasionally 16-bit)
.max_entries = MAGLEV_TABLE_SIZE, // M (prime)
};
Proposition 5.1 (Memory Scaling). The eBPF implementation requires $O(s \cdot M)$ kernel memory where $s$ is the number of services.
Proof. Each service maintains its own Maglev table of size $M$. With $s$ services:
$$\text{Memory} = s \cdot M \cdot \text{sizeof(backend\_id)} = O(s \cdot M)$$
This allows independent backend pools per service while maintaining O(1) lookup. □
Theoretical Memory Calculation: For s = 847 services with M = 16,381 and 2–4 byte backend IDs (depending on build):
Memory = 847 × 16,381 × 2–4 bytes = 27.7–55.5 MB
(Kernel may pad map values to alignment; treat 2–4 bytes as an effective range in practice (varies by kernel and architecture).)
For 10,000 services, this would require ~312–655 MB—still manageable on modern systems.
5.2 Lookup Performance in eBPF
The packet processing path in eBPF:
// Schematic pseudocode - not actual BPF C
static __always_inline int maglev_lookup(struct __sk_buff *skb, __u16 svc_id) {
// Get flow hash - either via helper or custom 5-tuple mixing
__u32 hash = bpf_get_hash_recalc(skb); // TC context
// For XDP: manually compute 5-tuple hash from packet headers
// In XDP, compute a keyed 5-tuple hash in BPF (bpf_get_hash_recalc is not available) to keep selection secret and stable.
__u32 idx = hash % MAGLEV_TABLE_SIZE;
// Two map lookups for O(1) backend selection
void *inner = bpf_map_lookup_elem(&maglev_outer, &svc_id);
if (!inner) return -1;
__u16 *backend = bpf_map_lookup_elem(inner, &idx);
return backend ? *backend : -1;
}
Proposition 5.2 (Lookup Performance). The eBPF Maglev lookup executes in constant time.
Proof. The lookup consists of:
- Hash computation: a constant number of CPU cycles
- Two helper lookups (outer + inner map) plus dereference
Total: Constant time; typically on the order of one or two cache misses. □
5.3 Verifier Constraints
Proposition 5.3 (Verifier Compliance). The Maglev eBPF implementation satisfies all verifier constraints with complexity $O(1)$ per packet.
Proof. The verifier checks:
- Bounded loops: No loops in lookup path—only array indexing ✓
- Memory safety: All map accesses are bounds-checked by BPF helpers ✓
- Complexity limit: Single lookup path with O(1) operations ✓
- Stack usage: Minimal stack variables (< 512 bytes) ✓
The implementation is thus verifier-compliant. □
Part VI: Optimality and Lower Bounds
6.1 Information-Theoretic Lower Bound
Theorem 6.1 (Lower Bound on Disruption). Any consistent hashing scheme must remap at least $1/(n+1)$ fraction of keys when adding a backend to $n$ existing backends.
Proof. By the pigeonhole principle, the new backend must receive at least $1/(n+1)$ fraction of the total load for uniform distribution. These keys must come from existing backends, requiring remapping.
This is the information-theoretic lower bound. Rendezvous (HRW) attains it exactly; Maglev approaches it for $M \gg n$ but does not guarantee equality. □
Corollary 6.1a (Adding $m$ backends). Adding $m$ backends to $n$ existing backends requires remapping at least $\dfrac{m}{n+m}$ of keys. Rendezvous (HRW) attains this bound exactly.
Plain-language intuition: This bound is the information-theoretic limit—no scheme can do better. Rendezvous (HRW) achieves it exactly; Maglev approaches it when $M \gg n$. It's like proving you need at least N cuts to divide a cake into N+1 equal pieces.
6.2 Table Size Requirements
Theorem 6.2 (Minimum Table Size). To guarantee maximum relative load imbalance below $\epsilon$, the table size must satisfy:
$$M \geq \frac{n}{\epsilon}$$
Proof. From Corollary 2.1, the relative load imbalance is bounded by $n/M$. For this to be less than $\epsilon$:
$$\frac{n}{M} < \epsilon \implies M > \frac{n}{\epsilon}$$
For $\epsilon = 0.01$ (1% imbalance), we need $M \ge 100n$. □
6.3 Prime Size—Simple Sufficient Condition
Theorem 6.3 (Prime size—simple sufficient condition). With $\text{skip}[i] = (\text{hash}_2(b_i) \bmod (M-1)) + 1$ and prime $M$, we have $\gcd(\text{skip}[i],M)=1$ for all backends, so each $\pi_i$ is a full permutation (Theorem 1.1).
Proof. For prime $M$ and any $\text{skip}[i] \in {1, 2, ..., M-1}$:
$$\gcd(\text{skip}[i], M) = 1$$
This holds because $M$ is prime and $0 < \text{skip}[i] < M$. By Theorem 1.1, each backend's permutation covers all $M$ positions. □
Note on composite $M$. If $M$ is composite and $\gcd(\text{skip}[i],M)\neq 1$, $\pi_i$ decomposes into cycles on a proper subset of ${0,\dots,M-1}$. That restricts which slots a backend can ever claim and can bias occupancy. To avoid this, either (a) use prime $M$, or (b) draw $\text{skip}$ uniformly from the unit group $\mathbb{Z}_M^{*}$ (values coprime to $M$). If drawing $\text{skip}$ from the unit group $\mathbb{Z}_M^{*}$, every $\pi_i$ is a cycle whose length divides $\varphi(M)$. Choosing prime $M$ makes $\varphi(M)=M-1$, which guarantees full-length cycles and simplifies analysis.
Part VII: Comparative Analysis
7.1 Versus Ring Hashing
Comparison 7.1 (Maglev vs Ring Hashing). Compared to ring-based consistent hashing with $v$ virtual nodes per backend:
Interpretation note. For comparable balance, ring hashing typically sets $v \approx M/n$, which makes both methods $O(M)$ in space. The operational trade is binary search on a ring vs direct table indexing in Maglev.
| Property | Ring Hashing | Maglev |
|---|---|---|
| Lookup Time | $O(\log(nv))$ | $O(1)$ |
| Space | $O(nv)$ | $O(M)$ (often $M \approx c \cdot n$) |
| Relative imbalance (ring: typical CV; maglev: deterministic bound) | ~ $\Theta(1/\sqrt{v})$ | $\le n/M$ (≈ 1/c when M = c·n) |
| Update Time (per change) | $O(v \log(nv))$ | $O(M)$ |
CV = coefficient of variation (stddev/mean) of per-backend load.
Notes. Ring hashing requires binary search over $nv$ points on the ring, while Maglev uses direct array indexing. For equivalent load balance ($v \approx M/n$), ring hashing uses $O(M)$ space but with $O(\log M)$ lookup time.
Maglev trades update time for lookup performance, optimal for read-heavy workloads.
7.2 Versus Rendezvous Hashing
Comparison 7.2 (Maglev vs Rendezvous). Compared to Rendezvous (HRW) hashing:
| Property | Rendezvous | Maglev |
|---|---|---|
| Lookup Time | $O(n)$ | $O(1)$ |
| Space | $O(n)$ | $O(M)$ (often $M \approx c \cdot n$) |
| Disruption | Optimal | Near-optimal (for $M \gg n$) |
| Update Time (per change) | $O(1)$ | $O(M)$ |
Rendezvous achieves optimal disruption; Maglev approaches it for large M. Maglev's precomputed table enables constant-time lookup at the cost of slower updates.
Rendezvous is suitable for small $n$ or scenarios where updates are frequent. Maglev excels when lookups are frequent and $n$ is large (hundreds or thousands of backends).
Part VIII: Production Considerations
8.1 Practical Table Sizes (Well-Tested Primes)
Guideline 8.1 (Table size selection). Pick a prime $M$ large enough that $n/M \le \epsilon$. Typical choices by scale:
- Small (n < 100): $M = 16{,}381$
- Medium (n < 600): $M = 65{,}521$
- Large (n $\lesssim$ 1,300): $M = 131{,}071$
Note. These sizes are commonly supported in Cilium, but exact allowed values and defaults are version-dependent. Check your chart/agent version and confirm with helm show values or cilium-agent --help. Cilium validates table sizes against a supported set; unsupported primes will be rejected even if prime.
Note: On modern compilers, constant-modulus division is lowered to reciprocal-multiply; being "near $2^k$" offers no meaningful advantage, but the listed primes are convenient and well-tested.
Rationale. These values are chosen to satisfy:
- Primality for complete permutations
- $M > 100n$ for < 1% imbalance
- Fit within CPU cache levels for performance
The specific values are well-tested primes from production deployments.
Plain-language intuition: These "magic numbers" are carefully chosen primes that have been battle-tested in production while being large enough to ensure good balance.
8.2 Connection Affinity
Theorem 8.2 (Connection Affinity Preservation). For a stable backend set, Maglev guarantees that:
$$P(\text{connection maintains backend | no changes}) = 1$$
Proof. The lookup table $T$ is deterministic given the backend set $B$. For unchanged $B$:
- Same hash function $h$
- Same permutations $\pi_i$
- Same table $T$
Therefore, $h(k) = T[hash(k) \bmod M]$ remains constant for all connections. □
Caveat. Affinity assumes the hashing inputs (typically the 5-tuple) are stable; mid-flow NAT/policy changes or L7 proxies that re-originate connections will change the hash and therefore the selected backend.
Plain-language intuition: Once established, a connection always goes to the same backend (assuming no backend changes)—crucial for stateful services. Your video stream doesn't suddenly jump to a different server mid-movie.
8.3 Failure Detection Integration
Proposition 8.3 (Graceful Degradation Lower Bound). With health checking removing failed backends:
$$\text{Keys affected} \ge \frac{f}{n}$$
where $f$ is the number of failed backends. Rendezvous (HRW) achieves exactly $f/n$; Maglev is typically close for large $M$, but its table construction (round-robin over per-backend permutations) can introduce small additional reshuffling.
Proof. At minimum, connections previously mapped to failed backends need remapping. This fraction is at least $f/n$ by uniform distribution. Maglev's table construction may cause additional reshuffling. □
Part IX: Security Considerations
9.1 Hash Collision Attacks
Proposition 9.1 (Adversarial key steering bound). With a keyed, independent hash on the 5-tuple and a Maglev table whose per-backend share is $X_i/M \in {\lfloor M/n\rfloor/M,\ \lceil M/n\rceil/M}$, the probability that $k$ chosen keys all map to the same specific backend, conditioning on the first key's backend, is
$$P \ \le\ \left(\frac{\lceil M/n\rceil}{M}\right)^{k-1}.$$
This assumes independence of per-flow hashes and a secret seed, using the worst-case backend share $\max_i X_i/M = \lceil M/n\rceil/M$. In practice, keep the seed confidential and combine with rate-limits to mitigate probing.
9.2 Seed Randomization
Using a random seed $s$ unknown to attackers, the work to find collisions is $\Omega(2^{|s|/2})$ via birthday attacks. For example, with a 96-bit seed, birthday attacks need $\approx 2^{48}$ trials to find colliding inputs—well beyond practical limits for DDoS scenarios.
Cilium Implementation: Cilium's maglev.hashSeed (or --bpf-lb-maglev-hash-seed) seeds the userspace permutation generation so every agent builds identical tables. The per-packet flow hash in BPF uses bpf_get_hash_recalc() or an equivalent 5-tuple mix [1,2,3].
Part X: Cilium Implementation Details
10.1 eBPF/XDP Architecture
Cilium implements Maglev using a sophisticated eBPF map-in-map structure for efficient packet processing:
Map Structure:
// Schematic representation of map-in-map structure
// Outer map: Service ID → Inner Maglev table
struct bpf_map_def maglev_outer = {
.type = BPF_MAP_TYPE_HASH_OF_MAPS,
.key_size = sizeof(__u16), // Service ID
.value_size = sizeof(__u32), // Inner map ID
.max_entries = MAX_SERVICES,
};
// Inner map: Maglev lookup table
struct bpf_map_def maglev_inner = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(__u32), // Table index [0, M-1]
.value_size = sizeof(__u16), // Backend ID (implementation-dependent; commonly 32-bit, occasionally 16-bit)
.max_entries = MAGLEV_TABLE_SIZE, // M (prime)
};
Performance Characteristics:
- Packet rate: Up to ~14.88 Mpps demonstrated in XDP L4LB benchmarks (10 GbE line rate at 64-byte frames) [6,9]
- Lookup complexity: O(1) with two map lookups plus array access
- Memory usage: 2–4 bytes × M × services (value size is 2–4 bytes depending on build; e.g., 27.7–55.5 MB for 847 services with M=16381)
10.2 Configuration and Tuning
Helm Configuration:
# Example (verify keys for your chart version)
helm upgrade --install cilium cilium/cilium \
--set loadBalancer.algorithm=maglev \
--set maglev.tableSize=16381 \
--set maglev.hashSeed="$(head -c12 /dev/urandom | base64 | tr -d '\n')"
# Some options (e.g., kubeProxyReplacement) are version-dependent.
# Always verify: helm show values cilium/cilium --version <x.y.z>
Agent Flags (alternative):
cilium-agent \
--bpf-lb-algorithm=maglev \
--bpf-lb-maglev-table-size=16381 \
--bpf-lb-maglev-hash-seed="$(head -c12 /dev/urandom | base64 | tr -d '\n')"
Examples of available table sizes in Cilium (as of the cited Cilium release): 251, 509, 1021, 2039, 4093, 8191, 16381 (default), 32749, 65521, 131071 [1,4].
The hash seed prevents attackers from predicting backend selection, crucial for DDoS protection.
10.3 Production Deployment Patterns
Traffic paths. Cilium commonly uses Maglev for north–south L4LB (ingress/external → service). East–west (pod→service) on the same node often bypasses Maglev via socket-level load-balancing at connect(2) (sockops/sockmap), while inter-node paths may still traverse TC/XDP depending on configuration. The net effect: Maglev is critical for external fan-in; intra-cluster connections frequently take an even faster path.
Session Affinity:
The consistent hashing property naturally provides session affinity without additional state:
- Same 5-tuple always maps to same backend
- Survives node failures (minimal redistribution)
- No session table required
10.4 Real-World Performance Impact
Seznam.cz Production Case Study:
Seznam successfully migrated to Cilium for their large-scale infrastructure, reporting significant improvements in visibility and cost savings through more efficient load balancing (see citation [10]). The migration eliminated iptables rule explosion (from O(n×m) to O(1) lookups) and enabled handling of massive connection volumes.
Key Optimizations in Cilium:
- XDP acceleration: Process packets before SKB allocation
- Direct server return (DSR): Bypass return path for reduced latency
- Socket-level LB: Redirect at connect() for container-to-container traffic
10.5 Known Limitations and Workarounds
Current Limitations:
- Weighted backends: Not yet supported (equal weight only) as of this writing; track project roadmap/issues (weights are active design work; semantics may change)
- Dynamic table resize: May require an agent restart depending on version
- Memory trade-off: Each service requires M × sizeof(backend_id) bytes
Platform-Specific Considerations:
- Kernel requirements: Use kernel ≥5.10 LTS for production; specific features vary by NIC/driver
- NIC compatibility: Native XDP requires driver support (check with
ethtool -i) - MTU constraints: With AWS ENA driver, XDP max MTU is ~3498 bytes due to single-page buffers; other drivers differ [1,5]
Configuration Best Practices:
# For high-throughput services
loadBalancer:
algorithm: "maglev"
acceleration: "native" # Force native XDP
mode: "dsr" # Direct server return
# For large backend pools (>100 backends)
maglev:
tableSize: 65521 # Larger prime for better distribution
10.6 Observability and Debugging
Debugging Commands:
# List Maglev tables
cilium-dbg bpf lb maglev list
# Get the lookup table for a specific service
cilium-dbg bpf lb maglev get <service-id>
# List all services using Maglev
cilium-dbg bpf lb list --output=json | jq '.[] | select(.algorithm=="maglev")'
# Check overall load balancer state
cilium-dbg bpf lb list
These commands provide direct insight into the Maglev tables stored in eBPF maps, allowing operators to verify distribution quality and debug load balancing issues [7,8].
Part XI: Practical Implications
11.1 When to Use Maglev
Ideal Scenarios:
- High-throughput services (>1M requests/sec)
- Large backend pools (>50 instances)
- Read-heavy workloads (lookups >> updates)
- Session affinity requirements
Poor Fit:
- Highly dynamic backend pools (constant churn)
- Small services (<10 backends)
- Weighted load balancing requirements
- Memory-constrained environments
11.2 Migration Considerations
From IPVS/iptables:
- Often yields substantial CPU reductions at scale (commonly reported 50-70% in practice, workload-dependent)
- Connection table memory shifts to Maglev tables
- Session affinity naturally preserved
- No connection state synchronization needed
Performance Comparison at Scale:
| Method | 1K backends | 10K backends | 50K backends |
|---|---|---|---|
| iptables | O(n) rules | Unusable | Crash |
| IPVS | 15% CPU | 45% CPU | 78% CPU |
| Maglev | 3% CPU | 5% CPU | 7% CPU |
Illustrative only (not measured) — conveys trend lines; actual CPU varies with workload, kernel/NIC, and tuning.
11.3 Future Developments
Upcoming Features in Cilium:
- Weighted Maglev: Extension for non-uniform backend weights
- Adaptive table sizing: Dynamic M based on backend count
- NUMA-aware optimization: Per-CPU Maglev tables for cache locality
- Hardware offload: SmartNIC Maglev implementation
Research Directions:
- Hierarchical Maglev for multi-tier load balancing
- Integration with service mesh circuit breaking
- Formal verification of eBPF implementation correctness
Conclusion
The Maglev algorithm represents a masterful balance of theoretical elegance and practical efficiency. Through our rigorous analysis, we've established:
- Optimality: Absolute load imbalance is ≤ 1 slot; relative imbalance ≤ $n/M$ (choose $M \ge 100n$ for <1%). On membership change, the lower bound is $1/(n+1)$ for additions and $1/n$ for removals. Rendezvous (HRW) attains this exactly; Maglev approaches it when $M \gg n$
- Efficiency: Constant-time lookups with precomputed state $O(M)$. In practice choose $M\approx c\cdot n$ (e.g., $c\in[100,1000]$) for linear memory in $n$—a deliberate time–space tradeoff that performs excellently in the packet path
- Robustness: Mathematical guarantees hold under failures, attacks, and dynamic changes
- Practicality: The eBPF implementation maintains all theoretical properties while satisfying kernel verifier constraints
Cilium's production implementation validates these theoretical guarantees:
- Real-world scale: Seznam's case study reports substantial efficiency and visibility gains after adopting Cilium [10]
- eBPF efficiency: Map-in-map architecture achieves true O(1) with two map lookups
- XDP acceleration: ~14.88 Mpps throughput demonstrated in L4LB benchmarks, limited by NIC not software [6,9]
- Operational simplicity: No connection state synchronization, natural session affinity
The algorithm's beauty lies not just in achieving these properties individually, but in achieving them simultaneously through a simple permutation-based construction. The round-robin filling strategy, combined with prime table sizes and independent permutations, creates a near-perfect hash distribution that's both theoretically sound and practically efficient.
For production deployments with Cilium, the key insights are:
- Choose $M > 100n$ and prime for < 1% load imbalance (16381 default handles ~160 backends)
- Use cryptographic hashing with random seeds for DDoS protection
- Leverage eBPF's map-in-map structure for per-service isolation
- Accept $O(M \log M)$ rebuild time for $O(1)$ lookup performance
- Enable XDP acceleration for maximum throughput (kernel ≥5.10 LTS recommended)
- Monitor Cilium LB/Maglev metrics exposed by cilium-agent (names vary by version; search for 'maglev'/'lb' in
/metrics) and validate distribution with thecilium-dbgcommands above
As we push the boundaries of network performance with technologies like XDP and eBPF, algorithms like Maglev become critical infrastructure. They demonstrate that distributed systems problems often have beautiful mathematical solutions waiting to be discovered—solutions that are not just theoretically optimal but practically transformative. The transition from iptables' O(n×m) rule explosion to Maglev's O(1) lookups represents not just an incremental improvement, but a fundamental paradigm shift in how we approach load balancing at scale.
References
[1] Cilium Documentation. "Kubernetes Without kube-proxy." Available at: https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/
[2] Linux Kernel Patchwork. "bpf: allow for map-in-map with dynamic inner." Available at: https://patchwork.ozlabs.org/comment/2550297/
[3] eBPF Documentation. "Helper Function bpf_get_hash_recalc." Available at: https://docs.ebpf.io/linux/helper-function/bpf_get_hash_recalc/
[4] Cilium Documentation. "cilium-agent hive." Available at: https://docs.cilium.io/en/stable/cmdref/cilium-agent_hive.html
[5] Cloudflare Blog. "High Availability Load Balancers with Maglev." Available at: https://blog.cloudflare.com/high-availability-load-balancers-with-maglev/
[6] Cilium Blog. "Cilium Standalone Layer 4 Load Balancer XDP." Available at: https://cilium.io/blog/2022/04/12/cilium-standalone-L4LB-XDP/
[7] GitHub Issue. "Maglev weight annotations for EndpointSlices." Available at: https://github.com/cilium/cilium/issues/40190
[8] Cilium Documentation. "cilium-dbg bpf lb maglev." Available at: https://docs.cilium.io/en/stable/cmdref/cilium-dbg_bpf_lb_maglev.html
[9] eBPF.top. "告别IPVS,拥抱Cilium/XDP" (Goodbye IPVS, Hello Cilium/XDP). Available at: https://www.ebpf.top/post/cilium-standalone-L4LB-XDP-zh/
[10] CNCF Case Study. "Seznam's move to Cilium saves money and boosts visibility." Available at: https://www.cncf.io/case-studies/seznam/
[11] Eisenbud, D. E., et al. "Maglev: A Fast and Reliable Software Network Load Balancer." NSDI 2016. Available at: https://research.google/pubs/pub44824/
[12] Karger, D., et al. "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web." STOC 1997.
[13] Lamping, J., and Veach, E. "A Fast, Minimal Memory, Consistent Hash Algorithm." arXiv:1406.2294, 2014. Available at: https://arxiv.org/abs/1406.2294