Reservoir Sampling

Feb 16, 2026

•

3 min read

The Hetch Hetchy Reservoir in California. (Image Source: National Geographic. “Reservoir | National Geographic Society.” Education.nationalgeographic.org, 21 June 2024, education.nationalgeographic.org/resource/reservoir/. Accessed 15 Feb. 2026.)

Suppose we’re running a raffle in which $n$ people enter and $k$ winners are selected. How would we select the winners once we have all the entrants? Well, we’d just do something like drawing $k$ names from a hat. Here, the fundamental operation is uniformly sampling $k$ items from a population of size $n$ , and the way we’re doing it by storing all the items and then randomly choosing from them at the very end (batch approach).

Now imagine that either $n$ is unknown or is so large (or even infinite) that we cannot simply record the possible winners and then select them at the end (e.g., if we were randomly sampling online data streams). Reservoir sampling is an algorithm for sampling $k$ items without replacement from a population of (potentially unknown) size $n$ in a single pass. The idea is to maintain running sample of size $k$ and then, for each newly encountered item, randomly decide which of the $k+1$ items (the current sample and new item) to discard forever (streaming approach). More precisely, we:

Initialize our running sample to be the first $k$ items $x_1, x_2, …, x_k$
When encountering a new item $x_i$ , uniformly generate an index $j\in\{1, 2, \ldots, i\}$ . If $1\leq j\leq k$ , then replace the $j^{th}$ item in the running sample $x_j’$ with $x_i$ (otherwise do nothing).

One problem with the algorithm above is that you have to uniformly sample an index from a growing range. One way to fix this is to take advantage of the following fact: if you independently sample $n$ values $u_1, u_2, \ldots, u_n\sim U[0,1]$ , then the indices of the smallest $k$ values will be a uniform sample of $k$ -sized subsets of $\{1, 2, \ldots, n\}$ . In other words, sampling $k$ items without replacement from a population of size $n$ is equivalent to sampling $n$ independent values from $U[0,1]$ and then checking which $k$ are the smallest (and mapping indexes back to items). This means that we can essentially do the same process as above, but comparing values uniformly sampled from the $[0,1]$ interval to determine eviction rather than the ever-widening $\{1, 2, \ldots, i\}$ .

This solves the issue of size, but the algorithm can actually be further optimized so that we don’t even have to generate a random number for each sample: If the $k$ current smallest uniform values are $u_1’\leq u_2’\leq \ldots\leq u_k’$ , then the probability than a new item will be included is $u_k’$ . This means that the probability the next item to be included is the next $i^{th}$ item is $(1-u_k’)^{i-1}u_k’$ , which follows a geometric distribution. Thus, rather than testing each item one-by-one, we can simply sample $i\sim \text{Geom}(u_k’)$ , accept the next $i^{th}$ item, evict the item corresponding to the largest value $u_k’$ (and update $u_k’$ ), and repeat.

algorithms

One response

Shreya S

Feb 17, 2026

This post does a nice job turning a fairly technical idea into something intuitive by starting with the raffle example, then gradually moving to the streaming case.

Reply

Reservoir Sampling

One response

Leave a Reply to Shreya S Cancel reply