Reservoir Sampling

The Hetch Hetchy Reservoir in California. (Image Source: National Geographic. “Reservoir | National Geographic Society.” Education.nationalgeographic.org, 21 June 2024, education.nationalgeographic.org/resource/reservoir/. Accessed 15 Feb. 2026.)

Suppose we’re running a raffle in which nn people enter and kk winners are selected. How would we select the winners once we have all the entrants? Well, we’d just do something like drawing kk names from a hat. Here, the fundamental operation is uniformly sampling kk items from a population of size nn, and the way we’re doing it by storing all the items and then randomly choosing from them at the very end (batch approach).

Now imagine that either nn is unknown or is so large (or even infinite) that we cannot simply record the possible winners and then select them at the end (e.g., if we were randomly sampling online data streams). Reservoir sampling is an algorithm for sampling kk items without replacement from a population of (potentially unknown) size nn in a single pass. The idea is to maintain running sample of size kk and then, for each newly encountered item, randomly decide which of the k+1k+1 items (the current sample and new item) to discard forever (streaming approach). More precisely, we:

  1. Initialize our running sample to be the first kk items x1,x2,...,xkx_1, x_2, …, x_k
  2. When encountering a new item xix_i, uniformly generate an index j{1,2,,i}j\in\{1, 2, \ldots, i\}. If 1jk1\leq j\leq k, then replace the jthj^{th} item in the running sample xjx_j’ with xix_i (otherwise do nothing).

One problem with the algorithm above is that you have to uniformly sample an index from a growing range. One way to fix this is to take advantage of the following fact: if you independently sample nn values u1,u2,,unU[0,1]u_1, u_2, \ldots, u_n\sim U[0,1], then the indices of the smallest kk values will be a uniform sample of kk-sized subsets of {1,2,,n}\{1, 2, \ldots, n\}. In other words, sampling kk items without replacement from a population of size nn is equivalent to sampling nn independent values from U[0,1]U[0,1] and then checking which kk are the smallest (and mapping indexes back to items). This means that we can essentially do the same process as above, but comparing values uniformly sampled from the [0,1][0,1] interval to determine eviction rather than the ever-widening {1,2,,i}\{1, 2, \ldots, i\}.

This solves the issue of size, but the algorithm can actually be further optimized so that we don’t even have to generate a random number for each sample: If the kk current smallest uniform values are u1u2uku_1’\leq u_2’\leq \ldots\leq u_k’, then the probability than a new item will be included is uku_k’. This means that the probability the next item to be included is the next ithi^{th} item is (1uk)i1uk(1-u_k’)^{i-1}u_k’, which follows a geometric distribution. Thus, rather than testing each item one-by-one, we can simply sample iGeom(uk)i\sim \text{Geom}(u_k’), accept the next ithi^{th} item, evict the item corresponding to the largest value uku_k’ (and update uku_k’), and repeat.

One response

  1. Shreya S Avatar

    This post does a nice job turning a fairly technical idea into something intuitive by starting with the raffle example, then gradually moving to the streaming case.

Leave a Reply to Shreya S Cancel reply

Your email address will not be published. Required fields are marked *