
Suppose we’re running a raffle in which people enter and winners are selected. How would we select the winners once we have all the entrants? Well, we’d just do something like drawing names from a hat. Here, the fundamental operation is uniformly sampling items from a population of size , and the way we’re doing it by storing all the items and then randomly choosing from them at the very end (batch approach).
Now imagine that either is unknown or is so large (or even infinite) that we cannot simply record the possible winners and then select them at the end (e.g., if we were randomly sampling online data streams). Reservoir sampling is an algorithm for sampling items without replacement from a population of (potentially unknown) size in a single pass. The idea is to maintain running sample of size and then, for each newly encountered item, randomly decide which of the items (the current sample and new item) to discard forever (streaming approach). More precisely, we:
- Initialize our running sample to be the first items
- When encountering a new item , uniformly generate an index . If , then replace the item in the running sample with (otherwise do nothing).
One problem with the algorithm above is that you have to uniformly sample an index from a growing range. One way to fix this is to take advantage of the following fact: if you independently sample values , then the indices of the smallest values will be a uniform sample of -sized subsets of . In other words, sampling items without replacement from a population of size is equivalent to sampling independent values from and then checking which are the smallest (and mapping indexes back to items). This means that we can essentially do the same process as above, but comparing values uniformly sampled from the interval to determine eviction rather than the ever-widening .
This solves the issue of size, but the algorithm can actually be further optimized so that we don’t even have to generate a random number for each sample: If the current smallest uniform values are , then the probability than a new item will be included is . This means that the probability the next item to be included is the next item is , which follows a geometric distribution. Thus, rather than testing each item one-by-one, we can simply sample , accept the next item, evict the item corresponding to the largest value (and update ), and repeat.

Leave a Reply