水塘采样(reservoir sampling)算法是一种抽样算法,用于从大小为(非常大或未知)的流中随机采样个数据,并且要保证每个数据被抽样的概率相等。
算法的伪代码为:
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i := 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i := k+1 to n
(* randomInteger(a, b) generates a uniform integer from the inclusive range {a, ..., b} *)
j := randomInteger(1, i)
if j <= k
R[j] := S[i]
对应的C++实现示例为:
vector<int> reservoir_sample(vector<int> &s, int k)
{
int n = s.size();
vector<int> res(k);
for (int i = 0; i < k; i++)
{
res[i] = s[i];
}
srand(time(NULL));
for(int i = k;i<n;i++)
{
int j = rand() % (i + 1);
if (j < k)
res[j] = s[i];
}
}
推导推导证明详见维基百科