水塘采样算法

水塘采样(reservoir sampling)算法是一种抽样算法,用于从大小为mathmath非常大或未知)的流中随机采样math个数据,并且要保证每个数据被抽样的概率相等。

算法的伪代码为:

ReservoirSample(S[1..n], R[1..k])
  // fill the reservoir array
  for i := 1 to k
      R[i] := S[i]

  // replace elements with gradually decreasing probability
  for i := k+1 to n
    (* randomInteger(a, b) generates a uniform integer from the inclusive range {a, ..., b} *)
    j := randomInteger(1, i)
    if j <= k
        R[j] := S[i]

对应的C++实现示例为:

vector<int> reservoir_sample(vector<int> &s, int k)
{
    int n = s.size();
    vector<int> res(k);
    for (int i = 0; i < k; i++)
    {
        res[i] = s[i];
    }
    srand(time(NULL));
    for(int i = k;i<n;i++)
    {
        int j = rand() % (i + 1);
        if (j < k)
            res[j] = s[i];
    }
}

推导推导证明详见维基百科

c++·algorithm
134 views
Comments
登录后评论
Sign In