改进此页

Shuffle and Sampling - 随机抽样和洗牌

洗牌算法

Given an array, write a program to generate a random permutation of array elements. This question is also asked as “shuffle a deck of cards” or “randomize a given array”.

题解

这里以 Fisher–Yates shuffle 算法为例,伪代码如下:

To shuffle an array a of n elements (indices 0..n-1):
  for i from 0 downto i do
       j ← random integer such that 0 ≤ j ≤ i
       exchange a[j] and a[i]

转化为代码为:

    /*
     * shuffle cards
     */
    public static void shuffleCard(int[] cards) {
        if (cards == null || cards.length == 0) return;

        Random rand = new Random();
        for (int i = 0; i < cards.length; i++) {
            int k = rand.nextInt(i + 1); // 0~i (inclusive)
            int temp = cards[i];
            cards[i] = cards[k];
            cards[k] = temp;
        }
    }

看了算法和代码后让我们来使用归纳法简单证明下这个洗牌算法的正确性。我们要证明的问题是:数组中每个元素在每个索引处出现的概率均相等。

对于单个元素来讲,以上算法显然正确,因为交换后仍然只有一个元素。接下来我们不妨假设其遍历到数组索引为i-1时满足随机排列特性,那么当遍历到数组索引为i时,随机数ki的概率为1/i, 为0~i-1的概率为(i-1)/i. 接下来与索引为i的值交换,可以得知card[i]出现在索引i的位置的概率为1/i, 在其他索引位置的概率也为1/i; 而对于card[i]之前的元素,以索引j处的元素card[j]为例进行分析可知其在位置j的概率为1/(i-1) * (i-1)/i = 1/i, 具体含义为遍历到索引i-1card[j]位于索引j的概率(1/(i-1))乘以遍历到索引i时随机数未选择与索引j的数进行交换的概率((i-1)/i).

需要注意的是前面的j <= i-1, 那么card[j]位于索引i的概率又是多少呢?要位于索引i,则随机数k须为i, 这种概率为1/i.

综上,以上算法可以实现完美洗牌(等概率)。

Random sampling - 随机抽样

随机抽样也称为水池抽样,Randomly choosing a sample of k items from a list S containing n items. 大意是从大小为 n 的数组中随机选出 m 个整数,要求每个元素被选中的概率相同。

题解

比较简洁的有算法 Algorithm R, 伪代码如下:

/*
  S has items to sample, R will contain the result
*/
ReservoirSample(S[1..n], R[1..k])
  // fill the reservoir array
  for i = 1 to k
      R[i] := S[i]

  // replace elements with gradually decreasing probability
  for i = k+1 to n
    j := random(1, i)   // important: inclusive range
    if j <= k
        R[j] := S[i]

转化为代码为:

    /*
     * random sample
     */
    public static int[] randomSample(int[] nums, int m) {
        if (nums == null || nums.length == 0 || m <= 0) return new int[]{};

        int[] sample = new int[m];
        for (int i = 0; i < m; i++) {
            sample[i] = nums[i];
        }

        Random random = new Random();
        for (int i = m; i < nums.length; i++) {
            int k = random.nextInt(i + 1); // 0~i(inclusive)
            if (k < m) {
                sample[k] = nums[i];
            }
        }

        return sample;
    }

和洗牌算法类似,我们要证明的问题是:数组中每个元素在最终采样的数组中出现的概率均相等且为m/n. 洗牌算法中是排列,而对于随机抽样则为组合。

维基百科上的证明相对容易懂一些,这里我稍微复述下。首先将数组前 m 个元素填充进新数组sample, 然后从m开始遍历直至数组最后一个索引。随机数k的范围为0~i, 如果k < m,新数组的索引为 k 的元素则和原数组索引为i的元素交换;如果k >= m, 则不进行交换。i == m时,以原数组中第j个元素为例,它被nums[m]替换的概率为1/(m+1), 也就是说保留在sample数组中的概率为m/(m+1). 对与第m+1个元素nums[m]来说,其位于sample数组中的概率则为m*1/(m+1)(可替换 m 个不同的元素).

接下来仍然使用数学归纳法证明,若i遍历到r时,其之前的元素出现的概率为m/(r-1), 那么其之前的元素中任一元素nums[j]被替换的概率为m/r * 1/m = 1/r, 不被替换的概率则为(r-1)/r. 故元素nums[j]i遍历完r后仍然保留的概率为m/(r-1) * (r-1)/r = m/r. 而对于元素nums[r]来说,其要被替换至sample数组中的概率则为m/r(随机数小于m 的个数为 m).

综上,以上算法在遍历完长度为 n 的数组后每个数出现在最终sample数组中的概率都为m/n.

Implementation and Test case

Talk is cheap, show me the code!

Java

import java.util.*;
import java.util.Random;

public class Probability {
    public static void main(String[] args) {
        int[] cards = new int[10];
        for (int i = 0; i < 10; i++) {
            cards[i] = i;
        }
        // 100000 times test
        final int times = 100000;
        final int m = 5;
        int[][] count = new int[cards.length][cards.length];
        int[][] count2 = new int[cards.length][m];
        for (int i = 0; i < times; i++) {
            shuffleCard(cards);
            shuffleTest(cards, count);
            int[] sample = randomSample(cards, m);
            shuffleTest(sample, count2);
        }
        System.out.println("Shuffle cards");
        shufflePrint(count);
        System.out.println();
        System.out.println("Random sample");
        shufflePrint(count2);
    }

    /*
     * shuffle cards
     */
    public static void shuffleCard(int[] cards) {
        if (cards == null || cards.length == 0) return;

        Random rand = new Random();
        for (int i = 0; i < cards.length; i++) {
            int k = rand.nextInt(i + 1);
            int temp = cards[i];
            cards[i] = cards[k];
            cards[k] = temp;
        }
    }

    /*
     * random sample
     */
    public static int[] randomSample(int[] nums, int m) {
        if (nums == null || nums.length == 0 || m <= 0) return new int[]{};

        m = Math.min(m, nums.length);
        int[] sample = new int[m];
        for (int i = 0; i < m; i++) {
            sample[i] = nums[i];
        }

        Random random = new Random();
        for (int i = m; i < nums.length; i++) {
            int k = random.nextInt(i + 1);
            if (k < m) {
                sample[k] = nums[i];
            }
        }

        return sample;
    }

    /*
     * nums[i] = j, num j appear in index i ==> count[j][i]
     */
    public static void shuffleTest(int[] nums, int[][] count) {
        if (nums == null || nums.length == 0) return;

        for (int i = 0; i < nums.length; i++) {
            count[nums[i]][i]++;
        }
    }

    /*
     * print shuffle test
     */
    public static void shufflePrint(int[][] count) {
        if (count == null || count.length == 0) return;

        // print index
        System.out.print("   ");
        for (int i = 0; i < count[0].length; i++) {
            System.out.printf("%-7d", i);
        }
        System.out.println();
        // print num appear in index i in total
        for (int i = 0; i < count.length; i++) {
            System.out.print(i + ": ");
            for (int j = 0; j < count[i].length; j++) {
                System.out.printf("%-7d", count[i][j]);
            }
            System.out.println();
        }
    }
}

以十万次试验为例,左侧是元素i, 列代表在相应索引位置出现的次数。可以看出分布还是比较随机的。

Shuffle cards
   0      1      2      3      4      5      6      7      8      9
0: 10033  9963   10043  9845   9932   10020  9964   10114  10043  10043
1: 9907   9951   9989   10071  10059  9966   10054  10023  10015  9965
2: 10042  10046  9893   10080  10050  9994   10024  9852   10098  9921
3: 10039  10023  10039  10024  9919   10057  10188  9916   9907   9888
4: 9944   9913   10196  10059  9838   10205  9899   9945   9850   10151
5: 10094  9971   10054  9958   10022  9922   10047  9978   9965   9989
6: 9995   10147  9824   10015  10023  9804   10050  10192  9939   10011
7: 9941   10131  9902   9920   10040  10121  10010  9928   9984   10023
8: 10010  9926   9883   10098  10083  10028  9801   9936   10200  10035
9: 9995   9929   10177  9930   10034  9883   9963   10116  9999   9974

Random sample
   0      1      2      3      4
0: 9966   10026  10078  9966   9891
1: 9958   9806   10066  10022  10039
2: 9923   9936   9964   10051  10083
3: 10165  10088  10184  9928   9916
4: 9998   9990   9973   9931   9832
5: 10026  9932   9873   10085  10035
6: 9942   9972   9990   10030  10026
7: 9903   10153  9997   10051  10044
8: 10082  10066  9804   9899   10147
9: 10037  10031  10071  10037  9987

Reference