A random selection algorithm

Suppose you want to select k things randomly from a list of n items, with no duplicates. Of course, if k > n, then it’s hopeless, and if k = n it’s trivial, so the interesting case is when k < n. A naive approach might pick an item randomly, and try again if it’s a duplicate.

var selected = new List();
while( len(selected) < k ) {
    var newIndex = floor(random() * n);
    if ( items[newIndex] not in selected ) {
        selected.append( items[newIndex] );
    }
}

Some DiceThis may work fine if k is much smaller than n, but when k is large and close to n, this can be disastrously slow. Near the end, most of the items are already selected, so we keep picking items and having to “throw them back”. The alternative of selecting which items to omit (instead of selecting the ones to keep) fails in the same way for small values of k.

Instead, there’s a simple trick that is guaranteed to never require more than n random numbers. As we get to each item, we select it or not with the “correct” probability. Here is the pseudocode:

var selected = new List();
var needed = k;
var available = n;
var position = 0;
while( len(selected) < k ) {
    if( random() < needed / available ) {
        selected.append( items[position] )
        needed -= 1;
    }
    available -= 1;
}

Handy trick… I’ve needed this algorithm quite a few times.

Comments

2 Responses to “A random selection algorithm”

  1. Shawn Poulson on 2008-06-28 12:15 pm

    Hmm, it looks like you’re not incrementing position. This code will retrieve the first item in each iteration.

    I’d recommend replacing items[position] with items[available-1] and ditch position altogether.

    Very good code snippet. I think I’ll keep it.

  2. Jeff on 2008-12-08 5:51 pm

    I agree with Shawn. This looks like, here is something wrong with the position variable. But items[available-1] does not fit too, because then you would pick the same element twice if your random condition is true twice in a row:
    if( random() < needed / available )

    So you would need something like this:
    items[available-k+needed]

    But way of selection is not uniform distributed: the probability would increase to the end of the while-loop, so the elements at the end have allways better chance to get into the array than the first element.

    Also this approach has a very simple problem: the items are only picked randomly, but the order stays the same. I mean, if i use this to take 3 elements from:
    1 4 5 8 9
    i could get:
    5 4 1
    or
    9 5 1
    but never: 1 9 5

    I would suggest to generate a random number for every element from you list, and the sort the array by this number. After that you can just take the first k elements from your list. If you use round(256*rand()) as your random number (or any other constant number * rand()) and radixsort as your sortalgorithm, you can arrange this in linear time, and it stays deterministic.

    Of course, it is not that easy to implement as your snippet, so i would use my suggestion only if you really need an uniform distributed random numbers array. (=you work for a lottery) ;)

Leave a Reply