A random selection algorithm

Suppose you want to select k things randomly from a list of n items, with no duplicates. Of course, if k > n, then it’s hopeless, and if k = n it’s trivial, so the interesting case is when k < n. A naive approach might pick an item randomly, and try again if it’s a duplicate.

var selected = new List();
while( len(selected) < k ) {
    var newIndex = floor(random() * n);
    if ( items[newIndex] not in selected ) {
        selected.append( items[newIndex] );
    }
}

Some DiceThis may work fine if k is much smaller than n, but when k is large and close to n, this can be disastrously slow. Near the end, most of the items are already selected, so we keep picking items and having to “throw them back”. The alternative of selecting which items to omit (instead of selecting the ones to keep) fails in the same way for small values of k.

Instead, there’s a simple trick that is guaranteed to never require more than n random numbers. As we get to each item, we select it or not with the “correct” probability. Here is the pseudocode:

var selected = new List();
var needed = k;
var available = n;
var position = 0;
while( len(selected) < k ) {
    if( random() < needed / available ) {
        selected.append( items[position] )
        needed -= 1;
    }
    available -= 1;
}

Handy trick… I’ve needed this algorithm quite a few times.

Comments

6 Responses to “A random selection algorithm”

  1. Shawn Poulson on 2008-06-28 12:15 pm

    Hmm, it looks like you’re not incrementing position. This code will retrieve the first item in each iteration.

    I’d recommend replacing items[position] with items[available-1] and ditch position altogether.

    Very good code snippet. I think I’ll keep it.

  2. Jeff on 2008-12-08 5:51 pm

    I agree with Shawn. This looks like, here is something wrong with the position variable. But items[available-1] does not fit too, because then you would pick the same element twice if your random condition is true twice in a row:
    if( random() < needed / available )

    So you would need something like this:
    items[available-k+needed]

    But way of selection is not uniform distributed: the probability would increase to the end of the while-loop, so the elements at the end have allways better chance to get into the array than the first element.

    Also this approach has a very simple problem: the items are only picked randomly, but the order stays the same. I mean, if i use this to take 3 elements from:
    1 4 5 8 9
    i could get:
    5 4 1
    or
    9 5 1
    but never: 1 9 5

    I would suggest to generate a random number for every element from you list, and the sort the array by this number. After that you can just take the first k elements from your list. If you use round(256*rand()) as your random number (or any other constant number * rand()) and radixsort as your sortalgorithm, you can arrange this in linear time, and it stays deterministic.

    Of course, it is not that easy to implement as your snippet, so i would use my suggestion only if you really need an uniform distributed random numbers array. (=you work for a lottery) ;)

  3. Niklas Rosenstein on 2014-08-30 8:51 am

    Is the algorithm guaranteed to yield k items? Thanks!

  4. mcherm on 2014-08-30 7:34 pm

    No, in fact (as the comments above point out) the algorithm doesn’t even work at all! Let me see if I can find a fix…

  5. mcherm on 2014-08-30 9:45 pm

    Okay, I found a new algorithm that does what was originally advertised. And this time I tested it properly first.

    I had been getting confused between an algorithm for selecting k random items from a list of n items, and an algorithm for selecting a random item from a list with equal probability of selecting each item and WITHOUT having random access to the list.

    First, an algorithm for selecting some number of random items from a list of items IN A RANDOM ORDER is as follows (in Python, since I tested it):

    import math
    import random

    orig_list = [‘alpha’, ‘beta’, ‘gamma’, ‘delta’, ‘epsilon’, ‘zeta’]

    def selectRandom(values, needed):
    ”’This returns a new list of ‘needed’ random values from
    the list ‘values’. It ALSO shuffles the list values.

    This executes in time proportional to needed, assuming that
    swapping 2 positions in the list happens in constant time.”’
    def swap(i,j):
    ”’Swaps the values at locations i and j in values.”’
    temp = values[i]
    values[i] = values[j]
    values[j] = temp
    available_to_choose = len(values)
    assert needed <= available_to_choose
    already_picked = 0
    while ( already_picked < needed ):
    # pick a random number from [0..available-1]
    random_choice = math.floor( random.random() * available_to_choose )
    # swap the cell at ‘already_picked’ with the one ‘random_choice’ after it
    swap(already_picked, already_picked + random_choice)
    available_to_choose -= 1
    already_picked += 1
    return values[0:needed]

  6. mcherm on 2014-08-30 10:12 pm

    Okay, and here is what I was ACTUALLY trying to do originally. Again, this is written in Python so I could verify that it actually worked before publishing it.

    The algorithm is just the last part:

    class OneWayCollection:
    ”’This is a collection where you can find the length, but
    you can only walk through it once; you do not have random
    access to the values.”’
    def __init__(self, listOfValues):
    self.length = len(listOfValues)
    self.iterator = iter(listOfValues)
    def length(self):
    return self.length
    def getNext(self):
    return self.iterator.__next__()

    orig_list = OneWayCollection(
    [‘alpha’, ‘beta’, ‘gamma’, ‘delta’, ‘epsilon’, ‘zeta’])

    def selectRandom(values, needed):
    ”’This returns ‘needed’ values selected uniformly at random from
    ‘values’, which can be a OneWayCollection.”’
    result = []
    available = values.length
    assert needed < = available
    while needed > 0:
    item = values.getNext()
    rand = random.random()
    if rand < needed / available:
    needed -= 1
    result.append( item )
    else:
    available -= 1
    return result

Leave a Reply