加权版本的random.choice

244

我需要写一个加权版本的random.choice（列表中的每个元素被选择的可能性都不同）。这是我想出的：

def weightedChoice(choices):
    """Like random.choice, but each element can have a different chance of
    being selected.

    choices can be any iterable containing iterables with two items each.
    Technically, they can have more than two items, the rest will just be
    ignored.  The first item is the thing being chosen, the second item is
    its weight.  The weights can be any numeric values, what matters is the
    relative differences between them.
    """
    space = {}
    current = 0
    for choice, weight in choices:
        if weight > 0:
            space[current] = choice
            current += weight
    rand = random.uniform(0, current)
    for key in sorted(space.keys() + [current]):
        if rand < key:
            return choice
        choice = space[key]
    return None

对于我来说，此功能似乎过于复杂且难看。我希望这里的每个人都可以提出一些改进建议或其他替代方法。对于我来说，效率并不像代码的清洁度和可读性那么重要。

python optimization

— 科林
source

297

从1.7.0版开始，NumPy具有choice支持概率分布的功能。

from numpy.random import choice
draw = choice(list_of_candidates, number_of_items_to_pick,
              p=probability_distribution)

请注意，这probability_distribution是顺序相同的序列list_of_candidates。您也可以使用关键字replace=False来更改行为，以便不替换绘制的项目。

— 罗南·派桑（RonanPaixão）
source

11

根据我的测试，这比random.choices单个呼叫要慢一个数量级。如果您需要大量随机结果，那么通过调整立即选择所有结果非常重要number_of_items_to_pick。如果这样做，则速度要快一个数量级。

— jpmc26

2

这不适用于元组等（“ ValueError：必须为一维”），因此在这种情况下，可以要求numpy将索引选择到列表中，即len(list_of_candidates)，然后执行list_of_candidates[draw]

— xjcl

217

从Python 3.6 choices开始，random模块提供了一种方法。

Python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import random

In [2]: random.choices(
...:     population=[['a','b'], ['b','a'], ['c','b']],
...:     weights=[0.2, 0.2, 0.6],
...:     k=10
...: )

Out[2]:
[['c', 'b'],
 ['c', 'b'],
 ['b', 'a'],
 ['c', 'b'],
 ['c', 'b'],
 ['b', 'a'],
 ['c', 'b'],
 ['b', 'a'],
 ['c', 'b'],
 ['c', 'b']]

请注意，random.choices将根据docs 进行替换示例：

返回k从总体中选择并替换的元素的大小列表。

如果您需要采样而不进行替换，那么正如@ronan-paixão出色的回答所言，您可以使用numpy.choice，其replace参数控制着这种行为。

— vishes_shell
source

4

这比numpy.random.choice快得多。从8个加权项的列表中选择10,000次，numpy.random.choice花费了0.3286秒，而random.choices花费了0.0416秒，大约快了8倍。

— Anton Codes

@AntonCodes此示例是精心挑选的。numpy将会有一些固定的时间开销random.choices，因此，在8个项目的微小列表中，它当然要慢一些，如果从这样的列表中选择10k次，那是正确的。但是对于列表较大的情况（取决于您的测试方式，我看到100-300个元素之间的断点）np.random.choice开始表现random.choices出相当大的差距。例如，包括归一化步骤和numpy调用，random.choices对于1万个元素的列表，我获得了近4倍的加速。

— ggorlen

这应该是基于@AntonCodes报告的性能改进的新答案。

— Wayne Workman

132

def weighted_choice(choices):
   total = sum(w for c, w in choices)
   r = random.uniform(0, total)
   upto = 0
   for c, w in choices:
      if upto + w >= r:
         return c
      upto += w
   assert False, "Shouldn't get here"

— 内德·巴切尔德
source

10

您可以通过反转for循环内的语句来删除操作并节省upto +=w; if upto > r

— 一小段

5

通过删除upto并每次将r减权重来保存变量。然后进行比较if r < 0

— JnBrymn

@JnBrymn您需要检查r <= 0。考虑一个包含1个项目的输入集和一个1.0的卷。断言将失败。我更正了答案中的错误。

— moooeeeep 2015年

1

@Sardathrion，您可以使用编译指示将for循环标记为部分循环：# pragma: no branch

— Ned Batchelder

1

@ mLstudent33我不使用Udacity。

— 安东密码

70

将权重排列为累积分布。
使用random.random（）选择一个随机float 0.0 <= x < total。
如http://docs.python.org/dev/library/bisect.html#other-examples中的示例所示，使用bisect.bisect搜索发行版。

from random import random
from bisect import bisect

def weighted_choice(choices):
    values, weights = zip(*choices)
    total = 0
    cum_weights = []
    for w in weights:
        total += w
        cum_weights.append(total)
    x = random() * total
    i = bisect(cum_weights, x)
    return values[i]

>>> weighted_choice([("WHITE",90), ("RED",8), ("GREEN",2)])
'WHITE'

如果您需要做出多个选择，请将其拆分为两个函数，一个用于构建累加权重，另一个用于平分到随机点。

— 雷蒙德·海廷格（Raymond Hettinger）
source

5

这比Ned的答案更有效。基本上，他不是在选择中进行线性（O（n））搜索，而是在进行二进制搜索（O（log n））。+1！

— NHDaly 2014年

如果random（）恰好返回1.0，则元组索引超出范围

— Jon Vaughan 2014年

10

O(n)由于累积分布计算，因此仍然存在。

— Lev Levitsky

6

在同一选择集需要多次调用weighted_choice的情况下，此解决方案更好。在这种情况下，您可以创建一次累加和，然后对每个调用进行二进制搜索。

— 阿莫斯

1

@JonVaughan random() 无法返回1.0。根据文档，它以半开间隔返回结果，[0.0, 1.0)也就是说它可以精确地返回0.0，但不能精确地返回1.0。它可以返回的最大值是0.99999999999999988897769753748434595763683319091796875（Python打印为0.9999999999999999，并且是小于1的最大64位浮点数）。

— Mark Amery

21

如果您不介意使用numpy，则可以使用numpy.random.choice。

例如：

import numpy

items  = [["item1", 0.2], ["item2", 0.3], ["item3", 0.45], ["item4", 0.05]
elems = [i[0] for i in items]
probs = [i[1] for i in items]

trials = 1000
results = [0] * len(items)
for i in range(trials):
    res = numpy.random.choice(items, p=probs)  #This is where the item is selected!
    results[items.index(res)] += 1
results = [r / float(trials) for r in results]
print "item\texpected\tactual"
for i in range(len(probs)):
    print "%s\t%0.4f\t%0.4f" % (items[i], probs[i], results[i])

如果您知道需要预先选择多少个选项，则可以像这样循环执行：

numpy.random.choice(items, trials, p=probs)

— 威兹曼
source

15

粗略，但可能足够：

import random
weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))

它行得通吗？

# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]

# initialize tally dict
tally = dict.fromkeys(choices, 0)

# tally up 1000 weighted choices
for i in xrange(1000):
    tally[weighted_choice(choices)] += 1

print tally.items()

印刷品：

[('WHITE', 904), ('GREEN', 22), ('RED', 74)]

假定所有权重都是整数。他们不必相加100，我只是这样做以使测试结果更易于解释。（如果权重是浮点数，则将它们全部乘以10，直到所有权重> =1。）

weights = [.6, .2, .001, .199]
while any(w < 1.0 for w in weights):
    weights = [w*10 for w in weights]
weights = map(int, weights)

— 保罗·麦格
source

1

很好，但是我不确定我是否可以假设所有权重都是整数。

— 科林2010年

1

似乎您的对象将在此示例中重复。那将是低效的（将权重转换为整数的函数也是如此）。但是，如果整数权重较小，则此解决方案是一个很好的方案。

— wei2912

基元将被复制，但是对象将仅具有重复的引用，而不是对象本身。（这就是为什么你不能创建使用列表的列表[[]]*10-在同一个列表外点列表中的所有元素。

— PaulMcG

@PaulMcG否；除了引用，别无其他。Python的类型系统没有基元的概念。您可以int通过执行类似的操作来确认即使使用，您仍然可以得到对同一对象的大量引用，[id(x) for x in ([99**99] * 100)]并观察id每次调用都返回相同的内存地址。

— Mark Amery

14

如果您有加权词典而不是列表，则可以这样写

items = { "a": 10, "b": 5, "c": 1 } 
random.choice([k for k in items for dummy in range(items[k])])

请注意[k for k in items for dummy in range(items[k])]生成此列表['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c', 'b', 'b', 'b', 'b', 'b']

— 马克西姆
source

10

这适用于较小的总人口值，但不适用于较大的数据集（例如，按州划分的美国人口最终将创建一个包含3亿个项目的工作清单）。

— 瑞安

@瑞安确实。它也不适用于非整数权重，这是另一种现实情况（例如，如果您将权重表示为选择的概率）。

— Mark Amery

12

从Python开始v3.6，random.choices可用于list从给定总体中返回具有可选权重的指定大小的元素。

random.choices(population, weights=None, *, cum_weights=None, k=1)

人口：list包含独特的观察结果。（如果为空，则引发IndexError）
权重：更精确地进行选择所需的相对权重。
cum_weights：进行选择所需的累积权重。
k：要输出的size（len）list。（默认len()=1）

注意事项：

1）它使用加权抽样进行替换，因此抽取的项目以后将被替换。权重序列中的值本身并不重要，但它们的相对比率却无关紧要。

不同于np.random.choice仅可以将概率作为权重并且还必须确保单个概率的总和不超过1个标准的方法，这里没有这样的规定。只要它们属于数字类型（类型int/float/fraction除外Decimal），它们仍然会执行。

>>> import random
# weights being integers
>>> random.choices(["white", "green", "red"], [12, 12, 4], k=10)
['green', 'red', 'green', 'white', 'white', 'white', 'green', 'white', 'red', 'white']
# weights being floats
>>> random.choices(["white", "green", "red"], [.12, .12, .04], k=10)
['white', 'white', 'green', 'green', 'red', 'red', 'white', 'green', 'white', 'green']
# weights being fractions
>>> random.choices(["white", "green", "red"], [12/100, 12/100, 4/100], k=10)
['green', 'green', 'white', 'red', 'green', 'red', 'white', 'green', 'green', 'green']

2）如果既未指定权重也未指定cum_weights，则选择的可能性相等。如果提供了权重序列，则其长度必须与总体序列的长度相同。

同时指定权重和cum_weights会引发一个TypeError。

>>> random.choices(["white", "green", "red"], k=10)
['white', 'white', 'green', 'red', 'red', 'red', 'white', 'white', 'white', 'green']

3）cum_weights通常是itertools.accumulate函数的结果，在这种情况下确实很方便。

_{从链接的文档中：}

在内部，相对权重在选择之前会转换为累积权重，因此提供累积权重可以节省工作。

因此，无论是供应weights=[12, 12, 4]还是cum_weights=[12, 24, 28]为我们精心策划的案例产生相同的结果，后者似乎更快/更有效率。

— 尼克·马维利（Nickil Maveli）
source

11

这是Python 3.6标准库中包含的版本：

import itertools as _itertools
import bisect as _bisect

class Random36(random.Random):
    "Show the code included in the Python 3.6 version of the Random class"

    def choices(self, population, weights=None, *, cum_weights=None, k=1):
        """Return a k sized list of population elements chosen with replacement.

        If the relative weights or cumulative weights are not specified,
        the selections are made with equal probability.

        """
        random = self.random
        if cum_weights is None:
            if weights is None:
                _int = int
                total = len(population)
                return [population[_int(random() * total)] for i in range(k)]
            cum_weights = list(_itertools.accumulate(weights))
        elif weights is not None:
            raise TypeError('Cannot specify both weights and cumulative weights')
        if len(cum_weights) != len(population):
            raise ValueError('The number of weights does not match the population')
        bisect = _bisect.bisect
        total = cum_weights[-1]
        return [population[bisect(cum_weights, random() * total)] for i in range(k)]

来源：https : //hg.python.org/cpython/file/tip/Lib/random.py#l340

— 雷蒙德·海廷格（Raymond Hettinger）
source

2

import numpy as np
w=np.array([ 0.4,  0.8,  1.6,  0.8,  0.4])
np.random.choice(w, p=w/sum(w))

— WHI
source

2

我可能为时已晚，无法提供任何有用的信息，但这是一个简单，简短且非常有效的代码段：

def choose_index(probabilies):
    cmf = probabilies[0]
    choice = random.random()
    for k in xrange(len(probabilies)):
        if choice <= cmf:
            return k
        else:
            cmf += probabilies[k+1]

无需对您的概率进行排序或使用cmf创建向量，并且一旦找到选择就终止。内存：O（1），时间：O（N），平均运行时间约为N / 2。

如果您有权重，只需添加一行：

def choose_index(weights):
    probabilities = weights / sum(weights)
    cmf = probabilies[0]
    choice = random.random()
    for k in xrange(len(probabilies)):
        if choice <= cmf:
            return k
        else:
            cmf += probabilies[k+1]

— 阿图尔
source

1

这有几处错误。从表面上看，有一些拼写错误的变量名，并且没有给出使用该变量的理由np.random.choice。但更有趣的是，有一种失败模式会引发异常。这样做probabilities = weights / sum(weights)并不能保证probabilities总和为1；例如，如果weightsis，[1,1,1,1,1,1,1]则其probabilities总和将为0.9999999999999998，小于可能的最大返回值random.random（即0.9999999999999999）。那就choice <= cmf永远不会满足。

— Mark Amery

2

如果您的加权选择列表相对静态，并且您想要频繁采样，则可以执行一个O（N）预处理步骤，然后使用此相关答案中的函数在O（1）中进行选择。

# run only when `choices` changes.
preprocessed_data = prep(weight for _,weight in choices)

# O(1) selection
value = choices[sample(preprocessed_data)][0]

— 贝壳
source

1

我查看了所指向的其他线程，并在我的编码样式中提出了此变体，它返回用于计算目的的选择索引，但是返回字符串很简单（注释返回替代）：

import random
import bisect

try:
    range = xrange
except:
    pass

def weighted_choice(choices):
    total, cumulative = 0, []
    for c,w in choices:
        total += w
        cumulative.append((total, c))
    r = random.uniform(0, total)
    # return index
    return bisect.bisect(cumulative, (r,))
    # return item string
    #return choices[bisect.bisect(cumulative, (r,))][0]

# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]

tally = [0 for item in choices]

n = 100000
# tally up n weighted choices
for i in range(n):
    tally[weighted_choice(choices)] += 1

print([t/sum(tally)*100 for t in tally])

— 托尼·韦嘉兰宁
source

1

这取决于您要对分布进行采样的次数。

假设您要采样K次分布。那么，np.random.choice()每次使用的时间复杂度是O(K(n + log(n)))when n是分发中项目的数量。

在我的情况下，我需要对同一分布进行多次采样，采样顺序为10 ^ 3，其中n为10 ^ 6。我使用了下面的代码，该代码预先计算了累积分布并将其采样到中O(log(n))。总时间复杂度为O(n+K*log(n))。

import numpy as np

n,k = 10**6,10**3

# Create dummy distribution
a = np.array([i+1 for i in range(n)])
p = np.array([1.0/n]*n)

cfd = p.cumsum()
for _ in range(k):
    x = np.random.uniform()
    idx = cfd.searchsorted(x, side='right')
    sampled_element = a[idx]

— Uppinder Chugh
source

0

通用解决方案：

import random
def weighted_choice(choices, weights):
    total = sum(weights)
    treshold = random.uniform(0, total)
    for k, weight in enumerate(weights):
        total -= weight
        if total < treshold:
            return choices[k]

— 标记
source

0

这是使用numpy的weighted_choice的另一个版本。传递权重向量，它将返回一个包含1的0数组，指示选择了哪个bin。该代码默认只进行一次抽奖，但是您可以传递要进行的抽奖次数，并且将返回每个抽奖箱的计数。

如果权重向量的总和不等于1，它将被归一化。

import numpy as np

def weighted_choice(weights, n=1):
    if np.sum(weights)!=1:
        weights = weights/np.sum(weights)

    draws = np.random.random_sample(size=n)

    weights = np.cumsum(weights)
    weights = np.insert(weights,0,0.0)

    counts = np.histogram(draws, bins=weights)
    return(counts[0])

— murphsp1
source

0

假设我们的权重与元素数组中元素的索引相同，则这是另一种方法。

import numpy as np
weights = [0.1, 0.3, 0.5] #weights for the item at index 0,1,2
# sum of weights should be <=1, you can also divide each weight by sum of all weights to standardise it to <=1 constraint.
trials = 1 #number of trials
num_item = 1 #number of items that can be picked in each trial
selected_item_arr = np.random.multinomial(num_item, weights, trials)
# gives number of times an item was selected at a particular index
# this assumes selection with replacement
# one possible output
# selected_item_arr
# array([[0, 0, 1]])
# say if trials = 5, the the possible output could be 
# selected_item_arr
# array([[1, 0, 0],
#   [0, 0, 1],
#   [0, 0, 1],
#   [0, 1, 0],
#   [0, 0, 1]])

现在假设，我们必须在1次试用中抽取3个项目。您可以假设存在三个球R，G，B，它们的重量比是权重数组给出的重量之比，这可能是以下结果：

num_item = 3
trials = 1
selected_item_arr = np.random.multinomial(num_item, weights, trials)
# selected_item_arr can give output like :
# array([[1, 0, 2]])

您还可以将要选择的项目数视为一组中的二项式/多项式试验数。因此，以上示例仍可以按以下方式工作

num_binomial_trial = 5
weights = [0.1,0.9] #say an unfair coin weights for H/T
num_experiment_set = 1
selected_item_arr = np.random.multinomial(num_binomial_trial, weights, num_experiment_set)
# possible output
# selected_item_arr
# array([[1, 4]])
# i.e H came 1 time and T came 4 times in 5 binomial trials. And one set contains 5 binomial trails.

— Nsquare
source

0

Sebastien Thurn在免费的Udacity机器人技术课程AI中对此进行了演讲。基本上，他使用mod运算符制作索引权重的圆形数组%，将变量beta设置为0，随机选择一个索引，通过N进行循环，其中N是索引数，并且在for循环中首先通过以下公式递增beta：

beta = beta +来自{0 ... 2 * Weight_max}的均匀样本

然后嵌套在for循环中，下面是while循环：

while w[index] < beta:
    beta = beta - w[index]
    index = index + 1

select p[index]

然后转到下一个索引，以根据概率（或本课程中提出的情况下的归一化概率）进行重新采样。

讲座链接：https : //classroom.udacity.com/courses/cs373/lessons/48704330/concepts/487480820923

我使用我的学校帐户登录到Udacity，因此，如果该链接不起作用，则是第8课，机器人人工智能的视频号码21，他正在讲解粒子过滤器。

— 学生33
source

0

如果您碰巧拥有Python 3，并且害怕安装numpy或编写自己的循环，则可以执行以下操作：

import itertools, bisect, random

def weighted_choice(choices):
   weights = list(zip(*choices))[1]
   return choices[bisect.bisect(list(itertools.accumulate(weights)),
                                random.uniform(0, sum(weights)))][0]

因为您可以用一袋管道适配器来制造任何东西！尽管...我必须承认，内德的回答虽然稍长，但更容易理解。

— personal_cloud
source

-1

一种方法是对所有权重的总和进行随机化，然后将这些值用作每个变量的极限点。这是生成器的粗略实现。

def rand_weighted(weights):
    """
    Generator which uses the weights to generate a
    weighted random values
    """
    sum_weights = sum(weights.values())
    cum_weights = {}
    current_weight = 0
    for key, value in sorted(weights.iteritems()):
        current_weight += value
        cum_weights[key] = current_weight
    while True:
        sel = int(random.uniform(0, 1) * sum_weights)
        for key, value in sorted(cum_weights.iteritems()):
            if sel < value:
                break
        yield key

— 多年生
source

-1

使用numpy

def choice(items, weights):
    return items[np.argmin((np.cumsum(weights) / sum(weights)) < np.random.rand())]

— blue_note
source

np.random.choice自2014年以来一直在接受的答案中提到，NumPy已经拥有了。滚动自己的目标是什么？

— Mark Amery

-1

我需要快速，非常简单地完成这样的工作，从寻找想法开始，我终于建立了这个模板。这个想法是从api接收json形式的加权值，这里是由dict模拟的。

然后将其转换为一个列表，其中每个值均按其权重成比例地重复，只需使用random.choice从列表中选择一个值即可。

我尝试了运行10、100和1000次迭代。分布似乎很稳定。

def weighted_choice(weighted_dict):
    """Input example: dict(apples=60, oranges=30, pineapples=10)"""
    weight_list = []
    for key in weighted_dict.keys():
        weight_list += [key] * weighted_dict[key]
    return random.choice(weight_list)

— 斯塔斯·巴斯金（Stas Baskin）
source

-1

我不喜欢那些语法。我真的只想指定项目是什么，每个项目的权重是什么。我意识到我可以使用，random.choices但我很快在下面编写了此类。

import random, string
from numpy import cumsum

class randomChoiceWithProportions:
    '''
    Accepts a dictionary of choices as keys and weights as values. Example if you want a unfair dice:


    choiceWeightDic = {"1":0.16666666666666666, "2": 0.16666666666666666, "3": 0.16666666666666666
    , "4": 0.16666666666666666, "5": .06666666666666666, "6": 0.26666666666666666}
    dice = randomChoiceWithProportions(choiceWeightDic)

    samples = []
    for i in range(100000):
        samples.append(dice.sample())

    # Should be close to .26666
    samples.count("6")/len(samples)

    # Should be close to .16666
    samples.count("1")/len(samples)
    '''
    def __init__(self, choiceWeightDic):
        self.choiceWeightDic = choiceWeightDic
        weightSum = sum(self.choiceWeightDic.values())
        assert weightSum == 1, 'Weights sum to ' + str(weightSum) + ', not 1.'
        self.valWeightDict = self._compute_valWeights()

    def _compute_valWeights(self):
        valWeights = list(cumsum(list(self.choiceWeightDic.values())))
        valWeightDict = dict(zip(list(self.choiceWeightDic.keys()), valWeights))
        return valWeightDict

    def sample(self):
        num = random.uniform(0,1)
        for key, val in self.valWeightDict.items():
            if val >= num:
                return key

— ML_Dev
source

-1

为random.choice（）提供预加权列表：

解决方案和测试：

import random

options = ['a', 'b', 'c', 'd']
weights = [1, 2, 5, 2]

weighted_options = [[opt]*wgt for opt, wgt in zip(options, weights)]
weighted_options = [opt for sublist in weighted_options for opt in sublist]
print(weighted_options)

# test

counts = {c: 0 for c in options}
for x in range(10000):
    counts[random.choice(weighted_options)] += 1

for opt, wgt in zip(options, weights):
    wgt_r = counts[opt] / 10000 * sum(weights)
    print(opt, counts[opt], wgt, wgt_r)

输出：

['a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd']
a 1025 1 1.025
b 1948 2 1.948
c 5019 5 5.019
d 2008 2 2.008

— DocOc
source