144

我在numpy中有一个1D数组，我想在值超过numpy数组中的值的位置找到索引的位置。

例如

aa = range(-10,10)

查找超出aa值的位置5。

python numpy

— 用户名
source

2

应该清楚地知道是否有解决方案（例如，因为argmax答案在那种情况下将不起作用（最大值（0,0,0,0）= 0），正如埋伏评论所言

— seanv507 2015年

199

这有点快（看起来更好）

np.argmax(aa>5)

因为argmax将在第一个位置停止True（“如果出现多次最大值，则返回对应于第一次出现的索引。”），并且不会保存其他列表。

In [2]: N = 10000

In [3]: aa = np.arange(-N,N)

In [4]: timeit np.argmax(aa>N/2)
100000 loops, best of 3: 52.3 us per loop

In [5]: timeit np.where(aa>N/2)[0][0]
10000 loops, best of 3: 141 us per loop

In [6]: timeit np.nonzero(aa>N/2)[0][0]
10000 loops, best of 3: 142 us per loop

— Askewchan
source

103

请注意：如果输入数组中没有True值，则np.argmax会很高兴返回0（在这种情况下这不是您想要的）。

— 埋伏

8

结果是正确的，但我发现解释有些可疑。argmax似乎并没有一开始就停下来True。（可以通过True在不同位置创建单个布尔数组来测试这一点。）可以通过argmax不需要创建输出列表这一事实来解释速度。

— DrV 2014年

1

我认为你是对的，@ DrV。我的解释是要解释为什么尽管最初的意图并没有真正寻求最大值，但为什么它仍给出正确的结果，而不是为什么它更快，因为我无法声称理解的内部细节argmax。

— askewchan 2014年

1

@乔治，我恐怕不知道为什么。我只能说在我展示的特定示例中它更快，所以如果没有（i）不知道为什么这样做（请参阅@DrV的评论）或（ii）测试更多的案例（例如，是否aa排序，就像@Michael的回答）。

— askewchan '17

3

@DrV，我只是使用NumPy 1.11.2在具有不同位置argmax的单个元素上运行了1000万个元素的布尔数组True，并且问题的位置True。因此1.11.2 argmax似乎在布尔数组上“短路”。

— Ulrich Stern

96

给定数组的排序内容，还有一个更快的方法：searchsorted。

import time
N = 10000
aa = np.arange(-N,N)
%timeit np.searchsorted(aa, N/2)+1
%timeit np.argmax(aa>N/2)
%timeit np.where(aa>N/2)[0][0]
%timeit np.nonzero(aa>N/2)[0][0]

# Output
100000 loops, best of 3: 5.97 µs per loop
10000 loops, best of 3: 46.3 µs per loop
10000 loops, best of 3: 154 µs per loop
10000 loops, best of 3: 154 µs per loop

— 迈克尔·凯泽斯
source

19

假设数组已排序（实际上未在问题中指定），这实际上是最佳答案。您可以+1使用np.searchsorted(..., side='right')

— askewchan

3

我认为，side只有在排序数组中存在重复的值时，该参数才有所不同。它不会改变返回索引的含义，该索引始终是您可以在其中插入查询值的索引，将以下所有条目向右移动，并保持已排序的数组。

— 2015年

@Gus，在排序数组和插入数组中的side值相同时，不管其中哪个值重复，都会起作用。排序数组中的重复值只会夸大效果（两边之间的差是要插入的值出现在排序数组中的次数）。side 确实会更改返回索引的含义，尽管它不会更改将值插入这些索引处的已排序数组中所得到的数组。微妙但重要的区别；区别实际上，如果N/2不在中，此答案将给出错误的索引aa。

— askewchan '17

如以上评论中所暗示的，如果N/2不在，则此答案将减一aa。正确的格式为np.searchsorted(aa, N/2, side='right')（不带+1）。否则，两种形式都给出相同的索引。考虑N奇数的测试用例（N/2.0如果使用python 2，则强制浮动）。

— askewchan '17

21

我对此也很感兴趣，并且将所有建议的答案与perfplot进行了比较。（免责声明：我是perfplot的作者。）

如果您知道要查找的数组已经排序，则

numpy.searchsorted(a, alpha)

是给你的。这是一个固定时间操作，即，速度也不能依赖于数组的大小。您无法获得比这更快的速度。

如果您对阵列一无所知，那么您不会错

numpy.argmax(a > alpha)

已排序：

未分类：

复制剧情的代码：

import numpy
import perfplot


alpha = 0.5

def argmax(data):
    return numpy.argmax(data > alpha)

def where(data):
    return numpy.where(data > alpha)[0][0]

def nonzero(data):
    return numpy.nonzero(data > alpha)[0][0]

def searchsorted(data):
    return numpy.searchsorted(data, alpha)

out = perfplot.show(
    # setup=numpy.random.rand,
    setup=lambda n: numpy.sort(numpy.random.rand(n)),
    kernels=[
        argmax, where,
        nonzero,
        searchsorted
        ],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    xlabel='len(array)'
    )

— 尼科·施洛默（NicoSchlömer）
source

4

np.searchsorted不是固定时间的。实际上是O(log(n))。但是，你的测试用例实际基准的最好的情况searchsorted（这是O(1)）。

— MSeifert

@MSeifert您需要哪种输入数组/ alpha才能看到O（log（n））？

— NicoSchlömer，

1

使该项目的索引为sqrt（length）确实导致了非常糟糕的性能。我还在这里写了一个答案，包括该基准。

— MSeifert

我怀疑searchsorted（或任何算法）能否击败O(log(n))二进制搜索的排序均匀分布的数据。编辑：searchsorted 是二进制搜索。

— Mateen Ulhaq

16

In [34]: a=np.arange(-10,10)

In [35]: a
Out[35]:
array([-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,
         3,   4,   5,   6,   7,   8,   9])

In [36]: np.where(a>5)
Out[36]: (array([16, 17, 18, 19]),)

In [37]: np.where(a>5)[0][0]
Out[37]: 16

— 莫吉
source

8

元素之间步长恒定的数组

如果是range数组或任何其他线性增加的数组，则可以简单地以编程方式计算索引，而根本不需要实际遍历数组：

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('no value greater than {}'.format(val))
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    # For linearly decreasing arrays or constant arrays we only need to check
    # the first element, because if that does not satisfy the condition
    # no other element will.
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

一个人可能可以改善这一点。我已经确保它可以正确地用于一些示例数组和值，但这并不意味着在那里不会有错误，特别是考虑到它使用浮点数...

>>> import numpy as np
>>> first_index_calculate_range_like(5, np.arange(-10, 10))
16
>>> np.arange(-10, 10)[16]  # double check
6

>>> first_index_calculate_range_like(4.8, np.arange(-10, 10))
15

假设它无需任何迭代就可以计算位置，那么它将是恒定时间（O(1)），并且可能击败所有其他提到的方法。但是，它需要数组中的恒定步长，否则将产生错误的结果。

使用numba的一般解决方案

更通用的方法是使用numba函数：

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

这将适用于任何数组，但必须迭代该数组，因此在一般情况下将是O(n)：

>>> first_index_numba(4.8, np.arange(-10, 10))
15
>>> first_index_numba(5, np.arange(-10, 10))
16

基准测试

尽管NicoSchlömer已经提供了一些基准，但我认为包括我的新解决方案并测试不同的“值”可能很有用。

测试设置：

import numpy as np
import math
import numba as nb

def first_index_using_argmax(val, arr):
    return np.argmax(arr > val)

def first_index_using_where(val, arr):
    return np.where(arr > val)[0][0]

def first_index_using_nonzero(val, arr):
    return np.nonzero(arr > val)[0][0]

def first_index_using_searchsorted(val, arr):
    return np.searchsorted(arr, val) + 1

def first_index_using_min(val, arr):
    return np.min(np.where(arr > val))

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('empty array')
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

funcs = [
    first_index_using_argmax, 
    first_index_using_min, 
    first_index_using_nonzero,
    first_index_calculate_range_like, 
    first_index_numba, 
    first_index_using_searchsorted, 
    first_index_using_where
]

from simple_benchmark import benchmark, MultiArgument

并使用以下方式生成图：

%matplotlib notebook
b.plot()

项目在开始

b = benchmark(
    funcs,
    {2**i: MultiArgument([0, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

numba函数执行效果最佳，其次是calculate-function和searchsorted函数。其他解决方案的性能要差得多。

项目在末尾

b = benchmark(
    funcs,
    {2**i: MultiArgument([2**i-2, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

对于较小的数组，numba函数的执行速度非常快，但是对于较大的数组，它的性能要优于calculate-function和searchsorted函数。

项目在sqrt（len）

b = benchmark(
    funcs,
    {2**i: MultiArgument([np.sqrt(2**i), np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

这更有趣。numba和calculate函数再次表现出色，但这实际上触发了最糟糕的搜索排序情况，在这种情况下确实不能很好地工作。

没有值满足条件时的功能比较

另一个有趣的一点是，如果没有值应返回其索引，则这些函数的行为：

arr = np.ones(100)
value = 2

for func in funcs:
    print(func.__name__)
    try:
        print('-->', func(value, arr))
    except Exception as e:
        print('-->', e)

结果如下：

first_index_using_argmax
--> 0
first_index_using_min
--> zero-size array to reduction operation minimum which has no identity
first_index_using_nonzero
--> index 0 is out of bounds for axis 0 with size 0
first_index_calculate_range_like
--> no value greater than 2
first_index_numba
--> -1
first_index_using_searchsorted
--> 101
first_index_using_where
--> index 0 is out of bounds for axis 0 with size 0

Searchsorted，argmax和numba只会返回错误的值。但是searchsorted，numba返回的索引不是该数组的有效索引。

功能where，min，nonzero并calculate抛出一个异常。但是，只有例外calculate实际上可以说明任何帮助。

这意味着实际上必须将这些调用包装在一个适当的包装函数中，该函数可以捕获异常或无效的返回值并适当地进行处理，至少在不确定该值是否可以在数组中的情况下。

注意：计算和searchsorted选项仅在特殊条件下起作用。“计算”功能需要一个恒定的步骤，而searchsorted需要对数组进行排序。因此，这些方法在适当的情况下可能很有用，但不是解决此问题的一般方法。如果要处理排序的 Python列表，则可能需要查看bisect模块，而不是使用Numpys searchsorted。

— 塞弗特
source

3

我想提出

np.min(np.append(np.where(aa>5)[0],np.inf))

这将返回满足条件的最小索引，如果从不满足条件，则返回无穷大（并where返回一个空数组）。

— 姆费尔特
source

1

我会去

i = np.min(np.where(V >= x))

其中Vvector是向量（一维数组），x是值，i是结果索引。

— 西维奇
source

首次出现大于当前值的值

元素之间步长恒定的数组

使用numba的一般解决方案

基准测试

项目在开始

项目在末尾

项目在sqrt（len）

没有值满足条件时的功能比较