在熊猫系列中查找元素的索引

154

我知道这是一个非常基本的问题，但是由于某种原因我找不到答案。如何获取python pandas中Series某些元素的索引？（第一次出现就足够了）

即，我想要类似的东西：

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

当然，可以使用循环定义这样的方法：

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

但我认为应该有更好的方法。在那儿？

python pandas

— 萨什凯洛
source

199

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

尽管我承认应该有一个更好的方法，但这至少避免了迭代和循环遍历对象并将其移至C级别。

— 维克多·凯尔兹（Viktor Kerkez）
source

12

这里的问题是它假设要搜索的元素实际上在列表中。这是一个令人沮丧的熊猫，似乎没有内置的查找操作。

— jxramos

7

仅当您的序列具有顺序整数索引时，此解决方案才有效。如果您的系列索引按日期时间排序，则此方法无效。

— 安德鲁·梅德林'18

42

转换为索引，您可以使用 get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

重复处理

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

如果非连续返回，将返回一个布尔数组

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

内部使用哈希表，速度如此之快

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

正如Viktor所指出的那样，创建索引有一次性的创建开销（实际上是在使用索引执行某些操作时产生的开销，例如is_unique）

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

— 杰夫
source

1

@Jeff如果您有一个更有趣的索引，它并不是那么容易...但是我想您可以做到s.index[_]

— Andy Hayden

11

In [92]: (myseries==7).argmax()
Out[92]: 3

如果您提前知道7个，则此方法有效。您可以使用（myseries == 7）.any（）进行检查

另一种方法（非常类似于第一个答案）也占多个7（或全无）的原因是

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

— 阿隆
source

关于提前知道7的要点是正确的。但是，使用any检查并不理想，因为需要进行两次迭代。有一个很酷的操作后检查，它将揭示False您可以在此处看到的所有条件。

— jxramos

1

小心，如果没有元素符合此条件，argmax则仍将返回0（而不是出错）。

— cs95

7

这里的所有答案给我留下了深刻的印象。这不是一个新的答案，只是尝试总结所有这些方法的时间。我考虑了一个由25个元素组成的系列的情况，并假设了一般情况下索引可以包含任何值，并且您希望索引值与该系列末尾的搜索值相对应。

以下是2013年MacBook Pro的Python 3.7和Pandas 0.25.3版的速度测试。

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff的答案似乎是最快的-尽管它不处理重复项。

更正：很抱歉，我错过了一个，@ Alex Spangher使用列表索引方法的解决方案是迄今为止最快的。

更新资料：添加了@EliadL的答案。

希望这可以帮助。

如此简单的操作需要如此复杂的解决方案，而且许多解决方案是如此之慢，真令人惊讶。在某些情况下，超过半毫秒才能找到一系列25的值。

— 法案
source

1

谢谢。但你不应该被测量后 myindex创建的，因为它只需要一次产生的呢？

— EliadL

您可能会争辩说，但这取决于需要进行多少次此类查找。仅myindex当您要多次进行查找时，才值得创建该系列。对于该测试，我认为只需要一次，并且总执行时间很重要。

— 比尔

1

今晚遇到这种需要，并且在多个查找中对同一Index对象使用.get_lock（）似乎应该是最快的。我认为对答案的一种改进是为两者提供时间：包括创建索引，以及提供仅在创建索引后进行查找的另一时间。

— 里克（Rick）

是的，很好。@EliadL也这么说。它取决于该系列是静态的多少个应用程序。如果系列中的任何值更改，则需要重新构建pd.Index(myseries)。为了与其他方法公平起见，我假设自上次查找以来原始系列可能已更改。

— 法案

5

尽管同样不令人满意，但另一种方法是：

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

返回：3

使用我正在使用的当前数据集进行时间测试（随机考虑）：

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

— 亚历克斯·斯潘格（Alex Spangher）
source

4

如果您使用numpy，则可以获取一个数组，该数组确定了您的值：

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

这将返回一个包含元素数组的单元素元组，其中7是myseries中的值：

(array([3], dtype=int64),)

— 亚历克斯
source

3

您可以使用Series.idxmax（）

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>

— 拉基·加德
source

5

这似乎只返回找到max元素的索引，而不是返回index of certain element类似问题的特定索引。

— jxramos

1

尚未提及的另一种实现方法是tolist方法：

myseries.tolist().index(7)

假设该系列中存在该值，则应返回正确的索引。

— 鲁塔里克
source

1

@Alex Spangher在2014年9月17日提出了类似建议。看到他的答案。现在，我已经将两个版本都添加到了测试结果中。

— 比尔

0

通常，您的价值出现在多个指标上：

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

— 乌尔夫·阿斯拉克
source

0

这是我能找到的最原生和可扩展的方法：

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64

— 埃利亚德
source