大熊猫的行程是否有性能问题？

Question 1

我注意到使用大熊猫的散列时的效果非常差。

这是别人的经历吗？它特定于迭代吗？对于一定大小的数据（我要处理2-3百万行），应该避免使用此功能吗？

在GitHub上进行的讨论使我相信，这是在数据帧中混合dtypes时引起的，但是下面的简单示例显示，即使使用一个dtype（float64）也存在该问题。这在我的机器上需要36秒：

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

为什么矢量化运算之类的应用这么快？我想象那里也必须进行逐行迭代。

我无法弄清楚如何在我的情况下不使用迭代（这将在以后的问题中进行介绍）。因此，如果您一直能够避免这种迭代，不胜感激。我正在基于单独数据框中的数据进行计算。谢谢！

---编辑：下面添加了我要运行的简化版本---

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():   
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

#%% Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]

Question 2

通常，iterrows仅应在非常非常特殊的情况下使用。这是执行各种操作的一般优先顺序：

1) vectorization
2) using a custom cython routine
3) apply
    a) reductions that can be performed in cython
    b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)

使用自定义的Cython例程通常太复杂了，所以现在就跳过它。

1）矢量化永远是第一选择，也是最好的选择。但是，有少数情况（通常涉及复发）无法以明显的方式进行向量化。此外，在小DataFrame，使用其他方法可能会更快。

3）apply 通常可以由Cython空间中的迭代器处理。这是由熊猫内部处理的，尽管它取决于apply表达式内部发生的情况。例如，df.apply(lambda x: np.sum(x))将很快执行，当然df.sum(1)更好。但是，类似的操作df.apply(lambda x: x['b'] + 1)将在Python空间中执行，因此速度要慢得多。

4）itertuples不会将数据装箱到中Series。它只是以元组的形式返回数据。

5）iterrows确实将数据装进Series。除非您真的需要此方法，否则请使用其他方法。

6）一次更新单个行的空框架。我已经看到这种方法使用了太多的方法。这是迄今为止最慢的。它可能很常见（对于某些python结构来说相当快），但是a DataFrame对索引进行了大量检查，因此每次更新一行总是很慢。创建新的结构和方法更好concat。

Question 3

Numpy和pandas中的向量运算比香草Python中的标量运算要快得多，原因如下：

摊销类型查找：Python是一种动态类型化的语言，因此数组中每个元素的运行时开销都很大。但是，Numpy（进而是熊猫）使用C语言（通常通过Cython）执行计算。数组的类型仅在迭代开始时确定。仅此一项节省便是最大的成功之一。
更好的缓存：在C数组上进行迭代是缓存友好的，因此非常快。大熊猫DataFrame是“面向列的表”，这意味着每一列实际上只是一个数组。因此，您可以在DataFrame上执行的本机操作（例如对列中的所有元素求和）将很少有缓存未命中。
更多并行性机会：可以通过SIMD指令来操作简单的C数组。Numpy的某些部分启用SIMD，具体取决于您的CPU和安装过程。并行性的好处不会像静态类型和更好的缓存那样引人注目，但是它们仍然是一个坚实的胜利。

故事的寓意：在Numpy和Pandas中使用向量运算。它们比Python中的标量运算要快，原因很简单，因为这些运算正是C程序员无论如何都要手工编写的。（除了数组概念比带有嵌入式SIMD指令的显式循环更易于阅读。）

Question 4

这是解决问题的方法。这都是矢量化的。

In [58]: df = table1.merge(table2,on='letter')

In [59]: df['calc'] = df['number1']*df['number2']

In [60]: df
Out[60]: 
  letter  number1  number2  calc
0      a       50      0.2    10
1      a       50      0.5    25
2      b      -10      0.1    -1
3      b      -10      0.4    -4

In [61]: df.groupby('letter')['calc'].max()
Out[61]: 
letter
a         25
b         -1
Name: calc, dtype: float64

In [62]: df.groupby('letter')['calc'].idxmax()
Out[62]: 
letter
a         1
b         2
Name: calc, dtype: int64

In [63]: df.loc[df.groupby('letter')['calc'].idxmax()]
Out[63]: 
  letter  number1  number2  calc
1      a       50      0.5    25
2      b      -10      0.1    -1

Question 5

另一种选择是使用to_records()，它比itertuples和都快iterrows。

但是对于您的情况，还有很多其他类型的改进空间。

这是我最终的优化版本

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        # np.multiply is in general faster than "x * y"
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

基准测试：

-- iterrows() --
100 loops, best of 3: 12.7 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

-- itertuple() --
100 loops, best of 3: 12.3 ms per loop

-- to_records() --
100 loops, best of 3: 7.29 ms per loop

-- Use group by --
100 loops, best of 3: 4.07 ms per loop
  letter  number2
1      a      0.5
2      b      0.1
4      c      5.0
5      d      4.0

-- Avoid multiplication --
1000 loops, best of 3: 1.39 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

完整代码：

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b','c','d'],
      'number1':[50,-10,.5,3]}

t2 = {'letter':['a','a','b','b','c','d','c'],
      'number2':[0.2,0.5,0.1,0.4,5,4,1]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)


print('\n-- iterrows() --')

def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2'] * t1info)
    maxrow_in_t2 = calculation.index(max(calculation))
    return t2info.loc[maxrow_in_t2]

#%% Iterate through filtering relevant data, optimizing, returning info
def iterthrough():
    for row_index, row in table1.iterrows():   
        t2info = table2[table2.letter == row['letter']].reset_index()
        table3.iloc[row_index,:] = optimize(t2info, row['number1'])

%timeit iterthrough()
print(table3)

print('\n-- itertuple() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.itertuples():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.itertuples():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()


print('\n-- to_records() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.to_records():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.to_records():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()

print('\n-- Use group by --')

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    for index, letter, n1 in table1.to_records():
        t2 = table2.iloc[grouped.groups[letter]]
        calculation = t2.number2 * n1
        maxrow = calculation.argsort().iloc[-1]
        ret.append(t2.iloc[maxrow])
    global table3
    table3 = pd.DataFrame(ret)

%timeit iterthrough()
print(table3)

print('\n-- Even Faster --')
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

%timeit iterthrough()
print(table3)