16

当具有这样的Pandas DataFrame时：

import pandas as pd
import numpy as np
df = pd.DataFrame({'today': [['a', 'b', 'c'], ['a', 'b'], ['b']], 
                   'yesterday': [['a', 'b'], ['a'], ['a']]})

                 today        yesterday
0      ['a', 'b', 'c']       ['a', 'b']
1           ['a', 'b']            ['a']
2                ['b']            ['a']                          
... etc

但是，我希望通过大约10万个条目在行的基础上在两列中找到这些列表的添加和删除。

它可与以下问题相提并论：Pandas：如何将DataFrame中的列表列与Pandas（不是for循环）进行比较？但我正在研究这些差异，而且Pandas.apply对于许多条目而言，方法似乎并不那么快。这是我当前正在使用的代码。Pandas.apply使用numpy's setdiff1d方法：

additions = df.apply(lambda row: np.setdiff1d(row.today, row.yesterday), axis=1)
removals  = df.apply(lambda row: np.setdiff1d(row.yesterday, row.today), axis=1)

效果很好，但是大约需要一分钟才能完成12万个条目。那么，有没有更快的方法来实现这一目标？

— MegaCookie
source

这些列之一最多可容纳多少个项目（在一行中）？

— 由此hv89

2

您是否尝试过所链接文章中的方法？特别是那些使用集合相交的集合，您所要做的就是使用集合差异，不是吗？

— gold_cy

1

@aws_apprentice基本上是OP在这里提供的解决方案。

— Quang Hoang

熊猫DataFrame可能不是正确的数据结构。您能否在程序和数据上共享更多背景知识？

— AMC

14

不确定性能，但是在缺少更好的解决方案的情况下，可能会遇到这种情况：

temp = df[['today', 'yesterday']].applymap(set)
removals = temp.diff(periods=1, axis=1).dropna(axis=1)
additions = temp.diff(periods=-1, axis=1).dropna(axis=1)

拆卸：

  yesterday
0        {}
1        {}
2       {a}

补充：

  today
0   {c}
1   {b}
2   {b}

— 车
source

2

这是非常快的。

— rpanai

2

这确实非常快。大约2秒钟！

— MegaCookie

2

哇，由于，我也对性能感到惊讶applymap，但很高兴为您解决了问题！

— r.ook

2

现在，据我们所知，鲁克的解决方案很快，有人可以向我解释。为什么更快？

— Grijesh Chauhan

7

df['today'].apply(set) - df['yesterday'].apply(set)

— 安德烈亚斯（Andreas K.）
source

谢谢！我认为这是最易读的解决方案，但是r.ook的解决方案要快一些。

— MegaCookie

5

我会建议您计算additions并removals在相同的适用范围内。

产生更大的例子

import pandas as pd
import numpy as np
df = pd.DataFrame({'today': [['a', 'b', 'c'], ['a', 'b'], ['b']], 
                   'yesterday': [['a', 'b'], ['a'], ['a']]})
df = pd.concat([df for i in range(10_000)], ignore_index=True)

您的解决方案

%%time
additions = df.apply(lambda row: np.setdiff1d(row.today, row.yesterday), axis=1)
removals  = df.apply(lambda row: np.setdiff1d(row.yesterday, row.today), axis=1)
CPU times: user 10.9 s, sys: 29.8 ms, total: 11 s
Wall time: 11 s

您的解决方案一次申请

%%time
df["out"] = df.apply(lambda row: [np.setdiff1d(row.today, row.yesterday),
                                  np.setdiff1d(row.yesterday, row.today)], axis=1)
df[['additions','removals']] = pd.DataFrame(df['out'].values.tolist(), columns=['additions','removals'])
df = df.drop("out", axis=1)

CPU times: user 4.97 s, sys: 16 ms, total: 4.99 s
Wall time: 4.99 s

使用 `set`

除非您的清单很大，否则您可以避免 numpy

def fun(x):
    a = list(set(x["today"]).difference(set(x["yesterday"])))
    b = list((set(x["yesterday"])).difference(set(x["today"])))
    return [a,b]

%%time
df["out"] = df.apply(fun, axis=1)
df[['additions','removals']] = pd.DataFrame(df['out'].values.tolist(), columns=['additions','removals'])
df = df.drop("out", axis=1)

CPU times: user 1.56 s, sys: 0 ns, total: 1.56 s
Wall time: 1.56 s

@ r.ook的解决方案

如果您乐于使用集而不是列表作为输出，则可以使用@ r.ook的代码

%%time
temp = df[['today', 'yesterday']].applymap(set)
removals = temp.diff(periods=1, axis=1).dropna(axis=1)
additions = temp.diff(periods=-1, axis=1).dropna(axis=1) 
CPU times: user 93.1 ms, sys: 12 ms, total: 105 ms
Wall time: 104 ms

@Andreas K.的解决方案

%%time
df['additions'] = (df['today'].apply(set) - df['yesterday'].apply(set))
df['removals'] = (df['yesterday'].apply(set) - df['today'].apply(set))

CPU times: user 161 ms, sys: 28.1 ms, total: 189 ms
Wall time: 187 ms

然后您最终可以添加.apply(list)以获得相同的输出

— 拉帕奈
source

1

你做的很酷的比较！

— MegaCookie

1

这是一个将计算部件卸载到矢量化NumPy工具中的想法。我们将为每个标头将所有数据收集到单个数组中，对NumPy执行所有必需的匹配，最后切回到所需的行条目。在承担繁重任务的NumPy上，我们将基于组ID和每个组中的ID使用散列np.searchsorted。我们还利用数字，因为使用NumPy可以更快。实现看起来像这样-

t = df['today']
y = df['yesterday']
tc = np.concatenate(t)
yc = np.concatenate(y)

tci,tcu = pd.factorize(tc)

tl = np.array(list(map(len,t)))
ty = np.array(list(map(len,y)))

grp_t = np.repeat(np.arange(len(tl)),tl)
grp_y = np.repeat(np.arange(len(ty)),ty)

sidx = tcu.argsort()
idx = sidx[np.searchsorted(tcu,yc,sorter=sidx)]

s = max(tci.max(), idx.max())+1
tID = grp_t*s+tci
yID = grp_y*s+idx

t_mask = np.isin(tID, yID, invert=True)
y_mask = np.isin(yID, tID, invert=True)

t_se = np.r_[0,np.bincount(grp_t,t_mask).astype(int).cumsum()]
y_se = np.r_[0,np.bincount(grp_y,y_mask).astype(int).cumsum()]

Y = yc[y_mask].tolist()
T = tc[t_mask].tolist()

A = pd.Series([T[i:j] for (i,j) in zip(t_se[:-1],t_se[1:])])
R = pd.Series([Y[i:j] for (i,j) in zip(y_se[:-1],y_se[1:])])

在计算t_mask和的步骤y_mask中np.searchsorted可能会进行进一步的优化，并可以再次使用。

我们还可以使用简单的数组分配作为isin获取t_mask和的步骤的替代方法y_mask，就像这样-

M = max(tID.max(), yID.max())+1
mask = np.empty(M, dtype=bool)

mask[tID] = True
mask[yID] = False
t_mask = mask[tID]

mask[yID] = True
mask[tID] = False
y_mask = mask[yID]

— 迪卡卡
source

有效地比较两列中的列表

产生更大的例子

您的解决方案

您的解决方案一次申请

使用 set

@ r.ook的解决方案

@Andreas K.的解决方案

使用 `set`