使用熊猫数据框中的数据将列匹配在一起

18

我有两个pandas数据框，a和b：

a1   a2   a3   a4   a5   a6   a7
1    3    4    5    3    4    5
0    2    0    3    0    2    1
2    5    6    5    2    1    2

和

b1   b2   b3   b4   b5   b6   b7
3    5    4    5    1    4    3
0    1    2    3    0    0    2
2    2    1    5    2    6    5

这两个数据帧包含完全相同的数据，但顺序不同且列名不同。基于这两个数据帧的数字，我希望能够将每个列名在匹配a到每个列名b。

就像简单地比较的第一行a和的第一行那样简单，b因为存在重复的值，例如两者a4和a7都有重复的值，5因此不可能立即将它们与b2或匹配b4。

做这个的最好方式是什么？

python python-3.x pandas

— 1995年
source

16

这是一种使用方法sort_values：

m=df1.T.sort_values(by=[*df1.index]).index
n=df2.T.sort_values(by=[*df2.index]).index
d=dict(zip(m,n))
print(d)

{'a1': 'b5', 'a5': 'b1', 'a2': 'b7', 'a3': 'b6', 'a6': 'b3', 'a7': 'b2', 'a4': 'b4'}

— ky
source

感谢您分享精美的命令Anky，能否请您进一步解释一下[*df1.index]？会很感激你，加油。

— RavinderSingh13

1

@ RavinderSingh13当然，sort_values(by=..)将列表作为参数，所以我在这里将索引解list(df1.index)[*df1.index]

— 压缩

16

这是利用numpy的一种方法broadcasting：

b_cols = b.columns[(a.values == b.T.values[...,None]).all(1).argmax(1)]
dict(zip(a, b_cols))

{'a1': 'b5',
 'a2': 'b7',
 'a3': 'b6',
 'a4': 'b4',
 'a5': 'b1',
 'a6': 'b3',
 'a7': 'b2'}

另一种类似的方法（通过@piR）：

a_ = a.to_numpy()
b_ = b.to_numpy()
i, j = np.where((a_[:, None, :] == b_[:, :, None]).all(axis=0))
dict(zip(a.columns[j], b.columns[i]))

{'a1': 'b5',
 'a2': 'b7',
 'a3': 'b6',
 'a4': 'b4',
 'a5': 'b1',
 'a6': 'b3',
 'a7': 'b2'}

— 亚图
source

1

我把鼻子塞在你的岗位上。希望你不介意。请根据自己的喜好更改它。

— piRSquared

相反，啊:)好的方法，并且在大数据帧上进行检查会稍微提高性能@piRSquared

— yatu

12

一种方式 merge

s=df1.T.reset_index().merge(df2.T.assign(match=lambda x : x.index))
dict(zip(s['index'],s['match']))
{'a1': 'b5', 'a2': 'b7', 'a3': 'b6', 'a4': 'b4', 'a5': 'b1', 'a6': 'b3', 'a7': 'b2'}

— YOBEN_S
source

我想我只添加另一个聪明的解决方案看，这是你的一样（ - ：哎呦

— piRSquared

8

字典理解

使用tuple列值作为字典的哈希的关键

d = {(*t,): c for c, t in df2.items()}
{c: d[(*t,)] for c, t in df1.items()}

{'a1': 'b5',
 'a2': 'b7',
 'a3': 'b6',
 'a4': 'b4',
 'a5': 'b1',
 'a6': 'b3',
 'a7': 'b2'}

万一我们没有完美的表示形式，我只为匹配的列制作字典。

d2 = {(*t,): c for c, t in df2.items()}
d1 = {(*t,): c for c, t in df1.items()}

{d1[c]: d2[c] for c in {*d1} & {*d2}}

{'a5': 'b1',
 'a2': 'b7',
 'a7': 'b2',
 'a6': 'b3',
 'a3': 'b6',
 'a1': 'b5',
 'a4': 'b4'}

`idxmax`

这很荒谬……实际上不要这样做。

{c: df2.T.eq(df1[c]).sum(1).idxmax() for c in df1}

{'a1': 'b5',
 'a2': 'b7',
 'a3': 'b6',
 'a4': 'b4',
 'a5': 'b1',
 'a6': 'b3',
 'a7': 'b2'}

— 海盗
source

1

我能理解这些语句中的每个表达式，却又不完全明白我的意思是怎么回事？有点像国际象棋，我知道如何移动棋盘上的所有棋子，但是看不到2向前移动的更多。

— 斯科特·波士顿

好吧...我现在已经消化了，这绝对是简单的，很棒。+1

— 斯科特·波士顿