将列表中找到的ID添加到Pandas数据框中的新列

11

假设我有以下数据框（一列整数和一列整数列表）...

      ID                   Found_IDs
0  12345        [15443, 15533, 3433]
1  15533  [2234, 16608, 12002, 7654]
2   6789      [43322, 876544, 36789]

还有一个单独的ID列表...

bad_ids = [15533, 876544, 36789, 11111]

鉴于此，忽略df['ID']列和任何索引，我想看看bad_ids列中是否提到了列表中的任何ID df['Found_IDs']。到目前为止，我的代码是：

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

这是有效的，但仅当bad_ids列表比数据框长，并且对于实际数据集，bad_ids列表将比数据框短得多时。如果我将bad_ids列表设置为仅两个元素...

bad_ids = [15533, 876544]

我遇到了一个非常流行的错误（我读过很多有相同错误的问题）...

ValueError: Length of values does not match length of index

我尝试将列表转换为系列（错误无变化）。我还尝试添加新列并将所有值设置为，False然后再执行理解行（同样，错误中也不会发生变化）。

两个问题：

如何使我的代码（如下）对小于数据框的列表起作用？
我将如何获取将找到的实际ID写回到df['bad_id']列中的代码（比True / False有用）？

预期输出bad_ids = [15533, 876544]：

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

bad_ids = [15533, 876544]（ID的理想输出将写入一个或多个新列）：

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    876544

码：

import pandas as pd

result_list = [[12345,[15443,15533,3433]],
        [15533,[2234,16608,12002,7654]],
        [6789,[43322,876544,36789]]]

df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])

# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]

# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]

# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))

# setting up a new column of false values doesn't change things
# df['bad_id'] = False

print(df)

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

print(bad_ids)

print(df)

— 耐多药
source

7

使用np.intersect1d得到两个列表的交叉：

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

或仅使用香草python使用相交的sets：

bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))

— 二凡
source

3

如果要按使用的所有值测试Found_IDs列中列表的所有值bad_ids：

bad_ids = [15533, 876544]

df['bad_id'] = [any(c in l for c in bad_ids) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

如果要全部匹配：

df['bad_id'] = [[c for c in bad_ids if c in l] for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

对于第一次匹配，如果设置了空列表False，则可能是解决方案，但不建议将布尔值和数字混合使用：

df['bad_id'] = [next(iter([c for c in bad_ids if c in l]), False) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]   15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]  876544

成套解决方案：

df['bad_id'] = df['Found_IDs'].map(set(bad_ids).intersection)
print (df)

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   {15533}
1  15533  [2234, 16608, 12002, 7654]        {}
2   6789      [43322, 876544, 36789]  {876544}

并且与列表理解类似：

df['bad_id'] = [list(set(bad_ids).intersection(l)) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

— 耶斯雷尔
source

1

您可以申请并使用np.any：

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.any([c in x for c in bad_ids]))

如果要检索此bad_id，则如果在Found_ID中存在bad_id，则返回布尔值：

df['bad_id'] = df['Found_IDs'].apply(lambda x: [*filter(lambda x: c in x, bad_ids)])

这将返回found_ids的bad_ids列表，如果为0，则返回[]

— 布鲁诺·梅洛
source

1

使用merge和concat同时按索引分组以返回所有匹配项。

bad_ids = [15533, 876544, 36789, 11111]

df2 = pd.concat(
    [
        df,
        pd.merge(
            df["Found_IDs"].explode().reset_index(),
            pd.Series(bad_ids, name="bad_ids"),
            left_on="Found_IDs",
            right_on="bad_ids",
            how="inner",
        )
        .groupby("index")
        .agg(bad_ids=("bad_ids", list)),
    ],
    axis=1,
).fillna(False)
print(df2)


      ID                   Found_IDs          bad_ids
0  12345        [15443, 15533, 3433]          [15533]
1  15533  [2234, 16608, 12002, 7654]            False
2   6789      [43322, 876544, 36789]  [876544, 36789]

— 数据创新
source

0

使用爆炸和分组汇总

s = df['Found_IDs'].explode()
df['bad_ids'] = s.isin(bad_ids).groupby(s.index).any()

对于 bad_ids = [15533, 876544]

>>> df
      ID                   Found_IDs  bad_ids
0  12345        [15443, 15533, 3433]     True
1  15533  [2234, 16608, 12002, 7654]    False
2   6789      [43322, 876544, 36789]     True

要么

为了获得匹配的价值

s = df['Found_IDs'].explode()
s.where(s.isin(bad_ids)).groupby(s.index).agg(lambda x: list(x.dropna()))

对于 bad_ids = [15533, 876544]

      ID                   Found_IDs   bad_ids
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

— 维什努杰夫
source