假设我有以下数据框(一列整数和一列整数列表)...
ID Found_IDs
0 12345 [15443, 15533, 3433]
1 15533 [2234, 16608, 12002, 7654]
2 6789 [43322, 876544, 36789]
还有一个单独的ID列表...
bad_ids = [15533, 876544, 36789, 11111]
鉴于此,忽略df['ID']
列和任何索引,我想看看bad_ids
列中是否提到了列表中的任何ID df['Found_IDs']
。到目前为止,我的代码是:
df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]
这是有效的,但仅当bad_ids
列表比数据框长,并且对于实际数据集,bad_ids
列表将比数据框短得多时。如果我将bad_ids
列表设置为仅两个元素...
bad_ids = [15533, 876544]
我遇到了一个非常流行的错误(我读过很多有相同错误的问题)...
ValueError: Length of values does not match length of index
我尝试将列表转换为系列(错误无变化)。我还尝试添加新列并将所有值设置为,False
然后再执行理解行(同样,错误中也不会发生变化)。
两个问题:
- 如何使我的代码(如下)对小于数据框的列表起作用?
- 我将如何获取将找到的实际ID写回到
df['bad_id']
列中的代码(比True / False有用)?
预期输出bad_ids = [15533, 876544]
:
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] True
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] True
bad_ids = [15533, 876544]
(ID的理想输出将写入一个或多个新列):
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] 15533
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] 876544
码:
import pandas as pd
result_list = [[12345,[15443,15533,3433]],
[15533,[2234,16608,12002,7654]],
[6789,[43322,876544,36789]]]
df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])
# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]
# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]
# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))
# setting up a new column of false values doesn't change things
# df['bad_id'] = False
print(df)
df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]
print(bad_ids)
print(df)