如何将数据框单元格内的列表分解为单独的行


93

我正在寻找将包含列表的熊猫单元格变成每个值的行。

因此,请采取以下措施:

在此处输入图片说明

如果我想解压并堆叠nearest_neighbors列中的值,以使每个值在每个opponent索引内都是一行,那么我该如何做呢?是否有适用于此类操作的熊猫方法?


您能否举例说明所需的输出,以及到目前为止所进行的尝试?如果您还提供了一些可以剪切和粘贴的示例数据,那么对其他人来说最容易为您提供帮助。
dagrha

您可以pd.DataFrame(df.nearest_neighbors.values.tolist())打开该柱的包装,然后pd.merge将其与其他柱 粘合。
hellpanderr 2015年

@helpanderr我不认为values.tolist()这里有任何作用;该列已经是列表
maxymoo 2015年


1
相关的,但包含更多细节stackoverflow.com/questions/53218931/...
BEN_YO

Answers:


54

在下面的代码中,我首先重置索引以简化行迭代。

我创建了一个列表列表,其中外部列表​​的每个元素都是目标DataFrame的一行,内部列表的每个元素都是其中的一列。此嵌套列表最终将被串联以创建所需的DataFrame。

我将lambda函数与列表迭代结合使用,为nearest_neighbors与相关的name和配对的每个元素创建一行opponent

最后,我从该列表创建一个新的DataFrame(使用原始列名并将索引设置回nameopponent)。

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))

>>> df
                                                    nearest_neighbors
name       opponent                                                  
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]

df.reset_index(inplace=True)
rows = []
_ = df.apply(lambda row: [rows.append([row['name'], row['opponent'], nn]) 
                         for nn in row.nearest_neighbors], axis=1)
df_new = pd.DataFrame(rows, columns=df.columns).set_index(['name', 'opponent'])

>>> df_new
                    nearest_neighbors
name       opponent                  
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

编辑2017年6月

一种替代方法如下:

>>> (pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name='nearest_neighbors')
     .set_index(['name', 'opponent'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

apply(pd.Series)在最小的框架上就可以了,但是对于任何大小合适的框架,您应该重新考虑一种性能更高的解决方案。请参阅何时在代码中使用pandas apply()?(更好的解决方案是先对列进行列表。)
cs95

2
通过添加该方法,在熊猫0.25中显着简化了爆炸式列表explode()。我使用与此处相同的df设置添加示例答案
joelostblom

@joelostblom很高兴听到。感谢您添加具有当前用法的示例。
亚历山大

34

使用apply(pd.Series)stack,然后reset_indexto_frame

In [1803]: (df.nearest_neighbors.apply(pd.Series)
              .stack()
              .reset_index(level=2, drop=True)
              .to_frame('nearest_neighbors'))
Out[1803]:
                    nearest_neighbors
name       opponent
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

细节

In [1804]: df
Out[1804]:
                                                   nearest_neighbors
name       opponent
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]

1
爱您的解决方案的优雅!您是否有机会对照其他方法进行基准测试?
rpyzh

1
结果df.nearest_neighbors.apply(pd.Series)对我来说非常令人惊讶。
Calum You

1
@rpyzh是的,它非常优雅,但速度很慢。
cs95

33
df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))

df.explode('nearest_neighbors')

出:

                    nearest_neighbors
name       opponent                  
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

2
请注意,这仅适用于单列(从0.25开始)。有关更多通用解决方案,请参见此处此处
cs95

16

我认为这是一个非常好的问题,在Hive中,您将使用EXPLODE,我认为有必要将熊猫默认情况下包括此功能。我可能会使用嵌套的生成器理解来爆炸列表列,如下所示:

pd.DataFrame({
    "name": i[0],
    "opponent": i[1],
    "nearest_neighbor": neighbour
    }
    for i, row in df.iterrows() for neighbour in row.nearest_neighbors
    ).set_index(["name", "opponent"])

我喜欢这种解决方案如何使每一行的列表项数量不同。
user1718097

有没有办法用这种方法保持原始索引?
SummerEla

2
@SummerEla大声笑这是一个非常老的答案,我已经更新以显示我现在将如何做
maxymoo

1
@maxymoo不过,这仍然是一个很大的问题。感谢您的更新!
SummerEla

我发现这很有用,并将其打包
Oren

11

最快我发现方法迄今被延伸至与该数据帧.iloc和分配回扁平目标列。

给定通常的输入(略有重复):

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))
df = pd.concat([df]*10)

df
Out[3]: 
                                                   nearest_neighbors
name       opponent                                                 
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
...

给出以下建议的替代方法:

col_target = 'nearest_neighbors'

def extend_iloc():
    # Flatten columns of lists
    col_flat = [item for sublist in df[col_target] for item in sublist] 
    # Row numbers to repeat 
    lens = df[col_target].apply(len)
    vals = range(df.shape[0])
    ilocations = np.repeat(vals, lens)
    # Replicate rows and add flattened column of lists
    cols = [i for i,c in enumerate(df.columns) if c != col_target]
    new_df = df.iloc[ilocations, cols].copy()
    new_df[col_target] = col_flat
    return new_df

def melt():
    return (pd.melt(df[col_target].apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name=col_target)
            .set_index(['name', 'opponent'])
            .drop('variable', axis=1)
            .dropna()
            .sort_index())

def stack_unstack():
    return (df[col_target].apply(pd.Series)
            .stack()
            .reset_index(level=2, drop=True)
            .to_frame(col_target))

我发现这extend_iloc()最快的

%timeit extend_iloc()
3.11 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit melt()
22.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit stack_unstack()
11.5 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

不错的评价
javadba

2
谢谢,这确实对我有所帮助。我用extend_iloc解决方案,发现 cols = [c for c in df.columns if c != col_target] 应该是: cols = [i for i,c in enumerate(df.columns) if c != col_target]df.iloc[ilocations, cols].copy()错误如果不与列索引呈现。
jdungan

再次感谢您的iloc建议。我在此处撰写了有关其工作原理的详细说明: medium.com/@johnadungan/…。希望它对遇到类似挑战的人有所帮助。
jdungan

7

具有apply(pd.Series)的更好的替代解决方案:

df = pd.DataFrame({'listcol':[[1,2,3],[4,5,6]]})

# expand df.listcol into its own dataframe
tags = df['listcol'].apply(pd.Series)

# rename each variable is listcol
tags = tags.rename(columns = lambda x : 'listcol_' + str(x))

# join the tags dataframe back to the original dataframe
df = pd.concat([df[:], tags[:]], axis=1)

这个扩展列而不是行。
奥列格

@Oleg对,但是您始终可以转置DataFrame然后应用pd.Series-比大多数其他建议更简单
Philipp Schwarz,

7

与Hive的EXPLODE功能相似:

import copy

def pandas_explode(df, column_to_explode):
    """
    Similar to Hive's EXPLODE function, take a column with iterable elements, and flatten the iterable to one element 
    per observation in the output table

    :param df: A dataframe to explod
    :type df: pandas.DataFrame
    :param column_to_explode: 
    :type column_to_explode: str
    :return: An exploded data frame
    :rtype: pandas.DataFrame
    """

    # Create a list of new observations
    new_observations = list()

    # Iterate through existing observations
    for row in df.to_dict(orient='records'):

        # Take out the exploding iterable
        explode_values = row[column_to_explode]
        del row[column_to_explode]

        # Create a new observation for every entry in the exploding iterable & add all of the other columns
        for explode_value in explode_values:

            # Deep copy existing observation
            new_observation = copy.deepcopy(row)

            # Add one (newly flattened) value from exploding iterable
            new_observation[column_to_explode] = explode_value

            # Add to the list of new observations
            new_observations.append(new_observation)

    # Create a DataFrame
    return_df = pandas.DataFrame(new_observations)

    # Return
    return return_df

1
运行此命令时,出现以下错误:NameError: global name 'copy' is not defined
frmsaul

4

因此,所有这些答案都不错,但是我想要“非常简单”的东西,所以这是我的贡献:

def explode(series):
    return pd.Series([x for _list in series for x in _list])                               

就是这样..仅在您需要一个新列表被“分解”的新系列时才使用它。这是一个示例,我们在taco选择上执行value_counts():)

In [1]: my_df = pd.DataFrame(pd.Series([['a','b','c'],['b','c'],['c']]), columns=['tacos'])      
In [2]: my_df.head()                                                                               
Out[2]: 
   tacos
0  [a, b, c]
1     [b, c]
2        [c]

In [3]: explode(my_df['tacos']).value_counts()                                                     
Out[3]: 
c    3
b    2
a    1

2

这是针对较大数据帧的潜在优化。如果“爆炸”字段中有多个相等的值,则运行速度更快。(将数据帧与字段中的唯一值计数进行比较,此代码将执行得更好。)

def lateral_explode(dataframe, fieldname): 
    temp_fieldname = fieldname + '_made_tuple_' 
    dataframe[temp_fieldname] = dataframe[fieldname].apply(tuple)       
    list_of_dataframes = []
    for values in dataframe[temp_fieldname].unique().tolist(): 
        list_of_dataframes.append(pd.DataFrame({
            temp_fieldname: [values] * len(values), 
            fieldname: list(values), 
        }))
    dataframe = dataframe[list(set(dataframe.columns) - set([fieldname]))]\ 
        .merge(pd.concat(list_of_dataframes), how='left', on=temp_fieldname) 
    del dataframe[temp_fieldname]

    return dataframe

1

扩展Oleg的.iloc答案以自动拉平所有列表列:

def extend_iloc(df):
    cols_to_flatten = [colname for colname in df.columns if 
    isinstance(df.iloc[0][colname], list)]
    # Row numbers to repeat 
    lens = df[cols_to_flatten[0]].apply(len)
    vals = range(df.shape[0])
    ilocations = np.repeat(vals, lens)
    # Replicate rows and add flattened column of lists
    with_idxs = [(i, c) for (i, c) in enumerate(df.columns) if c not in cols_to_flatten]
    col_idxs = list(zip(*with_idxs)[0])
    new_df = df.iloc[ilocations, col_idxs].copy()

    # Flatten columns of lists
    for col_target in cols_to_flatten:
        col_flat = [item for sublist in df[col_target] for item in sublist]
        new_df[col_target] = col_flat

    return new_df

这假定每个列表列具有相等的列表长度。


1

代替使用apply(pd.Series),您可以展平该列。这样可以提高性能。

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                'opponent': ['76ers', 'blazers', 'bobcats'], 
                'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
  .set_index(['name', 'opponent']))



%timeit (pd.DataFrame(df['nearest_neighbors'].values.tolist(), index = df.index)
           .stack()
           .reset_index(level = 2, drop=True).to_frame('nearest_neighbors'))

1.87 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%timeit (df.nearest_neighbors.apply(pd.Series)
          .stack()
          .reset_index(level=2, drop=True)
          .to_frame('nearest_neighbors'))

2.73 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

IndexError:级别过多:当我尝试示例时,索引只有2个级别,而不是3个级别
vinsent paramanantham

1
您必须根据您的示例更改reset_index中的“级别”
suleep kumar
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.