分割大熊猫数据框

86

我有423244行的大型数据框。我想将其拆分为4。我尝试了以下给出错误的代码？ValueError: array split does not result in an equal division

for item in np.split(df, 4):
    print item

如何将此数据帧分为4组？

python pandas

— 尼拉尼·阿尔吉里雅格
source

我们想要一个np.split(df, N)功能。

— 索伦

182

用途np.array_split：

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

— 根
source

非常感谢！除此之外，我想对每个组应用一些功能？如何一个接一个地访问组？

— Nilani Algiriyage 2013年

7

@NilaniAlgiriyage-array_split返回DataFrames的列表，因此您可以循环浏览该列表...

— 根

由于数据框太大，因此正在拆分。我想参加第一组并应用功能，然后参加第二组并应用功能，等等。那么我如何访问每个组？

— Nilani Algiriyage 2013年

1

由于数据框没有“大小”，您如何不获取AttributeError？

— Boosted_d16

2

这个答案已经过时了：AttributeError: 'DataFrame' object has no attribute 'size'

— Tjorriemorrie 2015年

33

我想做同样的事情，我首先遇到了split函数的问题，然后是安装pandas 0.15.2的问题，所以我回到原来的版本，并编写了一个运行良好的小函数。希望对您有所帮助！

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000): 
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
    return chunks

— 长生不老药
source

5

比使用np.array_split（）快得多

— jgaw

4

计算numberChunks的正确方法导入数学numberChunks = math.ceil（len（df）/ chunkSize）

— Sergey Leyko

21

我想现在我们可以使用plainiloc了range。

chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
    df_subset = df.iloc[start:start + chunk_size]
    process_data(df_subset)
    ....

— 普拉特波尔
source

1

简单直观

— rmstmppr

13

请注意，np.array_split(df, 3)将数据帧拆分为3个子数据帧，而@elixir的answer中split_dataframe定义的函数（称为）将数据帧拆分为每一行。split_dataframe(df, chunk_size=3)chunk_size

例：

与np.array_split：

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)

...您将获得3个子数据帧：

df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11

与split_dataframe：

df_split2 = split_dataframe(df, chunk_size=3)

...您将获得4个子数据帧：

df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11

希望我是对的，并且这很有用。

— 吉尔伯托
source

有没有一种简单的方法可以使此过程随机化。我只能考虑添加rondom列，拆分和删除随机列，但可能会有更简单的方法

— Rutger Hofste

它们必须等于块大小吗？

— InquilineKea

8

警告：

np.array_split不适用于numpy-1.9.0。我签出了：它适用于1.8.1。

错误：

数据框没有“大小”属性

— 野木
source

6

我在熊猫github上提交了一个错误：github.com/pydata/pandas/issues/8846 似乎它已经为熊猫0.15.2修复了

— yemu 2014年

4

您可以使用groupby，假设您拥有一个整数枚举索引：

import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)

subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

注意：groupby返回一个元组，其中第二个元素是数据帧，因此提取会稍微复杂一些。

>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])

— 谣言
source

1

我还遇到了np.array_split无法与Pandas DataFrame一起使用的问题，我的解决方案是仅拆分DataFrame的索引，然后引入带有“ group”标签的新列：

indexes = np.array_split(df.index,N, axis=0)
for i,index in enumerate(indexes):
   df.loc[index,'group'] = i

这使得grouby运算非常方便用于实例计算每个组的平均值：

df.groupby(by='group').mean()

— 马丁·亚历山大森
source

0

您可以使用列表推导功能在一行中完成此操作

n = 4
chunks = [df[i:i+n] for i in range(0,df.shape[0],n)]

— 里沙卜·维吉（Rishabh Vij）
source