我通常以这种方式使用数十GB的数据,例如,我在磁盘上有一些表,这些表是通过查询读取,创建数据并追加回去的。
值得阅读文档以及该线程的后期内容,以获取有关如何存储数据的一些建议。
将影响您存储数据方式的详细信息,例如:
尽可能多地提供详细信息;我可以帮助您建立结构。
- 数据大小,行数,列数,列类型;您要追加行还是仅追加列?
- 典型的操作将是什么样的。例如,对列进行查询以选择一堆行和特定的列,然后执行一个操作(在内存中),创建新列并保存。
(提供一个玩具示例可以使我们提供更具体的建议。)
- 处理完之后,您该怎么办?步骤2是临时的还是可重复的?
- 输入平面文件:大约总大小(以Gb为单位)。这些是如何组织的,例如通过记录?每个文件都包含不同的字段,还是每个文件都有一些记录,每个文件中都有所有字段?
- 您是否曾经根据条件选择行(记录)的子集(例如,选择字段A> 5的行)?然后执行某项操作,还是只选择包含所有记录的A,B,C字段(然后执行某项操作)?
- 您是否“工作”所有列(成组),还是只用于报告的比例很高(例如,您想保留数据,但无需明确地拉入该列,直到最终结果时间)?
解
确保至少0.10.1
安装了熊猫。
逐块读取迭代文件和多个表查询。
由于pytables已优化为按行操作(这是您要查询的内容),因此我们将为每组字段创建一个表。这样一来,很容易选择一小组字段(它将与一个大表一起使用,但是这样做更有效。我想我将来可能会解决此限制。这是更加直观):(
以下是伪代码。)
import numpy as np
import pandas as pd
# create a store
store = pd.HDFStore('mystore.h5')
# this is the key to your storage:
# this maps your fields to a specific group, and defines
# what you want to have as data_columns.
# you might want to create a nice class wrapping this
# (as you will want to have this map and its inversion)
group_map = dict(
A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
B = dict(fields = ['field_10',...... ], dc = ['field_10']),
.....
REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),
)
group_map_inverted = dict()
for g, v in group_map.items():
group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
读入文件并创建存储(基本上是做什么append_to_multiple
):
for f in files:
# read in the file, additional options may be necessary here
# the chunksize is not strictly necessary, you may be able to slurp each
# file into memory in which case just eliminate this part of the loop
# (you can also change chunksize if necessary)
for chunk in pd.read_table(f, chunksize=50000):
# we are going to append to each table by group
# we are not going to create indexes at this time
# but we *ARE* going to create (some) data_columns
# figure out the field groupings
for g, v in group_map.items():
# create the frame for this group
frame = chunk.reindex(columns = v['fields'], copy = False)
# append it
store.append(g, frame, index=False, data_columns = v['dc'])
现在,您已将所有表存储在文件中(实际上,您可以根据需要将它们存储在单独的文件中,您可能需要将文件名添加到group_map中,但这可能不是必需的)。
这是获取列并创建新列的方式:
frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
# select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows
# do calculations on this frame
new_frame = cool_function_on_frame(frame)
# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
准备进行后期处理时:
# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
关于data_columns,实际上不需要定义任何 data_columns。它们使您可以根据列来子选择行。例如:
store.select(group, where = ['field_1000=foo', 'field_1001>0'])
在最后的报告生成阶段,它们可能对您来说最有趣(实际上,数据列与其他列是分开的,如果定义太多,这可能会影响效率)。
您可能还想:
- 创建一个使用字段列表的函数,在groups_map中查找组,然后选择它们并连接结果,以便获得结果框架(本质上就是select_as_multiple所做的事情)。这样,结构对您将非常透明。
- 在某些数据列上建立索引(使行子设置快得多)。
- 启用压缩。
如有疑问,请告诉我!