Answers:
在将CSV文件加载到pandas对象之前,没有选项可以过滤行。
您可以加载文件,然后使用进行过滤df[df['field'] > constant]
,或者如果文件很大,又担心内存用完,则可以使用迭代器并在连接文件块时应用过滤器,例如:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
您可以更改chunksize
以适合您的可用内存。有关更多详细信息,请参见此处。
chunk[(chunk['field'] > constant2)&(chunk['field']<constant1)]
.loc
吗?chunk.loc[chunk['field'] > constant]
.loc
。我认为.loc
早在2012年就不存在,但我想这些天的使用.loc
更为明确。
如果您使用的是Linux,则可以使用grep。
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print('{} for {} from {}'.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
return data
chunk['filed']>constant
我能夹住它2个常数值之间?例如:constant1> chunk ['field']> constant2。还是可以使用“范围内”?