从一个大CSV文件中读取一个小的随机样本到Python数据框中

Question 1

我要读取的CSV文件不适合主存储器。如何读取其中的几行（〜10K）随机行，并对所选数据帧进行一些简单统计？

Question 2

假设CSV文件中没有标题：

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

如果read_csv有一个保留行，或者如果跳过行使用了回调函数而不是列表，那会更好。

具有标题和未知文件长度：

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

Question 3

@dlm的答案很好，但是从v0.20.0开始，skiprows确实接受了callable。可调用对象接收行号作为参数。

如果您可以指定所需的行数百分比，而不是指定多少行，则您甚至不需要获取文件大小，而只需阅读一次文件即可。假设标题在第一行：

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

或者，如果您想走每一n行：

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

Question 4

这不在Pandas中，但是通过bash可以更快地达到相同的结果，而不会将整个文件读入内存：

shuf -n 100000 data/original.tsv > data/sample.tsv

该shuf命令将对输入进行随机排序，和和-n参数指示我们要在输出中显示多少行。

相关问题：https : //unix.stackexchange.com/q/108581

可在此处查看700万行CSV的基准（2008年）：

最佳答案：

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

熊猫计时：

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

使用时shuf：

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

这样shuf大约快12倍，重要的是不会将整个文件读入内存。

Question 5

这是一种算法，不需要事先计算文件中的行数，因此您只需要读取一次文件。

假设您要m个样本。首先，该算法保留前m个样本。当它以概率m / i看到第i个样本（i> m）时，该算法将使用该样本随机替换已选择的样本。

这样，对于任何i> m，我们总是有从前i个样本中随机选择的m个样本的子集。

请参见下面的代码：

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line

Question 6

以下代码首先读取标头，然后读取其他行上的随机样本：

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

Question 7

class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

我认为类似的东西应该起作用

Question 8

没有熊猫！

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

您将获得一个sampled_lines列表。您的意思是什么统计？

Question 9

使用子样本

pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv

Question 10

您还可以在将其带入Python环境之前使用10000条记录创建一个示例。

使用Git Bash（Windows 10），我只是运行以下命令来生成示例

shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv

注意：如果CSV有标题，则不是最佳解决方案。

Question 11

TL; DR

如果知道所需样本的大小，但不知道输入文件的大小，则可以使用以下pandas代码从其中有效地加载随机样本：

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 10000
batch_size = 200

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

说明

知道输入CSV文件的大小并不总是那么容易。

如果存在嵌入式换行符，请使用wc或之类的工具shuf会给您错误的答案，或者使您的数据混乱。

因此，基于desktable的答案，我们可以将sample_size文件的第一行视为初始样本，然后针对文件中的每个后续行，随机替换初始样本中的一行。

为了有效地做到这一点，我们TextFileReader通过传递chunksize=参数来使用加载CSV文件：

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

首先，我们得到初始样本：

sample = sample_reader.get_chunk(sample_size)

然后，我们遍历文件的其余块，使用与块大小相同的随机整数序列替换每个块的索引，但是每个整数都在index初始样本的范围内（与range(sample_size)）相同：

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))

并使用此重新索引的块替换示例中的（某些）行：

sample.loc[chunk.index] = chunk

在后for循环，你将最多有一个数据帧sample_size行长，但是从大的CSV文件中选取的随机行。

为了使循环更有效，您可以将batch_size其设置为内存允许的最大大小（是的，甚至大于sample_size您可以的）。

请注意，在使用创建新的块索引时np.random.default_rng().integers()，我们将其len(chunk)用作新的块索引大小，而不是简单地batch_size因为循环中的最后一个块可能更小。

另一方面，即使文件中的行数少于，我们也使用sample_size而不是len(sample)随机整数的“范围” sample_size。这是因为在这种情况下，将没有任何块可以循环，因此永远不会有问题。

Question 12

例如，您拥有loan.csv，则可以使用此脚本轻松加载指定数量的随机项目。

data = pd.read_csv('loan.csv').sample(10000, random_state=44)

Question 13

读取数据文件

import pandas as pd
df = pd.read_csv('data.csv', 'r')

首先检查df的形状

df.shape()

从df创建1000个未加工的小样本

sample_data = df.sample(n=1000, replace='False')

＃检查sample_data的形状

sample_data.shape()

Question 14

假设您要加载20％的数据集样本：

    import pandas as pd
    df = pd.read_csv(filepath).sample(frac = 0.20)