排序数据更快的方法

我需要将bed文件随机排序10000次，并且每次都要获取前1000行。当前，我正在使用以下代码：

for i in {1..100}; do
    for j in {1..100}; do
        sort -R myfile.bed_sorted | tail -n 1000 > myfile.bed.$i.$j.bed
    done
done

每个文件大约要花6个小时。我大约有150个需要解决。有更快的解决方案吗？

我有一个数据样本（myfile.bed_sorted）：

    chr1    111763899   111766405   peak1424    1000    .   3224.030    -1  -1
    chr1    144533459   144534584   peak1537    998 .   3219.260    -1  -1
    chr8    42149384    42151246    peak30658   998 .   3217.620    -1  -1
    chr2    70369299    70370655    peak16886   996 .   3211.600    -1  -1
    chr8    11348914    11352994    peak30334   990 .   3194.180    -1  -1
    chr21   26828820    26830352    peak19503   988 .   3187.820    -1  -1
    chr16   68789901    68791150    peak11894   988 .   3187.360    -1  -1
    chr6    11458964    11462245    peak26362   983 .   3169.750    -1  -1
    chr1    235113793   235117308   peak2894    982 .   3166.000    -1  -1
    chr6    16419968    16422194    peak26522   979 .   3158.520    -1  -1
    chr6    315344  321339  peak26159   978 .   3156.320    -1  -1
    chr1    111756584   111759633   peak1421    964 .   3110.520    -1  -1
    chrX    12995098    12997685    peak33121   961 .   3100.000    -1  -1
    chr9    37408601    37410262    peak32066   961 .   3100.000    -1  -1
    chr9    132648603   132651523   peak32810   961 .   3100.000    -1  -1
    chr8    146103178   146104943   peak31706   961 .   3100.000    -1  -1
    chr8    135611963   135614649   peak31592   961 .   3100.000    -1  -1
    chr8    128312253   128315935   peak31469   961 .   3100.000    -1  -1
    chr8    128221486   128223644   peak31465   961 .   3100.000    -1  -1
    chr8    101510621   101514237   peak31185   961 .   3100.000    -1  -1
    chr8    101504210   101508005   peak31184   961 .   3100.000    -1  -1
    chr7    8173062 8174642 peak28743   961 .   3100.000    -1  -1
    chr7    5563424 5570618 peak28669   961 .   3100.000    -1  -1
    chr7    55600455    55603724    peak29192   961 .   3100.000    -1  -1
    chr7    35767878    35770820    peak28976   961 .   3100.000    -1  -1
    chr7    28518260    28519837    peak28923   961 .   3100.000    -1  -1
    chr7    104652502   104654747   peak29684   961 .   3100.000    -1  -1
    chr6    6586316 6590136 peak26279   961 .   3100.000    -1  -1
    chr6    52362185    52364270    peak27366   961 .   3100.000    -1  -1
    chr6    407805  413348  peak26180   961 .   3100.000    -1  -1
    chr6    32936987    32941352    peak26978   961 .   3100.000    -1  -1
    chr6    226477  229964  peak26144   961 .   3100.000    -1  -1
    chr6    157017923   157020836   peak28371   961 .   3100.000    -1  -1
    chr6    137422769   137425128   peak28064   961 .   3100.000    -1  -1
    chr5    149789084   149793727   peak25705   961 .   3100.000    -1  -1
    chr5    149778033   149783125   peak25702   961 .   3100.000    -1  -1
    chr5    149183766   149185906   peak25695   961 .   3100.000    -1  -1

sort

— 生物芽
source

您的文件有多大，“随机”的概念有多严格？split可以将每个文件分成1000行，因此您只需一次调用即可获得更多文件sort。另外，您是否检查了是否head比tail不需要读取整个文件快一些？

— Ulrich Schwarz 2014年

@UlrichSchwarz：我上面粘贴的示例文件包含大约33000行。通常，我所有的床文件都将具有或多或少相同的行数。另外例如：从33000行文件，我不希望一次运行获得33个子集（每个1000行）。我只希望从每次运行中获取前1000行。我还将做同一文件的尾部。只是为了示例，我head在这里使用过。

— biobudhan

根据手册页sort -R使用“键的随机哈希”。创建散列是在浪费时间，并且可能花费的时间比其他任何时间都要长。最好将这些行读入一个数组，然后使用索引对其进行混洗。就个人而言，我会用perl它。您可以这样做，bash但是需要一个函数来生成随机数。

— goldilocks 2014年

@goldilocks：我不是一个perl人！你能帮我吗？

— biobudhan 2014年

尝试shuf代替sort -R，它要快得多。当然，在内存中执行此操作（请参阅Perl答案）将击败需要重新读取Shell中的整个文件的所有内容。

— 弗罗斯特斯2014年

Answers:

假设您有足够的内存来提取文件，则可以尝试

perl -e 'use List::Util 'shuffle'; @k=shuffle(<>); print @k[0..999]' file.bed

由于您要执行10000次，因此建议将重复代码集成到脚本中，并改组索引而不是数组本身，以加快处理速度：

$ time perl -e 'use List::Util 'shuffle'; 
            @l=<>; for $i (1..10000){
               open(my $fh, ">","file.$i.bed"); 
               @r=shuffle(0..$#l); 
               print $fh @l[@r[0..999]]
            }' file.bed

real    1m12.444s
user    1m8.536s
sys     0m3.244s

上面的文件从一个包含37000行的文件中创建了10000个文件，每行1000行（您的示例文件重复了1000次）。如您所见，我的系统花费了三分钟多一点的时间。

说明

use List::Util 'shuffle';：这将导入一个Perl模块，该模块提供将shuffle()数组随机化的功能。
@l=<>;：将输入文件（<>）加载到数组中@l。
for $i (1..10000){} ：运行10000次。
@r=shuffle(0..$#l);：$#l是元素的数目，@l所以@r现在是数组索引号@l（输入文件的行）的随机列表。
open(my $fh, ">","file.$i.bed");：打开一个称为file.$i.bed写入的文件。$i取值范围是1到10000。
print $fh @l[@r[0..999]]：获取混洗后的数组中的前1000个索引并打印相应的行（的元素@l）。

另一种方法是使用shuf（感谢@frostschutz）：

$ time for i in {1..10000}; do shuf -n 1000 file.bed > file.$i.abed; done

real    1m9.743s
user    0m23.732s
sys     0m31.764s

— 特登
source

哇！！太棒了！！它在2分钟内有效:-)我还有一个问题。还检索文件的最后1000行如何？因为我们需要知道文件的长度（行数）才能达到目的？请帮忙！

— biobudhan 2014年

@biobudhan确实考虑shuf了frostschutz的建议：for i in {1..10000}; do shuf -n 1000 file.bed > file.$i.bed; done。在我的系统上花费了大约1分钟。至于最后1000行，您需要的是tail -n 1000。

— terdon

@biobudhan还可以看到更新的答案，以获得更快的3倍perl版本。

— terdon

是的，我尝试过了，现在它的工作速度更快了！！非常感谢你！！！:-)

— biobudhan 2014年

您是否仔细检查过perl版本的输出文件？在我看来，sys时间如此之少，这就是文件I / O，这似乎很奇怪-这应该与shuf大约30s的时间完全不同sys。所以我在这里测试了perl（剪切n'粘贴），O_O它创建了1000个文件，但是所有文件都是空的……

— goldilocks 2014年

如果您希望基准测试可以看到它可以完成的速度，请将其复制粘贴10kshuffle.cpp并编译g++ 10kshuffle.cpp -o 10kshuffle。然后可以运行它：

10kshuffle filename < inputfile

其中filename是用于输出文件的基本路径；它们将被命名为filename.0，filename.1等等，每个都包含随机播放的前1000行。它随即写入每个文件的名称。

#include <cerrno>
#include <cstdlib>
#include <cstring>
#include <fcntl.h>
#include <fstream>
#include <iostream>
#include <string>
#include <sstream>
#include <unistd.h>
#include <vector>

using namespace std;

unsigned int randomSeed () {
    int in = open("/dev/urandom", O_RDONLY);
    if (!in) {
        cerr << strerror(errno);
        exit(1);
    }
    unsigned int x;
    read(in, &x, sizeof(x));
    close(in);
    return x;
}

int main (int argc, const char *argv[]) {
    char basepath[1024];
    strcpy(basepath,argv[1]);
    char *pathend = &basepath[strlen(basepath)];
// Read in.
    vector<char*> data;
    data.reserve(1<<16);
    while (!cin.eof()) {
        char *buf = new char[1024];
        cin.getline(buf,1023);
        data.push_back(buf);
    }

    srand(randomSeed());
    for (int n = 0; n < 10000; n++) {
        vector<char*> copy(data);
    // Fisher-Yates shuffle.
        int last = copy.size() - 1;
        for (int i = last; i > 0; i--) {
            int r = rand() % i;
            if (r == i) continue;
            char *t = copy[i];
            copy[i] = copy[r];
            copy[r] = t;
        }
    // Write out.
        sprintf(pathend, ".%d", n);
        ofstream file(basepath);
        for (int j = 0; j < 1000; j++) file << copy[j] << endl;
        cout << basepath << endl;
        file.close();
    }

    return 0;
}

在单个3.5 GHz内核上，此过程运行约20秒：

   time ./10kshuffle tmp/test < data.txt
   tmp/test.0
   [...]
   tmp/test.9999
   real 19.95, user 9.46, sys 9.86, RSS 39408

data.txt从问题中重复了37000行。如果要在输出文件中而不是前1000行显示整个混洗，请将第54行更改为：

for (int j = 0; j < copy.size(); j++) file << copy[j] << endl;

— 金发姑娘
source

因此，您的问题涉及Unix方面，但是首先要解决您的基本问题，然后再尝试找到实现该解决方案的Unix-y方法。

您需要从一个具有大量未知行的文件中创建10,000个样本，每个样本的大小为1,000。如果可以在内存中容纳10,000 x 1,000行，则可以在文件的一次通过中执行此操作。如果您无法在内存中保留那么多行，那么只要知道文件包含多少行，您仍然可以单次执行。如果您不知道文件包含多少行，则需要另一遍操作来计算行数。

在更困难的情况下，当您不知道行数时，该算法将对每个样本执行以下操作（并行执行，将样本保留在内存中）：

在样本中包括前1,000行
对于第n行（其中n > 1000），将其包括在概率中，1000 / n并从已经选择的行中丢弃随机行。（由于可能会丢弃某些行，因此需要将样本保存在内存中直到输入结束）

实施第二步骤一种优雅的方式是生成一个随机整数k在[1, n]。如果是，k <= 1000则包括该行并k用它替换现有的-th行。这是该算法的更标准说明：http : //en.wikipedia.org/wiki/Reservoir_sampling

如果您知道行数R，则：

从s0的样本量开始
包括具有概率的第n行(1000 - s) / (R - n + 1)并立即输出（并增加样本大小s）

如何在Unix上执行此操作？awk似乎是此帖子在Internet上的答案（我不能保证其正确性，但有代码可用）https://news.ycombinator.com/item?id=4840043

— 死灵法师
source