随机播放两个并行文本文件

我有两个句子对齐的并行语料库（文本文件），带有大约5000万个单词。（来自Europarl语料库->法律文件的并行翻译）。我现在想对两个文件的行进行混排，但是两者都以相同的方式进行。我想通过一个唯一的随机源使用gshuf（我在Mac上）来解决这个问题。

gshuf --random-source /path/to/some/random/data file1
gshuf --random-source /path/to/some/random/data file2

但是我收到了错误消息end of file，因为显然随机种子需要包含要排序的文件包含的所有单词。真的吗？如果是，我应该如何创建满足自己需求的随机种子？如果没有，我还可以通过其他什么方式并行地随机分配文件？我考虑过将它们粘贴在一起，随机化然后再分裂。但是，这似乎很难看，因为我首先需要找到文件中没有的定界符。

text-processing osx random

— 科尼波
source

因为您的random_file文件中包含的字节数不足，所以出现了该错误...请参阅random sources。至于paste，你可以作为分隔符一些低ASCII字符，这是不太可能在文件中出现使用（像\x02，\x03...）。

— don_crissti 2015年

好吧，无论我想随机化什么，如果我使用/ dev / urandom，我都会很好，对吧？粘贴定界符是一个很好的技巧，谢谢！

— conipo

我不知道是否有更优雅的方法，但这对我有用：

mkfifo onerandom tworandom threerandom
tee onerandom tworandom threerandom < /dev/urandom > /dev/null &
shuf --random-source=onerandom onefile > onefile.shuf &
shuf --random-source=tworandom twofile > twofile.shuf &
shuf --random-source=threerandom threefile > threefile.shuf &
wait

结果：

$ head -n 3 *.shuf
==> onefile.shuf <==
24532 one
47259 one
58678 one

==> threefile.shuf <==
24532 three
47259 three
58678 three

==> twofile.shuf <==
24532 two
47259 two
58678 two

但是文件必须具有完全相同的行数。

GNU Coreutils文档还提供了一个很好的解决方案，将其openssl用作种子随机生成器来重复随机性：

https://www.gnu.org/software/coreutils/manual/html_node/Random-sources.html#Random-sources
get_seeded_random()
{
  seed="$1"
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)

但是，请考虑使用比“ 42”更好的种子，除非您也希望其他人也能够复制“您的”随机结果。

— 弗罗斯特
source

这就像一个魅力。您介意解释您采取的步骤吗？tee命令可确保将相同的随机数存储在所有三个管道中，对吗？为什么还需要输出到/ dev / null？是否自动确保有足够的字节并且end of file不会发生错误？

— conipo

该/dev/null是因为tee还打印到stdout。可以> threerandom改用，但很难编写脚本。命名管道将根据需要生成尽可能多的随机数据，因此您不必事先知道需要多少数据。

— 弗罗斯特斯

好的，为什么不能将它作为一个管道用作所有3个shuffle的随机源呢？

— conipo

您不能从一个管道读取三次相同的数据。您必须以某种方式进行多路复用，而这就是tee...

— frostschutz