如何将大文本文件分割成行数相等的小文件？

514

我有一个大的（按行数）纯文本文件，我想按行数将其拆分成多个较小的文件。因此，如果我的文件大约有200万行，我想将其拆分为10个包含20万行的文件，或100个包含20k行的文件（再加上一个文件，其余部分；被平均整除并不重要）。

我可以在Python中相当容易地做到这一点，但我想知道是否有任何忍者可以使用bash和unix utils来做到这一点（与手动循环和计数/分区行相对）。

bash file unix

— 丹本
source

2

出于好奇，将它们“拆分”后，如何将它们“组合”？像“猫part2 >> part1”一样？还是还有另一个忍者实用程序？介意更新您的问题吗？

— dlamotte 2010年

7

放在一起，cat part* > original

— 马克·拜尔斯

9

是的，猫是串联的缩写。通常，apropos对于查找适当的命令很有用。IE看到以下内容的输出：apropos split

— pixelbeat 2010年

@pixelbeat挺酷的，谢谢

— danben 2010年

3

顺便说一句，OS X用户应确保其文件包含LINUX或UNIX风格的换行符/行尾指示符（LF）而不是MAC OS X-风格的行尾指示符（CR）-分割和如果您喜欢的休息时间是回车，而不是LineFeed，则csplit命令将不起作用。如果您使用的是Mac OS，则BareBones软件的TextWrangler可以为您提供帮助。您可以选择换行符的外观。当您保存（或另存为...）文本文件时。

855

您看过split命令吗？

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

您可以执行以下操作：

split -l 200000 filename

这将创建每个文件以200000行命名的文件xaa xab xac...

另一个选项，按输出文件的大小拆分（仍在换行符处拆分）：

 split -C 20m --numeric-suffixes input_filename output_prefix

创建output_prefix01 output_prefix02 output_prefix03 ...每个最大大小为20 MB的文件。

— 马克·拜尔斯
source

16

您还可以按大小分割文件：split -b 200m filename（m表示兆字节，k表示千字节或字节不带后缀）

— Abhi Beckert

136

按大小拆分并确保文件在换行符上拆分：split -C 200m文件名

— Clayton Stanley

2

split产生带有Unicode（UTF-16）输入的乱码输出。至少在具有该版本的Windows上。

— 眩晕

4

@geotheory，请确保在线程中早先遵循LeberMac的建议，有关首先使用TextWrangler或BBEdit将CR（Mac）行尾转换为LR（Linux）行尾。在找到该建议之前，我遇到了与您完全相同的问题。

— sstringer 2013年

6

-d该选项在OSX上不可用，请gsplit改用。希望这对Mac用户有用。

— user5698801

80

又如何分割命令？

split -l 200000 mybigfile.txt

— 罗伯特·克里斯蒂
source

39

是的，有一个split命令。它将按行或字节分割文件。

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

— 戴夫·柯比
source

尝试过georgec @ ATGIS25〜$ split -l 100000 /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt，但目录中没有拆分文件-输出在哪里？

— GeorgeC 2012年

1

它应该在同一目录中。例如，如果我想通过分割每个文件百万行，请执行以下操作：split -l 1000000 train_file train_file.在同一个目录下我会train_file.aa用第一个一百万，那么trail_file.ab下一个百万，等等

— 威尔

1

@GeorgeC，您可以获取带有前缀的自定义输出目录split input my/dir/。

— Ciro Santilli郝海东冠状病六四事件法轮功2016年

15

采用 split

将文件拆分为固定大小的片段，创建包含INPUT连续部分的输出文件（如果未指定或INPUT为'-'，则为标准输入）

Syntax split [options] [INPUT [PREFIX]]

http://ss64.com/bash/split.html

— Zmbush
source

13

采用：

sed -n '1,100p' filename > output.txt

在这里，1和100是您将捕获的行号output.txt。

— 哈什沃德汉
source

这仅获得前100行，您需要循环循环以将文件依次拆分为下一个101..200等。或者就像split这里已经使用的所有顶级答案一样使用。

— 三人房

10

将文件“ file.txt”拆分为10000行文件：

split -l 10000 file.txt

— 里亚克韦兹
source

9

split（来自GNU coreutils，自2010年12月22日起为8.8版）包含以下参数：

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

因此，split -n 4 input output.将生成四个output.a{a,b,c,d}具有相同字节数的文件（），但中间的行可能会中断。

如果我们要保留完整的行（即按行分割），那么这应该起作用：

split -n l/4 input output.

相关答案：https：//stackoverflow.com/a/19031247

— DenilsonSáMaia
source

9

如果您只想按每个文件的x行数进行拆分，则给出的答案split是可以的。但是，我很好奇没有人关注要求：

“无需计数”->使用wc + cut
“将其余部分保留在额外的文件中”->默认情况下拆分

没有“ wc + cut”，我无法做到这一点，但是我正在使用：

split -l  $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename

这可以很容易地添加到您的bashrc函数中，因此您可以通过传递文件名和块来调用它：

 split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2) $1

如果只需要x个块，而没有多余的多余文件，则只需对公式进行修改，以对每个文件求和（大块-1）。我确实使用这种方法，因为通常我只希望x个文件而不是每个文件x行：

split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1

您可以将其添加到脚本中并称为“忍者之路”，因为如果没有什么适合您的需求，则可以构建它:-)

— m3nda
source

或者，只需使用的-n选项split。

— 阿米特·奈杜

8

你也可以使用awk

awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile

— 鬼狗74
source

3

awk -v lines=200000 -v fmt="%d.txt" '{print>sprintf(fmt,1+int((NR-1)/lines))}'

— Mark Edgar

0

HDFS getmerge小文件并溢出到属性大小中。

此方法将导致换行

split -b 125m compact.file -d -a 3 compact_prefix

我尝试进行合并，并将每个文件分成大约128MB。

# split into 128m ,judge sizeunit is M or G ,please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# split into $res files with number suffix.  ref  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name :"$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}

— 马蒂吉66
source