86

我-|在每个部分之后都有一个用作分隔符的文件...需要使用Unix为每个部分创建单独的文件。

输入文件示例

wertretr
ewretrtret
1212132323
000232
-|
ereteertetet
232434234
erewesdfsfsfs
0234342343
-|
jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

文件1中的预期结果

wertretr
ewretrtret
1212132323
000232
-|

文件2中的预期结果

ereteertetet
232434234
erewesdfsfsfs
0234342343
-|

文件3中的预期结果

jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

— 用户名
source

1

您是在编写程序还是要使用命令行实用程序来执行此操作？

— rkyser 2012年

1

使用命令行实用程序将是更好的选择

— 。–

您可以使用awk，编写一个3或4行程序可以很容易地做到这一点。不幸的是我没有练习。

— ctrl-alt-delor 2012年

97

一个衬板，无需编程。（正则表达式等除外）

csplit --digits=2  --quiet --prefix=outfile infile "/-|/+1" "{*}"

经过测试： csplit (GNU coreutils) 8.30

有关在Apple Mac上使用的注意事项

“对于OS X用户，请注意csplitOS附带的版本不起作用。您需要在coreutils（可通过Homebrew安装）中将其称为gcsplit。” -@丹尼尔

“只需添加一下，您就可以获得OS X的版本（至少在High Sierra上可以使用）。您只需要稍微调整args csplit -k -f=outfile infile "/-\|/+1" "{3}"。似乎不起作用的功能是"{*}"，我必须具体说明分隔符的数量，-k如果找不到最终分隔符，则需要添加以避免它删除所有输出文件。如果需要--digits，还需要使用分隔符-n。” — @Pebbl

— ctrl-alt-delor
source

31

@ zb226我做了很长时间，因此不需要任何解释。

— ctrl-alt-delor 2014年

5

我建议添加--elide-empty-files，否则末尾将有一个空文件。

— luator 2014年

8

对于OS X用户，请注意，操作系统随附的csplit版本不起作用。您将需要coreutils中的版本（可通过Homebrew安装），称为gcsplit。

— 丹尼尔（Daniel）

10

只是对于那些想知道参数是什么意思的人：--digits=2控制用于对输出文件编号的位数（我默认为2，因此没有必要）。--quiet抑制输出（在这里也不是必须的或不需要的）。--prefix指定输出文件的前缀（默认为xx）。因此，您可以跳过所有参数，并获得类似的输出文件xx12。

— Christopher K.

3

只需添加一下，即可获得OS X的版本（至少与High Sierra一起使用）。您只需要稍微调整args即可csplit -k -f=outfile infile "/-\|/+1" "{3}"。似乎不起作用的功能是"{*}"，我必须具体说明分隔符的数量，并且需要添加-k以避免在找不到最终分隔符时删除所有输出文件的功能。另外，如果需要--digits，则需要使用-n。

— Pebbl

38

awk '{f="file" NR; print $0 " -|"> f}' RS='-\\|'  input-file

说明（已编辑）：

RS是记录分隔符，并且此解决方案使用gnu awk扩展名，该扩展名可以使其不止一个字符。NR是记录号。

print语句先打印一条记录，然后打印" -|"到名称中包含记录号的文件中。

— 威廉·珀塞尔
source

1

RS是记录分隔符，并且此解决方案使用gnu awk扩展名，该扩展名可以使其不止一个字符。NR是记录号。打印语句打印记录，后跟“-|” 到包含名称中记录编号的文件中。

— 威廉·珀塞尔

1

@rzetterbeg这应该适用于大文件。awk一次处理一个记录，因此它只能读取所需的内容。如果第一次出现的记录分隔符在文件中显示的很晚，则可能是内存不足，因为一条完整的记录必须放入内存中。另外，请注意，在RS中使用多个字符不是标准的awk，但这在gnu awk中有效。

— William Pursell 2014年

4

对我来说，它在31.728s中拆分了3.3 GB

— Cleankod

3

@ccf文件名只是位于右侧的字符串>，因此您可以随意构建它。例如，print $0 "-|" > "file" NR ".txt"

— William Pursell '16

1

@AGrush取决于版本。你可以做awk '{f="file" NR; print $0 " -|" > f}'

— 威廉·珀塞尔

7

Debian有csplit，但我不知道这是否对所有/大多数/其他发行版都通用。如果不是这样，那么跟踪源代码并进行编译应该不难...

— 沃尔伯格
source

1

我同意。我的Debian盒子说csplit是gnu coreutils的一部分。因此，任何Gnu操作系统（例如所有Gnu / Linux发行版）都将拥有它。Wikipedia在csplit页面上还提到了“单一UNIX®规范，第7期”，所以我怀疑您理解了它。

— ctrl-alt-delor

3

由于csplit在POSIX中，我希望它基本上可以在所有类似Unix的系统上使用。

— 乔纳森·勒夫勒

1

尽管csplit是POISX，但问题（似乎是在我面前的Ubuntu系统上进行了测试）是没有明显的方法使它使用更现代的regex语法。比较：csplit --prefix gold-data - "/^==*$/与csplit --prefix gold-data - "/^=+$/。至少GNU grep具有-e。

— 13年

5

我解决了一个稍有不同的问题，其中文件包含一行名称，该行的名称应紧随其后的文本。这个Perl代码为我完成了窍门：

#!/path/to/perl -w

#comment the line below for UNIX systems
use Win32::Clipboard;

# Get command line flags

#print ($#ARGV, "\n");
if($#ARGV == 0) {
    print STDERR "usage: ncsplit.pl --mff -- filename.txt [...] \n\nNote that no space is allowed between the '--' and the related parameter.\n\nThe mff is found on a line followed by a filename.  All of the contents of filename.txt are written to that file until another mff is found.\n";
    exit;
}

# this package sets the ARGV count variable to -1;

use Getopt::Long;
my $mff = "";
GetOptions('mff' => \$mff);

# set a default $mff variable
if ($mff eq "") {$mff = "-#-"};
print ("using file switch=", $mff, "\n\n");

while($_ = shift @ARGV) {
    if(-f "$_") {
    push @filelist, $_;
    } 
}

# Could be more than one file name on the command line, 
# but this version throws away the subsequent ones.

$readfile = $filelist[0];

open SOURCEFILE, "<$readfile" or die "File not found...\n\n";
#print SOURCEFILE;

while (<SOURCEFILE>) {
  /^$mff (.*$)/o;
    $outname = $1;
#   print $outname;
#   print "right is: $1 \n";

if (/^$mff /) {

    open OUTFILE, ">$outname" ;
    print "opened $outname\n";
    }
    else {print OUTFILE "$_"};
  }

— 约翰·大卫·史密斯
source

您能否解释一下为什么此代码有效？我的情况与您在此处描述的情况类似-所需的输出文件名嵌入在文件中。但是我不是perl的普通用户，因此无法完全理解此代码。

— shiri

真正的牛肉在最后while一圈。如果mff在行首找到正则表达式，它将使用该行的其余部分作为打开并开始写入的文件名。它永远不会关闭任何东西，因此几十个文件句柄都将用完。

— Tripleee '18

通过在最终while循环之前删除大部分代码并切换至while (<>)

— Tripleee '18

4

以下命令对我有用。希望能帮助到你。

awk 'BEGIN{file = 0; filename = "output_" file ".txt"}
    /-|/ {getline; file ++; filename = "output_" file ".txt"}
    {print $0 > filename}' input

— han
source

1

通常在几十个文件之后，这将耗尽文件句柄。解决方法是close在启动新文件时明确显示旧文件。

— 三人制

@tripleee您如何关闭它（初学者awk问题）。你能提供一个更新的例子吗？

— JesperRønn-Jensen'18

1

@JesperRønn-Jensen对于任何有用的示例，此框可能都太小，但基本上是if (file) close(filename);在分配新filename值之前。

— 三月

aah发现了如何关闭它：; close(filename)。确实很简单，但确实可以解决上述示例

— JesperRønn-Jensen18年

1

@JesperRønn-Jensen我回滚了您的编辑，因为您提供了一个损坏的脚本。应当避免对他人的答案进行大量修改-如果您认为值得单独回答，请随时发布自己的新答案（也许是社区Wiki）。

— 三月

2

您也可以使用awk。我对awk不太熟悉，但是以下内容似乎对我有用。它生成了part1.txt，part2.txt，part3.txt和part4.txt。请注意，此生成的最后一个partn.txt文件为空。我不确定该如何解决，但是我可以稍作调整就可以完成。有任何建议吗？

awk_pattern文件：

BEGIN{ fn = "part1.txt"; n = 1 }
{
   print > fn
   if (substr($0,1,2) == "-|") {
       close (fn)
       n++
       fn = "part" n ".txt"
   }
}

bash命令：

awk -f awk_pattern input.file

— 克塞尔
source

2

这是一个Python 3脚本，该脚本根据定界符提供的文件名将一个文件拆分为多个文件。输入文件示例：

# Ignored

######## FILTER BEGIN foo.conf
This goes in foo.conf.
######## FILTER END

# Ignored

######## FILTER BEGIN bar.conf
This goes in bar.conf.
######## FILTER END

这是脚本：

#!/usr/bin/env python3

import os
import argparse

# global settings
start_delimiter = '######## FILTER BEGIN'
end_delimiter = '######## FILTER END'

# parse command line arguments
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input filename")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")

args = parser.parse_args()

# read the input file
with open(args.input_file, 'r') as input_file:
    input_data = input_file.read()

# iterate through the input data by line
input_lines = input_data.splitlines()
while input_lines:
    # discard lines until the next start delimiter
    while input_lines and not input_lines[0].startswith(start_delimiter):
        input_lines.pop(0)

    # corner case: no delimiter found and no more lines left
    if not input_lines:
        break

    # extract the output filename from the start delimiter
    output_filename = input_lines.pop(0).replace(start_delimiter, "").strip()
    output_path = os.path.join(args.output_dir, output_filename)

    # open the output file
    print("extracting file: {0}".format(output_path))
    with open(output_path, 'w') as output_file:
        # while we have lines left and they don't match the end delimiter
        while input_lines and not input_lines[0].startswith(end_delimiter):
            output_file.write("{0}\n".format(input_lines.pop(0)))

        # remove end delimiter if present
        if not input_lines:
            input_lines.pop(0)

最后，这是您的运行方式：

$ python3 script.py -i input-file.txt -o ./output-folder/

— ctrlc根
source

2

使用csplit如果您有它。

如果没有，但是您拥有Python ...请不要使用Perl。

延迟读取文件

您的文件可能太大而无法一次全部保存在内存中-最好逐行读取。假设输入文件名为“ samplein”：

$ python3 -c "from itertools import count
with open('samplein') as file:
    for i in count():
        firstline = next(file, None)
        if firstline is None:
            break
        with open(f'out{i}', 'w') as out:
            out.write(firstline)
            for line in file:
                out.write(line)
                if line == '-|\n':
                    break"

— 亚伦·霍尔
source

这会将整个文件读入内存，这意味着对于大文件而言，它将效率低下甚至失败。

— Tripleee '18

1

@tripleee我已经更新了答案以处理非常大的文件。

— 亚伦·霍尔

0

cat file| ( I=0; echo -n "">file0; while read line; do echo $line >> file$I; if [ "$line" == '-|' ]; then I=$[I+1]; echo -n "" > file$I; fi; done )

和格式化的版本：

#!/bin/bash
cat FILE | (
  I=0;
  echo -n"">file0;
  while read line; 
  do
    echo $line >> file$I;
    if [ "$line" == '-|' ];
    then I=$[I+1];
      echo -n "" > file$I;
    fi;
  done;
)

— 姆本宁
source

4

与以往一样，将cat是无用的。

— 2013年

1

@Reishin链接页面更详细地说明了cat在每种情况下如何避免使用单个文件。还有一个堆栈溢出问题，需要更多讨论（尽管可以接受的答案是恕我直言）；stackoverflow.com/questions/11710552/useless-use-of-cat

— Tripleee，

1

无论如何，shell通常在这种情况下效率很低。如果您不能使用csplit，则Awk解决方案可能比该解决方案更可取（即使您要解决shellcheck.net等报告的问题；请注意，它目前并未在其中找到所有错误）。

— 人间

@tripleee，但是如果任务是在没有awk，csplit等的情况下执行此操作，则仅执行bash吗？

— Reishin

1

然后，cat仍然无效，并且可以简化脚本的其余部分并进行大量纠正。但仍然会很慢。参见例如stackoverflow.com/questions/13762625/…–

— Tripleee，

0

这是我为上下文编写的一种问题：http : //stromberg.dnsalias.org/~strombrg/context-split.html

$ ./context-split -h
usage:
./context-split [-s separator] [-n name] [-z length]
        -s specifies what regex should separate output files
        -n specifies how output files are named (default: numeric
        -z specifies how long numbered filenames (if any) should be
        -i include line containing separator in output files
        operations are always performed on stdin

— 用户名
source

嗯，这看起来基本上是标准csplit实用程序的副本。参见@richard的答案。

— 人间

这实际上是imo的最佳解决方案。由于某种原因，我不得不拆分一个98G的mysql dump和csplit，耗尽了我所有的RAM，并被杀死了。即使此时只需要匹配一行。没有意义。这个python脚本工作得更好，并且不会耗尽所有的内存。

— Stefan Midjich

0

这是一个可以执行此操作的perl代码

#!/usr/bin/perl
open(FI,"file.txt") or die "Input file not found";
$cur=0;
open(FO,">res.$cur.txt") or die "Cannot open output file $cur";
while(<FI>)
{
    print FO $_;
    if(/^-\|/)
    {
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die "Cannot open output file $cur"
    }
}
close(FO);

— amaksr
source

根据定界符将一个文件拆分为多个文件

有关在Apple Mac上使用的注意事项

延迟读取文件