UNIX中的工具减去文本文件？

16

我有一个大文件，由文本字段组成，这些文本字段以大表的形式用分号分隔。已排序。我有一个由相同文本字段组成的较小文件。在某个时候，有人将这个文件与其他文件串联在一起，然后进行了排序以形成上述的大文件。我想从大文件中减去小文件的行（即对于小文件中的每一行，如果大文件中存在匹配的字符串，则删除大文件中的该行）。

该文件大致如下所示

GenericClass1; 1; 2; NA; 3; 4;
GenericClass1; 5; 6; NA; 7; 8;
GenericClass2; 1; 5; NA; 3; 8;
GenericClass2; 2; 6; NA; 4; 1;

等等

有没有快速的经典方法可以做到这一点，还是我必须使用awk？

files text-processing diff

— 埃舍尔
source

28

您可以使用grep。给它一个小的文件作为输入，并告诉它找到不匹配的行：

grep -vxFf file.txt bigfile.txt > newbigfile.txt

使用的选项是：

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)
   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

   -v, --invert-match
          Invert the sense of matching, to select non-matching lines.  (-v
          is specified by POSIX.)
   -x, --line-regexp
          Select only those matches that exactly match the whole line.  
          (-x is specified by POSIX.)

— Terdon
source

很好，做得很好。非常感谢。

— Escher 2014年

1

它很有效，但是在我看来-x，如果该较小文件中的一行发生在我主文件中另一行的子字符串中，那么使用该选项也会更好。另外，@ UlrichSchwarz的答案很有可能会更快。

— rici 2014年

18

comm 是你的朋友：

NAME COMM-逐行比较两个排序的文件

概要comm [OPTION] ... FILE1 FILE2

描述逐行比较排序的文件FILE1和FILE2。

   With  no  options, produce three-column output.  Column one contains lines unique to FILE1, column two contains
   lines unique to FILE2, and column three contains lines common to both files.

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

（由于考虑了排序，因此comm可能会带来性能优势grep。）

例如：

comm -1 -3 file.txt bigfile.txt > newbigfile.txt

— 乌尔里希·施瓦兹（Ulrich Schwarz）
source

2

关于对grep使用comm over grep的好点。如果您给出了特定的命令行示例，例如comm -1 -3 file.txt bigfile.txt > newbigfile.txt

— Steve Midgley，2014年

我确认我尝试了上面报告的grep命令，其中包含大约100MB的文件，并且出现了“被杀死”错误。尝试使用comm成功完成。

— Gianluca Casati'3

命令重定向对未排序的文件或需要两个以上的文件很有用：comm -1 -3 <(sort BAD.txt GOOD.txt) <(sort FILES.txt)

— odinho-Velmont，