在Linux终端中比较两个文件

168

有两个名为“ a.txt”和“ b.txt”的文件，都有一个单词列表。现在，我要检查“ a.txt”中哪些单词是多余的，而“ b.txt”中哪些单词不是。

我需要一种有效的算法，因为我需要比较两个字典。

— 阿里·伊姆兰
source

27

diff a.txt b.txt是不足够的？

— ThanksForAllTheFish

每个文件中的单词可以出现几次吗？您可以对文件进行排序吗？

— Basile Starynkevitch

我只需要在“ b.txt”中不存在且在a.txt中存在的单词

— Ali Imran

343

如果您已安装vim，请尝试以下操作：

vimdiff file1 file2

要么

vim -d file1 file2

您会发现它很棒。在此处输入图片说明

— 李凤雅
source

9

绝对很棒，设计精良，容易发现差异。Ohmygod

— 禅

1

您的回答很棒，但我的老师要求我不要使用任何库函数：P

— Ali Imran 2015年

1

多么棒的工具！这是非常有帮助的。

— user1205577

1

这些颜色是什么意思？

— zygimantus'1

1

彩色代码表示两个文件中的代码不同。@zygimantus

— 风雅李

73

对它们进行排序并使用comm：

comm -23 <(sort a.txt) <(sort b.txt)

comm比较（排序的）输入文件，默认情况下输出三列：a独有的行，b独有的行以及两者中都存在的行。通过指定-1，-2和/或-3可以抑制相应的输出。因此，comm -23 a b仅列出a唯一的条目。我使用<(...)语法对文件进行即时排序，如果已经对文件进行了排序，则不需要此语法。

— 安德斯·约翰森（Anders Johansson）
source

我仅使用grep命令添加了自己的答案，请告诉我它是否更有效？

— Ali Imran 2013年

3

@AliImran comm效率更高，因为它一次运行即可完成工作，而无需将整个文件存储在内存中。由于您使用的字典很可能已经排序，因此您甚至不需要sort它们。使用grep -f file1 file2在另一方面将加载整个file1到内存并且在每个线比较file2与所有这些条目，这是效率低得多的。它对未分类的小物件最有用-f file1。

— Anders Johansson

1

感谢@AndersJohansson共享“ comm”命令。确实很漂亮。我经常必须在文件之间进行外部联接，这可以解决问题。

— blispr

请注意换行符...我刚刚发现它\n也将包含在内以进行比较。

— Bin

31

试试sdiff（man sdiff）

sdiff -s file1 file2

— 马德里
source

28

您可以使用difflinux中的工具比较两个文件。您可以使用--changed-group-format和--unchanged-group-format选项来过滤所需的数据。

以下三个选项可用于为每个选项选择相关的组：

'％<'从FILE1获取行
'％>'从FILE2获取行
''（空字符串）用于从两个文件中删除行。

例如：diff --changed-group-format =“％<” --unchanged-group-format =“” file1.txt file2.txt

[root@vmoracle11 tmp]# cat file1.txt 
test one
test two
test three
test four
test eight
[root@vmoracle11 tmp]# cat file2.txt 
test one
test three
test nine
[root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
test two
test four
test eight

— 曼朱拉
source

27

如果您喜欢的diff输出样式git diff，则可以将其与--no-index标志一起使用，以比较不在git存储库中的文件：

git diff --no-index a.txt b.txt

我使用几个文件，每个文件中包含大约200k文件名字符串，我（使用内置time命令）对这种方法进行了基准测试，并与此处的其他一些答案进行了比较：

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

comm似乎是迄今为止最快的，而git diff --no-index似乎是比较样式输出的最快方法。

更新2018年3月25日您实际上可以省略该--no-index标志，除非您位于git存储库中并且想要比较该存储库中未跟踪的文件。从手册页：

这种形式是比较文件系统上给定的两个路径。在由Git控制的工作树中运行命令时，可以忽略--no-index选项，并且至少有一个路径指向该工作树之外，或者在由Git控制的工作树外部运行命令时，也可以省略该选项。

— 黄el
source

9

您还可以使用：colordiff：显示带有颜色的diff的输出。

关于vimdiff：它允许您通过SSH比较文件，例如：

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

摘自：http : //www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

— Findlinux一个
source

6

另外，不要忘了mcdiff - GNU Midnight Commander的内部差异查看器。

例如：

mcdiff file1 file2

请享用！

— 尤里·高斯基（Iurii Golskyi）
source

4

使用comm -13 （需要排序的文件）：

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four

— 克里斯·西摩
source

1

这是我的解决方案：

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

— 阿里·伊姆兰
source

2

您是否尝试过其他解决方案？这些解决方案之一对您有用吗？您的问题具有足够的通用性，可以吸引许多用户，但是您的回答更符合我的口味……因为我的特殊情况sdiff -s file1 file2很有用。

— Metafaniel

@Metafaniel我的解决方案不使用sdiff命令。它仅使用linux内置命令来解决该问题。

— Ali Imran 2015年

-1

使用awk。测试文件：

$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one

awk：

$ awk '
NR==FNR {                    # process b.txt  or the first file
    seen[$0]                 # hash words to hash seen
    next                     # next word in b.txt
}                            # process a.txt  or all files after the first
!($0 in seen)' b.txt a.txt   # if word is not hashed to seen, output it

重复输出：

four
four

为避免重复，请将a.txt中每个新遇到的单词添加到seen哈希中：

$ awk '
NR==FNR {
    seen[$0]
    next
}
!($0 in seen) {              # if word is not hashed to seen
    seen[$0]                 # hash unseen a.txt words to seen to avoid duplicates 
    print                    # and output it
}' b.txt a.txt

输出：

four

如果单词列表用逗号分隔，例如：

$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three

您必须多做几圈（for循环）：

awk -F, '                    # comma-separated input
NR==FNR {
    for(i=1;i<=NF;i++)       # loop all comma-separated fields
        seen[$i]
    next
}
{
    for(i=1;i<=NF;i++)
        if(!($i in seen)) {
             seen[$i]        # this time we buffer output (below):
             buffer=buffer (buffer==""?"":",") $i
        }
    if(buffer!="") {         # output unempty buffers after each record in a.txt
        print buffer
        buffer=""
    }
}' b.txt a.txt

这次输出：

four
five,six

— 詹姆斯·布朗
source