如何从另一个txt文件中存在的txt文件中删除单词？

8

文件a.txt大约有10万个单词，每个单词都在新行中

july.cpp
windows.exe
ttm.rar
document.zip

File b.txt有15万个字，一个字一个字-有些字来自file a.txt，但有些字是新的：

july.cpp    
NOVEMBER.txt    
windows.exe    
ttm.rar    
document.zip    
diary.txt

如何将这些文件合并为一个，删除所有重复的行，并保留新行（行中存在a.txt但不存在的行b.txt，反之亦然）？

text-processing

— 凯特·卡西亚
source

您愿意使用python吗？

— 蒂姆（Tim）

2

@MikołajBartnicki Unix.SE可能会是一个更好的地方要问

— Glutanimate

1

Kasia，我在回答中犯了一个错误，这就是为什么我删除了它。我正在研究一个新的。

2

@Glutanimate这个问题在这里很好。

— 赛斯

1

@Glutanimate啊，对不起，我以某种方式错过了该评论。

— 赛斯2014年

13

有一个命令可以执行此操作：comm。如中所述man comm，这很简单：

   comm -3 file1 file2
          Print lines in file1 not in file2, and vice versa.

请注意，comm期望对文件内容进行排序，因此您必须在对它们进行调用之前comm对它们进行排序，如下所示：

sort unsorted-file.txt > sorted-file.txt

所以总结一下：

sort a.txt > as.txt

sort b.txt > bs.txt

comm -3 as.txt bs.txt > result.txt

执行上述命令后，result.txt文件中将包含预期的行。

谢谢，它的运作就像一种魅力。PS。tozdjęcieztłuczkiemna Twoim profilu jest fajne ;-)

— Kate-Kasia

2

这是一个简短的python3脚本，基于Germar的answer，应该在保留b.txtunsorted顺序的同时完成此操作。

#!/usr/bin/python3

with open('a.txt', 'r') as afile:
    a = set(line.rstrip('\n') for line in afile)

with open('b.txt', 'r') as bfile:
    for line in bfile:
        line = line.rstrip('\n')
        if line not in a:
            print(line)
            # Uncomment the following if you also want to remove duplicates:
            # a.add(line)

— 钟莉莉
source

1

#!/usr/bin/env python3

with open('a.txt', 'r') as f:
    a_txt = f.read()
a = a_txt.split('\n')
del(a_txt)

with open('b.txt', 'r') as f:
    while True:
        b = f.readline().strip('\n ')
        if not len(b):
            break
        if not b in a:
            print(b)

— 格尔马
source

2

伙计，您正在用海军加农炮射击蚊子！

：-）你是对的。我错过了100k的“ k”

— Germar，2014年

1

看一下coreutils comm命令-man comm

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

       With  no  options,  produce  three-column  output.  Column one contains
       lines unique to FILE1, column two contains lines unique to  FILE2,  and
       column three contains lines common to both files.

       -1     suppress column 1 (lines unique to FILE1)

       -2     suppress column 2 (lines unique to FILE2)

       -3     suppress column 3 (lines that appear in both files)

因此，例如，您可以

$ comm -13 <(sort a.txt) <(sort b.txt)
diary.txt
NOVEMBER.txt

（唯一的行b.txt）

— 钢铁司机
source