识别文件中的重复行而不删除它们？

11

我将引用作为文本文件包含很多条目，每个都有两个（或更多）字段。

第一列是参考的网址；第二栏是标题，标题可能会有所不同，具体取决于输入方式。对于可能不存在的第三字段相同。

我想识别但不删除具有相同的第一个字段（引用URL）的条目。我知道，sort -k1,1 -u但是会自动（非交互地）删除除第一个匹配项之外的所有匹配项。有没有办法让我知道，以便我选择保留哪个？

在具有相同第一个字段（http://unix.stackexchange.com/questions/49569/）的三行下面的摘录中，我要保留第二行，因为它具有其他标签（sort，CLI）并删除＃1和＃3行：

http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

是否有程序可帮助识别此类“重复项”？然后，我可以通过个人删除＃1和＃3行来进行手动清理？

command-line sort

— DK Bose
source

我不太理解您的示例...您能提供输入和预期输出的简化版本吗？

— 奥利（Oli）

请看看现在是否更清楚？

— DK Bose 2014年

9

如果我理解您的问题，我认为您需要以下内容：

for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

要么：

for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

file.txt文件在哪里包含您感兴趣的数据。

在输出中，您将看到行数以及两次或更多次发现第一个字段的行数。

— 拉杜·拉迪亚努（RaduRădeanu）
source

3

谢谢：甚至 cut -d " " -f1 file.txt | uniq -d给我很好的输出。

— DK Bose

@DKBose可能还有更多的可能性，但是我也想使用和您的命令。

— RaduRădeanu2014年

谢谢。第二个命令是我喜欢的命令。您可以删除第一个。如果您解释的代码也很好:)

— DK Bose 2014年

10

这是可以使用uniq命令解决的经典问题。uniq可以检测重复的连续行并删除重复的（-u，--unique）或仅保留重复的（-d，--repeated）。

由于重复行的排序对您而言并不重要，因此应首先对其进行排序。然后用于uniq仅打印唯一行：

sort yourfile.txt | uniq -u

还有一个-c（--count）选项可打印该选项的重复次数-d。有关uniq详细信息，请参见的手册页。

如果您真的不关心第一个字段之后的部分，则可以使用以下命令查找重复的键并为其打印每个行号（追加另一个| sort -n以按行对输出进行排序）：

 cut -d ' ' -f1 .bash_history | nl | sort -k2 | uniq -s8 -D

由于您要查看重复的行（使用第一个字段作为键），因此不能直接使用uniq。使自动化变得困难的问题是标题部分各不相同，但是程序无法自动确定应将哪个标题视为最终标题。

这是一个AWK脚本（保存到script.awk），该脚本将文本文件作为输入并打印所有重复的行，以便您决定删除哪个行。（awk -f script.awk yourfile.txt）

#!/usr/bin/awk -f
{
    # Store the line ($0) grouped per URL ($1) with line number (NR) as key
    lines[$1][NR] = $0;
}
END {
    for (url in lines) {
        # find lines that have the URL occur multiple times
        if (length(lines[url]) > 1) {
            for (lineno in lines[url]) {
                # Print duplicate line for decision purposes
                print lines[url][lineno];
                # Alternative: print line number and line
                #print lineno, lines[url][lineno];
            }
        }
    }
}

— 莱肯斯坦
source

我认为这很接近我想要的，但是我需要`-f，--skip-fields = N（请避免比较前N个字段）。换句话说，我只希望考虑第一个字段即url。

— DK Bose 2014年

@DKBose有一个-w（--check-chars）选项可将字符数限制为固定值，但是在您的示例中，您具有可变的第一字段。由于uniq不支持字段选择，因此您必须使用解决方法。我将提供一个AWK示例，因为这比较容易。

— Lekensteyn

是的，我只是在查看，-w但是第一个字段的长度是可变的:(

— DK Bose 2014年

@DKBose请参阅最新的编辑

— Lekensteyn 2014年

1

我收到awk：script.awk：第4行：或附近的语法错误[awk：script.awk：第10行：或附近的语法错误[awk：script.awk：第18行：或附近的语法错误}

— DK Bose 2014年

2

如果我没看错，您所需要的只是

awk '{print $1}' file | sort | uniq -c | 
    while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done

这将打印出包含重复项和行本身的行号。例如，使用此文件：

foo bar baz
http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
bar foo baz
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
baz foo bar
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

它将产生以下输出：

2:http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
4:http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
6:http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

要仅打印行号，您可以执行

awk '{print $1}' file | sort | uniq -c | 
 while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 1

并仅打印该行：

awk '{print $1}' file | sort | uniq -c | 
while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 2-

说明：

该awk脚本仅打印文件的第一个空格分隔的字段。使用$N打印的第N场。sort对其进行排序并uniq -c计算每行的出现次数。

然后将其传递到while循环，该循环将出现的次数保存为$numas $dupe，如果行$num大于1 ，则将其保存（因此至少重复了一次），它将在文件中搜索该行，-n用于打印行号。该--告诉grep那接下来是不是一个命令行选项，当有用的$dupe可以下手-。

— 特登
source

1

毫无疑问，列表中最冗长的一个可能会更短：

#!/usr/bin/python3
import collections
file = "file.txt"

def find_duplicates(file):
    with open(file, "r") as sourcefile:
        data = sourcefile.readlines()
    splitlines = [
        (index, data[index].split("  ")) for index in range(0, len(data))
        ]
    lineheaders = [item[1][0] for item in splitlines]
    dups = [x for x, y in collections.Counter(lineheaders).items() if y > 1]
    dupsdata = []
    for item in dups:
        occurrences = [
            splitlines_item[0] for splitlines_item in splitlines\
                       if splitlines_item[1][0] == item
            ]
        corresponding_lines = [
            "["+str(index)+"] "+data[index] for index in occurrences
            ]
        dupsdata.append((occurrences, corresponding_lines))

    # printing output   
    print("found duplicates:\n"+"-"*17)
    for index in range(0, len(dups)):
        print(dups[index], dupsdata[index][0])
        lines = [item for item in dupsdata[index][1]]
        for line in lines:
            print(line, end = "")


find_duplicates(file)

给出一个像这样的文本文件：

monkey  banana
dog  bone
monkey  banana peanut
cat  mice
dog  cowmeat

输出像：

found duplicates:
-----------------
dog [1, 4]
[1] dog  bone
[4] dog  cowmeat
monkey [0, 2]
[0] monkey  banana
[2] monkey  banana peanut

选择要删除的行后：

removelist = [2,1]

def remove_duplicates(file, removelist):
    removelist = sorted(removelist, reverse=True)
    with open(file, "r") as sourcefile:
        data = sourcefile.readlines()
    for index in removelist:
        data.pop(index)
    with open(file, "wt") as sourcefile:
        for line in data:
            sourcefile.write(line)

remove_duplicates(file, removelist)

— 雅各布·弗利姆
source

0

请参阅以下排序file.txt：

addons.mozilla.org/en-US/firefox/addon/click-to-play-per-element/ ::: C2P per-element
addons.mozilla.org/en-us/firefox/addon/prospector-oneLiner/ ::: OneLiner
askubuntu.com/q/21033 ::: What is the difference between gksudo and gksu?
askubuntu.com/q/21148 ::: openoffice calc sheet tabs (also askubuntu.com/q/138623)
askubuntu.com/q/50540 ::: What is Ubuntu's Definition of a "Registered Application"?
askubuntu.com/q/53762 ::: How to use lm-sensors?
askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors
stackoverflow.com/q/4594319 ::: bash - shell replace cr\lf by comma
stackoverflow.com/q/4594319 ::: shell replace cr\lf by comma
wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence
wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence - Ubuntu Wiki
www.youtube.com/watch?v=1olY5Qzmbk8 ::: Create new mime types in Ubuntu
www.youtube.com/watch?v=2hu9JrdSXB8 ::: Change mouse cursor
www.youtube.com/watch?v=Yxfa2fXJ1Wc ::: Mouse cursor size

因为列表很短，所以我可以看到（排序后）有三组重复项。

然后，例如，我可以选择保留：

askubuntu.com/q/53762 ::: How to use lm-sensors?

而不是

askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors

但是对于更长的列表，这将是困难的。根据一个建议uniq和另一个建议的两个答案cut，我发现此命令为我提供了我想要的输出：

$ cut -d " " -f1 file.txt | uniq -d
askubuntu.com/q/53762
stackoverflow.com/q/4594319
wiki.ubuntu.com/ClipboardPersistence
$

— DK Bose
source

我用的另一个变体更新了答案cut。如果您正在执行重复数据删除工作，则行号可能会很有帮助。要打印所有副本，请使用-D选项代替-d。

— Lekensteyn 2014年

我认为您最好使用：for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n $dup file.txt; done就像我的回答一样。它将为您提供有关您感兴趣的内容的更好预览。

— RaduRădeanu2014年

0

她是我解决的方法：

file_with_duplicates：

1,a,c
2,a,d
3,a,e <--duplicate
4,a,t
5,b,k <--duplicate
6,b,l
7,b,s
8,b,j
1,b,l
3,a,d <--duplicate
5,b,l <--duplicate

按第1列和第2列排序和重复数据删除的文件：

sort -t',' -k1,1 -k2,2 -u file_with_duplicates

文件仅按列1和2排序：

sort -t',' -k1,1 -k2,2 file_with_duplicates

仅显示差异：

diff <(sort -t',' -k1,1 -k2,2 -u file_with_duplicates) <(sort -t',' -k1,1 -k2,2 file_with_duplicates)

 3a4
   3,a,d
 6a8
   5,b,l

— 克林特·史密斯
source