检查文件的所有行是否唯一

11

我有一个包含以下内容的文本文件：

This is a thread  139737522087680
This is a thread  139737513694976
This is a thread  139737505302272
This is a thread  139737312270080
.
.
.
This is a thread  139737203164928
This is a thread  139737194772224
This is a thread  139737186379520

我如何确定每一行的唯一性？

注意：目的是测试文件，如果存在重复的行，则不要修改它。

text-processing

— r
source

1

链接：unix.stackexchange.com/q/76049/117549

— Jeff Schaller

1

您要检查所有行是否唯一，还是要删除所有重复项？

— 8bittree '18

1

@ 8bittree-希望确保唯一性

— snr

24

[ "$(wc -l < input)" -eq "$(sort -u input | wc -l)" ] && echo all unique

— 杰夫·谢勒
source

我会说的完全一样，只不过是用uniq代替sort -u

— Nonny Moose

1

如果输入尚未排序，uniq将是一个大错误；它只会删除相邻行的重复数据！

— 亚历克西斯

1

如果对罪魁祸首感兴趣，a sort <file> | uniq -d将打印副本。

— 罗尔夫（Rolf）'18年

25

AWK解决方案：

awk 'a[$0]++{print "dupes"; exit(1)}' file && echo "no dupes"

— 伊鲁瓦
source

4

+1接受的答案将读取整个文件两次，而一旦在一次读取中遇到重复的行，它将停止。这也适用于管道输入，而其他需要重新读取的文件。

— JoL

你不能推echo入END吗？

— 伊格纳西奥·巴斯克斯

2

@ IgnacioVazquez-Abrams回声中没有任何意义。回答中的做为&& echo或是|| echo约定，以指示命令使用退出状态代码执行了正确的操作。重要的是exit(1)。理想情况下，您会使用if has_only_unique_lines file; then ...，而不是if [[ $(has_only_unique_lines file) = "no dupes" ]]; then ...，这会很愚蠢。

— JoL

2

在其他答案两次读取文件以节省内存的情况下，如果没有重复，这会将整个文件读取到内存中。

— 库萨兰达

1

@Kusalananda虽然这会在没有重复的情况下将整个文件读入内存，但sort无论是否存在重复，都可以使用，对吗？如何节省内存？

— JoL

21

使用sort/ uniq：

sort input.txt | uniq

要仅检查重复的行，请使用-duniq选项。这将仅显示重复的行，如果不重复，则不显示任何内容：

sort input.txt | uniq -d

— jesse_b
source

这是我的后援。不知道其他哪些投票更高的答案提供了这个答案没有的答案。

— user1717828'7

1

这是删除重复项的好选择。

— snr

1

这没有做他想要的。他想知道是否有重复项，而不是删除它们。

— Barmar

@Barmar：虽然看起来确实如此，但问题仍然不清楚。以及OP试图澄清这一点的评论。

— jesse_b

有一个待处理的编辑，可以增加更多说明。

— Barmar

5

TLDR

最初的问题尚不清楚，并且阅读到OP仅需要文件内容的唯一版本。如下所示。在问题的自更新以来，OP现在声明他/她只是想知道文件内容是否唯一。

测试文件内容是否唯一

您可以简单地sort用来验证文件是否唯一或包含重复项，例如：

$ sort -uC input.txt && echo "unique" || echo "duplicates"

例

说我有这两个文件：

重复样本文件

$ cat dup_input.txt
This is a thread  139737522087680
This is a thread  139737513694976
This is a thread  139737505302272
This is a thread  139737312270080
This is a thread  139737203164928
This is a thread  139737194772224
This is a thread  139737186379520

独特的样本文件

$  cat uniq_input.txt
A
B
C
D

现在，当我们分析这些文件时，我们可以判断它们是否唯一或包含重复项：

测试重复文件

$ sort -uC dup_input.txt && echo "unique" || echo "duplicates"
duplicates

测试唯一文件

$ sort -uC uniq_input.txt && echo "unique" || echo "duplicates"
unique

原始问题（文件的唯一内容）

可以通过以下方式完成sort：

$ sort -u input.txt
This is a thread  139737186379520
This is a thread  139737194772224
This is a thread  139737203164928
This is a thread  139737312270080
This is a thread  139737505302272
This is a thread  139737513694976
This is a thread  139737522087680

— slm
source

3

我通常使用sort该文件，然后使用它uniq来计算重复项的数量，然后再sort一次在列表底部看到重复项。

我在您提供的示例中添加了一个副本：

$ sort thread.file | uniq -c | sort
      1 This is a thread  139737186379520
      1 This is a thread  139737194772224
      1 This is a thread  139737203164928
      1 This is a thread  139737312270080
      1 This is a thread  139737513694976
      1 This is a thread  139737522087680
      2 This is a thread  139737505302272

由于我已经有uniq一段时间没有阅读手册页了，因此我快速浏览了所有替代方法。如果您只想查看重复项，则以下内容不再需要第二种：

$ sort thread.file | uniq -d
This is a thread  139737505302272

— 卡洛斯·汉森（Carlos Hanson）
source

确实，这是一个很好的选择。#rez

— snr

2

如果没有重复项，则所有行都是唯一的：

[ "$(sort file | uniq -d)" ] && echo "some line(s) is(are) repeated"

说明：对文件行进行排序以使重复的行连续（排序），
提取所有相等的连续行（uniq -d）。
如果（[...]）以上命令的输出，则（&&）打印一条消息。

— 以撒
source

2

没有Perl的回答，这将是不完整的！

$ perl -ne 'print if ++$a{$_} == 2' yourfile

这将打印每个非唯一行一次：因此，如果不打印任何内容，则文件具有所有唯一行。

— 弗拉帕丁格
source

1

在cmp和sort中使用bash：

cmp -s <( sort file ) <( sort -u file ) && echo 'All lines are unique'

要么

if cmp -s <( sort file ) <( sort -u file )
then
    echo 'All lines are unique'
else
    echo 'At least one line is duplicated'
fi

但这将对文件进行两次排序，就像接受的答案一样。

— 库萨兰达
source