Linux命令或脚本计算文本文件中的重复行？

116

如果我有以下内容的文本文件

red apple
green apple
green apple
orange
orange
orange

是否可以使用Linux命令或脚本来获得以下结果？

1 red apple
2 green apple
3 orange

linux text duplicates

— Timeon
source

214

将其发送sort（将相邻的项目放在一起）然后uniq -c进行计数，即：

sort filename | uniq -c

并以排序顺序（按频率）获取该列表，您可以

sort filename | uniq -c | sort -nr

— 糟透了
source

48

几乎与易碎物品相同，但是如果将d参数添加到其中，uniq则只会显示重复的物品。

sort filename | uniq -cd | sort -nr

— 贾贝里诺
source

1

竖起大拇指的小-d笔记。

— sepehr 2015年

6

uniq -c file

如果文件尚未排序：

sort file | uniq -c

— 米弗里茨
source

3

试试这个

cat myfile.txt| sort| uniq

— 拉胡尔
source

如果没有-c或-d标志，则uniq不能将重复的行与非重复的行区分开，还是我错过了什么？

— drevicko 2015年

2

cat <filename> | sort | uniq -c

— 佩顿
source

2

您可以按字母顺序排列吗？

echo "red apple
> green apple
> green apple
> orange
> orange
> orange
> " | sort -u

？

green apple
orange
red apple

要么

sort -u FILE

-u代表唯一，唯一性只能通过排序来实现。

保留订单的解决方案：

echo "red apple
green apple
green apple
orange
orange
orange
" | { old=""; while read line ; do   if [[ $line != $old ]]; then  echo $line;   old=$line; fi ; done }
red apple
green apple
orange

并且，带有一个文件

cat file | { 
old=""
while read line
do
  if [[ $line != $old ]]
  then
    echo $line
    old=$line
  fi
done }

最后两个仅删除重复项，紧随其后的-与您的示例相符。

echo "red apple
green apple
lila banana
green apple
" ...

将打印两个苹果，并用香蕉劈开。

— 用户未知
source

0

只是为了计数：

$> egrep -o '\w+' fruits.txt | sort | uniq -c

      3 apple
      2 green
      1 oragen
      2 orange
      1 red

要获得排序的计数：

$> egrep -o '\w+' fruits.txt | sort | uniq -c | sort -nk1
      1 oragen
      1 red
      2 green
      2 orange
      3 apple

编辑

啊哈，这不是单词边界，我不好。这是用于整行的命令：

$> cat fruits.txt | sort | uniq -c | sort -nk1
      1 oragen
      1 red apple
      2 green apple
      2 orange

— 克里斯·埃伯勒
source

0

这是一个使用Counter类型的简单python脚本。好处是，这不需要对文件进行排序，本质上是使用零内存：

import collections
import fileinput
import json

print(json.dumps(collections.Counter(map(str.strip, fileinput.input())), indent=2))

输出：

$ cat filename | python3 script.py
{
  "red apple": 1,
  "green apple": 2,
  "orange": 3
}

或者您可以使用简单的单线：

$ cat filename | python3 -c 'print(__import__("json").dumps(__import__("collections").Counter(map(str.strip, __import__("fileinput").input())), indent=2))'

— 奥雷斯蒂夫
source