在文件夹层次结构中搜索重复的文件名？

29

我有一个名为的文件夹img，该文件夹具有许多子文件夹级别，所有子文件夹都包含图像。我将它们导入到图像服务器中。

通常，图像（或任何文件）可以具有相同的名称，只要它们在不同的目录路径中或具有不同的扩展名即可。但是，我将它们导入到的图像服务器要求所有图像名称都是唯一的（即使扩展名不同）。

例如，图像background.png和background.gif不会被允许，因为即使它们具有不同的扩展名，它们仍然具有相同的文件名。即使它们位于单独的子文件夹中，它们仍然需要唯一。

因此，我想知道是否可以在img文件夹中进行递归搜索以查找具有相同名称（不包括扩展名）的文件列表。

有命令可以做到这一点吗？

command-line bash search

— JD伊萨克斯
source

@DavidFoerster你说得对！我不知道为什么我认为这可能与如何查找（和删除）重复文件相同，但显然并非如此。

— 伊莱亚·卡根

17

FSlint 是一种通用的重复查找器，包括用于查找重复名称的功能：

FSlint

用于Ubuntu的FSlint软件包强调图形界面，但是正如FSlint FAQ中所解释的那样，可通过中的程序使用命令行界面/usr/share/fslint/fslint/。使用该--help选项进行文档编制，例如：

$ /usr/share/fslint/fslint/fslint --help
File system lint.
A collection of utilities to find lint on a filesystem.
To get more info on each utility run 'util --help'.

findup -- find DUPlicate files
findnl -- find Name Lint (problems with filenames)
findu8 -- find filenames with invalid utf8 encoding
findbl -- find Bad Links (various problems with symlinks)
findsn -- find Same Name (problems with clashing names)
finded -- find Empty Directories
findid -- find files with dead user IDs
findns -- find Non Stripped executables
findrs -- find Redundant Whitespace in files
findtf -- find Temporary Files
findul -- find possibly Unused Libraries
zipdir -- Reclaim wasted space in ext2 directory entries
$ /usr/share/fslint/fslint/findsn --help
find (files) with duplicate or conflicting names.
Usage: findsn [-A -c -C] [[-r] [-f] paths(s) ...]

If no arguments are supplied the $PATH is searched for any redundant
or conflicting files.

-A reports all aliases (soft and hard links) to files.
If no path(s) specified then the $PATH is searched.

If only path(s) specified then they are checked for duplicate named
files. You can qualify this with -C to ignore case in this search.
Qualifying with -c is more restictive as only files (or directories)
in the same directory whose names differ only in case are reported.
I.E. -c will flag files & directories that will conflict if transfered
to a case insensitive file system. Note if -c or -C specified and
no path(s) specifed the current directory is assumed.

用法示例：

$ /usr/share/fslint/fslint/findsn /usr/share/icons/ > icons-with-duplicate-names.txt
$ head icons-with-duplicate-names.txt 
-rw-r--r-- 1 root root    683 2011-04-15 10:31 Humanity-Dark/AUTHORS
-rw-r--r-- 1 root root    683 2011-04-15 10:31 Humanity/AUTHORS
-rw-r--r-- 1 root root  17992 2011-04-15 10:31 Humanity-Dark/COPYING
-rw-r--r-- 1 root root  17992 2011-04-15 10:31 Humanity/COPYING
-rw-r--r-- 1 root root   4776 2011-03-29 08:57 Faenza/apps/16/DC++.xpm
-rw-r--r-- 1 root root   3816 2011-03-29 08:57 Faenza/apps/22/DC++.xpm
-rw-r--r-- 1 root root   4008 2011-03-29 08:57 Faenza/apps/24/DC++.xpm
-rw-r--r-- 1 root root   4456 2011-03-29 08:57 Faenza/apps/32/DC++.xpm
-rw-r--r-- 1 root root   7336 2011-03-29 08:57 Faenza/apps/48/DC++.xpm
-rw-r--r-- 1 root root    918 2011-03-29 09:03 Faenza/apps/16/Thunar.png

— ændrük
source

谢谢你的工作。一些结果显示为紫色，而某些结果显示为绿色。您知道这些不同的颜色是什么意思吗？

— JD Isaacks 2011年

@John似乎FSlint正在使用ls -l它格式化输出。这个问题应该解释颜色的含义。

— ændrük

FSlint有很多依赖项。

— 纳文

31

find . -mindepth 1 -printf '%h %f\n' | sort -t ' ' -k 2,2 | uniq -f 1 --all-repeated=separate | tr ' ' '/'

如评论所述，这也将找到文件夹。这是将其限制为文件的命令：

find . -mindepth 1 -type f -printf '%p %f\n' | sort -t ' ' -k 2,2 | uniq -f 1 --all-repeated=separate | cut -d' ' -f1

— 奥布拉斯
source

我更改了解决方案，以使其返回所有重复项的完整（相对）路径。不幸的是，它假定路径名不包含空格，因为uniq它不提供选择其他字段定界符的功能。

— 大卫·佛斯特

@DavidFoerster，您的第6版是一个改进，但是关于您的评论，因为什么时候sed过时了？奥术？当然。过时了吗？不是我知道的。（我只是搜索进行检查。）

— cp.engr

@ cp.engr：sed并不过时。在我的另一个更改之后，它的调用变得过时了。

— 大卫·佛斯特

@DavidFoerster，那么过时对我来说似乎不是正确的词。我认为“废除”会更合适。无论如何，感谢您的澄清。

— cp.engr

@ cp.engr：谢谢你的建议！我不知道这个词，但似乎更适合这种情况。

— 大卫·佛斯特

8

将此保存到名为 duplicates.py

#!/usr/bin/env python

# Syntax: duplicates.py DIRECTORY

import os, sys

top = sys.argv[1]
d = {}

for root, dirs, files in os.walk(top, topdown=False):
    for name in files:
        fn = os.path.join(root, name)
        basename, extension = os.path.splitext(name)

        basename = basename.lower() # ignore case

        if basename in d:
            print(d[basename])
            print(fn)
        else:
            d[basename] = fn

然后使文件可执行：

chmod +x duplicates.py

像这样运行：

./duplicates.py ~/images

它应该输出具有相同基名（1）的文件对。用python编写，您应该可以对其进行修改。

— 罗夫堡
source

它似乎无法正常工作。它检测到P001.ORF并P001 (1).ORF作为重复项，而且似乎还认为我的文件中有60％是重复项，我敢肯定这是错误的。fslint发现实际的重复文件名数量接近3％。

— 罗尔夫

3

我假设您只需要查看这些“重复项”，然后手动进行处理即可。如果是这样，此bash4代码应该按照我的想法去做。

declare -A array=() dupes=()
while IFS= read -r -d '' file; do 
    base=${file##*/} base=${base%.*}
    if [[ ${array[$base]} ]]; then 
        dupes[$base]+=" $file"
    else
        array[$base]=$file
    fi
done < <(find /the/dir -type f -print0)

for key in "${!dupes[@]}"; do 
    echo "$key: ${array[$key]}${dupes[$key]}"
done

有关关联数组语法的帮助，请参阅http://mywiki.wooledge.org/BashGuide/Arrays#Associative_Arrays和/或bash手册。

— 盖尔哈
source

如何在终端中执行类似的命令？这是我需要先保存到文件并执行文件的东西吗？

— JD Isaacks 2011年

@John Isaacks您可以将其复制/粘贴到终端中，也可以将其放入文件中并作为脚本运行。无论哪种情况都将达到相同的效果。

— 盖尔哈

1

这是bname：

#!/bin/bash
#
#  find for jpg/png/gif more files of same basename 
#
# echo "processing ($1) $2"
bname=$(basename "$1" .$2)
find -name "$bname.jpg" -or -name "$bname.png"

使它可执行：

chmod a+x bname

调用它：

for ext in jpg png jpeg gif tiff; do find -name "*.$ext" -exec ./bname "{}" $ext ";"  ; done

优点：

它简单明了，因此可以扩展。
在文件名afaik中处理空白，制表符，换行符和分页信息。（假设扩展名中没有这样的东西）。

缺点：

它总是查找文件本身，如果为a.jpg找到a.gif，也将为a.gif找到a.jpg。因此，对于10个具有相同基名的文件，最终将找到100个匹配项。

— 用户未知
source

0

根据我的需要对loevborg的脚本进行了改进（包括分组输出，黑名单，扫描时更干净的输出）。我当时正在扫描一个10TB的驱动器，所以我需要更清洁的输出。

用法：

python duplicates.py DIRNAME

duplicates.py

    #!/usr/bin/env python

    # Syntax: duplicates.py DIRECTORY

    import os
    import sys

    top = sys.argv[1]
    d = {}

    file_count = 0

    BLACKLIST = [".DS_Store", ]

    for root, dirs, files in os.walk(top, topdown=False):
        for name in files:
            file_count += 1
            fn = os.path.join(root, name)
            basename, extension = os.path.splitext(name)

            # Enable this if you want to ignore case.
            # basename = basename.lower()

            if basename not in BLACKLIST:
                sys.stdout.write(
                    "Scanning... %s files scanned.  Currently looking at ...%s/\r" %
                    (file_count, root[-50:])
                )

                if basename in d:
                    d[basename].append(fn)
                else:
                    d[basename] = [fn, ]

    print("\nDone scanning. Here are the duplicates found: ")

    for k, v in d.items():
        if len(v) > 1:
            print("%s (%s):" % (k, len(v)))
            for f in v:
                print (f)

— 斯科考森
source