有没有一种方法可以删除比fdupes -rdN更精细的重复项？

22

最近，我需要删除很多重复项。我正在合并三个或四个文件系统，并且我希望空间可以经济地使用。起初，fdupes它似乎是完成工作的最佳工具，但是我越来越遇到限制。

考虑命令fdupes -rdN somedirectory/。这将散列某个目录的子目录中所有文件。

并且当遇到重复项时，它将删除它们，因此所有内容只有一个副本。

但是，如果我要保留somedirectory/subdirectory1/somefile并且实际上有四个重复项，并且程序首先遇到其中一个重复项，该怎么办？然后将其删除somedirectory/subdirectory1/somefile，这是我不想要的。

我希望能够以某种方式指定要保留的重复项。到目前为止，用于处理重复项（duff，FSLint）的标准程序似乎都不允许这种行为的自动化。我不想自己动手，所以这就是为什么我问这个问题。

我希望能够写类似

killdupes -rdN --keep=filesin,somedirectories,separated,by,commas somedirectory/

files disk-usage fdupes

— ixtmixilix
source

我在寻找相同的东西，我发现了这个superuser.com/a/561207/218922

— Alexis

5

虽然您没有找到所需的功能fdupes，但我进行了分叉fdupes （我的叉子称为jdupes），并添加了一些可以在某些情况下解决此问题的功能。例如，在上述情况下，您希望somedirectory/subdirectory1/somefile在自动删除重复项时将其保留（d和N在一起），并且紧挨着下面没有单独的文件somedirectory，jdupes可以使用subdirectory1first和-Oswitch（通过命令对文件排序）向每个直接子目录路径馈送文件。行参数顺序优先）：

jdupes -nrdNO somedirectory/subdirectory1 somedirectory/subdirectory2 somedirectory/subdirectory3

这将自动删除重复集中除一个文件外的所有文件，并保证如果该文件集中包含一个文件，somedirectory/subdirectory1则该文件将是第一个文件，从而自动成为该集中的保留文件。这种方法仍然存在明显的局限性，例如somedirectory/subdirectory1可能保留了另一个重复副本，而不是要保留的重复副本，但是在像您这样的很多情况下，使用jdupes参数顺序选项作为解决方法就足够了。

我计划在不久的将来添加一个过滤系统，以jdupes对文件的包含/排除，-N动作保留以及在全局或每个参数的基础上应用此类“过滤器堆栈”进行大量控制。非常需要此功能；我设想这样的事情是“以递归方式自动删除非零重复项，但始终保持somedirectory/subdirectory1/somefile原样”：

jdupes -nrdN --filter=preserve:somedirectory/subdirectory1/somefile somedirectory/

— 乔迪·李·布鲁雄
source

4

将重复文件硬链接在一起怎么办？这样，空间仅使用一次，但是它们仍然存在于所有路径中。这样做的好处是硬链接的文件应在适当的位置修改（仅应修改它们以删除文件并使用新内容重新创建）。另一种方法是将文件符号链接在一起，尽管在确定哪个“主”文件方面存在相同的问题。可以使用以下脚本来完成此操作（尽管请注意，这不处理包含空格的文件名）。

fdupes --quiet --recurse --sameline somedirectory/ | while read SOURCE DESTS; do
    for DEST in $DESTS; do
        ln -f $SOURCE $DEST
    done
done

— 米高尔文
source

1

使用jdupes代替fdupes您可以简单地走jdupes -nrL somedirectory/，这要快得多。

— Jody Lee Bruchon '16

1

错别字在jdupes的链接中。便捷链接：github.com/jbruchon/jdupes

— 罗伊·威廉姆斯

4

我在其他任何地方都没有看到这个：说你想要的就是这个。您有/ mnt / folder-tree-1 / mnt / folder-tree-2。您不想删除所有重复对象，但是如果树2中存在一个文件，并且树1中存在具有相同路径和名称的相同文件，则将其从树2中删除。

警告：这很简洁，如果您尝试使用有限的shell技能复制粘贴此内容，请小心。

fdupes -rn /mnt/folder-tree-1/ /mnt/folder-tree-2/ > dupes-all.txt

fgrep /mnt/folder-tree-1/ dupes-all.txt | while read line
do
if grep -q "`echo $line | sed -e 's|^/mnt/folder-tree-1/|/mnt/folder-tree-2/|'`" dupes-all.txt
then
    echo rm \"$(echo $line | sed -e 's|^/mnt/folder-tree-1/|/mnt/folder-tree-2//|')\"
fi
done > rm-v2-dupes.sh

或全部一行：

fdupes -rn /mnt/folder-tree-1/ /mnt/folder-tree-2/ > dupes-all.txt; fgrep /mnt/folder-tree-1/ dupes-all.txt | while read line; do if grep -q "`echo $line | sed -e 's|^/mnt/folder-tree-1/|/mnt/folder-tree-2/|'`" dupes-all.txt; then echo rm \"$(echo $line | sed -e 's|^/mnt/folder-tree-1/|/mnt/folder-tree-2/|')\"; fi; done > rm-v2-dupes.sh

然后，检查并执行rm-v2-dupes.sh

— 隆德
source

4

我有同样的问题。如果重复很多，fdupes /my/directory/ -rdN则文件的修改日期最旧，或者几个文件的修改日期相同，则首先找到该文件。

如果修改日期对您而言并不重要，则可以touch保留目录中的文件。如果您选择touch当前日期和时间，则将其与当前日期 fdupes -rdNi保持一致。或者，您可以touch保留日期早于要删除并fdupes -rdN正常使用的文件。

如果需要保留修改日期，则将需要使用其他方法之一。

— on
source

3

只是为先前的答案添加了一个转折。我已多次使用以下代码，使用一个简单| grep的隔离我要删除的文件夹的方法来稍微修改先前的答案。

`fdupes -r -n -S /directory | grep /delete-from-directory | sed -r "s/^/rm \"/" | sed -r "s/$/\"/" >remove-duplicate-files.sh`

同样，这将创建一个sh文件以删除所有列出的文件，没有注释行。当然，您仍然可以编辑文件以注释掉要保留的特定行/文件。

大目录的另一个提示是将fdupes运行到txt文件，然后尝试| grep和| sed直到得到所需的结果。

`fdupes -r -n -S /directory > duplicate-files.txt`
`cat duplicate-files.txt | grep /delete-from-directory | sed -r "s/^/rm \"/" | sed -r "s/$/\"/" >remove-duplicate-files.sh`

— fl
source

2

使用sed以创建将包含注释掉的命令来删除每个重复文件的shell文件：

fdupes -r -n -S /directory | sed -r "s/^/#rm \"/" | sed -r "s/$/\"/" >remove-duplicate-files.sh

remove-duplicate-files.sh我们刚刚创建的结果文件将注释掉每一行。取消注释要删除的文件。然后运行sh remove-duplicate-files.sh。瞧！

更新

好吧，如果您不想只删除某些目录中的文件，就这么简单：

fdupes -S /directory|sed '/^$/d' |sed -r "s/^[0-9]/#&/" > duple_list

python exclude_duplicates.py -f /path/to/dupe_list --delimiter='#' --keep=/full/path/to/protected/directory1,/full/path/to/protected/directory2\ with\ spaces\ in\ path >remove-duplicate-files-keep-protected.sh

在哪里exclude_duplicates.py：

#/usr/bin/python
# -*- coding: utf-8 -*-
# exclude_duplicates.py
"""
THE SCRIPT DOESN'T DELETE ANYTHING, IT ONLY GENERATES TEXT OUTPUT.
Provided a list of duplicates, such as fdupes or fslint output,
generate a bash script that will have all duplicates in protected
directories commented out. If none of the protected duplicates are
found in a set of the same files, select a random unprotected
duplicate for preserving.
Each path to a file will be transformed to an `rm "path"` string which
will be printed to standard output.     
"""

from optparse import OptionParser
parser = OptionParser()
parser.add_option("-k", "--keep", dest="keep",
    help="""List of directories which you want to keep, separated by commas. \
        EXAMPLE: exclude_duplicates.py --keep /path/to/directory1,/path/to/directory\ with\ space\ in\ path2""",
    metavar="keep"
)
parser.add_option("-d", "--delimiter", dest="delimiter",
    help="Delimiter of duplicate file groups", metavar="delimiter"
)
parser.add_option("-f", "--file", dest="file",
    help="List of duplicate file groups, separated by delimiter, for example, fdupes or fslint output.", metavar="file"
)

(options, args) = parser.parse_args()
directories_to_keep = options.keep.split(',')
file = options.file
delimiter = options.delimiter

pretty_line = '\n#' + '-' * 35
print '#/bin/bash'
print '#I will protect files in these directories:\n'
for d in directories_to_keep:
    print '# ' + d
print pretty_line

protected_set = set()
group_set = set()

def clean_set(group_set, protected_set, delimiter_line):
    not_protected_set = group_set - protected_set
    while not_protected_set:
        if len(not_protected_set) == 1 and len(protected_set) == 0:
            print '#randomly selected duplicate to keep:\n#rm "%s"' % not_protected_set.pop().strip('\n')
        else:
            print 'rm "%s"' % not_protected_set.pop().strip('\n')
    for i in protected_set: print '#excluded file in protected directory:\n#rm "%s"' % i.strip('\n')
    print '\n#%s' % delimiter_line
file = open(file, 'r')
for line in file.readlines():
    if line.startswith(delimiter):
        clean_set(group_set, protected_set, line)
        group_set, protected_set = set(), set()
    else:
        group_set = group_set|{line}
        for d in directories_to_keep:
            if line.startswith(d): protected_set = protected_set|{line}
else:
    if line: clean_set(group_set, protected_set, line)

remove-duplicate-files-keep-protected.sh我们刚刚创建的结果文件将注释掉受保护目录中的所有文件。在您喜欢的文本编辑器中打开此文件，检查一切正常。然后运行它。瞧！

— 伊凡·哈拉莫夫（Ivan Kharlamov）
source

我想到了这一点，但还不够自动化。愚蠢的是，当处理跨多个文件系统的重复项时，我用这种方法导致数据丢失...鉴于fdupes的输出，无法分配优先级。基本上，为了防止数据丢失，我将不得不手动浏览10000个文件...所以，不用了，谢谢……事实上，数据丢失正是我问这个问题的原因。

— ixtmixilix 2012年

@ixtmixilix，好吧，手动方法取决于用户的专心，这并不是什么新鲜事物。如果您希望自动化一些，请在上方签出更新后的答案。

— 伊万·哈拉莫夫

2

那这样的东西呢？

#!/bin/bash

DUPE_SEARCH_DIR=somedir/
PREFERRED_DIRS=("somedir/subdir1" "somedir/subdir2")
DUPE_FILE=/tmp/`basename $0`_found-duplicates

delete_dupes() {
    while read line ; do
        if [ -n "$line" ] ; then
            matched=false
            for pdir in "${PREFERRED_DIRS[@]}" ; do
                if [[ $line == $pdir/* ]] ; then
                    matched=true
                    break
                fi
            done
            if ! $matched ; then
                rm -v "$line"
            fi
        fi
    done < "$DUPE_FILE"
}

cleanup() {
    rm -f $DUPE_FILE
}

trap cleanup EXIT

# get rid of normal dupes, preserve first & preserve preferred
fdupes -rf "$DUPE_SEARCH_DIR" > $DUPE_FILE
delete_dupes

# get rid of preserve dupes, preserve preferred
fdupes -r "$DUPE_SEARCH_DIR" > "$DUPE_FILE"
delete_dupes

— 伦乔登
source