如何使用.NET快速比较2个文件？

136

典型的方法建议通过FileStream读取二进制文件并逐字节比较它。

校验和比较（例如CRC）会更快吗？
是否有任何.NET库可以为文件生成校验和？

c# file compare checksum

— 闪光
source

重复：stackoverflow.com/questions/211008/c-file-management

— Jon B

117

校验和比较最有可能比逐字节比较慢。

为了生成校验和，您需要加载文件的每个字节，并对其进行处理。然后，您必须在第二个文件上执行此操作。处理几乎肯定会比比较检查慢。

至于生成校验和：您可以使用密码学类轻松地做到这一点。这是使用C＃生成MD5校验和的简短示例。

但是，如果可以预先计算“测试”或“基本”情况的校验和，则校验和可能会更快并且更有意义。如果您有一个现有文件，并且正在检查一个新文件是否与现有文件相同，则在“现有”文件上预先计算校验和将意味着只需要一次在磁盘上执行DiskIO。新文件。这可能比逐字节比较要快。

— 里德·科普西
source

30

确保考虑文件所在的位置。如果您要比较本地文件与世界各地（或带宽异常的网络）的备份，则最好先进行散列并通过网络发送校验和，而不是将字节流发送至比较。

— 金，

@ReedCopsey：我有一个类似的问题，因为我需要存储由若干精心制作的输入/输出文件，这些精心制作应该包含很多重复项。我以为使用预先计算的哈希，但是您认为我可以合理地假设，如果2个（例如MD5）哈希相等，则2个文件相等，并避免进一步的字节2字节比较？据我所知，MD5 / SHA1等碰撞确实不太可能...

— digEmAll 2014年

1

@digEmAll碰撞的机会很小-但是您始终可以进行更强的哈希处理-即：使用SHA256而不是SHA1，这将进一步降低发生碰撞的可能性。

— Reed Copsey 2014年

谢谢您的回答-我刚进入.net。我假设如果使用哈希码/校验和技术，那么主文件夹的哈希值将永久存储在某个地方？出于好奇，您将如何为WPF应用程序存储它-您会怎么做？（我目前正在查看xml，文本文件或数据库）。

— BKSpurgeon '16

139

最慢的方法是逐字节比较两个文件。我能够想出的最快的方法是进行类似的比较，但是一次使用一个大小为Int64的字节数组，而不是一次一个字节，然后比较结果数字。

这是我想出的：

    const int BYTES_TO_READ = sizeof(Int64);

    static bool FilesAreEqual(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            byte[] one = new byte[BYTES_TO_READ];
            byte[] two = new byte[BYTES_TO_READ];

            for (int i = 0; i < iterations; i++)
            {
                 fs1.Read(one, 0, BYTES_TO_READ);
                 fs2.Read(two, 0, BYTES_TO_READ);

                if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
                    return false;
            }
        }

        return true;
    }

在我的测试中，我可以看到它比简单的ReadByte（）场景快了将近3：1。平均运行1000多次，我在1063ms时获得了该方法，下面的方法（逐字节比较）在3031ms时获得了。散列总是返回到亚秒级，平均大约为865ms。该测试是使用约100MB的视频文件进行的。

这是我用于比较目的的ReadByte和哈希方法：

    static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }

    static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
    {
        byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
        byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());

        for (int i=0; i<firstHash.Length; i++)
        {
            if (firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }

— chsh
source

1

你让我的生活更轻松。谢谢

— anindis

2

@anindis：为完整起见，您可能需要阅读@Lars的答案和@RandomInsano的答案。很高兴它帮助了很多年！:)

— chsh 2015年

1

该FilesAreEqual_Hash方法也应该using在两个文件流上都具有一个ReadByte方法，否则该方法将挂在两个文件上。

— 伊恩·默瑟

2

请注意，FileStream.Read()实际上读取的字节数可能少于请求的字节数。您应该StreamReader.ReadBlock()改用。

— Palec

2

在Int64版本中，当流长度不是Int64的倍数时，最后一次迭代将使用先前迭代的填充来比较未填充的字节（也应该相等，这样就可以了）。同样，如果流长度小于sizeof（Int64），则未填充字节为0，因为C＃初始化了数组。IMO，代码可能应该注释这些奇怪之处。

— crokusek

46

如果你决定你真正需要一个完整的逐字节的比较（见散列讨论其他的答案），那么最简单的解决方法是：

• System.IO.FileInfo例如：

public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
    fi1.Length == fi2.Length &&
    (fi1.Length == 0 || File.ReadAllBytes(fi1.FullName).SequenceEqual(
                        File.ReadAllBytes(fi2.FullName)));

•对于System.String路径名：

public static bool AreFileContentsEqual(String path1, String path2) =>
                   AreFileContentsEqual(new FileInfo(path1), new FileInfo(path2));

与其他一些已发布的答案不同，这绝对适用于任何类型的文件：二进制文件，文本文件，媒体文件，可执行文件等，但作为完整的二进制文件比较，文件仅以 “不重要”的方式（例如BOM，行）不同-end，字符编码，媒体元数据，空白，填充，源代码注释等）将始终被视为不相等。

此代码将两个文件全部加载到内存中，因此不应将其用于比较真正的巨大文件。除了这一重要的警告之外，考虑到.NET GC的设计，完全加载并不是一个真正的代价（因为它从根本上进行了优化，以保持较小的，短期的分配非常便宜），并且实际上甚至在预期文件大小时也可能是最佳的小于85K的，因为使用最少的用户代码（如这里所示）意味着最大限度委托文件的性能问题的CLR，BCL和JIT从（例如）最新的设计技术，系统代码，以及自适应运行时优化的好处。

此外，对于这样触目所及的情况下，约逐字节的比较的通过性能问题LINQ统计员（如下图所示）是没有实际意义的，因为击中盘所有文件I / O将大大超过，由几个数量级，带来的好处各种内存比较选择。例如，即使实际上SequenceEqual 确实为我们提供了第一次不匹配时放弃的“优化” ，但在已经获取了文件的内容（确认匹配完全必要）之后，这并不重要。

— 格伦·斯莱登
source

3

这个文件对于大文件来说看起来并不好。这不利于内存使用，因为它会在开始比较字节数组之前将两个文件读取到最后。这就是为什么我宁愿选择带缓冲区的流读取器。

— Krypto_47 '17

3

@ Krypto_47我在回答的内容中讨论了这些因素以及正确使用的方法。

— Glenn Slayden '17

33

除了Reed Copsey的回答：

最坏的情况是两个文件相同。在这种情况下，最好逐字节比较文件。
如果两个文件不相同，则可以通过更快地检测到它们不相同来加快处理速度。

例如，如果两个文件的长度不同，那么您就知道它们不可能相同，甚至不必比较它们的实际内容。

— dtb
source

10

完整地说：一旦位置1的字节不同，另一个大的收获就是停止。

— 汉克·霍尔特曼

6

@Henk：我认为这太明显了：-)

— dtb

1

添加这个的好点。这对我来说很明显，因此我没有包括在内，但值得一提。

— 里德·科普西

16

如果您不读取8个字节的小块，而是循环读取一个更大的块，它的速度甚至会更快。我将平均比较时间减少到1/4。

    public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
    {
        bool result;

        if (fileInfo1.Length != fileInfo2.Length)
        {
            result = false;
        }
        else
        {
            using (var file1 = fileInfo1.OpenRead())
            {
                using (var file2 = fileInfo2.OpenRead())
                {
                    result = StreamsContentsAreEqual(file1, file2);
                }
            }
        }

        return result;
    }

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 1024 * sizeof(Int64);
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
}

— 拉尔斯
source

13

一般来说，检查count1 != count2不正确。Stream.Read()由于各种原因，您所返回的数字可能会少于您提供的计数。

— porges 2012年

1

为了确保缓冲区将持有的偶数Int64块，你可能要计算的大小是这样的：const int bufferSize = 1024 * sizeof(Int64)。

— 杰克

14

使校验和比较比逐字节比较稍快一点的唯一事情是，您一次读取一个文件，这在某种程度上减少了磁盘头的查找时间。但是，通过计算散列的额外时间可能会完全吞噬掉这种微不足道的收益。

另外，如果文件相同，则校验和比较当然只有更快的机会。如果不是这样，则逐个字节的比较将在第一个差异处结束，从而使其快得多。

您还应该考虑将哈希码比较仅告诉您，很有可能文件是相同的。为了100％确定，您需要进行逐字节比较。

例如，如果哈希码为32位，则可以确定99.99999998％（如果哈希码匹配）文件是相同的。那接近100％，但是如果您确实需要100％的确定性，那不是。

— 古法
source

使用较大的哈希值，您可以使误报的几率大大低于计算机在执行测试时犯下的几率。

— 洛伦·佩希特尔

我不同意哈希时间与搜索时间。在一次寻头过程中，您可以进行很多计算。如果文件匹配的可能性很高，我将使用带有很多位的哈希。如果有合理的比赛机会，我会一次比较一个块，例如1MB块。（选择4k均匀划分的块大小，以确保您永远不会分割扇区。）

— Loren Pechtel 2015年

1

为了解释@Guffa的数字99.99999998％，它来自计算1 - (1 / (2^32))，这是任何单个文件都会具有某些给定的32位哈希的概率。两个不同文件具有相同哈希值的可能性是相同的，因为第一个文件提供了“给定”哈希值，我们只需要考虑另一个文件是否与该值匹配。使用64位和128位哈希的机会分别减少到99.999999999999999994％和99.9999999999999999999999999999999999997％，就好像这些不可思议的数字很重要一样。

— Glenn Slayden '16

...事实上，对于大多数人来说，这些数字比“无数个文件碰撞到相同的哈希码”这一假定的简单概念（尽管确实如此）要难得多，这一事实可能解释了为什么人们不合理地怀疑接受哈希作为-平等。

— Glenn Slayden

13

编辑：此方法不适用于比较二进制文件！

在.NET 4.0中，File该类具有以下两个新方法：

public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)

这意味着您可以使用：

bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));

— 山姆·哈威尔
source

1

@dtb：它不适用于二进制文件。当我意识到这一点时，您可能已经在输入评论，并在帖子顶部添加了编辑内容。：o

— Sam Harwell，

@ 280Z28：我什么也没说；-)

— dtb

您是否还需要将两个文件都存储在内存中？

— RandomInsano 2012年

请注意，File还具有ReadAllBytes函数，该函数也可以使用SequenceEquals，因此应使用它，因为它将对所有文件都有效。就像@RandomInsano所说的那样，它存储在内存中，因此虽然可以很好地用于小文件，但在用于大文件时还是要小心。

— DaedalusAlpha

1

@DaedalusAlpha它返回一个可枚举的值，因此这些行将按需加载并且不会一直存储在内存中。另一方面，ReadAllBytes确实将整个文件作为数组返回。

— IllidanS4希望莫妮卡回到

7

老实说，我认为您需要尽可能地减少搜索树。

逐字节检查之前要检查的事项：

大小一样吗？
文件A中的最后一个字节与文件B中的不同吗？

同样，一次读取大块数据将更加有效，因为驱动器读取顺序字节的速度更快。逐字节访问不仅会导致更多的系统调用，而且还会导致传统硬盘驱动器的读取头在两个文件位于同一驱动器上时更频繁地来回搜索。

将块A和块B读入字节缓冲区，并进行比较（不要使用Array.Equals，请参见注释）。调整块的大小，直到达到内存和性能之间的良好平衡。您还可以对比较进行多线程处理，但不要对读取的磁盘进行多线程处理。

— 随机Insano
source

使用Array.Equals是一个坏主意，因为它会比较整个数组。可能至少读取了一个块不会填充整个数组。

— Doug Clutter 2015年

为什么比较整个数组是个坏主意？为什么读取的块无法填充数组？肯定有一个很好的调整点，但这就是为什么要使用尺寸了。在单独的线程中进行比较的加分点。

— RandomInsano 2015年

定义字节数组时，它将具有固定的长度。（例如-var buffer = new byte [4096]）当您从文件中读取一个块时，它可能会或可能不会返回完整的4096字节。例如，如果文件只有3000字节长。

— Doug Clutter 2015年

啊，现在我明白了！好消息是读取将返回加载到数组中的字节数，因此，如果无法填充数组，将有数据。由于我们正在测试是否相等，因此旧的缓冲区数据将无关紧要。文件：msdn.microsoft.com/en-us/library/9kstw824

— v

同样重要的是，我建议使用Equals（）方法是一个坏主意。在Mono中，由于元素在内存中是连续的，所以它们进行内存比较。但是，Microsoft不会覆盖它，而是仅进行参考比较，这在这里始终是错误的。

— RandomInsano 2015年

4

我的答案是@lars的派生词，但修复了对的调用中的错误Stream.Read。我还添加了其他答案具有的一些快速路径检查，以及输入验证。简而言之，这应该是的答案：

using System;
using System.IO;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqual(fi1, fi2));
        }

        public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return StreamsContentsAreEqual(file1, file2);
                    }
                }
            }
        }

        private static int ReadFullBuffer(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = ReadFullBuffer(stream1, buffer1);
                int count2 = ReadFullBuffer(stream2, buffer2);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}

或者，如果您想变得很棒，可以使用async变体：

using System;
using System.IO;
using System.Threading.Tasks;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqualAsync(fi1, fi2).GetAwaiter().GetResult());
        }

        public static async Task<bool> FilesContentsAreEqualAsync(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return await StreamsContentsAreEqualAsync(file1, file2).ConfigureAwait(false);
                    }
                }
            }
        }

        private static async Task<int> ReadFullBufferAsync(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = await stream.ReadAsync(buffer, bytesRead, buffer.Length - bytesRead).ConfigureAwait(false);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static async Task<bool> StreamsContentsAreEqualAsync(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = await ReadFullBufferAsync(stream1, buffer1).ConfigureAwait(false);
                int count2 = await ReadFullBufferAsync(stream2, buffer2).ConfigureAwait(false);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}

— 安德鲁·阿诺特
source

对于（var i = 0; i <count; i + = sizeof（long））{如果（BitConverter.ToInt64（buffer1，i）！= BitConverter.ToInt64（buffer2，i）） {返回false；}}``

— 西蒙

2

我的实验表明，肯定可以减少调用Stream.ReadByte（）的次数，但是使用BitConverter打包字节与比较字节数组中的字节没有太大区别。

因此，可以用最简单的代码替换上面注释中的“ Math.Ceiling和迭代”循环：

            for (int i = 0; i < count1; i++)
            {
                if (buffer1[i] != buffer2[i])
                    return false;
            }

我猜想这与以下事实有关：比较之前，BitConverter.ToInt64需要做一些工作（检查参数，然后执行位移），最终结果与在两个数组中比较8个字节的工作量相同。

— 罗密欧克
source

1

Array.Equals更深入系统，因此它可能比在C＃中逐字节处理要快得多。我不能代表微软，但是从根本上讲，Mono使用C的memcpy（）命令实现数组相等。不能比这快得多。

— RandomInsano 2012年

2

@RandomInsano猜测您的意思是memcmp（），而不是memcpy（）

— SQL警察

1

如果文件不是太大，则可以使用：

public static byte[] ComputeFileHash(string fileName)
{
    using (var stream = File.OpenRead(fileName))
        return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}

仅当哈希值可用于存储时，才比较哈希值是可行的。

（将代码编辑得更加干净。）

— 塞西尔有一个名字
source

1

具有相同长度的大文件的另一个改进可能是不顺序读取文件，而是比较或多或少地随机块。

您可以使用多个线程，从文件的不同位置开始，然后向前或向后进行比较。

这样，您可以检测文件中间/结尾的更改，比使用顺序方法要快得多。

— 托马斯·科恩斯
source

1

磁盘跳动会在这里引起问题吗？

— RandomInsano 2012年

物理磁盘驱动器是的，SSD可以解决此问题。

— TheLegendaryCopyCoder

1

如果您只需要比较两个文件，我想最快的方法应该是（在C语言中，我不知道它是否适用于.NET）

打开两个文件f1，f2
得到相应的文件长度l1，l2
如果l1！= l2，则文件不同；停
mmap（）两个文件
在mmap（）文件上使用memcmp（）

OTOH，如果您需要查找一组N个文件中是否存在重复文件，那么最快的方法无疑是使用散列来避免N位逐位比较。

— CAFxX
source

1

高效的东西（希望）：

public class FileCompare
{
    public static bool FilesEqual(string fileName1, string fileName2)
    {
        return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2));
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="file1"></param>
    /// <param name="file2"></param>
    /// <param name="bufferSize">8kb seemed like a good default</param>
    /// <returns></returns>
    public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192)
    {
        if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false;

        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        using (var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
        {
            using (var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
            {

                while (true)
                {
                    var bytesRead1 = stream1.Read(buffer1, 0, bufferSize);
                    var bytesRead2 = stream2.Read(buffer2, 0, bufferSize);

                    if (bytesRead1 != bytesRead2) return false;
                    if (bytesRead1 == 0) return true;
                    if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false;
                }
            }
        }
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="array1"></param>
    /// <param name="array2"></param>
    /// <param name="bytesToCompare"> 0 means compare entire arrays</param>
    /// <returns></returns>
    public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0)
    {
        if (array1.Length != array2.Length) return false;

        var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare;
        var tailIdx = length - length % sizeof(Int64);

        //check in 8 byte chunks
        for (var i = 0; i < tailIdx; i += sizeof(Int64))
        {
            if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false;
        }

        //check the remainder of the array, always shorter than 8 bytes
        for (var i = tailIdx; i < length; i++)
        {
            if (array1[i] != array2[i]) return false;
        }

        return true;
    }
}

— 扎尔·沙丹（Zar Shardan）
source

1

这是一些实用程序功能，可让您确定两个文件（或两个流）是否包含相同的数据。

我提供了一个多线程的“快速”版本，因为它使用Tasks比较了不同线程中的字节数组（每个缓冲区从读取的文件中填充的每个缓冲区）。

正如预期的那样，它要快得多（快3倍左右），但它消耗更多的CPU（因为是多线程）和更多的内存（因为每个比较线程需要两个字节数组缓冲区）。

    public static bool AreFilesIdenticalFast(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast);
    }

    public static bool AreFilesIdentical(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdentical);
    }

    public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical)
    {
        if (path1 == null)
            throw new ArgumentNullException(nameof(path1));

        if (path2 == null)
            throw new ArgumentNullException(nameof(path2));

        if (areStreamsIdentical == null)
            throw new ArgumentNullException(nameof(path2));

        if (!File.Exists(path1) || !File.Exists(path2))
            return false;

        using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
        {
            using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            {
                if (valueFile.Length != thisFile.Length)
                    return false;

                if (!areStreamsIdentical(thisFile, valueFile))
                    return false;
            }
        }
        return true;
    }

    public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)

        var tasks = new List<Task<bool>>();
        do
        {
            // consumes more memory (two buffers for each tasks)
            var buffer1 = new byte[bufsize];
            var buffer2 = new byte[bufsize];

            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
            {
                int read3 = stream2.Read(buffer2, 0, 1);
                if (read3 != 0) // not eof
                    return false;

                break;
            }

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            // consumes more cpu
            var task = Task.Run(() =>
            {
                return IsSame(buffer1, buffer2);
            });
            tasks.Add(task);
        }
        while (true);

        Task.WaitAll(tasks.ToArray());
        return !tasks.Any(t => !t.Result);
    }

    public static bool AreStreamsIdentical(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)
        var buffer1 = new byte[bufsize];
        var buffer2 = new byte[bufsize];

        var tasks = new List<Task<bool>>();
        do
        {
            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
                return stream2.Read(buffer2, 0, 1) == 0; // check not eof

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            if (!IsSame(buffer1, buffer2))
                return false;
        }
        while (true);
    }

    public static bool IsSame(byte[] bytes1, byte[] bytes2)
    {
        if (bytes1 == null)
            throw new ArgumentNullException(nameof(bytes1));

        if (bytes2 == null)
            throw new ArgumentNullException(nameof(bytes2));

        if (bytes1.Length != bytes2.Length)
            return false;

        for (int i = 0; i < bytes1.Length; i++)
        {
            if (bytes1[i] != bytes2[i])
                return false;
        }
        return true;
    }

— 西蒙·穆里尔
source

0

我认为在某些应用中“哈希”比逐字节比较要快。如果您需要将文件与其他文件进行比较，或者需要更改照片的缩略图。这取决于它在哪里以及如何使用。

private bool CompareFilesByte(string file1, string file2)
{
    using (var fs1 = new FileStream(file1, FileMode.Open))
    using (var fs2 = new FileStream(file2, FileMode.Open))
    {
        if (fs1.Length != fs2.Length) return false;
        int b1, b2;
        do
        {
            b1 = fs1.ReadByte();
            b2 = fs2.ReadByte();
            if (b1 != b2 || b1 < 0) return false;
        }
        while (b1 >= 0);
    }
    return true;
}

private string HashFile(string file)
{
    using (var fs = new FileStream(file, FileMode.Open))
    using (var reader = new BinaryReader(fs))
    {
        var hash = new SHA512CryptoServiceProvider();
        hash.ComputeHash(reader.ReadBytes((int)file.Length));
        return Convert.ToBase64String(hash.Hash);
    }
}

private bool CompareFilesWithHash(string file1, string file2)
{
    var str1 = HashFile(file1);
    var str2 = HashFile(file2);
    return str1 == str2;
}

在这里，您可以获得最快的。

var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));

（可选）我们可以将散列保存在数据库中。

希望这可以帮助

— 安东尼奥
source

0

另一个答案来自@chsh。MD5具有与文件相同的用途和快捷方式，文件不存在且长度不同：

/// <summary>
/// Performs an md5 on the content of both files and returns true if
/// they match
/// </summary>
/// <param name="file1">first file</param>
/// <param name="file2">second file</param>
/// <returns>true if the contents of the two files is the same, false otherwise</returns>
public static bool IsSameContent(string file1, string file2)
{
    if (file1 == file2)
        return true;

    FileInfo file1Info = new FileInfo(file1);
    FileInfo file2Info = new FileInfo(file2);

    if (!file1Info.Exists && !file2Info.Exists)
       return true;
    if (!file1Info.Exists && file2Info.Exists)
        return false;
    if (file1Info.Exists && !file2Info.Exists)
        return false;
    if (file1Info.Length != file2Info.Length)
        return false;

    using (FileStream file1Stream = file1Info.OpenRead())
    using (FileStream file2Stream = file2Info.OpenRead())
    { 
        byte[] firstHash = MD5.Create().ComputeHash(file1Stream);
        byte[] secondHash = MD5.Create().ComputeHash(file2Stream);
        for (int i = 0; i < firstHash.Length; i++)
        {
            if (i>=secondHash.Length||firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }
}

— 安德鲁·泰勒（Andrew Taylor）
source

您说if (i>=secondHash.Length ...在什么情况下两个MD5散列的长度不同？

— frogpelt

-1

我发现这比较好，首先比较长度而不读取数据，然后比较读取的字节序列

private static bool IsFileIdentical(string a, string b)
{            
   if (new FileInfo(a).Length != new FileInfo(b).Length) return false;
   return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b)));
}

— kernowcode
source