查找任何文件编码的有效方法

115

是的，这是一个最常见的问题，这对我来说很模糊，因为我对此并不了解。

但我想以一种非常精确的方式来查找文件编码。像Notepad ++一样精确。

c# encoding

— 法比奥·安特尼斯（FábioAntunes）
source

1

Java的

— Oded

哪种编码？UTF-8与UTF-16，大端与小端？还是您指的是旧的MSDos代码页，例如shift-JIS或Cyrillic等？

— dthorpe

另一种可能的重复：stackoverflow.com/questions/436220/...

— 奥德

@Oded：引用“ getEncoding（）方法将返回为流设置（读取JavaDoc）的编码。它将不会为您猜测编码。”

— 法比奥·安特尼斯（FábioAntunes）2010年

2

对于某些背景阅读，joelonsoftware.com / articles / Unicode.html是不错的阅读。如果您应该对文本有一点了解，那就是没有纯文本之类的东西。

— Martijn 2015年

155

该StreamReader.CurrentEncoding属性很少为我返回正确的文本文件编码。通过分析文件的字节序标记（BOM），我在确定文件的字节序方面取得了更大的成功。如果文件没有BOM，则无法确定文件的编码。

*更新4/08/2020，包括UTF-32LE检测并返回UTF-32BE的正确编码

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
    if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE

    // We actually have no idea what the encoding is if we reach this point, so
    // you may wish to return null instead of defaulting to ASCII
    return Encoding.ASCII;
}

— 2蟾蜍
source

3

+1。这也对我有用（而detectEncodingFromByteOrderMarks却没有）。我使用“新FileStream（filename，FileMode.Open，FileAccess.Read）”来避免IOException，因为该文件是只读的。

— Polyfun 2014年

56

UTF-8文件可以不带BOM，在这种情况下，它将错误地返回ASCII。

— user626528 2014年

3

这个答案是错误的。纵观参考源的StreamReader，即实现更多的人会想什么。他们使用新的编码而不是使用现有的Encoding.Unicode对象，因此，相等检查将失败（无论如何都很少发生，因为例如Encoding.UTF8可以返回不同的对象），但是它（1）并未使用真正奇怪的UTF-7格式，（2）如果未找到BOM，则默认为UTF-8，并且（3）可以被覆盖以使用其他默认编码。

— 机库

2

我使用新的StreamReader（filename，true）取得了更好的成功。CurrentEncoding–

— Benoit

4

代码中存在基本错误；当检测到大端 UTF32 签名（00 00 FE FF），返回系统提供的Encoding.UTF32，这是一个小端编码（如注意这里）。而且，正如@Nyerguds指出的那样，您仍然没有在寻找具有签名的UTF32LE FF FE 00 00（根据en.wikipedia.org/wiki/Byte_order_mark）。正如该用户指出的那样，因为它是包含的，所以该检查必须在2字节检查之前进行。

— Glenn Slayden

44

以下代码使用StreamReader该类对我来说运行良好：

  using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
  {
      reader.Peek(); // you need this!
      var encoding = reader.CurrentEncoding;
  }

诀窍是使用该Peek调用，否则，.NET不会做任何事情（并且它不会读取序言，即BOM）。当然，如果ReadXXX在检查编码之前使用了其他任何调用，它也可以工作。

如果文件没有BOM，则将使用defaultEncodingIfNoBom编码。还有一个不带此重载方法的StreamReader（在这种情况下，默认（ANSI）编码将用作defaultEncodingIfNoBom），但我建议您定义上下文中考虑的默认编码。

我已经使用带有BOM的UTF8，UTF16 / Unicode（LE＆BE）和UTF32（LE＆BE）的文件成功地对此进行了测试。它不适用于UTF7。

— 西蒙·穆里尔
source

我找回设置为默认编码的内容。我可能会想念什么吗？

— 拉姆

1

@DRAM-如果文件没有BOM表，则可能发生这种情况

— Simon Mourier

谢谢@Simon Mourier。我希望我的pdf /任何文件都不会包含bom。此链接stackoverflow.com/questions/4520184/…对于尝试检测无Bom的人可能会有所帮助。

— 拉姆

1

在powershell中，我必须运行$ reader.close（），否则它将无法写入。

foreach($filename in $args) {       $reader = [System.IO.StreamReader]::new($filename,      [System.Text.Encoding]::default,$true);       $peek = $reader.Peek();       $reader.currentencoding | select bodyname,encodingname;       $reader.close()     }

— js2010

1

@SimonMourier如果文件的编码是UTF-8 without BOM

— Ozkan

11

我会尝试以下步骤：

1）检查是否有字节顺序标记

2）检查文件是否有效的UTF8

3）使用本地“ ANSI”代码页（Microsoft定义为ANSI）

步骤2之所以有效，是因为除了UTF8之外，代码页中的大多数非ASCII序列都不是有效的UTF8。

— 乱码
source

这似乎是更正确的答案，因为其他答案对我不起作用。可以使用File.OpenRead和.Read-ing文件的前几个字节来做到这一点。

— user420667 2013年

1

不过，第2步是一堆用于检查位模式的编程工作。

— Nyerguds

1

我不确定解码实际上会引发异常，还是只是将无法识别的序列替换为“？”。无论如何，我都去写了一个模式检查类。

— Nyerguds

3

创建实例时，Utf8Encoding可以传入一个额外的参数，该参数确定是否应该引发异常，或者您是否更喜欢静默数据损坏。

— CodesInChaos

1

我喜欢这个答案。大多数编码（可能是您用例的99％）将为UTF-8或ANSI（Windows代码页1252）。您可以检查字符串是否包含替换字符（0xFFFD），以确定编码是否失败。

— marsze

10

检查一下。

UDE

这是Mozilla Universal Charset Detector的端口，您可以像这样使用它...

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

— 阿列克谢·阿盖罗·阿尔巴
source

您应该知道UDE是GPL

— lindexi

好吧，如果您担心许可证，那么可以使用此许可证。已获得MIT许可，您可以将其用于开放源代码和封闭源代码软件。nuget.org/packages/SimpleHelpers.FileEncoding

— 阿列克谢·

该许可证是具有GPL选项的MPL。

The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").

— jbtule

看来此分叉当前是最活跃的，并且具有nuget包UDE.Netstandard。 github.com/yinyue200/ude

— jbtule

非常有用的库，可以处理许多不同且不寻常的编码！坦克！

— mshakurov

6

提供@CodesInChaos建议的步骤的实现细节：

1）检查是否有字节顺序标记

2）检查文件是否有效的UTF8

3）使用本地“ ANSI”代码页（Microsoft定义为ANSI）

步骤2之所以有效，是因为除了UTF8之外，代码页中的大多数非ASCII序列都不是有效的UTF8。https://stackoverflow.com/a/4522251/867248更详细地说明了该策略。

using System; using System.IO; using System.Text;

// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage

public string DetectFileEncoding(Stream fileStream)
{
    var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
    using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
           detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
    {
        string detectedEncoding;
        try
        {
            while (!reader.EndOfStream)
            {
                var line = reader.ReadLine();
            }
            detectedEncoding = reader.CurrentEncoding.BodyName;
        }
        catch (Exception e)
        {
            // Failed to decode the file using the BOM/UT8. 
            // Assume it's local ANSI
            detectedEncoding = "ISO-8859-1";
        }
        // Rewind the stream
        fileStream.Seek(0, SeekOrigin.Begin);
        return detectedEncoding;
   }
}


[Test]
public void Test1()
{
    Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
    var detectedEncoding = DetectFileEncoding(fs);

    using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
    {
       // Consume your file
        var line = reader.ReadLine();
        ...

— 伯蒂埃·勒米
source

谢谢！这为我解决了。但我宁愿使用只是 reader.Peek() 代替 while (!reader.EndOfStream) { var line = reader.ReadLine(); }

— Harison席尔瓦

reader.Peek()不读取整个流。我发现使用较大的流Peek()是不够的。我reader.ReadToEndAsync()改用了。

— 加里·彭德伯里

什么是Utf8EncodingVerifier？

— 彼得·摩尔

@PeterMoore它是utf8的编码，在读取行时

var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());

在try块中使用。如果编码器无法解析提供的文本（该文本未使用utf8编码），则Utf8EncodingVerifier将抛出异常。捕获到异常，然后我们知道该文本不是utf8，并且默认为ISO-8859-1

— Berthier Lemieux

2

以下代码是我的Powershell代码，用于确定某些cpp或h或ml文件是否使用ISO-8859-1（Latin-1）或UTF-8（不带BOM）进行编码（如果两者都不假定为GB18030）。我是在法国工作的华人，MSVC在法国计算机上另存为Latin-1，在中国计算机上另存为GB，因此当我的系统与同事之间进行源文件交换时，这有助于避免编码问题。

方法很简单，如果所有字符都在x00-x7E之间，则ASCII，UTF-8和Latin-1都相同，但是如果我通过UTF-8读取非ASCII文件，我们会发现特殊字符-显示出来，因此请尝试使用Latin-1阅读。在Latin-1中，\ x7F和\ xAF之间为空，而GB在x00-xFF之间使用完整，因此如果我在两者之间取任意值，则不是Latin-1

该代码是用PowerShell编写的，但是使用.net，因此很容易转换为C＃或F＃

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
    $openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
    $contentUTF = $openUTF.ReadToEnd()
    [regex]$regex = '�'
    $c=$regex.Matches($contentUTF).count
    $openUTF.Close()
    if ($c -ne 0) {
        $openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
        $contentLatin1 = $openLatin1.ReadToEnd()
        $openLatin1.Close()
        [regex]$regex = '[\x7F-\xAF]'
        $c=$regex.Matches($contentLatin1).count
        if ($c -eq 0) {
            [System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
            $i.FullName
        } 
        else {
            $openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
            $contentGB = $openGB.ReadToEnd()
            $openGB.Close()
            [System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
            $i.FullName
        }
    }
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

— 恩佐伊兹
source

2

.NET并不是很有帮助，但是您可以尝试以下算法：

尝试按BOM（字节顺序标记）查找编码...很可能找不到
尝试解析成不同的编码

这里是电话：

var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
    throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");

这是代码：

public class FileHelper
{
    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings       
    /// Defaults to UTF8 when detection of the text file's endianness fails.
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding or null.</returns>
    public static Encoding GetEncoding(string filename)
    {
        var encodingByBOM = GetEncodingByBOM(filename);
        if (encodingByBOM != null)
            return encodingByBOM;

        // BOM not found :(, so try to parse characters into several encodings
        var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
        if (encodingByParsingUTF8 != null)
            return encodingByParsingUTF8;

        var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
        if (encodingByParsingLatin1 != null)
            return encodingByParsingLatin1;

        var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
        if (encodingByParsingUTF7 != null)
            return encodingByParsingUTF7;

        return null;   // no encoding found
    }

    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM)  
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding.</returns>
    private static Encoding GetEncodingByBOM(string filename)
    {
        // Read the BOM
        var byteOrderMark = new byte[4];
        using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
        {
            file.Read(byteOrderMark, 0, 4);
        }

        // Analyze the BOM
        if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
        if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
        if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
        if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
        if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;

        return null;    // no BOM found
    }

    private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
    {            
        var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());

        try
        {
            using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
            {
                while (!textReader.EndOfStream)
                {                        
                    textReader.ReadLine();   // in order to increment the stream position
                }

                // all text parsed ok
                return textReader.CurrentEncoding;
            }
        }
        catch (Exception ex) { }

        return null;    // 
    }
}

— Pacurar Stefan
source

1

在这里寻找C＃

https://msdn.microsoft.com/zh-CN/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path\to\your\file.ext";

using (StreamReader sr = new StreamReader(path, true))
{
    while (sr.Peek() >= 0)
    {
        Console.Write((char)sr.Read());
    }

    //Test for the encoding after reading, or at least
    //after the first read.
    Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
    Console.ReadLine();
    Console.WriteLine();
}

— 塞德里克
source

0

可能有用

string path = @"address/to/the/file.extension";

using (StreamReader sr = new StreamReader(path))
{ 
    Console.WriteLine(sr.CurrentEncoding);                        
}

— 劳山
source