如何检测文本文件的字符编码?


76

我尝试检测文件中使用了哪种字符编码。

我尝试使用此代码来获取标准编码

public static Encoding GetFileEncoding(string srcFile)
    {
      // *** Use Default of Encoding.Default (Ansi CodePage)
      Encoding enc = Encoding.Default;

      // *** Detect byte order mark if any - otherwise assume default
      byte[] buffer = new byte[5];
      FileStream file = new FileStream(srcFile, FileMode.Open);
      file.Read(buffer, 0, 5);
      file.Close();

      if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
        enc = Encoding.UTF8;
      else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;
      else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
        enc = Encoding.UTF32;
      else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
        enc = Encoding.UTF7;
      else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
        // 1201 unicodeFFFE Unicode (Big-Endian)
        enc = Encoding.GetEncoding(1201);      
      else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
        // 1200 utf-16 Unicode
        enc = Encoding.GetEncoding(1200);


      return enc;
    }

我的前五个字节是60、118、56、46和49。

是否有图表显示哪种编码与前五个字节匹配?


4
字节顺序标记不应用于检测编码。在某些情况下,使用哪种编码模棱两可:UTF-16 LE和UTF-32 LE都以相同的两个字节开头。BOM仅应用于检测字节顺序(因此其名称)。同样严格地说,UTF-8甚至不应该具有字节顺序标记,添加一个字符可能会干扰某些不希望使用此标记的软件。
Mark Byers 2010年

@Mark Ba​​yers,所以我可以在文件中使用反技术巫婆编码吗?
塞德里克·博伊文

4
@Mark Byers:UTF-32 LE以与UTF-16 LE相同的2个字节开头。但是,它也跟在字节00 00之后,这在UTF-16 LE中是不太可能的(我认为非常)。同样,BOM在理论上应该像您所说的那样指示,但在实践中,它充当签名以显示其编码方式。参见:unicode.org/faq/utf_bom.html#bom4
Dan W

UTF7 BOM实际上是真的吗?我尝试制作一个UTF7Encoding对象并对其执行GetPreamble(),它返回了一个空数组。而且与utf8不同,它没有构造函数参数。
Nyerguds

马克·拜尔斯(Mark Beyers):您的评论完全错误。BOM是检测编码的防弹方法。UTF16 BE和UTF32 BE并不明确。在写错评论之前,您应该研究该主题。如果某个软件不能处理UTF8 BOM,则该软件不是1980年代版本,就是编程错误。今天,每个软件都应处理和识别BOM。
Elmue '16

Answers:


84

您不能依赖具有BOM表的文件。UTF-8不需要它。非Unicode编码甚至没有BOM。但是,还有其他方法可以检测编码。

UTF-32

BOM为00 00 FE FF(对于BE)或FF FE 00 00(对于LE)。

但是即使没有BOM,UTF-32也很容易检测到。这是因为Unicode代码点范围限制为U + 10FFFF,因此UTF-32单元始终具有模式00 {00-10} xx xx(对于BE)或xx xx {00-10} 00(对于LE)。 。如果数据的长度是4的倍数,并且遵循以下模式之一,则可以放心地假定它是UTF-32。由于面向字节的编码中很少有00个字节,因此误报几乎是不可能的。

美国ASCII

没有物料清单,但您不需要一个。缺少80-FF范围内的字节可以轻松识别ASCII。

UTF-8

BOM是EF BB BF。但是您不能依靠它。许多UTF-8文件没有BOM,特别是如果它们起源于非Windows系统。

但是您可以放心地假设,如果文件验证为UTF-8,它将 就是UTF-8。误报很少见。

具体来说,假设数据不是ASCII,则2字节序列的误报率仅为3.9%(1920/49152)。对于7字节的序列,它小于1%。对于12个字节的序列,它小于0.1%。对于24字节的序列,它小于百万分之一。

UTF-16

BOM为FE FF(对于BE)或FF FE(对于LE)。请注意,UTF-16LE BOM位于UTF-32LE BOM的开头,因此请首先检查UTF-32。

如果您碰巧有一个主要由ISO-8859-1字符组成的文件,则文件的一半字节为00也将是UTF-16的有力指示。

否则,识别没有BOM的UTF-16的唯一可靠方法是寻找代理对(D [8-B] xx D [CF] xx),但是非BMP字符很少使用,因此无法使这种方法实用。

XML格式

如果文件以3C 3F 78 6D 6C字节(即ASCII字符“ <?xml”)开头,则查找encoding=声明。如果存在,请使用该编码。如果不存在,则采用UTF-8,这是默认的XML编码。

如果需要支持EBCDIC,还请寻找等效的序列4C 6F A7 94 93。

通常,如果您的文件格式包含编码声明,请查找该声明,而不要尝试猜测编码。

以上都不是

还有数百种其他编码,需要花费更多的精力进行检测。我建议尝试Mozilla的字符集检测器它的.NET端口

合理的默认值

如果您已排除UTF编码,并且没有指向不同编码的编码声明或统计信息检测,请使用ISO-8859-1或密切相关的Windows-1252。(请注意,最新的HTML标准要求将“ ISO-8859-1”声明解释为Windows-1252。)是Windows(英语)(以及其他流行的语言,如西班牙语,葡萄牙语,德语和法语)的默认代码页,它是UTF-8以外最常遇到的编码。


1
好吧,我所期望的。您能解决区分UTF-8 / UTF-16的问题吗?PS:感谢您提供非常有用的答案。+1
艾拉·巴克斯特

2
对于UTF-16BE文本文件,如果将一定百分比的偶数字节清零(或为UTF-16LE检查奇数字节),则很有可能编码为UTF-16。你怎么看?
丹·W

1
通过执行位模式检查,可以很好地检测到UTF-8有效性;第一个字节的位模式准确地告诉您接下来将要读取多少个字节,随后的字节也要检查控制位。该模式都显示在这里:ianthehenry.com/2015/1/17/decoding-utf-8
Nyerguds

2
@marsze这不是我的答案...,也没有提及,因为这是关于检测的,而且,正如我提到的那样,您实际上无法检测到简单的每个符号一个字节的编码。不过,我已经在这个地方贴了一个答案(约)。
Nyerguds

2
@marsze:在那里,我为Latin-1添加了一个部分。
dan04

11

如果您想追求“简单”的解决方案,您可能会发现我整理的此类非常有用:

http://www.architectshack.com/TextFileEncodingDetector.ashx

它会先自动进行BOM检测,然后尝试区分没有BOM的Unicode编码和其他默认编码(通常为Windows-1252,在.Net中错误地标记为Encoding.ASCII)。

如上所述,涉及NCharDet或MLang的“较重”解决方案可能更合适,并且正如我在此类概述页面上所指出的那样,最好是尽可能与用户提供某种形式的交互性,因为没有100%的检测率!

网站离线时的摘要:

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace KlerksSoft
{
    public static class TextFileEncodingDetector
    {
        /*
         * Simple class to handle text file encoding woes (in a primarily English-speaking tech 
         *      world).
         * 
         *  - This code is fully managed, no shady calls to MLang (the unmanaged codepage
         *      detection library originally developed for Internet Explorer).
         * 
         *  - This class does NOT try to detect arbitrary codepages/charsets, it really only
         *      aims to differentiate between some of the most common variants of Unicode 
         *      encoding, and a "default" (western / ascii-based) encoding alternative provided
         *      by the caller.
         *      
         *  - As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and 
         *      Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a 
         *      heuristic - so the more of the file we can sample the better the guess. If you 
         *      are going to read the whole file into memory at some point, then best to pass 
         *      in the whole byte byte array directly. Otherwise, decide how to trade off 
         *      reliability against performance / memory usage.
         *      
         *  - The UTF-8 detection heuristic only works for western text, as it relies on 
         *      the presence of UTF-8 encoded accented and other characters found in the upper 
         *      ranges of the Latin-1 and (particularly) Windows-1252 codepages.
         *  
         *  - For more general detection routines, see existing projects / resources:
         *    - MLang - Microsoft library originally for IE6, available in Windows XP and later APIs now (I think?)
         *      - MLang .Net bindings: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
         *    - CharDet - Mozilla browser's detection routines
         *      - Ported to Java then .Net: http://www.conceptdevelopment.net/Localization/NCharDet/
         *      - Ported straight to .Net: http://code.google.com/p/chardetsharp/source/browse
         *  
         * Copyright Tao Klerks, 2010-2012, tao@klerks.biz
         * Licensed under the modified BSD license:
         * 
Redistribution and use in source and binary forms, with or without modification, are 
permitted provided that the following conditions are met:
 - Redistributions of source code must retain the above copyright notice, this list of 
conditions and the following disclaimer.
 - Redistributions in binary form must reproduce the above copyright notice, this list 
of conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
 - The name of the author may not be used to endorse or promote products derived from 
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, 
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY 
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 
OF SUCH DAMAGE.
         * 
         * CHANGELOG:
         *  - 2012-02-03: 
         *    - Simpler methods, removing the silly "DefaultEncoding" parameter (with "??" operator, saves no typing)
         *    - More complete methods
         *      - Optionally return indication of whether BOM was found in "Detect" methods
         *      - Provide straight-to-string method for byte arrays (GetStringFromByteArray)
         */

        const long _defaultHeuristicSampleSize = 0x10000; //completely arbitrary - inappropriate for high numbers of files / high speed requirements

        public static Encoding DetectTextFileEncoding(string InputFilename)
        {
            using (FileStream textfileStream = File.OpenRead(InputFilename))
            {
                return DetectTextFileEncoding(textfileStream, _defaultHeuristicSampleSize);
            }
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize)
        {
            bool uselessBool = false;
            return DetectTextFileEncoding(InputFileStream, _defaultHeuristicSampleSize, out uselessBool);
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize, out bool HasBOM)
        {
            if (InputFileStream == null)
                throw new ArgumentNullException("Must provide a valid Filestream!", "InputFileStream");

            if (!InputFileStream.CanRead)
                throw new ArgumentException("Provided file stream is not readable!", "InputFileStream");

            if (!InputFileStream.CanSeek)
                throw new ArgumentException("Provided file stream cannot seek!", "InputFileStream");

            Encoding encodingFound = null;

            long originalPos = InputFileStream.Position;

            InputFileStream.Position = 0;


            //First read only what we need for BOM detection
            byte[] bomBytes = new byte[InputFileStream.Length > 4 ? 4 : InputFileStream.Length];
            InputFileStream.Read(bomBytes, 0, bomBytes.Length);

            encodingFound = DetectBOMBytes(bomBytes);

            if (encodingFound != null)
            {
                InputFileStream.Position = originalPos;
                HasBOM = true;
                return encodingFound;
            }


            //BOM Detection failed, going for heuristics now.
            //  create sample byte array and populate it
            byte[] sampleBytes = new byte[HeuristicSampleSize > InputFileStream.Length ? InputFileStream.Length : HeuristicSampleSize];
            Array.Copy(bomBytes, sampleBytes, bomBytes.Length);
            if (InputFileStream.Length > bomBytes.Length)
                InputFileStream.Read(sampleBytes, bomBytes.Length, sampleBytes.Length - bomBytes.Length);
            InputFileStream.Position = originalPos;

            //test byte array content
            encodingFound = DetectUnicodeInByteSampleByHeuristics(sampleBytes);

            HasBOM = false;
            return encodingFound;
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData)
        {
            bool uselessBool = false;
            return DetectTextByteArrayEncoding(TextData, out uselessBool);
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData, out bool HasBOM)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                HasBOM = true;
                return encodingFound;
            }
            else
            {
                //test byte array content
                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData);

                HasBOM = false;
                return encodingFound;
            }
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding)
        {
            return GetStringFromByteArray(TextData, DefaultEncoding, _defaultHeuristicSampleSize);
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding, long MaxHeuristicSampleSize)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                //For some reason, the default encodings don't detect/swallow their own preambles!!
                return encodingFound.GetString(TextData, encodingFound.GetPreamble().Length, TextData.Length - encodingFound.GetPreamble().Length);
            }
            else
            {
                byte[] heuristicSample = null;
                if (TextData.Length > MaxHeuristicSampleSize)
                {
                    heuristicSample = new byte[MaxHeuristicSampleSize];
                    Array.Copy(TextData, heuristicSample, MaxHeuristicSampleSize);
                }
                else
                {
                    heuristicSample = TextData;
                }

                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData) ?? DefaultEncoding;
                return encodingFound.GetString(TextData);
            }
        }


        public static Encoding DetectBOMBytes(byte[] BOMBytes)
        {
            if (BOMBytes == null)
                throw new ArgumentNullException("Must provide a valid BOM byte array!", "BOMBytes");

            if (BOMBytes.Length < 2)
                return null;

            if (BOMBytes[0] == 0xff 
                && BOMBytes[1] == 0xfe 
                && (BOMBytes.Length < 4 
                    || BOMBytes[2] != 0 
                    || BOMBytes[3] != 0
                    )
                )
                return Encoding.Unicode;

            if (BOMBytes[0] == 0xfe 
                && BOMBytes[1] == 0xff
                )
                return Encoding.BigEndianUnicode;

            if (BOMBytes.Length < 3)
                return null;

            if (BOMBytes[0] == 0xef && BOMBytes[1] == 0xbb && BOMBytes[2] == 0xbf)
                return Encoding.UTF8;

            if (BOMBytes[0] == 0x2b && BOMBytes[1] == 0x2f && BOMBytes[2] == 0x76)
                return Encoding.UTF7;

            if (BOMBytes.Length < 4)
                return null;

            if (BOMBytes[0] == 0xff && BOMBytes[1] == 0xfe && BOMBytes[2] == 0 && BOMBytes[3] == 0)
                return Encoding.UTF32;

            if (BOMBytes[0] == 0 && BOMBytes[1] == 0 && BOMBytes[2] == 0xfe && BOMBytes[3] == 0xff)
                return Encoding.GetEncoding(12001);

            return null;
        }

        public static Encoding DetectUnicodeInByteSampleByHeuristics(byte[] SampleBytes)
        {
            long oddBinaryNullsInSample = 0;
            long evenBinaryNullsInSample = 0;
            long suspiciousUTF8SequenceCount = 0;
            long suspiciousUTF8BytesTotal = 0;
            long likelyUSASCIIBytesInSample = 0;

            //Cycle through, keeping count of binary null positions, possible UTF-8 
            //  sequences from upper ranges of Windows-1252, and probable US-ASCII 
            //  character counts.

            long currentPos = 0;
            int skipUTF8Bytes = 0;

            while (currentPos < SampleBytes.Length)
            {
                //binary null distribution
                if (SampleBytes[currentPos] == 0)
                {
                    if (currentPos % 2 == 0)
                        evenBinaryNullsInSample++;
                    else
                        oddBinaryNullsInSample++;
                }

                //likely US-ASCII characters
                if (IsCommonUSASCIIByte(SampleBytes[currentPos]))
                    likelyUSASCIIBytesInSample++;

                //suspicious sequences (look like UTF-8)
                if (skipUTF8Bytes == 0)
                {
                    int lengthFound = DetectSuspiciousUTF8SequenceLength(SampleBytes, currentPos);

                    if (lengthFound > 0)
                    {
                        suspiciousUTF8SequenceCount++;
                        suspiciousUTF8BytesTotal += lengthFound;
                        skipUTF8Bytes = lengthFound - 1;
                    }
                }
                else
                {
                    skipUTF8Bytes--;
                }

                currentPos++;
            }

            //1: UTF-16 LE - in english / european environments, this is usually characterized by a 
            //  high proportion of odd binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of even binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.Unicode;


            //2: UTF-16 BE - in english / european environments, this is usually characterized by a 
            //  high proportion of even binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of odd binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.BigEndianUnicode;


            //3: UTF-8 - Martin Dürst outlines a method for detecting whether something CAN be UTF-8 content 
            //  using regexp, in his w3c.org unicode FAQ entry: 
            //  http://www.w3.org/International/questions/qa-forms-utf-8
            //  adapted here for C#.
            string potentiallyMangledString = Encoding.ASCII.GetString(SampleBytes);
            Regex UTF8Validator = new Regex(@"\A(" 
                + @"[\x09\x0A\x0D\x20-\x7E]"
                + @"|[\xC2-\xDF][\x80-\xBF]"
                + @"|\xE0[\xA0-\xBF][\x80-\xBF]"
                + @"|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}"
                + @"|\xED[\x80-\x9F][\x80-\xBF]"
                + @"|\xF0[\x90-\xBF][\x80-\xBF]{2}"
                + @"|[\xF1-\xF3][\x80-\xBF]{3}"
                + @"|\xF4[\x80-\x8F][\x80-\xBF]{2}"
                + @")*\z");
            if (UTF8Validator.IsMatch(potentiallyMangledString))
            {
                //Unfortunately, just the fact that it CAN be UTF-8 doesn't tell you much about probabilities.
                //If all the characters are in the 0-127 range, no harm done, most western charsets are same as UTF-8 in these ranges.
                //If some of the characters were in the upper range (western accented characters), however, they would likely be mangled to 2-byte by the UTF-8 encoding process.
                // So, we need to play stats.

                // The "Random" likelihood of any pair of randomly generated characters being one 
                //   of these "suspicious" character sequences is:
                //     128 / (256 * 256) = 0.2%.
                //
                // In western text data, that is SIGNIFICANTLY reduced - most text data stays in the <127 
                //   character range, so we assume that more than 1 in 500,000 of these character 
                //   sequences indicates UTF-8. The number 500,000 is completely arbitrary - so sue me.
                //
                // We can only assume these character sequences will be rare if we ALSO assume that this
                //   IS in fact western text - in which case the bulk of the UTF-8 encoded data (that is 
                //   not already suspicious sequences) should be plain US-ASCII bytes. This, I 
                //   arbitrarily decided, should be 80% (a random distribution, eg binary data, would yield 
                //   approx 40%, so the chances of hitting this threshold by accident in random data are 
                //   VERY low). 

                if ((suspiciousUTF8SequenceCount * 500000.0 / SampleBytes.Length >= 1) //suspicious sequences
                    && (
                           //all suspicious, so cannot evaluate proportion of US-Ascii
                           SampleBytes.Length - suspiciousUTF8BytesTotal == 0 
                           ||
                           likelyUSASCIIBytesInSample * 1.0 / (SampleBytes.Length - suspiciousUTF8BytesTotal) >= 0.8
                       )
                    )
                    return Encoding.UTF8;
            }

            return null;
        }

        private static bool IsCommonUSASCIIByte(byte testByte)
        {
            if (testByte == 0x0A //lf
                || testByte == 0x0D //cr
                || testByte == 0x09 //tab
                || (testByte >= 0x20 && testByte <= 0x2F) //common punctuation
                || (testByte >= 0x30 && testByte <= 0x39) //digits
                || (testByte >= 0x3A && testByte <= 0x40) //common punctuation
                || (testByte >= 0x41 && testByte <= 0x5A) //capital letters
                || (testByte >= 0x5B && testByte <= 0x60) //common punctuation
                || (testByte >= 0x61 && testByte <= 0x7A) //lowercase letters
                || (testByte >= 0x7B && testByte <= 0x7E) //common punctuation
                )
                return true;
            else
                return false;
        }

        private static int DetectSuspiciousUTF8SequenceLength(byte[] SampleBytes, long currentPos)
        {
            int lengthFound = 0;

            if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x81 
                    || SampleBytes[currentPos + 1] == 0x8D 
                    || SampleBytes[currentPos + 1] == 0x8F
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0x90 
                    || SampleBytes[currentPos + 1] == 0x9D
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] >= 0xA0 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC3
                )
            {
                if (SampleBytes[currentPos + 1] >= 0x80 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC5
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92 
                    || SampleBytes[currentPos + 1] == 0x93
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xA0 
                    || SampleBytes[currentPos + 1] == 0xA1
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xB8 
                    || SampleBytes[currentPos + 1] == 0xBD 
                    || SampleBytes[currentPos + 1] == 0xBE
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC6
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92)
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xCB
                )
            {
                if (SampleBytes[currentPos + 1] == 0x86 
                    || SampleBytes[currentPos + 1] == 0x9C
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 2 
                && SampleBytes[currentPos] == 0xE2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x80)
                {
                    if (SampleBytes[currentPos + 2] == 0x93 
                        || SampleBytes[currentPos + 2] == 0x94
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x98 
                        || SampleBytes[currentPos + 2] == 0x99 
                        || SampleBytes[currentPos + 2] == 0x9A
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x9C 
                        || SampleBytes[currentPos + 2] == 0x9D 
                        || SampleBytes[currentPos + 2] == 0x9E
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA0 
                        || SampleBytes[currentPos + 2] == 0xA1 
                        || SampleBytes[currentPos + 2] == 0xA2
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA6)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB0)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB9 
                        || SampleBytes[currentPos + 2] == 0xBA
                        )
                        lengthFound = 3;
                }
                else if (SampleBytes[currentPos + 1] == 0x82 
                    && SampleBytes[currentPos + 2] == 0xAC
                    )
                    lengthFound = 3;
                else if (SampleBytes[currentPos + 1] == 0x84 
                    && SampleBytes[currentPos + 2] == 0xA2
                    )
                    lengthFound = 3;
            }

            return lengthFound;
        }

    }
}

1
实际上,Encoding.GetEncoding("Windows-1252")给出的对象类不同于Encoding.ASCII。在调试时,Windows-1252显示为一个System.Text.SBCSCodePageEncoding对象,而ascii是一个System.Text.ASCIIEncoding对象。当我需要Windows-1252时,我从不使用ASCII码
Nyerguds

要将正则表达式与二进制数据(字节)进行匹配,正确的方法是: string data = Encoding.GetEncoding("iso-8859-1").GetString(bytes); 因为这是唯一的一对一映射到字符串的单字节编码。
Amr Ali

6

使用StreamReader并指导它为您检测编码:

using (var reader = new System.IO.StreamReader(path, true))
{
    var currentEncoding = reader.CurrentEncoding;
}

并使用代码页标识符 https://msdn.microsoft.com/zh-cn/library/windows/desktop/dd317756(v=vs.85).aspx ,以便根据其切换逻辑​​。


3
不行,StreamReader的假设你的文件是UTF-8
塞德里克·博伊文

@Cedric:请检查MSDN以获得此构造函数。您是否有证据表明构造函数与文档不一致?当然,这在微软的文档中是可能的:-)
Phil Hunt 2010年

4
此版本还仅检查BOM
DanielBişar2012年

2
嗯,你不必Read()在看书前打电话CurrentEncoding吗?在MSDN的CurrentEncoding说:“值可以是第一次调用StreamReader的任何Read方法后不同,因为编码自动检测没有,直到第一次调用一个读方法来完成。”
卡尔·沃尔什

1
我的测试表明,它不能可靠地使用,因此根本不应该使用。
杰弗里·麦格拉思

5

这里有几个答案,但是没有人张贴有用的代码。

这是我的代码,用于检测Microsoft在StreamReader类的Framework 4中检测到的所有编码。

显然,在打开流后,必须立即调用此函数,然后再从流中读取其他任何内容,因为BOM是流中的第一个字节。

此功能需要可以搜索的Stream(例如FileStream)。如果您有无法找到的流,则必须编写更复杂的代码,该代码将返回带有已读取但不是BOM的字节的Byte缓冲区。

/// <summary>
/// UTF8    : EF BB BF
/// UTF16 BE: FE FF
/// UTF16 LE: FF FE
/// UTF32 BE: 00 00 FE FF
/// UTF32 LE: FF FE 00 00
/// </summary>
public static Encoding DetectEncoding(Stream i_Stream)
{
    if (!i_Stream.CanSeek || !i_Stream.CanRead)
        throw new Exception("DetectEncoding() requires a seekable and readable Stream");

    // Try to read 4 bytes. If the stream is shorter, less bytes will be read.
    Byte[] u8_Buf = new Byte[4];
    int s32_Count = i_Stream.Read(u8_Buf, 0, 4);
    if (s32_Count >= 2)
    {
        if (u8_Buf[0] == 0xFE && u8_Buf[1] == 0xFF)
        {
            i_Stream.Position = 2;
            return new UnicodeEncoding(true, true);
        }

        if (u8_Buf[0] == 0xFF && u8_Buf[1] == 0xFE)
        {
            if (s32_Count >= 4 && u8_Buf[2] == 0 && u8_Buf[3] == 0)
            {
                i_Stream.Position = 4;
                return new UTF32Encoding(false, true);
            }
            else
            {
                i_Stream.Position = 2;
                return new UnicodeEncoding(false, true);
            }
        }

        if (s32_Count >= 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
        {
            i_Stream.Position = 3;
            return Encoding.UTF8;
        }

        if (s32_Count >= 4 && u8_Buf[0] == 0 && u8_Buf[1] == 0 && u8_Buf[2] == 0xFE && u8_Buf[3] == 0xFF)
        {
            i_Stream.Position = 4;
            return new UTF32Encoding(true, true);
        }
    }

    i_Stream.Position = 0;
    return Encoding.Default;
}



1

如果您的文件以字节60、118、56、46和49开头,则说明情况不明确。它可以是UTF-8(不带BOM),也可以是任何单字节编码,例如ASCII,ANSI,ISO-8859-1等。


嗯...所以我需要全部测试吗?
塞德里克·博伊文

那只是纯ASCII码。没有特殊字符的UTF-8完全等于ASCII,如果有特殊字符,则使用特定的可检测位模式。
Nyerguds

@Nyerguds可能不是。我有一个UTF-8文本文件(没有“特定的可检测位模式”-大部分都是英文字符)。如果我使用ASCII读取它,它将无法读取一个特定的“-”符号。
阿米特

不可能。如果字符不是ascii,则将使用那些可检测的特定位模式对其进行编码。utf-8就是这样工作的。更有可能的是,您的文本既不是ascii也不是utf-8,而只是像Windows-1252这样的8位编码。
Nyerguds

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.