字节顺序标记破坏了Java中的文件读取

107

我正在尝试使用Java读取CSV文件。有些文件的开头可能有字节顺序标记，但不是全部。如果存在，字节顺序将与第一行的其余部分一起读取，从而导致字符串比较出现问题。

是否存在一种跳过字节顺序标记的简单方法？

谢谢！

java utf-8 byte-order-mark

— 汤姆
source

也许：rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

— 克里斯（Chris）

114

编辑：我已经在GitHub上发布了正确的版本：https : //github.com/gpakosz/UnicodeBOMInputStream

这是我之前编码的一个类，我只是在粘贴之前编辑了程序包名称。没什么特别的，它与SUN的错误数据库中发布的解决方案非常相似。将其合并到您的代码中就可以了。

/* ____________________________________________________________________________
 * 
 * File:    UnicodeBOMInputStream.java
 * Author:  Gregory Pakosz.
 * Date:    02 - November - 2005    
 * ____________________________________________________________________________
 */
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
 * The <code>UnicodeBOMInputStream</code> class wraps any
 * <code>InputStream</code> and detects the presence of any Unicode BOM
 * (Byte Order Mark) at its beginning, as defined by
 * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
 * 
 * <p>The
 * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
 * defines 5 types of BOMs:<ul>
 * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
 * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
 * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
 * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
 * <li><pre>EF BB BF     = UTF-8</pre></li>
 * </ul></p>
 * 
 * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
 * or not.
 * </p>
 * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
 * wrapped <code>InputStream</code> object.</p>
 */
public class UnicodeBOMInputStream extends InputStream
{
  /**
   * Type safe enumeration class that describes the different types of Unicode
   * BOMs.
   */
  public static final class BOM
  {
    /**
     * NONE.
     */
    public static final BOM NONE = new BOM(new byte[]{},"NONE");

    /**
     * UTF-8 BOM (EF BB BF).
     */
    public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                       (byte)0xBB,
                                                       (byte)0xBF},
                                            "UTF-8");

    /**
     * UTF-16, little-endian (FF FE).
     */
    public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE},
                                                "UTF-16 little-endian");

    /**
     * UTF-16, big-endian (FE FF).
     */
    public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-16 big-endian");

    /**
     * UTF-32, little-endian (FF FE 00 00).
     */
    public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE,
                                                            (byte)0x00,
                                                            (byte)0x00},
                                                "UTF-32 little-endian");

    /**
     * UTF-32, big-endian (00 00 FE FF).
     */
    public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                            (byte)0x00,
                                                            (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-32 big-endian");

    /**
     * Returns a <code>String</code> representation of this <code>BOM</code>
     * value.
     */
    public final String toString()
    {
      return description;
    }

    /**
     * Returns the bytes corresponding to this <code>BOM</code> value.
     */
    public final byte[] getBytes()
    {
      final int     length = bytes.length;
      final byte[]  result = new byte[length];

      // Make a defensive copy
      System.arraycopy(bytes,0,result,0,length);

      return result;
    }

    private BOM(final byte bom[], final String description)
    {
      assert(bom != null)               : "invalid BOM: null is not allowed";
      assert(description != null)       : "invalid description: null is not allowed";
      assert(description.length() != 0) : "invalid description: empty string is not allowed";

      this.bytes          = bom;
      this.description  = description;
    }

            final byte    bytes[];
    private final String  description;

  } // BOM

  /**
   * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
   * specified <code>InputStream</code>.
   * 
   * @param inputStream an <code>InputStream</code>.
   * 
   * @throws NullPointerException when <code>inputStream</code> is
   * <code>null</code>.
   * @throws IOException on reading from the specified <code>InputStream</code>
   * when trying to detect the Unicode BOM.
   */
  public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                      IOException

  {
    if (inputStream == null)
      throw new NullPointerException("invalid input stream: null is not allowed");

    in = new PushbackInputStream(inputStream,4);

    final byte  bom[] = new byte[4];
    final int   read  = in.read(bom);

    switch(read)
    {
      case 4:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE) &&
            (bom[2] == (byte)0x00) &&
            (bom[3] == (byte)0x00))
        {
          this.bom = BOM.UTF_32_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0x00) &&
            (bom[1] == (byte)0x00) &&
            (bom[2] == (byte)0xFE) &&
            (bom[3] == (byte)0xFF))
        {
          this.bom = BOM.UTF_32_BE;
          break;
        }

      case 3:
        if ((bom[0] == (byte)0xEF) &&
            (bom[1] == (byte)0xBB) &&
            (bom[2] == (byte)0xBF))
        {
          this.bom = BOM.UTF_8;
          break;
        }

      case 2:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE))
        {
          this.bom = BOM.UTF_16_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0xFE) &&
            (bom[1] == (byte)0xFF))
        {
          this.bom = BOM.UTF_16_BE;
          break;
        }

      default:
        this.bom = BOM.NONE;
        break;
    }

    if (read > 0)
      in.unread(bom,0,read);
  }

  /**
   * Returns the <code>BOM</code> that was detected in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return a <code>BOM</code> value.
   */
  public final BOM getBOM()
  {
    // BOM type is immutable.
    return bom;
  }

  /**
   * Skips the <code>BOM</code> that was found in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return this <code>UnicodeBOMInputStream</code>.
   * 
   * @throws IOException when trying to skip the BOM from the wrapped
   * <code>InputStream</code> object.
   */
  public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
  {
    if (!skipped)
    {
      in.skip(bom.bytes.length);
      skipped = true;
    }
    return this;
  }

  /**
   * {@inheritDoc}
   */
  public int read() throws IOException
  {
    return in.read();
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[]) throws  IOException,
                                          NullPointerException
  {
    return in.read(b,0,b.length);
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[],
                  final int off,
                  final int len) throws IOException,
                                        NullPointerException
  {
    return in.read(b,off,len);
  }

  /**
   * {@inheritDoc}
   */
  public long skip(final long n) throws IOException
  {
    return in.skip(n);
  }

  /**
   * {@inheritDoc}
   */
  public int available() throws IOException
  {
    return in.available();
  }

  /**
   * {@inheritDoc}
   */
  public void close() throws IOException
  {
    in.close();
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void mark(final int readlimit)
  {
    in.mark(readlimit);
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void reset() throws IOException
  {
    in.reset();
  }

  /**
   * {@inheritDoc}
   */
  public boolean markSupported() 
  {
    return in.markSupported();
  }

  private final PushbackInputStream in;
  private final BOM                 bom;
  private       boolean             skipped = false;

} // UnicodeBOMInputStream

您正在以这种方式使用它：

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
  public static void main(final String[] args) throws Exception
  {
    FileInputStream fis = new FileInputStream("test/offending_bom.txt");
    UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

    System.out.println("detected BOM: " + ubis.getBOM());

    System.out.print("Reading the content of the file without skipping the BOM: ");
    InputStreamReader isr = new InputStreamReader(ubis);
    BufferedReader br = new BufferedReader(isr);

    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();

    fis = new FileInputStream("test/offending_bom.txt");
    ubis = new UnicodeBOMInputStream(fis);
    isr = new InputStreamReader(ubis);
    br = new BufferedReader(isr);

    ubis.skipBOM();

    System.out.print("Reading the content of the file after skipping the BOM: ");
    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();
  }

} // UnicodeBOMInputStreamUsage

— 格雷戈里·帕科斯（Gregory Pakosz）
source

2

抱歉，滚动区域太长，很糟糕，没有附件功能

— Gregory Pakosz

谢谢格雷戈里，这就是我想要的。

— 汤姆”，

3

这应该是Java核心API中

— 丹尼斯Kniazhev

7

10年已经过去了，我仍然为此获得因果报应：D我正在看你Java！

— 格雷戈里·帕科斯

1

赞成，因为答案提供了有关文件输入流为何不提供默认情况下丢弃BOM表的选项的历史记录。

— MxLDevs

94

的Apache的百科全书IO库具有InputStream可以检测和丢弃物料清单：BOMInputStream（Javadoc中）：

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}

如果您还需要检测不同的编码，它还可以区分各种不同的字节顺序标记，例如UTF-8与UTF-16 big + little endian-有关详细信息，请参见上面的doc链接。然后，您可以使用检测到的ByteOrderMark来选择a Charset以对流进行解码。（如果您需要所有这些功能，则可能有更简化的方法来执行此操作-BalusC的答案中可能是UnicodeReader？）。请注意，通常来说，并不是一种很好的方法来检测某些字节的编码方式，但是如果流以BOM表开头，显然这会有所帮助。

编辑：如果您需要检测UTF-16，UTF-32等中的BOM，则构造函数应为：

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
        ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

支持@ martin-charlesworth的评论:)

— Rescdsk
source

只是跳过BOM表。应该是99％用例的完美解决方案。

— atamanroman

7

我成功地使用了这个答案。但是，我会适当地添加booleanarg来指定是包含还是排除BOM。范例：BOMInputStream bomIn = new BOMInputStream(in, false); // don't include the BOM

— Kevin Meredith

19

我还要补充一点，这只能检测到UTF-8 BOM。如果要检测所有utf-X BOM，则需要将它们传递给BOMInputStream构造函数。

BOMInputStream bomIn = new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, 				ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);

— 马丁·查尔斯沃思

至于@KevinMeredith的评论，我想强调的是，带有布尔值的构造函数更加清晰，但是默认构造函数已经摆脱了UTF-8 BOM，正如JavaDoc建议的那样：BOMInputStream(InputStream delegate) Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— WesternGun

跳过可以解决我的大多数问题。如果我的文件以BOM UTF_16BE开头，是否可以跳过BOM并将文件读取为UTF_8来创建InputReader？到目前为止有效，我想了解是否存在边缘情况？提前致谢。

— Bhaskar

31

更简单的解决方案：

public class BOMSkipper
{
    public static void skip(Reader reader) throws IOException
    {
        reader.mark(1);
        char[] possibleBOM = new char[1];
        reader.read(possibleBOM);

        if (possibleBOM[0] != '\ufeff')
        {
            reader.reset();
        }
    }
}

用法样本：

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
BOMSkipper.skip(input);
//Now UTF prefix not present:
input.readLine();
...

它适用于所有5种UTF编码！

1

很好的安德烈。但是您能解释一下它为什么起作用吗？模式0xFEFF如何成功匹配似乎具有不同模式和3个字节而不是2个字节的UTF-8文件？以及该模式如何匹配UTF16和UTF32的字节序？

— Vahid Pazirandeh 2014年

1

如您所见-我不使用字节流，而是使用预期的字符集打开了字符流。因此，如果此流中的第一个字符是BOM-我将其跳过。BOM可以为每种编码使用不同的字节表示形式，但这是一个字符。请阅读这篇文章，它对我有帮助：joelonsoftware.com/articles/Unicode.html

不错的解决方案，只需确保在读取之前检查文件是否为空，以避免在skip方法中出现IOException。您可以通过调用if（reader.ready（））{reader.read（possibleBOM）...}来做到这一点

— 雪

我看到您已经介绍了0xFE 0xFF，这是UTF-16BE的字节顺序标记。但是，如果前三个字节为0xEF 0xBB 0xEF怎么办？（UTF-8的字节顺序标记）。您声称这适用于所有UTF-8格式。哪个可能是正确的（我尚未测试您的代码），但是它如何工作？

— bvdb '16

1

请参阅我对Vahid的回答：我不是打开字节流而是打开字符流，并从中读取一个字符。没关系，文件使用哪种utf编码-bom前缀可以用不同的字节数表示，但是就字符而言，它只是一个字符

24

Google Data API具有UnicodeReader自动检测编码的。

您可以使用它代替InputStreamReader。这是其来源的一个略微压缩的摘录，非常简单：

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

— BalusC
source

似乎该链接显示Google Data API已弃用？现在应该在哪里寻找Google Data API？

— SOUser 2016年

1

@XichenLi：GData API已出于其预期目的而被弃用。我无意建议直接使用GData API（OP不使用任何GData服务），但我打算将源代码作为您自己的实现的示例。这也是为什么我将其包括在答案中以便进行复制粘贴的原因。

— BalusC

这有一个错误。UTF-32LE的情况不可用。为了(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)真实起见，UTF-16LE的大小写（(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)）应该已经匹配。

— 约书亚·泰勒

由于此代码来自Google Data API，因此我发布了第471期。

— 约书亚·泰勒

13

该Apache Commons IO图书馆的BOMInputStream已经被@rescdsk提到，但我没有看到它提到如何获得一个InputStream 没有 BOM的。

这就是我在Scala中所做的事情。

 import java.io._
 val file = new File(path_to_xml_file_with_BOM)
 val fileInpStream = new FileInputStream(file)   
 val bomIn = new BOMInputStream(fileInpStream, 
         false); // false means don't include BOM

— 凯文·梅瑞迪斯（Kevin Meredith）
source

单个arg构造函数可以做到：public BOMInputStream(InputStream delegate) { this(delegate, false, ByteOrderMark.UTF_8); }。UTF-8 BOM默认情况下不包括在内。

— 弗拉基米尔·瓦加耶夫（Fladimir Vagaytsev）'16

好点，弗拉基米尔。我看到，在它的文档- commons.apache.org/proper/commons-io/javadocs/api-2.2/org/...：Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— 凯文·梅雷迪思

4

为了简单地从文件中删除BOM字符，我建议使用Apache Common IO

public BOMInputStream(InputStream delegate,
              boolean include)
Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
Parameters:
delegate - the InputStream to delegate to
include - true to include the UTF-8 BOM or false to exclude it

将include设置为false，您的BOM表字符将被排除。

— 安德烈亚斯·巴瑟鲁德（Andreas Baaserud）
source

2

遗憾的是没有。您必须识别并跳过自己。此页面详细说明了您需要注意的内容。另请参阅此SO问题以获取更多详细信息。

— 布赖恩·阿格纽
source

1

我遇到了同样的问题，因为我没有读很多文件，所以我做了一个简单的解决方案。我认为我的编码是UTF-8，因为当我借助此页面打印出有问题的字符时：获取一个字符的unicode值，我发现它是\ufeff。我使用代码System.out.println( "\\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) );来打印出有害的unicode值。

一旦有了令人讨厌的unicode值，就可以在继续阅读之前在文件的第一行中替换它。该部分的业务逻辑：

String str = reader.readLine().trim();
str = str.replace("\ufeff", "");

这解决了我的问题。这样我就可以继续处理文件了。我添加了trim()仅在出现前导或尾随空格的情况下，可以根据您的特定需求执行此操作或不执行此操作。

— 艾米·希金斯（Amy B Higgins）
source

1

那对我不起作用，但是我使用了.replaceFirst（“ \ u00EF \ u00BB \ u00BF”，“”）。

— StackUMan