读取UTF-8-BOM标记

71

我正在通过FileReader读取文件-该文件已经过UTF-8解码（使用BOM），现在的问题是：我读取了文件并输出了字符串，但可悲的是BOM表标记也被输出了。为什么会这样？

fr = new FileReader(file);
br = new BufferedReader(fr);
    String tmp = null;
    while ((tmp = br.readLine()) != null) {
    String text;    
    text = new String(tmp.getBytes(), "UTF-8");
    content += text + System.getProperty("line.separator");
}

第一行后输出

?<style>

java file encoding

— 奥尼贡
source

7

UTF-8不应该具有BOM表！Unicode标准既没有必要也不建议这样做。

— tchrist 2011年

27

@tchrist：在Microsoft，他们不在乎标准。

— Matti Virkkunen'Feb

11

@Matti“不推荐”！=非标准

— bacar 2012年

6

@tchrist告诉那些在保存时将BOM放入UTF-8文件（= Microsoft）的人。

— dstibbe 2012年

5

@tchrist我希望事情就这么简单。您为用户而不是自己创建一个应用程序。用户使用（部分）Microsoft软件来创建他们的文件。

— dstibbe 2012年

80

在Java中，您必须手动使用UTF8 BOM（如果存在）。此行为记录在Java bug数据库（此处和此处）中。目前尚无修复方法，因为它将破坏JavaDoc或XML解析器之类的现有工具。在Apache的IO共享提供了一个BOMInputStream处理这种情况。

看一下这个解决方案：使用BOM处理UTF8文件

— RealHowTo
source

游戏进行得很晚，但是对于大文件来说这似乎非常慢。我尝试使用缓冲区。如果使用缓冲区，则似乎还会留下某种尾随数据。

— rockNwaves

39

最简单的解决方法可能只是\uFEFF从字符串中删除结果，因为由于任何其他原因，它极不可能出现。

tmp = tmp.replace("\uFEFF", "");

另请参阅此Guava错误报告

— Finnw
source

4

关于“极不可能”的坏处是，它很少出现，因此查找错误非常困难... :)因此，如果您认为自己的软件能够成功并且长期存在，那么在使用此代码时要格外小心，因为迟早会出现任何现有情况。

— Franz D.

4

FEFF是UTF-16 BOM。UTF-8 BOM为EFBBBF。

— 史蒂夫·投手

4

@StevePitchers，但是我们必须在解码后将其匹配（当它String始终是UTF-16的一部分时）

— -finnw

怎么样\uFFFE（UTF-16，小端）？

— Suzana

@ live-love如果文件没有BOM，则只需截断第一行。

— 埃里克·杜米尼尔

31

使用Apache Commons库。

类： org.apache.commons.io.input.BOMInputStream

用法示例：

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

— 花生
source

commons.apache.org/proper/commons-io/apidocs/org/apache/commons/...

— BMOC

该代码仅适用于UTF-8 BOM检测和排除。检查bOMInputStream的实现：`/ ** *构造一个新的BOM输入流，以检测一个{@link ByteOrderMark＃UTF_8}并将其包含在内。* @param委托InputStream委托给* @param include true包括UTF-8 BOM或* false排除它* / public BOMInputStream（InputStream委托，boolean include）{this（delegate，include，ByteOrderMark.UTF_8）; }``

— -czupe

7

这是我使用Apache BOMInputStream的方法，它使用try-with-resources块。“ false”参数告诉对象忽略以下BOM（出于安全原因，我们使用“ BOM少”文本文件，哈哈）：

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}

— 蛇医生
source

1

永远无法弄清楚如何在该网站上发布内容-总是以AFU结尾。

— 蛇博士

5

考虑一下Google的UnicodeReader，它可以为您完成所有这些工作。

Charset utf8 = Charset.forName("UTF-8"); // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8)) {
    ....
}

Maven依赖关系：

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>

— 阿德里安·史密斯
source

谢谢。它可以很好地与SuperCSV一起使用。这为我赢得了一些布朗尼点。：）

— 萨基·桑

优秀的。非常简单的解决方案，对OpenCSV很有用

— grizzasd

4

使用Apache Commons IO。

例如，下面让我们看一下我的代码（用于读取同时包含拉丁字符和西里尔字符的文本文件）：

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

结果，我们有了一个名为“ ari”的ArrayList，其中包含文件“ 1.txt”中的所有字符（BOM除外）。

— 爪子
source

1

这里提到，这通常是Windows上文件的问题。

一种可能的解决方案是首先通过诸如dos2unix之类的工具运行文件。

— 德雷克·索巴尼亚
source

是的dos2unix（cygwin的一部分）具有添加（--add-bom）和删除（--remove-bom）bom的选项。

— 罗马，

1

如果有人想用标准来做，那就可以这样：

public static String cutBOM(String value) {
    // UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
    String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
    if (bom.equals("efbbbf"))
        // UTF-8
        return value.substring(3, value.length());
    else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
        // UTF-16BE or UTF16-LE
        return value.substring(2, value.length());
    else
        return value;
}

— 马库斯
source

0

我发现绕过BOM的最简单方法

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
                    //case of, remove the BOM of UTF-8 BOM
                    currentLine = currentLine.replace("ï»¿","");

— 大卫
source