如何自动检测文本文件编码？

69

有许多纯文本文件，它们是用变型字符集编码的。

我想将它们全部转换为UTF-8，但是在运行iconv之前，我需要知道其原始编码。大多数浏览器都提供Auto Detect编码选项，但是由于太多，我无法一一检查这些文本文件。

只有知道原始编码后，我才能通过转换文本iconv -f DETECTED_CHARSET -t utf-8。

是否有任何实用程序可检测纯文本文件的编码？它不必是100％完美的，我不介意1,000,000个文件中是否有100个文件转换错误。

linux batch encoding

— 谢继来
source

57

试试chardet Python模块，该模块在PyPi上可用：

pip install chardet

然后运行chardetect myfile.txt。

Chardet基于Mozilla使用的检测代码，因此只要输入文本足够长即可进行统计分析，它应该给出合理的结果。请阅读项目文档。

正如评论中提到的那样，它相当慢，但是某些发行版还提供了原始C ++版本，因为@Xavier在https://superuser.com/a/609056中找到了。某个地方也有Java版本。

— grawity
source

3

是的，它已经像python-chardetUbuntu Universe库中那样打包了。

— 谢耶利

如果不是完美的猜测，chardet仍然会给出最正确的猜测，例如./a.txt: GB2312 (confidence: 0.99)。与刚刚失败并报告“无法识别的编码”的Enca相比。但是，可悲的是，chardet运行速度非常慢。

— 谢耶利

1

@谢继雷：让它整夜运行。字符集检测是一个复杂的过程。您也可以尝试基于Java的jChardet或...原始chardet是Mozilla的一部分，但只有C ++源可用，没有命令行工具。

— grawity 2011年

2

关于速度：chardet <(head -c4000 filename.txt)在我的用例中，运行速度更快且同样成功。（以防万一，这个bash语法只会将前4000个字节发送到chardet）

— ndemou 2015年

@ndemou我有chardet==3.0.4，而命令行工具的实际可执行文件名chardetect不是chardet。

— Devy

31

我将使用以下简单命令：

encoding=$(file -bi myfile.txt)

或者，如果您只需要实际的字符集（例如utf-8）：

encoding=$(file -b --mime-encoding myfile.txt)

— Humpparitari
source

4

不幸的是，file仅检测具有特定属性的编码，例如UTF-8或UTF-16。其余的-较旧的ISO8859或它们的MS-DOS和Windows通讯程序-被列为“ unknown-8bit”或类似名称，即使对于chardet具有99％置信度的文件也是如此。

— grawity 2011年

6

文件向我展示了iso-8859-1

— cweiske 2012年

如果分机在说谎怎么办？

— james.garriss 2014年

2

@ james.garriss：文件扩展名与其（文本）内容编码无关。

— MestreLion

29

在基于Debian的Linux上，uchardet软件包（Debian / Ubuntu）提供了命令行工具。请参阅下面的包装说明：

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

— 泽维尔
source

3

谢谢！在项目的主页上，对我来说并不明显包含一个CLI。uchardet通过Homebrew 安装时，它也可以在OS X上使用。

— Stefan Schmidt

1

一开始我有点困惑，因为ISO 8859-1文档被错误地标识为Windows-1252，但在可打印范围内Windows-1252是ISO 8859-1的超集，因此iconv可以正常进行转换。

— Stefan Schmidt

16

对于Linux，有enca；对于Solaris，可以使用auto_ef。

— 皮利斯
source

恩卡（Enca）对我来说似乎太严格了：enca -d -L zh ./a.txt出现消息失败./a.txt: Unrecognized encoding Failure reason: No clear winner.如@grawity所述，chardet它比较松懈，但是速度仍然太慢。

— 谢耶莱2011年

10

Enca完全没有通过“实际上做某事”测试。

— Michael Wolf

1

uchardet失败（检测到CP1252而不是实际的CP1250），但是enca正常工作。（单个示例，难以一概而论...）

— Palo

2

Mozilla在网页中有一个很好的自动检测代码库：http :
//lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/

该算法的详细说明：http :
//www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

— 马丁·亨宁斯
source

2

回到chardet（python 2.？），此调用就足够了：

python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())' < file
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

虽然还远远不够完美...

echo "öasd" | iconv -t ISO-8859-1 | python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())'
{'confidence': 0.5, 'encoding': 'windows-1252'}

— 埃斯塔尼
source

2

对于经常使用Emacs的用户，他们可能会发现以下有用的信息（允许手动检查和验证转换）。

此外，我经常发现Emacs字符集自动检测比其他字符集自动检测工具（例如chardet）效率更高。

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

然后，使用此脚本作为参数对Emacs的简单调用（请参阅“ -l”选项）即可完成工作。

— 伊夫·卢里耶（Yves Lhuillier）
source

1

UTFCast值得一试。不适用于我（可能是因为我的文件很糟糕），但是看起来不错。

http://www.addictivetips.com/windows-tips/how-to-batch-convert-text-files-to-utf-8-encoding/

— 沙美尔
source

0

isutf8（从moreutils包装中）完成了工作

— 罗南
source

2

怎么样？这个答案并不是真的有帮助。

— 摩西

1

并不是完全有人问它，而是一个有用的工具。如果文件是有效的UTF-8，则退出状态为零。如果文件无效的UTF-8，或出现某些错误，则退出状态为非零。

— 吨

0

另外，如果您使用文件-i给您未知

您可以使用此php命令来猜测字符集，如下所示：

在php中，您可以像下面这样检查：

明确指定编码列表：

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

更准确的“ mb_list_encodings ”：

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

在第一个示例中，您可以看到我放置了可能匹配的编码列表（检测列表顺序）。为了获得更准确的结果，您可以通过mb_list_encodings（）使用所有可能的编码

注意mb_ *函数需要php-mbstring

apt-get install php-mbstring

查看答案：https : //stackoverflow.com/a/57010566/3382822

— 穆罕默德·加尔比
source