如何确定文本的编码？

219

我收到了一些经过编码的文本，但是我不知道使用了什么字符集。有没有一种方法可以使用Python确定文本文件的编码？如何检测 C＃处理的文本文件的编码/代码页。

python encoding text-files

— 不
source

225

始终无法正确检测编码。

（来自chardet常见问题解答：）

但是，某些编码针对特定语言进行了优化，并且语言不是随机的。某些字符序列始终弹出，而其他字符序列毫无意义。一个会说英语的人，打开报纸发现“ txzqJv 2！dasd0a QqdKjvz”，会立即意识到这不是英语（即使它完全由英文字母组成）。通过研究大量的“典型”文本，计算机算法可以模拟这种流利程度，并对文本的语言做出有根据的猜测。

有一个chardet库使用该研究来尝试检测编码。chardet是Mozilla中自动检测代码的端口。

您也可以使用UnicodeDammit。它将尝试以下方法：

在文档本身中发现的编码：例如，在XML声明或（对于HTML文档）http等效的META标记中。如果Beautiful Soup在文档中找到这种编码，它将从头开始再次解析文档，然后尝试使用新的编码。唯一的例外是，如果您显式指定了一种编码，并且该编码确实起作用：那么它将忽略它在文档中找到的任何编码。
通过查看文件的前几个字节来嗅探编码。如果在此阶段检测到编码，它将是UTF- *编码，EBCDIC或ASCII之一。
chardet库嗅探到的编码（如果已安装）。
UTF-8
Windows-1252

— 诺斯克洛
source

1

感谢您的chardet参考。看起来不错，尽管有点慢。

— Craig McQueen 2010年

17

@Geomorillo：没有“编码标准”之类的东西。文本编码与计算一样古老，它是随着时间和需求而有机增长的，它不是计划中的。“ Unicode”是解决此问题的尝试。

— nosklo 2013年

1

考虑到所有因素，这还不错。我想知道的是，如何找出打开文本文件的编码方式？

— holdenweb 2014年

2

@dumbledad什么，我说的是正确检测到它所有的时间是不可能的。您所能做的只是一个猜测，但有时可能会失败，由于无法真正检测到编码，因此每次都无法使用。要进行猜测，您可以使用我在答案中建议的一种工具

— nosklo

1

@LasseKärkkäinen答案的要点是表明不可能彻底检测编码；您提供的功能可以根据您的情况进行猜测，但在许多情况下是错误的。

— nosklo

67

进行编码的另一种方法是使用 libmagic（这是file命令后面的代码）。有大量可用的python绑定。

存在于文件源树中的python绑定可以作为 python-magic（或python3-magic）debian软件包使用。它可以通过执行以下操作来确定文件的编码：

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

在pypi上有一个同名但不兼容的python-magic pip包，该包也使用libmagic。它还可以通过执行以下操作获取编码：

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

— 哈米什·唐纳
source

5

libmagic确实是可行的替代方法chardet。以及有关名为的不同软件包的详细信息python-magic！我敢肯定这种模棱两可的

— 做法会

1

file在识别文本文件中的人类语言方面不是特别擅长。尽管有时您必须知道它的含义（“ Microsoft Office文档”可能表示Outlook消息等），但它对于识别各种容器格式非常有用。

— 2015年

寻找一种方法来管理文件编码的奥秘，我发现了这篇文章。不幸的是，使用示例代码，我无法走过open()：UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 169799: invalid start byte。根据vim的文件编码:set fileencoding为latin1。

— xtian

如果我使用可选参数errors='ignore'，则示例代码的输出会有所帮助binary。

— xtian

2

@xtian您需要以二进制模式打开，即open（“ filename.txt”，“ rb”）。

— L.Kärkkäinen19年

31

一些编码策略，请不加评论：

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

您可能想通过以循环形式打开和读取文件来检查编码...但是您可能需要先检查文件大小：

encodings = ['utf-8', 'windows-1250', 'windows-1252' ...etc]
            for e in encodings:
                try:
                    fh = codecs.open('file.txt', 'r', encoding=e)
                    fh.readlines()
                    fh.seek(0)
                except UnicodeDecodeError:
                    print('got unicode error with %s , trying different encoding' % e)
                else:
                    print('opening the file with encoding:  %s ' % e)
                    break

— 扎特
source

您也可以使用io，如io.open(filepath, 'r', encoding='utf-8')，这样更方便，因为codecs它不会\n在读写时自动转换。更多关于HERE

— Searene酒店

23

这是一个读取和获取chardet编码预测值的示例，n_lines如果文件很大，则从文件中读取。

chardet还为您提供confidence了其编码预测的概率（即）（尚未查看它们是如何得出的），该预测的预测结果来自chardet.predict()，因此您可以根据需要以某种方式进行操作。

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

— 赖安迪永
source

在投票后看一下，现在看到如果第一行上有很多数据，则此解决方案可能会变慢。在某些情况下，最好以不同的方式读取数据。

— ryanjdillon '18

2

我以这种方式修改了此功能： def predict_encoding(file_path, n=20): ... skip ... and then rawdata = b''.join([f.read() for _ in range(n)]) 在Python 3.6上尝试过此功能，可以完美地与“ ascii”，“ cp1252”，“ utf-8”，“ unicode”编码一起使用。因此，这绝对是正确的。

— n158

1

这对于处理各种格式的小型数据集非常有用。在我的根目录上递归测试了它，它像对待东西一样工作。谢谢哥们。

— Datanovice

4

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

— 比尔·摩尔
source

2

根据您的平台，我只是选择使用linux shell file命令。这对我有用，因为我在专门在我们的Linux机器之一上运行的脚本中使用了它。

显然，这不是理想的解决方案或答案，但可以对其进行修改以满足您的需求。就我而言，我只需要确定文件是否为UTF-8。

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')

— 迈克·D
source

不需要分叉新过程。Python代码已经在进程内部运行，并且可以调用适当的系统函数本身，而无需加载新进程。

— vdboor

2

这可能会有所帮助

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

— 富裕
source

1

在一般情况下，原则上不可能确定文本文件的编码。因此，没有，没有标准的Python库可以为您执行此操作。

如果您对文本文件有更具体的了解（例如，它是XML），则可能有库函数。

— 马丁诉路易斯
source

1

如果您知道文件的某些内容，则可以尝试使用几种编码对其进行解码，然后查看缺少的内容。通常，由于文本文件是文本文件而那些文本文件是愚蠢的，所以没有办法；）

— 马丁·图劳
source

1

该站点具有用于识别ascii，使用boms编码和utf8 no bom的python代码：https ://unicodebook.readthedocs.io/guess_encoding.html 。将文件读入字节数组（数据）：http : //www.codecodex.com/wiki/Read_a_file_into_a_byte_array。这是一个例子。我在osx中。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

— 的js2010
source

欢迎使用指向解决方案的链接，但请确保没有该链接的情况下，您的回答是有用的：在链接周围添加上下文，以便您的其他用户可以了解它的含义和含义，然后引用您所使用页面中最相关的部分如果目标页面不可用，请重新链接到。只是链接的答案可能会被删除。

— 哔哔声，