SQLite，python，unicode和非utf数据

Question 1

我首先尝试使用python将字符串存储在sqlite中，并得到以下消息：

sqlite3.ProgrammingError：除非使用可以解释8位字节串的text_factory（如text_factory = str），否则不得使用8位字节串。强烈建议您改为将应用程序切换为Unicode字符串。

好的，我切换到Unicode字符串。然后我开始收到消息：

sqlite3.OperationalError：无法解码为文本为“ SigurRós”的UTF-8列“ tag_artist”

尝试从数据库检索数据时。进行了更多研究，我开始在utf8中对其进行编码，但随后“ SigurRós”开始看起来像“ SigurRÃ³s”

注意： 正如@John Machin指出的那样，我的控制台设置为在“ latin_1”中显示。

是什么赋予了？看完这篇文章并描述了我所处的完全相同的情况之后，似乎该建议是忽略其他建议并毕竟使用8位字节串。

在开始此过程之前，我对unicode和utf不太了解。在过去的几个小时中，我学到了很多东西，但是我仍然不知道是否有一种方法可以将“ó”从拉丁文1正确地转换为utf-8，而不是对其进行处理。如果没有，为什么sqlite强烈建议我将应用程序切换为unicode字符串？

我将使用最近24小时内我学到的所有内容的摘要和一些示例代码来更新此问题，以便穿鞋的人可以轻松获得指南。如果我发布的信息有误或以任何方式引起误导，请告诉我，我会更新，或者你们中的一位资深人士可以更新。

答案摘要

让我先说一说我了解的目标。如果要在各种编码之间进行转换，则处理各种编码的目的是要了解源编码是什么，然后使用该源编码将其转换为unicode，然后将其转换为所需的编码。Unicode是一个基础，编码是该基础的子集的映射。utf_8可以容纳unicode中的每个字符，但是由于它们与例如latin_1不在同一个位置，因此，以utf_8编码并发送到latin_1控制台的字符串看起来不会像您期望的那样。在python中，进入unicode并转换为另一种编码的过程如下：

str.decode('source_encoding').encode('desired_encoding')

或者如果str已经在unicode中

str.encode('desired_encoding')

对于sqlite，我实际上并不想再次对其进行编码，我想对其进行解码并将其保留为unicode格式。在尝试使用python中的unicode和编码时，可能需要注意以下四点。

您要使用的字符串的编码以及要获取的字符串的编码。
系统编码。
控制台编码。
源文件的编码

详细说明：

（1）从源读取字符串时，它必须具有某种编码，例如latin_1或utf_8。就我而言，我从文件名中获取字符串，所以不幸的是，我可能会获得任何类型的编码。Windows XP使用UCS-2（Unicode系统）作为其本机字符串类型，这对我来说似乎是作弊行为。对我来说幸运的是，大多数文件名中的字符都不会由多个源编码类型组成，而且我认为我的全部要么完全是latin_1，完全是utf_8，要么仅仅是纯ascii（这是两个字符的子集）那些）。因此，我只是阅读它们并对其进行解码，就好像它们仍在latin_1或utf_8中一样。不过，在Windows上的文件名中，可能有latin_1和utf_8以及其他任何字符混合在一起的可能。有时这些字符会显示为方框，有时候，它们看起来像是被弄乱了，而有时候，它们看起来是正确的（带重音的字符等等）。继续。

（2）Python具有默认的系统编码，该默认系统编码会在python启动时设置，并且无法在运行时更改。有关详细信息，请参见此处。肮脏的摘要...好吧，这是我添加的文件：

\# sitecustomize.py  
\# this file can be anywhere in your Python path,  
\# but it usually goes in ${pythondir}/lib/site-packages/  
import sys  
sys.setdefaultencoding('utf_8')

当您使用unicode（“ str”）函数而不使用任何其他编码参数时，将使用此系统编码。换句话说，python会尝试根据默认系统编码将“ str”解码为unicode。

（3）如果您使用的是IDLE或命令行python，我认为您的控制台将根据默认的系统编码显示。由于某种原因，我将pydev与eclipse一起使用，因此必须进入项目设置，编辑测试脚本的启动配置属性，转到“通用”选项卡，并将控制台从latin-1更改为utf-8，以便我可以从视觉上确认我在做什么。

（4）如果您想要一些测试字符串，例如

test_str = "ó"

在您的源代码中，那么您将不得不告诉python在该文件中使用哪种编码。（仅供参考：由于我的文件变得不可读，当我输入错误的编码时，我必须使用ctrl-Z。）这可以通过在源代码文件的顶部放置如下一行来轻松实现：

# -*- coding: utf_8 -*-

如果您没有此信息，则python默认会尝试将您的代码解析为ascii，因此：

SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

一旦程序正常运行，或者如果您没有使用python的控制台或任何其他控制台查看输出，那么您实际上可能只关心列表中的＃1。除非您需要查看输出和/或使用内置的unicode（）函数（不带任何编码参数），而不是string.decode（）函数，否则系统默认值和控制台编码并不重要。我编写了一个演示函数，将其粘贴到此巨大混乱的底部，希望可以正确演示列表中的项目。这是当我通过演示功能运行字符'ó'时的一些输出，显示了各种方法如何响应作为输入的字符。我的系统编码和控制台输出都为此运行设置为utf_8：

'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

现在，我将系统和控制台编码更改为latin_1，并获得与同一输入相同的输出：

'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

注意，“原始”字符可以正确显示，并且内置的unicode（）函数现在可以正常工作。

现在，将控制台输出更改回utf_8。

'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

在这里，一切仍然与上次相同，但控制台无法正确显示输出。等等。下面的功能还显示了更多信息，希望可以帮助某人弄清他们的理解差距在哪里。我知道所有这些信息都在其他地方，并且在那里得到了更全面的处理，但是我希望这对尝试使用python和/或sqlite进行编码的人来说是一个很好的起点。想法很棒，但有时源代码可以让您节省一两天的时间来弄清楚什么函数可以做什么。

免责声明：我不是编码专家，我将其汇总以帮助自己理解。当我本应该开始将函数作为参数传递来避免过多的冗余代码时，我一直在此基础上进行构建，因此，如果可以的话，我将使其更加简洁。另外，utf_8和latin_1绝不是唯一的编码方案，它们只是我正在使用的两种编码方案，因为我认为它们可以处理我需要的所有内容。将自己的编码方案添加到演示函数并测试您自己的输入。

还有一件事：显然有疯狂的应用程序开发人员在Windows中使生活变得困难。

#!/usr/bin/env python
# -*- coding: utf_8 -*-

import os
import sys

def encodingDemo(str):
    validStrings = ()
    try:        
        print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
        validStrings += ((str,""),)
    except UnicodeEncodeError as ude:
        print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print ude
    try:
        x = unicode(str)
        print "unicode(str) = ",x
        validStrings+= ((x, " decoded into unicode by the default system encoding"),)
    except UnicodeDecodeError as ude:
        print "ERROR.  unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
        print "\tThe system encoding is set to {0}.  See error:\n\t".format(sys.getdefaultencoding()),  
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print uee
    try:
        x = str.decode('latin_1')
        print "str.decode('latin_1') =",x
        validStrings+= ((x, " decoded with latin_1 into unicode"),)
        try:        
            print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
            validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
        except UnicodeDecodeError as ude:
            print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8.  See error:\n\t",
            print ude
    except UnicodeDecodeError as ude:
        print "Something didn't work, probably because the string wasn't latin_1 encoded.  See error:\n\t",
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print uee
    try:
        x = str.decode('utf_8')
        print "str.decode('utf_8') =",x
        validStrings+= ((x, " decoded with utf_8 into unicode"),)
        try:        
            print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
        except UnicodeDecodeError as ude:
            print "str.decode('utf_8').encode('latin_1') didn't work.  The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1.  See error:\n\t",
            validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
            print ude
    except UnicodeDecodeError as ude:
        print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded.  See error:\n\t",
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",uee

    print
    print "Printing information about each character in the original string."
    for char in str:
        try:
            print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
        except UnicodeDecodeError as ude:
            print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
            print uee    

        try:
            x = unicode(char)        
            print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = unicode(char) ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = unicode(char)  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)

        try:
            x = char.decode('latin_1')
            print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = char.decode('latin_1')  ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = char.decode('latin_1')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)

        try:
            x = char.decode('utf_8')
            print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = char.decode('utf_8')  ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = char.decode('utf_8')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)

        print

x = 'ó'
encodingDemo(x)

非常感谢您提供以下答案，尤其感谢@John Machin如此详尽地回答。

Question 2

我仍然不知道是否有一种方法可以将'ó'从latin-1正确转换为utf-8而不进行任何处理

在调试此类问题时，repr（）和unicodedata.name（）是您的朋友：

>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>

如果将oacute_utf8发送到为latin1设置的终端，则将得到A-tilde，后跟上标3。

我切换到Unicode字符串。

您在说什么Unicode字符串？UTF-16？

是什么赋予了？看完这篇文章，描述了我所处的完全相同的情况之后，似乎该建议是忽略其他建议，而毕竟使用8位字节串。

我无法想象您的感觉如何。所传达的故事是，要使用Python中的unicode对象和数据库中的UTF-8编码。但是Martin回答了最初的问题，并为OP提供了一种使用latin1的方法（“文本工厂”），但这并不构成建议！

针对在评论中提出的其他问题进行更新：

我不明白unicode字符仍然包含隐式编码。我说的对吗？

不能。编码是Unicode与其他内容之间的映射，反之亦然。Unicode字符没有隐式或其他方式的编码。

在我看来，用repr（）求值时，unicode（“ \ xF3”）和“ \ xF3” .decode（'latin1'）相同。

说什么？在我看来，它看起来并不像：

>>> unicode("\xF3")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>

也许您的意思是：u'\xf3' == '\xF3'.decode('latin1')……这确实是事实。

也确实unicode(str_object, encoding)与str_object.decode(encoding)...相同，包括在提供了不适当的编码时炸毁。

那是一个快乐的情况吗

最好将Unicode中的前256个字符用于代码，因为latin1中的256个字符是一个好主意。因为所有256个可能的latin1字符都映射到Unicode，这意味着可以将任何8位字节，任何Python str对象解码为unicode，而不会引发异常。这是应该的。

但是，有些人会混淆两个截然不同的概念：“我的脚本运行到完成而没有引发任何异常”和“我的脚本没有错误”。对他们而言，latin1是“一个陷阱和一个妄想”。

换句话说，如果您的文件实际上是用cp1252或gbk或koi8-u或其他格式编码的，并且使用latin1对其进行解码，那么生成的Unicode将完全是垃圾，而Python（或任何其他语言）将不会标记错误- -它无法知道您犯了傻。

还是unicode（“ str”）总是返回正确的解码？

就像这样，默认编码为ascii，如果文件实际上是用ASCII编码的，它将返回正确的unicode。否则，它会炸毁。

同样，如果您指定正确的编码，或者是正确编码的超集，则将获得正确的结果。否则会出现乱码或异常情况。

简而言之：答案是否定的。

如果没有，当我收到其中包含任何可能的字符集的python str时，如何知道如何对其进行解码？

如果str对象是有效的XML文档，它将在前面指定。默认值为UTF-8。如果它是正确构建的网页，则应预先指定（查找“字符集”）。不幸的是，许多网页编写者都through之以鼻（ISO-8859-1 aka latin1，应该是Windows-1252 aka cp1252；不要浪费资源尝试解码gb2312，请改用gbk）。您可以从网站的国籍/语言获得线索。

UTF-8总是值得尝试的。如果数据是ascii，则可以正常工作，因为ascii是utf8的子集。如果您尝试将其解码为utf8，则使用非ASCII字符编写且已使用utf8以外的编码方式编码的文本字符串几乎肯定会失败，但会出现异常。

以上所有启发式方法以及更多以及许多统计信息都封装在chardet中，chardet是一个用于猜测任意文件编码的模块。通常效果很好。但是，您不能使软件不受白痴的影响。例如，如果您将一些编码为A的数据文件与编码B的数据文件串联起来，并将结果输入chardet，答案可能是置信度降低的编码C ，例如0.8。始终检查答案的置信度部分。

如果其他所有方法均失败：

（1）尝试在此处进行询问，并从数据的print repr(your_data[:400])开头获取一个小样本... ...以及有关其来源的所有附带信息。

（2）俄罗斯最近对恢复忘记密码的技术的研究似乎非常适用于推导未知的编码。

更新2 BTW，是不是您提出另一个问题的时间？-）

还有一件事：Windows显然将某些字符用作某些字符的Unicode，而不是该字符的正确Unicode，因此，如果要在其他程序中使用它们，则可能必须将这些字符映射到正确的字符。期待那些角色在正确的位置。

不是Windows这样做的；而是Windows。这是一群疯狂的应用程序开发人员。可以理解的是，您可能没有措辞，而是引用了effbot文章的开头部分：

某些应用程序将CP1252（Windows，西欧）字符添加到标记为ISO 8859-1（拉丁语1）或其他编码的文档中。这些字符不是有效的ISO-8859-1字符，并且可能在处理和显示应用程序中引起各种问题。

背景：

U + 0000到U + 001F（含）范围在Unicode中指定为“ C0控制字符”。它们也以相同的含义存在于ASCII和latin1中。它们包括诸如回车，换行，铃声，退格键，制表符之类的熟悉的东西，以及其他很少使用的东西。

U + 0080到U + 009F（含）范围在Unicode中指定为“ C1控制字符”。这些也存在于latin1中，包括32个字符，unicode.org之外的任何人都无法想象有任何可能的用途。

因此，如果对unicode或latin1数据进行字符频率计数，然后发现该范围内的任何字符，则数据已损坏。没有通用的解决方案。这取决于它如何损坏。这些字符可能与cp1252字符在相同位置具有相同的含义，因此effbot的解决方案将起作用。在我最近查看的另一种情况下，狡猾的字符似乎是由串联以UTF-8编码的文本文件和另一种编码所导致的，这些编码需要根据文件（人类）语言中的字母频率来推断写。

Question 3

UTF-8是SQLite数据库的默认编码。在“ SELECT CAST（x'52C3B373'AS TEXT）;”之类的情况下会出现这种情况。但是，SQLite C库实际上并不检查插入数据库中的字符串是否为有效的UTF-8。

如果插入Python unicode对象（或3.x中的str对象），Python sqlite3库将自动将其转换为UTF-8。但是，如果插入str对象，它将仅假定字符串为UTF-8，因为Python 2.x“ str”不知道其编码。这是偏爱Unicode字符串的原因之一。

但是，如果数据开始损坏，这对您没有帮助。

要修复您的数据，请执行

db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")

数据库中的每个文本列。

Question 4

我通过设置解决了pysqlite问题：

conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')

默认情况下，text_factory设置为unicode（），它将使用当前的默认编码（我的机器上的ascii）

Question 5

当然有但是您的数据已在数据库中损坏，因此您需要对其进行修复：

>>> print u'Sigur RÃ³s'.encode('latin-1').decode('utf-8')
Sigur Rós

Question 6

我的Python 2.x Unicode问题（特定于Python 2.7.6）解决了以下问题：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

它还解决了您在帖子开头提到的错误：

sqlite3.ProgrammingError：不得使用8位字节串，除非...

编辑

sys.setdefaultencoding是肮脏的骇客。是的，它可以解决UTF-8问题，但是一切都需要付出代价。有关更多详细信息，请参考以下链接：