将字节转换为字符串

2300

我正在使用以下代码从外部程序获取标准输出：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

communication（）方法返回一个字节数组：

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

但是，我想将输出作为普通的Python字符串使用。这样我就可以像这样打印它：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

我认为这就是binascii.b2a_qp（）方法的用途，但是当我尝试使用它时，我又得到了相同的字节数组：

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

如何将字节值转换回字符串？我的意思是，使用“电池”而不是手动进行操作。我希望它与Python 3兼容。

python string python-3.x

— 托马斯·塞多维奇（Tomas Sedovic）
source

46

为什么不起作用str(text_bytes)？在我看来这很奇怪。

— 查理·帕克

12

@CharlieParker因为str(text_bytes)无法指定编码。取决于text_bytes中的内容，text_bytes.decode('cp1250）`可能导致字符串与完全不同text_bytes.decode('utf-8')。

— 克雷格·安德森

6

因此str函数不再转换为真实字符串。由于某种原因，我不得不明确地说出一种编码，我懒得通读它的原因。只需将其转换为utf-8，看看您的代码是否有效。例如var = var.decode('utf-8')

— Charlie Parker

@CraigAnderson：unicode_text = str(bytestring, character_encoding)按预期工作有关Python 3.尽管unicode_text = bytestring.decode(character_encoding)是更优选的，以避免与刚刚混乱str(bytes_obj)产生一个文本表示为bytes_obj而不是将其进行解码，以文本：str(b'\xb6', 'cp1252') == b'\xb6'.decode('cp1252') == '¶'和str(b'\xb6') == "b'\\xb6'" == repr(b'\xb6') != '¶'

— JFS

3670

您需要解码bytes对象以产生一个字符串：

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

— 亚伦·曼帕（Aaron Maenpaa）
source

57

使用"windows-1252"也不可靠（例如，对于其他语言版本的Windows），不是最好使用sys.stdout.encoding吗？

— nikow 2012年

12

也许这会进一步帮助别人：有时您使用字节数组进行ex TCP通信。如果要将字节数组转换为字符串，以截断结尾的'\ x00'字符，则以下答案是不够的。然后使用b'example \ x00 \ x00'.decode（'utf-8'）。strip（'\ x00'）。

— Wookie88 2013年

2

我已经在bugs.python.org/issue17860中填写了有关记录该文档的错误-随时提出一个补丁。如果很难做出贡献，欢迎评论如何改进。

— anatoly techtonik

44

在Python 2.7.6中无法处理b"\x80\x02\x03".decode("utf-8")-> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte。

— martineau 2014年

9

如果内容是随机二进制值，则utf-8转换很可能会失败。相反，请参见@techtonik答案（下）stackoverflow.com/a/27527728/198536

— wallyk 2015年

214

您需要解码该字节字符串，然后将其转换为字符（Unicode）字符串。

在Python 2上

encoding = 'utf-8'
'hello'.decode(encoding)

要么

unicode('hello', encoding)

在Python 3上

encoding = 'utf-8'
b'hello'.decode(encoding)

要么

str(b'hello', encoding)

— dF。
source

2

在Python 3上，如果字符串在变量中怎么办？

— Alaa M.

1

@AlaaM .：一样。如果您有variable = b'hello'，那么unicode_text = variable.decode(character_encoding)

— jfs

182

我认为这种方式很简单：

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'

— 西索
source

6

谢谢，您的方法对我有用。我有一个非编码的字节数组，需要将其转换为字符串。试图找到一种重新编码的方法，以便可以将其解码为字符串。此方法完美工作！

— leetNightshade 2014年

5

@leetNightshade：效率很低。如果您有字节数组，则只需解码即可。

— 马丁·彼得

12

@Martijn Pieters我用其他答案做了一个简单的基准测试，运行了10,000次运行stackoverflow.com/a/3646405/353094而且上述解决方案实际上每次都快得多。在python 2.7.7中运行10,000次需要8毫秒，而其他时间分别为12毫秒和18毫秒。当然，可能会有所不同，具体取决于输入，Python版本等。对我来说似乎并不慢。

— leetNightshade 2014年

5

@Martijn Pieters是的。因此，就这一点而言，这并不是所提出问题的最佳答案。标题令人误解，不是吗？他/她想将字节字符串转换为常规字符串，而不是将字节数组转换为字符串。对于所问问题的标题，此答案可以正常工作。

— leetNightshade 2014年

5

对于python 3，这应该等效于bytes([112, 52, 52])-btw字节对于本地变量来说是一个坏名字，正是因为它是内置的p3

— Mr_and_Mrs_D

91

如果您不知道编码，则要以Python 3和Python 2兼容的方式将二进制输入读取为字符串，请使用古老的MS-DOS CP437编码：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

因为编码是未知的，所以希望将非英语符号转换为字符cp437（不会翻译英语字符，因为它们在大多数单字节编码和UTF-8中都匹配）。

将任意二进制输入解码为UTF-8是不安全的，因为您可能会得到以下信息：

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

同样适用于latin-1，这在Python 2中很流行（默认？）。请参见“ 代码页布局”中的遗漏之处-这是Python臭名昭著的地方ordinal not in range。

UPDATE 20150604：有传言称Python 3具有surrogateescape错误策略，可将内容编码为二进制数据而不会导致数据丢失和崩溃，但它需要进行转换测试[binary] -> [str] -> [binary]，以验证性能和可靠性。

更新20170116：感谢评论-还可以使用backslashreplace错误处理程序对所有未知字节进行斜线转义。这仅适用于Python 3，因此即使采用这种解决方法，您仍然会从不同的Python版本获得不一致的输出：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

看到详细信息， Python的Unicode支持。

更新20170119：我决定实现适用于Python 2和Python 3的斜线转义解码。它应该比cp437解决方案要慢，但是在每个Python版本上都应产生相同的结果。

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

— 分析技术
source

6

我真的觉得Python应该提供一种机制来替换丢失的符号并继续。

— anatoly techtonik

@techtonik：这无法像在python2中那样在数组上工作。

— user2284570

@ user2284570您的意思是列表吗？以及为什么它应该在数组上工作？特别是漂浮的阵列..

— 阿纳托techtonik

你也可以忽略Unicode的错误b'\x00\x01\xffsd'.decode('utf-8', 'ignore')出现在Python 3

— 安东尼斯·卡劳

3

@anatolytechtonik有可能在字符串中保留转义序列并继续：b'\x80abc'.decode("utf-8", "backslashreplace")将导致'\\x80abc'。此信息取自unicode文档页面，该页面自编写此答案以来似乎已更新。

— Nearoo

86

在Python 3中，默认编码为"utf-8"，因此您可以直接使用：

b'hello'.decode()

相当于

b'hello'.decode(encoding="utf-8")

另一方面，在Python 2中，编码默认为默认的字符串编码。因此，您应该使用：

b'hello'.decode(encoding)

encoding您想要的编码在哪里。

注意：在Python 2.7中添加了对关键字参数的支持。

— Lmiguelvargasf
source

41

我认为您实际上想要这样：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

亚伦的答案是正确的，除了您需要知道哪个要使用编码。而且我相信Windows使用的是“ windows-1252”。仅当您的内容中包含一些不寻常的（非ASCII）字符时，这才有意义，但这将有所作为。

顺便说一句，它事实上做事情的原因了Python转移到使用两种不同类型的二进制和文本数据：它不能神奇地将它们转换之间，因为它不知道编码，除非你告诉它！您唯一知道的方法是阅读Windows文档（或在此处阅读）。

— 切姆
source

3

open()文本流的函数，或者Popen()如果传递它universal_newlines=True，则可以为您神奇地确定字符编码（locale.getpreferredencoding(False)在Python 3.3+中）。

— jfs 2014年

2

'latin-1'是设置了所有代码点的逐字编码，因此您可以使用它来将字节字符串有效地读入Python支持的任何字符串类型（因此，将Python 2的逐字记录转换为Python 3的Unicode记录）。

— Tripleee '17

@tripleee：'latin-1'是获得mojibake 的好方法。Windows上也有神奇的替代方法：令人惊讶的是，很难将数据从一个进程传递到另一个未修改的进程，例如dir：\xb6-> \x14（在我的回答结尾的示例）

— jfs

32

将Universal_newlines设置为True，即

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

— 上下文切换
source

5

我一直在使用这种方法，它的工作原理。虽然，它只是基于系统上用户的偏好来猜测编码，所以它不如其他一些选项那么健壮。这是它的工作，请参考docs.python.org/3.4/library/subprocess.html：“如果Universal_newlines为True，则[stdin，stdout和stderr]将使用语言环境返回的编码以通用换行模式打开为文本流。 .getpreferredencoding（False）。”

— twasbrillig 2014年

在3.7中，您可以（并且应该）text=True代替universal_newlines=True。

— 鲍里斯（Boris）

23

虽然@Aaron Maenpaa的答案有效，但最近有用户问：

有没有更简单的方法？'fhand.read（）。decode（“ ASCII”）'[...]太长了！

您可以使用：

command_stdout.decode()

decode()有一个标准参数：

codecs.decode(obj, encoding='utf-8', errors='strict')

— 服务
source

.decode()使用该'utf-8'命令可能会失败（命令的输出可能使用其他字符编码，甚至返回无法解码的字节序列）。虽然如果输入是ascii（utf-8的子集），则.decode()可以工作。

— jfs

22

要将字节序列解释为文本，您必须知道相应的字符编码：

unicode_text = bytestring.decode(character_encoding)

例：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls命令可能会产生无法解释为文本的输出。Unix上的文件名可以是任何字节序列，但斜杠b'/'和零除外b'\0'：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用utf-8编码对此类字节汤进行解码将引发UnicodeDecodeError。

可能会更糟。如果使用错误的不兼容编码，解码可能会默默失败并产生mojibake：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

数据已损坏，但是您的程序仍然不知道发生了故障。

通常，要使用的字符编码不会嵌入字节序列本身。您必须带外传达此信息。一些结果比其他结果更有可能，因此chardet存在可以猜测字符编码的模块。单个Python脚本可能在不同位置使用多种字符编码。

ls可以使用os.fsdecode() 即使对于无法解码的文件名也成功的函数将输出转换为Python字符串（在Unix上使用 sys.getfilesystemencoding()和surrogateescape错误处理程序）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节，可以使用os.fsencode()。

如果传递universal_newlines=True参数，则subprocess用于 locale.getpreferredencoding(False)解码字节，例如，它可以 cp1252在Windows上使用。

要实时解码字节流， io.TextIOWrapper() 可以使用：example。

不同的命令可能对其输出使用不同的字符编码，例如，dir内部命令（cmd）可能使用cp437。要解码其输出，可以显式传递编码（Python 3.6+）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能与os.listdir()（使用Windows Unicode API）不同（例如，'\xb6'可以用'\x14'—Python的cp437编解码器映射b'\x14'代替）来控制字符U + 0014而不是U + 00B6（¶）。要支持带有任意Unicode字符的文件名，请参阅将 PowerShell输出可能包含非ASCII Unicode字符解码为Python字符串。

— f
source

16

由于这个问题实际上是在询问subprocess输出，因此您可以使用更直接的方法，因为它Popen接受了encoding关键字（在Python 3.6+中）：

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

其他用户的一般答案是将字节解码为文本：

>>> b'abcde'.decode()
'abcde'

没有参数，sys.getdefaultencoding()将被使用。如果您的数据不是sys.getdefaultencoding()，那么您必须在decode调用中显式指定编码：

>>> b'caf\xe9'.decode('cp1250')
'café'

— 威姆
source

3

或者使用Python 3.7，您可以text=True使用给定的编码（如果设置了）或其他系统默认值来传递解码stdin，stdout和stderr的信息。Popen(['ls', '-l'], stdout=PIPE, text=True)。

— 鲍里斯（Boris）

ls使用utf-8编码解码输出可能会失败（请参阅我在2016年的回答中的示例）。

— jfs

1

@Boris：如果encoding给出了text参数，则忽略该参数。

— jfs

11

如果您应该尝试以下操作decode()：

AttributeError：“ str”对象没有属性“ decode”

您还可以直接在转换中指定编码类型：

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

— 布罗珀
source

6

当使用Windows系统中的数据（以\r\n行结尾）时，我的答案是

String = Bytes.decode("utf-8").replace("\r\n", "\n")

为什么？尝试使用多行Input.txt：

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

您所有的行尾都将加倍（以 \r\r\n），从而导致多余的空行。Python的文本读取函数通常会规范行尾，因此字符串只能使用\n。如果您从Windows系统接收二进制数据，Python将没有机会这样做。从而，

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

将复制您的原始文件。

— 啤酒
source

我一直在寻找.replace("\r\n", "\n")补充。如果要正确呈现HTML，这就是答案。

— mhlavacka

5

我做了一个清理清单的功能

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista

— 异花
source

6

实际上，你可以链中所有的.strip，.replace，.encode在一个列表理解，等来电，但是仅遍历列表，而不是一次迭代超过五次。

— 泰勒·埃德米斯顿

1

@TaylorEdmiston也许可以节省分配资源，但是操作数量将保持不变。

— JulienD

5

对于Python 3，这是一个更安全和Python的方法来从转换byte到string：

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): # Check if it's in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

输出：

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

— 因孔努
source

5

1）正如@bodangly所说，类型检查根本不是pythonic。2）你写的功能被命名为“ byte_to_str”，这意味着它会返回一个海峡，但只打印转换后的值，并且如果失败，打印错误消息（但不会引发异常）。这种方法也是非Python的，会混淆bytes.decode您提供的解决方案。

— cosmicFluke

3

从sys —系统特定的参数和功能：

要从标准流写入二进制数据或从标准流读取二进制数据，请使用基础二进制缓冲区。例如，要将字节写入stdout，请使用sys.stdout.buffer.write(b'abc')。

— 于志昌
source

3

子进程的管道已经是二进制缓冲区。您的答案无法解决如何从结果bytes值中获取字符串值的问题。

— 马丁·彼得斯

1

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

— 莱昂纳多·菲利普
source

1

尽管此代码可以回答问题，但提供有关如何和/或为什么解决问题的其他上下文将提高答案的长期价值。请记住，您将来会为读者回答问题，而不仅仅是现在问的人！请编辑您的答案以添加说明，并指出适用的限制和假设。提一下为什么这个答案比其他答案更合适也无妨。

— Dev-iL

一个解释将是有条理的。

— 彼得·莫滕森

1

对于“运行shell命令并以文本而不是字节形式获取其输出” 的特定情况，在Python 3.7上，您应该使用subprocess.run并传入text=True（以及capture_output=True捕获输出）

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

text过去称为universal_newlines，并在Python 3.7中进行了更改（很好，为别名）。如果要支持3.7之前的Python版本，请传入universal_newlines=True而不是text=True

— 鲍里斯
source

0

如果要转换任何字节，而不仅仅是将字符串转换为字节：

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

但是，这不是很有效。它将2 MB的图片变成9 MB。

— HCLivess
source

-1

尝试这个

bytes.fromhex('c3a9').decode('utf-8')

— 蔡国荣
source