如何gzip在Python中压缩字符串?


86

如何gzip在Python中压缩字符串?

gzip.GzipFile 存在,但这是针对文件对象的-纯字符串呢?


1
@KevinDTimm,该文档仅提及StringIO但并未真正解释如何执行。所以在这里问这个问题是完全有效的,恕我直言。不过,在询问并告诉我们有关它们的信息之前,还要进行一些试验。
Alfe

@Alfe-该问题在4年前被关闭,其原因与我的评论大致相同-OP并未进行任何先搜索的努力。
KevinDTimm 2015年

4
这离题如何?

2
这个问题现在gzip string in python是google上的热门话题,这是非常合理的IMO。应该重新打开它。
加勒特

2
如上所述,这个问题是google搜索中的最高结果,答案之一是正确的-似乎好像不应该关闭它。
darkdan21 '18

Answers:


156

如果您想产生一个完整的,gzip兼容的二进制字符串(带有标头等),则可以gzip.GzipFile与一起使用StringIO

try:
    from StringIO import StringIO  # Python 2.7
except ImportError:
    from io import StringIO  # Python 3.x
import gzip
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
  f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

# returns '\x1f\x8b\x08\x00\xbd\xbe\xe8N\x02\xff\x0b\xc9\xc8,V\x00\xa2\xdc\xcc\xecT\x85\xbc\xd2\xdc\xa4\xd4"\x85\xfc\xbcT\x1d\xa0X\x9ez\x89B\tH:Q!\'\xbfD!?M!\xad4\xcf\x1e\x00w\xd4\xea\xf41\x00\x00\x00'

2
相反的是:`def gunzip_text(text):infile = StringIO.StringIO()infile.write(text)with gzip.GzipFile(fileobj = infile,mode =“ r”)as f:f.rewind()f .read()return out.getvalue()
fastmultiplication 14-4-24

3
@fastmultiplication:或更短:f = gzip.GzipFile(StringIO.StringIO(text)); result = f.read(); f.close(); return result
Alfe

2
不幸的是,这个问题已经接近了,所以我不能让一个新的答案,但在这里是如何在Python 3。做到这一点
加勒特

可能无关,首先在内存中压缩更快(使用本地磁盘)吗?
user3226167'9

1
在Python 3中:import zlib; my_string = "hello world"; my_bytes = zlib.compress(my_string.encode('utf-8')); my_hex = my_bytes.hex(); my_bytes2 = bytes.fromhex(my_hex); my_string2 = zlib.decompress(my_bytes); assert my_string == my_string2;
ostrokach

64

最简单的方法是zlib 编码

compressed_value = s.encode("zlib")

然后用以下命令解压缩:

plain_string_again = compressed_value.decode("zlib")

1
@Daniel:是的,s是类型为的Python 2.x对象str
Sven Marnach

2
请参阅标准编码以了解其来源(向下滚动至“编解码器”)。也可用:s.encode('rot13')s.encode( 'base64' )
bobobobo

8
请注意,此方法与gzip命令行实用程序不兼容,因为gzip包含标头和校验和,而此机制只是压缩内容。
tylerl 2013年

我知道这很旧,但是您要解压缩的代码行应该是: plain_string_again = compressed_value.decode("zlib")
minillinim 2014年

6
@BenjaminToueg:Python 3在Unicode字符串(strPython 3中的类型)和字节字符串(类型bytes)之间的区别上更加严格。 str对象具有encode()返回对象的方法bytes,而bytes对象具有decode()返回的方法str。该zlib编解码器的特殊之处在于它转换bytesbytes,所以它不适合这种结构。您可以使用codecs.encode(b, "zlib")codecs.decode(b, "slib")作为bytes对象b
Sven Marnach 2014年

22

Sven Marnach 2011年答案的Python3版本:

import gzip
exampleString = 'abcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijmortenpunnerudengelstadrocksklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuv123'
compressed_value = gzip.compress(bytes(exampleString,'utf-8'))
plain_string_again  = gzip.decompress(compressed_value)

2
在Python 3zlib中仍在使用,gzip实际使用zlib,请参阅:docs.python.org/3/library/zlib.htmldocs.python.org/3/library/gzip.html#module-gzip
gitaarik

我最初的答案是使用zlib。更改为gzip,因为这是原始问题。在我的示例中,您可以轻松地将gzip替换为zlib(搜索并替换),它将起作用。
Punnerud

2

对于那些想要以JSON格式压缩Pandas数据帧的人:

经过Python 3.6和Pandas 0.23测试

import sys
import zlib, lzma, bz2
import math

def convert_size(size_bytes):
    if size_bytes == 0:
        return "0B"
    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

dataframe = pd.read_csv('...') # your CSV file
dataframe_json = dataframe.to_json(orient='split')
data = dataframe_json.encode()
compressed_data = bz2.compress(data)
decompressed_data = bz2.decompress(compressed_data).decode()
dataframe_aux = pd.read_json(decompressed_data, orient='split')

#Original data size:  10982455 10.47 MB
#Encoded data size:  10982439 10.47 MB
#Compressed data size:  1276457 1.22 MB (lzma, slow), 2087131 1.99 MB (zlib, fast), 1410908 1.35 MB (bz2, fast)
#Decompressed data size:  10982455 10.47 MB
print('Original data size: ', sys.getsizeof(dataframe_json), convert_size(sys.getsizeof(dataframe_json)))
print('Encoded data size: ', sys.getsizeof(data), convert_size(sys.getsizeof(data)))
print('Compressed data size: ', sys.getsizeof(compressed_data), convert_size(sys.getsizeof(compressed_data)))
print('Decompressed data size: ', sys.getsizeof(decompressed_data), convert_size(sys.getsizeof(decompressed_data)))

print(dataframe.head())
print(dataframe_aux.head())

-4
s = "a long string of characters"

g = gzip.open('gzipfilename.gz', 'w', 5) # ('filename', 'read/write mode', compression level)
g.write(s)
g.close()

4
我想问题是关于压缩内存中的字符串而不必在过程中将其写入磁盘。否则,您的答案是完全正确的。
Alfe 2015年
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.