如何解压缩pkl文件？

Question 1

我有一个来自MNIST数据集的pkl文件，其中包含手写数字图像。

我想看一下每个数字图像，因此我需要解压缩pkl文件，除非我不知道怎么做。

有没有办法解压缩/解压缩pkl文件？

Question 2

通常

pkl实际上，您的文件是一个序列化pickle文件，这意味着它已经使用Python的pickle模块转储了。

要释放数据，您可以：

import pickle


with open('serialized.pkl', 'rb') as f:
    data = pickle.load(f)

对于MNIST数据集

gzip仅在压缩文件时才需要注意：

import gzip
import pickle


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

每个集合可以进一步划分的地方（即训练集合）：

train_x, train_y = train_set

这些将是集合的输入（数字）和输出（标签）。

如果要显示数字：

import matplotlib.cm as cm
import matplotlib.pyplot as plt


plt.imshow(train_x[0].reshape((28, 28)), cmap=cm.Greys_r)
plt.show()

mnist_digit

另一种选择是查看原始数据：

http://yann.lecun.com/exdb/mnist/

但这将更加困难，因为您将需要创建一个程序来读取这些文件中的二进制数据。因此，我建议您使用Python，并使用加载数据pickle。如您所见，这非常容易。;-)

Question 3

方便的单线

pkl() (
  python -c 'import pickle,sys;d=pickle.load(open(sys.argv[1],"rb"));print(d)' "$1"
)
pkl my.pkl

将为__str__腌制对象打印。

可视化对象的一般问题当然是不确定的，因此，如果__str__这还不够的话，您将需要一个自定义脚本。

Question 4

如果您想使用原始MNIST文件，可以通过以下方法反序列化它们。

如果尚未下载文件，请首先在终端中运行以下命令来进行下载：

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

然后将以下内容另存为deserialize.py并运行它。

import numpy as np
import gzip

IMG_DIM = 28

def decode_image_file(fname):
    result = []
    n_bytes_per_img = IMG_DIM*IMG_DIM

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[16:]

        if len(data) % n_bytes_per_img != 0:
            raise Exception('Something wrong with the file')

        result = np.frombuffer(data, dtype=np.uint8).reshape(
            len(bytes_)//n_bytes_per_img, n_bytes_per_img)

    return result

def decode_label_file(fname):
    result = []

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[8:]

        result = np.frombuffer(data, dtype=np.uint8)

    return result

train_images = decode_image_file('train-images-idx3-ubyte.gz')
train_labels = decode_label_file('train-labels-idx1-ubyte.gz')

test_images = decode_image_file('t10k-images-idx3-ubyte.gz')
test_labels = decode_label_file('t10k-labels-idx1-ubyte.gz')

脚本不会像腌制文件中那样标准化像素值。为此，您要做的就是

train_images = train_images/255
test_images = test_images/255

Question 5

需要使用pickle（和gzip，如果文件已压缩）模块

注意：这些已经在标准Python库中。无需安装任何新东西