使用散点数据集在MatPlotLib中生成热图

187

我有一组X，Y数据点（大约10k），易于绘制为散点图，但我想将其表示为热图。

我浏览了MatPlotLib中的示例，它们似乎都已经从热图单元格值开始以生成图像。

有没有一种方法可以将所有不同的x，y转换为热图（其中x，y的频率较高的区域会“变暖”）？

— 格雷耶
source

同样相关：计算不规则点的密度的有效方法

— ImportanceOfBeingErnest

182

如果您不想要六角形，可以使用numpy的histogram2d函数：

import numpy as np
import numpy.random
import matplotlib.pyplot as plt

# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)

heatmap, xedges, yedges = np.histogram2d(x, y, bins=50)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]

plt.clf()
plt.imshow(heatmap.T, extent=extent, origin='lower')
plt.show()

这将产生50x50的热图。如果您想要512x384，则可以bins=(512, 384)拨打histogram2d。

例： Matplotlib热图示例

— 番茄
source

1

我并不是要白痴，但是您实际上如何将此输出保存为PNG / PDF文件，而不是仅在交互式IPython会话中显示？我试图将其作为某种普通axes实例，可以在其中添加标题，轴标签等，然后savefig()像执行其他任何典型matplotlib图一样执行普通操作。

— gotgenes

3

@gotgenes：不行plt.savefig('filename.png')吗？如果要获取轴实例，请使用Matplotlib的面向对象的界面：fig = plt.figure() ax = fig.gca() ax.imshow(...) fig.savefig(...)

— ptomato 2011年

1

的确，谢谢！我想我不完全了解imshow()与相同的功能类别scatter()。老实说，我不明白为什么imshow()将2d浮点数数组转换为适当颜色的块，而我确实理解scatter()应该对这种数组做什么。

— gotgenes

14

关于使用imshow绘制x / y值的2d直方图的警告，如下所示：默认情况下，imshow在左上角绘制原点并转置图像。为了获得与散点图相同的方向，我要做的是plt.imshow(heatmap.T, extent=extent, origin = 'lower')

— Jamie

7

对于那些想做对数彩条的人，请参阅此问题stackoverflow.com/questions/17201172/…并简单地做from matplotlib.colors import LogNorm plt.imshow(heatmap, norm=LogNorm()) plt.colorbar()

— tommy.carstensen

109

在Matplotlib词典中，我认为您想要一个十六进制图。

如果您对这种类型的图不熟悉，它只是一个二元直方图，其中xy平面由六边形的规则网格细分。

因此，从直方图中，您可以仅计算落在每个六边形中的点数，将绘制区域离散为一组窗口，将每个点分配给这些窗口中的一个；最后，将窗口映射到颜色数组上，您将获得一个六边形图。

尽管不如圆形或正方形那样普遍使用，但对于合并容器的几何形状来说，六角形是更好的选择，这很直观：

六边形具有最近邻对称性（例如，正方形容器不对称，例如，从正方形边界上的点到该正方形内的点的距离并不都相等），并且
六角形是提供规则平面细分的最高n多边形（即，您可以安全地用六角形瓷砖重新建模厨房地板，因为完成后在瓷砖之间将没有任何空隙空间-对于所有其他高-n，n> = 7，多边形）。

（Matplotlib使用术语hexbin plot;（AFAIK）也使用R的所有绘图库 ;我仍然不知道这是否是此类绘图的公认术语，尽管我怀疑hexbin很短用于六角装仓，它描述了准备显示数据的基本步骤。）

from matplotlib import pyplot as PLT
from matplotlib import cm as CM
from matplotlib import mlab as ML
import numpy as NP

n = 1e5
x = y = NP.linspace(-5, 5, 100)
X, Y = NP.meshgrid(x, y)
Z1 = ML.bivariate_normal(X, Y, 2, 2, 0, 0)
Z2 = ML.bivariate_normal(X, Y, 4, 1, 1, 1)
ZD = Z2 - Z1
x = X.ravel()
y = Y.ravel()
z = ZD.ravel()
gridsize=30
PLT.subplot(111)

# if 'bins=None', then color of each hexagon corresponds directly to its count
# 'C' is optional--it maps values to x-y coordinates; if 'C' is None (default) then 
# the result is a pure 2D histogram 

PLT.hexbin(x, y, C=z, gridsize=gridsize, cmap=CM.jet, bins=None)
PLT.axis([x.min(), x.max(), y.min(), y.max()])

cb = PLT.colorbar()
cb.set_label('mean value')
PLT.show()

在此处输入图片说明

— 道格
source

“六边形具有最近邻对称性”是什么意思？您说“从一个正方形边界上的点到该正方形内的点的距离并非到处都是相等的”，而是距离是多少？

— 2014年

9

对于六边形，从中心到连接两个边的顶点的距离也比从中心到边的中间的距离长，只有比率较小（六边形与sqrt（2）≈1.41的2 / sqrt（3）≈1.15用于正方形）。从中心到边界上每个点的距离相等的唯一形状是圆形。

— 2014年

5

@Jaan对于六角形，每个邻居都位于相同的距离处。8邻域或4邻域没有问题。没有对角线邻居，只有一种邻居。

— isarandi 2015年

@doug如何选择gridsize=参数。我想这样选择，以使六边形刚好接触而不重叠。我注意到gridsize=100会产生较小的六边形，但是如何选择合适的值呢？

— 亚历山大·斯卡

40

编辑：对于亚历杭德罗的答案的更好的近似，请参见下文。

我知道这是一个古老的问题，但是想在Alejandro的anwser中添加一些内容：如果您想要一个很好的平滑图像而不使用py-sphviewer，则可以使用np.histogram2d高斯滤镜并将其应用于scipy.ndimage.filters热图：

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.ndimage.filters import gaussian_filter


def myplot(x, y, s, bins=1000):
    heatmap, xedges, yedges = np.histogram2d(x, y, bins=bins)
    heatmap = gaussian_filter(heatmap, sigma=s)

    extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
    return heatmap.T, extent


fig, axs = plt.subplots(2, 2)

# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)

sigmas = [0, 16, 32, 64]

for ax, s in zip(axs.flatten(), sigmas):
    if s == 0:
        ax.plot(x, y, 'k.', markersize=5)
        ax.set_title("Scatter plot")
    else:
        img, extent = myplot(x, y, s)
        ax.imshow(img, extent=extent, origin='lower', cmap=cm.jet)
        ax.set_title("Smoothing with  $\sigma$ = %d" % s)

plt.show()

产生：

Agape Gal'lo的散点图和s = 16画在彼此的顶部（单击以获得更好的视图）：

我在高斯滤波器方法和亚历杭德罗方法中注意到的一个区别是，他的方法显示的局部结构比我的方法好得多。因此，我在像素级别实现了一个简单的最近邻方法。该方法为每个像素计算距离的倒数和。n数据中最接近点。这种方法的高分辨率计算量很大，我认为有一种更快的方法，因此，如果您有任何改进，请告诉我。

更新：我怀疑，使用Scipy's的方法要快得多scipy.cKDTree。有关实现，请参见加百利的答案。

无论如何，这是我的代码：

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm


def data_coord2view_coord(p, vlen, pmin, pmax):
    dp = pmax - pmin
    dv = (p - pmin) / dp * vlen
    return dv


def nearest_neighbours(xs, ys, reso, n_neighbours):
    im = np.zeros([reso, reso])
    extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]

    xv = data_coord2view_coord(xs, reso, extent[0], extent[1])
    yv = data_coord2view_coord(ys, reso, extent[2], extent[3])
    for x in range(reso):
        for y in range(reso):
            xp = (xv - x)
            yp = (yv - y)

            d = np.sqrt(xp**2 + yp**2)

            im[y][x] = 1 / np.sum(d[np.argpartition(d.ravel(), n_neighbours)[:n_neighbours]])

    return im, extent


n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)
resolution = 250

fig, axes = plt.subplots(2, 2)

for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 64]):
    if neighbours == 0:
        ax.plot(xs, ys, 'k.', markersize=2)
        ax.set_aspect('equal')
        ax.set_title("Scatter Plot")
    else:
        im, extent = nearest_neighbours(xs, ys, resolution, neighbours)
        ax.imshow(im, origin='lower', extent=extent, cmap=cm.jet)
        ax.set_title("Smoothing over %d neighbours" % neighbours)
        ax.set_xlim(extent[0], extent[1])
        ax.set_ylim(extent[2], extent[3])
plt.show()

结果：

— 乔吉
source

1

喜欢这个。Graph和Alejandro的回答一样好，但是不需要任何新软件包。

— 内森·克莱门特

非常好！但是您可以使用此方法生成偏移量。通过将普通散点图与彩色散点图进行比较，可以看到这一点。您可以添加一些更正吗？还是仅将图形移动x和y值？

— Agape Gal'lo

1

阿加普·加洛（Agape Gal'lo），您对偏移的意思是什么？如果将它们绘制在彼此的顶部，它们确实会匹配（请参阅我的帖子的编辑）。也许您被推迟是因为散点图的宽度与其他三个不完全匹配。

— Jurgy

非常感谢您为我绘制图表！我理解了我的错误：我修改了“范围”以定义x和y限制。我现在知道它修改了图的原点。然后，我有最后一个问题：即使对于没有现有数据的区域，如何扩展图的限制？例如，x和y在-5到+5之间。

— Agape Gal'lo

1

假设您希望x轴从-5到5，y轴从-3到4；在myplot功能，添加range参数np.histogram2d：np.histogram2d(x, y, bins=bins, range=[[-5, 5], [-3, 4]])和在for循环设置x和轴线Y的LIM： ax.set_xlim([-5, 5]) ax.set_ylim([-3, 4])。此外，默认情况下，imshow使长宽比与轴的比例相同（在我的示例中为10：7），但是如果您希望它与绘图窗口匹配，则将参数添加aspect='auto'到imshow。

— Jurgy

31

我不想使用np.hist2d（通常会产生非常难看的直方图），而是要回收py-sphviewer，这是一个使用自适应平滑内核渲染粒子模拟的python包，可以从pip轻松安装（请参阅网页文档）。考虑以下基于示例的代码：

import numpy as np
import numpy.random
import matplotlib.pyplot as plt
import sphviewer as sph

def myplot(x, y, nb=32, xsize=500, ysize=500):   
    xmin = np.min(x)
    xmax = np.max(x)
    ymin = np.min(y)
    ymax = np.max(y)

    x0 = (xmin+xmax)/2.
    y0 = (ymin+ymax)/2.

    pos = np.zeros([3, len(x)])
    pos[0,:] = x
    pos[1,:] = y
    w = np.ones(len(x))

    P = sph.Particles(pos, w, nb=nb)
    S = sph.Scene(P)
    S.update_camera(r='infinity', x=x0, y=y0, z=0, 
                    xsize=xsize, ysize=ysize)
    R = sph.Render(S)
    R.set_logscale()
    img = R.get_image()
    extent = R.get_extent()
    for i, j in zip(xrange(4), [x0,x0,y0,y0]):
        extent[i] += j
    print extent
    return img, extent

fig = plt.figure(1, figsize=(10,10))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)


# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)

#Plotting a regular scatter plot
ax1.plot(x,y,'k.', markersize=5)
ax1.set_xlim(-3,3)
ax1.set_ylim(-3,3)

heatmap_16, extent_16 = myplot(x,y, nb=16)
heatmap_32, extent_32 = myplot(x,y, nb=32)
heatmap_64, extent_64 = myplot(x,y, nb=64)

ax2.imshow(heatmap_16, extent=extent_16, origin='lower', aspect='auto')
ax2.set_title("Smoothing over 16 neighbors")

ax3.imshow(heatmap_32, extent=extent_32, origin='lower', aspect='auto')
ax3.set_title("Smoothing over 32 neighbors")

#Make the heatmap using a smoothing over 64 neighbors
ax4.imshow(heatmap_64, extent=extent_64, origin='lower', aspect='auto')
ax4.set_title("Smoothing over 64 neighbors")

plt.show()

产生以下图像：

如您所见，图像看起来非常漂亮，并且我们能够在其上标识不同的子结构。这些图像被构造成在一定范围内为每个点散布给定的权重，该权重由平滑长度定义，而平滑长度又由与更近的nb个邻居的距离给出（示例中，我选择了16、32和64）。因此，与较低密度的区域相比，较高密度的区域通常分布在较小的区域。

函数myplot只是我编写的一个非常简单的函数，用于将x，y数据提供给py-sphviewer进行处理。

— 亚历杭德罗
source

2

对任何试图在OSX上安装py-sphviewer的人的评论：我遇到了很多困难，请参见：github.com/alejandrobll/py-sphviewer/issues/3

— Sam Finnigan

太糟糕了，它不适用于python3。它会安装，但是当您尝试使用它时会崩溃……

— 法比奥·迪亚斯

1

@Fabio Dias，最新版本（1.1.x）现在可用于Python3。–

— Alejandro

29

如果您使用的是1.2.x

import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn(100000)
y = np.random.randn(100000)
plt.hist2d(x,y,bins=100)
plt.show()

gaussian_2d_heat_map

— 皮蒂·翁蒙科库尔
source

17

Seaborn现在具有jointplot函数，在这里应该可以很好地工作：

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)

sns.jointplot(x=x, y=y, kind='hex')
plt.show()

— 话语
source

简单，漂亮且在分析上有用。

— ryanjdillon

@wordsforthewise，如何使用此方法使600k数据在视觉上可读？（如何调整大小）

— nrmb

我不太确定你的意思；也许最好问一个单独的问题并在此处链接。您是说要调整整个无花果的大小？首先使用制作图形fig = plt.figure(figsize=(12, 12))，然后使用获取当前轴ax=plt.gca()，然后将参数添加ax=ax到jointplot函数中。

— wordforthewise

@wordsforthewise，您能回答这个问题吗：stackoverflow.com/questions/50997662/…谢谢

— ebrahimi

4

最初的问题是...如何将分散值转换为网格值，对吗？ histogram2d确实会计算每个单元格的频率，但是，如果每个单元格除频率之外还有其他数据，则需要做一些额外的工作。

x = data_x # between -10 and 4, log-gamma of an svc
y = data_y # between -4 and 11, log-C of an svc
z = data_z #between 0 and 0.78, f1-values from a difficult dataset

因此，我有一个Z值的X和Y坐标数据集。但是，我在计算感兴趣区域之外的几个点（较大的差距），而在很小的感兴趣区域中计算出很多点。

是的，这里变得更加困难，但同时也更加有趣。一些库（对不起）：

from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
from scipy.interpolate import griddata

pyplot是我今天的图形引擎，cm是一系列颜色图，其中包含一些令人鼓舞的选择。numpy用于计算，griddata用于将值附加到固定网格。

最后一个很重要，特别是因为xy点的频率在我的数据中分布不均。首先，让我们从适合我的数据的边界和任意的网格大小开始。原始数据的数据点也在这些x和y边界之外。

#determine grid boundaries
gridsize = 500
x_min = -8
x_max = 2.5
y_min = -2
y_max = 7

因此，我们定义了一个在x和y的最小值和最大值之间具有500个像素的网格。

在我的数据中，最受关注的领域有500多个可用值。而在低息区域，整个网格中甚至没有200个值；的图形边界之间x_min，并x_max有更小。

因此，为了获得良好的画面，任务是获取高利息值的平均值并填补其他地方的空白。

我现在定义网格。对于每个xx-yy对，我想要一种颜色。

xx = np.linspace(x_min, x_max, gridsize) # array of x values
yy = np.linspace(y_min, y_max, gridsize) # array of y values
grid = np.array(np.meshgrid(xx, yy.T))
grid = grid.reshape(2, grid.shape[1]*grid.shape[2]).T

为什么形状奇怪？scipy.griddata的形状为（n，D）。

Griddata通过预定义的方法为网格中的每个点计算一个值。我选择“最近”-空的网格点将填充最近邻居的值。这看起来好像信息较少的区域具有较大的单元格（即使不是这种情况）。人们可以选择插值“线性”，然后信息较少的区域看起来不那么清晰。味道很重要。

points = np.array([x, y]).T # because griddata wants it that way
z_grid2 = griddata(points, z, grid, method='nearest')
# you get a 1D vector as result. Reshape to picture format!
z_grid2 = z_grid2.reshape(xx.shape[0], yy.shape[0])

跳，我们移交给matplotlib显示图

fig = plt.figure(1, figsize=(10, 10))
ax1 = fig.add_subplot(111)
ax1.imshow(z_grid2, extent=[x_min, x_max,y_min, y_max,  ],
            origin='lower', cmap=cm.magma)
ax1.set_title("SVC: empty spots filled by nearest neighbours")
ax1.set_xlabel('log gamma')
ax1.set_ylabel('log C')
plt.show()

在V形的尖角部分周围，您会发现在寻找最佳点时我做了很多计算，而几乎其他任何地方的不那么有趣的部分的分辨率都较低。

— 安德拉斯
source

您能否改善答案以获取完整且可运行的代码？这是您提供的一种有趣的方法。目前，我正试图更好地理解它。我不太明白为什么也有V形。谢谢。

— ldmtwo

V型来自我的数据。它是训练有素的SVM的f1值：这在SVM的理论中有点过头。如果C值高，它将在计算中包括所有点，从而可以使用更大的伽马范围。伽玛是曲线的刚度，分为好坏。这两个值必须提供给SVM（在我的图形中为X和Y）。然后您得到一个结果（我的图形中为Z）。在最好的区域，您有望达到有意义的高度。

— 安德拉斯（Anderas）'18年

第二次尝试：V形出现在我的数据中。它是SVM的f1值：如果C高，则它将所有点都包括在计算中，从而可以使用更大的伽马范围，但会使计算变慢。伽玛是曲线的刚度，分为好坏。这两个值必须提供给SVM（在我的图形中为X和Y）。那么您会得到一个结果（我的图形中为Z）。在优化区域中，您将获得高价值，而在其他地方则获得低价值。如果您具有某些（X，Y）的Z值以及其他位置的许多间隙，则我在此处显示的内容可用。如果您有（X，Y，Z）个数据点，则可以使用我的代码。

— 安德拉斯（Anderas）'18年

4

这是Jurgy最理想的最近邻居方法，但使用scipy.cKDTree实现。在我的测试中，速度提高了约100倍。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.spatial import cKDTree


def data_coord2view_coord(p, resolution, pmin, pmax):
    dp = pmax - pmin
    dv = (p - pmin) / dp * resolution
    return dv


n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)

resolution = 250

extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]
xv = data_coord2view_coord(xs, resolution, extent[0], extent[1])
yv = data_coord2view_coord(ys, resolution, extent[2], extent[3])


def kNN2DDens(xv, yv, resolution, neighbours, dim=2):
    """
    """
    # Create the tree
    tree = cKDTree(np.array([xv, yv]).T)
    # Find the closest nnmax-1 neighbors (first entry is the point itself)
    grid = np.mgrid[0:resolution, 0:resolution].T.reshape(resolution**2, dim)
    dists = tree.query(grid, neighbours)
    # Inverse of the sum of distances to each grid point.
    inv_sum_dists = 1. / dists[0].sum(1)

    # Reshape
    im = inv_sum_dists.reshape(resolution, resolution)
    return im


fig, axes = plt.subplots(2, 2, figsize=(15, 15))
for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 63]):

    if neighbours == 0:
        ax.plot(xs, ys, 'k.', markersize=5)
        ax.set_aspect('equal')
        ax.set_title("Scatter Plot")
    else:

        im = kNN2DDens(xv, yv, resolution, neighbours)

        ax.imshow(im, origin='lower', extent=extent, cmap=cm.Blues)
        ax.set_title("Smoothing over %d neighbours" % neighbours)
        ax.set_xlim(extent[0], extent[1])
        ax.set_ylim(extent[2], extent[3])

plt.savefig('new.png', dpi=150, bbox_inches='tight')

— 加布里埃尔
source

1

我知道我的实现效率很低，但是不了解cKDTree。做得好！我会在回答中提及您。

— Jurgy

2

制作一个与最终图像中的单元格相对应的二维数组，称为say，heatmap_cells并将其实例化为全零。

选择两个缩放因子，它们定义每个维度的实际单位中每个数组元素之间的差异，例如x_scale和y_scale。选择这些，使您的所有数据点都落在热图数组的范围内。

对于每个带有x_value和的原始数据点y_value：

heatmap_cells[floor(x_value/x_scale),floor(y_value/y_scale)]+=1

— 潜意识
source

1

这是我在100万个点集上制作的，其中包括3个类别（红色，绿色和蓝色）。如果您想尝试使用此功能，请点击这里。Github回购

histplot(
    X,
    Y,
    labels,
    bins=2000,
    range=((-3,3),(-3,3)),
    normalize_each_label=True,
    colors = [
        [1,0,0],
        [0,1,0],
        [0,0,1]],
    gain=50)

— 乔尔·斯坦斯伯里
source

0

与@Piti的答案非常相似，但是使用1个调用而不是2个调用来生成分数：

import numpy as np
import matplotlib.pyplot as plt

pts = 1000000
mean = [0.0, 0.0]
cov = [[1.0,0.0],[0.0,1.0]]

x,y = np.random.multivariate_normal(mean, cov, pts).T
plt.hist2d(x, y, bins=50, cmap=plt.cm.jet)
plt.show()

输出：

— Alaa M.
source

0

恐怕聚会晚了一点，但不久前我也遇到了类似的问题。接受的答案（@ptomato提供）帮助了我，但我也想将其发布，以防有人使用。


''' I wanted to create a heatmap resembling a football pitch which would show the different actions performed '''

import numpy as np
import matplotlib.pyplot as plt
import random

#fixing random state for reproducibility
np.random.seed(1234324)

fig = plt.figure(12)
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

#Ratio of the pitch with respect to UEFA standards 
hmap= np.full((6, 10), 0)
#print(hmap)

xlist = np.random.uniform(low=0.0, high=100.0, size=(20))
ylist = np.random.uniform(low=0.0, high =100.0, size =(20))

#UEFA Pitch Standards are 105m x 68m
xlist = (xlist/100)*10.5
ylist = (ylist/100)*6.5

ax1.scatter(xlist,ylist)

#int of the co-ordinates to populate the array
xlist_int = xlist.astype (int)
ylist_int = ylist.astype (int)

#print(xlist_int, ylist_int)

for i, j in zip(xlist_int, ylist_int):
    #this populates the array according to the x,y co-ordinate values it encounters 
    hmap[j][i]= hmap[j][i] + 1   

#Reversing the rows is necessary 
hmap = hmap[::-1]

#print(hmap)
im = ax2.imshow(hmap)

这是结果

— 阿比舍克
source