Python中的主成分分析

112

我想使用主成分分析（PCA）进行降维。numpy或scipy是否已经拥有它，或者我必须使用自己滚动numpy.linalg.eigh？

我不只是想使用奇异值分解（SVD），因为我的输入数据是相当高的维度（约460个维度），因此我认为SVD比计算协方差矩阵的特征向量要慢。

我希望找到一个预制的，已调试的实现，该实现已经对何时使用哪种方法以及哪些可能进行的其他优化进行了正确的决策，而这些优化我都不知道。

— Vebjorn Ljosa
source

28

您可以看看MDP。

我没有机会亲自对其进行测试，但是我已将其完全标记为PCA功能。

— 克里斯多夫
source

8

自2012年以来一直未维护MDP，它似乎不是最佳解决方案。

— 马克·加西亚

最新更新来自2016年3月9日，但请注意，ir只是一个错误修复版本：

Note that from this release MDP is in maintenance mode. 13 years after its first public release, MDP has reached full maturity and no new features are planned in the future.

— Gabriel

65

几个月后，这是一门小型PCA和一张图片：

#!/usr/bin/env python
""" a small class for Principal Component Analysis
Usage:
    p = PCA( A, fraction=0.90 )
In:
    A: an array of e.g. 1000 observations x 20 variables, 1000 rows x 20 columns
    fraction: use principal components that account for e.g.
        90 % of the total variance

Out:
    p.U, p.d, p.Vt: from numpy.linalg.svd, A = U . d . Vt
    p.dinv: 1/d or 0, see NR
    p.eigen: the eigenvalues of A*A, in decreasing order (p.d**2).
        eigen[j] / eigen.sum() is variable j's fraction of the total variance;
        look at the first few eigen[] to see how many PCs get to 90 %, 95 % ...
    p.npc: number of principal components,
        e.g. 2 if the top 2 eigenvalues are >= `fraction` of the total.
        It's ok to change this; methods use the current value.

Methods:
    The methods of class PCA transform vectors or arrays of e.g.
    20 variables, 2 principal components and 1000 observations,
    using partial matrices U' d' Vt', parts of the full U d Vt:
    A ~ U' . d' . Vt' where e.g.
        U' is 1000 x 2
        d' is diag([ d0, d1 ]), the 2 largest singular values
        Vt' is 2 x 20.  Dropping the primes,

    d . Vt      2 principal vars = p.vars_pc( 20 vars )
    U           1000 obs = p.pc_obs( 2 principal vars )
    U . d . Vt  1000 obs, p.obs( 20 vars ) = pc_obs( vars_pc( vars ))
        fast approximate A . vars, using the `npc` principal components

    Ut              2 pcs = p.obs_pc( 1000 obs )
    V . dinv        20 vars = p.pc_vars( 2 principal vars )
    V . dinv . Ut   20 vars, p.vars( 1000 obs ) = pc_vars( obs_pc( obs )),
        fast approximate Ainverse . obs: vars that give ~ those obs.


Notes:
    PCA does not center or scale A; you usually want to first
        A -= A.mean(A, axis=0)
        A /= A.std(A, axis=0)
    with the little class Center or the like, below.

See also:
    http://en.wikipedia.org/wiki/Principal_component_analysis
    http://en.wikipedia.org/wiki/Singular_value_decomposition
    Press et al., Numerical Recipes (2 or 3 ed), SVD
    PCA micro-tutorial
    iris-pca .py .png

"""

from __future__ import division
import numpy as np
dot = np.dot
    # import bz.numpyutil as nu
    # dot = nu.pdot

__version__ = "2010-04-14 apr"
__author_email__ = "denis-bz-py at t-online dot de"

#...............................................................................
class PCA:
    def __init__( self, A, fraction=0.90 ):
        assert 0 <= fraction <= 1
            # A = U . diag(d) . Vt, O( m n^2 ), lapack_lite --
        self.U, self.d, self.Vt = np.linalg.svd( A, full_matrices=False )
        assert np.all( self.d[:-1] >= self.d[1:] )  # sorted
        self.eigen = self.d**2
        self.sumvariance = np.cumsum(self.eigen)
        self.sumvariance /= self.sumvariance[-1]
        self.npc = np.searchsorted( self.sumvariance, fraction ) + 1
        self.dinv = np.array([ 1/d if d > self.d[0] * 1e-6  else 0
                                for d in self.d ])

    def pc( self ):
        """ e.g. 1000 x 2 U[:, :npc] * d[:npc], to plot etc. """
        n = self.npc
        return self.U[:, :n] * self.d[:n]

    # These 1-line methods may not be worth the bother;
    # then use U d Vt directly --

    def vars_pc( self, x ):
        n = self.npc
        return self.d[:n] * dot( self.Vt[:n], x.T ).T  # 20 vars -> 2 principal

    def pc_vars( self, p ):
        n = self.npc
        return dot( self.Vt[:n].T, (self.dinv[:n] * p).T ) .T  # 2 PC -> 20 vars

    def pc_obs( self, p ):
        n = self.npc
        return dot( self.U[:, :n], p.T )  # 2 principal -> 1000 obs

    def obs_pc( self, obs ):
        n = self.npc
        return dot( self.U[:, :n].T, obs ) .T  # 1000 obs -> 2 principal

    def obs( self, x ):
        return self.pc_obs( self.vars_pc(x) )  # 20 vars -> 2 principal -> 1000 obs

    def vars( self, obs ):
        return self.pc_vars( self.obs_pc(obs) )  # 1000 obs -> 2 principal -> 20 vars


class Center:
    """ A -= A.mean() /= A.std(), inplace -- use A.copy() if need be
        uncenter(x) == original A . x
    """
        # mttiw
    def __init__( self, A, axis=0, scale=True, verbose=1 ):
        self.mean = A.mean(axis=axis)
        if verbose:
            print "Center -= A.mean:", self.mean
        A -= self.mean
        if scale:
            std = A.std(axis=axis)
            self.std = np.where( std, std, 1. )
            if verbose:
                print "Center /= A.std:", self.std
            A /= self.std
        else:
            self.std = np.ones( A.shape[-1] )
        self.A = A

    def uncenter( self, x ):
        return np.dot( self.A, x * self.std ) + np.dot( x, self.mean )


#...............................................................................
if __name__ == "__main__":
    import sys

    csv = "iris4.csv"  # wikipedia Iris_flower_data_set
        # 5.1,3.5,1.4,0.2  # ,Iris-setosa ...
    N = 1000
    K = 20
    fraction = .90
    seed = 1
    exec "\n".join( sys.argv[1:] )  # N= ...
    np.random.seed(seed)
    np.set_printoptions( 1, threshold=100, suppress=True )  # .1f
    try:
        A = np.genfromtxt( csv, delimiter="," )
        N, K = A.shape
    except IOError:
        A = np.random.normal( size=(N, K) )  # gen correlated ?

    print "csv: %s  N: %d  K: %d  fraction: %.2g" % (csv, N, K, fraction)
    Center(A)
    print "A:", A

    print "PCA ..." ,
    p = PCA( A, fraction=fraction )
    print "npc:", p.npc
    print "% variance:", p.sumvariance * 100

    print "Vt[0], weights that give PC 0:", p.Vt[0]
    print "A . Vt[0]:", dot( A, p.Vt[0] )
    print "pc:", p.pc()

    print "\nobs <-> pc <-> x: with fraction=1, diffs should be ~ 0"
    x = np.ones(K)
    # x = np.ones(( 3, K ))
    print "x:", x
    pc = p.vars_pc(x)  # d' Vt' x
    print "vars_pc(x):", pc
    print "back to ~ x:", p.pc_vars(pc)

    Ax = dot( A, x.T )
    pcx = p.obs(x)  # U' d' Vt' x
    print "Ax:", Ax
    print "A'x:", pcx
    print "max |Ax - A'x|: %.2g" % np.linalg.norm( Ax - pcx, np.inf )

    b = Ax  # ~ back to original x, Ainv A x
    back = p.vars(b)
    print "~ back again:", back
    print "max |back - x|: %.2g" % np.linalg.norm( back - x, np.inf )

# end pca.py

— 丹尼斯
source

3

Fyinfo，C。Caramanis在2011

— denis

该代码将输出该图像（Iris PCA）吗？如果不是，您是否可以发布替代解决方案，而解决方案将是该图像。IM在将此代码转换为C ++时遇到了一些困难，因为我是python的

— 新手

44

使用PCA numpy.linalg.svd非常容易。这是一个简单的演示：

import numpy as np
import matplotlib.pyplot as plt
from scipy.misc import lena

# the underlying signal is a sinusoidally modulated image
img = lena()
t = np.arange(100)
time = np.sin(0.1*t)
real = time[:,np.newaxis,np.newaxis] * img[np.newaxis,...]

# we add some noise
noisy = real + np.random.randn(*real.shape)*255

# (observations, features) matrix
M = noisy.reshape(noisy.shape[0],-1)

# singular value decomposition factorises your data matrix such that:
# 
#   M = U*S*V.T     (where '*' is matrix multiplication)
# 
# * U and V are the singular matrices, containing orthogonal vectors of
#   unit length in their rows and columns respectively.
#
# * S is a diagonal matrix containing the singular values of M - these 
#   values squared divided by the number of observations will give the 
#   variance explained by each PC.
#
# * if M is considered to be an (observations, features) matrix, the PCs
#   themselves would correspond to the rows of S^(1/2)*V.T. if M is 
#   (features, observations) then the PCs would be the columns of
#   U*S^(1/2).
#
# * since U and V both contain orthonormal vectors, U*V.T is equivalent 
#   to a whitened version of M.

U, s, Vt = np.linalg.svd(M, full_matrices=False)
V = Vt.T

# PCs are already sorted by descending order 
# of the singular values (i.e. by the
# proportion of total variance they explain)

# if we use all of the PCs we can reconstruct the noisy signal perfectly
S = np.diag(s)
Mhat = np.dot(U, np.dot(S, V.T))
print "Using all PCs, MSE = %.6G" %(np.mean((M - Mhat)**2))

# if we use only the first 20 PCs the reconstruction is less accurate
Mhat2 = np.dot(U[:, :20], np.dot(S[:20, :20], V[:,:20].T))
print "Using first 20 PCs, MSE = %.6G" %(np.mean((M - Mhat2)**2))

fig, [ax1, ax2, ax3] = plt.subplots(1, 3)
ax1.imshow(img)
ax1.set_title('true image')
ax2.imshow(noisy.mean(0))
ax2.set_title('mean of noisy images')
ax3.imshow((s[0]**(1./2) * V[:,0]).reshape(img.shape))
ax3.set_title('first spatial PC')
plt.show()

— ali_m
source

2

我意识到我来晚了一点，但是OP特意要求了一个避免奇异值分解的解决方案。

— Alex A.

1

@Alex我意识到这一点，但是我坚信SVD仍然是正确的方法。它应该足够容易地满足OP的需求（在上面的例子中，我的262144尺寸在普通笔记本电脑上仅需要7.5秒左右），并且在数值上比本征分解方法稳定得多（请参阅下面的dwf注释）。我还注意到，可接受的答案也使用SVD！

— ali_m 2015年

我不同意SVD是解决之道，我只是说答案并没有解决问题。不过，这是一个很好的答案，很好的工作。

— Alex A.

5

@Alex公平。我认为这是XY问题的另一个变体-OP表示他不想要基于SVD的解决方案，因为他认为 SVD太慢了，可能还没有尝试过。在这种情况下，我个人认为，解释您将如何解决更广泛的问题，而不是确切地以其原始的，较窄的形式回答问题，会更有帮助。

— ali_m 2015年

svd就s文档而言，已经按照降序返回。（也许在2012年不是这样，但今天是这样）

— Etienne Bruines 2015年

34

您可以使用sklearn：

import sklearn.decomposition as deco
import numpy as np

x = (x - np.mean(x, 0)) / np.std(x, 0) # You need to normalize your data first
pca = deco.PCA(n_components) # n_components is the components number after reduction
x_r = pca.fit(x).transform(x)
print ('explained variance (first %d components): %.2f'%(n_components, sum(pca.explained_variance_ratio_)))

— 诺姆·皮莱德
source

赞成，因为这对我很有效-我有超过460个尺寸，即使sklearn使用SVD且问题要求使用非SVD，我认为460个尺寸也可以。

— 丹·斯托威尔

您可能还希望删除具有恒定值（std = 0）的列。为此，您应该使用：remove_cols = np.where（np.all（x == np.mean（x，0），0））[0]然后x = np.delete（x，remove_cols，1）

— Noam佩莱德

31

matplotlib.mlab具有PCA实现。

— 汤姆10
source

5

matplotlib的PCA链接已更新。

— 开发人员

3

PCA的matplotlib.mlab实现使用SVD。

— 阿曼2012年

3

这是其功能和使用方法的更详细说明。

— Dolan Antenucci 2013年

14

SVD应该可以在460尺寸上正常工作。在我的Atom上网本上大约需要7秒钟。eig（）方法花费更多的时间（它应该使用更多的浮点运算），并且几乎总是精度较低。

如果您的示例少于460个，则您要对角化散布矩阵（x-datamean）^ T（x-mean），假设您的数据点为列，然后向左乘以（x-datamean）。如果您的维数多于数据，那可能会更快。

— dwf
source

当维度大于数据时，您能否更详细地描述此技巧？

— mrgloom 2014年

1

基本上，您假设特征向量是数据向量的线性组合。参见Sirovich（1987）。“湍流和相干结构的动力学。”

— 2014年

11

您可以很容易地使用scipy.linalg（假设预先居中的数据集data）“滚动”自己的数据：

covmat = data.dot(data.T)
evs, evmat = scipy.linalg.eig(covmat)

然后evs是您的特征值，evmat就是您的投影矩阵。

如果要保留d尺寸，请使用第一个d特征值和第一个d特征向量。

假设scipy.linalg具有分解和numpy个矩阵乘法，您还需要什么？

— 有QUIT--Anony-Mousse
source

cov矩阵是np.dot（data.T，data，out = covmat），其中数据必须居中排列。

— mrgloom 2014年

2

您应该看看@dwf在此答案上的评论，以eig()了解在协方差矩阵上使用的危险。

— Alex A.

8

我刚读完《机器学习：算法观点》一书。本书中的所有代码示例都是由Python（以及几乎所有的Numpy）编写的。chatper10.2主成分分析的代码片段可能值得一读。它使用numpy.linalg.eig。
顺便说一句，我认为SVD可以很好地处理460 * 460尺寸。我已经在一个非常旧的PC：Pentium III 733mHz上使用numpy / scipy.linalg.svd计算出6500 * 6500 SVD。老实说，脚本需要大量内存（约1.xG）和大量时间（约30分钟）才能获得SVD结果。但是我认为，除非您需要大量执行SVD，否则现代PC上的460 * 460并不是什么大问题。

— 孙强
source

28

当您可以简单地使用svd（）时，切勿在协方差矩阵上使用eig（）。根据您计划使用的组件数量和数据矩阵的大小，前者（它会执行更多的浮点运算）引入的数值误差可能会变得很重要。出于同样的原因，如果您真正感兴趣的是向量或矩阵的逆次数，则永远不要使用inv（）显式地反转矩阵；您应该改用solve（）。

— dwf

5

您不需要完全奇异值分解（SVD），因为它可以计算所有特征值和特征向量，并且对于大型矩阵可能是禁止的。 scipy及其稀疏模块提供了适用于稀疏和密集矩阵的通用线性代数函数，其中包括eig *系列函数：

http://docs.scipy.org/doc/scipy/reference/sparse.linalg.html#matrix-factorizations

Scikit-learn提供了Python PCA实现，目前仅支持密集矩阵。

时间：

In [1]: A = np.random.randn(1000, 1000)

In [2]: %timeit scipy.sparse.linalg.eigsh(A)
1 loops, best of 3: 802 ms per loop

In [3]: %timeit np.linalg.svd(A)
1 loops, best of 3: 5.91 s per loop

— 尼古拉斯·巴比（Nicolas Barbey）
source

1

由于您仍然需要计算协方差矩阵，因此这并不是真正的公平比较。同样，对于非常大的矩阵，可能只需要使用稀疏的linalg东西，因为从密集的矩阵构造稀疏的矩阵似乎很慢。例如，eigsh实际上比非eigh稀疏矩阵慢4倍。scipy.sparse.linalg.svdsvs的情况也是如此numpy.linalg.svd。由于@dwf提到的原因，我将始终使用SVD进行特征值分解，如果矩阵真的很大，则可能使用稀疏版本的SVD。

— ali_m 2012年

2

您无需从密集矩阵中计算稀疏矩阵。sparse.linalg模块中提供的算法仅通过Operator对象的matvec方法依赖于矩阵矢量乘法运算。对于密集矩阵，这就像matvec = dot（A，x）一样。出于同样的原因，你并不需要计算的协方差矩阵但只为A.提供操作点（AT，点（A，X））

— 尼古拉斯·巴贝

啊，现在我看到稀疏方法与非稀疏方法的相对速度取决于矩阵的大小。如果我使用您的示例，其中A是1000 * 1000矩阵，则A eigsh和B svds的速度快于eigh和svd，系数是〜3，但是如果A较小，例如100 * 100，则A eigh和B svd的速度分别是〜4 和〜1.5的系数。。尽管T仍然会使用稀疏SVD而不是稀疏特征值分解。

— ali_m 2012年

2

确实，我认为我偏向于大型矩阵。对我来说，大型矩阵更像是10⁶*10⁶而不是1000 *1000。在那种情况下，您通常甚至无法存储协方差矩阵...

— Nicolas Barbey 2012年

4

这是使用numpy，scipy和C扩展名的python PCA模块的另一种实现。该模块使用SVD或在C中实现的NIPALS（非线性迭代部分最小二乘）算法执行PCA。

— rcs
source

0

如果您正在使用3D向量，则可以使用toolbelt vg简洁地应用SVD 。它是numpy之上的一个浅层。

import numpy as np
import vg

vg.principal_components(data)

如果只需要第一个主成分，则还有一个方便的别名：

vg.major_axis(data)

我在上次启动时创建了该库，其灵感来自于以下用途：在NumPy中冗长或不透明的简单想法。

— 保罗梅尔尼科夫
source