Python中的主成分分析和回归

11

我试图找出如何在Python中重现我在SAS中所做的一些工作。使用这个存在多重共线性问题的数据集，我想在Python中执行主成分分析。我看过scikit-learn和statsmodels，但是我不确定如何获取它们的输出并将其转换为与SAS相同的结果结构。一方面，当您使用时，SAS似乎在相关矩阵上执行PCA PROC PRINCOMP，但是大多数（全部？）Python库似乎都在使用SVD。

在数据集中，第一列是响应变量，接下来的5个是预测变量，称为pred1-pred5。

在SAS中，常规工作流程为：

/* Get the PCs */
proc princomp data=indata out=pcdata;
    var pred1 pred2 pred3 pred4 pred5;
run;

/* Standardize the response variable */
proc standard data=pcdata mean=0 std=1 out=pcdata2;
    var response;
run;

/* Compare some models */
proc reg data=pcdata2;
    Reg:     model response = pred1 pred2 pred3 pred4 pred5 / vif;
    PCa:     model response = prin1-prin5 / vif;
    PCfinal: model response = prin1 prin2 / vif;
run;
quit;

/* Use Proc PLS to to PCR Replacement - dropping pred5 */
/* This gets me my parameter estimates for the original data */
proc pls data=indata method=pcr nfac=2;
    model response = pred1 pred2 pred3 pred4 / solution;
run;
quit;

我知道最后一步只能奏效，因为我只依次选择了PC1和PC2。

因此，在Python中，这与我所了解的差不多：

import pandas as pd
import numpy  as np
from sklearn.decomposition.pca import PCA

source = pd.read_csv('C:/sourcedata.csv')

# Create a pandas DataFrame object
frame = pd.DataFrame(source)

# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]
frame2 = frame[cols]

pca = PCA(n_components=5)
pca.fit(frame2)

每台PC解释的差异量是多少？

print pca.explained_variance_ratio_

Out[190]:
array([  9.99997603e-01,   2.01265023e-06,   2.70712663e-07,
         1.11512302e-07,   2.40310191e-09])

这些是什么？特征向量？

print pca.components_

Out[179]:
array([[ -4.32840645e-04,  -7.18123771e-04,  -9.99989955e-01,
         -4.40303223e-03,  -2.46115129e-05],
       [  1.00991662e-01,   8.75383248e-02,  -4.46418880e-03,
          9.89353169e-01,   5.74291257e-02],
       [ -1.04223303e-02,   9.96159390e-01,  -3.28435046e-04,
         -8.68305757e-02,  -4.26467920e-03],
       [ -7.04377522e-03,   7.60168675e-04,  -2.30933755e-04,
          5.85966587e-02,  -9.98256573e-01],
       [ -9.94807648e-01,  -1.55477793e-03,  -1.30274879e-05,
          1.00934650e-01,   1.29430210e-02]])

这些是特征值吗？

print pca.explained_variance_

Out[180]:
array([  8.07640319e+09,   1.62550137e+04,   2.18638986e+03,
         9.00620474e+02,   1.94084664e+01])

我对如何从Python结果到实际执行主成分回归（在Python中）感到有些困惑。是否有任何Python库类似于SAS来填充空白？

任何提示表示赞赏。在SAS输出中使用标签让我有些受宠若惊，而且我对熊猫，numpy，scipy或scikit-learn不太熟悉。

编辑：

因此，看起来sklearn不会直接在熊猫数据框上运行。假设我将其转换为numpy数组：

npa = frame2.values
npa

这是我得到的：

Out[52]:
array([[  8.45300000e+01,   4.20730000e+02,   1.99443000e+05,
          7.94000000e+02,   1.21100000e+02],
       [  2.12500000e+01,   2.73810000e+02,   4.31180000e+04,
          1.69000000e+02,   6.28500000e+01],
       [  3.38200000e+01,   3.73870000e+02,   7.07290000e+04,
          2.79000000e+02,   3.53600000e+01],
       ..., 
       [  4.71400000e+01,   3.55890000e+02,   1.02597000e+05,
          4.07000000e+02,   3.25200000e+01],
       [  1.40100000e+01,   3.04970000e+02,   2.56270000e+04,
          9.90000000e+01,   7.32200000e+01],
       [  3.85300000e+01,   3.73230000e+02,   8.02200000e+04,
          3.17000000e+02,   4.32300000e+01]])

如果我随后将copysklearn的PCA 的参数更改为False,直接在数组上运行，请按照以下注释操作。

pca = PCA(n_components=5,copy=False)
pca.fit(npa)

npa

对于输出，看起来它替换了所有值，npa而不是在数组中附加任何内容。现在的价值是npa什么？原始数组的主成分得分是多少？

Out[64]:
array([[  3.91846649e+01,   5.32456568e+01,   1.03614689e+05,
          4.06726542e+02,   6.59830027e+01],
       [ -2.40953351e+01,  -9.36743432e+01,  -5.27103110e+04,
         -2.18273458e+02,   7.73300268e+00],
       [ -1.15253351e+01,   6.38565684e+00,  -2.50993110e+04,
         -1.08273458e+02,  -1.97569973e+01],
       ..., 
       [  1.79466488e+00,  -1.15943432e+01,   6.76868901e+03,
          1.97265416e+01,  -2.25969973e+01],
       [ -3.13353351e+01,  -6.25143432e+01,  -7.02013110e+04,
         -2.88273458e+02,   1.81030027e+01],
       [ -6.81533512e+00,   5.74565684e+00,  -1.56083110e+04,
         -7.02734584e+01,  -1.18869973e+01]])

pca python scikit-learn

— 粘土
source

1

在scikit-learn中，每个样本在数据矩阵中存储为一行。PCA类直接对数据矩阵进行操作，即，它负责计算协方差矩阵，然后计算其特征向量。关于最后的3个问题，是的，components_是协方差矩阵的特征向量，explained_variance_ratio_是每个PC解释的方差，并且解释的方差应对应于特征值。

— lightalchemist 2014年

@lightalchemist谢谢您的澄清。使用sklearn，在执行PCA之前创建一个新的数据帧是否合适，或者是否可以发送“完整”的熊猫数据帧并且不对最左边的（响应）列进行操作？

— 粘土2014年

我添加了更多信息。如果先转换为numpy数组，然后使用运行PCA copy=False，则会得到新值。这些是主要成分分数吗？

— 粘土2014年

我对Pandas不太熟悉，所以我没有回答您问题的那一部分。关于第二部分，我认为它们不是主要组成部分。我相信它们是原始数据样本，但减去了平均值。但是，我对此不确定。

— lightalchemist 2014年

16

Scikit-learn没有像R中的pls包那样实现PCA和回归的组合实现。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')

df = pd.read_csv('multicollinearity.csv')
X = df.iloc[:,1:6]
y = df.response

Scikit学习PCA

pca = PCA()

缩放和转换数据以获得主成分

X_reduced = pca.fit_transform(scale(X))

主成分解释的方差（累积百分比）

np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

array([  73.39,   93.1 ,   98.63,   99.89,  100.  ])

似乎前两个成分确实可以解释数据中的大部分差异。

10折简历，有洗牌

n = len(X_reduced)
kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

regr = LinearRegression()
mse = []

做一个简历即可获得仅针对截距的MSE（无主成分回归）

score = -1*cross_validation.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()    
mse.append(score)

对5个主要成分进行CV，同时将一个成分添加到回归中

for i in np.arange(1,6):
    score = -1*cross_validation.cross_val_score(regr, X_reduced[:,:i], y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(score)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(mse, '-v')
ax2.plot([1,2,3,4,5], mse[1:6], '-v')
ax2.set_title('Intercept excluded from plot')

for ax in fig.axes:
    ax.set_xlabel('Number of principal components in regression')
    ax.set_ylabel('MSE')
    ax.set_xlim((-0.2,5.2))

Scikit学习PLS回归

mse = []

kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

for i in np.arange(1, 6):
    pls = PLSRegression(n_components=i, scale=False)
    pls.fit(scale(X_reduced),y)
    score = cross_validation.cross_val_score(pls, X_reduced, y, cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(-score)

plt.plot(np.arange(1, 6), np.array(mse), '-v')
plt.xlabel('Number of principal components in PLS regression')
plt.ylabel('MSE')
plt.xlim((-0.2, 5.2))

— 乔迪
source

7

这仅是Python和NumPy中的SVD（多年之后）。
（这根本没有解决您关于SSA / sklearn / pandas的问题，但有一天可能会帮助pythonist。）

#!/usr/bin/env python2
""" SVD straight up """
# geometry: see http://www.ams.org/samplings/feature-column/fcarc-svd

from __future__ import division
import sys
import numpy as np

__version__ = "2015-06-15 jun  denis-bz-py t-online de"

# from bz.etc import numpyutil as nu
def ints( x ):
    return np.round(x).astype(int)  # NaN Inf -> - maxint

def quantiles( x ):
    return "quantiles %s" % ints( np.percentile( x, [0, 25, 50, 75, 100] ))


#...........................................................................
csvin = "ccheaton-multicollinearity.csv"  # https://gist.github.com/ccheaton/8393329
plot = 0

    # to change these vars in sh or ipython, run this.py  csvin=\"...\"  plot=1  ...
for arg in sys.argv[1:]:
    exec( arg )

np.set_printoptions( threshold=10, edgeitems=10, linewidth=120,
    formatter = dict( float = lambda x: "%.2g" % x ))  # float arrays %.2g

#...........................................................................
yX = np.loadtxt( csvin, delimiter="," )
y = yX[:,0]
X = yX[:,1:]
print "read %s" % csvin
print "y %d  %s" % (len(y), quantiles(y))
print "X %s  %s" % (X.shape, quantiles(X))
print ""

#...........................................................................
U, sing, Vt = np.linalg.svd( X, full_matrices=False )
#...........................................................................

print "SVD: %s -> U %s . sing diagonal . Vt %s" % (
        X.shape, U.shape, Vt.shape )
print "singular values:", ints( sing )
    # % variance (sigma^2) explained != % sigma explained, e.g. 10 1 1 1 1

var = sing**2
var *= 100 / var.sum()
print "% variance ~ sing^2:", var

print "Vt, the right singular vectors  * 100:\n", ints( Vt * 100 )
    # multicollinear: near +- 100 in each row / col

yU = y.dot( U )
yU *= 100 / yU.sum()
print "y ~ these percentages of U, the left singular vectors:", yU

->日志

# from: test-pca.py
# run: 15 Jun 2015 16:45  in ~bz/py/etc/data/etc  Denis-iMac 10.8.3
# versions: numpy 1.9.2  scipy 0.15.1   python 2.7.6   mac 10.8.3

read ccheaton-multicollinearity.csv
y 373  quantiles [  2823  60336  96392 147324 928560]
X (373, 5)  quantiles [     7     47    247    573 512055]

SVD: (373, 5) -> U (373, 5) . sing diagonal . Vt (5, 5)
singular values: [2537297    4132    2462     592      87]
% variance ~ sing^2: [1e+02 0.00027 9.4e-05 5.4e-06 1.2e-07]
Vt, the right singular vectors  * 100:
[[  0   0 100   0   0]
 [  1  98   0 -12  17]
 [-10 -11   0 -99  -6]
 [  1 -17   0  -4  98]
 [-99   2   0  10   2]]
y ~ these percentages of U, the left singular vectors: [1e+02 15 -18 0.88 -0.57]

— 丹尼斯
source

我参加聚会有点晚了，但是答案很好

— plumbus_bouquet

3

尝试使用管道来结合主成分分析和线性回归：

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Principle components regression
steps = [
    ('scale', StandardScaler()),
    ('pca', PCA()),
    ('estimator', LinearRegression())
]
pipe = Pipeline(steps)
pca = pipe.set_params(pca__n_components=3)
pca.fit(X, y)

— 乔
source

3

我的答案要晚将近五年了，您很有可能不再需要有关在Python中进行PCR的帮助。我们已经开发了一个名为hoggorm的Python软件包，该软件包完全可以满足您当时的需求。请在这里查看PCR实例。还有一个名为hoggormplot的补充绘图软件包，用于可视化使用hoggorm计算的结果。

— 奥利
source