主成分分析“向后”:给定的变量线性组合可解释多少数据差异?


17

我对六个变量ABCDE进行了主成分分析F。如果我理解正确,未旋转的PC1会告诉我这些变量的线性组合描述/解释了数据中的最大方差,而PC2告诉我这些变量的线性组合描述了数据中的第二大方差,依此类推。

我只是很好奇-有什么办法可以做到这一点吗?假设我选择了这些变量的线性组合-例如A+2B+5C,我能算出所描述数据的方差是多少?


7
严格来说,PC2是正交于PC1的线性组合它描述了数据中的下一个最大方差。
亨利

1
您是否要估计Var(A+2B+5C)
vqv 2011年

所有好的答案(三个+1)。如果我们认为一个或多个潜在变量是“变量的线性组合”,那么我对人们所提出的问题是否可以通过潜在变量方法(SEM / LVM)可以解决感到好奇。
Aleksandr Blekh

1
@Aleksandr,我的回答实际上是与其他两个直接矛盾的。我编辑了答案,以澄清分歧(并计划进一步编辑以阐明数学)。想象一个具有两个标准化的相同变量的数据集X=Y描述了多少方差X?另外两种解决方案50%。我认为正确答案是100%
变形虫说莫妮卡(

1
@amoeba:尽管仍在努力完全理解材料,但我知道您的答案有所不同。当我说“所有不错的答案”时,我暗示我本身喜欢答案的水平,而不是答案的正确性。我发现它对像我这样的人具有教育价值,这些人正在崎country的乡村中从事自我教育的追求,称为“ 统计 :-)”。希望有道理。
Aleksandr Blekh

Answers:


11

如果我们以所有变量都居中为前提(PCA中的标准做法),那么数据中的总方差就是平方和:

T=i(Ai2+Bi2+Ci2+Di2+Ei2+Fi2)

这等于变量的协方差矩阵的迹线,其等于协方差矩阵的特征值之和。这与PCA在“解释数据”方面所说的数量相同-即,您希望您的PC解释协方差矩阵的对角元素的最大比例。现在,如果我们将其作为一组预测值的目标函数,如下所示:

S=i([AiA^i]2++[FiF^i]2)

然后,第一主成分最小化S所有秩1个拟合值之间(A^i,,F^i)。所以看来您要追随的适当数量是

P=1ST
要使用您的示例A+2B+5C,我们需要将此方程式转换为等级1的预测。首先,你需要标准化的权重有1平方之和所以我们更换(1,2,5,0,0,0)(平方和30用)(130,230,530,0,0,0)。接下来,我们根据归一化的权重对每个观察结果“打分”:

Zi=130Ai+230Bi+530Ci

然后,我们将分数乘以权重向量,得出排名1的预测。

(A^iB^iC^iD^iE^iF^i)=Zi×(130230530000)

Then we plug these estimates into S calculate P. You can also put this into matrix norm notation, which may suggest a different generalisation. If we set O as the N×q matrix of observed values of the variables (q=6 in your case), and E as a corresponding matrix of predictions. We can define the proportion of variance explained as:

||O||22||OE||22||O||22

Where ||.||2 is the Frobenius matrix norm. So you could "generalise" this to be some other kind of matrix norm, and you will get a difference measure of "variation explained", although it won't be "variance" per se unless it is sum of squares.


This is a reasonable approach, but your expression can be greatly simplified and shown to be equal to the sum of squares of Zi divided by the total sum of squares T. Also, I think this is not the best way to interpret the question; see my answer for an alternative approach that I argue makes more sense (in particular, see my example figure there).
amoeba says Reinstate Monica

Think about it like that. Imagine a dataset with two standardized identical variables X=Y. How much variance is described by X? Your calculation gives 50%. I argue that the correct answer is 100%.
amoeba says Reinstate Monica

@amoeba - if X=Y then the first PC is (12,12) - this makes rank 1 scores of zi=xi+yi2=xi2 (assuming xi=yi). This gives rank 1 predictions of x^i=xi, and similarly y^i=yi. Hence you get OE=0 and S=0. Hence you get 100% as your intuition suggests.
probabilityislogic

Hey, yes, sure, the 1st PC explains 100% variance, but that's not what I meant. What I meant is that X=Y, but the question is how much variance is described by X, i.e. by (1,0) vector? What does your formula say then?
amoeba says Reinstate Monica

@amoeba - this says 50%, but note that the (1,0) vector says that the best rank 1 predictor for (xi,yi) is given as x^i=xi and y^i=0 (noting that zi=xi under your choice of vector). This is not an optimal prediction, which is why you don't get 100%. You need to predict both X and Y in this set-up.
probabilityislogic

8

Let's say I choose some linear combination of these variables -- e.g. A+2B+5C, could I work out how much variance in the data this describes?

This question can be understood in two different ways, leading to two different answers.

A linear combination corresponds to a vector, which in your example is [1,2,5,0,0,0]. This vector, in turn, defines an axis in the 6D space of the original variables. What you are asking is, how much variance does projection on this axis "describe"? The answer is given via the notion of "reconstruction" of original data from this projection, and measuring the reconstruction error (see Wikipedia on Fraction of variance unexplained). Turns out, this reconstruction can be reasonably done in two different ways, yielding two different answers.


Approach #1

Let X be the centered dataset (n rows correspond to samples, d columns correspond to variables), let Σ be its covariance matrix, and let w be a unit vector from Rd. The total variance of the dataset is the sum of all d variances, i.e. the trace of the covariance matrix: T=tr(Σ). The question is: what proportion of T does w describe? The two answers given by @todddeluca and @probabilityislogic are both equivalent to the following: compute projection Xw, compute its variance and divide by T:

Rfirst2=Var(Xw)T=wΣwtr(Σ).

This might not be immediately obvious, because e.g. @probabilityislogic suggests to consider the reconstruction Xww and then to compute

X2XXww2X2,
but with a little algebra this can be shown to be an equivalent expression.

Approach #2

Okay. Now consider a following example: X is a d=2 dataset with covariance matrix

Σ=(10.990.991)
and w=(10) is simply an x vector:

variance explained

The total variance is T=2. The variance of the projection onto w (shown in red dots) is equal to 1. So according to the above logic, the explained variance is equal to 1/2. And in some sense it is: red dots ("reconstruction") are far away from the corresponding blue dots, so a lot of the variance is "lost".

On the other hand, the two variables have 0.99 correlation and so are almost identical; saying that one of them describes only 50% of the total variance is weird, because each of them contains "almost all the information" about the second one. We can formalize it as follows: given projection Xw, find a best possible reconstruction Xwv with v not necessarily the same as w, and then compute the reconstruction error and plug it into the expression for the proportion of explained variance:

Rsecond2=X2XXwv2X2,
where v is chosen such that XXwv2 is minimal (i.e. R2 is maximal). This is exactly equivalent to computing R2 of multivariate regression predicting original dataset X from the 1-dimensional projection Xw.

It is a matter of straightforward algebra to use regression solution for v to find that the whole expression simplifies to

Rsecond2=Σw2wΣwtr(Σ).
In the example above this is equal to 0.9901, which seems reasonable.

Note that if (and only if) w is one of the eigenvectors of Σ, i.e. one of the principal axes, with eigenvalue λ (so that Σw=λw), then both approaches to compute R2 coincide and reduce to the familiar PCA expression

RPCA2=Rfirst2=Rsecond2=λ/tr(Σ)=λ/λi.

PS. See my answer here for an application of the derived formula to the special case of w being one of the basis vectors: Variance of the data explained by a single variable.


Appendix. Derivation of the formula for Rsecond2

Finding v minimizing the reconstruction XXwv2 is a regression problem (with Xw as univariate predictor and X as multivariate response). Its solution is given by

v=((Xw)(Xw))1(Xw)X=(wΣw)1wΣ.

Next, the R2 formula can be simplified as

R2=X2XXwv2X2=Xwv2X2
due to the Pythagoras theorem, because the hat matrix in regression is an orthogonal projection (but it is also easy to show directly).

Plugging now the equation for v, we obtain for the numerator:

Xwv2=tr(Xwv(Xwv))=tr(XwwΣΣwwX)/(wΣw)2=tr(wΣΣw)/(wΣw)=Σw2/(wΣw).

The denominator is equal to X2=tr(Σ) resulting in the formula given above.


I think this is an answer to a different question. For example, it not the case that that optimising your R2 wrt w will give the first PC as the unique answer (in those cases where it is unique). The fact that (1,0) and 12(1,1) both give 100% when X=Y is evidence enough. Your proposed method seems to assume that the "normalised" objective function for PCA will always understate the variance explained (yours isn't a normalised PCA objective function as it normalises by the quantity being optimised in PCA).
probabilityislogic

I agree that our answers are to different questions, but it's not clear to me which one OP had in mind. Also, note that my interpretation is not something very weird: it's a standard regression approach: when we say that x explains so and so much variance in y, we compute reconstruction error of yxb with an optimal b, not just yx. Here is another argument: if all n variables are standardized, then in your approach each one explains 1/n amount of variance. This is not very informative: some variables can be much more predictive than others! My approach reflects that.
amoeba says Reinstate Monica

@amoeba (+1) Great answer, it's really helpful! Would you know any reference that tackles this issue? Thanks!
PierreE

@PierreE Thanks. No, I don't think I have any reference for that.
amoeba says Reinstate Monica

4

Let the total variance, T, in a data set of vectors be the sum of squared errors (SSE) between the vectors in the data set and the mean vector of the data set,

T=i(xix¯)(xix¯)
where x¯ is the mean vector of the data set, xi is the ith vector in the data set, and is the dot product of two vectors. Said another way, the total variance is the SSE between each xi and its predicted value, f(xi), when we set f(xi)=x¯.

Now let the predictor of xi, f(xi), be the projection of vector xi onto a unit vector c.

fc(xi)=(cxi)c

Then the SSE for a given c is

SSEc=i(xifc(xi))(xifc(xi))

I think that if you choose c to minimize SSEc, then c is the first principal component.

If instead you choose c to be the normalized version of the vector (1,2,5,...), then TSSEc is the variance in the data described by using c as a predictor.


This is a reasonable approach, but I think this is not the best way to interpret the question; see my answer for an alternative approach that I argue makes more sense (in particular, see my example figure there).
amoeba says Reinstate Monica

Think about it like that. Imagine a dataset with two standardized identical variables X=Y. How much variance is described by X? Your calculation gives 50%. I argue that the correct answer is 100%.
amoeba says Reinstate Monica
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.