相关或协方差的PCA:相关的PCA是否有意义?[关闭]


32

在主成分分析(PCA)中,可以选择协方差矩阵或相关矩阵来查找成分(从它们各自的特征向量中)。由于两个矩阵之间的特征向量不相等,因此得出不同的结果(PC加载和得分)。我的理解是,这是由于以下事实导致的:原始数据矢量及其标准化无法通过正交变换进行关联。在数学上,相似的矩阵(即通过正交变换关联)具有相同的特征值,但不一定具有相同的特征向量。XZ

这在我的脑海中带来了一些困难:

  1. 如果您可以针对同一起始数据集获得两个不同的答案,而两者都试图实现相同的目标(=最大方差的寻找方向),那么PCA真的有意义吗?

  2. 使用相关矩阵方法时,在计算PC之前,将通过其各自的标准偏差对每个变量进行标准化(缩放)。如果事先已经对数据进行了不同的缩放/压缩,那么找到最大方差方向仍然有意义吗?我知道基于相关的PCA非常方便(标准化变量是无量纲的,因此可以添加它们的线性组合;其他优点也基于实用主义),但这是正确的吗?

在我看来,基于协方差的PCA是唯一真正正确的方法(即使变量的方差相差很大),并且每当无法使用此版本时,也不应使用基于相关性的PCA。

我知道有这个线程:相关性或协方差的PCA?-但它似乎只专注于找到一种实用的解决方案,该解决方案也可能不是代数正确的解决方案。


4
我要说实话,告诉你我在某个时候不读你的问题。PCA很有道理。是的,根据您选择使用相关矩阵还是方差/协方差矩阵,结果可能会有所不同。如果您的变量是在不同的尺度上测量的,那么基于相关的PCA是首选,但是您不希望它主导结果。想象一下,如果您有一系列从0到1的变量,然后有一些具有非常大的值(相对而言,如0到1000)的变量,则与第二组变量相关的大方差将占主导地位。
Patrick

4
但是,其他许多技巧也是如此,我认为Patrick的观点是合理的。同样,这只是评论,不需要变得积极进取。一般来说,您为什么会假定应该有一种真正的“代数”正确方法来解决问题?
嘎拉2013年

5
也许您以错误的方式想到了PCA:这只是一种转换,所以毫无疑问它是正确的还是不正确的,或者毫无疑问地依赖于数据模型的假设-与回归或因子分析不同。
Scortchi-恢复莫妮卡

5
这个问题的症结似乎在于对标准的作用以及PCA如何工作的误解。这是可以理解的,因为对PCA的良好掌握需要可视化高维形状。我会坚持认为,这个问题与许多其他基于某种误解的问题一样,是一个很好的问题,应该保持开放性,因为它的答案可以揭示许多人以前可能没有完全理解的真理。
ub

6
PCA不“主张”任何东西。人们对PCA提出要求,实际上,根据领域的不同,PCA的用法也有很大不同。其中一些用法可能很愚蠢或令人怀疑,但是假设该技术的一个变体必须是“代数正确”的方法却没有参考分析的背景或目标似乎并没有什么启发性。
嘎拉2013年

Answers:


29

我希望这些对您的两个问题的回答能够平息您的担忧:

  1. 相关矩阵标准数据(即不仅是居中的而且是重新缩放的)的协方差矩阵;也就是说,另一个不同数据集的协方差矩阵(好像)。因此很自然,结果不一样也不会打扰您。
  2. 是的,用标准化数据找到最大方差的方向是有意义的-它们是-可以说-“相关性”而不是“协方差”的方向;也就是说,在原始变量的方差不均等影响之后,对多元数据云的形状进行了处理。

@whuber 添加下一个文本和图片(我感谢他。另外,请参见下面的评论)

这是一个二维示例,显示了为什么仍然有必要定位标准化数据的主轴(如右图所示)。请注意,在右图中,即使沿坐标轴的方差现在完全相等(等于1.0),云仍然具有“形状”。同样,在更高维度上,即使沿所有轴的方差完全相等(等于1.0),标准化点云也将具有非球形形状。主轴(及其对应的特征值)描述了该形状。理解这一点的另一种方法是注意,标准化变量时进行的所有重新缩放和移动发生在坐标轴的方向上,而不发生在主方向本身上。

Figure

这里发生的事情在几何上是如此直观和清晰,以至于将其描述为“黑匣子操作”将是一件费力的事情:相反,标准化和PCA是我们按顺序处理数据的一些最基本和最常规的操作了解他们。


@ttnphns

什么时候会更愿意对相关性(即对z标准化变量)进行PCA(或因子分析或其他类似类型的分析),而不是对协方差(即对中心变量)进行PCA ?

  1. 当变量是不同的度量单位时。很清楚
  2. 当一个人希望分析反映唯一的线性关联时。Pearson r不仅是单比例变量(方差= 1)之间的协方差;它突然成为线性关系强度的度量,而通常的协方差系数可以接受线性关系和单调关系。
  3. 当一个人希望联想反映相对的同偏(从均值)而不是原始的同偏。相关性基于分布及其散布,而协方差基于原始测量范围。如果我要根据某些由李克特类型的项目组成的临床调查表对患者的精神病理学特征进行因素分析,我会选择协方差。因为不期望专业人员在心理上扭曲评级量表。另一方面,如果我要通过同一份问卷分析患者的自我比例,则可能会选择相关性。因为预期外行的评估是相对的“其他人”,所以“多数”是“允许偏差” “缩小”或“拉伸”一个等级的放大镜时。

1
1.对不起,但这困扰很多。对于外部人员而言,标准化是黑盒操作,这是PCA数据预处理的一部分(也在ICA中)。他想为他的(原始)输入数据提供一个答案,尤其是当它与物理(多维)数据有关时,也需要对PCA输出进行物理解释(即,根据非标准化变量)。
Lucozade 2013年

1
您的最新修订似乎是在重申“基于协方差的PCA是唯一真正正确的修订”。到目前为止,由于所有答复的本质都是“不;考虑问题的错误方式;这就是原因”,很难知道您期望如何引导讨论以应对这种压倒性的分歧。
Nick Cox

4
@Lucozade:我对您对您的应用程序的描述感到困惑:-PCA如何推荐任何东西?您如何衡量绩效?同样,对于您的最后评论:- 最佳选择是什么?
Scortchi-恢复莫妮卡

5
@Lucozade:的确,请听Scortchi所说的,您似乎继续追赶幽灵。PCA只是空间中旋转数据的一种特殊形式。它总是以最佳的方式处理输入数据。cov-corr难题是一个务实的难题,植根于数据预处理并在该级别而不是在PCA级别得到解决。
ttnphns

1
@Lucozade:根据您对我的答复,这是我(非专家)的意见,根据您的特定需求,您需要基于CoV的PCA。同样,您的变量在数据/测量类型(相同的机器类型以及所有以伏特为单位)方面都是同质的。对我来说,您的示例显然是cov-PCA是正确的情况,但是请注意,情况并非总是如此,并且我认为这是线程时这一点的重点(cor v。cov的选择是特定于案例的,需要由最了解数据和应用程序的人确定)。祝您研究顺利!
帕特里克

6

从实际角度讲-在这里可能不受欢迎-如果您以不同的比例尺测量数据,则应采用相关性(如果您是化学计量师,则使用“ UV比例尺”),但是如果变量在相同的比例尺上并且它们的大小很重要(例如,使用光谱数据),则协方差(仅使数据居中)更有意义。PCA是一种与比例有关的方法,日志转换也可以帮助处理高度偏斜的数据。

In my humble opinion based on 20 years of practical application of chemometrics you have to experiment a bit and see what works best for your type of data. At the end of the day you need to be able to reproduce your results and try to prove the predictability of your conclusions. How you get there is often a case of trial and error but the thing that matters is that what you do is documented and reproducible.


4
The practical approach you seem to advocate here boils down to - when both covariances and correlations are warranted - "try both and see what works best". That pure empirical stance masks the fact that any choice goes with its own assumptions or paradigm about the reality which the researcher ought to be aware of in advance, even if he understands that he prefers one of them fully arbitrarily. Selecting "what works best" is the capitalizing on the feeling of pleasure, the narcomania.
ttnphns

-2

I have no time to go into a fuller description of detailed & technical aspects of the experiment I described, and clarifications on wordings (recommending, performance, optimum) would again divert us away from the real issue, which is about what type of input data the PCA can(not) / should (not) be taking. PCA operates by taking linear combinations of numbers (values of variables). Mathematically, of course, one can add any two (real or complex) numbers. But if they have been re-scaled before PCA transformation, is their linear combination (and hence to process of maximization) still meaningful to operate on? If each variable xi has same variance s2, then clearly yes, because (x1/s1)+(x2/s2)=(x1+x2)/s is still proportional and comparable to the physical superposition of data x1+x2 itself. But if s1s2, then the linear combination of standardized quantities distorts the data of the input variables to different degrees. There seems little point then to maximize the variance of their linear combination. In that case, PCA gives a solution for a different set of data, whereby each variable is scaled differently. If you then unstandardize afterwards (when using corr_PCA) then that may be OK and necessary; but if you just take the the raw corr_PCA solution as-is and stop there, you would obtain a mathematical solution, but not one related to the physical data. As unstandardization afterwards then seems mandatory as a minimum (i.e., 'unstretching' the axes by the inverse standard deviations), cov_PCA could have been used to begin with. If you are still reading by now, I am impressed! For now, I finish by quoting from Jolliffe's book, p. 42, which is the part that concerns me: 'It must not be forgotten, however, that correlation matrix PCs, when re-expressed in terms of the original variables, are still linear functions of x that maximize variance with respect to the standardized variables and not with respect to the original variables.' If you think I am interpreting this or its implications wrongly, this excerpt may be a good focus point for further discussion.


3
It is so amusing that your own answer, which is in tune with everything people here were trying to convey to you, remains unsettled for you. You still argue There seems little point in PCA on correlations. Well, if you need to stay close to raw data ("physical data", as you strangely call it), you really shouldn't use correlations since they correspond to another ("distorted") data.
ttnphns

2
(Cont.) Jolliffe's citation states, that PCs obtained on correlations will ever be themselves and cannot be turned "back" into PCs on covariances even though you can re-express them as linear combinations of the original variables. Thus, Jolliffe stresses the idea that PCA results are fully dependent on the type of pre-processing used and that there exist no "true", "genuine" or "universal" PCs...
ttnphns

2
(Cont.) And in fact, Several lines below Jolliffe speaks of yet another "form" of PCA - PCA on X'X matrix. This form is even "closer" to original data than cov-PCA because no centering of variables are being done. And the results are usually utterly different. You could also do PCA on cosines. People do PCA on all versions of the SSCP matrix, albeit covariances or correlations are used most often.
ttnphns

3
Underlying this answer is an implicit assumption that the units in which data are measured have an intrinsic meaning. That is rarely the case: we may choose to measure length in Angstroms, parsecs, or anything else, and time in picoseconds or millennia, without altering the meaning of the data one iota. The changes made in going from covariance to correlation are merely changes of units (which, by the way, are particularly sensitive to outlying data). This suggests the issue is not covariance versus correlation, but rather to find fruitful ways to express the data for analysis.
whuber

3
@ttnphns I'll stick by the "merely," thanks. Whether or not the implications are "profound," the fact remains that standardization of a variable literally is an affine re-expression of its values: a change in its units of measure. The importance of this observation lies in its implications for some claims appearing in this thread, of which the most prominent is "covariance-based PCA is the only truly correct one." Any conception of correctness that ultimately depends on an essentially arbitrary aspect of the data--how we write them down--cannot be right.
whuber
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.