偏最小二乘回归背后的理论


33

谁能为了解SVD和PCA的人推荐一个关于偏最小二乘回归背后的理论的很好的解释(可在线获得)?我在网上查看了许多资料,但没有找到将严谨性和可访问性完美结合的任何内容。

我研究了《统计学习的要素》,这是在对交叉验证提出的一个问题的评论中提出的,什么是偏最小二乘(PLS)回归?它与OLS有何不同?,但我认为该参考文献并未涉及“正义”这一主题(这样做太简短了,并且没有提供关于该主题的太多理论)。从我读过,PLS利用预测变量,的线性组合zi=Xφi协方差最大化yTzi受约束φi=1ziTzj=0 if ij, where the φi are chosen iteratively, in the order in which they maximize the covariance. But even after all I've read, I'm still uncertain whether that is true, and if so, how the method is executed.

Answers:


38

Section 3.5.2 in The Elements of Statistical Learning is useful because it puts PLS regression in the right context (of other regularization methods), but is indeed very brief, and leaves some important statements as exercises. In addition, it only considers a case of a univariate dependent variable y.

The literature on PLS is vast, but can be quite confusing because there are many different "flavours" of PLS: univariate versions with a single DV y (PLS1) and multivariate versions with several DVs Y (PLS2), symmetric versions treating X and Y equally and asymmetric versions ("PLS regression") treating X as independent and Y as dependent variables, versions that allow a global solution via SVD and versions that require iterative deflations to produce every next pair of PLS directions, etc. etc.

All of this has been developed in the field of chemometrics and stays somewhat disconnected from the "mainstream" statistical or machine learning literature.

The overview paper that I find most useful (and that contains many further references) is:

For a more theoretical discussion I can further recommend:


A short primer on PLS regression with univariate y (aka PLS1, aka SIMPLS)

βy=Xβ+ϵβ=(XX)1Xy enjoys many optimality properties but can suffer from overfitting. Indeed, OLS looks for β that yields the highest possible correlation of Xβ with y. If there is a lot of predictors, then it is always possible to find some linear combination that happens to have a high correlation with y. This will be a spurious correlation, and such β will usually point in a direction explaining very little variance in X. Directions explaining very little variance are often very "noisy" directions. If so, then even though on training data OLS solution performs great, on testing data it will perform much worse.

βXβ

βcorr(Xβ,y) with an alternative goal of finding β with length β=1 maximizing covariance

cov(Xβ,y)corr(Xβ,y)var(Xβ),
which again effectively penalizes directions of low variance.

Finding such β (let's call it β1) yields the first PLS component z1=Xβ1. One can further look for the second (and then third, etc.) PLS component that has the highest possible covariance with y under the constraint of being uncorrelated with all the previous components. This has to be solved iteratively, as there is no closed-form solution for all components (the direction of the first component β1 is simply given by Xy normalized to unit length). When the desired number of components is extracted, PLS regression discards the original predictors and uses PLS components as new predictors; this yields some linear combination of them βz that can be combined with all βi to form the final βPLS.

Note that:

  1. If all PLS1 components are used, then PLS will be equivalent to OLS. So the number of components serves as a regularization parameter: the lower the number, the stronger the regularization.
  2. If the predictors X are uncorrelated and all have the same variance (i.e. X has been whitened), then there is only one PLS1 component and it is equivalent to OLS.
  3. Weight vectors βi and βj for ij are not going to be orthogonal, but will yield uncorrelated components zi=Xβi and zj=Xβj.

All that being said, I am not aware of any practical advantages of PLS1 regression over ridge regression (while the latter does have lots of advantages: it is continuous and not discrete, has analytical solution, is much more standard, allows kernel extensions and analytical formulas for leave-one-out cross-validation errors, etc. etc.).


Quoting from Frank & Friedman:

RR, PCR, and PLS are seen in Section 3 to operate in a similar fashion. Their principal goal is to shrink the solution coefficient vector away from the OLS solution toward directions in the predictor-variable space of larger sample spread. PCR and PLS are seen to shrink more heavily away from the low spread directions than RR, which provides the optimal shrinkage (among linear estimators) for an equidirection prior. Thus PCR and PLS make the assumption that the truth is likely to have particular preferential alignments with the high spread directions of the predictor-variable (sample) distribution. A somewhat surprising result is that PLS (in addition) places increased probability mass on the true coefficient vector aligning with the Kth principal component direction, where K is the number of PLS components used, in fact expanding the OLS solution in that direction.

They also conduct an extensive simulation study and conclude (emphasis mine):

For the situations covered by this simulation study, one can conclude that all of the biased methods (RR, PCR, PLS, and VSS) provide substantial improvement over OLS. [...] In all situations, RR dominated all of the other methods studied. PLS usually did almost as well as RR and usually outperformed PCR, but not by very much.


Update: In the comments @cbeleites (who works in chemometrics) suggests two possible advantages of PLS over RR:

  1. An analyst can have an a priori guess as to how many latent components should be present in the data; this will effectively allow to set a regularization strength without doing cross-validation (and there might not be enough data to do a reliable CV). Such an a priori choice of λ might be more problematic in RR.

  2. RR yields one single linear combination βRR as an optimal solution. In contrast PLS with e.g. five components yields five linear combinations βi that are then combined to predict y. Original variables that are strongly inter-correlated are likely to be combined into a single PLS component (because combining them together will increase the explained variance term). So it might be possible to interpret the individual PLS components as some real latent factors driving y. The claim is that it is easier to interpret β1,β2, etc., as opposed to the joint βPLS. Compare this with PCR where one can also see as an advantage that individual principal components can potentially be interpreted and assigned some qualitative meaning.


1
That paper looks useful. I don't think it addresses how much overfitting can be caused by PLS.
Frank Harrell

3
That's right, @Frank, but honestly, as far as predictive performance is concerned, I don't see much sense in doing anything else than ridge regression (or perhaps an elastic net if sparsity is desired too). My own interest in PLS is in the dimensionality reduction aspect when both X and Y are multivariate; so I am not very interested in how PLS performs as a regularization technique (in comparison with other regularization methods). When I have a linear model that I need to regularize, I prefer to use ridge. I wonder what's your experience here?
amoeba says Reinstate Monica

3
My experience is that ridge (quadratic penalized maximum likelihood estimation) gives superior predictions. I think that some analysts feel that PLS is a dimensionality reduction technique in the sense of avoiding overfitting but I gather that's not the case.
Frank Harrell

2
b) if you are going for a, say, spectroscopic interpretation of what the model does, I find it easier to look at PLS loadings what kind of substances are measured. You may find one or two substances/substance classes in there, wheras the coefficients which include all latent variables are harder to interprete because spectral contributions of more substances are combined. This is more prominent because not all of the usual spectral interpretation rules apply: a PLS model may pick some bands of a substance while ignoring others. "Normal" spectra interpretation uses a lot of this band could ...
cbeleites supports Monica

2
... come from this or that substance. If it is this substance, there must be this other band. As this latter possibility of verifying the substance is not possible with the latent variables/loadings/coefficients, interpreting things that vary together and therefore end up in the same latent variable is much easier than interpreting the coefficients that already summarize all kinds of possible "hints" that are known by the model.
cbeleites supports Monica

4

Yes. Herman Wold's book Theoretical Empiricism: A general rationale for scientific model-building is the single best exposition of PLS that I'm aware of, especially given that Wold is an originator of the approach. Not to mention that it's simply an interesting book to read and know about. In addition based on a search on Amazon, the number of references to books on PLS written in German is astonishing but it may be that the subtitle of Wold's book is part of the reason for that.


1
This amazon.com/Towards-Unified-Scientific-Models-Methods/dp/… is related but covers much more than PLS
kjetil b halvorsen

That is true but the primary focus of the book is Wold's development of the theory and application of PLS.
Mike Hunter
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.