岭回归等价公式的证明


15

我读过统计学习中最受欢迎的书

1- 统计学习的要素。

2- 统计学习简介

两者都提到岭回归有两个等价的公式。有没有可以理解的数学证明呢?

我还经历了交叉验证,但在那里找不到确定的证明。

此外,LASSO是否会享受相同类型的证明?

enter image description here



1
套索不是岭回归的一种形式。
西安

@jeza,您能解释一下我的答案中缺少什么吗?它实际上派生了所有有关连接的派生对象。
罗伊'18

@jeza,您能具体点吗?除非您了解拉格朗日约束问题的概念,否则很难给出一个简明的答案。
罗伊'18

1
@jeza , a constrained optimization problem can be converted into optimization of the Lagrangian function / KKT conditions (as explained in the current answers). This principle has already many different simple explanations all over the internet. In what direction is more explanation of the proof necessary? Explanation/proof of the Lagrangian multiplier/function, explanation/proof how this problem is a case of optimization that relates to the method of Lagrange, difference KKT/Lagrange, explanation of the principle of regularization, etc?
Sextus Empiricus

Answers:


19

The classic Ridge Regression (Tikhonov Regularization) is given by:

argminx12xy22+λx22

argminx12xy22+λx22

The claim above is that the following problem is equivalent:

argminx12xy22subject tox22t

argminxsubject to12xy22x22t

Let's define ˆxx^ as the optimal solution of the first problem and ˜xx~ as the optimal solution of the second problem.

The claim of equivalence means that t,λ0:ˆx=˜xt,λ0:x^=x~.
Namely you can always have a pair of tt and λ0λ0 such the solution of the problem is the same.

How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.

The solution for the first problem is given at the point the gradient vanishes which means:

ˆxy+2λˆx=0

x^y+2λx^=0

The KKT Conditions of the second problem states:

˜xy+2μ˜x=0

x~y+2μx~=0

and

μ(˜x22t)=0

μ(x~22t)=0

The last equation suggests that either μ=0μ=0 or ˜x22=tx~22=t.

Pay attention that the 2 base equations are equivalent.
Namely if ˆx=˜xx^=x~ and μ=λμ=λ both equations hold.

So it means that in case y22ty22t one must set μ=0μ=0 which means that for tt large enough in order for both to be equivalent one must set λ=0λ=0.

On the other case one should find μμ where:

yt(I+2μI)1(I+2μI)1y=t

yt(I+2μI)1(I+2μI)1y=t

This is basically when ˜x22=tx~22=t

Once you find that μμ the solutions will collide.

Regarding the L1L1 (LASSO) case, well, it works with the same idea.
The only difference is we don't have closed for solution hence deriving the connection is trickier.

Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance of λλ in Basis Pursuit.

Remark
What's actually happening?
In both problems, xx tries to be as close as possible to yy.
In the first case, x=yx=y will vanish the first term (The L2L2 distance) and in the second case it will make the objective function vanish.
The difference is that in the first case one must balance L2L2 Norm of xx. As λλ gets higher the balance means you should make xx smaller.
In the second case there is a wall, you bring xx closer and closer to yy until you hit the wall which is the constraint on its Norm (By tt).
If the wall is far enough (High value of tt) and enough depends on the norm of yy then i has no meaning, just like λλ is relevant only of its value multiplied by the norm of yy starts to be meaningful.
The exact connection is by the Lagrangian stated above.

Resources

I found this paper today (03/04/2019):


does the equivalent means that the \lambda and \t should be the same. Because I can not see that in the proof. thanks
jeza

@jeza, As I wrote above, for any tt there is λ0λ0 (Not necessarily equal to tt but a function of tt and the data yy) such that the solutions of the two forms are the same.
Royi

3
@jeza, both λλ & tt are essentially free parameters here. Once you specify, say, λλ, that yields a specific optimal solution. But tt remains a free parameter. So at this point the claim is that there can be some value of tt that would yield the same optimal solution. There are essentially no constraints on what that tt must be; it's not like it has to be some fixed function of λλ, like t=λ/2t=λ/2 or something.
gung - Reinstate Monica

@Royi, I would like to know 1- why your formula have (1/2), while the formulas in question not? 2- are using KKT to show the equivalence of the two formulas? 3- if yes, I am still can't see that equivalence. I am not sure but what I expect to see is that proof to show that formula one = formula two.
jeza

1. Just easier when you differentiate the LS term. You can move form my λλ to the OP λλ by factor of two. 2. I used KKT fo the 2nd case. The first case has no constraints, hence you can just solve it. 3. There is no closed form equation between them. I showed the logic and how you can create a graph connecting them. But as I wrote it will change for each yy (It is data dependent).
Royi

9

A less mathematically rigorous, but possibly more intuitive, approach to understanding what is going on is to start with the constraint version (equation 3.42 in the question) and solve it using the methods of "Lagrange Multiplier" (https://en.wikipedia.org/wiki/Lagrange_multiplier or your favorite multivariable calculus text). Just remember that in calculus xx is the vector of variables, but in our case xx is constant and ββ is the variable vector. Once you apply the Lagrange multiplier technique you end up with the first equation (3.41) (after throwing away the extra λtλt which is constant relative to the minimization and can be ignored).

This also shows that this works for lasso and other constraints.


8

It's perhaps worth reading about Lagrangian duality and a broader relation (at times equivalence) between:

  • optimization subject to hard (i.e. inviolable) constraints
  • optimization with penalties for violating constraints.

Quick intro to weak duality and strong duality

Assume we have some function f(x,y)f(x,y) of two variables. For any ˆxx^ and ˆyy^, we have:

minxf(x,ˆy)f(ˆx,ˆy)maxyf(ˆx,y)

minxf(x,y^)f(x^,y^)maxyf(x^,y)

Since that holds for any ˆxx^ and ˆyy^ it also holds that:

maxyminxf(x,y)minxmaxyf(x,y)

maxyminxf(x,y)minxmaxyf(x,y)

This is known as weak duality. In certain circumstances, you have also have strong duality (also known as the saddle point property):

maxyminxf(x,y)=minxmaxyf(x,y)

maxyminxf(x,y)=minxmaxyf(x,y)

When strong duality holds, solving the dual problem also solves the primal problem. They're in a sense the same problem!

Lagrangian for constrained Ridge Regression

Let me define the function LL as:

L(b,λ)=ni=1(yxib)2+λ(pj=1b2jt)

L(b,λ)=i=1n(yxib)2+λ(j=1pb2jt)

The min-max interpretation of the Lagrangian

The Ridge regression problem subject to hard constraints is:

minbmaxλ0L(b,λ)

minbmaxλ0L(b,λ)

You pick bb to minimize the objective, cognizant that after bb is picked, your opponent will set λλ to infinity if you chose bb such that pj=1b2j>tpj=1b2j>t.

If strong duality holds (which it does here because Slater's condition is satisfied for t>0t>0), you then achieve the same result by reversing the order:

maxλ0minbL(b,λ)

maxλ0minbL(b,λ)

Here, your opponent chooses λλ first! You then choose bb to minimize the objective, already knowing their choice of λλ. The minbL(b,λ)minbL(b,λ) part (taken λλ as given) is equivalent to the 2nd form of your Ridge Regression problem.

As you can see, this isn't a result particular to Ridge regression. It is a broader concept.

References

(I started this post following an exposition I read from Rockafellar.)

Rockafellar, R.T., Convex Analysis

You might also examine lectures 7 and lecture 8 from Prof. Stephen Boyd's course on convex optimization.


note that your answer can be extended to any convex function.
81235

6

They are not equivalent.

For a constrained minimization problem

minbni=1(yxib)2s.t.pj=1b2jt,b=(b1,...,bp)

minbi=1n(yxib)2s.t.j=1pb2jt,b=(b1,...,bp)(1)

we solve by minimize over bb the corresponding Lagrangean

Λ=ni=1(yxib)2+λ(pj=1b2jt)

Λ=i=1n(yxib)2+λ(j=1pb2jt)(2)

Here, tt is a bound given exogenously, λ0 is a Karush-Kuhn-Tucker non-negative multiplier, and both the beta vector and λ are to be determined optimally through the minimization procedure given t.

Comparing (2) and eq (3.41) in the OP's post, it appears that the Ridge estimator can be obtained as the solution to

minb{Λ+λt}

Since in (3) the function to be minimized appears to be the Lagrangean of the constrained minimization problem plus a term that does not involve b, it would appear that indeed the two approaches are equivalent...

But this is not correct because in the Ridge regression we minimize over b given λ>0. But, in the lens of the constrained minimization problem, assuming λ>0 imposes the condition that the constraint is binding, i.e that

pj=1(bj,ridge)2=t

The general constrained minimization problem allows for λ=0 also, and essentially it is a formulation that includes as special cases the basic least-squares estimator (λ=0) and the Ridge estimator (λ>0).

So the two formulation are not equivalent. Nevertheless, Matthew Gunn's post shows in another and very intuitive way how the two are very closely connected. But duality is not equivalence.


@MartijnWeterings Thanks for the comment, I have reworked my answer.
Alecos Papadopoulos

@MartijnWeterings I do not see what is confusing since the expression written in your comment is exactly the expression I wrote in my reworked post.
Alecos Papadopoulos

1
This was the duplicate question I had in mind were the equivalence is explained very intuitively to me math.stackexchange.com/a/336618/466748 the argument that you give for the two not being equivalent seems only secondary to me, and a matter of definition (the OP uses λ0 instead of λ>0 and we could just as well add the constrain t<βOLS22 to exclude the cases where λ=0) .
Sextus Empiricus

@MartijnWeterings When A is a special case of B, A cannot be equivalent to B. And ridge regression is a special case of the general constrained minimization problem, Namely a situation to which we arrive if we constrain further the general problem (like you do in your last comment).
Alecos Papadopoulos

Certainly you could define some constrained minimization problem that is more general then ridge regression (like you can also define some regularization problem that is more general than ridge regression, e.g. negative ridge regression), but then the non-equivalence is due to the way that you define the problem and not due to the transformation from the constrained representation to the Lagrangian representation. The two forms can be seen as equivalent within the constrained formulation/definition (non-general) that are useful for ridge regression.
Sextus Empiricus
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.