我读过统计学习中最受欢迎的书
1- 统计学习的要素。
2- 统计学习简介。
两者都提到岭回归有两个等价的公式。有没有可以理解的数学证明呢?
我还经历了交叉验证,但在那里找不到确定的证明。
此外,LASSO是否会享受相同类型的证明?
我读过统计学习中最受欢迎的书
1- 统计学习的要素。
2- 统计学习简介。
两者都提到岭回归有两个等价的公式。有没有可以理解的数学证明呢?
我还经历了交叉验证,但在那里找不到确定的证明。
此外,LASSO是否会享受相同类型的证明?
Answers:
The classic Ridge Regression (Tikhonov Regularization) is given by:
argminx12‖x−y‖22+λ‖x‖22
The claim above is that the following problem is equivalent:
argminx12‖x−y‖22subject to‖x‖22≤t
Let's define ˆx
The claim of equivalence means that ∀t,∃λ≥0:ˆx=˜x
Namely you can always have a pair of t
How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.
The solution for the first problem is given at the point the gradient vanishes which means:
ˆx−y+2λˆx=0
The KKT Conditions of the second problem states:
˜x−y+2μ˜x=0
and
μ(‖˜x‖22−t)=0
The last equation suggests that either μ=0
Pay attention that the 2 base equations are equivalent.
Namely if ˆx=˜x
So it means that in case ‖y‖22≤t
On the other case one should find μ
yt(I+2μI)−1(I+2μI)−1y=t
This is basically when ‖˜x‖22=t
Once you find that μ
Regarding the L1
The only difference is we don't have closed for solution hence deriving the connection is trickier.
Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance of λ
Remark
What's actually happening?
In both problems, x
In the first case, x=y
The difference is that in the first case one must balance L2
In the second case there is a wall, you bring x
If the wall is far enough (High value of t
The exact connection is by the Lagrangian stated above.
I found this paper today (03/04/2019):
A less mathematically rigorous, but possibly more intuitive, approach to understanding what is going on is to start with the constraint version (equation 3.42 in the question) and solve it using the methods of "Lagrange Multiplier" (https://en.wikipedia.org/wiki/Lagrange_multiplier or your favorite multivariable calculus text). Just remember that in calculus x
This also shows that this works for lasso and other constraints.
It's perhaps worth reading about Lagrangian duality and a broader relation (at times equivalence) between:
Assume we have some function f(x,y)
minxf(x,ˆy)≤f(ˆx,ˆy)≤maxyf(ˆx,y)
Since that holds for any ˆx
maxyminxf(x,y)≤minxmaxyf(x,y)
This is known as weak duality. In certain circumstances, you have also have strong duality (also known as the saddle point property):
maxyminxf(x,y)=minxmaxyf(x,y)
When strong duality holds, solving the dual problem also solves the primal problem. They're in a sense the same problem!
Let me define the function L
L(b,λ)=n∑i=1(y−xi⋅b)2+λ(p∑j=1b2j−t)
The Ridge regression problem subject to hard constraints is:
minbmaxλ≥0L(b,λ)
You pick b
If strong duality holds (which it does here because Slater's condition is satisfied for t>0
maxλ≥0minbL(b,λ)
Here, your opponent chooses λ
As you can see, this isn't a result particular to Ridge regression. It is a broader concept.
(I started this post following an exposition I read from Rockafellar.)
Rockafellar, R.T., Convex Analysis
You might also examine lectures 7 and lecture 8 from Prof. Stephen Boyd's course on convex optimization.
They are not equivalent.
For a constrained minimization problem
minbn∑i=1(y−x′i⋅b)2s.t.p∑j=1b2j≤t,b=(b1,...,bp)
we solve by minimize over b
Λ=n∑i=1(y−x′i⋅b)2+λ(p∑j=1b2j−t)
Here, t
Comparing (2) and eq (3.41) in the OP's post, it appears that the Ridge estimator can be obtained as the solution to
minb{Λ+λt}
Since in (3) the function to be minimized appears to be the Lagrangean of the constrained minimization problem plus a term that does not involve b, it would appear that indeed the two approaches are equivalent...
But this is not correct because in the Ridge regression we minimize over b given λ>0. But, in the lens of the constrained minimization problem, assuming λ>0 imposes the condition that the constraint is binding, i.e that
p∑j=1(b∗j,ridge)2=t
The general constrained minimization problem allows for λ=0 also, and essentially it is a formulation that includes as special cases the basic least-squares estimator (λ∗=0) and the Ridge estimator (λ∗>0).
So the two formulation are not equivalent. Nevertheless, Matthew Gunn's post shows in another and very intuitive way how the two are very closely connected. But duality is not equivalence.