岭回归等价公式的证明

15

我读过统计学习中最受欢迎的书

1- 统计学习的要素。

2- 统计学习简介。

两者都提到岭回归有两个等价的公式。有没有可以理解的数学证明呢？

我还经历了交叉验证，但在那里找不到确定的证明。

此外，LASSO是否会享受相同类型的证明？

— 耶扎
source

2

en.wikipedia.org/wiki/…–

— 泰勒（Taylor）

1

套索不是岭回归的一种形式。

— 西安

@jeza，您能解释一下我的答案中缺少什么吗？它实际上派生了所有有关连接的派生对象。

— 罗伊'18

@jeza，您能具体点吗？除非您了解拉格朗日约束问题的概念，否则很难给出一个简明的答案。

— 罗伊'18

1

@jeza , a constrained optimization problem can be converted into optimization of the Lagrangian function / KKT conditions (as explained in the current answers). This principle has already many different simple explanations all over the internet. In what direction is more explanation of the proof necessary? Explanation/proof of the Lagrangian multiplier/function, explanation/proof how this problem is a case of optimization that relates to the method of Lagrange, difference KKT/Lagrange, explanation of the principle of regularization, etc?

— Sextus Empiricus

19

The classic Ridge Regression (Tikhonov Regularization) is given by:

arg min x 1 2 ∥ x - y ∥ 22 + λ ∥ x ∥ 22

$\arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2}$

The claim above is that the following problem is equivalent:

arg min x subject to 1 2 ∥ x - y ∥ 22 ∥ x ∥ 22 \leq t

$\begin{align*} \arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\ \text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t \end{align*}$

Let's define $\hat{x}$ as the optimal solution of the first problem and $\tilde{x}$ as the optimal solution of the second problem.

The claim of equivalence means that $\forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x}$ .
Namely you can always have a pair of $t$ and $\lambda \geq 0$ such the solution of the problem is the same.

How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.

The solution for the first problem is given at the point the gradient vanishes which means:

x^- y + 2 λ x^= 0

$\hat{x} - y + 2 \lambda \hat{x} = 0$

The KKT Conditions of the second problem states:

x ~ - y + 2 μ x ~ = 0

$\tilde{x} - y + 2 \mu \tilde{x} = 0$

and

μ (∥ x ~ ∥ 22 - t) = 0

$\mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0$

The last equation suggests that either $\mu = 0$ or ${\left\| \tilde{x} \right\|}_{2}^{2} = t$ .

Pay attention that the 2 base equations are equivalent.
Namely if $\hat{x} = \tilde{x}$ and $\mu = \lambda$ both equations hold.

So it means that in case ${\left\| y \right\|}_{2}^{2} \leq t$ one must set $\mu = 0$ which means that for $t$ large enough in order for both to be equivalent one must set $\lambda = 0$ .

On the other case one should find $\mu$ where:

y t (I + 2 μ I) - 1 (I + 2 μ I) - 1 y = t

${y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t$

This is basically when ${\left\| \tilde{x} \right\|}_{2}^{2} = t$

Once you find that $\mu$ the solutions will collide.

Regarding the ${L}_{1}$ (LASSO) case, well, it works with the same idea.
The only difference is we don't have closed for solution hence deriving the connection is trickier.

Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance of $\lambda$ in Basis Pursuit.

Remark
What's actually happening?
In both problems, $x$ tries to be as close as possible to $y$ .
In the first case, $x = y$ will vanish the first term (The ${L}_{2}$ distance) and in the second case it will make the objective function vanish.
The difference is that in the first case one must balance ${L}_{2}$ Norm of $x$ . As $\lambda$ gets higher the balance means you should make $x$ smaller.
In the second case there is a wall, you bring $x$ closer and closer to $y$ until you hit the wall which is the constraint on its Norm (By $t$ ).
If the wall is far enough (High value of $t$ ) and enough depends on the norm of $y$ then i has no meaning, just like $\lambda$ is relevant only of its value multiplied by the norm of $y$ starts to be meaningful.
The exact connection is by the Lagrangian stated above.

Resources

I found this paper today (03/04/2019):

Approximation Hardness for A Class of Sparse Optimization Problems.

— Royi
source

does the equivalent means that the \lambda and \t should be the same. Because I can not see that in the proof. thanks

— jeza

@jeza, As I wrote above, for any

t $t$ there is

λ≥0 $\lambda \geq 0$ (Not necessarily equal to

t $t$ but a function of

t $t$ and the data

y $y$ ) such that the solutions of the two forms are the same.

— Royi

3

@jeza, both

λ $\lambda$ &

t $t$ are essentially free parameters here. Once you specify, say,

λ $\lambda$ , that yields a specific optimal solution. But

t $t$ remains a free parameter. So at this point the claim is that there can be some value of

t $t$ that would yield the same optimal solution. There are essentially no constraints on what that

t $t$ must be; it's not like it has to be some fixed function of

λ $\lambda$ , like

t=λ/2 $t=\lambda/2$ or something.

— gung - Reinstate Monica

@Royi, I would like to know 1- why your formula have (1/2), while the formulas in question not? 2- are using KKT to show the equivalence of the two formulas? 3- if yes, I am still can't see that equivalence. I am not sure but what I expect to see is that proof to show that formula one = formula two.

— jeza

1. Just easier when you differentiate the LS term. You can move form my

λ $\lambda$ to the OP

λ $\lambda$ by factor of two. 2. I used KKT fo the 2nd case. The first case has no constraints, hence you can just solve it. 3. There is no closed form equation between them. I showed the logic and how you can create a graph connecting them. But as I wrote it will change for each

y $y$ (It is data dependent).

— Royi

9

A less mathematically rigorous, but possibly more intuitive, approach to understanding what is going on is to start with the constraint version (equation 3.42 in the question) and solve it using the methods of "Lagrange Multiplier" (https://en.wikipedia.org/wiki/Lagrange_multiplier or your favorite multivariable calculus text). Just remember that in calculus $x$ is the vector of variables, but in our case $x$ is constant and $\beta$ is the variable vector. Once you apply the Lagrange multiplier technique you end up with the first equation (3.41) (after throwing away the extra $-\lambda t$ which is constant relative to the minimization and can be ignored).

This also shows that this works for lasso and other constraints.

— Greg Snow
source

8

It's perhaps worth reading about Lagrangian duality and a broader relation (at times equivalence) between:

optimization subject to hard (i.e. inviolable) constraints
optimization with penalties for violating constraints.

Quick intro to weak duality and strong duality

Assume we have some function $f(x,y)$ of two variables. For any $\hat{x}$ and $\hat{y}$ , we have:

min x f (x, y^) \leq f (x^, y^) \leq max y f (x^, y)

$\min_x f(x, \hat{y}) \leq f(\hat{x}, \hat{y}) \leq \max_y f(\hat{x}, y)$

Since that holds for any $\hat{x}$ and $\hat{y}$ it also holds that:

max y min x f (x, y) \leq min x max y f (x, y)

$\max_y \min_x f(x, y) \leq \min_x \max_y f(x, y)$

This is known as weak duality. In certain circumstances, you have also have strong duality (also known as the saddle point property):

max y min x f (x, y) = min x max y f (x, y)

$\max_y \min_x f(x, y) = \min_x \max_y f(x, y)$

When strong duality holds, solving the dual problem also solves the primal problem. They're in a sense the same problem!

Lagrangian for constrained Ridge Regression

Let me define the function $\mathcal{L}$ as:

L (b, λ) = \sum i = 1 n (y - x i \cdot b) 2 + λ (\sum j = 1 p b 2 j - t)

$\mathcal{L}(\mathbf{b}, \lambda) = \sum_{i=1}^n (y - \mathbf{x}_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right)$

The min-max interpretation of the Lagrangian

The Ridge regression problem subject to hard constraints is:

min b max λ \geq 0 L (b, λ)

$\min_\mathbf{b} \max_{\lambda \geq 0} \mathcal{L}(\mathbf{b}, \lambda)$

You pick $\mathbf{b}$ to minimize the objective, cognizant that after $\mathbf{b}$ is picked, your opponent will set $\lambda$ to infinity if you chose $\mathbf{b}$ such that $\sum_{j=1}^p b_j^2 > t$ .

If strong duality holds (which it does here because Slater's condition is satisfied for $t>0$ ), you then achieve the same result by reversing the order:

max λ \geq 0 min b L (b, λ)

$\max_{\lambda \geq 0} \min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$

Here, your opponent chooses $\lambda$ first! You then choose $\mathbf{b}$ to minimize the objective, already knowing their choice of $\lambda$ . The $\min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$ part (taken $\lambda$ as given) is equivalent to the 2nd form of your Ridge Regression problem.

As you can see, this isn't a result particular to Ridge regression. It is a broader concept.

References

(I started this post following an exposition I read from Rockafellar.)

Rockafellar, R.T., Convex Analysis

You might also examine lectures 7 and lecture 8 from Prof. Stephen Boyd's course on convex optimization.

— Matthew Gunn
source

note that your answer can be extended to any convex function.

— 81235

6

They are not equivalent.

For a constrained minimization problem

min b \sum i = 1 n (y - x' i \cdot b) 2 s . t . \sum j = 1 p b 2 j \leq t, b = (b 1, . . ., b p) (1)

$\min_{\mathbf b} \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2\\ s.t. \sum_{j=1}^p b_j^2 \leq t,\;\;\; \mathbf b = (b_1,...,b_p) \tag{1}$

we solve by minimize over $\mathbf b$ the corresponding Lagrangean

Λ = \sum i = 1 n (y - x' i \cdot b) 2 + λ (\sum j = 1 p b 2 j - t) (2)

$\Lambda = \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right) \tag{2}$

Here, $t$ is a bound given exogenously, $\lambda \geq 0$ is a Karush-Kuhn-Tucker non-negative multiplier, and both the beta vector and $\lambda$ are to be determined optimally through the minimization procedure given $t$ .

Comparing $(2)$ and eq $(3.41)$ in the OP's post, it appears that the Ridge estimator can be obtained as the solution to

$\min_{\mathbf b}\{\Lambda + \lambda t\} \tag{3}$

Since in $(3)$ the function to be minimized appears to be the Lagrangean of the constrained minimization problem plus a term that does not involve $\mathbf b$ , it would appear that indeed the two approaches are equivalent...

But this is not correct because in the Ridge regression we minimize over $\mathbf b$ given $\lambda >0$ . But, in the lens of the constrained minimization problem, assuming $\lambda >0$ imposes the condition that the constraint is binding, i.e that

$\sum_{j=1}^p (b^*_{j,ridge})^2 = t$

The general constrained minimization problem allows for $\lambda = 0$ also, and essentially it is a formulation that includes as special cases the basic least-squares estimator ( $\lambda ^*=0$ ) and the Ridge estimator ( $\lambda^* >0$ ).

So the two formulation are not equivalent. Nevertheless, Matthew Gunn's post shows in another and very intuitive way how the two are very closely connected. But duality is not equivalence.

— Alecos Papadopoulos
source

@MartijnWeterings Thanks for the comment, I have reworked my answer.

— Alecos Papadopoulos

@MartijnWeterings I do not see what is confusing since the expression written in your comment is exactly the expression I wrote in my reworked post.

— Alecos Papadopoulos

1

This was the duplicate question I had in mind were the equivalence is explained very intuitively to me math.stackexchange.com/a/336618/466748 the argument that you give for the two not being equivalent seems only secondary to me, and a matter of definition (the OP uses

$\lambda \geq 0$ instead of

$\lambda > 0$ and we could just as well add the constrain

$t < \Vert \beta^{OLS} \Vert^2_2$ to exclude the cases where

$\lambda=0$ ) .

— Sextus Empiricus

@MartijnWeterings When A is a special case of B, A cannot be equivalent to B. And ridge regression is a special case of the general constrained minimization problem, Namely a situation to which we arrive if we constrain further the general problem (like you do in your last comment).

— Alecos Papadopoulos

Certainly you could define some constrained minimization problem that is more general then ridge regression (like you can also define some regularization problem that is more general than ridge regression, e.g. negative ridge regression), but then the non-equivalence is due to the way that you define the problem and not due to the transformation from the constrained representation to the Lagrangian representation. The two forms can be seen as equivalent within the constrained formulation/definition (non-general) that are useful for ridge regression.

— Sextus Empiricus