从贝叶斯角度看LASSO和ridge：调整参数如何？

惩罚性回归估计量（例如LASSO和ridge）据说与具有某些先验的贝叶斯估计量相对应。我猜（因为我对贝叶斯统计知识还不够了解），对于固定的调整参数，存在一个具体的对应先验。

现在，常客可以通过交叉验证来优化调整参数。是否有这样做的贝叶斯等效项，并且完全使用吗？还是贝叶斯方法在查看数据之前有效地调整了调整参数？（我猜后者会损害预测性能。）

bayesian lasso ridge-regression

— 理查德·哈迪
source

我认为完全的贝叶斯方法将从给定的先验开始，而不是对其进行修改，是的。但是，还有一种经验贝叶斯方法可以优化超参数值：例如，参见stats.stackexchange.com/questions/24799。

— 变形虫说恢复莫妮卡

另一个问题（可能是主要问题的一部分）：在正则化参数上是否存在某种先验方法，以某种方式代替了交叉验证过程？

— kjetil b halvorsen

贝叶斯可以在调整参数上设置先验，因为它通常对应于方差参数。通常这样做是为了避免CV以保持全贝叶斯状态。另外，您可以使用REML优化正则化参数。

— 家伙

PS：对于那些旨在赏金的人，请注意我的评论：我想看到一个明确的答案，该答案显示出一个先验的结果，该先验结果导致MAP估计等同于频繁交叉验证。

— statslearner2

@ statslearner2我认为它确实很好地解决了Richard的问题。你的慷慨，似乎把重点放在更窄的方面（约hyperprior）比理查德Q.

— 阿米巴说恢复莫妮卡

惩罚性回归估计量（例如LASSO和ridge）据说与具有某些先验的贝叶斯估计量相对应。

对，那是正确的。每当我们遇到涉及对数似然函数最大化和参数惩罚函数最大化的优化问题时，这在数学上就等同于后验最大化，其中惩罚函数被视为先前核的对数。要看到这一点，假设我们有一个使用调整参数的惩罚函数。在这些情况下，目标函数可以写为： $^\dagger$ $w$ $\lambda$

\begin{aligned} H_{x} (θ | λ) & = ℓ_{x} (θ) - w (θ | λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ)}{\int L_{x} (θ) π (θ | λ) d θ}) + const \\ = \ln π (θ | x, λ) + const, \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

在这里我们使用现有 $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$ 。在此观察到，优化中的调整参数在先验分布中被视为固定的超参数。如果要使用固定的调整参数进行经典优化，则等效于使用固定的超参数进行贝叶斯优化。对于LASSO和Ridge回归，罚函数和相应的先验等式为：

\begin{aligned} LASSO Regression & π (θ | λ) & = \prod_{k = 1}^{m} Laplace (0, \frac{1}{λ}) = \prod_{k = 1}^{m} \frac{λ}{2} \cdot \exp (- λ | θ_{k} |), \\ Ridge Regression & π (θ | λ) & = \prod_{k = 1}^{m} Normal (0, \frac{1}{2 λ}) = \prod_{k = 1}^{m} \sqrt{λ / π} \cdot \exp (- λ θ_{k}^{2}) . \end{aligned}

$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$

前一种方法会根据回归系数的绝对大小对回归系数进行惩罚，这等效于将拉普拉斯先验值置于零。后一种方法根据回归系数的平方幅度对回归系数进行惩罚，这等效于对位于零处的法线优先级进行强加。

现在，常客可以通过交叉验证来优化调整参数。是否有这样做的贝叶斯等效项，并且完全使用吗？

只要可以将常人主义方法视为一个优化问题（而不是说包括一个假设检验或类似的东西），就会有一个使用等价先验的贝叶斯类比。就像常客可以将调整参数 $\lambda$ 视为未知并从数据中估算出该参数一样，贝叶斯算法也可以类似地将超参数 $\lambda$ 视为未知。在完整的贝叶斯分析中，这将涉及赋予超参数自己的先验并在该先验下找到后验最大值，这类似于最大化以下目标函数：

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - h (λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ)) \cdot \exp (- h (λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ) π (λ)}{\int L_{x} (θ) π (θ | λ) π (λ) d θ}) + const \\ = \ln π (θ, λ | x) + const . \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$

如果分析师不愿意为其先验选择特定的超参数，并且试图通过将先验视为未知并进行分配来使先验更加分散，则该方法确实用于贝叶斯分析中。（请注意，这只是在感兴趣的参数 $\theta$ 之前提供更多扩散的隐式方式。）

（来自下面的statslearner2的评论）我正在寻找等效的MAP估计数值。例如，对于定额罚款岭，有一个高斯先验，它将给我MAP估计与岭估计完全相等。现在，对于k倍CV脊，给我MAP估计与CV脊估计相似的超先验是什么？

在着眼于 $K$ 折交叉验证之前，首先要注意的是，在数学上，最大后验（MAP）方法只是参数 $\theta$ 和数据 $\mathbf{x}$ 的函数的优化。如果您愿意允许不正确的先验，则该范围将封装涉及这些变量功能的任何优化问题。因此，可以被构造为这种单个优化问题的任何频繁性方法都具有MAP类比，而不能被构造为这种单个优化的任何频繁性方法都没有MAP类比性。

在上述形式的模型中，涉及带有调节参数的罚函数，通常使用 $K$ 倍交叉验证来估计调节参数 $\lambda$ 。对于这种方法，你分区数据矢量 $\mathbb{x}$ 进入 $K$ 子矢量 $\mathbf{x}_1,...,\mathbf{x}_K$ 。对于每个子向量， $k=1,...,K$ ，则使用“训练”数据 $\mathbf{x}_{-k}$ 拟合模型，然后使用“测试”数据测量模型的拟合度 $\mathbf{x}_k$ 。在每次拟合中，您都会得到模型参数的估算器，然后可以对测试数据进行预测，然后可以将其与实际测试数据进行比较以得出“损失”的量度：

\begin{matrix} Estimator & \hat{θ} (x_{- k}, λ), \\ Predictions & {\hat{x}}_{k} (x_{- k}, λ), \\ Testing loss & L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ) . \end{matrix}

$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$

$K$

L (x, λ) = \sum_{k} L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ)

$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$

然后，通过最小化总体损耗测度来估算调整参数：

\hat{λ} \equiv \hat{λ} (x) \equiv \underset{λ}{arg min} L (x, λ) .

$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$

$\theta$ $\lambda$ $\theta$

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ), \end{aligned}

where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$ -fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.

From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$ -fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrarily accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ) \\ = \ln (\frac{L_{x}^{*} (θ, λ) π (θ, λ)}{\int L_{x}^{*} (θ, λ) π (θ, λ) d θ}) + const, \end{aligned}

$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$ , with a fixed (and very large) hyper-parameter $\delta$ .

$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.

— Reinstate Monica
source

Ok +1 already, but for the bounty I'm looking for these more precise answers.

— statslearner2

1. I do not get how (since frequentists generally use classical hypothesis tests, etc., which have no Bayesian equivalent) connects to the rest of what I or you are saying; parameter tuning has nothing to do with hypothesis tests, or does it? 2. Do I understand you correctly that there is no Bayesian equivalent to frequentist regularized estimation when the tuning parameter is selected by cross validation? What about empirical Bayes that amoeba mentions in the comments to the OP?

— Richard Hardy

3. Since regularization with cross validation seems to be quite effective for, say, prediction, doesn't point 2. suggest that the Bayesian approach is somehow inferior?

— Richard Hardy

@Ben, thanks for your explicit answer and the subsequent clarifications. You have once again done a wonderful job! Regarding 3., yes, it was quite a jump; it certainly is not a strict logical conclusion. But looking at your points w.r.t. 2. (that a Bayesian method can approximate the frequentist penalized optimization with cross validation), I no longer think that Bayesian must be "inferior". The last quibble on my side is, could you perhaps explain how the last, complicated formula could arise in practice in the Bayesian paradigm? Is it something people would normally use or not?

— Richard Hardy

@Ben (ctd) My problem is that I know little about Bayes. Once it gets technical, I may easily lose the perspective. So I wonder whether this complicated analogy (the last formula) is something that is just a technical possibility or rather something that people routinely use. In other words, I am interested in whether the idea behind cross validation (here in the context of penalized estimation) is resounding in the Bayesian world, whether its advantages are utilized there. Perhaps this could be a separate question, but a short description will suffice for this particular case.

— Richard Hardy

Indeed most penalized regression methods correspond to placing a particular type of prior to the regression coefficients. For example, you get the LASSO using a Laplace prior, and the ridge using a normal prior. The tuning parameters are the “hyperparameters” under the Bayesian formulation for which you can place an additional prior to estimate them; for example, for in the case of the ridge it is often assumed that the inverse variance of the normal distribution has a $\chi^2$ prior. However, as one would expect, resulting inferences can be sensitive to the choice of the prior distributions for these hyperparameters. For example, for the horseshoe prior there are some theoretical results that you should place such a prior for the hyperparameters that it would reflect the number of non-zero coefficients you expect to have.

A nice overview of the links between penalized regression and Bayesian priors is given, for example, by Mallick and Yi.

— Dimitris Rizopoulos
source

Thank you for your answer! The linked paper is quite readable, which is nice.

— Richard Hardy

This does not answer the question, can you elaborate to explain how does the hyper-prior relate to k-fold CV?

— statslearner2