什么使曲面凸出错误？是由Covarinace矩阵还是由Hessian确定？

我目前正在学习有关回归的最小二乘估计（以及其他方法），并且从一些自适应算法文献中也可以看到，经常出现短语“ ...并且由于误差面是凸的...”，并且从何开始是凸面的任何深度都找不到。

...那么究竟是什么使它凸出呢？

我发现这种重复的遗漏有点令人讨厌，因为我希望能够使用自己的成本函数设计自己的自适应算法，但是如果我无法确定我的成本函数是否产生凸误差面，我将无法由于没有全局最小值，因此在应用诸如梯度下降之类的方法时走得太远了。也许我想变得有创意-例如，也许我不想使用最小二乘作为错误标准。

深入研究（我的问题从这里开始）后，我发现，为了能够判断您是否具有凸误差面，必须确保您的Hessian矩阵是正半定的。对于对称矩阵，此测试很简单-只需确保Hessian矩阵的所有特征值均为非负值即可。（如果您的矩阵不是对称的，则可以通过将其添加到自己的转置中并借助Gramian进行相同的特征值测试来使其对称，但这在这里并不重要）。

什么是黑森州矩阵？Hessian矩阵将成本函数的部分的所有可能组合编码。那里有几个局部？特征向量中的特征数目。如何计算局部数？从原始成本函数中“手动”取偏导数。

所以这正是我所做的：我假设我们有一个 $m$ x数据矩阵，用矩阵表示，其中， $n$ $X$ $m$ denotes the number of examples, and $n$ denotes the number of features per example. (which will also be the number of partials). I suppose we can say that we have $m$ time samples and $n$ spatial samples from sensors, but the physical application is not too important here.

Furthermore, we also have a vector $y$ of size $m$ x $1$ . (This is your 'label' vector, or your 'answer' corresponding to every row of $X$ ). For simplicity, I have assumed $m=n=2$ for this particular example. So 2 'examples' and 2 'features'.

So now suppose that you want to ascertain the 'line' or polynomial of best fit here. That is, you project your input data features against your polynomial co-efficient vector $\boldsymbol{\theta}$ such that your cost function is:

J (θ) = \frac{1}{2 m} \sum_{i = 1}^{m} [θ_{0} x_{0} [i] + θ_{1} x_{1} [i] - y [i]]^{2}

$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \bigg[\theta_{0}x_{0}[i] + \theta_{1}x_{1}[i] - y[i]\bigg]^{2}$

Now, let us take the first partial derivative w.r.t $\theta_{0}$ , (feature 0) Thus:

\frac{δ J (θ)}{δ θ_{0}} = \frac{1}{m} \sum_{i = 1}^{m} [θ_{0} x_{0} [i] + θ_{1} x_{1} [i] - y [i]] x_{0} [i]

$\frac{\delta J(\theta)}{\delta\theta_0} = \frac{1}{m}\sum_{i=1}^{m} \bigg[\theta_{0}x_{0}[i] + \theta_{1}x_{1}[i] - y[i]\bigg] x_{0}[i]$

\frac{δ J (θ)}{δ θ_{0}} = \frac{1}{m} \sum_{i = 1}^{m} [θ_{0} x_{0}^{2} [i] + θ_{1} x_{1} [i] x_{0} [i] - y [i] x_{0} [i]]

$\frac{\delta J(\theta)}{\delta\theta_0} = \frac{1}{m}\sum_{i=1}^{m} \bigg[\theta_{0}x_{0}^{2}[i] + \theta_{1}x_{1}[i]x_{0}[i] - y[i]x_{0}[i]\bigg]$

Now, let us compute all the second partials, so:

\frac{δ^{2} J (θ)}{δ θ_{0}^{2}} = \frac{1}{m} \sum_{i = 1}^{m} x_{0}^{2} [i]

$\frac{\delta^{2} J(\theta)}{\delta\theta_0^{2}} = \frac{1}{m}\sum_{i=1}^{m} x_{0}^{2}[i]$

\frac{δ^{2} J (θ)}{δ θ_{0} θ_{1}} = \frac{1}{m} \sum_{i = 1}^{m} x_{0} [i] x_{1} [i]

$\frac{\delta^{2} J(\theta)}{\delta\theta_0\theta_{1}} = \frac{1}{m}\sum_{i=1}^{m} x_{0}[i]x_{1}[i]$

\frac{δ^{2} J (θ)}{δ θ_{1} θ_{0}} = \frac{1}{m} \sum_{i = 1}^{m} x_{1} [i] x_{0} [i]

$\frac{\delta^{2} J(\theta)}{\delta\theta_1\theta_{0}} = \frac{1}{m}\sum_{i=1}^{m} x_{1}[i]x_{0}[i]$

\frac{δ^{2} J (θ)}{δ θ_{1}^{2}} = \frac{1}{m} \sum_{i = 1}^{m} x_{1}^{2} [i]

$\frac{\delta^{2} J(\theta)}{\delta\theta_1^{2}} = \frac{1}{m}\sum_{i=1}^{m} x_{1}^{2}[i]$

We know that the Hessian is nothing but:

H (J (θ)) = [\begin{matrix} \frac{δ^{2} J (θ)}{δ θ_{0}^{2}} & \frac{δ^{2} J (θ)}{δ θ_{0} θ_{1}} \\ \frac{δ^{2} J (θ)}{δ θ_{1} θ_{0}} & \frac{δ^{2} J (θ)}{δ θ_{1}^{2}} \end{matrix}]

$H(J(\theta)) = \begin{bmatrix} \frac{\delta^{2} J(\theta)}{\delta\theta_0^{2}} & \frac{\delta^{2} J(\theta)}{\delta\theta_0\theta_{1}} \\ \frac{\delta^{2} J(\theta)}{\delta\theta_1\theta_{0}} & \frac{\delta^{2} J(\theta)}{\delta\theta_1^{2}}\end{bmatrix}$

H (J (θ)) = [\begin{matrix} \frac{1}{m} \sum_{i = 1}^{m} x_{0}^{2} [i] & \frac{1}{m} \sum_{i = 1}^{m} x_{0} [i] x_{1} [i] \\ \frac{1}{m} \sum_{i = 1}^{m} x_{1} [i] x_{0} [i] & \frac{1}{m} \sum_{i = 1}^{m} x_{1}^{2} [i] \end{matrix}]

$H(J(\theta)) = \begin{bmatrix} \frac{1}{m}\sum_{i=1}^{m} x_{0}^{2}[i] & \frac{1}{m}\sum_{i=1}^{m} x_{0}[i]x_{1}[i] \\ \frac{1}{m}\sum_{i=1}^{m} x_{1}[i]x_{0}[i] & \frac{1}{m}\sum_{i=1}^{m} x_{1}^{2}[i] \end{bmatrix}$

Now, based on how I have constructed the data matrix $X$ , (my 'features' go by columns, and my examples go by rows), the Hessian appears to be:

H (J (θ)) = X^{T} X = Σ

$H(J(\theta)) = X^{T}X = \Sigma$

...which is nothing but the sample covariance matrix!

So I am not quite sure how to interpret - or I should say, I am not quite sure how generalizing I should be here. But I think I can say that:

Always true:
- The Hessian matrix always controls whether or not your error/cost surface is convex.
- If you Hessian matrix is pos-semi-def, you are convex, (and can happily use algorithms like gradient descent to converge to the optimal solution).
True for LSE only:
- The Hessian matrix for the LSE cost criterion is nothing but the original covariance matrix. (!).
- To me this means that, if I use LSE criterion, the data itself determines whether or not I have a convex surface? ... Which would then mean that the eigenvectors of my covariance matrix somehow have the capability to 'shape' the cost surface? Is this always true? Or did it just work out for the LSE criteria? It just doesnt sit right with me that the convexity of an error surface should be dependent on the data.

So putting it back in the context of the original question, how does one determine whether or not an error surfance (based on some cost function you select) is convex or not? Is this determination based on the data, or the Hessian?

Thanks

TLDR: How, exactly, and practically do I go about determining whether my cost-function and/or data-set yield a convex or non-convex error surface?

— Spacey
source

您可以想到一维线性最小二乘法。成本函数类似于 $a^{2}$ 。然后，一阶导数（Jacobian） $2a$ ，因此线性 $a$ 。二阶导数（Hessian）是 $2$ -一个常数。

由于二阶导数为正，因此您要处理凸成本函数。这等同于多元演算中的正定Hessian矩阵。

您只需处理两个变量（ $\theta_{1}$ ， $\theta_{2}$ ），因此，黑森州特别简单。

但是，实际上，通常涉及许多变量，因此构建和检查Hessian是不切实际的。

更有效的方法是直接在雅可比矩阵上工作 $J$ 在最小二乘问题中：

Ĵ X = b

$Jx=b$

$J$ 可以是秩不足的，奇异的或接近奇异的。在这种情况下，成本函数的二次曲面几乎是平坦的和/或在某个方向上疯狂地拉伸。您还可以发现您的矩阵在理论上是可解的，但是解在数值上是不稳定的。可以使用预处理的方法来应对这种情况。

一些简单的算法运行Cholesky分解的 $J$ 。如果算法失败，则表示 $J$ 是单数（或病态）。

在数值上更稳定，但更昂贵的是QR分解，它仅在以下情况下存在 $J$ 是正常的。

最后，最先进的方法是奇异值分解（SVD），它最昂贵，可以在每个矩阵上完成，揭示了 $J$ 并允许您分别处理排名不足的案件。

我写了一篇有关线性和非线性最小二乘解的文章，详细介绍了这些主题：

使用Math.NET的线性和非线性最小二乘

也有参考书，涉及与最小二乘相关的高级主题（参数/数据点的协方差，预处理，缩放，正交距离回归-总最小二乘，确定最小二乘估计器的精度和准确性等）。）。

我为本文创建了一个示例项目，该项目是开源的：

LeastSquaresDemo-二进制

LeastSquaresDemo-源（C＃）

— Libor
source

Thanks Libor: 1) Tangential but, choleskey is like a matrix square root it seems, yes? 2) Not sure I understand your point about how the Hessian tells you about convexity at each point on the error surface - are you saying in general? Because from LSE derivation above, the Hessian does not depend on the

θ

$\theta$ parameters at all, and just on the data. Perhaps you mean in general? 3) Finally in total, how to then determine if an error surface is convex - just stick to making sure the Hessian is SPD? But you mentioned that it might depend on

θ

$\theta$ ...so how can one know for sure? Thanks!

— Spacey 2012年

2) Yes I mean in general. In linear least squares, the whole error surface has constant Hessian. Taking second derviative of quadratic is constant, the same applies for Hessian. 3) It depends on conditioning of your data matrix. If the Hessian is spd, you there is a single closed solution and the error surface is convex in all directions. Otherwise the data matrix is ill conditioned or singular. I have never used Hessian to probe that, rather inspecting singular values of the data matrix or checking whether it has Cholesky decomposition. Both ways will tell you whether there is a solution.

— Libor

Libor-1）如果可以，请添加您如何使用SVD

X

$X$ 数据矩阵，或者您如何使用Choleskey分解来检查您是否有一个封闭的解决方案，它们似乎非常有用，而且很好，我想知道如何使用它们。2）最后一件事，只是为了确保我对黑森州有所了解：因此，黑森州通常是

θ

$\theta$ 和/或

X

$X$ 。如果是SPD，则我们有一个凸表面。（如果黑森州有

θ

$\theta$ 但是，我们必须对它进行任何地方的评估）。再次感谢。

— Spacey 2012年

Mohammad：1）我重写了答案，并在我的有关Least-Squares的文章上添加了链接（可能存在一些错误，我尚未正式发布），包括工作示例项目。我希望它可以帮助您更深入地了解问题... 2）在线性最小二乘中，Hessian是常数，并且仅取决于数据点。通常，它也取决于模型参数，但这仅是非线性最小二乘的情况。

— Libor