我目前正在学习有关回归的最小二乘估计(以及其他方法),并且从一些自适应算法文献中也可以看到,经常出现短语“ ...并且由于误差面是凸的...”,并且从何开始是凸面的任何深度都找不到。
...那么究竟是什么使它凸出呢?
我发现这种重复的遗漏有点令人讨厌,因为我希望能够使用自己的成本函数设计自己的自适应算法,但是如果我无法确定我的成本函数是否产生凸误差面,我将无法由于没有全局最小值,因此在应用诸如梯度下降之类的方法时走得太远了。也许我想变得有创意-例如,也许我不想使用最小二乘作为错误标准。
深入研究(我的问题从这里开始)后,我发现,为了能够判断您是否具有凸误差面,必须确保您的Hessian矩阵是正半定的。对于对称矩阵,此测试很简单-只需确保Hessian矩阵的所有特征值均为非负值即可。(如果您的矩阵不是对称的,则可以通过将其添加到自己的转置中并借助Gramian进行相同的特征值测试来使其对称,但这在这里并不重要)。
什么是黑森州矩阵?Hessian矩阵将成本函数的部分的所有可能组合编码。那里有几个局部?特征向量中的特征数目。如何计算局部数?从原始成本函数中“手动”取偏导数。
所以这正是我所做的:我假设我们有一个 x数据矩阵,用矩阵表示,其中, denotes the number of examples, and denotes the number of features per example. (which will also be the number of partials). I suppose we can say that we have time samples and spatial samples from sensors, but the physical application is not too important here.
Furthermore, we also have a vector of size x . (This is your 'label' vector, or your 'answer' corresponding to every row of ). For simplicity, I have assumed for this particular example. So 2 'examples' and 2 'features'.
So now suppose that you want to ascertain the 'line' or polynomial of best fit here. That is, you project your input data features against your polynomial co-efficient vector such that your cost function is:
Now, let us take the first partial derivative w.r.t , (feature 0) Thus:
Now, let us compute all the second partials, so:
We know that the Hessian is nothing but:
Now, based on how I have constructed the data matrix , (my 'features' go by columns, and my examples go by rows), the Hessian appears to be:
...which is nothing but the sample covariance matrix!
So I am not quite sure how to interpret - or I should say, I am not quite sure how generalizing I should be here. But I think I can say that:
Always true:
- The Hessian matrix always controls whether or not your error/cost surface is convex.
- If you Hessian matrix is pos-semi-def, you are convex, (and can happily use algorithms like gradient descent to converge to the optimal solution).
True for LSE only:
- The Hessian matrix for the LSE cost criterion is nothing but the original covariance matrix. (!).
- To me this means that, if I use LSE criterion, the data itself determines whether or not I have a convex surface? ... Which would then mean that the eigenvectors of my covariance matrix somehow have the capability to 'shape' the cost surface? Is this always true? Or did it just work out for the LSE criteria? It just doesnt sit right with me that the convexity of an error surface should be dependent on the data.
So putting it back in the context of the original question, how does one determine whether or not an error surfance (based on some cost function you select) is convex or not? Is this determination based on the data, or the Hessian?
Thanks
TLDR: How, exactly, and practically do I go about determining whether my cost-function and/or data-set yield a convex or non-convex error surface?