M估计量的经验式Hessian可以不确定吗？

15

Jeffrey Wooldridge在他的 “横截面和面板数据的计量经济学分析”（第357页）中说，经验Hessian“对于我们正在处理的特定样本，不能保证为正定，甚至正半定”。

对于我来说，这似乎是错误的，因为（由于数字问题）Hessian必须是正半定的，这是因为M估计量的定义是参数的值，该参数使给定样本的目标函数最小化，并且众所周知，在（局部）最小值处，Hessian为正半定值。

我的说法正确吗？

[编辑：该语句已在第二版中删除。这本书。见评论。

背景技术假设是通过最小化所获得的估计 $\widehat \theta_N$

\frac{1}{N} \sum_{i = 1}^{N} q (w_{i}, θ),

${1 \over N}\sum_{i=1}^N q(w_i,\theta),$ 其中

w_{i}

$w_i$ 表示第

i

$i$ 个观测值。

让我们表示的海赛 $q$ 通过 $H$ ，

H (q, θ)_{i j} = \frac{\partial^{2} q}{\partial θ_{i} \partial θ_{j}}

$H(q,\theta)_{ij}=\frac{\partial^2 q}{\partial \theta_i \partial \theta_j}$

的渐近协方差涉及，其中 $\widehat \theta_n$ $E[H(q,\theta_0)]$ $\theta_0$ 是真参数值。估计它的一种方法是使用经验式的Hessian

\hat{H} = \frac{1}{N} \sum_{i = 1}^{N} H (w_{i}, {\hat{θ}}_{n})

$\widehat H=\frac{1}{N}\sum_{i=1}^N H(w_i,\widehat \theta_n)$

它的确定性这是个问题。 $\widehat H$

— Jyotirmoy Bhattacharya
source

1

@Jyotirmoy，如果最小值发生在参数空间的边界怎么办？

— 主教

@cardinal。您说得对，在这种情况下我的论点行不通。但是Wooldridge正在考虑内部最小化的情况。那不是他错了吗？

— Jyotirmoy Bhattacharya

@Jyotirmoy，它肯定只能是正半定号。考虑一下线性函数或一组最小点形成凸多面体的函数。举一个简单的例子，考虑在

处的任何多项式

。

f (x) = x^{2 n}

$f(x)=x^{2n}$

x = 0

$x = 0$

— 主教

1

@cardinal。真正。令我困扰的是引用语句中的“偶数正半定”。

— Jyotirmoy Bhattacharya

@Jyotirmoy，您可以提供书中提供的M估计量的特定形式吗？还要考虑参数空间。也许那时我们可以弄清楚作者的想法。总的来说，我认为我们已经确定了作者的主张是正确的。将进一步的约束放在

的形式或所考虑的参数空间可能会改变这种情况。

q

$q$

— 主教

16

我觉得你是对的。 让我们将您的论点提炼为本质：

最小化函数定义为 $\widehat \theta_N$ $Q$ $Q(\theta) = {1 \over N}\sum_{i=1}^N q(w_i,\theta).$
让是的Hessian矩阵，从那里 $H$ $Q$ 根据定义，这反过来又，通过分化的线性度，等于 $H(\theta) = \frac{\partial^2 Q}{\partial \theta_i \partial \theta_j}$ $\frac{1}{N}\sum_{i=1}^N H(w_i, \theta_n)$ .
Assuming $\widehat \theta_N$ lies in the interior of the domain of $Q$ , then $H(\widehat \theta_N)$ must be positive semi-definite.

This is merely a statement about the function $Q$ : how it is defined is merely a distraction, except insofar as the assumed second order differentiability of $q$ with respect to its second argument ( $\theta$ ) assures the second order differentiability of $Q$ .

Finding M-estimators can be tricky. Consider these data provided by @mpiktas:

{1.168042, 0.3998378}, {1.807516, 0.5939584}, {1.384942, 3.6700205}, {1.327734, -3.3390724}, {1.602101, 4.1317608}, {1.604394, -1.9045958}, {1.124633, -3.0865249}, {1.294601, -1.8331763},{1.577610, 1.0865977}, { 1.630979, 0.7869717}

将R过程找到M估计与产生的溶液 = 。目标函数的值（的平均值 $q((x,y),\theta)=(y-c_1x^{c_2})^4$ $(c_1, c_2)$ $(-114.91316, -32.54386)$ $q$ 's) at this point equals 62.3542. Here is a plot of the fit:

Fit 1

这是此拟合邻域中（对数）目标函数的图：

Objective 1

事情是腥这里：拟合的参数是从用于模拟数据（近参数非常远），我们似乎并没有处于最小值：我们正处在一个非常浅谷是倾斜朝着两个参数的较大值： $(0.3, 0.2)$

Objective 1, 3D view

此时，Hessian的否定行列式确认这不是局部最小值！ 但是，当您查看z轴标签时，可以看到此函数的精度达到了五位数在整个区域内的，因为它等于常数4.1329（对数62.354）。这可能导致R函数最小化器（具有其默认公差）得出的结论是接近最小值。

实际上，解决方案离这一点还很遥远。为了确保找到它，我在Mathematica中采用了计算昂贵但非常有效的“ 主轴 ”方法，该方法使用50位精度（以10为底）来避免可能的数值问题。它找到附近的一个最小值，其中所述目标函数具有58.292655的值：约6％，比“最低”小发现R.在极平坦外观的部分发生该最小，但我可以通过夸大使其（几乎）看起来像是具有椭圆形轮廓的真实最小值 $(c_1, c_2) = (0.02506, 7.55973)$ $c_2$ 图中的方向：

Objective 2

等高线从中间的58.29266一直到拐角处的58.29284（！）。这是3D视图（再次是日志目标）：

Objective 2, 3D view

这里的Hessian是正定的：其特征值是55062.02和0.430978。因此，此点是局部最小值（可能是全局最小值）。这是它对应的适合条件：

Fit 2

我认为它比另一种更好。参数值当然更现实，很明显，我们将无法使用这一系列曲线做得更好。

我们可以从此示例中学到一些有用的经验教训：

数值优化可能很困难，尤其是使用非线性拟合和非二次损失函数时。因此：
以尽可能多的方式仔细检查结果，包括：
尽可能绘制目标函数的图形。
当数值结果似乎违反数学定理时，请高度怀疑。
当统计结果令人惊讶时（例如R代码返回的令人惊讶的参数值），您会特别怀疑。

— ub
source

+1，不错的分析。我认为这就是为什么沃尔德里奇发表这一言论的原因。我仍然认为可以考虑一些示例，其中粗麻布将是不确定的。例如，人为地限制参数空间。在此示例中，参数空间是整个平面，这就是为什么局部最小值将给出半正向粗麻布的原因。我认为现在是时候给Wooldridge写一封很好的电子邮件来回答他的问题了：）

— mpiktas 2011年

@mpiktas是的，我确定内部全局最小值具有不确定的Hessian值，但所有参数都是可识别的值。 但是，对于足够平滑的内部全局最小值，Hessian根本不可能是不确定的。这种事情已经被一再证明，例如在Milnor的拓扑学中从可区分的观点来看。我怀疑Wooldridge可能被错误的数字“解决方案”所误导。（顺便说一句，引用页上的错别字表明它是匆忙写的。）

— whuber

即使在边界处，粗麻布也会是积极的？我将检查这本书，我发现我确实在这方面缺乏广泛的知识。古典定理非常简单，因此我认为不应有其他非常复杂的定理。这也许就是我很难回答这个问题的原因之一。

— mpiktas 2011年

@mpiktas在边界处，甚至不必定义黑森州。这个想法是这样的：如果在临界点定义了雅可比矩阵/黑塞西亚矩阵/二阶导数矩阵，则该函数在邻域中的行为就像由该矩阵确定的二次形式一样。如果矩阵具有积极和消极的特征值，函数必须在某些方向上增加，而在其他减少：它不能是一个局部极值。这就是@Jyotirmoy关心的报价，这似乎与该基本属性矛盾。

— ub

感谢您和@mpiktas进行的很好的分析。我倾向于同意您的观点，即Wooldridge将数值困难与估计量的理论性质混淆了。让我们看看是否还有其他答案。

— Jyotirmoy Bhattacharya

7

$\hat{\theta}_N$

\begin{aligned} min_{θ \in Θ} N^{- 1} \sum_{i = 1}^{N} q (w_{i}, θ) \end{aligned}

$\begin{align} \min_{\theta\in \Theta}N^{-1}\sum_{i=1}^Nq(w_i,\theta) \end{align}$

If the solution $\hat{\theta}_N$ is interior point of $\Theta$ , objective function is twice differentiable and gradient of the objective function is zero, then Hessian of the objective function (which is $\hat{H}$ ) is positive semi-definite.

Now what Wooldridge is saying that for given sample the empirical Hessian is not guaranteed to be positive definite or even positive semidefinite. This is true, since Wooldridge does not require that objective function $N^{-1}\sum_{i=1}^Nq(w_i,\theta)$ has nice properties, he requires that there exists a unique solution $\theta_0$ for

min_{θ \in Θ} E q (w, θ) .

$\min_{\theta\in\Theta}Eq(w,\theta).$

So for given sample objective function $N^{-1}\sum_{i=1}^Nq(w_i,\theta)$ may be minimized on the boundary point of $\Theta$ in which Hessian of objective function needs not to be positive definite.

Further in his book Wooldridge gives an examples of estimates of Hessian which are guaranteed to be numerically positive definite. In practice non-positive definiteness of Hessian should indicate that solution is either on the boundary point or the algorithm failed to find the solution. Which usually is a further indication that the model fitted may be inappropriate for a given data.

Here is the numerical example. I generate non-linear least squares problem:

y_{i} = c_{1} x_{i}^{c_{2}} + ε_{i}

$y_i=c_1x_i^{c_2}+\varepsilon_i$

I take $X$ uniformly distributed in interval $[1,2]$ and $\varepsilon$ normal with zero mean and variance $\sigma^2$ . I generated a sample of size 10, in R 2.11.1 using set.seed(3). Here is the link to the values of $x_i$ and $y_i$ .

I chose the objective function square of usual non-linear least squares objective function:

q (w, θ) = (y - c_{1} x_{i}^{c_{2}})^{4}

$q(w,\theta)=(y-c_1x_i^{c_2})^4$

Here is the code in R for optimising function, its gradient and hessian.

##First set-up the epxressions for optimising function, its gradient and hessian.
##I use symbolic derivation of R to guard against human error    
mt <- expression((y-c1*x^c2)^4)

gradmt <- c(D(mt,"c1"),D(mt,"c2"))

hessmt <- lapply(gradmt,function(l)c(D(l,"c1"),D(l,"c2")))

##Evaluate the expressions on data to get the empirical values. 
##Note there was a bug in previous version of the answer res should not be squared.
optf <- function(p) {
    res <- eval(mt,list(y=y,x=x,c1=p[1],c2=p[2]))
    mean(res)
}

gf <- function(p) {
    evl <- list(y=y,x=x,c1=p[1],c2=p[2]) 
    res <- sapply(gradmt,function(l)eval(l,evl))
    apply(res,2,mean)
}

hesf <- function(p) {
    evl <- list(y=y,x=x,c1=p[1],c2=p[2]) 
    res1 <- lapply(hessmt,function(l)sapply(l,function(ll)eval(ll,evl)))
    res <- sapply(res1,function(l)apply(l,2,mean))
    res
}

First test that gradient and hessian works as advertised.

set.seed(3)
x <- runif(10,1,2)
y <- 0.3*x^0.2

> optf(c(0.3,0.2))
[1] 0
> gf(c(0.3,0.2))
[1] 0 0
> hesf(c(0.3,0.2))
     [,1] [,2]
[1,]    0    0
[2,]    0    0
> eigen(hesf(c(0.3,0.2)))$values
[1] 0 0

The hessian is zero, so it is positive semi-definite. Now for the values of $x$ and $y$ given in the link we get

> df <- read.csv("badhessian.csv")
> df
          x          y
1  1.168042  0.3998378
2  1.807516  0.5939584
3  1.384942  3.6700205
4  1.327734 -3.3390724
5  1.602101  4.1317608
6  1.604394 -1.9045958
7  1.124633 -3.0865249
8  1.294601 -1.8331763
9  1.577610  1.0865977
10 1.630979  0.7869717
> x <- df$x
> y <- df$y
> opt <- optim(c(1,1),optf,gr=gf,method="BFGS")  
> opt$par
[1] -114.91316  -32.54386
> gf(opt$par)
[1] -0.0005795979 -0.0002399711
> hesf(opt$par)
              [,1]         [,2]
[1,]  0.0002514806 -0.003670634
[2,] -0.0036706345  0.050998404
> eigen(hesf(opt$par))$values
[1]  5.126253e-02 -1.264959e-05

Gradient is zero, but the hessian is non positive.

Note: This is my third attempt to give an answer. I hope I finally managed to give precise mathematical statements, which eluded me in the previous versions.

— mpiktas
source

@mpiktas, That's some interesting notation there (I know it's not yours). A

w

$w$ on the left-hand side and

y

$y$ and

x

$x$ on the right-hand side. I'm guessing

w = (x, y)

$w = (x,y)$ or something like that. Also, I'm assuming the squaring should be happening to

y - m (x, θ)

$y - m(x,\theta)$ and not just to

m (x, θ)

$m(x,\theta)$ . No?

— cardinal

@mpiktas, I'm not quite sure how to interpret your first sentence due to the wording. I can see two ways, one that I'd call correct and the other I wouldn't. Also, strictly speaking, I don't agree with the second sentence in your first paragraph. As I've shown above, it is possible to be at a local minimum in the interior of the parameter space without the Hessian being positive definite.

— cardinal

@cardinal, yes you are right. Wooldridge uses

w

$w$ for consistency reasons,

y

$y$ and

x

$x$ is reserved for response and predictors throughout the book. In this example

w = (x, y)

$w=(x,y)$ .

— mpiktas

@cardinal, I fixed my wording. Now it should be ok. Thanks for pointing out the problem.

— mpiktas

@mptikas. Neither Wooldridge nor I are claiming that the Hessian has to be positive definite everywhere. My claim is that for an interior maximum the empirical Hessian has to be positive semidefinite as a necessary condition of a smooth function reaching its maximum. Wooldridge seems to be saying something different.

— Jyotirmoy Bhattacharya

3

The hessian is indefinite at a saddle point. It’s possible that this may be the only stationary point in the interior of the parameter space.

Update: Let me elaborate. First, let’s assume that the empirical Hessian exists everywhere.

If $\hat{\theta}_n$ is a local (or even global) minimum of $\sum_i q(w_i, \cdot)$ and in the interior of the parameter space (assumed to be an open set) then necessarily the Hessian $(1/N) \sum_i H(w_i, \hat{\theta}_n)$ is positive semidefinite. If not, then $\hat{\theta}_n$ is not a local minimum. This follows from second order optimality conditions — locally $\sum_i q(w_i, \cdot)$ must not decrease in any directions away from $\hat{\theta}_n$ .

One source of the confusion might the "working" definition of an M-estimator. Although in principle an M-estimator should be defined as $\arg\min_\theta \sum_i q(w_i, \theta)$ , it might also be defined as a solution to the equation

0 = \sum_{i} \dot{q} (w_{i}, θ),

$0 = \sum_i \dot{q}(w_i, \theta)\,,$ where

\dot{q}

$\dot{q}$ is the gradient of

q (w, θ)

$q(w, \theta)$ with respect to

θ

$\theta$ . This is sometimes called the

Ψ

$\Psi$ -type. In the latter case a solution of that equation need not be a local minimum. It can be a saddle point and in this case the Hessian would be indefinite.

Practically speaking, even a positive definite Hessian that is nearly singular or ill-conditioned would suggest that the estimator is poor and you have more to worry about than estimating its variance.

— vqv
source

could you adapt your answer so that it matches the notation of the question? To what is

x^{2} - y^{2}

$x^2-y^2$ referring? Where does this get inserted into the equations given in the question?

— probabilityislogic

+1 Good points in the update, especially the last paragraph. When the Hessian is available--as is implicitly assumed throughout this discussion--one would automatically use its positive-definiteness as one of the criteria for testing any critical point and therefore this issue simply could not arise. This leads me to believe the Wooldridge quotation must concern the Hessian at a putative global minimum, not at a mere critical point.

— whuber

1

There's been a lot of beating around the bush in this thread regarding whether the Hessian has to be positive (semi)definite at a local minimum. So I will make a clear statement on that.

Presuming the objective function and all constraint functions are twice continuously differentiable, then at any local minimum, the Hessian of the Lagrangian projected into the null space of the Jacobian of active constraints must be positive semidefinite. I.e., if $Z$ is a basis for the null space of the Jacobian of active constraints, then $Z^T*(\text{Hessian of Lagrangian})*Z$ must be positive semidefinite. This must be positive definite for a strict local minimum.

So the Hessian of the objective function in a constrained problem having active constraint(s) need not be positive semidefinite if there are active constraints.

Notes:

1) Active constraints consist of all equality constraints, plus inequality constraints which are satisfied with equality.

2) See the definition of the Lagrangian at https://www.encyclopediaofmath.org/index.php/Karush-Kuhn-Tucker_conditions .

3) If all constraints are linear, then the Hessian of the Lagrangian = Hessian of the objective function because the 2nd derivatives of linear functions are zero. But you still need to do the projection jazz if any of these constraints are active. Note that lower or upper bound constraints are particular cases of linear inequality constraints. If the only constraints which are active are bound constraints, the projection of the Hessian into the null space of the Jacobian of active constraints amounts to eliminating the rows and columns of the Hessian corresponding to those components on their bounds.

4) Because Lagrange multipliers of inactive constraints are zero, if there are no active constraints, the Hessian of the Lagrangian = the Hessian of the objective function, and the Identity matrix is a basis for the null space of the Jacobian of active constraints, which results in the simplification of the criterion being the familiar condition that the Hessian of the objective function be positive semidefinite at a local minimum (positive definite if a strict local minimum).

— Mark L. Stone
source

0

The positive answers above are true but they leave out the crucial identification assumption - if your model is not identified (or if it is only set identified) you might indeed, as Wooldridge correctly indicated, find yourself with a non-PSD empirical Hessian. Just run some non-toy psychometric / econometric model and see for yourself.

— vlad
source

Because this does not seem mathematically possible, could you offer a simple, clear example to demonstrate how the Hessian of a continuously twice-differentiable objective function could possibly fail to be PSD at a global minimum?

— whuber