逻辑回归的矩阵符号

在线性回归（平方损失）中，使用矩阵，我们对目标有一个非常简洁的表示法

minimize ‖ A x - b ‖^{2}

$\text{minimize}~~ \|Ax-b\|^2$

其中 $A$ 是数据矩阵， $x$ 是系数， $b$ 是响应。

Logistic回归目标是否有类似的矩阵符号？我见过的所有符号都不能消除所有数据点的总和（像 $\sum_{\text data} \text{L}_\text{logistic}(y,\beta^Tx)$ ）。

编辑：感谢joceratops和AdamO的出色回答。他们的回答使我意识到线性回归具有更简洁的表示法的另一个原因是因为规范的定义封装了平方和或。但是在逻辑损失中，没有这样的定义，这使表示法有点复杂。 $e^\top e$

— 海涛都
source

Answers:

在线性回归中，用于估计 $x$ 的最大化似然估计（MLE）解决方案具有以下闭合形式的解决方案（假设A是具有完整列秩的矩阵）：

{\hat{x}}_{lin} = \underset{x}{argmin} ‖ A x - b ‖_{2}^{2} = (A^{T} A)^{- 1} A^{T} b

$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$

这被理解为“找到使目标函数最小化的 $x$ ， $\|Ax-b\|_2^2$ 。关于代表这样的线性回归的目标函数的好处是，我们可以把一切都在矩阵符号，解决了的手。正如Alex R.所提到的，实际上，我们通常不直接考虑，因为它在计算上效率低下，并且经常不符合完整等级标准。相反，我们转向Moore-Penrose伪逆 $\hat{x}_\text{lin}$ $(A^TA)^{-1}$ $A$ 。计算上求解伪逆的细节可能涉及Cholesky分解或奇异值分解。

或者，用于估计逻辑回归中系数的MLE解决方案是：

{\hat{x}}_{log} = \underset{x}{argmin} \sum_{i = 1}^{N} y^{(i)} \log (1 + e^{- x^{T} a^{(i)}}) + (1 - y^{(i)}) \log (1 + e^{x^{T} a^{(i)}})

$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$

其中（假设每个数据样本按行存储）：

$x$ 是向量，表示回归系数

$a^{(i)}$ 是一个矢量表示 $i^{th}$ 样品/行中的数据矩阵 $A$

$y^{(i)}$ 是一个标量 $\{0, 1\}$ ，以及 $i^{th}$ 对应于标签 $i^{th}$ 样品

$N$ 是数据样本数/数据矩阵 $A$ 的行数。

再次，这被理解为“找到使目标函数最小的 $x$ ”。

如果你愿意，你可以把它更进了一步，并表示在矩阵符号如下： $\hat{x}_\text{log}$

{\hat{x}}_{log} = \underset{x}{argmin} [\begin{matrix} 1 & (1 - y^{(1)}) \\ ⋮ & ⋮ \\ 1 & (1 - y^{(N)}) \end{matrix}] [\begin{matrix} \log (1 + e^{- x^{T} a^{(1)}}) & . . . & \log (1 + e^{- x^{T} a^{(N)}}) \\ \log (1 + e^{x^{T} a^{(1)}}) & . . . & \log (1 + e^{x^{T} a^{(N)}}) \end{matrix}]

$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix}$

but you don't gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for $\hat{x}_\text{log}$ estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), $\hat{x}_\text{log}$ is approximated and is represented in matrix notation (see link provided by Alex R.).

— joceratops
source

Great. Thanks. I think the reason we do not have something like solving

A^{⊤} A x = A^{⊤} b

$A^\top A x=A^\top b$ is the reason we do not take that step more to make the matrix notation and avoid sum symbol.

— Haitao Du

We do have some advantage of taking one step further, making it into matrix multiplication would make the code simpler, and in many platforms such as matlab, for loop with sum over all data, is much slower than matrix operations.

— Haitao Du

@hxd1011: Just a small comment: reducing to matrix equations is not always wise. In the case of

A^{T} A x = A^{T} b

$A^TAx=A^Tb$ , you shouldn't actually try looking for matrix inverse

A^{T} A

$A^TA$ , but rather do something like a Cholesky decomposition which will be much faster and more numerically stable. For logistic regression, there are a bunch of different iteration schemes which do indeed use matrix computations. For a great review see here: research.microsoft.com/en-us/um/people/minka/papers/logreg/…

— Alex R.

@AlexR. thank you very much. I learned that using normal equation will make the matrix conditional number squared. And QR or Cholesky would be much better. Your link is great, such review with numerical methods is always what I wanted.

— Haitao Du

@joceratops answer focuses on the optimization problem of maximum likelihood for estimation. This is indeed a flexible approach that is amenable to many types of problems. For estimating most models, including linear and logistic regression models, there is another general approach that is based on the method of moments estimation.

The linear regression estimator can also be formulated as the root to the estimating equation:

0 = X^{T} (Y - X β)

$0 = \mathbf{X}^T(Y - \mathbf{X}\beta)$

In this regard $\beta$ is seen as the value which retrieves an average residual of 0. It needn't rely on any underlying probability model to have this interpretation. It is, however, interesting to go about deriving the score equations for a normal likelihood, you will see indeed that they take exactly the form displayed above. Maximizing the likelihood of regular exponential family for a linear model (e.g. linear or logistic regression) is equivalent to obtaining solutions to their score equations.

0 = \sum_{i = 1}^{n} S_{i} (α, β) = \frac{\partial}{\partial β} \log L (β, α, X, Y) = X^{T} (Y - g (X β))

$0 = \sum_{i=1}^n S_i(\alpha, \beta) = \frac{\partial}{\partial \beta} \log \mathcal{L}( \beta, \alpha, X, Y) = \mathbf{X}^T (Y - g(\mathbf{X}\beta))$

Where $Y_i$ has expected value $g(\mathbf{X}_i \beta)$ . In GLM estimation, $g$ is said to be the inverse of a link function. In normal likelihood equations, $g^{-1}$ is the identity function, and in logistic regression $g^{-1}$ is the logit function. A more general approach would be to require $0 = \sum_{i=1}^n Y - g(\mathbf{X}_i\beta)$ which allows for model misspecification.

Additionally, it is interesting to note that for regular exponential families, $\frac{\partial g(\mathbf{X}\beta)}{\partial \beta} = \mathbf{V}(g(\mathbf{X}\beta))$ which is called a mean-variance relationship. Indeed for logistic regression, the mean variance relationship is such that the mean $p = g(\mathbf{X}\beta)$ is related to the variance by $\mbox{var}(Y_i) = p_i(1-p_i)$ . This suggests an interpretation of a model misspecified GLM as being one which gives a 0 average Pearson residual. This further suggests a generalization to allow non-proportional functional mean derivatives and mean-variance relationships.

A generalized estimating equation approach would specify linear models in the following way:

0 = \frac{\partial g (X β)}{\partial β} V^{- 1} (Y - g (X β))

$0 = \frac{\partial g(\mathbf{X}\beta)}{\partial \beta} \mathbf{V}^{-1}\left(Y - g(\mathbf{X}\beta)\right)$

With $\mathbf{V}$ a matrix of variances based on the fitted value (mean) given by $g(\mathbf{X}\beta)$ . This approach to estimation allows one to pick a link function and mean variance relationship as with GLMs.

In logistic regression $g$ would be the inverse logit, and $V_{ii}$ would be given by $g(\mathbf{X}_i \beta)(1-g(\mathbf{X}\beta))$ . The solutions to this estimating equation, obtained by Newton-Raphson, will yield the $\beta$ obtained from logistic regression. However a somewhat broader class of models is estimable under a similar framework. For instance, the link function can be taken to be the log of the linear predictor so that the regression coefficients are relative risks and not odds ratios. Which--given the well documented pitfalls of interpreting ORs as RRs--behooves me to ask why anyone fits logistic regression models at all anymore.

— AdamO
source

+1 great answer. formulate it as a root finding on derivative is really new for me. and the second equation is really concise.

— Haitao Du