有关如何归一化回归系数的问题

不确定normalize是否在此处使用正确的词，但是我会尽力说明我要问的问题。这里使用的估计量是最小二乘。

假设有 $y=\beta_0+\beta_1x_1$ ，则可以通过居中围绕平均值 $y=\beta_0'+\beta_1x_1'$ ，其中 $\beta_0'=\beta_0+\beta_1\bar x_1$ 和 $x_1'=x-\bar x$ ，使 $\beta_0'$ 不再对估计任何影响 $\beta_1$ 。

我的意思是在是等效于在。我们简化了方程，以简化最小二乘计算。 $\hat\beta_1$ $y=\beta_1x_1'$ $\hat\beta_1$ $y=\beta_0+\beta_1x_1$

您一般如何应用此方法？现在我有模型 $y=\beta_1e^{x_1t}+\beta_2e^{x_2t}$ ，我试图将其降低到 $y=\beta_1x'$ 。

— 佩剑CN
source

What kind of data are you analyzing, and why do you want to remove a covariate,

ex1t $e^{x_1t}$ , from your model. Also, is there a reason you are removing the intercept? If you mean-center the data the slope will be the same in the model with/without intercept, but the model with the intercept will fit your data better.

— caburke

@caburke I am not concerned about the fit of the model, because after I calculated

β1 $\beta_1$ and

β2 $\beta_2$ I can put them back into the model. The point for this exercise is to estimate

β1 $\beta_1$ . By reducing the original equation to only

y=β1x′ $y=\beta_1x'$ , the least square calculation will be easier (x' is part of what I am trying to find out, it may include

ex1t $e^{x_1t}$ ). I am trying to learn the mechanisms, this is a question from a book by Tukey.

— Saber CN

@ca The observation at the end of your comment is puzzling. It cannot possibly apply to the nonlinear expressions--they don't contain anything that can reasonably be considered a "slope"--but it's not correct in the OLS setting: the fit for the mean-centered data is precisely as good as the fit with an intercept. Saber, your model is ambiguous: which of

β1,β2,x1,x2,t $\beta_1, \beta_2, x_1, x_2, t$ are variables and which are parameters? What is the intended error structure? (And which of Tukey's books is the question from?)

— whuber

@whuber This is from Tukey's book "Data analysis and regression: a second course in statistics" chapter 14A.

β1,β2 $\beta_1,\beta_2$ are the parameters we are trying to estimate,

x1,x2 $x_1,x_2$ are the variables each with n observations,

t $t$ I assume is the time variable associated with the observations, however it did not specify. The error should be normal and can be ignored for this question.

— Saber CN

@whuber I was mostly referring to the first part of the post, but this was not clear in my comment. What I meant was that if you only mean-center

x $x$ , and not

y $y$ , as it seemed was being suggested in the OP, and then remove the intercept then the fit will be worse, since its not necessarily the case that

y¯=0 $\bar{y}=0$ . Slope is obviously not a good term for the coefficient in the model mentioned in the last line of the OP.

— caburke

Although I cannot do justice to the question here--that would require a small monograph--it may be helpful to recapitulate some key ideas.

The question

Let's begin by restating the question and using unambiguous terminology. The data consist of a list of ordered pairs $(t_i, y_i)$ . Known constants $\alpha_1$ and $\alpha_2$ determine values $x_{1,i} = \exp(\alpha_1 t_i)$ and $x_{2,i} = \exp(\alpha_2 t_i)$ . We posit a model in which

y i = β 1 x 1, i + β 2 x 2, i + ε i

$y_i = \beta_1 x_{1,i} + \beta_2 x_{2,i} + \varepsilon_i$

for constants $\beta_1$ and $\beta_2$ to be estimated, $\varepsilon_i$ are random, and--to a good approximation anyway--independent and having a common variance (whose estimation is also of interest).

Background: linear "matching"

Mosteller and Tukey refer to the variables $x_1$ = $(x_{1,1}, x_{1,2}, \ldots)$ and $x_2$ as "matchers." They will be used to "match" the values of $y = (y_1, y_2, \ldots)$ in a specific way, which I will illustrate. More generally, let $y$ and $x$ be any two vectors in the same Euclidean vector space, with $y$ playing the role of "target" and $x$ that of "matcher". We contemplate systematically varying a coefficient $\lambda$ in order to approximate $y$ by the multiple $\lambda x$ . The best approximation is obtained when $\lambda x$ is as close to $y$ as possible. Equivalently, the squared length of $y - \lambda x$ is minimized.

One way to visualize this matching process is to make a scatterplot of $x$ and $y$ on which is drawn the graph of $x \to \lambda x$ . The vertical distances between the scatterplot points and this graph are the components of the residual vector $y - \lambda x$ ; the sum of their squares is to be made as small as possible. Up to a constant of proportionality, these squares are the areas of circles centered at the points $(x_i, y_i)$ with radii equal to the residuals: we wish to minimize the sum of areas of all these circles.

Here is an example showing the optimal value of $\lambda$ in the middle panel:

Panel

The points in the scatterplot are blue; the graph of $x \to \lambda x$ is a red line. This illustration emphasizes that the red line is constrained to pass through the origin $(0,0)$ : it is a very special case of line fitting.

Multiple regression can be obtained by sequential matching

Returning to the setting of the question, we have one target $y$ and two matchers $x_1$ and $x_2$ . We seek numbers $b_1$ and $b_2$ for which $y$ is approximated as closely as possible by $b_1 x_1 + b_2 x_2$ , again in the least-distance sense. Arbitrarily beginning with $x_1$ , Mosteller & Tukey match the remaining variables $x_2$ and $y$ to $x_1$ . Write the residuals for these matches as $x_{2\cdot 1}$ and $y_{\cdot 1}$ , respectively: the $_{\cdot 1}$ indicates that $x_1$ has been "taken out of" the variable.

We can write

y = λ 1 x 1 + y \cdot 1 and x 2 = λ 2 x 1 + x 2 \cdot 1 .

$y = \lambda_1 x_1 + y_{\cdot 1}\text{ and }x_2 = \lambda_2 x_1 + x_{2\cdot 1}.$

Having taken $x_1$ out of $x_2$ and $y$ , we proceed to match the target residuals $y_{\cdot 1}$ to the matcher residuals $x_{2\cdot 1}$ . The final residuals are $y_{\cdot 12}$ . Algebraically, we have written

y \cdot 1 y = λ 3 x 2 \cdot 1 + y \cdot 12; whence = λ 1 x 1 + y \cdot 1 = λ 1 x 1 + λ 3 x 2 \cdot 1 + y \cdot 12 = λ 1 x 1 + λ 3 (x 2 - λ 2 x 1) + y \cdot 12 = (λ 1 - λ 3 λ 2) x 1 + λ 3 x 2 + y \cdot 12 .

$\eqalign{ y_{\cdot 1} &= \lambda_3 x_{2\cdot 1} + y_{\cdot 12}; \text{ whence} \\ y &= \lambda_1 x_1 + y_{\cdot 1} = \lambda_1 x_1 + \lambda_3 x_{2\cdot 1} + y_{\cdot 12} =\lambda_1 x_1 + \lambda_3 \left(x_2 - \lambda_2 x_1\right) + y_{\cdot 12} \\ &=\left(\lambda_1 - \lambda_3 \lambda_2\right)x_1 + \lambda_3 x_2 + y_{\cdot 12}. }$

This shows that the $\lambda_3$ in the last step is the coefficient of $x_2$ in a matching of $x_1$ and $x_2$ to $y$ .

We could just as well have proceeded by first taking $x_2$ out of $x_1$ and $y$ , producing $x_{1\cdot 2}$ and $y_{\cdot 2}$ , and then taking $x_{1\cdot 2}$ out of $y_{\cdot 2}$ , yielding a different set of residuals $y_{\cdot 21}$ . This time, the coefficient of $x_1$ found in the last step--let's call it $\mu_3$ --is the coefficient of $x_1$ in a matching of $x_1$ and $x_2$ to $y$ .

Finally, for comparison, we might run a multiple (ordinary least squares regression) of $y$ against $x_1$ and $x_2$ . Let those residuals be $y_{\cdot lm}$ . It turns out that the coefficients in this multiple regression are precisely the coefficients $\mu_3$ and $\lambda_3$ found previously and that all three sets of residuals, $y_{\cdot 12}$ , $y_{\cdot 21}$ , and $y_{\cdot lm}$ , are identical.

Depicting the process

None of this is new: it's all in the text. I would like to offer a pictorial analysis, using a scatterplot matrix of everything we have obtained so far.

Scatterplot

Because these data are simulated, we have the luxury of showing the underlying "true" values of $y$ on the last row and column: these are the values $\beta_1 x_1 + \beta_2 x_2$ without the error added in.

The scatterplots below the diagonal have been decorated with the graphs of the matchers, exactly as in the first figure. Graphs with zero slopes are drawn in red: these indicate situations where the matcher gives us nothing new; the residuals are the same as the target. Also, for reference, the origin (wherever it appears within a plot) is shown as an open red circle: recall that all possible matching lines have to pass through this point.

Much can be learned about regression through studying this plot. Some of the highlights are:

The matching of $x_2$ to $x_1$ (row 2, column 1) is poor. This is a good thing: it indicates that $x_1$ and $x_2$ are providing very different information; using both together will likely be a much better fit to $y$ than using either one alone.
Once a variable has been taken out of a target, it does no good to try to take that variable out again: the best matching line will be zero. See the scatterplots for $x_{2\cdot 1}$ versus $x_1$ or $y_{\cdot 1}$ versus $x_1$ , for instance.
The values $x_1$ , $x_2$ , $x_{1\cdot 2}$ , and $x_{2\cdot 1}$ have all been taken out of $y_{\cdot lm}$ .
Multiple regression of $y$ against $x_1$ and $x_2$ can be achieved first by computing $y_{\cdot 1}$ and $x_{2\cdot 1}$ . These scatterplots appear at (row, column) = $(8,1)$ and $(2,1)$ , respectively. With these residuals in hand, we look at their scatterplot at $(4,3)$ . These three one-variable regressions do the trick. As Mosteller & Tukey explain, the standard errors of the coefficients can be obtained almost as easily from these regressions, too--but that's not the topic of this question, so I will stop here.

Code

These data were (reproducibly) created in R with a simulation. The analyses, checks, and plots were also produced with R. This is the code.

#
# Simulate the data.
#
set.seed(17)
t.var <- 1:50                                    # The "times" t[i]
x <- exp(t.var %o% c(x1=-0.1, x2=0.025) )        # The two "matchers" x[1,] and x[2,]
beta <- c(5, -1)                                 # The (unknown) coefficients
sigma <- 1/2                                     # Standard deviation of the errors
error <- sigma * rnorm(length(t.var))            # Simulated errors
y <- (y.true <- as.vector(x %*% beta)) + error   # True and simulated y values
data <- data.frame(t.var, x, y, y.true)

par(col="Black", bty="o", lty=0, pch=1)
pairs(data)                                      # Get a close look at the data
#
# Take out the various matchers.
#
take.out <- function(y, x) {fit <- lm(y ~ x - 1); resid(fit)}
data <- transform(transform(data, 
  x2.1 = take.out(x2, x1),
  y.1 = take.out(y, x1),
  x1.2 = take.out(x1, x2),
  y.2 = take.out(y, x2)
), 
  y.21 = take.out(y.2, x1.2),
  y.12 = take.out(y.1, x2.1)
)
data$y.lm <- resid(lm(y ~ x - 1))               # Multiple regression for comparison
#
# Analysis.
#
# Reorder the dataframe (for presentation):
data <- data[c(1:3, 5:12, 4)]

# Confirm that the three ways to obtain the fit are the same:
pairs(subset(data, select=c(y.12, y.21, y.lm)))

# Explore what happened:
panel.lm <- function (x, y, col=par("col"), bg=NA, pch=par("pch"),
   cex=1, col.smooth="red",  ...) {
  box(col="Gray", bty="o")
  ok <- is.finite(x) & is.finite(y)
  if (any(ok))  {
    b <- coef(lm(y[ok] ~ x[ok] - 1))
    col0 <- ifelse(abs(b) < 10^-8, "Red", "Blue")
    lwd0 <- ifelse(abs(b) < 10^-8, 3, 2)
    abline(c(0, b), col=col0, lwd=lwd0)
  }
  points(x, y, pch = pch, col="Black", bg = bg, cex = cex)    
  points(matrix(c(0,0), nrow=1), col="Red", pch=1)
}
panel.hist <- function(x, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y,  ...)
}
par(lty=1, pch=19, col="Gray")
pairs(subset(data, select=c(-t.var, -y.12, -y.21)), col="Gray", cex=0.8, 
   lower.panel=panel.lm, diag.panel=panel.hist)

# Additional interesting plots:
par(col="Black", pch=1)
#pairs(subset(data, select=c(-t.var, -x1.2, -y.2, -y.21)))
#pairs(subset(data, select=c(-t.var, -x1, -x2)))
#pairs(subset(data, select=c(x2.1, y.1, y.12)))

# Details of the variances, showing how to obtain multiple regression
# standard errors from the OLS matches.
norm <- function(x) sqrt(sum(x * x))
lapply(data, norm)
s <- summary(lm(y ~ x1 + x2 - 1, data=data))
c(s$sigma, s$coefficients["x1", "Std. Error"] * norm(data$x1.2)) # Equal
c(s$sigma, s$coefficients["x2", "Std. Error"] * norm(data$x2.1)) # Equal
c(s$sigma, norm(data$y.12) / sqrt(length(data$y.12) - 2))        # Equal

— whuber
source

Could multiple regression of

$y$ against

$x_1$ and

$x_2$ still be achieved by first computing

$y_{.1}$ and

$x_{2.1}$ if

$x_1$ and

$x_2$ were correlated? Wouldn't it then make a big difference whether we sequentially regressed

$y$ on

$x_1$ and

$x_{2.1}$ or on

$x_2$ and

$x_{1.2}$ ? How does this relate to one regression equation with multiple explanatory variables?

— miura

@miura, One of the leitmotifs of that chapter in Mosteller & Tukey is that when the

$x_i$ are correlated, the partials

$x_{i\cdot j}$ have low variances; because their variances appear in the denominator of a formula for the estimation variance of their coefficients, this implies the corresponding coefficients will have relatively uncertain estimates. That's a fact of the data, M&T say, and you need to recognize that. It makes no difference whether you start the regression with

$x_1$ or

$x_2$ : compare y.21 to y.12 in my code.

— whuber

I came across this today, here is what I think on the question by @miura, Think of a 2 dimensional space where Y is to be projected as a combination of two vectors. y = ax1 + bx2 + res (=0). Now think of y as a combination of 3 variables, y = ax1 + bx2 + cx3. and x3 = mx1 + nx2. so certainly, the order in which you choose your variables is going to effect the coefficients. The reason for this is: the minimum error here can be obtained by various combinations. However, in few examples, the minimum error can be obtained by only one combination and that is where the order will not matter.

— Gaurav Singhal

@whuber Can you elaborate on how this equation might be used for a multivariate regression that also has a constant term ? ie y = B1 * x1 + B2 * x2 + c ? It is not clear to me how the constant term can be derived. Also I understand in general what was done for the 2 variables, enough at least to replicate it in Excel. How can that be expanded to 3 variables ? x1, x2, x3. It seems clear that we would need to remove x3 first from y, x1, and x2. then remove x2 from x1 and y. But it is not clear to me how to then get the B3 term.

— Fairly Nerdy

I have answered some of my questions I have in the comment above. For a 3 variable regression, we would have 6 steps. Remove x1 from x2, from x3, and from y. Then remove x2,1 from x3,1 and from y1. Then remove x3,21 from y21. That results in 6 equations, each of which is of the form variable = lamda * different variable + residual. One of those equations has a y as the first variable, and if you just keep substituting the other variables in, you get the equation you need

— Fairly Nerdy