自变量=随机变量?


25

我略微混淆如果自变量在统计模型(也称为预测器或功能),例如,线性回归,是一个随机变量?ÿ = β 0 + β 1 XXY=β0+β1X


12
线性模型以为条件X,因此它是否为随机无关紧要。
西安

4
检查一下。好问题,顺便说一句。
安东尼帕雷拉达,2016年

@西安,在固定设计中,线性模型假设不以为条件X,请参见我的回答。因此,这确实很重要。这就是为什么实验比观察性研究结果容易解释的原因
Aksakal,2017年

Answers:


19

线性回归有两种常见的表达方式。 为了专注于概念,我将对它们进行一些抽象。数学描述比英语描述要复杂得多,因此让我们从后者开始:

线性回归是一个模型,其中假定响应Y是随机的,其分布由回归器X通过线性映射β(X)以及可能由其他参数θ

在大多数情况下,可能分布的集合是具有参数αθ的位置族,而β(X)给出参数α。典型的例子是常态回归,其中分布的集合是正态族N(μ,σ)μ=β(X)是回归变量的线性函数。

由于我尚未在数学上对此进行描述,因此,XYβθ指的是哪种数学对象仍然是一个悬而未决的问题,并且我认为这是该线程中的主要问题。尽管可以做出各种(等效)选择,但是大多数选择将等同于以下描述,或者是特殊情况。


  1. 固定回归器。回归量被表示为实数向量XRp。所述响应是一个随机变量Y:ΩR(其中,Ω被赋予了一个Σ字段和概率)。该模型是一个函数f:R×ΘMd(或者,如果您愿意,可以用Θ设置一组函数RMd)。 中号dΘMd是概率分布空间维数为d的有限维拓扑(通常是第二可微分)子流形(或带边界子流形)。 f通常被认为是连续的(或足够可微的)。 ΘRd1是“多余参数”。假定的分布Yf(β(X),θ)为一些未知双重矢量βRp以下简称“回归系数”)和未知θΘ。我们可以写这个

    Yf(β(X),θ).

  2. 随机回归器。的回归量和响应是一个p+1个维矢量值随机变量Z=(X,Y):ΩRp×R。模型f与以前的对象相同,但现在给出了条件概率

    Y|Xf(β(X),θ).

如果没有一些处方说明如何将其应用于数据,则数学描述是无用的。在固定回归器的情况下,我们认为X由实验者指定。因此,将Ω视为具有乘积sigma代数的乘积Rp×Ω可能会有所帮助。实验者确定X和性质决定(一些未知的,抽象的)ωΩ。在随机回归的情况下,性质决定ωΩ中,X随机变量的-component πX(Z(ω))确定X(其被“观察”),并且我们现在有一个有序对(X(ω),ω))Ω完全一样在固定回归的情况。


多重线性回归(我将表达使用标准表示法为对象,而不是这个更一般的一种)的原型实例是

f(β(X),σ)=N(β(x),σ)
对于某一常数σΘ=R+。当x在整个Rp范围内变化时,其图像在正态分布的二维流形中有区别地描绘出一维子集(曲线)

当-在whatsoever--任何方式β被推定为βσσ,的值βX 预测值ý关联X --whether X由实验者控制(情况1 )或仅被观察到(情况2)。如果我们要么设置的值(情况1)或观察一个实现(情况2)XX,则响应ÿ与该相关联的X是一个随机变量,其分布是 Ñβ^σσ^β^(x)YxxxX YXN(β(x),σ),这是未知的,但估计N(β^(x),σ^)


我只想提一下,这是一个了不起的答案(但可能并不适合所有人)。
l7ll7

2
P.S. Do you know of any book, where these foundational question are explained as precisely as you did here ? As a mathematician, all the books I found reflected the other answers here, that are much less precise from a mathematical point of view. (This doesn't make them bad, of course, it's just that those books are not for me - I would love a book that is more precise, like this answer.)
l7ll7

In the first sentence of the last paragraph, isn't β^(x) the predicted value for y (a realization of the random variable Y), not the predicted value for x? Or have I misunderstood your language, and "predicted value for x" means "predicted value when x is the set(observed) value of X?"
Chad

1
@Chad Thank you for pointing out the ambiguous language. I have edited that sentence to clarify the meaning, which is consistent with your understanding.
whuber

7

首先,@ whuber提供了一个很好的答案。我会给它一个不同的看法,从某种意义上说可能更简单,同时还要引用文本。

动机

在回归公式中可以是随机的或固定的。这取决于您的问题。对于所谓的观察性研究,它必须是随机的,而对于实验,它通常是固定的。X

例子一。我正在研究暴露于电子辐射下对金属零件硬度的影响。因此,我对金属零件进行了一些采样,并将它们暴露在变化的辐射水平下。我的曝光级别是X,并且是固定的,因为我设置为选择的级别。我完全控制了实验条件,或者至少尝试了。我可以对其他参数(例如温度和湿度)执行相同的操作。

例子二。您正在研究经济对信用卡申请中欺诈行为发生频率的影响。因此,您可以将欺诈事件归因于GDP。您无法控制GDP,也无法将其设置为所需的水平。而且,您可能希望查看多元回归,因此您拥有其他变量,例如失业率,现在您在X中拥有观察到的但不受控制的值的组合。在这种情况下,X是随机的

例子三。您正在现场研究新农药的功效,即不是在实验室条件下,而是在实际的实验农场中。在这种情况下,您可以控制某些东西,例如,您可以控制要放置的农药量。但是,您无法控制一切,例如天气或土壤条件。好的,您可以在一定程度上控制土壤,但不能完全控制土壤。这是一种介于中间的情况,其中观察到某些条件并且控制了某些条件。整个研究领域都称为实验设计,实际上专注于第三种情况,其中农业研究是其中最大的应用之一。

数学

这是答案的数学部分。在研究线性回归时,通常会提出一组假设,称为高斯-马尔可夫条件。它们是非常理论性的,没有人愿意证明它们适合任何实际设置。但是,它们对于理解普通最小二乘(OLS)方法的局限性非常有用。

因此,随机和固定X的假设集不同,这大致对应于观察性研究与实验性研究。大致来说,因为如我在第三个示例中所示,有时我们确实处于极端之间。我发现Salkind的《研究设计百科全书》中的“ Gauss-Markov”定理部分是一个很好的起点,可以在Google图书中找到。

在此有固定的设计的不同的假设为通常的回归模型如下Y=Xβ+ε

  • E[ε]=0
  • 同方差,E[ε2]=σ2
  • 无序列相关,E[εi,εj]=0

与随机设计中的相同假设:

  • E[ε|X]=0
  • 方差齐性,E[ε2|X]=σ2
  • 无序列相关,E[εi,εj|X]=0

如您所见,不同之处在于对随机设计的设计矩阵进行了假设设定。条件使这些假设更强。例如,我们不仅仅是像固定设计中所说的那样,误差均值为零。在随机设计中,我们还说它们不依赖X协变量。


2

在统计中,随机变量是数量以某种方式随机变化的数量。您可以在这个出色的CV线程中找到一个很好的讨论:“随机变量”是什么意思?

In a regression model, the predictor variables (X-variables, explanatory variables, covariates, etc.) are assumed to be fixed and known. They are not assumed to be random. All of the randomness in the model is assumed to be in the error term. Consider a simple linear regression model as standardly formulated:

Y=β0+β1X+εwhere εN(0,σ2)
The error term, ε, is a random variable and is the source of the randomness in the model. As a result of the error term, Y is a random variable as well. But X is not assumed to be a random variable. (Of course, it might be a random variable in reality, but that is not assumed or reflected in the model.)

So you mean X is a constant ? Because that is the only other way to make sense of X from a mathematical point of view, since ε is a random variable and addition is only defined between two random variables and not "something else" + random variable. Though one of the two random variables could be constant, which is the case I'm referring to.
l7ll7

P.S. I looked at all the explanations from said link and none very illuminating: Why ? Because none make the connection between random variables as probabilists understand it vs. how statisticians understand it. So some answers restate the standard, precise probability theory definition, while others restate the (yet unclear to me) vague statistical definition. But none really explain the connection between these two concepts.(The only exception is the long ticket-in-a-box model answer, which may show some promise, but even so [...]
l7ll7

the difference wasn't fleshed out clearly enough to be strikingly illuminating; I'll have to meditate on this specific answer to see if there's any value to it)
l7ll7

@user10324, if you like, you can think of X as a set of constants. You could also think of it as a non-random variable.
gung - Reinstate Monica

No, the non-random variable way of thinking about it does not work, for two reasons: One, as I argued in the comments above, there is no such thing as a "variable" in mathematics, and two, even if it were, then addition in that case is not defined, as I argued in the comments above.
l7ll7

1

Not sure if I understand the question, but if you're just asking, "must an independent variable always be a random variable", then the answer is no.

An independent variable is a variable which is hypothesised to be correlated with the dependent variable. You then test whether this is the case through modelling (presumably regression analysis).

There are a lot of complications and "ifs, buts and maybes" here, so I would suggest getting a copy of a basic econometrics or statistics book covering regression analysis and reading it thoroughly, or else getting the class notes from a basic statistics/econometrics course online if possible.


Ok, but what is it, if it is not a random variable ? Just a (therefore deterministic) function ? I'm confused regarding the mathematical nature of the object "X". Actually, I found in the meantime a textbook, Probability and Statistics by Papoulis, where on page 149 he says "given two random variables X and Y [...]" and then goes on to explain how to regress X on Y. So he seems to understand X as a random variable ?
l7ll7

P.S. I want to add that there is no such thing as a "variable" in mathematics when you look at it as a "standalone" objects (my background is maths). Variables in mathematics are just parts of standalone objects (e.g. arguments of function), but have no standalone meaning. If I would just write "x" in mathematics, it could mean the function xx, or it could be a specific number, if x was assigned a values previously, but we don't have just x. And since log. regression is a mathematical model, I'm interested in the mathematical meaning of X.
l7ll7

It sounds as though you have a much greater understanding of maths than me. I'm just giving you the standard university undergraduate econometrics/statistics answer. I wonder if perhaps you might be overthinking it a bit, at least from the perspective of practical analysis. Regarding the quote from that book, my interpretation of that is that the specific x and y to which he is referring are random - but that doesn't mean that any x or any y are random.
Statsanalyst

e.g. the dependent variable in a model for voting trends in UK politics might be the number of votes received by the Conservative candidate in each constituency (Riding to Canadians, District to Americans), and the independent variable might be average house prices (a proxy for wealth/income in the UK). Neither of these is a "random" variable as I understand it, but this would be a perfectly reasonable thing to model.
Statsanalyst

Ok, that's is good to know what kind of answers I can expect/is the standard at econometrics/statistics departments and I appreciate that feedback very much (I would upvote again, but I can't since I already did). The problem with mathematics is "once you go black you never go back": Yearlong training in mathematical precision will induce a feeling of uneasiness if something is not crystal-clear fleshed out until one achieves claritiy [...]
l7ll7
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.