32

存在许多健壮的规模估计器。一个明显的例子是与标准偏差相关的中位数绝对偏差，即。在贝叶斯框架中，存在多种方法来可靠地估计大致正态分布的位置（例如，被异常值污染的正态），例如，可以假设数据的分布与分布或拉普拉斯分布相同。现在我的问题是： $\sigma = \mathrm{MAD}\cdot1.4826$

以鲁棒方式测量大致正态分布规模的贝叶斯模型在与MAD或类似鲁棒估计量相同的意义上是鲁棒的吗？

与MAD的情况一样，如果数据的分布实际上是正态分布的，那么贝叶斯模型可以逼近正态分布的SD，那将是很巧妙的。

编辑1：

一个模型的一个典型的例子假设数据时即防止污染/离群健壮是大致正常的使用是在状分布： $y_i$

y_{i} \sim t (m, s, ν)

$y_i \sim \mathrm{t}(m, s,\nu)$

其中是平均值，是小数，是自由度。如果在和上具有适当的先验，则将是均值的估计值，它将对异常值具有鲁棒性。但是，由于取决于，因此并不是 SD的一致估计。例如，如果将固定为4.0，并且上面的模型将适合分布中的大量样本，则 $m$ $s$ $\nu$ $m, s$ $\nu$ $m$ $y_i$ $s$ $y_i$ $s$ $\nu$ $\nu$ $\mathrm{Norm}(\mu=0,\sigma=1)$ $s$ 大约是0.82 我要寻找的是一个健壮的模型，就像t模型一样，但是要使用SD代替平均值（或除平均值外）。

编辑2：

以下是R和JAGS中的一个编码示例，上面提到的t模型相对于均值如何更健壮。

# generating some contaminated data
y <- c( rnorm(100, mean=10, sd=10), 
        rnorm(10, mean=100, sd= 100))

#### A "standard" normal model ####
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dnorm(mu, inv_sigma2)
  }

  mu ~ dnorm(0, 0.00001)
  inv_sigma2 ~ dgamma(0.0001, 0.0001)
  sigma <- 1 / sqrt(inv_sigma2)
}"

model <- jags.model(textConnection(model_string), list(y = y))
mcmc_samples <- coda.samples(model, "mu", n.iter=10000)
summary(mcmc_samples)

### The quantiles of the posterior of mu
##  2.5%   25%   50%   75% 97.5% 
##   9.8  14.3  16.8  19.2  24.1 

#### A (more) robust t-model ####
library(rjags)
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dt(mu, inv_s2, nu)
  }

  mu ~ dnorm(0, 0.00001)
  inv_s2 ~ dgamma(0.0001,0.0001)
  s <- 1 / sqrt(inv_s2)
  nu ~ dexp(1/30) 
}"

model <- jags.model(textConnection(model_string), list(y = y))
mcmc_samples <- coda.samples(model, "mu", n.iter=1000)
summary(mcmc_samples)

### The quantiles of the posterior of mu
## 2.5%   25%   50%   75% 97.5% 
##8.03  9.35  9.99 10.71 12.14

— 拉斯穆斯·巴斯
source

也许它不够鲁棒，但是卡方分布是方差倒数之前通常选择的共轭。

— Mike Dunlavey 2014年

您可能想看看这个问题的第一个答案stats.stackexchange.com/questions/6493/…是否足以满足您的需要；可能不是，但也许是。

— jbowman

您对污染程度有何看法？污染会系统化吗？随机？它是由一个或多个分布生成的吗？我们是否了解噪声分布？如果至少上面的一些事情是已知的，那么我们可以拟合某种混合模型。否则，我不确定您对这个问题的看法到底是什么，如果您没有其他想法，这似乎是非常模糊的环境。您需要修复某些问题，否则可以随机选择一个点并将其声明为唯一的高斯生成点。

— 意味着意义的意义

但总的来说，您可以选择更能抵抗离群值的t分布，也可以混合使用t分布。我敢肯定有很多论文，这是Bishop research.microsoft.com/zh-cn/um/people/cmbishop/downloads / ...撰写的论文，这是适合混合使用的R包：maths.uq.edu。 au /〜gjm / mix_soft / EMMIX_R / EMMIX-manual.pdf

— 意指意义

1

你

是真实的正态分布的人口，而不是大多数其它发行

σ = M A D \cdot 1.4826

$\sigma = \mathrm{MAD}\cdot1.4826$

— 亨利

10

具有适当先验的T噪声模型中的贝叶斯推断将对位置和规模进行可靠的估计。位置和尺度参数的贝叶斯鲁棒性建模由Andrade和O'Hagan（2011）给出了可能性和先验需要满足的精确条件。从单次观察无法使估计任意大的意义上来说，估计是可靠的，如本文的图2所示。

当数据为正态分布时，拟合T分布的SD（对于固定）与生成分布的SD不匹配。但这很容易解决。令是发电分布的标准偏差，而是拟合T分布的标准偏差。如果数据是由2缩放，然后从似然度的形式，我们知道必须由2缩放这意味着为一些固定的功能。可以通过模拟从标准法线数值计算该函数。这是执行此操作的代码： $\nu$ $\sigma$ $s$ $s$ $s = \sigma f(\nu)$ $f$

library(stats)
library(stats4)
y = rnorm(100000, mean=0,sd=1)
nu = 4
nLL = function(s) -sum(stats::dt(y/s,nu,log=TRUE)-log(s))
fit = mle(nLL, start=list(s=1), method="Brent", lower=0.5, upper=2)
# the variance of a standard T is nu/(nu-2)
print(coef(fit)*sqrt(nu/(nu-2)))

例如，在我得到。期望估计器然后。 $\nu=4$ $f(\nu)=1.18$ $\hat{\sigma} = s/f(\nu)$

— 汤姆·明卡
source

1

好答案（+1）。“从某种意义上说，一个观测值不能使估计值任意大”，因此故障点为2 / n（我对此很纳闷）。 n / 2。

— user603 2014年

哇谢谢！模糊跟进问题。那么“校正”比例实际上是否有意义，使其与“正常”情况下的SD一致？我正在考虑的用例是在报告传播程度时。我的报告规模没有问题，但是报告与SD保持一致的内容将是一件好事，因为它是传播的最常见指标（至少在心理学上如此）。您是否看到这种校正会导致奇怪且不一致的估计的情况？

— 2014年

6

当您提出有关非常精确的问题（可靠的估计）的问题时，我将为您提供同样精确的答案。但是，首先，我将开始尝试消除一个毫无根据的假设。确实存在位置的鲁棒贝叶斯估计（存在位置的贝叶斯估计器，但正如我在下面说明的那样，它们不是鲁棒的，而且显然，即使是最简单的位置鲁棒估计器也不是贝叶斯的）。我认为，位置案例中的“贝叶斯”和“鲁棒”范式之间没有重叠的原因在解释为什么没有同时存在鲁棒和贝叶斯的散射估计量时起了很大的作用。

在上具有适当的先验 $m, s$ $\nu$ , $m$ will be an estimate of the mean of $y_i$ that will be robust against outliers.

Actually, no. The resulting estimates will only be robust in a very weak sense of the word robust. However, when we say that the median is robust to outliers we mean the word robust in a much stronger sense. That is, in robust statistics, the robustness of the median refers to the property that if you compute the median on a data-set of observations drawn from a uni-modal, continuous model and then replace less than half of these observations by arbitrary values, the value of the median computed on the contaminated data is close to the value you would have had had you computed it on the original (uncontaminated) data-set. Then, it is easy to show that the estimation strategy you propose in the paragraph I quoted above is definitely not robust in the sense of how the word is typically understood for the median.

I'm wholly unfamiliar with Bayesian analysis. However, I was wondering what is wrong with the following strategy as it seems simple, effective and yet has not been considered in the other answers. The prior is that the good part of the data is drawn from a symmetric distribution $F$ and that the rate of contamination is less than half. Then, a simple strategy would be to:

compute the median/mad of your dataset. Then compute: $z_{i} = \frac{| x_{i} - med (x) |}{mad (x)}$ $z_i=\frac{|x_i-\mbox{med}(x)|}{\mbox{mad}(x)}$
exclude the observations for which $z_i>q_{\alpha}(z|x\sim F)$ (this is the $\alpha$ quantile of the distribution of $z$ when $x\sim F$ ). This quantity is avalaible for many choice of $F$ and can be bootstrapped for the others.
Run a (usual, non-robust) Bayesian analysis on the non-rejected observations.

EDIT:

Thanks to the OP for providing a self contained R code to conduct a bonna fide bayesian analysis of the problem.

the code below compares the the bayesian approach suggested by the O.P. to it's alternative from the robust statistics literature (e.g. the fitting method proposed by Gauss for the case where the data may contain as much as $n/2-2$ outliers and the distribution of the good part of the data is Gaussian).

central part of the data is $\mathcal{N}(1000,1)$ :

n<-100
set.seed(123)
y<-rnorm(n,1000,1)

Add some amount of contaminants:

y[1:30]<-y[1:30]/100-1000 
w<-rep(0,n)
w[1:30]<-1

the index w takes value 1 for the outliers. I begin with the approach suggested by the O.P.:

library("rjags")
model_string<-"model{
  for(i in 1:length(y)){
    y[i]~dt(mu,inv_s2,nu)
  }
  mu~dnorm(0,0.00001)
  inv_s2~dgamma(0.0001,0.0001)
  s<-1/sqrt(inv_s2)
  nu~dexp(1/30) 
}"

model<-jags.model(textConnection(model_string),list(y=y))
mcmc_samples<-coda.samples(model,"mu",n.iter=1000)
print(summary(mcmc_samples)$statistics[1:2])
summary(mcmc_samples)

I get:

     Mean        SD 
384.2283  97.0445

and:

2. Quantiles for each variable:

 2.5%   25%   50%   75% 97.5% 
184.6 324.3 384.7 448.4 577.7

(quiet far thus from the target values)

For the robust method,

z<-abs(y-median(y))/mad(y)
th<-max(abs(rnorm(length(y))))
print(c(mean(y[which(z<=th)]),sd(y[which(z<=th)])))

one gets:

 1000.149 0.8827613

(very close to the target values)

The second result is much closer to the real values. But it gets worst. If we classify as outliers those observations for which the estimated $z$ -score is larger than th (remember that the prior is that $F$ is Gaussian) then the bayesian approach finds that all the observations are outliers (the robust procedure, in contrast, flags all and only the outliers as such). This also implies that if you were to run a usual (non-robust) bayesian analysis on the data not classified as outliers by the robust procedure, you should do fine (e.g. fulfil the objectives stated in your question).
This is just an example, but it's actually fairly straightforward to show that (and it can done formally, see for example, in chapter 2 of [1]) the parameters of a student $t$ distribution fitted to contaminated data cannot be depended upon to reveal the outliers.

[1]Ricardo A. Maronna, Douglas R. Martin, Victor J. Yohai (2006). Robust Statistics: Theory and Methods (Wiley Series in Probability and Statistics).
Huber, P. J. (1981). Robust Statistics. New York: John Wiley and Sons.

— user603
source

1

Well, the t is often proposed as a robust alternative to the normal distribution. I don't know if this is in the weak sense or not. See for example: Lange, K. L., Little, R. J., & Taylor, J. M. (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408), 881-896. pdf

— Rasmus Bååth

1

This is the weak sense. If you have an R code that implements the procedure you suggest, I ll be happy to illustrate my answer with an example. otherwise you can get more explanation in chapter 2 of this textbook.

— user603

The procedure I suggest is basically described here: indiana.edu/~kruschke/BEST including R code. I will have to think about your solution! It does not, however, seem Bayesian in the sense that it does not model all the data, just the subset that "survives" step 2.

— Rasmus Bååth

I thank you for your interesting discussion! Your answer is not that I seek, however, because (1) you don't describe a Bayesian procedure, you describe more of a data preparation step for how to remove outliers (2) your procedure does not result in a consistent estimator of SD, that is, if you sample from a normal distribution and the number of datapoints

\to \infty

$\rightarrow \infty$ you will not approach the "true" SD, rather your estimate will be a bit low. I also don't completely buy your definition of robust (your definition is not how I have seen it in most Bayesian literature I've come across)

— Rasmus Bååth

1

I have now done so!

— Rasmus Bååth

1

In bayesian analysis using the inverse Gamma distribution as a prior for the precision (the inverse of the variance) is a common choice. Or the inverse Wishart distribution for multivariate models. Adding a prior on the variance improves robustness against outliers.

There is a nice paper by Andrew Gelman: "Prior distributions for variance parameters in hierarchical models" where he discusses what good choices for the priors on the variances can be.

— jpmuc
source

4

I'm sorry but I fail to see how this answers the question. I did not ask for a robust prior, but rather for a robust model.

— Rasmus Bååth

0

A robust estimator for the location parameter $\mu$ of some dataset of size $N$ is obtained when one assigns a Jeffreys prior to the variance $\sigma^2$ of the normal distribution, and computes the marginal for $\mu$ , yielding a $t$ distribution with $N$ degrees of freedom.

Similarly, if you want a robust estimator for the standard deviation $\sigma$ of some data $D$ , we can do the following:

First, we suppose that the data is normally distributed when its mean and standard deviation are known. Therefore,

{D |}_{μ, σ} \sim N (μ, σ^{2})

$\left.D\right|_{\mu,\sigma} \sim \mathcal{N}(\mu,\sigma^2)$ and if

D \equiv (d_{1}, \dots, d_{N})

$D \equiv (d_1,\ldots,d_N)$ then

p (D | μ, σ^{2}) = \frac{1}{(\sqrt{2 π} σ)^{N}} \exp (- \frac{N}{2 σ^{2}} ((m - μ^{2}) + s^{2}))

$p(D|\mu,\sigma^2) = \frac{1}{(\sqrt{2\pi}\sigma)^N} \exp\left(-\frac{N}{2\sigma^2}\left((m-\mu^2)+s^2\right)\right)$ where the sufficient statistics

m

$m$ and

s^{2}

$s^2$ are

m = \frac{1}{N} \sum_{i = 1}^{N} d_{i} s^{2} = \frac{1}{N} \sum_{i = 1}^{N} d_{i}^{2} - m^{2}

$m=\frac{1}{N}\sum_{i=1}^N d_i \quad s^2 = \frac{1}{N}\sum_{i=1}^N d_i^2 - m^2$ In addition, using Bayes' theorem, we have

p (μ, σ^{2} | D) \propto p (D | μ, σ^{2}) p (μ, σ^{2})

$p(\mu,\sigma^2|D) \propto p(D|\mu,\sigma^2) p(\mu,\sigma^2)$ A convenient prior for

(μ, σ^{2})

$(\mu,\sigma^2)$ is the Normal-invese-gamma family, which covers a wide range of shapes and is conjugate to this likelihood. This means that the posterior distribution

p (μ, σ^{2} | D)

$p(\mu,\sigma^2|D)$ still belongs to the normal-inverse-gamma family, and its marginal

p (σ^{2} | D)

$p(\sigma^2|D)$ is an inverse gamma distribution parameterized as

{σ^{2} |}_{D} \sim I G (α + N / 2, 2 β + N s^{2}) α, β > 0

$\left.\sigma^2\right|_{D} \sim \mathcal{IG}\left(\alpha+N/2,2\beta+Ns^2\right) \qquad \alpha,\beta>0$ From this distribution, we can take the mode, which will give us an estimator for

σ^{2}

$\sigma^2$ . This estimator will be more or less tolerant to small excursions from misspecifications on the model by varying

α

$\alpha$ and/or

β

$\beta$ . The variance of this distribution will then provide some indication on the fault-tolerance of the estimate. Since the tails of the inverse gamma are semi-heavy, you get the kind of behaviour you would expect from the

t

$t$ distribution estimate for

μ

$\mu$ that you mention.

— yannick
source

1

"A robust estimator for the location parameter μ of some dataset of size N is obtained when one assigns a Jeffreys prior to the variance

σ^{2}

$σ^2$ of the normal distribution." Isn't this Normal model you describe a typical example of a non-robust model? That is, a single value that is off can have great influence on the parameters of the model. There is a big difference between the posterior over the mean being a t-distribution (as in your case) and the distribution for the data being a t-distribution (as is a common example of a robust Bayesian model for estimating the mean).

— Rasmus Bååth

1

It all depends on what you mean by robust. What you are saying right now is that you would like robustness wrt data. What I was proposing was robustness wrt model mis-specification. They are both different types of robustness.

— yannick

2

I would say that the examples I gave, MAD and using a t distribution as the distribution for the data are examples of robustness with respect to data.

— Rasmus Bååth

I would say Rasmus is right and so would Gelman er al in BDA3, as would a basic understanding that th t distribution has fatter tails than the normal for the same location parameter

— Brash Equilibrium

0

I have followed the discussion from the original question. Rasmus when you say robustness I am sure you mean in the data (outliers, not miss-specification of distributions). I will take the distribution of the data to be Laplace distribution instead of a t-distribution, then as in normal regression where we model the mean, here we will model the median (very robust) aka median regression (we all know). Let the model be:

$Y=\beta X+\epsilon$ , $\epsilon$ has laplace $(0,$ $\sigma^2)$ .

Of course our goal is to estimate model parameters. We expect our priors to be vague to have an objective model. The model at hand has a posterior of the form $f(\beta,\sigma,Y,X)$ . Giving $\beta$ a normal prior with large variance makes such a prior vague and a chis-squared prior with small degrees of freedom to mimic a jeffrey's prior(vague prior) is given to to $\sigma^2$ . With a Gibbs sampler what happens? normal prior+laplace likehood=???? we do know. Also chi-square prior +laplace likelihood=??? we do not know the distribution. Fortunately for us there is a theorem in (Aslan,2010) that transforms a laplace likelihood to a scale mixture of normal distributions which then enable us to enjoy the conjugate properties of our priors. I think the whole process described is fully robust in terms of outliers. In a multivariate setting chi-square becomes a a wishart distribution, and we use multivariate laplace and normal distributions.

— Chamberlain Foncha
source

2

Your solution seems to be focused on robust estimation of the location(mean/median). My question was rather about estimation of scale with the property of consistency with respect to retrieving the SD when the data generating distribution actually is normal.

— Rasmus Bååth

With a robust estimate of the location, the scale as function of the location immediately benefits from the robustness of the location. There is no other way of making the scale robust.

— Chamberlain Foncha

Anyway I must say I am eagerly waiting to see how this problem will be tackled most especially with a normal distribution as you emphasized.

— Chamberlain Foncha

0

Suppose that you have $K$ groups and you want to model the distribution of their sample variances, perhaps in relation to some covariates $\bf{x}$ . That is, suppose that your data point for group $k \in {1 \ldots K}$ is $\textrm{Var}(y_k) \in [0, \infty)$ . The question here is, "What is a robust model for the likelihood of the sample variance?" One way to approach this is to model the transformed data $\textrm{ln}[\textrm{Var}(y_k)]$ as coming from a $t$ distribution, which as you have already mentioned is a robust version of the normal distribution. If you don't feel like assuming that the transformed variance is approximately normal as $n \rightarrow \infty$ , then you could choose a probability distribution with positive real support that is known to have heavy tails compared to another distribution with the same location. For example, there is a recent answer to a question on Cross Validated about whether the lognormal or gamma distribution has heavier tails, and it turns out that the lognormal distribution does (thanks to @Glen_b for that contribution). In addition, you could explore the half-Cauchy family.

Similar reasoning applies if instead you are assigning a prior distribution over a scale parameter for a normal distribution. Tangentially, the lognormal and inverse-gamma distributions are not advisable if you want to form a boundary avoiding prior for the purposes of posterior mode approximation because they peak sharply if you parameterize them so that the mode is near zero. See BDA3 chapter 13 for discussion. So in addition to identifying a robust model in terms of tail thickness, keep in mind that kurtosis may matter to your inference, too.

I hope this helps you as much as your answer to one of my recent questions helped me.

— Brash Equilibrium
source

1

My question was about the situation when you have one group and how to robustly estimate the scale of that group. In the case of outliers I don't believe the sample variance is considered robust.

— Rasmus Bååth

If you have one group, and you are estimating its normal distribution, then your question applies to the form of the prior over its scale parameter. As my answer implies, you can use a t distribution over its log transformation or choose a fat tailed distribution with positive real support, being careful about other aspects of that distribution such as its kurtosis. Bottom line, if you wan a robust model for a scale parameter, use a t distribution over its log transform or some other fat tailed distribution.

— Brash Equilibrium

用于估计大致正态分布规模的鲁棒贝叶斯模型将是什么？

EDIT: