当您提出有关非常精确的问题(可靠的估计)的问题时,我将为您提供同样精确的答案。但是,首先,我将开始尝试消除一个毫无根据的假设。确实存在位置的鲁棒贝叶斯估计(存在位置的贝叶斯估计器,但正如我在下面说明的那样,它们不是鲁棒的,而且显然,即使是最简单的位置鲁棒估计器也不是贝叶斯的)。我认为,位置案例中的“贝叶斯”和“鲁棒”范式之间没有重叠的原因在解释为什么没有同时存在鲁棒和贝叶斯的散射估计量时起了很大的作用。
在上具有适当的先验m,sν, m will be an estimate of
the mean of yi that will be robust against outliers.
Actually, no. The resulting estimates will only be robust in a very weak sense of the word robust. However, when we say that the median is robust to outliers we mean the word robust in a much stronger sense. That is, in robust statistics, the robustness of the median refers to the property that if you compute the median on a data-set of observations drawn from a uni-modal, continuous model and then replace less than half of these observations by arbitrary values, the value of the median computed on the contaminated data is close to the value you would have had had you computed it on the original (uncontaminated) data-set.
Then, it is easy to show that the estimation strategy you propose in the paragraph I quoted above is definitely not robust in the sense of how the word is typically understood for the median.
I'm wholly unfamiliar with Bayesian analysis. However, I was wondering what is wrong with the following strategy as it seems simple, effective and yet has not been considered in the other answers. The prior is that the good part of the data is drawn from a symmetric distribution F and that the rate of contamination is less than half. Then, a simple strategy would be to:
- compute the median/mad of your dataset. Then compute:
zi=|xi−med(x)|mad(x)
- exclude the observations for which zi>qα(z|x∼F) (this is the α quantile of the distribution of z when x∼F). This quantity is avalaible for many choice of F and can be bootstrapped for the others.
- Run a (usual, non-robust) Bayesian analysis on the non-rejected observations.
EDIT:
Thanks to the OP for providing a self contained R
code to conduct a bonna fide bayesian analysis of the
problem.
the code below compares the the bayesian approach suggested by the O.P. to
it's alternative from the robust statistics literature (e.g. the fitting method proposed by Gauss for the case where the data may contain as much as n/2−2
outliers and the distribution of the good part of the data is Gaussian).
central part of the data is N(1000,1):
n<-100
set.seed(123)
y<-rnorm(n,1000,1)
Add some amount of contaminants:
y[1:30]<-y[1:30]/100-1000
w<-rep(0,n)
w[1:30]<-1
the index w takes value 1 for the outliers. I begin with the approach
suggested by the O.P.:
library("rjags")
model_string<-"model{
for(i in 1:length(y)){
y[i]~dt(mu,inv_s2,nu)
}
mu~dnorm(0,0.00001)
inv_s2~dgamma(0.0001,0.0001)
s<-1/sqrt(inv_s2)
nu~dexp(1/30)
}"
model<-jags.model(textConnection(model_string),list(y=y))
mcmc_samples<-coda.samples(model,"mu",n.iter=1000)
print(summary(mcmc_samples)$statistics[1:2])
summary(mcmc_samples)
I get:
Mean SD
384.2283 97.0445
and:
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
184.6 324.3 384.7 448.4 577.7
(quiet far thus from the target values)
For the robust method,
z<-abs(y-median(y))/mad(y)
th<-max(abs(rnorm(length(y))))
print(c(mean(y[which(z<=th)]),sd(y[which(z<=th)])))
one gets:
1000.149 0.8827613
(very close to the target values)
The second result is much closer to the real values.
But it gets worst. If we classify as outliers those observations
for which the estimated z-score is larger than th
(remember that the
prior is that F is Gaussian) then the bayesian approach finds that all the observations are outliers (the robust procedure, in contrast, flags all and only the outliers as such). This also implies that if you were to run a usual (non-robust) bayesian analysis on the data not classified as outliers by the robust procedure, you should do fine (e.g. fulfil the objectives stated in your question).
This is just an example, but it's actually fairly straightforward to
show that (and it can done formally, see for example, in chapter 2 of [1]) the parameters of a student t distribution fitted to contaminated data cannot be depended upon to reveal the outliers.
- [1]Ricardo A. Maronna, Douglas R. Martin, Victor J. Yohai (2006). Robust Statistics: Theory and Methods (Wiley Series in Probability and Statistics).
- Huber, P. J. (1981). Robust Statistics. New York: John Wiley and Sons.