统计学习要素练习2.2

10

教科书首先通过以下方式生成一些2类数据：

在此处输入图片说明

这使：

在此处输入图片说明

然后它问：

在此处输入图片说明

我尝试通过首先使用此图形模型对此模型进行建模来解决此问题：

在此处输入图片说明

其中是标签，是所选均值的索引，是数据点。这将给 $c$ $h\,(1\le h \le 10)$ $m_h^c$ $x$

\begin{aligned} 镨 （ X ∣ 米_{H}^{C} ） = & ñ （ 米_{H}^{C} ， 一世 / 5 ） \\ 镨 （ 米_{H}^{C} ∣ H ， C = b 升 ü Ë ） = & ñ （ （ 1个 ， 0 ）^{Ť} ， 一世 ） \\ 镨 （ 米_{H}^{C} ∣ H ， C = Ø [R 一个 ñ G Ë ） = & ñ （ （ 0 ， 1个 ）^{Ť} ， 一世 ） \\ 镨 （ H ） = & \frac{1个}{10} \\ 镨 （ C ） = & \frac{1个}{2} \end{aligned}

$\begin{align*} \Pr(x\mid m_h^c) =& \mathcal{N}(m_h^c,\mathbf{I}/5)\\ \Pr(m_h^c\mid h,c=\mathrm{blue}) =& \mathcal{N}((1,0)^T,\mathbf{I})\\ \Pr(m_h^c\mid h,c=\mathrm{orange}) =& \mathcal{N}((0,1)^T,\mathbf{I})\\ \Pr(h) =& \frac{1}{10}\\ \Pr(c) =& \frac{1}{2} \end{align*}$

另一方面，边界是。用贝叶斯规则，我们有 $\{x:\Pr(c=\mathrm{blue}\mid x)=\Pr(c=\mathrm{orange}\mid x)\}$

\begin{aligned} 镨 （ C ∣ X ） = & \frac{镨 （ X ∣ C ） 镨 （ C ）}{\sum_{C} 镨 （ X ∣ C ） 镨 （ C ）} \\ 镨 （ X ∣ C ） = & \sum_{H} \int_{米_{H}^{C}} 镨 （ H ） 镨 （ 米_{H}^{C} ∣ H ， C ） 镨 （ X ∣ 米_{H}^{C} ） \end{aligned}

$\begin{align*} \Pr(c\mid x) =& \frac{\Pr(x\mid c)\Pr(c)}{\sum_c\Pr(x\mid c)\Pr(c)}\\ \Pr(x\mid c) =& \sum_h\int_{m_h^c}\Pr(h)\Pr(m_h^c\mid h,c)\Pr(x\mid m_h^c) \end{align*}$

但是后来我发现问题设置是对称的，因此这可能会产生作为边界。如果问题是在设置的条件时询问边界，则该方程将包含参数，我认为这不太可能成为练习的目的。 $x=y$ $m_h^c$ $40$

那我误会了吗？谢谢。

self-study bayesian

— 紫苑
source

8

对于给定的实现，我认为您不应该为贝叶斯决策边界找到一个解析表达式。同样，我怀疑您是否应该获得分布的边界，因为正如您指出的那样，对称性只是。 $m_k$ $m_k$ $x=y$

我认为您需要显示的是一个可以为给定的实现计算决策边界的程序。这可以通过设置和值的网格，计算类条件密度并找到它们相等的点来完成。 $m_k$ $x$ $y$

此代码是一个刺探。IIRC实际上在使用S的现代应用统计中有代码来计算决策边界，但我现在还没有那么方便。

# for dmvnorm/rmvnorm: multivariate normal distribution
library(mvtnorm)

# class-conditional density given mixture centers
f <- function(x, m)
{
    out <- numeric(nrow(x))
    for(i in seq_len(nrow(m)))
        out <- out + dmvnorm(x, m[i, ], diag(0.2, 2))
    out
}

# generate the class mixture centers
m1 <- rmvnorm(10, c(1,0), diag(2))
m2 <- rmvnorm(10, c(0,1), diag(2))
# and plot them
plot(m1, xlim=c(-2, 3), ylim=c(-2, 3), col="blue")
points(m2, col="red")

# display contours of the class-conditional densities
dens <- local({
    x <- y <- seq(-3, 4, len=701)
    f1 <- outer(x, y, function(x, y) f(cbind(x, y), m1))
    f2 <- outer(x, y, function(x, y) f(cbind(x, y), m2))
    list(x=x, y=y, f1=f1, f2=f2)
})

contour(dens$x, dens$y, dens$f1, col="lightblue", lty=2, levels=seq(.3, 3, len=10),
        labels="", add=TRUE)

contour(dens$x, dens$y, dens$f2, col="pink", lty=2, levels=seq(.3, 3, len=10),
        labels="", add=TRUE)

# find which points are on the Bayes decision boundary
eq <- local({
    f1 <- dens$f1
    f2 <- dens$f2
    pts <- seq(-3, 4, len=701)
    eq <- which(abs((dens$f1 - dens$f2)/(dens$f1 + dens$f2)) < 5e-3, arr.ind=TRUE)
    eq[,1] <- pts[eq[,1]]
    eq[,2] <- pts[eq[,2]]
    eq
})
points(eq, pch=16, cex=0.5, col="grey")

结果：

在此处输入图片说明

— 洪大井
source

3

实际上，这本书确实要求提供对此问题的分析解决方案。是的，您必须限制边界，但不必限制40个均值：您永远无法准确地了解它们。相反，您必须限制要查看的200个数据点。因此，您将需要200个参数，但是由于使用了求和，答案看起来并不复杂。

我永远无法得出这个公式，因此我只因为意识到解析解决方案不必太丑陋，然后在google上搜索就可以了。幸运的是，它是由作者提供的一些好人，第6-7页。

— 最高
source

2

希望我早些时候偶然发现上面的代码; 只需在下面创建一些替代代码...值得

set.seed(1)
library(MASS)

#create original 10 center points/means for each class 
I.mat=diag(2)
mu1=c(1,0);mu2=c(0,1)
mv.dist1=mvrnorm(n = 10, mu1, I.mat)
mv.dist2=mvrnorm(n = 10, mu2, I.mat)

values1=NULL;values2=NULL

#create 100 observations for each class, after random sampling of a center point, based on an assumed bivariate probability distribution around each center point  
for(i in 1:10){
  mv.values1=mv.dist1[sample(nrow(mv.dist1),size=1,replace=TRUE),]
  sub.mv.dist1=mvrnorm(n = 10, mv.values1, I.mat/5)
  values1=rbind(sub.mv.dist1,values1)
}
values1

#similar as per above, for second class
for(i in 1:10){
  mv.values2=mv.dist2[sample(nrow(mv.dist2),size=1,replace=TRUE),]
  sub.mv.dist2=mvrnorm(n = 10, mv.values2, I.mat/5)
  values2=rbind(sub.mv.dist2,values2)
}
values2

#did not find probability function in MASS, so used mnormt
library(mnormt)

#create grid of points
grid.vector1=seq(-2,2,0.1)
grid.vector2=seq(-2,2,0.1)
length(grid.vector1)*length(grid.vector2)
grid=expand.grid(grid.vector1,grid.vector2)



#calculate density for each point on grid for each of the 100 multivariates distributions
prob.1=matrix(0:0,nrow=1681,ncol=10) #initialize grid
for (i in 1:1681){
  for (j in 1:10){
    prob.1[i,j]=dmnorm(grid[i,], mv.dist1[j,], I.mat/5)  
  }
}
prob.1
prob1.max=apply(prob.1,1,max)

#second class - as per above
prob.2=matrix(0:0,nrow=1681,ncol=10) #initialize grid
for (i in 1:1681){
  for (j in 1:10){
    prob.2[i,j]=dmnorm(grid[i,], mv.dist2[j,], I.mat/5)  
  }
}
prob.2
prob2.max=apply(prob.2,1,max)

#bind
prob.total=cbind(prob1.max,prob2.max)
class=rep(1,1681)
class[prob1.max<prob2.max]=2
cbind(prob.total,class)

#plot points
plot(grid[,1], grid[,2],pch=".", cex=3,col=ifelse(class==1, "coral", "cornflowerblue"))

points(values1,col="coral")
points(values2,col="cornflowerblue")

#check - original centers
# points(mv.dist1,col="coral")
# points(mv.dist2,col="cornflowerblue")

— 用户名
source