交叉验证泊松模型的误差度量

我正在交叉验证试图预测计数的模型。如果这是二进制分类问题，那么我将计算出不匹配的AUC，如果这是回归问题，则将计算出不匹配的RMSE或MAE。

对于Poisson模型，我可以使用哪些误差度量来评估样本外预测的“准确性”？是否存在AUC的Poisson扩展，可以查看预测对实际值的排序程度？

似乎很多Kaggle竞赛都在使用根均方根平方误差或RMLSE来进行计数（例如，一次yelp审查将获得的有用票数或患者在医院花费的天数）。

/编辑：我一直在做的一件事是计算预测值的十分之一，然后查看实际计数，并按分位数进行分组。如果十分位数1低，十分位数10高且两者之间的十分位数都在增加，则我一直将该模型称为“好”，但是我一直难以量化此过程，并且我相信会有更好的方法方法。

/编辑2：我正在寻找一个公式，该公式采用预测值和实际值并返回一些“错误”或“准确性”指标。我的计划是在交叉验证过程中根据折叠数据计算此函数，然后将其用于比较各种模型（例如，泊松回归，随机森林和GBM）。

例如，一个这样的函数是RMSE = sqrt(mean((predicted-actual)^2))。另一个这样的功能是AUC。这两个函数似乎都不适合泊松数据。

— 扎克
source

对于泊松模型，您可以使用偏差。这类似于MSE，但更适合于Poisson。如果样本数量不小，则加权MSE将会非常相似。

— Glen_b-恢复莫妮卡

@Glen_b偏差的公式是什么？

— Zach

偏差。您如何拟合泊松模型？

— Glen_b-恢复莫妮卡

几种不同的方法，从惩罚性泊松回归到gbm。我正在寻找一个好的误差指标来比较不同的模型。谢谢你的建议。

— Zach

泊松回归至少应自动给你一个偏差

— Glen_b -Reinstate莫妮卡

对于可以使用的计数数据，有几个适当的和严格适当的评分规则。计分规则是引入的惩罚其中为预测分布，为观察值。它们具有许多理想的属性，首先，最重要的是，更接近真实概率的预测将始终受到较少的惩罚，并且存在（唯一）最佳预测，而当预测概率与真实概率一致时就是一个预测。因此，将的期望值最小化意味着报告真实概率。另请参阅Wikipedia。 $s(y,P)$ $P$ $y$ $s(y,P)$

通常，将所有预测值的平均值作为

$S=\frac{1}{n}\sum_{i=1}^n s(y^{(i)},P^{(i)})$

采取哪个规则取决于您的目标，但是当每个规则都可以使用时，我将给出一个粗略的描述。

$f(y)$ $\Pr(Y=y)$ $F(y)$ $\sum_k$ $0,1,\dots, \infty$ $I$ $\mu$ $\sigma$

严格正确的评分规则

石蜡分数： $s(y,P)=-2 f(y) + \sum_k f^2(k)$ (stable for size imbalance in categorical predictors)
Dawid-Sebastiani score: $s(y,P)=(\frac{y-\mu}{\sigma})^2+2\log\sigma$ (good for general predictive model choice; stable for size imbalance in categorical predictors)
Deviance score: $s(y,P)=-2\log f(y) + g_y$ ( $g_y$ is a normalization term that only depends on $y$ , in Poisson models it is usually taken as the saturated deviance; good for use with estimates from an ML framework)
Logarithmic score: $s(y,P)=-\log f(y)$ (very easily calculated; stable for size imbalance in categorical predictors)
Ranked probability score: $s(y,P)=\sum_k \{F(k)-I(y\leq k)\}^2$ (good for contrasting different predictions of very high counts; susceptible to size imbalance in categorical predictors)
Spherical score: $s(y,P)=\frac{f(y)}{\sqrt{\sum_k f^2(k)}}$ (stable for size imbalance in categorical predictors)

Other scoring rules (not so proper but often used)

Absolute error score: $s(y,P)=|y-\mu|$ (not proper)
Squared error score: $s(y,P)=(y-\mu)^2$ (not strictly proper; susceptible to outliers; susceptible to size imbalance in categorical predictors)
Pearson normalized squared error score: $s(y,P)=(\frac{y-\mu}{\sigma})^2$ (not strictly proper; susceptible to outliers; can be used for checking if model checking if the averaged score is very different from 1; stable for size imbalance in categorical predictors)

Example R code for the strictly proper rules:

library(vcdExtra)
m1 <- glm(Freq ~ mental, family=poisson, data=Mental) 

# scores for the first observation
mu <- predict(m1, type="response")[1]
x  <- Mental$Freq[1]

# logarithmic (equivalent to deviance score up to a constant) 
-log(dpois(x, lambda=mu))

# quadratic (brier)
-2*dpois(x,lambda=mu) + sapply(mu, function(x){ sum(dpois(1:1000,lambda=x)^2) })

# spherical
- dpois(x,mu) / sqrt(sapply(mu, function(x){ sum(dpois(1:1000,lambda=x)^2) }))

# ranked probability score
sum(ppois((-1):(x-1), mu)^2) + sum((ppois(x:10000,mu)-1)^2)

# Dawid Sebastiani
(x-mu)^2/mu + log(mu)

— Momo
source

@Momo, it's a old thread but very good and useful. Question however about the logarithmic score. You used function -log(f(y)). Is the - sign really should be there? In your scoring rule wikipedia link (en.wikipedia.org/wiki/Scoring_rule#Logarithmic_scoring_rule), the logarithmic score as no negative sign: L(r,i)=ln(ri) is it normal? Finally, in that case is a higher score better or worst?

— Bastien

Is it better (or at least more conservative and more realistic) to calculate these measures on a validation data-set that wasn't a part of the data used for estimating the models?

— Fred

Given that GLMs are fit using iteratively reweighted least squares, as in bwlewis.github.io/GLM, what would be the objection actually of calculating a weighted R2 on the GLM link scale, using 1/variance weights as weights (which glm gives back in the slot weights in a glm fit)? This would also work for a Poisson glm right?

— Tom Wenseleers

See stats.stackexchange.com/questions/412580/… for a reproducible example...

— Tom Wenseleers