使用SMOTE技术平衡数据集的最佳性能指标是什么

我使用smote技术对数据集进行过采样，现在有了平衡的数据集。我面临的问题是性能指标；精度，召回率，f1度量，不平衡数据集中的准确性要优于平衡数据集。

我可以使用哪种度量来表明平衡数据集可以改善模型的性能？

注意：平衡数据集中的roc_auc_score比数据集不平衡的roc_auc_score更好吗？是否可以认为它是一个很好的性能衡量指标？经过解释，我实现了代码，并得到了这个结果

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split,StratifiedShuffleSplit,cross_val_score
import seaborn as sns
from scipy import interp
from time import *
from sklearn import metrics
X=dataCAD.iloc[:,0:71]
y= dataCAD['Cardio1']
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)
print(y_test.value_counts())
model=SVC(C=0.001, kernel="rbf",gamma=0.01, probability=True)
t0 = time()
clf = model.fit(X_train,y_train)
y_pred = clf.predict(X_test)
t = time() - t0
print("=" * 52)
print("time cost: {}".format(t))
print()
print("confusion matrix\n", metrics.confusion_matrix( y_test, y_pred))
cf=metrics.confusion_matrix(y_test, y_pred)
accuracy=(cf.item((0,0))/50)+(cf.item((1,1))/14)
print("model accuracy \n",accuracy/2)
print()
print("\t\tprecision_score: {}".format(metrics.precision_score( y_test, y_pred, average='macro')))
print()
print("\t\trecall_score: {}".format(metrics.recall_score(y_test, y_pred, average='macro')))
print()
print("\t\tf1_score: {}".format(metrics.f1_score(y_test, y_pred, average='macro')))
print()
print("\t\troc_auc_score: {}".format(metrics.roc_auc_score( y_test, y_pred, average='macro')))

结果：

Name: Cardio1, dtype: int64
====================================================
time cost: 0.012008905410766602

confusion matrix
 [[50  0]
 [14  0]]
model accuracy 
 0.5

        precision_score: 0.390625

        recall_score: 0.5

        f1_score: 0.43859649122807015

        roc_auc_score: 0.5

对于平衡数据集

X_train1,y_train1 = sm.fit_sample(X_train, y_train.ravel())
df= pd.DataFrame({'Cardio1': y_train1})
df.groupby('Cardio1').Cardio1.count().plot.bar(ylim=0)
plt.show()
print(X_train1.shape)
print(y_train1.shape)
#model=SVC(C=0.001, kernel="rbf",gamma=0.01, probability=True)
model=SVC(C=10, kernel="sigmoid",gamma=0.001, probability=True)
t0 = time()
clf = model.fit(X_train1,y_train1)
y_pred = clf.predict(X_test)
t = time() - t0
print("=" * 52)
print("time cost: {}".format(t))
print()
print("confusion matrix\n", metrics.confusion_matrix(y_test, y_pred))
cf=metrics.confusion_matrix(y_test, y_pred)
accuracy=(cf.item((0,0))/50)+(cf.item((1,1))/14)
print("model accuracy \n",accuracy/2)
print()
#print("\t\taccuracy: {}".format(metrics.accuracy_score( y_test, y_pred)))
print()
print("\t\tprecision_score: {}".format(metrics.precision_score( y_test, y_pred, average='macro')))
print()
print("\t\trecall_score: {}".format(metrics.recall_score(y_test, y_pred, average='macro')))
print()
print("\t\tf1_score: {}".format(metrics.f1_score(y_test, y_pred, average='macro')))
print()
print("\t\troc_auc_score: {}".format(metrics.roc_auc_score( y_test, y_pred, average='macro')))

结果：

(246, 71)
(246,)
====================================================
time cost: 0.05353999137878418

confusion matrix
 [[ 0 50]
 [ 0 14]]
model accuracy 
 0.5


        precision_score: 0.109375

        recall_score: 0.5

        f1_score: 0.1794871794871795

        roc_auc_score: 0.5

我没有找到有效的结果。我应该使用交叉验证来实现模型吗？

performance

— Rawia Sammout
source

首先，为了清楚起见，您不应该在平衡数据集上评估模型的性能。您应该做的是将数据集拆分为理想情况下不平衡程度相同的训练和测试集。评估应仅对测试集进行，而对训练集进行平衡。

至于您的问题，任何宏观平均指标都应足以证明您的平衡技术有效。要计算这样的度量（假设为简单起见精度），你只需要计算每一类的精确度分别，然后平均它们。

示例：
我们训练了两个模型m1和m2，第一个不平衡数据集，第二个使用SMOTE平衡数据集。

实际值： 0, 0, 0, 0, 0, 0, 0, 0, 1, 1
预测m1：0, 0, 0, 0, 0, 0, 0, 0, 0, 0 <-仅预测多数阶层
预测m2：1, 0, 0, 1, 0, 1, 0, 0, 1, 1

我们通常如何计算准确性？

$acc = \frac{correct \, predictions}{total \, predictions}$

我们的两个模型在此指标上的表现如何？

$acc_1 = \frac{8}{10} = 80\%$
$acc_2 = \frac{7}{10} = 70\%$

根据此性能指标，m2优于m1。但是，这不一定是事实，m1就像预测多数阶级一样！为了显示m2优于m1，我们需要一个度量标准，将这两个类别等同对待。

现在，我们将尝试计算宏平均精度。怎么样？首先，我们将分别计算每个类别的准确性，然后对它们进行平均：

例如m1： <- 在类上的精度 <- 在类上的精度
$acc_1^0 = \frac{8}{8} = 100\%$ m10
$acc_1^1 = \frac{0}{2} = 0\%$ m11
$macro\_acc_1 = \frac{acc_1^0 + acc_1^1}{2} = \frac{100\% + 0\%}{2} = 50\%$
例如m2： <- 类别精度 <- 类别精度
$acc_2^0 = \frac{5}{8} = 62.5\%$ m20
$acc_2^1 = \frac{2}{2} = 100\%$ m21
$macro\_acc_2 = \frac{acc_2^0 + acc_2^1}{2} = \frac{62.5\% + 100\%}{2} = 81.25\%$

注意事项：

宏平均可以应用于所需的任何度量，但是在混淆矩阵度量（例如精度，查全率，f1）中最常见。
您不需要自己实现，很多库已经有（例如sklearn的f1_score有一个名为的参数average，可以将其设置为"macro"）

— 吉布2011
source

非常感谢您的出色解释，简洁明了您能提出一些科学的文章吗？

— Rawia Sammout

对此事的几篇文章：1，2，3。这些文章从本质上概述了消除类平衡的方法（过采样/欠采样，类权重等）以及可在这些情况下使用的度量标准（ROC，g-mean，二次方卡帕等）

— Djib2011

您能看看共享代码吗，我发现一个令人困惑的结果，而不是使用

— smote

根据混淆矩阵判断，您的第一个模型（没有平衡）仅预测多数阶层，而第二个模型（具有冒烟）则预测其他阶层。我建议您尝试另一个分类器，因为SVM需要大量的超参数调整（即，一次又一次地运行模型以找出最佳的C，伽玛，内核类型等）。

— Djib2011 '18

谢谢你。我认为更改分类器会更好，因为我使用了gridsearch调整参数，并且在gridsearch算法找到的最佳超参数上训练了两个模型

— Rawia Sammout