LogisticRegression:未知标签类型:在python中使用sklearn的“ continuous”


73

我有以下代码来测试sklearn python库的一些最流行的ML算法:

import numpy as np
from sklearn                        import metrics, svm
from sklearn.linear_model           import LinearRegression
from sklearn.linear_model           import LogisticRegression
from sklearn.tree                   import DecisionTreeClassifier
from sklearn.neighbors              import KNeighborsClassifier
from sklearn.discriminant_analysis  import LinearDiscriminantAnalysis
from sklearn.naive_bayes            import GaussianNB
from sklearn.svm                    import SVC

trainingData    = np.array([ [2.3, 4.3, 2.5],  [1.3, 5.2, 5.2],  [3.3, 2.9, 0.8],  [3.1, 4.3, 4.0]  ])
trainingScores  = np.array( [3.4, 7.5, 4.5, 1.6] )
predictionData  = np.array([ [2.5, 2.4, 2.7],  [2.7, 3.2, 1.2] ])

clf = LinearRegression()
clf.fit(trainingData, trainingScores)
print("LinearRegression")
print(clf.predict(predictionData))

clf = svm.SVR()
clf.fit(trainingData, trainingScores)
print("SVR")
print(clf.predict(predictionData))

clf = LogisticRegression()
clf.fit(trainingData, trainingScores)
print("LogisticRegression")
print(clf.predict(predictionData))

clf = DecisionTreeClassifier()
clf.fit(trainingData, trainingScores)
print("DecisionTreeClassifier")
print(clf.predict(predictionData))

clf = KNeighborsClassifier()
clf.fit(trainingData, trainingScores)
print("KNeighborsClassifier")
print(clf.predict(predictionData))

clf = LinearDiscriminantAnalysis()
clf.fit(trainingData, trainingScores)
print("LinearDiscriminantAnalysis")
print(clf.predict(predictionData))

clf = GaussianNB()
clf.fit(trainingData, trainingScores)
print("GaussianNB")
print(clf.predict(predictionData))

clf = SVC()
clf.fit(trainingData, trainingScores)
print("SVC")
print(clf.predict(predictionData))

前两个工作正常,但在LogisticRegression通话中出现以下错误:

root@ubupc1:/home/ouhma# python stack.py 
LinearRegression
[ 15.72023529   6.46666667]
SVR
[ 3.95570063  4.23426243]
Traceback (most recent call last):
  File "stack.py", line 28, in <module>
    clf.fit(trainingData, trainingScores)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/logistic.py", line 1174, in fit
    check_classification_targets(y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

输入数据与之前的调用中的数据相同,所以这里发生了什么?

顺便说一下,为什么会出现在第一预测一个巨大的性差异LinearRegression()SVR()算法(15.72 vs 3.95)

Answers:


82

您正在将浮点数传递给分类器,该分类器期望将分类值作为目标向量。如果将其转换int为输入,那么它将被接受为输入(尽管这样做是否正确还是值得怀疑的)。

最好使用scikit的labelEncoder功能来转换您的训练成绩。

您的DecisionTree和KNeighbors限定符也是如此。

from sklearn import preprocessing
from sklearn import utils

lab_enc = preprocessing.LabelEncoder()
encoded = lab_enc.fit_transform(trainingScores)
>>> array([1, 3, 2, 0], dtype=int64)

print(utils.multiclass.type_of_target(trainingScores))
>>> continuous

print(utils.multiclass.type_of_target(trainingScores.astype('int')))
>>> multiclass

print(utils.multiclass.type_of_target(encoded))
>>> multiclass

1
谢谢!所以我必须转换2.323等等,不是吗?有一种使用numpy或pandas进行转换的优雅方法吗?
harrison4

3
但是,在此示例中,使用LogisticRegression函数:machinelearningmastery.com/… ...,输入数据具有浮点数,并且工作正常。为什么?
harrison4

2
输入可以是浮点数,但输出必须是分类的,即int。在此示例中,列8仅是0或1。通常,这是带有分类标签的另一种方式,例如['red','big','sick'],并且需要将其转换为数值。尝试scikit-learn.org/stable/modules/...scikit-learn.org/stable/modules/generated/...
马克西米利安·彼得斯

2.323一样的吗?
Ajay Kulkarni

24

当尝试将浮点数输入分类器时,我遇到了同样的问题。我想保持浮点数而不是整数以保持准确性。尝试使用回归算法。例如:

import numpy as np
from sklearn import linear_model
from sklearn import svm

classifiers = [
    svm.SVR(),
    linear_model.SGDRegressor(),
    linear_model.BayesianRidge(),
    linear_model.LassoLars(),
    linear_model.ARDRegression(),
    linear_model.PassiveAggressiveRegressor(),
    linear_model.TheilSenRegressor(),
    linear_model.LinearRegression()]

trainingData    = np.array([ [2.3, 4.3, 2.5],  [1.3, 5.2, 5.2],  [3.3, 2.9, 0.8],  [3.1, 4.3, 4.0]  ])
trainingScores  = np.array( [3.4, 7.5, 4.5, 1.6] )
predictionData  = np.array([ [2.5, 2.4, 2.7],  [2.7, 3.2, 1.2] ])

for item in classifiers:
    print(item)
    clf = item
    clf.fit(trainingData, trainingScores)
    print(clf.predict(predictionData),'\n')

19

LogisticRegression不是为了回归而是分类

Y变量必须是分类类,

(例如01

而不是continuous变量

那将是一个回归问题。


我希望这不是垃圾邮件,但我在这里结束了很多次,错误提示不是很直观。
Thomas

这应该是正确的答案。确实LogisticRegression是一个分类器。因此,错误。
导航
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.