多类LDA训练中的共线变量


16

我正在训练具有8类数据的多类LDA分类器。

进行培训时,我收到以下警告:“ 变量是共线的

我的训练准确率超过90%

我在Python中使用scikits-learn库来训练和测试Multi-class数据。

我也得到了不错的测试准确性(大约85%-95%)。

我不明白错误/警告的意思。请帮帮我。

Answers:


29

多重共线性意味着您的预测变量是相关的。为什么这样不好?

因为LDA像回归技术一样涉及计算矩阵求逆,如果行列式接近0(两个或多个变量几乎是彼此的线性组合),则矩阵求逆是不准确的。

更重要的是,它使得估计的系数无法解释。如果的增加与X 2的减少相关,并且它们都增加了变量Y,则X 1的每一次变化都将由X 2的变化所补偿,并且您会低估X 1Y的影响。在LDA中,您可能会低估X 1对分类的影响。X1X2YX1X2X1YX1

如果您只关心分类本身,那么在对模型的一半数据进行训练并在另一半上对其进行测试之后,您将获得85-95%的准确度,我会说很好。


因此,我可以将其解释为:如果测试精度低,特征向量中的特征X1不是一个好选择?
garak

1
我猜想如果测试精度低,那就没有好的选择。
gui11aume12年

有趣的是,LDA出现了这个问题,但是当我使用QDA时却没有。我想知道那里有什么不同吗?
garak

1
答案为+1,但“计算矩阵求逆”可能不准确。我们永远不会明确地对其进行计算机处理,而是使用直接方法,例如LU,QR或迭代方法。
海涛杜

@ hxd1011正确!作为记录,您是否可以说一下矩阵几乎为“单数”时LU / QR等中发生的情况,还是可以指向解释它的文档?
gui11aume18年

12

我似乎认为gui11aume给了您一个很好的答案,我想从一个可能有所启发的稍微不同的角度举一个例子。考虑判别函数中的协变量如下:

X1=5X2+3X3X4

假设最佳LDA具有以下线性边界:

X1+2X2+X32X4=5

5X2+3X3X4X1

5X2+3X3-X4+2X2+X3-2X4=5

要么

7X2+4X3-3X4=5

These two boundaries are identical but the first one has coefficients 1,2,1,2 for X1, X2, X3, and X4 respectively, while the other has coefficients 0,7,3,1.

So the coefficient are quite different but the two equations give the same boundary and identical prediction rule. If one form is good the other is also. But now you can see why gui11ame says the coefficients are uninterpretable.

There are several other ways to express this boundary as well by substituting for X2 to give it the 0 coefficient and the same could be done for X3 or X4. But in practice the collinearity is approximate. This makes things worse because the noise allows for a unique answer. Very slight perturbations of the data will cause the coefficients to change drastically. But for prediction you are okay because each equation defines almost the same boundary and so LDA will result in nearly identical predictions.


1

While the answer that was marked here is correct, I think you were looking for a different explanation to find out what happened in your code. I had the exact same issue running through a model.

Here's whats going on: You're training your model with the predicted variable as part of your data set. Here's an example of what was occurring to me without even noticing it:

df = pd.read_csv('file.csv')
df.columns = ['COL1','COL2','COL3','COL4']
train_Y = train['COL3']
train_X = train[train.columns[:-1]]

In this code, I want to predict the value of 'COL3'... but, if you look at train_X, I'm telling it to retrieve every column except the last one, so its inputting COL1 COL2 and COL3, not COL4, and trying to predict COL3 which is part of train_X.

I corrected this by just moving the columns, manually moved COL3 in Excel to be the last column in my data set (now taking place of COL4), and then:

df = pd.read_csv('file.csv')
df.columns = ['COL1','COL2','COL3','COL4']
train_Y = train['COL4']
train_X = train[train.columns[:-1]]

If you don't want to move it in Excel, and want to just do it by code then:

df = pd.read_csv('file.csv')
df.columns = ['COL1','COL2','COL3','COL4']
train_Y = train['COL3']
train_X = train[train.columns['COL1','COL2','COL4']]

Note now how I declared train_X, to include all columns except COL3, which is part of train_Y.

I hope that helps.

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.