Answers:
多重共线性意味着您的预测变量是相关的。为什么这样不好?
因为LDA像回归技术一样涉及计算矩阵求逆,如果行列式接近0(即两个或多个变量几乎是彼此的线性组合),则矩阵求逆是不准确的。
更重要的是,它使得估计的系数无法解释。如果的增加与X 2的减少相关,并且它们都增加了变量Y,则X 1的每一次变化都将由X 2的变化所补偿,并且您会低估X 1对Y的影响。在LDA中,您可能会低估X 1对分类的影响。
如果您只关心分类本身,那么在对模型的一半数据进行训练并在另一半上对其进行测试之后,您将获得85-95%的准确度,我会说很好。
我似乎认为gui11aume给了您一个很好的答案,我想从一个可能有所启发的稍微不同的角度举一个例子。考虑判别函数中的协变量如下:
假设最佳LDA具有以下线性边界:
要么
。
These two boundaries are identical but the first one has coefficients for , , , and respectively, while the other has coefficients .
So the coefficient are quite different but the two equations give the same boundary and identical prediction rule. If one form is good the other is also. But now you can see why gui11ame says the coefficients are uninterpretable.
There are several other ways to express this boundary as well by substituting for to give it the coefficient and the same could be done for or . But in practice the collinearity is approximate. This makes things worse because the noise allows for a unique answer. Very slight perturbations of the data will cause the coefficients to change drastically. But for prediction you are okay because each equation defines almost the same boundary and so LDA will result in nearly identical predictions.
While the answer that was marked here is correct, I think you were looking for a different explanation to find out what happened in your code. I had the exact same issue running through a model.
Here's whats going on: You're training your model with the predicted variable as part of your data set. Here's an example of what was occurring to me without even noticing it:
df = pd.read_csv('file.csv')
df.columns = ['COL1','COL2','COL3','COL4']
train_Y = train['COL3']
train_X = train[train.columns[:-1]]
In this code, I want to predict the value of 'COL3'... but, if you look at train_X, I'm telling it to retrieve every column except the last one, so its inputting COL1 COL2 and COL3, not COL4, and trying to predict COL3 which is part of train_X.
I corrected this by just moving the columns, manually moved COL3 in Excel to be the last column in my data set (now taking place of COL4), and then:
df = pd.read_csv('file.csv')
df.columns = ['COL1','COL2','COL3','COL4']
train_Y = train['COL4']
train_X = train[train.columns[:-1]]
If you don't want to move it in Excel, and want to just do it by code then:
df = pd.read_csv('file.csv')
df.columns = ['COL1','COL2','COL3','COL4']
train_Y = train['COL3']
train_X = train[train.columns['COL1','COL2','COL4']]
Note now how I declared train_X, to include all columns except COL3, which is part of train_Y.
I hope that helps.