我正在看本教程:https : //www.dataquest.io/mission/75/improving-your-submission
在第8节中,找到最佳功能,它显示了以下代码。
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]
# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])
# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)
# Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
k = 5在做什么,因为它从未被使用过(无论我使用k = 1还是k =“ all”,该图仍列出了所有功能)?它如何确定最佳功能,它们是否独立于人们想要使用的方法(逻辑回归,随机森林或其他)?
根据k个最高分数选择功能。
—
斯里尼