我将从头开始进行整个Naive Bayes流程,因为对我来说不清楚你在哪里挂断电话。
P(class|feature1,feature2,...,featuren
P(A|B)=P(B|A)⋅P(A)P(B)
P(class|features)=P(features|class)⋅P(class)P(features)
P(features)P(class|features)classP(features)classP(class|features)∝P(features|class)⋅P(class)
P(class)
P(features|class)P(feature1,feature2,...,featuren|class)P(feature1,feature2,...,featuren|class)=∏iP(featurei|class)
。
示例:训练分类器
为了训练分类器,我们计算点的各种子集,并使用它们来计算先验概率和条件概率。
P(class=green)=4060=2/3 and P(class=red)=2060=1/3
feature1feature2
- P(feature1=A|class=red)
- P(feature1=B|class=red)
- P(feature1=A|class=green)
- P(feature1=B|class=green)
- P(feature2=X|class=red)
- P(feature2=Y|class=red)
- P(feature2=X|class=green)
- P(feature2=Y|class=green)
- (以防万一,这是所有可能的要素值和类别对)
P(feature1=A|class=red)feature1P(feature1=A|class=red)=20/20=1P(feature1|class=red)=0/20=0P(feature1=A|class=green)=5/40=1/8P(feature1=B|class=green)=35/40=7/8feature2
- P(feature1=A|class=red)=1
- P(feature1=B|class=red)=0
- P(feature1=A|class=green)=1/8
- P(feature1=B|class=green)=7/8
- P(feature2=X|class=red)=3/10
- P(feature2=Y|class=red)=7/10
- P(feature2=X|class=green)=8/10
- P(feature2=Y|class=green)=2/10
那十个概率(两个先验加八个条件)是我们的模型
分类新示例
feature1feature2P(class=red|example)∝P(class=red)⋅P(feature1=A|class=red)⋅P(feature2=Y|class=red)
P(class=red|example)∝13⋅1⋅710=730
P(class=green|example)∝P(class=green)⋅P(feature1=A|class=green)⋅P(feature2=Y|class=green)
2/3⋅0⋅2/10
笔记
P(feature=value|class)通过为每个类别插入适当的均值和方差。其他分布可能更合适,具体取决于数据的详细信息,但是高斯将是一个不错的起点。
我对DARPA数据集不太熟悉,但是您基本上会做同样的事情。您可能最终会计算出诸如P(attack = TRUE | service = finger),P(attack = false | service = finger),P(attack = TRUE | service = ftp)之类的内容,然后将它们合并到与示例相同。附带说明一下,这里技巧的一部分是要提供良好的功能。例如,源IP可能会变得非常稀疏-给定IP可能只有一个或两个示例。如果对IP进行地理定位并使用“ Source_in_same_building_as_dest(true / false)”或某些功能代替,则可能会做得更好。
希望对您有所帮助。如果有任何需要澄清的地方,我很乐意再试一次!