Breiman的随机森林是否使用信息增益或Gini指数?


15

我想知道Breiman的随机森林(R randomForest包中的随机森林)是用作分割标准(属性选择标准)还是信息增益或基尼系数?我试图在http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm以及R中randomForest包的文档中找到它。但是我发现的唯一发现是,可以将Gini索引用于可变重要性计算。


我也想知道randomForest包中随机森林的树是否是二进制的。
某人

Answers:


16

A. Liaw在R中的randomForest软件包是原始代码的一部分,原始代码是c代码(已翻译)和一些剩余的fortran代码以及R包装程序代码的混合。为了确定跨断点和mtry变量的总体最佳分割,代码使用类似于gini-gain的评分函数:

G一世ñ一世G一种一世ññX=G一世ñ一世ñ-|ñ1||ñ|G一世ñ一世ñ1-|ñ2||ñ|G一世ñ一世ñ2

其中是给定的特征,N是要在其上进行拆分的节点,N 1N 2是通过拆分N创建的两个子节点。| | 是节点中元素的数量。XñN1N2N|.|

并且,其中K是节点中类别的数量Gini(N)=1k=1Kpk2K

但是,应用的评分功能并不完全相同,而是等效的计算效率更高的版本。和| N | 对于所有比较的分割都是常数,因此省略。Gini(N)

还让我们检查零件是否在node(1)中的患病率平方和计算为|N2||N|Gini(N2)|N2|Gini(N2)=|N2|(1k=1Kpk2)=|N2|nclass2,k2|N2|2

nclass1,k|N2|

1

|N1|k=1Kp1,k2+|N2|k=1Kp2,k2=|N1|k=1Knclass1,k2|N1|2+|N2|k=1Knclass2,k2|N2|2 =k=1Knclass2,k21|N1|1+k=1Knclass2,k21|N1|2 =nominator1/denominator1+nominator2/denominator2

The implementation also allows for classwise up/down weighting of samples. Also very important when the implementation update this modified gini-gain, moving a single sample from one node to the other is very efficient. The sample can be substracted from nominators/denominators of one node and added to the others. I wrote a prototype-RF some months ago, ignorantly recomputing from scratch gini-gain for every break-point and that was slower :)

If several splits scores are best, a random winner is picked.

This answer was based on inspecting source file "randomForest.x.x.tar.gz/src/classTree.c" line 209-250

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.