XGBoost损失函数与泰勒展开式的近似


28

例如,以第次迭代的XGBoost模型的目标函数为例:t

L(t)=i=1n(yi,y^i(t1)+ft(xi))+Ω(ft)

其中是损失函数,是第个树的输出,是正则化。近似值是快速计算的(许多)关键步骤之一:fttΩ

L(t)i=1n(yi,y^i(t1))+gtft(xi)+12hift2(xi)+Ω(ft),

其中和是损失函数的一阶和二阶导数。gih一世

我要问的是令人信服的论点,以揭开上述近似为何起作用的神秘色彩:

1)具有上述近似值的XGBoost与具有完整目标函数的XGBoost相比如何?近似中丢失了哪些潜在的有趣的高阶行为?

2)很难形象化(并取决于损失函数),但是,如果损失函数具有较大的三次方分量,则逼近可能会失败。怎么不给XGBoost造成问题?

Answers:


62

这是一个非常有趣的问题。为了完全了解正在发生的事情,我必须仔细研究XGBoost正在尝试做的事情,以及我们工具箱中有哪些其他方法可以处理它。我的答案超越了传统方法,以及XGBoost如何/为什么有所改进。如果只需要要点,则末尾有一个摘要。

传统梯度提升

考虑传统的梯度提升算法(维基百科)

  • 计算基本模型H0
  • 对于m1:M
    • 计算伪残差rim=(yi,Hm1(xi))Hm1(xi)
    • 使基础学习者hm(x)适应伪残差
    • 计算乘数γ最小化的成本,γ=argminγi=1N(yi,Hm1(xi)+γhm(xi)),(使用线搜索)
    • 更新模型Hm(x)=Hm1(x)+γhm(x)
  • 您得到了增强模型HM(x)

对于以下部分,函数逼近很重要,

使基础学习者hm(x)适应伪残差。

想象一下您在哪里可以天真地构建梯度提升算法。您可以使用现有的回归树作为弱学习者来构建上述算法。假设您不允许调整弱学习者的现有实现。在Matlab中,默认拆分标准是均方误差。这同样适用于scikit学习

你正在努力寻找最好的模型hm(x),最大限度地减少成本(yi,Hm1(xi)+hm(xi))。但是要这样做,您需要使用MSE作为目标函数将简单的回归模型拟合到残差。请注意,您并不是直接将所需的内容最小化,而是使用残差和MSE作为代理。不利的是,它不一定能产生最佳解决方案。好的部分是它可以工作。

传统梯度下降

这类似于传统的梯度下降(维基百科),在那里你试图最小化成本函数f(x)由以下功能,梯度(的负)f(x)的每一步。

x(i+1)=x(i)f(x(i))

它不允许您在一步之后找到确切的最小值,但是每一步都会使您更接近最小值(如果函数是凸的)。这是一个近似值,但效果很好,例如,这是我们传统上用于进行逻辑回归的算法。

插曲

在这一点上,需要理解的是,一般的梯度提升算法并未针对每个可能的分割计算成本函数,而是使用回归弱学习器的成本函数来拟合残差。

您的问题似乎意味着,“真正的XGBoost”应该为每个拆分计算成本函数,而“近似XGBoost”正在使用试探法对其进行近似。您可以通过这种方式看到它,但是从历史上看,我们拥有通用的梯度提升算法,除了当前点的导数之外,该算法不使用有关成本函数的信息。XGBoost是Gradient Boosting的扩展,它试图通过使用比梯度更精确的近似值来更聪明地生长弱回归树。

选择最佳模型h mx )的其他方法hm(x)

如果我们将AdaBoost看作是梯度增强的特例,它不会选择回归变量而是将分类器作为弱学习者。如果我们设置hm(x){1,1},AdaBoost算法选择最佳模型是通过找到方式

hm=argmaxhmi=1Nwihm(xi)

其中wi是残差(来源,从幻灯片20开始)。使用此目标函数的原因是,如果wihm(xi)朝相同方向/具有相同符号,则该点将朝着正确的方向移动,并且您正在尝试使最大数量最大化朝正确方向运动。

但是再次,这不是直接测量其hm最小化(yi,Hm1(xi)+hm(xi))。它测量的是相对于您应该走的总方向的运动hm有多好,用残差wi来测量,残差w i也是一个近似值。残差告诉您应按照其符号朝哪个方向移动,大致按其幅度移动多少,但它们并不能告诉您应该在哪里停止。

更好的梯度下降

接下来的三个示例并不是解释的必要条件,它们只是在此处提供一些比原始梯度下降更好的方法,以支持XGBoost所做的只是改进梯度下降的另一种方法的想法。在传统的梯度下降设置中,尝试最小化f(x),可能会比仅遵循梯度更好。已经提出了许多扩展(维基百科)。以下是其中的一些示例,以表明在更多的计算时间或函数f更多属性的情况下,可以做得更好。

  • 线搜索/回溯:在梯度下降,一旦梯度f(x(i))被计算,下一个点应

    x(i+1)=x(i)f(x(i))

    但是梯度仅给出一个方向,而不是“多少”,因此可以使用另一种方法来找到最佳的c>0,从而

    xc(i+1)=x(i)cf(x(i))

    最小化成本函数。这是通过对某些c评估f(xc(i+1))来完成的,并且由于函数f应该是凸的,因此相对容易通过Line Search(Wikipedia)Backtracking Line Search(Wikipedia)进行。在这里,主要成本是评估f x 。因此,如果f易于计算,则此扩展最有效。请注意,梯度提升的通用算法使用线搜索,如我的答案开头所示。cff(x)f

  • 快速近端梯度法:如果最小化函数是强凸的,并且其梯度是平滑的(Lipschitz(Wikipedia)),则使用这些特性可以加快收敛速度​​。

  • 随机梯度下降和动量法:在随机梯度下降中,您不评估所有点的梯度,而仅评估那些点的子集。您迈出了一步,然后在另一个批次上计算梯度,然后继续。可以使用随机梯度下降法,因为所有点的计算都非常昂贵,或者所有这些点甚至都无法放入内存。这使您可以采取更多步骤,更快,但准确性更低。

    这样做时,渐变的方向可能会有所不同,具体取决于要采样的点。为了抵消这种影响,动量方法会针对每个维度保持方向的移动平均值,从而减小每次移动的方差。

在我们对XGBoost的讨论中,与梯度下降最相关的扩展是牛顿方法(Wikipedia)。它使用二阶导数来收集梯度信息的更多信息,而不仅仅是计算梯度并遵循梯度。如果我们使用梯度下降,则每次迭代都需要更新点x(i),如下所示:

x(i+1)=x(i)f(x(i))

f(x(i))ff(x(i+1))<f(x(i))x(i)

x(i+1)=x(i)f(x(i))Hessf(x(i))

Where Hessf(x) is the Hessian of f in x. This update takes into account second order information, so the direction is no longer the direction of highest decrease, but should point more precisely towards the x(i+1) such that f(x(i+1))=0 (or the point where f is minimal, if there is no zero). If f is a second order polynomial, then Newton's method coupled with a line search should be able to find the minimum in one step.

Newton's method contrasts with Stochastic gradient descent. In Stochastic Gradient Descent, we use less point to take less time to compute the direction we should go towards, in order to make more of them, in the hope we go there quicker. In Newton's method, we take more time to compute the direction we want to go into, in the hope we have to take less steps in order to get there.

Now, the reason why Newton's method works is the same as to why the XGBoost approximation works, and it relies on Taylor's expansion (Wikipedia) and Taylor's theorem (Wikipedia). The Taylor expansion (or Taylor series) of a function at a point f(x+a) is

f(x)+f(x)xa+122f(x)x2a2+=n=01n!nf(x)xnan.

Note the similarity between this expression and the approximation XGBoost is using. Taylor's Theorem states that if you stop the expansion at the order k, then the error, or the difference between f(x+a) and n=0k1n!nf(x)xnan, is at most hk(x)ak, where hk is a function with the nice property that it goes to zero as a goes to zero.

If you want some visualisation of how well it approximate some functions, take a look at the wikipedia pages, they have some graphs for the approximation of non-polynomial function such as ex, log(x).

The thing to note is that approximation works very well if you want to compute the value of f in the neighbourhood of x, that is, for very small changes a. This is what we want to do in Boosting. Of course we would like to find the tree that makes the biggest change. If the weak learners we build are very good and want to make a very big change, then we can arbitrarily hinder it by only applying 0.1 or 0.01 of its effect. This is the step-size or the learning rate of the gradient descent. This is acceptable, because if our weak learners are getting very good solutions, this means that either the problem is easy, in which case we are going to end up with a good solution anyway, or we are overfitting, so going a little or very much in this bad direction does not change the underlying problem.

So what is XGBoost doing, and why does it work?

XGBoost is a Gradient Boosting algorithm that build regression trees as weak learners. The traditional Gradient Boosting algorithm is very similar to a gradient descent with a line search, where the direction in which to go is drawn from the available weak learners. The naïve implementation of Gradient Boosting would use the cost function of the weak learner to fit it to the residual. This is a proxy to minimize the cost of the new model, which is expensive to compute. What XGBoost is doing is building a custom cost function to fit the trees, using the Taylor series of order two as an approximation for the true cost function, such that it can be more sure that the tree it picks is a good one. In this respect, and as a simplification, XGBoost is to Gradient Boosting what Newton's Method is to Gradient Descent.

Why did they build it that way

Your question as to why using this approximation comes to a cost/performance tradeoff. This cost function is used to compare potential splits for regression trees, so if our points have say 50 features, with an average of 10 different values, each node has 500 potential splits, so 500 evaluation of the function. If you drop a continuous feature, the number of splits explode, and the evaluation of the split is called more and more (XGBoost has another trick to deal with continuous features, but that's out of the scope). As the algorithm will spend most of its time evaluating splits, the way to speed up the algorithm is to speed up tree evaluation.

If you evaluated the tree with the full cost function, , it is a new computation for every new split. In order to do optimization in the computation of the cost function, you would need to have information about the cost function, which is the whole point of Gradient Boosting: It should work for every cost function.

The second order approximation is computationally nice, because most terms are the same in a given iteration. For a given iteration, most of the expression can be computed once, and reused as constant for all splits:

L(t)i=1n(yi,y^i(t1))constant+giconstantft(xi)+12hiconstantft2(xi)+Ω(ft),

So the only thing you have to compute is ft(xi) and Ω(ft), and then what is left is mostly additions, and some multiplications. Moreover, if you take a look at the XGBoost paper (arxiv), you will see that they use the fact that they are building a tree to further simplify the expression down to a bunch of summation of indexes, which is very, very quick.

Summary

You can see XGBoost (with approximation) as a regression from the exact solution, an approximation of the "true XGBoost", with exact evaluation. But since the exact evaluation is so costly, another way to see it is that on huge datasets, the approximation is all we can realistically do, and this approximation is more accurate than the first-order approximation a "naïve" gradient boosting algorithm would do.

The approximation in use is similar to Newton's Method, and is justified by Taylor Series (Wikipedia) and Taylor Theorem (Wikipedia).

Higher order information is indeed not completely used, but it is not necessary, because we want a good approximation in the neighbourhood of our starting point.

For visualisation, check the Wikipedia page of Taylor Series/Taylor's Theorem, or Khan Academy on Taylor Series approximation, or MathDemo page on polynomial approximation of non-polynomials


2
+1. I must confess that I haven't read this answer (yet?) and cannot judge on it anyway because it's outside of my expertise, but it looks so impressive that I am happy to upvote. Well done [it seems]!
amoeba says Reinstate Monica

That was an excellent answer. I have one question though.The Gradient boosting algorithm fits a regression tree to the negative gradient with split criterion the mse. How the tree structure is determined in XGBoost??
gnikol

You've nailed the answer, good job!
Marcin Zablocki
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.