我试图了解内核SVM背后的直觉。现在,我了解了线性SVM的工作原理,通过决策线可以最大程度地分割数据。我也了解将数据移植到高维空间的原理,以及如何使在新空间中找到线性决策线变得更容易。我不了解的是如何使用内核将数据点投影到这个新空间。
我对内核的了解是,它有效地表示了两个数据点之间的“相似性”。但这与预测有何关系?
我试图了解内核SVM背后的直觉。现在,我了解了线性SVM的工作原理,通过决策线可以最大程度地分割数据。我也了解将数据移植到高维空间的原理,以及如何使在新空间中找到线性决策线变得更容易。我不了解的是如何使用内核将数据点投影到这个新空间。
我对内核的了解是,它有效地表示了两个数据点之间的“相似性”。但这与预测有何关系?
Answers:
设 为高维空间的投影。基本上核函数,这是内积。因此,它不是用于投影数据点,而是投影的结果。可以认为它是相似性的一种度量,但是在SVM中,它不止于此。
在寻找最佳分离超平面的优化仅通过内积形式涉及。也就是说,如果您知道,则不需要知道h (x )的确切形式。,这会使优化变得更容易。
每个核也具有对应的。因此,如果您将SVM与该内核一起使用,那么您将隐式地在映射到的空间中找到线性决策线。
《统计学习要素》的第12章简要介绍了SVM。这提供了有关内核和功能映射之间的连接的更多详细信息:http : //statweb.stanford.edu/~tibs/ElemStatLearn/
内核SVM的有用属性不是通用的-它们取决于内核的选择。为了获得直觉,看一下最常用的内核之一高斯内核会很有帮助。值得注意的是,该内核将SVM变成了非常类似于k近邻分类器的东西。
该答案解释了以下内容:
由于内核的局部性,这会导致任意灵活的决策边界,因此高斯内核始终可以实现完美的分离。对于足够小的内核带宽,决策边界看起来就像您只是在需要分离正例和负例时在点周围画了些圆圈:
(来源:吴安德的在线机器学习课程)。
那么,为什么从数学的角度来看呢?
考虑的标准设置:你有一个高斯核 和训练数据(X (1 ),ÿ (1 )),(X (2 ),y (2 )),… ,(x (n,其中 y (i )值为 ± 1。我们想学习分类器功能
现在我们将如何分配权重?我们是否需要无穷维空间和二次规划算法?不,因为我只想表明我可以完美地分开要点。所以我使σ比最小间距小十亿倍| | x (i ) − x (j ) | | 在任何两个训练示例之间,我只设置w i = 1。这意味着,所有的训练点是十亿西格玛除了尽可能的内核而言,每个点完全控制的迹象ÿ在附近。正式地,我们有
where is some arbitrarily tiny value. We know is tiny because is a billion sigmas away from any other point, so for all we have
由于是如此之小,ÿ(X (ķ ))绝对有相同的符号ÿ (ķ ), and the classifier achieves perfect accuracy on the training data. In practice this would be terribly overfitting but it shows the tremendous flexibility of the Gaussian kernel SVM, and how it can act very similar to a nearest neighbor classifier.
The fact that this can be interpreted as "perfect linear separation in an infinite dimensional feature space" comes from the kernel trick, which allows you to interpret the kernel as an abstract inner product some new feature space:
where is the mapping from the data space into the feature space. It follows immediately that the function as a linear function in the feature space:
where the linear function is defined on feature space vectors as
This function is linear in because it's just a linear combination of inner products with fixed vectors. In the feature space, the decision boundary is just , the level set of a linear function. This is the very definition of a hyperplane in the feature space.
Kernel methods never actually "find" or "compute" the feature space or the mapping explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function . It is possible to write down a formula for but the feature space it maps to is quite abstract and is only really used for proving theoretical results about SVM. If you're still interested, here's how it works.
Basically we define an abstract vector space where each vector is a function from to . A vector in is a function formed from a finite linear combination of kernel slices:
The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:
This definition is very deliberate: its construction ensures the identity we need for linear separation, .
With the feature space defined in this way, is a mapping , taking each point to the "kernel slice" at that point:
You can prove that is an inner product space when is a positive definite kernel. See this paper for details.
For the background and the notations I refer to How to calculate decision boundary from support vectors?.
So the features in the 'original' space are the vectors , the binary outcome and the Lagrange multipliers are .
As said by @Lii (+1) the Kernel can be written as ('' represents the inner product.
I will try to give some 'intuitive' explanation of what this looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong.
I have to 'transform' my feature space (so my ) into some 'new' feature space in which the linear separation will be solved.
For each observation , I define functions , so I have a function for each element of my training sample. These functions span a vector space. The vector space spanned by the , note it .
I will try to argue that is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space can be written as as a linear combination of the , i.e.: , where are real numbers.
is the size of the training sample and therefore the dimension of the vector space can go up to , depending on whether the are linear independent. As (see supra, we defined in this way), this means that the dimension of depends on the kernel used and can go up to the size of the training sample.
The transformation, that maps my original feature space to is defined as
.
This map maps my original feature space onto a vector space that can have a dimension that goed up to the size of my training sample.
Obviously, this transformation (a) depends on the kernel, (b) depends on the values in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of look like , where , are real numbers.
Looking at the function in How to calculate decision boundary from support vectors? it can be seen that .
In other words, is a linear combination of the and this is a linear separator in the V-space : it is a particular choice of the namely !
The are known from our observations, the are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the -spave.
This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample.
So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?
Transform predictors (input data) to a high-dimensional feature space. It is sufficient to just specify the kernel for this step and the data is never explicitly transformed to the feature space. This process is commonly known as the kernel trick.
Let me explain it. The kernel trick is the key here. Consider the case of a Radial Basis Function (RBF) Kernel here. It transforms the input to infinite dimensional space. The transformation of input to can be represented as shown below (taken from http://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf)
The input space is finite dimensional but the transformed space is infinite dimensional. Transforming the input to an infinite dimensional space is something that happens as a result of the kernel trick. Here which is the input and is the transformed input. But is not computed as it is, instead the product is computed which is just the exponential of the norm between and .
There is a related question Feature map for the Gaussian kernel to which there is a nice answer /stats//a/69767/86202.
The output or decision function is a function of the kernel matrix and not of the input or transformed input directly.
映射到更高的维度仅仅是解决原始维度中定义的问题的一种技巧;因此,诸如通过进入具有过多自由度的维来过度拟合数据之类的问题并不是映射过程的副产品,而是问题定义所固有的。
基本上,映射所做的只是将原始维中的条件分类转换为较高维中的平面定义,并且由于较高维中的平面与较低维中的条件之间存在一对一的关系,因此您始终可以在两者之间移动。
显然,对于过度拟合的问题,您可以通过定义足够的条件以将每个观测值隔离到其自己的类中来对任何一组观测值进行过度拟合,这等效于将数据映射到(n-1)D,其中n是观测值的数量。
以最简单的问题为例,通过移至2D维并用一条线分隔数据,您的观察值为[[1,-1],[0,0],[1,1]] [[特征,值]] ,您只需将条件分类feature < 1 && feature > -1 : 0
变成定义通过的线(-1 + epsilon, 1 - epsilon)
。如果您有更多的数据点并且需要更多的条件,则只需通过定义的每个新条件为更高的维度增加一个自由度即可。
You can replace the process of mapping to a higher dimension with any process that provides you with a 1 to 1 relationship between the conditions and the degrees of freedom of your new problem. Kernel tricks simply do that.
[x, floor(sin(x))]
. Mapping your problem into a 2D dimension is not helpful here at all; in fact, mapping to any plane will not be helpful here, which is because defining the problem as a set of x < a && x > b : z
is not helpful in this case. The simplest mapping in this case is mapping into a polar coordinate, or into the imaginary plane.