具有高斯核的SVM具有有限的维特征空间这一事实背后的直觉是什么?
具有高斯核的SVM具有有限的维特征空间这一事实背后的直觉是什么?
Answers:
该答案解释了以下内容:
由于内核的局部性,这会导致任意灵活的决策边界,因此对于高斯内核(只要没有来自不同类的两个点都完全相同)总是可能实现完美分离。对于足够小的内核带宽,决策边界看起来就像您只是在需要分离正例和负例时在点周围画了些圆圈:
(来源:吴安德的在线机器学习课程)。
那么,为什么从数学的角度来看呢?
考虑的标准设置:你有一个高斯核 和训练数据(X (1 ),ÿ (1 )),(X (2 ),y (2 )),… ,(x (n ),,其中 y (i )值为 ± 1。我们想学习一个分类器功能
现在我们将如何分配权重?我们是否需要无穷维空间和二次规划算法?不,因为我只想表明我可以完美地分开各点。因此,我使σ比最小间距小十亿倍| | x (i ) − x (j ) | | 在任何两个训练示例之间,我只需设置w i = 1。这意味着,所有的训练点是十亿西格玛除了尽可能的内核而言,每个点完全控制的迹象ÿ在附近。正式地,我们有
其中是任意一些微小的值。我们知道ε是很小的,因为X (ķ )从任何其他点十亿西格玛的路程,所以对于所有我≠ ķ我们
由于是如此之小,ÿ(X (ķ ))绝对有相同的符号ÿ (ķ ),以及所述分类器实现在训练数据完美的准确性。
可以解释为“在无限维特征空间中的完美线性分离”这一事实来自内核技巧,它使您可以将内核解释为(可能是无限维)特征空间中的内积:
其中是从数据空间到特征空间的映射。它紧跟该ÿ(X)函数作为在特征空间中的线性函数:
在特征空间向量v上定义线性函数为
该函数在是线性的,因为它只是内部乘积与固定向量的线性组合。在特征空间中,判定边界ÿ(X)= 0仅仅是大号(v)= 0,水平集的线性函数的。这就是特征空间中超平面的定义。
注意:在本节中,符号指的是 n点的任意集合,而不是训练数据。这是纯数学;培训数据根本没有纳入本节!
内核方法永远不会真正地“发现”或“计算”特征空间或映射。诸如SVM之类的内核学习方法不需要它们起作用。他们只需要在内核函数ķ。
也就是说,可以写下的公式。Φ映射到的特征空间是抽象的(可能是无限维的),但是本质上,映射只是使用内核来执行一些简单的特征工程。在最终结果方面,您最终使用内核学习的模型与线性回归和GLM建模中普遍使用的传统特征工程没有什么不同,例如在将正预测变量输入对数公式之前先对其取对数。仅在此处进行数学运算即可确保内核与SVM算法配合良好,该算法具有稀疏的优势,并且可以很好地扩展到大型数据集。
如果您仍然感兴趣,请按以下步骤操作。本质上讲,我们采取我们希望保持身份,,并构造一个空间和内积,使得其保持由定义。为此,我们定义了一个抽象向量空间V,其中每个向量都是从数据所居住的空间X到实数R的函数。载体˚F在V是从内核切片的有限的线性组合形成的函数: ˚F (X 是很方便的写 ˚F更简洁的 ˚F = ñ Σ我= 1 α 我ķ X (我) ,其中 ķ X(Ý)= ķ (x,y)是在 x处给出内核“切片”的函数。
空间上的内部积不是普通的点积,而是基于内核的抽象内部积:
使用这种方式定义的特征空间,是X → V的映射,将每个点x都指向该点的“内核切片”:
您可以证明当K是一个正定核时,是一个内积空间。有关详细信息,请参见本文。(为f coppens指出这一点而致谢!)
这个答案给出了很好的线性代数解释,但这是一个具有直觉和证明的几何视角。
对于任何不动点,我们都有一个内核切片函数K z(x)= K (z,x)。K z的图只是以z为中心的高斯凸点. Now, if the feature space were only finite dimensional, that would mean we could take a finite set of bumps at a fixed set of points and form any Gaussian bump anywhere else. But clearly there's no way we can do this; you can't make a new bump out of old bumps, because the new bump could be really far away from the old ones. So, no matter how many feature vectors (bumps) we have, we can always add new bumps, and in the feature space these are new independent vectors. So the feature space can't be finite dimensional; it has to be infinite.
We use induction. Suppose you have an arbitrary set of points such that the vectors are linearly independent in the feature space. Now find a point distinct from these points, in fact a billion sigmas away from all of them. We claim that is linearly independent from the first feature vectors .
Proof by contradiction. Suppose to the contrary that
Now take the inner product on both sides with an arbitrary . By the identity , we obtain
Here is a free variable, so this equation is an identity stating that two functions are the same. In particular, it says that a Gaussian centered at can be represented as a linear combination of Gaussians at other points . It is obvious geometrically that one cannot create a Gaussian bump centered at one point from a finite combination of Gaussian bumps centered at other points, especially when all those other Gaussian bumps are a billion sigmas away. So our assumption of linear dependence has led to a contradiction, as we set out to show.
The kernel matrix of the Gaussian kernel has always full rank for distinct . This means that each time you add a new example, the rank increases by . The easiest way to see this if you set very small. Then the kernel matrix is almost diagonal.
The fact that the rank always increases by one means that all projections in feature space are linearly independent (not orthogonal, but independent). Therefore, each example adds a new dimension to the span of the projections . Since you can add uncountably infinitely many examples, the feature space must have infinite dimension. Interestingly, all projections of the input space into the feature space lie on a sphere, since . Nevertheless, the geometry of the sphere is flat. You can read more on that in
Burges, C. J. C. (1999). Geometry and Invariance in Kernel Based Methods. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel Methods Support Vector Learning (pp. 89–116). MIT Press.
For the background and the notations I refer to the answer How to calculate decision boundary from support vectors?.
So the features in the 'original' space are the vectors , the binary outcome and the Lagrange multipliers are .
It is known that the Kernel can be written as ('' represents the inner product.) Where is an (implicit and unknown) transformation to a new feature space.
I will try to give some 'intuitive' explanation of what this looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong. The basis for my explanation is section 2.2.1 of this pdf
I have to 'transform' my feature space (so my ) into some 'new' feature space in which the linear separation will be solved.
For each observation , I define functions , so I have a function for each element of my training sample. These functions span a vector space. The vector space spanned by the , note it . ( is the size of the training sample).
I will try to argue that this vector space is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space can be written as as a linear combination of the , i.e.: , where are real numbers. So, in fact,
Note that are the coordinates of vector in the vector space .
is the size of the training sample and therefore the dimension of the vector space can go up to , depending on whether the are linear independent. As (see supra, we defined in this way), this means that the dimension of depends on the kernel used and can go up to the size of the training sample.
If the kernel is 'complex enough' then the will all be independent and then the dimension of will be , the size of the training sample.
The transformation, that maps my original feature space to is defined as
.
This map maps my original feature space onto a vector space that can have a dimension that goes up to the size of my training sample. So maps each observation in my training sample into a vector space where the vectors are functions. The vector from my training sample is 'mapped' to a vector in , namely the vector with coordinates all equal to zero, except the -th coordinate is 1.
Obviously, this transformation (a) depends on the kernel, (b) depends on the values in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of look like , where are real numbers.
Looking at the function in How to calculate decision boundary from support vectors? it can be seen that . The decision boundary found by the SVM is .
In other words, is a linear combination of the and is a linear separating hyperplane in the -space : it is a particular choice of the namely !
The are known from our observations, the are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the -spave.
This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample. As training samples may have any size this could go up to 'infinite'. Obviously, in very high dimensional spaces the risk of overfitting will increase.
So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?
不幸的是,fcop的解释是不正确的。他首先说:“众所周知,内核可以写成……,其中……是对新特征空间的(隐式和未知的)转换。” 这不是未知数。实际上,这是要素映射到的空间,并且这是在RBF情况下可以是无限尺寸的空间。内核所做的只是将转换后的特征向量与训练示例的转换后的特征向量的内积相乘,并将某些函数应用于结果。因此,它隐式表示此更高维的特征向量。例如,考虑写(x + y)^ 2而不是x ^ 2 + 2xy + y ^ 2。现在考虑一下指数函数隐式表示的无穷级数...在这里,您拥有无限的特征空间。
考虑SVM的正确方法是,将特征映射到可能是无限维的特征空间,而该空间恰好在另一个维可能与训练集大小一样大的有限维“内核”特征空间中隐式表示。