15

我试图了解内核SVM背后的直觉。现在，我了解了线性SVM的工作原理，通过决策线可以最大程度地分割数据。我也了解将数据移植到高维空间的原理，以及如何使在新空间中找到线性决策线变得更容易。我不了解的是如何使用内核将数据点投影到这个新空间。

我对内核的了解是，它有效地表示了两个数据点之间的“相似性”。但这与预测有何关系？

machine-learning svm kernel-trick

— 卡尼瓦龙
source

3

如果您进入足够高的维度空间，则所有训练数据点都可以由平面完美地分开。这并不意味着它将具有任何预测能力。我认为进入非常高的空间是过度拟合的一种道德等同形式。

— Mark L. Stone

@Mark L. Stone：正确（+1），但询问内核如何映射到无限维空间仍然是一个好问题？这是如何运作的？我尝试过，请参见答案

我将把特征映射称为“投影”会很小心。特征映射通常是非线性变换。

— 保罗

一篇关于内核技巧的非常有用的文章可视化了内核的内部乘积空间，并描述了如何使用高维特征向量来实现这一目标，希望这可以简洁地回答这个问题：eric-kim.net/eric-kim-net/ posts / 1 / kernel_trick.html

— JStrahl，2016年

6

设 $h(x)$ 为高维空间的投影 $\mathcal{F}$ 。基本上核函数 $K(x_1,x_2)=\langle h(x_1),h(x_2)\rangle$ ，这是内积。因此，它不是用于投影数据点，而是投影的结果。可以认为它是相似性的一种度量，但是在SVM中，它不止于此。

在寻找最佳分离超平面的优化仅通过内积形式 $\mathcal{F}$ 涉及 $h(x)$ 。也就是说，如果您知道 $K(\cdot,\cdot)$ ，则不需要知道的确切形式。 $h(x)$ ，这会使优化变得更容易。

每个核 $K(\cdot,\cdot)$ 也具有对应的 $h(x)$ 。因此，如果您将SVM与该内核一起使用，那么您将隐式地在 $h(x)$ 映射到的空间中找到线性决策线。

《统计学习要素》的第12章简要介绍了SVM。这提供了有关内核和功能映射之间的连接的更多详细信息：http : //statweb.stanford.edu/~tibs/ElemStatLearn/

— i
source

您是说对于内核

K (x, y)

$K(x,y)$ 有唯一的基础

h (x)

$h(x)$ ？

2

@fcoppens否；对于一个简单的例子，考虑

h

$h$ 和

- h

$-h$ 。但是，确实存在一个与该内核相对应的唯一的再生内核希尔伯特空间。

— 2015年

@杜加尔：那我可以同意你的看法，但是在上面的回答中说“'

'”，所以我想确定一下。对于RKHS，我看到了，但是您认为可以以一种“直观的方式”解释对于内核

这种变换

是什么样的吗？

h

$h$

h

$h$

K (x, y)

$K(x,y)$

@fcoppens通常，不；要找到这些地图的明确表示很难。对于某些内核，它要么不太难，要么之前已经完成。

— Dougal

1

@fcoppens对，h（x）不是唯一的。您可以轻松更改h（x），同时保持内积<h（x），h（x'）>不变。但是，您可以将它们视为基本函数，并且它们跨越的空间（即RKHS）是唯一的。

— Lii 2015年

4

内核SVM的有用属性不是通用的-它们取决于内核的选择。为了获得直觉，看一下最常用的内核之一高斯内核会很有帮助。值得注意的是，该内核将SVM变成了非常类似于k近邻分类器的东西。

该答案解释了以下内容：

为什么使用带宽足够小的高斯核总是能够完全分离正负训练数据（以过度拟合为代价）
在要素空间中如何将这种分离解释为线性的。
如何使用内核构造从数据空间到要素空间的映射。剧透：特征空间是一个数学上非常抽象的对象，具有基于内核的不寻常的抽象内部乘积。

1.实现完美分离

由于内核的局部性，这会导致任意灵活的决策边界，因此高斯内核始终可以实现完美的分离。对于足够小的内核带宽，决策边界看起来就像您只是在需要分离正例和负例时在点周围画了些圆圈：

（来源：吴安德的在线机器学习课程）。

那么，为什么从数学的角度来看呢？

考虑的标准设置：你有一个高斯核和训练数据 $K(\mathbf{x},\mathbf{z}) = \exp(- ||\mathbf{x}-\mathbf{z}||^2 / \sigma^2)$ ，其中值为。我们想学习分类器功能 $(\mathbf{x}^{(1)},y^{(1)}), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(n)},y^{(n)})$ $y^{(i)}$ $\pm 1$

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} K (x^{(i)}, x)

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x})$

现在我们将如何分配权重？我们是否需要无穷维空间和二次规划算法？不，因为我只想表明我可以完美地分开要点。所以我使比最小间距小十亿倍在任何两个训练示例之间，我只设置。这意味着，所有的训练点是十亿西格玛除了尽可能的内核而言，每个点完全控制的迹象 $w_i$ $\sigma$ $||\mathbf{x}^{(i)} - \mathbf{x}^{(j)}||$ $w_i = 1$ $\hat{y}$ 在附近。正式地，我们有

\hat{y} (x^{(k)}) = \sum_{i = 1}^{n} y^{(k)} K (x^{(i)}, x^{(k)}) = y^{(k)} K (x^{(k)}, x^{(k)}) + \sum_{i \neq k} y^{(i)} K (x^{(i)}, x^{(k)}) = y^{(k)} + ϵ

$\hat{y}(\mathbf{x}^{(k)}) = \sum_{i=1}^n y^{(k)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} K(\mathbf{x}^{(k)},\mathbf{x}^{(k)}) + \sum_{i \neq k} y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} + \epsilon$

where $\epsilon$ is some arbitrarily tiny value. We know $\epsilon$ is tiny because $\mathbf{x}^{(k)}$ is a billion sigmas away from any other point, so for all $i \neq k$ we have

K (x^{(i)}, x^{(k)}) = \exp (- | | x^{(i)} - x^{(k)} | |^{2} / σ^{2}) \approx 0.

$K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = \exp(- ||\mathbf{x}^{(i)} - \mathbf{x}^{(k)}||^2 / \sigma^2) \approx 0.$

由于是如此之小绝对有相同的符号 $\epsilon$ $\hat{y}(\mathbf{x}^{(k)})$ $y^{(k)}$ , and the classifier achieves perfect accuracy on the training data. In practice this would be terribly overfitting but it shows the tremendous flexibility of the Gaussian kernel SVM, and how it can act very similar to a nearest neighbor classifier.

2.内核SVM学习为线性分离

The fact that this can be interpreted as "perfect linear separation in an infinite dimensional feature space" comes from the kernel trick, which allows you to interpret the kernel as an abstract inner product some new feature space:

K (x^{(i)}, x^{(j)}) = ⟨ Φ (x^{(i)}), Φ (x^{(j)}) ⟩

$K(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x}^{(j)})\rangle$

where $\Phi(\mathbf{x})$ is the mapping from the data space into the feature space. It follows immediately that the $\hat{y}(\mathbf{x})$ function as a linear function in the feature space:

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), Φ (x) ⟩ = L (Φ (x))

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x})\rangle = L(\Phi(\mathbf{x}))$

where the linear function $L(\mathbf{v})$ is defined on feature space vectors $\mathbf{v}$ as

L (v) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), v ⟩

$L(\mathbf{v}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\mathbf{v}\rangle$

This function is linear in $\mathbf{v}$ because it's just a linear combination of inner products with fixed vectors. In the feature space, the decision boundary $\hat{y}(\mathbf{x}) = 0$ is just $L(\mathbf{v}) = 0$ , the level set of a linear function. This is the very definition of a hyperplane in the feature space.

3. How the kernel is used to construct the feature space

Kernel methods never actually "find" or "compute" the feature space or the mapping $\Phi$ explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function $K$ . It is possible to write down a formula for $\Phi$ but the feature space it maps to is quite abstract and is only really used for proving theoretical results about SVM. If you're still interested, here's how it works.

Basically we define an abstract vector space $V$ where each vector is a function from $\mathcal{X}$ to $\mathbb{R}$ . A vector $f$ in $V$ is a function formed from a finite linear combination of kernel slices:

f (x) = \sum_{i = 1}^{n} α_{i} K (x^{(i)}, x)

$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$ (Here the

x^{(i)}

$\mathbf{x}^{(i)}$ are just an arbitrary set of points and need not be the same as the training set.) It is convenient to write

f

$f$ more compactly as

f = \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}

$f = \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}$ where

K_{x} (y) = K (x, y)

$K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$ is a function giving a "slice" of the kernel at

x

$\mathbf{x}$ .

The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:

⟨ \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}, \sum_{j = 1}^{n} β_{j} K_{x^{(j)}} ⟩ = \sum_{i, j} α_{i} β_{j} K (x^{(i)}, x^{(j)})

$\langle \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}, \sum_{j=1}^n \beta_j K_{\mathbf{x}^{(j)}} \rangle = \sum_{i,j} \alpha_i \beta_j K(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$

This definition is very deliberate: its construction ensures the identity we need for linear separation, $\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = K(\mathbf{x},\mathbf{y})$ .

With the feature space defined in this way, $\Phi$ is a mapping $\mathcal{X} \rightarrow V$ , taking each point $\mathbf{x}$ to the "kernel slice" at that point:

Φ (x) = K_{x}, where K_{x} (y) = K (x, y) .

$\Phi(\mathbf{x}) = K_\mathbf{x}, \quad \text{where} \quad K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y}).$

You can prove that $V$ is an inner product space when $K$ is a positive definite kernel. See this paper for details.

— Paul
source

Great explanation, but I think you have missed a minus for the definition of the gaussian kernel. K(x,z)=exp(-||x−z||2/σ2) . As it's written, it does not make sense with the ϵ found in the part (1)

— hqxortn

1

For the background and the notations I refer to How to calculate decision boundary from support vectors?.

So the features in the 'original' space are the vectors $x_i$ , the binary outcome $y_i \in \{-1, +1\}$ and the Lagrange multipliers are $\alpha_i$ .

As said by @Lii (+1) the Kernel can be written as $K(x,y)=h(x) \cdot h(y)$ (' $\cdot$ ' represents the inner product.

I will try to give some 'intuitive' explanation of what this $h$ looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong.

I have to 'transform' my feature space (so my $x_i$ ) into some 'new' feature space in which the linear separation will be solved.

For each observation $x_i$ , I define functions $\phi_i(x)=K(x_i,x)$ , so I have a function $\phi_i$ for each element of my training sample. These functions $\phi_i$ span a vector space. The vector space spanned by the $\phi_i$ , note it $V=span(\phi_{i, i=1,2,\dots N})$ .

I will try to argue that is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space $V$ can be written as as a linear combination of the $\phi_i$ , i.e.: $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ are real numbers.

$N$ is the size of the training sample and therefore the dimension of the vector space $V$ can go up to $N$ , depending on whether the $\phi_i$ are linear independent. As $\phi_i(x)=K(x_i,x)$ (see supra, we defined $\phi$ in this way), this means that the dimension of $V$ depends on the kernel used and can go up to the size of the training sample.

The transformation, that maps my original feature space to $V$ is defined as

$\Phi: x_i \to \phi(x)=K(x_i, x)$ .

This map $\Phi$ maps my original feature space onto a vector space that can have a dimension that goed up to the size of my training sample.

Obviously, this transformation (a) depends on the kernel, (b) depends on the values $x_i$ in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of $V$ look like $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ , $\gamma_i$ are real numbers.

Looking at the function $f(x)$ in How to calculate decision boundary from support vectors? it can be seen that $f(x)=\sum_i y_i \alpha_i \phi_i(x)+b$ .

In other words, $f(x)$ is a linear combination of the $\phi_i$ and this is a linear separator in the V-space : it is a particular choice of the $\gamma_i$ namely $\gamma_i=\alpha_i y_i$ !

The $y_i$ are known from our observations, the $\alpha_i$ are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the $V$ -spave.

This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space $V$ , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample.

So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?

— Community
source

"for each element of my training sample" -- is element here referring to a row or column (i.e. feature )

— user1761806

what is x and x_i? If my X is an input of 5 columns, and 100 rows, what would x and x_i be?

— user1761806

@user1761806 an element is a row. The notation is explained in the link at the beginning of the answer

1

Transform predictors (input data) to a high-dimensional feature space. It is sufficient to just specify the kernel for this step and the data is never explicitly transformed to the feature space. This process is commonly known as the kernel trick.

Let me explain it. The kernel trick is the key here. Consider the case of a Radial Basis Function (RBF) Kernel here. It transforms the input to infinite dimensional space. The transformation of input $x$ to $\phi(x)$ can be represented as shown below (taken from http://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf)

The input space is finite dimensional but the transformed space is infinite dimensional. Transforming the input to an infinite dimensional space is something that happens as a result of the kernel trick. Here $x$ which is the input and $\phi$ is the transformed input. But $\phi$ is not computed as it is, instead the product $\phi(x_i)^T\phi(x)$ is computed which is just the exponential of the norm between $x_i$ and $x$ .

There is a related question Feature map for the Gaussian kernel to which there is a nice answer /stats//a/69767/86202.

The output or decision function is a function of the kernel matrix $K(x_i,x)=\phi(x_i)^T\phi(x)$ and not of the input $x$ or transformed input $\phi$ directly.

— prashanth
source

0

映射到更高的维度仅仅是解决原始维度中定义的问题的一种技巧；因此，诸如通过进入具有过多自由度的维来过度拟合数据之类的问题并不是映射过程的副产品，而是问题定义所固有的。

基本上，映射所做的只是将原始维中的条件分类转换为较高维中的平面定义，并且由于较高维中的平面与较低维中的条件之间存在一对一的关系，因此您始终可以在两者之间移动。

显然，对于过度拟合的问题，您可以通过定义足够的条件以将每个观测值隔离到其自己的类中来对任何一组观测值进行过度拟合，这等效于将数据映射到（n-1）D，其中n是观测值的数量。

以最简单的问题为例，通过移至2D维并用一条线分隔数据，您的观察值为[[1，-1]，[0,0]，[1,1]] [[特征，值]] ，您只需将条件分类feature < 1 && feature > -1 : 0变成定义通过的线(-1 + epsilon, 1 - epsilon)。如果您有更多的数据点并且需要更多的条件，则只需通过定义的每个新条件为更高的维度增加一个自由度即可。

You can replace the process of mapping to a higher dimension with any process that provides you with a 1 to 1 relationship between the conditions and the degrees of freedom of your new problem. Kernel tricks simply do that.

— Hou
source

1

As a different example, take the problem where the phenomenon results in observations of the form of [x, floor(sin(x))]. Mapping your problem into a 2D dimension is not helpful here at all; in fact, mapping to any plane will not be helpful here, which is because defining the problem as a set of x < a && x > b : z is not helpful in this case. The simplest mapping in this case is mapping into a polar coordinate, or into the imaginary plane.

— Hou

内核SVM：我想对映射到更高维度的特征空间有一个直观的了解，以及这如何使线性分离成为可能

1.实现完美分离

2.内核SVM学习为线性分离

3. How the kernel is used to construct the feature space