OReilly.Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记
chapter 5 支持向量机
Support Vector Machine (SVM) is capable of performing linear and nonlinear classification, regression, and even outlier detection.
SVM used to be the most popular ML model.
SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.
5.1 Linear SVM Classification
Requirement: the dataset is linearly separable.
Principle: large margin classification. An SVM classifier can be thought of as fitting the widest possible margin between the classes.
The instances on the margin are called support vectors.
SVMs are sensitive to the feature scales. Better using StandardScaler\verb+StandardScaler+StandardScaler before training.
5.1.1 Soft Margin Classification
Hard margin classification: all instances are required to be off the margin and on the right side. First, it works only if the data is linearly separable. Second, it is quite sensitive to outlier.
Soft margin classification: to find a balance between keeping the interval as large as possible and limiting the margin violation, i.e., instances that end up in the middle of the interval or even on the wrong side.
Note that, an instance in the middle of the interval does not have to be wrongly classified. It may lies on the right side of the decision boundary.
In Scikit-Learn’s SVM classes, a C\verb+C+C hyperparameter is used to control this balance. A smaller C\verb+C+C indicates a wider interval but more margin violations. If your SVM model is overfitting, you can try regularizing it by reducing C\verb+C+C.
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris=datasets.load_iris()
X=iris['data'][:,(2,3)]#petal length, petal width
y=(iris['target']==2).astype(np.float64) #iris-Virginica
svm_clf=Pipeline([
("scaler",StandardScaler()),
("linear_svc",LinearSVC(C=1,loss="hinge"))
])
svm_clf.fit(X,y)
svm_clf.predict([[5.5,1.7]]) #array([1.])
Alternatives:
-
SVC(kernel="linear",C=1)\verb+SVC(kernel="linear",C=1)+SVC(kernel="linear",C=1)
slower, not recommended
-
SGDClassifier(loss="hinge",alpha=1/(m*C))\verb+SGDClassifier(loss="hinge",alpha=1/(m*C))+SGDClassifier(loss="hinge",alpha=1/(m*C))
does not converge as fast as LinearSVC\verb+LinearSVC+LinearSVC class, but useful for huge datasets that do not fit in memory (out-of-core training), or to handle online classification tasks.
5.2 Nonlinear SVM Classification
For datasets that are not linearly separable, add new features such as polynomial features.
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
X,y=make_moons()
polynomial_svm_clf=Pipeline([
('poly_features',PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",LinearSVC(C=10,loss="hinge"))
])
polynomial_svm_clf.fit(X,y)
5.2.1 Polynomial Kernel
kernel trick: get the same results without adding polynomial features actually.
from sklearn.svm import SVC
poly_kernel_svm_clf=Pipeline([
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="poly",degree=3,coef0=1,C=5))
])
poly_kernel_svm_clf.fit(X,y)
A hyperparameter coef0\verb+coef0+coef0 can be used to control how much the model is influenced by high-degree polynomials versus low-degree polynomials.
A common way to find the right hyperparameter values is to use grid search (GridSearchCV\verb+GridSearchCV+GridSearchCV class).
5.2.2 Adding Similarity Features
Another way to introduce new features is to use a similarity function to compute instances similar to a particular landmark.
For example, use Gaussian Radial Basis Function (RBF) as similarity function, with γ=0.3\gamma=0.3γ=0.3.
A one-dimensional dataset with two landmarks x1=−2x_1=-2x1=−2 and x1=1x_1=1x1=1, where x1x_1x1 is the axis of the dataset.
Equation 5-1. Gaussian RBF
ϕγ(x,ℓ)=exp(−γ∣∣x−ℓ∣∣2)
\phi\gamma(\textbf{x},\ell)=\exp\left(-\gamma||\textbf{x}-\ell||^2\right)
ϕγ(x,ℓ)=exp(−γ∣∣x−ℓ∣∣2)
It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at the landmark).
For the instance x1=−1x_1=-1x1=−1, located at a distance of 111 and 222 to the two landmarks, the new features are x2=exp(−0.3×12)≈0.74x_2=\exp(-0.3\times 1^2)\approx 0.74x2=exp(−0.3×12)≈0.74 and x3=exp(−0.3×22)≈0.30x_3=\exp(-0.3\times 2^2)\approx 0.30x3=exp(−0.3×22)≈0.30.
How to select landmarks? A simple way is to use all the instances as landmarks. If the dataset contains mmm instances with nnn features, this results in a training set with mmm instances and mmm features if you drop the original features.
5.2.3 Gaussian RBF Kernel
Again, RBF kernel trick get same results as RBF similarity features without adding new features actually.
rbf_kernel_svm_clf=Pipeline([
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="rbf",gamma=5,C=0.001))
])
rbf_kernel_svm_clf.fit(X,y)
We can see from Figure 5-9 that, increasing gamma\verb+gamma+gamma make the bell-shaped curve narrower, and as a result each instance’s range of influence is smaller: overfitting.
The usage of other kernels is rare.
As a rule of thumb, to select kernels for a specific task, you should always try the linear kernel first (LinearSVC\verb+LinearSVC+LinearSVC is much faster than SVC(kernel="linear")\verb+SVC(kernel="linear")+SVC(kernel="linear")), especially if the training set is very large or if it has plenty of features. If the training set is not too large, try the Gaussian RBF kernel.
5.2.3 Computational Complexity
The LinearSVC\verb+LinearSVC+LinearSVC class is based on the liblinear library, which implements an optimized algorithm for linear SVMs. It does not support kernel trick, but scales linearly with the number of instances and the number of features: O(m×n)O(m\times n)O(m×n).
The tolerance hyperparameter ϵ\epsilonϵ (tol\verb+tol+tol in Scikit-Learn) controls the precision.
The SVC\verb+SVC+SVC class is based on the libsvm library, which implements SMO (sequential minimal Optimization) that supports the kernel trick. The training time complexity is usually between O(m2×n)O(m^2\times n)O(m2×n) and O(m3×n)O(m^3\times n)O(m3×n).
Table 5-1. Comparison of Scikit-Learn classes for SVM classification
| Class | Time complexity | Out-of-core support | Scaling required | Kernel trick |
|---|---|---|---|---|
| LinearSVC | O(m×n)O(m\times n)O(m×n) | No | Yes | No |
| SGDClassifier | O(m×n)O(m\times n)O(m×n) | Yes | Yes | No |
| SVC | O(m2×n)O(m^2\times n)O(m2×n) to O(m3×n)O(m^3\times n)O(m3×n) | No | Yes | Yes |
5.3 SVM Regression
SVM Regression tries to fit as many instances as possible within the margin while limiting margin violations (i.e., instances off the street). The width of the street is controlled by a hyperparameter ϵ\epsilonϵ.
from sklearn.svm import LinearSVR
linear_svr_reg=Pipeline([
('scaler',StandardScaler()),
('svm_reg',LinearSVR(epsilon=1.5))
])
linear_svr_reg.fit(X,y)
from sklearn.svm import SVR
nonlinear_svr_reg=Pipeline([
('scaler',StandardScaler()),
('svm_reg',SVR(kernel="poly",degree=2,C=100,epsilon=0.1))
])
nonlinear_svr_reg.fit(X,y)
5.4 Under the Hood
5.4.1 Decision Function and Predictions
Equation 5-2. Linear SVM classifier prediction
y^={0 if wT⋅x+b<01 if wT⋅x+b≥0
\hat y=\left\{\begin{array}{c}
0 \textrm{ if }\textbf{w}^T\cdot \textbf{x}+b<0\\
1 \textrm{ if }\textbf{w}^T\cdot \textbf{x}+b\ge0
\end{array}\right.
y^={0 if wT⋅x+b<01 if wT⋅x+b≥0
The decision function is of the form wT⋅x+b=w1x1+⋯+wnxn+b\textbf{w}^T\cdot \textbf{x}+b=w_1x_1+\cdots+w_nx_n+bwT⋅x+b=w1x1+⋯+wnxn+b.
h=wT⋅x+bh=\textbf{w}^T\cdot \textbf{x}+bh=wT⋅x+b of each and every hhh constitute a nnn-dimension hyperplane, whereas decision boundary 0=wT⋅x+b0=\textbf{w}^T\cdot \textbf{x}+b0=wT⋅x+b is the intersection of h=wT⋅x+bh=\textbf{w}^T\cdot \textbf{x}+bh=wT⋅x+b and h=0h=0h=0, which is a (n−1)(n-1)(n−1)-dimension hyperplane.
5.4.2 Training Objective
Hard margin SVM:
Objective: make the interval as large as possible.
Analysis: Through observation, the smaller the norm ∣∣w∣∣||\textbf{w}||∣∣w∣∣ of the weight vector w\textbf{w}w, the larger the interval. Minimizing ∣∣w∣∣||\textbf{w}||∣∣w∣∣ is equivalent to minimizing 12∣∣w∣∣2\frac{1}{2}||\textbf{w}||^221∣∣w∣∣2, which is equal to 12wTw\frac{1}{2}\textbf{w}^T\textbf{w}21wTw.
Constraints: if wT⋅x(i)+b≥1\textbf{w}^T\cdot\textbf{x}^{(i)}+b\ge 1wT⋅x(i)+b≥1, it should be classified as positive. Otherwise, it should be classified as negative. By introducing a variable t(i)t^{(i)}t(i), which is 111 when wT⋅x(i)+b≥1\textbf{w}^T\cdot\textbf{x}^{(i)}+b\ge 1wT⋅x(i)+b≥1 and 000 otherwise, we can express this constraint as t(i)(wT⋅x(i)+b)≥1t^{(i)}(\textbf{w}^T\cdot\textbf{x}^{(i)}+b)\ge 1t(i)(wT⋅x(i)+b)≥1.
Equation 5-3. Hard margin linear SVM classifier objective
KaTeX parse error: No such environment: align* at position 8:
\begin{̲a̲l̲i̲g̲n̲*̲}̲
\mathop{\textr…
Soft margin SVM
Introduce a slack variable ζ(i)≥0\zeta^{(i)}\ge 0ζ(i)≥0 for each instance x(i)\textbf{x}^{(i)}x(i) to measure how much x(i)\textbf{x}^{(i)}x(i) is allowed to violate the margin and C\verb+C+C introduced in 5.1.1 to control the balance between making the interval as large as possible and
Equation 5-4. Soft margin linear SVM classifier objective
KaTeX parse error: No such environment: align* at position 8:
\begin{̲a̲l̲i̲g̲n̲*̲}̲
\mathop{\textr…
5.4.3 Quadratic Programming
Equation 5-5. Quadratic Programming problem
KaTeX parse error: No such environment: align* at position 8:
\begin{̲a̲l̲i̲g̲n̲*̲}̲
\mathop{\textr…
The expression A⋅p≤b\textbf{A}\cdot \textbf{p}\le \textbf{b}A⋅p≤b actually defines ncn_cnc constraints: pT⋅a(i)≤b(i)\textbf{p}^T\cdot \textbf{a}^{(i)}\le b^{(i)}pT⋅a(i)≤b(i) for i=1,2,⋯ ,nci=1,2,\cdots ,n_ci=1,2,⋯,nc, where a(i)\bold a ^{(i)}a(i) is the vector containing the elements of the iii-th row of A\textbf{A}A and b(i)b^{(i)}b(i) is the iii-th element of b\textbf{b}b.
Hard margin linear SVM classifier objective:
- np=n+1n_p=n+1np=n+1, where nnn is the number of features (the +1 is for the bias term)
- nc=mn_c=mnc=m, where mmm is the number of training instances
- H\textbf{H}H is the np×npn_p\times n_pnp×np identity matrix, except with a zero in the top-left cell (to ignore the bias term)
- f=0\textbf{f} = \bold 0f=0, an npn_pnp-dimensional vector full of 0s.
- b=1\bold b = \bold 1b=1, an ncn_cnc-dimensional vector full of 1s.
- a(i)=–t(i)x˙(i)\bold a^{(i)} = –t^{(i)} \bold{\dot{x}}^{(i)}a(i)=–t(i)x˙(i), where x˙(i)\bold{\dot{x}}^{(i)}x˙(i) is equal to x(i)\bold{x}^{(i)}x(i) with an extra bias feature x˙0=1\bold{\dot{x}}_0=1x˙0=1.
So one way to train a hard margin linear SVM classifier is just to use an off-the-shelf QP solver by passing it the preceding parameters.
Using kernel trick makes the algorithm more efficient than directly using QP. To use the kernel trick, a different optimization problem is to be solved.
5.4.4 The Dual Problem
Given a constrained optimization problem, known as the primal problem, it is possible to express a different but closely related problem, called its dual problem. The solution to the dual problem gives a lower bound to the solution of primal problem, but under some conditions it can even have the same solutions as the primal problem. The conditions required for same solutions are as follows.
- the objective function is convex
- the inequality constraints are continuously differentiable and convex functions.
Equation 5-6. Dual form of the linear SVM objective
minimizeα12∑i=1m∑j=1mα(i)α(j)t(i)t(j)x(i)Tx(j)−∑i=1mα(i)subject to α(i)≥0 for i=1,2,⋯ ,m
\mathop{\textrm{minimize}}\limits_\alpha \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m \alpha^{(i)}\alpha^{(j)}t^{(i)}t^{(j)}\textbf{x}^{(i)^T}\textbf{x}^{(j)}-\sum_{i=1}^m\alpha^{(i)}\\
\textrm{subject to } \alpha^{(i)}\ge 0 \textrm{ for }i=1,2,\cdots,m
αminimize21i=1∑mj=1∑mα(i)α(j)t(i)t(j)x(i)Tx(j)−i=1∑mα(i)subject to α(i)≥0 for i=1,2,⋯,m
Once you find the vector α^\hat \alphaα^ that minimizes Equation 5-6 (using a QP solver), you can compute w^\hat {\textbf w}w^ and b^\hat bb^ that minimize the primal problem by using Equation 5-7.
Equation 5-7. From the dual solution to the primal solution
w^=∑i=1mα^(i)t(i)x(i)b^=1ns∑i=1mα^(i)>0(1−t(i)(w^⋅x(i)))
\hat{\textbf w}=\sum_{i=1}^m \hat \alpha^{(i)}t^{(i)}\textbf x^{(i)}\\
\hat b=\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\left(\hat{\textbf{w}}\cdot \textbf x^{(i)}\right)\right)
w^=i=1∑mα^(i)t(i)x(i)b^=ns1α^(i)>0i=1∑m(1−t(i)(w^⋅x(i)))
The dual problem is faster to solve than the primal when the number of training instances is smaller than the number of features.
5.4.5 Kernelized SVM
Equation 5-8. Second-degree polynomial mapping
ϕ(x)=ϕ((x1x2))=(x122x1x2x22)
\phi(\textbf x)=\phi\left(\begin{pmatrix} x_1 \\ x_2 \end{pmatrix}\right)=\begin{pmatrix} x_1^2\\\sqrt{2}x_1x_2\\x_2^2 \end{pmatrix}
ϕ(x)=ϕ((x1x2))=⎝⎛x122x1x2x22⎠⎞
Equation 5-9. Kernel trick for a 2nd-degree polynomial mapping
ϕ(a)Tϕ(b)=(a122a1a2a22)T⋅(b122b1b2b22)=a12b12+2a1b1a2b2+a22b22=(a1b1+a2b2)2=((a1a2)T⋅(b1b2))2=(aT⋅b)2
\phi(\textbf a)^T\phi(\textbf b)=\begin{pmatrix} a_1^2\\\sqrt{2}a_1a_2\\a_2^2 \end{pmatrix}^T\cdot\begin{pmatrix} b_1^2\\\sqrt{2}b_1b_2\\b_2^2 \end{pmatrix}=a_1^2b_1^2+2a_1b_1a_2b_2+a_2^2b_2^2\\
=(a_1b_1+a_2b_2)^2=\left(\begin{pmatrix} a_1\\a_2 \end{pmatrix}^T\cdot\begin{pmatrix} b_1\\b_2 \end{pmatrix}\right)^2=\left(\textbf a^T\cdot \textbf b\right)^2
ϕ(a)Tϕ(b)=⎝⎛a122a1a2a22⎠⎞T⋅⎝⎛b122b1b2b22⎠⎞=a12b12+2a1b1a2b2+a22b22=(a1b1+a2b2)2=((a1a2)T⋅(b1b2))2=(aT⋅b)2
Now here is the key insight: if the data is not linear separable, then you may transform it into high dimensional space. If you apply the transformation ϕ\phiϕ to all training instances, then the dual problem (see Equation 5-6) will contain the dot product ϕ(x(i))T⋅ϕ(x(j))\phi (x^{(i)})^T\cdot
\phi (x^{(j)})ϕ(x(i))T⋅ϕ(x(j)). But if ϕ\phiϕ is the 2nd-degree polynomial transformation defined in Equation 5-8, then you can replace this dot product of transformed vectors simply by (x(i)T⋅x(j))2(x^{(i)^T}\cdot
x^{(j)})^2(x(i)T⋅x(j))2.
So you don’t actually need to transform the training instances at all: just replace the dot product by its square in Equation 5-6. The result will be strictly the same as if you went through the trouble of actually transforming the training set then fitting a linear SVM algorithm, but this trick makes the whole process much more computationally efficient. This is the essence of the kernel trick.
The function K(a,b)=(aT⋅b)2K(\bold a, \textbf b) = (\textbf a^T \cdot \textbf b)^2K(a,b)=(aT⋅b)2 is called a 2nd-degree polynomial kernel. In Machine
Learning, a kernel is a function capable of computing the dot product ϕ(a)T⋅ϕ(b)\phi (\textbf a)^T \cdot \phi (\textbf b)ϕ(a)T⋅ϕ(b) based only on the original vectors a\textbf aa and b\textbf bb, without having to compute (or even to know about) the transformation ϕ\phiϕ. Equation 5-10 lists some of the most commonly used kernels.
Equation 5-10. Common kernels
Linear:K(a,b)=aT⋅bPolynomial:K(a,b)=(γaT⋅b+r)dGaussian RBF:K(a,b)=exp(−γ∣∣a−b∣∣2)Sigmoid:K(a,b)=tanh(γaT⋅b+r)
\begin{array}{rl}\textrm{Linear:}& K(\textbf a, \textbf b) = \textbf a^T\cdot \textbf b\\
\textrm{Polynomial:}& K (\textbf a, \textbf b) = \left(\gamma\textbf a^T \cdot \textbf b + r\right)^ d\\
\textrm{Gaussian RBF:}& K (\textbf a, \textbf b) = \exp\left(-\gamma||\textbf a - \textbf b||^2\right)\\
\textrm{Sigmoid:}& K (\textbf a, \textbf b) = \tanh \left(\gamma\textbf a^T \cdot \textbf b + r\right)\end{array}
Linear:Polynomial:Gaussian RBF:Sigmoid:K(a,b)=aT⋅bK(a,b)=(γaT⋅b+r)dK(a,b)=exp(−γ∣∣a−b∣∣2)K(a,b)=tanh(γaT⋅b+r)
Mercer’s Theorem: if a function K(a,b)K(\textbf a, \textbf b)K(a,b) respects Mercer’s conditions (KKK must be continuous, symmetric in its arguments so K(a,b)=K(b,a)K(\textbf a, \textbf b)=K(\textbf b, \textbf a)K(a,b)=K(b,a), etc.), then there exists a function ϕ\phiϕ that maps a\textbf aa and b\textbf{b}b into another space (possibly with much higher dimensions) such that K(a,b)=ϕ(a)T⋅ϕ(b)K(\textbf a, \textbf b)=\phi(\textbf a)^T\cdot \phi(\textbf b)K(a,b)=ϕ(a)T⋅ϕ(b). So you can use KKK as a kernel since you know ϕ\phiϕ exists, even if you don’t know what ϕ\phiϕ is. In the case of the Gaussian RBF kernel, it can be shown that ϕ\phiϕ actually maps each training instance to an infinite-dimensional space, so it’s a good thing you don’t need to actually perform the mapping!
Using kernel trick makes the parameters w\textbf ww and bbb hard to obtain, as it may have as many dimension as ϕ(x(i))\phi(x^{(i)})ϕ(x(i)), which may be infinite. However, you can ignore the step of computing the model parameters and directly jump to the prediction step.
Equation 5-11. Making predictions with a kernelized SVM
hw^,b^(ϕ(x(n)))=w^T⋅ϕ(x(n))+b^=(∑i=1mα^(i)t(i)ϕ(x(i)))T⋅ϕ(x(n))+b^=∑i=1mα^(i)t(i)(ϕ(x(i))T⋅ϕ(x(n)))+b^=∑i=1mα^(i)>0α^(i)t(i)K(x(i),x(n))+b^
h_{\widehat{\textbf w},\hat b}\left (\phi\left(\textbf x^{(n)}\right) \right)
=\widehat{\textbf w}^T\cdot \phi\left(\textbf x^{(n)}\right) +\hat b
=\left(\sum_{i=1}^m \hat\alpha^{(i)}t^{(i)}\phi\left(\textbf x^{(i)}\right)\right)^T\cdot \phi\left(\textbf x^{(n)}\right) +\hat b\\
=\sum_{i=1}^m \hat\alpha^{(i)}t^{(i)}\left(\phi\left(\textbf x^{(i)}\right)^T\cdot \phi\left(\textbf x^{(n)}\right)\right) +\hat b\\
=\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0} \hat\alpha^{(i)}t^{(i)}K\left(\textbf x^{(i)},\textbf x^{(n)}\right)+\hat b\\
hw,b^(ϕ(x(n)))=wT⋅ϕ(x(n))+b^=(i=1∑mα^(i)t(i)ϕ(x(i)))T⋅ϕ(x(n))+b^=i=1∑mα^(i)t(i)(ϕ(x(i))T⋅ϕ(x(n)))+b^=α^(i)>0i=1∑mα^(i)t(i)K(x(i),x(n))+b^
Where x(n)\textbf x ^{(n)}x(n) is a new instance.
Equation 5-12. Computing the bias term using the kernel trick
b^=1ns∑i=1mα^(i)>0(1−t(i)w^T⋅ϕ(x(i)))=1ns∑i=1mα^(i)>0(1−t(i)(∑j=1mα^(j)t(j)ϕ(x(j)))T⋅ϕ(x(i)))=1ns∑i=1mα^(i)>0(1−t(i)∑j=1mα^(j)>0α^(j)t(j)K(x(j)⋅x(i)))
\hat b=\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\widehat{\textbf{w}}^T\cdot \phi\left(\textbf x^{(i)}\right) \right)\\
=\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\left(\sum_{j=1}^m \hat\alpha^{(j)}t^{(j)}\phi\left(\textbf x^{(j)}\right)\right)^T\cdot \phi\left(\textbf x^{(i)}\right) \right)\\
=\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\mathop{\sum_{j=1}^m}\limits_{\hat \alpha^{(j)}>0} \hat\alpha^{(j)}t^{(j)}K\left(\textbf x^{(j)}\cdot \textbf x^{(i)}\right) \right)
b^=ns1α^(i)>0i=1∑m(1−t(i)wT⋅ϕ(x(i)))=ns1α^(i)>0i=1∑m⎝⎛1−t(i)(j=1∑mα^(j)t(j)ϕ(x(j)))T⋅ϕ(x(i))⎠⎞=ns1α^(i)>0i=1∑m⎝⎜⎜⎛1−t(i)α^(j)>0j=1∑mα^(j)t(j)K(x(j)⋅x(i))⎠⎟⎟⎞
5.4.6 Online SVMs
For training online linear SVM classifiers, one way is to use Gradient Descent to minimized the cost function in Equation 5-13.
Equation 5-13. Linear SVM classifier cost function
J(w,b)=12wTw+C∑i=1mmax(0,1−t(i)(wT⋅x(i)+b))
J(\textbf w,b)=\frac{1}{2}\textbf w^T \textbf w +C\sum_{i=1}^m \max\left(0,1-t^{(i)}\left(\textbf w^T\cdot \textbf x^{(i)}+b\right)\right)
J(w,b)=21wTw+Ci=1∑mmax(0,1−t(i)(wT⋅x(i)+b))
The first sum in the cost function will push the model to have a small weight vector w, leading to a larger margin. The second sum computes the total of all margin violations. An instance’s margin violation is equal to 0 if it is located off the street and on the correct side, or else it is proportional to the distance to the correct side of the street.
Hinge Loss: the function max(0,1−t)\max(0,1-t)max(0,1−t) is called the hinge loss. It is equal to 0 when t≥1t\ge 1t≥1 and 1−t1-t1−t if t<1t<1t<1. Its derivative (slope) is equal to -1 if t<1t<1t<1 and 0 if t>1t>1t>1. It is not differentiable at t=1t=1t=1, but just like for Lasso Regression you can still use Gradient Descent using any sub-derivative at t=0t=0t=0 (i.e., any value between -1 and 0).
To sum up, SVM is one of the most popular ML model. It serves as linear and nonlinear classifier, and performs linear and nonlinear regression.
-
Linear SVM Classification
preprocessing: StandardScaler\verb+StandardScaler+StandardScaler
Balance between keeping the interval as large as possible and limiting the margin violation is controlled by the C\verb+C+C hyperparameter in Scikit-Learn’s SVM classes. A smaller C\verb+C+C value leads to a wider interval but more margin violation.
Three ways to perform linear SVM classification:
- LinearSVC(loss="hinge",C=1)\verb+LinearSVC(loss="hinge",C=1)+LinearSVC(loss="hinge",C=1) fast
- SVC(kernel="linear",C=1)\verb+SVC(kernel="linear",C=1)+SVC(kernel="linear",C=1) slow
- SGDRegression(loss="hinge", alpha=1/(m*C))\verb+SGDRegression(loss="hinge", alpha=1/(m*C))+SGDRegression(loss="hinge", alpha=1/(m*C))fast
-
Nonlinear SVM Classification
Four ways to perform nonlinear SVM classification:
-
PolynomialFeatures(degree=3)\verb+PolynomialFeatures(degree=3)+PolynomialFeatures(degree=3)
LinearSVC(loss="hinge",C=10)\verb+LinearSVC(loss="hinge",C=10)+LinearSVC(loss="hinge",C=10)
-
polynomial kernel trick: SVC(kernel="poly",degree=3,coef0=1,C=5)\verb+SVC(kernel="poly",degree=3,coef0=1,C=5)+SVC(kernel="poly",degree=3,coef0=1,C=5)
coef0\verb+coef0+coef0 controls the influence of high degree versus low degree
-
adding similarity features by using similarity functions, such as Gaussian RBF
-
Gaussian RBF Kernel: SVC(kernel="rbf",gamma=5,,C=0.001)\verb+SVC(kernel="rbf",gamma=5,,C=0.001)+SVC(kernel="rbf",gamma=5,,C=0.001)
-
-
SVM Regression
-
LinearSVR(epsilon=1.5)\verb+LinearSVR(epsilon=1.5)+LinearSVR(epsilon=1.5)
epsilon\verb+epsilon+epsilon controls the width of the interval
-
SVR(kernel="poly",degree=2,C=100,epsilon=0.1)\verb+SVR(kernel="poly",degree=2,C=100,epsilon=0.1)+SVR(kernel="poly",degree=2,C=100,epsilon=0.1)
-
-
Under the Hood
decision function: wT⋅x+b\textbf w^T\cdot \textbf x+bwT⋅x+b
decision boundary: wT⋅x+b=0\textbf w^T\cdot \textbf x+b=0wT⋅x+b=0
training objective and constraints:
-
hard margin classification
minimizew,b\mathop{\textrm{minimize}}\limits_{\textbf w,b}w,bminimize 12wT⋅w\frac{1}{2}\textbf w^T\cdot \textbf w21wT⋅w
subject to t(i)(wT⋅x(i)+b)≥1t^{(i)}\left( \textbf w^T\cdot \textbf x^{(i)} +b\right)\ge 1t(i)(wT⋅x(i)+b)≥1 for i=1,2,⋯ ,mi=1,2,\cdots,mi=1,2,⋯,m
-
soft margin classification
minimizew,b\mathop{\textrm{minimize}}\limits_{\textbf w,b}w,bminimize 12wT⋅w+C∑i=1mζ(i)\frac{1}{2}\textbf w^T\cdot \textbf w+C\sum_{i=1}^m \zeta^{(i)}21wT⋅w+C∑i=1mζ(i)
subject to t(i)(wT⋅x(i)+b)≥1−ζ(i)t^{(i)}\left( \textbf w^T\cdot \textbf x^{(i)} +b\right)\ge 1-\zeta^{(i)}t(i)(wT⋅x(i)+b)≥1−ζ(i) for ζ(i)≥0,i=1,2,⋯ ,m\zeta^{(i)}\ge 0, i=1,2,\cdots,mζ(i)≥0,i=1,2,⋯,m
quadratic programming: solved by an off-the-shelf QP solver
dual problem: under some conditions, has the same solution as the primal problem.
kernelized SVM: kernel functions make the dot product in high dimension space easy to compute.
online SVM:
cost function
-
J(w,b)=12wTw+C∑i=1mmax(0,1−t(i)(wT⋅x(i)+b)) J(\textbf w,b)=\frac{1}{2}\textbf w^T \textbf w +C\sum_{i=1}^m \max\left(0,1-t^{(i)}\left(\textbf w^T\cdot \textbf x^{(i)}+b\right)\right) J(w,b)=21wTw+Ci=1∑mmax(0,1−t(i)(wT⋅x(i)+b))
本文深入解析了支持向量机(SVM)的原理与应用,涵盖了线性和非线性分类、回归,以及SVM的内部机制。介绍了SVM的多种实现方式,包括线性核、多项式核和高斯RBF核的使用场景。

1841

被折叠的 条评论
为什么被折叠?



