chapter 5 支持向量机

最新推荐文章于 2024-04-10 09:51:46 发布

原创最新推荐文章于 2024-04-10 09:51:46 发布 · 539 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#python #机器学习 #深度学习

Hands-On Machine Learning with Scik 专栏收录该内容

15 篇文章

订阅专栏

本文深入解析了支持向量机（SVM）的原理与应用，涵盖了线性和非线性分类、回归，以及SVM的内部机制。介绍了SVM的多种实现方式，包括线性核、多项式核和高斯RBF核的使用场景。

OReilly.Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

chapter 5 支持向量机

Support Vector Machine (SVM) is capable of performing linear and nonlinear classification, regression, and even outlier detection.

SVM used to be the most popular ML model.

SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

5.1 Linear SVM Classification

Requirement: the dataset is linearly separable.
Principle: large margin classification. An SVM classifier can be thought of as fitting the widest possible margin between the classes.

The instances on the margin are called support vectors.

SVMs are sensitive to the feature scales. Better using $StandardScaler\verb+StandardScaler+$ before training.

5.1.1 Soft Margin Classification

Hard margin classification: all instances are required to be off the margin and on the right side. First, it works only if the data is linearly separable. Second, it is quite sensitive to outlier.

Soft margin classification: to find a balance between keeping the interval as large as possible and limiting the margin violation, i.e., instances that end up in the middle of the interval or even on the wrong side.

Note that, an instance in the middle of the interval does not have to be wrongly classified. It may lies on the right side of the decision boundary.

In Scikit-Learn’s SVM classes, a $C\verb+C+$ hyperparameter is used to control this balance. A smaller $C\verb+C+$ indicates a wider interval but more margin violations. If your SVM model is overfitting, you can try regularizing it by reducing $C\verb+C+$ .

import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris=datasets.load_iris()
X=iris['data'][:,(2,3)]#petal length, petal width
y=(iris['target']==2).astype(np.float64) #iris-Virginica

svm_clf=Pipeline([
    ("scaler",StandardScaler()),
    ("linear_svc",LinearSVC(C=1,loss="hinge"))
])

svm_clf.fit(X,y)

svm_clf.predict([[5.5,1.7]]) #array([1.])

Alternatives:

$SVC(kernel="linear",C=1)\verb+SVC(kernel="linear",C=1)+$

slower, not recommended
$SGDClassifier(loss="hinge",alpha=1/(m*C))\verb+SGDClassifier(loss="hinge",alpha=1/(m*C))+$

does not converge as fast as $LinearSVC\verb+LinearSVC+$ class, but useful for huge datasets that do not fit in memory (out-of-core training), or to handle online classification tasks.

5.2 Nonlinear SVM Classification

For datasets that are not linearly separable, add new features such as polynomial features.

from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
X,y=make_moons()
polynomial_svm_clf=Pipeline([
    ('poly_features',PolynomialFeatures(degree=3)),
    ("scaler",StandardScaler()),
    ("svm_clf",LinearSVC(C=10,loss="hinge"))
])
polynomial_svm_clf.fit(X,y)

5.2.1 Polynomial Kernel

kernel trick: get the same results without adding polynomial features actually.

from sklearn.svm import SVC
poly_kernel_svm_clf=Pipeline([
    ("scaler",StandardScaler()),
    ("svm_clf",SVC(kernel="poly",degree=3,coef0=1,C=5))
])
poly_kernel_svm_clf.fit(X,y)

A hyperparameter $coef0\verb+coef0+$ can be used to control how much the model is influenced by high-degree polynomials versus low-degree polynomials.

A common way to find the right hyperparameter values is to use grid search ( $GridSearchCV\verb+GridSearchCV+$ class).

5.2.2 Adding Similarity Features

Another way to introduce new features is to use a similarity function to compute instances similar to a particular landmark.

For example, use Gaussian Radial Basis Function (RBF) as similarity function, with $γ=0.3\gamma=0.3$ .

A one-dimensional dataset with two landmarks $x_1=-2$ and $x_1=1$ , where $x_1$ is the axis of the dataset.

Equation 5-1. Gaussian RBF
$\phi\gamma(\textbf{x},\ell)=\exp\left(-\gamma||\textbf{x}-\ell||^2\right)$
It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at the landmark).

For the instance $x_1=-1$ , located at a distance of $1$ and $2$ to the two landmarks, the new features are $x2=exp⁡(−0.3×12)≈0.74x_2=\exp(-0.3\times 1^2)\approx 0.74$ and $x3=exp⁡(−0.3×22)≈0.30x_3=\exp(-0.3\times 2^2)\approx 0.30$ .

How to select landmarks? A simple way is to use all the instances as landmarks. If the dataset contains $m$ instances with $n$ features, this results in a training set with $m$ instances and $m$ features if you drop the original features.

5.2.3 Gaussian RBF Kernel

Again, RBF kernel trick get same results as RBF similarity features without adding new features actually.

rbf_kernel_svm_clf=Pipeline([
    ("scaler",StandardScaler()),
    ("svm_clf",SVC(kernel="rbf",gamma=5,C=0.001))
])
rbf_kernel_svm_clf.fit(X,y)

We can see from Figure 5-9 that, increasing $gamma\verb+gamma+$ make the bell-shaped curve narrower, and as a result each instance’s range of influence is smaller: overfitting.

The usage of other kernels is rare.

As a rule of thumb, to select kernels for a specific task, you should always try the linear kernel first ( $LinearSVC\verb+LinearSVC+$ is much faster than $SVC(kernel="linear")\verb+SVC(kernel="linear")+$ ), especially if the training set is very large or if it has plenty of features. If the training set is not too large, try the Gaussian RBF kernel.

5.2.3 Computational Complexity

The $LinearSVC\verb+LinearSVC+$ class is based on the liblinear library, which implements an optimized algorithm for linear SVMs. It does not support kernel trick, but scales linearly with the number of instances and the number of features: $O(m×n)O(m\times n)$ .

The tolerance hyperparameter $ϵ\epsilon$ ( $tol\verb+tol+$ in Scikit-Learn) controls the precision.

The $SVC\verb+SVC+$ class is based on the libsvm library, which implements SMO (sequential minimal Optimization) that supports the kernel trick. The training time complexity is usually between $O(m2×n)O(m^2\times n)$ and $O(m3×n)O(m^3\times n)$ .

Table 5-1. Comparison of Scikit-Learn classes for SVM classification

Class	Time complexity	Out-of-core support	Scaling required	Kernel trick
LinearSVC	$O(m×n)O(m\times n)$	No	Yes	No
SGDClassifier	$O(m×n)O(m\times n)$	Yes	Yes	No
SVC	$O(m2×n)O(m^2\times n)$ to $O(m3×n)O(m^3\times n)$	No	Yes	Yes

5.3 SVM Regression

SVM Regression tries to fit as many instances as possible within the margin while limiting margin violations (i.e., instances off the street). The width of the street is controlled by a hyperparameter $ϵ\epsilon$ .

from sklearn.svm import LinearSVR
linear_svr_reg=Pipeline([
    ('scaler',StandardScaler()),
    ('svm_reg',LinearSVR(epsilon=1.5))
])
linear_svr_reg.fit(X,y)

from sklearn.svm import SVR
nonlinear_svr_reg=Pipeline([
    ('scaler',StandardScaler()),
    ('svm_reg',SVR(kernel="poly",degree=2,C=100,epsilon=0.1))
])
nonlinear_svr_reg.fit(X,y)

5.4 Under the Hood

5.4.1 Decision Function and Predictions

Equation 5-2. Linear SVM classifier prediction
$y^={0 if wT⋅x+b<01 if wT⋅x+b≥0 \hat y=\left\{\begin{array}{c} 0 \textrm{ if }\textbf{w}^T\cdot \textbf{x}+b<0\\ 1 \textrm{ if }\textbf{w}^T\cdot \textbf{x}+b\ge0 \end{array}\right.$
The decision function is of the form $wT⋅x+b=w1x1+⋯+wnxn+b\textbf{w}^T\cdot \textbf{x}+b=w_1x_1+\cdots+w_nx_n+b$ .

$h=wT⋅x+bh=\textbf{w}^T\cdot \textbf{x}+b$ of each and every $h$ constitute a $n$ -dimension hyperplane, whereas decision boundary $0=wT⋅x+b0=\textbf{w}^T\cdot \textbf{x}+b$ is the intersection of $h=wT⋅x+bh=\textbf{w}^T\cdot \textbf{x}+b$ and $h = 0$ , which is a $(n - 1)$ -dimension hyperplane.

5.4.2 Training Objective

Hard margin SVM:

Objective: make the interval as large as possible.

Analysis: Through observation, the smaller the norm $∣∣w∣∣||\textbf{w}||$ of the weight vector $w\textbf{w}$ , the larger the interval. Minimizing $∣∣w∣∣||\textbf{w}||$ is equivalent to minimizing $12∣∣w∣∣2\frac{1}{2}||\textbf{w}||^2$ , which is equal to $12wTw\frac{1}{2}\textbf{w}^T\textbf{w}$ .

Constraints: if $wT⋅x(i)+b≥1\textbf{w}^T\cdot\textbf{x}^{(i)}+b\ge 1$ , it should be classified as positive. Otherwise, it should be classified as negative. By introducing a variable $t^{(i)}$ , which is $1$ when $wT⋅x(i)+b≥1\textbf{w}^T\cdot\textbf{x}^{(i)}+b\ge 1$ and $0$ otherwise, we can express this constraint as $t(i)(wT⋅x(i)+b)≥1t^{(i)}(\textbf{w}^T\cdot\textbf{x}^{(i)}+b)\ge 1$ .

Equation 5-3. Hard margin linear SVM classifier objective
$KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \mathop{\textr…$
Soft margin SVM

Introduce a slack variable $ζ(i)≥0\zeta^{(i)}\ge 0$ for each instance $x(i)\textbf{x}^{(i)}$ to measure how much $x(i)\textbf{x}^{(i)}$ is allowed to violate the margin and $C\verb+C+$ introduced in 5.1.1 to control the balance between making the interval as large as possible and

Equation 5-4. Soft margin linear SVM classifier objective
$KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \mathop{\textr…$

5.4.3 Quadratic Programming

Equation 5-5. Quadratic Programming problem
$KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \mathop{\textr…$
The expression $A⋅p≤b\textbf{A}\cdot \textbf{p}\le \textbf{b}$ actually defines $n_c$ constraints: $pT⋅a(i)≤b(i)\textbf{p}^T\cdot \textbf{a}^{(i)}\le b^{(i)}$ for $i=1,2,⋯ ,nci=1,2,\cdots ,n_c$ , where $a(i)\bold a ^{(i)}$ is the vector containing the elements of the $i$ -th row of $A\textbf{A}$ and $b^{(i)}$ is the $i$ -th element of $b\textbf{b}$ .

Hard margin linear SVM classifier objective:

$n_p=n+1$ , where $n$ is the number of features (the +1 is for the bias term)
$n_c=m$ , where $m$ is the number of training instances
$H\textbf{H}$ is the $np×npn_p\times n_p$ identity matrix, except with a zero in the top-left cell (to ignore the bias term)
$f=0\textbf{f} = \bold 0$ , an $n_p$ -dimensional vector full of 0s.
$b=1\bold b = \bold 1$ , an $n_c$ -dimensional vector full of 1s.
$a(i)=–t(i)x˙(i)\bold a^{(i)} = –t^{(i)} \bold{\dot{x}}^{(i)}$ , where $x˙(i)\bold{\dot{x}}^{(i)}$ is equal to $x(i)\bold{x}^{(i)}$ with an extra bias feature $x˙0=1\bold{\dot{x}}_0=1$ .

So one way to train a hard margin linear SVM classifier is just to use an off-the-shelf QP solver by passing it the preceding parameters.

Using kernel trick makes the algorithm more efficient than directly using QP. To use the kernel trick, a different optimization problem is to be solved.

5.4.4 The Dual Problem

Given a constrained optimization problem, known as the primal problem, it is possible to express a different but closely related problem, called its dual problem. The solution to the dual problem gives a lower bound to the solution of primal problem, but under some conditions it can even have the same solutions as the primal problem. The conditions required for same solutions are as follows.

the objective function is convex
the inequality constraints are continuously differentiable and convex functions.

Equation 5-6. Dual form of the linear SVM objective
$\mathop{\textrm{minimize}}\limits_\alpha \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m \alpha^{(i)}\alpha^{(j)}t^{(i)}t^{(j)}\textbf{x}^{(i)^T}\textbf{x}^{(j)}-\sum_{i=1}^m\alpha^{(i)}\\ \textrm{subject to } \alpha^{(i)}\ge 0 \textrm{ for }i=1,2,\cdots,m$
Once you find the vector $α^\hat \alpha$ that minimizes Equation 5-6 (using a QP solver), you can compute $w^\hat {\textbf w}$ and $b^\hat b$ that minimize the primal problem by using Equation 5-7.

Equation 5-7. From the dual solution to the primal solution
$w^=∑i=1mα^(i)t(i)x(i)b^=1ns∑i=1mα^(i)>0(1−t(i)(w^⋅x(i))) \hat{\textbf w}=\sum_{i=1}^m \hat \alpha^{(i)}t^{(i)}\textbf x^{(i)}\\ \hat b=\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\left(\hat{\textbf{w}}\cdot \textbf x^{(i)}\right)\right)$
The dual problem is faster to solve than the primal when the number of training instances is smaller than the number of features.

5.4.5 Kernelized SVM

Equation 5-8. Second-degree polynomial mapping
$\phi(\textbf x)=\phi\left(\begin{pmatrix} x_1 \\ x_2 \end{pmatrix}\right)=\begin{pmatrix} x_1^2\\\sqrt{2}x_1x_2\\x_2^2 \end{pmatrix}$
Equation 5-9. Kernel trick for a 2nd-degree polynomial mapping
$\phi(\textbf a)^T\phi(\textbf b)=\begin{pmatrix} a_1^2\\\sqrt{2}a_1a_2\\a_2^2 \end{pmatrix}^T\cdot\begin{pmatrix} b_1^2\\\sqrt{2}b_1b_2\\b_2^2 \end{pmatrix}=a_1^2b_1^2+2a_1b_1a_2b_2+a_2^2b_2^2\\ =(a_1b_1+a_2b_2)^2=\left(\begin{pmatrix} a_1\\a_2 \end{pmatrix}^T\cdot\begin{pmatrix} b_1\\b_2 \end{pmatrix}\right)^2=\left(\textbf a^T\cdot \textbf b\right)^2$
Now here is the key insight: if the data is not linear separable, then you may transform it into high dimensional space. If you apply the transformation $ϕ\phi$ to all training instances, then the dual problem (see Equation 5-6) will contain the dot product $ϕ(x(i))T⋅ϕ(x(j))\phi (x^{(i)})^T\cdot \phi (x^{(j)})$ . But if $ϕ\phi$ is the 2nd-degree polynomial transformation defined in Equation 5-8, then you can replace this dot product of transformed vectors simply by $(x(i)T⋅x(j))2(x^{(i)^T}\cdot x^{(j)})^2$ .

So you don’t actually need to transform the training instances at all: just replace the dot product by its square in Equation 5-6. The result will be strictly the same as if you went through the trouble of actually transforming the training set then fitting a linear SVM algorithm, but this trick makes the whole process much more computationally efficient. This is the essence of the kernel trick.

The function $K(a,b)=(aT⋅b)2K(\bold a, \textbf b) = (\textbf a^T \cdot \textbf b)^2$ is called a 2nd-degree polynomial kernel. In Machine
Learning, a kernel is a function capable of computing the dot product $ϕ(a)T⋅ϕ(b)\phi (\textbf a)^T \cdot \phi (\textbf b)$ based only on the original vectors $a\textbf a$ and $b\textbf b$ , without having to compute (or even to know about) the transformation $ϕ\phi$ . Equation 5-10 lists some of the most commonly used kernels.
Equation 5-10. Common kernels
$\begin{array}{rl}\textrm{Linear:}& K(\textbf a, \textbf b) = \textbf a^T\cdot \textbf b\\ \textrm{Polynomial:}& K (\textbf a, \textbf b) = \left(\gamma\textbf a^T \cdot \textbf b + r\right)^ d\\ \textrm{Gaussian RBF:}& K (\textbf a, \textbf b) = \exp\left(-\gamma||\textbf a - \textbf b||^2\right)\\ \textrm{Sigmoid:}& K (\textbf a, \textbf b) = \tanh \left(\gamma\textbf a^T \cdot \textbf b + r\right)\end{array}$
Mercer’s Theorem: if a function $K(a,b)K(\textbf a, \textbf b)$ respects Mercer’s conditions ( $K$ must be continuous, symmetric in its arguments so $K(a,b)=K(b,a)K(\textbf a, \textbf b)=K(\textbf b, \textbf a)$ , etc.), then there exists a function $ϕ\phi$ that maps $a\textbf a$ and $b\textbf{b}$ into another space (possibly with much higher dimensions) such that $K(a,b)=ϕ(a)T⋅ϕ(b)K(\textbf a, \textbf b)=\phi(\textbf a)^T\cdot \phi(\textbf b)$ . So you can use $K$ as a kernel since you know $ϕ\phi$ exists, even if you don’t know what $ϕ\phi$ is. In the case of the Gaussian RBF kernel, it can be shown that $ϕ\phi$ actually maps each training instance to an infinite-dimensional space, so it’s a good thing you don’t need to actually perform the mapping!

Using kernel trick makes the parameters $w\textbf w$ and $b$ hard to obtain, as it may have as many dimension as $ϕ(x(i))\phi(x^{(i)})$ , which may be infinite. However, you can ignore the step of computing the model parameters and directly jump to the prediction step.

Equation 5-11. Making predictions with a kernelized SVM
$hw^,b^(ϕ(x(n)))=w^T⋅ϕ(x(n))+b^=(∑i=1mα^(i)t(i)ϕ(x(i)))T⋅ϕ(x(n))+b^=∑i=1mα^(i)t(i)(ϕ(x(i))T⋅ϕ(x(n)))+b^=∑i=1mα^(i)>0α^(i)t(i)K(x(i),x(n))+b^ h_{\widehat{\textbf w},\hat b}\left (\phi\left(\textbf x^{(n)}\right) \right) =\widehat{\textbf w}^T\cdot \phi\left(\textbf x^{(n)}\right) +\hat b =\left(\sum_{i=1}^m \hat\alpha^{(i)}t^{(i)}\phi\left(\textbf x^{(i)}\right)\right)^T\cdot \phi\left(\textbf x^{(n)}\right) +\hat b\\ =\sum_{i=1}^m \hat\alpha^{(i)}t^{(i)}\left(\phi\left(\textbf x^{(i)}\right)^T\cdot \phi\left(\textbf x^{(n)}\right)\right) +\hat b\\ =\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0} \hat\alpha^{(i)}t^{(i)}K\left(\textbf x^{(i)},\textbf x^{(n)}\right)+\hat b\\$
Where $x(n)\textbf x ^{(n)}$ is a new instance.

Equation 5-12. Computing the bias term using the kernel trick
$b^=1ns∑i=1mα^(i)>0(1−t(i)w^T⋅ϕ(x(i)))=1ns∑i=1mα^(i)>0(1−t(i)(∑j=1mα^(j)t(j)ϕ(x(j)))T⋅ϕ(x(i)))=1ns∑i=1mα^(i)>0(1−t(i)∑j=1mα^(j)>0α^(j)t(j)K(x(j)⋅x(i))) \hat b=\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\widehat{\textbf{w}}^T\cdot \phi\left(\textbf x^{(i)}\right) \right)\\ =\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\left(\sum_{j=1}^m \hat\alpha^{(j)}t^{(j)}\phi\left(\textbf x^{(j)}\right)\right)^T\cdot \phi\left(\textbf x^{(i)}\right) \right)\\ =\frac{1}{n_s}\mathop{\sum_{i=1}^m}\limits_{\hat \alpha^{(i)}>0}\left(1-t^{(i)}\mathop{\sum_{j=1}^m}\limits_{\hat \alpha^{(j)}>0} \hat\alpha^{(j)}t^{(j)}K\left(\textbf x^{(j)}\cdot \textbf x^{(i)}\right) \right)$

5.4.6 Online SVMs

For training online linear SVM classifiers, one way is to use Gradient Descent to minimized the cost function in Equation 5-13.

Equation 5-13. Linear SVM classifier cost function
$J(\textbf w,b)=\frac{1}{2}\textbf w^T \textbf w +C\sum_{i=1}^m \max\left(0,1-t^{(i)}\left(\textbf w^T\cdot \textbf x^{(i)}+b\right)\right)$
The first sum in the cost function will push the model to have a small weight vector w, leading to a larger margin. The second sum computes the total of all margin violations. An instance’s margin violation is equal to 0 if it is located off the street and on the correct side, or else it is proportional to the distance to the correct side of the street.

Hinge Loss: the function $max⁡(0,1−t)\max(0,1-t)$ is called the hinge loss. It is equal to 0 when $t≥1t\ge 1$ and $1 - t$ if $t < 1$ . Its derivative (slope) is equal to -1 if $t < 1$ and 0 if $t > 1$ . It is not differentiable at $t = 1$ , but just like for Lasso Regression you can still use Gradient Descent using any sub-derivative at $t = 0$ (i.e., any value between -1 and 0).

To sum up, SVM is one of the most popular ML model. It serves as linear and nonlinear classifier, and performs linear and nonlinear regression.

Linear SVM Classification

preprocessing: $StandardScaler\verb+StandardScaler+$

Balance between keeping the interval as large as possible and limiting the margin violation is controlled by the $C\verb+C+$ hyperparameter in Scikit-Learn’s SVM classes. A smaller $C\verb+C+$ value leads to a wider interval but more margin violation.

Three ways to perform linear SVM classification:
- $LinearSVC(loss="hinge",C=1)\verb+LinearSVC(loss="hinge",C=1)+$ fast
- $SVC(kernel="linear",C=1)\verb+SVC(kernel="linear",C=1)+$ slow
- $alpha=1/(m*C))\verb+SGDRegression(loss="hinge", alpha=1/(m*C))+$ fast
Nonlinear SVM Classification

Four ways to perform nonlinear SVM classification:
- $PolynomialFeatures(degree=3)\verb+PolynomialFeatures(degree=3)+$
  
  $LinearSVC(loss="hinge",C=10)\verb+LinearSVC(loss="hinge",C=10)+$
- polynomial kernel trick: $SVC(kernel="poly",degree=3,coef0=1,C=5)\verb+SVC(kernel="poly",degree=3,coef0=1,C=5)+$
  
  $coef0\verb+coef0+$ controls the influence of high degree versus low degree
- adding similarity features by using similarity functions, such as Gaussian RBF
- Gaussian RBF Kernel: $SVC(kernel="rbf",gamma=5,,C=0.001)\verb+SVC(kernel="rbf",gamma=5,,C=0.001)+$
SVM Regression
- $LinearSVR(epsilon=1.5)\verb+LinearSVR(epsilon=1.5)+$
  
  $epsilon\verb+epsilon+$ controls the width of the interval
- $SVR(kernel="poly",degree=2,C=100,epsilon=0.1)\verb+SVR(kernel="poly",degree=2,C=100,epsilon=0.1)+$
Under the Hood

decision function: $wT⋅x+b\textbf w^T\cdot \textbf x+b$

decision boundary: $wT⋅x+b=0\textbf w^T\cdot \textbf x+b=0$

training objective and constraints:
- hard margin classification
  
  $minimizew,b\mathop{\textrm{minimize}}\limits_{\textbf w,b}$ $12wT⋅w\frac{1}{2}\textbf w^T\cdot \textbf w$
  
  subject to $t(i)(wT⋅x(i)+b)≥1t^{(i)}\left( \textbf w^T\cdot \textbf x^{(i)} +b\right)\ge 1$ for $i=1,2,⋯ ,mi=1,2,\cdots,m$
- soft margin classification
  
  $minimizew,b\mathop{\textrm{minimize}}\limits_{\textbf w,b}$ $12wT⋅w+C∑i=1mζ(i)\frac{1}{2}\textbf w^T\cdot \textbf w+C\sum_{i=1}^m \zeta^{(i)}$
  
  subject to $t(i)(wT⋅x(i)+b)≥1−ζ(i)t^{(i)}\left( \textbf w^T\cdot \textbf x^{(i)} +b\right)\ge 1-\zeta^{(i)}$ for $ζ(i)≥0,i=1,2,⋯ ,m\zeta^{(i)}\ge 0, i=1,2,\cdots,m$
quadratic programming: solved by an off-the-shelf QP solver

dual problem: under some conditions, has the same solution as the primal problem.

kernelized SVM: kernel functions make the dot product in high dimension space easy to compute.

online SVM:

cost function