Support Vector Machines ( SVM)

Watch this video before you start reading  SVM

A Support Vector Machine (SVM) classifier is a powerful supervised machine learning algorithm used for classification and regression tasks. It is particularly well-suited for binary classification problems but can be extended to multi-class classification as well. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.

Support Vectors
Support vectors are the data points, which are closest to the hyperplane. These points will define the separating line better by calculating margins. These points are more relevant to the construction of the classifier.

Hyperplane
A hyperplane is a decision plane which separates between a set of objects having different class memberships.

Margin
A margin is a gap between the two lines on the closest class points. This is calculated as the perpendicular distance from the line to support vectors or closest points. If the margin is larger in between the classes, then it is considered a good margin, a smaller margin is a bad margin.

How SVM Works

At its core, SVM aims to find the optimal hyperplane that best separates the data points of different classes in feature space. The hyperplane is chosen in such a way that the margin between the closest data points of the classes, known as support vectors, is maximized. These support vectors are crucial as they define the decision boundary and influence the model's generalization.

SVM searches for the maximum marginal hyperplane in the following steps:

Generate hyperplanes which segregates the classes in the best way. Left-hand side figure showing three hyperplanes black, blue and orange. Here, the blue and orange have higher classification error, but the black is separating the two classes correctly.

Select the right hyperplane with the maximum segregation from the either nearest data points as shown in the right-hand side figure.


Dealing with non-linear and inseparable planes

For linearly separable data, the decision boundary is a hyperplane, but if the data is not linearly separable, SVM can still find a separating hyperplane by mapping the data to a higher-dimensional feature space using a kernel function. The most commonly used kernels are linear, polynomial, and radial basis function (RBF). This process is called the "kernel trick," and it allows SVM to handle complex non-linear decision boundaries.

Some problems can’t be solved using linear hyperplane, as shown in the figure below (left-hand side).

In such situation, SVM uses a kernel trick to transform the input space to a higher dimensional space as shown on the right. The data points are plotted on the x-axis and z-axis (Z is the squared sum of both x and y: z=x^2=y^2). Now you can easily segregate these points using linear separation.





In Support Vector Machine (SVM), the concepts of hard margin and soft margin refer to how the algorithm handles data points that are not perfectly separable by a hyperplane.
The Role of Margin in SVMs
Let’s start with a set of data points that we want to classify into two groups. We can consider two cases for these data: either they are linearly separable, or the separating hyperplane is non-linear. When the data is linearly separable, and we don’t want to have any misclassifications, we use SVM with a hard margin. However, when a linear boundary is not feasible, or we want to allow some misclassifications in the hope of achieving better generality, we can opt for a soft margin for our classifier.

Hard Margin SVM

A hard margin SVM is designed to find the optimal hyperplane that perfectly separates the data points of different classes. It assumes that the data is linearly separable, meaning a hyperplane exists that can separate all data points of one class from those of the other class without any misclassifications.

In hard margin SVM, the optimization objective is to maximize the margin between the two classes while ensuring that all data points are correctly classified. The margin is the perpendicular distance between the hyperplane and the closest data points of the classes (support vectors).


Let’s assume that the hyperplane separating our two classes is defined as :$w^Tx+b=0$
Then, we can define the margin by two parallel hyperplanes:
$w^Tx+\alpha=0$
$w^Tx+\beta=0$
They are the green and purple lines in the above figure. Without allowing any misclassifications in the hard margin SVM, we want to maximize the distance between the two hyperplanes. To find this distance, we can use the formula for the distance of a point from a plane. So the distance of the blue points and the red point from the black line would respectively be:
$\frac{|w^Tx+\alpha|}{||w||}$
$\frac{|w^Tx+\beta|}{||w||}$
As a result, the total margin would become:
$\frac{|\alpha-\beta|}{||w||}$
We want to maximize this margin. Without the loss of generality, we can consider $\alpha=b+1$ and $\beta=b-1$. Subsequently, the problem would be to maximize $\frac{2}{||w||}$ or minimize $\frac{||w||}{2}$. To make the problem easier when taking the gradients, we’ll, instead, word with its squared form:
$min \frac{1}{2}||w||^2=min \frac{1}{2}w^T.w$

This optimization comes with some constraints. Let’s assume that the labels for our classes are $\{-1, +1\}$. When classifying the data points, we want the points belonging to positives classes to be greater than , meaning$w^Tx+b \ge 1$  , and the points belonging to the negative classes to be less than , i.e. $w^T.x+b \le -1$
We can combine these two constraints and express them as:$y_i(w^Tx_i+b) \ge 1$. Therefore our optimization problem would become:

Minimize: $\frac{1}{2}||w||^2$ Subject to: $y_i (w^T · x_i + b) ≥ 1$ for all data points $(x_i, y_i)$
Here, $w$ is the weight vector, $b$ is the bias term, $y_i$ is the class label (+1 or -1) for data point $x_i,$ and $||w||$ represents the norm of the weight vector.

This optimization is called the primal problem and is guaranteed to have a global minimum. We can solve this by introducing Lagrange multipliers ($\alpha_i$) and converting it to the dual problem
$L(w,b,\alpha_i)=\frac{1}{2}w^T.w-\sum_{i=1}^n \alpha_i(y_i(w^Tx_i+b)-1)$

This is called the Lagrangian function of the SVM which is differentiable with respect to $w$  and $b$.
$\bigtriangledown _w L(w,b,\alpha)=0\Rightarrow w=\sum _{i=1}^n \alpha_iy_ix_i$
$\bigtriangledown _b L(w,b,\alpha)=0\Rightarrow 0=\sum _{i=1}^n \alpha_iy_i$

By substituting them in the second term of the Lagrangian function, we’ll get the dual problem of SVM:
$max_\alpha -\frac{1}{2}\sum _{i=1}^n \sum _{j=1}^n\alpha_i \alpha_j y_iy_jx_i^Tx_j +\sum_{i=1}^n \alpha_i$
st $\sum _{i=1}^n \alpha_iy_i=0$

The dual problem is easier to solve since it has only the Lagrange multipliers. Also, the fact that the dual problem depends on the inner products of the training data comes in handy when extending linear SVM to learn non-linear boundaries.

Hard margin SVM has some limitations:
It may not be suitable for datasets with noisy or overlapping data points.
It is sensitive to outliers, as even a single outlier can make the problem infeasible.

Soft Margin SVM

A soft margin SVM is an extension of the hard margin SVM that allows for misclassifications and overlapping data points. It introduces the concept of a "soft margin," which allows some data points to be inside the margin or even misclassified. This makes the algorithm more robust to noisy or non-linearly separable data.
In this scenario, we allow misclassifications to happen. So we’ll need to minimize the misclassification error, which means that we’ll have to deal with one more constraint. Second, to minimize the error, we should define a loss function. A common loss function used for soft margin is the hinge loss.
$max\{0,1-y_i(w^Tx_i+b)\}$

The loss of a misclassified point is called a slack variable and is added to the primal problem that we had for hard margin SVM. In soft margin SVM, the optimization objective is to maximize the margin while penalizing misclassifications and data points that are inside the margin. The amount of penalty is controlled by a hyperparameter $C$. A larger $C$ allows fewer misclassifications and a smaller margin, while a smaller $C$ allows more misclassifications and a larger margin.

The soft margin SVM optimization problem is formulated as follows:

Minimize: $||w||^2 + C * Σ(ξ_i)$ Subject to:$ y_i (w · x_i + b) ≥ 1 - ξ_i$ for all $i=1 \ldots n, ξ_i ≥ 0$ for all data points $(x_i, y_i)$.

Here, $ξ_i$ represents the slack variable associated with data point $x_i$, which quantifies the error or misclassification of that data point. $C$ is the regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error.
As you can see, the difference between the primal problem and the one for the hard margin is the addition of slack variables. The new slack variables ( in the figure below) add flexibility for misclassifications of the model:
Finally, we can also compare the dual problems:
$max_\alpha -\frac{1}{2}\sum _{i=1}^n \sum _{j=1}^n\alpha_i \alpha_j y_iy_jx_i^Tx_j +\sum_{i=1}^n \alpha_i$
st $\sum _{i=1}^n \alpha_iy_i=0, 0 \le \alpha_i \le C$
As you can see, in the dual form, the difference is only the upper bound applied to the Lagrange multipliers.

Soft margin SVM has several advantages:
It can handle noisy data and overlapping classes.
It is more flexible and applicable to a wider range of datasets.
It is less sensitive to outliers compared to the hard margin SVM.

However, setting the value of the regularization parameter $C$ is crucial. A small $C$ may result in a larger margin but more misclassifications, while a large $C$ may lead to a smaller margin but fewer misclassifications. The value of $C$ needs to be tuned using techniques like cross-validation to achieve the best performance on the validation set.

Note:The difference between a hard margin and a soft margin in SVMs lies in the separability of the data. If our data is linearly separable, we go for a hard margin. However, if this is not the case, it won’t be feasible to do that. In the presence of the data points that make it impossible to find a linear classifier, we would have to be more lenient and let some of the data points be misclassified. In this case, a soft margin SVM is appropriate.

Sometimes, the data is linearly separable, but the margin is so small that the model becomes prone to overfitting or being too sensitive to outliers. Also, in this case, we can opt for a larger margin by using soft margin SVM in order to help the model generalize better.

Loss Function-Hinge Loss

Hinge loss is a loss function commonly used in Support Vector Machines (SVMs) for training binary classification models. It is designed to quantify the classification error and encourage the SVM to find a decision boundary (hyperplane) that maximizes the margin between the two classes.

In the context of SVM, hinge loss is defined as follows for a single training example:

For a sample with true label $y$ and a prediction $f(x)$ (where $f(x)$ is the decision function of the SVM, often represented as $w.x+b$, where $w$ is the weight vector and $b$ is the bias term), the hinge loss is calculated as:

$Hinge \: loss = max(0, 1 - y * f(x))$

Here's how the hinge loss works:

If the true label $y$ and the prediction $f(x)$ have the same sign, meaning that the prediction is on the correct side of the decision boundary, the hinge loss will be zero.

If $y$ and $f(x)$ have opposite signs, indicating that the prediction is on the wrong side of the decision boundary (i.e., misclassified), the hinge loss will be proportional to how far away $f(x)$ is from the correct side of the margin. The larger the margin violation, the larger the loss.

The hinge loss has several important characteristics:

Margin Maximization: Hinge loss encourages the SVM to find a decision boundary that maximizes the margin between the two classes. This is because the loss increases as data points cross the margin boundary, penalizing points that are closer to the decision boundary.

Sparsity of Support Vectors: The hinge loss makes SVM robust by promoting sparsity in support vectors. Support vectors are the data points closest to the decision boundary, and hinge loss ensures that only the most relevant data points contribute to the loss function. This property makes SVMs less sensitive to outliers.

Soft Margin SVM: The hinge loss can be extended to soft margin SVM by introducing slack variables. In the soft margin case, the loss allows for some margin violations to account for noisy data or non-linearly separable data.

The overall SVM objective is to minimize the hinge loss (i.e., minimize classification errors) while maximizing the margin. This leads to a convex optimization problem where the goal is to find the optimal weight vector $w$ and bias term $b$ that minimize the sum of hinge losses across all training examples, subject to the margin constraints.

Mathematically, the objective of the SVM can be formulated as:

Minimize: $1/2 ||w||^2 + C Σ(max(0, 1 - y_i * (w · x_i + b)))$
Here, $||w||$ is the norm of the weight vector, $C$ is a regularization parameter that controls the trade-off between maximizing the margin and minimizing classification errors (similar to the soft margin concept), and the summation goes over all training examples $(x_i, y_i)$.The regularization parameter, often denoted as $C$, is a hyperparameter that determines the trade-off between achieving a low hinge loss and having a small weight vector. A small value of $C$ corresponds to a large margin and a high tolerance for misclassification, while a large value of $C$ corresponds to a narrow margin and a low tolerance for misclassification.

If $C$ is set to a very large value, then the SVM will try to minimize the hinge loss function at all costs, even if it means overfitting the data. Conversely, if $C$ is set to a very small value, then the SVM will prioritize having a large margin, even if it means misclassifying some data points. The regularization parameter is typically set using cross-validation techniques to find the optimal value that balances the trade-off.

Solving the Optimization Problem

Solving the SVM optimization problem involves finding the optimal values for the weight vector $w$ and the bias term $b$ that satisfy the margin constraints and minimize the objective function. This is a convex optimization problem, and various algorithms like the Sequential Minimal Optimization (SMO) or the gradient descent method can be used to find the optimal solution.

Once the optimization problem is solved, the support vectors are determined, and they define the decision boundary of the SVM model.

In summary, SVM finds the optimal hyperplane to separate data points of different classes by maximizing the margin between the hyperplane and the support vectors. The kernel trick allows SVM to handle non-linearly separable data efficiently. SVM's mathematical foundation lies in convex optimization, and its decision boundary is defined by the support vectors obtained during the training process.

Kernel Tricks for Non-Linearity in SVM

The Need for Non-Linearity: In many real-world classification problems, data is not linearly separable. This means that a single straight line or hyperplane cannot effectively separate the data points of different classes. SVMs, by default, use a linear kernel, which can only model linear relationships.

Mapping to a Higher-Dimensional Space: The idea behind kernel tricks is to map the original data from its original feature space into a higher-dimensional feature space where it might become linearly separable. This is done through a function $\phi$ (phi) that maps each data point $x$ from the original space to the higher-dimensional space.

Mathematically, $\phi: x → \phi(x)$

For example, in a 2D space, you might map data points $(x_1, x_2)$ to a higher-dimensional space $(x_1^2, x_2^2, √2 * x_1 * x_2)$, effectively transforming a 2D space into a 3D space.

The Kernel Function: The key to kernel tricks is the kernel function $K(x, x')$ that calculates the dot product between two data points in the higher-dimensional space without explicitly performing the transformation:

$K(x, x') = \phi(x) · \phi(x')$

This kernel function provides a measure of similarity or inner product between the transformed data points $\phi(x)$ and $\phi(x')$ in the higher-dimensional space.

Common Kernel Functions: Several common kernel functions are used in SVMs:
Linear Kernel (default): $K(x, x') = x · x'$
Polynomial Kernel: $K(x, x') = (γ x · x' + r)^d$
Radial Basis Function (RBF) Kernel (Gaussian Kernel): $K(x, x') = exp(-γ ||x - x'||^2)$

Here, $γ, r$, and d are hyperparameters that control the shape and behavior of the kernel.

Training the SVM: When training an SVM with a kernel, you don't need to explicitly compute the transformation $\phi(x)$. Instead, you work with the kernel function directly. The SVM optimization problem is formulated in terms of the kernel function and the data points' inner products.

Predictions: After training, when making predictions for new data points, you apply the same kernel function to the new data points and the support vectors (a subset of training data points), effectively mapping the new data into the same higher-dimensional space. The decision boundary is defined in this space.

Benefits: Kernel tricks allow SVMs to capture complex non-linear decision boundaries and are particularly effective when dealing with data that cannot be easily separated by a linear hyperplane. Different kernels can be chosen based on the specific characteristics of the data.

Hyperparameter Tuning: Choosing the right kernel and its hyperparameters (e.g., $γ$ in the RBF kernel) is important and often requires experimentation and cross-validation to achieve the best performance for a given dataset.

Kernel Functions
The kernel function computes the dot product between the transformed data points in the higher-dimensional space without explicitly computing the transformation itself. This is achieved through a kernel function, which calculates the similarity between two data points in the original feature space.

Linear Kernel: 
$K(x_i, x_j) = x_i^T · x_j$
The linear kernel is the simplest kernel and is used for linearly separable data. It corresponds to the standard dot product between the two data points.It is one of the most common kernels to be used. It is mostly used when there are large number of features in a dataset. Linear kernel is often used for text classification purposes.

Training with a linear kernel is usually faster, because we only need to optimize the $C$ regularization parameter. When training with other kernels, we also need to optimize the $γ$ parameter. So, performing a grid search will usually take more time.

Polynomial Kernel: 
$K(x_i, x_j) = (γ x_i^T · x_j + r)^d$

The polynomial kernel maps the data into a higher-dimensional feature space using a polynomial function. The parameter $γ$ is a scaling factor, $r$ is an optional constant term, and $d$ is the degree of the polynomial. It is useful for capturing non-linear relationships when data is not linearly separable.

Polynomial kernel is very popular in Natural Language Processing. The most common degree is d = 2 (quadratic), since larger degrees tend to overfit on NLP problems. It can be visualized with the following diagram.



Radial Basis Function (RBF) Kernel:
Radial basis function kernel is a general purpose kernel. It is used when we have no prior knowledge about the data. The RBF kernel on two samples x and y is defined by the following equation –

$K(x, y) = exp(\frac{-γ ||x - x'||^2}{2\sigma^2})$

The RBF kernel maps the data into an infinite-dimensional feature space. The parameter $γ$ controls the spread of the kernel function. As $γ$ becomes larger, the influence of each data point decreases, and the decision boundary becomes smoother. The RBF kernel is widely used and effective in many real-world applications.

Sigmoid kernel

Sigmoid kernel has its origin in neural networks. We can use it as the proxy for neural networks. Sigmoid kernel is given by the following equation –

sigmoid kernel : $k (x, y) = tanh(αx^Ty + c)$

Sigmoid kernel can be visualized with the following diagram



Other kernel functions, such as Laplacian kernel, are also available in some SVM implementations.

The choice of the kernel function and its hyperparameters can significantly impact the performance of the SVM model. In practice, the choice of kernel and its parameters is determined through hyperparameter tuning, typically using techniques like cross-validation.

The kernel trick allows SVM to capture complex, non-linear decision boundaries, making it applicable to a wide range of data distributions. By combining the kernel trick with the soft margin concept, SVM becomes a powerful and versatile classifier, capable of handling both linearly separable and non-linearly separable data.

Advantages of SVM

Effective in High-Dimensional Spaces: SVM performs well in high-dimensional feature spaces, making it suitable for problems with many features or complex relationships.

Robust to Overfitting: SVM is less prone to overfitting, especially in cases where the number of features is greater than the number of samples.

Optimal Margin: SVM aims to maximize the margin between classes, promoting better generalization to unseen data.

Kernel Trick for Non-Linearity: The kernel trick allows SVM to handle non-linear data, transforming it into a higher-dimensional space where it may become linearly separable.

Global Optimal Solution: SVM optimization problem has a unique global optimal solution, so it avoids getting stuck in local optima.

Small Memory Footprint: SVM uses only a subset of training data (support vectors) for decision-making, which results in a small memory footprint.

Effective for Small Datasets: SVM works well even with small datasets, as it relies on a few critical support vectors.

Regularization Control: SVM has a regularization parameter (C) that controls the trade-off between maximizing the margin and minimizing the classification error. This parameter helps to fine-tune the model and prevent overfitting.

Flexibility in Kernels: SVM supports various kernel functions, allowing it to adapt to different data distributions.

Limitations of SVM

Computationally Intensive: Training SVM can be computationally expensive, especially for large datasets, as it involves solving a convex optimization problem.

Sensitivity to Noise: SVM can be sensitive to noisy data, which might lead to suboptimal performance.

Kernel Selection: The choice of the kernel function and its parameters can significantly impact the model's performance. Selecting an appropriate kernel is essential for good results.

Interpretability: SVM doesn't provide direct probabilities of class membership, and its decision boundaries may not be as interpretable as other models like decision trees.

In summary, SVM is a powerful and versatile classifier that can handle both linearly separable and non-linearly separable data by using the kernel trick. Its characteristics include effectiveness in high-dimensional spaces, robustness to overfitting, and optimal margin properties. Despite its computational cost, SVM remains a popular choice for various machine learning tasks due to its solid theoretical foundation and strong generalization abilities.

Applications of SVM

Support Vector Machines (SVMs) are a versatile machine learning algorithm with various applications across different domains. Here are some common applications of SVMs:

Classification: SVMs are widely used for binary and multiclass classification tasks. They excel in situations where there is a clear margin of separation between classes. Applications include:
Email spam detection
Image classification
Handwritten digit recognition (e.g., MNIST dataset)
Medical diagnosis (e.g., cancer classification)

Regression: SVMs can also be used for regression tasks, where the goal is to predict a continuous numeric value. This is known as Support Vector Regression (SVR). Applications include:Stock price prediction
House price prediction
Demand forecasting

Anomaly Detection: SVMs can identify outliers or anomalies in data, which is useful for fraud detection and quality control in manufacturing.

Text and Document Classification: SVMs are effective in natural language processing tasks such as sentiment analysis, document categorization, and spam detection.

Image Segmentation: SVMs can be used for image segmentation, separating objects or regions of interest from the background in medical imaging or computer vision applications.

Bioinformatics: SVMs are applied in bioinformatics for tasks like protein structure prediction, gene expression classification, and disease prediction.

Face Detection and Recognition: SVMs have been used in facial detection and recognition systems, including in security and biometric applications.

Handwriting Recognition: SVMs have been used for recognizing handwritten characters and converting them into machine-readable text.

Credit Scoring: SVMs are employed in the finance industry to assess credit risk and make lending decisions.

Quality Control in Manufacturing: SVMs can help identify defects or faults in manufacturing processes by analyzing sensor data.

Speech Recognition: SVMs can be used in speech recognition systems to classify spoken words or phrases.

Chemoinformatics: In drug discovery and chemistry, SVMs are used for tasks such as compound classification and toxicity prediction.

Geospatial Data Analysis: SVMs are applied in geospatial analysis for tasks like land cover classification and remote sensing.

Recommendation Systems: SVMs can be used in recommendation systems to suggest products, movies, or content to users based on their preferences and behavior.

Network Security: SVMs can help detect network intrusions and cyberattacks by analyzing network traffic patterns.

SVMs are favored for their ability to handle high-dimensional data, robustness in the face of noisy data, and their potential to provide good generalization performance. However, they may require careful tuning of hyperparameters and can be computationally intensive, especially with large datasets.

Binary SVM classifier in Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

# Generating synthetic data
np.random.seed(42)
X = np.random.randn(20, 2)
y = np.array([0] * 10 + [1] * 10)

# Create an SVM classifier with a linear kernel
svm_classifier = SVC(kernel='linear')

# Fit the classifier to the data
svm_classifier.fit(X, y)

# Plot the data and decision boundary
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# Create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = svm_classifier.decision_function(xy).reshape(XX.shape)

# Plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])

# Plot support vectors
ax.scatter(svm_classifier.support_vectors_[:, 0], svm_classifier.support_vectors_[:, 1], s=100,
           linewidth=1, facecolors='none', edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Classifier with Linear Kernel')
plt.show()

# Predict the class of a new data point
new_data_point = np.array([[0.5, 0.5]])
predicted_class = svm_classifier.predict(new_data_point)

print("Predicted class for the new data point:", predicted_class[0])

Predicted class for the new data point: 0

In this example, we generated a synthetic dataset with two features and two classes. The SVM classifier with a linear kernel is trained on this dataset. We visualize the data points, decision boundary, and support vectors. Finally, we predict the class of a new data point (0.5, 0.5) and print the result. Keep in mind that as this is a toy example, the data points are randomly generated and may not be perfectly separable.


Find the SVM classifier of following data with SVM x1=(2,4,7,) X2=(2,5,4) class=(0,1,1)

To find the Support Vector Machine (SVM) classifier for the given data, we first need to build a model using a suitable SVM library.
import numpy as np
from sklearn.svm import SVC

# Given data points and their corresponding classes
X = np.array([[2, 2], [4, 5], [7, 4]])
y = np.array([0, 1, 1])

# Create an SVM classifier with a linear kernel
svm_classifier = SVC(kernel='linear')

# Fit the classifier to the data
svm_classifier.fit(X, y)
print(svm_classifier.coef_)
print(svm_classifier.intercept_)
print(svm_classifier.support_vectors_[0])
print(svm_classifier.support_vectors_[1])
# The classifier is now trained and can make predictions on new data
[[0.30769231 0.46153846]]
[-2.5384]
[2. 2.] 
[4. 5.]
The above code uses the linear kernel for simplicity, but depending on the data, other kernel functions like polynomial or radial basis function (RBF) might provide better results. You can change the kernel by setting the kernel parameter to 'poly' or 'rbf' in the SVC constructor.

Once the classifier is trained, you can use it to predict the class labels for new data points or evaluate its performance on test data.

Please note that this implementation assumes that your data points have two features each (dimensionality 2). If they have more features, you should adjust the input accordingly. Additionally, keep in mind that the class labels should be numeric (e.g., 0 and 1), and not strings.

Python example
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

# Create toy dataset
X = np.array([[1, 2], [2, 3], [3, 3], [2, 1], [3, 2]])
y = np.array([1, 1, 1, -1, -1])

# Create SVM classifier
clf = svm.SVC(kernel='linear')

# Fit the classifier to the data
clf.fit(X, y)

# Plot the decision boundary and support vectors
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

# Plot the decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),
                     np.linspace(ylim[0], ylim[1], 50))

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
            linestyles=['--', '-', '--'])

# Highlight support vectors
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
            s=100, facecolors='none', edgecolors='k')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Classifier with Linear Kernel')
plt.show()




import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt
# Given data points and their corresponding classes
X = np.array([[2, 2], [4, 5], [7, 4]])
y = np.array([-1, 1, 1])

# Create an SVM classifier with a linear kernel
svm_classifier = SVC(kernel='linear')

# Fit the classifier to the data
svm_classifier.fit(X, y)
print(svm_classifier.coef_)
print(svm_classifier.intercept_)
print(svm_classifier.support_vectors_[0])
print(svm_classifier.support_vectors_[1])
# Plot the decision boundary and support vectors
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

# Plot the decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),
np.linspace(ylim[0], ylim[1], 50))

Z = svm_classifier.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])

# Highlight support vectors
plt.scatter(svm_classifier.support_vectors_[:, 0], svm_classifier.support_vectors_[:, 1],
s=100, facecolors='none', edgecolors='k')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Classifier with Linear Kernel')
plt.show()


[[0.30769231 0.46153846]] [-2.53846154] [2. 2.] [4. 5.]

Working with Random Data
Hard Margin SVM

import numpy as np
import matplotlib.pyplot as plt
#generating binary classiffication data
from sklearn.datasets.samples_generator import make_blobs
X,y=make_blobs(n_samples=100,centers=2,random_state=2,cluster_std=0.65)
plt.scatter(X[:,0],X[:,1],c=y,s=60,cmap='autumn')
def plot_svm(model,ax=None,plot_support=True):
    if ax==None:
        ax=plt.gca()
    xlim=ax.get_xlim()
    ylim=ax.get_ylim()
    x=np.linspace(xlim[0],xlim[1],30)
    y=np.linspace(ylim[0],ylim[1],30)
    X,Y=np.meshgrid(x,y)
    xy=np.vstack([X.ravel(),Y.ravel()]).T
    P=model.decision_function(xy).reshape(X.shape)
    ax.contour(X,Y,P,colors='k',levels=[-1,0,1],alpha=0.5,linestyles=['--','-','--'])
    if plot_support:
        ax.scatter(model.support_vectors_[:,0],model.support_vectors_[:,1],s=300,linewidth=1,facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)   
from sklearn.svm import SVC
model=SVC(kernel='linear',C=10000000)
model.fit(X,y)
print(model.support_vectors_)
plot_svm(model)


[[-0.92165683 -7.99154016]
 [-0.01208894 -2.64727591]]





Soft Margin SVM

import numpy as np
import matplotlib.pyplot as plt
#generating binary classiffication data
from sklearn.datasets.samples_generator import make_blobs
X,y=make_blobs(n_samples=100,centers=2,random_state=3,cluster_std=0.95)
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap='autumn')
def plot_svm(model,ax=None,plot_support=True):
    if ax==None:
        ax=plt.gca()
    xlim=ax.get_xlim()
    ylim=ax.get_ylim()
    x=np.linspace(xlim[0],xlim[1],30)
    y=np.linspace(ylim[0],ylim[1],30)
    X,Y=np.meshgrid(x,y)
    xy=np.vstack([X.ravel(),Y.ravel()]).T
    P=model.decision_function(xy).reshape(X.shape)
    ax.contour(X,Y,P,colors='k',levels=[-1,0,1],alpha=0.5,linestyles=['--','-','--'])
    if plot_support:
        ax.scatter(model.support_vectors_[:,0],model.support_vectors_[:,1],s=50,linewidth=1,facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)   
from sklearn.svm import SVC
model=SVC(kernel='linear',C=0.1)
model.fit(X,y)
print(model.support_vectors_)
plot_svm(model)

[[ 0.4432172   2.34420801]
 [ 0.25238026  1.86482744]
 [-1.75399281  3.23970797]
 [-1.63959838  0.14889403]
 [-2.53895866  0.87590176]
 [-2.29284003  1.35185752]]



RBF Kernal

import numpy as np
import matplotlib.pyplot as plt
#generating binary classiffication data
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)
from sklearn.preprocessing import StandardScaler
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap='autumn')
def plot_svm(model,ax=None,plot_support=True):
    if ax==None:
        ax=plt.gca()
    xlim=ax.get_xlim()
    ylim=ax.get_ylim()
    x=np.linspace(xlim[0],xlim[1],30)
    y=np.linspace(ylim[0],ylim[1],30)
    X,Y=np.meshgrid(x,y)
    xy=np.vstack([X.ravel(),Y.ravel()]).T
    P=model.decision_function(xy).reshape(X.shape)
    ax.contour(X,Y,P,colors='k',levels=[-1,0,1],alpha=0.5,linestyles=['--','-','--'])
    if plot_support:
        ax.scatter(model.support_vectors_[:,0],model.support_vectors_[:,1],s=50,linewidth=1,facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)   
from sklearn.svm import SVC
model=SVC(kernel='rbf',C=1)
model.fit(X,y)
#print(model.support_vectors_)
plot_svm(model)








SVM Examples


Comments

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU