SVM
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
Classification¶
Support Vector Machine (SVM)¶
Let us copy the snippet from previous tutorial for reference:
ML Algorithms General Form:¶
We can write about every machine learning algorithm that exists in the following canonical form: the goal of any machine learning algorithm is to find the parameters that minimize the average of losses on the data. This problem is written formally as:
$$ minimize_{\theta} \ \frac{1}{m} \sum_{1}^{m} l(h_{\theta}(x^{(i)}, y^i) = minimize_{\theta} \ E(\theta)$$In the next tutorial, we'll study classification algorithms logistic regression, support vector machine etc and you'll to define them (or any ML algorithm in general) we'll specify Hypothesis function, Loss function and the way to optimize loss function just like we did here.
Loss functions in Classification¶
For classification, loss functions are the one element that is substantially different from the ones in regression.
We can naively use least-squares loss function for classification as we did in regression however when applied to classification it aims at predicting exactly +1 or −1 on each data point which is wrong in the sense that for a positive class we're also okay with a number much greater than +1. Similarly, for a negative class we're also okay with a number much less than -1. You've already seen logistic loss, another commonly used loss in ML is hinge loss.
Hinge Loss for SVM¶
Support vector machines (SVMs) result from choosing hinge loss
$$l_{hinge}(h_{\theta}(x),y)= max(1 − h_{\theta}(x)⋅y, 0))$$as the loss function to minimize.
Weston and Watkins extension of hinge loss for multiclass classification is also common:
$$l_{hinge}(h_{\theta}(x),y)= \sum_{y \neq t} max(0, \Delta + w_y x - w_t x))$$The (extension) SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ.
More on Hinge Loss.
Hypothesis Functions (Linear vs Non-Linear SVMs)¶
Hypothesis function could either be linear (as in linear regression, and logistic regression) or it could be kernel (non-linear).
The terminologies “linear SVM” or “kernel SVM”, are just the designations of which type of hypothesis function they are using. That is, linear SVMs use the hypothesis function:
$$ h_{\theta}(x) = \sum_{j=1}^{n} \theta_j x_j = \theta^T x $$exactly as we have seen before in the linear regression case. While a kernel SVM uses the hypothesis function:
$$ h_{\theta}(x) = \sum_{i=1}^{n} \theta_i K (x, x^{(i)}) $$Where K is a kernel function. More on this here.
import numpy as np
import matplotlib.pyplot as plt
hy = np.linspace(-3,3,1000)
plt.plot(hy,(hy<=0))
plt.plot(hy, np.log(1+np.exp(-hy)))
plt.plot(hy, np.maximum(1 - hy, 0))
plt.plot(hy, np.exp(-hy))
plt.xlim([-3,3])
plt.ylim([-0.05, 5])
plt.ylabel("Loss")
plt.xlabel("$h_θ(x) \cdot y$")
plt.legend(['Zero-one', 'Logistic', 'Hinge'])
<matplotlib.legend.Legend at 0x7fb711ecb780>
This is an excellent demo for playing with SVM (which uses the hinge loss- proposed by Weston and Watkins)
Implemetation of linear svm using sklearn¶
Iris Dataset¶
Let's load iris dataset form sklearn. There are 3 classes (labels) in this dataset. See this guide for more information on dataset.
This is the same dataset we used in the prvious tutorial.
iris = load_iris()
features = iris.data
labels = iris.target
labels_names = iris.feature_names
labels_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Train/Test Split¶
Use sklearn as in previous tutorial.
x_train, x_test, y_train, y_test = train_test_split(features, labels,\
test_size=.3, random_state=32)
svm_lin_clf = LinearSVC(max_iter=50000)
svm_lin_clf.fit(x_train, y_train)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=50000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)
Notice the parameter max_iter=140- default value is 100 we increased it to 140 as the optimization method 'lbgfs' does not converge within 100 iterations (Try decreasing it from 140). Read the documentation for more information.
Exercise¶
Find the least number of iterations to ensure convergence. Currently, it's set 50000 which is more than the minimum number required.
y_pred = svm_lin_clf.predict(x_test) # predict on test data
y_pred
array([1, 0, 0, 1, 2, 2, 0, 0, 1, 0, 1, 2, 1, 1, 2, 2, 1, 2, 1, 0, 0, 2, 2, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 2, 1, 0, 0, 2, 2, 1, 0, 2, 0, 2, 0])
svm_lin_clf.score(x_test, y_test) # R2 score.
1.0
accuracy_score(y_test, y_pred) # accuracy
1.0
Exercise¶
Implement non-linear (kernel) SVM with RBF kernel.
Decision Trees¶
Decision trees are some of the most ubiquitous algorithms in ML. They are not as powerful from a pure ML standpoint, yet are still one of the canonical examples of an ML algorithm. The basic concept of a DT is make classification/regression predictions by tracing through rules in a tree, with a constant prediction at each leaf node. Observe the figure below for a concrete example:
We use greedy methods to incrementally build the tree (i.e., minimize the loss function) one node at a time. More on DTs.
Implement DT using Sklearn¶
dt_clf = DecisionTreeClassifier()
dt_clf.fit(x_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best')
y_pred = svm_lin_clf.predict(x_test) # Predict on test data
y_pred
array([1, 0, 0, 1, 2, 2, 0, 0, 1, 0, 1, 2, 1, 1, 2, 2, 1, 2, 1, 0, 0, 2, 2, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 2, 1, 0, 0, 2, 2, 1, 0, 2, 0, 2, 0])
svm_lin_clf.score(x_test, y_test) # R2 score.
1.0
accuracy_score(y_test, y_pred) # accuracy
1.0
Cross-validation.¶
In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called cross-validation. Working with our previous example, the idea is that instead of arbitrarily picking the first 1000 datapoints to be the validation set and rest training set, you can get a better and less noisy estimate of how well a certain value works by iterating over different validation sets and averaging the performance across these. For example, in 5-fold cross-validation, we would split the training data into 5 equal folds, use 4 of them for training, and 1 for validation. We would then iterate over which fold is the validation fold, evaluate the performance, and finally average the performance across the different folds. Summed up in the image below:
Model Selection¶
In this simple dataset (iris) all models gave sufficiently good accuracy/r2 score. But keep in mind: in real-world ML models, we'll interested in a:
- Model that have low generalizatoin error (good performance on test set)
- Model that represents good tradeoff between bais and variance . We're interested in models with moderate bias/variance (not too simple (high underfitting, not too complex (high overfitting). See these notes on learning theory for more information.
- Maintainability and available resources are factors that play role in model selction.
- Model that meets requirements and constraints of project stakeholders.
- Model that is killful relative to other tested models and navie models.
In depth information on Model selection and regularization is here.