Advanced Experiments
Advanced Experiments¶
from sklearn.datasets import make_classification
from collections import Counter
# To install imblearn, if uninstalled.
# !pip install -U imbalanced-learn
import imblearn
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
Imbalanced Data¶
We've alluded to imbalanced data before in tutorial 4
in a different context where we said that a useless model which does nothing except always returning class 1
can get 99% accuracy on a highly imbalanced dataset in which 99% of the examples indeed belong to class 1. There we used a different metric other than accuracy. In fact, even if your classification model is sophisticated (has low bias) but it was trained on imbalanced training data, it could likely lead to unsatisfactory prediction results on test data. In this tutorial, we'll try to balance that difference in sample sizes of different classe to ultimately improve the generalization of the model.
Undersampling¶
Undersampling method eliminates a subset of data points in the majority class, using two general approaches.
- Random Undersampling (randomly decrease majority class examples)- efficient.
- Cluster Undersampling (Cluster majority class examples such that the number of centroids is equal to the number of examples in minority class- thus, balancing) - expensive; This technique could be slow if the size of the majority class is large.
A general drawback of undersampling is that it could cause critical information to be lost given that a large portion of the majority class’s data is being discarded.
Generate a dataset with ~80% examples in class 1¶
X, y = make_classification(n_classes=2, \
class_sep=2, weights=[0.2, 0.8])
Counter(y)
Counter({0: 20, 1: 80})
pp(X)
pp(y)
-1.43228 | 0.146628 | 0.801869 | -0.483847 | -2.4365 | 0.193837 | -0.87376 | -1.45602 | 0.476077 | -1.49824 | -0.251565 | 0.546001 | 0.00170161 | 0.163215 | 0.222226 | -1.06312 | 1.43999 | -1.74664 | 1.08229 | 0.98123 |
2.30078 | 0.141863 | -0.968487 | 1.14428 | -1.08277 | -2.04181 | -1.34833 | -1.29097 | -1.04471 | 0.654629 | 1.07456 | -1.11135 | 0.44511 | 0.430062 | -0.824792 | 0.28074 | -0.1457 | 1.56169 | 0.596665 | 0.96862 |
2.81875 | 0.405202 | -0.990546 | 0.187378 | 0.570846 | 1.52981 | -1.02793 | -0.767789 | 2.01763 | -0.690527 | -0.324681 | 1.1524 | 0.440734 | 0.211606 | 0.786272 | -0.396822 | 0.385607 | 1.15046 | -0.912842 | 1.67379 |
1.40061 | -0.365877 | -0.972568 | -1.65787 | -1.74946 | -1.15534 | -1.10076 | 0.51787 | 0.794375 | -0.177217 | 0.292975 | 0.495149 | -0.521737 | 0.80121 | -0.00248154 | 1.22147 | -2.23406 | 2.44145 | 0.928758 | 1.13502 |
-1.80728 | -1.00325 | 0.00547605 | 0.439866 | 1.44213 | 0.66761 | 1.21749 | 0.278976 | -1.35193 | -0.326498 | -1.12 | -0.0609738 | 1.43325 | 0.722688 | 0.690167 | 0.272792 | 0.800295 | 1.7131 | 1.27424 | 2.82659 |
0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
Counter(y_res)
Counter({0: 20, 1: 20})
Oversampling¶
Oversampling increases the number of data points in the minority class to balance it with the size of the majority class. Random oversampling achieves this through randomly duplicating data points in the minority class until they are equally proportionate to the size of the majority class.
Oversampling method could lead to over-fitting and it usually requires longer training time relative to undersampling
Oversample¶
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
Check Ratio¶
Counter(y_res)
Counter({0: 80, 1: 80})
Both classes contain 80 examples now.
Which technique is better?¶
So, what should we do? Oversampling or Undersampling or no sampling at all? There is also a hybrid approach (both oversampling and undersampling. And other techniques such as Synthetic Minority Over-sampling Technique SMOTE. It turns out there is no easy answer to which technique you should use. It depends on on your metric (whether you want to reduce accuracy, type 1 error etc- explained below), on your classifier (Logistic Regression, Neural Net etc), how imbalanced your data is (imbalanced ratio) etc as argued by this recently published paper. You're highly encouraged to go through this easy-to-read paper- especially observe graphs.
Message:
The take-away message is that in real world, with imbalanced data, you usually should consider various combinations of resampling techniques (whether you should undersample) and the base classification methods (whether you should use svm). Don't assume undersampling is always the way to go, for example.
Evaluation¶
F1 (Recall & Precision)¶
Recall confusion matrix from tutorial 4:
Precision:¶
The number of True Positives $(TP)$ divided by the number of True Positives and False Positives $(FP)$. A low precision can also indicate a large number of False Positives. $$ \frac{TP}{(TP+FP)}$$
Recall:¶
The number of True Positives divided by the number of True Positives and the number of False Negatives $(FN)$. A low recall indicates many False Negatives. $$ \frac{TP}{(TP+FN)}$$
F1 Score:¶
$$\frac{2 \times (precision \times recall)}{(precision+recall)}$$.
F1 score conveys the balance between the precision and the recall.
Read Sklearn Documentation for more information.
ROC¶
Let's define type 1 and 2 errors first:
Type I error: $$ \frac{FP}{(TN+FP)}$$¶
Type II error: $$ \frac{FN}{(FN+TP)}$$¶
The Receiver Operating Characteristic (ROC) curve is a graphical illustration of the classification model’s performance in identifying the positive from the negative class as the discrimination threshold is varied (see below Figure 2). Additionally, it reflects the trade-off between type I and type II errors.
Area Under the ROC Curve:¶
The area under the ROC curve $(AUC)$ is a single-number summary that quantifies the model’s performance in discriminating the positive from the negative class. It typically falls between 0 and 1. The larger the value, the better an indication of the models’ performance. A value of 0.5 does not serve any better than making a random guess.
GridSearch (Hyperparameters)¶
GridSearch is an Exhaustive search over specified hyperparameter values for a model.
Consider a Support Vector Classifier SVC). Let's find the best value for 'Regularization' and whether linear or RBF kernel performs better (see arguments in docs SVC takes for more information).
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
# hyperparameters possible values.
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
GridSearchCV(cv=None, error_score=nan, estimator=SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False), iid='deprecated', n_jobs=None, param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring=None, verbose=0)
Best Hyperparamets Found:¶
clf.best_params_
{'C': 1, 'kernel': 'linear'}
Hence, best value for regularization is 1 and linear for kernel on Iris dataset.