Test 8
Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and collaborators below:
NAME = ""
COLLABORATORS = ""
Ensemble Learning¶
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
import nose.tools as test_# For testing
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Useful in beautifying numpy arrays.
from IPython.display import HTML, display
import tabulate
def pp(a, show_head=True):
'''
args: show_head -> if True print only first 5 rows.
return: None
'''
if a.ndim < 2:
a = [a]
if show_head:
display(HTML(tabulate.tabulate(a[:5], tablefmt='html')))
return
display(HTML(tabulate.tabulate(a, tablefmt='html')))
In this test, we'll use SVC, Bagging (voting, stacking etc) and Boosting and compare their performaces. To this end, let's use a difficult to learn dataset with 16 classses and 7 features (5 informative and 2 redundant).
Question 1¶
3 Points
Generate a dataset using make_classification given a number of samples to generate, number of classes and number of features.
def generate_dataset(n_samples_, n_classes_, n_features_, \
n_informative_, n_redundant_, random_state_, shuffle_):
'''
args: n_samples -> int => number of samples to generate
n_classes -> int => number of classes in your dataset
n_features -> int => total number of features (inforamtive + redundant)
n_informative -> int => number of informative features
n_informative -> int => number of informative features
n_redundant -> int => number of redundant features
random_state_ -> int => random state (for reproduciable results)
shuffle_ -> Bool => whether to shuffle data.
return: tuple (X, y) => X is ndarray of features (m, 7)
=> y is ndarray of labels (m,)
'''
# YOUR CODE HERE
raise NotImplementedError()
X_for_test = generate_dataset(10000, 16, 7, 5, 2, 42, True)[0]
y_for_test = generate_dataset(10000, 16, 7, 5, 2, 42, True)[1]
test_.eq_ (X_for_test.shape, (10000, 7))
test_.eq_ (y_for_test.shape, (10000,))
X = generate_dataset(10000, 16, 7, 5, 2, 42, True)[0]
y = generate_dataset(10000, 16, 7, 5, 2, 42, True)[1]
print('Dataset with 5 informative features and 2 redundant features:')
pp(X)
print('Labels (Total Classes-16):')
pp(y)
Train Test Split¶
X_train, X_test, y_train, y_test = train_test_split(X, y, \
test_size=0.33, random_state=42)
Train SVC¶
svc = SVC()
svc.fit(X_train, y_train)
Predict labels
y_pred = svc.predict(X_test)
svc.score(X_test, y_test)
accuracy_score(y_test, y_pred)
Both acccuracy and R2 score are low (~56%) on this difficult to learn multiclass dataset.
Think about how many decision boundaries a linear classifier need to classify a dataset with 16 classes- this indeed a very difficult task (with very few informative features) when compared to binary classification.
Bagging¶
Let's try to improve this accuracy by Bagging if possible.
Question 2¶
4 Points
Implement a voting classifier, using 4 classifiers:
- LogisticRegression
- RandomForestClassifier (with 100 estimators)
- Gaussian Naive Bayes (GaussianNB)
- Support Vector Classifier (SVC).
Import any sklearn libraries/functions you need below.
Note that SVC in general performs better when data is standardized; we're passing data as it is for this question intentionally.
# Set up any libarary imports here
# YOUR CODE HERE
raise NotImplementedError()
def instantiate_classifiers_for_voting(rs, n_estimators_):
'''
Return 4 instantiated (not fitted) classifiers.
args: rs -> int => random state for all classifiers which accept it.
n_estimators_ -> int => number of estimators for RandomForestClassifier
return: tuple of length 4 -> (clf1, clf2, clf3, clf4) => Each element is
an instantiated classifier in this order logistic regression, Random Forest Classifier,
Gaussian NB, SVC.
Other:
> In LogisticRegression use 'multinomial' for 'multi_class' arg.
> For all classifiers, leave other parameters with their default values.
'''
# YOUR CODE HERE
raise NotImplementedError()
test_.eq_ (len(instantiate_classifiers_for_voting(34, 10)), 4)
logr = instantiate_classifiers_for_voting(34, 10)[0]
test_.eq_(logr.random_state, 34)
_classifiers_voting = instantiate_classifiers_for_voting(42, 100)
_classifiers_voting
Question 3¶
4 Points
Fit a voting classifier using the classifiers you instatitated above- using sklearn. Look up what sklearn library/function you'll need and import it below.
# YOUR CODE HERE
raise NotImplementedError()
def fit_voting_classifier(classifiers_, voting_):
'''
Fit a Voting classifier which uses classifiers_ as estimators. Note that in sklearn
fitting a voting classifier will automatically fit its estimators. Read relevant documentation
for more information.
args: classifiers_ -> tuple of length 4 -> (clf1, clf2, clf3, clf4) => Each element is
an instantiated classifier.
voting_ -> string -> 'hard', 'soft'. We'll only do 'hard' voting here. Do not
worry about soft voting at this point. Read relevant documentation
for more information.
return: fitted voting classifier.
'''
# YOUR CODE HERE
raise NotImplementedError()
_vot_clf = fit_voting_classifier(_classifiers_voting, 'hard')
estimators_list = _vot_clf.estimators_
_vot_clf = fit_voting_classifier(_classifiers_voting, 'hard')
estimators_list = _vot_clf.estimators_
test_.eq_ (
[str(est_name) for est_name in estimators_list], [
"LogisticRegression(multi_class='multinomial', random_state=42)",
'RandomForestClassifier(random_state=42)',
'GaussianNB()',
'SVC(random_state=42)'
]
)
_vot_clf = fit_voting_classifier(_classifiers_voting, 'hard')
Train¶
_vot_clf = fit_voting_classifier(_classifiers_voting, 'hard')
Predict¶
y_pred_vot_clf = _vot_clf.predict(X_test)
Accuracy¶
print('R2 score:', _vot_clf.score(X_test, y_test))
print('Accuracy:', accuracy_score(y_pred_vot_clf, y_test))
You should see at accuracy of ~62% which is a huge improvement when compared to using only SVC. Voting improved performance.
Aside: Netflix awarded a 1 million dollars prize to a developer team in 2009 for an algorithm that increased the accuracy of the company's recommendation engine by 10 percent.
Stacking¶
Let's now try stacking and observe how it behaves on our dataset. Recall, what stacking is (read the tutorial if needed).
Question 4¶
4 Points
Implement a voting classifier, using 4 classifiers:
- Random Forest Classifier
- SVC (with 100 estimators). This time with data standardized using standard scaler.
- Logistic Regression
- Gaussian Naive Bayes (GaussianNB)
Import any sklearn libraries/functions you need below.
Note that SVC in general performs better when data is standardized; We'll do so in this question. You may find make_pipeline useful.
# Set up any libarary imports here
# YOUR CODE HERE
raise NotImplementedError()
def instantiate_classifiers_for_stacking(rs, n_estimators_, max_iter_):
'''
Return 5 instantiated (not fitted) classifiers.
args: rs -> int => random state for all classifiers which accept it.
n_estimators_ -> int => number of estimators for RandomForestClassifier
max_iter_ -> int => pass this to SVC and both Logistic Regressions to silence warnings.
Read relevant docs for more information.
return: tuple of length 4 -> (clf1, clf2, clf3, clf4, final_clf) => Each element is
an instantiated classifier in this order Random Forest Classifier,
SVC, Logistic Regression, Gaussian NB. The final_clf is also a
Logistic Regression.
Other:
> In both LogisticRegressions use 'multinomial' for 'multi_class' arg.
> For all classifiers, leave other parameters with their default values.
'''
# YOUR CODE HERE
raise NotImplementedError()
test_.eq_ (len(instantiate_classifiers_for_stacking(34, 10, 10000)), 5)
logr = instantiate_classifiers_for_stacking(34, 10, 10000)[0]
test_.eq_(logr.random_state, 34)
classifiers_stacking = instantiate_classifiers_for_stacking(42, 100, 100000)
# Set up any libarary imports here
# YOUR CODE HERE
raise NotImplementedError()
def fit_stacking_classifier(classifiers_):
'''
Fit a Voting classifier which uses classifiers_ as estimators. Note that in sklearn
fitting a voting classifier will automatically fit its estimators. Read relevant documentation
for more information.
args: classifiers_ -> tuple of length 4 -> (clf1, clf2, clf3, clf4) => Each element is
an instantiated classifier.
return: fitted voting classifier.
The solution should be very similar to that of fit_voting_classifier.
'''
# YOUR CODE HERE
raise NotImplementedError()
_stack_clf = fit_stacking_classifier(classifiers_stacking)
estimators_list = _stack_clf.estimators_
test_.eq_ (
["".join(str(est_name).split()) for est_name in estimators_list], # remove white spaces then compare
[
'RandomForestClassifier(random_state=42)',
"Pipeline(steps=[('standardscaler',StandardScaler()),('svc',SVC(max_iter=100000,random_state=42))])",
"LogisticRegression(max_iter=100000,multi_class='multinomial',random_state=42)",
'GaussianNB()'
]
)
# YOUR CODE HERE
raise NotImplementedError()
Observe the performance.
Question 5¶
7 Points
This question is open ended and this will test you on several things including how well you can read documentations. Train a XGBoost classifier on given data; Tune hyperparameters if you wish; To earn credit in this question, your accuracy on test data should be $\ge 56\%$.
# Set up any library imports here
# YOUR CODE HERE
raise NotImplementedError()
# You may use this cell (or create other cells)
# for scratch work e.g. GridSearch() for hyperparameters tuning.
def train_XGBoost(X_train_, y_train_):
'''
Train a booster object on
return: xgboost.core.Booster object
args: X_train_ -> ndarray -> shape (m, 7)
y_train_ -> ndarray -> shape (m,)
Returned Booster should be trained (hyperparameter tuned etc) which can predict
accuracy score on test data. Accuracy should be at least 56 to earn credit.
'''
bst = None
# YOUR CODE HERE
raise NotImplementedError()
return bst # xgboost.core.Booster object
bst = train_XGBoost(X_train, y_train)
test_.eq_ (str(type(bst)), "<class 'xgboost.core.Booster'>")
Other Insights From XGBoost¶
xgb.plot_importance(bst) # feature immportance
# !pip install graphviz
xgb.to_graphviz(bst, num_trees=2) # second tree. Change this number to display the corresponding tree.