Test 2
Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE", as well as your name and collaborators below:
NAME = ""
COLLABORATORS = ""
Feel free to go online. In fact, we encourage you to read documentations where needed. However, you may not collaborate with anybody. To certify that you didn't collaborate with anyone you'll write 'Nobody' in 'collaborators' above.
# Set up library imports. These imports also give you pointers on how to approach a question.
from sklearn.datasets import load_boston
import pandas as pd
from sklearn.svm import SVR
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import numpy as np
def minimum_features_to_classify_dataset():
'''
arg: None
return: list (containing strings)
Features are called x1, x2, x1^2, x2^2, x1x2. Return your answer in a list of string.
For example, if your answer is x1 and x2, return ['x1', 'x2']
'''
# YOUR CODE HERE
raise NotImplementedError()
## This test is only checking whether your list contains a string. Other tests are hidden
assert isinstance(minimum_features_to_classify_dataset()[0], str)
# YOUR CODE HERE
raise NotImplementedError()
all_data_boston = get_boston_data()
assert list(all_data_boston.keys()) == ['data', 'target', 'feature_names', 'DESCR', 'filename']
Raw Features¶
Data as it.
raw_features = get_boston_data().data # ndarray of shp
raw_features.shape
Question 3¶
(2 points)
Store data into a pandas dataframe with correct column names.
Hint: all_data_boston
def get_df(all_data_boston):
'''
args: dict containing boston dataset return by
return: pd dataframe
get data into a dataframe
'''
# YOUR CODE HERE
raise NotImplementedError()
list(get_df(all_data_boston).columns) == ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT']
df_head = get_df(all_data_boston).head()
df_head
What does each column name mean?
Answer
labels = all_data_boston.target
Labels¶
MEDV Median value of owner-occupied homes in $1000’s
# First 10 labels
labels[:10]
Question 4¶
(2 points)
(A question on Sklearn)
We'll oberve the performance of our ML model on a chunk of data which it has not previously seen. To this end, divide the raw features into parts- call one training set and other test set. We'll see more of train-test split inn future as well. Read the documentation on this sklearn function for more information.
def split(raw_features, labels, _test_size, _random_state=42):
'''
args: raw_features -> numpy.ndarray
labels -> pandas.core.series.Series
_test_size -> float (between 0 to 1)
_random_state -> int (to reproduce the same results across multiple runs)
return:
X_train -> numpy.ndarray
y_train -> numpy.ndarray
y_train -> pandas.core.series.Series
y_test -> pandas.core.series.Series
'''
# YOUR CODE HERE
raise NotImplementedError()
x_train_raw_feats, x_test_raw_feats, y_train_raw_feats, y_test_raw_feats = split(\
raw_features, labels, _test_size=.3)
assert x_train_raw_feats.shape == (354, 13)
assert x_test_raw_feats.shape == (152, 13)
assert y_train_raw_feats.shape[0] == x_train_raw_feats.shape[0]
assert y_test_raw_feats.shape[0] == y_test_raw_feats.shape[0]
x_train_raw_feats, x_test_raw_feats, y_train_raw_feats, y_test_raw_feats = split(\
raw_features, labels, _test_size=.3)
Importance of Feature engineering.¶
We'll use an algorithm support vector regression (SVR). For now, you may assume svr is just another ML algorithm. You'll observe how svr performs better if features are engineered well.
Question 5¶
(2 points)
(A question on Sklearn)
As mentioned above, you don't need to know anything about svr to answer this question. This question will test you on how well you can read documentation (of sklearn in this case)- like the question 4.
In this question you'll write return an object of type sklearn.svm._classes.SVR
which is fitted (trained) on x_train (raw features) and y_train (labels). We can use the svr returned by this function to make predictions.
You may also find documentation on fit method helpful.
# YOUR CODE HERE
raise NotImplementedError()
Fit¶
Fitting on unscaled x_train_raw_feats and y_train_raw_feats by calling the function you implemented above.
svr_model = fit_model(x_train_raw_feats, y_train_raw_feats)
assert str(type(svr_model)) == "<class 'sklearn.svm._classes.SVR'>"
To see how well, svr performed on the unseen chunk of data (x_test_raw_feats, y_test_raw_feats), we'll use coefficient of determination as a measure- called score (in range 0 to 1). Higher this score, better the performance.
svr_model.score(x_test_raw_feats, y_test_raw_feats)
You should see a score of ~0.28. Let's try to improve this score.
Question 6¶
(5 points)
Zero Mean and unit variace scaling using numpy only.¶
Given an x_train (which are raw features in an ndarray), you'll standard scale them, such that each feature (column) in x_train has mean 0 and variance 1. Specifically, for each feature subtract its mean and divide by its standard deviation. Note that you may only use numpy functions for this one. You may not use any sklearn functions. Fill in the two functions below in the class.
class _Standard_Scaler(object):
def __init__(self, x):
self.mean_each_feat = 0.0 # mean of each feature- ndarray
self.std_dev_each_feat = 0.0
self.x = x
def compute_mean_and_std_dev(self):
'''
args: x -> ndarray (here, containing scaled features) of shape (m,n)
return: None (compute mean and standard deviation of each feature and
store them in self.mean_each_feat and self.std_dev_each_feat respectively)
you may onle use numpy functions.
'''
# YOUR CODE HERE
raise NotImplementedError()
def get_params(self):
return self.mean_each_feat, self.std_dev_each_feat
def get_scaled_input(self, _mean=None, _std=None):
'''
args: None
return: scaled x such that each feature has 0 mean and unit var (also unit std)
'''
if _mean == None:
_mean = self.mean_each_feat
if _std == None:
_std = self.std_dev_each_feat
# YOUR CODE HERE
raise NotImplementedError()
_scaler = _Standard_Scaler(x_train_raw_feats)
_scaler.compute_mean_and_std_dev()
_scaler.get_scaled_input()
assert _scaler.get_scaled_input().shape == x_train_raw_feats.shape
scaled_feats = _scaler.get_scaled_input()
assert np.isclose(scaled_feats[:2], np.array([[-0.41425879, -0.50512499, -1.29214218, -0.28154625, -0.85108479,
0.14526384, -0.365584 , 1.08162833, -0.74617905, -1.11279004,
0.18727079, 0.39651419, -1.01531611],
[-0.40200818, -0.50512499, -0.16208345, -0.28154625, -0.08796708,
-0.20840082, 0.13394078, -0.48787608, -0.39846419, 0.15008778,
-0.21208981, 0.3870674 , -0.05366252]])).all()
_scaler = _Standard_Scaler(x_train_raw_feats)
_scaler.compute_mean_and_std_dev()
scaled_feats_x_train = _scaler.get_scaled_input()
_scaler.get_params()
Ungraded Question 7¶
Can you see the importance of creating a class
to implement a _Standard_Scaler?
Answer¶
To scale features in the test set we'd like to use the same mean and std of the train set as we assume the train and test set are taken from the same distribution. Not computing mean and std on test set is especially useful if your test set is very small or if you want to predict label of only one datapoint, for example. We can simply call 'get_params()' to get mean and std and scale on test set as well.
_scaler.get_params()
(array([3.46988686e+00, 1.14039548e+01, 1.11330508e+01, 7.34463277e-02,
5.57259322e-01, 6.32567232e+00, 6.87997175e+01, 3.76587401e+00,
9.43785311e+00, 4.07042373e+02, 1.82779661e+01, 3.59701808e+02,
1.24211299e+01]),
array([8.30407703e+00, 2.25765011e+01, 6.92884344e+00, 2.60867715e-01,
1.16626831e-01, 7.18194456e-01, 2.76262572e+01, 2.12302684e+00,
8.62775916e+00, 1.66286870e+02, 2.25360235e+00, 8.68019175e+01,
7.10234960e+00]))
To this end, also, think about the other disign decisions we made while writing this class especially the arguments of get_scaled_input(self, _mean=None, _std=None).
def standard_scale_with_sklearn(x):
'''
args: ndarray (here, containing scaled features) of shape (m,n)
return: return standard scaled x of shape (m,n)
'''
# YOUR CODE HERE
raise NotImplementedError()
assert (x_train_raw_feats).shape == x_train_raw_feats.shape
scaled_feats = standard_scale_with_sklearn(x_train_raw_feats)
assert np.isclose(scaled_feats[:2], np.array([[-0.41425879, -0.50512499, -1.29214218, -0.28154625, -0.85108479,
0.14526384, -0.365584 , 1.08162833, -0.74617905, -1.11279004,
0.18727079, 0.39651419, -1.01531611],
[-0.40200818, -0.50512499, -0.16208345, -0.28154625, -0.08796708,
-0.20840082, 0.13394078, -0.48787608, -0.39846419, 0.15008778,
-0.21208981, 0.3870674 , -0.05366252]])).all()
Remarks on Question 7¶
Odds are you called a fit method and a transform method. This interface in sklearn is uniform and you'll see it a lot in sklearn. Did you notice the similarities in design in our standard scaler class and sklearn's standard scaler. How does sklearn's standard scaler resolve the problem of scaling test set.
Question 9¶
(2 points)
As a quick quick check, verify if each scaled feature indeed has 0 mean and unit variance- or at least very close to 0 and 1.
def mean_and_std_of_each_feature(x_scaled):
''' args: x_scaled-> ndarray of shape (m,n)
return tuple (mean, std each of shape (n,)
'''
# YOUR CODE HERE
raise NotImplementedError()
x_scaled_feats = standard_scale_with_sklearn(x_train_raw_feats)
assert np.isclose(mean_and_std_of_each_feature(x_scaled_feats)[0], np.array([-1.26232985e-16, -4.82978378e-17, 3.72473552e-15, -6.68015549e-17,
-5.44322904e-15, -1.59406386e-15, -6.96241558e-17, -2.24459497e-15,
-8.18554264e-17, -1.89035855e-16, 1.72849807e-14, 8.11654573e-15,
-7.53320821e-16])).all()
assert np.isclose(mean_and_std_of_each_feature(x_scaled_feats)[1], np.array(([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))).all()
x_train_scaled_feats = standard_scale_with_sklearn(x_train_raw_feats)
Train SVR on scaled input¶
svr_model_for_scaled_input = fit_model(x_train_scaled_feats, y_train_raw_feats)
Compute score on scaled test set¶
## test features scaled using the same 'scaler'
x_test_scaled = standard_scale_with_sklearn(x_test_raw_feats)
svr_model_for_scaled_input.score(x_test_scaled, y_test_raw_feats)
You should see a considerable higher score of ~0.65.
Question 10¶
(5 points)
Curse of Dimensionality¶
Extract Best Features¶
Extract best k features where $k < n$ and n is number of previous number of features (i.e. features in x_train) such that score improves from 0.6516 (score will imporve very little though in this case). For this question, import the relevant sklearn function in the cell below. We've intentionally not
imported it for you at top.
# import Necessary sklearn packages here.
# YOUR CODE HERE
raise NotImplementedError()
def select_k_best_feats(_data_dict, _K):
'''args: _data_dict-> dict -> with following keys: x_train_scaled_feats,
y_train_raw_feats, x_test_scaled, y_test_raw_feats
return: tuple -> (score, k) -> (float, int)
where k is the number of best features.
'''
# unpack train and test data from _data_dict
x_train_scaled_feats, y_train_raw_feats, x_test_scaled, y_test_raw_feats = _data_dict['x_train_scaled_feats'],\
_data_dict['y_train_raw_feats'], _data_dict['x_test_scaled'], _data_dict['y_test_raw_feats'],
# trucated features of test x_test- currently set to None. (Used at the end)
x_test_trucated = None
# Try different values of k and report which gives the most score.
##########################
"Once you determin it, overwrite the value of '_K' below with the one which gives highest score e.g. replace _K by 8"
_K = _K
##########################
# YOUR CODE HERE
raise NotImplementedError()
svr_model_for_k_best_feats = svr_model_for_scaled_input.fit(x_train_trucated, y_train_raw_feats)
# score on x_test_trucated and y_test_raw_feats
_score = svr_model_for_k_best_feats.score(x_test_trucated, y_test_raw_feats)
return _score, _K
# Use this cell to do your scratch work. For example, you can call
# the above function inside a for loop to print scores and corresponding values of k- thereby note the best k.
_data_dict = {'x_train_scaled_feats' : x_train_scaled_feats, 'y_train_raw_feats' : y_train_raw_feats, \
'x_test_scaled' : x_test_scaled, 'y_test_raw_feats' : y_test_raw_feats}
assert len(select_k_best_feats(_data_dict, 4)) == 2
assert isinstance(select_k_best_feats(_data_dict, 4)[1], int)
assert isinstance(select_k_best_feats(_data_dict, 4)[0], float)
# The tests above are superficial. Other tests are hidden as in every other question.
We used fewer features- reduced computational expenses- and got higher score. Win-win.