Test 2 Solution
Feel free to go online. In fact, we encourage you to read documentations where needed. However, you may not collaborate with anybody. To certify that you didn't collaborate with anyone you'll write 'Nobody' in 'collaborators' above.
# Set up library imports. These imports also give you pointers on how to approach a question.
from sklearn.datasets import load_boston
import pandas as pd
from sklearn.svm import SVR
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import numpy as np
def minimum_features_to_classify_dataset():
'''
arg: None
return: list (containing strings)
Features are called x1, x2, x1^2, x2^2, x1x2. Return your answer in a list of string.
For example, if your answer is x1 and x2, return ['x1', 'x2']
'''
### BEGIN SOLUTION
return ['x1x2']
### END SOLUTION
## This test is only checking whether your list contains a string. Other tests are hidden
assert isinstance(minimum_features_to_classify_dataset()[0], str)
### BEGIN HIDDEN TESTS
assert minimum_features_to_classify_dataset() == ['x1x2']
### END HIDDEN TESTS
def get_boston_data():
'''
args: None
return: dict containing boston data.
'''
all_data_boston = load_boston()
return all_data_boston
all_data_boston = get_boston_data()
assert list(all_data_boston.keys()) == ['data', 'target', 'feature_names', 'DESCR', 'filename']
### BEGIN HIDDEN TESTS
assert all_data_boston.data.shape == (506, 13)
### END HIDDEN TESTS
Raw Features¶
Data as it.
raw_features = get_boston_data().data # ndarray of shp
raw_features.shape
(506, 13)
Question 3¶
(2 points)
Store data into a pandas dataframe with correct column names.
Hint: all_data_boston
def get_df(all_data_boston):
'''
args: dict containing boston dataset return by
return: pd dataframe
get data into a dataframe
'''
### BEGIN SOLUTION
data = all_data_boston.data
_columns = all_data_boston.feature_names
return pd.DataFrame(data, columns=_columns)
### END SOLUTION
list(get_df(all_data_boston).columns) == ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT']
### BEGIN HIDDEN TESTS
assert get_df(all_data_boston).values.shape == (506, 13)
### END HIDDEN TESTS
df_head = get_df(all_data_boston).head()
df_head
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
What does each column name mean?
Answer
labels = all_data_boston.target
Labels¶
MEDV Median value of owner-occupied homes in $1000’s
# First 10 labels
labels[:10]
array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9])
Question 4¶
(2 points)
(A question on Sklearn)
We'll oberve the performance of our ML model on a chunk of data which it has not previously seen. To this end, divide the raw features into parts- call one training set and other test set. We'll see more of train-test split inn future as well. Read the documentation on this sklearn function for more information.
def split(raw_features, labels, _test_size, _random_state=42):
'''
args: raw_features -> numpy.ndarray
labels -> pandas.core.series.Series
_test_size -> float (between 0 to 1)
_random_state -> int (to reproduce the same results across multiple runs)
return:
X_train -> numpy.ndarray
y_train -> numpy.ndarray
y_train -> pandas.core.series.Series
y_test -> pandas.core.series.Series
'''
### BEGIN SOLUTION
X_train, X_test, y_train, y_test = train_test_split(raw_features, labels,\
test_size=_test_size, random_state=_random_state)
return X_train, X_test, y_train, y_test
### END SOLUTION
x_train_raw_feats, x_test_raw_feats, y_train_raw_feats, y_test_raw_feats = split(\
raw_features, labels, _test_size=.3)
assert x_train_raw_feats.shape == (354, 13)
assert x_test_raw_feats.shape == (152, 13)
assert y_train_raw_feats.shape[0] == x_train_raw_feats.shape[0]
assert y_test_raw_feats.shape[0] == y_test_raw_feats.shape[0]
### BEGIN HIDDEN TESTS
x_train_raw_feats, x_test_raw_feats, y_train_raw_feats, y_test_raw_feats = split(\
raw_features, labels, _test_size=1)
assert x_train_raw_feats.shape == (505, 13)
assert x_test_raw_feats.shape == (1, 13)
assert y_train_raw_feats.shape[0] == x_train_raw_feats.shape[0]
assert y_test_raw_feats.shape[0] == y_test_raw_feats.shape[0]
### END HIDDEN TESTS
x_train_raw_feats, x_test_raw_feats, y_train_raw_feats, y_test_raw_feats = split(\
raw_features, labels, _test_size=.3)
Importance of Feature engineering.¶
We'll use an algorithm support vector regression (SVR). For now, you may assume svr is just another ML algorithm. You'll observe how svr performs better if features are engineered well.
Question 5¶
(2 points)
(A question on Sklearn)
As mentioned above, you don't need to know anything about svr to answer this question. This question will test you on how well you can read documentation (of sklearn in this case)- like the question 4.
In this question you'll write return an object of type sklearn.svm._classes.SVR
which is fitted (trained) on x_train (raw features) and y_train (labels). We can use the svr returned by this function to make predictions.
You may also find documentation on fit method helpful.
def fit_model(x_train, y_train):
'''
args: x_train-> ndarray (m, n)
y_train-> ndarray (m, )
return: a fitted sklearn. object which can predict.
For simplicity, use with default values for sklearn function/class you use.
'''
regr_svr = SVR()
return regr_svr.fit(x_train, y_train)
Fit¶
Fitting on unscaled x_train_raw_feats and y_train_raw_feats by calling the function you implemented above.
svr_model = fit_model(x_train_raw_feats, y_train_raw_feats)
assert str(type(svr_model)) == "<class 'sklearn.svm._classes.SVR'>"
To see how well, svr performed on the unseen chunk of data (x_test_raw_feats, y_test_raw_feats), we'll use coefficient of determination as a measure- called score (in range 0 to 1). Higher this score, better the performance.
svr_model.score(x_test_raw_feats, y_test_raw_feats)
0.28195563021839387
You should see a score of ~0.28. Let's try to improve this score.
Question 6¶
(5 points)
Zero Mean and unit variace scaling using numpy only.¶
Given an x_train (which are raw features in an ndarray), you'll standard scale them, such that each feature (column) in x_train has mean 0 and variance 1. Specifically, for each feature subtract its mean and divide by its standard deviation. Note that you may only use numpy functions for this one. You may not use any sklearn functions. Fill in the two functions below in the class.
class _Standard_Scaler(object):
def __init__(self, x):
self.mean_each_feat = 0.0 # mean of each feature- ndarray
self.std_dev_each_feat = 0.0
self.x = x
def compute_mean_and_std_dev(self):
'''
args: x -> ndarray (here, containing scaled features) of shape (m,n)
return: None (compute mean and standard deviation of each feature and
store them in self.mean_each_feat and self.std_dev_each_feat respectively)
you may onle use numpy functions.
'''
### BEGIN SOLUTION
self.mean_each_feat = self.x.mean(axis=0) # mean of each feature
self.std_dev_each_feat = self.x.std(axis=0) # std dev of each feature.
### END SOLUTION
def get_params(self):
return self.mean_each_feat, self.std_dev_each_feat
def get_scaled_input(self, _mean=None, _std=None):
'''
args: None
return: scaled x such that each feature has 0 mean and unit var (also unit std)
'''
if _mean == None:
_mean = self.mean_each_feat
if _std == None:
_std = self.std_dev_each_feat
### BEGIN SOLUTION
scaled_input = (self.x - _mean) / _std
return scaled_input
### END SOLUTION
_scaler = _Standard_Scaler(x_train_raw_feats)
_scaler.compute_mean_and_std_dev()
_scaler.get_scaled_input()
assert _scaler.get_scaled_input().shape == x_train_raw_feats.shape
scaled_feats = _scaler.get_scaled_input()
assert np.isclose(scaled_feats[:2], np.array([[-0.41425879, -0.50512499, -1.29214218, -0.28154625, -0.85108479,
0.14526384, -0.365584 , 1.08162833, -0.74617905, -1.11279004,
0.18727079, 0.39651419, -1.01531611],
[-0.40200818, -0.50512499, -0.16208345, -0.28154625, -0.08796708,
-0.20840082, 0.13394078, -0.48787608, -0.39846419, 0.15008778,
-0.21208981, 0.3870674 , -0.05366252]])).all()
### BEGIN HIDDEN TESTS
assert np.isclose(scaled_feats[-2:], np.array([[ 0.92611293, -0.50512499, 1.00549958, -0.28154625, 1.56688368,
0.42234757, 0.93390438, -0.77303498, 1.68782492, 1.5572945 ,
0.8528718 , -2.87841346, 1.52750437],
[-0.39030549, -0.50512499, -0.37135358, -0.28154625, -0.3194747 ,
0.11045432, 0.60088786, -0.49512987, -0.51436915, -0.13857001,
1.16348561, -3.32828832, -0.25218837]])).all()
### END HIDDEN TESTS
_scaler = _Standard_Scaler(x_train_raw_feats)
_scaler.compute_mean_and_std_dev()
scaled_feats_x_train = _scaler.get_scaled_input()
_scaler.get_params()
(array([3.46988686e+00, 1.14039548e+01, 1.11330508e+01, 7.34463277e-02, 5.57259322e-01, 6.32567232e+00, 6.87997175e+01, 3.76587401e+00, 9.43785311e+00, 4.07042373e+02, 1.82779661e+01, 3.59701808e+02, 1.24211299e+01]), array([8.30407703e+00, 2.25765011e+01, 6.92884344e+00, 2.60867715e-01, 1.16626831e-01, 7.18194456e-01, 2.76262572e+01, 2.12302684e+00, 8.62775916e+00, 1.66286870e+02, 2.25360235e+00, 8.68019175e+01, 7.10234960e+00]))
Ungraded Question 7¶
Can you see the importance of creating a class
to implement a _Standard_Scaler?
Answer¶
To scale features in the test set we'd like to use the same mean and std of the train set as we assume the train and test set are taken from the same distribution. Not computing mean and std on test set is especially useful if your test set is very small or if you want to predict label of only one datapoint, for example. We can simply call 'get_params()' to get mean and std and scale on test set as well.
_scaler.get_params()
(array([3.46988686e+00, 1.14039548e+01, 1.11330508e+01, 7.34463277e-02,
5.57259322e-01, 6.32567232e+00, 6.87997175e+01, 3.76587401e+00,
9.43785311e+00, 4.07042373e+02, 1.82779661e+01, 3.59701808e+02,
1.24211299e+01]),
array([8.30407703e+00, 2.25765011e+01, 6.92884344e+00, 2.60867715e-01,
1.16626831e-01, 7.18194456e-01, 2.76262572e+01, 2.12302684e+00,
8.62775916e+00, 1.66286870e+02, 2.25360235e+00, 8.68019175e+01,
7.10234960e+00]))
To this end, also, think about the other disign decisions we made while writing this class especially the arguments of get_scaled_input(self, _mean=None, _std=None).
def standard_scale_with_sklearn(x):
'''
args: ndarray (here, containing scaled features) of shape (m,n)
return: return standard scaled x of shape (m,n)
'''
### BEGIN SOLUTION
scaler = preprocessing.StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)
return x_scaled
### END SOLUTION
assert (x_train_raw_feats).shape == x_train_raw_feats.shape
scaled_feats = standard_scale_with_sklearn(x_train_raw_feats)
assert np.isclose(scaled_feats[:2], np.array([[-0.41425879, -0.50512499, -1.29214218, -0.28154625, -0.85108479,
0.14526384, -0.365584 , 1.08162833, -0.74617905, -1.11279004,
0.18727079, 0.39651419, -1.01531611],
[-0.40200818, -0.50512499, -0.16208345, -0.28154625, -0.08796708,
-0.20840082, 0.13394078, -0.48787608, -0.39846419, 0.15008778,
-0.21208981, 0.3870674 , -0.05366252]])).all()
### BEGIN HIDDEN TESTS
assert np.isclose(scaled_feats[-2:], np.array([[ 0.92611293, -0.50512499, 1.00549958, -0.28154625, 1.56688368,
0.42234757, 0.93390438, -0.77303498, 1.68782492, 1.5572945 ,
0.8528718 , -2.87841346, 1.52750437],
[-0.39030549, -0.50512499, -0.37135358, -0.28154625, -0.3194747 ,
0.11045432, 0.60088786, -0.49512987, -0.51436915, -0.13857001,
1.16348561, -3.32828832, -0.25218837]])).all()
### END HIDDEN TESTS
Remarks on Question 7¶
Odds are you called a fit method and a transform method. This interface in sklearn is uniform and you'll see it a lot in sklearn. Did you notice the similarities in design in our standard scaler class and sklearn's standard scaler. How does sklearn's standard scaler resolve the problem of scaling test set.
Question 9¶
(2 points)
As a quick check, verify if each scaled feature indeed has 0 mean and unit variance- or at least very close to 0 and 1.
def mean_and_std_of_each_feature(x_scaled):
''' args: x_scaled-> ndarray of shape (m,n)
return tuple (mean, std each of shape (n,)
'''
### BEGIN SOLUTION
return x_scaled.mean(axis=0), x_scaled.std(axis=0)
### END SOLUTION
x_scaled_feats = standard_scale_with_sklearn(x_train_raw_feats)
assert np.isclose(mean_and_std_of_each_feature(x_scaled_feats)[0], np.array([-1.26232985e-16, -4.82978378e-17, 3.72473552e-15, -6.68015549e-17,
-5.44322904e-15, -1.59406386e-15, -6.96241558e-17, -2.24459497e-15,
-8.18554264e-17, -1.89035855e-16, 1.72849807e-14, 8.11654573e-15,
-7.53320821e-16])).all()
assert np.isclose(mean_and_std_of_each_feature(x_scaled_feats)[1], np.array(([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))).all()
x_train_scaled_feats = standard_scale_with_sklearn(x_train_raw_feats)
Train SVR on scaled input¶
svr_model_for_scaled_input = fit_model(x_train_scaled_feats, y_train_raw_feats)
Compute score on scaled test set¶
## test features scaled using the same 'scaler'
x_test_scaled = standard_scale_with_sklearn(x_test_raw_feats)
svr_model_for_scaled_input.score(x_test_scaled, y_test_raw_feats)
0.6512492711851576
You should see a considerable higher score of ~0.65.
Question 10¶
(5 points)
Curse of Dimensionality¶
Extract Best Features¶
Extract best k features where $k < n$ and n is number of previous number of features (i.e. features in x_train) such that score improves from 0.6516 (score will imporve very little though in this case). For this question, import the relevant sklearn function in the cell below. We've intentionally not
imported it for you at top.
# import Necessary sklearn packages here.
### BEGIN SOLUTION
from sklearn.feature_selection import SelectKBest, f_regression
### END SOLUTION
def select_k_best_feats(_data_dict, _K):
'''args: _data_dict-> dict -> with following keys: x_train_scaled_feats,
y_train_raw_feats, x_test_scaled, y_test_raw_feats
return: tuple -> (score, k) -> (float, int)
where k is the number of best features.
'''
# unpack train and test data from _data_dict
x_train_scaled_feats, y_train_raw_feats, x_test_scaled, y_test_raw_feats = _data_dict['x_train_scaled_feats'],\
_data_dict['y_train_raw_feats'], _data_dict['x_test_scaled'], _data_dict['y_test_raw_feats'],
# trucated features of test x_test- currently set to None. (Used at the end)
x_test_trucated = None
# Try different values of k and report which gives the most score.
##########################
"Once you determin it, overwrite the value of '_K' below with the one which gives highest score e.g. replace _K by 8"
_K = _K
##########################
### BEGIN SOLUTION
extractor = SelectKBest(f_regression, k=_K)
x_train_trucated = extractor.fit_transform(x_train_scaled_feats, y_train_raw_feats)
x_test_trucated = extractor.transform(x_test_scaled)
### END SOLUTION
svr_model_for_k_best_feats = svr_model_for_scaled_input.fit(x_train_trucated, y_train_raw_feats)
# score on x_test_trucated and y_test_raw_feats
_score = svr_model_for_k_best_feats.score(x_test_trucated, y_test_raw_feats)
return _score, _K
# Use this cell to do your scratch work. For example, you can call
# the above function inside a for loop to print scores and corresponding values of k- thereby note the best k.
_data_dict = {'x_train_scaled_feats' : x_train_scaled_feats, 'y_train_raw_feats' : y_train_raw_feats, \
'x_test_scaled' : x_test_scaled, 'y_test_raw_feats' : y_test_raw_feats}
assert len(select_k_best_feats(_data_dict, 4)) == 2
assert isinstance(select_k_best_feats(_data_dict, 4)[1], int)
assert isinstance(select_k_best_feats(_data_dict, 4)[0], float)
# The tests above are superficial. Other tests are hidden as in every other question.
### BEGIN HIDDEN TESTS
assert select_k_best_feats(_data_dict, 4)[1] == 7
assert np.isclose(select_k_best_feats(_data_dict, 6)[0], 0.7016995059984922)
### END HIDDEN TESTS
We used fewer features- reduced computational expenses- and got higher score. Win-win.