Introduction
Introduction to Tools¶
This tutorial is on setting things up and getting hands-on experience with tools which you'll use for the rest of the semester. A programmer is more productive when he uses the tools he's more comfortable with.
Therefore, it's indispendable to get as much practice with these tools now as possible- especially if you've not used them before. In this course, you'll use:
A Note
:¶
What you'll learn in a tutorial will be reinforced in a test
. Think of the tests as another opportunity to learn. Each test is autograded for you.
Python¶
Let's write some Python code in our notebook. This tutorial is not an exhasutive treatment of Python.
print('hello wold')
hello wold
Python Code is Readable/Succinct¶
Python is a high-level, dynamically typed multiparadigm programming language. Python code is often said to be almost like pseudocode, since it allows you to express very powerful ideas in very few lines of code while being very readable. For example, consider the implementation of binary search in the following cell:
# Find sum of elements in an array. Non-Recusrive implementation.
def find_sum_of_array_1(array):
_sum = 0
for elm in array:
_sum = elm + _sum
return _sum
# Find sum of elements in an array. Recusrive implementation.
def find_sum_of_array_2(array):
if len(array) == 1:
return array[0]
else:
return array[0] + find_sum_of_array(array[1:]) # notice array slicing
# sum is also available as built-in function in Python.
def find_sum_of_array_3(array):
return sum(array)
"All 3 variants are relatively readable and concise"
'''
The following is another illustration of brevity of Python.
'''
# Check if x is in array
def has_elm(array, x):
return x in array
print('sum is:', find_sum_of_array_3([4,2,1]))
print(has_elm([4,2,1], 1))
sum is: 7 True
Also, notice code is commented in 3 different ways in the cell above.
Basic data types¶
Python has a number of basic types including integers, floats, booleans, and strings. These data types behave in ways that are familiar from other programming languages.
x = 5 # variable definition. No type declaration unlike C for example.
y = 3
z = x + y # addition
f = 3.0
print(z) # 8 -> sum of two vars
print('type of x is', type(x)) # int
print('type of f is', type(f)) # float
print('product =', x*z) # 40 ->multiplication of two vars
print('cube =', x ** 3 ) # 125 ->x cubed
print('3/2 =', 3 / 2) # 1.5 ->division. Different from C.
print('3//2 =', 3 // 2) # 1.5 ->division. Different from C.
print(True and False) # bool or op
8 type of x is <class 'int'> type of f is <class 'float'> product = 40 cube = 125 3/2 = 1.5 3//2 = 1 False
Strings¶
Python provides a lot of builtin functionality for working with strings.
str1 = 'machine'
str2 = 'learning'
spc = " " # ' ' and " " are both fine around a string.
course_str = str1 + spc + str2 # concatenation
course_str
'machine learning'
How would you capitlize the first letter of a string?
str2.capitalize()
'Learning'
Documentation has a lot other useful functions for string.
Excercise¶
As an excercise given a string s in which each word is separated by space return a modified string such that first letter of each word is capitalized.
def cap_first_word(s):
### BEGIN SOLUTION
return [elm.capitalize() for elm in s.split()]
### END SOLUTION
assert cap_first_word(course_str) == ['Machine', 'Learning']
# list can be constructed in the following ways:
l = [] # an empty list
print(type(l), l)
l = list() # a type constuctor
print(type(l), l)
_iterable = (4,2,4) # define an iterable (here a tuple)
l = list(_iterable) # list construction from an iterable
print(type(l), l)
l = [4, 5, 4] # using square brackets around comma separated elms
print(type(l), l)
l = [x for x in range(1,5)] # using list comprehension.
print(type(l), l)
<class 'list'> [] <class 'list'> [] <class 'list'> [4, 2, 4] <class 'list'> [4, 5, 4] <class 'list'> [1, 2, 3, 4]
- More on range
- More on List Comprehension
As an exercise find square of each element of array x and cube of each element of array y and find the total sum. Do this using list comprehension in one line of Python code.
def total_sum(x, y):
'''
args: x a list,
y a list
return: int
'''
### BEGIN SOLUTION
return sum([elmx**2 for elmx in x]) + sum([elmy**3 for elmy in y])
### END SOLUTION
assert total_sum([5,3,1], [4,5,2]) == 232
assert total_sum([0], [0]) == 0
assert total_sum([1110], [0]) == 1110**2
# that's your hash table in one line with keys one, two, three and vals 1,2,3
a = dict(one=1, two=2, three=3)
# the following are other ways to construct the same dict
b = {'one': 1, 'two': 2, 'three': 3}
c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d = dict([('two', 2), ('one', 1), ('three', 3)])
e = dict({'three': 3, 'one': 1, 'two': 2})
a == b == c == d == e
print(e['one']) # key is one- prints val of 1
print(b['three']) # key is three- prints val of 3
1 3
Exercise¶
You've $n$ friends, for simplification, there names are integers $1,2,...n$. Your friend named $i$ has $i^2$ candies, and $(n-i)^2$ chocolates $\forall i \in \{1,n\}$. Construct a mapping in which you input your friends name and it returns a pair of (candies, chocolates) he/she has. Do this in one line of code (in Pythonic way).
def create_mapping(n):
''' args: int
return: dict
'''
### BEGIN SOLUTION
return dict(zip([name for name in range(1,n+1)], [(i**2, (n-i)**2) for i in range(1,n+1)]))
### END SOLUTION
assert create_mapping(5) == {1: (1, 16), 2: (4, 9), 3: (9, 4), 4: (16, 1), 5: (25, 0)}
assert create_mapping(15) == {1: (1, 196),
2: (4, 169),
3: (9, 144),
4: (16, 121),
5: (25, 100),
6: (36, 81),
7: (49, 64),
8: (64, 49),
9: (81, 36),
10: (100, 25),
11: (121, 16),
12: (144, 9),
13: (169, 4),
14: (196, 1),
15: (225, 0)}
assert create_mapping(1) == {1: (1, 0)}
assert create_mapping(0) == {}
Functions¶
Functions are defined using def
keyword. Look at the function below which doubles the input x.
def double_it(x):
'''
args: int
return: int
'''
return x*2
Let's define a couple of more functions.
# swap values of args. x gets val of y, and y gets val of x.
def swap(x, y):
tmp = x
x = y
y = tmp
return x, y # note we are returning more than one vars (as a tuple).
It turns out this (func above) is not a true Pythonic way to swap variables. We've been stressing on using the Pythonic way of coding. As a matter of fact, when most programmers come from C to Python, for example, the habits and style they developed learning C, will naturally appear in their Python code. For example, observe the following two snippets of code in which we print elements of array.
array = np.random.randn(5) # random array
# way 1 -> C like code in Python.
for i in range(len(array)):
print(array[i])
# way 2 -> Python like
for elm in array:
print(elm)
Both are printing the elmenets of an array, yet a novice in Python and an expert in C might print using way 1
. Similarly, to swap variables, the above solution is not the best way. Swap two variables in more a Pythonic way in one line of code.
def swap(x, y):
### Begin Solution
y, x = x, y
### End Solution
return x, y
assert swap(4,3) == (3,4)
Anonymous functions¶
You'll occasionally see anonymous functions (lambdas functions).
An anonymous function refers to a function declared with no name. Although syntactically they look different, lambda functions behave in the same way as regular functions that are declared using the def
keyword. lambda functions are always written in one Python line.
For example, double_it() can be written as:
double_it = lambda x : x * 2
That's it.
Now, say you want to double every element of an array using double_it. You can use map
in conjunction with lambda
.
map(anonymous_func, array)
See the cell below for an example.
array = [1,2,3,4]
list(map((lambda x: x * 2), array)) # map returns an iterator. list(iterator) will give you a readble list
[2, 4, 6, 8]
def _print1(array):
### BEGIN SOLUTION
already_printed = []
for elm in array:
if elm not in already_printed:
print(elm)
already_printed.append(elm)
### END SOLUTION
def _print2(array):
### BEGIN SOLUTION
distinct_arr = (set(array))
for elm in distinct_arr:
print(elm)
### END SOLUTION
def _print3(array):
### BEGIN SOLUTION
print(set(array))
### END SOLUTION
# test_arr = [4,2,4,4,3,5,4,3,464,3,42,4]
# _print1(test_arr) == _print2(test_arr) == _print3(test_arr) $ not good test
Classes¶
Classes are ubiquitous in this course. The following excerpt from Code Complete
emphasizes writing high quality classes (in general):
In the twenty-first century, programmers think about programming in terms of classes.
A class is a collection of data and routines that share a cohesive, well-defined responsibility... A key to being an effective programmer is maximizing the portion of a program that you can safely ignore while working on any one section of code. Classes are the primary tool for accomplishing that objective.
Python syntax for class definition is the following:
class Welcome_to_ML(object):
# Constructor
def __init__(self, student_name):
# collection of data
self.name = student_name
# collection of routines/methods
def say_hello_to_student(self):
print('hello', self.name, 'welcome to your Machine Learning course.')
wc_ml = Welcome_to_ML('Ertugrul') # instantiate an object called wc_ml
wc_ml.say_hello_to_student() #
hello Ertugrul welcome to your Machine Learning course.
Remark:¶
We'll end our tutorial on Python here. You may find this exhaustive official tutorial on Python helpful.
Numpy¶
Matrices or n-dimensional arrays or tensors are everywhere in Machine Learning/Deep Learning. For example, a self-driving cars can recognize a dog on the road and slow down. But essentially, what it recognized is a pattern in an n-dim array (3 dim (RGB) for color images and 2 dim for grey scale images)- more on that later in course.
To this end, Numpy provides ndarray, a homogeneous n-dimensional array object, with methods to efficiently operate on it (a lot more efficient than Python native list).
import numpy¶
Numpy is already installed in your current virtual environment.
If you want a package or library in your code, you first need to make it accessible by importing
it. In general:
import package_name as alias
Let's import numpy now:
import numpy as np
a = np.array([[1,2,],[3,5]]) # instantiate a 2x2 array.
type(a)
print('shape of a', a.shape) # Shape is (2,2). Rank of this array is 2 (total number of dims)
# datatype of each elm in array a
print('Each elm in a has type', a.dtype)
# Array slicing in np. Simiilar to native list in Python.
# For each dimension, specify a slice.
print('at pos (0,0) is ', a[0,0]) # elm at position 0 in first dim and at position also 0 at 2nd dim i.e. 1
print('first col', a[:, 0]) # elms at any position in 1st dim and position 0 at 2nd dim i.e. first col
print('first row', a[0,:]) # elms at any position in 2nd dim and position 0 at 1st dim i.e. first row
shape of a (2, 2) Each elm in a has type int64 at pos (0,0) is 1 first col [1 3] first row [1 2]
# Let's create an array of all zeros
z = np.zeros((4,2))
print('shape', z.shape)
z = np.zeros((4,2,2)) # rank 3
print('shape', z.shape)
# similarly, an array one ones with dtype int16 (16 bits int)
o = np.ones((3,3), dtype=np.int16)
print('\n array of 1s: \n', o)
shape (4, 2) shape (4, 2, 2) array of 1s: [[1 1 1] [1 1 1] [1 1 1]]
Other ways of Indexing¶
Exercise¶
Write your own examples to practice with Integer and boolean array indexing by reading the documentation linked above. Don't use the examples already there in the documentation.
## TO DO
# Write your examples here
# A sequence of numbers
np.arange(5,50, 5) # start val: 5, end val 50 (not inclusive), difference: 5
array([ 5, 10, 15, 20, 25, 30, 35, 40, 45])
def evenly_spaced_numbers(n, x, y):
### BEGIN SOLUTION
return np.linspace(x, y, n)
### END SOLUTION
assert (evenly_spaced_numbers(3, 2, 3) == np.array([2. , 2.5, 3. ])).all()
Basic Operations¶
Array basic operations (addition, multiplication etc) are done elementwise; Multiplication and division are not matrix multiplication or anything related to matrix inversion. These operations are available as operator overloads and functions.
_mul_op_overload = (a_rank3 * b_rank3) # elem wise mult. Overloaded operator *
_mul_func = np.multiply(a_rank3, b_rank3) # elem wise mult. Func.
print(_mul_func)
print('Each elm equality\n', _mul_op_overload == _mul_func) # elm wise equality check
print('\nAre all elms equal?', (_mul_op_overload == _mul_func).all()) # check if all elms equal. .all() is native Python
[[176 5 36] [ 0 11 144] [196 46 25]] Each elm equality [[ True True True] [ True True True] [ True True True]] Are all elms equal? True
a_rank3 = np.array([[4,5,6],
[10,11,12],
[4,2,5]])
b_rank3 = np.array([[44,1,6],
[0,1,12],
[49,23,5]])
a_rank3 + b_rank3 # elem wise addition
array([[48, 6, 12], [10, 12, 24], [53, 25, 10]])
a_rank3 - b_rank3 # elem wise sub
array([[-40, 4, 0], [ 10, 10, 0], [-45, -21, 0]])
b_rank3 / a_rank3 # elem wise div.
# try a_rank3 / b_rank3
array([[11. , 0.2 , 1. ], [ 0. , 0.09090909, 1. ], [12.25 , 11.5 , 1. ]])
Linear Algebra Operations¶
# mat-mat multiply
A = np.random.randn(5,4)
B = np.random.randn(4,3)
Recall that for matrix $A$ of shape $n \times m$ and matrix $B$ of shape $m \times p$, their product (not elment wise multiplication) $AB$ is of shape $n \times p$ and matrix multplication is not commutative in genera; product $BA$ is not possible. Let's multiply $A$ and $B$ in numpy.
A.dot(B) # AB prod
array([[-1.1281805 , 1.45925032, -2.41200943], [-1.71955545, -4.912645 , 1.10673145], [ 1.27780871, 2.82185852, 0.69053135], [-2.59719716, -1.65524365, -0.50383199], [-0.20735248, -5.8440586 , 4.13832884]])
try:
B.dot(A) # BA prod
except:
print('shapes not aligned for matrix mul')
shapes not aligned for matrix mul
As you'll come to know in future tutorials/exams, dimesionality check is a very useful debugging tool in ML
Operator $@$ is also available for mat multplication
A@B # AB prod
array([[-1.1281805 , 1.45925032, -2.41200943], [-1.71955545, -4.912645 , 1.10673145], [ 1.27780871, 2.82185852, 0.69053135], [-2.59719716, -1.65524365, -0.50383199], [-0.20735248, -5.8440586 , 4.13832884]])
np.matmul(A,B) # yet another way to get AB prod
array([[-1.1281805 , 1.45925032, -2.41200943], [-1.71955545, -4.912645 , 1.10673145], [ 1.27780871, 2.82185852, 0.69053135], [-2.59719716, -1.65524365, -0.50383199], [-0.20735248, -5.8440586 , 4.13832884]])
Documentation on Numpy LA
Excercise¶
We multiplied $A$ and $B$ in three different ways above. But there is a subtle difference in how they work behind the scene. Find that difference. (No code is required)
Excercise¶
Sum all the columns of matrix $A$ defined above and return an np array whose ith element is sum of entries of ith column.
Excercise¶
Sum all the rows of matrix $A$ defined above and return an np array whose ith element is sum of entries of ith row.
Hint: np.sum
def sum_cols(a):
'''
args: ndarray of shape (mxn)
return: np array of shape n
'''
### BEGIN SOLUTION
return np.sum(a, axis=0)
### END SOLUTION
def sum_rows(a):
'''
args: ndarray of shape (mxn)
return: np array of shape n
'''
### BEGIN SOLUTION
return np.sum(a, axis=1)
### END SOLUTION
Broadcasting¶
For conreteness take two ndarrays, $M$ (defined below of shape $4\times3$) and $1d$ array $v$
M = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]])
v = np.random.rand(4,1)
Exercise 1¶
Add a scalar s to a vector v using for loop.
Exercise 2¶
Add vector $v$ of shape $(4 \times 1$) i.e. a column
Things start to get very fun when you add/substract/multiply/divide array of different sizes. Rather than throw an error, Numpy will try to make sense of your operation using the Numpy broadcasting rules. This is an advanced topic, which often really throws off newcomers to Numpy, but with a bit of practice the rules become quite intuitive. While writing ML code, you may find broadcasting
bugs/errors often.
M = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]])
v = np.array([1,1,1,1])
v = np.ones((1,4))
v.shape
(1, 4)
# Ex 1. Use for loop
def add_sacalar_to_vect(v, s):
### BEGIN SOLUTION
sum_ = np.empty_like(v)
for i in range(len(v)):
sum_[i] = v[i] + s
return sum_
### END SOLUTION
# Ex 2. Use for loop
def sum_vec_to_matrix(M, v):
### BEGIN SOLUTION (intentionally bogus)
M_tranp = M.T # shape (3,4)
len_m = len(M_tranp)
sum_ = np.empty((3,4))
for i in range(len_m):
sum_[i] = M_tranp[i] + v # M_tranp[i].reshape(-1) + M_tranp[i].reshape(-1)
return sum_.T
### END SOLUTION
sum_vec_to_matrix(M, v)
array([[ 2., 3., 4.], [ 5., 6., 7.], [ 8., 9., 10.], [11., 12., 13.]])
When the matrix $M$ is very large, computing an explicit loop in Python could be slow. And, broadcasting will allows numpy to work with arrays of different shapes when performing arithmetic operations by avoiding explicit loops. For example, for exercise 1, you simply do:
sum_ = v + s # v has shape (4,1) and s is a scalar
And for exercise 2, you simply do:
sum_ = M + v # M has shape(4,3) and v has shape (4,1)
Read the broadcasting rules below before doing the debugging exercises.
General Broadcasting Rules:¶
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions and works its way forward. Two dimensions are compatible when
- they are equal, or
- one of them is 1
The size of the resulting array is the size that is not 1 along each axis of the inputs.
(More information in documentation)
Debugging Test Exam¶
When you declare a vector v as follows:
v = np.ones((4,1)) # instead of v = np.ones((1,4))
and call the following function with the staff's provided solution:
sum_vec_to_matrix(M, v)
do you see any errors? Report if any errors. What is causing that error?
# debug this
v = np.ones((4,1))
sum_vec_to_matrix(M, v)
array([[ 2., 3., 4.], [ 5., 6., 7.], [ 8., 9., 10.], [11., 12., 13.]])
Answer of debugging exercise¶
Shape of Mtranp[i] is (4,) and shape of v is (4,1), shape of their sum is (4,4) a matrix not a vector which can't be appended at sum[i]. To make sure, sum is a vector reshape both M_tranp and v to (4).
sum_vec_to_matrix(M, v)
array([[ 2., 3., 4.], [ 5., 6., 7.], [ 8., 9., 10.], [11., 12., 13.]])
M + v
array([[ 1.79851075, 2.79851075, 3.79851075], [ 4.21489588, 5.21489588, 6.21489588], [ 7.55142288, 8.55142288, 9.55142288], [10.49551768, 11.49551768, 12.49551768]])
Best Practices
:¶
Note that we provide you direct links to documentation of the function which you need to solve a particualr problem in hint
. In practice, the right way is google what you're trying to accomplish using a particular tool e.g. sum cols of np array
and it'll lead you to the official documentation. Make it a habit of refering to documentations as much as possible. Don't memorize any functions names or their prototypes e.g. which arguments they take (functions often get deprecated too) as their are tons of functions and tools. We'll use numpy, pandas dataframe, sklearn now, and then later in the course use advanced Deep Learning tools such as PyTorch/TensorFlow which are backbones of everything amazing you see in the AI world these days e.g. self driving cars, Apple Siri.
test question on broadcast¶
Arrays do not need to have the same number of dimensions. For example, if you have a 256x256x3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:
This is far from complete tutorial on numpy. You'll find this official tutorial on numpy useful.
A = np.random.randn(1,4)
B = np.random.randn(4,)
# what is the shape of A+B?
A = np.random.randn(4,1)
# what is the shape of A+B? Should be different from before?
Pandas¶
Pandas is one of the most useful Python libraries for data science. Pandas is not a relational database library, but instead a “data frame” library. You can think of a data frame as being essentially like a 2D array, except that entires in the data frame can be any type of Python object (and have mixed types within the array), and the rows/columns can have “labels” instead of just integer indices like in a standard array. Let's create a dataframe with
import pandas as pd
df = pd.DataFrame([(1, 'ayesha', 'Riyadh College of Technology'),
(2, 'Haifa', 'King Saud University'),
(3, 'Saleh', 'Imam Muhammad ibn Saud Islamic University'),
(4, 'Lucas', 'Princess Nora bint Abdul Rahman University'),
(5, 'Ali', 'Prince Sultan University'),
(6, 'Asma', 'Al-Yamamah University')],
columns=["id", "name", "university name"])
df
id | name | university name | |
---|---|---|---|
0 | 1 | ayesha | Riyadh College of Technology |
1 | 2 | Haifa | King Saud University |
2 | 3 | Saleh | Imam Muhammad ibn Saud Islamic University |
3 | 4 | Lucas | Princess Nora bint Abdul Rahman University |
4 | 5 | Ali | Prince Sultan University |
5 | 6 | Asma | Al-Yamamah University |
# set index to id
df.set_index('id')
name | university name | |
---|---|---|
id | ||
1 | ayesha | Riyadh College of Technology |
2 | Haifa | King Saud University |
3 | Saleh | Imam Muhammad ibn Saud Islamic University |
4 | Lucas | Princess Nora bint Abdul Rahman University |
5 | Ali | Prince Sultan University |
6 | Asma | Al-Yamamah University |
# let's see our df again
df
id | name | university name | |
---|---|---|---|
0 | 1 | ayesha | Riyadh College of Technology |
1 | 2 | Haifa | King Saud University |
2 | 3 | Saleh | Imam Muhammad ibn Saud Islamic University |
3 | 4 | Lucas | Princess Nora bint Abdul Rahman University |
4 | 5 | Ali | Prince Sultan University |
5 | 6 | Asma | Al-Yamamah University |
Oops orignial index is back. By default, most Pandas operations, like .set_index() and many others, and not done in place. That is, while the df.set_index("id") call above returns a copy of the df dataframe with the index set to the id column (remember that Jupyter notebook displays the return value of the last line in a cell), the original df object is actually unchanged. Let's try again, this time in place.
df.set_index("id", inplace=True)
df.head() # display top 5 rows only. Useful especially if df contains too many rows.
name | university name | |
---|---|---|
id | ||
1 | ayesha | Riyadh College of Technology |
2 | Haifa | King Saud University |
3 | Saleh | Imam Muhammad ibn Saud Islamic University |
4 | Lucas | Princess Nora bint Abdul Rahman University |
5 | Ali | Prince Sultan University |
Aceessing Data¶
Let’s consider a few of the common ways to access or set data in a Pandas DataFrame. You can access individual elements using the .loc[row, column] notation, where row denotes the index you are searching for and column denotes the column name. For example, to get name with id 4:
df.loc[4, 'name']
'Lucas'
# all names in df
df.loc[:, 'name']
id 1 ayesha 2 Haifa 3 Saleh 4 Lucas 5 Ali 6 Asma Name: name, dtype: object
print(type(df.loc[:, 'name']))
<class 'pandas.core.series.Series'>
Return type is a series
# add another row
df.loc[7,:] = ('Noor', 'College of Dentistry - BUC')
df
name | university name | |
---|---|---|
id | ||
1 | ayesha | Riyadh College of Technology |
2 | Haifa | King Saud University |
3 | Saleh | Imam Muhammad ibn Saud Islamic University |
4 | Lucas | Princess Nora bint Abdul Rahman University |
5 | Ali | Prince Sultan University |
6 | Asma | Al-Yamamah University |
7 | Noor | College of Dentistry - BUC |
Official Getting started¶