NLP Continued
Natual Language Processing NLP¶
Tutorial 2 of 2.
One of the downsides to term frequency models (even as far as bag-of-words (BOW) representations are concerned), is there is no notion of similarity between words. But in reality, we do associate some notion of similarity between different words. For example, consider the following corpus:
documents = [
"king will reward dwarf",
"queen is angry",
"apple is worth more than 2 trillion dollars now"
]
Semantically, we'll agree that the first two documents are similar while the third is different. Bag of words approach will blindly assert that 2nd and 3rd have a matching word 'is'
while the first two have no matching word. A BOW model will, wrongly, say document 2 and 3 are similar.
To get started with word embeddings, we’ll note that words in the term frequency model can also be considered as being represented by vectors; in particular, any word is represented by a “one-hot” vector that has a zero in all the coordinates except a one in the coordinate specified by the word. Do you remember 'one hot' encoding. You've implemented it before without knowing it's called 'one hot encoding'.
Aside
: Do you recall Question 8 from Test 1. Let's jog your memory by pasting it below with solution.
Question 8¶
Given an ndarray of shape $(1, m)$, essentially a row vector, containig $m$ integer elements $\{a_0, a_1, ...a_{m-1}\}$, construct an output ndarray such that $ith$ row of the output contains all 0's
except a 1
at index $a_i \ where \ i \in \{0,1,2..,{m-1}\}$. See the example below for clarifications:
arr1 = np.array([1, 3, 2]).reshape(1,-1) # input array of shape (1,3)
encode(arr1) # Function called
array([[0., 1., 0., 0.], # output array.
[0., 0., 0., 1.],
[0., 0., 1., 0.]])
"The input array contains '3' at index 1, hence the 1st row [0., 0., 0., 1.] in \
the output array contains all 0's but 1 at index '3'. Similarly, for others."
Solution:
def encode(arr):
'''args: ndarray of shape (1,m)
returns: ndarray of shape (m, y). Figuring out y is part of this problem.
Staff's solution has 7-8 lines of code.
'''
### BEGIN SOLUTION
_, ncols_in_arr = arr.shape
# desired array has shape (ncols_in_arr, max(arr) + 1)
nrows_in_encod = ncols_in_arr
ncols_in_encod = np.max(arr) + 1
# initialize to all zeros first.
encoding_ = np.zeros((ncols_in_arr, ncols_in_encod))
# Append 1's at appropriate places. Using integer indexing
row_idx = np.arange(nrows_in_encod) # e.g. gives [0,1,2,...,4] if nrows_in_encod is 5
col_idx = arr
encoding_[row_idx, col_idx] = 1
return encoding_
### END SOLUTION
Aside
: If you think of the input array as an array of labels (cat (1), dog (0), chair (2) etc), then the output of the encode()
function you implemented above is a very useful encoding ubiquitous in Machine Learning- commonly used to represent categorical variables. More on that in future.
In our corpus, word 'king' for example could be represened as
[0, 0, 0, 1, 0, 0, ...] # only one 1- all other 0s. Arbitrary econding to illustrate the point.
Word Embedding¶
Like the one-hot encoding of words, a word embedding is a vector representation of words. Embedding represents each word with a $k$-dimensional vector. Importantly, Euclidean distances in the word embedding (attempt to) correspond to some notion of similarity between the words. $k$ could be much smaller than vocabulary size.
Creating Embeddings.¶
Creating word embeddings falls in the arena of Deep Learning which we'll talk about a lot in future tutorials. A common approach that has ignited a lot of recent interest in word embeddings is the word2vec algorithm. Simply put, in word2vec, given a large body of text, an algorithm is trained that can “predict” the context around a word. word2vec was trained on over 300 billion words from Google News, and containing 3 million unique words, with embeddings of size k=300. Training such a model on such a large dataset will require huge computational power. Instead we'll use a smaller version- pretrained word2vec English Wikipedia. Available here on gdrive.
import gensim as gs
import numpy as np
word2vec_model = gs.models.KeyedVectors.load_word2vec_format('data/deps.words.bin', binary=True)
word2vec_model["king"][:5] # a 300 dim vector. Get first 5 entries
array([ 0.03303813, 0.06656987, 0.02628002, -0.05732338, 0.01353508], dtype=float32)
word2vec_model["queen"][:5] # a 300 dim vector. Get first 5 entries
array([0.01998595, 0.15262055, 0.00061866, 0.01659017, 0.07441706], dtype=float32)
The smaller version of word2vec covers a lot of common words you would use in English.
Similarity Measures¶
As we mentioned earlier the notion of similarity in word embeddings, we can get words closest to 'king': 'Queen' is one of them.
word2vec_model.similar_by_word("king", topn=5)
[('norodom', 0.6755779981613159), ('songtsän', 0.6748666167259216), ('queen', 0.6625942587852478), ('bhumibol', 0.6613788604736328), ('monarch', 0.6593648195266724)]
Let's create our own vectorizer that returns embedding vectors given documents.
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec_model):
self.word2vec = word2vec_model
self.dim = 300 # default 300 dim vectors returned by our word2vec
def fit(self, X, y):
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] for w in words.split() if w in self.word2vec]
or [np.zeros(self.dim)], axis=0) # if word not in word2vec return a vector of zeros
for words in X
])
embd_vectorizer = MeanEmbeddingVectorizer(word2vec_model)
documents[1]
'queen is angry'
for doc in documents:
words = doc.split()
print(words)
['king', 'will', 'reward', 'dwarf'] ['queen', 'is', 'angry'] ['apple', 'is', 'worth', 'more', 'than', '2', 'trillion', 'dollars', 'now']
embd_vectorizer.transform(documents[1]).shape
(14, 300)
Text classification Using Embeddings¶
Let's try text classification though we shouldn't expect to see very good results as our miniature word2vec was trained only on a few thousand words. Also, our feature vector is of size 300, in TFIDF the size was ~32000.
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_sklearn = tfidf_vectorizer.fit_transform(documents)
tfidf_sklearn
<3x14 sparse matrix of type '<class 'numpy.float64'>' with 15 stored elements in Compressed Sparse Row format>
n_samples = 1000
# tf_news = vectorizer.fit_transform(newsgroups_train.data)
raw_text = newsgroups_train.data[:n_samples]
# embd_feats = embd_vectorizer.transform(raw_text)
embd_feats = embd_vectorizer.transform(raw_text)
labels = newsgroups_train.target[:n_samples]
from sklearn.svm import SVC
svc_clf = SVC()
train_size = int(n_samples * .8) # 80% data for training
x_train = embd_feats[:train_size]
y_train = labels[:train_size]
svc_clf.fit(embd_feats, labels)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
x_test = embd_feats[train_size:]
y_test = labels[train_size:]
svc_clf.score(x_train, y_train)
0.2275
You would see a very low score as our word2vec was trained on a few thousand words intead of 300 billion words and the embeddings generated are not very good. You can try playing with more recently proposed Glove embeddings available here. The link has various versions of Glove based on the number of words it was trained on i.e. 6B, 27B, 42B or 840B.