NLP
Natual Language Processing NLP¶
Tutorial 1 of 2
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
# useful for printing numpy arrays.
from IPython.display import HTML, display
import tabulate
def pp(a):
if a.ndim < 2:
a = [a]
display(HTML(tabulate.tabulate(a, tablefmt='html')))
Here, we'll deal with free text (Natural Language) which is widely available data.
NLP looks to understand the structure behind free text, e.g. perform tasks like parse the sentences grammatically, describe the general entities and properties that the text is referring to etc.
Most basic tasks involving free text can be carried out using relatively simple methods that we can easily understand without any reference to complex machine learning: methods like bag-of-word models, TFIDF vectors, and (simple n-gram) language models. It's likely that Deep Learning methods in future might render these methods obsolete but we are quite not there yet.
A truly intelligent agent is not really intelligent if it cannot understand natural language; Yet, unfortunately, we're not there yet. State-of-the
You can easily come up with examples which
All of the ML algorithms we have seen so far deal with numerical values. Natural language on the other hand
A Note on NLP Terminology
“Words” or “Terms” refer to individual tokens separated by whitespace, and additionally also refers to puncutation.
"Document” refer to an individual group of free text data i.e. document consists of words.
"Corpus" is a collection of documents.
"Vocabulary" is number of unique words across all documents (i.e. in corpus).
Feature Extraction¶
We'll transform textual features into numerical features usable for machine learning algorithms.
documents = ["machine learning is such an amazing field",
"i enjoy writing machine learning programs",
"i think of tests as learning instruments yet they also provide the course staff\
a guage of my understanding"]
Bag of Words (BOW)¶
As the name suggests, BOW is simply a set of words that make up the document irrespective of their relative positioins.
Despite the obvious information we are throwing away with this representation, it send to work surpringingly well in practice. The general “gist” of many documents can be obtained with only looking at the presence/absence of words in the text.
- We'll tokenize text (giving an integer id for each possible token using space/comma as seprater)
- We'll count the occurrences of tokens (words) in each document-
TF
(explained below) - We'll normalize and weight with diminishing importance tokens that occur in the majority of samples / documents-
TFIDF
(explained below)
This specific strategy (tokenization, counting and normalization) is called the Bag of Words.
More in sklearn docs.
N-grams¶
Although the bag of words representation does work remarkably well in many cases, it of course is still an incredibly simple representation of documents that will miss basic properties. For example the documents
- Go, do not stop.
- Stop, do not go.
have the exact same bag of words representation (as they contain the same words but different ordering) yet have the exact opposite semantics (meaning).
Resolution of the above Drawback
:
n-gram model
: A probabilistic language model attempts to to predict the probablity of each word in the document, given all the words that proceeded it. Our objective here is to build a rudimentary language model that captures some basic context of the preceeding words in predicting the next one. The idea of ngram model is to put a cap on the size of the context we use to predict the next word, so that under an n-gram model we have that
For example, for 3-gram model (n=3), we'll look at prvious two words.
N-gram models are typically built based upon some fixed corpus, and when we refer to probabilities in an n-gram model, we are just refering to counts in this corpus. Specficially, we just count the number of times we have seen the entire sequence of words divided by the number of times we have seen just the preceeding n words. For example
, for 4-gram, we'll simply count in our corpus: $$ \frac{\# Walking \ on \ the \ road}{\#walking \ on \ the} $$
This simple model (n-gram) has amazing applications- such as generating text. See the figure below (the text was automatically generated by ngrams):
This can easily be implemented in a few lines of code.
Sklearn on ngrams
n-gram extraction using sklearn¶
ngram_vectorizer = CountVectorizer( ngram_range=(2, 2)) # (2,2) means extract bigrams only
bigrams = ngram_vectorizer.fit_transform(documents)
ngram_vectorizer.get_feature_names()
['also provide', 'amazing field', 'an amazing', 'as learning', 'course staff', 'enjoy writing', 'guage of', 'instruments yet', 'is such', 'learning instruments', 'learning is', 'learning programs', 'machine learning', 'my understanding', 'of my', 'of tests', 'provide the', 'staff guage', 'such an', 'tests as', 'the course', 'they also', 'think of', 'writing machine', 'yet they']
bigrams # rewritten here for shape comprison- next.
<3x25 sparse matrix of type '<class 'numpy.int64'>' with 26 stored elements in Compressed Sparse Row format>
Read docs for more info on n-grams.
Exercise¶
Try 3-gram/4 gram feature extraction on our corpus.
Features Transformation¶
Hashing¶
Vectorizers (such as Count vectorizer and TF/TFIDF vectorizers- explained later) consume a lot of memory. For example, feature names were kept in memory:
ngram_vectorizer.get_feature_names() # in countvectorizer above
Think about a corpus containing thousands/millions of documents. Vectorizers keep in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) which causes several problems when dealing with large datasets.
To resolve this problem somewhat a hashing trick is used: Instead of building a hash table of the features encountered in training, as the vectorizers do, instances of FeatureHasher apply a hash function to the features to determine their column index in sample matrices directly. More in docs.
Sklearn HashingVectorizer¶
from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=10)
hv.transform(documents)
<3x10 sparse matrix of type '<class 'numpy.float64'>' with 19 stored elements in Compressed Sparse Row format>
Compare the number of columns (10) in sparse matrix returned by HashingVectorizer with those returned by CountVectorizer (25).
Notice that HashingVectorizer is stateless- no fit method.
TF-IDF¶
A Matrix
Before discussing TFIDF let's talk about TF and IDF, TFIDF is simply their product
TF (Term Frequency)¶
We can represent the documents using a term frequency matrix, and $m \times n$ matrix where m denotes the number of documents, and n denotes the vocabulary size. Specifically, the entry in row i and column j represents the number of times the word j occcurs in document i. Where 'word j' is jth word in 'Vocabulary'.
documents
['machine learning is such an amazing field', 'i enjoy writing machine learning programs', 'i think of tests as learning instruments yet they also provide the course staff a guage of my understanding']
document_words = [doc.split() for doc in documents]
vocab = sorted(set(sum(document_words, [])))
print(vocab, "\n") # list containing unique words in corpus.
['a', 'also', 'amazing', 'an', 'as', 'course', 'enjoy', 'field', 'guage', 'i', 'instruments', 'is', 'learning', 'machine', 'my', 'of', 'programs', 'provide', 'staff', 'such', 'tests', 'the', 'they', 'think', 'understanding', 'writing', 'yet']
vocab_dict = {k:i for i,k in enumerate(vocab)} # word-index mapping. Useful to have.
print(vocab_dict, "\n")
{'a': 0, 'also': 1, 'amazing': 2, 'an': 3, 'as': 4, 'course': 5, 'enjoy': 6, 'field': 7, 'guage': 8, 'i': 9, 'instruments': 10, 'is': 11, 'learning': 12, 'machine': 13, 'my': 14, 'of': 15, 'programs': 16, 'provide': 17, 'staff': 18, 'such': 19, 'tests': 20, 'the': 21, 'they': 22, 'think': 23, 'understanding': 24, 'writing': 25, 'yet': 26}
TF = np.zeros((len(documents), len(vocab)), dtype=int)
for i, doc in enumerate(document_words):
for word in doc:
TF[i, vocab_dict[word]] += 1
print('shape of TF matrix', TF.shape)
print('TF Matrix:')
pp(TF)
shape of TF matrix (3, 27) TF Matrix:
0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
There are 3 documents and therefore, TF matrix has 3 rows. Similarly, 27 words in vocabulary and 27 columns.
TF in sklearn¶
Let's compute TF using count vectorizer of sklearn.
vectorizer = CountVectorizer()
TF_sklearn = vectorizer.fit_transform(documents,)
print(TF_sklearn.shape) # (3,25) but it should be (3,27).
pp(TF_sklearn.toarray())
(3, 25)
0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
Are they the same TF matrices. It appears sklearn skipped two terms.
TF_sklearn.shape == TF.shape
False
print((vectorizer.get_feature_names()))
['also', 'amazing', 'an', 'as', 'course', 'enjoy', 'field', 'guage', 'instruments', 'is', 'learning', 'machine', 'my', 'of', 'programs', 'provide', 'staff', 'such', 'tests', 'the', 'they', 'think', 'understanding', 'writing', 'yet']
Exercise¶
Write code to find out which two terms are missing in sklearn TF yet present in our implementation.
Answer¶
It turns out sklearn by default will only include words with at least 2 chars. We can change this default behavior by overwriting an argument to constructor token_pattern
which accepts regex. Regular expressions are really powerful, but we won't discuss them in depth here. You're encouraged to read the docs.
vectorizer_inclcudes_single_char = CountVectorizer(token_pattern=u"(?u)\\b\\w+\\b")
TF_sklearn_inclcudes_single_char = vectorizer_inclcudes_single_char.fit_transform(documents,)
print(TF_sklearn_inclcudes_single_char.shape)
pp(TF_sklearn_inclcudes_single_char.toarray())
(3, 27)
0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
Now, the shape (3,27) match with the one in our implementation.
TF_sklearn_inclcudes_single_char.shape == TF.shape
True
Drawback of TF¶
The issue with TF representation of documents is that common words (such as is, a, to) that occur frequently in most documents will dominate the corresponding vector of a document. These common words are not conveying much information about that document. Consider the task of finding how similar two documents are: similarity score will be influenced highly by these common words e.g. two highly unsimilar documents with a lot of 'a's' and 'to's' will be rendered similar. We somehow want to diminish these common words entries in a document vector.
IDF (Inverse Document Frequency)¶
A Vector
The above drawback can be resolved via the inverse document frequency weight for words in vocabulary. IDF for $word_j$ is commonly defined as: $$ IDF_j = \log \frac{\# documents}{\# documents \ with \ word_j}$$
Sklearn's definition of IDF is a little different.
count_docs_with_word_j = TF.astype(bool).sum(axis=0)
total_docs = TF.shape[0]
IDF = np.log(total_docs/count_docs_with_word_j)
pp(IDF)
1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 0.405465 | 1.09861 | 1.09861 | 0 | 0.405465 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 |
Exercise¶
Notice a 0 in IDF that means that word occurs in all documents $\log (1) = 0$. Which word is that (Warning: solution provided in the next line)?
idx_arr = np.where(IDF == 0) # (array([12]),)
idx = int(idx_arr[0][0]) # extract int 12
vocab[idx] # 'learning' which indeed occurs in all three docs (verify).
'learning'
TFIDF (Term Frequency Inverse Document Frequency)¶
A matrix
TFIDF scales the columns of the term frequency matrix by their inverse document frequency. In doing so, we still have an effective bag of words representation of each document, but words discouted that occur very frequently, and increasing the weight of less frequent terms.
TFIDF = TF * IDF
print(TFIDF.shape) # (3,27)
pp(TFIDF)
(3, 27)
0 | 0 | 1.09861 | 1.09861 | 0 | 0 | 0 | 1.09861 | 0 | 0 | 0 | 1.09861 | 0 | 0.405465 | 0 | 0 | 0 | 0 | 0 | 1.09861 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1.09861 | 0 | 0 | 0.405465 | 0 | 0 | 0 | 0.405465 | 0 | 0 | 1.09861 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.09861 | 0 |
1.09861 | 1.09861 | 0 | 0 | 1.09861 | 1.09861 | 0 | 0 | 1.09861 | 0.405465 | 1.09861 | 0 | 0 | 0 | 1.09861 | 2.19722 | 0 | 1.09861 | 1.09861 | 0 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 1.09861 | 0 | 1.09861 |
TFIDF in Sklearn¶
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_sklearn = tfidf_vectorizer.fit_transform(documents)
tfidf_sklearn
<3x25 sparse matrix of type '<class 'numpy.float64'>' with 28 stored elements in Compressed Sparse Row format>
A Couple of Remarks
:
Notice that by default tfidf is returned as a sparse matrix (as was TF_sklearn) which is a very good design decision and works really well in practice. Why? TFIDF mostly will contain 0's as each document will only contain a fraction of vocabulary (which for real-world data is very large list).
Also, notice the shape mismatch again like in TF. Again two words are missing which is expected since under the hood TfidfVectorizer is calling CountVectorizer. According to the documentation: tfidf_sklearn is
Equivalent to CountVectorizer followed by TfidfTransformer
.
pp(tfidf_sklearn.toarray()) # readble numpy array
0 | 0.410747 | 0.410747 | 0 | 0 | 0 | 0.410747 | 0 | 0 | 0.410747 | 0.242594 | 0.312384 | 0 | 0 | 0 | 0 | 0 | 0.410747 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0.504611 | 0 | 0 | 0 | 0 | 0.298032 | 0.38377 | 0 | 0 | 0.504611 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.504611 | 0 |
0.233451 | 0 | 0 | 0.233451 | 0.233451 | 0 | 0 | 0.233451 | 0.233451 | 0 | 0.13788 | 0 | 0.233451 | 0.466902 | 0 | 0.233451 | 0.233451 | 0 | 0.233451 | 0.233451 | 0.233451 | 0.233451 | 0.233451 | 0 | 0.233451 |
Notice that our tfidf is noticeably different from that of sklearn implementation. It appears that sklearn TFIDF returns normalized rows (which actually is the case). There are other differences such as sklearn adds normalizing 1s in defining IDF. See the docs for in depth information on sklearn implementation. We'll end our discussion on TFIDF here.
Text classification Using TFIDF Features¶
Support Vector Classifier
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
The dataset is large and it'll take some time to train. To quickly train you can use fewer samples (not recommended for real-world projects- to state the obvious- more training data will decrease generalization error).
n_samples = 1000
# tf_news = vectorizer.fit_transform(newsgroups_train.data)
raw_text = newsgroups_train.data[:n_samples]
tfidf_feats = tfidf_vectorizer.fit_transform(raw_text)
labels = newsgroups_train.target[:n_samples]
from sklearn.svm import SVC
svc_clf = SVC()
tfidf_feats
<1000x32190 sparse matrix of type '<class 'numpy.float64'>' with 157775 stored elements in Compressed Sparse Row format>
train_size = int(n_samples * .8) # 80% data for training
x_train = tfidf_feats[:train_size]
y_train = labels[:train_size]
svc_clf.fit(tfidf_feats, labels)
SVC()
x_test = tfidf_feats[train_size:]
y_test = labels[train_size:]
svc_clf.score(x_test, y_test)
0.995