Natural Language Processing (NLP)

Zero to Mastery Part I: Foundations
Unlocking Ten Key Concepts for NLP Proficiency

ChenDataBytes · Follow
10 min read · Dec 28, 2023 1/22
Photo by Sven Brandsma on Unsplash

The articles in this series cover the following topics:

Part 1(this article): Presents the fundamental principles of Natural

Language Processing (NLP).

Part 2: Explores the common applications of NLP.

Natural Language Processing (NLP) is a field of study within computer

science and artificial intelligence that focuses on the interaction between
computers and human languages. Its objective is to enable computers to
understand, interpret, and generate human language, thereby facilitating
communication and interaction between humans and machines.

NLP relies on various libraries commonly employed in the field, such as

NLTK and Spacy. NLTK provides a comprehensive set of tools and resources
for NLP tasks, while Spacy offers efficient processing capabilities. However,
it’s worth noting that although Spacy is efficient, it may not be suitable for
certain applications like sentiment analysis, which may require more
specialized libraries or approaches. For sequence modelling in NLP,
TensorFlow’s deep learning framework can be utilized.

Regarding the foundational aspects of NLP, we will delve into ten essential
topics: lemmatization, stemming, part-of-speech tagging, stop words,
pattern matching, sentence segmentation, named entity recognition,
tokenization, word embedding and bag-of-words. 2/22
NLP concepts

Linguistic Basics

1. Stemming
Stemming is a linguistic method utilized to obtain the base or root form of
words by removing letters from the word’s end. Its objective is to simplify
words by disregarding tense, pluralization, and other grammatical
variations. The Porter stemming algorithm employs a collection of
predetermined rules and heuristics to eliminate common English suffixes,
thereby converting words into their corresponding stems. SpaCy doesn’t
have a built-in implementation of the Porter Stemmer, so we use nltk for this

from nltk.stem.porter import *

p_stemmer = PorterStemmer()
words = ["runner", "running", "ran"]
for word in words:
print(word+' --> '+p_stemmer.stem(word)) 3/22
2. Lemmatization
In contrast to stemming, lemmatization is a more sophisticated linguistic
process that aims to reduce a word to its base or dictionary form, known as a
lemma. Lemmatization takes into account factors such as part-of-speech
(POS) tags and contextual understanding to ensure accurate and meaningful

import spacy
nlp = spacy.load('en_core_web_sm')

doc1 = nlp(u"The dedicated runner, after running for hours, finally ran across t

for token in doc1:

print(token.text, '\t', token.lemma, '\t', token.lemma_) 4/22
3. Part of speech
Part of speech refers to the grammatical category of a word in a sentence,
such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, or
interjection. Part of speech tagging can be used for various purposes,
including identifying named entities and speech recognition. The 5/22
probabilities of part of speech tags occurring near one another can be used
to generate the most reasonable output.

import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u"The dedicated runner, after running for hours, finally ran across th
#print pos for “after"
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))

Text Identification/ Extraction

4. Stop words
Common words such as “a” and “the” occur so frequently in text that they
often do not carry significant meaning compared to nouns, verbs, and
modifiers. These commonly occurring words are referred to as stop words
and can be excluded or filtered out during text processing. Spacy provides a
built-in list of approximately 305 English stop words that can be readily

import spacy
nlp = spacy.load('en_core_web_sm')

sentences = "The dedicated runner, after running for hours, finally ran across t

def remove_stopwords(sentence):
sentence = sentence.lower()
words = sentence.split() 6/22
sentence = " ".join([w for w in words if not nlp.vocab[w].is_stop])

return sentence


5. Pattern Match
Pattern matching entails the identification and extraction of linguistic
patterns or structural information from text. This process involves searching
for specific sequences of words, phrases, or syntactic structures that
conform to predefined patterns or rules.

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')

doc = nlp(u'The dedicated runner, after running for hours, finally ran across th

matcher = PhraseMatcher(nlp.vocab)

phrase_list = ['runner', 'smile']

phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('newproduct', None, *phrase_patterns)

matches = matcher(doc)

#Print the matches found in the text.

#Each match is represented as a tuple containing the match ID, start index, and
print(matches) 7/22
6. Sentence Segmentation
Sentence segmentation refers to the process of dividing a document or a
textual piece into individual sentences. In natural language processing,
accurately identifying sentence boundaries is crucial for various text
analysis and language understanding tasks.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'This is the first sentence. This is another sentence. This is the la
for sent in doc.sents:

7. Named Entity Recognition (NER)

Named Entity Recognition (NER) entails the identification and classification
of named entities present in the given text. Named entities typically
encompass distinct types of words or phrases that represent recognizable
entities, including but not limited to names of individuals, organizations,
locations, dates, numerical expressions, and others. 8/22
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'The dedicated runner, after running for hours, finally ran across th
for ent in doc.ents:
print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_

Text Representation

8. Tokenization
Tokenization involves breaking down the original text into smaller
components known as tokens. These tokens can be created based on
contiguous sequences of characters or words. The utilization of spacy for
tokenization is demonstrated below.

import spacy
nlp = spacy.load('en_core_web_sm')
mystring = '"The dedicated runner, after running for hours, finally ran across t
doc = nlp(mystring)

for token in doc:

print(token.text, end=' | ')

# Counting Tokens
print("\n Token Counts:",len(doc)) 9/22
# Counting Vocab Entries

print("\n Vocab Entries: "+str(len(doc.vocab)))

N-gram refers to a consecutive sequence of n items, where an item can be a

character or a word.

from nltk import bigrams

text = """The dedicated runner, after running for hours, finally ran across the
lines = map(str.split, text.split('\n'))
for line in lines:
print("\n".join([" ".join(bi) for bi in bigrams(line)])) 10/22
The following example illustrates how to generate tokens, sequences, and

perform padding using the TensorFlow framework. In NLP model training, it
is also common to create input-output pairs, where the input consists of a
sequence of words or characters, and the output is the subsequent word or

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define your input texts

sentences = [
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?' 11/22
# Initialize the Tokenizer class

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
# Tokenize the input sentences
# Get the word index dictionary
word_index = tokenizer.word_index
# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Print the result

print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5)
# Print the result
print("\nPadded Sequences:")

Subword tokenization is a text tokenization technique that breaks down

words into smaller units, known as subwords or subword units. Unlike
traditional word-based tokenization, where each word is considered a single
token, subword tokenization allows the representation of words as a
sequence of subword units.

import tensorflow_datasets as tfds

# Download the subword encoded pretokenized dataset

imdb_subwords, info_subwords = tfds.load("imdb_reviews/subwords8k", with_info=Tr
train_data, test_data = imdb_subwords['train'], imdb_subwords['test'], 12/22
# Get the encoder

tokenizer_subwords = info_subwords.features['text'].encoder

# Define sample sentence

sample_string = 'TensorFlow, from basics to mastery'

# Encode using the subword text encoder

tokenized_string = tokenizer_subwords.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

# Decode and print the results

original_string = tokenizer_subwords.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

9. Word vectors/word embeddings

Basic word representations could be classified into three categories:
integers, one-hot vectors and word embeddings. Word embedding is a
technique that represents words as dense, low-dimensional vectors in a
continuous vector space. The main objective of word embeddings is to
capture the semantic and contextual relationships between words. For
instance, when visualizing word embeddings in 2D, similar words tend to be
located close to each other.

Word Embedding Methods:

Continuous bag-of-words (CBOW): the model learns to predict the center

word given some context words. 13/22
Continuous skip-gram / Skip-gram with negative sampling (SGNS): the model

learns to predict the words surrounding a given input word.

word2vec (Google, 2013): overcomes the limitations of BoW and TF-IDF

by preserving contextual information and representing words in a dense
vector space. It does not handle out-of-vocabulary (OOV) words well.

Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the

corpus’s word co-occurrence matrix, similar to the count matrix you’ve
used before.

Deep learning-based contextual embeddings include BERT and GPT.

The code provided demonstrates two ways of adding the embedding layer in
a TensorFlow model. The first method is to use the Embedding layer. The
second method is the use of TensorFlow Hub to build a neural network
model using the Universal Sentence Encoder as a pre-trained embedding
layer. By setting trainable=True, the layer parameters can be fine-tuned
during training.

import tensorflow_hub as hub

import tensorflow as tf

#embedding method 1
model = tf.keras.Sequential([
# Add an Embedding layer with the correct parameters
# input_dim Integer. Size of the vocabulary, i.e. maximum integer index + 1.
# output_dim Integer. Dimension of the dense embedding.
# input_length Length of input sequences, when it is constant.
# 2D tensor with shape: (batch_size, input_length).
# 3D tensor with shape: (batch_size, input_length, output_dim).
tf.keras.layers.Embedding(input_dim=num_words, output_dim=embedding_dim, inp
tf.keras.layers.Dense(265, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='softmax'),
]) 14/22
#embedding method 2
model = tf.keras.Sequential([
trainable=True, dtype=tf.string, input_shape=[]),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid")

10. Bag-of-Words/TF-IDF
The bag-of-words approach represents text as an assortment or “bag” of
individual words or tokens, disregarding their specific order or sequence. It
creates a numerical representation of a document or corpus by tallying the
occurrences of each word in the text. However, in the bag-of-words model,
the original word order is discarded, and the focus is solely on the frequency
of word occurrence.

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting

scheme applied to the bag-of-words representation. Its purpose is to assign
weights to words that reflect their importance within a document in the
context of a larger collection of documents, known as a corpus.

TF-IDF takes into account two key factors:

1. Term Frequency (TF): This measures how frequently a term (word)

appears in a document. It assigns a higher weight to words that occur more
frequently within the document. The formula for TF is:

TF(t,d)= Number of times term t appears in document d / Total number of terms

in document d 15/22
2. Inverse Document Frequency (IDF): This part measures how unique or

rare a term is across all documents. It assigns a higher weight to words that
appear less frequently across the corpus but provide more unique or
informative content.

IDF(t,D) = log(Total number of documents in the corpus N​/ Number of

documents containing term t+1)

3. TF-IDF Calculation: calculated by multiplying TF and IDF:


If the TF-IDF value is high, it means the term is both common in the
document and rare across the entire corpus, making it a distinctive feature
of that document.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
"The dedicated runner, after running for hours, finally ran across the finis
"I enjoy running",
"I like to run in the morning"

vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(documents)

feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix

for i in range(len(documents)):
print("Document:", i+1)
for j, feature in enumerate(feature_names):
tfidf_value = tfidf_vectors[i, j]
if tfidf_value != 0: 16/22
print(feature, ":", tfidf_value)


End note:
In summary, this NLP primer covered ten basic concepts, from tokenization
and word embeddings to part-of-speech tagging and named entity
recognition. This foundational understanding sets the stage for Part II,
where we’ll explore practical NLP applications across diverse domains. 17/22
NLP Natural Language Process Nltk Spacy Artificial Intelligence

Written by ChenDataBytes Follow


Data Scientist & Machine Learning Engineer in London.

You might also like