Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Classifying E-procurement Contracts using Classical

Machine Learning, Bidirectional Long/Short-Term Memory


and Transformer Pretrained Embeddings.
Frank Mitchell
CS958: Project
Computer Science Department
University of Strathclyde, Glasgow

August 16, 2020


This dissertation is submitted in part fulfilment of the requirements for
the degree of MSc of the University of Strathclyde.

I declare that this dissertation embodies the results of my own work and
that it has been composed by myself.

Following normal academic conventions, I have made due


acknowledgement to the work of others.

No ethics approval was required.

I give permission to the University of Strathclyde, Department of


Computer and Information Sciences, to provide copies of the dissertation,
at cost, to those who may in the future request a copy of the dissertation
for private study or research.

I give permission to the University of Strathclyde, Department of


Computer and Information Sciences, to place a copy of the dissertation in
a publicly available archive.

I declare that the word count for this dissertation (excluding title page,
declaration, abstract, acknowledgements, table of contents, list of
illustrations, references and appendices is 8785.

Signed: Frank Mitchell

Date: 18th August 2020

i
Abstract

Here we present our work classifying e-procurement contracts for BiP Solutions. The
work here is intended to enhance the client search engine within the company by ac-
curately predicting the Nature of Contract. By improving this metadata for BiP we
aim to increase the value of service for clients, stakeholders and taxpayers. Our work
follows from the research of Adhikari et al in 2019, which compared a simple, one-layer
Bidirectional Long/Short-Term Memory (BiLSTM) network with more complex neu-
ral structures [2]. Here we aim to add to this research by conducting a comparative
study of BiLSTM with current state-of-art Transfomer Learning systems (which utilise
powerful pretrained word embeddings). This analysis uses the Macro F1 score as our
evaluation metric. We also explored the main classical machine learning methods used
for text classification and found that the Linear SVM algorithm could match the perfor-
mance of both the complex Transformer architectures, as well as the simpler BiLSTM
structure. Our recommendation for the company - on the basis of interpretability, scal-
ability and accuracy - is to implement a classical model for the enhancement of their
e-procurement search system.

ii
Contents

Abstract ii

List of Figures v

List of Tables vii

Acknowledgements ix

1 Introduction 1

2 Related Work 4
2.1 Fundamental NLP Processes . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Pre-processing: Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Pre-processing: Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Preprocessing: Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Benchmark Document Classification: Classical Approaches . . . . . . . 9
2.6.1 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.3 Linear Support Vector Machines . . . . . . . . . . . . . . . . . . 10
2.7 Benchmark Document Classification: Neural Networks . . . . . . . . . . 10
2.7.1 Convolutions for Document Classification . . . . . . . . . . . . . 11
2.7.2 Recurrence for Document Classification . . . . . . . . . . . . . . 11
2.8 Hierarchical Attention Networks . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iii
Contents

2.10 State of Art in Document Classification . . . . . . . . . . . . . . . . . . 15


2.10.1 Bidirectional Encoder Representations from Transformers (BERT) 15
2.10.2 Bidirectional Long-Short Term Memory Network (BiLSTM) . . . 17

3 Methodology 19
3.1 Technical Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Target Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.4 Target Variable Distribution . . . . . . . . . . . . . . . . . . . . 21
3.2.5 A True Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.6 Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.7 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Pre-processing the documents . . . . . . . . . . . . . . . . . . . . 23
3.3.2 Embedding Technique: TF-IDF Representation . . . . . . . . . . 24
3.3.3 Neural Embedding Technique: Continous Bag of Words . . . . . 25
3.4 Models & Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Building A Baseline With Classical Models . . . . . . . . . . . . 26
3.4.2 Improving the baseline with BiLSTM . . . . . . . . . . . . . . . 27
3.4.3 Improving the baseline with BERT . . . . . . . . . . . . . . . . . 28

4 Project Analysis 29
4.1 Correlation with Target . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Classical Model Baseline: Description . . . . . . . . . . . . . . . . . . . 30
4.2.1 Naive Bayes Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 XGBoost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Linear SVM Analysis . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 TF-IDF Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Improving Classical Baseline: Title . . . . . . . . . . . . . . . . . . . . . 36

iv
Contents

4.5 Masked Class Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


4.6 Combining Features Analysis . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7 BiLSTM Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7.1 BiLSTM: Raw Title . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7.2 BiLSTM: Raw Description . . . . . . . . . . . . . . . . . . . . . 41
4.7.3 BiLSTM: Cleaned Title . . . . . . . . . . . . . . . . . . . . . . . 41
4.7.4 BiLSTM: Cleaned Description . . . . . . . . . . . . . . . . . . . 42
4.7.5 BiLSTM: Train, Validation Test Splits . . . . . . . . . . . . . . 43
4.8 BERT Classifier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.8.1 BERT Classifier: Title . . . . . . . . . . . . . . . . . . . . . . . . 44
4.8.2 BERT Classifier: Description . . . . . . . . . . . . . . . . . . . . 45
4.9 RoBERTA Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.9.1 RoBERTA Classifier: Title . . . . . . . . . . . . . . . . . . . . . 47
4.9.2 RoBERTA Classifier: Description . . . . . . . . . . . . . . . . . . 48
4.9.3 Results Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Conclusions 50

v
List of Figures

2.1 Full NLP Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


2.2 Example of tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 TF-IDF Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Example TF-IDF Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Recall Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Precision Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Macro F1 formula for binary problem . . . . . . . . . . . . . . . . . . . 8
2.8 Convolutional Neural Network for Document Classification (Quispe et
al, 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 RNN Multi-task Learning: Uniform-layer Architecture . . . . . . . . . . 12
2.10 RNN Multi-task Learning: Coupled-layer Architecture . . . . . . . . . . 12
2.11 RNN Multi-task Learning: Shared-layer Architecture . . . . . . . . . . . 12
2.12 Hierarchical Attention Network . . . . . . . . . . . . . . . . . . . . . . . 13
2.13 Attention is all you need . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.14 Bidirectional Encoder Representations from Transformers (BERT) ar-
chitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.15 Bidirectional Long-Short Term Memory Architecture . . . . . . . . . . . 17

3.1 Target Label Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Cramer’s V. Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


4.2 Naive Bayes with Description Classification Report . . . . . . . . . . . . 31
4.3 Naive Bayes with Description Confusion Matrix . . . . . . . . . . . . . . 31

vi
List of Figures

4.4 XGBoost with Description Classification Report . . . . . . . . . . . . . 32


4.5 XGBoost with Description Confusion Matrix . . . . . . . . . . . . . . . 32
4.6 Linear SVM with Description Classification Report . . . . . . . . . . . . 33
4.7 Linear SVM with Description Confusion Matrix . . . . . . . . . . . . . . 33
4.8 TF-IDF Embeddings on Description with Unigrams . . . . . . . . . . . 34
4.9 TF-IDF Embeddings on Description with Bigrams . . . . . . . . . . . . 35
4.10 Linear SVM with Description Bigrams Classification Report . . . . . . . 36
4.11 Linear SVM with Description Bigrams Confusion Matrix . . . . . . . . . 36
4.12 TF-IDF Embedding using Title and Bigrams . . . . . . . . . . . . . . . 37
4.13 Linear SVM with Title Bigrams Classification Report . . . . . . . . . . 38
4.14 Linear SVM with Title Bigrams Confusion Matrix . . . . . . . . . . . . 38
4.15 Linear SVM with Title & Masked Classes . . . . . . . . . . . . . . . . . 39
4.16 BiLSTM on Raw Title, Classification Report . . . . . . . . . . . . . . . 40
4.17 BiLSTM on Raw Description, Confusion Matrix . . . . . . . . . . . . . 41
4.18 BiLSTM on Clean Title, Classification Report . . . . . . . . . . . . . . . 42
4.19 BiLSTM on Clean Title, Confusion Matrix . . . . . . . . . . . . . . . . 42
4.20 BiLSTM on Clean Description, Classification Report . . . . . . . . . . . 43
4.21 BiLSTM on Clean Description, Confusion Matrix . . . . . . . . . . . . . 43
4.22 BERT on Title, Classification Report . . . . . . . . . . . . . . . . . . . . 44
4.23 BERT on Title, Confusion Matrix . . . . . . . . . . . . . . . . . . . . . 45
4.24 BERT on Description, Classification Report . . . . . . . . . . . . . . . . 46
4.25 BERT on Description, Confusion Matrix . . . . . . . . . . . . . . . . . . 46
4.26 RoBERTA on Title, Classification Report . . . . . . . . . . . . . . . . . 47
4.27 RoBERTA on Title, Confusion Matrix . . . . . . . . . . . . . . . . . . . 47
4.28 RoBERTA on Description, Classification Report . . . . . . . . . . . . . 48

vii
List of Tables

4.1 Table of Correlation with Target Value. . . . . . . . . . . . . . . . . . . 30


4.2 Main Results Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

viii
Acknowledgements

The following people were integral to my journey into Data Science and I’d like to
acknowledge their contribution to my life.
First, Mr. Paul M. Coker, whose tireless dedication to education and life-long
commitment to technology has gifted me with a deep and lasting love for computer
programming.
Also, Mr Tom. Shering whose towering intellect, burning passion and inexhaustible
patience opened my eyes to the mesmerising beauty of mathematics.
Finally to Neil and Eileen Mitchell, without whom I wouldn’t have the drive to
succeed and the courage to meet change with open arms. Thank you Mum and Dad.

ix
Chapter 1

Introduction

In the following project we have been tasked with using artificial intelligence techniques
to accurately predict the nature of e-procurement contracts for BiP Solutions, a well-
established e-procurement firm based in Glasgow.
The tendering of public procurement contracts is a worldwide, multi-million pound
business [24] that has direct impact on local and national economy, making the process
of facilitating this industry of great importance to taxpayers and stakeholders alike.
BiP Solutions hold over 35 years’ experience in the industry, acting as an interface
between public authorities and private vendors. Every day a multitude of geographic
and domain specific procurement portals tender thousands of contracts, resulting in
the need for a contracts aggregation system that collates this procurement data into a
user-friendly, searchable system.
BiP operate one of industry’s market-leading search-engines, serving 120,000 con-
tract notices and awards every month and aggregating contract information from over
1500 sources [24].
Our deep learning model aims to improve this system by predicting the Nature of
Contract, increasing metadata and enhancing professional search subscriptions for BiP’s
users. The importance of this project is reflected in the global ramifications of improved
procurement systems for local and national governmental authorities; delivering better
value for all.
The challenge of classifying contracts falls under the domain of professional search.

1
Chapter 1. Introduction

A search problem is considered professional if it is carried out for business benefit,


within a specific business domain, yielding a high monetary value and performed while
under some constraint (time, money etc).
Companies pursuing business opportunities through “professional search” will often
limit the query to domain specific keywords which often appear as a label on each
document. Our research aims to build a machine learning system to accurately predict
these keywords – known as Nature of Contract – to increase search relevancy for BiP
Solution’s clients.
There have been countless research papers in the field of document classification in
the past twenty years, with deep learning taking centre stage in recent times. A number
of complex architectures have achieved outstanding results on popular benchmark tests,
however, a recent paper from 2019 indicates that high-parameter, complex structures
are not always the best choice [2].
Adhikari et al demonstrate that a simple Bidirectional Long Short-Term Memory
(BiLSTM) network, with a number of regularisation and optimisation techniques, has
the ability to outperform numerous complex architectures on a variety of popular text
classification datasets.
Their work compares BiLSTM to a series of other architectures, including hier-
archical modelling [31], XML-CNN [22], KimCNN [20] and SGM [35], but leaves the
question of BiLSTM vs. Transformer Learning models (using pretrained embeddings
such as BERT or RoBERTa) as an open challenge.
Our paper aims to compare the simple BiLSTM architecture outlined by Adhikari
et al, with a variety of Transformer Learning systems (such as ELMo, BERT and
RoBERTa) on the task of classifying e-procurement contracts.
The model will be evaluated according to the company requirements; high accu-
racy (approx. 95% or above), efficiency (able to process and predict roughly 3000-400
contracts per day) and scalable (able to grow as BiP Solutions expand their market po-
sition). We aim to achieve this by exploring both classical machine learning approaches
and modern neural network solutions, evaluating the effectiveness of each solution in
meeting the company requirements.

2
Chapter 1. Introduction

The purpose of conducting this research is to achieve the best possible model for a
deployment scenario. We hope that by opting for a simpler solution - one that’s easier
to train and less sensitive to hyperparameters - we have a greater chance of achieving
our goal of producing an efficient, accurate and scalable predictive system.
Following our introduction, we examine previous work on the subject of text classi-
fication, before outlining our approach, methodology and analysis. Finally, we present
our results across a number of different settings and make a recommendation for im-
plementation.

3
Chapter 2

Related Work

2.1 Fundamental NLP Processes

Figure 2.1: Full NLP Pipeline

Here we have the most fundamental steps in any Natural Language Processing
Pipeline, outlined in figure 2.1.
Over the years, document classification has been tackled by countless researchers,
resulting in a predefined set of methods and practices which standardise parts of the
process. A few common machine learning models have been shown to excel at this task,

4
Chapter 2. Related Work

and conducting processing steps like tokenisation, stopword removal and lemmatisation
has been shown to improve predictive capability [14].
After our initial text processing, we then represent the data in numerical format.
This process is known as Embedding, and the techniques used to achieve this result are
outlined in our Related Work section 2.4 section below.

2.2 Pre-processing: Pipeline

Throughout our research we follow the steps of the pipeline outlined in figure 2.1, mak-
ing a slight deviation from this pattern when moving from classical machine learning to
deep learning neural networks. The powerful Tensorflow library allows us to bake the
processing into the model, so that the Continuous Bag of Words (CBoW) embeddings
(see Section 3.3.3 ) are created at runtime. Although the number of steps remain the
same, by imbuing the model with the ability to embed text we reduce the computational
time required to process contracts, which could allow for easier deployment.
However, the number of processing steps that we can use in this way is very limited.
The inbuilt Tensorflow text pre-processing will only make data lower case and remove
all punctuation. As such, an examination of each approach on the model’s performance
balanced with company requirements will need to be considered.

2.3 Pre-processing: Tokenisation

Figure 2.2: Example of tokenisation

5
Chapter 2. Related Work

The first step is cleaning and processing the text; converting to lower-case, conduct-
ing stopword removal to remove common words, then stemming and lemmatisation to
convert words to their root form. This whole process is sometimes referred to as To-
kenisation and results in our document represented as a sequence of individual tokens
(strings) as shown in Figure 2.2. This type of processing cannot be baked into the
model and must be performed by third-party functions (found in our functions.py file).

2.4 Preprocessing: Embeddings

The next preprocessing step is representing the documents in numerical form. There are
a number of common techniques used to embed textual information for machine learning
systems, but the approach that has yielded the best results – for document classification
– is Term-Frequency Inverse-Document-Frequency (TF-IDF) representation ([15], [27],
[1]).
TF-IDF evaluates the importance of each word and increases this importance pro-
portional to the number of times the word appears in the document. This is then offset
by the number of times that word appears in the whole corpus.

Figure 2.3: TF-IDF Formula

The above formula calculates the TF-IDF weight Wi ,j for word j in document
i, by first calculating the Term-Frequency of word j in document i. We then divide
each word count by the log of the total number of training instances (log(N/dfi )). By
calculating the TF-IDF weight for each word we can create a sparse n ∗ m matrix with
n unique words and m documents to represent our data (Figure 2.4 below).

6
Chapter 2. Related Work

Figure 2.4: Example TF-IDF Matrix

As demonstrated by the work of Goldberg, Peng and Adhikari, this embedding


approach has shown promise in both classic and neural architectures, demonstrating
that semantic meaning is arguably less important when classifying documents.

2.5 Evaluation Metrics

Historically, document classification models have used ROC AUC (Receiver Operating
Characteristic – Area Under the Curve) as an evaluation metric in a binary scenario.
An ROC Curve plots the True Positive rate as a function of the False Positive rate
and provides a representation of the model’s performance in two-dimensional space.
The AUC then compresses this 2d plot to a single evaluation metric, with a perfect
model scoring 100% AUC and a model that makes random decisions scoring approx.
50% AUC [33].
In terms of our multiclass problem, we have a number of evaluation metrics to
consider. Some researchers choose to focus on recall (how many samples within the
positive class were predicted correctly), aiming to minimise False Negatives [3]. Others
also include precision (how many of the model’s positive predictions were correct)) as
a means of evaluating a classifier’s performance [7].

7
Chapter 2. Related Work

Figure 2.5: Recall Formula

Figure 2.6: Precision Formula

A better single metric for a multi-class problems is Macro Average, more specifi-
cally, the Macro F1 score, defined as the harmonic mean of precision and recall where
the average is calculated per label and then averaged across all labels. This value is
more robust than precision or recall alone because our classifier has to score both high
precision (a lot of correct positive predictions) and high recall (more correct predictions
overall) in order to achieve a high F1. The Macro Average is simply the Average F1
scores across all classes.

Figure 2.7: Macro F1 formula for binary problem

Macro averages are widely used to evaluate the effectiveness of multi-class document
classification systems ([4], [30], [10], [29]) and will be used to evaluate our e-procurement
contract classification system. Accuracy will hence refer to Macro F1 unless otherwise
stated.

8
Chapter 2. Related Work

2.6 Benchmark Document Classification: Classical Ap-


proaches

2.6.1 Naı̈ve Bayes

There are various benchmark datasets and models that have been used extensively for
developing document classification systems. The first we chose to explore was Naive
Bayes, an algorithm derived from Bayesian Decision Theory that seeks to calculate a
posterior probability given prior information.
The presence or absence of each word is considered a feature, and the model treats
each feature as statistically independent (hence the moniker Naive) [16]. In our case
we are calculating the posterior probability that a given document belongs to a specific
class given the set of words that make up that documents features.
Kibriya et al used the algorithm on the benchmark 20newsgroups data in 2004 (a
collection of 18,000 newsgroup posts spanning 20 topics). The team achieved incredible
success for the time, achieving 88.36% accuracy on this popular multi-class problem
[19].

2.6.2 XGBoost

When exploring classical machine learning models, it was important that we cover the
best known algorithms for text classification. Together with the probabilistic approach
outlined above, we thought it pertinent to also include a tree-based model.
In recent years, the tree-based solution which has consistently yielded the most
promising results is XGBoost. A highly scalable and efficient tree boosting model that
continues to dominate the Kaggle leader board, winning the, Netflix Competition that
was launched back in 2007 [5].
Chen & Guestrin used XGBoost on the Allstate Insurance Claim data (a record of
over 580,000 insurance claims) in 2017, achieving a mean Area Under the Curve (AUC)
score of 83% [8]. Further support for XGBoost comes from the work of Torlay et al
in 2017, using the algorithm on the binary task of predicting epilepsy diagnosis using
various language representation tasks. Torlay achieved a mean AUC of 91% [33].

9
Chapter 2. Related Work

2.6.3 Linear Support Vector Machines

Our final classical approach uses Linear Support Vector Machines (SVM), a very pop-
ular method for document classification that’s been shown to produce a high degree of
accuracy in a number of research papers.
Support Vector Machines is an optimal margin classifier, meaning it attempts to find
the largest gap between instances that lie on a set of margin borders. The algorithm
makes use of various kernel tricks which allow manipulation of the feature environment,
learning representations in high-dimensional space from low-dimensional data.
Linear SVM is traditionally a binary classifier, however the SKLearn instance of
this model implements a multi-class system using a One-vs-The-Rest scheme. This is
commonly referred to as Error-Correcting Output Coding (ECOC), and means we have
the same number of classifiers as we have classes. The algorithm will then assign each
model a class and classify each instance as belonging to that class (a positive sample)
or not belonging to that class (a negative sample).
Work by Anne et al from 2017 attempted to use Linear SVM to classify NASA
patent documents. Using a poly-kernel based implementation the team were able to
achieve 64.6% accuracy when predicting on their validation set [3]. Our work supports
the findings of Anne et al, in that our SVM classifier, using a One-vs-The-Rest scheme,
well outperformed Naı̈ve Bayes and XGBoost when classifying e-procurement contracts.

2.7 Benchmark Document Classification: Neural Networks

The success of Naı̈ve Bayes, XGBoost and Linear SVM have been surpassed in recent
years through the use of Neural Networks. These powerful digital structures have the
ability to learn complex patterns from high-dimensional data. They achieve this by
incorporating ideas from neuroscience that allow researchers to develop systems that
mirror aspects of the human brain.
Here we present the achievements in document classification using a variety of
interesting neural network solutions.

10
Chapter 2. Related Work

2.7.1 Convolutions for Document Classification

Typically used for Computer Vision tasks, Convolutional Neural Networks (CNN) aim
to mimic the architecture of the visual cortex. This is achieved using a number of filters
that expose the model to limited regions of the input, allowing it to learn from different
parts of the image.

Figure 2.8: Convolutional Neural Network for Document Classification (Quispe et al,
2018

Convolution can also be used for document classification, as shown in figure 2.5.
The input (an embedded document) is scanned using a number of filters and convo-
lutional layers, outputting a final hidden state vector. This vector is then fed to a
fully-connected feed-forward neural network for classification.
This approach has seen some success on benchmark datasets in the field of document
classification, such as the CNN from Quispe et al in 2018. The team used convolutional
layers as an automatic feature extractor to capture meaningful information, achieving
88.8% and 93.8% accuracy on the AG News and Soguo News datasets [28].

2.7.2 Recurrence for Document Classification

A more common neural architecture for document classification comes in variations of


Recurrent Neural Networks (RNN). This type of network imbues the system with a
modicum of memory by, “allowing the network’s hidden units to see its own previous
output, so that the subsequent behaviour can be shaped by previous responses. These
recurrent connections are what give the network memory” [13].

11
Chapter 2. Related Work

Figure 2.9: RNN Multi-task Learning: Uniform-layer Architecture

Figure 2.10: RNN Multi-task Learning: Coupled-layer Architecture

Figure 2.11: RNN Multi-task Learning: Shared-layer Architecture

An interesting take on RNN’s in NLP is applying multi-task learning to the problem,


using the correlation between related tasks to improve classification by learning them
in parallel. Liu et al propose the three models above to leverage supervised data from
many related tasks to make task specific classifications [23]. Tested against another
popular dataset (IMDB Movie Reviews, containing over 50,000 film reviews from the
IMDB website) the team scored 92.7% accuracy.

12
Chapter 2. Related Work

2.8 Hierarchical Attention Networks

In terms of recent advances in the field, we can find a notable improvement in the work of
Seo et al, where they outline a new architecture called Hierarchical Attention Networks
(HAN). This work has a hierarchical structure which mirrors the hierarchical nature of
documents, and has two levels of attention mechanism applied (one at the word-level
and another at the sentence-level [31]. Attention allows the model to calculate the
most important words in each sentence and manifests as a weight matrix assigned to
the hidden output state from the HAN.
The basic structure of this network encompasses a word sequence encoder and word-
level attention layer, together with a sentence encoder and a sentence-level attention
layer.

Figure 2.12: Hierarchical Attention Network

13
Chapter 2. Related Work

Using the above implementation, Seo et al achieved mixed results on a number of


benchmark datasets. The model outperforms all classical methods, as well as LSTM,
CNN and LSTM-GRNN [32], scoring an average accuracy of 69.9% on the Yelp 2013-
2015 data, and 75.8% on another common benchmark dataset, Yahoo Answers.

2.9 Attention Is All You Need

Figure 2.13: Attention is all you need

A year later, a paper from Vaswani et al demonstrated that a high level of accuracy
could be achieved using only the attention mechanism, dispensing with convolutions
and recurrence entirely (Figure 2.13 ).

14
Chapter 2. Related Work

This work was carried out on a machine translation task, and as such used the BLEU
score (Bilingual Evaluation Understudy score) to assess the relevancy of responses [34].
This is yet another instance where a simpler approach has proved more effective than
complex neural structures, outperforming all other models when translating from En-
glish to French and from French to English.

2.10 State of Art in Document Classification

2.10.1 Bidirectional Encoder Representations from Transformers (BERT)

Current state of the art uses pretrained embeddings with Transformer Learning archi-
tectures. Leading the charge in the world of pre-trained embeddings is BERT (Bidi-
rectional Encoder Representations from Transformers), a system that pre-trains deep
bidirectional representations from unlabelled text, using Masked Language Modelling
(MLM) to condition the system on both left and right contexts [11].

Figure 2.14: Bidirectional Encoder Representations from Transformers (BERT) archi-


tecture

BERT brought massive improvements to the field of Natural Language Processing


for downstream tasks. The model is trained on two pretraining tasks (MLM and next
sentence prediction), using a corpus of over 800 million words.

15
Chapter 2. Related Work

These powerful pre-trained embeddings can then be utilised for downstream tasks
like classification or machine translation, by fine-tuning with labelled, task-specific data
[11].
The model comes with two main drawbacks, the first being that it does not have the
ability to handle long sequences of text. The maximum number of tokens the system
can manage is 512. Although this hasn’t shown any significant issues in the past, it is
certainly to be considered in our contract classification problem, where the description
of the contract could often exceed this value.
The second drawback is that the model is computationally expensive with hundreds
of millions of parameters. A distilled version of the original BERT model is available, as
well as a light-weight RoBERTA model that better manages this large computational
cost, however it still trains using a large number of parameters.
Despite its shortcomings, BERT has become synonymous with successfully assisting
downstream Natural Language Processing tasks. The use of these powerful pretrained
embeddings have facilitated advances in a number of domains:

i VGCN-BERT : Combining BERT with Graph Convolutional Networks to improve


global information capturing [18].

ii Bio-BERT : Using BERT for Biomedical Text Mining [21].

iii BERT Search: Applying BERT to improve search query results [26].

In the context of document classification, we can see BERT being applied by Ad-
hikari et al in 2019, with their DocBERT Document Classification system, marking the
first time the model has been used for this task [1]. The author notes the arguable
unimportance of syntactic structure when classifying documents (a point supported by
our own research which shows Naı̈ve Bayes and Linear SVM performing exception-
ally well using TF-IDF embeddings). Nevertheless, BERT manages to outperform the
classical approaches across all settings.
Using two variations on BERT (the large and computationally expensive version
and the smaller, distilled version), the researchers achieved +91% accuracy on the

16
Chapter 2. Related Work

popular Reuters dataset. Furthermore, they show that by distilling the knowledge we
can save on computational complexity and still achieve +90% accuracy.

2.10.2 Bidirectional Long-Short Term Memory Network (BiLSTM)

Even with knowledge distillation, BERT still has hundreds of millions of parameters,
making the network massively complex and computationally expensive. Also, it’s abil-
ity to only cope with sequences of 512 tokens mean it has trouble tracking long-term
dependencies. Both these issues can be mitigated using a simpler architecture - Bidi-
rectional Long-Short Term Memory Networks (BiLSTM) - which we can find mention
of all the way back in 1998 [17].
LSTM networks increase the effectiveness of vanilla RNN’s by adding an additional
hidden state that uses a number of gates to decide what to remember and what to
forget. Combining these LSTM layers into a bidirectional architecture allows the input
to be read in both directions (left to right, and right to left), further enhancing the
network’s ability to track long-term dependencies.

Figure 2.15: Bidirectional Long-Short Term Memory Architecture

As we mentioned earlier in the Introduction, a recent paper from 2019 demonstrated


that a simple Bidirectional Long-Short-Term Memory (LSTM) network can match, and
outperform, much more complex architectures in the task of document classification.
Adhikari et al’s research demonstrates that a simple, one-layer BiLSTM can achieve
excellent results (approx. 89% on the Reuters dataset), outperforming more complex

17
Chapter 2. Related Work

neural networks by approx. 7% [2]. The team leave the question of comparing BiLSTM
with Transformer Learning as an open question, one which we have chosen to explore
in our research below.

18
Chapter 3

Methodology

3.1 Technical Environment

We conducted our research using the following technology stack:

i Python Programming Language v.3.6.6

ii Keras 3.3.6

iii Tensorflow v.2.2.2

iv Ubuntu v.18.04

v NVIDIA RTX 2080Ti Graphics Processing Unit

Full details on package versions can be found in the requirements.txt file in the
project documentation.
Our custom functions (pre-processing, graph display, data splits etc) are all held
in a separate file (functions.py) and imported at the top of our notebooks. Likewise,
our custom classes (used to build model testing frameworks) can be imported from the
custom class.py file.
The main libraries used for pre-processing are NLTK and Spacy. Implementation
details can be found below in Methodology section 3.3.1.

19
Chapter 3. Methodology

3.2 The Data

3.2.1 Data Source

The data for our research was provided by BiP Solutions and consists of 350,014 ex-
ample e-procurement contracts. There are 29 features that describe the data, split
between a range of datatypes, with the main focus of our research concerning the Title
and Description of the contract.
The majority of contracts (348,940) are sourced from European vendors and include
our target label, Nature of Contract. A small amount of contracts (1074) are sourced
from U.S vendors and do not include our target label. More information on this sample
can be found in Methodology section 3.2.5.
A very small amount (7) contracts are labelled combined. These instances have
been dropped from our analysis.

3.2.2 Cross-Validation

In an effort to ensure a robust classifier we will test our model by splitting our data into
different groups. Our training data will be used (in conjunction with our validation
data) to train and validate the model. Our testing data will remain completely unseen
by the model and acts as a further checkpoint to ensure reliable predictions.
We explore three different Train, Test and Validation splits given the size of the
dataset. Here we present the details of the three most common ratios that we intend
to explore [12].

i 70% Training, 15% Validation & 15% Testing

ii 80% Training, 10% Validation & 10% Testing

iii 60% Training, 20% Validation & 20% Testing

It is common practice to test models with varying proportions of the data and we
present our findings in this regard in the Project Analysis Section 4.7.5.

20
Chapter 3. Methodology

3.2.3 Target Value

Our target variable - Nature of Contract - can fall into one of three categories:

i Services

ii Works

iii Supplies

As BiP Solutions expands into an American or Asian marketplace we cannot always


be certain that the Nature of Contract” will be properly labelled (or labelled at all),
hence the motivation behind this research.
Furthermore, traditional indicators like CSV codes (Common Procurement Vocabu-
lary Codes) have not been standardised across the industry and cannot be guaranteed.
The CSV code operates as a rudimentary classification system for European procure-
ment firms, however these processes can change from continent to continent, and as
such make this an unstable variable for prediction.
In both our classical and deep learning models the target variable was one-hot
encoded using SKLearn’s OneHotLabelEncoder.

3.2.4 Target Variable Distribution

As figure 3.1 below shows, we have a large imbalance in our target distribution. This
is a common problem in most machine learning problems, and we intend to mitigate
this issue using two well-known techniques:

i Random Up-Sampling: Randomly copying instances within the minority class(es)


to balance with the majority class.

ii Random Down-Sampling: Randomly remove instances within the majority class(es)


to balance with the minority class.

Where possible we try to retain as much data as possible so would always opt for
Random Up-Sampling in the first instance. However, with so much data, it is certainly
worth also exploring the effect of Random Down-Sampling on model performance.

21
Chapter 3. Methodology

Figure 3.1: Target Label Distribution

3.2.5 A True Test Set

The data also includes 1074 instances with an N/A target value (figure 3.1). These
instances reflect international contracts (from American or Asia) and comprise an ex-
ample of genuinely unlabelled data.
Following the company guidelines, it would be good to manually label 200-300
documents to enhance our testing capabilities beyond the train, test and validation
split using the European examples. Through creation of this genuine set we would
hope to make our model more robust to market changes in the future. Unfortunately,
due to resource constraints within BiP Solutions this was not possible for our research.

3.2.6 Independent Variables

As mentioned in Methodology section 3.2.1, the features that describe contract data
have not been standardised across the industry. We use the features of the data to train
a predictive model, so missing values that are important for an accurate predication is

22
Chapter 3. Methodology

also a major concern.


The minimum required data for a human being to classify procurement contracts
would be Title or Description, as these fields will contain the information to allow
categorisation the document. On this basis our research focuses on the exploration of
the above two features in terms of predicting the nature of contract. Some rudimentary
analysis of other variables was conducted but were found to have very little correlation
to the target variable (details found in Project Analysis Section 4.1 ).

3.2.7 Correlations

Before conducting any major pre-processing steps, we wanted to check the correlation
of independent variables with the target variable. This was to save time exploring
features which have little effect on our predictions. Our method here consisted of the
following steps:

i Check the correlation of the main independent variables (Title and Description).

ii Examine the correlation of a number of other potential indicators.

iii Use the strongest features to develop baseline performance using classical machine
learning methods.

The results of our study using Naı̈ve Bayes, XGBoost and Linear SVM can be
found in our Project Analysis section. Each of the other features was text based and
went through the necessary preprocessing steps outlined in the Background Research.
Our method of embedding differed between classical and neural models and is outlined
below in Methodology section 3.3.2.

3.3 Data Processing

3.3.1 Pre-processing the documents

Our initial preprocessing was undertaken using the popular natural language library,
Natural Language Tool Kit (NLTK). This library is popular with researchers and comes

23
Chapter 3. Methodology

with the ability to conduct stopword removal, lemmatisation and stemming. Our cus-
tom function takes a data frame as input and iterates through it to create a new cleaned
text column.
We used another popular libarary, Spacy, in an attempt to improve our processing.
This library works better in a deployment environment as it’s built on top of Cython
(a C optimised version of Python) so allows for faster manipulation of data. The only
prerequisite is making sure the function that utilises the library is vectorised (it accepts
and returns NumPy arrays), which is not possible when dealing with text. The issue
of speeding up Spacy to process large amounts of text is an open research question and
one we leave in the hands of the BiP Solutions development team.
In our deep learning model, we have also created the option to bake the preprocess-
ing into the model. Rather than create a TF-IDF matrix, the deep learning system
conducts Continuous Bag-of-Words (CBoW) embedding at runtime, allowing for much
faster preprocessing (as the text does not require parsing by a third-party function
before training the model).
As mentioned in Related Work section 2.2 however, this comes with a limited num-
ber of processing functions. The accuracy of a deep learning model fed with minimally
processed text would need to weighted against the processing time required by our
custom functions. A comparative examination of pre-baked processing and third-party
parsing can be found in Project Analysis section 4.7.

3.3.2 Embedding Technique: TF-IDF Representation

When building our classical models, we conducted a TF-IDF embedding technique


explained in the Background Research section 2.4. As demonstrated, this form of
converting documents into numerical form is highly effective when using classical models
to classify text.
The TF-IDF embedding was conducted using SKLearn, a very popular python li-
brary with built-in classes and methods for processing text. First, we use the Countvec-
torizer object to convert a collection of texts to an n ∗ m matrix of token counts (figure
2.4 ) (n unique words by m documents).

24
Chapter 3. Methodology

The following settings were found to be optimal:

i stop words = [‘works’, ‘services’, ‘supplies] : This feeds a custom stopword list to
the CountVectoriser, so these words are removed from the documents. This was
to ensure the class labels were not bleeding into the independent variables.

ii ngram range = (1, 2): This tells our vectoriser to take account of both unigrams
and bigrams. This will increase the number of features and make our matrix
sparser, but the benefit of using connecting words can be found in our comparative
analysis of a model trained using unigrams and one using both unigrams and
bigrams (Project Analysis section 4.3 )

To produce our normalised sparse TF-IDF matrix (as in figure 2.4 ) we then need
to feed the output of our Countvectorizer to a TfidfTransformer. As per the SKLearn
documentation,“The goal of using tf-idf instead of the raw frequencies of occurrence
of a token in a given document is to scale down the impact of tokens that occur very
frequently in a given corpus and that are hence empirically less informative than features
that occur in a small fraction of the training corpus.”
The TfidfTransformer object was initialised thusly:

i smooth idf=True: This smooths the IDF weights by adding one to the document
frequencies. This setting avoids division by zero errors.

ii use idf=True: This enables inverse-document-frequency reweighting.

iii sublinear tf=True: This applies term-frequency sublinear scaling, which replaces
term-frequency with 1 + log(term − f requency).

The sparse TF-IDF matrix representation is used as our feature set in the classical
machine learning models.

3.3.3 Neural Embedding Technique: Continous Bag of Words

The deep learning model allows us to bake the processing into the system at runtime,
creating a Continuous Bag-of-Word (CBOW) embedding representation. This tech-
nique gives each word a unique index, and uses these indexes to transform a document

25
Chapter 3. Methodology

into a sequence of numbers. This technique does not retain any contextual meaning in
the documents, but as has been shown, semantic meaning will have less influence on
our prediction.
Our data is held as Tensors for the machine learning model (a native data structure
to Tensorflow that allows us to use low-level operations to build, train and make predic-
tions). As such, to ensure embedding at runtime, we need to use a Tensorflow wrapper
function to map the conversion of documents to a sequence of data. Implementation
details can be found in our code documentation.
These functions use a TensorFlow Tokenizer and the built-in python Counter object
to create a vocabulary of unique words assigned to a unique index. The Tensorflow
Tokenizer requires two placeholder values:

i 0 : The Tokenizer object saves the 0 value as a padding placeholder (please see
below for padding justification)

ii n + 1 : n is the number of unique words and n+1 is reserved for unknown words.

After wrapping our data in the processing function, we lastly split the data into
batches. Sequences within the same batch need to be the same length to allow the model
to train, so we used TensorFlow’s padded batch function. Please refer to functions.py
for details on our batching functions.
Our batched data includes both features and labels so can be fed directly to our
deep learning systems for training.

3.4 Models & Methods

3.4.1 Building A Baseline With Classical Models

Our approach after ensuring the features had been properly processed was to develop
a baseline accuracy for our predictive system. We conducted this by testing the three
main classical text classification models; Naı̈ve Bayes, XGBoost and Linear SVM. These
models were tested using their default parameters and only the description of the
contract as a feature.

26
Chapter 3. Methodology

The thinking here is to test the above three models and use the best of them to
further explore the feature space (testing various features and combinations of features
against the model with default parameters.) The aim is building up to the best possible
baseline, using a combination of optimum features that can then be used as a starting
point for our neural architectures.
Within this exploration of the feature space was an examination of the effect on
ngram usage on model accuracy. A discussion of these effects with accompanying
diagrams can be found in Project Analysis section 4.3.
Once an optimum baseline accuracy has been achieved, we then conduct both ran-
dom up-sampling and down-sampling to assess the effects on model accuracy. Again,
the results of these tests will inform our approach with the Bidirectional LSTM and
Transformer Learning architectures.
Finally, given an optimal set of features, suitable re-sampling and highest-performing
model we can arrive at an optimal accuracy. This value is then used to comparatively
measure performance with state-of-the-art architectures.

3.4.2 Improving the baseline with BiLSTM

Moving forward from our classical baseline we then use our optimum features to gauge
the performance of our two state-of-the-art approaches. Each model is reconstructed
according to their original paper (details below) using our imported custom classes. Our
Project Analysis section also explores variations on the architecture and parameters
outlined below.
The one-layer BiLSTM is initialised exactly as Adhikari et al propose in 2019:

i Optimiser : Adam

ii Learning Rate: 0.001

iii Batch Size: 64

iv BiLSTM Hidden Units: 512

v Dropout Rate: 0.1

27
Chapter 3. Methodology

vi Weight Dropping: 0.2

vii Optimisation Objective: Categorical Crossentropy

3.4.3 Improving the baseline with BERT

BERT and other Transfer Learning systems will be constructed using the popular Hug-
ging Face library, accessed through the third-party module Ktrain. This is a tool that
puts the power of pretrained embeddings into the hands of machine learning practi-
tioners, and allows companies to leverage their predictive capability on downstream
tasks.
Using this library, we have constructed a test framework that allows researchers to
initialise our models and further experiment with different parameters and architec-
tures.
For our initial investigation we aim to replicate the architecture of DocBERT, as
it was shown to provide good accuracy for document classification and will serve as a
comparable model to the simpler BiLSTM. Adhikari et al take the BERT model and
build a fully-connected classifier on top and then fine-tune with the task-specific data
for document classification. The Ktrain library allows us to do this easily.
Our BERT initialisation is as follows:

i Optimiser : Adam

ii Learning Rate: 0.00002

iii Batch Size: 16

iv Maximum Sequence Length: 512

v Optimisation Objective: Categorical Crossentropy

As with our BiLSTM, the framework we have constructed will allow us to experi-
ment with different architectures, parameters and pretrained models.

28
Chapter 4

Project Analysis

Here we present our findings across all settings, beginning with our analysis of the data
and classical models. We then work towards an optimum feature set before outlining
our analysis of each modern neural architecture. We begin this discussion with a look
at Correlation between variables.

4.1 Correlation with Target

The majority of our data was categorical, so we utilised a variation on Pearson’s Chi-
Square Correlation, Cramers V. The Chi-Square Test of Independence indicates if there
is a significant relationship between variables, with Cramers V., indicating the strength
of this correlation. A score closer to 0 indicates little relationship and a score closer to
1 indicates a strong relationship between two variables.

Figure 4.1: Cramer’s V. Formula

29
Chapter 4. Project Analysis

In figure 4.1 we see the Cramer’s V formula where X2 is the Pearson Chi Square
statistic, N is the sample size and K is the lesser number of categories of either variable.
We felt that by using this method we could determine a good starting point for our
features. Our intention to use human-level requirements to classify contracts (the title,
the description) could be supplemented by a variety of other features in the data. As
such it was pertinent to explore the correlation of all variables in terms of predicting
our target value, Nature of Contract.
Using the results demonstrated in table 4.1 we can cut down on the time spent
exploring data that has no impact on our prediction, making for a faster analysis and
a better performing predictive system.

Contracts Data Attribute Cramers V. score


Country 0.2479
Description 0.4078
CPV Text 0.6348
CPV45 0.6348
Title 0.6274

Table 4.1: Table of Correlation with Target Value.

4.2 Classical Model Baseline: Description

Moving forward from our examination of correlation, we then endeavoured to build


a baseline accuracy using the three classic machine learning models outlined in our
Methodology. We use default parameters for each of our models, with a default TF-IDF
matrix (using only unigrams). First we use the Description parameter to establish an
appropriate baseline, then use the best of our three classical models to further explore
the parameter space.

30
Chapter 4. Project Analysis

4.2.1 Naive Bayes Analysis

Naı̈ve Bayes was chosen as our first model in an arbitrary way, it was simply surrepti-
tious that this model achieved the lowest accuracy.

Figure 4.2: Naive Bayes with Description Classification Report

Figure 4.3: Naive Bayes with Description Confusion Matrix

Our initial model looks promising, performing quite well on the fully cleaned De-
scription data. Remember, from our Methodology section 3.4 section, the classical
models do not have the ability to process the data, so each of the three models uses
fully cleaned data with the Spacy preprocessing functions.

31
Chapter 4. Project Analysis

As figure 4.3 shows, our Naı̈ve Bayes classifier represents an opportune starting
point, achieving 79.78% accuracy.

4.2.2 XGBoost Analysis

The next classical algorithm in our initial exploration was XGBoost. Again, as per our
Methodology section 3.4, here we are aiming to gauge the effectiveness of each classical
model with a single feature, (the Description).

Figure 4.4: XGBoost with Description Classification Report

Figure 4.5: XGBoost with Description Confusion Matrix

32
Chapter 4. Project Analysis

There is a clear improvement across all classes by switching to the tree-based algo-
rithm, going from 79.78% to 81.48%.

4.2.3 Linear SVM Analysis

The final classical model we use to try and improve our previous baseline is Linear
SVM. As demonstrated in our Related Work section, this algorithm has also shown
success in the field of document classification.

Figure 4.6: Linear SVM with Description Classification Report

Figure 4.7: Linear SVM with Description Confusion Matrix

33
Chapter 4. Project Analysis

As expected, Linear SVM manages to outperform all other classical methods by a


considerable margin. Bringing the two minority classes up to an acceptable baseline of
+90%. Due to its considerable success using just the Description feature we chose to
use Linear SVM to further explore the parameter space.

4.3 TF-IDF Analysis

With an optimal model that can predict the Nature of Contract based on the single De-
scription feature, we felt it also important to aim for some level of domain knowledge.
As we know from Related Word section 2.10.1, embedding techniques that involve the
representation of contextual meaning add little predictive power to document classifi-
cation systems. We felt that another way to represent a modicum of domain knowledge
within our embeddings was to enable the TF-IDF object to use both unigrams and
bigrams when creating the TF-IDF matrix.
In essence this doubled the number of features our model could use to predict
the Nature of Contract by including conjoining words, as well as unique words in our
words ∗ documents feature matrix.

Figure 4.8: TF-IDF Embeddings on Description with Unigrams

34
Chapter 4. Project Analysis

Figure 4.8 is a visual representation of our document TF-IDF embeddings using


only unigrams on the description feature. This was achieved by using Singular Value
Decomposition to reduce our embeddings to three dimensions and then plotted in the
above scatter plot. We can see the emebddings forcing our data down three distinct
funnels. Our classes are heavily distorted within each of these directions but the graph
does reflect the multi-class nature of our problem.

Figure 4.9: TF-IDF Embeddings on Description with Bigrams

By enabling the TF-IDF object to create features using both unigrams and bigrams
we can see a marked improvement in our visualisation. We still have three distinct data
funnels down which our embeddings are being forced, but now the classes are much
more distinct within each direction. This improved visualisation is reflected in the
accuracy with which our Linear SVM could then predict Nature of Contract using
these new Description embeddings.
This improvement can be seen from the confusion matrix below (figure 4.11 ). En-
abling the use of unigrams and bigrams has nudged the accuracy of our classifier up
across all classes.

35
Chapter 4. Project Analysis

Figure 4.10: Linear SVM with Description Bigrams Classification Report

Figure 4.11: Linear SVM with Description Bigrams Confusion Matrix

4.4 Improving Classical Baseline: Title

Exploring Title as a singular feature had a number of benefits. Firstly, in our previous
work using Natural Language Processing to classify disaster tweets [25], we found that
Linear SVM performed well on short input sequences. Secondly, due to fewer unique
words we produce a denser TF-IDF matrix, which we surmise allows Linear SVM to
better fit a separating hyper-plane to the data.

36
Chapter 4. Project Analysis

Before exploring the feature we felt it suitable to visualise the embeddings in a simi-
lar way to Description, by reducing the dimensions using Singular Value Decomposition
and plotting on a three dimensional scatter plot.

Figure 4.12: TF-IDF Embedding using Title and Bigrams

Our embeddings are now looking extremely distinct. By shortening the input se-
quence and creating a denser TF-IDF matrix - but still allowing both unigrams and
bigrams - our Vectoriser has created three completely distinct groupings of data. The
size of each class in the above plot roughly reflects our target distribution outlined in
figure 3.1. This evidence heavily supports our decision to use shorter input sequences
and allow for bigrams in our TF-IDF Vectoriser ).

37
Chapter 4. Project Analysis

The evidence is further supplemented with the results from our Linear SVM using
the above embedding representations.

Figure 4.13: Linear SVM with Title Bigrams Classification Report

Figure 4.14: Linear SVM with Title Bigrams Confusion Matrix

Our accuracy is now reaching a suitable level to meet company requirements. How-
ever, due to the high degree of predictive accuracy we began to examine possible sources
of data leak. This led to the discovery that our target variables may be contained within
some of the documents. This effect was mitigated using the technique below.

38
Chapter 4. Project Analysis

4.5 Masked Class Analysis

In order that we avoid data leak from our independent variables (the text of the doc-
ument) we created a new TF-IDF Vectoriser, inclusive of a custom stopword list that
contained each of our classes. By passing this parameter to the Vectoriser, these words
would be removed from the documents and could not be used by the model to inform
prediction.

Figure 4.15: Linear SVM with Title & Masked Classes

Our experiment with masked classes showed a minor reduction in accuracy across
all settings, but the reduction is so small as to be negligible to results. Nevertheless,
for the sake of posterity, all TF-IDF matrix creation from this point is created using
masked classes and unigram / bigram combination.

4.6 Combining Features Analysis

From here we felt it important to explore combinations of features in an attempt to


establish the optimum attributes to feed to our Neural Network models. This produced
mixed results outlined in our main results table X.X. We can conclude that title, de-
scription and CPV Text appear to be our strongest predictors. However, due to the
volatile nature of the CPV Text attribute (sometimes it may be missing from the data)
we felt it less important as the main predictor of Nature of Contract.
With our baseline accuracy established, our TF-IDF Vectoriser optimised and our
feature space fully explored, we felt our input features to the neural architectures were

39
Chapter 4. Project Analysis

fully optimised. This approach allowed us to begin experimenting with different hyper-
parameters and architectures in a quick and efficient manner.

4.7 BiLSTM Analysis

Having completed the above preliminary steps we felt we had exhausted our options
when classifying procurement contracts with classical machine learning methods. To
retain the sequential nature of our analysis we then proceeded a step further in com-
plexity by utilising our custom BiLSTM Builder class to replicate the Neural Network
outlined by Adhikari et al.

4.7.1 BiLSTM: Raw Title

Title performed so well with our classical models we felt it a suitable feature to begin our
analysis of more complex neural architectures. Utilising built-in Tensorflow operations
we are able to train and test on the raw data. This method, as mentioned in our
Methodology section, allows us to bake the pre-processing into the model. This analysis
is conducted using the 60 / 20 / 20 Training, Validation and Test split.
The raw data - as shown in figure 4.17 - performed well even though the Tensorflow
preprocessing provides limited capability, achieving an excellent 96% accuracy.

Figure 4.16: BiLSTM on Raw Title, Classification Report

40
Chapter 4. Project Analysis

4.7.2 BiLSTM: Raw Description

Description of contract is our other main feature, and as such we felt it important to
explore this feature in the BiLSTM as a predictive variable. Sequence length is our main
concern with Description, having no upper bound on its size. A BiLSTM will handle
larger sequences than BERT, with an upper threshold of approx. 1000 characters, but
our Description may also exceed this amount. The effect of longer sequences is reflected
in the model accuracy, achieving considerably lower than other models with 72%

Figure 4.17: BiLSTM on Raw Description, Confusion Matrix

4.7.3 BiLSTM: Cleaned Title

The reason for exploring Tensorflow’s in-built processing is to increase computational


efficiency by feeding the model raw text. As a means of comparison, we now use a set
of fully cleaned data to make predictions. These features have been processed using our
set of Spacy preprocessing functions (as this function performed well with our classical
models). We retained all other parameters to ensure that any significant difference in
accuracy can be attributed to the cleaning of the text.

41
Chapter 4. Project Analysis

Figure 4.18: BiLSTM on Clean Title, Classification Report

Figure 4.19: BiLSTM on Clean Title, Confusion Matrix

Cleaning the Title made a negligible difference to our model, indicating that – when
concerning Title – we can bake the processing into the implementation avoiding lengthy
third-party preprocessing functions, with minimal effect on accuracy.

4.7.4 BiLSTM: Cleaned Description

Where the third-party cleaning function made a noticeable difference is when we used
Description as our only predictor. We see a greater than 10% increase in accuracy by
first running our Description through the Spacy preprocessing steps. This could be
down to shorter input sequences better informing our prediction or less noise in the

42
Chapter 4. Project Analysis

signal distorting the decision-making process. Regardless, we see a significant jump


from 72% to 86% accuracy.

Figure 4.20: BiLSTM on Clean Description, Classification Report

Figure 4.21: BiLSTM on Clean Description, Confusion Matrix

4.7.5 BiLSTM: Train, Validation Test Splits

We further explored the BiLSTM architecture by adjusting the ratio of Train, Valida-
tion and Testing splits. This analysis produced no significant improvement using our
best feature, Title.
Due to our lack of significant findings in this regard we have chosen to omit the
classification report and confusion matrix, however these tables can be found in the
jupyter notebook, BiP Solutions – Tensorflow End-To-End BiLSTM.

43
Chapter 4. Project Analysis

4.8 BERT Classifier Analysis

As with our BiLSTM, we have drilled through our data to find the optimum, text-based
features and so can get started with our Transformer Learning system straight away.
As mentioned in our Methodology section we have tried to retain the original hyper-
parameters from the original DocBert classification system. Some changes had to be
made due to Out Of Memory (OOM) errors with our GPU (things like sequence length
and batch size), but these changes will be made apparent as and when they occur.
Also note that the data has no need to be processed by our third-party Spacy
functions as the Ktrain library has built-in preprocessing which conducts the main
steps outlined in the NLP Pipeline in figure 2.1. Finally it must be noted that, due to
time and resources, we have opted to use the DistilBert pretrained model (a distilled
version of the full BERT system with smaller number of trainable parameters).

4.8.1 BERT Classifier: Title

As ever, we begin our analysis with our “top” feature, Title. This feature is a great fit
for a BERT Classifier as we have a short sequence length, meaning there is a greater
likelihood of the classifier being exposed to the complete Title.
Also, following from our discovery of very little impact from different Train, Validate
and Test splits we have chosen to proceed with the 60 / 20 / 20 split. It should also be
noted that we had to use the popular Seaborn library to produce our confusion matrix.
This decision was down to a dependency clash (Ktrain requires Sci-Kit Learn version
0.3.1, which does not include the Plot Confusion Matrix object).

Figure 4.22: BERT on Title, Classification Report

44
Chapter 4. Project Analysis

Figure 4.23: BERT on Title, Confusion Matrix

As figure 4.25 shows, BERT achieves the highest overall accuracy - at 98% - using
Title to predict the Nature of Contract. This is further enhanced by the extremely low
number of epochs required to achieve this accuracy (we had to opt for 1 epoch due to
computational complexity). This high accuracy is offset by the extremely long training
times (roughly 90minutes per epoch) and relatively long inference time (approx. 6-7
minutes for approx. 50,000 contracts).

4.8.2 BERT Classifier: Description

Following our methodology, we then used our Description feature with the BERT clas-
sifier. As expected, this did not perform as well as the Title feature, however it did
outperform all other models that used the same independent variable. We had to make
two changes to this model to ensure training; decreasing maximum sequence length to
400 (as opposed to 500 with Title) and a batch size of 8 (as opposed to 16 with Title).

45
Chapter 4. Project Analysis

Figure 4.24: BERT on Description, Classification Report

Figure 4.25: BERT on Description, Confusion Matrix

Figure 4.27 demonstrates the power of pretrained embeddings to perform down-


stream classification tasks. We have seen a significant increase on prediction accuracy
using the Description of the contract (up to an overall accuracy of 92%, the first time
any model has exceeded the 90% mark using this feature). Again, this is offset by longer
training and inference times, caused by an extremely large set trainable parameters.

4.9 RoBERTA Analysis

The Ktrain library allows us to very easily swap out one set of pretrained embeddings for
another. Due to time and computational resources we have chosen to only explore one

46
Chapter 4. Project Analysis

other model (RoBERTa). This is a robustly optimised model of pretrained embeddings


which is said to improve upon the work of BERT. It should be noted however that
training and inference times increased considerably.

4.9.1 RoBERTA Classifier: Title

Continuing to follow our methodology, we first apply the Title feature to the RoBERTa
model. As discussed, the Ktrain library allows us to preprocess the text using built-in
methods.

Figure 4.26: RoBERTA on Title, Classification Report

Figure 4.27: RoBERTA on Title, Confusion Matrix

RoBERTa did not perform as well as the original BERT classifier, however with

47
Chapter 4. Project Analysis

further training we would expect the accuracy to increase, and possibly surpass that of
the original BERT model on this task. Also note that with RoBERTa we were required
to decrease our maximum sequence length to 400 and our batch size to 8 to avoid an
OOM error.

4.9.2 RoBERTA Classifier: Description

Finally, we used our Description feature to make predictions on Nature of Contract


using the RoBERTa classifier. As above we had to reduce the maximum sequence
length to 400 and the batch size to 8. Mirroring the performance of Title, we seen a
negligible decrease in accuracy compared to BERT, but a significant increase in training
and inference time (roughly 2.5 hours to train on Roberta, compared to 90 minutes with
BERT).

Figure 4.28: RoBERTA on Description, Classification Report

48
Chapter 4. Project Analysis

4.9.3 Results Table

Here we present an overall view of our results, highlighting our top scoring classical,
BiLSTM and Transformer Learning implementation.

Model Feature Macro F1 Score


Multinomial Naive Bayes Description 0.79
XGBoost Description 0.81
Linear SVM Description 0.92
Linear SVM Ttile 0.96
Linear SVM Title (with masked classes) 0.97
BiLSTM Raw Title 0.96
BiLSTM Raw Description 0.72
BiLSTM Cleaned Title 0.95
BiLSTM Cleaned Description 0.86
BERT Classifier Title 0.98
BERT Classifier Title 0.92
RoBERTa Classifier Title 0.97
RoBERTa Classifier Description 0.90

Table 4.2: Main Results Table.

49
Chapter 5

Conclusions

In the work above we have presented our findings when comparing a simple one-layer
BiLSTM with Transformer Learning pretrained embeddings. We also included a full
analysis of classical machine learning methods for classifying e-procurement contracts.
Below we present our full table of results across all settings.
As demonstrated from table 4.2, BERT outperforms all other models in the task
of classification, but the significant training time and prohibitive size of the model
offset this accuracy. The BiLSTM compares well with the pretrained embeddings,
slightly under-performing by comparison, but as an option for deployment would be
more suitable due to the short training time and the ability to embed the preprocessing
in the model. If the company required a state-of-art classification system in which
accuracy was of the utmost importance, we feel the BiLSTM would better meet these
requirements while retaining a relatively simple neural architecture.
However, our recommendation for implementation is to use the classical Linear
SVM method. Although this requires our text to be preprocessed by the third-party
function, it is the most efficient and scalable solution. Using this algorithm allows us
to effectively balance the accuracy and interpretability of the model, allowing company
researchers to better track the decision-making process and scale as required.
The research above also indicates that – in all settings – opting for a shorter input
sequence (using the Title attribute) allows for greater predictive capability in all set-
tings. This reflects our previous work classifying disaster tweets and makes intuitive

50
Chapter 5. Conclusions

sense when we consider the significantly smaller unique vocabulary and subsequently
denser TF-IDF matrix. For this reason, we recommend a Linear SVM classifier, trained
to predict on the Title, as a means of meeting all company requirements. This imple-
mentation will help to enhance user experience on the BiP e-procurement contract
search engine.
Further research could be conducted exploring very recent pretrained embeddings
(such as ELECTRA [9] from Google, or GPT-3 [6] from OpenAI as a means of classi-
fying e-procurement contracts. However, with each new pretrained model the number
of trainable parameters undergo a significant increase (GPT-3 has billions of trainable
parameters). Considering the need for a rapidly trainable and scalable solution we
don’t feel that any significant increase in performance can be found by exploring these
new state-of-art language models for classifying e-procurment contracts.

51
Bibliography

[1] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. Docbert: Bert
for document classification. 2019. Times cited: 2 Greate evidence for TF-IDF.

[2] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. Rethinking
complex neural network architectures for document classification. NAACL HLT
2019 - 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies - Proceedings of the
Conference, 1:4046–4051, 2019. Times cited: 1.

[3] Chaitanya Anne, Avdesh Mishra, Md Tamjidul Hoque, and Shengru Tu. Multiclass
patent document classification. Artificial Intelligence Research, 7(1):1, 2017. Times
cited: 2 Great little explanation of Imbalance.

[4] Mohammed Attia, Younes Samih, Ali Elkahky, and Laura Kallmeyer. Multi-
lingual multi-class sentiment classification using convolutional neural networks.
LREC 2018 - 11th International Conference on Language Resources and Evalua-
tion, pages 635–640, 2019. Times cited: 1 F1 MACRO.

[5] James Bennett and Stan Lanning. The Netflix Prize. 2007. Times cited: 1.

[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Ben-
jamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,

52
Bibliography

Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020.
Times cited: 1.

[7] Che-Wen Chen, Shih-Pang Tseng, Ta-Wen Kuan, and Jhing-Fa Wang. Outpa-
tient text classification using attention-based bidirectional lstm for robot-assisted
servicing in hospital. 0. Times cited: 1.

[8] Guestrin C. Chen, T. Xgboost: A scalable tree boosting system. XGBoost: A


scalable Tree Boosting System, 2017. Times cited: 1.

[9] Kevin Clark, Minh-Thang Luong, V. Quoc Le, and Christopher D. Manning. Elec-
tra: Pre-training text encoders as discriminators rather than generators. pages
1–18, 2020. Times cited: 1.

[10] Fabrice Colas and Pavel Brazdil. On the behavior of svm and some older al-
gorithms in binary text classification tasks. Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 4188 LNCS:45–52, 2006. Times cited: 1.

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. pages
4171–4186, 2018. Times cited: 1.

[12] R. Draelos. Best use of train/val/test splits, with tips for medical data.
Glasbox Medicine, https://glassboxmedicine.com/2019/09/15/best-use-of-train-
val-test-splits-with-tips-for-medical-data/, 2019. Times cited: 1.

[13] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211,


1990. Times cited: 1 The first Recurrent Neural Network? Representing time
implicitly through inherint structure and not explicitly through an external teacher
or mechanism. Fantasitc explnation of Recurrent architecture.

[14] Wael Etaiwi and Ghazi Naymat. The impact of applying different preprocessing
steps on review spam detection. Procedia Computer Science, 113:273–279, 2017.
Times cited: 1.

53
Bibliography

[15] Yoav Goldberg. Neural network methods for natural language processing. Neural
Network Methods for Natural Language Processing, pages 77–78, 2017. Times cited:
1.

[16] Peter Harrington. Machine learning in action. Machine Learning In Action,


page 65, 2012. Times cited: 1.

[17] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural
nets and problem solutions. International Journal of Uncertainty, Fuzziness and
Knowlege-Based Systems, 6(2):107–116, 1998. Times cited: 1.

[18] Joemon M Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro,
Mário J Silva, and Flávio Martins. Advances in Information Retrieval: 42nd
European Conference on IR Research, ECIR 2020 Lisbon, Portugal, April 14–17,
2020 Proceedings, Part I. Springer International Publishing, 2020. Times cited: 1.

[19] Frank E. Pfahringer B. Holmes G. Kibriya, A. M. Multinomial naive bayes for


text categorization revisited. lecture notes in artificial intelligence. Subseries of
Lecture Notes in Computer Science, page 488–499, 2004. Times cited: 1.

[20] Yoon Kim. Convolutional neural networks for sentence classification. EMNLP
2014 - 2014 Conference on Empirical Methods in Natural Language Processing,
Proceedings of the Conference, pages 1746–1751, 2014. Times cited: 1 KIMCNN
COMPLETE ntroduction Excellent lead to pretrained vectors in the introduction.
CHECK OUT CONCLUSION This work fully supports the use of pretrained vec-
tors, such as those we will explore with the BERT model below, as an important
ingredient in deep learning for NLP.

[21] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
Chan Ho So, and Jaewoo Kang. Biobert: A pre-trained biomedical language rep-
resentation model for biomedical text mining. Bioinformatics, 36(4):1234–1240,
2020. Times cited: 1.

[22] Jingzhou Liu, Wei Cheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for
extreme multi-label text classification. SIGIR 2017 - Proceedings of the 40th In-

54
Bibliography

ternational ACM SIGIR Conference on Research and Development in Information


Retrieval, pages 115–124, 2017. Times cited: 1.

[23] Pengfei Liu, Xipeng Qiu, and Huang Xuanjing. Recurrent neural network for text
classification with multi-task learning. IJCAI International Joint Conference on
Artificial Intelligence, 2016-Janua:2873–2879, 2016. Times cited: 1 Multi task
learning in 2007, super cool! Great reference to Long Short Term Memory paper
from 1997.

[24] Stuart Mackie, David MacDonald, Leif Azzopardi, and Yashar Moshfeghi. Looking
for opportunities: Challenges in professional procurement search. SIGIR 2019 -
Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 1397–1398, 2019. Times cited: 2.

[25] F. Mitchell. Disaster tweets classification challenge. Kaggle Website,


https://www.kaggle.com/fmitchell259/disaster-tweets-naive-bayes-svm-rnn, 2020.
Times cited: 1.

[26] P. Nayak. Understanding searches better than ever before. The Keyword,
https://blog.google/products/search/search-language-understanding-bert/, 2019.
Times cited: 1.

[27] Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu
Song, and Qiang Yang. Large-scale hierarchical text classification with recursively
regularized deep graph-cnn. The Web Conference 2018 - Proceedings of the World
Wide Web Conference, WWW 2018, pages 1063–1072, 2018. Times cited: 1.

[28] Oscar Quispe, Alexander Ocsa, and Ricardo Coronado. Latent semantic indexing
and convolutional neural network for multi-label and multi-class text classification.
2017 IEEE Latin American Conference on Computational Intelligence, LA-CCI
2017 - Proceedings, 2017-Novem(November 2017):1–6, 2018. Times cited: 1.

[29] Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. Sta-
tistical topic models for multi-label document classification. Machine Learning,
88(1-2):157–208, 2012. Times cited: 1.

55
Bibliography

[30] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert,
a distilled version of bert: smaller, faster, cheaper and lighter. pages 2–6, 2019.
Times cited: 1.

[31] Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, and Bohyung Han.
Hierarchical attention networks. ArXiv, pages 1480–1489, 2016. Times cited: 2.

[32] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated re-
current neural network for sentiment classification. Conference Proceedings -
EMNLP 2015: Conference on Empirical Methods in Natural Language Process-
ing, (September):1422–1432, 2015. Times cited: 1.

[33] L. Torlay, M. Perrone-Bertolotti, E. Thomas, and M. Baciu. Machine learn-


ing–xgboost analysis of language networks to classify patients with epilepsy. Brain
Informatics, 4(3):159–169, 2017. Times cited: 2.

[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in Neural Information Processing Systems, 2017-Decem(Nips):5999–
6009, 2017. Times cited: 1.

[35] Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang.
Sgm: Sequence generation model for multi-label classification. 2018. Times cited:
1.

56

You might also like