Professional Documents
Culture Documents
CS985 Project FrankMitchell BiP Solutions
CS985 Project FrankMitchell BiP Solutions
I declare that this dissertation embodies the results of my own work and
that it has been composed by myself.
I declare that the word count for this dissertation (excluding title page,
declaration, abstract, acknowledgements, table of contents, list of
illustrations, references and appendices is 8785.
i
Abstract
Here we present our work classifying e-procurement contracts for BiP Solutions. The
work here is intended to enhance the client search engine within the company by ac-
curately predicting the Nature of Contract. By improving this metadata for BiP we
aim to increase the value of service for clients, stakeholders and taxpayers. Our work
follows from the research of Adhikari et al in 2019, which compared a simple, one-layer
Bidirectional Long/Short-Term Memory (BiLSTM) network with more complex neu-
ral structures [2]. Here we aim to add to this research by conducting a comparative
study of BiLSTM with current state-of-art Transfomer Learning systems (which utilise
powerful pretrained word embeddings). This analysis uses the Macro F1 score as our
evaluation metric. We also explored the main classical machine learning methods used
for text classification and found that the Linear SVM algorithm could match the perfor-
mance of both the complex Transformer architectures, as well as the simpler BiLSTM
structure. Our recommendation for the company - on the basis of interpretability, scal-
ability and accuracy - is to implement a classical model for the enhancement of their
e-procurement search system.
ii
Contents
Abstract ii
List of Figures v
Acknowledgements ix
1 Introduction 1
2 Related Work 4
2.1 Fundamental NLP Processes . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Pre-processing: Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Pre-processing: Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Preprocessing: Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Benchmark Document Classification: Classical Approaches . . . . . . . 9
2.6.1 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.3 Linear Support Vector Machines . . . . . . . . . . . . . . . . . . 10
2.7 Benchmark Document Classification: Neural Networks . . . . . . . . . . 10
2.7.1 Convolutions for Document Classification . . . . . . . . . . . . . 11
2.7.2 Recurrence for Document Classification . . . . . . . . . . . . . . 11
2.8 Hierarchical Attention Networks . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . . . . . . 14
iii
Contents
3 Methodology 19
3.1 Technical Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Target Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.4 Target Variable Distribution . . . . . . . . . . . . . . . . . . . . 21
3.2.5 A True Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.6 Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.7 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Pre-processing the documents . . . . . . . . . . . . . . . . . . . . 23
3.3.2 Embedding Technique: TF-IDF Representation . . . . . . . . . . 24
3.3.3 Neural Embedding Technique: Continous Bag of Words . . . . . 25
3.4 Models & Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Building A Baseline With Classical Models . . . . . . . . . . . . 26
3.4.2 Improving the baseline with BiLSTM . . . . . . . . . . . . . . . 27
3.4.3 Improving the baseline with BERT . . . . . . . . . . . . . . . . . 28
4 Project Analysis 29
4.1 Correlation with Target . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Classical Model Baseline: Description . . . . . . . . . . . . . . . . . . . 30
4.2.1 Naive Bayes Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 XGBoost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Linear SVM Analysis . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 TF-IDF Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Improving Classical Baseline: Title . . . . . . . . . . . . . . . . . . . . . 36
iv
Contents
5 Conclusions 50
v
List of Figures
vi
List of Figures
vii
List of Tables
viii
Acknowledgements
The following people were integral to my journey into Data Science and I’d like to
acknowledge their contribution to my life.
First, Mr. Paul M. Coker, whose tireless dedication to education and life-long
commitment to technology has gifted me with a deep and lasting love for computer
programming.
Also, Mr Tom. Shering whose towering intellect, burning passion and inexhaustible
patience opened my eyes to the mesmerising beauty of mathematics.
Finally to Neil and Eileen Mitchell, without whom I wouldn’t have the drive to
succeed and the courage to meet change with open arms. Thank you Mum and Dad.
ix
Chapter 1
Introduction
In the following project we have been tasked with using artificial intelligence techniques
to accurately predict the nature of e-procurement contracts for BiP Solutions, a well-
established e-procurement firm based in Glasgow.
The tendering of public procurement contracts is a worldwide, multi-million pound
business [24] that has direct impact on local and national economy, making the process
of facilitating this industry of great importance to taxpayers and stakeholders alike.
BiP Solutions hold over 35 years’ experience in the industry, acting as an interface
between public authorities and private vendors. Every day a multitude of geographic
and domain specific procurement portals tender thousands of contracts, resulting in
the need for a contracts aggregation system that collates this procurement data into a
user-friendly, searchable system.
BiP operate one of industry’s market-leading search-engines, serving 120,000 con-
tract notices and awards every month and aggregating contract information from over
1500 sources [24].
Our deep learning model aims to improve this system by predicting the Nature of
Contract, increasing metadata and enhancing professional search subscriptions for BiP’s
users. The importance of this project is reflected in the global ramifications of improved
procurement systems for local and national governmental authorities; delivering better
value for all.
The challenge of classifying contracts falls under the domain of professional search.
1
Chapter 1. Introduction
2
Chapter 1. Introduction
The purpose of conducting this research is to achieve the best possible model for a
deployment scenario. We hope that by opting for a simpler solution - one that’s easier
to train and less sensitive to hyperparameters - we have a greater chance of achieving
our goal of producing an efficient, accurate and scalable predictive system.
Following our introduction, we examine previous work on the subject of text classi-
fication, before outlining our approach, methodology and analysis. Finally, we present
our results across a number of different settings and make a recommendation for im-
plementation.
3
Chapter 2
Related Work
Here we have the most fundamental steps in any Natural Language Processing
Pipeline, outlined in figure 2.1.
Over the years, document classification has been tackled by countless researchers,
resulting in a predefined set of methods and practices which standardise parts of the
process. A few common machine learning models have been shown to excel at this task,
4
Chapter 2. Related Work
and conducting processing steps like tokenisation, stopword removal and lemmatisation
has been shown to improve predictive capability [14].
After our initial text processing, we then represent the data in numerical format.
This process is known as Embedding, and the techniques used to achieve this result are
outlined in our Related Work section 2.4 section below.
Throughout our research we follow the steps of the pipeline outlined in figure 2.1, mak-
ing a slight deviation from this pattern when moving from classical machine learning to
deep learning neural networks. The powerful Tensorflow library allows us to bake the
processing into the model, so that the Continuous Bag of Words (CBoW) embeddings
(see Section 3.3.3 ) are created at runtime. Although the number of steps remain the
same, by imbuing the model with the ability to embed text we reduce the computational
time required to process contracts, which could allow for easier deployment.
However, the number of processing steps that we can use in this way is very limited.
The inbuilt Tensorflow text pre-processing will only make data lower case and remove
all punctuation. As such, an examination of each approach on the model’s performance
balanced with company requirements will need to be considered.
5
Chapter 2. Related Work
The first step is cleaning and processing the text; converting to lower-case, conduct-
ing stopword removal to remove common words, then stemming and lemmatisation to
convert words to their root form. This whole process is sometimes referred to as To-
kenisation and results in our document represented as a sequence of individual tokens
(strings) as shown in Figure 2.2. This type of processing cannot be baked into the
model and must be performed by third-party functions (found in our functions.py file).
The next preprocessing step is representing the documents in numerical form. There are
a number of common techniques used to embed textual information for machine learning
systems, but the approach that has yielded the best results – for document classification
– is Term-Frequency Inverse-Document-Frequency (TF-IDF) representation ([15], [27],
[1]).
TF-IDF evaluates the importance of each word and increases this importance pro-
portional to the number of times the word appears in the document. This is then offset
by the number of times that word appears in the whole corpus.
The above formula calculates the TF-IDF weight Wi ,j for word j in document
i, by first calculating the Term-Frequency of word j in document i. We then divide
each word count by the log of the total number of training instances (log(N/dfi )). By
calculating the TF-IDF weight for each word we can create a sparse n ∗ m matrix with
n unique words and m documents to represent our data (Figure 2.4 below).
6
Chapter 2. Related Work
Historically, document classification models have used ROC AUC (Receiver Operating
Characteristic – Area Under the Curve) as an evaluation metric in a binary scenario.
An ROC Curve plots the True Positive rate as a function of the False Positive rate
and provides a representation of the model’s performance in two-dimensional space.
The AUC then compresses this 2d plot to a single evaluation metric, with a perfect
model scoring 100% AUC and a model that makes random decisions scoring approx.
50% AUC [33].
In terms of our multiclass problem, we have a number of evaluation metrics to
consider. Some researchers choose to focus on recall (how many samples within the
positive class were predicted correctly), aiming to minimise False Negatives [3]. Others
also include precision (how many of the model’s positive predictions were correct)) as
a means of evaluating a classifier’s performance [7].
7
Chapter 2. Related Work
A better single metric for a multi-class problems is Macro Average, more specifi-
cally, the Macro F1 score, defined as the harmonic mean of precision and recall where
the average is calculated per label and then averaged across all labels. This value is
more robust than precision or recall alone because our classifier has to score both high
precision (a lot of correct positive predictions) and high recall (more correct predictions
overall) in order to achieve a high F1. The Macro Average is simply the Average F1
scores across all classes.
Macro averages are widely used to evaluate the effectiveness of multi-class document
classification systems ([4], [30], [10], [29]) and will be used to evaluate our e-procurement
contract classification system. Accuracy will hence refer to Macro F1 unless otherwise
stated.
8
Chapter 2. Related Work
There are various benchmark datasets and models that have been used extensively for
developing document classification systems. The first we chose to explore was Naive
Bayes, an algorithm derived from Bayesian Decision Theory that seeks to calculate a
posterior probability given prior information.
The presence or absence of each word is considered a feature, and the model treats
each feature as statistically independent (hence the moniker Naive) [16]. In our case
we are calculating the posterior probability that a given document belongs to a specific
class given the set of words that make up that documents features.
Kibriya et al used the algorithm on the benchmark 20newsgroups data in 2004 (a
collection of 18,000 newsgroup posts spanning 20 topics). The team achieved incredible
success for the time, achieving 88.36% accuracy on this popular multi-class problem
[19].
2.6.2 XGBoost
When exploring classical machine learning models, it was important that we cover the
best known algorithms for text classification. Together with the probabilistic approach
outlined above, we thought it pertinent to also include a tree-based model.
In recent years, the tree-based solution which has consistently yielded the most
promising results is XGBoost. A highly scalable and efficient tree boosting model that
continues to dominate the Kaggle leader board, winning the, Netflix Competition that
was launched back in 2007 [5].
Chen & Guestrin used XGBoost on the Allstate Insurance Claim data (a record of
over 580,000 insurance claims) in 2017, achieving a mean Area Under the Curve (AUC)
score of 83% [8]. Further support for XGBoost comes from the work of Torlay et al
in 2017, using the algorithm on the binary task of predicting epilepsy diagnosis using
various language representation tasks. Torlay achieved a mean AUC of 91% [33].
9
Chapter 2. Related Work
Our final classical approach uses Linear Support Vector Machines (SVM), a very pop-
ular method for document classification that’s been shown to produce a high degree of
accuracy in a number of research papers.
Support Vector Machines is an optimal margin classifier, meaning it attempts to find
the largest gap between instances that lie on a set of margin borders. The algorithm
makes use of various kernel tricks which allow manipulation of the feature environment,
learning representations in high-dimensional space from low-dimensional data.
Linear SVM is traditionally a binary classifier, however the SKLearn instance of
this model implements a multi-class system using a One-vs-The-Rest scheme. This is
commonly referred to as Error-Correcting Output Coding (ECOC), and means we have
the same number of classifiers as we have classes. The algorithm will then assign each
model a class and classify each instance as belonging to that class (a positive sample)
or not belonging to that class (a negative sample).
Work by Anne et al from 2017 attempted to use Linear SVM to classify NASA
patent documents. Using a poly-kernel based implementation the team were able to
achieve 64.6% accuracy when predicting on their validation set [3]. Our work supports
the findings of Anne et al, in that our SVM classifier, using a One-vs-The-Rest scheme,
well outperformed Naı̈ve Bayes and XGBoost when classifying e-procurement contracts.
The success of Naı̈ve Bayes, XGBoost and Linear SVM have been surpassed in recent
years through the use of Neural Networks. These powerful digital structures have the
ability to learn complex patterns from high-dimensional data. They achieve this by
incorporating ideas from neuroscience that allow researchers to develop systems that
mirror aspects of the human brain.
Here we present the achievements in document classification using a variety of
interesting neural network solutions.
10
Chapter 2. Related Work
Typically used for Computer Vision tasks, Convolutional Neural Networks (CNN) aim
to mimic the architecture of the visual cortex. This is achieved using a number of filters
that expose the model to limited regions of the input, allowing it to learn from different
parts of the image.
Figure 2.8: Convolutional Neural Network for Document Classification (Quispe et al,
2018
Convolution can also be used for document classification, as shown in figure 2.5.
The input (an embedded document) is scanned using a number of filters and convo-
lutional layers, outputting a final hidden state vector. This vector is then fed to a
fully-connected feed-forward neural network for classification.
This approach has seen some success on benchmark datasets in the field of document
classification, such as the CNN from Quispe et al in 2018. The team used convolutional
layers as an automatic feature extractor to capture meaningful information, achieving
88.8% and 93.8% accuracy on the AG News and Soguo News datasets [28].
11
Chapter 2. Related Work
12
Chapter 2. Related Work
In terms of recent advances in the field, we can find a notable improvement in the work of
Seo et al, where they outline a new architecture called Hierarchical Attention Networks
(HAN). This work has a hierarchical structure which mirrors the hierarchical nature of
documents, and has two levels of attention mechanism applied (one at the word-level
and another at the sentence-level [31]. Attention allows the model to calculate the
most important words in each sentence and manifests as a weight matrix assigned to
the hidden output state from the HAN.
The basic structure of this network encompasses a word sequence encoder and word-
level attention layer, together with a sentence encoder and a sentence-level attention
layer.
13
Chapter 2. Related Work
A year later, a paper from Vaswani et al demonstrated that a high level of accuracy
could be achieved using only the attention mechanism, dispensing with convolutions
and recurrence entirely (Figure 2.13 ).
14
Chapter 2. Related Work
This work was carried out on a machine translation task, and as such used the BLEU
score (Bilingual Evaluation Understudy score) to assess the relevancy of responses [34].
This is yet another instance where a simpler approach has proved more effective than
complex neural structures, outperforming all other models when translating from En-
glish to French and from French to English.
Current state of the art uses pretrained embeddings with Transformer Learning archi-
tectures. Leading the charge in the world of pre-trained embeddings is BERT (Bidi-
rectional Encoder Representations from Transformers), a system that pre-trains deep
bidirectional representations from unlabelled text, using Masked Language Modelling
(MLM) to condition the system on both left and right contexts [11].
15
Chapter 2. Related Work
These powerful pre-trained embeddings can then be utilised for downstream tasks
like classification or machine translation, by fine-tuning with labelled, task-specific data
[11].
The model comes with two main drawbacks, the first being that it does not have the
ability to handle long sequences of text. The maximum number of tokens the system
can manage is 512. Although this hasn’t shown any significant issues in the past, it is
certainly to be considered in our contract classification problem, where the description
of the contract could often exceed this value.
The second drawback is that the model is computationally expensive with hundreds
of millions of parameters. A distilled version of the original BERT model is available, as
well as a light-weight RoBERTA model that better manages this large computational
cost, however it still trains using a large number of parameters.
Despite its shortcomings, BERT has become synonymous with successfully assisting
downstream Natural Language Processing tasks. The use of these powerful pretrained
embeddings have facilitated advances in a number of domains:
iii BERT Search: Applying BERT to improve search query results [26].
In the context of document classification, we can see BERT being applied by Ad-
hikari et al in 2019, with their DocBERT Document Classification system, marking the
first time the model has been used for this task [1]. The author notes the arguable
unimportance of syntactic structure when classifying documents (a point supported by
our own research which shows Naı̈ve Bayes and Linear SVM performing exception-
ally well using TF-IDF embeddings). Nevertheless, BERT manages to outperform the
classical approaches across all settings.
Using two variations on BERT (the large and computationally expensive version
and the smaller, distilled version), the researchers achieved +91% accuracy on the
16
Chapter 2. Related Work
popular Reuters dataset. Furthermore, they show that by distilling the knowledge we
can save on computational complexity and still achieve +90% accuracy.
Even with knowledge distillation, BERT still has hundreds of millions of parameters,
making the network massively complex and computationally expensive. Also, it’s abil-
ity to only cope with sequences of 512 tokens mean it has trouble tracking long-term
dependencies. Both these issues can be mitigated using a simpler architecture - Bidi-
rectional Long-Short Term Memory Networks (BiLSTM) - which we can find mention
of all the way back in 1998 [17].
LSTM networks increase the effectiveness of vanilla RNN’s by adding an additional
hidden state that uses a number of gates to decide what to remember and what to
forget. Combining these LSTM layers into a bidirectional architecture allows the input
to be read in both directions (left to right, and right to left), further enhancing the
network’s ability to track long-term dependencies.
17
Chapter 2. Related Work
neural networks by approx. 7% [2]. The team leave the question of comparing BiLSTM
with Transformer Learning as an open question, one which we have chosen to explore
in our research below.
18
Chapter 3
Methodology
ii Keras 3.3.6
iv Ubuntu v.18.04
Full details on package versions can be found in the requirements.txt file in the
project documentation.
Our custom functions (pre-processing, graph display, data splits etc) are all held
in a separate file (functions.py) and imported at the top of our notebooks. Likewise,
our custom classes (used to build model testing frameworks) can be imported from the
custom class.py file.
The main libraries used for pre-processing are NLTK and Spacy. Implementation
details can be found below in Methodology section 3.3.1.
19
Chapter 3. Methodology
The data for our research was provided by BiP Solutions and consists of 350,014 ex-
ample e-procurement contracts. There are 29 features that describe the data, split
between a range of datatypes, with the main focus of our research concerning the Title
and Description of the contract.
The majority of contracts (348,940) are sourced from European vendors and include
our target label, Nature of Contract. A small amount of contracts (1074) are sourced
from U.S vendors and do not include our target label. More information on this sample
can be found in Methodology section 3.2.5.
A very small amount (7) contracts are labelled combined. These instances have
been dropped from our analysis.
3.2.2 Cross-Validation
In an effort to ensure a robust classifier we will test our model by splitting our data into
different groups. Our training data will be used (in conjunction with our validation
data) to train and validate the model. Our testing data will remain completely unseen
by the model and acts as a further checkpoint to ensure reliable predictions.
We explore three different Train, Test and Validation splits given the size of the
dataset. Here we present the details of the three most common ratios that we intend
to explore [12].
It is common practice to test models with varying proportions of the data and we
present our findings in this regard in the Project Analysis Section 4.7.5.
20
Chapter 3. Methodology
Our target variable - Nature of Contract - can fall into one of three categories:
i Services
ii Works
iii Supplies
As figure 3.1 below shows, we have a large imbalance in our target distribution. This
is a common problem in most machine learning problems, and we intend to mitigate
this issue using two well-known techniques:
Where possible we try to retain as much data as possible so would always opt for
Random Up-Sampling in the first instance. However, with so much data, it is certainly
worth also exploring the effect of Random Down-Sampling on model performance.
21
Chapter 3. Methodology
The data also includes 1074 instances with an N/A target value (figure 3.1). These
instances reflect international contracts (from American or Asia) and comprise an ex-
ample of genuinely unlabelled data.
Following the company guidelines, it would be good to manually label 200-300
documents to enhance our testing capabilities beyond the train, test and validation
split using the European examples. Through creation of this genuine set we would
hope to make our model more robust to market changes in the future. Unfortunately,
due to resource constraints within BiP Solutions this was not possible for our research.
As mentioned in Methodology section 3.2.1, the features that describe contract data
have not been standardised across the industry. We use the features of the data to train
a predictive model, so missing values that are important for an accurate predication is
22
Chapter 3. Methodology
3.2.7 Correlations
Before conducting any major pre-processing steps, we wanted to check the correlation
of independent variables with the target variable. This was to save time exploring
features which have little effect on our predictions. Our method here consisted of the
following steps:
i Check the correlation of the main independent variables (Title and Description).
iii Use the strongest features to develop baseline performance using classical machine
learning methods.
The results of our study using Naı̈ve Bayes, XGBoost and Linear SVM can be
found in our Project Analysis section. Each of the other features was text based and
went through the necessary preprocessing steps outlined in the Background Research.
Our method of embedding differed between classical and neural models and is outlined
below in Methodology section 3.3.2.
Our initial preprocessing was undertaken using the popular natural language library,
Natural Language Tool Kit (NLTK). This library is popular with researchers and comes
23
Chapter 3. Methodology
with the ability to conduct stopword removal, lemmatisation and stemming. Our cus-
tom function takes a data frame as input and iterates through it to create a new cleaned
text column.
We used another popular libarary, Spacy, in an attempt to improve our processing.
This library works better in a deployment environment as it’s built on top of Cython
(a C optimised version of Python) so allows for faster manipulation of data. The only
prerequisite is making sure the function that utilises the library is vectorised (it accepts
and returns NumPy arrays), which is not possible when dealing with text. The issue
of speeding up Spacy to process large amounts of text is an open research question and
one we leave in the hands of the BiP Solutions development team.
In our deep learning model, we have also created the option to bake the preprocess-
ing into the model. Rather than create a TF-IDF matrix, the deep learning system
conducts Continuous Bag-of-Words (CBoW) embedding at runtime, allowing for much
faster preprocessing (as the text does not require parsing by a third-party function
before training the model).
As mentioned in Related Work section 2.2 however, this comes with a limited num-
ber of processing functions. The accuracy of a deep learning model fed with minimally
processed text would need to weighted against the processing time required by our
custom functions. A comparative examination of pre-baked processing and third-party
parsing can be found in Project Analysis section 4.7.
24
Chapter 3. Methodology
i stop words = [‘works’, ‘services’, ‘supplies] : This feeds a custom stopword list to
the CountVectoriser, so these words are removed from the documents. This was
to ensure the class labels were not bleeding into the independent variables.
ii ngram range = (1, 2): This tells our vectoriser to take account of both unigrams
and bigrams. This will increase the number of features and make our matrix
sparser, but the benefit of using connecting words can be found in our comparative
analysis of a model trained using unigrams and one using both unigrams and
bigrams (Project Analysis section 4.3 )
To produce our normalised sparse TF-IDF matrix (as in figure 2.4 ) we then need
to feed the output of our Countvectorizer to a TfidfTransformer. As per the SKLearn
documentation,“The goal of using tf-idf instead of the raw frequencies of occurrence
of a token in a given document is to scale down the impact of tokens that occur very
frequently in a given corpus and that are hence empirically less informative than features
that occur in a small fraction of the training corpus.”
The TfidfTransformer object was initialised thusly:
i smooth idf=True: This smooths the IDF weights by adding one to the document
frequencies. This setting avoids division by zero errors.
iii sublinear tf=True: This applies term-frequency sublinear scaling, which replaces
term-frequency with 1 + log(term − f requency).
The sparse TF-IDF matrix representation is used as our feature set in the classical
machine learning models.
The deep learning model allows us to bake the processing into the system at runtime,
creating a Continuous Bag-of-Word (CBOW) embedding representation. This tech-
nique gives each word a unique index, and uses these indexes to transform a document
25
Chapter 3. Methodology
into a sequence of numbers. This technique does not retain any contextual meaning in
the documents, but as has been shown, semantic meaning will have less influence on
our prediction.
Our data is held as Tensors for the machine learning model (a native data structure
to Tensorflow that allows us to use low-level operations to build, train and make predic-
tions). As such, to ensure embedding at runtime, we need to use a Tensorflow wrapper
function to map the conversion of documents to a sequence of data. Implementation
details can be found in our code documentation.
These functions use a TensorFlow Tokenizer and the built-in python Counter object
to create a vocabulary of unique words assigned to a unique index. The Tensorflow
Tokenizer requires two placeholder values:
i 0 : The Tokenizer object saves the 0 value as a padding placeholder (please see
below for padding justification)
ii n + 1 : n is the number of unique words and n+1 is reserved for unknown words.
After wrapping our data in the processing function, we lastly split the data into
batches. Sequences within the same batch need to be the same length to allow the model
to train, so we used TensorFlow’s padded batch function. Please refer to functions.py
for details on our batching functions.
Our batched data includes both features and labels so can be fed directly to our
deep learning systems for training.
Our approach after ensuring the features had been properly processed was to develop
a baseline accuracy for our predictive system. We conducted this by testing the three
main classical text classification models; Naı̈ve Bayes, XGBoost and Linear SVM. These
models were tested using their default parameters and only the description of the
contract as a feature.
26
Chapter 3. Methodology
The thinking here is to test the above three models and use the best of them to
further explore the feature space (testing various features and combinations of features
against the model with default parameters.) The aim is building up to the best possible
baseline, using a combination of optimum features that can then be used as a starting
point for our neural architectures.
Within this exploration of the feature space was an examination of the effect on
ngram usage on model accuracy. A discussion of these effects with accompanying
diagrams can be found in Project Analysis section 4.3.
Once an optimum baseline accuracy has been achieved, we then conduct both ran-
dom up-sampling and down-sampling to assess the effects on model accuracy. Again,
the results of these tests will inform our approach with the Bidirectional LSTM and
Transformer Learning architectures.
Finally, given an optimal set of features, suitable re-sampling and highest-performing
model we can arrive at an optimal accuracy. This value is then used to comparatively
measure performance with state-of-the-art architectures.
Moving forward from our classical baseline we then use our optimum features to gauge
the performance of our two state-of-the-art approaches. Each model is reconstructed
according to their original paper (details below) using our imported custom classes. Our
Project Analysis section also explores variations on the architecture and parameters
outlined below.
The one-layer BiLSTM is initialised exactly as Adhikari et al propose in 2019:
i Optimiser : Adam
27
Chapter 3. Methodology
BERT and other Transfer Learning systems will be constructed using the popular Hug-
ging Face library, accessed through the third-party module Ktrain. This is a tool that
puts the power of pretrained embeddings into the hands of machine learning practi-
tioners, and allows companies to leverage their predictive capability on downstream
tasks.
Using this library, we have constructed a test framework that allows researchers to
initialise our models and further experiment with different parameters and architec-
tures.
For our initial investigation we aim to replicate the architecture of DocBERT, as
it was shown to provide good accuracy for document classification and will serve as a
comparable model to the simpler BiLSTM. Adhikari et al take the BERT model and
build a fully-connected classifier on top and then fine-tune with the task-specific data
for document classification. The Ktrain library allows us to do this easily.
Our BERT initialisation is as follows:
i Optimiser : Adam
As with our BiLSTM, the framework we have constructed will allow us to experi-
ment with different architectures, parameters and pretrained models.
28
Chapter 4
Project Analysis
Here we present our findings across all settings, beginning with our analysis of the data
and classical models. We then work towards an optimum feature set before outlining
our analysis of each modern neural architecture. We begin this discussion with a look
at Correlation between variables.
The majority of our data was categorical, so we utilised a variation on Pearson’s Chi-
Square Correlation, Cramers V. The Chi-Square Test of Independence indicates if there
is a significant relationship between variables, with Cramers V., indicating the strength
of this correlation. A score closer to 0 indicates little relationship and a score closer to
1 indicates a strong relationship between two variables.
29
Chapter 4. Project Analysis
In figure 4.1 we see the Cramer’s V formula where X2 is the Pearson Chi Square
statistic, N is the sample size and K is the lesser number of categories of either variable.
We felt that by using this method we could determine a good starting point for our
features. Our intention to use human-level requirements to classify contracts (the title,
the description) could be supplemented by a variety of other features in the data. As
such it was pertinent to explore the correlation of all variables in terms of predicting
our target value, Nature of Contract.
Using the results demonstrated in table 4.1 we can cut down on the time spent
exploring data that has no impact on our prediction, making for a faster analysis and
a better performing predictive system.
30
Chapter 4. Project Analysis
Naı̈ve Bayes was chosen as our first model in an arbitrary way, it was simply surrepti-
tious that this model achieved the lowest accuracy.
Our initial model looks promising, performing quite well on the fully cleaned De-
scription data. Remember, from our Methodology section 3.4 section, the classical
models do not have the ability to process the data, so each of the three models uses
fully cleaned data with the Spacy preprocessing functions.
31
Chapter 4. Project Analysis
As figure 4.3 shows, our Naı̈ve Bayes classifier represents an opportune starting
point, achieving 79.78% accuracy.
The next classical algorithm in our initial exploration was XGBoost. Again, as per our
Methodology section 3.4, here we are aiming to gauge the effectiveness of each classical
model with a single feature, (the Description).
32
Chapter 4. Project Analysis
There is a clear improvement across all classes by switching to the tree-based algo-
rithm, going from 79.78% to 81.48%.
The final classical model we use to try and improve our previous baseline is Linear
SVM. As demonstrated in our Related Work section, this algorithm has also shown
success in the field of document classification.
33
Chapter 4. Project Analysis
With an optimal model that can predict the Nature of Contract based on the single De-
scription feature, we felt it also important to aim for some level of domain knowledge.
As we know from Related Word section 2.10.1, embedding techniques that involve the
representation of contextual meaning add little predictive power to document classifi-
cation systems. We felt that another way to represent a modicum of domain knowledge
within our embeddings was to enable the TF-IDF object to use both unigrams and
bigrams when creating the TF-IDF matrix.
In essence this doubled the number of features our model could use to predict
the Nature of Contract by including conjoining words, as well as unique words in our
words ∗ documents feature matrix.
34
Chapter 4. Project Analysis
By enabling the TF-IDF object to create features using both unigrams and bigrams
we can see a marked improvement in our visualisation. We still have three distinct data
funnels down which our embeddings are being forced, but now the classes are much
more distinct within each direction. This improved visualisation is reflected in the
accuracy with which our Linear SVM could then predict Nature of Contract using
these new Description embeddings.
This improvement can be seen from the confusion matrix below (figure 4.11 ). En-
abling the use of unigrams and bigrams has nudged the accuracy of our classifier up
across all classes.
35
Chapter 4. Project Analysis
Exploring Title as a singular feature had a number of benefits. Firstly, in our previous
work using Natural Language Processing to classify disaster tweets [25], we found that
Linear SVM performed well on short input sequences. Secondly, due to fewer unique
words we produce a denser TF-IDF matrix, which we surmise allows Linear SVM to
better fit a separating hyper-plane to the data.
36
Chapter 4. Project Analysis
Before exploring the feature we felt it suitable to visualise the embeddings in a simi-
lar way to Description, by reducing the dimensions using Singular Value Decomposition
and plotting on a three dimensional scatter plot.
Our embeddings are now looking extremely distinct. By shortening the input se-
quence and creating a denser TF-IDF matrix - but still allowing both unigrams and
bigrams - our Vectoriser has created three completely distinct groupings of data. The
size of each class in the above plot roughly reflects our target distribution outlined in
figure 3.1. This evidence heavily supports our decision to use shorter input sequences
and allow for bigrams in our TF-IDF Vectoriser ).
37
Chapter 4. Project Analysis
The evidence is further supplemented with the results from our Linear SVM using
the above embedding representations.
Our accuracy is now reaching a suitable level to meet company requirements. How-
ever, due to the high degree of predictive accuracy we began to examine possible sources
of data leak. This led to the discovery that our target variables may be contained within
some of the documents. This effect was mitigated using the technique below.
38
Chapter 4. Project Analysis
In order that we avoid data leak from our independent variables (the text of the doc-
ument) we created a new TF-IDF Vectoriser, inclusive of a custom stopword list that
contained each of our classes. By passing this parameter to the Vectoriser, these words
would be removed from the documents and could not be used by the model to inform
prediction.
Our experiment with masked classes showed a minor reduction in accuracy across
all settings, but the reduction is so small as to be negligible to results. Nevertheless,
for the sake of posterity, all TF-IDF matrix creation from this point is created using
masked classes and unigram / bigram combination.
39
Chapter 4. Project Analysis
fully optimised. This approach allowed us to begin experimenting with different hyper-
parameters and architectures in a quick and efficient manner.
Having completed the above preliminary steps we felt we had exhausted our options
when classifying procurement contracts with classical machine learning methods. To
retain the sequential nature of our analysis we then proceeded a step further in com-
plexity by utilising our custom BiLSTM Builder class to replicate the Neural Network
outlined by Adhikari et al.
Title performed so well with our classical models we felt it a suitable feature to begin our
analysis of more complex neural architectures. Utilising built-in Tensorflow operations
we are able to train and test on the raw data. This method, as mentioned in our
Methodology section, allows us to bake the pre-processing into the model. This analysis
is conducted using the 60 / 20 / 20 Training, Validation and Test split.
The raw data - as shown in figure 4.17 - performed well even though the Tensorflow
preprocessing provides limited capability, achieving an excellent 96% accuracy.
40
Chapter 4. Project Analysis
Description of contract is our other main feature, and as such we felt it important to
explore this feature in the BiLSTM as a predictive variable. Sequence length is our main
concern with Description, having no upper bound on its size. A BiLSTM will handle
larger sequences than BERT, with an upper threshold of approx. 1000 characters, but
our Description may also exceed this amount. The effect of longer sequences is reflected
in the model accuracy, achieving considerably lower than other models with 72%
41
Chapter 4. Project Analysis
Cleaning the Title made a negligible difference to our model, indicating that – when
concerning Title – we can bake the processing into the implementation avoiding lengthy
third-party preprocessing functions, with minimal effect on accuracy.
Where the third-party cleaning function made a noticeable difference is when we used
Description as our only predictor. We see a greater than 10% increase in accuracy by
first running our Description through the Spacy preprocessing steps. This could be
down to shorter input sequences better informing our prediction or less noise in the
42
Chapter 4. Project Analysis
We further explored the BiLSTM architecture by adjusting the ratio of Train, Valida-
tion and Testing splits. This analysis produced no significant improvement using our
best feature, Title.
Due to our lack of significant findings in this regard we have chosen to omit the
classification report and confusion matrix, however these tables can be found in the
jupyter notebook, BiP Solutions – Tensorflow End-To-End BiLSTM.
43
Chapter 4. Project Analysis
As with our BiLSTM, we have drilled through our data to find the optimum, text-based
features and so can get started with our Transformer Learning system straight away.
As mentioned in our Methodology section we have tried to retain the original hyper-
parameters from the original DocBert classification system. Some changes had to be
made due to Out Of Memory (OOM) errors with our GPU (things like sequence length
and batch size), but these changes will be made apparent as and when they occur.
Also note that the data has no need to be processed by our third-party Spacy
functions as the Ktrain library has built-in preprocessing which conducts the main
steps outlined in the NLP Pipeline in figure 2.1. Finally it must be noted that, due to
time and resources, we have opted to use the DistilBert pretrained model (a distilled
version of the full BERT system with smaller number of trainable parameters).
As ever, we begin our analysis with our “top” feature, Title. This feature is a great fit
for a BERT Classifier as we have a short sequence length, meaning there is a greater
likelihood of the classifier being exposed to the complete Title.
Also, following from our discovery of very little impact from different Train, Validate
and Test splits we have chosen to proceed with the 60 / 20 / 20 split. It should also be
noted that we had to use the popular Seaborn library to produce our confusion matrix.
This decision was down to a dependency clash (Ktrain requires Sci-Kit Learn version
0.3.1, which does not include the Plot Confusion Matrix object).
44
Chapter 4. Project Analysis
As figure 4.25 shows, BERT achieves the highest overall accuracy - at 98% - using
Title to predict the Nature of Contract. This is further enhanced by the extremely low
number of epochs required to achieve this accuracy (we had to opt for 1 epoch due to
computational complexity). This high accuracy is offset by the extremely long training
times (roughly 90minutes per epoch) and relatively long inference time (approx. 6-7
minutes for approx. 50,000 contracts).
Following our methodology, we then used our Description feature with the BERT clas-
sifier. As expected, this did not perform as well as the Title feature, however it did
outperform all other models that used the same independent variable. We had to make
two changes to this model to ensure training; decreasing maximum sequence length to
400 (as opposed to 500 with Title) and a batch size of 8 (as opposed to 16 with Title).
45
Chapter 4. Project Analysis
The Ktrain library allows us to very easily swap out one set of pretrained embeddings for
another. Due to time and computational resources we have chosen to only explore one
46
Chapter 4. Project Analysis
Continuing to follow our methodology, we first apply the Title feature to the RoBERTa
model. As discussed, the Ktrain library allows us to preprocess the text using built-in
methods.
RoBERTa did not perform as well as the original BERT classifier, however with
47
Chapter 4. Project Analysis
further training we would expect the accuracy to increase, and possibly surpass that of
the original BERT model on this task. Also note that with RoBERTa we were required
to decrease our maximum sequence length to 400 and our batch size to 8 to avoid an
OOM error.
48
Chapter 4. Project Analysis
Here we present an overall view of our results, highlighting our top scoring classical,
BiLSTM and Transformer Learning implementation.
49
Chapter 5
Conclusions
In the work above we have presented our findings when comparing a simple one-layer
BiLSTM with Transformer Learning pretrained embeddings. We also included a full
analysis of classical machine learning methods for classifying e-procurement contracts.
Below we present our full table of results across all settings.
As demonstrated from table 4.2, BERT outperforms all other models in the task
of classification, but the significant training time and prohibitive size of the model
offset this accuracy. The BiLSTM compares well with the pretrained embeddings,
slightly under-performing by comparison, but as an option for deployment would be
more suitable due to the short training time and the ability to embed the preprocessing
in the model. If the company required a state-of-art classification system in which
accuracy was of the utmost importance, we feel the BiLSTM would better meet these
requirements while retaining a relatively simple neural architecture.
However, our recommendation for implementation is to use the classical Linear
SVM method. Although this requires our text to be preprocessed by the third-party
function, it is the most efficient and scalable solution. Using this algorithm allows us
to effectively balance the accuracy and interpretability of the model, allowing company
researchers to better track the decision-making process and scale as required.
The research above also indicates that – in all settings – opting for a shorter input
sequence (using the Title attribute) allows for greater predictive capability in all set-
tings. This reflects our previous work classifying disaster tweets and makes intuitive
50
Chapter 5. Conclusions
sense when we consider the significantly smaller unique vocabulary and subsequently
denser TF-IDF matrix. For this reason, we recommend a Linear SVM classifier, trained
to predict on the Title, as a means of meeting all company requirements. This imple-
mentation will help to enhance user experience on the BiP e-procurement contract
search engine.
Further research could be conducted exploring very recent pretrained embeddings
(such as ELECTRA [9] from Google, or GPT-3 [6] from OpenAI as a means of classi-
fying e-procurement contracts. However, with each new pretrained model the number
of trainable parameters undergo a significant increase (GPT-3 has billions of trainable
parameters). Considering the need for a rapidly trainable and scalable solution we
don’t feel that any significant increase in performance can be found by exploring these
new state-of-art language models for classifying e-procurment contracts.
51
Bibliography
[1] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. Docbert: Bert
for document classification. 2019. Times cited: 2 Greate evidence for TF-IDF.
[2] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. Rethinking
complex neural network architectures for document classification. NAACL HLT
2019 - 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies - Proceedings of the
Conference, 1:4046–4051, 2019. Times cited: 1.
[3] Chaitanya Anne, Avdesh Mishra, Md Tamjidul Hoque, and Shengru Tu. Multiclass
patent document classification. Artificial Intelligence Research, 7(1):1, 2017. Times
cited: 2 Great little explanation of Imbalance.
[4] Mohammed Attia, Younes Samih, Ali Elkahky, and Laura Kallmeyer. Multi-
lingual multi-class sentiment classification using convolutional neural networks.
LREC 2018 - 11th International Conference on Language Resources and Evalua-
tion, pages 635–640, 2019. Times cited: 1 F1 MACRO.
[5] James Bennett and Stan Lanning. The Netflix Prize. 2007. Times cited: 1.
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Ben-
jamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
52
Bibliography
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020.
Times cited: 1.
[7] Che-Wen Chen, Shih-Pang Tseng, Ta-Wen Kuan, and Jhing-Fa Wang. Outpa-
tient text classification using attention-based bidirectional lstm for robot-assisted
servicing in hospital. 0. Times cited: 1.
[9] Kevin Clark, Minh-Thang Luong, V. Quoc Le, and Christopher D. Manning. Elec-
tra: Pre-training text encoders as discriminators rather than generators. pages
1–18, 2020. Times cited: 1.
[10] Fabrice Colas and Pavel Brazdil. On the behavior of svm and some older al-
gorithms in binary text classification tasks. Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 4188 LNCS:45–52, 2006. Times cited: 1.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. pages
4171–4186, 2018. Times cited: 1.
[12] R. Draelos. Best use of train/val/test splits, with tips for medical data.
Glasbox Medicine, https://glassboxmedicine.com/2019/09/15/best-use-of-train-
val-test-splits-with-tips-for-medical-data/, 2019. Times cited: 1.
[14] Wael Etaiwi and Ghazi Naymat. The impact of applying different preprocessing
steps on review spam detection. Procedia Computer Science, 113:273–279, 2017.
Times cited: 1.
53
Bibliography
[15] Yoav Goldberg. Neural network methods for natural language processing. Neural
Network Methods for Natural Language Processing, pages 77–78, 2017. Times cited:
1.
[17] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural
nets and problem solutions. International Journal of Uncertainty, Fuzziness and
Knowlege-Based Systems, 6(2):107–116, 1998. Times cited: 1.
[18] Joemon M Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro,
Mário J Silva, and Flávio Martins. Advances in Information Retrieval: 42nd
European Conference on IR Research, ECIR 2020 Lisbon, Portugal, April 14–17,
2020 Proceedings, Part I. Springer International Publishing, 2020. Times cited: 1.
[20] Yoon Kim. Convolutional neural networks for sentence classification. EMNLP
2014 - 2014 Conference on Empirical Methods in Natural Language Processing,
Proceedings of the Conference, pages 1746–1751, 2014. Times cited: 1 KIMCNN
COMPLETE ntroduction Excellent lead to pretrained vectors in the introduction.
CHECK OUT CONCLUSION This work fully supports the use of pretrained vec-
tors, such as those we will explore with the BERT model below, as an important
ingredient in deep learning for NLP.
[21] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
Chan Ho So, and Jaewoo Kang. Biobert: A pre-trained biomedical language rep-
resentation model for biomedical text mining. Bioinformatics, 36(4):1234–1240,
2020. Times cited: 1.
[22] Jingzhou Liu, Wei Cheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for
extreme multi-label text classification. SIGIR 2017 - Proceedings of the 40th In-
54
Bibliography
[23] Pengfei Liu, Xipeng Qiu, and Huang Xuanjing. Recurrent neural network for text
classification with multi-task learning. IJCAI International Joint Conference on
Artificial Intelligence, 2016-Janua:2873–2879, 2016. Times cited: 1 Multi task
learning in 2007, super cool! Great reference to Long Short Term Memory paper
from 1997.
[24] Stuart Mackie, David MacDonald, Leif Azzopardi, and Yashar Moshfeghi. Looking
for opportunities: Challenges in professional procurement search. SIGIR 2019 -
Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 1397–1398, 2019. Times cited: 2.
[26] P. Nayak. Understanding searches better than ever before. The Keyword,
https://blog.google/products/search/search-language-understanding-bert/, 2019.
Times cited: 1.
[27] Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu
Song, and Qiang Yang. Large-scale hierarchical text classification with recursively
regularized deep graph-cnn. The Web Conference 2018 - Proceedings of the World
Wide Web Conference, WWW 2018, pages 1063–1072, 2018. Times cited: 1.
[28] Oscar Quispe, Alexander Ocsa, and Ricardo Coronado. Latent semantic indexing
and convolutional neural network for multi-label and multi-class text classification.
2017 IEEE Latin American Conference on Computational Intelligence, LA-CCI
2017 - Proceedings, 2017-Novem(November 2017):1–6, 2018. Times cited: 1.
[29] Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. Sta-
tistical topic models for multi-label document classification. Machine Learning,
88(1-2):157–208, 2012. Times cited: 1.
55
Bibliography
[30] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert,
a distilled version of bert: smaller, faster, cheaper and lighter. pages 2–6, 2019.
Times cited: 1.
[31] Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, and Bohyung Han.
Hierarchical attention networks. ArXiv, pages 1480–1489, 2016. Times cited: 2.
[32] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated re-
current neural network for sentiment classification. Conference Proceedings -
EMNLP 2015: Conference on Empirical Methods in Natural Language Process-
ing, (September):1422–1432, 2015. Times cited: 1.
[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in Neural Information Processing Systems, 2017-Decem(Nips):5999–
6009, 2017. Times cited: 1.
[35] Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang.
Sgm: Sequence generation model for multi-label classification. 2018. Times cited:
1.
56