Professional Documents
Culture Documents
Privacy Preserving Machine Learning
Privacy Preserving Machine Learning
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
Orientador
Prof. Dr. Daniel Guerreiro e Silva
Coorientador
Prof. Dr. Anderson Clayton Alves do Nascimento
Brasília
2022
Assinaturas
Universidade de Brasília
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
To God, author and meaning of all existence, be all the glory, honor and praise.
iv
Agradecimentos
To all those who supported me in this harsh journey, specially to my beloved Talita.
v
Abstract
Machine learning (ML) applications have become increasingly frequent and pervasive in
many areas of our lives. We enjoy customized services based on predictive models built
with our private data. There are, however, growing concerns about privacy. This is proven
by the enactment of the General Law of Data Protection in Brazil, and similar legislative
initiatives in the European Union and in several countries.
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the construction and operation of these computa-
tional models with formal - mathematical - guaranties of preservation of privacy. These
techniques need to respond adequately to challenges posed at every stage of the typical
ML application life cycle, from data discovery, through feature extraction, model training
and validation, up to its effective use.
This work presents a framework of techniques for Privacy-Preserving Machine Learn-
ing (PPML) and Natural Language Processing (NLP), built on homomorphic cryptogra-
phy primitives and Secure Multi-party Computation (MPC) protocols, which allow the
adequate treatment of data and the efficient application of ML algorithms with robust
guarantees of privacy preservation in text classification. This work also brings a practical
application of privacy-preserving text classification for the detection of fake news.
vi
Resumo
Aplicações de aprendizagem de máquina (ML) tem se tornado cada vez mais recorrentes
e pervasivas em diversas áreas de nossas vidas. Usufruímos de serviços personalizados
baseados em modelos preditivos construídos com nossos dados privados. Há, no entanto,
uma preocupação crescente com a privacidade. A Lei Geral de Proteção de Dados, no
Brasil, e iniciativas legislativas semelhantes na União Europeia e em diversos países são
uma prova disso.
Esse trade-off entre privacidade e os benefícios das aplicações de ML pode ser mitigado
com uso de técnicas que permitam a construção e operação desses modelos computacionais
com garantias formais, matemáticas, de preservação da privacidade dos usuários. Essas
técnicas precisam responder adequadamente aos desafios apresentados em todas as fases
no ciclo de vida de uma aplicação de ML, desde a descoberta de dados, passando pela
fase de feature extraction, pelo treinamento e validação dos modelos, até seu efetivo uso.
Este trabalho apresenta um framework de técnicas de Processamento de Linguagem
Natural (NLP) e Aprendizado de Máquina com Preservação de Privacidade (PPML), con-
struído sobre primitivas de criptografia homomórfica e protocolos de computação segura
de múltiplas partes (MPC), que permitem o tratamento adequado dos dados, e a apli-
cação eficiente de algoritmos de ML com garantias robustas de privacidade na classificação
de textos. O trabalho traz, ainda, uma aplicação prática de classificação de texto com
privacidade na detecção de fake news.
vii
Contents
1 Introduction 1
1.1 Research subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Text classification 18
3.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Classic NLP preprocessing techniques . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Stopword removal (SwR) . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.5 Part-of-Speech (PoS) tagging . . . . . . . . . . . . . . . . . . . . . 21
3.2.6 Bag-of-Words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.7 Term Frequency – Inverse Document Frequency (TF-IDF) . . . . . 22
3.2.8 Continuous bag-of-words (CBoW) . . . . . . . . . . . . . . . . . . . 22
3.3 The State-of-the-Art: Transformers . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Trade-offs and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
viii
4 Fake news detection 25
4.1 Detection approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Source based detection . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Fact checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.3 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . 27
4.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Clear-text setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Privacy-preserving setting . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.3 Privacy-preserving inference . . . . . . . . . . . . . . . . . . . . . . 31
Bibliography 37
Appendix 44
I MPC Protocols 45
ix
List of Figures
x
List of Tables
xi
Acronyms
HE Homomorphic Encryption.
ML Machine Learning.
xii
1 Introduction
Pensées
Blaise Pascal
Natural Language Processing (NLP), the group of techniques used for text classifi-
cation, is one of the earliest and also one the most advanced areas of Machine Learning
(ML). Conversational agents and generational models still have people in awe. Especially
when large Language Understanding Models, such as GPT-3, make it to the headlines in
the general media, that advertise how they present a fascinating performance in a series
of tasks as diverse as producing working computer code, coherently responding e-mails or
even writing a novel [1].
In fact, aside from the astonishing headlines, ML applications have become more and
more present, and have an increasing impact on everyday life. Predictive models, built
with ML algorithms, are found in trivial tasks, from ranking and ordering algorithms
in search engines and social media, to recommendation systems in streaming platforms
and e-commerce marketplaces. There are, however, applications in areas as sensitive as
medical imaging diagnosis, detection of tax fraud and crimes against the financial system.
In most cases, we benefit from personalized services based on inference models built
with our private data. Recent disclosures of numerous cases of abuse arising from the
possession of such data, in addition to frequent security breaches that expose millions
users, have fueled a growing concern about privacy. The General Data Protection Law
in Brazil (LGPD), and similar legislative initiatives such as the General Data Protection
Regulation (GDPR) in the European Union, and the California Consumer Privacy Act of
2018 (CCPA) are evidence of this move to limit and regulate the use of private information
by large service providers [2, 3].
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the training and application of predictive models
while preserving user privacy. These techniques need to respond adequately to the chal-
1
lenges presented at all stages in the life-cycle of a typical ML application: from data
discovery, through the data wrangling stage (feature selection, feature extraction, combi-
nation, normalization and imputation in the sample space), to the training and validation
of the models, until their effective use in inference.
2
The main challenge, in all these lines of research, is the distance between the logical
or theoretical design of the proposed solutions and the practical intended applications.
The most cited works in the area, in general, do not present how, or how effectively, the
proposed system or protocol deals with problems of scale, response time, usability, among
others, which drastically affect the effective support of a real decision process, in real use
cases.
Furthermore, most works in literature focus on just one phase of the Machine Learning
life-cycle: either the training-related steps or the application of the trained model in
inference (both for regression and classification models). Little or no attention is given
to the initial steps, to data fitting, feature extraction, among other activities that are
extremely relevant for the overall performance of the predictive models. Works that bring
any analysis of the statistical robustness or the quality of the models produced are also
rare.
Therefore, the selected subject for the current research is not limited the study of
PPML techniques for privacy-preserving text classification. The complete knowledge pro-
duction cycle is considered, from the basic theory to the assessment of the impact of its
applications. Attention is given to how those techniques can be used for or can interact
with the well established NLP procedures for textual data wrangling or preprocessing.
The study also looks into statistical tests to check the predictive power or quality of pro-
duced models. And practical implementation details, such as realistic computational cost
analysis, are also examined.
For the sake of experimental reproducibility, and as a way to apply and demonstrate
the produced knowledge in publicly available databases, this study also dives into the use
case of text classification for fake news detection.
1.2 Motivation
This work results from both the author’s research as a Doctorate student at the Graduate
Program in Electrical Engineering (PPGEE), at Universidade de Brasília, and his work
as a Data Scientist at the Administrative Council for Economic Defense (Cade). It also
brings some experience with homomorphic cryptography gathered during the author’s
masters research [17].
The scientific investigation proposed in this project is primarily based on the need to
complete the knowledge production cycle, as discussed above, with the effective devel-
opment, application and validation of a technology based on the state-of-the-art in the
fields of Cryptography, Machine Learning and Natural Language Processing. These are
somewhat disparate concepts: while ML and NLP are focused on extracting information,
3
Cryptography is focused on concealing it. Therefore, there is a contribution from the
theoretical point of view, considering the exposition on how to couple and harmonize
such different groups of techniques. Another contribution is the discussion on adequate
security models for the specific task of text classification. There is also a contribution
from a technical point of view, with the caring out of experiments that may be used as a
reference implementation of the proposed models. The most relevant contribution, never-
theless, is the practical application, with a relevant impact on the institutions involved. In
the present case, the specific contribution to Cade - the Brazilian competition authority.
The intelligence unit at Cade deals with an enormous quantity of data, from large pub-
lic procurement databases to other open intelligence sources, such as news articles, online
marketplaces, and enterprise web sites. This unit also performs down raids, collecting
documents from investigated companies - on paper, on computers, hard drives, executives
smartphones, etc. Some operations are carried out in cooperation with prosecutors and
police, at Federal or State levels. Some, involving multinationals, are coordinated with
other competition authorities in different territories.
All this intelligence activity, nonetheless, is bounded by data protection laws effective
in Brazil, that require formal guaranties of privacy protection [19]. While searching for
evidence of cartel or other anticompetitive conduct, Cade needs to protect the privacy
of the individuals involved - whether executives, legal representatives or other persons
somehow related to the investigated economic agents. And a considerable portion of this
data lies in textual formats. Thus, the importance of privacy in text classification.
• To identify possible limitations and flaws in PPML solutions presented in the liter-
ature;
• To propose and test improvements, new techniques and implementation details that
may correct and overcome major flaws and limitations pointed out in previous so-
lutions;
4
• Build a reference implementation of a framework in order to demonstrate the feasi-
bility and correctness of the proposed techniques;
• Propose a work process that adequately integrates said framework with the processes
in force at an organization that makes massive use of private or legally bound
confidential data.
1.4 Methodology
Experience shows, as reported by Trauth [20], that when a study in the area of technology
development seeks to understand the impact of a given solution, there is a need for the
application of qualitative methodologies.
Also, according to Deb, Dey & Balas [21], engineering research must effectively com-
bine the conceptualization of a research question with practical problems, from equip-
ment to algorithms and mathematical concepts used to solve the proposed problem. Fur-
thermore, engineering research should advance knowledge in three broad, and somewhat
overlapping, areas: observational data, that is, knowledge of the phenomena; functional
modeling the observed phenomena; and the design of processes (algorithms, procedures,
arrangements) that contribute to the desired output. This brings a strong descriptive
character to engineering research, as one must clearly communicate the preconditions
and environmental dependencies, observed data, processes, inputs and outputs of their
experimental results.
Kaplan and Duchon state that the eminently applied nature of research in the field
of information systems requires a combination of qualitative and quantitative methods
[22]. They assert that quantitative investigations, usually performed through statisti-
cal hypothesis testing, are extremely limited when the expected results or the intended
applications are highly dependent on context.
They refer to the work of Yin [23], which deals with methods for Case Study research,
to show that quantitative research must be preceded by a qualitative investigation, in
which the problem is better defined based on the observation of the context, habits and
needs of the stakeholders. This is even more relevant in exploratory research, in which
new hypotheses are raised. These initial hypotheses need contextualization, through
qualitative investigation, to create more refined models and hypotheses, which can then
be tested using quantitative methods.
Taking into account the research subject and the proposed objectives, this work is ex-
ploratory in nature. It sources from different areas of study to propose a new technological
arrangement. It is also of applied nature, as the method used for scientific investigation
5
is centered on the design of a solution for a specific problem. Therefore, it must combine
descriptive, qualitative and quantitative approaches to the study of the selected problem.
Overall, this work can be descried as the combination of the following initiatives:
7. Complexity and cost analysis and comparison between HE and MPC protocols;
9. Experimentation on fake news detection over ciphered text and MPC protocols;
1.4.2 Limitations
As a result of the extensive research of different topics (ML, NLP, HE and MPC), this
document is not meant as a deep exposition of any of these areas. It provides, nonetheless,
a good set of references for those willing to further investigate relevant results in each area.
6
This work also does not set out to be a reference for complete security proofs of all
the cryptographic primitives and protocols used. There is, however, sufficient discussion
regarding composable protocols and the security of protocol compositions.
There are projects underway to apply the knowledge gathered in this study, as well as
the proposed techniques, in document classification and evidence search at Cade. There
are critical use cases, especially when it involves the cooperation and information shar-
ing with other government agencies and with competition authorities in other countries.
However, due to confidentiality requirements, it is not possible to expose results or present
reproducible experiments performed over such data.
All the experimental details exposed are limited to the fake news detection task. There
is also the choice to deal with this task as a binary output problem: documents are
classified as either fake or true. This choice results from the fact that most public fake
news datasets are annotated that way. The same techniques can be easily generalized to
deal with multi-class problems using the one-vs-all approach, that is easily carried out by
training one classifier per class.
1.5 Organization
The next chapter brings a more detailed exposition of the concept of Privacy-Preserving
Machine Learning (PPML ¸ ), with selected results from the literature. Chapter 3 presents
an extensive literature review on Natural Language Processing, from the classic prepro-
cessing techniques developed throughout the last 3 or 4 decades, to the present state-of-the
art with ‘transformers’ and other complex Natural Language Understanding models. The
fourth Chapter will discuss the concept of fake news, a short review on fake news de-
tection and introduce a few preliminary results on fake news detection with the use of
Privacy-Preserving ML algorithms.
The last chapter summarizes our results and conclusions, pointing out the best of our
knowledge in privacy-preserving text classification solutions and their application on the
specific problem of fake news detection.
7
2 Privacy-Preserving Machine
Learning
8
example, the CryptDB from the MIT [15]. In this line there are also many practical
solutions, including areas as sensitive as Electronic Health Records [14, 17].
Note that the use of encrypted databases usually requires the prompt availability of the
the entire database, and the computing power needed for encryption and for the training
of inference models - which is not always the case. Seeking to overcome this limitation,
there is also a line of works focused on online, distributed, interactive and reinforcement
learning techniques. These applications require the development of Secure Multi-party
Computation (MPC) and the composition of several protocols as the basis for learning
algorithms [18].
Henceforth, privacy-preserving computation grew in importance and attention, and
many Privacy-Preserving Machine Learning (PPML) and Privacy-Preserving Function
Evaluation (PPFE) frameworks have been developed. The protocols that form the basic
building blocks of these frameworks, are usually based on homomorphic cryptography
primitives or Secure Multi-Party Computation protocols. Some of the first frameworks to
appear in literature, for instance, used MPC protocols based on Secret Sharing. Among
those preceding results are FairPlayMP [26] and Sharemind [27].
Recent developments include frameworks like PySyft [28], that uses HE protocols,
and Chameleon [29], that uses MPC for linear operations and Garbled Circuits (a form
of HE) for non-linear evaluations. Other MPC frameworks in literature include CrypTen
[30], PICCO [31], TinyGarble [32], ABY3 [33] and SecureML [34].
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
9
Additive secret sharing is one way to implement MPC. Protocol parties have addi-
tive shares of their secret values and perform joint computations over those shares. For
example, to create n additive shares of a secret value x ∈ Zq , a participant can draw
(x1 , . . . , xn ) uniformly from {0, . . . , q − 1} such that x = ni=1 xi mod q. We denote this
P
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi̸=ι
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
Given two sets of shares JxKq , JyKq and a constant α, it is trivial to implement a
protocol like πADD or πMUL in order to locally compute linear functions over the sum of
their respective shares and broadcast local results to securely compute values such as
JzKq = α(JxKq ± JyKq ), JzKq = α ± (JxKq ± JyKq ), and JzKq = α1 (α2 ± (JxKq ± JyKq )). In all
those operations, JzKq is the shared secret result of the protocol. The real value of z can
only be obtained if one has knowledge of the local zi values held by all computing parties,
and performs a last step of computation to obtain z = ni=1 zi mod q.
P
With these two basic building blocks, addition and multiplication, it is possible to
compose protocols in order to perform virtually any computation. For instance, Protocol
10
πEq , uses πMUL in order to provide the Secure Distributed Equality computation. In the
Appendix I, you will find definitions for other additive secrete share MPC protocols, such
as Secure Multi-party Inner Product Protocol πIP , Secure Multi-party Bitwise OR/XOR
πOR|XOR , Secure Multi-party Argmax πargmax and Secure Multi-party Order Comparison
Protocol πDC .
Protocol: πEq
Setup: The setup procedure for πMUL
Input: JxKq and JyKq
Output: JzKq = J0Kq if x = y. JzKq ̸= J0Kq , otherwise.
Execution:
3. Output JzKq
All of these protocols are built on the commodity-based model [36]. In this approach,
there is a costly offline phase, lead by a Trusted Initializer (TI) that pre-distributes
correlated random numbers. This role can be performed by an independent agent or
by one of the computing parties without loss of generality or security guarantees of the
online phase.
In order to learn from data, a Machine Learning algorithm will usually update a set
β of internal model parameters iterating a computation over the training set, comparing
the result of a function F (X) = y′ over the the properties of each element in the sample
with the expected inference output. This difference, y′ − y, is usually called the ‘loss’
function, and is used to update the internal parameters with an intensity defined by a
pre-defined learning rate.
The most commonly used model, for instance Linear Regression, consists in the mul-
tiplication of the model parameters β with matrix X ∈ Zq n,k representing the values of
features of all the elements in the training set. And the learning goal is to find a coefficient
vector β = (β0 β1 . . . βk ) that minimizes the mean squared error
n
1X
(βxi − yi )2 (2.1)
n i=1
11
This coefficient that minimizes (2.1) can be computed as
β = (X T X)−1 X T y (2.2)
Protocol: πMMUL
Setup: the TI chooses uniformly random Ax , Aw ∈ Znq 1 ×n2 and Bx , Bw ∈ Znq 2 ×n3 and T ∈
Zqn1 ×n3 , and distributes the values Ax , Bx , and T to party Pi and the values Aw , Bw , and
C = (Ax Bw + Aw Bx − T ) to Pj̸=i
Input: JXKq and JW Kq
Output: JXW Kq
Execution:
2. Pj sends (X − Aw ) and (Y − Bw ) to Pi .
3. Output JzKq
∀m ∈ M,
fC (Enc(m)) ≡ Enc(fM (m))
12
Fully Homomorphic Encryption (FHE) refers to a class of cryptosystems for which the
homomorphism is valid for every function defined in M. That is:
∀fM : M → M,
∃fC : C → C | fC (Enc(m)) ≡ Enc(fM (m))
The most commonly used homomorphic cryptography systems, however, are only par-
tially homomorphic. There are additive homomorphic systems, multiplicative homomor-
phic systems and systems that combine a few homomorphic features. For example, Pail-
lier’s cryptosystem has additive and multiplicative homomorphic features that can be
used to delegate a limited set of computations over a dataset, without compromising its
confidentiality [37, 16].
The underlying primitive in Paillier’s system is the Decisional Composite Residuosity
Problem (DCRP). This problem deals with the intractability of deciding, given n =
pq, where p and q are two unknown large primes, and an arbitrary integer g coprime
to n (i.e. g ∈ Zn ), if g is a n-th residue modulo n2 . In other words, the problem
consists in finding y ∈ Z∗n2 such that g ≡ y n mod n2 . In his work, Paillier defines the
DCRP problem and demonstrates its equivalence (in terms of computing cost) with the
Quadratic Residue Problem, which is the foundation of well known cryptosystems, such
as Goldwasser-Micali’s [38].
Paillier’s system can be defined by the three algorithms:
Paillier.KeyGen the key generation algorithm selects two large primes p, q; computes
u
their product n = pq; uniformly draws a coprime to n, i.e. g ← − Zn ; computes
1
λ = mmc(p − 1, q − 1); and, finally computes µ = L(gλ mod n2 )
, where L(u) = u−1
n
;
The Public Key is ⟨n, g⟩, and the Private Key is ⟨µ, λ⟩;
Paillier.Enc given the public key ⟨n, g⟩ and a message m ∈ Zn , the encryption algorithm
u
consists on uniformly selecting r ←
− {1..n − 1} and computing the ciphertext c =
g m rn mod n2
Paillier.Dec the decryption algorithm, in turn, receives the private key ⟨µ, λ⟩ and a
ciphertext c and computes the corresponding message as such: m = L(cλ mod n2 ).µ
mod n.
This construction harnesses the homomorphism between the fields ZN and ZN 2 to
render the following features:
Asymmetric cryptography: it is possible to perform homomorphic computations over
the encrypted data using the public key. Knowing the results of the computation,
nevertheless, requires access to the private key;
13
Additive homomorphism: the multiplication of two ciphertexts equals the ciphertext
of the sum of their respective messages. That is:
Again, it is straight forward to see that, with the basic building blocks of addition
and multiplication, it is possible to compose arbitrarily complex protocols for privacy-
preserving computations using homomorphic encryption.
14
Very few works on PPML publish their results with computing times observed against
benchmark datasets/tasks (e.g. classification on the ImageNet or the Iris dataset). When
any general estimate measure is present, it is usually the complexity order O(g(n)), which
defines an asymptotically lower-bounding asymptotic complexity or cost function g(n).
The order function, g(n), represents the general behavior or shape of a class of func-
tions. So, if t(n) is the function defining the computational cost of the given algorithm
over its input size n, then stating that t(n) ∈ O(g(n)) means that there is a constant c
and a large enough value of n after which t(n) is always less than c × g(n).
A protocol is usually regarded efficient if its order function is at most polynomial. Sub-
exponential or exponential orders, on the other hand, are usually deemed prohibitively
high.
For example, the authors of ABY3 [33] assert that their Linear Regression protocol is
the most efficient in literature, with cost O(B + D) per round, where B is a training batch
size and D is the feature matrix dimension [33]. That may seem like a very well behaved
linear function over n, which would lead us to conclude they devised an exceptionally
efficient protocol with order O(n).
Nevertheless, this order expression only gives a bound on the number of operations
to be performed by the algorithm. However, it does not inform an accurate estimate for
execution time. And, more importantly, the order function will only bound the actual
cost function for extremely large input sizes. Recall that t(Kn) ∈ O(n), regardless of how
arbitrarily large the constant K may be.
Thus, for small input sizes, the actual cost may be many orders of magnitude higher
than the asymptotic bound. The addition protocol in [40] is also of order O(n), but one
would never assume that a protocol with many matrix multiplications and inversions can
run as fast as the one with a few simple additions.
Probabilistic and ML models are commonly used to estimate cost, execution time and
other statistics over complex systems and algorithms, especially in control or real-time
systems engineering [41, 42].
We propose the use of Monte Carlo methods in order to estimate execution times for
privacy-preserving computations, considering different protocols and input sizes. Section
II presents a short review on privacy-preserving computation and its cost. Section III has
a brief description of the Monte Carlo methods used. Section IV discusses implementation
details and results of the various Monte Carlo experiments we performed. Finally, Section
V presents key conclusions and points out relevant questions open for further investigation.
15
2.3.1 Monte Carlo methods for integration
There is another way to estimate execution times. All examples found in literature of
privacy-preserving computation have one thing in common: their protocols depend heavily
on pseudo-randomly generated numbers, used to mask or encrypt private data.
Those numbers are assumed to be drawn independently according to a specific proba-
bility density function. That is, the algorithms use at least one random variable as input.
Although the observed execution time does not depend directly on the random inputs, it
is directly affected by their magnitude, or more specifically, their average bit-size.
Also, the size of the dataset has direct impact on execution time. If we consider the
size of the dataset as a random variable, then the interaction between the magnitude of
the random numbers and the numerical representation of the dataset are, unequivocally,
random variables. So, it is safe to assume that protocol runtimes, that are a function of
the previous two variables, are also random variables.
Monte Carlo methods are a class of algorithms based on repeated random sampling
that render numerical approximations, or estimations, to a wide range of statistics, as well
as the associated standard error for the empirical average of any function of the parameters
of interest. We know, for example, that if X is a random variable with density f (x), then
the mathematical expectation of the random variable T = t(X) is:
Z ∞
E[t(X)] = t(x)f (x)dx. (2.3)
−∞
And, if t(X) is unknown, or the analytic solution for the integral is hard or impossible,
then we can use a Monte Carlo estimation for the expected value. It can can be obtained
with:
M
1 X
θ̂ = t(xi )f (xi ) (2.4)
M i=1
In other words, if the probability density function f (x) has support on a set X , (that
R
is, f (x) ≥ 0 ∀x ∈ X and X f (x) = 1), we can estimate the integral
16
Z
θ= t(x)f (x)dx (2.5)
X
M
1 X
θ̂ = t(xi ) (2.6)
M i=1
M
ˆ 1 X 2
V ar(θ̂) = 2 t(xi ) − θ̂ (2.7)
M i=1
In order to improve the accuracy of our estimation, we can always increase M , the
divisor in the variance expression. That comes, however, with increased computational
cost. We explore this trade-off in our experiments by performing the simulations with
different values of M and then examining the impact of M on the observed sample variance
and on the execution time of the experiment.
17
3 Text classification
Discourse on Method
René Descartes
18
3.2 Classic NLP preprocessing techniques
3.2.1 Tokenization
Tokenization is a preprocessing technique commonly understood as the first step of any
kind of natural language processing. It is used to identify the atomic units of text process-
ing. The text, represented as a single sequence of characters, is transformed in a collection
of tokens: words, punctuation marks, emojis, etc. Most NLP software libraries (e.g. nltk,
gensim and CoreNLP) provide multiple tokenization strategies, such as character, sub-
word, word or n-gram. The best granularity or tokenization strategy usually depends on
the application [44].
A common practice is to combine tokenization with sentence splitting. The gensim li-
brary, for instance, will perform tokenization by processing a sentence at a time. CoreNLP,
in turn, adds flags to the tokens that represent the limits of each sentence. Sentence split-
ting is also very important to other NLP methods, such as Part-Of-Speech (PoS) tagging
and Named Entity Recognition (NER).
• Pre-compiled dictionary: manually curated stopword lists. The lists may be crafted
for specific contexts, jargons or document corpus;
• Frequency based: use frequency based rules, such as TF-High (removal of terms
with high frequency), TF-1 (removal of terms with a single occurrence), IDF-Low
(removal of terms with low inverse document frequency, i.e. terms that are present
in most documents);
• Term Based Random Sampling (TBRS): uses the Kullback-Leibler divergence be-
tween term frequencies on the corpus with the frequency measured on randomly
sampled text chunks to identify words with low divergence and, consequently, low
information on any given text class.
19
3.2.3 Stemming
Stemming is the reduction of variant forms of a word, eliminating inflectional morphemes
such as verbal tense or plural suffixes, in order to provide a common representation, the
root or stem. The intuition is to perform a dimensionality reduction on the dataset,
removing rare morphological word variants, and reduce the risk of bias on word statistics
measured on the documents [47].
Most stemming algorithms only truncate suffixes and do not return the appropriate
term stem or even a valid word in the language of the text. There are different classes of
stemming algorithms, including:
• Dictionary based algorithms: lookup tables with terms and corresponding stems.
Usually restricted to a specific corpus, jargon or knowledge area;
3.2.4 Lemmatization
Lemmatization consists on the reduction of each token to a linguistically valid root or
lemma. The goal, from the statistical perspective, is exactly the same as in stemming:
reduce variance in term frequency. It is sometimes compared to the normalization of the
word sample, and aims to provide more accurate transformations than stemming, from
the linguistic perspective [48].
The impact on predictive models, however, will depend on characteristics of the lan-
guage or the document corpus being processed. In highly inflectional languages, such as
Latin and the romance languages, lemmatization is expected to produce better results
then stemming [49].
The typical lemmatizer implementation requires the creation of a lexicon (dictionary
or wordbook) of valid words and their corresponding lemma [50]. Yet, there are different
classes of algorithms, designed to deal with distinct problems in word normalization and
different languages. Recent works in literature, for instance, use deep neural networks to
produce ‘neuro lemmatizers’ trained for specific tasks [51, 52].
20
3.2.5 Part-of-Speech (PoS) tagging
Part-of-Speech tagging is a processing technique that flags each token with a grammatical
class, taking into account the sentence or even the context of the sentence in which they are
found [53, 52]. Most implementations will return multiple tags per token, with syntactic,
lexical, phrasal and other categories.
PoS tagging helps to differentiate homonyms, words with the same spelling but differ-
ent meanings, and to capture part of the semantics relations between words. Therefore,
many works in the fake news detection literature use PoS tags to engineer new features
(e.g. "noun count", "adjective count", "mean adjectives per noun") to capture concepts
such as ‘style’ or ‘quality’ of the text and improve model accuracy [54, 55].
21
of occurrences of that token in that document. This algorithm produces an unordered set
that does not retain any information on word order or proximity in the document [58].
In order to deal with this loss of information on word order or word-word relationship,
many techniques were proposed and tested in various NLP tasks. There are, for instance,
algorithms of Bag-of-N-grams, where the basic unit of count is not a single word by a set
of words of size n [59].
fti ,dj
tf(ti , dj ) = 1 + log P
t∈dj ft,dj
(3.1)
|D| + 1
idf(ti , D) = 1 + log
|{d ∈ D : ti ∈ d}| + 1
22
reverse step, gives the conditional probability for a set of preceding and following words,
given a specific term.
One of the advantages of word embeddings, such as CBoW, is the fixed length, dense
matrix representation that usually allows for more efficient computations. It also captures
some of the semantic relationships of words, based on their co-occurrence probability.
CBoW has also been used to achieve good results in fake news detection [64, 65] and
is possibly the most advanced preprocessing or feature engineering technique that can
be used on top of MPC protocols in order to produce a privacy-preserving fake news
classification model.
23
3.3 The State-of-the-Art: Transformers
24
4 Fake news detection
René Descartes
Fake news are texts, possibly distributed with or on other media formats, that present
false, incorrect or inaccurate information and are shared over digital platforms, such as
social networks, messaging apps or news web sites [67]. The main characteristics that dif-
ferentiate fake news from concepts such as gossip, hoax and other forms of misinformation
are:
1. They are formatted and presented as legitimate news, usually as a form of ’self-
validation’, with the intent to manipulates the audience’s cognitive processes;
2. They have faster and broader propagation patterns, partly due to the context of
instant and pervasive communication of digital platforms;
3. They have greater impact on the audience’s social behavior, also partly due to the
business model of the digital platforms based on engagement or “attention reten-
tion”.
These platforms are designed to retain their audience with algorithms that will filter
and sort the content displayed to each user based on their preferences and attention
patterns. The algorithms are so effective in retaining users’ attention that a growing
number of people are now suffering with addiction to their social media [68].
With users spending an ever increasing amount of time on their favorite platforms,
social media have amassed huge databases on user profiling and segmentation and be-
came, arguably, some of the most effective mass communication tools. They monetize
their databases serving targeted advertising and charging companies that consume their
Application Programming Interfaces (APIs) to interact with their users [24].
This business model is threatened by the misuse of the platforms with the spread
of illegitimate and false content. The malicious use of digital platforms to spread fake
25
news has already been applied, for example, to manipulate opinions on extremely relevant
issues, such as presidential elections in France and the United States [69].
Identifying and clearly flagging fake news may help users to assert better judgment
on the content they consume and lessen its negative effects [70]. Research in the area,
nevertheless, faces a few particular challenges: first, the difficulty, from the technological
perspective, to delimit fake news and distinguish them from other forms of propaganda.
Also, the relevance of the topic in the political arena elevates the risk of bias and partisan
interference. For instance, a dataset with over 200 citations gives you an accuracy of
99% using the presence of a single term to label a text as true news [71]. Also, there are
very few good public datasets and fewer NLP resources (dictionaries, word embeddings
models, language models, etc) in languages other than English [55].
Another important issue in the area is how to balance the need to detect and appro-
priately handle fake news and the equally important need to guarantee end user’s privacy.
This concern with user’s privacy has lead to the development of many Privacy-Preserving
Machine Learning (PPML) techniques [29, 33, 34]. There already many classic Machine
Learning (ML) algorithms, such as Logistic Regression, Decision Tree and Support Vector
Machines implemented on top of Secure Multi-party Computation (MPC) protocols [7].
26
4.1.2 Fact checking
A few works propose solutions that are based on complex conversational models that
query the topics identified in the text against a database of checked facts [75, 76, 77].
Conversational models are Deep Neural Networks trained to react to a text, or a spe-
cific query, according to a database of pairs ⟨input, response⟩. First, the model classifies
the input with multiple labels, each indicating a topic or knowledge area. Then, the
model selects the responses that have higher probability of appropriately responding to
that query. Some models use a fixed knowledge-base, or even a fixed list of responses.
Others will search the web, performing a second round of classification to select the prob-
able correct response [78].
Research in this area has lead to the creation of curated databases of checked facts.
Some are maintained by multidisciplinary research groups. Most of these databases,
however, are created and curated by fact checking agencies and news companies [79].
Here, there is a risk that heightened partisanship might interfere with the quality of
these databases. This is especially true in a politically polarized environment, in which
agencies aligned with the different political forces are mutually accused of publishing
fake news [80]. The dataset mentioned in the introduction, for example, flags every text
published by a single news outlet as true and all the others as fake [71].
27
4.2 Experimental results
4.2.1 Clear-text setting
In order to establish benchmark performance measures in the ‘clear-text’ setting, we
ran the pipeline detailed in [?] for model tuning, selection and testing, with different
combinations of NLP preprocessing techniques, as shown in Table 4.1.
The pipeline uses k-fold validation and random search for hyper-parameter tuning on
Naive Bayes, Decision Tree, K-Nearest Neighbors, Logistic Regression, Support Vector
Machines, Random Forest and XGBoost GBDT classifiers. For the DistilBERT and
Sentence-BERT experiments, we have used the pre-built multilingual models from [?] in
order to encode our datasets with the corresponding embeddings, before submitting them
to the pipeline.
For hyper-parameter search, we decided to use ROC AUC metrics to compare and se-
lect the best models, as it gives a better information on a model performance despite class
imbalance. After model selection, we recorded ROC AUC, F1-score and accuracy metrics
on the test set for the model selected at the end of each experiment. The best combination
of preprocessing techniques and classifier algorithm, measured by the accuracy on the test
set for each dataset, is presented on Table 4.2. The runtime is in seconds.
We have also trained a few convolutional (CNN) and deep feed-forward (FNN) net-
works in the clear-text setting. We select best network of each architecture for the exper-
iments in the privacy-preserving setting, in order to have a benchmark for comparison,
both on accuracy and runtime. We compare training runtimes for these networks on
Table 4.3.
28
Note that the choice for these simple architectures is due to CrypTen’s limited imple-
mentation of Pytorch modules. It does not implement modules as RNN, or LSTM, that
have been proven in NLP literature to provide better results than simple convolutional
or feed-forward networks.
We also trained the the convolutional neural network from [?] as a benchmark for our
results. Our neural networks outperformed their model in accuracy, F1 score and ROC
AUC for all datasets.
The results show that in most of our experiments, word embeddings or sentence em-
beddings from large language models did not outperform traditional NLP preprocessing
techniques. Also, the classic ML models outperformed the deep learning models. Par-
ticularly, tree-based models, Random Forest and GBDT, presented the best performance
for most datasets.
We observed that the models trained with the short text datasets had lower perfor-
mance in all metrics, over all experiments. It indicates that the models required a larger
sample of words in order to appropriately approximate the underlying statistics repre-
sented by trained parameters. Notice, also, the high training runtime for the Random
Forest model trained over Sentence-BERT embeddings for the liar dataset. It indicates
that large language models may provide better results, but may also introduce higher
computing cost.
29
4.2.2 Privacy-preserving setting
Our experiments cover two basic scenarios of application for PPML techniques in privacy-
preserving fake news detection. The first scenario is the privacy-preserving model training.
It consists of a first party interested in training a model for fake news classification and
a second party, or even a group of parties, that can provide annotated datasets for model
training, but do not want or do not trust the first party to have full knowledge of the
dataset.
This scenario is relevant, for example, in the cases when a social media platform wants
to train a model and another party, such as a fact-checking agency, or a group of scholars
and specialists, will provide the dataset with annotated news articles. This group may
fear the social platform to have political or economical incentives to meddle with the
specialists’ classification. The privacy-preserving model training solution guarantees the
model owner has no knowledge of the provided texts or the classification labels, and thus,
can not negatively impact the quality or fairness of the dataset and, consequently, of the
trained model.
The second scenario is related to privacy-preserving inference. It addresses the cases
in which a user wants to know the inferred classification for a text, but does not want the
model owner to know the content of the text submitted for classification. This scenario
applies, for example, to the messaging app users that want to have a feedback on a message
shared on a family group, but are not comfortable to have a fake news classification agency
being able to track and record the exact content shared on that private chat. Privacy-
preserving inference allows the users to receive a probable classification for the texts they
read, without exposing their private conversation or the people interacting with them.
For privacy-preserving model training and inference we set up three virtual machines
on the cloud. We tested our code using Amazon AWS EC2 and Google Cloud Compute
Engine instances, with similar results. Table 4.4 brings a comparison of training cost of
the same neural networks in the clear-text and privacy-preserving settings.
30
On the first machine, named ‘alice’, we store the trained model. The training set,
embeddings and corresponding labels, are stored on the second machine, named ‘bob’. The
third participant, ‘charlie’, holds the validation set. At the end of the MPC computation,
the accuracy score on the validation set is known to all computing parties, but only alice
has knowledge of the trained model’s weights.
The CrypTen framework extends the PyTorch library API with tensor based im-
plementations of secret sharing protocols [30]. We decided to use this framework to
facilitate our experiments, since it extends a well known library, and allows us to use
peer-reviewed neural network architectures found in literature with very few changes to
the code. CrypTen also allows for private inference with a encrypted a PyTorch model
trained on clear-text data.
Nevertheless, as stated above, we used only simple feed-forward and convolutional
neural network architectures. This choice is due to CrypTen’s limited implementation of
PyTorch modules. It does not implement modules as RNN, or LSTM, that have been
proven in NLP literature to provide better results than simpler networks architectures [?].
The results in Table 4.4 show that training times are, in average, one order of magni-
tude higher in the privacy-preserving setting. That is a reasonable cost, considering the
advantage of preserving both the privacy of participants’ input texts and of the service
provider’s trained model.
31
Table 4.5: Best accuracy on Privacy-Preserving setting
Dataset Embedding Model Accuracy F1-Score ROC-AUC Runtime
DistilBERT FNN 58.73 48.73 57.24 9.06
liar
Sentence-BERT CNN 61.54 51.71 59.99 1414.53
DistilBERT FNN 67.82 74.10 65.61 1.66
sbnc
Sentence-BERT FNN 72.52 77.11 71.44 1.60
DistilBERT FNN 78.32 87.84 50.00 1.62
factck.br
Sentence-BERT CNN 78.32 87.84 60.00 175.28
DistilBERT FNN 80.83 80.72 80.83 5.28
fake.br
Sentence-BERT CNN 81.04 79.11 81.04 793.78
32
5 Results & conclusion
5.1 Experiments
Our experiments cover two basic scenarios of application for PPML techniques in privacy-
preserving fake news detection. The first scenario is the privacy-preserving model training.
It consists of a first party interested in training a model for fake news classification and
a second party, or even a group of parties, that can provide annotated datasets for model
training, but do not want or do not trust the first party to have full knowledge of the
dataset.
This scenario is relevant, for example, in the cases when a social media platform wants
to train a model and another party, such as a fact-checking agency, or a group of scholars
or specialists, will provide the dataset with annotated news articles. This group may
fear the social platform to have political or economical incentives to meddle with the
specialists’ classification. The privacy-preserving model training solution guarantees the
model owner has no knowledge of the provided texts or the classification labels, and thus,
can not negatively impact the quality or fairness of the dataset and, consequently, of the
trained model.
The second scenario is related to privacy-preserving inference. It addresses the cases
in which a user wants to know the inferred classification for a text, but does not want the
model owner to know the content of the text submitted for classification. This scenario
applies, for example, to the messaging app users that want to have a feedback on a message
shared on a family group, but are not comfortable to have a fake news classification agency
being able to track and record the exact content shared on that private chat. Privacy-
33
preserving inference allows the users to receive a probable classification for the texts they
read, without exposing their private conversation or the people interacting with them.
• Liar Dataset (liar): curated by the UC Santa Barbara NLP Group, contains 12791
claims by North-American politicians and celebrities, classified as ‘true’, ‘mostly-
true’, ‘half-true’, ‘barely-true’, ‘false’ and ‘pants-on-fire’ [90];
• Source Based Fake News Classification (sbnc): 2020 full-length news manu-
ally labeled as Real or Fake [91];
• Fake.br: 7200 full-length news articles, with text and metadata, manually labeled
as real or fake news [55];
34
of preserving both the privacy of participants input texts and of the service provider’s
trained model. The accuracy was measured over the validation set.
35
5.2 Conclusion
We have presented relevant fake news detection approaches and pointed out a few ad-
vantages of NLP applications in a privacy-preserving oriented solution. We have also
discussed the use of different NLP techniques in text classification, and how large lan-
guage models can be used as a preprocessing step to generate embeddings that convey
semantic information from the encoded text. Then, we showed how those embeddings are
used for training and querying fake news detection inference models.
Our experiments also demonstrate how a neural network can be trained to detect fake
news using Secure Multi-party Computation protocols and how those MPC protocols
allow users to perform news classification in a privacy-preserving way.
The relevant finding is that the performance of the privacy-preserving fake news clas-
sification model, measured both in terms of runtime, accuracy and other classification
metrics, is very close to that of a model trained and queried in the clear-text setting. In-
dicating that introducing the use of MPC protocols does not reduce the predictive power
or usability of fake news detection models.
Acknowledgments
This work has been funded in part by the Graduate Deanship of Universidade de Brasília,
under the “EDITAL DPG Nº 0004/2021” grants program.
36
Bibliography
[1] Dale, Robert: GPT-3: What’s it good for? Natural Language Engineering,
27(1):113–118, 2021. 1
[2] BRASIL: Lei nº 13.709, de 14 de agosto de 2018., 2018. http://http://www.
planalto.gov.br/ccivil_03/_Ato2015-2018/2018/Lei/L13709.htm. 1
[3] European Commission: Regulation EU n. 2016/679., 2016. https://ec.europa.eu/
info/law/law-topic/data-protection_en. 1
[4] Al-Rubaie, Mohammad and J. Morris Chang: Privacy-Preserving Machine Learning:
Threats and Solutions. IEEE Security Privacy, 17(2):49–58, 2019. 2, 8
[5] Graepel, T., K Lauter, and M Naehrig: ML Confidential: Machine Learning on
Encrypted Data. Cryptology ePrint Archive, Report 2012/323, 2012. https://
eprint.iacr.org/2012/323. 2, 8
[6] Canetti, R.: Universally Composable Security: A New Paradigm for Cryptographic
Protocols. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer
Science, FOCS ’01, pages 136–, Washington, DC, USA, 2001. IEEE Computer Soci-
ety, ISBN 0-7695-1390-5. http://dl.acm.org/citation.cfm?id=874063.875553.
2, 8
[7] De Cock, Martine, Rafael Dowsley, Caleb Horst, Raj Katti, Anderson Nascimento,
Wing Sea Poon, and Stacey Truex: Efficient and Private Scoring of Decision Trees,
Support Vector Machines and Logistic Regression Models based on Pre-Computation.
IEEE Transactions on Dependable and Secure Computing, PP(99), 2017. 2, 8, 26,
27
[8] Rivest, R., L. Adleman, and M. Dertouzos: On data banks and privacy homo-
morphisms. Foundations of Secure Computation, pages 169–177, 1978. 2, 8
[9] Gentry, C.: A fully homomorphic encryption scheme. PhD thesis, Stanford Univer-
sity, 2009. crypto.stanford.edu/craig. 2, 8
[10] Lopez-Alt, A., E. Tromer, and V. Vaikuntanathan: On-the-Fly Multiparty Computa-
tion on the Cloud via Multikey Fully Homomorphic Encryption. Cryptology ePrint
Archive, Report 2013/094, 2013. 2, 8
[11] Souza, Stefano M P C and Ricardo S Puttini: Client-side encryption for privacy-
sensitive applications on the cloud. Procedia Computer Science, 97:126–130, 2016.
2, 12
37
[12] Damgård, Ivan and Mats Jurik: A Generalisation, a Simplification and Some Ap-
plications of Paillier’s Probabilistic Public-Key System. In Proceedings of the 4th
International Workshop on Practice and Theory in Public Key Cryptography: Public
Key Cryptography, PKC ’01, pages 119–136, London, UK, UK, 2001. Springer-Verlag,
ISBN 3-540-41658-7. http://dl.acm.org/citation.cfm?id=648118.746742. 2, 8
[13] Nikolaenko, V., U. Weinsberg, S Ioannidis, D. Joyeand, M ans Boneh, and N. Taft:
Privacy-Preserving Ridge Regression on Hundreds of Millions of Records. In 2013
IEEE Symposium on Security and Privacy. IEEE, 2013. 2, 8
[14] Bos, J. W., K. Lauter, and M. Naehrig: Private Predictive Analysis on Encrypted
Medical Data. Cryptology ePrint Archive, Report 2014/336, 2014. 2, 9
[17] Souza, Stefano M. P. C.: Safe-Record: segurança e privacidade para registros eletrô-
nicos em saúde na nuvem. Master’s thesis, PPGEE/FT - Universidade de Brasília,
2016. 2, 3, 9
[20] Trauth, E. M.: Achieving the Research Goal with Qualitative Methods: Lessons Lear-
ned along the Way. In Proceedings of the IFIP TC8 WG 8.2 International Conference
on Information Systems and Qualitative Research, page 225–245, GBR, 1997. Chap-
man & Hall, Ltd., ISBN 0412823608. 5
[21] Deb, Dipankar, Rajeeb Dey, and Valentina E. Balas: [Intelligent Systems Refe-
rence Library - Vol. 153] Engineering Research Methodology: A Practical Insight
for Researchers, volume 10.1007/978-981-13-2947-0, chapter 1, pages 1–7. Sprin-
ger, 2019, ISBN 978-981-13-2946-3,978-981-13-2947-0. http://gen.lib.rus.ec/
scimag/index.php?s=10.1007/978-981-13-2947-0. 5
[22] Kaplan, Bonnie and Dennis Duchon: Combining Qualitative and Quantitative
Methods in Information Systems Research: A Case Study. MIS Q., 12(4):571–586,
December 1988, ISSN 0276-7783. http://dx.doi.org/10.2307/249133. 5
[23] Yin, R. K.: Case Study Research: Design and Methods. SAGE, Beverly Hills, 1984.
5
38
[24] Souza, S. M. P. C., T. B. Rezende, J. Nascimento, L. G. Chaves, D. H. P. Soto,
and S. Salavati: Tuning machine learning models to detect bots on Twitter. In 2020
Workshop on Communication Networks and Power Systems (WCNPS), pages 1–6,
2020. 6, 25, 26
[25] Souza, Stefano M. P. C. and Daniel G. Silva: Monte Carlo execution time estimation
for Privacy-preserving Distributed Function Evaluation protocols, 2021. 6
[26] Ben-David, Assaf, Noam Nisan, and Benny Pinkas: FairplayMP: a system for secure
multi-party computation. In Ning, Peng, Paul F. Syverson, and Somesh Jha (editors):
Proceedings of the 2008 ACM Conference on Computer and Communications Secu-
rity, CCS 2008, Alexandria, Virginia, USA, October 27-31, 2008, pages 257–266.
ACM, 2008. 9
[27] Bogdanov, Dan, Sven Laur, and Jan Willemson: Sharemind: A Framework for Fast
Privacy-Preserving Computations. In Proc. of the 13th European Symposium on
Research in Computer Security, pages 192–206, 2008. 9
[28] Ryffel, Theo, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel
Rueckert, and Jonathan Passerat-Palmbach: A generic framework for privacy pre-
serving deep learning. CoRR, abs/1811.04017, 2018. 9
[29] Sadegh Riazi, M., C. Weinert, O. Tkachenko, E. M. Songhori, T. Schneider, and
F. Koushanfar: Chameleon: A Hybrid Secure Computation Framework for Machine
Learning Applications. ArXiv e-prints, 2018. 9, 26
[30] Knott, B., S. Venkataraman, A.Y. Hannun, S. Sengupta, M. Ibrahim, and L.J.P.
van der Maaten: CrypTen: Secure Multi-Party Computation Meets Machine Lear-
ning. In Proceedings of the NeurIPS Workshop on Privacy-Preserving Machine Le-
arning, 2020. 9, 14, 31, 34
[31] Zhang, Yihua, Aaron Steele, and Marina Blanton: PICCO: A General-purpose
Compiler for Private Distributed Computation. In Proceedings of the 2013
ACM SIGSAC Conference on Computer Communications Security. ACM, 2013,
ISBN 978-1-4503-2477-9. 9
[32] Songhori, E. M., S. U. Hussain, A. Sadeghi, T. Schneider, and F. Koushanfar: Tiny-
Garble: Highly Compressed and Scalable Sequential Garbled Circuits. In 2015 IEEE
Symposium on Security and Privacy, pages 411–428, May 2015. 9
[33] Demmler, Daniel, Thomas Schneider, and Michael Zohner: ABY - A Framework
for Efficient Mixed-Protocol Secure Two-Party Computation. In 22nd Network and
Distributed System Security Symposium, 2015. 9, 15, 26
[34] Mohassel, P. and Y. Zhang: SecureML: A System for Scalable Privacy-Preserving
Machine Learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages
19–38, May 2017. 9, 26
[35] Yao, Andrew C.: Protocols for Secure Computations. In Proceedings of the 23rd
Annual Symposium on Foundations of Computer Science, SFCS ’82. IEEE Computer
Society, 1982. 9
39
[36] Beaver, Donald: One-time tables for two-party computation. In Computing and Com-
binatorics, pages 361–370. Springer, 1998. 11
[37] Paillier, Pascal: Public-key cryptosystems based on composite degree residuosity clas-
ses. In IN ADVANCES IN CRYPTOLOGY — EUROCRYPT 1999, pages 223–238.
Springer-Verlag, 1999. 13
[38] Goldwasser, Shafi and Silvio Micali: Probabilistic encryption. Journal of Computer
and System Sciences, 28(2):270–299, 1984, ISSN 0022-0000. 13
[39] Naor, Moni and Kobbi Nissim: Communication Complexity and Secure Function Eva-
luation. Electronic Colloquium on Computational Complexity (ECCC), 8, 2001. 14
[40] Agarwal, Anisha, Rafael Dowsley, Nicholas D. McKinney, Dongrui Wu, Chin Teng
Lin, Martine De Cock, and Anderson Nascimento: Privacy-Preserving Linear Regres-
sion for Brain-Computer Interface Applications. In Proc. of 2018 IEEE International
Conference on Big Data, 2018. 15
[41] Silva, D. G, M. Jino, and B. de Abreu: A Simple Approach for Estimation of Execution
Effort of Functional Test Cases. In IEEE Sixth International Conference on Software
Testing, Verification and Validation. IEEE Computer Society, Apr 2009. 15
[42] Iqbal, N., M. A. Siddique, and J. Henkel: DAGS: Distribution agnostic sequential
Monte Carlo scheme for task execution time estimation. In 2010 Design, Automation
Test in Europe Conference Exhibition (DATE 2010), pages 1645–1648, 2010. 15
[43] Cunha, Washington, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine Re-
sende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Martins,
Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves:
On the cost-effectiveness of neural and non-neural approaches and representations
for text classification: A comprehensive comparative study. Information Processing
& Management, 58(3):102481, 2021, ISSN 0306-4573. 18
[44] Habert, Benoit, Gilles Adda, Martine Adda-Decker, P Boula de Marëuil, Serge Fer-
rari, Olivier Ferret, Gabriel Illouz, and Patrick Paroubek: Towards tokenization eva-
luation. In Proceedings of LREC, volume 98, pages 427–431, 1998. 19
[45] Kaur, Jashanjot and P Kaur Buttar: A systematic review on stopword removal algo-
rithms. Int. J. Futur. Revolut. Comput. Sci. Commun. Eng, 4(4), 2018. 19
[46] Gerlach, Martin, Hanyu Shi, and Luís A Nunes Amaral: A universal information
theoretic approach to the identification of stopwords. Nature Machine Intelligence,
1(12):606–612, 2019. 19
[47] Singh, Jasmeet and Vishal Gupta: Text stemming: Approaches, applications, and
challenges. ACM Computing Surveys (CSUR), 49(3):1–46, 2016. 20
[48] Dereza, Oksana: Lemmatization for Ancient Languages: Rules or Neural Networks?
In Conference on Artificial Intelligence and Natural Language, pages 35–47. Springer,
2018. 20
40
[49] Jongejan, Bart and Hercules Dalianis: Automatic training of lemmatization rules that
handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL and the 4th International Joint
Conference on Natural Language Processing of the AFNLP, pages 145–153, 2009. 20
[50] Plisson, Joël, Nada Lavrac, Dunja Mladenic, et al.: A rule based approach to word
lemmatization. In Proceedings of IS, volume 3, pages 83–86, 2004. 20
[51] Malaviya, Chaitanya, Shijie Wu, and Ryan Cotterell: A simple joint model for im-
proved contextual neural lemmatization. arXiv preprint arXiv:1904.02306, 2019. 20
[52] Kondratyuk, Daniel, Tomáš Gavenčiak, Milan Straka, and Jan Hajič: LemmaTag:
Jointly tagging and lemmatizing for morphologically-rich languages with BRNNs. ar-
Xiv preprint arXiv:1808.03703, 2018. 20, 21
[53] Schmid, Helmut and Florian Laws: Estimation of conditional probabilities with de-
cision trees and an application to fine-grained POS tagging. In Proceedings of the
22nd International Conference on Computational Linguistics (Coling 2008), pages
777–784, 2008. 21
[54] Potthast, Martin, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno
Stein: A Stylometric Inquiry into Hyperpartisan and Fake News. In Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 231–240, Melbourne, Australia, July 2018. Association for
Computational Linguistics. https://www.aclweb.org/anthology/P18-1022. 21,
27
[55] Monteiro, Rafael A., Roney L. S. Santos, Thiago A. S. Pardo, Tiago A. de Almeida,
Evandro E. S. Ruiz, and Oto A. Vale: Contributions to the Study of Fake News in
Portuguese: New Corpus and Automatic Detection Results. In Computational Proces-
sing of the Portuguese Language, pages 324–334. Springer International Publishing,
2018, ISBN 978-3-319-99722-3. 21, 26, 34
[56] Davis, R. and C. Proctor: Fake News, Real Consequences: Recruiting Neural
Networks for the Fight Against Fake News. Technical report, Stanford University,
2017. 21
[57] Barde, B. V. and A. M. Bainwad: An overview of topic modeling methods and to-
ols. In 2017 International Conference on Intelligent Computing and Control Systems
(ICICCS), pages 745–750, 2017. 21
[58] El-Din, Doaa Mohey: Enhancement bag-of-words model for solving the challenges
of sentiment analysis. International Journal of Advanced Computer Science and
Applications, 7(1), 2016. 22
[59] Li, Bofang, Zhe Zhao, Tao Liu, Puwei Wang, and Xiaoyong Du: Weighted neural bag-
of-n-grams model: New baselines for text classification. In Proceedings of COLING
2016, the 26th International Conference on Computational Linguistics: Technical
Papers, pages 1591–1600, 2016. 22
41
[60] Yun-tao, Zhang, Gong Ling, and Wang Yong-cheng: An improved TF-IDF approach
for text classification. Journal of Zhejiang University-Science A, 6(1):49–55, 2005. 22
[61] Ahmed, Hadeer, Issa Traore, and Sherif Saad: Detection of online fake news using
n-gram analysis and machine learning techniques. In International conference on
intelligent, secure, and dependable systems in distributed and cloud environments,
pages 127–138. Springer, 2017. 22
[62] Dyson, Lauren and Alden Golab: Fake News Detection Exploring the Application of
NLP Methods to Machine Identification of Misleading News Sources. CAPP 30255
Adv. Mach. Learn. Public Policy, 2017. 22, 27
[63] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean: Efficient Estimation of
Word Representations in Vector Space, 2013. 22
[64] Yang, Kai Chou, Timothy Niven, and Hung Yu Kao: Fake News Detection as Natural
Language Inference. arXiv preprint arXiv:1907.07347, 2019. 23, 27
[66] Reimers, Nils and Iryna Gurevych: Sentence-BERT: Sentence Embeddings using Sia-
mese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing. Association for Computational Linguistics, Novem-
ber 2019. https://arxiv.org/abs/1908.10084. 24
[67] Gelfert, Axel: Fake News: A Definition. Informal Logic, 37(0):83–117, 2017. 25
[68] D’Arienzo, Maria Chiara, Valentina Boursier, and Mark D. Griffiths: Addiction to
Social Media and Attachment Styles: A Systematic Literature Review. International
Journal of Mental Health and Addiction, 17:1094 – 1118, 2019. 25
[69] Ferrara, Emilio: Disinformation and Social Bot Operations in the Run Up to the 2017
French Presidential Election. First Monday, 22, June 2017. 26
[70] Lee, Sangwon and Michael Xenos: Social distraction? Social media use and political
knowledge in two U.S. Presidential elections. Computers in Human Behavior, 90:18
– 25, 2019, ISSN 0747-5632. 26
[71] Nascimento, Josué: Only one word 99.2%, Aug 2020. https://www.kaggle.com/
josutk/only-one-word-99-2. 26, 27
[72] Gangireddy, Siva Charan Reddy, Deepak P, Cheng Long, and Tanmoy Chakraborty:
Unsupervised Fake News Detection: A Graph-Based Approach. In Proceedings of the
31st ACM Conference on Hypertext and Social Media, HT ’20, page 75–83, New
York, NY, USA, 2020. Association for Computing Machinery, ISBN 9781450370981.
https://doi.org/10.1145/3372923.3404783. 26
42
[73] Shu, Kai, Xinyi Zhou, Suhang Wang, Reza Zafarani, and Huan Liu: The Role of
User Profiles for Fake News Detection. In ASONAM ’19: International Conference
on Advances in Social Networks Analysis and Mining, page 436–439, New York, NY,
USA, 2019. Association for Computing Machinery, ISBN 9781450368681. 26
[74] Pinnaparaju, Nikhil, Vijaysaradhi Indurthi, and Vasudeva Varma: Identifying Fake
News Spreaders in Social Media. In CLEF, 2020. 26
[75] Nadeem, Moin, Wei Fang, Brian Xu, Mitra Mohtarami, and James Glass: FAKTA:
An Automatic End-to-End Fact Checking System, 2019. 27
[76] Moreno, Jo
btxfnamespacelong ao and Graça Bressan: FACTCK.BR: A New Dataset to Study
Fake News. In Proceedings of the 25th Brazillian Symposium on Multimedia and
the Web, WebMedia ’19, page 525–527, New York, NY, USA, 2019. Associa-
tion for Computing Machinery, ISBN 9781450367639. https://doi.org/10.1145/
3323503.3361698. 27, 34
[77] Gupta, Ankur, Yash Varun, Prarthana Das, Nithya Muttineni, Parth Srivastava,
Hamim Zafar, Tanmoy Chakraborty, and Swaprava Nath: TruthBot: An Automated
Conversational Tool for Intent Learning, Curated Information Presenting, and Fake
News Alerting. CoRR, abs/2102.00509, 2021. https://arxiv.org/abs/2102.00509.
27
[78] Lee, Sungjin: Nudging Neural Conversational Model with Domain Knowledge. CoRR,
abs/1811.06630, 2018. http://arxiv.org/abs/1811.06630. 27
[79] Graves, Lucas: Anatomy of a Fact Check: Objective Practice and the Contested Epis-
temology of Fact Checking. Communication, Culture and Critique, 10(3):518–537,
October 2017, ISSN 1753-9129. https://doi.org/10.1111/cccr.12163. 27
[80] Marietta, Morgan, David C Barker, and Todd Bowser: Fact-checking polarized poli-
tics: Does the fact-check industry provide consistent guidance on disputed realities?
In The Forum, volume 13, pages 577–596. De Gruyter, 2015. 27
[81] Horne, Benjamin D. and Sibel Adali: This Just In: Fake News Packs a Lot in Title,
Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real
News. CoRR, abs/1703.09398, 2017. http://arxiv.org/abs/1703.09398. 27
[82] Young, T., D. Hazarika, S. Poria, and E. Cambria: Recent Trends in Deep Learning
Based Natural Language Processing [Review Article]. IEEE Computational Intelli-
gence Magazine, 13(3):55–75, 2018. 27
[83] Devlin, Jacob, Ming Wei Chang, Kenton Lee, and Kristina Toutanova: Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018. 27
[84] Baruah, Arup, K Das, F Barbhuiya, and Kuntal Dey: Automatic Detection of Fake
News Spreaders Using BERT. In CLEF, 2020. 27
43
[85] Zhang, T., D. Wang, H. Chen, Z. Zeng, W. Guo, C. Miao, and L. Cui: BDANN:
BERT-Based Domain Adaptation Neural Network for Multi-Modal Fake News De-
tection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages
1–8, 2020. 27
[86] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amo-
dei: Language Models are Few-Shot Learners. In Larochelle, H., M. Ranzato, R. Had-
sell, M. F. Balcan, and H. Lin (editors): Advances in Neural Information Processing
Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. 27
[87] Tan, Reuben, Bryan A. Plummer, and Kate Saenko: Detecting Cross-Modal Incon-
sistency to Defend Against Neural Fake News, 2020. 27
[88] Mosallanezhad, Ahmadreza, Kai Shu, and Huan Liu: Topic-Preserving Synthetic
News Generation: An Adversarial Deep Reinforcement Learning Approach, 2020.
27
[89] Zellers, Rowan, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Fran-
ziska Roesner, and Yejin Choi: Defending Against Neural Fake News. In Wallach,
H., H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (editors):
Advances in Neural Information Processing Systems, volume 32, pages 9054–9065.
Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper/2019/
file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf. 27
[90] Wang, William Yang: "liar, liar pants on fire": A new benchmark dataset for fake
news detection. arXiv preprint arXiv:1705.00648, 2017. 34
[91] Bhatia, Ruchi: Source based Fake News Classification, Aug 2020. https://www.
kaggle.com/ruchi798/source-based-news-classification. 34
44
I MPC Protocols
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi̸=ι
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
45
Protocol: πIP
Setup: The setup procedure for πMUL
Input: J⃗xKq , J⃗y Kq , and l (length of ⃗x and ⃗y )
Output: JzKq = J⃗x · ⃗y Kq
Execution:
1. Run l parallel instances of πMUL in order to compute J⃗zk Kq = J⃗xk Kq · J⃗yk Kq for k ∈
{1, . . . , l}
3. Output JzKq
Protocol: πOIS
Setup: Let l be the bitlength of the inputs to be shared and n the dimension of the input
vector. The trusted initializer pre-distributes all the correlated randomness necessary for the
execution of πMUL over Z2l
Input: Alice inputs the vector ⃗x = (x1 , . . . , xn ), and Bob has input k, the index of the desired
output value
Output: xk
Execution:
2. For j ∈ {1, . . . , n} and i ∈ {1, . . . , l}, let xj,i denote the i-th bit of xj
3. Define Jyj K2 as the pair of shares (0, yj ) and Jxj,i K2 as (xj,i , 0).
46
Protocol: πEq
Setup: The setup procedure for πMUL
Input: JxKq and JyKq
Output: J0Kq if x = y. Any non-zero number otherwise.
Execution:
3. Output JzKq
Protocol: F2toq
Input: JxK2
Output: JxKq
Execution:
2. Alice and Bob perform a secure bitwise xor using πOR|XOR with Alice’s inputs being
(⃗xa , 0) and Bob’s input being (0, ⃗xb ) and modulus q > 2.
Protocol: πOR|XOR
Setup: The setup procedure for πMUL over Z2
Input: JxK2 , JyK2 and k, where k = 1 to compute OR and k = 2 to compute XOR between
the numbers.
Output: Jx ∨ yK2 if k = 1, Jx ⊻ yK2 if k = 2.
Execution:
3. Output JzK2 .
47
Protocol: πTrunc
Setup: Let λ be a statistical security parameter. The Protocol is parametrized by the size
q > 2k+f +λ+1 of the field and the dimensions ℓ1 , ℓ2 of the input matrix. The trusted initializer
picks a matrix R′ ∈ Fℓq1 ×ℓ2 with elements uniformly drawn from {0, . . . , 2f − 1} and a matrix
R′′ ∈ Fℓq1 ×ℓ2 with elements uniformly drawn from {0, . . . , 2k+λ − 1}. Then, the TI computes
R = R′′ 2f + R′ and creates secret shares JRKq and JR′ Kq to distribute to the parties.
Input: The parties input is JWKq such that for all elements w of W it holds that w ∈
{0, 1, . . . , 2k+f −1 − 1}{q − 2k+f −1 + 1, . . . , q − 1}.
Execution:
3. For i = ((q + 1)/2)f , locally compute JTKq ← iJSKq and output JTKq .
Protocol: πBD
Setup: Let l be the bitlength of the value x to be bit-decomposed. The TI draws U, V, W
uniformly from Z2 and distribute shares of blinding values such that W := U V such that
[[W ]] ←R {Z2 }.
Input: JxKq , for q ≤ 2l
Output: JxK2
Execution:
1. Let a denote Alice’s share of x, which corresponds to the bit string {a1 , . . . , al }. Sim-
ilarly, let b denote Bob’s share of x, which corresponds to the bit string {b1 , . . . , bl }.
Define the secret sharing Jyi K2 as the pair of shares (ai , bi ) for yi = ai + bi mod2, Jai K2
as (ai , 0) and Jai K2 as (0, bi ).
2. Compute [[c1 ]]2 ← [[a1 ]]2 [[b1 ]]2 using distributed multiplication, and locally set [[x1 ]]2 ←
[[y1 ]]2 .
48
Protocol: πDC
Input: The trusted initializer will select uniformly from {Z2 } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ← {Z2 }.
Each party gets the shares [[xi ]]2 and Jyi K2 for each bit of l-bit integers x and y.
Output: J1K2 if x ≥ y, and [[0]]2 otherwise.
Execution:
1. For i ∈ {1, . . . , l}, compute in parallel [[di ]]2 ← Jyi K2 (J1K2 − [[xi ]]2 ) using multiplication
Protocol.
3. For i ∈ {1, . . . , l}, compute [[cj ]]2 ← [[di ]]2 lj=i+1 [[ej ]]2 using multiplication Protocol.
Q
Pl
4. compute [[w]]2 ← J1K2 + i=1 Jci K2
Protocol: πargmax
Setup: Let l be the bitlength and k be then umber of values to be compared.
Input: The trusted initializer will select uniformly from {Zq } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ←R {Zq }.
Each party has as inputs the shares [[vj , i]]q for all jϵ{1, . . . , k} and iϵ{1, . . . , l}.
Output: Value m computed by party P1 .
Execution:
1. For j ∈ {1, . . . , k} and nϵ{1, . . . , k}, the parties compute in parallel the distributed
comparison Protocol with inputs [[vj , i]]2 and [[vn , i]]2 (i = {1, . . . , l}). Let [[wj , n]]2
denote this output obtained.
Q
2. For j ∈ {1, . . . , k}, compute in parallel [[wj ]]2 = n∈{1,...,k} [[wj , n]]2 using multiplica-
tion Protocol.
49