Privacy Preserving Machine Learning

Universidade de Brasília
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
Privacy-Preserving Text Classification
Stefano Mozart Pontes Canedo de Souza
Tese apresentado como requisito parcial para

conclusão do Doutorado em Engenharia Elétrica
Orientador
Prof. Dr. Daniel Guerreiro e Silva
Coorientador
Prof. Dr. Anderson Clayton Alves do Nascimento
Brasília
2022
Assinaturas
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
Privacy-Preserving Text Classification
Stefano Mozart Pontes Canedo de Souza
Tese apresentado como requisito parcial para

conclusão do Doutorado em Engenharia Elétrica
Prof. Dr. Daniel Guerreiro e Silva (Orientador)

Prof. Dr. Anderson Clayton Alves do Nascimento (Coorientador)

University of Washington at Tacoma
Prof. Dr. Yehuda Lindell Prof. Dr. Silvio Micali

Bar Ilan University Massachusetts Institute of Technology
Prof. Dr. Kleber Melo e Silva

Coordenador do Programa de Pós-graduação em Engenharia Elétrica
Brasília, 28 de Junho de 2022

Dedicatória
To God, author and meaning of all existence, be all the glory, honor and praise.
iv
Agradecimentos
To all those who supported me in this harsh journey, specially to my beloved Talita.
v
Abstract
Machine learning (ML) applications have become increasingly frequent and pervasive in
many areas of our lives. We enjoy customized services based on predictive models built
with our private data. There are, however, growing concerns about privacy. This is proven
by the enactment of the General Law of Data Protection in Brazil, and similar legislative
initiatives in the European Union and in several countries.
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the construction and operation of these computa-
tional models with formal - mathematical - guaranties of preservation of privacy. These
techniques need to respond adequately to challenges posed at every stage of the typical
ML application life cycle, from data discovery, through feature extraction, model training
and validation, up to its effective use.
This work presents a framework of techniques for Privacy-Preserving Machine Learn-
ing (PPML) and Natural Language Processing (NLP), built on homomorphic cryptogra-
phy primitives and Secure Multi-party Computation (MPC) protocols, which allow the
adequate treatment of data and the efficient application of ML algorithms with robust
guarantees of privacy preservation in text classification. This work also brings a practical
application of privacy-preserving text classification for the detection of fake news.
Keywords: Privacy-Preserving Machine Learning, Secure Multi-Party Computation, Text

Classification
vi
Resumo
Aplicações de aprendizagem de máquina (ML) tem se tornado cada vez mais recorrentes
e pervasivas em diversas áreas de nossas vidas. Usufruímos de serviços personalizados
baseados em modelos preditivos construídos com nossos dados privados. Há, no entanto,
uma preocupação crescente com a privacidade. A Lei Geral de Proteção de Dados, no
Brasil, e iniciativas legislativas semelhantes na União Europeia e em diversos países são
uma prova disso.
Esse trade-off entre privacidade e os benefícios das aplicações de ML pode ser mitigado
com uso de técnicas que permitam a construção e operação desses modelos computacionais
com garantias formais, matemáticas, de preservação da privacidade dos usuários. Essas
técnicas precisam responder adequadamente aos desafios apresentados em todas as fases
no ciclo de vida de uma aplicação de ML, desde a descoberta de dados, passando pela
fase de feature extraction, pelo treinamento e validação dos modelos, até seu efetivo uso.
Este trabalho apresenta um framework de técnicas de Processamento de Linguagem
Natural (NLP) e Aprendizado de Máquina com Preservação de Privacidade (PPML), con-
struído sobre primitivas de criptografia homomórfica e protocolos de computação segura
de múltiplas partes (MPC), que permitem o tratamento adequado dos dados, e a apli-
cação eficiente de algoritmos de ML com garantias robustas de privacidade na classificação
de textos. O trabalho traz, ainda, uma aplicação prática de classificação de texto com
privacidade na detecção de fake news.
Palavras-chave: Privacy-Preserving Machine Learning, Secure Multi-Party Computa-

tion, Text Classification
vii
Contents
1 Introduction 1
1.1 Research subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Privacy-Preserving Machine Learning 8

2.1 Secure Multi-Party Computation . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Homomorphic Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Comparing complexity and cost . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Monte Carlo methods for integration . . . . . . . . . . . . . . . . . 16
3 Text classification 18
3.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Classic NLP preprocessing techniques . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Stopword removal (SwR) . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.5 Part-of-Speech (PoS) tagging . . . . . . . . . . . . . . . . . . . . . 21
3.2.6 Bag-of-Words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.7 Term Frequency – Inverse Document Frequency (TF-IDF) . . . . . 22
3.2.8 Continuous bag-of-words (CBoW) . . . . . . . . . . . . . . . . . . . 22
3.3 The State-of-the-Art: Transformers . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Trade-offs and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
viii
4 Fake news detection 25
4.1 Detection approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Source based detection . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Fact checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.3 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . 27
4.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Clear-text setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Privacy-preserving setting . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.3 Privacy-preserving inference . . . . . . . . . . . . . . . . . . . . . . 31
5 Results & conclusion 33

5.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1 Selected datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.2 Privacy-preserving model training . . . . . . . . . . . . . . . . . . . 34
5.1.3 Privacy-preserving inference . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Bibliography 37
Appendix 44
I MPC Protocols 45
ix
List of Figures
3.1 PoS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Continuous Bag-of-Words & Skip-gram auto-encoder . . . . . . . . . . . . 23
x
List of Tables
2.1 Experiment Type I - Runtimes in Milliseconds . . . . . . . . . . . . . . . . 16
3.1 List of Experiments vs NLP techniques . . . . . . . . . . . . . . . . . . . . 23
4.1 List of NLP preprocessing experiments in clear-text . . . . . . . . . . . . . 28

4.2 Best accuracy on clear-text setting . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Neural networks training cost . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Neural networks training runtime . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Best accuracy on Privacy-Preserving setting . . . . . . . . . . . . . . . . . 32
5.1 Metrics on PPML setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xi
Acronyms
Cade Administrative Council for Economic Defense.
FHE Fully Homomorphic Encryption.
GDPR General Data Protection Regulation.
HE Homomorphic Encryption.
LGPD General Data Protection Law.
MIT Massachusetts Institute of Technology.
ML Machine Learning.
MPC Secure Multi-party Computation.
NLP Natural Language Processing.
PPGEE Graduate Program in Electrical Engineering.
PPML Privacy-preserving Machine Learning.
UC Universal Protocol Composability Framework.
xii
1 Introduction
The last thing one discovers in

composing a work is what to put first.
Pensées
Blaise Pascal
Natural Language Processing (NLP), the group of techniques used for text classifi-
cation, is one of the earliest and also one the most advanced areas of Machine Learning
(ML). Conversational agents and generational models still have people in awe. Especially
when large Language Understanding Models, such as GPT-3, make it to the headlines in
the general media, that advertise how they present a fascinating performance in a series
of tasks as diverse as producing working computer code, coherently responding e-mails or
even writing a novel [1].
In fact, aside from the astonishing headlines, ML applications have become more and
more present, and have an increasing impact on everyday life. Predictive models, built
with ML algorithms, are found in trivial tasks, from ranking and ordering algorithms
in search engines and social media, to recommendation systems in streaming platforms
and e-commerce marketplaces. There are, however, applications in areas as sensitive as
medical imaging diagnosis, detection of tax fraud and crimes against the financial system.
In most cases, we benefit from personalized services based on inference models built
with our private data. Recent disclosures of numerous cases of abuse arising from the
possession of such data, in addition to frequent security breaches that expose millions
users, have fueled a growing concern about privacy. The General Data Protection Law
in Brazil (LGPD), and similar legislative initiatives such as the General Data Protection
Regulation (GDPR) in the European Union, and the California Consumer Privacy Act of
2018 (CCPA) are evidence of this move to limit and regulate the use of private information
by large service providers [2, 3].
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the training and application of predictive models
while preserving user privacy. These techniques need to respond adequately to the chal-
1
lenges presented at all stages in the life-cycle of a typical ML application: from data
discovery, through the data wrangling stage (feature selection, feature extraction, combi-
nation, normalization and imputation in the sample space), to the training and validation
of the models, until their effective use in inference.
1.1 Research subject

The research on Privacy-Preserving Machine Learning (PPML) has several active branches
or trends. On one end of the spectrum, there are some works dealing with palliative ap-
proaches to privacy, such as obfuscation techniques or access control tools for servers and
software defined networks [4, 5]. Among the various techniques available, however, those
that offer the most robust privacy guarantees are those based on computation over en-
crypted data. This assertion stems from the fact that if there is a formal, mathematical
proof of the security of the underlying cryptographic primitives, by the Universal Com-
posability Model (UC), it is possible to build a composition model that inherits the same
confidentiality guarantees [6, 7].
Performing computations on encrypted data has theoretically been made possible more
than three decades ago [8], but has undergone slow development ever since. The com-
plexity of formal security proofs of the proposed cryptographic systems and protocols,
as well as their high computational cost, made any practical applications unfeasible for
many years and imposed large barriers to the development of this field. More recently,
the subject has received renewed interest with works demonstrating the use of Fully Ho-
momorphic Encryption (FHE) cryptographic systems and protocols [9, 10].
Another line of publications uses partially homomorphic cryptographic systems, which
allow a limited set of operations on encrypted databases [11, 12, 13, 14]. Among these
works there are solutions that attracted a lot of attention, even from the mainstream
media, as, for example, the CryptDB from the MIT [15]. In this line there is, also, the
research developed here at the PPGEE, Universidade de Brasília, which integrated an
additively homomorphic encryption system and a cryptosystem with order preservation
in an application that allows the search on an encrypted repository of Electronic Health
Records [16, 17].
Note that the use of encrypted databases presupposes the prompt availability of the
the entire original database for encryption and for the training of inference models - which
is not always the case. Seeking to overcome this limitation, there is also a line of works
focused on online, interactive and reinforcement learning techniques. These applications
require the development of Secure Multi-party Computation (MPC) and the composition
of several protocols as the basis for learning algorithms. [18].
2
The main challenge, in all these lines of research, is the distance between the logical
or theoretical design of the proposed solutions and the practical intended applications.
The most cited works in the area, in general, do not present how, or how effectively, the
proposed system or protocol deals with problems of scale, response time, usability, among
others, which drastically affect the effective support of a real decision process, in real use
cases.
Furthermore, most works in literature focus on just one phase of the Machine Learning
life-cycle: either the training-related steps or the application of the trained model in
inference (both for regression and classification models). Little or no attention is given
to the initial steps, to data fitting, feature extraction, among other activities that are
extremely relevant for the overall performance of the predictive models. Works that bring
any analysis of the statistical robustness or the quality of the models produced are also
rare.
Therefore, the selected subject for the current research is not limited the study of
PPML techniques for privacy-preserving text classification. The complete knowledge pro-
duction cycle is considered, from the basic theory to the assessment of the impact of its
applications. Attention is given to how those techniques can be used for or can interact
with the well established NLP procedures for textual data wrangling or preprocessing.
The study also looks into statistical tests to check the predictive power or quality of pro-
duced models. And practical implementation details, such as realistic computational cost
analysis, are also examined.
For the sake of experimental reproducibility, and as a way to apply and demonstrate
the produced knowledge in publicly available databases, this study also dives into the use
case of text classification for fake news detection.
1.2 Motivation
This work results from both the author’s research as a Doctorate student at the Graduate
Program in Electrical Engineering (PPGEE), at Universidade de Brasília, and his work
as a Data Scientist at the Administrative Council for Economic Defense (Cade). It also
brings some experience with homomorphic cryptography gathered during the author’s
masters research [17].
The scientific investigation proposed in this project is primarily based on the need to
complete the knowledge production cycle, as discussed above, with the effective devel-
opment, application and validation of a technology based on the state-of-the-art in the
fields of Cryptography, Machine Learning and Natural Language Processing. These are
somewhat disparate concepts: while ML and NLP are focused on extracting information,
3
Cryptography is focused on concealing it. Therefore, there is a contribution from the
theoretical point of view, considering the exposition on how to couple and harmonize
such different groups of techniques. Another contribution is the discussion on adequate
security models for the specific task of text classification. There is also a contribution
from a technical point of view, with the caring out of experiments that may be used as a
reference implementation of the proposed models. The most relevant contribution, never-
theless, is the practical application, with a relevant impact on the institutions involved. In
the present case, the specific contribution to Cade - the Brazilian competition authority.
The intelligence unit at Cade deals with an enormous quantity of data, from large pub-
lic procurement databases to other open intelligence sources, such as news articles, online
marketplaces, and enterprise web sites. This unit also performs down raids, collecting
documents from investigated companies - on paper, on computers, hard drives, executives
smartphones, etc. Some operations are carried out in cooperation with prosecutors and
police, at Federal or State levels. Some, involving multinationals, are coordinated with
other competition authorities in different territories.
All this intelligence activity, nonetheless, is bounded by data protection laws effective
in Brazil, that require formal guaranties of privacy protection [19]. While searching for
evidence of cartel or other anticompetitive conduct, Cade needs to protect the privacy
of the individuals involved - whether executives, legal representatives or other persons
somehow related to the investigated economic agents. And a considerable portion of this
data lies in textual formats. Thus, the importance of privacy in text classification.
1.3 Research objectives

The general goal of the research work described in this document is to present a framework,
a cohesive, logically and formally integrated set of techniques, combining cryptographic
and MPC protocols, as well as ML and NLP algorithms, that provide robust privacy
guarantees for real inference applications, with special attention to the text classification
class of tasks.
The proposed goal is to be attained though specific objectives:
• To identify possible limitations and flaws in PPML solutions presented in the liter-
ature;
• To propose and test improvements, new techniques and implementation details that
may correct and overcome major flaws and limitations pointed out in previous so-
lutions;
4
• Build a reference implementation of a framework in order to demonstrate the feasi-
bility and correctness of the proposed techniques;
• Propose a work process that adequately integrates said framework with the processes
in force at an organization that makes massive use of private or legally bound
confidential data.
1.4 Methodology
Experience shows, as reported by Trauth [20], that when a study in the area of technology
development seeks to understand the impact of a given solution, there is a need for the
application of qualitative methodologies.
Also, according to Deb, Dey & Balas [21], engineering research must effectively com-
bine the conceptualization of a research question with practical problems, from equip-
ment to algorithms and mathematical concepts used to solve the proposed problem. Fur-
thermore, engineering research should advance knowledge in three broad, and somewhat
overlapping, areas: observational data, that is, knowledge of the phenomena; functional
modeling the observed phenomena; and the design of processes (algorithms, procedures,
arrangements) that contribute to the desired output. This brings a strong descriptive
character to engineering research, as one must clearly communicate the preconditions
and environmental dependencies, observed data, processes, inputs and outputs of their
experimental results.
Kaplan and Duchon state that the eminently applied nature of research in the field
of information systems requires a combination of qualitative and quantitative methods
[22]. They assert that quantitative investigations, usually performed through statisti-
cal hypothesis testing, are extremely limited when the expected results or the intended
applications are highly dependent on context.
They refer to the work of Yin [23], which deals with methods for Case Study research,
to show that quantitative research must be preceded by a qualitative investigation, in
which the problem is better defined based on the observation of the context, habits and
needs of the stakeholders. This is even more relevant in exploratory research, in which
new hypotheses are raised. These initial hypotheses need contextualization, through
qualitative investigation, to create more refined models and hypotheses, which can then
be tested using quantitative methods.
Taking into account the research subject and the proposed objectives, this work is ex-
ploratory in nature. It sources from different areas of study to propose a new technological
arrangement. It is also of applied nature, as the method used for scientific investigation
5
is centered on the design of a solution for a specific problem. Therefore, it must combine
descriptive, qualitative and quantitative approaches to the study of the selected problem.
Overall, this work can be descried as the combination of the following initiatives:
1. Literature review on Machine Learning;
2. Literature review on Natural Language Processing for Text Classification;
3. Literature review on fake news detection;
4. Literature review on Privacy-Preserving Machine Learning;
5. Implementation and experimentation of Homomorphic Cryptography based proto-

cols (HE);
6. Implementation and experimentation of Secure Multi-party Computation protocols

(MPC);
7. Complexity and cost analysis and comparison between HE and MPC protocols;
8. Experimentation on fake news detection over clear text;
9. Experimentation on fake news detection over ciphered text and MPC protocols;
1.4.1 Preliminary results

Although this document focuses on text classification, a few other approaches to fake
news detection were tested during the Doctorate research. As a preliminary result, the
author, in collaboration with graduate students from Universidade de Campinas - Uni-
camp, published a solution that uses the source based detection approach, by identifying
autonomous software agents (bots) [24].
The author also reported a few experimental results on NLP preprocessing techniques
at the Seminary organized by the Digital Signal Processing Group (GPDS). These results
are detailed in the Chapter dedicated to Natural Language Processing in this work.
The experiments used to compare the computational cost and discuss on expected
execution time of HE and MPC protocols are presented in [25], and is part of the chapter
on PPML.
1.4.2 Limitations
As a result of the extensive research of different topics (ML, NLP, HE and MPC), this
document is not meant as a deep exposition of any of these areas. It provides, nonetheless,
a good set of references for those willing to further investigate relevant results in each area.
6
This work also does not set out to be a reference for complete security proofs of all
the cryptographic primitives and protocols used. There is, however, sufficient discussion
regarding composable protocols and the security of protocol compositions.
There are projects underway to apply the knowledge gathered in this study, as well as
the proposed techniques, in document classification and evidence search at Cade. There
are critical use cases, especially when it involves the cooperation and information shar-
ing with other government agencies and with competition authorities in other countries.
However, due to confidentiality requirements, it is not possible to expose results or present
reproducible experiments performed over such data.
All the experimental details exposed are limited to the fake news detection task. There
is also the choice to deal with this task as a binary output problem: documents are
classified as either fake or true. This choice results from the fact that most public fake
news datasets are annotated that way. The same techniques can be easily generalized to
deal with multi-class problems using the one-vs-all approach, that is easily carried out by
training one classifier per class.
1.5 Organization
The next chapter brings a more detailed exposition of the concept of Privacy-Preserving
Machine Learning (PPML ¸ ), with selected results from the literature. Chapter 3 presents
an extensive literature review on Natural Language Processing, from the classic prepro-
cessing techniques developed throughout the last 3 or 4 decades, to the present state-of-the
art with ‘transformers’ and other complex Natural Language Understanding models. The
fourth Chapter will discuss the concept of fake news, a short review on fake news de-
tection and introduce a few preliminary results on fake news detection with the use of
Privacy-Preserving ML algorithms.
The last chapter summarizes our results and conclusions, pointing out the best of our
knowledge in privacy-preserving text classification solutions and their application on the
specific problem of fake news detection.
7
2 Privacy-Preserving Machine
Learning
It is remarkable that a science which

began with the consideration of
games of chance should have become
the most important object of human
knowledge.
Théorie Analytique des Probabilitiés

Pierre-Simon Laplace
The research on Privacy-Preserving Machine Learning (PPML) is currently very ac-

tive. On one end of the spectrum, there are some works dealing with palliative approaches
to privacy, such as obfuscation techniques or access control tools for servers and software
defined networks [4, 5]. Among the various techniques available, however, the ones that
offer more robust privacy guarantees are those based on computation over encrypted data.
This assertion stems from the fact that if there is a formal, mathematical proof of the se-
curity of the underlying cryptographic primitives, by the Universal Composability Model
(UC), it is possible to build a composition model that inherits the same confidentiality
guarantees [6, 7].
Performing computations on encrypted data has theoretically been made possible more
than three decades ago [8], but has undergone slow development ever since. The com-
plexity of formal security proofs of the proposed cryptographic systems and protocols,
as well as their high computational cost, made any practical applications unfeasible for
many years and imposed large barriers to the development of this field. More recently,
the subject has received renewed interest with works demonstrating the use of Fully Ho-
momorphic Encryption (FHE) cryptographic systems and protocols [9, 10].
Another line of publications uses partially homomorphic cryptographic systems, which
allow a limited set of operations on encrypted databases [12, 13]. Among these works there
are solutions that attracted a lot of attention, even from the mainstream media, as, for
8
example, the CryptDB from the MIT [15]. In this line there are also many practical
solutions, including areas as sensitive as Electronic Health Records [14, 17].
Note that the use of encrypted databases usually requires the prompt availability of the
the entire database, and the computing power needed for encryption and for the training
of inference models - which is not always the case. Seeking to overcome this limitation,
there is also a line of works focused on online, distributed, interactive and reinforcement
learning techniques. These applications require the development of Secure Multi-party
Computation (MPC) and the composition of several protocols as the basis for learning
algorithms [18].
Henceforth, privacy-preserving computation grew in importance and attention, and
many Privacy-Preserving Machine Learning (PPML) and Privacy-Preserving Function
Evaluation (PPFE) frameworks have been developed. The protocols that form the basic
building blocks of these frameworks, are usually based on homomorphic cryptography
primitives or Secure Multi-Party Computation protocols. Some of the first frameworks to
appear in literature, for instance, used MPC protocols based on Secret Sharing. Among
those preceding results are FairPlayMP [26] and Sharemind [27].
Recent developments include frameworks like PySyft [28], that uses HE protocols,
and Chameleon [29], that uses MPC for linear operations and Garbled Circuits (a form
of HE) for non-linear evaluations. Other MPC frameworks in literature include CrypTen
[30], PICCO [31], TinyGarble [32], ABY3 [33] and SecureML [34].
2.1 Secure Multi-Party Computation

Introduced by Yao [35], Secure Multi-Party Computation refers to a set of protocols and
algorithms that allow a group of computing parties P to evaluate a function F (X) over
a set X = (1 x, 2 x, ..., n x) of private inputs, in a way that guarantees participants gain
knowledge only on the global function result, but not on each others inputs.
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
2. Each party Pi broadcasts zi = 142 C

Pn
3. Each party Pi locally computes JzKq = i=1 zi
Protocol 1: Secure Multi-party Addition Protocol πADD
9
Additive secret sharing is one way to implement MPC. Protocol parties have addi-
tive shares of their secret values and perform joint computations over those shares. For
example, to create n additive shares of a secret value x ∈ Zq , a participant can draw
(x1 , . . . , xn ) uniformly from {0, . . . , q − 1} such that x = ni=1 xi mod q. We denote this
P
set of shares by JxKq .

Notice that access to any proper subset of JxKq gives no information about x. A shared
secret can only be revealed after gathering all shares. Likewise, the result of a protocol
executing linear transformations over such shares can only be known if all the local results
at each computing party are combined.
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi̸=ι
Input: Shares JxKq , JyKq

Output: JzKq = JxyKq
Execution:
1. Each party Pi locally computes di ← xi − ui and ei ← yi − vi
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
4. The party Pι holding the asymmetric bit computes zι ← wι + dvι + euι + de
5. All other parties Pi̸=ι compute zi ← wi + dvi + eui
Protocol 2: Secure Multi-party Multiplication πMUL
Given two sets of shares JxKq , JyKq and a constant α, it is trivial to implement a
protocol like πADD or πMUL in order to locally compute linear functions over the sum of
their respective shares and broadcast local results to securely compute values such as
JzKq = α(JxKq ± JyKq ), JzKq = α ± (JxKq ± JyKq ), and JzKq = α1 (α2 ± (JxKq ± JyKq )). In all
those operations, JzKq is the shared secret result of the protocol. The real value of z can
only be obtained if one has knowledge of the local zi values held by all computing parties,
and performs a last step of computation to obtain z = ni=1 zi mod q.
P
With these two basic building blocks, addition and multiplication, it is possible to
compose protocols in order to perform virtually any computation. For instance, Protocol
10
πEq , uses πMUL in order to provide the Secure Distributed Equality computation. In the
Appendix I, you will find definitions for other additive secrete share MPC protocols, such
as Secure Multi-party Inner Product Protocol πIP , Secure Multi-party Bitwise OR/XOR
πOR|XOR , Secure Multi-party Argmax πargmax and Secure Multi-party Order Comparison
Protocol πDC .
Protocol: πEq
Setup: The setup procedure for πMUL
Input: JxKq and JyKq
Output: JzKq = J0Kq if x = y. JzKq ̸= J0Kq , otherwise.
Execution:
1. Locally compute JdKq = JxKq − JyKq .
2. Execute πMUL to compute JzKq = JdKq · JvKq .
3. Output JzKq
Protocol 3: Equality Protocol πEq
All of these protocols are built on the commodity-based model [36]. In this approach,
there is a costly offline phase, lead by a Trusted Initializer (TI) that pre-distributes
correlated random numbers. This role can be performed by an independent agent or
by one of the computing parties without loss of generality or security guarantees of the
online phase.
In order to learn from data, a Machine Learning algorithm will usually update a set
β of internal model parameters iterating a computation over the training set, comparing
the result of a function F (X) = y′ over the the properties of each element in the sample
with the expected inference output. This difference, y′ − y, is usually called the ‘loss’
function, and is used to update the internal parameters with an intensity defined by a
pre-defined learning rate.
The most commonly used model, for instance Linear Regression, consists in the mul-
tiplication of the model parameters β with matrix X ∈ Zq n,k representing the values of
features of all the elements in the training set. And the learning goal is to find a coefficient
vector β = (β0 β1 . . . βk ) that minimizes the mean squared error
n
1X
(βxi − yi )2 (2.1)
n i=1
11
This coefficient that minimizes (2.1) can be computed as
β = (X T X)−1 X T y (2.2)
Therefore, to perform a privacy-preserving linear regression, one needs at least a so-

lution for matrix multiplication and inversion. The matrix multiplication can be built
as a simple series of iterations of the πMUL and πADD protocols. The construction below
represents a acceleration, that significantly reduces communication costs and execution
time. A matrix pseudo-inverse can be approximated interactively, by the Newton-Raphson
method, with a series of matrix multiplications.
Protocol: πMMUL
Setup: the TI chooses uniformly random Ax , Aw ∈ Znq 1 ×n2 and Bx , Bw ∈ Znq 2 ×n3 and T ∈
Zqn1 ×n3 , and distributes the values Ax , Bx , and T to party Pi and the values Aw , Bw , and
C = (Ax Bw + Aw Bx − T ) to Pj̸=i
Input: JXKq and JW Kq
Output: JXW Kq
Execution:
2. Pj sends (X − Aw ) and (Y − Bw ) to Pi .
3. Output JzKq
Protocol 4: Matrix Multiplication πMMUL
2.2 Homomorphic Cryptography

A cryptosystem is said to be homomorphic if there is an homomorphism between the
domain (the message space M) and the image (the cipher space C) of its encryption func-
tion Enc(m) [11]. An homomorphism is a map from one algebraic structure to another,
that maintains its internal properties. So, if there is an internally well defined relation,
or function, in M, fM : M → M, there will be a corresponding function defined in C,
fC : C → C, such that:
∀m ∈ M,
fC (Enc(m)) ≡ Enc(fM (m))
12
Fully Homomorphic Encryption (FHE) refers to a class of cryptosystems for which the
homomorphism is valid for every function defined in M. That is:
∀fM : M → M,
∃fC : C → C | fC (Enc(m)) ≡ Enc(fM (m))
The most commonly used homomorphic cryptography systems, however, are only par-
tially homomorphic. There are additive homomorphic systems, multiplicative homomor-
phic systems and systems that combine a few homomorphic features. For example, Pail-
lier’s cryptosystem has additive and multiplicative homomorphic features that can be
used to delegate a limited set of computations over a dataset, without compromising its
confidentiality [37, 16].
The underlying primitive in Paillier’s system is the Decisional Composite Residuosity
Problem (DCRP). This problem deals with the intractability of deciding, given n =
pq, where p and q are two unknown large primes, and an arbitrary integer g coprime
to n (i.e. g ∈ Zn ), if g is a n-th residue modulo n2 . In other words, the problem
consists in finding y ∈ Z∗n2 such that g ≡ y n mod n2 . In his work, Paillier defines the
DCRP problem and demonstrates its equivalence (in terms of computing cost) with the
Quadratic Residue Problem, which is the foundation of well known cryptosystems, such
as Goldwasser-Micali’s [38].
Paillier’s system can be defined by the three algorithms:
Paillier.KeyGen the key generation algorithm selects two large primes p, q; computes
u
their product n = pq; uniformly draws a coprime to n, i.e. g ← − Zn ; computes
1
λ = mmc(p − 1, q − 1); and, finally computes µ = L(gλ mod n2 )
, where L(u) = u−1
n
;
The Public Key is ⟨n, g⟩, and the Private Key is ⟨µ, λ⟩;
Paillier.Enc given the public key ⟨n, g⟩ and a message m ∈ Zn , the encryption algorithm
u
consists on uniformly selecting r ←
− {1..n − 1} and computing the ciphertext c =
g m rn mod n2
Paillier.Dec the decryption algorithm, in turn, receives the private key ⟨µ, λ⟩ and a
ciphertext c and computes the corresponding message as such: m = L(cλ mod n2 ).µ
mod n.
This construction harnesses the homomorphism between the fields ZN and ZN 2 to
render the following features:
Asymmetric cryptography: it is possible to perform homomorphic computations over
the encrypted data using the public key. Knowing the results of the computation,
nevertheless, requires access to the private key;
13
Additive homomorphism: the multiplication of two ciphertexts equals the ciphertext
of the sum of their respective messages. That is:
Enc(m1 ).Enc(m2 ) mod N 2 = Enc(m1 + m2 mod N )
Multiplicative homomorphism: a ciphertext to the power of an integer equals the

ciphertext of the multiplication of the original message by that integer. That is:
Enc(m1 )m2 mod N 2 = Enc(m1 .m2 mod N )
Again, it is straight forward to see that, with the basic building blocks of addition
and multiplication, it is possible to compose arbitrarily complex protocols for privacy-
preserving computations using homomorphic encryption.
2.3 Comparing complexity and cost

In order to compare the efficiency of these privacy-preserving computation techniques,
researchers usually resort to theoretical complexity analysis, that give higher-bounding
asymptotic limits of computational cost when the input size tends to infinite [39]. These
limits, drafted from the the underlying mathematical primitives, often drive design and
implementation decisions on large and complex systems.
Ultimately, they may determine important investment in research and development
of protocols and applications. Nonetheless, theoretical limits may be very different from
the actual average computational cost and execution times observed when executing the
protocols over small, or average sized datasets.
The aforementioned PPML frameworks are proof of the research interest in such meth-
ods and the large investment put in the development of privacy-preserving computation.
It is important to note that some of the most recent and relevant results are sponsored
by major cloud service providers, especially those with specific Machine Learning related
cloud services, such as Google, IBM and Amazon [30].
Therefore, not only the capabilities of each protocol or framework are of great impor-
tance, but also their economical viability: determined mostly by their computational cost.
Expected execution time and power consumption may drive decisions that have impact
on millions of users and on relevant fields of application. In spite of the importance of
the efficiency of their solutions, authors usually only discuss very briefly the estimated
computational cost or execution times.
14
Very few works on PPML publish their results with computing times observed against
benchmark datasets/tasks (e.g. classification on the ImageNet or the Iris dataset). When
any general estimate measure is present, it is usually the complexity order O(g(n)), which
defines an asymptotically lower-bounding asymptotic complexity or cost function g(n).
The order function, g(n), represents the general behavior or shape of a class of func-
tions. So, if t(n) is the function defining the computational cost of the given algorithm
over its input size n, then stating that t(n) ∈ O(g(n)) means that there is a constant c
and a large enough value of n after which t(n) is always less than c × g(n).
t(n) ∈ O(g(n)) −→ ∃ c, k | ∀ n ≥ k, t(n) ≤ cg(n)
A protocol is usually regarded efficient if its order function is at most polynomial. Sub-
exponential or exponential orders, on the other hand, are usually deemed prohibitively
high.
For example, the authors of ABY3 [33] assert that their Linear Regression protocol is
the most efficient in literature, with cost O(B + D) per round, where B is a training batch
size and D is the feature matrix dimension [33]. That may seem like a very well behaved
linear function over n, which would lead us to conclude they devised an exceptionally
efficient protocol with order O(n).
Nevertheless, this order expression only gives a bound on the number of operations
to be performed by the algorithm. However, it does not inform an accurate estimate for
execution time. And, more importantly, the order function will only bound the actual
cost function for extremely large input sizes. Recall that t(Kn) ∈ O(n), regardless of how
arbitrarily large the constant K may be.
Thus, for small input sizes, the actual cost may be many orders of magnitude higher
than the asymptotic bound. The addition protocol in [40] is also of order O(n), but one
would never assume that a protocol with many matrix multiplications and inversions can
run as fast as the one with a few simple additions.
Probabilistic and ML models are commonly used to estimate cost, execution time and
other statistics over complex systems and algorithms, especially in control or real-time
systems engineering [41, 42].
We propose the use of Monte Carlo methods in order to estimate execution times for
privacy-preserving computations, considering different protocols and input sizes. Section
II presents a short review on privacy-preserving computation and its cost. Section III has
a brief description of the Monte Carlo methods used. Section IV discusses implementation
details and results of the various Monte Carlo experiments we performed. Finally, Section
V presents key conclusions and points out relevant questions open for further investigation.
15
2.3.1 Monte Carlo methods for integration
There is another way to estimate execution times. All examples found in literature of
privacy-preserving computation have one thing in common: their protocols depend heavily
on pseudo-randomly generated numbers, used to mask or encrypt private data.
Those numbers are assumed to be drawn independently according to a specific proba-
bility density function. That is, the algorithms use at least one random variable as input.
Although the observed execution time does not depend directly on the random inputs, it
is directly affected by their magnitude, or more specifically, their average bit-size.
Also, the size of the dataset has direct impact on execution time. If we consider the
size of the dataset as a random variable, then the interaction between the magnitude of
the random numbers and the numerical representation of the dataset are, unequivocally,
random variables. So, it is safe to assume that protocol runtimes, that are a function of
the previous two variables, are also random variables.
Table 2.1: Experiment Type I - Runtimes in Milliseconds

Dataset Protocol M θˆcli V ˆar(θˆcli ) θˆ
srv V ˆar(θˆ
srv )
1000 1943.94 4.36 13.03 0.001
πµ̂HE
Dow Jones Index 5000 1931.49 0.12 12.73 4.77e−05
(750 instances) 1000 134.35 6.66 0.25 3.15e−05
πµ̂MPC
5000 130.62 1.23 0.241 1.04e−06
1000 11630.08 165.05 24.75 0.006
πµ̂HE
Bank Marketing 5000 11514.43 24.22 24.44 5.05e−04
(4521 instances) 1000 289.51 6.96 0.89 1.63e−04
πµ̂MPC
5000 280.97 1.30 0.82 9.91e−06
Monte Carlo methods are a class of algorithms based on repeated random sampling
that render numerical approximations, or estimations, to a wide range of statistics, as well
as the associated standard error for the empirical average of any function of the parameters
of interest. We know, for example, that if X is a random variable with density f (x), then
the mathematical expectation of the random variable T = t(X) is:
Z ∞
E[t(X)] = t(x)f (x)dx. (2.3)
−∞
And, if t(X) is unknown, or the analytic solution for the integral is hard or impossible,
then we can use a Monte Carlo estimation for the expected value. It can can be obtained
with:
M
1 X
θ̂ = t(xi )f (xi ) (2.4)
M i=1
In other words, if the probability density function f (x) has support on a set X , (that
R
is, f (x) ≥ 0 ∀x ∈ X and X f (x) = 1), we can estimate the integral
16
Z
θ= t(x)f (x)dx (2.5)
X
by sampling M instances from f (x) and computing
M
1 X
θ̂ = t(xi ) (2.6)
M i=1
The intuition is that, as M grows, X = {x1 , . . . , xm } sampled from X , the support

interval of f (x), becomes closer to X itself. Therefore, the estimation θ̂ will converge to
the expected value θ. This comes from the fact that the sample mean is an unbiased
estimator for the expected value.
We know that, by the Law of Large Numbers, the sample mean θ̂ converges to E[θ̂] = θ
as M → ∞. Therefore, we are assured that for a sufficiently large M , the error ε = E[θ̂]−θ̂
becomes negligible.
The associated variance for θ̂ is V ar(θ̂) = σ 2 /M , where σ 2 is the variance of t(x).
Since we may not know the exact form of t(x), or the corresponding mean and variance,
we can use the following approximation:
M
ˆ 1 X 2
V ar(θ̂) = 2 t(xi ) − θ̂ (2.7)
M i=1
In order to improve the accuracy of our estimation, we can always increase M , the
divisor in the variance expression. That comes, however, with increased computational
cost. We explore this trade-off in our experiments by performing the simulations with
different values of M and then examining the impact of M on the observed sample variance
and on the execution time of the experiment.
17
3 Text classification
Divide each difficulty into as many

parts as is feasible and necessary to
resolve it. (...) Each problem that I
solved became a rule which served
afterwards to solve other problems.
Discourse on Method
René Descartes
Text classification refers to a broad category of techniques used to categorize, or label,

texts based on their content. Most of these techniques include statistical analysis, in gen-
eral, and inference modeling, in particular, to infer the correct label with the help of term
statistics. Since words have many different relationships (synonymity, concordance, gen-
eralization, specification, dependency etc.), the correct computation and interpretation of
term statistics demands a great deal of work, not only on data wrangling, exploration and
preparation, but also in understanding the structure and inner working of the languages
in which the texts are written.
3.1 Natural Language Processing

The development of information retrieval techniques gave rise to many domain specific
text relanguages, such as structured query language (SQL).
Many times, simpler and computationally cheaper solutions outperform deep learning
and large language models in important metrics, such as statistical significance of model
parameters [43].
18
3.2 Classic NLP preprocessing techniques
3.2.1 Tokenization
Tokenization is a preprocessing technique commonly understood as the first step of any
kind of natural language processing. It is used to identify the atomic units of text process-
ing. The text, represented as a single sequence of characters, is transformed in a collection
of tokens: words, punctuation marks, emojis, etc. Most NLP software libraries (e.g. nltk,
gensim and CoreNLP) provide multiple tokenization strategies, such as character, sub-
word, word or n-gram. The best granularity or tokenization strategy usually depends on
the application [44].
A common practice is to combine tokenization with sentence splitting. The gensim li-
brary, for instance, will perform tokenization by processing a sentence at a time. CoreNLP,
in turn, adds flags to the tokens that represent the limits of each sentence. Sentence split-
ting is also very important to other NLP methods, such as Part-Of-Speech (PoS) tagging
and Named Entity Recognition (NER).
3.2.2 Stopword removal (SwR)

Stopword removal consists on removing some words from the documents before computing
any statistics. The intuition is to reduce the noise in the dataset and the computational
cost of the following processing steps by removing the uninformative words [45, 46].
Common stopword removal algorithms include:
• Pre-compiled dictionary: manually curated stopword lists. The lists may be crafted
for specific contexts, jargons or document corpus;
• Frequency based: use frequency based rules, such as TF-High (removal of terms
with high frequency), TF-1 (removal of terms with a single occurrence), IDF-Low
(removal of terms with low inverse document frequency, i.e. terms that are present
in most documents);
• Mutual-Information: the algorithm computes the Mutual Information (MI) between

a term and a class of documents, removing the terms with low MI with the target
class for classification;
• Term Based Random Sampling (TBRS): uses the Kullback-Leibler divergence be-
tween term frequencies on the corpus with the frequency measured on randomly
sampled text chunks to identify words with low divergence and, consequently, low
information on any given text class.
19
3.2.3 Stemming
Stemming is the reduction of variant forms of a word, eliminating inflectional morphemes
such as verbal tense or plural suffixes, in order to provide a common representation, the
root or stem. The intuition is to perform a dimensionality reduction on the dataset,
removing rare morphological word variants, and reduce the risk of bias on word statistics
measured on the documents [47].
Most stemming algorithms only truncate suffixes and do not return the appropriate
term stem or even a valid word in the language of the text. There are different classes of
stemming algorithms, including:
• Dictionary based algorithms: lookup tables with terms and corresponding stems.
Usually restricted to a specific corpus, jargon or knowledge area;
• Fixed truncation methods: truncation of a fixed number of characters, or of a fixed

list of word endings;
• Rule based truncation: terms are truncated interactively, according to a set of

predefined rules. Algorithms of this class, such as the Lovins and Porters algorithms,
are the most commonly used and are implemented in all the major NLP libraries;
• Inflectional/derivational stochastic algorithms: use stochastic algorithms, such as

Hidden Markov Model and Finit State Automata, in order to compute the proba-
bility of equivalence between two terms, based on the context (surrounding words).
3.2.4 Lemmatization
Lemmatization consists on the reduction of each token to a linguistically valid root or
lemma. The goal, from the statistical perspective, is exactly the same as in stemming:
reduce variance in term frequency. It is sometimes compared to the normalization of the
word sample, and aims to provide more accurate transformations than stemming, from
the linguistic perspective [48].
The impact on predictive models, however, will depend on characteristics of the lan-
guage or the document corpus being processed. In highly inflectional languages, such as
Latin and the romance languages, lemmatization is expected to produce better results
then stemming [49].
The typical lemmatizer implementation requires the creation of a lexicon (dictionary
or wordbook) of valid words and their corresponding lemma [50]. Yet, there are different
classes of algorithms, designed to deal with distinct problems in word normalization and
different languages. Recent works in literature, for instance, use deep neural networks to
produce ‘neuro lemmatizers’ trained for specific tasks [51, 52].
20
3.2.5 Part-of-Speech (PoS) tagging
Part-of-Speech tagging is a processing technique that flags each token with a grammatical
class, taking into account the sentence or even the context of the sentence in which they are
found [53, 52]. Most implementations will return multiple tags per token, with syntactic,
lexical, phrasal and other categories.
Figure 3.1: PoS Tagging
PoS tagging helps to differentiate homonyms, words with the same spelling but differ-
ent meanings, and to capture part of the semantics relations between words. Therefore,
many works in the fake news detection literature use PoS tags to engineer new features
(e.g. "noun count", "adjective count", "mean adjectives per noun") to capture concepts
such as ‘style’ or ‘quality’ of the text and improve model accuracy [54, 55].
3.2.6 Bag-of-Words (BoW)

A Bag-of-Words is a representation of a document in which the sequence of tokens, a one
dimensional ordered vector, is replaced by a matrix where tokens are associated with a
statistic. There are several BoW algorithms that mainly differ on how they deal with
repetition and ordering of tokens [56].
BoW may also be considered a Topic Modeling technique, although most of the meth-
ods in this class of algorithms render document representations associating tokens with
their context and also differentiating homonyms. Among these methods we can mention
Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA) and La-
tent Dirichlet Allocation (LDA) [57].
The BoW algorithm used in most NLP libraries is based on the Vector Space Model
(VSM) and associates the tokens with with the corresponding term frequency: the number
21
of occurrences of that token in that document. This algorithm produces an unordered set
that does not retain any information on word order or proximity in the document [58].
In order to deal with this loss of information on word order or word-word relationship,
many techniques were proposed and tested in various NLP tasks. There are, for instance,
algorithms of Bag-of-N-grams, where the basic unit of count is not a single word by a set
of words of size n [59].
3.2.7 Term Frequency – Inverse Document Frequency (TF-IDF)

Similar to the Vector Space Model Bag-of-Words, the TF-IDF (sometimes expressed as
TF*IDF) document representation will associate each token in a document with a nor-
malized or smoothed term frequency, weighted by the inverse of the frequency at which
the term occurs in the corpus, or in the list of documents [60]. That is, fti ,dj , the number
of occurrences of token ti in document dj , is replaced by tf · idf, where:
fti ,dj
tf(ti , dj ) = 1 + log P
t∈dj ft,dj
(3.1)
|D| + 1
idf(ti , D) = 1 + log
|{d ∈ D : ti ∈ d}| + 1
And D is the collection of all documents being processed.

The TF-IDF representation tries to overcome a few limitations of the classic BoW
representation by creating a measure of relevance of each term for a specific document.
Terms that are frequent throughout the corpus will have lower TF-IDF, while terms that
are particularly common to a class of documents will have higher TF-IDF representation.
There a many results showing how TF-IDF representation improve fake news classification
models [61, 62].
3.2.8 Continuous bag-of-words (CBoW)

The expression Continuous Bag-of-Words (CBoW) is frequently used to refer to a word
embedding representation, based on a language model pre-trained on word/word co-
occurrence probabilities. The CBoW model, as originally proposed, is a neural network
whose weights represent the conditional probability of occurrence of a word, given a con-
text, that is, a window of n words that precede it and the n following words [63].
The Word2vec algorithm, proposed by Tomas Mikolov, et al. [63] at Google in 2013, is
still the most used CBoW implementation. It is trained in conjunction with the skip-gram
model, as seen in Figure 3.2, forming a kind o ‘auto-encoder’. The skip-gram model, as a
22
reverse step, gives the conditional probability for a set of preceding and following words,
given a specific term.
One of the advantages of word embeddings, such as CBoW, is the fixed length, dense
matrix representation that usually allows for more efficient computations. It also captures
some of the semantic relationships of words, based on their co-occurrence probability.
Figure 3.2: Continuous Bag-of-Words & Skip-gram auto-encoder
CBoW has also been used to achieve good results in fake news detection [64, 65] and
is possibly the most advanced preprocessing or feature engineering technique that can
be used on top of MPC protocols in order to produce a privacy-preserving fake news
classification model.
Table 3.1: List of Experiments vs NLP techniques

E01: BoW E06: TF-IDF
E02: SwR & BoW E07: SwR & TF-IDF
E03: Stemming & BoW E08: Stemming & TF-IDF
E04: Lemmatization & BoW E09: Lemmatization & TF-IDF
E05: SwR, Lemm. & BoW E10: SwR, Lemm. & TF-IDF
23
3.3 The State-of-the-Art: Transformers
3.4 Trade-offs and applications

When comparing the two approaches, the use of classic NLP preprocessing techniques, or
the use of word embeddings and large transformer models, at last two criteria must come
to mind: computational cost and effective impact on the given task performance.
Some may argue that large transformer models, such as GPT-3 are still restricted to
research and inaccessible for everyday tasks. There are, however, a few libraries that al-
ready give access to well tested transformer models, such as BERT, RoBERTa, BistilBert,
XLNet and even GPT-2, with high level APIs for both Natural Language Understanding
(such as classification) and Natural Language Generation tasks.
It is also possible to fine-tune these models, training the last layers on a specific set
of texts in order to improve classification results. There are several libraries, such as
sentence_transformers [66] from the Ubiquitous Knowledge Processing (UKP) Lab at the
Technical University of Darmstadt, that offer a high level API for fine-tuning BERT,
RoBERTa and other transformer models.
This fine-tuning comes, evidently, at a high computational cost and may introduce
more complexity to the deployment of such models in real-life applications, since the
‘home made’ model is not easily integrated in common NLP and transformer libraries as
the original models.
24
4 Fake news detection
In order to seek truth, it is necessary

once in the course of our life to
doubt, as far as possible, of all things.
René Descartes
Fake news are texts, possibly distributed with or on other media formats, that present
false, incorrect or inaccurate information and are shared over digital platforms, such as
social networks, messaging apps or news web sites [67]. The main characteristics that dif-
ferentiate fake news from concepts such as gossip, hoax and other forms of misinformation
are:
1. They are formatted and presented as legitimate news, usually as a form of ’self-
validation’, with the intent to manipulates the audience’s cognitive processes;
2. They have faster and broader propagation patterns, partly due to the context of
instant and pervasive communication of digital platforms;
3. They have greater impact on the audience’s social behavior, also partly due to the
business model of the digital platforms based on engagement or “attention reten-
tion”.
These platforms are designed to retain their audience with algorithms that will filter
and sort the content displayed to each user based on their preferences and attention
patterns. The algorithms are so effective in retaining users’ attention that a growing
number of people are now suffering with addiction to their social media [68].
With users spending an ever increasing amount of time on their favorite platforms,
social media have amassed huge databases on user profiling and segmentation and be-
came, arguably, some of the most effective mass communication tools. They monetize
their databases serving targeted advertising and charging companies that consume their
Application Programming Interfaces (APIs) to interact with their users [24].
This business model is threatened by the misuse of the platforms with the spread
of illegitimate and false content. The malicious use of digital platforms to spread fake
25
news has already been applied, for example, to manipulate opinions on extremely relevant
issues, such as presidential elections in France and the United States [69].
Identifying and clearly flagging fake news may help users to assert better judgment
on the content they consume and lessen its negative effects [70]. Research in the area,
nevertheless, faces a few particular challenges: first, the difficulty, from the technological
perspective, to delimit fake news and distinguish them from other forms of propaganda.
Also, the relevance of the topic in the political arena elevates the risk of bias and partisan
interference. For instance, a dataset with over 200 citations gives you an accuracy of
99% using the presence of a single term to label a text as true news [71]. Also, there are
very few good public datasets and fewer NLP resources (dictionaries, word embeddings
models, language models, etc) in languages other than English [55].
Another important issue in the area is how to balance the need to detect and appro-
priately handle fake news and the equally important need to guarantee end user’s privacy.
This concern with user’s privacy has lead to the development of many Privacy-Preserving
Machine Learning (PPML) techniques [29, 33, 34]. There already many classic Machine
Learning (ML) algorithms, such as Logistic Regression, Decision Tree and Support Vector
Machines implemented on top of Secure Multi-party Computation (MPC) protocols [7].
4.1 Detection approaches

4.1.1 Source based detection
One of the main trends in recent literature is the use of graph based and ML models, such
as Logistic Regression, Random Forest and Support Vector Machines, to identify probable
sources of fake news. The predictive models are trained over user profile information and
metadata for accounts that have been manually flagged as the source of a certain number
of fake news posts [72, 73, 74].
Works in this area usually build models that are specific to each platform, using
metadata relevant to that particular context. For instance, retweet counts and thread
size are relevant on Twitter, but may not have similar meaning on other platforms. In
many cases, fake news detection is associated with the problem detecting autonomous
software agents (usually referred to as "bots") [24].
This detection approach has shown many positive results. Nevertheless, the use of
personal information raises the concern with user privacy. The problem is aggravated if the
proposed solution involves the transfer of user’s personal information to some sort of third-
party or government agency responsible for the detection or repression of misinformation.
26
4.1.2 Fact checking
A few works propose solutions that are based on complex conversational models that
query the topics identified in the text against a database of checked facts [75, 76, 77].
Conversational models are Deep Neural Networks trained to react to a text, or a spe-
cific query, according to a database of pairs ⟨input, response⟩. First, the model classifies
the input with multiple labels, each indicating a topic or knowledge area. Then, the
model selects the responses that have higher probability of appropriately responding to
that query. Some models use a fixed knowledge-base, or even a fixed list of responses.
Others will search the web, performing a second round of classification to select the prob-
able correct response [78].
Research in this area has lead to the creation of curated databases of checked facts.
Some are maintained by multidisciplinary research groups. Most of these databases,
however, are created and curated by fact checking agencies and news companies [79].
Here, there is a risk that heightened partisanship might interfere with the quality of
these databases. This is especially true in a politically polarized environment, in which
agencies aligned with the different political forces are mutually accused of publishing
fake news [80]. The dataset mentioned in the introduction, for example, flags every text
published by a single news outlet as true and all the others as fake [71].
4.1.3 Natural Language Processing (NLP)

The majority of the works on fake news detection is based on the notion that it is possible
to identify language traits particular to fake news [81, 54]. There are many works in
recent literature still using classic ML and NLP pre-processing techniques [64, 62]. The
major trend, however, is using complex DL algorithms and large Language Models, such
as GPT-3 and BERT [82, 83, 84, 85, 86].
Some of the most recent works use adversarial deep reinforcement learning algorithms
to deal with ‘Neural Fake News’: fake news generated by large language models [87, 88, 89].
We could argue this line of research represents the state-of-the-art in fake news detection.
Nonetheless, complex solutions involving deep neural networks or large languages models
are out of our scope, since the computational cost and complexity of implementation of
such algorithms on top of PPML protocols are arguably prohibitive.
We already have a few classic ML algorithms, such as Support Vector Machines,
Decision Trees and Logistic Regression, running on the PPML setting with MPC protocols
[7]. Current research effort is focused on the identification of preprocessing or features
engineering methods that improve model accuracy and that can be ported to the PPML
setting.
27
4.2 Experimental results
4.2.1 Clear-text setting
In order to establish benchmark performance measures in the ‘clear-text’ setting, we
ran the pipeline detailed in [?] for model tuning, selection and testing, with different
combinations of NLP preprocessing techniques, as shown in Table 4.1.
The pipeline uses k-fold validation and random search for hyper-parameter tuning on
Naive Bayes, Decision Tree, K-Nearest Neighbors, Logistic Regression, Support Vector
Machines, Random Forest and XGBoost GBDT classifiers. For the DistilBERT and
Sentence-BERT experiments, we have used the pre-built multilingual models from [?] in
order to encode our datasets with the corresponding embeddings, before submitting them
to the pipeline.
Table 4.1: List of NLP preprocessing experiments in clear-text

E01: BoW E07: SwR + TF-IDF
E02: SwR + BoW E08: Stemming + TF-IDF
E03: Stemming + BoW E09: Lemmatization + TF-IDF
E04: Lemmatization + BoW E10: SwR + Lemm. + TF-IDF
E05: SwR + Lemm. + BoW E11: DistilBERT embeddings
E06: TF-IDF E12: Sentence-BERT embeddings
For hyper-parameter search, we decided to use ROC AUC metrics to compare and se-
lect the best models, as it gives a better information on a model performance despite class
imbalance. After model selection, we recorded ROC AUC, F1-score and accuracy metrics
on the test set for the model selected at the end of each experiment. The best combination
of preprocessing techniques and classifier algorithm, measured by the accuracy on the test
set for each dataset, is presented on Table 4.2. The runtime is in seconds.
Table 4.2: Best accuracy on clear-text setting

Language English Portuguese
Dataset liar sbnc factck.br fake.br
Preprocessing Sentence-BERT SwR, TF-IDF SwR, Lemm., BoW Lemm., BoW
Model Random Forest XGBoost Logistic Regression XGBoost
Accuracy 63.07 81.73 83.33 98.69
F1 score 60.0 78.26 77.07 98.26
Runtime 2529 31 0.11 56
We have also trained a few convolutional (CNN) and deep feed-forward (FNN) net-
works in the clear-text setting. We select best network of each architecture for the exper-
iments in the privacy-preserving setting, in order to have a benchmark for comparison,
both on accuracy and runtime. We compare training runtimes for these networks on
Table 4.3.
28
Note that the choice for these simple architectures is due to CrypTen’s limited imple-
mentation of Pytorch modules. It does not implement modules as RNN, or LSTM, that
have been proven in NLP literature to provide better results than simple convolutional
or feed-forward networks.
We also trained the the convolutional neural network from [?] as a benchmark for our
results. Our neural networks outperformed their model in accuracy, F1 score and ROC
AUC for all datasets.
The results show that in most of our experiments, word embeddings or sentence em-
beddings from large language models did not outperform traditional NLP preprocessing
techniques. Also, the classic ML models outperformed the deep learning models. Par-
ticularly, tree-based models, Random Forest and GBDT, presented the best performance
for most datasets.
Table 4.3: Neural networks training cost

Clear-text PPML
Model Dataset Embeddings
runtime runtime
DistilBERT 928 24429

liar
Sentence-BERT 895 24518
DistilBERT 147 3913
sbnc
FNN
DistilBERT 88 2599
factck.br
fake.br
liar
DistilBERT 147 3913
sbnc
CNN
DistilBERT 88 2599
factck.br
fake.br
We observed that the models trained with the short text datasets had lower perfor-
mance in all metrics, over all experiments. It indicates that the models required a larger
sample of words in order to appropriately approximate the underlying statistics repre-
sented by trained parameters. Notice, also, the high training runtime for the Random
Forest model trained over Sentence-BERT embeddings for the liar dataset. It indicates
that large language models may provide better results, but may also introduce higher
computing cost.
29
4.2.2 Privacy-preserving setting
Our experiments cover two basic scenarios of application for PPML techniques in privacy-
preserving fake news detection. The first scenario is the privacy-preserving model training.
It consists of a first party interested in training a model for fake news classification and
a second party, or even a group of parties, that can provide annotated datasets for model
training, but do not want or do not trust the first party to have full knowledge of the
dataset.
This scenario is relevant, for example, in the cases when a social media platform wants
to train a model and another party, such as a fact-checking agency, or a group of scholars
and specialists, will provide the dataset with annotated news articles. This group may
fear the social platform to have political or economical incentives to meddle with the
specialists’ classification. The privacy-preserving model training solution guarantees the
model owner has no knowledge of the provided texts or the classification labels, and thus,
can not negatively impact the quality or fairness of the dataset and, consequently, of the
trained model.
The second scenario is related to privacy-preserving inference. It addresses the cases
in which a user wants to know the inferred classification for a text, but does not want the
model owner to know the content of the text submitted for classification. This scenario
applies, for example, to the messaging app users that want to have a feedback on a message
shared on a family group, but are not comfortable to have a fake news classification agency
being able to track and record the exact content shared on that private chat. Privacy-
preserving inference allows the users to receive a probable classification for the texts they
read, without exposing their private conversation or the people interacting with them.
For privacy-preserving model training and inference we set up three virtual machines
on the cloud. We tested our code using Amazon AWS EC2 and Google Cloud Compute
Engine instances, with similar results. Table 4.4 brings a comparison of training cost of
the same neural networks in the clear-text and privacy-preserving settings.
Table 4.4: Neural networks training runtime

FNN CNN
Dataset Embeddings
Clear-text PPML Clear-text PPML
DistilBERT 28.10 1619.57 78.99 14543.13

liar
Sentence-BERT 30.64 1621.22 80.22 14577.85
DistilBERT 4.64 256.96 13.61 2276.23
sbnc
Sentence-BERT 4.73 268.62 13.59 2293.95
DistilBERT 2.91 167.48 9.04 1503.27
factck.br
Sentence-BERT 3.18 168.06 9.01 1511.90
DistilBERT 16.62 948.499 46.64 8182.08
fake.br
Sentence-BERT 16.34 903.07 45.53 8536.39
30
On the first machine, named ‘alice’, we store the trained model. The training set,
embeddings and corresponding labels, are stored on the second machine, named ‘bob’. The
third participant, ‘charlie’, holds the validation set. At the end of the MPC computation,
the accuracy score on the validation set is known to all computing parties, but only alice
has knowledge of the trained model’s weights.
The CrypTen framework extends the PyTorch library API with tensor based im-
plementations of secret sharing protocols [30]. We decided to use this framework to
facilitate our experiments, since it extends a well known library, and allows us to use
peer-reviewed neural network architectures found in literature with very few changes to
the code. CrypTen also allows for private inference with a encrypted a PyTorch model
trained on clear-text data.
Nevertheless, as stated above, we used only simple feed-forward and convolutional
neural network architectures. This choice is due to CrypTen’s limited implementation of
PyTorch modules. It does not implement modules as RNN, or LSTM, that have been
proven in NLP literature to provide better results than simpler networks architectures [?].
The results in Table 4.4 show that training times are, in average, one order of magni-
tude higher in the privacy-preserving setting. That is a reasonable cost, considering the
advantage of preserving both the privacy of participants’ input texts and of the service
provider’s trained model.
4.2.3 Privacy-preserving inference

For the privacy-preserving inference experiment, we used the same VMs mentioned above.
This time, alice has the private model already trained. bob and charlie have each half of
the test set. In this particular test scenario, we also allow bob and charlie to hold the
original labels, in order to compare them with the predicted values produced with alice’s
model and compute the accuracy, F1 and ROC-AUC scores.
The classification result (the joint computation) will be available to the three com-
puting parties, even though alice’s model and the other parties’ texts, remain private to
their respective owners. Table 5.1 shows the observed metrics for our models.
We observed that Sentence-BERT achieved better performances than DistilBERT for
most of the datasets. Moreover, the accuracy marks for the datasets in Portuguese were
higher than the ones for the English datasets, even though these BERT based models were
trained with most texts in English. In terms of runtime, in the worst case, the inference
for each text takes about half a second. In average, however, each text takes about 4
milliseconds, which is reasonable for a solution with good guarantees on privacy.
We need to point out that exact reproducibility cannot be guaranteed across PyTorch
and CrypTen non-deterministic algorithms. Also, the cuDNN library, used for CUDA
31
Table 4.5: Best accuracy on Privacy-Preserving setting
Dataset Embedding Model Accuracy F1-Score ROC-AUC Runtime
DistilBERT FNN 58.73 48.73 57.24 9.06
liar
Sentence-BERT CNN 61.54 51.71 59.99 1414.53
DistilBERT FNN 67.82 74.10 65.61 1.66
sbnc
Sentence-BERT FNN 72.52 77.11 71.44 1.60
DistilBERT FNN 78.32 87.84 50.00 1.62
factck.br
DistilBERT FNN 80.83 80.72 80.83 5.28
fake.br
convolution operations, can be a source of nondeterminism across multiple executions.

However, as we have repeated these experiments multiple times, in two different cloud
environments, we can ascertain that anyone running our notebooks should get numbers
really close to the ones presented here.
We have also created a docker image and made it publicly available on Docker Hub
(https://dockr.ly/3ED3S1D). The experiments are extensively documented on Jupyter
notebooks at a public git repository (https://bit.ly/3BwhfPn).
32
5 Results & conclusion
Neither logic without observation, nor

observation without logic, can move
one step in the formation of science.
The Organization of Thought

Alfred North Whitehead
5.1 Experiments
Our experiments cover two basic scenarios of application for PPML techniques in privacy-
preserving fake news detection. The first scenario is the privacy-preserving model training.
It consists of a first party interested in training a model for fake news classification and
a second party, or even a group of parties, that can provide annotated datasets for model
training, but do not want or do not trust the first party to have full knowledge of the
dataset.
This scenario is relevant, for example, in the cases when a social media platform wants
to train a model and another party, such as a fact-checking agency, or a group of scholars
or specialists, will provide the dataset with annotated news articles. This group may
fear the social platform to have political or economical incentives to meddle with the
specialists’ classification. The privacy-preserving model training solution guarantees the
model owner has no knowledge of the provided texts or the classification labels, and thus,
can not negatively impact the quality or fairness of the dataset and, consequently, of the
trained model.
The second scenario is related to privacy-preserving inference. It addresses the cases
in which a user wants to know the inferred classification for a text, but does not want the
model owner to know the content of the text submitted for classification. This scenario
applies, for example, to the messaging app users that want to have a feedback on a message
shared on a family group, but are not comfortable to have a fake news classification agency
being able to track and record the exact content shared on that private chat. Privacy-
33
preserving inference allows the users to receive a probable classification for the texts they
read, without exposing their private conversation or the people interacting with them.
5.1.1 Selected datasets

We selected 2 datasets in English and 2 in Portuguese. Each pair has a dataset with
full-length news articles and a dataset comprised of short statements. The purpose of
experimenting with different languages and text sizes was to observe how these variables
may impact preprocessing and training cost, and, ultimately, model performance.
The selected datasets are:
• Liar Dataset (liar): curated by the UC Santa Barbara NLP Group, contains 12791
claims by North-American politicians and celebrities, classified as ‘true’, ‘mostly-
true’, ‘half-true’, ‘barely-true’, ‘false’ and ‘pants-on-fire’ [90];
• Source Based Fake News Classification (sbnc): 2020 full-length news manu-
ally labeled as Real or Fake [91];
• FactCk.br: 1313 claims by Brazilian politicians, manually annotated by fact check-

ing agencies1 as ‘true’, ‘false’, ‘imprecise’ and ‘others’. This dataset is imbalanced,
with 22% of the texts labeled as true [76].
• Fake.br: 7200 full-length news articles, with text and metadata, manually labeled
as real or fake news [55];
5.1.2 Privacy-preserving model training

We ran the experiments listed in Table 5.1 on the privacy-preserving setting, using the
CrypTen framework [30]. For privacy-preserving model training and inference we set up
three virtual machines on the cloud. We tested our code using Amazon AWS EC2 and
Google Cloud Compute Engine instances, with similar results.
On the first machine, ‘alice’, we store the trained model. The training set, embeddings
and corresponding labels, are stored on the second machine, named ‘bob’. The third
participant, ‘charlie’, holds the validation set. At the end of the MPC computation, the
accuracy score on the validation set is known to all computing parties, but only alice has
knowledge of the trained model’s weights.
The results in Table 4.3 show that training times are, in average, 27 times longer
in the privacy-preserving setting. That is a reasonable cost, considering the advantage
1
https://piaui.folha.uol.com.br/lupa, https://www.aosfatos.org and
https://apublica.org
34
of preserving both the privacy of participants input texts and of the service provider’s
trained model. The accuracy was measured over the validation set.
5.1.3 Privacy-preserving inference

For the privacy-preserving inference experiment, we used the same VMs mentioned above.
This time, alice has the private model already trained. Bob and charlie have each half
of the test set. In this particular test scenario, we also allow bob and charlie to hold the
original labels, in order to compare them with the predicted values produced with alice’s
model and compute the accuracy, F1 and ROC-AUC scores.
The classification result (the joint computation) will be available to the three com-
puting parties, even though alice’s model and the other parties’ texts, remain private to
their respective owners. Table 5.1 shows the observed metrics for our models.
Table 5.1: Metrics on PPML setting

Model Dataset Embedding Accuracy F1-Score ROC-AUC Runtime
DistilBERT 56.31 0.11 0.51 3505
liar
Sentence-BERT 55.76 0.0 0.5 3531
CNN
DistilBERT 63.61 76.25 0.54 563.828
sbnc
Sentence-BERT 60.39 75.30 0.5 556.046
DistilBERT 78.32 87.84 0.5 364.381
factck.br
Sentence-BERT 78.32 87.84 0.5 364.243
FNN
DistilBERT 67.11 68.60 0.67 1974.83
fake.br
Sentence-BERT 50.34 66.60 0.50 1977.41
We observed that Sentence-BERT achieved better performances than DistilBERT for

most of the datasets. Moreover, the accuracy marks for the Portuguese datasets were
higher than the English ones, even though these BERT based models were trained with
most texts in English. In terms of runtime. In average, each inference query take 3
seconds. Which is reasonable for a solution with guarantees on privacy.
We need to point out that exact reproducibility cannot be guaranteed across PyTorch
and CrypTen non-deterministic algorithms. Also, the cuDNN library, used by CUDA
convolution operations, can be a source of nondeterminism across multiple executions.
However, as we have repeated these experiments multiple times, in two different cloud
environments, we can ascertain that anyone running our notebooks should get numbers
really close to the ones presented here.
We have also created a docker image and made it publicly available on Docker Hub
(https://dockr.ly/3ED3S1D). The experiments are extensively documented on Jupyter
notebooks at a public git repository (https://bit.ly/3BwhfPn).
35
5.2 Conclusion
We have presented relevant fake news detection approaches and pointed out a few ad-
vantages of NLP applications in a privacy-preserving oriented solution. We have also
discussed the use of different NLP techniques in text classification, and how large lan-
guage models can be used as a preprocessing step to generate embeddings that convey
semantic information from the encoded text. Then, we showed how those embeddings are
used for training and querying fake news detection inference models.
Our experiments also demonstrate how a neural network can be trained to detect fake
news using Secure Multi-party Computation protocols and how those MPC protocols
allow users to perform news classification in a privacy-preserving way.
The relevant finding is that the performance of the privacy-preserving fake news clas-
sification model, measured both in terms of runtime, accuracy and other classification
metrics, is very close to that of a model trained and queried in the clear-text setting. In-
dicating that introducing the use of MPC protocols does not reduce the predictive power
or usability of fake news detection models.
5.2.1 Future work

We have also found that, for the datasets at hand, using large multi-lingual language
models did not significantly improve the NLP pipeline, when compared to well established
techniques such as Lemmatization, Stop-Word Removal and TF-IDF. In possible future
investigations, we may test the effect of feature extraction and engineering techniques,
such as word count by part-of-speech, latent topic analyses and other methods commonly
found in NLP literature.
Moreover, in the experiments reported above, we have only looked at computational
cost, in terms of runtime, of privacy-preserving training and inference. We have yet to
study the differences in performance between computationally heavier and lighter prepro-
cessing techniques. Considering that in a privacy-preserving setting, the preprocessing
phase must be performed on the user’s device, it is interesting to look for techniques with
low computational cost and acceptable performance.
Another important problem to be studied is the use privacy-preserving techniques not
only during training or inference, but in the faze of text preprocessing as well.
Acknowledgments
This work has been funded in part by the Graduate Deanship of Universidade de Brasília,
under the “EDITAL DPG Nº 0004/2021” grants program.
36
Bibliography
[1] Dale, Robert: GPT-3: What’s it good for? Natural Language Engineering,
27(1):113–118, 2021. 1
[2] BRASIL: Lei nº 13.709, de 14 de agosto de 2018., 2018. http://http://www.
planalto.gov.br/ccivil_03/_Ato2015-2018/2018/Lei/L13709.htm. 1
[3] European Commission: Regulation EU n. 2016/679., 2016. https://ec.europa.eu/
info/law/law-topic/data-protection_en. 1
[4] Al-Rubaie, Mohammad and J. Morris Chang: Privacy-Preserving Machine Learning:
Threats and Solutions. IEEE Security Privacy, 17(2):49–58, 2019. 2, 8
[5] Graepel, T., K Lauter, and M Naehrig: ML Confidential: Machine Learning on
Encrypted Data. Cryptology ePrint Archive, Report 2012/323, 2012. https://
eprint.iacr.org/2012/323. 2, 8
[6] Canetti, R.: Universally Composable Security: A New Paradigm for Cryptographic
Protocols. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer
Science, FOCS ’01, pages 136–, Washington, DC, USA, 2001. IEEE Computer Soci-
ety, ISBN 0-7695-1390-5. http://dl.acm.org/citation.cfm?id=874063.875553.
2, 8
[7] De Cock, Martine, Rafael Dowsley, Caleb Horst, Raj Katti, Anderson Nascimento,
Wing Sea Poon, and Stacey Truex: Efficient and Private Scoring of Decision Trees,
Support Vector Machines and Logistic Regression Models based on Pre-Computation.
IEEE Transactions on Dependable and Secure Computing, PP(99), 2017. 2, 8, 26,
27
[8] Rivest, R., L. Adleman, and M. Dertouzos: On data banks and privacy homo-
morphisms. Foundations of Secure Computation, pages 169–177, 1978. 2, 8
[9] Gentry, C.: A fully homomorphic encryption scheme. PhD thesis, Stanford Univer-
sity, 2009. crypto.stanford.edu/craig. 2, 8
[10] Lopez-Alt, A., E. Tromer, and V. Vaikuntanathan: On-the-Fly Multiparty Computa-
tion on the Cloud via Multikey Fully Homomorphic Encryption. Cryptology ePrint
Archive, Report 2013/094, 2013. 2, 8
[11] Souza, Stefano M P C and Ricardo S Puttini: Client-side encryption for privacy-
sensitive applications on the cloud. Procedia Computer Science, 97:126–130, 2016.
2, 12
37
[12] Damgård, Ivan and Mats Jurik: A Generalisation, a Simplification and Some Ap-
plications of Paillier’s Probabilistic Public-Key System. In Proceedings of the 4th
International Workshop on Practice and Theory in Public Key Cryptography: Public
Key Cryptography, PKC ’01, pages 119–136, London, UK, UK, 2001. Springer-Verlag,
ISBN 3-540-41658-7. http://dl.acm.org/citation.cfm?id=648118.746742. 2, 8
[13] Nikolaenko, V., U. Weinsberg, S Ioannidis, D. Joyeand, M ans Boneh, and N. Taft:
Privacy-Preserving Ridge Regression on Hundreds of Millions of Records. In 2013
IEEE Symposium on Security and Privacy. IEEE, 2013. 2, 8
[14] Bos, J. W., K. Lauter, and M. Naehrig: Private Predictive Analysis on Encrypted
Medical Data. Cryptology ePrint Archive, Report 2014/336, 2014. 2, 9
[15] Popa, R. A., C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan: CryptDB: Pro-

tecting Confidentiality with Encrypted Query Processing. In Proceedings of the 23rd
ACM Symposium on Operating Systems Principles, SOSP ’11, pages 85–100, New
York, NY, USA, 2011. ACM, ISBN 978-1-4503-0977-6. 2, 9
[16] Souza, Stefano M P C, RF Gonçalves, E Leonova, RS Puttini, and Anderson CA

Nascimento: Privacy-ensuring electronic health records in the cloud. Concurrency
and Computation: Practice and Experience, 29(11):e4045, 2017. 2, 13
[17] Souza, Stefano M. P. C.: Safe-Record: segurança e privacidade para registros eletrô-
nicos em saúde na nuvem. Master’s thesis, PPGEE/FT - Universidade de Brasília,
2016. 2, 3, 9
[18] Sakuma, Jun, Shigenobu Kobayashi, and Rebecca N. Wright: Privacy-preserving

Reinforcement Learning. In Proceedings of the 25th International Conference on
Machine Learning, ICML ’08, pages 864–871, New York, 2008. ACM. 2, 9
[19] Souza, Stefano M P C: Possíveis impactos da LGPD na atividade de inteligência do

Cade. Escola Nacional de Administração Pública (Enap), 2020. 4
[20] Trauth, E. M.: Achieving the Research Goal with Qualitative Methods: Lessons Lear-
ned along the Way. In Proceedings of the IFIP TC8 WG 8.2 International Conference
on Information Systems and Qualitative Research, page 225–245, GBR, 1997. Chap-
man & Hall, Ltd., ISBN 0412823608. 5
[21] Deb, Dipankar, Rajeeb Dey, and Valentina E. Balas: [Intelligent Systems Refe-
rence Library - Vol. 153] Engineering Research Methodology: A Practical Insight
for Researchers, volume 10.1007/978-981-13-2947-0, chapter 1, pages 1–7. Sprin-
ger, 2019, ISBN 978-981-13-2946-3,978-981-13-2947-0. http://gen.lib.rus.ec/
scimag/index.php?s=10.1007/978-981-13-2947-0. 5
[22] Kaplan, Bonnie and Dennis Duchon: Combining Qualitative and Quantitative
Methods in Information Systems Research: A Case Study. MIS Q., 12(4):571–586,
December 1988, ISSN 0276-7783. http://dx.doi.org/10.2307/249133. 5
[23] Yin, R. K.: Case Study Research: Design and Methods. SAGE, Beverly Hills, 1984.
5
38
[24] Souza, S. M. P. C., T. B. Rezende, J. Nascimento, L. G. Chaves, D. H. P. Soto,
and S. Salavati: Tuning machine learning models to detect bots on Twitter. In 2020
Workshop on Communication Networks and Power Systems (WCNPS), pages 1–6,
2020. 6, 25, 26
[25] Souza, Stefano M. P. C. and Daniel G. Silva: Monte Carlo execution time estimation
for Privacy-preserving Distributed Function Evaluation protocols, 2021. 6
[26] Ben-David, Assaf, Noam Nisan, and Benny Pinkas: FairplayMP: a system for secure
multi-party computation. In Ning, Peng, Paul F. Syverson, and Somesh Jha (editors):
Proceedings of the 2008 ACM Conference on Computer and Communications Secu-
rity, CCS 2008, Alexandria, Virginia, USA, October 27-31, 2008, pages 257–266.
ACM, 2008. 9
[27] Bogdanov, Dan, Sven Laur, and Jan Willemson: Sharemind: A Framework for Fast
Privacy-Preserving Computations. In Proc. of the 13th European Symposium on
Research in Computer Security, pages 192–206, 2008. 9
[28] Ryffel, Theo, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel
Rueckert, and Jonathan Passerat-Palmbach: A generic framework for privacy pre-
serving deep learning. CoRR, abs/1811.04017, 2018. 9
[29] Sadegh Riazi, M., C. Weinert, O. Tkachenko, E. M. Songhori, T. Schneider, and
F. Koushanfar: Chameleon: A Hybrid Secure Computation Framework for Machine
Learning Applications. ArXiv e-prints, 2018. 9, 26
[30] Knott, B., S. Venkataraman, A.Y. Hannun, S. Sengupta, M. Ibrahim, and L.J.P.
van der Maaten: CrypTen: Secure Multi-Party Computation Meets Machine Lear-
ning. In Proceedings of the NeurIPS Workshop on Privacy-Preserving Machine Le-
arning, 2020. 9, 14, 31, 34
[31] Zhang, Yihua, Aaron Steele, and Marina Blanton: PICCO: A General-purpose
Compiler for Private Distributed Computation. In Proceedings of the 2013
ACM SIGSAC Conference on Computer Communications Security. ACM, 2013,
ISBN 978-1-4503-2477-9. 9
[32] Songhori, E. M., S. U. Hussain, A. Sadeghi, T. Schneider, and F. Koushanfar: Tiny-
Garble: Highly Compressed and Scalable Sequential Garbled Circuits. In 2015 IEEE
Symposium on Security and Privacy, pages 411–428, May 2015. 9
[33] Demmler, Daniel, Thomas Schneider, and Michael Zohner: ABY - A Framework
for Efficient Mixed-Protocol Secure Two-Party Computation. In 22nd Network and
Distributed System Security Symposium, 2015. 9, 15, 26
[34] Mohassel, P. and Y. Zhang: SecureML: A System for Scalable Privacy-Preserving
Machine Learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages
19–38, May 2017. 9, 26
[35] Yao, Andrew C.: Protocols for Secure Computations. In Proceedings of the 23rd
Annual Symposium on Foundations of Computer Science, SFCS ’82. IEEE Computer
Society, 1982. 9
39
[36] Beaver, Donald: One-time tables for two-party computation. In Computing and Com-
binatorics, pages 361–370. Springer, 1998. 11
[37] Paillier, Pascal: Public-key cryptosystems based on composite degree residuosity clas-
ses. In IN ADVANCES IN CRYPTOLOGY — EUROCRYPT 1999, pages 223–238.
Springer-Verlag, 1999. 13
[38] Goldwasser, Shafi and Silvio Micali: Probabilistic encryption. Journal of Computer
and System Sciences, 28(2):270–299, 1984, ISSN 0022-0000. 13
[39] Naor, Moni and Kobbi Nissim: Communication Complexity and Secure Function Eva-
luation. Electronic Colloquium on Computational Complexity (ECCC), 8, 2001. 14
[40] Agarwal, Anisha, Rafael Dowsley, Nicholas D. McKinney, Dongrui Wu, Chin Teng
Lin, Martine De Cock, and Anderson Nascimento: Privacy-Preserving Linear Regres-
sion for Brain-Computer Interface Applications. In Proc. of 2018 IEEE International
Conference on Big Data, 2018. 15
[41] Silva, D. G, M. Jino, and B. de Abreu: A Simple Approach for Estimation of Execution
Effort of Functional Test Cases. In IEEE Sixth International Conference on Software
Testing, Verification and Validation. IEEE Computer Society, Apr 2009. 15
[42] Iqbal, N., M. A. Siddique, and J. Henkel: DAGS: Distribution agnostic sequential
Monte Carlo scheme for task execution time estimation. In 2010 Design, Automation
Test in Europe Conference Exhibition (DATE 2010), pages 1645–1648, 2010. 15
[43] Cunha, Washington, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine Re-
sende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Martins,
Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves:
On the cost-effectiveness of neural and non-neural approaches and representations
for text classification: A comprehensive comparative study. Information Processing
& Management, 58(3):102481, 2021, ISSN 0306-4573. 18
[44] Habert, Benoit, Gilles Adda, Martine Adda-Decker, P Boula de Marëuil, Serge Fer-
rari, Olivier Ferret, Gabriel Illouz, and Patrick Paroubek: Towards tokenization eva-
luation. In Proceedings of LREC, volume 98, pages 427–431, 1998. 19
[45] Kaur, Jashanjot and P Kaur Buttar: A systematic review on stopword removal algo-
rithms. Int. J. Futur. Revolut. Comput. Sci. Commun. Eng, 4(4), 2018. 19
[46] Gerlach, Martin, Hanyu Shi, and Luís A Nunes Amaral: A universal information
theoretic approach to the identification of stopwords. Nature Machine Intelligence,
1(12):606–612, 2019. 19
[47] Singh, Jasmeet and Vishal Gupta: Text stemming: Approaches, applications, and
challenges. ACM Computing Surveys (CSUR), 49(3):1–46, 2016. 20
[48] Dereza, Oksana: Lemmatization for Ancient Languages: Rules or Neural Networks?
In Conference on Artificial Intelligence and Natural Language, pages 35–47. Springer,
2018. 20
40
[49] Jongejan, Bart and Hercules Dalianis: Automatic training of lemmatization rules that
handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL and the 4th International Joint
Conference on Natural Language Processing of the AFNLP, pages 145–153, 2009. 20
[50] Plisson, Joël, Nada Lavrac, Dunja Mladenic, et al.: A rule based approach to word
lemmatization. In Proceedings of IS, volume 3, pages 83–86, 2004. 20
[51] Malaviya, Chaitanya, Shijie Wu, and Ryan Cotterell: A simple joint model for im-
proved contextual neural lemmatization. arXiv preprint arXiv:1904.02306, 2019. 20
[52] Kondratyuk, Daniel, Tomáš Gavenčiak, Milan Straka, and Jan Hajič: LemmaTag:
Jointly tagging and lemmatizing for morphologically-rich languages with BRNNs. ar-
Xiv preprint arXiv:1808.03703, 2018. 20, 21
[53] Schmid, Helmut and Florian Laws: Estimation of conditional probabilities with de-
cision trees and an application to fine-grained POS tagging. In Proceedings of the
22nd International Conference on Computational Linguistics (Coling 2008), pages
777–784, 2008. 21
[54] Potthast, Martin, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno
Stein: A Stylometric Inquiry into Hyperpartisan and Fake News. In Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 231–240, Melbourne, Australia, July 2018. Association for
Computational Linguistics. https://www.aclweb.org/anthology/P18-1022. 21,
27
[55] Monteiro, Rafael A., Roney L. S. Santos, Thiago A. S. Pardo, Tiago A. de Almeida,
Evandro E. S. Ruiz, and Oto A. Vale: Contributions to the Study of Fake News in
Portuguese: New Corpus and Automatic Detection Results. In Computational Proces-
sing of the Portuguese Language, pages 324–334. Springer International Publishing,
2018, ISBN 978-3-319-99722-3. 21, 26, 34
[56] Davis, R. and C. Proctor: Fake News, Real Consequences: Recruiting Neural
Networks for the Fight Against Fake News. Technical report, Stanford University,
2017. 21
[57] Barde, B. V. and A. M. Bainwad: An overview of topic modeling methods and to-
ols. In 2017 International Conference on Intelligent Computing and Control Systems
(ICICCS), pages 745–750, 2017. 21
[58] El-Din, Doaa Mohey: Enhancement bag-of-words model for solving the challenges
of sentiment analysis. International Journal of Advanced Computer Science and
Applications, 7(1), 2016. 22
[59] Li, Bofang, Zhe Zhao, Tao Liu, Puwei Wang, and Xiaoyong Du: Weighted neural bag-
of-n-grams model: New baselines for text classification. In Proceedings of COLING
2016, the 26th International Conference on Computational Linguistics: Technical
Papers, pages 1591–1600, 2016. 22
41
[60] Yun-tao, Zhang, Gong Ling, and Wang Yong-cheng: An improved TF-IDF approach
for text classification. Journal of Zhejiang University-Science A, 6(1):49–55, 2005. 22
[61] Ahmed, Hadeer, Issa Traore, and Sherif Saad: Detection of online fake news using
n-gram analysis and machine learning techniques. In International conference on
intelligent, secure, and dependable systems in distributed and cloud environments,
pages 127–138. Springer, 2017. 22
[62] Dyson, Lauren and Alden Golab: Fake News Detection Exploring the Application of
NLP Methods to Machine Identification of Misleading News Sources. CAPP 30255
Adv. Mach. Learn. Public Policy, 2017. 22, 27
[63] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean: Efficient Estimation of
Word Representations in Vector Space, 2013. 22
[64] Yang, Kai Chou, Timothy Niven, and Hung Yu Kao: Fake News Detection as Natural
Language Inference. arXiv preprint arXiv:1907.07347, 2019. 23, 27
[65] Hosseinimotlagh, Seyedmehdi and Evangelos E Papalexakis: Unsupervised content-

based identification of fake news articles with tensor decomposition ensembles. In
Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web
(MIS2), 2018. 23
[66] Reimers, Nils and Iryna Gurevych: Sentence-BERT: Sentence Embeddings using Sia-
mese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing. Association for Computational Linguistics, Novem-
ber 2019. https://arxiv.org/abs/1908.10084. 24
[67] Gelfert, Axel: Fake News: A Definition. Informal Logic, 37(0):83–117, 2017. 25
[68] D’Arienzo, Maria Chiara, Valentina Boursier, and Mark D. Griffiths: Addiction to
Social Media and Attachment Styles: A Systematic Literature Review. International
Journal of Mental Health and Addiction, 17:1094 – 1118, 2019. 25
[69] Ferrara, Emilio: Disinformation and Social Bot Operations in the Run Up to the 2017
French Presidential Election. First Monday, 22, June 2017. 26
[70] Lee, Sangwon and Michael Xenos: Social distraction? Social media use and political
knowledge in two U.S. Presidential elections. Computers in Human Behavior, 90:18
– 25, 2019, ISSN 0747-5632. 26
[71] Nascimento, Josué: Only one word 99.2%, Aug 2020. https://www.kaggle.com/
josutk/only-one-word-99-2. 26, 27
[72] Gangireddy, Siva Charan Reddy, Deepak P, Cheng Long, and Tanmoy Chakraborty:
Unsupervised Fake News Detection: A Graph-Based Approach. In Proceedings of the
31st ACM Conference on Hypertext and Social Media, HT ’20, page 75–83, New
York, NY, USA, 2020. Association for Computing Machinery, ISBN 9781450370981.
https://doi.org/10.1145/3372923.3404783. 26
42
[73] Shu, Kai, Xinyi Zhou, Suhang Wang, Reza Zafarani, and Huan Liu: The Role of
User Profiles for Fake News Detection. In ASONAM ’19: International Conference
on Advances in Social Networks Analysis and Mining, page 436–439, New York, NY,
USA, 2019. Association for Computing Machinery, ISBN 9781450368681. 26
[74] Pinnaparaju, Nikhil, Vijaysaradhi Indurthi, and Vasudeva Varma: Identifying Fake
News Spreaders in Social Media. In CLEF, 2020. 26
[75] Nadeem, Moin, Wei Fang, Brian Xu, Mitra Mohtarami, and James Glass: FAKTA:
An Automatic End-to-End Fact Checking System, 2019. 27
[76] Moreno, Jo
btxfnamespacelong ao and Graça Bressan: FACTCK.BR: A New Dataset to Study
Fake News. In Proceedings of the 25th Brazillian Symposium on Multimedia and
the Web, WebMedia ’19, page 525–527, New York, NY, USA, 2019. Associa-
tion for Computing Machinery, ISBN 9781450367639. https://doi.org/10.1145/
3323503.3361698. 27, 34
[77] Gupta, Ankur, Yash Varun, Prarthana Das, Nithya Muttineni, Parth Srivastava,
Hamim Zafar, Tanmoy Chakraborty, and Swaprava Nath: TruthBot: An Automated
Conversational Tool for Intent Learning, Curated Information Presenting, and Fake
News Alerting. CoRR, abs/2102.00509, 2021. https://arxiv.org/abs/2102.00509.
27
[78] Lee, Sungjin: Nudging Neural Conversational Model with Domain Knowledge. CoRR,
abs/1811.06630, 2018. http://arxiv.org/abs/1811.06630. 27
[79] Graves, Lucas: Anatomy of a Fact Check: Objective Practice and the Contested Epis-
temology of Fact Checking. Communication, Culture and Critique, 10(3):518–537,
October 2017, ISSN 1753-9129. https://doi.org/10.1111/cccr.12163. 27
[80] Marietta, Morgan, David C Barker, and Todd Bowser: Fact-checking polarized poli-
tics: Does the fact-check industry provide consistent guidance on disputed realities?
In The Forum, volume 13, pages 577–596. De Gruyter, 2015. 27
[81] Horne, Benjamin D. and Sibel Adali: This Just In: Fake News Packs a Lot in Title,
Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real
News. CoRR, abs/1703.09398, 2017. http://arxiv.org/abs/1703.09398. 27
[82] Young, T., D. Hazarika, S. Poria, and E. Cambria: Recent Trends in Deep Learning
Based Natural Language Processing [Review Article]. IEEE Computational Intelli-
gence Magazine, 13(3):55–75, 2018. 27
[83] Devlin, Jacob, Ming Wei Chang, Kenton Lee, and Kristina Toutanova: Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018. 27
[84] Baruah, Arup, K Das, F Barbhuiya, and Kuntal Dey: Automatic Detection of Fake
News Spreaders Using BERT. In CLEF, 2020. 27
43
[85] Zhang, T., D. Wang, H. Chen, Z. Zeng, W. Guo, C. Miao, and L. Cui: BDANN:
BERT-Based Domain Adaptation Neural Network for Multi-Modal Fake News De-
tection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages
1–8, 2020. 27
[86] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amo-
dei: Language Models are Few-Shot Learners. In Larochelle, H., M. Ranzato, R. Had-
sell, M. F. Balcan, and H. Lin (editors): Advances in Neural Information Processing
Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. 27
[87] Tan, Reuben, Bryan A. Plummer, and Kate Saenko: Detecting Cross-Modal Incon-
sistency to Defend Against Neural Fake News, 2020. 27
[88] Mosallanezhad, Ahmadreza, Kai Shu, and Huan Liu: Topic-Preserving Synthetic
News Generation: An Adversarial Deep Reinforcement Learning Approach, 2020.
27
[89] Zellers, Rowan, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Fran-
ziska Roesner, and Yejin Choi: Defending Against Neural Fake News. In Wallach,
H., H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (editors):
Advances in Neural Information Processing Systems, volume 32, pages 9054–9065.
Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper/2019/
file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf. 27
[90] Wang, William Yang: "liar, liar pants on fire": A new benchmark dataset for fake
news detection. arXiv preprint arXiv:1705.00648, 2017. 34
[91] Bhatia, Ruchi: Source based Fake News Classification, Aug 2020. https://www.
kaggle.com/ruchi798/source-based-news-classification. 34
44
I MPC Protocols
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
2. Each party Pi broadcasts zi
Protocol 1: Secure Distributed Addition Protocol πADD
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi̸=ι
Input: Shares JxKq , JyKq

Output: JzKq = JxyKq
Execution:
1. Each party Pi locally computes di ← xi − ui and ei ← yi − vi
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
4. The party Pι holding the asymmetric bit computes zι ← wι + dvι + euι + de
5. All other parties Pi̸=ι compute zi ← wi + dvi + eui
Protocol 2: Secure Distributed Multiplication πMUL
45
Protocol: πIP
Input: J⃗xKq , J⃗y Kq , and l (length of ⃗x and ⃗y )
Output: JzKq = J⃗x · ⃗y Kq
Execution:
1. Run l parallel instances of πMUL in order to compute J⃗zk Kq = J⃗xk Kq · J⃗yk Kq for k ∈
{1, . . . , l}
2. Each party Pi locally computes zi = lk=1 ⃗zk,i

P
3. Output JzKq
Protocol 3: Secure Distributed Inner Product Protocol πIP
Protocol: πOIS
Setup: Let l be the bitlength of the inputs to be shared and n the dimension of the input
vector. The trusted initializer pre-distributes all the correlated randomness necessary for the
execution of πMUL over Z2l
Input: Alice inputs the vector ⃗x = (x1 , . . . , xn ), and Bob has input k, the index of the desired
output value
Output: xk
Execution:
1. Define yk = 1, and yj = 0 for j ∈ {1, . . . , n}, j ̸= k
2. For j ∈ {1, . . . , n} and i ∈ {1, . . . , l}, let xj,i denote the i-th bit of xj
3. Define Jyj K2 as the pair of shares (0, yj ) and Jxj,i K2 as (xj,i , 0).
4. Compute in parallel Jzi K2 ← nj=1 Jyj K2 · Jxj,i K2 for i = 1, . . . , l

P
5. Output Jzi K2 for i = {1, . . . , l}
Protocol 4: Oblivious Input Selection Protocol πOIS
46
Protocol: πEq
Input: JxKq and JyKq
Output: J0Kq if x = y. Any non-zero number otherwise.
Execution:
2. Execute πMUL to compute JzKq = JrKq · JvKq .
3. Output JzKq
Protocol 5: Equality Protocol πEq
Protocol: F2toq
Input: JxK2
Output: JxKq
Execution:
1. Alice’s ⃗xa and Bob’s ⃗xb
2. Alice and Bob perform a secure bitwise xor using πOR|XOR with Alice’s inputs being
(⃗xa , 0) and Bob’s input being (0, ⃗xb ) and modulus q > 2.
3. Output the result of XOR.
Protocol 6: Z2 to Zq Conversion Function F2toq
Protocol: πOR|XOR
Setup: The setup procedure for πMUL over Z2
Input: JxK2 , JyK2 and k, where k = 1 to compute OR and k = 2 to compute XOR between
the numbers.
Output: Jx ∨ yK2 if k = 1, Jx ⊻ yK2 if k = 2.
Execution:
1. Execute πMUL to compute JvK2 = JxK2 JyK2
2. Locally compute JzK2 = JxK2 + JyK2 − kJvK2 .
3. Output JzK2 .
Protocol 7: Bitwise OR/XOR Protocol πOR|XOR
47
Protocol: πTrunc
Setup: Let λ be a statistical security parameter. The Protocol is parametrized by the size
q > 2k+f +λ+1 of the field and the dimensions ℓ1 , ℓ2 of the input matrix. The trusted initializer
picks a matrix R′ ∈ Fℓq1 ×ℓ2 with elements uniformly drawn from {0, . . . , 2f − 1} and a matrix
R′′ ∈ Fℓq1 ×ℓ2 with elements uniformly drawn from {0, . . . , 2k+λ − 1}. Then, the TI computes
R = R′′ 2f + R′ and creates secret shares JRKq and JR′ Kq to distribute to the parties.
Input: The parties input is JWKq such that for all elements w of W it holds that w ∈
{0, 1, . . . , 2k+f −1 − 1}{q − 2k+f −1 + 1, . . . , q − 1}.
Execution:
1. Locally compute JZKq ← JWKq + JRKq and then open Z.
2. Compute C = Z + 2k+f −1 and C′ = C mod 2f where these scalar operations are

performed element-wise. Then compute JSKq ← JWKq + JR′ Kq − C′ .
3. For i = ((q + 1)/2)f , locally compute JTKq ← iJSKq and output JTKq .
Protocol 8: πTrunc Truncation Protocol
Protocol: πBD
Setup: Let l be the bitlength of the value x to be bit-decomposed. The TI draws U, V, W
uniformly from Z2 and distribute shares of blinding values such that W := U V such that
[[W ]] ←R {Z2 }.
Input: JxKq , for q ≤ 2l
Output: JxK2
Execution:
1. Let a denote Alice’s share of x, which corresponds to the bit string {a1 , . . . , al }. Sim-
ilarly, let b denote Bob’s share of x, which corresponds to the bit string {b1 , . . . , bl }.
Define the secret sharing Jyi K2 as the pair of shares (ai , bi ) for yi = ai + bi mod2, Jai K2
as (ai , 0) and Jai K2 as (0, bi ).
2. Compute [[c1 ]]2 ← [[a1 ]]2 [[b1 ]]2 using distributed multiplication, and locally set [[x1 ]]2 ←
[[y1 ]]2 .
3. for i ∈ {2, . . . , l}:
(a) compute [[di ]]2 ← Jai K2 Jbi K2 + J1K2

(b) Jei K2 ← Jyi K2 Jci − 1K2 + J1K2
(c) Jci K2 ← Jei K2 [[di ]]2 + J1K2
(d) [[xi ]]2 ← Jyi K2 + Jci − 1K2
4. Output [[xi ]]2 for i ∈ {1, . . . , l}.
Protocol 9: πBD : Bit-Decomposition Protocol
48
Protocol: πDC
Input: The trusted initializer will select uniformly from {Z2 } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ← {Z2 }.
Each party gets the shares [[xi ]]2 and Jyi K2 for each bit of l-bit integers x and y.
Output: J1K2 if x ≥ y, and [[0]]2 otherwise.
Execution:
1. For i ∈ {1, . . . , l}, compute in parallel [[di ]]2 ← Jyi K2 (J1K2 − [[xi ]]2 ) using multiplication
Protocol.
2. Locally compute Jei K2 ← [[xi ]]2 + Jyi K2 + J1K2
3. For i ∈ {1, . . . , l}, compute [[cj ]]2 ← [[di ]]2 lj=i+1 [[ej ]]2 using multiplication Protocol.
Q
Pl
4. compute [[w]]2 ← J1K2 + i=1 Jci K2
Protocol 10: πDC : Comparison Protocol
Protocol: πargmax
Setup: Let l be the bitlength and k be then umber of values to be compared.
Input: The trusted initializer will select uniformly from {Zq } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ←R {Zq }.
Each party has as inputs the shares [[vj , i]]q for all jϵ{1, . . . , k} and iϵ{1, . . . , l}.
Output: Value m computed by party P1 .
Execution:
1. For j ∈ {1, . . . , k} and nϵ{1, . . . , k}, the parties compute in parallel the distributed
comparison Protocol with inputs [[vj , i]]2 and [[vn , i]]2 (i = {1, . . . , l}). Let [[wj , n]]2
denote this output obtained.
Q
2. For j ∈ {1, . . . , k}, compute in parallel [[wj ]]2 = n∈{1,...,k} [[wj , n]]2 using multiplica-
tion Protocol.
3. The parties open wj for P1 . If wj = 1, P1 append j to the value to be output in the

end.
Protocol 11: ArgMax Protocol πargmax
49

Privacy Preserving Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Privacy Preserving Machine Learning

Uploaded by

Copyright:

Available Formats

Universidade de Brasília

Privacy-Preserving Text Classification

Stefano Mozart Pontes Canedo de Souza

Tese apresentado como requisito parcial para

Privacy-Preserving Text Classification

Stefano Mozart Pontes Canedo de Souza

Tese apresentado como requisito parcial para

Prof. Dr. Daniel Guerreiro e Silva (Orientador)

Prof. Dr. Anderson Clayton Alves do Nascimento (Coorientador)

Prof. Dr. Yehuda Lindell Prof. Dr. Silvio Micali

Prof. Dr. Kleber Melo e Silva

Brasília, 28 de Junho de 2022

Keywords: Privacy-Preserving Machine Learning, Secure Multi-Party Computation, Text

Palavras-chave: Privacy-Preserving Machine Learning, Secure Multi-Party Computa-

2 Privacy-Preserving Machine Learning 8

5 Results & conclusion 33

3.1 PoS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1 Experiment Type I - Runtimes in Milliseconds . . . . . . . . . . . . . . . . 16

3.1 List of Experiments vs NLP techniques . . . . . . . . . . . . . . . . . . . . 23

4.1 List of NLP preprocessing experiments in clear-text . . . . . . . . . . . . . 28

5.1 Metrics on PPML setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Cade Administrative Council for Economic Defense.

FHE Fully Homomorphic Encryption.

GDPR General Data Protection Regulation.

LGPD General Data Protection Law.

MIT Massachusetts Institute of Technology.

MPC Secure Multi-party Computation.

NLP Natural Language Processing.

PPGEE Graduate Program in Electrical Engineering.

PPML Privacy-preserving Machine Learning.

UC Universal Protocol Composability Framework.

The last thing one discovers in

1.1 Research subject

1.3 Research objectives

1. Literature review on Machine Learning;

2. Literature review on Natural Language Processing for Text Classification;

3. Literature review on fake news detection;

4. Literature review on Privacy-Preserving Machine Learning;

5. Implementation and experimentation of Homomorphic Cryptography based proto-

6. Implementation and experimentation of Secure Multi-party Computation protocols

8. Experimentation on fake news detection over clear text;

1.4.1 Preliminary results

It is remarkable that a science which

Théorie Analytique des Probabilitiés

The research on Privacy-Preserving Machine Learning (PPML) is currently very ac-

2.1 Secure Multi-Party Computation

2. Each party Pi broadcasts zi = 142 C

Protocol 1: Secure Multi-party Addition Protocol πADD

set of shares by JxKq .

Input: Shares JxKq , JyKq

1. Each party Pi locally computes di ← xi − ui and ei ← yi − vi

4. The party Pι holding the asymmetric bit computes zι ← wι + dvι + euι + de

5. All other parties Pi̸=ι compute zi ← wi + dvi + eui

Protocol 2: Secure Multi-party Multiplication πMUL

1. Locally compute JdKq = JxKq − JyKq .

2. Execute πMUL to compute JzKq = JdKq · JvKq .

Protocol 3: Equality Protocol πEq

Therefore, to perform a privacy-preserving linear regression, one needs at least a so-

1. Locally compute JdKq = JxKq − JyKq .

Protocol 4: Matrix Multiplication πMMUL

2.2 Homomorphic Cryptography

Enc(m1 ).Enc(m2 ) mod N 2 = Enc(m1 + m2 mod N )