Icail 2021 Proceedings

Eighteenth International Conference on
Fifteenth International Conference on

Artificial
ArtificialIntelligence
Intelligence and
and Law
Law
San Diego, June 8 - 12, 2015

São Paulo, Brasil, June 21-25, 2021
University of San Diego School of Law
Law school, University of São Paulo
Proceedings
Proceedingsofofthe
theConference
Conference
Sponsored by:
The International Association for Artificial Intelligence and Law
Sponsored
Thomson by:
Reuters
InternationalUniversity
Association forDiego
of San Artificial
CenterIntelligence
for IP Law & and Law
Markets
Davis Polk & WardwellJusbrasil
LLP
Albert Einstein Israeli Hospital
TrademarkNow
Lawgorithm
Legal Robot
LegalCode
Pires e Gonçalves Advogados
Opice Blum Advogados
In cooperation
OASIS with:
Open
Association for the Advancement of Artificial Intelligence
Urbano Vitalino Advogados (AAAI)
ACM SIGART
In cooperation with:
Association for the Advancement of Artificial Intelligence (AAAI)
ACM SIGAI
The Association for Computing Machinery
1601 Broadway, 10th Floor
New York, New York 10019, USA
ACM COPYRIGHT NOTICE. Copyright © 2021 by the Association for Computing Ma-
chinery, Inc. Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not made
or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by
others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from Publications Dept., ACM,
Inc., fax +1 (212) 869-0481, or permissions@acm.org.
For other copying of articles that carry a code at the bottom of the first or last page,
copying is permitted provided that the per-copy fee indicated in the code is paid
through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923,
+1-978-750-8400, +1-978-750-4470 (fax).
ACM ISBN: 978-1-4503-8526-8
ii
Conference Organization
Program Chair
Adam Zachary Wyner (Swansea University, United Kingdom)
Conference Chair
Juliano Maranhão (University of São Paulo, Brazil)
Tutorials/Workshops & Proceedings Chair

Adam Zachary Wyner (Swansea University, United Kingdom)
Secretary / Treasurer
Michał Araszkiewicz (Jagiellonian University, Poland)
Local Organising Committee

Renata Wassermann (University of São Paulo, Brazil)
Samuel Barbosa (University of São Paulo, Brazil)
Alexandre Zavaglia (FGV-SP, Brazil)
Aline Macohin (Federal University of Paraná, Brazil)
Marcelo Lopes (University of São Paulo, Brazil)
Roberto Bornhausen (University of São Paulo, Brazil)
Doctoral Consortium & Mentoring Programme Chair

Michał Araszkiewicz (Jagiellonian University)
Industry Chair
Fabio Cozman (University of Sao Paulo, Brazil)
Program committee
Tommaso Agnoloni (CNR, Italy)
Thomas Ågotnes (University of Bergen, Norway)
Laura Alonso Alemany (Universidad Nacional de Córdoba, Spain)
Francisco Andrade (Universidade do Minho, Portugal)
Michał Araszkiewicz (Uniwersytet Jagiellonski w Krakowie, Poland)
Kevin Ashley (University of Pittsburgh, United States)
Katie Atkinson (University of Liverpool, United Kingdom)
Trevor Bench-Capon (University of Liverpool, United Kingdom)
Floris Bex (Utrecht University, Netherlands)
Luther Branting (The MITRE Corporation, United States)
iii
Scott Brewer (Harvard Law School, United States)
Pompeu Casanovas (La Trobe, Australia)
Jack G. Conrad (Thomson Reuters, United States)
Claudia d’Amato (University of Bari, Italy)
Luigi Di Caro (University of Turin, Italy)
Rossana Ducato (University of Aberdeen, Italy)
Jenny Eriksson Lundström (Uppsala University, Sweden)
Enrico Francesconi (European Parliament, Luxembourg)
Fernando Galindo Ayuda (Universidad de Zaragoza,Spain)
Kripabandhu Ghosh (Indian Institute of Science Education and Research, India)
Saptarshi Ghosh (Indian Institute of Technology Kharagpur, India)
Randy Goebel (University of Alberta, Canada)
Thomas Gordon (Germany)
Guido Governatori (CSIRO, Australia)
Matthias Grabmair (Technical University of Munich, Germany)
Davide Grossi (University of Groningen and University of Amsterdam, Netherlands)
Maura R. Grossman (University of Waterloo, Canada)
Mustafa Hashmi (Data61, CSIRO Australia)
Bruce Hedin (H5, United States)
Hans Henseler (University of Applied Sciences Leiden, Netherlands)
Rinke Hoekstra (Elsevier, Netherlands)
Joris Hulstijn (Tilburg University, Netherlands)
John Joergensen (Rutgers Law School, United States)
Yoshinobu Kano (Shizuoka University, Japan)
Daniel Katz (Illinois Tech & Bucerius Law School, United States)
Jeroen Keppens (King’s College London, United Kingdom)
Marc Lauritsen (Capstone Practice Systems, United States)
Rūta Liepin, a (Maastricht University, Netherlands)
Arno R. Lodder (Vrije Universiteit Amsterdam, Netherlands)
Prasenjit Majumder (DAIICT, India)
Juliano Maranhao (University of São Paulo, Brazil)
L. Thorne McCarty (Rutgers University, United States)
Parth Mehta (Parmonic AI, India)
Raquel Mochales (Cerence, Belgium)
Ashutosh Modi (Indian Institute of Technology Kanpur, India)
Katsumi Nitta (Tokyo Institute of Technology, Japan)
Merel Noorman (Tilburg University, Netherlands)
Paulo Novais (Universidade do Minho, Portugal)
Gordon Pace (University of Malta, Malta)
Ugo Pagallo (University of Turin, Italy)
Arindam Pal (Data61, CSIRO, Australia)
Monica Palmirani (University of Bologna, Italy)
Girish Palshikar (Tata Consultancy Services Ltd., India)
Sachin Pawar (Tata Consultancy Services, India)
Wim Peters (University of Aberdeen, Netherlands)
Henry Prakken (Utrecht University & University of Groningen, Netherlands)
iv
Paulo Quaresma (University of Evora, Portugal)
Edwina Rissland (University of Massachusetts/Amherst, United States)
Livio Robaldo (Swansea University, United Kingdom)
Anna Ronkainen (University of Helsinki, Finland)
Antonino Rotolo (University of Bologna, Italy)
Giovanni Sartor (University of Bologna, Italy)
Ken Satoh (NII, Japan)
Burkhard Schafer (University of Edinburgh, United Kingdom)
Fernando Schapachnik (Universidad de Buenos Aires, Argentina)
Uri Schild (Bar Ilan University, Israel)
Frank Schilder (Thomson Reuters, United States)
Marijn Schraagen (Utrecht University, Netherlands)
Erich Schweighofer (University of Vienna, Austria)
Giovanni Sileno (University of Amsterdam, Netherlands)
Munindar Singh (North Carolina State University, United States)
Clara Patricia Smith (Universidad Nacional de La Plata (UNLP), Argentina)
Katsuhiko Toyama (Nagoya University, Japan)
Thomas Vacek (Thomson Reuters, United States)
Leon van der Torre (Luxembourg)
Marc van Opijnen (Ministry of the Interior and Kingdom Relations of the Nether-
lands, Netherlands)
Bart Verheij (University of Groningen, Netherlands)
Serena Villata (Universite Cote d’Azur, CNRS, France)
Vern R. Walker (Hofstra University, United States)
Radboud Winkels (University of Amsterdam, Netherlands)
Masaharu Yoshioka (Hokkaido University, Japan)
Haozhen Zhao (Ankura, United States)
Tomasz Zurek (Maria Curie-Sklodowska University, Poland)
v
Preface
I am pleased to share with you the proceedings of the 18th International Conference
on Artificial Intelligence and Law (ICAIL 2021). Since 1987, the International Asso-
ciation for Artificial Intelligence and Law (IAAIL) has biennially organised ICAIL to
present and discuss research and applications as well as to stimulate interdisciplinary
and international collaboration. The ICAIL series can lay claim to substantial influ-
ence on the recent growth in AI in legal services. This year’s ICAIL upholds and
extends the mission of the IAAIL. ICAIL 2021 runs the week of June 21-25. For the
first time, the conference and workshops are presented 100% online and cost free,
leading to over 1500 registrations in 65 countries!
The conference had 89 submissions; 17 were selected for publication as full pa-
pers (~19%), 17 as short papers (~19%), 8 as extended research abstracts (~9%), 2 as
demonstration papers (~2%), and 3 as COLIEE papers (~3%). ICAIL strives to max-
imise the opportunities for researchers to present their work. In addition, ICAIL will
hold a Doctoral Consortium, helping emerging researchers to engage with the ICAIL
community. There will be 11 collocated workshops on focused topics.
Research in AI & Law is highly interdisciplinary. A range of AI theories and tech-
niques may apply to diverse legal information, processes, or topics. As well, there are
important considerations about how the Law applies to AI. The relation between AI
and Law is, then, many-to-many. While machine learning techniques perform well, it
may be crucial to explain results in legal contexts. Moreover, as AI continues to make
inroads into legal services, other success criteria must be addressed such as: account-
ability, accessibility, portability, linking, consistency, and resource sharing. These are
matters for further research.
The interdisciplinarity of AI & Law shows in our invited speakers. Prof. Stuart
Russell, an internationally recognised researcher and AI educator, will speak on “Prov-
ably Beneficial AI”, which will be discussed by a panel, deepening our understanding
of the relation between AI and Law. Joe Cohen of Dentons law firm will talk about
advances in automation, highlighting the real world impact of AI and Law. Finally,
IAAIL president Enrico Francesconi will outline the evolution and the perspectives of
AI research in relation to ICAIL.
Finally, many people worked hard over months to make ICAIL 2021 excellent.
Many thanks to the following. Conference chair Juliano Maranhão and his team took
on the task to put ICAIL 2021 online and in very trying circumstances. The IAAIL
secretary Michał Araszkiewicz, along with Anne Gardner, addressed administrative
and management matters. The most substantive contributors were the authors who
submitted papers and the reviewers who took the time to assess and discuss the sub-
missions; they have shaped the content that advances the field. The organisers of and
presenters at the workshops and Doctoral Consortium all extended the discussion and
supported emerging researchers. Our sponsors provided essential recognition and
support. And finally, we are most appreciative of the IAAIL Executive Committee,
which promotes AI & Law research through the ICAIL conferences.
Adam Zachary Wyner, ICAIL 2021 Program Chair
vi
ICAIL 2021 Program and Schedule of Events
All times in GMT. Events marked REC. are streamed events (replay with live discus-
sion with pre-registered questions).
Monday, 21 June – Workshops

Various-13:30 Workshops
Schedule independently defined by the organizers.
6 hours - 1st half
LEGAL AIIA AI and Intelligent Assistance for Legal Professionals in the
Digital Workplace
JUL.IA Artificial Intelligence in Jurisdictional Logistics
COLIEE Competition on Legal Information Extraction/Entailment
3 hours
BEFAIR Bias, Ethics and Fairness in Artificial Intelligence: Representa-
tion and Reasoning
13:30-15:00 Great Opening and Industry sessions
Time
13:30-14:00 Minister Luís Roberto Barroso (Brazilian Supreme Court)
14:00-14:30 Fabio Cozman (Industry Chair)
14:30-15:00 Sponsors and discussion
15:00-15:30 Awards ceremony
15:30-Various Workshops
Schedule independently defined by the organizers.
6 hours - 2nd half
LEGAL AIIA AI and Intelligent Assistance for Legal Professionals in the
Digital Workplace
JUL.IA Artificial Intelligence in Jurisdictional Logistics
COLIEE Competition on Legal Information Extraction/Entailment
3 hours
COPYRIGHT Copyright Regulation of Inputs and Outputs of AI Systems
XAILA EXplainable & Responsible AI in Law
Tuesday, 22 June – Main conference

9:00-10:30 Session 1
Time Page
09:00 On Semantics-based Minimal Revision for Legal Reasoning 50
Fungwacharakorn, Wachara; Tsushima, Kanae; Satoh, Ken
09:30 Converting Copyright Legislation into Machine-Executable Code: In- 139
terpretation, Coding Validation and Legal Alignment
Witt, Alice; Huggins, Anna; Governatori, Guido; Buckley, Joshua
10:00 Unravel Legal References in Defeasible Deontic Logic 69
Governatori, Guido; Olivieri, Francesco
vii
10:30-11:00 Short break / Networking space
11:00- Session 2
Time Page
11:00 Hardness of Case-Based Decisions: a Formal Theory 149
Zheng, Heng; Grossi, Davide; Verheij, Bart
11:30 Precedential Constraint: The Role of Issues 12
Bench-Capon, Trevor; Atkinson, Katie
12:00 Incorporating Domain Knowledge for Extractive Summarization of Le- 22
gal Case Documents
Bhattacharya, Paheli; Poddar, Soham; Rudra, Koustav; Ghosh, Kri-
pabandhu; Ghosh, Saptarshi
12:30:13:30 Break
13:30-14:30 Session 3
Time Page
13:30 A dynamic model for balancing values 89
Maranhão, Juliano; Souza, Edelcio; Sartor, Giovanni
14:00 On Semantics-based Minimal Revision for Legal Reasoning 50
REC. Fungwacharakorn, Wachara; Tsushima, Kanae; Satoh, Ken
14:30-15:30 Keynote Speaker – iRobot: how to use Robotic Process Automation to
automate certain legal work
16:00-16:55 Session 4
Time Page
16:00 Incorporating Domain Knowledge for Extractive Summarization of Le- 22
gal Case Documents
REC. Bhattacharya, Paheli; Poddar, Soham; Rudra, Koustav; Ghosh, Kri-
pabandhu; Ghosh, Saptarshi
16:30 To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment 295
Rosa, Guilherme Moraes; Rodrigues, Ruan Chaves; Lotufo, Roberto;
Nogueira, Rodrigo
16:45 Interactive System for Arranging Issues based on PROLEG in Civil Lit- 273
igation
Satoh, Ken; Takahashi, Kazuko; Kawasaki, Tatsuki
16:50 Live Demonstration of a Working Collaborative e-Negotiaton System 275
(Smartsettle Infinity)
Ross, Graham Laurence; Thiessen, Ernest
16:55-17:30 Break
18:30-20:00 Session 5
viii
Time Page
18:30 Precedential Constraint: The Role of Issues 12
REC. Bench-Capon, Trevor; Atkinson, Katie
19:00 BERT-based Ensemble Methods with Data Augmentation for Legal Tex- 278
tual Entailment in COLIEE Statute Law Task
Yoshioka, Masaharu; Aoki, Tasuhiro; Suzuki, Youta
19:15 Legal Norm Retrieval with Variations of the BERT Model Combined 285
with TF-IDF Vectorization
Wehnert, Sabine Sarah; Sudhi, Viju; Dureja, Shipra; Kutty, Libin
Johnny; Shahania, Saijal; De Luca, Ernesto William
19:30 Toward Summarizing Case Decisions via Extracting Argument Issues, 250
Reasons, and Conclusions
Xu, Huihui; Savelka, Jaromir; Ashley, Kevin
19:45 Practical Tools from Formal Models: The ECHR as a Case Study 170
Atkinson, Katie; Collenette, Joe; Bench-Capon, Trevor; Dzehtsiarou,
Kanstantsin
20:30-21:45 Session 6
Time Page
20:30 Hardness of Case-Based Decisions: a Formal Theory 149
REC. Zheng, Heng; Grossi, Davide; Verheij, Bart
21:00 When Does Pretraining Help? Assessing Self-Supervised Learning for 159
Law and the CaseHOLD Dataset of 53,000+ Legal Holdings
Guha, Neel; Zheng, Lucia; Anderson, Brandon Ray; Henderson, Pe-
ter; Ho, Daniel En-Wenn
21:30 Modelling Legal Procedures 220
Rotolo, Antonino; Smith, Clara
21:45 Towards compliance checking in reified I/O logic via SHACL 215
Robaldo, Livio
Wednesday, 23 June – Main conference

09:00-10:30 Session 1
Time Page
09:00 A Combined Rule-Based and Machine Learning Approach for Auto- 40
mated GDPR Compliance Checking
El Hamdani, Rajaa; Mustapha, Majd; Restrepo Amariles, David;
Troussel, Aurore; Meeus, Sébastien; Krasnashchok, Katsiaryna
09:30 The Burden of Persuasion in Structured Argumentation 180
Calegari, Roberta; Riveret, Regis; Sartor, Giovanni
09:45 Discovering the Rationale of Decisions: Towards a Method for Aligning 235
Learning and Reasoning
Steging, Cornelis Cor; Renooij, Silja; Verheij, Bart
10:00 A dynamic model for balancing values 89
REC. Maranhão, Juliano; Souza, Edelcio; Sartor, Giovanni
ix
11:00-11:40 Session 2
Time Page
11:00 Applying Decision Tree Analysis to Family Court Decisions: Factors 258
Determining Child Custody in Taiwan
Huang, Sieh-Chuen; Shao, Hsuan-Lei; Leflar, Robert B
11:05 Constraint Answer Set Programming as a Tool to Improve Legislative 262
Drafting: A Rules as Code Experiment
Morris, Jason Patrick
11:10 CriminelBART: A French Canadian Legal Language Model Specialized 256
in Criminal Law
Garneau, Nicolas; Gaumond, Eve; Lamontagne, Luc; Déziel, Pierre-
Luc
11:15 Sentence Classification for Contract Law Cases: A Natural Language 260
Processing Approach
Mok, Wai Yin; Mok, Jonathan R.; Mok, Rachel V.
11:20 Labels distribution matters in performance achieved in legal judgment 268
prediction task
Salaün, Olivier; Langlais, Philippe; Benyekhlef, Karim
11:25 Pathways to Legal Dynamics in Robotics 266
Rotolo, Antonino; Tamargo, Luciano H.; Martìnez, Diego C.
11:30 A simple mathematical model for the legal concept of balancing of in- 270
terests
Zufall, Frederike; Kimura, Rampei; Peng, Linyu
11:35 Predicting Legal Proceedings Status: Approaches Based on Sequential 264
Text Data
Polo, Felipe Maia; Ciochetti, Itamar; Bertolo, Emerson
11:40-13:00 Break
13:00-14:30 Session 3
Time Page
13:00 Automatic Extraction of Amendments from Polish Statutory Law 225
Smywiński-Pohl, Aleksander; Piech, Mateusz; Kaleta, Zbigniew;
Wróbel, Krzysztof
13:15 Enhancing a Recidivism Prediction Tool With Machine Learning: Ef- 210
fectiveness and Algorithmic Fairness
Karimi-Haghighi, Marzieh; Castillo, Carlos
13:30 Converting Copyright Legislation into Machine-Executable Code: In- 139
terpretation, Coding Validation and Legal Alignment
REC. Witt, Alice; Huggins, Anna; Governatori, Guido; Buckley, Joshua
14:00 Unravel Legal References in Defeasible Deontic Logic 69
REC. Governatori, Guido; Olivieri, Francesco
14:30-15:30 Keynote Speaker – Provably Beneficial Artificial Intelligence
x
16:00-17:00 Session 4
Time Page
16:00 A Combined Rule-Based and Machine Learning Approach for Auto- 40
mated GDPR Compliance Checking
El Hamdani, Rajaa; Mustapha, Majd; Restrepo Amariles, David;
Troussel, Aurore; Meeus, Sébastien; Krasnashchok, Katsiaryna
16:30 When Does Pretraining Help? Assessing Self-Supervised Learning for 159
Law and the CaseHOLD Dataset of 53,000+ Legal Holdings
REC. Guha, Neel; Zheng, Lucia; Anderson, Brandon Ray; Henderson, Pe-
ter; Ho, Daniel En-Wenn
17:00-18:30 Break
18:30-20:00 Session 5
Time Page
18:30 Context-Aware Legal Citation Recommendation using Deep Learning 79
Huang, Zihan; Low, Charles; Teng, Mengqiu; Zhang, Hongyi; Ho,
Daniel E.; Krass, Mark; Grabmair, Matthias
19:00 From Data to Information: Automating Data Science to Explore the U.S. 119
Court System
Li Zhao, Andong L.; Pack, Harper; Servantez, Sergio; Adler, Rachel
F.; Sterbentz, Marko; Pah, Adam; Schwartz, David; Barrie, Cameron;
Einarsson, Alexander; Hammond, Kristian
19:30 Case-level Prediction of Motion Outcomes in Civil Litigation 99
McConnell, Devin J.; Zhu, James; Pandya, Sachin S.; Aguiar, Derek
Cole
20:30-21:30 Session 6
Time Page
20:30 Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdic- 129
tions, and Legal Domains
Savelka, Jaromir; Westermann, Hannes; Benyekhlef, Karim; Alexan-
der, Charlotte S.; Grant, Jayla C.; Amariles, David Restrepo; El-
Hamdani, Rajaa; Meeus, Sebastien; Troussel, Aurore; Araszkiewicz,
Michal; Ashley, Kevin D.; Ashley, Alexandra; Branting, Karl L.; Fal-
duti, Mattia; Grabmair, Matthias; Harasta, Jakub; Novotna, Tereza;
Tippett, Elizabeth; Johnson, Shiwanni
21:00 Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for 200
Natural Language Generation
Garneau, Nicolas; Gaumond, Eve; Lamontagne, Luc; Déziel, Pierre-
Luc
21:15 Process Mining-Enabled Jurimetrics: Analysis of a Brazilian Court’s 240
Judicial Performance in the Business Law Processing
Unger, Adriana Jacoto; dos Santos Neto, Jose Francisco; Trecenti,
Julio; Hirota, Renata; Fantinato, Marcelo; Peres, Sarajane Marques
xi
Thursday, 24 June – Main conference
09:00-10:30 Session 1
Time Page
09:00 Explainable Artificial Intelligence, lawyer’s perspective 60
Górski, Łukasz; Ramakrishna, Shashishekar
09:30 Evaluating Document Representations for Content-based Legal Litera- 109
ture Recommendations
Ostendorff, Malte; Ash, Elliott; Ruas, Terry; Gipp, Bela; Moreno-
Schneider, Julian; Rehm, Georg
10:00 Structural Text Segmentation of Legal Documents 2
Aumiller, Dennis; Almasian, Satya; Lackner, Sebastian; Gertz,
Michael
11:00-12:00 Session 2
Time Page
11:00 AI systems and product liability 32
Borges, Georg
11:30 A Dataset for Evaluating Legal Question Answering on Private Inter- 230
national Law
Sovrano, Francesco; Palmirani, Monica; Distefano, Biagio; Sapienza,
Salvatore; Vitali, Fabio
11:45 Making Intelligent Online Dispute Resolution Tools available to Self- 195
Represented Litigants in the Public Justice System
Esteban de la Rosa, Fernando; Zeleznikow, John
12:00-12:30 Break
12:30-13:30 IAAIL General Meeting
13:30-14:30 Session 3
Time Page
13:30 Structural Text Segmentation of Legal Documents 2
REC. Aumiller, Dennis; Almasian, Satya; Lackner, Sebastian; Gertz,
Michael
14:00 Anonymization of German Legal Court Rulings 205
Glaser, Ingo; Schamberger, Tom; Matthes, Florian
14:15 Regulating Artificial Intelligence: A Technology Regulator’s Perspective 190
Ellul, Joshua; McCarthy, Stephen; Sammut, Trevor; Brockdorff,
Juanita; Scerri, Matthew; Pace, Gordon J.
14:30-15:30 Presidential Address – The Winter, The Summer and The Summer Dream
of AI in LAW
xii
16:00-17:15 Session 4
Time Page
16:00 Using Transformers to Improve Answer Retrieval for Legal Questions 245
Vold, Andrew; Conrad, Jack G
16:15 From Data to Information: Automating Data Science to Explore the U.S. 119
Court System
REC. Li Zhao, Andong L.; Pack, Harper; Servantez, Sergio; Adler, Rachel
F.; Sterbentz, Marko; Pah, Adam; Schwartz, David; Barrie, Cameron;
Einarsson, Alexander; Hammond, Kristian
16:45 Case-level Prediction of Motion Outcomes in Civil Litigation 99
REC. McConnell, Devin J.; Zhu, James; Pandya, Sachin S.; Aguiar, Derek
Cole
17:15-18:30 Break
18:30-20:00 Session 5
Time Page
18:30 Context-Aware Legal Citation Recommendation using Deep Learning 79
REC. Huang, Zihan; Low, Charles; Teng, Mengqiu; Zhang, Hongyi; Ho,
Daniel E.; Krass, Mark; Grabmair, Matthias
19:00 Explainable Artificial Intelligence, lawyer’s perspective 60
REC. Górski, Łukasz; Ramakrishna, Shashishekar
19:30 On the relevance of algorithmic decision predictors for judicial decision 175
making
Bex, Floris; Prakken, Henry
19:45 Prediction of monetary penalties for data protection cases in multiple 185
languages
Ceross, Aaron William Karl; Zhu, Tingting
20:30-22:00 Session 6
Time Page
20:30 Evaluating Document Representations for Content-based Legal Litera- 109
ture Recommendations
REC. Ostendorff, Malte; Ash, Elliott; Ruas, Terry; Gipp, Bela; Moreno-
Schneider, Julian; Rehm, Georg
21:00 AI systems and product liability 32
REC. Borges, Georg
21:30 Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdic- 129
tions, and Legal Domains
REC. Savelka, Jaromir; Westermann, Hannes; Benyekhlef, Karim; Alexan-
der, Charlotte S.; Grant, Jayla C.; Amariles, David Restrepo; El-
Hamdani, Rajaa; Meeus, Sebastien; Troussel, Aurore; Araszkiewicz,
Michal; Ashley, Kevin D.; Ashley, Alexandra; Branting, Karl L.; Fal-
duti, Mattia; Grabmair, Matthias; Harasta, Jakub; Novotna, Tereza;
Tippett, Elizabeth; Johnson, Shiwanni
22:00-22:15 Closing comments conduced by the program chair
xiii
Friday, 25 June – Workshops, Doctoral Consortium, Closing
Various-14:30 Workshops
Time schedule to be defined by the organizer of each workshop, finishing activities
before 14:30.
6 hours
ASAIL Automated Detection, Extraction and Analysis of Semantic In-
formation in Legal Texts
MWAIL Multilingual Workshop on AI & Law
PATENTS Artificial Intelligence and Patents
3 hours
AILBIZ (Legal International Workshop on A.I. for Understanding the Legal
Business) Business
RELATED RELATED – Relations in the Legal Domain
07:30-09:15 1st Panel
07:30 Opening speeches of the Doctoral Consortium
07:45 Beyond persons and things: the legal status of artificial intel-
ligence
Diana Mocanu
08:15 The Digital Administrative Act
Alexander Stepanov
08:45 Constitutional limits to the use of artificial intelligence in court
proceedings
Elisabeth Paar
09:30-11:00 2nd Panel
09:30 An African Perspective on Answering the Ethics Question:
Who Should Make the Rules on Self-Driving Cars?
Okechukwu Effoduh
10:00 Judged by Machines? How do algorithms, now used in crimi-
nal justice, impact on the legitimacy of the system?
Cari Hyde-Vaamonde
10:30 Transactions on Privacy and the Tools that Assist – An Inter-
disciplinary Analysis
Kartik Chawla
11:00-11:30 Long break / Networking space
xiv
11:30-13:00 3rd Panel
11:30 ALGORITHMS (DIS)SERVING JUSTICE: Risk Assessment
Tools in Pre-trial Process
Mina Ilhan
12:00 HONto: A Knowledge Base from Textbooks for Legal Text Re-
trieval and Recommendations
Sabine Wehnert
12:30 ESRA: An End-to-End System for Re-Identification and
Anonymization of Swiss Court Decisions
Joel Niklaus
13:15-14:30 4th Panel
13:15 Measurement of Consistency in Judicial Decisions
Aline Macohin
13:45 Artifact design for design-science research on process mining
for legal compliance
Adriana Jacoto Unger
14:15 Doctoral Consortium Best Paper award & ending
14:30-15:30 Closing speeches
xv
Invited Speakers
Provably Beneficial Artificial Intelligence
Professor Stuart Russell
University of California Berkeley, US
Topic Prof. Russell discusses an approach to ethical AI based on the idea that AI
systems should be beneficial to humans, with the key caveat that what counts as
"beneficial" is very unlikely to be fully specified. Issues include social aggregation
and social choice, laws, equity, sadism, pride, envy, and mental integrity.
About Stuart Russell is a Professor of Computer Science at the University of Cali-

fornia at Berkeley, holder of the Smith-Zadeh Chair in Engineering, and Director of
the Center for Human-Compatible AI. He is a recipient of the IJCAI Computers and
Thought Award and from 2012 to 2014 held the Chaire Blaise Pascal in Paris. He is
an Honorary Fellow of Wadham College, Oxford, an Andrew Carnegie Fellow, and
a Fellow of the American Association for Artificial Intelligence, the Association for
Computing Machinery, and the American Association for the Advancement of Sci-
ence. His book "Artificial Intelligence: A Modern Approach" (with Peter Norvig) is
the standard text in AI, used in 1500 universities in 135 countries. His research covers
a wide range of topics in artificial intelligence, with an emphasis on the long-term
future of artificial intelligence and its relation to humanity.
iRobot: how to use Robotic Process Automation to automate cer-

tain legal work
Joseph Cohen
Innovation Lead, Dentons UK & Middle East
Topic Robotic Process Automation (RPA) is usually thought of with respect to back-
office functions such as Finance and HR. Recently, however, Dentons have been putting
this technology into the hands of our junior lawyers and asking them to not only sug-
gest some legal tasks for the software to automate, but to actually do the automation
as well. In this talk Joe will cover: How RPA fits into the wider legaltech landscape;
How RPA actually works; How to choose RPA use cases, and when not to use it; How
best to empower lawyers with the right tools.
About Joe leads the Innovation team for the Dentons UK, Ireland and Middle East
region. This includes responsibility for legal technology pilots and implementations,
as well as other ’innovation culture’ initiatives such as innovation training and design
thinking. Prior to this, as a non-lawyer, Joe held innovation positions at Linklaters
and Slaughter and May, and before that was a technology consultant for Deloitte. Joe
was shortlisted for the recent Law.com Innovation Trailblazer of the Year award.
xvi
Contents
Conference Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ICAIL 2021 Program and Schedule of Events . . . . . . . . . . . . . . . . . vii
Invited Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
I Full Papers 1
Structural Text Segmentation of Legal Documents . . . . . . . . . . . . . 2
Dennis Aumiller, Satya Almasian, Sebastian Lackner and Michael Gertz
Precedential Constraint: The Role of Issues . . . . . . . . . . . . . . . . . 12

Trevor Bench-Capon and Katie Atkinson
Incorporating Domain Knowledge for Extractive Summarization of Le-

gal Case Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh and
Saptarshi Ghosh
AI Systems and Product Liability . . . . . . . . . . . . . . . . . . . . . . . . 32

Georg Borges
A Combined Rule-Based and Machine Learning Approach for Automated

GDPR Compliance Checking . . . . . . . . . . . . . . . . . . . . . . . . 40
Rajaa EL HAMDANI, Majd Mustapha, David Restrepo Amariles, Aurore Trous-
sel, Sébastien Meeùs and Katsiaryna Krasnashchok
On Semantics-based Minimal Revision for Legal Reasoning . . . . . . . 50

Wachara Fungwacharakorn, Kanae Tsushima and Ken Satoh
Explainable Artificial Intelligence, Lawyer’s Perspective . . . . . . . . . 60

Łukasz Górski and Shashishekar Ramakrishna
xvii
Unravel Legal References in Defeasible Deontic Logic . . . . . . . . . . . 69
Guido Governatori and Francesco Olivieri
Context-Aware Legal Citation Recommendation using Deep Learning . 79

Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark
S. Krass and Matthias Grabmair
A dynamic model for balancing values . . . . . . . . . . . . . . . . . . . . 89

Juliano Maranhão, Edelcio G. de Souza and Giovanni Sartor
Case-level Prediction of Motion Outcomes in Civil Litigation . . . . . . 99

Devin J. McConnell, James Zhu, Sachin Pandya and Derek Aguiar
Evaluating Document Representations for Content-based Legal Litera-

ture Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Malte Ostendorff, Elliott Ash, Terry Ruas, Bela Gipp, Julian Moreno-Schneider
and Georg Rehm
From Data to Information: Automating Data Science to Explore the U.S.

Court System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Andrew Paley, Andong L. Li Zhao, Harper Pack, Sergio Servantez, Rachel F.
Adler, Marko Sterbentz, Adam Pah, David Schwartz, Cameron Barrie, Alexan-
der Einarsson and Kristian Hammond
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdic-

tions, and Legal Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Jaromir Savelka, Hannes Westermann, Karim Benyekhlef, Charlotte S. Alexan-
der, Jayla C. Grant, David Restrepo Amariles, Rajaa El Hamdani, Sébastien
Meeùs, Aurore Troussel, Michał Araszkiewicz, Kevin D. Ashley, Alexandra
Ashley, Karl Branting, Mattia Falduti, Matthias Grabmair, Jakub Harašta,
Tereza Novotná, Elizabeth Tippett and Shiwanni Johnson
Converting Copyright Legislation into Machine-Executable Code: In-

terpretation, Coding Validation and Legal Alignment . . . . . . . . . 139
Alice Witt, Anna Huggins, Guido Governatori and Joshua Buckley
Hardness of Case-Based Decisions: a Formal Theory . . . . . . . . . . . . 149

Heng Zheng, Davide Grossi and Bart Verheij
When Does Pretraining Help? Assessing Self-Supervised Learning for

Law and the CaseHOLD Dataset of 53,000+ Legal Holdings . . . . . . 159
Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson and Daniel E.
Ho
xviii
II Short Papers 169
Practical Tools from Formal Models: The ECHR as a Case Study . . . . . 170
Katie Atkinson, Joe Collenette, Trevor Bench-Capon and Kanstantsin Dzeht-
siarou
On the relevance of algorithmic decision predictors for judicial decision

making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Floris Bex and Henry Prakken
The Burden of Persuasion in Structured Argumentation . . . . . . . . . 180

Roberta Calegari, Régis Riveret and Giovanni Sartor
Prediction of monetary penalties for data protection cases in multiple

languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Aaron Ceross and Tingting Zhu
Regulating Artificial Intelligence: A Technology Regulator’s Perspective 190

Joshua Ellul, Gordon Pace, Stephen McCarthy, Trevor Sammut, Juanita Brock-
dorff and Matthew Scerri
Making Intelligent Online Dispute Resolution Tools available to Self-

Represented Litigants in the Public Justice System . . . . . . . . . . . 195
Fernando Esteban de la Rosa and John Zeleznikow
Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for

Natural Language Generation . . . . . . . . . . . . . . . . . . . . . . . . 200
Nicolas Garneau, Eve Gaumond, Luc Lamontagne and Pierre-Luc Déziel
Anonymization of German Legal Court Rulings . . . . . . . . . . . . . . . 205

Ingo Glaser, Tom Schamberger and Florian Matthes
Enhancing a Recidivism Prediction Tool With Machine Learning: Ef-

fectiveness and Algorithmic Fairness . . . . . . . . . . . . . . . . . . . 210
Marzieh Karimi-Haghighi and Carlos Castillo
Towards compliance checking in reified I/O logic via SHACL . . . . . . . 215

Livio Robaldo
Modelling Legal Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

Antonino Rotolo and Clara Smith
Automatic Extraction of Amendments from Polish Statutory Law . . . 225

Aleksander Smywiński-Pohl, Mateusz Piech, Zbigniew Kaleta and Krzysztof
Wróbel
xix
A Dataset for Evaluating Legal Question Answering on Private Interna-
tional Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Francesco Sovrano, Monica Palmirani, Biagio Distefano, Salvatore Sapienza
and Fabio Vitali
Discovering the Rationale of Decisions: Towards a Method for Aligning

Learning and Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Cornelis Cor Steging, Silja Renooij and Bart Verheij
Process Mining-Enabled Jurimetrics: Analysis of a Brazilian Court’s Ju-

dicial Performance in the Business Law Processing . . . . . . . . . . . 240
Adriana Jacoto Unger, Jose Francisco dos Santos Neto, Julio Trecenti, Renata
Hirota, Marcelo Fantinato and Sarajane Marques Peres
Using Transformers to Improve Answer Retrieval for Legal Questions . 245

Andrew Vold and Jack G. Conrad
Toward Summarizing Case Decisions via Extracting Argument Issues,

Reasons, and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Huihui Xu, Jaromir Savelka and Kevin D. Ashley
III Extended Abstracts 255

CriminelBART: A French Canadian Legal Language Model Specialized
in Criminal Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Nicolas Garneau, Eve Gaumond, Luc Lamontagne and Pierre-Luc Déziel
Applying Decision Tree Analysis to Family Court Decisions: Factors De-

termining Child Custody in Taiwan . . . . . . . . . . . . . . . . . . . . 258
Sieh-Chuen Huang, Hsuan-Lei Shao and Robert B Leflar
Sentence Classification for Contract Law Cases: A Natural Language

Processing Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Jonathan R. Mok, Wai Yin Mok and Rachel V. Mok
Constraint Answer Set Programming as a Tool to Improve Legislative

Drafting: A Rules as Code Experiment . . . . . . . . . . . . . . . . . . 262
Jason Morris
Predicting Legal Proceedings Status: Approaches Based on Sequential

Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Felipe Maia Polo, Itamar Ciochetti and Emerson Bertolo
Pathways to Legal Dynamics in Robotics . . . . . . . . . . . . . . . . . . . 266

Antonino Rotolo, Luciano H. Tamargo and Diego C. Martínez
xx
Labels distribution matters in performance achieved in legal judgment
prediction task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Olivier Salaün, Philippe Langlais and Karim Benyekhlef
A simple mathematical model for the legal concept of balancing of in-

terests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Frederike Zufall, Rampei Kimura and Linyu Peng
IV Demonstrations 272
Interactive System for Arranging Issues based on PROLEG in Civil Lit-
igation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Ken Satoh, Kazuko Takahashi and Tatsuki Kawasaki
Live Demonstration of a Working Collaborative eNegotiaton System

(Smartsettle Infinity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Ernest Thiessen and Graham Ross
V COLIEE Papers 277

BERT-based Ensemble Methods with Data Augmentation for Legal Tex-
tual Entailment in COLIEE Statute Law Task . . . . . . . . . . . . . . 278
Masaharu Yoshioka, Yasuhiro Aoki and Youta Suzuki
Legal Norm Retrieval with Variations of the BERT Model Combined

with TF-IDF Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Sabine Wehnert, Viju Sudhi, Shipra Dureja, Libin Kutty, Saijal Shahania and
Ernesto W. De Luca
To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment . 295
Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto de Alencar Lotufo
and Rodrigo Nogueira
Index of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
xxi
Part I
Full Papers
Structural Text Segmentation of Legal Documents
Dennis Aumiller∗ Satya Almasian∗
Institute of Computer Science, Heidelberg University Institute of Computer Science, Heidelberg University
Heidelberg, Germany Heidelberg, Germany
aumiller@informatik.uni-heidelberg.de almasian@informatik.uni-heidelberg.de
Sebastian Lackner Michael Gertz

Institute of Computer Science, Heidelberg University Institute of Computer Science, Heidelberg University
Heidelberg, Germany Heidelberg, Germany
lackner@informatik.uni-heidelberg.de gertz@informatik.uni-heidelberg.de
ABSTRACT ACM Reference Format:

The growing complexity of legal cases has lead to an increasing Dennis Aumiller, Satya Almasian, Sebastian Lackner, and Michael Gertz.
2021. Structural Text Segmentation of Legal Documents . In Eighteenth
interest in legal information retrieval systems that can effectively
International Conference for Artificial Intelligence and Law (ICAIL’21), June
satisfy user-specific information needs. However, such downstream 21–25, 2021, Sāo Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https:
systems typically require documents to be properly formatted //doi.org/10.1145/3462757.3466085
and segmented, which is often done with relatively simple pre-
processing steps, disregarding topical coherence of segments. Sys-
tems generally rely on representations of individual sentences or
paragraphs, which may lack crucial context, or document-level 1 INTRODUCTION
representations, which are too long for meaningful search results. Written texts are often a sequence of semantically coherent seg-
To address this issue, we propose a segmentation system that can ments, designed to create a smooth transition between various
predict topical coherence of sequential text segments spanning sev- subtopics discussed in a single document. Usually, the information
eral paragraphs, effectively segmenting a document and providing needs of a user are satisfied by retrieving only the relevant subtopic,
a more balanced representation for downstream applications. We and retrieving the whole document is unwieldy and may result in
build our model on top of popular transformer networks and for- information overload [29, 43]. However, the context of a single
mulate structural text segmentation as topical change detection, by subtopic frequently spans multiple sentences and contains local-
performing a series of independent classifications that allow for ized context, which is crucial for proper understanding. Despite the
efficient fine-tuning on task-specific data. We crawl a novel dataset clear relevance of segmentation to downstream performance, many
consisting of roughly 74,000 online Terms-of-Service documents, (legal) retrieval systems choose structurally rigid representations of
including hierarchical topic annotations, which we use for training. only a single text element (generally either the full document [8, 28],
Results show that our proposed system significantly outperforms or a single paragraph/sentence [34, 42]), disregarding the semantic
baselines, and adapts well to structural peculiarities of legal doc- coherence. Especially in legal documents, which can be extremely
uments. We release both data and trained models to the research lengthy and contain domain-specific complexities in their topics,
community for future work.1 it is important to suitably represent entire topics in a single cohe-
sive unit. Furthermore, aside from semi-structured legal texts, such
CCS CONCEPTS as laws, other documents do not necessarily contain uniform and
• Applied computing → Law; Annotation; • Information sys- easily separable segments, to begin with. Especially input formats
tems → Document structure. such as PDF or scans frequently lack any sort of meta descriptors
for hierarchical information about the document contents, which
makes this a challenging task. To find a fitting representation that
KEYWORDS
captures the precise topical context in the text, a robust and flexible
Document Understanding, Outline Generation, Text Segmentation framework to obtain such a structural segmentation is required.
We therefore propose a new approach for the estimation of topic
∗ Both authors contributed equally to this research.
1 https://github.com/dennlinger/TopicalChange
boundaries to generate more suitable document representations
for the mentioned downstream applications, by considering the
topical coherence of paragraphs. We define a coherent section in a
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed document as a unit consisting of potentially one or multiple para-
for profit or commercial advantage and that copies bear this notice and the full citation graphs, which together share a common topic. Section boundaries
on the first page. Copyrights for components of this work owned by others than the often coincide with a change in topic and can thus be assumed to
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission generate candidates for the later segmentation.
and/or a fee. Request permissions from permissions@acm.org. Despite their importance, many previous works for structural text
ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil segmentation ignore the notion of paragraphs and focus only on
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 the granularity of sentences. This is contrary to the nature of writ-
https://doi.org/10.1145/3462757.3466085 ten text, where paragraphs represent a semantically cohesive unit,
2
ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil Aumiller and Almasian, et al.
sections with similar topics. We consider sections as the top-level

hierarchy in our dataset and do not consider subsections. We as-
sume topical independence between consecutive paragraphs and
show that it does not lead to a deterioration of performance while
avoiding the costly computation of hierarchical models. The hy-
pothesis is that by fine-tuning the paragraph embedding for topic
similarity we can generate segment features that detect coherent
topical structures in a document. We evaluate our models against
Figure 1: Visual cues such as paragraphs often give away a no- the traditional embedding baselines and compare them to super-
tion of semantic coherence, which is disregarded in sentence- vised and unsupervised approaches for text segmentation and find
level models. significant improvements by our method.
Contributions. The contributions of this paper are as follows:

which is already available and represents a coarser and more mean- (𝑖) We present the task of structural text segmentation on coarser
ingful structure than sentences. In this work, we assume that topic cohesive text units (paragraphs/sections). (𝑖𝑖) We investigate the
boundaries generally do not appear in the middle of a paragraph, performance of transformer-based models for topical change detec-
and, consequently, operating on paragraph level can reduce the tion, and (𝑖𝑖𝑖) frame the task as a collection of independent binary
risk of false-positive segmentations and lower the computation predictions, reducing overhead for hierarchical training and sim-
cost of per sentence prediction. Figure 1 shows how paragraphs plified training sample generation. (𝑖𝑣) We present a new dataset
group sentences and divide a text into coherent parts and how by consisting of online Terms-of-Service documents partitioned into
overlooking this valuable information the structure in the text is hierarchical sections, and make the data available for future re-
lost to the model. search. (𝑣) We evaluate our model against classical baselines for
Focusing on the field of text segmentation in the context of the text segmentation, and (𝑣𝑖) show the effectiveness of our gener-
Natural Language Processing (NLP), we find an already large body ated embeddings for structural segmentations to obtain superior
of existing work. Because no large labeled dataset existed, early performance to other text segmentation techniques.
approaches to text segmentation were mainly unsupervised, us-
ing heuristics to identify whether two sentences belong to the 2 RELATED WORK
same topic or not. Such approaches either exploit the fact that
Our work is closely tied to the broader fields of legal document
topically related words tend to appear in semantically coherent
understanding, topic analysis, text segmentation, and transformer
segments [6, 18, 22, 26, 39], or focus on the representation of text in
language models, and we briefly review related work in each of
terms of latent-topic vectors using topic modeling methods, such
these areas in this section.
as Latent Dirichlet Allocation [2, 29, 30, 37]. Recently, with the
availability of annotated data, text segmentation has also been for-
mulated as a supervised learning problem. Most methods utilize a 2.1 Legal Document Understanding
hierarchical neural model, where the lower-level network creates The area of legal document understanding and legal information
sentence representations, and a secondary network models the de- retrieval has a long-standing history. A great overview is presented
pendencies between the embedded sentences [14, 21]. These models by Moens [31], who details more of the applications that were
use sentence dependencies to predict the potential segment bound- mentioned in the introduction. While there is existing work on
aries. One drawback of these approaches is their sentence-level the topic of legal document segmentation [24, 25, 27], they are
granularity, which disregards the paragraph coherence previously generally concerned with a specific information extraction inter-
mentioned. This problem is partially solved by hierarchical neu- est. In the case of Mencía, they are concerned with metadata of
ral models, where the dependency between sentences is modeled French law documents [27]. Specifically, they also make use of an
in a hierarchical structure, to combine sentence into bigger units. existing HTML/XML structure in their input documents, but do
However, training such models are, due to the different document not generalize to arbitrary text inputs without structural features.
lengths, computationally expensive. Moreover, these models fail Lu et al., on the other hand, are utilizing clustering techniques to
to take advantage of larger pre-trained language representations, identify sub-topics in legal documents. However, these topics are
such as BERT [9] and RoBERTa [23], which have recently proven irrespective of actual section boundaries within the document [24].
to be valuable feature generators with a low cost of fine-tuning, They further include additional metadata, such as citations, head-
advancing the state-of-the-art in several disciplines. notes, and key numbers, in their task setup, which are suitable to
In this paper, we aim to tackle the task of structural text segmenta- their specific application in case law. Similarly, Conrad et al. [8]
tion using transformer-based models, and introduce a novel dataset have previously attempted to employ clustering for heterogeneous
of Terms-of-Service documents, containing annotated paragraphs document collections, again focusing on hierarchical representa-
belonging to the same topic. We focus on the text segmentation tions in the form of short topic descriptors and not focusing on
alone and leave the use of segmentation for retrieval enhancement the actual textual content of the documents themselves. Lyte and
to future work. We formulate topical coherence as a special case of Branting [25] classify metadata labels based on a CRF (Conditional
binary classification of Same Topic Prediction (STP) and fine-tune Random Field) model, building on prior work by Branting [3] in the
our transformer-based models to detect paragraphs belonging to same direction, but focus mostly on element classification. Slightly
3
Structural Text Segmentation of Legal Documents ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil
longer segments in the form of entire sentences are both used by et al. [14] introduce Coherence-Aware Text Segmentation, which
Poudyal et al. [34], who mine arguments from European case-law encodes a sentence sequence using two hierarchically connected
decision, and Westermann et al. [42], where a system for efficient transformer networks. The two latter models are closest to our
similarity search based on sentence embeddings is presented. work in terms of data size and problem formulation. However, they
rely solely on per sentence predictions, which is incomparable to
2.2 Topic Analysis our paragraph-based method. The model by Glavas et al. is similar
Detection and analysis of topical change are grounded in topic mod- to our approach in that it is based on a transformer architecture, yet,
eling approaches. Earlier work such as LDA [2] treat documents as they do not take advantage of transfer learning from pre-trained
bag-of-words, where each document is assigned to a topic distribu- language models and learn all the features from scratch. Finally,
tion, and each topic is a distribution over all words. More recent Zhang et al. [45] extend text segmentation by outline generation
work has adopted a more sophisticated representation than bag-of- and trained an end-to-end LSTM-model for identifying sections
words and generally models Markovian topic or state transitions to and generating corresponding headings for Wikipedia documents.
capture dependencies between words in a document [15, 41]. With
the rise of distributed word representation, the focus has shifted 2.4 Transformer Language Models
to the combination of LDA and word embeddings [11, 32]. Since The transformer architecture, much like recurrent neural networks,
we are interested in a primary segmentation without necessarily aims to solve sequence-to-sequence tasks, relying entirely on self-
predicting topics, we put a stronger focus on the related work of attention to compute representations of its input and output [40].
segmentation methods, as discussed in the following section. Transformers have made a significant step in bringing transfer
learning to the NLP community, which allows the easy adaptation
2.3 Text Segmentation of a generically pre-trained model for specific tasks. Pre-trained
Text Segmentation is the task of dividing a document into a multi- models such as BERT, GPT-2, and RoBERTa [9, 23, 35] use lan-
paragraph discourse unit that is topically coherent, with the cut-off guage modeling for pre-training on large corpora. These models
point usually indicating a change in topic [17, 39]. Although the are powerful feature generators, which with minimal task-specific
task itself dates back to 1994 [17], most existing text segmentation fine-tuning achieve state-of-the-art performance on a wide variety
datasets are small and limit their scope to sentences (predicting of tasks. Although at the core of all these models lies the idea of
whether two sentences discuss the same topic or not). The most transformers and attention mechanisms, many have been modi-
common dataset is by Choi [7], containing only 920 synthesized pas- fied and optimized to fit various downstream applications. One
sages from the Brown corpus. Choi’s method (C99) is a probabilistic variation based on BERT is Sentence-BERT [36], which combines
algorithm measuring similarity via term overlap. GraphSeg [13] is two BERT-based models in Siamese fashion to derive semantically
an unsupervised graph method that segments documents using a meaningful sentence embeddings. By its design, Sentence-BERT
semantic relatedness graph of a document. GraphSeg is also evalu- also allows for longer input sequences for pairwise training tasks
ated on a small set of 5 manually-segmented political manifestos and outperforms BERT on semantic textual similarity tasks, mak-
from the Manifesto project2 . Another class of methods are topic- ing it a suitable choice for embedding paragraphs. Another notable
based document segmentations, which are statistical models that variant of BERT is RoBERTa, a retraining of BERT with improved
find latent topic assignments reflecting the underlying structure training methodology and more training data, it achieves slightly
of the document [1, 4, 5, 10, 29, 38]. TopicTiling [38] performs best better results than BERT on some natural understanding tasks. Due
among this family of methods and uses LDA to detect topic shifts, to the advantages of RoBERTa, we chose RoBERTa and Sentence-
with computing similarities between adjacent blocks based on their RoBERTa from the Sentence-BERT variant for the setup in our
term frequency vectors. Brants et al. [4] follow a similar approach approach.
but employ PLSA [19] to compute the estimated word distributions.
Another noteworthy approach based on Bayesian topic models is 3 SAME TOPIC PREDICTION
by Chen et al. [5], where they constrain latent topic assignments to
We formulate structural text segmentation as a supervised learning
reflect the underlying organization of document topics. They also
task of the same topic prediction. Our model consists of two steps:
publish a test dataset with 218 Wikipedia articles about cities and
(𝑖) Independent and Identically Distributed Same Topic Prediction
chemical elements.
(IID STP) and (𝑖𝑖) Sequential inference over a full document. As
All mentioned methods are unsupervised learning approaches, and
mentioned previously, sections are the considered level of hierarchy
small annotated datasets are only used for evaluation and, hence,
in our model and the structure of sub-sections is ignored in this
are not directly comparable to our approach. Instead, we focus on
study. However, the model is easily adaptable to any granularity, and
supervised learning of topics and introduce a new dataset with
our dataset contains information for all the levels. In the first step,
43,056 automatically labeled documents.
we fine-tune transformer-based models to detect topical change
The only two comparable supervised approaches are from Koshorek
for both paragraphs and entire sections. Given two paragraphs or
et al. [21] and Glavas et al. [14]. Koshorek et al. [21] propose a
sections, the classifier should correctly identify if they discuss the
hierarchical LSTM architecture for learning sentence representation
same subject or not. We assume that the topic of each paragraph
and their dependencies. They train their hierarchical model on a
or section is independent of the text before and after, meaning
dataset of cleaned Wikipedia articles, called Wiki-727k. Glavas
that the topic of one paragraph does not affect the likelihood of
2 https://manifestoproject.wzb.eu the next paragraph belonging to the same topic. We later prove
4
that this assumption yields good performance without a costly 𝑃1

training of hierarchical models. In the second step, we use the 𝑇 𝐹 (𝑃1, 𝑃2 ) = 1
fine-tuned transformer-based classifiers for sequential inference 𝑃2
on entire documents, where the segment boundaries are defined
by topical change. In the following, we discuss these steps in more 𝑇 𝐹 (𝑃2, 𝑃3 ) = 1
detail. 𝑃3
3.1 IID Same Topic Prediction (STP) 𝑃4 𝑇 𝐹 (𝑃3, 𝑃4 ) = 0
A document 𝑑 ∈ 𝐷 is represented as a sequence of 𝑁 sections

𝑆𝑑 = (𝑠 1, ..., 𝑠 𝑁 ), where each section is assigned one of 𝑀 topics
𝑇 = (𝑡 1, ..., 𝑡𝑀 ), and each section contains up to 𝐾 paragraphs Figure 2: Demonstration of how the transformer classifier is
𝑃𝑛 = (𝑝 1, ....𝑝𝐾 ). We assume topical consistency within a paragraph used during inference, by comparing consecutive paragraphs
and argue that the results for classification do not change based to detect section boundaries.
on the position of the paragraph in the document, since the most
relevant part for our inference is the intra-section information.
Therefore, all paragraphs in a section belong to the same topic. If
negligible, we choose RoBERTa as the representative of this fam-
the topic assignment is defined by the function 𝑇𝑜𝑝𝑖𝑐, we have:
ily. In the fine-tuning process, the model receives two chunks as
𝑠𝑛 = (𝑝 1, .., 𝑝𝑘 ) ∧ 𝑇𝑜𝑝𝑖𝑐 (𝑠𝑛 ) = 𝑡 1 =⇒ input and learns to predict whether they belong to the same topic
𝑇𝑜𝑝𝑖𝑐 (𝑝 1 ) = 𝑡 1 = ... = 𝑇𝑜𝑝𝑖𝑐 (𝑝𝑘 ) = 𝑡 1 (1) or not. To distinguish between two chunks in training a [CLS]
token is inserted at the beginning of the first chunk and a [SEP]
If we define 𝐶 as a chunk of text corresponding to either a section token is inserted at the end of both the first and second chunk. The
or paragraph, the topic prediction task is defined for section and embedding of the [CLS] is what is used for pre-training the next
paragraph granularity as follows: Given two chunks of text of the sentence prediction task and contains RoBERTa’s understanding
same type (both paragraphs or both sections) (𝑐 1, 𝑐 2 ) and labels 𝑦 ∈ at the sentence-level. This token is used by a simple classification
{0, 1}, indicating whether the two chunks belong to the same topic, layer, learned during fine-tuning, for the same topic prediction task.
topical change detection can be formulated as a binary classification Since the input size for both chunks combined is limited to a maxi-
problem. The positive class indicates that both chunks have the mum of 512 tokens, shorter than many sections and paragraphs in
same topic, whereas the negative class indicates a change in topic our dataset, any longer chunk of text has to be truncated to fit.
and potentially the beginning of a new segment in text. Note that we
only consider chunks of the same type, namely, either only sections 3.1.2 Sentence-Transformers (SRoBERTa) aims to enhance the sen-
or only paragraphs, in each model. By formulating the problem tence embeddings by modification of RoBERTa using a Siamese
as a binary classification, detecting the topic consistency between architecture to derive semantically meaningful sentence embed-
two chunks of text can now be solved with any type of classifier. dings [36]. Their method is available for several transformer models.
In this work, we train two types of transformer-based classifiers We choose a RoBERTa-based variant to make the results compara-
for this task, one from the pre-trained language models [23] and ble to the first approach. SRoBERTa enables RoBERTa to be used
another Siamese network [36] variation, which is more suitable for for certain new tasks, such as large-scale semantic similarity com-
encoding pairwise similarity. Subsequently, the two variations are parison. Their modifications result in faster inference and better
discussed. representation for sentence-pair tasks. Moreover, because of the
Siamese structure and coupling of two RoBERTa networks, the in-
3.1.1 RoBERTa is a replication study of BERT pre-training with
put size doubles, which allows for longer sequences and thus more
optimized hyper-parameters that applies minor adjustments to the
context. In this setup, each sentence is passed through a separate
BERT language model to achieve better performance [23]. BERT
RoBERTa network with an input limit of 512 tokens. The sentence
and RoBERTa both belong to the family of pre-trained transformer-
embeddings are derived from a pooling operation over the output
based language models. The transformer is an architecture for
of two models with tied weights. Sentence-Transformers introduce
shaping one sequence into another one with the help of the self-
several learning objectives, out of which we use the classification
attention mechanism, which helps the model to extract features
objective function with binary cross-entropy loss to classify the
from each word relative to all the other words in the sequence. The
chunks into the same topics.
encoder stacks in BERT and RoBERTa consist of one or multiple
self-attention blocks followed by a feed-forward network. During
pre-training, two sentences are taken as input, and models are 3.2 Sequential Inference
trained on two tasks of language modeling, by predicting masked For inference, we use the classifiers of the previous step as topic
words in the input and next sentence prediction, and by classi- change detectors for text segmentation. We read each paragraph
fying whether the two sentences are sequential. By these means, of the document sequentially and classify the adjacent paragraphs
the models learn task-independent features from a vast amount of for topical mutuality. More concretely, given a document 𝑑 ∈ 𝐷
unlabelled text that can then be used in a fine-tuning stage for vari- divided into consecutive paragraphs 𝑃 = (𝑝 1, ....𝑝𝑘 ), section breaks
ous natural language understanding tasks. Since the performance are marked as where the paragraph’s topic changes. Considering a
difference between most transformer-based language models is transformer 𝑇 𝐹 as our classifier and two consecutive paragraphs
5
as our input, the classifier outputs the probability of the two para- search for hyperlinks with texts Terms of Service, Terms of Use,
graphs belonging to the same topic, independent of their surround- Terms and Conditions, and Conditions of Use, and follow them to
ing context, e.g., 𝑇 𝐹 (𝑝 1, 𝑝 2 ) = 𝑃 (𝑇𝑜𝑝𝑖𝑐 (𝑝 1 ) = 𝑇𝑜𝑝𝑖𝑐 (𝑝 2 )). There- get to the respective terms-of-service pages. Levenshtein distance
fore, given sequences of paragraphs 𝑝 1, ....𝑝𝑘 , and the corresponding with a threshold of 0.75 is used to allow for spelling mistakes and
predicted labels 𝑦 = (𝑦1, ..., 𝑦𝑘−1 ), a segmentation of the document different wording. The raw Hypertext Markup Language (HTML)
is given by 𝑘 − 1 predictions of 𝑇 𝐹 , where 𝑦𝑖 = 0 denotes the end content of the Terms-of-Service page is downloaded and stored
of a segment by 𝑝𝑖 . It is worth noting that regardless of the chunk for further processing. In case of an error, e.g., if the website is
type used during the training of the classifiers (section or para- temporarily unreachable, we retry the same website 2 additional
graph inputs) the segmentation module operates on paragraphs times before skipping it. The unprocessed dataset contains HTML
only. Figure 2 shows the inference on a sample document with four code for roughly 74,000 websites. Note that due to limitations of the
paragraphs and two sections, where the paragraph colors show current crawler implementation, websites that rely on JavaScript
the topics. The 𝑇 𝐹 classifier is applied on a paragraph pair and to display content are not supported.
can ideally recognize the topic change from 𝑃3 to 𝑃4 and mark the
beginning of the new section. 4.2 Section Extraction
Despite the fact that HTML is a structured format, it is a non-trivial
3.3 Legal Applications
task to extract text and hierarchies. The main reasons are that Web
To put the presented segmentation into a legal context, we focus pages often contain a lot of boilerplate (e.g., navigational elements,
on three main application areas: (𝑖) As mentioned, a section-based advertisements, etc.), generally have heterogeneous appearences
semantic segmentation can be used as a pre-processing step for a and implementations, and that they simply do not always conform
passage retrieval context. This, however, would require additional to the HTML standard. Here, only a rough outline of the pipeline
data with relevance annotations for both sentence- and paragraph- is given. For further reference, please refer to the implementation
level relevance to compare the specific benefits of our approach, in our repository.
which we leave to future work in this area. (𝑖𝑖) However, seman-
tically coherent sections can also be used as a basis for similarity Boilerplate Removal. For boilerplate removal, we use the
search. This is especially helpful when looking for, e.g., related boilerpipe package by Kohlschütter et al. [20], which is based on
sections in existing contracts [42]. Here, we focus on Terms-of- shallow text features for classifying the text elements on a Web
Service documents that are widely available, and contain sections page. The result is an HTML page with all navigational elements,
that follow a general pattern of similar topics. (𝑖𝑖𝑖) Lastly, the sec- advertisements, and template code removed. Importantly, relevant
tion separation can be used for generating outlines of documents, hierarchical information is retained past this step.
which has previously been shown to work well on other domains
such as Wikipedia [45]. During our document crawl, we also en- HTML Cleanup. To deal with websites that do not conform to HTML
countered several documents not including any sectional headings, standards, we perform several cleanup steps. This includes, for ex-
which makes it especially hard to understand the legal contexts for ample, fixing mistakes such as text appearing without a correspond-
laymen users. ing paragraph (<p> tag), or incorrectly nested tags (e.g., section
headings within a <p> tag). We fix such mistakes by adding missing
4 TERMS-OF-SERVICE (TOS) DATASET tags and adjusting nested tags similar to how a web-browser would
Due to data governance policies in many countries, it is generally interpret the code.
mandated that commercial websites contain the necessary legal
information for site users. Specifically, these must be easily reach- Language Detection. Since the Alexa dataset also contains many
able via the landing page, which makes it comparatively easy to non-English websites, we reject extracted terms-of-service, where
be crawled. For each Terms-of-Service document, we automati- the majority of text most likely has a language different from Eng-
cally extract the content divided into paragraphs and respective lish. We use the langid Python package for detecting the language
hierarchical section headings. Further, ToS documents allow us to of each individual paragraph (<p> tag).
experiment with a large-scale dataset that comes with a shared
Extracting Hierarchy. To obtain the hierarchy, we split the docu-
set of topics, while still maintaining a heterogeneous set of topics
ment into smaller chunks. Splits are done in the following order:
due to the different types of websites. In the following, we will
first we split on each section heading (<h1>-<h6> tags), then on
discuss the detailed mining process, and implicate limitations of
bold text (<b> tag) starting with an enumeration pattern, then on
this approach.
enumerations (<li> tags), then on underline text (<u> tag) starting
4.1 Crawling with an enumeration pattern, and lastly on regular text (<p> tag)
starting with an enumeration pattern. To prevent spurious splits,
As seeds to our crawler, we use the Alexa 1M URL dataset.3 For each criterion is only used if there are at least 5 occurrences within
each URL in the dataset, we try to access the website both with the document. Each time the document is split, we save the corre-
and without the www prefix. First, the landing page is downloaded sponding headings, which then form the hierarchy. As enumeration
and parsed using the Beautiful Soup Python package. We then patterns we recognize Latin numbers, roman numerals, and letters,
3 Available at: http://s3.amazonaws.com/alexa-static/top- optionally prefixed with Part, Section, or Article. The majority of
1m.csv.zip documents contain at most two levels of section hierarchy.
6
Table 1: Top 10 section topics by document frequency. Addi- Doc 1 Doc 1 Doc 1
tionally, the number of associated paragraphs is given.
Document Paragraph
Topic Label
Frequency Frequency
limitation of liability 21,317 68,517
indemnification 16,698 25,683
law and jurisdiction 15,113 29,790
links to other websites 13,752 24,727
termination 12,855 33,978
Doc 2 Doc 2 Doc 2
warranty 9,926 41,403
privacy 8,958 25,022
disclaimer 8,575 29,265
general terms 7,936 54,693
4.3 Data Set Statistics

In addition to the full dataset, we provide a cleaned subset for Section (S) Random Paragraph (RP) Consecutive Paragraph (CP)
which we manually grouped sections into distinct topics based on
similar spelling or meaning. We manually merged 554 section titles,
Figure 3: Visualization of three distinct setups for same topic
which corresponds to all titles with at least 250 distinct occurrences
classification, where the example chunk of text is depicted
in the corpus. After merging, 82 topics were obtained, and only
in dashed purple, a positive sample in line of green, and a
sections that have at least one of these aliases as a heading were kept.
negative sample in dotted red. Each vertical line is a sen-
The dataset contains different levels of section hierarchy. For our
tence and the grouping of them shows a paragraph, different
work, we group document content into top-level sections only, any
sections in the document are shown with different colors,
further hierarchies are discarded, but are present in the raw data and
denoting the same topic across all paragraphs of the same
available for future work. After removing documents without any
section. From left to right: Same Section prediction, Random
valid sections, belonging to the predefined 82 topics, we are left with
Paragraph, and Consecutive Paragraph sampling.
approximately 43,000 documents for the same section prediction
task, and around 40,000 documents for our paragraph-level setup.
We randomly split the data 80/10/10 into train, validation, and test
sentence similarity tasks (ST-Ro-NLI ) to investigate performance of
set. The average number of sections per document is 6.56, and each
further pretraining.
document consists of 22.32 paragraphs on average, which results
Transformer models are trained using the popular HuggingFace
in a mean of 2.92 paragraphs per section. Table 1 shows the top 10
transformer library [44] for the [CLS] models, as well as the
section topic labels. The average number of paragraphs per section
sentence-transformers package from [36] for Siamese variants. We
varies between different topics.
use two Nvidia Titan RTX GPUs for training, and each model vari-
ant has been trained with five different random seeds. Details for
5 EVALUATION the training parameters can be found in our public repository. Due
We demonstrate the capabilities of transformer-based architectures to the length limitation of 512 tokens, we employ an iterative trun-
for topical change detection using dataset consisting of online cation strategy for two-sentence inputs. Due to the coupling of two
Terms-of-Service (ToS) documents, which was discussed above. transformers for the Sentence-Transformers the input size doubles,
Results are compared for the introduced IID STP task (see Sec- accepting a total input of 1024 tokens.
tion 3.1) as well as a downstream comparison of text segmentation
results to a range of baselines and existing methods. Results show 5.2 Prediction Tasks
a great improvement in the performance for all transformer-based As previously introduced, we train models with an independent
models. classification setup, which is generally much faster than more com-
plicated hierarchical sequential models. Specifically, we highlight
5.1 Evaluation of Models the differences in the setup for the same section prediction task,
We compare our methods against a range of baselines, including compared to the two paragraph-based methods. We point out that
averaging over Global Vectors (GLVavg) [33], tf-idf vectors (tf- results depicted in Table 2, for the prediction accuracy are not
idf ), and Bag of Words (BoW ) [16]. For transformer language mod- directly comparable between sampling methods, as they gener-
els, we evaluate the standard [CLS] sequence classification with ate different development and test sets based on the employed
roberta-base (Ro-CLS). For Sentence-Transformers [36] we use sampling strategy. We show in the subsequent section, however,
the Siamese transformer setup with a variant of roberta-base that downstream performance for the text segmentation is in line
(ST-Ro) and an additional model that has been pre-trained on NLI with results on topic prediction. Figure 3 visualizes the different
7
Table 2: Prediction accuracy for the independent topic prediction tasks, Same Topic Prediction (STP), Random Paragraph (RP),
Consecutive Paragraph (CP) with different sampling strategies. Standard deviation is reported over 5 runs and the best model
on each respective set is depicted in bold.
GLV𝑎𝑣𝑔 tf-idf BoW Ro-CLS ST-Ro ST-Ro-N

Dev 89.70 ±.07 82.10 ±.05 50.94 ±.33 96.42 ±.52 96.38 ±.03 96.39 ±.03
STP
Test 90.01 ±.06 82.54 ±.07 51.05 ±.51 96.58 ±.52 96.45 ±.06 96.46 ±.02
Dev 76.63 ±.04 70.94 ±.07 50.34 ±.04 57.63 ±10.4 87.50 ±.13 87.39 ±.08
RP
Test 76.16 ±.06 70.41 ±.09 50.31 ±.37 57.48 ±10.2 87.19 ±.64 86.88 ±.11
Dev 77.64 ±6.6 74.94 ±.11 56.34 ±.83 89.63 ±.12 91.17 ±.05 91.12 ±.04
CP
Test 78.63 ±6.8 76.17 ±.07 56.58 ±1.1 90.34 ±.08 91.17 ±.04 91.69 ±.02
sampling strategies. In the following, we describe each strategy However, results show a sharp drop in the performance, which can
in detail and highlight their difference. Across all strategies, we come from a much narrower context of the paragraphs, as well as
added three positive and three negative samples for each individual a differing selection of test samples compared to the section task.
section/paragraph. Solely the BoW model seems to be largely unaffected, which is
simply due to its low performance in either setting.
5.2.1 Section (S) Topic Prediction. In this setup, we use sections
as input chunks to the transformer classifiers. The section task
showcases how different levels of granularity can affect outcomes
in the prediction results. Specifically, the extremely long input 5.2.3 Consecutive Paragraph (CP) Topic Prediction. To boost
sequences test the limits of what transformers can predict from performance and account for the coherent structures in the text,
partial observations since the majority of inputs will be heavily we employ a sampling strategy inspired by Ein Dor et al. [12]. For
truncated. To ensure an equal distribution of samples from within their triplet loss, samples are generated inside the same document
the same and different sections, we match each section with three only, which can be translated into sampling from intra-document
samples from the same topic, and three from different topics. The paragraphs. Note that this strategy also no longer requires any
positive and negative sections can be sampled from a different merging and annotation of topics across documents, as all relevant
document. The important point is that the positive samples should information is now contained within a single document. This fact
come from the same topic and negative samples from different ones. opens up much larger generation of training data, which we omit
The first column of Figure 3 visualizes the section sampling, where in our current work for the sake of comparability with the RP
the first section of 𝐷𝑜𝑐 1 is paired with the second section of 𝐷𝑜𝑐 2 model. To generate samples, we look at all paragraphs of a section
to form a positive sample and the first section of 𝐷𝑜𝑐 2 to form a and pair them as positive samples. Negative samples are picked
negative sample, respectively. The same strategy is employed for from paragraphs of different sections in the same document. The
the generation of the development and test set. third column of Figure 3 depicts the consecutive paragraph setup,
Despite the constraints with respect to the input length, we find that where the samples are limited to paragraphs of 𝐷𝑜𝑐1. Note that
all transformers perform on a near-perfect level, compare Table 2. despite their similar setup, results of RP and CP runs in Table 2 are
Comparing these results to already very well-performing baselines, not evaluated on the same test set and thus are not comparable,
we suspect that certain keywords give away similar sections, but since the test sets are each generated with the respective sampling
highlight the fact that the explicit representation of different topics strategies (RP or CP) as well. However, we are able to compare
is not given during training in the binary classification task, which their downstream performance on the subsequently introduced
makes this a suitable method for dealing with imbalanced topics. text segmentation task (see Section 5.3 and Table 3).
5.2.2 Random Paragraph (RP) Topic Prediction. In contrast to the The result of different sampling strategies along with the perfor-
section-level task, we revert to a more fine-grained distinction of mance of the baselines is shown in Table 2, where the transformer-
paragraphs in a text. In the Random Paragraph setting, we still based models all outperform the baselines by a significant margin.
generate samples similarly, meaning we include three paragraphs Among the baselines BoW has the worst performance overall, with
from a random document with the same topic and three negative the accuracy close to random, showing that distinct word occur-
samples from random paragraphs with different topics. The main rences are not a sufficient indicator. Average GloVe has the best
difference between the Section prediction and Random Paragraph performance of all baselines, but is still behind the transformers
is in the level of granularity and not how the samples are chosen. by a large margin. Despite the NLI-pretrained SRoBERTa model
The second column of Figure 3 highlights this difference, where the (ST-Ro-N) achieving better scores than the base model (ST-Ro) for
samples are paragraphs inside the sections rather than the entire most setups, the difference is insignificant, indicating that the pre-
section. Paragraph-based sampling is closer to our inference setup, training on sentence similarity tasks does not directly influence our
where each input document is considered one paragraph at a time. topic prediction setup.
8
Table 3: Boundary error rate 𝑃𝑘 for compared models (lower is Specifically, we compare the paragraph-based training methods
better), based on sampling strategies Random Paragraph (RP), CP and RP. As an evaluation metric, we follow related literature
Consecutive Paragraph (CP) and their Ensemble variates, and adopt the 𝑃𝑘 metric introduced by Beeferman et al. [1], which
RP𝐸𝑛𝑠 and CP𝐸𝑛𝑠 , respectively. Ensemble (”Ens”) predictions is the error rate of two segments at 𝑘 sentences apart being
are obtained by majority voting over model runs. classified incorrectly. We use the default window size of half the
document length for our evaluation, again following related work.
RP CP RP𝐸𝑛𝑠 CP𝐸𝑛𝑠 Furthermore, we count the number of explicit misclassifications,
and use the accuracy 𝑎𝑐𝑐𝑘 of “up to 𝑘 mistakes per document” as
GLV𝑎𝑣𝑔 29.97 ±.09 26.23 ±6.2 29.55 23.06 an evaluation metric. Due to the coarser nature of paragraphs and
tf-idf 39.87 ±.24 29.70 ±.28 39.36 28.60 the lower number of predictions per document compared to the
BoW 45.76 ±.67 43.46 ±1.5 46.20 41.80 sentence-level segmentation, this is a more illustrative metric. This
Random Oracle 35.08 ±.15 - 31.88 - also relates to the “exact match” metric EM𝑜𝑢𝑡𝑙𝑖𝑛𝑒 employed by
Zhang et al. [45], where 𝑎𝑐𝑐 0 = EM𝑜𝑢𝑡𝑙𝑖𝑛𝑒 .
GraphSeg - 32.48 ±.46 - 32.28
Here, we also include the performance of related works where
WikiSeg - 48.29 ±.30 - 48.29
public and up-to-date code repositories are available. Specifically,
Ro-CLS 37.26 ±4.8 15.15 ±.00 41.15 15.15 we compare to the unsupervised segmentation algorithm Graph-
ST-Ro 15.72 ±.11 14.06 ±.14 14.62 13.14 Seg [13], and the supervised model by Koshorek et al. [21], which
ST-Ro-N 15.97 ±.14 13.97 ±.19 14.81 12.95 we dub “WikiSeg”. Both approaches are trained on a sentence-level
approach, though, and predictions have to be translated back to a
Ens consec - - - 12.50 paragraph level for comparison of results. We train each model with
the suggested parameters in their publicly available repositories.
For an additional pseudo-sequential baseline, we use an informed
random oracle that has a-priori information on the number of
topics in the document, and samples from a distribution with
adjusted probability 𝑃 (“next section”) = #𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑠/#𝑝𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ𝑠.
Note that no additional parameters are learned for any model, and
predictions are binarized with a simple 0.5 threshold over the same
topic predictions. We provide ensembling results for the majority
voting decisions by the five seed runs of each model variant (Ens),
which provides further improvements. Best results are obtained
by ensembling all consecutive transformer-based methods (Ens
consec).
Table 3 shows the results of the evaluation, where one can see
that results in the sequential segmentation are directly linked
to the performance on the independent classification task seen
in Table 2. To verify our initial assumption of cross-document
comparability of content from similar sections, we make the
following observations: (𝑖) Evaluation performance for the STP
setup is consistent for both training strategies (RP and CP) when
using Sentence-Transformer models (see Table 2). (𝑖𝑖) Similarly,
both CP and RP-trained Sentence-Transformer segmentations
achieve results within 2 percentage points of the respective 𝑃𝑘
Figure 4: Mistake rate of per-model ensembles, where the scores. (𝑖𝑖𝑖) In general, CP training setup yields slightly better
suffix CP indicates the consecutive paragraph sampling and 𝑃𝑘 scores, likely because the intra-document dependencies are
RP the random paragraph sampling for each model. The captured better with this sampling strategy, which is a more
baseline is Rand Oracle (Random Oracle), GLV𝑎𝑣𝑔 (average appropriate sampling for the segmentation task. (𝑖𝑣) We find
GloVe vectors), tf-idf, and BoW (Bag-of-Words). Ro-CLS is convergence problems for RP training with the [CLS] models, as
the fine-tuned CLS token for Roberta and Sentence trans- well as the tf-idf model. Due to the size of our general training
former models are ST-Ro and ST-Ro-N, where the latter is corups, we therefore conclude that it is realistic to expect topical
pre-trained on NLI task. The best performing model is the similarity within a section, even across documents. However, due
Ens All (Ensemble of all models). to the seemingly inconsistent convergence of RP models, we
caution against blindly using this strategy, especially when dealing
with more heterogeneous corpora. The oracle baseline performs
5.3 Text Segmentation unexpectedly better than both tf-idf and BoW, indicating that
By generating a text segmentation over the paragraphs of a additional information about the sections of a document can
full document, the independent prediction results from the greatly boost task performance, which might be relevant for future
previous section can now be compared across several approaches. work. Additional pre-training of ST models (ST-Ro-N) does not
9
show any significant improvement over the standard ST-Ro models. as HTML or XML, making this an attractive option for a larger-scale
study of cross-domain document collections. Finally, an interface
To our surprise, sentence-based implementations (GraphSeg and build on top of our framework, enabling the users to judge the
WikiSeg) show significantly lower performance, and fall even usefulness of segmentation for legal use cases, such as a collection
behind the simpler baselines. For GraphSeg, an unsupervised of documents from mergers and acquisitions, could be used to
segmentation approach, the lack of explicit training on the different determine the efficacy of our improved segmentation.
granularity seems to significantly prohibit correct predictions
on longer segments. WikiSeg heavily preprocesses the data and ACKNOWLEDGEMENTS
discards many samples, thus significantly shrinking the training set.
We thank the anonymous reviewers for their insightful comments.
Since performance on the reduced training set is still decent, this
indicates that training a network from scratch is not suitable with
the smaller training set of a reduced corpus and tends to overfit. REFERENCES
[1] Doug Beeferman, Adam L. Berger, and John D. Lafferty. 1999. Statistical Models
We expect a significant increase in performance if the training for Text Segmentation. Mach. Learn. 34, 1-3 (1999), 177–210.
would instead be performed without such strict preprocessing [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
criteria, or continuing fine-tuning on pre-trained weights from a Allocation. J. Mach. Learn. Res. 3 (2003), 993–1022.
[3] Luther Karl Branting. 2017. Automating Judicial Document Analysis. In Pro-
paragraph-level WikiSeg model. For either baseline model, it is ceedings of the Second Workshop on Automated Semantic Analysis of Information
also important to note that these models predict on the entirety in Legal Texts co-located with the 16th International Conference on Artificial In-
of the sequence, which theoretically allows information sharing telligence and Law (ICAIL 2017), London, UK, June 16, 2017 (CEUR Workshop
Proceedings, Vol. 2143), Kevin D. Ashley, Katie Atkinson, Luther Karl Branting,
between different sections in the current sample. However, they Enrico Francesconi, Matthias Grabmair, Marc Lauritsen, Vern R. Walker, and
show no improvement over our binary prediction setup which Adam Zachary Wyner (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-
does not share this information. It would be of interest to compare 2143/paper2.pdf
[4] Thorsten Brants, Francine Chen, and Ioannis Tsochantaridis. 2002. Topic-based
results to sequential transformer-based architectures, such as they Document Segmentation with Probabilistic Latent Semantic Analysis. In Proceed-
are used by Glavas et al. [14]. However, their model again requires ings of the 2002 ACM CIKM International Conference on Information and Knowledge
Management, McLean, VA, USA. ACM, 211–218.
training from scratch, which has proven to be inconsistent in our [5] Harr Chen, S. R. K. Branavan, Regina Barzilay, and David R. Karger. 2009. Global
experiments with WikiSeg. Models of Document Structure using Latent Permutations. In Human Language
Technologies: Conference of the North American Chapter of the Association of
Computational Linguistics, Proceedings, 2009, Boulder, Colorado, USA. 371–379.
Lastly, the plots for 𝑎𝑐𝑐𝑘 for various models in Figure 4 indicate [6] Freddy Y. Y. Choi. 2000. Advances in Domain Independent Linear Text Segmen-
a correlation between the 𝑎𝑐𝑐𝑘 and 𝑃𝑘 measures, which does not tation. In Proceedings of the 1st North American Chapter of the Association for
Computational Linguistics Conference (Seattle, Washington) (NAACL 2000). ACL,
apply to sentence-level segmentations. Overall, the best-performing USA, 26–33.
ensembles classify around 25% of documents without any mistake [7] Freddy Y. Y. Choi. 2000. Advances in Domain Independent Linear Text Seg-
(𝑎𝑐𝑐 0 ), and around 70% with less than three mistakes (𝑎𝑐𝑐 2 ) over mentation. In 6th Applied Natural Language Processing Conference, ANLP, Seattle,
Washington, USA, 2000. ACL, 26–33.
the entire document. We therefore suggest 𝑎𝑐𝑐𝑘 as an interpretable [8] Jack G. Conrad, Khalid Al-Kofahi, Ying Zhao, and George Karypis. 2005. Effective
addition to the classic evaluation of segmentation approaches when Document Clustering for Large Heterogeneous Law Firm Collections. In The
dealing with paragraph-level segmentations. Tenth International Conference on Artificial Intelligence and Law, Proceedings of the
Conference, June 6-11, 2005, Bologna, Italy, Giovanni Sartor (Ed.). ACM, 177–187.
https://doi.org/10.1145/1165485.1165513
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
6 CONCLUSION AND FUTURE WORK Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Associa-
Despite a multitude of previous works, structural text segmentation for Computational Linguistics: Human Language Technologies, NAACL-HLT ,
tion methods have always focused on very finely segmented text Minneapolis, USA, 2019, Volume 1. 4171–4186.
[10] Satya Dharanipragada, Martin Franz, J. Scott McCarley, Salim Roukos, and Todd
chunks in the form of sentences. In this work, we have shown that Ward. 1999. Story Segmentation and Topic Detection for Recognized Speech.
a relaxation of this problem to coarser text structures reduces the In Sixth European Conference on Speech Communication and Technology, EU-
complexity of the problem, while still allowing for semantic segmen- ROSPEECH 1999, Budapest, Hungary. ISCA.
[11] Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. Topic Modeling in
tation. Further, we reformulate the oftentimes expensive-to-train Embedding Spaces. CoRR abs/1907.04907 (2019). arXiv:1907.04907
sequential setup of text segmentation as a supervised Same Topic [12] Liat Ein Dor, Yosi Mass, Alon Halfon, Elad Venezian, Ilya Shnayderman, Ranit
Aharonov, and Noam Slonim. 2018. Learning Thematic Similarity Metric from
Prediction task, which reduces training time, while allowing for a Article Sections Using Triplet Networks. In Proceedings of the 56th Annual Meeting
near-trivial generation of samples from automatically crawled text of the Association for Computational Linguistics (Volume 2: Short Papers). ACL,
documents. To show the applicability of our method, we present a Melbourne, Australia, 49–54.
[13] Goran Glavas, Federico Nanni, and Simone Paolo Ponzetto. 2016. Unsupervised
new domain-specific and large corpus of online Terms-of-Service Text Segmentation Using Semantic Relatedness Graphs. In Proceedings of the Fifth
documents, and train transformer-based models that vastly outper- Joint Conference on Lexical and Computational Semantics, *SEM@ACL, Berlin,
form a number of text segmentation baselines. Germany, 2016. The *SEM 2016 Organizing Committee.
[14] Goran Glavas and Swapna Somasundaran. 2020. Two-Level Transformer and Aux-
We are currently investigating the setup for deeper hierarchical iliary Coherence Modeling for Improved Text Segmentation. CoRR abs/2001.00891
sections, which our dataset already contains annotations for, to (2020). arXiv:2001.00891
[15] Thomas L. Griffiths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum.
see whether such notions can also be picked up by an independent 2004. Integrating Topics and Syntax. In Advances in Neural Information Process-
classifier and benefit a legal retrieval system. Also, the findings from ing Systems 17 [Neural Information Processing Systems, 2004, British Columbia,
our Consecutive Paragraph model already indicate that training Canada]. 537–544.
[16] Zellig S. Harris. 1954. Distributional Structure. WORD 10, 2-3 (1954), 146–162.
requires no further information than the ground truth segmentation, [17] Marti A. Hearst. 1994. Multi-Paragraph Segmentation of Expository Text. In 32nd
which can generally be inferred from structured input formats, such Annual Meeting of the Association for Computational Linguistics, 1994, Las Cruces,
10
New Mexico, USA, Proceedings. ACL, 9–16. Methods in Natural Language Processing and the 9th International Joint Conference
[18] Marti A. Hearst. 1997. TextTiling: Segmenting Text into Multi-Paragraph Subtopic on Natural Language Processing, EMNLP-IJCNLP, Hong Kong, China. ACL, 3980–
Passages. Comput. Linguist. 23, 1 (March 1997), 33–64. 3990.
[19] Thomas Hofmann. 2017. Probabilistic Latent Semantic Indexing. SIGIR Forum 51, [37] Martin Riedl and Chris Biemann. 2012. How Text Segmentation Algorithms Gain
2 (2017), 211–218. from Topic Models. In Human Language Technologies: Conference of the North
[20] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate American Chapter of the Association of Computational Linguistics, Proceedings,
Detection Using Shallow Text Features. In Proceedings of the Third International 2012, Montréal, Canada. ACL, 553–557.
Conference on Web Search and Web Data Mining, WSDM, New York, USA 2010. [38] Martin Riedl and Chris Biemann. 2012. TopicTiling: A Text Segmentation Al-
ACM, 441–450. gorithm based on LDA. In Proceedings of the Student Research Workshop of the
[21] Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant. 50th Meeting of the Association for Computational Linguistics. Republic of Korea,
2018. Text Segmentation as a Supervised Learning Task. In Proceedings of the 2018 37–42.
Conference of the North American Chapter of the Association for Computational [39] Masao Utiyama and Hitoshi Isahara. 2001. A Statistical Model for Domain-
Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, Independent Text Segmentation. In Association for Computational Linguistic, 39th
USA, 2018, Volume 2 (Short Papers). ACL, 469–473. Annual Meeting and 10th Conference of the European Chapter, Proceedings of the
[22] Hideki Kozima. 1993. Text Segmentation Based on Similarity between Words. In Conference, 2001, Toulouse, France. ACL, 491–498.
31st Annual Meeting of the Association for Computational Linguistics, 1993, Ohio [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
State University, Columbus, Ohio, USA, Proceedings. ACL, 286–288. Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
[23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer you Need. In Advances in Neural Information Processing Systems 30: Annual Con-
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A ference on Neural Information Processing Systems 2017, Long Beach, CA, USA.
Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). 5998–6008.
arXiv:1907.11692 [41] Hanna M. Wallach. 2006. Topic modeling: Beyond Bag-of-Words. In Machine
[24] Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, and William Keenan. 2011. Legal Learning, Proceedings of the Twenty-Third International Conference (ICML), Pitts-
document clustering with built-in topic segmentation. In Proceedings of the 20th burgh, Pennsylvania, USA, 2006 (ACM International Conference Proceeding Series,
ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, Vol. 148). 977–984.
United Kingdom, October 24-28, 2011, Craig Macdonald, Iadh Ounis, and Ian [42] Hannes Westermann, Jaromír Savelka, Vern R. Walker, Kevin D. Ashley, and
Ruthven (Eds.). ACM, 383–392. https://doi.org/10.1145/2063576. Karim Benyekhlef. 2020. Sentence Embeddings and High-Speed Similarity Search
2063636 for Fast Computer Assisted Annotation of Legal Documents. In Legal Knowledge
[25] Alex Lyte and Karl Branting. 2019. Document Segmentation Labeling Techniques and Information Systems - JURIX 2020: The Thirty-third Annual Conference, Brno,
for Court Filings. In Proceedings of the Third Workshop on Automated Seman- Czech Republic, December 9-11, 2020 (Frontiers in Artificial Intelligence and Ap-
tic Analysis of Information in Legal Texts co-located with the 17th International plications, Vol. 334), Villata Serena, Jakub Harasta, and Petr Kremen (Eds.). IOS
Conference on Artificial Intelligence and Law (ICAIL 2019), Montreal, QC, Canada, Press, 164–173. https://doi.org/10.3233/FAIA200860
June 21, 2019 (CEUR Workshop Proceedings, Vol. 2385), Kevin D. Ashley, Katie [43] Ross Wilkinson. 1994. Effective Retrieval of Structured Documents. In Proceedings
Atkinson, Luther Karl Branting, Enrico Francesconi, Matthias Grabmair, Bern- of the 17th Annual International ACM-SIGIR Conference on Research and Devel-
hard Waltl, Vern R. Walker, and Adam Zachary Wyner (Eds.). CEUR-WS.org. opment in Information Retrieval. Dublin, Ireland, 1994 (Special Issue of the SIGIR
http://ceur-ws.org/Vol-2385/paper5.pdf Forum). ACM/Springer, 311–317.
[26] Igor Malioutov and Regina Barzilay. 2006. Minimum Cut Model for Spoken [44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Lecture Segmentation. In ACL, 21st International Conference on Computational Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie
Linguistics and 44th Annual Meeting of the Association for Computational Linguis- Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
tics, Proceedings of the Conference, Sydney, Australia, 2006. ACL. Processing. ArXiv abs/1910.03771 (2019).
[27] Eneldo Loza Mencía. 2009. Segmentation of legal documents. In The 12th Interna- [45] Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, and Xueqi Cheng. 2019.
tional Conference on Artificial Intelligence and Law, Proceedings of the Conference, Outline Generation: Understanding the Inherent Content Structure of Documents.
June 8-12, 2009, Barcelona, Spain. ACM, 88–97. https://doi.org/10.1145/ In Proceedings of the 42nd International ACM SIGIR Conference on Research and
1568234.1568245 Development in Information Retrieval, SIGIR, Paris, France, 2019. 745–754.
[28] Nada Mimouni. 2013. Modeling Legal Documents as Typed Linked Data for
Relational Querying. In Proceedings of the First JURIX Doctoral Consortium and
Poster Sessions in conjunction with the 26th International Conference on Legal
Knowledge and Information Systems, JURIX 2013, Bologna, Italy, December 11-13,
2013 (CEUR Workshop Proceedings, Vol. 1105), Monica Palmirani and Giovanni
Sartor (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-1105/paper6.
pdf
[29] Hemant Misra, François Yvon, Olivier Cappé, and Joemon M. Jose. 2011. Text
Segmentation: A Topic Modeling Perspective. Inf. Process. Manag. 47, 4 (2011),
528–544.
[30] Hemant Misra, François Yvon, Joemon M. Jose, and Olivier Cappé. 2009. Text
segmentation via Topic Modeling: an Analytical Study. In Proceedings of the 18th
ACM Conference on Information and Knowledge Management, CIKM , Hong Kong,
China, 2009. 1553–1556.
[31] Marie-Francine Moens. 2001. Innovative techniques for legal text retrieval. Artif.
Intell. Law 9, 1 (2001), 29–57.
[32] Christopher E. Moody. 2016. Mixing Dirichlet Topic Models and Word Embed-
dings to Make lda2vec. CoRR abs/1605.02019 (2016). arXiv:1605.02019
[33] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
Global Vectors for Word Representation. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing, EMNLP, 2014, Doha, Qatar, A
meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.
[34] Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma. 2019. Using Clustering
Techniques to Identify Arguments in Legal Documents. In Proceedings of the Third
Workshop on Automated Semantic Analysis of Information in Legal Texts co-located
with the 17th International Conference on Artificial Intelligence and Law (ICAIL
2019), Montreal, QC, Canada, June 21, 2019 (CEUR Workshop Proceedings, Vol. 2385),
Kevin D. Ashley, Katie Atkinson, Luther Karl Branting, Enrico Francesconi,
Matthias Grabmair, Bernhard Waltl, Vern R. Walker, and Adam Zachary Wyner
(Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-2385/paper2.pdf
[35] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018).
[36] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical
11
Precedential Constraint: The Role of Issues
Trevor Bench-Capon and Katie Atkinson
Department of Computer Science, University of Liverpool
Liverpool, United Kingdom
{tbc,katie}@liverpool.ac.uk
ABSTRACT the outcome of new cases? and (3) can we formally characterise how
Horty, Rigoni and Prakken have developed formal characterisa- precedents constrain future cases?
tions of precedential constraint based on dimensions and factors as An early project addressing precedential reasoning was the
introduced in HYPO and CATO. We discuss the relation between HYPO project of Rissland and Ashley, introduced at the first ICAIL
dimensions and factors and also describe the current models of [43] and most fully described in [6]. HYPO modelled reasoning with
precedential constraint based on factors, along with some criti- precedents in the US Trade Secrets domain. It influenced a great
cisms of them. We argue that problems arise from ignoring the deal of research by a number of different researchers, as discussed in
structure of legal cases that is provided by the notion of issues, [10], including the CATO system of Ashley and Aleven, introduced
and that seeing precedential constraint in terms of issues rather in [4] and most fully described in [3]. CATO also addressed US
than whole cases provides a more effective approach and better Trade Secrets. Both HYPO and CATO were concerned with the first
reflects legal practice. The advantages of the issue based approach of our questions: their goal was to show how arguments concerning
are illustrated with a concrete example. We then discuss how dimen- new cases can be constructed on the basis of precedent cases, and
sions should be accommodated, suggesting that this is best done how such arguments can be challenged by distinguishing the cited
by seeing reasoning with legal cases as a two stage process: first precedents. These systems presented arguments for and against
factors are ascribed to cases and then factor based reasoning can particular decisions, but did not attempt to choose between them:
be used to arrive at a decision. Thus precedential constraint can be that was left to the judgement of the user.
described in terms of factors, dimensions being handled at the first In contrast, systems based on rules, whether based on expert
stage. Both stages are constrained, in different ways, by precedents: knowledge [49], or on a formalisation of legislation [47], or a com-
we identify three types of precedent: framework precedents which bination of the two [8], were able to predict the outcome of a new
structure cases into issues, preference precedents which resolve case entered into the system, answering our second question. It was
conflicts between opposing sets of factors within these issues, and therefore a natural development to adapt systems such as CATO
ascription precedents which constrain the mapping from facts to to offer predictions based on reasoning with precedents. This was
factors. done in the Issue Based Prediction system (IBP) [21], in which ar-
guments generated from CATO were organised and evaluated so as
CCS CONCEPTS to predict an outcome. Subsequently Grabmair further developed
this approach to accommodate his value judgement formalism [24].
• Applied computing → Law.
Predictions based on precedent continue to be implemented in both
KEYWORDS symbolic systems [2] and machine learning (ML) systems such as
[35] which base their predictions on large collections of case deci-
reasoning with precedents, factors, dimensions, issues sions. Factor based reasoning is acquiring an important new role
ACM Reference Format: in explaining the predictions of ML systems (e.g. [19] and [38]).
Trevor Bench-Capon and Katie Atkinson. 2021. Precedential Constraint: The The reasoning in HYPO and CATO was embodied in algorithms
Role of Issues. In Eighteenth International Conference for Artificial Intelligence rather than expressed declaratively and so was not readily amenable
and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, to formalisation to address the third question. This situation was
NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466062
changed when Prakken and Sartor provided a means of expressing
a case base of precedents as a set of rules and priorities between
1 INTRODUCTION them [39]. The resulting rule base could then be deployed to predict
Reasoning with precedent cases has been a central concern of AI the outcome of a new case. Further, this laid the foundations for
and Law since the very beginning. At least three questions can the provision of a formal account of precedential constraint1 . The
be posed in relation to reasoning with precedent cases: (1) how do work was begun by Horty [26], using a factor based representation
people reason with precedents? (2) can we use precedents to predict taken from [6] and [3]. His approach was developed in [30] and
Permission to make digital or hard copies of all or part of this work for personal or extended by Rigoni in [41]. However, it became recognised that
classroom use is granted without fee provided that copies are not made or distributed factors were not sufficient to capture all the necessary nuances of
for profit or commercial advantage and that copies bear this notice and the full citation precedents: some aspects of cases can favour a party to different
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or extents. The need to address dimensions was argued in [15] and
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil
1 These formal accounts consider that a decision is constrained if any other decision
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 would be inconsistent with past decisions. In practice this constraint may not be
https://doi.org/10.1145/3462757.3466062 respected in given judicial settings. For a jursiprudential discussion see [46].
12
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bench-Capon and Atkinson
addressed by Horty in [27], [28] and [29] and by Rigoni in [42]. A has as one of its prerequisites that the plaintiff made
comparison of the approaches of Horty and Rigoni is given in [37]. disclosures of confidential information to outsiders.
In this paper we will address the question of how precedents ([44], p 67)).
constrain decisions in new cases, and in particular identify how Although disclosures to outsiders may be a reason to find for the
domain knowledge can complement the purely formal characterisa- defendant, the lack of disclosures was never found to be used as a
tions. Section 2 reviews the use of dimensions and factors in HYPO reason to find for the plaintiff in the analysed cases [44]: because the
and CATO to clarify their different roles: whereas dimensions iden- plaintiff is expected to take measures to protect the secret, simply
tify the aspects of cases which must be considered, factors record refraining from disclosure seems not to strengthen the plaintiff’s
their legal significance in the particular case by identifying the case. Thus this dimension is not applicable if no disclosures were
party favoured by that aspect. Section 3 gives an overview of the made. Typically only a few dimensions will be applicable in any
formalisations of precedential constraint using factors. In Section 4 given case: in HYPO four or five is typical.
we show how these approaches can be improved by exploiting the
structure found in legal cases. Section 5 considers how to accommo-
2.1 From Dimensions to Factors
date dimensions, by considering precedential reasoning as a two
stage process. First factors are ascribed on the basis of dimensional Even if applicable, the value on the dimension may be such that
facts and then these factors supply the reasons to resolve the is- it does not favour either party; the dimension may be neutral in
sues, and hence constrain the overall decision. Different precedents the particular case. Applicable dimensions must be assessed for
are relevant at each stage: some constrain the ascription of factors their legal significance for the particular case, that is whether they
while others constrain the preferences between sets of factors. favour a party, and if so, which one. This significance is shown by
The contributions of the paper are: improvement in the formal ascribing a factor to the case. A factor is present if the case lies
characterisation of precedential constraint, both in terms of effec- within a range on a dimension which favours a particular side. At
tiveness and in reflecting actual decisions, by applying it to issues one end the dimension will either be inapplicable because it does
rather than whole cases; clarification of the role of dimensions by not affect the strength of a side’s case, or it will favour a particular
articulating the reasoning process into two distinct stages; and iden- side. Moving along the dimension we may enter a neutral area
tifying the need to recognise that precedents operate differently at favouring neither side, and then an area which favours the other
the two stages. Throughout the paper we use examples from US side. In practice many dimensions have only two points and either
Trade Secrets cases, the most widely discussed domain for reason- favour a particular side or are inapplicable. For a many-valued
ing with precedents in AI and Law: as well as HYPO and CATO it dimension, such as D3d, if sufficient disclosures to provide a reason
has been used in [21], [22], [2], [24], [13], [36], [50] and [38], among for the defendant were made, then the corresponding factor (F10d)
many others. applies. It may be, however, that too few disclosures were made
to favour the defendant (e.g. Emery v Marcan “Even though parts
drawings may on occasion have been shown to a limited number of
2 DIMENSIONS AND FACTORS outsiders for a particular purpose, this did not in itself necessarily
To relate formal work on precedential constraint to actual legal destroy the secrecy which protected them.”). Here no factor will
cases, it is important to have a clear understanding of factors and apply (although the dimension remains applicable if any disclosures
dimensions and the relationship between them. The terms have were made, because whether the factor should be ascribed needs to
been used in different ways, but we will consider dimensions as be considered). The point about neutrality is made in [44]
used in HYPO and factors as used in CATO, discussed by Rissland
Note that CATO does not automatically treat the fact
and Ashley in [44]. This is the most common use, and HYPO and
that a factor does not apply to a case as a strength for
CATO were explicitly identified by Horty in [26] and [30] as the
the opponent. ([44], pp 68-9).
source of the factors used in his formal account of precedential
constraint, which is the starting point for subsequent discussions As can be seen from Table 1, only one dimension, D13b, Security
of this topic. Moreover both HYPO dimensions and CATO factors Measures, was seen as capable of favouring both sides.
resulted from thorough domain analyses. Most of the many systems [...] the Security-Measures dimension was broken into
addressing US Trade Secrets have taken both the analysis of the two factors: Security-Measures [F6p], favoring the
domain and the ascription of factors to cases from CATO [3]. plaintiff, and No-Security-Measures [F19d], favoring
In HYPO cases are represented as collections of facts (see Appen- the defendant. This was done because judges explic-
dix B of [6]). There are thirteen implemented dimensions (Appendix itly said that the fact that plaintiff had taken no secu-
F of [6]) which may be applicable to a case on the basis of these rity measures was a positive strength for the oppo-
facts. In general a dimension can take a range of values, but in fact nent. By contrast, Ashley and Aleven did not create a
ten of the thirteen were two-valued. A list of HYPO’s dimensions, “No-Secrets-Disclosed-Outsiders” factor because they
summarising Appendix F of [6], is given in Table 1. found no cases where judges had said that the absence
Dimensions identify the aspects of cases which need to be con- of any disclosures to outsiders was a positive strength
sidered to see if they are applicable: for the plaintiff. ([44], p 69).
Each dimension has prerequisites that must be satis- Thus the security measures dimension is always applicable, al-
fied in order for the dimension to be applicable. For though it is possible that neither F6p nor F19d is present: the plaintiff
example, the dimension Secrets-Voluntarily-Disclosed may have taken sufficient measures to prevent the lack of concern
13
Precedential Constraint: The Role of Issues ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 1: Dimensions in HYPO and their corresponding CATO factors. Dimension and Factor IDs are D or F for dimension or
factor (factor numbers are those in CATO) followed by p, d, or b to indicate whether it can favour plaintiff, defendant or both.
Plaintiff Defendant
ID Dimension Values Number of Values Factors Factors
in CATO in CATO
Computed from
D1p Competitive Advantage Gained Many F8p
development time and cost
D2d Vertical Knowledge Vertical or technical 2 F11d
D3d Secrets Voluntarily Disclosed Number of Disclosures Many F10d F27d
D4d Disclosures Subject to Restriction Yes or No 2 F12p
D5p Agreement Supported by Consideration Something or Nothing 2
D6p Common Employee Paid to Change Employers Something or Nothing 2 F2d
D7p Exists Express Noncompetition Agreement Yes or No 2 F13p
D8p Common Employee Transferred Product Tools Something or Nothing 2 F7p
D9p Non-Disclosure Agreement Re Defendant Access Yes or No 2 F4p
D10d Common Employee Sole Developer Yes or No 2 F3d
D11d Non-disclosure Agreement Specific Yes or No 2 F5d
D12d Disclosure in Negotiations with Defendant Yes or No 2 F1d
D13b Security Measures Range of possible measures 8 F6p F19d
being a strength for the defendant, but without sufficient rigour the opinion, it played no role in the decision, and hence can safely
to be a reason to find for the plaintiff. Thus, although it is always be considered absent. The absence of a base level factor does not
relevant to consider the security measures taken, in many cases provide a reason for the other side, and so its absence will be men-
there will be no legal significance. Indeed many cases in CATO tioned only when its presence was considered but rejected because
[3] do not have either F6p or F19d. Of the thirteen dimensions the case fell into a neutral area on an applicable dimension.
in HYPO, ten, the two-valued dimensions, are either inapplicable In addition to the fourteen factors derived from the HYPO di-
or favour a particular side. Of the three multi-valued dimensions, mensions in Table 1, CATO introduced another twelve factors. This
two are considered, if applicable, to be either neutral or capable of is because CATO analysed considerably more cases than HYPO
favouring only one party (defendant for disclosures, and plaintiff and seems to have included more cases questioning whether the
for competitive advantage). Only security measures is capable of information was a trade secret rather than whether there was a
favouring either side, or being neutral. confidential relationship. In Table 2 we have related these additional
Note, however, that in one case, disclosures to outsiders, there are factors to dimensions in the manner of Table 1. These additional
two pro-defendant factors associated with the dimension. As well factors can be accommodated in seven dimensions, only one of
as F10d, SecretsDisclosedOutsiders, we also have F27d DisclosureIn- which, D14b, has both plaintiff and defendant factors. Three have
PublicForum. This is because F27d provides a much stronger reason multiple factors for the same side. Four are two-valued. The mix is
for the defendant than F10d, so that it might be that a plaintiff similar to that found in HYPO, and so may be considered typical.
factor such as F12p OutsiderDisclosuresRestricted would defeat F10d That the additional cases analysed by CATO led to additional
but not F27d. Thus a dimension may give rise to multiple factors factors and dimensions is an indication of how precedent cases are
favouring the same side. the source of dimensions and factors. The opinions in precedents
This understanding of dimensions and factors shows why it is show what aspects of the cases judges thought relevant, and what,
a mistake to speak of the “negations” of base level factors, as in if any, significance they accorded to them in that case. Any given
some recent formally oriented approaches (e.g. [50], [37]). CATO case will only have a few applicable dimensions, and so will only
used two distinct factors for the rare case where a dimension could contain a small subset of possible factors. Therefore as we analyse
favour either side. Moreover, if a factor is absent, a different factor more cases we are likely to encounter more dimensions and more
favouring that side may be present, as with disclosures. Thus the distinctions and hence more factors.
absence of F10d might mean that no disclosures had been made, so
that the dimension was inapplicable; that too few disclosures had
been made, meaning that the dimension was not legally significant
in this case; or that disclosures had been made in a public forum, 2.2 Arguing with Factors
so that the stronger F27d was present. There seems little sense in HYPO and CATO were not concerned with determining or pre-
wrapping these three quite different notions under the “negation” dicting outcomes, but rather the identification of arguments for
of F10d. Nor is negation needed to distinguish cases where a base the two parties. These arguments were organised in the “three ply”
level factor is known absent from those where there is no infor- structure common in law (e.g. US Supreme Court Oral Argument
mation about that factor. If a base level factor is not mentioned in and witness testimony which follows the initial questions with a
cross examination and a redirect). In this structure an outcome is
14
Table 2: Factors Introduced in CATO Organised into Dimensions. See Table 3 for factor names.
Plaintiff Defendant
ID Dimension Values Number of Values Factors Factors
in CATO in CATO
D14b Use Of Available Information Various types of use Many F14p F16d F25d F17d
D15p Similarity Of Products Degrees of similarity Many F15p F18p
D16d Availability Of Information Various forms of availability Many F20d F24d
D17p Invasive Techniques Yes or No 2 F22p
D18p Obtained by Deception Yes or No 2 F26p
D19d Confidentiality Waived Yes or No 2 F23d
D20p Knew Confidential Yes or No 2 F21p
proposed, a response made by the other side followed by a rebut- means of expressing precedent cases as a sets of rules [39]. Since
tal from the original side. For reasoning with precedents with the the factors for the plaintiff provide a reason to find for the plaintiff
proponent arguing for the plaintiff, these three plies in CATO are: and the factors favouring the defendant a reason to find for the
(1) Cite the precedent case with a decision for the desired side defendant, the decision in the case can be seen as expressing a pref-
which has the most factors in common and fewest distin- erence for one of these reasons. The conjunction of all the factors
guishing factors compared with the current case. The side for a side is the strongest reason for that side, so the precedent can
favoured by the factors does not matter. be modelled as a set of three rules expressing that the strongest
(2) The opponent may distinguish the cited case. Typically the reason for the winner was preferred to the strongest reason for the
new case will not contain exactly the same factors as the loser. Where the case comprises a set of factors 𝑃 ∪ 𝐷 where 𝑃 is
precedent. Some of these differences will make the case the set of plaintiff factors and 𝐷 the set of defendant factors, the
stronger for the plaintiff: plaintiff factors in the current case three rules are:
but not the precedent, and defendant factors in the precedent r1: 𝑃 → plaintiff; r2: 𝐷 → defendant;
but not the current case. The defence will be wise to remain r3: 𝑟 2 ≺ 𝑟 1 if the decision was for the plaintiff and 𝑟 1 ≺ 𝑟 2 if
silent as to these differences. If, however, the precedent con- the decision was for the defendant.
tains plaintiff factors not in the current case, or defendant
factors in the current case but not the precedent, the differ- If we represent all the precedents in the domain using this tech-
ences may be significant and so provide an argument not to nique we can build a logcal theory representing our case base of
follow the cited precedent2 . precedents. If we are given a new case, we can see whether the rules
(3) The proponent may now attempt a rebuttal: downplaying apply to it, and if so whether an outcome is determined by the cur-
distinctions by citing factors favouring the plaintiff (the dif- rent theory. A distinction will mean that the winner’s rule does not
ferences the defendant could not use in the second ply). apply or that the loser may have a stronger rule. This representation
Assuming that the opponent was able to make some distinctions was used in [9] in which the possible sets of plaintiff factors were
in the second ply, it is now up to the user to decide whether, given represented as a partial order, the possible sets of defendant factors
the rebuttal, the distinctions are of sufficient weight to merit an were represented as a partial order, and the precedents as ordering
outcome different from the precedent case. relations between these two partial orders. The nodes contain all
This method of arguing with precedents is the basis of the formal the possible antecedents for plaintiff and defendant rules and the
characterisations of precedential cases discussed in the next section. arcs show the priorities between particular rules. The example from
Precedents are converted into sets of rules with conjunctions of [9] is shown in Figure 1. Deciding a new case is now a matter of
factors as antecedents. These rules constrain a new case if there is a adding an arc between the two relevant nodes representing the
rule applicable to the new case which finds for a particular side (ply factors in the new case and deciding which way the arrow should
1) which cannot be distinguished (ply 2), and which is preferred to point. The constraint is that the arrow should not introduce a cycle
any applicable rule favouring the other side (ply 3). since this would introduce an inconsistency to the case base. Thus
a case which could introduce a cycle is constrained, but if no cycle
3 MODELS OF PRECEDENTIAL CONSTRAINT can result, the judge is free to decide either way.
This idea was refined and presented in a more rigorous way by
HYPO and CATO were realised as programs, with the knowledge
Horty in [26] and further refined in [30]. Horty was interested in
represented as particular data structures (e.g. case frames in HYPO),
modelling two different accounts of precedential constraint from
and the operation of the reasoning defined in terms of algorithms
the jurisprudence literature. One is a very strict version, for which
manipulating these structures (the algorithms for CATO are given
Horty cites [5]. Here any distinction between the precedent and the
in Appendix 3 of [3]). As such, reasoning with cases was not readily
current case is enough to allow the judge to come to a different deci-
amenable to logical analysis until Prakken and Sartor provided a
sion. This version, which corresponds to Figure 1, is now normally
2 In HYPO and CATO the opponent can also cite counter examples in this ply, but we termed the results model in AI and Law [37]. This model encodes
will not discuss counter examples in this paper. precedents as rules in the same way as [39] and [9]. Any weakening
15
The first problematic sort of cases are those in which

the court makes a decision on the basis of multiple le-
gal rules, each of which would be sufficient for the de-
cision. These are sometimes known as cases with alter-
native holdings” [34] ... Consider Newport Yacht Basin
Ass’n ... the defendant prevailed and was awarded
attorney’s fees “based upon a prevailing party provi-
sion of a purchase and sale agreement, a contractual
indemnity provision, and principles of equitable in-
Figure 1: Plaintiff and Defendant lattices with two prece- demnity” (NYBA 2012, p. 75). According to the trial
dents from [9]. Precedent 1 was [A,B,C,D,E] found for court judge, each of these was sufficient to justify the
plaintiff and Precedent 2 was [A,D,E] found for defendant. awarding of attorney’s fees. ([41], p 141).
In such cases it is impossible to choose between the alternative
holdings to give the rule to represent the case in the theory. This,
however, is not an insurmountable problem: we can simply include
of the plaintiff’s case or strengthening of the defendant’s case is
all of these rules in our theory, each with a preference over the rule
enough to prevent a precedent for the plaintiff from being followed
for the other side in the case.
(and vice versa). Although this model provides an unchallengeable
Rigoni’s second observation is that not all precedents can be seen
constraint when the rule is applicable, the standard required for
as expressing a preference between reasons. Some cases instead
the rule to be applicable is rather high. Even with only 3 factors for
lay out a method for considering cases of a particular type. Rigoni
each side we have 64 possible comparisons, and if we have the 13
termed these framework precedents, and used Lemon v. Kurtzman
factors for each side from CATO we have 226 (67,108,864) compar-
(1971) as his example:
isons of sets of plaintiff factors with sets of defendant factors [1].
Of course, an example of every comparison is not needed, but still, In that case the US Supreme Court addressed the ques-
given the number of possible distinctions, expecting enough prece- tion of whether Pennsylvania’s and Rhode Island’s
dents to provide significant guidance is unrealistic. This suggests statutes that provided money to religious primary
that the result model does not offer us enough constraints from schools subject to state oversight violated the Estab-
the precedents: rarely will the match be exact enough to constrain lishment Clause of the First Amendment. The court
the decision. Horty therefore used an alternative model, which he introduced a three-pronged test and ultimately
termed the reason model and attributed to [31]. Horty noted that ruled that both programs did violate the Establish-
although the full set of factors for the plaintiff represented the best ment Clause. ([41], p 142. Emphasis ours).
reason to decide for the plaintiff, a subset of these factors may well In Lemon it is not the balance of factors that led to the decision that
have been sufficient to defeat the defendant reason without the are instructive, but the tests themselves. These tests will be applied
additional factors. Thus instead of representing the winner’s rule in future cases, providing a set of issues that give a framework for
by the full set of factors favouring the winner, the winner could deciding future cases. Framework precedents did not arise in CATO
be represented by a subset of these factors, which can be seen as because the Restatements of Torts4 provides just such a framework
the reason for the case. Thus in the first precedent of Figure 1, we of issues, which formed the basis of the factor hierarchy in CATO,
could see the reason for deciding the first case for the plaintiff rendering framework precedents unnecessary.
as 𝐵 → plaintiff 3 , so that future cases containing [𝐴, 𝐵, 𝐷, 𝐸] and Another problem arises from the possibility of there being several
[𝐵, 𝐶, 𝐷, 𝐸] will also be constrained for the plaintiff, without being factors for the same side with different strengths on a dimension.
distinguished by the absence of 𝐶 and 𝐴 respectively as was the Consider the following example:
case with the results model shown in Figure 1. Horty formalised Example 1. Consider a case in which the plaintiff had a unique
this reason model in [26]. For a formal comparison of the result product (F15p), but made disclosures to outsiders (F10d), which was
and reason models of factor based precedential constraint see [37]. found for the defendant. This gives the three rules:
r1: F15p → plaintiff; r2: F10d → defendant
3.1 Criticisms of the Reason Model of r3: r1 ≺ r2
Precedential Constraint with Factors Now consider a case where the plaintiff had made the disclosures
Horty’s formalisation of the reason model provides an effective way in a public forum, so that F27d applies rather than F10d. Now r2
of characterising precedential constraint using factors. But there does not match, and so the theory does not constrain the new case.
have been several questions raised about it. But since F27d is stronger on the dimension than F10d, the new
Rigoni advanced two criticisms in [41]. First, he pointed to cases case should, a fortiori, be decided for the defendant on the basis of
with multiple rationales: the precedent.
This problem has a simple pragmatic fix: where a factor is in-
3 Of cluded in a case, include also all weaker factors on that dimension.
course, identifying the reason in practice is not a straightforward task: it requires
close analysis of the decision and there my be different interpretations of the decision.
The problem is avoided by the more conservative results model, at the expense of 4 The relevant section, section 757, Liability for disclosure or use of another’s Trade
constraining far fewer cases Secret, can be found at https://www.lrdc.pitt.edu/ashley/restatem.htm.
16
So the new case becomes [F15p, F10d, F27d] and r2 will match. This IRAC method was applied to the explanation of factor based rea-
will work from a logical perspective, although care must be taken soning in [11]. A key point about IRAC is that the rule (the reason
to use the factor actually present in any explanation. in the sense of the model above) relates to an issue, not to the case
The fourth criticism is simply that even the reason model does as a whole. The issues in US Trade Secrets Law, taken from the
not constrain enough cases. This is because no account is taken Restatement of Torts, were used to group related factors together
of whether the distinction is sufficient to overturn the rule. The for IBP [21] and VJAP [24]. These issues were also the basis for the
reason model as formalised above accepts any distinction, whereas factor hierarchy in [3] and the Abstract Dialectical Framework in
in CATO distinctions can be rejected through downplaying through [2]. A similar structure is given by Rigoni’s framework precedents,
substitution and cancellation [40] of factors. The point is clear in and we would argue that the role of framework precedents is to
the following example. identify the issues in a domain. Other systems, such as CABARET
Example 2. Here we have two cases, both found for the plaintiff. [48] derive their framework of issues from statutes. Unlike factors,
In case 1 the plaintiff took security measures (F6p) and, although issues can be seen as in a logical relation to the outcome. To find for
the defendant claimed the information was reverse engineerable the plaintiff, it must be shown both that the information was a trade
(F16d), the plaintiff won5 . In the second case we have the plaintiff secret and that it was misappropriated. A trade secret must both be
making disclosures (F10d), but restricting these disclosures (F12p). valuable and the information not generally known. To be misap-
Here the preference for the plaintiff is clear. We therefore have the propriated it must be either that improper means were used, or that
following rules: the information was used in breach of a confidential relationship.
r1: F6p → plaintiff r4 F12p → plaintiff We can express this as the following non-defeasible rules:
r2: F16d → defendant r5: F10d → defendant ROT1: TradeSecret ∧ and Misappropriated ↔ plaintiff
r3: r2 ≺ r1 r6: r5 ≺ r4 ROT2: InfoValuable ∧ SecrecyMaintained ↔ TradeSecret
Now consider a case with all four factors: [F6p, F12p. F10d, F16d]. ROT3: ImproperMeans ∨ (InfoUsed ∧ ConfidentialRelationship)
It seems this should clearly be found for the plaintiff: however, we ↔ Misappropriated
cannot apply r1 because of the distinction F10d and we cannot In [7] these issues are used to group CATO’s factors as shown
apply r4 because F16d distinguishes. Although neither distinction in Table 36 . Even though some factors appear under two issues,
would be significant in CATO, being cancelled for both precedents the issues contain only five to seven factors, greatly reducing the
by the other factors available, the reason model gives the case as possible combinations of relevant factors. That using issues rather
unconstrained. We will address this problem in the next section. than whole cases to constrain decisions will enable us to decide
more cases is evidenced by [21]. The issue based IBP was able
4 THE IMPORTANCE OF ISSUES to reach a prediction in 99.5% of cases, as opposed to the 73.1%
The fourth problem discussed in the previous section arises because achieved by a system considering cases as a whole.
the reason model considers cases as unstructured bundles of factors, This suggests that instead of describing cases simply as a set of
so that a difference which should not be considered significant factors, we should distribute these factors across the issues they
prevents us from applying the rule which should constrain the relate to. Note also that structuring into issues is an implicit feature
case. We can use knowledge of the domain structure to solve this of rule based systems such as [47] and [2]. We can now apply the
problem. That we do not exploit the full power of our precedents if methods of precedential constraint developed in [27] and [37] not
we consider whole cases was noticed by Branting in [17]: at the case level, but at the issue level. To see the difference this
makes, we will consider a set of cases7 , taken from [3] and used in
combining portions of multiple precedents can permit [22], shown in Table 4. We have not re-analysed the decisions: the
new cases to be resolved that would be indeterminate factors for each case are taken from Table II of [22].
if new cases could only be compared to entire prece- Notice that, in all these cases, some issues are uncontested. It
dents. ([17], Abstract). seems that we can regard the information as a trade secret, unless
When Brüninghaus and Ashley adapted CATO to predict cases in argued otherwise, and that there is a presumption that the informa-
IBP [21] they structured the cases around issues, as did Grabmair in tion was used. On the other hand the plaintiff needs to establish that
his prediction system VJAP [24]. Grabmair reports an improvement improper means were used or that a confidential relationship ex-
over IBP through the use of values, but values raise several addi- isted. To find the issue and rule in the case we look at the contested
tional questions, such as the extent to which they are promoted by issues, and which factors led to the outcome. This is the method
different factors and whether value preferences are global, or local used to identify the rule and resolve the issue when applying the
to issues. Since precedential constraint with values has not yet been IRAC methodology in [11].
given a formal characterisation, we will restrict our consideration in In the next sections we will illustrate the use of the standard
this paper to factors. Issues are a well known concept in law: many reason model followed by the use of the proposed issue based
law schools teach the Issue-Rule-Application-Conclusion (IRAC) 6 In [24] Grabmair associated factors to issues a little differently. This does not, however
method (or some variant) as a way of analysing legal cases. The affect any of the factors in our example below, and so we follow [7] here.
7 National Instrument Labs, Inc. v. Hycel, Inc., 478 F.Supp. 1179 (D.Del.1979), M. Bryce
5 This preference for F6p was used in Mason v. Jack Daniel Distillery, 518 So.2d 130 & Associates, Inc. v. Gladstone, 107 Wis.2d 241, 319 N.W.2d 907 (Wis.App.1982), K & G
(Ala.Civ.App.1987): “courts have protected information as a trade secret despite ev- Oil Tool & Service Co. v. G & G Fishing Tool Serv., 314 S.W.2d 782 (1958), Televation
idence that such information could be easily duplicated by others competent in the Telecommunication Systems, Inc. v. Saindon, 522 N.E.2d 1359 (Ill.App. 2 Dist. 1988),
given field. KFC Corp. v. Marion-Kay Co., 620 F. Supp. 1160 (S.D.Ind. 1985); Sperry Mason v. Jack Daniel Distillery, 518 So.2d 130 (Ala.Civ.App.1987) and The Boeing
Rand Corp. v. Rothlein, 241 F. Supp. 549 (D.Conn. 1964)”. Company v. Sierracin Corporation, 108 Wash.2d 38, 738 P.2d 665 (1987).
17
Table 3: CATO factors grouped by Issues [7]
Issue Plaintiff Factors Defendant Factors

F16d Info Reverse Engineerable
F8p Competitive Advantage F20d Info Known to Competitors
InfoValuable
F15p Unique Product F24d Info Obtainable Elsewhere
F27d Disclosure In Public Forum
F4p Agreed Not To Disclose
F10d Secrets Disclosed Outsiders
SecrecyMaintained F6p Security Measures
F19d No Security Measures
F12p Outsider Disclosures Restricted
F2p Bribe Employee
F7p Brought Tools
F17d Info Independently Generated
ImproperMeans F14p Restricted Materials Used
F25d Info Reverse Engineered
F22p Invasive Techniques
F26p Deception
F7p Brought Tools
F8p Competitive Advantage F17d Info Independently Generated
InfoUsed
F14p Restricted Materials Used F25d Info Reverse Engineered
F18p Identical Products
F4p Agreed Not To Disclose
F1d Disclosure In Negotiations
ConfidentialRelationship F13p Noncompetition Agreement
F23d Waiver of Confidentiality
F21p Knew Info Confidential
Table 4: Cases in the Example with Factors Grouped By Issue
Case InfoValuable SecrecyMaintained ImproperMeans InfoUsed ConfidentialRelationship

National Instruments P F18p F1d F21p
Bryce P F6p F18p F1d F4p F21p
K and G P F15p F16d F6p F14p 25d F14p F18p F25d F21p
Televation P F15p F16d F6p F10d F12p F18p F21p
Mason P F15p F16d F6p F1d F21p
Boeing P F6p F10d F12p F14p F14p F1d F4p F21p
reason model. We will see that when using issues, more cases are Now suppose we are presented with K and G. Here the plain-
constrained because distinctions relating to issues unrelated to that tiff argues that improper means were used, because the defendant
governed by a rule no longer distinguish that rule and are relevant used restricted materials (F14p). The defendant counters this by a
only if they constrain that other issue so as to lead to a different claim to have reverse engineered the information (F25d). Moreover
outcome. the defendant argues that the information is not a trade secret be-
cause it was reverse engineerable (F16d)9 . This is in turn countered
4.1 Using the reason model by the claim that the uniqueness of the product (F15p) suggests
Suppose our first case8 is National Instruments. As can be seen that the information was not readily reverse engineerable. In the
from Table 4, the case turned on whether there was a confiden- judgement both the issues were decided in favour of the plaintiff,
tial relationship, given that the plaintiff had made disclosures in since the reverse engineering had made use of restricted materials.
negotiations (F1d). The defendant, however, did know that the infor- Note that there was no need to decide the breach of confidence
mation was confidential (F21p), and the court found for the plaintiff. issue: improper means suffice to establish misappropriation. Al-
The reason model then gives the three rules: though NatInstP applies, it cannot be used because the defendant
has stronger rules than NatInstD. Thus the reason in K and G must
NatInstP: F21p → plaintiff NatInstD: F1d → defendant
cover two different issues, InfoValuable and ImproperMeans, and
NatInstO: NatInstD ≺ NatInstP
so we get rules spanning both these issues:
If the next case is Bryce, we can see that it is constrained by
these rules: the additional factors are not distinctions because both KGP: F15p and F14p → plaintiff
favour the plaintiff, and so do not give the defendant anything KGD: F16d and F25d → defendant
better than NatInstD, which is defeated by NatInstP, which also 9 Both F25d and F16d were introduced in CATO and relate to the same dimension, D14b.
applies to Bryce. Bryce thus adds no new rules. However, F25d, that the information was actually reverse engineered, relates to the
issues of whether the whether the information was used and whether improper means
8 The sequencing of the cases used here is for the purposes of illustrating our approach, were used, whereas F16d, the possibility of reverse engineering, relates to whether the
and is not the actual sequence. information was valuable and hence a trade secret.
18
KGO: KGD ≺ KGP KGIMP: F15p → ImproperMeans

Note, however, that this means that the reasons extracted from KGIMD: F25d → Not ImproperMeans
KGP conflate the two issues in dispute in the case. The next case is KGOIM: KGIMD ≺ KGIMP
Televation. The plaintiff had made disclosures (F10d), distinguishing Now when we come to Televation, K and G constrains Informa-
all the previous cases, although these were restricted (F12p). These tionValuable, but we get rules for the new issue, SecrecyMaintained.
restrictions were held sufficient to find that the plaintiff had made TelevationSMP: F12p → SecrecyMaintained
efforts to maintain secrecy. Again it was found that the uniqueness TelevationSMD: F10d → Not SecrecyMaintained
of the product argued against it being readily reverse engineerable. TelevationSM: TelevationSMD ≺ TelevationSMP
The misappropriation consisted of using the information, shown
By using issues, when we come to Mason, the decision is con-
by F18p, and the confidential relationship, shown by F21p. Because,
strained. We can use the rules from K and G to constrain Infor-
however, misappropriation was not contested, these do not form
mationValuable and those from National Instruments to constrain
part of the reason. The reason model rules from Televation also
ConfidentialRelationship. Under the result model above, F1d dis-
cover two issues, InfoValuable and SecrecyMaintained:
tinguished K and G, and F16d distinguished National Instruments.
TelevationP: F15p and F12p → plaintiff Similarly Boeing is constrained: by Televation with respect to Se-
TelevationD: F16d and F10d → defendant crecyMaintained and by National Instruments (like Bryce the F4p
TelevationO: TelevationD ≺ TelevationP factor is not needed) with respect to ConfidentialRelationship. F15p
We now reach Mason. With respect to the existence of a trade would have distinguished Televation and F10d National Instruments
secret Mason is identical to K and G. However, Mason disclosed under the results model. Thus by focussing on issues, we are able
the information in negotiations (F1d), which distinguishes it from to constrain cases which required new rules under the “whole case”
this case, and so KGP is not applicable. Similarly although it can reason model. We also argue that this better fits with legal prac-
instantiate NatInstP from National Instruments, that case is distin- tice: not only does it follow the IRAC methodology taught in law
guished by F16d, and so the defendant has a reason stronger than schools, but it also reflects how precedents are used in decisions.
NatInstD. So the reason for Mason must cover both trade secret and Consider this extract from Mason, finding that the information
confidential relationship, and when Mason is found for the plaintiff, could be regarded as a trade secret10 :
we get the rules: We note that absolute secrecy is not required ... “a
MasonP: F15p and F21p → plaintiff substantial element of secrecy is all that is necessary
MasonD: F16d and F1d → defendant to provide trade secret protection.” Drill Parts, 439
MasonO: MasonD ≺ MasonP So.2d at 49. ... we note that courts have protected
Finally consider Boeing. We have two contested issues: whether information as a trade secret despite evidence that
secrecy was maintained, with the same factors as Televation, and such information could be easily duplicated by others
whether there was a confidential relationship with the same factors competent in the given field. KFC Corp. v. Marion-
as Bryce. But we cannot use these to find for the plaintiff: we are not Kay Co., 620 F. Supp. 1160 (S.D.Ind. 1985); Sperry Rand
constrained by Televation without F15p, and we are not constrained Corp. v. Rothlein, 241 F. Supp. 549 (D.Conn. 1964).
by NatInstP as used in Bryce with F10d present. Three precedents are used to justify the finding in Mason: for none
In Bryce, Mason and Boeing, the decision seems clear given the of the three precedents was there any consideration of aspects of
preceding cases, but only Bryce is constrained under the reason these cases not germane to the specific issue being resolved. Thus
model. Now we will consider the difference made by exploiting our the focus provided by issues means that we are able to constrain
understanding of the domain and structuring the cases into issues. more cases, since we can ignore differences which do not relate to
the issue on which the case was decided. Issues would be even more
4.2 Using the Issue Based Reason Model useful in the results model, since they would allow us to ignore
In this section we look at how the sequence of cases in 4.1 works irrelevant pro-plaintiff as well as irrelevant pro-defendant factors.
out if we associate our reasons with issues, and rely on the frame- We now look at how we can handle dimensional features of cases.
work provided by the Restatement of Torts (rules ROT1-3 above)
to combine these partial findings. National Instruments concerned 5 RELATING TO DIMENSIONS
only one issue, Confidential Relationship, so our rules will be: In Bryce the Restatement of Torts was explicitly cited as the frame-
NatInstCRP: F21p → ConfidentialRelationship work for considering the case:
NatInstCRD: F1d → Not ConfidentialRelationship Some factors to be considered in determining whether
NatInstCR: NatInstCRD ≺ NatInstCRP given information is one’s trade secret are: (1) the
In Bryce, we have the same issue, and we can use these rules: the extent to which the information is known outside of
additional factor also supports a confidential relationship. K and his business; (2) the extent to which it is known by
G, however, concerns two different issues, and so gives us distinct employees and others involved in his business; (3) the
sets of rules for each of these two issues. 10 Actuallythis suggests that the court used F6p rather than F15p to defeat F16d. This
KGIVP: F15p → InformationValuable suggests that perhaps F6p should be added to the factors related to InfoValuable and
used as the reason in KGIVP. Alternatively, TradeSecret could be considered as a single
KGIVD: F16d → Not InformationValuable issue subsuming InfoValuable and MaintainSecrecy. This, however, is a matter of legal
KGOIV: KGIVP ≺ KGIVD analysis, and for the purposes of this paper we adopt the analysis of [7].
19
extent of measures taken by him to guard the secrecy consequences. CATO has been explicitly identified
of the information; (4) the value of the information with the second of these steps (e.g. [20]). ([40], p 22).
to him and to his competitors; (5) the amount of This can be seen clearly in [7] where factors – the intermedi-
effort or money expended by him in developing the ate predicates – were ascribed to cases by the machine learning
information; (6) the ease or difficulty with which the program SMILE, before being passed to IBP to predict the legal con-
information could be properly acquired or duplicated sequences. More recently this two-stage approach has been used
by others. Emphasis ours. by Branting in [19] and [18]. Thus before we can consider whether
These points are all reflected in CATO’s factor hierarchy. But as a case is constrained, which can be done in terms of factors using
the emphasised terms indicate, ascribing these factors is not simple, the issue based reason model described above, we must first as-
but requires a judgement as to whether the extent is sufficient for sign the factors. For some factors, those derived from many-valued
the factor to apply. This point was addressed by Horty in [27] and dimensions, this will involve ascribing the factors on that dimen-
[28]. The issue was further discussed by Rigoni in [42]. Horty has sion respecting ranges identified in precedent cases. This can be
modified his approach in [29], and a formal comparison of Horty done using either the reason or the result model, or using Rigoni’s
and Rigoni’s approaches is given in [37]. switching points. For a discussion of mapping a dimensional fact
Horty’s main example in [27] is taken from [39] and concerns (age) into ranges through precedents see [25].
change of fiscal domicile, decided on the basis of several consid- Thus the conclusion is that attempting to model precedential
erations including length of absence and percentage of income constraint in terms of cases represented as sets of dimensions rather
earned abroad. In [39] absence was modelled as two factors, which than sets of factors as in [37] conflates two distinct steps in the
we will call shortStay and longStay 11 , favouring no change and process of reasoning with legal cases. Cases are not represented
change respectively. But this raises the question of how we deter- as sets of dimensions: cases are represented as facts in HYPO and
mine whether a particular length of absence, say 24 months, is a where they are represented as sets of points on dimensions, as in
shortStay, a longStay, or somewhere in between, and so neutral. [40] and [14], these are dimensional facts, the legal significance
Horty responds by introducing the notion of a factor with mag- of which is unknown until they are mapped into factors. If we
nitude, (i.e. a factor deriving from a dimension with more than represent cases as sets of dimensional facts (including dimensions
two values) based on the dimensional fact of length of stay. The with two values) as in [40], we can derive the factors applicable
ascription of factors on the basis of dimensional facts can also be to the case. Or we can get our factors through machine learning
found in [40]. Ascription of factors is constrained by precedents: as in [19]. We then organise these factors into issues and apply
in a previous case a judge may have found for change on the basis precedential constraint in terms of the factors associated with each
of an absence of 18 months, showing on the result model that any issue as described in section 4.2. While some precedents will supply
absence of at least 18 months must be considered a longStay. But the plaintiff, defendant and priority rules as described in section 3.1,
the judge may have spoken of an absence of greater than one year, others will supply rules to move from dimensional facts to factors.
so that on the reason model that any absence over 12 months is to To give an example from the fiscal domicile domain, such a rule
be considered longStay. Rigoni’s suggestion was to see precedents would be something like: longStay ← absence(A) ∧ A ≥ 12.
as fixing “switching points”, which determine which (if any) factor One issue in the ascription of factors is that in some cases they
applies for various values of the dimensional fact. Rigoni also notes do not seem to be independent. Thus in the fiscal domicile case it
that a dimension may encompass multiple factors for a given side is possible that there is a trade off between length of absence and
(as with disclosures (D3d) in CATO). amount of income earned, so that whether the percentage of income
Note here that the precedents which impose bounds on the is considered to be “substantial” is relative to the length of absence.
ranges occupied by factors are a different kind of precedent from The question of balancing factors has been discussed in [32] and
those which resolve factor conflicts as discussed above: they express [23], and an equation representing the trade off was used in [12].
no preferences. Thus what is required to accommodate dimensional In that paper a single factor (e.g. SufficientIncomeGivenAbsence) is
facts and factors with magnitude is not a different way of represent- ascribed on the basis of the two dimensional facts. How factors are
ing precedential constraint, but to recognise that we are looking ascribed on the basis of facts relates to the first stage and the focus
at a two stage process, with each stage using different types of of this paper is on the second stage, namely determining how the
precedents. The need for two stages was observed in [40]: precedents, when described in terms of factors, constrain the deci-
Once the facts of a case have been established - and sion. Therefore we will not discuss the important and interesting
this is rarely straightforward since the move from questions relating to balancing and trade-offs further in this paper.
evidence to facts is often itself the subject of debate -
legal reasoning can be seen, following Ross [45] and 6 CONCLUDING REMARKS
Lindhal and Odelstad [33], as a two stage process, first A number of conclusions can be drawn from the above discussion:
from the established facts to intermediate predicates, • Reasoning with cases is a two stage process: first factors are
and then from these intermediate predicates to legal ascribed on the basis of (often dimensional) facts, and then
the cases are compared with precedents using factors.
11 In[39] long duration and not long duration were used, but for reasons explained in • Precendential constraint should be considered in terms of
section 2.1, negating factors is problematic and we follow CATO and [44] and use
two distinct factors when the dimension can favour both sides. This also permits the factors, even if we wish to represent cases in terms of dimen-
possibility of a moderate duration being neutral. sional facts. Applicable dimensions show which aspects must
20
be considered, while factors show which side is favoured in [15] Trevor Bench-Capon and Edwina L Rissland. 2001. Back to the future: Dimensions
the particular case. revisited. In Proceedings of JURIX 2001. IOS Press, 41–52.
[16] Trevor Bench-Capon and Giovanni Sartor. 2003. A model of legal reasoning with
• Comparison (for both the results and the reason models) cases incorporating theories and values. Artificial Intelligence 150, 1-2 (2003),
should be at the level of issues, to ignore irrelevant distinc- 97–143.
[17] L Karl Branting. 1991. Reasoning with portions of precedents. In Proceedings of
tions, and to reflect legal practice better. the 3rd International Conference on AI and Law. 145–154.
• Precedents do not always have the same role: [18] L Karl Branting. 2020. Explanation in Hybrid, Two-Stage Models of Legal Predic-
– Framework precedents (e.g. Lemon v. Kurtzman) identify tion. In The 3rd XAILA Workshop at JURIX 2020.
[19] L Karl Branting, Craig Pfeifer, Bradford Brown, Lisa Ferro, John Aberdeen, Brandy
the issues and set out the logical framework in which they Weiss, Mark Pfaff, and Bill Liao. 2020. Scalable and explainable legal prediction.
are considered; AI and Law (2020), 1–26.
– Preference precedents (the standard use in CATO) say how [20] Stefanie Brüninghaus and Kevin Ashley. 2003. A predictive role for intermediate
legal concepts. In Proceedings of Jurix 2003. 153–62.
conflicting factors within an issue should be resolved; [21] Stefanie Brüninghaus and Kevin D Ashley. 2003. Predicting outcomes of case
– Ascription precedents (e.g. National Instruments states its based legal arguments. In Proceedings of the 9th International Conference on AI
and Law. 233–242.
reasons for withholding F16d at some length), give reasons [22] Alison Chorley and Trevor Bench-Capon. 2005. An empirical investigation of
to determine if a factor should be ascribed to a case or not. reasoning with legal cases through theory construction and application. AI and
Law 13, 3 (2005), 323–371.
Here we have used the issues from IBP [7]. But we could have [23] Thomas F Gordon and Douglas Walton. 2016. Formalizing Balancing Arguments..
used coarser grained issues, perhaps merging the conjoined issues In Proceedings of COMMA 2016. 327–338.
[24] Matthias Grabmair. 2017. Predicting trade secret case outcomes using argument
in IBP, or finer grained issues, using the abstract factors of [3] as schemes and learned quantitative value effect tradeoffs. In Proceedings of the 16th
issues, or even the nodes of the 2-regular hierarchy of [1]. The finer International Conference on AI and Law. 89–98.
the granularity, the more decisions are constrained. Experiments [25] John Henderson and Trevor Bench-Capon. 2019. Describing the development
of case law. In Proceedings of the 17th International Conference on AI and Law.
to investigate the impact of different granularities on predictive ac- 32–41.
curacy would be interesting. It would also be interesting to explore [26] John F Horty. 2011. Reasons and precedent. In Proceedings of the 13th International
the use of values rather than factors as the elements over which Conference on AI and Law. 41–50.
[27] John F Horty. 2017. Reasoning with dimensions and magnitudes. In Proceedings
preferences are expressed as in [16] and [24]. The possibility of of the 16th the International Conference on Articial Intelligence and Law. 109–118.
using multiple granularities could also be explored, with some ar- [28] John F Horty. 2019. Reasoning with dimensions and magnitudes. AI and Law 27,
3 (2019), 309–345.
guments being in terms of issues, some in terms of abstract factors, [29] John F Horty. 2021. Modifying the Reason Model. AI and Law (2021), On Line.
others in terms of values, and others considering whole cases. [30] John F Horty and Trevor Bench-Capon. 2012. A factor-based definition of prece-
Perhaps the best way to deploy machine learning is for the first dential constraint. AI and Law 20, 2 (2012), 181–214.
[31] Grant Lamond. 2005. Do precedents create rules. Legal Theory 11 (2005), 1–26.
stage, factor ascription, as in [7] and [19]. Moreover, if we wish to [32] Marc Lauritsen. 2015. On balance. AI and Law 23, 1 (2015), 23–42.
address the second stage with machine learning, perhaps it would be [33] Lars Lindahl and Jan Odelstad. 2006. Open and closed intermediaries in normative
better to predict issues rather than whole cases, and then combine systems. In Proceedings of JURIX 2006. IOS Press, 91–99.
[34] Jo Desha Lucas. 1983. The direct and collateral estoppel effects of alternative
the results using a logical framework to get the overall decision. holdings. The University of Chicago Law Review 50, 2 (1983), 701–730.
[35] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2019. Using machine
learning to predict decisions of the European Court of Human Rights. AI and
REFERENCES Law (2019), 1–30.
[1] Latifa Al-Abdulkarim, Katie Atkinson, and Trevor Bench-Capon. 2015. Factors, [36] Henry Prakken. 2019. Modelling accrual of arguments in ASPIC+. In Proceedings
issues and values: Revisiting reasoning with cases. In Proceedings of the 15th of the 17th International Conference on AI and Law. 103–112.
International Conference on AI and Law. 3–12. [37] Henry Prakken. 2021. A A formal analysis of some factor- and precedent-based
[2] Latifa Al-Abdulkarim, Katie Atkinson, and Trevor Bench-Capon. 2016. A method- accounts of precedential constraint. AI and Law (2021), Available On–Line.
ology for designing systems to reason with legal cases using ADFs. AI and Law [38] Henry Prakken and Ratsma Rosa. 2021. A top-level model of case-based ar-
24, 1 (2016), 1–49. gumentation for explanation: formalisation and experiments. Argument and
[3] Vincent Aleven. 1997. Teaching case-based argumentation through a model and Computation (2021), Available On–line.
examples. Ph.D. thesis. University of Pittsburgh. [39] Henry Prakken and Giovanni Sartor. 1998. Modelling reasoning with precedents
[4] Vincent Aleven and Kevin D Ashley. 1995. Doing things with factors. In Proceed- in a formal dialogue game. AI and Law 6, 3-4 (1998), 231–87.
ings of the 5th International Conference on AI and Law. 31–41. [40] Henry Prakken, Adam Wyner, Trevor Bench-Capon, and Katie Atkinson. 2015.
[5] Larry Alexander. 1989. Constrained by precedent. Southern California Law Review A formalization of argumentation schemes for legal case-based reasoning in
63 (1989), 1–64. ASPIC+. Journal of Logic and Computation 25, 5 (2015), 1141–1166.
[6] Kevin D Ashley. 1990. Modeling legal arguments: Reasoning with cases and hypo- [41] Adam Rigoni. 2015. An improved factor based approach to precedential constraint.
theticals. MIT press, Cambridge, Mass. AI and Law 23, 2 (2015), 133–160.
[7] Kevin D Ashley and Stefanie Brüninghaus. 2009. Automatically classifying case [42] Adam Rigoni. 2018. Representing dimensions within the reason model of prece-
texts and predicting outcomes. AI and Law 17, 2 (2009), 125–165. dent. AI and Law 26, 1 (2018), 1–22.
[8] Trevor Bench-Capon. 1991. Practical legal expert systems: the relation between [43] Edwina L Rissland and Kevin D Ashley. 1987. A case-based system for Trade
a formalisation of legislation and expert knowledge. In Law, Computer Science Secrets law. In Proceedings of the 1st International Conference on AI and Law.
and Artificial Intelligence, M Bennun and A Narayanan (Eds.). Ablex, 191–201. 60–66.
[9] Trevor Bench-Capon. 1999. Some observations on modelling case based reasoning [44] Edwina L Rissland and Kevin D Ashley. 2002. A note on dimensions and factors.
with formal argument models. In Proceedings of the 7th International Conference AI and Law 10, 1-3 (2002), 65–77.
on AI and Law. 36–42. [45] Alf Ross. 1957. Tû-tû. Harvard Law Review (1957), 812–825.
[10] Trevor Bench-Capon. 2017. HYPO’s legacy: introduction to the virtual special [46] Frederick Schauer. 1987. Precedent. Stanford Law Review (1987), 571–605.
issue. AI and Law 25, 2 (2017), 205–250. [47] Marek Sergot, Fariba Sadri, Robert Kowalski, Frank Kriwaczek, Peter Hammond,
[11] Trevor Bench-Capon. 2020. Explaining Legal Decisions Using IRAC. In Proceed- and Therese H Cory. 1986. The British Nationality Act as a logic program.
ings of CMNA 2020. CEUR Workshop Proceedings 2669, 74–83. Commun. ACM 29, 5 (1986), 370–386.
[12] Trevor Bench-Capon and Katie Atkinson. 2017. Dimensions and Values for Legal [48] David B Skalak and Edwina L Rissland. 1992. Arguments and cases: An inevitable
CBR. In Proceeding of JURIX 2017. 27–32. intertwining. AI and Law 1, 1 (1992), 3–44.
[13] Trevor Bench-Capon and Katie Atkinson. 2018. Lessons from Implementing [49] Richard E Susskind. 1989. The Latent Damage system: A jurisprudential analysis.
Factors with Magnitude. In Proceedings of JURIX 2018. 11–20. In Proceedings of the 2nd International Conference on AI and Law. 23–32.
[14] Trevor Bench-Capon and Floris Bex. 2015. Cases and Stories, Dimensions and [50] Heng Zheng, Davide Grossi, and Bart Verheij. 2020. Case-Based Reasoning with
Scripts.. In Proceedings of JURIX 2015. 11–20. Precedent Models: Preliminary Report. In Proceedings of COMMA 2020. 443–450.
21
Incorporating Domain Knowledge for Extractive Summarization
of Legal Case Documents
Paheli Bhattacharya Soham Poddar Koustav Rudra
Department of CSE, IIT Kharagpur Department of CSE, IIT Kharagpur L3S Research Center, Leibniz
India India University, Hannover
Germany
Kripabandhu Ghosh Saptarshi Ghosh

Department of CDS, IISER, Kolkata Department of CSE, IIT Kharagpur
India India
ABSTRACT behaved in different legal situations. Case documents usually span
Automatic summarization of legal case documents is an important several tens to hundreds of pages and contain complex text, which
and practical challenge. Apart from many domain-independent text makes comprehending them difficult even for legal experts. Hence,
summarization algorithms that can be used for this purpose, sev- summaries of case documents often prove beneficial. Existing le-
eral algorithms have been developed specifically for summarizing gal IR systems employ legal attorneys or para-legals to write case
legal case documents. However, most of the existing algorithms summaries, which is an expensive process. Hence, automatic sum-
do not systematically incorporate domain knowledge that spec- marization of case documents is a practical and well-known chal-
ifies what information should ideally be present in a legal case lenge [8, 14, 18, 20, 25].
document summary. To address this gap, we propose an unsuper- Document summarization is broadly of two types – (i) extractive,
vised summarization algorithm DELSumm which is designed to where important sentences are extracted from the source document
systematically incorporate guidelines from legal experts into an and included in the summary [7, 14, 16], and (ii) abstractive, where
optimization setup. We conduct detailed experiments over case the model attempts to generate a summary using suitable vocabu-
documents from the Indian Supreme Court. The experiments show lary [15, 17]. This paper focuses on extractive summarization since
that our proposed unsupervised method outperforms several strong it is more common for legal case documents.
baselines in terms of ROUGE scores, including both general sum- Law practitioners in various countries have a set of guidelines
marization algorithms and legal-specific ones. In fact, though our for how to summarize court cases, e.g., see [9, 11, 12] for guidelines
proposed algorithm is unsupervised, it outperforms several super- on summarizing case documents of the USA and the UK. All these
vised summarization models that are trained over thousands of guidelines basically advise to consider different rhetorical/thematic
document-summary pairs. segments that are usually present in case documents (e.g., facts of the
case, legal issues being discussed, final judgment, etc.) and then to
KEYWORDS include certain important parts of each segment into the summary.
We got similar advice from law experts in India – two senior law
Legal document summarization; Integer Linear Programming
students and a faculty from the Rajiv Gandhi School of Intellectual
ACM Reference Format: Property Law, India – regarding Indian court case documents (see
Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, details in Section 5.1). Apart from focusing on summarizing the
and Saptarshi Ghosh. 2021. Incorporating Domain Knowledge for Extrac-
case document as a whole, it is also necessary that each of these
tive Summarization of Legal Case Documents. In Eighteenth International
segments is summarized well and has representations in the final
Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021,
São Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10. summary. Another advantage of including representations of each
1145/3462757.3466092 segment in the summary is that, often law practitioners wish to
read only the summary of particular segments (e.g., only the facts
1 INTRODUCTION of a case, or the legal issues involved in a case).
There are several summarization algorithms specifically meant
In a Common Law system (followed in most countries including
for legal court case documents [8, 14, 18, 20, 25] (details in Section 2).
USA, UK, India, Australia), law practitioners have to go through
However, most of these algorithms do not take into account the
hundreds of legal case documents to understand how the Court
guidelines from law practitioners. Though a few legal domain-
Permission to make digital or hard copies of all or part of this work for personal or specific algorithms recognize the presence of various rhetorical
classroom use is granted without fee provided that copies are not made or distributed segments in case documents [8, 20], they also do not consider expert
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the guidelines in deciding what to include in the summary from which
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or segment. As a result, these algorithms cannot represent some of
the important segments well in the summary. For instance, a recent
ICAIL’21, June 21–25, 2021, São Paulo, Brazil work [2] showed that most existing algorithms do not represent
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. the final judgement well in the summary, though this segment is
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 considered very important by law practitioners.
https://doi.org/10.1145/3462757.3466092
22
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, and Saptarshi Ghosh
The above limitations of existing methods motivate us to develop (i) Unsupervised domain-independent, (ii) Unsupervised domain-
a summarization algorithm that will consider various rhetorical seg- specific, (iii) Supervised domain-independent, and (iv) Supervised
ments in a case document, and then decide which parts from each domain-specific. We briefly describe some popular methods from
segment to include in the summary, based on guidelines from law each of the classes in this section.
practitioners. Supervised methods which are expected to learn these
guidelines automatically would require a large number of expert 2.1 Unsupervised domain-independent
written summaries for lengthy case judgements, which may not
be available for all jurisdictions. Hence we go for an unsupervised
methods
summarization algorithm in this work. Popular, unsupervised extractive summarization algorithms iden-
We propose DELSumm (Domain-adaptive Extractive Legal tify important sentences either by using Frequency-based methods
Summarizer ), an unsupervised extractive summarization algo- (e.g. Luhn [16]) or Graph-based methods (e.g. LexRank [7]). The
rithm for legal case documents. We formulate the task of summariz- summary is the top ranked sentences. There are algorithms from
ing legal case documents as maximizing an Integer Linear Program- the family of matrix-factorization, such as LSA [10].
ming (ILP) objective function that aims to maximize the inclusion of Summarization has also been treated as an Integer Linear Pro-
the most informative sentences in the summary, and has balanced gramming (ILP) optimization problem. Such methods have been
representations from all the thematic segments, while reducing applied for summarizing news documents [1] and social media
redundancy. We demonstrate how DELSumm can be tuned to oper- posts [19]. Our proposed approach in this work is also based on
ationalize summarization guidelines obtained from law experts over ILP-based optimization.
Indian Supreme Court case documents. Comparison with several
baseline methods – including five legal domain-specific methods 2.2 Unsupervised domain-specific methods
and two state-of-the-art deep learning models trained over thou-
There are several unsupervised methods for extractive summariza-
sands of document-summary pairs – suggest that our proposed
tion specifically designed for legal case documents. One of the ear-
approach outperforms most existing methods, especially in summa-
liest methods for unsupervised legal document summarization that
rizing the different rhetorical segments present in a case document.
takes into account the rhetorical structure of a case document was
To summarize, the contributions of this paper are:
LetSum [8]. They consider certain cue-phrases to assign rhetor-
(1) We propose DELSumm that incorporates domain knowledge pro-
ical/semantic roles to a sentence in the source document. They
vided by legal experts for summarizing lengthy case documents.1
specifically consider the roles – Introduction, Context, Juridical
(2) We perform extensive experiments on documents from the
Analysis and Conclusion. Sentences are ranked based on their TF-
Supreme Court of India and compare our proposed method with
IDF values for estimating their importance. The final summary is
eleven baseline approaches. We find that DELSumm outperforms
generated by taking 10% from the Introduction, 25% from Context,
a large number of legal-specific as well as general summarisation
60% from Juridical Analysis and 5% from the Conclusion segments.
methods, including supervised neural models trained over thou-
While LetSum considers TF-IDF to rank sentences, Saravanan
sands of document-summary pairs. Especially, DELSumm achieves
et.al. [20] use a K-Mixture Model to rank sentences for deciding
much better summarization of the individual rhetorical segments in
which sentences to include in the summary. We refer to this work
a case document, compared to most prior methods.
as KMM in the rest of the paper. Note that this work also identifies
(3) We also show that our proposed approach is robust to inaccurate
rhetorical roles of sentences (using a graphical model). However,
rhetorical labels generated algorithmically. There is a negligible
this is a post-summarization step that is mainly used for displaying
drop in performance when DELSumm uses rhetorical sentence
the summary in a structured way; the rhetorical roles are not used
labels generated by a rhetorical segmentation method (which is
for generating the summary.
more a practical setup), instead of using expert-annotated labels.
Another method CaseSummarizer [18] finds the importance of
To the best of our knowledge, this is the first systematic attempt
a sentence based on several factors including its TF-IDF value, the
to computationally model and incorporate legal domain knowledge
number of dates present in the sentence, number of named entities
for summarization of legal case documents. Through comparison
and whether the sentence is at the start of a section. Sentences are
with 11 methods, including 2 state-of-the-art deep learning methods,
then ranked in order to generate the summary comprising of the
we show that an unsupervised algorithm developed by intelligently
top-ranked sentences.
including domain expertise, can surpass the performance of su-
Zhong et.al. [25] creates a template based summary for Board
pervised learning models even when the latter are trained over
of Veteran Appeals (BVA) decisions from the U.S. Department of
large training data (which is anyway expensive to obtain in an
Veteran Affairs. The summary contains (i) one sentence from the
expert-driven domain such as Law).
procedural history (ii) one sentence from issue (iii) one sentence
from the service history of the veteran (iv) variable number of
2 RELATED WORK Reasoning & Evidential Support sentences selected using Maximum
Extractive text summarization aims to detect important sentences Margin Relevance (v) one sentence from the conclusion. We refer
from the full document and include them in the summary. Existing to this method as MMR in this paper.
methods for extractive summarization that can be applied for legal
document summarization, can be broadly classified into four classes: Limitations of the methods: CaseSummarizer does not assume
the presence of any rhetorical role. The other methods – LetSum,
1 Implementation publicly available at https://github.com/Law-AI/DELSumm KMM, and MMR – consider their presence but do not include
23
Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents ICAIL’21, June 21–25, 2021, São Paulo, Brazil
them while estimating the importance of sentences for generat- applied for obtaining the final summary. We refer to this method
ing the summary. Specifically, LetSum and KMM use generic term- as Gist [14].
distributional models (TF-IDF and K-mixture models respectively)
Limitations of the method: Although this method was developed
and MMR uses Maximum Margin Relevance. As evident, rhetorical
and applied for Chinese legal case documents, domain-specific
roles do not play any role in finally selecting the sentences for
attributes such as rhetorical labels were not considered.
inclusion in the summary.
We believe that it is more plausible to measure the importance
of sentences in each rhetorical segment separately using domain
knowledge. In fact, for each segment, the parameters measuring 2.5 Rhetorical roles in a legal case document
sentence importance can vary. As an example, consider the segment A legal case document can be structured into thematic segments,
‘Statute’. One can say that sentences from this segment that actually where each sentence can be labelled with a rhetorical role. Note that
contain a reference to a Statute/Act name are important. In contrast, case documents often do not implicitly specify these rhetorical roles;
this reasoning will not hold for the segment ‘Facts’ and one may there exist algorithms that assign rhetorical roles to the sentences [3,
derive a different measure based on the presence of certain Part-of- 21, 24]. Different prior works have considered different sets of
Speech tags e.g., Nouns (names of people, location, etc.). Hence, in rhetorical roles [3, 8, 20, 25]. It was shown in our prior work [2]
the present work, we especially include such domain knowledge that a mapping between the different sets of rhetorical roles is
along with segment representations in a more systematic way for possible.
generating summaries. In this work, we consider a set of eight rhetorical roles suggested
in our prior work [3]. Briefly, the rhetorical roles are as follows –
(i) Facts: the events that led to filing the case, (ii) Issue: legal ques-
2.3 Supervised domain-independent methods
tions/points being discussed, (iii) Ruling by Lower Court: case
Supervised neural (Deep Learning-based) methods for extractive documents from higher courts (e.g., Supreme Court) can contain de-
text summarization treat the task as a binary classification prob- cisions delivered at the lower courts (e.g. Tribunal), (iv) Precedent:
lem, where sentence representations are learnt using a hierarchical citations to relevant prior cases, (v) Statute: citations to statutory
encoder. A survey of the existing approaches can be found in [6]. laws that are applicable to the case (e.g., Dowry Prohibition Act,
Two popular extractive methods are NeuralSum [4] and Sum- Indian Penal Code), (vi) Arguments delivered by the contending
maRunner [17]. These methods use RNN (Recurrent Neural Net- parties, (vii) Ratio: rationale based on which the judgment is given,
work) encoders to learn the sentence representations from scratch. and (viii) Final judgement of the present court.
Sentence selection into summary is based on – content, salience, In this work, we attempt to generate a summary that includes
novelty and absolute and relative position importance. These pa- representations of all rhetorical segments in the source document
rameters are also learned in the end-to-end models. (full text). We assume that the rhetorical labels of every sentence
Recently, pretrained encoders (especially transformer-based mod- in the source document is already labeled either by legal experts or
els such as BERT [5]) have gained much popularity. These encoders by applying the method proposed in [3].
have already been pretrained using a large amount of open do-
main data. Given a sentence, they can directly output its sentence
representation. These models can be used in a supervised summa-
3 DATASET
rization by fine-tuning the last few layers on domain-specific data.
BERTSUM [15] is a BERT-based extractive summarization method. For evaluation of summarization algorithms, we need a set of source
Unlike SummaRuNNer, here sentence selection into summary is documents (full text) and their gold standard summaries. Addition-
based on trigram overlap with the currently generated summary. ally, since we propose to use the rhetorical labels of sentences in
a source document for generating the summary, we need rhetori-
Limitations of the neural methods: These Deep Learning archi- cal label annotations of the source documents. Additionally, since
tectures have been evaluated mostly in the news summarization the supervised methods Gist, SumaRuNNer and BERTSUM require
domain, and news documents are much shorter and contain simpler training over large number of document-summary pairs, we also
language than legal documents. Recent works [22, 23] show that need such a training set. In this section, we describe the training
these methods do not perform well in summarizing scientific arti- set and the evaluation set (that is actually used for performance
cles from PubMed and arxiv, which are longer sequences. However, evaluation of summarization algorithms).
it has not been explored how well these neural models would work
in summarizing legal case documents; we explore this question for Evaluation set: Our evaluation dataset consists of a set of 50 In-
two popular neural summarization models in this paper. dian Supreme Court case documents, where each sentence is tagged
with a rhetorical/semantic label by law experts (out of the rhetori-
cal roles described in Section 2.5). This dataset is made available
2.4 Supervised domain-specific methods by our prior work [3]. We asked two senior law students (from
Liu et.al. [14] have recently developed a supervised learning method the Rajiv Gandhi School of Intellectual Property Law, one of the
for extractive summarization of legal documents. Sentences are most reputed law schools in India) to write summaries for each of
represented using handcrafted features like number of words in these 50 document. They preferred to write extractive summaries of
a sentence, position of the sentence in a document etc. Machine approximately one-third of the length of the documents. We asked
Learning (ML) classifiers (e.g., Decision Tree, MLP, LSTM) are then the experts to summarize each rhetorical segment separately (so
24
that we can evaluate how well various models summarize the indi- Notation Meaning
vidual sections); only, they preferred to summarize the rhetorical 𝐿 Desired summary length (number of words)
segments ‘Ratio’ and ‘Precedent’ together. 𝑛 Number of sentences in the document
All the summarization methods have been used to generate 𝑔 Number of segments in the document
summaries for these 50 documents. These summaries were then 𝑚 Number of distinct content words in the document
uniformly evaluated against the two gold standard summaries writ- 𝑖 Index for sentence (𝑖 = [1 . . . 𝑛])
ten by the law students, using the standard ROUGE scores (details 𝑗 Index for content word (𝑗 = [1 . . . 𝑚])
given in later sections). We report the average ROUGE scores over 𝑘 Index for segment (𝑘 = [1 . . . 𝑔])
the two sets of gold standard summaries. 𝑥𝑖 Indicator variable for sentence 𝑖 (1 if sentence 𝑖 is to
be included in summary, 0 otherwise)
Training set : For training supervised methods (Gist, SumaRuNNer
𝑦𝑗 Indicator variable for content word 𝑗 (1 if 𝑗 is to be
and BERTSUM), we need an additional training dataset consisting
included in summary, 0 otherwise)
of document-summary pairs. To this end, we crawled 7, 100 Indian
𝑝𝑖 Indicator variable for sentence 𝑖 citing a prior-case
Supreme Court case documents and their headnotes (short abstrac- (1 if there is a citation, 0 otherwise)
tive summaries) from http://www.liiofindia.org/in/cases/cen/INSC/
𝑎𝑖 Indicator variable for sentence 𝑖 citing a statute (1 if
which is an archive of Indian court cases. These document-summary there is a citation, 0 otherwise)
pairs were used to train the supervised summarization models (de- 𝐿 (𝑖) Number of words in sentence 𝑖
tails in later sections). We ensured that there was no overlap between
𝐼 (𝑖) Informativeness of sentence 𝑖
this training set and the evaluation set of 50 documents.
𝐶 (𝑖) Set of content words in sentence 𝑖
Note that the headnotes described above are not considered to
𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 (𝑖) Position of sentence 𝑖 in the document
be summaries of sufficiently good quality by our law experts, and
𝑆𝑐𝑜𝑟𝑒 ( 𝑗) Score (a measure of importance) of content word 𝑗
this is why the experts preferred to write their own summaries
𝑇𝑗 Set of sentences where content word 𝑗 is present
as gold standard for the documents in the evaluation set. It can
𝑊 𝑒𝑖𝑔ℎ𝑡 (𝑘) Weight (a measure of importance) of segment 𝑘
be argued that it is unfair to train the supervised summarization
models using summaries that are known to be of poorer quality 𝑆𝑘 Set of sentences belonging to segment 𝑘
than the target summaries. However, the supervised summarization 𝑁 𝑂𝑆𝑘 Minimum number of sentences to be selected from
segment 𝑘 in the summary
models require thousands of document-summary pairs for training,
and it is practically impossible to obtain so many summaries of Table 1: Notations used in the DELSumm algorithm
the target quality. Hence our only option is to use the available
headnotes for training the supervised summarization models. This included in the summary. We assume that domain experts will de-
situation can be thought of as a trade-off between quantity and fine how to estimate the informativeness of a sentence.
quality of training data – if a method requires large amounts of (ii) Content words: Content words signify domain-specific vo-
training data, then that data may be of poorer quality than the cabulary (e.g., terms from a legal dictionary, or names of statutes,
target quality. etc) and noun-phrases. The importance of each content word 𝑗
is given by 𝑆𝑐𝑜𝑟𝑒 ( 𝑗). We assume that domain experts will define
what words/terms should be considered as ‘content words’, and
4 PROPOSED APPROACH: DELSUMM
how important the content words are.
In this section, we describe our proposed algorithm DELSumm . A summary of length 𝐿 words, consisting of the most informa-
Our algorithm uses an optimization framework to incorporate legal tive sentences and content words, is achieved by maximizing the
domain knowledge (similar to what is stated in [9, 11, 12]) into an following objective function:
objective function with constraints. The objective function is then
Õ𝑛 Õ
𝑚
maximized using Integer Linear Programming (ILP). The symbols 𝑚𝑎𝑥 ( 𝐼 (𝑖) · 𝑥𝑖 + 𝑆𝑐𝑜𝑟𝑒 ( 𝑗) · 𝑦 𝑗 ) (1)
used to explain the algorithm are stated in Table 1. 𝑖=1 𝑗=1
We consider that a case document has a set of 𝑔 rhetorical seg-
subject to constraints
ments (e.g., the 𝑔 = 8 rhetorical segments stated in Section 2.5),
and the summary is supposed to contain a representation of each Õ
𝑛
segment (as indicated in [9, 11, 12]). The algorithm takes as input 𝑥𝑖 · 𝐿(𝑖) ≤ 𝐿 (2)
– (i) a case document where each sentence has a label signifying
Õ
𝑖=1
its rhetorical segment, and (ii) the desired number of words 𝐿 in
𝑦 𝑗 >= |𝐶 (𝑖)| · 𝑥𝑖 , 𝑗 = [1 . . . 𝑚] (3)
the summary. The algorithm then outputs a summary of at most 𝐿
𝑗 ∈𝐶 (𝑖)
words containing a representation of each segment. Õ
𝑥𝑖 ≥ 𝑦 𝑗 , 𝑖 = [1 . . . 𝑛] (4)
The optimization framework: We formulate the summarization
Õ
𝑖 ∈𝑇 𝑗
problem using an optimization framework. The ILP formulation
maximizes the following factors: 𝑥𝑖 ≥ 𝑁𝑂𝑆𝑘 , 𝑘 = [1 . . . 𝑔] (5)
(i) Informativeness of a sentence: The informativeness 𝐼 (𝑖) of 𝑖 ∈𝑆𝑘
a sentence 𝑖 defines the importance of a sentence in terms of its The objective function in Eqn. 1 tries to maximize the inclusion
information content. More informative sentences are likely to be of informative sentences (through the 𝑥𝑖 indicator variables) and
25
the number of important content words (through the 𝑦 𝑗 indicator from all segments of a case document, except the segment ‘Ruling
variables). Here 𝑥𝑖 (respectively, 𝑦 𝑗 ) is set to 1 if the algorithm by Lower Court’ which may be omitted. The relative importance
decides that sentence 𝑖 (respectively, content word 𝑗) should be of segments in a summary should be: Final judgment > Issue >
included in the summary. Eqn. 2 constraints that the summary Fact > (Statute, Precedent, Ratio) > Argument.
length is at most 𝐿 words (note from Table 1 that 𝐿(𝑖) is the number • (G2) The segments ‘Final judgement’ and ‘Issue’ are excep-
of words in sentence 𝑖). Eqn. 3 implies that if a particular sentence tionally important for the summary. The segments are usually
𝑖 is selected for inclusion in the summary (i.e., if 𝑥𝑖 = 1), then the very short in the documents, and so they can be included completely
all the content words contained in that sentence (𝐶 (𝑖)) are also in the summary.
selected. Eqn. 4 suggests that if a content word 𝑗 is selected for • (G3) The important sentences in various rhetorical segments
inclusion in the summary (i.e., if 𝑦 𝑗 = 1), then at least one sentence (Fact, Statute, Precedent, Ratio) should be decided as follows – (a)
where that content word in present is also selected. Fact: sentences that appear at the beginning; (b) Statute : sentences
Eqn. 5 ensures from each segment 𝑘 a minimum number of that contain citations to an Act; (c) Precedent : sentences that con-
sentences 𝑁𝑂𝑆𝑘 should be selected in the summary. This is a key tain citation to a prior-cases; (d) Ratio : sentences appearing at
step that ensures representation of all rhetorical segments in the the end of the document and sentences that contain citation to an
generated summary. We assume that suitable values of 𝑁𝑂𝑆𝑘 will act/law/prior-case.
be obtained from domain knowledge. • (G4) The summary must contain sentences that give important
Any ILP solver can be used to solve the ILP; we specifically used details of the case, including the Acts and sections of Acts that were
the GUROBI optimizer (http://www.gurobi.com/). Finally, those referred, names of people, places etc. that concern the case, and so
sentences for which 𝑥𝑖 is set to 1 will be included in the summary on. Also sentences containing specific legal keywords are usually
generated by DELSumm . important.
Handling Redundancy: DELSumm implicitly handles redundancy
in the summary through the 𝑦 𝑗 variables that indicate inclusion of
5.2 Operationalizing the guidelines
content words (refer to Eqn. 3). According to Eqn. 3, if a sentence is
included in the summary, then the content words in that sentence Table 2 shows how the above-mentioned guidelines from law ex-
are also selected. Consider that two sentences 𝑖 and 𝑖 ′ are similar, perts in India have been operationalized in DELSumm .
so that including both in the summary would be redundant. The • Operationalizing G1: We experiment with two ways of assign-
two sentences are expected to have the same content words. Hence, ing weights to segments – linearly decreasing and exponentially
adding both 𝑖 and 𝑖 ′ (which have the same content words) will not decreasing – based on guideline G1 stated above. We finally decided
help in maximizing the objective function. Hence the ILP method to go with the exponentially decreasing weights. Specifically, we
is expected to refrain from adding both sentences, thus prohibiting assign weight 27 to the ‘Final Judgement’ segment (that is judged
redundancy in the summary. to be most important by experts), followed by 26 for ‘Issue’, 25
for ‘Fact’, 23 for ‘Statute’, ‘Ratio & ‘Precedent’ and finally 21 for
Applying DELSumm to case documents of a specific juris-
‘Argument’.
diction: It can be noted that this section has given only a general
description of DELSumm , where it has been assumed that many • Operationalizing G2: As per guideline G2, 𝑁𝑂𝑆𝑘 (the minimum
details will be derived from domain knowledge (e.g., the informa- number of sentences from a segment 𝑘) is set to ensure a minimum
tiveness or importance of a sentence or content word, or values of representation of every segment in the summary – the ‘final judge-
𝑁𝑂𝑆𝑘 ). In the next section, we specify how these details are derived ment’ and ‘issue’ rhetorical segments are to be included fully, and
from the guidelines given by domain experts from India. at least 2 sentences from every other segment are to be included.
• Operationalizing G3: The informativeness of a sentence 𝑖 de-
5 APPLYING DELSUMM ON INDIAN CASE fines the importance of a sentence in terms of its information con-
DOCUMENTS tent. More informative sentences are likely to be included in the
summary. To find the informativeness 𝐼 (𝑖) of a sentence 𝑖, we use the
In this section, we describe how DELSumm was adopted to summa- guideline G3 stated above. 𝐼 (𝑖) therefore depends on the rhetorical
rize Indian case documents. We first describe the summarization segment 𝑘 (Fact, Issue, etc.) that contains a particular sentence 𝑖. For
guidelines stated by law experts from India, and then discuss how instance, G3(a) dictates that sentences appearing at the beginning
to adopt those guidelines into DELSumm . We then discuss com- of the rhetorical segment ‘Fact’ are important. We incorporate this
parative results of the proposed algorithm and the baselines on the guideline by weighing the sentences within ‘Fact’ by the inverse of
India dataset. its position in the document.
According to G3(b) and G3(c), sentences that contains mentions
5.1 Guidelines from Law experts of Statutes/Act names (e.g., Section 302 of the Indian Penal Code,
We consulted law experts – senior law students and faculty mem- Article 15 of the Constitution, Dowry Prohibition Act 1961, etc.) and
bers from the Rajiv Gandhi School of Intellectual Property Law (a prior-cases/precedents are important. We incorporate this guideline
reputed law school in India) – to know how Indian case documents through a Boolean/indicator variable 𝑎𝑖 which is 1 if the sentence
should be summarized. Based on the rhetorical segments described contains a mention of a Statute. We use regular expression patterns
in Section 2.5, the law experts suggested the following guidelines. to detect Statute/Act name mentions. Similarly for detecting if a
• (G1) In general, the summary should contain representations sentence contains a reference to a prior case, we use the Boolean
26
Guideline Operationalization in DELSumm

(G1) Segment weights Highest 𝑊 𝑒𝑖𝑔ℎ𝑡 (𝑘) assigned to Final judgement and Issue; weight decreasing exponentially /
linearly for other segments: 𝐹𝑖𝑛𝑎𝑙 𝐽𝑢𝑑𝑔𝑒𝑚𝑒𝑛𝑡 > 𝐼𝑠𝑠𝑢𝑒 > 𝐹𝑎𝑐𝑡 > 𝑆𝑡𝑎𝑡𝑢𝑡𝑒, 𝑅𝑎𝑡𝑖𝑜, 𝑃𝑟𝑒𝑐𝑒𝑑𝑒𝑛𝑡 >
(
𝐴𝑟𝑔𝑢𝑚𝑒𝑛𝑡
(G2) Segment |𝑆𝑘 | , if 𝑘 = Final judgement, Issue
𝑁𝑂𝑆𝑘 =
representation 𝑚𝑖𝑛(2, |𝑆𝑘 |), otherwise




𝑊 𝑒𝑖𝑔ℎ𝑡 (𝑘) · (1/𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑖)), if 𝑘 = Fact



 𝑒𝑖𝑔ℎ𝑡 (𝑘) · 𝑎𝑖 , if 𝑘 = Statute


𝑊
(G3) Sentence
𝐼 (𝑖) = 𝑊 𝑒𝑖𝑔ℎ𝑡 (𝑘) · 𝑝𝑖 , if 𝑘 = Precedent
Informativeness 


𝑊 𝑒𝑖𝑔ℎ𝑡 (𝑘) · 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑖) · (𝑝𝑖 OR 𝑎𝑖 ), if 𝑘 = Ratio



𝑊 𝑒𝑖𝑔ℎ𝑡 (𝑘), otherwise



(G4) Content Words & 


5, if j = Act/sections of Acts (e.g.,Indian Penal Code, 1860; Article 15, etc.)
𝑆𝑐𝑜𝑟𝑒 ( 𝑗) = 3, if j = Keywords from a legal dictionary
their weights 

 1, if j = Noun Phrases = 1

Table 2: Parameter settings in DELSumm for operationalizing the guidelines provided by law experts for Indian case docu-
ments (stated in Section 5.1). Parameter settings such as weights of various segments, how to compute importance of sentences
in various segments, etc. are decided based on the segment 𝑘. The symbols are explained in Table 1.
variable 𝑝𝑖 . If a regular expression pattern of the form Party 1 vs. (2) LexRank [7], (3) Reduction [13], and (4) LSA [10] (see Section 2.1
Party 2 is found in a sentence 𝑖, 𝑝𝑖 is set to 1. for brief descriptions of these methods).
For understanding important Ratio sentences, we lookup guide-
Unsupervised domain-specific methods: Section 2.2 gave brief
line G3(d), which mandates that sentences containing references
descriptions of these methods. Among such methods, we consider
to either a Statute/Act name or a prior-case are important to be in-
the following four as baselines – (1) LetSum [8], (2) KMM [20],
cluded in the summary. We therefore use the two Boolean/indicator
(3) CaseSummarizer [18] and (4) a simplified version of MMR [25]
variables 𝑎𝑖 and 𝑝𝑖 for detecting the presence of a Statute/Act name
– in the absence of identical datasets, we adopt only the Maximum
or a prior-case. If any one of them occurs in the sentence, we con-
Margin Relevance module and use it to summarize a document.
sider that sentence from segment Ratio to be informative.
• Operationalizing G4: Based on guideline G4 stated above, we Supervised domain-specific methods: From this class of algo-
need to identify some important ‘content words’ whose presence rithms, we consider Gist [14], a recent method that applies general
would indicate the sentences that contain especially important ML algorithms to the task. The best performance was observed by
details of the case. To this end, we consider three types of content using Gradient Boosted Decision Tree as the ML classifier, which
words – mentions of Acts/sections of Acts, keywords from a legal we report (see Section 2.4 for details of the method).
dictionary, and noun phrases (since names of people, places, etc. are Supervised domain-independent methods: Among these meth-
important). For identifying legal keywords, we use a legal dictionary ods, we consider the neural method SummaRuNNer [17] (imple-
from the website advocatekhoj.com. For identifying mentions of mentation available at https://github.com/hpzhao/SummaRuNNer).
Acts and statutes, we use a comprehensive list of Acts in the Indian Similar to Gist, they consider the task of extractive summarization
judiciary, obtained from Westlaw India (westlawindia.com). As as binary classification problem. The classifier returns a ranked
content word scores (𝑆𝑐𝑜𝑟𝑒 ( 𝑗)), we assign a weight of 5 to statute list of sentences based on their prediction/confidence probabilities
mentions, 3 to legal phrases and 1 to noun phrases; these scores are about their inclusion in the summary. We include sentences from
based on suggestions by the legal experts on the relative importance this list in decreasing order of their predicted probabilities, until
of sentences containing different types of content words. the desired summary length is reached.
We also apply a recent BERT-based summarization method BERT-
6 BASELINES AND EXPERIMENTAL SETUP SUM [15] (implementation available at https://github.com/nlpyang/
This section describes the baseline summarization methods that PreSumm). In BERTSUM, the sentence selection function is a simple
we consider for comparison with the proposed method. Also the binary classification task (whether or not to include a sentence in
experimental setup used to compare all methods is discussed. the summary).2 Similar to SummaRuNNer, we use the ranked list of
sentences based on the confidence probabilities of their inclusion
6.1 Baseline summarization methods in the summary. We include sentences one-by-one into the final
As described in Section 2, there are four classes of methods that can summary, until the desired length is reached.
be applied for extractive summarization of legal case documents. We 2 The original BERTSUM model uses a post-processing step called Trigram Blocking
consider some representative methods from each class as baselines. that excludes a candidate sentence if it has a significant amount of trigram overlap with
the already generated summary (to minimize redundancy in the summary). However,
Unsupervised domain-independent methods: We consider as we observed that this step leads to summaries that are too short, as also observed
baselines the following popular algorithms from this class: (1) Luhn [16], in [22]. Hence we ignore this step.
27
Implementations of the baselines: For Luhn, LexRank, Reduc- ROUGE 2 ROUGE-L

Algorithm
tion and LSA, we use the implementations from the Python sumy R F R F
module. The implementations of SummaRuNNEr and BERTSUM Unsupervised Domain Independent
are also available publicly, at the URLs stated above. We imple- (i) LexRank 0.3662 0.3719 0.5068 0.5392
mented the legal domain-specific summarization methods – Cas- (ii) LSA 0.3644 0.3646 0.5632 0.5483
eSummarizer, LetSum, GraphicalModel, MMR and Gist – following
(iii) Luhn 0.3907 0.3873 0.5424 0.5521
the descriptions in the corresponding papers as closely as possible
(the same parameter values were used as specified in the papers). (iv) Reduction 0.3778 0.3814 0.5194 0.542
Unsupervised Domain Specific
Converting abstractive summaries to extractive for training
(v) LetSum 0.403 0.4137 0.5898 0.5846*
supervised summarization models: The supervised methods
(vi) KMM 0.354 0.3546 0.5407 0.5385
Gist, SumaRuNNer and BERTSUM require labelled data for train-
ing, where every sentence in the source document must be labeled (vii) CaseSummarizer 0.3699 0.3765 0.4925 0.5349
as 1 if this sentence is suitable for inclusion in the summary, and (viii) MMR 0.3733 0.3729 0.6064 0.568
labeled as 0 otherwise. The publicly available headnotes in our DELSumm (proposed) 0.4323 0.4217 0.6831 0.6017
training datasets (see Section 3) are abstractive in nature and can- Supervised Domain Independent
not be directly used to train such methods. Hence we convert the (ix) SummaRuNNer 0.4104* 0.4149* 0.5835 0.5821
abstractive summaries to extractive for training these methods. (x) BERTSUM 0.4044 0.4048 0.56 0.5529
To this end, we adopt the technique mentioned in [17]. Briefly,
Supervised Domain Specific
we assign label 1 (suitable for inclusion in the summary) to those
sentences from the source document (full text), that greedily max- (xi) Gist 0.3567 0.3593 0.5712 0.5501
imizes the ROUGE 2-overlap with the abstractive summary. The Table 3: Performance of the summarization methods as per
rest of the sentences in the source document are assigned label 0. ROUGE metrics (R = recall, F = Fscore). All values are aver-
aged over the 50 documents in the evaluation set. The best
6.2 Experimental setup value for each measure is in bold and the second best in
Now we discuss the setup for our experiments to compare the underlined. Entries with asterisk(*) indicates DELSumm is
performances of various summarization models. not statistically significant than baselines using 95% confi-
dence interval by Student’s T-Test.
Target length of summaries: Summarization models need to be
told the target length 𝐿 (in number of words) of the summaries that
they need to generated. For every document in the India dataset, we various methods in summarizing the Indian Supreme Court case
have two gold standard summaries written by two domain experts. documents. In both tables, ROUGE scores are averaged over the 50
For a particular document, we consider the target summary length documents in the India evaluation set.
𝐿 to be the average of the number of words in the summaries (for
that document) written by the two experts.
7.1 Evaluation of the full summaries
For each document, every summarization model is made to gen-
erate a summary of length at most 𝐿 words, which ensures a fair Table 3 shows the performances of summarizing the full documents
comparison among the different models. in terms of ROUGE-2 and ROUGE-L Recall and F-scores. Consider-
ing all the approaches, DELSumm achieves the best performance
Evaluation of summary quality: We use the standard ROUGE across all the measures. The second-best metric values are achieved
metrics for evaluation of summary quality (computed using https: by LetSum (For ROUGE-L F-score), MMR (for ROUGE-L Recall),
//pypi.org/project/rouge/ v. 1.0.0). For a particular document, we and SummaRuNNer (for ROUGE-2 Recall and F-score). To check
have gold standard summaries written by two legal experts (as if their performances are statistically significantly different from
was described in Section 3). We compute the ROUGE scores of those achieved by DELSumm , we perform Student’s T-Test at 95%
the algorithmic summary and each of the gold standard summary, (𝑝 < 0.05) with these three methods. Entries that are not statistically
and take the average of the two. We remove a common set of significant are marked with asterisk (*) in Table 3.
English stopwords from both the gold standard summaries and the It can be noted that, out of all the domain-specific methods in
algorithm-generated summaries before measuring ROUGE scores. Table 3 (both supervised and unsupervised), DELSumm performs
We report ROUGE-2 Recall and F-score, and ROUGE-L Recall and the best for all the measures. While LetSum is the only prior work
F-scores. The scores are averaged over all 50 documents in the that considers rhetorical segments while generating the summary
evaluation set. (see Section 2), none of these methods take into account domain-
specific guidelines on what to include in the summary. DELSumm
7 RESULTS AND ANALYSES on the other hand models the legal knowledge about what to include
This section compares the performances of the proposed DELSumm in a summary, thus being able to perform better.
and all the baseline algorithms. We compare the performances Importantly, DELSumm outperforms even supervised neural
broadly in two ways – (i) how good the summaries for the full models such as SummaRuNNer and BERTSUM (though the differ-
documents are, and (ii) how well each individual rhetorical segment ence in the performance of SummaRuNNer with respect to DEL-
is summarized. Table 3 and Table 4 show the performances of Summ is not statistically significant in terms of ROUGE-2 scores).
28
Final
Issue Facts Statute
Precedent
Argument Similar to [2], we also observe that many of the baseline meth-
Algorithm judgement
(1.2%) (23.9%) (7.1%)
+Ratio
(8.6%) ods could not represent the ‘Final judgement’ and ‘Issue’ well in
(2.9%) (53.1%)
their summaries. This flaw is especially critical since these two
Unsupervised, Domain Independent
segments are the most important in the summary according to
LexRank 0.0619 0.3469 0.4550 0.2661 0.3658 0.4284
the law experts (see Section 5.1). In contrast, DELSumm achieves
LSA 0.0275 0.2529 0.5217 0.2268 0.3527 0.3705
Luhn 0.0358 0.2754 0.5408 0.2662 0.2927 0.3781
a very high performance on these two important segments. This
Reduction 0.0352 0.3153 0.5064 0.2579 0.3059 0.4390
difference in performance is possibly because these two segments
Unsupervised, Domain Specific
are also the shortest segments (constitutes only 2.9% and 1.2% of
LetSum 0.0423 0.3926 0.6246 0.3469 0.3853 0.2830 the whole document, as stated in the first row of Table 4), and
KMM 0.3254 0.2979 0.4124 0.3415 0.4450 0.416 hence are missed by other methods which do not know of their
CaseSummarizer 0.2474 0.3537 0.4500 0.2255 0.4461 0.4184 domain-specific importance. This observation show the necessity
MMR 0.4378 0.3548 0.4442 0.2763 0.4647 0.3705 of an informed algorithm that can incorporate domain knowledge
DELSumm 0.7929 0.6635 0.5539 0.4030 0.4305 0.4370 from experts.
Supervised, Domain Independent DELSumm also represents the ‘Statute’ segment better than all
SummaRuNNer 0.4451 0.2990 0.5231 0.1636 0.5215 0.3090 other methods. For the ‘Statute’ segment, the algorithm was for-
BERTSUM 0.0662 0.3544 0.6376 0.2535 0.3121 0.3262 mulated in such a way (through the 𝑎𝑖 variable) that it is able to
Supervised, Domain Specific incorporate sentences that contain mention of an Act/law. Other
Gist 0.5844 0.3856 0.4621 0.2759 0.4537 0.2132 methods did not perform well in this aspect. The performance of
Table 4: Segment-wise performance (ROUGE-L F-scores) of DELSumm for the ‘Argument’ segment is second-best (0.4370) after
the methods. All values are averaged over the 50 documents that of Reduction (0.4390). These values are very close, and the
in the evaluation set. Values in the column headings are the difference is not statistically significant.
% of sentences in the full document that belong to a segment. Our method is unable to perform as well for the ‘Precedent +
Values < 0.3 highlighted in red-underlined. The best value Ratio’ and ‘Facts’ segments as some other methods. Note that, these
for each segment in green-bold. segments accounts for maximum number of sentences in a document
Note that DELSumm is a completely unsupervised method while and also forms a large part of the summary (see Table 4, first row).
SummaRuNNer and BERTSUM are deep-learning based supervised Hence neural methods (e.g., BERTSUM) and methods relying of
methods trained over 7, 100 document-summary pairs. The fact TF-IDF measures (e.g., LetSum) obtained relatively large amounts
that DELSumm still outperforms these supervised models is be- of training data for these segments, and hence performed well for
cause DELSumm intelligently utilizes domain knowledge while these segments.
generating the summaries. Finally, although LetSum is a segment-aware algorithm, its sum-
marization mechanism is not strong enough to understand the
importance of sentences in all segments; for instance, it performs
7.2 Evaluation of segment-wise summarization very poorly for the ‘Final judgement’ segment. DL methods fail to
Overall ROUGE scores are not the best metrics for evaluating case perform well for the smaller segments (e.g., Final judgement, Issue)
document summaries. Law experts opine that, even methods that for which lesser amounts of training data is available.
achieve high overall ROUGE scores may not represent every seg- Overall, it is to note that DELSumm achieves a much more
ment well in the summary [2]. A segment-wise performance evalu- balanced representation of segments in the summary, com-
ation is practically important since law practitioners often intend pared to all the baseline methods. Contrary to the baseline methods,
to read the summary of a particular segment, and not always the DELSumm shows decent performance for all the segments. This
full summary. Hence we perform a segment-wise performance has been taken care mainly through the constraint in Eqn. 5. Hence,
evaluation of the algorithms. even when the most important segments (e.g., Final judgement) are
To this end, we proceed as follows. For each rhetorical segment optimized in the summary, the less important ones (e.g., Arguments)
(e.g., Facts, Issues), we extract the portions of an algorithmic sum- are not missed out.
mary and the gold standard summary that represent the given
segment, using the gold standard rhetorical labels of individual
sentences (which are present in our evaluation set, as detailed in
Section 3). Then we compute the ROUGE scores on those specific
text portions only. For instance, to compute the ROUGE score on the 7.3 Discussion on the comparative results
‘Fact’ segment of a particular document, we only consider those sen- The results stated above suggest that our proposed method per-
tences in the gold standard summary and the algorithmic summary, forms better than many existing domain-independent as well as
which have the rhetorical label ‘Fact’ in the said document. We legal domain-specific algorithms. The primary reason for this su-
report the average ROUGE score for a particular segment, averaged perior performance of DELSumm is that it is able to incorporate
over all 50 documents in the evaluation set. legal domain knowledge much more efficiently through a theoret-
Table 4 shows the segment-wise ROUGE-L F-scores (averaged ically grounded approach. By doing this, it is able to surpass the
over all 50 documents in the evaluation set). The values < 0.3 are performance of deep learning and machine learning approaches
underlined and highlighted in red color, while the best value for such as SummaRuNNer, BERTSUM and Gist (that are trained over
each segment is highlighted in boldface and green color. 7, 100 document-summary pairs). These results show that if legal
29
domain knowledge can be used intelligently, even simple unsuper-

vised summarization algorithms can compete with state-of-the-art
supervised learning approaches.
It can be argued that 7, 100 document-summary pairs are too
less to train a deep learning summarization model; additionally,
as stated in Section 3, the summaries (headnotes) in the training
set may not be of the same quality as those in the evaluation set.
However, in domains such as legal, it is difficult and expensive to
obtain rich quality training data in huge amounts. In such scenarios,
DELSumm , being an unsupervised method, can work efficiently
by incorporating domain knowledge.
Figure 1: Effect on the performance of DELSumm with al-
Another advantage of DELSumm is that it is easily customiz-
gorithmic labels. There is minimal degradation of ROUGE
able – if the requirement is that a particular segment should have
scores when using algorithmic labels
more representation in the summary than the others, the segment
weights and representations can be tuned accordingly. If the sen- standard sentence labels, and again using the algorithmic sentence
tence importance measures are different, that too can be adjusted. labels. We perform three evaluations:
Other domain-specific and supervised learning methods do not (i) Accuracy of rhetorical labelling: we check what fraction of
exhibit such flexibility. the algorithmic labels match with the gold standard labels (over
the documents in 𝐸 𝑓 ). This measure gives an idea of the accuracy
8 USING DELSUMM WITH ALGORITHMIC of the rhetorical labeling model [3].
RHETORICAL LABELS (ii) ROUGE scores with Gold Std. Labels: We compute the ROUGE
DELSumm requires sentences of the source document (full text) to scores between the gold standard summaries and the summaries
be labelled with a rhetorical role prior to summarization. To this generated by DELSumm considering the gold standard labels. We
end, till now we have been using expert-annotated labels. However, average the ROUGE scores over all documents in the set 𝐸 𝑓 (this is
in a real scenario, it is not possible to get expert-annotations for very similar to what we did in Section 5).
rhetorical role labelling for every document. A more practical ap- (iii) ROUGE scores with Algorithmic Labels: We compute the
proach is to first computationally generate these rhetorical labels ROUGE scores between the gold standard summaries and sum-
through some algorithm that detects rhetorical roles of sentences maries generated by DELSumm with the algorithmic labels, i.e.,
in a legal document, and then consider the labelled document as labels inferred by the model trained over 𝑇𝑓 . Again we average the
input to DELSumm for summarization. ROUGE scores over all documents in the set 𝐸 𝑓 .
Note that algorithmically derived rhetorical labels are likely to The differences between these two sets of ROUGE scores can be
be noisy (i.e., not 100% accurate). In this section, we analyse how used to quantify the degradation in performance of DELSumm
the performance of DELSumm is affected by such noisy algorithmic while using noisy algorithmic labels instead of gold standard labels.
labels when compared to gold standard labels. To this end, we
Results: We perform the above experiment with the India dataset
perform the following experiment (which is inspired by the idea of 𝑘-
where 𝑛 = 50, and we consider 𝑘 = 5 folds. Hence in each fold,
fold cross validation that is popular in Machine Learning literature).
we use a training set of 40 documents and an evaluation set of 10
Experimental Design: Assume that we have a dataset 𝐷 consist- documents. Figure 1 shows the results of the above experiment.
ing of 𝑛 documents, each having sentence-wise rhetorical label The horizontal axis shows the rhetorical labeling accuracy across
annotations as well as gold-standard summaries (e.g., the Indian the 𝑘 = 5 folds; we see that the model [3] achieves accuracies in the
dataset we are using in this paper, with 𝑛 = 50). We divide the range [0.825, 0.851], i.e., about 82%–85% of the algorithmic labels
dataset into 𝑘 folds (e.g., 𝑘 = 5). In each fold 𝑓 , there are 𝑛/𝑘 docu- are correct. The figure shows the two types of ROUGE-L F-scores
ments for evaluation, which we call as the evaluation set 𝐸 𝑓 . The as described above – one set using the algorithmic labels (denoted
remaining 𝑛 − (𝑛/𝑘) documents are considered for training a model by red diamonds), and the other set using the gold-standard labels
for rhetorical role labeling, termed as training set 𝑇𝑓 . We also have (denoted by black circles) – for each of the 5 folds. Each value
the expert annotated labels for the documents in fold 𝑓 , and their is averaged over the 10 documents in the evaluation set of the
gold standard summaries. For each fold, we use 𝑇𝑓 for training the corresponding fold. We find that the two sets of ROUGE scores are
state-of-the-art rhetorical role identification model Hier-BiLSTM- very close. In other words, even when noisy/inaccurate labels are
CRF from our prior work [3].3 used in DELSumm , the drop in performance is very low, indicating
Using the trained model for that fold, we infer the rhetorical the robustness of DELSumm .
labels of the documents in the evaluation set 𝐸 𝑓 for that fold. Now, Finally, we compute the ROUGE scores averaged over all 𝑛 = 50
we have two sets of rhetorical labels for the sentences in the doc- documents across all 𝑘 = 5 folds. Table 5 shows the average ROUGE
uments in 𝐸 𝑓 – (i) gold standard labels given by law experts, and scores achieved by DELSumm using the algorithmic labels (denoted
(ii) algorithmic labels assigned by the model [3]. We next summarize ‘DELSumm AL’). For ease of comparison, we repeat some rows
every document in 𝐸 𝑓 with DELSumm twice, once using the gold from Table 3 – the performance of DELSumm with gold standard
rhetorical labels (denoted ‘DELSumm GSL’), and the performance
3 Implementation of the closest competitors LetSum, MMR and SummaRuNNer. We
available at https://github.com/Law-AI/semantic-segmentation.
30
ROUGE-2 ROUGE-L Analytics’. This work is also supported in part by the European
Algorithm
R F R F Union’s Horizon 2020 research and innovation programme under
LetSum 0.4030 0.4137 0.5898 0.5846 grant agreement No 832921. P. Bhattacharya is supported by a
MMR 0.3733 0.3729 0.6064 0.5680 Fellowship from Tata Consultancy Services.
SummaRuNNer 0.4104 0.4149 0.5835 0.5821
DELSumm (GSL) 0.4323 0.4217 0.6831 0.6017 REFERENCES
DELSumm (AL) 0.4193 0.4075 0.6645 0.5897 [1] Siddhartha Banerjee, Prasenjit Mitra, and Kazunari Sugiyama. 2015. Multi-
document abstractive summarization using ILP based multi-sentence compres-
Table 5: Performance of DELSumm with Gold Standard la- sion. In Proc. International Conference on Artificial Intelligence.
bels (GSL) and Algorithmic labels (AL) over the 50 docu- [2] Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripa-
bandhu Ghosh, and Saptarshi Ghosh. 2019. A comparative study of summariza-
ments. Apart from the last row, all other rows are repeated
tion algorithms applied to legal case judgments. In Proc. European Conference on
from Table 3. While DELSumm (AL) shows slightly degraded Information Retrieval.
performance than DELSumm (GSL), it still outperforms the [3] Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and
Adam Wyner. 2019. Identification of Rhetorical Roles of Sentences in Indian
other methods according to all ROUGE measures except Legal Judgments. Proc. Legal knowledge and information systems (JURIX) (2019).
ROUGE-2 F-score. [4] Jianpeng Cheng and Mirella Lapata. 2016. Neural Summarization by Extracting
Sentences and Words. In Proc. Annual Meeting of the Association for Computational
observe that, even when DELSumm is used with automatically Linguistics.
generated algorithmic labels (with about 15% noise), it still per- [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. In Proc. NAACL-HLT.
forms better than LetSum, MMR and SummaRuNNer (which were [6] Yue Dong. 2018. A Survey on Neural Network-Based Summarization Methods.
its closest competitors) according to all ROUGE measures except CoRR abs/1804.04589 (2018). arXiv:1804.04589 http://arxiv.org/abs/1804.04589
[7] Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical
ROUGE-2 Fscore. This robustness of DELSumm is because, through Centrality As Salience in Text Summarization. J. Artif. Int. Res. 22, 1 (2004).
the sentence informativeness and content words measures in the [8] Atefeh Farzindar and Guy Lapalme. 2004. Letsum, an automatic legal text sum-
ILP formulation, DELSumm can capture useful information even if marizing system. Proc. Legal knowledge and information systems (JURIX) (2004).
[9] Jessica Giles. 2015. Writing Case Notes and Case Comments.
the rhetorical labels may be a little inaccurate. http://law-school.open.ac.uk/sites/law-school.open.ac.uk/files/files/PILARS-
Writing-case-comments.pdf.
[10] Yihong Gong and Xin Liu. 2001. Generic Text Summarization Using Relevance
9 CONCLUSION Measure and Latent Semantic Analysis. In Proc. International conference on Re-
search and development in information retrieval (SIGIR).
We propose DELSumm , an unsupervised algorithm that systemati- [11] how-to-brief-case-cuny 2017. How to brief a case. https://www.lib.jjay.cuny.edu/
cally incorporates domain knowledge for extractive summarization how-to/brief-a-case.
of legal case documents. Extensive experiments and comparison [12] intro-case-briefing-northwestern [n.d.]. Introduction to Case Briefing.
http://www.law.northwestern.edu/law-school-life/studentservices/orientation/
with as many as eleven baselines, including deep learning-based documents/Orientation-Reading-Introduction-to-Case-Briefing.pdf.
approaches as well as domain-specific approaches, show the utility [13] Hongyan Jing. 2000. Sentence reduction for automatic text summarization. In
of our approach. The strengths our approach are: (i) DELSumm Proc. Applied Natural Language Processing Conference.
[14] Chao-Lin Liu and Kuan-Chun Chen. 2019. Extracting the Gist of Chinese Judg-
systematically encodes domain knowledge necessary for legal doc- ments of the Supreme Court. In Proc. International Conference on Artificial Intelli-
ument summarization into a computational approach, (ii) although gence and Law (ICAIL).
[15] Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders.
an unsupervised approach, it performs at par with supervised learn- In Proc. EMNLP-IJCNLP.
ing models trained over huge amounts of training data, (iii) it is [16] H.P. Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of
able to provide a summary that has a balanced representation from Research Development,2(2) (1958).
[17] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A Recur-
all the rhetorical segments, which is highly lacking in prior ap- rent Neural Network Based Sequence Model for Extractive Summarization of
proaches (iv) inaccuracy in labels do not degrade the performance Documents. In Proc. AAAI Conference on Artificial Intelligence.
of DELSumm much, thus showing its robustness and rich informa- [18] Seth Polsley, Pooja Jhunjhunwala, and Ruihong Huang. 2016. CaseSummarizer:
A System for Automated Summarization of Legal Texts. In Proc. Iinternational
tion identification capabilities, and (v) the method is flexible and conference on Computational Linguistics (COLING) System Demonstrations.
generalizable to summarize documents from other jurisdictions; [19] Koustav Rudra, Pawan Goyal, Niloy Ganguly, Prasenjit Mitra, and Muhammad
Imran. 2018. Identifying Sub-Events and Summarizing Disaster-Related Informa-
all that is needed are the expert guidelines for what to include tion from Microblogs. In Proc. International Conference on Research & Development
the summary, and how to identify important content words. The in Information Retrieval (SIGIR).
objective function and constraints can be adjusted as per the re- [20] M Saravanan, B Ravindran, and S Raman. 2006. Improving Legal Document
Summarization Using Graphical Models. In Legal knowledge and information
quirements of different jurisdictions (e.g., giving more weight to systems, JURIX.
certain segments). The implementation of DELSumm is publicly [21] M. Saravanan, B. Ravindran, and S. Raman. 2008. Automatic Identification of
available at https://github.com/Law-AI/DELSumm. Rhetorical Roles using Conditional Random Fields for Legal Document Summa-
rization. In Proc. International Joint Conference on Natural Language Processing.
In future, we plan to apply DELSumm to documents from other [22] Sajad Sotudeh, Arman Cohan, and Nazli Goharian. 2021. On Generating Extended
jurisdictions as well as to generate different types of summaries Summaries of Long Documents. In The AAAI-21 Workshop on Scientific Document
Understanding (SDU 2021).
(e.g., for different stakeholders) and analyse the performance. [23] Wen Xiao and Giuseppe Carenini. 2019. Extractive Summarization of Long
Documents by Combining Global and Local Context. In Proc. EMNLP-IJCNLP.
Acknowledgements: The authors thank the Law experts from the [24] Hiroaki Yamada, Simone Teufel, and Takenobu Tokunaga. 2019. Neural Network
Rajiv Gandhi School of Intellectual Property Law, India who helped Based Rhetorical Status Classification for Japanese Judgment Documents.. In
Proc. Legal knowledge and information systems (JURIX).
in developing the gold standard data and provided the guidelines for [25] Linwu Zhong, Ziyi Zhong, Zinian Zhao, Siyuan Wang, Kevin D Ashley, and
summarization. The research is partially supported by the TCG Cen- Matthias Grabmair. 2019. Automatic Summarization of Legal Decisions Using
Iterative Masking of Predictive Sentences. In Proc. International Conference on
tres for Research and Education in Science and Technology (CREST) Artificial Intelligence and Law (ICAIL).
through the project titled ‘Smart Legal Consultant: AI-based Legal
31
AI Systems and Product Liability
Prof. Dr. Georg Borges

Institute of Legal Informatics
Saarland University
Saarbrücken, Germany
georg.borges@uni-saarland.de
ABSTRACT
The article examines whether the current product liability law 1 The role of liability within the legal frame-
pro-vides an appropriate regulation for AI systems. This question, work for AI systems
which is discussed at the example of the European Product Liabil- High expectations are associated with artificial intelligence. At
ity Directive, is of great practical importance in the current legal the same time, the risks related to this technology are also a
policy discussion on liability for AI systems. highly debated topic.
This article demonstrates that in principle the liability require- This naturally applies in particular to the legal discussion, the
ments are also applicable to AI systems. If the conduct of an AI aim of which is to reconcile the interest in raising the benefits
system is carefully distinguished from its properties, excessive li- associated with the introduction of new technologies with the
ability can be avoided. To reverse the burden of proof in favour of interest in protecting against the associated risks.
the injured party in the case of faulty behaviour enables a liability Artificial intelligence especially the production and use of AI-
regime that is fair to the interests at stake. equipped systems, often referred to as autonomous systems or
AI systems has been a challenge not least for the law, which has
However, product liability law only applies if AI systems lead di- the task of balancing the interest in enhancing the benefits asso-
rectly to personal injury or damage to property. Product liability ciated with the introduction of new technologies with the inter-
law is not applicable insofar as AI systems indirectly lead to con- est in pro-tecting against the risks associated with such technol-
siderable disadvantages for the person concerned, in particular ogies.
through assessments of persons. Protection against discrimination
or otherwise unfair assessments by AI systems shall be eﬀected by 1.1 Liability for AI systems as a pillar of the le-
other legal instruments. gal framework for AI
CCS CONCEPTS It goes without saying that artificial intelligence and AI systems
raise questions in almost every field of law. An aspect which has
• Applied computing → Law
gained much attention recently, is what is often referred to as
“algorithmic fairness” [1]. In the US, the Federal Trade Commis-
KEYWORDS sion in the USA recently published some guidance on the truth-
Product liability, AI systems, Product Liability Directive, strict li- ful, fair and equitable use of AI [2]. Another aspect of highest
ability, burden of proof relevance certainly is liability for damage caused by AI systems.
The importance of a legal framework for AI systems and the
ACM Reference format: role of liability can be seen clearly by looking at the current dis-
Georg Borges. 2021. AI systems and product liability. In Proceedings of cussion in Europe.
ICAIL 2021, 21-25 June 2021, São Paulo, Brazil, 8 pages. In its 2018 strategy “Artificial Intelligence for Europe” [3], the
https://doi.org/10.1145/3462757.3466099 EU Commission already explicitly focused on ethical and legal
issues in addition to the technical and economic aspects. The EU
Permission to make digital or hard copies of all or part of this work for personal or Commission established several expert groups who support the
classroom use is granted without fee provided that copies are not made or distrib- commission in these tasks. In the field of ethical aspects of AI,
uted for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by the High Level Expert Group on AI [4] developed ethical guide-
others than the author(s) must be honored. Abstracting with credit is permitted. lines for trustworthy AI which were published in June 2018 [5].
To copy otherwise, or republish, to post on servers or to redistribute to lists, re-
quires prior specific permission and/or a fee. Request permissions from permis-
Regarding the regulation on liability and security, the Expert
sions@acm.org. Group on liability and new technologies consisting of a Product
ICAIL'21, June 21–25, 2021, São Paulo, Brazil Liability Directive formation and a “New Technologies” for-
© 2021 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06…$15.00 mation was established [6].
https://doi.org/10.1145/3462757.3466099
32
ICAIL 2021, 21-25 June 2021, São Paulo, Brazil G. Borges
Recently, in April 2021, the Commission presented a proposal In the current discussion, systems for pre-selecting job appli-
for an Artificial Intelligence Act (‘AI Act’) [7], which refers to cants for jobs are often mentioned [16].
the security of so-called AI systems. The concept of AI systems is An essential characteristic of damage caused directly or indi-
defined very broadly in the draft and is likely to cover a large rectly by AI systems is that the damage is not caused by a physi-
part of modern software and software-equipped products. The cal feature of the system, such as its material properties, but by
Act essentially distinguishes between three groups of AI sys- its – own – behaviour, i.e., behaviour that is not directly con-
tems. Firstly, the proposal provides a list of AI systems whose trolled by a human being but by the AI immanent in the system.
operation is to be prohibited (Art. 5 of the proposal). The second It is not the human driver who steers the car against the victim,
group of so-called high-risk AI systems will be subject to manda- but the automated vehicle itself. The assessment of the criminal
tory risk management (Art. 9 of the proposal), while a third or job seeker is not carried out by a human, but by an AI system.
group of certain AI systems will be subject to transparency re- This special feature of AI systems is of decisive importance
quirements only (Art. 52 of the proposal). for liability law, since the damage cannot be directly traced back
As the second pillar of the legal framework for AI systems, to the behaviour of a particular natural person which is why the
civil liability, i.e. the duty to compensate for damage, is of great justification of and the responsibility for the damage must be
importance. discussed.
In its final report, the New Technologies formation of the EU The importance of liability law is based on the dual function
Expert Group liability for new technologies suggested amend- of civil liability, which grants the injured party a claim for dam-
ments of the existing liability law regarding AI and autonomous ages against the addressee of liability. The rules on civil liability
systems [8]. traditionally aim to compensate the victim for damage suffered
In November 2020, the European Parliament issued a resolu- to protected legal assets [17, p. 259f.; 18, p. 166, 565]. At the same
tion on the civil liability regime for AI systems [9] which even time, the threat of liability also serves to steer behaviour by
contains a proposal for a new regulation on liability for systems providing an incentive to avoid damage [17, p. 187; 18, p. 169].
equipped with artificial intelligence [10]. However, the risk of liability can also trigger undesirable strate-
gies of liability avoidance. In the current debate, for example, a
1.2 Risks Associated with the Use of AI possible chilling effect of liability risks to the development and
The risks associated with the use of AI systems are very differ- market launch of products equipped with artificial intelligence is
ent. Legal protection against these risks depends, among other often mentioned.
things, on what protected interests are affected and whether the In the current legal policy discussion, a tendency is emerging
AI system in question can cause damage directly or whether the to assign responsibility to the operator of the AI system, at least
respective AI system only has an indirect effect through use of in the sense of civil liability for damages. In Europe, Art. 4 of the
its activity by third parties. These differences are expressed in proposed regulation on liability for the operation of AI systems
two groups of cases that are currently intensively discussed and [10] imposes strict liability on the operator of AI systems classi-
investigated in the German "ExamAI" project, among others [11]. fied as dangerous for damages resulting from the operation of
One case group concerns damage to persons and property di- the AI system. Art. 11 of the proposed regulation [10] states that
rectly caused by AI systems, also called cyber-physical systems manufacturers of AI systems are deliberately not covered by the
[12]. The Uber vehicle accident of 2019, in which a highly auto- directive.
mated vehicle hit and fatally injured a pedestrian [13], vividly The proposed regulation mirrors the current discussion in
demonstrated this danger, hence the awareness of the risks in- Europe regarding liability for highly automated vehicles. Some
duced by AI systems raised significantly. authors, e.g., Buck-Heeb and Dieckmann [19] or Kreutz [20], re-
Another group of cases concerns assessments of general infer to the existing strict liability of the owner of a motor vehicle
formation on individuals provided by an AI system and used by and, with regard to the manufacturer, to product liability law.
third parties for decisions which interfere with the rights of the The latter raises the question of whether product liability law
affected individuals. In this group of cases, there is no direct in- provides an adequate liability regime for AI systems.
terference with health or property and, most importantly, the in- The term product liability refers to the non-contractual, strict
terference is not directly caused by the AI system itself. liability of the producer for damage caused by a defective prod-
For the latter case group, connected with the keyword “bias uct.
in the data”, the COMPAS system can be mentioned, as it illus- Product liability was introduced in many countries in the sec-
trates the risks arising from ‘biased’ algorithms even though the ond half of the 20th century, starting in the U.S. which is consid-
related questions touch criminal law rather than liability law. ered the “birthplace” of product liability [21]. The transition
COMPAS generates an evaluation of a human being (here: prob- from fault-based liability to strict liability of the manufacturer
ability of recidivism of offenders) that has been used by third for defective products was heavily influenced by the 1963 land-
parties (here: judges) to make a decision that was highly relevant mark decision of the California Supreme Court in Greenman v
for human beings (here: offenders). The evaluation generated by Yuba Power Products [22]. This approach found its way into §
the system was criticized for discriminating against certain 402A of the Restatement Second of Torts from 1966 and was
groups of people [14]. although this criticism is not without con- later accepted in most states of the US [23, p.19; 24, p. 251].
troversy [15].
33
AI Systems and Product Liability ICAIL 2021, 21-25 June 2021, São Paulo, Brazil
In Europe, product liability was mainly introduced by way of 3.1 AI as a Product

statutory law based on the European Product Liability Directive
An essential precondition of product liability is that the damage
of 1985 [25].
was caused by a product. According to the definition of Art. 2 of
This issue of whether product liability is able to offer ade-
the Product Liability Directive, a product is any movable thing,
quate compensation and incentives is discussed in this article on
even if it forms part of another movable thing or of an immova-
the basis of the European Product Liability Directive, which is of
ble thing, as well as electricity. Cyber-physical systems, such as
particular interest insofar as it is currently subject to evaluation
highly automated vehicles, are without doubt a product, [35, p.
and possible revision, especially with regard to AI systems [26].
549; 36, p. 316; 37, p. 3538] as is ultimately any system equipped
with AI.
2 The liability concept of the Product Liability An essential question is whether the software that produces
Directive this AI is in itself a product. Above all, this question becomes rel-
The 1985 European Product Liability Directive introduced strict evant if the programme is distributed as such without being
liability for defective products in European tort law [27]. Con- bound to a data carrier, or if it is part of the system, but the ques-
sumers who suffer damage caused by a product have a claim for tion of the software manufacturer's liability arises.
damages against the producer of the defective product. Art. 2 of the Product Liability Directive restricts the concept
Unlike in general tort law, liability is strict, i.e., it requires of product to things thus defining it as a tangible object.
neither intent nor negligence on the part of the producer. How- The programme as such is not a tangible object, [16, p. 46f.]
ever, it is not strict liability in the sense of causal liability, since but, as the WIPO defined it in § 1 of the Model Law for the Pro-
liability only applies when the product has a defect and this was tection of Computer Programs, a sequence of commands that can
the cause of the damage [28]. Since a defect exists if the product be executed by a machine. The same definition is used in a deci-
does not provide the safety that could reasonably be expected, sion by the German Federal Supreme Court [38, p. 1047]. Despite
the result is that the conditions for liability are closely approxi- this fact, there is unanimity in current German literature that
mated to negligence [29]. software is subject to product liability [39, no. 830; 40, no. 172;
Liability not only applies to end products, but also to basic 41, p. 706; 42, p. 714f.], and according to the prevailing view,
materials and component parts, and therefore addresses not only even if it is delivered via the internet, for example by download
the manufacturer of the finished product, but also the manufac- from a website [40, no. 173f.; 43, no. 43]. The same seems to ap-
turers of parts and basic materials (infra 3.1). Thus, the directive ply to the Product Liability Directive in general [44, p. 245]. In
deliberately covers all stages of the value chain [29, no. 13; 30, the upcoming revision of the Product Liability Directive, this re-
no. 28]. sult - the application to software - should be clarified.
Liability is limited in several respects to avoid excessive lia- Since AI is generated by computer software, product liability
bility risks for the manufacturer [31, p. 52f.]. First of all, accord- therefore applies to any form of AI.
ing to Art. 9 lt. b) of the Product Liability Directive [25], only pri-
vate users of the final product are beneficiaries, not, for example, 3.2 Producers
commercial users of the product [30, no. 15]. According to Art. 1 of the Product Liability Directive, the oppo-
Liability only applies to damage to health or property, in par- nent of the claim for damages and thus the addressee of liability
ticular it does not apply to mere financial losses [32; 33, p. 468]. is the producer of the product. According to Art. 3(1) of the
The list of protected rights in Art. 1 of the Product Liability Di- Product Liability Directive, the producer in this sense is both the
rective, [25] is exhaustive [29, no. 19; 34, no. 2]. Negative evalua- producer of the finished product and the producer of the compo-
tions of a person, which at most might infringe the persons’ per- nent parts or raw material.
sonality rights, are therefore not covered by product liability The manufacturer of the overall system is the primary liabil-
law. ity addressee as the manufacturer of the finished product. Soft-
Finally, liability is limited in amount to ensure insurability. It ware that is integrated into the system is to be qualified as a
thus has the following prerequisites: Damage to a protected right component part, the supplier of this software is consequently
(health or property) must have been caused by a product. The subject to liability as the manufacturer of the component part.
product must have a defect that was the cause of the damage. Insofar as (artificial) neural networks are a component of an
The burden of proof for all prerequisites is on the injured party. AI system, a differentiation must be made: The untrained neural
The claim for damages is directed against the producer of the network does not have functional properties but acquires them
finished product as well as the manufacturer of the defective only through the process of machine learning. Considering this,
component part or basic material. the untrained network is not a component part but a basic mate-
This liability concept gives rise to specific challenges in the rial. Since the manufacturer of basic products is also included in
area of AI systems. the liability, this differentiation as such is not decisive in the re-
sult. However, it points the way to the classification of the train-
3 Questions of Application of Product Liability ing. Insofar as the trained network is integrated into an AI sys-
in the Case of AI tem as software, it is a component part, the manufacturer of the
trained network is consequently a component part manufacturer.
34
The status of being a component part producer is thus acquired based on the perceived situation, i.e., the interpreted data. In do-
through the training of the network. This aspect can be general- ing so, all information available in the traffic situation must be
ised: whoever controls the learning process in the context of ma- taken into account.
chine learning and thus determines the properties of the learning The decisive factor is that situations occurring in traffic are
system is the producer of the respective system. unpredictably diverse. Accordingly, the concrete behaviour in
such a unique, future traffic situation is also unpredictable. If one
3.3 Limitation of Damage to Health and Prop- equates a faulty behaviour of the system with a product defect,
erty liability arises for behaviour in an unforeseeable multitude of fu-
Art. 9 of the Product Liability Directive [25] explicitly limits lia- ture traffic situations. This is a conceivable liability concept that
bility to damage to health and property. This is intended to ex- can be achieved in particular by specific legislation providing for
clude, in particular, mere pecuniary damages, which occur, for a causal liability of the producer, as some authors suggest in the
example, in the event of a business interruption. case of highly automated vehicles [48, p. 277; 49, p. 574].
There is thus a "liability gap" in relation to AI systems that However, this is not the liability concept of the Product Lia-
generate assessments or evaluations of persons. Whether it is an bility Directive. Product liability law focusses, as can be derived
assessment of the likelihood of recidivism, creditworthiness or from Art. 6 of the Product Liability Directive, on the existence of
performance for a job or a place at university: any damage properties of the product when it is placed on the market, prop-
caused by incorrect assessments is not covered by this concept. erties, which must correspond to justified safety expectations.
Even if one were to assume a violation of personality rights of The expectation of being able to control unforeseeable future sit-
the person concerned, these rights would not be subject to the uations is certainly not a justified safety expectation.
scope of product liability law. If the legislator wanted to introduce such liability, this could
This is not to say that product liability law should be ex- be done e.g., by imposing causal liability on the producer, similar
tended to cover such damages. However, it is important to note to the operator's liability under § 7 StVG (German Road Traffic
that these damages are currently not covered by product liability Act). According to this norm, the owner of a motor vehicle is lia-
law and therefore the legislator must be encouraged to achieve ble for accidents that are not due to force majeure. Another ex-
sufficient protection against erroneous assessments by other ample of such causal liability in German law is the liability of the
means. pet owner for so-called "luxury animals" (§ 833 BGB).
A similar result could be achieved by applying the construct
of vicarious liability to AI systems and equating them to humans
4 Faulty Decisions and Product Liability
in this respect. This concept is not a part of tort law in Germany
4.1 Conduct as a Defect of a Product? or in most legal systems. German law contains such an attribu-
tion of fault in § 278 of the German Civil Code (BGB) for agents.
A major challenge of product liability law in relation to AI is the It is also argued that this provision should be applied analo-
requirement of a defect of the product. The characteristic of an gously to AI systems [50, p. 211f.]. This liability only applies to
AI system is that it reacts to situations independently, i.e., with- contractual obligations and does not apply to non-contractual li-
out direct control by a human being, thus exhibiting an inde- ability.
pendent and not previously determined behaviour [45, p. 43f.; 46, As an interim result, this shows that product liability law
p. 7; 47, p. 4]. However, the behaviour of a system in a particular does not provide for liability for conduct, not even for the prop-
situation as such is not a property of the system. Even in cases erty of behaving correctly in future, unforeseeable situations.
where the actor is human, it cannot be directly inferred that a Therefore, only the ability to behave correctly in a spectrum of
certain behaviour equals a certain characteristic. foreseeable situations can be described as a property of an AI
The bridge between behaviour and property can be closed by system in the sense of product liability.
defining property as the ability to behave in a certain way in a
certain situation or not to show a certain behaviour. The prop- 4.2 Reference Point of the Defectiveness of AI
erty of a highly automated vehicle is thus the suitability or abil- Systems
ity to drive "correctly" in a certain situation.
Yet, this makes clear the error that exists in a simple equation With the interim result that product liability only refers to prop-
of behaviour and property. If one were to conclude a (product) erties of the AI system, but not to its behaviour in a specific situ-
error, i.e., a certain (insufficient) property of this AI system, di- ation, follow-up questions necessarily arise.
rectly from a faulty behaviour, for example a driving error, the The starting point is that the ability of the AI system to be-
required ability would be directed at behaving correctly in every have according to a certain expectation in a certain situation is
traffic situation. to be regarded as a property of the AI system and that the lack of
This task can be very complex. For example, a traffic situation this ability can trigger product liability.
consists of a multitude of facts which must be converted into in- First of all, it is important to establish a reference point for
formation by the vehicle through sensors and then be inter- this ability, in particular to analyse whether and to what extent
preted correctly in several steps. Then the vehicle must derive a human abilities are to be taken into account.
correct decision on the driving strategy (trajectory planning) An intuitively obvious reference point would be the human
behaviour that is to be replaced by the activity of the AI system,
35
in the case of automated driving, for example, the behaviour of a Product liability is aimed at the justified expectations of
human driver [48, p. 276; 51, p. 77]. For the determination of the safety, i.e. the safety that can reasonably be achieved. This stand-
defectiveness of an AI system, reference would therefore have to ard also applies to the behaviour of the system: what is required
be made to human capabilities. is the degree of safety of correct behaviour that can reasonably
However, a simple reference to human capabilities as a yard- be ensured by the manufacturer. In this respect, the technical
stick for errors of AI systems would not be convincing. For ex- possibilities at the time of placing the product on the market are
ample, a human driver is granted a certain reaction time; the decisive.
ability to react immediately is not expected. The granting of a re- To avoid misunderstandings, it should be said that liability
action time to be measured against human limitations, however, law must be distinguished from market approval rules which ap-
is obviously not necessary for AI systems; on the contrary, it ply to numerous machines, such as vehicles, and that the re-
would be highly counterproductive. The improved safety of quirements are not identical. The obligation may be stricter than
highly automated vehicles, for example, is based not least on the in the approval procedure.
fast reaction of the system compared to the human driver [42, p. However, if the safety requirements are tied to what is rea-
734f.]. Therefore, an independent standard must be developed, as sonable, exceptions can be considered; product liability law con-
is generally the case in product liability law [42, p. 733f.]. tains a fault tolerance for the behaviour of AI systems.
Admittedly, the reference to human capabilities is to some The extent of this fault tolerance can ultimately only be clari-
extent predetermined. Insofar as AI systems are used to provide fied in individual cases. However, one can ask the fundamental
"human performance", i.e., performance which was previously question of whether it is tolerable for an AI system to behave
provided by humans, the expected behaviour of the machine is worse than the "average human" in a certain situation. In princi-
based on human behaviour. Thus, requirements for its behaviour ple, this question can be answered in the affirmative, since AI
in road traffic are entirely oriented towards the abilities of a hu- systems act completely differently from humans and show
man being. Consequently, insofar as AI systems are used, a be- strong weaknesses especially in unusual situations. A basic prob-
haviour comparable to that of a human being is required [48, p. lem of machine learning, as it has been practised until recently,
275f.]. is the lack of consideration dedicated to the confidence in the
This orientation towards human behaviour leads to an orien- classification. In many cases, doubts about the classification
tation towards human capabilities without being limited to them. (such as of an object perceived by sensors) were not or not suffi-
Accordingly, the safety expectations within the framework of ciently taken into account for planning the behaviour of the AI
product liability law are also not to be limited by the abilities of system
humans [42, p. 735f.]. In addition, AI systems are highly susceptible to manipula-
This leads to a huge challenge for product liability law: the tion, as has been demonstrated in numerous studies on IT secu-
requirements for the capabilities of AI systems are currently rity. It is therefore inherent in the liability concept of product li-
very unclear, and clear, legally secure requirements have not yet ability law that machines make mistakes that would not be toler-
emerged in many areas. ated if made by a human.
Structurally, none of this is new. Requirements often have to This necessarily creates a liability gap in relation to the pro-
be concretised in the respective case, binding ultimately by ducer. This gap must be filled outside of product liability law, for
judges. It is therefore particularly relevant to define safety re- example through specific causal liability on the part of the man-
quirements for new technologies in the form of technical stand- ufacturer, or, as is currently the trend, through the liability of
ards. other parties involved, such as the operator of the AI system.
However, product liability law offers another important con-
4.3 Requirements for Accuracy of AI Systems necting factor for the safety of AI systems. Part of the expecta-
In the context of concretising the safety requirements, it is of tion of safety in a product is that the function of the product has
great importance whether the accuracy of the system with re- been properly verified. This general aspect is crucial in the case
gard to its behaviour can be expected as a property of an AI sys- of AI systems: since the behaviour of AI systems based on ma-
tem, or whether fault tolerance is permitted. chine learning cannot be verified by code analysis but can at
As already shown, an error-free behaviour for every future least be partially verified by testing, the tests on the behaviour of
situation cannot be part of the justified safety expectation, so the AI system are a crucial aspect of the safety expectation.
that a certain degree of error tolerance is required. In this regard, Proper testing of the behaviour of AI systems thus becomes the
the special features of AI systems compared to traditional, hu- anchor of product liability: if there is no proper test, the product
man-controlled machines become particularly clear. is defective. If the damage can be traced back to a behaviour
Insofar as the required behaviour is specified, negligence is which would have been detectable if proper tests had been car-
assumed in the case of a human acting otherwise. If a human ried out, this error is also causal for the damage.
driver does not drive correctly, there is fault. However, the pro- This shows a major advantage of AI systems compared to hu-
ducer of the AI system is not its operator like a driver would be. man activity: the behaviour of AI systems is in principle repro-
If one wanted to make the producer liable for driving errors of ducible and can therefore be checked by tests in many cases -
the system itself, this would be a new type of causal liability, specifically: in complex situations with a correspondingly large
which is rightly demanded (supra 4.1).
36
amount of different input information - not fully but much better should apply to AI systems. In this respect, great caution is re-
than with humans. quired, as a premature presumption, which may not be rebutta-
ble, would transform product liability into a causal liability of the
4.4 Burden of Proof manufacturer, which is precisely what the legislator wanted to
Another aspect of product liability that is often decisive in prac- avoid.
tice concerns the burden of proof. According to Art. 4 of the Eu- In the cases of interest here, the starting point of the damage
ropean Product Liability Directive, the injured party bears the is a faulty behaviour of the AI system. This behaviour appears
burden of proof for the existence of the product defect, i.e. he externally and is usually more recognisable for the injured party
must present the defectiveness of the product and prove it in the than for the producer. In this respect, the assignment of the bur-
event of a dispute. den of proof to the injured party is appropriate. Since the injured
In this respect, the circumstances of the procedural law sys- party's lack of evidence concerns the facts relating to the design
tems, which are subject to the law of the EU member states, have and manufacture underlying the conduct, including the testing
a decisive effect. In the United States in particular, the instru- of the AI system, the burden of proof should start at precisely
ment of pretrial discovery provides the plaintiff with a strong in- this point. Therefore, there should be a presumption of the exist-
strument to strengthen the factual basis of his claim [52, p. ence of a defect in the AI system if it has behaved "incorrectly".
115f.], which is not known in most European legal systems. It is then incumbent on the manufacturer to prove that the fault
In German law, the requirements for the presentation of the could not have been detected with proper design and testing.
facts on which a claim is based are particularly strict. In princi- This proof can be provided by the manufacturer presenting and,
ple, the plaintiff is required to provide a complete statement of if necessary, proving the existence of a sufficient test. Since he
the facts in such a way that a decision of the dispute can be has access to the relevant data and since he can use external help
made based on the plaintiff's submissions [29, no. 92]. if necessary, he can reasonably be expected to provide this evi-
These requirements also apply in product liability, there also dence.
with regard to the defect of the product. As a rule, the injured With this easing of the burden of proof, there is probably an
party cannot provide such evidence, as he does not know any de- appropriate balance of interests. It is often difficult to prove a
tails about the design or manufacture of the product. Normally, misconduct. In the example of the highly automated vehicle - in
he will only be able to present circumstantial evidence from the absence of causal liability of the operator - a driving error
which it may be possible to conclude that the product is defec- would have to be proven, which can be difficult if the course of
tive. the accident cannot be determined. In this respect, however, the
In view of this procedural starting position, presumptions of burden of proof on the injured party is correct, since the manu-
the existence of a defect, which lead to the reversal of the burden facturer has no better information in this respect. In this respect,
of proof, play a central role. In fact, case law in Germany, for ex- however, the black box, which is to become obligatory for highly
ample, has assumed a presumption in a number of cases: For ex- and fully automated vehicles, should provide a remedy, for ex-
ample, the lack of compliance with statutory safety regulations is ample in the case of automated driving. The inspection of the
supposed to give rise to a presumption that products are defec- recorded data can be ordered by the court, e.g., within the frame-
tive. work of independent evidence proceedings prior to the filing of a
Such presumptions are of crucial importance, especially for lawsuit.
AI systems. Without them, it will hardly be possible for the in-
jured party to prove a fault of the AI system, as he has no access 5 Conclusion
to the relevant information. He does not know the design of the As an overall result, product liability law can be said to have a
system, in machine learning cases he lacks any knowledge about certain performance capacity also in relation to AI systems.
the learning process, also the injured party does not know any- Product liability is comprehensively applicable to AI systems
thing about the tests of the AI system [53, p. 285f.]. and their components, in particular to neural networks and any
Against this background, the Export Group Liability for New other software that creates AI. The manufacturers of the system
Technologies - New Technology Formation has recommended a as well as of the software as a partial product are the addressees
shift of the burden of proof in certain cases regarding AI and of product liability law and can be held liable for damages.
other new technologies [8, p. 42]. The European Parliament has A significant gap arises in the case of AI systems that cannot
taken up this idea in its resolution for a civil liability regime for directly lead to personal injury or damage to property, but which
AI [9]. The proposal for a regulation on liability for AI systems set an essential prerequisite for decisions by third parties, as is
[10] contains a different distribution of risk for so-called "AI sys- the case with AI systems for the assessment of persons.
tems with high risk", for which strict liability of the operator is The central challenge of product liability law in relation to AI
to apply, and for "other AI systems" for which fault-based liabil- systems is the determination of a product defect. Since the con-
ity is to remain. However, according to Art. 8 Para. 2 of the pro- cept of defect is based on properties of the product that must be
posed regulation, the operator must prove that personal injury present when the product is placed on the market, the behaviour
or damage to property was caused through no fault of his own. of the AI system cannot be directly linked to the defect. How-
This raises the crucial question of the conditions under which ever, one property of the AI system is the ability to behave in a
a reversal of the burden of proof for the existence of a fault
37
certain way. The more the behaviour depends on a situation and [12] Rasmus Adler, Jens Heidrich, Lisa Jöckel, Michael Kläs. 2020. Anwendungssze-
narien: KI-Systeme in der Produktionsautomatisierung; retrieved from
the more diverse the possible situations are - in the case of auto- https://testing-ai.gi.de/fileadmin/GI/Projekte/KI_Testing_Auditing/Exa-
mated driving, for example, represented by the unforeseeable mAI_Publikation_Anwendungsszenarien_KI_Industrie.pdf (accessed
17.5.2021).
multitude of traffic situations - the less the behaviour of the AI [13] National Transportation Safety Board (NTSB). 2018. Collision Between Vehicle
system can be anticipated by the manufacturer and secured in Controlled by Developmental Automated Driving System and Pedestrian
advance. The justified expectation of safety as a measure of Tempe, Arizona, March 18, 2018 accident report, retrieved from
https://www.ntsb.gov/investigations/AccidentReports/Reports/HAR1903.pdf
faultiness is therefore not directed towards fault-free behaviour (accessed 17.5.2021).
in all situations, but only towards such behaviour in situations [14] Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner. 2016. Machine
Bias, ProPublica, May 23, 2016, retrieved from <https://www.propublica.org/ar-
which can be reasonably foreseen. ticle/machine-bias-risk-assessments-in-criminal-sentencing> (accessed
The sufficient degree of safety is, as described, primarily con- 17.5.2021).
cretised by the reasonableness of the measures. Technical stand- [15] Anthony W. Flores, Kristin Bechtel, Christopher T. Lowenkamp. 2016. False
Positives, False Negatives and False Anaylses: A Rejoinder to “Machine Bias:
ards will play a special role in this, as they emerge. There’s Software Used Across the Country to Predict Future Criminals. And It’s
In the case of AI systems, an important starting point for Biased Against Blacks.”, Federal Probation 80(2), 38–46.
[16] Katharina Zweig, Marc Hauer, Franziska Raudonat. 2020. Anwendungsszenar-
product liability will probably be the testing of AI systems, since ien: KI-Systeme im Personal- und Talentmanagement, retrieved from
the existence of proper tests of AI systems is to be regarded as a https://testing-ai.gi.de/fileadmin/PR/Testing-AI/ExamAI_Publikation_Anwen-
dungsszenarien_KI_HR.pdf (accessed 17.5.2021).
component of the justified expectation of safety. If this is lack- [17] Steven Shavell. 2004. Foundations of Economic Analysis of Law. Cambridge, MA:
ing, the AI system is defective. Belknap Press.
In practice, the burden of proof will play a central role. In this [18] Hans-Bernd Schäfer, Claus Ott. 2012. Lehrbuch der ökonomischen Analyse des
Zivilrechts, Springer.
respect, a reversal of the burden of proof for the existence of a [19] Petra Buck-Heeb, Andreas Dieckmann. 2020. in: Oppermann/Stender-Vor-
fault is to be assumed for AI systems if the system behaves incor- wachs (eds.), Autonomes Fahren, C.H. Beck, 2nd ed., Chap. 3.1.1.
[20] Peter Kreutz. 2020. in: Oppermann/Stender-Vorwachs (eds.), Autonomes Fahren,
rectly. The burden of proof for the faulty behaviour lies with the C.H. Beck , 2nd ed., Ch. 3.1.2.
injured party. [21] Geraint Howells. 1993. Comparative Product Liability, Dartmouth Publishing.
With these requirements, product liability law can contribute [22] Greenman v. Yuba Power Products, Inc. (1963), 59 Cal. 2d 57.
[23] Fairgrieve, Howells, Møgelvang-Hansen, Straetmans, Verhoevens, Machnikow-
to an appropriate balancing of interests with regard to damage ski, Janssen, Schulze. 2016. Product Liability Directive. In Pjotr Machnikowski
caused by AI systems. Nevertheless, the legal framework of AI (Ed.), European Product Liability. An Analysis of the State of the Art in the Era of
New Technologies, 17–111.
systems needs to be supplemented, as otherwise liability risks [24] Matthias Reimann, M. 2015. Product liability. In M. Bussani & A. Sebok (eds.),
and false incentives exist. Comparative Tort Law, 250 – 279, Edward Elgar Publishing.
[25] Council Directive of 25 July 1985 on the approximation of the laws, regulations
and administrative provisions of the Member States concerning liability for de-
REFERENCES fective products (85/374/EEC), O.J. 1985 No. L 210/9.
[1] Virginia Foggo, John Villasenor, Pratyush Garg. 2021. Algorithms and Fairness, [26] Astrid Seehafer, Joel Kohler. 2020. Künstliche Intelligenz: Updates für das Pro-
Ohio State Technology Law Journal 17(1), 123–188. dukthaftungsrecht? Mögliche Anpassungen der europäischen Produkthaf-
[2] Federal Trade Commission (FTC). 2021. Aiming for truth, fairness, and equity tungsrichtlinie für neue Technologien, EuZW 2020, 213–218.
in your company’s use of AI, retrieved from https://www.ftc.gov/news- [27] Thomas Riehm. 2010. 25 Jahre Produkthaftungsrichtlinie – Ein Lehrstück zur
events/blogs/business-blog/2021/04/aiming-truth-fairness-equity-your-compa- Vollharmonisierung, EuZW 2010, 567–571.
nys-use-ai (accessed 17.5.2021). [28] Georg Borges. 2018. Rechtliche Rahmenbedingungen für autonome Systeme,
[3] Communication from the Commission to the European Parliament, the Euro- NJW 2018, 977–982.
pean Council, the Council, the European Economic and Social Committee and [29] Georg Borges. 2021. in: Borges/Hilber (Eds.), Beck’scher Online-Kommentar IT-
the Committee of the Regions, Artificial Intelligence for Europe, COM(2018) 237 Recht, § 1 ProdHaftG.
final, 25.4.2018, https://eur-lex.europa.eu/legal-con- [30] Gerhard Wagner. 2020. in: Münchener Kommentar zum BGB, 8th ed., § 4 Pro-
tent/EN/TXT/PDF/?uri=CELEX:52018DC0237 (accessed 17.5.2021). dHaftG.
[4] For detailed information on the group see https://ec.europa.eu/transparency/re- [31] Georg Borges. 2020. Kann ein Gegenstand nicht Sache und doch Sache sein?.
gexpert/index.cfm?do=groupDetail.groupDetail&groupID=3591&news=1 (ac- Computerprogramme im Privatrecht, in: Omlor (ed.), Weltbürgerliches Recht,
cessed 17.5.2021). Festschrift für Michael Martinek zum 70. Geburtstag, 45–58, C.H. Beck.
[5] High-Level Expert Group on AI, Ethics Guidelines for Trustworthy Artificial [32] OLG Hamm NJW-RR 2012, 355.
Intelligence, retrieved form https://digital-strategy.ec.europa.eu/en/library/eth- [33] Erwin Deutsch. 1989. Der Schutzbereich der Produzentenhaftung nach dem
ics-guidelines-trustworthy-ai (accessed 17.5.2021). BGB und dem PHG, JZ 1989, 465–470.
[6] For detailed information on the group see https://ec.europa.eu/transparency/re- [34] Gerhard Wagner. 2020. in: Münchener Kommentar zum BGB, 8th ed., § 1 Pro-
gexpert/index.cfm?do=groupDetail.groupDetail&groupID=359 (accessed dHaftG.
17.5.2021). [35] Ulrich Berz, Eva Dedy, Claudia Granich. 2000. Haftungsfragen bei dem Einsatz
[7] Proposal for a Regulation laying down harmonized rules on artificial intelli- von Telematik-Systemen im Straßenverkehr, DAR 2000, 545–554.
gence (Artificial Intelligence Act), 21.4.2021, https://ec.europa.eu/news- [36] Volker Jänich, Paul Schrader, Vivian Reck. 2015. Rechtsprobleme des autono-
room/dae/redirection/document/75788 (accessed 17.5.2021). men Fahrens, NZV 2015, 313–318.
[8] Report from the Expert Group on Liability and New Technologies – New Tech- [37] Paul Schrader. 2015. Haftungsrechtlicher Begriff des Fahrzeugführers bei
nologies Formation, Executive Summary, 3 f.; https://ec.europa.eu/transpar- zunehmender Automatisierung von Kraftfahrzeugen, NJW 2015, 3537–3542.
ency/regexpert/index.cfm?do=groupDetail.groupMeetingDoc&docid=36608 [38] Bundesgerichtshof [German Federal Supreme Court], GRUR 1985, 1041.
(accessed 17.5.2021). [39] Helmut Redeker. 2017. IT-Recht, 6th ed., C.H.Beck.
[9] Civil liability regime for artificial intelligence, European Parliament resolution [40] Philipp Reusch. 2020. In Kaulartz/Braegelmann (eds.), Rechtshandbuch Artificial
of 20 October 2020 with recommendations to the Commission on a civil liability Intelligence und Machine Learning, Chap. 4.1, C.H.Beck.
regime for artificial intelligence (2020/2014(INL)), retrieved from [41] Friedrich-Wilhelm Engel. 1986. Produzentenhaftung für Software, CR 1986,
https://www.europarl.europa.eu/doceo/document/TA-9-2020-0276_EN.pdf (ac- 702–708.
cessed 17.5.2021). [42] Gerhard Wagner. 2017. Produkthaftung für autonome Systeme, AcP 2017(6),
[10] Proposal for a Regulation of the European Parliament and of the Council on 707–765.
liability for the operation of Artificial Intelligence-systems, (2020/2014(INL)), [43] Georg Borges. 2021. in: Borges/Hilber (eds.), Beck’scher Online-Kommentar IT-
retrieved from https://www.europarl.europa.eu/doceo/document/TA-9-2020- Recht, § 2 ProdHaftG.
0276_EN.pdf (accessed 17.5.2021). [44] Ulrich Magnus. 2017. In Machnikowski (ed.), European Product Liability. In-
[11] ExamAI – Testing and Auditing of AI systems; retrieved from https://testing- tersentia.
ai.gi.de (accessed 17.5.2021).
38
[45] Thomas Schulz. 2015. Verantwortlichkeit bei autonom agierenden Systemen. No-
mos.
[46] Hans Steege. 2021. Auswirkungen von künstlicher Intelligenz auf die Produzen-
tenhaftung in Verkehr und Mobilität. Zum Thema des Plenarvortrags auf dem
59. Deutschen Verkehrsgerichtstag, NZV 2021, 6–13.
[47] Chris Reed, Elizabeth Kennedy, Sara Silva. 2016. Responsibility, Autonomy and
Accountability: Legal Liability for Machine Learning, Queen Mary University of
London Legal Studies Research Paper No. 243/2016.
[48] Georg Borges. 2016. Haftung für selbstfahrende Autos. Warum eine Kausalhaf-
tung für selbstfahrende Autos gesetzlich geregelt werden sollte, CR 2016, 272–
280.
[49] Sabine Gless, Ruth Janal. 2016. Hochautomatisiertes und autonomes Autofahren
– Risiko und rechtliche Verantwortung, JR 2016, 561–575.,
[50] Herbert Zech. 2019. Künstliche Intelligenz und Haftungsfragen, ZfPW 2019,
198–219.
[51] Christian Gomille. 2016. Herstellerhaftung für automatisierte Fahrzeuge, JZ
2016, 76–82.
[52] Geoffrey Hazard, Michele Taruffo.1993. American Civil Procedure. Yale Univer-
sity Press.
[53] Mario Martini. 2019. Blackbox Algorithmus – Grundfragen einer Regulierung
Künstlicher Intelligenz. Springer.
39
A Combined Rule-Based and Machine Learning Approach for
Automated GDPR Compliance Checking
Rajaa EL HAMDANI Majd Mustapha David Restrepo Amariles
HEC Paris EURA NOVA HEC Paris
France Belgium France
Aurore Troussel Sébastien Meeùs Katsiaryna Krasnashchok

Steptoe & Johnson LLP HEC Paris EURA NOVA
Belgium France Belgium
ABSTRACT 1 INTRODUCTION
The General Data Protection Regulation (GDPR) requires data con- It has become widely acknowledged that complying with data pro-
trollers to implement end-to-end compliance. Controllers must tection laws, particularly the European General Data Protection
therefore ensure that the terms agreed with the data subject and Regulation (GDPR), is the most difficult compliance challenge or-
their own obligations under GDPR are respected in the data flows ganizations face today across industries [1]. Moreover, technology
from data subject to controllers, processors and sub processors (i.e. became the most important compliance cost for organizations, as
data supply chain). This paper seeks to contribute to bridge both they turned to specialized technologies to carry out compliance
ends of compliance checking through a two-pronged study. First, tasks such as document review, regulatory checks and operational
we conceptualize a framework to implement a document-centric audits [1]. Despite the increasing interest in technology as a com-
approach to compliance checking in the data supply chain. Second, pliance tool in data protection, there is not yet a comprehensive
we develop specific methods to automate compliance checking of conceptual framework characterizing the tasks required to verify
privacy policies. We test a two-modules system, where the first compliance in the entire data supply chain (i.e., end-to-end compli-
module relies on NLP to extract data practices from privacy poli- ance), and how current methods can contribute to address them.
cies. The second module encodes GDPR rules to check the presence This paper intends to contribute to both of these issues. First, we
of mandatory information. The results show that the text-to-text lay down a framework to implement and monitor GDPR compliance
approach outperforms local classifiers and enables the extraction in the data supply chain through a document-centric approach.
of both coarse-grained and fine-grained information with only one We define three key tasks of compliance checking based on the
model. We implement full evaluation of our system on a dataset of function and content of the document: (1) document to regulation,
30 privacy policies annotated by legal experts. We conclude that (2) document to document, and (3) document to operations. We
this approach could be generalized to other documents in the data apply this framework to analyze the compliance function of privacy
supply as a means to improve end-to-end compliance. policies in the data supply chain and define the tasks required to
determine if a privacy policy is compliant with the GDPR.
CCS CONCEPTS Second, we develop and test several methods to verify compli-
• Applied computing → Law; • Computing methodologies → ance of privacy policies to the GDPR by leveraging the advantages
Information extraction; Machine learning; • Social and proof both machine learning and rule-based approaches. In particu-
fessional topics → Privacy policies; • Security and privacy → lar, we build a two-modules system to verify the completeness of
Usability in security and privacy. privacy policies with regards to mandatory information. The first
module automatically extracts coarse-grained and fine-grained data
ACM Reference Format: practices, while the second module analyzes extracted data prac-
Rajaa EL HAMDANI, Majd Mustapha, David Restrepo Amariles, Aurore tices and checks the presence of mandatory information according
Troussel, Sébastien Meeùs, and Katsiaryna Krasnashchok. 2021. A Com- to the provisions of the GDPR. We make use of the OPP-115 dataset
bined Rule-Based and Machine Learning Approach for Automated GDPR [48] for training and evaluation of our models. We treat the extrac-
Compliance Checking. In Eighteenth International Conference for Artificial
tion of data practices as a Hierarchical Multi-label Classification
Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM,
New York, NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466081
(HMTC) task and experiment with two different approaches: local
classifiers and text-to-text. Our proposed text-to-text method has
several advantages over local classifiers, including extraction of
classroom use is granted without fee provided that copies are not made or distributed
additional information and better scalability.
for profit or commercial advantage and that copies bear this notice and the full citation Our contributions are the following:
on the first page. Copyrights for components of this work owned by others than ACM • we present a theoretical framework to implement automated
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a GDPR compliance in the data supply chain;
fee. Request permissions from permissions@acm.org. • we provide formal and substantive approach to verify com-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil pliance of privacy policies with the GDPR;
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466081
40
ICAIL’21, June 21–25, 2021, São Paulo, Brazil EL HAMDANI, et al.
• we develop a combined rule-based and machine learning • Open texture of legal language [46]: Privacy concepts can be
approach to experiment with automated formal compliance quite abstract, and their evaluation arduous, as it is impossi-
checking of privacy policies with the GDPR; ble to define a finite set of rules for all possible applications.
• we propose a text-to-text approach for HMTC using transfer For example, the storage limitation principle states that per-
learning and multi-task learning ; sonal data must not be kept "longer than is necessary for
• additionally, we extract the span of text from a privacy pol- the purposes for which the personal data are processed."(Art.
icy corresponding to the fine-grained data practices, which 5.1(e)). This principle does not define specific time limits that
provides better explainability of the classification results; can be evaluated easily.
• we further annotate a dataset of 30 privacy policies with To solve the aforementioned problems, we employ natural lan-
the presence of mandatory information, in order to fully guage processing (NLP) techniques to automatically extract infor-
evaluate our two-modules system of compliance checking; mation from the data supply chain documents. Likewise, authors of
• finally, we release a public repository of the dataset, imple- [44] suggest using NLP to extract information from legal documents
mentations and the fine-tuned T5-11B model on OPP-1151 . according to their UML model of the GDPR, to automatically con-
This paper is organized as follows. Section 2 provides an overview struct their model-based representation. A large and growing body
of the related work. Section 3 lays down a general compliance of literature uses machine learning and NLP algorithms to extract
framework to automate GDPR compliance and applies it to privacy privacy information from legal documents, however, it only con-
policies. Section 4 describes the development of data practices ex- siders privacy policies and does not yet deal with other documents
traction algorithms and its evaluation results. Section 5 presents of the data supply chain, such as DPA and DPIA.
the rule-based system, which verifies the presence of mandatory
information, and its evaluation on an annotated dataset.
2.2 Information extraction from privacy
2 RELATED WORK policies
Privacy policies are long documents that are difficult to read for
The Information Commissioner’s Office (ICO), the Data Protection
data subjects. Empirical studies have been conducted to study and
Authority of the United Kingdom, observes that overall compli-
measure the ambiguity and vagueness of privacy policies [5, 20, 24,
ance with the GDPR is jeopardized by the "opaque nature" of the
36]. Efforts have been made to decrease these deficiencies [42, 51],
data supply chain, which is poorly documented, and generally fails
by using classification techniques to extract data practices from
to comply with the GDPR’s accountability principle [2]. The ICO
the text and represent them in a user-friendly interface. Other
further points out that controllers and processors should not limit
approaches focus on the extraction of one specific type of data
compliance to entering into contract and producing legal docu-
practice, such as "opt-out choice" in [38].
ments. They must monitor processing activities, and conduct audit
More and more methods are developed to analyze the compliance
to ensure that appropriate technical and organizational measures
of privacy policies with the GDPR. A variety of NLP tools such as
are in place throughout the data supply chain [2].
word embeddings are used in [7, 43] to verify the completeness of
Most research in AI and law studying GDPR compliance has
privacy policies according to the rules set out by the GDPR. The
focused on the relations between data subject and data controllers,
CLAUDETTE project [6] extracts clauses that are problematic with
rather than on the compliance challenges in the data supply chain
respect to the GDPR. In a different project [23], privacy policies were
[3, 6, 11, 54]. More recently, the BPR4GDPR project started working
analyzed in a large-scale setting to study the effect of the GDPR on
on a compliance ontology specification that supports end-to-end
their provisions. For instance, the comparison between pre-GDPR
compliance [37] and which can contribute to address some of the
and post-GDPR versions of 6,278 English privacy policies showed
operational challenges raised by the ICO. A compliance framework
that the GDPR caused textual changes of the privacy policies, such
is provided in [40] for specific documents in the supply chain such
that their appearance improved, their length increased, and that
as the data protection impact assessment (DPIA).
they cover more data practice categories.
A great deal of previous research on privacy policies relies on
2.1 Model-based Compliance Checking supervised machine learning methods, which require datasets of
The AI and law research community has developed model-based annotated privacy policies. However, there are very few such pub-
methods for automated compliance checking, such as legal ontolo- licly released datasets. In our work, we use OPP-115 [48] – a corpus
gies that support legal reasoning via logic programming [9, 12, 13, of 115 privacy policies annotated with both coarse-grained and
15, 16, 19, 30, 44, 45]. However, even though most legal rules can fine-grained data practices. Several works used OPP-115 to train
be described in logic programming, these methods face two main machine learning algorithms on the task of extracting data prac-
challenges when applied to real-life cases: tices [14, 25, 28, 29, 32]. PrivacyQA [35] is another publicly available
• Knowledge acquisition bottleneck [17]: Logic programming dataset of 1,750 questions about the privacy policies of mobile ap-
requires the encoding of facts into predicate form, but such plications, which is used to train question-answering algorithms.
an encoding would be very cumbersome for each data pro- A serious limitation of the publicly available datasets is that their
tection documents in the data supply chain. annotation schemes contain few concepts aligned with the GDPR.
A connection between the data practices identified in OPP-115 and
the GDPR was presented in [31], which revealed that the principle
1 https://github.com/smartlawhub/Automated-GDPR-Compliance-Checking of accountability of Article 5 is absent from OPP-115 concepts.
41
A Combined Rule-Based and Machine Learning Approach for Automated GDPR Compliance Checking ICAIL’21, June 21–25, 2021, São Paulo, Brazil
2.3 Transfer learning in NLP 3.1 Compliance Checking Framework: A

Until very recently, in computer science research, general NLP Document-centric Approach
tasks such as text classification were commonly handled with ar- The starting point of our compliance framework is the set of doc-
chitectures based on word embeddings [27], Convolutional Neural uments establishing the conditions and processing activities that
Networks (CNN) [21] and Recurrent Neural Networks (RNN) [18]. data controllers and processors are intended to and effectively carry
RNN-based solutions, which achieved state-of-the-art results at the out (hereafter compliance documents), such as privacy policies, con-
time, are, however, limited when it comes to dealing with long text, tracts, DPAs, TOMs, DPIAs, etc. This is a reasonable starting point
due to their sequential nature. Moreover, this nature stands in the as the GDPR establishes a document-centric approach to compli-
way of most parallel training methods, therefore limiting the ability ance, entrusting documents with different functions in relation
to decrease training time. Encoder-decoder architectures – com- to the obligations of data controllers and processors. First, it re-
monly used for sequence to sequence problems – have inspired the quires data controllers to properly demonstrate compliance with
attention-based transformers [47], and these architectures managed the Regulation ("principle of accountability", Art. 5.2). Second, data
to go beyond the limitations of sequential networks. controllers shall inform data subjects about the processing activities
The use of transformers is gradually becoming the state of the they carry out (Art. 12, 13, and 14) and document instructions they
art of NLP. Indeed, they are effective in the variety of tasks, includ- give to data processors (Art. 29). Third, sub-processors must obtain
ing masked token prediction, next sentence prediction, question a written authorization from controllers to further subcontract a
answering, machine translation, summarization, sentiment analysis processing activity, and must also document their instructions (Art.
and classification [10, 33]. The fact that transformers are trained in 28.2 and 28.3). Lastly, controllers and processors must maintain a
an unsupervised way, which reduces the reliance on labeled data record of processing activities (ROPA) (Art. 30).
and allows the use of a larger pool of text, explains the increased Hence, we define GDPR compliance checking as the assessment
performance. Transformers are very effective in transfer learn- of the provisions of a compliance document in relation to: (1) a
ing, allowing researchers to pretrain them with large amounts of regulation, (2) another document in the supply chain, and (3) an op-
general-purpose texts and then to fine-tune them for their specific eration. These dimensions define the three key tasks of compliance
tasks with good results, less effort, and less labeled data. Several checking in the data supply chain under the GDPR.
works explored the potential of fine-tuning transformers on OPP-
115 to extract coarse-grained data practices [28, 29]. (1) Document to Regulation Compliance: The provisions of a
Similarly to [14], we extract both coarse-grained and fine-grained compliance document are assessed against the regulation.
data practices from privacy policies. And similarly to [28, 29], we For example, according to the GDPR, privacy policies must
use variations of the transformer architecture as base for fine-tuning provide certain mandatory information to data subjects (Art.
on OPP-115. Furthermore, our best-performing model is able to 12, 13 and 14). We further divide this compliance task into
extract the fine-grained data practices with corresponding spans of two sub-tasks:
text from the policy, thus improving explainability of the results. • Formal Compliance Checking: whether the document ful-
fils its informative function, i.e., is the mandatory informa-
tion included in the document. For example, this sub-task
3 A FRAMEWORK FOR COMPLIANCE could consist of the verification of the existence of infor-
CHECKING IN THE DATA SUPPLY CHAIN mation about data retention (Art. 13.2(a)).
New data protection and privacy regulations around the world have • Substantive Compliance Checking: whether the document
empowered data subjects vis-à-vis data controllers (EU fundamental fulfils its accountability function, i.e., does the information
rights, California Privacy laws, Canada PIPEDA, Brazil, etc.). As provided comply with the GDPR. For example, this sub-
mentioned earlier, the data supply chain (i.e., the way data flows task could consist of the verification that the data retention
from data subject to controllers, processors and sub processors) period is lawful, i.e., it does not exceed the necessary time
is opaque and remains hard to monitor, ultimately hindering the as required by Article 5.1(e).
effectiveness of data-subject rights. (2) Document Chain Compliance: The supply chain documents
At the operation level, the data supply chain is characterized by are assessed against the main contractual standards or against
the high volume of data flows across processors and jurisdictions documents with a higher hierarchy in the chain. Using the
for tasks as varied as storage, pre-processing, producing analyt- previous example on data retention, two assessments could
ics, implementing AI-methods, generating visualizations, etc. In be done on (i) the contract between the data controllers and
addition to regulations, these flows are regulated by fragmented the data processors to verify if the agreed data retention
legal artifacts such as contracts, data protection addendum (DPA), period corresponds to the one provided in the privacy pol-
technical and operational measures (TOMs), etc., which are often icy, and on (ii) the ROPAs of both the data controllers and
of restricted access, lack interoperability and have low operational processors that contain the effective date of erasure (Art.
efficiency, ultimately obstructing compliance with the GDPR. 30.1).
To help data controllers use technological tools to improve com- (3) Operational Compliance: This task consists of assessing the
pliance with the GDPR, we draw a general conceptual framework adequacy between operations, documents, and regulations.
to automate compliance checking in the data supply chain based For the previous example, this task would imply verifying if
on a document-centric approach and lay down the tasks required the data has been effectively deleted from the servers of data
to achieve end-to-end compliance. processors and controllers when the retention period ends.
42
3.2 The compliance of privacy policies 4.1 OPP-115 : Training Dataset of Online
Privacy policies are the compliance documents that appear at the Privacy Policies
top of the data supply chain. Hence, we first apply our framework For our task we make use of the Usable Privacy Policy Project’s
to analyze privacy policies’ compliance in the data supply chain. Online Privacy Policies (OPP-115) corpus, introduced by [48] which
This paper seeks to define the tasks required to determine a privacy contains detailed annotations made by Subject Matter Experts
policy’s formal and substantive compliance with the GDPR. (SMEs) for the data practices described in a set of 115 website
The GDPR does not explicitly mention privacy policies, but data privacy policies.
controllers widely use them. Their main function is to provide At a high level, annotations fall into one of ten data practice
mandatory information to data subjects according to Articles 13 categories:
and 14. The absence of any part of this information renders the
(1) 1𝑠𝑡 Party Collection/Use: What, why and how information
privacy policy non-compliant. Moreover, the GDPR requires pri-
is collected by the service provider.
vacy policies to use plain and clear language so individuals can
(2) 3𝑟𝑑 Party Sharing/Collection: What, why and how informa-
understand how their personal data are processed, provide their
tion shared with or collected by third parties.
consent and exercise their rights. Consequently, the tasks to verify
(3) User Choice/Control: Control options available to users.
the formal compliance of privacy policies are:
(4) User Access, Edit, & Deletion: If/how users can access, edit
• Check the presence of each mandatory information accord- or delete information.
ing to Articles 13 and 14. (5) Data Retention: How long user information will be stored.
• Check the readability and clarity of the language used in the (6) Data Security: Protection measures for user information.
privacy policy. (7) Policy Change: Informing users if policy information has
been changed.
The substantive compliance checking of privacy policies consist of
(8) Do Not Track: If and how DNT signals for online tracking
verifying that the data processing complies with the data protection
and advertising are honored.
rules (e.g, fair and transparent processing of Art. 5, and lawfulness
(9) International & Specific Audiences: Practices pertaining to a
of the processing of Art. 6). For example, a privacy policy must
specific group of users.
specify the legal basis of the processing according to Article 13.1(c),
(10) Other: General text, contact information or practices not
but it must also demonstrate that this legal basis complies with
covered by the other categories.
Article 6 requirements on the lawfulness of the processing (e.g., if
the legal basis is consent, the controller must ensure that consent According to the dataset creators, the best agreement between SMEs
has been given for one or more specific purposes.). was achieved on Do Not Track class with Fleiss’ Kappa equal to 91%,
In this study, we automate the first task of formal compliance whereas the most controversial class was Other, with only 49% of
checking of privacy policies. In the following sections, we describe agreement [48]. We further decompose the latter category into its
how we combined rules and machine learning to check the presence attributes – “Introductory/Generic”, ”Privacy Contact Information”
of mandatory information in privacy policies automatically. The and “Practice Not Covered” – resulting in 12 categories.
end-users of such a system would be lawyers or data protection Figure 1 depicts a fragment of OPP-115 taxonomy: for each class
officers who review large numbers of privacy policies to check their (grey shaded blocks), a set of lower-level privacy attributes is as-
compliance with the GDPR. Another type of end-user would be signed (20 in total, dark blue shaded blocks), with specific values
project managers in small companies who lack the legal knowledge corresponding to each attribute. For example, the attribute “Per-
to ensure privacy policies’ compliance. sonal Information Type” designates the different types of personal
information mentioned in the text, as can be seen from the annota-
tions in Figure 2 from the IMDb policy 2 , annotated with “1𝑠𝑡 Party
4 EXTRACTION OF DATA PRACTICES FROM Collection/Use” category.
PRIVACY POLICIES OPP-115 comprises 3,792 segments, each segment labeled with
Ensuring both the compliance of documents and data processing ac- one or more classes out of 12. The SMEs produced a total of 23K
tivities is becoming more burdensome to companies due to several annotations of categories. In aggregate, these categories were asso-
challenges. We focus on the challenge posed by the large number ciated with 128K values for attributes and 103K selected spans of
of documents that data protection officers need to review to guar- policy text. To the extent of our knowledge, this is the first effort to
antee compliance. We suggest using natural language processing leverage these spans to extract information from privacy policies.
technologies to assist data protection officers in performing the We split the OPP-115 dataset on a policy-document level into
compliance checking tasks. NLP could help to extract compliance 3 sets: 65 policies are used for training, 35 for validation and 30
information from unstructured compliance documents and save it policies are kept as a testing set.
into structured formats such as XML or RDF to unlock use cases
such as automated compliance checking. 4.2 Problem formulation
In this paper, we describe our experiment in automating formal The taxonomy of data practices is organized in a class hierarchy that
compliance of privacy policies. We first train a machine learning we model as a Directed Acyclic Graph (DAG) shown in Figure 3.
algorithm to extract from privacy policies information describing
the company’s data practices. We then use the extracted information 2 To retrieve the exact source used:<https://web.archive.org/web/20200526092253if_
as input to a rule-based system that encode Articles 13 and 14. /https://www.imdb.com/privacy#auto>(“Automatic Information” sub-section.)
43
Figure 1: The privacy taxonomy of [48]. The top level of the hierarchy (grey shaded blocks) defines coarse-grained data practices or privacy
categories. The lower level defines a set of privacy attributes (blue shaded blocks), each assuming a set of values. We show a subset of the
taxonomy for clarity and space considerations.
We treat predicting the categories of data practices and the val-

ues of each attribute as an HMTC task. There are three methods
to solve hierarchical text classification tasks: flat, local, or global
methods [39]. The flat method behaves like traditional classifica-
tion algorithms by ignoring the labels’ hierarchy and predicting
only classes at the leaf nodes. Local methods take into account the
hierarchy by training independent local classifiers. Global methods
train a single classifier for all classes. In this paper we conduct two
experiments, where we first build a local multi-label classifier, and
then cast the HMTC task to two text-to-text tasks.
4.3 Local classifiers approach

Figure 2: Annotated excerpt from IMDb privacy notice
This approach is inspired by Polisis [14], where authors build a local
multi-label classifier for the higher level categories, and one local
multi-label classifier per attribute to predict their values. Predictions
We choose a DAG structure instead of a tree structure because are made in a top-down order: once the categories of a segment are
some attributes are associated with more than one category, i.e., inferred, the second step predicts the values of attributes – children
have more than one parent. For example, the attribute “Personal of the predicted categories. For example, if the first-level classifier
information type” belongs to both “1𝑠𝑡 Party Collection/Use” and predicts the “Data Retention” and “Data Security” categories, only
“3𝑟𝑑 Party Collection/Use” categories. the local classifiers corresponding to the attributes “Retention Pe-
The training dataset is a corpus C of N privacy policies 𝑃𝑖 : Y = riod”, “Retention Purpose”, “Personal Information Type”, “Security
{𝑃1, ..., 𝑃 𝑁 }. Each privacy policy 𝑃𝑖 is a set of annotated segments: Measure” are chosen in the second step.
𝑃𝑖 = {(𝑠𝑖1, 𝑦𝑖1 ), ..., (𝑠𝑖𝑛 , 𝑦𝑖𝑛 )} where 𝑦𝑖 ⊆ Y = {𝜆1, ..., 𝜆𝐿 } such that Authors of [14] are using the same base classifier for all the multi-
𝜆𝑖 is a path in the DAG that starts from category 𝑐 and ends at leaf label classifiers. In this paper we reproduce their work by using
node 𝑣. CNN as the base classifier. We use the same architecture of CNN
and hyperparameters. The CNN classifier is composed of one convo-
lutional layer with a ReLU activation, followed by a dense layer and
a ReLU activation. The last layer is a dense layer with a sigmoid ac-
tivation. We tokenize segments using PENN Treebank tokenization
in NLTK [41]. Tokens are mapped into a 300-dimensional space via
an embedding Layer. We used FastText to train Word embeddings
on 130,326 privacy policies [54].
Recently, the state-of-the-art results have been achieved by trans-
formers. We reproduce the framework of Polisis, using XLNet [49]
instead of CNN as a base classifier. XLNet is a transformer language
Figure 3: The DAG structure of the OPP-115 taxonomy model, which extends Transformer-XL [8]. It is an auto-regressive
44
language model, pretrained on all the permutations of the input Table 1: Results of categories prediction by the local classi-
sequence. We fine-tune the XLNet3 on 21 tasks – one task for pre- fiers approach
dicting the categories, and the rest – for each attribute’s values.
CNN XLNet
4.4 Text-to-Text approach Categories
P R F1 P R F1
In this section we explain how we use T5 [33] to solve HMTC. T5 is Introductory/Generic 74 40 52 76 54 63
a pretrained language model based on the transformer architecture.
Policy Change 85 60 71 73 65 69
T5 has two main differences in comparison to XLNet. First, it is
Specific audiences 90 77 83 85 80 82
pretrained on a multi-task mixture of unsupervised and supervised
Privacy Contact Info 87 52 65 84 75 79
tasks. Second, each task is converted into a text-to-text format. We
1𝑠𝑡 Party Collection 67 87 76 84 81 83
adopt T5 both for its top results on NLP benchmarks and for its
Data Retention 52 39 45 58 36 44
text-to-text nature.
3𝑟𝑑 party sharing/collection 71 85 78 76 87 81
The local classifiers approach has two main drawbacks. First, it
User Choice/Control 45 79 58 66 69 67
trains the set of local classifiers independently. Second, the number
Practice Not Covered 39 39 38 40 44 42
of local classifiers grows linearly with the size of the label hierarchy.
Data Security 79 48 60 77 68 72
These limitations motivate this second approach, where we convert
Access, Edit, Deletion 87 35 50 75 72 74
the HMTC task into two text-to-text tasks – one for each level of
Do Not Track 100 29 45 93 100 96
the label hierarchy – to better capture the dependencies of labels
Macro-Average 72 55 62 76 71 73
belonging to the same level. Moreover, by training one unique
algorithm for each level, we ensure that the number of classifiers
scales linearly with the hierarchy’s depth. 𝑇 𝑃𝑗 𝑇 𝑃𝑗 2 ∗ 𝑃𝑗 ∗ 𝑅𝑗
Thanks to the text-to-text nature of T5, we can simplify HMTC 𝑃𝑗 = 𝑅𝑗 = 𝐹 1𝑗 =
𝑇 𝑃𝑗 + 𝐹𝑃𝑗 𝑇 𝑃𝑗 + 𝐹 𝑁𝑗 𝑃𝑗 + 𝑅𝑗
into two text-to-text tasks shown in Figure 4. To prepare the task of
𝑇 𝑃 𝑗 , 𝐹 𝑃 𝑗 , 𝑇 𝑁 𝑗 and 𝐹 𝑁 𝑗 are the number of true positive, false pos-
categories prediction, we prepend the "categories prediction: " prefix
itive, true negative and false negative test examples with respect
to the text of segments and generate one sequence of categories
to class label 𝑦 𝑗 . To measure the global performance over a set of
separated by "; " as shown in Figure 5. The lists of categories were
labels we compute the macro-average of each metric by averaging
sorted in alphabetical order so that they have the same order across
on the set of labels4 .
training examples, as advised by the authors of [33].
For the task of values prediction, we only report the result of
The second task’s objective is to predict the values of the at-
attributes and values that we use to automate formal compliance
tributes of a category from an input segment, as well as to generate
checking of privacy policies, such that the reported metrics are the
the spans of texts related to the predicted values. This task is similar
macro-average over the values necessary for compliance checking.
to a reading comprehension task [34], where the question is "what
is the value of the attribute?", and the context paragraph is the pair
(segment, category). So we format it into a text-to-text task, similar Evaluation of span extraction: We use the F1-score, as in the
to how the authors of T5 formatted the reading comprehension SQuaD dataset [34], to evaluate the extraction of spans associated
dataset SQuAD (see Figure 6). with the values by comparing the ground truth target to the gener-
Once we format the tasks, we fine-tune the largest available T5 of ated target.
11B parameters on these tasks. We try two fine-tuning methods: the
first method (advised by [33]) is to fine-tune on each task indepen- 4.6 Results and Discussion
dently, and the second method is to fine-tune in a multi-task setting Local classifiers approach: In Table 1, we report the results of
on a mixture of both tasks to capture the global labels hierarchy. the evaluation of the CNN and XLNet experiments on categories
We fixed the hyperparameters of input sequence length and output prediction. We present the results of CNN and the XLNet for the task
sequence length to 512 and the batch size to 16, and performed a of values prediction in Table 2. XLNet has superior performance
grid-search over the learning rate. The model was fine-tuned each comparing to CNN for the task of categories predictions. However,
time for 25,000 steps. Interestingly, the best-performing learning it has significantly lower performance than CNN for the second
rate is the same (4e-3) for all the models and tasks. task, because there are enough examples to fine-tune XLNet for
categories prediction but not enough for values prediction.
4.5 Evaluation measures
Text-to-text approach: We report the results of evaluating the
Evaluation of multi-label classification: We use precision, re-
two tasks on the test dataset in Table 3, Table 4, and Table 5. We
call, and F1-score metric to evaluate the extraction of both coarse-
observe that the individual fine-tuning and multi-task fine-tuning
grained and fine-grained data practices from privacy policies seg-
have a close recall for both tasks, but they differ significantly in
ments. Since we are in a multi-label classification setting, we adapt
precision. By fine-tuning each task separately, we obtain a 2.6%
the traditional single-label metrics to this setting by using the label-
precision improvement for the task of categories prediction and
based metrics [52]: precision, recall, and F1-score for the j-th class
label 𝑦 𝑗 are defined as follows: 4 It is worth noting that we don’t use the same precision, recall and F1-score as in [14]
where they use the macro-average of each metric predicting the presence and absence
3 We used the pretrained model available at Hugging Face models hub. of the label.
45
Figure 4: The hierarchical multi-label classification of OPP-115 is converted into two text-to-text tasks. The first task is to predict categories
of data-practice from an input segment. We then retrieve the attributes of each predicted category to feed them and their category as input to
the second task. The second task is to predict the values of the input attributes and to generate the corresponding span of text (highlighted).
Table 3: Results of categories prediction by T5
Multi-task Fine-tuning Task 1 Fine-tuning

Category
P R F1 P R F1
Introductory/Generic 68 61 64 69 58 63
Policy Change 82 67 74 82 69 75
Specific audiences 94 84 88 91 85 87
Privacy Contact Info 81 82 82 79 69 74
Figure 5: Input example to train T5 for categories prediction 1𝑠𝑡 Party Collection 88 87 87 90 85 87
Data Retention 65 47 55 77 52 62
3𝑟𝑑 party sharing/collection 83 86 85 85 85 85
User Choice/Control 65 60 62 66 64 65
Practice Not Covered 46 45 46 49 47 48
Data Security 73 66 69 78 70 74
Access, Edit, Deletion 82 84 83 78 76 77
Do Not Track 100 75 85 100 62 76
Macro-Average 77 70 73 79 69 73
Figure 6: Example of an input to train T5 for prediction of Table 4: Results of values prediction for attributes used in
attributes’ values and the corresponding spans of text compliance checking by T5. We report the macro-average
over the values of each attribute.
Table 2: Results of values prediction by CNN and XLNet
for attributes used in compliance checking. We report the Multi-task Fine-tuning Task 2 Fine-tuning
Attribute
macro-average over the values of each attribute. P R F1 P R F1
action first-party 55 56 55 62 58 60
CNN XLNet does/does not 90 78 83 91 83 87
Attribute personal information type 72 61 66 73 63 68
P R F1 P R F1
purpose 72 65 69 74 67 70
action first-party 44 47 45 16 33 21 retention period 53 47 50 50 25 33
does/does not 93 77 84 84 82 83 access type 62 58 60 71 70 70
personal information type 73 58 64 56 49 52 Macro-average 67 61 63 70 61 65
purpose 74 56 64 74 64 69
retention period 79 62 69 50 11 18
access type 61 51 55 30 26 27
Macro-average 72 60 65 51 44 47 where separate models trained on each task outperforms the multi-
task model is coherent with previous findings [4, 26, 33].
The performance of span extraction, presented in Table 5 is
4.5% precision improvement for values prediction. This behavior low in comparison with the performance of transformers models
46
on similar tasks such as reading comprehension or named entity communicated to data subjects (language complexity, length of sen-
recognition, which might be due to the relatively small number of tences, etc.). Legal experts manually converted rules from Articles
training examples given to T5. 13 and 14 into code using the OPP-115 taxonomy. As the OPP-115
taxonomy does not cover all the concepts of the GDPR, we only
Table 5: Results of evaluation of span extraction by T5. encoded the articles presented in the second column of Table 7.
Table 7: List of mandatory information from articles 13 and

Multi-task Fine-tuning Task 2 Fine-tuning
14 of the GDPR encoded by us using the OPP-115 taxonomy.
P R F1 P R F1
64 57 52 65 59 54 Mandatory information Article Reference
Identity of the controller 13 1.a; 14 1.a
Contact details of the controller 13 1.a; 14 1.a
Purpose of the processing of personal data 13 1.c; 14 1.c
Local classifiers approach vs. text-to-text approach: We present Right to data portability 13 2.b; 14 2.c
in Table 6 the macro-average of the different experiments for both Right to erasure 13 2.b; 14 2.c
Right to rectification 13 2.b; 14 2.c
tasks – categories and values prediction. Fine-tuning separate mod- Right to access 13 2.b; 14 2.c
els of T5 for each task achieves the highest F1 score. Both XLNet Data retention period or the criteria of retention period 13 2.a; 14 2.a
and T5 of the transformers family significantly outperform CNN The recipients or categories of recipients of the personal data 13 1.e; 14 1.e
Categories of personal data 14 1.d
on the first task. However, the performance of transformers on the
task of values prediction is at its best close to CNN’s performance.
The local classifiers approach requires fine-tuning separate mod- We plan to build a GUI for legal experts to convert compliance
els for each attribute. Consequently, it decreases the number of rules into code. To do so, we choose JsonLogic to serialize obtained
training examples seen by each model, explaining the significant rules as a JSON file. JsonLogic provides a simple mechanism to share
performance gap between task 1 and task 2. This performance gap rules between the front-end and back-end of a GUI. It comes with a
is more important for XLNet than CNN, which could result from parser that we use to build a first-order logic inference engine with
the high number of parameters of XLNet we need to fine-tune in Python. In Figure 7 we present an example of a rule encoded with
comparison to the CNN architecture. T5 has even more parameters JsonLogic and OPP-115 taxonomy. We consider that the purpose of
than XLNet, but its performance does not drop as significantly as processing personal data is mentioned in the privacy policy if there
XLNet. We can explain this difference by the nature of the text-tois at least one data practice whose category is “1𝑠𝑡 Party Collection”
text approach where we fine-tune one model of T5 on the prediction and the attribute "Purpose" has a value different from “Unspecified”.
of the values of all the attributes instead of individual fine-tuning
for each attribute. Hence, T5 sees much more examples than XLNet. {"some":[{"var":"data_practices"},
{"and":[{"==":[{"var":"category"},"first_party"]},
{"!=":[{"var":"attributes.purpose"},"unspecified"
]}]}]}
Table 6: Macro-Average of precision, recall and F1-score for
categories prediction (Task 1) and values prediction (Task 2)
by the four models. Figure 7: Example of encoding a GDPR rule with JsonLogic.
The rule states the obligation of mentioning the purpose of the
processing of personal data.
Task 1 Task 2
Approach
P R F1 P R F1
CNN 72 55 62 72 60 65 5.2 Evaluation
Local classifiers
XLNet 76 71 73 51 44 47
The OPP-115 taxonomy was created before the entry into force of
Multi-task FT 77 70 73 67 61 63 the GDPR. To evaluate its capacity to capture GDPR concepts, we
Text-to-Text: T5
Individual FT 79 69 73 70 61 65 create a dataset of 30 privacy policies where legal experts indicate
each mandatory information’s presence. We use this dataset as
ground truth of mandatory information listed in Table 7.
5 RULE BASED COMPLIANCE CHECKING The ground truth dataset contains two types of privacy policies:
15 privacy policies are from the OPP-115 dataset and 15 post-GDPR
5.1 From rules to code privacy policies are from the corpus released by [23]. We extract
This section describes our rule-based approach to automate the data practices from privacy policies and feed them to the inference
formal compliance checking of privacy policies. Privacy policies engine to check for the presence of each mandatory information.
must comply with Articles 12, 13, and 14 of the GDPR. We limit For the first type of privacy policy, we use the ground truth data
our experiments to Articles 13 and 14, listing the mandatory in- practices extracted manually by legal experts from [48]. For the
formation that privacy policies must contain, for which we can second type, we use data practices predicted by T5. Therefore, any
use information extraction algorithms, described in the previous error on the first type of privacy policies will not be due to machine
section. To verify compliance with Article 12, we will need to de- learning errors but due to the OPP-115 taxonomy used to encode
velop other algorithms to assess how the mandatory information is mandatory information.
47
We report metrics of both the absence and presence of informa- 6 CONCLUSION

tion in Table 8. Our objective is to detect non-compliance and send This study designed a theoretical framework to implement and
the documents for review by experts, so we need to maximize the monitor GDPR compliance in the data supply chain through a
number of absent mandatory information we can detect. document-based approach, for which we defined three key tasks.
We proposed a formal and substantive approach to verify GDPR-
Table 8: Results of mandatory information detection from compliance of privacy policies. It is worth highlighting that, as
both OPP-115 and post-GDPR privacy policies. a potential next step, our framework could be adapted to other
compliance documents like DPIAs and/or ROPAs. More broadly,
Dataset P R F1 research is also needed to implement this framework in a multi-
document setting, where data processing activities are described in
OPP-115: absence 93 88 90
multiple documents.
OPP-115: presence 93 97 95
Our second significant contribution is the experimentation on
Post-GDPR: absence 78 78 78
the automation of formal compliance checking of privacy policies,
Post-GDPR: presence 91 91 91
which could be generalized to other documents in the data supply
chain as a means to improve end-to-end compliance. We build a
Most of the errors on the OPP-115 dataset (see Figure 8) are system combining machine learning and rules to detect the presence
caused by the difficulty of aligning OPP-115 concepts with GDPR of information required by the GDPR. We fine-tuned the T5 model
concepts. For example, to encode "Data retention period" we used in a multi-task setting and achieved good performance predicting
the "Retention period" attribute. However, even when the anno- both coarse-grained and fine-grained data practices with only one
tators select a value for the "Retention period" in [48] it does not model. The T5 model also extracts the spans of text corresponding
always concern all the collected personal data. In contrast, the to the fine-grained data practices. These spans of text could be used
GDPR requires that the retention period is stated for all of the data. to explain the predicted values.
The majority of errors on the post-GDPR policies are when the We used the OPP-115 taxonomy to encode 10 GDPR rules from
algorithm does not detect the right to data portability. This type Articles 13 and 14 concerning the information a privacy policy
of error is expected: because data portability has not been widely should contain. We evaluated the system on a corpus of 30 privacy
adopted before the GDPR (Art. 20), and no privacy policy from the policies, where legal experts indicated the presence of mandatory
OPP-115 dataset mentions the right to data portability. information. Although OPP-115 taxonomy is pre-GDPR, it proved
Although our rules did not capture the right to data portability, capable of capturing some mandatory information in both pre-
T5 correctly predicted the category "data portability" from this sen- GDPR and post-GDPR policies. Currently, it is one of the most
tence: "Data portability, that is to say the possibility of receiving valuable resources in our research community. Still, it is not enough
these data in a structured format that is readable by an automatic de- to encode both GDPR rules and data protection activities defined in
vice and of sending them to another processing owner without any compliance documents. Thus, there is a need for a new corpus of
impediments." T5 could predict new categories of data practices due data protection documents from the data supply chain, to automate
to its language understanding capabilities that enable few and zero- compliance checking tasks, which we leave for the future work.
shot inference [22, 50, 53]. Other new categories from the GDPR Additionally, we pointed that T5 was able to predict new cate-
articles, predicted for post-GDPR policies are: "data minimisation" gories such as data portability. This capacity of zero-shot prediction
(Art. 5.1(c)), "data accuracy" (Art. 5.1(d)), "legitimate interests" (Art. can be leveraged to assist law and privacy scholars in creating a
6.1(f)), "lawful basis" (Art. 6), and "right to object" (Art. 21). GDPR taxonomy compatible with the variety of compliance docu-
ments in the data supply chain.
OPP-115
REFERENCES
Normalized count of errors
0.4 Post-GDPR
[1] 2017. The True Cost of Compliance with Data Protection Regulations. Technical
0.3 Report. Ponemon Institute LLC.
[2] 2019. ICO Guidance: Update report into adtech and real time bidding. Technical
0.2 Report. Information Commissioner’s Office. 19–21 pages.
[3] David Restrepo Amariles, Aurore Clément Troussel, and Rajaa El Hamdani. 2020.
0.1 Compliance Generation for Privacy Documents under GDPR: A Roadmap for
Implementing Automation and Machine Learning. arXiv preprint arXiv:2012.12718
0.0 (2020).
iod ss ure ion ata ata ility nts [4] Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin John-
n per acce eras ificat nal d nal d rtab cipie
a r e tentio Right to Right to ht to rect of persoser persoto data poories of re son, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al.
Dat s
Rig egorie g of u Right categ 2019. Massively multilingual neural machine translation in the wild: Findings
Cat rocessin ts or
e of p ipien and challenges. arXiv preprint arXiv:1907.05019 (2019).
r p o s Th e rec [5] Jaspreet Bhatia, Travis D Breaux, Joel R Reidenberg, and Thomas B Norton. 2016.
Pu
Information type A theory of vagueness and privacy risk perception. In 2016 IEEE 24th International
Requirements Engineering Conference (RE). IEEE, 26–35.
[6] Giuseppe Contissa, Koen Docter, Francesca Lagioia, Marco Lippi, Hans-W Mick-
litz, Przemysław Pałka, Giovanni Sartor, and Paolo Torroni. 2018. Claudette meets
Figure 8: The distribution of errors over types of mandatory gdpr: Automating the evaluation of privacy policies using artificial intelligence.
information for OPP-115 and post-GDPR privacy policies Available at SSRN 3208596 (2018).
[7] Elisa Costante, Yuanhao Sun, Milan Petković, and Jerry Den Hartog. 2012. A
machine learning solution to assess privacy policy completeness: (short paper).
48
In Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society. [33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
91–96. Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-
[8] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan its of transfer learning with a unified text-to-text transformer. arXiv preprint
Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed- arXiv:1910.10683 (2019).
length context. arXiv preprint arXiv:1901.02860 (2019). [34] Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know:
[9] Marina De Vos, Sabrina Kirrane, Julian Padget, and Ken Satoh. 2019. ODRL policy Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
modelling and compliance checking. In International Joint Conference on Rules [35] Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and
and Reasoning. Springer, 36–51. Norman Sadeh. 2019. Question answering for privacy policies: Combining com-
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: putational and legal perspectives. arXiv preprint arXiv:1911.00841 (2019).
Pre-training of deep bidirectional transformers for language understanding. arXiv [36] Joel R Reidenberg, Jaspreet Bhatia, Travis D Breaux, and Thomas B Norton. 2016.
preprint arXiv:1810.04805 (2018). Ambiguity in privacy policies and the impact of regulation. The Journal of Legal
[11] Olha Drozd and Sabrina Kirrane. 2020. Privacy CURE: Consent Comprehension Studies 45, S2 (2016), S163–S190.
Made Easy. In IFIP International Conference on ICT Systems Security and Privacy [37] Community Research and Development Information Service. 2021. Business
Protection. Springer, 124–139. Process Re-engineering and functional toolkit for GDPR compliance. https:
[12] María Teresa Gómez-López, Luisa Parody, Rafael M Gasca, and Stefanie Rinderle- //cordis.europa.eu/project/id/787149/results. Accessed: 2021-02-28.
Ma. 2014. Prognosing the compliance of declarative business processes using [38] Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian
event trace robustness. In OTM Confederated International Conferences" On the Zimmeck, and Norman Sadeh. 2017. Identifying the provision of choices in
Move to Meaningful Internet Systems". Springer, 327–344. privacy policy text. In Proceedings of the 2017 Conference on Empirical Methods in
[13] Guido Governatori and Sidney Shek. 2012. Rule Based Business Process Compli- Natural Language Processing. 2774–2779.
ance.. In RuleML (2). Citeseer. [39] Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical classification
[14] Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G Shin, across different application domains. Data Mining and Knowledge Discovery 22, 1
and Karl Aberer. 2018. Polisis: Automated analysis and presentation of privacy (2011), 31–72.
policies using deep learning. In 27th {USENIX } Security Symposium ( {USENIX } [40] Laurens Sion, Pierre Dewitte, Dimitri Van Landuyt, Kim Wuyts, Peggy Valcke,
Security 18). 531–548. and Wouter Joosen. 2020. DPMF: A Modeling Framework for Data Protection by
[15] Mustafa Hashmi, Guido Governatori, and Moe Thandar Wynn. 2012. Business Design. Enterprise Modelling and Information Systems Architectures (EMISAJ) 15
process data compliance. In International Workshop on Rules and Rule Markup (2020), 10–1.
Languages for the Semantic Web. Springer, 32–46. [41] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014.
[16] Mustafa Hashmi, Guido Governatori, and Moe Thandar Wynn. 2016. Norma- Learning sentiment-specific word embedding for twitter sentiment classification.
tive requirements for regulatory compliance: An abstract formal framework. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Information Systems Frontiers 18, 3 (2016), 429–455. Linguistics (Volume 1: Long Papers). 1555–1565.
[17] Martin Hepp. 2008. Ontologies: State of the art, business potential, and grand [42] Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto,
challenges. Ontology Management (2008), 3–22. and Jetzabel Serna. 2018. PrivacyGuide: towards an implementation of the EU
[18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural GDPR on internet privacy policy evaluation. In Proceedings of the Fourth ACM
computation 9, 8 (1997), 1735–1780. International Workshop on Security and Privacy Analytics. 15–21.
[19] Katsiaryna Krasnashchok, Majd Mustapha, Anas Al Bassit, and Sabri Skhiri. 2020. [43] Damiano Torre, Sallam Abualhaija, Mehrdad Sabetzadeh, Lionel Briand, Katrien
Towards Privacy Policy Conceptual Modeling. In International Conference on Baetens, Peter Goes, and Sylvie Forastier. 2020. An ai-assisted approach for
Conceptual Modeling. Springer, 429–438. checking the completeness of privacy policies against gdpr. In 2020 IEEE 28th
[20] Logan Lebanoff and Fei Liu. 2018. Automatic detection of vague words and International Requirements Engineering Conference (RE). IEEE, 136–146.
sentences in privacy policies. arXiv preprint arXiv:1808.06219 (2018). [44] Damiano Torre, Ghanem Soltana, Mehrdad Sabetzadeh, Lionel C Briand, Yuri
[21] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient- Auffinger, and Peter Goes. 2019. Using models to enable compliance checking
based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278– against the GDPR: an experience report. In 2019 ACM/IEEE 22nd International
2324. Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE,
[22] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot 1–11.
relation extraction via reading comprehension. arXiv preprint arXiv:1706.04115 [45] Silvano Colombo Tosatto, Guido Governatori, Nick van Beest, and Francesco
(2017). Olivieri. 2019. Efficient Full Compliance Checking of Concurrent Components
[23] Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. 2020. for business Process Models. FLAP 6, 5 (2019), 963–998.
The privacy policy landscape after the GDPR. Proceedings on Privacy Enhancing [46] Sebastian Urbina. 2002. Legal method and the rule of law. Vol. 59. Springer Science
Technologies 2020, 1 (2020), 47–64. & Business Media.
[24] Fei Liu, Nicole Lee Fella, and Kexin Liao. 2018. Modeling language vagueness [47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
in privacy policies using deep neural networks. arXiv preprint arXiv:1805.10393 Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
(2018). you need. arXiv preprint arXiv:1706.03762 (2017).
[25] Frederick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck, and Norman [48] Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain
Sadeh. 2018. Towards automatic classification of privacy policy text. School of Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck,
Computer Science Carnegie Mellon University, Pittsburgh, PA, Tech. Rep. CMU-ISR- Kanthashree Mysore Sathyendra, N Cameron Russell, et al. 2016. The creation
17-118R and CMULTI-17-010 (2018). and analysis of a website privacy policy corpus. In Proceedings of the 54th Annual
[26] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
The natural language decathlon: Multitask learning as question answering. arXiv 1330–1340.
preprint arXiv:1806.08730 (2018). [49] Z. Yang, Zihang Dai, Yiming Yang, J. Carbonell, R. Salakhutdinov, and Quoc V.
[27] Tomas Mikolov, Kai Chen, G. S. Corrado, and J. Dean. 2013. Efficient Estimation Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Under-
of Word Representations in Vector Space. In ICLR. standing. In NeurIPS.
[28] Majd Mustapha, Katsiaryna Krasnashchok, Anas Al Bassit, and Sabri Skhiri. [50] Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text
2020. Privacy Policy Classification with XLNet (Short Paper). In Data Privacy classification: Datasets, evaluation and entailment approach. arXiv preprint
Management, Cryptocurrencies and Blockchain Technology. Springer, 250–257. arXiv:1909.00161 (2019).
[29] Najmeh Mousavi Nejad, Pablo Jabat, Rostislav Nedelchev, Simon Scerri, and [51] Razieh Nokhbeh Zaeem, Rachel L German, and K Suzanne Barber. 2018. Privacy-
Damien Graux. 2020. Establishing a strong baseline for privacy policy classi- check: Automatic summarization of privacy policies using data mining. ACM
fication. In IFIP International Conference on ICT Systems Security and Privacy Transactions on Internet Technology (TOIT) 18, 4 (2018), 1–18.
Protection. Springer, 370–383. [52] Min-Ling Zhang and Zhi-Hua Zhou. 2013. A review on multi-label learning
[30] Monica Palmirani, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio algorithms. IEEE transactions on knowledge and data engineering 26, 8 (2013),
Robaldo. 2018. PrOnto: Privacy ontology for legal reasoning. In International 1819–1837.
Conference on Electronic Government and the Information Systems Perspective. [53] Ben Zhou, Daniel Khashabi, Chen-Tse Tsai, and Dan Roth. 2019. Zero-shot open
Springer, 139–152. entity typing as type-compatible grounding. arXiv preprint arXiv:1907.03228
[31] Ellen Poplavska, Thomas B Norton, Shomir Wilson, and Norman Sadeh. 2020. (2019).
From Prescription to Description: Mapping the GDPR to a Privacy Policy Corpus [54] Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi
Annotation Scheme. In 33rd International Conference on Legal Knowledge and Wang, Joel Reidenberg, N Cameron Russell, and Norman Sadeh. 2019. Maps:
Information Systems, JURIX 2020. IOS Press BV, 243–246. Scaling privacy compliance analysis to a million apps. Proceedings on Privacy
[32] Wenjun Qiu and David Lie. 2020. Deep Active Learning with Crowdsourcing Enhancing Technologies 2019, 3 (2019), 66–86.
Data for Privacy Policy Classification. arXiv preprint arXiv:2008.02954 (2020).
49
On Semantics-based Minimal Revision for Legal Reasoning
Wachara Fungwacharakorn Kanae Tsushima Ken Satoh
National Institute of Informatics and National Institute of Informatics and National Institute of Informatics and
SOKENDAI SOKENDAI SOKENDAI
Chiyoda, Tokyo, Japan Chiyoda, Tokyo, Japan Chiyoda, Tokyo, Japan
wacharaf@nii.ac.jp k_tsushima@nii.ac.jp ksatoh@nii.ac.jp

When literal interpretation of statutes leads to counterintuitive Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh. 2021. On
Semantics-based Minimal Revision for Legal Reasoning. In Eighteenth In-
consequences, judges, especially in high courts, may identify coun-
ternational Conference for Artificial Intelligence and Law (ICAIL’21), June
terintuitive consequences and revise interpretation of statutes. Re- 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https:
searchers have studied revisions for computational legal represen- //doi.org/10.1145/3462757.3466075
tation. Generally, studies on revision usually consider minimal revi-
sion to reflect limitation of judges’ legislative power. However, those
studies tend to minimize the number of operations used for chang- 1 INTRODUCTION
ing rules rather than minimize the changes of semantics (the set of When literal interpretation of the law leads to counterintuitive
conclusions obtained from the program), which vary among cases. consequences, judges, especially in high courts, may have a legal
In this paper, we consider minimizing the changes of semantics of judgement that also revises interpretation of statutes. Such a legal
a rule-base written in a normal logic program. We consider that judgement is called a case law, which suggests more appropriate
each possible fact-base (the representation of a case) has its corre- interpretation of statutes for a present case and also later cases.
sponding semantics and corresponding dominant rule-base, which Such revisions are usually described as discoveries of essential con-
is a set of Horn clauses obtained from the subset of rule-base that ditions, which are revealed in a real-life case [27]. This is similar to
is specific to the considered fact-base. Hence, we present a new sub a qualification problem [42] in artificial intelligence, which discusses
type of semantics-based minimal revision called a dominant-based the necessity of learning essential conditions in real-life since we
minimal revision. Furthermore, we present one guidance to obtain cannot know them all in the first place.
one dominant-based minimal revision by using legal debugging and Computational law researchers have been long interested in
Closed World Specification. We also compare the dominant-based building legal representations for interpretations of statutes and
minimal revision with the syntax-based minimal revision in Theory modelling revisions for legal representation. Such legal representa-
Distance Metric. As the syntax-based minimal revision minimizes tions are, for example, Institutional models [15], Defeasible logic
the number of operations used for changing rules, the comparison [21], or PROLEG [37], with revision models in Institutional model
shows that the syntax-based minimal revision may cause extra se- [28], in Defeasible logic [22], or in PROLEG [18]. In addition, some
mantics changes compared to the dominant-based minimal revision, studies focus on modelling revisions that are not specific for one
especially when the rule-base contains multiple rules for the same legal representation such as [4]. One theme that some studies (e.g.
consequence. We discuss that such extra semantics changes can be [4, 28, 35]) have in common is that a revision should be a minimal
considered as unintentional changes caused by the syntax-based revision of legal rules. This reflects a principle that judges usually
minimal revision. Hence, legal reasoning systems can check with limit themselves from legislative issues. However, since legal rea-
the user such extra semantics changes to confirm the user intention soning is a hybrid between reasoning by rules and reasoning by
of changes. cases [10, 43], a revision should not concern only effects on rules
but also effects on interpretations applying in each case. Although
CCS CONCEPTS effects of changes on each case and its corresponding semantics has
• Computing methodologies → Nonmonotonic, default rea- been considered in hybrid legal reasoning systems between rule-
soning and belief revision; • Applied computing → Law. based and case-based [32, 46], effects of such changes on revisions
of rules in pure rule-based legal reasoning systems have not been
KEYWORDS fully investigated.
Therefore, in this paper, we present one semantics-based min-
legal reasoning, legal representation, theory revision imal revision relying on a rule-based structure called a dominant
rule-base. Firstly, we introduce semantics of a rule-base written in
classroom use is granted without fee provided that copies are not made or distributed a normal logic program and the difference of semantics. Semantics
for profit or commercial advantage and that copies bear this notice and the full citation of a rule-base is defined by a set of all possible fact-bases and corre-
on the first page. Copyrights for components of this work owned by others than ACM sponding interpretations of statutes in pairs whereas the difference
to post on servers or to redistribute to lists, requires prior specific permission and/or a of semantics is a set of all possible fact-bases with corresponding
fee. Request permissions from permissions@acm.org. differences of interpretations. Then, we define the semantics-based
ICAIL’21, June 21–25, 2021, São Paulo, Brazil minimal revision, which means a revision that causes the smallest
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 semantics changes. However, since the definition of semantics-
https://doi.org/10.1145/3462757.3466075 based minimal revision is considered to be too strong, we relax the
50
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh
definition by defining a partial semantics-based minimal revision. [13]), recent studies become widely interested in revisions of such
Then, we describe a sub type of a partial semantics-based minimal theories. For example, Henderson and Bench-Capon [24] have en-
revision based on a dominant rule-base, which is a set of Horn couraged contrastive revision of a theory by not only supporting
clauses obtained from the subset of rule-base specific to the consid- accepted consequences but also attacking the objected ones. Rotolo
ered fact-base. Since the judicial revision of legal interpretation is and Roversi [35] have classified theory revision criteria and mini-
initiated by an exceptional case, we present one guidance to obtain mal change is one of such criteria. This criterion comes from the
the dominant-based minimal revision based on Legal Debugging standard revision theory, which intuitively states that we should
[17] and Closed World Specification [6] by putting all facts in an adjust theories as little as possible. To minimally revise rule-based
exceptional case as a condition of a new rule. We show that this theories, they present two strategies:
revision would affect only the dominant rule-base of the fact-bases (1) keep the set of rules as close as possible to the original one
that are larger or equal to the exceptional case. Then, to optimize (this strategy is independent of the facts of the case)
the revision, we remove redundant conditions in the body as long (2) minimize the changes of the set of conclusions obtained from
as the minimality of the revision remains. the theory (the strategy is dependent of the facts of the case
In spite of definitions for the dominant-based minimal revision since different facts give different set of conclusions)
and the guidance to obtain it, we also compare the dominant-based
minimal revision with the syntax-based minimal revision in The- The first strategy could be traced back to the early studies of syntax-
ory Distance Metric [47], which states that a distance of revision based (or formula-based) minimal revision e.g. [29, 47]. On the other
is the minimal number of program edit operations (i.e. deleting hand, the second strategy could be traced back to the early studies
a rule, adding a rule with an empty body, adding a condition to of semantics-based (or model-based) minimal revision e.g. [14, 36]
a rule or deleting a condition from a rule) that are used for re- (for an intensive survey on both types of minimal revision, see [16],
vising an original program to a new program. The definition has for an intensive survey specific to semantics-based revision, see
been applied in many rule-based revisions (e.g. [12, 28]) due to its [26]).
simplicity. From this comparison, we show that the syntax-based Explaining legal change can also be seen in recent extensions
minimal revision may cause extra semantics changes compared to of HYPO [7, 31]. For example, Horty Bench-Capon theory [25],
the dominant-based minimal revision, especially when the original one extension of HYPO, states that a new legal judgement could
rule-base contains multiple rules for the same consequence. We be made as long as it preserves the precedential constraint. This
discuss that such extra semantics changes can be considered as guides a judge to introduce new factors in a new case to preserve
changes unintentionally caused by the syntax-based minimal revi- the precedential constraint if the new case is exceptional. Horty
sion and the legal reasoning system can use such extra semantics Bench-Capon theory has been extended into Rigoni theory [32]
changes for check with the user the intention of changes. for supporting hybrid legal reasoning systems between rule-based
This paper is structured as follows. Section 2 reviews related and case-based. Moreover, recently presented Verheij’s case models
work. Section 3 provides preliminary definitions of a normal logic [44–46] have become new explanations of hybrid legal reasoning
program for legal reasoning and a revision of a normal logic pro- systems by using the preference of cases. In our study, we explain
gram based on Legal Debugging and Closed World Specification. the legal change using a dominant rule-base. Hence, our study is
Section 4 presents the definitions of semantics of a rule-base, the different from Rigoni theory and Verheij’s case models since their
difference of semantics, the semantics-based minimal revision and approaches focus on the effects of legal change on cases such as the
the dominant-based minimal revision. Section 5 describes one guid- precedential constraint or the preference of cases but our approach
ance to obtain a dominant-based minimal revision based on Legal focuses on the effects of legal change on revision of rules.
Debugging and Closed World Specification. Section 6 compares the
dominant-based minimal revision with the syntax-based minimal 3 PRELIMINARIES
revision in Theory Distance Metric with the example of the judicial In this paper, we consider a normal logic program (also known
revision of Japanese Civil Code Article 612. Section 7 discusses the as Prolog program) for a computational legal representation as in
results obtained from the comparison between the dominant-based [38, 39, 41]. We use notations in our paper as follows.
minimal revision and the syntax-based minimal revision. Finally,
Section 8 provides the conclusion and future works. Definition 3.1 (Normal Logic Program). A normal logic program
(hereafter, a program) is a set of rules of the form
2 RELATED WORK ℎ ← 𝑏 1, . . . , 𝑏𝑚 , 𝑛𝑜𝑡 𝑏𝑚+1, . . . , 𝑛𝑜𝑡 𝑏𝑛 . (1)

AI and Law researchers have been long interested in legal change
[2] especially in case-based legal reasoning systems since such legal where ℎ, 𝑏 1, . . . , 𝑏𝑛 are propositions. Let 𝑅 be a rule of the form
change can be detected through a temporal order of precedent cases (1), we have
[8, 34]. The role of legal context in legal change has been explored • ℎ as a head of a rule or a consequence of a rule denoted by
in [23] and classified into three aspects, which are teleological rela- ℎ𝑒𝑎𝑑 (𝑅),
tions, temporal relations, and procedural postures. As AI and Law • {𝑏 1, . . . , 𝑏𝑚 } as a positive body of a rule denoted by 𝑝𝑜𝑠 (𝑅)
researchers become interested in building case law theories (for (each element of a positive body is called a requisite),
instance, building case law theories [11] from a case-base in HYPO • {𝑏𝑚+1, . . . , 𝑏𝑛 } as a negative body of a rule denoted by 𝑛𝑒𝑔(𝑅)
[33] or building case law theories [5] from a case-base in AA-CBR (each element of a negative body is called an exception),
51
On Semantics-based Minimal Revision for Legal Reasoning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
• {𝑏 1, . . . , 𝑏𝑚 , 𝑛𝑜𝑡 𝑏𝑚+1, . . . , 𝑛𝑜𝑡 𝑏𝑛 } as a body of a rule denoted the present case. Then, judges use the introduced factual concept
by 𝑏𝑜𝑑𝑦 (𝑅) (each element of a body is called a condition). to revise the statute so that the counterintuitive consequence is
Sometimes, we express the rule in the form ℎ ← 𝐵. where 𝐵 is a resolved. To formalize this procedure, we firstly define agreement
body of rule. We express ℎ. (called a fact) if the body of the rule is and disagreement as follows.
empty. A rule is called a Horn clause if the negative body of the rule Definition 3.4 (Agreement and Disagreement). Let 𝐹 𝐵 1, 𝐹 𝐵 2 be
is empty. fact-bases, 𝑅𝐵 1, 𝑅𝐵 2 be rule-bases, and 𝑝 be a proposition. We say
Since a program representing statutes is generally a non-recursive 𝐹 𝐵 1 ∪ 𝑅𝐵 1 agrees with 𝐹 𝐵 2 ∪ 𝑅𝐵 2 on 𝑝 if
and stratified program, we also hold this presumption in this paper. • 𝐹 𝐵 1 ∪ 𝑅𝐵 1 ⊢ 𝑝 and 𝐹 𝐵 2 ∪ 𝑅𝐵 2 ⊢ 𝑝 or
A definition of non-recursive and stratified program is adopted • 𝐹 𝐵 1 ∪ 𝑅𝐵 1 ⊬ 𝑝 and 𝐹 𝐵 2 ∪ 𝑅𝐵 2 ⊬ 𝑝
from [3] defined as follows. Otherwise, we say 𝐹 𝐵 1 ∪ 𝑅𝐵 1 disagrees with 𝐹 𝐵 2 ∪ 𝑅𝐵 2 on 𝑝.
Definition 3.2 (A non-recursive and stratified program). A pro- Formalization of the procedure goes as follows. Firstly, we have a
gram 𝑇 is non-recursive and stratified if there is a partition 𝑇 = proposition 𝑝 representing a counterintuitive consequence and 𝐹 𝐵𝑒
𝑇0 ∪ 𝑇1 ∪ . . . ∪ 𝑇𝑛 (𝑇𝑖 and 𝑇 𝑗 disjoint for all 𝑖 ≠ 𝑗) such that, if a representing an exceptional case, which contains at least one fact
proposition 𝑝 occurs in a body of rule in 𝑇𝑖 then a rule with 𝑝 in proposition not occurring in an original rule-base (hereafter, we
the head is only contained within 𝑇0 ∪ 𝑇1 ∪ . . . 𝑇 𝑗 where 𝑗 < 𝑖. also refer to 𝐹 𝐵𝑒 as an exceptional case). Then, we revise an original
In civil litigation, a judge would make correspondence between rule-base 𝑅𝐵 to a new rule-base 𝑅𝐵 ′ , Then 𝑅𝐵 ′ is a correct revision
factual situations in a case and factual concepts in statutes. Then, the of 𝑅𝐵 ′ with respect to 𝐹 𝐵𝑒 and 𝑝 if it can resolve the counterintuitive
judge would conclude a legal decision based on related statutes. To consequence 𝑝. We call such a task a counterintuitive consequence
reflect this civil litigation, we determine a proposition occurring in resolution task (CCR task) formally defined as follows.
a head of a rule as a rule proposition and a proposition not occurring Definition 3.5 (Counterintuitive Consequence Resolution (CCR)
in a head of a rule as a fact proposition. By this determination, we Task). A counterintuitive consequence resolution (CCR) task is a
denote a set of all fact propositions by F called a fact-domain and tuple ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩ where 𝑅𝐵 is a rule-base representing statutes,
we denote all fact propositions occurring in a program 𝑇 by 𝑓 (𝑇 ) 𝐹 𝐵𝑒 is a fact-base representing an exceptional case (𝑓 (𝐹 𝐵𝑒 ) ⊈
hence, 𝑓 (𝑇 ) ⊆ F . We call a program 𝑅𝐵 a rule-base if 𝑅𝐵 has no 𝑓 (𝑅𝐵)), and 𝑝 is a considered counterintuitive consequence. A rule-
propositions in F occurring in a head of a rule, and all propositions base 𝑅𝐵 ′ is a resolution to the CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩ if 𝐹 𝐵𝑒 ∪ 𝑅𝐵 ′
in 𝑅𝐵 that do not occurring in a head of a rule, are in F . disagrees with 𝐹 𝐵𝑒 ∪ 𝑅𝐵 on 𝑝 (hence, it implies that the considered
As a judge makes correspondence between factual situations in counterintuitive consequence is resolved).
a case and factual concepts, those factual concepts are represented
by fact propositions. Hence, we call a set of facts (rules with empty Example 3.6. Let a fact-domain F = {𝑎, 𝑏, 𝑐}, and a rule-base
bodies) constructed from a subset of F a fact-base. A fact-base then 𝑅𝐵 1 = {𝑝 ← 𝑞. 𝑞 ← 𝑎. 𝑞 ← 𝑏.}. Suppose 𝑝 is a counterintuitive
represents a case. Then, the semantics is the set of propositions consequence from applying 𝑅𝐵 1 in an exception case represented by
(including rule propositions and fact propositions) which can be 𝐹 𝐵 1 = {𝑎. 𝑐.}. Then, 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.}
concluded when compiling a fact-base 𝐹 𝐵 with a rule-base 𝑅𝐵 and 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.} are both
(denoted by 𝐹 𝐵 ∪ 𝑅𝐵). In this paper, we apply the stable model resolutions to the CCR task ⟨𝑅𝐵, 𝐹 𝐵 1, 𝑝⟩ since 𝐹 𝐵 1 ∪ 𝑅𝐵 1 disagrees
semantics [20] defined as follows. with 𝐹 𝐵 1 ∪ 𝑅𝐵 2 and 𝐹 𝐵 1 ∪ 𝑅𝐵 3 on 𝑝 which means 𝑝 is resolved in
both revisions.
Definition 3.3 (Stable Model Semantics). Let 𝑇 be a normal logic
program and 𝑀 be a set of propositions. Let 𝑡𝑟𝑖𝑚(𝑇 ) be a trimming From the previous example, we can see that there are possibly
function defined as follows: {ℎ𝑒𝑎𝑑 (𝑅) ← 𝑝𝑜𝑠 (𝑅)|𝑅 ∈ 𝑇 } and 𝑇 𝑀 = more than one rules that we can put an exception (𝑛𝑜𝑡 𝑟 ) so that
𝑡𝑟𝑖𝑚({𝑅|𝑅 ∈ 𝑇 and 𝑛𝑒𝑔(𝑅) ∩𝑀 = ∅}). 𝑀 is a stable model semantics the counterintuitive consequence 𝑝 is resolved. To let a user specify
of𝑇 if and only if 𝑀 is the minimum set (in the sense of set inclusion) which rule should an exception be put in, one can consider using
such that 𝑀 satisfies every rule 𝑅 ′ ∈ 𝑇 𝑀 , that is 𝑝𝑜𝑠 (𝑅 ′ ) ⊆ 𝑀 Legal Debugging [17], which extends from a common algorithmic
implies ℎ𝑒𝑎𝑑 (𝑅 ′ ) ∈ 𝑀. debugging [40] for legal reasoning. Legal Debugging considers a
user as an oracle query of an unknown set of intended interpretation
The semantics of 𝐹 𝐵 ∪ 𝑅𝐵 represents the literal interpretation and the counterintuitive consequence is the symmetric difference
of statute (represented by 𝑅𝐵) when applying in a particular case of the literal interpretation and the intended interpretation. Legal
(represented by 𝐹 𝐵). We denote a proposition 𝑝 is in the answer set debugging iterates to ask a user whether related consequences are
of 𝐹 𝐵 ∪ 𝑅𝐵 by 𝐹 𝐵 ∪ 𝑅𝐵 ⊢ 𝑝. Since we presume a non-recursive and counterintuitive until it can no longer find any counterintuitive
stratified program, 𝐹 𝐵 ∪ 𝑅𝐵 has a unique semantics [20]. This also consequences related. The last counterintuitive consequence found
reflects a constraint that judges need one unique judgement from is called a culprit, which is defined as follows.
legal rules.
When a judge applies the literal interpretation of statutes in Definition 3.7 (Culprit). A proposition 𝑝 is a culprit with respect
a particular case and it leads to counterintuitive consequences, to an intended interpretation 𝐼𝑀 and a program 𝑇 if
the judges may revise interpretation of statutes. We call such a • 𝑝 ∉ 𝐼𝑀 but there is a rule 𝑅 ∈ 𝑇 (called a supporting rule of 𝑝)
case an exceptional case. To distinguish the present case as an such that 𝑝𝑜𝑠 (𝑅) ⊆ 𝐼𝑀, 𝑛𝑒𝑔(𝑅) ∩ 𝐼𝑀 = ∅, and ℎ𝑒𝑎𝑑 (𝑅) = 𝑝
exceptional case, judges would introduce a new factual concept in (we call such 𝑝 an incorrect culprit) or
52
• 𝑝 ∈ 𝐼𝑀 but there is no such supporting rule of 𝑝 in 𝑇 (we as follows: {⟨𝐹, 𝑀⟩|𝐹 ⊆ F and 𝑀 is the stable model semantics of
call such 𝑝 an incomplete culprit) 𝐹 𝐵 ∪ 𝑅𝐵 where 𝐹 𝐵 is a fact-base such that 𝑓 (𝐹 𝐵) = 𝐹 }. We denote
this set as 𝑠𝑒𝑚(𝑅𝐵).
We follow the culprit resolution from [19]. To resolve an incor-
rect culprit, we put exceptions in all supporting rules of the culprit We illustrate with the same setting in Example 3.6 throughout
in the same manner of Closed World Specification [6]. To resolve this section. Let a fact-domain, F be {𝑎, 𝑏, 𝑐}, and a rule-base 𝑅𝐵 1 =
an incomplete culprit, we just put a new rule with the culprit as a {𝑝 ← 𝑞. 𝑞 ← 𝑎. 𝑞 ← 𝑏.}. Then 𝑠𝑒𝑚(𝑅𝐵 1 ) =
head. Following this instruction as illustrated in Algorithm 1, we {⟨∅, ∅⟩, ⟨{𝑎}, {𝑎, 𝑝, 𝑞}⟩, ⟨{𝑏}, {𝑏, 𝑝, 𝑞}⟩, ⟨{𝑐}, {𝑐}⟩,
can reduce a CCR task to a task finding appropriate conditions to ⟨{𝑎, 𝑏}, {𝑎, 𝑏, 𝑝, 𝑞}⟩, ⟨{𝑎, 𝑐}, {𝑎, 𝑐, 𝑝, 𝑞}⟩, ⟨{𝑏, 𝑐}, {𝑏, 𝑐, 𝑝, 𝑞}⟩,
a body of a new rule. Hence, we define a culprit resolution to CCR ⟨{𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑐, 𝑝, 𝑞}⟩}
task as follows.
Then, we define the difference of semantics as follows.
Definition 3.8 (Culprit Resolution to CCR task). Given a CCR task
Definition 4.2 (Difference of Semantics). Let F be a fact-domain
f 𝐻 obtained by Algorithm 1 where 𝑅𝐵
⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩, 𝑅𝐵, f is a prelim-
and 𝑅𝐵 1, 𝑅𝐵 2 be two rule-bases. The difference of semantics be-
inary revision, and 𝐻 = {ℎ 1, . . . , ℎ𝑛 } as a set of rule propositions tween 𝑅𝐵 1 and 𝑅𝐵 2 denoted by 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) is a set defined as
that would become heads of new rules, and a set of Horn clauses follows: {⟨𝐹, 𝐷⟩|𝐹 ⊆ F and 𝐷 is a symmetric difference between
𝑅𝐵𝑛 = {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← 𝐵𝑛 .} where 𝐵𝑖 ⊆ 𝑓 (𝐹 𝐵𝑒 ) for all the stable model semantics of 𝐹 𝐵 ∪ 𝑅𝐵 1 and 𝐹 𝐵 ∪ 𝑅𝐵 2 where 𝐹 𝐵 is
1 ≤ 𝑖 ≤ 𝑛. A rule-base 𝑅𝐵 ′ = 𝑅𝐵 f ∪ {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← 𝐵𝑛 .} is
a fact-base such that 𝑓 (𝐹 𝐵) = 𝐹 }.
called a culprit resolution to the CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩.
For example, let 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.}.
We get that, 𝑠𝑒𝑚(𝑅𝐵 2 ) =
Algorithm 1 Preparation Phrase of Culprit Resolution
{⟨∅, ∅⟩, ⟨{𝑎}, {𝑎, 𝑝, 𝑞}⟩, ⟨{𝑏}, {𝑏, 𝑝, 𝑞}⟩, ⟨{𝑐}, {𝑐, 𝑟 }⟩,
Given A CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩ ⟨{𝑎, 𝑏}, {𝑎, 𝑏, 𝑝, 𝑞}⟩, ⟨{𝑎, 𝑐}, {𝑎, 𝑐, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑏, 𝑐, 𝑝, 𝑞, 𝑟 }⟩,
f = 𝑅𝐵 and 𝐻 = ∅
Let 𝑅𝐵 ⟨{𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑐, 𝑝, 𝑞, 𝑟 }⟩}
for all culprits 𝑝𝑐 detected from 𝑝 do and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) =
if 𝑝𝑐 is an incorrect culprit then {⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, {𝑟 }⟩,
for all supporting rule 𝑅 of 𝑝𝑐 do ⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝, 𝑞, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑟 }⟩, ⟨{𝑎, 𝑏, 𝑐}, {𝑟 }⟩}
Let 𝑝𝑒 be a new rule proposition
Let 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.}. We get that,
Add 𝑝𝑒 to 𝐻
f
𝑠𝑒𝑚(𝑅𝐵 3 ) =
Add 𝑛𝑜𝑡 𝑝𝑒 to the body of 𝑅 in 𝑅𝐵
{⟨∅, ∅⟩, ⟨{𝑎}, {𝑎, 𝑝, 𝑞}⟩, ⟨{𝑏}, {𝑏, 𝑝, 𝑞}⟩, ⟨{𝑐}, {𝑐, 𝑟 }⟩,
end for
⟨{𝑎, 𝑏}, {𝑎, 𝑏, 𝑝, 𝑞}⟩, ⟨{𝑎, 𝑐}, {𝑎, 𝑐, 𝑞, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑏, 𝑐, 𝑞, 𝑟 }⟩,
else ⊲ when 𝑝𝑐 is an incomplete culprit
⟨{𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑐, 𝑞, 𝑟 }⟩}
Add 𝑝𝑐 to 𝐻
end if and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) =
end for {⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, {𝑟 }⟩,
⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑝, 𝑟 }⟩, ⟨{𝑎, 𝑏, 𝑐}, {𝑝, 𝑟 }⟩}
Due to the property of culprits and Closed World Specification, The difference of semantics reflects changes on a consequence
we get that a culprit resolution to a CCR task can resolve the con- (both adding and removing) of two rule-bases. Now, we can define
sidered counterintuitive consequence in the CCR task. Hence, a a minimal revision of this framework as follows.
culprit resolution is also a kind of resolution. Definition 4.3 (Semantics-based Minimal Revision). Let 𝑅𝐵 1 , 𝑅𝐵 2 ,
Example 3.9. Continuing from Example 3.6, if a culprit is 𝑝 then 𝑅𝐵 3 be three rule-bases. We say that 𝑅𝐵 2 has a smaller change than
g1 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ←
we put 𝑛𝑜𝑡 𝑟 in the first rule then 𝑅𝐵 𝑅𝐵 3 from 𝑅𝐵 1 denoted as 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) if for
𝑏.} and 𝐻 = {𝑟 }. If a culprit is 𝑞 then we put 𝑛𝑜𝑡 𝑟 in the second every ⟨𝐹, 𝐷 2 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) and ⟨𝐹, 𝐷 3 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ),
g1 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏.} and 𝐻 = {𝑟 }.
rule then 𝑅𝐵 then 𝐷 2 ⊆ 𝐷 3 . We define 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) < 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) if
We get that 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.} and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) but 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰ 𝐷𝐼 𝐹 𝐹
𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.} in Example 3.6 (𝑅𝐵 1, 𝑅𝐵 2 ). We call 𝑅𝐵 2 a semantics-based minimal revision of 𝑅𝐵 1
are also culprit resolutions to the CCR task ⟨𝑅𝐵, 𝐹 𝐵 1, 𝑝⟩ since both if 𝑅𝐵 2 is a revision of 𝑅𝐵 1 and there is no revision 𝑅𝐵 ′ of 𝑅𝐵 1 such
rule-bases include a new Horn clause 𝑟 ← 𝑐. and {𝑐} ⊆ 𝐹 𝐵 1 . that 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 ′ ) < 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ).
However, we consider Definition 4.3 is too strong, since it is hard
4 SEMANTICS-BASED MINIMAL REVISION for comparing between two revisions from different schemes. For
Since legal reasoning is a hybrid between reasoning by rules and example, 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) are incomparable
reasoning by cases [10, 43], a revision should consider semantics of since
a rule-base, which vary among cases. Hence, we define semantics ⟨{𝑎, 𝑐}, {𝑝, 𝑞, 𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 )
of a rule-base as follows. and ⟨{𝑎, 𝑐}, {𝑝, 𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 )
Definition 4.1 (Semantics of a rule-base). Let F be a fact-domain hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≰ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 )
and 𝑅𝐵 be a rule-base, The semantics of 𝑅𝐵 is a set of pairs defined and
53
⟨{𝑎, 𝑏, 𝑐}, {𝑝, 𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) semantics changes for {𝑎, 𝑐}. Therefore, to describe that 𝑅𝐵 2 is a
and ⟨{𝑎, 𝑏, 𝑐}, {𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) suitable semantics-based minimal revision, we say each fact-base
hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ). has its corresponding specific rule-base and dominant rule-base
Therefore, we relax Definition 4.3 by considering partial seman- with respect to the considered consequence. We adopt the definition
tics defined as follows. of a specific rule-base from [30] and define a dominant rule-base as
a trimmed version of a specific rule-base as follows.
Definition 4.4 (Partial Semantics-based Minimal Revision). Let
𝑅𝐵 1 , 𝑅𝐵 2 , 𝑅𝐵 3 be three rule-bases and 𝑆 be a set of propositions. We Definition 4.6 (Specific Rule-base and Dominant Rule-base). Let
say that 𝑅𝐵 2 has a smaller change than 𝑅𝐵 3 from 𝑅𝐵 1 with respect 𝑅𝐵 be a rule-base, 𝐹 𝐵 be a fact-base, and 𝑝 be a proposition. We say
to 𝑆 denoted as 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) if for every a rule-base 𝑆𝑅 ⊆ 𝑅𝐵 is specific to 𝐹 𝐵 with respect to 𝑅𝐵 and 𝑝 if 𝑆𝑅
⟨𝐹, 𝐷 2 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) and ⟨𝐹, 𝐷 3 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ), then is a minimal set of rules (in the sense of set inclusion) such that
𝐷 2 ∩ 𝑆 ⊆ 𝐷 3 ∩ 𝑆. We define 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) • 𝐹 𝐵 ∪ 𝑆𝑅 agrees with 𝐹 𝐵 ∪ 𝑅𝐵 on 𝑝, and
if 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) but 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰𝑆 • no rule-base 𝑆𝑅 ′ such that 𝑆𝑅 ⊊ 𝑆𝑅 ′ ⊊ 𝑅𝐵 and 𝐹 𝐵 ∪ 𝑆𝑅 ′
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ). We call 𝑅𝐵 2 a partial semantics-based minimal disagrees with 𝐹 𝐵 ∪ 𝑆𝑅 on 𝑝.
revision of 𝑅𝐵 1 with respect to 𝑆 if 𝑅𝐵 2 is a revision of 𝑅𝐵 1 and there We call 𝐷𝑅 = 𝑡𝑟𝑖𝑚(𝑆𝑅) a dominant rule-base of 𝐹 𝐵 with respect to
is no revision 𝑅𝐵 ′ of 𝑅𝐵 1 such that 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 ′ ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 and 𝑝 (𝑡𝑟𝑖𝑚 is defined in Definition 3.3).
𝑅𝐵 2 ).
Now, we define dominant-based semantics as follows.
From the previous example, if 𝑆 = {𝑝} or 𝑆 = {𝑝, 𝑟 }, we get that
Definition 4.7 (Dominant-based Semantics). Let F be a fact-domain,
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) but
𝑅𝐵 be a rule-base, and 𝑝 be a proposition. The dominant-based se-
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 )
mantics of 𝑅𝐵 with respect to 𝑝 is a set of pairs defined as follows:
hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ).
{⟨𝐹 𝐵, 𝐴𝐷⟩|𝐹 𝐵 is a fact-base constructed from a subset of F and 𝐴𝐷
However, if 𝑆 = {𝑞}, we get that is a set of all dominant rule-bases of 𝐹 𝐵 with respect to 𝑅𝐵 and 𝑝}.
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) but We denote this set as 𝑑𝑜𝑚𝑝 (𝑅𝐵).
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≰𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 )
hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ). Then, we define the difference of dominant-based semantics as
follows.
We define a partial difference of semantics as follows.
Definition 4.8 (Difference of Dominant-based Semantics). Let F be
Definition 4.5 (Partial Difference of Semantics). Let F be a fact-
a fact-domain, 𝑅𝐵 1, 𝑅𝐵 2 be two rule-bases, and 𝑝 be a proposition.
domain, 𝑅𝐵 1, 𝑅𝐵 2 be two rule-bases, and 𝑆 be a set of propositions. A
The difference of dominant-based semantics between 𝑅𝐵 1 and 𝑅𝐵 2
partial difference of semantics between 𝑅𝐵 1 and 𝑅𝐵 2 with respect to
with respect to 𝑝 denoted by 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) is a set defined
𝑆 denoted by 𝐷𝐼 𝐹 𝐹𝑆 (𝑅𝐵 1, 𝑅𝐵 2 ) is a set defined as follows: {⟨𝐹, 𝐷 ∩
as follows: {⟨𝐴𝐷 1, 𝐴𝐷 2 ⟩|⟨𝐹 𝐵, 𝐴𝐷 1 ⟩ ∈ 𝑑𝑜𝑚𝑝 (𝑅𝐵 1 ) and
𝑆⟩|⟨𝐵, 𝐷⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 )}.
⟨𝐹 𝐵, 𝐴𝐷 2 ⟩ ∈ 𝑑𝑜𝑚𝑝 (𝑅𝐵 2 ) and 𝐴𝐷 1 ≠ 𝐴𝐷 2 }.
By this way, given any three rule-bases 𝑅𝐵 1 , 𝑅𝐵 2 , 𝑅𝐵 3 ,
The difference in this definition represents patterns of fact-bases
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≡ that move their dominant rule-bases after revising a program, re-
𝐷𝐼 𝐹 𝐹𝑆 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤ 𝐷𝐼 𝐹 𝐹𝑆 (𝑅𝐵 1, 𝑅𝐵 3 ). gardless how many fact-bases move such patterns. From Table 1
Continuing from the previous example, 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 2 ) = continuing from the previous example, 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) =
{⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, ∅⟩, {⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑐.}}⟩,
⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝}⟩, ⟨{𝑏, 𝑐}, ∅⟩, ⟨{𝑎, 𝑏, 𝑐}, ∅⟩} ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑝 ← 𝑞. 𝑞 ← 𝑏.}}⟩}
When a fact domain is {𝑎, 𝑏, 𝑐}, we get that 𝑅𝐵 2 is a partial From Table 2, 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) =
semantics-based minimal revision with respect to {𝑝}. This is be- {⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑐.}}⟩,
cause to find a rule-base 𝑅𝐵 ′ such that 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 ′ ) <𝑆 𝐷𝐼 𝐹 𝐹 ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑟 ← 𝑐.}}⟩,
(𝑅𝐵 1, 𝑅𝐵 2 ), 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 ′ ) must be {⟨𝐹, ∅⟩|𝐹 ⊆ F }, then 𝑅𝐵 ′ ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑟 ← 𝑐.}}⟩}
is not a resolution to the CCR task since the exceptional case is
The difference of dominant-based semantics does not change
𝐹 𝐵 1 = {𝑎. 𝑐.} but 𝐹 𝐵 1 ∪ 𝑅𝐵 agrees with 𝐹 𝐵 1 ∪ 𝑅𝐵 ′ on 𝑝 (the coun-
when we extend the fact domain. Due to the ordering of dominants,
terintuitive consequence is not resolved). However, if we extend a
we get the following lemma.
fact domain to F = {𝑎, 𝑏, 𝑐, 𝑑 }, 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 2 ) would become:
{⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, ∅⟩, ⟨{𝑑 }, ∅⟩, Lemma 4.9. Let 𝑅𝐵 1 , 𝑅𝐵 2 be two rule-bases, 𝑝 be a proposition
⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝}⟩, ⟨{𝑎, 𝑑 }, ∅⟩, ⟨{𝑏, 𝑐}, ∅⟩, and ⟨𝐹 𝐵 1, 𝐴𝐷 1 ⟩, ⟨𝐹 𝐵 3, 𝐴𝐷 3 ⟩ ∈ 𝑑𝑜𝑚𝑝 (𝑅𝐵 1 ) such that 𝐹 𝐵 1 ⊆ 𝐹 𝐵 3
. and 𝐴𝐷 1 ⊆ 𝐴𝐷 3 . If ⟨𝐴𝐷 1, 𝐴𝐷 2 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) then
⟨{𝑏, 𝑑 }, ∅⟩, ⟨{𝑐, 𝑑 }, ∅⟩, ⟨{𝑎, 𝑏, 𝑐}, ∅⟩, ⟨{𝑎, 𝑏, 𝑑 }, ∅⟩,
⟨{𝑎, 𝑐, 𝑑 }, {𝑝}⟩, ⟨{𝑏, 𝑐, 𝑑 }, ∅⟩, ⟨{𝑎, 𝑏, 𝑐, 𝑑 }, ∅⟩} ⟨𝐴𝐷 3, 𝐴𝐷 4 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ).
Thus, 𝑅𝐵 2 is not a partial semantics-based minimal revision of This lemma intuitively means if a revision affects a set 𝐴𝐷 1 of
𝑅𝐵 1 with respect to {𝑝} when a fact-domain is {𝑎, 𝑏, 𝑐, 𝑑 } because all dominant rule-bases of 𝐹 𝐵 1 , the revision also affects every set
there exists a revision 𝑅𝐵 ′ such that 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 ′ ) = {⟨𝐹, 𝐷⟩| 𝐴𝐷 3 of all dominant rule-bases of 𝐹 𝐵 3 such that 𝐹 𝐵 3 is a super
𝐹 ⊆ F if 𝐹 = {𝑎, 𝑐} then 𝐷 = {𝑝} otherwise 𝐷 = ∅}. However, set of 𝐹 𝐵 1 and 𝐴𝐷 3 is a super set of 𝐴𝐷 1 . Now, we can define the
we consider a revision 𝑅𝐵 ′ is too specific since it requires only dominant-based minimal revision as follows.
54
Table 1: Difference of dominant-based semantics between 𝑅𝐵 1 and 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.} in the example
Possible fact-base 𝐹 𝐵 Dominant rule-base(s) of 𝐹 𝐵 (before revision) Dominant rule-base(s) of 𝐹 𝐵 (after revision)
∅ ∅ ∅
{𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}
{𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑐.} ∅ ∅
{𝑎. 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑎. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {r ← c.}
{𝑏. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑎. 𝑏. 𝑐} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
Table 2: Difference of dominant-based semantics between 𝑅𝐵 1 and 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.} in the example
Possible fact-base 𝐹 𝐵 Dominant rule-base(s) of 𝐹 𝐵 (before revision) Dominant rule-base(s) of 𝐹 𝐵 (after revision)
∅ ∅ ∅
{𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}
{𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑐.} ∅ ∅
{𝑎. 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑎. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {r ← c.}
{𝑏. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {r ← c.}
{𝑎. 𝑏. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {r ← c.}
Definition 4.10 (Dominant-based Minimal Revision). Let 𝑅𝐵 1 , 𝑅𝐵 2 , counterintuitive consequence ? The answer to this question is yes.
𝑅𝐵 3 be three rule-bases and 𝑝 be a proposition. We say that 𝑅𝐵 2 One way is to follow a culprit resolution in Algorithm 1 and in-
has a smaller dominant change than 𝑅𝐵 3 from 𝑅𝐵 1 with respect troduce all fact propositions occurring in an exceptional case as
to 𝑝 denoted as 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ⪯ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) a body of each new rule so that all new rules are applicable only
when this following condition is satisfied. to the fact-bases that cover all facts in the considered exceptional
If ⟨𝐴𝐷 1, 𝐴𝐷 2 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ), case.
then ⟨𝐴𝐷 1, 𝐴𝐷 3 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ). From the example in previous sections, let 𝑅𝐵 1 = {𝑝 ← 𝑞. 𝑞 ←
We define 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) if 𝑎. 𝑞 ← 𝑏.}. If a culprit is 𝑝 and we get a preliminary revision
g1 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏.} and 𝐻 = {𝑟 }. If we add all
𝑅𝐵
𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ⪯ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) but
fact propositions in 𝐹 𝐵 1 = {𝑎. 𝑐.} to a body of a new rule, we get a
𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) ⪯̸ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ).
culprit resolution 𝑅𝐵 4 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑎, 𝑐.},
We call 𝑅𝐵 2 a dominant-based minimal revision of 𝑅𝐵 1 with respect which is a dominant-based minimal revision of 𝑅𝐵 1 with respect to
to 𝑝 if 𝑅𝐵 2 is a revision of 𝑅𝐵 1 and there is no revision 𝑅𝐵 ′ of 𝑅𝐵 1 𝑝 since 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 4 ) =
such that 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ).
{⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑎, 𝑐.}}⟩,
From the previous example, 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑟 ← 𝑎, 𝑐.}}⟩}
(𝑅𝐵 1, 𝑅𝐵 3 ) so 𝑅𝐵 3 is not the dominant-based minimal revision of and there is no revision 𝑅𝐵 ′ such that 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) ≺
𝑅𝐵 1 respect to 𝑝. We also get that 𝑅𝐵 2 is the dominant-based 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 4 ) in the same manner of 𝑅𝐵 2 . This also ap-
minimal revision respect to 𝑝. This because to find a revision plies to a scenario when a culprit is 𝑞, 𝑅𝐵 5 = {𝑝 ← 𝑞. 𝑞 ←
𝑅𝐵 ′ such that 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ), 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑎, 𝑐.} is also a dominant-based minimal revi-
⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, 𝐴𝐷 ′ ⟩ must be not in sion of 𝑅𝐵 1 since 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 5 ) =
𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ). However, since a revision 𝑅𝐵 ′ must affect
{⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑎, 𝑐.}}⟩,
{𝑝 ← 𝑞. 𝑞 ← 𝑎.} as it is the dominant rule-base of {𝑎. 𝑐.} (the
⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑝 ← 𝑞. 𝑞 ← 𝑏.}}⟩}.
exceptional case), ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, 𝐴𝐷 ′ ⟩
must be in 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) according to Lemma 4.9. Hence, In addition, we can optimize 𝑅𝐵 5 into 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ←
such revision 𝑅𝐵 ′ does not exist and 𝑅𝐵 2 is the dominant-based 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.} which is also a dominant-based minimal
minimal revision of 𝑅𝐵 1 . revision. To optimize the revision like we do from 𝑅𝐵 5 to 𝑅𝐵 2 , we
need to consider specific rule-bases with respect to the original rule-
5 OBTAINING MINIMAL REVISION base and the considering consequence as the following theorem.
One question is that, given a CCR task, can we always obtain a Theorem 5.1. Given a fact-domain F , a CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩,
dominant-based minimal revision with respect to the considered f 𝐻 obtained by Algorithm 1 where 𝑅𝐵
𝑅𝐵, f is a preliminary revision, and
55
𝐻 = {ℎ 1, . . . , ℎ𝑛 } as a set of rule propositions that would become heads 𝑏. 𝑟 ← 𝑐.}. Although 𝑅𝐵 3 is the syntax-based minimal revision
of new rules, and a set of Horn clauses 𝑅𝐵𝑛 = {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← when we require to add a condition to a new rule, 𝑅𝐵 3 is not a
𝐵𝑛 .} where 𝐵𝑖 ⊆ 𝑓 (𝐹 𝐵𝑒 ) for all 1 ≤ 𝑖 ≤ 𝑛. If 𝐵𝑖 satisfies the following dominant-based minimal revision as we illustrated in the previous
condition for all 1 ≤ 𝑖 ≤ 𝑛: section.
For every fact-base 𝐹 𝐵 constructed from a subset of F Generally, we get that a syntax-based minimal revision is not
such that 𝐵 ⊆ 𝐹 𝐵, if there is a rule 𝑅 in a specific rule- always a dominant-based minimal revision especially when the
base of 𝐹 𝐵 with respect to 𝑅𝐵 and 𝑝, and ℎ𝑖 occurs in 𝑅 rule-base contains multiple rules for the same consequence. To
then 𝐹 𝐵𝑒 ⊆ 𝐹 𝐵, illustrate more on this constraint, we use an example case adopted
then, a culprit resolution 𝑅𝐵 ′ = 𝑅𝐵 f ∪ {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← 𝐵𝑛 .} is from a Supreme Court Case as in [4, 17, 38] relating to Japanese
a dominant-based minimal revision of 𝑅𝐵 with respect to 𝑝. Civil Code Article 612, which states
Phrase 1: A Lessee may not assign the lessee’s rights
Theorem 5.1 intuitively implies a method for finding dominant-
or sublease a leased thing without obtaining the ap-
based minimal revision. Let 𝐹 𝐵𝑒 be an exceptional case, the method
proval of the lessor.
is to find a minimal set 𝐵 ⊆ 𝑓 (𝐹 𝐵𝑒 ) which affects specific rule-
Phrase 2: If the lessee allows any third party to make
bases only of 𝐹 𝐵 ⊇ 𝐹 𝐵𝑒 . Thus, when 𝐵 = 𝑓 (𝐹 𝐵𝑒 ), the revision is
use of or take profits from a leased thing in violation of
a dominant-based minimal revision. However, if 𝐵 ⊊ 𝑓 (𝐹 𝐵𝑒 ), a
the provisions of the preceding paragraph, the lessor
revision affects specific rule-bases only of 𝐹 𝐵 ⊇ 𝐹 𝐵𝑒 if there is no
may cancel the contract.
rule 𝑅 in a specific rule-base of 𝐹 𝐵 such that 𝐹 𝐵 ⊇ 𝐵, 𝐹 𝐵 ⊉ 𝐹 𝐵𝑒
and 𝑝𝑒 occurs in 𝑅. For example, regarding 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← Japanese Civil Code can be represented with the rule-base as
𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.}, there is no 𝐹 𝐵 such that 𝐹 𝐵 ⊇ {𝑐.}, follows1 . Hereafter, this rule-base is denoted by 𝐽 𝑅𝐵 1 .
𝐹 𝐵 ⊉ {𝑎. 𝑐.}, and 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . is in the specific rule of 𝐹 𝐵. However, 1 cancellation_due_to_sublease :-
regarding 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.}, there is 2 effective_lease_contract,
𝐹 𝐵 = {𝑏. 𝑐.} such that 𝐹 𝐵 ⊇ {𝑐.}, 𝐹 𝐵 ⊉ {𝑎. 𝑐.}, and 𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . is 3 effective_sublease_contract,
in the specific rule of 𝐹 𝐵. Therefore, 𝑅𝐵 5 can be optimized to 𝑅𝐵 2 4 using_leased_thing,
but 𝑅𝐵 4 cannot be optimized to 𝑅𝐵 3 . 5 manifestation_cancellation,
6 not approval_of_sublease.
6 COMPARISON AND EXAMPLE 7 effective_lease_contract :-
In this section, we compare the dominant-based minimal revision 8 agreement_of_lease_contract,
with the syntax-based minimal revision in Theory Distance Metric 9 handover_to_lessee.
[47], which is one common minimal revision used for describing 10 effective_sublease_contract :-
minimal revision in legislation (e.g. [12, 28]). The definition is for- 11 agreement_of_sublease_contract,
mally described in our context as follows. 12 handover_to_sublessee.
13 effective_sublease_contract :-
Definition 6.1 (Syntax-based Minimal Revision). Let 𝑅𝐵 and 𝑅𝐵 ′
14 implicit_sublease.
be rule-bases. A revision transformation 𝑟 is such that 𝑟 (𝑅𝐵) =
15 approval_of_sublease :-
𝑅𝐵 ′ , and 𝑅𝐵 ′ is obtained from 𝑅𝐵 by program edit operations as
16 approval_before_the_day.
follows: deleting a rule, creating a rule with an empty body, adding
a condition to a rule in 𝑅𝐵 or deleting a condition from a rule in 𝑅𝐵. This representation illustrates that to prove the contract was
𝑅𝐵 ′ is a revision of 𝑅𝐵 with distance 𝑐 (𝑅𝐵, 𝑅𝐵 ′ ) = 𝑛 if and only if ended due to sublease (represented as cancellation_due_to_
𝑅𝐵 ′ = 𝑟 𝑛 (𝑅𝐵) and there is no 𝑚 < 𝑛 such that 𝑅𝐵 ′ = 𝑟 𝑛 (𝑅𝐵) [47]. sublease), we must prove four requisites (lines 2-5)
Consequently, it is very simple to find a culprit resolution that is (1) the lease contract was effective
also a syntax-based minimal revision. Actually, a culprit resolution (represented as effective_lease_contract)
with an empty condition for each new rule is a syntax-based mini- (2) the sublease contract was effective
mal revision since it requires no additional program edit operation, (represented as effective_sublease_contract)
but that kind of culprit resolution can be considered as too general. (3) the third party was using the leased thing
If we require to add some condition to a new rule, we can just add (represented as using_leased_thing)
an extra fact proposition in the exceptional case as a condition of a (4) the plaintiff manifested the intention of cancellation of the
new rule (an extra fact proposition means a fact proposition that contract (represented as manifestation_cancellation)
does not occur in the rule-base but occurs in the exceptional case). And there is one exception, approval_of_sublease in line 5,
This requires only one program edit operation for each new rule so which is explicitly stated in the Civil Code. To prove the exception,
the culprit resolution is definitely the syntax-based minimal revi- we must prove that the approval before the cancellation (repre-
sion (under the constraint that we require to add some condition sented as approval_before_the_day).
to each new rule). To prove that the lease contract was effective (effective_lease
From Example 3.9, we have 𝑅𝐵 1 = {𝑝 ← 𝑞. 𝑞 ← 𝑎. 𝑞 ← 𝑏.}, _contract), we must prove two requisites (lines 8-9).
if a culprit is 𝑝 and we get a preliminary revision 𝑅𝐵 g1 = {𝑝 ←
𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏.}. If we add an extra fact proposition 𝑐 to 1 This representation is adopted for ease of exposition. The implicit sublease contract
a body of a new rule, we get 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← is a fictitious condition to illustrate multiple conditions. We use :- instead of ←.
56
(1) the lease contract was established 16 approval_of_sublease :-

(represented as agreement_of_lease_contract) 17 approval_before_the_day.
(2) the leased thing was handed over to the lessee (represented 18 new_exception :-
as handover_to_lessee) 19 agreement_of_sublease_contract,
There are two ways to prove that sublease contract was effec- 20 handover_to_sublessee,
tive (effective_sublease_contract), one way is to prove two 21 non_destruction_of_confidence.
requisites (lines 11-12). On the other hand, the syntax-based minimal revision is 𝐽 𝑅𝐵 3 =
(1) the sublease contract was established 1 cancellation_due_to_sublease :-
(represented as agreement_of_sublease_contract) 2 effective_lease_contract,
(2) the leased thing was handed over to the sublessee (repre- 3 effective_sublease_contract,
sented as handover_to_sublessee) 4 using_leased_thing,
Another way is to show this sublease is implicit (represented as 5 manifestation_cancellation,
implicit_sublease in line 14). 6 not approval_of_sublease,
When an exceptional case related to Japanese Civil Code Article 7 not new_exception.
612 went to Japanese Supreme Court, the judges have once revised 8 effective_lease_contract :-
this article as stated [1, 4]. 9 agreement_of_lease_contract,
[Japanese Civil Code Article 612] Phrase 2 is not ap- 10 handover_to_lessee.
plicable in exceptional situations where the sublease 11 effective_sublease_contract :-
does not harm the confidence between a lessee and a 12 agreement_of_sublease_contract,
lessor, and therefore the lessor cannot cancel the con- 13 handover_to_sublessee.
tract unless the lessor proves the lessee’s destruction 14 effective_sublease_contract :-
of confidence. 15 implicit_sublease.
16 approval_of_sublease :-
In this court decision, the judge introduced a factual concept of
17 approval_before_the_day.
non-destruction of confidence (represented as non_destruction_
18 new_exception :-
of_confidence) as an exception of Phrase 2 to prevent the coun-
19 non_destruction_of_confidence.
terintuitive consequence from the literal interpretation of Japanese
Civil Code Article 612. The case that goes to the Supreme Court From this example, we get that 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝐽 𝑅𝐵 1, 𝐽 𝑅𝐵 2 ) ≺
can be represented as the following set of fact-base 𝐽 𝐹 𝐵 1 = 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝐽 𝑅𝐵 1, 𝐽 𝑅𝐵 3 ) where 𝑝 is cancellation_due_to_
sublease. We get that the syntax-based minimal revision affects
{agreement_of_lease_contract.
the specific rule-base of 𝐽 𝐹 𝐵 2 =
handover_to_lessee.
agreement_of_sublease_contract. {agreement_of_lease_contract,
handover_to_sublessee. handover_to_lessee,
using_leased_thing. implicit_sublease,
manifestation_cancellation. using_leased_thing,
non_destruction_of_confidence.} manifestation_cancellation,
non_destruction_of_confidence}
Suppose we revise the rule-base 𝐽 𝑅𝐵 1 with Legal Debugging
and Closed World Specification as described in Section 3 and it In this way, we can consider such effects may be unintentionally
concludes that cancellation_due_to_sublease is a culprit. Then, caused by the syntax-based minimal revision and the system may
a dominant-based minimal revision according to our method in confirm the intention of such effects with a user.
Section 5 is 𝐽 𝑅𝐵 2 =
1 cancellation_due_to_sublease :- 7 DISCUSSION
2 effective_lease_contract, In this paper, we compare our minimal revision to the syntax-based
3 effective_sublease_contract, minimal revision in Theory Distance Metric [47]. In their study, it
4 using_leased_thing, has been discussed that two programs may be close to one another
5 manifestation_cancellation, syntactically but have entirely different semantics. Our study also
6 not approval_of_sublease, shows in line with this discussion. In this paper, we consider a
7 not new_exception. revision task as a task to find appropriate conditions to a body of a
8 effective_lease_contract :- new rule after applying Legal Debugging [17] and Closed World
9 agreement_of_lease_contract, Specification [6] as described in Section 3. Figure 1 illustrates a
10 handover_to_lessee. comparison between each type of revision discussed in this paper.
11 effective_sublease_contract :- Let 𝐵 be a set of conditions to a body of a new rule and 𝐹 𝐵𝑒 be an
12 agreement_of_sublease_contract, exceptional case. Regarding Theorem 5.1, if 𝐵 = 𝑓 (𝐹 𝐵𝑒 ), that is all
13 handover_to_sublessee. fact propositions in the exceptional case are conditions to a body of
14 effective_sublease_contract :- a new rule, then such a revision has the smallest dominant changes
15 implicit_sublease. but it requires the largest number of program edit operation (Point
57
minimal revision. These definitions were adopted for the context

of legal reasoning in normal logic programs and legal change by
judges when the literal interpretation of statutes leads to counter-
intuitive consequences. Since we consider such definitions are too
strong, we have defined a partial semantics-based minimal revision
and a dominant-based minimal revision. The dominant-based mini-
mal revision intuitively means a revision that affects the specific
rule-bases of the fact-bases only that are larger or equal to the con-
sidered exceptional case. Then, we present one guidance to obtain a
dominant-based minimal revision and also optimize the number of
edit operations. The comparison shows that a syntax-based minimal
revision may affect more specific rule-bases than a dominant-based
minimal revision hence a syntax-based minimal revision is not a
dominant-based minimal revision in general. Since this paper only
Figure 1: A comparison between each type of revision dis-
gives preliminary definitions for semantics-based and dominant-
cussed in this paper
based minimal revision, future research might investigate, for ex-
ample, an algorithm effectively finding a dominant-based minimal
revision, an extension of definitions for other types of revisions or
(a) in the figure). In Section 5, we discuss briefly that we can find a other legal representations, or the application for cases where the
dominant-based minimal revision with an optimization of program revision impacts several rules, which may lead to impact analysis
edit operations by finding a minimal set (in the sense of size) that in legal formalisms and may contribute to improving coherence of
satisfies the condition in Theorem 5.1 (Point (b) in the figure, may laws.
be the same as Point (a) for some rule-bases). If 𝐵 goes smaller
than that point, the revision will have larger dominant changes but
ACKNOWLEDGMENTS
requires less program edit operations so line (b)-(c) in the figure
shows a decrement function. Regarding the syntax-based minimal We would like to thank anonymous reviewers for their extensive
revision, if it requires a new condition to a new rule, we can obtain comments and suggestions. This work was supported by JSPS KAK-
the syntax-based minimal revision by making 𝐵 a singleton set ENHI Grant Numbers, JP17H06103 and JP19H05470 and JST, AIP
of extra fact proposition in 𝐹 𝐵𝑒 (Point (c) in the figure) since it Trilateral AI Research, Grant Number JPMJCR20G4, Japan
requires only one program edit operation. Otherwise, we can do no
revision after applying the Closed World Specification algorithm REFERENCES
then it requires zero operation but it definitely has larger dominant [1] Tokyo High Court 1994 (O) 693. 1996. Case to seek removal of a building and
surrender of lands. http://www.courts.go.jp/app/hanrei_en/detail?id=273
changes than introducing an extra fact proposition (Point (d) in the [2] Latifa Al-Abdulkarim, Katie Atkinson, and Trevor Bench-Capon. 2016. Accom-
figure). modating change. Artificial Intelligence and Law 24, 4 (2016), 409–427.
Our paper follows the standard revision theory since we attempt [3] Krzysztof R Apt, Howard A Blair, and Adrian Walker. 1988. Towards a the-
ory of declarative knowledge. In Foundations of deductive databases and logic
to adjust rule-based theories as little as possible. As discussed in programming. Morgan Kaufmann, Burlington, MA,USA, 89–148.
[35], there are two strategies to minimally revise rule-based theo- [4] Ryuta Arisaka. 2015. A Belief Revision Technique to Model Civil Code Updates.
ries, one is keeping the revision as close as possible to the original In JSAI International Symposium on Artificial Intelligence. Springer International
Publishing, Cham, 204–216.
and another is minimizing the changes of semantics (the set of [5] Duangtida Athakravi, Ken Satoh, Mark Law, Krysia Broda, and Alessandra Russo.
conclusions obtained from the theory), which vary among cases. 2015. Automated inference of rules with exception from past legal cases using ASP.
In International Conference on Logic Programming and Nonmonotonic Reasoning.
Although the dominant-based minimal revision follows the second Springer International Publishing, Cham, 83–96.
strategy, we also present a guidance to obtain a dominant-based [6] Michael Bain and Stephen Muggleton. 1992. Non-monotonic learning. Inductive
minimal revision but keep the revision as close as possible to the logic programming 38 (1992), 145153.
[7] Trevor JM Bench-Capon. 2017. Hypo’s legacy: introduction to the virtual special
original so the guidance takes both strategies into account. How- issue. Artificial Intelligence and Law 25, 2 (2017), 205–250.
ever, judges may intentionally make a minimal change based on [8] Donald H. Berman and Carole D. Hafner. 1995. Understanding Precedents in
the first strategy rather than the second strategy. In that scenario, a Temporal Context of Evolving Legal Doctrine. In Proceedings of the 5th Inter-
national Conference on Artificial Intelligence and Law (College Park, Maryland,
we think that our study of the semantics-based revision is benefi- USA) (ICAIL ’95). Association for Computing Machinery, New York, NY, USA,
cial for reminding unintentional effects that may be missed when 42–51. https://doi.org/10.1145/222092.222116
[9] Bohner. 1996. Impact analysis in the software change process: a year 2000 per-
considering revisions by merely the first strategy. This may lead to spective. In 1996 Proceedings of International Conference on Software Maintenance.
Impact Analysis [9] (the study of analyzing the impacts of changes, IEEE, USA, 42–51.
typically ones not known in advance ) in legal formalisms and may [10] L Karl Branting. 1991. Building explanations from rules and structured cases.
International journal of man-machine studies 34, 6 (1991), 797–837.
contribute to improving coherence of laws [35]. [11] Alison Chorley and Trevor Bench-Capon. 2005. AGATHA: Using heuristic search
to automate the construction of case law theories. Artificial Intelligence and Law
13, 1 (2005), 9–51.
8 CONCLUSION AND FUTURE WORK [12] Domenico Corapi, Alessandra Russo, Marina De Vos, Julian Padget, and Ken
In this paper, we investigate definitions of semantics-based minimal Satoh. 2011. Normative design using inductive learning. Theory and Practice of
Logic Programming 11, 4-5 (2011), 783–799.
revision for legal reasoning. We present definitions of semantics [13] Kristijonas Cyras, Ken Satoh, and Francesca Toni. 2016. Abstract argumentation
of a rulebase, the difference of semantics and the semantics-based for case-based reasoning. In Fifteenth International Conference on the Principles of
58
Knowledge Representation and Reasoning. AAAI Press, CA, USA, 243–254. [31] Henry Prakken. 2021. A formal analysis of some factor-and precedent-based
[14] Mukesh Dalal. 1988. Investigations into a theory of knowledge base revision: accounts of precedential constraint. Artificial Intelligence and Law (2021), 1–27.
preliminary report. In Proceedings of the Seventh National Conference on Artificial [32] Adam Rigoni. 2015. An improved factor based approach to precedential constraint.
Intelligence, Vol. 2. AAAI Press, CA, USA, 475–479. Artificial Intelligence and Law 23, 2 (2015), 133–160.
[15] Marina De Vos, Julian Padget, and Ken Satoh. 2010. Legal modelling and reason- [33] Edwina L Rissland and Kevin D Ashley. 1987. A case-based system for trade secrets
ing using institutions. In JSAI International Symposium on Artificial Intelligence. law. In Proceedings of the 1st international conference on Artificial intelligence and
Springer Berlin Heidelberg, Berlin, Heidelberg, 129–140. law. Association for Computing Machinery, New York, NY, USA, 60–66.
[16] Alvaro Del Val. 1997. Non monotonic reasoning and belief revision: syntactic, [34] Edwina L Rissland and M Timur Friedman. 1995. Detecting change in legal
semantic, foundational and coherence approaches. Journal of Applied Non- concepts. In Proceedings of the 5th international conference on Artificial intelligence
Classical Logics 7, 1-2 (1997), 213–240. and law. Association for Computing Machinery, New York, NY, USA, 127–136.
[17] Wachara Fungwacharakorn and Ken Satoh. 2018. Legal Debugging in Proposi- [35] Antonino Rotolo and Corrado Roversi. 2013. Constitutive rules and coherence in
tional Legal Representation. In JSAI International Symposium on Artificial Intelli- legal argumentation: The case of extensive and restrictive interpretation. In Legal
gence. Springer International Publishing, Cham, 146–159. Argumentation Theory: Cross-Disciplinary Perspectives. Springer Netherlands,
[18] Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh. 2020. On the legal Dordrecht, The Netherlands, 163–188.
revision in PROLEG program. In 34th Proceedings of the Annual Conference of JSAI, [36] Ken Satoh. 1988. Nonmonotonic reasoning by minimal belief revision. Institute for
Vol. 2020. Japan Society of Artificial Intelligence, Japan, 3G5ES104–3G5ES104. New Generation Computer Technology, Japan.
[19] Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh. 2021. Resolving [37] Ken Satoh, Kento Asai, Takamune Kogawa, Masahiro Kubota, Megumi Naka-
counterintuitive consequences in law using legal debugging. Artificial Intelligence mura, Yoshiaki Nishigai, Kei Shirakawa, and Chiaki Takano. 2011. PROLEG: An
and Law (2021), 1–17. Implementation of the Presupposed Ultimate Fact Theory of Japanese Civil Code
[20] Michael Gelfond and Vladimir Lifschitz. 1988. The stable model semantics for by PROLOG Technology. In New Frontiers in Artificial Intelligence (Lecture Notes
logic programming. In Proceedings of International Logic Programming Conference in Computer Science). Springer Berlin Heidelberg, Berlin, Heidelberg, 153–164.
and Symposium, Robert Kowalski, Bowen, and Kenneth (Eds.), Vol. 88. MIT Press, [38] Ken Satoh, Masahiro Kubota, Yoshiaki Nishigai, and Chiaki Takano. 2009. Trans-
Cambridge, MA, USA, 1070–1080. lating the Japanese Presupposed Ultimate Fact Theory into Logic Programming.
[21] Guido Governatori, Michael J Maher, Grigoris Antoniou, and David Billington. In Proceedings of the 2009 Conference on Legal Knowledge and Information Systems:
2004. Argumentation semantics for defeasible logic. Journal of Logic and Compu- JURIX 2009: The Twenty-Second Annual Conference. IOS Press, Amsterdam, The
tation 14, 5 (2004), 675–702. Netherlands, 162–171.
[22] Guido Governatori, Monica Palmirani, Regis Riveret, Antonio Rotolo, and Gio- [39] Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham-
vanni Sartor. 2005. Norm Modifications in Defeasible Logic. In Proceedings of the mond, and H Terese Cory. 1986. The British Nationality Act as a logic program.
2005 Conference on Legal Knowledge and Information Systems: JURIX 2005: The Commun. ACM 29, 5 (April 1986), 370–386.
Eighteenth Annual Conference. IOS Press, Amsterdam, The Netherlands, 13–22. [40] Ehud Y. Shapiro. 1983. Algorithmic Program DeBugging. MIT Press, Cambridge,
[23] Carole D Hafner and Donald H Berman. 2002. The role of context in case-based MA, USA.
legal reasoning: teleological, temporal, and procedural. Artificial Intelligence and [41] David M Sherman. 1987. A Prolog model of the income tax act of Canada. In
Law 10, 1 (2002), 19–64. Proceedings of the 1st international conference on Artificial intelligence and law.
[24] John Henderson and Trevor Bench-Capon. 2019. Describing the development of Association for Computing Machinery, New York, NY, USA, 127–136.
case law. In Proceedings of the Seventeenth International Conference on Artificial [42] Michael Thielscher. 2001. The qualification problem: A solution to the problem
Intelligence and Law (ICAIL ’19). Association for Computing Machinery, New of anomalous models. Artificial Intelligence 131, 1-2 (2001), 1–37.
York, NY, USA, 32–41. [43] Bart Verheij. 2008. About the Logical Relations between Cases and Rules. In
[25] John F Horty and Trevor JM Bench-Capon. 2012. A factor-based definition of Proceedings of the 2008 Conference on Legal Knowledge and Information Systems:
precedential constraint. Artificial intelligence and Law 20, 2 (2012), 181–214. JURIX 2008: The Twenty-First Annual Conference. IOS Press, Amsterdam, The
[26] Hirofumi Katsuno and Alberto O Mendelzon. 1991. Propositional knowledge Netherlands, 21–32.
base revision and minimal change. Artificial Intelligence 52, 3 (1991), 263–294. [44] Bart Verheij. 2016. Correct grounded reasoning with presumptive arguments.
[27] Edward H Levi. 2013. An introduction to legal reasoning. University of Chicago In European Conference on Logics in Artificial Intelligence. Springer International
Press, Chicago, USA. Publishing, Cham, 481–496.
[28] Tingting Li, Tina Balke, Marina De Vos, Julian Padget, and Ken Satoh. 2013. A [45] Bart Verheij. 2016. Formalizing value-guided argumentation for ethical systems
model-based approach to the automatic revision of secondary legislation. In design. Artificial Intelligence and Law 24, 4 (2016), 387–407.
Proceedings of the Fourteenth International Conference on Artificial Intelligence [46] Bart Verheij. 2017. Formalizing Arguments, Rules and Cases. In Proceedings of
and Law (Rome, Italy) (ICAIL ’13). Association for Computing Machinery, New the 16th Edition of the International Conference on Articial Intelligence and Law
York, NY, USA, 202–206. (London, United Kingdom) (ICAIL ’17). Association for Computing Machinery,
[29] Bernhard Nebel et al. 1992. Syntax-based approaches to belief revision. Belief New York, NY, USA, 199–208.
revision 29 (1992), 52–88. [47] James Wogulis and Michael J Pazzani. 1993. A methodology for evaluating
[30] Henry Prakken. 1991. A tool in modelling disagreement in law: preferring the theory revision systems: Results with Audrey II. In Proceedings of the Thirteenth
most specific argument. In Proceedings of the 3rd international conference on International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San
Artificial intelligence and law. Association for Computing Machinery, New York, Francisco,CA,USA, 1128–1134.
NY, USA, 165–174.
59
Explainable Artificial Intelligence, Lawyer’s Perspective
Łukasz Górski Shashishekar Ramakrishna
lgorski@icm.edu.pl shashi792@gmail.com
Interdisciplinary Centre for Mathematical and Freie Universität Berlin
Computational Modelling Berlin, Germany
Warsaw, Poland
ABSTRACT AI-engineers. This work was thought as the first step towards the
Explainable artificial intelligence (XAI) is a research direction that identification of requirements of explainable AI-based systems that
was already put under scrutiny, in particular in the AI&Law commu- would involve legal perspective to a greater extent.
nity. Whilst there were notable developments in the area of (general, For the purpose of this paper, a two-way investigation was per-
not necessarily legal) XAI, user experience studies regarding such formed. Firstly, to assess the performance of different explainability
methods, as well as more general studies pertaining to the concept methods (Grad-CAM, LIME, SHAP), we have used a convolutional
of explainability among the users are still lagging behind. This paper neural network (CNN) for text classification and used those meth-
firstly, assesses the performance of different explainability methods ods to explain the predictions; those explanations were then judged
(Grad-CAM, LIME, SHAP), in explaining the predictions for a legal by legal professionals according to their accuracy. Secondly, the
text classification problem; those explanations were then judged same respondents were asked to give their opinion on the desired
by legal professionals according to their accuracy. Secondly, the qualities of (explainable) artificial intelligence (AI) legal decision
same respondents were asked to give their opinion on the desired system and to present their general understanding of the term XAI.
qualities of (explainable) artificial intelligence (AI) legal decision This part was treated as a pilot study for a more pronounced one
system and to present their general understanding of the term XAI. regarding the lawyer’s position on AI, and XAI in particular. Our re-
This part was treated as a pilot study for a more pronounced one sults can be treated as a stepping stone towards a more pronounced
regarding the lawyer’s position on AI, and XAI in particular. survey-based research.
This work contributes by:
CCS CONCEPTS • Presenting a comparison of different explainability methods
• Computing methodologies → Neural networks. when applied to legal classification neural network.
• Giving an assessment of explainable artificial intelligence as
KEYWORDS understood by lawyers.
explainable artificial intelligence, SHAP, LIME, Grad-CAM, survey,
XAI 2 RELATED WORK
ACM Reference Format: XAI is a research direction that was already put under scrutiny,
Łukasz Górski and Shashishekar Ramakrishna. 2021. Explainable Artificial in particular in the AI&Law community. The general consensus
Intelligence, Lawyer’s Perspective. In Eighteenth International Conference for is that the opening of black-box models is a conditio sine qua non
Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil.
for assuring their trustworthiness and compliance with normative
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3462757.3466145
background [12]. Moreover, in [33] it was noted that the develop-
ment of explainable technical solutions is necessarily connected
1 INTRODUCTION with the regulatory frameworks. Importantly, AI and legal specialists
Explainable artificial intelligence (XAI) is a research field that aims need to engage in a dialogue when tackling explainability questions in
to make AI systems results more understandable to humans [1]. Other legal AI [33]. However, currently, we are unaware of any study that
domains notwithstanding, law is a domain that needs focus regard- aimed to investigate the understanding of XAI and elicit system’s
ing explainability. Due process of law and fair trial requirements requirements from the lawyers. Yet, obviously, legal perspective
make an explanation of AI-based decision models a must. However, was brought into the discourse with the advent of legislation such
the legal and technological discourses pertaining to the creation as European GDPR or envisioned ePrivacy or Digital Services Act,
of explainable AI seem to be separated. In particular, the methods other areas of law, like tort, notwithstanding [9]. As far as the legal
used to create explainable systems seem to favour the needs of perspective is concerned, in [7] it was noted that the judiciary will
play an important role in the production of different forms of XAI,
classroom use is granted without fee provided that copies are not made or distributed especially because judicial reasoning, using case-by-case considera-
for profit or commercial advantage and that copies bear this notice and the full citation tion of the facts to produce nuanced decisions, is a pragmatic way to
on the first page. Copyrights for components of this work owned by others than the develop rules for XAI [7]. Whilst this way of thinking sounds very
republish, to post on servers or to redistribute to lists, requires prior specific permission attractive and reminds how seminal works on case-based-reasoning
and/or a fee. Request permissions from permissions@acm.org. in AI originated in the AI&Law community [24], on the other hand,
ICAIL’21, June 21–25, 2021, São Paulo, Brazil the cited passage clearly alludes to common law judge mode of
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 thinking and civil law judge may be more concerned with general
https://doi.org/10.1145/3462757.3466145 rules.
60
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ł. Górski, S. Ramakrishna
This train of thought should also remind of historical develop- CNN or rule-based classifier; user model of the explanation - knowl-
ments regarding the explainability in AI&Law. Historically, case- edge how explanation strategy works with the system [12]. This
based reasoners and rule-based reasoners (symbolic AI) are in- tripartite division offers a useful conceptual framework through
herently interpretable [3]. Yet, producing viable explanations for which the results of this paper can be analysed. While our respon-
deep learning algorithms (sub-symbolic AI) remains a challenge. dents have deep domain knowledge, their knowledge of AI-based
Atkinson et al. cite Robbins, who remarked that [a]ny decision re- systems or explanation techniques is much shallower, if any.
quiring an explanation should not be made by machine learning In regard to the concrete explanation methods we’ve tested (Grad-
(ML) algorithms. Automation is still an option; however, this should CAM, LIME, SHAP), they are of different origins. Grad-CAM is a
be restricted to the old-fashioned kind of automation whereby the method used for the detection of the most important input data for
considerations are hard-coded into the algorithm [3] [25]. Conse- a CNN’s prediction. While it was originally used with image data, it
quently, historical machine learning-based systems (like SMILE) was already proven to be of use in the case of textual input, including
were capable of achieving good performance, though introspection legal texts [8]. However, the aforementioned work does not include
of their domain models repeatedly shows that those were often an end-user feasibility study and can be supplemented with the
incomplete or faulty [3]. Development of twin systems, with tra- comparison of Grad-CAM with other explainability methods. They
ditional rule-based or case-based reasoner explaining the results were, to our knowledge, studied in isolation. For example, there are
of neural network was also a subject of scrutiny. Research in this already first studies regarding the usability of LIME for text data [18]
respect continues to test different newly developed methods to (however, with conclusions pertaining to decision trees and linear
achieve good explanations. Usage of Attention Network, accompa- models). Available analyses for SHAP, on the other hand, used it
nied with attention-weight-based text highlighting was explored with CNN-based text classification models [36]. Nevertheless, it
in [4], with the negative conclusion regarding this approach’s fea- should be noted that post-hoc explanation methods were already
sibility. In general, Competition on Legal Information Extraction subject of criticism, due to them representing correlations and not
and Entailment (COLIEE) is a venue for applying state-of-the-art being faithful to the underlying model [30].
research regarding the domain-specific information selection and
justification generation methods [21][20].
The work [4] is exemplary in including user evaluation study. 3 METHODOLOGY
However, whilst there were notable developments in the area of
(general, not necessarily legal) XAI, user experience studies regard- 3.1 Explanations generation
ing such methods, as well as more general studies pertaining to Herein, the effectiveness of different XAI methods when applied to
the concept of explainability among the users are still lagging be- the law is studied. By using a plug-in classification system based
hind [31] (with, for example, <1% of papers in the area of case-based on CNN [13][8], we created a number of predictions based on the
reasoning including user evaluations in the first place [12] [11]). well-known PTSD dataset [34]. Those were used as a basis for gen-
The results presented herein can be considered as facilitating such erating explanations for CNN predictions. In general, our pipeline
dialogue. (based on previous work by other authors [29][13] with significant
On the other hand, surveys are commonly used to assess the extensions by us) is composed of embedder, classifications CNN,
usability of computer-aided methods for lawyers. Whilst the study visualizer, XAI module, and metric-based evaluator; we have also
of law is commonly associated rather with logical and linguistic developed a simple pre-processing module that uses some industry
studies of texts, interdisciplinary studies of law do include methods de facto standard text processing libraries for spelling correction,
like survey-based study. Works that employed such methodology sentence detection, irregular character removal, etc., enhanced with
include [4] (aforementioned attention-weight-based text highlight- our own implementations which make them better-suited for legal
ing) or [5](for building the ontology of basic legal knowledge). As texts (though it was not used in this research). Our embedding
recognized in [5], usability studies are an important part in the module houses a plug-in system to handle different variants of
human-centered design process, where they are used for, inter alia, embeddings, in particular BERT and word2vec. The classification
requirements specification with the use of questionnaire-based module houses simple 1D CNN which facilitates explainability
methods. Such methodology was employed, for example, in [32], methods. The XAI module integrates the SHAP, Grad-CAM and
where A/B study was conducted to grade the understandability LIME models (based on library solutions, as developed, respectively
of student loan decision letter created by an automated decision by [16][29][22]). It connects the output from the CNN to arrive at
system. However, such system was rule-based and explainability explanations based on the model used. For LIME, the following pa-
is mostly connected with deep-learning systems, which remain rameters were customized: kernel_width=20, num_features=150,
inherently non-interpretable. num_samples=400.
Contemporary XAI models view human-machine interaction as For the purpose of this work, as an embedding layer, we use
a dynamic process, in which user understanding of the system is DistilBERT [26] (a small, fast, cheap and light Transformer model
continuously impacted by the explanation they receive [12]. User’s trained by distilling BERT base) encoding input sentences into vec-
mental model of the system can further be decomposed into three tors. Huggingface’s DistilBERT implementation was found to work
parts: user model of the domain - which pertains to their under- relatively well with XAI libraries that we have used; unfortunately,
standing of the area that the system is working in, for example, legal some of them even rely on Huggingface’s implementation details,
decision making; user model of the AI system - pertaining to the which can be found a limiting factor when plugin architecture is
knowledge of how the system is implemented, for example using considered.
61
Explainable Artificial Intelligence, Lawyer’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
The CNN used in the pipeline was trained for classification. We the class activation map corresponds to one token and indicates
use The Post-Traumatic Stress Disorder (PTSD) [19] dataset [35], for its importance in terms of the score of the particular (usually the
CNN training (as well as testing). It annotates a set of sentences orig- predicted) class. The class activation map gives information on
inating from 50 decisions issued by the Board of Veterans’ Appeals how strongly the particular tokens present in the input sentence
("BVA") of the U.S. Department of Veterans Affairs according to their influence the prediction of the CNN.
function in the decision [35] [34] [27]. The classification consists of
six elements: Finding Sentence, Evidence Sentence, (Evidence-based) 3.2.2 SHAP. SHAP [SHapley Additive exPlanations] [17] is an ex-
Reasoning Sentence, Legal-Rule Sentence, Citation Sentence, Other plainability method based on the shapely concept. SHAP values are
Sentence. This was found to be a relatively good candidate for neu- used for identifying the contribution of each concept/words/features
ral network training, as the classes distinguished in this dataset in a given sentence. Shapley value is the average of the marginal
are of relatively similar sizes. The output from each of the three contributions across all permutations, i.e. average of all the permu-
explainer models had different value ranges, including negative tations for each concept/feature to get each entity’s contribution.
values. Each value inside the value range indicated the association In terms of local interpretability, each observation gets its own set
of the token to a specific class. For this experiment of understanding of SHAP values. For our problem here we consider the global inter-
the importance of a word in the downstream task of classification, pretability, where Collective SHAP values can show how much each
we round-off all negative explainer values to 0 and then normalize predictor contributes, either positively or negatively, towards influ-
the remaining value ranges to fit a range between 0 - 1. encing predictions of the CNN. SHAP has shown to provide good
The software stack used for the development of this system was consistency in attributing importance scores to each feature [15].
instrumented under Anaconda 4.8.3 (with Python 3.8.3). Tensor-
flow v. 2.2.0 was used for CNN instrumentation and Grad-CAMs 3.2.3 LIME. LIME [Local Interpretable Model (Agnostic) Expla-
calculations (with the code itself expanding prior implementation nations] [10][23] is another explainability method that provides
available at [29]). DistilBERT implementation and supporting codes explanations to the predictions of any classifier by learning inter-
were sourced from Huggingface libraries: transformers v. 3.1.0, tok- pretable or simpler models locally around the prediction. While
enizers v. 0.8.1rc2, nlp v. 0.4.0. For XAI methods implementations, the Shapley values consider all possible permutations, a united ap-
SHAP v. 0.37 and LIME v. 0.1.1.18 were facilitated. The code used proach to provide global and local consistency, the LIME approach
for this paper is available on GitHub1 . GPU cluster was used for builds sparse linear models around individual predictions in its
calculations (with 4x Nvidia Tesla V100 32GB GPUs). local vicinity.
3.2 Explainability methods 3.3 Comparison metrics

The system described hereinbefore, for the purpose of this work, The XAI methods described hereinbefore are used to visualize (high-
was used to generate three types of explanations. Those can be light) the parts of the input texts that are of most importance for the
broadly classified into three different types [37]: network’s predictions. Such generated heatmaps (saliency maps)
(1) Gradient based-diagnostics e.g. Grad-CAM are in turn presented to a legal professional for (qualitative) grading.
(2) Shapley values-based diagnostics e.g. SHAP and On the other hand, heatmaps are not easily susceptible to quantita-
(3) Surrogate model-based diagnostics e.g. LIME tive analysis and can be supported by more formal metrics in this
In general, gradient-based methods look inside the box to obtain respect [14][8]. Quantitative analysis and comparison of heatmaps
the gradient and feature values for attributing contributions. The generated by different XAI methods are thus facilitated by recalling
other two methods explain a given model from outside of the box the following metrics [8]:
by looking at what is fed into the model and what is produced. The (1) Fraction of elements above relative threshold 𝑡 (F (𝑣, 𝑡))
more detailed description of concrete XAI methods chosen for this (2) Intersection over union with relative thresholds 𝑡 1 and 𝑡 2
paper is as follows: (I (𝑣 1, 𝑣 2, 𝑡 1, 𝑡 2 ))
3.2.1 Grad-CAM. Grad-CAM is an explainability method origi- The first metric, F (𝑡), aims to quantify what portion of input
nating from computer vision [28]. It is a well established post-hoc data is taken into account by CNN when making a prediction. It
explainability technique when CNNs are concerned. Moreover, the can be defined as
| {𝑥 ∈𝑣 |𝑥 >𝑡 ×𝑚𝑎𝑥 (𝑣) } |
.
Grad-CAM method passed independent sanity checks [2]. Whilst 𝑙𝑒𝑛 (𝑣)
The second metric, I (𝑣 1, 𝑣 2, 𝑡 1, 𝑡 2 ), is used to compare two saliency
it is mainly connected with the explanations of deep learning net-
maps, 𝑣 1 and 𝑣 2 and can be used to compare whether the same parts
works used with image data, it has already been adapted for other
of input sentence affect the CNN’s prediction. It takes as arguments
areas of application. In particular, CNN architecture for text clas-
two heatmaps (𝑣 1 and 𝑣 2 ), binarizes them using relative thresholds
sification was described in [13], and there exists at least one im- |𝑣 ∩𝑣 |
plementation which extends this work with Grad-CAM support (𝑡 1 and 𝑡 2 ) and finally calculates |𝑣1 ∪𝑣2 | . It quantifies the relative
1 2
for explainability [29]. Grad-CAMs were already used in the NLP overlap of words considered important for the prediction by each
domain, for (non-legal) document retrieval [6]. With Grad-CAM of two models.
technique it is possible to produce a class activation map (heatmap) In both cases, 0.15 was chosen as a threshold value. This was
for a given input sentence and predicted class. Each element of selected based on suggestions in prior work [28]. In addition to
that, for the current study, a value of 0.5 was tested as well, to filter
1 https://github.com/lukeg/xai_lawyers_perspective out the lower-ranking words.
62
3.4 Pilot User Study of Veterans’ Appeals, it was decided that using only a subset of
To study prospective user’s opinions on explainability in AI as sentences’ classes was a good compromise.
well as on the particular method employed in this study, a survey The following sentences were chosen for grading:
was prepared. It consisted of two parts. The generic one contained (1) Evidence-based-reasoning sentences
three questions, pertaining to the general knowledge of XAI by the (a) Further, as discussed below, none of the medical evidence
respondents. Those were as follows: indicates that a psychiatric disorder had its onset during
service, and psychiatric disorders are complex matters
(1) Have you ever encountered the term “explainable artificial requiring medical evidence for diagnosis; they are not the
intelligence” (explainable AI, XAI)? kind of disorders that subject to lay observation.
(2) According to you how can “explainable artificial intelligence”/ (b) Given the inconsistencies between the Veteran’s reports
explainability in decision systems be characterized? and the objective evidence of record, the Veteran’s credi-
(a) Explains the decision-making process and why we arrived bility is diminished.
at this result. (2) Legal Rule Sentence
(b) Conclusions are explained without a need for you to un- (a) Service connection for PTSD requires medical evidence
derstand the inner-workings of the system. diagnosing the condition in accordance with 38 C.F.R.
(c) It’s more useful for a software developer rather than to a 4.125(a); a link, established by medical evidence, between
lawyer. current symptoms and an in-service stressor; and credible
(d) AI-based systems are of little use and their explainability supporting evidence that the in-service stressor occurred.
is thus irrelevant (b) There must be 1) medical evidence diagnosing PTSD; 2)
(e) Explanations given by a computer system usually are not a link, established by medical evidence, between current
sufficient and need to be supplemented with background symptoms of PTSD and an in-service stressor; and 3) cred-
legal knowledge. ible supporting evidence that the claimed in-service stres-
(3) What are your expectations regarding the justifications given sor occurred.
by the artificial intelligence-based automated decision-making (c) The Federal Circuit has held that 38 U.S.C.A. 105 and 1110
systems? preclude compensation for primary alcohol abuse disabil-
ities and secondary disabilities that result from primary
While the list of questions could have been non-exhaustive, the
alcohol abuse.
goal of these above questions was only to assess the depth of their
(3) Citation sentence
knowledge in XAI. Answers to these questions help us to under-
(a) See also Mittleider v. West, 11 Vet. App. 181, 182 (1998)
stand respondent’s degree of confidence in the domain of XAI to
(in the absence of medical evidence that does so, VA is
evaluate/score the results generated from different XAI models.
precluded from differentiating between symptomatology
The second part of the study was aimed to elicit the lawyer’s
attributed to a nonservice-connected disability and symp-
assessment of the three XAI methods that can be used to explain
tomatology attributed to a service-connected disability).
CNN’s predictions. In total, six correctly classified sentences from
the test set were chosen as a basis for the study. Words composing
those sentences were highlighted with colors of different intensities, 4 RESULTS
according to their importance in the final prediction. All the words 4.1 Neural network training
were subject to highlighting, as even stopwords can be of impor- Firstly, for classification, CNN was trained for 10 epochs, with a
tance in case of legal interpretation. Respondents were asked to batch size of 1000 elements and a learning rate of 0.001. 80% of the
grade each visualization on the scale from 0 (worst) to 10 (best). We PTSD dataset was used for training, with the remaining 20% left
have decided to use only the correctly classified sentences, as the out as a test set. This allowed to achieve accuracy of ca. 83% on
visualizations for incorrect ones might have looked peculiar for the the test set. PTSD dataset was found to be a good candidate for
respondents and could have confused them when compared with training, as it is relatively balanced, with classes’ size varying from
the visualizations for the correct ones. Moreover, understanding 1941 elements (Evidence sentence) to 389 (Finding sentence) in case
this part of the study was already not easy for the respondents, with of the training set. Relations between the sizes of classes in the case
one commenting that he skipped this part as he did not understand of the test set were similar. For details on training effectiveness,
what the colorful words mean. Table 1 can be consulted.
In the end, from the classes distinguished in the dataset, three
were chosen for the visualization: citation sentence, reasoning sen-
tence (or evidence-based reasoning sentence) and legal rule sentence.
4.2 Study results. Lawyers’ conceptualization of
The classes were chosen so that each class stands out when com- XAI
pared with the others. Original dataset includes two additional For our pilot study, we have collected 21 surveys. Respondents
classes, finding sentence and evidence sentence. However, even its represented a variety of legal professions, including university pro-
authors admit that it may be difficult to tell apart different sentences fessors, attorneys and junior legal advisors. Results for questions 1
when all the classes are compared (e.g. fact-finding ones vs. legal (prior exposure to the term "XAI") and questions 2 (closed questions
rule ones are easy to conflate). As our respondents do not neces- regarding the respondents’ understanding of XAI) are summarised
sarily have experience with cases under the cognition of Board in Table 2, given as a percentage of respective answers. In Table
63
Table 1: Classification report for CNN trained for the purpose computer scientists (76%); three persons disagreed, with only one
of this paper of them seeing the need to understand the system’s inner workings
as a precondition for understanding of its explanation (as evidenced
Class Precision Recall F1-score Support by the previous question).
Artificial intelligence & law curriculum seems to have penetrated
Other Sentence 0.96 0.57 0.71 83
the ranks of lawyers, as 90% of respondents see the prospects of
Reasoning Sentence 0.59 0.41 0.48 148
their use (and, by extension of XAI). Remaining 10% had no opinion.
Finding Sentence 0.63 0.51 0.57 101
71% of respondents see the need of supplementing machine-based
Legal Rule Sentence 0.76 0.97 0.85 207
explanations with further background legal knowledge, 19% think
Citation Sentence 0.99 1.00 0.99 213
otherwise.
Evidence Sentence 0.88 0.95 0.92 479
In conclusion, lawyers seem to be affirmative regarding the use-
fulness of AI and XAI. Yet, more work should be devoted to prepare
Table 2: Respondents’ answers to questions 1 and 2 the explanations that do not involve prior knowledge of how a
given system works. If XAI was to be deployed and aimed at non-
yes no no opinion professionals, clarity and completeness of the explanations should
be the focal point, so that the decisions could be understood even
Question 1 without deeper background legal knowledge.
62% 38% – When we asked each respondent on their list of expectations
regarding the justifications given by an AI-based decision-system
Question 2
(question 3), the majority of their expectations are in line with any
a 90% 5% 5% general requirements necessary for such a system. Users’ expecta-
b 33% 52% 14% tions can be divided into three broad categories. These categories
c 14% 76% 10% are also in-line with the user’s mental model of a AI system (cf.
d 0% 90% 10% Sec. 2). General requirements, coming from users’ deep knowledge
e 71% 19% 10% of the law and its principles, like due process, contain expectations
such as: fairness, lack of bias, transparency in the decision process,
Table 3: Respondents’ answers to questions 2, broken down consequence awareness. Secondly, users’ model of AI-based system
according to their prior exposure to XAI tells them how it can increase the effectiveness of their work. Here,
respondents remarked that such system should decrease their man-
ual effort, lessen time consumption and introduce automation. Finally,
Prior exposure No prior exposure the user’s knowledge of XAI implementation is concerned with
yes no no opinion yes no no opinion supporting system’s conclusions with citations and provision of facts
a 100% 0% 0% 75% 13% 13% relevant to system’s decision. It should be noted that those require-
b 46% 46% 8% 13% 63% 25% ments apply not just to XAI systems, but also to any legal software
c 23% 69% 8% 0% 88% 13% system in general, AI-based or not. Thus, the mental model of an
d 0% 85% 15% 0% 100% 0% AI-based system in the case of prospective users is not fully devel-
e 77% 15% 8% 63% 25% 13% oped. Fig. 1, depicts the key vocabulary set used by the respondents
while providing their list of expectations.
the answers to question 2 are further broken down by the status of

prior exposure to XAI.
The majority of respondents already encountered the term "XAI".
Elicitation of XAI requirements through closed question 2 shows
that almost all lawyers agree that XAI is used to explain the decision-
making process and the way we arrived at the result (with one
person having "no opinion" and one disagreeing; both persons were
unfamiliar with the term XAI, as noted in their answer to q. 1).
Therefore we can conclude that - even in case of no prior exposure
to this term - lawyers intuitively conceptualize the meaning of this
term. Yet, certain divisions can be seen when results for question
2 are analyzed. 52% of respondents agreed that the XAI systems Figure 1: User expectations regarding the justifications given
give explanations and the user does not need to understand its by an AI-based decision-system
inner workings. On the other hand, 33% of respondents concluded
otherwise. Therefore, the significant minority pointed out that the
user of such system has to possess a deeper system’s model. Despite While a majority of the requirements described above were re-
this division, almost everyone agreed that XAI in legal decision- peated by multiple respondents, a few respondents provided re-
making systems would be of more importance to the lawyer than to quirements that were interesting from an AI-driven automated legal
64
decision support system’s perspective. They are interesting not be- same results were achieved using the Friedman test (𝛼 = 0.05 in
cause of their novelty, but because only a few domain experts/end all the cases). Therefore other factors should be taken into account
users thought about it. Some such requirements are as shown below: when choosing a particular method. From a software engineering
(1) “It should have the option of forgetting the decisions taken in point of view, it should be noted that implementation of certain
past. However, it should also have a certain standard form of methods is dependant not only on the user’s voice but also on
code of conduct.” technical feasibility. In this respect, it is of importance that Grad-
(2) “There should be scope for generating contrasting explanations CAM is a method that is used together only with CNNs and we are
based on various objectives/functions/factors.” unaware of any works that managed to use it with other neural
(3) “Basis on which AI has neglected the other information.” network architectures. LIME is dependent on a number of hyper-
(4) “Legally correct explanations.” parameters. In this work, we have chosen the largest values that
(5) “Option for collaborative (Man-Machine) explanation of the allowed us to carry calculations with our hardware resources and
law, context, and linguistic.” software implementation. Yet, many works suggest that one should
(6) “Understandable explanations and Better Interfaces.” fine-tune those values until explanations received are close to one’s
expectations. However, this introduces a risk of overfitting explana-
4.3 Study results. XAI methods’ assessment tory procedures and having such fine-tuned explanations would be
counterproductive with regard to our study. As far as SHAP goes,
Tables 4, 5, 6 can be consulted for comparions of different XAI
the library solution we used is still under active development, with
methods when used with our sample sentences in terms of user
documentation not always up-to-date and with limited support of
score as well comparison metrics (please note that the matrix for
external libraries. Regarding some of the observations presented
I metric results is symmetric, therefore repeating results were
hereinbefore, there are a few for which different evaluation tech-
removed from the table). Below are some observation made based
niques (technical, empirical, or mixed) need to be performed for
on these results:
arriving at conclusions. Such conclusions need to be backed by
• Users’ expectations seem to have differed to a large extent, domain experts. Those will form a basis for future work.
which can be seen from the divisions of respondent’s grades
of SHAP’s explanation of citation sentence (Table 6 can be
consulted for visualization). SHAP marked all the words as 5 CONCLUSION
very important, while - for example, Grad-CAM marked only In this paper, we have presented a comparison of different explain-
the reference to the source material, and omitted the sum- ability methods when applied to legal classification neural network
mary included with the citation. For the authors of this paper, (CNN) and provided an assessment of explainable artificial intel-
Grad-CAM has proven superior here, yet many respondents ligence as understood by lawyers. Different XAI methods were
judged SHAP highly. 9 respondents gave it the lowest score graded similarly by the users, though certain variances can be spot-
out of all three (with two giving 0), 8 graded it the highest ted. The metrics presented herein offer software engineers an option
(with one person giving 10). to quantify explanations given by the system. When a more general
• While the heatmaps for the citation sentence from Grad- point of view is taken, one which comes from prospective users, it
CAM and SHAP seem to be very different in nature (also should be noted that the lawyers are generally looking forward to
pointed out by their F- metric), the respondents seem to the implementation of (explainable) artificial intelligence systems
provide scores that vary only by a small margin. and solutions that allow them to be more efficient in their use of
• Based on the F and I metric scores for all the sentence types, time. Users are thus waiting for results of AI - and in particular XAI
LIME in general seems to have higher scores compared to - research.
Grad-CAM and then SHAP.
• Both SHAP and LIME seem to be more sensitive to the change ACKNOWLEDGMENTS
in threshold value in terms of their F metric when compared
This research was carried out with the support of the Interdisci-
to Grad-CAM.
plinary Centre for Mathematical and Computational Modelling
• Based on respondents’ average scores, Grad-CAM and SHAP
(ICM), University of Warsaw, under grant no GR81-14.
seem to perform consistently between sentence types when
compared against LIME. But, based on F metric, both LIME
and SHAP seem to have consistent values between sentence DISCLAIMER
types as compared to Grad-CAM. The views reflected in this article are the views of the author (SR)
Herein we have presented a software system, based on CNN, and do not necessarily reflect the views of his employer organisation
capable of classifying legally relevant texts and coupled the sys- or its member firms.
tem with an explainability module. Furthermore, this explainability
module was then subject to user study. The use of ANOVA (be- REFERENCES
cause D’Agostino’s and Pearson’s test did not allow to reject the [1] Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: a
hypothesis that scores for each of the XAI methods have normal survey on explainable artificial intelligence (XAI). IEEE access 6 (2018), 52138–
distributions and the same conclusions were arrived at when using 52160.
[2] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt,
Levene test for the equality of variances) confirms that the discrep- and Been Kim. 2018. Sanity Checks for Saliency Maps. In Advances in
ancy between various scores is not of practical importance. The Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle,
65
Table 4: Comparison of XAI methods on sample PTSD evidence-based-reasoning sentences
I metric I metric
Method Sentence User F metric vs. vs. F metric vs. vs.
score SHAP LIME SHAP LIME
(avg.)
𝑡 = 𝑡 1 = 𝑡 2 = 0.15 𝑡 = 𝑡 1 = 𝑡 2 = 0.5
Grad-CAM 4.43 0.51 0.46 0.48 0.14 0.18 0.3
SHAP 3.48 0.82 0.74 0.13 0.25
LIME 6.66 0.85 0.45
Grad-CAM 6.15 0.92 0.89 0.81 0.78 0.22 0.19
SHAP 5.15 0.96 0.85 0.26 0.31
LIME 5.8 0.85 0.4
K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran As- [11] Mark T Keane and Eoin M Kenny. 2019. How case-based reasoning explains
sociates, Inc., 9505–9515. https://proceedings.neurips.cc/paper/2018/file/ neural networks: A theoretical analysis of XAI using post-hoc explanation-by-
294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf example from a survey of ANN-CBR twin-systems. In International Conference
[3] Katie Atkinson, Trevor Bench-Capon, and Danushka Bollegala. 2020. Explanation on Case-Based Reasoning. Springer, 155–171.
in AI and law: Past, present and future. Artificial Intelligence (2020), 103387. [12] Eoin M. Kenny, Courtney Ford, Molly Quinn, and Mark T. Keane. 2021. Explain-
[4] L Karl Branting, Craig Pfeifer, Bradford Brown, Lisa Ferro, John Aberdeen, Brandy ing black-box classifiers using post-hoc explanations-by-example: The effect of
Weiss, Mark Pfaff, and Bill Liao. 2020. Scalable and explainable legal prediction. explanations and error-rates in XAI user studies. Artificial Intelligence 294 (2021),
Artificial Intelligence and Law (2020), 1–26. 103459. https://doi.org/10.1016/j.artint.2021.103459
[5] Núria Casellas. 2011. Legal ontology engineering: Methodologies, modelling trends, [13] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification.
and the ontology of professional judicial knowledge. Vol. 3. Springer Science & In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Business Media. Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar,
[6] Jaekeol Choi, Jungin Choi, and Wonjong Rhee. 2020. Interpreting Neural Ranking 1746–1751. https://doi.org/10.3115/v1/D14-1181
Models using Grad-CAM. arXiv preprint arXiv:2005.05768 (2020). [14] David Krakov and Dror G Feitelson. 2013. Comparing performance heatmaps. In
[7] Ashley Deeks. 2019. The judicial demand for explainable artificial intelligence. Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 42–61.
Columbia Law Review 119, 7 (2019), 1829–1850. [15] Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2019. Consistent Individual-
[8] Lukasz Gorski, Shashishekar Ramakrishna, and Jedrzej M Nowosielski. 2020. ized Feature Attribution for Tree Ensembles. arXiv:1802.03888 [cs.LG]
Towards Grad-CAM Based Explainability in a Legal Text Processing Pipeline. [16] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model
arXiv preprint arXiv:2012.09603 (2020). Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon,
[9] Philipp Hacker, Ralf Krestel, Stefan Grundmann, and Felix Naumann. 2020. Ex- U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett
plainable AI under contract and tort law: legal incentives and technical challenges. (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062-a-
Artificial Intelligence and Law (2020), 1–25. unified-approach-to-interpreting-model-predictions.pdf
[10] Linwei Hu, Jie Chen, Vijayan N. Nair, and Agus Sudjianto. 2020. Surrogate [17] Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting
Locally-Interpretable Models with Supervised Machine Learning Algorithms. Model Predictions. In Proceedings of the 31st International Conference on Neural
arXiv:2007.14528 [stat.ML] Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran
Associates Inc., Red Hook, NY, USA, 4768–4777.
66
Table 5: Comparison of XAI methods on sample PTSD legal rule sentences
I metric I metric
(avg.)
𝑡 = 𝑡 1 = 𝑡 2 = 0.15 𝑡 = 𝑡 1 = 𝑡 2 = 0.5
Grad-CAM 5.2 0.8 0.89 0.77 0.42 0.41 0.34
SHAP 6.05 0.9 0.87 0.52 0.41
LIME 5.9 0.93 0.48
Grad-CAM 5.45 0.73 0.47 0.8 0.12 0.14 0.18
SHAP 5.75 0.67 0.56 0.36 0.08
LIME 4.3 0.73 0.11
Grad-CAM 5.9 0.71 0.41 0.71 0.4 0.13 0.06
SHAP 4.2 0.63 0.63 0.29 0.25
LIME 6.15 0.93 0.54
[18] Dina Mardaoui and Damien Garreau. 2020. An Analysis of LIME for Text Data. [21] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
arXiv preprint arXiv:2010.12487 (2020). Kano, and Ken Satoh. 2019. A Summary of the COLIEE 2019 Competition. In
[19] Victoria Hadfield Moshiashwili. 2015. The Downfall of Auer Deference: Veterans JSAI International Symposium on Artificial Intelligence. Springer, 34–49.
Law at the Federal Circuit in 2014. [22] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I
[20] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the
Kano, and Ken Satoh. [n.d.]. COLIEE 2020: Methods for Legal Document Retrieval 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
and Entailment. ([n. d.]). Mining, San Francisco, CA, USA, August 13-17, 2016. 1135–1144.
67
Table 6: Comparison of XAI methods on sample PTSD citation sentence
I metric I metric
(avg.)
𝑡 = 𝑡 1 = 𝑡 2 = 0.15 𝑡 = 𝑡 1 = 𝑡 2 = 0.5
Grad-CAM 5.2 0.26 0.26 0.24 0.13 0.13 0.11
SHAP 4.95 0.98 0.83 0.98 0.16
LIME 4.65 0.73 0.17
[23] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why [31] Jasper van der Waa, Elisabeth Nieuwburg, Anita Cremers, and Mark Neerincx.
Should I Trust You?": Explaining the Predictions of Any Classifier. 2021. Evaluating XAI: A comparison of rule-based and example-based explana-
arXiv:1602.04938 [cs.LG] tions. Artificial Intelligence 291 (2021), 103404.
[24] Edwina L Rissland, Kevin D Ashley, and Ronald Prescott Loui. 2003. AI and Law: [32] Tom M van Engers and Dennis M de Vries. 2019. Governmental Transparency in
A fruitful synergy. Artificial Intelligence 150, 1-2 (2003), 1–15. the Era of Artificial Intelligence.. In JURIX. 33–42.
[25] Scott Robbins. 2019. A misdirected principle with a catch: explicability for AI. [33] Martijn van Otterlo and Martin Atzmueller. 2018. On Requirements and Design
Minds and Machines 29, 4 (2019), 495–514. Criteria for Explainability in Legal AI. In XAILA@ JURIX.
[26] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. [34] Vern R. Walker, Ji Hae Han, Xiang Ni, and Kaneyasu Yoseda. 2017. Semantic Types
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. for Computational Legal Reasoning: Propositional Connectives and Sentence
arXiv:1910.01108 [cs.CL] Roles in the Veterans’ Claims Dataset (ICAIL ’17). Association for Computing Ma-
[27] Jaromír Savelka, Vern R. Walker, Matthias Grabmair, and Kevin D. Ashley. 2017. chinery, New York, NY, USA, 217–226. https://doi.org/10.1145/3086512.3086535
Sentence Boundary Detection in Adjudicatory Decisions in the United States. [35] Vern R. Walker, Krishnan Pillaipakkamnatt, Alexandra M. Davidson, Marysa
[28] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- Linares, and Domenick J. Pesce. 2019. Automatic Classification of Rhetorical
tam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from Roles for Sentences: Comparing Rule-Based Scripts with Machine Learning. In
deep networks via gradient-based localization. In Proceedings of the IEEE interna- Proceedings of the Third Workshop on Automated Semantic Analysis of Information
tional conference on computer vision. 618–626. in Legal Texts, Montreal, QC, Canada, June 21, 2019 (CEUR Workshop Proceedings,
[29] Haebin Shin. [n.d.]. Grad-CAM for Text. https://github.com/HaebinShin/grad- Vol. 2385). http://ceur-ws.org/Vol-2385/paper1.pdf
cam-text. Accessed: 2020-08-05. [36] Wei Zhao, Tarun Joshi, Vijayan N Nair, and Agus Sudjianto. 2020. SHAP values for
[30] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Explaining CNN-based Text Classification Models. arXiv preprint arXiv:2008.11825
2020. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. (2020).
In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 180–186. [37] Wei Zhao, Tarun Joshi, Vijayan N. Nair, and Agus Sudjianto. 2020. SHAP values
for Explaining CNN-based Text Classification Models. arXiv:2008.11825 [cs.CL]
68
Unravel Legal References in Defeasible Deontic Logic
Guido Governatori Francesco Olivieri
Data61 Institute for Integrated and Intelligent Systems
CSIRO Griffith University
Dutton Park, Queensland, Australia Nathan, Queensland, Australia
guido.governatori@data61.csiro.au f.olivieri@griffith.edu.au
ABSTRACT said references. Here, by the legal status we mean where the provi-
Legal documents often contain references to either other documents, sion corresponding to the reference either is applicable, has been
or other parts (of the same document). The use of references is complied with, or has been violated. The examples below provide a
meant to reduce the complexity of the documents; however, they wide set of instances of the types of references for which we are
pose serious concerns for the formal (logical) representation of the going to develop techniques to represent them in a logic designed
norms stipulated in the document itself. We propose an approach to formalise legal reasoning.
to directly model the references in a logic language and to resolve Example 1.1 (Applicable). Section 9.8.2, Telecommunications Con-
them during the computation of the legal effects in force in a case. sumer Protections Code.
The approach is proved to be computationally feasible and to have Termination by a Customer: A Supplier must ensure that, if so
an efficient algorithmic implementation. notified by the Customer who is exercising the applicable termi-
nation right in clause 9.9.1 h), if any, as a result of the move, the
CCS CONCEPTS Supplier terminates the relevant Customer Contract relating to the
• Computing methodologies → Artificial intelligence; • Ap- Telecommunications Service within 5 Working Days of receiving
plied computing → Law; • Theory of computation → Proof the Customer’s notice.
theory; Automated reasoning.
Example 1.2 (Complied with). Section 9.8.2, Telecommunications
Consumer Protections Code.
KEYWORDS Provided that a Supplier complies with the terms of this clause
Defeasible Deontic Logic, legal references 9.9 in circumstances of a move to an alternate wholesale network
ACM Reference Format:
provider, the Supplier is not required to comply with the other
Guido Governatori and Francesco Olivieri. 2021. Unravel Legal References in provisions of this Chapter in relation to such a move except for
Defeasible Deontic Logic. In Eighteenth International Conference for Artificial clauses 9.5, 9.6, and 9.7.
Example 1.3 (Violated). Section 26-105, Income Tax Assessment
Act 1997.
(2) You cannot deduct under this Act a non-cash benefit if:
1 INTRODUCTION (a) section 14–5 in Schedule 1 to the Taxation Administration
A typical characteristic of legal documents aiming at the reduction Act 1953 requires you to pay an amount to the Commis-
of the complexity of the documents themselves is the use of ref- sioner before providing the benefit, because of any of the
erences, either to other sections of the same document (internal following provisions in that Schedule:
references), or to sections of other documents (external references). (i) section 12–35 (about payments to employees);
The key idea behind references is that they are used to import “con- (ii) section 12–40 (about payments to directors);
tent” from relevant provisions in the provision where the reference (iii) section 12–47 (about payments to religious practition-
appears, without the need to repeat the content/text of the imported ers);
provision. Also, frequently, the content is not simply imported, as (iv) section 12–60 (about payments under labour hire and
a reference may require a legal lens to unravel the actual (legal) certain other arrangements);
intent of the content, in the context where the reference appears. (v) in relation to a supply, other than a supply referred to
The focus of this work is to examine the references in a legal in subsection (3) of this section–section 12-190 (about
document under the legal lens that assesses the legal status of the quoting of ABN); and
(b) you fail to comply, or purportedly comply, with section
Permission to make digital or hard copies of all or part of this work for personal or 16–150 in that Schedule in relation to the amount.
for profit or commercial advantage and that copies bear this notice and the full citation Given a formal representation of a set of legal provisions as
on the first page. Copyrights for components of this work owned by others than ACM logical expressions, one of the aims of a logic for legal reasoning is
to post on servers or to redistribute to lists, requires prior specific permission and/or a to infer what conclusions are entailed by a set of given premises
fee. Request permissions from permissions@acm.org. (e.g., corresponding to the facts of a case). Said conclusions are
ICAIL’21, June 21–25, 2021, São Paulo, Brazil meant to represent: (a) the legal requirements (or effects) that hold,
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 or are in force, based on the facts of the case, and (b) the set of
https://doi.org/10.1145/3462757.3466080 norms (the given legal provisions).
69
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Guido Governatori and Francesco Olivieri
In general, following [14], provisions can be represented by “IF section X in the modelling of section Y, we could use
. . . THEN . . . ” rules, where the IF part contains the conditions under
which a rule is able to produce its conclusion (or effect), which is IF 𝐶 1, . . . , 𝐶𝑚 , 𝐴𝑝𝑝𝑙𝑦 (𝑆𝑒𝑐𝑡𝑖𝑜𝑛 X) THEN 𝐷 (4)
encoded by the THEN part. Provisions are, in general, defeasible;
this means that there are some provisions that give baseline condi- The issue now is to determine when 𝐴𝑝𝑝𝑙𝑦 (𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝑋 ) holds and
tions under which the effect of the norm holds (or enters in force). can then be use to trigger rule above. A solution would be to add
However, such baseline rules are subject to exceptions (or exclu- the rule
sions, or derogations). We also adopt the usual distinction between IF 𝐴1, . . . , 𝐴𝑛 THEN 𝐴𝑝𝑝𝑙𝑦 (𝑆𝑒𝑐𝑡𝑖𝑜𝑛 X). (5)
constitutive and prescriptive rules, where a constitutive rule defines
an institutional fact in the underlying normative system, while This approach works well for the first meaning of applicable, but
a prescriptive rule stipulates the conditions under which a legal it does not for the second interpretation, when it faces the same
effect is in force and what is the legal effect (either an obligation, a issues discussed for the previous approach when exceptions exist.
prohibition, or a permission). Complied with and violated are types of references that apply
Based on the discussion so far, it should be clear that, to formalise to prescriptive rules only. In the simplest case, the semantics for
provisions containing citations (as the instances given in the exam- complied with requires that if the provision is applicable, then the
ples), one has to understand what are the conditions corresponding effect (i.e., an obligation or a prohibition) has been fulfilled. This
to the cited provisions before such conditions are represented by amounts to that if the effect is the obligation O𝐴, then 𝐴 holds as
logical expressions. One option is to formalise the cited provisions well; if the effect is a prohibition (i.e., F𝐴, equivalent to O¬𝐴), then
and then copy the formalised conditions in the formalisation of the 𝐴 does not hold. The case of violated is similar: the rule is violated if
citing provision. However, this is not enough because we have to the content of the obligation does not hold, or it holds if the effect
include the logical representation of the semantic of the citation. is a prohibition. Simple adaptations of the techniques exemplified
Let us start by examining the case of “applicable”. It can have above are available. All we have to do is to include O𝐴 and 𝐴 (for
multiple meanings. The first is that the IF part of the provision complied with), or O𝐴 and ¬𝐴 (for violated) in the IF part of the
formalisation holds (it is deemed to be true in the particular situ- appropriate rules.
ation or case). The second meaning is that, in addition to the IF Things get more complicated when we take into account a key
part, also the THEN part holds. For the first case, the referring feature of normative systems: situations when it is possible to fail
provision should import the IF part, and for the second case the to comply with some provisions, but compensatory measures are
import consists of the conjunction of both the IF part and the THEN contemplated. Hence, it is possible to fail in fulfilling the primary
part. Suppose that section X is modelled by the rule obligation and still be compliant by fulfilling the compensatory mea-
sures. Nevertheless, we might encounter scenarios where, typically,
IF 𝐴1, . . . , 𝐴𝑛 THEN 𝐵 (1) failing to comply with the primary obligation requires us to comply
and that section Y, citing section X, requires conditions 𝐶 1, . . . , 𝐶𝑚 with the compensatory obligation, but in specific circumstances
to produce the effect 𝐷 when section X applies. According to the what constitutes the compensatory measure to recover from the vi-
first reading the rule for section Y is olation of the primary obligation is forbidden, and then there is not
an effective way to compensate and restore compliance. According
IF 𝐴1, . . . , 𝐴𝑛 , 𝐶 1, . . . , 𝐶𝑚 , THEN 𝐷. (2) to this discussion, we comply with an applicable provision when
we fulfil the primary obligation, or at least one of the compensatory
Based on the second interpretation the rendering of section Y is obligations that is actually in force and we fulfil it. Conversely, a
IF 𝐴1, . . . , 𝐴𝑛 , 𝐵, 𝐶 1, . . . , 𝐶𝑚 , THEN 𝐷. (3) provision is not complied with when one of the obligations in force
(either the primary obligation or one of the compensatory obliga-
However, in case there are multiple provisions to support the same tions) is not fulfilled and it does not admit a compensation. While it
effect or conclusion, such rules might not offer an accurate model is possible to write rules importing the content of the related rules
of the normative system. To handle this situation, it is not enough (in this case rules preventing a compensation to be in force), the
to use the conjunction of the IF and THEN parts, but we have to process would require to use an oracle that simulate the underlying
include, in the formalisation, elements from the provisions that reasoning to determine what are the rules that contribute to the
might provide exceptions to the cited provision. Moreover, it is conditions of the imported citation.
possible to have exclusions of the exclusions, and this has a ripple The method we propose is to incorporate in the logic itself the
effect on the conditions to be included in the citing provision. When conditions for the resolution of the citation predicates. This means
we look at the examples above, specifically Example 1.3, and we that, while the logic determines when the conditions of a rule hold
consider real-life Acts, where exclusion sections can run for several and it possible to infer that the rule’s conclusion holds as well, then
pages with many rules encoding the exceptions (and the exceptions we can assert: (i) that the rule applies (according to the second
themselves have further exclusions), it is evident that unravelling reading of applicable), and (ii) that the citations are resolved at the
citations based on the technique we just alluded to is untenable. same time the inferences that can be drawn from a set of rules and
Consequently, the solution we advocate is to extend the logical a set of facts are computed. The combination of (i) and (ii) avoids
language with a new class of unary predicates whose argument is the need to write additional rules to capture the intended semantics
the cited proposition; the new predicates correspond to the possible (eventually using an oracle to solve the semantic conditions) of the
legal statuses for their argument (the cited provision). Hence, for cited provisions.
70
Unravel Legal References in Defeasible Deontic Logic ICAIL’21, June 21–25, 2021, São Paulo, Brazil
2 LOGIC head of the rule) which is a single literal in case 𝑋 = {C, P}, or an
In this section we introduce the logical apparatus, which is an ex- ⊗-expression in case 𝑋 = O.
tension of standard Defeasible Logic (DL) [1], and more specifically We use several abbreviations on sets of rules. 𝑅s , 𝑅d , 𝑅sd denote,
is based on the modal, deontic frameworks proposed in [5], as we respectively, the set of only strict, only defeasible, and both strict
shall need means to formalise prescriptive behaviours as well as to and defeasible rules; 𝑅𝑋 [𝑙] is the set of constitutive rules (𝑋 = C) or
determine which situations are compliant, and which ones are not. deontic rules (𝑋 ∈ {O, P}) whose head is 𝑙. Lastly, 𝑅 O [𝑙, 𝑖] denotes
the set of obligation rules, where 𝑙 is the 𝑖-th element in the ⊗-
expression.
2.1 Language of Defeasible Deontic Logic The meaning of an ⊗-expression 𝐶 (𝛼) = 𝑐 1 ⊗ 𝑐 2 ⊗ · · · ⊗ 𝑐𝑚 as
Let PROP be the set of propositional atoms, then the set of literals consequent of a rule 𝐴(𝛼) ↩→O 𝐶 (𝛼) is that: if the rule is allowed
is Lit = PROP ∪ {¬𝑝 | 𝑝 ∈ PROP}. The complementary of a literal to draw its conclusion, then 𝑐 1 is the obligation in force, and only
𝑝 is denoted with ∼𝑝: if 𝑝 is a positive literal 𝑞 then ∼𝑝 is ¬𝑞, if 𝑝 when 𝑐 1 is violated then 𝑐 2 becomes the new in force obligation,
is a negative literal ¬𝑞 then ∼𝑝 is 𝑞. The set of deontic literals is and so on for the rest of the elements in the chain. In this setting, 𝑐𝑚
ModLit = {𝑀𝑙, ¬𝑀𝑙 | 𝑙 ∈ Lit ∧ 𝑀 ∈ {O, P}}. Note that we will not represents the last chance to comply with the prescriptive behaviour
have specific rules nor modality for prohibitions, as we will treat enforced by 𝛼 and, in case 𝑐𝑚 is violated as well, then we will result
them according to the standard duality that something is forbidden in a non-compliant situation.
iff the opposite is mandatory (F𝑝 ≡ O∼𝑝). A conclusion of 𝐷 is either a tagged (deontic) literal, or a tagged
Lab is the set of labels, which are names for rules and will be label; as it will be clear in the reminder of the section, reference
denoted by small-capitals Greek letters. As we will be interested in literals are not conclusions of rules and they can be derived based
understanding (deriving) when a rule is violated, complied with, and on various conditions on the provability of particular literals in
active, we extend the set of literals with reference literals RefLit = the rule. For (deontic) literals, a conclusion can have one of the
{𝑌 (𝛼), ¬𝑌 (𝛼)| 𝛼 ∈ Lab ∧ 𝑌 ∈ {𝑎𝑐𝑡𝑖𝑣𝑒, 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑, 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑_𝑤𝑖𝑡ℎ}. following forms: (i) ±Δ𝑋 𝑙 which means that 𝑙 is definitely prov-
A defeasible theory 𝐷 is a tuple (𝐹, 𝑅, >). 𝐹 ⊆ Lit is the set of able/refutable in 𝐷, and (ii) ±𝜕𝑋 𝑙 which means that 𝑙 is defeasibly
facts (indisputable, constitutive statements). (In this version of the provable/refutable in 𝐷. For labels, we use the ideas of ⊥ and ⊤
logic, we do not admit obligations and permissions to be facts of from [6] (that will provide a compact notation for compliance and
the theory.) We use two kinds of rules. Non-deontic (thus standard, violation), a conclusion can have one of the following forms: (iii)
or constitutive) rules 𝑅 C model constitutive statements (count-as ±𝜕A 𝛼 which means that 𝛼 is active/not active (various nuances
rules). Deontic rules represent prescriptive behaviours; deontic are possible, and some are given in Definitions 2.5–2.10), (iv) +⊤𝛼
rules are either obligation rules 𝑅 O which determine when and which means that 𝛼 is complied with, (v) −⊤𝛼 which means that
which obligations are in force, or permission rules 𝑅 P which repre- 𝛼 is not complied with - violated and not compensated -, (vi) +⊥𝛼
sent strong (or explicit) permissions. Finally, > is a binary relation which means that 𝛼 is violated, and lastly (vii) −⊥𝛼 which means
over 𝑅 to solve conflicts in case of potentially conflicting informa- that 𝛼 has never been violated.
tion. A proof of length 𝑛 in 𝐷 is a finite sequence 𝑃 (1), 𝑃 (2), . . . , 𝑃 (𝑛)
Following the ideas of [8], obligation rules gain more expressive- of the tagged literals and tagged reference literals just described
ness with the compensation operator ⊗ for obligation rules, which above and formally defined hereafter; 𝑃 (1..𝑛) denotes the first 𝑛
is to model reparative chains of obligations. Intuitively, 𝑎 ⊗ 𝑏 means steps of 𝑃. If, for instance, 𝐷 proves +𝜕𝑙 at proof step 𝑛, we write
that 𝑎 is the primary obligation, but if for some reason we fail to 𝑃 (𝑛) = +𝜕𝑙, and we also assume the conventional notation 𝐷 ⊢ +𝜕𝑙.
obtain, to comply with, 𝑎 (by either not being able to prove 𝑎, or Before defining when a rule is applicable/discarded, we provide
by proving ∼𝑎), then 𝑏 becomes the new obligation in force. This the two definitions of being a trigger, and being a blank. These two
operator is used to build chains of preferences, called ⊗-expressions. concepts are related to the elements in the set of antecedents for a
The formation rules for ⊗-expressions are: (i) every literal is given rule. Conceptually, an antecedent is a trigger for a rule when
an ⊗-expression, (ii) if 𝐴 is an ⊗-expression and 𝑏 is a literal then its provability allows such a rule to (potentially1 ) fire, whilst it is a
𝐴 ⊗ 𝑏 is an ⊗-expression. In addition, we stipulate that ⊗ obeys the blank when its refutability prevents such a rule to fire. ¬O𝑙 (resp.
following properties: (a) Associativity 𝑎 ⊗ (𝑏 ⊗ 𝑐) = (𝑎 ⊗ 𝑏) ⊗ 𝑐, (b) ¬P𝑙) means that the obligation (resp. permission) of 𝑙 is not in force.
Ë Ë𝑘−1
Duplication
Ë𝑚 and contraction on the right 𝑚 𝑖=1 𝑐𝑖 = ( 𝑖=1 𝑐𝑖 ) ⊗ The meaning of being active depends on the interpretation,
( 𝑖=𝑘+1 𝑐𝑖 ) for 𝑗 < 𝑘 and 𝑐 𝑗 = 𝑐𝑘 . which may vary from context to context. Notice that henceforth
We adopt the standard DL’s definitions of strict rules, defeasible we use active to refer to a rule that we describe as applicable in
rules and defeaters [1], where a rule is an expression 𝛼 : 𝐴(𝛼) ↩→𝑋 Section 1 to distinguish it from the Defeasible Logic notion of a rule
𝐶 (𝛼). 𝛼 ∈ Lab is a unique label for the name of the rule. ↩→∈ being applicable. Various different nuances for the notion of active
{→, ⇒, ;} is the type of rule: we use → for strict rules, ⇒ for are proposed afterwards, but we first start by formalising provabil-
defeasible rules, and ; for defeaters. 𝑋 = {C, O, P} is the kind of ity/refutability for constitutive statements, obligations, permissions,
rule: if 𝑋 = C then ↩→ has no subscript and the rule is used to derive and finally violated and complied with rules.
non-deontic literals (constitutive statements), whilst if X is O or
Definition 2.1 (Trigger). Given a deontic defeasible theory 𝐷, a
P then the rule is used to derive deontic conclusions (prescriptive
proof 𝑃, a rule 𝛼, we say that 𝑎 ∈ 𝐴(𝛼) is a trigger for 𝛼 at 𝑃 (𝑛 + 1)
statements). 𝐴(𝛼) = 𝑎 1, . . . , 𝑎𝑛 is the set of antecedents/premises
iff
where each 𝑎𝑖 is either a literal, a deontic literal, or a reference
literal. Lastly, 𝐶 (𝛼) is the conclusion of the rule (referred to as the 1 Potentially because the rule itself may be defeated.
71
(1) if 𝑎 ∈ Lit, then were in force obligations that have been violated. We are thus to
(a) if 𝛼 ∈ 𝑅𝑠 then +Δ𝑎 ∈ 𝑃 (1..𝑛); establish whether the element at the current index is an in force
(b) if 𝛼 ∈ 𝑅𝑑 then +𝜕𝑎 ∈ 𝑃 (1..𝑛); obligation or not. If it is, we have another chance to be compliant
(2) if 𝑎 = 𝑋𝑙 and 𝑋 ∈ {O, P}, then with the rule; if it is not, the whole norm cannot be complied with
(a) if 𝛼 ∈ 𝑅𝑠 then +Δ𝑋 𝑙 ∈ 𝑃 (1..𝑛); as the previous step was indeed our last possibility to be compli-
(b) if 𝛼 ∈ 𝑅𝑑 then +𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛); ant. These concepts will be formalised and further explained in the
(3) if 𝑎 = ¬𝑋𝑙, then −𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P}; following with Definition 2.4 and proof conditions ±⊤ and ±⊥.
(4) if 𝑎 is 𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then +𝜕EA 𝛼 ∈ 𝑃 (1..𝑛); We are now ready to provide the strict and defeasible proof
(5) if 𝑎 is ¬𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then −𝜕EA 𝛼 ∈ 𝑃 (1..𝑛); conditions for constitutive statements [1]. In the following, we
(6) if 𝑎 is 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then +⊤𝛼 ∈ 𝑃 (1..𝑛); shall omit the explanation of some negative proof conditions, as
(7) if 𝑎 is ¬𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then −⊤𝛼 ∈ 𝑃 (1..𝑛); they can be obtained via the strong negation principle.
(8) if 𝑎 is 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then +⊥𝛼 ∈ 𝑃 (1..𝑛); +Δ𝑙: If 𝑃 (𝑛 + 1) = +Δ𝑙 then
(9) if 𝑎 is ¬𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then −⊥𝛼 ∈ 𝑃 (1..𝑛); (1) 𝑙 ∈ 𝐹 , or
Definition 2.2 (Blank). Given a deontic defeasible theory 𝐷, a (2) ∃𝛼 ∈ 𝑅sC [𝑙] is applicable.
proof 𝑃, and a rule 𝛼, we say that a literal 𝑎 ∈ 𝐴(𝛼) is a blank for 𝛼 A constitutive statement is strictly proven if is either a fact, or
at 𝑃 (𝑛 + 1) iff there exists an applicable, strict rule for it. Note that inconsistencies
(1) if 𝑎 ∈ Lit, then within a deontic defeasible theories can arise only if derived from
(a) if 𝛼 ∈ 𝑅𝑠 then −Δ𝑎 ∈ 𝑃 (1..𝑛); the strict part of the theory.
(b) if 𝛼 ∈ 𝑅𝑑 then −𝜕𝑎 ∈ 𝑃 (1..𝑛); −Δ𝑙: If 𝑃 (𝑛 + 1) = −Δ𝑙 then
(2) if 𝑎 = 𝑋𝑙 and 𝑋 ∈ {O, P}, then (1) 𝑙 ∉ 𝐹 and
(a) if 𝛼 ∈ 𝑅𝑠 then −Δ𝑋 𝑙 ∈ 𝑃 (1..𝑛); (2) ∀𝛼 ∈ 𝑅sC [𝑙] is discarded.
(b) if 𝛼 ∈ 𝑅𝑑 then −𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛); A constitutive statement is strictly refuted (disproven, or rejected)
(3) if 𝑎 = ¬𝑋𝑙, then +𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P}; if it is not a fact and all the strict rules for it are discarded.
(4) if 𝑎 is 𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then −𝜕EA 𝛼 ∈ 𝑃 (1..𝑛);
+𝜕𝑙: If 𝑃 (𝑛 + 1) = +𝜕𝑙 then
(5) if 𝑎 is ¬𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then +𝜕EA 𝛼 ∈ 𝑃 (1..𝑛);
(1) +Δ𝑙 ∈ 𝑃 (1..𝑛), or
(6) if 𝑎 is 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then +⊤𝛼 ∈ 𝑃 (1..𝑛);
(2) −Δ∼𝑙 ∈ 𝑃 (1..𝑛) and
(7) if 𝑎 is ¬𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then −⊤𝛼 ∈ 𝑃 (1..𝑛); C [𝑙] is applicable and
(8) if 𝑎 is 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then +⊥𝛼 ∈ 𝑃 (1..𝑛); (3) ∃𝛼 ∈ 𝑅sd
C
(4) ∀𝛽 ∈ 𝑅 [∼𝑙] either
(9) if 𝑎 is ¬𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then −⊥𝛼 ∈ 𝑃 (1..𝑛);
(1) 𝛽 is discarded, or
Based on these two definitions, we are now ready to define when (2) ∃𝜁 ∈ 𝑅 C [𝑙] s.t. 𝜁 is applicable and 𝜁 > 𝛽.
a rule is applicable/discarded. Note that the definition of being
A constitutive statement is defeasibly proven if either was already
discarded is obtained by applying the strong negation principle2 to
strictly proven, or the opposite is not and there exists an applicable
its positive counterpart.
rule for the conclusion itself such that every opposite rule is either
Definition 2.3 (Applicable & Discarded). Assume a deontic defea- discarded or defeated by an applicable supporting rule. Note that
sible theory 𝐷. whilst 𝛽 and 𝜁 can be defeaters, 𝛼 may not.
We say that a rule 𝛼 ∈ 𝑅 C ∪ 𝑅 P is applicable at 𝑃 (𝑛 + 1) iff for all −𝜕𝑙: If 𝑃 (𝑛 + 1) = −𝜕𝑙 then
𝑎 ∈ 𝐴(𝛼) then 𝑎 is a trigger for 𝛼 at 𝑃 (𝑛 + 1). (1) −Δ𝑙 ∈ 𝑃 (1..𝑛) and either
We say that 𝛼 is discarded at 𝑃 (𝑛 + 1) iff there exists 𝑎 ∈ 𝐴(𝛼) (2) +Δ∼𝑙 ∈ 𝑃 (1..𝑛) or
such that 𝑎 is a blank for 𝛼 at 𝑃 (𝑛 + 1). (3) ∀𝛼 ∈ 𝑅sdC [𝑙] is discarded, or
For obligation rules, we say that 𝛼 ∈ 𝑅 O is applicable at index 𝑖 (4) ∃𝛽 ∈ 𝑅 C [∼𝑙] s.t.
and 𝑃 (𝑛+1) iff (1) for all 𝑎 ∈ 𝐴(𝛼) then 𝑎 is a trigger for 𝛼 at 𝑃 (𝑛+1), (1) 𝛽 is applicable and
and (2) for all 𝑐 𝑗 ∈ 𝐶 (𝛼), 𝑗 < 𝑖, (2.1) if 𝛼 ∈ 𝑅s then +ΔO𝑐 𝑗 ∈ 𝑃 (1..𝑛),
(2) ∀𝜁 ∈ 𝑅 C [𝑙] either 𝜁 is discarded, or 𝜁 ≯ 𝛽.
if 𝛼 ∈ 𝑅d then +𝜕O𝑐 𝑗 ∈ 𝑃 (1..𝑛), and (2.2) +𝜕∼𝑐 𝑗 ∈ 𝑃 (1..𝑛).
A constitutive statement is defeasibly refuted if it was not strictly
We say that 𝛼 ∈ 𝑅 O is discarded at index 𝑖 and 𝑃 (𝑛 + 1) iff
proven, and either the opposite statement was strictly proven, or
either (1) there exists 𝑎 ∈ 𝐴(𝛼) such that 𝑎 is a blank for 𝛼 at
all the applicable rules are either discarded or there exists an ‘unde-
𝑃 (𝑛 + 1), or (2) there exists 𝑐 𝑗 ∈ 𝐶 (𝛼), 𝑗 < 𝑖, such that (2.1) if 𝛼 ∈ 𝑅s
feated’ opposite, applicable rule.
then −ΔO𝑐 𝑗 ∈ 𝑃 (1..𝑛), if 𝛼 ∈ 𝑅d then −𝜕O𝑐 𝑗 ∈ 𝑃 (1..𝑛), or (2.2)
Proof conditions for obligations and permissions are updated
+𝜕𝑐 𝑗 ∈ 𝑃 (1..𝑛).
versions of the ones proposed in [5, 7].
Intuitively, a rule is applicable when all its premises are proven. +ΔO𝑙: If 𝑃 (𝑛 + 1) = +ΔO𝑙 then
Moreover, for an obligation rule, we also need to take into consider- (1) ∃𝛼 ∈ 𝑅sO [𝑙, 𝑖] is applicable at index 𝑖.
ation its ⊗-chain. A rule being applicable at a certain index greater
than 1 reflects the idea of compensation: all the previous elements −ΔO𝑙: If 𝑃 (𝑛 + 1) = −ΔO𝑙 then
(1) ∀𝛼 ∈ 𝑅sO [𝑙, 𝑖] is discarded at index 𝑖.
2 The strong negation principle is closely related to the function that simplifies a
formula by moving all negations to the inner most position in the resulting formula, For the defeasible derivation, we need to consider attacks even from
and replaces its positive tags with negative ones, and the other way around [2]. permissive rules.
72
+𝜕O𝑙: If 𝑃 (𝑛 + 1) = +𝜕O𝑙 then of the content of the prescriptive behaviour is derivable. Therefore,
(1) +ΔO𝑙 ∈ 𝑃 (1..𝑛), or given 𝐷 ⊢ +𝜕O𝑙, a violation is 𝐷 ⊢ +𝜕∼𝑙. Naturally, when consider-
(2) −Δ𝑋 ∼𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P} and ing chains of compensatory behaviours, we would like not just to
(3) ∃𝛼 ∈ 𝑅sd O [𝑙, 𝑖] is applicable at index 𝑖 and discern norms that have been complied with against norms that
(4) ∀𝛽 ∈ 𝑅 [∼𝑙, 𝑗] ∪ 𝑅 P [∼𝑙] either
O have been violated, but we also want to discriminate norms that
(1) 𝛽 is discarded (at index 𝑗), or have never been violated against norms that have been complied
(2) ∃𝜁 ∈ 𝑅 O [𝑙, 𝑘] s.t. with but where at least one compensation has occurred.
𝜁 is applicable at index 𝑘 and 𝜁 > 𝛽. +⊤𝛼: If 𝑃 (𝑛 + 1) = +⊤𝛼 then
Note that (i) in Condition (4) 𝛽 can be a permission rule as explicit, (1) ∃𝑖.𝛼 ∈ 𝑅 O [𝑙, 𝑖] s.t.
opposite permissions represent exceptions to obligations, whereas (1) If (1) 𝛼 is applicable at index 𝑖 and 𝑃 (1..𝑛), and
𝜁 must be an obligation rule as a permission rule cannot reinstate an (2) +𝜕O𝑙 ∈ 𝑃 (1..𝑛),
obligation, and that (ii) 𝑙 may appear at different positions (indices (2) Then +𝜕𝑙 ∈ 𝑃 (1..𝑛).
𝑖, 𝑗, and 𝑘) within the three ⊗-chains. A norm (obligation rule) is complied with if either it is not appli-
−𝜕O𝑙: If 𝑃 (𝑛 + 1) = −𝜕O𝑙 then cable, or there exists and element of its ⊗-chain that is an in force
(1) −ΔO𝑙 ∈ 𝑃 (1..𝑛), and either obligation and the content of the obligation holds.
(2) +Δ𝑋 ∼𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P}, or −⊤𝛼: If 𝑃 (𝑛 + 1) = −⊤𝛼 then
(3) ∀𝛼 ∈ 𝑅sd O [𝑙, 𝑖] is discarded at index 𝑖, or
(1) +𝛼 is applicable at index 1 and ∈ 𝑃 (1..𝑛)
(4) ∃𝛽 ∈ 𝑅 O [∼𝑙, 𝑗] ∪ 𝑅 P [∼𝑙] (2) +𝜕O𝑐 1 ∈ 𝑃 (1..𝑛),
(1) 𝛽 is applicable (at index 𝑗), and (3) +𝜕∼𝑐 1 ∈ 𝑃 (1..𝑛), and
(2) ∀𝜁 ∈ 𝑅 O [𝑙, 𝑘] either (4) ∀𝑐𝑖 ∈ 𝐶 (𝛼), with 𝑖 ≥ 2
𝜁 is discarded at index 𝑘 or 𝜁 ≯ 𝛽. (1) If 𝛼 is applicable at index 𝑖 and 𝑃 (1..𝑛), and
Follow the proof conditions for permissions. +𝜕O𝑐𝑖 ∈ 𝑃 (1..𝑛),
+ΔP𝑙: If 𝑃 (𝑛 + 1) = +ΔP𝑙 then (2) Then +𝜕∼𝑐𝑖 ∈ 𝑃 (1..𝑛).
(1) +ΔO𝑙 ∈ 𝑃 (1..𝑛), or A norm is not complied with when is applicable and all the in force
(2) ∃𝛼 ∈ 𝑅sP [𝑙, 𝑖] is applicable. elements of its ⊗-chain have been violated.
We can derive a permission if it was already proven as an obligation. +⊥𝛼: If 𝑃 (𝑛 + 1) = +⊥𝛼 then
−ΔP𝑙: If 𝑃 (𝑛 + 1) = −ΔP𝑙 then (1) 𝛼 is applicable at index 1 and 𝑃 (1..𝑛), and
(1) −ΔO𝑙 ∈ 𝑃 (1..𝑛) and (2) ∃𝑐𝑖 ∈ 𝐶 (𝛼), 𝑖 ≥ 2, 𝛼 is applicable at index 𝑖 and 𝑃 (1..𝑛).
(2) ∀𝛼 ∈ 𝑅sP [𝑙, 𝑖] is discarded. A violation occurs if at least one in force element of the norm’s ⊗-
Permissive defeasible derivations: chain has been violated. By Definition 2.3 and the fact that the norm
+𝜕P𝑙: If 𝑃 (𝑛 + 1) = +𝜕P𝑙 then is applicable both at index 1 by Condition (1) and at an index greater
(1) +ΔP𝑙 ∈ 𝑃 (1..𝑛), or than 1 by Condition (2), it follows that a(t least one) violation has
(2) −ΔO ∼𝑙 ∈ 𝑃 (1..𝑛), and either occurred. Note that, trivially, −⊤𝛼 implies +⊥𝛼, but not the other
(1) +𝜕O𝑙 ∈ 𝑃 (1..𝑛), or way around.
(2) ∃𝛼 ∈ 𝑅sd P [𝑙] is applicable and −⊥𝛼: If 𝑃 (𝑛 + 1) = −⊥𝛼 then either
O
(3) ∀𝛽 ∈ 𝑅 [∼𝑙, 𝑗] either (1) 𝛼 is discarded at index 1 and 𝑃 (1..𝑛), or
(1) 𝛽 is discarded at index 𝑗, or (2) 𝛼 ∈ 𝑅 O [𝑐, 2] is discarded at index 2 and 𝑃 (1..𝑛).
(2) ∃𝜁 ∈ 𝑅 P [𝑙]∪ ∈ 𝑅 O [𝑙, 𝑘] s.t. A norm has never been violated if either it was never applicable by
𝜁 is applicable (at index 𝑘) and 𝜁 > 𝛽. Condition (1), or the first element is an in-force, and complied with
Condition (1) states that if something is obligatory then is permit- obligation (guaranteed by Condition (2) as the norm was not ap-
ted. Condition (3) considers as possible counter-arguments only plicable at index 2). Symmetrically to what commented previously,
obligation rules as situations where both P𝑙 and P∼𝑙 hold are legal. −⊥𝛼 implies +⊤𝛼, but not the other way around.
−𝜕P𝑙: If 𝑃 (𝑛 + 1) = −𝜕P𝑙 then
Definition 2.4. Given a deontic defeasible theory 𝐷 and an appli-
(1) −ΔP𝑙 ∈ 𝑃 (1..𝑛), and either
cable at index 1, obligation rule 𝛼 : 𝐴(𝛼) ⇒O 𝑐 1 ⊗ · · · ⊗ 𝑐𝑚 , we say
(2) +ΔO ∼𝑙 ∈ 𝑃 (1..𝑛), or
that 𝛼 is
(1) −𝜕O𝑙 ∈ 𝑃 (1..𝑛), and
(2) ∀𝛼 ∈ 𝑅sd P [𝑙] is discarded or Strongly complied with: (i) −⊥𝛼, and (ii) +𝜕O𝑐 1 .
(3) ∃𝛽 ∈ 𝑅 O [∼𝑙, 𝑗] s.t. Weakly complied with: (i) +⊤𝛼 and (ii) +⊥𝛼.
(1) 𝛽 is applicable at index 𝑗 and Violated: +⊥𝛼.
Not complied with: −⊤𝛼.
(2) ∀𝜁 ∈ 𝑅 P [𝑙]∪ ∈ 𝑅 O [𝑙, 𝑘] either
𝜁 is discarded (at index 𝑘), or 𝜁 ≯ 𝛽. A norm to be complied with, violated, or not complied with has
We now have the formal tools to define compliance and violations. to be applicable: note that we could not consider such a requirement,
We adhere to the principle that a violation occurs when the logics but a discarded norm is always vacuously applicable.
proves both (i) that the prescription holds and (ii) that the opposite A norm is violated when at least one violation has occurred.
73
A norm is strongly complied with if it has never been violated: If 𝛽 is applicable at (1..𝑛), Then 𝛽 > 𝛼.
the theory thus achieves the very first literal of the ⊗-expression.
In many real-life contexts is indeed important to identify such
A norm is weakly complied with when, on the contrary, is com-
undefeated rules, but in some other contexts such conditions may
plied with but also violated; this means that all the first 𝑗 −1 elements
be too restrictive as they fail to include rules that are fundamental
of the ⊗-expression were violated obligations, whereas the 𝑗-th ele-
in proving a given claim. Consider indeed theory 𝐷 of Example 2.7:
ment is an in force obligation and the theory proves the content of
𝐷 ⊬ +𝜕PA 𝛼 2 according to Definition 2.8 as it is defeated by 𝛽 2 even
the obligation.
if without 𝛼 2 it is not possible to prove +𝜕𝑙, as no other rule defeats
This is exactly what marks the difference between being weak
𝛽 1 . Consequently, our next definition will identify which rules are
complied with and being not complied with . In fact, a norm is not
essential in proving a given claim, and we will call them effectively
complied with in two cases. The former, and more straightforward,
active rules.
case is when all elements (of the ⊗-expression) were in force, vio-
lated obligations. The latter case is when the first 𝑗 − 1 elements Definition 2.10 (Effectively Active).
were in force, violated obligations, whilst the 𝑗-th element is not +𝜕EA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕EA 𝛼 then
proven as obligation (i.e. 𝐷 ⊢ −𝜕O𝑐 𝑗 ). Accordingly, it does not mat- (1) 𝛼 is applicable at 𝑃 (1..𝑛),
ter whether the theory achieves it or not, as the 𝑗 − 1-th element (2) +𝜕𝑋 𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}, and
was the last chance to comply with the norm. (3) 𝐷 \ {𝛼 } ⊢ −𝜕𝑋 𝐶 (𝛼).
We now enter in the final part of this section, and we shift
our attention from compliance to active rules. Next Definition 2.5 This definition includes the concept of sine qua non: as per Con-
identifies those applicable rules whose conclusion was actually dition (3), 𝛼 is indeed a requisite in proving its conclusion 𝑙 as the
proved. We name such rules provisionally active. theory without 𝛼 cannot prove 𝑙. Note that the contrapositive ver-
sion of Condition (3), (3 ′ ) 𝐷 \ {𝜁 ∈ 𝑅 [𝐶 (𝛼)] | 𝜁 ≠ 𝛼 } ⊢ +𝜕𝑙 would
Definition 2.5 (Provisionally Active). give us slightly different results3 .
+𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕PA 𝛼 then
Definition 2.11 (Effectively Inactive).
(1) 𝛼 is applicable at 𝑃 (1..𝑛), and
(2) +𝜕𝑋 𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}. +𝜕EA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕EA 𝛼 then either
(1) 𝛼 is discarded at 𝑃 (1..𝑛), or
Definition 2.6 (Provisionally Inactive). (2) −𝜕𝑋 𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}, or
−𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = −𝜕PA 𝛼 then either (3) 𝐷 \ {𝛼 } ⊢ +𝜕𝑋 𝐶 (𝛼).
(1) 𝛼 is discarded at 𝑃 (1..𝑛), or
(2) −𝜕𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}. We end this section by providing an example that illustrates a
proof in our logical apparatus. The explanation hereafter is not com-
A fairly straightforward limitation of Definition 2.5 is that it fails plete as the point is to explain every key notion without slavishly
to identify which rules ‘effectively contributes’ in proving a claim: describe the whole procedure.
a rule may very well be applicable, but not really necessary as the
following example shows. Example 2.12. Let 𝐷 = (𝐹 = {𝑎, 𝑏, 𝑐, 𝑑 }, 𝑅, >= {(𝛼, 𝛾), (𝜒, 𝜁 ), (𝜂, 𝜇)})
be the deontic defeasible theory such that
Example 2.7. Assume that, in a deontic defeasible theory 𝐷, the
𝑅 = {𝛼 : 𝑎 ⇒ 𝑒 𝛽 :𝑏 ⇒𝑒 𝛾 : 𝑐 ⇒ ∼𝑒 𝜑 :𝑑 ⇒𝑞
following rules are all applicable: 𝛼 1 , 𝛼 2 , and 𝛼 3 are for 𝑙, while 𝛽 1
and 𝛽 2 are for ∼𝑙. It also holds that 𝛽 1 > 𝛼 1 , 𝛼 2 > 𝛽 1 , 𝛽 2 > 𝛼 2 and 𝜈 : 𝑐 ⇒ ∼𝑧 𝜎 : 𝑎, 𝑏 ⇒ 𝑤 𝜌 : ∼𝑎 ⇒ 𝑓
𝛼 3 > 𝛽 2 . In this context, rule 𝛼 1 does not contribute in defeating any 𝜒 : 𝑒 ⇒O ∼𝑞 𝜇 : 𝑑, 𝑞 ⇒O ∼𝑤 ⊗ 𝑠 𝜁 : 𝑎, 𝑑 ⇒O 𝑞
𝛽-like rule (it is actually defeated by 𝛽 1 ), and thus its contribution 𝜂 : 𝑎, active(𝛼), violated( 𝜒) ⇒O 𝑧 ⊗ 𝑤 ⊗ 𝑣 }.
is possibly limited/superfluous. On the other hand, 𝛼 2 is indeed
defeated by 𝛽 2 , but its role is fundamental because it is the only As 𝑎 is a fact, we prove +Δ𝑎 and, in cascade, +𝜕𝑎 as well as −𝜕∼𝑎.
𝛼-like rule that defeats 𝛽 1 . (The same applies for the other three literals in the set of facts.)
This makes 𝛼 applicable as all its antecedents are proved, and 𝜌
We hence try to identify those rules that are undefeated (by any discarded as its sole antecedent is refuted. For the same reasoning,
applicable, opposite rule). 𝛽, 𝛾, 𝜑, 𝜈, 𝜎 are all applicable, while 𝜒, 𝜇, 𝜁 , and 𝜂 are applicable
Definition 2.8 (Provisionally Active v2). at index 1. Since 𝛼 > 𝛾 and since no other applicable rule for 𝑒 is
+𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕PA 𝛼 then stronger than 𝛾, we have that 𝛼 is effectively active (𝐷 ⊢ +𝜕EA 𝛼): in
(1) 𝛼 is applicable at 𝑃 (1..𝑛), fact, the theory without 𝛼 would not prove +𝜕𝑒. This is not the case
(2) +𝜕𝐶 (𝛼) ∈ 𝑃 (1..𝑛), and for 𝛽, but we can state that 𝛽 is provisionally active as no applicable
(3) ∀𝛽 ∈ 𝑅 [∼𝐶 (𝛼)]. rule for ∼𝑒 is stronger than 𝛽. Symmetrically, 𝛾 is provisionally
If 𝛽 is applicable at (1..𝑛), Then 𝛽 ≯ 𝛼. inactive. Both 𝜒 and 𝜁 are applicable at index 1, and, since 𝜒 > 𝜁 ,
we conclude 𝐷 ⊢ +𝜕O ∼𝑞. This makes 𝜁 vacuously complied with.
Definition 2.9 (Provisionally Inactive v2). Given that the theory also proves +𝜕𝑞 and 𝜒’s ⊗-chain consists of
−𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = −𝜕PA 𝛼 then either only one element, we conclude that 𝜒 is not just violated but also
(1) 𝛼 is discarded at 𝑃 (1..𝑛), or 3 Inboth Conditions (3) and (3′ ), we used the notational simplification 𝐷 \ Γ to denote
(2) −𝜕𝐶 (𝛼) ∈ 𝑃 (1..𝑛), or the revision process of removing Γ from 𝑅 , and adjusting the superiority relation
(3) ∃𝛽 ∈ 𝑅 [∼𝐶 (𝛼)]. accordingly.
74
not complied with (𝐷 ⊢ +⊥𝜒 and 𝐷 ⊢ −⊤𝜒). This, in turn, makes Algorithm 1: Compliance
𝜂 applicable at index 1, and since there are no obligation rules for Input: A deontic defeasible theory 𝐷
∼𝑧, we conclude 𝐷 ⊢ +𝜕O𝑧. Again, as the theory proves +𝜕∼𝑧, then Output: The defeasible meta-extension 𝐸 (𝐷)
𝜂 is violated, 𝐷 ⊢ +⊥𝜂. This time the ⊗-chain has other elements; 1 ±𝜕2 ← ∅ with 2 ∈ {C, O, P};
consequently, 𝜂 is now applicable at index 2. The second element 2 ±⊤ ← ∅; ±⊥ ← ∅; ±𝜕A ← ∅; ±𝜕EA ← ∅;
of 𝜂’s ⊗-chain is 𝑤: 𝜇, the obligation rule for ∼𝑤, is applicable but 3 InitialiseHerbrandBase(𝐻 𝐵);
weaker than 𝜂. Lastly, as 𝐷 ⊢ +𝜕𝑤 through 𝜎, we can conclude that 4 for 2𝑙 ∈ Lit ∪ ModLit do 𝑅 2 [𝑙 ]𝑖𝑛𝑓 𝑑 ← ∅; // with 2 ∈ {C, O, P}
𝜂 is weakly complied with (𝐷 ⊢ +⊤𝜂). In this theory, no norm was 5 for 𝛼 ∈ 𝑅 O do Initialise 𝛼 [𝑘 ] [2] to null, 𝑘 = length(𝐶 (𝛼));
strongly complied with. 6 for 𝑙 ∈ 𝐹 do Prove(𝑙, C); Refute(∼𝑙, C);
7 repeat
3 ALGORITHMS 8
± ← ∅;
𝜕2
for 2𝑙 ∈ 𝐻 𝐵, 2 ∈ {C, O, P} do
The algorithms presented in this section, given a deontic defeasible 9
10 if 𝑅 2 [𝑙 ] = ∅ then Refute(𝑙, 2);
theory as input, compute: (1) the defeasible extension of the theory,
11 if ∃𝛼 ∈ +𝜕A ∩ 𝑅 C [𝑙 ] then // 𝑙 is a non-deontic
(2) which rules are applicable/discarded, (3) which rules are effec-
literal
tively active/inactive, (4) which norms are (not) complied with, and
12 𝑅 C [∼𝑙 ]𝑖𝑛𝑓 𝑑 ← 𝑅 C [∼𝑙 ]𝑖𝑛𝑓 𝑑 ∪ {𝛽 ∈ 𝑅 C [∼𝑙 ] | 𝛼 > 𝛽 };
lastly (5) which norms are (never) violated.
13 if {𝛽 ∈ 𝑅 C [∼𝑙 ] | 𝛽 > 𝛼 } = ∅ then
The extension of a defeasible theory is, in a sketch, all that 14 Refute(∼𝑙, C);
the theory can prove and refute (disprove). Typically, a defeasible 15 if 𝑅 [∼𝑙 ] \ 𝑅 [∼𝑙 ]𝑖𝑛𝑓 𝑑 = ∅ then
extension is limited to the (deontic) literals of the theory itself, 16 Prove(𝑙, C);
but as in this case we are interested in understanding which rules 17 Active;
are applicable and which norms are (effectively) active, complied 18 end
with, violated, we shall extend the standard definition to include 19 end
such nuances. In the algorithms, we will associate proof tag +𝜕A to 20 end
applicable rules, and −𝜕A to discarded rules (thus 𝐷 ⊢ ±𝜕A 𝛼 for a 21 if ∃𝛼 + 𝜕A ∩ ∈ 𝑅 O [𝑙, 𝑖 ] ∧ ∀𝑗 < 𝑖.(𝛼 [ 𝑗 ] [1] =
theory 𝐷 and a rule 𝛼). + ∧ 𝛼 [ 𝑗 ] [2] = −) then
Note that the algorithms do no compute proof conditions for 22 𝑅 O [∼𝑙 ]𝑖𝑛𝑓 𝑑 ← 𝑅 O [∼𝑙 ]𝑖𝑛𝑓 𝑑 ∪ {𝛽 ∈ 𝑅𝑋 [∼𝑙 ] | 𝛼 > 𝛽 };
Definitions 2.5-2.9 for space reasons, as such an addition is straight- // with 𝑋 = {O, P}
forward and does not result in any increase of the complexity of the 23 if {𝛽 ∈ 𝑅 [∼𝑙 ] | 𝛽 > 𝛼 } = ∅ then
algorithms themselves. Also, for space reasons, the algorithms pre- 24 Refute(∼𝑙, 2);
sented in this work do not compute the strict part of the extension: 25 if (𝑅 O [∼𝑙 ] ∪ 𝑅 P [∼𝑙 ]) \ 𝑅 [∼𝑙 ]𝑖𝑛𝑓 𝑑 = ∅ then
to include that is a mundane task as all the defeasibility checks are 26 Prove(𝑙, O);
not taken into consideration. 27 Prove(𝑙, P);
Given a deontic defeasible theory 𝐷, 𝐻 𝐵𝐷 is the set of literals 28 Refute(∼𝑙, P);
such that the literal or its complement appears in 𝐷, where ‘appears’ 29 Active;
means that is a sub-formula of a literal occurring in the theory. The 30 end
deontic Herbrand Base of 𝐷 is 𝐻 𝐵 = {𝑋𝑙 | 𝑙 ∈ 𝐻 𝐵𝐷 ∧ 𝑋 ∈ {O, P}}. 31 end
Note that we do not consider reference literals in the Herbrand 32 end
Base. Accordingly, the extension of a deontic defeasible theory is 33 if ∃𝛼 ∈ +𝜕A ∩ 𝑅 P [𝑙 ] then
defined as follows. 34 𝑅 P [∼𝑙 ]𝑖𝑛𝑓 𝑑 ← 𝑅 P [∼𝑙 ]𝑖𝑛𝑓 𝑑 ∪ {𝛽 ∈ 𝑅 O [∼𝑙 ] | 𝛼 > 𝛽 };
// with 𝑋 = {O, P}
Definition 3.1. Given a deontic defeasible theory 𝐷 = (𝐹, 𝐷, >), 35 if {𝛽 ∈ 𝑅 [∼𝑙 ] | 𝛽 > 𝛼 } = ∅ then
we say that the extension is 𝐸 (𝐷) = (±𝜕C, ±𝜕O, ±𝜕P, ±𝜕A, ±𝜕EA, ±⊤, 36 Refute(∼𝑙, P);
±⊥), where ±𝜕2 = {𝑙 ∈ 𝐻𝐵𝐷 : 𝐷 ⊢ ±𝜕2𝑙 } with 2 ∈ {C, O, P}, 37 Refute(∼𝑙, O);
±𝜕3 = {𝛼 ∈ 𝑅| 𝐷 ⊢ ±𝜕3 𝛼 } with 3 = {A, EA}, ±⊤ = {𝛼 ∈ 𝑅| 𝐷 ⊢ 38 if (𝑅 P [∼𝑙 ] ∪ 𝑅 O [∼𝑙 ]) \ 𝑅 [∼𝑙 ]𝑖𝑛𝑓 𝑑 = ∅ then
±⊤𝛼 }, and ±⊥ = {𝛼 ∈ 𝑅| 𝐷 ⊢ ±⊥𝛼 }. 39 Prove(𝑙, P);
We say that two theories 𝐷 and 𝐷 ′ are equivalent iff 𝐸 (𝐷)+𝐸 (𝐷 ′ ) 40 Active;
(i.e., they have the same extension). 41 end
42 end
The next definition extends the concept of complement presented 43 end
in Section 2, and its sole purpose is to ease the notation of the 44 end
algorithms by establishing the logical connection among proved 45 ±𝜕2 ← ±𝜕2 ∪ 𝜕2 ±;
and refuted literals. 46 until 𝜕2 = ∅ and 𝜕2 = ∅;
+ −
f of literal 𝑋𝑙 as
Definition 3.2. We define the complement 𝑋𝑙 47 return 𝐸 (𝐷) = (±𝜕C , ±𝜕O , ±𝜕P , ±⊤, ±⊥, ±𝜕A , ±𝜕EA )
f = {∼𝑙 }.
• Trivially if 𝑋 = C, then 𝑋𝑙
f = {O∼𝑙, ¬O𝑙, P∼𝑙 }.
• O𝑙 A few important comments for the reader before presenting the
e = {¬P𝑙, O∼𝑙 }.
• P𝑙 algorithms.
75
The algorithms presented here determine the extension of a whether there exists an applicable, opposite rule that is not defeated
deontic defeasible theory by computing, at each iteration step, a by any other applicable rule (for 2𝑙) than 𝛼.
simpler theory than the one at the previous step. By simpler, we
mean that, by proving and refuting literals and standard rules,
we can progressively simplify the rules of the theory itself: (i) by Procedure Prove
progressively eliminating elements from the antecedents of rules, Input: 𝑙 ∈ Lit, 2 ∈ {C, O, P}
and (ii) by eliminating rules that we know are either discarded, or 1
+ ← 𝜕 + ∪ {𝑙 };
𝜕2 2
defeated. 2 f
𝐻 𝐵 ← 𝐻 𝐵 \ ( {2𝑙 } ∪ 2𝑙);
Important for this goal is to note that, trivially, a rule with empty 3 −𝜕3 ← −𝜕3 ∪ {𝜁 ∈ 𝑅 | 2𝑙 f ∈ 𝐴(𝜁 ) }; // with 3 = {A, EA}
antecedent is vacuously applicable. We thus want to achieve, for 4 f ⊆ 𝐴(𝜁 ) };
>←> \{ (𝜁 ,𝜓 ), (𝜓, 𝜁 ) ∈> | 2𝑙
any rule, the status where the antecedent’s rule is empty (as this 5 switch 2 do
will simplify solving the superiorities), and we will do so by pro- 6 case 2 = C do
gressively eliminating elements from the antecedent as soon as they 7 +𝜕A ← +𝜕A ∪ {𝜁 ∈ 𝑅 | 𝐴(𝜁 ) \ {𝑙 } = ∅ };
satisfy the proper condition in Definitions 2.1 and 2.3. 8 𝑅 ← {𝐴(𝜁 ) \ {𝑙 } ↩→ 𝐶 (𝜁 ) | 𝜁 ∈ 𝑅 };
Symmetrically, when a rule is discarded, it can no longer supports 9 for 𝜁 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 = length(𝐶 (𝜁 ))
its conclusion, nor rejects the opposite. Therefore, as soon as an 10 𝜁 [𝑛] [2] ← +;
element satisfies the proper condition of Definitions 2.2 and 2.3, we 11 if 𝑛 = 1 ∧ 𝜁 [1] [1] = + then
can eliminate the corresponding rule from the set of rules. 12 −⊥ ← −⊥ ∪ {𝜁 };
We begin by populating the Herbrand Base, for every (deontic) 13 𝑅 ← {𝐴(𝜙) \ {¬violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
literal we create the support set 𝑅𝑖𝑛𝑓 𝑑 , and for every obligation rule 𝑅 } \ {𝜙 ∈ 𝑅 | violated(𝜁 )∈ 𝐴(𝜙) };
a 2-dimensional array which will simplify in checking conditions 14 else if 𝜁 ∈ +𝜕A ∧ 𝜁 [𝑛] [1] = + ∧ (∀𝑗 < 𝑛.𝜁 [ 𝑗 ] [1] =
of applicability and compliance. + ∧ 𝜁 [ 𝑗 ] [2] = −) then
Let us consider the theory proposed in Example 2.12. As 𝑎 is 15 +⊤ ← +⊤ ∪ {𝜁 }; +⊥ ← +⊥ ∪ {𝜁 };
in the set of facts, at the first iteration loop for at Line 6 invokes 16 𝑅 ← {𝐴(𝜙) \ {complied(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
Procedure Prove to prove it as constitutive. There, it is added to the 𝑅 } \ {𝜙 ∈ 𝑅 | ¬complied(𝜁 )∈ 𝐴(𝜙) };
support set +𝜕C (Line 1). We then eliminate 𝑎 from the antecedent 17 𝑅 ← {𝐴(𝜙) \ {violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
of 𝛼, which is now empty and so 𝛼 is applicable (Lines 7 and 8). As 𝑅 } \ {𝜙 ∈ 𝑅 | ¬violated(𝜁 )∈ 𝐴(𝜙) };
𝑎 does not appear in any ⊗-chain, Procedure Prove terminates, and 18 end
the main algorithm invokes Refute on ∼𝑎. 19 end
The set of defeasibly refuted constitutive literals is updated with 20 end
∼𝑎, and 𝜌 is discarded as its antecedent contains ∼𝑎; 𝜌 is also ef- 21 case 2 = O do
fectively inactive. The idea of these simplifications is taken from 22 +𝜕A ← +𝜕A ∪ {𝜁 ∈ 𝑅 | 𝐴(𝜁 ) \ {O𝑙, ¬O∼𝑙, ¬P∼𝑙 } = ∅ };
[5, 7]. 23 𝑅 ← {𝐴(𝜁 ) \ {O𝑙, ¬O∼𝑙, ¬P∼𝑙 } ↩→2 𝐶 (𝜁 ) | 𝜁 ∈
𝑅 } \ {𝜁 ∈ 𝑅 | O𝑙f ⊆ 𝐴(𝜁 ) };
The algorithm now enters the main cycle Repeat-Until at Lines
7–46. For every literal 𝑙 in HB, depending on which type of con- 24 for 𝜁 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 = length(𝐶 (𝜁 ))
𝜁 [𝑛] [1] ← +;
clusion is (2), we first verify whether there is any rule supporting 25
if 𝑛 = 1𝜁 [1] [2] = + then
it and, if not, we refute it (Line 10). Otherwise, if there exists an 26
27 −⊥ ← −⊥ ∪ {𝜁 };
applicable rule 𝛼 supporting it (ifs at Lines 11 for C, 21 for O, and 33
28 𝑅 ← {𝐴(𝜙) \ {¬violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
for P), we update the set of defeated rules supporting the opposite
𝑅 } \ {𝜙 ∈ 𝑅 | violated(𝜁 )∈ 𝐴(𝜙) };
conclusion 𝑅 2 [∼𝑙]𝑖𝑛𝑓 𝑑 (Lines 12, 22, 34): (i) in case of obligations,
29 else if 𝜁 ∈ +𝜕A ∧ 𝜁 [𝑛] [2] = + ∧ (∀𝑗 < 𝑛.𝜁 [ 𝑗 ] [1] =
both obligation and permission opposite rules (Condition (4) of
+ ∧ 𝜁 [ 𝑗 ] [2] = −) then
+𝜕O ), (ii) in case of permissions, only obligation rules (Condition
30 +⊤ ← +⊤ ∪ {𝜁 }; +⊥ ← +⊥ ∪ {𝜁 };
(3.2) of +𝜕P ). Given that 𝑅 [∼𝑙] contains all the opposite rules, and
31 𝑅 ← {𝐴(𝜙) \ {complied(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
given that we have just verified that 𝛼 for 𝑙 is applicable, we store
𝑅 } \ {𝜙 ∈ 𝑅 | ¬complied(𝜁 )∈ 𝐴(𝜙) };
in 𝑅 [∼𝑙]𝑖𝑛𝑓 𝑑 all those rules defeated by 𝛼. The next step is to verify
32 𝑅 ← {𝐴(𝜙) \ {violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
whether there actually exists any rule supporting ∼𝑙 stronger than
𝑅 } \ {𝜙 ∈ 𝑅 | ¬violated(𝜁 )∈ 𝐴(𝜙) };
𝛼: if not, ∼𝑙 can be refuted (Lines 13, 23, 35).
33 end
The idea behind the ifs at Lines 15–18, 25–30, and 38–41, is: if
34 end
𝐷 ⊢ +𝜕_2𝑙, eventually the repeat-until cycle will have added to
35 end
𝑅 2 [∼𝑙]𝑖𝑛𝑓 𝑑 enough rules to defeat all opposite supports. When
36 case 2 = P do
that is the case, we invoke Prove on 𝑙, 2, and Refute on ∼𝑙, 2 (but 37 +𝜕A ← +𝜕A ∪ {𝜁 ∈ 𝑅 | 𝐴(𝜁 ) \ {P𝑙, ¬O∼𝑙 } = ∅ };
not in case of permission as it is legal to have P𝑙 and P∼𝑙 at the e ⊆ 𝐴(𝜁 ) };
38 𝑅 ← {𝐴(𝜁 ) \ {P𝑙, ¬O∼𝑙 } | 𝜁 ∈ 𝑅 } \ {𝜁 ∈ 𝑅 | P𝑙
same time).
39 end
When something is proved, the algorithm verifies whether 𝛼 is
40 end
effectively active via procedure Active. Such a procedure controls
76
Procedure Refute Prove will then save ‘+’ in 𝜂 [2] [2] at Line 10, but nothing can be
Input: 𝑙 ∈ Lit, 2 ∈ {C, O, P} determined yet on compliance or violation. Later on, it proves −𝜕C𝑧
− −
1 𝜕2 ← 𝜕2 ∪ {𝑙 }; (and 𝜂 [1] [2] = −, Line 8 of Refute), and finally +𝜕O𝑧. This implies
2 𝐻𝐵 ← 𝐻𝐵 \ {2𝑙 }; that 𝜂 [1] [1] = + at Line 25 of Prove. Once the algorithm proves
3 −𝜕3 ← −𝜕3 ∪ {𝜁 ∈ 𝑅| 2𝑙 ∈ 𝐴(𝜁 )}; // with 3 = {A, EA}
+𝜕O𝑤, the else if test at Line 29 will succeed, and the algorithm
4 𝑅 ← 𝑅 \ {𝜁 ∈ 𝑅| 2𝑙 ∈ 𝐴(𝜁 )};
correctly establishes that 𝜂 is complied with (since 𝜂 [2] [1, 2] = +),
but also is violated (since 𝜂 [1] [1] = + but 𝜂 [1] [2] = −).
5 >← > \{(𝜁 ,𝜓 ), (𝜓, 𝜁 ) ∈> | 2𝑙 ∈ 𝐴(𝜁 )};
6 if 2 = C then
3.1 Computational properties
7 for 𝜓 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 =length(𝐶 (𝜓 ))
We discuss the computational properties of Algorithm 1 Compli-
8 𝜓 [𝑛] [2] ← −;
ance. Due to space reason, we only sketch the proofs by providing
9 if 𝜓 ∈ +𝜕A ∧ (∀𝑗 < 𝑛 =length(𝐶 (𝜓 ))𝜓 [ 𝑗] [1] =
to the reader the motivations of why our algorithms are sound,
+ ∧ 𝜓 [ 𝑗] [2] = −) then complete, and terminate (but we leave out all the technical details).
10 −⊤ ← −⊤ ∪ {𝜓 }; In order to discuss termination and computational complexity,
11 𝑅 ← {𝐴(𝜇) \ {¬complied(𝜓 )} ↩→2 𝐶 (𝜇)| 𝜇 ∈ we start by defining the size of a meta-theory 𝐷 as Σ(𝐷), as number
𝑅} \ {𝜇 ∈ 𝑅 | complied(𝜓 )∈ 𝐴(𝜇)}; of the occurrences of literals plus the number of occurrences of
12 end rules plus 1 for every tuple in the superiority relation.
13 end Note that, by implementing hash tables with pointers to rules
14 else if 2 = O then where a given literal occurs, each rule can be accessed in constant
15 for 𝜓 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 =length(𝐶 (𝜓 )) time. We also implement hash tables for the tuples of the superiority
16 if 𝜓 ∈ +𝜕A ∧ 𝑛 = 1 then relation where a given rule appears as either of the two element,
17 𝑅 ← {𝐴(𝜇) \ {complied(𝜓 ), and thus even those can be accessed in constant time.
¬violated(𝜓 )} ↩→2 𝐶 (𝜇)| 𝜇 ∈ 𝑅} \ ({𝜙 ∈
Theorem 3.3. Algorithm 1 Compliance terminates and its com-
𝑅 | {¬complied(𝜓 ),
plexity is 𝑂 (Σ4 ).
violated(𝜓 )} ∈ 𝐴(𝜙)} ∪ {𝜓 });
18 +⊤ ← +⊤ ∪ {𝜓 }; −⊥ ← −⊥ ∪ {𝜓 }; Proof. Termination of Procedures Prove, Refute, and Active
19 else is straightforward, as the size of the input theory is finite, and, at
20 𝜓 [𝑛] [1] ← −; every step, we modify finite sets. The complexity of Prove is 𝑂 (Σ2 ),
21 end the complexity of Refute is 𝑂 (Σ3 ) (two inner for loops of 𝑂 (Σ)
22 end each), and, lastly, the complexity of Active is 𝑂 (Σ).
Termination of Algorithm 1 Compliance is bound to termination
23 end
of the repeat-until cycle at Lines 7–46, as all other cycles loop
24 if 2 ∈ {O, P} then
over finite sets of elements of the order of 𝑂 (Σ). Given that 𝐻 𝐵
25 𝑅 ← {𝐴(𝜁 ) \ {¬2𝑙 }| 𝜁 ∈ 𝑅};
and 𝑅 are finite, and since every time a literal is proved/refuted, it
26 end
is removed from the corresponding set, the algorithm eventually
empties such a set, and, at the next iteration, no modification to the
extension can be made. This proves the termination of Algorithm 1
Procedure Active Compliance.
Input: Regarding its complexity, note that: (1) all set modifications are
1 if {𝛽 ∈ 𝑅 [∼𝑙] 𝑖𝑛𝑓 𝑑 | 𝛼 > 𝛽} \ {𝜓 ∈ 𝑅 [∼𝑙] 𝑖𝑛𝑓 𝑑 | 𝜁 > 𝜓 ∧ 𝜁 ≠
made in linear time, and (ii) the aforementioned repeat-until cycle
𝛼 ∧ 𝜁 ∈ +𝜕A ∩ 𝑅 [𝑙]} ≠ ∅ then is iterated at most 𝑂 (Σ) times, and so are the two for loop at lines
2 +𝜕EA ← +𝜕EA ∪ {𝛼 }; 9–44. This would suggest that the repeat-until cycle runs in 𝑂 (Σ2 ).
3 𝑅 ← {𝐴(𝜁 ) \ {active(𝛼)} ↩→2 𝐶 (𝜁 )| 𝜁 ∈ 𝑅} \ {𝜁 ∈ A more discerning analysis shows that the complexity is actually
𝑅 | ¬active(𝛼)∈ 𝐴(𝜁 )}; 𝑂 (Σ): the complexity of the for cycle cannot be considered sepa-
4 else rately from the complexity of the external repeat-until loop, while
5 −𝜕EA ← −𝜕EA ∪ {𝛼 }; instead they are strictly dependent. Indeed, the overall number of
6 𝑅 ← {𝐴(𝜁 ) \ {¬active(𝛼)} ↩→2 𝐶 (𝜁 )| 𝜁 ∈ 𝑅} \ {𝜁 ∈ operations made by the sum of all loop iterations cannot outrun
𝑅 | active(𝛼)∈ 𝐴(𝜁 )}; the number of occurrences of the literals or rules (𝑂 (Σ) + 𝑂 (Σ)),
7 end
because the operations in the inner cycles directly decrease, iter-
ation after iteration, the number of the remaining repetitions of
the outmost loop, and the other way around. This sets the overall
complexity of Algorithm 1 Compliance to 𝑂 (Σ4 ). □
We conclude the analysis of the algorithms by seeing in more
details how compliance and violations are verified, and we do so Theorem 3.4. Algorithm 1 Compliance is sound and complete:
by continuing the analysis of Example 2.12. (1) 𝐷 ⊢ +𝜕𝑋 𝑝 iff 𝑝 ∈ +𝜕𝑋 of 𝐸 (𝐷), 𝑋 ∈ {C, O, P}, 𝑝 ∈ Lit
At a certain iteration, the algorithm will prove +𝜕C𝑤, but assume (2) 𝐷 ⊢ +𝜕𝑌 𝛼 iff 𝛼 ∈ +𝜕𝑌 of 𝐸 (𝐷), 𝑌 ∈ {A, EA}, 𝑝 ∈ Lit
that so far we have proven neither +𝜕O𝑧, nor +𝜕O𝑤. Procedure (3) 𝐷 ⊢ +⊤𝛼 iff 𝛼 ∈ +⊤ of 𝐸 (𝐷), 𝛼 ∈ Lab
77
(4) 𝐷 ⊢ +⊥𝛼 iff 𝛼 ∈ +⊥ of 𝐸 (𝐷), 𝛼 ∈ Lab first class citizens in our logic with an equal treatment. This has
(5) 𝐷 ⊢ −𝜕𝑋 𝑝 iff 𝑝 ∈ −𝜕𝑋 of 𝐸 (𝐷), 𝑋 ∈ {C, O, P}, 𝑝 ∈ Lit allowed us to provide an efficient computational treatment of the
(6) 𝐷 ⊢ −𝜕𝑌 𝛼 iff 𝛼 ∈ −𝜕𝑌 of 𝐸 (𝐷), 𝑌 ∈ {A, EA}, 𝑝 ∈ Lit references. The other major benefit of the approach we adopted is
(7) 𝐷 ⊢ −⊤𝛼 iff 𝛼 ∈ −⊤𝛼 of 𝐸 (𝐷), 𝛼 ∈ Lab that it enables encodings of pieces of legislation to strictly adhere to
(8) 𝐷 ⊢ −⊥𝛼 iff 𝛼 ∈ −⊥𝛼 of 𝐸 (𝐷), 𝛼 ∈ Lab. the legal isomorphism principle [3] that facilitates the translation
from provisions in natural language to their formal representation
Proof. The aim of Algorithm 1 Compliance is to compute a and their maintenance when the provisions are amended.
defeasible extension of the input theory through successive trans- References are a prominent feature in norms amending other
formations on the set of facts, rules and the superiority relation. norms. Norm change can be modelled by nested rules [4, 9] and
These transformations act in a way to obtain a simpler theory while [12] develops algorithms to compute the extension of a defeasible
retaining the same extension. By simpler theory we mean a theory theory with nested norms. We plan to investigate how to integrate
with less symbol in it. For instance, given a theory 𝐷 such that the techniques proposed in this work with the algorithms in [12]
𝐷 ⊢ +𝜕O 𝑝, we can remove O𝑝 from the antecedent of the rules to implement the logics of [4, 9].
since a such deontic literal no longer plays any role in the rule, and
we can delete all the rules where P¬𝑝 is in the antecedent, as such REFERENCES
rules can no longer conclude literals (and we also know that the [1] Grigoris Antoniou, David Billington, Guido Governatori, and Michael J. Maher.
rule is no longer applicable); analogously, when 𝐷 ⊢ +⊤𝛼, we can 2001. Representation results for defeasible logic. ACM Trans. Comput. Log. 2, 2
(2001), 255–287. https://doi.org/10.1145/371316.371517
remove the instances of 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼) from the rule (and remove the [2] Grigoris Antoniou, David Billington, Guido Governatori, Michael J. Maher, and
rules where 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼) is in the antecedent). The theory obtained Andrew Rock. 2000. A Family of Defeasible Reasoning Logics and its Implemen-
from 𝐷 from such operations is equivalent to 𝐷 ′ , with respect to tation. In ECAI 2000, Proceedings of the 14th European Conference on Artificial
Intelligence, Berlin, Germany, August 20-25, 2000, Werner Horn (Ed.). IOS Press,
the elements of the Herbrand base and labels still in the theory. 459–463.
The proof that the above transformation produces theories equiv- [3] Trevor J. M. Bench-Capon and Frans Coenen. 1992. Isomorphism and legal
alent to the original one is by induction on the length of derivations knowledge based systems. Artif. Intell. Law 1, 1 (1992), 65–86. https://doi.org/10
.1007/BF00118479
and contrapositive. □ [4] Matteo Cristani, Francesco Olivieri, and Antonino Rotolo. 2017. Changes to
temporary norms. In Proceedings of the 16th edition of the International Conference
4 CONCLUSIONS AND RELATED WORK on Artificial Intelligence and Law, ICAIL 2017, London, United Kingdom, June 12-
16, 2017, Jeroen Keppens and Guido Governatori (Eds.). ACM, 39–48. https:
References are widespread in legal documents, and they have been //doi.org/10.1145/3086512.3086517
[5] Guido Governatori, Francesco Olivieri, Antonino Rotolo, and Simone Scannapieco.
a topic of intensive research in the field of AI and Law (e.g., citations 2013. Computing Strong and Weak Permissions in Defeasible Logic. J. Philos.
networks, automated detection of citations, citation and reference Log. 42, 6 (2013), 799–829. https://doi.org/10.1007/s10992-013-9295-1
navigations, . . . ). Despite their pervasive presence, the study of how [6] Guido Governatori, Francesco Olivieri, Simone Scannapieco, and Matteo Cristani.
2011. Designing for Compliance: Norms and Goals. In RuleML 2011-America
to logically represent them with the aim of exploit them for the (LNCS, Vol. 7018). Springer, 282–297. https://doi.org/10.1007/978-3-642-24908-
digitalisation of legislation has been largely neglected. The OASIS 2_29
LegalRuleML standard [11] introduces the terms “comply” and [7] Guido Governatori, Francesco Olivieri, Simone Scannapieco, Antonino Rotolo,
and Matteo Cristani. 2016. The rationale behind the concept of goal. Theory Pract.
“violated” (accepting an argument pointing to a legal rule), but the Log. Program. 16, 3 (2016), 296–324. https://doi.org/10.1017/S1471068416000053
development of a method to solve such references is well beyond the [8] Guido Governatori and Antonino Rotolo. 2006. Logic of Violations: A Gentzen
System for Reasoning with Contrary-To-Duty Obligations. Australasian Journal
scope of the standard. The task of solving such references has been of Logic 4 (2006), 193–215. arXiv:9307/main.pdf http://ojs.victoria.ac.nz/ajl/artic
tackled by [10] adopting the technique of importing the content of le/view/1780
the citation (with the appropriate semantic layer), but the approach [9] Guido Governatori and Antonino Rotolo. 2010. Changing legal systems: legal
abrogations and annulments in Defeasible Logic. Logic Journal of IGPL 18, 1
is restricted to a shallow import (ignoring attacking rules). (2010), 157–194.
The idea of employing terms denoting the legal status of pro- [10] Ho-Pun Lam and Mustafa Hashmi. 2019. Enabling reasoning with LegalRuleML.
visions goes back, at least, to the seminal work by Sartor [13]. Theory Pract. Log. Program. 19, 1 (2019), 1–26. https://doi.org/10.1017/S1471068
418000339
However, when the approach is used, the information about such [11] OASIS. 2017. LegalRuleML core specification version 1.0. Standard Specification.
terms is either given as part of the input of a case (and it is not de- OASIS. http://docs.oasis-open.org/legalruleml/legalruleml-core-spec/v1.0/cspr
d02/legalruleml-core-spec-v1.0-csprd02.html
termined by the other “facts" of the case and the rules), or addressed [12] Francesco Olivieri, Guido Governatori, Matteo Cristani, and Abdul Sattar. [n.d.].
using the techniques exemplified by (5) and (4), and not dealt with Computing Defeasible Meta-logic. In Logics in Artificial Intelligence - 17th Euro-
at the logic level, and focusing, at best, on the first reading of “ap- pean Conference, JELIA (LNCS, Vol. 12678), Wolfgang Faber, Gerhard Friedrich,
Martin Gebser, and Michael Morak (Eds.). Springer, 69–84. https://doi.org/10.1
plicable”. Often, the use of “applicable” is to facilitate some form 007/978-3-030-75775-5_6
of non-monotonic reasoning, avoiding the adoption of a priority [13] Giovanni Sartor. 1991. The Structure of Norm Conditions and Nonmonotonic
relation over rules. Reasoning in Law. In Proceedings of the Third International Conference on Artificial
Intelligence and Law, ICAIL ’91, Richard E. Susskind (Ed.). ACM, 155–164. https:
In this paper, we started from the same idea of [13], but with //doi.org/10.1145/112646.112665
the direct focus on handling the references and not to facilitate [14] Giovanni. Sartor. 2005. Legal Reasoning: A Cognitive Approach to the Law.
Springer.
some form of defeasible reasoning. Accordingly, references are
78
Context-Aware Legal Citation Recommendation using Deep
Learning
Zihan Huang∗ Daniel E. Ho Matthias Grabmair†

Charles Low∗ Mark S. Krass Department of Informatics
Mengqiu Teng∗ Stanford University Technical University of Munich
SINC GmbH
Hongyi Zhang∗
Language Technologies Institute
Carnegie Mellon University
ABSTRACT Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/
Lawyers and judges spend a large amount of time researching the 3462757.3466066
proper legal authority to cite while drafting decisions. In this paper,
we develop a citation recommendation tool that can help improve 1 INTRODUCTION
efficiency in the process of opinion drafting. We train four types
Government agencies adjudicate large volumes of cases, posing
of machine learning models, including a citation-list based method
well-known challenges for the accuracy, consistency, and fairness of
(collaborative filtering) and three context-based methods (text sim-
decisions [2, 27]. One of the prototypical mass adjudicatory agencies
ilarity, BiLSTM and RoBERTa classifiers). Our experiments show
in the U.S. context is the Board of Veterans’ Appeals (BVA), which
that leveraging local textual context improves recommendation,
makes decisions on over fifty thousand appeals for disabled vet-
and that deep neural models achieve decent performance. We show
eran benefits annually. Due to these case volumes and constrained
that non-deep text-based methods benefit from access to struc-
resources, the BVA suffers from both a large backlog of cases and
tured case metadata, but deep models only benefit from such access
large error rates in decisions. Roughly 15% of (single-issue) cases
when predicting from context of insufficient length. We also find
are appealed and around 72% of appealed cases are reversed or
that, even after extensive training, RoBERTa does not outperform a
remanded by a higher court [14]. These challenges are typical for
recurrent neural model, despite its benefits of pretraining. Our be-
agencies like the Social Security Administration, the Office of Medi-
havior analysis of the RoBERTa model further shows that predictive
care Hearings and Appeals, and the immigration courts, which
performance is stable across time and citation classes.
adjudicate far more cases than all federal courts combined. Lawyers
and judges are hence in great need of tools that can help them
CCS CONCEPTS reduce the cost of legal research as they draft decisions to improve
• Applied computing → Law; Document analysis; • Information the quality and efficiency of the adjudication process.
systems → Data mining; Recommender systems; • Computing Advancing the application of machine learning to suggesting
methodologies → Natural language processing. legal citations is essential to the broader effort to use AI to assist
lawyers. Citations are a critical component of legal text in common-
KEYWORDS law countries. To show that a proposition is supported by law,
citation recommendation, citation normalization, legal text, legal writers cite to statutes passed by a legislature; to regulations writ-
opinion drafting, neural natural language processing ten by agencies implementing statutes; and to cases applying legal
authorities in a particular context. Such is the importance of cita-
Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho,
tions to legal writing that the traditional method of selecting law
Mark S. Krass, and Matthias Grabmair. 2021. Context-Aware Legal Citation students to edit law journals has been a gruelling test on the cor-
Recommendation using Deep Learning. In Eighteenth International Con- rect format of legal citations [30]. Achieving performance on more
ference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São difficult tasks, like text generation and summarization, depends on
a sophisticated treatment of citations.
∗ Authors contributed equally to the paper. This paper reports on experiments evaluating a series of machine
† Corresponding author (matthias.grabmair@tum.de). Current affiliation at TUM; work
learning tools for recommending legal citations in judicial opinions.
largely conducted while employed at SINC as part of adjunct affiliation with Carnegie
Mellon University, Language Technologies Institute. We show that deep learning models beat ordinary machine learning
tools at recommending legal citations on a variety of metrics, which
Permission to make digital or hard copies of part or all of this work for personal or suggests that the neural models have a stronger capability to exploit
classroom use is granted without fee provided that copies are not made or distributed semantics to understand which citation is the most appropriate.
on the first page. Copyrights for third-party components of this work must be honored. We also demonstrate the importance of context in predicting
For all other uses, contact the owner/author(s). legal citations. For ordinary text-based machine learning models
ICAIL’21, June 21–25, 2021, São Paulo, Brazil with limited capacity for detecting semantic meaning, structured
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8526-8/21/06. contextual metadata improves virtually all predictions. For deep
https://doi.org/10.1145/3462757.3466066 learning models, the utility of structured metadata emerges only
79
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair
in sufficiently difficult settings where there may be a weaker se- citation-list based approaches do not exploit the rich information
mantic link between the input and the target, and only for certain contained in the textual context of each citation.
models. Still, this result shows the potential importance of context
to citation predictions. Deep learning models that are able to bet- 2.1.2 Context-Based Methods. In this setting, the researcher inputs
ter incorporate contextual cues from semantic inputs are likely to a span of text (the query context), which can be a particular sentence
outperform methods without such capabilities. or paragraph, instead of a list of citations. The system recommends
Because the BVA corpus has never been made available to the local citations relevant to this query context.
research community, we are releasing the text for single-issue de- Traditional information retrieval approaches directly compare
cisions, with legal citation tokenization, case metadata, and our the words in the query context to the words in the title, abstract, or
source code upon publication at: https://github.com/TUMLegalTech/ full text of each cited document, and apply scoring models such as
bva-citation-prediction. We believe many other advances can be Okapi BM25 [1] or Indri [38] to arrive at a similarity score that is
built on this as a benchmark for natural language processing in used to rank documents. However, as [33] observes, the full text of
law. cited documents is often noisy and may not contain words similar
to those used to describe the document as a whole. This problem is
especially pertinent in law. Legal decisions and statutes sometimes
2 RELATED WORK lack informative titles, and the key legal implications are often
2.1 Citation Recommendation buried in a mountain of other factual or procedural details.
Citation recommendation is a well-studied problem in the domain Intuitively, we expect the span of text preceding or surrounding
of academic research paper recommendation, as researchers seek a citation (its citation context) to contain useful information pertain-
help to navigate vast literatures in their fields. Many of the ap- ing to the content of the cited document and the reason for citation.
proaches are transferable to the legal context. They can be broadly This information can then be used for retrieval. [3, 33] demonstrate
categorized into citation-list based methods, which characterize that indexing academic papers using words found in their citation
a query document by the incomplete set of citations it contains contexts improves retrieval. He et al. [13] develop this idea further
and provide a global recommendation of citations relevant to the by representing each paper as a collection of citation contexts, and
entire document, and context-based methods, which take a particu- then using a non-parametric similarity measure between a query
lar segment of text from the query document and provide a local context and each paper for recommendation. Huang et al. [16] use
recommendation that is relevant to that specific context [26]. a neural network to learn word and document representations to
perform similarity comparison in that space. More recently, in a
work most similar to our approach, Ebesu and Fang [10] directly
2.1.1 Citation-List Based Methods. In this setting, the researcher train an encoder-decoder network with attention for context-aware
is drafting a paper and has an incomplete set of citations on hand, citation prediction and find that adding embeddings representing
and seeks to find additional relevant papers. the citing and cited authors improves predictions.
An early approach in [28] applies collaborative filtering to this
task. There, citing papers are “users” and citations are “items.” Given
a new user, the algorithm locates existing users with similar prefer- 2.2 Legal Citation Prediction
ences to the new user, and recommends items popular among the Because of the importance of citations to legal writing [8], prior
existing users. Matrix factorization methods project the sparse, high- work has explored machine-generated recommendations for legal
dimensional user-item adjacency matrix onto a low-dimensional authorities relevant to a given legal question.
latent space and compare similarity in this latent space. For exam- A number of commercial tools claim to assist users in legal re-
ple, [4] uses Singular Value Decomposition to find the latent space, search using citations. Zhang and Koppaka [43] describe a feature in
and finds performance gains over ordinary collaborative filtering. LexisNexis that allows users to traverse a semantics-based citation
Graph-based approaches treat research papers as nodes and network in which relevance is determined by textual similarity be-
citations as edges (directed or undirected), and use graph-based tween citation contexts. Other commercial offerings include ROSS
measures of relevance to find relevant nodes to an input set corre- Intelligence [18], CaseText’s CARA A.I. [17] and Parallel Search [6],
sponding to the researcher’s incomplete set of citations. Examples as well as Quick Check by Thomson Reuters [40]. The methodology
include the Katz measure [23], PageRank [31] and SimRank [19]. of such offerings is largely proprietary.
[12] applies a topic-sensitive version of the PageRank algorithm Winkels et al. [41] develop a prototype legal recommender sys-
by up-weighting papers in the incomplete set. [37] finds the Katz tem for Dutch immigration law, which allows legal professionals
measure of node proximity to be a significant feature. to search a corpus by clicking on articles of interest; the system
The citation-list based approach has its drawbacks. It puts the returns cases with the highest between-ness centrality with the
burden of creating a partial list of citations on the user. Attorneys article. In [8], Dadgostari et al. consider the task of generating a
who are new to veterans’ law would face the well-known “cold-start” bibliography for a citation-free legal text by modelling the search
problem, where they have difficulty generating enough citations as process as a Markov Decision Process in which an agent iteratively
input to receive quality recommendations. Second, attorneys draft- selects relevant documents. At each step, the agent can choose
ing an opinion may be more interested in local recommendations whether to explore a new topic in the original paper or to select a
relevant to their current section of work rather than global recom- relevant paper from the current topic of focus. An optimal policy
mendations that are generally relevant to the entire case. Third, is learned using Q-learning. They find this adaptive algorithm to
80
Context-Aware Legal Citation Recommendation using Deep Learning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
# Values Most Frequent Least Frequent reporter containing the case; and finally the page in the reporter
Class (# cases) Class (# cases) where the case begins. Thus, a citation to Brown v. Board of Education
of Topeka would begin as follows: Brown v. Board of Education, 347
Year 19 2009 (22,801) 2017 (3,651)
Issue Area 17 Service Con- Increased
U.S. 483. This indicates that the first page of Brown is found on
nection for Rating for the 483rd page of the 347th volume of the United States Reports.
Bodily Injury Nerve Damage The volume-reporter-page citation is usually a unique identifier for
Claims (38,956) (2,921) each case.1 Citations to statutory law follow a similar three-part
VLJ 289 Anonymized Anonymized pattern: “18 U.S.C. § 46,” means the 46th section of the 18th title of
(6,159) (6) the United States Code.
Table 1: Summary Statistics of Corpus Metadata Variables. These three-part citation patterns form the basis for our text
preprocessing pipeline. We first use a series of regular expressions
to identify, clean, and classify citations from opinions. We then
outperform a simpler method, based on proximity to the original build a vocabulary of legal authority using publicly-available lists
document, on the task of retrieving U.S. Supreme Court decisions. of valid cases and statutes. We use this vocabulary to extract all
Other works [11, 21] have tangentially analyzed properties of citations from case texts and represent them using standardized
legal citation networks, exploring measures of authority and rele- indices. We describe this process in greater detail below.
vance of precedents, as well as macro characteristics of the network,
such as degree distribution and shortest path lengths. Sadeghian et 3.3 Citation Preprocessing
al. [35] develop a system to automatically identify citations in legal The large raw citation vocabulary obtained from running regular
text, extract their context and predict the reason for the citation expression extractors on every case is normalized into classes of
(e.g., legal basis, exception) based on a curated label set. case, statute, regulation, and unknown citations.
For cases, this normalization involves matching the volume, re-
3 DATA porter, and first/last page interval derived from the citation string
3.1 The BVA Corpus with an authoritative list of cases found in the CaseLawAccess
The BVA corpus we use contains the full text of over 1 million appeal (CLA) metadata.2 If an extracted citation can be matched to a CLA
decisions from 1999 to 2017. Accompanying each decision is a set of metadata entry, it is replaced with a reference to that entry in the
metadata derived from the Veterans Appeals Control and Locator citation vocabulary during tokenization. For example, the extrac-
System (VACOLS), which includes fields such as the decision date, tion ‘Degmetich v. Brown, 8 Vet. App. 208 (1995)’ is resolved to the
diagnostic codes indicating the veteran’s injuries, the case outcome, normalized ‘Degmetich v. Brown, 8 Vet. App. 208, CLA#6456776’
and an indicator for whether the case was subsequently re-appealed. (i.e. CLA metadata entry 6456776), which becomes an entry in the
Each case also contains one or more ‘issue codes,’ which are hand- citation vocabulary that is used for all identifiable references to the
coded by BVA attorneys and categorize the key legal or factual same case. Citations to the U.S. Code and to the Code of Federal
questions raised (e.g., “entitlement to a burial benefit”). This paper Regulations are extracted using patterns based on the ‘<chapter>
focuses on a subset of 324,309 cases that raise a single issue and U.S.C. <tail>’ and ‘<chapter> C.F.R. <tail>’ anchors. The tail typi-
have complete metadata, although our methods can be generalized cally consists of one or more section elements, which we break into
to the full corpus. individual elements that each become their own normalized citation
We hypothesized that three metadata features would contribute with the same anchor and chapter (e.g., ‘18 U.S.C. §§ 46(a), 46(b)
to model performance. First, we included the year of the decision, to becomes the two entries ‘18 U.S.C. § 46(a)’ and ‘18 U.S.C. § 46(b)’).
reflect changes in citation patterns as new legal precedents emerge All citations that cannot be normalized into either case, code, or
over time. Second, we constructed an issue area feature to reflect the regulation classes will form the ‘unknown’ class. Once normalized,
substantive issues presented in each case, which we hypothesize to the vocabulary is further reduced by removing all citation entries
provide strong priors for the type of citations contained within as which occur less than 20 times in the training cases and resolving
well. The BVA has a hierarchical coding system comprising program them to an ‘unknown citation’ token. This threshold was manu-
codes, issue codes, and diagnostic codes to categorize each issue. ally chosen as a suitable tradeoff between extensive coverage of
For simplicity and class balancing, we curated a composite issue citations and baseline frequency to enable the model to learn.
area variable with 17 classes (see Figure 1). Third, we included a The training data contains about 5M extracted citation instances
feature referring to the Veterans’ Law Judge (VLJ) who handled the comprising roughly 97k unique strings. Our normalization proce-
case. This corresponds to the hypothesis that citation patterns vary dure reduces this to a citation vocabulary of size 4287, of which
with the idiosyncrasies of individual judges, inspired in part by [10]. 4050 ( ≈ 94.5%) are normalized (1286 cases, 870 statutes, 1894 reg-
Judge names were anonymized and judges with 5 cases or fewer ulations). The normalized entries cover about 98.5% of citations
were collapsed into a single unknown judge category. Summary
statistics for these metadata are included in Table 1. 1 Summary dispositions of a case are sometimes reported in a table, such that multiple
cases appear on a single physical page.
3.2 Decision Text Preprocessing 2 CLA is a public-access project that has digitized the contents of major case reporters
[5]. We include the Vet. App. and F.3d reporters, which contain veterans’ law cases
American legal citations follow a predictable format governed by and cases from the Federal Courts of Appeal, as these account for the vast majority of
[7]. Case citations, for instance, identify the parties to the case; the cases cited in the corpus.
81
5: Accrued/Dental/New working on. We represent this text segment of interest by a se-

1: Other
6: Bodily Injury quence of tokens as the query context b𝑑 = {𝑏 1, ..., 𝑏𝑙 }. The second
0: No 2: Dependents
7: Eye/Ear/Respiratory task is thus to predict the next upcoming citation 𝑐 ∗ ∈ 𝐶 that is lo-
3: Effective Date
8: Organ Damage cally relevant to context b𝑑 . This corresponds to the context-aware
4: Total Disability
9: Nerve approach. Specifically in our experiments, given a query context b𝑑
10: Psychological of length 𝑙 in the document 𝑑, we seek to predict the first citation
Compensation Claim? that occurs in the upcoming forecast window of length 𝑤. We vary
length 𝑙 and forecast window 𝑤 depending on the method (see
Service Connection Issue?
Yes Section 5).
In addition, metadata describing characteristics of the draft deci-
Increased Rating Issue? sion may also aid in citation prediction. For instance, the relevance
and validity of case citations can change over time as new prece-
11: Not Schedular 12: Body Injury 13: Eye/Ear/Respiratory
14: Organ Damage 15: Nerve 16: Psych
dents emerge and others are overruled. Since many of the relevant
legal standards are specific to particular classes of claims, the issue
Figure 1: Issue area categories. code feature may help identify relevant citations. Finally, different
VLJs may have different propensities to cite certain authorities.
occurring in the tokenized decisions. This reduction effect is pri-
marily due to (a) complex statutory citations breaking apart into a 5 METHODS
smaller set of atoms, (b) different page-specific citations to a case
Our main metrics are recall at 1, recall at 5 and recall at 20, that is,
getting collapsed into a single CLA entry, and (c) different forms
the proportion of data instances where the correct next citation is
of variation reduction (e.g., removal of trailing parentheses with
among the model’s top 1, 5, and 20 predictions. Precision would
years, etc.).
not be an informative metric as we are only seeking to predict the
The final vocabulary is then used to normalize all citations en-
single correct next citation. Recall at 1 reflects a restrictive user
countered in case texts. Citation strings are extracted and replaced
that expects the system to predict a single citation only. Recall at
with a placeholder. The case/code/regulation procedure outlined
5 simulates what we think is the typical user, who benefits from a
above is applied to each citation string to obtain a list of one or
small number of recommendations that can quickly be examined
more corresponding normalized citations. Each of these is kept if it
for the most appropriate one. A longer list of 20 simulates users
is contained in the final vocabulary, or replaced with the ‘unknown
seeking to get a bigger picture of what could possibly be relevant.
citation’ otherwise. The resulting sequence of vocabulary index
We split the 324,309 single-issue BVA cases into 233,506 (72%), 58,370
tokens is re-inserted at the location of the general citation place-
(18%) and 32,433 (10%) cases for the training, validation and test
holder after the text has been tokenized. Note that only citations
set, respectively. Each model is trained on the training set, tuned
containing reporter and page references are extracted and regu-
on the validation set, and tested on a 6-fold split of the test set to
larized. Short form citations (e.g., ‘id.’) are treated as ordinary text
measure statistical uncertainty.
and are excluded from the pool of prediction targets.3 We also do
We implement four different methods on the task of legal cita-
not treat quotations in the text in any special way and rely on the
tion prediction on the BVA corpus, and examine their compara-
tokenizer to capture them as part of the context window.
tive performance: a citation-only collaborative filtering system, a
context-similarity based model, a BiLSTM recurrent neural classi-
4 PROBLEM DEFINITION
fier, and a RoBERTa-based classifier that has been pretrained on a
We model the legal citation prediction problem as follows. Suppose language model objective. We note that our task is related to legal
a BVA staff attorney is drafting an opinion regarding an appeals language generation (e.g., [32]). However, evaluating the citation
proceeding. We refer to this document as 𝑑. The incomplete draft prediction ability of a language generation model is significantly
may already contain several citations to authority. We call this more difficult. Citations would need to be captured dynamically
incomplete set c𝑑 ⊂ 𝐶, where 𝐶 represents the entire corpus of during a parameter-dependent generation process, validated, and
legal authorities, comprising possibly relevant cases, statutes, and resolved against the vocabulary. By contrast, the neural models
regulations. The first task we consider is to predict the next ci- in this project are implemented as conceptually straightforward
tation 𝑐 ∗ ∈ 𝐶 \ c𝑑 that is globally relevant to the opinion, given classifiers, allowing us to test their ability to read the context well
the incomplete set c𝑑 . This corresponds to the citation-list based enough to forecast what will be cited next. We plan to tackle citation
approach. In our experiments, for a document that contains 𝑀 cita- prediction as language generation in future work.
tions, we model the incomplete list c𝑑 by taking the first 𝑚 citations
(1 ≤ 𝑚 < 𝑀) from the document. We then seek to predict the next
5.1 Collaborative Filtering
citation, i.e., the (𝑚 + 1)-th citation 𝑐 ∗ given c𝑑 .
Alternatively, the attorney may be more interested in legal au- Our first experimental model uses collaborative filtering, a common
thority specific to the current segment of the opinion he/she is recommender system technique based on the assumption that sim-
ilar users will like similar items. Transferred to our setting, each
3 While this choice may limit the pool of prediction targets, it does not threaten the BVA decision document is treated as a user, and each citation is
integrity of predictions themselves. By convention, short-form citations always follow
full-form citations, which we detect. Because the system only has access to left context, seen as an item. The prediction task then takes as input the citations
it cannot ‘cheat’ by reference to short-hand citations. that are already cited in a BVA draft opinion (which can be seen
82
as the items that a user has liked), and returns other citations that set of citations. Instead, the words in a section of interest within the
similar documents have also cited. draft opinion are used as a query to find the most relevant citation
Formally, assume that the corpus of BVA cases 𝐶 has 𝑉 authori- based on textual similarity of the present context to the previous
ties that can be cited. Then every document 𝑑 ′ can be represented contexts associated with each citation. Such local citation recom-
by a sparse vector v𝑑 ′ ∈ R𝑉 , each of whose dimensions v𝑑 ′,𝑐 in- mendations have the added advantage of relevance to a particular
dicates an importance score of a citation 𝑐 to the document. If section of the opinion.
citation 𝑐 is cited in a document, possible scoring functions could Formally, we adopt the approach of [13], which represents each
include a binary representation (v𝑑 ′,𝑐 = 1), a term frequency vector context by its tf-idf vector (normalized to have an L2-norm of
(tf), and a tf-idf vector that incorporates the inverse document fre- 1). Each citation 𝑐 is represented by a collection of tf-idf vectors
quency (idf). With such a representation, a set of document vectors {b 𝑗 : 𝑗 = 1, 2, · · · , 𝑘𝑐 }, where each b 𝑗 represents the local context
{v𝑑 ′ : 𝑑 ′ ∈ D} can be constructed from a document collection D. of one citation occurrence and 𝑘𝑐 is the number of times 𝑐 was
Given a draft of a BVA opinion 𝑑, its incomplete citation set cited in the training set. Given a query context b𝑑 at test time, the
c𝑑 can also be summarized into a document vector v𝑑 . We use relevance of each citation 𝑐 to the query is then calculated as:
a collaborative filtering approach known as the user-based top-𝐾
1 Õ
𝑘𝑐
recommendation algorithm. The algorithm first identifies the 𝐾 doc- score(b𝑑 , 𝑐) = (b · b 𝑗 ) 2
uments D𝐾 (𝑑) that are most similar to 𝑑 from the collection based 𝑘𝑐 𝑗=1 𝑑
on their vector representations v, based on their cosine similarity:
We removed stopwords, words that occurred in less than 10 doc-
v𝑑 · v𝑑 ′
sim(v𝑑 , v𝑑 ′ ) = . uments, and words that contained digits. The most frequent 25,000
∥v𝑑 ∥ 2 ∥v𝑑 ′ ∥ 2
words were then chosen as a vocabulary. We used the 50 words
The algorithm then finds candidate citations based on what these preceding (instead of surrounding) each citation as its context, in
documents cite. An average of these document vectors weighted line with our task to recommend relevant upcoming citations.4 Cita-
by their similarities gives the final recommendation. Specifically, tions that occurred within each context were also used as part of the
the recommendation score of citation 𝑐 for document 𝑑 is given by vocabulary. As some citations were very frequently cited, we col-
Í
𝑑 ′ ∈D (𝑑) sim(v𝑑 , v𝑑 ′ )v𝑑 ′,𝑐 lected at most 100 randomly chosen context vectors (i.e. 𝑘𝑐 ≤ 100)
score(𝑑, 𝑐) = Í 𝐾 . per citation. Metadata features are incorporated into the model in
𝑑 ′ ∈D𝐾 (𝑑) sim(v𝑑 , v𝑑 ′ )
a way similar to the Collaborative Filtering model (see Section 5.1).
In our experiments, the document vectors are collected from the
Each feature is assigned a score and an SVM model is trained to
training set. The number of top similar documents 𝐾 is a hyper-
learn feature weights to produce the final score.
parameter that can be tuned, and 𝐾 = 50 is chosen for the results
reported. From our trials with three different scoring functions
5.3 Bi-directional Long Short Term Memory
for the document vectors, binary scoring proved to be the most
effective choice and was used throughout the experiments. LSTMs [15] are a popular form of recurrent neural networks and
To incorporate metadata features, a score is assigned to each serve as a well-known baseline for deep neural network models.
categorical feature 𝑓𝑖 , namely the probability of citing the citation Variants using LSTM remain competitive in various NLP tasks
𝑐 after conditioning on that feature: [22, 25, 29]. BiLSTM (Bi-directional LSTM) improves on the original
LSTM by reading inputs in both forward and backward directions.
score(𝑓𝑖 , 𝑐) = 𝑃 (𝑐 | 𝑓𝑖 ). We adopted a two-layer BiLSTM on the BVA corpus for citation
We take a weighted average of these features and the output prediction. Just like the text similarity baseline, this approach per-
of the collaborative filtering algorithm. We adopt the commonly forms local citation recommendation. It takes a sequence of words
used svmRank algorithm of [20] to learn weights for each feature. within the draft opinion as the query context, and predicts which
We extract all citation occurrences in a random sample of 1000 citation is most likely to be cited next given the context. Going
documents from the training set, perform a pairwise transformation beyond the text similarity model, we predict the first citation that
on the data, apply min-max normalization on the pairwise data, and appears within a forecasting window of fixed length.
train a linear Support Vector Machine (SVM) on the normalized data. Formally, a sequence of tokens b𝑑 = {𝑏 1, ..., 𝑏𝑙 } is extracted from
The final score is a linear combination of individual feature scores each document 𝑑 as the query context and we seek to predict the
using the learned weights. Citations suggested by the recommender immediate next citation in the upcoming forecasting window of
system are reranked by their final scores and the top citations are length 𝑤. The query context is encoded using pre-trained byte-
chosen as final predictions. level Byte Pair Encoding (BPE) [36]. For comparability with the
RoBERTa model, we use the ‘roberta-base’ tokenizer provided by
5.2 Text Similarity Huggingface [42], which has a vocabulary of about 50k tokens.
The second model uses a context-aware bag-of-words approach to The citation vocabulary indices are re-inserted after encoding, re-
predict citations. Previous studies, such as [13, 34], have demon- placing the general citation token to generate the final encoded
strated that the local context of words surrounding each citation tokens as described in Section 3.3. The encoded tokens are fed into
occurrence can be used as a compact representation of the cited an embedding layer followed by two stacked bi-directional LSTM
document to improve retrieval effectiveness, much like how in-link 4 Notethat this means citations are always the very next word after the context. This
text is used to improve web retrieval. By contrast to collaborative contrasts with the neural models presented below, where citations may appear at some
filtering, this approach does not require the user to input an existing distance from the context.
83
layers to produce a sequence of hidden states. The hidden state

corresponding to the last token is used as the aggregate represen-
tation of the query context and flows into the classification head,
which consists of two dropout/linear combination layers separated
by a tanh activation, followed by a softmax layer to produce output
probabilities for each citation, indicating how likely they will be
cited next. Figure 2 illustrates the detailed architecture.
Figure 3: The RoBERTa model architecture.
dropout/linear layers separated by a tanh activation to produce the

final hidden vector 𝐶 in the citation vocabulary size.
To fine-tune our RoBERTA model, we use the same data prepro-
cessing and loading as in the BiLSTM experiment. We tokenize the
BVA decisions with the pre-trained RoBERTa tokenizer provided
by Huggingface [42] and apply our citation extraction and normal-
ization procedure. Sequences are padded to the same length and an
Figure 2: The BiLSTM model architecture. attention mask is generated to indicate whether the corresponding
token is a padding token. Formally, a pair of tensors (b𝑑 , a𝑑 ) is
To incorporate the metadata information, processed metadata extracted from each document 𝑑, where b𝑑 represents the token
features are concatenated to the last hidden state before the clas- ids and a𝑑 represents attention mask. Label l𝑑 is the index in the
sification layers as illustrated in Figure 2. The year and issue area citation vocabulary of the first citation following the given context
features are one-hot encoded. The VLJ feature is projected into b𝑑 . We compute cross-entropy loss between predictions p𝑑 for (b𝑑 ,
a three-dimensional vector space by feeding the VLJ ID into an a𝑑 ) and label l𝑑 .
embedding layer that can be inspected after training. Again, to allow training with relatively large batches on NVidia
Our training setup follows common settings for language anal- P100 GPUs, we use a batch size of 192 and accumulate gradients
ysis experiments of this size. We use an embedding size of 768 for three steps before performing back-propagation, resulting an
and a hidden size of 3072. We compute CrossEntropy loss against effective batch size of 576. We use the AdamW optimizer with a
a one-hot target vector of the same length as the vocabulary. To learning rate of 1𝑒 −4 .
facilitate stable convergence, we use an effective batch size of 512
(implemented via gradient accumulations across 4 batches of 128 5.5 Sampling-based Data Loading
to fit onto Nvidia P100 GPUs). We use an Adam optimizer with a
Each data instance for the BiLSTM and RoBERTa models consists of
fixed learning rate of 1𝑒 −4 .
a context window and forecasting window to the left and right side
of an offset. During every training epoch, and during evaluation,
5.4 Pretrained RoBERTa-based Classifier
we sample a random offset for each case from all offsets whose fore-
Since the introduction of BERT [9], language model pre-training casting window contains a citation token. We designed data loading
has gained immense popularity, leading to models with superior this way to mitigate the prohibitively large space of traversing all
performance on many NLP tasks and reductions in the amount possible context/forecasting window combinations for all citations
of task-specific training data required. Its core mechanism is to in all cases. Note that, because the target is always the first citation
compute a layer-wise self-attention across all tokens in the text, within the forecasting window, our data loading is biased against
which allows it to effectively capture long-distance interactions citations that rarely appear first in strings of successive citations.
without the architectural restrictions imposed by sequential mod- We plan to address this imbalance in future work.
els. RoBERTa [24] further improved BERT by employing certain
techniques, such as longer training, and key hyperparameter ad- 6 RESULTS AND DISCUSSION
justment. We apply this model to our task via transfer learning
to test how a Transformer model pretrained on a language model 6.1 General Performance
objective performs against our BiLSTM model trained from scratch. Table 2 shows the full citation prediction results of the four models.
We fine-tuned a pre-trained RoBERTa model (HuggingFace’s We add a naive majority vote baseline, which always recommends
‘roberta-base’ [42]) on the BVA corpus using the citation predic- the 20 most popular citations in descending order of their number
tion task. The model uses 24 layers, a hidden size of 1024, 16 self- of occurrences in the training data.
attention heads, leading to 355M parameters overall. We apply a We first turn to our ordinary machine learning models. A com-
common sequence classification architecture and, similar to our parison to their ‘original’ setting — without access to structured
BiLSTM model, feed the final hidden layer’s output through two metadata — shows the importance of semantic context for citation
84
Model Setting Recall@1 Recall@5 Recall@20

Majority Vote Original 1.73% (.02%) 7.35% (.03%) 26.4% (.02%)
Original 10.2% (.06%) 25.5% (.07%) 45.4% (.08%)
Collaborative Original + Year 9.68% (.05%) 24.8% (.08%) 45.2% (.09%)
Filtering Original + Year + IssueArea 9.64% (.05%) 24.7% (.08%) 45.2% (.08%)
Original + Year + IssueArea + VLJ 9.60% (.05%) 24.7% (.09%) 45.2% (.09%)
Original 16.4% (.03%) 41.1% (.04%) 66.2% (.05%)
Text Original + Year 20.4% (.05%) 48.2% (.05%) 79.5% (.03%)
Similarity Original + Year + Class 16.2% (.06%) 51.6% (.07%) 82.6% (.05%)
Original + Year + Class + VLJ 16.6% (.06%) 51.7% (.06%) 82.7% (.05%)
no metadata (47 epochs) 65.2% (.33%) 81.8% (.14%) 91.1% (.11%)
BiLSTM
all metadata (50 epochs) 65.8% (.35%) 82.4% (.26%) 91.3% (.16%)
no metadata (106 epochs) 65.6% (.33%) 82.8% (.31%) 91.7% (.21%)
RoBERTa
all metadata (126 epochs) 66.2% (.30%) 83.2% (.17%) 92.1% (.20%)
Table 2: Prediction results. Each model is evaluated on six folds of the test set and the numbers reported are the mean and the
standard error (in parentheses) of recall at 1, 5, and 20. Neural models are trained using 256/128 context/forecast windows. All
metadata includes year, issue area, and VLJ identifiers.
prediction. The collaborative filtering model uses only the previous That delta, however, is mostly within two standard errors of the
citations in a document as input. It returns the correct citation as its two models. Our two possible explanations are (a) that the neural
top-ranked recommendation 10.2% of the time; recall@5 is 25.5%. models are capable of implicitly inferring some background features
By contrast, the text similarity baseline achieves a recall@1 of 16.4% from the legal text itself, and thus they will not benefit much from
and a recall@5 of 41.1%, on average. This is strong evidence that us providing this information explicitly, and (b) that metadata may
the textual context preceding a citation is a critical signal. By con- not carry much signal for this task.
trast, the document-level statistical information on citation patterns The superior neural model performance is intuitive in legal text
leveraged by collaborative filtering is less informative. also because the text preceding a citation will typically paraphrase
For the text similarity model, adding metadata information gen- a legal principle or statement that is reflective of that source. We
erally gives a noticeable improvement over predictions based on can assume that some portion of our context-forecast instances
text alone. For example, adding structured information on the year consist of relatively easy examples. To some degree, short-distance
of a decision improves performance, which suggests that the model citation prediction can in fact be considered a sentence similarity
does not otherwise detect temporal information. But not all meta- task. Commercial search engines even use text encoding similarity
data is equally useful. Adding information on the identity of the to suggest cases to cite for a particular sentence (e.g., [6]). Similarly,
judge produces little or no marginal gain. Further, we do not find literal quotations from the source preceding the citation can be
evidence that metadata enhances the collaborative filtering model. certain indicators. However, a pure memorization approach will fail
Interestingly, the benefit in recall@1 of case year information is for longer forecast distances, as one can anticipate an upcoming
negated when class is added, although recall@5 and recall@20 im- cited source from the narrative progression in the text before it
prove at the same time. If one were to pursue the baseline further, becomes lexically similar to the source closer to the citation. An
this effect should be examined. exception to this consists of large spans of boilerplate text that
For purposes of this comparison experiment, we train our BiL- contain citations and are reused across decisions. To investigate the
STM and RoBERTa models on a context window of 256 tokens capacity of our models to anticipate citations from further away,
and a forecast window of 128 tokens. They are trained until, in we experiment with different forecasting lengths (see Section 6.2
our assessment, validation metrics indicated convergence, at which and 6.4 below).
point they dramatically outperform both baselines. Both predict the A final observation is the stability of predictive performance
correct citation roughly 65-66% of the time and produce a recall@5 across the six test set folds as evidenced by the low standard errors.
of around 81-83% using the textual context alone. The neural mod- The neural models have slightly more deviation than the baselines
els’ improvement over the text similarity baseline suggests that the and the BiLSTM and RoBERTa models metric are generally within
ability to encode more complex semantic meanings—and track long- the reach of ±2 standard errors within a given recall metric.
term dependencies across context windows of significant length—
noticeably improves performance in citation recommendations.
We experimented with different metadata combinations for the
neural models with 8 epochs of training time and observed no clear
6.2 Context & Forecasting Window Sizes
differences, and decided to only train all-meta and no-meta models To further explore the behavior of the deep neural models, we con-
until convergence. Giving the BiLSTM and RoBERTa models access ducted an ablation study, in which we varied the size of the context
to metadata improves predictive performance by around 0.2-0.6%. and forecasting windows and varied the availability of structured
metadata information. We tested 12 different settings for BiLSTM
85
Text Alone Text & Metadata Text Alone Text & Metadata Text Alone Text & Metadata
0.63 0.81
0.90
0.62 0.80
64 Ahead
64 Ahead
64 Ahead
0.61 0.79 0.89
0.60 0.78 0.88
Recall @ 20
Recall @ 1
Recall @ 5
0.59 0.77 0.87
0.58 0.76
0.76 0.88
0.56
0.87
128 Ahead
128 Ahead
128 Ahead
0.54 0.74 0.86
0.72 0.85
0.52
0.84
0.50 0.70 0.83
64 128 256 64 128 256 64 128 256 64 128 256 64 128 256 64 128 256
Context Window Context Window Context Window
Model BiLSTM RoBERTa
Figure 4: Results of the ablation study for Recall at 1, 5, and 20. Within each panel, the most difficult tasks are in the bottom
left corner and the easiest tasks are in the top right. The x-axis shows the context window. “64 ahead” and “128 ahead” refer to
the maximum number of tokens between the context window and the target citation. Error bars are 95% confidence intervals.
and RoBERTa, respectively. In this grid-search experiment, all mod- 1.0

case
els were trained for 8 epochs before test metrics were computed. regulation
The detailed settings and the results are illustrated in Figure 4. 0.8 statute
As expected, increasing the forecasting window hurts perfor-
mance by weakening the semantic link between input and target. citation recall @ 1 0.6
Also unsurprising is the upward slope in each panel, which simply
shows that providing the models with more context generally im- 0.4
proves predictions. But the utility of added context changes with
the difficulty of the forecasting task. When the target citation is 0.2
nearer to the context (‘64 ahead’), we observe diminishing returns
to context: A 128-token context window is only slightly better than 0.0
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
a 64-token context window. When the target is further away (‘128
decision year
ahead’), more context helps. We hypothesize that adding additional Figure 5: Per-class recall@1 for RoBERTa all-meta model
context helps to compensate for the difficulty of the task: the net- over time.
work models are able to infer more clues for the citation given the
extra information. As the size of the forecasting window increases,
the potential for a weaker semantic relationship between the im-
mediate context and the eventual citation makes it more helpful for
the model to have access to additional context.
decision that the loss decrease had slowed down enough to stop
We observe a similar story with respect to structured metadata:
training was a matter of judgment. Given the carbon footprint of
the harder the task, the more helpful it is to add metadata. In the
neural model training [39], we believe such ablation research on
BiLSTM framework, metadata is most helpful when the model is
large neural models should be conducted with care.
given little context and when the targets are far away. But when
the target is nearby (‘64 ahead’), performance is statistically indis-
tinguishable between models that do have access to metadata and 6.3 Pre-Training vs. Training From Scratch
those that do not. The findings align with our hypothesis that, when Despite its general English language-model pretraining, RoBERTa
enough context is given, the neural network models are able to models do not show noticeable superiority over BiLSTM in the ab-
derive the clues for citations from text snippets, and thus obviating lation experiments, even when a more challenging task and a more
the need for metadata information. complicated context is given. For the models trained until conver-
Although the ablation experiment was conducted with only 8 gence in Table 2, RoBERTa performs better than the BiLSTM model,
epochs of training, our experimental results on models trained un- but only by at most 1% and often with overlapping ±2 standard
til convergence are in line with the observation that the effect of errors. One possible explanation is that the pretraining of RoBERTa
metadata is only marginal. At the very least, however, our results is performed on non-legal text, negating the pretraining benefit
indicate that a conclusive exploration of the effectiveness of indi- for this domain-specific task. Alternatively, the task may not re-
vidual metadata features in training neural citation recommenders quire sophisticated language understanding and/or our supervised
may require considerable computational resources or the use of setup provides sufficient training to learn citation prediction from
advanced techniques to reduce training time. Even after 126 and 50 scratch. We leave an exploration of the effects of domain-specific
epochs, respectively, our models showed no signs of overfit and the pretraining (e.g. using [44]) in this task for future work.
86
200000 1.0
175000 Distance 𝑁 Recall@1 Recall@5 Recall@20
citation count in training data
150000 0.8 1-16 13609 78.7 91.9 97.1

17-32 11237 75.6 89.8 95.9
citation recall @ 1
125000 0.6 33-48 9125 68.8 85.2 93.2
100000
49-64 7082 63.3 81.1 91.0
75000 0.4 65-80 5452 55.6 75.4 87.6
50000 81-96 4534 52.7 73.1 87.1
0.2
25000 97-112 3918 47.9 69.6 84.0
113-128 3403 42.1 66.2 82.7
0 0.0
0 250 500 750 1000 1250 1500 1750 2000 Table 3: Roberta all-meta performance binned by token dis-
citations sorted by recall tance from beginning of forecasting window to target cita-
Figure 6: Per-citation recall@1 vs. number of instances in tion, based on single pass over validation set.
training data for RoBERTa all-meta model.
6.4 Error Analysis is no conceptual mismatch. Second, somewhere around 5% of the
Figure 5 shows a relatively consistent recall at 𝑘 = 1 performance errors involve regulations that implement a particular statute. For
across classes over time. We see a slight downward slope for the example, one case cites 38 C.F.R. § 3.156(a), a regulation defining
case and regulation metric towards the end of our analysis period. when veterans may present “new and material evidence” to reopen
This may be due to opinions later in the time period potentially con- a claim. The model predicted a citation to 38 U.S.C. § 5108(a), which
taining new citations and patterns occurring less frequently in the is precisely the statute commanding the BVA to reopen claims when
training data. The plot exhibit a single strong upwards oscillation veterans identify “new and material evidence.” Again, the erroneous
in 2002-2003. We believe this is likely due to litigation surrounding prediction is in exactly the right conceptual neighborhood.
the Veterans Claims Assistance Act of 2000, which sparked mass Consistent with our ablation analysis, our review of the errors
remands by the BVA back to regional offices. This relative shape of suggests the critical role that topical changes in long texts play in
the per-class recall graphs stays roughly the same for larger values generating errors. Table 3 shows recall metrics for targets binned
of 𝑘, albeit shifted to higher absolute recall levels. by the position of the target citation within the forecast window
To assess the influence of the sampling distribution, the com- between minimum and maximum distances. Since legal analysis
bined scatterplot in Figure 6 plots the recall at 𝑘 = 1 achieved for is often addressed in a single section of an opinion, close citations
each citation against its frequency as a prediction target in the are more frequent than distant ones. Unsurprisingly, performance
training data. Of the 2037 different citations that were loaded in a decreases with distance from the context window. From closest to
single pass over the test data (of the total of 4287; see Section 5.5), farthest bin, recall@1 shrinks by a relative 47%, recall@5 by 28%,
only about 1200 citations are predicted with non-zero recall. At and recall@20 by 15%. This behavior is intuitive and indicates that
𝑘 = 20 this number increases to about 1700 and the red curve shifts the system may indeed memorize contexts immediately surround-
right (not shown). The distribution of blue data points indicates ing citations. Still, the gradual decline in performance, especially
that almost all zero-recall citations occur with very low, or zero, for recall@5, suggests that the model is learning some amount of
frequency. However, citations with high recall do not follow a rec- longer-distance patterns. This forms evidence that effective citation
ognizable frequency pattern. This is informative for the cold-start recommendation benefits from both a sophisticated representation
problem of new sources becoming available that have not been cited of context and supervised training on existing citation patterns.
enough yet to be learned by models such as the ones presented
here. We are aware of this limitation and leave it for future work. 7 CONCLUSION
Finally, we examined whether the number of decisions in the test In this paper, we have implemented and evaluated four models that
data authored by a judge correlated with the model’s performance can recommend citations to lawyers drafting legal opinions. BiL-
in predicting citations from those decisions, but did not find clear STM and pretrained RoBERTa perform comparably and outperform
patterns. The three-dimensional judge embeddings also did not the collaborative filtering and bag-of-words baselines. Our ablation
reveal any clear separation with regard to the per-judge recall. experiments show that (a) adding metadata about case year, issue,
We intend to investigate the relationship between attributes of and judge only leads to insignificant performance improvements for
individual VLJs and the behavior of trained models in future work. the neural models, and (b) predicting citations further away from
To help characterize the underlying behavior of the models, we the context is more difficult, which can be compensated to some
drew a sample of 200 erroneous predictions generated by a long- degree by providing more context. Training for extended periods
trained RoBERTa model similar to the one in Table 2.5 Two sets of continuously improves up to a recall@5 of 83.2%. As such, we have
observations indicate that the model has developed some conceptual shown that context-based citation recommendation systems can
mapping of citations. First, 16% of the erroneous predictions did be implemented as classifiers for a largely normalized citation vo-
appear in the forecast window, somewhere after the first citation. cabulary with acceptable performance. Further, our error analysis
Idiosyncrasies in citation order might explain these errors, but there shows that even incorrect predictions may still be useful.
5 After
Our work also points to the next steps for legal citation predic-
qualitative error analysis was completed, a pre-processing bug was corrected,
leading to changes in recall values of less than 0.5%. Quantitative results and analyses tion. First, citation prediction can be conceived of more broadly
of converged models reported here are from this slightly improved version. as language generation. Research should hence explore whether
87
neural models can go beyond pointing to an entry in the citation [16] Wenyi Huang, Zhaohui Wu, Chen Liang, Prasenjit Mitra, and C. Lee Giles. 2015.
vocabulary and write valid citation strings appropriate for a given A Neural Probabilistic Model for Context Based Citation Recommendation. In
Proceedings AAAI ’15. 2404–2410.
context, possibly as part of a continuation of the text. Second, as a [17] Casetext Inc. 2020. CARA A.I. | Casetext. Retrieved December 17, 2020 from
practical matter, it will be important to evaluate the usefulness of https://casetext.com/cara-ai
[18] ROSS Intelligence Inc. 2020. ROSS Intelligence. Retrieved December 17, 2020
the models trained here with expert users. Finally, we note that legal from https://blog.rossintelligence.com
sources and institutions form dynamic systems. Constant adapta- [19] Glen Jeh and Jennifer Widom. 2002. SimRank: A Measure of Structural-Context
tion, such as detecting and accounting for changes in precedent, Similarity. In Proceedings KDD ’02. 538–543.
[20] Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data.
will be key to the future utility of citation systems. Proceedings KDD ’02 (2002), 133–142.
These future directions could rapidly improve legal citation, and [21] Marios Koniaris, Ioannis Anagnostopoulos, and Yannis Vassiliou. 2017. Network
our results here show that context-aware citation prediction can analysis in the legal domain: a complex model for European Union legal sources.
Journal of Complex Networks 6, 2 (08 2017), 243–268.
play a significant role in improving the accuracy, consistency, and [22] Peng-Hsuan Li, Tsu-Jui Fu, and Wei-Yun Ma. 2020. Why Attention? Analyze
speed of mass adjudication. BiLSTM Deficiency and Its Remedies in the Case of NER. In AAAI ’20. 8236–8244.
[23] David Liben-Nowell and Jon Kleinberg. 2007. The Link-Prediction Problem for
Social Networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019–1031.
8 STATEMENT OF CONTRIBUTIONS [24] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta:
The project was conceived and planned by all authors. ZH, CL, MT, A robustly optimized bert pretraining approach. CoRR abs/1907.11692 (2019).
and HZ conducted all model development and experimental work http://arxiv.org/abs/1907.11692
under the mentorship of DEH, MK, and MG. MK and MG developed [25] Ji Ma, Kuzman Ganchev, and David Weiss. 2018. State-of-the-art Chinese Word
Segmentation with Bi-LSTMs. In Proceedings EMNLP ’18. 4902–4908.
the citation preprocessing functionality, as well as produced the [26] Shutian Ma, Chengzhi Zhang, and Xiaozhong Liu. 2020. A review of citation
error analysis. All authors contributed to writing the paper. recommendation: from textual content to enriched context. Scientometrics 122, 3
(2020), 1445–1472.
[27] Jerry L Mashaw. 1985. Bureaucratic justice: Managing social security disability
9 ACKNOWLEDGMENTS claims. Yale University Press.
[28] Sean M. McNee, Istvan Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K.
The authors thank CMU MCDS students Dahua Gan, Jiayuan Xu, Lam, Al Mamunur Rashid, Joseph A. Konstan, and John Riedl. 2002. On the
and Lucen Zhao for creating the issue typology, Anne McDonough Recommending of Citations for Research Papers. In Proceedings of the 2002 ACM
for supporting contributions around citation normalization, and Conference on Computer Supported Cooperative Work (CSCW ’02). 116–125.
[29] Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of
Dave Ames, Eric Nyberg, Mansheej Paul, and RegLab meeting par- evaluation in neural language models. arXiv preprint arXiv:1707.05589 (2017).
ticipants for helpful feedback. [30] J.C. Oleson. 2003. You Make Me Sic: Confessions of a Sadistic Law Review Editor.
U.C. Davis Law Review 37 (2003).
[31] Larry Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998. The
REFERENCES PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford
[1] Giambattista Amati. 2009. BM25. Springer US, Boston, MA, 257–260. University (1998).
[2] David Ames, Cassandra Handan-Nader, Daniel E. Ho, and David Marcus. 2020. [32] Lazar Peric, Stefan Mijic, Dominik Stammbach, and Elliott Ash. 2020. Legal
Due Process and Mass Adjudication: Crisis and Reform. Stanford Law Review 72 Language Modeling with Transformers. In Proceedings ASAIL 2020, Vol. 2764.
(2020), 1–78. CEUR-WS.
[3] Shannon Bradshaw. 2004. Reference Directed Indexing: Redeeming Relevance [33] Anna Ritchie. 2009. Citation context analysis for information retrieval. PhD
for Subject Search in Citation Indexes. In Research and Advanced Technology for thesis, University of Cambridge.
Digital Libraries, Vol. 2769. 499–510. [34] Anna Ritchie, Stephen Robertson, and Simone Teufel. 2008. Comparing Citation
[4] Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra, and C. Lee Giles. 2013. Can’t Contexts for Information Retrieval. Proceedings CIKM ’08 (2008), 213–222.
See the Forest for the Trees? A Citation Recommendation System. In Proceedings [35] Ali Sadeghian, Laksshman Sundaram, Daisy Zhe Wang, William F. Hamilton,
of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’13). 111–114. Karl Branting, and Craig Pfeifer. 2018. Automatic Semantic Edge Labeling over
[5] Caselaw Access Project. 2020. Caselaw Access Project. https://case.law. Legal Citation Graphs. Artif. Intell. Law 26, 2 (2018), 127–144.
[6] CaseText. 2020. The Machine Learning Technology Behind Parallel Search. [36] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine
https://casetext.com/blog/machine-learning-behind-parallel-search/. Accessed: Translation of Rare Words with Subword Units. In Proceedings ACL ’16 (Volume
2020-12-18. 1: Long Papers). 1715–1725.
[7] Columbia Law Review Ass’n, Harvard Law Review Ass’n, and Yale Law Journal. [37] Trevor Strohman, W. Bruce Croft, and David Jensen. 2007. Recommending
2015. The Bluebook: A Uniform System of Citation (21st ed.). citations for academic papers. In Proceedings SIGIR ’07. 705–706.
[8] Faraz Dadgostari, Mauricio Guim, P. Beling, Michael A. Livermore, and D. Rock- [38] Trevor Strohman, Donald Metzler, Howard Turtle, and W. Bruce Croft. 2005.
more. 2020. Modeling law search as prediction. Artif. Intell. Law 29 (2020), Indri: a language-model based search engine for complex queries. Technical Report.
3–34. in Proceedings of the International Conference on Intelligent Analysis.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [39] Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy
Pre-training of Deep Bidirectional Transformers for Language Understanding. In Considerations for Deep Learning in NLP. In Proceedings ACL ’19. 3645–3650.
Proceedings NAACL-HLT ’19. 4171–4186. [40] Merine Thomas, Thomas Vacek, Xin Shuai, Wenhui Liao, George Sanchez, Paras
[10] Travis Ebesu and Yi Fang. 2017. Neural Citation Network for Context-Aware Sethia, Don Teo, Kanika Madan, and Tonya Custis. 2020. Quick Check: A Legal
Citation Recommendation. In Proceedings SIGIR ’17. 1093–1096. Research Recommendation System. In Proceedings NLLP ’20, Vol. 2645. CEUR-WS.
[11] James Fowler, Timothy Johnson, James Spriggs, Sangick Jeon, and Paul Wahlbeck. [41] Radboud Winkels, Alexander Boer, Bart Vredebregt, and Alexander von Someren.
2007. Network Analysis and the Law: Measuring the Legal Importance of Prece- 2014. Towards a Legal Recommender System. In Proceedings JURIX ’14.
dents at the U.S. Supreme Court. Political Analysis 15 (06 2007). [42] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
[12] Marco Gori and Augusto Pucci. 2006. Research Paper Recommender Systems: A Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe
Random-Walk Based Approach. 2006 IEEE/WIC/ACM International Conference on Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Web Intelligence (WI 2006 Main Conference Proceedings) (WI’06) (2006), 778–781. Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
[13] Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee Giles. 2010. Context-Aware and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language
Citation Recommendation. Proceedings of the 19th International Conference on Processing. In Proceedings EMNLP ’20: System Demonstrations. 38–45. https:
World Wide Web (2010), 421–430. https://doi.org/10.1145/1772690.1772734 //doi.org/10.18653/v1/2020.emnlp-demos.6
[14] Daniel E. Ho, Cassandra Handan-Nader, David Ames, and David Marcus. 2019. [43] Paul Zhang and Lavanya Koppaka. 2007. Semantics-Based Legal Citation Network.
Quality Review of Mass Adjudication: A Randomized Natural Experiment at In Proceedings ICAIL ’07. 123–130.
the Board of Veterans Appeals, 2003–16. The Journal of Law, Economics, and [44] Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E.
Organization 35, 2 (03 2019), 239–288. https://doi.org/10.1093/jleo/ewz001 Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for
[15] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Law and the CaseHOLD Dataset. In Proceedings ICAIL ’21. arXiv:2104.08671 (in
computation 9, 8 (1997), 1735–1780. press).
88
A dynamic model for balancing values
Juliano Maranhão∗ Edelcio G. de Souza Giovanni Sartor
julianomaranhao@usp.br edelcio.souza@usp.br giovanni.sartor@eui.eu
University of São Paulo Law School University of São Paulo - FFLCH European University Institute
São Paulo, SP, Brazil São Paulo, SP, Brazil Firenze, FI, Italy
ABSTRACT argue, on the contrary that such laws are legally defective and, in
We propose an additive model for balancing the impacts of actions extreme cases, legally invalid [3].
on values, where factors intensify or attenuate impacts on values, Second, values play a key role in the context of legal interpre-
and values are assigned degrees of relative importance (weights). tation. In determining the meaning of legal sources, a dominant
The balancing model induces axiological rules, consisting in prohi- role is played by teleological approaches, where the ascription of
bitions or permissions that are justified according to the impacts of a meaning to a legal provision over other possible meanings is
the prohibited or permitted action on the values at stake. We also justified on the ground that the selected meaning better promotes
propose eight different revision operators, which shift the balance desirable interests or goals, or better prevents undesired outcomes
– and thus induce different norms – by expanding or contracting [5].
either the set of factors or the set of values. We provide the con- Values also play a key role in in constitutional review where
struction and prove some success properties of those operators. assessments are often performed according to proportionality, i.e.,
by determining whether an infringement of constitutional rights
KEYWORDS is justified by non-inferior advantages with regard to other consti-
tutional rights or values, provided that no less infringing choice
balancing values, change functions, teleological interpretation and
delivers a better trade-off [2]. More generally, values play a key
argumentation
role in all instances of legal decision-making where there is a space
ACM Reference Format: for discretion. In such cases the decision maker has to consider
Juliano Maranhão, Edelcio G. de Souza, and Giovanni Sartor. 2021. A dy-
the merit of alternative choices. This has to be done by taking into
namic model for balancing values. In Eighteenth International Conference for
account all legally relevant values (which the decision maker is
ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466143 allowed to consider), and the extent to which each choice promotes
or demote such values, in the context provided by relevant features
1 INTRODUCTION (factors) of the case.
Value-based reasoning has been given an growing attention by
There is an apparent consensus in jurisprudence that legal decision-
research on AI & Law.
making cannot be fully driven by rules alone; it calls for value-based
Following the seminal contribution by Berman and Hafner [8],
reasoning. Value-based reasoning is indeed relevant to the law in
AI & Law research has provided multiple models of the relation
multiple regards.
between cases (and the factors that such cases include or express)
First of all, ethical and political values (the so-called political
and the values at stake. Bench-Capon and Sartor [6] assign values
morality) provide a critical framework for assessing the merit of
to factors and consequently to rules embedding such factors. They
positive laws. Existing laws, as resulting from legislative enactment,
explain precedents according to the applicable rules and the impor-
from the practice of legal officers or from custom can be critically
tance of the (sets of) values promoted by such rules. They compare
examined with regard to the extent that they meet or violate ideals
alternative sets of rules in terms of their coherence with precedents
of justice and fairness, or that they promote or demote particular
and values. Bench Capon et al [19] formalise teleological reason-
human rights or social values. The divergence of laws from ideas of
ing using logics for defeasible argumentation, extended with the
justice may justify citizens in disobeying certain laws (as in cases of
possibility to express arguments about values, supported by cases.
civil disobedience) or even officers in refuse the application of such
Grabmair [12] defines functions representing the extent to which a
laws. Legal theorists have provided different interpretations of this
factor contributes to make it so that a certain outcome promotes a
phenomena, depending on views on the relation between law and
certain value, and compares alternative outcomes accordingly. Sar-
morality. Those endorsing a complete separation of law morality
tor [21] explores the proportional balance of constitutional rights
argue that the laws departing from justice still count as perfectly
(as theorized by Alexy [2]), where a legal outcome is compared to
valid laws [13]; those affirming the intertwining of law and morality
alternative outcomes based on its impact on the promotion and
Permission to make digital or hard copies of all or part of this work for personal or demotion of values and examines consistency between value-based
decisions of cases, given the factors present in such cases [22].
on the first page. Copyrights for components of this work owned by others than ACM AI & Law research on statutory interpretation has also explored
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, the relations between legal rules and the values underlying the
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org. deliberation and moral/political justification of their enactment
ICAIL’21, June 21–25, 2021, São Paulo, Brazil based on dynamic approaches where values guide changes of the
© 2021 Association for Computing Machinery. content of rules. For instance, Boella et al. [9] introduce values
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466143 as coherence parameters guiding the change of conceptual rules.
89
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Maranhão, de Souza and Sartor
Such models of statutory interpretation were also explored in the action rather than abstaining from it. The action may consist in
AGM style [15] and in the framework of i/o logics [10]. In this line any behaviour, e.g., having an abortion rather than continuing the
Maranhão [16] proposed an architecture of i/o logics where values pregnancy or accessing a text message stored in a mobile phone,
are represented as rules, and constitutive and regulative rules are rather than respecting its confidentiality. The evaluation model
the object of different revision functions. Walton, Macagno and basically compares, for each given action, its impact on the set of
Sartor [23] analyse multiple argument schemes for interpretive values it promotes against its impact on the set of values it demotes,
reasoning. given the constellation of factors, i.e., the context in which the
In this paper we propose an additive model of balancing values. action is performed. Two clarifications are of central importance to
In this model, each relevant factor contributes to intensify or at- understand the model here proposed.
tenuate the impact of the permissibility of the action at stake on a First, we only consider the assessment of impact of a single action
set of relevant values which are promoted or demoted by the per- on values and therefore we only compare the values promoted
missibility of that action. These influences are then proportionally against the values demoted by that specific action, so that a decision
considered with respect to the relative importance of the values takes place whether that action should or should not be performed
being impacted. The resulting assessments of the action’s impacts on moral grounds. There is no room in this model to compare
on single values are then aggregated to determine the action’s total and decide among different and logically independent actions in
impact on the set of values it promotes and on the set of values it terms of their impacts on values. Typically, a claim before a court
demotes. The comparison of the action’s impacts on the promoted questions the legality of a particular action and the court must
and on the demoted values enables us to determine whether the decide whether that action under evaluation should be performed
action is axiologically permitted or rather prohibited. or not (should be forbidden or permitted, should be punished or
After presenting this model, we propose change functions that not be punished). So, we keep the same structure regarding its
shifts the axiological evaluation and consequently the axiological axiological evaluation. We acknowledge that there may be contexts
rule that applies to it. Such shifts are operated by additions or where a judicial decision compares and chooses among alternative
subtractions of factors or by additions or subtractions of values in courses of action, for instance, between the consumer’s right to
the model. These operations have some resemblance to argument receive a new product or to have his money back. However we shall
moves, where new features of the case or moral considerations are leave this kind of value assessment to future work.
brought about to oppose previously justified conclusions. Second, we assume that the direction of impact of an action on a
To illustrate our model, we shall use the some variations of the value – i.e., whether the action promotes or demotes the value – is
leading case Riley v. California judged by the U.S. Supreme Court in invariant, although the extent of the promotion or demotion may
2014. In that precedent, the court concluded that a specific warrant be intensified or attenuated by the presence of factors in the context
was needed to access the digital content of a mobile phone of an of performance. By saying that the direction of impact is invariant,
arrestee, considering the significant amount of personal data usually we mean that irrespective of how many attenuating factors are
stored in such device. In its own words, such access “implicates taken into account, the impact of an action on the promotion of a
substantially greater privacy interests than a brief physical search” particular value never shifts to its demotion. And vice-versa the
of the items accessible to the arrestee, which is on the contrary impact of the action at stake on the demotion of a value never shifts
allowed. to its promotion.
The paper is structured as follows. First, in Section 2, we in- Let us illustrate the model’s underlying rationality with an ex-
troduce the additive model for balancing the impact of actions on ample. Suppose the rules of a condominium forbid people to take
values. In Section 3, we describe how to build systems of value-rules the elevator during the COVID-19 pandemics. Suppose now that
or axiological systems. Then, in Section 4, we introduce and dis- one inhabitant has a medical emergency. Then one could evaluate
cuss eight revision operators upon axiological systems and specify whether following the rule would lead to immoral results. The fac-
their success conditions. We conclude the paper discussing some tor “medical emergency” is an intensifier w.r.t the promotion of the
limitations of our model and indicating possible paths for future value of the patient’s health, which would lead to a permission to
research. use the elevator. But now consider that the emergency does not
hinder the patient’s ability to walk (for instance, it is a toothache)
2 AN ADDITIVE MODEL FOR BALANCING and that she lives on the second floor. So the proportional influence
VALUES of the set of factors on the promotion of the patient’s health may
become null or negative, but one would not say that the action of
In this Section we shall first introduce the general idea and then taking the elevator would now demote her health in that particular
provide a formal account of an action’s impact on relevant values, in context. Actually, the action still promotes health even in presence
given contexts. Then we show how such impact assessment induces of those attenuating factors. But in such cases the proportional
value-rules and finally discuss some properties of systems of such impact of the action is so low that it becomes morally irrelevant to
rules and their relation with the deontological rules contained in legal considerations, that is, it will not play a role in a consideration
the positive legal system. whether to follow the rule or not. Hence, in the model here pro-
posed attenuating factors only affects the degree of moral impact
2.1 Introducing the additive model of the action on a value.
An axiological model of balancing presupposes a determination
of the comparative moral merits of the choice of performing an
90
A dynamic model for balancing values ICAIL’21, June 21–25, 2021, São Paulo, Brazil
2.2 The model for balancing Definition 2.2. (impact function) For each action 𝑥 we define
The model of balancing may be described by the structure 𝐼𝑥 : 𝐹𝑎𝑐𝑡 ∪ {∅} × 𝑉 𝑎𝑙 −→ [−1, 1], where 𝐼𝑥 (𝑓 , 𝑉𝑖 ) is the influence
of factor 𝑓 on the impact of action 𝑥 on the value 𝑉𝑖 .
V = ⟨𝐴𝑐𝑡, 𝐹𝑎𝑐𝑡, 𝑉 𝑎𝑙, 𝑃𝑟𝑜𝑚, 𝐷𝑒𝑚, {𝐼𝑥 }𝑥 ∈𝐴𝑐𝑡 , 𝑤⟩
whose elements are going to be detailed and discussed below. Note that the function 𝐼𝑥 is fixed for a particular action 𝑥 and
We are going to work with two sorts of literals, actions and it regards the influence that each factor has on the impact of the
factors. The set of action 𝐴𝑐𝑡 is the union 𝑃𝐴𝑐𝑡 ∪ 𝑁 𝐴𝑐𝑡, a set of action at stake on each value. The influence may be null, i.e. takes
atomic actions {𝑥 1, 𝑥 2, . . .} and of their negations {¬𝑥 1, ¬𝑥 2, . . .}. value 0, when the factor has no influence on the action’s impact
Similarly, the set of factors 𝐹𝑎𝑐𝑡 is the union 𝑃𝐹𝑎𝑐𝑡 ∪ 𝑁 𝐹𝑎𝑐𝑡 of a set on a particular value, that is, when it is a morally neutral factor
of atomic factors {𝑓1, 𝑓2, . . .} and of their negations {¬𝑓1, ¬𝑓2, . . .}. regarding that action and value. When the influence takes a positive
We write 𝑥 to denote the complement (negation) of action 𝑥. The real number in the interval, we say that the factor is an intensifier
set 𝑉 𝑎𝑙 = {𝑉1, 𝑉2, ..., 𝑉𝑛 } is a finite set whose elements are values. of the impact of the action on the value. And if the function assigns
Each action may also be subject to the usual deontic qualification. the factor a negative real number in the interval, we say that the
By combining 𝑥 and the conjunction Φ∧ of all factors in a set Φ factor is an attenuator of such impact. The action at stake also has
we obtain diadic deontic formulae 𝑂 (𝑥 |Φ∧ ) and 𝑃 (𝑥 |Φ∧ ) stating, a baseline impact on each value, when the impact function takes as
respectively, that action 𝑥 is obligatory or permitted under condition argument the empty set.
Φ∧ . We say that the formulas 𝑂 (𝑥 |Φ∧ ) and 𝑃 (𝑥 |Φ∧ ) are deontic Given these assignments of weight and impact we may define
opposites. the proportional impact of a factor on a value for a given action.
In the following, we distinguish two possible deontic evaluations Definition 2.3. (Proportional influence of a factor on a value) Let
of an action, under given factors. The first, which is indicated by 𝑥 be an action, 𝑓 a factor and 𝑉𝑖 a value in V ⊆ 𝑉 𝑎𝑙, we define:
the operators 𝑂𝑑 and 𝑃𝑑 is the deontological evaluation, given by a 𝑓
set of positive norms, which explicitly state obligations and permis- Δ𝑥 (𝑉𝑖 ) = 𝐼𝑥 (𝑓 , 𝑉𝑖 ) × 𝑤 (𝑉𝑖 )
sions (e.g., the norms stated by a legislator). The second, indicated The value assessment will compare the influence of all relevant
by the operators 𝑂 𝑣 and 𝑃 𝑣 , corresponds to the axiological eval- factors on all relevant values. On the one hand, those values pro-
uation. This evaluation refers to the connection between actions moted by the action in a given constellation of factors, and on the
and values: it considers to what extent the an action promotes or other hand those values promoted by that action in that context.
demotes a value. Under a different reading, that may be more ap- So we take the sum of the proportional impacts on each set of
propriate when engaging in the axiological evaluation of positive promoted or demoted values.
norms, the axiological evaluation considers whether making an
action permissible (in a positive legal code) promotes or rather de- Definition 2.4. (Proportional influence of a factor on a set of
motes the values at stake. Consider for instance the case of abortion. values) Let 𝑥 be an action, 𝑓 a factor and a set V ⊆ 𝑉 𝑎𝑙, then we
When considering the ethical merits of the legal permissibility of define:
abortion, we are not engaging in the moral merit of a woman’s
Õ
choice to have or not to have an abortion, but rather on the moral 𝑓 ,𝑥
𝐵 𝑃𝑟𝑜𝑚 (V) =
𝑓
Δ𝑥 𝑉𝑖
merit of making abortion permissible rather than forbidden. This 𝑉𝑖 ∈𝑃𝑟𝑜𝑚 (𝑥)∩V
assessment does not pertain to the morality of individuals, but
The same definition holds mutatis mutandis for the proportional
rather to political morality, i.e., to the morality of making public 𝑓 ,𝑥
choices, on what is to be imposed or not on citizens. demotion of values, denoted by 𝐵𝐷𝑒𝑚 (V) . When V = 𝑉 𝑎𝑙 we write
Thus if 𝑥 is an action, we define 𝑃𝑟𝑜𝑚(𝑥) ⊆ 𝑉 𝑎𝑙 as the set 𝑓 ,𝑥
simply 𝐵𝑃𝑟𝑜𝑚/𝐷𝑒𝑚 .
of values promoted by (the legal permissibility of) action 𝑥 and
𝐷𝑒𝑚(𝑥) ⊆ 𝑉 𝑎𝑙 the set of values demoted by (the permissibility of) The proportional influence of a factor on the promotion (demo-
the action 𝑥, where 𝑃𝑟𝑜𝑚(𝑥) ∩ 𝐷𝑒𝑚(𝑥) = ∅. tion) of values may now be straightforwardly extended to a set of
The comparison depends on the evaluations expressed by the factors:
quantitative assignments of weights to values and of influence of Definition 2.5. (proportional influence of a set of factors a set of
factors on the impact of actions on values. For generality’s sake values) Let Φ = {𝑓1, 𝑓2, ..., 𝑓𝑛 } be a set of factors, then:
we assume that such indexes can take arbitrary numerical assign- Õ 𝑥,𝑓
ments within given ranges. These numbers can be restricted to any 𝐵𝑥,Φ
𝑃𝑟𝑜𝑚 = 𝐵𝑃𝑟𝑜𝑚
scales that may be convenient for the chosen domain of applica- 𝑓 ∈Φ
tion. Here we shall use the positions (0,.2,.4,.6,.8,1) in the examples. Again, the same definition holds mutatis mutandis for the propor-
What matters is that the numerical assignments reflect some rel- tional influence of a set of factors on the demotion of value, denoted
ative importance of the elements at stake, as part of a reasoning by 𝐵𝑥,Φ
𝐷𝑒𝑚 .
with dimensions and magnitudes, and how such assessment of rel-
ative importance affects the outputs of the systems and its overall 2.3 Value-rules induced by balancing
coherence. We define now an entailment-like relation to extract from a given
Definition 2.1. (weight function) For the finite set of values 𝑉 𝑎𝑙 value assessment whether the action under analysis would be axi-
we define the weight function 𝑤 : 𝑉 𝑎𝑙 −→ [0, 1], where 𝑤 (𝑉𝑖 ) is ologically permitted, forbidden or indifferent. The action at stake
the weight of the value 𝑉𝑖 . is forbidden if the proportional aggregate demotion of values is
91
positive and higher than the proportional aggregate promotion of Example 2.8 (Riley vs California). Consider the action 𝑎𝑐𝑐, the set
values. Otherwise, if the number corresponding to the aggregate of factors Φ = {𝑎𝑟𝑟, 𝑝𝑟𝑜𝑝, 𝑚𝑜𝑏} and values V = {𝑆𝑎𝑓 , 𝑃𝑟𝑖𝑔ℎ𝑡, 𝑃𝑟𝑖𝑣 },
promotion of values is higher than the aggregate demotion, the where 𝑃𝑟𝑜𝑚(V) = {𝑆𝑎𝑓 } and 𝐷𝑒𝑚(V) = {𝑃𝑟𝑖𝑔ℎ𝑡, 𝑃𝑟𝑖𝑣 }, with the
action is permitted. If the proportional aggregate promotion and following weights 𝑤 (𝑃𝑟𝑖𝑣) = .6, 𝑤 (𝑃𝑟𝑖𝑔ℎ𝑡) = .4 and 𝑤 (𝑆𝑎𝑓 ) = .6.
demotion are equivalent or if the aggregate promotion or demo- Consider also that accessing items found with an individual has
tion is a non positive real number, then the action is not morally only a baseline impact on privacy, as it would not impact property
relevant, and therefore also permitted. if there are no property items neither promote safety if there is
no evidence that the individual is related to any criminal offence
Definition 2.6. Let 𝑥 ∈ 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡, and Φ∧ be the conjunction (zero baseline impact). So, we have 𝐼𝑎𝑐𝑐 (∅, 𝑃𝑟𝑖𝑣) = .4. Now consider
of all factors in Φ. Then the value-rule induced by Φ and 𝑥 is the influence of the factors specified above on the impact of the
(i) 𝑂 𝑣 (𝑥 |Φ∧ ), if 𝐵𝑥,Φ
𝐷𝑒𝑚 > 𝐵𝑃𝑟𝑜𝑚 and 𝐵 𝐷𝑒𝑚 > 0,
𝑥,Φ 𝑥,Φ
action on the relevant values. A considerable impact on property
(ii) 𝑃 𝑣 (𝑥 |Φ∧ ), otherwise right by accessing property 𝐼𝑎𝑐𝑐 (𝑝𝑟𝑜𝑝, 𝑃𝑟𝑖𝑔ℎ𝑡) = .4, a significant
promotion of safety if there is sufficient evidence of criminal offence
We say that the balance shifts whenever the addition of a new
for an arrest 𝐼𝑎𝑐𝑐 (𝑎𝑟𝑟, 𝑆𝑎𝑓 ) = .8 and, as considered, by the court, an
set of factors to the original set of factors changes the value rule
extreme impact on privacy by accessing personal data stored in a
induced by the balancing. That is, if we have 𝑂 𝑣 (𝑥 |Φ∧ ) then the
mobile phone 𝐼𝑎𝑐𝑐 (𝑚𝑜𝑏, 𝑃𝑟𝑖𝑣) = 1.
addition of the set of factors Θ shifts the balance if it holds that
𝑃𝑟𝑜𝑚 ≥ 𝐵 𝐷𝑒𝑚 or 𝐵 𝐷𝑒𝑚 < 0. Likewise, if it holds that 𝑃 𝑣 (𝑥 |Φ ),
𝐵𝑥,Φ∪Θ 𝑥,Φ∪Θ 𝑥,Φ∪Θ ∧
In the example above we have the following balance in the base-
then the balance shifts if we obtain 0 < 𝐵𝑥,Φ∪Θ
𝐷𝑒𝑚 > 𝐵𝑃𝑟𝑜𝑚 .
𝑥,Φ∪Θ
line context, 𝐵𝑎𝑐𝑐,∅
𝑃𝑟𝑜𝑚 = 0 against 𝐵 𝐷𝑒𝑚 = .24, thus morally prohibit-
𝑎𝑐𝑐,∅
The dyadic deontic statements obtained by the balancing may
ing police officers to access items in the possession of any indi-
be taken as elements of a theory in a dyadic deontic logic, where
vidual (VR𝑎𝑐𝑐 = {𝑂¬𝑎𝑐𝑐}). In the context of an arrest, we would
further deontic statements may be derived. If Φ = ∅, we express ∅
𝑎𝑐𝑐,{𝑎𝑟𝑟 } 𝑎𝑐𝑐,{𝑎𝑟𝑟 }
the baseline evaluation of an action 𝑥 as 𝑂/𝑃 𝑣 (𝑥 |⊤), which we may have 𝐵𝑃𝑟𝑜𝑚 = .48 against 𝐵𝐷𝑒𝑚 = .24, thus rendering the
also be written as a monadic obligation/permission 𝑂/𝑃 𝑣 𝑥. access morally permitted (VR𝑎𝑐𝑐{𝑎𝑟𝑟 }
= {𝑃 (𝑎𝑐𝑐 |𝑎𝑟𝑟 )}. If the seizure
during the arrest includes property items the access would still be
Definition 2.7. Let 𝑥 be an action, Φ a set o factors and V ⊆ 𝑉 𝑎𝑙. 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝 }
morally justified, given that we would have 𝐵𝑃𝑟𝑜𝑚 = .48
Then VR𝑥Θ (Φ)is the set of value-rules induced by the balancing 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝 }
model for Θ ⊆ Φ and VR𝑥 (Φ) is the class of sets of axiological rules against 𝐵𝐷𝑒𝑚 = .40, thus reflecting the previous US case
VR𝑥Θ (Φ) such that Θ𝑖 ⊆ Φ. law (that is, VR𝑎𝑐𝑐
{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝 }
= {𝑃 𝑣 (𝑎𝑐𝑐 |𝑎𝑟𝑟 ∧ 𝑝𝑟𝑜𝑝)}). Finally, con-
𝑖
sidering the additional factor that the property item is a mobile
Let us illustrate the model by an example. We adopt here a phone, the finding of the court in Riley vs California would be
convention to omit the reference to the argument of an impact 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝,𝑚𝑜𝑏 }
morally justified by the model, given that 𝐵𝑃𝑟𝑜𝑚 = .48
function when its value is zero. 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝,𝑚𝑜𝑏 }
According to the US case law before the judgment of Riley vs against 𝐵𝐷𝑒𝑚 = 1 (thus leading to VR𝑎𝑐𝑐
{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝,𝑚𝑜𝑏 }
=
California (2014) a police officer was allowed to access the content 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑎𝑟𝑟 ∧ 𝑝𝑟𝑜𝑝 ∧ 𝑚𝑜𝑏).
of any items, including property, in the premises or surroundings
when arresting an individual due to any criminal offense. However, 2.4 Consistency, coherence and the Radbruch’s
in the Riley vs California case, the arrest included the seizure of a
mobile phone and, from the content accessed in that device, the
formula
officers found evidence of another crime, which led to a conviction. Based on the aggregate impacts on the demotion and promotion of
The U.S. Supreme Court concluded that a specific warrant was the values, which are triggered by an action, we define the propor-
needed to access the digital content of the mobile phone of the tional impact of an induced value-rule as the difference between
arrestee. It considered that the significant amount of personal data, the values promoted and demoted in the assessment.
usually stored in a mobile phone, would involve an inadmissible
impact on privacy. This rule could be explained by the following Definition 2.9. (Proportional impact of a rule) Consider an action
considerations on the underlying value impacts: accessing items 𝑥, a set of factors Φ and a set of values V.
(𝑎𝑐𝑐) in an arrest has a baseline impact on the promotion of public Then, 𝜎 (𝑥, Φ, V) = 𝐵𝑥,Φ
𝑃𝑟𝑜𝑚 (V)
− 𝐵𝑥,Φ
𝐷𝑒𝑚 (V)
is the proportional impact
safety (𝑆𝑎𝑓 ) and a baseline demotion of property rights (𝑃𝑟𝑖𝑔ℎ𝑡) and on values V of the axiological rules 𝑃 𝑣 (𝑥 |Φ∧ )/𝑂 𝑣 (𝑥 |Φ∧ ).
privacy (𝑃𝑟𝑖𝑣); the factor “arrest” (𝑎𝑟𝑟 ) intensifies the promotion
of public safety so as to outweigh the extent to which the factor Given the definition above, the polarity (positive or negative) of
“property” (𝑝𝑟𝑜𝑝) intensifies the demotion (through the same action) the proportional impact indicates the modality of the value-rule
of property rights and privacy respectively. However, as considered induced by the balance model. If 𝜎 (𝑥, Φ, V) > 0, then 𝑃 𝑣 (𝑥 |Φ). If
by the court, if the item collected is a mobile phone (𝑚𝑜𝑏), then 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥𝐷 𝑒𝑚 > 0 then 𝑂 𝑣 (𝑥 |Φ)
the negative impact on privacy is intensified to the extent that the Considering that we may obtain value-rules from the balancing
promotion of public safety is outweighed. This led the court to model, we may now evaluate the content of positive rules expressed
introduce an exception, forbidding access to the digital content in a dyadic deontic language with the operators 𝑂𝑑 (𝑥 |𝑎) for pos-
of mobile phones collected during an arrest without an specific itive obligation to do 𝑥 in context 𝑎 and the operator 𝑃𝑑 (𝑥 |𝑎) for
warrant. permission of action 𝑥 in context 𝑎.
92
We may assume a dyadic deontic logic consequence relation ⊢ In the Example 2.8, suppose we settle the Radbruch’s threshold at
satisfying inclusion and factual detachment and to derive deontic .2. Then only the rule permitting police officers to access property
sentences from two different sets of rules. items would be legally enforceable based on the evaluation, while
all other rules indicated in the example – the permission to access
Definition 2.10. (consistency) Let 𝑥 ∈ 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡, and Φ∧ be
items in the possession of any individual, the prohibition to access
the conjunction of all factors in Φ. Then a set of rules R is consistent
items with the arrestee and the permission to access the digital
iff
content of the arrestee’s mobile phone in an arrest – would be
• R ⊬ 𝑂𝑑 (𝑥 |Φ∧ ) ∧ 𝑂𝑑 (𝑥 |Φ∧ ) and unbearably unjust and therefore not legally enforceable, according
• R ⊬ 𝑂𝑑 (𝑥 |Φ∧ ) ∧ 𝑃𝑑 (𝑥 |Φ∧ ). to Radbruch’s theory.
Thus, the postulate of consistency is satisfied iff it is not the case To illustrate that in the model, consider a set of positively en-
that an action 𝑥 is both forbidden and obligatory or both forbidden acted rules LR = {𝑃𝑑 (𝑎𝑐𝑐 |⊤), 𝑃𝑑 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑂𝑑 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ),
and permitted by the normative system. The set of rules may be a 𝑃𝑑 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏)}. Now let us compare it with the rules
set of value-rules induced by the balancing model VR or a set of extracted in the evaluation model, indicating the proportional im-
positively enacted rules represented as dyadic deontic sentences pact of each in parenthesis:
LR.
AS𝑎𝑐𝑐 = {𝑂 𝑣 (¬𝑎𝑐𝑐) (−.24), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝) (−.4), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧
Definition 2.11. (coherence) Let 𝑥 ∈ 𝐴𝑐𝑡 and Φ ⊆ 𝐹𝑎𝑐𝑡. Then 𝑎𝑟𝑟 ) (.08), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏) (−.52)}.
(LR, VR) is coherent iff,
• LR and VR are both consistent and Hence, according to the value assessment, all positively enacted
• it is not the case that rules are morally unjustified. Nevertheless, the only the rule which
– LR ⊢ 𝑂𝑑 (𝑥 |Φ∧ ) and VR ⊢ 𝑂 𝑣 (𝑥 |Φ∧ ) or would still be valid according to a threshold of .2 would be the pro-
– LR ⊢ 𝑃𝑑 (𝑥 |Φ∧ ) and VR ⊢ 𝑂 𝑣 (𝑥 |Φ∧ ) or hibition to search property items during an arrest. Following Alexy,
– LR ⊢ 𝑂𝑑 (𝑥 |Φ∧ ) and VR ⊢ 𝑃 𝑣 (𝑥 |Φ∧ ) we would say that such prohibition would be morally defective
but still legally valid, while all others are both morally and legally
An interesting application of this model to legal theory consists defective ([3], [4]).
in the interpretation of a formula about the relation between law
and morality originally proposed by Gustav Radbruch [20] to deter-
mine the (in)validity of Nazi’s laws: laws enacted by proper authority
and power are legally valid unless they reach an unbearable degree 3 SYSTEMS OF AXIOLOGICAL RULES
of immorality or injustice. In Definition 2.7 we have specified singular sets of deontic sentences
In our model, we may define an inference relation upon the basic VR𝑥Θ for each subset Θ of the set of relevant factors Φ. Clearly,
dyadic deontic logic, considering the proportional impact of the the union of these singular sets would lead to an inconsistency
value-rule corresponding to the legal rule. If the positively enacted if the underlying dyadic deontic logic is monotonic and the prin-
legal rule 𝑂𝑑 (𝑥 |Φ∧ ) /𝑃𝑑 (𝑥 |Φ∧ ) is inconsistent with the value-rule ciple of deontic non-contradiction holds. For instance, VR𝑎𝑐𝑐 ∪
∅
induced by the value assessment 𝑃 𝑣 (𝑥 |Φ∧ ) /𝑂 𝑣 (𝑥 |Φ∧ ) and the value- VR𝑎𝑐𝑐 = {𝑂 𝑣 (¬𝑎𝑐𝑐 |⊤), 𝑃 𝑣 𝑥 (𝑎𝑐𝑐 |𝑎𝑟𝑟 )}, from which one may infer
{𝑎𝑟𝑟 }
rule has a proportional impact that exceeds a given threshold, then both 𝑂 (¬𝑎𝑐𝑐 |𝑎𝑟𝑟 ) and 𝑃 (𝑎𝑐𝑐 |𝑎𝑟𝑟 ).
we say that the positively enacted rule is unbearably unjustified or Therefore we are going to define a construction of a system of
immoral and therefore is invalid, according to the Radbruch’s legal value-rules, which we shall call axiological system AS𝑥 , in which
theory. The inference relation consists in a restriction of the set of all steps of factor additions are consistently compiled.
positively enacted rules available to derive a normative solution Note that each step of adding a factor to Θ𝑖 may or may not
for a given action. There may be different interpretations of the shift the balance of values to the effect of changing the normative
Radbruch’s formula leading to different associated legal theories, solution to the action at stake in that specific context.
for instance, a generative formula that not only censors positively Thus there are two different situations to be covered. In the first
enacted rules but also generate valid legal content based on moral one, the addition of a new factor shifts the normative solution to the
considerations (see [17]). opposite one. For instance, police officers are generally forbidden
Definition 2.12. (Radbruch’s Formula) Let 𝑥 be an action, Φ a set to access items held by an individual. But in the context of an arrest,
of factors and V a valuation and 𝑟 a threshold index. Then: the normative solution shifts to a permission to perform that action.
• ⊢𝑟𝑎𝑑 𝑃𝑑 (𝑥 |Φ∧ ) iff Were it not the case, that is, if the factor “arrest” were absent, then
– LR ⊢ 𝑃𝑑 (𝑥 |Φ∧ ) and the former solution would prevail. In its turn, if now we consider
– it is not the case that VR ⊢ 𝑂 𝑣 (𝑥 |Φ∧ ) and that the item accessed is a property item, it is still permitted to
– |𝜎 (𝑥, Φ, V)| ⩾ 𝑟 1 ; perform it if there is an arrest.
• ⊢𝑟𝑎𝑑 𝑂𝑑 (𝑥 |Φ∧ ) iff So, gathering the set of deontic sentences may reveal this “step
– LR ⊢ 𝑂𝑑 (𝑥 |Φ∧ ) and by step” feature of an argumentation process where new factors are
– it is not the case that VR ⊢ 𝑃 𝑣 (𝑥 |Φ∧ ) and called into question or the presence of a factor or its relevance or
– |𝜎 (𝑥, Φ, V)| ⩾ 𝑟 intensity regarding the values at stake is contested. We are going to
represent such argument moves by change functions of expansion
1 We consider |n| the absolute value of the (positive or negative) integer n. and contraction that impact the resulting normative solution.
93
In order to do that, first we are going to define, by induction, an 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑚𝑜𝑏 ∧𝑝𝑟𝑜𝑝 ∧𝑎𝑟𝑟 )}. Nevertheless, each axiological system
axiological system for an action 𝑥 and a set of factors Φ. Notice that resulting from the balancing is consistent.
we have to consider two cases for the inductive step. The first case
Theorem 3.2. For every action 𝑥 and Φ ⊆ 𝐹𝑎𝑐𝑡, AS𝑥 is consistent.
covers the hypothesis where there is no shifts in the balance in the
course of addition of new factors and accordingly new value-rules Proof. (sketch) By construction, we notice that the basic step
to the axiological system. The second case covers the hypothesis 𝐴𝑆 0𝑥 contains a single axiological rule and also that each step 𝐴𝑆𝑛+1
𝑥
where there was at least one shift of balance (and the corresponding preserves consistency of the extracted rules, provided that the con-
induced value-rule) at some previous step of the construction of the ditions of opposed modalities 𝑂¬𝑥 and 𝑃𝑥 are, by construction,
axiological system. For simplicity we write the definition only for mutually exclusive. □
those cases where the antecedent step starts with an obligation, but
it is easy to adapt it, mutatis mutandis, for those cases where the Although different consistent axiological systems AS𝑥 may be
previous step in the construction of the axiological system delivers built upon the same model V, according to the order of sets of
a permission for the corresponding constellation of factors. factors used in the construction, all systems will deliver the same
result in the presence of all relevant factors at hand. This result
Definition 3.1. Let 𝑥 be an action, 𝐹𝑎𝑐𝑡 = {𝑓1, 𝑓2, ..., 𝑓𝑚 } and may happen if all the steps in the construction have one element,
Φ ⊆ 𝐹𝑎𝑐𝑡. We are going to define a set of axiological rules based or if there is any step where two elements are included in the
on an induction of an increasing sequence of subsets Φ0 = ∅, Φ1 = axiological system. The end result may be thought of as the result
{𝑓1 }, Φ2 = {𝑓1, 𝑓2 }, ..., Φ𝑛 = Φ. Then the Axiological Normative of an argumentation process, were in each step a new factor is
System for 𝑥, AS𝑥 is inductively defined as follows: brought about that may or may not change the solution within the
• Basic step (Φ = ∅) balance of values. This property of the axiological systems may be
– AS𝑥1 = {𝑃 𝑣 (𝑥 |𝑓1 ), 𝑂 𝑣 (𝑥 |¬𝑓1 )}, if AS𝑥0 = 𝑂 𝑣 (𝑥 |⊤) and the shown as a corollary of the following general theorem:
balance shifts
Theorem 3.3. Let Φ0 = ∅, Φ1 = {𝑓1 }, Φ2 = {𝑓1, 𝑓2 }, ..., Φ𝑛 = Φ be
– AS𝑥1 = {𝑂 𝑣 (𝑥 |𝑓1 )}, otherwise.
an increasing sequence of subsets of Φ ⊆ 𝐹𝑎𝑐𝑡. Then for every natural
• Inductive step with one element in AS𝑛𝑥 ; Φ = {𝑓1, 𝑓2, ..., 𝑓𝑛 }
number 0 ⩽ 𝑖 ⩽ 𝑛, it holds that:
– AS𝑛+1𝑥 = {𝑃 𝑣 (𝑥 |Φ∪{𝑓𝑛+1 }∧ ), 𝑂 𝑣 (𝑥 |Φ∪{¬𝑓𝑛+1 }∧ ), if AS𝑛𝑥 =
𝑂 𝑣 (𝑥 |Φ∧ ) and the balance shifts • if 0 < 𝐵𝑥,Φ
𝐷𝑒𝑚 (V)
> 𝐵𝑥,Φ
then AS𝑥 ⊢ 𝑂 (𝑥 |Φ𝑖∧ )
– AS𝑛+1𝑥 = {𝑂 𝑣 (𝑥 |Φ ∪ {𝑓𝑛+1 })}∧ ), otherwise.
• Inductive step with two elements in AS𝑛𝑥 ; Φ = {𝑓1, 𝑓2, ..., 𝑓𝑛 } • otherwise, AS𝑥 ⊢ 𝑃 (𝑥 |Φ𝑖∧ )
– AS𝑛+1𝑥 = {𝑃 𝑣 (𝑥 |𝑓1 ∧ ... ∧ ¬𝑓𝑛 ∧ 𝑓𝑛+1 ), 𝑂 𝑣 (𝑥 |𝑓1 ∧ ... ∧ 𝑓𝑛 ∧
Proof. For 𝑛 = 0, the result is immediate. For the inductive
¬𝑓𝑛+1 )}, if AS𝑛𝑥 = {𝑂 𝑣 (𝑥 |𝑓1 ∧ ... ∧ 𝑓𝑛 ), 𝑃 (𝑥 |𝑓1 ∧ ... ∧ ¬𝑓𝑛 )}
step, suppose that we have AS𝑥 ⊢ 𝑂 (𝑥 |Φ𝑘∧ ) for Φ𝑘 = {𝑓1, 𝑓2, ..., 𝑓𝑘 }
and the balance shifts
and therefore AS𝑥 ⊢ 𝑂 (𝑥 |Φ𝑘∧ ). Then for Φ𝑘 ∪ {𝑓𝑘+1 }, it holds,
– AS𝑛+1𝑥 = {𝑂 𝑣 (𝑥 |𝑓1 ∧ ... ∧ 𝑓𝑛 ∧ 𝑓𝑛+1 ), 𝑃 𝑣 (𝑥 |𝑓1 ∧ ... ∧ ¬𝑓𝑛 ∧
𝑓𝑛+1 )}, otherwise. by construction that if 0 < 𝐵𝑥,Φ > 𝐵𝑥,Φ , then AS𝑥 ⊢
Ð
𝐷𝑒𝑚 (V) 𝑃𝑟𝑜𝑚 (V)
Then, AS𝑥 = AS𝑖𝑥 , for 0 ≤ 𝑖 ≤ 𝑛 𝑂 (𝑥 |Φ𝑖∧ ∧ 𝑓𝑘+1 ). Otherwise, it holds that AS𝑥 ⊢ 𝑃 (𝑥 |Φ𝑖∧ ∧ 𝑓𝑘+1 )
showing that new additions of factors in the axiological rules never
Notice that the choice of a particular set of factors, i.e., the selec- change the previous constellation of literals representing factors or
tion of a set Φ1 ⊆ 𝐹𝑎𝑐𝑡 rather than a different set Φ2 ⊆ 𝐹𝑎𝑐𝑡 may the previous order of increasing subsets of factors. The case where
deliver a different axiological system, since the assessment of the AS𝑥 ⊢ 𝑃 (𝑥 |Φ𝑘∧ ) follows the same steps. □
action’s impact on the values is determined by the factors being
considered in each set. Moreover, notice that the particular order As a corollary from theorem 3.3 it holds that the final evaluation
in which the new factors are introduced in a given set Φ ⊆ 𝐹𝑎𝑐𝑡 and corresponding induced value-rule does not depend on the order
determines a specific path leading to the axiological system based of subsets of relevant factors used in the construction of the axio-
on all factors in Φ ⊆ 𝐹𝑎𝑐𝑡. logical system, since it is obtained by summing all the differences
Indeed, the sequence AS𝑥1 , . . . 𝐴𝑆 𝑛𝑥 reflects a strategy of argu- that each factor makes autonomously.
mentation adopted by the parties, as far as this strategy consists
in the introduction of new factors (the removal of factors will be 4 SHIFTING THE BALANCE
considered in Section 3). For instance, that the items were collected The ascription of weights to values and influence on the action’s
in an arrest is an argument for the justification of a permission to impact on values are the key aspects of the balancing model V. The
the police to access the content of those items. In its turn arguing set of relevant factors Φ ⊆ 𝐹𝑎𝑐𝑡 and the set of relevant values V ⊆
that an item collected is a mobile phone favours the axiological 𝑉 𝑎𝑙 are the building blocks that determine whether the outcome of
prohibition to access its digital content. an evaluation is either an axiological prohibition or an axiological
In the precedent discussed in example 2.8, the sequence ⟨∅, permission.
{𝑝𝑟𝑜𝑝}, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟 }, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟, 𝑚𝑜𝑏}⟩ would result in AS𝑎𝑐𝑐 = Hence, provided an impact function and a weight assignment,
{𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝∧𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝∧¬𝑎𝑟𝑟 ), adding or excluding factors and adding or excluding values from
𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧𝑎𝑟𝑟 ∧𝑚𝑜𝑏), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧𝑎𝑟𝑟 ∧¬𝑚𝑜𝑏)}. In its turn, the respective sets considered in the evaluation may shift the bal-
the sequence ⟨∅, {𝑚𝑜𝑏}, {𝑚𝑜𝑏, 𝑝𝑟𝑜𝑝}, {𝑚𝑜𝑏, 𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟 }⟩ would re- ance and therefore change the induced value-rule. As mentioned
sult in A𝑆 𝑎𝑐𝑐 = {𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑚𝑜𝑏), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑚𝑜𝑏 ∧ 𝑝𝑟𝑜𝑝), above, such moves may be though as the advancement of reasons or
94
arguments for the moral justification or disapproval of a permission are promoted by the action through intensifying factors. The ex-
or a prohibition to the action at stake. clusion of elements of V may also produce a shift in the previous
balance, changing the normative solution. For instance, a value-
undercutting contraction corresponds to the exclusion of values
4.1 Possible shifts in the value assessment of which are demoted by the action through intensifying factors. In
an action its turn, a value-rebutting contraction corresponds to the exclusion
To represent such reason-giving or argument dynamics we shall of values which are promoted by the action through attenuating
introduce different change functions that either modify the set factors.
of relevant factors or modify the set of values considered in the
assessment. In this paper we are not going to study combinations 4.2 Factor expansion and contraction operators
of changes of values with changes in the set of factors. We propose here eight change operators to capture the shifts of
We shall focus on those modifications that shift the balance balance described above. Four of such operators act upon factors,
between the value-impacts of the action being considered, so that, in and four upon values. In this subsection we consider the operators
virtue of the modification a prohibited action becomes permissible, on factor, two of which enlarge the set of available factors (factor ex-
or a permissible action becomes prohibited: pansion operators) and two restrict such a set (factors contractions
operators).
• Before the change the action is axiologically prohibited since
its impacts on the values it demotes prevail over its impacts Definition 4.1. (factor expansion function) We define an expansion
on the values it promotes: 0 < 𝐵𝑥,Φ > 𝐵𝑥,Φ ); after function 𝑒 : P (𝐹𝑎𝑐𝑡) −→ P (𝐹𝑎𝑐𝑡) − {∅} such that for all Φ ⊂ 𝐹𝑎𝑐𝑡,
𝐷𝑒𝑚 (V) 𝑃𝑟𝑜𝑚 (V) Φ ∩ 𝑒 (Φ) = ∅.
the change the action is axiologically permitted since its
impacts on the values it promotes prevail over its impacts The first two operators add new attenuators the “winning” side
on the values it demotes (0 < 𝐵𝑥,Φ ≥ 𝐵𝑥,Φ ) of the balance between values demoted and promoted by the action.
𝑃𝑟𝑜𝑚 (V) 𝐷𝑒𝑚 (V)
This expansion reduces the extent to which the action demotes or
• Before the change the action is axiologically permitted since
promotes the values at stake, to an extent that is sufficient to invert
its impacts on the values it demotes prevail over its impacts
the original balance.
on the values it promotes (0 < 𝐵𝑥,Φ ≥ 𝐵𝑥,Φ ); after
𝑃𝑟𝑜𝑚 (V) 𝐷𝑒𝑚 (V) This can happen in two cases. In the first case, before the revision,
the change the action is axiologically prohibited since its the action’s impact on the demoted values was greater than its
impacts on the values it demotes prevail over its impacts on impact over the promoted values. The additional factors reduce the
the values it promotes 0 < 𝐵𝑥,Φ
𝐷𝑒𝑚 (V)
> 𝐵𝑥,Φ
). demotion of the demoted values, to such an extent that the action’s
impact on the demoted value becomes smaller than its impact on
For a given evaluation 0 < 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
> 𝐵𝑥,Φ
from which the promoted values. This means that in the context of the extended
we induce 𝑂 𝑣 (𝑥/Φ), there are different ways to shift the balance. set of factors the former prohibited action becomes permissible.
First, it is possible to add new factors to Φ that together are going In the second case, before the revision, the action’s impact on
to attenuate the impact of the action on the demotion of the values the promoted values was greater than its impact over the demoted
considered. If this attenuation effect shifts the balance either by values. The additional factors reduce the promotion of the promoted
making the degree of the demotion negative, or by making it equal values to such an extent that the action’s impact on the promoted
or lower to the degree of the values promoted, we call this move a values becomes smaller than its impact on the demoted values. This
factor-undercutting expansion of the relevant factors. Another move means that in the context of the extended set of factors the former
to the same effect consists in adding a set of factors that together permissible action becomes prohibited.
intensify the promotion of the relevant values. We call this move a Definition 4.2. (factor undercutting expansion function) Let 𝑥 ∈
factor-rebutting expansion. 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. An expansion function 𝑒 is an under-
Besides adding factors, one may also contest that a given factor cutting expansion function, denoted by 𝑢𝑒 (Φ), iff:
considered is present in the context, what could be reduced to the
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0
exclusion of factors from the set of relevant factors Φ. This might 𝐷𝑒𝑚 (V)
produce a shift of balance in two different ways. First it could be an (i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
𝑥,Φ∪𝑒 (Φ)
exclusion of a set of factors that together intensify the demotion (ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) ⩾ 0 or 𝐵𝐷𝑒𝑚 (V) ⩽ 0
of the values by the action at stake. We call this move a factor- • for 𝜎 (𝑥, Φ, V) ⩾ 0
undercutting contraction. On the other hand the same effect could (i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
be obtained if one deletes a set of factors that together attenuates (ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) < 0
the action’s promotion of the relevant values. We call this move a On this basis we define two factor contraction operators. The
factor-rebutting contraction. first
Similar moves may be described that modify the set of val-
ues V which are considered relevant to the evaluation. A value- Definition 4.3. (factor undercutting expansion operator) Let 𝑢𝑒 be
undercutting expansion is an addition of values to V, which are an undercutting expansion operation. Then we define an operator
demoted by the action through attenuating factors, given the set of + on sets of factors such that Φ+ = Φ ∪ 𝑢𝑒 (Φ).
factors Φ and both the impact and the weight function. In its turn, Now we turn to the rebutting expansion, which consists in
a value-rebutting expansion is an addition of values to V, which adding new intensifiers to the “losing” side in the balance between
95
the action’s impacts on promoted vs demoted values, resulting in a Definition 4.10. (factor rebutting contraction operator) Let 𝑟𝑒 be a
shift of such a balance. factor rebutting contraction function. Then we define and operator
⊖ on sets of factors such that Φ ⊖ = Φ − 𝑟𝑐 (Φ).
Definition 4.4. (factor rebutting expansion function) Let 𝑥 ∈ 𝐴𝑐𝑡,
Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. An expansion function 𝑒 is an rebutting Introduce some examples for each contraction operator.
expansion function, in this case denoted by, 𝑟𝑒 (Φ), iff:
4.3 Value expansion and contraction operators
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0 In the previous section we have seen how changes in set of the
𝐷𝑒𝑚 (V)
(i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V) relevant factors may modify the evaluation of an action. This may
(ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) ⩾ 0 happen either when some additional factors are considered relevant
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0 or when some previously considered factors are rejected as being
𝐷𝑒𝑚 (V) irrelevant. Similarly, changes in the set of values, which have been
(i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
𝑥,Φ∪𝑒 (Φ) taken into account, may modify the valuation of an action. This
(ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) < 0 and 𝐵𝐷𝑒𝑚 (V) > 0 may happen when additional values are considered contextually
Definition 4.5. (factor rebutting expansion operator) Let 𝑢𝑒 be an relevant, and thus added into the deliberation, or when some values
undercutting expansion operation. Then we define an operator ⊕ previously assumed to be relevant are discarded in the given context.
on sets of factors such that Φ ⊕ = Φ ∪ 𝑟𝑒 (Φ). Hence, in the following, we define the value change operators
capturing these possible argument moves, which may provoke a
Let us now introduce the two contraction operators. These op- shift in the original balance of the proportional impact of actions
erators shift the balance either by weakening the “winning” side, on values. Let us begin with a value expansion function.
i.e., by deleting intensifying factors for that side (undercutting ex-
Definition 4.11. (value expansion function) We define a value
pansion), or by strengthening the “losing”, side, i.e., by deleting
expansion function 𝑣𝑒 : P (𝑉 𝑎𝑙) −→ P (𝑉 𝑎𝑙) − {∅} such that for
attenuating factors for that side (rebutting expansion).Let us begin
every V ⊆ 𝑉 𝑎𝑙 V ∩ 𝑣𝑒 (V) = ∅.
by introducing the contraction function.
The first operator weakens the “winner” in the balance, which
Definition 4.6. (factor contraction function) We define a factor compares promoted versus demoted values, thus provoking a shift
contraction function 𝑐 : P (𝐹𝑎𝑐𝑡) −→ P (𝐹𝑎𝑐𝑡) − {∅} such that for in the original balance.
every Φ ⊂ 𝐹𝑎𝑐𝑡, 𝑐 (Φ) ⊆ Φ.
Definition 4.12. (value undercutting expansion function) Let 𝑥 ∈
Now we introduce the undercutting contraction operator. 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. An expansion function 𝑣𝑒 is a value
Definition 4.7. (factor undercutting contraction function) Let 𝑥 ∈ undercutting expansion function 𝑣𝑢𝑒 (V) iff:
𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A contraction function 𝑐 is an under- • for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0
cutting contraction function, in this case denoted by 𝑢𝑐 (Φ) iff: (i) for all 𝑣𝑖 ∈ 𝑣𝑒 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
(ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) ⩾ 0 or 𝐵𝑥,Φ 𝐷𝑒𝑚 (V∪𝑣𝑒 (V))
⩽0
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0
𝐷𝑒𝑚 (V) • for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V) 𝐷𝑒𝑚 (V)
𝑥,Φ−𝑐 (Φ) (i) for all 𝑣𝑖 ∈ 𝑣𝑒 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) ⩾ 0 ou 𝐵𝐷𝑒𝑚 (V) ⩽ 0 (ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) < 0
• for 𝜎 (𝑥, Φ, V) ⩾ 0
Definition 4.13. (value undercutting expansion operator) Let 𝑣𝑢𝑒
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
be a value undercutting expansion function. Then we define and
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) < 0
operator ⊳ on sets of values such that V⊳ = V ∪ 𝑣𝑢𝑒 (V).
Definition 4.8. (factor undercutting contraction operator) Let 𝑢𝑐 Now we define the operator which strengthen the “opponent”
be a factor undercutting contraction function. Then we define and in order to shift the balance.
operator − on sets of factors such that Φ− = Φ − 𝑢𝑐 (Φ).
Definition 4.14. (value rebutting expansion function) Let 𝑥 ∈ 𝐴𝑐𝑡,
We turn now to the construction of the rebutting contraction Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A value expansion function 𝑣𝑒 is an rebut-
operator. ting expansion function 𝑣𝑟𝑒 (V) iff:
Definition 4.9. (factor rebutting contraction function) Let 𝑥 ∈ 𝐴𝑐𝑡,
Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A contraction function 𝑐 is a factor rebut- • for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0
ting contraction function, in this case denoted by 𝑟𝑐 (Φ) iff: (i) for all 𝑣𝑖 ∈ 𝑣𝑒 ((V)), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
(ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) ⩾ 0
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0 • for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ𝐷𝑒𝑚 (V)
<0
𝐷𝑒𝑚 (V)
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V) (i) for all 𝑣𝑖 ∈ 𝑣𝑒 ((V)), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) ⩾ 0 (ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V∪𝑣𝑒 (V))
>0
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0
𝐷𝑒𝑚 (V) Definition 4.15. (value rebutting expansion operator) Let 𝑣𝑟𝑒 be a
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V) value rebutting expansion function. Then we define an operator ⊲
𝑥,Φ−𝑐 (Φ)
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) < 0 and 𝐵𝐷𝑒𝑚 (V) > 0 on sets of values such that V⊲ = V ∪ 𝑣𝑟𝑒 (V).
96
Now we are going to introduce the contraction operators that model, a shift in the solution provided by the axiological system
subtract values from the original set of relevant values used in the AS𝑥 in the modified constellation of factors or values considered
balance. The balance shifts either by deleting values influenced by in the evaluation.
intensifying factors in the “winner” side of the balance between If those conditions are present and the functions apply, provided
demoted and promoted values (undercutting contraction) or by that the sequence of factors of the original set of factors is preserved,
strengthening the “opponent”, that is, by deleting values which then a success and an inclusion result may be shown.
are affected by attenuating factors in the side that “lost” the bal-
ance (rebutting contraction). Let us begin by introducing the value Theorem 4.21. Let AS𝑥Φ,V be an axiological system 2 based on
contraction function. model V, 𝑒 an expansion function and 𝑐 a contraction function, then:
i. AS𝑥Φ,V ⊂ AS𝑥Φ+ ,V and, if 𝑒 is an 𝑢𝑒 applicable in V, then
Definition 4.16. (value contraction function) Let V be a set of val- Ó Ó
either AS𝑥Φ+ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ) ∧ 𝑃 𝑣 (𝑥 | Φ+ ) or AS𝑥Φ+ ,V ⊢
ues, then a value contraction function is a function 𝑣𝑐 : P (𝑉 𝑎𝑙) −→ Ó Ó +
P (𝑉 𝑎𝑙) − ∅ such that 𝑣𝑐 (V) ⊆ V. 𝑃 𝑣 (𝑥 | Φ) ∧ 𝑂 𝑣 (𝑥 | Φ )
ii. AS𝑥Φ,V ⊂ AS𝑥Φ⊕ ,V and if 𝑒 is an 𝑟𝑒 applicable in V, then either
First we introduce the value undercutting contraction operator. Ó Ó Ó
AS𝑥Φ+ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ)∧𝑃 (𝑥 | Φ ⊕ ) or AS𝑥Φ⊕ ,V ⊢ 𝑃 𝑣 (𝑥 | Φ)∧
Ó ⊕
Definition 4.17. (value undercutting contraction function) Let 𝑥 ∈ 𝑂 𝑣 (𝑥 | Φ )
𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A value contraction function 𝑣𝑐 is a iii. AS𝑥Φ− ,V ⊂ AS𝑥Φ,V and if 𝑐 is an 𝑢𝑐 applicable in V, then
value undercutting contraction function 𝑣𝑢𝑐 (V) iff: Ó Ó
if AS𝑥Φ,V ⊢ 𝑂 𝑣 (𝑥 | Φ), then AS𝑥Φ− ,V ⊢ 𝑃 𝑣 (𝑥 | Φ− ) or if
Ó Ó
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0 AS𝑥Φ,V ⊢ 𝑃 𝑣 (𝑥 | Φ), then AS𝑥Φ− ,V ⊢ 𝑂 𝑣 (𝑥 | Φ− )
(i) for all 𝑣𝑖 ∈ 𝑣𝑐 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V) iv. AS𝑥Φ⊖ ,V ⊂ AS𝑥Φ,V and if 𝑐 is an 𝑢𝑐 applicable in V, then
Ó Ó
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) ⩾ 0 or 𝐵𝑥,Φ ⩽0 if AS𝑥Φ,V ⊢ 𝑂 𝑣 (𝑥 | Φ), then AS𝑥Φ⊖ ,V ⊢ 𝑃 𝑣 (𝑥 | Φ ⊖ ) or if
𝐷𝑒𝑚 (V−𝑣𝑐 (V))
Ó Ó
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0 AS𝑥Φ,V ⊢ 𝑃 𝑣 (𝑥 | Φ), then AS𝑥Φ⊖ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ ⊖ )
𝐷𝑒𝑚 (V)
(i) for all 𝑣𝑖 ∈ 𝑣𝑐 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) < 0 Proof. Straightforward from theorem 3.3. □
Definition 4.18. (value undercutting contraction operator) Let 𝑣𝑢𝑐 Changes in the set of values may determine changes in the rules
be a value undercutting contraction function. Then we define an of the axiological system. For instance, in the precedent discussed at
operator ÷ on sets of factors such that V÷ = V − 𝑣𝑢𝑐 (V). example 2.8, the sequence ⟨∅, {𝑝𝑟𝑜𝑝}, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟 }, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟, 𝑚𝑜𝑏}⟩
would deliver the following axiological system AS𝑎𝑐𝑐 :
Finally, we turn to the construction of the value rebutting con-
traction operator. {𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧
¬𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ ¬𝑚𝑜𝑏)}
Definition 4.19. (value rebutting contraction function) Let 𝑥 ∈ 𝐴𝑐𝑡,
Suppose one argues that the context of an arrest also strongly
Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A contraction function 𝑐 is a value rebutting
impacts the promotion of national security 𝑆𝑒𝑐 with a proportional
contraction function 𝑣𝑟𝑐 (Φ) iff:
impact 𝐵𝑎𝑐𝑐
{𝑝𝑟𝑜𝑝,𝑎𝑟𝑟,𝑚𝑜𝑏 },{𝑆𝑒𝑐 }
= .6, then we would have the follow-
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0 ing axiological system AS𝑎𝑐𝑐 :
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
{𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) ⩾ 0
¬𝑎𝑟𝑟 ), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏), 𝑂 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ ¬𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏)}
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0
𝐷𝑒𝑚 (V)
Hence inclusion simpliciter does not hold. But considering that
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
the modification is going to occur in a particular step of the evalua-
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) < 0
tion, we have a qualified form of inclusion where 𝐴𝑆𝑛−1,V
𝑥 𝑥
⊂ AS𝑛,V ∗,
Definition 4.20. (factor rebutting contraction operator) Let 𝑟𝑒 be a for some 𝑛 in the sequence of the subsets of factors considered, and
factor rebutting contraction function. Then we define an operator where ∗ may be any value-change operator here defined. For con-
⊗ on sets of factors such that V ⊗ = V − 𝑣𝑟𝑐 (V). venience, we have denoted AS𝑛𝑥 as AS𝑛,V 𝑥 in order to express the
Remark. The functions of undercutting expansion, rebutting exchange of the set of relevant values.
pansion, undercutting contraction, rebutting contraction, value- Theorem 4.22. Let AS𝑥Φ,V be an axiological system 3 based on
undercutting expansion, value-rebutting expansion, value-undercut- model V, 𝑣𝑒 a value expansion function and 𝑣𝑐 a value contraction
ting contraction, value-rebutting contraction defined above may function, then:
not exist depending on the model V assumed. If the conditions to
i. if 𝑣𝑒 is an 𝑣𝑢𝑒 or a 𝑣𝑟𝑒 function applicable in V, then if
apply these functions hold in the assumed model V we say that Ó Ó
AS𝑥Φ+ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ) then AS𝑥Φ+ ,V∗ ⊢ 𝑃 𝑣 (𝑥 | Φ) and if
the function is applicable in the model V. Ó Ó
AS𝑥Φ+ ,V ⊢ 𝑃 (𝑣 𝑥 | Φ), then AS𝑥Φ+ ,V∗ ⊢ 𝑂 𝑣 (𝑥 | Φ), for ∗ ∈
4.4 Change operations on axiological systems {⊳, ⊲}
The change operators we have developed so far modify the set of 2 We are including V in the notation of AS𝑥 for convenience. Also to avoid clutter
relevant factors or the set of relevant values, provoking, if the con- Φ Ó
with superscripts in the notation, we are going to substitute Φ for Φ∧ .
ditions for application of each function are present in a particular 3 We are including V in the notation of AS𝑥 for convenience.
Φ
97
ii. if 𝑣𝑒 is an 𝑣𝑢𝑐 or a 𝑣𝑟𝑐 applicable in V, then if AS𝑥Φ+ ,V ⊢ the model could be extended in order to compare different and in-
Ó Ó
𝑂 𝑣 (𝑥 | Φ) then AS𝑥Φ+ ,V∗ ⊢ 𝑃 𝑣 (𝑥 | Φ) and if AS𝑥Φ+ ,V ⊢ dependent actions, or evaluate if positively enacted rules regarding
Ó Ó different and independent actions are justified.
𝑃 𝑣 (𝑥 | Φ), then AS𝑥Φ+ ,V∗ ⊢ 𝑂 𝑣 (𝑥 | Φ), for ∗ ∈ {÷, ⊗}
Proof. Straighforward from theorem 3.3. □ ACKNOWLEDGMENTS
Juliano Maranhão acknowledges the support by the Fundação de
5 FINAL REMARKS Apoio à Pesquisa do Estado de São Paulo (FAPESP 2019/07665-4)
The additive model we propose makes strong assumptions about and the IBM Corporation to the Center for Artificial Intelligence
the behaviour of factors as reasons for the evaluation of the action (C4AI/USP. Giovanni Sartor has been supported by the H2020 Euro-
at stake. It assumes an atomist conception to the effect that a factor pean Research Council (ERC) Project “CompuLaw” (G.A. 833647).
that is a reason for the action in one case always remains a reason
when new factors (reasons) are considered. Also, a factor always REFERENCES
keeps the same polarity and its contribution in terms of impact [1] Carlos E Alchourrón, Paul Gärdenfors, and David Makinson. 1985. On the Logic
in each value considered. That is, not only the action is invariant of Theory Change: Partial Meet Contraction and Revision Functions. Journal of
Symbolic Logic 50 (1985), 510–530.
with respect to each value 𝑉𝑖 in the assessment, but also the factors [2] R. Alexy. 2002. A Theory of Constitutional Rights. Oxford University Press.
retain the same contribution with the same polarity and intensity [3] Robert Alexy. 2003. The Argument from Injustice. {A} Reply to legal positivism.
(it intensifies or attenuates in a given degree the action’s promotion Oxford University Press.
[4] Robert Alexy. 2010. The Dual Nature of Law. Ratio Juris 23 (2010), 167–182.
or demotion of the value 𝑉𝑖 ) in the assessment of new evaluations [5] Aharon Barak. 2005. Purposive Interpretation in Law. Princeton University Press.
considering different constellations of factors. So, the model as- [6] T. J. M. Bench-Capon and G. Sartor. 2003. A Model of Legal Reasoning with Cases
Incorporating Theories and Values. Artificial Intelligence 150 (2003), 97–142.
sumes a strong atomist conception in the assessment of the moral [7] Christoph Benzmüller, David Fuenmayor, and Bertram Lomfeld. 2021. Encod-
evaluation of actions, and consequently, the moral evaluation of ing Legal Balancing: Automating an Abstract Ethico-Legal Value Ontology in
positively enacted rules. Preference Logic. arXiv:2006.12789 [cs.AI]
[8] D. H. Berman and C. D. Hafner. 1993. Representing Teleological Structure in Case-
Different conceptions are possible, relaxing in different degrees based Reasoning: The Missing Link. In Proceedings of the Fourth International
the atomist assumptions. For instance, a multiplicative model may Conference on Artificial Intelligence and Law (ICAIL). ACM, 50–9.
make the contribution of a factor void if a new factor with propor- [9] G. Boella, G. Governatori, A. Rotolo, and L. van der Torre. 2010. Lex minus dixit
quam voluit, lex magis dixit quam voluit: A formal study on legal compliance
tional impact 0 is considered ([14]). One may also assume a model and interpretation. In AICOL-I/IVR-XXIV’09 Proceedings of the 2009 international
where the intensity of the contributions of factors may vary or conference on AI approaches to the complexity of legal systems: complex systems,
the semantic web, ontologies, argumentation, and dialogue. Springer, 162–183.
invert polarity, changing from an intensifier to an attenuator of the [10] G. Boella, L. van der Torre, and G. Pigozzi. 2016. AGM Contraction and Revision
impact of the action on a value, or vice-versa. Or even the polarity of Rules. J Log Lang Inf 25 (2016), 273–297.
of the impact of an action may change, for instance an action that, [11] J. Dancy. 2004. Ethics Without Principles. Oxford University Press.
[12] M. Grabmair. 2017. Predicting Trade Secret Case Outcomes using Argument
given a constellation of factors, promotes a value, may demote the Schemes and Learned Quantitative Value Effect Tradeoffs. In Proceedings of
same value given a new constellation of factors. ICAIL-2017. ACM, 89–98.
Such steps would move the model towards an holist conception [13] Herbert L. A. Hart. 1994. The Concept of Law (2nd ed.). Oxford University Press.
[14] Shelly Kagan. 1988. The additive Falacy. Ethics 99 (1988), 5–31.
where a feature which is a reason for or against the action in one [15] J. Maranhão. 2001. Refinement: a tool to deal with inconsistencies. In Proceedings
case, may not be a reason or be an opposite reason in another of the Eight International Conference on AI and Law. ICAIL-2001. ACM Press,
52–59.
case, and where each contribution for the evaluation is variant and [16] J. Maranhão. 2017. A logical Architecture for dynamic legal interpretation. In
contextual. Holist conceptions are usually, although not necessar- Proceedings of the Eight International Conference on AI and Law ICAIL ’17,. ACM
ily, connected to moral particularism, according to which moral Press, 129–38.
[17] JSA Maranhão and G Sartor. [n.d.]. Interpretive Normative Systems. In 15th
evaluations of actions do not depend on the subsumption of moral International Conference on Deontic Logic and Normative Systems- DEON 2020-
principles, as opposed to moral universalism, according to which 2021, F Liu, A Marra, P Portner, and F Van de Putte (Eds.). College Publications.
moral judgment is intrinsically connected to the instantiation of [18] Juliano Maranhão and Giovanni Sartor. 2019. Value assessment and revision in
legal interpretation. In Proceedings of the 17th International Conference on Artificial
moral principles ([11]). Intelligence and Law, ICAIL 2019. Association for Computing Machinery, Inc, New
We shall explore such variations in future developments of the York, New York, USA, 219–223. https://doi.org/10.1145/3322640.3326709
[19] H. Prakken, A. Wyner, T.R Bench-Capon, and K. Atkinson. 2015. A formalisation
model. of argumentation schemes for legal case-based reasoning in ASPIC+. Journal of
In this paper we have limited ourselves to use the model to Logic and Computation 25 (2015), 1141–1166.
induce value-rules and build axiological systems, which may be [20] G. Radbruch. 2006. Statutory Lawlessness and Supra-Statutory Law. Oxford
Journal of Legal Studies 6 (2006), 1–11. (1st ed. 1946.).
assumed as premises in any inference system of dyadic deontic [21] G. Sartor. 2013. The Logic of Proportionality: Reasoning with Non-Numerical
logic, satisfying some conditions. Hence another relevant path to Magnitudes. German Law Journal (2013), 1419–57.
explore is to embed the model into an inferential system [7]. We see [22] G. Sartor. 2018. Consistency in balancing: from value assessments to factor-
based rules. In Proportionality in Law: An Analytical Perspective, D. Duarte and
two interesting alternatives. One of them is to follow the statutory S. Sampaio (Eds.). Springer, 121–36.
interpretation line of research in AI & Law and embed the model [23] Douglas Walton, Giovanni Sartor, and Fabrizio Macagno. 2018. Statutory Inter-
pretation as Argumentation. In Handbook of Legal Reasoning and Argumentation,
in a deontic logic with revision operators applied on the logical G. Bongiovanni, G. Postema, A. Rotolo, G. Sartor, C. Valentini, and D. Walton
consequences of normative systems in the AGM-style ([1]). Steps in (Eds.). Springer, 519–60.
this direction have been made by [18]. The other is to explore the
resemblance of the revision operators and the axiological system
with an argumentation structure and embed the model or explore
its connections with a logic of defeasible argumentation. Finally,
98
Case-level Prediction of Motion Outcomes in Civil Litigation
Devin J. McConnell James Zhu
devin.mcconnell@uconn.edu james.zhu@uconn.edu
Department of Computer Science Department of Computer Science
University of Connecticut University of Connecticut
Storrs, Connecticut, USA Storrs, Connecticut, USA
Sachin Pandya Derek Aguiar

sachin.pandya@uconn.edu derek.aguiar@uconn.edu
School of Law Department of Computer Science
University of Connecticut University of Connecticut
Hartford, Connecticut, USA Storrs, Connecticut, USA
Lawyers regularly predict court outcomes to make strategic deci- Devin J. McConnell, James Zhu, Sachin Pandya, and Derek Aguiar. 2021.
Case-level Prediction of Motion Outcomes in Civil Litigation. In Eighteenth
sions, including when, if at all, to sue or settle, what to argue, and
how to reduce their clients’ liability risk. Yet, lawyer predictions 21–25, 2021, Sao Paulo, Brazil. ACM, New York, NY, USA, 16 pages. https:
tend to be poorly calibrated and biased, which exacerbate unjustifi- //doi.org/10.1145/3462757.3466101
able disparities in civil case outcomes. Current machine learning
(ML) approaches for predicting court outcomes are typically con-
1 INTRODUCTION
strained to final dispositions or are based on features unavailable in
real-time during litigation, like judicial opinions. Here, we present In the late nineteenth century, American jurist Oliver Wendell
the first ML-based methods to support lawyer and client decision Holmes declared that law is “nothing more pretentious” than pre-
making in real-time for motion filings in civil proceedings. Using dictions of what courts in fact do [38]. Lawyers regularly predict
the State of Connecticut Judicial Branch administrative data and what courts will do and, on that basis, make strategic decisions,
court case documents, we trained six classifiers to predict motion including what to argue, when, if at all, to sue or prosecute, set-
to strike outcomes in tort and vehicular cases between July 1, 2004 tle cases, and how to reduce their clients’ liability risk. Indeed,
and February 18, 2019. Integrating dense word embeddings from because most civil proceedings settle before any trial occurs, litiga-
complaint documents, which contain information specific to the tion outcomes (civil settlements) vary on how attorneys estimate
claims alleged, with the Judicial Branch data improved classifica- what would happen if the case went to trial and how a trial judge
tion accuracy across all models. Subsequent models defined using a or jury would rule. Yet, despite a literature in litigation decision
novel attorney case-entropy feature, dense word embeddings using analysis [13], lawyers’ actual predictions tend to be overconfident,
corpus specific TF-IDF weightings, and algorithmic classification not well calibrated [34, 39], and surprisingly vulnerable to bias
rules yielded the best predictor, Adaboost, with a classification associated with advocating for a particular view [55].
accuracy of 64.4%. An analysis of feature importance weights con- While lawyers have always made strategic decisions based on
firmed the usefulness of incorporating attorney case-entropy and what they believed about outcomes in similar cases, the collection
natural language features from complaint documents. Since all fea- of large and diverse data from civil proceedings facilitates a new
tures used in model training are available during litigation, these computational view of legal decision making. This computational
methods will help lawyers make better predictions than they oth- perspective of court outcome prediction has been bolstered by
erwise could given disparities in lawyer and client resources. All the rise of machine learning (ML), which is a branch of artificial
ML models, training code, and evaluation scripts are available at intelligence (AI) focused on the development of computational
https://github.com/aguiarlab/motionpredict. systems that learn from historical data to make predictions or infer
patterns in big, noisy, or high-dimensional data sets [28, 74].
CCS CONCEPTS Combining ML with collections of dockets, judicial opinions,
and court documents presents two significant opportunities. First,
• Applied computing → Law; • Computing methodologies → ML techniques can increase how accurately lawyers can predict
Machine learning approaches; Natural language processing. legal outcomes to help support real-time decision making. Compu-
tational modeling in law has relied primarily on the text of formal
Permission to make digital or hard copies of part or all of this work for personal or judicial opinions, with or without annotation from domain experts,
classroom use is granted without fee provided that copies are not made or distributed to classify the outcome reached or the type of legal reasoning used
on the first page. Copyrights for third-party components of this work must be honored. to reach that outcome [5, 35, 54, 62]. In some scenarios, ML methods
For all other uses, contact the owner/author(s). can predict judicial decisions with an accuracy that exceeds legal
ICAIL’21, June 21–25, 2021, São Paulo, Brazil scholars [41, 65]. However, court opinions are not available during
ACM ISBN 978-1-4503-8526-8/21/06. litigation, and thus this retrospective process stands in contrast to
https://doi.org/10.1145/3462757.3466101 predicting judge or jury decision-making itself [25, 57].
99
ICAIL’21, June 21–25, 2021, São Paulo, Brazil McConnell et al.
Second, since court outcomes vary with attorney quality and The most prevalent features used by ML methods to model court
client resources, predictions using ML may reduce the litigation decisions are extracted from legal documents. Legal documents, typ-
disadvantages faced by the poor, racial minorities, and other vulner- ically court records, judicial opinions, and legislation, are difficult
able groups [20]. These problems are exacerbated by overwhelming to model due to their high dimensionality. For instance, modelling
workloads faced by civil courts and lawyers. Increased workloads documents as bags-of-words, which treats a document as an un-
contribute to workplace strain, which has detrimental effects on ordered multiset of words, is commonly assumed in applications
the ability to function effectively [40] and, for lawyers, a decreased like text classification [43] and topic modelling [9]. With this simpli-
perception of ability to uphold the law [12]. However, to ensure fying assumption, the dimension of a document is proportional to
widespread acceptability and trustworthiness of algorithmic deci- the vocabulary size, |𝑉 |, which is prohibitively large for many ML
sions [60], models must be accurate and explainable to all parties. methods. When the ordering of words in a document is considered,
Methods for characterizing judicial decisions in previous work the dimensionality grows exponentially. Therefore, methods typi-
have focused on court opinions, ignoring the many important for- cally seek lower dimensional representations of legal documents
mal procedures that lead to a final judgment. In this work, we that preserve relevant structure of the underlying text.
present the first client and lawyer support methods that predict Early legal document representations focused on summary sta-
court outcomes at the level of individual motions. Motions are for- tistics, like word length [8] or other metadata including document
mal requests to judges for an official ruling on a contested issue. complexity [22], publication date, and amendment counts [47].
They can be submitted before, during, or after the trial and can Knowledge representations of the dyadic citation relationships
have a significant impact on the final disposition of the lawsuit. For between documents are typically modelled as citation networks,
example, motions to strike petition for the removal of all or a subset where vertices correspond to legal documents and a directed edge
of the opposing party’s pleading. While many court documents exists from document 𝐴 to document 𝐵 if 𝐴 cites 𝐵 [27]. Important
are filed over the course of legal proceedings that are relevant to features can be extracted from the connectivity structure of legal
motion outcomes, we focus on complaint documents because (a) citation networks, e.g., directed paths can be interpreted as chains
they contain the facts alleged and legal claims asserted and (b) are of legal precedent or network centrality and in-degree can indicate
available to all parties when a lawsuit begins and can therefore be case importance.
used to support decision making. More recently, representations from natural language processing
We present a general overview of computational prediction of (NLP) have been used to compute richer representations of legal
litigation outcomes and our contributions in Section 2. Section documents. Term frequency–inverse document frequency (TD-IDF)
3 provides details of our methods, where we describe the court is a statistic computed for a word 𝓌 and document 𝒹:
administrative data and legal documents as well as our approach to
|D|
feature engineering and predictive modelling. We present results 𝑓𝑇 (𝓌,𝒹) = 𝒻𝒹 (𝓌) · log Í
on approximately 15 years of Connecticut civil case data in Section 𝒹 ∈D 1 (𝓌 ∈ 𝒹)
4, followed by a discussion and conclusions in Sections 5 and 6. where 𝒹 ∈ D is a document in corpus D, 𝒻𝑑 (𝑤) is the frequency
of word 𝓌 in document 𝒹, and 1 (𝓌 ∈ 𝒹) is an indicator function
that is 1 if 𝓌 appears in 𝒹 and 0 otherwise. TD-IDF is often used
as a corpus-specific importance weighting for words [80].
2 BACKGROUND State-of-the-art language embedding models have seen recent
Predicting legal outcomes has traditionally been the purview, not success by providing lower 𝑑 dimensional embeddings of words
only of practicing lawyers, but also researchers of judicial behavior and documents, where 𝑑 ≈ 102 << |𝑉 |. These architectures are pre-
in law, political science, and recently, computer science [37]. Early trained on large general corpora and then either applied directly
efforts to computationally model legal decision making focused on to legal documents or fine-tuned using transfer learning to legal
representations of rules obtained from case law and legislation [68, specific applications. The law2vec model is a neural embedding
32, 64]. When legal decisions can be modelled as a deterministic architecture based on word2vec [56] that was pre-trained on a large
process, rule-based AI has achieved considerable success [16, 23]. legal corpus consisting of mostly legislative documents [14]. Taking
More recently, ML has made a large impact on the research and advantage of recent developments in NLP, some researchers have
practice of law in general, and in predictive litigation analysis used transfer learning techniques from pre-trained transformer
specifically [2, 3, 28, 63]. models [21] to classify U.S. Fourth Amendment cases [35].
Several efforts are focused on extracting standardized data sets to Machine learning methods for predicting court outcomes have,
support ML in law. The Supreme Court Database contains over two thus far, been mostly trained using judicial opinions. One study
hundred years of U.S. Supreme Court cases each containing hun- developed a random forest classifier to predict over 240,000 justice
dreds of variables [70]. The CASELAW4 data set contains 350,000 votes and about 28,000 case outcomes for the U.S. Supreme Court
common law judicial decisions extracted from US State appellate from 1816 through 2015 [41]. The method predicted court decisions
courts [62]. The University of Oxford is constructing a database with 70.2% accuracy and justice votes with 71.9% accuracy. By
of 100,000 US court case decisions with features that include the comparison, legal experts at best accurately predicted about 66%
facts of the case, judgements, location, timing, and judicial opin- of the outcomes in sixty-eight cases argued in the U.S. Supreme
ions [24]. These and other similar works [77] provide benchmarks Court’s 2002 Term [65]. It is a common practice to use a court’s
that will accelerate the use of ML in litigation in a similar manner past decisions to predict its future decisions, as was done with data
as CIFAR [44] and MNIST [49] for image classification. from the European Court of Human Rights [54]. French Supreme
100
Case-level Prediction of Motion Outcomes in Civil Litigation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Court Admin Hyperparameter ML Model Training

Feature Engineering
Database Optimization and Evaluation
SQL Parameter 1 AdaBoost Decision

Minimal Feature Set Trees
Subset Feature Set Gradient Random
Boosting Forests
Parameter 2
SVM XGBoost
word2vec
OCR
TF-IDF
rule-based algorithm Grid Search
Complaint
Documents
Figure 1: Motion prediction pipeline overview. Two sets of features were computed from the State of Connecticut Judicial
Branch court administrative database (minimal and subset, Table 1) and combined with natural language features extracted
from complaint documents using word2vec, TF-IDF, and a rule-based algorithm. We optimized the hyperparameters of six ML
models using grid search. Here, grid search is described in two dimensions where the circles denote parameter configurations
and the curves on each axis denote the marginal classification accuracy. In this example, classification accuracy has higher
variability across values of parameter 1, which primarily determines the choice of the best parameter setting (red point). A
toy decision tree on the attorney specialization and major code features is shown under our six ML models.
Court decisions have been modelled from historical rulings data In this work, we consider 𝑘 = 2 where 1 and 2 correspond to
using a linear support vector machine (SVM) classifier, assuming a motion denied and granted respectively. Let the observed data be
bag-of-words representation for the rulings documents [73]. (𝑥𝑥 1, . . . ,𝑥𝑥 𝑖 , . . . ,𝑥𝑥 𝑛 ) = 𝑋 ∈ R𝑛×𝑝 , an 𝑛 × 𝑝 matrix of 𝑛 civil court
Importantly, these methods require data from decisions and opin- cases each containing 𝑝 covariates. Note that 𝑋 can, in general,
ions, which distinguish them from other uses of ML to support contain real, nominal, ordinal, or integer valued variables. Given a
real-time litigation support in court cases [25, 57]. Other tools, like training set (𝑋 𝑋 ,𝑌 𝑌 ), the goal is to build a model that predicts class
MyOpenCourt, provide an AI platform directly to clients for an- labels 𝑌¯ 𝑡𝑒𝑠𝑡 from held-out test data 𝑋 𝑡𝑒𝑠𝑡 to maximize classifica-
swering legal questions [45]. While these tools do provide real-time tion accuracy 𝑇|𝑌𝑌𝑃 +𝑇 𝑁| where TP and TN are true positives and true
𝑡𝑒𝑠𝑡
support for decision making, the focus is on data mining and learn- negatives respectively.
ing legal recommendations to support self-represented litigants,
not predictive analytics. 3.1 Connecticut Civil Court Data
The values for 𝑋 and 𝑌 were collected from the State of Connecticut
2.1 Our Contributions Judicial Branch, which provides access to materials such as public
Existing approaches to predicting court case outcomes focus on records and court case documents, as well as researcher access to
final judgements of a trial and rely on retrospective court data from their civil court administrative data [71]. The court administrative
appellate or national court decisions, and thus are not amenable data is populated by courthouse staff and stored in a centralized
for informing motion-level decision making [4, 41, 54, 73]. In this relational database, which was rebuilt locally in mySQL. Court case
work, we contribute to the field of legal analytics in several ways: documents are scanned and made available through the Judicial
(1) we define new lower dimensional features to support predic- Branch Law Library API.
tive modelling at the case-level; We focus on predicting the outcomes for motions to strike. A
(2) we develop and benchmark the first computational pipeline motion to strike has significant influence on case outcomes and
to assist lawyer and client decision making through the pre- are therefore an important factor in legal decision making. In Con-
diction of motion outcomes in district court data (Figure 1); necticut, a motion to strike is a written petition typically from a
(3) we analyze the predictability of motions to strike using both defendant to a judge to remove part or all of a plaintiff’s complaint
court administrative data and natural language features ex- allegations based on legal insufficiency. While we restrict our at-
tracted from complaint documents; tention to civil cases filed in Connecticut, the methods apply more
(4) we provide this pipeline (code, trained models, evaluation generally to other states as long as minimal docket information and
scripts) freely available and open-source. complaint documents are available.
3 METHODS 3.2 Feature Engineering: Court Administrative

Consider 𝑛 motions filed in civil court whose outcomes are rep- Data
resented as a collection of random variables (𝑦1, . . . ,𝑦𝑖 , . . . ,𝑦𝑛 )𝑇 = Most features in the court administrative data were either not
𝑌 ∈ {1, . . . ,𝑘 }𝑛 where 𝑘 is the number of distinct judicial rulings. relevant to motion outcomes, had high missingness rates, or were
101
categorical variables with a number of levels proportional to the Branch website using custom crawling scripts [71]. If a PDF con-
size of the data. Therefore, we developed custom SQL scripts to tained text, we used pdftotext (version 0.26.5) to convert the PDF to
extract 3 informative features based on domain expertise and low a text file (see Supplemental Methods for additional details). If a PDF
missing data rates (< 0.6% missingness). In total, we considered contained an image, we first converted the PDF to a TIFF file using
four court administrative features: juris number, major code, case ImageMagick (version 6.9.10-68 Q16) [72]. Then, we converted the
location, and attorney specialization (Table 1). The juris number is a TIFF to text using tesseract (version 4.1.1-rc2-20-g01fb) [69]. The
unique identifier for the attorney or firm representing the defendant. tesseract optical character recognition (OCR) engine is based on
The major code represents the case type encoded as a Bernoulli LSTM neural networks and maintained by Google. It has shown to
variable for tort or vehicular cases. The case location encodes a 15 have high accuracy on machine-written characters and black and
dimensional categorical variable denoting the Connecticut superior white images, both of which categorize complaint documents [75].
court location for the case.
3.3.1 Rule-based features for complaint documents. We also con-
The attorney specialization is derived based on the entropy of
sider algorithmically generated natural language features based
the case type (i.e., major case code) distribution for each attorney.
on a sequential covering rule generating algorithm [1]. A rule 𝑅
Formally, let the number of different major case codes (e.g. tort
maps a condition (antecedent) to a class (consequent). Here, an-
or vehicular) associated with an attorney be 𝑚 and the counts of
tecedents are a conjunction of Boolean conditions indicating the
cases litigated by an attorney in each major case code be 𝑤 =
presence of a word in a complaint document and the consequent is
(𝑤 1, . . . ,𝑤𝑚 ). Then, we model the case counts 𝑤 for attorney 𝑎 as
motion granted or denied. For example, the rule (𝑐𝑎𝑟 ∈ 𝐷) AND
a multinomial distribution with a Dirichlet prior,
(𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡 ∈ 𝐷) AND (𝑛𝑒𝑔𝑙𝑖𝑔𝑒𝑛𝑐𝑒 ∈ 𝐷) ⇒ 1 would map a motion
𝑤𝑎 ∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝜃 𝑎 ) to strike associated with complaint 𝐷 to denied if the words “car”,
𝜃 𝑎 ∼𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡 (𝛼 1, . . . ,𝛼𝑚 ) “accident”, and “negligence” are contained within 𝐷.
The sequential covering algorithm proceeds by learning a rule
In this work, we set (𝛼 1, . . . ,𝛼𝑚 ) = 1 . After observing 𝑤𝑎 , the
that maximizes some function. Then, the rule is added to a list and
posterior is a Dirichlet-multinomial distribution
all of the complaint documents covered by this rule are removed
𝑃 (𝜃 𝑎 |𝑤𝑎 ) = 𝐷𝑖𝑟 (𝛼 1 + 𝑤𝑎1, . . . ,𝛼𝑚 + 𝑤𝑎𝑚 ) from the data. This process is repeated until all documents are
We compute the specialization for attorney 𝑎 and major case code removed or sufficient coverage of the data is reached. In this work,
𝑗 as the entropy of the posterior expectation: we learned the number of rules by cross-validation on the training
! set. We consider two functions (or criteria) to optimize: a simple
Õ𝑚
𝑤𝑎 𝑗 + 𝛼 𝑗 𝑤𝑎 𝑗 + 𝛼 𝑗
𝐻 𝐸 [𝜃 𝑎 𝑗 |𝑤𝑎 ] = − Í𝑚 log Í𝑚 criteria and the First Order Inductive Learner (FOIL) criteria.
(𝑤 + 𝛼𝑘 )
𝑗=1 𝑘=1 𝑎𝑘
(𝑤 + 𝛼𝑘 )
𝑘=1 𝑎𝑘 The function with which the simple sequential covering algo-
rithm optimizes is
where 𝐻 is Shannon’s entropy whose probability vector is the rel-
𝑛+ + 1
ative frequency of the 𝑗 𝑡ℎ major case code smoothed by taking (1)
the expectation of a Dirichlet-multinomial distribution with prior 𝑛∗ + 𝑘
𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡 (𝛼 1, . . . ,𝛼𝑚 ); this smoothing reduces the variance of en- where 𝑛 + is the frequency of a word that appears in the granted
tropy calculations for attorneys with small caseloads. Note, that (GR) documents, 𝑛 ∗ is the total number of occurrences of the word
this generates 𝑚 − 1 features. across all documents, and k is the number of classes (here, 𝑘 = 2).
We consider two feature sets derived from these four court ad- In other words, the simple algorithm greedily adds a term to the
ministrative features, minimal and subset (see Table 1). antecedent that increases the rule’s accuracy the most.
A separate function is the FOIL criteria, which is optimized by
Feature Description S the RIPPER algorithm [17, 18]. The FOIL criteria is less greedy than
Juris Number Defendant attorney or firm identifier Equation 1, attempting to balance information gain with document
coverage
Attorney Case type specialization smoothed by a
Specialization Dirichlet-multinomial prior 𝑛+ 𝑛+
𝑛 +2 log2 + 2 − − log2 + 1 −
Major Code Classifies the case type (e.g. tort) ✓ 𝑛2 + 𝑛2 𝑛1 + 𝑛1
Case Location Superior court location for the case ✓ where
Table 1: Features of the CT Civil Case Administrative Data- • 𝑛 +1 (𝑛 −
1 ) is the number of complaint documents associated
base. The column identified as S denotes whether or not the with a motion to strike that were granted (denied) that the
feature was only included in the subset feature set (and not rule covers;
in the minimal feature set). • 𝑛 +2 (𝑛 −
2 ) is the number of complaint documents associated
with a motion to strike that are changed to positive (negative)
with the addition of a prospective word to the antecedent.
3.3.2 Word embeddings for complaint documents. We considered
3.3 Feature Engineering: Complaint three architectures to construct complaint document features from
Documents neural embeddings: word2vec [56], doc2vec [48], and law2vec [14].
We downloaded 7904 complaint documents associated with each The word2vec model maximizes log 𝑃 (𝑤𝑂 |𝑤 𝐼 ), or the log probabil-
case in our data as PDF files from the State of Connecticut Judicial ity of a given word 𝑤𝑂 given an input word 𝑤 𝐼 . The doc2vec model
102
is similar but instead of conditioning on an input word 𝑤 𝐼 , it condi- Adaboost, optimizing an approximate negative gradient of the bi-
tions on a vector representing the document. We also consider the nomial deviance loss function [31]. XGBoost, or Extreme Gradient
law2vec model which was trained on 123,066 documents, including Boosting, is a highly efficient implementation of gradient boosting
53,000 UK legislative documents, 62,000 European legislative doc- trees with an adjusted loss function to control the complexity of
ument, and thousands of other English legislative, U.S. code, and the decision trees [7]
opinion documents. We computed 300 dimensional embeddings Õ
𝑛 Õ
𝑀
for each complaint document using doc2vec and word2vec models 𝐿𝑥𝑔𝑏 = 𝐿 (𝑦𝑖 , 𝐹 (𝑥𝑖 )) + Ω(ℎ𝑚 )
that were pre-trained on AP News, Google News, and Wiki arti- 𝑖=1 𝑚=1
cles. The law2vec repository provides pretrained models for 100 1
and 200 dimensional vectors, thus we selected 200 dimensional Ω(ℎ𝑚 ) = 𝛾𝑇𝑚 + 𝜆||𝑠 || 2
2
vectors. To compute a document representation using word2vec or where 𝐿 (𝑦𝑖 , 𝐹 (𝑥𝑖 )) is a loss function computed from observed mo-
law2vec, we computed an average word2vec vector weighted by
tion outcome 𝑦𝑖 and computed outcome 𝐹 (𝑥𝑖 ), ℎ𝑚 is the 𝑚𝑡ℎ weak
term frequency–inverse document frequency (TF-IDF) [53].
learning model with 𝑇𝑚 leaves and 𝑠 leaf output scores.
We also produce models trained on a combination of word embed-
ding and rule classifiers. A simple word2vec or simple law2vec TF- 3.4.3 Support Vector Machines. Support vector machines classify
IDF model computes an average word2vec vector for each document data by solving a convex optimization problem yielding a hyper-
that is weighted by TF-IDF and including only those words identi- plane that optimally separates two classes [10, 19]. Optimality is
fied by the simple rule-based classifier. Likewise, a FOIL word2vec measured with respect to the size of the margin between granted
or FOIL law2vec TF-IDF model computes an average word2vec vec- and denied motions. They can efficiently compute non-linear deci-
tor for each document that is weighted by TF-IDF and including sion boundaries with kernel functions, representing similarities in
only those words identified by the FOIL rule-based classifier. an inner product space.
3.4 Modelling Motion Outcomes 4 RESULTS

We considered six classification models to predict motion to strike The court administrative data from the State of Connecticut Judicial
outcomes: adaBoost [29], decision trees [66], gradient boosting [31], Branch contained 916,805 observations from 184,125 unique cases.
random forests [11], SVM [10, 19], and XGBoost [15]. These meth- First, we extracted 7904 motions to strike filed by a defendant
ods were selected based on their relatively high explainability [81], against a plaintiff between July 1, 2004 to February 18, 2019. In
applicability in small-data settings, and ability to model mixed data the results that follow, we focus on predicting the outcome of a
types (e.g. real and categorical-valued variables). motion to strike in civil tort and vehicular cases based on court
administrative data and complaint documents.
3.4.1 Tree Classifiers. Tree classifiers aim to construct a hierarchy Classifier performance can greatly depend on model selection [26].
of decision rules that recursively split the data until a leaf is reached For example, when using default parameters, the performance of
that denotes the inferred class label. Formally, the hierarchy is rep- XGBoost is significantly worse than models with optimized hyper-
resented as a tree where internal nodes denote a bifurcation of parameters [7]. For model selection, we performed a grid search for
a subset of samples based on maximum information gain splits each method, with 10 fold cross validation on 70% of the data used
(although splitting based on other criteria like Gini index is also
common). Information gain is the difference between the infor-
minimal
mation entropy of the samples at an internal node in the decision 0.60 subset
tree and the conditional entropy of the samples split on a feature. 0.59
Intuitively, after an internal node split, the bifurcation should yield
Motion Classification Accuracy
0.58
subsets that have higher purity. Since finding an optimal decision 0.57
tree is NP-hard [36, 46], trees are typically constructed in a greedy 0.56
fashion with successive maximum entropy gain splits and leaves 0.55
that define a sample’s classification. In this work we build decision 0.54
tree classifiers [66], which produce a single decision tree, and ran- 0.53
dom forest classifiers [11], which are ensembles of decision trees 0.52
built from a bootstrap subset of the training data. ab dt gb rf svm xgb
3.4.2 Boosting Methods. Boosting is a general ML technique based

on building an accurate learner from an ensemble of weak learn- Figure 2: Performance comparison by feature set using court
ers [30]. Boosting classifiers learn an ensemble of weak learners administrative database features. We compared the motion
(e.g. shallow decision trees) sequentially. In this work, we train Ad- to strike classification accuracy on minimal and subset fea-
aboost, gradient boosting trees, and XGBoost classifiers. Adaboost, ture sets for six ML models (adaBoost (ab), decision trees (dt),
the first practical boosting algorithm [67], builds new decision trees gradient boosting (gb), random forests (rf), support vector
with successive boosting iterations re-weighting training instances machines (svm), and XGBoost (xgb)). Box plots show the dis-
such that newly built decision trees focus more on the samples that tribution of 100 bootstrapped samples with Tukey whiskers
were previously misclassified. Gradient boosting trees generalize (median ± 1.5 times interquartile range).
103
Database
Doc2vec

Word2vec
2
0.6
0
0.6
8
0.5
6
0.5
4
0.5
ab dt gb rf svm xgb
Figure 3: Predicting motion to strike outcomes across court administrative data and complaint documents. Distinct classifiers
were trained on court administrative data and dense features computed from complaint documents. Document embedding
features based on doc2vec [48] (black) and word2vec [56, 53] (blue) largely improved the classification accuracy of motion
outcomes versus court administrative database features alone (red) for six classifiers [61]: AdaBoost (ab), decision trees (dt),
gradient boosting (gb), random forests (rf), support vector machines (svm), Xgboost (xgb). Box plots are drawn with Tukey
whiskers (median ± 1.5 times interquartile range).
for training and validation. To estimate variability in classification improved classification accuracy for all models. These results high-
accuracy, we computed 100 bootstrapped samples for each model light the utility of incorporating natural language features in the
selected from our grid search. prediction of motion outcomes.
First, we evaluated the minimal and subset feature sets (Table 1) Interestingly, the difference between minimal and subset fea-
associated with legal cases to determine feature relevancy for mo- turization was diminished when including complaint document
tion outcome prediction. We varied the feature composition for the features into the model. We observed this behavior when consider-
court administrative data and compared classifier accuracy (Fig. 2). ing both doc2vec (Supp. Fig. 1) and word2vec (Supp. Fig. 2) features,
Adding in the case location and major code features improved although word2vec features continued to yield better performing
median accuracy for all methods besides random forests. Overall, models. These findings suggest that complaint document embed-
decisions trees exhibited the highest motion to strike classification dings effectively capture major code and case location features.
accuracy (mean: 0.583, median: 0.583) with slightly higher perfor- We next investigated feature significance for motion to strike
mance than boosting methods. Given these results, we primarily outcome prediction for decision trees built on word2vec features.
focus our analysis on the subset featurization. We chose word2vec features because models built on word2vec
To assist interpretation of model performance, we compared embeddings produced more accurate results than database-only or
our ML models with a naive baseline. The naive classifier predicts database and doc2vec features; we selected decisions trees since
motion outcomes using the empirical frequency of the training these are the most explainable of the six models tested with similar
set; with 52% of the motions being granted, we observed a naive performance to the boosting models. We found that features derived
baseline accuracy of 0.501. During model selection, we observed from the complaint documents were universally important across
a maximum classification accuracy of 0.644 using Adaboost with all decision tree classifiers (Fig. 4). The case type (Major Code) was
dense word embeddings, corpus specific TF-IDF weightings, and also an important feature for a subset of models, as was the attorney
FOIL algorithmic rules. This same model had a mean accuracy of specialization (entropy).
0.605 over 100 bootstraps. However, the highest mean accuracy Next, we quantified whether tort or vehicular cases had subtle
score from the same group of features was found in decision tree differences that made prediction easier by stratifying classification
classifiers with 0.606. accuracy by case type in our best performing feature configura-
Next, we evaluated whether court administrative data alone tion (database features, FOIL and TF-IDF weighted word2vec). All
was sufficient for learning accurate motion to strike classification methods, excluding SVM, predicted motion to strike outcomes in
models. Using only database features, all methods produced classifi- vehicular cases with a significantly higher accuracy than tort cases
cation accuracies less than 0.60 (Fig. 3). Subsequently, we evaluated (one-way paired t-test, 𝑝 ≤ 2.2 × 10−16 ) (Fig. 5). This is likely due
these same methods, but including dense natural language fea- to inherent properties of vehicular cases and not class imbalance
tures extracted from complaint documents. While concatenating since vehicular cases encompassed approximately 47% of the total
doc2vec features to the database feature vectors improved classifi- cases.
cation accuracy for most methods, a more careful model defined Lastly, we investigated if legal domain-specific word2vec mod-
over word2vec features using corpus specific weighting (TF-IDF) els would improve classifier performance compared to word2vec
104
1.0
dt (AP = 0.6274)
gb (AP = 0.5734)
0.8 ab (AP = 0.6186)
0.9 svm (AP = 0.6042)
ub
xgb (AP = 0.6273)

p,s p,m IL,m IL,s p,s p,m IL,m IL,s
rf (AP = 0.5661)
,si F,FO F,FO
in
0.8
-ID -ID
TF TF
Precision
0.6
in
m
0.7
ub
F
-ID
TF
i m
F,s
ub
0.4
-ID
0.6
TF
sim sim FO FO
in
0.5
in
0.0 0.2 0.4 0.6 0.8 1.0

0.2 Recall
ub
er
py
V
de
Lo T)
Lo HB)
Lo D)
Figure 6: Precision-Recall curves for law2vec classifier on

NL
W2
mb
(FS
tro
H
Co
(K
(H
(H
En
Nu
subset and TF-IDF weighted simple features. We trained and

jor
c.
c.
c.
c.
Lo
Ma
ris
0.0
Ju
tested with 100 bootstrapped samples the law2vec model us-

ing TF-IDF weighted simple and subset features. Precision
and recall were computed for several thresholds for six ML
models (adaBoost (ab), decision trees (dt), gradient boosting
Figure 4: Decision tree feature importance for word2vec
(gb), random forests (rf), support vector machines (svm), and
models. The vertical axis denotes the model configuration
XGBoost (xgb)). Average precision (AP) for each method is
from TF-IDF weighted, simple (simp) or FOIL algorithmic
given in the legend.
rules, and minimum (min) or subset (sub) feature sets. The
x-axis gives features with at least one model yielding a non-
zero importance score. The case major code, attorney spe-
cialization (entropy), and word2vec (W2V) features exhib- law2vec improved classifier performance. Using court adminis-
ited the highest average feature importance across models. trative data, FOIL, and TF-IDF weighted features, word2vec classi-
The word2vec feature importance weights were summed fier median performance was greater than or approximately equal
across all word2vec features. across all methods besides XGBoost (Supp. Fig. 3); the behavior
was similar when removing FOIL features (Supp. Fig. 4). While
tree and boosting classifiers outperformed our SVM model overall,
V when we compared the performance of each classifier by vary-
T ing the classification threshold, all methods had similar average
0.650
precision when including TF-IDF weighted features from the sim-
0.625
ple (Fig. 6) and FOIL (Supp. Fig. 5) rule based classifiers or when
including (Supp. Fig. 6) and excluding (Supp. Fig. 7) FOIL features
0.600
without TF-IDF weighting. However, SVMs have noticeably higher

0.575
0.550
average precision than their classification accuracy would suggest,
0.525
providing strong justification for SVMs when multiple classification
0.500 thresholds are considered.
0.475
ab dt gb rf svm xgb
5 DISCUSSION
In this work we focused on motions to strike, but the prediction
Figure 5: Classifier accuracy by case type. We stratified clas-
of other high impact motions may be of interest to lawyers and
sifier performance on motion to strike prediction by vehic-
their clients. For example, outcomes of interest may include the
ular or tort case types (major code). Here, we include classi-
file dates of a disposition, the cost or activity associated with the
fiers trained on court administrative and word2vec features.
discovery period (which is typically a function of complexity of
Box plots show the distribution of 100 bootstrapped sam-
motions during discovery), or motions for default judgement –
ples with Tukey whiskers (median ± 1.5 times interquartile
which can happen when a party does not respond to a court filing.
range).
A motion for summary judgement is another high-impact motion
which occurs with high frequency; this motion is a written petition
which argues that, based on the evidence gathered thus far, no
trained on general corpora [42]. Overall, we did not find that reasonable jury could find that the plaintiff has proved one or
105
more genuinely-disputed facts that matter to one or more of the ensure ethical and legal compliance in situations where the ethics
plaintiff’s legal claim. However, careful consideration must be taken or legal implications are ill-defined or cannot be mathematically
when deciding which features or court documents are relevant. For modelled. Model-based interpretability focuses on restricting the set
example, a motion for summary judgement depends on the quality of models such that a trained model directly informs relationships
of the evidence collected by the parties thus far, which may not be among model variables [58]. The methods developed in this work
well captured by the text of court documents. were selected, in part, due to high model-based interpretability (e.g.
Our comparisons between doc2vec and law2vec suggested there decision trees), but, incorporating causal reasoning and explicit
are potential benefits to incorporating neural embeddings from Bayesian modelling of relevant variables in the judicial decision
models trained specifically on legal corpora. In modern transformer making process would only increase explainability, interpretability,
architectures, domain specific pre-trained models, e.g. BioBERT [51], and ultimately, trustworthiness of the model. One possibility is to
have been shown to outperform transfer learning fine-tuning ap- consider the recent work in causal frameworks for decision trees [33,
proaches from general corpora like Wikipedia. While some of these 52, 78].
models are in the legal domain, e.g. patentBERT [50], no such model
exists for court documents. However, one significant challenge 6 CONCLUSIONS
to applying such models in the legal domain is that the memory By developing ML workflows with feature engineering rooted in
requirements for transformers scale quadratically with sequence legal domain expertise, we developed methods to help researchers
length, prohibiting them from being applied directly to the longer better understand the predictability of trial motions and practition-
texts that are common in complaints and judicial opinions [6]. Re- ers the ability to make more informed decisions. We developed and
cent work on extending the range of transformers provides some benchmarked the first ML methods to predict motion outcomes
evidence that this issue will be addressed [79, 76]. using only data that is available to all parties at trial. Our work
A limitation of the data in our analysis is that courts may occa- demonstrated that motion to strike outcomes are predictable with
sionally grant a motion to strike in part. This can occur, for example, high accuracy when new features like attorney specialization are
if two or more legal claims levied against a defendant were formally combined with complaint document embeddings.
challenged by a motion to strike. The court may grant the motion We expect these methods will be a valuable resource for lawyers
to strike with respect to a single legal claim and deny it for others. and their clients by enabling the estimation of case strength. For
In these cases, the motion to strike order code in the Law Library example, our methods can be used to predict if the case will sur-
data is noisy since there is only a single value that is provided. vive a motion to strike after a complaint document is filed. Based
Furthermore, the procedure with which courts interpret this order on predicted motion outcomes, both parties can make more in-
code is heterogeneous. Some courts interpret a granted motion as formed settlement decisions and lawyers representing the plaintiff
any motion granted in part or in full; other courts only use the can revise the language in their complaint documents. Fitted ML
granted order code when the motion is granted in full. Addressing models, training code, and benchmarking code can be accessed at
this issue either requires reforming and unifying the data process- https://github.com/aguiarlab/motionpredict.
ing procedures across Connecticut courts or developing methods
to parse out distinct legal claims from complaint documents and REFERENCES
then matching them to judicial order documents. [1] Charu C Aggarwal. 2018. Machine learning for text. Springer.
The relevant features for the motion prediction problem can [2] Sharan Agrawal et al. 2017. Affirm or reverse? using machine
also likely be improved. For example, an attorney is defined based learning to help judges write opinions. NBER Working Paper,
on their juris number and a derived attorney specialization fea- 29.
ture. There are other features that are likely relevant for predicting [3] Benjamin Alarie et al. 2016. Using machine learning to pre-
motion outcomes, e.g., attorney experience, record, or case load. dict outcomes in tax law. Can. Bus. LJ, 58, 231.
Similar feature engineering can be implemented for judges. Case [4] Nikolaos Aletras et al. 2016. Predicting judicial decisions
location and other high dimensional categorical features can be of the european court of human rights: a natural language
one-hot encoded, but may also benefit from a descriptive, lower processing perspective. PeerJ Computer Science, 2, e93.
dimensional set of features based on, e.g., court culture. [5] Katie Atkinson et al. 2020. Explanation in AI and law: past,
With the rise of ML in the legal domain, governments across the present and future. Artificial Intelligence, 103387.
globe are placing new emphases on ensuring AI-assisted decision [6] Iz Beltagy et al. 2020. Longformer: the long-document trans-
making is done in an ethical, transparent, and nondiscriminatory former. arXiv preprint arXiv:2004.05150.
manner. The European Commission for the Efficiency of Justice [7] Candice Bentejac et al. 2020. A comparative analysis of gra-
adopted 5 principles in the European Ethical Charter on the use of dient boosting algorithms. Artificial Intelligence Review, 1–
AI in judicial systems [82]. These principles guarantee that judi- 31.
cial AI is compatible with fundamental rights, nondiscriminatory, [8] Ryan C Black and James F Spriggs. 2008. An empirical anal-
transparent, impartial, fair, and explainable. In the U.S., the Na- ysis of the length of US Supreme Court opinions. Hous. L.
tional Center for State Courts has identified data transparency Rev., 45, 621.
and investigating how AI transforms judicial processes as national [9] David M Blei et al. 2003. Latent dirichlet allocation. the Jour-
priorities [59]. nal of Machine Learning Research, 3, 993–1022.
Fundamental to upholding these ideals is developing methods
that are interpretable. Interpretability provides a mechanism to
106
[10] Bernhard E Boser et al. 1992. A training algorithm for optimal Social Science, 16, 1, 39–57. https://doi.org/10.1146/annurev-
margin classifiers. In Proceedings of the fifth annual workshop lawsocsci-052720-121843.
on Computational learning theory, 144–152. [29] Yoav Freund and Robert E Schapire. 1997. A decision-theoretic
[11] Leo Breiman. 2001. Random forests. Machine learning, 45, 1, generalization of on-line learning and an application to
5–32. boosting. Journal of computer and system sciences, 55, 1, 119–
[12] Shelagh MR Campbell. 2017. Exercising discretion in the 139.
context of dependent employment: assessing the impact of [30] Yoav Freund et al. 1999. A short introduction to boosting.
workload on the rule of law. Legal Studies, 37, 2, 305–323. Journal-Japanese Society For Artificial Intelligence, 14, 771-
[13] John Celona. 2016. Winning at Litigation through Decision 780, 1612.
Analysis: Creating and Executing Winning Strategies in any [31] Jerome H Friedman. 2002. Stochastic gradient boosting. Com-
Litigation or Dispute. Springer Series in Operations Research putational statistics & data analysis, 38, 4, 367–378.
and Financial Engineering. Springer. [32] Anne von der Lieth Gardner. 1984. Artificial intelligence
[14] Ilias Chalkidis. 2018. Law2Vec: Legal Word Embeddings. approach to legal reasoning. Technical report. Stanford Univ.
(2018). https://archive.org/details/Law2Vec. [33] Tim Genewein et al. 2020. Algorithms for Causal Reasoning
[15] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: a scalable in Probability Trees. en. arXiv:2010.12237 [cs], (November
tree boosting system. In Proceedings of the 22nd acm sigkdd 2020). arXiv: 2010.12237. Retrieved 12/10/2020 from http :
international conference on knowledge discovery and data //arxiv.org/abs/2010.12237.
mining, 785–794. [34] Jane Goodman-Delahunty et al. 2010. Insightful or wishful:
[16] Cary Coglianese. 2004. E-rulemaking: information technol- lawyers’ ability to predict case outcomes. Psychology, Public
ogy and the regulatory process. Admin. L. Rev., 56, 353. Policy, and Law, 16, 2, 133–157.
[17] William W Cohen et al. 1996. Learning rules that classify [35] Evan Gretok et al. 2020. Transformers for classifying fourth
e-mail. In AAAI spring symposium on machine learning in amendment elements and factors tests. In Legal Knowledge
information access. Volume 18. Stanford, CA, 25. and Information Systems: JURIX 2020: The Thirty-third An-
[18] William W Cohen and Yoram Singer. 1999. Context-sensitive nual Conference, Brno, Czech Republic, December 9-11, 2020.
learning methods for text categorization. ACM Transactions Volume 334. IOS Press, 63–72.
on Information Systems (TOIS), 17, 2, 141–173. [36] Thomas Hancock et al. 1996. Lower bounds on learning
[19] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector decision lists and trees. Information and Computation, 126, 2,
networks. Machine learning, 20, 3, 273–297. 114–122.
[20] Lindsey Devers. 2011. Plea and charge bargaining. Research [37] Allison P. Harris and Maya Sen. 2019. Bias and judging. An-
summary for Bureau of Justice Assistance, U.S. Department of nual Review of Political Science, 22, 1, 241–259. https://doi.
Justice, 1. org/10.1146/annurev-polisci-051617-090650.
[21] Jacob Devlin et al. 2018. Bert: pre-training of deep bidi- [38] Oliver Wendell Holmes. 1897. The path of the law. Harvard
rectional transformers for language understanding. arXiv Law Review, 10, 8, 457–478.
preprint arXiv:1810.04805. [39] Jonas Jacobson et al. 2011. Predicting civil jury verdicts: how
[22] Michael Evans et al. 2007. Recounting the courts? apply- attorneys use (and misuse) a second opinion. Journal of Em-
ing automated content analysis to enhance empirical legal pirical Legal Studies, 8, S1, 99–119. http://dx.doi.org/10.1111/
research. Journal of Empirical Legal Studies, 4, 4, 1007–1039. j.1740-1461.2011.01229.x.
[23] Frank Fagan and Saul Levmore. 2019. The impact of artificial [40] Robert A Karasek Jr. 1979. Job demands, job decision latitude,
intelligence on rules, standards, and judicial discretion. S. and mental strain: implications for job redesign. Administra-
Cal. L. Rev., 93, 1. tive science quarterly, 285–308.
[24] Felix Steffek. 2021. Law and Autonomous Systems Series: [41] Daniel Martin Katz et al. 2017. A general approach for pre-
Paving the Way for Legal Artificial Intelligence – A Common dicting the behavior of the Supreme Court of the United
Dataset for Case Outcome Predictions. University of Oxford. States. PLOS ONE, 12, 4, (April 2017), 1–18. https://doi.org/
(2021). https://www.law.ox.ac.uk/business-law-blog/blog/ 10.1371/journal.pone.0174698.
2018 / 05 / law - and - autonomous - systems - series - paving - [42] Nari Kim and Hyoung Joong Kim. 2017. A study on the
way-legal-artificial. law2vec model for searching related law. Journal of Digital
[25] Norman Fenton et al. 2016. Bayes and the law. Annual Review Contents Society, 18, 7, 1419–1425.
of Statistics and Its Application, 3, 1, 51–77. https://doi.org/ [43] Sang-Bum Kim et al. 2006. Some effective techniques for
10.1146/annurev-statistics-041715-033428. naive Bayes text classification. IEEE transactions on knowl-
[26] Matthias Feurer and Frank Hutter. 2019. Hyperparameter op- edge and data engineering, 18, 11, 1457–1466.
timization. In Automated Machine Learning. Springer, Cham, [44] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning mul-
3–33. tiple layers of features from tiny images.
[27] James H Fowler et al. 2007. Network analysis and the law: [45] Jason T Lam et al. 2020. The gap between deep learning and
measuring the legal importance of precedents at the us law: predicting employment notice. NLLP KDD, 7, 10.
supreme court. Political Analysis, 324–346. [46] Hyafil Laurent and Ronald L Rivest. 1976. Constructing opti-
[28] Jens Frankenreiter and Michael A. Livermore. 2020. Compu- mal binary decision trees is NP-complete. Information pro-
tational methods in legal analysis. Annual Review of Law and cessing letters, 5, 1, 15–17.
107
[47] David S Law and David Zaring. 2009. Law Versus Ideology: [66] S. R. Safavian and D. Landgrebe. 1991. A survey of decision
The Supreme Court and the Use of Legislative History. Wm. tree classifier methodology. IEEE Transactions on Systems,
& Mary L. Rev., 51, 1653. Man, and Cybernetics, 21, 3, 660–674.
[48] Quoc Le and Tomas Mikolov. 2014. Distributed representa- [67] Robert E Schapire. 2013. Explaining adaboost. In Empirical
tions of sentences and documents. In International conference inference. Springer, 37–52.
on machine learning, 1188–1196. [68] Marek J. Sergot et al. 1986. The British Nationality Act as a
[49] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten logic program. Communications of the ACM, 29, 5, 370–386.
digit database. http://yann.lecun.com/exdb/mnist/. [69] Ray Smith. 2007. An overview of the tesseract ocr engine.
[50] Jieh-Sheng Lee and Jieh Hsiang. 2019. Patentbert: patent clas- In Ninth international conference on document analysis and
sification with fine-tuning a pre-trained bert model. arXiv recognition (ICDAR 2007). Volume 2. IEEE, 629–633.
preprint arXiv:1906.02124. [70] Harold Spaeth et al. 2014. Supreme court database code book.
[51] Jinhyuk Lee et al. 2020. Biobert: a pre-trained biomedical (2014).
language representation model for biomedical text mining. [71] State of Connecticut Judicial Branch. 2021. Public Records
Bioinformatics, 36, 4, 1234–1240. Online. Accessed on 2021-01-01. (2021). https://jud.ct.gov/
[52] Jiuyong Li et al. 2016. Causal decision trees. IEEE Transactions lawlib/publicrecords.htm.
on Knowledge and Data Engineering, 29, 2, 257–271. [72] Michael Still. 2006. The definitive guide to ImageMagick. Apress.
[53] Joseph Lilleberg et al. 2015. Support vector machines and [73] Octavia-Maria Şulea et al. 2017. Predicting the law area and
word2vec for text classification with semantic features. In decisions of French Supreme Court cases. In Proceedings
2015 IEEE 14th International Conference on Cognitive Infor- of the International Conference Recent Advances in Natural
matics & Cognitive Computing (ICCI* CC). IEEE, 136–140. Language Processing, RANLP 2017. INCOMA Ltd., Varna, Bul-
[54] Masha Medvedeva et al. 2020. Using machine learning to garia, (September 2017), 716–722. https://doi.org/10.26615/
predict decisions of the European Court of Human Rights. 978-954-452-049-6_092.
Artificial Intelligence and Law, 28, 2, 237–266. [74] Harry Surden. 2014. Machine learning and law. Wash. L. Rev.,
[55] David E. Melnikoff and Nina Strohminger. 2020. The auto- 89, 87.
matic influence of advocacy on lawyers and novices. Nature [75] Ahmad P Tafti et al. 2016. OCR as a service: an experi-
Human Behaviour, (September 7, 2020), 1–7. mental evaluation of Google Docs OCR, Tesseract, ABBYY
[56] Tomas Mikolov et al. 2013. Efficient estimation of word rep- FineReader, and Transym. In International Symposium on
resentations in vector space. (2013). http://arxiv.org/abs/ Visual Computing. Springer, 735–746.
1301.3781. [76] Yi Tay et al. 2020. Long Range Arena: A Benchmark for
[57] Jane Mitchell et al. 2020. Machine learning for determining Efficient Transformers. en. arXiv:2011.04006 [cs], (November
accurate outcomes in criminal trials. Law, Probability and 2020). arXiv: 2011.04006. Retrieved 12/12/2020 from http :
Risk, 19, 1, (March 2020), 43–65. //arxiv.org/abs/2011.04006.
[58] W James Murdoch et al. 2019. Definitions, methods, and [77] Thomas Vacek et al. 2019. Litigation Analytics: Case out-
applications in interpretable machine learning. Proceedings comes extracted from US federal court dockets. In Proceed-
of the National Academy of Sciences, 116, 44, 22071–22080. ings of the Natural Legal Language Processing Workshop 2019,
[59] National Center for State Courts. 2021. Joint technology com- 45–54.
mittee priority topics. Accessed on 2021-03-01. (2021). https: [78] Stefan Wager and Susan Athey. 2018. Estimation and in-
//www.ncsc.org/about-us/committees/joint-technology- ference of heterogeneous treatment effects using random
committee/priority-topics-old-page. forests. Journal of the American Statistical Association, 113,
[60] Patrick W Nutter. 2018. Machine learning evidence: admissi- 523, 1228–1242.
bility and weight. U. Pa. J. Const. L., 21, 919. [79] Sinong Wang et al. 2020. Linformer: Self-Attention with
[61] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Linear Complexity. en. arXiv:2006.04768 [cs, stat], (June 2020).
Python. Journal of Machine Learning Research, 12, 2825–2830. arXiv: 2006.04768. Retrieved 12/12/2020 from http://arxiv.
[62] Alina Petrova et al. 2020. Extracting Outcomes from Ap- org/abs/2006.04768.
pellate Decisions in US State Courts. In Legal Knowledge [80] Ho Chung Wu et al. 2008. Interpreting tf-idf term weights
and Information Systems: JURIX 2020: The Thirty-third An- as making relevance decisions. ACM Transactions on Infor-
nual Conference, Brno, Czech Republic, December 9-11, 2020. mation Systems (TOIS), 26, 3, 1–37.
Volume 334. IOS Press, 133–142. [81] Feiyu Xu et al. 2019. Explainable ai: a brief survey on history,
[63] Arti K Rai. 2018. Machine learning at the patent office: lessons research areas, approaches and challenges. In CCF interna-
for patents and administrative law. Iowa L. Rev., 104, 2617. tional conference on natural language processing and Chinese
[64] Edwina L Rissland. 1990. Artificial intelligence and law: step- computing. Springer, 563–574.
ping stones to a model of legal reasoning. The Yale Law [82] Irina Moroianu Zlatescu and Petru Emanuel Zlatescu. 2019.
Journal, 99, 8, 1957–1981. Implementation of the European ethical charter on the use
[65] Theodore W Ruger et al. 2004. The supreme court forecasting of artificial intelligence in judicial systems and their envi-
project: legal and political science approaches to predicting ronment. Current Issues of the EU Political-Legal Space, 237.
supreme court decision making. Columbia Law Review, 1150–
1210.
108
Evaluating Document Representations for Content-based Legal
Literature Recommendations
Malte Ostendorff Elliott Ash Terry Ruas
Open Legal Data ETH Zurich University of Wuppertal
Germany Switzerland Germany
mo@openlegaldata.io ashe@ethz.ch ruas@uni-wuppertal.de
Bela Gipp Julian Moreno-Schneider Georg Rehm

University of Wuppertal DFKI GmbH DFKI GmbH
Germany Germany Germany
gipp@uni-wuppertal.de julian.moreno_schneider@dfki.de georg.rehm@dfki.de

Recommender systems assist legal professionals in finding relevant Malte Ostendorff, Elliott Ash, Terry Ruas, Bela Gipp, Julian Moreno-Schneider,
and Georg Rehm. 2021. Evaluating Document Representations for Content-
literature for supporting their case. Despite its importance for the
based Legal Literature Recommendations. In Eighteenth International Con-
profession, legal applications do not reflect the latest advances in ference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São
recommender systems and representation learning research. Si- Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/
multaneously, legal recommender systems are typically evaluated 3462757.3466073
in small-scale user study without any public available benchmark
datasets. Thus, these studies have limited reproducibility. To address
the gap between research and practice, we explore a set of state-of- 1 INTRODUCTION
the-art document representation methods for the task of retrieving Legal professionals, e.g., lawyers and judges, frequently invest con-
semantically related US case law. We evaluate text-based (e.g., fast- siderable time to find relevant literature [23]. More so than most
Text, Transformers), citation-based (e.g., DeepWalk, Poincaré), and other domains, in law there are high stakes for finding the most
hybrid methods. We compare in total 27 methods using two silver relevant information (documents) as that can drastically affect the
standards with annotations for 2,964 documents. The silver stan- outcome of a dispute. A case can be won or lost depending on
dards are newly created from Open Case Book and Wikisource and whether or not a supporting decision can be found. Recommender
can be reused under an open license facilitating reproducibility. systems assist in the search for relevant information. However,
Our experiments show that document representations from aver- research and development of recommender systems for legal cor-
aged fastText word vectors (trained on legal corpora) yield the best pora poses several challenges. Recommender system research is
results, closely followed by Poincaré citation embeddings. Com- known to be domain-specific, i.e., minor changes may lead to unpre-
bining fastText and Poincaré in a hybrid manner further improves dictable variations in the recommendation effectiveness [4]. Like-
the overall result. Besides the overall performance, we analyze the wise, legal English is a peculiarly obscure and convoluted variety of
methods depending on document length, citation count, and the English with a widespread use of common words with uncommon
coverage of their recommendations. meanings [30]. Recent language models like BERT [15] may not
be equipped to handle legal English since they are pretrained on
CCS CONCEPTS generic corpora like Wikipedia or cannot process lengthy legal doc-
• Information systems → Recommender systems; Similarity mea- uments due to their limited input length. This raises the question
sures; Clustering and classification; • Applied computing → Law. of whether the recent advances in recommender system research
and underlying techniques are also applicable to law.
KEYWORDS In this paper, we empirically evaluate 27 document representa-
tion methods and analyze the results with respect to the aforemen-
Legal literature, document embeddings, document similarity, rec- tioned possible issues. In particular, we evaluate for each method
ommender systems, Transformers, WikiSource, Open Case Book the quality of the document representations in a literature recom-
mender use case. The methods are distinguished in three categories:
(1) word vector-based, (2) Transformer-based, and (3) citation-based
classroom use is granted without fee provided that copies are not made or distributed methods. Moreover, we test additional hybrid variations of the afore-
for profit or commercial advantage and that copies bear this notice and the full citation mentioned methods. Our primary evaluation metric comes from
on the first page. Copyrights for components of this work owned by others than the two silver standards on US case law that we extract from Open
republish, to post on servers or to redistribute to lists, requires prior specific permission Case Book and Wikisource. The relevance annotations from the
and/or a fee. Request permissions from permissions@acm.org. silver standards are provided for 2,964 documents.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil In summary, our contributions are: (1) We propose and make
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 available two silver standards as benchmarks for legal recommender
https://doi.org/10.1145/3462757.3466073 systems that currently do not exist. (2) We evaluate 27 methods of
109
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ostendorff et al.
which the majority have never been investigated in the legal context a particular decision, e.g., to prepare a litigation strategy. Based on
with a quantitative study and validate our results qualitatively. (3) the decision at hand, the system recommends other decisions to its
We show that the hybrid combination of text-based and citation- users such that the research task is easy to accomplish. The recom-
based methods can further improve the experimental results. mendation is relevant when it covers the same topic or provides
essential information, e.g., it overruled the seed decision [45].
2 RELATED WORK
3.1 Case Corpus and Silver Standard
Recommender systems are a well-established research field [3] but
relatively few publications focus on law as the application domain. Most of the previous works (Section 2) evaluate recommendation
Winkels et al. [52] are among the first to present a content-based relevance by asking domain experts to provide subjective annota-
approach to recommend legislation and case law. Their system tions [9, 22, 28, 52]. Especially in the legal domain, these expert
uses the citation graph of Dutch Immigration Law and is evaluated annotations are costly to collect and, therefore, their quantity is
with a user study conducted with three participants. Boer and limited. For the same reason, expert annotations are rarely pub-
Winkels [9] propose and evaluate Latent Dirichlet Allocation (LDA) lished. Consequently, the research is difficult to reproduce [4]. In
[7] as a solution to the cold start problem in collaborative filtering the case of the US court decisions, such expert annotations between
approaches. In an experiment with 28 users, they find the user- documents are also not publicly available. We construct two ground
based approach outperforms LDA. Wiggers and Verberne [49] study truth datasets from publicly available resources allowing the evalua-
citations for legal information retrieval and suggest citations should tion of more recommendations to mitigate the mentioned problems
be combined with other techniques to improve the performance. of cost, quantity, and reproducibility.
Kumar et al. [22] compare four methods to measure the similar- 3.1.1 Open Case Book. With Open Case Book, the Harvard Law
ity of Indian Supreme Court decision: TF-IDF [43] on all document School Library offers a platform for making and sharing open-
terms, TF-IDF on only specific terms from a legal dictionary, Co- licensed casebooks 1 . The corpus consists of 222 casebooks contain-
Citation, and Bibliographic Coupling. They evaluate the similarity ing 3,023 cases from 87 authors. Each casebook contains a manually
measure on 50 document pairs with five legal domain experts. In curated set of topically related court decisions, which we use as
their experiment, Bibliographic Coupling and TF-IDF on legal terms relevance annotations. The casebooks cover a range from broad
yield the best results. Mandal et al. [28] extend this work by evaluat- topics (e.g., Constitutional law) to specific ones (e.g., Intermediary
ing LDA and document embeddings (Paragraph Vectors [25]) on the Liability and Platforms’ Regulation). The decisions are mapped to
same dataset, whereby Paragraph Vectors was found to correlate full-texts and citations retrieved from the Caselaw Access Project
the most with the expert annotations. Indian Supreme Court deci- (CAP)2 . After duplicate removal and the mapping procedure, rele-
sions are also used as evaluation by Wagh and Anand [47], where vance annotations for 1,601 decisions remain.
they use document similarity based on concepts instead of full-text.
They extract concepts (groups of words) from the decisions and 3.1.2 Wikisource. We use a collection of 2,939 US Supreme Court
compute the similarity between documents based on these concepts. decisions from Wikisource as ground truth [50]. The collection is
Their vector representation, an average of word embeddings and categorized in 67 topics like antitrust, civil rights, and amendments.
TF-IDF, shows IDF for weighting word2vec embeddings improve We map the decisions listed in Wikisource to the corpus from
results. Also, Bhattacharya et al. [6] compare citation similarity CourtListener3 . The discrepancy between the two corpora decreases
methods, i.e., Bibliographic Coupling, Co-citation, Dispersion [32] the number of relevance annotations to 1,363 court decisions.
and Node2Vec [17]), and text similarity methods like Paragraph Vec-
Table 1: Distribution of relevant annotations for Open Case
tors. They evaluate the algorithms and their combinations using a
Book and Wikisource.
gold standard of 47 document pairs. A combination of Bibliographic
Coupling and Paragraph Vectors achieves the best results.
With Eunomos, Boella et al. [8] present a legal document and Relevant annotations per document
knowledge management system for searching legal documents. Mean Std. Min. 25% 50% 75% Max.
The document similarity problem is handled using TF-IDF and co- Open Case Book 86.42 65.18 2.0 48.0 83.0 111.0 1590.0
sine similarity. Other experiments using embeddings for document Wikisource 130.01 82.46 1.0 88.0 113.0 194.0 616.0
similarity include Nanda et al. [33] or Ash and Chen [2].
Even though different methods have been evaluated in the legal
We derive a binary relevance classification from Open Case Book
domain, most results are not coherent and rely on small-scale user
and Wikisource. When decisions A and B are in the same casebook
studies. This finding emphasizes the need for a standard bench-
or category, A is relevant for B and vice versa. Table 1 presents the
mark to enable reproducibility and comparability [4]. Moreover,
distribution of relevance annotations. This relevance classification
the recent Transformer models [46] or novel citation embeddings
is limited since a recommendation might still be relevant despite
have not been evaluated in legal recommendation research.
not being assigned to the same topic as the seed decision. Thus,
we consider the Open Case Book and Wikisource annotations as a
3 METHODOLOGY silver standard rather than a gold one.
In this section, we describe our quantitative evaluation of 27 docu- 1 https://opencasebook.org
ment recommendations methods. We define the recommendation 2 https://case.law
scenario as follows: The user, a legal professional, needs to research 3 https://courtlistner.com
110
Evaluating Document Representations for Content-based Legal Literature Recommendations ICAIL’21, June 21–25, 2021, São Paulo, Brazil
3.2 Evaluated Methods on different corpora, allows the evaluation of the method’s cross-
We evaluate 27 methods, each representing legal document 𝑑 as a domain applicability.
numerical vector 𝑑® ∈ R𝑠 , with 𝑠 denoting the vector size. To retrieve
3.2.3 Transformer-based Methods. As the second method cate-
the recommendations, we first obtain the vector representations (or
gory, we employ language models for deep contextual text rep-
document embeddings). Next, we compute the cosine similarities
resentations based on the Transformer architecture [46], namely
of the vectors. Finally, we select the top 𝑘 = 5 documents with the
BERT [15], RoBERTa [27], Sentence Transformers (Sentence-
highest similarity through nearest neighbor search4 . Mean Average
BERT and Sentence-RoBERTa) [41], LongFormer [5] and vari-
Precision (MAP) is the primary and Mean Reciprocal Rank (MRR) is
ations of them. In contrast to Paragraph Vectors and average word
the second evaluation metric [29]. We compute MAP and MRR over
vectors, which neglect the word order, the Transformers incorporate
a set of queries 𝑄, whereby 𝑄 is equivalent to the seed decisions
word positions making the text representations context-dependent.
with |𝑄 WS | = 1363 available in Wikisource and |𝑄 OCB | = 1601 for
BERT significantly improved the state-of-the-art for many NLP
Open Case Book. In addition to the accuracy-oriented metrics, we
tasks. In general, BERT models are pretrained on large text corpora
evaluate the coverage and Jaccard index of the recommendations.
in an unsupervised fashion to then be fine-tuned for specific tasks
The coverage for the method 𝑎 is defined as in Equation 1 where
like document classification [36]. We use four variations of BERT.
𝐷 denotes the set of all available documents in the corpus and 𝐷𝑎
The original BERT [15] as base and large version (pretrained on
denotes the recommended documents by 𝑎 [16].
Wikipedia and BookCorpus) and two BERT-base models pretrained
|𝐷𝑎 | on legal corpora. Legal-JHU-BERT-base from Holzenberger et al.
𝐶𝑜𝑣 (𝑎) = (1) [18] which is a BERT base model but fine-tuned on the CAP corpus.
|𝐷 |
Similarly, Legal-AUEB-BERT-base from Chalkidis et al. [14] is
We define the Jaccard index [19] for the similarity and diversity
as well fine-tuned on the CAP corpus but also on other corpora
of two recommendation sets 𝑅𝑎 and 𝑅𝑏 from methods 𝑎 and 𝑏 for
(court cases and legislation from the US and EU, and US contracts).
the seed 𝑑𝑠 in Equation 2:
RoBERTa improves BERT with longer training, larger batches, and
|𝑅𝑎 ∩ 𝑅𝑏 | removal of the next sentence prediction task for pretraining. Sen-
𝐽 (𝑎, 𝑏) = (2) tence Transformers are fine-tuned BERT and RoBERTa models in
|𝑅𝑎 ∪ 𝑅𝑏 |
a Siamese setting [12] to derive semantically meaningful sentence
We divide the evaluated methods into three categories: Word
embeddings that can be compared using cosine similarity (Sentence-
vector-, Transformer-, and citation-based methods.
BERT and Sentence-RoBERTa). The provided Sentence Transform-
3.2.1 TF-IDF Baseline. As a baseline method, we use the sparse ers variations are nli- or stsb-version that are either fine-tuned on
document vectors from TF-IDF [43], which are commonly used in the SNLI and MNLI dataset [11, 51] or fine-tuned on the STS bench-
related works [22, 33]5 . mark [13]. As the self-attention mechanism scales quadratically
with the sequence length, the Transfomer-based methods (BERT,
3.2.2 Word vector-based Methods. The following methods are de- RoBERTa and Sentence Transformers) bound their representation
rived from word vectors, i.e., context-free word representations. to 512 tokens. Longformer includes an attention mechanism that
Paragraph Vectors [25] extend the idea of word2vec [31] to learn- scales linearly with sequence length, which allows to process longer
ing embeddings for word sequences of arbitrary length. Paragraph documents. We use pretrained Longformer models as provided by
Vectors using distributed bag-of-words (dbow) performed well in Beltagy et al. [5] and limited to 4096 tokens. All Transformer models
text similarity tasks applied on legal documents [2, 28] and other apply mean-pooling to derive document vectors. We experimented
domains [24]. We train Paragraph Vectors’ dbow model to gen- with other pooling strategies but they yield significantly lower re-
erate document vectors for each court decision. Like word2vec, sults. These findings agree with Reimers and Gurevych [41]. We
GloVe [38] and fastText [10, 20] produce dense word vectors but investigate each Transformer in two variations depending on their
they do not provide document vectors. To embed a court decision availability and w.r.t. model size and document vector size (base
as a vector, we compute the weighted average over its word vectors, with 𝑠 = 768 and large with 𝑠 = 1024).
𝑤®𝑖 , whereby the number of occurrences of the word 𝑖 in 𝑑 defines
the weight 𝑐𝑖 . Averaging of word vectors is computationally effec- 3.2.4 Citation-based Methods. We explore citation-based graph
tive and yields good results for representing even longer documents methods in which documents are nodes and edges correspond
[1]. For our experiments, we use word vectors made available by to citations to generate document vectors. Like text-based repre-
the corresponding authors and custom word vectors. While GloVe sentations, citation graph embeddings have the vector size 𝑑® ∈
vectors are pretrained on Wikipedia and Gigaword [38], fastText R300 . With DeepWalk, Perozzi et al. [39] were the first to borrow
is pretrained on Wikipedia, UMBC webbase corpus and statmt.org word2vec’s idea and applied it to graph network embeddings. Deep-
news dataset [10]. Additionally, we use custom word vectors6 for Walk performs truncated random walks on a graph and the node em-
both methods (namely fastTextLegal and GloVeLegal ) pretrained beddings are learned through the node context information encoded
on the joint court decision corpus extracted from Open Case Book in these short random walks similar to the context sliding window
and Wikisource (see Section 3.1). Using word vectors pretrained in word2vec. Walklets [40] explicitly encodes multi-scale node
4 We relationships to capture community structures with the graph em-
set 𝑘 = 5 due to the UI of our legal recommender system [35].
5 We use the TF-IDF implementation from the scikit-learn framework [37]. bedding. Walklets generates these multi-scale relationships by sub-
6 The legal word vectors can be downloaded from our GitHub repository. sampling short random walks on the graph nodes. BoostNE [26] is
111
a matrix factorization-based embedding technique combined with 4 RESULTS

gradient boosting. In [26], BoostNE is applied on a citation graph For our evaluation, we obtain a list of recommendations for each
from scientific papers and outperforms other graph embeddings input document and method and then compute the performance
such as DeepWalk. Hence, we expect comparable results for the measures accordingly. We compute the average number of relevant
legal citation graph. Nickel and Kiela [34] introduced Poincaré recommendations, precision, recall, MRR, MAP, and coverage.
embeddings for learning embeddings in the hyperbolic space of
the Poincaré ball model rather than the Euclidean space used in
the aforementioned methods. Embeddings produced in hyperbolic
4.1 Quantitative Evaluation
space are naturally equipped to model hierarchical structures [21]. 4.1.1 Overall Results. Table 2 presents the overall evaluation met-
Such structures can also be found in the legal citation graph in the rics for 27 methods and the two datasets. From the non-hybrid
form of different topics or jurisdictions. For DeepWalk, Walklets, methods, fastTextLegal yields with 0.05 the highest MAP score on
BoostNe, we use the Karate Club implementation [42]. Open Case Book, whereas on Wikisource, fastTextLegal , Poincaré,
and Walklets all achieve the highest MAP score of 0.031. The hybrid
3.2.5 Variations & Hybrid Methods. Given the conceptional differ- method of Poincaré ∥ fastTextLegal outperforms the non-hybrids
ences in the evaluated methods, each method has its strength and for Wikisource with 0.035 MAP. For Open Case Book, the MAP of
weakness. For further insights on these differences, we evaluate Poincaré + fastTextLegal and fastTextLegal are equally high.
all methods with limited text, vector concatenation, and score Due to space constraints, we remove 14 methods from Table 2
summation: Unlike the Transformers, the word vector-based meth- (excluded methods are on GitHub9 ). From the word vector-based
ods have no maximum of input tokens. Whether an artificial limi- methods, we discard the 512 and 4096 tokens variations of Para-
tation of the document length improves or decreases the results is graph Vectors, GloVe and GloVeLegal , as they show a similar per-
unclear. Longer documents might add additional noise to the repre- formance deterioration as fastTextLegal . The base versions of some
sentation and could lead to worse results [44]. To make these two Transformers are also excluded in favour of the better performing
method categories comparable, we include additional variations of large versions. Similarly, the nli version always outperform the stsb
the word vector-based methods that are limited to the first 512 or version of Sentence Transformers (sBERT and sRoBERTa). For the
4096 tokens of the document. For instance, the method fastTextLegal hybrid variations, we show only the best methods. We also tested
(512) has only access to the first 512 tokens. Node2Vec [17] and but exclude it given its low MAP scores.
Additionally, we explore hybrid methods that utilize texts and Regarding the word vector-based methods, we see that the meth-
citations. Each of the single methods above yields a vector rep- ods which are trained on the legal corpus (Paragraph Vectors,
resentation 𝑑® for a given document 𝑑. We combine methods by fastTextLegal , GloVeLegal ) perform similarly well with a minor ad-
concatenating their vectors. For example, the vectors from fastText vantage by fastTextLegal . Moreover, there is a margin between the
generic and legal word vectors even though the legal word vectors
𝑑®fastText and Poincaré 𝑑®Poincaré can be concatenated as in Equation 3:
are trained on a small corpus compared to ones from the generic
vectors. The advantage of Paragraph Vectors over TF-IDF is consis-
𝑑® = 𝑑®fastText ||𝑑®Poincaré (3) tent with the results from Mandal et al. [28]. Limiting the document
length to 512 or 4096 decreases the effectiveness of fastTextLegal .
The resulting vector size is the sum of the concatenated vector A limit of 512 tokens decreases the MAP score to 59% compared
sizes, e.g., 𝑠 = 300 + 300 = 600. Recommendations based on the to all tokens on Open Case Book. With 4096 tokens, the perfor-
concatenated methods are retrieved in the same fashion as the other mance decline is only minor (90% compared to all tokens). The
methods, with cosine similarity. Moreover, we combine methods by token limitation effect is also larger on Open Case Book than Wiki-
adding up their cosine similarities [48]. The combined score of two source. The 4096 tokens version of fastTextLegal even outperforms
methods is the sum of the individual scores, e.g., for method X and all Transformer methods.
method Y the similarity of two documents 𝑑𝑎 and 𝑑𝑏 is computed Longformer-large is the best Transformer for Open Case Book
as in Equation 4. Methods with score summation are denoted with with 0.031 MAP. For Wikisource, Legal-AUEB-BERT achieves the
𝑋 + 𝑌 , e.g., Poincaré + fastTextLegal . highest MAP of 0.022, closely followed by Legal-JHU-BERT. The
Longformer’s theoretical advantage of processing 4096 instead of
512 tokens does not lead to better results for Wikisource, for which
𝑠𝑖𝑚(𝑑®𝑎 , 𝑑®𝑏 ) = 𝑠𝑖𝑚(𝑑®X𝑎 , 𝑑®X𝑏 ) + 𝑠𝑖𝑚(𝑑®Y𝑎 , 𝑑®Y𝑏 ) (4) even BERT scores the same MAP of 0.018. We generally observe
that large models outperform their base counterparts7 . Likewise,
Lastly, we integrate citation information into Sentence Trans- RoBERTa has higher scores than BERT as Liu et al. [27] suggested.
formers analog to the fine-tuning procedure proposed by Reimers From the Transformers category, Sentence Transformers yield the
and Gurevych [41]. Based on the citation graph, we construct a worst results. We assume that fine-tuning on the similarity datasets
dataset of positive and negative document pairs. Two documents like NLI or STSB does not increase the performance since the mod-
𝑑𝑎 , 𝑑𝑏 are considered as positive samples when they are connected els do not generalize well to other domains. However, the language
through a citation. Negative pairs are randomly sampled and do model fine-tuning from Legal-JHU-BERT and Legal-AUEB-BERT
not share any citation. Sentence-Legal-AUEB-BERT-base is the
Sentence Tranformer model with Legal-AUEB-BERT-base as base
model and trained with these citation information. 7 Legal-JHU-BERT and Legal-AUEB-BERT are only available as base version.
112
Table 2: Overall scores for top 𝑘 = 5 recommendations from Open Case Book and Wikisource as the number of relevant
documents, precision, recall, MRR, MAP and coverage for the 27 methods and the vector sizes. The methods are divided into:
baseline, word vector-based, Transformer-based, citation-based, and hybrid. High scores according to the exact numbers are
underlined (or bold for category-wise). ∗ values were rounded up.
Datasets → Open Case Book Wikisource

Methods ↓ Size Rel. Prec. Recall MRR MAP Cov. Rel. Prec. Recall MRR MAP Cov.
TF-IDF 500000 1.60 0.320 0.032 0.363 0.020 0.487 1.59 0.318 0.026 0.389 0.015 0.446
Paragraph Vectors 300 2.78 0.555 0.056 0.729 0.049 0.892 2.39 0.477 0.036 0.629 0.030 0.841
fastText 300 2.66 0.532 0.053 0.713 0.045 0.811 2.11 0.422 0.031 0.581 0.025 0.772
fastTextLegal 300 2.87 0.574 0.059 0.739 0.050 0.851 2.39 0.478 0.037 0.631 0.031 0.815
fastTextLegal (512) 300 1.97 0.394 0.037 0.591 0.028 0.835 2.16 0.433 0.034 0.587 0.027 0.809
fastTextLegal (4096) 300 2.76 0.552 0.054 0.727 0.045 0.867 2.33 0.466 0.035 0.620 0.029 0.817
GloVe 300 2.68 0.536 0.054 0.702 0.046 0.814 2.06 0.412 0.033 0.577 0.026 0.789
GloVeLegal 300 2.82 0.564 0.057 0.724 0.048 0.834 2.31 0.461 0.037 0.621 0.030 0.804
BERT-base 768 1.26 0.253 0.021 0.428 0.015 0.815 1.62 0.323 0.021 0.485 0.015 0.784
BERT-large 1024 1.35 0.270 0.022 0.443 0.016 0.841 1.82 0.364 0.023 0.530 0.018 0.794
Legal-JHU-BERT-base 768 1.47 0.295 0.025 0.482 0.018 0.848 1.85 0.371 0.027 0.537 0.020 0.796
Legal-AUEB-BERT-base 768 1.66 0.331 0.028 0.506 0.021 0.884 2.01 0.401 0.027 0.573 0.022 0.813
Longformer-base 768 1.91 0.382 0.033 0.572 0.026 0.892 1.65 0.329 0.020 0.514 0.016 0.841
Longformer-large 1024 2.09 0.419 0.039 0.614 0.031 0.885 1.80 0.360 0.023 0.535 0.018 0.826
RoBERTa-large 1024 1.52 0.305 0.026 0.481 0.019 0.843 1.93 0.387 0.026 0.553 0.020 0.782
Sentence-BERT-large-nli 1024 1.03 0.206 0.018 0.352 0.013 0.872 1.37 0.273 0.017 0.443 0.012 0.782
Sentence-BERT-large-nli-stsb 1024 0.98 0.196 0.018 0.338 0.013 0.848 1.36 0.272 0.015 0.434 0.011 0.777
Sentence-RoBERTa-large-nli 1024 0.92 0.183 0.016 0.321 0.011 0.884 1.18 0.236 0.013 0.409 0.009 0.795
BoostNE 300 1.29 0.258 0.022 0.442 0.016 0.800 1.24 0.248 0.016 0.398 0.013 0.832
DeepWalk 300 1.34 0.267 0.028 0.473 0.021 0.818 1.82 0.364 0.030 0.533 0.025 0.856
Poincaré 300 2.24 0.447 0.044 0.629 0.036 0.930 2.33 0.465 0.038 0.598 0.031 0.837
Walklets 300 2.24 0.448 0.043 0.636 0.035 0.816 2.35 0.470 0.038 0.611 0.031 0.826
Poincaré ∥ fastTextLegal 600 2.36 0.473 0.048 0.656 0.041 0.737 2.52 0.505 0.041 0.638 0.035 0.818
Longformer-large ∥ fastTextLegal 1324 2.26 0.451 0.043 0.642 0.035 0.876 1.91 0.383 0.025 0.547 0.020 0.829
300
Poincaré + fastTextLegal 2.85 0.571 0.058 0.746 0.050 0.860 2.48 0.497 0.040 0.646 0.034 0.835
300
300
Poincaré + Longformer-large 2.09 0.419 0.039 0.630 0.033 0.885 1.80 0.360 0.023 0.548 0.019 0.826
1024
Sentence-Legal-AUEB-BERT-base 768 2.19 0.438 0.039 0.603 0.031 0.917 2.36 0.471 0.038 0.602 0.032 0.849
does improve the performance, whereby Legal-AUEB-BERT gen- + fastTextLegal is even higher than the MRR of its sub-methods
erally outperforms Legal-JHU-BERT. For Open Case Book, Legal- Poincaré (0.629) and fastTextLegal (0.739) individually. The concate-
AUEB-BERT is the best model in the Transformer category in terms nation of Poincaré ∥ fastTextLegal is with 0.035 MAP the best method
of MAP even though it is only used as base version. on Wikisource. Using citation as training signal as in Sentence-
Poincaré and Walklets are by far the best methods in the citation Legal-AUEB-BERT also improves the performance but not as much
category. For Wikisource, the two citation-based methods, score the as concatenation or summation. When comparing the three hybrid
same MAP of 0.031 as fastTextLegal . Compared to the word vector- variations, score summation achieves overall the best results. In the
based methods, the citation methods do better on Wikisource than case of Wikisource, the concatenation’s scores are below its sub-
on Open Case Book. methods, while summation has at least the best sub-methods score.
In the category of hybrid methods, the combination of text and Moreover, combining two text-based methods such as Longformer-
citations improves the performance. For Open Case Book, the score large and fastTextLegal never improves its sub-methods.
summation Poincaré + fastTextLegal has the same MAP of 0.05
4.1.2 Document Length. The effect of the document length on the
as fastTextLegal but a higher MRR of 0.746. The MRR of Poincaré
performance in terms of MAP is displayed in Figure 1. We group
113
Paragraph Vectors fastText Legal Legal-AUEB-BERT-base Poincaré

fastText fastText Legal-4096 Longformer-large Poincaré + fastText Legal
MAP (OpenCaseBook)
0.06
0.04
0.02
Paragraph Vectors fastText Legal Legal-AUEB-BERT-base Poincaré

0.00
(187, 2327] fastText
(2327, 3499] (3499, 4772] fastText Legal-4096
(4772, 6172] Longformer-large
(6172, 7859] (7859, Poincaré
11070] + fastText Legal
(11070, 16785] (16785, 88269]
Text length as word count (8 equal-sized buckets)
0.06
MAP (WikiSource)
0.05
0.04
0.03
0.02
0.01
0.00
(31, 1777] (1777, 2666] (2666, 3520] (3520, 4532] (4532, 6083] (6083, 8659] (8659, 12930] (12930, 136017]
Text length as word count (8 equal-sized buckets)
Figure 1: MAP wrt. words in the seed document of Open Case Book (top) and Wikisource (bottom). The more words, the better
the results, no peak at medium length. fastTextLegal outperforms Legal-BERT and Longformer for short documents.
the seed documents into eight equal-sized buckets (each bucket citations and even decrease at 67-89 citations. When comparing
represents the equal number of documents) depending on the word Poincaré and Walklets there is no superior method and no depen-
count in the document text to make the two datasets comparable. dency pattern is visible. The performance effect on DeepWalk is
Both datasets, Open Case Book and Wikisource, present a similar more substantial. The number of citations must be above a cer-
outcome. The MAP increases as the word count increases. Table 2 tain threshold to allow DeepWalk to achieve competitive results.
presents the average overall documents and, therefore, the overall For Open Case Book, the threshold is at 51-67 citations, and for
best method is not equal to the best method in some subsets. For Wikisource, it is at 30-50 citations. Figure 2 also shows the on aver-
instance, Paragraph Vectors achieve the best results for several buck- age higher MAP of Poincaré + fastTextLegal in comparison to the
ets, e.g., 4772-6172 words in Open Case Book or 6083-8659 words other approaches. Citation-based methods require citations to work,
in Wikisource. The text limitation of fastTextLegal (4096 tokens) whereas text methods do not have this limitation (see 0-14 citations
in comparison to fastText is also clearly visible. The performance for Open Case Book). When no citations are available, citation-
difference between the two methods increases as the document based methods cannot recommend any documents, whereas the
length increases. For the first buckets with less than 4096 words, text methods still work (see 0-14 citations for Open Case Book).
e.g., 187-2327 words in Open Case Book, one could expect no differ- Our citation-based methods use only a fraction of original cita-
ence since the limitation does not affect the seed documents in these tion data, 70,865 citations in Open Case Book, and 331,498 citations
buckets. However, we observe a difference since target documents in Wikisource, because of limitation to the documents available in
are not grouped into the same buckets. Remarkable is that the per- the silver standards. For comparison, the most-cited decision from
formance difference for very long documents is less substantial. CourtListener (the underlying corpus of Wikisource) has 88,940 ci-
When comparing Longformer-large and Legal-AUEB-BERT, we tations, whereas in experimental data of Wikisource the maximum
also see an opposing performance shift with changing word count. number of in- and out-citations is 386. As a result, we expect the
While Legal-AUEB-BERT’s scores are relatively stable throughout citation-based methods, especially DeepWalk, to work even better
all buckets, Longformer depends more on the document length. On when applied on the full corpus.
the one hand, Longformer performs worse than Legal-AUEB-BERT
4.1.4 Coverage and Similarity of Recommendations. In addition
for short documents, i.e., 187-2327 words in Open Case Book, and
to the accuracy-oriented metrics, Table 2 reports also the cover-
31-1777 words in Wikisource. On the other hand, for documents
age of the recommendation methods. A recommender systems for
with more words, Longformer mostly outperforms Legal-AUEB-
an expert audience should not focus on small set of most-popular
BERT by a large margin. The citation-based method Poincaré is as
items but rather provide a high coverage of the whole item collec-
well affected by the document length. However, this effect is due to
tion. However, coverage alone does not account for relevancy and,
a positive correlation between word count and citation count.
therefore, it must be contextualized with other metrics, e.g., MAP.
4.1.3 Citation Count. Figure 2 shows the effect of the number of Overall, two citation-based methods yield the highest coverage
in- and out-citations (i.e., edges in the citation graph) on the MAP for both datasets, i.e., Poincaré for Open Case Book and DeepWalk
score. The citation analysis for Wikisource confirms the word count for Wikisource. In particular, Poincaré has not only a high coverage
analysis. More data leads to better results. Instead, for Open Case but also high MAP scores. Yet, the numbers do not indicate that
Book, the performance of the citation-based methods peak for 31-51 citation-based methods have generally a higher coverage since the
114
Paragraph Vectors Longformer-large DeepWalk Walklets

fastText Legal BoostNE Poincaré Poincaré + fastText Legal
0.06
MAP (OpenCaseBook)
0.05
0.04
0.03
0.02
0.01
Paragraph Vectors Longformer-large DeepWalk Walklets
0.00
(0, 14] (14, 23] fastText(23,
Legal
31] BoostNE
(31, 40] Poincaré
(40, 51] Poincaré + fastText Legal
(51, 67] (67, 89] (89, 425]
In- and out- citations(8 equal-sized buckets)
0.06
MAP (WikiSource)
0.05
0.04
0.03
0.02
0.01
0.00
(0, 3] (3, 6] (6, 12] (12, 21] (21, 30] (30, 50] (50, 82] (82, 386]
In- and out- citations(8 equal-sized buckets)
Figure 2: MAP scores wrt. citation count for Open Case Book (top) and Wikisource (bottom). Among citation-based methods,
Poincaré and Walklets perform on average the best, while DeepWalk outperforms them only for Wikisource and when more
than 82 citations are available (rightmost bucket).
Le
g
Po Po
inc inc Besides the coverage, we also analyze the similarity or diversity
Pa al-A ar a
fas ragr UEB
é | ré +
|f
as fas
of the recommendations between two methods. Figure 3 shows the
Gl a
similarity measured as Jaccard index for selected methods. Method
o t Te ph BE- D tT tT
e e
TF Ve L fast xt L Ve RT- eep Wal Poin xt L xt L
-ID eg Te eg cto ba Wa kle ca e e
F al xt al rs se lk ts ré gal gal
pairs with 𝐽 (𝑎, 𝑏) = 1 have identical recommendations, whereas
TF-IDF 1.00 0.17 0.15 0.16 0.10 0.04 0.06 0.06 0.06 0.11 0.13
𝐽 (𝑎, 𝑏) = 0 means no common recommendations. Generally speak-
ing, the similarity of all method pairs is considerably low (𝐽 < 0.8).
The highest similarity can be found between a hybrid method and
GloVe Legal 0.17 1.00 0.40 0.67 0.27 0.08 0.09 0.12 0.11 0.23 0.52
fastText 0.15 0.40 1.00 0.41 0.21 0.07 0.07 0.10 0.09 0.18 0.33 one of its sub-methods, e.g., Poincaré + fastTextLegal and fastText-
fastText Legal 0.16 0.67 0.41 1.00 0.28 0.09 0.09 0.13 0.11 0.24 0.76 Legal with 𝐽 = 0.76. Apart from that, substantial similarity can be
only found between pairs from the same category. For example, the
Paragraph Vectors 0.10 0.27 0.21 0.28 1.00 0.09 0.09 0.13 0.12 0.19 0.24
pair of the two text-based methods of GloVeLegal and fastTextLegal
Legal-AUEB-BERT-base 0.04 0.08 0.07 0.09 0.09 1.00 0.04 0.06 0.05 0.07 0.08 yields 𝐽 = 0.67. Citation-based methods tend to have a lower sim-
DeepWalk 0.06 0.09 0.07 0.09 0.09 0.04 1.00 0.20 0.14 0.14 0.12 ilarity compared to the text-based methods, whereby the highest
Jaccard index between two citation-based methods is achieved for
Walklets 0.06 0.12 0.10 0.13 0.13 0.06 0.20 1.00 0.32 0.27 0.18
Walklets and Poincaré with 𝐽 = 0.32. Like the coverage metric,
Poincaré 0.06 0.11 0.09 0.11 0.12 0.05 0.14 0.32 1.00 0.39 0.32 the Jaccard index should be considered in relation to the accu-
Poincaré || fastText Legal 0.11 0.23 0.18 0.24 0.19 0.07 0.14 0.27 0.39 1.00 0.23 racy results. GloVeLegal and fastTextLegal yield equally high MAP
scores, while having also a high recommendation’s similarity. In
Poincaré + fastText Legal 0.13 0.52 0.33 0.76 0.24 0.08 0.12 0.18 0.32 0.23 1.00
contrast, the MAP for Wikisource from fastTextLegal and Poincaré
is equally high, too. However, their recommendation’s similarity
Figure 3: Jaccard index for similarity or diversity of two is low 𝐽 = 0.11. Consequently, fastTextLegal and Poincaré provide
recommendation sets (average over all seeds from the two relevant recommendations that are diverse from each other. This
datasets). explains the good performance of their hybrid combination.
4.2 Qualitative Evaluation

Due to lack of openly available gold standards, we conduct our
text-based Paragraph Vectors or Longformer-base also achieve a
quantitative analysis using silver standards. Thus, we additionally
considerably high coverage. The lowest coverage has by far the
conduct a qualitative evaluation with domain experts to estimate
TF-IDF baseline. Notable, the hybrid methods with concatenation
the quality of our silver standards.
and summation have a different effect on the coverage as on the ac-
Table 3 lists one of the randomly chosen seed decisions (Mu-
curacy metrics. While the hybrid methods generally yield a higher
gler vs. Kansas8 ), and five recommended similar decisions, each
MAP, their coverage is lower compared to their sub-methods. Only,
from fastTextLegal and Poincaré. In Mugler vs. Kansas (1887), the
the Sentence-Legal-AUEB-BERT-base yields a higher coverage com-
pared to Legal-AUEB-BERT-base. 8 https://www.courtlistener.com/opinion/92076/mugler-v-kansas/
115
Table 3: Examples from fastTextLegal and Poincaré (other methods are in the supplementary material) for Mugler v. Kansas
with relevance annotations by the silver standards (S) and domain expert (D).
Open Case Book Wikisource

# Recommendations Year S D Recommendations Year S D
1 Yick Wo v. Hopkins 1886 N N Kidd v. Pearson 1888 N Y
fastTextLegal
2 Munn v. Illinois 1876 Y Y Lawton v. Steele 1894 N Y

3 LS. Dealers’ & Butchers’ v. Crescent City LS. 1870 N Y Yick Wo v. Hopkins 1886 N N
4 Butchers’ Benevolent v. Crescent City LS. 1872 Y Y Geer v. Connecticut 1896 N Y
5 Lochner v. New York 1905 Y Y Groves v. Slaughter 1841 Y N
1 Yick Wo v. Hopkins 1886 N N Rast v. Van Deman & Lewis Co. 1916 Y N
2 Allgeyer v. Louisiana 1897 Y Y County of Mobile v. Kimball 1881 N N
Poincaré
3 Calder v. Wife 1798 N N Brass v. North Dakota Ex Rel. Stoeser 1894 Y Y

4 Davidson v. New Orleans 1877 Y Y Erie R. Co. v. Williams 1914 Y Y
5 Muller v. Oregon 1908 Y Y Hall v. Geiger-Jones Co. 1917 Y Y
court held that Kansas could constitutionally outlaw liquor sales MAP for Open Case Book, while for Wikisource only hybrid meth-
with constitutional issues raised on substantive due process (Four- ods outperform fastTextLegal . Also, the coverage of fastTextLegal
teenth Amendment) and takings (Fifth Amendment). We provide a is considerably high for both datasets. Simultaneously, fastText-
description of the cases and their relevance on GitHub9 . Legal is robust to corner cases since neither very short nor very
The sample verification indicates the usefulness of both text- long documents reduce fastTextLegal ’s performance substantially.
based and citation-based methods and does not contradict our quan- These results confirm the findings from Arora et al. [1] that average
titative findings. Each of the recommendations have a legal impor- word vectors are “simple but tough-to-beat baseline”. Regarding
tant connection to the seed case (either the Fourteenth Amendment baselines, our TF-IDF baseline yields one of the worst results. In
or Fifth Amendment), although it is difficult to say whether the terms of accuracy metrics, only some Transformers are worse than
higher-ranked cases are more similar along an important topical TF-IDF, but especially TF-IDF’s coverage is the lowest by a large
dimension. The rankings do not appear to be driven by facts pre- margin. With a coverage below 50%, TF-IDF fails to provide diverse
sented in the case as most of them have not to do with alcohol recommendations that are desirable for legal literature research.
bans. Only Kidd vs. Pearson (1888) is about liquor sales as the seed The transfer of research advances to the legal domain is one
decision. The samples also do not reveal considerable differences aspect of our experiments. Thus, the performance of Transformers
between text- and citation-based similarity. With regards to the and citation embeddings is of particular interest. Despite the success
silver standards, the domain expert agrees in 14 of 20 cases (70%). of Transformers for many NLP tasks, Transformers yield on average
In only two cases the domain expert classify a recommendation as the worst results for representing lengthy documents written in
irrelevant despite being classified as relevant in the silver standard. legal English. The other two method categories, word vector-based,
and citation-based methods, surpass Transformers.
The word vector-based methods achieve overall the best results
5 DISCUSSION among the non-hybrid methods. All word vectors with in-domain
Our experiments explore the applicability of the latest advances in training, i.e., Paragraph Vectors, fastTextLegal , and GloVeLegal , per-
research to the use case of legal literature recommendations. Exist- form similarly good with a minor advantage by fastTextLegal . Their
ing studies on legal recommendations typically rely on small-scale similar performance aligns with the large overlap among their re-
user studies and are therefore limited in the number of approaches commendations. Despite a small corpus of 65,635 documents, the
that they can evaluate (Section 2). For this study, we utilize rele- in-domain training generally improves the performance as the gap
vance annotations from two publicly available sources, i.e., Open between the out-of-domain fastText and fastTextLegal shows. Given
Case Book and Wikisource. These annotations does not only enable that the training of custom word vectors is feasible on commodity
us to evaluate the recommendations of 2,964 documents but also hardware, in-domain training is advised. More significant than the
the comparison of in total 41 methods and their variations of which gap between in- and out-of-domain word vectors is the effect of
27 methods are presented in this paper. limited document lengths. For Open Case Book, the fastTextLegal
Our extensive evaluation shows a large variance in the recom- variation limited to the first 512 tokens has only 52% of the MAP
mendation performance. Such a variance is known from other stud- of the full-text method. For Wikisource, the performance decline
ies [4]. There is no single method that yields the highest scores exists as well but is less significant. This effect highlights the advan-
across all metrics and all datasets. Despite that, fastTextLegal is on tage of the word vector-based methods that they derive meaningful
average the best of all 41 methods. fastTextLegal yields the highest representations of documents with arbitrary length.
116
The evaluated Transformers cannot process documents of arbi- agreement with the domain expert is high. The expert tends to clas-
trary length but are either limited to 512 or 4096 tokens. This limi- sify more recommendations as relevant than the silver standards,
tation contributes to Transformers’ low performance. For instance, i.e., relevant recommendations are missed. This explains the rela-
Longformer-large’s MAP is almost twice as high as BERT-large’s tively low recall from the quantitative evaluation. In a user study,
MAP on Open Case Book. However, for Wikisource both models we would expect only minor changes in the ranking of methods
yield the same MAP scores. For Wikisource, the in-domain pretrain- with similar scores, e.g., fastTextLegal and GloVeLegal . The category
ing as a larger effect than the token limit since Legal-AUEB-BERT ranking would remain the same. The benefit of our silver standards
achieves the best results among the Transformers. Regarding the is the number of available relevance annotations. The number of
Transformer pretraining, the difference between Legal-JHU-BERT annotations in related user studies is with up to 50 annotations
and Legal-AUEB-BERT shows the effect between two pretraining rather low. Instead, our silver standards provide a magnitude more
approaches. The corpora and the hyperparameter settings used relevance annotations. Almost 3,000 relevance annotations enable
during pretraining are crucial. Even though Legal-JHU-BERT was evaluations regarding text length, citation count, or other proper-
exclusively pretrained on the CAP corpus, which has a high over- ties that would be otherwise magnitudes more difficult. Similarly,
lap with Open Case Book, Legal-AUEB-BERT still outperforms the user studies are difficult to reproduce as their data is mostly
Legal-JHU-BERT on Open Case Book. Given these findings, we unavailable. This leads to reproducibility being an issue in recom-
expect the performance of Transformers could be improved by in- mender system research [4]. The open license of the silver standards
creasing the token limit beyond the 4096 tokens and by additional allows the sharing of all evaluation data and, therefore, contributes
in-domain pretraining. Such improvements are technically possible to more reproducibility. In summary, the proposed datasets bring
but add significant computational effort. In contrast to word vectors, great value to the field, overcoming eventual shortcomings.
Transformers are not trained on commodity hardware but on GPUs.
Especially long-sequence Transformers such as the Longformer 6 CONCLUSION
require GPUs with large memory. Such hardware may not be avail- We present an extensive empirical evaluation of 27 document rep-
able in production deployments. Moreover, the computational effort resentation methods in the context of legal literature recommen-
must be seen in relation to the other methods. Put differently, even dations. In contrast to previous small-scale studies, we evaluate
fastTextLegal limited to 512 tokens outperforms all Transformers. the methods over two document corpora containing 2,964 docu-
Concerning the citation embeddings, we consider Poincaré, clo- ments (1,601 from Open Case Book and 1,363 from Wikisource). We
sely followed by Walklets, as the best method. In particular, the two underpin our findings with a sample-based qualitative evaluation.
methods outperform the other citation methods even when only a Our analysis of the results reveals fastTextLegal (averaged fastText
few citations are available, which makes them attractive for legal word vectors trained on our corpora) as the overall best performing
research. Poincaré also provides the highest coverage for Open Case method. Moreover, we find that all methods have a low overlap be-
Book, emphasizing its quality for literature recommendations. For tween their recommendations and are vulnerable to certain dataset
Wikisource, DeepWalk has the highest coverage despite yielding characteristics like text length and number of citations available. To
generally low accuracy scores. As Figure 2 shows, DeepWalk’s MAP mitigate the weakness of single methods and to increase recommen-
score improves substantially as the number of citations increases. dation diversity, we propose hybrid methods like score summation
Therefore, we expect that DeepWalk but also the other citation of fastTextLegal and Poincaré that outperforms all other methods on
methods would perform even better when applied on larger cita- both datasets. Although there are limitations in the experimental
tion graph. The analysis of recommendation similarity also shows evaluation due to the lack of openly available ground truth data, we
little overlap between the citation-based methods and the text- are able to draw meaningful conclusions for the behavior of text-
based methods (Figure 3). This indicates that the two approaches based and citation-based document embeddings in the context of
complement each other and motivates the use of hybrid methods. legal document recommendation. Our source code, trained models,
Related work has already shown the benefit of hybrid methods and datasets are openly available to encourage further research9 .
for literature recommendations [6, 49]. Our experiments confirm
these findings. The simple approaches of score summation or vector ACKNOWLEDGMENTS
concatenation can improve the results. In particular, Poincaré +
We would like to thank Christoph Alt, Till Blume, and the anony-
fastTextLegal never leads to a decline in performance. Instead, it
mous reviewers for their comments. The research presented in this
increases the performance for corner cases in which one of the
article is funded by the German Federal Ministry of Education and
sub-methods performs poorly. Vector concatenation has mixed
Research (BMBF) through the project QURATOR (Unternehmen
effects on the performance, e.g., positive effect for Wikisource and
Region, Wachstumskern, no. 03WKDA1A) and by the project LYNX,
negative effect for Open Case Book. Using citations as training
which has received funding from the EU’s Horizon 2020 research
data in Sentence Transformers can also be considered as a hybrid
and innovation program under grant agreement no. 780602.
method that improves the performance. However, this requires
additional effort for training a new Sentence Transformer model.
As we discuss in Section 3.1, we consider Open Case Book and REFERENCES
[1] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but though Baseline
Wikisource more of silver than gold standards. With the qualitative for Sentence Embeddings. In 5th International Conference on Learning Representa-
evaluation, we mitigate the risk of misinterpreting the quantitative tions (ICLR 2017), Vol. 15. 416–424.
results, whereby we acknowledge our small sample size. The overall
9 GitHub repository: https://github.com/malteos/legal-document-similarity
117
[2] Elliott Ash and Daniel L. Chen. 2018. Case Vectors: Spatial Representations of [28] Arpan Mandal, Raktim Chaki, Sarbajit Saha, Kripabandhu Ghosh, Arindam Pal,
the Law Using Document Embeddings. SSRN Electronic Journal 11, 2017 (may and Saptarshi Ghosh. 2017. Measuring Similarity among Legal Court Case
2018), 313–337. https://doi.org/10.2139/ssrn.3204926 Documents. In Proc. of Compute ’17. 1–9.
[3] Xiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang, Xiangjie Kong, and Feng [29] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Intro-
Xia. 2019. Scientific paper recommendation: A survey. IEEE Access 7 (2019), duction to Information Retrieval. Vol. 16. Cambridge University Press, Cambridge.
9324–9339. 100–103 pages. https://doi.org/10.1017/CBO9780511809071
[4] Joeran Beel, Corinna Breitinger, Stefan Langer, Andreas Lommatzsch, and Bela [30] David Mellinkoff. 1963. The language of the law. Boston: Little Brown and
Gipp. 2016. Towards reproducibility in recommender-systems research. User Company (1963).
Modeling and User-Adapted Interaction (UMAI) 26 (2016). [31] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti-
[5] Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long- mation of Word Representations in Vector Space. (2013), 1–12. arXiv:1301.3781
Document Transformer. (2020). arXiv:2004.05150 [32] Akshay Minocha, Navjyoti Singh, and Arjit Srivastava. 2015. Finding Relevant
[6] Paheli Bhattacharya, Kripabandhu Ghosh, Arindam Pal, and Saptarshi Ghosh. Indian Judgments using Dispersion of Citation Network. In Proc. of WWW ’15.
2020. Methods for Computing Legal Document Similarity: A Comparative Study. ACM Press, New York, New York, USA, 1085–1088.
(2020). arXiv:2004.12307 [33] Rohan Nanda, Giovanni Siragusa, Luigi Di Caro, Guido Boella, Lorenzo Grossio,
[7] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Marco Gerbaudo, and Francesco Costamagna. 2019. Unsupervised and supervised
Journal of machine Learning research 3, Jan (2003), 993–1022. text similarity systems for automated identification of national implementing
[8] Guido Boella, Luigi Di Caro, Llio Humphreys, Livio Robaldo, Piercarlo Rossi, measures of European directives. Artificial Intelligence and Law 27, 2 (2019),
and Leendert van der Torre. 2016. Eunomos, a legal document and knowledge 199–225. https://doi.org/10.1007/s10506-018-9236-y
management system for the Web to provide relevant, reliable and up-to-date [34] Maximilian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning
information on the law. Artificial Intelligence and Law 24, 3 (2016), 245–283. hierarchical representations. Advances in Neural Information Processing Systems
[9] Alexander Boer and Radboud Winkels. 2016. Making a cold start in legal recom- 2017-Decem, Nips (2017), 6339–6348. arXiv:1705.08039
mendation: An experiment. Frontiers in Artificial Intelligence and Applications [35] Malte Ostendorff, Till Blume, and Saskia Ostendorff. 2020. Towards an Open
294 (2016), 131–136. https://doi.org/10.3233/978-1-61499-726-9-131 Platform for Legal Information. In Proc. of the ACM/IEEE Joint Conference on
[10] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En- Digital Libraries in 2020. ACM, New York, NY, USA, 385–388.
riching Word Vectors with Subword Information. Transactions of the Association [36] Malte Ostendorff, Peter Bourgonje, Maria Berger, Julian Moreno-Schneider, Georg
for Computational Linguistics 5 (2017), 135–146. Rehm, and Bela Gipp. 2019. Enriching BERT with Knowledge Graph Embeddings
[11] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Man- for Document Classification. In Proc. of the 15th Conference on Natural Language
ning. 2015. A large annotated corpus for learning natural language inference. Processing (KONVENS 2019). GSCL, Erlangen, Germany, 305–312.
Proc. of EMNLP (2015), 632–642. [37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
[12] Jane Bromley, J.W. Bentz, Leon Bottou, I. Guyon, Yann Lecun, C. Moore, Eduard Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
Sackinger, and R. Shah. 1993. Signature verification using a Siamese time de- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
lay neural network. International Journal of Pattern Recognition and Artificial Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
Intelligence 7, 4 (1993). [38] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
[13] Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Global Vectors for Word Representation. In Proc. of the 2014 Conference on Em-
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual pirical Methods in Natural Language Processing (EMNLP). ACL, Stroudsburg, PA,
Focused Evaluation. In Proc. of the 11th International Workshop on Semantic USA, 1532–1543. https://doi.org/10.3115/v1/D14-1162
Evaluation (SemEval-2017). ACL, Vancouver, Canada, 1–14. [39] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: online learning
[14] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, of social representations. In Proc. of KDD ’14. ACM Press, New York, New York,
and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law USA, 701–710.
School. In Findings of the Association for Computational Linguistics: EMNLP 2020. [40] Bryan Perozzi, Vivek Kulkarni, Haochen Chen, and Steven Skiena. 2017. Don’t
ACL, Stroudsburg, PA, USA, 2898–2904. Walk, Skip!: Online Learning of Multi-scale Network Embeddings. In Proc. of the
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis
Pre-training of Deep Bidirectional Transformers for Language Understanding. In and Mining 2017. ACM, New York, NY, USA, 258–265.
Proc. of the 2019 Conf. of the NAACL. ACL, Minneapolis, Minnesota, 4171–4186. [41] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
[16] Mouzhi Ge, Carla Delgado-Battenfeld, and Dietmar Jannach. 2010. Beyond using Siamese BERT-Networks. In The 2019 Conference on Empirical Methods in
accuracy: evaluating recommender systems by coverage and serendipity. In Proc. Natural Language Processing (EMNLP 2019). arXiv:1908.10084
of RecSys ’10. ACM Press, New York, New York, USA, 257. [42] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. An API Oriented
[17] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Open-source Python Framework for Unsupervised Learning on Graphs. (2020).
Networks. In Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery arXiv:2003.04819
and Data Mining - KDD ’16. ACM Press, New York, New York, USA, 855–864. [43] G. Salton, A. Wong, and C. S. Yang. 1975. Vector Space Model for Automatic
[18] Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020. A Indexing. Information Retrieval and Language Processing. Commun. ACM 18, 11
dataset for statutory reasoning in tax law entailment and question answering. In (1975), 613–620.
Proc. of the 2020 Natural Legal Language Processing Workshop. 31–38. [44] Malte Schwarzer, Moritz Schubotz, Norman Meuschke, and Corinna Breitinger.
[19] Paul Jaccard. 1912. The Distribution of the Flora in the Alpine Zone. New 2016. Evaluating Link-based Recommendations for Wikipedia. Proc. of the 16th
Phytologist 11, 2 (feb 1912), 37–50. ACM/IEEE Joint Conference on Digital Libraries (JCDL‘16) (2016), 191–200.
[20] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag [45] Marc van Opijnen and Cristiana Santos. 2017. On the concept of relevance in
of Tricks for Efficient Text Classification. In Proc. of EACL 2017. ACL, Stroudsburg, legal information retrieval. Artificial Intelligence and Law 25, 1 (2017), 65–87.
PA, USA, 427–431. [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
[21] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and and I. Polosukhin. 2017. Attention Is All You Need. Advances in Neural Information
Marián Boguñá. 2010. Hyperbolic geometry of complex networks. Physical Processing Systems 30 (Jun 2017), 5998–6008.
Review E - Statistical, Nonlinear, and Soft Matter Physics 82, 3 (2010), 1–18. [47] Rupali S. Wagh and Deepa Anand. 2020. Legal document similarity: A multicri-
[22] Sushanta Kumar, P. Krishna Reddy, V. Balakista Reddy, and Aditya Singh. 2011. teria decision-making perspective. PeerJ Computer Science 2020, 3 (2020), 1–20.
Similarity analysis of legal judgments. Compute 2011 - 4th Annual ACM Bangalore https://doi.org/10.7717/peerj-cs.262
Conference (2011). https://doi.org/10.1145/1980422.1980439 [48] Lidan Wang, Ming Tan, and Jiawei Han. 2016. FastHybrid: A hybrid model for effi-
[23] Steven A. Lastres. 2013. Rebooting Legal Research in a Digital Age. https: cient answer selection. Proc. of the 26th International Conference on Computational
//www.lexisnexis.com/documents/pdf/20130806061418_large.pdf Linguistics (2016), 2378–2388.
[24] J. H. Lau and T. Baldwin. 2016. An Empirical Evaluation of doc2vec with Practical [49] Gineke Wiggers and Suzan Verberne. 2019. Citation Metrics for Legal Information
Insights into Document Embedding Generation. In Proc. Workshop on Representa- Retrieval Systems. In BIR@ECIR. 39–50.
tion Learning for NLP. https://doi.org/10.18653/v1/w16-1609 [50] Wikisource. 2020. United States Supreme Court decisions by topic.
[25] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences https://en.wikisource.org/wiki/Category:United_States_Supreme_Court_
and Documents. Int. Conf. on Machine Learning 32 (2014), 1188–1196. decisions_by_topic
[26] Jundong Li, Liang Wu, Ruocheng Guo, Chenghao Liu, and Huan Liu. 2019. Multi- [51] Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage
level network embedding with boosted low-rank matrix approximation. In Proc. Challenge Corpus for Sentence Understanding through Inference. (2018), 1112–
of the 2019 IEEE/ACM International Conference on Advances in Social Networks 1122. https://doi.org/10.18653/v1/n18-1101
Analysis and Mining. ACM, New York, NY, USA, 49–56. [52] Radboud Winkels, Alexander Boer, Bart Vredebregt, and Alexander Van Someren.
[27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer 2014. Towards a Legal Recommender System. In Frontiers in Artificial Intelligence
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A and Applications, Vol. 271. 169–178.
Robustly Optimized BERT Pretraining Approach. (2019). arXiv:1907.11692
118
From Data to Information: Automating Data Science to Explore
the U.S. Court System
Andrew Paley Andong L. Li Zhao Harper Pack
andrewpaley@u.northwestern.edu andong@u.northwestern.edu harper.pack@northwestern.edu
Northwestern University Northwestern University Northwestern University
Sergio Servantez Rachel F. Adler Marko Sterbentz

servantez@u.northwestern.edu r-adler@neiu.edu marko.sterbentz@u.northwestern.edu
Northwestern University Northeastern Illinois University Northwestern University
Northwestern University
Adam Pah David Schwartz Cameron Barrie

a-pah@kellogg.northwestern.edu david.schwartz@law.northwestern.edu cameron.barrie@u.northwestern.edu
Northwestern University Northwestern University Northwestern University
Alexander Einarsson Kristian Hammond

aeinarsson@u.northwestern.edu Kristian.Hammond@northwestern.edu
Northwestern University Northwestern University
ABSTRACT CCS CONCEPTS

The U.S. court system is the nation’s arbiter of justice, tasked with • Information systems → Decision support systems; • Ap-
the responsibility of ensuring equal protection under the law. But plied computing → Law; • Computing methodologies → Nat-
hurdles to information access obscure the inner workings of the ural language processing; • Human-centered computing →
system, preventing stakeholders – from legal scholars to journalists Natural language interfaces.
and members of the public – from understanding the state of justice
in America at scale. There is an ongoing data access argument here: KEYWORDS
U.S. court records are public data and should be freely available. notebook interface, information extraction, data analytics, natural
But open data arguments represent a half-measure; what we really language processing, visualization
need is open information. This distinction marks the difference
between downloading a zip file containing a quarter-million case ACM Reference Format:
Andrew Paley, Andong L. Li Zhao, Harper Pack, Sergio Servantez, Rachel
dockets and getting the real-time answer to a question like “Are pro
F. Adler, Marko Sterbentz, Adam Pah, David Schwartz, Cameron Barrie,
se parties more or less likely to receive fee waivers?” To help bridge
Alexander Einarsson, and Kristian Hammond. 2021. From Data to Informa-
that gap, we introduce a novel platform and user experience that tion: Automating Data Science to Explore the U.S. Court System. In Eigh-
provides users with the tools necessary to explore data and drive teenth International Conference for Artificial Intelligence and Law (ICAIL’21),
analysis via natural language statements. Our approach leverages an June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 10 pages.
ontology configuration that adds domain-relevant data semantics https://doi.org/10.1145/3462757.3466100
to database schemas to provide support for user guidance and for
search and analysis without user-entered code or SQL. The system 1 INTRODUCTION
is embodied in a “natural-language notebook” user experience,
In the United States, the federal judicial system serves as a vital
and we apply this approach to the space of case docket data from
umpire, ideally ensuring equal protection and justice under the law.
the U.S. federal court system. Additionally, we provide detail on
Its mechanics are a tapestry of countless unique decisions made
the collection, ingestion and processing of the dockets themselves,
by individuals across 94 district courts, 13 circuit courts and the
including early experiments in the use of language modeling for
Supreme Court. While this system is at the core of the management
docket entry classification with an initial focus on motions.
of justice in the United States, its operation is essentially inaccessi-
ble. Distributed decision making, inaccessible data, and the general
classroom use is granted without fee provided that copies are not made or distributed public’s lack of technical skills sufficient to analyze data mean that
for profit or commercial advantage and that copies bear this notice and the full citation the actual mechanics of U.S. justice are largely obscured. As citi-
on the first page. Copyrights for components of this work owned by others than the zens, we trust our laws are being enforced equally but – absent the
republish, to post on servers or to redistribute to lists, requires prior specific permission occasional headline-grabbing case – it’s all opaque to the majority
and/or a fee. Request permissions from permissions@acm.org. of us.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil One issue is that data access is prohibitively expensive. Public
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 Court records are available only through a paywall called the Public
https://doi.org/10.1145/3462757.3466100 Access to Court Electronic Records (PACER) system. Other research
119
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paley et al.
has explored such limitations on access [1, 35], as well as the ques- They felt they were limited by the tools they were currently using
tionable completeness of the data available [30, 32]. Legislation to and wanted to ask questions of the data that they weren’t able to.
eliminate the PACER fees is progressing through Congress – one To help bridge that gap, we introduce a novel platform and
step towards opening the courts to public scrutiny and understand- user experience that provides users with the tools necessary to ex-
ing. plore data and drive analysis via natural language statements. Our
But making court records free won’t eliminate all barriers to approach leverages an ontology configuration that adds domain-
access. Many open-government initiatives in the U.S. and abroad relevant data semantics to a database schema for the sake of sup-
have yielded a growing array of public datasets [2, 24], and work to- porting search and analysis without user-entered code or SQL. This
wards data transparency is an ongoing effort [14]. However, while configuration allows us to abstract away the underlying schema
access to data is necessary, it’s insufficient: the applied value of complexities from user concern, understand what filters and analy-
that data to the end goal of increased public understanding – of sis are possible and domain-relevant, infer relevant analytics from
access to information – remains stymied by the limited analytical the data semantics, and provide guided outcomes during both search
skills and resources of the majority of those afforded that data. A and analysis.
survey detailed in [47] found in part that while citizens acknowl- The associated notebook-style experience is an early embodi-
edge and appreciate moves towards open data, most don’t know ment of a new form of human-data, or human-information, interface
people in their social circles who take advantage of it. Further, the – a user experience imbued with a set of assistive capabilities where
authors note “most open data released by the government is avail- interactions happen in natural language rather than code. The sys-
able in the raw format, which restricts its understandability by all tem also generates responses in modalities intuitively appropriate
people” and that “this data is mostly usable by experts with some to the nature of the analysis results – from text to various types of
technical knowledge to interpret and develop applications” [47]. visualizations.
Separately, in a case study of Data.gov, [23] argue that open data
“generates its value when it is not only available and accessible but 1.2 The Data Scientist/Data Interaction
also made sense by its users to solve problems” and conclude that
The second set of requirements mirrors the data scientist/data inter-
“public agencies should invest in new technologies and craft new
action: the wrangling of data into coherent and controlled schemas
data management techniques to make data readily accessible to
through various modes of ETL (extract, transform, load), text extrac-
users...providing real-time analysis and updates.”
tion, data cleaning, and the more complicated arenas of machine
To date, the bridge between raw data and meaningful infor-
learning and language modeling.
mation has generally been built ad-hoc and on-demand by data
To support explorations of the U.S. court system, this includes
scientists, but that resource-intensive approach doesn’t scale when
the structuring and harmonization of court records, with the initial
considering the information needs of a broader subset of the public.
focus here on a snapshot of roughly 270,000 case dockets. This
And, in the space of the legal system, even questions as simple as,
involves consultation with domain experts; the definition of a com-
“Are there differences in how judges handle fee waiver requests?” or
plex schema (across 30 tables ranging in size from two to thirty-one
“Is there any correlation between a judge’s tenure and the length of
columns); a pipeline to extract, transform and harmonize the un-
cases they oversee?” are impossible to answer without significant
structured and semi-structured components of dockets; the integra-
data expertise or the resources to pay for it. Clearly, open data ac-
tion of additional datasets to expand the information space (starting
cess isn’t enough; we need a mechanism to access the information
with background information on federal judges); the creation of a
contained within.
novel dataset for training language models for classification tasks
To build that mechanism, in essence, is to automate work that
(initially for classifying various types of motions within the scope of
would be done by a data scientist to extract information. Thus, we
a case); and the model training/fine-tuning and validation process
endeavor to outline what the data scientist’s role entails and identify
in pursuit of proving the utility of framing motion type detection
those functions as requirement sets for building the platform.
as a classification task.
1.1 The Domain Expert/Data Scientist 1.3 Automating Data Science

Interaction In sum, those two tracks build to our end goal: to democratize access
One set of requirements mirrors the domain expert/data scientist to information associated with the U.S. court system, eliminating
interaction: the ability to understand the user’s intent and trans- barriers to access and understanding, and providing journalists, le-
late that into queries and analysis and to provide guidance and gal scholars, lawyers, government officials, social justice advocates
guardrails around what’s possible given a dataset – and then to and others with relevant information derived from data about the
translate the results of analysis back to users in a way that is intel- mechanics of the federal courts.
ligible to them. We first detail our primary novel contribution – the platform
To help us better understand and frame the set of potential for information exploration and analysis, and the natural language
users in the space of the U.S. court system, we conducted 28 sets of notebook frontend – and then provide detail about the ETL and data
interviews with a total of 38 people (25 male and 13 female). Some of enrichment processes that serve as a backdrop for the search and
those interviewed included faculty in law, sociology, and economics; analyses this instantiation supports. We engage in user testing and
lawyers; and journalists. Participants generally reported wanting report preliminary results as well as explore how our platform’s
to answer advanced questions beyond their analytical capabilities. capabilities map to existing data science approaches through a case
120
From Data to Information: Automating Data Science to Explore the U.S. Court System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
study. Our discussion elaborates on the goals of our work, including forebear, IPython [36]. Automated visualization is a related area of
challenges to be addressed. research [31, 50] focusing primarily on presentation layers for a
Our approach to court docket search and analysis is one early given dataset rather than intent-driven question-answering.
step in the development of an open-source platform aimed at de-
mocratizing access to information. In discussion of future work, 3 THE NATURAL-LANGUAGE NOTEBOOK
we outline dual and distinct tracks: the first aimed at continuing to
Notebook-style interfaces are a standard part of the modern data
build and augment our U.S. court records database, and the second science toolkit, and for good reason: they support a logical process
focused on the ongoing development of the core platform. On the flow and marry exploratory and presentation layers in one cohesive
platform side, we point to a future in which additional data can experience. However, they are the tools of experts – users who
be brought in by technical users who manage data wrangling and know how to code, run analysis, interpret stack traces and explore
define data semantics – the steps we now think of as getting to complex results. They bring order in the form of scaffolding, but
“open data,” but with a newly imagined purpose – and our system remain largely agnostic about content or the specifics of a particular
scales to new domains, communities, and geographies. dataset or domain.
We borrow from that scaffolding, but our system leverages sim-
2 RELATED WORKS plified data filtering mechanisms and natural language statements.
And where other notebooks display a variety of outputs (defined by
Reducing the costs associated with PACER has been pursued as a the near-infinite space of possibility supported by arbitrary code),
way to achieve judicial transparency. However, studies have shown our system outputs natural language and annotated visualization
the limits of open data in providing greater transparency [44]. as a means of conveying information.
Notably, problems persist across many user personas, from citi- This approach maintains the intuitive flow of the notebook user
zens to data scientists, government agents, and even academics experience but brings its power to people unfamiliar with program-
[7, 18, 19, 23]. We aim to address a subset of these challenges – ming. Our notebooks are domain- and dataset-aware, and the user
pertaining to data utility and barriers to information access – by experience speaks the language not of the data scientist, but of a
applying automated analytical and visualization capabilities on top user reasonably fluent in the domain. Further, they provide assis-
of the data. tive mechanisms to surface what the system knows it is capable of,
Much research has focused on automating legal processes [28],
guiding even novice users to understand the range of capabilities
predicting outcomes [5], or assessing the value of AI for the two
available to them.
former areas [13, 45]. There have been some recent developments An exploration of the current iteration of interface mechanisms
in legal question-answering (QA) systems [11, 21]. However, these and output capabilities can be found in Figure 1 and Figure 2, and a
have had limited data analytics capabilities [22] and often rely deeper discussion of the paradigm follows.
on simple data retrieval for generating answers [33]. While some
commercial tools support exploration of court documents they are
prohibitively expensive, and limited in terms of scope and consis-
3.1 The UX Paradigm: “Search First, Then
tency of results [1]. Converse”
More broadly, general QA systems have been the subject of re- Our approach separates concerns between search (winnowing the
search for decades [16, 40, 46] and are some of the most prominent available dataset to a space of interest) and converse (the user in-
examples of AI systems [12]. There has been significant progress putting statements that drive analysis upon the filtered dataset and
in neural QA systems [10], with transformer-based models [27] the generation of responses). This approach embodies the strengths
achieving state-of-the-art results on benchmark tasks [39]. How- of the notebook format in focusing on one task at a time and pre-
ever, these QA systems are best suited for unstructured text data senting interstitial output as feedback. Further, this has the indi-
where the answer is plainly stated in the corpus itself, unlike our rect effect of separating concerns on the backend, supporting a
system which can infer or derive the answer through follow-on generalizable approach to the specification of filter and analysis
analysis. Other approaches aim to understand and decompose the configuration.
structure of complex questions into discrete parts as a plan for As depicted in Figure 1, each exploration in our notebook inter-
deriving an answer [49]; however, the representation is high-level face starts from a “search” (or “filter”) panel: a paginated view of the
and distinct from our approach which constructs runnable queries dataset that matches the current set of filters. The primary entities
against a given datasource. presented in this view are court cases in the Northern District of
Extensive research has parsed natural language queries into SQL Illinois. On initialization, the user is presented with the full space
queries [20], using techniques from deep learning [17], rules-based of available data absent any applied filters and can opt to apply
methods [42], or a mixture of both [43]. Instead, our approach filters or skip right to adding analyses of the full dataset.
automatically generates the space of possible analysis from an Below that is the partitioned “converse” step, where the user can
ontology configuration, and then translates the underlying analysis enter natural-language statements that drive analysis (Figure 2).
plans to natural language, drawing inspiration from prior work Of note, users can enter multiple analysis statements against one
[37, 41]. data view, stepping through a set of questions while maintaining a
Beyond current work in information retrieval via conversational thread of prior exploration.
systems, our approach utilizes a notebook-style interface, with This paradigm means our system does not have to manage state-
inspiration coming from Jupyter notebooks [38] as well as their ments like “Average case duration grouped by judge tenure for cases
121
Figure 1: The primary interface for search/filtering within

our notebook format. Annotations: 1) Notebook title, 2) Ap- Figure 2: The primary interface for running analysis within
plied filter example, 3) Mechanism for adding filters, 4) Re- our notebook. Annotations: 1) Applied filter example, 2) Col-
sults interface with sortable columns, 5) Pagination and full lapsed results panel (seen in detail in Figure 1), 3) Analy-
results space, 6) Download button for raw data access, 7) sis statement, 4) Mechanism for removing previous analysis
Mechanism for adding analysis output, 5) Result of analysis (basic NLG type to deliver sin-
gle value), 6) Second analysis statement, 7) Result of analysis
(interactive line chart with rollover states to display change
over time, including legend with terms of art and associated
that occurred in the Northern District of Illinois since February definition), 8) Mechanism for adding additional analysis
2015 and involved property rights,” (or require the user to repeat
parts of this cumbersome statement for additional analysis), but
instead simple, widely applicable statements like the “average case
duration grouped by judge tenure” in the context of a previously
filtered set of data. Thus, given an ontology and available analyt-
ics, the space of analysis statements is finite, but is made virtually
infinite by the possibility of applying them to any slice of the data.
3.2 Mechanisms and Underlying Configuration

Necessary elements are abstracted out of the platform’s core search
and analysis engines as well as the user experience framework, API
mechanics, and proactive caching and pre-fetching mechanisms.
The system requires only a pointer to an SQL database, an object-
relational mapping (ORM) defined in the open-source SQLAlchemy
library [34], and an ontology configuration that references that Figure 3: A simplified architecture diagram to illustrate the
ORM in order to provide all functionality, from generating the relationship between the different components of the plat-
available filters and analysis statements to building queries and form and the dataset-specific Ontology Configuration and
running analytics based on user input. These capabilities mean ORM
that additions to the underlying data schema or the onboarding of
complementary datasets can be made available through the plat-
form with only a small addendum to the already-required work 3.2.1 Search/Filter. As depicted in Figure 1, filters are applied by 1)
of data management: the creation of or updates to the ontology adding them to the filter bar above a given data view, 2) selecting a
configuration. See Figure 3 for a high-level architecture; specifics filter type via the dropdown, and 3) entering values in the associated
about the configuration follow for both the search and analysis input. The set of additive filters that can be applied to winnow down
components. the list of case dockets includes: district, circuit, case name, cause
122
of action, case status, filing date, nature of suit, party name, judge
name, attorney name, as well as free text search in the docket entries
associated with the case. Ultimately, users can make a few targeted
selections and fill in a few inputs to get to searches equivalent to
“all cases in the Northern District of Illinois between 2015 and 2017
where Kennelly served as judge” or “all cases with nature of suit
property rights where one of the parties is Apple” – the SQL query
versions of which only a fraction of those users could generate
themselves.
The focus on case dockets being the “primary” searchable unit (as
opposed to judges, parties, attorneys, etc.) and all associated filters
are entirely configuration-driven and distinct from complexities
of the underlying schema. The ontology config maps the machine
representation to a user-friendly set of names and attends to the
scaffolding of ids, foreign keys and joins. Thus, the filterable fields
are a subset of those that exist at the schema level on various Figure 5: Two examples of the primary interface for adding
tables that join against the case table, and in some cases (such as analysis statements. Annotations: 1) The user-entered state-
Judge Name, depicted in Figure 4) actually span multiple fields in ment, having been auto-completed progressively via gener-
the schema (first_name, middle_name, last_name). The key point ated statement candidates, 2) Candidate matches given the
here is that the user does not need to consider the schema but previously auto-completed statement (“Average Fee Waiver
simply makes decisions about domain-relevant ways to search with Grant Rate Year-Over-Year”) and the subsequently user-
guidance from the system about the relevant search space (and the appended “grouped by,” 3) A user-entered string that hasn’t
system then generates runnable queries of various types, including yet been auto-completed, demonstrating fuzzy string match-
string matching and range finding, such as with dates). For domain ing, 4) A set of fuzzily matched results
expert users, our approach is a significant convenience over having
to learn or write SQL, and, for less knowledgeable users, it also
serves as guidance about relevance in the domain. in Figure 5, this is realized on the UX as a fuzzy (i.e., approximate
string matching) search across a set of natural language statements,
1 {... , each of which is generated dynamically by the system through
2 " judgeName " : { inferring relevant analysis possibilities based on the underlying
3 " nicename " : " Judge Name " ontology configuration and a core model of analysis types. Because
4 " type ": " text " , the system is inferring and defining the analysis space based on the
5 " allowMultiple " : True , ontology components, each generated statement corresponds to an
6 " autocomplete " : acs . getJudges , underlying plan representation that is interpretable by the analysis
7 " model " : [ db . JudgeOnCase , db . Judge ] , engine for the sake of generating queries and running analytics on
8 " fromTargetModel " : [ " judges " , " judge " ] , any set of filters.
9 " fields " : [" first_name " , " middle_name " , "
As users select and add additional analysis statements to the
last_name " ] ,
10 },
notebook, the system responds with answers in the form of text
11 ... } and visualizations, as depicted in Figure 2. As per standard notebook
mechanics, each analysis statement is tied to the active filtered set
in the panel above such that changes to the filters (and thus the slice
Figure 4: The config entry for the “Judge Name” entity for of the data presented) will flow through and update each linked
search/filter capabilities and the results view. 1) “nicename” result.
is the user-facing name of this entity type, 2) “type” and "al- The core platform’s model of analysis includes a growing set
lowMultiple" inform the input style and query generation of available operations (e.g., average), as well as specifications on
mechanisms, 3) “autocomplete” maps to a method on the how the operation ought to be performed (e.g., can only be done
autocomplete class (can be default or a plugin) and powers on numeric fields, how many fields are needed). As seen in Figure
the autocomplete API endpoint, 4) “model” and “fromTar- 6, the ontology configuration then defines the fields relevant for
getModel” map the model join and relationship feature path analysis and their user-friendly names, as well as their attributes
from the db.Case table at the ORM level, 5) “fields” defines (e.g., semantic type, possible transformations into other data types,
the field(s) this entity’s name/id maps to (affording support relevant units, and – in the case of discrete entities delimited by id
for multi-field queries) – how to generate their user-friendly names) and their relationship
to the primary model.
To illustrate the generation of the analysis space, in the instanti-
3.2.2 Analysis Statements and Query Generation. Once the user ation referenced in our figures, an analysis configuration that lists
has arrived at filtered data they are interested in, they can add ten relevant features for analysis (e.g., Judge Tenure, Nature of Suit,
multiple analysis statements below the dataview panel. As seen Case Duration, Fee Waiver Grant Status) alongside the metadata
123
1 {... , 4 THE DATA PROCESSING PIPELINE

2 " caseDuration " : {
PACER, the official source for federal judicial records, houses a
3 " model " : db . Case ,
4 " field " : " case_duration " , variety of document types and charges a per-page fee for access.
5 " type ": " float " , Following the recommendations set forth in [35], we focused on
6 " name ": [ " Case Duration " , " Case Durations " the docket reports, “essentially a lawsuit’s table of contents.” [35]
], identified efforts to improve accessibility of these docket reports as
7 " unit ": [ " day " , " days "] , the “most impactful” work to be done in building a more open justice
8 }, system. Thus, using docket reports as our primary data source, we
9 ... } designed a 30-table database schema (plus relevant join tables) to
represent them and all relevant entities (e.g., judges, attorneys,
defendants, and districts) as well as key fields (e.g., nature of suit,
Figure 6: The config entry for the “Case Duration” feature
date of filing, and docket entries).
as an analysis target. 1) “model” and “field” define where in
Our initial dataset captures samples of both depth (ten years
the schema the relevant field(s) exist, 2) “type” informs the
of docket reports from Northern Illinois district courts from 2007
available analyses for that field, 3) “name” and “unit” define
to 2016) and breadth (docket reports from every district court in
the singular and plural forms for user interaction/presenta-
2016). In total, our sample draws from more than a quarter-million
tion.
case dockets in HTML format acquired through purchase and batch
downloading from PACER. Taking advantage of their semi-regular
structure, we parsed the files into meaningful sections (e.g., the
docket header) and extracted information. While most informa-
associated with each is sufficient to generate 120 different possible tion we extracted was listed explicitly on the docket report (e.g.,
analyses, each of which can be applied to any filtered view of the case title), we also captured implicit (e.g., case duration) and in-
data. Augmenting this list of generated analysis statements requires terpreted (e.g., party acting as their own attorney) information.
simply adding new elements to the configuration or augmenting In this latter case, we relied on guidance from domain experts to
the core system’s analytics library and tying new analytics to data analyze both docket text (e.g., recognizing “Pro Se” designations
or semantic types. for attorneys) and docket contents (e.g., verifying instances of self-
To compute an analytics statement at runtime, our system steps representation where a party has only one attorney, who is also the
through the process of building an analysis chain and SQL queries party). Such interpretation often required triangulation between
based on the filters and the statement’s underlying plan. The steps multiple approaches (e.g., a “Pro Se” designation alone is insufficient
to do so are: 1) Do any necessary filtering (as specified by the for classifying self-representation; one must also count and check
search context the given analysis statement is run in); 2) Query the attorneys).
the necessary fields from the analytics statement; 3) If needed,
transform data to ensure data type compatibility; 4) Perform the 5 FURTHER DATA ENRICHMENT
necessary operations and grouping; and 5) Format the results based
To garner a more complete data-level representation of the mechan-
on the nature of the information to be conveyed.
ics of the judicial system, we sought to enrich the initial core docket
For instance, if we wanted to know the "Average Fee Waiver
dataset in two distinct ways: blending additional sources and the
Grant Rate Grouped by District" for cases where the Nature of Suit
use of language modeling for classification tasks.
is “Property Rights,” the steps would be: 1) We filter to get only
cases where "Property Rights" is listed in the nature of suit field
5.1 Additional Sources
(leveraging the ontology mapping to the schema); 2) We query the
fields associated with Fee Waiver Grant and Court District (again We blended additional data into the core database, serving both
leveraging the ontology mapping); 3) Since we are taking an average as supplemental fodder for search and analysis and in support of
rate and Fee Waiver Grant is stored as booleans, we convert it into initial forms of entity disambiguation. We leveraged the Federal
integers; 4) We compute the average rate of Fee Waiver Grants Judicial Center’s database of appointed federal judges [6], which
for each Court District; 5) We convert the internal database Court includes birthdate, gender, race/ethnicity, history of appointments,
District id into human-readable labels leveraging the name and appointing parties, education, and professional career. We normal-
units information from the ontology config. ized the data into a multi-table schema, linked it to the extracted
Finally, the system output – the result of running the analysis representations of judges, and then leveraged that join to expand
statements above – is delivered in a form best-suited to conveying the space of available analysis. For example, based on the judge’s
the nature of the information for each result (and includes a de- appointment date and the start of a given case, we derived how
scription of the filters applied for added clarity). This is keyed off of long a judge was on the bench prior to the start of that case, and
features of the results themselves. As depicted in Figure 2, running then used that “judge tenure” as a metric in subsequent analysis
“Average Case Duration” – a bit of analysis that will yield a single (e.g., to derive “fee waiver grant rate grouped by judge tenure”).
value – results in the system rendering the results via basic natural
language generation mechanics. However, when looking at some- 5.2 Classification Tasks
thing “Year-Over-Year,” the system pivots into change-over-time We identified a variety of information targets that we believe can
behavior, leading to the generation of an interactive line chart. be culled from the unstructured text components of dockets by
124
reframing them as classification tasks. For context, the main body the transformer models, the training accuracy is slightly higher
of a docket is a series of time-stamped text entries, each marking than the validation/test accuracy for both models, but we believe
events in the arc of a given case. These text-snippet representations this margin is reasonable given the small size of the dataset. To
contain various sorts of useful information, including motions (ef- reduce the likelihood of overfitting in the future, we continue to
fectively discrete requests for a judicial decision), the outcome of grow the tagged motion dataset.
a given motion, changes of representation or venue or presiding
judge, references to evidence or testimony, eventual outcomes, and 6 EVALUATION
so on. Being able to identify and classify such information would We evaluated our system’s effectiveness in handling both search
prove highly valuable for both search and analysis. and analysis of data across two separate tracks: 1) usability testing
To explore approaches, we started with the classification of mo- in which target users completed tasks with the system and provided
tion types as our initial target. At first glance, it could be tempting survey feedback, and 2) a case-study comparative analysis to assess
to envision a solution to this classification task based on regular the system’s efficacy when benchmarked against a data scientist’s
expression where motions are explicitly identified by name. How- ad hoc analysis.
ever, as depicted in Table 1, a pure regex approach is far too rigid
to capture the many complexities found in the docket entry sample
6.1 Usability Testing
space, including multiple motions being named in a single entry,
non-motion events referencing motions by name (e.g., notices, or- We gave 15 subjects (14 legal professionals, one journalist) a set
ders), and obfuscation of the motion type through varying levels of of prompts (e.g., “For all cases in the ’N.D. IL’ district, which year
docket entry metadata. These complexities are further compounded had the highest average case duration?”) and assessed their experi-
by naming convention variations across districts and the trappings ences in: (1) using the search filter, (2) conducting analysis on all
of error-prone human data entry [1]. the records, and (3) conducting analyses based on specific search
Thus we pivoted to language modeling. As no training dataset criteria of varying complexity. In addition, we gave them time to
exists for such a task, we created a web application to view and tag test their own scenarios while “thinking aloud” so we could capture
the motions pulled from our docket dataset. For both the definition their intentions and strategies. Participants were then presented
of the space of possible motion types and for the sake of actually with a survey to complete at the end of the session consisting of the
tagging the motion entries, we solicited help from legal scholars and modified System Usability Scale (SUS), an evaluation framework
their law students. We implemented a voting mechanism in the app shown to be effective at quantifying the complexity and ease of use
such that each motion will be tagged three times by three distinct of interfaces [4]. The average SUS score for our participants’ overall
users as a means of ensuring accuracy. Our dataset continues to experience across (1), (2), and (3) was 72.83, which is considered
grow through use of the application, though the experiments that good usability [3, 26]. When answering the statement regarding
follow leverage a subset of this data. whether they would use our system frequently, all but two partic-
In order to effectively utilize this data for our classification ex- ipants (87%) agreed or strongly agreed with that statement (one
periments, we performed some preprocessing on the raw dataset. was neutral and one disagreed and wrote that docket sheets are not
First, the raw dataset contained several motion classes with few used in their research). These results represent a preliminary round
data points. To address these rare motion classes in these initial of user testing, and we intend to further analyze the associated
tests, we set a threshold of 25 data points and merged all classes feedback and conduct additional user tests targeting users with a
below this threshold into the “Other Motion” class. Second, we re- wider variety of backgrounds.
moved all duplicate docket entries that arose as a byproduct of the
voting mechanism from the dataset to ensure that the models were 6.2 System Evaluation: A Case Study
not training on some docket entries more than others. After this To further weigh the benefits of our approach, we compare it with
preprocessing, the smallest motion class contained 25 samples, the prior work done by data scientists examining how fee waiver grant
largest contained 951, and the median and average of the motion rates vary among judges using ad hoc data processing and analysis
classes were 50 and 152, respectively. For each of the models we [35]. We answer the same question using our system (see Figure
used a train/validation/test split of 80/10/10 per class. In total after 7 for an example of the output) and through observation compare
preprocessing, there were 2,064 training samples with 524 testing both approaches across three dimensions: speed to insight, flexibil-
samples across 17 distinct motion classes. ity of exploration, and barrier to entry.
Making use of two pretrained transformers, the 110M parameter The initial data processing pipeline looks similar for both. A
BERT-base [9] and 125M parameter RoBERTa [29] models, we fine- systematic analysis of this issue requires paying to download case
tuned each on this processed dataset. We made use of the AllenNLP documents, creating an ETL process to structure the data, and
framework [15] as a wrapper around the Huggingface Transformers identifying the fee waiver status of each case [35]. However, where
library [48] to fine-tune the models for 10 epochs, using a batch the ad hoc method attends to ETL, aggregation, and visualization
size of 8, and the AdamW optimizer. The RoBERTa model achieved for a single target task, our approach looks to leverage that upfront
training accuracy of 95.69%, validation accuracy of 91.22%, and data work to support a wide array of possible downstream analyses.
test accuracy of 90.08%. The BERT-base model achieved training Thus, when considering a one-off query or single data point, we
accuracy of 96.95%, validation accuracy of 89.69%, and test accuracy cannot definitively say that the ETL, schema and ontology work in
of 89.31%. These results exceeded the baseline bag of embeddings support of our system will require less time than a data scientist
classification model, which achieved a test accuracy of 80.50%. For taking the ad hoc approach. But one-offs aren’t the goal of our
125
Motion Entry Motion Class

MOTION by Defendant [Name Omitted] for extension of time to file respon- Motion for Extension
se/reply as to motion for summary judgment 20 (Unopposed) ([Name Omitted])
(Entered: 05/18/2017)
Proposed Order re 13 MOTION to Stay Proceedings Pending Transfer by [Name Not a Motion
Omitted]. ([Name Omitted]) (Filed on 4/16/2012) [Transferred from California
Northern on 4/18/2012.] (Entered: 04/16/2012)
Motion by [Name Omitted], Cook County Board of Review, [Name Omitted] Motion for Leave
for Leave to Cite Additional Authority ([Name Omitted]) (Entered: 01/13/2011)
Plaintiff’s motion for leave to file a first amended complaint 54 is granted. Not a Motion
WRITTEN Opinion entered by the Honorable George W. Lindberg on 4/29/2011:
Signed by the Honorable George W. Lindberg on 4/29/2011:Mailed notice(pm, )
(Entered: 04/29/2011)
Table 1: Sample motion entries illustrating some complexities of classifying motion types.
platform and what matters to us is the speed to information for our

end users – and once the setup of our system is complete, speed
from question to information for end users will be much faster and
easier on a per-query basis. Thus, our system demonstrates a clear
advantage given that the cost of defining a configuration can be
amortized over every question answered.
On flexibility of exploration, we consider the state of things once
the fee waiver question is answered. For the ad hoc approach, we
have access to the information needed to answer the question at
hand, but any subsequent question or filter amendment requires
fresh code and additional work on the part of the data scientist. In
contrast, once a configuration has been defined, our system sup-
ports running new types of analysis or perhaps the same analysis on
different slices of the data (e.g., “How do fee waiver grant rates vary
among judges in the Northern District of Illinois vs the Southern
Figure 7: The system’s response to the analysis statement
District of Illinois?”) – without a data scientist in the loop.
“Average Fee Waiver Grant Rate Grouped by Judge” in a fil-
Last, and we believe most important, is the barrier to entry.
tered data context. Each bar represents a given judge’s fee
Regardless of time or effort, there will always be those who lack
waiver grant rate – judge names are available in rollover
the technical skills or resources necessary to convert data into
states (not depicted).
information. Since most are not data scientists and cannot afford to
hire one, the ad hoc approach doesn’t scale. In contrast, our system
provides a path to minimize the barrier to entry by abstracting
away these technical skills through a one-off, upfront setup. This To reach those goals, we must automate. We need systems that
decouples the data scientist from the exploration of data and by bridge the gap between questions and answers – routinely employ-
doing so democratizes access to the underlying information. ing individual data scientists to serve that function simply doesn’t
scale. The ultimate objective is repositioning as much of the bur-
den of that complexity on the machine as possible, and our work
7 DISCUSSION here is a step in that direction. And while our approach certainly
Though in its early stages, our work already demonstrates signif- won’t eliminate the role of data scientists, we believe a significant
icant promise. User testing among legal scholars, attorneys and amount of analysis can be standardized, systematized and auto-
journalists in the U.S. confirms both the value and usability of our mated, bringing access where previously there was none for lack
system. Beyond that, our approach has the potential to generalize of expertise or resources. And of note, where access does already
well to a wide array of data sources, providing a new platform for exist, our approach has the potential to free data scientists from
the democratization of information across communities, sectors some of the repetitive “query generation” aspects of their roles,
and geographies. We see broad opportunities for such an approach affording them more time to drive novel exploration. In some sense
in the space of open government data, a domain rife with available the approach detailed here could scale their expertise, allowing
datasets but with chronic challenges in terms of accessibility and them to teach our platform about their datasets and then offload
use [18, 23]. This is a push towards the realization of the true goal stakeholder questions to the system whenever possible.
of such initiatives: from code to content and from open data to open That said, we see challenges in this approach to the U.S. court
information. system, and anticipate them in scaling to new domains as well:
126
• Issues of ethics and responsibility: One such example is speakers (of note, our ontology-driven approach means very little
privacy. Court documents are rife with personally identifi- actual language is coded into the UI, making this an easier pursuit),
able information, and reliably de-identifying documents at opening up the possibility of legal documents and open data from
scale is a non-trivial problem. Further, the tension between other countries being made available through the platform; 5) UX
de-identification and information completeness (say, for the improvements, including changes to analysis statement selection
sake of mapping to geographies) adds another complication. (with fuzzy semantic matching on colloquial terms against terms of
The use of highly regulated medical records data in research art), support for more interactivity in visualizations, and additional
and machine learning provides a promising precedent [8, 25] explanations associated with analysis results; 6) Support for inter-
for reference as we move forward. active machine learning by bringing the capabilities of our separate
• Issues of information misuse: Protecting against misuse motion tagging application directly to the platform and augmenting
of analysis, especially when the barrier of expertise to arriv- them to cover both the extraction/creation of novel tagged datasets
ing at such analysis has been lowered, is a significant issue and in-platform model training/fine-tuning, validation and testing.
in our increasingly fraught information landscape. In the This presents a significant opportunity for research in the space of
realm of law, the politicization of judicial decision making or making machine learning accessible to non-technical users.
the use of judicial analytics as a means of influencing future While we believe deeply in the importance of bringing trans-
outcomes are both potential issues. parency to the U.S. court system and will continue the data work
• Issues of explainability and data quality: Our scalable necessary to do so, we also see this data-information rift throughout
approach to data analysis adds a new layer of importance to the government and public sector in the United States and globally.
the explainability of results and also runs the risk of obscur- Thus, we are excited by the prospect of platform improvements to
ing incomplete or deficient data. To fully realize the promise support bringing a variety of new datasets to our application.
of data science automation, additional research will be fo-
cused on ensuring our system can explain itself and handle 9 CONCLUSION
issues of data quality gracefully and transparently. In this work we’ve detailed a novel platform and user experience to
• Issues associated with novel analysis: Inarguably, data allow non-data scientists to drive exploration and analysis of data
scientists can flexibly address novel questions or analysis associated with the U.S. Court system. In support of that experi-
requirements on the fly, and while our platform’s library ence, we defined the process by which we ingested, extracted and
of analytics will grow, there will continue to be question structured the data from 270,000 case dockets. Given the results of
types it can’t answer. In future work, we will expand our usability testing presented in our evaluation, we believe we have
nascent plugin framework to support custom analysis and early confirmation that this new natural language notebook ap-
continuously grow the built in libraries. proach marks a step in the direction of democratizing access to data
analysis and could have significant impact not only in the space
8 FUTURE WORK of the U.S. court system, but also more broadly across a variety of
publicly available data. Subsequent work is already underway to
Going forward, various members of our team are pursuing in tan- further develop the capabilities, refine the UX mechanics, and stand
dem the dual roadmap we laid out in the introduction. up new components of the ecosystem. In tandem, the ingestion,
One thread is aimed at making the raw data emitted from the structuring and enrichment of U.S. court records continues as we
U.S. court system increasingly machine-readable. This entails ev- work towards a comprehensive database mirroring the federal court
erything from the continued evolution of the ingestion pipeline system.
(sourcing data from a wider variety of districts and tackling corner
cases in the data) to improvements to the data already obtained
ACKNOWLEDGMENTS
through various forms of enrichment. In the near term, we intend
to pursue entity disambiguation on parties and attorneys, as well This material is based upon work supported by the National Sci-
as the creation of additional datasets to train language models for ence Foundation Convergence Accelerator Program under grant
classification outside the scope of the motions described above no. 1937123 and grant no. 2033604.
(such that we can attempt to capture additional data points such
as judicial rulings, charge severity, changes in representation, and REFERENCES
various forms of case outcome). [1] Charlotte Alexander and Mohammed Javad Feizollahi. 2019. On Dragons, Caves,
Teeth, and Claws: Legal Analytics and the Problem of Court Data Access. Com-
The other thread is the work with the core platform itself. This putational Legal Studies: The Promise and Challenge of Data-Driven Legal Research
will take a number of forms, including: 1) Expansions to the ana- (Ryan Whalen, ed., Edward Elgar, 2019, Forthcoming) (2019).
lytics capabilities and plugins (including the introduction of new [2] Judie Attard, Fabrizio Orlandi, Simon Scerri, and Sören Auer. 2015. A systematic
review of open government data initiatives. Government Information Quarterly
response types and visualizations); 2) An evolution of the ontology 32, 4 (2015), 399–418.
configuration and support for ontology management through the [3] Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what indi-
vidual SUS scores mean: Adding an adjective rating scale. Journal of usability
user experience, allowing for user-driven updates as well as the studies 4, 3 (2009), 114–123.
introduction of new data sources; 3) Ontology-driven derived fields, [4] Aaron Bangor, Philip T Kortum, and James T Miller. 2008. An empirical evaluation
providing support for adding new data points dynamically and intro- of the system usability scale. Intl. Journal of Human–Computer Interaction 24, 6
(2008), 574–594.
ducing new possibilities for downstream explanations; 4) Support [5] Karl Branting, Brandy Weiss, Bradford Brown, Craig Pfeifer, A Chakraborty, Lisa
for localization such that the platform could be used by non-English Ferro, M Pfaff, and A Yeh. 2019. Semi-supervised methods for explainable legal
127
prediction. In Proceedings of the Seventeenth International Conference on Artificial [31] Jock Mackinlay, Pat Hanrahan, and Chris Stolte. 2007. Show me: Automatic
Intelligence and Law. 22–31. presentation for visual analysis. IEEE transactions on visualization and computer
[6] Federal Judicial Center. 2011. Biographical directory of federal judges. graphics 13, 6 (2007), 1137–1144.
[7] Jonathan Crusoe, Anthony Simonofski, Antoine Clarinval, and Elisabeth Gebka. [32] Peter W Martin. 2018. District Court Opinions That Remain Hidden Despite a
2019. The impact of impediments on open government data use: insights from Long-Standing Congressional Mandate of Transparency-the Result of Judicial
users. In 2019 13th International Conference on Research Challenges in Information Autonomy and Systemic Indiffernece. Law Libr. J. 110 (2018), 305.
Science (RCIS). IEEE, 1–12. [33] Gayle McElvain, George Sanchez, Sean Matthews, Don Teo, Filippo Pompili,
[8] Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. and Tonya Custis. 2019. WestSearch Plus: A Non-factoid Question-Answering
De-identification of patient notes with recurrent neural networks. Journal of the System for the Legal Domain. In Proceedings of the 42nd International ACM SIGIR
American Medical Informatics Association 24, 3 (2017), 596–606. Conference on Research and Development in Information Retrieval. 1361–1364.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [34] Michael Bayer. [n.d.]. SQLAlchemy. https://www.sqlalchemy.org/
Pre-training of deep bidirectional transformers for language understanding. arXiv [35] Adam R Pah, David L Schwartz, Sarath Sanga, Zachary D Clopton, Peter DiCola,
preprint arXiv:1810.04805 (2018). Rachel Davis Mersey, Charlotte S Alexander, Kristian J Hammond, and Luís
[10] Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and A Nunes Amaral. 2020. How to build a more open justice system. Science 369,
Michael Auli. 2019. ELI5: Long Form Question Answering. In Proceedings of the 6500 (2020), 134–136.
57th Annual Meeting of the Association for Computational Linguistics. 3558–3567. [36] Fernando Pérez and Brian E Granger. 2007. IPython: a system for interactive
[11] Biralatei Fawei, Jeff Z Pan, Martin Kollingbaum, and Adam Z Wyner. 2018. A scientific computing. Computing in science & engineering 9, 3 (2007), 21–29.
methodology for a criminal law and procedure ontology for legal question answer- [37] Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo,
ing. In Joint International Semantic Technology Conference. Springer, 198–214. Maurizio Lenzerini, and Riccardo Rosati. 2008. Linking data to ontologies. In
[12] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Journal on data semantics X. Springer, 133–173.
Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, [38] Min Ragan-Kelley, F Perez, B Granger, T Kluyver, P Ivanov, J Frederic, and M
et al. 2010. Building Watson: An overview of the DeepQA project. AI magazine Bussonnier. 2014. The Jupyter/IPython architecture: a unified view of computa-
31, 3 (2010), 59–79. tional research, from interactive exploration to communication and publication.
[13] Anthony W Flores, Kristin Bechtel, and Christopher T Lowenkamp. 2016. False AGUFM 2014 (2014), H44D–07.
positives, false negatives, and false analyses: A rejoinder to machine bias: There’s [39] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
software used across the country to predict future criminals. and it’s biased SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed-
against blacks. Fed. Probation 80 (2016), 38. ings of the 2016 Conference on Empirical Methods in Natural Language Processing.
[14] World Wide Web Foundation. 2018. Open Data Barometer - Leaders Edition. World 2383–2392.
Wide Web Foundation. [40] Deepak Ravichandran and Eduard Hovy. 2002. Learning surface text patterns for
[15] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson a question answering system. In Proceedings of the 40th Annual meeting of the
H S Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2017. A association for Computational Linguistics. 41–47.
Deep Semantic Natural Language Processing Platform. [41] Mariano Rodriguez-Muro, Roman Kontchakov, and Michael Zakharyaschev. 2013.
[16] Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. 1961. Ontology-based data access: Ontop of databases. In International Semantic Web
Baseball: an automatic question-answerer. In Papers presented at the May 9-11, Conference. Springer, 558–573.
1961, western joint IRE-AIEE-ACM computer conference. 219–224. [42] Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq
[17] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Minhas, Ashish R Mittal, and Fatma Özcan. 2016. ATHENA: an ontology-driven
Zettlemoyer. 2017. Learning a neural semantic parser from user feedback. arXiv system for natural language querying over relational data stores. Proceedings of
preprint arXiv:1704.08760 (2017). the VLDB Endowment 9, 12 (2016), 1209–1220.
[18] Maxat Kassen. 2018. Adopting and managing open data: Stakeholder perspec- [43] Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi
tives, challenges and policy recommendations. Aslib Journal of Information Dalmia, Greg Stager, Ashish Mittal, Diptikalyan Saha, and Karthik Sankara-
Management (2018). narayanan. 2020. ATHENA++ natural language querying for complex nested
[19] Muhammad Mahboob Khurshid, Nor Hidayati Zakaria, Ammar Rashid, and SQL queries. Proceedings of the VLDB Endowment 13, 12 (2020), 2747–2759.
Muhammad Nouman Shafique. 2018. Examining the Factors of Open Govern- [44] Md Shamim Talukder, Liang Shen, Md Farid Hossain Talukder, and Yukun Bao.
ment Data Usability From Academician’s Perspective. International Journal of 2019. Determinants of user acceptance and use of open government data (OGD):
Information Technology Project Management (IJITPM) 9, 3 (2018), 72–85. An empirical investigation in Bangladesh. Technology in Society 56 (2019), 147–
[20] Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural 156.
language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, [45] Songül Tolan, Marius Miron, Emilia Gómez, and Carlos Castillo. 2019. Why ma-
10 (2020), 1737–1750. chine learning may lead to unfairness: Evidence from risk assessment for juvenile
[21] Mi-Young Kim, Randy Goebel, and S Ken. 2015. COLIEE-2015: evaluation of justice in catalonia. In Proceedings of the Seventeenth International Conference on
legal question answering. In Ninth International Workshop on Juris-informatics Artificial Intelligence and Law. 83–92.
(JURISIN 2015). [46] Ellen M Voorhees et al. 1999. The TREC-8 question answering track report. In
[22] Mi-Young Kim, Ying Xu, and Randy Goebel. 2014. Legal question answering using Trec, Vol. 99. 77–82.
ranking svm and syntactic/semantic similarity. In JSAI International Symposium [47] Vishanth Weerakkody, Zahir Irani, Kawal Kapoor, Uthayasankar Sivarajah, and
on Artificial Intelligence. Springer, 244–258. Yogesh K Dwivedi. 2017. Open data and its usability: an empirical view from the
[23] Rashmi Krishnamurthy and Yukika Awazu. 2016. Liberating data for public value: Citizen’s perspective. Information Systems Frontiers 19, 2 (2017), 285–300.
The case of Data. gov. International Journal of Information Management 36, 4 [48] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
(2016), 668–672. Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
[24] Karim R Lakhani, Robert D Austin, and Yumi Yi. 2002. Data. gov. Harvard Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Business School. Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
[25] Joffrey L Leevy, Taghi M Khoshgoftaar, and Flavio Villanustre. 2020. Survey on and Alexander M. Rush. 2019. HuggingFace’s Transformers: State-of-the-art
RNN and CRF models for de-identification of medical free text. Journal of Big Natural Language Processing. ArXiv abs/1910.03771 (2019).
Data 7, 1 (2020), 1–22. [49] Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel
[26] James R Lewis. 2018. The system usability scale: past, present, and future. Inter- Deutch, and Jonathan Berant. 2020. Break it down: A question understanding
national Journal of Human–Computer Interaction 34, 7 (2018), 577–590. benchmark. Transactions of the Association for Computational Linguistics 8 (2020),
[27] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman 183–198.
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising [50] Kanit Wongsuphasawat, Zening Qu, Dominik Moritz, Riley Chang, Felix Ouk,
sequence-to-sequence pre-training for natural language generation, translation, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2017. Voyager 2:
and comprehension. arXiv preprint arXiv:1910.13461 (2019). Augmenting visual analysis with partial view specifications. In Proceedings of the
[28] Tomer Libal and Matteo Pascucci. 2019. Automated reasoning in normative 2017 CHI Conference on Human Factors in Computing Systems. 2648–2659.
detachment structures with ideal conditions. In Proceedings of the Seventeenth
International Conference on Artificial Intelligence and Law. 63–72.
[29] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
[30] Lynn M LoPucki. 2001. Politics of Research Access to Federal Court Data. Tex. L.
Rev. 80 (2001), 2161.
128
Lex Rosetta: Transfer of Predictive Models Across Languages,
Jurisdictions, and Legal Domains
Jaromir Savelka Hannes Westermann Charlotte S. Alexander
jsavelka@cs.cmu.edu Karim Benyekhlef Jayla C. Grant
Carnegie Mellon University Université de Montréal Georgia State University
USA Canada USA
David Restrepo Amariles Sébastien Meeùs Michał Araszkiewicz

Rajaa El Hamdani Aurore Troussel Uniwersytet Jagielloński
HEC Paris HEC Paris Poland
France France
Kevin D. Ashley Karl Branting Mattia Falduti

Alexandra Ashley MITRE Corporation Libera Università di Bolzano
University of Pittsburgh USA Italy
USA
Matthias Grabmair Jakub Harašta Elizabeth Tippett

Technische Universität München Tereza Novotná Shiwanni Johnson
Germany Masarykova univerzita University of Oregon
Czech Republic USA

In this paper, we examine the use of multi-lingual sentence embed- • Applied computing → Law; Annotation; • Information sys-
dings to transfer predictive models for functional segmentation of tems → Document structure; Structure and multilingual text
adjudicatory decisions across jurisdictions, legal systems (common search; Data mining.
and civil law), languages, and domains (i.e. contexts). Mechanisms
for utilizing linguistic resources outside of their original context
KEYWORDS
have significant potential benefits in AI & Law because differences
between legal systems, languages, or traditions often block wider multi-lingual sentence embeddings, transfer learning, domain adap-
adoption of research outcomes. We analyze the use of Language- tation, adjudicatory decisions, document segmentation, annotation
Agnostic Sentence Representations in sequence labeling models
using Gated Recurrent Units (GRUs) that are transferable across ACM Reference Format:
languages. To investigate transfer between different contexts we Jaromir Savelka, Hannes Westermann, Karim Benyekhlef, Charlotte S. Alexan-
developed an annotation scheme for functional segmentation of der, Jayla C. Grant, David Restrepo Amariles, Rajaa El Hamdani, Sébastien
Meeùs, Aurore Troussel, Michał Araszkiewicz, Kevin D. Ashley, Alexandra
adjudicatory decisions. We found that models generalize beyond
Ashley, Karl Branting, Mattia Falduti, Matthias Grabmair, Jakub Harašta,
the contexts on which they were trained (e.g., a model trained on
Tereza Novotná, Elizabeth Tippett, and Shiwanni Johnson. 2021. Lex Rosetta:
administrative decisions from the US can be applied to criminal law Transfer of Predictive Models Across Languages, Jurisdictions, and Legal
decisions from Italy). Further, we found that training the models on Domains. In Eighteenth International Conference for Artificial Intelligence
multiple contexts increases robustness and improves overall perfor- and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York,
mance when evaluating on previously unseen contexts. Finally, we NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466149
found that pooling the training data from all the contexts enhances
the models’ in-context performance.
1 INTRODUCTION
This paper explores the ability of multi-lingual sentence embed-
Permission to make digital or hard copies of part or all of this work for personal or dings to enable training of predictive models that generalize beyond
classroom use is granted without fee provided that copies are not made or distributed individual languages, legal systems, jurisdictions, and domains (i.e.,
on the first page. Copyrights for third-party components of this work must be honored. contexts). We propose a new type schema for functional segmenta-
For all other uses, contact the owner/author(s). tion of adjudicatory decisions (i.e., decisions of trial and appellate
ICAIL’21, June 21–25, 2021, São Paulo, Brazil court judges, arbitrators, administrative judges and boards) and use
ACM ISBN 978-1-4503-8526-8/21/06. it to annotate legal cases across eight different contexts (7 countries,
https://doi.org/10.1145/3462757.3466149 6 languages). We release the newly created dataset (807 documents
129
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Savelka and Westermann, et al.
with 89,661 annotated sentences) including the annotation schema as the user experience, of legal search tools. For example, if a user
to the public.1 searches for an application of a legal rule, they might restrict the
In the area of AI & Law, research typically focuses on a single search to the section where a judge applies the rule to a factual
context, such as decisions of a specific court on a specific issue situation. Judges themselves might find the technique useful, within
within a specific time range. This is justified by the complexity of their own jurisdictions but also in transnational disputes involving
legal work and the need for nuanced solutions to particular prob- the application of different legal standards. The same benefits apply
lems. At the same time, this narrow focus can limit the applicability to non-court settings, e.g., international arbitration, where many
of the research outcomes, since a proposed solution might not be jurisdictions’ laws and their interpretation matter. Further, the
readily transferable to a different context. In text classification, for segmentation of decisions into meaningful sections could serve as
example, a model might simply memorize a particular vocabulary an important step in many legal document processing pipelines.
characteristic of a given context, rather than acquiring the seman-
tics of a predicted type. Adaptation of such a model to a new context 1.2 Hypotheses
would then require the assembly of a completely new dataset. This To investigate how well predictive models, based on multi-lingual
may be both time-consuming and expensive, since the annotation sentence embeddings, learn to segment cases into functional parts
of legal documents relies on legal expertise. across different contexts, we evaluated the following hypotheses:
Certain tasks appear to be of interest to researchers from multiple
countries with different legal traditions (e.g., deontic classification (H1) A model trained on a single context can generalize when
of legal norms embodied in statutory law, argument extraction from transferred to other, previously unseen, contexts.
case law, summarization/simplification of legal documents, etc.). (H2) A model trained on data pooled from multiple contexts is
This suggests that there may be several core tasks in AI & Law that more robust and generalizes better to unseen contexts than
are of general interest in almost any context. One such task is a a model trained on a single context.
functional segmentation of adjudicatory decisions, which has been (H3) A context-specific model benefits from pooling the in-domain
the subject of numerous studies in the past (see Section 2). In this data with data from other contexts.
paper, we show that for this particular task it is possible to leverage
linguistic resources created in multiple contexts. 1.3 Contributions
This has wide-reaching implications for AI & Law research. Since By carrying out this work, we provide the following contributions
annotation of training data is expensive, models that are able to use to the AI & Law research community:
existing data from other contexts might be instrumental in enabling
real-world applications that can be applied across contexts. Such • Detailed definition and analysis of a functional segmentation
approaches may further enable international collaboration of re- task that is widely applicable across different contexts.
searchers, each annotating their own part of a dataset to contribute • A new labeled dataset consisting of 807 documents (89,661
to a common pool (as we do in this work) that could be used to sentences) from seven countries in six different languages.
train strong models able to generalize across contexts. • Evidence of the effectiveness of multi-lingual embeddings
on processing legal documents.
1.1 Functional Segmentation • Release of the code used for data preparation, analysis, and
the experiments in this work.
We investigate the task of segmenting adjudicatory decisions based
on the functional role played by their parts. While there are signifi-
cant differences in how decisions are written in different contexts,
2 RELATED WORK
we hypothesize that selected elements might be universal, such as, Segmenting court decisions into smaller elements according to their
for example, sections: function or role is an important task in legal text processing. Prior
(1) describing facts that give rise to a dispute; research utilizing supervised machine learning (ML) approaches
(2) applying general legal rules to such facts; or expert crafted rules can roughly be distinguished into two cat-
(3) stating an outcome of the case (i.e., how was it decided). egories. First, the task could be to segment the text into a small
number of contiguous parts typically comprising multiple para-
This conjecture is supported by the results of the comparative
graphs (this work). Different variations of this task were applied
project titled Interpreting Precedents [23], which aimed to analyze
to several legal domains from countries, such as Canada [15], the
(among other things) the structure in 11 different jurisdictions. The
Czech Republic [17], France [8], or the U.S. [27]. Second, the task
findings of this project suggest that the structure indicated above
could instead be labeling smaller textual units, often sentences,
may be considered a general model followed in the investigated
according to some predefined type system (e.g., rhetorical roles,
jurisdictions, although variations exist that are characteristic of
such as evidence, reasoning, conclusion). Examples from several
particular legal systems and types of courts and their decisions.
domains and countries include administrative decisions from the
The ability to segment cases automatically could be beneficial
U.S. [33, 41], multi-domain court decisions from India [6], inter-
for many tasks. It could support reading and understanding of
national arbitration decisions [9], or even multi-{domain,country}
legal decisions by students, legal practitioners, researchers, and
adjudicatory decisions in English [28]. Identifying a section that
the public. It could facilitate empirical analyses of the discourse
states an outcome of the case has also received considerable atten-
structure of decisions. It could enhance the performance, as well
tion separately [25, 38]. To the best of our knowledge, existing work
1 https://github.com/lexrosetta/caselaw_functional_segmentation_multilingual on functional segmentation of court decisions is limited to a single
130
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains ICAIL’21, June 21–25, 2021, São Paulo, Brazil
language—ours being the first paper exploring the task jointly on adjudicatory decisions. Each team developed specifications for the
legal documents in multiple languages. decisions to be included in their part of the dataset.
In NLP, the success of word embeddings was followed by an Four of the contexts were double-annotated by two annotators
increasing interest in learning continuous vector representations of (Canada, Czech R., France, U.S.A. I); the remaining four by just
longer linguistic units, such as sentences (a trend that has been re- one. Each team had at least one member with a completed law
flected in AI & Law research as well [34, 41]). Multi-lingual represen- degree. When a team had more than one member, law students
tations recently attracted ample attention. While most of the earlier were allowed to be included.
work was limited to a few close languages or pairwise joint em- A high-level description of the resulting dataset is provided in
beddings for English and one foreign language, several approaches Table 1. It consists of eight contexts from seven different countries
to obtain general-purpose massively multi-lingual sentence repre- (two parts are from the U.S.) with 807 documents in six languages
sentations were proposed [5, 11, 13]. Such representations were (three parts are in English). Most of the contexts include judicial
utilized in many downstream applications, such as document clas- decisions, while U.S.A. II was the only context that consisted solely
sification [21], machine translation [2], question answering [22], of administrative decisions.There are considerable variations in the
hate speech detection [4], or information retrieval (IR) in the legal length of the documents. While an average document in the U.S.A. I
domain [40]. Our work is one of the first such applications in the context comprises of 530.6 sentences, an average document in the
legal domain and to the best of our knowledge the first dealing with France context is about ten times shorter (59.0 sentences).
more than two languages. The four double-annotated parts enabled us to examine the inter-
Approaches other than language-agnostic sentence embeddings annotator agreement. Table 2 shows the raw agreement on a charac-
(this work) were used in AI & Law research focused on texts in ter level. While it appears that recognizing the Outcome was rather
multiple languages. A recent line of work mapped recitals to arti- straightforward in the France and U.S.A. I contexts, it was more
cles in EU directives and normative provisions in Member states’ complicated in case of Canada and the Czech R. This might be due
legislation [24]. There, mono-lingual models were used (i.e., one to a presence/absence of some structural clue. We also observe that
model per language). Other published applications in multi-lingual in the Czech R. context it was presumably much easier to distin-
legal IR were based on thesauri [14, 29]. A common technique to guish between the Background and Analysis than in case of the
bridge the language gap was the use of ontologies and knowledge other three contexts.
graphs [1, 3, 7, 16]. The multi-lingual environments, such as EU In this paper, we focus on prediction of the Background, Analysis,
or systems established by international treaties, attracted work on and Outcome types. We decided to exclude the Introductory Sum-
machine translation [20], meaning equivalence verification [32], mary type, since it is mainly present in the data from the United
and building of parallel corpora [30, 31]. States. For the double-annotated datasets, we picked the annota-
tions that appeared to be of higher quality (either by consensus
between the annotators themselves or by a decision of a third unbi-
3 DATASET
ased expert).
In creating the dataset, the first goal was to identify a task that We first removed all the spans of text annotated with either of
would be useful across different contexts. After extensive litera- the Out of Scope or Heading types. The removal of Out of Scope
ture review, we identified the task of functional segmentation of leaves the main body of a decision stripped of potential metadata
adjudicatory decisions as a viable candidate. To make the task gen- or editorial content at the beginning of the document as well as
eralizable, we decided to include only a small number of core types. dissents, or end notes at the end. The removal of the text spans
(1) Out of Scope – Parts outside of the main document body (e.g., annotated with the Heading type might appear counter-intuitive
metadata, editorial content, dissents, end notes, appendices). since headings often provide a clear clue as to the content of the
(2) Heading – Typically an incomplete sentence or marker start- following section (e.g., “Outcome”, “Analysis” etc.). We remove
ing a section (e.g., “Discussion,” “Analysis,” “II.”). these (potentially valuable) headings because we want to focus
(3) Background – The part where the court describes procedural on the more interesting task of recognizing the sections purely
history, relevant facts, or the parties’ claims. by the semantics of their constitutive sentences. This task is more
(4) Analysis – The section containing reasoning of the court, challenging, and more closely emulates generalization to domains
issues, and application of law to the facts of the case. where headings are not used or not present in all cases, or are not
(5) Introductory Summary – A brief summary of the case at the reliable indicators.
beginning of the decision. The transformed documents are separated into several segments
(6) Outcome – A few sentences stating how the case was decided based on the annotations of the three remaining types. Each seg-
(i.e, the overall outcome of the case). ment is then split into sentences.2 A resulting document is a se-
quence of sentences labeled with one of the Background, Analysis,
We created detailed annotation guidelines defining the individ- or Outcome types. The highlighted (green) part of Table 1 provides
ual types as well as describing the annotation workflow (tooling,
steps taken during annotation). Eight teams of researchers from six 2 We used the processing pipeline from https://spacy.io/ (large models). For the Czech
different countries (14 persons) were trained in the annotation pro- language we used https://github.com/TakeLab/spacy-udpipe with the Czech model
cess through online meetings. After this, each annotator conducted (PDT) from https://universaldependencies.org/. The output was further processed
with several regular expressions. A different method was used for the French dataset,
a dry-run annotation on 10 cases and received detailed feedback. which consists of a few very long sections, internally separated by a semicolon. After
Then, each team was tasked with assembling approximately 100 consultation with an expert we decided to split the cases by the semicolon as well.
131
Table 1: Descriptive statistics of the created dataset. Each entry provides information about the country, the language of the
decisions (Lang), and the number of documents (Docs) in a specific context. The Sentence-Level Statistics subsection reports
basic descriptive statistics focused on sentences as well as the number of sentences labeled with each type (OoS - Out of Scope,
Head - Heading, Int.S. - Introductory Summary, Back - Background, Anl - Analysis, Out - Outcome). The part highlighted in
green contains the counts of sentences labeled with the types we focus on in this work.
Sentence-Level Statistics
Country Lang Docs Count Avg Min Max OoS Head Int.S. Back Anl Out Description
Canada EN 100 12168 121.7 8 888 873 438 20 3319 7190 328 Random selection of cases retrieved from www.canlii.org from multiple provinces.
7.2% 3.6% 0.2% 27.3% 59.1% 2.7% The selection is not limited to any specific topic or court.
Czech R. CS 100 11283 112.8 10 701 945 1257 2 3379 5422 278 A random selection of cases from Constitutional Court (30), Supreme Court (40), and
8.4% 11.1% 0.0% 29.9% 48.1% 2.5% Supreme Administrative Court (30). Temporal distribution was taken into account.
France FR 100 5507 55.1 8 583 3811 220 0 485 631 360 A selection of cases decided by Cour de cassation between 2011 and 2019. A stratified
69.2% 4.0% 0.0% 8.8% 11.4% 6.5% sampling based on the year of publication of the decision was used to select the cases.
Germany DE 104 10724 103.1 12 806 406 333 38 2960 6697 290 A stratified sample from the federal jurisprudence database spanning all federal courts
3.8% 3.1% 0.4% 27.6% 62.4% 2.7% (civil, criminal, labor, finance, patent, social, constitutional, and administrative).
Italy IT 100 4534 45.3 10 207 417 1098 0 986 1903 130 The top 100 cases of the criminal courts stored between 2015 and 2020 mentioning
9.2% 24.2% 0.0% 21.7% 42.0% 2.9% “stalking” and keyed to the Article 612 bis of the Criminal Code.
Poland PL 101 9791 96.9 4 1232 796 303 0 2736 5820 136 A stratified sample from trial-level, appellate, administrative courts, the Supreme Court,
8.1% 3.1% 0% 27.9% 59.4% 1.4% and the Constitutional tribunal. The cases mention “democratic country ruled by law.”
U.S.A. I EN 102 24898 244.1 34 1121 574 1235 475 6042 16098 474 Federal district court decisions in employment law mentioning “motion for summary
2.3% 5.0% 1.9% 24.3% 64.7% 1.9% judgment," “employee,” and “independent contractor.”
U.S.A. II EN 100 10756 107.6 24 397 1766 650 639 3075 4402 224 Administrative decisions from the U.S. Department of Labor. Top 100 ordered in
1.6% 6.0% 5.9% 28.6% 40.9% 2.1% reverse chronological rulings order, starting in October 2020, were selected.
Overall 6 807 89661 105.6 4 1232 9588 5534 1174 22982 48163 2220
Table 2: Raw agreement on a character level for the four encoder itself has no information on the language or writing script
datasets with two human annotators. The agreement is com- of the tokenized text, while the tokenizer is language specific. It
puted as a percentage of characters where both the annota- is even possible to mix multiple languages in one sentence. The
tors agree on a specific type over all the characters annotated focus of the LASER model is to produce vector representations
by that type by any of the annotators. (NM=Not Marked) of sentences that are general with respect to two dimensions: the
input language and the NLP task [5]. An interesting property of
OoS Head Int.S. Back Anl Out NM such universal multi-lingual sentence embeddings is the increased
Canada 97.2 68.2 44.0 83.3 92.2 79.9 43.4 focus on the sentence semantics, as the syntax or other surface
Czech R. 80.3 54.6 0.0 92.6 94.5 46.9 10.0 properties are unlikely to be shared among languages.
France 93.5 92.5 N/A 43.0 72.2 99.1 1.0
U.S.A. I 90.8 71.0 74.2 78.4 93.7 91.1 18.4 The GRU neural network [10] is an architecture based on a
recurrent neural network (RNN) that is able to learn the mapping
Overall 91.8 70.4 72.1 82.1 92.6 77.3 3.8
from a sequence of an arbitrary length to another sequence. GRUs
are able to either score a pair of sequences or to generate a target
basic descriptive statistics of the resulting dataset per the individual sequence given a source sequence (this work). In a bidirectional
contexts. Our final dataset for analysis consists of 807 cases split GRU, two separate sequences are considered (one from right to
into 74,539 annotated sentences. left and the other from left to right). Traditional RNNs work well
for shorter sequences but cannot be successfully applied to long
4 MODELS sequences due to the well-known problem of vanishing gradients.
Long Short-Term Memory (LSTM) networks [18] have been used
In our experiments we use the Language-Agnostic Sentence Repre- as an effective solution to this problem (the forget gate, along with
sentations (LASER) model [5] to encode sentences from different the additive property of the cell state gradients). GRUs have been
languages into a shared semantic space. Each document becomes proposed as an alternative to LSTMs with a reduced number of
a series of vectors which represent the semantic content of a sin- parameters. In GRUs there is no explicit memory unit, and the
gle sentence. We use these vectors to train a bidirectional Gated forget gate and the update gate are combined. The performance of
Recurrent Unit (GRU) model [10] for predicting sentence labels. GRUs was shown to be superior to that of LSTMs in the scenario
The LASER model is a language-agnostic bidirectional LSTM of long texts and small datasets [39], which is the situation in this
encoder coupled with an auxiliary decoder and trained on parallel work. For these reasons, we chose to use GRUs over LSTMs.
corpora. The sentence embeddings are obtained by applying a max- The overall structure of the employed model is shown in Fig-
pooling operation over the output of the encoder. The resulting ure 1.4 Each case is transformed into a 1080 × 1024 matrix. The
sentence representations (after concatenating both directions) are number 1080 represents the maximum length (in sentences) of any
1024-dimensional. The released trained model,3 which we use in case in our dataset. Shorter cases are padded to be of uniform length.
this work, supports 93 languages (including the six in our dataset) The vectors are passed to the model in batches of size 32. They first
belonging to 30 different families and written in 28 different scripts. go through a masking layer, which masks the sentences used for
The model was trained on 223 million parallel sentences. The joint
3 https://github.com/facebookresearch/LASER 4 The model was implemented using the Keras framework (https://keras.io/).
132
Out-Context Experiment (H1) Out-Context Experiment (H1)

(i = 1) (i = 2)
Context 1 Context n Target Ctx Context 1 Context n Target Ctx
1 1 Test 1 Train 1 1 1
Val 2 2 2 2 2 Test 2
3 3 3 Val 3 3 3
... ... ... ... ... ...

Train
9 9 9 Train 9 9 9
10 10 10 10 10 10
Pooled Out-Context (H2) Pooled With In-Context (H3)

(i = 1) (i = 1)
Context 1 Context n Target Ctx Context 1 Context n Target Ctx
1 1 Test 1 1 1 Test 1
Val 2 2 2 Val 2 2 2
3 3 3 3 3 3
Figure 1: The structure of the sequential model used for pre-
diction. Each case 𝑚 is split into 𝑛 sentences, which are con- Train
... ... ...
Train
... ... ...
verted to language-independent LASER vector embeddings. 9 9 9 9 9 9
These are fed through a bidirectional recurrent GRU-model, 10 10 10 10 10 10
which predicts one of the three labels per input sentence.

Figure 2: A visualization of the three experimental setups.
The figure shows how training contexts are selected for H1,
padding. The data is then passed to a bidirectional GRU model
H2 and H3, and examples of how the folds are assigned to
with 256 units and a dropout of 0.2. Finally, the model contains a
Train, Val and Test.
time-distributed dense output layer with softmax activation, which
outputs the predicted label (i.e., Background, Analysis, Outcome) for
each sentence. As a loss function, we use categorical cross-entropy. Table 3: Description of different training data selection and
As optimizer we use the Adam algorithm [19] with initial learning baselines used for H1, H2 and H3. The random (H1), the best
rate set to 4𝑒 −3 (reduced to 4𝑒 −4 once validation loss has stopped single Out-Context models (H2), and the In-Context models
decreasing, with a patience of 50 epochs). We train the model for up (H3) baselines are highlighted.
to 1000 epochs. We halt the training once the validation accuracy
has not increased in 80 epochs. For prediction, we use the best Name Trained on Hyp. Baseline
model as determined by validation accuracy. Random Target Context - -
In-Context Target Context - -
Out-Context Non-Target Context H1 Random
5 EXPERIMENTAL DESIGN Pooled Out-Context Pooled Non-Target H2 Best performing Out-
Contexts Context model per context
We performed three experiments to test the hypotheses (Section 1.2). Pooled with In-Context Non-Target and Target H3 In-Context
The first experiment (H1) focused on model generalization across Pooled Contexts
different contexts (Out-Context). The second experiment (H2) as-
sessed model robustness when trained on multiple pooled contexts
different from the one where the model was applied (Pooled Out- (6) The 𝑖 fold from the test context is designated as test data.
Context). Finally, the third experiment (H3) analyzed the effects (7) The models are trained and evaluated.
of pooling the target context data with data from other contexts (8) Index 𝑖 = 𝑖 + 1 is set.
(Pooled with In-Context). The different pools of training data, and (9) If 𝑖 ≤ 10 go to (4) else finish.
the baselines we compare them against, are summarized in Table 3 Note that from here on we highlight the baselines using the colors
and described in Sections 5.1-5.3. as shown in Table 3 to improve clarity.
Since our dataset is limited, we performed a 10-fold cross-vali-
dation. The folds were kept consistent across all the experiments. 5.1 Out-Context Experiment (H1)
The experiments were conducted in the following manner: In this experiment, we investigated the ability of the models to
(1) Index 𝑖 = 1 is set. generalize beyond the training context. The pool of training data
(2) Pool of training context(s) is selected (see Sections 5.1-5.3). consisted of a single context at a time (see Figure 2). As a baseline,
(3) A single test context is selected (see Sections 5.1-5.3). we used a random classifier that learns the distribution of the target
(4) Eight of the folds from the training context(s) are used as dataset (train and validation folds) and generates predictions based
training data (index different from 𝑖 and (𝑖 + 1) mod 10). on that distribution. The baseline thus had a slight advantage in
(5) The folds with index (𝑖 + 1) mod 10 from the training con- having access to the signal from the target dataset. The Out-Context
text(s) are designated as validation data. models performing better than the random baseline would show
133
that the models are able to transfer (at least some) knowledge from
one context to another.
5.2 Pooled Out-Context Experiment (H2)

The second experiment focused on the ability of the models to gain
robustness as they learn from more than one context, when applied
to an unseen context. Therefore, the training pool for the exper-
iments consisted of data from all the contexts except the current
target context (see Figure 2). As a baseline for each context, we used
the performance measurement taken from the best Out-Context
model for that context (see Section 5.1). This was a very competitive
baseline since the best Out-Context model likely stems from the
Figure 3: A description of performance metrics reported in
context that is the most similar to the target context. If the Pooled
Table 4.
Out-Context model performed better than the best Out-Context
model, this would indicate that pooling datasets increases the ro-
bustness of the model, allowing it to perform better on previously
7 RESULTS
unseen contexts.
The results of all three experiments are shown in Table 4. Each row
reports a performance of a specific model across different contexts
5.3 Pooled With In-Context Experiment (H3) (columns). Each cell shows a bold number on the first row which
The third experiment focused on pooling the target context’s data correspond to the (micro) average F1 -score over the three predicted
with data from the other contexts. The training pool therefore classes, with standard deviation across the 10 folds. The F1 -scores
incorporates all of the contexts, including the target one (see Fig- of the three classes are reported in the second row of each cell
ure 2). As a baseline, we used the In-Context models, trained on ordered as Background, Analysis, Outcome. A visual explanation of
the target context. Again, this is a very strong baseline since the cell contents can be found in Figure 3.
In-Context model should be able to learn the most accurate signal
from the target context (including context specific peculiarities). If 7.1 Out-Context Experiment (H1)
the model trained on the pooled contexts (including target) is able The performance of the models trained during the Out-Context
to outperform the In-Context model, this would indicate that pool- experiment is reported by the eight rows of Table 4 starting with
ing contexts is beneficial in terms of both robustness and absolute Canada and ending with U.S.A. II. Here, the models are trained on
performance on the individual contexts. a single context (row), and then evaluated on all of the contexts
(columns). It appears the models perform reasonably well under
the Out-Context condition. The application of the trained models
6 EVALUATION METHOD outperforms the random baseline across the board (in 54 out of
To evaluate the performance of the systems trained on data from 56 instances). This is further corroborated by the low p-values
different contexts, we used Precision (𝑃), Recall (𝑅), and 𝐹 1 -measure. obtained for these methods when compared to the random baseline
We compute the evaluation metrics for each class (i.e., Background, (reported in Table 4).
Analysis, Outcome) per fold. We then report per class 𝐹 1 -scores as Several interesting patterns emerge from the results. First, it
well as an overall 𝐹 1 -score (micro) including its standard deviation. appears that models trained on contexts with the same language
For statistical evaluation we used the Wilcoxon signed-rank or a language from the same family perform better. For example,
test [37] to perform pair comparisons between methods and the the models trained on Canada and the two U.S.A. contexts perform
baselines as suggested by [12]. We used the overall (micro) 𝐹 1 -score well among each other. A similar observation applies to the models
as the evaluation metric. The null-hypothesis states that the mean trained on the Poland and the Czech R. contexts. Quite surprisingly
performance of the two compared methods (i.e., an assessed model the models succeeded in identifying the Outcome sentences to some
and a baseline) is equivalent. extent, despite a heavy under-representation of these sentences
Since the number of samples is rather small (7 contexts in testing in our dataset. For example, a model trained on Canada predicts
H1, and 8 contexts in testing H2 and H3), we are not able to reject the the Outcome sentences of the both U.S.A. contexts with an average
null hypotheses at 𝛼 = 0.05 for any of H1–H3. This is as expected F1 -score close to 0.7. At the same time the Outcome sentences
given the size of the dataset, and it is not unacceptable given the are by far the most challenging ones. On multiple occasions the
exploratory nature of our work at this stage. We will be in a position models completely fail to predict these (e.g., Canada→Germany, or
to formally evaluate the hypotheses once we extend the dataset Poland→France).
(see Section 10). Instead, we report raw p-values produced by the Overall, the results demonstrate the ability of the models to
testing as an evidence of the assessed methods’ effectiveness. The effectively transfer the knowledge learned on one context to another.
p-value is the probability of seeing the results at least as extreme At the same time, it is clear that the models trained on one context
as those we observed given the null hypothesis is correct (i.e., their and applied to a different one do not perform as well as the In-Con-
mean performance is the same). text models.
134
Table 4: Results of the Out-Context, Pooled Out-Context, and Pooled with In-Context experiments. Each row reports a perfor-
mance of a model across contexts. A bold number in each cell reports the (micro) average F1 -score over the predicted classes
and the standard deviation across the 10 folds. The F1 -scores of the three classes are reported below ordered as Background,
Analysis, Outcome. A visual explanation of cell contents can be found in Figure 3. The random (H1), the best single Out-Con-
text models (H2), and the In-Context (H3) baselines are highlighted. The p-values are color coded to match the baselines.
Canada Czech R. France Germany Italy Poland U.S.A. I U.S.A. II Avg (-test) Avg (+test)
Random .54 ± .04 .49 ± .02 .33 ± .05 .56 ± .04 .50 ± .04 .55 ± .04 .58 ± .03 .51 ± .02 .51 ± .07
(dist.) .31 .66 .06 .36 .59 .04 .32 .37 .25 .29 .68 .03 .31 .62 .04 .32 .67 .00 .26 .72 .02 .37 .61 .02 .32 .62 .10
Canada .82 ± .09 .68 ± .08 .64 ± .09 .81 ± .06 .73 ± .08 .73 ± .07 .87 ± .05 .88 ± .03 .76 ± .09 .77 ± .08
𝑝 = .016 .75 .87 .70 .53 .80 .39 .64 .66 .57 .71 .89 .00 .55 .82 .66 .50 .85 .00 .75 .92 .69 .87 .90 .70 .65 .83 .43 .66 .84 .46
Czech R. .76 ± .09 .91 ± .04 .47 ± .08 .82 ± .08 .82 ± .05 .83 ± .07 .84 ± .06 .86 ± .05 .77 ± .13 .79 ± .13
𝑝 = .016 .71 .80 .31 .90 .92 .64 .52 .36 .48 .75 .89 .01 .81 .84 .40 .73 .90 .00 .73 .90 .28 .86 .88 .41 .73 .80 .27 .75 .81 .32
France .71 ± .07 .61 ± .06 .86 ± .08 .66 ± .10 .73 ± .08 .65 ± .07 .72 ± .07 .68 ± .08 .68 ± .04 .70 ± .07
𝑝 = .016 .45 .83 .69 .37 .78 .45 .81 .83 .98 .37 .82 .00 .59 .81 .69 .33 .82 .00 .32 .86 .69 .48 .81 .73 .42 .82 .46 .47 .82 .53
Germany .72 ± .10 .64 ± .09 .29 ± .12 .88 ± .11 .69 ± .09 .77 ± .09 .73 ± .10 .83 ± .07 .67 ± .16 .69 ± .17
𝑝 = .031 .68 .76 .01 .42 .81 .01 .42 .32 .00 .82 .93 .66 .50 .84 .00 .54 .88 .54 .47 .85 .00 .82 .88 .00 .55 .76 .08 .58 .78 .15
Italy .55 ± .12 .76 ± .09 .63 ± .08 .78 ± .09 .95 ± .02 .73 ± .10 .53 ± .13 .74 ± .08 .67 ± .10 .71 ± .13
𝑝 = .047 .57 .55 .49 .74 .81 .12 .69 .63 .52 .73 .83 .00 .92 .96 .94 .66 .78 .00 .50 .54 .42 .74 .74 .63 .66 .70 .31 .69 .73 .39
Poland .76 ± .08 .83 ± .05 .38 ± .11 .85 ± .08 .83 ± .05 .93 ± .05 .73 ± .08 .83 ± .07 .74 ± .15 .77 ± .16
𝑝 = .016 .66 .84 .00 .82 .89 .01 .48 .52 .00 .73 .91 .44 .80 .91 .00 .89 .95 .88 .44 .86 .00 .82 .87 .00 .68 .83 .06 .71 .84 .17
U.S.A. I .83 ± .06 .65 ± .08 .47 ± .14 .81 ± .07 .65 ± .15 .67 ± .09 .91 ± .03 .89 ± .03 .71 ± .13 .74 ± .14
𝑝 = .016 .76 .87 .59 .45 .79 .49 .35 .61 .35 .71 .89 .00 .40 .80 .58 .38 .83 .00 .84 .94 .73 .87 .91 .68 .56 .81 .38 .60 .83 .43
U.S.A. II .81 ± .08 .67 ± .06 .53 ± .15 .84 ± .10 .75 ± .10 .70 ± .08 .86 ± .05 .94 ± .02 .74 ± .11 .76 ± .12
𝑝 = .016 .74 .86 .65 .49 .80 .31 .50 .63 .41 .76 .91 .00 .57 .84 .75 .43 .84 .00 .72 .92 .59 .93 .96 .82 .60 .83 .39 .64 .85 .44
Pooled .83 ± .06 .87 ± .03 .66 ± .08 .90 ± .04 .85 ± .04 .88 ± .05 .81 ± .10 .92 ± .03 .84 ± .08
𝑝 = .148 .77 .86 .66 .87 .91 .03 .59 .67 .71 .87 .95 .01 .81 .89 .65 .83 .94 .01 .64 .88 .73 .91 .93 .65 .79 .88 .43
Pooled+ .88 ± .05 .94 ± .03 .82 ± .09 .96 ± .02 .94 ± .04 .94 ± .04 .92 ± .03 .96 ± .02 .92 ± .04
𝑝 = .195 .83 .91 .77 .94 .95 .70 .76 .78 .96 .94 .98 .64 .91 .95 .90 .92 .97 .65 .86 .95 .84 .95 .96 .80 .89 .93 .78
7.2 Pooled Out-Context Experiment (H2) 7.3 Pooled With In-Context Experiment (H3)
The performance of Pooled Out-Context models is reported in The performance of the Pooled with In-Context model is reported
the Pooled row of Table 4. The experiment concerns the resulting in the Pooled+ row of Table 4. This experiment models a scenario
models’ robustness, i.e., if a model trained on multiple contexts where a sample of labeled data from the target context is available.
adapts well to unseen contexts. We are especially interested if such The question is whether combining In-Context data with data from
a model adapts better than the models trained on single contexts. other contexts leads to improved performance.
The results suggest that training on multiple contexts leads to The results appear to suggest that pooling the target context
models that are robust and perform better than the models trained with data from other contexts does lead to improved performance.
on a single context. The multi-context models outperform the best In case of 3 out of the 8 contexts (Canada, Czech R., and U.S.A. I)
single extra-context model baseline in 7 out of 8 cases. The 𝑝 = 0.148 the improvement is clear and substantial across all the three classes
needs to be understood in terms of the small number of samples (Background, Analysis, Outcome). For three additional contexts (Ger-
(contexts) and competitiveness of the baseline. many, Poland, and U.S.A. II), the performance also improved in
Interestingly, the Pooled Out-Context models even appear to terms of overall (micro) F1 -score, but took a slight hit for the chal-
be competitive with several In-Context models (Canada, Czech R., lenging Outcome class. With respect to the two remaining contexts
Germany, U.S.A. II). The overall average F1 -scores are often quite (France, Italy) the overall performance of the pooled models is lower
high (over 0.80 or 0.90). This is a surprising outcome considering than that of the In-Context models. As in the previous experiment,
the fact that no data from the context on which a model is evaluated the 𝑝 = 0.195 needs to be understood in terms of the small number
is used during training. of samples (contexts) and high competitiveness of the In-Context
baseline.
135
Figure 5: Distribution of labels across datasets. The X-axis

corresponds to unique cases, whereas the Y-axis corresponds
to the normalized length of the cases. The colors correspond
to the different labels of the sentences (Background, Analy-
sis, Outcome).
Figure 4: Average LASER embeddings for each case doc- Overall, the cases from the same contexts appear to cluster to-
ument, projected to 2-dimensional space using Principal gether. This is expected as the documents written in the same
Component Analysis. language, having similar topics, or sharing some similarities due
to legal traditions, are likely to map to vectors that are closer. The
documents from both the U.S.A. contexts occupy the same region
in the embedding space. This is not surprising as they come from
8 DISCUSSION the same jurisdiction, are written in the same language, and deal
It appears that the multilingual sentence embeddings generated with similar topics (employment law). The Canadian cases, which
by the LASER model excel at capturing the semantics of a sen- are also in English, occupy a nearby space. This could be linked to
tence. A model trained on a single context could in theory capture the language as well as likely similarities in legal traditions. French,
the specific vocabulary used in that context. This would almost German, and Italian documents occupy the middle space; they are
certainly lead to poor generalization across different contexts. How- closer to the English documents than those from the Czech R. and
ever, the performance statistics we observed when transferring the Poland. Interestingly, the Czech and Polish documents occupy al-
Out-Context models to other contexts suggest that the sentence most the same space. The Polish context focuses on the rule of
embeddings provide a representation of the sentences that enable law while the Czech one is supposed to be more general. As the
the model to learn aspects of the meaning of a sentence, rather than latter deals with the decisions of the top-tier courts (one of them
mere surface linguistic features. Constitutional), it is possible that the topics substantially overlap.
The results clearly point to certain relationships where contexts Moreover, Poland and the Czech R. share similar legal traditions
within the same or related languages appear to work well together, and languages from the same family (Slavic), so the close proximity
e.g., {Canada, U.S.A. I, and U.S.A. II} or {Czech R. and Poland}. This of the documents in the embedding space might not be unexpected.
could indicate that the multi-lingual embeddings work better when Finally, German, French, and Canadian cases occupy wider areas
the language is the same or similar. It could also point to similarities than documents from other contexts. This could be due to their
in legal traditions, e.g. the use of somewhat similar sentences to lack of focus on specific legal domains.
indicate a transition from one type of section to the next. Note that We observed a peculiar phenomenon where the Out-Context
we removed headings, which means that explicit cues could not be models trained on the German and Polish contexts failed to detect
relied on by the models we trained on such transformed documents. Outcome sentences on the six remaining contexts and vice-versa.
Finally, the cause could also be topical (domain) similarity of the The cause is readily apparent from the visualization shown in Fig-
contexts (e.g., both the U.S.A. contexts deal with employment law). ure 5. Each segment of the figure depicts a spatial distribution of
Also note that the above are just possible explanations. We did not sentences color coded with their labels across documents for a par-
perform feature importance analysis on the models. ticular context. As can be seen, the cases typically follow a pattern
To gain insight into this phenomenon, we visualized the relation- of a contiguous Background segment, followed by a long Analysis
ships among the contexts on a document level. We first calculated section. The several Outcome sentences are placed at the very end
the average sentence embedding for each document. This yielded of the documents. In the Polish and German decisions, however,
1024-dimensional vectors for 807 documents representing their se- the Outcome sentences come first. The GRU models we use rely on
mantics. We arranged the resulting vectors in a matrix (1024 × 807) the structure as well as semantics in making their predictions. As
and performed a Principal Component Analysis (PCA) reducing we can see, a model trained exclusively on cases that begin with a
the dimensionality of the document vectors to 2. This operation Background might therefore have difficulties correctly identifying
enabled a convenient visualization shown in Figure 4. outcome sections at the beginning, and vice-versa. However, as we
136
will see below, a model trained with data featuring both structures The inclusion of the In-Context data in the pooled data leads
can learn to correctly identify the correct structure based on the to a remarkable improvement over only using the pooled Out-
semantics of the sentences. Context data. The magnitude of the improvement highlights the
The model trained on the French context appears to perform importance of including such data in the training. We envision that
better on detecting Outcome sentences than models trained on other the models trained on different contexts used in combination with
contexts. This is somewhat surprising as the French model’s overall high-speed similarity annotation frameworks [35, 36] could enable
performance is among the weakest (e.g. compare the Czech model’s highly cost efficient annotation in situations where resources are
average F1 = 0.77 to the F1 = 0.68 of the French model). Again, scarce. Perhaps, adapting a model to an entirely new context could
Figure 5 provides an insight into why this happens. The French be as simple as starting with a model trained on other contexts, and
context is the only one where the count of Outcome sentences is spending a few hours correcting the misconceptions of the model
comparable to those of the other two categories. For all the other to teach it the particularities of the new context.
contexts, the Outcome sentences are heavily underrepresented. This
reveals an interesting direction for future work where the use of 9 CONCLUSIONS
re-sampling may yield models with better sensitivity for identifying We analyzed the use of multi-lingual sentence embeddings in se-
Outcome sentences. quence labeling models to enable transfer across languages, juris-
In two instances models trained on a single context under-per- dictions, legal systems (common and civil law), and domains. We
formed the random baseline. The model trained on the Germany created a new type schema for functional segmentation of adjudi-
context achieved the average F1 = 0.29 when applied to the French catory decisions and used it to annotate legal cases across eight
context (Random 𝐹 1 = 0.33). As the model trained on Polish data different contexts. We found that models generalize beyond the
also performed poorly on the French context (𝐹 1 = 0.38) the cause contexts they were trained on and that training the models on
appears to be the inability of the two models to detect the Outcome multiple contexts increases their robustness and improves the over-
sentences at the end (discussed above). As the Outcome sentences all performance when evaluating on previously unseen contexts.
are heavily present in the French context, this problem manifests We also found that pooling the training data of a model with data
in a score lower than the random baseline. The second instance from additional contexts enhances its performance on the target
is the model trained on the Italy context applied to the U.S.A. I context. The results are promising in enabling re-use of annotated
data (F1 = 0.53 versus F1 = 0.58 on Random). Here, the cause data across contexts and creating generalizable and robust models.
appears to be different. Note that the Italian context likely has a very We release the newly created dataset (807 documents with 89,661
specific notion of Outcome sentences (F1 = 0.94 on Italy→Italy). annotated sentences), including the annotation schema and the
It appears that many Analysis sentences from the U.S.A. I context code used in our experiments, to the public.
were labeled as Outcome by the model. Summary judgments often This work suggests a promising path for the future of inter-
address multiple legal issues with their own conclusions which national collaboration in the field of AI & Law. While previous
could have triggered the model to label such sentences as Outcome. annotation efforts have typically been limited to a single context,
An important finding is the performance of the Pooled Out- the experiments presented here suggest that researchers can work
Context model (H2) shown in the Pooled row of Table 4. The ex- together by annotating cases from many different contexts at the
periment simulates training of a model on several contexts, and same time. Such a combined effort could aid researchers in creating
then applying it to an unseen context. The Pooled Out-Context models that perform well on the data from the context they care
models, having no access to the data from a target context, reliably about, while at the same time helping other groups train even better
outperform the best single Out-Context models. They appear to models for other contexts. We encourage these research directions
be competitive with several In-Context models. These results are and hope to form such collaborations under the Lex Rosetta project.
achieved with a fairly small dataset of 807 cases. We expect that
expanding the dataset would lead to further improved performance.
10 FUTURE WORK
The Pooled with In-Context experiment (H3) models the situ-
ation where data from a target context is available in addition to The application of multi-lingual sentence embeddings to functional
labeled data from other contexts. Our experiments indicate that segmentation of case law across different contexts yielded promis-
the use of data from other contexts (if available) in addition to data ing results. At the same time, the work is subject to limitations
from the target context is preferable to the use of the data from the and leaves much room for improvement. Hence, we suggest several
target context only. This is evidenced by the improved performance directions for future work:
of the models trained on the pooled contexts over the single In-Con- • Extension of the datasets from different contexts used in this
text models. The models have an interesting property of being able work beyond ~100 documents per context.
to identify the Outcome sentences with effectiveness comparable • Annotation of data from contexts beyond the eight used here
to (or higher than) the models trained on the same context only. (multi-lingual models support close to 100 languages).
This holds for all the contexts, except Poland where the Outcome • Analysis of automatic detection of Introductory Summary,
performance is a bit lower (0.65 vs. 0.88). This indicates that the Headings, and Out of Scope.
model is able to learn the two possible modes of the Outcome section • Identification and investigation of other tasks applicable
placement. It successfully distinguishes cases where the section is across different contexts.
at the beginning from the cases where the Outcome sentences are • Evaluation of the application of other multilingual models
found toward the end. (e.g., those mentioned in Section 2).
137
• Exploring other transfer learning strategies beyond simple WordNet architecture. In ICAIL 2005. 163–167.
data pooling, such as the framework proposed in [26]. [15] Atefeh Farzindar and Guy Lapalme. 2004. LetSum, an Automatic Text Summa-
rization system in Law field. JURIX 2004.
• Using multi-lingual models for annotation tasks with high- [16] Jorge González-Conejero, Pompeu Casanovas, and Emma Teodoro. 2018. Business
speed annotation framework, such as [35, 36]. Requirements for Legal Knowledge Graph: the LYNX Platform.. In TERECOM@
JURIX 2018. 31–38.
• Performing the transfer across contexts with related (but [17] Jakub Harašta, Jaromír Šavelka, František Kasl, and Jakub Míšek. 2019. Automatic
different) tasks, such as in [28]. Segmentation of Czech Court Decisions into Multi-Paragraph Parts. Jusletter IT
• Further exploring the differences in the distribution of the 4, M (2019).
[18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
multilingual embeddings for purposes of comparing and computation 9, 8 (1997), 1735–1780.
analyzing domains, languages, and legal traditions. [19] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980 (2014).
[20] Philipp Koehn, Alexandra Birch, and Ralf Steinberger. 2009. 462 Machine Trans-
ACKNOWLEDGMENTS lation Systems for Europe. In Proceedings of the Twelfth Machine Translation
Summit. Association for Machine Translation in the Americas, 65–72.
Hannes Westermann, Karim Benyeklef, and Kevin D. Ashley would [21] Guokun Lai, Barlas Oguz, Yiming Yang, and Veselin Stoyanov. 2019. Bridg-
like to thank the Cyberjustice Laboratory at Université de Montréal, ing the domain gap in cross-lingual document classification. arXiv preprint
the LexUM Chair on Legal Information and the Autonomy through arXiv:1909.07009 (2019).
[22] Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk.
Cyberjustice Technologies (ACT) project for their support of this 2020. MLQA: Evaluating Cross-lingual Extractive Question Answering. In ACL
research. Kevin D. Ashley also thanks the Canadian Legal Infor- 2020. 7315–7330.
mation Institute for providing the corpus of legal cases. Matthias [23] D.N. MacCormick, R.S. Summers, and A.L. Goodhart. 2016. Interpreting Precedents:
A Comparative Study. Taylor & Francis.
Grabmair thanks the SINC GmbH for supporting this research. [24] Rohan Nanda, Llio Humphreys, Lorenzo Grossio, and Adebayo Kolawole John.
Jakub Harašta and Tereza Novotná acknowledge the support of 2020. Multilingual Legal Information Retrieval System for Mapping Recitals and
Normative Provisions. In Proceedings of Jurix 2020. IOS Press, 123–132.
the ERDF project “Internal grant agency of Masaryk University” [25] Alina Petrova, John Armour, and Thomas Lukasiewicz. 2020. Extracting Out-
(No. CZ.02.2.69/0.0/0.0/19_073/0016943). comes from Appellate Decisions in US State Courts. In JURIX 2020. 133.
[26] Jaromír Šavelka and Kevin D Ashley. 2015. Transfer of predictive models for
classification of statutory texts in multi-jurisdictional settings. In ICAIL 2015.
REFERENCES 216–220.
[1] Tommaso Agnoloni, Lorenzo Bacci, Enrico Francesconi, P Spinosa, Daniela Tis- [27] Jaromir Savelka and Kevin D Ashley. 2018. Segmenting US Court Decisions into
cornia, Simonetta Montemagni, and Giulia Venturi. 2007. Building an ontological Functional and Issue Specific Parts.. In JURIX 2018. 111–120.
support for multilingual legislative drafting. Frontiers in Artificial Intelligence [28] Jaromır Šavelka, Hannes Westermann, and Karim Benyekhlef. 2020. Cross-
and Applications 165 (2007), 9. Domain Generalization and Knowledge Transfer in Transformers Trained on
[2] Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively Multilingual Legal Data. In ASAIL@ JURIX 2020.
Neural Machine Translation. In NAACL-HLT, Vol. 1 (Long and Short Papers). [29] Párai Sheridan, Martin Braschlert, and Peter Schauble. 1997. Cross-language
3874–3884. information retrieval in a Multilingual Legal Domain. In International Conference
[3] Gianmaria Ajani, Guido Boella, Luigi Di Caro, Livio Robaldo, Llio Humphreys, on Theory and Practice of Digital Libraries. Springer, 253–268.
Sabrina Praduroux, Piercarlo Rossi, and Andrea Violato. 2016. The European [30] Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manual Carrasco-
Legal Taxonomy Syllabus: A multi-lingual, multi-level ontology framework to Benitez, Patrick Schluter, Marek Przybyszewski, and Signe Gilbro. 2014. An
untangle the web of European legal terminology. Applied Ontology 11, 4 (2016). overview of the European Union’s highly multilingual parallel corpora. Lan-
[4] Sai Saket Aluru, Binny Mathew, Punyajoy Saha, and Animesh Mukherjee. 2020. guage Resources and Evaluation 48, 4 (2014), 679–707.
Deep learning models for multilingual hate speech detection. arXiv preprint [31] Kyoko Sugisaki, Martin Volk, Rodrigo Polanco, Wolfgang Alschner, and Dmitriy
arXiv:2004.06465 (2020). Skougarevskiy. 2016. Building a Corpus of Multi-lingual and Multi-format Inter-
[5] Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence national Investment Agreements. In JURIX 2016.
embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the [32] Linyuan Tang and Kyo Kageura. 2019. Verifying Meaning Equivalence in Bilingual
Association for Computational Linguistics 7 (2019), 597–610. International Treaties. In JURIX 2019. 103–112.
[6] Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and [33] Vern R Walker, Krishnan Pillaipakkamnatt, Alexandra M Davidson, Marysa
Adam Wyner. 2019. Identification of rhetorical roles of sentences in Indian legal Linares, and Domenick J Pesce. 2019. Automatic Classification of Rhetorical
judgments. In JURIX 2019, Vol. 322. IOS Press, 3. Roles for Sentences: Comparing Rule-Based Scripts with Machine Learning.. In
[7] Guido Boella, Luigi Di Caro, Michele Graziadei, Loredana Cupi, Carlo Emilio ASAIL@ ICAIL 2019.
Salaroglio, Llio Humphreys, Hristo Konstantinov, Kornel Marko, Livio Robaldo, [34] Hannes Westermann, Jaromír Šavelka, and Karim Benyekhlef. 2021. Paragraph
Claudio Ruffini, et al. 2015. Linking legal open data: breaking the accessibility and Similarity Scoring and Fine-Tuned BERT for Legal Information Retrieval and
language barrier in european legislation and case law. In ICAIL 2015. 171–175. Entailment. In New Frontiers in Artificial Intelligence (Lecture Notes in Computer
[8] Paul Boniol, George Panagopoulos, Christos Xypolopoulos, Rajaa El Hamdani, Science). Springer International Publishing.
David Restrepo Amariles, and Michalis Vazirgiannis. 2020. Performance in the [35] Hannes Westermann, Jaromír Šavelka, Vern R Walker, Kevin D Ashley, and Karim
Courtroom: Automated Processing and Visualization of Appeal Court Decisions Benyekhlef. 2019. Computer-Assisted Creation of Boolean Search Rules for Text
in France. In Proceedings of the Natural Legal Language Processing Workshop 2020. Classification in the Legal Domain. In JURIX 2019, Vol. 322. IOS Press, 123.
[9] Karl Branting, Brandy Weiss, Bradford Brown, Craig Pfeifer, A Chakraborty, Lisa [36] Hannes Westermann, Jaromír Šavelka, Vern R Walker, Kevin D Ashley, and Karim
Ferro, M Pfaff, and A Yeh. 2019. Semi-supervised methods for explainable legal Benyekhlef. 2020. Sentence Embeddings and High-Speed Similarity Search for
prediction. In ICAIL 2019. 22–31. Fast Computer Assisted Annotation of Legal Documents. In JURIX 2020, Vol. 334.
[10] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, IOS Press, 164.
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase [37] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Break-
Representations using RNN Encoder-Decoder for Statistical Machine Translation. throughs in statistics. Springer, 196–202.
In EMNLP 2014. [38] Huihui Xu, Jaromír Šavelka, and Kevin D Ashley. 2020. Using Argument Mining
[11] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- for Legal Text Summarization. In JURIX 2020, Vol. 334. IOS Press.
laume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, [39] Shudong Yang, Xueying Yu, and Ying Zhou. 2020. LSTM and GRU Neural Network
and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning Performance Comparison Study: Taking Yelp Review Dataset as an Example. In
at Scale. In ACL 2020. 8440–8451. IWECAI 2020. IEEE, 98–101.
[12] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. [40] Vladimir Zhebel, Denis Zubarev, and Ilya Sochenkov. 2020. Different Approaches
Journal of Machine learning research 7, Jan (2006), 1–30. in Cross-Language Similar Documents Retrieval in the Legal Domain. In Interna-
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: tional Conference on Speech and Computer. Springer, 679–686.
Pre-training of Deep Bidirectional Transformers for Language Understanding. In [41] Linwu Zhong, Ziyi Zhong, Zinian Zhao, Siyuan Wang, Kevin D Ashley, and
NAACL 2019, Volume 1 (Long and Short Papers). 4171–4186. Matthias Grabmair. 2019. Automatic summarization of legal decisions using
[14] Luca Dini, Wim Peters, Doris Liebwald, Erich Schweighofer, Laurens Mommers, iterative masking of predictive sentences. In ICAIL 2019. 163–172.
and Wim Voermans. 2005. Cross-lingual legal information retrieval using a
138
Converting Copyright Legislation into Machine-Executable
Code: Interpretation, Coding Validation and Legal Alignment
Alice Witt Anna Huggins
Law School Law School
Queensland University of Technology Queensland University of Technology
Brisbane, Queensland, Australia Brisbane, Queensland, Australia
ae.witt@qut.edu.au
Guido Governatori Joshua Buckley

Data61 Data61
CSIRO CSIRO
Dutton Park, Queensland, Australia Dutton Park, Queensland, Australia
ABSTRACT KEYWORDS
A critical challenge in “Rules as Code” (“RaC”) initiatives is enhanc- Rules as code, machine-executable code, statutory interpretation,
ing legal accuracy. In this paper, we present the preliminary results legal alignment, technical processes
of a two-week, first of its kind experiment that aims to shed light ACM Reference Format:
on how different legally trained people interpret and convert Aus- Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley. 2021.
tralian Commonwealth legislation into machine-executable code. Converting Copyright Legislation into Machine-Executable Code: Inter-
We find that coders collaboratively agreeing on key legal terms, or pretation, Coding Validation and Legal Alignment. In Eighteenth Interna-
atoms, before commencing independent coding work can signifi- tional Conference for Artificial Intelligence and Law (ICAIL’21), June 21–
cantly increase the similarity of their encoded rules. Participants 25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https:
nonetheless made a range of divergent interpretive choices, which //doi.org/10.1145/3462757.3466083
we argue are most likely due to: (1) the complexity of statutory
interpretation, (2) encoded provisions having varying levels of gran- 1 BACKGROUND
ularity, and (3) the functionality of our coding language. Based on In recent years, there has been significant growth in the “Rules
these findings, we draw an important distinction between processes as Code” (“RaC”) movement, a label for diverse initiatives that re-
for technical validation of encoded rules, which focus on ensuring evaluate how, when, and for whom government rules are made [24].
rules adhere to select coding languages and conventions, and pro- There are two main RaC approaches, the first of which involves
cesses of legal alignment, which we conceptualise as enhancing converting existing regulation, including statutes, into machine-
congruence between the encoded provisions and the true meaning executable code: “a coded representation of the actual rules in the
of the statutory text in line with the modern approach to statutory legislation, written in a computer language, so that computers can
interpretation. We argue that these processes are distinct but both read it and then use it to carry out programs”[29, p 27]. An applica-
critically important in enhancing the accuracy of encoded rules. We tion of this approach is regulatory technology (“RegTech”) [31] that
conclude by underlining the need for multi-disciplinary expertise can help firms comply and stay up to date with the rules governing
across specific legal subject matters, statutory interpretation and their commercial activities [24, 34]. The second approach involves
technical programming in RaC initiatives. “co-drafting” regulation in both natural language and machine-
consumable format, one that can “enable computers to model the
effect of the law”[3, p. 76], at the same time. A distinguishing feature
CCS CONCEPTS
of co-drafting is the potential for “digital first” rules [33], which
• Computing methodologies → Artificial intelligence; • Ap- not only represent an output, but also a “strategic and deliberate
plied computing → Law. approach to rulemaking”[24, p. 81]. RaC is therefore related to, yet
separate from, “computational law”, which investigates whether
regulation can and should be represented in computer code, among
other lines of inquiry [30], and “automated decision making", which
classroom use is granted without fee provided that copies are not made or distributed refers to decisions made by automated means with varying levels
for profit or commercial advantage and that copies bear this notice and the full citation of human involvement [18].
on the first page. Copyrights for components of this work owned by others than ACM While decades of research inform these overlapping areas, and
to post on servers or to redistribute to lists, requires prior specific permission and/or a governments including New Zealand, Australia, France and Canada
fee. Request permissions from permissions@acm.org. have made significant inroads with the practical RaC movement
ICAIL’21, June 21–25, 2021, São Paulo, Brazil [27, 30], challenges persist in ensuring that encoded rules are trans-
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 parent, traceable, appealable and legally accurate. It can be particu-
https://doi.org/10.1145/3462757.3466083 larly difficult to enhance the legal accuracy of digitised legislation
139
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley
in a way that aligns with the correct legal interpretation of the se- (the text) taking into account their context and purpose [26]. We
lect statute(s). In the Australian legal system, this challenge is made conclude this paper, in Section 6, by highlighting the need in RaC
more complex by the strict separation of judicial, legislative and ex- initiatives for multi-disciplinary expertise across specific legal sub-
ecutive powers in the Australian Constitution, under which only the ject matters, statutory interpretation and technical programming.
judiciary can authoritatively interpret the meaning of statutes [18]. We also identify important areas for future research, such as testing
A result is that other interpreters, including individual coders and our hypothesis with a larger number of participants, and further
RegTech companies, are expected to mirror the courts’ approach to exploring what the separate yet interrelated processes for technical
interpreting statutes [5]. coding validation and legal alignment might entail in practice.
Against this backdrop, we commenced a multi-disciplinary and
collaborative research project that seeks to identify the legal and 2 COPYRIGHT LAW
coding challenges of converting Australian Commonwealth legisla-
Copyright is a body of intellectual property law that “confers rights
tion into a machine-executable format. Our research largely falls
in relation to the reproduction and dissemination of material that
within the first RaC approach given that we are focusing on legisla-
expresses ideas or information” [8, p. 7]. For material to be protected
tion that already exists. In this paper, as part of the broader research
under Australian copyright law, it must: (1) fall under one of the
project, we present a first of its kind experiment that aims to shed
established categories of subject matter, including literary, dramatic,
light on how different legally trained people (T = 3 participants
musical or artistic "works" (known as “Part III (of the CA) works"),
over a two-week period) interpret and convert select provisions of
which we focus on in this experiment; and (2) be sufficiently con-
the Copyright Act 1968 (Cth) (“the CA”, “the Act") into computer
nected to Australia (see ss 10 and 32 of the CA). Additionally, Part
code. By examining the choices that interpreters make when con-
III works must be: (3) recorded in “material form” and (4) “original"
verting existing legislation into machine-executable code, we aim
(s 32 of the CA) [8, ch. 6]. When the relevant criteria for subsistence
to provide critical insights into the coding, statutory interpretation
of copyright are satisfied, the copyright owner has certain exclusive
and other issues that arise in the encoding process. This can, in
rights (see, e.g., Divisions 1 and 2, Part III of the CA), which a party
turn, facilitate new understandings of how stakeholders can pro-
can directly or indirectly infringe [8, pp. 268, 285].
mote alignment between the languages and logics of statutes and
We focus on copyright law for several reasons, including our
encoded rules for the RaC movement more broadly.
research team having some copyright law expertise, and because
This paper proceeds in five sections. In Section 2, we provide a
this body of law is widely applicable to a range of stakeholders but
brief overview of Australian copyright law and, in Section 3, we
often difficult for lay audiences to understand [7]. In practical terms,
explain the Defeasible Deontic Logic and Turnip encoding soft-
our research could be of benefit to large technology companies,
ware that participants used to convert select provisions of the CA
galleries, libraries, educational institutions and archives that often
into computer code. Next, in Section 4, we explain our experiment
deal with copyright issues in bulk [28]. It could also potentially
design and methods. Then, in Section 5, we outline our prelimi-
benefit a large number of small content creators, who routinely
nary results that support our hypothesis that coders collaboratively
rely on existing material [2]. In research terms, by focusing on law
agreeing on key atoms, or legal terms, before commencing inde-
that already exists, we are generating new knowledge particularly
pendent coding work increases the similarity of their coded output.
relevant to the first RaC approach. This knowledge can in turn
Despite a significant increase in the average similarity of atoms
inform the RaC movement more broadly.
and, to a lesser extent, rules in Week 2, participants made a range
of divergent interpretive choices, most likely due to: (1) the com-
plexity of statutory interpretation, (2) encoded provisions having 3 FROM LEGISLATION TO ENCODED RULES
varying levels of granularity, and (3) the functionality of our coding To convert select provisions of the CA into machine-executable
language, Turnip. These differences underline the complexity of code, participants used a language and program (reasoner) called
attempting to reproduce the languages and logics of a statutory “Turnip”, which is based on Defeasible Deontic Logic (“DDL”) [14].
text in machine-executable code. The presentation of DDL and Turnip in this paper is a based on
Overall, we argue that processes for technical coding validation [12, 14]. DDL is an extension of Defeasible Logic [1], which refers to
and legal alignment are distinct but both critically important in an interest that can be defeated, and Deontic Logic, which pertains
enhancing the accuracy of RaC. On the one hand, processes for to “the study of those sentences in which only logical words and
technical validation can be automated and/or manual, and aim to normative expressions occur essentially. Normative expressions in-
ensure encoded rules adhere to select coding languages and conven- clude the words ‘obligation’, ‘duty’, ‘permission’, ‘right’, and related
tions. On the other hand, a process of legal alignment is concerned expressions”[10, p. 1]. Defeasible Deontic Logic therefore extends
with enhancing congruence between the encoded rules and the true defeasible logic “by adding deontic and other modal operators” [16,
meaning of the select legislation in line with the modern approach p. 47]. More specifically, DDL enables coders to integrate reasoning
to statutory interpretation, an undertaking that heavily relies on with exceptions; to model deontic concepts, such as obligations [O],
human judgment. This approach, as articulated and applied by the permissions [P], prohibitions [F], and exemptions [E]; and to rep-
High Court of Australia in Project Blue Sky v Australian Broadcast- resent both definitional norms (also known as “constitutive rules”)
ing Corporation (1998) 194 CLR 355, “requires a combined exercise and prescriptive norms [12, p. 178], all of which are present in the
involving analysis of the text, context and purpose (or policy) of the CA. Coders can also use a non-classical compensation operator to
statute in question”[20, p. 116]. The key task of statutory interpre- model obligations in force after a (potential) violation [14, 15]. De-
tation is therefore to find the legal meaning of the statutory words feasible Deontic Logic has been applied in several studies that aim
140
Converting Copyright Legislation into Machine-Executable Code ICAIL’21, June 21–25, 2021, São Paulo, Brazil
to convert different types of Australian regulation into computer the superiority relation, as follows:
code [12, 13, 19]. 𝑠 : 𝐵 1, . . . , 𝐵𝑚 ⇒ ¬𝐶
A rule in DDL takes the form of an IF . . . THEN . . . statement
in which “IF” represents the condition(s) of the rule and "THEN" The reasoning mechanism of DDL, which is based on an argumen-
models the effect of the norm [11, p. 284]. Coders can divide rules tation structure, extends the proof theory of Defeasible Logic [1].
into constitutive rules that, for instance, define important terms To prove a conclusion, there must be an applicable rule for the
in a normative document (e.g., different types of regulation) or said conclusion., and a rule is applicable if all the elements of the
outline condition(s) (i.e., the IF part) that might give rise to legal antecedent of the rule hold (i.e., have been proved). All counter-
requirements (i.e., the THEN part, such as obligations, permissions arguments must also be rebutted or defeated. A counter-argument
and prohibitions). Rules can be further classified according to their is a rule for a conflicting conclusion; that is, the negation of the
strength: specifically, as strict rules, defeasible rules and defeaters. A conclusion, or in case of deontic conclusions, conflicting deontic
strict rule is a rule in the classical sense. Defeasible rules are rules modalities. A counter-argument is rebutted if its premise(s) do not
subject to exceptions: the conclusion of the rule holds unless there hold, or a coder proves that the premise(s) do not hold, and the
are other (applicable) rules (for the same conclusion) that defeat counter-argument is defeated when the rule is weaker than an
the rule. Defeaters are a special kind of rule, they do not support applicable rule for the conclusion. Having outlined the basics of
conclusions, but prevent the conclusion to the opposite [1, p. 257]. DDL, as it applies to this experiment, we now turn to our encoding
For more information about these classifications, see [14]. software.
As previously noted, DDL enables coders to represent both defi-
nitional norms, also known as “constitutive rules”, and prescriptive 3.1 Turnip
norms. Constitutive rules are those in standard defeasible logic [12, Turnip1 is a modern (typed) functional programming implementa-
p. 179]. Normative rules can be prescriptive, such as rules estab- tion of Defeasible Deontic Logic that is written in the programming
lishing that something is obligatory or forbidden, and permissive, language Haskell2 . As previously noted, this software facilitates
including rules establishing that certain activities are explicitly the conversion of norms (e.g., different types of regulation) into
permitted, derogating rules for prohibitions or obligations to the computer code.
contrary. The standard form of normative rules follows: Turnip requires coders to define all terms before using them in a
set of rules. The basic structure for defining an individual term is:
𝑟 : 𝐴1, . . . , 𝐴𝑛 ↩→□ 𝐶 1 ⊙ · · · ⊙ 𝐶𝑚
Type Name description_string
In this rule, 𝐴1, . . . , 𝐴𝑛 are the condition(s) of the rule expressed
Type is defined in the following table:
as literals or deontic literals (e.g., an obligation [O] or permission
[P]), □ is a deontic modality, and the 𝐶𝑖 are literals (𝐶 1 ⊙ · · · ⊙ 𝐶𝑚 Type Keyword Sample Values
is a “reparation chain"). ↩→ is a placeholder for the type of rule, Boolean Atom True, False
and → stands for a strict rule, ⇒ for a defeasible rule, and { for String String "anything in double quotation marks"
a defeater. The mode of the rule □ determines the scope of the Numeric Numeric 123.456,-5,0
conclusion. When the mode is [O], the meaning of the right-hand Date Date 1995-02-01
side of the rule is that when the rule applies [O]𝐶 1 is in force (i.e., DateTime DateTime 1995-02-01T13:35
𝐶 1 is obligatory). If the rule is violated (i.e., ¬𝐶 1 holds), then [O]𝐶 2 Duration Duration 10w, 1d, 5h, 30m
is in force (𝐶 2 is obligatory, and 𝐶 2 compensates for the violation In the experiment that is the subject of this paper, participants
of [O]𝐶 1 ). We can repeat this reasoning when [O]𝐶 2 is potentially largely used “atoms”, which correspond to literals in DDL and rep-
violated [14, 15]. resent (atomic Boolean) statements that can be either true of false
DDL is a type of skeptical non-monotonic formalism, which (e.g., Atom person "is a person"). The description string, which is
means that when there are applicable rules with conflicting con- the optional text in double quotation marks (" "), defines the atom in
clusions (i.e., 𝐴 and ¬𝐴), the logic does not provide a conclusion. natural language. Coders can also use arithmetic operators (i.e., +,
To solve conflicts, DDL employs a so-called superiority relation: a -, *, /,) for numeric terms and values; comparison operators (i.e.,
binary relation over rules establishing the relative strength of rules. ==, !=, <, <=, >, >=) to create Boolean types from numeric and
For example, if we have an applicable rule 𝑟 for 𝐴 and a second duration terms; and conversion functions (e.g., interval, toDays,
applicable rule 𝑠 for ¬𝐴, we can use 𝑟 > 𝑠 to indicate that 𝑟 is after) that can operate on dates, times and duration terms. Con-
stronger than 𝑠. Accordingly, 𝑟 defeats 𝑠 when both apply, solves sider, for example, the interval function that takes two dates as
the conflict and allows a coder to conclude for 𝐴. input and returns a duration:
The superiority relation also provides a simple and effective publication . date := 1919 -09 -01
mechanism to encode exceptions. Consider the following defeasible usage . date := 2010 -12 -03
rule: interval ( usage . date , publication . date ) >= 70 y
𝑟 : 𝐴1, . . . , 𝐴𝑛 ⇒ 𝐶
Here the assignment operator := gives values to two terms of type
We can model an exception to this rule using a second rule (let us date. Then, we use the interval operator to compute the duration
say 𝑠), where the conclusion is the opposite of the conclusion of 1 An online environment to run Turnip rulesets, with samples of the features it offers
𝑟 , and the IF part of s contains the conditions when the exception is available at http://turnipbox.netlify.com/.
holds. We can formalise this second rule, with the instance 𝑠 > 𝑟 of 2 https://www.haskell.org.
141
(i.e., time elapsed between the two dates), and we compare it with than the project’s shared GitHub4 repository, and could not com-
a given duration (70 years). municate with each other, or members of the broader research team,
Rules also have a basic structure that generally includes a label, in any way over the course of Week 1. Participants could, however,
a condition list, and a conclusion list. For example: raise questions or concerns with the first author (the contact per-
son for the experiment) at any time. Additionally, we instructed
label : condition_list = > conclusion_list
participants to assume that the elements for determining whether
The arrow (=>) determines the type of rule (e.g., strict, defeasible copyright subsists in a work are satisfied (i.e., to assume copyright
or a defeater). It is important to note that rules are designed to subsistence), and to encode the select legislative provisions only.
represent norms: a norm prescribes multiple, simultaneous effects, This means that participants did not encode relevant case law.
and different norms can prescribe the same effect [12, p. 180]. To In Week/Phase 2, participants coded ss 31, 32 and 36 of the CA.
make the work of coding in DDL more efficient, a condition list These provisions outline the nature of copyright in original works,
can include a conjunction (&) or disjunction (|) of Boolean, and original works in which copyright subsists, and infringement by
a conclusion list can be either an assignment, a single Boolean, doing acts comprised in the copyright, respectively. Like in Week
a conjunction of assignments, or a conjunction of Boolean. The 1, participants encoded the select provisions only (i.e., they did not
following example illustrates the equivalence between rules with encode case law), and could raise questions or concerns with the
conjunctions (&) and a disjunction (|) in a condition list: first author at any time. Aside from participants encoding different
provisions in each phase, a choice that we made largely to avoid
A & B => C & D A | B => C
participants becoming overly familiar with the statutory text, the
A & B => C A => C
main difference between the two phases of this experiment is that
A & B => D B => C
Week 2 had an intervention. The intervention was a two-hour com-
Turnip syntax, including negation ~ and numeric and temporal pulsory meeting during which participants collaboratively drafted
expression, ultimately allows for deontic expression. A deontic key atoms for manual coding in a single remote file (“Agreed Atom
expression is based on the combination of one of four deontic File”). The Agreed Atom File, which set out and defined key atoms
modalities: namely, [O], [P], [F], [E] (i.e., Obliged, Permitted, for ss 31, 32 and 36 of the CA, was separate from documents outlin-
Forbidden, Exempt) and an atom. For the modalities, "notice that ing our established coding conventions. After this meeting, which
[F]A is equivalent to [O]~A (and ~[P]A) and [E]A is equivalent to the first author facilitated, participants independently coded the
[P]~A". Given two rule labels, label1 and label2, label1>>label2 select provisions like in Week 1. Participants could refer back to, but
denotes the superiority relation between the rules identified by the not edit, the Agreed Atom File. The purpose of the intervention was
labels. to test our hypothesis that coders collaboratively agreeing on key
Turnip is of course one of several languages and logics that atoms before commencing independent coding work increases the
coders can use to convert different types of regulation into machine- similarity of their coded output (H1). Our null hypothesis was that
executable code [4, 25]. We used this language because it is particu- coders collaboratively agreeing on key atoms before commencing
larly useful for coders attempting to express complex rule structures. independent coding work does not increase the similarity of their
Take, for example, a statutory provision that establishes multiple coded output (H0).
obligations, permissions and at least one prohibition. We argue that In Weeks 1 and 2, participants were allocated up to 7.5 hours
coders can more accurately represent the effects of this provision by to encode the select provisions. For each phase, participants had
using Turnip’s deontic expressions and disjunctive and/or conjunc- approximately one week to complete their 7.5 hours of work, which
tive conditions lists. Thus, although Turnip is not the only encoding they submitted via email in the form of a single coding file (i.e.,
software available, it has clear advantages for the purposes of this each participant had one file for Week 1 and another for Week
experiment. 2). We assigned each participant a pseudonym and deidentified
the participants’ files for both the technical validation and legal
4 EXPERIMENT DESIGN AND METHODS alignment processes, to which we now turn.
We conducted this experiment over a two-week period in late 2020.
The population for this experiment was the pool of legally trained 4.1 Coding Validation Processes
research assistants attached to the broader research project. While An important part of converting legislation into machine-executable
participation was voluntary, in line with our university ethics ap- code is measuring the apparent success of encoded rules in terms
proval3 , participants were paid for their time. We had three partici- of coding validation. From a technical perspective, we argue that
pants in total. an encoded provision is “validated” when it adheres, in a formal
In Week/Phase 1, participants independently coded sections (“ss”) sense, to our select Turnip language and other relevant coding
40, 41, 41A and 42 of the CA, which are among several fair dealing conventions. In practical terms, this means that the code runs, or
exceptions to copyright infringement. These exceptions are those works, and produces a definitive outcome. Coding validation is also
for the purpose of research or study, criticism or review, parody concerned with the degree of internal consistency between coders
or satire, and reporting the news, respectively. By “independent” to ensure that encoded rules for the same piece of legislation work
coding, we mean that participants worked in remote files, rather together.
3 Wereceived ethics approval for this research at the Queensland University of Tech-
nology (QUT Approval 2000000763). 4 https://github.com/about.
142
Our coding validation process involves automated and manual statutory text, and/or statutory presumptions (e.g., legisla-
analyses of encoded rules. For automated analysis, we created a tion not operating retrospectively), which relate to the scope
program that uses string manipulation to parse each participant’s and effect of legislation rather than the statutory language,
atoms into string arrays and, then, compare the participants’ en- should apply [26, p. 225].
coded rules to find similarities between their approaches. The mea- (6) Return to the provision. Interpreters assign a meaning to the
sure of apparent “similarity” between any two coders is the number provision, and its key words, based on findings from Steps 1
of shared atoms and rules, respectively, between the coders divided to 5. After assigning meaning, interpreters define atoms and
by the average number of total atoms between the two datasets. identify modalities (e.g., a permission [P], prohibition [F],
This automated analysis focused on syntax and not semantic mean- obligation [O] or exemption [E]) and key words/elements/-
ing. We also manually “cleaned” some aspects of the participants’ conditions for conversion into code.
coding files to attempt to enhance the comparability of encoded (7) Acceptance testing of coded provisions. Testing in this con-
rules. For example, we adjusted atom names, which used a mixture text involves coders developing a series of unit tests for
of periods ( . ) and camel case (e.g., camelCase), in line with our select provisions based on case law or examples in explana-
project-specific naming conventions. We also filtered out common tory memoranda. Encoded rules "passing" relevant tests is a
stop words, such as “the” and “a”, normalised the tense of atoms strong indicator of legal alignment.
and, where possible, corrected issues with the structure of certain
rules. At Stage 6, interpreters start to manually convert natural language
legislation into the machine-executable Turnip language. In prac-
4.2 Legal Alignment Processes tice, the Turnip language requires interpreters to convert a statutory
provision into a set of if-then statements or rules (rulesets). The
We draw an important distinction between coding validation, which
online runtime environment (the Turnip reasoner) can take a set of
is principally a technical process, and legal alignment, which we
rules and facts, respectively, and produce a set of results that are
define as the extent to which encoded provisions align with the
what the logic can infer from applying the facts to the rules.
languages and logics of the select statutory text [18]. At the heart
While acceptance testing is Step 7, we recommend that inter-
of legal alignment is the accurate representation of laws and other
preters start testing encoded rules as early as possible, ideally in
regulation in computer code. Our legal alignment processes are
parallel with Stage 6 to optimise legal alignment processes. Devel-
based on the modern approach to statutory interpretation and
oping and applying rigorous acceptance testing can, for example,
incorporate the Turnip coding language. As previously mentioned,
help interpreters to identify edge cases and subtle errors that might
this approach requires interpreters to examine the text, context and
be easily overlooked. While copyright has a doctrinally deep and
purpose of select legislation [20, 26], which we break down into a
rich body of case law[8], participants did not undertake acceptance
7-step process:
testing, principally due to time constraints. Rigorous acceptance
(1) Locate and read a specific statutory provision, and identify key testing for select copyright provisions therefore remains an impor-
words/elements/conditions (known in coding terms as “atoms”). tant topic for future research. We provide select examples of our
(2) Read the provision in the context of the Act as a whole. Inter- subsequent acceptance testing in Section 5.2.
preters refine the legal meaning of key words by interpreting
their meaning within the legislation’s full scope. For example,
is a key word defined in the legislation? Under what part of 4.3 Limitations
the legislation is the provision located? This step includes in- The findings in this paper should be interpreted with some limita-
terpreters reading intertextual legislation: for example, other tions in mind. First, there is a small number of participants (N =
statutes that are referenced in an Act. 3) who encoded different provisions in Week 1 and Week 2. This
(3) Consider Parliament’s purpose in enacting the legislation. In- means that we cannot directly compare atoms and rules across
terpreters further refine the legal meaning of key words by weeks, and the results cannot be used to make generalised find-
interpreting the provision in line with the legislation’s pur- ings about all coders that might apply our methodology. Second,
pose. For example, the legislation’s object clause, identified the measure of apparent “similarity” between atoms and rules is
at the start of every statute, can indicate a stated purpose based on our methodology, including Turnip syntax, which is a
or purposes. If a statute has multiple purposes, then inter- non-standardised benchmark. Finally, we are not able to reach a
preters read the legislation’s establishing documents, such definitive conclusion about the extent to which the encoded rules
as speeches in Parliament or the legislation’s explanatory align with the statutory text due to the authoritative interpretive
memorandum, to determine a hierarchy of purposes. role of the courts under Australia’s constitutional framework [18],
(4) Evaluate interpretive choices. Interpreters identify all possible a limitation that will apply to any attempt to encode legislation.
interpretations of key words and apply them to findings from Despite these limitations, which we expand upon in the Results
Steps 1 to 3 to evaluate which interpretation best aligns with and Discussion section below, this study makes significant inroads
the legislation’s context and purpose. with developing and applying a methodology that fuses the modern
(5) Consider the “canons of construction”. If Step 4 findings are approach to statutory interpretation with DDL. This paper also
still ambiguous, then interpreters consider whether and how provides valuable insights into sources of apparent coding differ-
syntactical presumptions (e.g., noscitur a sociis and ejusdem ences, legal alignment issues and potential solutions, as part of the
generis) [26, pp. 212, 215], which relate to the meaning of the broader RaC movement.
143
5 RESULTS AND DISCUSSION meaning of a statute”[20, p. 118], unless the courts have reached
Overall, the results of our automated syntactical analysis show definitive conclusions about, for instance, what different provisions
a significant increase in the similarity of atoms and shared rules mean and how they apply in certain cases. More specific challenges
drafted by participants after the Week 2 intervention, as illustrated include converting evidentiary burdens, the exercise of discretion,
in Table 1. More specifically, Table 2 shows that the similarity of and ambiguous and open-textured rules in computer code, intra-
atoms increased from an average of 4.27% in Phase 1 to 57.64% in and intertextual provisions and drafting errors, all of which under-
Phase 2. A corollary of this is that the number of unique atoms, line the potential margin of error in interpreting statutory language
or atoms that are used by one coder only, decreased from Phase [17]. Due to the nature of statutory interpretation, it is unsurprising
1 to Phase 2. The similarity of rules drafted by participants also that there were differences in coders’ interpretive choices.
increased, albeit marginally, from 0% in Phase 1 to 1.01% in Phase 2. We also observed that participants coded the select provisions
These results support our hypothesis that participants agreeing on with varying levels of granularity. By “granularity”, we mean the
key atoms before commencing individual coding work increases extent to which coders split, or broke down, the statutory text into
the similarity of their encoding choices (H1). Importantly, as the discrete parts. Take, for example, s 40 of the CA that establishes the
individual coders encoded different statutory provisions in Weeks fair dealing exception to copyright infringement for the purpose
1 and 2, this cannot simply be attributed to the coders’ increasing of research or study. A selection of key atoms originating from ss
familiarity with the statutory text. 40(1), (1A) and (1B) follow:
Week 1 Week 2 Participant One (P1):

Atoms Rules Atoms Rules purposeOf . researchOrStudy
Participant One (P1) 60 68 62 43
Participant Two (P2) 47 59 26 24
Participant Three (P3):
Participant Three (P3) 106 70 51 43
work . producedForPurposeOf . theCourseOfStudy
Table 1: Total Number of Rules and Atoms by Participant work . producedForThePurposeOf . theCourseOfResearch
and Week/Phase work . producedBy . personLecturing . inCourseOfStudy
work . producedBy . personLecturing . inCourseOfResearch
work . producedBy . personTeaching . inCourseOfStudy
work . producedBy . personTeaching . inCourseOfResearch
work . producedBy . personInConnectionWith . courseOfStudy
Week 1 Week 2 work . producedBy . personInConnectionWith . courseOfResearch
P1-P2 atom similarity 11.54% 58.06% Here P1 adopts a high-level approach that combines research and
P1-P3 atom similarity 1.26% 51.85% study into one atom, while P3 uses multiple fine-grained atoms to
P2-P3 atom similarity 0% 62.99% distinguish between, inter alia, research and study. In general, to
Average 4.27% 57.64% be as comprehensive as possible in the first instance, we advocate
Week 1 Week 2 for a fine-grained approach that coders can abstract up if neces-
P1-P2 rule similarity 0% 3.03% sary. This does not mean that this approach is always optimal: for
P1-P3 rule similarity 0% 0% example, some coders might excessively split atoms, or potential
P2-P3 rule similarity 0% 0% end-users might not require or want fine-grained encoded provi-
Average 0% 1.01% sions. Nonetheless, it is clear that differing levels of granularity
Table 2: Percentage Similarity of Atoms and Rules by Partic- contributed to the low average similarity of atoms in Phase 1, with
ipant and Week/Phase P1, P2 and P3 creating 60, 47 and 106 atoms, respectively. Part of this
could also be due to coders’ individual backgrounds, as some partic-
ipants had varying levels of experience in statutory interpretation
and in computer science. It is worth noting that we can potentially
Despite this notable increase in the similarity of coding ap- reconcile the coding differences explored in this paragraph with a
proaches in Week 2, the results show that participants continued to constituent rule like this:
make a range of divergent interpretive choices, for which there are
several potential explanations. The first is the relative difficulty of work . producedForPurposeOf . theCourseOfStudy |
work . producedForThePurposeOf . theCourseOfResearch |
applying the modern approach to statutory interpretation, which,
work . producedBy . personLecturing . inCourseOfStudy |
as previously explained, involves interpreters deriving the mean- work . producedBy . personLecturing . inCourseOfResearch |
ing of a statute from close examination of the text, context and work . producedBy . personTeaching . inCourseOfStudy |
purpose (policy) of select legislation [6, 26]. This approach, as the work . producedBy . personTeaching . inCourseOfResearch |
Honourable Justice Michael Kirby contends, is “an art and not a work . producedBy . personInConnectionWith . courseOfResearch |
work . producedBy . personInConnectionWith . courseOfStudy =>
science”[20, p. 113] and stands in stark contrast to outdated “literal purposeOf . researchOrStudy
positivist”, “plain English” or seemingly “objective” understandings
of statutory interpretation [20, p. 116]. A result is that “[t]he task of Such a rule is arguably concise, in line with P1’s encoded rules for
construing statutory language is notorious for generating opposing ss 40(1), (1A) and (1B), and fine-grained, much like P3’s rendering
answers, no one of which is indisputably correct construing the of conditions.
144
P1:
s31_1_c : literaryDramaticMusicalWork . copyrightSubsists & ~ computerProgram & copyright . inRelationTo . work = >
exclusiveRightTo . enterInto . commercialRentalArrangement . workReproducedInSoundRecording
Participant 2 (P2):
s_31_1c : prescribedWorkSection31_1c & copyright . inRelationTo . work = >
exclusiveRightTo . enterInto . commercialRentalArrangement . inSoundRecording
s_31_1c_exception : computerProgram = >

~ exclusiveRightTo . enterInto . commercialRentalArrangement . inSoundRecording
s_31_1c_exception >> s_31_1c
Table 3: Encodings for Section 31(1)(c) CA
Another possible explanation for the differences we observe is Week 1 Week 2

that coders can use the Turnip language in different ways to achieve P1-P2 atom similarity 56.07% 85.71%
the same outcome(s). Take, for instance, the basic statement if A or P1-P3 atom similarity 30.12% 81.16%
B, then C. There are two main coding options: P2-P3 atom similarity 32.68% 87.61%
Average 39.62% 84.86%
Option One (Using Disjunction ( | )):
Week 1 Week 2
A | B => C P1-P2 rule similarity 11.36% 35.82%
P1-P3 rule similarity 10.62% 53.49%
Option Two: P2-P3 rule similarity 28.57% 26.87%
A => C Average 16.85% 38.72%
B => C Table 4: Results of Manual Syntactical Analysis. Percentage
Similarity of Atoms and Rules
A more complex example comes from Week 2 of the experiment
for which participants encoded, inter alia, s 31(1)(c) of the CA. This
provision establishes that copyright in a literary work “(other than
a computer program)”, musical or dramatic work is the exclusive atoms from 4.27% to 39.62% in Week 1 and, in Week 2, from 57.64%
right “to enter into a commercial rental arrangement in respect of to 84.86%. Most notably, coding validation measures significantly
the work reproduced in a sound recording”. Select encodings for improved the average similarity of rules from 0% to 16.85% in Week
this provision are presented in Table 3. 1 and, in Week 2, from 1.01% to 38.72% (See Table 4 for the detailed
Of particular note is how P1 encoded the provision in one rule results of manual analysis). These results not only underline the im-
and dealt with the statutory text "other than a computer program” portance of shared vocabulary and conventions, but also teams hav-
using the ~ (not) operator. P2 adopted a more complex approach by ing processes in place for managing alternative technically sound
creating three rules, two of which establish that the exclusive right rule structures, with a view to reducing interpretative differences
does not apply to a computer program (i.e., s_31_1c_exception, and down the line. While it appears that coding validation improved the
s_31_1c_exception >>s_31_1c). While these approaches are both internal consistency of encoded provisions, it is important to note
technically sound, or valid when using Turnip syntax as the relevant that “playing with data”[22] can give rise to potential risks, includ-
benchmark, the former is arguably more straightforward. This ing biases and erroneous outcomes. These risks lead us to explore
example highlights that differences can arise even when coders the importance of interpreters achieving legal alignment between
undertake similar training, use the same language and work in the encoded provisions and the statutory text, which we discuss in the
same team. We suggest that it is likely that there will be comparable following subsection.
findings if participants used another coding language, but this When coding legislation, we argue that following best practice
would need to be investigated through further in-depth, empirical and standardised approaches to coding is critical in the long-term.
research. Transparency, a well-established value of the Anglo-American legal
ideal of the rule of law [32], “is concerned with the quality of being
5.1 Coding Validation clear, obvious and understandable without doubt or ambiguity”5 .
After undertaking automated syntactical analysis, we manually At the core of this value is the principle that individuals should be
“cleaned” the dataset to attempt to improve internal consistency, as able to access and understand the rules that impact them [18, 35].
part of the coding validation process explained above. We found 5 Belgium v Commission (C-110/03) [2005] ECR I-2801, [44] (Advocate General Ruiz-
that these measures improved the baseline average similarity of Jarabo Colomer).
145
Coders using common conventions and developing clean, well- purposeOf.researchOrStudy, risks deviating from the true meaning
documented practices is likely to significantly improve the ability of of the statutory text as articulated by the courts. P3’s fine-grained
future users to understand how the legislation has been interpreted approach to construing and encoding the statutory language ap-
in the encoding process. It would also greatly simplify the task pears to better align with the court’s interpretation of the statutory
of future developers who may be called upon to review, audit, or text.
amend the code. The divergent coding choices for Section 40(2) of the CA provide
another useful illustration of nuanced legal issues that interpreters
could overlook in formal coding validation processes. Section 40(2)
5.2 Enhancing Legal Alignment outlines five “. . . matters to which regard shall be had, in determin-
Legal alignment, as distinct from and in addition to coding valida- ing whether a dealing constitutes a fair dealing with the work or
tion, is vital for ensuring the accuracy and legitimacy of encoded adaptation for the purpose of research or study” (emphasis added).
legislation. As a further step in checking the accuracy of individual We direct our attention to the word “shall” because it confers a
coding choices, we adopted a process of legal alignment, based on mandatory or directory obligation [26, p. 333]. We noticed, how-
the methodology above, to attempt to reproduce the legal logic ever, that only two of the three participants encoded the five matters
of select copyright provisions in machine-executable code. Legal as obligations ([O]). Consider the rules in Table 5. Significantly, this
alignment focuses only on our interpretation of the provisions us- suggests that P1’s choice to encode the matters in s 42(2) as general
ing a judicially confirmed process as the basis for our interpretive atoms rather than obligations deviates from the meaning the legisla-
approach and, critically, is distinct from a claim of “legal validity” ture seeks to convey. Such a decision matters because the statutory
that encoded rules correctly reflect the law. It is impossible to guar- text “has specific legal authority”[21, p. 159] and its component
antee the validity of any encoded legal rules [18]. As noted above, parts — words, symbols and, at times, images — are there for a
under Australia’s constitutional framework, only the judiciary can reason [23]. It can also have significant flow on effects for poten-
conclusively interpret the legal meaning of a statute. Even if a body tial end users who might, for example, rely on RegTech to comply
of case law exists, as it does for copyright law, the nature of the and stay up to date with regulation governing their commercial
common law system means that while similar cases will generally activities.
be treated alike [5], future cases with slightly different facts may In Australia, like other societies that strive to uphold the legal
trigger a reinterpretation of the law. An added complication is that ideals of the rule of law and separation of powers [32, 35], public
the construction of statutes is a question of law and therefore open law provides a range of checks and balances to assess whether a
to appeal on the basis of errors in statutory interpretation [20]. decision-maker is exercising power in accordance with established
This underlines that interpreters cannot authoritatively determine rules and principles [6, 18]. We argue that governments, technology
the extent to which their encoding choices are “legally valid”[17]. companies and other key stakeholders must take steps to ensure
Instead, we contend that RaC stakeholders should aim for “legal that adequate measures are in place to promote alignment between
alignment” between encoded rules and the true construction of the encoded statutory provisions and the true meaning of the statute as
statutory text. interpreted by the courts. This underscores the importance of RaC
As noted above, coders were not asked to undertake acceptance initiatives having multi-disciplinary expertise across the fields of
testing in Weeks 1 and 2 due to time constraints. This process was law, computer science and public policy [17]. From our perspective
undertaken by the research team after the technical coding valida- as legally trained coders, we argue that it is vital to have both
tion process. The subsequent legal alignment process highlighted subject matter and statutory interpretation expertise as part of a
the complexity of attempting to evaluate the extent to which en- multi-disciplinary RaC team.
coded rules align with the languages and logics of a statue. To
illustrate this point, it is useful to return to the encoding choices
for section 40 of the CA, for which P1 adopted a high-level ap-
proach and P3 drafted several fine-grained atoms. This provision, 6 CONCLUSION AND FUTURE RESEARCH
which is one of several exceptions to copyright infringement in The results from our experiment illustrate the complexities of at-
Part III, Division 3 of the Act, establishes that copyright in a work tempting to reproduce the languages and logics of a statutory text
or an adaptation of a literary, dramatic or musical work is not in- in machine-executable format. While our analyses are preliminary,
fringed by a fair dealing for the purpose of research or study. The we have provided a first of its kind experiment that examines how
wording of this provision is significant; in particular, the legisla- different legally trained people interpret and convert legislation
ture’s use of “research or study” (emphasis added). This raises the into computer code in practice. After our intervention in Week 2, a
question of whether an interpreter should code these terms con- meeting during which participants collaboratively agreed on key
junctively (i.e., research and study) or disjunctively (i.e., research or atoms for manual coding, we identified a significant increase in the
study). In De Garis v Neville Jeffress Pidler Pty Ltd (1990) 37 FCR 99 average similarity of atoms — from 4.27% in Week 1 to 57.64% in
ALR 625; 18 IPR 292, Beaumont J stated that the terms “research” Week 2 — and, to a lesser extent, rules. This finding, among others,
and “study” take their dictionary meanings and, most critically, supports our hypothesis that coders collaboratively agreeing on
should be considered disjunctively[9]. Indeed, the court considered key atoms before commencing independent coding work increases
whether the activities at issue could be characterised as “research” the similarity of their coded output. Importantly, as the individ-
or “study” for the purposes of s 40 of the CA, separately. This sug- ual participants encoded different statutory provisions in Weeks 1
gests that P1’s decision to encode the terms in one atom; namely, and 2 of the experiment, the greater similarity we observed cannot
146
P1:
s40_2_aTOd : determining . fairDealing = > regardTo . purposeAndCharacter & regardTo . natureOfWorkOrAdaptation &
regardTo . possibilityOfObtainingWorkOrAdaptation . withinReasonableTime . atOrdinaryCommercialPrice &
regardTo . effectOfDealingOn . potentialMarketOrValue
P2:
s_40_2 : determiningPotentialFairDealing . researchStudy = > [O] regardWorkPurposeCharacter &
[O] regardWorkNature & [O] regardPossibilityOfPurchasing &
[O] regardMarketValueEffect & [O] regardSubstantialityOfCopiedPart
P3:
s_40_2_work_a_to_e : entity . determining . whetherDealingIsFair . forPurposesOf . copyrightAct &
work . isLiteraryOrDramaticOrMusicalOrArtistic & dealing . isReproduction = >
[O] entity . toConsider . dealing . purpose & [O] entity . toConsider . dealing . character &
[O] entity . toConsider . dealing . natureOfWorkOrAdaptation &
[O] entity . toConsider . effectOfDealing . onValueOfWorkOrAdaptation &
[O] entity . toConsider . possibilityOfObtainingWorkOrAdaptationWithinReasonableTimeAtOrdinaryCommercialPrice
Table 5: Encodings for Section 40(2) CA
simply be attributed to the coders’ increasing familiarity with the processes for technical coding validation and legal alignment might
statutory text. entail in practice is warranted. For example, in terms of coding
Notwithstanding these increases, participants made a range of validation, our experiment raises practical questions about how
divergent interpretive choices, which we argue are most likely due exactly coding teams should choose between different yet equally
to: (1) the complexity of statutory interpretation, (2) encoded provi- technically valid coding choices, and the most appropriate and effi-
sions having varying levels of granularity, and (3) the functionality cient syntactic and semantic conventions and methodologies for
of our coding language, Turnip. This underlines that interpretive identifying and representing terms, predicates and propositions in
differences can arise even when coders undertake similar training, legal texts. Finally, in terms of coding validation, further research
use the same language and work in the same team. We explained is needed to shed light on acceptance testing options, including en-
that interpreters can, to some extent and not without risks, improve coding case law, for copyright law and beyond. An important part
the internal consistency of encoded rules by automatically and/or of this work will be clarifying best practices from both a technical
manually cleaning the dataset. and legal perspective.
Overall, we contend that RaC initiatives should have processes
for technical coding validation and legal alignment, both of which REFERENCES
are critically important in enhancing the accuracy of digitising [1] Grigoris Antoniou, David Billington, Guido Governatori, and Michael J. Ma-
legislation. The former helps to ensure that encoded rules adhere her. 2001. Representation Results for Defeasible Logic. ACM Transactions on
Computational Logic 2, 2 (2001), 255–287.
to select coding languages and conventions. It is particularly im- [2] Patricia Aufderheide, Kylie Pappalardo, Nicolas Suzor, and Jessica Stevens. 2018.
portant that interpreters not only follow select coding languages Calculating the consequences of narrow Australian copyright exceptions: mea-
surable, hidden and incalculable costs to creators. Poetics 69 (2018), 15–26.
and conventions, but also develop clean, well-documented code to [3] Tom Barraclough, Hamish Fraser, and Curtis Barnes. 2021. Legislation as Code
enable future users to understand how the legislation has been inter- for New Zealand: Opportunities, Risks and Recommendations. Report. Brainbox
preted in the encoding process. A second critical step is to engage in and The New Zealand Law Foundation, New Zealand.
[4] Sotiris Batsakis, George Baryannis, Guido Governatori, Tachmazidis Ilias, and
a process of legal alignment, which we conceptualise as enhancing Grigoris Antoniou. 2018. Legal Representation and Reasoning in Practice: A Crit-
congruence between the encoded rules and the true meaning of the ical Comparison. In Legal Knowledge and Information Systems, Monica Palmirani
select legislation in line with the modern approach to statutory in- (Ed.). IOS Press, 31–40.
[5] Lisa Burton Crawford and Dan Meagher. 2020. Statutory Precedents under the
terpretation. The results of this experiment suggest that a rigorous “Modern Approach” to Statutory Interpretation. Sydney Law Review 42, 2 (2020),
assessment of legal alignment requires multi-disciplinary expertise 209–239.
[6] Lisa Burton Crawford, Maria O’Sullivan, Janina Boughey, and Melissa Castan.
across specific legal subject matters, statutory interpretation and 2017. Public Law and Statutory Interpretation: Principles and Practice. Federation
technical programming. Press, NSW, Australia.
There are a range of important opportunities for future research [7] Commonwealth of Australia. 2013. Copyright and the Digital Economy: Discussion
Paper. ALRC Discussion Paper 79. https://www.alrc.gov.au/wp-content/upload
that arise from this study. First, there is scope to expand this ex- s/2019/08/dp79_whole_pdf_.pdf
periment to test our hypothesis (H1) across different bodies of law [8] Mark Davidson, Ann Monotti, and Leanne Wiseman. 2012. Australian Intellectual
and types of expertise, and with a larger number of participants. Property Law. Cambridge University Press, Cambridge, U.K.
[9] Bronwen Claire Ewen. 2017. 240 – Intellectual Property, III COPYRIGHT, (8)
Secondly, further exploration of what the separate yet interrelated DEFENCES TO INFRINGEMENT (B) Fair Dealing – Defences to Infringement of
Copyright. In Halsbury’s Laws of Australia. LexisNexis Australia, [240–2357].
147
[10] Dagfinn Føllesdal and Risto Hilpinen. 1971. Deontic Logic: An Introduction. In [21] Michael Kirby. 2012. The Never-Ending Challenge of Drafting and Interpreting
Deontic Logic: Introductory and Systematic Readings, Risto Hilpinen (Ed.). North Statutes – A Meditation on the career of John Finemore QC. Melbourne University
Holland, 1–35. Law Review 36, 1 (2012), 140.
[11] Thomas F. Gordon, Guido Governatori, and Antonino Rotolo. 2009. Rules and [22] David Lehr and Paul Ohm. 2017. Playing with the Data: What Legal Scholars
Norms: Requirements for Rule Interchange Languages in the Legal Domain Should Learn About Machine Learning. University of California Davis Law Review
(LNCS, 5858), Guido Governatori, John Hall, and Adrian Paschke (Eds.). Springer, 51, 2 (2017), 653–717.
Heidelberg, 282–296. [23] John Middleton. 2016. Statutory Interpretation: Mostly Common Sense? Mel-
[12] Guido Governatori, Pompeu Casanovas, and Louis de Koker. 2020. On the Formal bourne University Law Review 40, 2 (2016), 626–656.
Representation of the Australian Spent Conviction Scheme. In Rules and Reasoning [24] James Mohun and Alex Roberts. 2020. Cracking the code: Rulemaking for humans
(LNCS, Vol. 12173), Víctor Gutiérrez Basulto, Tomáš Kliegr, Ahmet Soylu, Martin and machines. OECD Working Papers on Public Governance. OECD, Paris, France.
Giese, and Dumitru Roman (Eds.). Springer International, 177–185. https://doi.org/10.1787/3afe6ba5-en
[13] Guido Governatori, Mustafa Hashmi, Ho-Pun Lam, Serena Villata, and Mon- [25] Jason Morris. 2020. Spreadsheets for Legal Reasoning: The Continued Promise of
ica Palmirani. 2016. Semantic Business Process Compliance Checking Using Declarative Logic Programming in Law. Masters Thesis, University of Alberta.
LegalRuleML. In Knowledge Engineering and Knowledge Management (LNAI, [26] Michelle Sanson. 2016. Statutory Interpretation (2nd ed.). Oxford University Press„
10024), Eva Blomqbvist, Paolo Ciancarini, Francesco Poggi, and Fabio Vitali (Eds.). Oxford, U.K.
Springer International, 746–761. [27] Service Innovation Lab (LabPlus). 2018. Better Rules for Government Discovery
[14] Guido Governatori, Francesco Olivieri, Antonino Rotolo, and Simone Scannapieco. Report. Technical Report.
2013. Computing Strong and Weak Permissions in Defeasible Logic. Journal of [28] Nicolas Suzor. 2014. The only way to fix copyright is to make it fair. The
Philosophical Logic 42, 6 (2013), 799–829. Conversation (21 February 2014). https://theconversation.com/the-only-way-to-
[15] Guido Governatori and Antonino Rotolo. 2006. Logic of Violations: A Gentzen fix-copyright-is-to-make-it-fair-23402
System for Reasoning with Contrary-To-Duty Obligations. Australasian Journal [29] Matthew Waddington. 2019. Machine-Consumable Legislation: A Legislative
of Logic 4 (2006), 193–215. Drafter’s Perspective – Human v Artificial Intelligence. The Loophole 2 (2019),
[16] Guido Governatori, Antonino Rotolo, and Erica Calardo. 2012. Possible World 21–52.
Semantics for Defeasible Deontic Logic. In Deontic Logic in Computer Science [30] Matthew Waddington. 2020. Rules as Code. Law in Context 37, 1 (2020), 179–186.
(DEON 2012) (Lecture Notes in Computer Science, Vol. 7393), Thomas Ågotnes, Jan [31] Vicki Waye. 2019. Regtech: A New Frontier in Legal Scholarship. Adelaide Law
Broersen, and Dag Elgesem (Eds.). Springer, Heidelberg, 46–60. Review 40, 1 (2019), 363–386.
[17] Anna Huggings, Alice Witt, Nicholas Suzor, Mark Burdon, and Guido Governatori. [32] Alice Witt, Nicolas Suzor, and Anna Huggins. 2019. The Rule of Law on Instagram:
2020. Financial Technology and Regulatory Technology, Issues Paper Submission. An Evaluation of the Moderation of Images Depicting Women’s Bodies. UNSW
Submission 196. Select Senate Committee on Financial Technology and Regula- Law Journal, 42, 2 (2019), 557–596.
tory Technology. https://www.aph.gov.au/DocumentStore.ashx?id=30153f40- [33] Meng Weng (HUANG Mingrong) Wong. 2020. Rules as Code – Seven Levels of
c456-4398-99f2-0dd627f86401&subId=699554 Digitisation. Report. Singapore Management University, School of Law, Singa-
[18] Anna Huggins. 2020. Executive Power in the Digital Age: Automation, Statutory pore.
Interpretation and Administrative Law. In Interpreting Executive Power, Janina [34] World Government Summit. 2018. RegTech for Regulators: Re-Architect the System
Boughey and Lisa Burton Crawford (Ed.). Federation Press, 111–128. for Better Regulation. Technical Report. 2018. https://www.worldgovernmentsum
[19] Mohammad Badiul Islam and Guido Governatori. 2018. RuleRS: A rule-based ar- mit.org/api/publications/document?id=5ccf8ac4-e97c-6578-b2f8-ff0000a7ddb6
chitecture for decision support systems. Artificial Intelligence and Law 26, 4 (2018), [35] Monika Zalnieriute, Lyria Bennett Moses, and George Williams. 2019. The Rule
315–344. https://doi.org/10.1007/s10506-018-9218-0 arXiv:http://rdcu.be/HIvL of Law and Automation of government Decision-Making. Modern Law Review
[20] Michael Kirby. 2011. Statutory Interpretation: The Meaning of Meaning. Mel- 82, 3 (2019), 425–427.
bourne University Law Review 35, 1 (2011), 113–133.
148
Hardness of Case-Based Decisions: a Formal Theory
Heng Zheng Davide Grossi Bart Verheij
Artificial Intelligence, Bernoulli Artificial Intelligence, Bernoulli Artificial Intelligence, Bernoulli
Institute, University of Groningen Institute, University of Groningen Institute, University of Groningen
The Netherlands ILLC/ACLE, University of The Netherlands
h.zheng@rug.nl Amsterdam bart.verheij@rug.nl
The Netherlands
d.grossi@rug.nl
ABSTRACT of legal decision-making have been object of much work in AI and
Stare decisis is a fundamental principle of case-based reasoning. Yet Law (cf. also [23]).
its application varies in complexity and depends, in particular, on But some cases are easier to decide than others. For instance,
whether relevant past decisions agree, or exist at all. The contribution when all past cases agree on a given legally relevant fact situa-
of this paper is a formal treatment of types of the hardness of case- tion, decision-making using the principle of stare decisis can be
based decisions. The typology of hardness is defined in terms of straightforward. When relevant past cases disagree, things get harder.
the arguments for and against the issue to be decided, and their Sometimes such a conflict of precedents can be resolved, for instance
kind of validity (conclusive, presumptive, coherent, incoherent). We when one precedent is considered a landmark case overturning ex-
apply the typology of hardness to Berman and Hafner’s research on isting doctrine, or when a precedent comes from a higher level
the dynamics of case-based reasoning and show formally how the court. But not all conflicts can be resolved, making decision-making
hardness of decisions varies with time. harder. Also it can happen that a legally relevant fact situation has no
matching precedent, so the stare decisis principle gives no answer.
CCS CONCEPTS
Paper contribution. As the above examples show, the hardness
• Computing methodologies → Knowledge representation and of case-based decisions comes in different types. It is the topic
reasoning; • Applied computing → Law. of this paper to provide a formal theory of the hardness of cases
in case-based decision-making. Significant work has been devoted
KEYWORDS to the nature and dynamics of case-based reasoning (e.g., [1, 2, 5–
Case-based reasoning, computational argumentation, hard cases 8, 11, 13–17, 21]), and to the topic of hard cases (e.g., [4, 9, 12]). Yet,
ACM Reference Format: to the best of our knowledge, no formal theory has been proposed
Heng Zheng, Davide Grossi, and Bart Verheij. 2021. Hardness of Case- so far of what makes a current case harder, or less hard, than other
Based Decisions: a Formal Theory . In Eighteenth International Confer- cases. We provide such a theory here, by focusing on the following
ence for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São question: is there a typology of how hard it is to make a decision
Paulo, Brazil. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/ about an issue in case-based reasoning? To answer this question,
3462757.3466071 we propose a formal approach based on the case model formalism
[20–22, 25, 26]. We describe a decision-making issue as an argument
1 INTRODUCTION and its counterargument, and formalize its hardness with the validity
Legal decision-making can be hard, very hard. Its complexities, of these arguments. We also illustrate the approach with a case study
which have been well-recognized in AI and Law since its early in the dynamics of case-based reasoning (following an example by
days, are numerous. An early contribution to the discussion of the Berman and Hafner [6–8, 11] as formalized in [21]), and show how
hardness of legal decision-making in AI and Law is ‘An Artificial hardness varies over time.
Intelligence Approach to Legal Reasoning’, a book by Gardner [9].
In that book, Gardner addresses the distinction between hard and Paper outline. Section 2 introduces earlier work in the case model
easy cases using ideas from jurisprudence. Following Rissland’s formalism. Section 3 develops a formal theory of hardness. Section 4
review [18] of the landmark work by Gardner, legal decisions are shows an application of our approach to a series of concrete legal
guided rather than governed by existing law; legal terms are open cases highlighting the development of hardness over time. Section 5
textured; legal questions can have more than one answer, but a positions our theory within existing literature on case-based decision-
reasonable and timely answer must be given; and the answers to legal making in law. Section 6 concludes. Detailed proof sketches of
questions can change over time. These, and other, such complexities relevant formal properties are provided throughout the paper.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed 2 PRELIMINARIES: CASE MODELS
on the first page. Copyrights for third-party components of this work must be honored. Our approach uses case models [20], a formalism based on a propo-
For all other uses, contact the owner/author(s). sitional logic language L generated from a finite set of constants. We
ICAIL’21, June 21–25, 2021, São Paulo, Brazil write ¬ for negation, ∧ for conjunction, ∨ for disjunction, ↔ for
ACM ISBN 978-1-4503-8526-8/21/06. equivalence, ⊤ for a tautology, and ⊥ for a contradiction. The asso-
https://doi.org/10.1145/3462757.3466071 ciated classical logical consequence relation is denoted |=. Cases can
149
ICAIL’21, June 21–25, 2021, São Paulo, Brazil H. Zheng, D. Grossi and B. Verheij
π2 All arguments
π0
π1
P∧Q ¬P
P ∧ ¬Q Coherent arguments
Figure 1: Example of a case model. Larger boxes denote cases Presumptive arguments
that are preferred over the cases denoted by smaller boxes.
Conclusive arguments
be compared through the preference relation between cases in case

models. A case model is a set of logically consistent, incompatible
cases forming a total preorder (i.e., a transitive, total binary relation) Figure 2: Validity of arguments in a Venn diagram.
representing a preference relation among the cases.
Definition 1 (Case models [20]) A case model is a pair C = (C, ≥)
with finite C ⊆ L, such that, for all π, π ′ and π ′′ ∈ C: (3) (χ, ρ) is conclusive w.r.t. C if and only if ∃π ∈ C : π |= χ ∧ ρ
and for all π ∈ C: if π |= χ, then π |= χ ∧ ρ. We then write
(1) ̸|= ¬π; C |= χ ⇒ ρ.
(2) If ̸|= π ↔ π ′ , then |= ¬(π ∧ π ′ );
Example 3 If we consider arguments (P, Q), (¬P, Q), and (P ∨ R, P)
(3) If |= π ↔ π ′ , then π = π ′ ;
with the case model C shown in Figure 1, we have:
(4) π ≥ π ′ or π ′ ≥ π;
(5) If π ≥ π ′ and π ′ ≥ π ′′ , then π ≥ π ′′ . (1) C |= P⊤Q, since π0 |= P ∧ Q;
(2) C |= ¬P⊥Q, since there is no case in C can imply ¬P ∧ Q;
As customary, the asymmetric part of ≥ is denoted >. The symmetric
(3) C |= P { Q, since π0 |= P ∧ Q, and π0 is the most preferred
part of ≥ is denoted ∼. Intuitively, ≥ means ‘at least as preferred
case in C that implies P;
as’. For instance, P ∧ Q ≥ P ∧ ¬Q means that the case expressed by
(4) C |= P ∨ R ⇒ P, since for all cases in C , if they imply P ∨ R,
P ∧ Q is higher than or equally high in the preference ordering than
they also imply its conclusion P.
the case expressed by P ∧ ¬Q.
The following proposition shows relations between validities.
Example 1 ([21]) Cases π0 = P ∧ Q, π1 = P ∧ ¬Q, and π2 = ¬P, and
preference relation π2 > π0 > π1 form a case model (Figure 1). Proposition 1 Let C be a case model, (χ, ρ) be an argument:
We move now to the definition of arguments and their validities. (1) If C |= χ ⇒ ρ, then C |= χ { ρ, but not in general vice versa;
(2) If C |= χ { ρ, then C |= χ⊤ρ, but not in general vice versa.
Definition 2 (Arguments [20]) An argument from χ to ρ is a pair
(3) For all χ, ρ ∈ L, either C |= χ⊤ρ or C |= χ⊥ρ.
(χ, ρ) with χ and ρ ∈ L. For λ ∈ L, if χ |= λ , λ is a premise of the
argument; if ρ |= λ , λ is a conclusion; if χ ∧ ρ |= λ , λ is a position P ROOF. We prove the first claim. The other cases are similar.
made by the argument. We say that χ expresses the full premise of Observe that since (χ, ρ) is conclusively valid, by Definition 3, for
the argument, ρ the full conclusion, and χ ∧ ρ its full position made all π ∈ C, if π |= χ, then π |= χ ∧ ρ. Furthermore, a case π |= χ ∧ ρ
by the argument. exists again by the definition of conclusive validity. There are two
Example 2 For instance, (P, Q) is an argument with P as its full cases. Either such a π is maximally preferred among the cases
premise, and Q as its full conclusion. Sentence P ∧ Q is the full logically implying χ, and we have π |= χ ∧ ρ as desired. Or it is not.
position by the argument (P, Q). There exists then a maximally preferred π ′ ∈ C such that π ′ |= χ and
Arguments have three kinds of validities. If the full position made π ′ ≥ π. But by the above observation we also have that π ′ |= ρ as
by an argument is logically implied by one of the precedents in a desired. We conclude that (χ, ρ) is presumptively valid. □
case model, then the argument is coherently valid in the case model. Figure 2 is an illustration of Proposition 1. Conclusive arguments
If an argument’s conclusion is logically implied by a precedent are presumptively valid, and presumptive arguments are coherently
which is weakly preferred over all precedents that logically imply valid. All arguments are either incoherent with respect to the case
the argument’s full premise, then the argument is presumptively model (outside the set of coherent arguments), or coherent with
valid in the case model. If all precedents that logically imply the full respect to the model (inside the set of coherent arguments).
premise of a coherently valid argument, also logically imply its full
conclusion, then the argument is conclusively valid in the model. 3 A FORMAL THEORY OF HARDNESS
Definition 3 (Validity of arguments [20]) Let C = (C, ≥) be a case We turn to the main theoretical contribution of the paper: a formal
model and (χ, ρ) an argument: approach to compare issues in case models by their hardness.
(1) (χ, ρ) is coherent w.r.t. C if and only if ∃π ∈ C : π |= χ ∧ρ. We We start by introducing the key definitions underpinning our
then write C |= χ⊤ρ. If C ̸|= χ⊤ρ, then (χ, ρ) is incoherent theory, in two steps. First we introduce a natural way of ordering
w.r.t. C . We then write C |= χ⊥ρ. arguments by their strength, in essence based on Proposition 1. We
(2) (χ, ρ) is presumptive w.r.t. C if and only if ∃π ∈ C s.t.: show then how this notion can be used also to develop ways in which
(a) π |= χ ∧ ρ; and issues can be compared. We think of issues essentially as pairs of
(b) for all π ′ ∈ C: if π ′ |= χ, then π ≥ π ′ . arguments (e.g., the plaintiff’s and the defendant’s arguments): the
We then write C |= χ { ρ. argument that points to the truth of the issue, an the one that points
150
Hardness of Case-Based Decisions: a Formal Theory ICAIL’21, June 21–25, 2021, São Paulo, Brazil
to its falsity. Here arguments are premise-conclusion pairs, without Example 4 In the case model of Figure 1, argument (P, Q) has type
considering a possible internal stepwise structure. Based on two pres, and argument (P, ¬Q) has type coh. Hence the validity distance
natural ways to compare issues, we will define and study an ordering between the validity labels of these two arguments is 1.
of issues representing their relative hardness.
3.2 Issues
3.1 Comparing arguments We introduce now the notion of issue, that is, the specific proposition
First we introduce labels for arguments based on their validity. a case model is supposed to decide about. Our theory concerns
Definition 4 (Validity labels for arguments) Let (χ, ρ) be an argu- precisely the hardness of deciding about the truth or falsity of an
ment and C a case model. Then the validity label of (χ, ρ) in C is issue given a situation in a case model.
denoted AC (χ, ρ) and is defined as follows: Issues and issue types. So an issue is a sentence whose truth or falsity
(1) AC (χ, ρ) = conc if C |= χ ⇒ ρ; we would like to establish, given a situation. Formally:
(2) AC (χ, ρ) = pres if C |= χ { ρ and C ̸|= χ ⇒ ρ; Definition 7 (Issues) A situation is a sentence σ ∈ L. A sentence
(3) AC (χ, ρ) = coh if C |= χ⊤ρ and C ̸|= χ { ρ; ι ∈ L is an issue given situation σ (denoted σ ± ι) if and only if
(4) AC (χ, ρ) = incoh if C ̸|= χ⊤ρ. σ ̸|= ι and σ ̸|= ¬ι.
For instance, pres is the label for arguments that are presumptively In other words, an issue for a given situation is a sentence whose
valid, but not conclusive. Using Figure 2 as an illustration, these truth or falsity is not logically settled by the situation. It is worth
arguments are in the presumptive arguments set, but not in the con- observing that σ ± ι is an issue if and only if σ ± ¬ι is an issue.
clusive arguments set. The label incoh is used for arguments that
are not coherent. In Figure 2, these arguments are in the set of all Example 5 For instance, P ± Q represents an issue Q with respect
arguments, but not in the set of coherent arguments.1 to a situation P.
Validity labels come with a natural ordering: Importantly, for every issue σ ± ι there are two naturally associated
Definition 5 (Validity ordering) The validity ordering is the total or- arguments: (σ , ι) and (σ , ¬ι). We will study hardness as a relation
der ≥ on set {conc, pres, coh, incoh} characterized by the following on the types of those pairs of arguments that correspond to issues.
property:2 conc ≥ pres ≥ coh ≥ incoh. Given the pair of arguments induced by an issue, we call a type
the multi-set (i.e., a set admitting multiple copies of an element)
Intuitively, based on their validity, arguments with label conc are consisting of the validity labels of the two arguments. Formally:
stronger than arguments with label pres. Similarly, arguments with
label pres are stronger than arguments with label coh, and arguments Definition 8 (Types of issues) Let C be a case model and σ ± ι an
with label coh are stronger than arguments with label incoh. issue. Then the type of the issue, denoted TC (σ ± ι), is the multiset:
The following proposition relates argument validities (Defini- {AC (σ , ι), AC (σ , ¬ι)}.
tion 3) to validity labels and validity ordering (Definitions 4 and 5).
Example 6 The type of issue P ± Q with respect to the case model
Proposition 2 Let (χ, ρ) be an argument and C a case model. Then in Figure 1 is {pres,coh}, as AC (P, Q) = pres and AC (P, ¬Q) = coh.
the following hold:
Not all types are logically possible. The following proposition shows
(1) (χ, ρ) is conclusive if and only if AC (χ, ρ) = conc; there are in total 7 possible issue types.
(2) (χ, ρ) is presumptive if and only if AC (χ, ρ) ≥ pres;
(3) (χ, ρ) is coherent if and only if AC (χ, ρ) ≥ coh; Proposition 3 Let C be a case model and σ ± ι an issue. Then
(4) (χ, ρ) is incoherent if and only if AC (χ, ρ) = incoh. TC (σ ± ι) equals one of the following:
(1) {conc, incoh};
P ROOF. Immediate using the definitions and Proposition 1. □ (2) {pres, pres};
(3) {pres, coh};
Using the ordering of arguments their validity label, we can quantify
(4) {pres, incoh};
how far apart they are in terms of strength, as follows:
(5) {coh, coh};
Definition 6 (Validity distance) Let v0 and v1 ∈ {conc, pres, coh, (6) {coh, incoh};
incoh}. Then we define the validity distance vd(v0 , v1 ) as the length (7) {incoh, incoh}.
of the shortest path from v0 to v1 in the validity ordering.
Intuitively, the validity distance between two arguments is the num- P ROOF. By Definition 4, an argument can have 4 possible labels
(42 −4)
ber of steps in the validity ordering ≥ that are needed to go from conc, pres, coh, incoh. There are therefore 2 +4 = 10 possible
the label of the weaker argument to the label of the stronger one multisets of size 2. Let a case model C = (C, ≥) and an issue σ ± ι
(of course avoiding loops). Clearly, as ≥ has length 3, the maximal be given. We split the proof in two parts. First, we reason by cases
validity distance is 3 and the minimal is 0. and for a multiset {v0 , v1 } (for which we can assume v0 ≥ v1 ) we
show what values among conc, pres, coh, incoh v1 can take once
1
These labels and their ordering in the following Definition 5 are related to the quantita- we fix v0 . Then we display examples showing the types listed in the
tive representation of [20, Section 3.3]. Arguments with label conc correspond to those
arguments with strength equal to 1 [20]. Arguments with label pres corresponds to those statement are possible.
arguments with strength above a given threshold but less than 1. Arguments with label v0 = conc Then v1 ∈ {conc, pres, coh, incoh}. If AC (σ , ι) =
coh corresponds to arguments with strength less than the given threshold but still above
0. And arguments with label incoh corresponds to arguments with strength equal to 0. conc, by Definition 3, for each π ∈ C with π |= σ , we have π |= σ ∧ ι.
2
Recall that a total order is a binary relation which is transitive, total and antisymmetric. Then since cases are consistent, π ̸|= σ ∧ ¬ι, so AC (σ , ¬ι) = incoh
151
and the types {conc, conc}, {conc, pres}, and {conc, coh} are not 3 2 1 0
possible. Similarly, when AC (σ , ¬ι) = v0 . Therefore, TC (σ ± ι) is {conc, incoh} {pres, pres}
equal to {conc, incoh}.
v0 = pres Then v1 ∈ {pres, coh, incoh} and AC (σ , ¬ι) ∈ {pres, {pres, coh}
coh, incoh}. Similarly, when AC (σ , ¬ι) = v0 . Therefore, TC (σ ± ι) {pres, incoh} {coh, coh}
is equal to {pres, pres}, {pres, coh}, or {pres, incoh}.
v0 = coh Then v1 ∈ {coh, incoh}, and TC (σ ± ι) is equal to {coh, incoh}
{coh,coh} or {coh,incoh}.
{incoh, incoh}
v0 = incoh Then v1 = incoh and TC (σ ± ι) is {incoh,incoh}.
Now we give an example for each possible type. For {conc, incoh},
let C be a case model with only one case π0 = P ∧ Q, P ± Q an issue. Figure 3: Hasse diagram of ⪰v . The numbered columns depict
TC (P ± Q) = {conc, incoh}. the equivalence classes of ⪰d (one per distance from 0 to 3).
Let C be a case model with cases π0 = P ∧ Q and π1 = P ∧ ¬Q,
P ± Q an issue. For {pres, pres}, if C has a preference relation
π0 ∼ π1 , then TC (P ± Q) = {pres, pres}. For {pres, coh}, if C has Example 7 Continuing on Example 6, we have:
a preference relation π0 > π1 , then TC (P ± Q) = {pres, coh}. {pres,coh} ⪰v {incoh,incoh};
For {pres, incoh}, let C be a case model with cases π0 = P ∧ Q {pres,coh} ⪰d {incoh,incoh}.
and π1 = P∧R, P±Q an issue. If C has a preference relation π0 > π1 ,
We will see these relations at work in the definition of hardness
then TC (P ± Q) = {pres, incoh}.
ordering provided in the next subsection. For now it is important to
For {coh, coh}, let C be a case model with cases π0 = P, π1 =
observe that these relations are well-behaved:
P∧Q, and π2 = P∧¬Q, P±Q an issue. If C has a preference relation
π0 > π1 ∼ π2 , then TC (P ± Q) = {coh, coh}. Proposition 4 We have that:
For {coh, incoh}, let C be a case model with cases π0 = P and (1) ⪰v is a partial order, which is not total;
π1 = P ∧ Q, P ± Q an issue. If C has a preference relation π0 > π1 , (2) ⪰d is a total preorder, which is not antisymmetric.
then TC (P ± Q) = {coh, incoh}.
For {incoh, incoh}, let C be a case model only with case π0 = P, P ROOF. Claim 1. By Definition 9, ⪰v inherits the properties
P ± Q an issue. Then TC (P ± Q) = {incoh, incoh}. □ of ≥ (Definition 5) and is therefore reflexive, antisymmetric, and
transitive; hence a partial order. However, it is not total since some
Proposition 3 depends on the preference relation between cases types are incomparable, for instance {conc,incoh} and {pres,coh}
in C to be general. If restrictions are imposed on that preference can not be compared since conc≥pres while incoh≤coh .
relation, fewer types may be possible. In particular, if such a pref- Claim 2. By Definitions 9 and 6, an integer is associated to each
erence relation is trivial (in the sense that all cases are at least as type. Then ⪰d inherits the properties of the integer ≥ relation: reflex-
preferred as all other cases), like in the case models representing ivity, transitivity and totality. However, it is not antisymmetric, for in-
HYPO examples [25], then only 4 types are possible: {conc,incoh}, stance, {pres,coh} ⪰d {coh,incoh} and {coh,incoh} ⪰d {pres,coh}
{pres,incoh}, {pres,pres} and {incoh,incoh} since then a coherent (both have validity distance 1), but {pres,coh} ≠ {coh,incoh}. □
argument is always presumptive.
Relations ⪰v and ⪰d are depicted in Figure 3.
Comparing issues. There are two natural ways in which to compare
types. They can be compared by the relative strength of their validity
labels, or by the distance of their labels. Formally: 3.3 Hardness
Definition 9 (Type orderings) Let {v0 , v1 } and {v′0 , v′1 } be types. We We view hardness as a relation between issues: harder vs. easier
issues. The intuition that guides our definition is based on the re-
define two binary relations ⪰v and ⪰d ∈ {conc, pres, coh, incoh}2 as
lations ⪰d and ⪰v introduced in the previous section. An issue is
follows:
‘easy’ if the validity labels of the two arguments involved in the
(1) {v0 , v1 } ⪰v {v′0 , v′1 } if and only if: v0 ≥ v′0 and v1 ≥ v′1 , or issue are far apart by their validity distance: the stronger argument
v0 ≥ v′1 and v1 ≥ v′0 ; prevails. By means of illustration, an issue where the two arguments
(2) {v0 , v1 } ⪰d {v′0 , v′1 } if and only if vd(v0 , v1 ) ≥ vd(v′0 , v′1 ). are one of type conc and one of type incoh will be easy to decide.
The asymmetric parts of ⪰v , ⪰d are respectively denoted ≻v and It is ‘hard’ if, vice versa, the validity labels of the two arguments
≻d . Their symmetric parts are respectively denoted ∼v and ∼d . involved in the issue are close by validity distance. The prototypical
Intuitively, relation ⪰v orders types by the strength of their va- case consists of arguments with the same validity label. It is in such
lidity labels defined in Definition 5. So higher types in ⪰v pertain cases that one can then use relation ⪰v to distinguish among issues
issues involving stronger arguments, while lower types in ⪰v pertain whose arguments have same validity distance. Intuitively, if two
issues involving weaker arguments. Instead, relation ⪰d orders types issues both involve arguments with the same validity distance—like
by how far apart the labels within the type are from each other. So for instance {pres, coh} and {coh, incoh}—it is the issue involving
higher types in ⪰d pertain issues involving arguments containing a stronger arguments that is arguably ‘easier’.
strong and a weak argument, while lower types in ⪰d pertain issues These intuitions back the definition of hardness as an ordering
involving arguments of similar strength. over issue types, which is based on the lexicographic combination
152
of ⪰d and ⪰v . In our definition of the hardness ordering, an issue {conc, incoh}

type that is higher in the ordering is ‘easier’.
Definition 10 (Hardness ordering) Let C , C ′ be two case models {pres, incoh}
and σ ± ι and σ ′ ± ι ′ two issues. The hardness ordering ⪰h is a easier
binary relation over types defined as follows. {pres, coh}
′ ′
TC (σ ± ι) ⪰h TC ′ (σ ± ι )
{coh, incoh}
if and only if:
(1) TC (σ ± ι) ≻d TC ′ (σ ′ ± ι ′ ), or {pres, pres}
(2) TC (σ ± ι) ∼d TC ′ (σ ′ ± ι ′ ) and TC (σ ± ι) ⪰v TC ′ (σ ′ ± ι ′ ). harder
The asymmetric (symmetric) part of ⪰h is denoted ≻h (∼h ). {coh, coh}
Example 8 In the case model shown in Figure 1, TC (P ± Q) =
{pres, coh}, TC (P ± R) = {incoh, incoh}. By the hardness order- {incoh, incoh}
ing, TC (P ± Q) ≻h TC (P ± R).
Figure 4: Hasse diagram of ⪰h .
We now show that our definition of hardness is well-behaved. In
particular, it is transitive and can compare any pair of issue types.
Theorem 1 The relation ⪰h is a total order. The issues to consider are as follows:
P ROOF. We need to prove that ⪰h is transitive, antisymmetric (1) Whether the current situation should have OUTCOME_1 or not,
and total. Let x, y, z be issue types. represented as SITUATION ± OUTCOME_1;
(2) Whether the current situation should have OUTCOME_2 or not,
Transitivity Assume x ⪰h y and y ⪰h z. There are 4 circum-
represented as SITUATION ± OUTCOME_2.
stances:
The situation is decided based on testimony, where the witness can
(1) If x ≻d y and y ≻d z, then by Proposition 4, x ≻d z, hence by
have expert knowledge, or not. Therefore, depending on the knowl-
Definition 10, x ⪰h z;
edge of the witness, the value of the witness varies. If the witness is
(2) If x ≻d y and y ∼d z, then by Definition 9 x ≻d z, hence by
an expert (represented as EXPERT), the outcome has stronger support
Definition 10, x ⪰h z;
than if the person is not an expert (represented as ¬EXPERT).
(3) If x ∼d y and y ≻d z, then by Definition 9, x ≻d z, hence by
Now we focus on issue SITUATION ± OUTCOME_1. Figure 5 shows
Definition 10 x ⪰h z;
examples of case models for each type of the issue listed in Proposi-
(4) If x ∼d y and y ∼d z, then by Definition 10, x ⪰v y and y ⪰v z.
tion 3, from which we can see that the cases with an expert witness
Then by Proposition 4, x ⪰v z, hence x ⪰h z;
(EXPERT) are more preferred than the cases with ¬EXPERT, as sug-
Totality For types x and y, if the validity distance between gested by the size of boxes.
the arguments in x is not equal to the validity distance in y, then In the {conc,incoh} model, there is only one case, namely, an ex-
by Definition 10 and Proposition 4, x and y are comparable via pert predicts that the situation should have OUTCOME_1. In this model,
⪰h because of the relation ⪰d . If the validity distance between argument (SITUATION, OUTCOME_1) in SITUATION ± OUTCOME_1
the arguments in x is equal to the validity distance in y, then by is conclusive, namely all decisions in this model are for having
Definition 10, x and y are comparable via ⪰h using ⪰v . OUTCOME_1, which makes OUTCOME_1 seem like a natural conse-
Antisymmetry If x ⪰h y and y ⪰h x, then by Definition 10, x ⪰d quence. And there is no support for other decisions.
y, y ⪰d x, hence x ∼d y and therefore x ⪰v y, y ⪰v x. Then using Comparing the {conc,incoh} model with the {pres,incoh} model,
Proposition 4, we find x = y. Hence ⪰h is antisymmetric. □ where there are decisions based on both an expert and a non-expert.
For issue SITUATION ± OUTCOME_1 in the {pres,incoh} model, ar-
The hardness ordering is depicted in Figure 4. The formal notion of gument (SITUATION, OUTCOME_1) has label pres, which makes
hardness captured by Definition 10 provides us with a systematic the decision of OUTCOME_1 stronger. However, comparing with the
way to categorize easy and hard decisions in case-based reasoning, same issue in the {conc,incoh} model, it becomes less strong, since
once represented within the case model formalism. The remaining there is also a case that implies another outcome (OUTCOME_2) in
of the paper is dedicated to putting this formal notion of hardness the {pres,incoh} model, even though the argument for OUTCOME_2
to the test by further discussing the intuitions underpinning it, and is only with coh. As the cases do not all point to OUTCOME_1, issue
putting it at work in concrete examples of case-based decisions. SITUATION ± OUTCOME_1 is harder in the {pres,incoh} model.
In the {pres,coh} model, the issue SITUATION ± OUTCOME_1 is
3.4 An illustration of the theory again harder than in the {pres,incoh} model. Argument (SITUATION,
We now give examples that are meant to illustrate all possible types OUTCOME_1) is still stronger (pres), however, the other argument
of an issue (summarized in Figure 5). In the next section, we will about SITUATION ± OUTCOME_1, i.e., (SITUATION, ¬OUTCOME_1), is
give a realistic example for discussing the hardness of issues. not incoherent anymore. There is a case in the model with ¬OUTCOME_1,
We assume there are two possible outcomes for the current situa- which indicates that it is possible that the situation should not have
tion to be considered. We represent the current situation as SITUATION, OUTCOME_1, even though it is from a non-expert source. The coherent
the two possible outcomes as OUTCOME_1 and OUTCOME_2. but opposite decisions make the issue become harder.
153
{conc, incoh} {incoh, incoh}

EXPERT ∧ SITUATION ∧ OUTCOME_1 EXPERT ∧ SITUATION ∧ OUTCOME_2
{pres, incoh} {coh, incoh}

EXPERT ∧ SITUATION ∧ OUTCOME_1 ¬EXPERT ∧ SITUATION ∧ OUTCOME_2 EXPERT ∧ SITUATION ∧ OUTCOME_2 ¬EXPERT ∧ SITUATION ∧ OUTCOME_1
{pres, pres} {pres, coh}

EXPERT ∧ SITUATION ∧ OUTCOME_1 EXPERT ∧ SITUATION ∧ ¬OUTCOME_1 EXPERT ∧ SITUATION ∧ OUTCOME_1 ¬EXPERT ∧ SITUATION ∧ ¬OUTCOME_1
{coh, coh}
EXPERT ∧ SITUATION ∧ OUTCOME_2 ¬EXPERT ∧ SITUATION ∧ OUTCOME_1 ¬EXPERT ∧ SITUATION ∧ ¬OUTCOME_1
Figure 5: The hardness of SITUATION ± OUTCOME_1 in 7 case models characterized by the type of issues
In the {coh,incoh} model, SITUATION±OUTCOME_1 is harder than Kerfoot v. Kelley 294 N.Y. 288, 62 N.E.2d 74 (1945): The claim
in the {pres,coh} model, since for argument (SITUATION, OUTCOME_1), was in tort law (driver negligence). The territorial rule applies.
it is less preferable as the testimony for OUTCOME_1 is made by a non- Auten v. Auten 308 N.Y. 155, 124 N.E.2d 99 (1954): The claim
expert. For (SITUATION, ¬OUTCOME_1), there is no support. Compar- was in contract law (enforce a child support agreement). The
ing with the {pres,coh} model, there is no expert testimony about center-of-gravity rule applies.
the issue, which makes the consideration of this issue harder. Kaufman v. American Youth Hostels 5 N.Y.2d 1016 (1959):
In the {pres,pres} model, issue SITUATION±OUTCOME_1 is harder The claim was in tort law (travel guide negligence). The territorial
than in the {coh,incoh} model, even though the terstimony for rule applies.
OUTCOME_1 is from an expert, namely a more preferable source. Haag v. Barnes 9 N.Y.2d 554, 175 N.E.2d 441, 216 N.Y.S. 2d
This is because its counterargument (SITUATION, ¬OUTCOME_1) is 65 (1961): The claim was in contract law (reopen a child support
also based on an expert, which makes both having OUTCOME_1 and agreement). The center-of-gravity rule applies.
not having OUTCOME_1 have strong support, hence harder to solve Kilberg v. Northeast Airlines 9 N.Y.2d 34, 172 N.E.2d 526, 211
than in the {coh,incoh} model, where there is no one who testifies N.Y.S.2d 133 (1961): The claim was in tort law (common carrier
that the situation should not have OUTCOME_1. negligence). The territorial rule is partly applied, and there is an
In the {coh,coh} model, issue SITUATION ± OUTCOME_1 is harder exception for the damages part of the case.
than in the {pres,pres} model. Even though in both of the models, the Babcock v. Jackson 12 N.Y.2d 473, 191 N.E.2d 279, 473 N.Y.S.2d
arguments for having OUTCOME_1 and for not having it are as strong 279 (1963): The claim was in tort law (driver negligence).
as each other, type {coh,coh} still indicates that the testimonies are
Considering the case model constructed in [21], also shown in Fig-
from less preferable sources (non-expert), and because of this, the
ure 6, where it consists of 7 cases. They are represented by factors for
consideration of the issue becomes harder.
the plaintiff’s name (SMITH, KERFOOT, etc.), the year of the decision
Type {incoh,incoh} is the hardest one. As shown in Figure 5,
(1938, 1945, etc.), the kind of case (TORT for a tort case, CONTRACT
there is completely no decision about issue SITUATION±OUTCOME_1,
for a contract case), and the jurisdiction choice rule (TERRITORY for
hence the consideration of the issue is the hardest as there is nothing
entirely applying the territorial rule, EXCEPTION for partly applying
that can be referred to.
the territorial rule while making an exception for the damages part
of the case, and GRAVITY for applying the center-of-gravity rule).
The preference relation among these cases is denoted by the size
4 HARDNESS OVER TIME
of the boxes directly, namely, the Babcock case is more preferred
In this section, we apply our approach to model case-based decision-
making in a real legal domain from the United States, and discuss
the development of precedential value in a series of relevant cases
by following the research developed by Berman and Hafner [8, 11] SMITH ∧ 1938 ∧ TORT ∧ TERRITORY
and Verheij [21]. KERFOOT ∧ 1945 ∧ TORT ∧ TERRITORY
The cases we show here are tort cases from New York, which are AUTEN ∧ 1954 ∧ CONTRACT ∧ GRAVITY
about car accidents, and which rule should be applied when different KAUFMAN ∧ 1959 ∧ TORT ∧ TERRITORY
jurisdictions are relevant. For instance, when people drive from New HAAG ∧ 1961 ∧ CONTRACT ∧ GRAVITY
York and have an accident in Ontario, which rule should be followed, KILBERG ∧ 1961 ∧ TORT ∧ EXCEPTION
Ontario’s or New York’s? BABCOCK ∧ 1963 ∧ TORT ∧ GRAVITY
Smith v. Clute 277 N.Y. 407, 14 N.E.2d 455 (1938): The claim
was in tort law (driver negligence). The territorial rule applies. Figure 6: The development of precedential values in cases [21]
154
than other cases, which are preferentially equivalent. Since the Bab- preferred, not only the issue about this rule becomes harder,
cock case is a landmark case that overriding previous cases, by which but also other issues are affected (become easier).
the center-of-gravity approach is established for tort law [11, 21]. (4) In general, after finally making the GRAVITY rule as the pri-
We also apply the background theory of all cases in the case model mary one in 1963, the 4 tort law-relevant issues remain the
set in [21], namely, the plaintiff names exclude each other pairwise same hardness, as in 1945 when the primary one is the terri-
(¬(SMITH ∧ KERFOOT), etc.), and similarly for the decision years torial rule. However, more options makes the consideration
(¬(1938 ∧ 1945), etc.), the kinds of cases (¬(TORT ∧ CONTRACT)) of applying rules becomes harder.
and the choice rules (¬(TERRITORY ∧ EXCEPTION), etc.). The first observation can be illustrated by the introduction of the
We analyze the development of the jurisdiction choice rule by center-of-gravity rule. In 1954, the center-of-gravity rule starts to
restricting the case model to the cases up and until a particular year. be considered by the New York courts in a general sense, even
For instance, we write C (1954) for the set consisting of the three it has no effect on the hardness of issues TORT ± TERRITORY and
cases Smith, Kerfoot and Auten dating from 1954 or before [21]. TORT ± GRAVITY, it does make issue ⊤ ± GRAVITY become harder
The issues that we want to analyze in this series of cases are about than in 1945:
the development of the jurisdiction choice rule in tort law cases.
TC (1954) (⊤ ± GRAVITY) ≻h TC (1945) (⊤ ± GRAVITY).
They are shown as follows:
This is because, before 1954, it is clear that the GRAVITY rule is not
(1) TORT ± TERRITORY, which is associated with arguments:
considered in the court as argument (⊤, GRAVITY) is with incoh and
(TORT, TERRITORY) and (TORT, ¬TERRITORY);
(⊤, ¬GRAVITY) is with conc. However, the Auten case introduces this
(2) TORT ± GRAVITY, which is associated with arguments:
rule to the series of case models and makes both of the arguments
(TORT, GRAVITY) and (TORT, ¬ GRAVITY);
become presumptive. The introduction not only makes (⊤, GRAVITY)
(3) TORT ± EXCEPTION, which is associated with arguments:
stronger and (⊤, ¬GRAVITY) weaker but also shortens the validity
(TORT, EXCEPTION) and (TORT, ¬ EXCEPTION);
distance between the validity labels of the two opposite arguments in
(4) ⊤ ± GRAVITY, which is associated with arguments:
the issue. Because of the shorter distance, considering whether the
(⊤, GRAVITY) and (⊤, ¬ GRAVITY).
center-of-gravity rule should be generally considered or not becomes
Issue TORT ± TERRITORY is about whether a tort law case should harder than before. Similarly, after introducing the center-of-gravity
entirely apply the territorial rule or not. TORT ± GRAVITY is about rule into the tort law domain (by the Babcock case in 1963), we can
whether a tort law case should apply the center-of-gravity rule or not. see the same trend as in 1954.
TORT ± EXCEPTION is about whether a tort law case should partly The second observation is about the exception in a tort law case
follow the territorial rule and make an exception for the damages that applied the territorial rule, which is introduced by the Kilberg
part of the case. And ⊤ ± GRAVITY is about the applied status of the case in 1961. After this case is added into the model, it has no effect
center-of-gravity rule in a general sense. on the hardness of issues that related to the GRAVITY rule:
The validity of the arguments listed above has been discussed in
TC (1961) (TORT ± GRAVITY) ∼h TC (1959) (TORT ± GRAVITY);
[21]. As we show in Section 3, the hardness of an issue is determined
TC (1961) (⊤ ± GRAVITY) ∼h TC (1959) (⊤ ± GRAVITY).
by the validity of the arguments that it associates with. For instance,
the hardness of issue TORT ± TERRITORY in 1938 with respect to case Both TORT ± TERRITORY and TORT ± EXCEPTION become harder:
model C (1938) is: TC (1961) (TORT ± TERRITORY) ≻h TC (1959) (TORT ± TERRITORY);
TC (1938) (TORT ± TERRITORY) = {conc, incoh} TC (1961) (TORT ± EXCEPTION) ≻h TC (1959) (TORT ± EXCEPTION).
which is determined by the validity of the following arguments: This is because the exception makes the consideration of the territo-
rial rule in the tort law domain become more complex, as now the
AC (1938) (TORT, TERRITORY) = conc
courts need to think of whether there will be an exception or not.
AC (1938) (TORT, ¬TERRITORY) = incoh
The third observation is for the introduction of the landmark
TORT ± TERRITORY becomes harder in 1961, since the hardness of case (Babcock), introducing the center-of-gravity rule in the tort law
the issue in C (1961) is {pres, pres}. Notice that according to Def- domain. This makes issue TORT ± GRAVITY harder:
inition 4, we consider the labels of arguments (TORT, TERRITORY) TC (1963) (TORT ± GRAVITY) ≻h TC (1961) (TORT ± GRAVITY).
and (TORT, ¬TERRITORY) in C (1961) as pres rather than coh.
Based on the validity of the relevant arguments, we summarize and other relevant issues easier:
the hardness of issues with respect to case models by years in Table 1. TC (1963) (TORT ± TERRITORY) ≻h TC (1961) (TORT ± TERRITORY);
The trends of the hardness of issues is shown in Figure 7, from which TC (1963) (TORT ± EXCEPTION) ≻h TC (1961) (TORT ± EXCEPTION);
we have the following observations about the hardness of issues: TC (1963) (⊤ ± GRAVITY) ≻h TC (1961) (⊤ ± GRAVITY).
(1) When the center-of-gravity rule is introduced in general (1954) These trends can be explained from an intuitive perspective. Since
and into the tort law domain (1963), the issues about the from 1963, the GRAVITY rule becomes primary, for other options, the
GRAVITY rule become harder. more preferable way is not applying them, hence make the issues that
(2) The issues related to the territorial rule become harder corre- they associated with easier. But for the TORT ± GRAVITY, it becomes
spondingly when the court shows doubt on the rule by making harder as we explained in the first observation above.
an exception for the damages part of a tort case. In the last observation, we find that after the GRAVITY becomes
(3) When the center-of-gravity rule is introduced by a landmark primary in 1963, all the tort law-relevant issues remain the same
case (with higher preference), which makes the rule more hardness as they were before 1954. The only difference is that the
155
Table 1: Hardness of issues in different years
TORT ± TERRITORY TORT ± GRAVITY

(TORT, TERRITORY): conc (TORT, GRAVITY): incoh
1938 - 1959 {conc,incoh} 1938 - 1959 {conc,incoh}
(TORT, ¬TERRITORY): incoh (TORT, ¬GRAVITY): conc
(TORT, TERRITORY): pres (TORT, GRAVITY): incoh
1961 {pres,pres} 1961 {conc,incoh}
(TORT, ¬TERRITORY): pres (TORT, ¬GRAVITY): conc
(TORT, TERRITORY): coh (TORT, GRAVITY): pres
1963 {pres,coh} 1963 {pres,coh}
(TORT, ¬TERRITORY): pres (TORT, ¬GRAVITY): coh
TORT ± EXCEPTION ⊤ ± GRAVITY
(TORT, EXCEPTION): incoh (⊤, GRAVITY): incoh
1938 - 1959 {conc,incoh} 1938 - 1945 {conc,incoh}
(TORT, ¬EXCEPTION): conc (⊤, ¬GRAVITY): conc
(TORT, EXCEPTION): pres (⊤, GRAVITY): pres
1961 {pres,pres} 1954 - 1961 {pres,pres}
(TORT, ¬EXCEPTION): pres (⊤, ¬GRAVITY): pres
(TORT, EXCEPTION): coh (⊤, GRAVITY): pres
1963 {pres,coh} 1963 {pres,coh}
(TORT, ¬EXCEPTION): pres (⊤, ¬GRAVITY): coh
{conc, incoh}
{pres, incoh}
easier
{pres, coh}
: TORT±TERRITORY
: TORT±GRAVITY
{coh, incoh} : TORT±EXCEPTION
: ⊤±GRAVITY
{pres, pres}
harder
{coh, coh}
{incoh, incoh}
1938 1945 1954 1959 1961 1963
Figure 7: Development of hardness over time for different issues in the series of tort cases
more preferred rule shifts from the TERRITORY rule to the GRAVITY The development of hardness over time. The dynamics of case-based
rule. Moreover, we can see that making the choice of which rule reasoning has for instance been addressed in [11, 13, 14] in terms of
should be applied in 1963 is harder than before 1954. This can be rules, values, and reasons, and there changes in these elements are
connected to our first observation. Before 1954, there is only one associated with the handling of hard cases. As we show in Section 4,
jurisdiction choice rule to be considered, as the argument for apply- our approach is also relevant for the dynamics in case-based reason-
ing the TERRITORY rule is conclusive (in C (1938) and C (1945)), ing. We extend the analysis of a series of New York tort cases in
and the argument for applying the GRAVITY rule is incoherent (in [11, 21] with types of issues, from where we find that even though
C (1938) and C (1945)). Therefore, if a current case is given in the new cases can strengthen an argument’s validity, the hardness of
year before 1954, by following the precedents, applying the territo- issues may also increase.
rial rule in the current case will be a more preferable choice. In 1963, Our approach gives insight into the five temporal patterns listed
even though the GRAVITY rule has already become more preferred in [11, Section 4.2] (also discussed in [21]), providing a different
than other choices, the consideration of which rule should be applied angle in terms of the hardness typology.
still becomes more complex. This is because there are new cases that (1) A general shift in the relative priority of competing purposes.
introduce new choices about the rule application during this period. The Auten and Haag cases introduce the center-of-gravity rule
More choices make the consideration of the application harder. into the contract cases, and let GRAVITY become a presump-
tive conclusion in general, whereas it was incoherent. How-
5 DISCUSSION ever, argument (TORT, GRAVITY) is not yet coherent. From
In this section, we position our formal theory of the hardness of the issue perspective, the Auten case makes the general con-
case-based decisions with respect to related research. sideration of the GRAVITY rule (⊤ ± GRAVITY) harder (from
156
{conc,incoh} to {pres,pres}). After 1954, the consideration The approach we present here continues the discussion in [25],
of the GRAVITY rule becomes more complicate. However, the where we find that ‘using an incoherent argument can make sense
hardness of handling GRAVITY rule in tort law cases has not and break new ground. A decision based on such an argument can
changed yet, as TORT ± GRAVITY is still as hard as before. be considered as going beyond the current legal status modeled in
(2) A shift in the relative priority of competing purposes by find- the precedent model.’ We can further interpret this idea with the
ing exceptions. The Kilberg case makes that TERRITORY, rep- results we get from the case study in Section 4. For instance, after
resenting the entire application of the territorial rule, is no introducing the center-of-gravity rule into the tort law domain in
longer a conclusive consequence of tort cases, but only a 1963 by the Babcock case, the validity of the argument for applying
presumptive consequence. From the issue perspective, after the rule in a tort law case, namely (TORT, GRAVITY), shifts from
the Kilberg case is added to the model, the type of issue incoh to pres. Even though the validity of the argument becomes
TORT ± TERRITORY shifts from {conc,incoh} to {pres,pres}, stronger, the associated issue TORT±GRAVITY becomes harder, and
in the sense that EXCEPTION (partly applying the territorial the validity distance between the validity labels of the two arguments
rule) makes the territorial rule harder to handle. in the issue is shortened, namely, making the other argument weaker.
(3) The ratio decidendi of an older case is overruled, although it It could be interesting to enrich the series of cases discussed in
is significantly different. The example of this pattern discussed Section 4 to include what happened after the Babcock case. Other
in [11, 21] is that the Babcock case overrules the Kaufman series of cases that are well-known in AI and Law are also interesting
case. The formal case model we use doesn’t distinguish tort to look at using the hardness theory we developed, for instance, the
cases from passenger cases, thus the pattern is not visible cases about product liability and privity [3, 13, 15]. Also, since the
here. But as shown in Figure 7, we can still figure out that case model we show in the case study does not have all the possi-
the landmark Babcock case makes the consideration of some ble types in a real legal domain, it can be interesting to investigate
issues in a sense harder than right after the Kaufman case. whether such a complete case study can be made, in order to better
(4) A case is implicitly overruled. The rule that applied in tort understand the hardness of issues in an actual decision-making pro-
cases has changed since 1961 because of the Kilberg case and cess. Natural developments are also to connect our hardness typology
the Babcock case. The territorial rule is no longer a presump- in terms of kinds of validity to proof standards [10] and to consider
tive conclusion, and the center-of-gravity rule becomes more the development of hardness over time in terms of argumentation
preferred. If the Kerfoot case is decided after Kilberg or Bab- schemes for case-based reasoning [24]. It would also be interesting
cock, it may have come with a different outcome. As shown to explore the hardness of issues under different preference orderings
in Figure 7, both issues are with type {pres,coh} in 1963, than significance, for instance, in terms of court levels.
but notice that the conclusions of presumptive arguments in
From hardness of issues to easy and hard cases. As discussed by
the issues are GRAVITY and ¬TERRITORY. Furthermore, both
Gardner [9], hard cases is a main topic in law. Rissland summarizes
issues in 1963 are harder than in the period 1938 ∼ 1959, if
that hard cases in law can arise in three ways [18]:
the Kerfoot case is decided after 1963, though it will more
likely apply the GRAVITY rule, the decision-making process (1) there exist competing legal rules;
still becomes more complicated than before. (2) there exist unresolved predicates; and
(5) A case is explicitly overruled. As discussed by [8, 11], this (3) there exist competing cases.
pattern occurs rarely, and is not shown in our case model. Our formalism has the potential for modeling the hard cases dis-
cussed by Gardner. Competing cases and legal rules can be associ-
ated with issues with types {pres,coh}, {pres,pres}, and {coh,coh}.
Comparing with other research developed recently, in particular As in these types, the conclusions of arguments involved are op-
Horty’s reason model for precedential constraint [14], Henderson posite to each other, hence form a competing relation. Unresolved
and Bench-Capon’s model [13] following Levi’s idea to consider predicates can be associated with issues that do not contain con-
case-based reasoning as a “moving classification system” [15], the clusive arguments, namely, the same premise can lead to different
approach we applied here is based on a different mechanism in terms conclusions. If we treat these predicates as the premises, their in-
of an elementary propositional logical language. Moreover, their dicated meaning as the conclusions, we can analyze the meaning
research focuses more on the development of cases and the involved of unresolved predicates as leaving room for debate, hence leading
rules, and the hardness of issues as they occur in the decision-making to cases that are harder in the typology. Hence it seems interesting
process—the main contribution of this paper—have not been for- for future research to connect the hardness of issues to insights on
mally analyzed. It seems interesting to follow up on Levi’s three easy and hard cases. For instance, there is a connection to Dworkin’s
stage life cycle of rules (creation, refinement, replacement) as dis- famous idea (see e.g. [19, p. 488f.]) that for the perfect, Herculean
cussed in [13] using the hardness typology in this paper. judge, there is one right solution for all cases, including the hard
The focus of other research on case-based reasoning is often ones. In our hardness typology, there is a variety of options. Some-
rather different from our paper, and the hardness approach presented times there is exactly one solution, namely in the types {conc,incoh},
here may supplement that research. For instance, [13, 14] emphasize {pres,incoh} and {coh,incoh}. In {incoh,incoh} there is no stare
the role of reasons and rules in legal cases, and how they favor decisis solution. In {pres,pres} and {coh,coh}, there are two equally
the different parties in the court, which may be connected with our preferred solutions, and in {pres,coh}, there are two of which one
approach, in which issues and their hardness are associated with is strictly preferred over the other. Also, consider the characteriza-
arguments with equal premises, but opposite conclusions. tion of hard cases in [12] that they require an a-rational decision
157
making process. In our typology, this characterization applies to REFERENCES

{incoh,incoh}, where there is no solution, and also to {pres,pres} [1] V. Aleven. 1997. Teaching Case-Based Argumentation Through a Model and
and {coh,coh}, where none of the two choices is preferred. These Examples. Ph.D. Dissertation. University of Pittsburgh.
[2] K. D. Ashley. 1990. Modeling Legal Arguments: Reasoning with Cases and
situations require the construction of a new, persuasive theory of the Hypotheticals. MIT Press, Cambridge.
case and its solution, as suggested by [16]. [3] K. D. Ashley. 2018. Precedent and Legal Analogy. In Handbook of Legal
Reasoning and Argumentation, G. Bongiovanni, G. Postema, A. Rotolo, G. Sartor,
Also, our hardness theory uses a fixed preference ordering, and it C. Valentini, and D. Walton (Eds.). Springer Netherlands, Dordrecht, 673–710.
seems relevant to consider a dynamic perspective as a topic of future [4] K. Atkinson and T. Bench-Capon. 2019. Reasoning with Legal Cases: Analogy or
research in order to address changes legal and societal changes. Rule Application? In Proceedings of the Seventeenth International Conference on
Artificial Intelligence and Law. ACM Press, New York, 12–21.
Hardness of issues in case-based reasoning with factors. The discus- [5] T. Bench-Capon and G. Sartor. 2003. A Model of Legal Reasoning with Cases
Incorporating Theories and Values. Artificial Intelligence 150, 1 (2003), 97–143.
sion about current situations with precedents in case-based reasoning [6] D. Berman and C. Hafner. 1991. Incorporating Procedural Context into a Model
with factors, such as HYPO/CATO, is also relevant to our approach of Case-based Legal Reasoning. In Proceedings of the Third International Confer-
ence on Artificial Intelligence and Law. ACM Press, New York, 12–20.
about hardness of issues. As discussed in [25], HYPO examples can [7] D. Berman and C. Hafner. 1993. Representing Teleological Structure in Case-
be modeled in case models in which all cases are equally preferred. Based Legal Reasoning: The Missing Link. In Proceedings of the 4th International
In Section 3.2, we show that this kind of case model constrains the Conference on Artificial Intelligence and Law. ACM Press, New York, 50–59.
[8] D. Berman and C. Hafner. 1995. Understanding Precedents in a Temporal Context
possible types of issues. Therefore, issues that occurred in HYPO- of Evolving Legal Doctrine. In Proceedings of the Fifth International Conference
style reasoning will have a special hardness typology, a subset of on Artificial Intelligence and Law. ACM, New York, 42–51.
the general typology. Further investigation seems to be in place. For [9] A. Gardner. 1987. An Artificial Intelligence Approach to Legal Reasoning. MIT
Press, Cambridge.
instance, connecting the hardness of issues to argument moves, such [10] T. F. Gordon and D. N. Walton. 2009. Proof Burdens and Standards. In Argu-
as analogizing and distinguishing a current situation with precedents, mentation in Artificial Intelligence, I. Rahwan and G. R. Simari (Eds.). Springer,
Berlin, 239–258.
could be an interesting line of further research. [11] C. Hafner and D. Berman. 2002. The Role of Context in Case-Based Legal
Reasoning: Teleological, Temporal, and Procedural. Artificial Intelligence and
6 CONCLUSION Law 10, 1 (2002), 19–64.
[12] J. Hage, R. Leenes, and A. R. Lodder. 1993. Hard Cases: A Procedural Approach.
In this paper, we model the hardness of case-based decisions in terms Artificial Intelligence and Law 2, 2 (1993), 113–167.
of arguments, and their kind of validity. In the approach, we describe [13] J. Henderson and T. Bench-Capon. 2019. Describing the Development of Case
Law. In Proceedings of the Seventeenth International Conference on Artificial
a decision-making problem as an issue in a situation in terms of Intelligence and Law. ACM Press, New York, 32–41.
an argument and a counterargument. The hardness of an issue is [14] J. Horty and T. Bench-Capon. 2012. A Factor-Based Definition of Precedential
Constraint. Artificial Intelligence and Law 20, 2 (2012), 181–214.
represented by the validity of the two associated arguments (conclu- [15] E. Levi. 1948. An Introduction to Legal Reasoning. University of Chicago Law
sive, presumptive, coherent, incoherent). We also define an ordering Review 15, 3 (1948), 501–574.
that shows which issues are harder, and which easier. Building on [16] L. T. McCarty. 1995. An Implementation of Eisner v. Macomber. In Proceedings
of the Fifth International Conference on Artificial Intelligence and Law. ACM
work by Berman and Hafner, we apply our approach to discuss the Press, New York, 276–286.
hardness of the issues that arose in a series of legal cases. It turns [17] H. Prakken. 2019. Comparing Alternative Factor- and Precedent-based Accounts
out that we can formally show the varying hardness of issues in the of Precedential Constraint. In Legal Knowledge and Information Systems. JURIX
2019: The Thirty-Second Annual Conference, M. Araszkiewicz and V. Rodriguez-
temporal development of case-based reasoning. Doncel (Eds.). IOS Press, Amsterdam, 73–82.
The hardness approach is relevant in the understanding of case- [18] E. Rissland. 1988. Artificial Intelligence and Legal Reasoning: A Discussion of
the Field and Gardner’s Book. AI Magazine 9, 3 (Sep. 1988), 45–55.
based decision-making using stare decisis as it formally describes [19] G. Sartor. 2005. Legal reasoning: a cognitive approach to the law. Vol 5 of Treatise
the complexity of decision making in different circumstances. In the on legal philosophy and general jurisprudence. Springer, Berlin.
discussion, we further suggested that it seems interesting to connect [20] B. Verheij. 2016. Correct Grounded Reasoning with Presumptive Arguments.
In 15th European Conference on Logics in Artificial Intelligence, JELIA 2016.
our hardness approach to other research. Although the approach Larnaca, Cyprus, November 9–11, 2016. Proceedings (LNAI 10021), L. Michael
here has been applied to issues in hard cases in law, it could be and A. Kakas (Eds.). Springer, Berlin, 481–496.
interesting to consider whether our hardness typology, based on [21] B. Verheij. 2016. Formalizing Value-Guided Argumentation for Ethical Systems
Design. Artificial Intelligence and Law 24, 4 (2016), 387–407.
propositional logic, is relevant in other domains where example- [22] B. Verheij. 2017. Formalizing Arguments, Rules and Cases. In Proceedings of the
based reasoning is relevant (such as in medical diagnosis). As the Sixteenth International Conference on Artificial Intelligence and Law. ACM, New
York, 199–208.
analysis of argument validity we use, is consistent with probabilistic [23] B. Verheij. 2020. Artificial Intelligence as Law. Presidential Address to the
methods [20], the connection between the hardness of issues and the Seventeenth International Conference on Artificial Intelligence and Law. Artificial
validity of arguments may also lead to insights on the hardness of Intelligence and Law 28, 2 (2020), 181–206.
[24] A. Wyner and T. J. M. Bench-Capon. 2019. Argument Schemes for Legal Case-
decision-making in hybrid AI systems involving both knowledge based Reasoning. In Legal Knowledge and Information Systems: JURIX 2007:
and data. In this way, the approach can be developed to support the The Twentieth Annual Conference, A. R. Lodder and L. Mommers (Eds.). IOS
relevance of AI & Law research for AI generally (cf. [23]). Press, Amsterdam, 139–149.
[25] H. Zheng, D. Grossi, and B. Verheij. 2020. Case-Based Reasoning with Precedent
Models: Preliminary Report. In Computational Models of Argument. Proceedings
ACKNOWLEDGMENTS of COMMA 2020, H. Prakken, S. Bistarelli, F. Santini, and C. Taticchi (Eds.).
Vol. 326. IOS Press, Amsterdam, 443–450.
The authors would like to thank the reviewers and Wijnand van [26] H. Zheng, D. Grossi, and B. Verheij. 2020. Precedent Comparison in the Precedent
Woerkom for their valuable feedback on earlier versions of this pa- Model Formalism: A Technical Note. In Legal Knowledge and Information
Systems. JURIX 2020: The Thirty-third Annual Conference, S. Villata, J. Harašta,
per. This research was partially funded by the Hybrid Intelligence and P. Kšemen (Eds.). Vol. 334. IOS Press, Amsterdam, 259–262.
Center, a 10-year programme funded by the Dutch Ministry of Edu-
cation, Culture and Science through the Netherlands Organization
for Scientific Research, https://hybrid-intelligence-centre.nl.
158
When Does Pretraining Help? Assessing Self-Supervised
Learning for Law and the CaseHOLD Dataset of 53,000+ Legal
Holdings
Lucia Zheng∗ Neel Guha∗ Brandon R. Anderson
zlucia@stanford.edu nguha@stanford.edu banderson@law.stanford.edu
Stanford University Stanford University Stanford University
Stanford, California, USA Stanford, California, USA Stanford, California, USA
Peter Henderson Daniel E. Ho

phend@stanford.edu dho@law.stanford.edu
Stanford University Stanford University
Stanford, California, USA Stanford, California, USA
While self-supervised learning has made rapid advances in natural • Applied computing → Law; • Computing methodologies →
language processing, it remains unclear when researchers should Natural language processing; Neural networks.
engage in resource-intensive domain-specific pretraining (domain
pretraining). The law, puzzlingly, has yielded few documented in- KEYWORDS
stances of substantial gains to domain pretraining in spite of the
law, natural language processing, pretraining, benchmark dataset
fact that legal language is widely seen to be unique. We hypothesize
that these existing results stem from the fact that existing legal NLP
tasks are too easy and fail to meet conditions for when domain Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel
pretraining can help. To address this, we first present CaseHOLD E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learn-
(Case Holdings On Legal Decisions), a new dataset comprised of ing for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings. In Eigh-
over 53,000+ multiple choice questions to identify the relevant teenth International Conference for Artificial Intelligence and Law (ICAIL’21),
holding of a cited case. This dataset presents a fundamental task June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 10 pages.
to lawyers and is both legally meaningful and difficult from an https://doi.org/10.1145/3462757.3466088
NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second, we
assess performance gains on CaseHOLD and existing legal NLP
datasets. While a Transformer architecture (BERT) pretrained on 1 INTRODUCTION
a general corpus (Google Books and Wikipedia) improves perfor- How can rapid advances in Transformer-based architectures be
mance, domain pretraining (on a corpus of ≈3.5M decisions across leveraged to address problems in law? One of the most significant
all courts in the U.S. that is larger than BERT’s) with a custom legal advances in natural language processing (NLP) has been the advent
vocabulary exhibits the most substantial performance gains with of “pretrained” (or self-supervised) language models, starting with
CaseHOLD (gain of 7.2% on F1, representing a 12% improvement Google’s BERT model [12]. Such models are pretrained on a large
on BERT) and consistent performance gains across two other legal corpus of general texts — Google Books and Wikipedia articles —
tasks. Third, we show that domain pretraining may be warranted resulting in significant gains on a wide range of fine-tuning tasks
when the task exhibits sufficient similarity to the pretraining corpus: with much smaller datasets and have inspired a wide range of
the level of performance increase in three legal tasks was directly applications and extensions [27, 38].
tied to the domain specificity of the task. Our findings inform when One of the emerging puzzles for law has been that while general
researchers should engage in resource-intensive pretraining and pretraining (on the Google Books and Wikipedia corpus) boosts
show that Transformer-based architectures, too, learn embeddings performance on a range of legal tasks, there do not appear to be any
suggestive of distinct legal language. meaningful gains from domain-specific pretraining (domain pre-
training) using a corpus of law. Numerous studies have attempted to
∗ These authors contributed equally to this work. apply comparable Transformer architectures to pretrain language
models on law, but have found marginal or insignificant gains on
Permission to make digital or hard copies of part or all of this work for personal or a range of legal tasks [7, 14, 49, 50]. These results would seem to
classroom use is granted without fee provided that copies are not made or distributed challenge a fundamental tenet of the legal profession: that legal lan-
on the first page. Copyrights for third-party components of this work must be honored. guage is distinct in vocabulary, semantics, and reasoning [28, 29, 44].
For all other uses, contact the owner/author(s). Indeed, a common refrain for the first year of U.S. legal education
ICAIL’21, June 21–25, 2021, São Paulo, Brazil is that students should learn the “language of law”: “Thinking like
ACM ISBN 978-1-4503-8526-8/21/06. a lawyer turns out to depend in important ways on speaking (and
https://doi.org/10.1145/3462757.3466088 reading, and writing) like a lawyer.” [29].
159
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zheng and Guha, et al.
We hypothesize that the puzzling failure to find substantial gains Table 1: CaseHOLD example
from domain pretraining in law stem from the fact that existing fine- Citing Text (prompt)
tuning tasks may be too easy and/or fail to correspond to the domain
They also rely on Oswego Laborers’ Local 214 Pension Fund v. Marine
of the pretraining corpus task. We show that existing legal NLP Midland Bank, 85 N.Y.2d 20, 623 N.Y.S.2d 529, 647 N.E.2d 741 (1996), which
tasks, Overruling (whether a sentence overrules a prior case, see held that a plaintiff “must demonstrate that the acts or practices have a
Section 4.1) and Terms of Service (classification of contractual terms broader impact on consumers at large." Defs.’ Mem. at 14 (quoting Oswego
of service, see Section 4.2), are simple enough for naive baselines Laborers’, 623 N.Y.S.2d 529, 647 N.E.2d at 744). As explained above, how-
(BiLSTM) or BERT (without domain-specific pretraining) to achieve ever, Plaintiffs have adequately alleged that Defendants’ unauthorized
high performance. Observed gains from domain pretraining are use of the DEL MONICO’s name in connection with non-Ocinomled
hence relatively small. Because U.S. law lacks any benchmark task restaurants and products caused consumer harm or injury to the public,
that is comparable to the large, rich, and challenging datasets that and that they had a broad impact on consumers at large inasmuch as
such use was likely to cause consumer confusion. See, e.g., CommScope,
have fueled the general field of NLP (e.g., SQuAD [36], GLUE [46],
Inc. of N.C. v. CommScope (U.S.A) Int’l Grp. Co., 809 F. Supp.2d 33, 38
CoQA [37]), we present a new dataset that simulates a fundamental
(N.D.N.Y 2011) (<HOLDING>); New York City Triathlon, LLC v. NYC
task for lawyers: identifying the legal holding of a case. Holdings are Triathlon
central to the common law system. They represent the governing
legal rule when the law is applied to a particular set of facts. The Holding Statement 0 (correct answer)
holding is precedential and what litigants can rely on in subsequent holding that plaintiff stated a 349 claim where plaintiff alleged facts
cases. So central is the identification of holdings that it forms a plausibly suggesting that defendant intentionally registered its corporate
canonical task for first-year law students to identify, state, and name to be confusingly similar to plaintiffs CommScope trademark
reformulate the holding. Holding Statement 1 (incorrect answer)
This CaseHOLD dataset (Case Holdings on Legal Decisions)
holding that plaintiff stated a claim for breach of contract when it alleged
provides 53,000+ multiple choice questions with prompts from the government failed to purchase insurance for plaintiff as agreed by
a judicial decision and multiple potential holdings, one of which contract
is correct, that could be cited. We construct this dataset using the
Holding Statement 2 (incorrect answer)
rules of case citation [9], which allow us to match a proposition to a
source through a comprehensive corpus of U.S. case law from 1965 holding that the plaintiff stated a claim for tortious interference
to the present. Intuitively, we extract all legal citations and use the Holding Statement 3 (incorrect answer)
“holding statement,” often provided in parenthetical propositions
holding that the plaintiff had not stated a claim for inducement to breach
accompanying U.S. legal citations, to match context to holding [2].
a contract where she had not alleged facts sufficient to show the existence
CaseHOLD extracts the context, legal citation, and holding state- of an enforceable underlying contract
ment and matches semantically similar, but inappropriate, holding
propositions. This turns the identification of holding statements Holding Statement 4 (incorrect answer)
into a multiple choice task. holding plaintiff stated claim in his individual capacity
In Table 1, we show a citation example from the CaseHOLD
dataset. The Citing Text (prompt) consists of the context and legal
citation text, Holding Statement 0 is the correct corresponding pose an important tradeoff, as cost estimates for fully pretraining
holding statement, Holding Statements 1-4 are the four similar, BERT can be upward of $1M [41], with potential for social harm
but incorrect holding statements matched with the given prompt, [4], but advances in legal NLP may also alleviate huge disparities in
and the Label is the 0-index label of the correct holding statement access to justice in the U.S. legal system [16, 34, 47]. Our findings
answer. For simplicity, we use a fixed context window that may suggest that there is indeed something unique to legal language
start mid-sentence. when faced with sufficiently challenging forms of legal reasoning.
We show that this task is difficult for conventional NLP ap-
proaches (BiLSTM F1 = 0.4 and BERT F1 = 0.6), even though law 2 RELATED WORK
students and lawyers are able to solve the task at high accuracy. The Transformer-based language model, BERT [12], which lever-
We then show that there are substantial and statistically significant ages a two step pretraining and fine-tuning framework, has achieved
performance gains from domain pretraining with a custom vocabu- state-of-the-art performance on a diverse array of downstream NLP
lary (which we call Legal-BERT), using all available case law from tasks. BERT, however, was trained on a general corpus of Google
1965 to the present (a 7.2% gain in F1, representing a 12% relative Books and Wikipedia, and much of the scientific literature has
boost from BERT). We then experimentally assess conditions for since focused on the question of whether the Transformer-based
gains from domain pretraining with CaseHOLD and find that the approach could be improved by domain-specific pretraining.
size of the fine-tuning task is the principal other determinant of Outside of the law, for instance, Lee et al. [25] show that BioBERT,
gains to domain-specific pretraining. a BERT model pretrained on biomedicine domain-specific corpora
The code, the legal benchmark task datasets, and the Legal-BERT (PubMed abstracts and full text articles), can significantly outper-
models presented here can be found at: https://github.com/reglab/ form BERT on domain-specific biomedical NLP tasks. For instance,
casehold. it achieves gains of 6-9% in strict accuracy compared to BERT [25]
Our paper informs how researchers should decide when to en- for biomedical question answering tasks (BioASQ Task 5b and Task
gage in data and resource-intensive pretraining. Such decisions 5c) [45]. Similarly, Beltagy et al. show improvements from domain
160
When Does Pretraining Help? ICAIL’21, June 21–25, 2021, São Paulo, Brazil
pretraining with SciBERT, using a multi-domain corpus of scientific is that unlike general NLP, which has thrived on large benchmark
publications [3]. On the ACL-ARC multiclass classification task [22], datasets (e.g., SQuAD [36], GLUE [46], CoQA [37]), there are few
which contains example citations labeled with one of six classes, large and publicly available legal benchmark tasks for U.S. law.
where each class is a citation function (e.g., background), SciBERT This is explained in part due to the expense of labeling decisions
achieves gains of 7.07% in macro F1 [3]. It is worth noting that this and challenges around compiling large sets of legal documents
task is constructed from citation text, making it comparable to the [32], leading approaches above to rely on non-English datasets
CaseHOLD task we introduce in Section 3. [49, 50] or proprietary datasets [14]. Indeed, there may be a kind
Yet work adapting this framework for the legal domain has not of selection bias in available legal NLP datasets, as they tend to
yielded comparable returns. Elwany at el. [14] use a proprietary reflect tasks that have been solved by methods often pre-dating the
corpus of legal agreements to pretrain BERT and report “marginal” rise of self-supervised learning. Third, assessment standards vary
gains of 0.4 - 0.7% on F1. They note that in some settings, such gains substantially, providing little guidance to researchers on whether
could still be practically important. Zhong et al. [49] uses BERT domain pretraining is worth the cost. Studies vary, for instance, in
pretrained on Chinese legal documents and finds no gains relative whether BERT is retrained with custom vocabulary, which is partic-
to non-pretrained NLP baseline models (e.g., LSTM). Similarly, [50] ularly important in fields where terms of art can defy embeddings
finds that the same pretrained model performs poorly on a legal of general language models. Moreover, some comparisons are be-
question and answer dataset. tween (a) BERT pretrained at 1M iterations and (b) domain-specific
Hendrycks et al. [19] found that in zero-shot and few-shot set- pretraining on top of BERT (e.g., 2M iterations) [25]. Impressive
tings, state-of-the-art models for question answering, GPT-3 and gains might hence be confounded because the domain pretrained
UnifiedQA, have lopsided performance across subjects, performing model simply has had more time to train. Fourth, legal language
with near-random accuracy on subjects related to human values, presents unique challenges in substantial part because of extensive
such as law and morality, while performing up to 70% accuracy and complicated system of legal citation. Work has shown that con-
on other subjects. This result motivated their attempt to create a ventional tokenization that fails to account for the structure of legal
better model for the multistate bar exam by further pretraining citations can improperly present the legal text [20]. For instance,
RoBERTa [27], a variant of BERT, on 1.6M cases from the Harvard sentence boundary detection (critical for BERT’s next sentence pre-
Law Library case law corpus. They found that RoBERTa fine-tuned diction pretraining task) may fail with legal citations containing
on the bar exam task achieved 32.8% test accuracy without domain complicated punctuation [40]. Just as using an in-domain tokenizer
pretraining and 36.1% test accuracy with further domain pretrain- helps in multilingual settings [39], using a custom tokenizer should
ing. They conclude that while “additional pretraining on relevant improve performance consistently for the “language of law.” Last,
high quality text can help, it may not be enough to substantially few have examined differences across the kinds of tasks where
increase . . . performance.” Hendrycks et al. [18] highlight that future pretraining may be helpful.
research should especially aim to increase language model perfor- We address these gaps for legal NLP by (a) contributing a new,
mance on tasks in subject areas such as law and moral reasoning large dataset with the task of identification of holding statements
since aligning future systems with human values and understand- that comes directly from U.S. legal decisions, (b) assessing the con-
ing of human approval/disapproval necessitates high performance ditions under which domain pretraining can help.
on such subject specific tasks.
Chalkidis et al. [7] explored the effects of law pretraining using
various strategies and evaluate on a broader range of legal NLP tasks. 3 THE CASEHOLD DATASET
These strategies include (a) using BERT out of the box, which is We present the CaseHOLD dataset as a new benchmark dataset for
trained on general domain corpora, (b) further pretraining BERT on U.S. law. Holdings are, of course, central to the common law system.
legal corpora (referred to as LEGAL-BERT-FP), which is the method They represent the governing legal rule when the law is applied
also used by Hendrycks et al. [19], and (c) pretraining BERT from to a particular set of facts. The holding is what is precedential and
scratch on legal corpora (referred to as LEGAL-BERT-SC). Each what litigants can rely on in subsequent cases. So central is the
of these models is then fine-tuned on the downstream task. They identification of holdings that it forms a canonical task for first-year
report that a LEGAL-BERT variant, in comparison to tuned BERT, law students to identify, state, and reformulate the holding. Thus,
achieves a 0.8% improvement in F1 on a binary classification task as for a law student, the goal of this task is two-fold: (1) understand
derived from the ECHR-CASES dataset [5], a 2.5% improvement in case names and their holdings; (2) understand how to re-frame the
F1 on the multi-label classification task derived from ECHR-CASES, relevant holding of a case to back up the proceeding argument.
and between a 1.1-1.8% improvement in F1 on multi-label classifi- CaseHOLD is a multiple choice question answering task derived
cation tasks derived from subsets of the CONTRACTS-NER dataset from legal citations in judicial rulings. The citing context from the
[6, 8]. These gains are small when considering the substantial data judicial decision serves as the prompt for the question. The answer
and computational requirements of domain pretraining. Indeed, choices are holding statements derived from citations following
Hendrycks et al. [19] concluded that the documented marginal text in a legal decision. There are five answer choices for each citing
difference does not warrant domain pretraining. text. The correct answer is the holding statement that corresponds
This existing work raises important questions for law and arti- to the citing text. The four incorrect answers are other holding
ficial intelligence. First, these results might be seen to challenge statements.
the widespread belief in the legal profession that legal language is We construct this dataset from the Harvard Law Library case
distinct [28, 29, 44]. Second, one of the core challenges in the field law corpus (In our analyses below, the dataset is constructed from
161
Table 2: Dataset overview statement that nullifies a previous case decision as a precedent, by
Dataset Source Task Type Size a constitutionally valid statute or a decision by the same or higher
ranking court which establishes a different rule on the point of law
Overruling Casetext Binary classification 2,400
Terms of Service Lippi et al. [26] Binary classification 9,414 involved.
CaseHOLD Authors Multiple choice QA 53,137 The Overruling task dataset was provided by Casetext, a com-
pany focused on legal research software. Casetext selected positive
the holdout dataset, so that no decision was used for pretraining overruling samples through manual annotation by attorneys and
Legal-BERT.). We extract the holding statement from citations (par- negative samples through random sampling sentences from the
enthetical text that begins with “holding”) as the correct answer and Casetext law corpus. This procedure has a low false positive rate for
take the text before it as the citing text prompt. We insert a <HOLD- negative samples because the prevalence of overruling sentences
ING> token in the position of the citing text prompt where the in the whole law is low. Less than 1% of cases overrule another
holding statement was extracted. To select four incorrect answers case and within those cases, usually only a single sentence contains
for a citing text, we compute the TD-IDF similarity between the overruling language. Casetext validates this procedure by estimat-
correct answer and the pool of other holding statements extracted ing the rate of false positives on a subset of sentences randomly
from the corpus and select the most similar holding statements, to sampled from the corpus and extrapolating this rate for the whole
make the task more difficult. We set an upper threshold for simi- set of randomly sampled sentences to determine the proportion of
larity to rule out indistinguishable holding statements (here 0.75), sampled sentences to be reviewed by human reviewers for quality
which would make the task impossible. One of the virtues of this assurance.
task setup is that we can easily tune the difficulty of the task by Overruling has moderate to high domain specificity because
varying the context window, the number of potential answers, and the positive and negative overruling examples are sampled from
the similarity thresholds. In future work, we aim to explore how the Casetext law corpus, so the language in the examples is quite
modifying the thresholds and task difficulty affects results. In a hu- specific to the law. However, it is the easiest of the three legal
man evaluation, the benchmark by a law student was an accuracy benchmark tasks, since many overruling sentences are distinguish-
of 0.94.1 able from non-overruling sentences due to the specific and explicit
A full example of CaseHOLD consists of a citing text prompt, the language judges typically use when overruling. In his work on
correct holding statement answer, four incorrect holding statement overruling language and speech act theory, Dunn cites several ex-
answers, and a label 0-4 for the index of the correct answer. The amples of judges employing an explicit performative form when
ordering of indices of the correct and incorrect answers are random overruling, using keywords such as “overrule”, “disapprove”, and
for each example and unlike a multi-class classification task, the “explicitly reject” in many cases [13]. Language models, non-neural
answer indices can be thought of as multiple choice letters (A, B, machine models, and even heuristics generally detect such key-
C, D, E), which do not represent classes with underlying meaning, word patterns effectively, so the structure of this task makes it less
but instead just enumerate the answer choices. We provide a full difficult compared to other tasks. Previous work has shown that
example from the CaseHOLD dataset in Table 1. SVM classifiers achieve high performance on similar tasks; Sulea et
al. [31] achieves a 96% F1 on predicting case rulings of cases judged
4 OTHER DATASETS by the French Supreme Court and Aletras et al. [1] achieves 79%
accuracy on predicting judicial decisions of the European Court of
To provide a comparison on difficulty and domain specificity, we
Human Rights.
also rely on two other legal benchmark tasks. The three datasets
The Overruling task is important for lawyers because the process
are summarized in Table 2.
of verifying whether cases remain valid and have not been overruled
In terms of size, publicly available legal tasks are small compared
is critical to ensuring the validity of legal arguments. This need has
to mainstream NLP datasets (e.g., SQuAD has 100,000+ questions).
led to the broad adoption of proprietary systems, such as Shepard’s
The cost of obtaining high-fidelity labeled legal datasets is precisely
(on Lexis Advance) and KeyCite (on Westlaw), which have become
why pretraining is appealing for law [15]. The Overruling dataset,
important legal research tools for most lawyers [11]. High language
for instance, required paying attorneys to label each individual
model performance on the Overruling tasks could enable further
sentence. Once a company has collected that information, it may
automation of the shepardizing process.
not want to distribute it freely for the research community. In
In Table 3, we show a positive example of an overruling sentence
the U.S. system, much of this meta-data is hence retained behind
and a negative example of a non-overruling sentence from the
proprietary walls (e.g., Lexis and Westlaw), and the lack of large-
Overruling task dataset. Positive examples have label 1 and negative
scale U.S. legal NLP datasets has likely impeded scientific progress.
examples have label 0.
We now provide more detail of the two other benchmark datasets.
4.1 Overruling Table 3: Overruling examples
The Overruling task is a binary classification task, where positive Passage Label
examples are overruling sentences and negative examples are non- for the reasons that follow, we approve the first district in the 1
overruling sentences from the law. An overruling sentence is a instant case and disapprove the decisions of the fourth district.
1 This
human benchmark was done on a pilot iteration of the benchmark dataset and
a subsequent search of the vehicle revealed the presence of an 0
may not correspond to the exact TF-IDF threshold presented here. additional syringe that had been hidden inside a purse located
on the passenger side of the vehicle.
162
4.2 Terms of Service computationally expensive requiring approximately 16 TPU (64

The Terms of Service task is a binary classification task, where GPU) core-days per 1M steps. First, we assess performance with
positive examples are potentially unfair contractual terms (clauses) base BERT. Second, we train BERT with twice the number of itera-
from the terms of service in contract documents. The Unfair Terms tions to be able to compare the value of additional training. Third,
in Consumer Contracts Directive 93/13/EEC [17] defines an unfair we ingest the entire Harvard Law case corpus from 1965 to the
contractual term as follows. A contractual term is unfair if: (1) present and pretrain Legal-BERT on the corpus. The size of this
it has not been individually negotiated; and (2) contrary to the dataset (37GB) is substantial, representing 3,446,187 legal decisions
requirement of good faith, it causes a significant imbalance in the across all federal and state courts, and is larger than the size of
parties rights and obligations, to the detriment of the consumer. the BookCorpus/Wikipedia corpus originally used to train BERT
The Terms of Service dataset comes from Lippi et al. [26], which (15GB). Fourth, we train a custom vocabulary variant of Legal-BERT.
studies machine learning and natural language approaches for au- We provide a comparison to a BiLSTM baseline. We now provide
tomating the detection of potentially unfair clauses in online terms details of these methods.
of service and implements a system called CLAUDETTE based on
the results of the study. The dataset was constructed from a corpus 5.1 Baseline
of 50 online consumer contracts. Clauses were manually annotated Our baseline architecture is a one-layer BiLSTM, with 300D word2vec
as clearly fair, potentially unfair, and clearly unfair. Positive exam- vectors [30]. For single-sentence tasks, Overruling and Terms of
ples were taken to be potentially unfair or clearly unfair clauses Service, we encode the sentence and pass the resulting vector to
and negative examples were taken to be clearly fair clauses to di- a softmax classifier. For CaseHOLD, each citation prompt has five
chotomize the task. Lippi et al. [26] also studies a multi-class setting answer choices associated with it. We concatenate the prompt with
in which each clause is additionally labeled according to one of each one of the five answers, separated by the <SEP> token, to get
eight categories of clause unfairness (e.g. limitation of liability). We five prompt-answer pairs. We independently encode each prompt-
focus on the more general setting where clauses are only labeled answer pair and pass the resulting vector through a linear layer,
according to whether they encompass any type of unfairness. then apply softmax over the concatenated outputs for the five pairs.
Terms of Service has low domain specificity relative to the Over- We choose this architecture because it is comparable to the design
ruling and CaseHOLD tasks because examples are drawn from the suggested for fine-tuning BERT on multiple choice tasks in Rad-
terms of service text in consumer contracts. Extensive contracting ford et al. [35], where prompt-answer pairs are fed independently
language may be less prevalent in the Casetext and Harvard case through BERT and a linear layer. In this architecture, we replace
law corpora, although contracts cases of course are. The Terms of BERT with the BiLSTM.
Service task is moderately difficult. Excluding ensemble methods,
the classifier that achieves highest F1 performance in the general 5.2 BERT
setting of Lippi et al. [26] is a single SVM exploiting bag-of-words We use the base BERT model (uncased, 110M parameters) [12]
features, which achieves a 76.9% F1. as our baseline BERT model. Because researchers in other disci-
The Terms of Service task is useful for consumers, since automa- plines have commonly performed domain pretraining starting with
tion of the detection of potentially unfair contractual terms could BERT’s parameter values, we also train a model initialized with
help consumers better understand the terms they agree to when base BERT and pretrained for an additional 1M steps, using the
signing a contract and make legal advice about unfair contracts same English Wikipedia corpus that BERT base was pretrained on.
more accessible and widely available for consumers seeking it. It This facilitates a direct comparison to rule out gains solely from
could also help consumer protection organizations and agencies increased pretraining. We refer to this model, trained for 2M total
work more efficiently [26]. steps, as BERT (double), and compare it to our two Legal-BERT
In Table 4, we show a positive example of a potentially unfair variants, each pretrained for 2M total steps. Using 2M steps as our
clause and a negative example of a fair clause from the Terms comparison point for pretraining also allows us to address findings
of Service dataset. Positive examples have label 1 and negative from Liu et al. [27] that BERT was significantly undertrained and
examples have label 0. exhibited improved performance with RoBERTa, a set of modifica-
tions to BERT training procedure which includes pretraining the
Table 4: Terms of Service examples model longer.
Passage Label
5.3 Legal-BERT
occasionally we may, in our discretion, make changes to the 1
agreements. We pretrain two variants of BERT with the Harvard Law case corpus
(https://case.law/) from 1965 to the present.2 We randomly sample
this section contains service-specific terms that are in addition 0
10% of decisions from this corpus as a holdout set, which we use
to the general terms.
to create the CaseHOLD dataset. The remaining 90% is used for
pretraining.
5 METHODS We preprocess the case law corpus with the sentence segmen-
Our basic approach to understand the conditions for when domain tation procedure and use the pretraining procedure described in
pretraining may help is to use a series of pretrained BERT models, 2 Weuse this period because there is a significant change in the number of reporters
but to carefully vary one key modeling decision at a time. This is around this period and it corresponds to the modern post-Civil Rights Act era.
163
Devlin et al. [12]. One variant is initialized with the BERT base independently through our pretrained models followed by a linear
model and pretrained for an additional 1M steps using the case law layer, then take a softmax over the five concatenated outputs. For
corpus and the same vocabulary as BERT (uncased). The other vari- Overruling and Terms of Service, we use a single NVIDIA V100
ant, which we refer to as Custom Legal-BERT, is pretrained from (16GB) GPU to fine-tune on each task. For CaseHOLD, we used
scratch for 2M steps using the case law corpus and has a custom eight NVIDIA V100 (32GB) GPUs to fine-tune on the task.
legal domain-specific vocabulary. The vocabulary set is constructed We use 10-fold cross-validation to evaluate our models on each
using SentencePiece [24] on a subsample (appx. 13M) of sentences task. We use F1 score as our performance metric for the Overruling
from our pretraining corpus, with the number of tokens fixed to and Terms of Service tasks and macro F1 score as our performance
32,000. We pretrain both variants with sequence length 128 for 90% metric for CaseHOLD, reporting mean F1 scores over 10 folds. We
and sequence length 512 for 10% over the 2M steps total. report our model performance results in Table 5 and report statisti-
Both Legal-BERT and Custom Legal-BERT are pretrained us- cal significance from (paired) 𝑡-tests with 10 folds of the test data
ing the masked language model (MLM) pretraining objective, with to account for uncertainty.
whole word masking. Whole word masking and other knowledge From the results of the base setup, for the easiest Overruling
masking strategies, like phrase-level and entity-level masking, have task, the difference in F1 between BERT (double) and Legal-BERT
been shown to yield substantial improvements on various down- is 0.5% and BERT (double) and Custom Legal-BERT is 1.6%. Both
stream NLP tasks for English and Chinese text, by making the MLM of these differences are marginal. For the task with intermediate
objective more challenging and enabling the model to learn more difficulty, Terms of Service, we find that BERT (double) with fur-
about prior knowledge through syntactic and semantic informa- ther pretraining BERT on the general domain corpus increases
tion extracted from these linguistically-informed language units performance over base BERT by 5.1%, but the Legal-BERT vari-
[10, 21, 43]. More recently, Kang et al. [23] posit that whole-word ants with domain-specific pretraining do not outperform BERT
masking may be most suitable for domain adaptation on emrQA (double) substantially. This is likely because Terms of Service has
[33], a corpus for question answering on electronic medical records, low domain-specificity, so pretraining on legal domain-specific text
because most words in emrQA are tokenized to sub-word Word- does not help the model learn information that is highly relevant to
Piece tokens [48] in base BERT due to the high frequency of unique, the task. We note that BERT (double), with 77.3% F1, and Custom
domain-specific medical terminologies that appear in emrQA, but Legal-BERT, with 78.7% F1, outperform the highest performing
are not in the base BERT vocabulary. Because the case law corpus model from Lippi et al. [26] for the general setting of Terms of
shares this property of containing many domain-specific terms rel- Service, by 0.4% and 1.8% respectively. For the most difficult and
evant to the law, which are likely tokenized into sub-words in base domain-specific task, CaseHOLD, we find that Legal-BERT and Cus-
BERT, we chose to use whole word masking for pretraining the tom Legal-BERT both substantially outperform BERT (double) with
Legal-BERT variants on the legal domain-specific case law corpus. gains of 5.7% and 7.2% respectively. Custom Legal-BERT achieves
The second pretraining task is next sentence prediction. Here, we the highest F1 performance for CaseHOLD, with a macro F1 of
use regular expressions to ensure that legal citations are included 69.5%.
as part of a segmented sentence according to the Bluebook system We run paired 𝑡-tests to validate the statistical significance of
of legal citation [9]. Otherwise, the model could be poorly trained model performance differences for a 95% confidence interval. The
on improper sentence segmentation [40].3 mean differences between F1 for paired folds of BERT (double) and
base BERT are statistically significant for the Terms of Service task,
6 RESULTS with 𝑝-value < 0.001. Additionally, the mean differences between
6.1 Base Setup F1 for paired folds of Legal-BERT and BERT (double) with 𝑝-value
< 0.001 and the mean differences between F1 for paired folds of
After pretraining the models as described above in Section 5, we Custom Legal-BERT and BERT (double) with 𝑝-value < 0.001 are
fine-tune on the legal benchmark target tasks and evaluate the statistically significant for the CaseHOLD task. The substantial per-
performance of each model. formance gains from the Legal-BERT model variants were achieved
6.1.1 Hyperparameter Tuning. We provide details on our hyperpa- likely because the CaseHOLD task is adequately difficult and highly
rameter tuning process at https://github.com/reglab/casehold. domain-specific in terms of language.
6.1.2 Fine-tuning and Evaluation. For the BERT-based models, we 6.1.3 Domain Specificity Score. Table 5 also provides a measure of
use the input transformations described in Radford et al. [35] for domain specificity of each task, which we refer to as the domain
fine-tuning BERT on classification and multiple choice tasks, which specificity (DS) score. We define DS score as the average difference
convert the inputs for the legal benchmark tasks into token sein pretrain loss between Legal-BERT and BERT, evaluated on the
quences that can be processed by the pretrained model, followed downstream task of interest. For a specific example, we run pre-
by a linear layer and a softmax. For the CaseHOLD task, we avoid diction for the downstream task of interest on the example input
making extensive changes to the architecture used for the two clas- using Legal-BERT and BERT models after pretraining, but before
sification tasks by converting inputs consisting of a prompt and fine-tuning, calculate loss on the task (i.e., binary cross entropy
five answers into five prompt-answer pairs (where the prompt and loss for Overruling and Terms of Service, categorical cross entropy
answer are separated by a delimiter token) that are each passed loss for CaseHOLD), and take the difference between the loss of
3 Where the vagaries of legal citations create detectable errors in sentence segmentation the two models. Intuitively, when the difference is large, the gen-
(e.g., sentences with fewer than 3 words), we omit the sentence from the corpus. eral corpus does not predict legal language very well. DS scores
164
Table 5: Test performance, with ±1.96 × standard error, aggregated across 10 folds. Mean F1 scores are reported for Overruling
and Terms of Service. Mean macro F1 scores are reported for CaseHOLD. The best scores are in bold.
Model Baseline BERT BERT (double) Legal-BERT Custom Legal-BERT

Overruling — DS=-0.028 0.910 ± 0.012 0.958 ± 0.005 0.958 ± 0.005 0.963 ± 0.007 0.974 ± 0.005
Terms of Service — DS=-0.085 0.712 ± 0.020 0.722 ± 0.015 0.773 ± 0.019 0.750 ± 0.018 0.787 ± 0.013
CaseHOLD — DS=0.084 0.399 ± 0.005 0.613 ± 0.005 0.623 ± 0.003 0.680 ± 0.003 0.695 ± 0.003
Number of Pretraining Steps - 1M 2M 2M 2M
Vocabulary Size (domain) - 30,522 (general) 30,522 (general) 30,522 (general) 32,000 (legal)
serve as a heuristic for task domain specificity. A positive value

conveys that on average, Legal-BERT is able to reason more accu-
rately about the task compared to base BERT after the pretraining
phase, but before fine-tuning, which implies the task has higher
legal domain-specificity.
The rank order from least to most domain-specific is: Terms of
Service, Overruling, and CaseHOLD. This relative ordering makes
substantive sense. CaseHOLD has high domain specificity since a
holding articulates a court’s precise, legal statement of the holding
of a decision. As noted earlier, the language of contractual terms-of-
service may not be represented extensively in the case law corpus.
The results in Table 5 outline an increasing relationship between Figure 1: Mean macro F1 scores over 3 folds, with ±1.96 ×
the legal domain specificity of a task, as measured by the DS score standard error, for train volume variant.
(compatible with our qualitative assessments of the tasks), and
the degree to which prior legal knowledge captured by the model 1,000, 5,000, 10,000, and the full train set. We find that the Legal-
through unsupervised pretraining improves performance. Addition- BERT gains compared to BERT (double) are strongest with low
ally, the Overruling results suggest that there exists an interplay train volume and wash out with high train volume. As we expect,
between the legal domain specificity of a task and the difficulty of Legal-BERT gains are larger when the fine-tuning dataset is smaller.
the task, as measured by baseline performance on non-attention In settings with limited training data, the models must rely more
based models. Gains from attention based models and domain pre- on prior knowledge and Legal-BERT’s prior knowledge is more
training may be limited for lower difficulty tasks, even those with relevant to the highly domain-specific task due to pretraining on
intermediate DS scores, such as Overruling, likely because the task legal domain-specific text, so we see stronger gains from Legal-
is easy enough provided local context that increased model domain BERT compared to BERT (double). For a training set size of 1, the
awareness is only marginally beneficial. mean gain in Legal-BERT is 17.6% ± 3.73, the maximal gain across
train set sizes.
6.2 Task Variants This particular variant is well-motivated because it has often
been challenging to adapt NLP for law precisely because there is
To provide a further assessment on the conditions for pretraining,
limited labeled training data available. Legal texts typically require
we evaluate the performance and sensitivity of our models on three
specialized legal knowledge to annotate, so it can often be prohibi-
task variants of CaseHOLD, the task for which we observe the most
tively expensive to construct large structured datasets for the legal
substantial gains from domain pretraining. We vary the task on
domain [16].
three dimensions: the volume of training data available for fine-
tuning (train volume), the difficulty of the prompt as controlled 6.2.2 Prompt Difficulty. For the difficulty variant, we vary the cit-
by the length of the prompt (prompt difficulty), and the level of ing text prompt difficulty, by shortening the length of the prompt to
domain specificity of the prompt (domain match). We hypothesize the first 𝑥 words. The average length of a prompt in the CaseHOLD
that these dimensions — data volume, prompt difficulty, and domain task dataset is 136 words, so we take the first 𝑥 = 5, 10, 20, 40, 60,
specificity — capture the considerations practitioners must account 80, 100 words of the prompt and the full prompt. We take the first
for in considering whether pretraining is beneficial for their use 𝑥 words instead of the last 𝑥 words closest to the holding, as the
case. For the task variants, we split the CaseHOLD task dataset into latter could remove less relevant context further from the holding
three train and test set folds using an 80/20 split over three random and thus make the task easier. We find that the prompt difficulty
seeds and evaluate on each fold. We report results as the mean F1 variant does not result in a clear pattern of increasing gains from
over the three folds’ test sets. Legal-BERT over BERT (double) above 20 words, though we would
expect to see the gains grow as the prompt is altered more. How-
6.2.1 Train Volume. For the train volume variant, keeping the test ever, a 2% drop in gain is seen in the 5 word prompt (the average
set constant, we vary the train set size to be of size 1, 10, 100, 500, F1 gap above 20 words is 0.062, while at 5 is 0.0391).
165
Figure 3: Density histograms of DS scores of examples for

Terms of Service, CaseHOLD, and both tasks.
other words, because the CaseHOLD task is already quite domain

specific, variation within the corpus may be too range-restricted to
Figure 2: Mean macro F1 scores over 3 folds, with ±1.96 × provide a meaningful test of domain match.
standard error, for prompt difficulty variant. Further work could instead examine domain match by pretrain-
ing on specific areas of law (e.g., civil law) and fine-tuning on other
One possible reason we do not observe a clear pattern may be
areas (e.g., criminal law), but the Harvard case law corpus does not
that the baseline prompt length constrains the degree to which we
currently have meaningful case / issue type features.
can manipulate the prompt and vary this dimension; the expected
relationship may be more clearly observed for a dataset with longer
prompts. Additionally, BERT models are known to disregard word
6.3 Error Analysis
order [42]. It is possible that beyond 5 words, there is a high like- We engage in a brief error analysis by comparing the the breakdown
lihood that a key word or phrase is encountered that Legal-BERT of errors between Legal-BERT and BERT (double). In the test set,
has seen in the pretraining data and can attend to. the breakdown was: 55% both correct, 13% Legal-BERT correct /
BERT (double) incorrect, 7% BERT (double) correct / Legal-BERT
6.2.3 Domain Match. For the domain match variant, we weight incorrect, and 25% both incorrect. We read samples of instances
the predictions for the test set when calculating F1 by sorting the where model predictions diverged, with a focus on the examples
examples in ascending order by their DS score and weighting each Legal-BERT predicted correctly and BERT (double) predicted in-
example by its rank order. Intuitively, this means the weighted F1 correctly. While we noticed instances indicative of Legal-BERT
score rewards correct predictions on examples with higher domain attending to legal language (e.g., identifying a different holding
specificity more. This method allows us to keep train volume con- because of the difference between a “may" and “must" and because
stant, to avoid changing the domain specificity distribution of train the “but see" citation signal indicated a negation), we did not find
set examples (which would occur if the test set was restricted to a that such simple phrases predicted differences in performance in
certain range of DS scores), and still observe the effects of domain a bivariate probit analysis. We believe there is much fruitful work
specificity on performance in the test set. We expect that the gains to be done on further understanding what Legal-BERT uniquely
in Legal-BERT compared to BERT are stronger for the weighted F1 attends to.
than the unweighted F1. We find that the mean gain in Legal-BERT
over three folds is greater for the weighted F1 compared to the 6.4 Limitations
unweighted F1, but only by a difference of 0.8% ± 0.154, as shown
While none of the CaseHOLD cases exist in the pretraining dataset,
in Table 6.
some of Legal-BERT gains on the CaseHOLD task may be attribut-
Table 6: Mean gain in Legal-BERT over 3 folds, for domain able to having seen key words tied to similar holding formulations
match variant. in the pretraining data. As mentioned, this is part of the goal of
the task: understanding the holdings of important cases in a min-
Mean macro F1 BERT (double) Legal-BERT Mean Gain
imally labeled way and determining how the preceding context
Unweighted 0.620 0.679 0.059 may affect the holding. This would explain the varying results in
Weighted 0.717 0.784 0.067 the prompt difficulty variant of the CaseHOLD task: gains could
be mainly coming from attending to only a key word (e.g., case
One possible reason this occurs is that the range of DS scores name) in the context. This may also explain how Legal-BERT is able
across examples in the CaseHOLD task is relatively small, so some to achieve zero-shot gains in the train volume variant of the task.
similarly domain-specific examples may have fairly different rank- BERT, may have also seen some of the cases and holdings in English
based weights. In Figure 3, we show histograms of the DS scores Wikipedia,4 potentially explaining its zero-shot performance im-
of examples for Terms of Service, CaseHOLD, and 5,000 examples provements over random in the train volume variant. Future work
sampled (without replacement) from each task. Notice that the on the CaseHOLD dataset may wish to disentangle memorization
Terms of Service examples are skewed towards negative DS scores of case names from the framing of the citing text, but we provide a
and the CaseHOLD examples are skewed towards positive DS scores strong baseline here. One possible mechanism for this is via a future
so the range of DS scores within a task is limited, while the examples variant of the CaseHOLD task where a case holding is paraphrased
sampled from both tasks span a larger range, explaining the small to indicate bias toward a different viewpoint from the contextual
gains from the domain match variant, but more substantial gains 4 See,
e.g., https://en.wikipedia.org/wiki/List_of_landmark_court_decisions_in_the_
for CaseHOLD from Legal-BERT compared to Terms of Service. In United_States which contains a list of cases and their holdings.
166
framing. This would reflect the first-year law student exercise of an existing large language models to a new task or developing new
re-framing a holding to persuasively match their argument and models, since these workflows require retraining to experiment
isolate the two goals of the task. with different model architectures and hyperparameters. DS scores
provide a quick metric for future practitioners to evaluate when
7 DISCUSSION resource intensive model adaptation and experimentation may be
Our results resolve an emerging puzzle in legal NLP: if legal lan- warranted on other legal tasks. DS scores may also be readily ex-
guage is so unique, why have we seen only marginal gains to do- tended to estimate the domain-specificity of tasks in other domains
main pretraining in law? Our evidence suggests that these results with existing pretrained models like SciBERT and BioBERT [3, 25].
can be explained by the fact that existing legal NLP benchmark In sum, we have shown that a new benchmark task, the Case-
tasks are either too easy or not domain matched to the pretrain- HOLD dataset, and a comprehensively pretrained Legal-BERT model
ing corpus. Our paper shows the largest gains documented for illustrate the conditions for domain pretraining and suggests that
any legal task from pretraining, comparable to the largest gains language models, too, can embed what may be unique to legal
reported by SciBERT and BioBERT [3, 25]. Our paper also shows language.
the highest performance documented for the general setting of
the Terms of Service task [26], suggesting substantial gains from ACKNOWLEDGMENTS
domain pretraining and tokenization. We thank Devshi Mehrotra and Amit Seru for research assistance,
Using a range of legal language tasks that vary in difficulty and Casetext for the Overruling dataset, Stanford’s Institute for Human-
domain-specificity, we find BERT already achieves high perfor- Centered Artificial Intelligence (HAI) and Amazon Web Services
mance for easy tasks, so that further domain pretraining adds little (AWS) for cloud computing research credits, and Pablo Arredondo,
value. For the intermediate difficulty task that is not highly domain- Matthias Grabmair, Urvashi Khandelwal, Christopher Manning,
specific, domain pretraining can help, but gain is most substantial and Javed Qadrud-Din for helpful comments.
for highly difficult and domain-specific tasks.
These results suggest important future research directions. First, REFERENCES
we hope that the new CaseHOLD dataset will spark interest in [1] Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios
Lampos. 2016. Predicting judicial decisions of the European Court of Human
solving the challenging environment of legal decisions. Not only Rights: A natural language processing perspective. PeerJ Computer Science 2
are many available benchmark datasets small or unavailable, but (2016), e93.
they may also be biased toward solvable tasks. After all, a com- [2] Pablo D Arredondo. 2017. Harvesting and Utilizing Explanatory Parentheticals.
SCL Rev. 69 (2017), 659.
pany would not invest in the Overruling task (baseline F1 with [3] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language
BiLSTM of 0.91), without assurance that there are significant gains Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical
to paying attorneys to label the data. Our results show that domain Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP). Association for Computational
pretraining may enable a much wider range of legal tasks to be Linguistics, Hong Kong, China, 3615–3620.
solved. [4] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret
Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be
Second, while the creation of large legal NLP datasets is impeded Too Big? . In Proceedings of the 2021 ACM Conference on Fairness, Accountability,
by the sheer cost of attorney labeling, CaseHOLD also illustrates an and Transparency. Association for Computing Machinery, New York, NY, USA,
advantage of leveraging domain knowledge for the construction of 610–623.
legal NLP datasets. Conventional segmentation would fail to take [5] Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal
Judgment Prediction in English. In Proceedings of the 57th Annual Meeting of
advantage of the complex system of legal citation, but investing in the Association for Computational Linguistics. Association for Computational
such preprocessing enables better representation and extraction of Linguistics, Florence, Italy, 4317–4323. https://www.aclweb.org/anthology/P19-
legal texts. 1424
[6] Ilias Chalkidis, Ion Androutsopoulos, and Achilleas Michos. 2017. Extracting
Third, our research provides guidance for researchers on when Contract Elements. In Proceedings of the 16th Edition of the International Con-
pretraining may be appropriate. Such guidance is sorely needed, ference on Articial Intelligence and Law (London, United Kingdom) (ICAIL ’17).
Association for Computing Machinery, New York, NY, USA, 19–28.
given the significant costs of language models, with one estimate [7] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras,
suggesting that full pretraining of BERT with a 15GB corpus can and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of
exceed $1M. Deciding whether to pretrain itself can hence have Law School. In Findings of the Association for Computational Linguistics: EMNLP
2020. Association for Computational Linguistics, Online, 2898–2904. https:
significant ethical, social, and environmental implications [4]. Our //www.aclweb.org/anthology/2020.findings-emnlp.261
research suggests that many easy tasks in law may not require do- [8] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, and Ion Androut-
main pretraining, but that gains are most likely when ground truth sopoulos. 2019. Neural Contract Element Extraction Revisited. Workshop on
Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=
labels are scarce and the task is sufficiently in-domain. Because B1x6fa95UH
estimates of domain-specificity across tasks using DS score match [9] Columbia Law Review Ass’n, Harvard Law Review Ass’n, and Yale Law Journal.
2015. The Bluebook: A Uniform System of Citation (21st ed.). The Columbia Law
our qualitative understanding, this heuristic can also be deployed to Review, The Harvard Law Review, The University of Pennsylvania Law Review,
determine whether pretraining is worth it. Our results suggest that and The Yale Law Journal.
for other high DS and adequately difficult legal tasks, experimen- [10] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and
Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT.
tation with custom, task relevant approaches, such as leveraging arXiv:1906.08101 [cs.CL]
corpora from task-specific domains and applying tokenization / [11] Laura C. Dabney. 2008. Citators: Past, Present, and Future. Legal Reference
sentence segmentation tailored to the characteristics of in-domain Services Quarterly 27, 2-3 (2008), 165–190.
text, may yield substantial gains. Bender et al. [4] discuss the signif- Pre-training of Deep Bidirectional Transformers for Language Understanding. In
icant environmental costs associated in particular with transferring
167
Proceedings of the 2019 Conference of the North American Chapter of the Association [33] Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA:
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and A Large Corpus for Question Answering on Electronic Medical Records. In
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, Proceedings of the 2018 Conference on Empirical Methods in Natural Language
4171–4186. https://www.aclweb.org/anthology/N19-1423 Processing. Association for Computational Linguistics, Brussels, Belgium, 2357–
[13] Pintip Hompluem Dunn. 2003. How judges overrule: Speech act theory and the 2368. https://www.aclweb.org/anthology/D18-1258
doctrine of stare decisis. Yale LJ 113 (2003), 493. [34] Marc Queudot, Éric Charton, and Marie-Jean Meurs. 2020. Improving Access to
[14] Emad Elwany, Dave Moore, and Gaurav Oberoi. 2019. BERT Goes to Law School: Justice with Legal Chatbots. Stats 3, 3 (2020), 356–375.
Quantifying the Competitive Advantage of Access to Large Legal Corpora in [35] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im-
Contract Understanding. arXiv:1911.00473 http://arxiv.org/abs/1911.00473 proving language understanding by generative pre-training.
[15] David Freeman Engstrom and Daniel E Ho. 2020. Algorithmic accountability in [36] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
the administrative state. Yale J. on Reg. 37 (2020), 800. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed-
[16] David Freeman Engstrom, Daniel E. Ho, Catherine Sharkey, and Mariano- ings of the 2016 Conference on Empirical Methods in Natural Language Pro-
Florentino Cuéllar. 2020. Government by Algorithm: Artificial Intelligence in cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392.
Federal Administrative Agencies. Administrative Conference of the United States, https://www.aclweb.org/anthology/D16-1264
Washington DC, United States. [37] Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversa-
[17] European Union 1993. Council Directive 93/13/EEC of 5 April 1993 on unfair terms tional question answering challenge. Transactions of the Association for Compu-
in consumer contracts. European Union. tational Linguistics 7 (2019), 249–266.
[18] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn [38] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in bertol-
Song, and Jacob Steinhardt. 2021. Aligning AI With Shared Human Values. ogy: What we know about how bert works. Transactions of the Association for
arXiv:2008.02275 [cs.CY] Computational Linguistics 8 (2021), 842–866.
[19] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn [39] Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2020.
Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- How Good is Your Tokenizer? On the Monolingual Performance of Multilingual
derstanding. arXiv:2009.03300 [cs.CY] Language Models. arXiv:2012.15613 [cs.CL]
[20] Michael J. Bommarito II, Daniel Martin Katz, and Eric M. Detterman. 2018. [40] Jaromir Savelka, Vern R Walker, Matthias Grabmair, and Kevin D Ashley. 2017.
LexNLP: Natural language processing and information extraction for legal and Sentence boundary detection in adjudicatory decisions in the United States.
regulatory texts. arXiv:1806.03688 http://arxiv.org/abs/1806.03688 Traitement automatique des langues 58 (2017), 21.
[21] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and [41] Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The Cost of Training NLP Models:
Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and A Concise Overview. arXiv:2004.08900 [cs.CL]
Predicting Spans. Transactions of the Association for Computational Linguistics 8 [42] Koustuv Sinha, Prasanna Parthasarathi, Joelle Pineau, and Adina Williams. 2020.
(2020), 64–77. https://www.aclweb.org/anthology/2020.tacl-1.5 Unnatural Language Inference. arXiv:2101.00010 [cs.CL]
[22] David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. [43] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin
2018. Measuring the Evolution of a Scientific Field through Citation Frames. Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Represen-
Transactions of the Association for Computational Linguistics 6 (2018), 391–406. tation through Knowledge Integration. arXiv:1904.09223 [cs.CL]
https://www.aclweb.org/anthology/Q18-1028 [44] P.M. Tiersma. 1999. Legal Language. University of Chicago Press, Chicago,
[23] Minki Kang, Moonsu Han, and Sung Ju Hwang. 2020. Neural Mask Generator: Illinois. https://books.google.com/books?id=Sq8XXTo3A48C
Learning to Generate Adaptive Word Maskings for Language Model Adaptation. [45] George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas,
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Matthias Zschunke, Michael R. Alvers, Dirk Weissenborn, Anastasia Krithara,
Processing (EMNLP). Association for Computational Linguistics, Online, 6102– Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopou-
6120. https://www.aclweb.org/anthology/2020.emnlp-main.493 los, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel-Cyrille Ngonga
[24] Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language Ngomo, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder,
independent subword tokenizer and detokenizer for Neural Text Processing. Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the BIOASQ
arXiv:1808.06226 [cs.CL] large-scale biomedical semantic indexing and question answering competition.
[25] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, BMC Bioinformatics 16, 1 (April 2015), 138.
Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language [46] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
representation model for biomedical text mining. Bioinformatics 36, 4 (2019), Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for
1234–1240. Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop
[26] Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans- BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association
Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2019. CLAUDETTE: for Computational Linguistics, Brussels, Belgium, 353–355. https://www.aclweb.
an automated detector of potentially unfair clauses in online terms of service. org/anthology/W18-5446
Artificial Intelligence and Law 27, 2 (2019), 117–139. [47] Jonah Wu. 2019. AI Goes to Court: The Growing Landscape of AI for Access
[27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer to Justice. https://medium.com/legal-design-and-innovation/ai-goes-to-court-
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A the-growing-landscape-of-ai-for-access-to-justice-3f58aca4306f
Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL] [48] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,
[28] David Mellinkoff. 2004. The language of the law. Wipf and Stock Publishers, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff
Eugene, Oregon. Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan
[29] Elizabeth Mertz. 2007. The Language of Law School: Learning to “Think Like a Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,
Lawyer”. Oxford University Press, USA. Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,
[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s
Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301. Neural Machine Translation System: Bridging the Gap between Human and
3781 Machine Translation. arXiv:1609.08144 [cs.CL]
[31] Octavia-Maria, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, [49] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
and Josef van Genabith. 2017. Exploring the Use of Text Classification in the Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal
Legal Domain. Proceedings of 2nd Workshop on Automated Semantic Analysis Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association
of Information in Legal Texts (ASAIL). for Computational Linguistics. Association for Computational Linguistics, Online,
[32] Adam R. Pah, David L. Schwartz, Sarath Sanga, Zachary D. Clopton, Peter DiCola, 5218–5230. https://www.aclweb.org/anthology/2020.acl-main.466
Rachel Davis Mersey, Charlotte S. Alexander, Kristian J. Hammond, and Luís [50] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
A. Nunes Amaral. 2020. How to build a more open justice system. Science 369, Maosong Sun. 2020. JEC-QA: A Legal-Domain Question Answering Dataset. ,
6500 (2020), 134–136. 9701-9708 pages.
168
Part II
Short Papers
Practical Tools from Formal Models: The ECHR as a Case Study
Katie Atkinson Kanstantsin Dzehtsiarou
Joe Collenette∗ dzeh@liverpool.ac.uk
Trevor Bench-Capon∗ Department of Law
{katie,j.m.collenette,tbc}@liverpool.ac.uk University of Liverpool
Department of Computer Science Liverpool, UK
University of Liverpool
Liverpool, UK
ABSTRACT the user and provide explanations. In [16] the law in question was
One approach to building legal support systems is to run an ex- statute law, but the approach can also be applied to case law, e.g. [1].
ecutable model of the relevant knowledge through an interface Similar structures are found in hybrid systems based on reasoning
designed to collect information from the user and provide explana- with cases such as CABARET [17], IBP [7] and VJAP [12] which
tions. The usability of such systems depends on the terms used in also represent the law as high level rules but instead of directly
the law being represented: often only users familiar with the prac- querying the user, the leaf predicates are resolved according to
tice and application of the law will be able to provide the required factors found in precedent cases. In all these approaches, we be-
information. Earlier work applied this approach to the European gin with a high level question, such as ‘is Peter a British Citizen?’
Convention on Human Rights (ECHR). Although the performance and then unfold it through a series of intermediate concepts until
of the tool built for that domain was good, the questions posed to a base level is reached. At this point, hybrid systems will launch
the user demanded a good deal of knowledge and experience of their case based reasoning mechanism, but if we are using a rule
the ECHR. Here we use the knowledge of an expert with extensive based approach which expects the user to answer these base level
experience of the ECHR to extend the model, through intermediate questions, the terms must be readily understood by the user. For a
levels, to identify questions that are appropriate to the target user. successful system, two kinds of knowledge are required: knowledge
We have undertaken a pilot evaluation in which a small number as to the questions that should be asked, and knowledge as to how
of lawyers have used the prototype program and provided very these questions should be answered. A formalisation of the legislation
positive feedback, showing that they are receptive to AI solutions provides the first but not the second, and so if the law does not
that give effective, explainable decision support. use terms familiar to the users, they will be unable to answer the
questions. This is also true of hybrid systems; the users will need
CCS CONCEPTS to ascribe the factors, which is itself a substantial and not entirely
straightforward task. Moreover the knowledge of how to answer
the questions or ascribe the factors may vary according to the back-
ground and skills of the user answering the questions. This suggests
KEYWORDS
that the formalisation of the law will need to be supplemented by
ADFs, case-based reasoning, explainability, ECHR a further sets of intermediate concepts, one for each type of user
ACM Reference Format: which will unfold into questions appropriate to the users.
Katie Atkinson, Joe Collenette, Trevor Bench-Capon, and Kanstantsin Dze- With the BNA, the formalisation of the law resulted in questions
htsiarou. 2021. Practical Tools from Formal Models: The ECHR as a Case such as ‘where was Peter born?’ and ‘who is the father of Peter?’
Study. In Eighteenth International Conference for Artificial Intelligence and which appeared to be readily answerable by applicants, or by ad-
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY,
judicators on the basis of an application form. Much of the appeal
USA, 5 pages. https://doi.org/10.1145/3462757.3466095
of this system came from the fact that the questions that resulted
directly from executing the formalisation were immediately and
1 INTRODUCTION
intuitively understandable in both these situations. However, those
Since the pioneering formalisation of the British Nationality Act who followed in their footsteps found that this was not always the
(BNA) [16], one popular approach to building legal support systems case. Problems became apparent in a follow up exercise to the BNA
has been to formalise the law and then execute the formalisation [5]. Two kinds of problems arose.
through a program able to gather facts about particular cases from First, many of the questions were difficult to answer for, or even,
Permission to make digital or hard copies of all or part of this work for personal or unintelligible to, the lay user, such as ‘did Peter pay the qualifying
classroom use is granted without fee provided that copies are not made or distributed level of contributions in the relevant tax year?’. Such questions might,
for profit or commercial advantage and that copies bear this notice and the full citation however, be answerable by an adjudicator, especially if there was
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or access to the contributions record database. With other questions
republish, to post on servers or to redistribute to lists, requires prior specific permission the adjudicator would have the problem: lay users will know their
birth details, but this may require investigation and verification
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. by the adjudicator, who will therefore need to know what kind of
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 evidence is required. So the questions may need further refinement
https://doi.org/10.1145/3462757.3466095
170
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Katie Atkinson, Joe Collenette, Trevor Bench-Capon, and Kanstantsin Dzehtsiarou
to enable the target users to answer them, and may need different and is maintainable. The ANGELIC ADF corresponds to the factor
refinements for different target users. For a lay person, the contri- hierarchy of traditional case based systems such as CATO [3]. In
butions condition will need to be expressed in terms of working in an ADF the nodes are connected into a graph structure, and there
a particular year, while for the adjudicator, date of birth will need are acceptance conditions that determine the acceptability of each
to ask about acceptable forms of proof, such as birth certificates. parent in terms of its children. The ADFs produced by ANGELIC
The second problem is that the question may have some partic- are always trees. In ANGELIC the acceptance conditions take the
ular legal interpretation. For example, UK Housing Benefit at one form of a set of prioritised sufficient conditions, together with a de-
time had an addition if the house was “hard to heat” [4]. While a fault to make the conditions collectively necessary. Maintainability
lay user may well have an opinion on this, in fact there was a very comes from the modular nature of this structure.
technical definition, spelled out in secondary legislation and case Testing on a limited set of cases achieved a 100% success rate.
law, in terms of factors such as size and type of house, number of Although the test set was small, the benefits of the approach were
rooms and age of occupants. Clearly this information is needed clear; the ADF was able to predict Article 6 cases with a high
both to enable lay users to answer the question, and to support the success rate, and fully justify its reasoning for the predictions. The
decision making of adjudicators, who may be unfamiliar with the developed ADF in [9] included a part relating to admissibility. The
relevant case law. questions developed for admissibility were at a very high level. The
In [9] an executable model of Article 6 of the European Con- model described in this paper, also developed using the ANGELIC
vention on Human Rights (ECHR) was presented. However the methodology, focuses entirely on admissibility. This focus allows a
questions that resulted from that formal model are unlikely to be much more detailed analysis to reduce the level of expertise needed
answered with confidence by a potential applicant, and some might to answer the questions with confidence. The decomposition of
even test an experienced lawyer. To move towards appropriate ques- the high level questions obtained from the documentation used in
tions requires knowledge of the theory and practice of the ECHR. [9] does, however, require guidance from someone expert in the
In this paper we will discuss what is needed for that model to be- relevant statute and case law. We have also replaced the command
come usable, in particular for those processing initial applications. line prototype of [9] with a visual interface that uses standard
Section 2 will briefly describe the ECHR. Section 3 will describe the interactions to make the program more accessible to non-computer
formal model of [9]. Section 4 will describe the additional analysis specialists.
we have carried out to move to a usable system, and Section 5
the resulting prototype. Section 6 gives an initial evaluation from
members of the intended user group who have used the prototype. 4 LEGAL FOUNDATIONS OF THE MODEL
Finally Section 7 will offer some concluding remarks. The model described here considers whether an application is ad-
missible or not, which is itself a substantial task. All applications
2 DOMAIN OVERVIEW submitted to the ECtHR need to be admissible in order to be consid-
The European Convention on Human Rights (ECHR) is a regional ered on merits. In other words, the Court needs to establish that the
human rights treaty that is now ratified by 47 European states application complies with a set of formal rules before it can examine
and covers almost the whole of Europe. Since 1960, when it was the substance of this application [10]. Admissibility includes two
established, the European Court of Human Rights (ECtHR) has types of rules: first, the ECtHR needs to establish that the applica-
delivered judgments in thousands of cases and created a significant tion falls within its jurisdiction to confirm that the Court can deal
body of legal precedents. with this application. Secondly, the ECHR has established a set of
The ECHR has proved very popular for experimentation with formal rules that the application itself needs to comply with, such
machine learning techniques for legal judgment predication tasks; as that the application was first submitted at the national level and
for example, see [2], [14], [8], [15] and [13]. These studies all report was rejected by the national judicial bodies and that it should be
success, with correct predictions being achieved in around 70-85% submitted within 6 months after the highest judicial body rejected
of cases. JURI Says [15] reports a success rate of 69% over the last the same application on the national level. This application should
year, although it fell to 60.9% for March 20211 . not be abusive, anonymous or trivial. This application also should
not be clearly without merits or – in the ECHR terms – manifestly
3 ADFS FOR REASONING ABOUT ECHR ill-founded. Again, if these conditions are not satisfied, the Court
CASES ON ARTICLE 6 declares an application inadmissible. The Court’s decision as to
inadmissibility is final and cannot be appealed against.
The aim of the work described in [9] was to encapsulate Article 6
The importance of admissibility is often underestimated. On
in an Abstract Dialectical Framework (ADF) [6]. A program based
average about 90% of all applications submitted to the ECtHR are
on the developed ADF predicted whether a particular case was
declared inadmissible. For instance, in 2019, 44,500 applications
admissible, and if so whether there was a violation of Article 6. One
were submitted to the Court and in the same year 38,480 applica-
of the main strengths of the program was that it was able to justify
tions were declared inadmissible. At the same time, in 2019, the
its reasoning in terms of the ADF nodes and acceptance conditions,
Court delivered only 2,187 meritorious judgments [11]. A large
similar to the how? explanation of a rule based system. The ADF
number of applications is declared inadmissible every year, so our
was created using the ANGELIC methodology [1], designed to cap-
project has potential importance for both the applicants wanting
ture case law in a manner that supports argumentation techniques
to avoid inadmissibility and for the Court for which considering of
1 JURI Says can be found at https://jurisays.com/ (accessed 2021/03/01). inadmissible applications takes a significant proportion of its time
171
Practical Tools from Formal Models: The ECHR as a Case Study ICAIL’21, June 21–25, 2021, São Paulo, Brazil
and resources which could be re-allocated to the meritorious cases

and so reduce the backlog (currently 50,000). Next we describe the
implemented tool that we have produced to enable decision support
for the important issue of admissibility of cases submitted to the
ECtHR. The model used in the tool captures the factors discussed
above that need to be examined to determine admissibility and is a
result of close consultation with our expert on the ECtHR.
5 PROTOTYPE IMPLEMENTATION
The current prototype consists of two parts; the ADF model, exe-
Figure 1: The results screen of the JAVA prototype. The user
cuted by a JAVA program, and the JAVA front-end. The ADF model
is presented with a high level result and the reasoning be-
makes the predictions from a set of answers to the questions gath-
hind the decision is shown in the text below.
ered from the user through the front-end, which then presents an
explanation of the prediction to the user. The aim of the application
is to allow people with minimal legal training to quickly identify
whether a case that is being submitted to the ECtHR is likely to be (1989). Identifying such questions requires a good knowledge of
admissible. The current ADF is more extensive in the modelling of the legal practice.
admissibility when compared to the previous ADF developed in [9]. When the JAVA program loads, it parses a text representation of
There are 61 questions which when answered allow the 26 factors the ADF, similar to Table 1. This supports maintenance by enabling
to resolve to a prediction. In the previous ADF of [9] there were any changes to the ADF to be automatically reflected in the program.
only 5 factors relating to admissibility and only 7 questions needed The program asks the user a series of yes/no questions and once all
to be answered. the relevant questions have been answered the results are shown
When developing the ADF we consulted a legal expert in order to the user. The results screen also displays the reasoning as to
to decompose the high level nodes used in [9] and so capture the why that decision was made. Figure 1 shows the results for an
legal knowledge required to ensure that the list of questions is both example where the program informs the user that a signature is
as complete as possible, and appropriate to the target users. Table 1 required for the form and explains that as the applicant does not
shows the text and the acceptance conditions associated with the have legal representation and they have not signed the application
nodes. The root of the ADF is “V1" which indicates whether sub- form, not all signatures have been provided, which in turn means
mitting the application to the ECHR is recommended. The program that the application does not comply with rule 47 of the rules of
will recommend submission if both issues, nodes I1 and I2, are ac- the court and therefore the program recommends not to submit the
cepted. That is, the application is admissible (I2) and the application application.
complies with rule 47 of the court (I1). In turn, node I1 is accepted To give a better user experience than the previous work [9],
if the base level factors, I1Q1 and I1Q2, are accepted and the ab- only questions that are needed to generate a recommendation are
stract factors that represents that all necessary signatures (I1F1) asked: if a node can be resolved, the program moves on to the
and all documentation (I1F2) have been accepted. Abstract Factor next. Different paths lead to different questions, e.g. the questions
I1F1 has a number of different base level factors, offering different needed if the application is submitted by a company rather, than as
possibilities for I1F1 to be accepted. These represent the different an individual, are notably different.
signatures that are needed depending on different situations. Using the example in Figure 1 and referring to Table 1, we will
The questions used as part of the ADF are not only based on the show how the explanation is generated in Figure 1. The last ques-
requirements of the ECHR but also take into account its case law. tion to be answered is I1F1Q2. We start by printing all the base
The case law makes the questions more nuanced and integrates level factors that have the same parent as the last base level factor,
those aspects of admissibility that do not obviously flow from the as long as the user has answered the corresponding question. The
text of the Convention. In practice it is almost impossible to submit parent of I1F1Q2 is I1F1, and the only other base level factor that
a successful application without familiarising oneself with the case has an answer is I1F1Q1. This gives us the first two lines of explana-
law, which can be quite broad and diverse, or consulting with a tion. Next we print the abstract factors and issues in a hierarchical
professional lawyer specialising in the law of the ECtHR. This manner, adding the word “So” before each explanation. Finally we
model presents the rules enshrined in the case law of the Court in print the root and add the word “Therefore” at the start of the root’s
a simplified form. explanation.
In Table 1, I2F2Q4 asks “Is the applicant a potential victim of a This prototype has a number of benefits when compared to the
violation?" which affects whether the applicant has victim status. system in [9]. Despite the two systems tackling two different aspects
It is not obvious to a lay person that potential victims can apply of legal reasoning within the ECHR domain, they are comparable
to the Court, in certain limited circumstances. The scope of these as both systems use an underlying ADF that is presented to a user.
circumstances was made clear in the case law: the question has The major benefit of the current approach is the ease of use when
arisen in the context of extradition cases, where the question was compared to the previous work. We have now incorporated both
whether there was a risk of human rights violations in the receiving knowledge of the law, and how that law is applied in practice,
state if extradition were permitted e.g. Soering v. the United Kingdom which allows expression in terms of questions which the users can
172
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Katie Atkinson, Joe Collenette, Trevor Bench-Capon, and Kanstantsin Dzehtsiarou
Table 1: Nodes of the ADF that are referenced, along with their acceptance conditions.
ID Factor Text Accepting Logic

V1 Submission Recommendation ACCEPT IF I1 AND I2
REJECT OTHERWISE
I1 Application compiles with Rule 47 ACCEPT IF I1Q1 AND I1Q2 AND I1F1 AND I1F2
REJECT OTHERWISE
I1Q1 Has the applicant identified the state against which the appli- ACCEPT IF TRUE
cation is brought to the Court (p2 of the application form)? REJECT OTHERWISE
I1Q2 Has the applicant ticked an appropriate box on p2 of the appli- ACCEPT IF TRUE
cation form? REJECT OTHERWISE
I1F1 All Signatures Provided ACCEPT IF (I1F1Q1 AND I1F1Q4 AND (I1F1Q2 OR I1F1Q3))
OR (!I1F1Q1 AND I1F1Q3) OR (I1F1Q8 AND ((!I1F1Q1 AND
I1F1Q7) OR (I1F1Q1 AND I1F1Q5 AND (I1F1Q6 OR I1F1Q7))))
REJECT OTHERWISE
I1F1Q1 Does the applicant have a legal representative? ACCEPT IF TRUE
REJECT OTHERWISE
I1F1Q2 Was the application form signed (p13) by the applicant’s legal ACCEPT IF TRUE
representative? REJECT OTHERWISE
I1F2 All Documentation ACCEPT IF I1F2Q1 AND I1F2Q2
REJECT OTHERWISE
I2 Application is Admissible ACCEPT IF I2F1 AND I2F2 AND I2F4 AND I2F5 AND (I2F3F1
AND I2F3F2 AND !I2F3F3 AND I2F3F4 AND I2F3F5)
REJECT OTHERWISE
I2F2 Victim Status ACCEPT IF !I2F2Q3 AND (I2F2Q1 OR I2F2Q2 OR I2F2Q4)
REJECT OTHERWISE
I2F2Q4 Is the applicant a potential victim of a violation? ACCEPT IF
TRUE REJECT OTHERWISE
be expected to answer . For example, in [9] there is a question “Was

No. of Responses
3
the person bringing the case the victim”. To answer the question
effectively the user needs to understand what constitutes a person, 2
a case, and a victim. The current ADF has decomposed the base 1
level factor into abstract factors such as I2F1 which describes what
constitutes a valid petitioner, and I2F2 which describes victim status 0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
0
1
(see Table 1 for questions associated with these abstract factors). Q1
Q1
The questions associated with these abstract factors require less Positive Responses Question Number
specialist knowledge. Improvements have been made to the method Negative Responses
of interacting with the application, which is more in line with what
non-specialist computer users would expect from an application. Figure 2: Graph showing number of feedback responses to
The previous work used a text-only, command line interface [9]. each question, as positive or negative feedback
The new application uses a visual, mouse-driven, interface.
6 PILOT STUDY (1) (Functionality) Does the program have a reasonable response
Before embarking upon a full evaluation, we have conducted a pilot time?
study in which the prototype has been tested by a sample of our (2) (Functionality) Did the program run to completion without
target audience, which is a small group of lawyers who work within any interruptions?
the ECHR. Three independent lawyers (not, of course, including our (3) (Usability) How easy was the program to use?
domain expert) tested the prototype and completed a questionnaire (4) (Usability) How intuitive was the program to start using?
that covered five different aspects of the prototype. (5) (Explainability) How effective was the explanation given for
The following is the list of questions that the users of the pro- describing the program’s decisions?
totype were presented with, along with the question category in (6) (Explainability) How easy was the information to parse?
brackets. Each question had four possible responses, for different (7) (Usefulness) How useful would you find this program for
levels of agreement: assisting you in your work?
173
Practical Tools from Formal Models: The ECHR as a Case Study ICAIL’21, June 21–25, 2021, São Paulo, Brazil
(8) (Usefulness) Generally how useful would additional technol- solid grounding on the road to development of a tool for use in
ogy be for assisting with legal work? practice.
(9) (Questions about Questions) How clear were the questions The prototype tool has been tested by a small group of the tar-
that you answered within the program? Additionally the get audience of lawyers. Their response has been very positive,
users were able to say which questions were unclear. highlighting that lawyers are receptive to tools that are carefully
(10) (Questions about Questions) How much time would you designed and targeted at problems they encounter with their work,
save if you used a fully functional program for your work all of which paves the way for a fuller evaluation.
on deciding on the admissibility of cases? The continuation of the work will focus on expanding on the use
(11) (Questions about Questions) Does the program reflect how of the prototype. We are currently organising, with the assistance of
you decide on the admissibility of cases that you process? our ECHR expert, a field evaluation of our tool by users who work
within the ECtHR. This will provide a greater range of feedback
In Figure 2 the responses to the questionnaire have been con-
in order to adjust our tool as needed to best fit the audience’s
densed into positive or negative responses, where the top two
needs. Once we have a satisfactory tool for use by those assessing
answers to a question are positive and the bottom two are negative.
applications – which would provide a way of addressing the current
Though the results of the questionnaire come from a very small
significant backlog of unprocessed cases in the ECtHR – we will
sample, with only three lawyers completing the survey, we can con-
develop a set of questions for use by applicants themselves. Such
clude that the program developed worked well and was functional,
a facility should reduce the number of inadmissible applications
as all the responses received on functionality (Q1, Q2) and usability
by enabling applicants to gain a better understanding of what is
(Q3, Q4) were positive. Another positive outcome is that two of the
required to make an admissible application. In this way, access to
three ECHR lawyers responded that they trusted that the justifica-
to the ECtHR could be improved, both by assisting applicants, and
tions for the decisions made were sensible and understandable (Q5)
by speeding up the decision process for cases submitted.
and all three respondents agreed that the information was easy to
parse (Q6). All respondents saw the usefulness of our prototype
REFERENCES
(Q7), with two respondents stating they would use the program as it
[1] Latifa Al-Abdulkarim, Katie Atkinson, and Trevor Bench-Capon. 2016. A method-
currently stands, and the other affirming the usefulness but saying ology for designing systems to reason with legal cases using ADFs. AI and Law
that some (as opposed to many) changes are needed. Again all the 24, 1 (2016), 1–49.
[2] Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios
respondents agreed that technology has a role to play in the legal Lampos. 2016. Predicting judicial decisions of the European Court of Human
domain (Q8): one respondent said technology is needed rapidly, Rights: A natural language processing perspective. PeerJ Computer Science 2
while two cautioned that careful development will be needed. The (2016), e93.
[3] Vincent Aleven. 1997. Teaching case-based argumentation through a model and
positive responses to Q9 and Q10 were particularly pleasing, since examples. Ph.D. thesis. University of Pittsburgh.
these questions directly concerned the central aims of this exercise: [4] Trevor Bench-Capon. 1991. Practical legal expert systems: the relation between
the users all agreed that the questions were suitable for them and a formalisation of legislation and expert knowledge. In Law, Computer Science
and Artificial Intelligence, M Bennun and A Narayanan (Eds.). Ablex, 191–201.
that the program would save them time when assessing admis- [5] Trevor Bench-Capon, Gwen Robinson, Tom Routen, and Marek Sergot. 1987.
sibility. While the majority of the feedback has been positive, it Logic programming for large scale applications in law: A formalisation of sup-
plementary benefit legislation. In Proceedings of the 1st ICAIL. 190–198.
has also highlighted the need for domain experts to be a part of [6] Gerhard Brewka and Stefan Woltran. 2010. Abstract dialectical frameworks. In
the development process (Q11): although two respondents felt the Twelfth International Conference on the Principles of Knowledge Representation
program reflected all or part of their own process of dealing with and Reasoning. 102–111.
[7] Stefanie Brüninghaus and Kevin D Ashley. 2003. Predicting outcomes of case
admissibility, one felt that only some aspects had been covered. based legal arguments. In Proceedings of the 9th ICAIL. ACM, 233–242.
Overall the response to the program is very positive and indicates [8] Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural legal
that it is a sound basis for further developments for our legal de- judgment prediction in English. arXiv preprint arXiv:1906.02059 (2019).
[9] Joe Collenette, Katie Atkinson, and Trevor Bench-Capon. 2020. An explainable
cision support tools, giving potential for collaborations between approach to deducing outcomes in European Court of HumanRights cases using
computer scientists and lawyers. Encouraged by these results, we ADFs. In Proceedings COMMA 2020. IOS Press, 21–32.
[10] Fiona De Londras and Kanstantsin Dzehtsiarou. 2018. Great Debates on the Euro-
will now extend the study to a larger group of potential users, to pean Convention on Human Rights. Macmillan International Higher Education.
ensure that this initial evaluation is reflected in the law community. [11] European Court of Human Rights. 2019. Analysis of statistics. (2019). https:
//www.echr.coe.int/Documents/Stats_analysis_2019_ENG.pdf.
[12] Matthias Grabmair. 2017. Predicting trade secret case outcomes using argument
7 SUMMARY AND NEXT STEPS schemes and learned quantitative value effect tradeoffs. In Proceedings of the 16th
ICAIL. 89–98.
Our aim in this paper has been to conduct a deep dive into the legal [13] Arshdeep Kaur and Bojan Bozic. 2019. Convolutional Neural Network-based
analysis required to provide a robust, executable model of Article 6 Automatic Prediction of Judgments of the European Court of Human Rights.. In
27th AIAI Irish Conference on AI and Cognitive Science. CEUR 2563, 458–469.
of the ECHR. We focused on the theory and practice of a particular [14] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2019. Using machine
issue within the the ECHR, namely admissibility of cases. In con- learning to predict decisions of the European Court of Human Rights. AI and
sultation with our legal expert, we defined an ADF that captures Law (2019), 1–30.
[15] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2020. URI SAYS: An
the domain knowledge relevant for the issue of admissibility and Automatic Judgement Prediction System for the European Court of Human
transformed our model into an implemented tool that is able to ask Rights.. In Proceedings of JURIX 2020. 277–280.
questions appropriate to the target user. This is a necessary step [16] Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham-
mond, and H Terese Cory. 1986. The British Nationality Act as a logic program.
to enable academic exercises on legal case-based reasoning to be Commun. ACM 29, 5 (1986), 370–386.
transformed into usable tools. Our prototype tool implements the [17] David Skalak and Edwina L Rissland. 1992. Arguments and cases: An inevitable
intertwining. AI and Law 1, 1 (1992), 3–44.
back-end reasoning of the underpinning ADF and provides us with
174
On the relevance of algorithmic decision predictors for judicial
decision making
Floris Bex Henry Prakken
Utrecht University, The Netherlands Utrecht University, The Netherlands
Tilburg University, The Netherlands University of Groningen, The Netherlands
f.j.bex@uu.nl h.prakken@uu.nl
ABSTRACT not be judged any more on the legal merits of their individual case
In this article, we discuss case decision predictors, algorithms which, but on the basis of general statistics [19]. This is related to O’Neill’s
given some features of a legal case predict the outcome of the [18] criticism of ‘bucketing’, the practice of basing a decision about
case (i.e. the decision of the judge). We discuss whether, and if so an individual (e.g., about granting the person a loan) on the fact
how, such prediction algorithms can be used to support judges that the individual is member of a particular class of which a statis-
in their decision making process. We conclude that case decision tical frequency is known instead of on the particular situation of
predictors can only be useful in individual cases if they can give that individual. O’Neill [18, pp. 145–6] argues that, although this
legal justifications for their predictions, and that only these legal strategy might optimise the decision maker’s profit in the long run,
justifications are what should matter for a judge. it may lead to unjust decisions in individual cases.
To be able to evaluate this debate it is necessary to have a clear
CCS CONCEPTS picture on what information a prediction of a decision by an algo-
rithm in a particular case gives to the judge deciding the case. One
• Applied computing → Law; • Computing methodologies →
answer is given in [4]: “an AI system can be trained to accurately
Natural language processing; Machine learning.
forecast based on past behaviour what a user’s decision would be in
KEYWORDS a situation absent lapses in rationality.” So if an algorithm performs
well on a test set and if it predicts a particular decision in a new
legal prediction, legal decision making, application of algorithms case, then an arbitrary rationally-thinking judge would if assigned
ACM Reference Format: to the case, take the predicted decision. Of course, algorithms are
Floris Bex and Henry Prakken. 2021. On the relevance of algorithmic de- rarely 100% accurate, so we look at the probability that an arbitrary
cision predictors for judicial decision making. In Eighteenth International competent judge assigned to the case would take a predicted de-
Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, cision. We want to investigate to what extent an algorithmic case
São Paulo, Brazil. ACM, New York, NY, USA, 5 pages. https://doi.org/10.
prediction can yield such a decision probability: how, and under
1145/3462757.3466069
which assumptions, does a prediction in a particular case combined
1 INTRODUCTION with information about an algorithm’s performance on a test set
yield a decision probability for a new case?
The prediction of the decision of legal cases by means of machine- This last question immediately gives rise to a new question: why
learning algorithms has become a hot topic [1, 3, 4, 12, 16]. Such would judges be interested in probabilities at all when deciding a
algorithmic predictors can have various uses in the law. In this case? After all, we expect judges not to give probabilistic reasons
paper we discuss their application to support judges in individual for their decisions (except perhaps on matters of fact) but legal
cases, focusing on algorithmic decision predictors: algorithms that reasons. Still, judges have always looked at what their colleagues
predict the final decision of a legal case, a decision that would oth- decide in similar cases and there are good reasons for doing so,
erwise be made by the judge(s) (such as guilty/not guilty, rule for such as improving the consistency of intra-judicial decision making
plaintiff/defendant). Algorithmic decision predictors are sometimes [10, par. 8]. Underlying this is the assumption that if the great
claimed to improve the predictability and consistency of judicial majority of their colleagues would take the same decision, then
decision making, which is demanded by the principle of equality it presumably is the right decision. Of course, this assumption is
(cf. [10]). According to these claims, judges can use decision predic- at best defeasible and this leads to a second idea, namely, that if
tors in order to come to more consistent, more informed and less an algorithmic decision predictor performs well in the test phase,
biased judgments [4, 8, 17]. Others, however, fear that when judges’ then its predictions yield the ‘normal’ decision of the case, so that
decisions are informed by algorithmic case predictors, people will a judge could only deviate from a prediction if there are special
Permission to make digital or hard copies of all or part of this work for personal or circumstances in the case. We also want to investigate to what
classroom use is granted without fee provided that copies are not made or distributed extent such thinking is justified.
on the first page. Copyrights for components of this work owned by others than the To address these questions, we first in Section 2 give a brief
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or overview of the main types of algorithmic case predictors for legal
republish, to post on servers or to redistribute to lists, requires prior specific permission cases. We then discuss in Section 3 the various senses in which
ICAIL’21, June 21–25, 2021, São Paulo, Brazil probabilities can be derived given an algorithm and its evaluation
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. with a test set. The heart of our paper is Section 4, in which we
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 discuss to which extent the probabilities derived from an algorithm
https://doi.org/10.1145/3462757.3466069
175
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bex and Prakken
and its evaluation can be applied to a new case that is to be decided Predicting on the basis of the textual description of a case. Other
in court. Our main conclusion will be that in practice such an ap- algorithms predict decisions based on the text of case law, where
plication is almost never warranted. We then in Section 5 discuss statistical correlations are identified between, for example, word
what this means for the hope that the use of algorithmic case pre- combinations in the text and the case decision. Examples are algo-
dictors by judges in individual cases will improve the consistency rithms that predict whether the European Court of Human Rights
and predictability of judicial decision making. (ECHR) will for a specific article from the Convention with the
same name decide whether that article was violated, on the basis of
2 ALGORITHMIC DECISION PREDICTORS part of the text of the decision by the Court [1, 16] or the facts of
Algorithmic decision predictors come in, roughly, three types: pre- the case as communicated to the parties [15]. The performance of
dictors on the basis of legally relevant factors, predictors on the these different algorithms is largely comparable, with accuracy and
basis of features unrelated to the merits of a case and predictors on F-measures ranging between 75% and 80%. Although it would seem
the basis of the textual description of a case (for a recent overview that this kind of algorithm looks at the legal aspects of the case
see [3]). We focus on supervised classification algorithms that pre- (procedural history, facts), the identified statistical correlations do
dict a categorical outcome – one of multiple possible decisions, not say anything about the legally relevant reasons for the decision
such as affirm/reverse, guilty/not guilty – and not on algorithms of a case. Therefore these algorithms can also not explain their
that provide a continuous output – such as a regression algorithm predicted decisions in a legally meaningful way.
that predicts the length of a sentence or the amount of damages to
be paid. Furthermore, note that we only focus on predictors that, 3 FROM ALGORITHM PERFORMANCE TO
given some features of a case, predict the final decision of the case, PROBABILITIES
and that we do not include, for example, algorithms for estimating Recall that we want to investigate to which extent the performance
recidivism risk, as these do not provide a final case decision. of an algorithm on a test set justifies the idea that an arbitrary
competent judge assigned to a case will likely take the predicted
Predicting on the basis of legally relevant factors. One approach
decision. We call the probability at stake here the decision probabil-
predicts decisions on the basis of legally relevant factors in a case,
ity, the probability that an arbitrary competent and rational judge
by using either machine-learning techniques or a symbolic model
assigned to a particular case will take decision 𝑋 in that case, given
of legal reasoning. 1 This approach describes the facts of a case at a
that the algorithm predicts “𝑋 ”, that is, that the case will receive
higher level of abstraction than the concrete facts. The factors are
decision 𝑋 . In formulas this is 𝑃𝑟 (𝑋 |“𝑋 ”), where “𝑋 ” stands for the
assumed to be legally relevant for the case decision, so they can be
algorithm predicting decision 𝑋 . The precise way in which this
used for generating informative explanations of a prediction.
decision probability can be determined is for present purposes irrel-
The first studies into prediction on the basis of factors applied
evant, but the idea is that this probability can somehow be derived
general machine-learning techniques to encodings of cases in terms
from the algorithm’s performance on the test set. One candidate
of legally relevant factors. An early AI & law example is Mackaay
method is using Precision, the percentage of predictions “𝑋 ” on the
& Robillard [14], who studied the prediction of a type of Cana-
test set that are correct (i.e. the true positives divided by the total
dian tax case with the nearest-neighbor rule. In AI & Law, various
number of positively predicted cases). Interpreted as a frequency-
factor-based models for case-based reasoning have been used for
type probability, the precision is 𝑃𝑟 (𝑋 |“𝑋 ”), which looks like the
generating knowledge-based case decision predictions without the
decision probability we are after. However, we do not commit to ex-
use of machine learning techniques. Examples are the studies of
actly this way of determining the decision probability – for present
Ashley and his PhD students on the case law concerning misuse of
purposes, all that is relevant is that this decision probability will be
trade secrets in American law [2, 7]. Accuracy levels were obtained
defined in terms of an algorithm’s application to a test set, and the
of up to 88% [2] and 92% [7]. An advantage of this approach is that
crucial thing to note is that this makes the step to a probability of
the arguments generated about the predicted decision can be used
the same form for a new case that is not in the test set non-trivial.
as explanations of the prediction based on legal knowledge and in
a form not unlike the arguments of human judges or lawyers.
4 APPLYING GENERAL FREQUENCY-TYPE
Predicting on the basis of case metadata. Several authors have PROBABILITIES TO NEW CASES
used supervised machine learning based on case features that are For the answer to the question how the step from a probability
not related to the merits of the case. An example is the algorithm derived from performance on a test to a probability for a new
that predicts decisions of the American Supreme Court on the basis case can be made, we turn to the philosophy of probability theory.
of structured metadata such as the kind of case, the date at which it Philosophers distinguish frequency-type and belief-type probability.
was decided and which lower court decided the original case [12]. Probabilities are of the frequency type if they are based on relative
This algorithm, which correctly predicted 70% of the decisions, frequencies. Usually, the frequencies are relative to outcomes of
cannot explain the predicted decisions in a legally meaningful way, experiments that can be repeated indefinitely, such as tossing a coin
since the features on the basis of which it makes its predictions are or rolling a dice, but we consider the special case where they are
‘extra-legal’, that is, they are not related to the merits of the case. derived from a given finite set of test cases. Dawid [9] calls such
probabilities ‘statistical’ probabilities. In contrast, probabilities are
1 ‘Factors’
are here not just CATO-style boolean factors but any abstract fact pattern of the belief type if they are about the degree to which a proposition
that can have two or more values. is believable. Such probabilities can also be attached to propositions
176
On the relevance of algorithmic decision predictors for judicial decision making ICAIL’21, June 21–25, 2021, São Paulo, Brazil
that a single event occurs. The probabilities that can be defined more about it than its predicted decision. And the point is that if a
in terms of an algorithm’s performance on the test set are all of judge has more information than just membership of the ‘reference
the frequency type, since they are based on the relative number of class’ of the relative frequency (for instance, ‘80% of the cases with
true/false positives/negatives. However, what we want is a belief- predicted decision 𝑋 have decision 𝑋 ’), then it is irrational to rely
type probability, namely, the probability that a given new case will on the frequency-based probability concerning that class. Instead,
be decided as predicted by the algorithm. one should look at the probability of the decision conditional on the
So what we are interested in is what information a prediction of more specific reference class that corresponds to one’s knowledge
a decision gives to a judge in a particular case that the judge has to about the case. And this, of course, amounts to thinking about the
decide. The italicised words are crucial, since when a probability is particulars of case as judges are used to do.
interpreted as a frequency (or in Dawid’s [9] terms as a statistical Our argument is an instance of what philosophers call the prob-
probability), it does not by itself say anything about a particular case. lem of finding the right reference class when performing ‘direct
As is well known (e.g. [11, p. 137]), there is a logical gap between inference’. It is this reference-class argument that gives a philosoph-
frequencies and an individual probability: turning a frequency-type ical justification for O’Neill’s [18] criticism of ‘bucketing’ and more
probability into a probability about a particular case is a decision, generally for the fear of trial by statistics. In essence it means that
which has to be justified. Now how can this decision be justified? if nothing more is known of an algorithmic decision predictor than
It turns out that this requires a number of assumptions. its performance on the test set, then its predicted decisions cannot
be regarded as the decision that an arbitrary judge assigned to the
4.1 From the test set to the set of future cases case would likely take. So a judge who wants to know what his
Clearly, the move from the past to the future is only justified if the or her colleagues would likely decide in an individual case, should
set of future cases has the same proportions as the test set. However, not consult the algorithm since it does not provide the correct de-
this is not guaranteed (see also e.g. [5, 6]). First, the decisions of cision probability for the case. This in turn means that there is no
judges can change in that they start deciding on different grounds meaningful sense in which an algorithmically predicted decision is
or weighing reasons in different ways than they used to do. This can the ‘normal’ decision for the case, from which a judge could only
happen, for instance, when moral or political opinions in society deviate if he or she can point at special circumstances that make
change, or because different judges with different legal opinions this case different than a normal case of this kind.
To explain this further, imagine that cases are distributed in such
are assigned to the same type of case. Also, the distribution of types
a way that many cases are ‘clear’, for which a decision predictor
of cases can change because of changes in the world. Moreover, the
algorithm could be overfitted on inessential features of the training would always be correct, but many other cases are ‘hard’, for which
data (a well-known problem in statistics and machine learning). So a decision predictor would often be incorrect, but the algorithm
(as is well known in the literature on machine learning) in order to cannot explain to which type a new case belongs. Then only in the
accept a probability based on the test set as a probability for a future clear cases can the predicted decision be said to be the ’normal’ one.
set of cases, we have to make at least the following assumptions: But how can the judge know which case is easy and which case is
judges continue to decide cases on the same grounds; the frequency hard? To know this, the judge has to think about the particulars of
of the various types of cases remains the same; and the algorithm the case as judges always do. But then the judge can just as well
made its predictions on the test set for the right reasons. ignore the algorithmic prediction.
4.2 Yielding a decision probability for an 4.3 Objections to the reference class argument
individual case In the previous subsection we concluded that in practice it will be
This is not yet all. If the assumptions listed in the previous section impossible to rationally derive a case-specific decision probability
are justified, then all we know is that the frequency-type probability from frequency-type probabilities based on experiments with a test
derived from the test set can also be applied to a future set of cases set, so that judges who want to know what their colleagues would
(which can be open-ended). However, we are not after a probability likely decide in the case cannot obtain an answer to their question
of a kind of event (decisions predicted by this algorithm) but after by consulting a case decision predictor. We now discuss possible
the probability of a single event (this decision predicted by this objections to our reference-class argument for this conclusion.
algorithm). The former can be frequency-type but the latter must First, it might be argued that it is still rational to stick to a
be belief-type. We could apply the so-called frequency principle statistical decision probability for a new case, since there often are
[11, p. 137] and let the latter equate the former. However, if we no statistics on which a more specific frequency-type probability
do so, that is, if we base our probabilities concerning individual can be based. Yet this is a reasoning fallacy: if one wants to express
case decision predictions on frequencies, then we in fact make a a decision probability in such cases, one should take the additional
crucial assumption. This assumption is that the only ways in which information into account. If this cannot be done on the basis of
cases can relevantly differ is in the properties on which the relative known frequencies, one should form a probability based about one’s
frequencies are defined, that is, on their real and predicted decision, information about the specific case, on the penalty of making the
just as in familiar text book examples about urns with coloured balls unfounded assumption that this additional information is irrelevant
the only relevant way in which the balls can differ is in their colour. for the decision of the case (cf. [11, p. 137]).
While in the textbook examples this assumption is justified, for A variant of this argument is the argument that a belief-type
legal cases it is not. Judges who have to decide a case know much probability is always less well founded than a frequency-based
177
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bex and Prakken
probability, so that a judge who wants to know what his or her tasks. In the medical example human and algorithm perform the
colleagues would likely decide can still look at what an algorithmic same task, namely, recognising cancer in images of, for instance,
decision predictor with a high precision predicts. However, this birthmarks. Moreover, the estimates of human and algorithm are
argument fails, since if one knows more about the case, then sticking compared to the same (objective) truth: by examining the cells un-
to the frequency is even less well-founded. Consider the analogy of der a microscope it can be determined with certainty whether there
an urn with 80% red balls and 20% blue balls. If this is all one knows is cancer. Thus a human expert and an algorithmic expert are com-
and one draws a ball from the top of the urn, then it is rational pared in terms of the same standard. In such a case a comparison
to assume that there is an 80% probability that it will be red. But between how humans and algorithms perform is meaningful and
suppose now that the person who filled the urn tells you that he the algorithm can be said to perform better than the human doctor,
first put all the red balls in and then all the blue balls, and that he namely, by recognising malign spots missed by the human doctor.
did not shake or stir, and that you take the ball from the top of the However, an algorithmic decision predictor performs a different
urn. It is now irrational to stick to the 80% probability that the ball task than the judge. A decision predictor predicts which decision a
will be red. In fact, the inverse probability (just 20% chance that the judge would take, which is a different task than the task the judge
ball will be red) seems more rational. performs, which is deciding the case. Then it is meaningless to
One may also consider technical solutions to the reference class compare the performance of the algorithm and the human judge.
problem. The first is to inspect the test set to check the algorithm’s What is more, even a correct prediction of a legally incorrect deci-
performance on subsets of test cases of particular types, as an at- sion would count as a success for the predictive algorithm. Such
tempt to make it more likely that the class memberships considered situations may arise, for instance, since the test set contains legally
for the algorithm’s performance coincide with the knowledge the incorrect decisions [5]. Correctly predicting a decision is not the
judge has about a particular case. This is a good idea in theory, same as predicting a correct decision.
but note that this approach in fact amounts to building a legal-
knowledge model of the reasons relevant for a decision. Moreover,
the created subclasses may be too small to yield reliable probabili-
5 CAN A DECISION PREDICTOR IMPROVE
ties, since in the law the collections of cases usually are not very PREDICTABILITY AND CONSISTENCY?
big [5]. Furthermore, a too fine-grained feature set may lead to an In Section 4 we concluded that a judge who has to decide a case
overfitted model that does not easily generalise [6, p. 9]. and who wants to know what an arbitrary rational judge assigned
A second technical solution is to obtain the probability for a to the case would probably decide, cannot rely on the statistics
single case directly from the algorithm, that is, the probability of a provided by (the evaluation of) an algorithmic decision predictor.
certain decision 𝑋 given the set of features 𝐹 that represents the However, this leaves the question what other benefits consulting
case, or 𝑃𝑟 (𝑋 |𝐹 ). Simpler predictive algorithms directly output such such an algorithm can have for a judge in an individual case. This
a prediction probability for a single case, and for e.g. neural net- we discuss below, focusing on the alleged benefit of improving the
works or support vector machines (SVMs) it is possible to estimate predictability and consistency of judicial decision making.
this probability based on the output of the model (cf. [20]). It can be First, we have to determine what the terms predictability and
argued that it is exactly this probability the judge needs: the algo- consistency mean in this context. Assuming that they mean the
rithm captures the behaviour of the judges in the training set cases same, there are two interpretations. One interpretation is that the
and then directly outputs the probability that these judges would same case is decided the same by different judges. Another interpre-
rule 𝑋 in a case like the current one with features 𝐹 . However, this tation is that similar cases are decided in the same way (or a similar
still does not yield the probability that an arbitrary judge would way) by the same or different judges. The second interpretation
rule in that way given the case, because there need not be a relation implies the first but not vice versa.
between the correctness of predictions and the prediction probabil- We can now ask how an algorithmic decision predictor can be
ity. For example, the algorithm can predict the wrong decision with used in order to improve predictability and consistency. If these
a high probability, or the algorithm may over- or underestimate terms mean that the same case is decided the same by different
individual probabilities simply because this leads to better classifi- judges, then a sure way to guarantee predictability and consistency
cation performance. Furthermore, using such advanced techniques is to give all judges the same algorithmic decision predictor and to
brings along even more assumptions and makes it even harder to require that they all follow its predictions in all cases. Then different
determine what exactly the given probability means, particularly judges would, when assigned to the same case, be guaranteed to
for a judge with no background in statistics or machine learning. take the same decision. However, this does not make sense, since
So instead of relying on this probability, the judge would be better as we argued in Section 4.3 we do not know whether all decisions
off thinking about the particulars of case as normal. in the training and test set were correct. If all judges blindly follow
A final objection is that an algorithm does not have to be perfect, the algorithm’s prediction, then both its accuracy and precision
as long as it performs better than human decision makers. Here will increase to 100%, and this would further lead to a tendency
sometimes the medical domain is mentioned, in which it is widely to make the predicted decision the legally correct one even if this
accepted that, for instance, a human oncologist has to consult a cannot be justified.
data-driven predictive algorithm for recognising skin cancer if this What if predictability and consistency mean that similar cases
algorithm has been proven to perform better than humans [21]. should be decided the same? Is this improved if we require judges
However, this analogy beaks down, since unlike in the medical to consult decision predictors as a source of information? Again, for
example, a legal predictive algorithm and a judge perform different mere decision predictors we cannot know. Suppose an algorithm
178
On the relevance of algorithmic decision predictors for judicial decision making ICAIL’21, June 21–25, 2021, São Paulo, Brazil
with 90% precision predicts decision X for case C. Does the judge like accuracy, precision and recall, and instead involves potential
then treat like cases alike if s/he follows the prediction? We cannot or actual users of the algorithms. More generally, we believe that
know, since the prediction in itself would not give any information it is important to inform the legal world in transparent language
about similarity with other cases. In fact, it might well be that about not only the potential benefits but also the limitations of
an algorithm treats cases that judges would regard as similar as algorithmic outcome predictors.
different or vice versa (likewise [6, p. 6]). For example, text-based Finally, we like to emphasise that our conclusions are confined
decision predictors like the ECHR predictor could fail to recognise to the use of algorithmic decision predictors for informing judges
that linguistically small differences are legally very relevant. on what they could decide in particular cases. Other uses of such
However, is this different if the prediction is combined with an algorithms may well have benefits, but this requires another paper.
explanation for it? The answer is negative if the explanation cannot
be given in terms of reasons related to the merit of the case. So a REFERENCES
SCOTUS-like predictor is ruled out. But this implies that an ECHR- [1] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, and V. Lampos. 2016. Predicting
judicial decisions of the European Court of Human Rights: A natural language
type predictor is also ruled out, since it cannot extract any legally processing perspective. PeerJ Computer Science 2 (2016), e93.
relevant information from the texts to which it is applied, so there is [2] V. Aleven. 2003. Using background knowledge in case-based legal reasoning: a
no way to identify whether its prediction is based on legal grounds computational model and an intelligent learning environment. Artificial Intelli-
gence 150 (2003), 183–237.
or on extraneous factors. Only decision predictors that base their [3] K. Ashley. 2019. A brief history of the changing roles of case prediction in AI
predictions on legally relevant factors could possibly yield legally and law. Law in Context. A Socio-legal Journal 36, 1 (2019), 93–112.
relevant information about similar cases to a judge. [4] B. Babic, D. Chen, T. Evgeniou, and A.-L. Fayard. 2021. The better way to onboard
AI. Harvard Business Review (2021). http://nber.org/~dlchen/papers/The_Better_
However, we believe that only these legal explanations are what Way_to_Onboard_AI.pdf To appear.
should matter for a judge, and that the judge should ignore the fact [5] T. Bench-Capon. 2020. The need for good-old fashioned AI and law. In Interna-
tional Trends in Legal Informatics: Festschrift for Erich Schweighofer, W. Hötzen-
that a decision was predicted by an algorithm with good statisti- dorfer, C. Tschol, and F. Kummer (Eds.). Editions Weblaw, Bern, 23–36.
cal performance on a test set. This use of such algorithms is not [6] R. Binns. 2020. Analogies and disanalogies between machine-driven and human-
much different from how judges currently use other information driven legal judgement. Journal of Cross-disciplinary Research in Computational
Law 1, 1 (2020).
sources, such as books, journals and peer consultation. Numeri- [7] S. Brueninghaus and K. Ashley. 2009. Automatically classifying case texts and
cal performance indicators like accuracy, precision and recall can predicting outcomes. Artificial Intelligence and Law 17 (2009), 125–165.
justify a degree of trust in algorithms in this general sense, but can- [8] I. Chalkidis, I. Androutsopoulos, and N. Aletras. 2019. Neural legal judgment
prediction in English. In Proceedings of the 57th Annual Meeting of the Association
not indicate the quality of individual predictions or explanations. for Computational Linguistics. 4317–4323.
Moreover, evaluating the quality of algorithmic explanations for [9] P. Dawid. 2005. Probaility and Proof. (2005). http://tinyurl.com/tz85o Appendix
to Analysis of Evidence, by T. J. Anderson, D. A. Schum and W. L. Twining.
individual predictions requires validation studies of a kind that goes [10] European Commission for the Efficiency of Justice (CEPEJ). 2018. European
far beyond the current trend to focus on numerical performance ethical Charter on the use of Artificial Intelligence in judicial systems and their
measures like accuracy, precision and recall and is more akin to an environment.
[11] I. Hacking. 2001. An Introduction to Probability and Inductive Logic. Cambridge
older AI tradition of carrying out empirical validation studies with University Press, Cambridge.
potential or actual users of the algorithm [13]. [12] D. Katz, M. Bommarito, and J. Blackman. 2017. A general approach for predicting
the behavior of the Supreme Court of the United States. PloS one 12, 4 (2017),
e0174698.
[13] R. O. Keefe. 1993. Issues in the verification and validation of knowledge-based
6 CONCLUSION systems. In Advances in Software Engineering and Knowledge Engineering, V. Am-
briola and G. Tortora (Eds.). Series on Software Engineering and Knowledge
In this paper we argued that a judge who has to decide a case Engineering, Vol. 2. World Scientific Publishing Co, 173–189.
and who wants to know what an arbitrary rational judge assigned [14] E. Mackaay and P. Robillard. 1974. predicting judicial decisions: The nearest
to the case would probably decide, cannot rely on the statistics neighbor rule and visual representation of case patterns. Datenverarbeitung im
Recht 3 (1974), 302–331.
provided by (the evaluation of) an algorithmic decision predictor. [15] M. Medvedeva, , X. Xu, M. Vols, and M. Wieling. 2020. JURI SAYS: an automatic
The idea that an algorithmic prediction that performed well on a judgement prediction system for the European Court of Human Rights. In
test set yields the ‘normal’ decision of the case, from which a judge Legal Knowledge and Information Systems. JURIX 2020: The Thirty-Third Annual
Conference, S. Villata, J. Harašta, and P. Křemen (Eds.). IOS Press, Amsterdam
could only deviate if there are special circumstances in the case, is etc., 277–280.
unfounded. Moreover, we argued that relying on the predictions of [16] M. Medvedeva, M. Vols, and M. Wieling. 2020. Using machine learning to predict
decisions of the European Court of Human Rights. Artificial Intelligence and Law
such algorithms cannot improve the predictability and consistency 28, 2 (2020), 237–266.
of judicial decision making in desirable ways. We believe that mere [17] F. Muhlenbach and I. Sayn. 2019. Artificial Intelligence and law: What do people
decision predictors, that is, predictors that cannot explain their really want?: Example of a French multidisciplinary working group. In Proceedings
of the 17th International Conference on Artificial Intelligence and Law. ACM Press,
predictions in legally meaningful terms, should not be used at New York, 224–228.
all by judges as decision-support tools for individual cases. Such [18] C. O’Neil. 2016. Weapons of Math Destruction: How Big Data Increases Inequality
algorithms do not give any useful information to judges and may and Threatens Democracy. Crown.
[19] F. Pasquale and G. Cashwell. 2018. Prediction, persuasion, and the jurisprudence
in fact be misleading and cause intellectual laziness. of behaviourism. University of Toronto Law Journal 68, supplement 1 (2018),
If an algorithmic decision predictor gives any useful information 63–81.
[20] J. Platt et al. 1999. Probabilistic outputs for support vector machines and com-
to judges at all, it is not in its predictions but in its explanations parisons to regularized likelihood methods. Advances in large margin classifiers
for these predictions. However, we noted that whether algorithmic 10, 3 (1999), 61–74.
explanations can indeed improve the quality of judicial decision [21] J. Susskind. 2018. Future Politics: Living Together in a World Transformed by Tech.
Oxford University Press, Oxford.
making requires validation studies of a kind that goes far beyond
the current trend to focus on numerical performance measures
179
The Burden of Persuasion in Structured Argumentation
Roberta Calegari Régis Riveret Giovanni Sartor
Alma Mater Research Institute for Commonwealth Scientific and Alma Mater Research Institute for
Human-Centered Artificial Industrial Research Organisation Human-Centered Artificial
Intelligence Brisbane, Australia Intelligence
Bologna, Italy regis.riveret@data61.csiro.au Bologna, Italy
roberta.calegari@unibo.it giovanni.sartor@unibo.it
ABSTRACT is also recognised in civil law jurisdiction, possibly using a different

In this paper we provide an account of the burden of persuasion in terminology [10]. The focus of this paper is on the burden of persua-
the context of structured argumentation. A formal model for the sion. We will show how an allocation of the burden of persuasion
burden of persuasion is defined, discussed, and used to capture the may induce single outcomes in contexts in which the assessment
role of the burden of persuasion in adjudicating conflicts between of conflicting arguments would, without such an allocation, remain
conflicting arguments and in determining the dialectical status of undecided. Our model combines Prakken and Sartor’s [17] model
arguments. We consider how our model can also capture adversarial with the insight from Carneades’ [8], and takes into account the
burdens of proof, namely, those cases in which failure to establish fact that the persuasiveness of an argument, in a dialectical context,
an argument for a proposition burdened with persuasion entails is determined not only by the internal strength of the argument, as
establishing the complementary proposition. determined by the strength of the inference rules used for building
the argument (according, for instance, to the last link criterion),
KEYWORDS but also by the applicable counterarguments. Our model originates
from legal considerations and is applied to legal examples [4, 5].
Burden of persuasion, argumentation, legal reasoning
However, the issue of the burden of proof has a significance that
ACM Reference Format: goes beyond the legal domain involving other domains – such as
Roberta Calegari, Régis Riveret, and Giovanni Sartor. 2021. The Burden public discourse, risk management, etc. [3] – in which evidence
of Persuasion in Structured Argumentation. In Eighteenth International and arguments are needed, and corresponding responsibilities are
Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, allocated, according to types of dialogues and dialectical or organi-
sational roles [19, 20]. The novelty of this contribution consists of
1145/3462757.3466078
a new definition of defeat relations involving arguments burdened
with persuasion, and a corresponding definition of the criteria for
1 INTRODUCTION labelling such arguments.
The burden of proof is a central feature of many dialectical con-
texts. It is particularly relevant in those domains, such as legal 2 BURDENS OF PERSUASION
disputations or political debates, in which controversial issues are Let us illustrate how the burden of persuasion works through two
discussed in order to adopt a decision, see [20] on burdens of proof examples, one from criminal law and one from civil law.
in different dialogue types. Research in AI & law has devoted a
number of contributions to the formal analysis of burdens of proof: Burden of persuasion in criminal law. In criminal law, the burden
models of defeasible legal reasoning have been criticised for not of production is distributed between prosecution and defence, while
taking burdens of proof into account [11], the distinction between the burden of persuasion (in most legal systems) is always on pros-
different standards of proof has been addressed [7], formal accounts ecution. More exactly, in criminal law, the burden of production
of burdens of proof have been developed within models for formal falls on the prosecution relative to the two constitutive elements of
argumentation [9, 17]. However, it seems to us that a comprehensive crime, namely, the criminal act (actus reus) and the required mental
model of burdens of proof in legal reasoning is still missing. state (mens rea), be it intention/recklessness or negligence, while
In the legal domain, two types of burdens can be distinguished: it falls to the defendant relative to justifications or exculpatory
the burden of production (also called burden of providing evidence, defences (e.g., self-defence, state of necessity, etc.). In other words,
or ‘evidential’ burden), and the burden of persuasion [17]. This ter- if both actus reus and mens rea are established, but no exculpatory
minology is used in common law systems [21], but the distinction evidence is provided, the decision should be criminal conviction. On
the other hand, the burden of persuasion falls on the prosecution
Permission to make digital or hard copies of all or part of this work for personal or for all determinants of criminal responsibility, including not only
classroom use is granted without fee provided that copies are not made or distributed for the constitutive elements of a crime but also for the absence of
on the first page. Copyrights for components of this work owned by others than ACM justifications of exculpatory defences.
to post on servers or to redistribute to lists, requires prior specific permission and/or a Example 2.1. Let us consider a case in which a woman, Hellen,
fee. Request permissions from permissions@acm.org. has shot and killed an intruder in her home. The applicable law
ICAIL’21, June 21–25, 2021, São Paulo, Brazil consists of (a) the rule according to which intentional killing consti-
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 tutes murder, and (b) the exception according to which there is no
https://doi.org/10.1145/3462757.3466078 murder if the victim was killed in self-defence. Assume that it has
180
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Calegari, Riveret and Sartor
been established with certainty that Hellen shot the intruder and prosecution’s argument for murder and the doctor’s argument for
that she did so intentionally. However, it remains uncertain whether her diligence, respectively.
the intruder was threatening Hellen with a gun, as claimed by the
defence, or had turned back and was running away on having been 3 ARGUMENTATION FRAMEWORK
discovered, as claimed by prosecution. The burden of persuasion is We introduce a structured argumentation framework relying on a
on prosecution, who needs to provide a convincing argument for lightweight ASPIC+ -like argumentation system [14]. For the sake
murder. Since in this case it remains uncertain whether there was of simplicity, we assume that arguments only consist of defeasible
self-defence, prosecution has failed to provide such an argument. rules, to the exclusion of strict rules and of some constituents of a
Therefore, the legally correct solution is that there should be no knowledge base—such as axioms, ordinary premises, assumptions,
conviction: Hellen needs to be acquitted. □ and issues that can be found in the complete model [14]. A frame-
Burden of persuasion in civil law. In civil law, burdens of produc- work based on defeasible rules is sufficient for our purposes and
tion and burdens of persuasion may be allocated in different ways. can be extended as needed with further structures.
The general principle is that the plaintiff only has the burden of
proof (both of production and persuasion) relatively to the operative 3.1 Argumentation graphs
facts that ground its claim, while the defendant has the burden of Let any literal be an atomic proposition or its negation. Literals are
proof relative to those exceptions which may prevent the operative brought into relation through defeasible rules.
facts from delivering their usual outcomes, such as justifications
Notation 3.1. For any literal 𝜙, its complement is denoted by 𝜙, ¯
with regard to torts, or incapability and vices of consent in contracts.
However, derogations from this principle may be established by i.e., if 𝜙 is a proposition 𝑝, then 𝜙¯ is ¬𝑝, while if 𝜙 is ¬𝑝, then 𝜙¯ is 𝑝.
the law, in order to take into account various factors, such as the Definition 3.1. A defeasible rule 𝑟 is a construct of the form:
presumed ability of each party to provide evidence in favour of his 𝜌 : 𝜙 1, ..., 𝜙𝑛 , ∼ 𝜙 1′ , ..., ∼ 𝜙𝑚
′ ⇒𝜓
or her claim, the need to protect weaker parties against abuses, etc.
In matters of civil liability, for example, it is usually the case that with 0 ≤ 𝑛 and 0 ≤ 𝑚, and where
the plaintiff, who asks for compensation, has to prove both that the • 𝜌 is the unique identifier for 𝑟 , denoted by N(𝑟 );
defendant caused the harm, and that this was done intentionally • each 𝜙 1, . . . 𝜙𝑛 , 𝜙 1′ , ..., 𝜙𝑚 ′ ,𝜓 is a literal;
or negligently. However, in certain cases, the law establishes an • 𝜙 1, . . . 𝜙𝑛 , ∼ 𝜙 1, ..., ∼ 𝜙𝑚 are denoted by 𝐴𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡(𝑟 ) and 𝜓 by
′ ′
inversion of the burden of proof for negligence. This means that in 𝐶𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡(𝑟 );
order to obtain compensation, the plaintiff only has to prove that • ∼ 𝜙 denotes the weak negation (negation by failure) of 𝜙: 𝜙 is
s/he was harmed by the defendant. This will be sufficient to win an exception that would block the application of the rule whose
the case unless the defendant provides a convincing argument that antecedent includes ∼ 𝜙.
s/he was diligent (not negligent). The identifier of a rule can be understood as the name of the rule.
Example 2.2. Let us consider a case in which a doctor caused It can be used as a literal to specify that the named rule is appli-
harm to a patient by misdiagnosing his case. Assume that there is cable, and its negation correspondingly to specify that the rule is
no doubt that the doctor harmed the patient: she failed to diagnose inapplicable [13].
cancer, which consequently spread and became incurable. However, A superiority relation ≻ is defined over rules: 𝑠 ≻ 𝑟 states that
it is uncertain whether or not the doctor followed the guidelines rule 𝑠 prevails over rule 𝑟 .
governing this case: it is unclear whether she prescribed all the
Definition 3.2. A superiority relation ≻ over a set of rules
tests that were required by the guidelines, or whether she failed to
𝑅𝑢𝑙𝑒𝑠 is an antireflexive and antisymmetric binary relation over
prescribe some tests that would have enabled cancer to be detected.
𝑅𝑢𝑙𝑒𝑠, i.e., ≻⊆ 𝑅𝑢𝑙𝑒𝑠 × 𝑅𝑢𝑙𝑒𝑠.
Assume that, under the applicable law, doctors are liable for any
harm suffered by their patients, but they can avoid liability if they A defeasible theory consists of a set of rules and a superiority
show that they were diligent (not negligent) in treating the patient, relation over the rules.
i.e., that they exercised due care. Thus, rather than the patient
having the burden of proving that doctors have been negligent (as Definition 3.3. A defeasible theory is a tuple ⟨𝑅𝑢𝑙𝑒𝑠, ≻⟩ where
it should be the case according to the general principles), doctors 𝑅𝑢𝑙𝑒𝑠 is a set of rules, and ≻ is a superiority relation over 𝑅𝑢𝑙𝑒𝑠.
have the burden of providing their diligence. Let us assume that We can construct arguments by chaining rules from the defeasi-
the law also says that doctors are considered to be diligent if they ble theory, as specified in the following definition; cf. [6, 13, 18].
followed the medical guidelines that govern the case. In this case,
given that the doctor has the burden of persuasion on her diligence, Definition 3.4. An argument 𝐴 constructed from a defeasible
and that she failed to provide a convincing argument for it, the theory ⟨𝑅𝑢𝑙𝑒𝑠, ≻⟩ is a finite construct of the form:
legally correct solution is that she should compensate the patient.□ 𝐴 : 𝐴1, . . . 𝐴𝑛 ⇒𝑟 𝜙
These two examples share a common feature. In both, uncertainty with 0 ≤ 𝑛, and where
remains concerning a decisive issue. However, this uncertainty does • 𝐴 is the argument’s unique identifier;
not preclude the law from prescribing a single legal outcome in • 𝐴1, . . . , 𝐴𝑛 are arguments constructed from the defeasible theory
each case. This outcome can be achieved by discarding the argu- ⟨𝑅𝑢𝑙𝑒𝑠, ≻⟩;
ments that fail to meet the required burden of persuasion, i.e., the • 𝜙 is the conclusion of the argument, denoted by Conc(𝐴);
181
The Burden of Persuasion in Structured Argumentation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
• 𝑟 : Conc(𝐴1 ), . . . , Conc(𝐴𝑛 ) ⇒ 𝜙 is the top rule of 𝐴, denoted by The notions of bp-rebuttings and undercuttings can then be
TopRule(𝐴). used to define a defeat relation comprising bp-defeats and strict
Notation 3.2. Given an argument 𝐴 : 𝐴1, . . . 𝐴𝑛 ⇒𝑟 𝜙 as in bp-defeats between arguments.
definition 3.4, Sub(𝐴) denotes the set of subarguments of 𝐴, i.e., Definition 3.9 (bp-defeat). A defeat relation { over a set of
Sub(𝐴) = Sub(𝐴1 ) ∪ . . . ∪ Sub(𝐴𝑛 ) ∪ {𝐴}. arguments A is a binary relation over A, i.e. {⊆ A × A, such
Different types of inconsistencies can appear between arguments, that ∀𝐴, 𝐵 ∈ A, 𝐴 defeats 𝐵, i.e. 𝐴 { 𝐵, iff 𝐴 bp-defeats 𝐵 or 𝐴
causing them to attack each other. In the ASPIC family of argumen- strictly-bp-defeats 𝐵:
tation frameworks, attack is differentiated from defeat, with the (1) 𝐴 bp-defeats 𝐵 iff 𝐴 bp-rebuts 𝐵 or 𝐴 undercuts 𝐵
latter taking preferences between arguments into account. Prefer- (2) 𝐴 strictly-bp-defeats 𝐵 iff 𝐴 bp-defeats 𝐵 and 𝐵 does not
ences over arguments are defined in the work reported here via a bp-defeats 𝐴 .
last-link ordering: an argument 𝐴 is preferred over another argu-
Example 3.10 (Civil law example: rules and arguments). To ex-
ment 𝐵 if the top rule of 𝐴 is stronger than the top rule of 𝐵.
emplify the notions just introduced, let us formalise Example 2.2
Definition 3.5. A preference relation ≻ is a binary relation through a set of rules. We assume that sufficient evidence is pro-
over a set of arguments A, such that an argument 𝐴 is preferred to vided to support (in the absence of evidence to the contrary) the
argument 𝐵, denoted by 𝐴 ≻ 𝐵, iff TopRule(𝐴) ≻ TopRule(𝐵). factual claims at issue (𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠, ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠, ℎ𝑎𝑟𝑚), i.e., that the
Before specifying the notion of defeat between arguments, let corresponding burdens of production are satisfied.
us first identify burdens of persuasion, i.e., those literals the proof f1 : ⇒ ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 r1 : ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 ⇒ ¬𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
of which requires a convincing argument. We assume that such f2 : ⇒ 𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 r2 : 𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 ⇒ 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
literals are consistent: it cannot be the case that there is a burden f3 : ⇒ ℎ𝑎𝑟𝑚 r3 : ℎ𝑎𝑟𝑚, ∼ 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒 ⇒ 𝑙𝑖𝑎𝑏𝑙𝑒
of persuasion both on 𝜙 and 𝜙. We can then build the following arguments:
Definition 3.6 (Burdens of persuasion). Let BurdPers, the set A1 : ⇒f1 ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 A2 : A1 ⇒r1 ¬𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
of burdens of persuasion, be a set of literals such that if 𝜙 ∈ B1 : ⇒f2 𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 B2 : B1 ⇒r2 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
BurdPers then 𝜙 ∉ BurdPers. We say that an argument 𝐴 is bur- C1 : ⇒f3 ℎ𝑎𝑟𝑚 C2 : C1 ⇒r3 𝑙𝑖𝑎𝑏𝑙𝑒
dened with persuasion if Conc(𝐴) ∈ BurdPers. If there were no burden of persuasion, the relations would be the
following: arguments A1 and B1 defeat one another, B1 defeats A2,
We now consider possible collisions between arguments, i.e., A1 defeats B2, A2 and B2 defeat one another, B2 strictly defeats C2.
those cases in which an argument 𝐴 challenges an argument 𝐵: (a) If on the contrary, there is burden of is on the doctors’ diligence
by contradicting the conclusion of a 𝐵’ subargument (rebutting), or (𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒 ∈ BurdPers), then B2 fails to defeats A2, so that A2
(b) by denying (the application of) the top rule of a 𝐵’ subargument strictly defeats B2. □
or by contradicting a weak negation in the body of the top rule of
a 𝐵’ subargument (undercutting). Note that our notion of rebutting Given a defeasible theory, arguments built from it and defeats be-
corresponds to the notion of successful rebutting in [14]. tween these arguments are gathered into an argumentation graph.
Definition 3.7 (bp-rebut). Argument 𝐴 bp-rebuts argument 𝐵 iff Definition 3.11. An argumentation graph constructed from a
∃𝐵 ′ ∈ 𝑆𝑢𝑏(𝐵) such that Conc(𝐴) = Conc(𝐵 ′ ) and defeasible theory 𝑇 is a tuple ⟨A, {⟩, where A is the set of all
(1) Conc(𝐴) ∉ BurdPers, and 𝐵 ′ ⊁ 𝐴, or arguments constructed from 𝑇 , and { is a defeat relation over A.
(2) Conc(𝐴) ∈ BurdPers and 𝐴 ≻ 𝐵 ′ . Notation 3.3. Given an argumentation graph 𝐺 = ⟨A, {⟩, we
According to Definition 3.7, for an unburdened argument 𝐴 to write A𝐺 , and {𝐺 to denote A and { respectively.
rebut 𝐵 by contradicting the latter’s subargument 𝐵 ′ , it is sufficient
that 𝐵 ′ is non-superior to 𝐴. For a burdened argument 𝐴 to rebut 𝐵 3.2 Labelling semantics
by contradicting 𝐵 ′ , it is necessary that 𝐴 is superior to 𝐵 ′ . Thus, Let us now introduce the notion of {IN, OUT, UND}-labellings of an
burdens of persuasion supplement priorities in deciding conflicts argumentation graph, so that each argument in the graph is labelled
between arguments having opposed conclusions. They dictate the IN, OUT or UND, depending on whether it is accepted, rejected, or
outcome of such conflicts when priorities do not already determine undecided, respectively.
which argument is to prevail: when two arguments contradict one
another, the one burdened with persuasion fails to bp-rebut the Definition 3.12. A {IN, OUT, UND}-labelling 𝐿 of an argumenta-
other, while the latter will succeed in bp-rebutting the first. tion graph 𝐺 is a total function 𝐿 : A𝐺 → {IN, OUT, UND}.
Undercutting is defined as usual, including both the case in Notation 3.4. Given a labelling 𝐿, we write IN(𝐿) for {𝐴 | 𝐿(𝐴) =
which an the attacker excludes the application of the top rule of IN }, OUT(𝐿)
for {𝐴 | 𝐿(𝐴) = OUT} and UND(𝐿) for {𝐴 | 𝐿(𝐴) = UND}.
the attacked argument (by denying the rule’s the name) and the
case in which it contradicts a weakly negated literal in the body of There are various ways to specify {IN, OUT, UND}-labelling func-
that rule. tions [1]. For example, they can be complete or grounded.
Definition 3.8 (bp-undercut). Argument 𝐴 undercuts argument Definition 3.13. A complete {IN, OUT, UND}-labelling of an argu-
𝐵 iff ∃𝐵 ′ ∈ 𝑆𝑢𝑏(𝐵) such that: 1) Conc(𝐴) = ¬N(𝑟 ) and TopRule(𝐵 ′ ) = mentation graph 𝐺 is a {IN, OUT, UND}-labelling such that ∀𝐴 ∈ A𝐺
𝑟 ; or 2) Conc(𝐴) = 𝜙 and ∼ 𝜙 ∈ 𝐴𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡(TopRule(𝐵 ′ )). (1) 𝐴 is labelled IN iff all defeaters of 𝐴 are labelled OUT, and
182
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Calegari, Riveret and Sartor
(2) 𝐴 is labelled OUT iff 𝐴 has a defeater labelled IN. persuasion. Let us assume that (as under Italian law) we have
BurdPers = {𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒, 𝑙𝑖𝑎𝑏𝑙𝑒}, i.e., the doctor has to provide
Definition 3.14. A grounded {IN, OUT, UND}-labelling of an ar-
a convincing argument that she was diligent, and the patient has
gumentation graph 𝐺 is a complete {IN, OUT, UND}-labelling 𝐿 of 𝐺
to provide a convincing argument for the doctor’s liability. As the
such that IN(𝐿) is minimal.
burdened doctor’s argument for 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒 is labelled OUT, her li-
Remark that any argument not labelled IN or OUT must be labelled ability can be established even though it remains uncertain whether
UND, since any { IN, OUT, UND }-labelling is a total function. the guidelines were followed. □
While common specifications of {IN, OUT, UND}-labellings define
reasonable positions [1], they do not cater for burdens of persuasion.
We now specify the notion of bp-labelling, namely, a labelling which A2 B2 C2
takes into account a set of burdens of persuasion. UND UND UND
Definition 3.15. A bp-labelling of an argumentation graph

𝐺, relative to a set of burdens of persuasion BurdPers, is a A1 B1 C1
{IN, OUT, UND}-labelling 𝐿 such that ∀𝐴 ∈ A𝐺 UND UND IN
(1) 𝐴 ∈ IN(𝐿) iff ∀𝐵 ∈ A𝐺 such that 𝐵 bp-defeats 𝐴 : 𝐵 ∈ OUT(𝐿)

A2 B2 C2
(2) 𝐴 ∈ OUT(𝐿) iff
(a) Conc(𝐴) ∈ BurdPers and ∃ 𝐵 ∈ A𝐺 such that UND OUT IN
• 𝐵 bp-defeats 𝐴, and 𝐵 ∈ IN(𝐿) or 𝐵 ∈ UND(𝐿)

(b) Conc(𝐴) ∉ BurdPers and ∃ 𝐵 ∈ A𝐺 such that A1 B1 C1
• 𝐵 bp-defeats 𝐴 and 𝐵 ∈ IN(𝐿) UND UND IN
Burdens of persuasion affect conditions for rejection, as speci-

fied in condition 3.15 (2) (a): the rejection (the OUT labelling) of an Figure 1: Grounded {IN, OUT, UND}-labelling of Example 2.2
argument burdened of persuasion may be determined by any de- in the absence of burdens of persuasion (top), and its bp-
feating counterargument 𝐵 that is accepted (IN) or also is uncertain labelling with BurdPers = {𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒, 𝑙𝑖𝑎𝑏𝑙𝑒} (bottom).
(UND). However, as specified in condition 3.15 (2) (b), the rejection
of an argument which is not burdened with persuasion requires a
defeating counterargument 𝐵 that is labelled IN. This example shows how the model presented here allows us
The semantics just described does not always deliver a unique to deal with the inversion of the burden of proof, i.e., a situation in
labelling. Multiple labellings may exist when arguments rebut each which one argument 𝐴 is presented for a claim 𝜙 being burdened
other, none of them being burdened with persuasion. If one of with persuasion, and 𝐴 (or a subargument of it) is attacked by a
these arguments is labelled IN the other is labelled OUT, and vice counterargument 𝐵, of which the conclusion 𝜓 is also burdened
versa. To address such a situation, we focus on IN-minimal labelling with persuasion. If no convincing argument for 𝜓 can be found, then
semantics, where for example both such arguments are labelled the attack fails, and the uncertainty on 𝜓 does not affect the status
UND. Let us call such a labelling a grounded bp-labelling. of 𝐴. In the example, the argument for the doctor’s due diligence
Definition 3.16. A bp-labelling 𝐿 of an argumentation graph 𝐺 is fails to meet its burden of persuasion. Consequently, it fails to defeat
a grounded bp-labelling iff UND(𝐿) is maximal. the argument for the doctor’s liability, which succeeds, meeting its
burden of persuasion.
Proposition 3.17. Let 𝐿1 be the grounded {IN, OUT, UND}-labelling
of an argumentation graph 𝐺, and 𝐿2 the grounded bp-labelling of 𝐺. Example 3.19 (Criminal law example: rules, graphs and bp-la-
If BurdPers = ∅ then IN(𝐿1 ) = IN(𝐿2 ). belling). Referring to Example 2.1, let us consider the following
rules (for simplicity’s sake, we do not specify pieces of evidence
Proof. It is easy to see that if condition 3.15(1) concerning argu- here, but we assume that all factual claims are supported by evi-
ments burdened with persuasion is removed from definition 3.15, dence):
we obtain the definition of grounded {IN, OUT, UND}-labellings. □
f1: ⇒ 𝑘𝑖𝑙𝑙𝑒𝑑 f2: ⇒ 𝑖𝑛𝑡𝑒𝑛𝑡𝑖𝑜𝑛
Example 3.18 (Civil law example: graphs and bp-labelling). Let f3: ⇒ 𝑡ℎ𝑟𝑒𝑎𝑡𝑒𝑑 f4: ⇒ ¬𝑡ℎ𝑟𝑒𝑎𝑡𝑒𝑑
us consider again Example 2.2 and the corresponding rules and r1: 𝑡ℎ𝑟𝑒𝑎𝑡𝑒𝑑 ⇒ 𝑠𝑒𝑙 𝑓 𝐷𝑒 𝑓 𝑒𝑛𝑐𝑒 r2: 𝑘𝑖𝑙𝑙𝑒𝑑, 𝑖𝑛𝑡𝑒𝑛𝑡𝑖𝑜𝑛 ⇒ 𝑚𝑢𝑟𝑑𝑒𝑟
arguments built in Example 3.10. The argumentation graph and r3: 𝑠𝑒𝑙 𝑓 𝐷𝑒 𝑓 𝑒𝑛𝑐𝑒 ⇒ ¬𝑚𝑢𝑟𝑑𝑒𝑟 r4: ¬𝑡ℎ𝑟𝑒𝑎𝑡𝑒𝑑 ⇒ ¬𝑠𝑒𝑙 𝑓 𝐷𝑒 𝑓 𝑒𝑛𝑐𝑒
its grounded {IN, OUT, UND}-labelling are depicted in Figure 1 (top),
with r3 ≻ r2. We can build the following arguments:
in which all arguments are UND except arguments for undisputed
facts. The result is not satisfactory according to the law, since A1 : ⇒f1 𝑘𝑖𝑙𝑙𝑒𝑑 B1 : ⇒f3 𝑡ℎ𝑟𝑒𝑎𝑡𝑒𝑑
it does not take into account the applicable burdens of persua- A2 : ⇒f2 𝑖𝑛𝑡𝑒𝑛𝑡𝑖𝑜𝑛 B2 : B1 ⇒r1 𝑠𝑒𝑙 𝑓 𝐷𝑒 𝑓 𝑒𝑛𝑐𝑒
sion. The doctor should have lost the case – i.e., be found liable A3 : A1, A2 ⇒r2 𝑚𝑢𝑟𝑑𝑒𝑟 B3 : B2 ⇒r3 ¬𝑚𝑢𝑟𝑑𝑒𝑟
– since she failed to discharge her burden of proving that she C1 : ⇒f4 ¬𝑡ℎ𝑟𝑒𝑎𝑡𝑒𝑑
was diligent (non-negligent). The doctor’s failure results from the C2 : C1 ⇒r4 ¬𝑠𝑒𝑙 𝑓 𝐷𝑒 𝑓 𝑒𝑛𝑐𝑒
fact that it remains uncertain whether she followed the guide- In the grounded {IN, OUT, UND}-labelling of Figure 2 (top), all argu-
lines. To capture this aspect, we need to specify the burdens of ments are UND except for the undisputed facts. Thus, in the absence
183
The Burden of Persuasion in Structured Argumentation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
of burdens of persuasion, we do not obtain the legally correct an- we plan to study the properties of our semantics, and its connection
swer, namely, acquittal. To obtain acquittal we need to introduce with other semantics for argumentation [1, 2].
burdens of persuasion. Prosecution has the burden of persuasion on
murder: it therefore falls to the prosecution to persuade the judge ACKNOWLEDGMENTS
that there was killing, that it was intentional, and that the killer did R. Calegari and G. Sartor have been supported by the H2020 ERC
not act in self-defence. Project “CompuLaw” (G.A. 833647).
A3 B3
REFERENCES
[1] Pietro Baroni, Martin Caminada, and Massimiliano Giacomin. 2011. An introduc-
UND UND
tion to argumentation semantics. The knowledge engineering review 26, 4 (2011),
365–410. https://doi.org/10.1017/S0269888911000166
[2] Pietro Baroni and Régis Riveret. 2019. Enhancing Statement Evaluation in Argu-
A2 B2 C2 mentation via Multi-labelling Systems. Journal Artificial Intelligence Research 66
IN UND UND (2019), 793–860. https://doi.org/10.1613/jair.1.11428
[3] Roberta Calegari, Andrea Omicini, and Giovanni Sartor. 2020. Argumentation
and Logic Programming for Explainable and Ethical AI. In XAI.it 2020 – Italian
A1 B1 C1 Workshop on Explainable Artificial Intelligence 2020 (CEUR Workshop Proceedings,
Vol. 2742). Sun SITE Central Europe, RWTH Aachen University, Italy, 55–68.
IN UND UND [4] Roberta Calegari and Giovanni Sartor. 2020. Burden of Persuasion in Ar-
gumentation. In Proceedings 36th International Conference on Logic Program-
ming (Technical Communications), ICLP 2020 (Electronic Proceedings in Theo-
A3 B3
retical Computer Science, Vol. 325). OPA, Rende (CS), Italy, 151–163. https:
OUT UND //doi.org/10.4204/EPTCS.325.21
[5] Roberta Calegari and Giovanni Sartor. 2020. A Model for the Burden of Persuasion
in Argumentation. In Legal Knowledge and Information Systems. JURIX 2020: The
A2 B2 C2 Thirty-third Annual Conference (Frontiers in Artificial Intelligence and Applications,
Vol. 334), Serena Villata, Jakub Harašta, and Petr Křemen (Eds.). IOS, Brno, Czech
IN UND UND Republic, 13–22. https://doi.org/10.3233/FAIA200845
[6] Martin Caminada and Leila Amgoud. 2007. On the Evaluation of Argumentation
Formalisms. Artificial Intelligence 171, 5—6 (2007), 286–310. https://doi.org/10.
A1 B1 C1 1016/j.artint.2007.02.003
IN UND UND [7] Arthur M. Farley and Kathleen Freeman. 1995. Burden of Proof in Legal Argumen-
tation. In Proceedings of the 5th International Conference on Artificial Intelligence
and Law. ACM, Maryland USA, 156—164. https://doi.org/10.1145/222092.222227
[8] Thomas F. Gordon, Henry Prakken, and Douglas Walton. 2007. The Carneades
Figure 2: Grounded {IN, OUT, UND}-labelling of Example model of argument and burden of proof. Artificial Intelligence 171, 10 (2007),
2.1 in the absence of burdens of persuasion (top), and 875–896. https://doi.org/10.1016/j.artint.2007.04.010
bp-labelling with the burden of persuasion BurdPers = [9] Thomas F. Gordon and Douglas N. Walton. 2009. Proof Burdens and Standards.
In Argumentation in Artificial Intelligence. Springer, Boston, MA, 239–258. https:
{𝑚𝑢𝑟𝑑𝑒𝑟 } (bottom). //doi.org/10.1007/978-0-387-98197-0_12
[10] Ulrike Hahn and Mike Oaksford. 2007. The Burden of Proof and Its Role in
Argumentation. Argumentation 21 (2007), 36–61. https://doi.org/10.1007/s10503-
The bp-labelling is depicted in Figure 2 (bottom). The prosecution 007-9022-6
failed to meet its burden of proving murder, i.e., its argument is [11] Ronald E. Leenes. 2001. Burden of Proof in Dialogue Games and Dutch Civil Pro-
cedure. In Proceedings of the 8th International Conference on Artificial Intelligence
not convincing, since it remains undetermined whether there was and Law. ACM, Missouri USA, 109–18. https://doi.org/10.1145/383535.383549
self-defence. Therefore, the argument supporting murder is labelled [12] Sanjay Modgil and Henry Prakken. 2010. Reasoning about Preferences in Struc-
tured Extended Argumentation Frameworks. In Proceedings of COMMA 2010,
OUT, and the presumed killer is to be acquitted. □ Computational Models of Argumentation. IOS, Italy, 347–58. https://doi.org/10.
3233/978-1-60750-619-5-347
4 CONCLUSION [13] Sanjay Modgil and Henry Prakken. 2014. The ASPIC + framework for structured
argumentation: a tutorial. Argument & Computation 5, 1 (2014), 31–62. https:
We have presented a formal model for the burden of persuasion. The //doi.org/10.1080/19462166.2013.869766
model is based on the idea that arguments burdened with persuasion [14] Henry Prakken. 2010. An Abstract Framework for Argumentation with Struc-
tured Arguments. Argument and Computation 1 (2010), 93–124. https://doi.org/
have to be rejected when there is uncertainty about them. We have 10.1080/19462160903564592
shown how an allocation of the burden of persuasion may lead to a [15] Henry Prakken, Chris Reed, and Douglas N. Walton. 2005. Dialogues about the
single outcome (IN arguments) in contexts in which the assessment Burden of Proof. In Proceedings of the 10th International Conference on Artificial
Intelligence and Law. ACM, Bologna, Italy, 115–124. https://doi.org/10.1145/
of conflicting arguments would otherwise remain undecided. We 1165485.1165503
have also shown how our model is able to address inversions of [16] Henry Prakken and Giovanni Sartor. 1996. Rules about Rules: Assessing Con-
flicting Arguments in Legal Reasoning. Artificial Intelligence and Law 4 (1996),
burdens of proof, namely, those cases in which the burden shifts 331–68. https://doi.org/10.1007/BF00118496
from one party to the other. In such cases, there is the burden of [17] Henry Prakken and Giovanni Sartor. 2010. A Logical Analysis of Burdens of
persuasion over the conclusion of a multistep argument, and at Proof. Legal Evidence and Proof: Statistics, Stories, Logic 1 (2010), 223–253.
[18] Gerard Vreeswijk. 1997. Abstract Argumentation Systems. Artificial Intelligence
the same time a burden of persuasion over the conclusion of an 90, 1–2 (1997), 225–279. https://doi.org/10.1016/S0004-3702(96)00041-0
attacker against a subargument of that multistep argument. The [19] Douglas Walton. 1996. Arguments from Ignorance. Pennsylvania State University
model can be expanded in various ways, to capture further aspects Press, Pennsylvania. https://doi.org/10.1007/978-3-319-15013-0_3
[20] Douglas Walton. 2014. Burden of proof, presumption and argumentation. Cam-
of legal reasoning. For instance, it can also be supplemented with bridge University Press, USA. https://doi.org/10.1017/CBO9781107110311
argumentation over burdens of persuasion [15], in a manner similar [21] C.R. Williams. 2003. Burdens and standards in civil litigation. Sydney Law Review
25 (2003), 165–188.
to the way in which argumentation systems can be expanded to
include argumentation about priorities, see [12, 16]. More generally
184
Prediction of monetary penalties for data protection cases in
multiple languages
Aaron Ceross Tingting Zhu
University of Oxford University of Oxford
Oxford, United Kingdom Oxford, United Kingdom
aaron.ceross@cs.ox.ac.uk tingting.zhu@eng.ox.ac.uk
ABSTRACT in the relationship between the data subject and the data controller.
As the use of personal data becomes further entrenched in the func- Correspondingly, data protection law has gained increased visibility
tion of societal interaction, the regulation of such data continues to in recent years, notably the enactment of the European Union’s
grow as an important area of law. Nevertheless, it is unfortunately General Data Protection Regulation (GDPR) [20].
the case that data protection authorities have limited resources to In general, there exists a disparity between regulators and the
address an increasing number of investigations. The leveraging of objects of regulation. This includes access to information, resources
appropriate data-driven models, coupled with the automation of to contest regulatory action, and technical expertise — with the bal-
decision making, has the potential to help in such circumstances. ance of power often favouring corporate entities over the regulating
In this paper, we evaluate machine learning models in the litera- authority. It has been argued that this leads to inefficient regulatory
ture (such as Support Vector Machine (SVM), Random Forest, and action for such corporate entities [4]. Constrained resources are
Multinomial Naive Bayes (MNB) classifiers) for natural language particularly acute in data protection regulation, with the effect of
processing in order to predict whether a monetary penalty was increasing pressure on the prioritisation of cases, therefore requir-
levied based on a description of case facts. We tested these models ing prudence when taking on an investigation in order to maximise
on a novel data set collected from the data protection authority of the effectiveness of regulatory action [5]. The widespread use of
Macao across the three languages (i.e., Chinese, English, and Por- personal data by innumerable entities has resulted in a need to be
tuguese). Our experimental results show that the machine learning selective in terms of determining which cases to take forward so
models provide the necessary predictability in order to automate the as to maximise effectiveness of regulation of personal data [14].
evaluation of data protection cases. In particular, SVM has consis- Further pressure on resources also comes from newer competences
tent performance across three languages and achieving an AUROC and regulatory expectation, such as mandatory breach notification.
of 0.725, 0.762, and 0.748 for Chinese, English, and Portuguese, re- In this paper we evaluate text classification methods that have the
spectively. We further evaluated the interpretability of the results potential to facilitate this aspect of the regulatory process. We are
independently for each of the languages and found that the salient unaware of any work in the available literature examining the use
texts that were identified are shared across the three languages. of machine learning methods for data protection regulatory actions.
Aaron Ceross and Tingting Zhu. 2021. Prediction of monetary penalties
2 BACKGROUND
for data protection cases in multiple languages. In Eighteenth International 2.1 Automation of data protection regulation
Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021,
Within the available literature we do not find examples using a
data-driven approach for regulatory actions in data protection. The
1145/3462757.3466097
wider literature regarding empirical analysis of data protection
1 INTRODUCTION judgements and regulatory actions has attracted only modest aca-
demic interest. There are few data sets which are readily available,
The capture, storage, and processing of personal data within in-
and this may contribute to the limited research in this area. For
formation systems has become a fundamental feature of societal
example, Ceross and Simpson [6] provided summary statistics on
interaction, including social relationships, commerce, government,
civil penalties provided by the United Kingdom’s data protection
and education. The increasing multiplicity and complexity of these
authority, however they did not provide a model in this work. Nev-
interactions across different entities necessitates the promulgation
ertheless, the authors found that regulatory actions were focused
of rules regarding the use and storage of personal data in order to
primarily on health and government-held data, with the causes of
prevent misuse of the data, as well as redress power asymmetries
breach being non-technological (e.g. improper disposal of records
Permission to make digital or hard copies of all or part of this work for personal or and unintended disclosure), which may suggest the priorities of the
data protection authority.
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 2.2 Automated prediction of legal outcomes
fee. Request permissions from permissions@acm.org. through textual analysis
ICAIL’21, June 21–25, 2021, São Paulo, Brazil There has been increasing interest in the research literature regard-
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 ing the prediction of the outcomes of legal cases through machine
https://doi.org/10.1145/3462757.3466097 learning and natural language processing. This may be due to the
185
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Aaron Ceross and Tingting Zhu
greater availability of court data as well as widespread use of ma- 3.2 Data collection
chine learning and natural language processing libraries. Aletras et In this study, we extracted the text of completed cases-investigations
al. [1] constructed a dataset of 584 cases decided by the European by the GPDP from 2007 to 2019. As Chinese and Portuguese are
Court of Human Rights (ECtHR), focusing on those cases involving the two co-equal official languages of Macao, the GPDP provides
Articles 3, 6, and 8 of the European Convention on Human Rights. case-files on its website provided in these languages. Additionally,
The authors extracted n-grams and semantic topics as features, and the GPDP also provides English translations of the cases, although
classified the cases using an SVM with 10-fold cross validation. The the this translation is unofficial. Overall, the case report itself is
approach established by Aletras et al. [1] has acted as a template divided into three parts: (i) the brief, which explains the facts of the
for other studies. For example, using a data-set of 5,990 German case; (ii) the analysis, which gives an explanation of the applicable
tax cases, Waltl et al. [23] employed a Naive Bayes classifier, using law and consideration of factors; and (iii) the case outcome, which
11 linguistic features, 10 fold cross validation. As another example, details the decision of the GPDP.
Virtucio et al. [22] examined 27,492 cases decided by the Philippine The GPDP provides reports for cases whether an infraction of
Supreme Court, achieving a 59% accuracy using a Random Forest the PDPA is found or not. This allows for classification of penalties
classifier using topics as a feature. More recently, Medvedeva et and closed cases, which represents the full scope of possibilities
al. [16] adopted the approach in [1], expanding the scope to anal- with regard to assessment, that data protection authorities make
yse ECtHR decisions across 14 articles. The authors in this work when deciding on cases. From the retrieved cases, there was a dis-
selected an equal number of decision outcomes in order to maintain crepancy between the number of translated cases available from the
a balanced dataset. website (316 in written Chinese; 309 Portuguese; and 292 English),
There exist multiple jurisdictions wherein there are multiple indicating that the website does not provide every case across the
official languages of legal action (e.g, Canada and Belgium). In the three languages. Using the unique case identification numbers, we
works examining the ECtHR data, the authors did not consider kept only those cases common to all three languages, resulting in a
whether the outcomes were similar for different languages. We add dataset of 281 cases, representing investigations between 2007 and
to the literature by comparing the same facts of a case in three 2019. Out of these 281 cases, 73 (26%) were given a penalty, with the
languages to determine whether predictors of the cases are the remaining 208 cases closed without one (75%). Most of the cases in
same across the language versions. the data set arise from complaints (185, 66%), followed by reports
(58, 21%), referrals (21, 8%), and active investigations (17, 6%).
3 DATA
3.1 Description 4 METHODS
Macao is a Special Administrative Region of the People’s Republic 4.1 Data processing and feature extraction
of China. It is a former colony of Portugal which was transferred to In addition to any of the regulatory actions the GPDP may take, the
Chinese rule in 1999. Macao’s designation as a special administra- case outcomes may result in the authority finding that no further
tive region has meant that Macao retains a degree of autonomy in its action is necessary. In this work, we only make use of the case briefs
public administration and legislation. As such, the city has adopted in order to predict case outcomes. Other sections within the case
an independent approach to data protection regulation which is documentation address the merits of the claims (the analysis) or
different from China. The Personal Data Protection Act (Act 8/2005) provide the case outcome which may strongly influence the results.
(PDPA) [17] regulates the processing and storage of personal data Table 1 describes the summary features of the linguistic properties
in Macao and establishes a regulator, the Office for Personal Data of a brief. Chinese text has the shortest mean length of the three
Protection (GPDP).1 The data protection regulation in Macao is languages, as well as having the least number of tokens. Despite
based on the approach in the European Union, which envisions this, it has a larger vocabulary than the Portuguese and English
more prominent roles for data protection authorities [12]. Accord- text (3,423 types). These smaller tokens and vocabulary may help in
ingly, the GPDP has a wide regulatory remit with regard to data reducing the feature set for the purposes of classification, thereby
protection, including education, advice, and consultation. As part allowing for more effective prediction of classes.
of its regulatory functions, Article 33 of the PDPA provides that the
GPDP may impose a monetary penalty ranging from MOP$4,000 Table 1: Descriptive statistics of case briefs.
to MOP$40,000 (approximately $500 – $5,000 USD) for acts which
(i) infringe on the rights of the data subject (Articles 5, 10, 11, 12,
13); (ii) contravene rules related to security and confidentiality of Language #tokens #types min max mean
processing (Articles 16, 17); or (iii) where the processing of personal length length length
data has not been appropriately publicised (Article 25, paragraph Chinese 19,287 3,423 20 459 68.64
3). In 2019, the GPDP reported an unusual increase in received en- English 26,002 2,256 30 240 92.53
quiries regarding data protection (2,940) — which was an increase Portuguese 26,157 2,856 26 249 93.09
of 60% from the previous year and the highest for five years [10].
Such an increase inevitably gives rise to issues of prioritisation and
resource allocation, as discussed in Section 1.
The cases briefs are written in short sentences describing the
1 Acronym used in English is taken from the authority’s Portuguese name, Gabinete reasons the case has been brought to the GPDP. The brief does not
para a Protecção de Dados Pessoais.
186
Prediction of monetary penalties for data protection cases in multiple languages ICAIL’21, June 21–25, 2021, São Paulo, Brazil
make an evaluation as to the merits of the case nor do the briefs features of any particular prediction (i.e. which words were most
provide an indication as to the outcome. The input feature to our influential) are shared between the languages. We draw an illustra-
classification models can be defined as a document term frequency tive case example and compare the top ten features, assessing the
matrix which describes the frequency of terms that occur in a extent to which these translate into each language.
collection of documents. In order to generate this, we lowercase the
text and remove ‘stop words’, which are frequently occurring words 5 RESULTS
of no semantic significance (e.g. articles and prepositions). Both 5.1 Performance of classification models
unigram and bigram word features are extracted. While stemming
is a common approach to reduce the number of features in language Table 2 shows the mean and standard deviation of the performance
processing tasks, research has demonstrated that lemmatisation, metrics on the test data sets derived from the stratified 3-fold cross
using the infinitive root of a word may produce better outcomes validation. Given structural differences between languages, it is
for tasks [15]. Finally, term-frequency inverse document frequency expected that different models will be more effective for one lan-
(tf-idf) of each word was extracted to reflect how important the guage over another. For instance, SVM performs mostly well across
word is in a collection of case briefs. all metrics for Portuguese (AUROC of 0.748 and F1 of 0.605) and
Chinese (AUROC of 0.725 and F1 of 0.577). For English, MNB was
4.2 Evaluation of classification models considered to be a better option (AUROC of 0.774 and F1 of 0.649).
However, it is difficult to conclude if the results are significant due
We evaluated the case briefs using the classification models in the
to the limited number of case briefs available. It was further ex-
identified literature (see Section 2.2). This includes (i) a Support
pected that the specificity would be high across different models
Vector Machine (SVM) classifier, which is a non-probabilistic classi-
(a range of 0.875 to 0.981) due to the number of non-penalty briefs
fier that maps feature values into a higher dimension to maximise
available in comparison to those with penalty. In the case of recall,
the discrimination between two classes; (ii) a Random Forest (RF)
SVM performed the best mostly across three languages (0.576 for
classifier, which is an ensemble of decision trees via bootstrap-
Chinese, 0.630 for English, and 0.602 for Portuguese), where other
ping subsets of features; and (iii) Multinomial Naive Bayes (MNB),
models have variable results. The MCC values were also similar
a probabilistic classifier which combines Bayes’ theorem with a
across three languages with MNB providing the highest values
multinomial event model, to allow for explicit modelling of the
ranging from 0.48 to 0.534. We determine that the SVM performs
frequency/count of each of our features.
the best scores across all languages when considering all metrics.
In our experimental setup, each model was trained and tested
using a stratified 3-fold cross validation. Stratification was consid- 5.2 Comparison of predictions across languages
ered as the portion of penalty versus non-penalty case briefs was
highly imbalanced with a ratio of 73:281. The averaged performance Out of the 57 cases tested, the performance of the languages in-
across 3 folds is then computed for each model as well as for each significantly varied: English, 49 (86%), Portuguese, 50 (88%), and
language. In model performance evaluation, we considered com- Chinese, 50 (88%). The classification of No Penalty is high but is due
mon metrics including: specificity, recall, the F1 score, and the Area to the imbalanced class. Complaints, which make up the majority
Under the Receiver Operating Characteristic (AUROC). We also in- of reasons initiating a case investigation, receive the most penalties
cluded the Matthews correlation coefficient (MCC), which provides when compared to other reasons. Across all languages, true predic-
a correlation coefficient between the observed and predicted binary tions of penalties for complaints were high, with the most being
classifications. MCC is robust and therefore particularly useful in identified by English (75%). The largest disparity in scores exists in
scenarios where classes exhibit a strong imbalance [18]. Active Investigation; only the Chinese language classifier was able
to positively identify one of the two penalties.
4.3 Interpretability of classification
5.3 Assessment of interpretability across
The interpretability of the resultant machine learning models is
a challenging but worthwhile endeavour [11]. As such we utilise
languages
the LIME methodology proposed by Ribeiro et al [21]. LIME allows From the above results in Section 5.2, there is some evidence for the
for the identification of areas which provided the most influence feasibility of utilising machine learning for penalty prediction. One
to a classification, which is an attractive feature for legal decision of the questions posed by this paper was whether these predictions
making. With the identification of predictive features, we qualita- converge on semantic meaning with regard to the most discrimina-
tively assess whether the identified features for correctly predicted tive features used to make these predictions. We make a qualitative
positive cases are shared across the three languages. From the com- check regarding the features by selecting a case which was correctly
parison of classification models, we select the most performant identified by the classifier in all languages. For this purpose we
model across all three languages by assessing the classifier with select Case No 0002/2014/IP [9], which concerned the uploading
the best value in by each metric for each language, and then check of a photo onto a social networking site without the consent of
for consensus among metrics. We re-run a fold from the model in the complainant. The complainant had family photographs profes-
order to examine the quality of predictions across the entire data sionally taken by a company. Without the knowledge or consent of
set. We utilise the predictions of the entire dataset to (i) compare the complainant, the company uploaded the photo to social media
the amount of true positives identified in each language and (ii) as promotional material. The complainant asked the company to
to determine whether the semantic meaning of the determinative remove the photo but did not receive any response. The complaint
187
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Aaron Ceross and Tingting Zhu
Table 2: Test performance of classification models across three languages.
Language Model F1 AUROC Recall Specificity MCC

SVM 0.577 (± 0.107) 0.725 (± 0.075) 0.576 (± 0.204) 0.875 (± 0.056) 0.463 (± 0.1)
Chinese RF 0.397 (± 0.136) 0.627 (± 0.053) 0.274 (± 0.119) 0.981 (± 0.013) 0.398 (± 0.081)
MNB 0.555 (± 0.136) 0.711 (± 0.081) 0.493 (± 0.204) 0.928 (± 0.042) 0.48 (± 0.1)
SVM 0.620 (± 0.126) 0.762 (± 0.095) 0.630 (± 0.272) 0.894 (± 0.083) 0.564 (± 0.081)
English RF 0.480 (± 0.122) 0.666 (± 0.067) 0.385 (± 0.17) 0.947 (± 0.036) 0.423 (± 0.084)
MNB 0.649 (± 0.13) 0.774 (± 0.098) 0.631 (± 0.256) 0.918 (± 0.065) 0.588 (± 0.116)
SVM 0.605 (± 0.142) 0.748 (± 0.095) 0.602 (± 0.239) 0.894 (± 0.048) 0.514 (± 0.126)
Portuguese RF 0.561 (± 0.147) 0.709 (± 0.084) 0.453 (± 0.19) 0.966 (± 0.024) 0.528 (± 0.108)
MNB 0.602 (± 0.163) 0.747 (± 0.102) 0.575 (± 0.255) 0.918 (± 0.56) 0.534 (± 0.135)
Note: the mean and standard deviation are computed on the test sets from the 3-fold stratified cross validation.
photo foto
several tirar
networking mensagem
publish rede
message companhia
account exigir
specify social
take publicar
social atender
afterwards acto
0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.1 0.0 0.1 0.2 0.3
Coef Coef
(a) Chinese (b) English (c) Portuguese
Figure 1: Influence of word features on the correct classification of Case No 0002/2014/IP.
was submitted to the GPDP who found that the situation warranted the features have limited semanitic utility such as ‘take’, ‘均’, and
a penalty. ‘acto’. This may have implications for the provision of explanations
In Figure 1 the most predictive features are shown for the out- in support of a prediction when used in a legal setting.
come of Case No 0002/2014/IP [9]. The top feature for English and
Portuguese is ‘photo’ and ‘foto’ respectively. In this instance we 6 DISCUSSION
have semantic convergence as these mean the same thing in both This work has demonstrated the effectiveness of text classification
languages. However, Chinese it is ‘投訴’, which translates to ‘to of data protection cases. The binary classification of penalties has
complain’. In this example, most of the terms are shared between been shown to be effective across multiple language translations of
English and Portuguese, e.g., ‘publish’ / ‘publicar’, ‘message’ / ‘men- the same facts. Despite the small dataset, this paper’s experiment
sagem’. Portuguese and Chinese share the word for ‘company’: has, at the very least, indicated the strong possibility that machine
companhia / ‘公司’. It is noted that many of the Chinese terms do learning may be included in some manner to facilitate case prioriti-
not provide much semantic meaning. For example, the ‘甲’ and ‘乙’ sation for regulatory action. Data protection authorities may utilise
characters are related to the ordering of items and in legal practice this method in order to pre-screen complaints thus more effectively
these characters are often used to anonymise names in a text (e.g. prioritise the use of the authority’s resources.
Person A/甲 and Person B/乙). The character ‘均’ is used to qualify While an argument may be made to use text classification for pri-
other nouns and thus may have multiple meanings depending on oritising cases, there are considerations whether such an undertak-
context such as ‘all’ or ‘even’; by itself, it does not provide any ing impinges on the nature of regulation and the law. Bayamlıoğlu
meaning. In short, while there may be some semantic meaning and Leenes [2] regard the use of the technology in legal decisions
derived from the prediction (e.g., ‘photo’, ’foto’, ‘公司’), many of and data-driven law as degrading the “moral enterprise” of the
188
Prediction of monetary penalties for data protection cases in multiple languages ICAIL’21, June 21–25, 2021, São Paulo, Brazil
law, which has an impact on the human trust and value placed not REFERENCES
only with the law itself but by extension those institutions charged [1] Aletras, N., Tsarapatsanis, D., Preoţiuc-Pietro, D., and Lampos, V. Pre-
with its execution. This is echoed by Hildebrandt [13] who argues dicting judicial decisions of the European Court of Human Rights: A natural
language processing perspective. PeerJ Computer Science 2 (2016), e93.
that data-driven law creates a crisis for law in that data-driven [2] Bayamlioğlu, E., and Leenes, R. The ‘rule of law’ implications of data-driven
predictions may result in atrophy of the ability to make judgements decision-making: a techno-regulatory perspective. Law, Innovation and Technol-
ogy 10, 2 (2018), 295–313.
congruent with the lived experience of individuals; data-driven law [3] Binns, R., Van Kleek, M., Veale, M., Lyngs, U., Zhao, J., and Shadbolt, N.
is beholden to a type of logic that may not lend easily itself to the ’It’s reducing a human being to a percentage’: Perceptions of justice in algorith-
rule of law. mic decisions. In Proceedings of the 2018 CHI Conference on Human Factors in
Computing Systems, CHI ’18, p. 1–14.
Explanability of machine learning models is often suggested as [4] Braithwaite, J. Enforced self-regulation: A new strategy for corporate crime
a counter-balance to what may be perceived as the negative effects control. Michigan Law Review 80, 7 (1982), 1466–1507.
of automated decisions. Edwards and Veale [8] maintain that an [5] Ceross, A. Examining data protection enforcement actions through qualitative
interviews and data exploration. International Review of Law, Computers &
explanation of the logic of a model and its outcomes may have Technology 32, 1 (2018), 99–117.
no meaning for an affected individual, giving little recourse by [6] Ceross, A., and Simpson, A. C. The use of data protection regulatory actions as
a data source for privacy economics. In Computer Safety, Reliability, and Security
which to challenge the outcome of the automated decision. There (SAFECOMP) (2017), S. Tonetta, E. Schoitsch, and F. Bitsch, Eds., vol. 10489 of
is also the question as to whether such explanations are helpful or Lecture Notes in Computer Science (LNCS), Springer, pp. 350–360.
informative to individuals as experiments on human perception [7] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019
of algorithmic decisions and found that even where explanations Conference of the North American Chapter of the Association for Computational
are provided to algorithmic decision-making, such explanations Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
may not contribute to a sense of ‘fairness’ in the decision made. (Minneapolis, Minnesota, June 2019), Association for Computational Linguistics,
pp. 4171–4186.
Binns et al. [3] investigated the human perception of algorithmic [8] Edwards, L., and Veale, M. Enslaving the algorithm: From a “Right to an
decisions and found that even where explanations are provided to Explanation” to a “Right to Better Decisions”? IEEE Security & Privacy 16, 3
(2018), 46–54.
algorithmic decision-making, such explanations may not contribute [9] Gabinete para a Protecção de Dados Pessoais. Case No: 0002/2014/IP:
to a sense of ‘fairness’ in the decision made. Uploaded clients’ photos by mistake. https://www.gpdp.gov.mo/index.php?m=
From the results, we would strongly caution against adopting content&c=index&a=show&catid=209&id=775, 2014. English version.
[10] Gabinete para a Protecção de Dados Pessoais. 個案調查. https://www.
the methods outlined in the experiment as the sole determination gpdp.gov.mo/uploadfile/2020/1009/20201009040422173.pdf, Oct. 2020.
of a penalty allocation. The results from our experiment shows [11] Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., and Kagal, L.
that the model are able to indicate which cases may be of more Explaining explanations: An overview of interpretability of machine learning.
In 2018 IEEE 5th International Conference on data science and advanced analytics
interest to a regulatory authority than others. It is, however, ques- (DSAA) (2018), IEEE, pp. 80–89.
tionable whether the description of factors amounts to a sufficient [12] Greenleaf, G. Macao’s EU-influenced Personal Data Protection Act. Privacy
Laws & Business International Newsletter 96 (2008), 21–22.
explanation for legal purposes. This is due to the models utilising [13] Hildebrandt, M. Law as computation in the era of artificial legal intelligence:
text processed in a bag-of-words approach which accounts for the Speaking law to the power of statistics. University of Toronto Law Journal 68,
frequency of words, not the semantic meaning of those words. As Supplement 1 (2018), 12–35.
[14] Hustinx, P. The role of data protection authorities. In Reinventing Data Pro-
such, many predictive features, as those detailed in Section 5.3, may tection?, S. Gutwirth, Y. Poullet, P. De Hert, C. de Terwange, and S. Nouwt, Eds.
be nonsensical when shown in isolation. Springer, 2009, pp. 131–137.
[15] Jianqiang, Z., and Xiaolin, G. Comparison research on text pre-processing
methods on Twitter sentiment analysis. IEEE Access 5 (2017), 2870–2879.
7 CONCLUSIONS AND FUTURE WORK [16] Medvedeva, M., Vols, M., and Wieling, M. Using machine learning to predict
In this work, we introduced a novel dataset, cases from the data decisions of the European Court of Human Rights. Artificial Intelligence and Law
28, 2 (2020), 237–266.
protection authority of Macao, and evaluated multiple machine [17] Personal Data Protection Act. https://www.gpdp.gov.mo/uploadfile/2016/0302/
learning classifiers for binary text classification for text in three 20160302033801814.pdf, 2005.
[18] Powers, D. M. Evaluation: From precision, recall and F-measure to ROC, in-
language versions. Our results show that MNB and SVM performed formedness, markedness and correlation. arXiv preprint arXiv:2010.16061 (2020).
well across all metrics for all languages with SVM being considered [19] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. Stanza: A Python
the most performant. Assessing interpretability was difficult given natural language processing toolkit for many human languages. In Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics: System
the bag-of-words model used in the text preprocessing, although Demonstrations (2020).
there is some overlap between the semantic meaning of features [20] Regulation on the protection of natural persons with regard to the processing of
between the languages. The models evaluated are not without their personal data and on the free movement of such data, and repealing Directive
95/46/EC (General Data Protection Regulation). L119, 4/5/2016, p. 1–88, 2016.
limitations and the size of training datasets remains a challenge. In [21] Ribeiro, M. T., Singh, S., and Guestrin, C. "Why should I trust you?": Explain-
future work, we aim to employ language models such as BERT [7] ing the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD
to experiment with test different language tasks in data protection International Conference on Knowledge Discovery and Data Mining, San Francisco,
CA, USA, August 13-17, 2016 (2016), pp. 1135–1144.
regulation. Another avenue of future work may include assessing [22] Virtucio, M. B. L., Aborot, J. A., Abonita, J. K. C., Avinante, R. S., Copino,
the utility and interpretability of the outputs of case prediction with R. J. B., Neverida, M. P., Osiana, V. O., Peramo, E. C., Syjuco, J. G., and Tan,
G. B. A. Predicting decisions of the Philippine Supreme Court using natural
case investigators themselves. language processing and machine learning. In 2018 IEEE 42nd Annual Computer
Software and Applications Conference (COMPSAC) (2018), vol. 2, IEEE, pp. 130–135.
ACKNOWLEDGEMENTS [23] Waltl, B., Bonczek, G., Scepankova, E., Landthaler, J., and Matthes, F.
Predicting the outcome of appeal decisions in Germany’s tax law. In International
The authors thank the anonymous reviewers for their feedback as Conference on Electronic Participation (2017), Springer, pp. 89–99.
well as Tasos Papastylianou and Andrew Simpson for comments
on earlier drafts of this work. Special thanks to Ethan Ceross.
189
Regulating Artificial Intelligence:
A Technology Regulator’s Perspective
Joshua Ellul Gordon Pace Stephen McCarthy
joshua.ellul@um.edu.mt gordon.pace@um.edu.mt stephen.mccarthy@mdia.gov.mt
Malta Digital Innovation Authority Department of Computer Science, Malta Digital Innovation Authority
& University of Malta University of Malta Malta
Malta Malta
Trevor Sammut Juanita Brockdorff Matthew Scerri

trevor.sammut@mdia.gov.mt JuanitaBrockdorff@kpmg.com.mt MatthewScerri@kpmg.com.mt
Malta Digital Innovation Authority KPMG KPMG
Malta Malta Malta
ABSTRACT discussions on regarding how to regulate technology and the regu-
Artificial Intelligence (AI) and the regulation thereof is a topic that lation of computer systems, but reaches further due to the very na-
is increasingly being discussed and various proposals have been ture of the potential for AI. In fact, one can argue that a substantial
made in literature for defining regulatory bodies and/or related portion of the debate is due to this very potential, which brings to-
regulation. In this paper, we present a pragmatic approach for gether ethical issues, rights, perils and other aspects. Regardless of
providing a technology assurance regulatory framework. To the which school of thought one subscribes to, in a spectrum ranging
best of our knowledge, this work presents the first national AI from the requirement of generic principles [1], to specific laws1 , to
technology assurance legal and regulatory framework that has advocating that regulation of such technology should be avoided,
been implemented by a national authority empowered through law and focus should be on safety mechanisms [16], when the technol-
to do so. Aiming to both provide assurances where required and ogy is used for applications that can directly or indirectly impact so-
not stifling innovation yet supporting it, it is proposed that such ciety then sufficient regulation (whether through law or otherwise)
regulation is not to be mandated for all AI-based systems but rather should be investigated (whether applied directly or indirectly to the
should provide a voluntary framework and only be mandated in technology). At the same time whilst some argue for mandatory
sectors and activities as deemed necessary by other authorities or regulation, many warn that regulation could stifle innovation [11].
laws for regulated and critical areas. In this paper, we do not purport to present a contribution to
this philosophical debate, but rather our aim is a more pragmatic
CCS CONCEPTS one — that of outlining and explaining the rationale behind a legal
and regulatory framework addressing AI systems adopted by Malta.
• Social and professional topics → Governmental regulations.
Whilst other regulatory frameworks and bodies have been proposed
in literature (discussed in Section 5), it is to the best knowledge
KEYWORDS
of the authors that the framework being presented herein is the
artificial intelligence, regulation first AI technology assurance legal and regulatory framework that
ACM Reference Format: has been implemented by a national authority (the Malta Digital
Joshua Ellul, Gordon Pace, Stephen McCarthy, Trevor Sammut, Juanita Innovation Authority2 ) empowered through law [10] to do so.
Brockdorff, and Matthew Scerri. 2021. Regulating Artificial Intelligence: A Towards the end of the 2010’s Malta built a framework for ad-
Technology Regulator’s Perspective. In Eighteenth International Conference dressing the regulation of Innovative Technology Arrangements
for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, [9], in order to ensure better end user protection through the adop-
Brazil. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3462757.
tion of appropriate due diligence on the underlying technologies.
3466093
Initially focusing on Blockchain, Smart Contracts and other Dis-
tributed Ledger Technologies (DLTs) [6], the legislation has since
1 INTRODUCTION been extended to cover critical systems (through a legal notice3 )
Issues concerning the design of legal and regulatory frameworks and regulatory guidelines have been issued by the Malta Digital
for Artificial Intelligence (AI) have been a topic of discussion and Innovation Authority (MDIA) for the regulation of arrangements
debate for the past few decades. Much of the debate inherits from which use an element of AI.
The aim of the paper is to provide a review of the regulatory
classroom use is granted without fee provided that copies are not made or distributed framework proposed and to put it in the context of the ongoing AI
for profit or commercial advantage and that copies bear this notice and the full citation regulation debate. One of the primary observations is that the need
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
1 https://www.europarl.europa.eu/doceo/document/E-9-2019-002411_EN.html
2 https://mdia.gov.mt
ACM ISBN 978-1-4503-8526-8/21/06.
https://doi.org/10.1145/3462757.3466093 3 https://legislation.mt/eli/ln/2020/389/eng/pdf
190
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Joshua Ellul, Gordon Pace, Stephen McCarthy, Trevor Sammut, Juanita Brockdorff, and Matthew Scerri
for precise definitions and objective measures in the legal frame- cases where unexpected behaviour emerged from AI systems, some-
work, meant that the Maltese regulatory approach is founded on times of a controversial or even safety-critical nature. The increas-
practical and auditable aspects, and is intended to address concerns ing concern is not only to do with cases that have emerged but also
with existing technology (as opposed to attempting to address pos- based on the reality that more and more systems are becoming com-
sible issues arising from future development of AI technology, for puterised and automated. One often-referenced cliché is that of au-
example Artificial General Intelligence). To implement an AI reg- tomated scoring systems [13] in which discrimination is unaccept-
ulatory framework intended for modern day technology and also able, highlighting the need to ensure bias in datasets is removed and
in aim of not stifling innovation, the framework is primarily vol- attempts made to remove discriminatory features during training.
untary however may be mandated based upon the sector and/or The concerns highlighted above demonstrate the need to ensure
risk associated with the activity within which the AI system is used that sufficient assurances are put in place to ensure that AI algo-
or as deemed necessary by another lead authority or governing rithms are implemented correctly, and that their behaviour is as ex-
legislation. This sets the tone of much of the paper, but it is nat- pected and does not introduce any unwanted biases. Indeed, many
urally endemic to any discussion of practical implementations of are advocating for such regulatory frameworks to be developed
the regulation of technologies. The fast evolving nature of technol- and applied to AI systems, but it whether such frameworks should
ogy requires law-makers to address existing technology in a sound be mandatory for all AI-based systems is debatable. We now follow
manner, but also in a way that is expected to be future-proof. A full with a case for why such frameworks should not always be man-
version of this paper can be found in [7]. dated, and that they should not be focused on the technology but
should be focused on the sector or activity that the AI is being used
within/for.
2 THE CASE FOR AI ASSURANCES
We start by highlighting issues related to AI-based systems which 3 THE CASE FOR VOLUNTARY ASSURANCES
could result in systems operating incorrectly in relation to its in-
tended functionality and thereafter build the case for instilling as-
OF AI AND MANDATORY ASSURANCES OF
surances. We concentrate on Artificial Narrow Intelligence (ANI) REGULATED AND CRITICAL ACTIVITIES
given that the state-of-the-art has not yet reached levels of Artifi- Setting aside AGI, when it comes to ANI should such frameworks
cial General Intelligence (AGI) [3]. We will use the term AI through- always be mandated? The same AI framework, for instance identi-
out the rest of the paper to refer to AI that exists today — ANI. fying user preferences, can be the engine behind a wide range of
Since the inception of software development, the fact that such applications, from a personal movie recommendation system, to a
systems occasionally fail has been accepted to be the norm. Al- social network targeted advertising campaign to influence users in
though much work has gone into developing techniques to reduce an upcoming election. The underlying infrastructure is application
the frequency and severity of such occurrences, we continue to ex- agnostic, but should such an underlying infrastructure be required
perience software malfunction on a daily basis. The impact of such to be regulated? More so, what difference does it make if an algo-
failure is contained as long as the software functions in a closed rithm is AI-based or not and yet can be used for the same activity?
system i.e. it has no direct impact on the real world, but frequently Then, should we be talking about AI regulation at all? Or should
software affects the real world in a direct or indirect manner. One we be focusing on software — or rather, the activity it is used for
finds reports of many catastrophic failures in literature and news- irrespective of how it is implemented?
paper reports, with effects ranging from huge financial losses to Regulating all forms of AI would result in shackling and stifling
critical infrastructure failure and even loss of human life. AI sys- innovation [11]. The definition of AI itself is controversial, and
tems are no exception when it comes to incorrect behaviour and even if a definition is chosen, is it going to be clear what software
even when the algorithms themselves are correctly implemented, is AI and what software is not? There are some algorithms which
incorrect behaviour might emerge. For instance, a correct imple- we can ascertain are universally accepted as AI, and some systems
mentation of a machine learning based algorithm may still learn which are universally considered to not have aspects of AI within
wrong due behaviour due to insufficient training, biases and unbal- them, however what should be done about the rest? Could this
ances that may exist within datasets, etc. approach not only stifle AI-innovation, but also other software
Undoubtedly AI systems should undergo standard quality as- based innovation?
surances processes, not only for functional correctness of the algo- Looking back at the principles of regulation though, we need to
rithms themselves but also with respect to the behaviour emergent ask ourselves why is regulation of AI being proposed? Is it only be-
following training. However, testing of AI systems is only as good cause of end-of-the-world scenarios being painted which require
as the coverage of training data, iterations and permutations and AGI, which the state-of-the-art is currently not capable of? If so,
use cases which are undertaken. Once an AI system is deployed then perhaps we should differentiate between any regulatory re-
and it encounters an event that it was not trained to handle it may quirements for AGI and ANI. We propose that this should be done,
well end up handling it incorrectly. More so, if it is continuously at least in the interim until AGI is deemed to be upon us. We leave
learning in a live environment it may be exposed to certain situa- considerations for AGI as future work, and here will continue dis-
tions which could affect its behaviour negatively. cussing aspects pertaining to ANI.
Part of the challenge is that many AI-based techniques function If AI is regulated even when applied to unregulated and non-
as black-boxesfor which reason one finds extensive research to- critical activities, given the line between AI and software in general
wards explainable AI. The past decade has seen various infamous is blurred, and given that non-AI based techniques processes may
191
Regulating Artificial Intelligence: A Technology Regulator’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
yield the same sort of undesirable outcomes, then why should not from Deep Learning to Natural Language Processing and Optimisa-
all software be regulated? We propose that mandatory regulation tion Algorithms. The MDIA will also continue to monitor develop-
should be sector/activity-based and not on technology. ments and update guidelines as required to include (and potentially
The question of what constitutes high-risk or defines whether exclude) defining features of what is/not classifed as an AI-ITA.
a sector or activity should mandate this framework arises. This System Audits and Subject Matter Experts. The framework pro-
is to be left up to other lead authorities and laws of the land to vides a structure for the Authority and applicant to work with inde-
decide. For financial affairs, a financial services authority (a separate pendent (and approved) system auditors to be able to scrutinise to
body) may impose when a sector or activity should be mandated a fairly high level of detail the software itself as well as the manner
to undertake a technology audit (as proposed herein), or even if with which it is being operated under the ISAE 3000 [12] standard
any levels of enhanced due diligence is required. Therefore, based for assurance. The audit of the software system itself is primarily
on the above we make the argument that mandatory regulatory conducted via a code review, whose aim is to ensure that the man-
frameworks should not be technology-specific (or AI-specific), yet ner with which the AI-ITA is implemented accurately reflects what
should be activity or sector-specific as defined and required per the organisation behind the AI-ITA are claiming in their technology
activity/sector. blueprints. The rationale behind this is to ensure that any claims be-
AI technology-based assurances may not only be required for ing done are truly reflected in the code, which enables the general
regulated activities, however various AI-based products and ser- public, who may not know what AI really is to gain trust in the sys-
vices may see benefit in providing assurances to various stakehold- tem given that it stood up to scrutiny prior to the certificate being is-
ers. Therefore, the regulatory approach enables for technology- sued. Beyond the software, the certification mandates depending on
based assurances to also be offered on a voluntary basis (besides be- the type of audit being undertaken and associated controls, to also
ing mandated from lead authorities of respective sectors/activities). give the general public assurances that the AI-ITA creator and oper-
Now, we present the AI technology assurance framework imple- ator are running the organisation in a manner that meets the stan-
mented by the Malta Digital Innovation Authority4 which offers dards set out by the MDIA. The certification therefore enables the
certification of AI systems on a voluntary basis where sought, or on general public to trust the creator, in the manner they build, main-
a mandatory basis where other lead authorities or laws require it. tain and run the AI system. Two main types of audits are required
throughout an AI-ITA’s lifetime: (i) first a ‘Type 1 Systems Audit’
4 AN AI TECHNOLOGY ASSURANCE is required which focuses on providing assurances with respect to
REGULATORY FRAMEWORK functional correctness typically undertaken as an AI-ITA’s first au-
dit; and (ii) a ‘Type 2 Systems Audit’ which focuses on renewing pre-
We now present the AI Innovative Technology Arrangement (AI-
vious assurances provided through a previous audit which factors
ITA) regulatory technology assurance framework. Approaches for
in live data and operations associated with the system to assure the
providing software assurances will invariably have a degree of com-
system assurances are still in place within the period under audit.
monality irrespective of the technology domain and also applica-
The audit process begins with the applicant submitting a request
tion domain within which the solution is categorised under. As
(in the form of an application) to the Authority, upon which the
such, this framework builds on the Innovative Technology Arrange-
Authority will assess the applicant by reviewing the provided doc-
ment (ITA) [6] regulatory assurance framework overseen by the
umentation around the AI-ITA and conduct its due diligence. Fol-
Malta Digital Innovation Authority (MDIA). Rather than mandating
lowing this, the MDIA issues a Letter of Intent upon which the ap-
compliance and certification of all AI based systems, the regulatory
plicant will be able to appoint an MDIA approved Systems Auditor,
framework is a voluntary one — unless a lead authority deems that
and notify the MDIA of the appointment, for the MDIA to verify
such technology assurances are required. It is in this manner we be-
that the Systems Auditor has the required competencies (which the
lieve innovation can still flourish, by only requiring mandatory over-
Authority has tested the system auditor for). The Systems Auditor
sight of sectors and activities that should require such oversight.
will then conduct the audit as per the Authority’s guidelines5 and
AI Innovative Technology Arrangement. The challenge with
compile a report with their findings, which is issued to the MDIA
Artificial Intelligence ITAs (AI-ITAs), primarily revolves around
for a review and a subsequent decision on whether the certificate
identifying what constitutes AI. Rather than define what is an AI-
is to be issued. Once issued, a further follow up audit must be con-
ITA as a hard and fast rule, the guidelines take the approach of
ducted every time there is a material change in the AI-ITA (and on
defining qualities and criteria that qualify software as an AI-ITA:
renewal after every two years).
(a) the ability to use knowledge acquired in a flexible manner in
Systems Audits are an integral part of the certification process as
order to perform specific tasks and/or reach specific goals; (b) evo-
they provide the MDIA with an independent report on the particu-
lution, adaptation and/or production of results based on interpret-
larities of the AI-ITA, specifically the code (and data) and whether it
ing and processing data; (c) systems logic based on the process
accurately reflects what is being disclosed in the blueprint, and the
of knowledge acquisition, learning, reasoning, problem solving,
ongoing operations of the AI-ITA. Systems Audits are conducted by
and/or planning; (d) prediction forecast and/or approximation of
Systems Auditors, who must be independent from the AI-ITA and
results for inputs that were not previously encountered.
its operator, that are subject to approval by the Authority, and who
The above ensures that techniques and algorithms commonly as-
need to meet a set of requirements (defined in the Systems Auditor
sociated with the wider AI field are captured and include anything
guidelines) through their combined complement of Subject Matter
4 https://mdia.gov.mt 5 https://mdia.gov.mt/wp-content/uploads/2019/10/AI-ITA-Guidelines-03OCT19.pdf
192
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Joshua Ellul, Gordon Pace, Stephen McCarthy, Trevor Sammut, Juanita Brockdorff, and Matthew Scerri
Experts (SMEs) in the fields of IT audit, cybersecurity and technol- highlights that the Forensic Node is not only used to support the
ogy with specialisation, in this case, in AI. The SMEs will be the assessment of (some of) the operating effectiveness of the controls
primary individuals responsible for conducting systems audits, and during an audit, but may also be used to support legal compliance
must adhere to a set of requirements, such as ensuring that they by the MDIA (or other authorities) and also enables a further layer
meet a level of continuous professional education in the AI field6 . of monitoring to be done (manually or automated) by a Techni-
This section describes the requirements that an AI-ITA must cal Administrator (discussed next). It is important to note that the
meet in order to qualify for certification. Forensic Node must be separate from the ITA Harness, in that the
ITA Blueprint. The Blueprint document is an essential document Forensic Node is more concerned with Data Logging, as opposed
in the certification process as it is meant to provide a detailed to the monitoring in relation to boundaries.
description to the Authority on what the system does, how it is Technical Administrator. A Technical Administrator, a form of a
designed, and operated. Other than allowing the MDIA to evaluate service provider appointed by the AI-ITA to act as the final safe-
whether AI-ITA certification is applicable, it is further intended to guard for the system, must be appointed and in place at all times.
be used by the Systems Auditors as the document against which The Technical Administrator must be able to intervene, if required
aspects such as the code is reviewed against. The blueprint also to do so by the MDIA, another authority or legally (such as in the
defines a minimum set of disclosures that must be disclosed to event of a breach of law by the AI-ITA), to limit further impact to
direct users (in English) in a non-technical manner, to be able to the users and where necessary limit or reverse losses. For example,
communicate the features and functionalities of the system and consider an AI system that utilises reinforced learning and which,
how it respects the ethical AI framework7 , limitations to prevention after a period of time, starts to exhibit discriminatory bias that goes
of bias, and the expected accuracy of the AI-ITA. against the principles laid down in the ethical AI framework and/or
In a general (AI agnostic) sense, the detailed description must against the requirements of any laws or rules it must abide by. In
cover the functional capabilities of the AI-ITA, how the system is to this case the Technical Administrator must be able to halt the op-
be verified and tested to ensure the results meet expectations and eration of the system to prevent further damage and revert to an
what the operational limitations of the systems are. More specif- older model (as may be mandated by a legal judgement). As such,
ically, for an AI solution the blueprint must include a disclosure this also imposes an indirect requirement for the AI-ITA to provide
of the AI techniques used and to justify why certification is being mechanisms to enable the Technical Administrator to conduct their
sought, and how specific risks are being managed and mitigated e.g. actions as may be necessary (e.g. by ensuring regular snapshots of
what is being done to ensure that the underlying dataset is unbiased. the machine learning models are kept to revert back to).
In a broader sense, the Blueprint must highlight the safety mecha- English Description and Consumer Protection. The system be-
nisms in place and alignment with Malta’s Ethical AI Framework. ing certified is checked by the systems auditors who, amongst other
ITA Harness. A crucial element that the AI-ITA framework pro- things, ensure that its functionality matches that described in the
poses, and which needs to be highlighted clearly in the Blueprint is blueprint in human-readable form (in English). If, post-deployment,
the ITA Harness. The ITA Harness provides a safety net for the pro- the system exhibits behaviour contrary to this description against
cess by monitoring activity inputs and outputs to ensure that the which it was certified, the Innovative Technology Arrangements
boundaries (which must also be disclosed in the Blueprint) are re- and Services Act specifies that the English version prevails legally.
spected. Furthermore, the ITA harness must also be able to handle Auditing of Design and Development Processes. Systems Au-
any anomalies it detects (such as outputs outside expected bound- dits include oversight of the design and development process of
aries) in a manner which is also disclosed. The AI-ITA harness must the system-under-audit. Not only does such oversight cover tradi-
also communicate with the Forensic Node (discussed next) to ensure tional software engineering principles, but for systems including
that any anomalies are appropriately logged and can be investigated an element of AI also includes assurances that certain foundational
and rectified. While the harness may not apply to all AI-ITAs, the principles have been taken into consideration in the process.
Authority requires that when it does not apply it must be justified ad- Build on a human-centric approach. The systems auditors ensure
equately in the blueprint and accompanied with alternative plans of that the AI system was designed in a manner to support and assist
how the behaviour of the AI-ITA will be monitored and contained8 . humans without overriding the user into taking any unwanted
Forensic Node. The Forensic Node is another requirement man- decisions and the manner with which it operates musts be equitable
dated by the MDIA, and whose implementation and operation is and inclusive across different segments of society.
also subject to the audit. The purpose of the Forensic Node is to Adherence to applicable laws and regulations. It is crucial that be-
“store all relevant information on the runtime behaviour of the AI-ITA haviour induced by the system, including parts driven by AI, will
in real-time such as recording of inputs and outputs, and supporting be designed in a manner that adheres to the law.
data related to potential explainability of how an output was derived Maximise benefits of AI systems while preventing and minimising
from a given input wherever applicable”. This means that any inputs, their risks. It is crucial that any risks induced through the use of AI
outputs as well as data that supports how the system achieved the are identified and mitigated accordingly, including the setting up
results it did must be stored in a secure data store in real-time. This of controls to ensure fairness, transparency and resiliency to new
AI-specific attack vectors.
6 https://mdia.gov.mt/sa-guidelines/
Aligned with emerging international standards and norms around AI
7 https://malta.ai/wp-content/uploads/2019/08/Malta_Towards_Ethical_and_
ethics. As the world is increasingly becoming globalised through
Trustworthy_AI.pdf
8 https://mdia.gov.mt/wp-content/uploads/2019/10/AI-ITA-Blueprint-Guidelines- technology, and which may be further amplified through the prolif-
03OCT19.pdf eration of AI systems, this objective was laid down to ensure that
193
Regulating Artificial Intelligence: A Technology Regulator’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Malta’s ethical framework is aligned with similar ethical guidelines vehicles, finance, etc.), privacy and data protection, fundamental
by the EU commission9 and OECD10 . rights, profiling and anti-discrimination issues, competition law,
The framework further builds on these principles by delineat- and legal personality. Quite a number of ethical frameworks have
ing a number of principles (such as Human Autonomy, Fairness, been proposed— it suffices to note that “at least 63 public-private
Prevention of Harm and Explicability) and proposes 63 controls of initiatives have produced statements describing high-level principles,
how these can be tested. While not all of these controls apply to all values and other tenets to guide the ethical development, deployment
AI-ITAs, the AI-ITA must show that it has taken them into consid- and governance of AI” [14]. Within the framework proposed herein
eration and justify in the Blueprint (and ultimately top the users) an ethical framework is referenced to, however the scope of such a
those controls which do not apply. discussion would warrant a paper of its own.
5 RELATED WORK 6 CONCLUSIONS

The work presented herein is complementary and orthogonal to a Moving towards a more efficient digital world has its benefits, how-
number of different areas which we will now provide an overview. ever it brings various risks that need to be mitigated through ade-
Regulatory Bodies. The European Parliament had proposed for quate levels of technology assurances. In this paper we have high-
the setting up of an European Agency for robotics and artificial lighted the need for an AI technology assurance regulatory frame-
intelligence address technical, ethical and regulatory aspects [5], work which is implemented in a manner that both promotes tech-
mostly driven from the need for transparency of automated sys- nology and does not stifle innovation, yet at the same time enforces
tems handling personal data and the often impossibility of doing so assurances where required, and we have presented an implementa-
due to trade secret protection [15]. A solution proposed was to al- tion of a national AI regulatory framework that is overseen by a
low for a trusted third party to undertake an audit of the system in technology regulator. The guiding principle was that mandatory
question. On similar lines, to avoid differing domestic approaches, regulation of AI should be avoided, but other national regulators
the need for an International Artificial Intelligence Organisation (e.g. finance, health, communications, etc.) can then work together
was highlighted [8]. Indeed, this would be a step in the right di- with the technology-centric regulator to identify whether manda-
rection, however it is the opinion of the authors that the need for tory assurances are required.
providing regulation should not wait for such an organisation to
emerge, yet national authorities (such as the MDIA) could work to- REFERENCES
gether towards harmonisation and adapt to eventual international [1] Isaac Asimov. 1950. I, robot. Gnome Press.
[2] Olivier Boiral. 2011. Managing with ISO systems: lessons from practice. Long
standards and guidance as it emerges. Range Planning 44, 3 (2011), 197–220.
International Standards. Whilst global software regulatory bod- [3] Nick Bostrom and Eliezer Yudkowsky. 2014. The ethics of artificial intelligence.
ies do not (yet) exist, global standards do. The International Organi- The Cambridge handbook of artificial intelligence 1 (2014), 316–334.
[4] Francois Coallier. 1994. How ISO 9001 fits into the software world. IEEE Software
sation for Standardization (ISO) has developed a number of different 11, 1 (1994), 98–100.
standards for use within the software domain11 . Whilst, such soft- [5] M Delvaux. 2016. Motion for a European Parliament resolution: with recommenda-
ware focused standards can be useful for global recognition within tions to the commission on civil law rules on robotics. Technical Report. Technical
Report (2015/2103 (INL)), European Commission.
the framework described herein local national standards were re- [6] Joshua Ellul, Jonathan Galea, Max Ganado, Stephen Mccarthy, and Gordon J Pace.
quired to be developed for the following reasons: (i) standards avail- 2020. Regulating Blockchain, DLT and Smart Contracts: a technology regulator’s
perspective. In ERA Forum. Springer, 1–12.
able to date do not provide guidelines or comprehensive control [7] Joshua Ellul, Stephen McCarthy, Trevor Sammut, Juanita Brockdorff, Matthew
objectives specific for the artificial intelligence domain; (ii) mecha- Scerri, and Gordon J. Pace. 2021. A Pragmatic Approach to Regulating Artificial
nisms and roles for ensuring continuous monitoring and interven- Intelligence: A Technology Regulator’s Perspective. CoRR abs/2105.06267 (2021).
arXiv:2105.06267 http://arxiv.org/abs/2105.06267
tion are not defined [2, 4];and (iii) the authority is ultimately re- [8] Olivia J Erdélyi and Judy Goldsmith. 2018. Regulating artificial intelligence:
sponsible and empowered through the MDIA Act to ensure audit in- Proposal for a global solution. In Proceedings of the 2018 AAAI/ACM Conference
tegrity and quality whilst at the same time able to propose changes on AI, Ethics, and Society. 95–101.
[9] Government of Malta. 2018. The Innovative Technology Arrangements and
to legislation and guidelines. Once international standards adequate Services Act (Chapter 592 of the Laws of Malta).
to adopt are developed national guidelines may be updated to make https://legislation.mt/eli/cap/592/eng/pdf.
[10] Government of Malta. 2018. The Malta Digital Innovation Act (Chapter 591 of
use of them (if deemed to meet the national requirements). That the Laws of Malta). https://legislation.mt/eli/cap/591/eng/pdf.
said, the authority has adopted and requires that audits are under- [11] Gonenc Gurkaynak, Ilay Yilmaz, and Gunes Haksever. 2016. Stifling artificial
taken following the ISAE 3000 [12] standard which specifies generic intelligence: Human perils. Computer Law & Security Review 32, 5 (2016), 749–758.
[12] IAASB. 2013. ISAE 3000 (revised), assurance engagements other than audits or
(i.e. not software nor AI related) principles for quality management, reviews of historical financial information. (2013).
ethical behaviour and performance for use in non-financial areas. [13] Keith Kirkpatrick. 2016. Battling algorithmic bias: How do we ensure algorithms
Other Non-technology Assurance Related Aspects. There is a treat us fairly?
[14] Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell,
large body of work looking at regulating the application of tech- Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker
nology which is orthogonal to the to approach presented in this pa- Barnes. 2020. Closing the AI accountability gap: defining an end-to-end frame-
work for internal algorithmic auditing. In Proceedings of the 2020 Conference on
per, including how to handle issues of liability, intellectual property Fairness, Accountability, and Transparency. 33–44.
and copyright, sector specific regulation (e.g. health, autonomous [15] Sandra Wachter, Brent Mittelstadt, and Luciano Floridi. 2017. Why a right to
explanation of automated decision-making does not exist in the general data
9 https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-
protection regulation. International Data Privacy Law 7, 2 (2017), 76–99.
ai [16] Roman V Yampolskiy. 2013. Artificial intelligence safety engineering: Why ma-
10 https://www.oecd.org/going-digital/ai/principles/ chine ethics is a wrong approach. In Philosophy and theory of artificial intelli-
11 https://www.iso.org/ics/35.080/x/ gence. Springer, 389–396.
194
Making Intelligent Online Dispute Resolution Tools available to
Self-Represented Litigants in the Public Justice System∗
Towards and Ethical use of the AI technology in the administration of Justice
Fernando Esteban de la Rosa† John Zeleznikow†

University of Granada La Trobe University Law School
Granada, Spain Melbourne, Australia
festeban@ugr.es J.Zeleznikow@latrobe.edu.au
ABSTRACT 1 INTRODUCTION: THE NEW COMBINED USE

Over the last decade online dispute resolution (ODR) has moved OF ONLINE DISPUTE RESOLUTION AND
from merely e-commerce litigation to widespread use in court sys- ARTIFICIAL INTELLIGENCE TOOLS IN THE
tems. Two phenomena have led to this situation: the rise of Self- PUBLIC JUSTICE SYSTEM
Represented Litigants and Courts moving beyond their traditional
Advances in Computing and Information and Communication Tech-
focus, allowing parties, for instance, to file a claim, formulate their
nology have opened new possibilities for implementing traditional
arguments, obtain legal information or even a receive a forecast
models of justice systems. The development of the COVID19 pan-
about the resolution of the case. AI tools have mainly been used to
demic has further enhanced the rise of Online Dispute Resolution
enable legal professionals (lawyers, mediators) to better perform
(ODR) and led to the incorporation of a wide range of technological
their tasks. Today some jurisdictions have begun to provide justice
tools into dispute resolution. Developments in Artificial Intelligence
users with truly useful intelligent- user centric ODR systems in-
(AI) hold promises for improving efficiency and quality in the pro-
corporating assessment and diagnosis AI tools. These tools may
vision of access to justice, leading to improved transparency and
provide information about a possible outcome. This paper analy-
standardisation of case-law.
ses the use being made by some jurisdictions of combined Online
Tania Sourdin [12] has suggested that there are three primary
Dispute Resolution and Artificial Intelligence tools and aims to
ways in which technology has already restructured the justice
promote the debate on the ethical governance of making these
system. First, and at the most rudimentary level, are “supportive”
tools available to unrepresented litigants. The evaluation follows a
technologies – these technologies aim to inform, support and ad-
European perspective on the ethical governance of the use of AI in
vise individuals involved in the justice system and include, for
the Justice System.
example, online legal applications (apps). At the second level are
“replacement” technologies – these technologies replace the roles
KEYWORDS and activities traditionally conducted by humans and include, inter
Justice Systems, Intelligent-user centric Online Dispute Resolution alia, e-filing processes and online mediation services. Finally, and at
Systems, Self-Represented Litigants, Ethics the most advanced level, are “disruptive” technologies – these tech-
ACM Reference Format: nologies fundamentally alter the way in which legal professionals
Fernando Esteban de la Rosa and John Zeleznikow. 2021. Making Intelligent work and include, for example, AI judges or other algorithm-based
Online Dispute Resolution Tools available to Self-Represented Litigants in decision-making programs that may reshape the judicial role.
the Public Justice System: Towards and Ethical use of the AI technology Whilst there are numerous e-courts (for instance in the United
in the administration of Justice. In Eighteenth International Conference for Kingdom, USA and Australia), not many jurisdictions have taken
Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. the step forward of proposing a combined use of ODR tools designed
ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3462757.3466077 to provide information in such a way that the parties may obtain a
diagnosis or a prediction of the outcome so that disputants can be
∗ Lodder and Zeleznikow [8] indicate, whilst there is no generally accepted definition better prepared to deal with a direct negotiations conducted online.
of ODR, we can think of it as using the Internet to perform Alternative Dispute Resolution Manifestations of this combination of tools are found today in the
(ADR).
† Both authors contributed equally to this research. Dutch Platform Rechtwijzer, the new Internet Courts in China and
the Civil Resolution Tribunal (CRT) in British Columbia (Canada).
There are also projects with the same aim in Singapore and in
classroom use is granted without fee provided that copies are not made or distributed Estonia, in the latter case this includes the creation of a “robot
for profit or commercial advantage and that copies bear this notice and the full citation judge”.
on the first page. Copyrights for components of this work owned by others than ACM This paper analyses the combined use of ODR and AI tools in
to post on servers or to redistribute to lists, requires prior specific permission and/or a courts, with a special focus on cases where AI tools aim to support
fee. Request permissions from permissions@acm.org. Self Represented Litigants (SRL)s in reaching settlements. This
ICAIL’21, June 21–25, 2021, São Paulo, Brazil combination of tools makes use of the notion of Bargaining in the
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 Shadow of the Law developed by Mnookin and Kornhauser [9]. The
https://doi.org/10.1145/3462757.3466077
195
ICAIL’21, June 21–25, 2021, São Paulo, Brazil F. Esteban de la Rosa and J. Zeleznikow
paper describes the underlying reasons that have triggered the rise the Civil Resolution Tribunal has been so successful, is that British
of SRLs and includes proposals to ensure their fair treatment. Columbia residents are mandated to use the system when dealing
with issues listed above. Whilst such an approach may be seen novel
2 EXAMPLES INCORPORATING ARTIFICIAL and discriminatory, it does ensure that the system is used, with
INTELLIGENCE INTO ONLINE DISPUTE relative ease, quickly and at minimal cost. In most cases parties are
RESOLUTION SYSTEMS to represent themselves, even if representation and legal assistance
is allowed.
2.1 The Dutch platform Rechtwijzer
Rechtwijzer1 (Roadmap to Justice) was designed for couples, with 2.3 The Internet Courts in China
children who are separating. The aim of Rechtwijzer was ‘to em- Between 2017 and 2018 China created three new Courts: the Hangzhou
power citizens to solve their problems by themselves or together Internet Court, the Beijing Internet Court and the Guangzhou In-
with his or her partner. If necessary, it refers people to the assistance ternet Court. These courts only have material jurisdiction over
of experts.’ Couples pay €100 for access to Rechtwijzer, which starts internet-related cases. The online platform makes an intelligent
by asking each partner for information such as their age, income, litigation risk assessment system available to the user and can pro-
education, whether they want the children to live with only one vide a report synthesising the litigants’ case and the corresponding
parent or part time with each, then guides them through questions risk based on the analysis of court data and similar cases. Litigation
about their preferences. risk assessment aims to help the party without legal knowledge
The platform had a diagnosis phase; an intake phase for the to identify and exclude common litigation risks, thereby reduc-
initiating party; and then invited the other to join and undertake ing unnecessary losses. Meanwhile, the assessment can make the
the same intake process. Once intake is completed, the parties start party aware that litigation is risky and costly and guide the parties
working on agreements. The dispute resolution model is that of to choose ADR or diversified dispute resolution. The system can
integrative (principled) negotiation[5]. The parties are informed of automatically generate a complaint letter by simply selecting the
rules such as those for dividing property, child support and stan- suitable response options. [4]
dard arrangements for visiting rights so that they could agree on
the basis of informed consent. Agreed agreements are reviewed by 2.4 Projects in Estonia and Singapore
a neutral lawyer. If the proposed solutions are not accepted, then In July 2019 the Estonian Ministry of Justice launched a project
couples can employ the system to request a mediator for an addi- developing AI software to hear and resolve small economic dis-
tional €360, or a binding decision by an adjudicator. Rechtwijzer is putes by eliminating human intervention [10]. The “robot judge”
voluntary and non-binding up until the point where the parties seek is configured to decide disputes of up to 7,000 euros. According
adjudication. Rechtwijzer had aimed to be self-financing through to the project, the disputing parties would have to upload their
user contributions. This has not occurred. documents and relevant information to a judicial platform. The AI
machine renders a decision that can be appealed to a human judge.
2.2 The British Columbia Civil Resolution The project limits its scope to contractual disputes.
Tribunal Singapore has been committed to digital justice since 2000. In
The British Columbia Civil Resolution Tribunal [11] is the most recent years it has been developing a more ambitious online system,
significant current widely available ODR system that comes closest initially only for injuries arising from motor vehicle accidents. An
to providing a full suite of dispute resolution services. It commences outcome simulator will provide guidance to potential claimants,
by diagnosing the dispute, by providing a decision tree, and provides prior to the commencement of proceedings, helping them to de-
legal information and tools such as customized letter templates. cide on offers from insurance companies. The aim is for parties
If this action does not resolve the dispute, one can then apply to first use the technology to reach amicable settlements without
to the Civil Resolution Tribunal for dispute resolution. The sys- professional legal advisors [14].
tem directs the user to the appropriate application forms. Once the
application is accepted, the user enters a secure and confidential 3 A FRAMEWORK FOR BUILDING ONLINE
negotiation platform, where the disputants can attempt to resolve DISPUTE RESOLUTION TOOLS FOR
their dispute. If the parties cannot resolve the dispute, a facilitator SELF-REPRESENTED LITIGANTS
will assist. Agreements can be turned into enforceable orders. If
An increasing phenomenon in Common Law countries is the grow-
negotiation or facilitation does not lead to a resolution, an indepen-
ing number of pro se (or self-represented) litigants. Landsman [7]
dent member will make a determination about the dispute.
argues that pro se cases pose inherent problems: they can cause de-
Currently, the Civil Resolution Tribunal deals with Motor vehi-
lays, increase administrative costs, undermine the judges’ ability to
cle injury disputes, Small claims disputes, Strata property disputes,
maintain impartiality and can leave the often-unsuccessful litigant
Societies and cooperative associations disputes and Shared accom-
feeling as though she has been treated unfairly.
modation and some housing disputes. For some of these domains
Research conducted in the Family Court of Australia shows that
potential litigants can only use the Civil Resolution Tribunal.
there are a range of reasons why people represent themselves,
To assist digitally disadvantaged litigants, technical support is
such as funding cuts and changes in eligibility to legal aid [3].
provided in accessing the Internet. One of the major reasons that
Other contributing factors include changes in technology, cultural
1 https://rechtwijzer.nl/ last viewed 5 February 2021. shifts towards self-help and self-representation, and changes in
196
Making Intelligent Online Dispute Resolution Tools available to Self-Represented Litigants in the Public Justice System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
legislation. The experiences of self-representation in Australian vulnerable position of the unrepresented litigant a truly helpful
law has generally accepted that SRLs are at a disadvantage in legal ODR system should provide the following facilities:
proceedings and their experience of the legal system may indeed (1) Case management: the system should allow users to enter
be negative. The lack of knowledge or skills of SRLs means that information, ask them for appropriate data and provide for
some are not able to access fair and equal justice in a system often templates to initiate the dispute. Self-represented litigants
geared towards legal representation. should be able to initiate the dispute, enter their pertinent
In England litigants can go to court without legal aid, in practice data and also track what is happening during the dispute
the technical and formal nature of proceedings, with the exception as well as being aware of what documents are required at
of the small claims procedure (for claims up to £10,000), makes specific times;
legal aid necessary. Its lack has led to public dissatisfaction but (2) Triaging: the system should provide information on how
also frustration among judges, faced with the need to inform lay important it is to act in a timely manner and where to send
litigants about the technicalities of the process without being able the dispute. This may be particularly important in cases of
to cross the line between information and legal aid. This situation domestic abuse or where there is a potential for children
led to a considerable increase in the time and cost spent on each to be kidnapped. Triaging systems are vital for expediting
judicial decision, even doubling it2 . While some SRLs can present action in high risk cases;
their case competently, most research suggests that SRLs struggle (3) Advisory tools: the system should provide tools for reality
with substantive law and procedure [6]. testing: these could include, books, articles, reports of cases,
Recent experiences, such as the online court established in Utah, copies of legislation and videos; there would also be calcu-
are demonstrating that ODR has the potential to transform the way lators (such as to advise upon child support) and BATNA
the American legal system deals with pro se litigants and access advisory; systems (to inform disputants of the likely out-
to justice issues at large. Although it may seem counterintuitive come if the dispute were to be decided by decision-maker
to bridge the justice gap by precluding people from appearing in (e.g. judge, arbitrator or ombudsman). Advisory tools, as
court, requiring certain types of claims to begin online will actually suggested by Zeleznikow [19] are a vital cog in supporting
provide quicker and more accessible legal solutions. As long as the self-represented litigants. An important associated question
programming and administration of ODR technology are conducted is how can we design advisory tools that self-represented
with attention to legal and ethical concerns, pro se litigants will litigants can gainfully use? Are the legal concepts behind
benefit from having their claims resolved online [2]. For this aim these tools too difficult for amateurs to understand? How do
access to justice is helped by the use of intelligent-user centric ODR we construct suitable user interfaces?
systems incorporating assessment and diagnosis AI tools [15]. (4) Communication tools – for negotiation, mediation, con-
Stranieri et al. [13] approach for providing advice about the dis- ciliation or facilitation. This could involve shuttle mediation
tribution of marital property following divorce in Australia was to if required. For many ODR providers, the provision of com-
use machine learning to provide advice about BATNAs (a BATNA is munication tools is their main goal;
used to inform disputants of the likely outcome if the dispute were (5) Decision Support Tools – if the disputants cannot resolve
to be decided by decision-maker e.g. judge, arbitrator or ombuds- their conflict, software using game theory or AI can be used
man). Despite using Machine Learning, it involved the development to facilitate trade-offs. Professionals (such as lawyers) can
of 94 Toulmin argument structures [16] to model the domain as it provide useful advice re trade-offs. In their absence, suitable
existed in 1995. Twenty-five years later, the theoretical principles decision support tools are vital;
behind machine learning software have not changed. But computer (6) Drafting software: if and once a negotiation is reached,
hardware is now much cheaper and data can be much more easily software can be used to draft suitable agreements. Drafting
stored. This has led to the development of ‘quicker; systems’, which plans (such as parenting plans) once there is an in-principle
the community has seen as ‘more intelligent’3 . agreement for a resolution of a dispute, is a non-trivial task.
Whilst the Split-Up system provides advice about BATNAs, the
Family Winner System [1] provided advice to disputing parents No single dispute is likely to require all six processes. However, the
on how they could best negotiate trade-offs. The disputing parties development of such a hybrid ODR system would be very signifi-
were asked to indicate how much they valued each item in dispute. cant. A total system would require us to construct the appropriate
Using logrolling, parties obtained what they most desired. systems 1 to 6, and the ultimate solution is to make sure that all the
Zeleznikow [20] discusses how it is possible to build ODR sys- systems are capable of communicating with each other.
tems that support self-represented litigants and what skills do self-
represented litigants require to use such systems. Zeleznikow [21] 4 ETHICAL ISSUES RELATING TO THE
considers how we can construct such systems with user centric PROVISION OF ARTIFICIAL
computing. So, what are the various types of ODR systems and INTELLIGENCE-BASED TOOLS TO
how can self-represented litigants use them? Having regard to the SELF-REPRESENTED LITIGANTS BY THE
PUBLIC JUSTICE SYSTEM: A EUROPEAN
2 JUSTICE, “Delivering Justice in an Age of Austerity” (April 2015). Available in https:
PERSPECTIVE
//justice.org.uk/justice-age-austerity-2/ last viewed 19 April 2021
3 See for example amica.gov.au which uses machine learning to advise upon property Neither the recent official documents of the European Union deter-
distribution amongst separating couples in Australia. mining how AI should be used in the field of the administration of
197
ICAIL’21, June 21–25, 2021, São Paulo, Brazil F. Esteban de la Rosa and J. Zeleznikow
justice4 nor the European Ethical Charter (EEC) on the use of AI origin, religion or belief, disability, age or sexual orientation) are
in judicial Systems and their environment adopted in 2018 by the respected and rule of law and due process principles upheld.
European Commission for the Efficiency of Justice of the Council In order to understand the European position it is also relevant to
of Europe deal directly with the admission of AI tools aimed at know the criterion followed by the new proposal for a Regulation of
enabling the parties to assess their legal position. Because SRLs April 2021. AI systems intended for the administration of justice are
generally lack legal skills and in view of the objective to encour- not listed among the prohibited practices (art. 5) but among the high-
age negotiation we submit that this use of technology for these risk AI systems (point 40 of the preamble). The new proposal for a
purposes should be considered high-risk. Regulation separates two kinds of judicial activities: it is considered
The EEC points out the inherent risks in these technologies may as high-risk the systems intended to assist judicial authorities in
even transcend the act of judging and affect essential functioning researching and interpreting facts and the law and in applying the
elements of the rule of law and judicial systems. These include law to a concrete set of facts. Such qualification is not extended to AI
principles such as the primacy of law. These tools could create systems intended for purely ancillary administrative activities that
a new form of normativity, which could supplement the law by do not affect the actual administration of justice in individual cases.
regulating the sovereign discretion of the judge, and potentially The proposed Regulation does not establish the definitive answer
leading, in the long term, to a standardisation of judicial decision as any use of AI must continue to occur solely in accordance with
based no longer on case-by-case reasoning by the courts, but on the applicable requirements resulting from the European Charter
a pure statistical calculation linked to the average compensation of Fundamental Rights, the rest of European Law and the national
previously awarded by other courts. That is why the report submits law.
a need to consider whether these solutions are compatible with We submit that in view of the beneficial impact it may have on
the individual rights enshrined in the European Convention on the functioning of the judicial system, it is necessary to identify the
Human Rights (ECHR). These would include the rights to a fair real possibilities, technical limits and safeguards to be met by the
trial (particularly the right to a natural judge established by law, the machines offered by the public justice system to SRLs.
right to an independent and impartial tribunal and equality of arms For specific areas of administrative law it is possible to develop
in judicial proceedings) and, where insufficient care has been taken legal rules as code providing useful information and support for
to protect data communicated in open data, the right to respect for SRLs. The use of code as rules in combination with User Centric
private and family life. Thus the EEC considers that applications ODR Tools using decision trees, may have success promoting ac-
of predictive justice should be assigned to the field of research and cess to justice for SRLs. The CRT in the British Columbia is an
further development in order to ensure that they fully tie in with example of success. The design of AI rule-based systems does not
actual needs before contemplating use on a significant scale in the exhibit the difficulties arising from the lack of transparency and
public sphere. the creation of biases that may arise employing ML induction al-
The European Commission (EC) recognises that the use of AI gorithms. Deductive AI tools (the so called Experts Systems) allow
applications can bring many benefits, such as making use of infor- transparency and the monitoring of the machine output is facili-
mation in new and highly efficient ways, and improve access to tated to be able to rectify what is necessary in case any errors in the
justice, including by reducing the duration of judicial proceedings. programming are discovered. Programming is, however, a delicate
At the same time it is aware that the opacity or biases embedded in process and if not done well can lead to unfair treatment when the
certain AI applications can also lead to risks and challenges for the algorithm doesn’t match reality. This can occur when a one-size-
respect of and effective enforcement of fundamental rights, includ- fits-all rule is implemented in a complex environment. A recent
ing in particular the right to an effective remedy and a fair trial. example is Australia’s Centrelink “robodebt” debacle5 . In that case,
The EC recognises as a possible high-risk a use case using the tech- welfare payments made on the basis of self-reported fortnightly
nology as part of decision-making processes with significant effects income were cross-referenced against an estimated fortnightly in-
on the rights of people. However, it also considers that the pro- come, taken as a simple average of annual earnings reported to
posed requirements in the White Paper on increased transparency, the Australian Tax Office, and used to auto-generate debt notices
human oversight, accuracy and robustness of these systems aim without any further human scrutiny or explanation. This assump-
to facilitate their beneficial use, while ensuring that fundamental tion is at odds with how Australia’s highly casualised workforce is
rights including non-discrimination based on sex, racial or ethnic actually paid. For example, a graphic designer who was unable to
find work for nine months of the financial year but earned A$12,000
in the three months before June would have had an automated debt
4 Among the last official documents are the Proposal for a Regulation Laying down raised against her. This is despite no fraud having occurred, and
Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending this scenario constituting exactly the kind of hardship Centrelink
Certain Union Legislative Act of 21.4.2021 COM (2021) 206 final; the Communication is designed to address.
from the Commission to the European Parliament, the Council, the European Economic
and Social Committee and the Committee of the regions called “Digitalisation of justice Rules as codes requires alterations to be introduced in case of
in the European Union. A toolbox of opportunities”, COM (2020) 710 final, of 2.12.2020; legislative changes. Although it will not be possible to attain the
the European Parliament Resolution of 20 October 2020 with recommendations to
the Commission on a framework of ethical aspects of artificial intelligence, robotics
quality of advice offered by a legal expert, we submit that the infor-
and related technologies (2020/2012 (INL); White Paper on Artificial Intelligence - A mation provided to SRLs through machines makes a contribution to
European approach to excellence and trust, COM(2020) 65 final of Brussels, 19.2.2020;
the European e-Justice Strategy 2019-2023 of 13 March 2019 (2019/C 96/04) Council
2019-2023; the Digital Revolution in view of Citizens’ Needs and Rights. Opinion of
the European Economic and Social Committee of 20.02.2019. 5 See https://tinyurl.com/y3dqe6mg last viewed 19 April 2019
198
Making Intelligent Online Dispute Resolution Tools available to Self-Represented Litigants in the Public Justice System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
improving access to justice for those who cannot afford legal assis- 5 CONCLUSION
tance. Regarding the quality of advice provided by these machines, One of the latest trends in the incorporation of technology in the
it seems reasonable that the proposals of the European Commission administration of justice is the provision by public justice systems
about requirements concerning possible testing of applications and to support SRLs by the use of a combination of AI and ODR tools.
the need to provide relevant documentation on their purposes and These allow SRLS to have a diagnosis of the case, which influences
functionalities. It seems also reasonable to require maintaining the the parties either to determine a dismissal of the action or how
possibility to correct errors and providing information to the user to negotiate. This combination of tools shows great potential in
that the answer given by the machine may not necessarily match reducing the level and duration of litigation. The paper submits that
the answer that would be given by a judge hearing the case. this use of the technology must be considered as high risk as it may
Two of the disadvantages of the use of Machine Learning systems function as a replacement of judicial activities. However, it is still
are that they are not transparent, and the data and the software on possible to obtain positive results from this technology by inserting
which they are based may be manipulated. There is also a concern some safeguards, as is beginning to emerge from the European legal
that the use of Machine Learning in the legal system will worsen sphere. The debate is now about what safeguards are necessary
biases against minorities or deepen the divide between those who to ensure that the use of high-risk artificial intelligence tools in
can afford quality legal assistance and those who cannot [17]. Algo- the field of justice is fully compatible with the rule of law. The
rithms will continue to perform existing biases against vulnerable implementation and use of this technology should be preceded by
groups because the algorithms are largely copying and amplifying the detection and diagnosis of the functioning of justice in specific
the decision-making trends embedded in the legal system. There is sectors, so that the efforts are made in the areas with most pressing
already a class divide in legal access – those who can afford high needs.
quality legal professionals will always have an advantage. The de-
velopment of intelligent support systems can partially redress this REFERENCES
power imbalance by providing users with important legal infor- [1] Emilia Bellucci and John Zeleznikow. 2006. Developing Negotiation Decision
mation that was previously unavailable to them. Difficulties may Support Systems that support mediators: a case study of the Family_Winner
system. Journal of Artificial Intelligence and Law 13, 2 (2006), 233–271.
stem from biases. One example is COMPAS, a decision support [2] Julianne Dardanes. 2021. When Accessing Justice Requires Absence from the
system designed to help parole boards in the United States [18] Courthouse: Utah’s Online Dispute Resolution Program and the Impact it Will
decide which prisoners to release early, by providing a probability Have on Pro Se Litigants. Pepperdine Dispute Resolution Law Journal 21, 1 (2021).
[3] John Dewar, Barry W. Smith, and Cate Banks. 2000. Litigants in Person in the
score of their likelihood of reoffending. Rather than rely on a simple Family Court of Australia – Research Report No 20, Family Court of Australia.
decision rule, the algorithm used a range of inputs, including demo- Vol. Research Report No 20. Family Court of Australia Canberra.
graphic and survey information, to derive a score. The algorithm [4] Xuhui Fang. 2018. Recent Development of Internet Courts in China. International
Journal on Online Dispute Resolution 5 (2018), 1–2, 49–55.
did not use race as an explicit variable, but it did embed systemic [5] Roger Fisher and William Ury. 1981. Getting to yes. Penguin Group, New York.
racism by using variables that were shaped by police and judicial [6] Hazel Genn and Yvette Genn. 1989. The effectiveness of representation at tribunals.
Lord Chancellor’s Department.
biases. [7] Stephan Landsman. 2009. The growing challenge of pro se litigation. 13 Lewis &
What can be done is to ensure the traceability and cleanliness Clark L. Rev. 439 (2009).
of the data with which the machine operates, and to introduce [8] Arno Lodder and John Zeleznikow. 2010. Enhanced dispute resolution through the
use of information technology. Cambridge University Press.
elements of weighting. But as Richard Susskind considers, an ethical [9] Robert H. Mnookin and Lewis Kornhauser. 1979. Bargaining in the shadow of
programming is not feasible. It is not at all clear, either technically the law: The case of divorce. The Yale Law Journal 88, 5 (1979), 950.
or philosophically, what is meant when it is proposed that ethics [10] Eric Niiler. 2019. Can AI be a Fair Judge in Court? Estonia Thinks So. https:
//www.wired.com/story/can-ai-be-fair-judge-court-estonia-thinks-so/
should be embedded in Machine Learning. Nor it is clear what [11] Shannon Salter and Darin Thompson. 2016. Public-Centred Civil Justice Redesign:
is meant when it is demanded that software engineers program A Case Study of the British Columbia Civil Resolution Tribunal”. McGill Journal
of Dispute Resolution 3 (2016), 113.
Machine Learning systems to provide intelligent explanations. To [12] Tania Sourdin. 2018. Judge v. Robot: Artificial Intelligence and Judicial Decision-
think so is to misunderstand the difference between the inductive Making. UNSWLJ 41 (2018), 1114.
processes inherent in Machine Learning and the kind of argument [13] Andrew Stranieri, John Zeleznikow, Mark Gawler, and Bryn Lewis. 1999. A hybrid
rule–neural approach for the automation of legal reasoning in the discretionary
we expect when we ask for an explanation [14]. domain of family law in Australia. Artificial intelligence and Law 7, 2-3 (1999).
A different issue is the use of AI tools by judges to decide a case. [14] Richard E. Susskind. 2019. Online courts and the future of justice. Oxford University
We share the European Commission’s view that it is important that Press.
[15] Darin Thompson. 2015. Creating New Pathways to Justice Using Simple Artificial
judgments are delivered by judges who fully understand the AI Intelligence and Online Dispute Resolution. International Journal of Online
applications and all information taken into account therein that Dispute Resolution 4 (2015).
[16] S. Toulmin. 1958. The Uses of Argument. Cambridge University Press, Cambridge.
they might use in their work (AI not to replace but as Augmented [17] Peter K. Yu. 2020. The Algorithmic Divide and Equality in the Age of Artificial
Intelligence), on the understanding that the use of AI applications Intelligence, 72 FLA. L. REV 331 (2020).
must not prevent any public body from giving explanations for its [18] Monika Zalnieriute, Lyria Bennett Moses, and George Williams. 2019. The rule
of law and automation of government decision-making. The Modern Law Review
decisions. As for the machine being able to decide the case on its 82, 3 (2019), 425–455.
own, as the Estonian project poses, this should not be completely [19] J. Zeleznikow. 2002. Using Web-based Legal Decision Support Systems to Improve
ruled out. However, we are not at that stage yet! In the current Access to. Justice Information and Communications Technology Law 11, 1 (2002).
[20] John Zeleznikow. 2020. The challenges of using Online Dispute Resolution to
state of the art, machines can neither motivate nor explain the support Self Represented Litigants. Journal of Internet Law 3 23, 7 (2020).
decisions and predictions they make [14]. Legal arguments require [21] John Zeleznikow. 2021. Using Artificial Intelligence to provide Intelligent Dispute
Resolution Support. Group Decision and Negotiation (2021). https://doi.org/10.
persuasion that does not depend on predictable variables. 1007/s10726-021-09734-1
199
Plum2Text: A French Plumitifs–Descriptions Data-to-Text
Dataset for Natural Language Generation
Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Laval University, Computer Science Department and Faculty of Law
Québec, Canada
nicolas.garneau@ift.ulaval.ca,eve.gaumond@observatoire-ia.ulaval.ca
luc.lamontagne@ift.ulaval.ca,pierre-luc.deziel@fd.ulaval.ca
ABSTRACT data (e.g. a table 𝑡) and generates a textual utterance (namely the
In this paper, we introduce a new French Data-to-Text (D2T) dataset hypothesis ℎ) that is syntactical and semantically faithful to both the
in the legal domain: Plum2Text 1 . It is made out of plumitifs (docket structured input, and one or possibly several textual references 𝑟 . In
files) – descriptions pairs that are derived from publicly available this paper, we propose Plum2Text, a new French D2T dataset in the
documents issued by Canadian criminal courts. The development of legal domain, rather different from those previously introduced in
Plum2Text is primarily intended to train statistical natural language the literature. The dataset was built from Quebec’s plumitifs which
generation algorithms, in order to make the plumitifs more easily are legal documents that lie in the same family as court dockets.
understandable for Canadian citizens. The inputs and outputs of The plumitifs are textual summaries, written in a structured format,
the dataset are unique: on the data side, the values of the table which present all the steps of a judicial case. They also provide
contain long pieces of textual utterance, and on the text side (or information about parties’ identity, the jurisdiction in charge of
reference), it most often consists of a paraphrase of the table values. administering the case, and some information relating to the nature
We describe how we curated the plumitif–description associations and the course of the preceding, as illustrated in Figure 12 .
by introducing an annotation tool and a methodology specific to
the D2T natural language generation task. We do so by using simple
yet efficient text classifiers to help the annotator leverage annotated Figure 1: A Plumitif example illustrating the accused and
examples during the annotation process. As a matter of privacy, we plaintiff personal information along with charges and asso-
also illustrate how we are decontextualizing the descriptions. ciated pleas, decisions and penalty.
CCS CONCEPTS
• Applied computing → Law; Annotation; • Computing method-
ologies → Language resources; Natural language generation;
KEYWORDS
Legal Language Resource, Natural Language Generation, Annota-
tion Methodology
Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel. 2021.
Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for Natu-
ral Language Generation. In Eighteenth International Conference for Artificial
1 INTRODUCTION
D2T generation [12, 18] is a specialized task of natural language
generation (NLG) where a model takes as input semi-structured
1 https://nicolas.nlp.quebec/files/icail_2021_plum2text.jsonl It has been shown that the plumitifs lack intelligibility [1, 19].
Indeed, these files are written by clerks and contain several abbrevi-
Permission to make digital or hard copies of all or part of this work for personal or ations and references to the Criminal Code’s provisions, making it
for profit or commercial advantage and that copies bear this notice and the full citation difficult for litigants to understand. Beauchemin et al. [1] attempted
on the first page. Copyrights for components of this work owned by others than ACM to generate plumitifs descriptions using a rule-based system. They
found out however that their model could hardly generalize to
fee. Request permissions from permissions@acm.org. plumitifs from other judicial districts or having slight differences
ICAIL’21, June 21–25, 2021, São Paulo, Brazil in the way they were organized. But having only the plumitifs in
2 All
examples (except screenshots) in this paper have been translated from French to
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466148 English to facilitate understanding
200
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Table 1: Statistics about the different D2T datasets intro- summaries paired with their corresponding statistic tables. The
duced in Section 2.1, WikiBio, WebNLG, RotoWire, Ro- dataset provides 4.9K examples containing on average 630 records
toWire augmented and purified by Wang [21], E2E, ToTTo, with longer associated text of 340 tokens, on average. Wang [21]
and our dataset, Plum2Text. For each dataset, we present extended RotoWire, namely RotoWire-FG, adding 50% more data
the number of examples (NE), average number of input at- and enriched input tables. Wang [21] observed that only 60% of
tributes (NA) along with the average number of tokens for the summary contents can be grounded to the table records and
the inputs (A-Avg) and the references (R-Avg). thus proposed a purified version of RotoWire. Thomson et al. [20]
refined even further Rotowire-FG by providing more attributes
Datasets NE NA A-Avg R-Avg across multiple dimensions, increasing the content overlap between
WikiBio 728K 20 3 26 statistic tables and reference texts. Dušek et al. [8] proposed the
WebNLG 22K 1-7 5 25 E2E challenge dataset, a crowdsourced dataset of 50K examples in
RotoWire 4.9K 630 1 340 the restaurant domain, as well as a cleaned version of it [7]. Inputs,
RotoWireW+ 7.5K 630 1 206 using the Meaning Representation format, contain 3-8 attributes
E2E 50K 3-8 15 20 and an average of 20 tokens per reference text. Finally, Parikh et al.
ToTTo 120K 4 2 20 [15] proposed an open-domain D2T dataset, ToTTo, covering a wide
range of topics. It contains over 120K examples of Wikipedia tables
Plum2Text 2.5K 2-9 61 50
along with one-sentence descriptions. They extend their dataset
with highlighted cells offering better control for generation. On
hand, they could not train a statistical model for natural language average, a given input contains 4 attributes and the reference text
generation in order to alleviate this problem. We wish to solve this has 20 tokens. While this list of D2T datasets is not exhaustive, it
issue by working with court judgments. One way to describe judge- illustrates the diverse nature of the data which is tightly coupled
ments is that they are in a sense a longer version of the plumitif; to the task of D2T NLG. We aggregated the datasets’ statistics in
they thoroughly explain what is written in the plumitif. By pairing Table 1.
plumitifs with their court judgments, we created Plum2Text, a D2T
dataset that can be used to train statistical language generation
model. It also allowed us to reframe the problem into one that can
2.2 Datasets in the Legal Domain
be solved by D2T natural language solutions. The input and output Legal–NLP is an emerging field and so is the creation of datasets
components of Plum2Text are very unique on their own; the values supporting the recent advances. While the following list of re-
of the table contain long pieces of text (e.g. paragraphs from the sources is not exhaustive, we selected the ones most related to our
law) and the references are mostly paraphrases of the table content. contribution. Kano et al. [11] introduced in 2018 the COLIEE legal
In the next sections, we compare Plum2Text with other standard case retrieval task along with a dataset drawn from an existing
D2T datasets known to the community. We also position our con- collection of case law primarily from the Federal Court of Canada.
tribution amongst other datasets in the legal field. We provide an In 2019, Rabelo et al. [17] introduced the statute law competition
at-length explanation of the annotation process we used to create data corpus where the questions (for the question answering task)
Plum2Text and present the tools we designed for this task. We hope were drawn from Japanese Legal Bar exams. They also proposed
that our methodology will encourage the creation of many other three new tasks: legal case entailment, statute law retrieval, and
datasets not only in the legal field but also in different D2T domains. legal question answering.
Xiao et al. [24] introduced the Chinese AI and Law challenge
2 RELATED WORK dataset (CAIL2018), the first large-scale Chinese legal dataset for
Since we are introducing a new D2T dataset in the legal domain, we judgment prediction. Still in Chinese, Duan et al. [6] introduced
first introduce standard D2T datasets on general purpose domains the Chinese judicial reading comprehension (CJRC) dataset. More
(e.g. Wikipedia) known to the community and further describe recently, Chalkidis et al. [3] proposed a large-scale multi-label text
what has been proposed in the legal field regarding different natural classification dataset on the European Union legislation. In the
language processing tasks. French Canadian spectrum, Westermann et al. [22] introduced two
datasets drawn from 1 million written judgments from the Régie du
2.1 Data-to-Text Datasets logement du Québec where they used factors to predict and analyze
landlord–tenant judgments and to create a chatbot out of it. Cumyn
There is a handful of D2T datasets and each of them has its specifics
et al. [4] proposed an annotated set of 2,500 judgments using a
that we further expose in this section. Lebret et al. [13] introduced
faceted scheme with the objective of improving the performance
WikiBio, a large-scale fact table of biographical sentences extracted
of legal search engines. Closer to the NLG task, Bhattacharya et al.
from Wikipedia. This dataset has 728K examples, each contain-
[2] studied several summarization techniques on a large set of
ing on average 20 records and 26 description tokens. In a similar
Indian Supreme Court judgments. Ye et al. [25] proposed a dataset
vein, Gardent et al. [9] introduced the WebNLG dataset consisting
where a generative model learns to generate court views from fact
of 22K RDF–text pairs extracted from DBPedia 3 . An input may
descriptions in Chinese. They framed the generation problem as
contain up to seven RDF triples and the average text length is 25
text summarization. From that point of view, Plum2Text is a rather
tokens. Wiseman et al. [23] introduced RotoWire, BasketBall game
unique new dataset in the legal field since it tackles a new task (i.e.
3 https://wiki.dbpedia.org/ D2T generation) and is in French.
201
Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for Natural Language Generation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
3 THE PLUM2TEXT DATASET

In this section, we first introduce the two source datasets used Figure 2: An excerpt listing the paragraph (1) containing the
to create Plum2Text. We further describe the methodology and accusations associated to the plumitif in Figure 1, an accusa-
annotation tools we developed to rapidly obtain such a dataset. We tion of conspiracy (art. 465 (1) c from the Criminal Code of
then explain how we deidentified4 the resulting dataset for a matter Canada).
of privacy, and then conclude with a set of statistics.
3.1 Source Datasets

In this section, we introduce the two source datasets, namely the
Quebec’s plumitifs and their associated Court Judgments.
3.1.1 Quebec’s plumitifs. Quebec’s plumitifs are essentially used
as summaries presenting all the steps of a court case. Information
concerning parties’ identity, jurisdiction in charge of administering
the case, and some information regarding the nature and the course
of the preceding are available through the plumitifs. In the context
of criminal proceedings, they contain relevant information for liti-
gants about the plaintiff, the defendant, and the various charges he’s
convicted of along with associated penalties (if applicable), as illus-
trated in Figure 1. Beauchemin et al. [1], who previously introduced
the plumitif corpus for NLG, showed that there is a need to make
the plumitifs more understandable. They proposed a template-based
generation architecture which, given a raw plumitif text, generates text classifiers in the annotation pipeline in order to rapidly cre-
its description in an unsupervised way due to the lack of annotated ate a D2T dataset. On average, the process of annotating pairs of
pairs. They concluded that a generalization challenge stems from Plumitif-Judgment with the help of classifiers takes approximately
the usage rules to generate a plumitif ’s description. two to five minutes per pair.
3.1.2 Court Judgments. We found out recently that a large pro- 3.2.1 The Plumitif Setting. The plumitif part is decomposed
portion of plumitifs have an associated court judgment (document) into 3 sections; the defendant, the plaintiff, and the list of charges.
that describes at-length the content of a given criminal case, hence In this work, we focus on generating the charges sections of a
being a textual description of the plumitif. We selected the same plumitif since it conveys an interesting fact verification challenge.
criminal plumitifs introduced in [1], added some plumitifs from 21 More precisely, each charge is composed of three components; the
other districts, and fetched their associated judgment (in French, if provision, the verdict, and the plea. Similarly, a judgment associ-
available) heard by the Criminal and penal division of the court of ated with the plumitif can be broken down into a set of relevant
Québec. With that in mind, we deemed it possible to create such paragraphs. The main goal of the annotator is to find the mapping
a D2T dataset to train a statistical NLG architecture, hoping to between the set of charges’ components with their respective rel-
solve the generalization problem raised by Beauchemin et al. [1]. evant paragraphs. This association step refines the context table
With the plumitifs and judgments both gathered from the SOQUIJ 𝑡, in a similar way Dhingra et al. [5] did by highlighting cells in
website5 , we created a total of 1,289 plumitif –judgment pairs from Wikipedia tables.
29 different districts. Judgments, often spanning over several pages,
3.2.2 Annotation Application. The annotation application is
are composed of multiple paragraphs containing detailed contex-
composed of one main interface; the plumitif–judgment association
tual information about the case6 . However, most of the document’s
interface as depicted in Figure 3. The interface contains on the left-
content is irrelevant to the generation of the plumitif ’s description
hand side the set of accusations of a given plumitif, and on the right-
(e.g. non-essential factual details). To align the plumitif ’s compo-
hand side, the list of paragraphs that compose the judgment. The
nents with its corresponding judgment’s sections, we mainly focus
annotator can then create new plumitif–description annotations by
on paragraphs relating to the charges, verdict, and plea. We explain
selecting which paragraph is referring to which accusation. Once
in the next section how we retrieve these paragraphs from the
the annotator has skimmed through the whole decision document,
judgment in order to obtain plumitifs’ descriptions, and we provide
he can move onto the next plumitif.
in Figure 2 an example of the kind of passage we are looking for.
However, documents can be very long with many useless para-
3.2 Annotation Methodology graphs to the task of plumitif description generation. To this end,
we train a Relevant Paragraph Classifier (RPC) that learns if a para-
In this section, we present our methodology and the main annota- graph is relevant or not. This classifier facilitates the work of the
tion interface. We illustrate the benefits of using simple yet efficient annotator by presenting beforehand paragraphs that potentially
4 What we mean by deidentifying is that one cannot reasonably be expected either, by refer to the plumitif. Given the list of relevant paragraphs and
itself or when combined with other information available, to enable the reidentification some parts of the plumitif that are left without associations, the
of individuals, as Ohm [14] explained on page 1744.
5 https://soquij.qc.ca/ annotator must nevertheless skim through the whole judgment to
6 On average, a judgment has 160 paragraphs. make sure the classifier did not miss something. To construct the
202
Figure 3: The accusation–paragraphs association interface. The set of accusations is displayed on the left and the whole deci-
sion on the right. The relevant paragraph classifier scores every paragraph and those with a high probability of being relevant
are displayed at the top of the interface, in the Relevant section.
RPC, we trained a standard Binary Text Classifier from the Spacy (1) Replace names, locations, and organizations that the NER
library [10]. We used 1,000 relevant paragraphs and 1,000 irrelevant model may have missed. For example, to preserve the privacy
ones randomly sampled from the judgments. On the test set, the of certain parties according to the order restricting publi-
paragraph classifier obtains 85% accuracy. cation (486.4 (1) C.Cr.), names are often elided as such; J...
D... for John Doe. A pre-trained NER model does not catch
3.3 Decontextualization of the Paragraphs this kind of elision and leaves unnecessary noise within the
In this section, we present how we decontextualize the paragraphs dataset. It had to be done manually.
(also known as delexicalization, see [18]). Paul Panenghat et al. [16] (2) Remove contextual information specific to the crime that was
showed that delexicalizing a dataset not only removes bias but also perpetrated. For example, one accused may have assaulted
improves out-of-domain portability, for instance in our case from his neighbor, which is very specific to the case. Furthermore,
criminal to civil law. Our decontextualization process is motivated we remove information that is not supported by the table 𝑡
by the underlying task Plum2Text is designed for; i.e. generating such as “after a trial of five days”.
an intelligible description of a plumitif. With that in mind, we (3) Remove gender and numbers7 related to the accused. As
remove not only personal information but also any information in such, we replace the French feminine version of “accusée”
the paragraph that cannot be found in the plumitif. (accused) with “accusé”. One example of an accusation in-
volving several defendants is conspiracy whereas “Person 1
3.3.1 Automatic Depersonalization. Following the argumenta- and Person 2 conspired...”. We thus replace “Person 1 and
tion of Beauchemin et al. [1], we deemed it essential to remove any Person 2” with “The accused”.
personal information that the judgment’s paragraphs may hold, (4) Remove information describing the victims such as “Per-
especially in the case of releasing this dataset. While the main con- son X, only 9 years old” or “Person X, a woman”. We also
cern here is privacy, it also greatly reduces the vocabulary size by normalize amounts, such as X$ or X kilos of cocaine, re-
removing rare tokens, and hopefully will improve the performance garding crimes committed against the controlled drugs and
of NLG models. The first step of our decontextualisation process substances act.
is to automatically remove all names as well as all information
Decontextualization reduced vocabulary size from 2,144 to 1,464
describing places or organisations (e.g. police department). A pre-
token types. We should also mention that we retrieve the sections
trained named entity recognition (NER) model [10] is used for this
text from their corresponding law. Compared to the D2T datasets
purpose. We also replace dates using regular expressions.
introduced in Section 2.1, Plum2Text is particularly interesting as
3.3.2 Information Specific to the Case. To further decontextu- several values from the table are sentences or even text paragraphs
alize the paragraphs, we manually remove information specific to (the laws), as illustrated in Figure 4. Also the resulting annotated
the case. More concretely, we went through every paragraph and
to make sure that people’s privacy is well protected; 7 One case may be regarding several defendants.
203
Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for Natural Language Generation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
dataset contains on average 61 tokens per input value, whereas a [5] Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan
typical D2T dataset usually has 1 to 5 tokens, as depicted in Table 1. Das, and William Cohen. 2019. Handling Divergent Reference Texts when
Evaluating Table-to-Text Generation. In Proceedings of the 57th Annual Meeting
Furthermore, a table has on average 5 associated references with of the Association for Computational Linguistics. Association for Computational
overlapping table values. These characteristics pose some challenge Linguistics, Florence, Italy, 4884–4895. https://doi.org/10.18653/v1/P19-1483
[6] X. Duan, Baoxin Wang, Ziyue Wang, Wentao Ma, Yiming Cui, D. Wu, S. Wang, T.
for both the generation of a description and its evaluation. Liu, Tianxiang Huo, Z. Hu, Heng Wang, and Z. Liu. 2019. CJRC: A Reliable Human-
Annotated Benchmark DataSet for Chinese Judicial Reading Comprehension.
ArXiv abs/1912.09156 (2019).
[7] Ondřej Dušek, David M. Howcroft, and Verena Rieser. 2019. Semantic Noise
Figure 4: A translated example of a Plumitif–Paragraph pair Matters for Neural Natural Language Generation. In Proc. of the 12th Interna-
from our dataset. The highlighted information in the boxes tional Conference on Natural Language Generation. Association for Computa-
tional Linguistics, Tokyo, Japan, 421–426. https://doi.org/10.18653/v1/W19-8652
illustrates the presence of paraphrasing within Plum2Text. arXiv:1911.03905
[8] Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the
State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG
Challenge. Computer Speech & Language 59 (Jan. 2020), 123–156. https://doi.org/
10.1016/j.csl.2019.06.009 arXiv:1901.11528
[9] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-
Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceed-
ings of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Association for Computational Linguistics, Vancouver,
Canada, 179–188. https://doi.org/10.18653/v1/P17-1017
[10] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under-
standing with Bloom embeddings, convolutional neural networks and incremen-
tal parsing. (2017). To appear.
[11] Yoshinobu Kano, Miyoung Kim, M. Yoshioka, Yao Lu, J. Rabelo, Naoki Kiyota,
R. Goebel, and K. Satoh. 2018. COLIEE-2018: Evaluation of the Competition on
Legal Information Extraction and Entailment. In JSAI-isAI Workshops.
[12] K. Kukich. 1983. Design of a Knowledge-Based Report Generator. In ACL.
[13] Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural Text Generation
from Structured Data with Application to the Biography Domain. In Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing.
4 CONCLUSION Association for Computational Linguistics, Austin, Texas, 1203–1213. https:
//doi.org/10.18653/v1/D16-1128
In this paper, we introduced a new French D2T dataset in the le- [14] Paul Ohm. 2009. Broken promises of privacy: Responding to the surprising failure
gal field, Plum2Text. We thoroughly present how we created and of anonymization. UCLA l. Rev. 57 (2009), 1701.
annotated this dataset, by introducing a methodology that we be- [15] Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan
Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A Controlled Table-To-Text
lieve will help the research community. The creation of the dataset Generation Dataset. In Proceedings of the 2020 Conference on Empirical Methods in
presented in this paper is a stepping stone in the development of Natural Language Processing (EMNLP). Association for Computational Linguistics,
Online, 1173–1186. https://doi.org/10.18653/v1/2020.emnlp-main.89
a web application aiming at making plumitifs – a legal document [16] Mithun Paul Panenghat, Sandeep Suntwal, Faiz Rafique, Rebecca Sharp, and
providing a summary of a given judicial case – more easily un- Mihai Surdeanu. 2020. Towards the Necessity for Debiasing Natural Language
derstandable. As explained by [1], enhancing the intelligibility of Inference Datasets. In Proceedings of the 12th Language Resources and Evaluation
Conference. European Language Resources Association, Marseille, France, 6883–
plumitifs foster the right to access judicial information which is a 6888. https://www.aclweb.org/anthology/2020.lrec-1.850
hallmark of Canadian democracy (and that of many other countries [17] J. Rabelo, Miyoung Kim, R. Goebel, M. Yoshioka, Yoshinobu Kano, and K. Satoh.
as well). In future works, we plan to train a statistical NLG model 2019. A Summary of the COLIEE 2019 Competition. In JSAI-isAI Workshops.
[18] Ehud Reiter and R. Dale. 1997. Building applied natural language generation
on Plum2Text and enable intelligible description of plumitif at scale. systems. Nat. Lang. Eng. 3 (1997), 57–87.
[19] Sandrine Prom Tep, Florence Millerand, Alexandra Parada, Alexandra Bahary,
Pierre Noreau, and Anne-Marie Santorineos. 2019. Legal Information in Digital
Acknowledgements Form: the Challenge of Accessing Computerized Court Records. IJR 8 (2019).
We thank the reviewers for their insightful comments. This research was [20] Craig Thomson, Ehud Reiter, and Somayajulu Sripada. 2020. SportSett: Basketball
- A robust and maintainable dataset for Natural Language Generation. https:
funded by the Natural Sciences and Engineering Research Council of Canada //intellang.github.io/ IntelLanG : Intelligent Information Processing and Natural
and the Social Sciences and Humanities Research Council of Canada. Language Generation ; Conference date: 07-09-2020 Through 07-09-2020.
[21] Hongmin Wang. 2019. Revisiting Challenges in Data-to-Text Generation with
Fact Grounding. In Proceedings of the 12th International Conference on Natural
REFERENCES Language Generation. Association for Computational Linguistics, Tokyo, Japan,
[1] David Beauchemin, Nicolas Garneau, Eve Gaumond, Pierre-Luc Déziel, Richard 311–322. https://doi.org/10.18653/v1/W19-8639
Khoury, and Luc Lamontagne. 2020. Generating Intelligible Plumitifs Descrip- [22] Hannes Westermann, V. Walker, K. Ashley, and Karim Benyekhlef. 2019. Using
tions: Use Case Application with Ethical Considerations. In Proceedings of the Factors to Predict and Analyze Landlord-Tenant Decisions to Increase Access
13th International Conference on Natural Language Generation. Association for to Justice. Proceedings of the Seventeenth International Conference on Artificial
Computational Linguistics, Dublin, Ireland, 15–21. https://www.aclweb.org/ Intelligence and Law (2019).
anthology/2020.inlg-1.3 [23] Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in Data-
[2] P. Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu to-Document Generation. In Proceedings of the 2017 Conference on Empirical
Ghosh, and S. Ghosh. 2019. A Comparative Study of Summarization Algorithms Methods in Natural Language Processing. Association for Computational Linguis-
Applied to Legal Case Judgments. In ECIR. tics, Copenhagen, Denmark, 2253–2263. https://doi.org/10.18653/v1/D17-1239
[3] Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, and Ion Androut- [24] Chaojun Xiao, Haoxi Zhong, Z. Guo, Cunchao Tu, Zhiyuan Liu, M. Sun, Yansong
sopoulos. 2019. Large-Scale Multi-Label Text Classification on EU Legislation. Feng, Xianpei Han, Z. Hu, Heng Wang, and J. Xu. 2018. CAIL2018: A Large-Scale
In Proceedings of the 57th Annual Meeting of the Association for Computational Legal Dataset for Judgment Prediction. ArXiv abs/1807.02478 (2018).
Linguistics. Association for Computational Linguistics, Florence, Italy, 6314–6322. [25] Hai Ye, Xin Jiang, Zhunchen Luo, and W. Chao. 2018. Interpretable Charge
https://doi.org/10.18653/v1/P19-1636 Predictions for Criminal Cases: Learning to Generate Court Views from Fact
[4] Michelle Cumyn, Günter Reiner, S. Mas, and David Lesieur. 2019. Legal Knowl- Descriptions. ArXiv abs/1802.08504 (2018).
edge Representation Using a Faceted Scheme. Proceedings of the Seventeenth
International Conference on Artificial Intelligence and Law (2019).
204
Anonymization of German Legal Court Rulings
Ingo Glaser Tom Schamberger Florian Matthes
Technical University of Munich Technical University of Munich Technical University of Munich
Garching bei München, Germany Garching bei München, Germany Garching bei München, Germany
ingo.glaser@tum.de tom.schamberger@tum.de florian.matthes@tum.de
ABSTRACT In recent years, this has led to a scarcity of publicly available

In the legal domain, many legal documents such as court decisions legal datasets. On the other side, court decisions in particular repre-
and contracts are regularly anonymized. This process requires text sent vital base material for legal professions, academic researchers,
sequences with high sensitivity to be identified and neutralized journalists and private companies [12]. Especially lawyers and law
to secure sensitive information from third parties. Usually, this firms require those documents for case research, because cases
process is performed manually by trained employees. Therefore, are usually evaluated by means of related cases. Furthermore, the
anonymization is generally considered an expensive and inefficient lack of publicly available datasets inhibits legal innovation utilizing
process. This work proposes a machine learning approach for the machine learning or big data techniques.
automatic identification of sensitive text elements in German le- As a result in this paper, after describing the process of anonymiza-
gal court decisions and provides an implementation. For this task, tion in the German legal domain, we propose an approach to auto-
different deep neural network architectures based on generally pre- matically anonymize German legal court rulings.
trained contextual embeddings as well as trained word embeddings
are evaluated. Because of the lack of non-anonymized data sets, an 2 RELATED WORK
approach to create pseudonymized data sets is proposed as well. Classical anonymization approaches, such as [11] and [10], make
use of rule-based named entity recognition (NER) systems that
CCS CONCEPTS rely on dictionary look-ups and regular expressions to classify
• Applied computing → Law; • Information systems → Expert individual words. On one hand, rule-based methods usually suffer
systems; • Computing methodologies → Neural networks. from weak robustness against rare occurrences and outliers such as
spelling mistakes as well as words that lie outside of the vocabulary,
because manually defined rules usually are too simple to match the
Ingo Glaser, Tom Schamberger, and Florian Matthes. 2021. Anonymization
of German Legal Court Rulings. In Eighteenth International Conference for high variety of different named entity representations. On the other
Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. hand, less data is necessary compared to ML-based NER systems [2].
ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3462757.3466087 Modern approaches make use of ML-based named entity recog-
nition for anonymization [2]. Current ML-based NER systems use
recurrent neural networks (RNNs) like stacked Long Short-Term
1 INTRODUCTION Memory (LSTM) RNNs [3] in combination with conditional ran-
In the age of digitization, an increasing amount of documents is cre- dom fields (CRFs), in order to classify tokens in the IOB (Inside,
ated in a digital manner, but more importantly also made available Outside, Beginning) tagging scheme [6]. This tagging scheme is
online. As a result, personal information must be removed from used, since named entities may span over multiple tokens. Thereby,
such documents. This applies particularly to the legal domain. The the CRF guarantees compliance with the scheme guidelines. Due
rising of legal technology is highlighted by the increasing num- to the small amounts of publicly available NER training datasets,
ber of digitized legal documents, in particular legal contracts and state-of-theart NER architectures make use of pre-trained word
court decisions [9]. Legal documents such as contracts or court de- embeddings such as GloVe [6, 8].
cisions are regularly being anonymized, in order to be published or Applied to the legal domain, the downside of all related methods
handed out to third parties. Usually, such documents are manually presented is that all detected entities are considered to be sensitive.
anonymized by skilled employees. This is crucial, as the detection As a result, insensitive information is unnecessarily neutralized
of personal information depends on the context and cannot be as long as it contains any entity. In legal documents, entities such
considered as trivial. For this reason, many organizations consider as dates and locations must also be considered, because they may
anonymization an expensive and inefficient process. reveal sensitive information if combined with additional informa-
tion from the document. Because many dates and locations are
Permission to make digital or hard copies of all or part of this work for personal or insensitive, but yet essential to understand the meaning of the text,
classroom use is granted without fee provided that copies are not made or distributed the above methods would discard vital information.
3 ANONYMIZATION
and/or a fee. Request permissions from permissions@acm.org. Anonymization is defined as the task of identifying and neutraliz-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil ing sensitive references within a given document or a set of docu-
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 ments [7]. Sensitivity is a binary measure determining whether or
https://doi.org/10.1145/3462757.3466087 not a particular reference, if publicly disclosed, might potentially
205
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Glaser et al.
cause harm or engender undesirable personal or legal repercus- the applied methods differ, the actual usage is flexible. While the
sions [7]. courts use different software to manage trials and store the actual
In order to anonymize documents, sensitive information has to be documents, including court rulings, all courts utilize Microsoft
identified first, before the detected entities can be pseudonymized Word for the actual anonymization. In fact, the "tool" at hand is
or neutralized. According to the definition of sensitivity, this infor- the search-and-replace functionality. This makes us belief that an
mation must contain at least one direct reference to an object or a approach, as proposed in this paper, will be highly beneficial.
juristic person outside the context of the document. In order to pre-
serve the meaning of the text, it is necessary to replace references 4 METHODOLOGY
referring to the same real object or person in such a way that the
4.1 Anonymization of German Legal Court
connection within the text stays intact.
The most challenging aspect of anonymization is the identifica- Rulings
tion step. In this work, the anonymization problem is reduced to a We introduce a anonymization method for court rulings solely
text sequence classification task referred to as contextual sensitivity trained on anonymized data. This approach is based on the idea
prediction. It is not sufficient to only detect references, because that the sensitivity of a entity does not depend on the specific entity
in legal documents such as court decisions, many references are itself, but on its type and context, only. In this paper, we developed
insensitive. Instead, the sensitivity of references in legal documents a machine learning approach, which classifies the sensitivity of
is assumed to mainly depend on the textual context. For instance, entities in legal documents based on their context. These entities
court names reference real-world objects, but they are insensitive have been pre-selected by general purpose NER.
and contain useful information for reviewers. On the other hand, The dataset used for model training consists of 1400 German
references to expert witnesses must not be exposed as they are anonymized court rulings of the state court in Munich Landgericht
highly sensitive. Munich that have been published in recent years. 1 In each ruling,
In order to underline the importance of supporting the anonymiza- each sensitive reference has previously been removed and replaced
tion of German legal court rulings, we conducted expert interviews. with a placeholder for publication at the courts. So, the anonymized
In knowledge-intensive and thus highly individual domains such entities, i.e. placeholders, could be detected using a rule-based algo-
as the legal domain, it is indispensable to understand the current, rithm, as discussed below.
non digital, process at first. Therefore, we also used the interviews Table 1 summarizes the most important information about the
to capture the current state of the anonymization. pre-processed training corpus without the test set of 180 documents.
We contacted 18 courts with a request for a respective expert In sum, those documents contain about 35.000 anonymization place-
interview. Courts from various jurisdictions were deliberately in- holders that are positively labeled for model training. However,
cluded. Eight courts agreed to support our research and provided due to the large variety of different reference types and different
appropriate contacts. These were one financial court, one social token lengths, the amount of data is considerably small. This espe-
court, one administrative court, and five courts from the ordinary cially impacts rare references like authorizations and bank accounts.
jurisdiction. The latter ones were distributed at all instances (district Anonymization mistakes in documents aggravate this issue even
court to higher regional court). Furthermore, we also interviewed further.
the Bavarian Ministry of Justice.
One of the major conclusions which emerged during the inter- Document count 1.220
views was the fact, that only two courts actually utilize standardized Word count 4.181.266
guidelines for the anonymization process. It is important to note Placeholder count 33.779
that the other courts do not, of course, anonymize indiscriminately, Average tokens per word 1.8
but rely much more on subject matter expertise. Therefore, it was Table 1: Information summary of the pre-processed training
important in these cases to understand the exact processes involved corpus
in anonymization, particularly the crucial entity types.
Another interesting aspect, which bases upon the fact of non- The placeholder detection is a rule-based classification algorithm,
existing guidelines, is the fact that these entities vary not only which takes paragraphs of anonymized legal documents and labels
between the different jurisdictions, but also within the ordinary the anonymization placeholders within the paragraph. The algo-
courts. While it seems obvious that decisions of different court rithm scans words using a sliding window of 3 consecutive text
types require other entity types to be anonymized, it may not elements. Each triplet of text elements consists of a predecessor, a
be obvious at first glance why this applies even within ordinary anonymization candidate and a successor. Two different types of
courts. However, such courts consist of multiple senates handling placeholders are distinguished using regular expressions: Obvious
different topics. In general, it is important that a judgment is still and potential placeholders. Obvious placeholders e.g. "Xxxx" meet
understandable after anonymization and, above all, that the judicial strong criteria and represent placeholders that are specially marked
decision is comprehensible. We therefore hypothesized that an ML by the author, e.g. ’"E."’. Potential placeholders may be interpreted
model must be specifically trained for the jurisdiction at hand and as placeholders if viewed outside the context, but may alternatively
cannot be utilized as a general purpose model. possess one of the following meanings: (1) omission within cites (e.g.
The replacement of sensitive information is a crucial part of the testimonies), (2) abbreviation (e.g. "i.d.R."), (3) reference to pages or
anonymization method. Therefore, it was important to understand
the used methods in practice. However, it turned out that, while 1 Source: https://www.gesetze-bayern.de
206
Anonymization of German Legal Court Rulings ICAIL’21, June 21–25, 2021, São Paulo, Brazil
The model was implemented using the Tensorflow framework

Annotated Tokens
Benutzer der klägerischen Internetseite ANONYM können
using Python and the NumPy library. For model training, we used
the Adam optimizer [5] with a decaying learning rate. The training
Masked Token
Benutzer [MASK] klägerischen Internetseite [MASK] können
BERT dataset was split into a training set of 1220 documents and a test
set of the remaining 180 documents. Then, the model was trained
Bi-LSTM Layer
for 4 epochs, i.e. the whole training dataset has been completely
Weighted Binary Cross Entropy Loss
iterated 4 times. Beyond the fourth epoch, the models performance
on the test set decreased, hence reaching over-fitting.
Prediction
A nA nA nA A nA
For application to original documents, we used the general pur-

pose named entity recognizer from the SpaCy framework, which
Figure 1: Anonymization Model Architecture detects the entity types person, organization, miscellaneous and lo-
cation in documents so that they can be masked for classification.
Because the entity types date, website and number were not sup-
appendices (e.g. "siehe Anhang A 34"), and (4) reference to laws or
ported, we extended the general named entity recognizer with a
regulations (e.g. "§ 8 Abs. 3 Ziffer 2 UWG").
rule-based algorithm that supports the additional named entities.
In order to overcome the limitation that anonymized document
The performance of this algorithm has been manually evaluated
entities are latent to the model during training, we chose a state-
using randomly extracted text passages from the dataset, which
of-the-art pre-trained masked language model called Bidirectional
contain different instances of each entity type. Using the extended
Encoder Representations from Transformer (BERT) [1] as the core
named entity recognizer, entities in original documents are marked
model architecture. Tokenization has been done using the Senten-
as candidates and masked before being classified by the model.
cePiece tokenizer 2 and partitioned into paragraphs of at most 512
The automated anonymization process needs to be accessible
tokens. Then, the previously detected anonymized entities together
to courts through a simple user interface, in order to assess the
with randomly chosen alternative passages are masked using spe-
practicality and quality of the approach. Therefore, we developed a
cial tokens. The masked paragraphs are fed into a pre-trained BERT
web interface, which allows the direct upload of court rulings (in
model [1]. This yields high-dimensional embeddings. Subsequently,
Microsoft Word DocX format) by incorporated staff at the courts to
a bidirectional LSTM layer is used to classify corresponding embed-
a locally deployed server. On the server side, the extended named
ding vectors, whether they belong to the anonymized passages or
entity recognition labels and masks the candidate entities, before
the randomized ones. For that, a fully connected layer is applied
the document is applied to the model and the resulting classification
to each output vector of the LSTM layer [4], which yields a final
is sent to the user. A screenshot of the classification result shown
single output per input token. Finally, we use a weighted binary
to the user is available on GitHub 3 . Finally, the document can be
cross entropy loss function to transform the one-dimensional input
downloaded as a Word document with sensitively classified text
to a binary classification per input token. Figure 1 visualizes this ar-
passages replaced with customisable placeholders.
chitecture using a simple example. For each document part (in this
case each word), the model classifies its sensitivity (A = sensitive,
4.2 Pseudonymization of German Legal Court
nA = insensitive).
In the end, only classifications for masked tokens are considered, Rulings
since non-masked tokens are never sensitive. Therefore, in order to The performance of anonymization of legal court rulings has been
apply the model to original data, candidate tokens must be chosen limited by the availability of original legal datasets. One way to
in the documents to be masked before being fed into the model. overcome this limitation is to utilize model training methods, which
Otherwise, no tokens in the document would have been masked, do not rely on original data. Another way is to provide courts with a
so that each token would be classified insensitive. Thereby, the convenient way to locally pre-process original data at the courts, so
number of candidates is restricted, because masked tokens hide that privacy and data protection issue do not play a role anymore.
original information so that the model has less data available, upon Therefore next to our anonymization method purely based on
which a classification can be made. non-anonymized data, we introduce an automatic pseudonymiza-
Table 2 shows the different variants of this architecture which tion method that can be applied locally at courts to pre-process pairs
were evaluated, along with number of layer and weights. The num- of original and anonymized court rulings. This method yields to pro-
bers denote the parameter count for each layer - ’d’, ’c’ and ’r’ mean duce pairs of pseudonomized/anonymized documents, which con-
’dense’, ’convolutional’ and ’biLSTM’ layers accordingly. tain no sensitive information and can be used to train anonymiza-
tion models that usually require original and anonymized document
pairs. The main idea behind this process is that the difference be-
Variant Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 tween manually anonymized court rulings and their original coun-
RNN1 r256 r256 d128 d64 - terpart are exactly the sensitive entities that has to be excluded
RNN2 r128 r128 r128 d128 d64 from the original document in order to leave the court. If the type
RNN3 r512 r512 d256 d128 d64 of those entities is already determined at the courts and included
Table 2: Model architecture variants on top of BERT embed- with the published documents, the removed entities can later be
dings replaced by randomized instances by the same entity type. This
2 SentencePiece tokenization: https://github.com/google/sentencepiece 3 https://github.com/sebischair/verlyze-anonymization
207
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Glaser et al.
Entity type Example Method Model Evaluation set Precision Recall

date 15.09.2012 rule-based Variant
website www.max.de rule-based RNN1 Munich district court 68.9% 79.1%
email max@email.de rule-based RNN2 Munich district court 68.2% 72.4%
street Birketweg 12 rule-based RNN3 Munich district court 68.4% 74.4%
district 80000 München rule-based RNN1 Munich financial court 64.7% 54.6%
number DE 0123 4567 rule-based RNN2 Munich financial court 62.8% 52.3%
amount 15.6 m rule-based RNN3 Munich financial court 63.1% 53.9%
person Max Müller SpaCy NER Table 4: Evaluation results of the anonymization procedure
location München SpaCy NER
organization Futter GmbH SpaCy NER
were returned to the court for inspection before being utilized in
miscellaneous Haribo SpaCy NER
our evaluation. Because the anonymization decision is assumed to
unknown undetected
be consistent for different entities of the same type, the resulting
Table 3: Extended named entity types sorted by priority (de- dataset of 123 unique anonymized entities (358 in total) is consid-
scending) ered to closely resemble original court ruling document bodies for
results in pseudonymized legal datasets that closely resemble the evaluation purposes. Nevertheless, because the document origi-
original data without sacrificing data protection. nated from the published format, original headers and footers of
As a first step, the original and anonymized documents pair is documents were missing and the format has been changed slightly
divided into tokens. Then a NER extracts entities in both documents. by the publication process.
If an entity is present only in the original document, it is assumed to The second evaluation dataset (Munich financial court consists
be sensitive and is replaced by a placeholder with a type hint. After of 33 original court rulings of the financial court in Munich. Each
all sensitive entities have been replaced, the remaining tokens of original document includes original header and footer and has been
the original document are compared to the anonymized document provided with its anonymized counterpart, together containing 122
tokens in order to find the entities undetected by the named entity unique anonymized entities (329 in total).
recognizer. Hence, the remaining entities are replaced by placehold-
ers with the type annotation unknown. Because those unknown 5.2 Anonymization Based on Contextual
entities can usually be considered to occur rarely for advanced NER, Information
those exceptional types can still be manually inferred from their The anonymization approach has been evaluated on both evaluation
context. datasets. The precision and recall metrics presented in Table 4 refer
In order to infer the type of the entities being detected in the to the classification of entities as (in-)sensitive. Thereby, the recall
difference between anonymized and original documents, a NER is metric is considered more important than the precision metric,
used to find named entities in both documents. Because general pur- since, in practice, the anonymization of an insensitive entity yields
pose NER only support very limited entity types such as person and lesser harm than the revelation of a sensitive entity.
location, we extended the NER similar to the extension described in It is important to underline the fact that the type II error is only
4.1 but with different and more supported types. Because the types partially introduced by the model, but also by the named entity
are not always distinct, the type of an entity is chosen according to recognizer pre-selecting the entities to be classified. The NER misses
the priorities as listed in table 3. 19.8 % of sensitive entities in the Munich district court dataset, while
The pseudonymization process is provided as a web interface4 . missing 28.5 % in the financial court dataset. This strictly limits
Compressed directories can be uploaded with document pairs as- the model performance, since detection during pre-processing is a
signed to each other by naming convention. Then the document precondition to be masked and possibly classified.
pairs are processed and the final list of pseudonymized documents While we have trained and evaluated many different approaches,
is shown. Court employees can either randomly review the docu- we report here only on the best performing models. Particularly,
ments or show each of the resulting documents to ensure that no the reported models with a NER-based candidate selection out-
sensitive information is still present. Finally, the pseudonymized performed other approaches. Without NER-based candidates, the
documents paired with the corresponding anonymized documents models have to overcome the "validation-test gap". In the train-
can be downloaded. ing dataset, the anonymization placeholders are masked during
training, because the real references behind the placeholder are
5 EVALUATION unknown. In this paper, this issue is referred to as the "validation-
5.1 Evaluation Data test gap", because the results on the validation dataset (masked
placeholders) during training have partly been much better than
Based on court rulings of the district court of Bavaria Landgericht
on the test set (manually replaced placeholders) during evaluation.
München published on the Bavarian legal platform ’Recht.Bayern’,
This is mitigated by our NER-based methods.
we created an evaluation set of 13 document pairs by manual re-
As shown in Table 4, the variant RNN1 outperformed the larger
placement of anonymized document parts with randomized named
models by a small margin. However, all models yield much more
entities of matching type (Munich District Court). These documents
promising results on the court rulings from the district court than
4 https://github.com/sebischair/verlyze-pseudonymization on the decisions of the financial court. This can be explained by
208
Anonymization of German Legal Court Rulings ICAIL’21, June 21–25, 2021, São Paulo, Brazil
the high variety of different anonymization standards practiced by tool in order to increase the quantity of German legal data sets
different types of courts. Entities such as financials or process dates being publicly available by means of expert interviews.
are often classified insensitive by district court, but frequently sen- For this purpose, multiple different deep learning architectures
sitive by financial court. Due to lack of published financial rulings, were trained using state-of-the-art generally pre-trained contextual
the training corpus has been mainly composed of more general embeddings. Furthermore, a rule-based placeholder detection algo-
rulings, which explains the drop in evaluation performance ( recall). rithm was developed and validated, in order to label anonymization
During interviews at the courts, the anonymization web tool placeholders in anonymized legal documents. Due to the difference
was demonstrated and results from randomly sampled court rulings between training and evaluation data, the "validation-test gap" issue
were presented to court employees, being specialized in the manual has been introduced, which has caused a drop in model performance
anonymization of court documents. The feedback has been consis- on the non-anonymized test set. This issue was resolved using reg-
tently positive, since most employees considered the anonymization ularization methods such as input masking and dropout layers. The
of legal documents as an unpleasant task. Nevertheless, most in- models were evaluated on a manually created test document cor-
terviewees criticized that the error rates were not high enough pus. Thereby, we found out that purely contextual classification
to enable unsupervised automation. A small portion of the inter- cannot distinguish between named entities and entities that refer to
viewed workers also faulted that no manual configuration of the named entities within the document. No model reached both, high
algorithm could be done, after the model had been trained. How- recall and high precision, metrics on the direct sequence classifi-
ever, this seems to be a rather less important concern as denoted cation task. Nevertheless, in combination with a generally trained
by the court director of the same court. NER model, the feature-based BERT tuning approach using stacked
One important module of the anonymization approach presented biLSTM-RNN delivered promising results, but a specialized NER
is the detection of placeholders in the anonymized training data. model supporting more reference types is required.
Poor performance of this rule-based algorithm may lead to low In order to achieve high performances on the task of anonymiza-
model performance. The algorithm has been evaluated using the tion, it is inevitable to utilize original and non-anonymized training
Munich district court dataset as used for the model evaluation. The data. Hence, we also proposed a system to automatically pseudonymize
placeholder detection module performed with a precision of 99.9 legal court ruling datasets that produces pairs of original and
%, a recall of 98.0 % and an accuracy of 99.9 %. These results show anonymized court rulings.
that rule-based algorithms as described in Section 4.1 are capa- To conclude, the anonymization of German legal documents re-
ble of delivering sufficient performance to be used to pre-process mains a complex problem and more data is required in order to build
anonymized legal data for anonymization models. fully autonomous anonymization systems. Nonetheless, contextual
sensitivity classification represents an important foundation for
5.3 Pseudonymization future anonymization systems.
The pseudonymization procedure was evaluated using the Munich
REFERENCES
financial court rulings, since the contained documents have been
in the original format including headers and footers. The evalua- Pre-training of Deep Bidirectional Transformers for Language Understanding.
tion revealed that all sensitive entities were filtered as intended https://arxiv.org/abs/1810.04805
[2] Francisco Manuel Carvalho Dias. 2016. Multilingual automated text anonymiza-
(100 %). Thereby, the specialized named entity recognizer correctly tion. Instituto Superior Técnico of Lisboa (2016).
recognized 72.8 % of the types of sensitive entities, which enables [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
automatic pseudonymization for a majority of entities. computation 9, 8 (1997), 1735–1780.
[4] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. https:
Considering that the remaining entity types may be manually re- //www.researchgate.net/publication/13853244_Long_Short-term_Memory
covered using the anonymized documents, the proposed pseudonymiza- [5] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza-
tion. https://arxiv.org/pdf/1412.6980.pdf
tion approach provides a considerable basis for the creation of large [6] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,
datasets, resembling original legal data without the necessity for and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv
privacy considerations. However, the ratio of correctly identified preprint arXiv:1603.01360 (2016).
[7] Ben Medlock. 2006. An Introduction to NLP-based Textual Anonymisation.. In
entity types can still be further improved by more advanced models. LREC. Citeseer, 1051–1056.
The pseudonymization web tool has been demonstrated dur- [8] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
ing interviews at the courts with samples of court rulings from Global vectors for word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP). 1532–1543.
the financial court in Munich. The samples had manually been [9] Manavalan Saravanan, Balaraman Ravindran, and Shivani Raman. 2009. Im-
pseudonymized already, because the rulings contained original proving legal information retrieval using an ontological framework. Artificial
Intelligence and Law 17, 2 (2009), 101–124.
data with us. The feedback was generally very positive and courts [10] Latanya Sweeney. 1996. Replacing personally-identifying information in medical
finally agreed to share data pseudonymized using this tool. Most records, the Scrub system.. In Proceedings of the AMIA annual fall symposium.
court employees found the tool to be simple to work with and ap- American Medical Informatics Association, 333.
[11] Amund Tveit, Ole Edsberg, TB Rost, Arild Faxvaag, O Nytro, T Nordgard, Mar-
preciated the feature to upload whole archives instead of favoring tin Thorsen Ranang, and Anders Grimsmo. 2004. Anonymization of general
individual document uploads, which requires more manual work. practioner medical records. In Proceedings of the second HelsIT Conference.
[12] Marc van Opijnen, Ginevra Peruginelli, Eleni Kefali, and Monica Palmirani. 2017.
On-line publication of court decisions in the eu: Report of the policy group of the
6 CONCLUSION & OUTLOOK project ‘building on the european case law identifier’. Available at SSRN 3088495
(2017).
After identifying the problem of anonymization of legal documents,
we verified that automatic legal anonymization is a highly desirable
209
Enhancing a Recidivism Prediction Tool With Machine Learning:
Effectiveness and Algorithmic Fairness
Marzieh Karimi-Haghighi Carlos Castillo
Universitat Pompeu Fabra Universitat Pompeu Fabra
marzieh.karimihaghighi@upf.edu carlos.castillo@upf.edu
ABSTRACT auditing, and criminal justice. Since the 1920s, violence risk as-
This paper addresses a key application of Machine Learning (ML) sessment tools have been progressively used in criminal justice by
in the legal domain, studying how ML may be used to increase the probation and parole officers, police, and psychologists to assess
effectiveness of a criminal recidivism risk assessment tool named the risk of harm, sexual, criminal, and violent offending in more
RisCanvi, without introducing undue biases. The two key dimen- than 44 countries [22, 32]. In comparison to traditional prediction
sions of this analysis are predictive accuracy and algorithmic fair- methods and unstructured clinical judgments, risk assessment tools
ness. ML-based prediction models obtained in this study are more ac- offer superior accuracy and performance [18]. In this regard, factors
curate at predicting criminal recidivism than the manually-created such as the availability of large databases, inexpensive computing
formula used in RisCanvi, achieving an AUC of 0.76 and 0.73 in pre- power, and developments in statistics and computer science have
dicting violent and general recidivism respectively. However, the brought an increase in the accuracy and applicability of these struc-
improvements are small, and it is noticed that algorithmic discrimi- tured tools [3]. Such advances have effectively increased the use
nation can easily be introduced between groups such as national vs of tools based on Machine Learning (ML) in criminal justice deci-
foreigner, or young vs old. It is described how effectiveness and al- sions for risk forecasting [4, 7, 8]. Today, various semi-structured
gorithmic fairness objectives can be balanced, applying a method in protocols for assessing risk of recidivism can be found in different
which a single error disparity in terms of generalized false positive countries including the U.S. [16], the U.K. [21], Canada [24], Aus-
rate is minimized, while calibration is maintained across groups. tria [30], and Germany [13]. In Spain, among current violence risk
Obtained results show that this bias mitigation procedure can sub- assessment tools including SAVRY, PCL-R, HCR-20, SVR-20, and
stantially reduce generalized false positive rate disparities across SARA, RisCanvi is a relatively new tool for risk assessment of recidi-
multiple groups. Based on these results, it is proposed that ML- vism. It was originally developed in 2009 in response to concerns of
based criminal recidivism risk prediction should not be introduced Catalan prison system officials regarding violent recidivism among
without applying algorithmic bias mitigation procedures. offenders after their sentences.
Research contribution. In this study, the effectiveness and algo-
CCS CONCEPTS rithmic fairness of RisCanvi risk assessment tool are evaluated in
comparison to ML models such as logistic regression, perceptron,
• Computing methodologies → Supervised learning by clas-
and support-vector machines, in violent and general recidivism
sification.
prediction. The effectiveness of the ML models are evaluated and
compared to RisCanvi in terms of various metrics including AUC,
KEYWORDS Generalized False Positive (GFPR), and Generalized False Negative
criminal recidivism, risk assessment, algorithmic fairness (GFNR). Also, potential algorithmic bias introduced by the ML meth-
ACM Reference Format: ods is evaluated in both violent and general recidivism prediction.
Marzieh Karimi-Haghighi and Carlos Castillo. 2021. Enhancing a Recidivism Given that model learning may lead to unfairness [11, 12, 34], the
Prediction Tool With Machine Learning: Effectiveness and Algorithmic impact of the obtained ML models is compared along nationality
Fairness. In Eighteenth International Conference for Artificial Intelligence and (national origin vs foreign origin) and age (young vs old). Then
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, some differences are addressed through a mitigation procedure [29],
USA, 5 pages. https://doi.org/10.1145/3462757.3466150 which try to equalize GFPR across nationality and age groups while
preserving the calibration in each group.
1 INTRODUCTION The rest of this paper is organized as follows. Section 2 outlines
Risk assessment is a necessary process in many important decisions related work. In Section 3, the RisCanvi risk assessment tool and the
such as public health, information security, project management, dataset used in this study are described. The methodology including
the ML models and algorithmic fairness analysis are presented in
Permission to make digital or hard copies of all or part of this work for personal or Section 4. Results are given in Section 5, and a procedure to mitigate
classroom use is granted without fee provided that copies are not made or distributed algorithmic discrimination is used in Section 6. Finally, the results
are discussed and the paper is concluded in Section 7.
and/or a fee. Request permissions from permissions@acm.org. 2 RELATED WORK
ICAIL’21, June 21–25, 2021, São Paulo, Brazil The introduction of algorithms for risk assessment in criminal
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 justice is a controversial topic, and perhaps one that has motivated
https://doi.org/10.1145/3462757.3466150 a great deal of research on algorithmic fairness.
210
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Marzieh Karimi-Haghighi and Carlos Castillo
In seminal research done by investigative journalism organiza- facilities, committing further violent offenses, and breaking prison
tion ProPublica [2, 25] it was concluded that a widely-used program permits. A fifth risk score was introduced more recently for general
named Correctional Offender Management Profiles for Alternative recidivism [31].
Sanctions (COMPAS) is biased against African American defendants. Two versions of the RisCanvi protocol were created, an abbre-
A follow-up study [19] found that COMPAS outcomes systemat- viated one of 10 items for screening (RisCanvi-S), and a complete
ically over-predict risk for women, thereby indicating systemic one of 43 items (RisCanvi-C). Risk items can be categorized into
gender bias. However, the findings of the ProPublica study were five different categories: Criminal/Penitentiary, Biographical, Fam-
rejected by Northpointe (COMPAS developer), claiming their al- ily/Social, Clinical, Attitudes/ Personality. These items can also
gorithm is fair because it is well calibrated [17]. Moreover, in this be divided into static factors (such as “criminal history of family”
report it is shown that the COMPAS risk scales exhibit accuracy and “age of starting violent activity”) and dynamic factors (such
equity and predictive parity. as “member of socially vulnerable groups” and “pro-criminal or
In contrast to the case of COMPAS, other studies have shown antisocial attitudes”).
that other risk assessment tools such as the Post Conviction Risk
Assessment (PCRA), the Structured Assessment of Violence Risk in 3.2 Dataset
Youth (SAVRY) and the Youth Level of Service/Case Management
The anonymized dataset used on this research comprises 7,239 of-
Inventory (YLS/CMI) do not exhibit racial bias in the recidivism
fenders who first entered the prison between 1989 and 2012 and who
prediction [28, 33]. In a more recent study focused on SAVRY [26,
were evaluated with the RisCanvi protocol between 2010 and 2013.
34], it is shown that although machine learning models could be
Only offenders for which nationality information was recorded
more accurate than the simple summation used to compute SAVRY
were kept that comprises 2,634 offenders. The result population
scores, they would introduce discrimination against some groups
was filtered in terms of their violent/general recidivism, freedom
of defendants.
and last RisCanvi evaluation dates considering the following con-
There are many different definitions of algorithmic fairness [27],
ditions: inmates who were released at most 9 months after their
some of which are incompatible with one another. It is impossible to
last RisCanvi evaluation, and for which violent/general recidivism
satisfy all of them simultaneously except in pathological cases (such
(or its absence) was recorded at most two years after their release.
as a perfect classifier), and in general it is impossible to maximize
Finally, samples with the size of 2,027 (out of 2,634) were reached.
algorithmic fairness and accuracy at the same time [5, 6]. Hence,
Among this population, 146 committed a violent offence (violent
there are necessary trade-offs between different metrics [6, 10, 23].
recidivism) and 310 committed a violent or non-violent offence
In this regard, some studies [20, 36, 37] try to mitigate potential
(general recidivism) after being released. The data includes all of
algorithmic discrimination by satisfying equalized odds or in other
the information for the two RisCanvi versions (RisCanvi-S and
words avoiding disparate mistreatment along different sensitive
RisCanvi-C). This study is focused on the RisCanvi-C protocol
groups. In addition, due to the importance of the calibration in
which is the complete version done after RisCanvi-S and it consists
risk assessment tools [6, 17], some previous work has also tried to
of more risk factors which results in three risk levels (low, medium,
minimize error disparity across groups while maintaining calibrated
and high).
probability estimates [29].
The most closely related previous work is Pleiss et al. [29], where
algorithmic bias in a machine learned risk assessment (COMPAS) 3.3 Violent and General Recidivism
is minimized by equalizing generalized false positive rates along This work is focused on RisCanvi protocol to assess Violent Re-
different races, finding this equalization to be incompatible with cidivism (“REVI” in the RisCanvi manual) and General Recidivism
calibration. In contrast, in the work presented on this paper, we (“REGE” in the RisCanvi manual) risks in sentenced inmates. REVI
start from an expert-based risk assessment method, which is not and REGE risks are outcomes predicted using two different sub-sets
machine learned, and propose a new machine learning model to of risk factors. REVI risk is obtained using 23 items out of the 43 risk
replace it, describing the effects of algorithmic bias mitigation on factors of the RisCanvi-C version plus two demographic features
both the original and the machine learned model. Additionally, we (gender and nationality) and to compute REGE risk, 14 items (out of
find that in RisCanvi equalization along nationality and age groups 43 risk factors of the RisCanvi-C version) are used. In RisCanvi-C,
is not entirely incompatible with calibration. each of the REVI and REGE scores has been computed by applying
the summation of their related features in a hand-crafted formula,
3 RISCANVI DATASET then using two cut-offs, obtaining three risk levels (details in [1]).
The distribution of REVI and REGE risk scores in the last RisCanvi
3.1 The RisCanvi Risk Assessment Tool evaluation is compared by nationality and age groups. Grouping
RisCanvi was introduced as a multi-level risk assessment protocol by gender is not considered as the number of women in the sample
for violence prevention in the Catalan prison system in 2009 [1]. is too small to draw robust conclusions. The comparison shows
This protocol is applied multiple times during an inmate’s period in that recidivism risk scores have approximately similar distributions
prison; the official recommendation is to do so every six months or along nationality and age group except for the REVI score in nation-
at the discretion of the case manager. RisCanvi is not a questionnaire. ality group which shows that foreigners tend to have lower REVI
Instead, each inmate is interviewed by professionals. In the original risk scores compared to Spaniards (Figures are omitted for brevity).
RisCanvi protocol, risk is determined for each inmate relative to For age groups, 30 years old is used as a cut-off, as criminology
four possible outcomes: self-directed violence, violence in the prison research suggests that the types of offense and context are different
211
Enhancing a Recidivism Prediction Tool With Machine Learning: Effectiveness and Algorithmic Fairness ICAIL’21, June 21–25, 2021, São Paulo, Brazil
for people under 30 and over 30 (see, e.g., [35]). This age is also used Table 1: Effectiveness of models in violent and general re-
as a cut-off for young and old people in the design of the RisCanvi cidivism prediction
protocol. In the present dataset, the majority of the population are
Spanish nationals (70%) and older than 30 years old (74%). Risk Violent Recidivism General Recidivism
According to the average violent and general recidivism rates for
Model AUC GFNR GFPR AUC GFNR GFPR
nationality and age groups, it can be seen that in general, foreigners
and older offenders have a lower recidivism rate. LR 0.76 0.82 0.06 0.73 0.75 0.14
RisCanvi_score 0.72 0.87 0.07 0.70 0.79 0.14
4 METHODOLOGY
The goal of this study is to compare the effectiveness and fairness
of Machine Learning (ML) models and the RisCanvi risk assessment
follows [29]: the GFPR of classifier ℎ𝑡 for group 𝐺𝑡 is 𝑐 𝑓 𝑝 (ℎ𝑡 ) =
tool in the prediction of violent and general recidivism.
E (𝑥,𝑦)∼𝐺𝑡 [ℎ𝑡 (𝑥) | 𝑦 = 0]. GFPR is the average probability of being
recidivist that the classifier estimates for people who actually do
4.1 ML-based Models not recidivate. Conversely, the GFNR of classifier ℎ𝑡 is 𝑐 𝑓 𝑛 (ℎ𝑡 ) =
Different ML methods, such as logistic regression, multi-layer per- E (𝑥,𝑦)∼𝐺𝑡 [(1−ℎ𝑡 (𝑥)) | 𝑦 = 1]. So the two classifiers ℎ 1 and ℎ 2 show
ceptron (MLP), and support vector machines (SVM) are used. The probabilistic equalized odds across groups 𝐺 1 and 𝐺 2 if 𝑐 𝑓 𝑝 (ℎ 1 ) =
ground truth is the violent/general recidivism, which is recorded at 𝑐 𝑓 𝑝 (ℎ 2 ) and 𝑐 𝑓 𝑛 (ℎ 1 ) = 𝑐 𝑓 𝑛 (ℎ 2 ).
most two years after the inmate’s release. Classifier ℎ𝑡 is said to be well-calibrated if ∀𝑝 ∈ [0, 1], P (𝑥,𝑦)∼𝐺𝑡
Different sub-sets of features are tested as input to the ML mod- [𝑦 = 1 | ℎ𝑡 (𝑥) = 𝑝] = 𝑝. To prevent the probability scores from
els, such as 43 RisCanvi-C items, Violent Recidivism (REVI)/General carrying group-specific information, both classifiers ℎ 1 and ℎ 2 are
Recidivism (REGE) risk items, and a set of features selected from 43 calibrated with respect to groups 𝐺 1 and 𝐺 2 [6, 17].
risk items using a feature selection method. In addition, three demo-
graphic features (gender, nationality, and age) are used as general 5 RESULTS
input features. Finally, the average of REVI/REGE risk scores over
all of the RisCanvi evaluations from the first to the last evaluation 5.1 Effectiveness Evaluation
is added. Among logistic regression (LR), multi-layer perceptron (MLP) and
The split of the two sets is done k times using stratified k-fold support vector machines, the best results were obtained using LR
cross-validation, reporting average results. for both violent and general recidivism predictions. Hence, the non-
LR based models are omitted for brevity. The final set of features
4.2 Algorithmic Fairness used for the model consists of a sub-set of the 43 risk items of
Algorithmic fairness is evaluated by comparing the impact of the the RisCanvi evaluation selected using a feature selection method
risk prediction method across nationality and age groups. (based on a linear model with L1-based penalization to yield sparse
As it is known, model calibration is a necessary condition, espe- coefficients), the average Violent Recidivism (REVI)/General Recidi-
cially in criminal justice risk assessments [6, 17]. If the risk tool is vism (REGE) score (from the first to the last RisCanvi evaluation),
not calibrated with respect to different groups, then the same risk es- gender, nationality, and age at the time of the last evaluation.
timate carries different meanings and cannot be interpreted equally Results in terms of AUC-ROC, GFNR, and GFPR are presented
for different groups. Furthermore, creating parity in the error rates and compared with the existing RisCanvi protocol in Table 1 for
of different groups (“equalized odds”) is a well-established method both violent and general recidivism prediction. These results are
to mitigate algorithmic discrimination in automatic classification. compared against RisCanvi_score, which is a number resulting
Previous work has also emphasized the importance of this algo- from the application of the RisCanvi formula.
rithmic fairness metric for this particular application [20, 36, 37]. In both violent and general recidivism prediction, LR yields better
Hence, to mitigate potential algorithmic discrimination, a relaxation results than RisCanvi in terms of all metrics. However, the results
method [29] is used in this paper which seeks to satisfy equalized are close to RisCanvi. In general, the LR model is more accurate
odds or parity in the error rates (generalized false positive rate and than RisCanvi, although by a small amount, which is surprising
generalized false negative rate) while preserving calibration in each considering that RisCanvi was not computationally optimized for
sub-group of nationality and age. In most cases, calibration and predictive accuracy.
equalized odds are mutually incompatible goals [10, 23], so in this
method it is sought to minimize only a single error disparity across 5.2 Algorithmic Fairness Evaluation
groups while maintaining calibration probability estimates. The results for the analysis of algorithmic fairness in all metrics
Generalized False Positive Rate (GFPR) and Generalized False along nationality (national and foreigner), and age groups (young
Negative Rate (GFNR) are the standard notions of false-positive and and old inmates) are shown in Table 2 for violent and general
false-negative rates that are generalized for use with probabilistic recidivism prediction. In the LR_calibrated model, the predictions
classifiers [29]. If variable 𝑥 represent an inmate’s features vector, have been calibrated with respect to each of the two sub-groups in
𝑦 indicates whether or not the inmate recidivists, 𝐺 1 , 𝐺 2 are the nationality and age.
two different groups, and ℎ 1 , ℎ 2 are binary classifiers which classify For violent recidivism, all models show a bias against nationals
samples from 𝐺 1 , 𝐺 2 respectively, GFPR and GFNR are defined as in terms of GFPR. The difference is less noticeable in RisCanvi. In
212
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Marzieh Karimi-Haghighi and Carlos Castillo
Table 2: Effectiveness of models in violent and general recidivism prediction per group
Risk Violent Recidivism General Recidivism

Model LR LR_Calibrated RisCanvi LR LR_Calibrated RisCanvi
Group/Metrics AUC GFNR GFPR AUC GFNR GFPR AUC GFNR GFPR AUC GFNR GFPR AUC GFNR GFPR AUC GFNR GFPR
National 0.81 0.77 0.07 0.81 0.81 0.06 0.76 0.85 0.08 0.78 0.70 0.15 0.77 0.73 0.13 0.72 0.78 0.14
Foreigner 0.85 0.87 0.05 0.85 0.85 0.04 0.72 0.91 0.05 0.68 0.80 0.11 0.72 0.77 0.11 0.59 0.83 0.13
National
Foreigner (Ratio) (0.95) (0.88) (1.64) (0.95) (0.95) (1.50) (1.05) (0.93) (1.44) (1.14) (0.87) (1.30) (1.07) (0.95) (1.20) (1.22) (0.94) (1.08)
Young 0.84 0.78 0.08 0.84 0.83 0.06 0.79 0.86 0.07 0.67 0.74 0.17 0.72 0.75 0.15 0.58 0.82 0.14
Old 0.83 0.78 0.06 0.83 0.83 0.06 0.76 0.85 0.07 0.78 0.71 0.12 0.75 0.74 0.11 0.75 0.78 0.14
Young
Old (Ratio) (1.02) (1.00) (1.26) (1.02) (1.01) (1.11) (1.04) (1.00) (1.03) (0.85) (1.04) (1.38) (0.96) (1.01) (1.37) (0.77) (1.06) (1.03)
LR model, we can also observe higher GFPR for young inmates recidivism prediction, the decline in GFPR bias is obtained at the
compared to old offenders. In general, LR_calibrated and RisCanvi expense of further inequalities in other metrics.
models lead to more algorithmically fair results along both nation-
ality and age in terms of all metrics, except for the metrics in which
all the models show discrimination.
7 DISCUSSION AND CONCLUSIONS
The results for general recidivism prediction show higher AUC The effectiveness and fairness of Machine Learning (ML) models
for nationals compared to foreigners in RisCanvi. In terms of GFPR, in violent and general recidivism prediction were compared to the
the LR and LR_calibrated models show discrimination against na- RisCanvi risk assessment tool, an in-use model created by experts.
tional group. In age group, LR and LR_calibrated models show ML models were generated with AUC of 0.76 and 0.73 in violent and
higher GFPR along young compared to old group. In terms of AUC, general recidivism prediction respectively which shows slightly
we can see more discrimination against young inmates in RisCanvi better results compared to the AUC of RisCanvi protocol which is
compared to other models. As a result, LR_calibrated model shows 0.72 and 0.70. It is noteworthy that in this type of task, predictions
better algorithmic fairness properties across nationality and more are not very accurate in general (existing recidivism prediction
balanced values can be observed along age group in RisCanvi. tools typically have AUC in the range of 0.57-0.74 [9, 14, 15]), and
it is found that a hand-crafted formula created by experts is quite
comparable to a machine-learned one. Although the improvement
in accuracy by ML would be insufficient on its own to support its
6 EQUALIZED ODDS AND CALIBRATION introduction as a risk assessment tool, a key element of ML models
In this section, it is tried to achieve parity along nationality and age is their flexibility. An ML model can be re-trained with newer data,
groups in terms of two fairness metrics simultaneously. For this and incorporate new factors as the population of inmates changes
purpose, the method introduced by Pleiss et al. [29] is used that and more data on recidivism becomes available.
seeks parity in Generalized False Positive Rate (GFPR) or Gener- By studying differential treatment of RisCanvi and ML models
alized False Negative Rate (GFNR) while preserving calibration in across different groups, it can be stated that depending on the
each sub-group of nationality and age. The conclusion from the desired metric and groups, machine learning and human expert
previous section based on the results obtained per group in Table 2, can lead to different but comparable results. An advantage of ML
is that in both violent and general recidivism predictions, machine models is that the emphasis on different metrics can be changed
learning models show inequality in terms of GFPR along nationality during the modeling as legal or policy changes are introduced.
and age. RisCanvi also shows an imbalance in GFPR values along In this study, results in Table 2 showed that in both violent and
nationality groups in violent recidivism prediction. general recidivism predictions, there is an inequality in terms of
Hence, it is tried to create parity in this metric while preserving Generalized False Positive Rate (GFPR) metric along nationality
calibration in each group. The results after bias mitigation is pre- and age groups. So using a relaxation method [29], it was tried to
sented in Table 3 for violent and general recidivism prediction. The set parity in GFPR while preserving calibration in each sub-group
obtained models are referred to in the following as LR-Equalized, of nationality and age. The results after bias mitigation (in Table 3)
LR_Calibrated-Equalized, and RisCanvi-Equalized. showed that GFPR disparity in violent and general recidivism has
By comparing the results before and after this bias mitigation been respectively decreased at most 0.26 and 0.04 along nationality
(Table 2 and Table 3 respectively) in violent recidivism, it can be and 0.09 and 0.19 along age, however, in exchange for inequalities
seen that the discrimination in GFPR has decreased in the order in some other metrics.
of 0.08-0.26 and 0.06-0.09 along nationality and age groups respec- A robust conclusion from this work is that in a context in which
tively. Also, comparing the results before and after bias mitigation predictive factors neither determine nor yield a clear signal of
in general recidivism shows that there are reductions in GFPR dis- low/medium/high recidivism risk, ML cannot be considered a silver
parity in the orders of 0.03-0.04 and 0.16-0.19 along nationality bullet. At the very least, improvements in accuracy need to be
and age groups respectively. However, in both violent and general carefully contrasted with potential issues of algorithmic fairness
213
Enhancing a Recidivism Prediction Tool With Machine Learning: Effectiveness and Algorithmic Fairness ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 3: Equalized GFPR while preserving calibration in violent and general recidivism prediction
Risk Violent Recidivism General Recidivism

Model LR-Equalized LR_Calib-Equalized RisCanvi-Equalized LR-Equalized LR_Calib-Equalized
Group/Metrics AUC GFNR GFPR AUC GFNR GFPR AUC GFNR GFPR AUC GFNR GFPR AUC GFNR GFPR
National 0.81 0.77 0.07 0.81 0.81 0.06 0.76 0.85 0.08 0.78 0.70 0.15 0.67 0.78 0.14
Foreigner 0.64 0.92 0.05 0.61 0.91 0.05 0.62 0.92 0.06 0.61 0.81 0.12 0.53 0.88 0.12
National
Foreigner (Ratio) (1.27) (0.83) (1.38) (1.32) (0.89) (1.42) (1.23) (0.93) (1.28) (1.28) (0.86) (1.27) (1.26) (0.89) (1.16)
Young 0.84 0.78 0.08 0.71 0.86 0.06 - - - 0.67 0.74 0.17 0.72 0.75 0.15
Old 0.62 0.88 0.07 0.60 0.89 0.06 - - - 0.63 0.78 0.14 0.53 0.86 0.13
Young
Old (Ratio) (1.36) (0.89) (1.17) (1.19) (0.97) (1.05) - - - (1.06) (0.95) (1.22) (1.35) (0.88) (1.18)
when introducing ML, and calibration and some bias mitigation [17] W. Dieterich, C. Mendoza, and T. Brennan. 2016. COMPAS risk scales: Demon-
method (such as equalized odds in this study) needs to be used. strating accuracy equity and predictive parity. Northpoint Inc (2016).
[18] William M Grove, David H Zald, Boyd S Lebow, Beth E Snitz, and Chad Nelson.
2000. Clinical versus mechanical prediction: a meta-analysis. Psychological
ACKNOWLEDGMENTS assessment 12, 1 (2000), 19.
[19] Melissa Hamilton. 2019. The sexist algorithm. Behavioral sciences & the law 37, 2
This work has been partially supported by the HUMAINT pro- (2019), 145–157.
gramme (Human Behaviour and Machine Intelligence), Centre for [20] M. Hardt, E. Price, and N. Srebro. 2016. Equality of opportunity in supervised
learning. In Advances in neural information processing systems. 3315–3323.
Advanced Studies, Joint Research Centre, and European Commis- [21] Philip D Howard and Louise Dixon. 2012. The construction and validation of the
sion. The project leading to these results has received funding OASys Violence Predictor: Advancing violence risk assessment in the English and
from “la Caixa” Foundation (ID 100010434), under the agreement Welsh correctional services. Criminal Justice and Behavior 39, 3 (2012), 287–307.
[22] Danielle Leah Kehl and Samuel Ari Kessler. 2017. Algorithms in the criminal
LCF/PR/PR16/51110009. justice system: Assessing the use of risk assessments in sentencing. (2017).
[23] J. Kleinberg, S. Mullainathan, and M. Raghavan. 2016. Inherent trade-offs in the
REFERENCES fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
[24] Carolin Kröner, Cornelis Stadtland, Matthias Eidt, and Norbert Nedopil. 2007.
[1] Antonio Andrés-Pueyo, Karin Arbach-Lucioni, and Santiago Redondo. 2018. The The validity of the Violence Risk Appraisal Guide (VRAG) in predicting criminal
RisCanvi: a new tool for assessing risk for violence in prison and recidivism. recidivism. Criminal Behaviour and Mental Health 17, 2 (2007), 89–100.
Recidivism Risk Assessment: A Handbook for Practitioners (2018), 255–268. [25] Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How we
[2] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias. analyzed the COMPAS recidivism algorithm. ProPublica (5 2016) 9 (2016).
ProPublica, May 23 (2016), 2016. [26] Marius Miron, Songül Tolan, Emilia Gómez, and Carlos Castillo. 2020. Evaluating
[3] Richard Berk. 2012. Criminal justice forecasts of risk: A machine learning approach. causes of algorithmic bias in juvenile criminal recidivism. Artificial Intelligence
Springer Science & Business Media. and Law (2020), 1–37.
[4] Richard Berk. 2017. An impact assessment of machine learning risk forecasts [27] Arvind Narayanan. 2018. 21 fairness definitions and their politics. presenterad
on parole board decisions and recidivism. J. of Experimental Criminology 13, 2 på konferens om Fairness, Accountability, and Transparency 23 (2018).
(2017), 193–216. [28] Rachael T Perrault, Gina M Vincent, and Laura S Guy. 2017. Are risk assess-
[5] Richard Berk. 2019. Accuracy and fairness for juvenile justice risk assessments. ments racially biased?: Field study of the SAVRY and YLS/CMI in probation.
J. of Empirical Legal Studies 16, 1 (2019), 175–194. Psychological assessment 29, 6 (2017), 664.
[6] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. [29] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger.
2018. Fairness in criminal justice risk assessments: The state of the art. Sociological 2017. On fairness and calibration. In Advances in Neural Information Processing
Methods & Research (2018), 0049124118782533. Systems. 5680–5689.
[7] Richard Berk and Jordan Hyatt. 2015. Machine learning forecasts of risk to inform [30] M. Rettenberger, M. Mönichweger, E. Buchelle, F. Schilling, and R. Eher. 2010. The
sentencing decisions. Federal Sentencing Reporter 27, 4 (2015), 222–228. development of a screening scale for the prediction of violent offender recidivism.
[8] Richard A Berk, Susan B Sorenson, and Geoffrey Barnes. 2016. Forecasting Monatsschrift für Kriminologie und Strafrechtsreform 93, 5 (2010), 346–360.
domestic violence: A machine learning approach to help inform arraignment [31] J.P. Singh, D.G. Kroner, J.S. Wormith, S.L. Desmarais, and Z. Hamilton. 2018.
decisions. J. of Empirical Legal Studies 13, 1 (2016), 94–115. Handbook of recidivism risk/needs assessment tools. John Wiley & Sons.
[9] Tim Brennan, William Dieterich, and Beate Ehret. 2009. Evaluating the predictive [32] Jay P Singh, Sarah L Desmarais, Cristina Hurducas, Karin Arbach-Lucioni, Car-
validity of the COMPAS risk and needs assessment system. Criminal Justice and olina Condemarin, Kimberlie Dean, Michael Doyle, Jorge O Folino, Verónica
Behavior 36, 1 (2009), 21–40. Godoy-Cervera, Martin Grann, et al. 2014. International perspectives on the
[10] Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study practical application of violence risk assessment: A global survey of 44 countries.
of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163. Int. J. of Forensic Mental Health 13, 3 (2014), 193–206.
[11] Alexandra Chouldechova and Aaron Roth. 2018. The frontiers of fairness in [33] Jennifer L Skeem and Christopher T Lowenkamp. 2016. Risk, race, and recidivism:
machine learning. arXiv preprint arXiv:1810.08810 (2018). Predictive bias and disparate impact. Criminology 54, 4 (2016), 680–712.
[12] S. Corbett-Davies and S. Goel. 2018. The measure and mismeasure of fairness: A [34] Songül Tolan, Marius Miron, Emilia Gómez, and Carlos Castillo. [n.d.]. Why
critical review of fair machine learning. arXiv preprint arXiv:1808.00023 (2018). Machine Learning May Lead to Unfairness: Evidence from Risk Assessment for
[13] K.P. Dahle, J. Biedermann, R.J. Lehmann, and F. Gallasch-Nemitz. 2014. The devel- Juvenile Justice in Catalonia. In Proc. of ICAIL’19.
opment of the Crime Scene Behavior Risk measure for sexual offense recidivism. [35] Jeffrey Todd Ulmer and Darrell J Steffensmeier. 2014. The age and crime rela-
Law and human behavior 38, 6 (2014), 569. tionship: Social variation, social explanations. In The nurture versus biosocial
[14] Matthew DeMichele, Peter Baumgartner, Michael Wenger, Kelle Barrick, Megan debate in criminology: On the origins of criminal behavior and criminality. SAGE
Comfort, and Shilpi Misra. 2018. The public safety assessment: A re-validation Publications Inc., 377–396.
and assessment of predictive utility and differential prediction by race and gender [36] B. Woodworth, S. Gunasekar, M.I. Ohannessian, and N. Srebro. 2017. Learning
in kentucky. Available at SSRN 3168452 (2018). non-discriminatory predictors. arXiv preprint arXiv:1702.06081 (2017).
[15] S.L. Desmarais, K.L. Johnson, and J.P. Singh. 2016. Performance of recidivism [37] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P
risk assessment instruments in US correctional settings. Psychological Services Gummadi. 2017. Fairness beyond disparate treatment & disparate impact: Learn-
13, 3 (2016), 206. ing classification without disparate mistreatment. In Proc. of the 26th Int. Conf.
[16] Sarah Desmarais and Jay Singh. 2013. Risk assessment instruments validated
and implemented in correctional settings in the United States. (2013). on WWW. 1171–1180.
214
Towards compliance checking in reified I/O logic via SHACL
Livio Robaldo
livio.robaldo@swansea.ac.uk
Legal Innovation Lab Wales - Swansea University
Swansea, Wales, UK
ABSTRACT As shown in [28], the formal simplicity and the modular structure
Reified Input/Output logic [29] has been recently proposed to han- of reified I/O logic facilitate the implementation of user-friendly
dle natural language meaning in Input/Output logic [17]. So far, interfaces to encode large knowledge bases of norms in reasonable
the research in reified I/O logic has focused only on KR issues, time. [28] presents the DAPRECO knowledge base (D-KB), a repos-
specifically on how to use the formalism for representing contex- itory of 966 formulae in reified I/O formulae that translates norms
tual meaning of norms (see [28]). This paper is the first attempt to from the GDPR. The D-KB was built in four months via a special
investigate reasoning in reified I/O logic, specifically compliance JavaScript editor implemented to this purpose.
checking. This paper investigates how to model reified I/O logic for- While past research in reified I/O logic has focused on how
mulae in Shapes Constraint Language (SHACL) [2], a recent W3C building formulae associated with norms in natural language, this
recommendation for validating and reasoning with RDFs/OWL. paper represents the first attempt to investigate how these formulae
can be implemented and used for compliance checking, i.e., to infer
KEYWORDS which obligations have been violated in a given state of affairs and
with respect to a given set of norms.
reified I/O logic, SHACL, RDFs/OWL
Compliance checking has never been really studied in I/O logic.
ACM Reference Format: Most past literature in I/O logic has focused on deontic reasoning,
Livio Robaldo. 2021. Towards compliance checking in reified I/O logic via and, recently, normative reasoning [15].
SHACL. In Eighteenth International Conference for Artificial Intelligence and Deontic reasoning is to reason about what is obligatory and
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, permitted, while dealing with contrary-to-duty reasoning, deontic
USA, 5 pages. https://doi.org/10.1145/3462757.3466065 paradoxes, ethical/moral conflicts, etc. Reasoning about obligations
and permissions is of course orthogonal to what agents really do,
1 INTRODUCTION i.e., whether they did or did not violate their obligations or whether
Reified Input/Output logic [29] is Input/Output logic [17] enriched they did or did not perform what they were permitted to do.
with reification. The introduction of reification in I/O logic enhances Compliance checking does not involve deontic reasoning. Still,
the expressivity of the I/O logic formulae without substantially compliance checking could not be so simple to handle, e.g., because
affecting the I/O logic constructs that implement deontic reasoning. norms might include exceptions that lead to defeasible reasoning.
Reification is a formal mechanism that associates instantiations This paper proposes a formalization of non-deontic inferences
of high-order predicates and operators with FOL terms [13], [27], in reified I/O logic via SHACL [2]. While recent literature offered
[26]. The latter can be then directly inserted as arguments of other solutions for compliance checking implemented in RDFs/OWL, e.g.,
FOL predicates, which may be in turn reified again into new FOL [6], only preliminary works use SHACL to this end, e.g., [21].
terms. In other words, reified I/O logic associates norms with ex-
plicit terms, e.g., constants or variables, and not only with truth-
conditional symbols such as predicates or (second-order) deontic 2 BACKGROUND - REIFIED I/O LOGIC
operators. These terms can be then inserted as parameter of sepa- 2.1 Input/Output logic
rated meta-properties.
I/O logic was originally introduced in [17]. I/O logic is a family of
Reified I/O logic is grounded on a specific reification-based ap-
logics, just like modal logic is a family of systems K, S4, S5, etc.
proach for Natural Language Semantics: the framework in [12].
However, while modal logic uses possible world semantics, I/O
The main insight of [12] is to massively use reification in order to
logic uses norm-based semantics, in the sense of [11]: I/O systems
transform every second-order operator, including boolean connec-
are families of if-then rules (𝑎, 𝑏), such that when 𝑎 is given in input,
tives, into a FOL predicate applied to FOL terms. The final resulting
𝑏 is returned in output. 𝑎 and 𝑏 are formulae in another logic, called
formulae are then flat conjunctions of atomic FOL predicates.
“the object logic”. It has been argued that norm-based reasoning
features some advantages over reasoning based on possible world
classroom use is granted without fee provided that copies are not made or distributed semantics, first of all a lower computational complexity [30].
for profit or commercial advantage and that copies bear this notice and the full citation I/O logic neatly decouples deontic and non-deontic inferences.
on the first page. Copyrights for components of this work owned by others than ACM I/O logic is indeed a meta-logic wrapped around another logic (e.g.,
to post on servers or to redistribute to lists, requires prior specific permission and/or a [12], in case of reified I/O logic) called “the object logic”. The meta-
fee. Request permissions from permissions@acm.org. logic implements deontic inferences while the object logic imple-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil ments the non-deontic ones. In I/O systems for legal reasoning,
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 rules (𝑎, 𝑏) can be obligations, permissions, and constitutive rules.
https://doi.org/10.1145/3462757.3466065 These are clustered within three distinct sets 𝑂, 𝑃, and 𝐶 such that
215
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Robaldo et al.
∀(𝑎, 𝑏)∈𝑂 reads as “given 𝑎, 𝑏 is obligatory”, ∀(𝑎, 𝑏)∈𝑃 reads as “given belongs to the set 𝑂 (note “∈ 𝑂” in (2)): it is an obligation requiring
𝑎, 𝑏 is permitted”, and ∀(𝑎, 𝑏)∈𝐶 reads as “given 𝑎, 𝑏 holds”. each personal data processing to be lawful.
Most past research on I/O logic has focused on theoretical inves- (2) ∀𝑒𝑝 ( ∃𝑡1 ,𝑧,𝑤,𝑦,𝑥 [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇𝑖𝑚𝑒 𝑒𝑝 𝑡 1 ) ∧
tigations in the meta-logic, for modeling deontic reasoning. Since (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧
the focus was on studying the meta-logic, the object logic was al-
(𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧ (𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧
ways kept as simple as possible, i.e., it was always propositional
logic. Reified I/O logic is perhaps the most relevant proposal so far (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ],
in the I/O logic literature that employs an alternative (first-order) (𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 𝑒𝑝 ) ) ∈ 𝑂
object logic: the logical framework in [12]. Formulae in reified I/O logic employ two kind of predicates: primed
In I/O logic, inferences in the meta-logic are achieved by impos- predicates such as 𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ and non-primed pred-
ing axioms and constraints on the sets of if-then rules. Different icates such as 𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡. The former are obtained by reifying
combinations of axioms and constraints trigger different inferences. the latter; the first argument of primed predicates is the reification
For instance, [17] defines the basic axioms in (1), where the of the non-primed counterpart, i.e., a FOL term.
symbol ‘⊢’ is the entailment relation of the object logic. Variants of We should not reify all predicates, but only those we need. For
these axioms have been further investigated in [23] and [22]. instance, we do need to reify (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑥 𝑧) into
(𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧), where 𝑒𝑝 explicitly refers to the
(1) • SI: from (𝑎, 𝑥) to (𝑏, 𝑥) whenever 𝑏 ⊢ 𝑎.
action of processing, because we need to assert a property on this ac-
• OR: from (𝑎, 𝑥) and (𝑏, 𝑥) to (𝑎 ∨ 𝑏, 𝑥).
tion: in the consequent of the obligation, we require it to be lawful,
• WO: from (𝑎, 𝑥) to (𝑎, 𝑦) whenever 𝑥 ⊢ 𝑦.
i.e., to satisfy the 𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 predicate. Note that in (2), in order to
• AND: from (𝑎, 𝑥) and (𝑎, 𝑦) to (𝑎, 𝑥 ∧ 𝑦).
“carry” the variable 𝑒𝑝 from the antecedent to the consequent, a uni-
• CT: from (𝑎, 𝑥) and (𝑎 ∧ 𝑥, 𝑦) to (𝑎, 𝑦).
versal quantifier outscoping the if-then rule has been inserted. All
By imposing axioms SI, WO, and AND, we obtain a specific derivation other variables are existentially quantified within the antecedent.
system called deriv1 . Adding OR to deriv1 gives deriv2 . Adding The other predicate that 𝑒𝑝 is required to satisfy is 𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇 𝑖𝑚𝑒.
CT to deriv1 gives deriv3 . The five axioms together give deriv4 . This is a special predicates used to assert which reifications “really
Each derivation system is sound and complete with respect to a exist” at a certain time. 𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇𝑖𝑚𝑒 parallels the well-known
different (norm-based) semantics and can therefore trigger different predicate 𝐻𝑜𝑙𝑑𝑠𝐴𝑡 used in Event Calculus [14].
inferences (see [17] for further discussion and details). Thus, formula (2) reads: “for every personal data processing 𝑒𝑝
Given a derivation system, we may further constrain its sets of of some personal data 𝑧, owned by a data subject 𝑤, controlled by
if-then rules, by considering only subsets that do not yield outputs a controller 𝑦, and processed by a processor 𝑥 (nominated by 𝑦), it
conflicting with given inputs. This is needed to handle contrary-tois obligatory for 𝑒𝑝 to be lawful.
duty reasoning, i.e., to determine which obligations are detached
in a situation that already violates some among them [16]. 2.3 Adding defeasibility to reified I/O logic
This paper is not concerned with the meta-level of I/O logic. It is common in legislation that some rules override others in re-
Rather, it will focus on the object logic and non-deontic inferences, stricted contexts. These more specific rules are seen as exceptions
including defeasible ones to handle exceptions in legal reasoning. of the general rules, as penguins may be seen as exceptions of birds
with respect to the ability of flying.
2.2 Adding reification to I/O logic In line with the literature, e.g., [10], reified I/O logic models
exceptions via special predicates “Ex” that are false by default. This
Reification is a well-known technique used in linguistics and com-
is achieved via negation-as-failure (naf). “naf(Ex)” is true if either
puter science for representing abstract concepts. These are associ-
“Ex” is false or it is unknown. On the other hand, when “Ex” holds,
ated with explicit objects, e.g., FOL terms (see below in this section)
“naf(Ex)” is false, and the general rule is blocked. An example, taken
or RDF resources (see §3 below), on which we can assert properties.
from [28], is given by the following rules:
These assertions can be recursively reified again into new terms.
Both [12] and RDFs/OWL recursively reify assertions until the (a) If the data subject has given consent to processing, then the
knowledge is represented in terms of a flat list of atomic predi- processing is lawful.
cates applied to terms. In RDFs/OWL, these flat lists are made of (b) If the age of the data subject is lower than the minimal age for
triples “(subject, predicate, object)”, while [12] also allows consent of his member state, (a) is not valid.
predicates with higher arity; however, any n-ary predicate can be (c) In case of (b), if the holder of parental responsibility has given
transformed into an equivalent conjunction of RDF triples. consent to processing, then the processing is lawful.
In [12] and in reified I/O logic, both the antecedent and the con-
(a)-(c) are formalized as the following constitutive rules:
sequent of the if-then rule are conjunctions of predicates. Universal
and existential quantifiers are added to bound the free variables (3) ∀𝑒𝑝 ( ∃𝑡,𝑧,𝑤,𝑦,𝑥 [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇𝑖𝑚𝑒 𝑒𝑝 𝑡) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧
occurring in the formulae. Universals that outscope the whole if- (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ∧ (𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧
then rules are used to “carry” individuals from the antecedent to (𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧ (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧
the consequent. Formal details and definitions are available in [29]. (𝐺𝑖𝑣𝑒𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝑇𝑜 𝑤 𝑒𝑝 ) ∧ 𝑛𝑎𝑓 ((𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝐴𝑔𝑒𝐷𝑆 𝑒𝑝 )) ],
A simple example from the D-KB [28] is shown in (2). (2) encodes
(𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 𝑒𝑝 ) ) ∈ 𝐶
in reified I/O logic part of Art.5(1)(a) of the GDPR. The if-then rule
216
Towards compliance checking in reified I/O logic via SHACL ICAIL’21, June 21–25, 2021, São Paulo, Brazil
(4) ∀𝑒𝑝 ( ∃𝑡,𝑧,𝑤,𝑦,𝑥,𝑠 [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇 𝑖𝑚𝑒 𝑒𝑝 𝑡) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧ formal language is SHACL [2], proposed by W3C precisely for
(𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ∧ (𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧ validation and inferences on RDFs/OWL graphs. The use of SHACL
(𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧ (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧ is currently a matter of ongoing research in the Semantic Web
community (see [7], [24], among others).
(𝑆𝑡𝑎𝑡𝑒𝑂 𝑓 𝑠 𝑤) ∧ (< 𝑎𝑔𝑒𝑂 𝑓 (𝑤) 𝑚𝑖𝑛𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝐴𝑔𝑒𝑂 𝑓 (𝑠)) ],
SHACL appears to be the right formal language for modelling
(𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝐴𝑔𝑒𝐷𝑆 𝑒𝑝 ) ) ∈ 𝐶 compliance checking, although so far it has been scarcely investi-
gated to this end, preliminary works being [20], [21], and [8].
(5) ∀𝑒𝑝 ( ∃𝑡,𝑧,𝑤,𝑦,𝑥,𝑠,ℎ [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇 𝑖𝑚𝑒 𝑒𝑝 𝑡) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧
SHACL was originally proposed to define special conditions on
(𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ∧ (𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧ RDFs/OWL graphs, called “SHACL shapes”, more expressive than
(𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧ (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧ standard OWL cardinality and quantifier restrictions. RDFs/OWL
(𝑆𝑡𝑎𝑡𝑒𝑂 𝑓 𝑠 𝑤) ∧ (< 𝑎𝑔𝑒𝑂 𝑓 (𝑤) 𝑚𝑖𝑛𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝐴𝑔𝑒𝑂 𝑓 (𝑠)) ∧ graphs can be then validated against a set of such SHACL shapes.
(ℎ𝑎𝑠𝐻𝑜𝑙𝑑𝑒𝑟𝑂 𝑓 𝑃𝑟 ℎ 𝑤) ∧ (𝐺𝑖𝑣𝑒𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝑇𝑜 ℎ 𝑒𝑝 ) ], However, SHACL “may be used for a variety of purposes beside
(𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 𝑒𝑝 ) ) ∈ 𝐶 validation, including user interface building, code generation and data
integration” (cit. [2]). This paper adds a new use cases for SHACL
3 COMPLIANCE CHECKING IN RDFS/OWL in that it proposes to use it for serializing reified I/O logic formulae
fit to check compliance.
RDFs/OWL is nowadays the W3C standard language for the Seman- In order to enhance the expressivity and the flexibility of the
tic Web [1]. RDFs/OWL represents knowledge via flat sets of triples standard, a current W3C Working Group Note proposes to enrich
“(subject, predicate, object)”, in which the predicate is SHACL shapes with advanced features1 such as “SHACL rules” to
an rdf:Property while the subject and the object can be any derive inferred triples from asserted ones, prior to validation.
rdfs:Resource, including other rdf:Property(s). In other words, As explained in [25], SHACL rules can trigger ontological or
RDFs/OWL allows to treat rdf:Property(s) as first-order terms on non-ontological inferences. Ontological inferences derive facts that
which separately asserting other (meta-)properties. can be added to the model. On the other hand, non-ontological
It is then evident that reification is, in essence, the very same inferences have the sole purpose of aggregating data, without nec-
mechanism used to represent knowledge in RDFs/OWL, thus the essarily asserting them in the model, in order to facilitate validation.
idea of implementing reified I/O logic in the W3C standard.
Some proposals have been done to implement compliance check- 5 SERIALIZING REIFIED I/O LOGIC IN SHACL
ing in RDFs/OWL, e.g., [9] and [6]. In these approaches compliance
checking is achieved by enriching the ontology with classes re- This paper represents the first attempt to investigate how to serialize
ferring to sets of individuals compliant with the norms and by reified I/O formula modeling obligations as SHACL shapes and
enforcing “is-a” inferences on these classes. reified I/O formula modeling constitutive rules as SHACL rules.
For instance, the OWL ontology used in [9] includes a class (6) shows the SHACL shape that serializes (2) above. Both require
Supplier including individuals that supply consumers with some every personal data processing to be lawful.
goods. Since suppliers are obliged to communicate their contractual (6) CheckLawfulness
conditions to their consumers (rule R1), the corresponding class in- rdf:type sh:NodeShape;
cludes a boolean datatype property hasCommunicatedConditions sh:targetClass PersonalDataProcessing;
sh:property [ sh:path is-lawful;
which is true for those suppliers that has complied with their obli-
sh:hasValue "true"ˆˆxsd:boolean; ];
gation and false otherwise. The ontology includes then a class
In (6), “sh:” is SHACL namespace prefix. (6) is a sh:NodeShape
SupplierR1compliant defined as to include only individuals in
requiring each individual of the sh:targetClass to satisfy the
Supplier for which hasCommunicatedConditions is true. Com-
sh:property. The latter constrains the individuals reached from
pliance checking is enforced by applying simple “is-a” inferences.
the sh:targetClass through the sh:path to satisfy sh:hasValue.
In the same spirit, [6] encodes in a fragment of OWL2 selected
On the other hand, PersonalDataProcessing, is-lawful, and
norms from Artt. 6, 7, 15, 23, and 30 of the GDPR, which concern
all other RDFs/OWL resources used in this paper are associated
data usage policies. Compliance on these policies is again imple-
1:1 with the predicates used in the reified I/O logic formulae such
mented via “is-a” inferences.
as (2), in the same way as the predicates occurring in the D-KB
While [9] and [6] are of course important contributions towards
[28] are associated with RDFs/OWL resources from the PrOnto
the same direction of research advocated here, it is not clear how to
ontology [19], an OWL ontology proposed to conceptualize the
model exceptions in those frameworks. Furthermore, adding explicit
data protection domain. Space constraints avoid to provide further
classes specifically devoted to “collect” the individuals compliant
details about the 1:1 mapping between reified I/O logic predicates
with the norms, as well as introducing new ones to properly handle
and RDFs/OWL resources.
exceptions, does not appear to be an easy and intuitive solution.
SHACL shapes refer to constraints, a solution that appears to be
The rest of the paper proposes to use SHACL as an alternative
more intuitive and economical than overpopulating the ontology
of the accounts in [9] and [6].
with extra classes as suggested in [9] and [6].
The validation facts, as well as new individuals, derived through
4 COMPLIANCE CHECKING IN SHACL SHACL are not mandatorily inserted in the ontology. The SHACL
This paper proposes and makes initial investigations to encode
legal rules in a formal language different from RDFs/OWL. This 1 See https://www.w3.org/TR/shacl-af
217
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Robaldo et al.
rules to model the reified I/O logic formulae in (3), (4), and (5) SHACL command sh:order. Mirroring these inferences in native
represent non-ontological inferences, in the sense explained in [25]: RDFs/OWL seems to be more difficult in that the formalism does
these rules are only functional to infer the truth value of is-lawful not allow to specify a priority between the inference rules.
before the SHACL shape in (6) is validated. When the agents’ age has been specified (sh:minCount 1) and it
(3), (4), and (5) are serialized in the SHACL rules in (7), (8), (9), is lower than (sh:lessThan) the minimal consent age of the Mem-
and, below, (10). ber State previously asserted by (7), rule (8) asserts the individual
of PersonalDataProcessing in the has-theme property of the in-
(7) sh:rule [rdf:type sh:TripleRule; sh:order 0;
sh:subject sh:this; dividual of GiveConsent as member of the class exceptionAgeDS
sh:predicate has-min-consent-age; (see rdf:type in sh:predicate).
sh:object [sh:path Finally, (9) sets as true the property is-lawful of the instances
(has-theme has-personal-data of PersonalDataProcessing that do not (sh:not) belong to class
is-personal-data-of has-member-state exceptionAgeDS. (9) implements the reified I/O logic formula shown
has-min-consent-age);]; ]; above in (3) and the SHACL operator sh:not the negation-as-failure
(predicate 𝑛𝑎𝑓 ) occurring therein. sh:not is in fact true when the
(8) sh:rule [rdf:type sh:TripleRule; sh:order 1;
sh:condition [ ontology does not include any specific assertion of the personal
sh:property [sh:path has-min-consent-age; data processing as member of the class exceptionAgeDS. In other
sh:minCount 1;]; words, since the close world assumption hold for both RDFs/OWL
sh:property [ and SHACL, sh:not is true when it is either false or unknown
sh:path (has-agent has-age); whether the personal data processing belongs to this class.
sh:lessThan has-min-consent-age;]; ]; Finally, (10) implements the reified I/O logic formula (5) above:
sh:subject [sh:path has-theme;];
sh:predicate rdf:type; (10) sh:rule [rdf:type sh:TripleRule; sh:order 2;
sh:object exceptionAgeDS; ]; sh:condition [
sh:property [sh:path
(9) sh:rule [rdf:type sh:TripleRule; sh:order 2; (has-theme; has-personal-data
sh:condition [ is-personal-data-of has-age);
sh:not [sh:property [sh:path has-theme; sh:lessThan has-min-consent-age; ];
sh:class exceptionAgeDS;]; ]; ]; sh:property [sh:path (has-theme
sh:subject [sh:path has-theme;]; has-personal-data
sh:predicate is-lawful; is-personal-data-of
sh:object "true"ˆˆxsd:boolean; ]; has-holder-of-pr);
The sh:targetClass of all these SHACL rules is GiveConsent. sh:equals has-agent;]; ];
Rules are executed according to the sh:order, from the lowest to sh:subject [sh:path has-theme;];
the highest value. Each rule in (7)-(9) makes a new assertion: the sh:predicate is-lawful;
sh:object "true"ˆˆxsd:boolean; ];
rdf:Property specified in the sh:predicate of the rule is asserted
between the two RDFs/OWL resources in the sh:subject and If the age of the data subject (has-age) who owns the personal
the sh:object. The sh:subject and the sh:object may be the data of the processing (has-personal-data is-personal-data)
sh:targetClass itself (keyword “sh:this”), a resource reachable that is the theme of a GiveConsent individual (has-theme) is lower
from the sh:targetClass through a path specified in sh:path, than (sh:lessThan) the minimal consent age of his/her Member
any other resource in the ontology, or a literal. State and the agent of this GiveConsent individual is the holder
(7) is executed as first because its sh:order is “0”. This rule sets of the data subject’s parental responsibility (has-holder-of-pr),
the value of the property has-min-consent-age for each individ- then the boolean is-lawful is again set to true.
ual in the class GiveConsent. This values is set to the integer value
reachable from the sh:path defined on sh:object in (7)). Specifi-
cally, this is the minimal consent age (has-min-consent-age) of 6 CONCLUSIONS
the Member State (has-member-state) of the data subject owning Reified I/O logic is a recent deontic logical framework explicitly
the personal data (has-personal-data is-personal-data-of) in- designed to handle natural language semantics, i.e., to represent
volved in the personal data processing occurring as the theme of norms occurring in existing legislation such as the GDPR.
the GiveConsent instances (has-theme). So far, the research in reified I/O logic has focused only on knowl-
It is important to understand that has-min-consent-age will edge representation issues, specifically on how to use the formalism
not be asserted on the individuals of GiveConsent also in the ref- for representing contextual meaning of norms [3].
erence ontology, but only in the derived one. In other words, (7) is On the other hand, this paper is the first attempt to investi-
a non-ontological inference rule that collects/aggregates this value gate computational issues in reified I/O logic, specifically how to
in GiveConsent for validation purposes only. After the validation, represent the reified I/O logic if-then rules in a computable machine-
these values will be discharged. readable format fit to enforce compliance checking.
Rule (8) compares the minimal consent age of the agents’ Member This paper proposed to model regulative rules as SHACL shapes
State, just asserted by (7) on GiveConsent’s instances, with the and constitutive rules as SHACL rules. SHACL shapes and rules are
agents’ age. The two rules are then executed in a pipeline, thanks to applied to RDFs/OWL models that describe states of affairs.
218
Towards compliance checking in reified I/O logic via SHACL ICAIL’21, June 21–25, 2021, São Paulo, Brazil
The solution proposed here is alternative to some recent ap- Notes in Computer Science, Vol. 12061), Mehdi Dastani, Huimin Dong, and Leon
proaches that model compliance checking on RDFs/OWL ontolo- van der Torre (Eds.). Springer, 151–165.
[16] David Makinson and Leendert van der Torre. 2001. Constraints for input/output
gies, e.g., [9] and [6]. logics. Journal of Philosophical Logic 30, 2 (2001), 155–185.
On the other hand, the present work only represents the first step [17] David Makinson and Leendert W. N. van der Torre. 2000. Input/Output Logics.
Journal of Philosophical Logic 29, 4 (2000), 383–408.
of a research endevour aiming at developing a full inference engine [18] Rohan Nanda, Luigi Di Caro, Guido Boella, Hristo Konstantinov, Tenyo Tyankov,
for reified I/O logic that implements and integrates all components Daniel Traykov, Hristo Hristov, Francesco Costamagna, Llio Humphreys, Livio
involved in normative reasoning. Much further work needs to be Robaldo, and Michele Romano. 2017. A Unifying Similarity Measure for Auto-
mated Identification of National Implementations of European Union Directives.
done in order to obtain a formally well-defined framework, tested In Proceedings of the 16th Edition of the International Conference on Articial Intelli-
on existing industrial use cases. gence and Law (ICAIL 2017). Association for Computing Machinery.
Further directions of research include the automatic or semi- [19] Monica Palmirani, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio
Robaldo. 2018. PrOnto: Privacy Ontology for Legal Compliance. In Proceedings
automatic generation of RDFs/OWL or SHACL assertions from of the 18𝑡ℎ European Conference on Digital Government (ECEG).
legal texts, possibly via NLP (cf. [4], [5], [18]). [20] Harshvardhan Jitendra Pandit, Declan O’Sullivan, and Dave Lewis. 2018. Explor-
ing GDPR Compliance Over Provenance Graphs Using SHACL. In Proc. of the
Posters and Demos Track of the 14th International Conference on Semantic Systems
ACKNOWLEDGMENTS co-located with the 14th International Conference on Semantic Systems (SEMAN-
TiCS 2018), Vienna, Austria, September 10-13, 2018 (CEUR Workshop Proceedings,
This research has been supported by the Legal Innovation Lab Wales Vol. 2198), Ali Khalili and Maria Koutraki (Eds.).
operation within Swansea University’s Hillary Rodham Clinton [21] Harshvardhan J. Pandit, Declan O’Sullivan, and Dave Lewis. 2019. Test-Driven
Approach Towards GDPR Compliance. In Semantic Systems. The Power of AI and
School of Law. The operation has been part-funded by the European Knowledge Graphs, Maribel Acosta, Philippe Cudré-Mauroux, Maria Maleshkova,
Regional Development Fund through the Welsh Government. Tassilo Pellegrini, Harald Sack, and York Sure-Vetter (Eds.). Springer International
Publishing, 19–33.
[22] Xavier Parent and Leon van der Torre. 2014. Aggregative Deontic Detachment for
Normative Reasoning. In Principles of Knowledge Representation and Reasoning:
REFERENCES Proceedings of the Fourteenth International Conference, KR 2014, Vienna, Austria,
[1] 2012. Web Ontology Language (OWL). Technical Report. W3C. https://www.w3. July 20-24, 2014.
org/OWL [23] Xavier Parent and Leendert van der Torre. 2014. “Sing and Dance!”. In Deontic
[2] 2017. Shapes constraint language (SHACL). Technical Report. W3C. https: Logic and Normative Systems, Fabrizio Cariani, Davide Grossi, Joke Meheus, and
//www.w3.org/TR/shacl Xavier Parent (Eds.). Springer International Publishing, 149–165.
[3] Cesare Bartolini, Andra Giurgiu, Gabriele Lenzini, and Livio Robaldo. 2016. [24] Paolo Pareti, George Konstantinidis, Fabio Mogavero, and Timothy J. Norman.
Towards Legal Compliance by Correlating Standards and Laws with a Semi- 2020. SHACL Satisfiability and Containment. In The Semantic Web - ISWC 2020 -
automated Methodology. In BNCAI (Communications in Computer and Informa- 19th International Semantic Web Conference, Athens, Greece, November 2-6, 2020,
tion Science, Vol. 765). Springer, 47–62. Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12506), Jeff Z. Pan,
[4] G. Boella, L. di Caro, L. Humphreys, L. Robaldo, and L. van der Torre. 2012. NLP Valentina A. M. Tamma, Claudia d’Amato, Krzysztof Janowicz, Bo Fu, Axel
Challenges for Eunomos, a Tool to Build and Manage Legal Knowledge. Proceed- Polleres, Oshani Seneviratne, and Lalana Kagal (Eds.). Springer, 474–493.
ings of the International Conference on Language Resources and Evaluation. [25] Paolo Pareti, George Konstantinidis, Timothy J. Norman, and Murat Sensoy. 2019.
[5] Guido Boella, Luigi Di Caro, Daniele Rispoli, and Livio Robaldo. 2013. A System SHACL Constraints with Inference Rules. In The Semantic Web - ISWC 2019 -
for Classifying Multi-label Text into EuroVoc. In Proceedings of the Fourteenth 18th International Semantic Web Conference, Auckland, New Zealand, October
International Conference on Artificial Intelligence and Law (Rome, Italy) (ICAIL 26-30, 2019, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11778),
’13). ACM, New York, NY, USA, 239–240. Chiara Ghidini, Olaf Hartig, Maria Maleshkova, Vojtech Svátek, Isabel F. Cruz,
[6] Piero A. Bonatti, Luca Ioffredo, Iliana M. Petrova, Luigi Sauro, and Ida Sri Rejeki Aidan Hogan, Jie Song, Maxime Lefrançois, and Fabien Gandon (Eds.). Springer,
Siahaan. 2020. Real-time reasoning in OWL2 for GDPR compliance. Artificial 539–557.
Intelligence 289 (2020). [26] L. Robaldo. 2010. Independent Set readings and Generalized Quantifiers. The
[7] Julien Corman, Juan L. Reutter, and Ognjen Savkovic. 2018. Semantics and Vali- Journal of Philosophical Logic 39(1) (2010), 23–58.
dation of Recursive SHACL. In The Semantic Web - ISWC 2018 - 17th International [27] L. Robaldo. 2011. Distributivity, Collectivity, and Cumulativity in terms of
Semantic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, (In)dependence and Maximality. The Journal of Logic, Language, and Information
Part I (Lecture Notes in Computer Science, Vol. 11136), Denny Vrandecic, Kalina 20(2) (2011), 233–271.
Bontcheva, Mari Carmen Suárez-Figueroa, Valentina Presutti, Irene Celino, Marta [28] L. Robaldo, C. Bartolini, M. Palmirani, A. Rossi, M. Martoni, and G. Lenzini. 2020.
Sabou, Lucie-Aimée Kaffee, and Elena Simperl (Eds.). Springer, 318–336. Formalizing GDPR provisions in reified I/O logic: the DAPRECO knowledge base.
[8] Christophe Debruyne, Harshvardhan J. Pandit, Dave Lewis, and Declan The Journal of Logic, Language, and Information 29 (2020). Issue 4.
O’Sullivan. 2019. Towards Generating Policy-Compliant Datasets. In 13th IEEE [29] L. Robaldo and X. Sun. 2017. Reified Input/Output logic: Combining Input/Output
International Conference on Semantic Computing, ICSC 2019, Newport Beach, CA, logic and Reification to represent norms coming from existing legislation. The
USA, January 30 - February 1, 2019. IEEE, 199–203. Journal of Logic and Computation 7 (2017). Issue 8.
[9] Enrico Francesconi and Guido Governatori. 2019. Legal Compliance in a Linked [30] X. Sun and L. Robaldo. 2017. On the complexity of Input/Output logic. The
Open Data Framework. In Legal Knowledge and Information Systems - JURIX Journal of Applied Logic 25 (2017), 69–88.
2019: The Thirty-second Annual Conference, Madrid, Spain, December 11-13, 2019
(Frontiers in Artificial Intelligence and Applications, Vol. 322), Michal Araszkiewicz
and Víctor Rodríguez-Doncel (Eds.). IOS Press, 175–180.
[10] G. Governatori, F. Olivieri, A. Rotolo, and S. Scannapieco. 2013. Computing
Strong and Weak Permissions in Defeasible Logic. Journal of Philosophical Logic
6, 42 (2013), 799–829.
[11] Jörg Hansen. 2014. Reasoning about permission and obligation. In David Makinson
on Classical Methods for Non-Classical Problems, S. O. Hansson (Ed.). Outstanding
Contributions to Logic Volume 3, Springer, 287–333.
[12] J.R. Hobbs and A.S. Gordon. 2017. A formal theory of commonsense psychology,
how people think people think. Cambridge University Press.
[13] J. R. Hobbs. 2008. Deep Lexical Semantics. In Proc. of the 9th International
Conference on Intelligent Text Processing and Computational Linguistics (CICLing-
2008). Haifa, Israel.
[14] R Kowalski and M Sergot. 1986. A Logic-based Calculus of Events. New Generation
Computing 4, 1 (1986), 67–95.
[15] Tomer Libal and Alexander Steen. 2020. Towards an Executable Methodology for
the Formalization of Legal Texts. In Logic and Argumentation - Third International
Conference, CLAR 2020, Hangzhou, China, April 6-9, 2020, Proceedings (Lecture
219
Modelling Legal Procedures
Antonino Rotolo Clara Smith
Alma AI, University of Bologna Law Faculty, University of La Plata
Bologna, Italy La Plata, Argentina
antonino.rotolo@unibo.it claritasmith@gmail.com
ABSTRACT Although we may identify several technical differences between

A legal procedure in court proceedings is a sequence of actions a process and a procedure1 , in this work we will generically refer
in which the last action is (the creation of) a(n individual) norm, to a procedure as a part of the process.
where the court settles that it is obligatory in the interest of some One peculiar aspect of civil proceedings is that some types of
agents that other agents bring about a certain state of affairs. This procedure in the process are prioritised. Priorities indeed derive
paper models legal procedures by using a variant of Propositional from individual preferences of the parties in the process, but they
Dynamic Logic (PDL) enriched with a preference operator for pri- can also follow from objective ordering requirements from civil
oritising procedural actions. The key reason towards the usage of procedures. For example, in the Italian procedure [4], we have the
PDL is that, in procedural law, claims and resolutions resemble following procedural headings:
programs to be executed. Requests are organised in a preference • in via preliminare, as a preliminary procedural matter – As a
order and resolutions have their own dynamics of execution (either preliminary procedural matter, the petitioner tries to raise
spontaneously by the one obliged and/or by force of law). new issues in her reply and advances affidavits which were
not included in the petition;
CCS CONCEPTS • in via istruttoria, as an interlocutory matter;
• in via pregiudiziale di merito, plea on the merits;
• Applied computing → Law, social and behavioral sciences;
• in via subordinata di merito, alternatively/in the alternative;
• Theory of computation → Logic.
• in via esecutiva, as an enforcement matter;
• in estremo subordine, as a further alternative.
KEYWORDS
For capturing procedural priorities, the operator ⊗ is introduced
Legal procedures, PDL, preferences
here, which is to be understood as a preference operator [1] that
ACM Reference Format: allows for a compact representation of the execution of alternative
Antonino Rotolo and Clara Smith. 2021. Modelling Legal Procedures. In Eigh- (or supplementary or backup) actions in social domains. The in-
teenth International Conference for Artificial Intelligence and Law (ICAIL’21), tended reading of an expression of the form a ⊗ b ⊗ c is that a is
June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 5 pages. preferred, but if a is not carried out then b is to be the case, and if b
https://doi.org/10.1145/3462757.3466089 is not carried out then c is to be applied.
Technically, in this paper we combine this preference logic with
a propositional dynamic logic (PDL) to put procedures, preferences,
1 INTRODUCTION courses of actions, and obligations to work all together.
A legal procedure in civil court proceedings is the formal way in The layout of the paper is a follows. Section 2 offers some dis-
which civil proceedings are conducted [6]. The practice and proce- cussion on the components of legal procedure. Section 3 illustrates
dures of a court are regulated by rules usually stated in a procedure syntax and semantics of the formal framework. Section 4 discusses
rule corpus (or made by a committee) possibly supplemented, e.g., some axioms and non-logical principles for modelling procedures.
by practice directions and pre-action protocols. A legal procedure is Section 5 sketches some patterns for reasoning and computational
usually defined as a chain of consecutive actions which has as (final) methods for handling procedures. A summary concludes the paper.
goal the decision/solution of a conflict. Thus, a legal procedure is a
(finite) sequence of actions in which the last action is (the creation 2 THE LEGAL PROCEDURE
of) a(n individual) norm, the judgment, where the court settles that A legal procedure is guided by procedural principles which give
it is obligatory in the interest of some agents that other agents bring to procedures their particular structure and logic. This implies
about a certain state of affairs. that procedures have a particular way of moving forward from its
beginning towards the decision.
Permission to make digital or hard copies of all or part of this work for personal or A procedure starts with the very first claim of the plaintiff, who
classroom use is granted without fee provided that copies are not made or distributed usually sues some agent(s). In a regular procedure, the plaintiff
on the first page. Copyrights for components of this work owned by others than the exposes a state of affairs, and presents her/his request to the court.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Intuitively, we can imagine the ideal structure of a procedure as a
republish, to post on servers or to redistribute to lists, requires prior specific permission kind of directed graph, with one entry node (the plaintiff’s claim)
ICAIL’21, June 21–25, 2021, São Paulo, Brazil and one final node, the decision, i.e., a specific obligation imposed
1 Forexample, any procedure typically concerns the enforcement of individuals’ rights
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466089 during the process [6].
220
to the parties by the authority of the court. In between, there are “consumed” in a technical sense w.r.t. that tort.) But, if I do not pay,
multiple possible and intermediate claims proposed by the plaintiff by the force of law I may be compelled to do so through a brief
and multiple possible defences presented by the defendant, and, secondary process known as judgement execution.
of course, many intermediate decisions taken by the court2 . Some
of these court decisions have a preclusive nature, i.e. they do not 3 FORMALISING LEGAL PROCEDURES
allow plaintiff and/or defendant to come back to previous points in We will use for representing legal procedures a multi-agent variant
the procedural flow. Given this ideal structure, the real procedure, of basic core of Propositional Dynamic Logic (PDL) [3] enriched
i.e. the actions effectively ordered by the authority, corresponds to with the preference operator ⊗ for denoting preferences among
a path (a subgraph) in the bigger, ideal graph. procedural actions [1]. The key reason towards the usage of PDL is
that, in the procedural law domain, claims and resolutions indeed
2.1 Structure of Claims resemble programs to be executed. Requests or proposed actions
Plaintiff and defendant are both parties in the procedure. A claim are organised in a preference order. Resolutions have their own
is the only available tool a party has for communicating with the dynamics of execution (either spontaneously by the one obliged
court within a procedure, asking for an action or a fact to be con- and/or by force of law).
sidered. Petitions rarely consist in one single request. They are
usually organised in the form of one main (preferred) request, plus 3.1 Syntax
subsidiary requests. This structure constitutes a common guideline. Let Ag be a set of agents. The language L consists of a set PROP =
The main reason for this preferred/subsidiary organisation of ac- {A, B, C, . . . } of countably many proposition symbols, a set P =
tions resides in the fact that plaintiff and defendant know that the {α i |i ∈ Ag} of countably many atomic programs which we call
court may not decide in their favour regarding their main request. atomic procedural actions or atomic procedures, the usual boolean
Therefore, each party presents to the court a menu of wills for the operators, the program constructors ;, ∪, and ⊗ = {⊗i |i ∈ Ag} and
court’s consideration, altogether and at the same instant. the modality [Π] for any procedure Π. A procedure Π is ⊗-free iff
Petitions can be seen, therefore, as prioritised goals. This prefer- ⊗ does not occur in it.
ence structure outlines a strategic move: I ask the court to order to So, formally, expressions of language are defined as follows:
bring about my preferred state of affairs, if this would or could not
be the case, I ask the court to order to bring about this other state p ::= A|¬p|p1 ∧ p2 |⟨Π⟩p with Π ::= α i |Π1 ; Π 2 |Π 1 ∪ Π 2 |
of affairs, and so on. Indeed, the goal pursued by the parties is that Πi 1 ⊗i · · · ⊗i Πi n
her/his main request becomes the content of the decision of the court where A ∈ PROP, α i ∈ P, and Πi 1 , . . . , Πi n are ⊗-free.
(if not possible, then one of the subsidiaries, as given). We usually deal with at least three types of agents, let’s denote
them p, d, k for representing the plaintiff, the defendant, and the
2.2 Resolutions court. Propositional letters denote as usual states of affairs. Complex
When the court receives a claim of a party it analyses it, and chooses formulas are built using classical boolean connectives as expected.
among the menu of proposed options.The court’s chosen option As usual, we also have an infinite collection of [Π] operators where
implies that the court discards those options appearing before the Π is a (lawful) procedure. In the simplest case, we may have [α i ],
one chosen, and that the options after the one chosen are not taken an atomic procedure for the agent i; hence, [α i ] A is a formula that
into account, at least at the time. The normal course of the procedure reads “every execution of α by i from the present state leads to a
usually indicates that the plaintiff presents the claim as described, state where A is true”. The dual assertion ⟨α i ⟩A—such that ⟨α i ⟩A ≡
then the court resolves, next the defendant defends himself with his ¬[α i ]¬A—states that “the execution of α by i from the present state
own claim, then the court resolves. To each request of the parties leads to a state where A is true”. Complex procedures are intuitively
the court produces an answer. We call this response a resolution. defined from fixed basic atomic procedures as follows:
The claim/resolution chain repeats from the very first claim of the (Sequence) if Π 1 and Π2 are procedures then Π1 ; Π 2 (“do Π 1
plaintiff to the judgment. followed by Π2 ”) is a procedure,
A resolution always has a performative nature even if it is a (Choice) if Π 1 and Π2 are procedures then Π1 ∪ Π2 (“do Π 1 or
low impact decision (e.g. “Take this fact into account for later in Π 2 , non-deterministically”) is a procedure,
this procedure”); it is an order of the authority. So far, in a sense, (Preference) if Π1 and Π 2 are procedures, so are (Π1 ⊗p Π 2 ),
resolutions have an executable nature: the court declares through (Π 1 ⊗d Π2 ), and (Π 1 ⊗k Π2 ), meaning “agent p prefers doing
a resolution which actions are to be done. The resolution, thus, α, but if not then p prefers doing β” (resp. for d, k).
has to be executed. Suppose I am sued and I defend myself with
Notice that we do not use the usual program constructor ∗, which
the claim of paying with no interest, and, subsidiary, to pay with
models in PDL the execution for a program of a nondeterministically
minimum interest. (My claim operates as the input to the court’s
chosen finite number of times. Although it is smooth to use it here,
decision.) Suppose next that the court sentences me to pay with
we can ignore this constructor for our specific purpose.
the minimum interest. Then I, the obliged agent, have to comply
The crucial variation w.r.t. the original use of ⊗ in [1] is that, in
with the judgment by effectively and spontaneously paying. With
their work, the authors interpret an expression [a ⊗ b]A as a being
my payment the judgment is considered to be “executed” (and
the most preferred state of affairs, and if a is not the case then b
2 We will restrict ourselves to procedures including only these agents: plaintiff, defen- is preferred. In this present work we interpret ⊗ as a preference
dant, and the decision-maker (e.g. the judge, the court, a mediator). operator among procedural actions.
221
Modelling Legal Procedures ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Remark 3.1. We read α p as “α as proposed by p”. For example, the Definition 3.7. Let M a procedural model M = ⟨W , R Π , R O, ≺
formula [α p ⊗k β d ]A is to be read: “the court prefers that p’s pro- , V ⟩ and Ri (w) := {v ∈ W | wRi v}, ∥A∥V := {w ∈ W | |=V w A}. The
posal is done, if that is not procedurally possible then d’s proposal valuation function for M is a follows:
must be carried out (leading to a state-of-affairs where A holds)”. • usual for atoms and boolean conditions,
• w |= OA iff ∃R ∈ R O such that R(w) = ∥A∥V [2];
Remark 3.2. Resolutions are expressions in which the procedure
• w |= [Π]A iff ∀v ∈ W if wR Πv then v ∈ ∥A∥V ;
subexpression is one relativised to the court (⊗k ). Within a (lawful)
• w |= [Π 1 ⊗i · · · ⊗i Πn ] A iff w |= [Π 1 ∪ · · · ∪ Πn ] A iff there
procedure, obligations are always imposed by the agent that holds
exists a preference path R Π1 ≺i · · · ≺i R Πn .
the legal power (e.g. the court, the third neutral.) When the court
speaks, it speaks in the form of an obligation, which is added as a To sum-up, this semantics combines the one for classical modal
milestone to the procedure. A resolution has the force of law. E.g.: logics proposed in [2] and the standard one for PDL, plus ranking
[α p ⊗k β p ]A implies the deontic expression O[α p ⊗k β p ] A3 . accessibility relations for procedures.
Note that the “ideal” graph intuitively described in the first sec-
In this work we state that obligations such as OA follow from tion is a multigraph. Vertices are states of affairs and arcs are pro-
resolutions. Later in this paper we address the detachment of O- cedures, relativised to agents. For example, suppose that starting
expressions from court resolutions. from the state-of-affairs v, procedure α p leads us to state w 1 , proce-
Example 3.3. The formulas [α p ⊗p β p ]A and [α p ⊗k β p ]A are read: dure µ d leads us to state w 2 , procedure β p leads us to state w 3 and
“The plaintiff proposes α, subsidiary he asks for β” and “The court procedure γ p to state w 4 . We may e.g. write the plaintiff’s request
as [α p ⊗p β p ⊗p γ p ]A, the defendant’s defence as µ d ; and a court’s
resolves that α as proposed by p is to be done, subsidiary β as

proposed by p is to be done” (leading to a state of affairs where A decision as e.g. [α p ⊗k µ p ]A (or α p ⊗k µ p A).
holds), respectively.
αp / w1 A
The formula [(α p ∪ β p ) ⊗k γ p ]A has in its turn the intuitive read-
ing: “the court decides that either α p or β p are to be performed,
being γ p subsidiary (leading to a state of affairs where A holds)”4 . µd
v / w2 A
3.2 Semantics βp
Let us now present an adequate semantics for our logic. The idea is '
to extend standard semantics for PDL with a relational version of w3 A
the one for ⊗. Multi-relational frames for ⊗-logics are based on the γp
idea of directly ranking relations.
, w4 A
Definition 3.4 (Procedural frame). A procedural frame is a struc-
ture F = ⟨W , R Π , R O, ≺⟩, where We assume that ≺ is a collection of strict partial orders, i.e., which
are irreflexive, transitive and asymmetric: one cannot validate a
• W is a non empty set of possible worlds,
formula such as [Π1 ⊗i Π1 ] A. Transitivity and antisymmetry are
• R Π is a countable set R α i |α i ∈ P of binary relations over
adopted as expected, and this it ensures the validity of
W ; we inductively extend R Π , for each non-atomic procedure
Π, as follows: [Π1 ⊗i · · · ⊗i Πn ] A ≡ (Contraction)
– wR Π1 ;Π2 v iff there exists a world z such that wR Π1 z and [Π1 ⊗i · · · ⊗i Πk−1 ⊗i Πk+1 · · · ⊗i Πn ] A
zR Π2 v; where Π j = Πk , j < k
– wR Π1 ∪Π2 v iff wR Π1 v or wR Π2 v;
• R O is countable set of binary relations over W , Lemma 3.8. The axiom (Contraction) is valid in the class of proce-
• ≺= {≺i |i ∈ Ag} is a collection of a strict partial orders over dural frames.
RΠ.
4 AXIOMS AND PRINCIPLES
Definition 3.5 (Procedural model). A procedural model is a struc-
4.1 Consistency
ture M = ⟨F , V ⟩, where
First of all, the fact that the court is the authority and that their
• F is a procedural frame, and resolutions have the force of law can be formalised as follows:
• V is a valuation function, V : Prop → 2W .

Definition 3.6 (Preference path). Let R Π1 , . . . , R Πn ⊆ R Π . We [Π1 ⊗k · · · ⊗k Π j ⊗k · · · ⊗k Πm ⊗k · · · ⊗k Πn ] A →
write R Π1 ≺i · · · ≺i R Πn to express that, for each j where 1 ≤ j < n, (Consistency)
R Π j ≺i R Π j+1 . We call R Π1 ≺i · · · ≺i R Πn a i-preference path from → ¬[Π1 ⊗x · · · ⊗x Πm ⊗x · · · ⊗x Π j ⊗x · · · ⊗x Πn ] A
R Π1 to R Πn of length n.
with x ∈ Ag
3 “O” is minimal non-normal deontic operator for representing obligations [2]. In the simplest case,
4 We can informally refer to procedures either by mentioning or not the state-of-affairs
they lead us to.
[Π 1 ⊗k Π2 ]A → ¬([Π2 ⊗x Π 1 ]A).
222
This axiom settles procedural consistency with respect to prefer- seeking or willing the same state-of-affairs (and e.g. the situation
ences, i.e., that a resolution always prevails over any preferences may be used by the court as a basis to analyse a call for agreement.)
and cannot imply a preference on the contrary by any other agent in Suppose now that we have the following model M2 :
the process, not even by means of a court’s contradictory resolution.
For example, if the court resolves action α i is to be done (and, w 1A,C
?
subsidiary, βk ) then no other agent should prefer them in a converse αp
order, not even the court itself.
Although this axiom stands for a consistency axiom, its structure / w A,D
cannot be imposed between plaintiff’s and defendant’s respective v 2
βd
strategies, as we will see next.
M1 and M2 are indeed different models (they are indeed different
4.2 Procedural Bilaterality graphs). Even both models are different, if α p ≺k β d we have
Generally speaking, bilaterality in court procedures implies the M2, v |= [α p ⊗k β d ]A

necessary existence of two confronted parties in a procedure.Two as it holds for M1 . But we have that
confronting parties is the essence of a tort.
A lawful procedure implies a duality of arguing agents and there- M2, v |= [α p ]C M2, v |= [β d ]D
fore a corresponding duality of confronting goals and confronting M2, v |= [α p ]A M2, v |= [β d ]A.
strategies. All these hold, at least in the very beginning, when the
Now assume that the court is interested in that plaintiff and
procedure starts (it should be clear that, during the course of the
defendant reach an agreement based on the fact that A holds. If
process, parties can engage in an agreement). Such a duality does
our model is M1 , the possibility of reaching an agreement while
not mean, in a legal sense, irreconcilable positions.
in v seems quite straightforward; it is easy to automatically detect
That is why formulas such as the following can be true:
that in a graph like M1 both parties propose actions that lead to
[Π1 ⊗p Π 2 ]A ∧ ([Π 2 ⊗d Π1 ]A) (1) the same world where the same state-of-affairs hold. But if our

model is M2 reaching an agreement from v is unclear; this because
Indeed, it is legally admissible that both α ⊗p β A and ⟨β ⊗d α⟩ A when trying to reach A we may also be conducted to other (possibly
hold at a given time during the process, even when both formulas unwanted) states-of-affairs, e.g. C or D.
reflect opposite preferences. Parties are needed as opponents during However, if the court considers C and D as irrelevant, we can
the process, and the final word would be the court’s. argue that both models M1 and M2 are equivalent.
A different discussion raises when we focus on choosing be-
4.3 Discussion tween α and β. Such a choice may be either important or useless,
A court’s analysis of particular “instantiations” of the contradiction according to what actions α and β actually are. There is quite an
principle during a process may lead the court members to the extra amount of procedural reasoning to be performed when mak-
detection of possible points of agreement between parties. ing the decision on whether to choose α or β. One well known
For the sake of simplicity, assume to work with atomic proce- definition of equivalence between programs in PDL says that two
dures and suppose that we have the following model M1 where, at programs are equivalent if, provided the same input, they always
world v, the proposals α and β (which are respectively plaintiff’s produce the same output. This definition helps to solve the question
and defendant’s) lead us both to world w, where A holds on the procedural equivalence of the two graphs above, they are
equivalent because they e.g. lead us to A. But if we focus on the
M1 : αp
2 wM A second model, suppose that d wants to solve the tort by paying
with one cow (β), and p wants cash (α). Even when both actions
help to bring about the intended outcome (A), they are not the same
v βd actions (for example, there may be a later need of a collateral or
incidental action to the process, for example, to sell out the cow.)
If α p ≺p β d , α p ≺k β d , and β d ≺d α p , it is easy to see that
M1, v |= [α p ⊗p β d ]A M1, v |= [β d ⊗d α p ]A 4.4 Principle of Procedural Economy
We mentioned that procedures are finite chains of actions. The
M1, v |= [α p ⊗k β d ]A
principle of economy says that if procedures are shorter, it is better
We may state that, semantically speaking, the three expressions for all the agents involved. This because there is, e.g., a less amount
are somehow equivalent because they all lead us to the same state of resources used, especially, and possibly most important, less
and give us the same result in M1 . Nonetheless, the three expres- procedural steps involved and therefore less time (and possibly less
sions are indeed different and highly expressive: they allow us to effort) spent.
write exactly what we want to express in three different expres- A form of procedural thinking involving the principle of econ-
sions: plaintiff’s and defendant’s strategies (as distinct or opposite omy can be carried out somehow easily through techniques
one from the other), and the court’s resolution. for reducing the complexity of graphs and comparing them us-
Moreover, note that M1 depicts a peaceful transition within the ing

bi-simulation [3, 5].

For example,
suppose that we have
process. This fact may help the court to see that both parties are (α p ; β p ) ⊗p γ p A and θ d ⊗d γ p A with α p , β p , γ p and θ d single
223
Modelling Legal Procedures ICAIL’21, June 21–25, 2021, São Paulo, Brazil

actions. The court should decide θ d ⊗k dp A as it leads to the the preferred option of every resolution in the column where the
expected state-of-affairs A, but with less court activity. same option appears in the conductor resolution. Following, all
A variant of the Principle of Procedural Economy is the Concen- the given resolutions are rejoined in one resulting main resolution,
tration Principle, which means that agents ought to present together column by column. In this course, component options in a column
all what it can be done in one step. The presentation of parties’ are gathered together with the ∪ (choice) PDL operator.
proposals as goal preferences, i.e. as a main goal and subsidiary A preference alignment algorithm is as follows:
goals (as in [α n ⊗p β p ⊗p ⊗γ p ]A), is an example of the application procedure preference_alignment
of this principle. (list_of_resolutions):result
begin
5 SPECIFIC RULES align resol_1 as the conductor resolution
for i:=2 to n do
5.1 Detachment of Obligations if (preferred action in resol_i is in resol_1) then
From a resolution we can derive an obligation. Such an inference position:= position in resol_1 of that preferred
rule reflects the lawful reading of expressions such as [Πi ⊗k Π 2 ]A. action
If we have α ⊗p β, then α is obligatory in achieving A and β is to align resol_i below resol_1 starting from position
be done in case α is not possible, because the authority imposes so. end;
Let us use O as a non-normal minimal modal deontic operator. i := 1
Technically then, from a ⊗k -expression we can derive a formula result := false
which is in the scope of O. The following general form for this while (there are actions in column i) do
high-level reasoning rule is: begin
! ! !! union := connect with ∪ all actions from resol_1 to
Ì
n Ì
m Û
n
′ ′′
A → O[Π′ ]A resol_n in column i
k Π i ⊗k Π ⊗k k Πi A ∧ ( ⟨Πi ⟩ ¬A)
i =1 i =1 i =1 result := result ⊗k union
(O-detachment) i := i+1
end;
which should be intuitively understood as: “The court’s preference
end;
that holds is obligatory”, reflecting the intuitive reading of the ⊗k .
6 SUMMARY
5.2 Consistency Check w.r.t. Court Resolutions
A legal procedure in court proceedings is the formal way in which
Suppose that we have the following court resolution:
civil proceedings are conducted. A legal procedure was defined
[α ⊗k β ⊗k γ ]A. Assume that, later, the court also resolves
as a chain of consecutive actions which has as (final) goal the
that [β ⊗k ϕ]A.
decision/solution of a conflict, i.e., as a finite sequence of actions
By application of the O-detachment rule we get O[α] A from the
in which the last action is (the creation of) a(n individual) norm,
first resolution, and also we get O[β] A from the second resolution.
usually, an obligation.
Both detachments give us O[α] A ∧ O[β] A. From the procedural
We argued that one peculiar aspect of proceedings is that some
point of view this is not what it is expected because, according
types of procedure in the process are prioritised. Priorities indeed
to the first resolution given, we should get O[α] A, but not O[β] A
derive from individual preferences of the parties in the process, or
(unless β is to be done because α cannot). This consistency conflict
they can also follow from objective ordering requirements from
raises when we analyse the resolutions in the framework of a lawful
procedures. In order to model legal procedures, in this paper we
procedure. The intuition behind the solution to this (lawful) incon-
technically added obligations and a preference operator for proce-
sistency is that there is form of procedural reasoning that consists
dural actions to a multi-agent version of PDL.
on the temporal ordering of court resolutions and, following, the
This paper is a preliminary research. For example, complexity
alignment of the forthcoming resolutions with respect to the first
features and a full investigation on the effective application of the
resolution given. Suppose the given resolutions are:
proposed machinery are a matter of future work.
[α ⊗k β ⊗k γ ]A [β ⊗k ϕ]A [ϕ ⊗k ψ ]A.
Note that all three lead to the same state of affairs. We set the
REFERENCES
[1] Erica Calardo, Guido Governatori, and Antonino Rotolo. Sequence semantics for
first one as the conductor resolution, then align the rest of them modelling reason-based preferences. Fundam. Inform., 158(1-3):217–238, 2018.
according to the preferred option in each resolution, as follows: [2] Erica Calardo and Antonino Rotolo. Variants of multi-relational semantics for
propositional non-normal modal logics. J. Appl. Non Class. Logics, 24(4):293–320,
(Conductor) Resolution 1: [α ⊗k β ⊗k γ ]A 2014.
Resolution 2: [β ⊗k ϕ]A [3] David Harel, Jerzy Tiuryn, and Dexter Kozen. Dynamic Logic. MIT Press, Cam-
bridge, MA, USA, 2000.
Resolution 3: [ϕ ⊗k ψ ]A [4] P. Luiso. Diritto processuale civile. Giuffre, 2019.
Resultant Resolution 4: [α ⊗k β ⊗k (γ ∪ ϕ) ⊗k ψ ]A [5] Johan van Benthem. Program constructions that are safe for bisimulation. Stud
Logica, 60(2):311–330, 1998.
We call this rule preference alignment. The intuition behind is the [6] Joachim Zekoll. Comparative civil procedure. In Mathias Reimann and Rein-
following: indeed, court resolutions have a temporal ordering. Ac- hard Zimmermann, editors, The Oxford Handbook of Comparative Law. Oxford
University Press, 2012.
cording to this ordering, and starting with the first given resolution,
the remaining resolutions are aligned to the conductor resolution
with respect to their respective preferred option. That is, we put
224
Automatic Extraction of Amendments from Polish Statutory Law
Aleksander Smywiński-Pohl Krzysztof Wróbel
Mateusz Piech krzysztof@wrobel.pro
Zbigniew Kaleta Jagiellonian University
{apohllo,mpiech,zkaleta}@agh.edu.pl Kraków, Poland
AGH University of Science and Technology
Kraków, Poland
ABSTRACT Table 1: The types of entities annotated in the amendments.
The article discusses the problem of automatic detection of amend-
ments found in the Polish statutory law. We treat the problem as a Amendment type
token-classification task and we introduce a scheme constructed add_content remove_content change_content
by analysis of more than 200 amending bills. We apply recent neu- add_unit remove_unit change_unit change_id
ral architectures such as BERT and BiRNN to the task of token Identifier
classification. The achieved results of all models are very high as new_id amended_id preceding_id
micro average F1 score ranges from 96.3% to 98.2% for BiRNN. The Content
presented solution is a first step towards fully automatic structuring new_content old_content preceding_content
and application of amendments in the Polish statutory law.
CCS CONCEPTS the user tries to use that website to track the changes of any law,
• Computing methodologies → Information extraction; • Ap- which underwent a number of amendments, the website is not very
plied computing → Law. useful.
Taking into account the fact that the source texts of the amend-
KEYWORDS ments are weakly structured (i.e. PDF files) and the fact that the
amendment extraction, information extraction, named entity recog- number of amendments is tremendous, we investigate the possi-
nition, legal information system, Polish statutory law bility to apply machine learning approach to the problem of the
ACM Reference Format: automatic structuring of the text that would allow for converting
Aleksander Smywiński-Pohl, Mateusz Piech, Zbigniew Kaleta, and Krzysztof amending bills into structured data.
Wróbel. 2021. Automatic Extraction of Amendments from Polish Statutory The contribution of the article is as follows. We start by casting
Law. In Eighteenth International Conference for Artificial Intelligence and the problem of amendment extraction as token classification. We
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, introduce a scheme devised for detecting the amendments based
USA, 5 pages. https://doi.org/10.1145/3462757.3466141 on an analysis of a large number of Polish amending laws. Then we
present three approaches to the problem of amendment extraction:
1 INTRODUCTION one based on rules and two others based on machine learning. We
The legal system in Poland is based on statutory law passed by pay special attention to the pre-processing of the data since it has
the Polish parliament. The laws are published in the Journal of a huge impact on the obtained results. We discuss the related work
Laws of the Republic of Poland and are available via ISAP 1 website in Section 5. We conclude the article with prospects for the future
which distributes the laws as PDF files with metadata available research.
as HTML pages. Although the primary law is linked with all its
amendments, the user can only list all laws (but not amendments 2 ANNOTATION CORPUS
specific to that law) that modified the given document. In the case For this research we have created a corpus of 242 bills of Polish
of the most important laws such as civil and criminal codes, the statutory law from the years 1993-2018. Table 1 shows the types
consolidated text is also published yearly in the form of PDF files, of the textual units identified by analyzing the sample of bills. The
with specific regulations marked as recently amended. All in all, if primary element used to define the amendment is its type. The set
1 http://isap.sejm.gov.pl of required elements and in some cases their exact meaning are
dependant on the type of amendment.
classroom use is granted without fee provided that copies are not made or distributed The amendments related to units (with unit suffix) are the basic
for profit or commercial advantage and that copies bear this notice and the full citation ones. Thus they require the id of the amended unit, the new content
on the first page. Copyrights for components of this work owned by others than the (in the case of addition and change), and the id of the preceding
republish, to post on servers or to redistribute to lists, requires prior specific permission unit (in the case of addition).
and/or a fee. Request permissions from permissions@acm.org. The amendments related to the change in content (with content
ICAIL’21, June 21–25, 2021, São Paulo, Brazil suffix) are much more sophisticated, since they change the content
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 of the unit only partly. The most common case is addition, sub-
https://doi.org/10.1145/3462757.3466141 stitution or removal of a short phrase or a punctuation mark. But
225
ICAIL’21, June 21–25, 2021, São Paulo, Brazil A. Smywiński-Pohl, M. Piech, Z. Kaleta, K. Wróbel
Table 2: The counts of tokens for different types of annota- a rule-based approach employing regular expressions (a baseline,
tions. henceforth called rules), the second one was an approach based on
the transformer architecture [19], with bi-directional encoding [5]
Annotation type Count (henceforth called BERT ) and the third one was an approach based
Amendment type 5 377 on a bidirectional recurrent neural network (BiRNN) [16] with Long
Identifier 9 747 short-term memory cells [8] (henceforth called BiRNN ). We have
Content 171 937 tested three transformer models pre-trained on different corpora:
No annotation 230 300 HerBERT on a large Polish corpus[15], RoBERTa on an English
Total 417 361 corpus and XLMR on multilingual corpus [4]. The HuggingFace’s
Transformers library [21] was used in that setting. The bidirectional
RNN employed character-level language model (cLM) an was pre-
trained on Polish texts from the law domain. We have chosen the
this action might be much more sophisticated, such as substitution FLAIR framework [2], for the pre-training and fine-tuning of that
of certain phrase en masse, e.g. replacing „minister of sport” with model.
„minister competent for sport”.
The last type of amendment – change_id – is related to the 3.1 Rules
specific case when a part of text receives a new identifier: mainly
In order to fairly compare the system with the other approaches, we
when an article containing one, unnumbered paragraph is extended
have created the rules looking only at the training and development
with a new paragraph, thus the existing paragraph receives an id.
datasets. The size of the “model” compared to the size of both RNN
Regarding the identifiers, we distinguish between the id of the
and HerBERT is very favorable for that approach. We were unable
element when it is changed or removed (amended_id) and when
to define the regular expression for the preceding_content tag
it is added, since in the second case the id of the preceding unit
(lack of common phrases in the examples), thus it is not detected
has to be indicated and we have to distinguish between them. It
by this approach. A tag associated with the regex is assigned to
was assumed that only the most specific part of the identifier is
text span matched with that regex and the O tag is assigned to the
annotated.
rest of text. In the case of multiple regular expressions matching
The annotation was performed by two independent annotators
overlapping text spans, the more popular class, according to training
and then reviewed by a super-annotator with the help of Inforex
and development sets, is assigned. The matching was run from left
system [11]. The corpus contains roughly 4 hundred thousand
to right, thus it is continued right after the phrase matching the
tokens and its size is comparable with the manually annotated
assigned tag.
part of the National Corpus of Polish (1.2 million) [14]. The Table 2
summarizes the counts for different types of tokens. The dominating
3.2 BERT
class are the tokens without annotation (230 thousand tokens).
But still almost half of the corpus contains amendment-related The second approach uses a transformer model for Polish called
annotations. This is not a surprise, since the majority of the enacted HerBERT [15]. The model was trained on a large number of Polish
bills are amendments to the existing laws. The second most popular texts and it uses the RoBERTa pre-training optimizations [10]. The
annotation types are related to content (more than 170 thousand model achieves SOTA results on KLEJ benchmark [15]2 , a collection
tokens). This is expected, since these tokens constitute the actual of Polish Natural Language Processing (NLP) tasks resembling the
contents of the new and changed regulations. The total number of GLUE benchmark [20]. The results on that benchmark are very
individual amendments – partly reflected by the first group, since similar to those of the XLMR model [4], yet HerBERT is better
amendment type is usually indicated by two tokens – is roughly suited for Polish since the previous model was trained on a very
2.5 thousand (this is an estimation, since there are cases when one large number of languages (100), while HerBERT was trained only
amendment type relates to changes of more than one element). on one language. As a result, HerBERT is much smaller and the
We have divided the annotated documents into three subsets: fine-tuning is faster compared to XLMR. Comparing to the previous
the training (~80%), the development (~10%) and the test (~10%) set. approach, we have not used any domain-specific texts to reduce
We imposed two requirements when splitting the documents. First the perplexity of the model on texts coming from the law domain.
of all, since the structure of the bill is crucial for the extraction of In the reported experiments we have used the large variant of the
amendments, we have split the whole documents rather than their model.
fragments, such as individual provisions. As a second requirement, To measure the impact of language-specific pre-training, we
we have paid special attention to keep similar distribution of tags have included two other models: Roberta (large) [10] and the largest
in each of the subsets. It wasn’t straightforward, thanks to the first multilingual model, i.e. XLMR [4]. Both of these models were trained
requirement, but we managed to keep each type of tag in 62%–85%, using the same approach, but they were trained for English and 100
8%–23% and 6%–23% range for the training, the development and languages respectively.
the test subsets respectively.
3.3 BiRNN
3 AMENDMENT DETECTION The last approach reflects a pretty recent SOTA model for NER.
Bidirectional RNNs with LSTM cells were very popular until the
The detection of the amendments – treated as a NER-like problem
– was tested using the following approaches. The first one was 2 Klej in Polish means glue in English.
226
Automatic Extraction of Amendments from Polish Statutory Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
transformer model dominated the NLP landscape. Also, the RNNs Table 3: The results of the detection of amendments.
have one important advantage compared to BERT – they accept
inputs of arbitrary length. BERT architectures are limited to 512 Model Micro F1 Macro F1 Weighted F1 Support
subtokens due to memory and time complexity which is quadratic Rules 69.58 82.03 72.07 1335
to the length of the input. This is important in the context of amend- HerBERT 97.96 90.81 97.92 1174
ment detection since we would prefer a model not requiring a so- RoBERTa 97.69 97.68 97.70 1148
phisticated text segmentation (at least in the stage of amendment XLMR 96.72 84.62 96.73 1174
detection), but rather accepting a complete text of the bill. The Flair BiRNN 1 90.81 80.72 90.60 1773
architecture assumes usage of a character-level language model pre- BiRNN 2 98.20 98.90 98.19 1174
trained on texts from a given domain. The model was pre-trained
on two corpora – the full National Corpus of Polish [13, 14] (approx.
2 billion tokens) and corpus containing the Polish statutory law training set). Thanks to that it gave a much better macro-average
and the judgments of Polish courts (approx. 4 billion tokens). F1 score (94%), but the micro-average F1 score of 84.36% was far
below expectation. That was primarily due to low outcome for the
4 EXPERIMENTS new_content class: 74.84% F1 score. This result is easy to explain,
We start the discussion of the experiments by presenting the re- since the new content usually contains a number of sentences which,
sults for the Rules model. For most of the types of elements the classified individually, lack the context required to determine if
rules work perfectly, yet there is a group of types (4 – new_id, 5 – they are part of the amending or the amended bill (they are quoted
amended_id, 7 – preceding_id) that are confused for each other. in the second case, but the quotation spans a large number of
This is the reason why the results for the rules approach (cf. Table sentences). In fact, the 75% score seems to be very high if we take
3) gives low scores both in terms of micro, macro and weighted that phenomenon into account.
average F1 measure. This shows that the identification of the ids is As a result, we have decided to perform a more elaborate pre-
problematic for this approach. We have to stress, that we have ex- processing. The algorithm iterated through the lines in the CONLL
cluded from the results the false positives related to the O tag, since document. If a new sentence (i.e. an empty line) was detected, it
that would yield a very large value of incorrectly determined identi- broke the sentence only if the preceding tag was O (i.e. the sentence
fiers since they appear in a large number of contexts. Moreover, we break was not inside an annotated span). In the other case, it broke
were unable to come up with a rule for detecting the 6th category the sentence when the first O tag was detected. Such approach does
(i.e. preceding_content). We argue that the approach based on not ensure that the context is always preserved, but it plays well
rules is not universal enough to easily differentiate between the with the new_content type, since these tokens always form long
various types of elements appearing in the amendments. spans of consecutive tokens.
As a second approach, we tested the family of BERT models. Yet, during inference such a procedure is impossible to apply be-
These models have a hard limit3 on the length of the input, which cause we do not have the tags, we are going to predict. To overcome
amounts to 512 subtokens. It is obvious that the length of bills is that problem we implemented a procedure based on the structure
longer than that limit, but it might be the case that during training of the bill. We treated textual content of provisions at any level as
it is not necessary to perform any pre-processing since all phenom- independent inputs for the algorithm. In the case the provision of
ena are present at the beginning of the document4 . We have run the lowest level was longer than 512 subtokens, it was truncated
two preliminary experiments using the HerBERT model, in order and the tag assigned to the last token was assigned to the remaining
to determine if the pre-processing is needed. In the first, we have tokens besides the last one (which in most cases is a quotation).
submitted the full document as the input. This approach results in Table 3 summarizes the results of an experiment conducted using
a large number of truncated text (more than 60% in the training the models belonging to the BERT family using the more sophisti-
set). In the second approach as input we took individual sentences cated pre-processing. All of them were trained with the same set
(obtained from the Inforex system that uses MorphoDiTa library of hyper-parameters (batch size: 8, epochs: 10, learning rate: 5e-6,
[18] for sentence boundary detection). This approach yielded com- eps: 1e-8, maximum gradient: 1.0, weight decay: 0.0, max sequence
plete coverage of the annotated tokens since the detected sentences length: 512, seed: 0 and F1 score on the development set used for
were always shorter than the 512 limit. early stopping). The results are reported for the test set.
The first approach resulted in 91.58% micro-average F1 score and The first observation relates to the number of testing examples
76.03% macro-average F1 score. The low macro-average score is a re- (support) available for each approach. As it was explained, there
sult of complete ignorance of the types of tags preceding_content were cases when the input had to be truncated. The different number
and remove_content, which were very rare both in the training of examples stems from the fact that the models yield a different
and the testing corpora. This outcome – as expected – falsifies the number of subtokens5 , but the differences are small.
assumption that the pre-processing is not needed. The second observation is that the performance of the models
Regarding the second approach, we observed that it resulted is very good, especially if we look at the micro average scores.
in a larger number of examples (3-fold increase in the case of the XLMR – which is the worst according to that metric – is still almost
3 The limit is hard in the sense that the quadratic memory complexity makes longer 5 HerBERT uses a tokenizer trained on Polish texts, RoBERTa – on English texts and
inputs prohibitively slow to process with the current available hardware. XLMR on texts from 100 languages. Since the vocabulary size is limited, the Polish
4 That would also be a waste of a large number of annotations, yet we believe that it is tokenizer model may yield the lowest number of tokens, since HerBERT and RoBERTa
good to test even the simplest approach. use the same vocabulary size. XLMR uses a dictionary which is 4 times larger.
227
ICAIL’21, June 21–25, 2021, São Paulo, Brazil A. Smywiński-Pohl, M. Piech, Z. Kaleta, K. Wróbel
perfect, yielding 96.72% micro average F1 score. According to that and alike models. For example, the authors of [2] propose the use
metric, the best results are obtained by HerBERT which yields 1.2 of a character level language model to generate contextual string
percentage point better result. Yet its macro average result (90.81% embeddings. A bidirectional neural model is trained for the task of
F1 score) is pretty low, compared to the result given by RoBERTa predicting the next character in the sequence. The hidden states of
(97.78% F1 score). That result is a bit surprising, especially if we the forward and backward parts of the model, respectively after and
recall that this model is pre-trained for English, while we analyze at the beginning of the analyzed word are concatenated to create
texts in Polish. A closer inspection of the results shows that the context-aware string embedding. In later work – [1] – they extend
low outcome of HerBERT was due to lower scores for the detec- this vector with second part, which is a function (e.g. element-wise
tion of the content-related tags ({add,remove,change}_content). minimum, maximum, or mean) of all embeddings for the same
If we take into account the fact that RoBERTa gives results only string including the newest one. One of our solutions (BiRNN) is
0.27 pp. lower than HerBERT in terms of micro average F1 score, directly based on that work.
we may conclude that for that problem it seems to be the best The authors of [22] have created LUKE – a language model
option among the family of BERT models, even though it is not based on RoBERTa, where they train the embeddings for entities
pre-trained on the Polish texts. This result is particularly inter- alongside the embeddings for words. The input for training LUKE
esting since it shows that we could leverage a more recent is the concatenation of tokenized sentence and the list of all entities
model pre-trained only for English, especially since the pre- present in this sentence. The training process is similar to BERT
training is a very costly procedure. and other masked language models – randomly selected parts of
As the last type of approach, we present the results for BiRNN the input (in this case words and entities alike) are masked and the
model. Although RNN do not have a hard limit on the length, to model is trained to predict those masked fragments. BERT training
fairly compare their performance with the previous approach, we is fully unsupervised, so it requires just a plain, unannotated corpus,
have used the same pre-processing strategy. We have tested the but LUKE needs an entity-annotated corpus. The authors used the
approach when the sentences are provided by MorphoDiTa (version corpus from Wikipedia with good results. LUKE also extends the
1) and when they are provided by the optimized version of the input self-attention mechanism normally used in transformers so that it
pre-processing (version 2). We have trained the model with the is entity-aware. Although that approach yields SOTA results for the
following hyper-parameters: hidden layer size: 256, max epochs: NER task, it does not apply to our problem, since the entities in our
150, learning rate: 0.1, mini-batch size: 32, word embeddings: pl- approach are much different from typical entities in NER: they are
wiki-fasttext-300d-1M [12]. The results of the experiments are given either short, constant phrases (e.g. is removed, is added), identifiers
in Table 3. The comparison of the results produced by HerBERT (e.g. art. 5a, letter b) and long spans of text (the new content of the
and BiRNN on the input provided by MorphoDiTa shows that – as amended provisions). Besides the first type, which is very easy to
expected – the number of testing (and training) examples is much detect, they belong to an open set of text phrases and cannot have
higher for BiRNN, since the input is not truncated. The recurrent their own – learned – representations.
architecture receives almost 3-times more examples. Interestingly The work on automatic processing of legal amendments dates as
the model achieves a low value (80.72%) for the macro average F1 far as the work of Timothy Arnold-Moore in 1995[3]. There are two
score (much lower than HerBERT). Inspection of the individual main approaches to the problem of amendment processing. One
classes showed that this is caused by 0.0% scores for new_id and is to take two versions of the same legal act and compare them in
change_id tags, which are rare in the training set. The results a diff-like manner. The second approach, presented herein, is to
for new_content and old_content were also lowering the result, generate an amended version of the legal act by using its previous
since both of them were below 90%. Yet the results are surprising text and the text of the amending act.
compared to the BERT model since that model had a much larger The most common approach is to use syntactic and semantic
number of training instances. parsing, using a rule-based system. E.g. authors of [17] use a shal-
The most interesting result is for the second setting, where the low syntactic parsing (chunking) using a battery of finite state
customized pre-processing was applied. Since the input was trun- automata and a semantic analysis using a compiler based on spe-
cated, the number of training and testing examples was lowered by cialized grammar. The system also contains an automatic classifier
approx. 33%. Yet this approach gave the best results overall – com- that recognizes three kinds of amending provisions and discards
paring both the simplified pre-processing and to the family of BERT not amending provisions from further processing. In our work, we
models. Both micro (98.2%) and macro (98.90%) average F1 scores expand the set of provisions by providing more fine-grained dis-
were almost perfect, meaning that the model was able to learn all tinctions and introducing the new type related to the introduction
classes. That result shows that even though the approach based of an identifier.
on BERT has dominated the landscape of NLP problems, for the The authors of [9] treat the task as a slots filling problem, where
amendment extraction problem – at least for the Polish language – the correct frame is chosen based on the verb and its dependents
RNNs still might be a good alternative. using IF-THEN rules. In the case of multiple solutions, a heuristic is
used to pick the best one. They also address the problem of idioms:
some complex phrases common in legal documents are rewritten
5 RELATED WORK using hand-crafted rules into a form that is easier to process by
The approach used in this article – BIO tagging – is often used for further stages of the system.
named entity recognition. One of the most commonly used methods In [7] the authors extend the number of types of modificatory pro-
for NER is the bidirectional LSTM often used with word embeddings visions and the article focuses on temporal modifications – changes
228
Automatic Extraction of Amendments from Polish Statutory Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
to either force or efficacy time. Unlike previous works in this field REFERENCES
they do not process all sentences of the amending act, but instead [1] Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled Contextualized
filter them using regular expressions to increase both accuracy and Embeddings for Named Entity Recognition. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics:
performance. They also introduce a sliding window to better handle Human Language Technologies, Volume 1 (Long and Short Papers). Association for
long and complex sentences Computational Linguistics, Minneapolis, Minnesota, 724–728. https://doi.org/10.
18653/v1/N19-1078
All of the aforementioned systems were devised for Italian legal [2] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embed-
acts (and in result Italian language) and use NIR (Norme in Rete) dings for sequence labeling. In Proceedings of the 27th International Conference
on Computational Linguistics. 1638–1649.
XML representation of the data. [3] Timothy Arnold-Moore. 1995. Automatically processing amendments to legis-
The authors of [6] describe an attempt to make an automated lation. In Proceedings of the 5th international conference on Artificial intelligence
system for the consolidation of Greek legal acts, based on officially and law. 297–306.
[4] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-
published PDFs. This system uses regular expressions on multiple laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
stages of the process, including recognizing the amendment type and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning
(addition, substitution, or deletion of a text portion) and extracting at Scale. arXiv:cs.CL/1911.02116
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
the required data (filling the slots). If the amendment concerns Pre-training of deep bidirectional transformers for language understanding. arXiv
whole structural units, such as paragraphs, the change is applied to preprint arXiv:1810.04805 (2018).
[6] John Garofalakis, Konstantinos Plessas, and Athanasios Plessas. 2016. A Semi-
the structure of the XML file containing the act. On the subpara- Automatic System for the Consolidation of Greek Legislative Texts. In Proceedings
graph level, the Python NLTK model is used to break down the of the 20th Pan-Hellenic Conference on Informatics (Patras, Greece) (PCI ’16).
paragraph into units of an appropriate level. Association for Computing Machinery, New York, NY, USA, Article 1, 6 pages.
https://doi.org/10.1145/3003733.3003735
[7] Davide Gianfelice, Leonardo Lesmo, Monica Palmirani, Daniele Perlo, and
6 CONCLUSIONS Daniele P Radicioni. 2013. Modificatory provisions detection: a hybrid NLP
approach. In Proceedings of the Fourteenth International Conference on Artificial
We have presented a novel algorithm for detecting amendments Intelligence and Law. 43–52.
found in Polish statutory law. The primary difference presented [8] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
in this article and the previous work is the application of neural computation 9, 8 (1997), 1735–1780.
[9] Leonardo Lesmo, Alessandro Mazzei, Monica Palmirani, and Daniele Radicioni.
models to the detection of amendment constituents. By treating 2013. TULSI: an NLP system for extracting legal modificatory provisions. Artificial
it as a token classification problem it is possible to use the most Intelligence and Law 21 (05 2013). https://doi.org/10.1007/s10506-012-9127-6
recent SOTA models from the BERT family and a bit older BiRNNs. Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
It turns out that using such models provides very accurate results. Robustly Optimized BERT Pretraining Approach. arXiv:cs.CL/1907.11692
This observation supported by experiments conducted with the [11] Michał Marcińczuk, Marcin Oleksy, and Jan Kocoń. 2017. Inforex—A collaborative
system for text corpora annotation and analysis. In Proceedings of the international
most recent neural models: HerBERT, RoBERTa, XLMR, and BiRNN. conference recent advances in natural language processing, RANLP. INCOMA
All of them yielded at least 96% for the micro average measure. Shoumen, 473–482.
Yet, contrary to our expectations the best performance among the [12] Agnieszka Mykowiecka, Małgorzata Marciniak, and Piotr Rychlik. 2017. Testing
word embeddings for Polish. Cognitive Studies 17 (2017).
family of BERT models was thanks to RoBERTa (a model trained [13] Piotr Pęzik. 2012. Wyszukiwarka PELCRA dla danych NKJP. In Narodowy
for English), while the best results were achieved by BiRNN whose Korpus Języka Polskiego, Adam Przepiórkowski, Mirosław Bańko, Rafał Górski,
and Barbara Lewandowska-Tomaszczyk (Eds.). Wydawnictwo Naukowe PWN,
score was above 98% for each weighting scheme. This opens the 253–279.
possibility for the automation of the amendment extraction in Pol- [14] Adam Przepiórkowski, Mirosław Bańko, Rafał Górski, and Barbara Lewandowska-
ish. Tomaszczyk. 2012. Narodowy Korpus J ezyka Polskiego. Wydawnictwo Naukowe
PWN.
Still, there are problems that have to be addressed to complete [15] Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and Ireneusz Gawlik.
that goal. First of all our solution is the first step in the automa- 2020. KLEJ: Comprehensive Benchmark for Polish Language Understanding.
tion pipeline, since the detected tokens has to be converted into arXiv:cs.CL/2005.00630
[16] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-
meaningful amendment representations. The solution thus requires works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
a proper structuring of the bill as well as automatic detection of [17] PierLuigi Spinosa, Gerardo Giardiello, Manola Cherubini, Simone Marchi, Giulia
Venturi, and Simonetta Montemagni. 2009. NLP-Based Metadata Extraction for
the references to the provisions. Both of these problems may be Legal Text Consolidation. In Proceedings of the 12th International Conference on
resolved following the same approach, but they were out of the Artificial Intelligence and Law (Barcelona, Spain) (ICAIL ’09). Association for
scope of this article. A connected problem is the detection and Computing Machinery, New York, NY, USA, 40–49. https://doi.org/10.1145/
1568234.1568240
processing of temporal expressions that determine the application [18] Milan Straka and Jana Straková. 2014. MorphoDiTa: Morphological Dictionary
date of a specific amendment. We will address these issues in the and Tagger. (2014).
forthcoming research. [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
You Need. arXiv:cs.CL/1706.03762
7 ACKNOWLEDGMENTS [20] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R
Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural
This work was supported by the Polish National Centre for Research language understanding. arXiv preprint arXiv:1804.07461 (2018).
and Development – LIDER Program under Grant LIDER/27/0164/L- [21] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie
8/16/NCBR/2017 titled “Lemkin – intelligent legal information sys- Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
tem” and in part by the PLGrid Infrastructure. Processing. CoRR abs/1910.03771 (2019). arXiv:1910.03771 http://arxiv.org/abs/
1910.03771
[22] Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto.
2020. LUKE: Deep Contextualized Entity Representations with Entity-aware
Self-attention. arXiv:cs.CL/2010.01057
229
A Dataset for Evaluating Legal Question Answering on Private
International Law
Francesco Sovrano Monica Palmirani Biagio Distefano
francesco.sovrano2@unibo.it monica.palmirani@unibo.it biagio.distefano@univie.ac.at
University of Bologna - DISI University of Bologna - CIRSFID Universität Wien
Italy Italy Austria
Salvatore Sapienza Fabio Vitali

salvatore.sapienza@unibo.it fabio.vitali@unibo.it
University of Bologna - CIRSFID University of Bologna - DISI
Italy Italy
ABSTRACT on Private International Law. In Eighteenth International Conference for

International Private Law (PIL) is a complex legal domain that Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil.
presents frequent conflicting norms between the hierarchy of le-
gal sources, legal domains, and the adopted procedures. Scientific
research on PIL reveals the need to create a bridge between Euro- 1 INTRODUCTION AND BACKGROUND
pean and national laws. In this context, legal experts have to access The International Private Law (PIL) is a complex legal domain that
heterogeneous sources, being able to recall all the norms and to presents frequent conflicting norms between the hierarchy of legal
combine them using case-laws and following the principles of inter- sources (e.g., national vs. European level), between legal domains
pretation theory. This clearly poses a daunting challenge to humans, (e.g., consumer law vs. labour law), between the adopted procedures
whenever Regulations change frequently or are big-enough in size. (e.g., alternative dispute resolution vs. litigation). Scientific research
Automated reasoning over legal texts is not a trivial task, because on PIL reveals the need to create a bridge between European and
legal language is very specific and in many ways different from a national laws on this domain by accessing heterogeneous legal
commonly used natural language. When applying state-of-the-art sources. The European project Interlex1 intended to investigate this
language models to legalese understanding, one of the challenges is domain and to use technology to fill the gap between different legal
always to figure how to optimally use the available amount of data. sources. This need to rely on technology is due to the complexity
This makes hard to apply state-of-the-art sub-symbolic question of the PIL domain. In fact, in this context, legal experts have to
answering algorithms on legislative texts, especially the PIL ones, access heterogeneous sources, being able to recall all the norms and
because of data scarcity. In this paper we try to expand previous to analyse them using case-law2 and following the principles of
works on legal question answering, publishing a larger and more interpretation theory. This poses a daunting challenge to humans,
curated dataset for the evaluation of automated question answering whenever Regulations change frequently or are big-enough in size.
on PIL. In fact, searching within thousands and thousands of pages of legal
documents from different sources and jurisdictions is undoubtedly
CCS CONCEPTS a task requiring human effort and specialised expertise. This is prob-
• Applied computing → Law; • Computing methodologies → ably one of the reasons why researchers, governments and industry
Reasoning about belief and knowledge; Information extrac- have for long looked for a way to build “intelligent” machines ca-
tion. pable of helping humans in detecting the relevant legal provisions
over such complex corpora [11]. In literature we may find at least
KEYWORDS two distinct main approaches to reasoning and artificial intelligence.
The first approach is more symbolic and formal, capable to model
Legal Question Answering, Private International Law, Knowledge
legal knowledge into a formal representation. For example, the
Graph Extraction
legal ontology modelling method [4, 8] is a relevant instrument
ACM Reference Format: for defining the legal concepts and relationships included in legal
Francesco Sovrano, Monica Palmirani, Biagio Distefano, Salvatore Sapienza, texts (e.g., hard law, judgement, soft law, etc.) but it is extremely
and Fabio Vitali. 2021. A Dataset for Evaluating Legal Question Answering
expensive, it depends on the hermeneutic approach adopted by
Permission to make digital or hard copies of all or part of this work for personal or each scholar or community (e.g., common law vs. civil law), it is
classroom use is granted without fee provided that copies are not made or distributed influenced by a strong localisation due to the local jurisdiction (e.g.,
on the first page. Copyrights for components of this work owned by others than ACM domestic regulation and local court action), by the cultural and so-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, cial norms (e.g., concept of gender) and, furthermore, modifications
fee. Request permissions from permissions@acm.org.
in the legal framework (e.g., new legislation) require a refinement
ICAIL’21, June 21–25, 2021, São Paulo, Brazil or (even worse) a whole extension of the ontology is required. The
1 http://www.interlexproject.eu/index.html
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466094 2 http://www.interlexproject.eu/del/Deliverable2dot3.pdf
230
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Sovrano, et al.
second approach is the most recent and in many ways the most These regulations are, respectively, on the law applicable to con-
versatile, but sub-symbolic and opaque. A sub-symbolic approach tractual obligations; on the law applicable to non-contractual obli-
is said to be more data-oriented and it follows the recent success gations; on jurisdiction and the recognition and enforcement of
of Deep Neural Networks (DNNs) on natural language processing judgements in civil and commercial matters. These Regulations aim
and understanding. Current state-of-the-art on natural language to provide a tool for identifying the applicable law and the jurisdic-
understanding is heavily based on this data-centred approach, and tion in cases when two or more legal systems connect and generate
many models specifically applied to legalese have already been complex relationships (e.g. a sale of goods contract between an
published. For example in 2018 [3] published a framework for natu- Italian and a German citizen regarding commodities situated in
ral language processing and information extraction for legal and Spain).
regulatory texts. In 2019 [5] proposed one of the first models for It is important to highlight the fact that, for the construction of
legal word embeddings. While, in 2015 Kim et al. [10] presented the new dataset, we decided to inherit some methodological choices
one of the very first algorithms based on DNNs for Legal Question from [13], considering PIL as a subject simply from the point of view
Answering (reasoning) applied to a dataset of Boolean questions of these three EU Regulations, as a self-contained environment, i.e.,
from Japanese legal bar exams, then followed up by [7] and others excluding references to other international conventions and general
[9, 12]. In 2020, Sovrano et al. [13] proposed a novel and hybrid ap- principles. So that it is possible to evaluate Q&A techniques with
proach for legal question answering on PIL, using a legal ontology respect to their ability to handle the general principles in the recitals,
based on Ontology Design Patterns (like agent, role, event, tempo- the scope of application in the initial articles, and the specific cases
ral parameter, action) in order to mirror the legal significance of (e.g. exceptions) in the other articles. The methodological choices
the relationships within and among the provisions. More generally, we kept raised some issues with regard to the formulation of the
automated reasoning over legal texts (not just the PIL’s ones) is questions and their relevance. Conceptual questions (e.g. “What is a
not a trivial task, due to the fact that the legal jargon (legalese) is non-contractual obligation?”) cannot be fully answered by relying
less frequent and more ambiguous than commonly-used natural solely on these 3 Regulations, as the goal of this legislation - when
language. This is probably the reason why some works have de- considered atomistically - is limited to discipline conflict of law and
cided to focus on corpora, such as privacy policies [12], with a legal conflict of jurisdiction cases. While the Regulations, as with any
language that would be more similar to its natural counterpart, or to other piece of legislation, rely somewhat on external definitions
focus on more argumentative texts (e.g. sentences, procedural docu- and legal concepts, including those derived from jurisprudence
ments, cross-examinations, parliamentary court reports) instead of and opinions from commentators, they also define intrinsically and
legislative texts or contracts. Anyway, this challenge makes hard to specifically for their own purposes, key concepts (e.g. “judgement”
apply state-of-the-art sub-symbolic question answering algorithms in Art. 2 of Reg. Brussels I-bis). Therefore we decided to exclude
on legislative texts, especially the PIL ones, because of data scarcity any conceptual question but those involving key concepts defined
or novel topics introduced for the first time in the legal system (e.g., within the Regulations.
no historical series). The legal question answering tools we are interested in evaluat-
With this work we are interested into advancing on automated ing are meant to be used by practising lawyers, with reasonable -
answering to questions written in legalese and on PIL legislative yet, not expert - knowledge of PIL to:
texts. Our goal is to be able to properly evaluate canonical question • explore the contents of the Regulations;
answering techniques for PIL. This is why we try to expand the • get support in the reasoning concerning large Regulations.
work presented by Sovrano et al. in [13], publishing a larger and
more curated dataset extracted from Regulations such as: Rome The dataset for evaluating such tools shall comprise a set of ques-
I Regulation EC 593/2008; Rome II Regulation EC 864/2007; and tions for each of which there is also a set of expected answers in
Brussels I bis Regulation EU 1215/2012. the form of Articles, Recitals or Commission Statements3 . Recitals
In Section 2 we describe our dataset, and the methodology we are considered beside Articles because the user persona could be
followed to design it. While in Section 3 we analyse the results interested in prima facie interpretive tools emerging from the text
obtained by re-running the experiment of [13] on the new dataset, itself, let alone the debated bindingness of Recitals. The dataset
pointing to future work in Section 4. published in [13], was designed following a methodology that is
similar to the one we are going to use for this extension. For the
selection of the questions and the identification of the expected
2 A DATASET FOR EVALUATING LEGAL
answers we adapted to our case a specific methodology encoded by
QUESTION ANSWERING ON PIL Ashley and others in their works [1, 2, 6] during the last years. This
In this Section, we explain how we expanded the dataset presented methodology is common to other works in the field and it is meant
in [13], doubling its size. We improved over [13], publishing a larger to validate the experiment also from a legal perspective. In our case,
and more curated dataset for the evaluation of automated question the questions were selected by two legal experts, while other two
answering on PIL. independent legal experts matching our intended user persona were
Both the old and the new dataset were extracted from the fol- responsible for identifying the expected answers by relying solely
lowing Regulations, in English: on the verbatim information that can be found in the Regulations.
• Rome I Regulation EC 593/2008;
• Rome II Regulation EC 864/2007; 3 Rome
II Regulation contains three Commission Statements meant to bind the EU
• and Brussels I bis Regulation EU 1215/2012. Commission to publish studies on selected topics
231
A Dataset for Evaluating Legal Question Answering on Private International Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Therefore, legal experts were instructed to prevent case-law, gen- choose a different applicable law for different parts of the con-
eral principles or scholar opinions from influencing their answers, tract?”); questions whose answer falls in part within the scope of
as well as requested to avoid interaction with each other. As stated the Regulations, but somewhat relies on external concepts were la-
above, the research wants to model only the neutral legislative belled as Normally specific (e.g., “Which parties of a contract should
information from the three Regulations without any interpretation be protected by conflict-of-law rules?”); finally, broad questions
other than the literal one. The inclusion of other knowledge will be whose answer requires the significant use of external legal concepts
left to further research. First, experts read the three Regulations and and resources and whose answer is found through an articulate
answered to the questions without any assistance from auxiliary combinations of articles and recital were labelled as having Low
sources, including tools and previous knowledge. Then, they were specificity (e.g., “How should a contract be interpreted according
allowed to compare their answers with those provided by the tool to this regulation?”).
for legal question answering, selecting tool-assisted correct answers Of the 17 questions that compose the new extended dataset:
and missing replies to be used to calculate performance scores in 29.41% have a Low specificity; 35.29% have a Normal specificity;
the later stages. Despite the efforts to draft interpretation-neutral 35.29% have a High specificity.
questions, each independent expert has a certain margin of appre-
ciation both when providing her/his answers and when assessing 3 DATASET ANALYSIS
the correctness of the tool-provided answers. Therefore, another
In order to understand the behaviour of existing question answer-
intervention was necessary when divergences in their evaluation
ing tools on the new dataset, we repeated on it the experiment
occurred. When identifying the expected answers, the aggrega-
described in [13] ,changing the metrics used for the evaluation.
tion kept into account only theoretical replies that were common
Considering that we are not interested in the order answers are
between the two independent experts. This aggregation was con-
ranked, as metric for estimating the performance of the algorithm
ducted by one legal expert who dispose of a higher level of expertise
we chose: top5-recall, top5-precision and top5-F1, defined as follows.
in comparison to the independent evaluators, yet relying on the
Let 𝑚 be the number of strictly-correct answers that are produced
same criterion, i.e. literal interpretation only.
as output by the algorithm, let |𝐸| be the number of expected an-
At the end of the process we got the 9 new questions shown in
swers for a question, let |𝐴| be the number of given answers to
Table 2.
a question, then the top5-recall is given by 𝑚𝑖𝑛 𝑚 ( |𝐸 |,5) , while the
The questions were chosen with the following criteria: they had
to be sufficiently specific to find adequate answer in the Regula- top5-precision is given by 𝑚𝑖𝑛 (𝑚|𝐴 |,5) . The top5-recall is a measure
tions (we avoided too broad or excessively conceptual questions); of how many relevant answers are selected by the algorithm in
the questions needed not to be focused on specific cases but with the top five answers, while the top5-precision is a measure of how
a reasonable level of abstraction (e.g., instead of “Where can an many selected answers in the top five are relevant. Knowing the
employee that carries out their work in Spain sue an employee top5-recall and the top5-precision, it is easy to compute the top5-F1
located in Spain, if they had not agreed on the jurisdiction?”, a score by following the formula.
question such as “Where can an employee sue their employer?”); After running the experiment we computed the average top5-
the questions needed to be sufficiently different from one another F1 for all the questions in the dataset presented in Section 2 (that
(i.e., not asking repetitive questions such as “What is the applicable is the old dataset of [13] plus our new extension). The results on
law in contracts of carriage?” and “What is the applicable law in the whole dataset are a Top5-Recall of 37.58%, a Top5-Precision of
insurance contracts?”). Some of the questions in the dataset are 45.17% and a Top5-F1 of 38.05%.
relatively similar to one another, with some of them being the a We also performed an error analysis taking under consideration
more correct specification of another, such as “Which parties of a how top5-F1 scores vary when the context specificity change, ex-
contract should be protected by conflict-of-law rules?” vs “What is pecting that questions with low context specificity are harder to
the applicable rule to protect the weaker party of a contract?” answer correctly.
Questions in the dataset are not speculative or de iure condendo Results partly confirmed our expectations. In fact, we can observe
and are agnostic to elements that are placed outside the Regulation a trend where top5-F1 scores increase proportionally to the context
(e.g. jurisprudence, general principles, etc.). As such, they are not specificity. Our expectations were based on the fact that:
meant to nudge towards forms of interpretations other than the
• the specificity of a question is low when it asks something
literal one (e.g. analogy, principle-based reasoning, lex specialis,
that is not closely related to the Regulations;
etc.)
• multi-hop reasoning is usually required to answer questions
Furthermore, in order to be able to further analyse the results
with a low specificity, but the baseline is not equipped for
of any evaluation based upon our dataset, we decided to pick an
that kind of reasoning (yet).
heuristic for classifying questions, that is the context specificity
(Low, Normal, High), and we applied it also to the old dataset we For example, the question “How should a contract be interpreted
extended. Context specificity is a subjective concept and it is highly according to this regulation?” has a very low specificity and it would
dependant on each jurist. For this reason, we opted to use a cri- probably require to pinpoint both recitals and articles for a proper
terion that would ensure an acceptable level of objectivity. Thus, answer, therefore more distinct and distant paragraphs. Probably,
specific questions whose answer is exactly in the domain of the most of the speculative questions would require a broader view
Regulations were labelled as Highly specific (e.g., “Can the parties on the subject matter, having a low specificity to the Regulations,
therefore requiring multi-hop reasoning.
232
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Sovrano, et al.
Table 1: First block of answers (ordered by the pertinence to the question estimated by the tool) given by the baseline to the
questions in [13]. “B” stands for Brussels, “RI” for Rome I and “RII” for Rome II. “Rec.” stands for Recital, “Art.” for Article, and
“Stat.” for Commission Statement. For each answer, the top5 scores (precision, recall, F1) are shown. In the “scores“ columns:
“P” stands for Precision and “R” stands for Recall. In the “Specificity“ column: “L” stands for Low, “N” stands for Normal and
“H” stands for High.
Question Speci Expected Answers Baseline’s Top5 Baseline’s

ficity Scores
R: 25%
Who determines disputes under a con- L B Art. 7.1, B Art. 8.3, B Art. 8.4, B RI Rec. 12, B Art.17.2, RI Rec. P: 33%
tract? Art. 17 24 F1: 28.44%
R: 25%
What factors should be taken into account N B Art. 7.1, B Art. 17, B Art. 20, B Art. RI Rec. 12, B Art.25, B P: 40%
for conferring the jurisdiction to deter- 25 Art.25.5, B Rec.15, RI Rec. 21 F1: 30.76%
mine disputes under a contract?
R: 25%
Which parties of a contract should be pro- N RI Rec. 23, RI Art. 6, RI Art. 8, RI RI Rec. 23, B Rec.18, RI Rec. P: 20%
tected by conflict-of-law rules? Art. 13 24, RI Art.25.1, RI Rec. 27 F1: 22.22%
R: 33%
In which case claims are so closely con- H B Art. 8, B Art. 30, B Art. 34 B Art. 8.1 P: 100%
nected that it would be better to treat them F1: 49.62%
together in order to avoid irreconcilable
judgments?
R: 20%
What kind of agreement between parties L B Rec. 6, B Rec. 10, B Rec. 12, B Art. B Art.73.3, B Rec. 12, B Rec. P: 20%
are regulated by these Regulations? 1, RI Rec. 7, RI Art. 1 36, B Art.71.2, B Art. 71.1 F1: 20%
R: 66%
In which court is celebrated the trial in H B Art. 21, B Art. 22, B Art. 23 B Art. 21.1, B Art.22.1, B Art. P: 60%
case the employer is domiciled in a Mem- 21.2, B Art. 20.1, B Art. 20.2 F1: 62.85%
ber State?
R: 0%
How should a contract be interpreted ac- L RI Rec. 22, RI Rec. 12, RI Rec. 26, RI RI Art. 10.1, RI Rec.17 P: 0%
cording to this regulation? Rec. 29, RI Art. 12 F1: 0%
R: 60%
Which law is applicable to a non- N RII Rec. 17, RII Rec. 18, RII Rec. RI Art. 8.1, RII Art.15, RII P: 60%
contractual obligation? 26, RII Rec. 27, RII Rec. 31, RII Art. Art.16, RII Art.8.1, RII Rec. F1: 60%
4-20 22
4 CONCLUSIONS ontology5 for better handling: obligations, permissions, exceptions,

With this paper we extended the work presented by Sovrano et al. derogations, prohibitions.
in [13], proposing a larger and more curated dataset for the evalua-
tion of automated question answering on PIL. In the future we will ACKNOWLEDGEMENTS
use these datasets for evaluating new algorithms for question an- This paper is was conducted with the contribution of CIRSFID-
swering, exploiting Akoma Ntoso XML4 models of the Regulations, Alma AI and DISI University of Bologna (Interlex Project Grant
for better capturing the relationships between different portions of Agreement Number 800839 and LAILA PRIN2017). The questions
the legal hierarchy (e.g. recitals connected via metadata to articles) were selected by Biagio Distefano, Salvatore Sapienza, while Pier
and also for reusing as much as possible other legal metadata like: Giorgio Chiara and Noemi Conditi picked the expected answers,
i) temporal legal information concerning modifications occurred separately. All of them are PhD candidates at the "Law, Science and
over time, ii) life-cycle information concerning the history of the Technology" International PhD program.
regulations, iii) normative references (citations). We also intend
to make the question answering tool “aware” of the LegalRuleML REFERENCES
[1] Kevin D Ashley. 2017. Artificial intelligence and legal analytics: new tools for law
practice in the digital age. Cambridge University Press.
[2] Trevor Bench-Capon, Michał Araszkiewicz, Kevin Ashley, Katie Atkinson, Floris
Bex, Filipe Borges, Daniele Bourcier, Paul Bourgine, Jack G Conrad, Enrico
4 http://docs.oasis-open.org/legaldocml/akn-core/v1.0/akn-core-v1.0-part1-
vocabulary.html 5 http://docs.oasis-open.org/legalruleml/legalruleml-core-spec/v1.0/cs02/rdfs/
233
A Dataset for Evaluating Legal Question Answering on Private International Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 2: Second block of expected answers and answers given by the baseline. See the caption of Table 1 for more details about
how to read this table.
Questions Speci Expected Answers Baseline’s Top5 Baseline’s

ficity Scores
Can the parties choose the applicable law H RI Rec. 11, RI Rec. 25, RI Rec. 27, RI B Art. 18.2, B Art. 18.1, RI R: 25%
in consumer contracts? Art. 6 Rec. 28, RI Art. 5.2, RI Art. P: 20%
6.2 F1: 22.22%
What factors should be taken into account N B Rec. 18, B Art. 17, B Art. 18, B Art. RI Rec. 12, RI Rec. 24, B Art. R: 40%
for conferring the jurisdiction to deter- 19, B Art. 26 19, B Art. 17.1, B Art. 25.5 P: 40%
mine disputes under a consumer contract? F1: 40%
Can the parties choose a different applica- L RI Rec. 11, RI Art. 3.1 RI Art. 3.1, RI Art. 5.2, RI Art. R: 50%
ble law for different parts of the contract? 7.3, RII Art. 25.2, RI Art. 22.2 P: 20%
F1: 28.57%
What non-contractual obligations fall into H RII Rec. 10, RII Rec. 11, RII Art. RII Stat. 1, RI Rec. 7 R: 0%
the scope of Regulation Rome II? 1, RII Art. 2 P: 0%
F1: 0%
What is the applicable rule to protect the N RI Rec. 23, B Rec. 18 RI Rec. 23, B Rec. 18 R: 100%
weaker party of a contract? P: 100%
F1: 100%
What is the applicable law to determine L RI Art. 3.5, RI Art. 10, RI Art. 11, RI RI Art. 3.5, RI Art. 10.2, RI R: 50%
the validity of consent? Art. 13 Art. 10.1, B Rec. 20 P: 75%
F1: 60%
When are two actions to be considered re- N B Rec. 21, B Art. 30.3 R: 0%
lated according to the Regulation Brussels P: 0%
I Bis? F1: 0%
What court has jurisdiction in case of a N B Art. 8.3, B Art. 14.2, B Art. 18.3, B B Art. 18.3, B Art. 14.2, B Art. R: 100%
counter-claim? Art. 22.2 22.2, B Art. 8, B Art. 24 P: 80%
F1: 88.88%
Where can an employee sue their em- H B Rec. 14, B Rec. 18, B Art. 21.1, B B Art. 21.1 R: 20%
ployer? Art. 22.1, B Art. 23 P: 100%
F1: 33.33%
[4] Pompeu Casanovas, Monica Palmirani, Silvio Peroni, Tom Van Engers, and Fabio
Figure 1: Average top5-F1 scores for each class of con- Vitali. 2016. Semantic web for the legal domain: the next step. Semantic Web 7, 3
(2016), 213–227.
text specificity: Low, Normal, High. Scores are respectively: [5] Ilias Chalkidis and Dimitrios Kampas. 2019. Deep learning in law: early adaptation
27.40%, 42.16%, 42.81% and legal word embeddings trained on large corpora. Artificial Intelligence and
Law 27, 2 (2019), 171–198.
0.5 [6] Jack G Conrad and John Zeleznikow. 2013. The significance of evaluation in
AI and law: a case study re-examining ICAIL proceedings. In Proceedings of the
0.42 Fourteenth International Conference on Artificial Intelligence and Law. 186–191.
[7] Phong-Khac Do, Huy-Tien Nguyen, Chien-Xuan Tran, Minh-Tien Nguyen, and
Minh-Le Nguyen. 2017. Legal question answering using ranking SVM and deep
convolutional neural network. arXiv preprint arXiv:1703.05320 (2017).
Top5-F1
0.27 [8] Meritxell Fernández-Barrera and Giovanni Sartor. 2011. The legal theory perspec-
tive: doctrinal conceptual systems vs. computational ontologies. In Approaches
to Legal Ontologies. Springer, 15–47.
[9] Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020. A
Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering.
arXiv preprint arXiv:2005.05257 (2020).
[10] Mi-Young Kim, Ying Xu, and Randy Goebel. 2015. A convolutional neural network
0 in legal question answering. In JURISIN Workshop.
L N H [11] Friedrich V Kratochwil. 1991. Rules, norms, and decisions: on the conditions
Specificity of practical and legal reasoning in international relations and domestic affairs.
Number 2. Cambridge University Press.
[12] Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and
Norman Sadeh. 2019. Question answering for privacy policies: Combining com-
putational and legal perspectives. arXiv preprint arXiv:1911.00841 (2019).
Francesconi, et al. 2012. A history of AI and Law in 50 papers: 25 years of
[13] Francesco Sovrano, Monica Palmirani, and Fabio Vitali. 2020. Legal Knowledge
the international conference on AI and Law. Artificial Intelligence and Law 20, 3
Extraction for Knowledge Graph Based Question-Answering. In Legal Knowledge
(2012), 215–319.
and Information Systems: JURIX 2020. The Thirty-third Annual Conference, Vol. 334.
[3] Michael J Bommarito II, Daniel Martin Katz, and Eric M Detterman. 2018. LexNLP:
IOS Press, 143–153.
Natural language processing and information extraction for legal and regulatory
texts. arXiv preprint arXiv:1806.03688 (2018).
234
Discovering the Rationale of Decisions:
Towards a Method for Aligning Learning and Reasoning
Cor Steging Silja Renooij Bart Verheij
c.c.steging@rug.nl s.renooij@uu.nl bart.verheij@rug.nl
Bernoulli Institute of Mathematics, Department of Information and Bernoulli Institute of Mathematics,
Computer Science and Artificial Computing Sciences, Computer Science and Artificial
Intelligence, University of Groningen Utrecht University Intelligence, University of Groningen
ABSTRACT Verheij 2003a], selective explanations [Atkinson et al. 2020b; Ver-

In AI and law, systems that are designed for decision support should heij 2003b], probabilistic explanations [Vlek et al. 2016] and social
be explainable when pursuing justice. In order for these systems explanations [Atkinson et al. 2020b; Gordon 1995; Hage et al. 1993].
to be fair and responsible, they should make correct decisions and This requirement of explainability is problematic for the ap-
make them using a sound and transparent rationale. In this paper, plication of central machine learning techniques in law. Neural
we introduce a knowledge-driven method for model-agnostic ratio- networks, for example, are known to perform well, but behave like
nale evaluation using dedicated test cases, similar to unit-testing a black box algorithm. Hence, explanation techniques have been
in professional software development. We apply this new quantita- developed to ‘open the black box’ (cf. LIME [Ribeiro et al. 2016],
tive human-in-the-loop method in a machine learning experiment SHAP [Lundberg and Lee 2017]). Even in the domain of vision
aimed at extracting known knowledge structures from artificial (where the successes of deep learning are especially significant),
datasets from a real-life legal setting. We show that our method the necessity of such methods is underpinned by studies regarding
allows us to analyze the rationale of black box machine learning adversarial attacks that show that slight perturbations of images,
systems by assessing which rationale elements are learned or not. invisible to the human observer, can radically change the outcome
Furthermore, we show that the rationale can be adjusted using of a classifier [Goodfellow et al. 2015].
tailor-made training data based on the results of the rationale eval- In this paper, we therefore evaluate black box machine learning
uation. methods with a focus on proper explainability, and not only in
terms of accuracy as in the standard machine learning protocol. We
CCS CONCEPTS are in particular interested in evaluating the discovered rationale
underlying decisions, where the rationale is the knowledge struc-
ture that can justify a decision, such as the rule applied. We aim
to measure the quality of rationale discovery, with an eye on the
KEYWORDS possibility of improving rationale discovery.
Learning knowledge from data, Explainable AI, Responsible AI, Our work builds on a study investigating whether neural net-
Machine Learning works are able to tackle open texture problems [Bench-Capon 1993]
ACM Reference Format: (also investigated in [Možina et al. 2005; Wardeh et al. 2009]). To
Cor Steging, Silja Renooij, and Bart Verheij. 2021. Discovering the Rationale measure and possibly improve rationale discovery, we create dedi-
of Decisions: Towards a Method for Aligning Learning and Reasoning. cated test datasets, on which a machine learning system can only
In Eighteenth International Conference for Artificial Intelligence and Law perform well if it has learned a particular component of the knowl-
(ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, edge structure that defined the data. The idea is similar to how unit
5 pages. https://doi.org/10.1145/3462757.3466059 testing works in professional software development: we define a
set of cases, targeting a specific component, in which we know
1 INTRODUCTION what the answer should be, and compare that to the output that
the system gives.
In AI and Law, explainability is a key requirement in system design,
In order to focus on what is methodologically feasible, we do not
due to the need for the justification of decisions. For machine-
use natural language corpora (such as conceptual retrieval [Grab-
supported decisions, this is encoded in the GDPR’s right to ex-
mair et al. 2015], argument mining [Mochales Palau and Moens
planation. Four types of explanations can be distinguished, all of
2009; Wyner et al. 2010] or case prediction [Ashley 2019; Brüning-
which have been applied to AI and Law [Atkinson et al. 2020a]:
haus and Ashley 2003; Medvedeva et al. 2019]). Instead we work
contrastive explanations [Ashley 1990; Rissland and Ashley 1987;
with datasets of artificial decisions with known underlying gener-
ating rationale. Other earlier discussions of neural networks in law
classroom use is granted without fee provided that copies are not made or distributed are [Hunter 1999; Philipps and Sartor 1999; Stranieri et al. 1999].
For all other uses, contact the owner/author(s). 2 REPLICATION EXPERIMENT
ICAIL’21, June 21–25, 2021, São Paulo, Brazil The first step towards developing our method for rationale evalua-
ACM ISBN 978-1-4503-8526-8/21/06. tion was replicating the study by Bench-Capon [1993]. This was
https://doi.org/10.1145/3462757.3466059 done using modern, widely-used neural network methods and with
235
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Steging et al.
significantly larger datasets, in order to reaffirm that the claims

made in 1993 still hold today. The study introduces a fictional legal
domain, where the eligibility for a welfare benefit for elderly citi-
zens is determined by the conjunction of six independent conditions.
Artificial datasets were generated specifying personal information
of elderly citizens with their eligibility for the welfare benefit. Mul-
tilayer perceptrons were trained and tested on these datasets, and
managed to achieve high accuracy scores (above 98%).
Using special test datasets, it was shown that the neural net-
works were unable to properly learn the first and the last of the
six conditions. Furthermore, the networks performed significantly
worse when ineligibility was caused by the failure of only a sin-
gle condition. The training data was therefore altered such that
ineligible people only failed on a single condition, rather than on
multiple conditions as in the original training dataset. By making
these adjustments to the training dataset, the neural networks were
able to learn conditions more adequately, while maintaining similar
accuracy scores. However, even after adjustment, the conditions
that defined the data were not learned perfectly.
In our replication, we discovered that even with more data and
modern, commonly used neural networks, the nets are still un- Figure 1: Arguments and attacks (A) and their elementary
able to learn all six conditions that define eligibility, despite high propositions (B) in Dutch tort law [Verheij 2017].
accuracies (99%). Using two dedicated datasets (as also defined
in [Bench-Capon 1993]), it was shown that the nets still did not 𝑑𝑢𝑡 (𝑥) ⇐⇒ 𝑐 1 (𝑥) ∧ 𝑐 2 (𝑥) ∧ 𝑐 3 (𝑥) ∧ 𝑐 4 (𝑥) ∧ 𝑐 5 (𝑥)
learn condition the first and last condition. The rationale of the 𝑐 1 (𝑥) ⇐⇒ cau(𝑥)
nets is therefore not sound, despite high accuracies. Just as in the 𝑐 2 (𝑥) ⇐⇒ ico(𝑥) ∨ ila(𝑥) ∨ ift (𝑥)
original study, adjusting the training data based on expert knowl- 𝑐 3 (𝑥) ⇐⇒ vun(𝑥) ∨ (vst (𝑥) ∧ ¬jus(𝑥)) ∨ (vrt (𝑥) ∧ ¬jus(𝑥))
edge of the domain significantly improves the rationale of the net 𝑐 4 (𝑥) ⇐⇒ dmg(𝑥)
without seriously impacting the accuracy. 𝑐 5 (𝑥) ⇐⇒ ¬(vst (𝑥) ∧ ¬prp(𝑥))
Additionally, we created a simplified version of the domain, con-
where the elementary propositions are provided alongside an argu-
taining only the first and last condition, to see how well the net-
mentative model of the law in Figure 1 [Verheij 2017], and condi-
works are able to extract a simplified rationale. In this domain, the
tions 𝑐 2 and 𝑐 3 capture the legal notions of unlawfulness (unl) and
networks were able to learn both conditions.
imputability (imp) respectively.
The methods and results of the replication experiment as well
Compared to the fictional welfare domain in [Bench-Capon 1993]
as its variations, can be found in detail in [Steging et al. 2021].
and our replication variations [Steging et al. 2021], the Dutch tort
law domain is captured in 5 conditions for duty to repair (dut), based
3 TORT LAW: DOMAIN AND DATASETS
upon 10 Boolean features. Each condition is a disjunction of one
Following up on the fictional welfare benefit domain, we study or more features, possibly with exceptions. The feature capturing
a non-fictional legal setting, namely Dutch tort law. This domain a violation of a statutory duty (vst) is present in both condition 𝑐 3
uses only Boolean variables, but allows for exceptions to underlying and 𝑐 5 , rendering these dependent.
rules. This section describes the underlying knowledge structure
of the Tort Law domain using logic, from which we will generate 3.2 Datasets
datasets to train a series of neural networks. These networks will
We generate four different types of datasets, each for different
subsequently be analysed using a method we propose for assessing
purposes.1 For most types of datasets, the generating process is
the quality of their rational discovery. To this end we need two types
at least partly stochastic and repeated for every repetition of an
of datasets for the purpose of testing. The first are standard test
experiment. Using the same type of dataset, for example in training
sets sampled from the complete domain to evaluate the accuracy
and testing a neural network, does therefore not mean that the
of the networks. The second type is a dedicated test set designed
exact same dataset was used in both training and testing. Table 1
to target a specific aspect of the domain knowledge. This section
shows an overview of the datasets of the tort law domain.
describes all datasets we use.
With 10 Boolean features there are 210 = 1024 possible unique
cases that can be generated from the argumentation structure of
3.1 Domain
the tort law domain in Figure 1. Each case has a corresponding
Our domain concerns Dutch tort law: articles 6:162 and 6:163 of the outcome for dut, indicating whether or not there is a duty to repair
Dutch civil code that describe when a wrongful act is committed someone’s damages.
and resulting damages must be repaired. This ‘duty to repair’ (𝑑𝑢𝑡)
1 The
can be formalised as follows: Jupyter notebooks used for generating the data can be found in the following
Github repository: https://github.com/CorSteging/DiscoveringTheRationaleOfDecisions
236
Discovering the Rationale of Decisions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 1: An overview of the tort law datasets. Datasets applied to any other machine learning model as well. We assume
marked with an asterisk are used for testing purposes only. that assessing and improving rationale discovery is relevant only
For each type of dataset, the size and label distribution is for models that perform well on their respective task. Our first step,
given. after training the above mentioned neural networks, is therefore
to evaluate their performance on typical test sets in terms of the
T/F label standard accuracy measure. Subsequently we will evaluate the
Dataset Size distribution performance of the networks on the dedicated, knowledge-driven
Regular 5,000/500 50%/50% test sets that were specifically designed for assessing the networks’
Unique* 1024 10.94%/89.06% quality of rationale discovery.
Unlawfulness* 168 66.67%/33.33%
Imputability* 128 87.5%/12.5% 4.1.1 Neural network architectures. Similar to the original experi-
ments, three multilayer perceptrons were used with one, two and
three hidden layers, respectively [Bench-Capon 1993]. The nets
The unique dataset contains these 1024 unique instances for all have 10 input nodes, corresponding to the number of features
the 10 features plus the label. In this dataset, there are 912 instances and a single output node, representing duty to repair. The node
where dut is false and 112 instances where dut is true (10.94%). configuration (i.e. number of nodes per layer) of each network is as
The regular type datasets are generated such that dut is true follows:
in exactly half of the instances. The sets are regular in the sense • One hidden layer network: 10-12-1
that balanced label distributions are common in machine learning • Two hidden layer network: 10-24-6-1
problems. These regular datasets are generated by sampling uni- • Three hidden layer network: 10-24-10-3-1
formly from the subset of cases from the unique dataset, such that
We use the MLPClassifier of the scikit-learn package [Pedregosa
each possible case is represented equally within the 50/50 label
et al. 2011], the sigmoid function as the activation function, the
distribution. In a typical machine learning experiment, only a sub-
Adam stochastic gradient-based optimizer [Kingma and Ba 2015],
set of the possible cases is typically available and presented to a
with a constant learning rate of 0.001. A total of 50,000 training
network, upon which the network will have to learn to generalize
iterations are used with a batch size of 50. Recall that the focus
to all possible cases. In addition to generating regular type datasets
of this study is not on creating the best possible classifier, but to
with 5,000 cases, we therefore also generate smaller regular type
assess rationale discovery.
datasets with only 500 instances; the latter contains 35.35% of the
unique instances. 4.1.2 Training and performance testing. The three types of neural
In the tort law domain we focus on the notions of unlawfulness networks are trained and tested on all combinations of different
(𝑐 2 ) and imputability (𝑐 3 ) to assess whether the networks are able datasets from Table 1. Every combination of training dataset and
to discover conditions in the data. For each of the two conditions, testing dataset is evaluated in terms of the accuracy of the resulting
we create a dedicated dataset. network on the test data. Because some of the datasets are stochas-
The Unlawfulness dataset is the subset of the unique dataset in tic (each generated dataset is slightly different), the whole process
which the features for the unlawfulness condition 𝑐 2 can take on of data generation, training and testing is repeated 50 times. The
any of their values, while the other features have values that are mean classification accuracies along with their standard deviations
guaranteed to satisfy the remaining conditions. Whether or not are reported. To assess the rationale discovery capabilities of all the
there is a duty to repair is therefore solely determined by whether trained networks, we study their performance on the dedicated test
or not condition 𝑐 2 is satisfied. All combinations of values of the sets for unlawfulness and imputability conditions. Performance is
other features are considered. The Unlawfulness dataset therefore measured both quantitatively, using standard accuracy, and qual-
consists of 168 unique instances, of which 66.66% have a positive itatively by a more detailed comparison of actual and expected
𝑑𝑢𝑡 value. outcomes.
The Imputability dataset is a similar subset of the unique dataset,
but now the features for the imputability condition (𝑐 3 ) can take on 4.2 Results
any value, except that the value of vst must be such that condition
Table 2 shows the mean classification accuracies over 50 runs, to-
𝑐 5 is satisfied. The value of 𝑑𝑢𝑡 (𝑥) is now completely dependent on
gether with their standard deviations, for the different combinations
whether or not condition 𝑐 3 evaluates to true. Due to the interde-
of training and testing sets in the tort law domain. The table includes
pendency of conditions 𝑐 3 and 𝑐 5 , the Imputability dataset only has
the quantitatively measured performance on the two dedicated test
128 unique instances, 87.5% of which have a positive 𝑑𝑢𝑡 value.
sets.
We can evaluate how well conditions 𝑐 2 (unlawfullness) and 𝑐 3
4 EXPERIMENTAL SETUP AND RESULTS (imputability) are learned. For these conditions, the network should
In this section we describe and motivate the experiments we per- output 1 in cases from the Unlawfulness dataset where the case is
formed for the tort law domain and report on their results. unlawful (𝑐 2 ), or in the Imputability dataset where the case can be
imputated to a person (𝑐 3 ); otherwise the output should be 0. The
4.1 Experiments mean output of the 3 layer network over 50 runs for the two training
We decide to use neural networks like in [Bench-Capon 1993]. sets on the Unlawfulness and Imputability datasets is presented in
The method is model-agnostic, however, meaning that it can be Table 3.
237
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Steging et al.
Table 2: The accuracies obtained by the neural networks in the tort law domain.
Trained on all instances Trained on smaller dataset

General Unique Unlawfulness Imputability General Unique Unlawfulness Imputability
1 hidden layer 100±0 100±0 100±0 100±0 98.45±0.5 97.24±0.89 92.8±3.47 91.22±4.04
2 hidden layers 100±0 100±0 100±0 100±0 99.03±0.44 98.27±0.78 95.71±3.1 94.38±3.84
3 hidden layers 99.86±0.37 99.76±0.66 99.67±1.83 99.5±1.56 98.23±0.72 96.83±1.28 92.96±5.33 91.45±3.51
Table 3: Mean network output on the Unlawfulness and Im- logical evaluation of unlawfulness is false, and 1 if it is true, which is
putability datasets versus the logical evaluation of the un- exactly what it should do. Networks trained on all instances attain
lawfulness resp. imputability conditions. a perfect score on the Imputability dataset as well. This can also
be seen in Table 2, where the networks score 100% accuracy on
Trained on all instances Trained on smaller dataset the Unlawfulness and Imputability datasets after training on all
Unlawfulness Output Unlawfulness Output instances.
False 0 False 0.018 With less data, however, accuracies drop to around 92-95% for
True 1 True 1 the Unlawfulness dataset and 91-94% for the Imputability dataset.
Imputability Output Imputability Output This accuracy may still seem high, but we should take into account
False 0 False 0.875 the label distributions (66.67-33.33% and 87.5-12.5%, respectively).
True 1 True 1 Table 3 shows that networks still perform perfectly on cases in
which the unlawfulness and imputability conditions evaluate to true.
When the conditions are false, however, mistakes are made. The
5 DISCUSSION average output of networks on the Unlawfulness dataset increases
5.1 Standard Accuracy to 0.018, which should be 0, meaning that networks classify some
Standard accuracy is measured to see whether the learned models lawful cases as unlawful. In the Imputability dataset, the mean
are able to solve the classification problem, regardless of whether output increased more drastically to 0.875 when imputability is
or not they discovered the rationale underlying the data. We find false, meaning that in 87.5% of the instances in which the act is
accuracies of 100% or near 100% for networks trained on all in- not imputable to a person, the network incorrectly decided that it
stances (see Table 2). When presented with all unique instances, should be. This means that despite high accuracy on the general
the networks with one and two hidden layers are able to perfectly test set, the networks largely ignored the concept of imputability.
predict the outcome from Dutch tort law, and the network with
three hidden layers can create a very close approximation. 5.3 A Method for Rationale Evaluation
Presenting a neural network with all available cases is in practice Although our experiments and discussion focused on specific ex-
often infeasible. If it is possible, then a simple lookup table rather ample domains and neural networks, our approach for rationale
than a neural network would most likely suffice. For this reason, evaluation can be interpreted as a general method independent of
we also trained the networks on a subset of only around 35% of the the machine learning algorithm applied. Building on the results of
unique instances (see Table 2). As expected, the accuracies of the this paper, we therefore proposes a knowledge-driven method for
networks on the general test sets drop, but only slightly (to 98-99%). model-agnostic rationale evaluation, consisting of three distinct
Even on the unique test set, accuracies remain around 96%. This steps:
suggests that it is possible for the models to approximate tort law
with a small subset of the unique cases. (1) Measure the accuracy of a trained system, and proceed if the
accuracy is sufficiently high;
5.2 Rationale Discovery (2) Design dedicated test sets for rationale evaluation targeting
selected rationale elements based on expert knowledge of
Looking at the performance of the networks on the dedicated test
the domain;
sets partially exposes how well the rationale is captured by the
(3) Evaluate the rationale through the performance of the trained
network. We designed these test sets such that each one targets a
system on these dedicated test sets.
single condition from the domain. In addition to considering the
accuracy on these dedicated test sets, we qualitatively evaluate the The first step is based on the assumption that efforts for assessing
rational discovery capabilities of the networks by comparing their and possibly improving the rationale discovery capabilities of a
outputs with the actual outputs we would ideally expect for the learned model are only taken if the general performance of the
different domains. model is already considered good enough. Here we assume per-
Recall that on the Imputability dataset, networks should output 1 formance is measured using accuracy, but other measures can be
if the act is imputable to the person, and 0 otherwise; on the Unlaw- employed as well and the threshold of what is considered good
fulness dataset, the networks should output 1 if the case is unlawful, enough may vary per domain and application.
and 0 otherwise. Table 3 shows how well the networks were able The second step in our method depends on domain knowledge.
to internalize the notions of unlawfulness and imputability. When Hence the method effectively is a quantitative human-in-the-loop
trained on all instances, the mean output of the networks is 0 if the solution for rationale evaluation.
238
Discovering the Rationale of Decisions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
In the third step, performance is again evaluated, by now not REFERENCES

only considering accuracy but also examining model output and K. D. Ashley. 1990. Modeling Legal Arguments: Reasoning with Cases and Hypotheticals.
expected output in terms of the dedicated test sets. The MIT Press, Cambridge (Massachusetts).
K. D. Ashley. 2019. A brief history of the changing roles of case prediction in AI and
The method does not currently specify how the dedicated test law. Law in Context 36, 1 (2019), 93–112.
sets are constructed. We aim to further operationalize the rationale K. Atkinson, T. Bench-Capon, F. Bex, T. F. Gordon, H. Prakken, G. Sartor, and B. Verheij.
2020b. In memoriam Douglas N. Walton: the influence of Doug Walton on AI and
evaluation method by using information about the knowledge in law. Artificial Intelligence and Law (2020), 1–46.
the domain, and the distribution of examples, for instance building K. Atkinson, T. Bench-Capon, and D. Bollegala. 2020a. Explanation in AI and law: Past,
on Bayesian networks. Subsequently, the information gained by present and future. Artificial Intelligence 289 (2020), 103387.
T. Bench-Capon. 1993. Neural networks and open texture. In Proceedings of the 4th
using this rationale evaluation method can be used to improve the International Conference on Artificial Intelligence and Law (ICAIL ’93). ACM, New
rationale of the system by adjusting the training data accordingly, York, 292–297.
such as in [Bench-Capon 1993] and our replication variants [Steg- S. Brüninghaus and K. D. Ashley. 2003. Predicting outcomes of case based legal
arguments. In Proceedings of the 9th International Conference on Artificial Intelligence
ing et al. 2021], effectively allowing us to impose sound rationale and Law (ICAIL 2003). ACM, New York (New York), 233–242.
discovery. I. J. Goodfellow, J. Shlens, and C. Szegedy. 2015. Explaining and harnessing adversarial
examples. In Proceedings of International Conference on Learning Representations.
T. F. Gordon. 1995. The Pleadings Game: An Artificial Intelligence Model of Procedural
Justice. Kluwer, Dordrecht.
M. Grabmair, K. D. Ashley, R. Chen, P. Sureshkumar, C. Wang, E. Nyberg, and V. R.
6 CONCLUSION Walker. 2015. Introducing LUIMA: an experiment in legal conceptual retrieval of
The work in this paper was inspired by Bench-Capon’s 1993 paper vaccine injury decisions using a UIMA type system and tools. In Proceedings of the
15th International Conference on Artificial Intelligence and Law. ACM, New York
that investigated whether neural networks are able to tackle open (New York), 69–78.
texture problems. The conclusions were that trained networks can J. C. Hage, R. Leenes, and A. R. Lodder. 1993. Hard cases: a procedural approach.
Artificial intelligence and law 2, 2 (1993), 113–167.
perform very well in terms of accuracy, even though some condi- D. Hunter. 1999. Out of their minds: Legal theory in neural networks. Artificial
tions from the domain are not learned [Bench-Capon 1993]. Similar Intelligence and Law 7, 2 (1999), 129–151.
results were found when we repeated the experiments with larger D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceed-
ings of 3rd International Conference on Learning Representations.
training datasets, in order to ensure that the original conclusions S. M. Lundberg and S. Lee. 2017. A unified approach to interpreting model predictions.
about conditions that were not learned are not due to a lack of data. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc.,
The idea of constructing test cases to test specific conditions 4765–4774.
M. Medvedeva, M. Vols, and M. Wieling. 2019. Using machine learning to predict
inspired us to propose a method for assessing rationale discovery decisions of the European Court of Human Rights. Artificial Intelligence and Law
capabilities by designing dedicated test datasets and to evaluate per- (2019), 1–30.
R. Mochales Palau and M. F. Moens. 2009. Argumentation mining: the detection, classi-
formance on these knowledge-driven test sets, combining quantita- fication and structure of arguments in text. In Proceedings of the 12th International
tive and qualitative evaluation elements in a hybrid way. Adjusting Conference on Artificial Intelligence and Law (ICAIL 2009). ACM Press, New York
the training dataset based on this evaluation methods demonstrates (New York), 98–107.
M. Možina, J. Žabkar, T. Bench-Capon, and I. Bratko. 2005. Argument based machine
that the rationale can be improved using knowledge-driven tailor learning applied to law. Artificial Intelligence and Law 13, 1 (2005), 53–73.
made training sets [Bench-Capon 1993; Steging et al. 2021]. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
In the real life tort law domain, with a non-fictional knowledge P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: machine learning in
structure and different characteristics, a similar pattern can be Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
observed as before: the networks failed to learn the independent L. Philipps and G. Sartor. 1999. Introduction: from legal theories to neural networks
and fuzzy reasoning. Artificial Intelligence and law 7, 2 (1999), 115–128.
condition that defines imputability, despite high accuracies on the M. T. Ribeiro, S. Singh, and C. Guestrin. 2016. "Why should I trust you?": Explaining the
general test set. predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International
This study therefore reaffirms the conclusions from previous Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. 1135–
1144.
work, while simultaneously introducing a model-agnostic method E. L. Rissland and K. D. Ashley. 1987. A case-based system for trade secrets law. In
for assessing rationale discovery capabilities of machine-learned Proceedings of the 1st International Conference on Artificial Intelligence and Law
black box models, using dedicated test datasets designed with ex- (ICAIL ’87). ACM, New York, NY, USA, 60–66.
C. Steging, S. Renooij, and B. Verheij. 2021. Discovering the Rationale of Decisions:
pert knowledge of the domain. In future research, we aim to further Experiments on Aligning Learning and Reasoning. arXiv:2105.06758 [cs.AI]
detail and extend our method such that by employing it, the sound- A. Stranieri, J. Zeleznikow, M. Gawler, and B. Lewis. 1999. A hybrid rule–neural
approach for the automation of legal reasoning in the discretionary domain of
ness of the rationale underlying system decisions becomes tangible, family law in Australia. Artificial Intelligence and Law 7, 2-3 (1999), 153–183.
and its quality can be asserted. Based on this evaluation, the train- B. Verheij. 2003a. Artificial Argument Assistants for Defeasible Argumentation. Artifi-
ing data of the black-box systems can be altered to improve their cial Intelligence 150, 1–2 (2003), 291–324.
B. Verheij. 2003b. Dialectical argumentation with argumentation schemes: An approach
rationale. Further expanding upon this design method will bring to legal logic. Artificial intelligence and Law 11, 2-3 (2003), 167–195.
us closer to AI that is both explainable and responsible. B. Verheij. 2017. Formalizing arguments, rules and cases. In Proceedings of the 16th
International Conference on Artificial Intelligence and Law (ICAIL ’17). ACM, New
York, 199–208.
C. S. Vlek, H. Prakken, S. Renooij, and B. Verheij. 2016. A method for explaining
ACKNOWLEDGMENTS Bayesian networks for legal evidence with scenarios. Artificial Intelligence and Law
24, 3 (2016), 285–324.
This research was funded by the Hybrid Intelligence Center, a M. Wardeh, T. Bench-Capon, and F. Coenen. 2009. Padua: a protocol for argumentation
dialogue using association rules. Artificial Intelligence and Law 17, 3 (2009), 183–215.
10-year programme funded by the Dutch Ministry of Education, A. Wyner, R. Mochales-Palau, M. F. Moens, and D. Milward. 2010. Approaches to text
Culture and Science through the Netherlands Organisation for mining arguments from legal cases. In Semantic Processing of Legal Texts. Springer,
Scientific Research, https://hybrid-intelligence-centre.nl. Berlin, 60–79.
239
Process Mining-Enabled Jurimetrics
Analysis of a Brazilian Court’s Judicial Performance in the Business Law Processing
Adriana Jacoto Unger Julio Trecenti

José Francisco dos Santos Neto Renata Hirota
Marcelo Fantinato {jtrecenti,rhirota}@abj.org.br
Sarajane Marques Peres Brazilian Jurimetrics Association
{ajacoto,jose.francisco.neto,m.fantinato,sarajane}@usp.br Sao Paulo, Brazil
University of Sao Paulo
Sao Paulo, Brazil
ABSTRACT Paulo, Brazil. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/
Improving judicial performance has become increasingly relevant 3462757.3466137
to guarantee access to justice for all, worldwide. In this context,
technology-enabled tools to support lawsuit processing emerge as 1 INTRODUCTION
powerful allies to enhance the justice efficiency. Using electronic Within the administration of justice, electronic lawsuits and man-
lawsuit management systems within the courts of justice is a wide- agement information systems emerged to control lawsuit process-
spread practice, which also leverages production of big data within ing in the courts of justice. These systems allow the practice of
judicial operation. Some jurimetrics techniques have arisen to eval- procedural acts by magistrates and other participants in lawsuits in
uate efficiency based on statistical analysis and data mining of data a fully digital environment. Since 2015, the Court of Justice of the
produced by judicial information systems. In this sense, the pro- State of Sao Paulo (TJSP1 ), Brazil, is 100% digital, i.e., all new law-
cess mining area offers an innovative approach to analyze judicial suits born digital. TJSP is considered the largest court in the world in
data from a process-oriented perspective. This paper presents the terms of volume of lawsuits2 , with over 19 million ongoing lawsuits,
application of process mining in a event log derived from a dataset managed by the Justice Automation System (e-SAJ3 ). Data-centric
containing business lawsuits from the Court of Justice of the State analysis techniques, such as jurimetrics [11] and machine learning
of Sao Paulo, Brazil – the largest court in the world – in order to [14], take advantage of the increasing availability of judicial data.
analyze judicial performance. Although the results show these law- Jurimetrics supports statistics-based analysis on justice big data. In
suits have an ad hoc sequence flow, process mining analysis have this sense, process mining [20] emerges as an approach to bridge
allowed to identify most frequent activities and process bottlenecks, data mining to Business Process Management (BPM), providing a
providing insights into the root causes of inefficiencies. process-oriented perspective which turns out to be more valuable
when analyzing phenomena distinguished by procedural behaviour,
CCS CONCEPTS such as in lawsuit processing ruled by procedural law.
• Applied computing → Law; Business process management; Busi- This paper presents an innovative application of process mining
ness intelligence; • Information systems → Data mining. to analyze judicial performance based on a lawsuit dataset, as a
proof of concept for process mining-enabled jurimetrics. The ex-
KEYWORDS periment was restricted to business law, as it is an area of law that
commonly presents poor judicial performance and difficulties in
Process mining, jurimetrics, judicial performance, administration of
diagnosis due to its high heterogeneity. This study raises and high-
justice, legal informatics, business process management, procedural
lights contributions of process mining showing the benefits of a
law, business law.
process-oriented approach to analyzing legal data for performance
ACM Reference Format: diagnostics. The paper is divided as follows: background, present-
Adriana Jacoto Unger, José Francisco dos Santos Neto, Marcelo Fantinato, ing the theoretical framework of this work; related work, pointing
Sarajane Marques Peres, Julio Trecenti, and Renata Hirota. 2021. Process out prior research related to this topic; research method, describing
Mining-Enabled Jurimetrics: Analysis of a Brazilian Court’s Judicial Per-
the stages of this research study; results, showing results of the
formance in the Business Law Processing. In Eighteenth International Con-
ference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São
process mining application; analysis of the results, presenting pro-
cess mining-enabled analysis of lawsuit processing; and conclusion,
Permission to make digital or hard copies of all or part of this work for personal or which highlights the research findings.
2 BACKGROUND
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or This section introduces the main theoretical concepts related to
and/or a fee. Request permissions from permissions@acm.org. this work, contextualized in the Brazilian setting.
1 Tribunal
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. de Justiça de São Paulo (in Portuguese)
2 http://www.tjsp.jus.br/QuemSomos
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466137 3 Sistema de Automação da Justiça (in Portuguese)
240
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Unger et al.
2.1 Judicial Performance Table 1: Process mining in procedural law

Judicial performance is a field of study dedicated to evaluate per-
formance of judicial systems and courts of justice [8]. Some inter- Process mining Procedural law
national organizations conducted studies to measure and compare Activity Name of the lawsuit procedural movement.
judicial performance worldwide [10, 17]. In Brazil, the National Event Occurrence of a procedural movement, i.e., oc-
Council of Justice provides the Justice in Numbers annual report currence of a lawsuit’s procedural progress.
[5], offering an overview of the judiciary productivity. Timestamp Date of occurrence of the lawsuit’s procedural
Current models for judicial performance evaluation contributed movement.
substantially to the continuous measurement, monitoring and com- Case Lawsuit.
parison of judicial performance. However, such evaluation models Case ID Lawsuit identifier.
have done little to guide improvement actions. Identifying the root Event attributes General attributes related to the movement.
causes of inefficiencies is often limited due to the asymmetry of Case attributes General attributes related to the lawsuit.
information among the courts’ information systems, as well as the
lack of detailed data on the activities of lawsuit processing [1].
2.2 Jurimetrics end in itself, so that process mining can enhance the way the judi-
ciary treats its digital data through the application of algorithms for
Jurimetrics [11] is defined as ‘statistics applied to the law’. Although
process discovery, compliance and predictive analysis [9]. A tech-
it emerged decades ago, recent advances in computing and data
nical report with suggested actions to improve judicial efficiency
storage capabilities have enabled alternative ways of observing
highlights the use of process mining as one of these actions [3].
patterns in data-based and hence statistics-based court decisions.
Nevertheless, few studies on this topic were found in the liter-
In the USA, the application of statistics to law has been developed
ature. Empirical studies were performed using lawsuit data from
under alternative nomenclature, such as Empirical Legal Studies
Brazilian courts[19, 22] though focused on comparison of lawsuit
[7] and, more recently, Judicial Analytics [4]. In Brazil, jurimetrics
throughput times. Attempts to apply data mining to extract infor-
has received growing interest [12].
mation from lawsuit data were made [13, 18], but these studies do
When analyzing alternative ways of managing the lawsuit pro-
not directly address process mining on that data.
cessing, the analysis of the lawsuit throughput time using jurimet-
rics techniques may present quality issues, as it eventually considers
inadequate time intervals for the object of analysis [3]. Procedural 4 RESEARCH METHOD
viscosity [16], defined as “a set of structural characteristics of a This study applied the Process Mining Project Methodology [21] to
lawsuit that is able to affect the speed of its processing”, may apply. guide the application of process mining to analyze judicial perfor-
Specifically in business law, processing of lawsuits may require mance in a specific context. Only the first five stages of the method
about twice as much effort as a common lawsuit [2]. The same were carried out: planning, extraction, data processing, mining and
authors suggest “future research focused on process flow analysis, analysis, and evaluation. The project was finished with the insights
i.e., the study of the stochastic process that generates all events and generated by the evaluation stage. Besides restricting the scope to
timestamps of lawsuit processing”. business law, a period of analysis was defined to consider only the
TJSP’s lawsuits distributed between January 1, 2018 and July 21,
2.3 Process Mining 2020. All lawsuits with a procedural movement published in that
time interval were considered. Full progress data for each lawsuit
Process mining [20] emerged as a set of techniques for mining was retrieved until July 31, 2020, including lawsuits opened before
business process-related information from event data logged by and lawsuits not yet closed.
information systems. A business process is a chain of activities that The data were extracted in two steps. First, the identifiers of
produces an outcome that adds value to an organization and its lawsuits of interest were obtained. For this, all issues of the TJSP’s
customers [6]. Business process models play a dominant role during Electronic Journals of Justice (DJE4 ) were downloaded from the DJE
BPM life cycle, leading in achieving organizational improvement website5 , considering the defined analysis period. DJE publishes in-
goals, including reducing costs, lead times, and error rates. By using formation on provisional or final decisions for all ongoing lawsuits
real event data to discover process models, process mining leverages at TJSP, daily. An automated scraping of these files was carried out
data mining to understand operational processes in organizations. using keywords associated with business law litigation. Second,
Table 1 shows the mapping of the basic elements of event logs the lawsuit identifiers obtained were used to retrieve data from
from the process mining perspective to their counterparts in the the e-SAJ website6 , where information on lawsuits is published. A
procedural law domain. web scraping was carried out to retrieve information on lawsuits
attributes and progress events including their respective dates. In
3 RELATED WORK TJSP, there are four filing court departments dedicated to business
Process mining should be seen as an analytical tool naturally suit-
able for lawsuits due to their inherently procedural nature. It is 4 Diário da Justiça Eletrônico (in Portuguese)
cited as a promising approach to suggest improvements for lawsuit 5 http://www.dje.tjsp.jus.br
processing [15]. Lawsuit digitization is presented as not being an 6 http://esaj.tjsp.jus.br
241
Process Mining-Enabled Jurimetrics ICAIL’21, June 21–25, 2021, São Paulo, Brazil
law; as a result, the lawsuits not filed at these four court depart-
ments were discarded. Data from both DJE and e-SAJ websites are Figure 2: Histogram of procedural movements by lawsuit
publicly available7 .
The process knowledge transfer with domain experts was carried
out, resulting in a mapping between the elements of the dataset
and the concepts of event log used by process mining, presented
in Table 2. Event data from lawsuits were used to create the event
log to be used in process mining. The lawsuit dataset was filtered
to remove columns with missing values or data not relevant to the
scope of this study. The judge column was made anonymous for
protecting personal data. The additional column order was added
to the movement database, and hence to the event log, to allow the
process mining discovery algorithm to identify the correct order of
activities within a case occurring on the same date.
5 RESULTS
The resulting event log contains data on lawsuits referring to 4,795 Figure 3: Process map based on average duration metrics
cases and 266,834 events, with procedural movements dating back
to 2008, and 10 case attributes, as described in Table 2. The event log
file8 was imported using the EverFlow9 process mining tool, which
produced the business process maps and the main process metrics,
such as number of cases, number of events, and average duration,
as presented in Figure 1 and Figure 3. Process map views are user-
interactive, so that activities, transitions, and the time interval can
be selected and filtered for drill-down analysis. Detailed views on
specific metrics are presented on dedicated dashboards and panels,
as shown in Figure 2, Figure 4, Figure 5 and Figure 6.
Figure 1: Process map of the lawsuit processing
Figure 4: Slow transition analysis panel
6 ANALYSIS OF THE RESULTS

As a proof of concept for process mining-enabled jurimetrics, the
results were analyzed considering each of the following process
mining perspectives: control-flow, time, resource, and case [20].
6.1 Control-flow Perspective analysis of the event log, as shown in Figure 1. The complexity
The control flow is the main perspective considered in process and procedural viscosity of lawsuits in business law can be verified
mining discovery, adding process-oriented value to the data mining by the process metrics, i.e., average rate of 55.6 events per case
and average case duration of 334 days. As shown in Figure 2, over
7 The right of access to judicial data in Brazil is guaranteed by the constitutional 10% of lawsuits have over 100 events per case, which corresponds
principle of judicial publicity, except in lawsuits protected by the secrecy of justice.
8 The event log file is available in the repository: https://doi.org/10.4121/14593857 to procedural movements during lawsuit processing. In addition,
9 http://everflow.ai one can verify that lawsuit processing in business law has an ad
242
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Unger et al.
Table 2: Judicial data mapped to process mining event log
Dataset column Data description Event log role

lawsuit_id Lawsuit identifier. Case ID.
movement Name of the lawsuit procedural movement. Event activity.
date Date of occurrence of the lawsuit’s procedural movement. Event timestamp.
order Sequential order of occurrence of the lawsuit’s procedural movement. Event attribute.
area Law area to which the lawsuit refers. Case attribute.
claim_amount Lawsuit claim amount. Case attribute.
class Procedural class (refers to specific law or claim reason). Case attribute.
control Internal control number. Case attribute.
court_department Filing court department. Case attribute.
digital It shows whether the process is digital (born digital or scanned) or on paper. Case attribute.
distribution_date Date and time when the lawsuit was distributed. Case attribute.
judge Name of the ruling judge for the lawsuit. Case attribute.
status Lawsuit status. Case attribute.
subject_matter Main topic of the lawsuit. Case attribute.
6.2 Time Perspective

Figure 5: Lawsuits by judge dashboard The analysis of the lawsuit throughput time using process mining
techniques is favored by the easy handling of incomplete cases,
considering the activities carried out in the sequence flow. Since
the full progress data for each lawsuit was retrieved, there is no
issue related to the beginning of the process (i.e., lawsuit opening).
However, the lawsuit processing time evaluation can be affected by
cases that did not ended (i.e., lawsuits not yet closed). The Definitely
Archived activity is the most frequent final activity, which confirms
the natural behaviour of lawsuit processing within the court. It is
the final activity in 20% of the cases, which represents the number of
closed lawsuits in the dataset. By filtering all cases that go through
this activity, the lawsuit throughput time for closed lawsuits can
be calculated as 312 days.
Furthermore, based on the process map in Figure 3, one can
identify highlighted slow transitions and activity bottlenecks. These
Figure 6: Comparative analysis between on-paper and digi- transitions are also listed in the slow transition analysis panel
tal lawsuits in Figure 4, which offers a detailed diagnostics on each of them.
Comments on some slow transitions of the lawsuit processing-
related business process are presented in the following:
• Transition between Issued Publication Certificate and Issued

Judicial Office Certificate activities, with duration of 18 days
and effort of 121 years and 209 days, affecting 38% of cases:
this is a clear bottleneck within the court department. After
the publication certificate is issued, the judicial office publi-
cation certificate must be issued. However, there is a long
waiting time due to the lack of human resources on the court.
This bottleneck has a high impact on lawsuit processing time
and the use of court resource, suggesting that this activity is
a good candidate for automation solutions.
hoc sequence flow, characterized by the large number of process • Transition between Issued Publication Certificate and Suspen-
variants. The most frequent control-flow is shared among 155 cases, sion of the Term activities, with duration of 21 days and effort
which represents only 3% of all lawsuits. The extended process map of 164 years and 258 days, affecting 42% of cases: bottleneck
in Figure 1 shows the process control-flow considering only 4.19% possibly due to litigant issues, such as lack of required doc-
of cases. Figure 1 also highlights the most frequent activities and umentation. Figure 4 shows that this transition causes the
event counting for each transition. average duration of the case to increase by 77%.
243
Process Mining-Enabled Jurimetrics ICAIL’21, June 21–25, 2021, São Paulo, Brazil
• Transition between Suspension of the Term and Conclusions REFERENCES

for Decision activities, with duration of 13 days and effort of [1] Bruna Armonas Colombo, Pedro Buck, and Vinicius Miana Bezerra. 2017. Chal-
19 years and 298 days, affecting 10% of cases: this bottleneck lenges When Using Jurimetrics in Brazil—A Survey of Courts. Future Internet 9,
4 (10 2017), 68. https://doi.org/10.3390/fi9040068
is related to the previous one, possibly due to the judge being [2] Associação Brasileira de Jurimetria. 2016. Estudo sobre varas empresariais na
unable to evaluate the case for proper decision making. Comarca de São Paulo. Technical Report. Associação Brasileira de Jurimetria. 20
pages. https://abj.org.br/pdf/ABJ_varas_empresariais_tjsp.pdf (in Portuguese).
[3] Associação Brasileira de Jurimetria. 2020. Justiça Pesquisa - Formas Al-
6.3 Resource Perspective ternativas de Gestão Processual: a especialização de varas e a unificação de
serventias. Technical Report. Conselho Nacional de Justiça, Brasília. 126
The resource perspective analysis allows process maps and metrics pages. https://www.cnj.jus.br/wp-content/uploads/2020/08/Justica-Pesquisa_
to be evaluated considering the resources (people or devices) that Relatorio_ABJ_2020-08-21_1.pdf (in Portuguese).
[4] Daniel L. Chen. 2019. Judicial analytics and the great transformation of American
execute the process activities. In addition, the values of the process Law. Artificial Intelligence and Law 27, 1 (3 2019), 15–42. https://doi.org/10.1007/
execution effort can be evaluated based on the average duration s10506-018-9237-x
and event counting for each activity. Performance and comparative [5] Conselho Nacional de Justiça. 2020. Justiça em Números 2020: ano-
base 2019. Technical Report. Conselho Nacional de Justiça, Brasília. 267
analysis of lawsuit processing from the resource perspective were pages. https://www.cnj.jus.br/wp-content/uploads/2020/08/WEB-V3-Justiça-
carried out based on the judge attribute, as shown in Figure 5. em-Números-2020-atualizado-em-25-08-2020.pdf (in Portuguese).
[6] Marlon Dumas, Marcello La Rosa, Jan Mendling, and Hajo A. Reijers. 2018. Fun-
damentals of Business Process Management (2 ed.). Springer, Berlin. 527 pages.
6.4 Case Perspective https://doi.org/10.1007/978-3-662-56509-4
[7] Theodore Eisenberg. 2011. The origins, nature, and promise of empirical legal
The case perspective allows performance diagnostics to be carried studies and a response to concerns. University of Illinois Law Review 2011, 5
out considering specific case attributes. Considering the digital case (2011), 1713–1738. https://doi.org/10.2139/ssrn.1727538
[8] Adalmir de Oliveira Gomes and Tomás de Aquino Guimarães. 2013. Desempenho
attribute, a side-by-side comparison was carried out to analyze how no judiciário. Conceituação, estado da arte e agenda de pesquisa. Revista de
this attribute impacts process flow and metrics. Surprisingly, Figure Administracao Publica 47, 2 (3 2013), 379–401. https://doi.org/10.1590/S0034-
6 shows that digital lawsuits are 6% slower than the on-paper ones. 76122013000200005 (in Portuguese).
[9] Bráulio Gusmão. 2021. Mineração de processos e gestão de casos no judiciário.
Some considerations on these findings are as follows: In Inteligência Artificial e Direito Processual: os impactos da virada tecnológica no
direito processual (2 ed.), Dierle Nunes, Paulo Henrique dos Santos Lucon, and
• The definition of the digital attribute includes both the born Erik Navarro Workart (Eds.). JusPodivm, Salvador, 589–594. (in Portuguese).
digital lawsuits and the scanned ones, although the scanned [10] International Consortium for Court and Excellence. 2020. Global Measures of
ones usually require more time and effort to be processed. Court Performance. Technical Report. Secretariat for the International Consortium
for Court Excellence, Sydney, Australia. 112 pages. http://www.courtexcellence.
It is not possible to distinguish them by the value of this com/resources/global-measures
attribute, but further investigation on procedural movements [11] Lee Loevinger. 1948. Jurimetrics–The Next Step Forward. Minnesota Law Review
may confirm this assumption, if scan activities are logged. 33, 5 (1948), 455–493.
[12] Marcos Maia and Cicero Aparecido Bezerra. 2020. Bibliometric analysis of scien-
• Procedural viscosity in business law might hide inefficiencies tific articles on jurimetry published in Brazil. Revista Digital de Biblioteconomia e
usually associated with on-paper lawsuits, which means Ciência da Informação 18 (2020). https://doi.org/10.20396/RDBCI.V18I0.8658889
[13] Oleg Metsker, Egor Trofimov, Sergey Sikorsky, and Sergey Kovalchuk. 2019. Text
that the main process bottlenecks in business law might and data mining techniques in judgment open data analysis for administrative
be related to internal activities that are independent of the practice control. In Communications in Computer and Information Science, Vol. 947.
digital nature of lawsuits. However, this statement requires Springer, Cham, 169–180. https://doi.org/10.1007/978-3-030-13283-5_13
[14] Tom M. Mitchell. 1997. Machine Learning. McGraw-Hill Sci-
further investigation, since digital lawsuit management is ence/Engineering/Math. 432 pages.
often associated with performance gains. [15] Dierle Nunes. 2021. A technological shift in procedural law (from automation
• A data selection bias may have occurred due to the analysis to transformation): Can legal procedure be adapted through technology? In
Inteligência Artificial e Direito Processual: os impactos da virada tecnológica no
period, as it is not uncommon for dedicated task forces to direito processual (2 ed.), Dierle Nunes, Paulo Henrique dos Santos Lucon, and
occur within court departments to reduce the backlog of Erik Navarro Workart (Eds.). JusPodivm, Salvador, 55–78.
[16] Marcelo Guedes Nunes. 2019. Jurimetria: como a estatística pode reinventar o
on-paper lawsuits. Direito (2 ed.). Revista dos Tribunais, São Paulo. 192 pages. (in Portuguese).
[17] Giuliana Palumbo, Giulia Giupponi, Luca Nunziata, and Juan Mora-Sanguinetti.
2013. Judicial performance and its determinants: a cross-country perspec-
7 CONCLUSION tive. OECD Economic Policy Papers 5 (2013), 1–38. https://doi.org/10.1787/
The analysis of the results revealed a comprehensive set of diag- 5k44x00md5g8-en
[18] Shahmin Sharafat, Zara Nasar, and Syed Waqar Jaffry. 2019. Data mining for
nostic metrics, insights into the root causes of inefficiencies, and smart legal systems. Computers and Electrical Engineering 78 (9 2019), 328–342.
ideas for improvement, which would hardly be discovered without https://doi.org/10.1016/j.compeleceng.2019.07.017
a process-oriented analysis approach. The prospects for using pro- [19] Universidade de São Paulo. 2019. Justiça Pesquisa - Mediações e Conciliações
Avaliadas Empiricamente: jurimetria para proposição de ações eficientes. Technical
cess mining in the Brazilian judiciary include the use of process Report. Conselho Nacional de Justiça, Brasília. 193 pages. https://www.cnj.
mining tools to provide online dashboards for judicial performance jus.br/wp-content/uploads/2011/02/e1d2138e482686bc5b66d18f0b0f4b16.pdf (in
monitoring by the State Internal Affairs Divisions of Justice, which Portuguese).
[20] Wil M. P. van der Aalst. 2016. Process mining: Data science in Action (2 ed.).
monitors the performance on the provision of jurisdictional ser- Springer, Berlin. 467 pages. https://doi.org/10.1007/978-3-662-49851-4
vices. These dashboards can be used to identify inefficiencies in [21] Maikel L. van Eck, Xixi Lu, Sander J. J. Leemans, and Wil M. P. van der Aalst.
2015. PM2: A process mining project methodology. In Lecture Notes in Computer
near real time and define targets for resolving them. Science, Vol. 9097. Springer, Cham, 297–313. https://doi.org/10.1007/978-3-319-
19069-3_19
[22] Caio Castelliano de Vasconcelos, Eduardo Watanabe de Oliveira, Henrique Pais da
ACKNOWLEDGMENTS Costa, and Tomas de Aquino Guimaraes. 2018. Tempo de Processos Judiciais
The authors thank the EverFlow company for kindly supporting na Justiça Federal do Brasil. In XLII Encontro da ANPAD. Curitiba, 1–16. http:
//www.anpad.org.br/abrir_pdf.php?e=MjQ1NzA= (in Portuguese).
this research.
244
Using Transformers to Improve Answer Retrieval
for Legal Questions
Andrew Vold Jack G. Conrad
Thomson Reuters Thomson Reuters
TR Labs Research TR Labs Research
andrew.vold@thomsonreuters.com jack.g.conrad@thomsonreuters.com
ABSTRACT consuming and laborious effort. Over time, we began to see an in-
Transformer architectures such as BERT, XLNet, and others are terest in more focused question answering systems taking the place
frequently used in the field of natural language processing. Trans- of traditional information retrieval systems. In the field of AI and
formers have achieved state-of-the-art performance in tasks such as Law, Quaresma and Rodrigues were among the first to implement
text classification, passage summarization, machine translation, and a question answering system for legal documents [13], one that
question answering. Efficient hosting of transformer models, how- focused on Portuguese legal decisions. More recently, however, de-
ever, is a difficult task because of their large size and high latency. velopments in deep learning-based approaches for tasks like open
In this work, we describe how we deploy a RoBERTa Base ques- domain question answering have resulted in major gains in answer
tion answer classification model in a production environment. We rate performance. They have also been responsible for comparable
also compare the answer retrieval performance of a RoBERTa Base advances in closed domain question answering in fields such as
classifier against a traditional machine learning model in the legal Legal QA [1]. Such progress has resulted in performance gains for
domain by measuring the performance difference between a trained both factoid and non-factoid question answering.
linear SVM on the publicly available PRIVACYQA dataset. We show Transformer architectures have delivered impressive perfor-
that RoBERTa achieves a 31% improvement in F1-score and a 41% mance gains over baselines for standard natural language process-
improvement in Mean Reciprocal Rank over the traditional SVM. ing (NLP) tasks. Open domain language modeling as a pretraining
step, followed by domain specific fine-tuning on another domain
CCS CONCEPTS has delivered state-of-the-art performance for tasks in a specific
domain, including the legal domain. One should thus expect to see
• Information systems → Information Retrieval; Retrieval
significant performance gains in legal question answer retrieval
Tasks and Goals; Question Answering; Information Retrieval;
by utilizing the output of a transformer based classifier which has
Retrieval Models and Ranking; Language Models; Information
been fine-tuned on legal QA pairs.
Retrieval; Evaluation of retrieval results; Relevance assessment.
It has been well observed that transformers are highly perfor-
mant at answering factoid questions which typically have answers
KEYWORDS of one or a few words [5]. Transformer based research in the Legal
Question Answering, Legal Applications, Deep Learning, Language domain has evolved toward more complex non-factoid questions
Models, BERT Engines, Evaluation which are more nuanced and may require several sentences to pro-
ACM Reference Format: vide context and elaboration in order to answer the legal question at
Andrew Vold and Jack G. Conrad. 2021. Using Transformers to Improve hand, for example, "When is a party entitled to a protective order?"
Answer Retrieval for Legal Questions. In Eighteenth International Conference The current work extends this research by processing a publicly
for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, available non-factoid QA dataset in an application workstream,
Brazil. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3462757. while addressing the challenges of performance quality, speed and
3466102 scale.
1 INTRODUCTION 2 PRIOR WORK

Historically, when legal professionals performed natural language The primary approaches employed to improve question answering
search, they would be required to sift through exhaustive lists of search results fall into three categories: document-centric, query-
results, ranked by probability of relevance, in order to identify centric, and ranking-centric (e.g., neural approaches). The works
materials relevant to their search [18]. The task could be a time described below generally fall into one or more of these categories.
Permission to make digital or hard copies of all or part of this work for personal or 2.1 Open-Domain Question Answering
for profit or commercial advantage and that copies bear this notice and the full citation Open domain question answering is a task that answers factoid
on the first page. Copyrights for components of this work owned by others than the questions using large collections of documents [19]. Historically,
republish, to post on servers or to redistribute to lists, requires prior specific permission retrieval in open domain QA was usually conducted using tf.idf
and/or a fee. Request permissions from permissions@acm.org. or BM25 approaches, which match keywords with an inverted in-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil dex, and represent the question and content in high-dimensional,
ACM ISBN 978-1-4503-8526-8/21/06. sparse vectors [16]. In their 2017 report, Chen et al. propose us-
https://doi.org/10.1145/3462757.3466102 ing Wikipedia for open domain question answering for factoid
245
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Andrew Vold and Jack G. Conrad
questions [5]. The task is one of machine reading at scale, which of a merger or acquisition [6]. They claim that what is novel in
addresses the challenges of document retrieval and machine com- their approach is that the proposed system explicitly handles the
prehension (identifying text spans containing the answer). Their imbalance in the data, by generating synthetic instances of the
approach combines a search component based on bigram hashing minority answer categories, using the Synthetic Minority Oversam-
and tf.idf matching with a multi-layer recurrent neural network pling Technique [4]. This ensures that the number of instances in
model trained to detect answers in Wikipedia paragraphs. They use all the classes are roughly equal to each other, thus leading to more
the SQuAD dataset for training and three other datasets for testing accurate and reliable classification. They use conditional random
[14]. They obtain an F-score of 79%, which was within a point of fields as their text selection algorithm. Each sentence in the contract
the top performing method at the time. under consideration is featurized into a tf.idf vector and fed into the
In their work on dense passage retrieval for open domain ques- CRF algorithm. The authors found a 13% improvement in accuracy
tion answering, Karpukhin et al. show that retrieval can be ef- due to the imbalance handling.
fectively implemented using dense representations alone, where The recently published work on Legal BERT has reported on
embeddings are learned from a small number of questions and performance gains on an assortment of downstream NLP tasks
passages via a simple dual encoder framework [9]. It has outper- [3]. The authors compare the performance of out of the box BERT
formed traditional QA baselines (top-20 results) by 9%-19%, while with a version that benefits from additional pre-training with legal
establishing new end-to-end baseline performance levels. domain data, and finally with a version where the pre-training with
In their earlier work on Bidirectional Encoder Representations legal domain data starts from scratch. The legal domain training
from Transformers (BERT), Devlin et al. introduced a new language data consists of UK and EU legislation, European Court of Justice
representation model which is designed to pre-train deep bidirec- and Court of Human Rights cases, and finally U.S. court cases as
tional representations from unlabeled text by jointly conditioning well as U.S. contracts. The authors show that the best strategy
on both left and right context in all layers [7]. BERT consequently to transfer BERT to a new domain may vary, but that one may
can be fine-tuned with just one additional output layer to create consider either further pre-training or pre-training from scratch
state-of-the-art, highly performant models for a wide range of tasks, on data from the new domain. Legal BERT achieved state-of-art
including question answering. results in three end-tasks, and, most notably, the performance gains
As an extension to BERT, Liu et al. developed a "robustly op- were stronger for the most challenging end-tasks (i.e., multi-label
timized" pretraining approach to BERT known as RoBERTa [10]. classification in ECHR-cases and contract header & lease details
They found that BERT was significantly undertrained. In their repli- in Contracts-NER) where in-domain (legal) knowledge is arguably
cation study of BERT, they carefully measured the impact of many the most important. The authors also released a version of Legal
key hyperparameters and training data size. They showed how BERT-SMALL, which is 3 times smaller than Legal BERT, but quite
hyperparameter choices have a major impact on final results. Their competitive performance-wise to the other versions of Legal BERT.
best model achieved state of the art results against such standard Reports on question answering systems have also recently been
collections as GLUE, RACE, and SQuAD. published by researchers at Thomson Reuters and LexisNexis [2,
Because pre-trained language models are usually computation- 11]. The current work demonstrates the robustness of a Legal QA
ally expensive, and it is difficult to execute them on resource limited system deployed in a multi-stage workstream where the engine is
devices, researchers like Jiao et al. have focused on transformer fine-tuned on an application-specific dataset. The application and
model distillation methods and proposed a novel method that was dataset are discussed below. The system is shown to significantly
specially designed for knowledge distillation (KD). By leveraging outperform the baseline using contemporary neural techniques.
their new KD method, while focusing on the knowledge already
preserved in larger models like RoBERTa, they discovered that such 3 METHODOLOGY
knowledge could be transferred to a smaller TinyBert model [8]. Transformer models have achieved state-of-the-art performance
The new framework captured in TinyBert performs transformer in many NLP applications such as text classification, text summa-
distillation at both the pre-training and task specific learning stages. rization, and question answering. Though transformers are highly
They have shown that their framework ensures that TinyBert cap- performant, their generally large size make them difficult to deploy
tures the general knowledge and task specific knowledge preserved in production systems. Successful transformer model hosting in
in BERT. a production environment would be a major advance in natural
In contrast with factoid question answering, Zhu et al. pursued language applications. For this reason, we developed a high perfor-
non-factoid question answering where the answers tend to be mance question answering (QA) system based on the RoBERTa base
longer passages [22]. In this work, the authors determine that by architecture, but other transformer architectures could be used as
generating synthetic training data of arbitrary volume and with well [10, 12]. The challenges and our strategies for handling these
well understood properties, the learning capacity of Knowledge problems will be discussed in this section.
Graph architectures can be better understood and characterized. QA system researchers do not frequently have access to evalu-
Whether a given neural architecture for KGQA will train a model to ated QA pairs that are broad, balanced, and comparable to what
generalize rather than memorize may depend on dataset properties. a user would ask. Open sourced QA pairs tend to be either very
general or belong to a niche domain. If one is fortunate to have
2.2 Legal Domain Question Answering access to labeled QA pairs in the working domain, it is unlikely that
In a recent work, the authors address a due diligence topic where there is enough data for broad topic coverage. To address this issue,
lawyers review documents for indication of risk due to the prospect subject matter experts (SMEs) can be assigned to procure quality
246
Using Transformers to Improve Answer Retrieval for Legal Questions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
mobile applications [20], and more than 3,500 relevant answers that
have been annotated by experts. From the data provided, we have
obtained approximately 130K passages for our training set, of which
about 25% was used in our validation set. The goal of the collection
was to achieve broad coverage across a spectrum of application
types. The researchers collected privacy policies from 35 mobile
applications representing different categories in the Google Play
Store [17]. Another goal of the creators was to include both policies
from well-known applications, which are likely to have carefully-
constructed privacy policies, and lesser-known applications with
Figure 1: QA System Development Cycle
smaller install bases, whose policies might be considerably less
sophisticated. They set a threshold of 5 million installs to ensure
QA pairs. Yet SMEs often experience fatigue when producing nu-
each category includes applications with installs on both sides of
merous examples, even if the queries originate from user query
the threshold. All policies in the corpus are in English, and were
logs. This phenomenon often manifests itself in the form of weak
collected before April 1, 2018, predating many companies’ GDPR-
question-answer pair generation where examples differ by only
focused revisions.
a few words. To address such limitations, natural language user
queries are identified, run through the classifier, and the highest 3.2.1 Answer Identification. In order to identify legally valid an-
scoring QA pairs are evaluated. The resulting data can then be used swers, seven subject matter experts with legal training were re-
to train the model, yielding a cyclic data curation, model training cruited to formulate answers to the Amazon Mechanical Turk ques-
process as seen in Figure 1. tions. They indicated relevant material within the given privacy
Given the QA system that we developed was intended for ap- policy in addition to supplying relevant metadata regarding the
plication to sets of in-house legal documents, many of which are question’s relevance, subjectivity, OPP-115 category [21], and how
not freely available to the general public, for the purposes of this likely any policy is to containing the answer to the question.
research report, we have opted to apply our techniques to the pub- Table 1 presents aggregate statistics for the PRIVACYQA dataset.
licly available legal questioning collection described in section 3.2. 1750 questions are posed to an imaginary privacy assistant over 35
Though it covers a subdomain of the legal space, it is nonetheless a mobile applications and their associated privacy documents.
broad ranging and complex dataset that contains an array of top-
ics, question and answer lengths and types. It is a nuanced and Dataset Train Test All
challenging set of data which is indicative of the kinds of question No. of Questions 1350 400 1750
and answer types one can expect to see in the legal domain. The No of Policies 27 8 35
findings we obtain apply specifically to the PRIVACYQA dataset, No. of Sentences 3704 1243 4947
but are also representative of the kinds of issues and challenges Avg. Q Length 8.42 8.56 8.46
one encounters with wider-ranging legal datasets as well. Avg. Doc. Length 3121.3 3629.13 3237.37
Avg. Ans. Length 123.73 153.44 139.62
3.1 Training Targets Table 1: Statistics of the PRIVACYQA Dataset
In order to assess the performance of the QA classifier, natural
language user log queries and their retrieved answers are presented 4 EXPERIMENTS
to an SME. The SME then must determine whether or not the top To demonstrate the quality of answer retrieval performance of a
answers returned by the classifier satisfy what was being asked. transformer in comparison with traditional ML models, we fine-
The grade by the SME can be a binary "pass/fail", a letter grade, tune an open domain pretrained RoBERTa classifier and train a
or even a score on a continuous scale. In our case, the grade is linear SVM with tf.idf features on the PRIVACYQA dataset. Training
converted into a label or regression target to be used for model models on this dataset is challenging for several reasons. First, the
fine-tuning. dataset is largely unbalanced with negative examples occurring 25
For our internal QA classifier, we utilized a multi-label grading times more often than positive examples. In addition, there exists
criteria which determined whether or not the answer satisfies the considerable noise in both the queries and the answers. Finally, the
requirements and to what degree it answers the given question. In number of unique questions and answers are far fewer than the
order to avoid grader bias, we have two SMEs grade each QA pair, total counts of QA pairs in the dataset.
and the average is taken. Disagreements of more than one grade Class imbalance is a common problem in real world machine
may be adjudicated by a senior SME. A similar approach was used learning applications. For this reason, there are many methods to
by the creators of the PRIVACYQA dataset, which will be explained effectively combat the adverse effects of training on an imbalanced
below. dataset. These can include over/under sampling, class weighting
on the loss, external or generated training data augmentation, and
3.2 Data more. For our experiments, we apply a simple class weighting
The dataset used in these experiments comes from the PRIVACYQA scheme to give more weight to the underrepresented positive class.
dataset described by Ravichander et al. in [15]. It is a corpus con- The PRIVACYQA data is quite noisy. The queries and answers are
sisting of 1,750 questions about privacy policies associated with riddled with misspellings, URLs, improper grammar, fragmented
247
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Andrew Vold and Jack G. Conrad
sentences, lack of punctuation, and more. In order to have the One of the most important metrics for QA classification systems
data resemble the data existing in our internal system, significant is Mean Reciprocal Rank (MRR). This simple metric is the average
data cleaning and filtering is applied. This includes capitalizing inverse position of the first true labeled examples in the answer
sentence beginnings, removing URLs, removing queries or answers ranks. MRR is a useful metric for ensuring that the highest quality
with more than 4 non-english words, and additional cleaning and answers make it to the highest rank in the list. This is especially
filtering steps. Even after all of this data preprocessing, the data important for applications like question answering which may
remains far from perfect, but is sufficient to meet the requirements return a few or even one answer for a particular query. Due to
of our experimental conditions. the importance of MRR, RoBERTa is the better choice for a QA
The original PRIVACYQA paper split the training and testing model with a 41.4% improvement in MRR over the SVM baseline
datasets by privacy category, rather than by unique queries (Table (** 𝑝 < 1 × 10−5 ).
1). The original PRIVACYQA dataset thus contains data leakage. It is interesting to see that a simple, traditional ML model like
Several queries from the test set can also be found in the training an SVM, operating on sparse word vectors, achieves performance
set. In order to rectify this, we identify the queries which exist in relatively similar to that of a transformer. One explanation for this
the test set, and reassign those QA pairs as training data, which can is that the data is very messy and lacks uniqueness. A simple ML
be seen in Table 2. model doesn’t get distracted by nuances of this dataset such as
fragmented sentences, misspellings, and the frequent use of URLs
Set Positives Negatives Total and company names. A simple ML model is also less prone to over-
Train 6,950 152,903 159,487 fitting than a transformer, especially considering the redundancy
Test 5,276 45,493 50,720 of the text in the dataset. Overfitting was a challenge during exper-
Table 2: Dataset Split Statistics imentation. For this reason, one can expect even higher RoBERTa
performance if the experiments are repeated with a more sophisti-
We perform tf.idf fitting on the unigrams and bigrams from cated strategy for combatting overfitting. A major lesson learned
the corpus of unique answers, and use it to vectorize the QA pairs from running this experiment is to ensure that the data used for
which are then used as inputs to a linear SVM. The hyperparameters training a transformer QA classifier is clean and without redun-
of the SVM are found by performing 5-fold cross validation via dancy. In addition, more careful domain adaptation could be applied
grid searching with maximizing the validation set F1-score as the before fine-tuning on the experimental dataset.
objective. This process leads to optimal hyperparameters for the
SVM model and a consistent training-validation split to be used for
6 APPLICATION PIPELINE
training RoBERTa. Developing a strong QA classifier is only one piece of deploying
Due to the large number of parameters in RoBERTa, it is trained a scalable QA application. It is not feasible to simply concatenate
by gradually unfreezing the layers, starting with the classification all passages from a corpus to a user’s query and sequentially feed
head. The learning rate and the batch size are decreased as layers them to a classifier. Instead, there needs to be a way to quickly filter
are unfrozen, as to avoid overloading the CUDA memory. After out obvious negative passages, yielding a smaller pool of potential
each epoch, the validation F1-score is measured until a plateau is answers to be fed to the classifier. An additional challenge of using a
reached, at which point the model loses generalizability. transformer based classifier like RoBERTa is its size and latency. In
order to address these challenges, we propose a solution consisting
5 RESULTS of a parallel cluster for candidate retrieval (Stage 1) and RoBERTa
After training both RoBERTa and SVM classifiers, the models are operating on a GPU endpoint (Stage 2). In addition, in order not to
run over the test set to determine the performance differences when overwhelm the user of the application, we typically return the top
using a transformer based QA classification engine. The results can n answers as predicted by RoBERTa, where n is small.
be seen in Table 3. One of the most important requirements for a powerful QA clas-
sification engine is to have a sufficiently large corpus of passages
Metric SVM RoBERTa against which a query can be compared. Oftentimes, this can be
Precision 0.212 0.470 on the scale of hundreds of thousands to millions of passages. The
Recall 0.480 0.326 overwhelmingly vast number of passages is irrelevant to a particu-
F1-score 0.294 0.385* lar query, and these are not difficult to identify. For this reason, it is
MRR 0.074 0.105** advisable to have a computationally efficient method of removing
Table 3: Classifier Performance on the Test Set the clearly irrelevant passages before performing any QA infer-
encing. In addition, due to the scale of the data, it is imperative to
As seen in the table, RoBERTa outperforms the SVM for all perform this filtering in parallel. To accomplish this, we employ a
metrics except recall. This makes sense because the SVM looks for parallel data cluster in the cloud with our data spanning several
exact token matches between the query and answer to assign a nodes (See Figure 2). The cluster functions by serving up the top-n
positive label. RoBERTa, however, uses the latent representation of most relevant passages as determined by properties such as term
the tokens to identify potential answers. In any QA application, it overlap between the query and passages. It is up to the application
is important to serve an expansive set of quality answers; for this designer to determine the appropriate number of passages to in-
reason, RoBERTa is preferable to the SVM for its 31% improvement clude in a candidate pool. Typically a candidate pool size between
in F1-score over the SVM (* 𝑝 < 1 × 10−5 ). 100 and 1000 will suffice. Increasing the number of nodes decreases
248
Using Transformers to Improve Answer Retrieval for Legal Questions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Figure 2: QA Application Pipeline
latency but increases cost, so application engineers must decide in [3] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos.
advance on how many nodes to include in their cluster. Legal-bert: The muppets straight out of law school, 2020.
[4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic
After a satisfactory candidate pool has been retrieved, the QA minority over-sampling technique. J. Artif. Int. Res., 16(1):321–357, June 2002.
pairs are then tokenized, pushed to the CUDA device, and fed to [5] D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading wikipedia to answer open-
domain questions, 2017.
the classifier. The classifier returns a list of prediction scores of the [6] R. Chitta and A. K. Hudek. A reliable and accurate multiple choice question
relevance of the passage to the answer. The passages associated answering system for due diligence. In Proceedings of the Seventeenth International
with these predictions are then sorted, and the top-n are returned, Conference on Artificial Intelligence and Law, ICAIL ’19, pages 184–188, New York,
NY, USA, 2019. Association for Computing Machinery.
where n is determined by the application development team. One [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep
may also wish to apply a RoBERTa score threshold, so that very low bidirectional transformers for language understanding, 2019.
predictions, which are very often negative, are not shown to the [8] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. Tinybert:
Distilling BERT for natural language understanding. CoRR, abs/1909.10351, 2019.
user. If executed properly on the appropriate hardware, the entire [9] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. t. Yih.
answer serving process can take a second or less to perform. Dense passage retrieval for open-domain question answering, 2020.
[10] Y. Liu, M. O., N. Goyal, J. Du, M. Joshi, D. Chen, O. L., M. Lewis, L. Zettlemoyer,
7 CONCLUSIONS and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
[11] G. McElvain, G. Sanchez, S. Matthews, D. Teo, F. Pompili, and T. Custis. West-
Question answering is a challenging task which has been in devel- search plus: A non-factoid question-answering system for the legal domain. In
Proceedings of the 42nd International ACM SIGIR Conference on Research and
opment for many years. Question answering can take on different Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019,
forms such as answer generation, answer snippet retrieval, and pages 1361–1364. ACM, 2019.
question answer classification. We propose an end-to-end pipeline [12] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli.
fairseq: A fast, extensible toolkit for sequence modeling, 2019.
which combines the speed of a parallel data retrieval mechanism [13] P. Quaresma and I. Rodrigues. A question-answering system for portuguese
with the classification power of a fine-tuned RoBERTa Base classi- juridical documents. In Proceedings of the 10th International Conference on Artifi-
fier. Our observations from our internal data and the data discussed cial Intelligence and Law, ICAIL ’05, pages 256–257, New York, NY, USA, 2005.
Association for Computing Machinery.
in this paper indicate that transformer architectures can achieve [14] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for
greater classification performance than traditional machine learn- machine comprehension of text, 2016.
[15] A. Ravichander, A. W. Black, S. Wilson, T. B. Norton, and N. M. Sadeh. Question
ing methods in legal QA classification tasks. answering for privacy policies: Combining computational and legal perspectives.
We have discussed the efficacy of transformer models in text clas- CoRR, abs/1911.00841, 2019.
sification tasks. We observe a significant increase in F1-score and [16] S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and
beyond. Found. Trends Inf. Retr., 3(4):333–389, apr 2009.
MRR of a RoBERTa classifier over a linear SVM on the PRIVACYQA [17] P. Story, S. Zimmeck, and N. Sadeh. Which apps have privacy policies? In M. Med-
dataset. Our experiment has shown that transformer models can ina, A. Mitrakas, K. Rannenberg, E. Schweighofer, and N. Tsouroulas, editors,
achieve superior performance over traditional machine learning Privacy Technologies and Policy, pages 3–23, Cham, 2018. Springer International
Publishing.
techniques in legal question answer classification. [18] H. R. Turtle. Text retrieval in the legal world. Artif. Intell. Law, 3(1-2):5–54, 1995.
We have also also discussed some of the challenges and solu- [19] E. M. Voorhees. The trec-8 question answering track report. In Proceedings of
TREC-8, pages 77–82, 1999.
tions associated with developing and operating a transformer based [20] D. Weissenborn, G. Wiese, and L. Seiffe. Making neural QA as simple as possible
question answer classification system. With a large set of content, but not simpler. In Proceedings of the 21st Conference on Computational Natural
subject matter experts, and sufficient computing power, it is pos- Language Learning (CoNLL 2017), pages 271–280, Vancouver, Canada, Aug. 2017.
Association for Computational Linguistics.
sible to train and operate a transformer based system in a cost [21] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. Giovanni Leon,
effective manner. M. Schaarup Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, T. B. Norton,
E. Hovy, J. Reidenberg, and N. Sadeh. The creation and analysis of a website
privacy policy corpus. In Proceedings of the 54th Annual Meeting of the Association
REFERENCES for Computational Linguistics (Volume 1: Long Papers), pages 1330–1340, Berlin,
[1] S. Badugu and R. Manivannan. A study on different closed domain question Germany, Aug. 2016. Association for Computational Linguistics.
answering approaches. Int. J. Speech Technol., 23(2):315–325, 2020. [22] M. Zhu, A. Ahuja, D. Juan, W. Wei, and C. K. Reddy. Question answering with
[2] Z. Bennett, T. Russell-Rose, and K. Farmer. A scalable approach to legal question long multiple-span answers. In T. Cohn, Y. He, and Y. Liu, editors, Proceedings of
answering. In Proceedings of the 16th Edition of the International Conference on the 2020 Conference on Empirical Methods in Natural Language Processing: Findings,
Artificial Intelligence and Law, ICAIL ’17, pages 269–270, New York, NY, USA, EMNLP 2020, Online Event, 16-20 November 2020, pages 3840–3849. Association
2017. Association for Computing Machinery. for Computational Linguistics, 2020.
249
Toward Summarizing Case Decisions via Extracting Argument
Issues, Reasons, and Conclusions
Huihui Xu Jaromir Savelka Kevin D. Ashley
Intelligent Systems Program School of Computer Science Intelligent Systems Program,
University of Pittsburgh Carnegie Mellon University University of Pittsburgh
USA USA USA
huihui.xu@pitt.edu jsavelka@andrew.cmu.edu ashley@pitt.edu
ABSTRACT accessible to the lay public. This depends, however, on whether the
In this paper, we assess the use of several deep learning classifica- summaries capture the gist of the argument in the decision. In prior
tion algorithms as a step toward automatically preparing succinct work, we proposed that such case summaries could be generated
summaries of legal decisions. Short case summaries that tease out by extracting legal argument triples (IRC triples) including: 1) the
the decision’s argument structure by making explicit its issues, con- major issues a court addressed in the case, 2) the court’s conclusion
clusions, and reasons (i.e., argument triples) could make it easier with respect to each issue, and 3) the court’s reasons for reaching
for the lay public and legal professionals to gain an insight into the conclusion.
what the case is about. We have obtained a sizeable dataset of In [23], we evaluated whether a machine learning (ML) model
expert-crafted case summaries paired with full texts of the deci- can identify the components of legal argument triples in summaries
sions issued by various Canadian courts. As the manual annotation prepared by legal professionals. We applied traditional ML algo-
of the full texts is prohibitively expensive, we explore various ways rithms (random forest variations) and deep neural network models
of leveraging the existing longer summaries which are much less (LSTM, CNN and FastText) to identify the sentence components of
time-consuming to annotate. We compare the performance of the IRC triples in legal summaries and to the task of binary classifying
systems trained on the annotations that are manually ported to sentences (IRC vs. non-IRC) in the summaries and corresponding
the full texts from the summaries to the performance of the same full text decisions. While the performance on the summaries was
systems trained on annotations that are projected from the sum- promising, the performance on the full texts was quite poor.
maries automatically. The results show the possibility of pursuing In this work, we have substantially increased the size of the
the automatic annotation in the future. annotated data set of full case texts compared to the prior work. We
focus on applying deep learning algorithms (LSTM, CNN), includ-
CCS CONCEPTS ing pre-trained transformer models (RoBERTa, CNN-BERT) with
different loss functions to deal with the continuing challenge of
• Information systems → Information retrieval; Retrieval mod-
data imbalance in our training set. We report the results of apply-
els and ranking; Similarity measures; • Applied computing →
ing the different kinds of neural models on cases’ full texts after
Law; Annotation.
training with manually-mapped human-annotated sentences from
KEYWORDS the summaries and analyze the effects of using different loss func-
tions and embeddings. We also report results of a proof-of-concept
Information retrieval, argument mining, legal analysis, relevant experiment that applied automatically mapped human-annotated
sentences, summarization sentences from the summaries to the full-texts in order to classify
ACM Reference Format: argument triples. If this succeeds, we would not need to manually
Huihui Xu, Jaromir Savelka, and Kevin D. Ashley. 2021. Toward Summariz- annotate the full texts. It would suffice to manually annotate the
ing Case Decisions via Extracting Argument Issues, Reasons, and Conclu- summaries, automatically map those summary annotations to the
sions. In Eighteenth International Conference for Artificial Intelligence and full texts, and train a model directly on the full texts.
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY,
USA, 5 pages. https://doi.org/10.1145/3462757.3466098
2 RELATED WORK
1 INTRODUCTION Argument mining research in the legal domain has focused on ex-
The ability to automatically prepare succinct summaries of legal tracting propositions, premises, conclusions, and nested argument
decisions could contribute to making legal source materials more structures [16], argument schemes such as by example [6], rhetori-
cal and other roles that sentences play in legal arguments [1, 19],
classroom use is granted without fee provided that copies are not made or distributed stereotypical fact patterns that strengthen a side’s claim (i.e., legal
for profit or commercial advantage and that copies bear this notice and the full citation factors) in domains like trade secret law [4], reasons or warrants in
on the first page. Copyrights for components of this work owned by others than ACM arguments citing facts or principles [21], functional parts of legal
to post on servers or to redistribute to lists, requires prior specific permission and/or a decisions such as analysis or conclusions [20], and segments by
fee. Request permissions from permissions@acm.org. topic [13] or by linguistic analysis [5, 7, 22].
ICAIL’21, June 21–25, 2021, São Paulo, Brazil We aim to identify legal argument triples and employ them
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 to succinctly summarize case summaries. Yamada, et al. [24] have
https://doi.org/10.1145/3462757.3466098 summarized Japanese judgments in terms of issues, conclusions, and
250
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Huihui Xu, Jaromir Savelka, and Kevin D. Ashley
framings. The legal argument triples we seek to employ are simpler

types, which have not been tailored to Japanese legal judgements.
In addition, we make use of a set of case summaries prepared by
human experts to assist with extracting argument triples from the
full case texts, a resource not employed in the cited work.
3 DATA SET
We defined the components of legal argument triples as follows:
(1) Issue – Legal question which a court addressed in the case.

(2) Conclusion – Court’s decision for the corresponding issue.
(3) Reason – Sentences that elaborate on why the court reached
the Conclusion.
All non-annotated sentences are treated as non-IRC sentences.

Two paid third-year law school students annotated sentences
from the human-prepared summaries to identify the issues, reasons,
and conclusions. Both students annotated 574 randomly selected
pairs from the 28,733 case/summary pairs that were available to
us. The total number of sentences from the corresponding full
texts is 120,707, which is significantly more than the summaries’
7,484 sentences. The total number of sentences from full texts are
significantly more than sentences from summaries.
Both annotators followed an 8-page Annotation Guide prepared
by the third author, a law professor, in order to mark-up instances
of IRC sentence types in the summaries. Using the Gloss annota-
tion environment of the second author, the annotators worked on
successive batches of summaries during a series of weeks. After an-
notating each batch, the annotators resolved any coding differences
Figure 1: Distribution of annotated IRC type sentences in
in regular Zoom meetings attended by the first and third authors.
summaries (top) and in full texts (bottom).
The procedure for annotating the full texts of cases differs from
annotating the summaries. For each annotated sentence in the
summary, the Annotation Guide instructs annotators to search the
full text of the case for those sentences that are most similar to the Cohen’s 𝜅 [2] is used to measure the degree of agreement be-
summary sentence and to assign them the same label (i.e., Issue, tween two annotators after their independent annotations of each
Conclusion, or Reason) as in the summary. Annotators may pick batch of summaries. The mean of Cohen’s 𝜅 coefficients across all
terms or phrases from the annotated summary sentences and search types for summaries is 0.709, and the mean for full texts is 0.827. Ac-
for corresponding sentences in the full texts. If the annotators find cording to [11], both scores indicate substantial agreement between
corresponding sentences, they do not need to read the full text of annotators about the sentence type. For the summary annotation,
the case. The Guide warns the annotators that there may not be the mean of Reason agreement is the lowest and Issue and Conclu-
an exact correspondence between the annotated sentences in the sion’s are the highest. Reasons are more challenging since they tend
summary and those in the full text of the case. This makes sense; to include descriptions of case facts. The agreement scores of full
having selected sentences in the full case texts to include in the texts are higher than the summaries’ scores. Since the full text anno-
summary, the summarizers probably edited them, for example, by tation took place after reconciling disagreements for the summary
combining some short sentences. annotation, the full text scores were expected to be higher.
By using the summaries’ annotations as anchors to target cor- Figure 1 reports the distributions of final consensus labels from
responding sentences in the full text, we attempted to leverage summaries and full texts. The most frequent label is the non-IRC
the summarizers’ work in selecting important sentences and the label for both summaries and full texts. The second most frequent
annotators’ work in marking up some of those sentences as issues, label is the Reason label for both summaries and full texts. The
conclusions, or reasons. We developed this strategy to expedite label distribution is aligned with our observation: Reasons tend to
the process of full text annotation which would be much more be more elaborated than Issues and Conclusions.
time-consuming and costly if performed directly on the full texts.
The strategy is based on the observation that sentences of sum- 4 EXPERIMENTS
maries stem from those in the full texts. The strategy also helps us We aim to leverage state-of-the-art deep learning models to identify
to confirm the mapping relationship between summaries and full the role a sentence may play in a case as an Issue, Conclusion, or
texts. This in turn helps us to develop the heuristic for automatic Reason. For our experiments, we used 80% of the full texts as the
mapping. training set, 10% as the validation set and 10% as the test set.
251
Toward Summarizing Case Decisions via Extracting Argument Issues, Reasons, and Conclusions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
In this section, we present the details of the models, including Table 1: Comparison of manually annotated IRC summary
convolutional [10] and recurrent neural networks [14], BERT-based sentences with top 1 automatically ranked full-text sen-
neural networks [12] and a hybrid neural model combining a con- tences (Sentence-BERT embedding with cosine similarity)
volutional neural network with BERT embedding [8]. We also ex-
periment with different loss functions for those neural networks, Issue
including cross entropy loss, F1 loss and focal loss. Manual Damage to both vehicles exceeded the insurance deductibles and both parties
claim damages against each other for the amount of the deductibles.
Rank 1 The damages to both the truck and the car exceeded the $500.00 insurance
4.1 Model Architectures deductible. [. . . ]
Convolutional Neural Networks. Convolutional neural networks Reason
(CNN) utilize convolutional filters to extract local features. Origi- Manual The plaintiff should have taken more appropriate measures to avoid the acci-
dent
nally applied to computer vision tasks, CNNs have also achieved a
Rank 1 Even if Schmidt concluded that Henry was going to proceed into his path, he
high level of performance in sentence classification tasks. [9]. had more appropriate alternatives than locking his brakes and turning to the
In our study, we use the settings of hyperparameters of filters right.
from [9]: filter sizes of 3, 4, and 5 with 100 for each size of filter. Conclusion
In other words, the models are looking for tri-grams, 4-grams and Manual Fault for this accident was attributed 10% to the defendant and 90% to the
plaintiff.
5-grams in sentences.
Rank 1 I attribute 10% of the fault in this accident to Henry and 90% to Schmidt.
Long Short-Term Memory Networks. Long Short-Term Memory (LSTM)
networks, a different RNN architecture, overcomes the vanishing
annotators still need to read contextual information around a sen-
gradient problem by employing a cell to control removing or adding
tence to confirm the mapping and IRC type.
information throughout the whole training process [14].
We undertook a proof-of-concept experiment to assess if a strat-
GloVe [17] is an unsupervised learning algorithm for obtaining
egy of automatic mapping could make the process more efficient
vector representations for words1 . We used “glove.6B.100d” as pre-
in the future. The idea is to employ sentence embedding to map
trained word embeddings to feed into the LSTM model, where the
annotated summary sentences to full texts. Sentence embedding
vectors were trained on 6-billion tokens and have 100 dimensions.
can represent a sentence and capture semantic information as vec-
Dropout is also adapted to the LSTM model.
tors. Cosine similarity is used to examine the degree of similarity
BERT-based Neural Networks. Google AI Language introduced Bidi- between sentences in annotated summaries and full texts.
rectional Encoder Representations from Transformers (BERT) in
Sentence-BERT Embedding. Sentence embedding techniques rep-
2018 [3]. Instead of using single word embedding like GloVe, BERT
resent the entire sentence and semantic information as vectors.
takes the context into account by using bidirectional pre-training
Sentence-BERT is a modification of the BERT neural model that
for language representations. This pre-training method is intended
uses siamese and triplet networks to produce semantically mean-
to better grasp contextual meaning of a language than single-
ingful sentence embeddings [18]. Sentence-BERT has achieved high
directional pre-training.
levels of performance in measuring the similarity of sentential ar-
RoBERTa [12] replicates BERT training using an improved train-
guments in [15]. Considering the size of our data set, we chose
ing methodology with more data and computational resources. For
to use the BERT base model for sentence embeddings with 768
our study we used RoBERTa in its default configuration.
dimensions.
Convolutional Neural Network with BERT embedding. CNN with We calculated the cosine similarity score for annotated sentences
BERT embedding takes BERT-pretrained embeddings as input and from a summary to every sentence in its corresponding full text.
feeds them into a CNN model for classification. Unlike GloVe pre- All the similarity scores are ranked in descending order, and only
trained word embedding, BERT-pretrained embedding is not a static the top 5 sentences are selected as useful. The remaining sentences
embedding. As sentences are fed in, it produces the word embed- are marked as non-IRC type sentences.
dings in real time. There are reasons to believe that the automatically mapped sen-
We combine the two models: a BERT-based model and a CNN tences bear useful similarities to manually mapped ones. Table 1
classification model. The encoded text passes through the BERT shows examples comparing manually mapped IRC sentences and
model first and produces BERT embeddings. The dimension of the automatically mapped sentences. The top ranked sentences often
BERT embedding (768) is higher than that of GloVe pre-trained include the same key words as the manual sentences do. How-
word embedding (100). Other hyperparameters remain the same as ever, the Reason sentences that the algorithm prefers have fewer
for the CNN-only model. overlapping keywords than Issue and Conclusion.
4.2 Automatic mapping 5 RESULTS

As mentioned in Section 3, the human annotation of full texts uti- Table 2 reports scores for the classification on the test split of the
lizes manual mapping: annotators used key words from annotated full texts. The left side of the table reports the results of training
sentences in original longer summaries to find corresponding sen- on the full text sentences corresponding to the manually mapped
tences in full texts. Even without actually reading the full case, human-annotated sentences from the summaries.
We tested LSTM, CNN, RoBERTa and CNN-BERT with three
1 https://nlp.stanford.edu/projects/glove/ different loss functions: cross-entropy loss, focal loss and F1 loss.
252
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Huihui Xu, Jaromir Savelka, and Kevin D. Ashley
Table 2: Scores for test set on both manually mapped full text sentences and automatically mapped full text sentences by
using LSTM, CNN, RoBERTa, CNN-BERT. The abbreviations of Issue, Reason and Conclusion are I, R, and C. The suffixes are
-P(precision), -R(recall). Ave-F1 stands for the average of class-wise F1 scores.
Training on manually mapped data Training on automatically mapped data

I-P I-R I-F1 R-P R-R R-F1 C-P C-R C-F1 Ave-F1 I-P I-R I-F1 R-P R-R R-F1 C-P C-R C-F1 Ave-F1
LSTM(cross-entropy) 0.72 0.49 0.58 0.35 0.09 0.15 0.72 0.42 0.53 0.42 0.19 0.41 0.26 0.28 0.16 0.20 0.10 0.23 0.14 0.20
LSTM(focal) 0.75 0.43 0.54 0.38 0.11 0.17 0.72 0.35 0.47 0.39 0.17 0.34 0.22 0.16 0.07 0.09 0.16 0.42 0.23 0.18
LSTM(F1) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.22 0.49 0.30 0.06 0.43 0.11 0.08 0.44 0.13 0.18
CNN(cross-entropy) 0.91 0.34 0.49 0.58 0.08 0.14 0.65 0.36 0.46 0.36 0.27 0.23 0.25 0.25 0.01 0.02 0.19 0.17 0.18 0.15
CNN(focal) 1.00 0.28 0.44 0.44 0.14 0.22 0.73 0.28 0.40 0.35 0.30 0.30 0.30 0.13 0.11 0.12 0.17 0.37 0.23 0.22
CNN(F1) 0.73 0.39 0.50 0.43 0.19 0.27 0.73 0.40 0.52 0.43 0.18 0.50 0.26 0.10 0.23 0.14 0.12 0.49 0.19 0.20
RoBERTa(cross-entropy) 0.35 0.38 0.36 0.08 0.01 0.01 0.71 0.09 0.16 0.18 0.21 0.47 0.29 0.16 0.23 0.19 0.19 0.37 0.25 0.24
RoBERTa(focal) 0.36 0.27 0.31 0.00 0.00 0.00 0.35 0.08 0.12 0.14 0.24 0.41 0.30 0.15 0.20 0.17 0.16 0.31 0.21 0.23
RoBERTa(F1) 0.26 0.63 0.36 0.20 0.22 0.21 0.35 0.51 0.41 0.33 0.17 0.63 0.27 0.14 0.28 0.18 0.30 0.44 0.35 0.27
CNN-BERT(cross-entropy) 0.11 0.43 0.18 0.54 0.10 0.17 0.96 0.13 0.23 0.19 0.10 0.59 0.17 0.45 0.07 0.12 0.68 0.28 0.39 0.23
CNN-BERT(focal) 0.12 0.39 0.18 0.03 0.09 0.05 0.46 0.29 0.36 0.20 0.18 0.48 0.26 0.63 0.09 0.15 0.76 0.21 0.33 0.25
CNN-BERT(F1) 0.57 0.50 0.53 0.37 0.18 0.24 0.85 0.27 0.41 0.39 0.13 0.61 0.22 0.09 0.21 0.12 0.44 0.18 0.26 0.20
All the models were trained for 30 epochs. We stored models’ check- models are trained on automatically mapped data. This means that
points after each epoch and evaluated on the separate validation set. models have lower probability of making correct classifications.
The models with lowest loss value on the validation were selected When we compare the performance among the same models with
for the classification on the test set. The performance on the test different loss functions, the model with F1 loss function always has
set is shown in the table. Average F1 is the average of class-wise F1 the highest average F1 score when trained on manually mapped
scores. data except for LSTM. The same pattern does not hold for the
We found that LSTM with the cross-entropy loss function achieved automatically mapped data. Each loss function has its own strength
the highest F1 scores on identifying Issues and Conclusions which in terms of training set and model selection.
are, 0.58 and 0.53, respectively. Both CNN with the F1 loss function
and CNN-BERT with cross-entropy loss have the highest F1 scores 6 DISCUSSION AND ERROR ANALYSIS
(0.27) on identifying Reasons. On average F1, CNN with the F1 loss
6.1 Discussion
function reached 0.43 which is highest among all the models.
The right side of of the table reports the performance of the The results for classifying full text sentences trained on automati-
models that were trained on the full text sentences corresponding cally mapped data, the right side of Table 2, are significantly higher
to the automatically mapped human-annotated sentences from the than trained on annotated summaries in prior work. There the
summaries. Both LSTM with F1, CNN with focal and RoBERTa highest F1 scores for full texts trained on annotated summaries
with focal tied in their performance on Issues (0.30). LSTM with were Issue (0.27), Reason (0.14), and Conclusion (0.24). We attribute
cross-entropy has the highest F1 (0.20) on Reasons. For Conclu- this improvement to using manually-mapped training sentences in
sion classification, CNN-BERT with cross-entropy loss achieves the the full texts, the higher numbers of annotated data, and the use
highest performance (0.39). Finally, RoBERTa with F1 loss achieves of deep learning algorithms plus transformer models (LSTM, CNN,
the highest score in terms of average F1. RoBERTa, and CNN-BERT).
Surprisingly, RoBERTa and CNN-BERT with cross-entropy and As noted, we tried different kinds of neural models paired with
focal losses perform better on automatically mapped data than different loss functions. We confirmed that the F1 loss function im-
manually mapped data in terms of average F1. However, LSTM and proved the performances of CNN and RoBERTa: RoBERTa with F1
CNN do not show the same pattern. The automatically mapped data loss yielded 0.21 on Reason and 0.41 on Conclusion while RoBERTa
are selected by sentence similarity scores with respect to BERT- without F1 loss produced only 0.01 on Reason and 0.16 on Conclu-
Sentence embedding. RoBERTa and CNN-BERT somehow take the sion. When a loss function is aligned with the evaluation metric, it
advantage of information contained in the sentence embedding is likely to improve model performance. LSTM, however, did not
to make a better classification. We are not sure how it affects the perform well with the F1 loss function: on the manually mapped
performance and will investigate it further. We also observed that data , LSTM(F1) yielded 0.0 on all IRC types.
models tend to perform better on Issue and Conclusion than Reason Those models each have advantages for certain sentence types.
despite the type of training set. Since Reasons frequently include LSTM(cross-entropy) yielded the highest F1 scores on Issues and
case facts, it is harder for models to classify them. Conclusions. CNN(F1) and CNN-BERT(cross-entropy) performed
Despite the relative comparable F1 scores between training on best on identifying Reasons. In general, models have difficulty iden-
manually mapped data and automatically mapped data, the preci- tifying Reason sentences, since Reasons have more complex se-
sion of all types of sentences drops significantly in most cases when mantic meanings. As noted, Reasons are intertwined with facts,
which can easily be classified as the non-IRC type. The annotators
253
Toward Summarizing Case Decisions via Extracting Argument Issues, Reasons, and Conclusions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
confirmed that Issues and Conclusions are easier to catch. They Improving Access to Justice. The Canadian Legal Information In-
employ distinct keywords such as “issue”, “conclusion”, etc. stitute provided the corpus of paired legal cases and summaries.
LSTM has the ability to detect temporal information about a Computation resources are provided by the Center for Research
sequence and can handle arbitrary input lengths. Meanwhile, CNN Computing at the University of Pittsburgh.
can only accept fixed size input. We think the ability to handle se-
quential information and longer lengths make LSTM more suitable REFERENCES
for Issues and Conclusions, since these involve plainer language [1] A. Bansal, Z. Bu, B. Mishra, S. Wang, K. Ashley, and M. Grabmair. 2016. Document
Ranking with Citation Information and Oversampling Sentence Classification in
than Reasons. CNN has the upper hand on spotting Reasons. The lit- the LUIMA Framework.
eral composition of Reasons is more diverse than that of Issues and [2] J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and
Conclusions; the convolutional features can capture this diversity. psychological measurement 20, 1 (1960), 37–46.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
6.2 Error Analysis preprint arXiv:1810.04805 (2018).
[4] M. Falakmasir and K. Ashley. 2017. Utilizing Vector Space Models for Identifying
With respect to the right side of Table 2, the proof-of-concept study Legal Factors from Text.. In JURIX. 183–192.
training models on automatically mapped data, the results suggest [5] A. Farzindar and G. Lapalme. 2004. Legal text summarization by exploration of
the thematic structure and argumentative roles. In Text Summarization Branches
that classifying argument triples is feasible, but less effective than Out. 27–34.
with the manually mapped data when taking precision and recall [6] V. Feng and G. Hirst. 2011. Classifying arguments by scheme. In Proceedings of
into account. We are particularly interested in the errors that the the 49th annual meeting of the association for computational linguistics: Human
language technologies. 987–996.
models made classifying Reasons. As noted, targeting the Reasons [7] C. Grover, B. Hachey, and C. Korycinski. 2003. Summarising legal texts: Sen-
correctly is harder since they tend to be more complex and diverse tential tense and argumentative roles. In Proceedings of the HLT-NAACL 03 Text
Summarization Workshop. 33–40.
than Issues and Conclusions. [8] Changai He, Sibao Chen, Shilei Huang, Jian Zhang, and Xiao Song. 2019. Us-
Some of the misclassifications involved phrases attributing an ing convolutional neural network with BERT for intent determination. In 2019
expressed view to the judge. This is a positive sign, in that such International Conference on Asian Language Processing (IALP). IEEE, 65–70.
[9] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification.
self-referential judicial sentences are relatively less frequent in a CoRR abs/1408.5882 (2014). arXiv:1408.5882 http://arxiv.org/abs/1408.5882
case opinion and indicate sentences where the judge is more likely [10] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu,
to assert that something is an Issue, Conclusion, or Reason. On Laura Barnes, and Donald Brown. 2019. Text classification algorithms: A survey.
Information 10, 4 (2019), 150.
the other hand, such self-referential attribution phrases do not [11] J. Landis and G. Koch. 1977. The measurement of observer agreement for cate-
necessarily discriminate among the three classifications. gorical data. Biometrics (1977), 159–174.
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
7 FUTURE WORK robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
We plan to continue to annotate new cases in order to increase the [13] Qi. Lu, J. Conrad, K. Al-Kofahi, and W. Keenan. 2011. Legal document clustering
size of the training set. Currently, the corpus includes 574 anno- with built-in topic segmentation. In Proc. 20th ACM int’l conf. Info. and knowledge
tated summary / full text pairs. The size of the data set is still not management. 383–392.
[14] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao.
large enough for adequately training more complex neural network 2020. Deep learning based text classification: A comprehensive review. arXiv
models. The data set is sufficiently large, however, to allow us to preprint arXiv:2004.03705 (2020).
[15] Amita Misra, Brian Ecker, and Marilyn A Walker. 2017. Measuring the similarity
continue to explore models and identify some challenges. The ex- of sentential arguments in dialog. arXiv preprint arXiv:1709.01887 (2017).
perience helps us to improve the quality of data as well as informs [16] R. Mochales and M. Moens. 2011. Argumentation mining. Artificial Intelligence
our intuitions about how human summarizers do their work. We and Law 19, 1 (2011), 1–22.
[17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
expect that the more annotated data we collect the more interesting Global vectors for word representation. In Proceedings of the 2014 conference on
properties we will be able to observe in this process. empirical methods in natural language processing (EMNLP). 1532–1543.
As noted, prior work explored different sampling strategies for [18] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Em-
dealing with imbalanced data to improve model performance. Dif- pirical Methods in Natural Language Processing. Association for Computational
ferent sampling methods have their merits in terms of their effects Linguistics. http://arxiv.org/abs/1908.10084
[19] M. Saravanan and B. Ravindran. 2010. Identification of rhetorical roles for
on training sets and model types. In this study, we briefly inves- segmentation and summarization of a legal judgment. Artificial Intelligence and
tigated a different method of adding augmented data to improve Law 18, 1 (2010), 45–76.
the performance of the models. Although the results were not as [20] J. Savelka and K. Ashley. 2018. Segmenting U.S. Court Decisions into Functional
and Issue Specific Parts. In Proceedings, 31st Int. Conf. on Legal Knowledge and
we expected, we observed that it had some positive effect on iden- Information Systems, Jurix. 111–120.
tifying Reasons from full texts. We will continue to explore other [21] O. Shulayeva, A. Siddharthan, and A. Wyner. 2017. Recognizing cited facts
methods to deal with our imbalanced data. and principles in legal judgements. Artificial Intelligence and Law 25, 1 (2017),
107–126.
We also plan to test whether a pre-trained legal language model [22] A. Wyner, R. Mochales-Palau, M. Moens, and D. Milward. 2010. Approaches to
improves performance over a generic language model. text mining arguments from legal cases. In Semantic processing of legal texts.
Springer, 60–79.
[23] Huihui Xu, Jaromír Šavelka, and Kevin D Ashley. 2020. Using Argument Mining
ACKNOWLEDGMENTS for Legal Text Summarization. Legal Knowledge and Information Systems JURIX
(2020), 184–193.
This work has been supported by grants from the Autonomy through [24] H. Yamada, S. Teufel, and T. Tokunaga. 2019. Building a corpus of legal argumenta-
Cyberjustice Technologies Research Partnership at the University tion in Japanese judgement documents: towards structure-based summarisation.
Art. Int. and Law 27, 2 (2019), 141–170.
of Montreal Cyberjustice Laboratory and the National Science Foun-
dation, grant no. 2040490, FAI: Using AI to Increase Fairness by
254
Part III
Extended Abstracts
CriminelBART: A French Canadian Legal Language Model
Specialized in Criminal Law∗
Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Laval University, Computer Science Department and Faculty of Law
Québec, Canada
nicolas.garneau@ift.ulaval.ca,eve.gaumond@observatoire-ia.ulaval.ca
luc.lamontagne@ift.ulaval.ca,pierre-luc.deziel@fd.ulaval.ca
CCS CONCEPTS 2 CRIMINELBART

• Applied computing → Law; • Computing methodologies → There has been few attempts to fine-tune pre-trained language
Neural networks; models to make them effective on the legal domain. For exam-
ple, Chalkidis et al. [3] proposed to fine-tune BERT on English law
KEYWORDS texts scraped from publicly available resources, yielding LegalBERT.
Criminal Law, Language Model, Cloze Tests, Text Generation These recent advances are all characterized by the same underlying
source language that is English. We thus propose in this paper a
ACM Reference Format: new French language model specialized in Canadian Criminal law,
Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel. 2021. a field gaining more attention recently [1].
CriminelBART: A French Canadian Legal Language Model Specialized in One important issue for this model is the choice of vocabulary
Criminal Law. In Eighteenth International Conference for Artificial Intelligence
that differs in important ways between private and public law.
and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York,
NY, USA, 2 pages. https://doi.org/10.1145/3462757.3466147
The legal remedies are not the same. Indeed, criminal cases are
mainly about fines and prison whereas civil cases are mainly about
damages. Also, the lawyer and their clients are not called the same
1 INTRODUCTION whether it is a criminal trial or a civil matter, etc. We thus deemed
Learning language representations is a key component in many it essential to instill this criminal specialization into a language
natural language processing tasks, and their usefulness is most often model. Just as a lawyer could not be an expert in every domain, we
challenged by specialized target domains and vocabulary. We have thought that it would be ill-advised to develop a generalist legal
witnessed several neural causal language models (CLM) that learn model to perform very specific tasks in the criminal field such as
contextual representations such as ELMo [8]. More recently, the plumitifs description generation[1].
Transformer architecture [10] has tremendously improved language The dataset we are working with is a French legal corpus specific
representation learning, giving birth to new architectures such as to the criminal law. It describes several offenses from different
BERT [4], a masked language model, pushing the state-of-the-art laws such as the Criminal Code and the Controlled Drugs and
of natural language understanding to an unprecedented level of Substances Act. This dataset, comprising around 9000 judgments
performance on standard benchmarks. Moreover, it has been found from the last 10 years has been gathered from the Criminal and
that Transformer-based CLM, such as GPT [9], are excellent feature Penal Chamber and was extracted from the Société Québécoise
extractors as well as being impressive text generators. BART [7], d’Information Juridique (SOQUIJ) website1 .
an architecture combining the backbone of both BERT and GTP To construct a specialized model from this corpus, we propose
proved to be particularly effective at generating text while being to further pre-train BARThez [5] according to BART’s pre-training
competitive in comprehension tasks. BARThez, the French version objectives (i.e. denoising auto-encoder) using the fairseq library2
of BART, was recently introduced as a pre-trained model on a very for 250,000 steps achieving a train and validation perplexity of 1.95
large monolingual French corpus [6]. In this paper, we introduce and 1.92 respectively. We then assess the “legal comprehension” of
CriminelBART, a fine-tuned version of BARThez specialized for CriminelBART with three different Cloze tests regarding the predic-
criminal law using a French Canadian corpus of legal judgments, tion of criminal charges, legal provisions, and individuals. Given a
and we evaluate its performance on different tasks. textual utterance and deleted passages, a model undergoing a Cloze
test tries to fill in the blanks with the best possible answers. It thus
∗ By “Specialized” we mean a language model that knows a little more than the average requires the ability to understand the context and the vocabulary.
citizen. CriminelBART is far from having the competencies of a criminal lawyer. Formally, given a textual utterance of length 𝑛, a model successfully
pass the Cloze Test by maximizing the following probability;
classroom use is granted without fee provided that copies are not made or distributed 𝑃 (𝑤𝑖 |𝑤 1, 𝑤 2, . . . , 𝑤𝑖−1, 𝑤𝑖+1, . . . , 𝑤𝑛−1, 𝑤𝑛 ) (1)
on the first page. Copyrights for third-party components of this work must be honored. where 𝑤𝑖 could be a word or a phrase.
1 https://soquij.qc.ca/
ACM ISBN 978-1-4503-8526-8/21/06.
https://doi.org/10.1145/3462757.3466147 2 https://github.com/pytorch/fairseq
256
Criminal charges. In order to determine if CriminelBART grasped crimes. Similar experiments with a generic BARThez model always
the accusations’ distribution from the corpus, we ask the model to result with a few unrelated names such as“Mr. Gagné”. While this
fill in the following sentence (translated in English): experiment does not expose a clear bias in CriminelBART, the sole
The defendant is accused of <Mask> under the Criminal Code. possibility, how small could it be, that defendants’ names may be
The top 5 predicted passages (with semantic duplicates being coming out of this model is a privacy matter that cannot be ignored.
removed3 ) comprises “driving under the influence”, “aggravated Even though the identity of judges is deemed public, this is not to
assault”, “possession of narcotics”, “dangerous driving”, and “hit- be taken lightly. In a context where judges are already reluctant to
and-run”. As expected, CriminelBART suggests crimes related to the idea that their work might be subject to analyses conducted by
the driving of motor vehicles and some other infractions related AI, it is important not to scare them with uses of technology that
to controlled drugs and substances which constitute the main ac- may be prejudicial to them. A more in-depth analysis should be
cusations in the corpus (up to 50%). Nonetheless, CriminelBART conducted before releasing this language model at scale, which is
predicts various crimes while BARThez, a more generic model, kept as future work. This decision is also highly motivated by the
mainly returns crimes restricted to premeditated murder. recent critics on language model being stochastic parrots and the
different attacks performed on them in order to extract the training
Legal provisions. We also probe CriminelBART for legal provi- set. In our future works, we wish to leverage CriminelBART as a
sions such that, given a context, it predicts which provision it is textual description generator of plumitifs, short legal documents
associated to. To this end, we created 84 cloze tests4 on 28 provi- known to be unintelligible [1].
sions, where CriminelBART achieves an accuracy of 64% over all
provisions. Unsurprisingly, BARThez achieves 0% by predicting ran- Acknowledgements. We thank the reviewers for their insightful com-
dom tokens. Here is an example regarding provision 4, “Possession ments. This research was funded by both the Natural Sciences and Engi-
of substance”, from the controlled drugs and substances act; neering & Social Sciences and Humanities Research Councils of Canada.
PER is put on trial on charges of possession of cannabis,

thus committing the criminal act provided in article <Mask>.5
REFERENCES
[1] David Beauchemin, Nicolas Garneau, Eve Gaumond, Pierre-Luc Déziel, Richard
By analyzing the predictions, we can see that CriminelBART Khoury, and Luc Lamontagne. 2020. Generating Intelligible Plumitifs Descrip-
demonstrates some semantic understanding of the different crim- tions: Use Case Application with Ethical Considerations. In Proceedings of the
13th International Conference on Natural Language Generation. Association for
inal offenses. For example, section 5 (“Trafficking of substance”) Computational Linguistics, Dublin, Ireland, 15–21. https://www.aclweb.org/
and section 7 (“Production of substance”) are both predicted in anthology/2020.inlg-1.3
this previous context along with section 4. It is relevant as these [2] N. Carlini, Florian Tramèr, Eric Wallace, M. Jagielski, Ariel Herbert-Voss, K.
Lee, A. Roberts, Tom Brown, D. Song, Úlfar Erlingsson, Alina Oprea, and Colin
criminal offenses are most likely to appear with one another and Raffel. 2020. Extracting Training Data from Large Language Models. ArXiv
are semantically related. Interestingly, section 95 (“Possession of abs/2012.07805 (2020).
[3] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras,
prohibited or restricted firearm with ammunition”) is not within the and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of
top 10 predicted sections for this context. This syntactically good Law School. In Findings of the Association for Computational Linguistics: EMNLP
candidate, i.e the “possession” of something illegal, is rejected by the 2020. Association for Computational Linguistics, Online, 2898–2904. https:
//doi.org/10.18653/v1/2020.findings-emnlp.261
model as not a semantically good one. These results exhibit Crim- [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
inelBART capability to establish semantic relationships between Pre-training of Deep Bidirectional Transformers for Language Understanding. In
certain provisions due to their collocation or similar contexts. Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
Privacy. When it comes to neural language models, there is 4171–4186. https://doi.org/10.18653/v1/N19-1423
inevitably some privacy concerns that arise. Numerous papers such [5] M. K. Eddine, Antoine J.-P. Tixier, and M. Vazirgiannis. 2020. BARThez: a Skilled
as Carlini et al. [2] report on the leakage of training data from these Pretrained French Sequence-to-Sequence Model. ArXiv abs/2010.12321 (2020).
[6] Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin
kinds of models. In the context of CriminelBART, we feared that it Lecouteux, Alexandre Allauzen, Benoit Crabbé, Laurent Besacier, and Didier
would be possible to reidentify someone from the corpus simply Schwab. 2020. FlauBERT: Unsupervised Language Model Pre-training for French.
by querying the model with the right questions; e.g. would it be In Proceedings of the 12th Language Resources and Evaluation Conference. European
Language Resources Association, Marseille, France, 2479–2490. https://www.
possible to reidentify one defendant by asking the model who was aclweb.org/anthology/2020.lrec-1.302
accused of one particular offense. We thus designed a cloze test [7] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART:
where the context contains a specific accusation (e.g. fraud) and Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
ask CriminelBART to predict the name of the defendant; Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of
Mr./Ms. <mask> is accused of fraud under the criminal code. the Association for Computational Linguistics. Association for Computational
Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
We did so for the top 10 provisions contained in the judgment [8] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
corpus and found out that the individual was not changing for Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Repre-
sentations. In Proceedings of the 2018 Conference of the North American Chapter
different accusations. Indeed, the name most often predicted (either of the Association for Computational Linguistics: Human Language Technologies,
male or female) was the name of a judge appearing most often Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans,
Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202
in the corpus. From the other predicted names, we could not find [9] A. Radford. 2018. Improving Language Understanding by Generative Pre-
any association between an individual and who committed the Training.
[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
3 We consider “impaired driving” and “driving while impaired” as semantic duplicates. Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
4 Availablehere: https://bit.ly/3edmM4N You Need. In NIPS.
5 We anonymized the defendant’s name with the generic token PER on purpose.
257
Applying Decision Tree Analysis to Family Court Decisions:
Factors Determining Child Custody in Taiwan
Sieh-Chuen Huang Hsuan-Lei Shao∗ Robert B Leflar
College of Law East Asia Studies Dept. School of Law, University of Arkansas
National Taiwan University National Taiwan Normal University United States (retired 2020)
Taiwan Taiwan College of Law, National Taiwan
schhuang@ntu.edu.tw hlshao2@gmail.com University, Taiwan (since 2020)
rbleflar@uark.edu
CCS CONCEPTS belonged to the father unless either it had been agreed otherwise
• Information systems applications → Data mining; • Decision in a consensual divorce (Article 1051) or the court had decided
support systems → Expert systems. otherwise (Article 1055). The 1996 amendment repealed Article 1051
and amended Article 1055, replacing paternal preference with the
KEYWORDS “best interests of the child” doctrine and recognizing joint custody
and non-custodial parents’ visitation rights. New Article 1055-1
child custody, best interests of the child, legal factor, machine learn-
lists several factors that judges must consider, such as the age, sex,
ing, decision tree.
birth order, health condition, and the wishes of the child, and the
1 INTRODUCTION age, occupation, character, economic ability and lifestyle of the
parents. Such a broad standard gives judges considerable discretion
The doctrine of “best interests of the child” has guided courts in in deciding what is in the best interests of the child.
determining post-divorce child custody cases in Taiwan since the Our goal is to provide a clear answer to the long-standing debate
amendment of the Taiwan Civil Code in 1996, which overturned on what “most important factors” in fact influence judges’ custody
previous patriarchal practices at least as a matter of law. Previous decisions. To accomplish this goal, we applied a machine learning
empirical studies have adopted descriptive statistics in analyzing algorithm to determine what those factors are. We established a
court cases to determine which factors, as set out in Article 1055-1, dataset which we carefully labeled according to our research con-
are the ones judges tend to consider. However, these approaches cerns (weighing normative “factors”). Our model predicts outcomes
do not clarify which factors judges consider primary. accurately, and also provides insights on “most important factors”
This study collects Taiwanese family court decisions from 2012 in a way useful to parents, their lawyers, and legal scholars.
to 2017. The study employs decision tree analysis, a commonly used
machine learning technology. This appears to be the first published
application worldwide of machine learning to analysis of family
3 RESEARCH DESIGN
court decisionmaking. 3.1 Data Collecting
The study concludes that the three most significant factors con- All cases except juvenile and sexual assault cases decided by district
sidered by judges in Taiwan are first, which parent is the child’s courts in Taiwan since 2000 are open to the public on the official
current primary caregiver, followed by the wishes of the child website of the Judicial Yuan. We focus on child custody decisions of
and the judge’s assessment of the relative quality of each parent’s first instance decided by family and district courts. Using carefully
parent-child interaction. This result runs counter to widely held chosen causes of action, keywords and decision dates, we iden-
beliefs that parental gender and parents’ occupations and economic tified 3,028 child custody decisions between January 1, 2012 and
resources are still prime factors in judges’ contemplation. December 31, 2017.
We then limited our sample to cases in which both parents were
2 BACKGROUND OF TAIWAN CHILD Taiwanese and both sought to acquire custody. This is because when
CUSTODY WHEN PARENTS DIVORCE one parent (usually the defendant) does not come to court or keeps
Before the 1996 amendment, Taiwan’s Civil Code had stipulated silent, the other (the plaintiff) is very likely to receive custody. We
that, in both consensual and judicial divorce, the custody of children excluded these cases from our dataset. Among the 3,028 cases, the
∗ Corresponding
2,096 cases in which one of the parents did not express any opinion
author
regarding custodian and the 97 cases of transnational marriage were
Permission to make digital or hard copies of all or part of this work for personal or therefore excluded. The remaining 835 cases contain 1,290 children.
classroom use is granted without fee provided that copies are not made or distributed Among them, 1,126 children were under sole custody (87.3%), 159
on the first page. Copyrights for components of this work owned by others than ACM children were under joint custody (12.3%), and 5 children were
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, under third party guardianship (0.4%)
fee. Request permissions from permissions@acm.org.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil 3.2 Dataset Construction/ Annotation Labels
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 We created a model that predicts the value of a dependent variable:
https://doi.org/10.1145/3462757.3466076 custody granted to father (labeled “1”) or mother (labeled “0”). As
258
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Huang, et al.
Table 1: Factors Considered by Judges the current caregiver of the child is in the first place. If the caregiver
Character Factor is the mother (represented by the number “3”) or both mother and
Sex father (represented by the number “2”), the model will follow the
Age line (branch) on the left side to the next node, childWill, meaning
Child
Child Willing
that the model predicts that judges will, secondarily, consider the
Emotional feelings between the other persons living
together and the child
child’s wishes. If the child prefers the mother or has no preference
Health condition (>= 2), the model again follows the line on the left side coming to
Character (drug use, alcohol consumptions. . . ) the next end node, which is labeled with a probability distribution
Economy indicating that in the validation set, custody has a very high prob-
Willing ability (95%) of going to mother (the label “0”). On the contrary,
Undue behavior (domestic violence towards the child) even if the mother is the primary caregiver, in the case that the
Parent Parenting time child prefers the father (< 2), the probability that the mother gets
Parenting environment custody becomes fairly low (17% in the validation set).
Friendly parent
Primary caretaker (caregiver) 4.2 Model Efficiency
Understanding the child
Parenting plan The model’s accuracy is 96.5% in its test set and the F1 score is
Parent-child interaction 0.9783, indicating that the model is quite satisfactory.
Both
Current residence
Support system Table 2: Confusion Matrix of Child Custody Model Test Set
Others
Social worker’s report Results predicted by the machine
Test_set_N=226
Positive (for mother) Negative (for father)
independent variables, we manually defined 19 factors from Article
Actual Positive True positive, TP=181 False negative, FN=7
1055-1, social workers’ evaluation items and previous literature results (for mother)
(Table 1 below). Negative False positive, FP=1 True negative, TN=37
(for father)
3.3 Decision Tree Learning
This study adopts the CHAID (Chi-squared Automatic Interaction 5 CONCLUSION
Detector) algorithm to make strategic splits. This algorithm’s pre-
Among the numerous factors stipulated in Article 1055-1 of the Tai-
dictive power performs better than statistical techniques such as
wan Civil Code, “primary caregiver,” “child’s wishes” and “parent-
regressions and it shows the nodes clearly, making the outcomes
child interaction” are the three most significant factors contem-
explainable in terms of the 19 factors listed above.
plated by judges. The pattern of Taiwanese judges’ decision-making
regarding child custody appears to be relatively constant and stable
4 RESEARCH RESULTS
during the six-year period studied.
4.1 Model Demonstration Furthermore, in custody disputes addressed by judicial decisions,
the mother seems to have overwhelming supremacy: in our dataset,
Figure 1: Decision Tree of Child Custody Cases in Taiwan mothers had a 75% likelihood of receiving sole custody.
This research should help legal scholars identify particular cus-
tody cases as either typical or exceptional by comparing the machine-
predicted results to actual judgments. A case may be exceptional,
worth further exploration and research, if the two outcomes are
inconsistent. In addition, this study should assist parents and their
lawyers to preliminarily evaluate their possibilities of acquiring
custody. And when the outcome of litigation can be predicted in
advance by the parties, the likelihood of cases going to court will
fall and the likelihood of settlement will increase. Thus, machine
learning helps us predict what the “law” is and hence contributes
to legal certainty.
ACKNOWLEDGMENTS
Hsuan-Lei Shao, “From Knowledge Genealogy to Knowledge Map-
China Studies in Big Data and Machine Learning” (107-2410-H-003-
058-MY3, Ministry of Science and Technology), Taiwan.
Sieh-Chuen Huang, “Exploring Implications of Legal Analytics

along with AI Technologies” (108-2628-H-002-005-MY2, Ministry
The root node of the tree—the feature that best divides the data—is of Science and Technology), Taiwan.
Caregiver, indicating that judges in Taiwan primarily consider who
259
Sentence Classification for Contract Law Cases: A Natural
Language Processing Approach
Jonathan R. Mok Wai Yin Mok Rachel V. Mok
Norris Injury Lawyers PC University of Alabama in Huntsville Independent Researcher
jonrexmok@gmail.com mokw@uah.edu rmok57@gmail.com
1 INTRODUCTION can result in the loss of personal liberties, such as imprisonment,

Legal practice can gain much from advances in machine learning and/or fines. Civil law case law is considered different to criminal
and A.I. technology. Common law countries such as the United law case law with little overlap. Moreover, civil law and criminal
States, the United Kingdom, Canada, and Australia, rely on judicial law are subcategorized into further specialized fields of law; civil
precedent to decide what law governs new cases and how that law law includes contract law, property law, tort law, family law, em-
is applied. Judicial precedent is the body of case law that observes ployment law, patent law, etc. and criminal law includes case law
how past courts have decided similar cases. Case law is composed of on crimes against a person’s body and/or mind, crimes against prop-
individual case decisions by courts that resolve a particular dispute. erty, crimes that violate a statute, crimes of financial fraud and/or
When resolving new cases, courts look back on judicial precedent to impropriety, etc. Each field of law listed is based on a body of case
structure how to decide said new cases. Thus, an extensive knowl- law that is considered different to any other with little overlap and
edge base of how prior cases have been decided under what law is can involve years of practice to become intimately familiar with.
needed in each field of legal practice. Legal research is the practice That said, a universal system that could analyze any type of case
of identifying relevant case law for a particular set of facts and is brought before it would need to rely upon a vast knowledge base
essential for any practitioner seeking to make the most persuasive of case law, be able to distinguish relevant facts from a new case,
argument for clients. Reviewing case law and identifying relevant and pair them with relevant case law to even begin an analysis, let
cases can be time consuming and imperfect; there is always risk alone complete the remaining steps in a typical legal workflow.
that practitioners overlook relevant and/or new case law that has This paper proposes to take tentative steps towards a self-learning
changed the legal landscape of their practice. Thus practitioners are system for legal research with initial steps being to classify the in-
not only required to retain an extensive knowledge base of relevant tended knowledge base, case law, into machine learnable informa-
case law, but also to keep abreast of changes in case law. tion. This involves classifying individual court opinions into legally
Machine learning and A.I. technology has potential to assist useful sentence types, a common practice among law students and
practitioners and the judiciary by automating legal research and practitioners alike. This paper will focus on case law in the United
analysis for new cases. A fully realized system would streamline States, specifically contract law case law in the state of Alabama,
legal workflow and may be able to analyze inputs raised in new and may explore other fields of law in later research.
cases, infer legal issues, identify relevant case law, draft arguments Translating case law into machine learnable information requires
for and against each issue, consider jurisdictional variations, and understanding case law first. Case law is composed of individual
generate a predictive outcome with a corresponding percent cer- court opinions written by judges. A single case focuses on a par-
tainty. Such a system could not only be used by practitioners to ticular set of facts that gave rise to a dispute(s). Each case involves
assist new case analysis, but also judges to assist them to resolve at least two parties, but can involve more. Case law generally fol-
pending cases without present bias. lows patterns to how each opinion is written. An opinion can be
However, there are tremendous barriers to developing such a classified into five main types of sentences for purposes of legal
system. Firstly, not all case law is relevant to every new case. The understanding: fact sentences, issue sentences, holding sentences,
practice of law is subcategorized into many different fields of law reasoning sentences, and law sentences.
that govern each type of dispute. In the United States, the first Fact sentences identify relevant facts to the case such as, but
major difference in law arises when distinguishing whether a case not limited to, party actions leading up to the dispute; background
is governed by civil law or criminal law. Civil law governs disputes information such as practices, scientific principles, etc. needed to
among private parties that mainly deal with monetary damages; understand the dispute; and damages sustained as a result of the
however, the government may be involved in civil suits that it is dispute. Bear in mind any case can have a myriad of facts surround-
not immune from. Criminal law governs conduct deemed harmful ing it; the facts that are included in a judicial opinion are those
to the state or society that is prosecuted by the government and facts as understood by the court and deemed relevant to resolve
the dispute.
Issue sentences frame the nature of the dispute and can be further
classified into two types: court issue sentences and party issue
for profit or commercial advantage and that copies bear this notice and the full citation sentences. Court issue sentences are statements made by the court
on the first page. Copyrights for third-party components of this work must be honored. itself that define the nature of the dispute; a court emphasizing
ICAIL’21, June 21–25, 2021, São Paulo, Brazil clarity may even have headers identifying court issue sentences, but
© 2021 Copyright held by the owner/author(s). this is not expected in every judicial opinion. Party issue sentences
ACM ISBN 978-1-4503-8526-8/21/06. are statements made by each party attempting to frame the dispute
https://doi.org/10.1145/3462757.3466074
260
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Mok et al.
in their terms and are reiterated by the court; these sentences will As previously stated, this paper will explore classification of
typically begin by stating ABC party argues XYZ point. Party issue legally useful sentence types found in contract law case law in
sentences are less significant than court issue sentences, unless the the state of Alabama. Due to the difficulty of classifying reasoning
court explicitly agrees with a particular party issue sentence, which sentences at this stage of the research, such classification has been
at that point elevates it to become a court determination known as excluded, but will be examined in later research. As a result, this
a court holding. paper will only focus on seven sentence types: Fact (FCT), Court
A holding sentence states a court conclusion that resolves a Issue (CTI), Party Issue (PTI), Holding (HLD), Law (LAW), Procedural
dispute by applying relevant law to the present set of facts in a History (PRH), and Reference (REF).
case. A holding sentence is the most important type of sentence in
any judicial opinion; it is considered precedent and binding on any 2 KNOWLEDGE BASE
future case brought within the issuing court’s jurisdiction.
Reasoning sentences help explain how a court has reached its
conclusion; it is analogous to a mathematical proof, but not nearly
as exact. Reasoning sentences can be difficult to definitively classify;
seasoned practitioners can mistake reasoning sentences for holding
sentences and vice versa. Any court determination not based on the
present facts of the case is not considered a holding sentence, but
likely a reasoning sentence; any hypothetical statement, such as an
example or analogy, made by the court is not a holding sentence,
but likely a reasoning sentence; and any commentary on law or
fact not regarding the present case, such as on a past case, is not a (a) Testing cases, 1063 total sen- (b) Training cases, 646 total sen-
holding sentence, but likely a reasoning sentence. tences. tences.
Law sentences are statements of law made by the court and are
always preceded or followed by a citation, or the citation may be Figure 1: Confusion matrices of the testing and training cases.
in the sentence itself. A law sentence may be the court restating a The elements along the diagonal show the number of sen-
law as it appears in the code of law, the court restating a holding tences that were identified correctly, and each off-diagonal
from a prior case, or the court stating its own interpretation of a element show the number of sentences that were identified
law or a prior court’s holding. incorrectly and in what label they were identified as.
Another type of sentence found in court opinions are proce-
dural history sentences, which describe how a case has progressed spaCy [1] has been chosen to parse and process court cases.
through the court system to be evaluated before the present court. Twelve Alabama contract cases were downloaded from JUSTIA, of
Almost all cases found in case law begin in a court that makes both which five are designated as training cases and seven as testing
factual and legal determinations such as which party’s account cases. Two different approaches are adopted for classifying the
of the dispute is considered correct and what legal principles are sentences: a rule-based approach for the reference sentences and
applied to resolve the case; such a court is called a court of first a knowledge-based approach for the other types of sentences. Be-
instance or trial court. All losing parties in civil and criminal mat- cause fact sentences are the default, the knowledge base contains
ters, bar the criminal prosecution, have a right to appeal to a higher fragments of sample sentences that are neither reference nor fact
court called an appellate court for additional review of mainly legal sentences. The fragments of the sample sentences are chosen to rep-
principles. Appellate courts rarely engage in fact finding and do resent the essential parts of the sample sentences from which they
so only under specific circumstances, such as when new relevant were extracted, but are general enough so that sentences whose
evidence arises in a case and is properly presented. In actuality, the types are to be determined may contain similar fragments. The sim-
majority of case law is written by appellate courts. Consequently, ilarity of a sentence and the sentence fragments in the knowledge
appellate courts will describe in procedural history sentences how base is calculated by the spaCy’s similarity function, which is based
a case was decided by a court of first instance and which party on word vectors [2, 3]. Shown in the above figure are the results of
appealed the case on what grounds to bring the case before it. Such our algorithm. Although our algorithm is still in an early stage of
information is of some value to practitioners, but can essentially be development, it yields a classification accuracy rate of 68.67% on
disregarded when forming a knowledge base for machine learnable the testing cases, demonstrating the validity of our approach.
information for the purposes of this paper.
Thus far, sentence classification according to fact sentences, is- REFERENCES
sue sentences further classified by court or party, law sentences, [1] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020.
spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/
reasoning sentences, holding sentences, and procedural history 10.5281/zenodo.1212303
sentences can be applied to every court opinion. Every court opin- [2] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Estimation of Word Representations in Vector Space. In 1st International Con-
ion also contains references that are valuable to practitioners, but ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May
can essentially be disregarded when forming a knowledge base 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
for machine learnable information for the purposes of this paper. http://arxiv.org/abs/1301.3781
[3] Tomás Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
Citations are classified as reference sentences in this paper. Distributed Representations of Words and Phrases and their Compositionality.
CoRR abs/1310.4546 (2013). arXiv:1310.4546 http://arxiv.org/abs/1310.4546
261
Constraint Answer Set Programming as a Tool to Improve
Legislative Drafting
A Rules as Code Experiment
Jason Morris
jmorris@smu.edu.sg
Singapore Management University Centre for Computational Law
Singapore
CCS CONCEPTS 2 S(CASP)

• Computer systems organization → Embedded systems; Re- s(CASP) is a stable-model constraint answer set programming lan-
dundancy; Robotics; • Networks → Network reliability. guage, implemented in the Ciao programming language [3]. s(CASP)
was selected for use in this experiment because of its ability to gen-
KEYWORDS erate natural language explanations for answer sets [2], and its
legal knowledge representation and reasoning, constraint answer ability to perform both deductive and abductive reasoning from
set programming, rules as code the same encoding with minor adjustments [3]. Justifications for
automated conclusions have long been recognized as useful in legal
ACM Reference Format: applications both for end-users and as a development tool [5].
Jason Morris. 2021. Constraint Answer Set Programming as a Tool to Im-
prove Legislative Drafting: A Rules as Code Experiment. In Eighteenth
3 RULE 34, LEGAL PROFESSION
21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 2 pages. https: (PROFESSIONAL CONDUCT) RULES OF
//doi.org/10.1145/3462757.3466084 SINGAPORE, 2015
Singapore’s Legal Profession Act, (Cap. 161) (the “Act”) governs the
1 RULES AS CODE legal profession in Singapore. Part VI of the Act establishes the
Professional Conduct Council, which in section 71(2) of the Act
"Rules as Code" in this paper is used to refer to a proposed method-
is given broad authority for drafting rules governing the practice,
ology of legislative and regulatory drafting.1 That legislation can
conduct, etiquette, and discipline of legal practitioners in Singapore.
be represented in declarative code for automation has long been
The Professional Conduct Council has enacted the Legal Profession
recognized [6], as has the opportunity for improving the quality of
(Professional Conduct) Rules (S 706/2015) ("the Rules"). In 2015 the
legal drafting with the techniques of formal representation [1].
Rules were significantly amended with a new Rule 34 setting out
Rules as Code further proposes that both drafting and automation
restrictions on lawyers accepting executive appointments outside
would be improved by initially co-drafting statute law in both
of their legal practice.
natural and computer languages simultaneously [4].
Knowledge acquisition bottlenecks and roadblocks associated
with statutory interpretation are largely avoided. The co-drafted 4 EXPERIMENTAL DESIGN
encoding need only reflect what the legislation says, and not what Our interdisciplinary team undertook to assess the strengths and
the legislators meant. Legislative intent is instead encoded as tests weaknesses of s(CASP) as a tool for improving legislative drafting
by people with authoritative knowledge of the intent, the drafters. in a Rules as Code approach. The author encoded a literal interpre-
In this way, failed tests can be used in the drafting process to signal tation of Rule 34 in s(CASP), and separately encoded the author’s
issues with the natural language draft. When the drafting process is expectations of the behaviour of Rule 34 as a set of tests. Test fail-
complete an authoritative encoding consistent with the legislative ures that the author attributed only or primarily to issues with
intent already exist. This encoding can be used by regulators and the natural language drafting of Rule 34 were raised with legally-
regulated entities to automate services and compliance tasks. trained team members to confirm whether the expected behaviour
was reasonable, and whether the cause of the test failure was a legal
1 The phrase "Rules as Code" is also often used to refer to legal knowledge representation drafting issue.
and reasoning generally. The discovery of such issues would demonstrate the feasibility of
using s(CASP) to detect legislative drafting issues. Any issues aris-
ing would be in the context of non-authoritative opinions as to the
classroom use is granted without fee provided that copies are not made or distributed expected behaviour of the text. The experiment cannot, therefore,
for profit or commercial advantage and that copies bear this notice and the full citation be used to diagnose issues with the Rule as enacted.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil 5 EXPERIMENTAL RESULTS
ACM ISBN 978-1-4503-8526-8/21/06. The code written to implement this experiment is available at https:
https://doi.org/10.1145/3462757.3466084 //github.com/smucclaw/r34_scasp.
262
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Jason Morris
A set of 25 tests were encoded, and there were 4 test failures variety of fact scenarios simultaneously, quickly providing a deep
not explained by errors encoding the Rule or the tests. These four level of insight into the behaviour of the encoding. s(CASP) also
failures were investigated by the author by performing ’why not’ facilitated the use of a version of defeasibility that allowed defeating
queries and reviewing the justifications provided by s(CASP). relations of both the "subject to" and "despite" types to be encoded
This process revealed that the failing tests were encoded on the where they appear in the text, enhancing maintainability of the
basis of an expectation that the word "business" in Rule 34(1)(b) code [7].
referred to a legal practitioner’s activities. But Rule 34(9) defines s(CASP)’s abductive queries slow down considerably with the
"business" to refer to a general category of undertaking. Setting out complexity of the code, and so it may not be an appropriate ap-
a test in which Rule 34(1)(b) applied, while also using the defined proach for real-time applications of abductive reasoning. However,
meaning of "business", required making statements that did not its performance on deductive reasoning tasks was very efficient,
have clearly meaningful real world equivalents. This suggested completing the 25 tests in this experiment in an average of less
that Rule 34(1)(b) might also use the word "business" in a way than 1 second each, which suggests it can also be used to answer
inconsistent with the defined meaning, which would be a drafting legal questions with complicated fact scenarios and complicated
issue. rules in a user-facing application.
That issue was raised with the rest of the research team, who
confirmed that Rule 34(1)(b) had been faithfully encoded, that the ACKNOWLEDGMENTS
expectations of the failing tests were reasonable, and that Rule I owe a debt of gratitude to all my colleagues at the SMU Centre
34(1)(b) required the use of an interpretations of the word "business" for Computational Law, and in particular our Principal Investigator
that is inconsistent with the defined meaning of the word in order Meng Weng Wong, Industry Director Alexis Chun, and Professors
to give effect to that expectation, or to give it any clear meaning at Lim How Khang and Jerrold Soh, all of whom contributed greatly to
all. the legal analysis. Professors Gopal Gupta of University of Texas at
The research team seriously considered the possibility that there Dallas and Joaquín Arias at Universidad Rey Juan Carlos provided
might be a different interpretation of other aspects of the Rule that valuable assistance on the effective use of s(CASP). The feedback
would make Rule 34(1)(b) more clearly meaningful. The team was of the reviewers has also improved the paper and is gratefully
unable to find an interpretation that would have had that effect and acknowledged.
would not also make Rule 34(1)(b) redundant to other portions of This research is supported by the National Research Founda-
the Rule. The research team therefore concluded that it would be tion (NRF), Singapore, under its Industry Alignment Fund – Pre-
more correct if Rule 34(1)(b) referred not to businesses but to the Positioning Programme, as the Research Programme in Computa-
holding of an executive appointment. tional Law. Any opinions, findings and conclusions or recommen-
The researchers agreed on the following proposed replacement dations expressed in this material are those of the author(s) and do
for Rule 34(1)(b): not reflect the views of National Research Foundation, Singapore.
(1A) A legal practitioner must not accept any execu-
tive appointment that materially interferes with — REFERENCES
(i) the legal practitioner’s primary occupation of [1] L. Allen and C. R. Engholm. 1978. Normalized Legal Drafting and the Query
Method. Journal of Legal Education 29 (1978), 380–412.
practising as a lawyer; [2] Joaquín Arias, Manuel Carro, Zhuo Chen, and Gopal Gupta. 2020. Justifica-
(ii) the legal practitioner’s availability to those tions for Goal-Directed Constraint Answer Set Programming. arXiv preprint
arXiv:2009.10238 (2020).
who may seek the legal practitioner’s services as [3] Joaquin Arias, Manuel Carro, Elmer Salazar, Kyle Marple, and Gopal Gupta. 2018.
a lawyer; or Constraint answer set programming without grounding. Theory and Practice of
(iii) the representation of the legal practitioner’s Logic Programming 18, 3-4 (2018), 337–354.
[4] Organization for Economic Cooperation and Development Observatory for Public
clients. Sector Innovation. [n.d.]. Cracking the Code: Rulemaking for humans and ma-
The proposed amendment was encoded, and the tests re-run. All chines. Accessed February 28, 2021, at https://oecd-opsi.org/wp-content/uploads/
2020/10/Rules-as-Code_Highlights_Final_HighRes.pdf.
25 tests passed. [5] D. Merritt. 2017. Expert Systems in Prolog. Independently Published. https:
//books.google.com.sg/books?id=6IQGyQEACAAJ
[6] Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham-
6 CONCLUSIONS mond, and H. Terese Cory. 1986. The British Nationality Act as a logic program.
Our experiment demonstrates the use of the Rules as Code method- Commun. ACM 29, 5 (1986), 370–386.
[7] Hui Wan, Benjamin Grosof, Michael Kifer, Paul Fodor, and Senlin Liang. 2009. Logic
ology to detect a drafting issue in a proposed statutory text, and to Programming with Defaults and Argumentation Theories. In Logic Programming,
verify the effect of a proposed amendment. The issue discovered in Patricia M. Hill and David S. Warren (Eds.). Springer Berlin Heidelberg, Berlin,
Heidelberg, 432–448.
this experiment is the type of issue that Rules as Code is intended
to address early: one that if left unaddressed negatively affects the
degree to which the statutory text can be automated.
With regard to s(CASP)’s strengths and weaknesses for this task,
the access to "why not" queries and natural language justifications
was extremely valuable both in the encoding of the Rule, and in
the analysis of test failures. s(CASP)’s abductive reasoning over
constraints, and the fact that it returned answer sets rather than
bindings, allowed the author to test the encoding against a wide
263
Predicting Legal Proceedings Status: Approaches Based on
Sequential Text Data
Felipe Maia Polo Itamar Ciochetti Emerson Bertolo
felipemaiapolo@gmail.com itamar@tikal.tech emerson@tikal.tech
University of São Paulo, Brazil Tikal Tech, Brazil Tikal Tech, Brazil
Advanced Institute for AI (AI2), Brazil
ACM Reference Format: classified as archived (class 1), 45.23% is classified as active (class 2),
Felipe Maia Polo, Itamar Ciochetti, and Emerson Bertolo. 2021. Predict- and 7.63% is classified as suspended (class 3). The datasets we use
ing Legal Proceedings Status: Approaches Based on Sequential Text Data. are representative samples from the first and third most significant
In Eighteenth International Conference for Artificial Intelligence and Law Brazilian state courts (São Paulo and Rio de Janeiro).
(ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, In this work, we split at random our labeled dataset into three
2 pages. https://doi.org/10.1145/3462757.3466138
parts: training set (70%) for training models, validation set (10%)
for hyperparameter tuning, and test set (20%) for final assessment.
1 OBJECTIVE AND PRACTICAL
IMPORTANCE OF THIS WORK 4 METHODOLOGY
The objective of this work, which is fully given by Polo et al. [8], is We used four approaches to extract features from the legal texts
to develop predictive models to classify Brazilian legal proceedings and three base classifiers to create our predictive models to classify
in three possible classes of legal status: (i) archived proceedings, (ii) legal proceedings, i.e., text sequences. A more detailed explanation
active proceedings, and (iii) suspended proceedings. Each proceed- of our methodology can be found in Polo et al. [8].
ing is made up of a chronological sequence of short texts called
“motions” written by the courts’ administrative staff. The motions 4.1 Classifiers
relate to the proceedings, but not necessarily to their legal status. The first classifier we use is a many-to-one long short-term memory
Moreover, the proceedings’ labels are decided by the courts to orga- network (LSTM). The inputs are given by 𝑇 vectors representing
nize their workflow. This problem’s resolution is intended to assist the 𝑇 most recent texts in chronological order. The outputs will be
public and private institutions in managing large portfolios of legal predicted probabilities for each of the three classes, returned by the
proceedings, providing gains in scale and efficiency. Softmax function. The second classifier is a multilayer perceptron
neural network (MLP) with one hidden layer and ReLU activation
2 RELATED WORK functions. The MLP input is the concatenation of the feature vectors
Despite researchers’ efforts to create applications in the legal field, of the last 𝑇 texts. The third classifier is given by a XGBoost [1] tree
we were unable to find an attempt to solve a problem like ours in ensemble. We feed the last classifier with same inputs used for the
the literature. The issues closest to ours we could relate in litera- MLP. All classifiers make classification choosing the most probable
ture are those of identifying the parties in legal proceedings [7], class and we fix 𝑇 = 5, using zero-padding vectors when necessary.
classification of legal documents according to their administrative Details of the hyperparameter tuning phase can be found in Polo
labels [2] or predicting the area a proceeding belongs to [11]. This et al. [8].
paper has a different application that can be useful when looking
for efficiency in legal systems, especially in developing countries. 4.2 Feature Extraction
Unlike previous work, we consider sequences of texts explicitly We use four different approaches to extract features from texts:
in our modeling, which has not yet been observed in Law and AI Word2Vec (W2V) [5], Doc2Vec (D2V/PV-DM) [4], TFIDF [9], and a
literature by us. Brazilian Portuguese BERT-Base [10]. All of them are completely
unsupervised or self-supervised methods. For all approaches, we
3 DATA had text preprocessing and hyperparameter setting steps, which
Our data is composed of two datasets: a dataset of 3 · 106 unlabeled are detailed in Polo et al. [8]. Word2Vec, Doc2Vec, and TFIDF repre-
motions and a dataset containing 6449 legal proceedings, each sentations are fully trained using the mass of 3 · 106 texts/motions
with an individual and a variable number of motions, but which from unlabeled proceedings, while the BERT-Base is fine-tuned
have been labeled by lawyers. Among the labeled data, 47.14% is in the same dataset, making use of the Masked Language Model
(MLM) objective. One thing that is worth mentioning is that we
Permission to make digital or hard copies of part or all of this work for personal or use a method proposed by Mikolov et al. [6] in order to identify
classroom use is granted without fee provided that copies are not made or distributed presence words (2 to 4) that should be considered as unique tokens
on the first page. Copyrights for third-party components of this work must be honored. when working with W2V, D2V, and TFIDF.
For all other uses, contact the owner/author(s). Given Word2Vec creates representations for tokens and not for
ICAIL’21, June 21–25, 2021, São Paulo, Brazil entire texts, we use two different approaches to that end. One
ACM ISBN 978-1-4503-8526-8/21/06. of them is used in conjunction with the LSTM classifier, and the
https://doi.org/10.1145/3462757.3466138 other is used in conjunction with the MLP and XGBoost classifiers.
264
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Felipe Maia Polo, Itamar Ciochetti, and Emerson Bertolo
Macro averaging Weighted averaging

Classifier Feature extraction Accuracy F1 Score Precision Recall F1 Score Precision Recall
W2V (using CNN) 0.93 ± 0.01 0.88 ± 0.01 0.92 ± 0.01 0.85 ± 0.02 0.92 ± 0.01 0.93 ± 0.01 0.93 ± 0.01
Doc2Vec 0.82 ± 0.01 0.76 ± 0.02 0.77 ± 0.02 0.75 ± 0.02 0.82 ± 0.01 0.82 ± 0.01 0.82 ± 0.01
LSTM
TFIDF 0.90 ± 0.01 0.85 ± 0.01 0.85 ± 0.01 0.85 ± 0.02 0.90 ± 0.01 0.90 ± 0.01 0.90 ± 0.01
BERT 0.93 ± 0.01 0.89 ± 0.01 0.92 ± 0.01 0.87 ± 0.02 0.93 ± 0.01 0.93 ± 0.01 0.93 ± 0.01
W2V 0.92 ± 0.01 0.87 ± 0.01 0.92 ± 0.01 0.84 ± 0.02 0.92 ± 0.01 0.92 ± 0.01 0.92 ± 0.01
Doc2Vec 0.87 ± 0.01 0.83 ± 0.01 0.89 ± 0.01 0.79 ± 0.02 0.87 ± 0.01 0.88 ± 0.01 0.87 ± 0.01
XGBoost
TFIDF 0.92 ± 0.01 0.88 ± 0.01 0.93 ± 0.01 0.84 ± 0.02 0.92 ± 0.01 0.93 ± 0.01 0.92 ± 0.01
BERT 0.92 ± 0.01 0.86 ± 0.01 0.92 ± 0.01 0.83 ± 0.02 0.92 ± 0.01 0.92 ± 0.01 0.92 ± 0.01
Table 1: Evaluation of classification approaches (scores ± bootstrap std. errors). We combine three basic classifiers (LSTM, MLP,
and XGBoost) and four approaches for extracting features (Word2Vec, Doc2Vec, TFIDF, and BERT). In this extended abstract,
we omit MLP’s results since they are not better than LSTM’s and XGBoost’s.
When using LSTM networks as classifiers, we first create a text

Filters Tokens cos(𝜃 )
matrix, each row being given by a token embedding. Secondly,
we apply 𝐾 convolutional filters [3], given by one-dimensional "final storage of docket" 0.46
matrices, to extract the desirable information from the texts. The 6 "final remittance to origin" 0.45
filters are trained in conjunction with the LSTM weights, while "remittance to origin" 0.42
embeddings are frozen. Due to the last detail, we also refer to the "final storage of docket" 0.47
W2V/LSTM classification approach as W2V/CNN/LSTM, where the "temporarily stored docket" 0.43
7
"CNN" stands for convolutional neural networks. On the other hand, "final remittance to origin" 0.42
to represent texts when working with MLP and XGBoost classifiers,
we use the average vector of each text’s tokens embeddings.
Table 2: Similarity between filters and their most similar to-
5 CLASSIFICATION RESULTS kens. It is possible to check what kind of information the
filters seek in a text excerpt.
Given that all classifiers make classification choosing the most prob-
able class, we now compare their performance according to a few
key metrics presented in Table 1. In this extended abstract, we omit REFERENCES
MLP’s results since they are not better than LSTM’s and XGBoost’s. [1] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.
In Proceedings of the 22nd acm sigkdd international conference on knowledge
In general, our classifiers have an excellent performance; however, discovery and data mining. 785–794.
the Doc2Vec approaches performed relatively worse than the oth- [2] N Correia da Silva, FA Braz, DB Gusmão, FB Chaves, DB Mendes, DA Bezerra,
ers. A pertinent point that can be observed in the "Recall" and "F1 GG Ziegler, LH Horinouchi, MHP Ferreira, PHG Inazawam, et al. 2018. Docu-
ment type classification for Brazil’s supreme court using a Convolutional Neural
Score" columns of the macro averages is that our classifiers perform Network. (2018).
relatively worse when finding examples from the "Suspended" mi- [3] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv
nority class, which add up to about 7.63%. This is not necessarily a preprint arXiv:1408.5882 (2014).
[4] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
problem though. documents. In International conference on machine learning. 1188–1196.
[5] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
6 INTERPRETABILITY estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
(2013).
We present interpretable insights of the W2V/CNN/LSTM approach. [6] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
To better understand the patterns extracted by the neural network’s Advances in neural information processing systems. 3111–3119.
convolutional layer, let us look at the embedding representations [7] Truong-Son Nguyen, Le-Minh Nguyen, Satoshi Tojo, Ken Satoh, and Akira Shi-
of tokens in our vocabulary that have the closest representations to mazu. 2018. Recurrent neural network-based models for recognizing requisite
and effectuation parts in legal texts. Artificial Intelligence and Law 26, 2 (2018),
the filters according to cosine similarity. In this extended abstract, 169–199.
we focus on two specific filters (6 and 7) from a total of nine. Table [8] Felipe Maia Polo, Itamar Ciochetti, and Emerson Bertolo. 2021. Predict-
2 shows which tokens (translated from Portuguese to English) most ing Legal Proceedings Status: Approaches Based on Sequential Text Data.
arXiv:2003.11561 [cs.CL]
closely resemble our filters after they are learned. [9] Gerard Salton and Michael J McGill. 1986. Introduction to modern information
One can see that the patterns sought by the neural network do retrieval. (1986).
[10] Fabio Souza, Rodrigo Nogueira, and Roberto Lotufo. 2019. Portuguese Named
have to do with the classifications we want to make, especially Entity Recognition using BERT-CRF. arXiv preprint arXiv:1909.10649 (2019).
when looking at filters 6 and 7. For example, the expressions "final http://arxiv.org/abs/1909.10649
storage of docket" and "final remittance to origin" indicate archiving [11] Octavia-Maria Sulea, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P
Dinu, and Josef Van Genabith. 2017. Exploring the use of text classification in
of proceedings (class 1), and the expression "temporarily stored the legal domain. arXiv preprint arXiv:1710.09306 (2017).
docket" may indicate suspension (class 3). Polo et al. [8] goes deeper
in the analysis and presents a method to connect patterns in Table
2 with the classifier’s output.
265
Pathways to Legal Dynamics in Robotics
Antonino Rotolo Luciano H. Tamargo Diego C. Martínez
University of Bologna Universidad Nacional del Sur Universidad Nacional del Sur
Bologna, Italy Bahia Blanca, Argentina Bahia Blanca, Argentina
antonino.rotolo@unibo.it lt@cs.uns.edu.ar dcm@cs.uns.edu.ar
ACM Reference Format: Challenge 2. Make explicit why your norms are a kind of (soft)
Antonino Rotolo, Luciano H. Tamargo, and Diego C. Martínez. 2021. Path- constraint that deserve special analysis.
ways to Legal Dynamics in Robotics. In Eighteenth International Conference
for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, With hard constraints the problem is how to reconfigure robots’
Brazil. ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3462757. behaviour in presence of norm change. If the norms are represented
3466146 as soft constraints, then the problem is to check if the process of
monitoring violations is correctly managed. For example, it may be
1 RESEARCH CHALLENGES the case that violations are not detected often enough.
Normative concepts can play a crucial role in modelling the be- Whatever model we adopt for legal norms on robotics, we need
haviour and interaction of artificial agents. Investigations are still a formal model handling norm change:
relatively underdeveloped in robotics while interesting ideas come
from related fields, such as multi-agent systems (MAS). Challenge 3. Why and how can norms be changed at runtime?
We outlines some research challenges in this domain for which Many legal issues can be raised in regard to robots [2]. Example
we can use existing models of legal change developed in AI&Law. 1 illustrates how norm change can impact on robotics.
Three challenges for MAS can be adapted for robotics [1]:
Challenge 1. Explain which of the following choices should be Example 1. The Italian penal code states the following:
made in robotics: (a) norms must be explicitly represented in robots Art. 111 Italian of Penal Code – Procuring a person for an
in a declarative way, or (b) norms must be explicitly represented in offence who is not indictable or not punishable. Anyone
the overall system specification. who has procured a person for a criminal offence who is not
Option (a) must be preferred if we should avoid trivialising the indictable or is not punishable on the basis of a personal condi-
notion of norm, which is a risk when we see any specification tion or quality is liable for that offence which was committed
requirement as a norm that the system has to comply with [1]. In by this person, and an increased penalty is applied.
addition, since legal norms change, maintenance would be probably Imagine Mr. Smith induces a robot to threaten Mr. Jones, and the
easier. However, (b) is more suitable to address this problem: robot is bound to that goal (to threat Mr. Jones) but is equipped with
Problem 1. How we can check whether a robot complies with autonomy in achieving it. Can we apply art. 111 to this case? It should
norms applicable to it? How can we design a robot such that it complies be noted that the provision cover cases where the procured person
with a given set of norms? is not legally capable (she is not in full possession of her faculties)
and this makes more serious the offence committed by the procuring
Addressing Challenge 1 requires the preliminary clarification of person. However, robots, though intelligent, are not indictable and the
norm features that we need to embed within robots. In particular, principle of legality in criminal law does not allow the provision to be
temporal aspects are especially relevant for legal dynamics [3], applied by analogy when the crime is committed by a robot.
since legal norms can be qualified by temporal properties, such as: Suppose that the legislator enacts at the 1st of January 2005 a new
(1) the time when the norm comes into existence and belongs to version of art. 111 (denoted as ‘Art.111-n’.):
the legal system, (2) the time when the norm is in force, (3) the time
when the norm produces legal effects (it is applicable), and (4) the Art. 111 [Amended] Italian of Penal Code – Procuring a person—
time when the normative effects hold. who is not indictable or not punishable—or an intelligent
A norm is a kind of system constraint. While hard constraints machine for an offence. Anyone who has procured a person—
are restricted to preventive control systems in which violations who is not indictable or is not punishable on the basis of a
are impossible (we call this mechanism regimentation), soft con- personal condition or quality—or an intelligent machine for a
straints are used in detective control systems where violations can criminal offence is liable for that offence which was committed
be detected (we call this mechanism regulation). This justifies the by this person or machine, and an increased penalty is applied.
following challenge: The AI&law community proposed several frameworks for norm
Permission to make digital or hard copies of part or all of this work for personal or change, two of them focused on temporal models: one tem-
classroom use is granted without fee provided that copies are not made or distributed poralised rule-based system [3] and one extending belief revi-
on the first page. Copyrights for third-party components of this work must be honored. sion techniques [5]. Both view a legal system as a time-series
For all other uses, contact the owner/author(s). LS(t 1 ), LS(t 2 ), . . . , LS(t j ) of its versions, where each version is ob-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil tained from previous ones through entering new norms, or by mod-
ACM ISBN 978-1-4503-8526-8/21/06. ification or repeal of existing norms: each LS(ti ) is the snapshot of
https://doi.org/10.1145/3462757.3466146 the norms in the legal system at time ti .
266
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Antonino Rotolo, Luciano H. Tamargo, and Diego C. Martínez
2 NORM CHANGE IN ROBOTICS

Temporal models are crucial to dynamically regiment robots by
design or to activate countermeasures for compensating norm vi-
olations. We illustrate how the temporal machinery conceptually
operates by using the specification of the MAS proposed by [4] for
sociotechnical systems and adapt it to a robot interacting with hu-
mans. Consider Figure 1. For the sake of simplicity, we can imagine
Figure 3: Example 1: dynamic non-retroactive
by the regulation mechanism or by preventing it via regimentation);

however, this is not the case because the norm is abrogated. Of
course, the initial version of the norm. i.e., Art.111 does not cover
Figure 1: Static normative interaction intelligent machines and so the case of Mr. Smith inducing in 2007
the robot to threaten Mr. Jones is not relevant.
Suppose now we are in 2007 and 1) Mr. Smith induces twice the
that a robot and a human interact in a given environment at time
robot to threaten Mr. Jones, a first time in 2005 and a second time
t. The legal system at t is denoted by LS(t). The technical layer
in 2007; 2) Art.111 − n is annulled in 2007 (thus repealing the norm
is composed of software components supporting robot’s actions.
since 2005) and Art.111 is reinstated. This is depicted in Figure 4.
Norms in LS(t) can govern robot-human interactions in two ways:
(1) by regimentation of robot’s behaviour via access control to soft-
ware components, thus preventing specific robot’s actions, (2) by
regulation, in such a way that norms can be violated but we check
accountability and possibly on resort to sanctions.
An interesting case is how legal dynamics impact on the above
framework. Figure 2 depicts the case in Example 1.
Figure 4: Example 1: dynamic retroactive
This case is different, since annulments are retroactive [3]. We

are in 2007 and norm Art.111 − n no longer exists as if it were never
enacted, so it does not cover the case of Mr. Smith inducing in 2005
and 2007 the robot to threaten Mr. Jones. (Unless Mr. Smith’s action
Figure 2: Example 1: static was prevented by regimentation in 2005; in this case, we do not
have an issue regarding it.) Finally, Art.111 does not cover the case
of Mr. Smith inducing in 2007 the robot to threaten Mr. Jones.
Assume to operate in 2005, so the version in force of art. 111 of the
Italian of Penal Code is the amended one in Example 1 and denoted REFERENCES
by Art.111 − n, where intelligent machines are also considered. [1] T. Balke et al. Norms in MAS: Definitions and Related Concepts. In Normative Multi-
Suppose now we are in 2007 and 1) Mr. Smith induces twice the Agent Systems. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl,
robot to threaten Mr. Jones, a first time in 2005 and a second time in Germany, 2013.
[2] M. Corrales, M. Fenwick, and N. Forgó. Robotics, AI and the Future of Law. Springer,
2007; 2) Art.111 − n is abrogated in 2007 and Art.111 is reinstated. 2018.
This is depicted in Figure 3. This case simply embeds legal dynamics [3] G. Governatori and A. Rotolo. Changing legal systems: legal abrogations and
within the scenario of Figure 2, since abrogations are not retroactive annulments in defeasible logic. Logic Journal of the IGPL, 18(1):157–194, 2010.
[4] Ö. Kafali, N. Ajmeri, and M. P. Singh. DESEN: specification of sociotechnical
[3]. We are in 2007 and norm Art.111 − n applies in 2007 to the case systems via patterns of regulation and control. ACM Trans. Softw. Eng. Methodol.,
of Mr. Smith inducing in 2005 the robot to threaten Mr. Jones (by 29(1):7:1–7:50, 2020.
[5] L. H. Tamargo, D. C. Martinez, A. Rotolo, and G. Governatori. An axiomatic
the regulation mechanism because we are in 2007). If Art.111 − n characterization of temporalised belief revision in the law. Artificial Intelligence
would not be abrogated, then we would apply in 2007 to the case of and Law, pages 1–21, 2019.
Mr. Smith inducing in 2007 the robot to threaten Mr. Jones (either
267
Labels distribution matters in performance achieved in legal
judgment prediction tasks
Olivier Salaün Philippe Langlais Karim Benyekhlef
salaunol@iro.umontreal.ca felipe@iro.umontreal.ca karim.benyekhlef@umontreal.ca
RALI, DIRO, University of Montréal RALI, DIRO, University of Montréal Cyberjustice Laboratory, Faculty of
Montréal, Québec, Canada Montréal, Québec, Canada Law, University of Montréal
Montréal, Québec, Canada
CCS CONCEPTS 2 PREPROCESSING OF THE DATASET
• Applied computing → Law; • Computing methodologies → The decisions of the corpus we used here come from a court of Que-
Natural language processing; Neural networks. bec in Canada that deals with all legal disputes occurring between
landlords and tenants. From 2001 to 2018, it issued 667,305 decisions
KEYWORDS in French, some of which are freely available at SOQUIJ (Société
legal judgment prediction, multilabel text classification, legal arti- québécoise d’information juridique) portal1 . Decisions have a mean
cles and a median lengths of 307 and 235 tokens respectively while the
standard deviation is 371, indicating a high variability in length
ACM Reference Format: across the documents.
Olivier Salaün, Philippe Langlais, and Karim Benyekhlef. 2021. Labels dis-
A first step in preparing the dataset consisted in extracting the
tribution matters in performance achieved in legal judgment prediction
tasks. In Eighteenth International Conference for Artificial Intelligence and
text of each decision then split it in two parts thanks to syntax-
Law (ICAIL’21), June 21–25, 2021, Sao Paulo, Brazil. ACM, New York, NY, based heuristics: the pre-verdict text and the verdict. The former
USA, 2 pages. https://doi.org/10.1145/3462757.3466144 contains the description of the dispute and is used as text input
for the text classification task. We also extracted the housing law
1 INTRODUCTION articles cited in the pre-verdict text from which we retained 445
that are cited in Book Five - Obligations in Civil Code of Quebec that
In recent years, transformer [4] and BERT models [1] have been are specifically related to property lease and thus more relevant to
widely used in plain NLP tasks with the assumption that models our task. The verdict text is further processed in order to generate
first pretrained on massive corpora then fine-tuned on the dataset several targets labels that cover the diversity of the verdicts decided
of a given task may suffice to achieve significant improvements. by the judges. Thanks to regular expressions and the like, plus
At the intersection of machine learning and law, legal judgment some expert knowledge of housing law, we pseudo-automatically
prediction (LJP) is a task that aims at predicting the outcome of a annotated the verdicts with 23 cumulative labels. Eventually, we
lawsuit based on a representation of the case. Such task is usually excluded all decisions for which no relevant article or verdict label
formalized in NLP as a text classification with different classes or was identified. All in all, the instances of the corpus amount to
labels corresponding to the verdicts. One specificity of court rulings 544,857 documents with an average of 3.3 labels and 2 cited articles,
is that their decisions are based on the application of legal articles and are randomly split into training, validation and test sets with a
to the facts described by the two parties (applicant and defendant). 60-20-20 ratio.
In this work, we designed a LJP multilabel classification task
based on a corpus of landlord-tenant disputes in French from Que-
bec, Canada [3, 5] in which a model must predict the verdicts labels
3 MODELS
on the basis of a truncated extract from the decision made by the Within the framework of this multilabel classification, we chose
tribunal. We applied CamemBERT [2], a BERT model pretrained as a baseline a One-Versus-Rest Logistic Regression with the in-
on French material, in order to assess to what extent the use of a put text as character-based TF-IDF vectors spanning 2-grams to
pretrained model can handle a LJP task. We also injected article- 8-grams (the top 100k most frequent n-grams are kept). We also
based input features with the hope that adding knowledge specific use a CamemBERT model that we fined-tuned to our task during
to housing law domain could improve classification performance. 10 epochs with a batch size of 32 and a learning rate of 10-5 with
Although such an approach allows better results, labels distribution the Adam optimizer and binary cross-entropy as loss function. The
must be taken into account when analyzing coarse and label-specific maximum sequence length amounts to 128 tokens for all of our
scores. models. Moreover, we also propose a model that leverage both
CamemBERT and cited articles by concatenating the BERT output
Permission to make digital or hard copies of part or all of this work for personal or (vector corresponding to the [CLS] token) with a 445-dimensional
for profit or commercial advantage and that copies bear this notice and the full citation one-hot vector that embeds which articles are cited in the decision.
on the first page. Copyrights for third-party components of this work must be honored. Then, the concatenation is sent to two fully connected layers as
shown in Figure 1
ACM ISBN 978-1-4503-8526-8/21/06.
https://doi.org/10.1145/3462757.3466144 1 https://soquij.qc.ca/
268
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Salaün et Langlais
Figure 1: Architecture combining CamemBERT with a one-

hot vector for cited articles
4 RESULTS
The metrics chosen for evaluation are exact match (EM) and F1
scores. The former implies that all labels of an instance must be ex-
Figure 2: F1 scores obtained for each label (x-axis is in loga-
actly predicted by the model in order to be considered as correctly
rithmic scale) relative to label support
classified. The latter is an unweighted average of the F1 scores
obtained across the 23 labels. As shown on Table 1, CamemBERT
favoured models in this task, especially when BERT output is com- BERT approach. Still the results obtained must be considered with
bined with articles as such an approach outperforms the sole BERT caution. Firstly, the higher the support of a label, the more likely it
architecture by 3.2 points and 2.2 points for F1 macro-average and will be accurately predicted. Secondly, including articles into the
EM scores respectively. Still, such coarse results must be viewed model does not improve performance results uniformly across all
with precaution as they do not take into account the high imbalance labels.
among labels. As future works, we plan on investigating further in what con-
For instance, the top three most frequent labels cover more than ditions articles allow classification improvements and how such
half of the corpus while the sixteen least ones have a support be- knowledge could be used as a way to make predictions more suit-
low 5%. Such biases have repercussions on the F1 scores obtained able for interpretability.
on each label as shown on Figure 2. The F1 results obtained for
labels with a support below 5% are spread out between 0 (none of ACKNOWLEDGMENTS
the three least frequent labels could be correctly predicted) and We would like to thank the Cyberjustice Laboratory at Université de
90%. Whenever the support of a label is around or above 40%, the Montréal, the LexUM Chair on Legal Information and the Autonomy
corresponding F1 score has a minimum value at 75%. Another obser- through Cyberjustice Technologies (ACT) project for their support
vation that can be drawn is that although the addition of a one-hot of this research.
vector helps in improving BERT scores, such improvement is only
significant for certain verdict labels, suggesting that the inclusion REFERENCES
of domain-related knowledge is only significant in some cases. For [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
instance, improvement is very little or non significant for the three Pre-training of Deep Bidirectional Transformers for Language Understanding. In
NAACL-HLT (1). 4171–4186. https://aclweb.org/anthology/papers/N/N19/N19-
most frequent labels while it seems more noticeable for those with 1423/
lower support. [2] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent
Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2019.
CamemBERT: a Tasty French Language Model. arXiv preprint arXiv:1911.03894
5 CONCLUSION (2019).
Within the framework of a LJP task formalized as multilabel text [3] Olivier Salaün, Philippe Langlais, Andrés Lou, Hannes Westermann, and Karim
Benyekhlef. 2020. Analysis and Multilabel Classification of Quebec Court Decisions
classification that uses a corpus in French of landlord-tenant dis- in the Domain of Housing Law. In International Conference on Applications of
putes, we extended a CamemBERT model with a one-hot vector of Natural Language to Information Systems. Springer, 135–143.
cited articles. This led to better overall results with respect to a sole [4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems. 5998–6008.
[5] Hannes Westermann, Vern R Walker, Kevin D Ashley, and Karim Benyekhlef.
Logistic CamemBERT 2019. Using Factors to Predict and Analyze Landlord-Tenant Decisions to Increase
CamemBERT
regression + one-hot Access to Justice. In Proceedings of the Seventeenth International Conference on
Artificial Intelligence and Law. 133–142.
F1 (macro-avg.) 53.5 58.4 61.6
Exact match 58.6 63.7 65.9
Table 1: F1 (macro-average) and exact match scores achieved
by each model on the test set.
269
A simple mathematical model
for the legal concept of balancing of interests
Frederike Zufall∗ Rampei Kimura∗ Linyu Peng∗
zufall@coll.mpg.de rampei@aoni.waseda.jp l.peng@mech.keio.ac.jp
Max Planck Institute for Research on Waseda Institute for Advanced Study Keio University
Collective Goods Tokyo, Japan Yokohama, Japan
Bonn, Germany
ACM Reference Format: 2 FROM LEGAL CRITERIA TO

Frederike Zufall, Rampei Kimura, and Linyu Peng. 2021. A simple math- MATHEMATICAL LEGAL PARAMETERS
ematical model for the legal concept of balancing of interests. In Eigh-
teenth International Conference for Artificial Intelligence and Law (ICAIL’21), We implement these legal criteria as parameters in our models,
June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 2 pages. called legal parameters.
https://doi.org/10.1145/3462757.3466140
Status of the person. We understand a person’s status as an in-
dicator for the degree of relevance the person is assigned for the
This work investigates the extent to which a mathematical model public discourse. We operationalise this criterion as the following
is able to stand in for a legal assessment performed by a lawyer. We parameter taking values between 0 and 1:
propose two simple mathematical models for the legal concept of 𝛼𝑝 ∈ [0, 1]: status of the person.
the balancing of interests by transforming legal criteria into input We take the following data points as examples to create data:
arguments of the models.
A person that ...
𝛼𝑝 = 0.01 .. is publicly unknown
1 THE LEGAL CONCEPT 𝛼𝑝 = 0.25 .. is relatively unknown to the public
A recurring concept in legal systems is the resolution of conflicts 𝛼𝑝 = 0.50 .. is to a certain degree known to the public
between competing interests through balancing. In order to develop 𝛼𝑝 = 0.75 .. is largely known in public
a mathematical model for this concept, we build on the conflict 𝛼𝑝 = 0.95 .. is known to nearly anyone
between Art. 7, Art. 8 of the EU Charter of Fundamental Rights
(EUCh) (right to privacy and right to the protection of personal data) Sphere of the information. Given the more or less private nature
and Art. 11 EUCh (right to freedom of expression and information). of information, we assume a sphere-model, starting from an inner
The typical example is the disclosure of personal data on the internet circle containing the most private information followed by such
as an act of free expression or subject to the right of access to relating to family and friends to the social sphere at the outer
information. circle. We operationalise the sphere of information by the following
As the law ultimately cannot foresee every possible situation parameter with values between 0 and 1:
in which these interests might collide, the balancing of interests 𝛼𝑠 ∈ [0, 1]: sphere of the information.
provides the legal instrument to take into consideration the partic-
𝛼𝑠 = 0.05 (e.g., health data)
ularities of each case. These particularities can be generalised to
𝛼𝑠 = 0.25 (e.g., family and friends)
legal criteria affecting the balancing. For the balancing between
𝛼𝑠 = 0.50
the right to the protection of personal data and access to informa-
𝛼𝑠 = 0.75 (e.g., professional misconduct)
tion these criteria may be the person’s social status or role in public 𝛼𝑠 = 0.95 (e.g., committing a major crime)
life, the sphere from which the relevant information originated or
the time that had passed since the occurrence of the underlying Time. We also consider the passage of time as a legal criterion
facts of that information. that affects the balancing. Following the European Court of Justice’s
case law on the “the right to be forgotten" (Case C-131/12 - Google
∗ All
Spain), we assume that the more time has passed, the more the
authors contributed equally to this research.
Full paper available at https://dx.doi.org/10.2139/ssrn.3834333. balancing leans increasingly towards the right to data protection.
Time 𝑡 is nondimensionalised as a legal parameter 𝛼𝑡 .
𝛼𝑡 = 𝑇𝑡 ∈ (−∞, 0]: a rescaling of time 𝑡 ≤ 0 with a properly
Permission to make digital or hard copies of part or all of this work for personal or chosen large number 𝑇 > 0.
for profit or commercial advantage and that copies bear this notice and the full citation The legal decision is made on facts that just occurred (𝛼𝑡 = 0), or
on the first page. Copyrights for third-party components of this work must be honored. on facts that occurred in the past:
𝛼𝑡 = −𝑚 m years ago (In this paper, we choose 𝑚 = 0, 1, 3, 6, 8, 10.)
ACM ISBN 978-1-4503-8526-8/21/06. Outcome. We denote the competing rights as (𝑖 1 ) privacy of infor-
https://doi.org/10.1145/3462757.3466140 mation and (𝑖 2 ) access to information. Their indices are respectively
270
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zufall, Kimura and Peng
denoted as 𝑢 1 and 𝑢 2 such that This time-independent model (1) does not take 𝛼𝑡 as an input
𝑢𝑘 ∈ [0, 1], 𝑘 = 1, 2, subject to 𝑢 1 + 𝑢 2 = 1, argument and only captures each point in time separately.
making it sufficient to use a single parameter 𝑢 ∈ [0, 1] as the 3.2 A time-dependent mathematical model
balancing outcome. In order to model time continuously and not just as intermittent
Data coding. Based on the above criteria, a dataset (150 sets of points, we propose the following time-dependent model for the
outcomes 𝑢) is hand-coded by a fully-qualified lawyer, to serve as outcome function
training data for the models. The data points are based on standards 𝑐 00 + 𝑐 10 𝛼𝑝 + 𝑐 01 𝛼𝑠 + 𝑐 20 𝛼𝑝2 + 𝑐 11 𝛼𝑝 𝛼𝑠 + 𝑐 02 𝛼𝑠2
inferred from the relevant case law. 𝑢 (𝛼𝑝 , 𝛼𝑠 , 𝛼𝑡 ) = ,
𝑎(log (|𝛼𝑡 | + 1)) 2 + 𝑏 log (|𝛼𝑡 | + 1) + 1
(4)
3 THE MATHEMATICAL MODELS where the model parameters 𝑎, 𝑏, 𝑐 00, 𝑐 01, . . . are to be determined
For any given piece of information, the purpose of the models is by using the data. Unlike the time-independent model, this function
to determine whether (𝑖 1 ) privacy of information outweighs (𝑖 2 ) takes the legal parameter 𝛼𝑡 as an argument and reduces to the time-
access to information or vice versa. To summarize, the parameters independent model (1) at a given time. We impose the following
are defined as follows: assumptions:
𝛼𝑝 ∈ [0, 1] status of the person 𝑢 (𝛼𝑝 , 𝛼𝑠 , −∞) = 0 for all 𝛼𝑝 , 𝛼𝑠 ,
(5)
𝛼𝑠 ∈ [0, 1] sphere of the information 𝑢 (0, 0, 𝛼𝑡 ) = 0 for all 𝛼𝑡 , 𝑢 (1, 1, 0) = 1,
𝛼𝑡 = 𝑇𝑡 ∈ (−∞, 0] nondimensionalised time
yielding 𝑐 00 = 0 and
𝑢𝑘 ∈ [0, 1], subject to 𝑢 1 + 𝑢 2 = 1 index for (𝑖𝑘 ), 𝑘 = 1, 2
𝑐 10 + 𝑐 01 + 𝑐 20 + 𝑐 11 + 𝑐 02 = 1. (6)
The constraint 𝑢 1 + 𝑢 2 = 1 allows us to define one single index Again, we use Mathematica to derive optimal values of the model
to fulfill the task. This is the outcome 𝑢, which is a function of parameters using the method of least squares:
the legal parameters 𝛼𝑝 , 𝛼𝑠 and 𝛼𝑡 . The final decision, namely
𝑎 ∗ = 0.165792, 𝑏 ∗ = −0.212271, ∗
𝑐 01 = 0.529979,
whether (𝑖 1 ) privacy of information or (𝑖 2 ) access to the information (7)
dominates, is made via the comparison with a prior given threshold
∗
𝑐 10 = −0.0110422, ∗
𝑐 02 = −0.0559473, ∗
𝑐 11 = 0.295508.
value 𝑢 0 ∈ [0, 1]. Without loss of generality, we assume that when As an illustration, Figure 1 shows surface of the fitted time-
𝑢 ≤ 𝑢 0 , (𝑖 1 ) dominates, and otherwise, (𝑖 2 ) dominates. dependent outcome function 𝑢 in comparison to our data points at
−3 year.
3.1 A time-independent mathematical model
For simplicity, we first propose a simple quadratic model for each
(rescaled) year 𝛼𝑡 respectively as follows
𝑢 (𝛼𝑝 , 𝛼𝑠 ) = 𝑐 00 + 𝑐 10 𝛼𝑝 + 𝑐 01 𝛼𝑠 + 𝑐 20 𝛼𝑝2 + 𝑐 11 𝛼𝑝 𝛼𝑠 + 𝑐 02 𝛼𝑠2, (1)
where 𝑐 00, 𝑐 01, . . . are to be determined using the given dataset for
each year separately. Note that in the mathematical model, the legal
parameters 𝛼𝑝 , 𝛼𝑠 and 𝛼𝑡 are model arguments while 𝑐 00, 𝑐 10, . . .
serve as model parameters. We impose the reasonable assumptions
𝑢 (0, 0) = 0 for all 𝛼𝑡 , 𝑢 (1, 1) = 1 for all 𝛼𝑡 , (2)
leading to that for all 𝛼𝑡 ,
Figure 1: −3 year for the time-dependent model (4).
𝑐 10 + 𝑐 01 + 𝑐 20 + 𝑐 11 + 𝑐 02 = 1. (3)
The proposed model can be regarded as a linear optimisation 3.3 Evaluation
problem for which the coded data can be used to determine the
Chi-square test. To evaluate the fitted function for our time-
above coefficients, i.e. model parameters. Thus, we fit this function
dependent model in comparison to the whole dataset, we use the
with the coded data by using Mathematica; the algorithm is based on
chi-square test where 𝑁 is the number of data in the dataset; here
the theory of linear least squares. In Table 1, the optimal coefficients
𝑁 = 150. This gives the reduced chi-square
(denoted by 𝑐 ∗ ), e.g., model parameters, are listed for each year.
∑︁
𝑁
(𝑢 data − 𝑢) 2 𝜒2
𝛼𝑡 (year) 𝜒 2 := and = 0.0343305. (8)
∗
𝑐 01 ∗
𝑐 10 ∗
𝑐 02 ∗
𝑐 20
𝑢 𝑁
0 0.756269 0.218749 -0.144324 0.181876 𝑖=1
-1 0.655165 0.0286861 -0.088864 0.301803 It implies that the fitted function can describe the original dataset
-3 0.429315 -0.159663 0.00774652 0.390121 with sufficient accuracy.
-6 0.184965 -0.174577 0.15114 0.253607
-8 0.129208 -0.241208 0.163708 0.30786 Cross-validation. In order to evaluate the time-dependent model
-10 0.0662971 -0.295998 0.185813 0.364145 in terms of predictability, we use leave-on-out cross-validation and
Table 1: Fitted model parameters for model (1). calculate the mean absolute error:
𝑀𝐴𝐸 = 0.0728038. (9)
271
Part IV
Demonstrations
Interactive System for Arranging Issues
based on PROLEG in Civil Litigation
Ken Satoh Kazuko Takahashi Tatsuki Kawasaki
National Institute of Informatics Kwansei Gakuin University educe Co.,Ltd
Chiyoda, Tokyo, Japan Sanda, Hyogo, Japan Chiyoda, Tokyo, Japan
ksatoh@nii.ac.jp ktaka@kwansei.ac.jp sktk40829@gmail.com
1 INTRODUCTION theory and specifies concrete facts to make concrete legal argu-
In Japan, we have the procedure of “arranging issues” in civil liti- ments for a specific case in a civil litigation. Moreover, from our
gation where we clarify which facts are in dispute and what kind experience working with lawyers, we noticed that if we introduced
of evidence action should be made for these issues. Currently, IT a system with sophisticated but complex reasoning mechanisms,
technology is used only for online meeting for arranging issues they would be very reluctant to use the system. So, our purpose
and more sophisticated method is expected by a full use of IT/AI of this work is to identify the simplest function to arrange issues
technology. which would be easily understood by lawyers. In a sense, we ex-
We proposed a method for formalizing Japanese presupposed tract a useful part of Carneades for arrangement of issues in civil
Ultimate Fact theory (JUF theory, in short, Youken-jijisturon, in litigation.
Japanese) and converting it into logic programming and developed 2 PROLEG
a system called PROLEG (PROlog-based LEGal reasoning support
system)[1]. JUF formalises which party should give certain facts to We firstly review the PROLEG system[1]. A program of PROLEG
get a desired legal effect for the party, in other words, formalizes consists of a rulebase and a factbase. A rulebase consists of a set of
which party has a burden of proof for these facts. Then, given these general rules of the form
facts, PROLEG simulates reasoning by a judge to make a final con- 𝐻 ⇐ 𝐵 1, ..., 𝐵𝑛 .
clusion and present such process in a directed tree structure called where 𝐻 (called head or conclusion), 𝐵 1, ..., 𝐵𝑛 (called body) are first-
“block diagram.” order atom, and a set of exception rules of the form
In this work, we modify the PROLEG system to support arrang- 𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝐻, 𝐸)
ing issues in civil litigation. In the PROLEG system, they assume where 𝐻 and 𝐸 are the head of some general rules. We call 𝐸 excep-
that all the facts are given before simulation of judge’s reasoning tion. A factbase consists of set of the following expression 𝑓 𝑎𝑐𝑡 (𝑃)
so there are no interaction between both parties of plaintiff and de- where 𝑃 is an atom which is never the head of any general rule.
fendant during the simulation. On the other hand, given a desired We call 𝑃 fact predicate.
effect requested by one party, our interactive system (which we call A rule represents a general default rule meaning that if 𝐵𝑖 in 𝑅
int-PROLEG) automatically calculates possible justifications for the are all proved then in general 𝐻 is true except there is an exception
desired effect based on JUF theory stored in the system. After the rule 𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝐻, 𝐸) such that 𝐸 is proved.
party chooses a justification and int-PROLEG asks for the existence Given a PROLEG program, we can construct a proof tree of a
of necessary facts to the party to satisfy the justification. Then, given goal which is the root of the tree and the child nodes are
int-PROLEG asks whether the other party agrees on alleged facts conditions of general rules of the conclusion and the exceptions of
and also calculates possible counter-arguments against the chosen the conclusion.
justification and provides them to the other party. We iterate this 3 EXTENSION TO ARRANGE ISSUES
process until no further (counter-)arguments are presented. When
In this section, we show how to modify the PROLEG system into
this process is finished, disagreed facts are issues for which a judge
int-PROLEG.
decides the truth value.
Most interactive argument systems are mainly for construct- 3.1 Indexing a level for PROLEG literals
ing arguments manually by a user or evaluating arguments con- (1) First, we define the dependency on the atomic formula that
structed by a user. A notable exception would be the Carneades appears in the general rule. Among the conclusions of the
system which has a funcition of argument invention using argu- general rule, the conclusion that does not appear in the body
mentation schemes [2]. On the other hand, in our system, a user of any general rule or is not an exception of any exception
does not construct arguments from the scratch but chooses a pat- rule is called 0-level conclusion. Then, when making a top-
tern of legal arguments provided by int-PROLEG based on the JUF down proof tree from the 0-level conclusion using only the
general rules, we end up the fact predicates. We call the fact
Permission to make digital or hard copies of part or all of this work for personal or predicates finally visited 0-level facts and the 0-level con-
for profit or commercial advantage and that copies bear this notice and the full cita- clusion and the intermediate visited atomic formulas called
tion on the first page. Copyrights for third-party components of this work must be 0-level atomic formulas.
honored. For all other uses, contact the owner/author(s).
(2) Suppose that the 𝑖-level atomic formulas and the 𝑖-level facts
© 2021 Copyright held by the owner/author(s). are decided. For exception rules that conclude with the 𝑖-
level atomic formula, the collection of the atomic formulas
273
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ken Satoh, Kazuko Takahashi, and Tatsuki Kawasaki
of the exceptions of the such exception rules are called (𝑖 +

1)-level exception rules. When making a top-down proof tree
from the (𝑖 + 1)-level exception using only the general rules,
we end up the atomic formulas. We call the fact predicates
finally visited (𝑖 +1)-level facts and the (𝑖 +1)-level exception
and the intermediate visited atomic formulas called (𝑖 + 1)-
level atomic formulas.
3.2 Interaction Process in int-PROLEG
Let 𝐷 be a defendant and 𝑃 be a plaintiff.
(1) Let 𝑃 choose one of the following 0-level conclusion. This
is 𝑃’s claim.
(2) We make a top-down proof tree from 0-level conclusion and
when we encounter the 0-level fact, we ask 𝑃 if the 𝑃 claims
that fact. If 𝑃 claims it, then we let 𝑃 to instantiate variables
in 0-level facts and added instantiated facts to the fact base
and we reflect the relevant part of the proof tree with such
instantiation. If 𝑃 does not claim it, we delete the path re-
lated with the fact in a proof tree.
(3) If all the concrete facts are entered, the modified proof tree Figure 1: Final Diagram
is displayed.
(4) Suppose that a user finished to enter 𝑡-level facts. We make a (1) Firstly, Alice claims the payment and the system tells a nec-
proof tree for (𝑡 +1)-level exception and when we encounter essary fact to establish the claim according to the Civil Code,
the (𝑡 + 1)-level fact, which is an establishment of a purchase contract. Alice gives
• if (𝑡 +1) is odd, we ask 𝐷 if 𝐷 claims that fact. If 𝐷 claims it, a specific fact for the purchase contract (called “contract1”).
then we let 𝐷 to instantiate variables in (𝑡 + 1)-level facts (2) Secondly, the system computes counter arguments against
and added instantiated facts to the fact base and we reflect Alice’s claim according to the Civil Code and provides two
the relevant part of the proof tree with such instantiation. counter arguments (“already paid” argument and “menaced
If 𝐷 does not claim it, we delete the path related with the to establish the contract” argument). Bob chooses “menaced
fact in a proof tree. We also ask 𝐷 whether 𝐷 admits 𝑡𝑢𝑟𝑛- to establish the contract” argument and gives relevant fact
level fact or not, and if 𝐷 does not admit it, we make a for the argument. (A block conntected with a dotted arrow
label of the fact as an “issue to be determined”. in Fig.1 represents a counter argument.).
• if (𝑡 + 1) is even, we ask 𝑃 if 𝑃 claims that fact. If 𝑃 claims (3) Thirdly, the system computes counter arguments agaist Bob’s
it, then we let the 𝑃 to instantiate variables in (𝑡 + 1)- claim but founds no counter arguments. So the only thing
level facts and added instantiated facts to the fact base and Alice can do is to deny some facts. In this example, Alice
we reflect the relevant part of the proof tree with such denies the facts related with Bob’s claim of being menaced
instantiation. If 𝑃 does not claim it, we delete the path so these facts become issues for which a judge decides the
related with the fact in a proof tree. We also ask 𝑃 whether truth value. (Leaf blocks with ’u’ in the bottom in Fig.1 are
𝑃 admits 𝑡𝑢𝑟𝑛-level fact or not, and if 𝑃 does not admit it, issues.).
we make a label of the fact as an “issue to be determined”. 5 CONCLUSION
(5) If all the concrete facts are entered, the modified proof tree We present how to extend the PROLEG system to arrange issues.
is displayed. This system helps for lawyers to prevent missing claims and for
(6) Continue above until there are no new argument raised. lay person to make correct legal claims without lawyers. For a fu-
The final proof tree will be all the arguments raised by the both ture research, we need an evaluation of the system and investigate
parties and fact nodes labelled as an “issue to be determined” are a method of automatic extraction of relevant legal facts directly
issues for which a judge decide the truth value. from claims in natural language.
Acknowledgements: This work was supported by JSPS KAKENHI,
4 DEMONSTRATION JP17H06103 and JP19H05470 and JST AIP, JPMJCR20G4.
We show a demonstration how a plaintiff and a defendant interact
each other to arrange issues in the following claims about claiming REFERENCES
[1] K. Satoh et al. 2012. PROLEG: An Implementation of the Presupposed Ultimate
payment in a purchase contract1 . In the example case, the plaintiff Fact Theory of Japanese Civil Code by PROLOG Technology. In New Frontiers
(Alice) requested the defendant (Bob) for the payment of goods ac- in Artificial Intelligence: JSAI-isAI 2010 Workshops, Revised Selected Papers, LNAI
cording to the purchase contract with Bob. Bob claimed that Alice 6797. 153–164.
[2] D. Walton and T. F. Gordon. 2017. Argument Invention with the Carneades Argu-
threatened Bob to buy the goods for a counter-argument, but Alice mentation System. Scripted 14, 2 (2017), 168–207. https://script-ed.org/?p=3391
did not admit the threatening action. The system works as follows.
1 http://research.nii.ac.jp/~ksatoh/PROLEGdemo/IssueArrangmentDemo.mp4
274
Live Demonstration Of A Working Collaborative
eNegotiation System (Smartsettle Infinity)
Ernest Thiessen Graham Ross
Founder & President Head of International Marketing
Smartsettle Resolutions Inc. Smartsettle Resolutions Inc.
Vancouver, BC, Canada Vancouver, BC, Canada
ernest.thiessen@smartsettle.com Board Member
International Council for Online Dispute Resolution
KEYWORDS
negotiation, multivariate visual blind bidding, eNegotiation, col- Figure 1: Smartsettle Multivariate Visual Blind Bidding
laborative, conflict resolution, alternative dispute resolution, ADR,
online dispute resolution, ODR, algorithms
The Smartsettle1 process, using anonymised proposals, Visual

Blind Bidding, machine learning of disputant preferences, ma-
chine generated proposals, and intelligent algorithms, not only pro-
motes and increases fair outcomes that go “beyond win-win” but,
by rewarding good negotiating behaviour, creates a revolutionary
challenge to status quo adversarial methods of negotiation. Eight al-
gorithms power a collaborative negotiation software product called
Smartsettle Infinity2 , which is designed to support multiple deci-
sion makers with conflicting objectives and perspectives in complex
multi-party negotiations.
Infinity models all aspects of complex cases[3], including any
number of quantitative and qualitative issues, and related prefer-
ences of negotiators. With preferences well-represented, parties (3) Each party has a minimum level of satisfaction that they
can confidently compare complete settlement packages in a non- would be willing to accept. Supplier achieves 100% satisfac-
emotional and objective way. tion at the top left while buyer’s goal is at the bottom ght, but
The context in the example case, depicted by Figure 1, is a nego- Smartsettle’s goal is to find a fair solution for both parties
tiation between a Supplier and a Buyer. The following dozen steps on the Efficiency Frontier.
that refer to Figure 1 invoke all eight Smartsettle algorithms. Parties (4) The area bounded by the minimum satisfaction of each party
would normally encounter these steps in an iterative process that and the Efficiency Frontier is the Zone of Possible Agree-
is designed to reward good negotiating behaviour. ment.
(5) Parties are encouraged to exchange optimistic proposals (red
(1) With the negotiation problem well-represented by a Single
stars).
Negotiating Framework, parties model their problem within
(6) The Visual Blind Bidding algorithm is depicted by Fig-
the Smartsettle platform.
ure 1 for a two-party negotiation. Visual Blind Bidding
(2) Parties use Comprehensive Preference Analysis to define
enables parties to request Suggestions (black stars) based
how they become satisfied on each of the issues. A secure
on party preferences and that fall between the optimistic
neutral server guards confidential information.
proposals. This process virtually eliminates the tedious nego-
tiation dance that characterizes most ordinary negotiations.
1 Smartsettle is disruptive technology that is emerging from nearly three decades of Another unique feature of Visual Blind Bidding is the
research and development. This research grew out of a challenge put to Ernest Thiessen
in 1990 by his major advisor, Professor Emeritus D. Peter Loucks at Cornell University. ability of either party or a mediator to make anonymous pro-
2 A simplified version of this system called Smartsettle ONE is described in a longer posals. They do this by requesting Smartsettle to camouflage
version of this paper[5] their proposal so that it appears no different than any other
Suggestion generated by the machine.
Permission to make digital or hard copies of part or all of this work for personal or (7) Parties motivated to collaborate secretly accept proposals
classroom use is granted without fee provided that copies are not made or distributed that fall in the Zone of Possible Agreement, which in turn
on the first page. Copyrights for third-party components of this work must be honored. moves them quickly toward a fair resolution to their conflict.
For all other uses, contact the owner/author(s). (8) If both parties have each accepted more than one of the same
ICAIL’21, June 21–25, 2021, São Paulo, Brazil packages by the end of a Session (inside the red circle), Re-
ACM ISBN 978-1-4503-8526-8/21/06. ward for Early Effort determines which of those packages
https://doi.org/10.1145/3462757.3466091 becomes the Baseline agreement (orange star).
275
ICAIL’21, June 21–25, 2021, São Paulo, Brazil E. Thiessen et al.
(9) If no mutually accepted packages exist, the Automatic Deal- research[4] and live case work have demonstrated that the magni-
Closer can be invoked to avoid impasse due to a small gap. tude of value left behind in ordinary negotiations can be around 16%.
(10) Having reached a Baseline quickly, parties have energy left Shell’s research[2] concluded that negotiators that are subjected
for Smartsettle’s signature algorithm, Maximize the Mini- to a tedious negotiation dance become exhausted and have little
mum Gain, to uncover any remaining hidden value and gen- energy left to go “beyond win-win”® in a search for hidden value.
erate an Improvement that distributes the additional value The Smartsettle Visual Blind Bidding process not only conserves
fairly to all parties. This algorithm is foundational to all the the energy of negotiators but makes it very easy to uncover hidden
others and endorsed3 by experts in the field. value.
(11) Fairness Enhancing Normalization distributes additional
benefits fairly among all the parties, and this brings them to Table 1: Rewards for Good Negotiating Behaviour
an optimal solution on the Efficiency Frontier (green star).
Smartsettle allows parties to represent their preferences with Objective Behaviour Reward
any scale that is convenient to them. But under the covers, Acceptance of a fair A timely win-win
Smartsettle employs a proprietary method for normalization Fairness outcome outcome
that effectively neutralizes the efforts of any party to inflate Early movement to Bigger portion of
the benefits of optimization for themselves. Zone of Agreement the overlap
(12) In the background is the Expert Neutral Deal-Closer, which Agreement to Ex- Guaranteed agree-
the parties may fall back on in case of a large gap, and just pert Neutral Deal- ment
the existence of this remedy results in it rarely being used. Closer
Efficiency Secure Honesty Uncovered hidden
The process described above works for almost any formal negotia-
Truthfulness value
tion between any number of parties and can be depicted graphically
by adding more dimensions to Figure 1. Maximize the Minimum Peace Collaboration Improved relation-
Gain as described by US ICANS Patent US 5495412A has been ships
slightly modified to deal better with multi-party negotiations.
In high-value cases, Infinity’s ability to uncover both tangible In traditional negotiations, parties will often tend to hide or even
and intangible hidden value that improves the outcome for all misrepresent their true preferences. However, with Smartsettle
parties even after a settlement package has been agreed and thereby Infinity, the temptation to misrepresent preferences is eliminated.
enhances further the inter-party relationships, is by far its greatest Skilled facilitators help parties understand that it is actually counter-
asset. The greatest benefits are achieved when well-trained parties productive to use any sort of deception as a negotiating strategy
use the system collaboratively. Rather than usurp control from the and that truthfulness is rewarded. All of these good behaviours
user or the work of the mediator or negotiation advisor, the process together represent the fifth behaviour of collaboration, and result
benefits from human input that best understands and assesses the in improved relationships.
preferences and interests of all parties.
Table 1 summarizes how Smartsettle intelligently rewards good REFERENCES
negotiating behaviour. Whether this is artificial intelligence (AI) or [1] Howard Raiffa. 1996. Lectures on Negotiation Analysis. Program on Negotiation at
Harvard Law School.
intelligence augmented (IA) we leave for the reader to decide. [2] G. Richard Shell. 1999. Bargaining for Advantage: Negotiation Strategies for Rea-
Acceptance of a fair outcome is the first prerequisite for achieving sonable People. Penguin.
a result that benefits all parties. Smartsettle enables this behaviour [3] Ernest Thiessen, Peter Holt, Graham Ross, and Diana Wallis. 2017. Brexit 2.0
Negotiation Simulation with Smartsettle Infinity. International Journal of Online
in a process where parties can place secret bids on packages. When Dispute Resolution 2, 4 (2017).
a Zone of Agreement occurs, the party who made the smallest [4] Ernest M. Thiessen and D. Pete. Loucks. 1992. Computer-Assisted Negotiation
last move is rewarded with a bigger portion of the overlap. An of Multi-objective Water Resources Conflicts. Water Resources Bulletin, American
Water Resources Association 28, 1 (February 1992), 163–177.
agreement is ensured if parties agree to the Expert Neutral Deal- [5] Ernest M. Thiessen and Graham L. Ross. 2021. Using AI & IA to Reward Good
Closer in Final Session, and in fact is more likely to happen without Negotiating Behaviour. (on smartsettle.com).
the need of outside intervention. These first three behaviours all
contribute to quickly achieving a fair outcome and are applicable
to all negotiations, whether simple or complex.
In more complex multivariate cases, the importance of coming to
an early agreement is even greater. In addition to time savings, nego-
tiators also have the opportunity of uncovering hidden value with
the fourth behaviour of secure honesty and truthfulness. Thiessen’s
3 Harvard Professor Emeritus Howard Raiffa published [1] a preference for Maximize
the Minimum Gain (MMG) over Nobel Laureate’s algorithms Maximize the Utility
Product (MUP). Raiffa said that MMG was more intuitive and (in his opinion) produced
better outcomes in certain hypothetical illustrations. He did admit however that the
difference between these two algorithms would be insignificant in most real world
applications.
276
Part V
COLIEE Papers
BERT-based Ensemble Methods with Data Augmentation for
Legal Textual Entailment in COLIEE Statute Law Task
Masaharu Yoshioka Yasuhiro Aoki
yoshioka@ist.hokudai.ac.jp Youta Suzuki
Faculty of Information Science and Technology, Hokkaido yasu-a_01@eis.hokudai.ac.jp
University suzuki@eis.hokudai.ac.jp
Graduate School of Information Science and Technology, Graduate School of Information Science and Technology,
Hokkaido University Hokkaido University
Sapporo-shi, Hokkaido, Japan Sapporo-shi, Hokkaido, Japan
ABSTRACT 1 INTRODUCTION
The Competition on Legal Information Extraction/Entailment (COL- The Competition on Legal Information Extraction/Entailment (COL-
IEE) statute law legal textual entailment task (task 4) is a task to IEE) [3, 4, 10, 11, 15] serves as a forum to discuss issues related to
make a system judge whether a given question statement is true legal information retrieval (IR) and entailment. There are two types
or not by provided articles. In the last COLIEE 2020, the best per- of tasks in COLIEE. One is a task using case law (tasks 1 and 2),
formance system used bidirectional encoder representations from and the other is a task using Japanese statute law with Japanese bar
transformers (BERT), a deep-learning-based natural language pro- exam questions (tasks 3, 4, and 5). Task 3 is an IR task that aims to
cessing tool for handling word semantics by considering their con- retrieve (a) relevant law article(s) to judge whether the statement of
text. However, there are problems related to the small amount of the question is true, task 4 is an entailment task that judges whether
training data and the variability of the questions. In this paper, we a given relevant article entails a given question statement and task
propose a BERT-based ensemble method with data augmentation 5 is a combination of tasks 3 and 4.
to solve this problem. For the data augmentation, we propose a Because a part of bar exam questions are based on real use cases,
systematic method to make training data for understanding the it is important to have a mechanism for semantic matching to dis-
syntactic structure of the questions and articles for entailment. In cuss the relevance of words in the articles and those in the questions
addition, due to the nature of the non-deterministic characteristics for entailment. At an earlier stage, machine-readable thesauruses
of BERT fine-tuning and the variability of the questions, we pro- such as WordNet[7] and distributed representation of words such
pose a method to construct multiple BERT fine-tuning models and as Word2Vec[6] were used. Recently, the deep learning-based natu-
select an appropriate set of models for ensemble. The accuracy of ral language processing tool bidirectional encoder representations
our proposed method for task 4 was 0.7037, which was the best from transformers (BERT) [1] was introduced. One of the character-
performance among all submissions. istics of BERT is that it provides a general semantic analysis system
that can be fine-tuned for a particular task. For the last COLIEE
CCS CONCEPTS [10], the best performance systems for tasks 3 [12] and 4 [9] used
• Computing methodologies → Information extraction; En- BERT as a core component of the system.
semble methods. In this paper, we propose a method to use BERT-based ensem-
ble methods for task 4. This method utilizes BERT with data aug-
mentation that increases training examples by making article-and-
KEYWORDS question pairs systematically using sentences in the statute law
Textual entailment, Data augmentation, BERT, Ensemble method articles. We also propose a system that ensembles the results from
multiple BERT-based system outputs. The accuracy of the system
Masaharu Yoshioka, Yasuhiro Aoki, and Youta Suzuki. 2021. BERT-based for task 4 was 0.7037, which was the best performance among all
Ensemble Methods with Data Augmentation for Legal Textual Entailment the submitted runs for task 4 at COLIEE 2021.
in COLIEE Statute Law Task. In Eighteenth International Conference for
2 RELATED WORKS
Because bar exam questions include questions about real use cases
of articles, it is necessary to discuss the correspondence between
classroom use is granted without fee provided that copies are not made or distributed the concepts used in the articles and real use cases. In the early
for profit or commercial advantage and that copies bear this notice and the full citation stage of COLIEE, several attempts were made to utilize resources
on the first page. Copyrights for components of this work owned by others than the for discussing such semantic matching, such as a machine-readable
republish, to post on servers or to redistribute to lists, requires prior specific permission thesaurus and data for the distributed representation of the terms.
and/or a fee. Request permissions from permissions@acm.org. For example, Mi-Young et al. [5] used Word2Vec [6] as a resource
ICAIL’21, June 21–25, 2021, São Paulo, Brazil for distributed representation, and Taniguchi et al. [14] proposed
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 a method to utilize WordNet [7] as a machine-readable thesaurus.
https://doi.org/10.1145/3462757.3466105 However, because those methods cannot handle the context to
278
ICAIL’21, June 21–25, 2021, São Paulo, Brazil M. Yoshioka, Y. Aoki and Y. Suzuki
estimate the meaning of such terms, they are not as effective for preliminary experiment (the details are discussed in Section
utilizing such resources. 3.3), we confirmed that the characteristics of the fine-tuned
Recently, Devlin et al. [1] proposed BERT, a deep learning-based BERT-based model are different and that the accuracy of the
natural processing tool pretrained for solving general tasks that re- validation data is not directly related to that of the test data.
quire semantic information with larger corpora (such as the whole We assume that this result reflects the different character-
contents of Wikipedia). Based on this training process, BERT can istics of each model and that the appropriate selection of
handle the meaning (distributed representation) of words in a sen- the generated models for ensemble may improve the perfor-
tence by considering the context. In addition, BERT can be used mance for the unseen questions.
for various tasks by employing a fine-tuning process that utilizes
comparatively small numbers of training data. Because a pretrained 3.1 Data augmentation using articles
model of BERT contains rich information about the semantics of
In the deep learning framework, it is common to enlarge training
the words, the fine-tuned models may be able to handle semantic
data by modifying the existing data (data augmentation). However,
information even though the words themselves are not included in
it is important to define the appropriate data augmentation method
the training data.
to obtain better results. Related to the legal textual entailment task,
At COLIEE 2020, the BERT-based system achieved the best per-
data augmentation methods have been used for natural language
formance for legal textual entailment tasks (JNLP [9]). In that paper,
and logical inference, as introduced in Section 2. However, it is
they proposed a lawfulness classification approach that classified
difficult to apply these methods to this legal textual entailment
the appropriateness of legal statements by using many legal sen-
data.
tences that include bar exam questions provided by organizers
In this research, we assume that there are two types of errors to
without considering given relevant articles. This approach worked
judge whether an article entails a given question. One is semantic
well for COLIEE 2020 because of the large number of training data.
mismatch, and the other is logical mismatch (the appropriateness
In addition, they also pointed out that it was difficult to select an
of the judicial decision).
appropriate model using validation data for the unseen questions
For example, let us discuss the example of training data using
because of the significant variability of the questions.
the following article (a part of article 9): “A juridical act performed
To increase the size of the training data, the data augmentation
by an adult ward is voidable.”
approach is widely used in the field of image recognition [13].
However, few studies related to the data augmentation method (1) “A juridical act performed by an adult is voidable.”
have been conducted for the legal textual entailment task. Min The article does not entail this question because of semantic
et al. [8] proposed a syntactic-based data augmentation method matching (“adult” is not “adult ward” ).
to increase the robustness of natural language inferences. They (2) “A juridical act performed by an adult ward is not voidable.”
proposed a systematic method to create positive and negative data The article does not entail this question because of the
from the correct inference sentence by a syntactic operation such inappropriateness of the juridical decision (“voidable” and
as passivation and the inversion of subject and object. Evans et al. “not voidable”)
[2] proposed a method for data augmentation for logical entailment. (3) “A juridical act performed by an adult is not voidable.”
In this framework, their method increased negative and positive We cannot judge whether this question is true (it may re-
data by modifying logical inference rules using symbolic vocabulary quire another article). However, a given article cannot entail
permutation, which includes an operation to make implication rules the question.
that share the same contents for the condition and derived parts.
For the semantic matching case (1), it is difficult to select appro-
Those approaches are useful to design data augmentation methods
priate pairs (“adult” and “adult ward”) for replacement to make such
for legal textual entailment.
a semantic mismatch sentence. For both cases (3), it is also difficult
to make the data and to use these data for negative examples to
3 BERT-BASED ENSEMBLE LEGAL TEXTUAL identify types of errors to judge the entailment results.
ENTAILMENT SYSTEM By contrast, if we make the pair of correct answers with logical
Based on a discussion of the previous best performance system mismatch cases (2), the examples may help to explain the impor-
(JNLP [9]), we propose a system with the following characteristics. tance of comparisons between the judicial decision of the article
and that of the question.
(1) Textual entailment approach with data augmentation
Based on this assumption, we create training data that charac-
We assume that the reason why the lawfulness classification
terize the logical mismatch between the articles and questions.
approach outperformed the textual entailment one in the
The procedures to make this augmented data are as follows.
last COLIEE is the size of the training data. Therefore, when
we provide larger training data by data augmentation, the (1) Extraction of (a) judicial decision parts from the articles.
textual entailment approach may outperform the lawfulness If there are multiple decisions in an article, the sentences
classification approach because it uses the most important in the article are split into smaller sentences that contain
information (relevant articles). one judicial decision (Figure 1). When the split sentence
(2) Ensemble results of multiple BERT-based model outputs explains an exceptional case, a flipped judicial decision is
As discussed, it is difficult to select appropriate models for complemented for the split sentence (underlined part of the
the task by only evaluating the validation model. From our split sentences). When the sentence does not contain any
279
BERT-based Ensemble Methods with Data Augmentation for Legal Textual Entailment in COLIEE Statute Law Task ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Input article Split sentence

!"#$%& ' ! ()*+,-.()/0123456.78(2
()*+,-.()/0123456.78(2*78 *78()*9:;:(2)06<.=>?@3
()*9:;:(2)06<.=>?@34A90 4A90BCD1)*.EFGHA9+IJKLM
BCD1)*.EFGHA9+IJKLM If the lessee lawfully subleases a
4N38O:?@:PQ8(2*+()*:RSTU leased thing, the lessor may not duly
V.=H?@W0X3DL49J78A:Y>IKLM assert against the sublessee the
Translation: cancellation by agreement of the lease
Article 613 (3) with the lessee.
If the lessee lawfully subleases a
leased thing, the lessor may not duly 4"#5(6*,7#8+3*&*92#(,0#,27(&*92#
assert against the sublessee the 0(&(#-'+:#&;2#32,&2,)23
cancellation by agreement of the lease
with the lessee; provided, however, that Negative pair
this does not apply if, at the time of ! Question:
the cancellation, the lessor has a right ()*+,-.()/0123456.78(2
to cancel due to non-performance on the *78()*9:;:(2)06<.=>?@3
part of the lessee. 4A90BCD1)*.EFGHA9+IJHM
leased thing, the lessor may duly
!"#$%&'()&*+,#+-#./'*0*)(1# assert against the sublessee the
02)*3*+,#32,&2,)23 cancellation by agreement of the lease
with the lessee.
Split sentences
! Article:
! ()*+,-.()/0123456.78(2
()*+,-.()/0123456.78(2
*78()*9:;:(2)06<.=>?@3
*78()*9:;:(2)06<.=>?@3
4A90BCD1)*.EFGHA9+IJKLM
4A90BCD1)*.EFGHA9+IJKLM
leased thing, the lessor may not duly
leased thing, the lessor may not duly
assert against the sublessee the
with the lessee
with the lessee.
! O:?@:PQ8(2*+()*:RSTUV.
=H?@W0X3DL49J78(2*78()
*9:;:(2)06<.=>?@34A90
BCD1)*.EFGHA9+IJHM Figure 2: Example of splitting sentences (Article 613(3))
If the lessor has a right to cancel
due to non-performance on the part of
the lessee at the time of the Based on this process, we construct 3,331 (positive: 1,677, nega-
cancellation, the lessor may duly tive: 1,654) training examples for data augmentation.
cancellation by agreement of the lease 3.2 BERT-based entailment system
with the lessee.
We implemented a BERT-based entailment system using the ordinal
BERT fine-tuning process proposed in [1]. We concatenated the
Figure 1: Example of splitting sentences (Article 613(3)) question and article using a sentence-separator token ([𝑆𝐸𝑃]) and
fed it into the BERT model to estimate whether the article entails
a question (positive:1) or not (negative:0). We use the BERT-based
model of BERT-Japanese 1 .
judicial decision (e.g., definitions of terms), those sentences Training and validation data are constructed from the training
are used only to generate positive pairs. bar exam questions from H18–30 (13 years of data with 695 ques-
(2) Make positive and negative data by using extracted sen- tions) by randomly splitting 90% (625) for training and 10% (70)
tences. for validation. All augmented data are merged with training data,
We use a sentence extracted from the step 1 as an article part and we use 3,956 examples for training and 70 for validation. We
and the same sentence text as a question for a positive exam- also made training sets without using augmented data (625 training
ple pair. We made a sentence that is generated by flipping the and 70 validation examples) for comparing system performance
judicial decision. For most of the cases, we add “ない” (not) without augmented data.
or remove “ない” (not) for the judicial decision verb. We The fine-tuning of the BERT model is done using a max sequence
also use the antonym dictionary to make flipped sentences. length of 256, adam as an optimizer, a training batch size of 12, and
Pairs of the original sentence and flipped sentence are used a learning rate of 1e-5. The validation loss is calculated at each
as negative example pairs. Figure 2 shows an example of
making negative pairs. 1 https://github.com/cl-tohoku/bert-japanese
280
Table 1: Evaluation results of the 10 models Table 2: Evaluation results of the ensemble models
Model Validation Validation Test Model used Accuracy

No. Loss Accuracy Accuracy (1, 2, 3, 4, 5, 6, 7) 0.694
1 0.6935 0.4857 0.5946 (1, 2, 4) 0.689
2 0.7247 0.5286 0.6667 (1, 2, 3) 0.685
3 0.7566 0.6286 0.6486 (1, 2, 3, 4, 7) 0.682
4 0.6822 0.6286 0.6486 (1, 2, 6) 0.676
5 0.7347 0.5143 0.6486 (1, 2, 5) 0.676
6 0.7745 0.6143 0.6396 (1, 2, 3, 4, 5, 7) 0.676
7 0.6913 0.5429 0.6126 (1, 2, 3, 4, 5, 6) 0.676
8 0.7123 0.5857 0.6486 (1, 2, 3, 4, 5) 0.676
9 0.7504 0.6286 0.6486 ··· ···
10 0.7735 0.5857 0.6396 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 0.622
Table 3: Evaluation results of the ensemble models
epoch and stop-training process when the validation loss increases. Submission ID Model used
We use a model with minimal validation loss. HUKB-1 (1, 2, 3, 4, 5, 6, 7)
Fine-tuned models accept a pair of question statements and (an) HUKB-2 (1, 2, 4)
article(s) as input and return whether the article(s) entail(s) the state- HUKB-3 (1, 2, 4, 7 , 8)
ment (positive) or not (negative), which is decided by comparing
the score for the probability of being positive or negative.
To discuss the appropriate settings for making the ensemble
3.3 Preliminary experiment model, we calculated the performance of the ensemble model using
these 10 models. For making the ensemble model, we used the
To evaluate the performance of the proposed BERT-based entail-
average probability of positive and negative from the target models.
ment system, we conducted a preliminary experiment using the R01
Table 6 shows the evaluation results of the ensemble models by
data (1 year of data with 111 questions) for evaluation. To discuss
accuracy. In this table, all combinations of the ensemble models are
the effect of the variability of the training and validation questions
used (selecting three to 10 models from those introduced in Table
set, we made 10 models using the same procedures. Because of
1).
the non-deterministic characteristics of the BERT fine-tuning pro-
There were large differences between the accuracies of the en-
cess and different training sets selected randomly, we expected
semble cases. The best accuracy system ensembled seven models,
the system to construct different models that use different features
and the worst used all models. Most of the cases that used three
for analyzing the texts. Table 1 shows the evaluation results for
to five models were adequate to estimate the results with better
the validation and test data. As shown in the table, the validation
accuracy.
accuracy and loss were not closely related to the test accuracy. We
All of the highest rank sets contained the best performance
assume that these results reflect the variability of the question set.
system model 2. In addition, they also used model 1, even though
We also made another 10 models without using augmented data.
the accuracy of model 1 was the lowest among these 10 models.
The average accuracy of these 10 models was 0.5108 (best: 0.5946,
This suggests that it is important to use a complementary set of
worst: 0.4505). From this comparison, we confirmed that data aug-
models that have different characteristics to improve the overall
mentation is effective to improve the performance of the BERT
performance of the ensemble models.
training process.
In this inference process, we can estimate the confidence of the
BERT model output by comparing the probability of positive or
3.4 Submitted results
negative. When the probability of positive is almost equal to 1(0), the Based on the results of the preliminary experiments, we submitted
system outputs positive (negative) results with higher confidence. the following three results that used different model sets for the
By contrast, when the probability is close to 0.5, the system output ensemble.
can be interpreted as less confidence. HUKB-1 and HUKB-2 were the best and second-best perfor-
When we checked the distribution of such confidence for each mance systems using R1 data as a kind of validation data. HUKB-3
question, the tendency of such confidence was not consistent among selected the five best models using validation loss information.
these models. From this observation, we assumed that these models Table 4 shows the final evaluation results of all submission runs,
may use different features to estimate the entailment results and among which, HUKB-2 achieved the highest accuracy.
that their characteristics may differ, even though we used the same
model architecture for training. In such a case, there is a probability 3.5 Discussion
to increase the performance of the overall system by ensembling To understand the effect of the ensemble method, we compared
the results of different models. the performance of the ensemble results with one of each model.
281
Table 4: Final evaluation results Table 6: Number of questions classified by the ensemble re-
sults
Submission ID Correct Accuracy
BaseLine No 43/All 81 0.5309 Submission Agree Majority Other
HUKB-2 57 0.7037 ID Correct Wrong Correct Wrong Correct Wrong
HUKB-1 55 0.6790 HUKB-1 17 4 27 14 11 8
HUKB-3 55 0.6790 HUKB-2 28 9 23 11 6 4
UA_parser 54 0.6667 HUKB-3 19 9 17 12 19 9
JNLP.Enss5C15050 51 0.6296
JNLP.Enss5C15050SilverE2E10 51 0.6296 Table 7: Topic difficulty analysis based on the number of cor-
JNLP.EnssBest 51 0.6296 rect runs
OVGU_run3 48 0.5926
TR-Ensemble 48 0.5926 No. of No. of No. of correct
TR-MTE 48 0.5926 correct runs questions answers by HUKB-2
OVGU_run2 45 0.5556 1–3 7 0
KIS1 44 0.5432 4–6 11 1
KIS3 44 0.5432 7–9 12 9
UA_1st 44 0.5432 10–12 19 15
KIS2 43 0.5309 13–15 19 19
UA_dl 43 0.5309 16–18 13 13
TR_Electra 41 0.5062
OVGU_run1 36 0.4444
than that of HUKB-2. However, the accuracy of HUKB-3 was lower
Table 5: Evaluation results of the 10 models for the test data than that of HUKB-2, which suggests that selecting an appropriate
set of models for the ensemble is also effective for maintaining the
Model No. Accuracy accuracy of the “Agree” questions.
1 0.6790 Second, we analyze the characteristics of our system based on
2 0.6666 the difficulty estimated by the number of runs that return the cor-
3 0.5185 rect answer provided by organizers. Table 7 shows the number of
4 0.5555 questions corresponding to the number of correct runs from the 18
5 0.6666 submitted runs (Table 4). Questions with a smaller number of cor-
6 0.5308 rect runs may be common difficult problems among all submitted
7 0.6790 methods.
8 0.5925 From this table, we confirm that our method answers the easy
9 0.5555 questions consistently. These characteristics may come from our
10 0.5308 ensemble method reducing the effect of variability of the training
data sets.
By contrast, our system performs poorly for difficult questions,
Table 5 shows the evaluation results of the 10 models. This year, the suggesting common problems that nearly all submitted systems
basic model performed well and the best performance systems were cannot handle at this moment.
almost equivalent to the ensemble ones. However, the appropriate We would like to discuss the characteristics of such difficult
selection of the models (HUKB-2) made the ensemble results better questions using examples.
than one for each model. The following question (Figure 3) is a difficult question that only
These results justify the appropriateness of using the ensemble one run can answer correctly. Because the main terms appear in
method by selecting an appropriate ensemble set using validation both the question and the first sentence, the systems tend to say
data. positive (entail) for this question. However, it also matches the last
Table 6 shows the number of questions classified by agreement sentence that explains an exceptional case of the articles. As a result,
level among the models used. “Agree,” “Majority,” and “Other” rep- the given article does not entail the question.
resent “all models return the same results,” “final results are same Because our data augmentation method splits the sentences and
as majority voting,” and others, respectively. From these results, we only handles flipped negative cases, as introduced in Section 3.2,
can confirm that the average calculation ensemble method is better our system cannot answer this question correctly either. However,
than majority voting because the number of correct questions for because several articles have such exceptional cases, it may be better
others is larger than the number of wrong ones. For the “Agree” to propose a data augmentation method to handle such articles.
questions, the best performance system (HUKB-2) had the largest The following failure example (Figure 4: one run can answer
numbers because of the small number of used models (three), but correctly) is also related to the logical expression (quantifier). The
the accuracy of HUKB-1 (using seven models) for “Agree” was better article says “together with the obligee” (more than two), but the
282
Question: R02-25-E Question: R02-19-I

賃借人が適法に賃借物を転貸し，その後，賃貸保証人は，被担保債権の一部を弁済したが残債
人が賃借人との間の賃貸借を合意により解除し務がある場合，その弁済をした価額の限度にお
た場合，賃貸人は，その解除の当時，賃借人のいて，代位により取得した被担保債権及びその
債務不履行による解除権を有していたときであ担保権を単独で行使することができる。
っても，その合意解除をもって転借人に対抗す If a guarantor has partially paid a secured claim but
ることはできない。 there is a remaining obligation, the guarantor may
If the lessee lawfully subleases a leased thing, the independently exercise the secured claim and the secu-
lessor may not duly assert against the sublessee the rity right acquired through subrogation in proportion
cancellation by agreement of the lease with the lessee to the value of the subrogee’s performance.
even if the lessor has a right to cancel due to non- Article for entailment (answer is No)
performance on the part of the lessee at the time of 第五百二条　債権の一部について代位弁済があ
the cancellation. ったときは、代位者は、債権者の同意を得て、
Article for entailment (answer is No) その弁済をした価額に応じて、債権者とともに
第六百十三条３　賃借人が適法に賃借物を転貸その権利を行使することができる。
した場合には、賃貸人は、賃借人との間の賃貸 Article 502　(1)　If performance by subrogation oc-
借を合意により解除したことをもって転借人に curs with respect to one part of a claim, the subro-
対抗することができない。ただし、その解除の gee, with the consent of the obligee, may exercise
当時、賃貸人が賃借人の債務不履行による解除 the rights of the subrogee together with the obligee
権を有していたときは、この限りでない。 in proportion to the value of the subrogee’s perfor-
Article 613 (3) If the lessee lawfully subleases a leased mance.
thing, the lessor may not duly assert against the sub-
lessee the cancellation by agreement of the lease with Figure 4: Example of the failure of a difficult question (2)
the lessee; provided, however, that this does not apply
if, at the time of the cancellation, the lessor has a right
Question: R02-23-U
to cancel due to non-performance on the part of the
lessee. ＡＢ間においてＡの所有する中古の時計甲の売
買契約が締結された場合、Ｂが，Ｅとの間で，
Figure 3: Example of the failure of a difficult question 売買契約における買主たる地位をＥに譲渡する
旨の合意をした場合，Ａの承諾の有無にかかわ
らず，買主たる地位はＥに移転する。
If B makes an agreement with E to transfer contrac-
question says “independently” (single). For this case, it is not so tual status of the buyer to E, regardless of whether A
easy to make a simple data augmentation method for handling this consents, the status of the buyer is transferred to E.
type of logical mismatch. Article for entailment (answer is No)
The following failure example (Figure 5: three runs can answer
第五百三十九条の二　契約の当事者の一方が第
correctly) is related to the logical expression and semantic mismatch.
三者との間で契約上の地位を譲渡する旨の合意
The article says “the other party to the contract gives consent”
をした場合において、その契約の相手方がその
(require consent), but the question says “regardless of whether A
譲渡を承諾したときは、契約上の地位は、その
consents.” Because there are no patterns for handling such logical
第三者に移転する。
mismatches in augmented data, it is comparatively difficult for the
Article 539-2 If one of the parties to a contract made an
system to identify this type of logical mismatch. In addition, the
agreement with a third party to transfer that party’s
vocabulary used for representing related persons is totally different;
contractual status to that third party, and the other
“A”, “B”, and “E” are used for the questions and “one of the party”,
party to the contract gives consent to the transfer, the
“the other party” and “the third party”, are used in the article. It
contractual status is transferred to the third party.
is also difficult for the system to estimate the relationship among
them.
Figure 5: Example of the failure of a difficult question (3)
4 SUMMARY
In this paper, we introduced our system for participating in task
4 (legal textual entailment) of COLIEE 2021. This system uses a was the best among all runs. We also discussed the characteristics
BERT-based entailment system with data augmentation by flipping of the failure of our system for future development.
the judicial decision of the article sentences. We also proposed a
method to make various BERT models and selected an appropriate ACKNOWLEDGMENT
ensemble model set using a validation data set. The effectiveness of We thank the organizers of the COLIEE for their efforts in con-
the proposed system was evaluated by COLIEE 2021 task 4 (textual structing this test data. This work was partially supported by JSPS
entailment task), and the accuracy of our system was 0.7037, which KAKENHI Grant Number 18H0333808.
283
REFERENCES Linguistics. Association for Computational Linguistics, Online, 2339–2352. https:

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: //doi.org/10.18653/v1/2020.acl-main.212
Pre-training of Deep Bidirectional Transformers for Language Understanding. In [9] Ha-Thanh Nguyen, Hai-Yen Thi Vuong, Phuong Minh Nguyen, Binh Tran Dang,
Proceedings of the 2019 Conference of the North American Chapter of the Association Quan Minh Bui, Sinh Trong Vu, Chau Minh Nguyen, Vu Tran, Ken Satoh, and
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Minh Le Nguyen. 2020. JNLP Team: Deep Learning for Legal Processing. In The
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, Proceedings of the 14th International Workshop on Juris-Informatics (JURISIN2020).
4171–4186. https://doi.org/10.18653/v1/N19-1423 The Japanese Society of Artificial Intelligence„ 195–208.
[2] Richard Evans, David Saxton, David Amos, Pushmeet Kohli, and Edward Grefen- [10] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
stette. 2018. Can Neural Networks Understand Logical Entailment?. In 6th In- Kano, and Ken Satoh. 2020. COLIEE2020:Methods for Legal Document Retrieval
ternational Conference on Learning Representations, ICLR 2018, Vancouver, BC, and Entailment. In The Proceedings of the 14th International Workshop on Juris-
Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Informatics (JURISIN2020). The Japanese Society of Artificial Intelligence„ 114–
https://openreview.net/forum?id=SkZxCk-0Z 127.
[3] Yoshinobu Kano, Mi-Young Kim, Randy Goebel, and Ken Satoh. 2017. Overview [11] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
of COLIEE 2017. In COLIEE 2017. 4th Competition on Legal Information Extraction Kano, and Ken Satoh. 2020. A Summary of the COLIEE 2019 Competition. In
and Entailment (EPiC Series in Computing, Vol. 47), Ken Satoh, Mi-Young Kim, New Frontiers in Artificial Intelligence, Maki Sakamoto, Naoaki Okazaki, Koji
Yoshinobu Kano, Randy Goebel, and Tiago Oliveira (Eds.). EasyChair, 1–8. Mineshima, and Ken Satoh (Eds.). Springer International Publishing, Cham, 34–
[4] Mi-Young Kim, Randy Goebel, Yoshinobu Kano, and Ken Satoh. 2016. COLIEE- 49.
2016: Evaluation of the Competition on Legal Information Extraction and Entail- [12] Hsuan-Lei Shao, Yi-Chia Chen, and Sieh-Chuen Huang. 2020. BERT-based Ensem-
ment. In The Proceedings of the 10th International Workshop on Juris-Informatics ble Model for The Statute Law Retrieval and Legal Information Entailment. In The
(JURISIN2016). Paper 11. Proceedings of the 14th International Workshop on Juris-Informatics (JURISIN2020).
[5] Mi-Young Kim, Ying Xu, and Randy Goebel. 2017. Applying a Convolutional The Japanese Society of Artificial Intelligence„ 223–234.
Neural Network to Legal Question Answering. In New Frontiers in Artificial [13] Connor Shorten and T. Khoshgoftaar. 2019. A survey on Image Data Aug-
Intelligence, Mihoko Otake, Setsuya Kurahashi, Yuiko Ota, Ken Satoh, and Daisuke mentation for Deep Learning. Journal of Big Data 6 (2019), 1–48. https:
Bekki (Eds.). Springer International Publishing, Cham, 282–294. //doi.org/10.1186/s40537-019-0197-0
[6] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. [14] Ryosuke Taniguchi, Reina Hoshino, and Yoshinobu Kano. 2019. Legal Question
Distributed representations of words and phrases and their compositionality. In Answering System Using FrameNet. In New Frontiers in Artificial Intelligence,
Advances in neural information processing systems. 3111–3119. Kazuhiro Kojima, Maki Sakamoto, Koji Mineshima, and Ken Satoh (Eds.). Springer
[7] George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM International Publishing, Cham, 193–206.
38, 11 (1995), 39–41. [15] Masaharu Yoshioka, Yoshinobu Kano, Naoki Kiyota, and Ken Satoh. 2018.
[8] Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen. Overview of Japanese Statute Law Retrieval and Entailment Task at COLIEE-
2020. Syntactic Data Augmentation Increases Robustness to Inference Heuristics. 2018. In The Proceedings of the 12th International Workshop on Juris-Informatics
In Proceedings of the 58th Annual Meeting of the Association for Computational (JURISIN2018). The Japanese Society of Artificial Intelligence„ 117–128.
284
Legal Norm Retrieval with Variations of the BERT Model
Combined with TF-IDF Vectorization
Sabine Wehnert Viju Sudhi Ernesto W. De Luca
sabine.wehnert@gei.de Shipra Dureja deluca@gei.de
Georg Eckert Institute Libin Kutty Georg Eckert Institute
Leibniz Institute for International Leibniz Institute for International
Textbook Research
Saijal Shahania Textbook Research
<firstname>.<lastname>@st.ovgu.de
Germany Germany
Otto von Guericke University
Otto von Guericke University Otto von Guericke University
Magdeburg, Germany
Magdeburg, Germany Magdeburg, Germany
ABSTRACT scenarios described in a query. Having a requirement of high recall

In this work, we examine variations of the BERT model on the and reasonable precision values at the same time, most state-of-
statute law retrieval task of the COLIEE competition. This includes the-art retrieval approaches are not reliable enough for real-life
approaches to leverage BERT’s contextual word embeddings, fine- scenarios. Since their first use in the Competition on Legal Informa-
tuning the model, combining it with TF-IDF vectorization, adding tion Extraction/Entailment (COLIEE), BERT (Bidirectional Encoder
external knowledge to the statutes and data augmentation. Our Representations from Transformers) approaches received criticism
ensemble of Sentence-BERT with two different TF-IDF representa- regarding their explainability compared to other traditional ma-
tions and document enrichment exhibits the best performance on chine learning methods that have been employed. A term frequency
this task regarding the F2 score. This is followed by a fine-tuned - inverse document frequency (TF-IDF) document representation
LEGAL-BERT with TF-IDF and data augmentation and our third may appear more reliable to a legal practitioner, since ranking based
approach with the BERTScore. As a result, we show that there on term statistics and its side effects are interpretable. Neverthe-
are significant differences between the chosen BERT approaches less the ongoing success of BERT-based methods may justify their
and discuss several design decisions in the context of statute law continued use. A big part in the performance of BERT models is
retrieval. attributed to the rigorous pre-training on huge corpora and the
large model capacity, for example with 768 embedding dimensions
CCS CONCEPTS in the base model. A pre-trained language model has encountered
a big diversity of text and has been therefore exposed to enough
• Applied computing → Law; • Information systems → Docu-
examples to form a decent representation for the contextual use
ment representation; Language models; Similarity measures; Rel-
of words. Hence, substantial knowledge from a distributional se-
evance assessment; • Computing methodologies → Neural net-
mantics point of view appears to be helpful in this task. However,
works.
using a standard BERT model on the plain training data is not suffi-
cient. The COLIEE data for task 3 is quite limited in its size due to
KEYWORDS
massive human effort behind its creation. This poses a challenge in
contextual word embeddings, document enrichment, data augmen- the training of deep learning models in addition to the legal jargon,
tation, legal information retrieval which differs from the language the models may be pre-trained on.
ACM Reference Format: We investigate whether data manipulations such as augmentation
Sabine Wehnert, Viju Sudhi, Shipra Dureja, Libin Kutty, Saijal Shahania, and the decomposition of relevant articles helps in this issue.
and Ernesto W. De Luca. 2021. Legal Norm Retrieval with Variations of the In previous COLIEE editions, machine learning models have also
BERT Model Combined with TF-IDF Vectorization. In ICAIL ’21: Interna- benefited from other types of external knowledge, for example by
tional Conference on Artificial Intelligence and Law, June 21–15, 2021, online.
using ontologies for information extraction. This motivates us to
also enrich the training data and to encode the additional content
jointly with the original text by using a BERT model. In addition,
1 INTRODUCTION
several BERT approaches have emerged, with variants specializing
In this paper, we describe our approach for the retrieval task 3 of on our domain, such as LEGAL-BERT [3]. Aside from the model
the COLIEE competition, based on the English version of the Japan- selection, we can also choose between using the BERT model in a
ese Civil Code. Legal Statute Retrieval is a challenging task due to supervised setting as a relevance classifier or to extract contextual
the short, abstract nature of law articles and at times very specific word embeddings and use them directly for similarity scoring.
Publication rights licensed to ACM. ACM acknowledges that this contribution was We use our three runs in the competition to test a few BERT
authored or co-authored by an employee, contractor or affiliate of a national govern- variations based on those considerations. The contributions of our
ment. As such, the Government retains a nonexclusive, royalty-free right to publish or three runs are:
reproduce this article, or to allow others to do so, for Government purposes only.
ICAIL ’21, June 21–15, 2021, online
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 • We combine Sentence-BERT with modifications on TF-IDF
https://doi.org/10.1145/3462757.3466104 vectorization and document enrichment strategies
285
ICAIL ’21, June 21–15, 2021, online Wehnert et al.
• We perform dataset manipulations to train a BERT classifier question answering tasks, sentence tagging tasks and paraphrase
for retrieval and combine it with TF-IDF vectorization identification. BERT introduced by Devlin et al. [4] is currently a
• We test similarity scores obtained from the BERTScore common choice for such downstream tasks, replacing various tradi-
The remainder of this work is organized in the following way: tional NLP pipelines. Following this, there has been an exhaustive
In Section 2, we describe approaches using TF-IDF, BERT models study about the applications of BERT and experiments to inves-
and their various use in the past editions of the COLIEE compe- tigate different fine-tuning methods for these pre-trained models
tition. Section 3 contains conceptual descriptions for each of our by Sun et al. [12]. They present various fine-tuning strategies for
three runs: Sentence-BERT Embedding with TF-IDF, LEGAL-BERT BERT on a text classification task, providing a general solution
with TF-IDF, and BERTScore. Section 4 consists of more details on for achieving state-of-the-art results on a variety of text classifi-
our experimental setup, results and a following discussion. In the cation datasets. We follow a few of these best practices for better
final section we conclude our results and indicate future research performance with our selected models, such as:
potential. (1) the use of the right combination of different hyperparameters
that directly affect the learning,
2 RELATED WORK (2) the importance of the selection of the correct value of warm-
In this section, we describe related research of retrieval methods we up steps,
used. In particular, we investigate past uses of the respective method (3) how concentrating on the decay rate can help to converge
within the COLIEE competition. First, we briefly review the TF-IDF towards the minima and when the learning rate decay should
(term frequency - inverse document frequency) vectorization and start, and
its place within the competition. Second, we collect approaches (4) the right combination of batch-size with the number of
which are similar to our methods which are using BERT (Bidirec- epochs and warm-up steps.
tional Encoder Representations from Transformers) and also make
a distinction between our methods for the three runs we submitted 2.2.1 Fine-tuning BERT. When Devlin et al. proposed the BERT
and the existing work. model, they described its use on downstream tasks in two phases:
a pre-training and a fine-tuning phase [4]. Therefore, its intended
2.1 Retrieval with TF-IDF use for any further task is to first fine-tune it in order to achieve the
desired performance. Nowadays, BERT is not always fine-tuned,
TF-IDF vectorization gives an idea of how relevant a particular term
sometimes the pre-trained model and its embeddings perform well
is within a document and within the document collection. TF-IDF
enough, if the domain is not substantially changed compared to
vectors represent a document by assigning a higher weight to terms
what the model was pre-trained on. However, for the legal domain it
which appear relatively frequent in few documents - compared to
can be worthwhile to adapt an existing BERT model to the different
their usual occurrence in the rest of the corpus - by discounting
use of vocabulary in that context. This can be also observed in the
the term frequency with the inverse document frequency. As Beel
past COLIEE competitions. For the task on statute law retrieval,
et al. [1] comment, the TF-IDF vectorization scheme is the most
Nguyen et al. [8] use an ensemble of BERT models. The publicly
widely used approach for content-based filtering for recommender
available bert-base-uncased 1 model is pre-trained on the English
systems and related text mining domains. In the COLIEE competi-
language Wikipedia and BooksCorpus [14] and then fine-tuned by
tion multiple teams in the previous years used TF-IDF vectors with
Nguyen et al. on the COLIEE training data. The model is combined
or without other representation methods to retrieve the relevant
with another special bert-base-uncased model that is further trained
articles given a query [8, 10]. In legal information retrieval, TF-IDF
with the masked language model (MLM) on the entire COLIEE data
only is still a valuable baseline model because its results are easy to
(BERT-CC) and fine-tuned on training data to obtain a measure
interpret for domain experts. However, in previous editions of the
of relevance. This ensemble of BERT achieved the best F2 score
competition, a mere TF-IDF approach could not reliably achieve
for the validation data. As their BERT-CC focuses on legal domain
winning scores. When used in conjugation with any other em-
knowledge, we reviewed further special BERT models. We found
bedding techniques, competitive results were attainable. One such
RoBERTa (Robustly Optimized BERT pre-training Approach) [7]
approach has been employed by Rabelo et al. [9] to address the case
with its variants, and LEGAL-BERT [3] as promising models for
law entailment task. They employ two different cosine similarity
task 3. RoBERTa [7] is optimized with some alterations to essential
approaches and a confidence score from BERT [4] to improve the
hyperparameters in BERT and trained with relatively bigger batches
extraction/entailment results. We adopt a similar approach in our
over a large training data size. It also excludes BERT’s next-sentence
second run by combining TF-IDF similarity scores with the softmax
prediction task, allowing it to improve on MLM over BERT. This
scores obtained from fine-tuned BERT models. However, we also
leads to a better performance on various baseline NLP downstream
differ in the way of choosing documents to calculate similarity and
tasks [7]. Similarly, LEGAL-BERT [3] is an adaption of BERT in the
in thresholding for the retrieval task.
legal domain where pre-training is carried out on a collection of
several fields of English legal text, such as contracts, court cases, and
2.2 Retrieval with BERT legislation. This special BERT model has been performing better
Nowadays, many pre-trained deep learning-based language models than the original version of BERT on legal domain-specific tasks [3].
are available, coming from neural network architectures for Natu-
ral Language Processing (NLP) with significant improvements for
various downstream tasks, such as single sentence classification, 1 https://huggingface.co/bert-base-uncased
286
Legal Norm Retrieval with Variations of the BERT Model Combined with TF-IDF Vectorization ICAIL ’21, June 21–15, 2021, online
2.2.2 Contextual Embeddings from BERT. Aside from further train- Table 1: Methods for each run for task 3
ing the whole BERT language model and using it on a classification
task, we can also use contextual word embeddings from a pre- Run Name Method
trained BERT model to determine semantic similarity of the query
OvGU_run1 Sentence-BERT + TF-IDF + data enrichment
and the article(s). Contextual word embeddings are computed at
OvGU_run2 LEGAL-BERT + TF-IDF + data augmentation
runtime. In particular, we obtain different vectors for the same
OvGU_run3 BERTScore
word, when it is used in another context or position in a sentence.
In that way, we can also distinguish homonyms when they are
accompanied by enough words in the appropriate context. Since
the contextual embedding type is quite recent, there is no final good fit for the COLIEE data and also perform well on the retrieval
consensus in the research community of how to compute the dis- task.
tance between two contextual word embedding sequences. The Overall, TF-IDF vectorization and BERT-based approaches have
most common methods are: using the [CLS] token which is often already been tried in course of the past COLIEE editions. Never-
seen as a representation of a whole sentence, using the individual theless, there are many options to employ both methods, while
word embeddings or averaging all individual word embeddings in fine-tuning can affect the outcome substantially.
a sentence and then computing a similarity score using the Word
Mover’s Distance [6] or cosine similarity. In the experiments by 3 STATUTE RETRIEVAL TASK
Reimers et al. [11], using the mean of the individual contextual This section describes in detail the three different methods we
word embeddings outperformed the approach with the [CLS] token. proposed and implemented for task 3 in COLIEE 2021, as mentioned
A recent approach related to this is the BERTScore [13]. After com- in Table 1. While the first method exploits Sentence-BERT coupled
puting the pairwise cosine similarity of all token-wise contextual with TF-IDF vectors and data enrichment, the second method uses
embeddings from two sentences, BERTScore selects token pairs LEGAL-BERT with TF-IDF vectors and data augmentation. The
between the two sentences which have the highest cosine similarity. third method applies the BERTScore to solve the problem at hand.
Those similarities are summed up and discounted by the words in
the sentence to obtain precision, recall and the according F1-score. 3.1 Sentence-BERT Embedding with TF-IDF
Optionally, the BERTScore can also incorporate IDF weighting. We
The first run involves a combination of 2-stage TF-IDF vectorization
employ the BERTScore in our third run to test whether the mere
with Sentence-BERT embeddings. This run was the best out of all
embeddings of BERT can also capture enough context in the train-
the runs submitted for task 3 in COLIEE 2021. An overview of the
ing data, compared to using document enrichment or fine-tuning
approach is depicted in Figure 1 and described in the following.
on a BERT model for relevance classification.
We start by enriching the training data with multiple adjustments
For us, it is particularly interesting that there are recent ap-
as described in Table 2. This enrichment helps us to create vectors
proaches for fine-tuning a language model specifically to obtain
for each article in the Civil Code which are more unique than those
meaningful sentence embeddings [2, 11]. In the previous COLIEE
the training data itself could deliver. A concrete example of the
edition, the cyber team achieved the best performance among all
enrichment process for Article 177 can be found in Table 3. We
teams using the universal sentence encoder, TF-IDF and a support
enrich each article in the training data as follows:
vector machine for the case law retrieval task [10]. Hence, we as-
sume that TF-IDF combined with sentence embeddings could also • Metadata: We add structural information using the section
work well on the statute law retrieval task. A new advancement titles in the Civil Code. In that way, hierarchical relations
on sentence embeddings has been made by Reimers et al. [11] who between articles within the same Part, Chapter, Section and
introduce Sentence-BERT. It outperforms the existing embedding even Subsection are modeled.
methods and is found useful for multiple downstream tasks. It is • Crawled data: We crawl Japanese open source commentary
based on a Siamese network architecture which ties the weights of on the Civil Code articles and thereby potentially enrich
two BERT models (one for each input sentence) that are updated the original article text with general remarks, corner cases,
during fine-tuning. As a default, the mean is used to pool the ob- previous versions, related articles and a reasoning for the
tained contextual word embeddings from each BERT model. Then, relation.
the two resulting sentence embeddings are concatenated with their • Relevant queries from training data: We parse the train-
element-wise difference, so that the final softmax layer can pre- ing data labels of task 4 (entailment) to enrich our training
dict a class. We have made use of this state-of-the-art embedding data of task 3 with queries that have a positive entailment
approach to create a richer and more meaningful numeric represen- relationship. With a positive entailment relationship we can
tation of each article and query pair in our first run. In a previous be sure that the queries added correspond to the meaning of
COLIEE edition, Kim et al. [5] have employed a Siamese Deep Con- the article and can help in determining relevance, too.
volutional Neural Network for the entailment task, which results After data enrichment, we encode the enriched texts with the
in better performance compared to regular Convolutional Neural TF-IDF vectorizers and the Tokenizer2 for our Sentence-BERT and
Networks. They attribute their success to the Siamese architecture progress to the final relevance score calculation with the following
which requires less parameters due to the weight sharing mecha- steps:
nism and a lower risk of overfitting. For this reason, we assume
that a sentence embedding based on a similar architecture may be a 2 https://huggingface.co/distilroberta-base
287
𝑣𝑎
TF-IDF Stage 1
Article with metadata +
Article with relevant queries
cos
𝑣𝑞
𝑣𝑎
Query
TF-IDF Stage 2
Article with metadata Í
+ cos Normalization Thresholding
𝑣𝑞
Article(s)
𝑣𝑎
𝑣 𝑎 - article vectors (for all articles)
Sentence Embedding 𝑣𝑞 - article vectors (for a given query)
Article with metadata +
cos Í
Article with relevant queries + cosine
cos sum
Article with crawled queries similarity
𝑣𝑞
Figure 1: Overview of the approach using Sentence-BERT embeddings with TF-IDF.
Table 2: Data enrichment for the statute retrieval task - 𝑡 𝑓𝑡,𝑑 is the term frequency - frequency of term 𝑡 in document
𝑑. Here, documents are the individual articles of the Civil
Description Code.
- 𝑁 is the total number of documents in the collection.
Articles with metadata training data + details
- 𝑑 𝑓𝑡 is the document frequency - frequency of term 𝑡 in the
regarding Part, Chapter, Section
collection.
and Subsection.
- 𝑤𝑡,𝑑 is the weight which is the product of term frequency
Articles with crawled data training data + translated
and inverse document frequency.
crawled data from the website
The vectors after a single stage of TF-IDF vectorization
https://ja.wikibooks.org/
yielded significant precision-recall trade-offs reflected in
Articles with relevant queries training data + queries from
the relatively lower F2 scores. This prompted us to provide
training data if the entailment
a different, but unique representation of the articles, which
label is Y
ended up in a second stage of TF-IDF where query and article
for the respective article.
vectors have been created considering only the Articles with
metadata enrichment. The combination of both stages acts
as a counter-balance in the trade-off.
(1) TF-IDF vectors are computed for queries and articles to- (2) Sentence-BERT embeddings for each article are created with
gether as a two-stage process. In the first stage, we rely the enrichment described in Table 2. We rely on the imple-
on sub-linear term frequency scaling and L2 normalization mentation3 by Reimers et al. [11] and use the pre-trained
while computing the vectors. Articles are enriched by a com- paraphrase-distilroberta-base-v1 model to create the article
bination of Articles with relevant queries and Articles with and query embeddings. We select the aforementioned para-
metadata. phrase model because it was trained on millions of para-
⃗⃗
The vectors 𝒗 are computed by the following equations 1 - 4, phrase examples and is reportedly performing well on natu-
ral language inference tasks4 .
𝑡 𝑓𝑡,𝑑 = (1 + 𝑙𝑜𝑔(𝑡 𝑓𝑡,𝑑 )) (1) (3) Finally, for each query-article pair we compute the cosine
similarity to determine the relevance of each article for the
𝑁 respective query. For each pair, we obtain three different
𝑖𝑑 𝑓𝑡 = 𝑙𝑜𝑔( ) (2)
1 + 𝑑 𝑓𝑡 similarity scores from the first stage TF-IDF, second stage
TF-IDF and Sentence-BERT embeddings. The sum of these
𝑤𝑡,𝑑 = 𝑡 𝑓𝑡,𝑑 ∗ 𝑖𝑑 𝑓𝑡 (3) scores is then normalized and we empirically determine a
threshold to filter out the best relevant articles for each test
⃗⃗⃗ query.
⃗⃗ 𝒘
𝒗 = qÍ (4)
2
𝑖 𝑤𝑖,𝑑
3 https://github.com/UKPLab/sentence-transformers
where, 4 https://www.sbert.net/docs/pre-trained_models.html#paraphrase-identification
288
Table 3: Example data enrichment for Article 177 of the Civil Code
training data
Article 177: Acquisitions of, losses of and changes in real rights on immovables .. and other laws regarding registration.
Metadata
Part: II Real Rights Chapter: I General Provisions Section: 3 Extinctive Prescription
Subsection: Requirements of Perfection of Changes in Real Rights on Immovables
Crawled data
Comprehensive succession - The range of changes in property rights that require registration has been determined ..
Legal evidence theory - What kind of person is referred to as "a person who has a legitimate interest ..
.. (161 unique words in total, shortened to conserve space)
Relevant queries from training data
– H19-11-3: In a case where A bought a registered building owned by B .. his/her acquisition of ownership of that building.
– H21-24-E: If a mortgage creation contract has the agreement of the mortgagee .. there is no registration of its creation.
– R01-6-A: In cases A sold Land X belonging to A and B sold it to C, C may be asserted .. for sales without security.
For developing an explanatory dialogue in a real setting, the articles. An example is shown in the Table 4 for the query
additional text we gained in the enrichment steps can be marked in with the Pair ID "H27-22-4":
a different font style. Then, we can highlight important keywords Query Q: "In the contract for deposit for value, if the perfor-
based on the scores of each TF-IDF stage. Since we did not apply any mance of the obligation to return deposited Thing has become
weighting during the cosine similarity computation of the Sentence- impossible due to reasons not attributable to the depositary,
BERT embeddings, the similarity between the word vectors of query he/she may not claim remuneration from the depositor, with
and article can be visualized using a heatmap. respect to the period after the impossibility of performance of
the agreed duration."
3.2 LEGAL-BERT with TF-IDF After achieving better results with data decomposition than
For our run 2, we treat task 3 as a sentence-pair classification task to with the original dataset, we further extract referenced arti-
predict the relevance of 1 if the given article is related to the query cles from each relevant article of the query using regular ex-
and 0 otherwise. Considering its good performance on previous pressions and append that as well to form multiple instances
retrieval tasks, we choose to work with a BERT model. A variety of of query-article pairs for each query. The same example is
BERT models that are pre-trained on different datasets can be used extended further for Approach 2 in Table 5.
for addressing domain-specific tasks with fine-tuning. However, this extensive decomposition of referenced arti-
cles did not optimize our recall further. We assume this is
3.2.1 BERT configuration. Following this convention with fine- plausible as these articles are supporting articles to the rele-
tuning, we initially used bert-base-uncased which has 12 hidden vant article content but are not directly relevant to the query.
layers with 768 hidden units in each layer 12 attention heads. A We compare the results with and without data decomposi-
classification head is added on top of the base model consisting of a tion and summarize them in Table 6. We decided to go with
single layer of fully connected linear neurons. We use the softmax Approach 1 where we have a better recall score.
function to get a probability distribution for the two labels and use (2) Data Augmentation: We use the non-relevant articles to
cross-entropy loss with Adam optimizer to fine-tune the model. reduce data imbalance. For this, we enriched this decom-
We split the training dataset into two parts for fine-tuning (∼ 85 % posed dataset using the top 50 non-relevant articles for each
training) the model and use the rest of the dataset for validation query instance. These non-relevant articles are based on the
(all queries starting with id "R01-*"). highest cosine similarity between TF-IDF vectors of the rel-
evant article to all the articles excluding the other relevant
3.2.2 Data Pre-processing. An overview of the pre-processing for ones for the respective query. This approach is similar to the
the LEGAL-BERT with TF-IDF approach is illustrated in Figure implementation by Nguyen et al. [8], where they considered
2. We pre-process both the training and validation splits in the query-article similarity. However, we assume that article-
following manner: article similarity is better suited than query-article similarity
(1) Data Decomposition: This is performed to extract each since we find that articles are more related to each other
relevant article for a given query to form separate instances. in terms of cosine similarity than they are to the queries.
For every query, there is one or more than one article associ- Based on the cosine similarity, we select only the top 50
ated and relevant to it. We take individual articles to create non-relevant articles as training examples, since we did not
a new instance in the training dataset so that the query can intend to reintroduce the data imbalance that we attempted
be divided into multiple instances against all of its relevant to overcome with augmentation.
289
Table 4: Approach 1 - Data decomposition of multiple articles for each query into multiple instances
Queries Articles
Before Pre-processing
Query Q Article 665 The provisions of Articles 646 through 648, Article 649, and Article 650, paragraphs ...
Article 648 (1) In the absence of any ... (2) ... the provisions of Article 624 ... (3) ... course of performance.
Article 536 (1) If the performance ... (2) ... obligee for the benefit.
After Pre-processing
Query Q Article 648 (1) In the absence of any ... (2) ... the provisions of Article 624 ... (3) ... course of performance.
Query Q Article 536 (1) If the performance ... (2) ... obligee for the benefit.
Table 5: Approach 2 - Data decomposition of multiple articles and their referenced articles for each query into multiple in-
stances
Queries Articles
Before Pre-processing
Article 648 (1) In the absence of any ... (2) ... the provisions of Article 624 ... (3) ... course of performance.
Article 536 (1) If the performance ... (2) ... obligee for the benefit.
After Pre-processing
Query Q Article 646 (1) A mandatary must deliver to the mandator monies and other things received during ...
Query Q Article 647 If the mandatary has personally consumed monies that were to be delivered to the mandator ...
Query Q Article 648 (1) In the absence of any special agreements, the mandatary may not claim remuneration ...
Query Q Article 649 If costs will be incurred in administering the mandated business, the mandator must ...
Query Q Article 650 (1) If the mandatary has expended costs found to be necessary for the administration ...
Query Q Article 624 (1) An employee may not demand remuneration until the work the employee promised ...
Query Q Article 536 (1) If the performance ... (2) ... obligee for the benefit.
Table 6: Results on the validation set for different data pre- legal-bert-base-uncased and legal-RoBERTa on similar hyperparam-
processing approaches of run 2 eters. We finally choose legal-bert-base-uncased as it indicated the
most satisfactory results to further test with different experimen-
Model Prec Recall tal setups in Section 4.1. To extract relevant articles for a given
query during testing, we combine each query with all the articles.
bert-base-uncased without decomposition 0.1392 0.3973
For LEGAL-BERT, we applied the softmax function to the logits
bert-base-uncased with Approach 1 0.2529 0.4421
predicted from our model. For each query-article pair, we obtain
bert-base-uncased with Approach 2 0.1179 0.4300
two softmax probability values, indicating the non-relevance and
relevance of the article to the query. We only consider the softmax
probabilities of the relevance column. To avoid the underflow of
softmax probabilities of top relevant articles, we max-normalize
(3) Augmenting the Original Dataset: To ensure that the
these scores. At the same time, we also calculate the query-article
original data could still influence the model, we also append
cosine similarity of all the articles for each query. The similarity
the original data. In other words, for each query without any
scores are also max-normalized for the same reasons as stated above.
data decomposition, all relevant articles are processed in an
We ultimately compute an average of these two normalized scores.
instance as they are given in the dataset. This increases the
To select the top-n relevant articles we use a threshold value se-
number of relevant articles for each query at the cost of gen-
lected based on the precision-recall trade-off for the validation set.
erating some duplicates, since for queries which have only
The time for training the LEGAL-BERT model increases from 2
one relevant article, those are already obtained at the step of
minutes on the original dataset to 2 hours on the fully enriched
data decomposition. Overall, the three pre-processing steps
dataset5 . The larger amount of text in the enriched data does not
increase the number of training instances by the factor 10.
have a significant impact during the test phase. At runtime, we
directly process the new query and all pre-stored enriched articles
3.2.3 Fine-Tuning. On comparing the results with the legal domain-
specific pre-trained BERT, bert-base-uncased was outperformed by 5 We used an NVIDIA Quadro RTX 8000 to accelerate training.
290
Query
Training
Split
Original +
Pair
Relevant
Article(s) TF-IDF LEGAL-BERT
cos 𝜎
Query
Max Max
Normalization Normalization
Copies
Decomposed Training cos cosine

+ similarity
Pairs Split 𝜇
Individual
Articles 𝜎 softmax for
relevant class
Thresholding 𝜇 arithmetic
Relevant mean
Article(s)
Figure 3: Overview of the approach using LEGAL-BERT with

TF-IDF.
Query
+
50 Training
Non-Relevant Data
Augmentation Article(s)
Pairs
cos
BERTScore
Corpus of Individual
all Articles Article
F1
cos cosine Thresholding F1 F1 Score

similarity
Figure 2: Pre-processing for LEGAL-BERT with TF-IDF. Figure 4: Overview of the approach using BERTScore.
(3) For the test data, we take the average K of all the BERTScore
with the already trained language model, so that the prediction is
thresholds from the training data.
not causing any noticeable delay in the system’s response time.
3.3 BERTScore 4 EVALUATION

In this section, we evaluate the previously presented methods. First,
In the third run of the retrieval task, we use the BERTScore [13].
we describe details of the experimental setup. Second, we proceed
In Zhang et al.’s implementation6 , BERTScore outputs precision,
to show the respective results on the competition task. Third, we
recall, and F1 measure. We use the F1 score as the main value for
discuss our findings. We evaluate our runs for the statute retrieval
further analysis. To decide how many articles should be retrieved
task on our validation split using variations in hyperparameters for
per query, the following steps are used to determine the BERTScore
training different models. The evaluation contributes a quantitative
threshold (K):
and qualitative analysis of how the runs perform with different
(1) We calculate the BERTScore for each article given a query, hyperparameter settings - if applicable - and if they are comparable
and then rank the result set in descending order. to each other. We discuss a few of the experiments below.
(2) For each query in the training data we select the top n doc-
uments, with a BERTScore value for each n. From this, we 4.1 Experimental Setup
select the BERTScore value K as a threshold where the F2
4.1.1 Sentence-BERT Embedding with TF-IDF. To enrich the articles
score is maximized, since the task performance is evaluated
further, we make use of the crawled content7 . We extract all the
on the F2 score.
7 https://ja.wikibooks.org/wiki/民法第<id>条 , where <id> stands for the Article ID of
6 https://github.com/Tiiiger/bert_score the different articles in the Civil Code.
291
Table 7: Two stages of TF-IDF counter-balancing the Table 8: Results on validation set for run 2 candidates
precision-recall trade-off with Sentence-BERT.
Model Prec Recall
F2 Prec Recall
bert-base-uncased 0.2529 0.4421
Validation data legal-bert-base-uncased 0.3447 0.5357
legal-RoBERTa 0.2205 0.4866
with 1st stage TF-IDF 54.67 50.16 61.98
with 2nd stage TF-IDF 53.74 49.54 60.27
with both stages 56.52 52.60 63.39 Table 9: Task 3 Results for COLIEE 2021
COLIEE 2021 test data
with 1st stage TF-IDF 72.98 66.77 78.40 Position Run F2 Prec Recall R_30
with 2nd stage TF-IDF 73.02 66.28 79.63 1 OvGU_run1 0.7302 0.6749 0.7778 0.8515
with both stages 73.02 67.49 77.78 9 OvGU_run2 0.6717 0.4857 0.8025 0.9010
18 OvGU_run3 0.3016 0.1570 0.7006 0.7030
paragraph tags (<p>) and the list tags (<ol>, <ul>, <dl>, <li>) to get Further, we experiment with warm-up steps, introduce a de-
relevant information about the articles. This is motivated by the cay rate and did some hyperparameter tuning to optimize our re-
team TRC3 in the previous year of COLIEE [10], where they used the sults. We notice that with 3500 warm-up steps and a decay rate of
content in Japanese itself. However, we translate the fetched content 0.1 (1+𝑒𝑝𝑜𝑐ℎ) , we achieve the best performance. We then perform
to English using the google-trans-new package 8 . To vectorize these further training on the validation set with an increased batch size
enriched articles and queries we used the TfidfVectorizer from of 24. We then create an ensemble of legal-bert-base-uncased and
scikit-learn. TF-IDF vectors, both with max-normalized similarity scores for
To address the problem of the precision-recall trade-off, we use the article-query pairs, assigning equal weights to both the scores.
two-stages of TF-IDF, which is motivated from previous experi- Finally, we fetch the articles that are above the threshold value
ments we conducted on queries starting with the id "R01-*", as of 0.5.
shown in Table 7. It is evident on the validation data that the two-
stage TF-IDF can counter-balance the classical trade-off between 4.1.3 BERTScore. For the BERTScore, we use the model type bert-
precision and recall, considering the improved F2 scores. For the base-uncased, 9 layers and no re-weighting with IDF. This setup
COLIEE 2021 test data the second stage has a positive effect on the was determined based on the performance on our validation data
F2 score as well, though it is not as significant as we found it for which we also used before (queries starting with the id "R01-*"). The
our validation split. text is processed with the regular Tokenizer of BERT and we pass
The threshold value to filter out the top n relevant articles was query and article(s) without further modification to the scorer of
found empirically. After normalizing the sum of the scores from the the original BERTScore implementation. Our thresholding strategy
two stages of TF-IDF and Sentence-BERT embeddings, we consid- for this run results in a threshold value of 0.63331205.
ered the top 4 articles. This was purely based on our validation data,
where none of the queries had relevant articles exceeding a count of 4.2 Results
four. This is true with COLIEE 2021 test data as well, where none of Our first run, OvGU_run1 obtained the first position for its F2 score
the articles have more than 4 relevant articles. To find a threshold in the overall task evaluation for COLIEE 2021. OvGU_run2 also
for the scores of these articles, an index-based threshold was found has the best recall sharing the position with the run
to be better than a single value for the whole set. Accordingly, we JNLP.CrossLMultiLThreshold, closely followed by OvGU_run1. While
take the article in the 1st index (with a score of 1.0) and then set a considering Recall at 30, our runs have the third best (for OvGU_run2)
threshold of 0.91 or higher for the articles if found in the 2nd index and the fifth best (for OvGU_run1) scores. The results for our runs
and a threshold of 0.85 or above if found in the subsequent indices. are summarized in Table 9. Values in bold are the best scores for
the corresponding metric.
4.1.2 LEGAL-BERT with TF-IDF. To decide among the three al-
ternative models that we selected as candidates for our run 2 as
4.3 Discussion
discussed in Section 3.2, we validate them on various hyperpa-
rameter settings and observe that the default hyperparameters We assume that our first run provides reliable results because of
of the Adam Optimizer with a selective change in the learning the combination of contextual Sentence-BERT embeddings with
rate ranging from 1e−03 to 1e−06 , 1e−05 achieve the best results the TF-IDF vectors. This is supported by the test query R02-1-A:
among all three models (see Table 8) when trained on 3 epochs "The family court may decide to commence an assistance also in re-
for batch-size 16. Considering the highest recall score, we select spect of a person whose capacity to appreciate their own situation is
legal-bert-base-uncased to be our final choice for run 2. extremely inadequate due to a mental disorder.",
as shown in Table 10. For this query, only Sentence-BERT embed-
dings could retrieve the most relevant Article 15 which was not
8 https://github.com/lushan88a/google_trans_new retrieved in either stage of TF-IDF vectorization. The Article 15 has
292
Table 10: Comparison of results for query R02-1-A Table 11: Comparison of results for query R02-24-U
Method Retrieved articles Method Top retrieved articles

With TF-IDF stages Article 11 With Run 1 Article 563, Article 566, Article 567
With Sentence-BERT Article 15, Article 7, Article 11 With Run 2 Article 563, Article 565, Article 567,
With combination Article 11, Article 15 Article 562
Relevant articles Article 15, Article 11 Relevant articles Article 562, Article 563
the following content: compared to other teams for this query. However, this result can
"(Decisions for Commencement of Assistance) be attributed to the threshold we selected, with high recall and
Article 15 (1) The family court may decide to commence an as- lower precision. The ranking of the articles by the BERTScore is
sistance in respect of a person whose capacity to appreciate only average even for this query, considering the Mean Average
their own situation is inadequate due to a mental disorder, Precision (MAP). The MAP score is only 0.0509 for run 3, while
at the request of the person in question, that person’s spouse, that run 1 gets 0.0299 and run 2 achieves 0.1250. The best MAP score
person’s relative within the fourth degree of kinship, the guardian, for this query with a value of 0.2309 was obtained by the team
the guardian’s supervisor, the curator, the curator’s supervisor, or a JNLP with their run called JNLP.CrossLBertJP. For assessing the
public prosecutor; provided, however, that this does not apply to a final ranking performance of run 3, we can compare its MAP score
person with respect to whom there are grounds as prescribed in Article to other teams. Also here we observe that the BERTScore with a
7 or the main clause of Article 11. (2) The issuance of a decision for MAP score of 0.5557 is the fourth-lowest performing run in the
commencement of assistance at the request of a person other than the competition, whereas our run 1 achieves 0.7496 and run 2 has the
person in question requires the consent of the person in question. (3) highest overall MAP score among our runs of 0.7571. The best
A decision for commencement of assistance must be made concurrent MAP score of 0.7947 was achieved by the team JNLP with their run
with a decision as referred to in Article 17, paragraph (1) or a decision JNLP.CrossLMultiLThreshold. This leads us to the conclusion that
as referred to in Article 876-9, paragraph (1)." the standard BERTScore without IDF-reweighting or any further
It turns out that for this query-article pair, we have a significant combined methods may not be sufficient to solve this task, at least
term overlap, which may be diluted by the whole article length. In with the the query type distribution of this year’s test dataset. We
that way, sentence-based approaches in general may work well for also observe how thresholding influences our F2 score in run 1, so
this query. Only our run OvGU_run1 and TR_HB have a 100% F2 that our method scores higher than a run by the JNLP team which
score for this query. has a better ranking performance.
On comparing our different runs, we find interesting similarities From the results and discussion above, our main takeaways from
in the articles retrieved by each of them. This might possibly be this COLIEE edition for task 3 are:
because of the common TF-IDF coupling in the first two runs. We
did not expect that embeddings from a pre-trained model (in run 1) (1) Contextual embeddings can significantly enhance retrieval
could give more or less comparable results with those from a model performance when coupled with TF-IDF vectors.
further trained on the COLIEE dataset (in run 2). (2) Adding external knowledge to the articles in the form of
Another insight from the results is how thresholding plays a structural information, entailed queries or definitions can
significant role in the retrieval task. For example, the test query help to make them more unique.
R02-24-U: (3) Data augmentation techniques are useful to train a BERT
"A donor shall assume a duty to retain the subject matter exercising classifier for a retrieval task.
care identical to that he/she exercises for his/her own property until (4) An intelligent or rather more effective thresholding mech-
the completion of such delivery.", anism should be devised to further improve precision and
retrieved only one relevant article with run 1 but both relevant maintain a decent F2 score.
articles with run 2. This is described in Table 11. Drawing conclu-
sions from this query - out of the many similar queries, we are not 5 CONCLUSION AND FUTURE WORK
surprised to see the fine-tuned model of run 2 retrieve 74 candidate In this work, we study variations of the language model BERT for
articles and run 1 retrieving only 70 candidate articles from a total task 3 of the COLIEE competition on statute law retrieval. We find
of 101. This results in run 2 with the overall best recall of 0.8025. a benefit in combining the BERT model with TF-IDF vectorization
With BERTScore, the interesting query to analyse is the test and in working on a sentence level with contextual embeddings.
query R02-17-I: Furthermore, it is helpful to test different pre-trained models and
"In the case that D manifests the intention to release another obligor fine-tuning, as well as adding external knowledge and data aug-
(C) from the obligation to D, even if neither D nor B manifests a mentation techniques. Our winning approach is an ensemble of
particular intention, D may not claim the payment of 600,000 to Sentence-BERT and two different TF-IDF representations with dif-
another obligor (A)." ferent extents of document enrichment. In the second run, we
We are able to retrieve 3 out of 4 articles (Article 439, 440 and fine-tune a BERT classifier for retrieval based on an augmented
441) using this technique which was the highest number when dataset. The third run is similarity scoring using the BERTScore
293
with thresholding. Future enhancements can consist of an improved IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile,
thresholding mechanism and of encoding other types of external December 7-13, 2015. IEEE Computer Society, 19–27. https://doi.org/10.1109/
ICCV.2015.11
knowledge, for example named entities.
REFERENCES
[1] Jöran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. 2016. Research-
paper recommender systems: a literature survey. Int. J. Digit. Libr. 17, 4 (2016),
305–338. https://doi.org/10.1007/s00799-015-0156-0
[2] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St
John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al.
2018. Universal Sentence Encoder for English. In Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Language Processing, EMNLP 2018: Sys-
tem Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, Eduardo
Blanco and Wei Lu (Eds.). Association for Computational Linguistics, 169–174.
https://doi.org/10.18653/v1/d18-2029
[3] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras,
and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law
School. CoRR abs/2010.02559 (2020). arXiv:2010.02559 https://arxiv.org/abs/2010.
02559
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies, NAACL-HLT
2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill
Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computa-
tional Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
[5] Mi-Young Kim, Yao Lu, and Randy Goebel. 2017. Textual Entailment in Legal Bar
Exam Question Answering Using Deep Siamese Networks. In New Frontiers
in Artificial Intelligence - JSAI-isAI Workshops, JURISIN, SKL, AI-Biz, LENLS,
AAA, SCIDOCA, kNeXI, Tsukuba, Tokyo, Japan, November 13-15, 2017, Revised
Selected Papers (Lecture Notes in Computer Science, Vol. 10838), Sachiyo Arai,
Kazuhiro Kojima, Koji Mineshima, Daisuke Bekki, Ken Satoh, and Yuiko Ohta
(Eds.). Springer, 35–48. https://doi.org/10.1007/978-3-319-93794-6_3
[6] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From
Word Embeddings To Document Distances. In Proceedings of the 32nd International
Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (JMLR
Workshop and Conference Proceedings, Vol. 37), Francis R. Bach and David M. Blei
(Eds.). JMLR.org, 957–966. http://proceedings.mlr.press/v37/kusnerb15.html
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
arXiv:1907.11692 http://arxiv.org/abs/1907.11692
[8] Ha-Thanh Nguyen, Hai-Yen Thi Vuong, Phuong Minh Nguyen, Tran Binh Dang,
Quan Minh Bui, Vu Trong Sinh, Chau Minh Nguyen, Vu D. Tran, Ken Satoh,
and Minh Le Nguyen. 2020. JNLP Team: Deep Learning for Legal Processing in
COLIEE 2020. CoRR abs/2011.08071 (2020). arXiv:2011.08071 https://arxiv.org/
abs/2011.08071
[9] Juliano Rabelo, Mi-Young Kim, and Randy Goebel. 2019. Combining Sim-
ilarity and Transformer Methods for Case Law Entailment. In Proceedings
of the Seventeenth International Conference on Artificial Intelligence and Law,
ICAIL 2019, Montreal, QC, Canada, June 17-21, 2019. ACM, 290–296. https:
//doi.org/10.1145/3322640.3326741
[10] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
Kano, and Ken Satoh. 2020. COLIEE 2020: Methods for Legal Document Retrieval
and Entailment. https://sites.ualberta.ca/~rabelo/COLIEE2021/COLIEE2020_
summary.pdf. Accessed: 2021-05-09.
[11] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed-
dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and
Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990.
https://doi.org/10.18653/v1/D19-1410
[12] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-Tune
BERT for Text Classification?. In Chinese Computational Linguistics - 18th China
National Conference, CCL 2019, Kunming, China, October 18-20, 2019, Proceedings
(Lecture Notes in Computer Science, Vol. 11856), Maosong Sun, Xuanjing Huang,
Heng Ji, Zhiyuan Liu, and Yang Liu (Eds.). Springer, 194–206. https://doi.org/10.
1007/978-3-030-32381-3_16
[13] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi.
2020. BERTScore: Evaluating Text Generation with BERT. In 8th International
Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr
[14] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards
Story-Like Visual Explanations by Watching Movies and Reading Books. In 2015
294
To Tune or Not To Tune?
Zero-shot Models for Legal Case Entailment
Guilherme Moraes Rosa Ruan Chaves Rodrigues
NeuralMind, Brazil NeuralMind, Brazil
University of Campinas (Unicamp), Brazil Federal University of Goiás (UFG), Brazil
Roberto de Alencar Lotufo Rodrigo Nogueira

NeuralMind, Brazil NeuralMind, Brazil
University of Campinas (Unicamp), Brazil University of Campinas (Unicamp), Brazil
University of Waterloo, Canada
ABSTRACT Zero-shot and few-shot models are becoming more competitive

There has been mounting evidence that pretrained language models with models fine-tuned on large datasets. For instance, few-shot
fine-tuned on large and diverse supervised datasets can transfer results of GPT-3 [4] sparked an interest in prompt engineering
well to a variety of out-of-domain tasks. In this work, we investigate methods, which are now an active area of research [20, 36, 38]. The
this transfer ability to the legal domain. For that, we participated goal of these methods is to find input templates such that the model
in the legal case entailment task of COLIEE 2021, in which we is more likely to give the correct answer.
use such models with no adaptations to the target domain. Our In information retrieval, pretrained models fine-tuned only on
submissions achieved the highest scores, surpassing the second-best a large dataset have also shown strong zero-shot capabilities [39].
submission by more than six percentage points. Our experiments For example, the same multi-stage pipeline based on T5 [33] was
confirm a counter-intuitive result in the new paradigm of pretrained the best or second best-performing system in 4 tracks of the TREC
language models: given limited labeled data, models with little or 2021 [27], including specialized tasks such as Precision Medicine [34],
no adaption to the target task can be more robust to changes in the and TREC-COVID [48]. A remarkable feature of this pipeline is that,
data distribution than models fine-tuned on it. Code is available for most tasks, the models are fine-tuned only on a general-domain
at https://github.com/neuralmind-ai/coliee. ranking dataset, i.e., they do not use in-domain data.
However, to date, there has not been strong evidence that zero-
KEYWORDS shot models transfer well to the legal domain. Most state-of-the-art
models need adaptations to the target task. For example, the top-
Legal NLP, Legal Case Entailment, Deberta, T5, Zero-shot
performing system on the legal case entailment task of COLIEE
ACM Reference Format: 2020 [31] uses an interpolation of BM25 [35] scores and scores from
Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto de Alencar Lotufo,
a BERT model fine-tuned on the target task [24].
and Rodrigo Nogueira. 2021. To Tune or Not To Tune? Zero-shot Models for
In this work, we show that, for the legal case entailment task of
Legal Case Entailment. In Eighteenth International Conference for Artificial
Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, COLIEE, pretrained language models without any fine-tuning on
New York, NY, USA, 6 pages. https://doi.org/10.1145/3462757.3466103 the target task perform at least equivalently or even better than
models fine-tuned on the task itself. Our approach is characterized
1 INTRODUCTION as zero-shot since the model was only fine-tuned on annotated data
from another domain. Our result confirms in the legal domain a
An ongoing trend in natural language processing and information
counter-intuitive recent finding in other domains: given limited
retrieval is to use the same model with small adaptions to solve a va-
labeled data, zero-shot models tend to perform better on held-out
riety of tasks. Pretrained transformers, epitomized by BERT [7], are
datasets than models fine-tuned on the target task [27, 32].
the state-of-the-art in question answering [15], natural language
inference [12, 33], summarization [3, 17], and ranking tasks [11, 21].
Although these tasks are diverse, current top-performing models
in each of them have a similar architecture to the original Trans- 2 RELATED WORK
former [40] and are pretrained on variations of the masked language It is a common assumption among NLP researchers that models
modeling objective used by Devlin et al. [7]. developed using nonlegal texts would lead to unsatisfactory perfor-
mance when directly applied to legal tasks [9, 50]. To overcome this
classroom use is granted without fee provided that copies are not made or distributed issue, general-purpose techniques are adapted to the legal domain.
for profit or commercial advantage and that copies bear this notice and the full citation For example, Chalkidis and Kampas [5] pre-trained legal word em-
on the first page. Copyrights for components of this work owned by others than ACM beddings using word2vec [22, 23] over a large corpus comprised
to post on servers or to redistribute to lists, requires prior specific permission and/or a of legislation from multiple countries. Zhong et al. [51] created
fee. Request permissions from permissions@acm.org. a question answering dataset in the legal domain, collected from
ICAIL’21, June 21–25, 2021, São Paulo, Brazil the National Judicial Examination of China and evaluated different
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 models on it, including Transformers. Elnaggar et al. [8] applied
https://doi.org/10.1145/3462757.3466103 multi-task learning to minimize the problems related to data scarcity
295
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Rosa et al.
in the legal domain. The models were trained in translation, sum- We separate 80% of the 2020 training set for training and the
marization, and multi-label classification tasks, and achieved better remaining for validation, which yields 260 and 65 positive examples
results than single-task models. for training and validation sets, respectively. Negative examples
Pretrained transformer models have only begun to be adopted in are all candidates not labeled as positive.
legal NLP applications more broadly [2, 10, 16, 37, 44]. In some tasks,
they marginally outperform classical methods, especially when 2020 2021
training data is scarce. For example, Zhong et al. [50] showed that Train Test Train Test
a BERT-based model performs better than a tf-idf similarity model Examples (base cases) 325 100 425 100
on a judgment prediction task [43], but is slightly less effective than Avg. # of candidates / example 35.52 36.72 35.80 35.24
an attention-based convolutional neural network [47]. Avg. positive candidates / example 1.15 1.25 1.17 1.17
In some cases, they outperform classical methods, but at the Avg. of tokens in base cases 37.72 37.03 37.51 32.97
expense of using hand-crafted features or by being fine-tuned on Avg. of tokens in candidates 100.16 112.65 103.14 100.83
the target task. For example, the best submission to task 2 of COLIEE
Table 1: Statistics of COLIEE’s Task 2.
2019 was a BERT model fed with hand-crafted inputs and fine-tuned
on in-domain data [29].
Peters et al. [26] demonstrate that fine-tuning on the target task
may not perform better than simple feature extraction from a pre- The micro F1-score is the official metric in this task:
trained model if the pretraining task and the target task belong to F1 = (2 × 𝑃 × 𝑅)/(𝑃 + 𝑅), (1)
highly different domains. These findings lead us to consider zero-
where 𝑃 is the number of correctly retrieved paragraphs for all
shot approaches while investigating how general domain Trans-
queries divided by the number of retrieved paragraphs for all
former models can be applied to legal tasks.
queries, and 𝑅 is the number of correctly retrieved paragraphs
Although zero-shot approaches are relatively novel in the legal
for all queries divided by the number of relevant paragraphs for all
domain, our work is not the first to apply zero-shot Transformer
queries.
models to domain-specific entailment tasks where limited labeled
data is available. Yin et al. [45] have transformed multi-label clas- 3 METHOD
sification tasks into textual entailment tasks, and then evaluated
the performance of a BERT model fine-tuned on mainstream entail- We experiment with the following models: BM25, monoT5-zero-
ment datasets. Yin et al. [46] also performed similar experiments shot, monoT5, and DeBERTa. We also evaluate an ensemble of our
while transforming question answering and coreference resolution monoT5 and DeBERTa models.
tasks into entailment tasks. We are not the first to use zero-shot
techniques on the legal case entailment task. For instance, Rabelo
3.1 BM25
et al. [28] used a BERT fine-tuned for paraphrase detection com- BM25 is a bag-of-words retrieval function that scores a document
bined with two transformer-based models fine-tuned on a generic based on the query terms appearing in it. We use the BM25 imple-
text entailment dataset and features generated by a BERT model mented in Pyserini [18], a Python toolkit that supports replicable
fine-tuned on the COLIEE training dataset. However, we are the information retrieval research. We use its default parameters.
first to show that zero-shot models can outperform fine-tuned ones We first index all paragraphs in datasets of tasks 1 and 2. Having
on this task. more paragraphs from task 1 improves the term statistics (e.g.,
document frequencies) used by BM25. Task 1 dataset is composed
of long documents, while task 2 is composed of paragraphs. This
difference in length may degrade BM25 scores for task 2 paragraphs
2.1 The Legal Case Entailment Task because the average document length will be higher due to task 1
The Competition on Legal Information Extraction/Entailment (COL- documents. We address this problem by segmenting each document
IEE) [13, 14, 30, 31] is an annual competition whose aim is to eval- into several paragraphs using a context window of 10 sentences
uate automatic systems on case and statute law tasks. with overlapping strides of 5 sentences.
Among the five tasks of the 2021 competition, we submitted The entailed fragment might be comprised of multiple sentences.
systems to task 2, called legal case entailment, which consists of Here we treat each of its sentences as a query and compute a BM25
identifying paragraphs from existing cases that entail a given frag- score for each sentence and candidate paragraph pair independently.
ment of a base case. The final score for each paragraph is the maximum among its sen-
Training data consists of a set of decision fragments, its respec- tence and paragraph pair scores. We then use the method described
tive candidate paragraphs that could be relevant or not to the frag- in Section 3.5 to select the paragraphs that will comprise our final
ment and a set of labels containing the number of the paragraphs answer.
by which the decision fragment is entailed. Test data includes only
decision fragments and candidate paragraphs, but no labels. As 3.2 monoT5-zero-shot
shown in Figure 1, the input to the model is a decision fragment At a high level, monoT5-zero-shot is a sequence-to-sequence adap-
Q of an unseen case and the output should be a set of paragraphs tation of the T5 model [33] proposed by Nogueira et al. [25] and
𝑃 = [𝑃 1, 𝑃2, ..., 𝑃𝑛 ] that are relevant to the given decision 𝑄. In further detailed in Lin et al. [19]. This ranking model is close to
table 1, we show the statistics of the 2020 and 2021 datasets. or at the state-of-the-art in retrieval tasks such as Robust04 [42],
296
Zero-shot Models for Legal Case Entailment ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Figure 1: COLIEE’s Task 2 example.
TREC-COVID, and TREC 2020 Precision Medicine and Deep Learn- memory cost of Transformers with respect to the sequence length,
ing tracks. Details of the model are described in Nogueira et al. [25]; we truncate inputs to 512 tokens during both training and inference.
here, we only provide a short overview. The model is fine-tuned with a learning rate of 10−3 for 80 steps
In the T5 model, all target tasks are cast as sequence-to-sequence using batches of size 128, which corresponds to 20 epochs. Each
tasks. For our task, we use the following input sequence template: batch has the same amount of positive and negative examples. We
refer to this model as monoT5.
Query: 𝑞 Document: 𝑑 Relevant: (2)
where 𝑞 and 𝑑 are the query and candidate texts, respectively. In
this work, 𝑞 is a fragment, and 𝑑 is one of the candidate paragraphs.
The model estimates a score 𝑠 quantifying how relevant a candi-
date text 𝑑 is to a query 𝑞. That is:
𝑠 = 𝑃 (Relevant = 1|𝑑, 𝑞). (3) 3.4 DeBERTa
The model is fine-tuned to produce the tokens “true” or “false” Decoding-enhanced BERT with disentangled attention (DeBERTa)
depending on whether the candidate is relevant or not to the query. improves on the original BERT and RoBERTa architectures by in-
That is, “true” and “false” are the “target tokens” (i.e., ground truth troducing two techniques: the disentangled attention mechanism
predictions in the sequence-to-sequence transformation). The suffix and an enhanced mask decoder [12]. Both improvements seek to
“Relevant:” in the input string serves as hint to the model for the introduce positional information to the pretraining procedure, both
tokens it should produce. in terms of the absolute position of a token and the relative position
We use a T5-large model fine-tuned on MS MARCO [1], a dataset between them.
of approximately 530k query and relevant passage pairs. We use a The COLIEE 2021 Task 2 dataset has very few positive examples
checkpoint available at Huggingface’s model hub that was trained of entailment. Therefore, for fine-tuning DeBERTa on this dataset,
with a learning rate of 10−3 using batches of 128 examples for 10k we found appropriate to artificially expand the positive examples.
steps, or approximately one epoch of the MS MARCO dataset.1 As fragments take up only a small portion of a base case paragraph,
In each batch, a roughly equal number of positive and negative we expand positive examples by generating artificial fragments
examples is sampled. We refer to this model as monoT5-zero-shot. from the same base case paragraph in which the original fragment
Although fine-tuning for more epochs leads to better perfor- has occurred. This is done by moving a sliding window, with a
mance on the MS MARCO development set, Nogueira et al. [25] stride that is half the size of the original fragment, over the base
showed that further training degrades a model’s zero-shot perfor- case paragraph. Each step of this sliding window is taken to be an
mance on other datasets. We observed similar behavior in our task artificial fragment, and such artificial fragments are assigned the
and opted to use the model trained for one epoch on MS MARCO. same labels as the original fragment.
At inference time, to compute probabilities for each query-candidate Although the resulting dataset after these operations is several
pair, a softmax is applied only on the logits of the tokens “true” and times larger than the original Task 2 dataset, we achieved better
“false”. The final score of each candidate is the probability assigned results by fine-tuning DeBERTa on a small sample taken from this
to the token “true”. artificial dataset. After experimenting with distinct sample sizes,
we settled for a sample of twenty thousand fragment and candidate
3.3 monoT5 paragraph pairs, equally balanced between positive and negative
We further fine-tune monoT5-zero-shot on the 2020 task 2 training entailment pairs.
set following a similar training procedure described in the previous In order to find the best hyperparameters for fine-tuning a De-
section. BERTa Large model, we perform a grid search over the hyperpa-
Fragments are mostly comprised of only one sentence, while rameters suggested by He et al. [12] while early stopping always
candidate paragraphs are longer, sometimes exceeding 512 tokens in at the second epoch. The best combination of hyperparameters is
length. Thus, to avoid excessive memory usage due to the quadratic used to fine-tune the model for ten epochs. The checkpoint with
the best performance on the 2020 test set is selected to generate
1 https://huggingface.co/castorini/monot5-large-msmarco-10k our predictions for the 2021 test set.
297
2020 2021
Description Submission name F1 Prec Recall F1 Prec Recall 𝛼, 𝛽, 𝛾
(1a) Median of submissions - 0.5718 - - 0.5860 - - -
(1b) Best of 2020 [24] JNLP.task2.BMWT 0.6753 0.7358 0.6240 - - - -
(1c) 2nd best of 2021 UA_reg_pp - - - 0.6274 - - -
(2) BM25 - 0.6046 0.7222 0.52 0.6009 0.6666 0.5470 0.07, 2, 0.99
(3) DeBERTa DeBERTa 0.7094 0.7614 0.6640 0.6339 0.6635 0.6068 0, 2, 0.999
(4) monoT5 monoT5 0.6887 0.7155 0.660 0.6610 0.6554 0.6666 0, 3, 0.995
(5) monoT5-zero-shot - 0.6577 0.7400 0.5920 0.6872 0.7090 0.6666 0, 3, 0.995
(6) Ensemble of (3) and (4) DebertaT5 0.7217 0.7904 0.6640 0.6912 0.7500 0.6410 0.6, 2, 0.999
(7) Ensemble of (3) and (5) - 0.7038 0.7592 0.6560 0.6814 0.7064 0.6581 0.6, 2, 0.999
Table 2: Test set results on Task 2 of COLIEE 2020 and 2021. Our best single model F1 for each year is in bold.
3.5 Answer Selection Our pretrained transformer models (rows 3, 4 and 5) score above
The models described above estimate a score for each (fragment, BM25, the best submission of 2020 [24], and the second-best submis-
candidate paragraph) pair. To select the final set of paragraphs for sion of 2021. Likewise, our ensemble method effectively combines
a given fragment, we apply three rules: DeBERTa and monoT5 predictions, achieving the best score among
all submissions (row 6). However, the performance of monoT5-
• Select paragraphs whose scores are above a threshold 𝛼; zero-shot decreases when combined with DeBERTa (row 5 vs. 7),
• Select the top 𝛽 paragraphs with respect to their scores; showing that monoT5-zero-shot is a strong model.
• Select paragraphs whose scores are at least 𝛾 of the top score. The most interesting comparison is between monoT5 and monoT5-
We use exhaustive grid search to find the best values for 𝛼, 𝛽, 𝛾 zero-shot (rows 4 and 5). In the 2020 test data, monoT5 showed
on the development set of the 2020 task 2 dataset. We swept 𝛼 = better results than monoT5-zero-shot. Hence, we decided to submit
[0, 0.1, ..., 0.9], 𝛽 = [1, 2..., 10], and 𝛾 = [0, 0.1, ..., 0.9, 0.95, 0.99, 0.995, only the fine-tuned model to the 2021 competition. After the release
..., 0.9999]. The best values for each model can be found in Table 3. of ground-truth annotations of the 2021 test set, our evaluation of
Note that our hyperparameter search includes the possibility of monoT5-zero-shot showed that it performs better than monoT5.
not using the first or third strategies if 𝛼 = 0 or 𝛾 = 0 are chosen, A similar “inversion” pattern was found for DeBERTa vs. monoT5
respectively. (rows 3 and 4). DeBERTa was better than monoT5 on the 2020 test
set, but the opposite happened on the 2021 test set.
One explanation for these results is that we overfit on the test
3.6 DeBERTa + monoT5 Ensemble (DebertaT5) data of 2020, i.e., by (unintentionally) selecting techniques and
Ensemble methods seek to combine the strengths and compensate hyperparameters that gave the best result on the 2020 test set as
for the weaknesses of the models in order that the final model has experiments progressed. However, this is unlikely to be the case
better generalization performance. for our fine-tuned monoT5 model, as our hyperparameter selection
We use the following method to combine the predictions of is fully automatic and maximized on the development set, whose
monoT5 and DeBERTa (both fine-tuned on COLIEE 2020): We con- data is from COLIEE competitions before 2020.
catenate the final set of paragraphs selected by each model. We Another explanation is that there is a significant difference
remove duplicates, preserving the highest score. Then, we apply between the annotation methodologies of 2020 and 2021. Conse-
again the grid search method explained in the previous section quently, models specialized in the 2020 data could suffer from this
to select the final set of paragraphs. It is important to note that change. However, this is also unlikely since BM25 performed simi-
our method does not combine scores between models. It ensures larly in both years. Furthermore, we cannot confirm this hypothesis
that only individual answers with a certain degree of confidence since it is difficult to quantify differences in the annotation process.
are maintained in the final answer, which generally leads to an Regardless of the reason for the inversion, our main finding is
increase in Precision. The final answer for each test example can be that our zero-shot model performed at least comparably to fine-
composed of individual answers from one model or both models. tuned models on the 2020 test set and achieved the best result of a
single model on 2021 test data.
4 RESULTS
We present our main result in Table 2. Our baseline BM25 method
scores above the median of submissions in both COLIEE 2020 and 4.1 Ablation of the Answer Selection Method
2021 (row 2 vs. 1a). This confirms that BM25 is a strong baseline In Table 3, we show the ablation result of the answer selection
and it is in agreement with results from other competitions such as method proposed in Section 3.5. Our baseline answer selection
the Health Misinformation and Precision Medicine track of TREC method, which we refer to as “no rule” in the table, uses only the
2020 [27]. paragraph with the highest score as the final answer set, i.e., 𝛼 = 𝛾 =
298
Zero-shot Models for Legal Case Entailment ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Model F1 Prec Recall 𝛼, 𝛽, 𝛾 Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural In-
monoT5-zero-shot (no rule) 0.6517 0.7373 0.584 0, 1, 0 formation Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan,
monoT5-zero-shot 0.6577 0.74 0.592 0, 3, 0.995 and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.
neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
monoT5 (no rule) 0.6755 0.7600 0.608 0, 1, 0 [5] Ilias Chalkidis and Dimitrios Kampas. 2019. Deep learning in law: early adaptation
monoT5 0.6887 0.7155 0.6640 0, 3, 0.995 and legal word embeddings trained on large corpora. Artificial Intelligence and
Law volume 27, pages171–198(2019) (2019).
DeBERTa (no rule) 0.6933 0.7800 0.6240 0, 1, 0
[6] Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan
DeBERTa 0.7094 0.7614 0.6640 0, 2, 0.999 Yu. 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with
DebertaT5-zero-shot (no rule) 0.6875 0.7777 0.6160 0, 1, 0 Less Forgetting. arXiv:2004.12651 [cs.CL]
DebertaT5-zero-shot 0.7038 0.7592 0.6560 0.6, 2, 0.999 Pre-training of Deep Bidirectional Transformers for Language Understanding. In
DebertaT5 (no rule) 0.7022 0.7900 0.6320 0, 1, 0 Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
DebertaT5 0.7217 0.7904 0.6640 0.6, 2, 0.999
Short Papers). 4171–4186.
[8] Ahmed Elnaggar, Christoph Gebendorfer, Ingo Glaser, and Florian Matthes. 2018.
Table 3: Ablation on the 2020 data of the answer selection Multi-Task Deep Learning for Legal Document Translation, Summarization and
method presented in Section 3.5. Multi-Label Classification. AICCC ’18: Proceedings of the 2018 Artificial Intelligence
and Cloud Computing Conference December 2018 Pages 9–15 (2018).
[9] Ahmed Elnaggar, Bernhard Waltl, Ingo Glaser, Jörg Landthaler, Elena Scepankova,
and Florian Matthes. 2018. Stop Illegal Comments: A Multi-Task Deep Learning
Approach. AICCC ’18: Proceedings of the 2018 Artificial Intelligence and Cloud
0 and 𝛽 = 1. For all models, the proposed answer selection method Computing Conference December 2018 Pages 41–47 (2018).
[10] Emad Elwany, Dave Moore, and Gaurav Oberoi. 2019. BERT Goes to Law School:
gives improvements of 0.6 to two F1 points over the baseline. Quantifying the Competitive Advantage of Access to Large Legal Corpora in
Contract Understanding. In Workshop on Document Intelligence at NeurIPS 2019.
[11] Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. Rethink Training of BERT
5 CONCLUSION Rerankers in Multi-Stage Retrieval Pipeline. arXiv preprint arXiv:2101.08751
We confirm a counter-intuitive result on a legal case entailment task: (2021).
[12] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. DeBERTa:
that models with little or no adaptation to the target task can have Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654 [cs.CL]
better generalization abilities than models that have been carefully [13] Yoshinobu Kano, M. Kim, R. Goebel, and K. Satoh. 2017. Overview of COLIEE
fine-tuned to the task at hand. Domain adversarial fine-tuning [41] 2017. In COLIEE 2017 (EPiC Series in Computing, vol. 47). 1–8.
[14] Yoshinobu Kano, Mi-Young Kim, Masaharu Yoshioka, Yao Lu, Juliano Rabelo,
and changes to the Adam optimizer [6] [49] have been proposed Naoki Kiyota, Randy Goebel, and Ken Satoh. 2018. COLIEE-2018: Evaluation of
as valid approaches for fine-tuning Transformer models on small the competition on legal information extraction and entailment. In JSAI Interna-
domain-specific datasets. However, whether these techniques could tional Symposium on Artificial Intelligence. 177–192.
[15] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord,
successfully be applied to the legal case entailment task to make Peter Clark, and Hannaneh Hajishirzi. 2020. UnifiedQA: Crossing Format Bound-
models fine-tuned on target task data perform better than zero-shot aries With a Single QA System. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: Findings. 1896–1907.
approaches remains an open question. [16] Spyretta Leivaditi, Julien Rossi, and Evangelos Kanoulas. 2020. A Benchmark for
Therefore, although domain-specific language model pretraining Lease Contract Review. arXiv preprint arXiv:2010.10386 (2020).
and adjustments to the fine-tuning process are promising directions [17] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART:
for future research, we believe that zero-shot approaches should Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
not be ignored as strong baselines for such experiments. Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of
It should also be noted that our research has implications for the Association for Computational Linguistics. 7871–7880.
[18] Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep,
future experiments beyond the scope of legal case entailment tasks. and Rodrigo Nogueira. 2021. Pyserini: An Easy-to-Use Python Toolkit to Support
Based on previous work by Yin et al. [45, 46], it is possible that Replicable IR Research with Sparse and Dense Representations. arXiv preprint
arXiv:2102.10073 (2021).
other legal tasks with limited labeled data, such as legal question [19] Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained transformers
answering, may benefit from our zero-shot approach. for text ranking: Bert and beyond. arXiv preprint arXiv:2010.06467 (2020).
[20] Jiang Lu, Pinghua Gong, Jieping Ye, and Changshui Zhang. 2020. Learning from
Very Few Samples: A Survey. arXiv preprint arXiv:2009.02653 (2020).
REFERENCES [21] Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xiang Ji, and Xueqi Cheng.
[1] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong 2021. PROP: Pre-training with Representative Words Prediction for Ad-hoc
Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Retrieval. In Proceedings of the 14th ACM International Conference on Web Search
Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. and Data Mining. 283–291.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. [22] Tomas Mikolov. 2013. Distributed representations of words and phrases and their
arXiv:1611.09268v3 (2018). compositionality. arXiv preprint arXiv:1310.4546 (2013).
[2] Purbid Bambroo and Aditi Awasthi. 2021. LegalDB: Long DistilBERT for Legal [23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti-
Document Classification. In 2021 International Conference on Advances in Elec- mation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781
trical, Computing, Communication and Sustainable Technologies (ICAECT). IEEE, (2013).
1–4. [24] Ha-Thanh Nguyen, Hai-Yen Thi Vuong, Phuong Minh Nguyen, Binh Tran Dang,
[3] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Quan Minh Bui, Sinh Trong Vu, Chau Minh Nguyen, Vu Tran, Ken Satoh, and
Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Minh Le Nguyen. 2020. JNLP Team: Deep Learning for Legal Processing in
Hon. 2020. UniLMv2: Pseudo-Masked Language Models for Unified Lan- COLIEE 2020. arXiv preprint arXiv:2011.08071 (2020).
guage Model Pre-Training. ArXiv. https://www.microsoft.com/en- [25] Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document
us/research/publication/unilmv2-pseudo-masked-language-models-for- Ranking with a Pretrained Sequence-to-Sequence Model. In Proceedings of the
unified-language-model-pre-training/ 2020 Conference on Empirical Methods in Natural Language Processing: Findings.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- 708–718.
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, [26] Matthew E Peters, Sebastian Ruder, and Noah A Smith. 2019. To Tune or Not to
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Tune? Adapting Pretrained Representations to Diverse Tasks. In Proceedings of
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). 7–14.
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
299
[27] Ronak Pradeep, Xueguang Ma, Xinyu Zhang, Hang Cui, Ruizhou Xu, Rodrigo [51] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
Nogueira, and Jimmy Lin. [n.d.]. H2oloo at TREC 2020: When all you got is Maosong Sun1. 2020. JEC-QA: A Legal-Domain Question Answering Dataset.
a hammer... Deep Learning, Health Misinformation, and Precision Medicine. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9701-9708.
Corpus 5, d3 ([n. d.]), d2. (2020).
[28] J. Rabelo, M.Y. Kim, and R. Goebel. 2020. Application of text entailment techniques
in COLIEE 2020. International Workshop on Juris-informatics (JURISIN) associated
with JSAI International Symposia on AI (JSAI-isAI) (2020).
[29] Juliano Rabelo, Mi-Young Kim, and Randy Goebel. 2019. Combining similarity and
transformer methods for case law entailment. In Proceedings of the Seventeenth
International Conference on Artificial Intelligence and Law (ICAIL ’19). 290–296.
Kano, and Ken Satoh. 2019. A Summary of the COLIEE 2019 Competition. In
JSAI International Symposium on Artificial Intelligence. 34–49.
Kano, and Ken Satoh. 2020. COLIEE 2020: Methods for Legal Document Retrieval
and Entailment. (2020).
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
et al. 2021. Learning transferable visual models from natural language supervision.
arXiv preprint arXiv:2103.00020 (2021).
[33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits
of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine
Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
[34] Kirk Roberts, Dina Demner-Fushman, E. Voorhees, W. Hersh, Steven Bedrick,
Alexander J. Lazar, and S. Pant. 2019. Overview of the TREC 2019 Precision
Medicine Track. The ... text REtrieval conference : TREC. Text REtrieval Conference
26 (2019).
[35] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu,
Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995),
109.
[36] Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few-shot
text classification and natural language inference. arXiv preprint arXiv:2001.07676
(2020).
[37] Shohreh Shaghaghian, Luna Yue Feng, Borna Jafarpour, and Nicolai Pogreb-
nyakov. 2020. Customizing Contextualized Language Models for Legal Docu-
ment Reviews. In 2020 IEEE International Conference on Big Data (Big Data). IEEE,
2139–2148.
[38] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin
Raffel. 2021. Improving and Simplifying Pattern Exploiting Training. arXiv
preprint arXiv:2103.11955 (2021).
[39] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna
Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of
Information Retrieval Models. arXiv preprint arXiv:2104.08663 (4 2021). https:
//arxiv.org/abs/2104.08663
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In NIPS.
[41] Giorgos Vernikos, Katerina Margatina, Alexandra Chronopoulou, and Ion An-
droutsopoulos. 2020. Domain Adversarial Fine-Tuning as an Effective Regularizer.
arXiv:2009.13366 [cs.LG]
[42] Ellen M. Voorhees. 2004. Overview of the TREC 2004 Robust Track. Proceedings
of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland,
November 16-19, 2004 (2004).
[43] Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong
Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018.
CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction. arXiv:1807.02478
(2018).
[44] Chin Man Yeung. 2019. Effects of inserting domain vocabulary and fine-tuning
BERT for German legal language. Master’s thesis. University of Twente.
[45] Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking Zero-
shot Text Classification: Datasets, Evaluation and Entailment Approach.
arXiv:1909.00161 [cs.CL]
[46] Wenpeng Yin, Nazneen Fatema Rajani, Dragomir Radev, Richard Socher, and
Caiming Xiong. 2020. Universal Natural Language Processing with Limited An-
notations: Try Few-shot Textual Entailment as a Start. arXiv:2010.02584 [cs.CL]
[47] Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN:
Attention-based convolutional neural network for modeling sentence pairs. Trans-
actions of the Association for Computational Linguistics 4 (2016), 259–272.
[48] Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, and Jimmy
Lin. 2020. Rapidly Deploying a Neural Search Engine for the COVID-19 Open
Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL
2020.
[49] Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, and Yoav Artzi.
2021. Revisiting Few-sample BERT Fine-tuning. arXiv:2006.05987 [cs.CL]
[50] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal
Artificial Intelligence. arXiv:2004.12158 (2020).
300
Index of Authors
Adler, Rachel F., 119 Distefano, Biagio, 230

Aguiar, Derek, 99 dos Santos Neto, Jose Francisco, 240
Alexander, Charlotte S., 129 Dureja, Shipra, 285
Almasian, Satya, 2 Dzehtsiarou, Kanstantsin, 170
Amariles, David Restrepo, 129 Déziel, Pierre-Luc, 200, 256
Anderson, Brandon R., 159
Aoki, Yasuhiro, 278 Einarsson, Alexander, 119
Araszkiewicz, Michał, 129 EL HAMDANI, Rajaa, 40
Ash, Elliott, 109 El Hamdani, Rajaa, 129
Ashley, Alexandra, 129 Ellul, Joshua, 190
Ashley, Kevin D., 129, 250 Esteban de la Rosa, Fernando, 195
Atkinson, Katie, 12, 170
Aumiller, Dennis, 2 Falduti, Mattia, 129
Fantinato, Marcelo, 240
Barrie, Cameron, 119 Fungwacharakorn, Wachara, 50
Bench-Capon, Trevor, 12, 170
Benyekhlef, Karim, 129, 268 Garneau, Nicolas, 200, 256
Bertolo, Emerson, 264 Gaumond, Eve, 200, 256
Bex, Floris, 175 Gertz, Michael, 2
Bhattacharya, Paheli, 22 Ghosh, Kripabandhu, 22
Borges, Georg, 32 Ghosh, Saptarshi, 22
Branting, Karl, 129 Gipp, Bela, 109
Brockdorff, Juanita, 190 Glaser, Ingo, 205
Buckley, Joshua, 139 Governatori, Guido, 69, 139
Grabmair, Matthias, 79, 129
Calegari, Roberta, 180 Grant, Jayla C., 129
Castillo, Carlos, 210 Grossi, Davide, 149
Ceross, Aaron, 185 Guha, Neel, 159
Ciochetti, Itamar, 264 Górski, Łukasz, 60
Collenette, Joe, 170
Conrad, Jack G., 245 Hammond, Kristian, 119
Harašta, Jakub, 129
De Luca, Ernesto W., 285 Henderson, Peter, 159
de Souza, Edelcio G., 89 Hirota, Renata, 240
Ho, Daniel E., 79, 159 Pandya, Sachin, 99
Huang, Sieh-Chuen, 258 Peng, Linyu, 270
Huang, Zihan, 79 Peres, Sarajane Marques, 240
Huggins, Anna, 139 Piech, Mateusz, 225
Poddar, Soham, 22
Johnson, Shiwanni, 129 Polo, Felipe Maia, 264
Prakken, Henry, 175
Kaleta, Zbigniew, 225
Karimi-Haghighi, Marzieh, 210 Ramakrishna, Shashishekar, 60
Kawasaki, Tatsuki, 273 Rehm, Georg, 109
Kimura, Rampei, 270 Renooij, Silja, 235
Krasnashchok, Katsiaryna, 40 Restrepo Amariles, David, 40
Krass, Mark S., 79 Riveret, Régis, 180
Kutty, Libin, 285 Robaldo, Livio, 215
Rodrigues, Ruan Chaves, 295
Lackner, Sebastian, 2
Rosa, Guilherme Moraes, 295
Lamontagne, Luc, 200, 256
Ross, Graham, 275
Langlais, Philippe, 268
Rotolo, Antonino, 220, 266
Leflar, Robert B, 258
Ruas, Terry, 109
Li Zhao, Andong L., 119
Rudra, Koustav, 22
Lotufo, Roberto de Alencar, 295
Low, Charles, 79 Salaün, Olivier, 268
Sammut, Trevor, 190
Maranhão, Juliano, 89
Sapienza, Salvatore, 230
Martínez, Diego C., 266
Sartor, Giovanni, 89, 180
Matthes, Florian, 205
Satoh, Ken, 50, 273
McCarthy, Stephen, 190
Savelka, Jaromir, 129, 250
McConnell, Devin J., 99
Scerri, Matthew, 190
Meeùs, Sébastien, 40, 129
Schamberger, Tom, 205
Mok, Jonathan R., 260
Schwartz, David, 119
Mok, Rachel V., 260
Servantez, Sergio, 119
Mok, Wai Yin, 260
Shahania, Saijal, 285
Moreno-Schneider, Julian, 109
Shao, Hsuan-Lei, 258
Morris, Jason, 262
Smith, Clara, 220
Mustapha, Majd, 40
Smywiński-Pohl, Aleksander, 225
Nogueira, Rodrigo, 295 Sovrano, Francesco, 230
Novotná, Tereza, 129 Steging, Cornelis Cor, 235
Sterbentz, Marko, 119
Olivieri, Francesco, 69 Sudhi, Viju, 285
Ostendorff, Malte, 109 Suzuki, Youta, 278
Pace, Gordon, 190 Takahashi, Kazuko, 273

Pack, Harper, 119 Tamargo, Luciano H., 266
Pah, Adam, 119 Teng, Mengqiu, 79
Paley, Andrew, 119 Thiessen, Ernest, 275
Palmirani, Monica, 230 Tippett, Elizabeth, 129
302
Trecenti, Julio, 240 Wróbel, Krzysztof, 225
Troussel, Aurore, 40, 129
Tsushima, Kanae, 50 Xu, Huihui, 250
Unger, Adriana Jacoto, 240 Yoshioka, Masaharu, 278
Verheij, Bart, 149, 235 Zeleznikow, John, 195

Vitali, Fabio, 230 Zhang, Hongyi, 79
Vold, Andrew, 245 Zheng, Heng, 149
Zheng, Lucia, 159
Wehnert, Sabine, 285 Zhu, James, 99
Westermann, Hannes, 129 Zhu, Tingting, 185
Witt, Alice, 139 Zufall, Frederike, 270
303

Icail 2021 Proceedings

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Icail 2021 Proceedings

Uploaded by

Copyright:

Available Formats

Eighteenth International Conference on

Fifteenth International Conference on

San Diego, June 8 - 12, 2015

ACM ISBN: 978-1-4503-8526-8

Tutorials/Workshops & Proceedings Chair

Local Organising Committee

Doctoral Consortium & Mentoring Programme Chair

Adam Zachary Wyner, ICAIL 2021 Program Chair

Monday, 21 June – Workshops

Tuesday, 22 June – Main conference

Wednesday, 23 June – Main conference

15:30-16:00 Short break / Networking space

15:30-16:00 Short break / Networking space

About Stuart Russell is a Professor of Computer Science at the University of Cali-

iRobot: how to use Robotic Process Automation to automate cer-

Conference Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

ICAIL 2021 Program and Schedule of Events . . . . . . . . . . . . . . . . . vii

Invited Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

Precedential Constraint: The Role of Issues . . . . . . . . . . . . . . . . . 12

Incorporating Domain Knowledge for Extractive Summarization of Le-

AI Systems and Product Liability . . . . . . . . . . . . . . . . . . . . . . . . 32

A Combined Rule-Based and Machine Learning Approach for Automated

On Semantics-based Minimal Revision for Legal Reasoning . . . . . . . 50

Explainable Artificial Intelligence, Lawyer’s Perspective . . . . . . . . . 60

Context-Aware Legal Citation Recommendation using Deep Learning . 79

A dynamic model for balancing values . . . . . . . . . . . . . . . . . . . . 89

Case-level Prediction of Motion Outcomes in Civil Litigation . . . . . . 99

Evaluating Document Representations for Content-based Legal Litera-

From Data to Information: Automating Data Science to Explore the U.S.

Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdic-

Converting Copyright Legislation into Machine-Executable Code: In-

Hardness of Case-Based Decisions: a Formal Theory . . . . . . . . . . . . 149

When Does Pretraining Help? Assessing Self-Supervised Learning for

On the relevance of algorithmic decision predictors for judicial decision

The Burden of Persuasion in Structured Argumentation . . . . . . . . . 180

Prediction of monetary penalties for data protection cases in multiple

Regulating Artificial Intelligence: A Technology Regulator’s Perspective 190

Making Intelligent Online Dispute Resolution Tools available to Self-

Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for

Anonymization of German Legal Court Rulings . . . . . . . . . . . . . . . 205

Enhancing a Recidivism Prediction Tool With Machine Learning: Ef-

Towards compliance checking in reified I/O logic via SHACL . . . . . . . 215

Modelling Legal Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

Automatic Extraction of Amendments from Polish Statutory Law . . . 225

Discovering the Rationale of Decisions: Towards a Method for Aligning

Process Mining-Enabled Jurimetrics: Analysis of a Brazilian Court’s Ju-

Using Transformers to Improve Answer Retrieval for Legal Questions . 245

Toward Summarizing Case Decisions via Extracting Argument Issues,

III Extended Abstracts 255

Applying Decision Tree Analysis to Family Court Decisions: Factors De-

Sentence Classification for Contract Law Cases: A Natural Language

Constraint Answer Set Programming as a Tool to Improve Legislative

Predicting Legal Proceedings Status: Approaches Based on Sequential

Pathways to Legal Dynamics in Robotics . . . . . . . . . . . . . . . . . . . 266

A simple mathematical model for the legal concept of balancing of in-

Live Demonstration of a Working Collaborative eNegotiaton System

V COLIEE Papers 277

Legal Norm Retrieval with Variations of the BERT Model Combined

Index of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

Sebastian Lackner Michael Gertz

ABSTRACT ACM Reference Format:

sections with similar topics. We consider sections as the top-level

Contributions. The contributions of this paper are as follows: