Professional Documents
Culture Documents
Icail 2021 Proceedings
Icail 2021 Proceedings
Proceedings
Proceedingsofofthe
theConference
Conference
Sponsored by:
The International Association for Artificial Intelligence and Law
Sponsored
Thomson by:
Reuters
InternationalUniversity
Association forDiego
of San Artificial
CenterIntelligence
for IP Law & and Law
Markets
Davis Polk & WardwellJusbrasil
LLP
Albert Einstein Israeli Hospital
TrademarkNow
Lawgorithm
Legal Robot
LegalCode
Pires e Gonçalves Advogados
Opice Blum Advogados
In cooperation
OASIS with:
Open
Association for the Advancement of Artificial Intelligence
Urbano Vitalino Advogados (AAAI)
ACM SIGART
In cooperation with:
Association for the Advancement of Artificial Intelligence (AAAI)
ACM SIGAI
The Association for Computing Machinery
1601 Broadway, 10th Floor
New York, New York 10019, USA
ACM COPYRIGHT NOTICE. Copyright © 2021 by the Association for Computing Ma-
chinery, Inc. Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not made
or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by
others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from Publications Dept., ACM,
Inc., fax +1 (212) 869-0481, or permissions@acm.org.
For other copying of articles that carry a code at the bottom of the first or last page,
copying is permitted provided that the per-copy fee indicated in the code is paid
through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923,
+1-978-750-8400, +1-978-750-4470 (fax).
ii
Conference Organization
Program Chair
Adam Zachary Wyner (Swansea University, United Kingdom)
Conference Chair
Juliano Maranhão (University of São Paulo, Brazil)
Secretary / Treasurer
Michał Araszkiewicz (Jagiellonian University, Poland)
Industry Chair
Fabio Cozman (University of Sao Paulo, Brazil)
Program committee
Tommaso Agnoloni (CNR, Italy)
Thomas Ågotnes (University of Bergen, Norway)
Laura Alonso Alemany (Universidad Nacional de Córdoba, Spain)
Francisco Andrade (Universidade do Minho, Portugal)
Michał Araszkiewicz (Uniwersytet Jagiellonski w Krakowie, Poland)
Kevin Ashley (University of Pittsburgh, United States)
Katie Atkinson (University of Liverpool, United Kingdom)
Trevor Bench-Capon (University of Liverpool, United Kingdom)
Floris Bex (Utrecht University, Netherlands)
Luther Branting (The MITRE Corporation, United States)
iii
Scott Brewer (Harvard Law School, United States)
Pompeu Casanovas (La Trobe, Australia)
Jack G. Conrad (Thomson Reuters, United States)
Claudia d’Amato (University of Bari, Italy)
Luigi Di Caro (University of Turin, Italy)
Rossana Ducato (University of Aberdeen, Italy)
Jenny Eriksson Lundström (Uppsala University, Sweden)
Enrico Francesconi (European Parliament, Luxembourg)
Fernando Galindo Ayuda (Universidad de Zaragoza,Spain)
Kripabandhu Ghosh (Indian Institute of Science Education and Research, India)
Saptarshi Ghosh (Indian Institute of Technology Kharagpur, India)
Randy Goebel (University of Alberta, Canada)
Thomas Gordon (Germany)
Guido Governatori (CSIRO, Australia)
Matthias Grabmair (Technical University of Munich, Germany)
Davide Grossi (University of Groningen and University of Amsterdam, Netherlands)
Maura R. Grossman (University of Waterloo, Canada)
Mustafa Hashmi (Data61, CSIRO Australia)
Bruce Hedin (H5, United States)
Hans Henseler (University of Applied Sciences Leiden, Netherlands)
Rinke Hoekstra (Elsevier, Netherlands)
Joris Hulstijn (Tilburg University, Netherlands)
John Joergensen (Rutgers Law School, United States)
Yoshinobu Kano (Shizuoka University, Japan)
Daniel Katz (Illinois Tech & Bucerius Law School, United States)
Jeroen Keppens (King’s College London, United Kingdom)
Marc Lauritsen (Capstone Practice Systems, United States)
Rūta Liepin, a (Maastricht University, Netherlands)
Arno R. Lodder (Vrije Universiteit Amsterdam, Netherlands)
Prasenjit Majumder (DAIICT, India)
Juliano Maranhao (University of São Paulo, Brazil)
L. Thorne McCarty (Rutgers University, United States)
Parth Mehta (Parmonic AI, India)
Raquel Mochales (Cerence, Belgium)
Ashutosh Modi (Indian Institute of Technology Kanpur, India)
Katsumi Nitta (Tokyo Institute of Technology, Japan)
Merel Noorman (Tilburg University, Netherlands)
Paulo Novais (Universidade do Minho, Portugal)
Gordon Pace (University of Malta, Malta)
Ugo Pagallo (University of Turin, Italy)
Arindam Pal (Data61, CSIRO, Australia)
Monica Palmirani (University of Bologna, Italy)
Girish Palshikar (Tata Consultancy Services Ltd., India)
Sachin Pawar (Tata Consultancy Services, India)
Wim Peters (University of Aberdeen, Netherlands)
Henry Prakken (Utrecht University & University of Groningen, Netherlands)
iv
Paulo Quaresma (University of Evora, Portugal)
Edwina Rissland (University of Massachusetts/Amherst, United States)
Livio Robaldo (Swansea University, United Kingdom)
Anna Ronkainen (University of Helsinki, Finland)
Antonino Rotolo (University of Bologna, Italy)
Giovanni Sartor (University of Bologna, Italy)
Ken Satoh (NII, Japan)
Burkhard Schafer (University of Edinburgh, United Kingdom)
Fernando Schapachnik (Universidad de Buenos Aires, Argentina)
Uri Schild (Bar Ilan University, Israel)
Frank Schilder (Thomson Reuters, United States)
Marijn Schraagen (Utrecht University, Netherlands)
Erich Schweighofer (University of Vienna, Austria)
Giovanni Sileno (University of Amsterdam, Netherlands)
Munindar Singh (North Carolina State University, United States)
Clara Patricia Smith (Universidad Nacional de La Plata (UNLP), Argentina)
Katsuhiko Toyama (Nagoya University, Japan)
Thomas Vacek (Thomson Reuters, United States)
Leon van der Torre (Luxembourg)
Marc van Opijnen (Ministry of the Interior and Kingdom Relations of the Nether-
lands, Netherlands)
Bart Verheij (University of Groningen, Netherlands)
Serena Villata (Universite Cote d’Azur, CNRS, France)
Vern R. Walker (Hofstra University, United States)
Radboud Winkels (University of Amsterdam, Netherlands)
Masaharu Yoshioka (Hokkaido University, Japan)
Haozhen Zhao (Ankura, United States)
Tomasz Zurek (Maria Curie-Sklodowska University, Poland)
v
Preface
I am pleased to share with you the proceedings of the 18th International Conference
on Artificial Intelligence and Law (ICAIL 2021). Since 1987, the International Asso-
ciation for Artificial Intelligence and Law (IAAIL) has biennially organised ICAIL to
present and discuss research and applications as well as to stimulate interdisciplinary
and international collaboration. The ICAIL series can lay claim to substantial influ-
ence on the recent growth in AI in legal services. This year’s ICAIL upholds and
extends the mission of the IAAIL. ICAIL 2021 runs the week of June 21-25. For the
first time, the conference and workshops are presented 100% online and cost free,
leading to over 1500 registrations in 65 countries!
The conference had 89 submissions; 17 were selected for publication as full pa-
pers (~19%), 17 as short papers (~19%), 8 as extended research abstracts (~9%), 2 as
demonstration papers (~2%), and 3 as COLIEE papers (~3%). ICAIL strives to max-
imise the opportunities for researchers to present their work. In addition, ICAIL will
hold a Doctoral Consortium, helping emerging researchers to engage with the ICAIL
community. There will be 11 collocated workshops on focused topics.
Research in AI & Law is highly interdisciplinary. A range of AI theories and tech-
niques may apply to diverse legal information, processes, or topics. As well, there are
important considerations about how the Law applies to AI. The relation between AI
and Law is, then, many-to-many. While machine learning techniques perform well, it
may be crucial to explain results in legal contexts. Moreover, as AI continues to make
inroads into legal services, other success criteria must be addressed such as: account-
ability, accessibility, portability, linking, consistency, and resource sharing. These are
matters for further research.
The interdisciplinarity of AI & Law shows in our invited speakers. Prof. Stuart
Russell, an internationally recognised researcher and AI educator, will speak on “Prov-
ably Beneficial AI”, which will be discussed by a panel, deepening our understanding
of the relation between AI and Law. Joe Cohen of Dentons law firm will talk about
advances in automation, highlighting the real world impact of AI and Law. Finally,
IAAIL president Enrico Francesconi will outline the evolution and the perspectives of
AI research in relation to ICAIL.
Finally, many people worked hard over months to make ICAIL 2021 excellent.
Many thanks to the following. Conference chair Juliano Maranhão and his team took
on the task to put ICAIL 2021 online and in very trying circumstances. The IAAIL
secretary Michał Araszkiewicz, along with Anne Gardner, addressed administrative
and management matters. The most substantive contributors were the authors who
submitted papers and the reviewers who took the time to assess and discuss the sub-
missions; they have shaped the content that advances the field. The organisers of and
presenters at the workshops and Doctoral Consortium all extended the discussion and
supported emerging researchers. Our sponsors provided essential recognition and
support. And finally, we are most appreciative of the IAAIL Executive Committee,
which promotes AI & Law research through the ICAIL conferences.
vi
ICAIL 2021 Program and Schedule of Events
All times in GMT. Events marked REC. are streamed events (replay with live discus-
sion with pre-registered questions).
vii
10:30-11:00 Short break / Networking space
11:00- Session 2
Time Page
11:00 Hardness of Case-Based Decisions: a Formal Theory 149
Zheng, Heng; Grossi, Davide; Verheij, Bart
11:30 Precedential Constraint: The Role of Issues 12
Bench-Capon, Trevor; Atkinson, Katie
12:00 Incorporating Domain Knowledge for Extractive Summarization of Le- 22
gal Case Documents
Bhattacharya, Paheli; Poddar, Soham; Rudra, Koustav; Ghosh, Kri-
pabandhu; Ghosh, Saptarshi
12:30:13:30 Break
13:30-14:30 Session 3
Time Page
13:30 A dynamic model for balancing values 89
Maranhão, Juliano; Souza, Edelcio; Sartor, Giovanni
14:00 On Semantics-based Minimal Revision for Legal Reasoning 50
REC. Fungwacharakorn, Wachara; Tsushima, Kanae; Satoh, Ken
14:30-15:30 Keynote Speaker – iRobot: how to use Robotic Process Automation to
automate certain legal work
15:30-16:00 Short break / Networking space
16:00-16:55 Session 4
Time Page
16:00 Incorporating Domain Knowledge for Extractive Summarization of Le- 22
gal Case Documents
REC. Bhattacharya, Paheli; Poddar, Soham; Rudra, Koustav; Ghosh, Kri-
pabandhu; Ghosh, Saptarshi
16:30 To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment 295
Rosa, Guilherme Moraes; Rodrigues, Ruan Chaves; Lotufo, Roberto;
Nogueira, Rodrigo
16:45 Interactive System for Arranging Issues based on PROLEG in Civil Lit- 273
igation
Satoh, Ken; Takahashi, Kazuko; Kawasaki, Tatsuki
16:50 Live Demonstration of a Working Collaborative e-Negotiaton System 275
(Smartsettle Infinity)
Ross, Graham Laurence; Thiessen, Ernest
16:55-17:30 Break
18:30-20:00 Session 5
viii
Time Page
18:30 Precedential Constraint: The Role of Issues 12
REC. Bench-Capon, Trevor; Atkinson, Katie
19:00 BERT-based Ensemble Methods with Data Augmentation for Legal Tex- 278
tual Entailment in COLIEE Statute Law Task
Yoshioka, Masaharu; Aoki, Tasuhiro; Suzuki, Youta
19:15 Legal Norm Retrieval with Variations of the BERT Model Combined 285
with TF-IDF Vectorization
Wehnert, Sabine Sarah; Sudhi, Viju; Dureja, Shipra; Kutty, Libin
Johnny; Shahania, Saijal; De Luca, Ernesto William
19:30 Toward Summarizing Case Decisions via Extracting Argument Issues, 250
Reasons, and Conclusions
Xu, Huihui; Savelka, Jaromir; Ashley, Kevin
19:45 Practical Tools from Formal Models: The ECHR as a Case Study 170
Atkinson, Katie; Collenette, Joe; Bench-Capon, Trevor; Dzehtsiarou,
Kanstantsin
20:30-21:45 Session 6
Time Page
20:30 Hardness of Case-Based Decisions: a Formal Theory 149
REC. Zheng, Heng; Grossi, Davide; Verheij, Bart
21:00 When Does Pretraining Help? Assessing Self-Supervised Learning for 159
Law and the CaseHOLD Dataset of 53,000+ Legal Holdings
Guha, Neel; Zheng, Lucia; Anderson, Brandon Ray; Henderson, Pe-
ter; Ho, Daniel En-Wenn
21:30 Modelling Legal Procedures 220
Rotolo, Antonino; Smith, Clara
21:45 Towards compliance checking in reified I/O logic via SHACL 215
Robaldo, Livio
ix
11:00-11:40 Session 2
Time Page
11:00 Applying Decision Tree Analysis to Family Court Decisions: Factors 258
Determining Child Custody in Taiwan
Huang, Sieh-Chuen; Shao, Hsuan-Lei; Leflar, Robert B
11:05 Constraint Answer Set Programming as a Tool to Improve Legislative 262
Drafting: A Rules as Code Experiment
Morris, Jason Patrick
11:10 CriminelBART: A French Canadian Legal Language Model Specialized 256
in Criminal Law
Garneau, Nicolas; Gaumond, Eve; Lamontagne, Luc; Déziel, Pierre-
Luc
11:15 Sentence Classification for Contract Law Cases: A Natural Language 260
Processing Approach
Mok, Wai Yin; Mok, Jonathan R.; Mok, Rachel V.
11:20 Labels distribution matters in performance achieved in legal judgment 268
prediction task
Salaün, Olivier; Langlais, Philippe; Benyekhlef, Karim
11:25 Pathways to Legal Dynamics in Robotics 266
Rotolo, Antonino; Tamargo, Luciano H.; Martìnez, Diego C.
11:30 A simple mathematical model for the legal concept of balancing of in- 270
terests
Zufall, Frederike; Kimura, Rampei; Peng, Linyu
11:35 Predicting Legal Proceedings Status: Approaches Based on Sequential 264
Text Data
Polo, Felipe Maia; Ciochetti, Itamar; Bertolo, Emerson
11:40-13:00 Break
13:00-14:30 Session 3
Time Page
13:00 Automatic Extraction of Amendments from Polish Statutory Law 225
Smywiński-Pohl, Aleksander; Piech, Mateusz; Kaleta, Zbigniew;
Wróbel, Krzysztof
13:15 Enhancing a Recidivism Prediction Tool With Machine Learning: Ef- 210
fectiveness and Algorithmic Fairness
Karimi-Haghighi, Marzieh; Castillo, Carlos
13:30 Converting Copyright Legislation into Machine-Executable Code: In- 139
terpretation, Coding Validation and Legal Alignment
REC. Witt, Alice; Huggins, Anna; Governatori, Guido; Buckley, Joshua
14:00 Unravel Legal References in Defeasible Deontic Logic 69
REC. Governatori, Guido; Olivieri, Francesco
14:30-15:30 Keynote Speaker – Provably Beneficial Artificial Intelligence
x
16:00-17:00 Session 4
Time Page
16:00 A Combined Rule-Based and Machine Learning Approach for Auto- 40
mated GDPR Compliance Checking
El Hamdani, Rajaa; Mustapha, Majd; Restrepo Amariles, David;
Troussel, Aurore; Meeus, Sébastien; Krasnashchok, Katsiaryna
16:30 When Does Pretraining Help? Assessing Self-Supervised Learning for 159
Law and the CaseHOLD Dataset of 53,000+ Legal Holdings
REC. Guha, Neel; Zheng, Lucia; Anderson, Brandon Ray; Henderson, Pe-
ter; Ho, Daniel En-Wenn
17:00-18:30 Break
18:30-20:00 Session 5
Time Page
18:30 Context-Aware Legal Citation Recommendation using Deep Learning 79
Huang, Zihan; Low, Charles; Teng, Mengqiu; Zhang, Hongyi; Ho,
Daniel E.; Krass, Mark; Grabmair, Matthias
19:00 From Data to Information: Automating Data Science to Explore the U.S. 119
Court System
Li Zhao, Andong L.; Pack, Harper; Servantez, Sergio; Adler, Rachel
F.; Sterbentz, Marko; Pah, Adam; Schwartz, David; Barrie, Cameron;
Einarsson, Alexander; Hammond, Kristian
19:30 Case-level Prediction of Motion Outcomes in Civil Litigation 99
McConnell, Devin J.; Zhu, James; Pandya, Sachin S.; Aguiar, Derek
Cole
20:00-20:30 Short break / Networking space
20:30-21:30 Session 6
Time Page
20:30 Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdic- 129
tions, and Legal Domains
Savelka, Jaromir; Westermann, Hannes; Benyekhlef, Karim; Alexan-
der, Charlotte S.; Grant, Jayla C.; Amariles, David Restrepo; El-
Hamdani, Rajaa; Meeus, Sebastien; Troussel, Aurore; Araszkiewicz,
Michal; Ashley, Kevin D.; Ashley, Alexandra; Branting, Karl L.; Fal-
duti, Mattia; Grabmair, Matthias; Harasta, Jakub; Novotna, Tereza;
Tippett, Elizabeth; Johnson, Shiwanni
21:00 Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for 200
Natural Language Generation
Garneau, Nicolas; Gaumond, Eve; Lamontagne, Luc; Déziel, Pierre-
Luc
21:15 Process Mining-Enabled Jurimetrics: Analysis of a Brazilian Court’s 240
Judicial Performance in the Business Law Processing
Unger, Adriana Jacoto; dos Santos Neto, Jose Francisco; Trecenti,
Julio; Hirota, Renata; Fantinato, Marcelo; Peres, Sarajane Marques
xi
Thursday, 24 June – Main conference
09:00-10:30 Session 1
Time Page
09:00 Explainable Artificial Intelligence, lawyer’s perspective 60
Górski, Łukasz; Ramakrishna, Shashishekar
09:30 Evaluating Document Representations for Content-based Legal Litera- 109
ture Recommendations
Ostendorff, Malte; Ash, Elliott; Ruas, Terry; Gipp, Bela; Moreno-
Schneider, Julian; Rehm, Georg
10:00 Structural Text Segmentation of Legal Documents 2
Aumiller, Dennis; Almasian, Satya; Lackner, Sebastian; Gertz,
Michael
10:30-11:00 Short break / Networking space
11:00-12:00 Session 2
Time Page
11:00 AI systems and product liability 32
Borges, Georg
11:30 A Dataset for Evaluating Legal Question Answering on Private Inter- 230
national Law
Sovrano, Francesco; Palmirani, Monica; Distefano, Biagio; Sapienza,
Salvatore; Vitali, Fabio
11:45 Making Intelligent Online Dispute Resolution Tools available to Self- 195
Represented Litigants in the Public Justice System
Esteban de la Rosa, Fernando; Zeleznikow, John
12:00-12:30 Break
12:30-13:30 IAAIL General Meeting
13:30-14:30 Session 3
Time Page
13:30 Structural Text Segmentation of Legal Documents 2
REC. Aumiller, Dennis; Almasian, Satya; Lackner, Sebastian; Gertz,
Michael
14:00 Anonymization of German Legal Court Rulings 205
Glaser, Ingo; Schamberger, Tom; Matthes, Florian
14:15 Regulating Artificial Intelligence: A Technology Regulator’s Perspective 190
Ellul, Joshua; McCarthy, Stephen; Sammut, Trevor; Brockdorff,
Juanita; Scerri, Matthew; Pace, Gordon J.
14:30-15:30 Presidential Address – The Winter, The Summer and The Summer Dream
of AI in LAW
xii
16:00-17:15 Session 4
Time Page
16:00 Using Transformers to Improve Answer Retrieval for Legal Questions 245
Vold, Andrew; Conrad, Jack G
16:15 From Data to Information: Automating Data Science to Explore the U.S. 119
Court System
REC. Li Zhao, Andong L.; Pack, Harper; Servantez, Sergio; Adler, Rachel
F.; Sterbentz, Marko; Pah, Adam; Schwartz, David; Barrie, Cameron;
Einarsson, Alexander; Hammond, Kristian
16:45 Case-level Prediction of Motion Outcomes in Civil Litigation 99
REC. McConnell, Devin J.; Zhu, James; Pandya, Sachin S.; Aguiar, Derek
Cole
17:15-18:30 Break
18:30-20:00 Session 5
Time Page
18:30 Context-Aware Legal Citation Recommendation using Deep Learning 79
REC. Huang, Zihan; Low, Charles; Teng, Mengqiu; Zhang, Hongyi; Ho,
Daniel E.; Krass, Mark; Grabmair, Matthias
19:00 Explainable Artificial Intelligence, lawyer’s perspective 60
REC. Górski, Łukasz; Ramakrishna, Shashishekar
19:30 On the relevance of algorithmic decision predictors for judicial decision 175
making
Bex, Floris; Prakken, Henry
19:45 Prediction of monetary penalties for data protection cases in multiple 185
languages
Ceross, Aaron William Karl; Zhu, Tingting
20:00-20:30 Short break / Networking space
20:30-22:00 Session 6
Time Page
20:30 Evaluating Document Representations for Content-based Legal Litera- 109
ture Recommendations
REC. Ostendorff, Malte; Ash, Elliott; Ruas, Terry; Gipp, Bela; Moreno-
Schneider, Julian; Rehm, Georg
21:00 AI systems and product liability 32
REC. Borges, Georg
21:30 Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdic- 129
tions, and Legal Domains
REC. Savelka, Jaromir; Westermann, Hannes; Benyekhlef, Karim; Alexan-
der, Charlotte S.; Grant, Jayla C.; Amariles, David Restrepo; El-
Hamdani, Rajaa; Meeus, Sebastien; Troussel, Aurore; Araszkiewicz,
Michal; Ashley, Kevin D.; Ashley, Alexandra; Branting, Karl L.; Fal-
duti, Mattia; Grabmair, Matthias; Harasta, Jakub; Novotna, Tereza;
Tippett, Elizabeth; Johnson, Shiwanni
22:00-22:15 Closing comments conduced by the program chair
xiii
Friday, 25 June – Workshops, Doctoral Consortium, Closing
Various-14:30 Workshops
Time schedule to be defined by the organizer of each workshop, finishing activities
before 14:30.
6 hours
ASAIL Automated Detection, Extraction and Analysis of Semantic In-
formation in Legal Texts
MWAIL Multilingual Workshop on AI & Law
PATENTS Artificial Intelligence and Patents
3 hours
AILBIZ (Legal International Workshop on A.I. for Understanding the Legal
Business) Business
RELATED RELATED – Relations in the Legal Domain
07:30-09:15 1st Panel
07:30 Opening speeches of the Doctoral Consortium
07:45 Beyond persons and things: the legal status of artificial intel-
ligence
Diana Mocanu
08:15 The Digital Administrative Act
Alexander Stepanov
08:45 Constitutional limits to the use of artificial intelligence in court
proceedings
Elisabeth Paar
09:15-09:30 Short break / Networking space
09:30-11:00 2nd Panel
09:30 An African Perspective on Answering the Ethics Question:
Who Should Make the Rules on Self-Driving Cars?
Okechukwu Effoduh
10:00 Judged by Machines? How do algorithms, now used in crimi-
nal justice, impact on the legitimacy of the system?
Cari Hyde-Vaamonde
10:30 Transactions on Privacy and the Tools that Assist – An Inter-
disciplinary Analysis
Kartik Chawla
11:00-11:30 Long break / Networking space
xiv
11:30-13:00 3rd Panel
11:30 ALGORITHMS (DIS)SERVING JUSTICE: Risk Assessment
Tools in Pre-trial Process
Mina Ilhan
12:00 HONto: A Knowledge Base from Textbooks for Legal Text Re-
trieval and Recommendations
Sabine Wehnert
12:30 ESRA: An End-to-End System for Re-Identification and
Anonymization of Swiss Court Decisions
Joel Niklaus
13:00-13:15 Short break / Networking space
13:15-14:30 4th Panel
13:15 Measurement of Consistency in Judicial Decisions
Aline Macohin
13:45 Artifact design for design-science research on process mining
for legal compliance
Adriana Jacoto Unger
14:15 Doctoral Consortium Best Paper award & ending
14:30-15:30 Closing speeches
xv
Invited Speakers
Provably Beneficial Artificial Intelligence
Professor Stuart Russell
University of California Berkeley, US
Topic Prof. Russell discusses an approach to ethical AI based on the idea that AI
systems should be beneficial to humans, with the key caveat that what counts as
"beneficial" is very unlikely to be fully specified. Issues include social aggregation
and social choice, laws, equity, sadism, pride, envy, and mental integrity.
Topic Robotic Process Automation (RPA) is usually thought of with respect to back-
office functions such as Finance and HR. Recently, however, Dentons have been putting
this technology into the hands of our junior lawyers and asking them to not only sug-
gest some legal tasks for the software to automate, but to actually do the automation
as well. In this talk Joe will cover: How RPA fits into the wider legaltech landscape;
How RPA actually works; How to choose RPA use cases, and when not to use it; How
best to empower lawyers with the right tools.
About Joe leads the Innovation team for the Dentons UK, Ireland and Middle East
region. This includes responsibility for legal technology pilots and implementations,
as well as other ’innovation culture’ initiatives such as innovation training and design
thinking. Prior to this, as a non-lawyer, Joe held innovation positions at Linklaters
and Slaughter and May, and before that was a technology consultant for Deloitte. Joe
was shortlisted for the recent Law.com Innovation Trailblazer of the Year award.
xvi
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
I Full Papers 1
Structural Text Segmentation of Legal Documents . . . . . . . . . . . . . 2
Dennis Aumiller, Satya Almasian, Sebastian Lackner and Michael Gertz
xvii
Unravel Legal References in Defeasible Deontic Logic . . . . . . . . . . . 69
Guido Governatori and Francesco Olivieri
xviii
II Short Papers 169
Practical Tools from Formal Models: The ECHR as a Case Study . . . . . 170
Katie Atkinson, Joe Collenette, Trevor Bench-Capon and Kanstantsin Dzeht-
siarou
xix
A Dataset for Evaluating Legal Question Answering on Private Interna-
tional Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Francesco Sovrano, Monica Palmirani, Biagio Distefano, Salvatore Sapienza
and Fabio Vitali
xx
Labels distribution matters in performance achieved in legal judgment
prediction task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Olivier Salaün, Philippe Langlais and Karim Benyekhlef
IV Demonstrations 272
Interactive System for Arranging Issues based on PROLEG in Civil Lit-
igation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Ken Satoh, Kazuko Takahashi and Tatsuki Kawasaki
To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment . 295
Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto de Alencar Lotufo
and Rodrigo Nogueira
xxi
Part I
Full Papers
Structural Text Segmentation of Legal Documents
Dennis Aumiller∗ Satya Almasian∗
Institute of Computer Science, Heidelberg University Institute of Computer Science, Heidelberg University
Heidelberg, Germany Heidelberg, Germany
aumiller@informatik.uni-heidelberg.de almasian@informatik.uni-heidelberg.de
2
ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil Aumiller and Almasian, et al.
3
Structural Text Segmentation of Legal Documents ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil
longer segments in the form of entire sentences are both used by et al. [14] introduce Coherence-Aware Text Segmentation, which
Poudyal et al. [34], who mine arguments from European case-law encodes a sentence sequence using two hierarchically connected
decision, and Westermann et al. [42], where a system for efficient transformer networks. The two latter models are closest to our
similarity search based on sentence embeddings is presented. work in terms of data size and problem formulation. However, they
rely solely on per sentence predictions, which is incomparable to
2.2 Topic Analysis our paragraph-based method. The model by Glavas et al. is similar
Detection and analysis of topical change are grounded in topic mod- to our approach in that it is based on a transformer architecture, yet,
eling approaches. Earlier work such as LDA [2] treat documents as they do not take advantage of transfer learning from pre-trained
bag-of-words, where each document is assigned to a topic distribu- language models and learn all the features from scratch. Finally,
tion, and each topic is a distribution over all words. More recent Zhang et al. [45] extend text segmentation by outline generation
work has adopted a more sophisticated representation than bag-of- and trained an end-to-end LSTM-model for identifying sections
words and generally models Markovian topic or state transitions to and generating corresponding headings for Wikipedia documents.
capture dependencies between words in a document [15, 41]. With
the rise of distributed word representation, the focus has shifted 2.4 Transformer Language Models
to the combination of LDA and word embeddings [11, 32]. Since The transformer architecture, much like recurrent neural networks,
we are interested in a primary segmentation without necessarily aims to solve sequence-to-sequence tasks, relying entirely on self-
predicting topics, we put a stronger focus on the related work of attention to compute representations of its input and output [40].
segmentation methods, as discussed in the following section. Transformers have made a significant step in bringing transfer
learning to the NLP community, which allows the easy adaptation
2.3 Text Segmentation of a generically pre-trained model for specific tasks. Pre-trained
Text Segmentation is the task of dividing a document into a multi- models such as BERT, GPT-2, and RoBERTa [9, 23, 35] use lan-
paragraph discourse unit that is topically coherent, with the cut-off guage modeling for pre-training on large corpora. These models
point usually indicating a change in topic [17, 39]. Although the are powerful feature generators, which with minimal task-specific
task itself dates back to 1994 [17], most existing text segmentation fine-tuning achieve state-of-the-art performance on a wide variety
datasets are small and limit their scope to sentences (predicting of tasks. Although at the core of all these models lies the idea of
whether two sentences discuss the same topic or not). The most transformers and attention mechanisms, many have been modi-
common dataset is by Choi [7], containing only 920 synthesized pas- fied and optimized to fit various downstream applications. One
sages from the Brown corpus. Choi’s method (C99) is a probabilistic variation based on BERT is Sentence-BERT [36], which combines
algorithm measuring similarity via term overlap. GraphSeg [13] is two BERT-based models in Siamese fashion to derive semantically
an unsupervised graph method that segments documents using a meaningful sentence embeddings. By its design, Sentence-BERT
semantic relatedness graph of a document. GraphSeg is also evalu- also allows for longer input sequences for pairwise training tasks
ated on a small set of 5 manually-segmented political manifestos and outperforms BERT on semantic textual similarity tasks, mak-
from the Manifesto project2 . Another class of methods are topic- ing it a suitable choice for embedding paragraphs. Another notable
based document segmentations, which are statistical models that variant of BERT is RoBERTa, a retraining of BERT with improved
find latent topic assignments reflecting the underlying structure training methodology and more training data, it achieves slightly
of the document [1, 4, 5, 10, 29, 38]. TopicTiling [38] performs best better results than BERT on some natural understanding tasks. Due
among this family of methods and uses LDA to detect topic shifts, to the advantages of RoBERTa, we chose RoBERTa and Sentence-
with computing similarities between adjacent blocks based on their RoBERTa from the Sentence-BERT variant for the setup in our
term frequency vectors. Brants et al. [4] follow a similar approach approach.
but employ PLSA [19] to compute the estimated word distributions.
Another noteworthy approach based on Bayesian topic models is 3 SAME TOPIC PREDICTION
by Chen et al. [5], where they constrain latent topic assignments to
We formulate structural text segmentation as a supervised learning
reflect the underlying organization of document topics. They also
task of the same topic prediction. Our model consists of two steps:
publish a test dataset with 218 Wikipedia articles about cities and
(𝑖) Independent and Identically Distributed Same Topic Prediction
chemical elements.
(IID STP) and (𝑖𝑖) Sequential inference over a full document. As
All mentioned methods are unsupervised learning approaches, and
mentioned previously, sections are the considered level of hierarchy
small annotated datasets are only used for evaluation and, hence,
in our model and the structure of sub-sections is ignored in this
are not directly comparable to our approach. Instead, we focus on
study. However, the model is easily adaptable to any granularity, and
supervised learning of topics and introduce a new dataset with
our dataset contains information for all the levels. In the first step,
43,056 automatically labeled documents.
we fine-tune transformer-based models to detect topical change
The only two comparable supervised approaches are from Koshorek
for both paragraphs and entire sections. Given two paragraphs or
et al. [21] and Glavas et al. [14]. Koshorek et al. [21] propose a
sections, the classifier should correctly identify if they discuss the
hierarchical LSTM architecture for learning sentence representation
same subject or not. We assume that the topic of each paragraph
and their dependencies. They train their hierarchical model on a
or section is independent of the text before and after, meaning
dataset of cleaned Wikipedia articles, called Wiki-727k. Glavas
that the topic of one paragraph does not affect the likelihood of
2 https://manifestoproject.wzb.eu the next paragraph belonging to the same topic. We later prove
4
ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil Aumiller and Almasian, et al.
5
Structural Text Segmentation of Legal Documents ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil
as our input, the classifier outputs the probability of the two para- search for hyperlinks with texts Terms of Service, Terms of Use,
graphs belonging to the same topic, independent of their surround- Terms and Conditions, and Conditions of Use, and follow them to
ing context, e.g., 𝑇 𝐹 (𝑝 1, 𝑝 2 ) = 𝑃 (𝑇𝑜𝑝𝑖𝑐 (𝑝 1 ) = 𝑇𝑜𝑝𝑖𝑐 (𝑝 2 )). There- get to the respective terms-of-service pages. Levenshtein distance
fore, given sequences of paragraphs 𝑝 1, ....𝑝𝑘 , and the corresponding with a threshold of 0.75 is used to allow for spelling mistakes and
predicted labels 𝑦 = (𝑦1, ..., 𝑦𝑘−1 ), a segmentation of the document different wording. The raw Hypertext Markup Language (HTML)
is given by 𝑘 − 1 predictions of 𝑇 𝐹 , where 𝑦𝑖 = 0 denotes the end content of the Terms-of-Service page is downloaded and stored
of a segment by 𝑝𝑖 . It is worth noting that regardless of the chunk for further processing. In case of an error, e.g., if the website is
type used during the training of the classifiers (section or para- temporarily unreachable, we retry the same website 2 additional
graph inputs) the segmentation module operates on paragraphs times before skipping it. The unprocessed dataset contains HTML
only. Figure 2 shows the inference on a sample document with four code for roughly 74,000 websites. Note that due to limitations of the
paragraphs and two sections, where the paragraph colors show current crawler implementation, websites that rely on JavaScript
the topics. The 𝑇 𝐹 classifier is applied on a paragraph pair and to display content are not supported.
can ideally recognize the topic change from 𝑃3 to 𝑃4 and mark the
beginning of the new section. 4.2 Section Extraction
Despite the fact that HTML is a structured format, it is a non-trivial
3.3 Legal Applications
task to extract text and hierarchies. The main reasons are that Web
To put the presented segmentation into a legal context, we focus pages often contain a lot of boilerplate (e.g., navigational elements,
on three main application areas: (𝑖) As mentioned, a section-based advertisements, etc.), generally have heterogeneous appearences
semantic segmentation can be used as a pre-processing step for a and implementations, and that they simply do not always conform
passage retrieval context. This, however, would require additional to the HTML standard. Here, only a rough outline of the pipeline
data with relevance annotations for both sentence- and paragraph- is given. For further reference, please refer to the implementation
level relevance to compare the specific benefits of our approach, in our repository.
which we leave to future work in this area. (𝑖𝑖) However, seman-
tically coherent sections can also be used as a basis for similarity Boilerplate Removal. For boilerplate removal, we use the
search. This is especially helpful when looking for, e.g., related boilerpipe package by Kohlschütter et al. [20], which is based on
sections in existing contracts [42]. Here, we focus on Terms-of- shallow text features for classifying the text elements on a Web
Service documents that are widely available, and contain sections page. The result is an HTML page with all navigational elements,
that follow a general pattern of similar topics. (𝑖𝑖𝑖) Lastly, the sec- advertisements, and template code removed. Importantly, relevant
tion separation can be used for generating outlines of documents, hierarchical information is retained past this step.
which has previously been shown to work well on other domains
such as Wikipedia [45]. During our document crawl, we also en- HTML Cleanup. To deal with websites that do not conform to HTML
countered several documents not including any sectional headings, standards, we perform several cleanup steps. This includes, for ex-
which makes it especially hard to understand the legal contexts for ample, fixing mistakes such as text appearing without a correspond-
laymen users. ing paragraph (<p> tag), or incorrectly nested tags (e.g., section
headings within a <p> tag). We fix such mistakes by adding missing
4 TERMS-OF-SERVICE (TOS) DATASET tags and adjusting nested tags similar to how a web-browser would
Due to data governance policies in many countries, it is generally interpret the code.
mandated that commercial websites contain the necessary legal
information for site users. Specifically, these must be easily reach- Language Detection. Since the Alexa dataset also contains many
able via the landing page, which makes it comparatively easy to non-English websites, we reject extracted terms-of-service, where
be crawled. For each Terms-of-Service document, we automati- the majority of text most likely has a language different from Eng-
cally extract the content divided into paragraphs and respective lish. We use the langid Python package for detecting the language
hierarchical section headings. Further, ToS documents allow us to of each individual paragraph (<p> tag).
experiment with a large-scale dataset that comes with a shared
Extracting Hierarchy. To obtain the hierarchy, we split the docu-
set of topics, while still maintaining a heterogeneous set of topics
ment into smaller chunks. Splits are done in the following order:
due to the different types of websites. In the following, we will
first we split on each section heading (<h1>-<h6> tags), then on
discuss the detailed mining process, and implicate limitations of
bold text (<b> tag) starting with an enumeration pattern, then on
this approach.
enumerations (<li> tags), then on underline text (<u> tag) starting
4.1 Crawling with an enumeration pattern, and lastly on regular text (<p> tag)
starting with an enumeration pattern. To prevent spurious splits,
As seeds to our crawler, we use the Alexa 1M URL dataset.3 For each criterion is only used if there are at least 5 occurrences within
each URL in the dataset, we try to access the website both with the document. Each time the document is split, we save the corre-
and without the www prefix. First, the landing page is downloaded sponding headings, which then form the hierarchy. As enumeration
and parsed using the Beautiful Soup Python package. We then patterns we recognize Latin numbers, roman numerals, and letters,
3 Available at: http://s3.amazonaws.com/alexa-static/top- optionally prefixed with Part, Section, or Article. The majority of
1m.csv.zip documents contain at most two levels of section hierarchy.
6
ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil Aumiller and Almasian, et al.
Table 1: Top 10 section topics by document frequency. Addi- Doc 1 Doc 1 Doc 1
tionally, the number of associated paragraphs is given.
Document Paragraph
Topic Label
Frequency Frequency
limitation of liability 21,317 68,517
indemnification 16,698 25,683
law and jurisdiction 15,113 29,790
links to other websites 13,752 24,727
termination 12,855 33,978
Doc 2 Doc 2 Doc 2
warranty 9,926 41,403
privacy 8,958 25,022
disclaimer 8,575 29,265
general terms 7,936 54,693
7
Structural Text Segmentation of Legal Documents ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil
Table 2: Prediction accuracy for the independent topic prediction tasks, Same Topic Prediction (STP), Random Paragraph (RP),
Consecutive Paragraph (CP) with different sampling strategies. Standard deviation is reported over 5 runs and the best model
on each respective set is depicted in bold.
sampling strategies. In the following, we describe each strategy However, results show a sharp drop in the performance, which can
in detail and highlight their difference. Across all strategies, we come from a much narrower context of the paragraphs, as well as
added three positive and three negative samples for each individual a differing selection of test samples compared to the section task.
section/paragraph. Solely the BoW model seems to be largely unaffected, which is
simply due to its low performance in either setting.
5.2.1 Section (S) Topic Prediction. In this setup, we use sections
as input chunks to the transformer classifiers. The section task
showcases how different levels of granularity can affect outcomes
in the prediction results. Specifically, the extremely long input 5.2.3 Consecutive Paragraph (CP) Topic Prediction. To boost
sequences test the limits of what transformers can predict from performance and account for the coherent structures in the text,
partial observations since the majority of inputs will be heavily we employ a sampling strategy inspired by Ein Dor et al. [12]. For
truncated. To ensure an equal distribution of samples from within their triplet loss, samples are generated inside the same document
the same and different sections, we match each section with three only, which can be translated into sampling from intra-document
samples from the same topic, and three from different topics. The paragraphs. Note that this strategy also no longer requires any
positive and negative sections can be sampled from a different merging and annotation of topics across documents, as all relevant
document. The important point is that the positive samples should information is now contained within a single document. This fact
come from the same topic and negative samples from different ones. opens up much larger generation of training data, which we omit
The first column of Figure 3 visualizes the section sampling, where in our current work for the sake of comparability with the RP
the first section of 𝐷𝑜𝑐 1 is paired with the second section of 𝐷𝑜𝑐 2 model. To generate samples, we look at all paragraphs of a section
to form a positive sample and the first section of 𝐷𝑜𝑐 2 to form a and pair them as positive samples. Negative samples are picked
negative sample, respectively. The same strategy is employed for from paragraphs of different sections in the same document. The
the generation of the development and test set. third column of Figure 3 depicts the consecutive paragraph setup,
Despite the constraints with respect to the input length, we find that where the samples are limited to paragraphs of 𝐷𝑜𝑐1. Note that
all transformers perform on a near-perfect level, compare Table 2. despite their similar setup, results of RP and CP runs in Table 2 are
Comparing these results to already very well-performing baselines, not evaluated on the same test set and thus are not comparable,
we suspect that certain keywords give away similar sections, but since the test sets are each generated with the respective sampling
highlight the fact that the explicit representation of different topics strategies (RP or CP) as well. However, we are able to compare
is not given during training in the binary classification task, which their downstream performance on the subsequently introduced
makes this a suitable method for dealing with imbalanced topics. text segmentation task (see Section 5.3 and Table 3).
5.2.2 Random Paragraph (RP) Topic Prediction. In contrast to the The result of different sampling strategies along with the perfor-
section-level task, we revert to a more fine-grained distinction of mance of the baselines is shown in Table 2, where the transformer-
paragraphs in a text. In the Random Paragraph setting, we still based models all outperform the baselines by a significant margin.
generate samples similarly, meaning we include three paragraphs Among the baselines BoW has the worst performance overall, with
from a random document with the same topic and three negative the accuracy close to random, showing that distinct word occur-
samples from random paragraphs with different topics. The main rences are not a sufficient indicator. Average GloVe has the best
difference between the Section prediction and Random Paragraph performance of all baselines, but is still behind the transformers
is in the level of granularity and not how the samples are chosen. by a large margin. Despite the NLI-pretrained SRoBERTa model
The second column of Figure 3 highlights this difference, where the (ST-Ro-N) achieving better scores than the base model (ST-Ro) for
samples are paragraphs inside the sections rather than the entire most setups, the difference is insignificant, indicating that the pre-
section. Paragraph-based sampling is closer to our inference setup, training on sentence similarity tasks does not directly influence our
where each input document is considered one paragraph at a time. topic prediction setup.
8
ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil Aumiller and Almasian, et al.
Table 3: Boundary error rate 𝑃𝑘 for compared models (lower is Specifically, we compare the paragraph-based training methods
better), based on sampling strategies Random Paragraph (RP), CP and RP. As an evaluation metric, we follow related literature
Consecutive Paragraph (CP) and their Ensemble variates, and adopt the 𝑃𝑘 metric introduced by Beeferman et al. [1], which
RP𝐸𝑛𝑠 and CP𝐸𝑛𝑠 , respectively. Ensemble (”Ens”) predictions is the error rate of two segments at 𝑘 sentences apart being
are obtained by majority voting over model runs. classified incorrectly. We use the default window size of half the
document length for our evaluation, again following related work.
RP CP RP𝐸𝑛𝑠 CP𝐸𝑛𝑠 Furthermore, we count the number of explicit misclassifications,
and use the accuracy 𝑎𝑐𝑐𝑘 of “up to 𝑘 mistakes per document” as
GLV𝑎𝑣𝑔 29.97 ±.09 26.23 ±6.2 29.55 23.06 an evaluation metric. Due to the coarser nature of paragraphs and
tf-idf 39.87 ±.24 29.70 ±.28 39.36 28.60 the lower number of predictions per document compared to the
BoW 45.76 ±.67 43.46 ±1.5 46.20 41.80 sentence-level segmentation, this is a more illustrative metric. This
Random Oracle 35.08 ±.15 - 31.88 - also relates to the “exact match” metric EM𝑜𝑢𝑡𝑙𝑖𝑛𝑒 employed by
Zhang et al. [45], where 𝑎𝑐𝑐 0 = EM𝑜𝑢𝑡𝑙𝑖𝑛𝑒 .
GraphSeg - 32.48 ±.46 - 32.28
Here, we also include the performance of related works where
WikiSeg - 48.29 ±.30 - 48.29
public and up-to-date code repositories are available. Specifically,
Ro-CLS 37.26 ±4.8 15.15 ±.00 41.15 15.15 we compare to the unsupervised segmentation algorithm Graph-
ST-Ro 15.72 ±.11 14.06 ±.14 14.62 13.14 Seg [13], and the supervised model by Koshorek et al. [21], which
ST-Ro-N 15.97 ±.14 13.97 ±.19 14.81 12.95 we dub “WikiSeg”. Both approaches are trained on a sentence-level
approach, though, and predictions have to be translated back to a
Ens consec - - - 12.50 paragraph level for comparison of results. We train each model with
the suggested parameters in their publicly available repositories.
For an additional pseudo-sequential baseline, we use an informed
random oracle that has a-priori information on the number of
topics in the document, and samples from a distribution with
adjusted probability 𝑃 (“next section”) = #𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑠/#𝑝𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ𝑠.
Note that no additional parameters are learned for any model, and
predictions are binarized with a simple 0.5 threshold over the same
topic predictions. We provide ensembling results for the majority
voting decisions by the five seed runs of each model variant (Ens),
which provides further improvements. Best results are obtained
by ensembling all consecutive transformer-based methods (Ens
consec).
Table 3 shows the results of the evaluation, where one can see
that results in the sequential segmentation are directly linked
to the performance on the independent classification task seen
in Table 2. To verify our initial assumption of cross-document
comparability of content from similar sections, we make the
following observations: (𝑖) Evaluation performance for the STP
setup is consistent for both training strategies (RP and CP) when
using Sentence-Transformer models (see Table 2). (𝑖𝑖) Similarly,
both CP and RP-trained Sentence-Transformer segmentations
achieve results within 2 percentage points of the respective 𝑃𝑘
Figure 4: Mistake rate of per-model ensembles, where the scores. (𝑖𝑖𝑖) In general, CP training setup yields slightly better
suffix CP indicates the consecutive paragraph sampling and 𝑃𝑘 scores, likely because the intra-document dependencies are
RP the random paragraph sampling for each model. The captured better with this sampling strategy, which is a more
baseline is Rand Oracle (Random Oracle), GLV𝑎𝑣𝑔 (average appropriate sampling for the segmentation task. (𝑖𝑣) We find
GloVe vectors), tf-idf, and BoW (Bag-of-Words). Ro-CLS is convergence problems for RP training with the [CLS] models, as
the fine-tuned CLS token for Roberta and Sentence trans- well as the tf-idf model. Due to the size of our general training
former models are ST-Ro and ST-Ro-N, where the latter is corups, we therefore conclude that it is realistic to expect topical
pre-trained on NLI task. The best performing model is the similarity within a section, even across documents. However, due
Ens All (Ensemble of all models). to the seemingly inconsistent convergence of RP models, we
caution against blindly using this strategy, especially when dealing
with more heterogeneous corpora. The oracle baseline performs
5.3 Text Segmentation unexpectedly better than both tf-idf and BoW, indicating that
By generating a text segmentation over the paragraphs of a additional information about the sections of a document can
full document, the independent prediction results from the greatly boost task performance, which might be relevant for future
previous section can now be compared across several approaches. work. Additional pre-training of ST models (ST-Ro-N) does not
9
Structural Text Segmentation of Legal Documents ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil
show any significant improvement over the standard ST-Ro models. as HTML or XML, making this an attractive option for a larger-scale
study of cross-domain document collections. Finally, an interface
To our surprise, sentence-based implementations (GraphSeg and build on top of our framework, enabling the users to judge the
WikiSeg) show significantly lower performance, and fall even usefulness of segmentation for legal use cases, such as a collection
behind the simpler baselines. For GraphSeg, an unsupervised of documents from mergers and acquisitions, could be used to
segmentation approach, the lack of explicit training on the different determine the efficacy of our improved segmentation.
granularity seems to significantly prohibit correct predictions
on longer segments. WikiSeg heavily preprocesses the data and ACKNOWLEDGEMENTS
discards many samples, thus significantly shrinking the training set.
We thank the anonymous reviewers for their insightful comments.
Since performance on the reduced training set is still decent, this
indicates that training a network from scratch is not suitable with
the smaller training set of a reduced corpus and tends to overfit. REFERENCES
[1] Doug Beeferman, Adam L. Berger, and John D. Lafferty. 1999. Statistical Models
We expect a significant increase in performance if the training for Text Segmentation. Mach. Learn. 34, 1-3 (1999), 177–210.
would instead be performed without such strict preprocessing [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
criteria, or continuing fine-tuning on pre-trained weights from a Allocation. J. Mach. Learn. Res. 3 (2003), 993–1022.
[3] Luther Karl Branting. 2017. Automating Judicial Document Analysis. In Pro-
paragraph-level WikiSeg model. For either baseline model, it is ceedings of the Second Workshop on Automated Semantic Analysis of Information
also important to note that these models predict on the entirety in Legal Texts co-located with the 16th International Conference on Artificial In-
of the sequence, which theoretically allows information sharing telligence and Law (ICAIL 2017), London, UK, June 16, 2017 (CEUR Workshop
Proceedings, Vol. 2143), Kevin D. Ashley, Katie Atkinson, Luther Karl Branting,
between different sections in the current sample. However, they Enrico Francesconi, Matthias Grabmair, Marc Lauritsen, Vern R. Walker, and
show no improvement over our binary prediction setup which Adam Zachary Wyner (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-
does not share this information. It would be of interest to compare 2143/paper2.pdf
[4] Thorsten Brants, Francine Chen, and Ioannis Tsochantaridis. 2002. Topic-based
results to sequential transformer-based architectures, such as they Document Segmentation with Probabilistic Latent Semantic Analysis. In Proceed-
are used by Glavas et al. [14]. However, their model again requires ings of the 2002 ACM CIKM International Conference on Information and Knowledge
Management, McLean, VA, USA. ACM, 211–218.
training from scratch, which has proven to be inconsistent in our [5] Harr Chen, S. R. K. Branavan, Regina Barzilay, and David R. Karger. 2009. Global
experiments with WikiSeg. Models of Document Structure using Latent Permutations. In Human Language
Technologies: Conference of the North American Chapter of the Association of
Computational Linguistics, Proceedings, 2009, Boulder, Colorado, USA. 371–379.
Lastly, the plots for 𝑎𝑐𝑐𝑘 for various models in Figure 4 indicate [6] Freddy Y. Y. Choi. 2000. Advances in Domain Independent Linear Text Segmen-
a correlation between the 𝑎𝑐𝑐𝑘 and 𝑃𝑘 measures, which does not tation. In Proceedings of the 1st North American Chapter of the Association for
Computational Linguistics Conference (Seattle, Washington) (NAACL 2000). ACL,
apply to sentence-level segmentations. Overall, the best-performing USA, 26–33.
ensembles classify around 25% of documents without any mistake [7] Freddy Y. Y. Choi. 2000. Advances in Domain Independent Linear Text Seg-
(𝑎𝑐𝑐 0 ), and around 70% with less than three mistakes (𝑎𝑐𝑐 2 ) over mentation. In 6th Applied Natural Language Processing Conference, ANLP, Seattle,
Washington, USA, 2000. ACL, 26–33.
the entire document. We therefore suggest 𝑎𝑐𝑐𝑘 as an interpretable [8] Jack G. Conrad, Khalid Al-Kofahi, Ying Zhao, and George Karypis. 2005. Effective
addition to the classic evaluation of segmentation approaches when Document Clustering for Large Heterogeneous Law Firm Collections. In The
dealing with paragraph-level segmentations. Tenth International Conference on Artificial Intelligence and Law, Proceedings of the
Conference, June 6-11, 2005, Bologna, Italy, Giovanni Sartor (Ed.). ACM, 177–187.
https://doi.org/10.1145/1165485.1165513
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
6 CONCLUSION AND FUTURE WORK Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Associa-
Despite a multitude of previous works, structural text segmenta- tion for Computational Linguistics: Human Language Technologies, NAACL-HLT ,
tion methods have always focused on very finely segmented text Minneapolis, USA, 2019, Volume 1. 4171–4186.
[10] Satya Dharanipragada, Martin Franz, J. Scott McCarley, Salim Roukos, and Todd
chunks in the form of sentences. In this work, we have shown that Ward. 1999. Story Segmentation and Topic Detection for Recognized Speech.
a relaxation of this problem to coarser text structures reduces the In Sixth European Conference on Speech Communication and Technology, EU-
complexity of the problem, while still allowing for semantic segmen- ROSPEECH 1999, Budapest, Hungary. ISCA.
[11] Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. Topic Modeling in
tation. Further, we reformulate the oftentimes expensive-to-train Embedding Spaces. CoRR abs/1907.04907 (2019). arXiv:1907.04907
sequential setup of text segmentation as a supervised Same Topic [12] Liat Ein Dor, Yosi Mass, Alon Halfon, Elad Venezian, Ilya Shnayderman, Ranit
Aharonov, and Noam Slonim. 2018. Learning Thematic Similarity Metric from
Prediction task, which reduces training time, while allowing for a Article Sections Using Triplet Networks. In Proceedings of the 56th Annual Meeting
near-trivial generation of samples from automatically crawled text of the Association for Computational Linguistics (Volume 2: Short Papers). ACL,
documents. To show the applicability of our method, we present a Melbourne, Australia, 49–54.
[13] Goran Glavas, Federico Nanni, and Simone Paolo Ponzetto. 2016. Unsupervised
new domain-specific and large corpus of online Terms-of-Service Text Segmentation Using Semantic Relatedness Graphs. In Proceedings of the Fifth
documents, and train transformer-based models that vastly outper- Joint Conference on Lexical and Computational Semantics, *SEM@ACL, Berlin,
form a number of text segmentation baselines. Germany, 2016. The *SEM 2016 Organizing Committee.
[14] Goran Glavas and Swapna Somasundaran. 2020. Two-Level Transformer and Aux-
We are currently investigating the setup for deeper hierarchical iliary Coherence Modeling for Improved Text Segmentation. CoRR abs/2001.00891
sections, which our dataset already contains annotations for, to (2020). arXiv:2001.00891
[15] Thomas L. Griffiths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum.
see whether such notions can also be picked up by an independent 2004. Integrating Topics and Syntax. In Advances in Neural Information Process-
classifier and benefit a legal retrieval system. Also, the findings from ing Systems 17 [Neural Information Processing Systems, 2004, British Columbia,
our Consecutive Paragraph model already indicate that training Canada]. 537–544.
[16] Zellig S. Harris. 1954. Distributional Structure. WORD 10, 2-3 (1954), 146–162.
requires no further information than the ground truth segmentation, [17] Marti A. Hearst. 1994. Multi-Paragraph Segmentation of Expository Text. In 32nd
which can generally be inferred from structured input formats, such Annual Meeting of the Association for Computational Linguistics, 1994, Las Cruces,
10
ICAIL’21, June 21–25, 2021, Sāo Paulo, Brazil Aumiller and Almasian, et al.
New Mexico, USA, Proceedings. ACL, 9–16. Methods in Natural Language Processing and the 9th International Joint Conference
[18] Marti A. Hearst. 1997. TextTiling: Segmenting Text into Multi-Paragraph Subtopic on Natural Language Processing, EMNLP-IJCNLP, Hong Kong, China. ACL, 3980–
Passages. Comput. Linguist. 23, 1 (March 1997), 33–64. 3990.
[19] Thomas Hofmann. 2017. Probabilistic Latent Semantic Indexing. SIGIR Forum 51, [37] Martin Riedl and Chris Biemann. 2012. How Text Segmentation Algorithms Gain
2 (2017), 211–218. from Topic Models. In Human Language Technologies: Conference of the North
[20] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate American Chapter of the Association of Computational Linguistics, Proceedings,
Detection Using Shallow Text Features. In Proceedings of the Third International 2012, Montréal, Canada. ACL, 553–557.
Conference on Web Search and Web Data Mining, WSDM, New York, USA 2010. [38] Martin Riedl and Chris Biemann. 2012. TopicTiling: A Text Segmentation Al-
ACM, 441–450. gorithm based on LDA. In Proceedings of the Student Research Workshop of the
[21] Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant. 50th Meeting of the Association for Computational Linguistics. Republic of Korea,
2018. Text Segmentation as a Supervised Learning Task. In Proceedings of the 2018 37–42.
Conference of the North American Chapter of the Association for Computational [39] Masao Utiyama and Hitoshi Isahara. 2001. A Statistical Model for Domain-
Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, Independent Text Segmentation. In Association for Computational Linguistic, 39th
USA, 2018, Volume 2 (Short Papers). ACL, 469–473. Annual Meeting and 10th Conference of the European Chapter, Proceedings of the
[22] Hideki Kozima. 1993. Text Segmentation Based on Similarity between Words. In Conference, 2001, Toulouse, France. ACL, 491–498.
31st Annual Meeting of the Association for Computational Linguistics, 1993, Ohio [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
State University, Columbus, Ohio, USA, Proceedings. ACL, 286–288. Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
[23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer you Need. In Advances in Neural Information Processing Systems 30: Annual Con-
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A ference on Neural Information Processing Systems 2017, Long Beach, CA, USA.
Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). 5998–6008.
arXiv:1907.11692 [41] Hanna M. Wallach. 2006. Topic modeling: Beyond Bag-of-Words. In Machine
[24] Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, and William Keenan. 2011. Legal Learning, Proceedings of the Twenty-Third International Conference (ICML), Pitts-
document clustering with built-in topic segmentation. In Proceedings of the 20th burgh, Pennsylvania, USA, 2006 (ACM International Conference Proceeding Series,
ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, Vol. 148). 977–984.
United Kingdom, October 24-28, 2011, Craig Macdonald, Iadh Ounis, and Ian [42] Hannes Westermann, Jaromír Savelka, Vern R. Walker, Kevin D. Ashley, and
Ruthven (Eds.). ACM, 383–392. https://doi.org/10.1145/2063576. Karim Benyekhlef. 2020. Sentence Embeddings and High-Speed Similarity Search
2063636 for Fast Computer Assisted Annotation of Legal Documents. In Legal Knowledge
[25] Alex Lyte and Karl Branting. 2019. Document Segmentation Labeling Techniques and Information Systems - JURIX 2020: The Thirty-third Annual Conference, Brno,
for Court Filings. In Proceedings of the Third Workshop on Automated Seman- Czech Republic, December 9-11, 2020 (Frontiers in Artificial Intelligence and Ap-
tic Analysis of Information in Legal Texts co-located with the 17th International plications, Vol. 334), Villata Serena, Jakub Harasta, and Petr Kremen (Eds.). IOS
Conference on Artificial Intelligence and Law (ICAIL 2019), Montreal, QC, Canada, Press, 164–173. https://doi.org/10.3233/FAIA200860
June 21, 2019 (CEUR Workshop Proceedings, Vol. 2385), Kevin D. Ashley, Katie [43] Ross Wilkinson. 1994. Effective Retrieval of Structured Documents. In Proceedings
Atkinson, Luther Karl Branting, Enrico Francesconi, Matthias Grabmair, Bern- of the 17th Annual International ACM-SIGIR Conference on Research and Devel-
hard Waltl, Vern R. Walker, and Adam Zachary Wyner (Eds.). CEUR-WS.org. opment in Information Retrieval. Dublin, Ireland, 1994 (Special Issue of the SIGIR
http://ceur-ws.org/Vol-2385/paper5.pdf Forum). ACM/Springer, 311–317.
[26] Igor Malioutov and Regina Barzilay. 2006. Minimum Cut Model for Spoken [44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Lecture Segmentation. In ACL, 21st International Conference on Computational Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie
Linguistics and 44th Annual Meeting of the Association for Computational Linguis- Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
tics, Proceedings of the Conference, Sydney, Australia, 2006. ACL. Processing. ArXiv abs/1910.03771 (2019).
[27] Eneldo Loza Mencía. 2009. Segmentation of legal documents. In The 12th Interna- [45] Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, and Xueqi Cheng. 2019.
tional Conference on Artificial Intelligence and Law, Proceedings of the Conference, Outline Generation: Understanding the Inherent Content Structure of Documents.
June 8-12, 2009, Barcelona, Spain. ACM, 88–97. https://doi.org/10.1145/ In Proceedings of the 42nd International ACM SIGIR Conference on Research and
1568234.1568245 Development in Information Retrieval, SIGIR, Paris, France, 2019. 745–754.
[28] Nada Mimouni. 2013. Modeling Legal Documents as Typed Linked Data for
Relational Querying. In Proceedings of the First JURIX Doctoral Consortium and
Poster Sessions in conjunction with the 26th International Conference on Legal
Knowledge and Information Systems, JURIX 2013, Bologna, Italy, December 11-13,
2013 (CEUR Workshop Proceedings, Vol. 1105), Monica Palmirani and Giovanni
Sartor (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-1105/paper6.
pdf
[29] Hemant Misra, François Yvon, Olivier Cappé, and Joemon M. Jose. 2011. Text
Segmentation: A Topic Modeling Perspective. Inf. Process. Manag. 47, 4 (2011),
528–544.
[30] Hemant Misra, François Yvon, Joemon M. Jose, and Olivier Cappé. 2009. Text
segmentation via Topic Modeling: an Analytical Study. In Proceedings of the 18th
ACM Conference on Information and Knowledge Management, CIKM , Hong Kong,
China, 2009. 1553–1556.
[31] Marie-Francine Moens. 2001. Innovative techniques for legal text retrieval. Artif.
Intell. Law 9, 1 (2001), 29–57.
[32] Christopher E. Moody. 2016. Mixing Dirichlet Topic Models and Word Embed-
dings to Make lda2vec. CoRR abs/1605.02019 (2016). arXiv:1605.02019
[33] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
Global Vectors for Word Representation. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing, EMNLP, 2014, Doha, Qatar, A
meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.
[34] Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma. 2019. Using Clustering
Techniques to Identify Arguments in Legal Documents. In Proceedings of the Third
Workshop on Automated Semantic Analysis of Information in Legal Texts co-located
with the 17th International Conference on Artificial Intelligence and Law (ICAIL
2019), Montreal, QC, Canada, June 21, 2019 (CEUR Workshop Proceedings, Vol. 2385),
Kevin D. Ashley, Katie Atkinson, Luther Karl Branting, Enrico Francesconi,
Matthias Grabmair, Bernhard Waltl, Vern R. Walker, and Adam Zachary Wyner
(Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-2385/paper2.pdf
[35] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018).
[36] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical
11
Precedential Constraint: The Role of Issues
Trevor Bench-Capon and Katie Atkinson
Department of Computer Science, University of Liverpool
Liverpool, United Kingdom
{tbc,katie}@liverpool.ac.uk
ABSTRACT the outcome of new cases? and (3) can we formally characterise how
Horty, Rigoni and Prakken have developed formal characterisa- precedents constrain future cases?
tions of precedential constraint based on dimensions and factors as An early project addressing precedential reasoning was the
introduced in HYPO and CATO. We discuss the relation between HYPO project of Rissland and Ashley, introduced at the first ICAIL
dimensions and factors and also describe the current models of [43] and most fully described in [6]. HYPO modelled reasoning with
precedential constraint based on factors, along with some criti- precedents in the US Trade Secrets domain. It influenced a great
cisms of them. We argue that problems arise from ignoring the deal of research by a number of different researchers, as discussed in
structure of legal cases that is provided by the notion of issues, [10], including the CATO system of Ashley and Aleven, introduced
and that seeing precedential constraint in terms of issues rather in [4] and most fully described in [3]. CATO also addressed US
than whole cases provides a more effective approach and better Trade Secrets. Both HYPO and CATO were concerned with the first
reflects legal practice. The advantages of the issue based approach of our questions: their goal was to show how arguments concerning
are illustrated with a concrete example. We then discuss how dimen- new cases can be constructed on the basis of precedent cases, and
sions should be accommodated, suggesting that this is best done how such arguments can be challenged by distinguishing the cited
by seeing reasoning with legal cases as a two stage process: first precedents. These systems presented arguments for and against
factors are ascribed to cases and then factor based reasoning can particular decisions, but did not attempt to choose between them:
be used to arrive at a decision. Thus precedential constraint can be that was left to the judgement of the user.
described in terms of factors, dimensions being handled at the first In contrast, systems based on rules, whether based on expert
stage. Both stages are constrained, in different ways, by precedents: knowledge [49], or on a formalisation of legislation [47], or a com-
we identify three types of precedent: framework precedents which bination of the two [8], were able to predict the outcome of a new
structure cases into issues, preference precedents which resolve case entered into the system, answering our second question. It was
conflicts between opposing sets of factors within these issues, and therefore a natural development to adapt systems such as CATO
ascription precedents which constrain the mapping from facts to to offer predictions based on reasoning with precedents. This was
factors. done in the Issue Based Prediction system (IBP) [21], in which ar-
guments generated from CATO were organised and evaluated so as
CCS CONCEPTS to predict an outcome. Subsequently Grabmair further developed
this approach to accommodate his value judgement formalism [24].
• Applied computing → Law.
Predictions based on precedent continue to be implemented in both
KEYWORDS symbolic systems [2] and machine learning (ML) systems such as
[35] which base their predictions on large collections of case deci-
reasoning with precedents, factors, dimensions, issues sions. Factor based reasoning is acquiring an important new role
ACM Reference Format: in explaining the predictions of ML systems (e.g. [19] and [38]).
Trevor Bench-Capon and Katie Atkinson. 2021. Precedential Constraint: The The reasoning in HYPO and CATO was embodied in algorithms
Role of Issues. In Eighteenth International Conference for Artificial Intelligence rather than expressed declaratively and so was not readily amenable
and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, to formalisation to address the third question. This situation was
NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466062
changed when Prakken and Sartor provided a means of expressing
a case base of precedents as a set of rules and priorities between
1 INTRODUCTION them [39]. The resulting rule base could then be deployed to predict
Reasoning with precedent cases has been a central concern of AI the outcome of a new case. Further, this laid the foundations for
and Law since the very beginning. At least three questions can the provision of a formal account of precedential constraint1 . The
be posed in relation to reasoning with precedent cases: (1) how do work was begun by Horty [26], using a factor based representation
people reason with precedents? (2) can we use precedents to predict taken from [6] and [3]. His approach was developed in [30] and
Permission to make digital or hard copies of all or part of this work for personal or extended by Rigoni in [41]. However, it became recognised that
classroom use is granted without fee provided that copies are not made or distributed factors were not sufficient to capture all the necessary nuances of
for profit or commercial advantage and that copies bear this notice and the full citation precedents: some aspects of cases can favour a party to different
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or extents. The need to address dimensions was argued in [15] and
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil
1 These formal accounts consider that a decision is constrained if any other decision
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 would be inconsistent with past decisions. In practice this constraint may not be
https://doi.org/10.1145/3462757.3466062 respected in given judicial settings. For a jursiprudential discussion see [46].
12
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bench-Capon and Atkinson
addressed by Horty in [27], [28] and [29] and by Rigoni in [42]. A has as one of its prerequisites that the plaintiff made
comparison of the approaches of Horty and Rigoni is given in [37]. disclosures of confidential information to outsiders.
In this paper we will address the question of how precedents ([44], p 67)).
constrain decisions in new cases, and in particular identify how Although disclosures to outsiders may be a reason to find for the
domain knowledge can complement the purely formal characterisa- defendant, the lack of disclosures was never found to be used as a
tions. Section 2 reviews the use of dimensions and factors in HYPO reason to find for the plaintiff in the analysed cases [44]: because the
and CATO to clarify their different roles: whereas dimensions iden- plaintiff is expected to take measures to protect the secret, simply
tify the aspects of cases which must be considered, factors record refraining from disclosure seems not to strengthen the plaintiff’s
their legal significance in the particular case by identifying the case. Thus this dimension is not applicable if no disclosures were
party favoured by that aspect. Section 3 gives an overview of the made. Typically only a few dimensions will be applicable in any
formalisations of precedential constraint using factors. In Section 4 given case: in HYPO four or five is typical.
we show how these approaches can be improved by exploiting the
structure found in legal cases. Section 5 considers how to accommo-
2.1 From Dimensions to Factors
date dimensions, by considering precedential reasoning as a two
stage process. First factors are ascribed on the basis of dimensional Even if applicable, the value on the dimension may be such that
facts and then these factors supply the reasons to resolve the is- it does not favour either party; the dimension may be neutral in
sues, and hence constrain the overall decision. Different precedents the particular case. Applicable dimensions must be assessed for
are relevant at each stage: some constrain the ascription of factors their legal significance for the particular case, that is whether they
while others constrain the preferences between sets of factors. favour a party, and if so, which one. This significance is shown by
The contributions of the paper are: improvement in the formal ascribing a factor to the case. A factor is present if the case lies
characterisation of precedential constraint, both in terms of effec- within a range on a dimension which favours a particular side. At
tiveness and in reflecting actual decisions, by applying it to issues one end the dimension will either be inapplicable because it does
rather than whole cases; clarification of the role of dimensions by not affect the strength of a side’s case, or it will favour a particular
articulating the reasoning process into two distinct stages; and iden- side. Moving along the dimension we may enter a neutral area
tifying the need to recognise that precedents operate differently at favouring neither side, and then an area which favours the other
the two stages. Throughout the paper we use examples from US side. In practice many dimensions have only two points and either
Trade Secrets cases, the most widely discussed domain for reason- favour a particular side or are inapplicable. For a many-valued
ing with precedents in AI and Law: as well as HYPO and CATO it dimension, such as D3d, if sufficient disclosures to provide a reason
has been used in [21], [22], [2], [24], [13], [36], [50] and [38], among for the defendant were made, then the corresponding factor (F10d)
many others. applies. It may be, however, that too few disclosures were made
to favour the defendant (e.g. Emery v Marcan “Even though parts
drawings may on occasion have been shown to a limited number of
2 DIMENSIONS AND FACTORS outsiders for a particular purpose, this did not in itself necessarily
To relate formal work on precedential constraint to actual legal destroy the secrecy which protected them.”). Here no factor will
cases, it is important to have a clear understanding of factors and apply (although the dimension remains applicable if any disclosures
dimensions and the relationship between them. The terms have were made, because whether the factor should be ascribed needs to
been used in different ways, but we will consider dimensions as be considered). The point about neutrality is made in [44]
used in HYPO and factors as used in CATO, discussed by Rissland
Note that CATO does not automatically treat the fact
and Ashley in [44]. This is the most common use, and HYPO and
that a factor does not apply to a case as a strength for
CATO were explicitly identified by Horty in [26] and [30] as the
the opponent. ([44], pp 68-9).
source of the factors used in his formal account of precedential
constraint, which is the starting point for subsequent discussions As can be seen from Table 1, only one dimension, D13b, Security
of this topic. Moreover both HYPO dimensions and CATO factors Measures, was seen as capable of favouring both sides.
resulted from thorough domain analyses. Most of the many systems [...] the Security-Measures dimension was broken into
addressing US Trade Secrets have taken both the analysis of the two factors: Security-Measures [F6p], favoring the
domain and the ascription of factors to cases from CATO [3]. plaintiff, and No-Security-Measures [F19d], favoring
In HYPO cases are represented as collections of facts (see Appen- the defendant. This was done because judges explic-
dix B of [6]). There are thirteen implemented dimensions (Appendix itly said that the fact that plaintiff had taken no secu-
F of [6]) which may be applicable to a case on the basis of these rity measures was a positive strength for the oppo-
facts. In general a dimension can take a range of values, but in fact nent. By contrast, Ashley and Aleven did not create a
ten of the thirteen were two-valued. A list of HYPO’s dimensions, “No-Secrets-Disclosed-Outsiders” factor because they
summarising Appendix F of [6], is given in Table 1. found no cases where judges had said that the absence
Dimensions identify the aspects of cases which need to be con- of any disclosures to outsiders was a positive strength
sidered to see if they are applicable: for the plaintiff. ([44], p 69).
Each dimension has prerequisites that must be satis- Thus the security measures dimension is always applicable, al-
fied in order for the dimension to be applicable. For though it is possible that neither F6p nor F19d is present: the plaintiff
example, the dimension Secrets-Voluntarily-Disclosed may have taken sufficient measures to prevent the lack of concern
13
Precedential Constraint: The Role of Issues ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 1: Dimensions in HYPO and their corresponding CATO factors. Dimension and Factor IDs are D or F for dimension or
factor (factor numbers are those in CATO) followed by p, d, or b to indicate whether it can favour plaintiff, defendant or both.
Plaintiff Defendant
ID Dimension Values Number of Values Factors Factors
in CATO in CATO
Computed from
D1p Competitive Advantage Gained Many F8p
development time and cost
D2d Vertical Knowledge Vertical or technical 2 F11d
D3d Secrets Voluntarily Disclosed Number of Disclosures Many F10d F27d
D4d Disclosures Subject to Restriction Yes or No 2 F12p
D5p Agreement Supported by Consideration Something or Nothing 2
D6p Common Employee Paid to Change Employers Something or Nothing 2 F2d
D7p Exists Express Noncompetition Agreement Yes or No 2 F13p
D8p Common Employee Transferred Product Tools Something or Nothing 2 F7p
D9p Non-Disclosure Agreement Re Defendant Access Yes or No 2 F4p
D10d Common Employee Sole Developer Yes or No 2 F3d
D11d Non-disclosure Agreement Specific Yes or No 2 F5d
D12d Disclosure in Negotiations with Defendant Yes or No 2 F1d
D13b Security Measures Range of possible measures 8 F6p F19d
being a strength for the defendant, but without sufficient rigour the opinion, it played no role in the decision, and hence can safely
to be a reason to find for the plaintiff. Thus, although it is always be considered absent. The absence of a base level factor does not
relevant to consider the security measures taken, in many cases provide a reason for the other side, and so its absence will be men-
there will be no legal significance. Indeed many cases in CATO tioned only when its presence was considered but rejected because
[3] do not have either F6p or F19d. Of the thirteen dimensions the case fell into a neutral area on an applicable dimension.
in HYPO, ten, the two-valued dimensions, are either inapplicable In addition to the fourteen factors derived from the HYPO di-
or favour a particular side. Of the three multi-valued dimensions, mensions in Table 1, CATO introduced another twelve factors. This
two are considered, if applicable, to be either neutral or capable of is because CATO analysed considerably more cases than HYPO
favouring only one party (defendant for disclosures, and plaintiff and seems to have included more cases questioning whether the
for competitive advantage). Only security measures is capable of information was a trade secret rather than whether there was a
favouring either side, or being neutral. confidential relationship. In Table 2 we have related these additional
Note, however, that in one case, disclosures to outsiders, there are factors to dimensions in the manner of Table 1. These additional
two pro-defendant factors associated with the dimension. As well factors can be accommodated in seven dimensions, only one of
as F10d, SecretsDisclosedOutsiders, we also have F27d DisclosureIn- which, D14b, has both plaintiff and defendant factors. Three have
PublicForum. This is because F27d provides a much stronger reason multiple factors for the same side. Four are two-valued. The mix is
for the defendant than F10d, so that it might be that a plaintiff similar to that found in HYPO, and so may be considered typical.
factor such as F12p OutsiderDisclosuresRestricted would defeat F10d That the additional cases analysed by CATO led to additional
but not F27d. Thus a dimension may give rise to multiple factors factors and dimensions is an indication of how precedent cases are
favouring the same side. the source of dimensions and factors. The opinions in precedents
This understanding of dimensions and factors shows why it is show what aspects of the cases judges thought relevant, and what,
a mistake to speak of the “negations” of base level factors, as in if any, significance they accorded to them in that case. Any given
some recent formally oriented approaches (e.g. [50], [37]). CATO case will only have a few applicable dimensions, and so will only
used two distinct factors for the rare case where a dimension could contain a small subset of possible factors. Therefore as we analyse
favour either side. Moreover, if a factor is absent, a different factor more cases we are likely to encounter more dimensions and more
favouring that side may be present, as with disclosures. Thus the distinctions and hence more factors.
absence of F10d might mean that no disclosures had been made, so
that the dimension was inapplicable; that too few disclosures had
been made, meaning that the dimension was not legally significant
in this case; or that disclosures had been made in a public forum, 2.2 Arguing with Factors
so that the stronger F27d was present. There seems little sense in HYPO and CATO were not concerned with determining or pre-
wrapping these three quite different notions under the “negation” dicting outcomes, but rather the identification of arguments for
of F10d. Nor is negation needed to distinguish cases where a base the two parties. These arguments were organised in the “three ply”
level factor is known absent from those where there is no infor- structure common in law (e.g. US Supreme Court Oral Argument
mation about that factor. If a base level factor is not mentioned in and witness testimony which follows the initial questions with a
cross examination and a redirect). In this structure an outcome is
14
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bench-Capon and Atkinson
Table 2: Factors Introduced in CATO Organised into Dimensions. See Table 3 for factor names.
Plaintiff Defendant
ID Dimension Values Number of Values Factors Factors
in CATO in CATO
D14b Use Of Available Information Various types of use Many F14p F16d F25d F17d
D15p Similarity Of Products Degrees of similarity Many F15p F18p
D16d Availability Of Information Various forms of availability Many F20d F24d
D17p Invasive Techniques Yes or No 2 F22p
D18p Obtained by Deception Yes or No 2 F26p
D19d Confidentiality Waived Yes or No 2 F23d
D20p Knew Confidential Yes or No 2 F21p
proposed, a response made by the other side followed by a rebut- means of expressing precedent cases as a sets of rules [39]. Since
tal from the original side. For reasoning with precedents with the the factors for the plaintiff provide a reason to find for the plaintiff
proponent arguing for the plaintiff, these three plies in CATO are: and the factors favouring the defendant a reason to find for the
(1) Cite the precedent case with a decision for the desired side defendant, the decision in the case can be seen as expressing a pref-
which has the most factors in common and fewest distin- erence for one of these reasons. The conjunction of all the factors
guishing factors compared with the current case. The side for a side is the strongest reason for that side, so the precedent can
favoured by the factors does not matter. be modelled as a set of three rules expressing that the strongest
(2) The opponent may distinguish the cited case. Typically the reason for the winner was preferred to the strongest reason for the
new case will not contain exactly the same factors as the loser. Where the case comprises a set of factors 𝑃 ∪ 𝐷 where 𝑃 is
precedent. Some of these differences will make the case the set of plaintiff factors and 𝐷 the set of defendant factors, the
stronger for the plaintiff: plaintiff factors in the current case three rules are:
but not the precedent, and defendant factors in the precedent r1: 𝑃 → plaintiff; r2: 𝐷 → defendant;
but not the current case. The defence will be wise to remain r3: 𝑟 2 ≺ 𝑟 1 if the decision was for the plaintiff and 𝑟 1 ≺ 𝑟 2 if
silent as to these differences. If, however, the precedent con- the decision was for the defendant.
tains plaintiff factors not in the current case, or defendant
factors in the current case but not the precedent, the differ- If we represent all the precedents in the domain using this tech-
ences may be significant and so provide an argument not to nique we can build a logcal theory representing our case base of
follow the cited precedent2 . precedents. If we are given a new case, we can see whether the rules
(3) The proponent may now attempt a rebuttal: downplaying apply to it, and if so whether an outcome is determined by the cur-
distinctions by citing factors favouring the plaintiff (the dif- rent theory. A distinction will mean that the winner’s rule does not
ferences the defendant could not use in the second ply). apply or that the loser may have a stronger rule. This representation
Assuming that the opponent was able to make some distinctions was used in [9] in which the possible sets of plaintiff factors were
in the second ply, it is now up to the user to decide whether, given represented as a partial order, the possible sets of defendant factors
the rebuttal, the distinctions are of sufficient weight to merit an were represented as a partial order, and the precedents as ordering
outcome different from the precedent case. relations between these two partial orders. The nodes contain all
This method of arguing with precedents is the basis of the formal the possible antecedents for plaintiff and defendant rules and the
characterisations of precedential cases discussed in the next section. arcs show the priorities between particular rules. The example from
Precedents are converted into sets of rules with conjunctions of [9] is shown in Figure 1. Deciding a new case is now a matter of
factors as antecedents. These rules constrain a new case if there is a adding an arc between the two relevant nodes representing the
rule applicable to the new case which finds for a particular side (ply factors in the new case and deciding which way the arrow should
1) which cannot be distinguished (ply 2), and which is preferred to point. The constraint is that the arrow should not introduce a cycle
any applicable rule favouring the other side (ply 3). since this would introduce an inconsistency to the case base. Thus
a case which could introduce a cycle is constrained, but if no cycle
3 MODELS OF PRECEDENTIAL CONSTRAINT can result, the judge is free to decide either way.
This idea was refined and presented in a more rigorous way by
HYPO and CATO were realised as programs, with the knowledge
Horty in [26] and further refined in [30]. Horty was interested in
represented as particular data structures (e.g. case frames in HYPO),
modelling two different accounts of precedential constraint from
and the operation of the reasoning defined in terms of algorithms
the jurisprudence literature. One is a very strict version, for which
manipulating these structures (the algorithms for CATO are given
Horty cites [5]. Here any distinction between the precedent and the
in Appendix 3 of [3]). As such, reasoning with cases was not readily
current case is enough to allow the judge to come to a different deci-
amenable to logical analysis until Prakken and Sartor provided a
sion. This version, which corresponds to Figure 1, is now normally
2 In HYPO and CATO the opponent can also cite counter examples in this ply, but we termed the results model in AI and Law [37]. This model encodes
will not discuss counter examples in this paper. precedents as rules in the same way as [39] and [9]. Any weakening
15
Precedential Constraint: The Role of Issues ICAIL’21, June 21–25, 2021, São Paulo, Brazil
16
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bench-Capon and Atkinson
So the new case becomes [F15p, F10d, F27d] and r2 will match. This IRAC method was applied to the explanation of factor based rea-
will work from a logical perspective, although care must be taken soning in [11]. A key point about IRAC is that the rule (the reason
to use the factor actually present in any explanation. in the sense of the model above) relates to an issue, not to the case
The fourth criticism is simply that even the reason model does as a whole. The issues in US Trade Secrets Law, taken from the
not constrain enough cases. This is because no account is taken Restatement of Torts, were used to group related factors together
of whether the distinction is sufficient to overturn the rule. The for IBP [21] and VJAP [24]. These issues were also the basis for the
reason model as formalised above accepts any distinction, whereas factor hierarchy in [3] and the Abstract Dialectical Framework in
in CATO distinctions can be rejected through downplaying through [2]. A similar structure is given by Rigoni’s framework precedents,
substitution and cancellation [40] of factors. The point is clear in and we would argue that the role of framework precedents is to
the following example. identify the issues in a domain. Other systems, such as CABARET
Example 2. Here we have two cases, both found for the plaintiff. [48] derive their framework of issues from statutes. Unlike factors,
In case 1 the plaintiff took security measures (F6p) and, although issues can be seen as in a logical relation to the outcome. To find for
the defendant claimed the information was reverse engineerable the plaintiff, it must be shown both that the information was a trade
(F16d), the plaintiff won5 . In the second case we have the plaintiff secret and that it was misappropriated. A trade secret must both be
making disclosures (F10d), but restricting these disclosures (F12p). valuable and the information not generally known. To be misap-
Here the preference for the plaintiff is clear. We therefore have the propriated it must be either that improper means were used, or that
following rules: the information was used in breach of a confidential relationship.
r1: F6p → plaintiff r4 F12p → plaintiff We can express this as the following non-defeasible rules:
r2: F16d → defendant r5: F10d → defendant ROT1: TradeSecret ∧ and Misappropriated ↔ plaintiff
r3: r2 ≺ r1 r6: r5 ≺ r4 ROT2: InfoValuable ∧ SecrecyMaintained ↔ TradeSecret
Now consider a case with all four factors: [F6p, F12p. F10d, F16d]. ROT3: ImproperMeans ∨ (InfoUsed ∧ ConfidentialRelationship)
It seems this should clearly be found for the plaintiff: however, we ↔ Misappropriated
cannot apply r1 because of the distinction F10d and we cannot In [7] these issues are used to group CATO’s factors as shown
apply r4 because F16d distinguishes. Although neither distinction in Table 36 . Even though some factors appear under two issues,
would be significant in CATO, being cancelled for both precedents the issues contain only five to seven factors, greatly reducing the
by the other factors available, the reason model gives the case as possible combinations of relevant factors. That using issues rather
unconstrained. We will address this problem in the next section. than whole cases to constrain decisions will enable us to decide
more cases is evidenced by [21]. The issue based IBP was able
4 THE IMPORTANCE OF ISSUES to reach a prediction in 99.5% of cases, as opposed to the 73.1%
The fourth problem discussed in the previous section arises because achieved by a system considering cases as a whole.
the reason model considers cases as unstructured bundles of factors, This suggests that instead of describing cases simply as a set of
so that a difference which should not be considered significant factors, we should distribute these factors across the issues they
prevents us from applying the rule which should constrain the relate to. Note also that structuring into issues is an implicit feature
case. We can use knowledge of the domain structure to solve this of rule based systems such as [47] and [2]. We can now apply the
problem. That we do not exploit the full power of our precedents if methods of precedential constraint developed in [27] and [37] not
we consider whole cases was noticed by Branting in [17]: at the case level, but at the issue level. To see the difference this
makes, we will consider a set of cases7 , taken from [3] and used in
combining portions of multiple precedents can permit [22], shown in Table 4. We have not re-analysed the decisions: the
new cases to be resolved that would be indeterminate factors for each case are taken from Table II of [22].
if new cases could only be compared to entire prece- Notice that, in all these cases, some issues are uncontested. It
dents. ([17], Abstract). seems that we can regard the information as a trade secret, unless
When Brüninghaus and Ashley adapted CATO to predict cases in argued otherwise, and that there is a presumption that the informa-
IBP [21] they structured the cases around issues, as did Grabmair in tion was used. On the other hand the plaintiff needs to establish that
his prediction system VJAP [24]. Grabmair reports an improvement improper means were used or that a confidential relationship ex-
over IBP through the use of values, but values raise several addi- isted. To find the issue and rule in the case we look at the contested
tional questions, such as the extent to which they are promoted by issues, and which factors led to the outcome. This is the method
different factors and whether value preferences are global, or local used to identify the rule and resolve the issue when applying the
to issues. Since precedential constraint with values has not yet been IRAC methodology in [11].
given a formal characterisation, we will restrict our consideration in In the next sections we will illustrate the use of the standard
this paper to factors. Issues are a well known concept in law: many reason model followed by the use of the proposed issue based
law schools teach the Issue-Rule-Application-Conclusion (IRAC) 6 In [24] Grabmair associated factors to issues a little differently. This does not, however
method (or some variant) as a way of analysing legal cases. The affect any of the factors in our example below, and so we follow [7] here.
7 National Instrument Labs, Inc. v. Hycel, Inc., 478 F.Supp. 1179 (D.Del.1979), M. Bryce
5 This preference for F6p was used in Mason v. Jack Daniel Distillery, 518 So.2d 130 & Associates, Inc. v. Gladstone, 107 Wis.2d 241, 319 N.W.2d 907 (Wis.App.1982), K & G
(Ala.Civ.App.1987): “courts have protected information as a trade secret despite ev- Oil Tool & Service Co. v. G & G Fishing Tool Serv., 314 S.W.2d 782 (1958), Televation
idence that such information could be easily duplicated by others competent in the Telecommunication Systems, Inc. v. Saindon, 522 N.E.2d 1359 (Ill.App. 2 Dist. 1988),
given field. KFC Corp. v. Marion-Kay Co., 620 F. Supp. 1160 (S.D.Ind. 1985); Sperry Mason v. Jack Daniel Distillery, 518 So.2d 130 (Ala.Civ.App.1987) and The Boeing
Rand Corp. v. Rothlein, 241 F. Supp. 549 (D.Conn. 1964)”. Company v. Sierracin Corporation, 108 Wash.2d 38, 738 P.2d 665 (1987).
17
Precedential Constraint: The Role of Issues ICAIL’21, June 21–25, 2021, São Paulo, Brazil
reason model. We will see that when using issues, more cases are Now suppose we are presented with K and G. Here the plain-
constrained because distinctions relating to issues unrelated to that tiff argues that improper means were used, because the defendant
governed by a rule no longer distinguish that rule and are relevant used restricted materials (F14p). The defendant counters this by a
only if they constrain that other issue so as to lead to a different claim to have reverse engineered the information (F25d). Moreover
outcome. the defendant argues that the information is not a trade secret be-
cause it was reverse engineerable (F16d)9 . This is in turn countered
4.1 Using the reason model by the claim that the uniqueness of the product (F15p) suggests
Suppose our first case8 is National Instruments. As can be seen that the information was not readily reverse engineerable. In the
from Table 4, the case turned on whether there was a confiden- judgement both the issues were decided in favour of the plaintiff,
tial relationship, given that the plaintiff had made disclosures in since the reverse engineering had made use of restricted materials.
negotiations (F1d). The defendant, however, did know that the infor- Note that there was no need to decide the breach of confidence
mation was confidential (F21p), and the court found for the plaintiff. issue: improper means suffice to establish misappropriation. Al-
The reason model then gives the three rules: though NatInstP applies, it cannot be used because the defendant
has stronger rules than NatInstD. Thus the reason in K and G must
NatInstP: F21p → plaintiff NatInstD: F1d → defendant
cover two different issues, InfoValuable and ImproperMeans, and
NatInstO: NatInstD ≺ NatInstP
so we get rules spanning both these issues:
If the next case is Bryce, we can see that it is constrained by
these rules: the additional factors are not distinctions because both KGP: F15p and F14p → plaintiff
favour the plaintiff, and so do not give the defendant anything KGD: F16d and F25d → defendant
better than NatInstD, which is defeated by NatInstP, which also 9 Both F25d and F16d were introduced in CATO and relate to the same dimension, D14b.
applies to Bryce. Bryce thus adds no new rules. However, F25d, that the information was actually reverse engineered, relates to the
issues of whether the whether the information was used and whether improper means
8 The sequencing of the cases used here is for the purposes of illustrating our approach, were used, whereas F16d, the possibility of reverse engineering, relates to whether the
and is not the actual sequence. information was valuable and hence a trade secret.
18
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bench-Capon and Atkinson
19
Precedential Constraint: The Role of Issues ICAIL’21, June 21–25, 2021, São Paulo, Brazil
extent of measures taken by him to guard the secrecy consequences. CATO has been explicitly identified
of the information; (4) the value of the information with the second of these steps (e.g. [20]). ([40], p 22).
to him and to his competitors; (5) the amount of This can be seen clearly in [7] where factors – the intermedi-
effort or money expended by him in developing the ate predicates – were ascribed to cases by the machine learning
information; (6) the ease or difficulty with which the program SMILE, before being passed to IBP to predict the legal con-
information could be properly acquired or duplicated sequences. More recently this two-stage approach has been used
by others. Emphasis ours. by Branting in [19] and [18]. Thus before we can consider whether
These points are all reflected in CATO’s factor hierarchy. But as a case is constrained, which can be done in terms of factors using
the emphasised terms indicate, ascribing these factors is not simple, the issue based reason model described above, we must first as-
but requires a judgement as to whether the extent is sufficient for sign the factors. For some factors, those derived from many-valued
the factor to apply. This point was addressed by Horty in [27] and dimensions, this will involve ascribing the factors on that dimen-
[28]. The issue was further discussed by Rigoni in [42]. Horty has sion respecting ranges identified in precedent cases. This can be
modified his approach in [29], and a formal comparison of Horty done using either the reason or the result model, or using Rigoni’s
and Rigoni’s approaches is given in [37]. switching points. For a discussion of mapping a dimensional fact
Horty’s main example in [27] is taken from [39] and concerns (age) into ranges through precedents see [25].
change of fiscal domicile, decided on the basis of several consid- Thus the conclusion is that attempting to model precedential
erations including length of absence and percentage of income constraint in terms of cases represented as sets of dimensions rather
earned abroad. In [39] absence was modelled as two factors, which than sets of factors as in [37] conflates two distinct steps in the
we will call shortStay and longStay 11 , favouring no change and process of reasoning with legal cases. Cases are not represented
change respectively. But this raises the question of how we deter- as sets of dimensions: cases are represented as facts in HYPO and
mine whether a particular length of absence, say 24 months, is a where they are represented as sets of points on dimensions, as in
shortStay, a longStay, or somewhere in between, and so neutral. [40] and [14], these are dimensional facts, the legal significance
Horty responds by introducing the notion of a factor with mag- of which is unknown until they are mapped into factors. If we
nitude, (i.e. a factor deriving from a dimension with more than represent cases as sets of dimensional facts (including dimensions
two values) based on the dimensional fact of length of stay. The with two values) as in [40], we can derive the factors applicable
ascription of factors on the basis of dimensional facts can also be to the case. Or we can get our factors through machine learning
found in [40]. Ascription of factors is constrained by precedents: as in [19]. We then organise these factors into issues and apply
in a previous case a judge may have found for change on the basis precedential constraint in terms of the factors associated with each
of an absence of 18 months, showing on the result model that any issue as described in section 4.2. While some precedents will supply
absence of at least 18 months must be considered a longStay. But the plaintiff, defendant and priority rules as described in section 3.1,
the judge may have spoken of an absence of greater than one year, others will supply rules to move from dimensional facts to factors.
so that on the reason model that any absence over 12 months is to To give an example from the fiscal domicile domain, such a rule
be considered longStay. Rigoni’s suggestion was to see precedents would be something like: longStay ← absence(A) ∧ A ≥ 12.
as fixing “switching points”, which determine which (if any) factor One issue in the ascription of factors is that in some cases they
applies for various values of the dimensional fact. Rigoni also notes do not seem to be independent. Thus in the fiscal domicile case it
that a dimension may encompass multiple factors for a given side is possible that there is a trade off between length of absence and
(as with disclosures (D3d) in CATO). amount of income earned, so that whether the percentage of income
Note here that the precedents which impose bounds on the is considered to be “substantial” is relative to the length of absence.
ranges occupied by factors are a different kind of precedent from The question of balancing factors has been discussed in [32] and
those which resolve factor conflicts as discussed above: they express [23], and an equation representing the trade off was used in [12].
no preferences. Thus what is required to accommodate dimensional In that paper a single factor (e.g. SufficientIncomeGivenAbsence) is
facts and factors with magnitude is not a different way of represent- ascribed on the basis of the two dimensional facts. How factors are
ing precedential constraint, but to recognise that we are looking ascribed on the basis of facts relates to the first stage and the focus
at a two stage process, with each stage using different types of of this paper is on the second stage, namely determining how the
precedents. The need for two stages was observed in [40]: precedents, when described in terms of factors, constrain the deci-
Once the facts of a case have been established - and sion. Therefore we will not discuss the important and interesting
this is rarely straightforward since the move from questions relating to balancing and trade-offs further in this paper.
evidence to facts is often itself the subject of debate -
legal reasoning can be seen, following Ross [45] and 6 CONCLUDING REMARKS
Lindhal and Odelstad [33], as a two stage process, first A number of conclusions can be drawn from the above discussion:
from the established facts to intermediate predicates, • Reasoning with cases is a two stage process: first factors are
and then from these intermediate predicates to legal ascribed on the basis of (often dimensional) facts, and then
the cases are compared with precedents using factors.
11 In[39] long duration and not long duration were used, but for reasons explained in • Precendential constraint should be considered in terms of
section 2.1, negating factors is problematic and we follow CATO and [44] and use
two distinct factors when the dimension can favour both sides. This also permits the factors, even if we wish to represent cases in terms of dimen-
possibility of a moderate duration being neutral. sional facts. Applicable dimensions show which aspects must
20
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bench-Capon and Atkinson
be considered, while factors show which side is favoured in [15] Trevor Bench-Capon and Edwina L Rissland. 2001. Back to the future: Dimensions
the particular case. revisited. In Proceedings of JURIX 2001. IOS Press, 41–52.
[16] Trevor Bench-Capon and Giovanni Sartor. 2003. A model of legal reasoning with
• Comparison (for both the results and the reason models) cases incorporating theories and values. Artificial Intelligence 150, 1-2 (2003),
should be at the level of issues, to ignore irrelevant distinc- 97–143.
[17] L Karl Branting. 1991. Reasoning with portions of precedents. In Proceedings of
tions, and to reflect legal practice better. the 3rd International Conference on AI and Law. 145–154.
• Precedents do not always have the same role: [18] L Karl Branting. 2020. Explanation in Hybrid, Two-Stage Models of Legal Predic-
– Framework precedents (e.g. Lemon v. Kurtzman) identify tion. In The 3rd XAILA Workshop at JURIX 2020.
[19] L Karl Branting, Craig Pfeifer, Bradford Brown, Lisa Ferro, John Aberdeen, Brandy
the issues and set out the logical framework in which they Weiss, Mark Pfaff, and Bill Liao. 2020. Scalable and explainable legal prediction.
are considered; AI and Law (2020), 1–26.
– Preference precedents (the standard use in CATO) say how [20] Stefanie Brüninghaus and Kevin Ashley. 2003. A predictive role for intermediate
legal concepts. In Proceedings of Jurix 2003. 153–62.
conflicting factors within an issue should be resolved; [21] Stefanie Brüninghaus and Kevin D Ashley. 2003. Predicting outcomes of case
– Ascription precedents (e.g. National Instruments states its based legal arguments. In Proceedings of the 9th International Conference on AI
and Law. 233–242.
reasons for withholding F16d at some length), give reasons [22] Alison Chorley and Trevor Bench-Capon. 2005. An empirical investigation of
to determine if a factor should be ascribed to a case or not. reasoning with legal cases through theory construction and application. AI and
Law 13, 3 (2005), 323–371.
Here we have used the issues from IBP [7]. But we could have [23] Thomas F Gordon and Douglas Walton. 2016. Formalizing Balancing Arguments..
used coarser grained issues, perhaps merging the conjoined issues In Proceedings of COMMA 2016. 327–338.
[24] Matthias Grabmair. 2017. Predicting trade secret case outcomes using argument
in IBP, or finer grained issues, using the abstract factors of [3] as schemes and learned quantitative value effect tradeoffs. In Proceedings of the 16th
issues, or even the nodes of the 2-regular hierarchy of [1]. The finer International Conference on AI and Law. 89–98.
the granularity, the more decisions are constrained. Experiments [25] John Henderson and Trevor Bench-Capon. 2019. Describing the development
of case law. In Proceedings of the 17th International Conference on AI and Law.
to investigate the impact of different granularities on predictive ac- 32–41.
curacy would be interesting. It would also be interesting to explore [26] John F Horty. 2011. Reasons and precedent. In Proceedings of the 13th International
the use of values rather than factors as the elements over which Conference on AI and Law. 41–50.
[27] John F Horty. 2017. Reasoning with dimensions and magnitudes. In Proceedings
preferences are expressed as in [16] and [24]. The possibility of of the 16th the International Conference on Articial Intelligence and Law. 109–118.
using multiple granularities could also be explored, with some ar- [28] John F Horty. 2019. Reasoning with dimensions and magnitudes. AI and Law 27,
3 (2019), 309–345.
guments being in terms of issues, some in terms of abstract factors, [29] John F Horty. 2021. Modifying the Reason Model. AI and Law (2021), On Line.
others in terms of values, and others considering whole cases. [30] John F Horty and Trevor Bench-Capon. 2012. A factor-based definition of prece-
Perhaps the best way to deploy machine learning is for the first dential constraint. AI and Law 20, 2 (2012), 181–214.
[31] Grant Lamond. 2005. Do precedents create rules. Legal Theory 11 (2005), 1–26.
stage, factor ascription, as in [7] and [19]. Moreover, if we wish to [32] Marc Lauritsen. 2015. On balance. AI and Law 23, 1 (2015), 23–42.
address the second stage with machine learning, perhaps it would be [33] Lars Lindahl and Jan Odelstad. 2006. Open and closed intermediaries in normative
better to predict issues rather than whole cases, and then combine systems. In Proceedings of JURIX 2006. IOS Press, 91–99.
[34] Jo Desha Lucas. 1983. The direct and collateral estoppel effects of alternative
the results using a logical framework to get the overall decision. holdings. The University of Chicago Law Review 50, 2 (1983), 701–730.
[35] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2019. Using machine
learning to predict decisions of the European Court of Human Rights. AI and
REFERENCES Law (2019), 1–30.
[1] Latifa Al-Abdulkarim, Katie Atkinson, and Trevor Bench-Capon. 2015. Factors, [36] Henry Prakken. 2019. Modelling accrual of arguments in ASPIC+. In Proceedings
issues and values: Revisiting reasoning with cases. In Proceedings of the 15th of the 17th International Conference on AI and Law. 103–112.
International Conference on AI and Law. 3–12. [37] Henry Prakken. 2021. A A formal analysis of some factor- and precedent-based
[2] Latifa Al-Abdulkarim, Katie Atkinson, and Trevor Bench-Capon. 2016. A method- accounts of precedential constraint. AI and Law (2021), Available On–Line.
ology for designing systems to reason with legal cases using ADFs. AI and Law [38] Henry Prakken and Ratsma Rosa. 2021. A top-level model of case-based ar-
24, 1 (2016), 1–49. gumentation for explanation: formalisation and experiments. Argument and
[3] Vincent Aleven. 1997. Teaching case-based argumentation through a model and Computation (2021), Available On–line.
examples. Ph.D. thesis. University of Pittsburgh. [39] Henry Prakken and Giovanni Sartor. 1998. Modelling reasoning with precedents
[4] Vincent Aleven and Kevin D Ashley. 1995. Doing things with factors. In Proceed- in a formal dialogue game. AI and Law 6, 3-4 (1998), 231–87.
ings of the 5th International Conference on AI and Law. 31–41. [40] Henry Prakken, Adam Wyner, Trevor Bench-Capon, and Katie Atkinson. 2015.
[5] Larry Alexander. 1989. Constrained by precedent. Southern California Law Review A formalization of argumentation schemes for legal case-based reasoning in
63 (1989), 1–64. ASPIC+. Journal of Logic and Computation 25, 5 (2015), 1141–1166.
[6] Kevin D Ashley. 1990. Modeling legal arguments: Reasoning with cases and hypo- [41] Adam Rigoni. 2015. An improved factor based approach to precedential constraint.
theticals. MIT press, Cambridge, Mass. AI and Law 23, 2 (2015), 133–160.
[7] Kevin D Ashley and Stefanie Brüninghaus. 2009. Automatically classifying case [42] Adam Rigoni. 2018. Representing dimensions within the reason model of prece-
texts and predicting outcomes. AI and Law 17, 2 (2009), 125–165. dent. AI and Law 26, 1 (2018), 1–22.
[8] Trevor Bench-Capon. 1991. Practical legal expert systems: the relation between [43] Edwina L Rissland and Kevin D Ashley. 1987. A case-based system for Trade
a formalisation of legislation and expert knowledge. In Law, Computer Science Secrets law. In Proceedings of the 1st International Conference on AI and Law.
and Artificial Intelligence, M Bennun and A Narayanan (Eds.). Ablex, 191–201. 60–66.
[9] Trevor Bench-Capon. 1999. Some observations on modelling case based reasoning [44] Edwina L Rissland and Kevin D Ashley. 2002. A note on dimensions and factors.
with formal argument models. In Proceedings of the 7th International Conference AI and Law 10, 1-3 (2002), 65–77.
on AI and Law. 36–42. [45] Alf Ross. 1957. Tû-tû. Harvard Law Review (1957), 812–825.
[10] Trevor Bench-Capon. 2017. HYPO’s legacy: introduction to the virtual special [46] Frederick Schauer. 1987. Precedent. Stanford Law Review (1987), 571–605.
issue. AI and Law 25, 2 (2017), 205–250. [47] Marek Sergot, Fariba Sadri, Robert Kowalski, Frank Kriwaczek, Peter Hammond,
[11] Trevor Bench-Capon. 2020. Explaining Legal Decisions Using IRAC. In Proceed- and Therese H Cory. 1986. The British Nationality Act as a logic program.
ings of CMNA 2020. CEUR Workshop Proceedings 2669, 74–83. Commun. ACM 29, 5 (1986), 370–386.
[12] Trevor Bench-Capon and Katie Atkinson. 2017. Dimensions and Values for Legal [48] David B Skalak and Edwina L Rissland. 1992. Arguments and cases: An inevitable
CBR. In Proceeding of JURIX 2017. 27–32. intertwining. AI and Law 1, 1 (1992), 3–44.
[13] Trevor Bench-Capon and Katie Atkinson. 2018. Lessons from Implementing [49] Richard E Susskind. 1989. The Latent Damage system: A jurisprudential analysis.
Factors with Magnitude. In Proceedings of JURIX 2018. 11–20. In Proceedings of the 2nd International Conference on AI and Law. 23–32.
[14] Trevor Bench-Capon and Floris Bex. 2015. Cases and Stories, Dimensions and [50] Heng Zheng, Davide Grossi, and Bart Verheij. 2020. Case-Based Reasoning with
Scripts.. In Proceedings of JURIX 2015. 11–20. Precedent Models: Preliminary Report. In Proceedings of COMMA 2020. 443–450.
21
Incorporating Domain Knowledge for Extractive Summarization
of Legal Case Documents
Paheli Bhattacharya Soham Poddar Koustav Rudra
Department of CSE, IIT Kharagpur Department of CSE, IIT Kharagpur L3S Research Center, Leibniz
India India University, Hannover
Germany
22
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, and Saptarshi Ghosh
The above limitations of existing methods motivate us to develop (i) Unsupervised domain-independent, (ii) Unsupervised domain-
a summarization algorithm that will consider various rhetorical seg- specific, (iii) Supervised domain-independent, and (iv) Supervised
ments in a case document, and then decide which parts from each domain-specific. We briefly describe some popular methods from
segment to include in the summary, based on guidelines from law each of the classes in this section.
practitioners. Supervised methods which are expected to learn these
guidelines automatically would require a large number of expert 2.1 Unsupervised domain-independent
written summaries for lengthy case judgements, which may not
be available for all jurisdictions. Hence we go for an unsupervised
methods
summarization algorithm in this work. Popular, unsupervised extractive summarization algorithms iden-
We propose DELSumm (Domain-adaptive Extractive Legal tify important sentences either by using Frequency-based methods
Summarizer ), an unsupervised extractive summarization algo- (e.g. Luhn [16]) or Graph-based methods (e.g. LexRank [7]). The
rithm for legal case documents. We formulate the task of summariz- summary is the top ranked sentences. There are algorithms from
ing legal case documents as maximizing an Integer Linear Program- the family of matrix-factorization, such as LSA [10].
ming (ILP) objective function that aims to maximize the inclusion of Summarization has also been treated as an Integer Linear Pro-
the most informative sentences in the summary, and has balanced gramming (ILP) optimization problem. Such methods have been
representations from all the thematic segments, while reducing applied for summarizing news documents [1] and social media
redundancy. We demonstrate how DELSumm can be tuned to oper- posts [19]. Our proposed approach in this work is also based on
ationalize summarization guidelines obtained from law experts over ILP-based optimization.
Indian Supreme Court case documents. Comparison with several
baseline methods – including five legal domain-specific methods 2.2 Unsupervised domain-specific methods
and two state-of-the-art deep learning models trained over thou-
There are several unsupervised methods for extractive summariza-
sands of document-summary pairs – suggest that our proposed
tion specifically designed for legal case documents. One of the ear-
approach outperforms most existing methods, especially in summa-
liest methods for unsupervised legal document summarization that
rizing the different rhetorical segments present in a case document.
takes into account the rhetorical structure of a case document was
To summarize, the contributions of this paper are:
LetSum [8]. They consider certain cue-phrases to assign rhetor-
(1) We propose DELSumm that incorporates domain knowledge pro-
ical/semantic roles to a sentence in the source document. They
vided by legal experts for summarizing lengthy case documents.1
specifically consider the roles – Introduction, Context, Juridical
(2) We perform extensive experiments on documents from the
Analysis and Conclusion. Sentences are ranked based on their TF-
Supreme Court of India and compare our proposed method with
IDF values for estimating their importance. The final summary is
eleven baseline approaches. We find that DELSumm outperforms
generated by taking 10% from the Introduction, 25% from Context,
a large number of legal-specific as well as general summarisation
60% from Juridical Analysis and 5% from the Conclusion segments.
methods, including supervised neural models trained over thou-
While LetSum considers TF-IDF to rank sentences, Saravanan
sands of document-summary pairs. Especially, DELSumm achieves
et.al. [20] use a K-Mixture Model to rank sentences for deciding
much better summarization of the individual rhetorical segments in
which sentences to include in the summary. We refer to this work
a case document, compared to most prior methods.
as KMM in the rest of the paper. Note that this work also identifies
(3) We also show that our proposed approach is robust to inaccurate
rhetorical roles of sentences (using a graphical model). However,
rhetorical labels generated algorithmically. There is a negligible
this is a post-summarization step that is mainly used for displaying
drop in performance when DELSumm uses rhetorical sentence
the summary in a structured way; the rhetorical roles are not used
labels generated by a rhetorical segmentation method (which is
for generating the summary.
more a practical setup), instead of using expert-annotated labels.
Another method CaseSummarizer [18] finds the importance of
To the best of our knowledge, this is the first systematic attempt
a sentence based on several factors including its TF-IDF value, the
to computationally model and incorporate legal domain knowledge
number of dates present in the sentence, number of named entities
for summarization of legal case documents. Through comparison
and whether the sentence is at the start of a section. Sentences are
with 11 methods, including 2 state-of-the-art deep learning methods,
then ranked in order to generate the summary comprising of the
we show that an unsupervised algorithm developed by intelligently
top-ranked sentences.
including domain expertise, can surpass the performance of su-
Zhong et.al. [25] creates a template based summary for Board
pervised learning models even when the latter are trained over
of Veteran Appeals (BVA) decisions from the U.S. Department of
large training data (which is anyway expensive to obtain in an
Veteran Affairs. The summary contains (i) one sentence from the
expert-driven domain such as Law).
procedural history (ii) one sentence from issue (iii) one sentence
from the service history of the veteran (iv) variable number of
2 RELATED WORK Reasoning & Evidential Support sentences selected using Maximum
Extractive text summarization aims to detect important sentences Margin Relevance (v) one sentence from the conclusion. We refer
from the full document and include them in the summary. Existing to this method as MMR in this paper.
methods for extractive summarization that can be applied for legal
document summarization, can be broadly classified into four classes: Limitations of the methods: CaseSummarizer does not assume
the presence of any rhetorical role. The other methods – LetSum,
1 Implementation publicly available at https://github.com/Law-AI/DELSumm KMM, and MMR – consider their presence but do not include
23
Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents ICAIL’21, June 21–25, 2021, São Paulo, Brazil
them while estimating the importance of sentences for generat- applied for obtaining the final summary. We refer to this method
ing the summary. Specifically, LetSum and KMM use generic term- as Gist [14].
distributional models (TF-IDF and K-mixture models respectively)
Limitations of the method: Although this method was developed
and MMR uses Maximum Margin Relevance. As evident, rhetorical
and applied for Chinese legal case documents, domain-specific
roles do not play any role in finally selecting the sentences for
attributes such as rhetorical labels were not considered.
inclusion in the summary.
We believe that it is more plausible to measure the importance
of sentences in each rhetorical segment separately using domain
knowledge. In fact, for each segment, the parameters measuring 2.5 Rhetorical roles in a legal case document
sentence importance can vary. As an example, consider the segment A legal case document can be structured into thematic segments,
‘Statute’. One can say that sentences from this segment that actually where each sentence can be labelled with a rhetorical role. Note that
contain a reference to a Statute/Act name are important. In contrast, case documents often do not implicitly specify these rhetorical roles;
this reasoning will not hold for the segment ‘Facts’ and one may there exist algorithms that assign rhetorical roles to the sentences [3,
derive a different measure based on the presence of certain Part-of- 21, 24]. Different prior works have considered different sets of
Speech tags e.g., Nouns (names of people, location, etc.). Hence, in rhetorical roles [3, 8, 20, 25]. It was shown in our prior work [2]
the present work, we especially include such domain knowledge that a mapping between the different sets of rhetorical roles is
along with segment representations in a more systematic way for possible.
generating summaries. In this work, we consider a set of eight rhetorical roles suggested
in our prior work [3]. Briefly, the rhetorical roles are as follows –
(i) Facts: the events that led to filing the case, (ii) Issue: legal ques-
2.3 Supervised domain-independent methods
tions/points being discussed, (iii) Ruling by Lower Court: case
Supervised neural (Deep Learning-based) methods for extractive documents from higher courts (e.g., Supreme Court) can contain de-
text summarization treat the task as a binary classification prob- cisions delivered at the lower courts (e.g. Tribunal), (iv) Precedent:
lem, where sentence representations are learnt using a hierarchical citations to relevant prior cases, (v) Statute: citations to statutory
encoder. A survey of the existing approaches can be found in [6]. laws that are applicable to the case (e.g., Dowry Prohibition Act,
Two popular extractive methods are NeuralSum [4] and Sum- Indian Penal Code), (vi) Arguments delivered by the contending
maRunner [17]. These methods use RNN (Recurrent Neural Net- parties, (vii) Ratio: rationale based on which the judgment is given,
work) encoders to learn the sentence representations from scratch. and (viii) Final judgement of the present court.
Sentence selection into summary is based on – content, salience, In this work, we attempt to generate a summary that includes
novelty and absolute and relative position importance. These pa- representations of all rhetorical segments in the source document
rameters are also learned in the end-to-end models. (full text). We assume that the rhetorical labels of every sentence
Recently, pretrained encoders (especially transformer-based mod- in the source document is already labeled either by legal experts or
els such as BERT [5]) have gained much popularity. These encoders by applying the method proposed in [3].
have already been pretrained using a large amount of open do-
main data. Given a sentence, they can directly output its sentence
representation. These models can be used in a supervised summa-
3 DATASET
rization by fine-tuning the last few layers on domain-specific data.
BERTSUM [15] is a BERT-based extractive summarization method. For evaluation of summarization algorithms, we need a set of source
Unlike SummaRuNNer, here sentence selection into summary is documents (full text) and their gold standard summaries. Addition-
based on trigram overlap with the currently generated summary. ally, since we propose to use the rhetorical labels of sentences in
a source document for generating the summary, we need rhetori-
Limitations of the neural methods: These Deep Learning archi- cal label annotations of the source documents. Additionally, since
tectures have been evaluated mostly in the news summarization the supervised methods Gist, SumaRuNNer and BERTSUM require
domain, and news documents are much shorter and contain simpler training over large number of document-summary pairs, we also
language than legal documents. Recent works [22, 23] show that need such a training set. In this section, we describe the training
these methods do not perform well in summarizing scientific arti- set and the evaluation set (that is actually used for performance
cles from PubMed and arxiv, which are longer sequences. However, evaluation of summarization algorithms).
it has not been explored how well these neural models would work
in summarizing legal case documents; we explore this question for Evaluation set: Our evaluation dataset consists of a set of 50 In-
two popular neural summarization models in this paper. dian Supreme Court case documents, where each sentence is tagged
with a rhetorical/semantic label by law experts (out of the rhetori-
cal roles described in Section 2.5). This dataset is made available
2.4 Supervised domain-specific methods by our prior work [3]. We asked two senior law students (from
Liu et.al. [14] have recently developed a supervised learning method the Rajiv Gandhi School of Intellectual Property Law, one of the
for extractive summarization of legal documents. Sentences are most reputed law schools in India) to write summaries for each of
represented using handcrafted features like number of words in these 50 document. They preferred to write extractive summaries of
a sentence, position of the sentence in a document etc. Machine approximately one-third of the length of the documents. We asked
Learning (ML) classifiers (e.g., Decision Tree, MLP, LSTM) are then the experts to summarize each rhetorical segment separately (so
24
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, and Saptarshi Ghosh
that we can evaluate how well various models summarize the indi- Notation Meaning
vidual sections); only, they preferred to summarize the rhetorical 𝐿 Desired summary length (number of words)
segments ‘Ratio’ and ‘Precedent’ together. 𝑛 Number of sentences in the document
All the summarization methods have been used to generate 𝑔 Number of segments in the document
summaries for these 50 documents. These summaries were then 𝑚 Number of distinct content words in the document
uniformly evaluated against the two gold standard summaries writ- 𝑖 Index for sentence (𝑖 = [1 . . . 𝑛])
ten by the law students, using the standard ROUGE scores (details 𝑗 Index for content word (𝑗 = [1 . . . 𝑚])
given in later sections). We report the average ROUGE scores over 𝑘 Index for segment (𝑘 = [1 . . . 𝑔])
the two sets of gold standard summaries. 𝑥𝑖 Indicator variable for sentence 𝑖 (1 if sentence 𝑖 is to
be included in summary, 0 otherwise)
Training set : For training supervised methods (Gist, SumaRuNNer
𝑦𝑗 Indicator variable for content word 𝑗 (1 if 𝑗 is to be
and BERTSUM), we need an additional training dataset consisting
included in summary, 0 otherwise)
of document-summary pairs. To this end, we crawled 7, 100 Indian
𝑝𝑖 Indicator variable for sentence 𝑖 citing a prior-case
Supreme Court case documents and their headnotes (short abstrac- (1 if there is a citation, 0 otherwise)
tive summaries) from http://www.liiofindia.org/in/cases/cen/INSC/
𝑎𝑖 Indicator variable for sentence 𝑖 citing a statute (1 if
which is an archive of Indian court cases. These document-summary there is a citation, 0 otherwise)
pairs were used to train the supervised summarization models (de- 𝐿 (𝑖) Number of words in sentence 𝑖
tails in later sections). We ensured that there was no overlap between
𝐼 (𝑖) Informativeness of sentence 𝑖
this training set and the evaluation set of 50 documents.
𝐶 (𝑖) Set of content words in sentence 𝑖
Note that the headnotes described above are not considered to
𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 (𝑖) Position of sentence 𝑖 in the document
be summaries of sufficiently good quality by our law experts, and
𝑆𝑐𝑜𝑟𝑒 ( 𝑗) Score (a measure of importance) of content word 𝑗
this is why the experts preferred to write their own summaries
𝑇𝑗 Set of sentences where content word 𝑗 is present
as gold standard for the documents in the evaluation set. It can
𝑊 𝑒𝑖𝑔ℎ𝑡 (𝑘) Weight (a measure of importance) of segment 𝑘
be argued that it is unfair to train the supervised summarization
models using summaries that are known to be of poorer quality 𝑆𝑘 Set of sentences belonging to segment 𝑘
than the target summaries. However, the supervised summarization 𝑁 𝑂𝑆𝑘 Minimum number of sentences to be selected from
segment 𝑘 in the summary
models require thousands of document-summary pairs for training,
and it is practically impossible to obtain so many summaries of Table 1: Notations used in the DELSumm algorithm
the target quality. Hence our only option is to use the available
headnotes for training the supervised summarization models. This included in the summary. We assume that domain experts will de-
situation can be thought of as a trade-off between quantity and fine how to estimate the informativeness of a sentence.
quality of training data – if a method requires large amounts of (ii) Content words: Content words signify domain-specific vo-
training data, then that data may be of poorer quality than the cabulary (e.g., terms from a legal dictionary, or names of statutes,
target quality. etc) and noun-phrases. The importance of each content word 𝑗
is given by 𝑆𝑐𝑜𝑟𝑒 ( 𝑗). We assume that domain experts will define
what words/terms should be considered as ‘content words’, and
4 PROPOSED APPROACH: DELSUMM
how important the content words are.
In this section, we describe our proposed algorithm DELSumm . A summary of length 𝐿 words, consisting of the most informa-
Our algorithm uses an optimization framework to incorporate legal tive sentences and content words, is achieved by maximizing the
domain knowledge (similar to what is stated in [9, 11, 12]) into an following objective function:
objective function with constraints. The objective function is then
Õ𝑛 Õ
𝑚
maximized using Integer Linear Programming (ILP). The symbols 𝑚𝑎𝑥 ( 𝐼 (𝑖) · 𝑥𝑖 + 𝑆𝑐𝑜𝑟𝑒 ( 𝑗) · 𝑦 𝑗 ) (1)
used to explain the algorithm are stated in Table 1. 𝑖=1 𝑗=1
We consider that a case document has a set of 𝑔 rhetorical seg-
subject to constraints
ments (e.g., the 𝑔 = 8 rhetorical segments stated in Section 2.5),
and the summary is supposed to contain a representation of each Õ
𝑛
segment (as indicated in [9, 11, 12]). The algorithm takes as input 𝑥𝑖 · 𝐿(𝑖) ≤ 𝐿 (2)
– (i) a case document where each sentence has a label signifying
Õ
𝑖=1
its rhetorical segment, and (ii) the desired number of words 𝐿 in
𝑦 𝑗 >= |𝐶 (𝑖)| · 𝑥𝑖 , 𝑗 = [1 . . . 𝑚] (3)
the summary. The algorithm then outputs a summary of at most 𝐿
𝑗 ∈𝐶 (𝑖)
words containing a representation of each segment. Õ
𝑥𝑖 ≥ 𝑦 𝑗 , 𝑖 = [1 . . . 𝑛] (4)
The optimization framework: We formulate the summarization
Õ
𝑖 ∈𝑇 𝑗
problem using an optimization framework. The ILP formulation
maximizes the following factors: 𝑥𝑖 ≥ 𝑁𝑂𝑆𝑘 , 𝑘 = [1 . . . 𝑔] (5)
(i) Informativeness of a sentence: The informativeness 𝐼 (𝑖) of 𝑖 ∈𝑆𝑘
a sentence 𝑖 defines the importance of a sentence in terms of its The objective function in Eqn. 1 tries to maximize the inclusion
information content. More informative sentences are likely to be of informative sentences (through the 𝑥𝑖 indicator variables) and
25
Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents ICAIL’21, June 21–25, 2021, São Paulo, Brazil
the number of important content words (through the 𝑦 𝑗 indicator from all segments of a case document, except the segment ‘Ruling
variables). Here 𝑥𝑖 (respectively, 𝑦 𝑗 ) is set to 1 if the algorithm by Lower Court’ which may be omitted. The relative importance
decides that sentence 𝑖 (respectively, content word 𝑗) should be of segments in a summary should be: Final judgment > Issue >
included in the summary. Eqn. 2 constraints that the summary Fact > (Statute, Precedent, Ratio) > Argument.
length is at most 𝐿 words (note from Table 1 that 𝐿(𝑖) is the number • (G2) The segments ‘Final judgement’ and ‘Issue’ are excep-
of words in sentence 𝑖). Eqn. 3 implies that if a particular sentence tionally important for the summary. The segments are usually
𝑖 is selected for inclusion in the summary (i.e., if 𝑥𝑖 = 1), then the very short in the documents, and so they can be included completely
all the content words contained in that sentence (𝐶 (𝑖)) are also in the summary.
selected. Eqn. 4 suggests that if a content word 𝑗 is selected for • (G3) The important sentences in various rhetorical segments
inclusion in the summary (i.e., if 𝑦 𝑗 = 1), then at least one sentence (Fact, Statute, Precedent, Ratio) should be decided as follows – (a)
where that content word in present is also selected. Fact: sentences that appear at the beginning; (b) Statute : sentences
Eqn. 5 ensures from each segment 𝑘 a minimum number of that contain citations to an Act; (c) Precedent : sentences that con-
sentences 𝑁𝑂𝑆𝑘 should be selected in the summary. This is a key tain citation to a prior-cases; (d) Ratio : sentences appearing at
step that ensures representation of all rhetorical segments in the the end of the document and sentences that contain citation to an
generated summary. We assume that suitable values of 𝑁𝑂𝑆𝑘 will act/law/prior-case.
be obtained from domain knowledge. • (G4) The summary must contain sentences that give important
Any ILP solver can be used to solve the ILP; we specifically used details of the case, including the Acts and sections of Acts that were
the GUROBI optimizer (http://www.gurobi.com/). Finally, those referred, names of people, places etc. that concern the case, and so
sentences for which 𝑥𝑖 is set to 1 will be included in the summary on. Also sentences containing specific legal keywords are usually
generated by DELSumm . important.
Handling Redundancy: DELSumm implicitly handles redundancy
in the summary through the 𝑦 𝑗 variables that indicate inclusion of
5.2 Operationalizing the guidelines
content words (refer to Eqn. 3). According to Eqn. 3, if a sentence is
included in the summary, then the content words in that sentence Table 2 shows how the above-mentioned guidelines from law ex-
are also selected. Consider that two sentences 𝑖 and 𝑖 ′ are similar, perts in India have been operationalized in DELSumm .
so that including both in the summary would be redundant. The • Operationalizing G1: We experiment with two ways of assign-
two sentences are expected to have the same content words. Hence, ing weights to segments – linearly decreasing and exponentially
adding both 𝑖 and 𝑖 ′ (which have the same content words) will not decreasing – based on guideline G1 stated above. We finally decided
help in maximizing the objective function. Hence the ILP method to go with the exponentially decreasing weights. Specifically, we
is expected to refrain from adding both sentences, thus prohibiting assign weight 27 to the ‘Final Judgement’ segment (that is judged
redundancy in the summary. to be most important by experts), followed by 26 for ‘Issue’, 25
for ‘Fact’, 23 for ‘Statute’, ‘Ratio & ‘Precedent’ and finally 21 for
Applying DELSumm to case documents of a specific juris-
‘Argument’.
diction: It can be noted that this section has given only a general
description of DELSumm , where it has been assumed that many • Operationalizing G2: As per guideline G2, 𝑁𝑂𝑆𝑘 (the minimum
details will be derived from domain knowledge (e.g., the informa- number of sentences from a segment 𝑘) is set to ensure a minimum
tiveness or importance of a sentence or content word, or values of representation of every segment in the summary – the ‘final judge-
𝑁𝑂𝑆𝑘 ). In the next section, we specify how these details are derived ment’ and ‘issue’ rhetorical segments are to be included fully, and
from the guidelines given by domain experts from India. at least 2 sentences from every other segment are to be included.
• Operationalizing G3: The informativeness of a sentence 𝑖 de-
5 APPLYING DELSUMM ON INDIAN CASE fines the importance of a sentence in terms of its information con-
DOCUMENTS tent. More informative sentences are likely to be included in the
summary. To find the informativeness 𝐼 (𝑖) of a sentence 𝑖, we use the
In this section, we describe how DELSumm was adopted to summa- guideline G3 stated above. 𝐼 (𝑖) therefore depends on the rhetorical
rize Indian case documents. We first describe the summarization segment 𝑘 (Fact, Issue, etc.) that contains a particular sentence 𝑖. For
guidelines stated by law experts from India, and then discuss how instance, G3(a) dictates that sentences appearing at the beginning
to adopt those guidelines into DELSumm . We then discuss com- of the rhetorical segment ‘Fact’ are important. We incorporate this
parative results of the proposed algorithm and the baselines on the guideline by weighing the sentences within ‘Fact’ by the inverse of
India dataset. its position in the document.
According to G3(b) and G3(c), sentences that contains mentions
5.1 Guidelines from Law experts of Statutes/Act names (e.g., Section 302 of the Indian Penal Code,
We consulted law experts – senior law students and faculty mem- Article 15 of the Constitution, Dowry Prohibition Act 1961, etc.) and
bers from the Rajiv Gandhi School of Intellectual Property Law (a prior-cases/precedents are important. We incorporate this guideline
reputed law school in India) – to know how Indian case documents through a Boolean/indicator variable 𝑎𝑖 which is 1 if the sentence
should be summarized. Based on the rhetorical segments described contains a mention of a Statute. We use regular expression patterns
in Section 2.5, the law experts suggested the following guidelines. to detect Statute/Act name mentions. Similarly for detecting if a
• (G1) In general, the summary should contain representations sentence contains a reference to a prior case, we use the Boolean
26
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, and Saptarshi Ghosh
variable 𝑝𝑖 . If a regular expression pattern of the form Party 1 vs. (2) LexRank [7], (3) Reduction [13], and (4) LSA [10] (see Section 2.1
Party 2 is found in a sentence 𝑖, 𝑝𝑖 is set to 1. for brief descriptions of these methods).
For understanding important Ratio sentences, we lookup guide-
Unsupervised domain-specific methods: Section 2.2 gave brief
line G3(d), which mandates that sentences containing references
descriptions of these methods. Among such methods, we consider
to either a Statute/Act name or a prior-case are important to be in-
the following four as baselines – (1) LetSum [8], (2) KMM [20],
cluded in the summary. We therefore use the two Boolean/indicator
(3) CaseSummarizer [18] and (4) a simplified version of MMR [25]
variables 𝑎𝑖 and 𝑝𝑖 for detecting the presence of a Statute/Act name
– in the absence of identical datasets, we adopt only the Maximum
or a prior-case. If any one of them occurs in the sentence, we con-
Margin Relevance module and use it to summarize a document.
sider that sentence from segment Ratio to be informative.
• Operationalizing G4: Based on guideline G4 stated above, we Supervised domain-specific methods: From this class of algo-
need to identify some important ‘content words’ whose presence rithms, we consider Gist [14], a recent method that applies general
would indicate the sentences that contain especially important ML algorithms to the task. The best performance was observed by
details of the case. To this end, we consider three types of content using Gradient Boosted Decision Tree as the ML classifier, which
words – mentions of Acts/sections of Acts, keywords from a legal we report (see Section 2.4 for details of the method).
dictionary, and noun phrases (since names of people, places, etc. are Supervised domain-independent methods: Among these meth-
important). For identifying legal keywords, we use a legal dictionary ods, we consider the neural method SummaRuNNer [17] (imple-
from the website advocatekhoj.com. For identifying mentions of mentation available at https://github.com/hpzhao/SummaRuNNer).
Acts and statutes, we use a comprehensive list of Acts in the Indian Similar to Gist, they consider the task of extractive summarization
judiciary, obtained from Westlaw India (westlawindia.com). As as binary classification problem. The classifier returns a ranked
content word scores (𝑆𝑐𝑜𝑟𝑒 ( 𝑗)), we assign a weight of 5 to statute list of sentences based on their prediction/confidence probabilities
mentions, 3 to legal phrases and 1 to noun phrases; these scores are about their inclusion in the summary. We include sentences from
based on suggestions by the legal experts on the relative importance this list in decreasing order of their predicted probabilities, until
of sentences containing different types of content words. the desired summary length is reached.
We also apply a recent BERT-based summarization method BERT-
6 BASELINES AND EXPERIMENTAL SETUP SUM [15] (implementation available at https://github.com/nlpyang/
This section describes the baseline summarization methods that PreSumm). In BERTSUM, the sentence selection function is a simple
we consider for comparison with the proposed method. Also the binary classification task (whether or not to include a sentence in
experimental setup used to compare all methods is discussed. the summary).2 Similar to SummaRuNNer, we use the ranked list of
sentences based on the confidence probabilities of their inclusion
6.1 Baseline summarization methods in the summary. We include sentences one-by-one into the final
As described in Section 2, there are four classes of methods that can summary, until the desired length is reached.
be applied for extractive summarization of legal case documents. We 2 The original BERTSUM model uses a post-processing step called Trigram Blocking
consider some representative methods from each class as baselines. that excludes a candidate sentence if it has a significant amount of trigram overlap with
the already generated summary (to minimize redundancy in the summary). However,
Unsupervised domain-independent methods: We consider as we observed that this step leads to summaries that are too short, as also observed
baselines the following popular algorithms from this class: (1) Luhn [16], in [22]. Hence we ignore this step.
27
Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents ICAIL’21, June 21–25, 2021, São Paulo, Brazil
28
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, and Saptarshi Ghosh
Final
Issue Facts Statute
Precedent
Argument Similar to [2], we also observe that many of the baseline meth-
Algorithm judgement
(1.2%) (23.9%) (7.1%)
+Ratio
(8.6%) ods could not represent the ‘Final judgement’ and ‘Issue’ well in
(2.9%) (53.1%)
their summaries. This flaw is especially critical since these two
Unsupervised, Domain Independent
segments are the most important in the summary according to
LexRank 0.0619 0.3469 0.4550 0.2661 0.3658 0.4284
the law experts (see Section 5.1). In contrast, DELSumm achieves
LSA 0.0275 0.2529 0.5217 0.2268 0.3527 0.3705
Luhn 0.0358 0.2754 0.5408 0.2662 0.2927 0.3781
a very high performance on these two important segments. This
Reduction 0.0352 0.3153 0.5064 0.2579 0.3059 0.4390
difference in performance is possibly because these two segments
Unsupervised, Domain Specific
are also the shortest segments (constitutes only 2.9% and 1.2% of
LetSum 0.0423 0.3926 0.6246 0.3469 0.3853 0.2830 the whole document, as stated in the first row of Table 4), and
KMM 0.3254 0.2979 0.4124 0.3415 0.4450 0.416 hence are missed by other methods which do not know of their
CaseSummarizer 0.2474 0.3537 0.4500 0.2255 0.4461 0.4184 domain-specific importance. This observation show the necessity
MMR 0.4378 0.3548 0.4442 0.2763 0.4647 0.3705 of an informed algorithm that can incorporate domain knowledge
DELSumm 0.7929 0.6635 0.5539 0.4030 0.4305 0.4370 from experts.
Supervised, Domain Independent DELSumm also represents the ‘Statute’ segment better than all
SummaRuNNer 0.4451 0.2990 0.5231 0.1636 0.5215 0.3090 other methods. For the ‘Statute’ segment, the algorithm was for-
BERTSUM 0.0662 0.3544 0.6376 0.2535 0.3121 0.3262 mulated in such a way (through the 𝑎𝑖 variable) that it is able to
Supervised, Domain Specific incorporate sentences that contain mention of an Act/law. Other
Gist 0.5844 0.3856 0.4621 0.2759 0.4537 0.2132 methods did not perform well in this aspect. The performance of
Table 4: Segment-wise performance (ROUGE-L F-scores) of DELSumm for the ‘Argument’ segment is second-best (0.4370) after
the methods. All values are averaged over the 50 documents that of Reduction (0.4390). These values are very close, and the
in the evaluation set. Values in the column headings are the difference is not statistically significant.
% of sentences in the full document that belong to a segment. Our method is unable to perform as well for the ‘Precedent +
Values < 0.3 highlighted in red-underlined. The best value Ratio’ and ‘Facts’ segments as some other methods. Note that, these
for each segment in green-bold. segments accounts for maximum number of sentences in a document
Note that DELSumm is a completely unsupervised method while and also forms a large part of the summary (see Table 4, first row).
SummaRuNNer and BERTSUM are deep-learning based supervised Hence neural methods (e.g., BERTSUM) and methods relying of
methods trained over 7, 100 document-summary pairs. The fact TF-IDF measures (e.g., LetSum) obtained relatively large amounts
that DELSumm still outperforms these supervised models is be- of training data for these segments, and hence performed well for
cause DELSumm intelligently utilizes domain knowledge while these segments.
generating the summaries. Finally, although LetSum is a segment-aware algorithm, its sum-
marization mechanism is not strong enough to understand the
importance of sentences in all segments; for instance, it performs
7.2 Evaluation of segment-wise summarization very poorly for the ‘Final judgement’ segment. DL methods fail to
Overall ROUGE scores are not the best metrics for evaluating case perform well for the smaller segments (e.g., Final judgement, Issue)
document summaries. Law experts opine that, even methods that for which lesser amounts of training data is available.
achieve high overall ROUGE scores may not represent every seg- Overall, it is to note that DELSumm achieves a much more
ment well in the summary [2]. A segment-wise performance evalu- balanced representation of segments in the summary, com-
ation is practically important since law practitioners often intend pared to all the baseline methods. Contrary to the baseline methods,
to read the summary of a particular segment, and not always the DELSumm shows decent performance for all the segments. This
full summary. Hence we perform a segment-wise performance has been taken care mainly through the constraint in Eqn. 5. Hence,
evaluation of the algorithms. even when the most important segments (e.g., Final judgement) are
To this end, we proceed as follows. For each rhetorical segment optimized in the summary, the less important ones (e.g., Arguments)
(e.g., Facts, Issues), we extract the portions of an algorithmic sum- are not missed out.
mary and the gold standard summary that represent the given
segment, using the gold standard rhetorical labels of individual
sentences (which are present in our evaluation set, as detailed in
Section 3). Then we compute the ROUGE scores on those specific
text portions only. For instance, to compute the ROUGE score on the 7.3 Discussion on the comparative results
‘Fact’ segment of a particular document, we only consider those sen- The results stated above suggest that our proposed method per-
tences in the gold standard summary and the algorithmic summary, forms better than many existing domain-independent as well as
which have the rhetorical label ‘Fact’ in the said document. We legal domain-specific algorithms. The primary reason for this su-
report the average ROUGE score for a particular segment, averaged perior performance of DELSumm is that it is able to incorporate
over all 50 documents in the evaluation set. legal domain knowledge much more efficiently through a theoret-
Table 4 shows the segment-wise ROUGE-L F-scores (averaged ically grounded approach. By doing this, it is able to surpass the
over all 50 documents in the evaluation set). The values < 0.3 are performance of deep learning and machine learning approaches
underlined and highlighted in red color, while the best value for such as SummaRuNNer, BERTSUM and Gist (that are trained over
each segment is highlighted in boldface and green color. 7, 100 document-summary pairs). These results show that if legal
29
Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents ICAIL’21, June 21–25, 2021, São Paulo, Brazil
30
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, and Saptarshi Ghosh
ROUGE-2 ROUGE-L Analytics’. This work is also supported in part by the European
Algorithm
R F R F Union’s Horizon 2020 research and innovation programme under
LetSum 0.4030 0.4137 0.5898 0.5846 grant agreement No 832921. P. Bhattacharya is supported by a
MMR 0.3733 0.3729 0.6064 0.5680 Fellowship from Tata Consultancy Services.
SummaRuNNer 0.4104 0.4149 0.5835 0.5821
DELSumm (GSL) 0.4323 0.4217 0.6831 0.6017 REFERENCES
DELSumm (AL) 0.4193 0.4075 0.6645 0.5897 [1] Siddhartha Banerjee, Prasenjit Mitra, and Kazunari Sugiyama. 2015. Multi-
document abstractive summarization using ILP based multi-sentence compres-
Table 5: Performance of DELSumm with Gold Standard la- sion. In Proc. International Conference on Artificial Intelligence.
bels (GSL) and Algorithmic labels (AL) over the 50 docu- [2] Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripa-
bandhu Ghosh, and Saptarshi Ghosh. 2019. A comparative study of summariza-
ments. Apart from the last row, all other rows are repeated
tion algorithms applied to legal case judgments. In Proc. European Conference on
from Table 3. While DELSumm (AL) shows slightly degraded Information Retrieval.
performance than DELSumm (GSL), it still outperforms the [3] Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and
Adam Wyner. 2019. Identification of Rhetorical Roles of Sentences in Indian
other methods according to all ROUGE measures except Legal Judgments. Proc. Legal knowledge and information systems (JURIX) (2019).
ROUGE-2 F-score. [4] Jianpeng Cheng and Mirella Lapata. 2016. Neural Summarization by Extracting
Sentences and Words. In Proc. Annual Meeting of the Association for Computational
observe that, even when DELSumm is used with automatically Linguistics.
generated algorithmic labels (with about 15% noise), it still per- [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. In Proc. NAACL-HLT.
forms better than LetSum, MMR and SummaRuNNer (which were [6] Yue Dong. 2018. A Survey on Neural Network-Based Summarization Methods.
its closest competitors) according to all ROUGE measures except CoRR abs/1804.04589 (2018). arXiv:1804.04589 http://arxiv.org/abs/1804.04589
[7] Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical
ROUGE-2 Fscore. This robustness of DELSumm is because, through Centrality As Salience in Text Summarization. J. Artif. Int. Res. 22, 1 (2004).
the sentence informativeness and content words measures in the [8] Atefeh Farzindar and Guy Lapalme. 2004. Letsum, an automatic legal text sum-
ILP formulation, DELSumm can capture useful information even if marizing system. Proc. Legal knowledge and information systems (JURIX) (2004).
[9] Jessica Giles. 2015. Writing Case Notes and Case Comments.
the rhetorical labels may be a little inaccurate. http://law-school.open.ac.uk/sites/law-school.open.ac.uk/files/files/PILARS-
Writing-case-comments.pdf.
[10] Yihong Gong and Xin Liu. 2001. Generic Text Summarization Using Relevance
9 CONCLUSION Measure and Latent Semantic Analysis. In Proc. International conference on Re-
search and development in information retrieval (SIGIR).
We propose DELSumm , an unsupervised algorithm that systemati- [11] how-to-brief-case-cuny 2017. How to brief a case. https://www.lib.jjay.cuny.edu/
cally incorporates domain knowledge for extractive summarization how-to/brief-a-case.
of legal case documents. Extensive experiments and comparison [12] intro-case-briefing-northwestern [n.d.]. Introduction to Case Briefing.
http://www.law.northwestern.edu/law-school-life/studentservices/orientation/
with as many as eleven baselines, including deep learning-based documents/Orientation-Reading-Introduction-to-Case-Briefing.pdf.
approaches as well as domain-specific approaches, show the utility [13] Hongyan Jing. 2000. Sentence reduction for automatic text summarization. In
of our approach. The strengths our approach are: (i) DELSumm Proc. Applied Natural Language Processing Conference.
[14] Chao-Lin Liu and Kuan-Chun Chen. 2019. Extracting the Gist of Chinese Judg-
systematically encodes domain knowledge necessary for legal doc- ments of the Supreme Court. In Proc. International Conference on Artificial Intelli-
ument summarization into a computational approach, (ii) although gence and Law (ICAIL).
[15] Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders.
an unsupervised approach, it performs at par with supervised learn- In Proc. EMNLP-IJCNLP.
ing models trained over huge amounts of training data, (iii) it is [16] H.P. Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of
able to provide a summary that has a balanced representation from Research Development,2(2) (1958).
[17] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A Recur-
all the rhetorical segments, which is highly lacking in prior ap- rent Neural Network Based Sequence Model for Extractive Summarization of
proaches (iv) inaccuracy in labels do not degrade the performance Documents. In Proc. AAAI Conference on Artificial Intelligence.
of DELSumm much, thus showing its robustness and rich informa- [18] Seth Polsley, Pooja Jhunjhunwala, and Ruihong Huang. 2016. CaseSummarizer:
A System for Automated Summarization of Legal Texts. In Proc. Iinternational
tion identification capabilities, and (v) the method is flexible and conference on Computational Linguistics (COLING) System Demonstrations.
generalizable to summarize documents from other jurisdictions; [19] Koustav Rudra, Pawan Goyal, Niloy Ganguly, Prasenjit Mitra, and Muhammad
Imran. 2018. Identifying Sub-Events and Summarizing Disaster-Related Informa-
all that is needed are the expert guidelines for what to include tion from Microblogs. In Proc. International Conference on Research & Development
the summary, and how to identify important content words. The in Information Retrieval (SIGIR).
objective function and constraints can be adjusted as per the re- [20] M Saravanan, B Ravindran, and S Raman. 2006. Improving Legal Document
Summarization Using Graphical Models. In Legal knowledge and information
quirements of different jurisdictions (e.g., giving more weight to systems, JURIX.
certain segments). The implementation of DELSumm is publicly [21] M. Saravanan, B. Ravindran, and S. Raman. 2008. Automatic Identification of
available at https://github.com/Law-AI/DELSumm. Rhetorical Roles using Conditional Random Fields for Legal Document Summa-
rization. In Proc. International Joint Conference on Natural Language Processing.
In future, we plan to apply DELSumm to documents from other [22] Sajad Sotudeh, Arman Cohan, and Nazli Goharian. 2021. On Generating Extended
jurisdictions as well as to generate different types of summaries Summaries of Long Documents. In The AAAI-21 Workshop on Scientific Document
Understanding (SDU 2021).
(e.g., for different stakeholders) and analyse the performance. [23] Wen Xiao and Giuseppe Carenini. 2019. Extractive Summarization of Long
Documents by Combining Global and Local Context. In Proc. EMNLP-IJCNLP.
Acknowledgements: The authors thank the Law experts from the [24] Hiroaki Yamada, Simone Teufel, and Takenobu Tokunaga. 2019. Neural Network
Rajiv Gandhi School of Intellectual Property Law, India who helped Based Rhetorical Status Classification for Japanese Judgment Documents.. In
Proc. Legal knowledge and information systems (JURIX).
in developing the gold standard data and provided the guidelines for [25] Linwu Zhong, Ziyi Zhong, Zinian Zhao, Siyuan Wang, Kevin D Ashley, and
summarization. The research is partially supported by the TCG Cen- Matthias Grabmair. 2019. Automatic Summarization of Legal Decisions Using
Iterative Masking of Predictive Sentences. In Proc. International Conference on
tres for Research and Education in Science and Technology (CREST) Artificial Intelligence and Law (ICAIL).
through the project titled ‘Smart Legal Consultant: AI-based Legal
31
AI Systems and Product Liability
ABSTRACT
The article examines whether the current product liability law 1 The role of liability within the legal frame-
pro-vides an appropriate regulation for AI systems. This question, work for AI systems
which is discussed at the example of the European Product Liabil- High expectations are associated with artificial intelligence. At
ity Directive, is of great practical importance in the current legal the same time, the risks related to this technology are also a
policy discussion on liability for AI systems. highly debated topic.
This article demonstrates that in principle the liability require- This naturally applies in particular to the legal discussion, the
ments are also applicable to AI systems. If the conduct of an AI aim of which is to reconcile the interest in raising the benefits
system is carefully distinguished from its properties, excessive li- associated with the introduction of new technologies with the
ability can be avoided. To reverse the burden of proof in favour of interest in protecting against the associated risks.
the injured party in the case of faulty behaviour enables a liability Artificial intelligence especially the production and use of AI-
regime that is fair to the interests at stake. equipped systems, often referred to as autonomous systems or
AI systems has been a challenge not least for the law, which has
However, product liability law only applies if AI systems lead di- the task of balancing the interest in enhancing the benefits asso-
rectly to personal injury or damage to property. Product liability ciated with the introduction of new technologies with the inter-
law is not applicable insofar as AI systems indirectly lead to con- est in pro-tecting against the risks associated with such technol-
siderable disadvantages for the person concerned, in particular ogies.
through assessments of persons. Protection against discrimination
or otherwise unfair assessments by AI systems shall be effected by 1.1 Liability for AI systems as a pillar of the le-
other legal instruments. gal framework for AI
CCS CONCEPTS It goes without saying that artificial intelligence and AI systems
raise questions in almost every field of law. An aspect which has
• Applied computing → Law
gained much attention recently, is what is often referred to as
“algorithmic fairness” [1]. In the US, the Federal Trade Commis-
KEYWORDS sion in the USA recently published some guidance on the truth-
Product liability, AI systems, Product Liability Directive, strict li- ful, fair and equitable use of AI [2]. Another aspect of highest
ability, burden of proof relevance certainly is liability for damage caused by AI systems.
The importance of a legal framework for AI systems and the
ACM Reference format: role of liability can be seen clearly by looking at the current dis-
Georg Borges. 2021. AI systems and product liability. In Proceedings of cussion in Europe.
ICAIL 2021, 21-25 June 2021, São Paulo, Brazil, 8 pages. In its 2018 strategy “Artificial Intelligence for Europe” [3], the
https://doi.org/10.1145/3462757.3466099 EU Commission already explicitly focused on ethical and legal
issues in addition to the technical and economic aspects. The EU
Permission to make digital or hard copies of all or part of this work for personal or Commission established several expert groups who support the
classroom use is granted without fee provided that copies are not made or distrib- commission in these tasks. In the field of ethical aspects of AI,
uted for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by the High Level Expert Group on AI [4] developed ethical guide-
others than the author(s) must be honored. Abstracting with credit is permitted. lines for trustworthy AI which were published in June 2018 [5].
To copy otherwise, or republish, to post on servers or to redistribute to lists, re-
quires prior specific permission and/or a fee. Request permissions from permis-
Regarding the regulation on liability and security, the Expert
sions@acm.org. Group on liability and new technologies consisting of a Product
ICAIL'21, June 21–25, 2021, São Paulo, Brazil Liability Directive formation and a “New Technologies” for-
© 2021 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06…$15.00 mation was established [6].
https://doi.org/10.1145/3462757.3466099
32
ICAIL 2021, 21-25 June 2021, São Paulo, Brazil G. Borges
Recently, in April 2021, the Commission presented a proposal In the current discussion, systems for pre-selecting job appli-
for an Artificial Intelligence Act (‘AI Act’) [7], which refers to cants for jobs are often mentioned [16].
the security of so-called AI systems. The concept of AI systems is An essential characteristic of damage caused directly or indi-
defined very broadly in the draft and is likely to cover a large rectly by AI systems is that the damage is not caused by a physi-
part of modern software and software-equipped products. The cal feature of the system, such as its material properties, but by
Act essentially distinguishes between three groups of AI sys- its – own – behaviour, i.e., behaviour that is not directly con-
tems. Firstly, the proposal provides a list of AI systems whose trolled by a human being but by the AI immanent in the system.
operation is to be prohibited (Art. 5 of the proposal). The second It is not the human driver who steers the car against the victim,
group of so-called high-risk AI systems will be subject to manda- but the automated vehicle itself. The assessment of the criminal
tory risk management (Art. 9 of the proposal), while a third or job seeker is not carried out by a human, but by an AI system.
group of certain AI systems will be subject to transparency re- This special feature of AI systems is of decisive importance
quirements only (Art. 52 of the proposal). for liability law, since the damage cannot be directly traced back
As the second pillar of the legal framework for AI systems, to the behaviour of a particular natural person which is why the
civil liability, i.e. the duty to compensate for damage, is of great justification of and the responsibility for the damage must be
importance. discussed.
In its final report, the New Technologies formation of the EU The importance of liability law is based on the dual function
Expert Group liability for new technologies suggested amend- of civil liability, which grants the injured party a claim for dam-
ments of the existing liability law regarding AI and autonomous ages against the addressee of liability. The rules on civil liability
systems [8]. traditionally aim to compensate the victim for damage suffered
In November 2020, the European Parliament issued a resolu- to protected legal assets [17, p. 259f.; 18, p. 166, 565]. At the same
tion on the civil liability regime for AI systems [9] which even time, the threat of liability also serves to steer behaviour by
contains a proposal for a new regulation on liability for systems providing an incentive to avoid damage [17, p. 187; 18, p. 169].
equipped with artificial intelligence [10]. However, the risk of liability can also trigger undesirable strate-
gies of liability avoidance. In the current debate, for example, a
1.2 Risks Associated with the Use of AI possible chilling effect of liability risks to the development and
The risks associated with the use of AI systems are very differ- market launch of products equipped with artificial intelligence is
ent. Legal protection against these risks depends, among other often mentioned.
things, on what protected interests are affected and whether the In the current legal policy discussion, a tendency is emerging
AI system in question can cause damage directly or whether the to assign responsibility to the operator of the AI system, at least
respective AI system only has an indirect effect through use of in the sense of civil liability for damages. In Europe, Art. 4 of the
its activity by third parties. These differences are expressed in proposed regulation on liability for the operation of AI systems
two groups of cases that are currently intensively discussed and [10] imposes strict liability on the operator of AI systems classi-
investigated in the German "ExamAI" project, among others [11]. fied as dangerous for damages resulting from the operation of
One case group concerns damage to persons and property di- the AI system. Art. 11 of the proposed regulation [10] states that
rectly caused by AI systems, also called cyber-physical systems manufacturers of AI systems are deliberately not covered by the
[12]. The Uber vehicle accident of 2019, in which a highly auto- directive.
mated vehicle hit and fatally injured a pedestrian [13], vividly The proposed regulation mirrors the current discussion in
demonstrated this danger, hence the awareness of the risks in- Europe regarding liability for highly automated vehicles. Some
duced by AI systems raised significantly. authors, e.g., Buck-Heeb and Dieckmann [19] or Kreutz [20], re-
Another group of cases concerns assessments of general in- fer to the existing strict liability of the owner of a motor vehicle
formation on individuals provided by an AI system and used by and, with regard to the manufacturer, to product liability law.
third parties for decisions which interfere with the rights of the The latter raises the question of whether product liability law
affected individuals. In this group of cases, there is no direct in- provides an adequate liability regime for AI systems.
terference with health or property and, most importantly, the in- The term product liability refers to the non-contractual, strict
terference is not directly caused by the AI system itself. liability of the producer for damage caused by a defective prod-
For the latter case group, connected with the keyword “bias uct.
in the data”, the COMPAS system can be mentioned, as it illus- Product liability was introduced in many countries in the sec-
trates the risks arising from ‘biased’ algorithms even though the ond half of the 20th century, starting in the U.S. which is consid-
related questions touch criminal law rather than liability law. ered the “birthplace” of product liability [21]. The transition
COMPAS generates an evaluation of a human being (here: prob- from fault-based liability to strict liability of the manufacturer
ability of recidivism of offenders) that has been used by third for defective products was heavily influenced by the 1963 land-
parties (here: judges) to make a decision that was highly relevant mark decision of the California Supreme Court in Greenman v
for human beings (here: offenders). The evaluation generated by Yuba Power Products [22]. This approach found its way into §
the system was criticized for discriminating against certain 402A of the Restatement Second of Torts from 1966 and was
groups of people [14]. although this criticism is not without con- later accepted in most states of the US [23, p.19; 24, p. 251].
troversy [15].
33
AI Systems and Product Liability ICAIL 2021, 21-25 June 2021, São Paulo, Brazil
34
ICAIL 2021, 21-25 June 2021, São Paulo, Brazil G. Borges
The status of being a component part producer is thus acquired based on the perceived situation, i.e., the interpreted data. In do-
through the training of the network. This aspect can be general- ing so, all information available in the traffic situation must be
ised: whoever controls the learning process in the context of ma- taken into account.
chine learning and thus determines the properties of the learning The decisive factor is that situations occurring in traffic are
system is the producer of the respective system. unpredictably diverse. Accordingly, the concrete behaviour in
such a unique, future traffic situation is also unpredictable. If one
3.3 Limitation of Damage to Health and Prop- equates a faulty behaviour of the system with a product defect,
erty liability arises for behaviour in an unforeseeable multitude of fu-
Art. 9 of the Product Liability Directive [25] explicitly limits lia- ture traffic situations. This is a conceivable liability concept that
bility to damage to health and property. This is intended to ex- can be achieved in particular by specific legislation providing for
clude, in particular, mere pecuniary damages, which occur, for a causal liability of the producer, as some authors suggest in the
example, in the event of a business interruption. case of highly automated vehicles [48, p. 277; 49, p. 574].
There is thus a "liability gap" in relation to AI systems that However, this is not the liability concept of the Product Lia-
generate assessments or evaluations of persons. Whether it is an bility Directive. Product liability law focusses, as can be derived
assessment of the likelihood of recidivism, creditworthiness or from Art. 6 of the Product Liability Directive, on the existence of
performance for a job or a place at university: any damage properties of the product when it is placed on the market, prop-
caused by incorrect assessments is not covered by this concept. erties, which must correspond to justified safety expectations.
Even if one were to assume a violation of personality rights of The expectation of being able to control unforeseeable future sit-
the person concerned, these rights would not be subject to the uations is certainly not a justified safety expectation.
scope of product liability law. If the legislator wanted to introduce such liability, this could
This is not to say that product liability law should be ex- be done e.g., by imposing causal liability on the producer, similar
tended to cover such damages. However, it is important to note to the operator's liability under § 7 StVG (German Road Traffic
that these damages are currently not covered by product liability Act). According to this norm, the owner of a motor vehicle is lia-
law and therefore the legislator must be encouraged to achieve ble for accidents that are not due to force majeure. Another ex-
sufficient protection against erroneous assessments by other ample of such causal liability in German law is the liability of the
means. pet owner for so-called "luxury animals" (§ 833 BGB).
A similar result could be achieved by applying the construct
of vicarious liability to AI systems and equating them to humans
4 Faulty Decisions and Product Liability
in this respect. This concept is not a part of tort law in Germany
4.1 Conduct as a Defect of a Product? or in most legal systems. German law contains such an attribu-
tion of fault in § 278 of the German Civil Code (BGB) for agents.
A major challenge of product liability law in relation to AI is the It is also argued that this provision should be applied analo-
requirement of a defect of the product. The characteristic of an gously to AI systems [50, p. 211f.]. This liability only applies to
AI system is that it reacts to situations independently, i.e., with- contractual obligations and does not apply to non-contractual li-
out direct control by a human being, thus exhibiting an inde- ability.
pendent and not previously determined behaviour [45, p. 43f.; 46, As an interim result, this shows that product liability law
p. 7; 47, p. 4]. However, the behaviour of a system in a particular does not provide for liability for conduct, not even for the prop-
situation as such is not a property of the system. Even in cases erty of behaving correctly in future, unforeseeable situations.
where the actor is human, it cannot be directly inferred that a Therefore, only the ability to behave correctly in a spectrum of
certain behaviour equals a certain characteristic. foreseeable situations can be described as a property of an AI
The bridge between behaviour and property can be closed by system in the sense of product liability.
defining property as the ability to behave in a certain way in a
certain situation or not to show a certain behaviour. The prop- 4.2 Reference Point of the Defectiveness of AI
erty of a highly automated vehicle is thus the suitability or abil- Systems
ity to drive "correctly" in a certain situation.
Yet, this makes clear the error that exists in a simple equation With the interim result that product liability only refers to prop-
of behaviour and property. If one were to conclude a (product) erties of the AI system, but not to its behaviour in a specific situ-
error, i.e., a certain (insufficient) property of this AI system, di- ation, follow-up questions necessarily arise.
rectly from a faulty behaviour, for example a driving error, the The starting point is that the ability of the AI system to be-
required ability would be directed at behaving correctly in every have according to a certain expectation in a certain situation is
traffic situation. to be regarded as a property of the AI system and that the lack of
This task can be very complex. For example, a traffic situation this ability can trigger product liability.
consists of a multitude of facts which must be converted into in- First of all, it is important to establish a reference point for
formation by the vehicle through sensors and then be inter- this ability, in particular to analyse whether and to what extent
preted correctly in several steps. Then the vehicle must derive a human abilities are to be taken into account.
correct decision on the driving strategy (trajectory planning) An intuitively obvious reference point would be the human
behaviour that is to be replaced by the activity of the AI system,
35
AI Systems and Product Liability ICAIL 2021, 21-25 June 2021, São Paulo, Brazil
in the case of automated driving, for example, the behaviour of a Product liability is aimed at the justified expectations of
human driver [48, p. 276; 51, p. 77]. For the determination of the safety, i.e. the safety that can reasonably be achieved. This stand-
defectiveness of an AI system, reference would therefore have to ard also applies to the behaviour of the system: what is required
be made to human capabilities. is the degree of safety of correct behaviour that can reasonably
However, a simple reference to human capabilities as a yard- be ensured by the manufacturer. In this respect, the technical
stick for errors of AI systems would not be convincing. For ex- possibilities at the time of placing the product on the market are
ample, a human driver is granted a certain reaction time; the decisive.
ability to react immediately is not expected. The granting of a re- To avoid misunderstandings, it should be said that liability
action time to be measured against human limitations, however, law must be distinguished from market approval rules which ap-
is obviously not necessary for AI systems; on the contrary, it ply to numerous machines, such as vehicles, and that the re-
would be highly counterproductive. The improved safety of quirements are not identical. The obligation may be stricter than
highly automated vehicles, for example, is based not least on the in the approval procedure.
fast reaction of the system compared to the human driver [42, p. However, if the safety requirements are tied to what is rea-
734f.]. Therefore, an independent standard must be developed, as sonable, exceptions can be considered; product liability law con-
is generally the case in product liability law [42, p. 733f.]. tains a fault tolerance for the behaviour of AI systems.
Admittedly, the reference to human capabilities is to some The extent of this fault tolerance can ultimately only be clari-
extent predetermined. Insofar as AI systems are used to provide fied in individual cases. However, one can ask the fundamental
"human performance", i.e., performance which was previously question of whether it is tolerable for an AI system to behave
provided by humans, the expected behaviour of the machine is worse than the "average human" in a certain situation. In princi-
based on human behaviour. Thus, requirements for its behaviour ple, this question can be answered in the affirmative, since AI
in road traffic are entirely oriented towards the abilities of a hu- systems act completely differently from humans and show
man being. Consequently, insofar as AI systems are used, a be- strong weaknesses especially in unusual situations. A basic prob-
haviour comparable to that of a human being is required [48, p. lem of machine learning, as it has been practised until recently,
275f.]. is the lack of consideration dedicated to the confidence in the
This orientation towards human behaviour leads to an orien- classification. In many cases, doubts about the classification
tation towards human capabilities without being limited to them. (such as of an object perceived by sensors) were not or not suffi-
Accordingly, the safety expectations within the framework of ciently taken into account for planning the behaviour of the AI
product liability law are also not to be limited by the abilities of system
humans [42, p. 735f.]. In addition, AI systems are highly susceptible to manipula-
This leads to a huge challenge for product liability law: the tion, as has been demonstrated in numerous studies on IT secu-
requirements for the capabilities of AI systems are currently rity. It is therefore inherent in the liability concept of product li-
very unclear, and clear, legally secure requirements have not yet ability law that machines make mistakes that would not be toler-
emerged in many areas. ated if made by a human.
Structurally, none of this is new. Requirements often have to This necessarily creates a liability gap in relation to the pro-
be concretised in the respective case, binding ultimately by ducer. This gap must be filled outside of product liability law, for
judges. It is therefore particularly relevant to define safety re- example through specific causal liability on the part of the man-
quirements for new technologies in the form of technical stand- ufacturer, or, as is currently the trend, through the liability of
ards. other parties involved, such as the operator of the AI system.
However, product liability law offers another important con-
4.3 Requirements for Accuracy of AI Systems necting factor for the safety of AI systems. Part of the expecta-
In the context of concretising the safety requirements, it is of tion of safety in a product is that the function of the product has
great importance whether the accuracy of the system with re- been properly verified. This general aspect is crucial in the case
gard to its behaviour can be expected as a property of an AI sys- of AI systems: since the behaviour of AI systems based on ma-
tem, or whether fault tolerance is permitted. chine learning cannot be verified by code analysis but can at
As already shown, an error-free behaviour for every future least be partially verified by testing, the tests on the behaviour of
situation cannot be part of the justified safety expectation, so the AI system are a crucial aspect of the safety expectation.
that a certain degree of error tolerance is required. In this regard, Proper testing of the behaviour of AI systems thus becomes the
the special features of AI systems compared to traditional, hu- anchor of product liability: if there is no proper test, the product
man-controlled machines become particularly clear. is defective. If the damage can be traced back to a behaviour
Insofar as the required behaviour is specified, negligence is which would have been detectable if proper tests had been car-
assumed in the case of a human acting otherwise. If a human ried out, this error is also causal for the damage.
driver does not drive correctly, there is fault. However, the pro- This shows a major advantage of AI systems compared to hu-
ducer of the AI system is not its operator like a driver would be. man activity: the behaviour of AI systems is in principle repro-
If one wanted to make the producer liable for driving errors of ducible and can therefore be checked by tests in many cases -
the system itself, this would be a new type of causal liability, specifically: in complex situations with a correspondingly large
which is rightly demanded (supra 4.1).
36
ICAIL 2021, 21-25 June 2021, São Paulo, Brazil G. Borges
amount of different input information - not fully but much better should apply to AI systems. In this respect, great caution is re-
than with humans. quired, as a premature presumption, which may not be rebutta-
ble, would transform product liability into a causal liability of the
4.4 Burden of Proof manufacturer, which is precisely what the legislator wanted to
Another aspect of product liability that is often decisive in prac- avoid.
tice concerns the burden of proof. According to Art. 4 of the Eu- In the cases of interest here, the starting point of the damage
ropean Product Liability Directive, the injured party bears the is a faulty behaviour of the AI system. This behaviour appears
burden of proof for the existence of the product defect, i.e. he externally and is usually more recognisable for the injured party
must present the defectiveness of the product and prove it in the than for the producer. In this respect, the assignment of the bur-
event of a dispute. den of proof to the injured party is appropriate. Since the injured
In this respect, the circumstances of the procedural law sys- party's lack of evidence concerns the facts relating to the design
tems, which are subject to the law of the EU member states, have and manufacture underlying the conduct, including the testing
a decisive effect. In the United States in particular, the instru- of the AI system, the burden of proof should start at precisely
ment of pretrial discovery provides the plaintiff with a strong in- this point. Therefore, there should be a presumption of the exist-
strument to strengthen the factual basis of his claim [52, p. ence of a defect in the AI system if it has behaved "incorrectly".
115f.], which is not known in most European legal systems. It is then incumbent on the manufacturer to prove that the fault
In German law, the requirements for the presentation of the could not have been detected with proper design and testing.
facts on which a claim is based are particularly strict. In princi- This proof can be provided by the manufacturer presenting and,
ple, the plaintiff is required to provide a complete statement of if necessary, proving the existence of a sufficient test. Since he
the facts in such a way that a decision of the dispute can be has access to the relevant data and since he can use external help
made based on the plaintiff's submissions [29, no. 92]. if necessary, he can reasonably be expected to provide this evi-
These requirements also apply in product liability, there also dence.
with regard to the defect of the product. As a rule, the injured With this easing of the burden of proof, there is probably an
party cannot provide such evidence, as he does not know any de- appropriate balance of interests. It is often difficult to prove a
tails about the design or manufacture of the product. Normally, misconduct. In the example of the highly automated vehicle - in
he will only be able to present circumstantial evidence from the absence of causal liability of the operator - a driving error
which it may be possible to conclude that the product is defec- would have to be proven, which can be difficult if the course of
tive. the accident cannot be determined. In this respect, however, the
In view of this procedural starting position, presumptions of burden of proof on the injured party is correct, since the manu-
the existence of a defect, which lead to the reversal of the burden facturer has no better information in this respect. In this respect,
of proof, play a central role. In fact, case law in Germany, for ex- however, the black box, which is to become obligatory for highly
ample, has assumed a presumption in a number of cases: For ex- and fully automated vehicles, should provide a remedy, for ex-
ample, the lack of compliance with statutory safety regulations is ample in the case of automated driving. The inspection of the
supposed to give rise to a presumption that products are defec- recorded data can be ordered by the court, e.g., within the frame-
tive. work of independent evidence proceedings prior to the filing of a
Such presumptions are of crucial importance, especially for lawsuit.
AI systems. Without them, it will hardly be possible for the in-
jured party to prove a fault of the AI system, as he has no access 5 Conclusion
to the relevant information. He does not know the design of the As an overall result, product liability law can be said to have a
system, in machine learning cases he lacks any knowledge about certain performance capacity also in relation to AI systems.
the learning process, also the injured party does not know any- Product liability is comprehensively applicable to AI systems
thing about the tests of the AI system [53, p. 285f.]. and their components, in particular to neural networks and any
Against this background, the Export Group Liability for New other software that creates AI. The manufacturers of the system
Technologies - New Technology Formation has recommended a as well as of the software as a partial product are the addressees
shift of the burden of proof in certain cases regarding AI and of product liability law and can be held liable for damages.
other new technologies [8, p. 42]. The European Parliament has A significant gap arises in the case of AI systems that cannot
taken up this idea in its resolution for a civil liability regime for directly lead to personal injury or damage to property, but which
AI [9]. The proposal for a regulation on liability for AI systems set an essential prerequisite for decisions by third parties, as is
[10] contains a different distribution of risk for so-called "AI sys- the case with AI systems for the assessment of persons.
tems with high risk", for which strict liability of the operator is The central challenge of product liability law in relation to AI
to apply, and for "other AI systems" for which fault-based liabil- systems is the determination of a product defect. Since the con-
ity is to remain. However, according to Art. 8 Para. 2 of the pro- cept of defect is based on properties of the product that must be
posed regulation, the operator must prove that personal injury present when the product is placed on the market, the behaviour
or damage to property was caused through no fault of his own. of the AI system cannot be directly linked to the defect. How-
This raises the crucial question of the conditions under which ever, one property of the AI system is the ability to behave in a
a reversal of the burden of proof for the existence of a fault
37
AI Systems and Product Liability ICAIL 2021, 21-25 June 2021, São Paulo, Brazil
certain way. The more the behaviour depends on a situation and [12] Rasmus Adler, Jens Heidrich, Lisa Jöckel, Michael Kläs. 2020. Anwendungssze-
narien: KI-Systeme in der Produktionsautomatisierung; retrieved from
the more diverse the possible situations are - in the case of auto- https://testing-ai.gi.de/fileadmin/GI/Projekte/KI_Testing_Auditing/Exa-
mated driving, for example, represented by the unforeseeable mAI_Publikation_Anwendungsszenarien_KI_Industrie.pdf (accessed
17.5.2021).
multitude of traffic situations - the less the behaviour of the AI [13] National Transportation Safety Board (NTSB). 2018. Collision Between Vehicle
system can be anticipated by the manufacturer and secured in Controlled by Developmental Automated Driving System and Pedestrian
advance. The justified expectation of safety as a measure of Tempe, Arizona, March 18, 2018 accident report, retrieved from
https://www.ntsb.gov/investigations/AccidentReports/Reports/HAR1903.pdf
faultiness is therefore not directed towards fault-free behaviour (accessed 17.5.2021).
in all situations, but only towards such behaviour in situations [14] Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner. 2016. Machine
Bias, ProPublica, May 23, 2016, retrieved from <https://www.propublica.org/ar-
which can be reasonably foreseen. ticle/machine-bias-risk-assessments-in-criminal-sentencing> (accessed
The sufficient degree of safety is, as described, primarily con- 17.5.2021).
cretised by the reasonableness of the measures. Technical stand- [15] Anthony W. Flores, Kristin Bechtel, Christopher T. Lowenkamp. 2016. False
Positives, False Negatives and False Anaylses: A Rejoinder to “Machine Bias:
ards will play a special role in this, as they emerge. There’s Software Used Across the Country to Predict Future Criminals. And It’s
In the case of AI systems, an important starting point for Biased Against Blacks.”, Federal Probation 80(2), 38–46.
[16] Katharina Zweig, Marc Hauer, Franziska Raudonat. 2020. Anwendungsszenar-
product liability will probably be the testing of AI systems, since ien: KI-Systeme im Personal- und Talentmanagement, retrieved from
the existence of proper tests of AI systems is to be regarded as a https://testing-ai.gi.de/fileadmin/PR/Testing-AI/ExamAI_Publikation_Anwen-
dungsszenarien_KI_HR.pdf (accessed 17.5.2021).
component of the justified expectation of safety. If this is lack- [17] Steven Shavell. 2004. Foundations of Economic Analysis of Law. Cambridge, MA:
ing, the AI system is defective. Belknap Press.
In practice, the burden of proof will play a central role. In this [18] Hans-Bernd Schäfer, Claus Ott. 2012. Lehrbuch der ökonomischen Analyse des
Zivilrechts, Springer.
respect, a reversal of the burden of proof for the existence of a [19] Petra Buck-Heeb, Andreas Dieckmann. 2020. in: Oppermann/Stender-Vor-
fault is to be assumed for AI systems if the system behaves incor- wachs (eds.), Autonomes Fahren, C.H. Beck, 2nd ed., Chap. 3.1.1.
[20] Peter Kreutz. 2020. in: Oppermann/Stender-Vorwachs (eds.), Autonomes Fahren,
rectly. The burden of proof for the faulty behaviour lies with the C.H. Beck , 2nd ed., Ch. 3.1.2.
injured party. [21] Geraint Howells. 1993. Comparative Product Liability, Dartmouth Publishing.
With these requirements, product liability law can contribute [22] Greenman v. Yuba Power Products, Inc. (1963), 59 Cal. 2d 57.
[23] Fairgrieve, Howells, Møgelvang-Hansen, Straetmans, Verhoevens, Machnikow-
to an appropriate balancing of interests with regard to damage ski, Janssen, Schulze. 2016. Product Liability Directive. In Pjotr Machnikowski
caused by AI systems. Nevertheless, the legal framework of AI (Ed.), European Product Liability. An Analysis of the State of the Art in the Era of
New Technologies, 17–111.
systems needs to be supplemented, as otherwise liability risks [24] Matthias Reimann, M. 2015. Product liability. In M. Bussani & A. Sebok (eds.),
and false incentives exist. Comparative Tort Law, 250 – 279, Edward Elgar Publishing.
[25] Council Directive of 25 July 1985 on the approximation of the laws, regulations
and administrative provisions of the Member States concerning liability for de-
REFERENCES fective products (85/374/EEC), O.J. 1985 No. L 210/9.
[1] Virginia Foggo, John Villasenor, Pratyush Garg. 2021. Algorithms and Fairness, [26] Astrid Seehafer, Joel Kohler. 2020. Künstliche Intelligenz: Updates für das Pro-
Ohio State Technology Law Journal 17(1), 123–188. dukthaftungsrecht? Mögliche Anpassungen der europäischen Produkthaf-
[2] Federal Trade Commission (FTC). 2021. Aiming for truth, fairness, and equity tungsrichtlinie für neue Technologien, EuZW 2020, 213–218.
in your company’s use of AI, retrieved from https://www.ftc.gov/news- [27] Thomas Riehm. 2010. 25 Jahre Produkthaftungsrichtlinie – Ein Lehrstück zur
events/blogs/business-blog/2021/04/aiming-truth-fairness-equity-your-compa- Vollharmonisierung, EuZW 2010, 567–571.
nys-use-ai (accessed 17.5.2021). [28] Georg Borges. 2018. Rechtliche Rahmenbedingungen für autonome Systeme,
[3] Communication from the Commission to the European Parliament, the Euro- NJW 2018, 977–982.
pean Council, the Council, the European Economic and Social Committee and [29] Georg Borges. 2021. in: Borges/Hilber (Eds.), Beck’scher Online-Kommentar IT-
the Committee of the Regions, Artificial Intelligence for Europe, COM(2018) 237 Recht, § 1 ProdHaftG.
final, 25.4.2018, https://eur-lex.europa.eu/legal-con- [30] Gerhard Wagner. 2020. in: Münchener Kommentar zum BGB, 8th ed., § 4 Pro-
tent/EN/TXT/PDF/?uri=CELEX:52018DC0237 (accessed 17.5.2021). dHaftG.
[4] For detailed information on the group see https://ec.europa.eu/transparency/re- [31] Georg Borges. 2020. Kann ein Gegenstand nicht Sache und doch Sache sein?.
gexpert/index.cfm?do=groupDetail.groupDetail&groupID=3591&news=1 (ac- Computerprogramme im Privatrecht, in: Omlor (ed.), Weltbürgerliches Recht,
cessed 17.5.2021). Festschrift für Michael Martinek zum 70. Geburtstag, 45–58, C.H. Beck.
[5] High-Level Expert Group on AI, Ethics Guidelines for Trustworthy Artificial [32] OLG Hamm NJW-RR 2012, 355.
Intelligence, retrieved form https://digital-strategy.ec.europa.eu/en/library/eth- [33] Erwin Deutsch. 1989. Der Schutzbereich der Produzentenhaftung nach dem
ics-guidelines-trustworthy-ai (accessed 17.5.2021). BGB und dem PHG, JZ 1989, 465–470.
[6] For detailed information on the group see https://ec.europa.eu/transparency/re- [34] Gerhard Wagner. 2020. in: Münchener Kommentar zum BGB, 8th ed., § 1 Pro-
gexpert/index.cfm?do=groupDetail.groupDetail&groupID=359 (accessed dHaftG.
17.5.2021). [35] Ulrich Berz, Eva Dedy, Claudia Granich. 2000. Haftungsfragen bei dem Einsatz
[7] Proposal for a Regulation laying down harmonized rules on artificial intelli- von Telematik-Systemen im Straßenverkehr, DAR 2000, 545–554.
gence (Artificial Intelligence Act), 21.4.2021, https://ec.europa.eu/news- [36] Volker Jänich, Paul Schrader, Vivian Reck. 2015. Rechtsprobleme des autono-
room/dae/redirection/document/75788 (accessed 17.5.2021). men Fahrens, NZV 2015, 313–318.
[8] Report from the Expert Group on Liability and New Technologies – New Tech- [37] Paul Schrader. 2015. Haftungsrechtlicher Begriff des Fahrzeugführers bei
nologies Formation, Executive Summary, 3 f.; https://ec.europa.eu/transpar- zunehmender Automatisierung von Kraftfahrzeugen, NJW 2015, 3537–3542.
ency/regexpert/index.cfm?do=groupDetail.groupMeetingDoc&docid=36608 [38] Bundesgerichtshof [German Federal Supreme Court], GRUR 1985, 1041.
(accessed 17.5.2021). [39] Helmut Redeker. 2017. IT-Recht, 6th ed., C.H.Beck.
[9] Civil liability regime for artificial intelligence, European Parliament resolution [40] Philipp Reusch. 2020. In Kaulartz/Braegelmann (eds.), Rechtshandbuch Artificial
of 20 October 2020 with recommendations to the Commission on a civil liability Intelligence und Machine Learning, Chap. 4.1, C.H.Beck.
regime for artificial intelligence (2020/2014(INL)), retrieved from [41] Friedrich-Wilhelm Engel. 1986. Produzentenhaftung für Software, CR 1986,
https://www.europarl.europa.eu/doceo/document/TA-9-2020-0276_EN.pdf (ac- 702–708.
cessed 17.5.2021). [42] Gerhard Wagner. 2017. Produkthaftung für autonome Systeme, AcP 2017(6),
[10] Proposal for a Regulation of the European Parliament and of the Council on 707–765.
liability for the operation of Artificial Intelligence-systems, (2020/2014(INL)), [43] Georg Borges. 2021. in: Borges/Hilber (eds.), Beck’scher Online-Kommentar IT-
retrieved from https://www.europarl.europa.eu/doceo/document/TA-9-2020- Recht, § 2 ProdHaftG.
0276_EN.pdf (accessed 17.5.2021). [44] Ulrich Magnus. 2017. In Machnikowski (ed.), European Product Liability. In-
[11] ExamAI – Testing and Auditing of AI systems; retrieved from https://testing- tersentia.
ai.gi.de (accessed 17.5.2021).
38
ICAIL 2021, 21-25 June 2021, São Paulo, Brazil G. Borges
[45] Thomas Schulz. 2015. Verantwortlichkeit bei autonom agierenden Systemen. No-
mos.
[46] Hans Steege. 2021. Auswirkungen von künstlicher Intelligenz auf die Produzen-
tenhaftung in Verkehr und Mobilität. Zum Thema des Plenarvortrags auf dem
59. Deutschen Verkehrsgerichtstag, NZV 2021, 6–13.
[47] Chris Reed, Elizabeth Kennedy, Sara Silva. 2016. Responsibility, Autonomy and
Accountability: Legal Liability for Machine Learning, Queen Mary University of
London Legal Studies Research Paper No. 243/2016.
[48] Georg Borges. 2016. Haftung für selbstfahrende Autos. Warum eine Kausalhaf-
tung für selbstfahrende Autos gesetzlich geregelt werden sollte, CR 2016, 272–
280.
[49] Sabine Gless, Ruth Janal. 2016. Hochautomatisiertes und autonomes Autofahren
– Risiko und rechtliche Verantwortung, JR 2016, 561–575.,
[50] Herbert Zech. 2019. Künstliche Intelligenz und Haftungsfragen, ZfPW 2019,
198–219.
[51] Christian Gomille. 2016. Herstellerhaftung für automatisierte Fahrzeuge, JZ
2016, 76–82.
[52] Geoffrey Hazard, Michele Taruffo.1993. American Civil Procedure. Yale Univer-
sity Press.
[53] Mario Martini. 2019. Blackbox Algorithmus – Grundfragen einer Regulierung
Künstlicher Intelligenz. Springer.
39
A Combined Rule-Based and Machine Learning Approach for
Automated GDPR Compliance Checking
Rajaa EL HAMDANI Majd Mustapha David Restrepo Amariles
HEC Paris EURA NOVA HEC Paris
France Belgium France
ABSTRACT 1 INTRODUCTION
The General Data Protection Regulation (GDPR) requires data con- It has become widely acknowledged that complying with data pro-
trollers to implement end-to-end compliance. Controllers must tection laws, particularly the European General Data Protection
therefore ensure that the terms agreed with the data subject and Regulation (GDPR), is the most difficult compliance challenge or-
their own obligations under GDPR are respected in the data flows ganizations face today across industries [1]. Moreover, technology
from data subject to controllers, processors and sub processors (i.e. became the most important compliance cost for organizations, as
data supply chain). This paper seeks to contribute to bridge both they turned to specialized technologies to carry out compliance
ends of compliance checking through a two-pronged study. First, tasks such as document review, regulatory checks and operational
we conceptualize a framework to implement a document-centric audits [1]. Despite the increasing interest in technology as a com-
approach to compliance checking in the data supply chain. Second, pliance tool in data protection, there is not yet a comprehensive
we develop specific methods to automate compliance checking of conceptual framework characterizing the tasks required to verify
privacy policies. We test a two-modules system, where the first compliance in the entire data supply chain (i.e., end-to-end compli-
module relies on NLP to extract data practices from privacy poli- ance), and how current methods can contribute to address them.
cies. The second module encodes GDPR rules to check the presence This paper intends to contribute to both of these issues. First, we
of mandatory information. The results show that the text-to-text lay down a framework to implement and monitor GDPR compliance
approach outperforms local classifiers and enables the extraction in the data supply chain through a document-centric approach.
of both coarse-grained and fine-grained information with only one We define three key tasks of compliance checking based on the
model. We implement full evaluation of our system on a dataset of function and content of the document: (1) document to regulation,
30 privacy policies annotated by legal experts. We conclude that (2) document to document, and (3) document to operations. We
this approach could be generalized to other documents in the data apply this framework to analyze the compliance function of privacy
supply as a means to improve end-to-end compliance. policies in the data supply chain and define the tasks required to
determine if a privacy policy is compliant with the GDPR.
CCS CONCEPTS Second, we develop and test several methods to verify compli-
• Applied computing → Law; • Computing methodologies → ance of privacy policies to the GDPR by leveraging the advantages
Information extraction; Machine learning; • Social and pro- of both machine learning and rule-based approaches. In particu-
fessional topics → Privacy policies; • Security and privacy → lar, we build a two-modules system to verify the completeness of
Usability in security and privacy. privacy policies with regards to mandatory information. The first
module automatically extracts coarse-grained and fine-grained data
ACM Reference Format: practices, while the second module analyzes extracted data prac-
Rajaa EL HAMDANI, Majd Mustapha, David Restrepo Amariles, Aurore tices and checks the presence of mandatory information according
Troussel, Sébastien Meeùs, and Katsiaryna Krasnashchok. 2021. A Com- to the provisions of the GDPR. We make use of the OPP-115 dataset
bined Rule-Based and Machine Learning Approach for Automated GDPR [48] for training and evaluation of our models. We treat the extrac-
Compliance Checking. In Eighteenth International Conference for Artificial
tion of data practices as a Hierarchical Multi-label Classification
Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM,
New York, NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466081
(HMTC) task and experiment with two different approaches: local
classifiers and text-to-text. Our proposed text-to-text method has
several advantages over local classifiers, including extraction of
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
additional information and better scalability.
for profit or commercial advantage and that copies bear this notice and the full citation Our contributions are the following:
on the first page. Copyrights for components of this work owned by others than ACM • we present a theoretical framework to implement automated
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a GDPR compliance in the data supply chain;
fee. Request permissions from permissions@acm.org. • we provide formal and substantive approach to verify com-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil pliance of privacy policies with the GDPR;
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466081
40
ICAIL’21, June 21–25, 2021, São Paulo, Brazil EL HAMDANI, et al.
• we develop a combined rule-based and machine learning • Open texture of legal language [46]: Privacy concepts can be
approach to experiment with automated formal compliance quite abstract, and their evaluation arduous, as it is impossi-
checking of privacy policies with the GDPR; ble to define a finite set of rules for all possible applications.
• we propose a text-to-text approach for HMTC using transfer For example, the storage limitation principle states that per-
learning and multi-task learning ; sonal data must not be kept "longer than is necessary for
• additionally, we extract the span of text from a privacy pol- the purposes for which the personal data are processed."(Art.
icy corresponding to the fine-grained data practices, which 5.1(e)). This principle does not define specific time limits that
provides better explainability of the classification results; can be evaluated easily.
• we further annotate a dataset of 30 privacy policies with To solve the aforementioned problems, we employ natural lan-
the presence of mandatory information, in order to fully guage processing (NLP) techniques to automatically extract infor-
evaluate our two-modules system of compliance checking; mation from the data supply chain documents. Likewise, authors of
• finally, we release a public repository of the dataset, imple- [44] suggest using NLP to extract information from legal documents
mentations and the fine-tuned T5-11B model on OPP-1151 . according to their UML model of the GDPR, to automatically con-
This paper is organized as follows. Section 2 provides an overview struct their model-based representation. A large and growing body
of the related work. Section 3 lays down a general compliance of literature uses machine learning and NLP algorithms to extract
framework to automate GDPR compliance and applies it to privacy privacy information from legal documents, however, it only con-
policies. Section 4 describes the development of data practices ex- siders privacy policies and does not yet deal with other documents
traction algorithms and its evaluation results. Section 5 presents of the data supply chain, such as DPA and DPIA.
the rule-based system, which verifies the presence of mandatory
information, and its evaluation on an annotated dataset.
2.2 Information extraction from privacy
2 RELATED WORK policies
Privacy policies are long documents that are difficult to read for
The Information Commissioner’s Office (ICO), the Data Protection
data subjects. Empirical studies have been conducted to study and
Authority of the United Kingdom, observes that overall compli-
measure the ambiguity and vagueness of privacy policies [5, 20, 24,
ance with the GDPR is jeopardized by the "opaque nature" of the
36]. Efforts have been made to decrease these deficiencies [42, 51],
data supply chain, which is poorly documented, and generally fails
by using classification techniques to extract data practices from
to comply with the GDPR’s accountability principle [2]. The ICO
the text and represent them in a user-friendly interface. Other
further points out that controllers and processors should not limit
approaches focus on the extraction of one specific type of data
compliance to entering into contract and producing legal docu-
practice, such as "opt-out choice" in [38].
ments. They must monitor processing activities, and conduct audit
More and more methods are developed to analyze the compliance
to ensure that appropriate technical and organizational measures
of privacy policies with the GDPR. A variety of NLP tools such as
are in place throughout the data supply chain [2].
word embeddings are used in [7, 43] to verify the completeness of
Most research in AI and law studying GDPR compliance has
privacy policies according to the rules set out by the GDPR. The
focused on the relations between data subject and data controllers,
CLAUDETTE project [6] extracts clauses that are problematic with
rather than on the compliance challenges in the data supply chain
respect to the GDPR. In a different project [23], privacy policies were
[3, 6, 11, 54]. More recently, the BPR4GDPR project started working
analyzed in a large-scale setting to study the effect of the GDPR on
on a compliance ontology specification that supports end-to-end
their provisions. For instance, the comparison between pre-GDPR
compliance [37] and which can contribute to address some of the
and post-GDPR versions of 6,278 English privacy policies showed
operational challenges raised by the ICO. A compliance framework
that the GDPR caused textual changes of the privacy policies, such
is provided in [40] for specific documents in the supply chain such
that their appearance improved, their length increased, and that
as the data protection impact assessment (DPIA).
they cover more data practice categories.
A great deal of previous research on privacy policies relies on
2.1 Model-based Compliance Checking supervised machine learning methods, which require datasets of
The AI and law research community has developed model-based annotated privacy policies. However, there are very few such pub-
methods for automated compliance checking, such as legal ontolo- licly released datasets. In our work, we use OPP-115 [48] – a corpus
gies that support legal reasoning via logic programming [9, 12, 13, of 115 privacy policies annotated with both coarse-grained and
15, 16, 19, 30, 44, 45]. However, even though most legal rules can fine-grained data practices. Several works used OPP-115 to train
be described in logic programming, these methods face two main machine learning algorithms on the task of extracting data prac-
challenges when applied to real-life cases: tices [14, 25, 28, 29, 32]. PrivacyQA [35] is another publicly available
• Knowledge acquisition bottleneck [17]: Logic programming dataset of 1,750 questions about the privacy policies of mobile ap-
requires the encoding of facts into predicate form, but such plications, which is used to train question-answering algorithms.
an encoding would be very cumbersome for each data pro- A serious limitation of the publicly available datasets is that their
tection documents in the data supply chain. annotation schemes contain few concepts aligned with the GDPR.
A connection between the data practices identified in OPP-115 and
the GDPR was presented in [31], which revealed that the principle
1 https://github.com/smartlawhub/Automated-GDPR-Compliance-Checking of accountability of Article 5 is absent from OPP-115 concepts.
41
A Combined Rule-Based and Machine Learning Approach for Automated GDPR Compliance Checking ICAIL’21, June 21–25, 2021, São Paulo, Brazil
42
ICAIL’21, June 21–25, 2021, São Paulo, Brazil EL HAMDANI, et al.
3.2 The compliance of privacy policies 4.1 OPP-115 : Training Dataset of Online
Privacy policies are the compliance documents that appear at the Privacy Policies
top of the data supply chain. Hence, we first apply our framework For our task we make use of the Usable Privacy Policy Project’s
to analyze privacy policies’ compliance in the data supply chain. Online Privacy Policies (OPP-115) corpus, introduced by [48] which
This paper seeks to define the tasks required to determine a privacy contains detailed annotations made by Subject Matter Experts
policy’s formal and substantive compliance with the GDPR. (SMEs) for the data practices described in a set of 115 website
The GDPR does not explicitly mention privacy policies, but data privacy policies.
controllers widely use them. Their main function is to provide At a high level, annotations fall into one of ten data practice
mandatory information to data subjects according to Articles 13 categories:
and 14. The absence of any part of this information renders the
(1) 1𝑠𝑡 Party Collection/Use: What, why and how information
privacy policy non-compliant. Moreover, the GDPR requires pri-
is collected by the service provider.
vacy policies to use plain and clear language so individuals can
(2) 3𝑟𝑑 Party Sharing/Collection: What, why and how informa-
understand how their personal data are processed, provide their
tion shared with or collected by third parties.
consent and exercise their rights. Consequently, the tasks to verify
(3) User Choice/Control: Control options available to users.
the formal compliance of privacy policies are:
(4) User Access, Edit, & Deletion: If/how users can access, edit
• Check the presence of each mandatory information accord- or delete information.
ing to Articles 13 and 14. (5) Data Retention: How long user information will be stored.
• Check the readability and clarity of the language used in the (6) Data Security: Protection measures for user information.
privacy policy. (7) Policy Change: Informing users if policy information has
been changed.
The substantive compliance checking of privacy policies consist of
(8) Do Not Track: If and how DNT signals for online tracking
verifying that the data processing complies with the data protection
and advertising are honored.
rules (e.g, fair and transparent processing of Art. 5, and lawfulness
(9) International & Specific Audiences: Practices pertaining to a
of the processing of Art. 6). For example, a privacy policy must
specific group of users.
specify the legal basis of the processing according to Article 13.1(c),
(10) Other: General text, contact information or practices not
but it must also demonstrate that this legal basis complies with
covered by the other categories.
Article 6 requirements on the lawfulness of the processing (e.g., if
the legal basis is consent, the controller must ensure that consent According to the dataset creators, the best agreement between SMEs
has been given for one or more specific purposes.). was achieved on Do Not Track class with Fleiss’ Kappa equal to 91%,
In this study, we automate the first task of formal compliance whereas the most controversial class was Other, with only 49% of
checking of privacy policies. In the following sections, we describe agreement [48]. We further decompose the latter category into its
how we combined rules and machine learning to check the presence attributes – “Introductory/Generic”, ”Privacy Contact Information”
of mandatory information in privacy policies automatically. The and “Practice Not Covered” – resulting in 12 categories.
end-users of such a system would be lawyers or data protection Figure 1 depicts a fragment of OPP-115 taxonomy: for each class
officers who review large numbers of privacy policies to check their (grey shaded blocks), a set of lower-level privacy attributes is as-
compliance with the GDPR. Another type of end-user would be signed (20 in total, dark blue shaded blocks), with specific values
project managers in small companies who lack the legal knowledge corresponding to each attribute. For example, the attribute “Per-
to ensure privacy policies’ compliance. sonal Information Type” designates the different types of personal
information mentioned in the text, as can be seen from the annota-
tions in Figure 2 from the IMDb policy 2 , annotated with “1𝑠𝑡 Party
4 EXTRACTION OF DATA PRACTICES FROM Collection/Use” category.
PRIVACY POLICIES OPP-115 comprises 3,792 segments, each segment labeled with
Ensuring both the compliance of documents and data processing ac- one or more classes out of 12. The SMEs produced a total of 23K
tivities is becoming more burdensome to companies due to several annotations of categories. In aggregate, these categories were asso-
challenges. We focus on the challenge posed by the large number ciated with 128K values for attributes and 103K selected spans of
of documents that data protection officers need to review to guar- policy text. To the extent of our knowledge, this is the first effort to
antee compliance. We suggest using natural language processing leverage these spans to extract information from privacy policies.
technologies to assist data protection officers in performing the We split the OPP-115 dataset on a policy-document level into
compliance checking tasks. NLP could help to extract compliance 3 sets: 65 policies are used for training, 35 for validation and 30
information from unstructured compliance documents and save it policies are kept as a testing set.
into structured formats such as XML or RDF to unlock use cases
such as automated compliance checking. 4.2 Problem formulation
In this paper, we describe our experiment in automating formal The taxonomy of data practices is organized in a class hierarchy that
compliance of privacy policies. We first train a machine learning we model as a Directed Acyclic Graph (DAG) shown in Figure 3.
algorithm to extract from privacy policies information describing
the company’s data practices. We then use the extracted information 2 To retrieve the exact source used:<https://web.archive.org/web/20200526092253if_
as input to a rule-based system that encode Articles 13 and 14. /https://www.imdb.com/privacy#auto>(“Automatic Information” sub-section.)
43
A Combined Rule-Based and Machine Learning Approach for Automated GDPR Compliance Checking ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Figure 1: The privacy taxonomy of [48]. The top level of the hierarchy (grey shaded blocks) defines coarse-grained data practices or privacy
categories. The lower level defines a set of privacy attributes (blue shaded blocks), each assuming a set of values. We show a subset of the
taxonomy for clarity and space considerations.
44
ICAIL’21, June 21–25, 2021, São Paulo, Brazil EL HAMDANI, et al.
language model, pretrained on all the permutations of the input Table 1: Results of categories prediction by the local classi-
sequence. We fine-tune the XLNet3 on 21 tasks – one task for pre- fiers approach
dicting the categories, and the rest – for each attribute’s values.
CNN XLNet
4.4 Text-to-Text approach Categories
P R F1 P R F1
In this section we explain how we use T5 [33] to solve HMTC. T5 is Introductory/Generic 74 40 52 76 54 63
a pretrained language model based on the transformer architecture.
Policy Change 85 60 71 73 65 69
T5 has two main differences in comparison to XLNet. First, it is
Specific audiences 90 77 83 85 80 82
pretrained on a multi-task mixture of unsupervised and supervised
Privacy Contact Info 87 52 65 84 75 79
tasks. Second, each task is converted into a text-to-text format. We
1𝑠𝑡 Party Collection 67 87 76 84 81 83
adopt T5 both for its top results on NLP benchmarks and for its
Data Retention 52 39 45 58 36 44
text-to-text nature.
3𝑟𝑑 party sharing/collection 71 85 78 76 87 81
The local classifiers approach has two main drawbacks. First, it
User Choice/Control 45 79 58 66 69 67
trains the set of local classifiers independently. Second, the number
Practice Not Covered 39 39 38 40 44 42
of local classifiers grows linearly with the size of the label hierarchy.
Data Security 79 48 60 77 68 72
These limitations motivate this second approach, where we convert
Access, Edit, Deletion 87 35 50 75 72 74
the HMTC task into two text-to-text tasks – one for each level of
Do Not Track 100 29 45 93 100 96
the label hierarchy – to better capture the dependencies of labels
Macro-Average 72 55 62 76 71 73
belonging to the same level. Moreover, by training one unique
algorithm for each level, we ensure that the number of classifiers
scales linearly with the hierarchy’s depth. 𝑇 𝑃𝑗 𝑇 𝑃𝑗 2 ∗ 𝑃𝑗 ∗ 𝑅𝑗
Thanks to the text-to-text nature of T5, we can simplify HMTC 𝑃𝑗 = 𝑅𝑗 = 𝐹 1𝑗 =
𝑇 𝑃𝑗 + 𝐹𝑃𝑗 𝑇 𝑃𝑗 + 𝐹 𝑁𝑗 𝑃𝑗 + 𝑅𝑗
into two text-to-text tasks shown in Figure 4. To prepare the task of
𝑇 𝑃 𝑗 , 𝐹 𝑃 𝑗 , 𝑇 𝑁 𝑗 and 𝐹 𝑁 𝑗 are the number of true positive, false pos-
categories prediction, we prepend the "categories prediction: " prefix
itive, true negative and false negative test examples with respect
to the text of segments and generate one sequence of categories
to class label 𝑦 𝑗 . To measure the global performance over a set of
separated by "; " as shown in Figure 5. The lists of categories were
labels we compute the macro-average of each metric by averaging
sorted in alphabetical order so that they have the same order across
on the set of labels4 .
training examples, as advised by the authors of [33].
For the task of values prediction, we only report the result of
The second task’s objective is to predict the values of the at-
attributes and values that we use to automate formal compliance
tributes of a category from an input segment, as well as to generate
checking of privacy policies, such that the reported metrics are the
the spans of texts related to the predicted values. This task is similar
macro-average over the values necessary for compliance checking.
to a reading comprehension task [34], where the question is "what
is the value of the attribute?", and the context paragraph is the pair
(segment, category). So we format it into a text-to-text task, similar Evaluation of span extraction: We use the F1-score, as in the
to how the authors of T5 formatted the reading comprehension SQuaD dataset [34], to evaluate the extraction of spans associated
dataset SQuAD (see Figure 6). with the values by comparing the ground truth target to the gener-
Once we format the tasks, we fine-tune the largest available T5 of ated target.
11B parameters on these tasks. We try two fine-tuning methods: the
first method (advised by [33]) is to fine-tune on each task indepen- 4.6 Results and Discussion
dently, and the second method is to fine-tune in a multi-task setting Local classifiers approach: In Table 1, we report the results of
on a mixture of both tasks to capture the global labels hierarchy. the evaluation of the CNN and XLNet experiments on categories
We fixed the hyperparameters of input sequence length and output prediction. We present the results of CNN and the XLNet for the task
sequence length to 512 and the batch size to 16, and performed a of values prediction in Table 2. XLNet has superior performance
grid-search over the learning rate. The model was fine-tuned each comparing to CNN for the task of categories predictions. However,
time for 25,000 steps. Interestingly, the best-performing learning it has significantly lower performance than CNN for the second
rate is the same (4e-3) for all the models and tasks. task, because there are enough examples to fine-tune XLNet for
categories prediction but not enough for values prediction.
4.5 Evaluation measures
Text-to-text approach: We report the results of evaluating the
Evaluation of multi-label classification: We use precision, re-
two tasks on the test dataset in Table 3, Table 4, and Table 5. We
call, and F1-score metric to evaluate the extraction of both coarse-
observe that the individual fine-tuning and multi-task fine-tuning
grained and fine-grained data practices from privacy policies seg-
have a close recall for both tasks, but they differ significantly in
ments. Since we are in a multi-label classification setting, we adapt
precision. By fine-tuning each task separately, we obtain a 2.6%
the traditional single-label metrics to this setting by using the label-
precision improvement for the task of categories prediction and
based metrics [52]: precision, recall, and F1-score for the j-th class
label 𝑦 𝑗 are defined as follows: 4 It is worth noting that we don’t use the same precision, recall and F1-score as in [14]
where they use the macro-average of each metric predicting the presence and absence
3 We used the pretrained model available at Hugging Face models hub. of the label.
45
A Combined Rule-Based and Machine Learning Approach for Automated GDPR Compliance Checking ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Figure 4: The hierarchical multi-label classification of OPP-115 is converted into two text-to-text tasks. The first task is to predict categories
of data-practice from an input segment. We then retrieve the attributes of each predicted category to feed them and their category as input to
the second task. The second task is to predict the values of the input attributes and to generate the corresponding span of text (highlighted).
Figure 6: Example of an input to train T5 for prediction of Table 4: Results of values prediction for attributes used in
attributes’ values and the corresponding spans of text compliance checking by T5. We report the macro-average
over the values of each attribute.
Table 2: Results of values prediction by CNN and XLNet
for attributes used in compliance checking. We report the Multi-task Fine-tuning Task 2 Fine-tuning
Attribute
macro-average over the values of each attribute. P R F1 P R F1
action first-party 55 56 55 62 58 60
CNN XLNet does/does not 90 78 83 91 83 87
Attribute personal information type 72 61 66 73 63 68
P R F1 P R F1
purpose 72 65 69 74 67 70
action first-party 44 47 45 16 33 21 retention period 53 47 50 50 25 33
does/does not 93 77 84 84 82 83 access type 62 58 60 71 70 70
personal information type 73 58 64 56 49 52 Macro-average 67 61 63 70 61 65
purpose 74 56 64 74 64 69
retention period 79 62 69 50 11 18
access type 61 51 55 30 26 27
Macro-average 72 60 65 51 44 47 where separate models trained on each task outperforms the multi-
task model is coherent with previous findings [4, 26, 33].
The performance of span extraction, presented in Table 5 is
4.5% precision improvement for values prediction. This behavior low in comparison with the performance of transformers models
46
ICAIL’21, June 21–25, 2021, São Paulo, Brazil EL HAMDANI, et al.
on similar tasks such as reading comprehension or named entity communicated to data subjects (language complexity, length of sen-
recognition, which might be due to the relatively small number of tences, etc.). Legal experts manually converted rules from Articles
training examples given to T5. 13 and 14 into code using the OPP-115 taxonomy. As the OPP-115
taxonomy does not cover all the concepts of the GDPR, we only
Table 5: Results of evaluation of span extraction by T5. encoded the articles presented in the second column of Table 7.
47
A Combined Rule-Based and Machine Learning Approach for Automated GDPR Compliance Checking ICAIL’21, June 21–25, 2021, São Paulo, Brazil
OPP-115
REFERENCES
Normalized count of errors
0.4 Post-GDPR
[1] 2017. The True Cost of Compliance with Data Protection Regulations. Technical
0.3 Report. Ponemon Institute LLC.
[2] 2019. ICO Guidance: Update report into adtech and real time bidding. Technical
0.2 Report. Information Commissioner’s Office. 19–21 pages.
[3] David Restrepo Amariles, Aurore Clément Troussel, and Rajaa El Hamdani. 2020.
0.1 Compliance Generation for Privacy Documents under GDPR: A Roadmap for
Implementing Automation and Machine Learning. arXiv preprint arXiv:2012.12718
0.0 (2020).
iod ss ure ion ata ata ility nts [4] Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin John-
n per acce eras ificat nal d nal d rtab cipie
a r e tentio Right to Right to ht to rect of persoser persoto data poories of re son, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al.
Dat s
Rig egorie g of u Right categ 2019. Massively multilingual neural machine translation in the wild: Findings
Cat rocessin ts or
e of p ipien and challenges. arXiv preprint arXiv:1907.05019 (2019).
r p o s Th e rec [5] Jaspreet Bhatia, Travis D Breaux, Joel R Reidenberg, and Thomas B Norton. 2016.
Pu
Information type A theory of vagueness and privacy risk perception. In 2016 IEEE 24th International
Requirements Engineering Conference (RE). IEEE, 26–35.
[6] Giuseppe Contissa, Koen Docter, Francesca Lagioia, Marco Lippi, Hans-W Mick-
litz, Przemysław Pałka, Giovanni Sartor, and Paolo Torroni. 2018. Claudette meets
Figure 8: The distribution of errors over types of mandatory gdpr: Automating the evaluation of privacy policies using artificial intelligence.
information for OPP-115 and post-GDPR privacy policies Available at SSRN 3208596 (2018).
[7] Elisa Costante, Yuanhao Sun, Milan Petković, and Jerry Den Hartog. 2012. A
machine learning solution to assess privacy policy completeness: (short paper).
48
ICAIL’21, June 21–25, 2021, São Paulo, Brazil EL HAMDANI, et al.
In Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society. [33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
91–96. Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-
[8] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan its of transfer learning with a unified text-to-text transformer. arXiv preprint
Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed- arXiv:1910.10683 (2019).
length context. arXiv preprint arXiv:1901.02860 (2019). [34] Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know:
[9] Marina De Vos, Sabrina Kirrane, Julian Padget, and Ken Satoh. 2019. ODRL policy Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
modelling and compliance checking. In International Joint Conference on Rules [35] Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and
and Reasoning. Springer, 36–51. Norman Sadeh. 2019. Question answering for privacy policies: Combining com-
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: putational and legal perspectives. arXiv preprint arXiv:1911.00841 (2019).
Pre-training of deep bidirectional transformers for language understanding. arXiv [36] Joel R Reidenberg, Jaspreet Bhatia, Travis D Breaux, and Thomas B Norton. 2016.
preprint arXiv:1810.04805 (2018). Ambiguity in privacy policies and the impact of regulation. The Journal of Legal
[11] Olha Drozd and Sabrina Kirrane. 2020. Privacy CURE: Consent Comprehension Studies 45, S2 (2016), S163–S190.
Made Easy. In IFIP International Conference on ICT Systems Security and Privacy [37] Community Research and Development Information Service. 2021. Business
Protection. Springer, 124–139. Process Re-engineering and functional toolkit for GDPR compliance. https:
[12] María Teresa Gómez-López, Luisa Parody, Rafael M Gasca, and Stefanie Rinderle- //cordis.europa.eu/project/id/787149/results. Accessed: 2021-02-28.
Ma. 2014. Prognosing the compliance of declarative business processes using [38] Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian
event trace robustness. In OTM Confederated International Conferences" On the Zimmeck, and Norman Sadeh. 2017. Identifying the provision of choices in
Move to Meaningful Internet Systems". Springer, 327–344. privacy policy text. In Proceedings of the 2017 Conference on Empirical Methods in
[13] Guido Governatori and Sidney Shek. 2012. Rule Based Business Process Compli- Natural Language Processing. 2774–2779.
ance.. In RuleML (2). Citeseer. [39] Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical classification
[14] Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G Shin, across different application domains. Data Mining and Knowledge Discovery 22, 1
and Karl Aberer. 2018. Polisis: Automated analysis and presentation of privacy (2011), 31–72.
policies using deep learning. In 27th {USENIX } Security Symposium ( {USENIX } [40] Laurens Sion, Pierre Dewitte, Dimitri Van Landuyt, Kim Wuyts, Peggy Valcke,
Security 18). 531–548. and Wouter Joosen. 2020. DPMF: A Modeling Framework for Data Protection by
[15] Mustafa Hashmi, Guido Governatori, and Moe Thandar Wynn. 2012. Business Design. Enterprise Modelling and Information Systems Architectures (EMISAJ) 15
process data compliance. In International Workshop on Rules and Rule Markup (2020), 10–1.
Languages for the Semantic Web. Springer, 32–46. [41] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014.
[16] Mustafa Hashmi, Guido Governatori, and Moe Thandar Wynn. 2016. Norma- Learning sentiment-specific word embedding for twitter sentiment classification.
tive requirements for regulatory compliance: An abstract formal framework. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Information Systems Frontiers 18, 3 (2016), 429–455. Linguistics (Volume 1: Long Papers). 1555–1565.
[17] Martin Hepp. 2008. Ontologies: State of the art, business potential, and grand [42] Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto,
challenges. Ontology Management (2008), 3–22. and Jetzabel Serna. 2018. PrivacyGuide: towards an implementation of the EU
[18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural GDPR on internet privacy policy evaluation. In Proceedings of the Fourth ACM
computation 9, 8 (1997), 1735–1780. International Workshop on Security and Privacy Analytics. 15–21.
[19] Katsiaryna Krasnashchok, Majd Mustapha, Anas Al Bassit, and Sabri Skhiri. 2020. [43] Damiano Torre, Sallam Abualhaija, Mehrdad Sabetzadeh, Lionel Briand, Katrien
Towards Privacy Policy Conceptual Modeling. In International Conference on Baetens, Peter Goes, and Sylvie Forastier. 2020. An ai-assisted approach for
Conceptual Modeling. Springer, 429–438. checking the completeness of privacy policies against gdpr. In 2020 IEEE 28th
[20] Logan Lebanoff and Fei Liu. 2018. Automatic detection of vague words and International Requirements Engineering Conference (RE). IEEE, 136–146.
sentences in privacy policies. arXiv preprint arXiv:1808.06219 (2018). [44] Damiano Torre, Ghanem Soltana, Mehrdad Sabetzadeh, Lionel C Briand, Yuri
[21] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient- Auffinger, and Peter Goes. 2019. Using models to enable compliance checking
based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278– against the GDPR: an experience report. In 2019 ACM/IEEE 22nd International
2324. Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE,
[22] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot 1–11.
relation extraction via reading comprehension. arXiv preprint arXiv:1706.04115 [45] Silvano Colombo Tosatto, Guido Governatori, Nick van Beest, and Francesco
(2017). Olivieri. 2019. Efficient Full Compliance Checking of Concurrent Components
[23] Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. 2020. for business Process Models. FLAP 6, 5 (2019), 963–998.
The privacy policy landscape after the GDPR. Proceedings on Privacy Enhancing [46] Sebastian Urbina. 2002. Legal method and the rule of law. Vol. 59. Springer Science
Technologies 2020, 1 (2020), 47–64. & Business Media.
[24] Fei Liu, Nicole Lee Fella, and Kexin Liao. 2018. Modeling language vagueness [47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
in privacy policies using deep neural networks. arXiv preprint arXiv:1805.10393 Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
(2018). you need. arXiv preprint arXiv:1706.03762 (2017).
[25] Frederick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck, and Norman [48] Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain
Sadeh. 2018. Towards automatic classification of privacy policy text. School of Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck,
Computer Science Carnegie Mellon University, Pittsburgh, PA, Tech. Rep. CMU-ISR- Kanthashree Mysore Sathyendra, N Cameron Russell, et al. 2016. The creation
17-118R and CMULTI-17-010 (2018). and analysis of a website privacy policy corpus. In Proceedings of the 54th Annual
[26] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
The natural language decathlon: Multitask learning as question answering. arXiv 1330–1340.
preprint arXiv:1806.08730 (2018). [49] Z. Yang, Zihang Dai, Yiming Yang, J. Carbonell, R. Salakhutdinov, and Quoc V.
[27] Tomas Mikolov, Kai Chen, G. S. Corrado, and J. Dean. 2013. Efficient Estimation Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Under-
of Word Representations in Vector Space. In ICLR. standing. In NeurIPS.
[28] Majd Mustapha, Katsiaryna Krasnashchok, Anas Al Bassit, and Sabri Skhiri. [50] Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text
2020. Privacy Policy Classification with XLNet (Short Paper). In Data Privacy classification: Datasets, evaluation and entailment approach. arXiv preprint
Management, Cryptocurrencies and Blockchain Technology. Springer, 250–257. arXiv:1909.00161 (2019).
[29] Najmeh Mousavi Nejad, Pablo Jabat, Rostislav Nedelchev, Simon Scerri, and [51] Razieh Nokhbeh Zaeem, Rachel L German, and K Suzanne Barber. 2018. Privacy-
Damien Graux. 2020. Establishing a strong baseline for privacy policy classi- check: Automatic summarization of privacy policies using data mining. ACM
fication. In IFIP International Conference on ICT Systems Security and Privacy Transactions on Internet Technology (TOIT) 18, 4 (2018), 1–18.
Protection. Springer, 370–383. [52] Min-Ling Zhang and Zhi-Hua Zhou. 2013. A review on multi-label learning
[30] Monica Palmirani, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio algorithms. IEEE transactions on knowledge and data engineering 26, 8 (2013),
Robaldo. 2018. PrOnto: Privacy ontology for legal reasoning. In International 1819–1837.
Conference on Electronic Government and the Information Systems Perspective. [53] Ben Zhou, Daniel Khashabi, Chen-Tse Tsai, and Dan Roth. 2019. Zero-shot open
Springer, 139–152. entity typing as type-compatible grounding. arXiv preprint arXiv:1907.03228
[31] Ellen Poplavska, Thomas B Norton, Shomir Wilson, and Norman Sadeh. 2020. (2019).
From Prescription to Description: Mapping the GDPR to a Privacy Policy Corpus [54] Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi
Annotation Scheme. In 33rd International Conference on Legal Knowledge and Wang, Joel Reidenberg, N Cameron Russell, and Norman Sadeh. 2019. Maps:
Information Systems, JURIX 2020. IOS Press BV, 243–246. Scaling privacy compliance analysis to a million apps. Proceedings on Privacy
[32] Wenjun Qiu and David Lie. 2020. Deep Active Learning with Crowdsourcing Enhancing Technologies 2019, 3 (2019), 66–86.
Data for Privacy Policy Classification. arXiv preprint arXiv:2008.02954 (2020).
49
On Semantics-based Minimal Revision for Legal Reasoning
Wachara Fungwacharakorn Kanae Tsushima Ken Satoh
National Institute of Informatics and National Institute of Informatics and National Institute of Informatics and
SOKENDAI SOKENDAI SOKENDAI
Chiyoda, Tokyo, Japan Chiyoda, Tokyo, Japan Chiyoda, Tokyo, Japan
wacharaf@nii.ac.jp k_tsushima@nii.ac.jp ksatoh@nii.ac.jp
50
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh
definition by defining a partial semantics-based minimal revision. [13]), recent studies become widely interested in revisions of such
Then, we describe a sub type of a partial semantics-based minimal theories. For example, Henderson and Bench-Capon [24] have en-
revision based on a dominant rule-base, which is a set of Horn couraged contrastive revision of a theory by not only supporting
clauses obtained from the subset of rule-base specific to the consid- accepted consequences but also attacking the objected ones. Rotolo
ered fact-base. Since the judicial revision of legal interpretation is and Roversi [35] have classified theory revision criteria and mini-
initiated by an exceptional case, we present one guidance to obtain mal change is one of such criteria. This criterion comes from the
the dominant-based minimal revision based on Legal Debugging standard revision theory, which intuitively states that we should
[17] and Closed World Specification [6] by putting all facts in an adjust theories as little as possible. To minimally revise rule-based
exceptional case as a condition of a new rule. We show that this theories, they present two strategies:
revision would affect only the dominant rule-base of the fact-bases (1) keep the set of rules as close as possible to the original one
that are larger or equal to the exceptional case. Then, to optimize (this strategy is independent of the facts of the case)
the revision, we remove redundant conditions in the body as long (2) minimize the changes of the set of conclusions obtained from
as the minimality of the revision remains. the theory (the strategy is dependent of the facts of the case
In spite of definitions for the dominant-based minimal revision since different facts give different set of conclusions)
and the guidance to obtain it, we also compare the dominant-based
minimal revision with the syntax-based minimal revision in The- The first strategy could be traced back to the early studies of syntax-
ory Distance Metric [47], which states that a distance of revision based (or formula-based) minimal revision e.g. [29, 47]. On the other
is the minimal number of program edit operations (i.e. deleting hand, the second strategy could be traced back to the early studies
a rule, adding a rule with an empty body, adding a condition to of semantics-based (or model-based) minimal revision e.g. [14, 36]
a rule or deleting a condition from a rule) that are used for re- (for an intensive survey on both types of minimal revision, see [16],
vising an original program to a new program. The definition has for an intensive survey specific to semantics-based revision, see
been applied in many rule-based revisions (e.g. [12, 28]) due to its [26]).
simplicity. From this comparison, we show that the syntax-based Explaining legal change can also be seen in recent extensions
minimal revision may cause extra semantics changes compared to of HYPO [7, 31]. For example, Horty Bench-Capon theory [25],
the dominant-based minimal revision, especially when the original one extension of HYPO, states that a new legal judgement could
rule-base contains multiple rules for the same consequence. We be made as long as it preserves the precedential constraint. This
discuss that such extra semantics changes can be considered as guides a judge to introduce new factors in a new case to preserve
changes unintentionally caused by the syntax-based minimal revi- the precedential constraint if the new case is exceptional. Horty
sion and the legal reasoning system can use such extra semantics Bench-Capon theory has been extended into Rigoni theory [32]
changes for check with the user the intention of changes. for supporting hybrid legal reasoning systems between rule-based
This paper is structured as follows. Section 2 reviews related and case-based. Moreover, recently presented Verheij’s case models
work. Section 3 provides preliminary definitions of a normal logic [44–46] have become new explanations of hybrid legal reasoning
program for legal reasoning and a revision of a normal logic pro- systems by using the preference of cases. In our study, we explain
gram based on Legal Debugging and Closed World Specification. the legal change using a dominant rule-base. Hence, our study is
Section 4 presents the definitions of semantics of a rule-base, the different from Rigoni theory and Verheij’s case models since their
difference of semantics, the semantics-based minimal revision and approaches focus on the effects of legal change on cases such as the
the dominant-based minimal revision. Section 5 describes one guid- precedential constraint or the preference of cases but our approach
ance to obtain a dominant-based minimal revision based on Legal focuses on the effects of legal change on revision of rules.
Debugging and Closed World Specification. Section 6 compares the
dominant-based minimal revision with the syntax-based minimal 3 PRELIMINARIES
revision in Theory Distance Metric with the example of the judicial In this paper, we consider a normal logic program (also known
revision of Japanese Civil Code Article 612. Section 7 discusses the as Prolog program) for a computational legal representation as in
results obtained from the comparison between the dominant-based [38, 39, 41]. We use notations in our paper as follows.
minimal revision and the syntax-based minimal revision. Finally,
Section 8 provides the conclusion and future works. Definition 3.1 (Normal Logic Program). A normal logic program
(hereafter, a program) is a set of rules of the form
51
On Semantics-based Minimal Revision for Legal Reasoning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
• {𝑏 1, . . . , 𝑏𝑚 , 𝑛𝑜𝑡 𝑏𝑚+1, . . . , 𝑛𝑜𝑡 𝑏𝑛 } as a body of a rule denoted the present case. Then, judges use the introduced factual concept
by 𝑏𝑜𝑑𝑦 (𝑅) (each element of a body is called a condition). to revise the statute so that the counterintuitive consequence is
Sometimes, we express the rule in the form ℎ ← 𝐵. where 𝐵 is a resolved. To formalize this procedure, we firstly define agreement
body of rule. We express ℎ. (called a fact) if the body of the rule is and disagreement as follows.
empty. A rule is called a Horn clause if the negative body of the rule Definition 3.4 (Agreement and Disagreement). Let 𝐹 𝐵 1, 𝐹 𝐵 2 be
is empty. fact-bases, 𝑅𝐵 1, 𝑅𝐵 2 be rule-bases, and 𝑝 be a proposition. We say
Since a program representing statutes is generally a non-recursive 𝐹 𝐵 1 ∪ 𝑅𝐵 1 agrees with 𝐹 𝐵 2 ∪ 𝑅𝐵 2 on 𝑝 if
and stratified program, we also hold this presumption in this paper. • 𝐹 𝐵 1 ∪ 𝑅𝐵 1 ⊢ 𝑝 and 𝐹 𝐵 2 ∪ 𝑅𝐵 2 ⊢ 𝑝 or
A definition of non-recursive and stratified program is adopted • 𝐹 𝐵 1 ∪ 𝑅𝐵 1 ⊬ 𝑝 and 𝐹 𝐵 2 ∪ 𝑅𝐵 2 ⊬ 𝑝
from [3] defined as follows. Otherwise, we say 𝐹 𝐵 1 ∪ 𝑅𝐵 1 disagrees with 𝐹 𝐵 2 ∪ 𝑅𝐵 2 on 𝑝.
Definition 3.2 (A non-recursive and stratified program). A pro- Formalization of the procedure goes as follows. Firstly, we have a
gram 𝑇 is non-recursive and stratified if there is a partition 𝑇 = proposition 𝑝 representing a counterintuitive consequence and 𝐹 𝐵𝑒
𝑇0 ∪ 𝑇1 ∪ . . . ∪ 𝑇𝑛 (𝑇𝑖 and 𝑇 𝑗 disjoint for all 𝑖 ≠ 𝑗) such that, if a representing an exceptional case, which contains at least one fact
proposition 𝑝 occurs in a body of rule in 𝑇𝑖 then a rule with 𝑝 in proposition not occurring in an original rule-base (hereafter, we
the head is only contained within 𝑇0 ∪ 𝑇1 ∪ . . . 𝑇 𝑗 where 𝑗 < 𝑖. also refer to 𝐹 𝐵𝑒 as an exceptional case). Then, we revise an original
In civil litigation, a judge would make correspondence between rule-base 𝑅𝐵 to a new rule-base 𝑅𝐵 ′ , Then 𝑅𝐵 ′ is a correct revision
factual situations in a case and factual concepts in statutes. Then, the of 𝑅𝐵 ′ with respect to 𝐹 𝐵𝑒 and 𝑝 if it can resolve the counterintuitive
judge would conclude a legal decision based on related statutes. To consequence 𝑝. We call such a task a counterintuitive consequence
reflect this civil litigation, we determine a proposition occurring in resolution task (CCR task) formally defined as follows.
a head of a rule as a rule proposition and a proposition not occurring Definition 3.5 (Counterintuitive Consequence Resolution (CCR)
in a head of a rule as a fact proposition. By this determination, we Task). A counterintuitive consequence resolution (CCR) task is a
denote a set of all fact propositions by F called a fact-domain and tuple ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩ where 𝑅𝐵 is a rule-base representing statutes,
we denote all fact propositions occurring in a program 𝑇 by 𝑓 (𝑇 ) 𝐹 𝐵𝑒 is a fact-base representing an exceptional case (𝑓 (𝐹 𝐵𝑒 ) ⊈
hence, 𝑓 (𝑇 ) ⊆ F . We call a program 𝑅𝐵 a rule-base if 𝑅𝐵 has no 𝑓 (𝑅𝐵)), and 𝑝 is a considered counterintuitive consequence. A rule-
propositions in F occurring in a head of a rule, and all propositions base 𝑅𝐵 ′ is a resolution to the CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩ if 𝐹 𝐵𝑒 ∪ 𝑅𝐵 ′
in 𝑅𝐵 that do not occurring in a head of a rule, are in F . disagrees with 𝐹 𝐵𝑒 ∪ 𝑅𝐵 on 𝑝 (hence, it implies that the considered
As a judge makes correspondence between factual situations in counterintuitive consequence is resolved).
a case and factual concepts, those factual concepts are represented
by fact propositions. Hence, we call a set of facts (rules with empty Example 3.6. Let a fact-domain F = {𝑎, 𝑏, 𝑐}, and a rule-base
bodies) constructed from a subset of F a fact-base. A fact-base then 𝑅𝐵 1 = {𝑝 ← 𝑞. 𝑞 ← 𝑎. 𝑞 ← 𝑏.}. Suppose 𝑝 is a counterintuitive
represents a case. Then, the semantics is the set of propositions consequence from applying 𝑅𝐵 1 in an exception case represented by
(including rule propositions and fact propositions) which can be 𝐹 𝐵 1 = {𝑎. 𝑐.}. Then, 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.}
concluded when compiling a fact-base 𝐹 𝐵 with a rule-base 𝑅𝐵 and 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.} are both
(denoted by 𝐹 𝐵 ∪ 𝑅𝐵). In this paper, we apply the stable model resolutions to the CCR task ⟨𝑅𝐵, 𝐹 𝐵 1, 𝑝⟩ since 𝐹 𝐵 1 ∪ 𝑅𝐵 1 disagrees
semantics [20] defined as follows. with 𝐹 𝐵 1 ∪ 𝑅𝐵 2 and 𝐹 𝐵 1 ∪ 𝑅𝐵 3 on 𝑝 which means 𝑝 is resolved in
both revisions.
Definition 3.3 (Stable Model Semantics). Let 𝑇 be a normal logic
program and 𝑀 be a set of propositions. Let 𝑡𝑟𝑖𝑚(𝑇 ) be a trimming From the previous example, we can see that there are possibly
function defined as follows: {ℎ𝑒𝑎𝑑 (𝑅) ← 𝑝𝑜𝑠 (𝑅)|𝑅 ∈ 𝑇 } and 𝑇 𝑀 = more than one rules that we can put an exception (𝑛𝑜𝑡 𝑟 ) so that
𝑡𝑟𝑖𝑚({𝑅|𝑅 ∈ 𝑇 and 𝑛𝑒𝑔(𝑅) ∩𝑀 = ∅}). 𝑀 is a stable model semantics the counterintuitive consequence 𝑝 is resolved. To let a user specify
of𝑇 if and only if 𝑀 is the minimum set (in the sense of set inclusion) which rule should an exception be put in, one can consider using
such that 𝑀 satisfies every rule 𝑅 ′ ∈ 𝑇 𝑀 , that is 𝑝𝑜𝑠 (𝑅 ′ ) ⊆ 𝑀 Legal Debugging [17], which extends from a common algorithmic
implies ℎ𝑒𝑎𝑑 (𝑅 ′ ) ∈ 𝑀. debugging [40] for legal reasoning. Legal Debugging considers a
user as an oracle query of an unknown set of intended interpretation
The semantics of 𝐹 𝐵 ∪ 𝑅𝐵 represents the literal interpretation and the counterintuitive consequence is the symmetric difference
of statute (represented by 𝑅𝐵) when applying in a particular case of the literal interpretation and the intended interpretation. Legal
(represented by 𝐹 𝐵). We denote a proposition 𝑝 is in the answer set debugging iterates to ask a user whether related consequences are
of 𝐹 𝐵 ∪ 𝑅𝐵 by 𝐹 𝐵 ∪ 𝑅𝐵 ⊢ 𝑝. Since we presume a non-recursive and counterintuitive until it can no longer find any counterintuitive
stratified program, 𝐹 𝐵 ∪ 𝑅𝐵 has a unique semantics [20]. This also consequences related. The last counterintuitive consequence found
reflects a constraint that judges need one unique judgement from is called a culprit, which is defined as follows.
legal rules.
When a judge applies the literal interpretation of statutes in Definition 3.7 (Culprit). A proposition 𝑝 is a culprit with respect
a particular case and it leads to counterintuitive consequences, to an intended interpretation 𝐼𝑀 and a program 𝑇 if
the judges may revise interpretation of statutes. We call such a • 𝑝 ∉ 𝐼𝑀 but there is a rule 𝑅 ∈ 𝑇 (called a supporting rule of 𝑝)
case an exceptional case. To distinguish the present case as an such that 𝑝𝑜𝑠 (𝑅) ⊆ 𝐼𝑀, 𝑛𝑒𝑔(𝑅) ∩ 𝐼𝑀 = ∅, and ℎ𝑒𝑎𝑑 (𝑅) = 𝑝
exceptional case, judges would introduce a new factual concept in (we call such 𝑝 an incorrect culprit) or
52
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh
• 𝑝 ∈ 𝐼𝑀 but there is no such supporting rule of 𝑝 in 𝑇 (we as follows: {⟨𝐹, 𝑀⟩|𝐹 ⊆ F and 𝑀 is the stable model semantics of
call such 𝑝 an incomplete culprit) 𝐹 𝐵 ∪ 𝑅𝐵 where 𝐹 𝐵 is a fact-base such that 𝑓 (𝐹 𝐵) = 𝐹 }. We denote
this set as 𝑠𝑒𝑚(𝑅𝐵).
We follow the culprit resolution from [19]. To resolve an incor-
rect culprit, we put exceptions in all supporting rules of the culprit We illustrate with the same setting in Example 3.6 throughout
in the same manner of Closed World Specification [6]. To resolve this section. Let a fact-domain, F be {𝑎, 𝑏, 𝑐}, and a rule-base 𝑅𝐵 1 =
an incomplete culprit, we just put a new rule with the culprit as a {𝑝 ← 𝑞. 𝑞 ← 𝑎. 𝑞 ← 𝑏.}. Then 𝑠𝑒𝑚(𝑅𝐵 1 ) =
head. Following this instruction as illustrated in Algorithm 1, we {⟨∅, ∅⟩, ⟨{𝑎}, {𝑎, 𝑝, 𝑞}⟩, ⟨{𝑏}, {𝑏, 𝑝, 𝑞}⟩, ⟨{𝑐}, {𝑐}⟩,
can reduce a CCR task to a task finding appropriate conditions to ⟨{𝑎, 𝑏}, {𝑎, 𝑏, 𝑝, 𝑞}⟩, ⟨{𝑎, 𝑐}, {𝑎, 𝑐, 𝑝, 𝑞}⟩, ⟨{𝑏, 𝑐}, {𝑏, 𝑐, 𝑝, 𝑞}⟩,
a body of a new rule. Hence, we define a culprit resolution to CCR ⟨{𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑐, 𝑝, 𝑞}⟩}
task as follows.
Then, we define the difference of semantics as follows.
Definition 3.8 (Culprit Resolution to CCR task). Given a CCR task
Definition 4.2 (Difference of Semantics). Let F be a fact-domain
f 𝐻 obtained by Algorithm 1 where 𝑅𝐵
⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩, 𝑅𝐵, f is a prelim-
and 𝑅𝐵 1, 𝑅𝐵 2 be two rule-bases. The difference of semantics be-
inary revision, and 𝐻 = {ℎ 1, . . . , ℎ𝑛 } as a set of rule propositions tween 𝑅𝐵 1 and 𝑅𝐵 2 denoted by 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) is a set defined as
that would become heads of new rules, and a set of Horn clauses follows: {⟨𝐹, 𝐷⟩|𝐹 ⊆ F and 𝐷 is a symmetric difference between
𝑅𝐵𝑛 = {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← 𝐵𝑛 .} where 𝐵𝑖 ⊆ 𝑓 (𝐹 𝐵𝑒 ) for all the stable model semantics of 𝐹 𝐵 ∪ 𝑅𝐵 1 and 𝐹 𝐵 ∪ 𝑅𝐵 2 where 𝐹 𝐵 is
1 ≤ 𝑖 ≤ 𝑛. A rule-base 𝑅𝐵 ′ = 𝑅𝐵 f ∪ {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← 𝐵𝑛 .} is
a fact-base such that 𝑓 (𝐹 𝐵) = 𝐹 }.
called a culprit resolution to the CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩.
For example, let 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.}.
We get that, 𝑠𝑒𝑚(𝑅𝐵 2 ) =
Algorithm 1 Preparation Phrase of Culprit Resolution
{⟨∅, ∅⟩, ⟨{𝑎}, {𝑎, 𝑝, 𝑞}⟩, ⟨{𝑏}, {𝑏, 𝑝, 𝑞}⟩, ⟨{𝑐}, {𝑐, 𝑟 }⟩,
Given A CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩ ⟨{𝑎, 𝑏}, {𝑎, 𝑏, 𝑝, 𝑞}⟩, ⟨{𝑎, 𝑐}, {𝑎, 𝑐, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑏, 𝑐, 𝑝, 𝑞, 𝑟 }⟩,
f = 𝑅𝐵 and 𝐻 = ∅
Let 𝑅𝐵 ⟨{𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑐, 𝑝, 𝑞, 𝑟 }⟩}
for all culprits 𝑝𝑐 detected from 𝑝 do and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) =
if 𝑝𝑐 is an incorrect culprit then {⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, {𝑟 }⟩,
for all supporting rule 𝑅 of 𝑝𝑐 do ⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝, 𝑞, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑟 }⟩, ⟨{𝑎, 𝑏, 𝑐}, {𝑟 }⟩}
Let 𝑝𝑒 be a new rule proposition
Let 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.}. We get that,
Add 𝑝𝑒 to 𝐻
f
𝑠𝑒𝑚(𝑅𝐵 3 ) =
Add 𝑛𝑜𝑡 𝑝𝑒 to the body of 𝑅 in 𝑅𝐵
{⟨∅, ∅⟩, ⟨{𝑎}, {𝑎, 𝑝, 𝑞}⟩, ⟨{𝑏}, {𝑏, 𝑝, 𝑞}⟩, ⟨{𝑐}, {𝑐, 𝑟 }⟩,
end for
⟨{𝑎, 𝑏}, {𝑎, 𝑏, 𝑝, 𝑞}⟩, ⟨{𝑎, 𝑐}, {𝑎, 𝑐, 𝑞, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑏, 𝑐, 𝑞, 𝑟 }⟩,
else ⊲ when 𝑝𝑐 is an incomplete culprit
⟨{𝑎, 𝑏, 𝑐}, {𝑎, 𝑏, 𝑐, 𝑞, 𝑟 }⟩}
Add 𝑝𝑐 to 𝐻
end if and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) =
end for {⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, {𝑟 }⟩,
⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝, 𝑟 }⟩, ⟨{𝑏, 𝑐}, {𝑝, 𝑟 }⟩, ⟨{𝑎, 𝑏, 𝑐}, {𝑝, 𝑟 }⟩}
Due to the property of culprits and Closed World Specification, The difference of semantics reflects changes on a consequence
we get that a culprit resolution to a CCR task can resolve the con- (both adding and removing) of two rule-bases. Now, we can define
sidered counterintuitive consequence in the CCR task. Hence, a a minimal revision of this framework as follows.
culprit resolution is also a kind of resolution. Definition 4.3 (Semantics-based Minimal Revision). Let 𝑅𝐵 1 , 𝑅𝐵 2 ,
Example 3.9. Continuing from Example 3.6, if a culprit is 𝑝 then 𝑅𝐵 3 be three rule-bases. We say that 𝑅𝐵 2 has a smaller change than
g1 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ←
we put 𝑛𝑜𝑡 𝑟 in the first rule then 𝑅𝐵 𝑅𝐵 3 from 𝑅𝐵 1 denoted as 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) if for
𝑏.} and 𝐻 = {𝑟 }. If a culprit is 𝑞 then we put 𝑛𝑜𝑡 𝑟 in the second every ⟨𝐹, 𝐷 2 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) and ⟨𝐹, 𝐷 3 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ),
g1 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏.} and 𝐻 = {𝑟 }.
rule then 𝑅𝐵 then 𝐷 2 ⊆ 𝐷 3 . We define 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) < 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) if
We get that 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.} and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) but 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰ 𝐷𝐼 𝐹 𝐹
𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.} in Example 3.6 (𝑅𝐵 1, 𝑅𝐵 2 ). We call 𝑅𝐵 2 a semantics-based minimal revision of 𝑅𝐵 1
are also culprit resolutions to the CCR task ⟨𝑅𝐵, 𝐹 𝐵 1, 𝑝⟩ since both if 𝑅𝐵 2 is a revision of 𝑅𝐵 1 and there is no revision 𝑅𝐵 ′ of 𝑅𝐵 1 such
rule-bases include a new Horn clause 𝑟 ← 𝑐. and {𝑐} ⊆ 𝐹 𝐵 1 . that 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 ′ ) < 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ).
However, we consider Definition 4.3 is too strong, since it is hard
4 SEMANTICS-BASED MINIMAL REVISION for comparing between two revisions from different schemes. For
Since legal reasoning is a hybrid between reasoning by rules and example, 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) and 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) are incomparable
reasoning by cases [10, 43], a revision should consider semantics of since
a rule-base, which vary among cases. Hence, we define semantics ⟨{𝑎, 𝑐}, {𝑝, 𝑞, 𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 )
of a rule-base as follows. and ⟨{𝑎, 𝑐}, {𝑝, 𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 )
Definition 4.1 (Semantics of a rule-base). Let F be a fact-domain hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≰ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 )
and 𝑅𝐵 be a rule-base, The semantics of 𝑅𝐵 is a set of pairs defined and
53
On Semantics-based Minimal Revision for Legal Reasoning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
⟨{𝑎, 𝑏, 𝑐}, {𝑝, 𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) semantics changes for {𝑎, 𝑐}. Therefore, to describe that 𝑅𝐵 2 is a
and ⟨{𝑎, 𝑏, 𝑐}, {𝑟 }⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) suitable semantics-based minimal revision, we say each fact-base
hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ). has its corresponding specific rule-base and dominant rule-base
Therefore, we relax Definition 4.3 by considering partial seman- with respect to the considered consequence. We adopt the definition
tics defined as follows. of a specific rule-base from [30] and define a dominant rule-base as
a trimmed version of a specific rule-base as follows.
Definition 4.4 (Partial Semantics-based Minimal Revision). Let
𝑅𝐵 1 , 𝑅𝐵 2 , 𝑅𝐵 3 be three rule-bases and 𝑆 be a set of propositions. We Definition 4.6 (Specific Rule-base and Dominant Rule-base). Let
say that 𝑅𝐵 2 has a smaller change than 𝑅𝐵 3 from 𝑅𝐵 1 with respect 𝑅𝐵 be a rule-base, 𝐹 𝐵 be a fact-base, and 𝑝 be a proposition. We say
to 𝑆 denoted as 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) if for every a rule-base 𝑆𝑅 ⊆ 𝑅𝐵 is specific to 𝐹 𝐵 with respect to 𝑅𝐵 and 𝑝 if 𝑆𝑅
⟨𝐹, 𝐷 2 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) and ⟨𝐹, 𝐷 3 ⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ), then is a minimal set of rules (in the sense of set inclusion) such that
𝐷 2 ∩ 𝑆 ⊆ 𝐷 3 ∩ 𝑆. We define 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) • 𝐹 𝐵 ∪ 𝑆𝑅 agrees with 𝐹 𝐵 ∪ 𝑅𝐵 on 𝑝, and
if 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) but 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰𝑆 • no rule-base 𝑆𝑅 ′ such that 𝑆𝑅 ⊊ 𝑆𝑅 ′ ⊊ 𝑅𝐵 and 𝐹 𝐵 ∪ 𝑆𝑅 ′
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ). We call 𝑅𝐵 2 a partial semantics-based minimal disagrees with 𝐹 𝐵 ∪ 𝑆𝑅 on 𝑝.
revision of 𝑅𝐵 1 with respect to 𝑆 if 𝑅𝐵 2 is a revision of 𝑅𝐵 1 and there We call 𝐷𝑅 = 𝑡𝑟𝑖𝑚(𝑆𝑅) a dominant rule-base of 𝐹 𝐵 with respect to
is no revision 𝑅𝐵 ′ of 𝑅𝐵 1 such that 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 ′ ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 and 𝑝 (𝑡𝑟𝑖𝑚 is defined in Definition 3.3).
𝑅𝐵 2 ).
Now, we define dominant-based semantics as follows.
From the previous example, if 𝑆 = {𝑝} or 𝑆 = {𝑝, 𝑟 }, we get that
Definition 4.7 (Dominant-based Semantics). Let F be a fact-domain,
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) but
𝑅𝐵 be a rule-base, and 𝑝 be a proposition. The dominant-based se-
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≰𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 )
mantics of 𝑅𝐵 with respect to 𝑝 is a set of pairs defined as follows:
hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ).
{⟨𝐹 𝐵, 𝐴𝐷⟩|𝐹 𝐵 is a fact-base constructed from a subset of F and 𝐴𝐷
However, if 𝑆 = {𝑞}, we get that is a set of all dominant rule-bases of 𝐹 𝐵 with respect to 𝑅𝐵 and 𝑝}.
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) but We denote this set as 𝑑𝑜𝑚𝑝 (𝑅𝐵).
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≰𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 )
hence 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) <𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ). Then, we define the difference of dominant-based semantics as
follows.
We define a partial difference of semantics as follows.
Definition 4.8 (Difference of Dominant-based Semantics). Let F be
Definition 4.5 (Partial Difference of Semantics). Let F be a fact-
a fact-domain, 𝑅𝐵 1, 𝑅𝐵 2 be two rule-bases, and 𝑝 be a proposition.
domain, 𝑅𝐵 1, 𝑅𝐵 2 be two rule-bases, and 𝑆 be a set of propositions. A
The difference of dominant-based semantics between 𝑅𝐵 1 and 𝑅𝐵 2
partial difference of semantics between 𝑅𝐵 1 and 𝑅𝐵 2 with respect to
with respect to 𝑝 denoted by 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) is a set defined
𝑆 denoted by 𝐷𝐼 𝐹 𝐹𝑆 (𝑅𝐵 1, 𝑅𝐵 2 ) is a set defined as follows: {⟨𝐹, 𝐷 ∩
as follows: {⟨𝐴𝐷 1, 𝐴𝐷 2 ⟩|⟨𝐹 𝐵, 𝐴𝐷 1 ⟩ ∈ 𝑑𝑜𝑚𝑝 (𝑅𝐵 1 ) and
𝑆⟩|⟨𝐵, 𝐷⟩ ∈ 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 )}.
⟨𝐹 𝐵, 𝐴𝐷 2 ⟩ ∈ 𝑑𝑜𝑚𝑝 (𝑅𝐵 2 ) and 𝐴𝐷 1 ≠ 𝐴𝐷 2 }.
By this way, given any three rule-bases 𝑅𝐵 1 , 𝑅𝐵 2 , 𝑅𝐵 3 ,
The difference in this definition represents patterns of fact-bases
𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤𝑆 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 3 ) ≡ that move their dominant rule-bases after revising a program, re-
𝐷𝐼 𝐹 𝐹𝑆 (𝑅𝐵 1, 𝑅𝐵 2 ) ≤ 𝐷𝐼 𝐹 𝐹𝑆 (𝑅𝐵 1, 𝑅𝐵 3 ). gardless how many fact-bases move such patterns. From Table 1
Continuing from the previous example, 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 2 ) = continuing from the previous example, 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) =
{⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, ∅⟩, {⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑐.}}⟩,
⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝}⟩, ⟨{𝑏, 𝑐}, ∅⟩, ⟨{𝑎, 𝑏, 𝑐}, ∅⟩} ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑝 ← 𝑞. 𝑞 ← 𝑏.}}⟩}
When a fact domain is {𝑎, 𝑏, 𝑐}, we get that 𝑅𝐵 2 is a partial From Table 2, 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) =
semantics-based minimal revision with respect to {𝑝}. This is be- {⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑐.}}⟩,
cause to find a rule-base 𝑅𝐵 ′ such that 𝐷𝐼 𝐹 𝐹 (𝑅𝐵 1, 𝑅𝐵 ′ ) <𝑆 𝐷𝐼 𝐹 𝐹 ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑟 ← 𝑐.}}⟩,
(𝑅𝐵 1, 𝑅𝐵 2 ), 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 ′ ) must be {⟨𝐹, ∅⟩|𝐹 ⊆ F }, then 𝑅𝐵 ′ ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑟 ← 𝑐.}}⟩}
is not a resolution to the CCR task since the exceptional case is
The difference of dominant-based semantics does not change
𝐹 𝐵 1 = {𝑎. 𝑐.} but 𝐹 𝐵 1 ∪ 𝑅𝐵 agrees with 𝐹 𝐵 1 ∪ 𝑅𝐵 ′ on 𝑝 (the coun-
when we extend the fact domain. Due to the ordering of dominants,
terintuitive consequence is not resolved). However, if we extend a
we get the following lemma.
fact domain to F = {𝑎, 𝑏, 𝑐, 𝑑 }, 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 2 ) would become:
{⟨∅, ∅⟩, ⟨{𝑎}, ∅⟩, ⟨{𝑏}, ∅⟩, ⟨{𝑐}, ∅⟩, ⟨{𝑑 }, ∅⟩, Lemma 4.9. Let 𝑅𝐵 1 , 𝑅𝐵 2 be two rule-bases, 𝑝 be a proposition
⟨{𝑎, 𝑏}, ∅⟩, ⟨{𝑎, 𝑐}, {𝑝}⟩, ⟨{𝑎, 𝑑 }, ∅⟩, ⟨{𝑏, 𝑐}, ∅⟩, and ⟨𝐹 𝐵 1, 𝐴𝐷 1 ⟩, ⟨𝐹 𝐵 3, 𝐴𝐷 3 ⟩ ∈ 𝑑𝑜𝑚𝑝 (𝑅𝐵 1 ) such that 𝐹 𝐵 1 ⊆ 𝐹 𝐵 3
. and 𝐴𝐷 1 ⊆ 𝐴𝐷 3 . If ⟨𝐴𝐷 1, 𝐴𝐷 2 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) then
⟨{𝑏, 𝑑 }, ∅⟩, ⟨{𝑐, 𝑑 }, ∅⟩, ⟨{𝑎, 𝑏, 𝑐}, ∅⟩, ⟨{𝑎, 𝑏, 𝑑 }, ∅⟩,
⟨{𝑎, 𝑐, 𝑑 }, {𝑝}⟩, ⟨{𝑏, 𝑐, 𝑑 }, ∅⟩, ⟨{𝑎, 𝑏, 𝑐, 𝑑 }, ∅⟩} ⟨𝐴𝐷 3, 𝐴𝐷 4 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ).
Thus, 𝑅𝐵 2 is not a partial semantics-based minimal revision of This lemma intuitively means if a revision affects a set 𝐴𝐷 1 of
𝑅𝐵 1 with respect to {𝑝} when a fact-domain is {𝑎, 𝑏, 𝑐, 𝑑 } because all dominant rule-bases of 𝐹 𝐵 1 , the revision also affects every set
there exists a revision 𝑅𝐵 ′ such that 𝐷𝐼 𝐹 𝐹 {𝑝 } (𝑅𝐵 1, 𝑅𝐵 ′ ) = {⟨𝐹, 𝐷⟩| 𝐴𝐷 3 of all dominant rule-bases of 𝐹 𝐵 3 such that 𝐹 𝐵 3 is a super
𝐹 ⊆ F if 𝐹 = {𝑎, 𝑐} then 𝐷 = {𝑝} otherwise 𝐷 = ∅}. However, set of 𝐹 𝐵 1 and 𝐴𝐷 3 is a super set of 𝐴𝐷 1 . Now, we can define the
we consider a revision 𝑅𝐵 ′ is too specific since it requires only dominant-based minimal revision as follows.
54
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh
Table 1: Difference of dominant-based semantics between 𝑅𝐵 1 and 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.} in the example
Possible fact-base 𝐹 𝐵 Dominant rule-base(s) of 𝐹 𝐵 (before revision) Dominant rule-base(s) of 𝐹 𝐵 (after revision)
∅ ∅ ∅
{𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}
{𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑐.} ∅ ∅
{𝑎. 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑎. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {r ← c.}
{𝑏. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑎. 𝑏. 𝑐} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
Table 2: Difference of dominant-based semantics between 𝑅𝐵 1 and 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.} in the example
Possible fact-base 𝐹 𝐵 Dominant rule-base(s) of 𝐹 𝐵 (before revision) Dominant rule-base(s) of 𝐹 𝐵 (after revision)
∅ ∅ ∅
{𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}
{𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑐.} ∅ ∅
{𝑎. 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}
{𝑎. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.} {r ← c.}
{𝑏. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {r ← c.}
{𝑎. 𝑏. 𝑐.} {𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.} {r ← c.}
Definition 4.10 (Dominant-based Minimal Revision). Let 𝑅𝐵 1 , 𝑅𝐵 2 , counterintuitive consequence ? The answer to this question is yes.
𝑅𝐵 3 be three rule-bases and 𝑝 be a proposition. We say that 𝑅𝐵 2 One way is to follow a culprit resolution in Algorithm 1 and in-
has a smaller dominant change than 𝑅𝐵 3 from 𝑅𝐵 1 with respect troduce all fact propositions occurring in an exceptional case as
to 𝑝 denoted as 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ⪯ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) a body of each new rule so that all new rules are applicable only
when this following condition is satisfied. to the fact-bases that cover all facts in the considered exceptional
If ⟨𝐴𝐷 1, 𝐴𝐷 2 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ), case.
then ⟨𝐴𝐷 1, 𝐴𝐷 3 ⟩ ∈ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ). From the example in previous sections, let 𝑅𝐵 1 = {𝑝 ← 𝑞. 𝑞 ←
We define 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) if 𝑎. 𝑞 ← 𝑏.}. If a culprit is 𝑝 and we get a preliminary revision
g1 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏.} and 𝐻 = {𝑟 }. If we add all
𝑅𝐵
𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ⪯ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) but
fact propositions in 𝐹 𝐵 1 = {𝑎. 𝑐.} to a body of a new rule, we get a
𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 3 ) ⪯̸ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ).
culprit resolution 𝑅𝐵 4 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑎, 𝑐.},
We call 𝑅𝐵 2 a dominant-based minimal revision of 𝑅𝐵 1 with respect which is a dominant-based minimal revision of 𝑅𝐵 1 with respect to
to 𝑝 if 𝑅𝐵 2 is a revision of 𝑅𝐵 1 and there is no revision 𝑅𝐵 ′ of 𝑅𝐵 1 𝑝 since 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 4 ) =
such that 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ).
{⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑎, 𝑐.}}⟩,
From the previous example, 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑟 ← 𝑎, 𝑐.}}⟩}
(𝑅𝐵 1, 𝑅𝐵 3 ) so 𝑅𝐵 3 is not the dominant-based minimal revision of and there is no revision 𝑅𝐵 ′ such that 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) ≺
𝑅𝐵 1 respect to 𝑝. We also get that 𝑅𝐵 2 is the dominant-based 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 4 ) in the same manner of 𝑅𝐵 2 . This also ap-
minimal revision respect to 𝑝. This because to find a revision plies to a scenario when a culprit is 𝑞, 𝑅𝐵 5 = {𝑝 ← 𝑞. 𝑞 ←
𝑅𝐵 ′ such that 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) ≺ 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 2 ), 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑎, 𝑐.} is also a dominant-based minimal revi-
⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, 𝐴𝐷 ′ ⟩ must be not in sion of 𝑅𝐵 1 since 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 5 ) =
𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ). However, since a revision 𝑅𝐵 ′ must affect
{⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}}, {{𝑟 ← 𝑎, 𝑐.}}⟩,
{𝑝 ← 𝑞. 𝑞 ← 𝑎.} as it is the dominant rule-base of {𝑎. 𝑐.} (the
⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, {{𝑝 ← 𝑞. 𝑞 ← 𝑏.}}⟩}.
exceptional case), ⟨{{𝑝 ← 𝑞. 𝑞 ← 𝑎.}, {𝑝 ← 𝑞. 𝑞 ← 𝑏.}}, 𝐴𝐷 ′ ⟩
must be in 𝐷𝑂𝑀𝐷𝐼 𝐹 𝐹𝑝 (𝑅𝐵 1, 𝑅𝐵 ′ ) according to Lemma 4.9. Hence, In addition, we can optimize 𝑅𝐵 5 into 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ←
such revision 𝑅𝐵 ′ does not exist and 𝑅𝐵 2 is the dominant-based 𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.} which is also a dominant-based minimal
minimal revision of 𝑅𝐵 1 . revision. To optimize the revision like we do from 𝑅𝐵 5 to 𝑅𝐵 2 , we
need to consider specific rule-bases with respect to the original rule-
5 OBTAINING MINIMAL REVISION base and the considering consequence as the following theorem.
One question is that, given a CCR task, can we always obtain a Theorem 5.1. Given a fact-domain F , a CCR task ⟨𝑅𝐵, 𝐹 𝐵𝑒 , 𝑝⟩,
dominant-based minimal revision with respect to the considered f 𝐻 obtained by Algorithm 1 where 𝑅𝐵
𝑅𝐵, f is a preliminary revision, and
55
On Semantics-based Minimal Revision for Legal Reasoning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
𝐻 = {ℎ 1, . . . , ℎ𝑛 } as a set of rule propositions that would become heads 𝑏. 𝑟 ← 𝑐.}. Although 𝑅𝐵 3 is the syntax-based minimal revision
of new rules, and a set of Horn clauses 𝑅𝐵𝑛 = {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← when we require to add a condition to a new rule, 𝑅𝐵 3 is not a
𝐵𝑛 .} where 𝐵𝑖 ⊆ 𝑓 (𝐹 𝐵𝑒 ) for all 1 ≤ 𝑖 ≤ 𝑛. If 𝐵𝑖 satisfies the following dominant-based minimal revision as we illustrated in the previous
condition for all 1 ≤ 𝑖 ≤ 𝑛: section.
For every fact-base 𝐹 𝐵 constructed from a subset of F Generally, we get that a syntax-based minimal revision is not
such that 𝐵 ⊆ 𝐹 𝐵, if there is a rule 𝑅 in a specific rule- always a dominant-based minimal revision especially when the
base of 𝐹 𝐵 with respect to 𝑅𝐵 and 𝑝, and ℎ𝑖 occurs in 𝑅 rule-base contains multiple rules for the same consequence. To
then 𝐹 𝐵𝑒 ⊆ 𝐹 𝐵, illustrate more on this constraint, we use an example case adopted
then, a culprit resolution 𝑅𝐵 ′ = 𝑅𝐵 f ∪ {ℎ 1 ← 𝐵 1 ., . . . , ℎ𝑛 ← 𝐵𝑛 .} is from a Supreme Court Case as in [4, 17, 38] relating to Japanese
a dominant-based minimal revision of 𝑅𝐵 with respect to 𝑝. Civil Code Article 612, which states
Phrase 1: A Lessee may not assign the lessee’s rights
Theorem 5.1 intuitively implies a method for finding dominant-
or sublease a leased thing without obtaining the ap-
based minimal revision. Let 𝐹 𝐵𝑒 be an exceptional case, the method
proval of the lessor.
is to find a minimal set 𝐵 ⊆ 𝑓 (𝐹 𝐵𝑒 ) which affects specific rule-
Phrase 2: If the lessee allows any third party to make
bases only of 𝐹 𝐵 ⊇ 𝐹 𝐵𝑒 . Thus, when 𝐵 = 𝑓 (𝐹 𝐵𝑒 ), the revision is
use of or take profits from a leased thing in violation of
a dominant-based minimal revision. However, if 𝐵 ⊊ 𝑓 (𝐹 𝐵𝑒 ), a
the provisions of the preceding paragraph, the lessor
revision affects specific rule-bases only of 𝐹 𝐵 ⊇ 𝐹 𝐵𝑒 if there is no
may cancel the contract.
rule 𝑅 in a specific rule-base of 𝐹 𝐵 such that 𝐹 𝐵 ⊇ 𝐵, 𝐹 𝐵 ⊉ 𝐹 𝐵𝑒
and 𝑝𝑒 occurs in 𝑅. For example, regarding 𝑅𝐵 2 = {𝑝 ← 𝑞. 𝑞 ← Japanese Civil Code can be represented with the rule-base as
𝑎, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑏. 𝑟 ← 𝑐.}, there is no 𝐹 𝐵 such that 𝐹 𝐵 ⊇ {𝑐.}, follows1 . Hereafter, this rule-base is denoted by 𝐽 𝑅𝐵 1 .
𝐹 𝐵 ⊉ {𝑎. 𝑐.}, and 𝑞 ← 𝑎, 𝑛𝑜𝑡 𝑟 . is in the specific rule of 𝐹 𝐵. However, 1 cancellation_due_to_sublease :-
regarding 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏. 𝑟 ← 𝑐.}, there is 2 effective_lease_contract,
𝐹 𝐵 = {𝑏. 𝑐.} such that 𝐹 𝐵 ⊇ {𝑐.}, 𝐹 𝐵 ⊉ {𝑎. 𝑐.}, and 𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . is 3 effective_sublease_contract,
in the specific rule of 𝐹 𝐵. Therefore, 𝑅𝐵 5 can be optimized to 𝑅𝐵 2 4 using_leased_thing,
but 𝑅𝐵 4 cannot be optimized to 𝑅𝐵 3 . 5 manifestation_cancellation,
6 not approval_of_sublease.
6 COMPARISON AND EXAMPLE 7 effective_lease_contract :-
In this section, we compare the dominant-based minimal revision 8 agreement_of_lease_contract,
with the syntax-based minimal revision in Theory Distance Metric 9 handover_to_lessee.
[47], which is one common minimal revision used for describing 10 effective_sublease_contract :-
minimal revision in legislation (e.g. [12, 28]). The definition is for- 11 agreement_of_sublease_contract,
mally described in our context as follows. 12 handover_to_sublessee.
13 effective_sublease_contract :-
Definition 6.1 (Syntax-based Minimal Revision). Let 𝑅𝐵 and 𝑅𝐵 ′
14 implicit_sublease.
be rule-bases. A revision transformation 𝑟 is such that 𝑟 (𝑅𝐵) =
15 approval_of_sublease :-
𝑅𝐵 ′ , and 𝑅𝐵 ′ is obtained from 𝑅𝐵 by program edit operations as
16 approval_before_the_day.
follows: deleting a rule, creating a rule with an empty body, adding
a condition to a rule in 𝑅𝐵 or deleting a condition from a rule in 𝑅𝐵. This representation illustrates that to prove the contract was
𝑅𝐵 ′ is a revision of 𝑅𝐵 with distance 𝑐 (𝑅𝐵, 𝑅𝐵 ′ ) = 𝑛 if and only if ended due to sublease (represented as cancellation_due_to_
𝑅𝐵 ′ = 𝑟 𝑛 (𝑅𝐵) and there is no 𝑚 < 𝑛 such that 𝑅𝐵 ′ = 𝑟 𝑛 (𝑅𝐵) [47]. sublease), we must prove four requisites (lines 2-5)
Consequently, it is very simple to find a culprit resolution that is (1) the lease contract was effective
also a syntax-based minimal revision. Actually, a culprit resolution (represented as effective_lease_contract)
with an empty condition for each new rule is a syntax-based mini- (2) the sublease contract was effective
mal revision since it requires no additional program edit operation, (represented as effective_sublease_contract)
but that kind of culprit resolution can be considered as too general. (3) the third party was using the leased thing
If we require to add some condition to a new rule, we can just add (represented as using_leased_thing)
an extra fact proposition in the exceptional case as a condition of a (4) the plaintiff manifested the intention of cancellation of the
new rule (an extra fact proposition means a fact proposition that contract (represented as manifestation_cancellation)
does not occur in the rule-base but occurs in the exceptional case). And there is one exception, approval_of_sublease in line 5,
This requires only one program edit operation for each new rule so which is explicitly stated in the Civil Code. To prove the exception,
the culprit resolution is definitely the syntax-based minimal revi- we must prove that the approval before the cancellation (repre-
sion (under the constraint that we require to add some condition sented as approval_before_the_day).
to each new rule). To prove that the lease contract was effective (effective_lease
From Example 3.9, we have 𝑅𝐵 1 = {𝑝 ← 𝑞. 𝑞 ← 𝑎. 𝑞 ← 𝑏.}, _contract), we must prove two requisites (lines 8-9).
if a culprit is 𝑝 and we get a preliminary revision 𝑅𝐵 g1 = {𝑝 ←
𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← 𝑏.}. If we add an extra fact proposition 𝑐 to 1 This representation is adopted for ease of exposition. The implicit sublease contract
a body of a new rule, we get 𝑅𝐵 3 = {𝑝 ← 𝑞, 𝑛𝑜𝑡 𝑟 . 𝑞 ← 𝑎. 𝑞 ← is a fictitious condition to illustrate multiple conditions. We use :- instead of ←.
56
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh
57
On Semantics-based Minimal Revision for Legal Reasoning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
58
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh
Knowledge Representation and Reasoning. AAAI Press, CA, USA, 243–254. [31] Henry Prakken. 2021. A formal analysis of some factor-and precedent-based
[14] Mukesh Dalal. 1988. Investigations into a theory of knowledge base revision: accounts of precedential constraint. Artificial Intelligence and Law (2021), 1–27.
preliminary report. In Proceedings of the Seventh National Conference on Artificial [32] Adam Rigoni. 2015. An improved factor based approach to precedential constraint.
Intelligence, Vol. 2. AAAI Press, CA, USA, 475–479. Artificial Intelligence and Law 23, 2 (2015), 133–160.
[15] Marina De Vos, Julian Padget, and Ken Satoh. 2010. Legal modelling and reason- [33] Edwina L Rissland and Kevin D Ashley. 1987. A case-based system for trade secrets
ing using institutions. In JSAI International Symposium on Artificial Intelligence. law. In Proceedings of the 1st international conference on Artificial intelligence and
Springer Berlin Heidelberg, Berlin, Heidelberg, 129–140. law. Association for Computing Machinery, New York, NY, USA, 60–66.
[16] Alvaro Del Val. 1997. Non monotonic reasoning and belief revision: syntactic, [34] Edwina L Rissland and M Timur Friedman. 1995. Detecting change in legal
semantic, foundational and coherence approaches. Journal of Applied Non- concepts. In Proceedings of the 5th international conference on Artificial intelligence
Classical Logics 7, 1-2 (1997), 213–240. and law. Association for Computing Machinery, New York, NY, USA, 127–136.
[17] Wachara Fungwacharakorn and Ken Satoh. 2018. Legal Debugging in Proposi- [35] Antonino Rotolo and Corrado Roversi. 2013. Constitutive rules and coherence in
tional Legal Representation. In JSAI International Symposium on Artificial Intelli- legal argumentation: The case of extensive and restrictive interpretation. In Legal
gence. Springer International Publishing, Cham, 146–159. Argumentation Theory: Cross-Disciplinary Perspectives. Springer Netherlands,
[18] Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh. 2020. On the legal Dordrecht, The Netherlands, 163–188.
revision in PROLEG program. In 34th Proceedings of the Annual Conference of JSAI, [36] Ken Satoh. 1988. Nonmonotonic reasoning by minimal belief revision. Institute for
Vol. 2020. Japan Society of Artificial Intelligence, Japan, 3G5ES104–3G5ES104. New Generation Computer Technology, Japan.
[19] Wachara Fungwacharakorn, Kanae Tsushima, and Ken Satoh. 2021. Resolving [37] Ken Satoh, Kento Asai, Takamune Kogawa, Masahiro Kubota, Megumi Naka-
counterintuitive consequences in law using legal debugging. Artificial Intelligence mura, Yoshiaki Nishigai, Kei Shirakawa, and Chiaki Takano. 2011. PROLEG: An
and Law (2021), 1–17. Implementation of the Presupposed Ultimate Fact Theory of Japanese Civil Code
[20] Michael Gelfond and Vladimir Lifschitz. 1988. The stable model semantics for by PROLOG Technology. In New Frontiers in Artificial Intelligence (Lecture Notes
logic programming. In Proceedings of International Logic Programming Conference in Computer Science). Springer Berlin Heidelberg, Berlin, Heidelberg, 153–164.
and Symposium, Robert Kowalski, Bowen, and Kenneth (Eds.), Vol. 88. MIT Press, [38] Ken Satoh, Masahiro Kubota, Yoshiaki Nishigai, and Chiaki Takano. 2009. Trans-
Cambridge, MA, USA, 1070–1080. lating the Japanese Presupposed Ultimate Fact Theory into Logic Programming.
[21] Guido Governatori, Michael J Maher, Grigoris Antoniou, and David Billington. In Proceedings of the 2009 Conference on Legal Knowledge and Information Systems:
2004. Argumentation semantics for defeasible logic. Journal of Logic and Compu- JURIX 2009: The Twenty-Second Annual Conference. IOS Press, Amsterdam, The
tation 14, 5 (2004), 675–702. Netherlands, 162–171.
[22] Guido Governatori, Monica Palmirani, Regis Riveret, Antonio Rotolo, and Gio- [39] Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham-
vanni Sartor. 2005. Norm Modifications in Defeasible Logic. In Proceedings of the mond, and H Terese Cory. 1986. The British Nationality Act as a logic program.
2005 Conference on Legal Knowledge and Information Systems: JURIX 2005: The Commun. ACM 29, 5 (April 1986), 370–386.
Eighteenth Annual Conference. IOS Press, Amsterdam, The Netherlands, 13–22. [40] Ehud Y. Shapiro. 1983. Algorithmic Program DeBugging. MIT Press, Cambridge,
[23] Carole D Hafner and Donald H Berman. 2002. The role of context in case-based MA, USA.
legal reasoning: teleological, temporal, and procedural. Artificial Intelligence and [41] David M Sherman. 1987. A Prolog model of the income tax act of Canada. In
Law 10, 1 (2002), 19–64. Proceedings of the 1st international conference on Artificial intelligence and law.
[24] John Henderson and Trevor Bench-Capon. 2019. Describing the development of Association for Computing Machinery, New York, NY, USA, 127–136.
case law. In Proceedings of the Seventeenth International Conference on Artificial [42] Michael Thielscher. 2001. The qualification problem: A solution to the problem
Intelligence and Law (ICAIL ’19). Association for Computing Machinery, New of anomalous models. Artificial Intelligence 131, 1-2 (2001), 1–37.
York, NY, USA, 32–41. [43] Bart Verheij. 2008. About the Logical Relations between Cases and Rules. In
[25] John F Horty and Trevor JM Bench-Capon. 2012. A factor-based definition of Proceedings of the 2008 Conference on Legal Knowledge and Information Systems:
precedential constraint. Artificial intelligence and Law 20, 2 (2012), 181–214. JURIX 2008: The Twenty-First Annual Conference. IOS Press, Amsterdam, The
[26] Hirofumi Katsuno and Alberto O Mendelzon. 1991. Propositional knowledge Netherlands, 21–32.
base revision and minimal change. Artificial Intelligence 52, 3 (1991), 263–294. [44] Bart Verheij. 2016. Correct grounded reasoning with presumptive arguments.
[27] Edward H Levi. 2013. An introduction to legal reasoning. University of Chicago In European Conference on Logics in Artificial Intelligence. Springer International
Press, Chicago, USA. Publishing, Cham, 481–496.
[28] Tingting Li, Tina Balke, Marina De Vos, Julian Padget, and Ken Satoh. 2013. A [45] Bart Verheij. 2016. Formalizing value-guided argumentation for ethical systems
model-based approach to the automatic revision of secondary legislation. In design. Artificial Intelligence and Law 24, 4 (2016), 387–407.
Proceedings of the Fourteenth International Conference on Artificial Intelligence [46] Bart Verheij. 2017. Formalizing Arguments, Rules and Cases. In Proceedings of
and Law (Rome, Italy) (ICAIL ’13). Association for Computing Machinery, New the 16th Edition of the International Conference on Articial Intelligence and Law
York, NY, USA, 202–206. (London, United Kingdom) (ICAIL ’17). Association for Computing Machinery,
[29] Bernhard Nebel et al. 1992. Syntax-based approaches to belief revision. Belief New York, NY, USA, 199–208.
revision 29 (1992), 52–88. [47] James Wogulis and Michael J Pazzani. 1993. A methodology for evaluating
[30] Henry Prakken. 1991. A tool in modelling disagreement in law: preferring the theory revision systems: Results with Audrey II. In Proceedings of the Thirteenth
most specific argument. In Proceedings of the 3rd international conference on International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San
Artificial intelligence and law. Association for Computing Machinery, New York, Francisco,CA,USA, 1128–1134.
NY, USA, 165–174.
59
Explainable Artificial Intelligence, Lawyer’s Perspective
Łukasz Górski Shashishekar Ramakrishna
lgorski@icm.edu.pl shashi792@gmail.com
Interdisciplinary Centre for Mathematical and Freie Universität Berlin
Computational Modelling Berlin, Germany
Warsaw, Poland
ABSTRACT AI-engineers. This work was thought as the first step towards the
Explainable artificial intelligence (XAI) is a research direction that identification of requirements of explainable AI-based systems that
was already put under scrutiny, in particular in the AI&Law commu- would involve legal perspective to a greater extent.
nity. Whilst there were notable developments in the area of (general, For the purpose of this paper, a two-way investigation was per-
not necessarily legal) XAI, user experience studies regarding such formed. Firstly, to assess the performance of different explainability
methods, as well as more general studies pertaining to the concept methods (Grad-CAM, LIME, SHAP), we have used a convolutional
of explainability among the users are still lagging behind. This paper neural network (CNN) for text classification and used those meth-
firstly, assesses the performance of different explainability methods ods to explain the predictions; those explanations were then judged
(Grad-CAM, LIME, SHAP), in explaining the predictions for a legal by legal professionals according to their accuracy. Secondly, the
text classification problem; those explanations were then judged same respondents were asked to give their opinion on the desired
by legal professionals according to their accuracy. Secondly, the qualities of (explainable) artificial intelligence (AI) legal decision
same respondents were asked to give their opinion on the desired system and to present their general understanding of the term XAI.
qualities of (explainable) artificial intelligence (AI) legal decision This part was treated as a pilot study for a more pronounced one
system and to present their general understanding of the term XAI. regarding the lawyer’s position on AI, and XAI in particular. Our re-
This part was treated as a pilot study for a more pronounced one sults can be treated as a stepping stone towards a more pronounced
regarding the lawyer’s position on AI, and XAI in particular. survey-based research.
This work contributes by:
CCS CONCEPTS • Presenting a comparison of different explainability methods
• Computing methodologies → Neural networks. when applied to legal classification neural network.
• Giving an assessment of explainable artificial intelligence as
KEYWORDS understood by lawyers.
explainable artificial intelligence, SHAP, LIME, Grad-CAM, survey,
XAI 2 RELATED WORK
ACM Reference Format: XAI is a research direction that was already put under scrutiny,
Łukasz Górski and Shashishekar Ramakrishna. 2021. Explainable Artificial in particular in the AI&Law community. The general consensus
Intelligence, Lawyer’s Perspective. In Eighteenth International Conference for is that the opening of black-box models is a conditio sine qua non
Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil.
for assuring their trustworthiness and compliance with normative
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3462757.3466145
background [12]. Moreover, in [33] it was noted that the develop-
ment of explainable technical solutions is necessarily connected
1 INTRODUCTION with the regulatory frameworks. Importantly, AI and legal specialists
Explainable artificial intelligence (XAI) is a research field that aims need to engage in a dialogue when tackling explainability questions in
to make AI systems results more understandable to humans [1]. Other legal AI [33]. However, currently, we are unaware of any study that
domains notwithstanding, law is a domain that needs focus regard- aimed to investigate the understanding of XAI and elicit system’s
ing explainability. Due process of law and fair trial requirements requirements from the lawyers. Yet, obviously, legal perspective
make an explanation of AI-based decision models a must. However, was brought into the discourse with the advent of legislation such
the legal and technological discourses pertaining to the creation as European GDPR or envisioned ePrivacy or Digital Services Act,
of explainable AI seem to be separated. In particular, the methods other areas of law, like tort, notwithstanding [9]. As far as the legal
used to create explainable systems seem to favour the needs of perspective is concerned, in [7] it was noted that the judiciary will
play an important role in the production of different forms of XAI,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed especially because judicial reasoning, using case-by-case considera-
for profit or commercial advantage and that copies bear this notice and the full citation tion of the facts to produce nuanced decisions, is a pragmatic way to
on the first page. Copyrights for components of this work owned by others than the develop rules for XAI [7]. Whilst this way of thinking sounds very
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission attractive and reminds how seminal works on case-based-reasoning
and/or a fee. Request permissions from permissions@acm.org. in AI originated in the AI&Law community [24], on the other hand,
ICAIL’21, June 21–25, 2021, São Paulo, Brazil the cited passage clearly alludes to common law judge mode of
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 thinking and civil law judge may be more concerned with general
https://doi.org/10.1145/3462757.3466145 rules.
60
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ł. Górski, S. Ramakrishna
This train of thought should also remind of historical develop- CNN or rule-based classifier; user model of the explanation - knowl-
ments regarding the explainability in AI&Law. Historically, case- edge how explanation strategy works with the system [12]. This
based reasoners and rule-based reasoners (symbolic AI) are in- tripartite division offers a useful conceptual framework through
herently interpretable [3]. Yet, producing viable explanations for which the results of this paper can be analysed. While our respon-
deep learning algorithms (sub-symbolic AI) remains a challenge. dents have deep domain knowledge, their knowledge of AI-based
Atkinson et al. cite Robbins, who remarked that [a]ny decision re- systems or explanation techniques is much shallower, if any.
quiring an explanation should not be made by machine learning In regard to the concrete explanation methods we’ve tested (Grad-
(ML) algorithms. Automation is still an option; however, this should CAM, LIME, SHAP), they are of different origins. Grad-CAM is a
be restricted to the old-fashioned kind of automation whereby the method used for the detection of the most important input data for
considerations are hard-coded into the algorithm [3] [25]. Conse- a CNN’s prediction. While it was originally used with image data, it
quently, historical machine learning-based systems (like SMILE) was already proven to be of use in the case of textual input, including
were capable of achieving good performance, though introspection legal texts [8]. However, the aforementioned work does not include
of their domain models repeatedly shows that those were often an end-user feasibility study and can be supplemented with the
incomplete or faulty [3]. Development of twin systems, with tra- comparison of Grad-CAM with other explainability methods. They
ditional rule-based or case-based reasoner explaining the results were, to our knowledge, studied in isolation. For example, there are
of neural network was also a subject of scrutiny. Research in this already first studies regarding the usability of LIME for text data [18]
respect continues to test different newly developed methods to (however, with conclusions pertaining to decision trees and linear
achieve good explanations. Usage of Attention Network, accompa- models). Available analyses for SHAP, on the other hand, used it
nied with attention-weight-based text highlighting was explored with CNN-based text classification models [36]. Nevertheless, it
in [4], with the negative conclusion regarding this approach’s fea- should be noted that post-hoc explanation methods were already
sibility. In general, Competition on Legal Information Extraction subject of criticism, due to them representing correlations and not
and Entailment (COLIEE) is a venue for applying state-of-the-art being faithful to the underlying model [30].
research regarding the domain-specific information selection and
justification generation methods [21][20].
The work [4] is exemplary in including user evaluation study. 3 METHODOLOGY
However, whilst there were notable developments in the area of
(general, not necessarily legal) XAI, user experience studies regard- 3.1 Explanations generation
ing such methods, as well as more general studies pertaining to Herein, the effectiveness of different XAI methods when applied to
the concept of explainability among the users are still lagging be- the law is studied. By using a plug-in classification system based
hind [31] (with, for example, <1% of papers in the area of case-based on CNN [13][8], we created a number of predictions based on the
reasoning including user evaluations in the first place [12] [11]). well-known PTSD dataset [34]. Those were used as a basis for gen-
The results presented herein can be considered as facilitating such erating explanations for CNN predictions. In general, our pipeline
dialogue. (based on previous work by other authors [29][13] with significant
On the other hand, surveys are commonly used to assess the extensions by us) is composed of embedder, classifications CNN,
usability of computer-aided methods for lawyers. Whilst the study visualizer, XAI module, and metric-based evaluator; we have also
of law is commonly associated rather with logical and linguistic developed a simple pre-processing module that uses some industry
studies of texts, interdisciplinary studies of law do include methods de facto standard text processing libraries for spelling correction,
like survey-based study. Works that employed such methodology sentence detection, irregular character removal, etc., enhanced with
include [4] (aforementioned attention-weight-based text highlight- our own implementations which make them better-suited for legal
ing) or [5](for building the ontology of basic legal knowledge). As texts (though it was not used in this research). Our embedding
recognized in [5], usability studies are an important part in the module houses a plug-in system to handle different variants of
human-centered design process, where they are used for, inter alia, embeddings, in particular BERT and word2vec. The classification
requirements specification with the use of questionnaire-based module houses simple 1D CNN which facilitates explainability
methods. Such methodology was employed, for example, in [32], methods. The XAI module integrates the SHAP, Grad-CAM and
where A/B study was conducted to grade the understandability LIME models (based on library solutions, as developed, respectively
of student loan decision letter created by an automated decision by [16][29][22]). It connects the output from the CNN to arrive at
system. However, such system was rule-based and explainability explanations based on the model used. For LIME, the following pa-
is mostly connected with deep-learning systems, which remain rameters were customized: kernel_width=20, num_features=150,
inherently non-interpretable. num_samples=400.
Contemporary XAI models view human-machine interaction as For the purpose of this work, as an embedding layer, we use
a dynamic process, in which user understanding of the system is DistilBERT [26] (a small, fast, cheap and light Transformer model
continuously impacted by the explanation they receive [12]. User’s trained by distilling BERT base) encoding input sentences into vec-
mental model of the system can further be decomposed into three tors. Huggingface’s DistilBERT implementation was found to work
parts: user model of the domain - which pertains to their under- relatively well with XAI libraries that we have used; unfortunately,
standing of the area that the system is working in, for example, legal some of them even rely on Huggingface’s implementation details,
decision making; user model of the AI system - pertaining to the which can be found a limiting factor when plugin architecture is
knowledge of how the system is implemented, for example using considered.
61
Explainable Artificial Intelligence, Lawyer’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
The CNN used in the pipeline was trained for classification. We the class activation map corresponds to one token and indicates
use The Post-Traumatic Stress Disorder (PTSD) [19] dataset [35], for its importance in terms of the score of the particular (usually the
CNN training (as well as testing). It annotates a set of sentences orig- predicted) class. The class activation map gives information on
inating from 50 decisions issued by the Board of Veterans’ Appeals how strongly the particular tokens present in the input sentence
("BVA") of the U.S. Department of Veterans Affairs according to their influence the prediction of the CNN.
function in the decision [35] [34] [27]. The classification consists of
six elements: Finding Sentence, Evidence Sentence, (Evidence-based) 3.2.2 SHAP. SHAP [SHapley Additive exPlanations] [17] is an ex-
Reasoning Sentence, Legal-Rule Sentence, Citation Sentence, Other plainability method based on the shapely concept. SHAP values are
Sentence. This was found to be a relatively good candidate for neu- used for identifying the contribution of each concept/words/features
ral network training, as the classes distinguished in this dataset in a given sentence. Shapley value is the average of the marginal
are of relatively similar sizes. The output from each of the three contributions across all permutations, i.e. average of all the permu-
explainer models had different value ranges, including negative tations for each concept/feature to get each entity’s contribution.
values. Each value inside the value range indicated the association In terms of local interpretability, each observation gets its own set
of the token to a specific class. For this experiment of understanding of SHAP values. For our problem here we consider the global inter-
the importance of a word in the downstream task of classification, pretability, where Collective SHAP values can show how much each
we round-off all negative explainer values to 0 and then normalize predictor contributes, either positively or negatively, towards influ-
the remaining value ranges to fit a range between 0 - 1. encing predictions of the CNN. SHAP has shown to provide good
The software stack used for the development of this system was consistency in attributing importance scores to each feature [15].
instrumented under Anaconda 4.8.3 (with Python 3.8.3). Tensor-
flow v. 2.2.0 was used for CNN instrumentation and Grad-CAMs 3.2.3 LIME. LIME [Local Interpretable Model (Agnostic) Expla-
calculations (with the code itself expanding prior implementation nations] [10][23] is another explainability method that provides
available at [29]). DistilBERT implementation and supporting codes explanations to the predictions of any classifier by learning inter-
were sourced from Huggingface libraries: transformers v. 3.1.0, tok- pretable or simpler models locally around the prediction. While
enizers v. 0.8.1rc2, nlp v. 0.4.0. For XAI methods implementations, the Shapley values consider all possible permutations, a united ap-
SHAP v. 0.37 and LIME v. 0.1.1.18 were facilitated. The code used proach to provide global and local consistency, the LIME approach
for this paper is available on GitHub1 . GPU cluster was used for builds sparse linear models around individual predictions in its
calculations (with 4x Nvidia Tesla V100 32GB GPUs). local vicinity.
62
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ł. Górski, S. Ramakrishna
3.4 Pilot User Study of Veterans’ Appeals, it was decided that using only a subset of
To study prospective user’s opinions on explainability in AI as sentences’ classes was a good compromise.
well as on the particular method employed in this study, a survey The following sentences were chosen for grading:
was prepared. It consisted of two parts. The generic one contained (1) Evidence-based-reasoning sentences
three questions, pertaining to the general knowledge of XAI by the (a) Further, as discussed below, none of the medical evidence
respondents. Those were as follows: indicates that a psychiatric disorder had its onset during
service, and psychiatric disorders are complex matters
(1) Have you ever encountered the term “explainable artificial requiring medical evidence for diagnosis; they are not the
intelligence” (explainable AI, XAI)? kind of disorders that subject to lay observation.
(2) According to you how can “explainable artificial intelligence”/ (b) Given the inconsistencies between the Veteran’s reports
explainability in decision systems be characterized? and the objective evidence of record, the Veteran’s credi-
(a) Explains the decision-making process and why we arrived bility is diminished.
at this result. (2) Legal Rule Sentence
(b) Conclusions are explained without a need for you to un- (a) Service connection for PTSD requires medical evidence
derstand the inner-workings of the system. diagnosing the condition in accordance with 38 C.F.R.
(c) It’s more useful for a software developer rather than to a 4.125(a); a link, established by medical evidence, between
lawyer. current symptoms and an in-service stressor; and credible
(d) AI-based systems are of little use and their explainability supporting evidence that the in-service stressor occurred.
is thus irrelevant (b) There must be 1) medical evidence diagnosing PTSD; 2)
(e) Explanations given by a computer system usually are not a link, established by medical evidence, between current
sufficient and need to be supplemented with background symptoms of PTSD and an in-service stressor; and 3) cred-
legal knowledge. ible supporting evidence that the claimed in-service stres-
(3) What are your expectations regarding the justifications given sor occurred.
by the artificial intelligence-based automated decision-making (c) The Federal Circuit has held that 38 U.S.C.A. 105 and 1110
systems? preclude compensation for primary alcohol abuse disabil-
ities and secondary disabilities that result from primary
While the list of questions could have been non-exhaustive, the
alcohol abuse.
goal of these above questions was only to assess the depth of their
(3) Citation sentence
knowledge in XAI. Answers to these questions help us to under-
(a) See also Mittleider v. West, 11 Vet. App. 181, 182 (1998)
stand respondent’s degree of confidence in the domain of XAI to
(in the absence of medical evidence that does so, VA is
evaluate/score the results generated from different XAI models.
precluded from differentiating between symptomatology
The second part of the study was aimed to elicit the lawyer’s
attributed to a nonservice-connected disability and symp-
assessment of the three XAI methods that can be used to explain
tomatology attributed to a service-connected disability).
CNN’s predictions. In total, six correctly classified sentences from
the test set were chosen as a basis for the study. Words composing
those sentences were highlighted with colors of different intensities, 4 RESULTS
according to their importance in the final prediction. All the words 4.1 Neural network training
were subject to highlighting, as even stopwords can be of impor- Firstly, for classification, CNN was trained for 10 epochs, with a
tance in case of legal interpretation. Respondents were asked to batch size of 1000 elements and a learning rate of 0.001. 80% of the
grade each visualization on the scale from 0 (worst) to 10 (best). We PTSD dataset was used for training, with the remaining 20% left
have decided to use only the correctly classified sentences, as the out as a test set. This allowed to achieve accuracy of ca. 83% on
visualizations for incorrect ones might have looked peculiar for the the test set. PTSD dataset was found to be a good candidate for
respondents and could have confused them when compared with training, as it is relatively balanced, with classes’ size varying from
the visualizations for the correct ones. Moreover, understanding 1941 elements (Evidence sentence) to 389 (Finding sentence) in case
this part of the study was already not easy for the respondents, with of the training set. Relations between the sizes of classes in the case
one commenting that he skipped this part as he did not understand of the test set were similar. For details on training effectiveness,
what the colorful words mean. Table 1 can be consulted.
In the end, from the classes distinguished in the dataset, three
were chosen for the visualization: citation sentence, reasoning sen-
tence (or evidence-based reasoning sentence) and legal rule sentence.
4.2 Study results. Lawyers’ conceptualization of
The classes were chosen so that each class stands out when com- XAI
pared with the others. Original dataset includes two additional For our pilot study, we have collected 21 surveys. Respondents
classes, finding sentence and evidence sentence. However, even its represented a variety of legal professions, including university pro-
authors admit that it may be difficult to tell apart different sentences fessors, attorneys and junior legal advisors. Results for questions 1
when all the classes are compared (e.g. fact-finding ones vs. legal (prior exposure to the term "XAI") and questions 2 (closed questions
rule ones are easy to conflate). As our respondents do not neces- regarding the respondents’ understanding of XAI) are summarised
sarily have experience with cases under the cognition of Board in Table 2, given as a percentage of respective answers. In Table
63
Explainable Artificial Intelligence, Lawyer’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 1: Classification report for CNN trained for the purpose computer scientists (76%); three persons disagreed, with only one
of this paper of them seeing the need to understand the system’s inner workings
as a precondition for understanding of its explanation (as evidenced
Class Precision Recall F1-score Support by the previous question).
Artificial intelligence & law curriculum seems to have penetrated
Other Sentence 0.96 0.57 0.71 83
the ranks of lawyers, as 90% of respondents see the prospects of
Reasoning Sentence 0.59 0.41 0.48 148
their use (and, by extension of XAI). Remaining 10% had no opinion.
Finding Sentence 0.63 0.51 0.57 101
71% of respondents see the need of supplementing machine-based
Legal Rule Sentence 0.76 0.97 0.85 207
explanations with further background legal knowledge, 19% think
Citation Sentence 0.99 1.00 0.99 213
otherwise.
Evidence Sentence 0.88 0.95 0.92 479
In conclusion, lawyers seem to be affirmative regarding the use-
fulness of AI and XAI. Yet, more work should be devoted to prepare
Table 2: Respondents’ answers to questions 1 and 2 the explanations that do not involve prior knowledge of how a
given system works. If XAI was to be deployed and aimed at non-
yes no no opinion professionals, clarity and completeness of the explanations should
be the focal point, so that the decisions could be understood even
Question 1 without deeper background legal knowledge.
62% 38% – When we asked each respondent on their list of expectations
regarding the justifications given by an AI-based decision-system
Question 2
(question 3), the majority of their expectations are in line with any
a 90% 5% 5% general requirements necessary for such a system. Users’ expecta-
b 33% 52% 14% tions can be divided into three broad categories. These categories
c 14% 76% 10% are also in-line with the user’s mental model of a AI system (cf.
d 0% 90% 10% Sec. 2). General requirements, coming from users’ deep knowledge
e 71% 19% 10% of the law and its principles, like due process, contain expectations
such as: fairness, lack of bias, transparency in the decision process,
Table 3: Respondents’ answers to questions 2, broken down consequence awareness. Secondly, users’ model of AI-based system
according to their prior exposure to XAI tells them how it can increase the effectiveness of their work. Here,
respondents remarked that such system should decrease their man-
ual effort, lessen time consumption and introduce automation. Finally,
Prior exposure No prior exposure the user’s knowledge of XAI implementation is concerned with
yes no no opinion yes no no opinion supporting system’s conclusions with citations and provision of facts
a 100% 0% 0% 75% 13% 13% relevant to system’s decision. It should be noted that those require-
b 46% 46% 8% 13% 63% 25% ments apply not just to XAI systems, but also to any legal software
c 23% 69% 8% 0% 88% 13% system in general, AI-based or not. Thus, the mental model of an
d 0% 85% 15% 0% 100% 0% AI-based system in the case of prospective users is not fully devel-
e 77% 15% 8% 63% 25% 13% oped. Fig. 1, depicts the key vocabulary set used by the respondents
while providing their list of expectations.
64
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ł. Górski, S. Ramakrishna
decision support system’s perspective. They are interesting not be- same results were achieved using the Friedman test (𝛼 = 0.05 in
cause of their novelty, but because only a few domain experts/end all the cases). Therefore other factors should be taken into account
users thought about it. Some such requirements are as shown below: when choosing a particular method. From a software engineering
(1) “It should have the option of forgetting the decisions taken in point of view, it should be noted that implementation of certain
past. However, it should also have a certain standard form of methods is dependant not only on the user’s voice but also on
code of conduct.” technical feasibility. In this respect, it is of importance that Grad-
(2) “There should be scope for generating contrasting explanations CAM is a method that is used together only with CNNs and we are
based on various objectives/functions/factors.” unaware of any works that managed to use it with other neural
(3) “Basis on which AI has neglected the other information.” network architectures. LIME is dependent on a number of hyper-
(4) “Legally correct explanations.” parameters. In this work, we have chosen the largest values that
(5) “Option for collaborative (Man-Machine) explanation of the allowed us to carry calculations with our hardware resources and
law, context, and linguistic.” software implementation. Yet, many works suggest that one should
(6) “Understandable explanations and Better Interfaces.” fine-tune those values until explanations received are close to one’s
expectations. However, this introduces a risk of overfitting explana-
4.3 Study results. XAI methods’ assessment tory procedures and having such fine-tuned explanations would be
counterproductive with regard to our study. As far as SHAP goes,
Tables 4, 5, 6 can be consulted for comparions of different XAI
the library solution we used is still under active development, with
methods when used with our sample sentences in terms of user
documentation not always up-to-date and with limited support of
score as well comparison metrics (please note that the matrix for
external libraries. Regarding some of the observations presented
I metric results is symmetric, therefore repeating results were
hereinbefore, there are a few for which different evaluation tech-
removed from the table). Below are some observation made based
niques (technical, empirical, or mixed) need to be performed for
on these results:
arriving at conclusions. Such conclusions need to be backed by
• Users’ expectations seem to have differed to a large extent, domain experts. Those will form a basis for future work.
which can be seen from the divisions of respondent’s grades
of SHAP’s explanation of citation sentence (Table 6 can be
consulted for visualization). SHAP marked all the words as 5 CONCLUSION
very important, while - for example, Grad-CAM marked only In this paper, we have presented a comparison of different explain-
the reference to the source material, and omitted the sum- ability methods when applied to legal classification neural network
mary included with the citation. For the authors of this paper, (CNN) and provided an assessment of explainable artificial intel-
Grad-CAM has proven superior here, yet many respondents ligence as understood by lawyers. Different XAI methods were
judged SHAP highly. 9 respondents gave it the lowest score graded similarly by the users, though certain variances can be spot-
out of all three (with two giving 0), 8 graded it the highest ted. The metrics presented herein offer software engineers an option
(with one person giving 10). to quantify explanations given by the system. When a more general
• While the heatmaps for the citation sentence from Grad- point of view is taken, one which comes from prospective users, it
CAM and SHAP seem to be very different in nature (also should be noted that the lawyers are generally looking forward to
pointed out by their F- metric), the respondents seem to the implementation of (explainable) artificial intelligence systems
provide scores that vary only by a small margin. and solutions that allow them to be more efficient in their use of
• Based on the F and I metric scores for all the sentence types, time. Users are thus waiting for results of AI - and in particular XAI
LIME in general seems to have higher scores compared to - research.
Grad-CAM and then SHAP.
• Both SHAP and LIME seem to be more sensitive to the change ACKNOWLEDGMENTS
in threshold value in terms of their F metric when compared
This research was carried out with the support of the Interdisci-
to Grad-CAM.
plinary Centre for Mathematical and Computational Modelling
• Based on respondents’ average scores, Grad-CAM and SHAP
(ICM), University of Warsaw, under grant no GR81-14.
seem to perform consistently between sentence types when
compared against LIME. But, based on F metric, both LIME
and SHAP seem to have consistent values between sentence DISCLAIMER
types as compared to Grad-CAM. The views reflected in this article are the views of the author (SR)
Herein we have presented a software system, based on CNN, and do not necessarily reflect the views of his employer organisation
capable of classifying legally relevant texts and coupled the sys- or its member firms.
tem with an explainability module. Furthermore, this explainability
module was then subject to user study. The use of ANOVA (be- REFERENCES
cause D’Agostino’s and Pearson’s test did not allow to reject the [1] Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: a
hypothesis that scores for each of the XAI methods have normal survey on explainable artificial intelligence (XAI). IEEE access 6 (2018), 52138–
distributions and the same conclusions were arrived at when using 52160.
[2] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt,
Levene test for the equality of variances) confirms that the discrep- and Been Kim. 2018. Sanity Checks for Saliency Maps. In Advances in
ancy between various scores is not of practical importance. The Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle,
65
Explainable Artificial Intelligence, Lawyer’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
I metric I metric
Method Sentence User F metric vs. vs. F metric vs. vs.
score SHAP LIME SHAP LIME
(avg.)
𝑡 = 𝑡 1 = 𝑡 2 = 0.15 𝑡 = 𝑡 1 = 𝑡 2 = 0.5
K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran As- [11] Mark T Keane and Eoin M Kenny. 2019. How case-based reasoning explains
sociates, Inc., 9505–9515. https://proceedings.neurips.cc/paper/2018/file/ neural networks: A theoretical analysis of XAI using post-hoc explanation-by-
294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf example from a survey of ANN-CBR twin-systems. In International Conference
[3] Katie Atkinson, Trevor Bench-Capon, and Danushka Bollegala. 2020. Explanation on Case-Based Reasoning. Springer, 155–171.
in AI and law: Past, present and future. Artificial Intelligence (2020), 103387. [12] Eoin M. Kenny, Courtney Ford, Molly Quinn, and Mark T. Keane. 2021. Explain-
[4] L Karl Branting, Craig Pfeifer, Bradford Brown, Lisa Ferro, John Aberdeen, Brandy ing black-box classifiers using post-hoc explanations-by-example: The effect of
Weiss, Mark Pfaff, and Bill Liao. 2020. Scalable and explainable legal prediction. explanations and error-rates in XAI user studies. Artificial Intelligence 294 (2021),
Artificial Intelligence and Law (2020), 1–26. 103459. https://doi.org/10.1016/j.artint.2021.103459
[5] Núria Casellas. 2011. Legal ontology engineering: Methodologies, modelling trends, [13] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification.
and the ontology of professional judicial knowledge. Vol. 3. Springer Science & In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Business Media. Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar,
[6] Jaekeol Choi, Jungin Choi, and Wonjong Rhee. 2020. Interpreting Neural Ranking 1746–1751. https://doi.org/10.3115/v1/D14-1181
Models using Grad-CAM. arXiv preprint arXiv:2005.05768 (2020). [14] David Krakov and Dror G Feitelson. 2013. Comparing performance heatmaps. In
[7] Ashley Deeks. 2019. The judicial demand for explainable artificial intelligence. Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 42–61.
Columbia Law Review 119, 7 (2019), 1829–1850. [15] Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2019. Consistent Individual-
[8] Lukasz Gorski, Shashishekar Ramakrishna, and Jedrzej M Nowosielski. 2020. ized Feature Attribution for Tree Ensembles. arXiv:1802.03888 [cs.LG]
Towards Grad-CAM Based Explainability in a Legal Text Processing Pipeline. [16] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model
arXiv preprint arXiv:2012.09603 (2020). Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon,
[9] Philipp Hacker, Ralf Krestel, Stefan Grundmann, and Felix Naumann. 2020. Ex- U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett
plainable AI under contract and tort law: legal incentives and technical challenges. (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062-a-
Artificial Intelligence and Law (2020), 1–25. unified-approach-to-interpreting-model-predictions.pdf
[10] Linwei Hu, Jie Chen, Vijayan N. Nair, and Agus Sudjianto. 2020. Surrogate [17] Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting
Locally-Interpretable Models with Supervised Machine Learning Algorithms. Model Predictions. In Proceedings of the 31st International Conference on Neural
arXiv:2007.14528 [stat.ML] Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran
Associates Inc., Red Hook, NY, USA, 4768–4777.
66
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ł. Górski, S. Ramakrishna
I metric I metric
Method Sentence User F metric vs. vs. F metric vs. vs.
score SHAP LIME SHAP LIME
(avg.)
𝑡 = 𝑡 1 = 𝑡 2 = 0.15 𝑡 = 𝑡 1 = 𝑡 2 = 0.5
[18] Dina Mardaoui and Damien Garreau. 2020. An Analysis of LIME for Text Data. [21] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
arXiv preprint arXiv:2010.12487 (2020). Kano, and Ken Satoh. 2019. A Summary of the COLIEE 2019 Competition. In
[19] Victoria Hadfield Moshiashwili. 2015. The Downfall of Auer Deference: Veterans JSAI International Symposium on Artificial Intelligence. Springer, 34–49.
Law at the Federal Circuit in 2014. [22] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I
[20] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the
Kano, and Ken Satoh. [n.d.]. COLIEE 2020: Methods for Legal Document Retrieval 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
and Entailment. ([n. d.]). Mining, San Francisco, CA, USA, August 13-17, 2016. 1135–1144.
67
Explainable Artificial Intelligence, Lawyer’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
I metric I metric
Method Sentence User F metric vs. vs. F metric vs. vs.
score SHAP LIME SHAP LIME
(avg.)
𝑡 = 𝑡 1 = 𝑡 2 = 0.15 𝑡 = 𝑡 1 = 𝑡 2 = 0.5
[23] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why [31] Jasper van der Waa, Elisabeth Nieuwburg, Anita Cremers, and Mark Neerincx.
Should I Trust You?": Explaining the Predictions of Any Classifier. 2021. Evaluating XAI: A comparison of rule-based and example-based explana-
arXiv:1602.04938 [cs.LG] tions. Artificial Intelligence 291 (2021), 103404.
[24] Edwina L Rissland, Kevin D Ashley, and Ronald Prescott Loui. 2003. AI and Law: [32] Tom M van Engers and Dennis M de Vries. 2019. Governmental Transparency in
A fruitful synergy. Artificial Intelligence 150, 1-2 (2003), 1–15. the Era of Artificial Intelligence.. In JURIX. 33–42.
[25] Scott Robbins. 2019. A misdirected principle with a catch: explicability for AI. [33] Martijn van Otterlo and Martin Atzmueller. 2018. On Requirements and Design
Minds and Machines 29, 4 (2019), 495–514. Criteria for Explainability in Legal AI. In XAILA@ JURIX.
[26] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. [34] Vern R. Walker, Ji Hae Han, Xiang Ni, and Kaneyasu Yoseda. 2017. Semantic Types
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. for Computational Legal Reasoning: Propositional Connectives and Sentence
arXiv:1910.01108 [cs.CL] Roles in the Veterans’ Claims Dataset (ICAIL ’17). Association for Computing Ma-
[27] Jaromír Savelka, Vern R. Walker, Matthias Grabmair, and Kevin D. Ashley. 2017. chinery, New York, NY, USA, 217–226. https://doi.org/10.1145/3086512.3086535
Sentence Boundary Detection in Adjudicatory Decisions in the United States. [35] Vern R. Walker, Krishnan Pillaipakkamnatt, Alexandra M. Davidson, Marysa
[28] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- Linares, and Domenick J. Pesce. 2019. Automatic Classification of Rhetorical
tam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from Roles for Sentences: Comparing Rule-Based Scripts with Machine Learning. In
deep networks via gradient-based localization. In Proceedings of the IEEE interna- Proceedings of the Third Workshop on Automated Semantic Analysis of Information
tional conference on computer vision. 618–626. in Legal Texts, Montreal, QC, Canada, June 21, 2019 (CEUR Workshop Proceedings,
[29] Haebin Shin. [n.d.]. Grad-CAM for Text. https://github.com/HaebinShin/grad- Vol. 2385). http://ceur-ws.org/Vol-2385/paper1.pdf
cam-text. Accessed: 2020-08-05. [36] Wei Zhao, Tarun Joshi, Vijayan N Nair, and Agus Sudjianto. 2020. SHAP values for
[30] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Explaining CNN-based Text Classification Models. arXiv preprint arXiv:2008.11825
2020. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. (2020).
In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 180–186. [37] Wei Zhao, Tarun Joshi, Vijayan N. Nair, and Agus Sudjianto. 2020. SHAP values
for Explaining CNN-based Text Classification Models. arXiv:2008.11825 [cs.CL]
68
Unravel Legal References in Defeasible Deontic Logic
Guido Governatori Francesco Olivieri
Data61 Institute for Integrated and Intelligent Systems
CSIRO Griffith University
Dutton Park, Queensland, Australia Nathan, Queensland, Australia
guido.governatori@data61.csiro.au f.olivieri@griffith.edu.au
ABSTRACT said references. Here, by the legal status we mean where the provi-
Legal documents often contain references to either other documents, sion corresponding to the reference either is applicable, has been
or other parts (of the same document). The use of references is complied with, or has been violated. The examples below provide a
meant to reduce the complexity of the documents; however, they wide set of instances of the types of references for which we are
pose serious concerns for the formal (logical) representation of the going to develop techniques to represent them in a logic designed
norms stipulated in the document itself. We propose an approach to formalise legal reasoning.
to directly model the references in a logic language and to resolve Example 1.1 (Applicable). Section 9.8.2, Telecommunications Con-
them during the computation of the legal effects in force in a case. sumer Protections Code.
The approach is proved to be computationally feasible and to have Termination by a Customer: A Supplier must ensure that, if so
an efficient algorithmic implementation. notified by the Customer who is exercising the applicable termi-
nation right in clause 9.9.1 h), if any, as a result of the move, the
CCS CONCEPTS Supplier terminates the relevant Customer Contract relating to the
• Computing methodologies → Artificial intelligence; • Ap- Telecommunications Service within 5 Working Days of receiving
plied computing → Law; • Theory of computation → Proof the Customer’s notice.
theory; Automated reasoning.
Example 1.2 (Complied with). Section 9.8.2, Telecommunications
Consumer Protections Code.
KEYWORDS Provided that a Supplier complies with the terms of this clause
Defeasible Deontic Logic, legal references 9.9 in circumstances of a move to an alternate wholesale network
ACM Reference Format:
provider, the Supplier is not required to comply with the other
Guido Governatori and Francesco Olivieri. 2021. Unravel Legal References in provisions of this Chapter in relation to such a move except for
Defeasible Deontic Logic. In Eighteenth International Conference for Artificial clauses 9.5, 9.6, and 9.7.
Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM,
Example 1.3 (Violated). Section 26-105, Income Tax Assessment
New York, NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466080
Act 1997.
(2) You cannot deduct under this Act a non-cash benefit if:
1 INTRODUCTION (a) section 14–5 in Schedule 1 to the Taxation Administration
A typical characteristic of legal documents aiming at the reduction Act 1953 requires you to pay an amount to the Commis-
of the complexity of the documents themselves is the use of ref- sioner before providing the benefit, because of any of the
erences, either to other sections of the same document (internal following provisions in that Schedule:
references), or to sections of other documents (external references). (i) section 12–35 (about payments to employees);
The key idea behind references is that they are used to import “con- (ii) section 12–40 (about payments to directors);
tent” from relevant provisions in the provision where the reference (iii) section 12–47 (about payments to religious practition-
appears, without the need to repeat the content/text of the imported ers);
provision. Also, frequently, the content is not simply imported, as (iv) section 12–60 (about payments under labour hire and
a reference may require a legal lens to unravel the actual (legal) certain other arrangements);
intent of the content, in the context where the reference appears. (v) in relation to a supply, other than a supply referred to
The focus of this work is to examine the references in a legal in subsection (3) of this section–section 12-190 (about
document under the legal lens that assesses the legal status of the quoting of ABN); and
(b) you fail to comply, or purportedly comply, with section
Permission to make digital or hard copies of all or part of this work for personal or 16–150 in that Schedule in relation to the amount.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation Given a formal representation of a set of legal provisions as
on the first page. Copyrights for components of this work owned by others than ACM logical expressions, one of the aims of a logic for legal reasoning is
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a to infer what conclusions are entailed by a set of given premises
fee. Request permissions from permissions@acm.org. (e.g., corresponding to the facts of a case). Said conclusions are
ICAIL’21, June 21–25, 2021, São Paulo, Brazil meant to represent: (a) the legal requirements (or effects) that hold,
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 or are in force, based on the facts of the case, and (b) the set of
https://doi.org/10.1145/3462757.3466080 norms (the given legal provisions).
69
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Guido Governatori and Francesco Olivieri
In general, following [14], provisions can be represented by “IF section X in the modelling of section Y, we could use
. . . THEN . . . ” rules, where the IF part contains the conditions under
which a rule is able to produce its conclusion (or effect), which is IF 𝐶 1, . . . , 𝐶𝑚 , 𝐴𝑝𝑝𝑙𝑦 (𝑆𝑒𝑐𝑡𝑖𝑜𝑛 X) THEN 𝐷 (4)
encoded by the THEN part. Provisions are, in general, defeasible;
this means that there are some provisions that give baseline condi- The issue now is to determine when 𝐴𝑝𝑝𝑙𝑦 (𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝑋 ) holds and
tions under which the effect of the norm holds (or enters in force). can then be use to trigger rule above. A solution would be to add
However, such baseline rules are subject to exceptions (or exclu- the rule
sions, or derogations). We also adopt the usual distinction between IF 𝐴1, . . . , 𝐴𝑛 THEN 𝐴𝑝𝑝𝑙𝑦 (𝑆𝑒𝑐𝑡𝑖𝑜𝑛 X). (5)
constitutive and prescriptive rules, where a constitutive rule defines
an institutional fact in the underlying normative system, while This approach works well for the first meaning of applicable, but
a prescriptive rule stipulates the conditions under which a legal it does not for the second interpretation, when it faces the same
effect is in force and what is the legal effect (either an obligation, a issues discussed for the previous approach when exceptions exist.
prohibition, or a permission). Complied with and violated are types of references that apply
Based on the discussion so far, it should be clear that, to formalise to prescriptive rules only. In the simplest case, the semantics for
provisions containing citations (as the instances given in the exam- complied with requires that if the provision is applicable, then the
ples), one has to understand what are the conditions corresponding effect (i.e., an obligation or a prohibition) has been fulfilled. This
to the cited provisions before such conditions are represented by amounts to that if the effect is the obligation O𝐴, then 𝐴 holds as
logical expressions. One option is to formalise the cited provisions well; if the effect is a prohibition (i.e., F𝐴, equivalent to O¬𝐴), then
and then copy the formalised conditions in the formalisation of the 𝐴 does not hold. The case of violated is similar: the rule is violated if
citing provision. However, this is not enough because we have to the content of the obligation does not hold, or it holds if the effect
include the logical representation of the semantic of the citation. is a prohibition. Simple adaptations of the techniques exemplified
Let us start by examining the case of “applicable”. It can have above are available. All we have to do is to include O𝐴 and 𝐴 (for
multiple meanings. The first is that the IF part of the provision complied with), or O𝐴 and ¬𝐴 (for violated) in the IF part of the
formalisation holds (it is deemed to be true in the particular situ- appropriate rules.
ation or case). The second meaning is that, in addition to the IF Things get more complicated when we take into account a key
part, also the THEN part holds. For the first case, the referring feature of normative systems: situations when it is possible to fail
provision should import the IF part, and for the second case the to comply with some provisions, but compensatory measures are
import consists of the conjunction of both the IF part and the THEN contemplated. Hence, it is possible to fail in fulfilling the primary
part. Suppose that section X is modelled by the rule obligation and still be compliant by fulfilling the compensatory mea-
sures. Nevertheless, we might encounter scenarios where, typically,
IF 𝐴1, . . . , 𝐴𝑛 THEN 𝐵 (1) failing to comply with the primary obligation requires us to comply
and that section Y, citing section X, requires conditions 𝐶 1, . . . , 𝐶𝑚 with the compensatory obligation, but in specific circumstances
to produce the effect 𝐷 when section X applies. According to the what constitutes the compensatory measure to recover from the vi-
first reading the rule for section Y is olation of the primary obligation is forbidden, and then there is not
an effective way to compensate and restore compliance. According
IF 𝐴1, . . . , 𝐴𝑛 , 𝐶 1, . . . , 𝐶𝑚 , THEN 𝐷. (2) to this discussion, we comply with an applicable provision when
we fulfil the primary obligation, or at least one of the compensatory
Based on the second interpretation the rendering of section Y is obligations that is actually in force and we fulfil it. Conversely, a
IF 𝐴1, . . . , 𝐴𝑛 , 𝐵, 𝐶 1, . . . , 𝐶𝑚 , THEN 𝐷. (3) provision is not complied with when one of the obligations in force
(either the primary obligation or one of the compensatory obliga-
However, in case there are multiple provisions to support the same tions) is not fulfilled and it does not admit a compensation. While it
effect or conclusion, such rules might not offer an accurate model is possible to write rules importing the content of the related rules
of the normative system. To handle this situation, it is not enough (in this case rules preventing a compensation to be in force), the
to use the conjunction of the IF and THEN parts, but we have to process would require to use an oracle that simulate the underlying
include, in the formalisation, elements from the provisions that reasoning to determine what are the rules that contribute to the
might provide exceptions to the cited provision. Moreover, it is conditions of the imported citation.
possible to have exclusions of the exclusions, and this has a ripple The method we propose is to incorporate in the logic itself the
effect on the conditions to be included in the citing provision. When conditions for the resolution of the citation predicates. This means
we look at the examples above, specifically Example 1.3, and we that, while the logic determines when the conditions of a rule hold
consider real-life Acts, where exclusion sections can run for several and it possible to infer that the rule’s conclusion holds as well, then
pages with many rules encoding the exceptions (and the exceptions we can assert: (i) that the rule applies (according to the second
themselves have further exclusions), it is evident that unravelling reading of applicable), and (ii) that the citations are resolved at the
citations based on the technique we just alluded to is untenable. same time the inferences that can be drawn from a set of rules and
Consequently, the solution we advocate is to extend the logical a set of facts are computed. The combination of (i) and (ii) avoids
language with a new class of unary predicates whose argument is the need to write additional rules to capture the intended semantics
the cited proposition; the new predicates correspond to the possible (eventually using an oracle to solve the semantic conditions) of the
legal statuses for their argument (the cited provision). Hence, for cited provisions.
70
Unravel Legal References in Defeasible Deontic Logic ICAIL’21, June 21–25, 2021, São Paulo, Brazil
2 LOGIC head of the rule) which is a single literal in case 𝑋 = {C, P}, or an
In this section we introduce the logical apparatus, which is an ex- ⊗-expression in case 𝑋 = O.
tension of standard Defeasible Logic (DL) [1], and more specifically We use several abbreviations on sets of rules. 𝑅s , 𝑅d , 𝑅sd denote,
is based on the modal, deontic frameworks proposed in [5], as we respectively, the set of only strict, only defeasible, and both strict
shall need means to formalise prescriptive behaviours as well as to and defeasible rules; 𝑅𝑋 [𝑙] is the set of constitutive rules (𝑋 = C) or
determine which situations are compliant, and which ones are not. deontic rules (𝑋 ∈ {O, P}) whose head is 𝑙. Lastly, 𝑅 O [𝑙, 𝑖] denotes
the set of obligation rules, where 𝑙 is the 𝑖-th element in the ⊗-
expression.
2.1 Language of Defeasible Deontic Logic The meaning of an ⊗-expression 𝐶 (𝛼) = 𝑐 1 ⊗ 𝑐 2 ⊗ · · · ⊗ 𝑐𝑚 as
Let PROP be the set of propositional atoms, then the set of literals consequent of a rule 𝐴(𝛼) ↩→O 𝐶 (𝛼) is that: if the rule is allowed
is Lit = PROP ∪ {¬𝑝 | 𝑝 ∈ PROP}. The complementary of a literal to draw its conclusion, then 𝑐 1 is the obligation in force, and only
𝑝 is denoted with ∼𝑝: if 𝑝 is a positive literal 𝑞 then ∼𝑝 is ¬𝑞, if 𝑝 when 𝑐 1 is violated then 𝑐 2 becomes the new in force obligation,
is a negative literal ¬𝑞 then ∼𝑝 is 𝑞. The set of deontic literals is and so on for the rest of the elements in the chain. In this setting, 𝑐𝑚
ModLit = {𝑀𝑙, ¬𝑀𝑙 | 𝑙 ∈ Lit ∧ 𝑀 ∈ {O, P}}. Note that we will not represents the last chance to comply with the prescriptive behaviour
have specific rules nor modality for prohibitions, as we will treat enforced by 𝛼 and, in case 𝑐𝑚 is violated as well, then we will result
them according to the standard duality that something is forbidden in a non-compliant situation.
iff the opposite is mandatory (F𝑝 ≡ O∼𝑝). A conclusion of 𝐷 is either a tagged (deontic) literal, or a tagged
Lab is the set of labels, which are names for rules and will be label; as it will be clear in the reminder of the section, reference
denoted by small-capitals Greek letters. As we will be interested in literals are not conclusions of rules and they can be derived based
understanding (deriving) when a rule is violated, complied with, and on various conditions on the provability of particular literals in
active, we extend the set of literals with reference literals RefLit = the rule. For (deontic) literals, a conclusion can have one of the
{𝑌 (𝛼), ¬𝑌 (𝛼)| 𝛼 ∈ Lab ∧ 𝑌 ∈ {𝑎𝑐𝑡𝑖𝑣𝑒, 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑, 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑_𝑤𝑖𝑡ℎ}. following forms: (i) ±Δ𝑋 𝑙 which means that 𝑙 is definitely prov-
A defeasible theory 𝐷 is a tuple (𝐹, 𝑅, >). 𝐹 ⊆ Lit is the set of able/refutable in 𝐷, and (ii) ±𝜕𝑋 𝑙 which means that 𝑙 is defeasibly
facts (indisputable, constitutive statements). (In this version of the provable/refutable in 𝐷. For labels, we use the ideas of ⊥ and ⊤
logic, we do not admit obligations and permissions to be facts of from [6] (that will provide a compact notation for compliance and
the theory.) We use two kinds of rules. Non-deontic (thus standard, violation), a conclusion can have one of the following forms: (iii)
or constitutive) rules 𝑅 C model constitutive statements (count-as ±𝜕A 𝛼 which means that 𝛼 is active/not active (various nuances
rules). Deontic rules represent prescriptive behaviours; deontic are possible, and some are given in Definitions 2.5–2.10), (iv) +⊤𝛼
rules are either obligation rules 𝑅 O which determine when and which means that 𝛼 is complied with, (v) −⊤𝛼 which means that
which obligations are in force, or permission rules 𝑅 P which repre- 𝛼 is not complied with - violated and not compensated -, (vi) +⊥𝛼
sent strong (or explicit) permissions. Finally, > is a binary relation which means that 𝛼 is violated, and lastly (vii) −⊥𝛼 which means
over 𝑅 to solve conflicts in case of potentially conflicting informa- that 𝛼 has never been violated.
tion. A proof of length 𝑛 in 𝐷 is a finite sequence 𝑃 (1), 𝑃 (2), . . . , 𝑃 (𝑛)
Following the ideas of [8], obligation rules gain more expressive- of the tagged literals and tagged reference literals just described
ness with the compensation operator ⊗ for obligation rules, which above and formally defined hereafter; 𝑃 (1..𝑛) denotes the first 𝑛
is to model reparative chains of obligations. Intuitively, 𝑎 ⊗ 𝑏 means steps of 𝑃. If, for instance, 𝐷 proves +𝜕𝑙 at proof step 𝑛, we write
that 𝑎 is the primary obligation, but if for some reason we fail to 𝑃 (𝑛) = +𝜕𝑙, and we also assume the conventional notation 𝐷 ⊢ +𝜕𝑙.
obtain, to comply with, 𝑎 (by either not being able to prove 𝑎, or Before defining when a rule is applicable/discarded, we provide
by proving ∼𝑎), then 𝑏 becomes the new obligation in force. This the two definitions of being a trigger, and being a blank. These two
operator is used to build chains of preferences, called ⊗-expressions. concepts are related to the elements in the set of antecedents for a
The formation rules for ⊗-expressions are: (i) every literal is given rule. Conceptually, an antecedent is a trigger for a rule when
an ⊗-expression, (ii) if 𝐴 is an ⊗-expression and 𝑏 is a literal then its provability allows such a rule to (potentially1 ) fire, whilst it is a
𝐴 ⊗ 𝑏 is an ⊗-expression. In addition, we stipulate that ⊗ obeys the blank when its refutability prevents such a rule to fire. ¬O𝑙 (resp.
following properties: (a) Associativity 𝑎 ⊗ (𝑏 ⊗ 𝑐) = (𝑎 ⊗ 𝑏) ⊗ 𝑐, (b) ¬P𝑙) means that the obligation (resp. permission) of 𝑙 is not in force.
Ë Ë𝑘−1
Duplication
Ë𝑚 and contraction on the right 𝑚 𝑖=1 𝑐𝑖 = ( 𝑖=1 𝑐𝑖 ) ⊗ The meaning of being active depends on the interpretation,
( 𝑖=𝑘+1 𝑐𝑖 ) for 𝑗 < 𝑘 and 𝑐 𝑗 = 𝑐𝑘 . which may vary from context to context. Notice that henceforth
We adopt the standard DL’s definitions of strict rules, defeasible we use active to refer to a rule that we describe as applicable in
rules and defeaters [1], where a rule is an expression 𝛼 : 𝐴(𝛼) ↩→𝑋 Section 1 to distinguish it from the Defeasible Logic notion of a rule
𝐶 (𝛼). 𝛼 ∈ Lab is a unique label for the name of the rule. ↩→∈ being applicable. Various different nuances for the notion of active
{→, ⇒, ;} is the type of rule: we use → for strict rules, ⇒ for are proposed afterwards, but we first start by formalising provabil-
defeasible rules, and ; for defeaters. 𝑋 = {C, O, P} is the kind of ity/refutability for constitutive statements, obligations, permissions,
rule: if 𝑋 = C then ↩→ has no subscript and the rule is used to derive and finally violated and complied with rules.
non-deontic literals (constitutive statements), whilst if X is O or
Definition 2.1 (Trigger). Given a deontic defeasible theory 𝐷, a
P then the rule is used to derive deontic conclusions (prescriptive
proof 𝑃, a rule 𝛼, we say that 𝑎 ∈ 𝐴(𝛼) is a trigger for 𝛼 at 𝑃 (𝑛 + 1)
statements). 𝐴(𝛼) = 𝑎 1, . . . , 𝑎𝑛 is the set of antecedents/premises
iff
where each 𝑎𝑖 is either a literal, a deontic literal, or a reference
literal. Lastly, 𝐶 (𝛼) is the conclusion of the rule (referred to as the 1 Potentially because the rule itself may be defeated.
71
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Guido Governatori and Francesco Olivieri
(1) if 𝑎 ∈ Lit, then were in force obligations that have been violated. We are thus to
(a) if 𝛼 ∈ 𝑅𝑠 then +Δ𝑎 ∈ 𝑃 (1..𝑛); establish whether the element at the current index is an in force
(b) if 𝛼 ∈ 𝑅𝑑 then +𝜕𝑎 ∈ 𝑃 (1..𝑛); obligation or not. If it is, we have another chance to be compliant
(2) if 𝑎 = 𝑋𝑙 and 𝑋 ∈ {O, P}, then with the rule; if it is not, the whole norm cannot be complied with
(a) if 𝛼 ∈ 𝑅𝑠 then +Δ𝑋 𝑙 ∈ 𝑃 (1..𝑛); as the previous step was indeed our last possibility to be compli-
(b) if 𝛼 ∈ 𝑅𝑑 then +𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛); ant. These concepts will be formalised and further explained in the
(3) if 𝑎 = ¬𝑋𝑙, then −𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P}; following with Definition 2.4 and proof conditions ±⊤ and ±⊥.
(4) if 𝑎 is 𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then +𝜕EA 𝛼 ∈ 𝑃 (1..𝑛); We are now ready to provide the strict and defeasible proof
(5) if 𝑎 is ¬𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then −𝜕EA 𝛼 ∈ 𝑃 (1..𝑛); conditions for constitutive statements [1]. In the following, we
(6) if 𝑎 is 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then +⊤𝛼 ∈ 𝑃 (1..𝑛); shall omit the explanation of some negative proof conditions, as
(7) if 𝑎 is ¬𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then −⊤𝛼 ∈ 𝑃 (1..𝑛); they can be obtained via the strong negation principle.
(8) if 𝑎 is 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then +⊥𝛼 ∈ 𝑃 (1..𝑛); +Δ𝑙: If 𝑃 (𝑛 + 1) = +Δ𝑙 then
(9) if 𝑎 is ¬𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then −⊥𝛼 ∈ 𝑃 (1..𝑛); (1) 𝑙 ∈ 𝐹 , or
Definition 2.2 (Blank). Given a deontic defeasible theory 𝐷, a (2) ∃𝛼 ∈ 𝑅sC [𝑙] is applicable.
proof 𝑃, and a rule 𝛼, we say that a literal 𝑎 ∈ 𝐴(𝛼) is a blank for 𝛼 A constitutive statement is strictly proven if is either a fact, or
at 𝑃 (𝑛 + 1) iff there exists an applicable, strict rule for it. Note that inconsistencies
(1) if 𝑎 ∈ Lit, then within a deontic defeasible theories can arise only if derived from
(a) if 𝛼 ∈ 𝑅𝑠 then −Δ𝑎 ∈ 𝑃 (1..𝑛); the strict part of the theory.
(b) if 𝛼 ∈ 𝑅𝑑 then −𝜕𝑎 ∈ 𝑃 (1..𝑛); −Δ𝑙: If 𝑃 (𝑛 + 1) = −Δ𝑙 then
(2) if 𝑎 = 𝑋𝑙 and 𝑋 ∈ {O, P}, then (1) 𝑙 ∉ 𝐹 and
(a) if 𝛼 ∈ 𝑅𝑠 then −Δ𝑋 𝑙 ∈ 𝑃 (1..𝑛); (2) ∀𝛼 ∈ 𝑅sC [𝑙] is discarded.
(b) if 𝛼 ∈ 𝑅𝑑 then −𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛); A constitutive statement is strictly refuted (disproven, or rejected)
(3) if 𝑎 = ¬𝑋𝑙, then +𝜕𝑋 𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P}; if it is not a fact and all the strict rules for it are discarded.
(4) if 𝑎 is 𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then −𝜕EA 𝛼 ∈ 𝑃 (1..𝑛);
+𝜕𝑙: If 𝑃 (𝑛 + 1) = +𝜕𝑙 then
(5) if 𝑎 is ¬𝑎𝑐𝑡𝑖𝑣𝑒 (𝛼), then +𝜕EA 𝛼 ∈ 𝑃 (1..𝑛);
(1) +Δ𝑙 ∈ 𝑃 (1..𝑛), or
(6) if 𝑎 is 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then +⊤𝛼 ∈ 𝑃 (1..𝑛);
(2) −Δ∼𝑙 ∈ 𝑃 (1..𝑛) and
(7) if 𝑎 is ¬𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼), then −⊤𝛼 ∈ 𝑃 (1..𝑛); C [𝑙] is applicable and
(8) if 𝑎 is 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then +⊥𝛼 ∈ 𝑃 (1..𝑛); (3) ∃𝛼 ∈ 𝑅sd
C
(4) ∀𝛽 ∈ 𝑅 [∼𝑙] either
(9) if 𝑎 is ¬𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼), then −⊥𝛼 ∈ 𝑃 (1..𝑛);
(1) 𝛽 is discarded, or
Based on these two definitions, we are now ready to define when (2) ∃𝜁 ∈ 𝑅 C [𝑙] s.t. 𝜁 is applicable and 𝜁 > 𝛽.
a rule is applicable/discarded. Note that the definition of being
A constitutive statement is defeasibly proven if either was already
discarded is obtained by applying the strong negation principle2 to
strictly proven, or the opposite is not and there exists an applicable
its positive counterpart.
rule for the conclusion itself such that every opposite rule is either
Definition 2.3 (Applicable & Discarded). Assume a deontic defea- discarded or defeated by an applicable supporting rule. Note that
sible theory 𝐷. whilst 𝛽 and 𝜁 can be defeaters, 𝛼 may not.
We say that a rule 𝛼 ∈ 𝑅 C ∪ 𝑅 P is applicable at 𝑃 (𝑛 + 1) iff for all −𝜕𝑙: If 𝑃 (𝑛 + 1) = −𝜕𝑙 then
𝑎 ∈ 𝐴(𝛼) then 𝑎 is a trigger for 𝛼 at 𝑃 (𝑛 + 1). (1) −Δ𝑙 ∈ 𝑃 (1..𝑛) and either
We say that 𝛼 is discarded at 𝑃 (𝑛 + 1) iff there exists 𝑎 ∈ 𝐴(𝛼) (2) +Δ∼𝑙 ∈ 𝑃 (1..𝑛) or
such that 𝑎 is a blank for 𝛼 at 𝑃 (𝑛 + 1). (3) ∀𝛼 ∈ 𝑅sdC [𝑙] is discarded, or
For obligation rules, we say that 𝛼 ∈ 𝑅 O is applicable at index 𝑖 (4) ∃𝛽 ∈ 𝑅 C [∼𝑙] s.t.
and 𝑃 (𝑛+1) iff (1) for all 𝑎 ∈ 𝐴(𝛼) then 𝑎 is a trigger for 𝛼 at 𝑃 (𝑛+1), (1) 𝛽 is applicable and
and (2) for all 𝑐 𝑗 ∈ 𝐶 (𝛼), 𝑗 < 𝑖, (2.1) if 𝛼 ∈ 𝑅s then +ΔO𝑐 𝑗 ∈ 𝑃 (1..𝑛),
(2) ∀𝜁 ∈ 𝑅 C [𝑙] either 𝜁 is discarded, or 𝜁 ≯ 𝛽.
if 𝛼 ∈ 𝑅d then +𝜕O𝑐 𝑗 ∈ 𝑃 (1..𝑛), and (2.2) +𝜕∼𝑐 𝑗 ∈ 𝑃 (1..𝑛).
A constitutive statement is defeasibly refuted if it was not strictly
We say that 𝛼 ∈ 𝑅 O is discarded at index 𝑖 and 𝑃 (𝑛 + 1) iff
proven, and either the opposite statement was strictly proven, or
either (1) there exists 𝑎 ∈ 𝐴(𝛼) such that 𝑎 is a blank for 𝛼 at
all the applicable rules are either discarded or there exists an ‘unde-
𝑃 (𝑛 + 1), or (2) there exists 𝑐 𝑗 ∈ 𝐶 (𝛼), 𝑗 < 𝑖, such that (2.1) if 𝛼 ∈ 𝑅s
feated’ opposite, applicable rule.
then −ΔO𝑐 𝑗 ∈ 𝑃 (1..𝑛), if 𝛼 ∈ 𝑅d then −𝜕O𝑐 𝑗 ∈ 𝑃 (1..𝑛), or (2.2)
Proof conditions for obligations and permissions are updated
+𝜕𝑐 𝑗 ∈ 𝑃 (1..𝑛).
versions of the ones proposed in [5, 7].
Intuitively, a rule is applicable when all its premises are proven. +ΔO𝑙: If 𝑃 (𝑛 + 1) = +ΔO𝑙 then
Moreover, for an obligation rule, we also need to take into consider- (1) ∃𝛼 ∈ 𝑅sO [𝑙, 𝑖] is applicable at index 𝑖.
ation its ⊗-chain. A rule being applicable at a certain index greater
than 1 reflects the idea of compensation: all the previous elements −ΔO𝑙: If 𝑃 (𝑛 + 1) = −ΔO𝑙 then
(1) ∀𝛼 ∈ 𝑅sO [𝑙, 𝑖] is discarded at index 𝑖.
2 The strong negation principle is closely related to the function that simplifies a
formula by moving all negations to the inner most position in the resulting formula, For the defeasible derivation, we need to consider attacks even from
and replaces its positive tags with negative ones, and the other way around [2]. permissive rules.
72
Unravel Legal References in Defeasible Deontic Logic ICAIL’21, June 21–25, 2021, São Paulo, Brazil
+𝜕O𝑙: If 𝑃 (𝑛 + 1) = +𝜕O𝑙 then of the content of the prescriptive behaviour is derivable. Therefore,
(1) +ΔO𝑙 ∈ 𝑃 (1..𝑛), or given 𝐷 ⊢ +𝜕O𝑙, a violation is 𝐷 ⊢ +𝜕∼𝑙. Naturally, when consider-
(2) −Δ𝑋 ∼𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P} and ing chains of compensatory behaviours, we would like not just to
(3) ∃𝛼 ∈ 𝑅sd O [𝑙, 𝑖] is applicable at index 𝑖 and discern norms that have been complied with against norms that
(4) ∀𝛽 ∈ 𝑅 [∼𝑙, 𝑗] ∪ 𝑅 P [∼𝑙] either
O have been violated, but we also want to discriminate norms that
(1) 𝛽 is discarded (at index 𝑗), or have never been violated against norms that have been complied
(2) ∃𝜁 ∈ 𝑅 O [𝑙, 𝑘] s.t. with but where at least one compensation has occurred.
𝜁 is applicable at index 𝑘 and 𝜁 > 𝛽. +⊤𝛼: If 𝑃 (𝑛 + 1) = +⊤𝛼 then
Note that (i) in Condition (4) 𝛽 can be a permission rule as explicit, (1) ∃𝑖.𝛼 ∈ 𝑅 O [𝑙, 𝑖] s.t.
opposite permissions represent exceptions to obligations, whereas (1) If (1) 𝛼 is applicable at index 𝑖 and 𝑃 (1..𝑛), and
𝜁 must be an obligation rule as a permission rule cannot reinstate an (2) +𝜕O𝑙 ∈ 𝑃 (1..𝑛),
obligation, and that (ii) 𝑙 may appear at different positions (indices (2) Then +𝜕𝑙 ∈ 𝑃 (1..𝑛).
𝑖, 𝑗, and 𝑘) within the three ⊗-chains. A norm (obligation rule) is complied with if either it is not appli-
−𝜕O𝑙: If 𝑃 (𝑛 + 1) = −𝜕O𝑙 then cable, or there exists and element of its ⊗-chain that is an in force
(1) −ΔO𝑙 ∈ 𝑃 (1..𝑛), and either obligation and the content of the obligation holds.
(2) +Δ𝑋 ∼𝑙 ∈ 𝑃 (1..𝑛) with 𝑋 ∈ {O, P}, or −⊤𝛼: If 𝑃 (𝑛 + 1) = −⊤𝛼 then
(3) ∀𝛼 ∈ 𝑅sd O [𝑙, 𝑖] is discarded at index 𝑖, or
(1) +𝛼 is applicable at index 1 and ∈ 𝑃 (1..𝑛)
(4) ∃𝛽 ∈ 𝑅 O [∼𝑙, 𝑗] ∪ 𝑅 P [∼𝑙] (2) +𝜕O𝑐 1 ∈ 𝑃 (1..𝑛),
(1) 𝛽 is applicable (at index 𝑗), and (3) +𝜕∼𝑐 1 ∈ 𝑃 (1..𝑛), and
(2) ∀𝜁 ∈ 𝑅 O [𝑙, 𝑘] either (4) ∀𝑐𝑖 ∈ 𝐶 (𝛼), with 𝑖 ≥ 2
𝜁 is discarded at index 𝑘 or 𝜁 ≯ 𝛽. (1) If 𝛼 is applicable at index 𝑖 and 𝑃 (1..𝑛), and
Follow the proof conditions for permissions. +𝜕O𝑐𝑖 ∈ 𝑃 (1..𝑛),
+ΔP𝑙: If 𝑃 (𝑛 + 1) = +ΔP𝑙 then (2) Then +𝜕∼𝑐𝑖 ∈ 𝑃 (1..𝑛).
(1) +ΔO𝑙 ∈ 𝑃 (1..𝑛), or A norm is not complied with when is applicable and all the in force
(2) ∃𝛼 ∈ 𝑅sP [𝑙, 𝑖] is applicable. elements of its ⊗-chain have been violated.
We can derive a permission if it was already proven as an obligation. +⊥𝛼: If 𝑃 (𝑛 + 1) = +⊥𝛼 then
−ΔP𝑙: If 𝑃 (𝑛 + 1) = −ΔP𝑙 then (1) 𝛼 is applicable at index 1 and 𝑃 (1..𝑛), and
(1) −ΔO𝑙 ∈ 𝑃 (1..𝑛) and (2) ∃𝑐𝑖 ∈ 𝐶 (𝛼), 𝑖 ≥ 2, 𝛼 is applicable at index 𝑖 and 𝑃 (1..𝑛).
(2) ∀𝛼 ∈ 𝑅sP [𝑙, 𝑖] is discarded. A violation occurs if at least one in force element of the norm’s ⊗-
Permissive defeasible derivations: chain has been violated. By Definition 2.3 and the fact that the norm
+𝜕P𝑙: If 𝑃 (𝑛 + 1) = +𝜕P𝑙 then is applicable both at index 1 by Condition (1) and at an index greater
(1) +ΔP𝑙 ∈ 𝑃 (1..𝑛), or than 1 by Condition (2), it follows that a(t least one) violation has
(2) −ΔO ∼𝑙 ∈ 𝑃 (1..𝑛), and either occurred. Note that, trivially, −⊤𝛼 implies +⊥𝛼, but not the other
(1) +𝜕O𝑙 ∈ 𝑃 (1..𝑛), or way around.
(2) ∃𝛼 ∈ 𝑅sd P [𝑙] is applicable and −⊥𝛼: If 𝑃 (𝑛 + 1) = −⊥𝛼 then either
O
(3) ∀𝛽 ∈ 𝑅 [∼𝑙, 𝑗] either (1) 𝛼 is discarded at index 1 and 𝑃 (1..𝑛), or
(1) 𝛽 is discarded at index 𝑗, or (2) 𝛼 ∈ 𝑅 O [𝑐, 2] is discarded at index 2 and 𝑃 (1..𝑛).
(2) ∃𝜁 ∈ 𝑅 P [𝑙]∪ ∈ 𝑅 O [𝑙, 𝑘] s.t. A norm has never been violated if either it was never applicable by
𝜁 is applicable (at index 𝑘) and 𝜁 > 𝛽. Condition (1), or the first element is an in-force, and complied with
Condition (1) states that if something is obligatory then is permit- obligation (guaranteed by Condition (2) as the norm was not ap-
ted. Condition (3) considers as possible counter-arguments only plicable at index 2). Symmetrically to what commented previously,
obligation rules as situations where both P𝑙 and P∼𝑙 hold are legal. −⊥𝛼 implies +⊤𝛼, but not the other way around.
−𝜕P𝑙: If 𝑃 (𝑛 + 1) = −𝜕P𝑙 then
Definition 2.4. Given a deontic defeasible theory 𝐷 and an appli-
(1) −ΔP𝑙 ∈ 𝑃 (1..𝑛), and either
cable at index 1, obligation rule 𝛼 : 𝐴(𝛼) ⇒O 𝑐 1 ⊗ · · · ⊗ 𝑐𝑚 , we say
(2) +ΔO ∼𝑙 ∈ 𝑃 (1..𝑛), or
that 𝛼 is
(1) −𝜕O𝑙 ∈ 𝑃 (1..𝑛), and
(2) ∀𝛼 ∈ 𝑅sd P [𝑙] is discarded or Strongly complied with: (i) −⊥𝛼, and (ii) +𝜕O𝑐 1 .
(3) ∃𝛽 ∈ 𝑅 O [∼𝑙, 𝑗] s.t. Weakly complied with: (i) +⊤𝛼 and (ii) +⊥𝛼.
(1) 𝛽 is applicable at index 𝑗 and Violated: +⊥𝛼.
Not complied with: −⊤𝛼.
(2) ∀𝜁 ∈ 𝑅 P [𝑙]∪ ∈ 𝑅 O [𝑙, 𝑘] either
𝜁 is discarded (at index 𝑘), or 𝜁 ≯ 𝛽. A norm to be complied with, violated, or not complied with has
We now have the formal tools to define compliance and violations. to be applicable: note that we could not consider such a requirement,
We adhere to the principle that a violation occurs when the logics but a discarded norm is always vacuously applicable.
proves both (i) that the prescription holds and (ii) that the opposite A norm is violated when at least one violation has occurred.
73
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Guido Governatori and Francesco Olivieri
A norm is strongly complied with if it has never been violated: If 𝛽 is applicable at (1..𝑛), Then 𝛽 > 𝛼.
the theory thus achieves the very first literal of the ⊗-expression.
In many real-life contexts is indeed important to identify such
A norm is weakly complied with when, on the contrary, is com-
undefeated rules, but in some other contexts such conditions may
plied with but also violated; this means that all the first 𝑗 −1 elements
be too restrictive as they fail to include rules that are fundamental
of the ⊗-expression were violated obligations, whereas the 𝑗-th ele-
in proving a given claim. Consider indeed theory 𝐷 of Example 2.7:
ment is an in force obligation and the theory proves the content of
𝐷 ⊬ +𝜕PA 𝛼 2 according to Definition 2.8 as it is defeated by 𝛽 2 even
the obligation.
if without 𝛼 2 it is not possible to prove +𝜕𝑙, as no other rule defeats
This is exactly what marks the difference between being weak
𝛽 1 . Consequently, our next definition will identify which rules are
complied with and being not complied with . In fact, a norm is not
essential in proving a given claim, and we will call them effectively
complied with in two cases. The former, and more straightforward,
active rules.
case is when all elements (of the ⊗-expression) were in force, vio-
lated obligations. The latter case is when the first 𝑗 − 1 elements Definition 2.10 (Effectively Active).
were in force, violated obligations, whilst the 𝑗-th element is not +𝜕EA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕EA 𝛼 then
proven as obligation (i.e. 𝐷 ⊢ −𝜕O𝑐 𝑗 ). Accordingly, it does not mat- (1) 𝛼 is applicable at 𝑃 (1..𝑛),
ter whether the theory achieves it or not, as the 𝑗 − 1-th element (2) +𝜕𝑋 𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}, and
was the last chance to comply with the norm. (3) 𝐷 \ {𝛼 } ⊢ −𝜕𝑋 𝐶 (𝛼).
We now enter in the final part of this section, and we shift
our attention from compliance to active rules. Next Definition 2.5 This definition includes the concept of sine qua non: as per Con-
identifies those applicable rules whose conclusion was actually dition (3), 𝛼 is indeed a requisite in proving its conclusion 𝑙 as the
proved. We name such rules provisionally active. theory without 𝛼 cannot prove 𝑙. Note that the contrapositive ver-
sion of Condition (3), (3 ′ ) 𝐷 \ {𝜁 ∈ 𝑅 [𝐶 (𝛼)] | 𝜁 ≠ 𝛼 } ⊢ +𝜕𝑙 would
Definition 2.5 (Provisionally Active). give us slightly different results3 .
+𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕PA 𝛼 then
Definition 2.11 (Effectively Inactive).
(1) 𝛼 is applicable at 𝑃 (1..𝑛), and
(2) +𝜕𝑋 𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}. +𝜕EA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕EA 𝛼 then either
(1) 𝛼 is discarded at 𝑃 (1..𝑛), or
Definition 2.6 (Provisionally Inactive). (2) −𝜕𝑋 𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}, or
−𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = −𝜕PA 𝛼 then either (3) 𝐷 \ {𝛼 } ⊢ +𝜕𝑋 𝐶 (𝛼).
(1) 𝛼 is discarded at 𝑃 (1..𝑛), or
(2) −𝜕𝐶 (𝛼) ∈ 𝑃 (1..𝑛), 𝑋 ∈ {𝑛𝑢𝑙𝑙, O, P}. We end this section by providing an example that illustrates a
proof in our logical apparatus. The explanation hereafter is not com-
A fairly straightforward limitation of Definition 2.5 is that it fails plete as the point is to explain every key notion without slavishly
to identify which rules ‘effectively contributes’ in proving a claim: describe the whole procedure.
a rule may very well be applicable, but not really necessary as the
following example shows. Example 2.12. Let 𝐷 = (𝐹 = {𝑎, 𝑏, 𝑐, 𝑑 }, 𝑅, >= {(𝛼, 𝛾), (𝜒, 𝜁 ), (𝜂, 𝜇)})
be the deontic defeasible theory such that
Example 2.7. Assume that, in a deontic defeasible theory 𝐷, the
𝑅 = {𝛼 : 𝑎 ⇒ 𝑒 𝛽 :𝑏 ⇒𝑒 𝛾 : 𝑐 ⇒ ∼𝑒 𝜑 :𝑑 ⇒𝑞
following rules are all applicable: 𝛼 1 , 𝛼 2 , and 𝛼 3 are for 𝑙, while 𝛽 1
and 𝛽 2 are for ∼𝑙. It also holds that 𝛽 1 > 𝛼 1 , 𝛼 2 > 𝛽 1 , 𝛽 2 > 𝛼 2 and 𝜈 : 𝑐 ⇒ ∼𝑧 𝜎 : 𝑎, 𝑏 ⇒ 𝑤 𝜌 : ∼𝑎 ⇒ 𝑓
𝛼 3 > 𝛽 2 . In this context, rule 𝛼 1 does not contribute in defeating any 𝜒 : 𝑒 ⇒O ∼𝑞 𝜇 : 𝑑, 𝑞 ⇒O ∼𝑤 ⊗ 𝑠 𝜁 : 𝑎, 𝑑 ⇒O 𝑞
𝛽-like rule (it is actually defeated by 𝛽 1 ), and thus its contribution 𝜂 : 𝑎, active(𝛼), violated( 𝜒) ⇒O 𝑧 ⊗ 𝑤 ⊗ 𝑣 }.
is possibly limited/superfluous. On the other hand, 𝛼 2 is indeed
defeated by 𝛽 2 , but its role is fundamental because it is the only As 𝑎 is a fact, we prove +Δ𝑎 and, in cascade, +𝜕𝑎 as well as −𝜕∼𝑎.
𝛼-like rule that defeats 𝛽 1 . (The same applies for the other three literals in the set of facts.)
This makes 𝛼 applicable as all its antecedents are proved, and 𝜌
We hence try to identify those rules that are undefeated (by any discarded as its sole antecedent is refuted. For the same reasoning,
applicable, opposite rule). 𝛽, 𝛾, 𝜑, 𝜈, 𝜎 are all applicable, while 𝜒, 𝜇, 𝜁 , and 𝜂 are applicable
Definition 2.8 (Provisionally Active v2). at index 1. Since 𝛼 > 𝛾 and since no other applicable rule for 𝑒 is
+𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = +𝜕PA 𝛼 then stronger than 𝛾, we have that 𝛼 is effectively active (𝐷 ⊢ +𝜕EA 𝛼): in
(1) 𝛼 is applicable at 𝑃 (1..𝑛), fact, the theory without 𝛼 would not prove +𝜕𝑒. This is not the case
(2) +𝜕𝐶 (𝛼) ∈ 𝑃 (1..𝑛), and for 𝛽, but we can state that 𝛽 is provisionally active as no applicable
(3) ∀𝛽 ∈ 𝑅 [∼𝐶 (𝛼)]. rule for ∼𝑒 is stronger than 𝛽. Symmetrically, 𝛾 is provisionally
If 𝛽 is applicable at (1..𝑛), Then 𝛽 ≯ 𝛼. inactive. Both 𝜒 and 𝜁 are applicable at index 1, and, since 𝜒 > 𝜁 ,
we conclude 𝐷 ⊢ +𝜕O ∼𝑞. This makes 𝜁 vacuously complied with.
Definition 2.9 (Provisionally Inactive v2). Given that the theory also proves +𝜕𝑞 and 𝜒’s ⊗-chain consists of
−𝜕PA 𝛼: If 𝑃 (𝑛 + 1) = −𝜕PA 𝛼 then either only one element, we conclude that 𝜒 is not just violated but also
(1) 𝛼 is discarded at 𝑃 (1..𝑛), or 3 Inboth Conditions (3) and (3′ ), we used the notational simplification 𝐷 \ Γ to denote
(2) −𝜕𝐶 (𝛼) ∈ 𝑃 (1..𝑛), or the revision process of removing Γ from 𝑅 , and adjusting the superiority relation
(3) ∃𝛽 ∈ 𝑅 [∼𝐶 (𝛼)]. accordingly.
74
Unravel Legal References in Defeasible Deontic Logic ICAIL’21, June 21–25, 2021, São Paulo, Brazil
not complied with (𝐷 ⊢ +⊥𝜒 and 𝐷 ⊢ −⊤𝜒). This, in turn, makes Algorithm 1: Compliance
𝜂 applicable at index 1, and since there are no obligation rules for Input: A deontic defeasible theory 𝐷
∼𝑧, we conclude 𝐷 ⊢ +𝜕O𝑧. Again, as the theory proves +𝜕∼𝑧, then Output: The defeasible meta-extension 𝐸 (𝐷)
𝜂 is violated, 𝐷 ⊢ +⊥𝜂. This time the ⊗-chain has other elements; 1 ±𝜕2 ← ∅ with 2 ∈ {C, O, P};
consequently, 𝜂 is now applicable at index 2. The second element 2 ±⊤ ← ∅; ±⊥ ← ∅; ±𝜕A ← ∅; ±𝜕EA ← ∅;
of 𝜂’s ⊗-chain is 𝑤: 𝜇, the obligation rule for ∼𝑤, is applicable but 3 InitialiseHerbrandBase(𝐻 𝐵);
weaker than 𝜂. Lastly, as 𝐷 ⊢ +𝜕𝑤 through 𝜎, we can conclude that 4 for 2𝑙 ∈ Lit ∪ ModLit do 𝑅 2 [𝑙 ]𝑖𝑛𝑓 𝑑 ← ∅; // with 2 ∈ {C, O, P}
𝜂 is weakly complied with (𝐷 ⊢ +⊤𝜂). In this theory, no norm was 5 for 𝛼 ∈ 𝑅 O do Initialise 𝛼 [𝑘 ] [2] to null, 𝑘 = length(𝐶 (𝛼));
strongly complied with. 6 for 𝑙 ∈ 𝐹 do Prove(𝑙, C); Refute(∼𝑙, C);
7 repeat
3 ALGORITHMS 8
± ← ∅;
𝜕2
for 2𝑙 ∈ 𝐻 𝐵, 2 ∈ {C, O, P} do
The algorithms presented in this section, given a deontic defeasible 9
10 if 𝑅 2 [𝑙 ] = ∅ then Refute(𝑙, 2);
theory as input, compute: (1) the defeasible extension of the theory,
11 if ∃𝛼 ∈ +𝜕A ∩ 𝑅 C [𝑙 ] then // 𝑙 is a non-deontic
(2) which rules are applicable/discarded, (3) which rules are effec-
literal
tively active/inactive, (4) which norms are (not) complied with, and
12 𝑅 C [∼𝑙 ]𝑖𝑛𝑓 𝑑 ← 𝑅 C [∼𝑙 ]𝑖𝑛𝑓 𝑑 ∪ {𝛽 ∈ 𝑅 C [∼𝑙 ] | 𝛼 > 𝛽 };
lastly (5) which norms are (never) violated.
13 if {𝛽 ∈ 𝑅 C [∼𝑙 ] | 𝛽 > 𝛼 } = ∅ then
The extension of a defeasible theory is, in a sketch, all that 14 Refute(∼𝑙, C);
the theory can prove and refute (disprove). Typically, a defeasible 15 if 𝑅 [∼𝑙 ] \ 𝑅 [∼𝑙 ]𝑖𝑛𝑓 𝑑 = ∅ then
extension is limited to the (deontic) literals of the theory itself, 16 Prove(𝑙, C);
but as in this case we are interested in understanding which rules 17 Active;
are applicable and which norms are (effectively) active, complied 18 end
with, violated, we shall extend the standard definition to include 19 end
such nuances. In the algorithms, we will associate proof tag +𝜕A to 20 end
applicable rules, and −𝜕A to discarded rules (thus 𝐷 ⊢ ±𝜕A 𝛼 for a 21 if ∃𝛼 + 𝜕A ∩ ∈ 𝑅 O [𝑙, 𝑖 ] ∧ ∀𝑗 < 𝑖.(𝛼 [ 𝑗 ] [1] =
theory 𝐷 and a rule 𝛼). + ∧ 𝛼 [ 𝑗 ] [2] = −) then
Note that the algorithms do no compute proof conditions for 22 𝑅 O [∼𝑙 ]𝑖𝑛𝑓 𝑑 ← 𝑅 O [∼𝑙 ]𝑖𝑛𝑓 𝑑 ∪ {𝛽 ∈ 𝑅𝑋 [∼𝑙 ] | 𝛼 > 𝛽 };
Definitions 2.5-2.9 for space reasons, as such an addition is straight- // with 𝑋 = {O, P}
forward and does not result in any increase of the complexity of the 23 if {𝛽 ∈ 𝑅 [∼𝑙 ] | 𝛽 > 𝛼 } = ∅ then
algorithms themselves. Also, for space reasons, the algorithms pre- 24 Refute(∼𝑙, 2);
sented in this work do not compute the strict part of the extension: 25 if (𝑅 O [∼𝑙 ] ∪ 𝑅 P [∼𝑙 ]) \ 𝑅 [∼𝑙 ]𝑖𝑛𝑓 𝑑 = ∅ then
to include that is a mundane task as all the defeasibility checks are 26 Prove(𝑙, O);
not taken into consideration. 27 Prove(𝑙, P);
Given a deontic defeasible theory 𝐷, 𝐻 𝐵𝐷 is the set of literals 28 Refute(∼𝑙, P);
such that the literal or its complement appears in 𝐷, where ‘appears’ 29 Active;
means that is a sub-formula of a literal occurring in the theory. The 30 end
deontic Herbrand Base of 𝐷 is 𝐻 𝐵 = {𝑋𝑙 | 𝑙 ∈ 𝐻 𝐵𝐷 ∧ 𝑋 ∈ {O, P}}. 31 end
Note that we do not consider reference literals in the Herbrand 32 end
Base. Accordingly, the extension of a deontic defeasible theory is 33 if ∃𝛼 ∈ +𝜕A ∩ 𝑅 P [𝑙 ] then
defined as follows. 34 𝑅 P [∼𝑙 ]𝑖𝑛𝑓 𝑑 ← 𝑅 P [∼𝑙 ]𝑖𝑛𝑓 𝑑 ∪ {𝛽 ∈ 𝑅 O [∼𝑙 ] | 𝛼 > 𝛽 };
// with 𝑋 = {O, P}
Definition 3.1. Given a deontic defeasible theory 𝐷 = (𝐹, 𝐷, >), 35 if {𝛽 ∈ 𝑅 [∼𝑙 ] | 𝛽 > 𝛼 } = ∅ then
we say that the extension is 𝐸 (𝐷) = (±𝜕C, ±𝜕O, ±𝜕P, ±𝜕A, ±𝜕EA, ±⊤, 36 Refute(∼𝑙, P);
±⊥), where ±𝜕2 = {𝑙 ∈ 𝐻𝐵𝐷 : 𝐷 ⊢ ±𝜕2𝑙 } with 2 ∈ {C, O, P}, 37 Refute(∼𝑙, O);
±𝜕3 = {𝛼 ∈ 𝑅| 𝐷 ⊢ ±𝜕3 𝛼 } with 3 = {A, EA}, ±⊤ = {𝛼 ∈ 𝑅| 𝐷 ⊢ 38 if (𝑅 P [∼𝑙 ] ∪ 𝑅 O [∼𝑙 ]) \ 𝑅 [∼𝑙 ]𝑖𝑛𝑓 𝑑 = ∅ then
±⊤𝛼 }, and ±⊥ = {𝛼 ∈ 𝑅| 𝐷 ⊢ ±⊥𝛼 }. 39 Prove(𝑙, P);
We say that two theories 𝐷 and 𝐷 ′ are equivalent iff 𝐸 (𝐷)+𝐸 (𝐷 ′ ) 40 Active;
(i.e., they have the same extension). 41 end
42 end
The next definition extends the concept of complement presented 43 end
in Section 2, and its sole purpose is to ease the notation of the 44 end
algorithms by establishing the logical connection among proved 45 ±𝜕2 ← ±𝜕2 ∪ 𝜕2 ±;
and refuted literals. 46 until 𝜕2 = ∅ and 𝜕2 = ∅;
+ −
f of literal 𝑋𝑙 as
Definition 3.2. We define the complement 𝑋𝑙 47 return 𝐸 (𝐷) = (±𝜕C , ±𝜕O , ±𝜕P , ±⊤, ±⊥, ±𝜕A , ±𝜕EA )
f = {∼𝑙 }.
• Trivially if 𝑋 = C, then 𝑋𝑙
f = {O∼𝑙, ¬O𝑙, P∼𝑙 }.
• O𝑙 A few important comments for the reader before presenting the
e = {¬P𝑙, O∼𝑙 }.
• P𝑙 algorithms.
75
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Guido Governatori and Francesco Olivieri
The algorithms presented here determine the extension of a whether there exists an applicable, opposite rule that is not defeated
deontic defeasible theory by computing, at each iteration step, a by any other applicable rule (for 2𝑙) than 𝛼.
simpler theory than the one at the previous step. By simpler, we
mean that, by proving and refuting literals and standard rules,
we can progressively simplify the rules of the theory itself: (i) by Procedure Prove
progressively eliminating elements from the antecedents of rules, Input: 𝑙 ∈ Lit, 2 ∈ {C, O, P}
and (ii) by eliminating rules that we know are either discarded, or 1
+ ← 𝜕 + ∪ {𝑙 };
𝜕2 2
defeated. 2 f
𝐻 𝐵 ← 𝐻 𝐵 \ ( {2𝑙 } ∪ 2𝑙);
Important for this goal is to note that, trivially, a rule with empty 3 −𝜕3 ← −𝜕3 ∪ {𝜁 ∈ 𝑅 | 2𝑙 f ∈ 𝐴(𝜁 ) }; // with 3 = {A, EA}
antecedent is vacuously applicable. We thus want to achieve, for 4 f ⊆ 𝐴(𝜁 ) };
>←> \{ (𝜁 ,𝜓 ), (𝜓, 𝜁 ) ∈> | 2𝑙
any rule, the status where the antecedent’s rule is empty (as this 5 switch 2 do
will simplify solving the superiorities), and we will do so by pro- 6 case 2 = C do
gressively eliminating elements from the antecedent as soon as they 7 +𝜕A ← +𝜕A ∪ {𝜁 ∈ 𝑅 | 𝐴(𝜁 ) \ {𝑙 } = ∅ };
satisfy the proper condition in Definitions 2.1 and 2.3. 8 𝑅 ← {𝐴(𝜁 ) \ {𝑙 } ↩→ 𝐶 (𝜁 ) | 𝜁 ∈ 𝑅 };
Symmetrically, when a rule is discarded, it can no longer supports 9 for 𝜁 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 = length(𝐶 (𝜁 ))
its conclusion, nor rejects the opposite. Therefore, as soon as an 10 𝜁 [𝑛] [2] ← +;
element satisfies the proper condition of Definitions 2.2 and 2.3, we 11 if 𝑛 = 1 ∧ 𝜁 [1] [1] = + then
can eliminate the corresponding rule from the set of rules. 12 −⊥ ← −⊥ ∪ {𝜁 };
We begin by populating the Herbrand Base, for every (deontic) 13 𝑅 ← {𝐴(𝜙) \ {¬violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
literal we create the support set 𝑅𝑖𝑛𝑓 𝑑 , and for every obligation rule 𝑅 } \ {𝜙 ∈ 𝑅 | violated(𝜁 )∈ 𝐴(𝜙) };
a 2-dimensional array which will simplify in checking conditions 14 else if 𝜁 ∈ +𝜕A ∧ 𝜁 [𝑛] [1] = + ∧ (∀𝑗 < 𝑛.𝜁 [ 𝑗 ] [1] =
of applicability and compliance. + ∧ 𝜁 [ 𝑗 ] [2] = −) then
Let us consider the theory proposed in Example 2.12. As 𝑎 is 15 +⊤ ← +⊤ ∪ {𝜁 }; +⊥ ← +⊥ ∪ {𝜁 };
in the set of facts, at the first iteration loop for at Line 6 invokes 16 𝑅 ← {𝐴(𝜙) \ {complied(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
Procedure Prove to prove it as constitutive. There, it is added to the 𝑅 } \ {𝜙 ∈ 𝑅 | ¬complied(𝜁 )∈ 𝐴(𝜙) };
support set +𝜕C (Line 1). We then eliminate 𝑎 from the antecedent 17 𝑅 ← {𝐴(𝜙) \ {violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
of 𝛼, which is now empty and so 𝛼 is applicable (Lines 7 and 8). As 𝑅 } \ {𝜙 ∈ 𝑅 | ¬violated(𝜁 )∈ 𝐴(𝜙) };
𝑎 does not appear in any ⊗-chain, Procedure Prove terminates, and 18 end
the main algorithm invokes Refute on ∼𝑎. 19 end
The set of defeasibly refuted constitutive literals is updated with 20 end
∼𝑎, and 𝜌 is discarded as its antecedent contains ∼𝑎; 𝜌 is also ef- 21 case 2 = O do
fectively inactive. The idea of these simplifications is taken from 22 +𝜕A ← +𝜕A ∪ {𝜁 ∈ 𝑅 | 𝐴(𝜁 ) \ {O𝑙, ¬O∼𝑙, ¬P∼𝑙 } = ∅ };
[5, 7]. 23 𝑅 ← {𝐴(𝜁 ) \ {O𝑙, ¬O∼𝑙, ¬P∼𝑙 } ↩→2 𝐶 (𝜁 ) | 𝜁 ∈
𝑅 } \ {𝜁 ∈ 𝑅 | O𝑙f ⊆ 𝐴(𝜁 ) };
The algorithm now enters the main cycle Repeat-Until at Lines
7–46. For every literal 𝑙 in HB, depending on which type of con- 24 for 𝜁 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 = length(𝐶 (𝜁 ))
𝜁 [𝑛] [1] ← +;
clusion is (2), we first verify whether there is any rule supporting 25
if 𝑛 = 1𝜁 [1] [2] = + then
it and, if not, we refute it (Line 10). Otherwise, if there exists an 26
27 −⊥ ← −⊥ ∪ {𝜁 };
applicable rule 𝛼 supporting it (ifs at Lines 11 for C, 21 for O, and 33
28 𝑅 ← {𝐴(𝜙) \ {¬violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
for P), we update the set of defeated rules supporting the opposite
𝑅 } \ {𝜙 ∈ 𝑅 | violated(𝜁 )∈ 𝐴(𝜙) };
conclusion 𝑅 2 [∼𝑙]𝑖𝑛𝑓 𝑑 (Lines 12, 22, 34): (i) in case of obligations,
29 else if 𝜁 ∈ +𝜕A ∧ 𝜁 [𝑛] [2] = + ∧ (∀𝑗 < 𝑛.𝜁 [ 𝑗 ] [1] =
both obligation and permission opposite rules (Condition (4) of
+ ∧ 𝜁 [ 𝑗 ] [2] = −) then
+𝜕O ), (ii) in case of permissions, only obligation rules (Condition
30 +⊤ ← +⊤ ∪ {𝜁 }; +⊥ ← +⊥ ∪ {𝜁 };
(3.2) of +𝜕P ). Given that 𝑅 [∼𝑙] contains all the opposite rules, and
31 𝑅 ← {𝐴(𝜙) \ {complied(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
given that we have just verified that 𝛼 for 𝑙 is applicable, we store
𝑅 } \ {𝜙 ∈ 𝑅 | ¬complied(𝜁 )∈ 𝐴(𝜙) };
in 𝑅 [∼𝑙]𝑖𝑛𝑓 𝑑 all those rules defeated by 𝛼. The next step is to verify
32 𝑅 ← {𝐴(𝜙) \ {violated(𝜁 )} ↩→2 𝐶 (𝜙) | 𝜙 ∈
whether there actually exists any rule supporting ∼𝑙 stronger than
𝑅 } \ {𝜙 ∈ 𝑅 | ¬violated(𝜁 )∈ 𝐴(𝜙) };
𝛼: if not, ∼𝑙 can be refuted (Lines 13, 23, 35).
33 end
The idea behind the ifs at Lines 15–18, 25–30, and 38–41, is: if
34 end
𝐷 ⊢ +𝜕_2𝑙, eventually the repeat-until cycle will have added to
35 end
𝑅 2 [∼𝑙]𝑖𝑛𝑓 𝑑 enough rules to defeat all opposite supports. When
36 case 2 = P do
that is the case, we invoke Prove on 𝑙, 2, and Refute on ∼𝑙, 2 (but 37 +𝜕A ← +𝜕A ∪ {𝜁 ∈ 𝑅 | 𝐴(𝜁 ) \ {P𝑙, ¬O∼𝑙 } = ∅ };
not in case of permission as it is legal to have P𝑙 and P∼𝑙 at the e ⊆ 𝐴(𝜁 ) };
38 𝑅 ← {𝐴(𝜁 ) \ {P𝑙, ¬O∼𝑙 } | 𝜁 ∈ 𝑅 } \ {𝜁 ∈ 𝑅 | P𝑙
same time).
39 end
When something is proved, the algorithm verifies whether 𝛼 is
40 end
effectively active via procedure Active. Such a procedure controls
76
Unravel Legal References in Defeasible Deontic Logic ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Procedure Refute Prove will then save ‘+’ in 𝜂 [2] [2] at Line 10, but nothing can be
Input: 𝑙 ∈ Lit, 2 ∈ {C, O, P} determined yet on compliance or violation. Later on, it proves −𝜕C𝑧
− −
1 𝜕2 ← 𝜕2 ∪ {𝑙 }; (and 𝜂 [1] [2] = −, Line 8 of Refute), and finally +𝜕O𝑧. This implies
2 𝐻𝐵 ← 𝐻𝐵 \ {2𝑙 }; that 𝜂 [1] [1] = + at Line 25 of Prove. Once the algorithm proves
3 −𝜕3 ← −𝜕3 ∪ {𝜁 ∈ 𝑅| 2𝑙 ∈ 𝐴(𝜁 )}; // with 3 = {A, EA}
+𝜕O𝑤, the else if test at Line 29 will succeed, and the algorithm
4 𝑅 ← 𝑅 \ {𝜁 ∈ 𝑅| 2𝑙 ∈ 𝐴(𝜁 )};
correctly establishes that 𝜂 is complied with (since 𝜂 [2] [1, 2] = +),
but also is violated (since 𝜂 [1] [1] = + but 𝜂 [1] [2] = −).
5 >← > \{(𝜁 ,𝜓 ), (𝜓, 𝜁 ) ∈> | 2𝑙 ∈ 𝐴(𝜁 )};
6 if 2 = C then
3.1 Computational properties
7 for 𝜓 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 =length(𝐶 (𝜓 ))
We discuss the computational properties of Algorithm 1 Compli-
8 𝜓 [𝑛] [2] ← −;
ance. Due to space reason, we only sketch the proofs by providing
9 if 𝜓 ∈ +𝜕A ∧ (∀𝑗 < 𝑛 =length(𝐶 (𝜓 ))𝜓 [ 𝑗] [1] =
to the reader the motivations of why our algorithms are sound,
+ ∧ 𝜓 [ 𝑗] [2] = −) then complete, and terminate (but we leave out all the technical details).
10 −⊤ ← −⊤ ∪ {𝜓 }; In order to discuss termination and computational complexity,
11 𝑅 ← {𝐴(𝜇) \ {¬complied(𝜓 )} ↩→2 𝐶 (𝜇)| 𝜇 ∈ we start by defining the size of a meta-theory 𝐷 as Σ(𝐷), as number
𝑅} \ {𝜇 ∈ 𝑅 | complied(𝜓 )∈ 𝐴(𝜇)}; of the occurrences of literals plus the number of occurrences of
12 end rules plus 1 for every tuple in the superiority relation.
13 end Note that, by implementing hash tables with pointers to rules
14 else if 2 = O then where a given literal occurs, each rule can be accessed in constant
15 for 𝜓 ∈ 𝑅 O [𝑙, 𝑛] do // for 𝑛 ≤ 𝑘 =length(𝐶 (𝜓 )) time. We also implement hash tables for the tuples of the superiority
16 if 𝜓 ∈ +𝜕A ∧ 𝑛 = 1 then relation where a given rule appears as either of the two element,
17 𝑅 ← {𝐴(𝜇) \ {complied(𝜓 ), and thus even those can be accessed in constant time.
¬violated(𝜓 )} ↩→2 𝐶 (𝜇)| 𝜇 ∈ 𝑅} \ ({𝜙 ∈
Theorem 3.3. Algorithm 1 Compliance terminates and its com-
𝑅 | {¬complied(𝜓 ),
plexity is 𝑂 (Σ4 ).
violated(𝜓 )} ∈ 𝐴(𝜙)} ∪ {𝜓 });
18 +⊤ ← +⊤ ∪ {𝜓 }; −⊥ ← −⊥ ∪ {𝜓 }; Proof. Termination of Procedures Prove, Refute, and Active
19 else is straightforward, as the size of the input theory is finite, and, at
20 𝜓 [𝑛] [1] ← −; every step, we modify finite sets. The complexity of Prove is 𝑂 (Σ2 ),
21 end the complexity of Refute is 𝑂 (Σ3 ) (two inner for loops of 𝑂 (Σ)
22 end each), and, lastly, the complexity of Active is 𝑂 (Σ).
Termination of Algorithm 1 Compliance is bound to termination
23 end
of the repeat-until cycle at Lines 7–46, as all other cycles loop
24 if 2 ∈ {O, P} then
over finite sets of elements of the order of 𝑂 (Σ). Given that 𝐻 𝐵
25 𝑅 ← {𝐴(𝜁 ) \ {¬2𝑙 }| 𝜁 ∈ 𝑅};
and 𝑅 are finite, and since every time a literal is proved/refuted, it
26 end
is removed from the corresponding set, the algorithm eventually
empties such a set, and, at the next iteration, no modification to the
extension can be made. This proves the termination of Algorithm 1
Procedure Active Compliance.
Input: Regarding its complexity, note that: (1) all set modifications are
1 if {𝛽 ∈ 𝑅 [∼𝑙] 𝑖𝑛𝑓 𝑑 | 𝛼 > 𝛽} \ {𝜓 ∈ 𝑅 [∼𝑙] 𝑖𝑛𝑓 𝑑 | 𝜁 > 𝜓 ∧ 𝜁 ≠
made in linear time, and (ii) the aforementioned repeat-until cycle
𝛼 ∧ 𝜁 ∈ +𝜕A ∩ 𝑅 [𝑙]} ≠ ∅ then is iterated at most 𝑂 (Σ) times, and so are the two for loop at lines
2 +𝜕EA ← +𝜕EA ∪ {𝛼 }; 9–44. This would suggest that the repeat-until cycle runs in 𝑂 (Σ2 ).
3 𝑅 ← {𝐴(𝜁 ) \ {active(𝛼)} ↩→2 𝐶 (𝜁 )| 𝜁 ∈ 𝑅} \ {𝜁 ∈ A more discerning analysis shows that the complexity is actually
𝑅 | ¬active(𝛼)∈ 𝐴(𝜁 )}; 𝑂 (Σ): the complexity of the for cycle cannot be considered sepa-
4 else rately from the complexity of the external repeat-until loop, while
5 −𝜕EA ← −𝜕EA ∪ {𝛼 }; instead they are strictly dependent. Indeed, the overall number of
6 𝑅 ← {𝐴(𝜁 ) \ {¬active(𝛼)} ↩→2 𝐶 (𝜁 )| 𝜁 ∈ 𝑅} \ {𝜁 ∈ operations made by the sum of all loop iterations cannot outrun
𝑅 | active(𝛼)∈ 𝐴(𝜁 )}; the number of occurrences of the literals or rules (𝑂 (Σ) + 𝑂 (Σ)),
7 end
because the operations in the inner cycles directly decrease, iter-
ation after iteration, the number of the remaining repetitions of
the outmost loop, and the other way around. This sets the overall
complexity of Algorithm 1 Compliance to 𝑂 (Σ4 ). □
We conclude the analysis of the algorithms by seeing in more
details how compliance and violations are verified, and we do so Theorem 3.4. Algorithm 1 Compliance is sound and complete:
by continuing the analysis of Example 2.12. (1) 𝐷 ⊢ +𝜕𝑋 𝑝 iff 𝑝 ∈ +𝜕𝑋 of 𝐸 (𝐷), 𝑋 ∈ {C, O, P}, 𝑝 ∈ Lit
At a certain iteration, the algorithm will prove +𝜕C𝑤, but assume (2) 𝐷 ⊢ +𝜕𝑌 𝛼 iff 𝛼 ∈ +𝜕𝑌 of 𝐸 (𝐷), 𝑌 ∈ {A, EA}, 𝑝 ∈ Lit
that so far we have proven neither +𝜕O𝑧, nor +𝜕O𝑤. Procedure (3) 𝐷 ⊢ +⊤𝛼 iff 𝛼 ∈ +⊤ of 𝐸 (𝐷), 𝛼 ∈ Lab
77
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Guido Governatori and Francesco Olivieri
(4) 𝐷 ⊢ +⊥𝛼 iff 𝛼 ∈ +⊥ of 𝐸 (𝐷), 𝛼 ∈ Lab first class citizens in our logic with an equal treatment. This has
(5) 𝐷 ⊢ −𝜕𝑋 𝑝 iff 𝑝 ∈ −𝜕𝑋 of 𝐸 (𝐷), 𝑋 ∈ {C, O, P}, 𝑝 ∈ Lit allowed us to provide an efficient computational treatment of the
(6) 𝐷 ⊢ −𝜕𝑌 𝛼 iff 𝛼 ∈ −𝜕𝑌 of 𝐸 (𝐷), 𝑌 ∈ {A, EA}, 𝑝 ∈ Lit references. The other major benefit of the approach we adopted is
(7) 𝐷 ⊢ −⊤𝛼 iff 𝛼 ∈ −⊤𝛼 of 𝐸 (𝐷), 𝛼 ∈ Lab that it enables encodings of pieces of legislation to strictly adhere to
(8) 𝐷 ⊢ −⊥𝛼 iff 𝛼 ∈ −⊥𝛼 of 𝐸 (𝐷), 𝛼 ∈ Lab. the legal isomorphism principle [3] that facilitates the translation
from provisions in natural language to their formal representation
Proof. The aim of Algorithm 1 Compliance is to compute a and their maintenance when the provisions are amended.
defeasible extension of the input theory through successive trans- References are a prominent feature in norms amending other
formations on the set of facts, rules and the superiority relation. norms. Norm change can be modelled by nested rules [4, 9] and
These transformations act in a way to obtain a simpler theory while [12] develops algorithms to compute the extension of a defeasible
retaining the same extension. By simpler theory we mean a theory theory with nested norms. We plan to investigate how to integrate
with less symbol in it. For instance, given a theory 𝐷 such that the techniques proposed in this work with the algorithms in [12]
𝐷 ⊢ +𝜕O 𝑝, we can remove O𝑝 from the antecedent of the rules to implement the logics of [4, 9].
since a such deontic literal no longer plays any role in the rule, and
we can delete all the rules where P¬𝑝 is in the antecedent, as such REFERENCES
rules can no longer conclude literals (and we also know that the [1] Grigoris Antoniou, David Billington, Guido Governatori, and Michael J. Maher.
rule is no longer applicable); analogously, when 𝐷 ⊢ +⊤𝛼, we can 2001. Representation results for defeasible logic. ACM Trans. Comput. Log. 2, 2
(2001), 255–287. https://doi.org/10.1145/371316.371517
remove the instances of 𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑑 (𝛼) from the rule (and remove the [2] Grigoris Antoniou, David Billington, Guido Governatori, Michael J. Maher, and
rules where 𝑣𝑖𝑜𝑙𝑎𝑡𝑒𝑑 (𝛼) is in the antecedent). The theory obtained Andrew Rock. 2000. A Family of Defeasible Reasoning Logics and its Implemen-
from 𝐷 from such operations is equivalent to 𝐷 ′ , with respect to tation. In ECAI 2000, Proceedings of the 14th European Conference on Artificial
Intelligence, Berlin, Germany, August 20-25, 2000, Werner Horn (Ed.). IOS Press,
the elements of the Herbrand base and labels still in the theory. 459–463.
The proof that the above transformation produces theories equiv- [3] Trevor J. M. Bench-Capon and Frans Coenen. 1992. Isomorphism and legal
alent to the original one is by induction on the length of derivations knowledge based systems. Artif. Intell. Law 1, 1 (1992), 65–86. https://doi.org/10
.1007/BF00118479
and contrapositive. □ [4] Matteo Cristani, Francesco Olivieri, and Antonino Rotolo. 2017. Changes to
temporary norms. In Proceedings of the 16th edition of the International Conference
4 CONCLUSIONS AND RELATED WORK on Artificial Intelligence and Law, ICAIL 2017, London, United Kingdom, June 12-
16, 2017, Jeroen Keppens and Guido Governatori (Eds.). ACM, 39–48. https:
References are widespread in legal documents, and they have been //doi.org/10.1145/3086512.3086517
[5] Guido Governatori, Francesco Olivieri, Antonino Rotolo, and Simone Scannapieco.
a topic of intensive research in the field of AI and Law (e.g., citations 2013. Computing Strong and Weak Permissions in Defeasible Logic. J. Philos.
networks, automated detection of citations, citation and reference Log. 42, 6 (2013), 799–829. https://doi.org/10.1007/s10992-013-9295-1
navigations, . . . ). Despite their pervasive presence, the study of how [6] Guido Governatori, Francesco Olivieri, Simone Scannapieco, and Matteo Cristani.
2011. Designing for Compliance: Norms and Goals. In RuleML 2011-America
to logically represent them with the aim of exploit them for the (LNCS, Vol. 7018). Springer, 282–297. https://doi.org/10.1007/978-3-642-24908-
digitalisation of legislation has been largely neglected. The OASIS 2_29
LegalRuleML standard [11] introduces the terms “comply” and [7] Guido Governatori, Francesco Olivieri, Simone Scannapieco, Antonino Rotolo,
and Matteo Cristani. 2016. The rationale behind the concept of goal. Theory Pract.
“violated” (accepting an argument pointing to a legal rule), but the Log. Program. 16, 3 (2016), 296–324. https://doi.org/10.1017/S1471068416000053
development of a method to solve such references is well beyond the [8] Guido Governatori and Antonino Rotolo. 2006. Logic of Violations: A Gentzen
System for Reasoning with Contrary-To-Duty Obligations. Australasian Journal
scope of the standard. The task of solving such references has been of Logic 4 (2006), 193–215. arXiv:9307/main.pdf http://ojs.victoria.ac.nz/ajl/artic
tackled by [10] adopting the technique of importing the content of le/view/1780
the citation (with the appropriate semantic layer), but the approach [9] Guido Governatori and Antonino Rotolo. 2010. Changing legal systems: legal
abrogations and annulments in Defeasible Logic. Logic Journal of IGPL 18, 1
is restricted to a shallow import (ignoring attacking rules). (2010), 157–194.
The idea of employing terms denoting the legal status of pro- [10] Ho-Pun Lam and Mustafa Hashmi. 2019. Enabling reasoning with LegalRuleML.
visions goes back, at least, to the seminal work by Sartor [13]. Theory Pract. Log. Program. 19, 1 (2019), 1–26. https://doi.org/10.1017/S1471068
418000339
However, when the approach is used, the information about such [11] OASIS. 2017. LegalRuleML core specification version 1.0. Standard Specification.
terms is either given as part of the input of a case (and it is not de- OASIS. http://docs.oasis-open.org/legalruleml/legalruleml-core-spec/v1.0/cspr
d02/legalruleml-core-spec-v1.0-csprd02.html
termined by the other “facts" of the case and the rules), or addressed [12] Francesco Olivieri, Guido Governatori, Matteo Cristani, and Abdul Sattar. [n.d.].
using the techniques exemplified by (5) and (4), and not dealt with Computing Defeasible Meta-logic. In Logics in Artificial Intelligence - 17th Euro-
at the logic level, and focusing, at best, on the first reading of “ap- pean Conference, JELIA (LNCS, Vol. 12678), Wolfgang Faber, Gerhard Friedrich,
Martin Gebser, and Michael Morak (Eds.). Springer, 69–84. https://doi.org/10.1
plicable”. Often, the use of “applicable” is to facilitate some form 007/978-3-030-75775-5_6
of non-monotonic reasoning, avoiding the adoption of a priority [13] Giovanni Sartor. 1991. The Structure of Norm Conditions and Nonmonotonic
relation over rules. Reasoning in Law. In Proceedings of the Third International Conference on Artificial
Intelligence and Law, ICAIL ’91, Richard E. Susskind (Ed.). ACM, 155–164. https:
In this paper, we started from the same idea of [13], but with //doi.org/10.1145/112646.112665
the direct focus on handling the references and not to facilitate [14] Giovanni. Sartor. 2005. Legal Reasoning: A Cognitive Approach to the Law.
Springer.
some form of defeasible reasoning. Accordingly, references are
78
Context-Aware Legal Citation Recommendation using Deep
Learning
ABSTRACT Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/
Lawyers and judges spend a large amount of time researching the 3462757.3466066
proper legal authority to cite while drafting decisions. In this paper,
we develop a citation recommendation tool that can help improve 1 INTRODUCTION
efficiency in the process of opinion drafting. We train four types
Government agencies adjudicate large volumes of cases, posing
of machine learning models, including a citation-list based method
well-known challenges for the accuracy, consistency, and fairness of
(collaborative filtering) and three context-based methods (text sim-
decisions [2, 27]. One of the prototypical mass adjudicatory agencies
ilarity, BiLSTM and RoBERTa classifiers). Our experiments show
in the U.S. context is the Board of Veterans’ Appeals (BVA), which
that leveraging local textual context improves recommendation,
makes decisions on over fifty thousand appeals for disabled vet-
and that deep neural models achieve decent performance. We show
eran benefits annually. Due to these case volumes and constrained
that non-deep text-based methods benefit from access to struc-
resources, the BVA suffers from both a large backlog of cases and
tured case metadata, but deep models only benefit from such access
large error rates in decisions. Roughly 15% of (single-issue) cases
when predicting from context of insufficient length. We also find
are appealed and around 72% of appealed cases are reversed or
that, even after extensive training, RoBERTa does not outperform a
remanded by a higher court [14]. These challenges are typical for
recurrent neural model, despite its benefits of pretraining. Our be-
agencies like the Social Security Administration, the Office of Medi-
havior analysis of the RoBERTa model further shows that predictive
care Hearings and Appeals, and the immigration courts, which
performance is stable across time and citation classes.
adjudicate far more cases than all federal courts combined. Lawyers
and judges are hence in great need of tools that can help them
CCS CONCEPTS reduce the cost of legal research as they draft decisions to improve
• Applied computing → Law; Document analysis; • Information the quality and efficiency of the adjudication process.
systems → Data mining; Recommender systems; • Computing Advancing the application of machine learning to suggesting
methodologies → Natural language processing. legal citations is essential to the broader effort to use AI to assist
lawyers. Citations are a critical component of legal text in common-
KEYWORDS law countries. To show that a proposition is supported by law,
citation recommendation, citation normalization, legal text, legal writers cite to statutes passed by a legislature; to regulations writ-
opinion drafting, neural natural language processing ten by agencies implementing statutes; and to cases applying legal
authorities in a particular context. Such is the importance of cita-
ACM Reference Format:
Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho,
tions to legal writing that the traditional method of selecting law
Mark S. Krass, and Matthias Grabmair. 2021. Context-Aware Legal Citation students to edit law journals has been a gruelling test on the cor-
Recommendation using Deep Learning. In Eighteenth International Con- rect format of legal citations [30]. Achieving performance on more
ference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São difficult tasks, like text generation and summarization, depends on
a sophisticated treatment of citations.
∗ Authors contributed equally to the paper. This paper reports on experiments evaluating a series of machine
† Corresponding author (matthias.grabmair@tum.de). Current affiliation at TUM; work
learning tools for recommending legal citations in judicial opinions.
largely conducted while employed at SINC as part of adjunct affiliation with Carnegie
Mellon University, Language Technologies Institute. We show that deep learning models beat ordinary machine learning
tools at recommending legal citations on a variety of metrics, which
Permission to make digital or hard copies of part or all of this work for personal or suggests that the neural models have a stronger capability to exploit
classroom use is granted without fee provided that copies are not made or distributed semantics to understand which citation is the most appropriate.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. We also demonstrate the importance of context in predicting
For all other uses, contact the owner/author(s). legal citations. For ordinary text-based machine learning models
ICAIL’21, June 21–25, 2021, São Paulo, Brazil with limited capacity for detecting semantic meaning, structured
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8526-8/21/06. contextual metadata improves virtually all predictions. For deep
https://doi.org/10.1145/3462757.3466066 learning models, the utility of structured metadata emerges only
79
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair
in sufficiently difficult settings where there may be a weaker se- citation-list based approaches do not exploit the rich information
mantic link between the input and the target, and only for certain contained in the textual context of each citation.
models. Still, this result shows the potential importance of context
to citation predictions. Deep learning models that are able to bet- 2.1.2 Context-Based Methods. In this setting, the researcher inputs
ter incorporate contextual cues from semantic inputs are likely to a span of text (the query context), which can be a particular sentence
outperform methods without such capabilities. or paragraph, instead of a list of citations. The system recommends
Because the BVA corpus has never been made available to the local citations relevant to this query context.
research community, we are releasing the text for single-issue de- Traditional information retrieval approaches directly compare
cisions, with legal citation tokenization, case metadata, and our the words in the query context to the words in the title, abstract, or
source code upon publication at: https://github.com/TUMLegalTech/ full text of each cited document, and apply scoring models such as
bva-citation-prediction. We believe many other advances can be Okapi BM25 [1] or Indri [38] to arrive at a similarity score that is
built on this as a benchmark for natural language processing in used to rank documents. However, as [33] observes, the full text of
law. cited documents is often noisy and may not contain words similar
to those used to describe the document as a whole. This problem is
especially pertinent in law. Legal decisions and statutes sometimes
2 RELATED WORK lack informative titles, and the key legal implications are often
2.1 Citation Recommendation buried in a mountain of other factual or procedural details.
Citation recommendation is a well-studied problem in the domain Intuitively, we expect the span of text preceding or surrounding
of academic research paper recommendation, as researchers seek a citation (its citation context) to contain useful information pertain-
help to navigate vast literatures in their fields. Many of the ap- ing to the content of the cited document and the reason for citation.
proaches are transferable to the legal context. They can be broadly This information can then be used for retrieval. [3, 33] demonstrate
categorized into citation-list based methods, which characterize that indexing academic papers using words found in their citation
a query document by the incomplete set of citations it contains contexts improves retrieval. He et al. [13] develop this idea further
and provide a global recommendation of citations relevant to the by representing each paper as a collection of citation contexts, and
entire document, and context-based methods, which take a particu- then using a non-parametric similarity measure between a query
lar segment of text from the query document and provide a local context and each paper for recommendation. Huang et al. [16] use
recommendation that is relevant to that specific context [26]. a neural network to learn word and document representations to
perform similarity comparison in that space. More recently, in a
work most similar to our approach, Ebesu and Fang [10] directly
2.1.1 Citation-List Based Methods. In this setting, the researcher train an encoder-decoder network with attention for context-aware
is drafting a paper and has an incomplete set of citations on hand, citation prediction and find that adding embeddings representing
and seeks to find additional relevant papers. the citing and cited authors improves predictions.
An early approach in [28] applies collaborative filtering to this
task. There, citing papers are “users” and citations are “items.” Given
a new user, the algorithm locates existing users with similar prefer- 2.2 Legal Citation Prediction
ences to the new user, and recommends items popular among the Because of the importance of citations to legal writing [8], prior
existing users. Matrix factorization methods project the sparse, high- work has explored machine-generated recommendations for legal
dimensional user-item adjacency matrix onto a low-dimensional authorities relevant to a given legal question.
latent space and compare similarity in this latent space. For exam- A number of commercial tools claim to assist users in legal re-
ple, [4] uses Singular Value Decomposition to find the latent space, search using citations. Zhang and Koppaka [43] describe a feature in
and finds performance gains over ordinary collaborative filtering. LexisNexis that allows users to traverse a semantics-based citation
Graph-based approaches treat research papers as nodes and network in which relevance is determined by textual similarity be-
citations as edges (directed or undirected), and use graph-based tween citation contexts. Other commercial offerings include ROSS
measures of relevance to find relevant nodes to an input set corre- Intelligence [18], CaseText’s CARA A.I. [17] and Parallel Search [6],
sponding to the researcher’s incomplete set of citations. Examples as well as Quick Check by Thomson Reuters [40]. The methodology
include the Katz measure [23], PageRank [31] and SimRank [19]. of such offerings is largely proprietary.
[12] applies a topic-sensitive version of the PageRank algorithm Winkels et al. [41] develop a prototype legal recommender sys-
by up-weighting papers in the incomplete set. [37] finds the Katz tem for Dutch immigration law, which allows legal professionals
measure of node proximity to be a significant feature. to search a corpus by clicking on articles of interest; the system
The citation-list based approach has its drawbacks. It puts the returns cases with the highest between-ness centrality with the
burden of creating a partial list of citations on the user. Attorneys article. In [8], Dadgostari et al. consider the task of generating a
who are new to veterans’ law would face the well-known “cold-start” bibliography for a citation-free legal text by modelling the search
problem, where they have difficulty generating enough citations as process as a Markov Decision Process in which an agent iteratively
input to receive quality recommendations. Second, attorneys draft- selects relevant documents. At each step, the agent can choose
ing an opinion may be more interested in local recommendations whether to explore a new topic in the original paper or to select a
relevant to their current section of work rather than global recom- relevant paper from the current topic of focus. An optimal policy
mendations that are generally relevant to the entire case. Third, is learned using Q-learning. They find this adaptive algorithm to
80
Context-Aware Legal Citation Recommendation using Deep Learning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
# Values Most Frequent Least Frequent reporter containing the case; and finally the page in the reporter
Class (# cases) Class (# cases) where the case begins. Thus, a citation to Brown v. Board of Education
of Topeka would begin as follows: Brown v. Board of Education, 347
Year 19 2009 (22,801) 2017 (3,651)
Issue Area 17 Service Con- Increased
U.S. 483. This indicates that the first page of Brown is found on
nection for Rating for the 483rd page of the 347th volume of the United States Reports.
Bodily Injury Nerve Damage The volume-reporter-page citation is usually a unique identifier for
Claims (38,956) (2,921) each case.1 Citations to statutory law follow a similar three-part
VLJ 289 Anonymized Anonymized pattern: “18 U.S.C. § 46,” means the 46th section of the 18th title of
(6,159) (6) the United States Code.
Table 1: Summary Statistics of Corpus Metadata Variables. These three-part citation patterns form the basis for our text
preprocessing pipeline. We first use a series of regular expressions
to identify, clean, and classify citations from opinions. We then
outperform a simpler method, based on proximity to the original build a vocabulary of legal authority using publicly-available lists
document, on the task of retrieving U.S. Supreme Court decisions. of valid cases and statutes. We use this vocabulary to extract all
Other works [11, 21] have tangentially analyzed properties of citations from case texts and represent them using standardized
legal citation networks, exploring measures of authority and rele- indices. We describe this process in greater detail below.
vance of precedents, as well as macro characteristics of the network,
such as degree distribution and shortest path lengths. Sadeghian et 3.3 Citation Preprocessing
al. [35] develop a system to automatically identify citations in legal The large raw citation vocabulary obtained from running regular
text, extract their context and predict the reason for the citation expression extractors on every case is normalized into classes of
(e.g., legal basis, exception) based on a curated label set. case, statute, regulation, and unknown citations.
For cases, this normalization involves matching the volume, re-
3 DATA porter, and first/last page interval derived from the citation string
3.1 The BVA Corpus with an authoritative list of cases found in the CaseLawAccess
The BVA corpus we use contains the full text of over 1 million appeal (CLA) metadata.2 If an extracted citation can be matched to a CLA
decisions from 1999 to 2017. Accompanying each decision is a set of metadata entry, it is replaced with a reference to that entry in the
metadata derived from the Veterans Appeals Control and Locator citation vocabulary during tokenization. For example, the extrac-
System (VACOLS), which includes fields such as the decision date, tion ‘Degmetich v. Brown, 8 Vet. App. 208 (1995)’ is resolved to the
diagnostic codes indicating the veteran’s injuries, the case outcome, normalized ‘Degmetich v. Brown, 8 Vet. App. 208, CLA#6456776’
and an indicator for whether the case was subsequently re-appealed. (i.e. CLA metadata entry 6456776), which becomes an entry in the
Each case also contains one or more ‘issue codes,’ which are hand- citation vocabulary that is used for all identifiable references to the
coded by BVA attorneys and categorize the key legal or factual same case. Citations to the U.S. Code and to the Code of Federal
questions raised (e.g., “entitlement to a burial benefit”). This paper Regulations are extracted using patterns based on the ‘<chapter>
focuses on a subset of 324,309 cases that raise a single issue and U.S.C. <tail>’ and ‘<chapter> C.F.R. <tail>’ anchors. The tail typi-
have complete metadata, although our methods can be generalized cally consists of one or more section elements, which we break into
to the full corpus. individual elements that each become their own normalized citation
We hypothesized that three metadata features would contribute with the same anchor and chapter (e.g., ‘18 U.S.C. §§ 46(a), 46(b)
to model performance. First, we included the year of the decision, to becomes the two entries ‘18 U.S.C. § 46(a)’ and ‘18 U.S.C. § 46(b)’).
reflect changes in citation patterns as new legal precedents emerge All citations that cannot be normalized into either case, code, or
over time. Second, we constructed an issue area feature to reflect the regulation classes will form the ‘unknown’ class. Once normalized,
substantive issues presented in each case, which we hypothesize to the vocabulary is further reduced by removing all citation entries
provide strong priors for the type of citations contained within as which occur less than 20 times in the training cases and resolving
well. The BVA has a hierarchical coding system comprising program them to an ‘unknown citation’ token. This threshold was manu-
codes, issue codes, and diagnostic codes to categorize each issue. ally chosen as a suitable tradeoff between extensive coverage of
For simplicity and class balancing, we curated a composite issue citations and baseline frequency to enable the model to learn.
area variable with 17 classes (see Figure 1). Third, we included a The training data contains about 5M extracted citation instances
feature referring to the Veterans’ Law Judge (VLJ) who handled the comprising roughly 97k unique strings. Our normalization proce-
case. This corresponds to the hypothesis that citation patterns vary dure reduces this to a citation vocabulary of size 4287, of which
with the idiosyncrasies of individual judges, inspired in part by [10]. 4050 ( ≈ 94.5%) are normalized (1286 cases, 870 statutes, 1894 reg-
Judge names were anonymized and judges with 5 cases or fewer ulations). The normalized entries cover about 98.5% of citations
were collapsed into a single unknown judge category. Summary
statistics for these metadata are included in Table 1. 1 Summary dispositions of a case are sometimes reported in a table, such that multiple
cases appear on a single physical page.
3.2 Decision Text Preprocessing 2 CLA is a public-access project that has digitized the contents of major case reporters
[5]. We include the Vet. App. and F.3d reporters, which contain veterans’ law cases
American legal citations follow a predictable format governed by and cases from the Federal Courts of Appeal, as these account for the vast majority of
[7]. Case citations, for instance, identify the parties to the case; the cases cited in the corpus.
81
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair
82
Context-Aware Legal Citation Recommendation using Deep Learning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
as the items that a user has liked), and returns other citations that set of citations. Instead, the words in a section of interest within the
similar documents have also cited. draft opinion are used as a query to find the most relevant citation
Formally, assume that the corpus of BVA cases 𝐶 has 𝑉 authori- based on textual similarity of the present context to the previous
ties that can be cited. Then every document 𝑑 ′ can be represented contexts associated with each citation. Such local citation recom-
by a sparse vector v𝑑 ′ ∈ R𝑉 , each of whose dimensions v𝑑 ′,𝑐 in- mendations have the added advantage of relevance to a particular
dicates an importance score of a citation 𝑐 to the document. If section of the opinion.
citation 𝑐 is cited in a document, possible scoring functions could Formally, we adopt the approach of [13], which represents each
include a binary representation (v𝑑 ′,𝑐 = 1), a term frequency vector context by its tf-idf vector (normalized to have an L2-norm of
(tf), and a tf-idf vector that incorporates the inverse document fre- 1). Each citation 𝑐 is represented by a collection of tf-idf vectors
quency (idf). With such a representation, a set of document vectors {b 𝑗 : 𝑗 = 1, 2, · · · , 𝑘𝑐 }, where each b 𝑗 represents the local context
{v𝑑 ′ : 𝑑 ′ ∈ D} can be constructed from a document collection D. of one citation occurrence and 𝑘𝑐 is the number of times 𝑐 was
Given a draft of a BVA opinion 𝑑, its incomplete citation set cited in the training set. Given a query context b𝑑 at test time, the
c𝑑 can also be summarized into a document vector v𝑑 . We use relevance of each citation 𝑐 to the query is then calculated as:
a collaborative filtering approach known as the user-based top-𝐾
1 Õ
𝑘𝑐
recommendation algorithm. The algorithm first identifies the 𝐾 doc- score(b𝑑 , 𝑐) = (b · b 𝑗 ) 2
uments D𝐾 (𝑑) that are most similar to 𝑑 from the collection based 𝑘𝑐 𝑗=1 𝑑
on their vector representations v, based on their cosine similarity:
We removed stopwords, words that occurred in less than 10 doc-
v𝑑 · v𝑑 ′
sim(v𝑑 , v𝑑 ′ ) = . uments, and words that contained digits. The most frequent 25,000
∥v𝑑 ∥ 2 ∥v𝑑 ′ ∥ 2
words were then chosen as a vocabulary. We used the 50 words
The algorithm then finds candidate citations based on what these preceding (instead of surrounding) each citation as its context, in
documents cite. An average of these document vectors weighted line with our task to recommend relevant upcoming citations.4 Cita-
by their similarities gives the final recommendation. Specifically, tions that occurred within each context were also used as part of the
the recommendation score of citation 𝑐 for document 𝑑 is given by vocabulary. As some citations were very frequently cited, we col-
Í
𝑑 ′ ∈D (𝑑) sim(v𝑑 , v𝑑 ′ )v𝑑 ′,𝑐 lected at most 100 randomly chosen context vectors (i.e. 𝑘𝑐 ≤ 100)
score(𝑑, 𝑐) = Í 𝐾 . per citation. Metadata features are incorporated into the model in
𝑑 ′ ∈D𝐾 (𝑑) sim(v𝑑 , v𝑑 ′ )
a way similar to the Collaborative Filtering model (see Section 5.1).
In our experiments, the document vectors are collected from the
Each feature is assigned a score and an SVM model is trained to
training set. The number of top similar documents 𝐾 is a hyper-
learn feature weights to produce the final score.
parameter that can be tuned, and 𝐾 = 50 is chosen for the results
reported. From our trials with three different scoring functions
5.3 Bi-directional Long Short Term Memory
for the document vectors, binary scoring proved to be the most
effective choice and was used throughout the experiments. LSTMs [15] are a popular form of recurrent neural networks and
To incorporate metadata features, a score is assigned to each serve as a well-known baseline for deep neural network models.
categorical feature 𝑓𝑖 , namely the probability of citing the citation Variants using LSTM remain competitive in various NLP tasks
𝑐 after conditioning on that feature: [22, 25, 29]. BiLSTM (Bi-directional LSTM) improves on the original
LSTM by reading inputs in both forward and backward directions.
score(𝑓𝑖 , 𝑐) = 𝑃 (𝑐 | 𝑓𝑖 ). We adopted a two-layer BiLSTM on the BVA corpus for citation
We take a weighted average of these features and the output prediction. Just like the text similarity baseline, this approach per-
of the collaborative filtering algorithm. We adopt the commonly forms local citation recommendation. It takes a sequence of words
used svmRank algorithm of [20] to learn weights for each feature. within the draft opinion as the query context, and predicts which
We extract all citation occurrences in a random sample of 1000 citation is most likely to be cited next given the context. Going
documents from the training set, perform a pairwise transformation beyond the text similarity model, we predict the first citation that
on the data, apply min-max normalization on the pairwise data, and appears within a forecasting window of fixed length.
train a linear Support Vector Machine (SVM) on the normalized data. Formally, a sequence of tokens b𝑑 = {𝑏 1, ..., 𝑏𝑙 } is extracted from
The final score is a linear combination of individual feature scores each document 𝑑 as the query context and we seek to predict the
using the learned weights. Citations suggested by the recommender immediate next citation in the upcoming forecasting window of
system are reranked by their final scores and the top citations are length 𝑤. The query context is encoded using pre-trained byte-
chosen as final predictions. level Byte Pair Encoding (BPE) [36]. For comparability with the
RoBERTa model, we use the ‘roberta-base’ tokenizer provided by
5.2 Text Similarity Huggingface [42], which has a vocabulary of about 50k tokens.
The second model uses a context-aware bag-of-words approach to The citation vocabulary indices are re-inserted after encoding, re-
predict citations. Previous studies, such as [13, 34], have demon- placing the general citation token to generate the final encoded
strated that the local context of words surrounding each citation tokens as described in Section 3.3. The encoded tokens are fed into
occurrence can be used as a compact representation of the cited an embedding layer followed by two stacked bi-directional LSTM
document to improve retrieval effectiveness, much like how in-link 4 Notethat this means citations are always the very next word after the context. This
text is used to improve web retrieval. By contrast to collaborative contrasts with the neural models presented below, where citations may appear at some
filtering, this approach does not require the user to input an existing distance from the context.
83
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair
84
Context-Aware Legal Citation Recommendation using Deep Learning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
prediction. The collaborative filtering model uses only the previous That delta, however, is mostly within two standard errors of the
citations in a document as input. It returns the correct citation as its two models. Our two possible explanations are (a) that the neural
top-ranked recommendation 10.2% of the time; recall@5 is 25.5%. models are capable of implicitly inferring some background features
By contrast, the text similarity baseline achieves a recall@1 of 16.4% from the legal text itself, and thus they will not benefit much from
and a recall@5 of 41.1%, on average. This is strong evidence that us providing this information explicitly, and (b) that metadata may
the textual context preceding a citation is a critical signal. By con- not carry much signal for this task.
trast, the document-level statistical information on citation patterns The superior neural model performance is intuitive in legal text
leveraged by collaborative filtering is less informative. also because the text preceding a citation will typically paraphrase
For the text similarity model, adding metadata information gen- a legal principle or statement that is reflective of that source. We
erally gives a noticeable improvement over predictions based on can assume that some portion of our context-forecast instances
text alone. For example, adding structured information on the year consist of relatively easy examples. To some degree, short-distance
of a decision improves performance, which suggests that the model citation prediction can in fact be considered a sentence similarity
does not otherwise detect temporal information. But not all meta- task. Commercial search engines even use text encoding similarity
data is equally useful. Adding information on the identity of the to suggest cases to cite for a particular sentence (e.g., [6]). Similarly,
judge produces little or no marginal gain. Further, we do not find literal quotations from the source preceding the citation can be
evidence that metadata enhances the collaborative filtering model. certain indicators. However, a pure memorization approach will fail
Interestingly, the benefit in recall@1 of case year information is for longer forecast distances, as one can anticipate an upcoming
negated when class is added, although recall@5 and recall@20 im- cited source from the narrative progression in the text before it
prove at the same time. If one were to pursue the baseline further, becomes lexically similar to the source closer to the citation. An
this effect should be examined. exception to this consists of large spans of boilerplate text that
For purposes of this comparison experiment, we train our BiL- contain citations and are reused across decisions. To investigate the
STM and RoBERTa models on a context window of 256 tokens capacity of our models to anticipate citations from further away,
and a forecast window of 128 tokens. They are trained until, in we experiment with different forecasting lengths (see Section 6.2
our assessment, validation metrics indicated convergence, at which and 6.4 below).
point they dramatically outperform both baselines. Both predict the A final observation is the stability of predictive performance
correct citation roughly 65-66% of the time and produce a recall@5 across the six test set folds as evidenced by the low standard errors.
of around 81-83% using the textual context alone. The neural mod- The neural models have slightly more deviation than the baselines
els’ improvement over the text similarity baseline suggests that the and the BiLSTM and RoBERTa models metric are generally within
ability to encode more complex semantic meanings—and track long- the reach of ±2 standard errors within a given recall metric.
term dependencies across context windows of significant length—
noticeably improves performance in citation recommendations.
We experimented with different metadata combinations for the
neural models with 8 epochs of training time and observed no clear
6.2 Context & Forecasting Window Sizes
differences, and decided to only train all-meta and no-meta models To further explore the behavior of the deep neural models, we con-
until convergence. Giving the BiLSTM and RoBERTa models access ducted an ablation study, in which we varied the size of the context
to metadata improves predictive performance by around 0.2-0.6%. and forecasting windows and varied the availability of structured
metadata information. We tested 12 different settings for BiLSTM
85
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair
Text Alone Text & Metadata Text Alone Text & Metadata Text Alone Text & Metadata
0.63 0.81
0.90
0.62 0.80
64 Ahead
64 Ahead
64 Ahead
0.61 0.79 0.89
0.60 0.78 0.88
Recall @ 20
Recall @ 1
Recall @ 5
0.59 0.77 0.87
0.58 0.76
0.76 0.88
0.56
0.87
128 Ahead
128 Ahead
128 Ahead
0.54 0.74 0.86
0.72 0.85
0.52
0.84
0.50 0.70 0.83
64 128 256 64 128 256 64 128 256 64 128 256 64 128 256 64 128 256
Context Window Context Window Context Window
Model BiLSTM RoBERTa
Figure 4: Results of the ablation study for Recall at 1, 5, and 20. Within each panel, the most difficult tasks are in the bottom
left corner and the easiest tasks are in the top right. The x-axis shows the context window. “64 ahead” and “128 ahead” refer to
the maximum number of tokens between the context window and the target citation. Error bars are 95% confidence intervals.
86
Context-Aware Legal Citation Recommendation using Deep Learning ICAIL’21, June 21–25, 2021, São Paulo, Brazil
200000 1.0
175000 Distance 𝑁 Recall@1 Recall@5 Recall@20
citation count in training data
citation recall @ 1
125000 0.6 33-48 9125 68.8 85.2 93.2
100000
49-64 7082 63.3 81.1 91.0
75000 0.4 65-80 5452 55.6 75.4 87.6
50000 81-96 4534 52.7 73.1 87.1
0.2
25000 97-112 3918 47.9 69.6 84.0
113-128 3403 42.1 66.2 82.7
0 0.0
0 250 500 750 1000 1250 1500 1750 2000 Table 3: Roberta all-meta performance binned by token dis-
citations sorted by recall tance from beginning of forecasting window to target cita-
Figure 6: Per-citation recall@1 vs. number of instances in tion, based on single pass over validation set.
training data for RoBERTa all-meta model.
6.4 Error Analysis is no conceptual mismatch. Second, somewhere around 5% of the
Figure 5 shows a relatively consistent recall at 𝑘 = 1 performance errors involve regulations that implement a particular statute. For
across classes over time. We see a slight downward slope for the example, one case cites 38 C.F.R. § 3.156(a), a regulation defining
case and regulation metric towards the end of our analysis period. when veterans may present “new and material evidence” to reopen
This may be due to opinions later in the time period potentially con- a claim. The model predicted a citation to 38 U.S.C. § 5108(a), which
taining new citations and patterns occurring less frequently in the is precisely the statute commanding the BVA to reopen claims when
training data. The plot exhibit a single strong upwards oscillation veterans identify “new and material evidence.” Again, the erroneous
in 2002-2003. We believe this is likely due to litigation surrounding prediction is in exactly the right conceptual neighborhood.
the Veterans Claims Assistance Act of 2000, which sparked mass Consistent with our ablation analysis, our review of the errors
remands by the BVA back to regional offices. This relative shape of suggests the critical role that topical changes in long texts play in
the per-class recall graphs stays roughly the same for larger values generating errors. Table 3 shows recall metrics for targets binned
of 𝑘, albeit shifted to higher absolute recall levels. by the position of the target citation within the forecast window
To assess the influence of the sampling distribution, the com- between minimum and maximum distances. Since legal analysis
bined scatterplot in Figure 6 plots the recall at 𝑘 = 1 achieved for is often addressed in a single section of an opinion, close citations
each citation against its frequency as a prediction target in the are more frequent than distant ones. Unsurprisingly, performance
training data. Of the 2037 different citations that were loaded in a decreases with distance from the context window. From closest to
single pass over the test data (of the total of 4287; see Section 5.5), farthest bin, recall@1 shrinks by a relative 47%, recall@5 by 28%,
only about 1200 citations are predicted with non-zero recall. At and recall@20 by 15%. This behavior is intuitive and indicates that
𝑘 = 20 this number increases to about 1700 and the red curve shifts the system may indeed memorize contexts immediately surround-
right (not shown). The distribution of blue data points indicates ing citations. Still, the gradual decline in performance, especially
that almost all zero-recall citations occur with very low, or zero, for recall@5, suggests that the model is learning some amount of
frequency. However, citations with high recall do not follow a rec- longer-distance patterns. This forms evidence that effective citation
ognizable frequency pattern. This is informative for the cold-start recommendation benefits from both a sophisticated representation
problem of new sources becoming available that have not been cited of context and supervised training on existing citation patterns.
enough yet to be learned by models such as the ones presented
here. We are aware of this limitation and leave it for future work. 7 CONCLUSION
Finally, we examined whether the number of decisions in the test In this paper, we have implemented and evaluated four models that
data authored by a judge correlated with the model’s performance can recommend citations to lawyers drafting legal opinions. BiL-
in predicting citations from those decisions, but did not find clear STM and pretrained RoBERTa perform comparably and outperform
patterns. The three-dimensional judge embeddings also did not the collaborative filtering and bag-of-words baselines. Our ablation
reveal any clear separation with regard to the per-judge recall. experiments show that (a) adding metadata about case year, issue,
We intend to investigate the relationship between attributes of and judge only leads to insignificant performance improvements for
individual VLJs and the behavior of trained models in future work. the neural models, and (b) predicting citations further away from
To help characterize the underlying behavior of the models, we the context is more difficult, which can be compensated to some
drew a sample of 200 erroneous predictions generated by a long- degree by providing more context. Training for extended periods
trained RoBERTa model similar to the one in Table 2.5 Two sets of continuously improves up to a recall@5 of 83.2%. As such, we have
observations indicate that the model has developed some conceptual shown that context-based citation recommendation systems can
mapping of citations. First, 16% of the erroneous predictions did be implemented as classifiers for a largely normalized citation vo-
appear in the forecast window, somewhere after the first citation. cabulary with acceptable performance. Further, our error analysis
Idiosyncrasies in citation order might explain these errors, but there shows that even incorrect predictions may still be useful.
5 After
Our work also points to the next steps for legal citation predic-
qualitative error analysis was completed, a pre-processing bug was corrected,
leading to changes in recall values of less than 0.5%. Quantitative results and analyses tion. First, citation prediction can be conceived of more broadly
of converged models reported here are from this slightly improved version. as language generation. Research should hence explore whether
87
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair
neural models can go beyond pointing to an entry in the citation [16] Wenyi Huang, Zhaohui Wu, Chen Liang, Prasenjit Mitra, and C. Lee Giles. 2015.
vocabulary and write valid citation strings appropriate for a given A Neural Probabilistic Model for Context Based Citation Recommendation. In
Proceedings AAAI ’15. 2404–2410.
context, possibly as part of a continuation of the text. Second, as a [17] Casetext Inc. 2020. CARA A.I. | Casetext. Retrieved December 17, 2020 from
practical matter, it will be important to evaluate the usefulness of https://casetext.com/cara-ai
[18] ROSS Intelligence Inc. 2020. ROSS Intelligence. Retrieved December 17, 2020
the models trained here with expert users. Finally, we note that legal from https://blog.rossintelligence.com
sources and institutions form dynamic systems. Constant adapta- [19] Glen Jeh and Jennifer Widom. 2002. SimRank: A Measure of Structural-Context
tion, such as detecting and accounting for changes in precedent, Similarity. In Proceedings KDD ’02. 538–543.
[20] Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data.
will be key to the future utility of citation systems. Proceedings KDD ’02 (2002), 133–142.
These future directions could rapidly improve legal citation, and [21] Marios Koniaris, Ioannis Anagnostopoulos, and Yannis Vassiliou. 2017. Network
our results here show that context-aware citation prediction can analysis in the legal domain: a complex model for European Union legal sources.
Journal of Complex Networks 6, 2 (08 2017), 243–268.
play a significant role in improving the accuracy, consistency, and [22] Peng-Hsuan Li, Tsu-Jui Fu, and Wei-Yun Ma. 2020. Why Attention? Analyze
speed of mass adjudication. BiLSTM Deficiency and Its Remedies in the Case of NER. In AAAI ’20. 8236–8244.
[23] David Liben-Nowell and Jon Kleinberg. 2007. The Link-Prediction Problem for
Social Networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019–1031.
8 STATEMENT OF CONTRIBUTIONS [24] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta:
The project was conceived and planned by all authors. ZH, CL, MT, A robustly optimized bert pretraining approach. CoRR abs/1907.11692 (2019).
and HZ conducted all model development and experimental work http://arxiv.org/abs/1907.11692
under the mentorship of DEH, MK, and MG. MK and MG developed [25] Ji Ma, Kuzman Ganchev, and David Weiss. 2018. State-of-the-art Chinese Word
Segmentation with Bi-LSTMs. In Proceedings EMNLP ’18. 4902–4908.
the citation preprocessing functionality, as well as produced the [26] Shutian Ma, Chengzhi Zhang, and Xiaozhong Liu. 2020. A review of citation
error analysis. All authors contributed to writing the paper. recommendation: from textual content to enriched context. Scientometrics 122, 3
(2020), 1445–1472.
[27] Jerry L Mashaw. 1985. Bureaucratic justice: Managing social security disability
9 ACKNOWLEDGMENTS claims. Yale University Press.
[28] Sean M. McNee, Istvan Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K.
The authors thank CMU MCDS students Dahua Gan, Jiayuan Xu, Lam, Al Mamunur Rashid, Joseph A. Konstan, and John Riedl. 2002. On the
and Lucen Zhao for creating the issue typology, Anne McDonough Recommending of Citations for Research Papers. In Proceedings of the 2002 ACM
for supporting contributions around citation normalization, and Conference on Computer Supported Cooperative Work (CSCW ’02). 116–125.
[29] Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of
Dave Ames, Eric Nyberg, Mansheej Paul, and RegLab meeting par- evaluation in neural language models. arXiv preprint arXiv:1707.05589 (2017).
ticipants for helpful feedback. [30] J.C. Oleson. 2003. You Make Me Sic: Confessions of a Sadistic Law Review Editor.
U.C. Davis Law Review 37 (2003).
[31] Larry Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998. The
REFERENCES PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford
[1] Giambattista Amati. 2009. BM25. Springer US, Boston, MA, 257–260. University (1998).
[2] David Ames, Cassandra Handan-Nader, Daniel E. Ho, and David Marcus. 2020. [32] Lazar Peric, Stefan Mijic, Dominik Stammbach, and Elliott Ash. 2020. Legal
Due Process and Mass Adjudication: Crisis and Reform. Stanford Law Review 72 Language Modeling with Transformers. In Proceedings ASAIL 2020, Vol. 2764.
(2020), 1–78. CEUR-WS.
[3] Shannon Bradshaw. 2004. Reference Directed Indexing: Redeeming Relevance [33] Anna Ritchie. 2009. Citation context analysis for information retrieval. PhD
for Subject Search in Citation Indexes. In Research and Advanced Technology for thesis, University of Cambridge.
Digital Libraries, Vol. 2769. 499–510. [34] Anna Ritchie, Stephen Robertson, and Simone Teufel. 2008. Comparing Citation
[4] Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra, and C. Lee Giles. 2013. Can’t Contexts for Information Retrieval. Proceedings CIKM ’08 (2008), 213–222.
See the Forest for the Trees? A Citation Recommendation System. In Proceedings [35] Ali Sadeghian, Laksshman Sundaram, Daisy Zhe Wang, William F. Hamilton,
of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’13). 111–114. Karl Branting, and Craig Pfeifer. 2018. Automatic Semantic Edge Labeling over
[5] Caselaw Access Project. 2020. Caselaw Access Project. https://case.law. Legal Citation Graphs. Artif. Intell. Law 26, 2 (2018), 127–144.
[6] CaseText. 2020. The Machine Learning Technology Behind Parallel Search. [36] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine
https://casetext.com/blog/machine-learning-behind-parallel-search/. Accessed: Translation of Rare Words with Subword Units. In Proceedings ACL ’16 (Volume
2020-12-18. 1: Long Papers). 1715–1725.
[7] Columbia Law Review Ass’n, Harvard Law Review Ass’n, and Yale Law Journal. [37] Trevor Strohman, W. Bruce Croft, and David Jensen. 2007. Recommending
2015. The Bluebook: A Uniform System of Citation (21st ed.). citations for academic papers. In Proceedings SIGIR ’07. 705–706.
[8] Faraz Dadgostari, Mauricio Guim, P. Beling, Michael A. Livermore, and D. Rock- [38] Trevor Strohman, Donald Metzler, Howard Turtle, and W. Bruce Croft. 2005.
more. 2020. Modeling law search as prediction. Artif. Intell. Law 29 (2020), Indri: a language-model based search engine for complex queries. Technical Report.
3–34. in Proceedings of the International Conference on Intelligent Analysis.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [39] Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy
Pre-training of Deep Bidirectional Transformers for Language Understanding. In Considerations for Deep Learning in NLP. In Proceedings ACL ’19. 3645–3650.
Proceedings NAACL-HLT ’19. 4171–4186. [40] Merine Thomas, Thomas Vacek, Xin Shuai, Wenhui Liao, George Sanchez, Paras
[10] Travis Ebesu and Yi Fang. 2017. Neural Citation Network for Context-Aware Sethia, Don Teo, Kanika Madan, and Tonya Custis. 2020. Quick Check: A Legal
Citation Recommendation. In Proceedings SIGIR ’17. 1093–1096. Research Recommendation System. In Proceedings NLLP ’20, Vol. 2645. CEUR-WS.
[11] James Fowler, Timothy Johnson, James Spriggs, Sangick Jeon, and Paul Wahlbeck. [41] Radboud Winkels, Alexander Boer, Bart Vredebregt, and Alexander von Someren.
2007. Network Analysis and the Law: Measuring the Legal Importance of Prece- 2014. Towards a Legal Recommender System. In Proceedings JURIX ’14.
dents at the U.S. Supreme Court. Political Analysis 15 (06 2007). [42] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
[12] Marco Gori and Augusto Pucci. 2006. Research Paper Recommender Systems: A Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe
Random-Walk Based Approach. 2006 IEEE/WIC/ACM International Conference on Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Web Intelligence (WI 2006 Main Conference Proceedings) (WI’06) (2006), 778–781. Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
[13] Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee Giles. 2010. Context-Aware and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language
Citation Recommendation. Proceedings of the 19th International Conference on Processing. In Proceedings EMNLP ’20: System Demonstrations. 38–45. https:
World Wide Web (2010), 421–430. https://doi.org/10.1145/1772690.1772734 //doi.org/10.18653/v1/2020.emnlp-demos.6
[14] Daniel E. Ho, Cassandra Handan-Nader, David Ames, and David Marcus. 2019. [43] Paul Zhang and Lavanya Koppaka. 2007. Semantics-Based Legal Citation Network.
Quality Review of Mass Adjudication: A Randomized Natural Experiment at In Proceedings ICAIL ’07. 123–130.
the Board of Veterans Appeals, 2003–16. The Journal of Law, Economics, and [44] Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E.
Organization 35, 2 (03 2019), 239–288. https://doi.org/10.1093/jleo/ewz001 Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for
[15] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Law and the CaseHOLD Dataset. In Proceedings ICAIL ’21. arXiv:2104.08671 (in
computation 9, 8 (1997), 1735–1780. press).
88
A dynamic model for balancing values
Juliano Maranhão∗ Edelcio G. de Souza Giovanni Sartor
julianomaranhao@usp.br edelcio.souza@usp.br giovanni.sartor@eui.eu
University of São Paulo Law School University of São Paulo - FFLCH European University Institute
São Paulo, SP, Brazil São Paulo, SP, Brazil Firenze, FI, Italy
ABSTRACT argue, on the contrary that such laws are legally defective and, in
We propose an additive model for balancing the impacts of actions extreme cases, legally invalid [3].
on values, where factors intensify or attenuate impacts on values, Second, values play a key role in the context of legal interpre-
and values are assigned degrees of relative importance (weights). tation. In determining the meaning of legal sources, a dominant
The balancing model induces axiological rules, consisting in prohi- role is played by teleological approaches, where the ascription of
bitions or permissions that are justified according to the impacts of a meaning to a legal provision over other possible meanings is
the prohibited or permitted action on the values at stake. We also justified on the ground that the selected meaning better promotes
propose eight different revision operators, which shift the balance desirable interests or goals, or better prevents undesired outcomes
– and thus induce different norms – by expanding or contracting [5].
either the set of factors or the set of values. We provide the con- Values also play a key role in in constitutional review where
struction and prove some success properties of those operators. assessments are often performed according to proportionality, i.e.,
by determining whether an infringement of constitutional rights
KEYWORDS is justified by non-inferior advantages with regard to other consti-
tutional rights or values, provided that no less infringing choice
balancing values, change functions, teleological interpretation and
delivers a better trade-off [2]. More generally, values play a key
argumentation
role in all instances of legal decision-making where there is a space
ACM Reference Format: for discretion. In such cases the decision maker has to consider
Juliano Maranhão, Edelcio G. de Souza, and Giovanni Sartor. 2021. A dy-
the merit of alternative choices. This has to be done by taking into
namic model for balancing values. In Eighteenth International Conference for
account all legally relevant values (which the decision maker is
Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil.
ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3462757.3466143 allowed to consider), and the extent to which each choice promotes
or demote such values, in the context provided by relevant features
1 INTRODUCTION (factors) of the case.
Value-based reasoning has been given an growing attention by
There is an apparent consensus in jurisprudence that legal decision-
research on AI & Law.
making cannot be fully driven by rules alone; it calls for value-based
Following the seminal contribution by Berman and Hafner [8],
reasoning. Value-based reasoning is indeed relevant to the law in
AI & Law research has provided multiple models of the relation
multiple regards.
between cases (and the factors that such cases include or express)
First of all, ethical and political values (the so-called political
and the values at stake. Bench-Capon and Sartor [6] assign values
morality) provide a critical framework for assessing the merit of
to factors and consequently to rules embedding such factors. They
positive laws. Existing laws, as resulting from legislative enactment,
explain precedents according to the applicable rules and the impor-
from the practice of legal officers or from custom can be critically
tance of the (sets of) values promoted by such rules. They compare
examined with regard to the extent that they meet or violate ideals
alternative sets of rules in terms of their coherence with precedents
of justice and fairness, or that they promote or demote particular
and values. Bench Capon et al [19] formalise teleological reason-
human rights or social values. The divergence of laws from ideas of
ing using logics for defeasible argumentation, extended with the
justice may justify citizens in disobeying certain laws (as in cases of
possibility to express arguments about values, supported by cases.
civil disobedience) or even officers in refuse the application of such
Grabmair [12] defines functions representing the extent to which a
laws. Legal theorists have provided different interpretations of this
factor contributes to make it so that a certain outcome promotes a
phenomena, depending on views on the relation between law and
certain value, and compares alternative outcomes accordingly. Sar-
morality. Those endorsing a complete separation of law morality
tor [21] explores the proportional balance of constitutional rights
argue that the laws departing from justice still count as perfectly
(as theorized by Alexy [2]), where a legal outcome is compared to
valid laws [13]; those affirming the intertwining of law and morality
alternative outcomes based on its impact on the promotion and
Permission to make digital or hard copies of all or part of this work for personal or demotion of values and examines consistency between value-based
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
decisions of cases, given the factors present in such cases [22].
on the first page. Copyrights for components of this work owned by others than ACM AI & Law research on statutory interpretation has also explored
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, the relations between legal rules and the values underlying the
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org. deliberation and moral/political justification of their enactment
ICAIL’21, June 21–25, 2021, São Paulo, Brazil based on dynamic approaches where values guide changes of the
© 2021 Association for Computing Machinery. content of rules. For instance, Boella et al. [9] introduce values
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466143 as coherence parameters guiding the change of conceptual rules.
89
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Maranhão, de Souza and Sartor
Such models of statutory interpretation were also explored in the action rather than abstaining from it. The action may consist in
AGM style [15] and in the framework of i/o logics [10]. In this line any behaviour, e.g., having an abortion rather than continuing the
Maranhão [16] proposed an architecture of i/o logics where values pregnancy or accessing a text message stored in a mobile phone,
are represented as rules, and constitutive and regulative rules are rather than respecting its confidentiality. The evaluation model
the object of different revision functions. Walton, Macagno and basically compares, for each given action, its impact on the set of
Sartor [23] analyse multiple argument schemes for interpretive values it promotes against its impact on the set of values it demotes,
reasoning. given the constellation of factors, i.e., the context in which the
In this paper we propose an additive model of balancing values. action is performed. Two clarifications are of central importance to
In this model, each relevant factor contributes to intensify or at- understand the model here proposed.
tenuate the impact of the permissibility of the action at stake on a First, we only consider the assessment of impact of a single action
set of relevant values which are promoted or demoted by the per- on values and therefore we only compare the values promoted
missibility of that action. These influences are then proportionally against the values demoted by that specific action, so that a decision
considered with respect to the relative importance of the values takes place whether that action should or should not be performed
being impacted. The resulting assessments of the action’s impacts on moral grounds. There is no room in this model to compare
on single values are then aggregated to determine the action’s total and decide among different and logically independent actions in
impact on the set of values it promotes and on the set of values it terms of their impacts on values. Typically, a claim before a court
demotes. The comparison of the action’s impacts on the promoted questions the legality of a particular action and the court must
and on the demoted values enables us to determine whether the decide whether that action under evaluation should be performed
action is axiologically permitted or rather prohibited. or not (should be forbidden or permitted, should be punished or
After presenting this model, we propose change functions that not be punished). So, we keep the same structure regarding its
shifts the axiological evaluation and consequently the axiological axiological evaluation. We acknowledge that there may be contexts
rule that applies to it. Such shifts are operated by additions or where a judicial decision compares and chooses among alternative
subtractions of factors or by additions or subtractions of values in courses of action, for instance, between the consumer’s right to
the model. These operations have some resemblance to argument receive a new product or to have his money back. However we shall
moves, where new features of the case or moral considerations are leave this kind of value assessment to future work.
brought about to oppose previously justified conclusions. Second, we assume that the direction of impact of an action on a
To illustrate our model, we shall use the some variations of the value – i.e., whether the action promotes or demotes the value – is
leading case Riley v. California judged by the U.S. Supreme Court in invariant, although the extent of the promotion or demotion may
2014. In that precedent, the court concluded that a specific warrant be intensified or attenuated by the presence of factors in the context
was needed to access the digital content of a mobile phone of an of performance. By saying that the direction of impact is invariant,
arrestee, considering the significant amount of personal data usually we mean that irrespective of how many attenuating factors are
stored in such device. In its own words, such access “implicates taken into account, the impact of an action on the promotion of a
substantially greater privacy interests than a brief physical search” particular value never shifts to its demotion. And vice-versa the
of the items accessible to the arrestee, which is on the contrary impact of the action at stake on the demotion of a value never shifts
allowed. to its promotion.
The paper is structured as follows. First, in Section 2, we in- Let us illustrate the model’s underlying rationality with an ex-
troduce the additive model for balancing the impact of actions on ample. Suppose the rules of a condominium forbid people to take
values. In Section 3, we describe how to build systems of value-rules the elevator during the COVID-19 pandemics. Suppose now that
or axiological systems. Then, in Section 4, we introduce and dis- one inhabitant has a medical emergency. Then one could evaluate
cuss eight revision operators upon axiological systems and specify whether following the rule would lead to immoral results. The fac-
their success conditions. We conclude the paper discussing some tor “medical emergency” is an intensifier w.r.t the promotion of the
limitations of our model and indicating possible paths for future value of the patient’s health, which would lead to a permission to
research. use the elevator. But now consider that the emergency does not
hinder the patient’s ability to walk (for instance, it is a toothache)
2 AN ADDITIVE MODEL FOR BALANCING and that she lives on the second floor. So the proportional influence
VALUES of the set of factors on the promotion of the patient’s health may
become null or negative, but one would not say that the action of
In this Section we shall first introduce the general idea and then taking the elevator would now demote her health in that particular
provide a formal account of an action’s impact on relevant values, in context. Actually, the action still promotes health even in presence
given contexts. Then we show how such impact assessment induces of those attenuating factors. But in such cases the proportional
value-rules and finally discuss some properties of systems of such impact of the action is so low that it becomes morally irrelevant to
rules and their relation with the deontological rules contained in legal considerations, that is, it will not play a role in a consideration
the positive legal system. whether to follow the rule or not. Hence, in the model here pro-
posed attenuating factors only affects the degree of moral impact
2.1 Introducing the additive model of the action on a value.
An axiological model of balancing presupposes a determination
of the comparative moral merits of the choice of performing an
90
A dynamic model for balancing values ICAIL’21, June 21–25, 2021, São Paulo, Brazil
2.2 The model for balancing Definition 2.2. (impact function) For each action 𝑥 we define
The model of balancing may be described by the structure 𝐼𝑥 : 𝐹𝑎𝑐𝑡 ∪ {∅} × 𝑉 𝑎𝑙 −→ [−1, 1], where 𝐼𝑥 (𝑓 , 𝑉𝑖 ) is the influence
of factor 𝑓 on the impact of action 𝑥 on the value 𝑉𝑖 .
V = ⟨𝐴𝑐𝑡, 𝐹𝑎𝑐𝑡, 𝑉 𝑎𝑙, 𝑃𝑟𝑜𝑚, 𝐷𝑒𝑚, {𝐼𝑥 }𝑥 ∈𝐴𝑐𝑡 , 𝑤⟩
whose elements are going to be detailed and discussed below. Note that the function 𝐼𝑥 is fixed for a particular action 𝑥 and
We are going to work with two sorts of literals, actions and it regards the influence that each factor has on the impact of the
factors. The set of action 𝐴𝑐𝑡 is the union 𝑃𝐴𝑐𝑡 ∪ 𝑁 𝐴𝑐𝑡, a set of action at stake on each value. The influence may be null, i.e. takes
atomic actions {𝑥 1, 𝑥 2, . . .} and of their negations {¬𝑥 1, ¬𝑥 2, . . .}. value 0, when the factor has no influence on the action’s impact
Similarly, the set of factors 𝐹𝑎𝑐𝑡 is the union 𝑃𝐹𝑎𝑐𝑡 ∪ 𝑁 𝐹𝑎𝑐𝑡 of a set on a particular value, that is, when it is a morally neutral factor
of atomic factors {𝑓1, 𝑓2, . . .} and of their negations {¬𝑓1, ¬𝑓2, . . .}. regarding that action and value. When the influence takes a positive
We write 𝑥 to denote the complement (negation) of action 𝑥. The real number in the interval, we say that the factor is an intensifier
set 𝑉 𝑎𝑙 = {𝑉1, 𝑉2, ..., 𝑉𝑛 } is a finite set whose elements are values. of the impact of the action on the value. And if the function assigns
Each action may also be subject to the usual deontic qualification. the factor a negative real number in the interval, we say that the
By combining 𝑥 and the conjunction Φ∧ of all factors in a set Φ factor is an attenuator of such impact. The action at stake also has
we obtain diadic deontic formulae 𝑂 (𝑥 |Φ∧ ) and 𝑃 (𝑥 |Φ∧ ) stating, a baseline impact on each value, when the impact function takes as
respectively, that action 𝑥 is obligatory or permitted under condition argument the empty set.
Φ∧ . We say that the formulas 𝑂 (𝑥 |Φ∧ ) and 𝑃 (𝑥 |Φ∧ ) are deontic Given these assignments of weight and impact we may define
opposites. the proportional impact of a factor on a value for a given action.
In the following, we distinguish two possible deontic evaluations Definition 2.3. (Proportional influence of a factor on a value) Let
of an action, under given factors. The first, which is indicated by 𝑥 be an action, 𝑓 a factor and 𝑉𝑖 a value in V ⊆ 𝑉 𝑎𝑙, we define:
the operators 𝑂𝑑 and 𝑃𝑑 is the deontological evaluation, given by a 𝑓
set of positive norms, which explicitly state obligations and permis- Δ𝑥 (𝑉𝑖 ) = 𝐼𝑥 (𝑓 , 𝑉𝑖 ) × 𝑤 (𝑉𝑖 )
sions (e.g., the norms stated by a legislator). The second, indicated The value assessment will compare the influence of all relevant
by the operators 𝑂 𝑣 and 𝑃 𝑣 , corresponds to the axiological eval- factors on all relevant values. On the one hand, those values pro-
uation. This evaluation refers to the connection between actions moted by the action in a given constellation of factors, and on the
and values: it considers to what extent the an action promotes or other hand those values promoted by that action in that context.
demotes a value. Under a different reading, that may be more ap- So we take the sum of the proportional impacts on each set of
propriate when engaging in the axiological evaluation of positive promoted or demoted values.
norms, the axiological evaluation considers whether making an
action permissible (in a positive legal code) promotes or rather de- Definition 2.4. (Proportional influence of a factor on a set of
motes the values at stake. Consider for instance the case of abortion. values) Let 𝑥 be an action, 𝑓 a factor and a set V ⊆ 𝑉 𝑎𝑙, then we
When considering the ethical merits of the legal permissibility of define:
abortion, we are not engaging in the moral merit of a woman’s
Õ
choice to have or not to have an abortion, but rather on the moral 𝑓 ,𝑥
𝐵 𝑃𝑟𝑜𝑚 (V) =
𝑓
Δ𝑥 𝑉𝑖
merit of making abortion permissible rather than forbidden. This 𝑉𝑖 ∈𝑃𝑟𝑜𝑚 (𝑥)∩V
assessment does not pertain to the morality of individuals, but
The same definition holds mutatis mutandis for the proportional
rather to political morality, i.e., to the morality of making public 𝑓 ,𝑥
choices, on what is to be imposed or not on citizens. demotion of values, denoted by 𝐵𝐷𝑒𝑚 (V) . When V = 𝑉 𝑎𝑙 we write
Thus if 𝑥 is an action, we define 𝑃𝑟𝑜𝑚(𝑥) ⊆ 𝑉 𝑎𝑙 as the set 𝑓 ,𝑥
simply 𝐵𝑃𝑟𝑜𝑚/𝐷𝑒𝑚 .
of values promoted by (the legal permissibility of) action 𝑥 and
𝐷𝑒𝑚(𝑥) ⊆ 𝑉 𝑎𝑙 the set of values demoted by (the permissibility of) The proportional influence of a factor on the promotion (demo-
the action 𝑥, where 𝑃𝑟𝑜𝑚(𝑥) ∩ 𝐷𝑒𝑚(𝑥) = ∅. tion) of values may now be straightforwardly extended to a set of
The comparison depends on the evaluations expressed by the factors:
quantitative assignments of weights to values and of influence of Definition 2.5. (proportional influence of a set of factors a set of
factors on the impact of actions on values. For generality’s sake values) Let Φ = {𝑓1, 𝑓2, ..., 𝑓𝑛 } be a set of factors, then:
we assume that such indexes can take arbitrary numerical assign- Õ 𝑥,𝑓
ments within given ranges. These numbers can be restricted to any 𝐵𝑥,Φ
𝑃𝑟𝑜𝑚 = 𝐵𝑃𝑟𝑜𝑚
scales that may be convenient for the chosen domain of applica- 𝑓 ∈Φ
tion. Here we shall use the positions (0,.2,.4,.6,.8,1) in the examples. Again, the same definition holds mutatis mutandis for the propor-
What matters is that the numerical assignments reflect some rel- tional influence of a set of factors on the demotion of value, denoted
ative importance of the elements at stake, as part of a reasoning by 𝐵𝑥,Φ
𝐷𝑒𝑚 .
with dimensions and magnitudes, and how such assessment of rel-
ative importance affects the outputs of the systems and its overall 2.3 Value-rules induced by balancing
coherence. We define now an entailment-like relation to extract from a given
Definition 2.1. (weight function) For the finite set of values 𝑉 𝑎𝑙 value assessment whether the action under analysis would be axi-
we define the weight function 𝑤 : 𝑉 𝑎𝑙 −→ [0, 1], where 𝑤 (𝑉𝑖 ) is ologically permitted, forbidden or indifferent. The action at stake
the weight of the value 𝑉𝑖 . is forbidden if the proportional aggregate demotion of values is
91
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Maranhão, de Souza and Sartor
positive and higher than the proportional aggregate promotion of Example 2.8 (Riley vs California). Consider the action 𝑎𝑐𝑐, the set
values. Otherwise, if the number corresponding to the aggregate of factors Φ = {𝑎𝑟𝑟, 𝑝𝑟𝑜𝑝, 𝑚𝑜𝑏} and values V = {𝑆𝑎𝑓 , 𝑃𝑟𝑖𝑔ℎ𝑡, 𝑃𝑟𝑖𝑣 },
promotion of values is higher than the aggregate demotion, the where 𝑃𝑟𝑜𝑚(V) = {𝑆𝑎𝑓 } and 𝐷𝑒𝑚(V) = {𝑃𝑟𝑖𝑔ℎ𝑡, 𝑃𝑟𝑖𝑣 }, with the
action is permitted. If the proportional aggregate promotion and following weights 𝑤 (𝑃𝑟𝑖𝑣) = .6, 𝑤 (𝑃𝑟𝑖𝑔ℎ𝑡) = .4 and 𝑤 (𝑆𝑎𝑓 ) = .6.
demotion are equivalent or if the aggregate promotion or demo- Consider also that accessing items found with an individual has
tion is a non positive real number, then the action is not morally only a baseline impact on privacy, as it would not impact property
relevant, and therefore also permitted. if there are no property items neither promote safety if there is
no evidence that the individual is related to any criminal offence
Definition 2.6. Let 𝑥 ∈ 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡, and Φ∧ be the conjunction (zero baseline impact). So, we have 𝐼𝑎𝑐𝑐 (∅, 𝑃𝑟𝑖𝑣) = .4. Now consider
of all factors in Φ. Then the value-rule induced by Φ and 𝑥 is the influence of the factors specified above on the impact of the
(i) 𝑂 𝑣 (𝑥 |Φ∧ ), if 𝐵𝑥,Φ
𝐷𝑒𝑚 > 𝐵𝑃𝑟𝑜𝑚 and 𝐵 𝐷𝑒𝑚 > 0,
𝑥,Φ 𝑥,Φ
action on the relevant values. A considerable impact on property
(ii) 𝑃 𝑣 (𝑥 |Φ∧ ), otherwise right by accessing property 𝐼𝑎𝑐𝑐 (𝑝𝑟𝑜𝑝, 𝑃𝑟𝑖𝑔ℎ𝑡) = .4, a significant
promotion of safety if there is sufficient evidence of criminal offence
We say that the balance shifts whenever the addition of a new
for an arrest 𝐼𝑎𝑐𝑐 (𝑎𝑟𝑟, 𝑆𝑎𝑓 ) = .8 and, as considered, by the court, an
set of factors to the original set of factors changes the value rule
extreme impact on privacy by accessing personal data stored in a
induced by the balancing. That is, if we have 𝑂 𝑣 (𝑥 |Φ∧ ) then the
mobile phone 𝐼𝑎𝑐𝑐 (𝑚𝑜𝑏, 𝑃𝑟𝑖𝑣) = 1.
addition of the set of factors Θ shifts the balance if it holds that
𝑃𝑟𝑜𝑚 ≥ 𝐵 𝐷𝑒𝑚 or 𝐵 𝐷𝑒𝑚 < 0. Likewise, if it holds that 𝑃 𝑣 (𝑥 |Φ ),
𝐵𝑥,Φ∪Θ 𝑥,Φ∪Θ 𝑥,Φ∪Θ ∧
In the example above we have the following balance in the base-
then the balance shifts if we obtain 0 < 𝐵𝑥,Φ∪Θ
𝐷𝑒𝑚 > 𝐵𝑃𝑟𝑜𝑚 .
𝑥,Φ∪Θ
line context, 𝐵𝑎𝑐𝑐,∅
𝑃𝑟𝑜𝑚 = 0 against 𝐵 𝐷𝑒𝑚 = .24, thus morally prohibit-
𝑎𝑐𝑐,∅
The dyadic deontic statements obtained by the balancing may
ing police officers to access items in the possession of any indi-
be taken as elements of a theory in a dyadic deontic logic, where
vidual (VR𝑎𝑐𝑐 = {𝑂¬𝑎𝑐𝑐}). In the context of an arrest, we would
further deontic statements may be derived. If Φ = ∅, we express ∅
𝑎𝑐𝑐,{𝑎𝑟𝑟 } 𝑎𝑐𝑐,{𝑎𝑟𝑟 }
the baseline evaluation of an action 𝑥 as 𝑂/𝑃 𝑣 (𝑥 |⊤), which we may have 𝐵𝑃𝑟𝑜𝑚 = .48 against 𝐵𝐷𝑒𝑚 = .24, thus rendering the
also be written as a monadic obligation/permission 𝑂/𝑃 𝑣 𝑥. access morally permitted (VR𝑎𝑐𝑐{𝑎𝑟𝑟 }
= {𝑃 (𝑎𝑐𝑐 |𝑎𝑟𝑟 )}. If the seizure
during the arrest includes property items the access would still be
Definition 2.7. Let 𝑥 be an action, Φ a set o factors and V ⊆ 𝑉 𝑎𝑙. 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝 }
morally justified, given that we would have 𝐵𝑃𝑟𝑜𝑚 = .48
Then VR𝑥Θ (Φ)is the set of value-rules induced by the balancing 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝 }
model for Θ ⊆ Φ and VR𝑥 (Φ) is the class of sets of axiological rules against 𝐵𝐷𝑒𝑚 = .40, thus reflecting the previous US case
VR𝑥Θ (Φ) such that Θ𝑖 ⊆ Φ. law (that is, VR𝑎𝑐𝑐
{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝 }
= {𝑃 𝑣 (𝑎𝑐𝑐 |𝑎𝑟𝑟 ∧ 𝑝𝑟𝑜𝑝)}). Finally, con-
𝑖
sidering the additional factor that the property item is a mobile
Let us illustrate the model by an example. We adopt here a phone, the finding of the court in Riley vs California would be
convention to omit the reference to the argument of an impact 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝,𝑚𝑜𝑏 }
morally justified by the model, given that 𝐵𝑃𝑟𝑜𝑚 = .48
function when its value is zero. 𝑎𝑐𝑐,{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝,𝑚𝑜𝑏 }
According to the US case law before the judgment of Riley vs against 𝐵𝐷𝑒𝑚 = 1 (thus leading to VR𝑎𝑐𝑐
{𝑎𝑟𝑟,𝑝𝑟𝑜𝑝,𝑚𝑜𝑏 }
=
California (2014) a police officer was allowed to access the content 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑎𝑟𝑟 ∧ 𝑝𝑟𝑜𝑝 ∧ 𝑚𝑜𝑏).
of any items, including property, in the premises or surroundings
when arresting an individual due to any criminal offense. However, 2.4 Consistency, coherence and the Radbruch’s
in the Riley vs California case, the arrest included the seizure of a
mobile phone and, from the content accessed in that device, the
formula
officers found evidence of another crime, which led to a conviction. Based on the aggregate impacts on the demotion and promotion of
The U.S. Supreme Court concluded that a specific warrant was the values, which are triggered by an action, we define the propor-
needed to access the digital content of the mobile phone of the tional impact of an induced value-rule as the difference between
arrestee. It considered that the significant amount of personal data, the values promoted and demoted in the assessment.
usually stored in a mobile phone, would involve an inadmissible
impact on privacy. This rule could be explained by the following Definition 2.9. (Proportional impact of a rule) Consider an action
considerations on the underlying value impacts: accessing items 𝑥, a set of factors Φ and a set of values V.
(𝑎𝑐𝑐) in an arrest has a baseline impact on the promotion of public Then, 𝜎 (𝑥, Φ, V) = 𝐵𝑥,Φ
𝑃𝑟𝑜𝑚 (V)
− 𝐵𝑥,Φ
𝐷𝑒𝑚 (V)
is the proportional impact
safety (𝑆𝑎𝑓 ) and a baseline demotion of property rights (𝑃𝑟𝑖𝑔ℎ𝑡) and on values V of the axiological rules 𝑃 𝑣 (𝑥 |Φ∧ )/𝑂 𝑣 (𝑥 |Φ∧ ).
privacy (𝑃𝑟𝑖𝑣); the factor “arrest” (𝑎𝑟𝑟 ) intensifies the promotion
of public safety so as to outweigh the extent to which the factor Given the definition above, the polarity (positive or negative) of
“property” (𝑝𝑟𝑜𝑝) intensifies the demotion (through the same action) the proportional impact indicates the modality of the value-rule
of property rights and privacy respectively. However, as considered induced by the balance model. If 𝜎 (𝑥, Φ, V) > 0, then 𝑃 𝑣 (𝑥 |Φ). If
by the court, if the item collected is a mobile phone (𝑚𝑜𝑏), then 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥𝐷 𝑒𝑚 > 0 then 𝑂 𝑣 (𝑥 |Φ)
the negative impact on privacy is intensified to the extent that the Considering that we may obtain value-rules from the balancing
promotion of public safety is outweighed. This led the court to model, we may now evaluate the content of positive rules expressed
introduce an exception, forbidding access to the digital content in a dyadic deontic language with the operators 𝑂𝑑 (𝑥 |𝑎) for pos-
of mobile phones collected during an arrest without an specific itive obligation to do 𝑥 in context 𝑎 and the operator 𝑃𝑑 (𝑥 |𝑎) for
warrant. permission of action 𝑥 in context 𝑎.
92
A dynamic model for balancing values ICAIL’21, June 21–25, 2021, São Paulo, Brazil
We may assume a dyadic deontic logic consequence relation ⊢ In the Example 2.8, suppose we settle the Radbruch’s threshold at
satisfying inclusion and factual detachment and to derive deontic .2. Then only the rule permitting police officers to access property
sentences from two different sets of rules. items would be legally enforceable based on the evaluation, while
all other rules indicated in the example – the permission to access
Definition 2.10. (consistency) Let 𝑥 ∈ 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡, and Φ∧ be
items in the possession of any individual, the prohibition to access
the conjunction of all factors in Φ. Then a set of rules R is consistent
items with the arrestee and the permission to access the digital
iff
content of the arrestee’s mobile phone in an arrest – would be
• R ⊬ 𝑂𝑑 (𝑥 |Φ∧ ) ∧ 𝑂𝑑 (𝑥 |Φ∧ ) and unbearably unjust and therefore not legally enforceable, according
• R ⊬ 𝑂𝑑 (𝑥 |Φ∧ ) ∧ 𝑃𝑑 (𝑥 |Φ∧ ). to Radbruch’s theory.
Thus, the postulate of consistency is satisfied iff it is not the case To illustrate that in the model, consider a set of positively en-
that an action 𝑥 is both forbidden and obligatory or both forbidden acted rules LR = {𝑃𝑑 (𝑎𝑐𝑐 |⊤), 𝑃𝑑 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑂𝑑 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ),
and permitted by the normative system. The set of rules may be a 𝑃𝑑 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏)}. Now let us compare it with the rules
set of value-rules induced by the balancing model VR or a set of extracted in the evaluation model, indicating the proportional im-
positively enacted rules represented as dyadic deontic sentences pact of each in parenthesis:
LR.
AS𝑎𝑐𝑐 = {𝑂 𝑣 (¬𝑎𝑐𝑐) (−.24), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝) (−.4), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧
Definition 2.11. (coherence) Let 𝑥 ∈ 𝐴𝑐𝑡 and Φ ⊆ 𝐹𝑎𝑐𝑡. Then 𝑎𝑟𝑟 ) (.08), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏) (−.52)}.
(LR, VR) is coherent iff,
• LR and VR are both consistent and Hence, according to the value assessment, all positively enacted
• it is not the case that rules are morally unjustified. Nevertheless, the only the rule which
– LR ⊢ 𝑂𝑑 (𝑥 |Φ∧ ) and VR ⊢ 𝑂 𝑣 (𝑥 |Φ∧ ) or would still be valid according to a threshold of .2 would be the pro-
– LR ⊢ 𝑃𝑑 (𝑥 |Φ∧ ) and VR ⊢ 𝑂 𝑣 (𝑥 |Φ∧ ) or hibition to search property items during an arrest. Following Alexy,
– LR ⊢ 𝑂𝑑 (𝑥 |Φ∧ ) and VR ⊢ 𝑃 𝑣 (𝑥 |Φ∧ ) we would say that such prohibition would be morally defective
but still legally valid, while all others are both morally and legally
An interesting application of this model to legal theory consists defective ([3], [4]).
in the interpretation of a formula about the relation between law
and morality originally proposed by Gustav Radbruch [20] to deter-
mine the (in)validity of Nazi’s laws: laws enacted by proper authority
and power are legally valid unless they reach an unbearable degree 3 SYSTEMS OF AXIOLOGICAL RULES
of immorality or injustice. In Definition 2.7 we have specified singular sets of deontic sentences
In our model, we may define an inference relation upon the basic VR𝑥Θ for each subset Θ of the set of relevant factors Φ. Clearly,
dyadic deontic logic, considering the proportional impact of the the union of these singular sets would lead to an inconsistency
value-rule corresponding to the legal rule. If the positively enacted if the underlying dyadic deontic logic is monotonic and the prin-
legal rule 𝑂𝑑 (𝑥 |Φ∧ ) /𝑃𝑑 (𝑥 |Φ∧ ) is inconsistent with the value-rule ciple of deontic non-contradiction holds. For instance, VR𝑎𝑐𝑐 ∪
∅
induced by the value assessment 𝑃 𝑣 (𝑥 |Φ∧ ) /𝑂 𝑣 (𝑥 |Φ∧ ) and the value- VR𝑎𝑐𝑐 = {𝑂 𝑣 (¬𝑎𝑐𝑐 |⊤), 𝑃 𝑣 𝑥 (𝑎𝑐𝑐 |𝑎𝑟𝑟 )}, from which one may infer
{𝑎𝑟𝑟 }
rule has a proportional impact that exceeds a given threshold, then both 𝑂 (¬𝑎𝑐𝑐 |𝑎𝑟𝑟 ) and 𝑃 (𝑎𝑐𝑐 |𝑎𝑟𝑟 ).
we say that the positively enacted rule is unbearably unjustified or Therefore we are going to define a construction of a system of
immoral and therefore is invalid, according to the Radbruch’s legal value-rules, which we shall call axiological system AS𝑥 , in which
theory. The inference relation consists in a restriction of the set of all steps of factor additions are consistently compiled.
positively enacted rules available to derive a normative solution Note that each step of adding a factor to Θ𝑖 may or may not
for a given action. There may be different interpretations of the shift the balance of values to the effect of changing the normative
Radbruch’s formula leading to different associated legal theories, solution to the action at stake in that specific context.
for instance, a generative formula that not only censors positively Thus there are two different situations to be covered. In the first
enacted rules but also generate valid legal content based on moral one, the addition of a new factor shifts the normative solution to the
considerations (see [17]). opposite one. For instance, police officers are generally forbidden
Definition 2.12. (Radbruch’s Formula) Let 𝑥 be an action, Φ a set to access items held by an individual. But in the context of an arrest,
of factors and V a valuation and 𝑟 a threshold index. Then: the normative solution shifts to a permission to perform that action.
• ⊢𝑟𝑎𝑑 𝑃𝑑 (𝑥 |Φ∧ ) iff Were it not the case, that is, if the factor “arrest” were absent, then
– LR ⊢ 𝑃𝑑 (𝑥 |Φ∧ ) and the former solution would prevail. In its turn, if now we consider
– it is not the case that VR ⊢ 𝑂 𝑣 (𝑥 |Φ∧ ) and that the item accessed is a property item, it is still permitted to
– |𝜎 (𝑥, Φ, V)| ⩾ 𝑟 1 ; perform it if there is an arrest.
• ⊢𝑟𝑎𝑑 𝑂𝑑 (𝑥 |Φ∧ ) iff So, gathering the set of deontic sentences may reveal this “step
– LR ⊢ 𝑂𝑑 (𝑥 |Φ∧ ) and by step” feature of an argumentation process where new factors are
– it is not the case that VR ⊢ 𝑃 𝑣 (𝑥 |Φ∧ ) and called into question or the presence of a factor or its relevance or
– |𝜎 (𝑥, Φ, V)| ⩾ 𝑟 intensity regarding the values at stake is contested. We are going to
represent such argument moves by change functions of expansion
1 We consider |n| the absolute value of the (positive or negative) integer n. and contraction that impact the resulting normative solution.
93
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Maranhão, de Souza and Sartor
In order to do that, first we are going to define, by induction, an 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑚𝑜𝑏 ∧𝑝𝑟𝑜𝑝 ∧𝑎𝑟𝑟 )}. Nevertheless, each axiological system
axiological system for an action 𝑥 and a set of factors Φ. Notice that resulting from the balancing is consistent.
we have to consider two cases for the inductive step. The first case
Theorem 3.2. For every action 𝑥 and Φ ⊆ 𝐹𝑎𝑐𝑡, AS𝑥 is consistent.
covers the hypothesis where there is no shifts in the balance in the
course of addition of new factors and accordingly new value-rules Proof. (sketch) By construction, we notice that the basic step
to the axiological system. The second case covers the hypothesis 𝐴𝑆 0𝑥 contains a single axiological rule and also that each step 𝐴𝑆𝑛+1
𝑥
where there was at least one shift of balance (and the corresponding preserves consistency of the extracted rules, provided that the con-
induced value-rule) at some previous step of the construction of the ditions of opposed modalities 𝑂¬𝑥 and 𝑃𝑥 are, by construction,
axiological system. For simplicity we write the definition only for mutually exclusive. □
those cases where the antecedent step starts with an obligation, but
it is easy to adapt it, mutatis mutandis, for those cases where the Although different consistent axiological systems AS𝑥 may be
previous step in the construction of the axiological system delivers built upon the same model V, according to the order of sets of
a permission for the corresponding constellation of factors. factors used in the construction, all systems will deliver the same
result in the presence of all relevant factors at hand. This result
Definition 3.1. Let 𝑥 be an action, 𝐹𝑎𝑐𝑡 = {𝑓1, 𝑓2, ..., 𝑓𝑚 } and may happen if all the steps in the construction have one element,
Φ ⊆ 𝐹𝑎𝑐𝑡. We are going to define a set of axiological rules based or if there is any step where two elements are included in the
on an induction of an increasing sequence of subsets Φ0 = ∅, Φ1 = axiological system. The end result may be thought of as the result
{𝑓1 }, Φ2 = {𝑓1, 𝑓2 }, ..., Φ𝑛 = Φ. Then the Axiological Normative of an argumentation process, were in each step a new factor is
System for 𝑥, AS𝑥 is inductively defined as follows: brought about that may or may not change the solution within the
• Basic step (Φ = ∅) balance of values. This property of the axiological systems may be
– AS𝑥1 = {𝑃 𝑣 (𝑥 |𝑓1 ), 𝑂 𝑣 (𝑥 |¬𝑓1 )}, if AS𝑥0 = 𝑂 𝑣 (𝑥 |⊤) and the shown as a corollary of the following general theorem:
balance shifts
Theorem 3.3. Let Φ0 = ∅, Φ1 = {𝑓1 }, Φ2 = {𝑓1, 𝑓2 }, ..., Φ𝑛 = Φ be
– AS𝑥1 = {𝑂 𝑣 (𝑥 |𝑓1 )}, otherwise.
an increasing sequence of subsets of Φ ⊆ 𝐹𝑎𝑐𝑡. Then for every natural
• Inductive step with one element in AS𝑛𝑥 ; Φ = {𝑓1, 𝑓2, ..., 𝑓𝑛 }
number 0 ⩽ 𝑖 ⩽ 𝑛, it holds that:
– AS𝑛+1𝑥 = {𝑃 𝑣 (𝑥 |Φ∪{𝑓𝑛+1 }∧ ), 𝑂 𝑣 (𝑥 |Φ∪{¬𝑓𝑛+1 }∧ ), if AS𝑛𝑥 =
𝑂 𝑣 (𝑥 |Φ∧ ) and the balance shifts • if 0 < 𝐵𝑥,Φ
𝐷𝑒𝑚 (V)
> 𝐵𝑥,Φ
𝑃𝑟𝑜𝑚 (V)
then AS𝑥 ⊢ 𝑂 (𝑥 |Φ𝑖∧ )
– AS𝑛+1𝑥 = {𝑂 𝑣 (𝑥 |Φ ∪ {𝑓𝑛+1 })}∧ ), otherwise.
• Inductive step with two elements in AS𝑛𝑥 ; Φ = {𝑓1, 𝑓2, ..., 𝑓𝑛 } • otherwise, AS𝑥 ⊢ 𝑃 (𝑥 |Φ𝑖∧ )
– AS𝑛+1𝑥 = {𝑃 𝑣 (𝑥 |𝑓1 ∧ ... ∧ ¬𝑓𝑛 ∧ 𝑓𝑛+1 ), 𝑂 𝑣 (𝑥 |𝑓1 ∧ ... ∧ 𝑓𝑛 ∧
Proof. For 𝑛 = 0, the result is immediate. For the inductive
¬𝑓𝑛+1 )}, if AS𝑛𝑥 = {𝑂 𝑣 (𝑥 |𝑓1 ∧ ... ∧ 𝑓𝑛 ), 𝑃 (𝑥 |𝑓1 ∧ ... ∧ ¬𝑓𝑛 )}
step, suppose that we have AS𝑥 ⊢ 𝑂 (𝑥 |Φ𝑘∧ ) for Φ𝑘 = {𝑓1, 𝑓2, ..., 𝑓𝑘 }
and the balance shifts
and therefore AS𝑥 ⊢ 𝑂 (𝑥 |Φ𝑘∧ ). Then for Φ𝑘 ∪ {𝑓𝑘+1 }, it holds,
– AS𝑛+1𝑥 = {𝑂 𝑣 (𝑥 |𝑓1 ∧ ... ∧ 𝑓𝑛 ∧ 𝑓𝑛+1 ), 𝑃 𝑣 (𝑥 |𝑓1 ∧ ... ∧ ¬𝑓𝑛 ∧
𝑓𝑛+1 )}, otherwise. by construction that if 0 < 𝐵𝑥,Φ > 𝐵𝑥,Φ , then AS𝑥 ⊢
Ð
𝐷𝑒𝑚 (V) 𝑃𝑟𝑜𝑚 (V)
Then, AS𝑥 = AS𝑖𝑥 , for 0 ≤ 𝑖 ≤ 𝑛 𝑂 (𝑥 |Φ𝑖∧ ∧ 𝑓𝑘+1 ). Otherwise, it holds that AS𝑥 ⊢ 𝑃 (𝑥 |Φ𝑖∧ ∧ 𝑓𝑘+1 )
showing that new additions of factors in the axiological rules never
Notice that the choice of a particular set of factors, i.e., the selec- change the previous constellation of literals representing factors or
tion of a set Φ1 ⊆ 𝐹𝑎𝑐𝑡 rather than a different set Φ2 ⊆ 𝐹𝑎𝑐𝑡 may the previous order of increasing subsets of factors. The case where
deliver a different axiological system, since the assessment of the AS𝑥 ⊢ 𝑃 (𝑥 |Φ𝑘∧ ) follows the same steps. □
action’s impact on the values is determined by the factors being
considered in each set. Moreover, notice that the particular order As a corollary from theorem 3.3 it holds that the final evaluation
in which the new factors are introduced in a given set Φ ⊆ 𝐹𝑎𝑐𝑡 and corresponding induced value-rule does not depend on the order
determines a specific path leading to the axiological system based of subsets of relevant factors used in the construction of the axio-
on all factors in Φ ⊆ 𝐹𝑎𝑐𝑡. logical system, since it is obtained by summing all the differences
Indeed, the sequence AS𝑥1 , . . . 𝐴𝑆 𝑛𝑥 reflects a strategy of argu- that each factor makes autonomously.
mentation adopted by the parties, as far as this strategy consists
in the introduction of new factors (the removal of factors will be 4 SHIFTING THE BALANCE
considered in Section 3). For instance, that the items were collected The ascription of weights to values and influence on the action’s
in an arrest is an argument for the justification of a permission to impact on values are the key aspects of the balancing model V. The
the police to access the content of those items. In its turn arguing set of relevant factors Φ ⊆ 𝐹𝑎𝑐𝑡 and the set of relevant values V ⊆
that an item collected is a mobile phone favours the axiological 𝑉 𝑎𝑙 are the building blocks that determine whether the outcome of
prohibition to access its digital content. an evaluation is either an axiological prohibition or an axiological
In the precedent discussed in example 2.8, the sequence ⟨∅, permission.
{𝑝𝑟𝑜𝑝}, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟 }, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟, 𝑚𝑜𝑏}⟩ would result in AS𝑎𝑐𝑐 = Hence, provided an impact function and a weight assignment,
{𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝∧𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝∧¬𝑎𝑟𝑟 ), adding or excluding factors and adding or excluding values from
𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧𝑎𝑟𝑟 ∧𝑚𝑜𝑏), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧𝑎𝑟𝑟 ∧¬𝑚𝑜𝑏)}. In its turn, the respective sets considered in the evaluation may shift the bal-
the sequence ⟨∅, {𝑚𝑜𝑏}, {𝑚𝑜𝑏, 𝑝𝑟𝑜𝑝}, {𝑚𝑜𝑏, 𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟 }⟩ would re- ance and therefore change the induced value-rule. As mentioned
sult in A𝑆 𝑎𝑐𝑐 = {𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑚𝑜𝑏), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑚𝑜𝑏 ∧ 𝑝𝑟𝑜𝑝), above, such moves may be though as the advancement of reasons or
94
A dynamic model for balancing values ICAIL’21, June 21–25, 2021, São Paulo, Brazil
arguments for the moral justification or disapproval of a permission are promoted by the action through intensifying factors. The ex-
or a prohibition to the action at stake. clusion of elements of V may also produce a shift in the previous
balance, changing the normative solution. For instance, a value-
undercutting contraction corresponds to the exclusion of values
4.1 Possible shifts in the value assessment of which are demoted by the action through intensifying factors. In
an action its turn, a value-rebutting contraction corresponds to the exclusion
To represent such reason-giving or argument dynamics we shall of values which are promoted by the action through attenuating
introduce different change functions that either modify the set factors.
of relevant factors or modify the set of values considered in the
assessment. In this paper we are not going to study combinations 4.2 Factor expansion and contraction operators
of changes of values with changes in the set of factors. We propose here eight change operators to capture the shifts of
We shall focus on those modifications that shift the balance balance described above. Four of such operators act upon factors,
between the value-impacts of the action being considered, so that, in and four upon values. In this subsection we consider the operators
virtue of the modification a prohibited action becomes permissible, on factor, two of which enlarge the set of available factors (factor ex-
or a permissible action becomes prohibited: pansion operators) and two restrict such a set (factors contractions
operators).
• Before the change the action is axiologically prohibited since
its impacts on the values it demotes prevail over its impacts Definition 4.1. (factor expansion function) We define an expansion
on the values it promotes: 0 < 𝐵𝑥,Φ > 𝐵𝑥,Φ ); after function 𝑒 : P (𝐹𝑎𝑐𝑡) −→ P (𝐹𝑎𝑐𝑡) − {∅} such that for all Φ ⊂ 𝐹𝑎𝑐𝑡,
𝐷𝑒𝑚 (V) 𝑃𝑟𝑜𝑚 (V) Φ ∩ 𝑒 (Φ) = ∅.
the change the action is axiologically permitted since its
impacts on the values it promotes prevail over its impacts The first two operators add new attenuators the “winning” side
on the values it demotes (0 < 𝐵𝑥,Φ ≥ 𝐵𝑥,Φ ) of the balance between values demoted and promoted by the action.
𝑃𝑟𝑜𝑚 (V) 𝐷𝑒𝑚 (V)
This expansion reduces the extent to which the action demotes or
• Before the change the action is axiologically permitted since
promotes the values at stake, to an extent that is sufficient to invert
its impacts on the values it demotes prevail over its impacts
the original balance.
on the values it promotes (0 < 𝐵𝑥,Φ ≥ 𝐵𝑥,Φ ); after
𝑃𝑟𝑜𝑚 (V) 𝐷𝑒𝑚 (V) This can happen in two cases. In the first case, before the revision,
the change the action is axiologically prohibited since its the action’s impact on the demoted values was greater than its
impacts on the values it demotes prevail over its impacts on impact over the promoted values. The additional factors reduce the
the values it promotes 0 < 𝐵𝑥,Φ
𝐷𝑒𝑚 (V)
> 𝐵𝑥,Φ
𝑃𝑟𝑜𝑚 (V)
). demotion of the demoted values, to such an extent that the action’s
impact on the demoted value becomes smaller than its impact on
For a given evaluation 0 < 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
> 𝐵𝑥,Φ
𝑃𝑟𝑜𝑚 (V)
from which the promoted values. This means that in the context of the extended
we induce 𝑂 𝑣 (𝑥/Φ), there are different ways to shift the balance. set of factors the former prohibited action becomes permissible.
First, it is possible to add new factors to Φ that together are going In the second case, before the revision, the action’s impact on
to attenuate the impact of the action on the demotion of the values the promoted values was greater than its impact over the demoted
considered. If this attenuation effect shifts the balance either by values. The additional factors reduce the promotion of the promoted
making the degree of the demotion negative, or by making it equal values to such an extent that the action’s impact on the promoted
or lower to the degree of the values promoted, we call this move a values becomes smaller than its impact on the demoted values. This
factor-undercutting expansion of the relevant factors. Another move means that in the context of the extended set of factors the former
to the same effect consists in adding a set of factors that together permissible action becomes prohibited.
intensify the promotion of the relevant values. We call this move a Definition 4.2. (factor undercutting expansion function) Let 𝑥 ∈
factor-rebutting expansion. 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. An expansion function 𝑒 is an under-
Besides adding factors, one may also contest that a given factor cutting expansion function, denoted by 𝑢𝑒 (Φ), iff:
considered is present in the context, what could be reduced to the
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0
exclusion of factors from the set of relevant factors Φ. This might 𝐷𝑒𝑚 (V)
produce a shift of balance in two different ways. First it could be an (i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
𝑥,Φ∪𝑒 (Φ)
exclusion of a set of factors that together intensify the demotion (ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) ⩾ 0 or 𝐵𝐷𝑒𝑚 (V) ⩽ 0
of the values by the action at stake. We call this move a factor- • for 𝜎 (𝑥, Φ, V) ⩾ 0
undercutting contraction. On the other hand the same effect could (i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
be obtained if one deletes a set of factors that together attenuates (ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) < 0
the action’s promotion of the relevant values. We call this move a On this basis we define two factor contraction operators. The
factor-rebutting contraction. first
Similar moves may be described that modify the set of val-
ues V which are considered relevant to the evaluation. A value- Definition 4.3. (factor undercutting expansion operator) Let 𝑢𝑒 be
undercutting expansion is an addition of values to V, which are an undercutting expansion operation. Then we define an operator
demoted by the action through attenuating factors, given the set of + on sets of factors such that Φ+ = Φ ∪ 𝑢𝑒 (Φ).
factors Φ and both the impact and the weight function. In its turn, Now we turn to the rebutting expansion, which consists in
a value-rebutting expansion is an addition of values to V, which adding new intensifiers to the “losing” side in the balance between
95
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Maranhão, de Souza and Sartor
the action’s impacts on promoted vs demoted values, resulting in a Definition 4.10. (factor rebutting contraction operator) Let 𝑟𝑒 be a
shift of such a balance. factor rebutting contraction function. Then we define and operator
⊖ on sets of factors such that Φ ⊖ = Φ − 𝑟𝑐 (Φ).
Definition 4.4. (factor rebutting expansion function) Let 𝑥 ∈ 𝐴𝑐𝑡,
Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. An expansion function 𝑒 is an rebutting Introduce some examples for each contraction operator.
expansion function, in this case denoted by, 𝑟𝑒 (Φ), iff:
4.3 Value expansion and contraction operators
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0 In the previous section we have seen how changes in set of the
𝐷𝑒𝑚 (V)
(i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V) relevant factors may modify the evaluation of an action. This may
(ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) ⩾ 0 happen either when some additional factors are considered relevant
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0 or when some previously considered factors are rejected as being
𝐷𝑒𝑚 (V) irrelevant. Similarly, changes in the set of values, which have been
(i) for all 𝑓𝑖 ∈ 𝑒 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
𝑥,Φ∪𝑒 (Φ) taken into account, may modify the valuation of an action. This
(ii) 𝜎 (𝑥, Φ ∪ 𝑒 (Φ), V) < 0 and 𝐵𝐷𝑒𝑚 (V) > 0 may happen when additional values are considered contextually
Definition 4.5. (factor rebutting expansion operator) Let 𝑢𝑒 be an relevant, and thus added into the deliberation, or when some values
undercutting expansion operation. Then we define an operator ⊕ previously assumed to be relevant are discarded in the given context.
on sets of factors such that Φ ⊕ = Φ ∪ 𝑟𝑒 (Φ). Hence, in the following, we define the value change operators
capturing these possible argument moves, which may provoke a
Let us now introduce the two contraction operators. These op- shift in the original balance of the proportional impact of actions
erators shift the balance either by weakening the “winning” side, on values. Let us begin with a value expansion function.
i.e., by deleting intensifying factors for that side (undercutting ex-
Definition 4.11. (value expansion function) We define a value
pansion), or by strengthening the “losing”, side, i.e., by deleting
expansion function 𝑣𝑒 : P (𝑉 𝑎𝑙) −→ P (𝑉 𝑎𝑙) − {∅} such that for
attenuating factors for that side (rebutting expansion).Let us begin
every V ⊆ 𝑉 𝑎𝑙 V ∩ 𝑣𝑒 (V) = ∅.
by introducing the contraction function.
The first operator weakens the “winner” in the balance, which
Definition 4.6. (factor contraction function) We define a factor compares promoted versus demoted values, thus provoking a shift
contraction function 𝑐 : P (𝐹𝑎𝑐𝑡) −→ P (𝐹𝑎𝑐𝑡) − {∅} such that for in the original balance.
every Φ ⊂ 𝐹𝑎𝑐𝑡, 𝑐 (Φ) ⊆ Φ.
Definition 4.12. (value undercutting expansion function) Let 𝑥 ∈
Now we introduce the undercutting contraction operator. 𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. An expansion function 𝑣𝑒 is a value
Definition 4.7. (factor undercutting contraction function) Let 𝑥 ∈ undercutting expansion function 𝑣𝑢𝑒 (V) iff:
𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A contraction function 𝑐 is an under- • for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0
cutting contraction function, in this case denoted by 𝑢𝑐 (Φ) iff: (i) for all 𝑣𝑖 ∈ 𝑣𝑒 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
(ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) ⩾ 0 or 𝐵𝑥,Φ 𝐷𝑒𝑚 (V∪𝑣𝑒 (V))
⩽0
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0
𝐷𝑒𝑚 (V) • for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V) 𝐷𝑒𝑚 (V)
𝑥,Φ−𝑐 (Φ) (i) for all 𝑣𝑖 ∈ 𝑣𝑒 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) ⩾ 0 ou 𝐵𝐷𝑒𝑚 (V) ⩽ 0 (ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) < 0
• for 𝜎 (𝑥, Φ, V) ⩾ 0
Definition 4.13. (value undercutting expansion operator) Let 𝑣𝑢𝑒
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
be a value undercutting expansion function. Then we define and
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) < 0
operator ⊳ on sets of values such that V⊳ = V ∪ 𝑣𝑢𝑒 (V).
Definition 4.8. (factor undercutting contraction operator) Let 𝑢𝑐 Now we define the operator which strengthen the “opponent”
be a factor undercutting contraction function. Then we define and in order to shift the balance.
operator − on sets of factors such that Φ− = Φ − 𝑢𝑐 (Φ).
Definition 4.14. (value rebutting expansion function) Let 𝑥 ∈ 𝐴𝑐𝑡,
We turn now to the construction of the rebutting contraction Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A value expansion function 𝑣𝑒 is an rebut-
operator. ting expansion function 𝑣𝑟𝑒 (V) iff:
Definition 4.9. (factor rebutting contraction function) Let 𝑥 ∈ 𝐴𝑐𝑡,
Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A contraction function 𝑐 is a factor rebut- • for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0
ting contraction function, in this case denoted by 𝑟𝑐 (Φ) iff: (i) for all 𝑣𝑖 ∈ 𝑣𝑒 ((V)), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
(ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) ⩾ 0
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ >0 • for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ𝐷𝑒𝑚 (V)
<0
𝐷𝑒𝑚 (V)
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V) (i) for all 𝑣𝑖 ∈ 𝑣𝑒 ((V)), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) ⩾ 0 (ii) 𝜎 (𝑥, Φ, V ∪ 𝑣𝑒 (V)) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V∪𝑣𝑒 (V))
>0
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0
𝐷𝑒𝑚 (V) Definition 4.15. (value rebutting expansion operator) Let 𝑣𝑟𝑒 be a
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V) value rebutting expansion function. Then we define an operator ⊲
𝑥,Φ−𝑐 (Φ)
(ii) 𝜎 (𝑥, Φ − 𝑐 (Φ), V) < 0 and 𝐵𝐷𝑒𝑚 (V) > 0 on sets of values such that V⊲ = V ∪ 𝑣𝑟𝑒 (V).
96
A dynamic model for balancing values ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Now we are going to introduce the contraction operators that model, a shift in the solution provided by the axiological system
subtract values from the original set of relevant values used in the AS𝑥 in the modified constellation of factors or values considered
balance. The balance shifts either by deleting values influenced by in the evaluation.
intensifying factors in the “winner” side of the balance between If those conditions are present and the functions apply, provided
demoted and promoted values (undercutting contraction) or by that the sequence of factors of the original set of factors is preserved,
strengthening the “opponent”, that is, by deleting values which then a success and an inclusion result may be shown.
are affected by attenuating factors in the side that “lost” the bal-
ance (rebutting contraction). Let us begin by introducing the value Theorem 4.21. Let AS𝑥Φ,V be an axiological system 2 based on
contraction function. model V, 𝑒 an expansion function and 𝑐 a contraction function, then:
i. AS𝑥Φ,V ⊂ AS𝑥Φ+ ,V and, if 𝑒 is an 𝑢𝑒 applicable in V, then
Definition 4.16. (value contraction function) Let V be a set of val- Ó Ó
either AS𝑥Φ+ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ) ∧ 𝑃 𝑣 (𝑥 | Φ+ ) or AS𝑥Φ+ ,V ⊢
ues, then a value contraction function is a function 𝑣𝑐 : P (𝑉 𝑎𝑙) −→ Ó Ó +
P (𝑉 𝑎𝑙) − ∅ such that 𝑣𝑐 (V) ⊆ V. 𝑃 𝑣 (𝑥 | Φ) ∧ 𝑂 𝑣 (𝑥 | Φ )
ii. AS𝑥Φ,V ⊂ AS𝑥Φ⊕ ,V and if 𝑒 is an 𝑟𝑒 applicable in V, then either
First we introduce the value undercutting contraction operator. Ó Ó Ó
AS𝑥Φ+ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ)∧𝑃 (𝑥 | Φ ⊕ ) or AS𝑥Φ⊕ ,V ⊢ 𝑃 𝑣 (𝑥 | Φ)∧
Ó ⊕
Definition 4.17. (value undercutting contraction function) Let 𝑥 ∈ 𝑂 𝑣 (𝑥 | Φ )
𝐴𝑐𝑡, Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A value contraction function 𝑣𝑐 is a iii. AS𝑥Φ− ,V ⊂ AS𝑥Φ,V and if 𝑐 is an 𝑢𝑐 applicable in V, then
value undercutting contraction function 𝑣𝑢𝑐 (V) iff: Ó Ó
if AS𝑥Φ,V ⊢ 𝑂 𝑣 (𝑥 | Φ), then AS𝑥Φ− ,V ⊢ 𝑃 𝑣 (𝑥 | Φ− ) or if
Ó Ó
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0 AS𝑥Φ,V ⊢ 𝑃 𝑣 (𝑥 | Φ), then AS𝑥Φ− ,V ⊢ 𝑂 𝑣 (𝑥 | Φ− )
(i) for all 𝑣𝑖 ∈ 𝑣𝑐 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V) iv. AS𝑥Φ⊖ ,V ⊂ AS𝑥Φ,V and if 𝑐 is an 𝑢𝑐 applicable in V, then
Ó Ó
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) ⩾ 0 or 𝐵𝑥,Φ ⩽0 if AS𝑥Φ,V ⊢ 𝑂 𝑣 (𝑥 | Φ), then AS𝑥Φ⊖ ,V ⊢ 𝑃 𝑣 (𝑥 | Φ ⊖ ) or if
𝐷𝑒𝑚 (V−𝑣𝑐 (V))
Ó Ó
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0 AS𝑥Φ,V ⊢ 𝑃 𝑣 (𝑥 | Φ), then AS𝑥Φ⊖ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ ⊖ )
𝐷𝑒𝑚 (V)
(i) for all 𝑣𝑖 ∈ 𝑣𝑐 (V), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) > 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) < 0 Proof. Straightforward from theorem 3.3. □
Definition 4.18. (value undercutting contraction operator) Let 𝑣𝑢𝑐 Changes in the set of values may determine changes in the rules
be a value undercutting contraction function. Then we define an of the axiological system. For instance, in the precedent discussed at
operator ÷ on sets of factors such that V÷ = V − 𝑣𝑢𝑐 (V). example 2.8, the sequence ⟨∅, {𝑝𝑟𝑜𝑝}, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟 }, {𝑝𝑟𝑜𝑝, 𝑎𝑟𝑟, 𝑚𝑜𝑏}⟩
would deliver the following axiological system AS𝑎𝑐𝑐 :
Finally, we turn to the construction of the value rebutting con-
traction operator. {𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧
¬𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ ¬𝑚𝑜𝑏)}
Definition 4.19. (value rebutting contraction function) Let 𝑥 ∈ 𝐴𝑐𝑡,
Suppose one argues that the context of an arrest also strongly
Φ ⊆ 𝐹𝑎𝑐𝑡 and V ⊆ 𝑉 𝑎𝑙. A contraction function 𝑐 is a value rebutting
impacts the promotion of national security 𝑆𝑒𝑐 with a proportional
contraction function 𝑣𝑟𝑐 (Φ) iff:
impact 𝐵𝑎𝑐𝑐
{𝑝𝑟𝑜𝑝,𝑎𝑟𝑟,𝑚𝑜𝑏 },{𝑆𝑒𝑐 }
= .6, then we would have the follow-
• for 𝜎 (𝑥, Φ, V) < 0 and 𝐵𝑥,Φ 𝐷𝑒𝑚 (V)
>0 ing axiological system AS𝑎𝑐𝑐 :
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝑃𝑟𝑜𝑚(V)
{𝑂 𝑣 (¬𝑎𝑐𝑐), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ), 𝑂 𝑣 (¬𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) ⩾ 0
¬𝑎𝑟𝑟 ), 𝑃 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ 𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏), 𝑂 𝑣 (𝑎𝑐𝑐 |𝑝𝑟𝑜𝑝 ∧ ¬𝑎𝑟𝑟 ∧ 𝑚𝑜𝑏)}
• for 𝜎 (𝑥, Φ, V) ⩾ 0 or 𝐵𝑥,Φ <0
𝐷𝑒𝑚 (V)
Hence inclusion simpliciter does not hold. But considering that
(i) for all 𝑓𝑖 ∈ 𝑐 (Φ), 𝐼𝑥 (𝑓𝑖 , 𝑣𝑖 ) < 0 and 𝑣𝑖 ∈ 𝐷𝑒𝑚(V)
the modification is going to occur in a particular step of the evalua-
(ii) 𝜎 (𝑥, Φ, V − 𝑣𝑐 (V)) < 0
tion, we have a qualified form of inclusion where 𝐴𝑆𝑛−1,V
𝑥 𝑥
⊂ AS𝑛,V ∗,
Definition 4.20. (factor rebutting contraction operator) Let 𝑟𝑒 be a for some 𝑛 in the sequence of the subsets of factors considered, and
factor rebutting contraction function. Then we define an operator where ∗ may be any value-change operator here defined. For con-
⊗ on sets of factors such that V ⊗ = V − 𝑣𝑟𝑐 (V). venience, we have denoted AS𝑛𝑥 as AS𝑛,V 𝑥 in order to express the
Remark. The functions of undercutting expansion, rebutting ex- change of the set of relevant values.
pansion, undercutting contraction, rebutting contraction, value- Theorem 4.22. Let AS𝑥Φ,V be an axiological system 3 based on
undercutting expansion, value-rebutting expansion, value-undercut- model V, 𝑣𝑒 a value expansion function and 𝑣𝑐 a value contraction
ting contraction, value-rebutting contraction defined above may function, then:
not exist depending on the model V assumed. If the conditions to
i. if 𝑣𝑒 is an 𝑣𝑢𝑒 or a 𝑣𝑟𝑒 function applicable in V, then if
apply these functions hold in the assumed model V we say that Ó Ó
AS𝑥Φ+ ,V ⊢ 𝑂 𝑣 (𝑥 | Φ) then AS𝑥Φ+ ,V∗ ⊢ 𝑃 𝑣 (𝑥 | Φ) and if
the function is applicable in the model V. Ó Ó
AS𝑥Φ+ ,V ⊢ 𝑃 (𝑣 𝑥 | Φ), then AS𝑥Φ+ ,V∗ ⊢ 𝑂 𝑣 (𝑥 | Φ), for ∗ ∈
4.4 Change operations on axiological systems {⊳, ⊲}
The change operators we have developed so far modify the set of 2 We are including V in the notation of AS𝑥 for convenience. Also to avoid clutter
relevant factors or the set of relevant values, provoking, if the con- Φ Ó
with superscripts in the notation, we are going to substitute Φ for Φ∧ .
ditions for application of each function are present in a particular 3 We are including V in the notation of AS𝑥 for convenience.
Φ
97
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Maranhão, de Souza and Sartor
ii. if 𝑣𝑒 is an 𝑣𝑢𝑐 or a 𝑣𝑟𝑐 applicable in V, then if AS𝑥Φ+ ,V ⊢ the model could be extended in order to compare different and in-
Ó Ó
𝑂 𝑣 (𝑥 | Φ) then AS𝑥Φ+ ,V∗ ⊢ 𝑃 𝑣 (𝑥 | Φ) and if AS𝑥Φ+ ,V ⊢ dependent actions, or evaluate if positively enacted rules regarding
Ó Ó different and independent actions are justified.
𝑃 𝑣 (𝑥 | Φ), then AS𝑥Φ+ ,V∗ ⊢ 𝑂 𝑣 (𝑥 | Φ), for ∗ ∈ {÷, ⊗}
Proof. Straighforward from theorem 3.3. □ ACKNOWLEDGMENTS
Juliano Maranhão acknowledges the support by the Fundação de
5 FINAL REMARKS Apoio à Pesquisa do Estado de São Paulo (FAPESP 2019/07665-4)
The additive model we propose makes strong assumptions about and the IBM Corporation to the Center for Artificial Intelligence
the behaviour of factors as reasons for the evaluation of the action (C4AI/USP. Giovanni Sartor has been supported by the H2020 Euro-
at stake. It assumes an atomist conception to the effect that a factor pean Research Council (ERC) Project “CompuLaw” (G.A. 833647).
that is a reason for the action in one case always remains a reason
when new factors (reasons) are considered. Also, a factor always REFERENCES
keeps the same polarity and its contribution in terms of impact [1] Carlos E Alchourrón, Paul Gärdenfors, and David Makinson. 1985. On the Logic
in each value considered. That is, not only the action is invariant of Theory Change: Partial Meet Contraction and Revision Functions. Journal of
Symbolic Logic 50 (1985), 510–530.
with respect to each value 𝑉𝑖 in the assessment, but also the factors [2] R. Alexy. 2002. A Theory of Constitutional Rights. Oxford University Press.
retain the same contribution with the same polarity and intensity [3] Robert Alexy. 2003. The Argument from Injustice. {A} Reply to legal positivism.
(it intensifies or attenuates in a given degree the action’s promotion Oxford University Press.
[4] Robert Alexy. 2010. The Dual Nature of Law. Ratio Juris 23 (2010), 167–182.
or demotion of the value 𝑉𝑖 ) in the assessment of new evaluations [5] Aharon Barak. 2005. Purposive Interpretation in Law. Princeton University Press.
considering different constellations of factors. So, the model as- [6] T. J. M. Bench-Capon and G. Sartor. 2003. A Model of Legal Reasoning with Cases
Incorporating Theories and Values. Artificial Intelligence 150 (2003), 97–142.
sumes a strong atomist conception in the assessment of the moral [7] Christoph Benzmüller, David Fuenmayor, and Bertram Lomfeld. 2021. Encod-
evaluation of actions, and consequently, the moral evaluation of ing Legal Balancing: Automating an Abstract Ethico-Legal Value Ontology in
positively enacted rules. Preference Logic. arXiv:2006.12789 [cs.AI]
[8] D. H. Berman and C. D. Hafner. 1993. Representing Teleological Structure in Case-
Different conceptions are possible, relaxing in different degrees based Reasoning: The Missing Link. In Proceedings of the Fourth International
the atomist assumptions. For instance, a multiplicative model may Conference on Artificial Intelligence and Law (ICAIL). ACM, 50–9.
make the contribution of a factor void if a new factor with propor- [9] G. Boella, G. Governatori, A. Rotolo, and L. van der Torre. 2010. Lex minus dixit
quam voluit, lex magis dixit quam voluit: A formal study on legal compliance
tional impact 0 is considered ([14]). One may also assume a model and interpretation. In AICOL-I/IVR-XXIV’09 Proceedings of the 2009 international
where the intensity of the contributions of factors may vary or conference on AI approaches to the complexity of legal systems: complex systems,
the semantic web, ontologies, argumentation, and dialogue. Springer, 162–183.
invert polarity, changing from an intensifier to an attenuator of the [10] G. Boella, L. van der Torre, and G. Pigozzi. 2016. AGM Contraction and Revision
impact of the action on a value, or vice-versa. Or even the polarity of Rules. J Log Lang Inf 25 (2016), 273–297.
of the impact of an action may change, for instance an action that, [11] J. Dancy. 2004. Ethics Without Principles. Oxford University Press.
[12] M. Grabmair. 2017. Predicting Trade Secret Case Outcomes using Argument
given a constellation of factors, promotes a value, may demote the Schemes and Learned Quantitative Value Effect Tradeoffs. In Proceedings of
same value given a new constellation of factors. ICAIL-2017. ACM, 89–98.
Such steps would move the model towards an holist conception [13] Herbert L. A. Hart. 1994. The Concept of Law (2nd ed.). Oxford University Press.
[14] Shelly Kagan. 1988. The additive Falacy. Ethics 99 (1988), 5–31.
where a feature which is a reason for or against the action in one [15] J. Maranhão. 2001. Refinement: a tool to deal with inconsistencies. In Proceedings
case, may not be a reason or be an opposite reason in another of the Eight International Conference on AI and Law. ICAIL-2001. ACM Press,
52–59.
case, and where each contribution for the evaluation is variant and [16] J. Maranhão. 2017. A logical Architecture for dynamic legal interpretation. In
contextual. Holist conceptions are usually, although not necessar- Proceedings of the Eight International Conference on AI and Law ICAIL ’17,. ACM
ily, connected to moral particularism, according to which moral Press, 129–38.
[17] JSA Maranhão and G Sartor. [n.d.]. Interpretive Normative Systems. In 15th
evaluations of actions do not depend on the subsumption of moral International Conference on Deontic Logic and Normative Systems- DEON 2020-
principles, as opposed to moral universalism, according to which 2021, F Liu, A Marra, P Portner, and F Van de Putte (Eds.). College Publications.
moral judgment is intrinsically connected to the instantiation of [18] Juliano Maranhão and Giovanni Sartor. 2019. Value assessment and revision in
legal interpretation. In Proceedings of the 17th International Conference on Artificial
moral principles ([11]). Intelligence and Law, ICAIL 2019. Association for Computing Machinery, Inc, New
We shall explore such variations in future developments of the York, New York, USA, 219–223. https://doi.org/10.1145/3322640.3326709
[19] H. Prakken, A. Wyner, T.R Bench-Capon, and K. Atkinson. 2015. A formalisation
model. of argumentation schemes for legal case-based reasoning in ASPIC+. Journal of
In this paper we have limited ourselves to use the model to Logic and Computation 25 (2015), 1141–1166.
induce value-rules and build axiological systems, which may be [20] G. Radbruch. 2006. Statutory Lawlessness and Supra-Statutory Law. Oxford
Journal of Legal Studies 6 (2006), 1–11. (1st ed. 1946.).
assumed as premises in any inference system of dyadic deontic [21] G. Sartor. 2013. The Logic of Proportionality: Reasoning with Non-Numerical
logic, satisfying some conditions. Hence another relevant path to Magnitudes. German Law Journal (2013), 1419–57.
explore is to embed the model into an inferential system [7]. We see [22] G. Sartor. 2018. Consistency in balancing: from value assessments to factor-
based rules. In Proportionality in Law: An Analytical Perspective, D. Duarte and
two interesting alternatives. One of them is to follow the statutory S. Sampaio (Eds.). Springer, 121–36.
interpretation line of research in AI & Law and embed the model [23] Douglas Walton, Giovanni Sartor, and Fabrizio Macagno. 2018. Statutory Inter-
pretation as Argumentation. In Handbook of Legal Reasoning and Argumentation,
in a deontic logic with revision operators applied on the logical G. Bongiovanni, G. Postema, A. Rotolo, G. Sartor, C. Valentini, and D. Walton
consequences of normative systems in the AGM-style ([1]). Steps in (Eds.). Springer, 519–60.
this direction have been made by [18]. The other is to explore the
resemblance of the revision operators and the axiological system
with an argumentation structure and embed the model or explore
its connections with a logic of defeasible argumentation. Finally,
98
Case-level Prediction of Motion Outcomes in Civil Litigation
Devin J. McConnell James Zhu
devin.mcconnell@uconn.edu james.zhu@uconn.edu
Department of Computer Science Department of Computer Science
University of Connecticut University of Connecticut
Storrs, Connecticut, USA Storrs, Connecticut, USA
99
ICAIL’21, June 21–25, 2021, São Paulo, Brazil McConnell et al.
Second, since court outcomes vary with attorney quality and The most prevalent features used by ML methods to model court
client resources, predictions using ML may reduce the litigation decisions are extracted from legal documents. Legal documents, typ-
disadvantages faced by the poor, racial minorities, and other vulner- ically court records, judicial opinions, and legislation, are difficult
able groups [20]. These problems are exacerbated by overwhelming to model due to their high dimensionality. For instance, modelling
workloads faced by civil courts and lawyers. Increased workloads documents as bags-of-words, which treats a document as an un-
contribute to workplace strain, which has detrimental effects on ordered multiset of words, is commonly assumed in applications
the ability to function effectively [40] and, for lawyers, a decreased like text classification [43] and topic modelling [9]. With this simpli-
perception of ability to uphold the law [12]. However, to ensure fying assumption, the dimension of a document is proportional to
widespread acceptability and trustworthiness of algorithmic deci- the vocabulary size, |𝑉 |, which is prohibitively large for many ML
sions [60], models must be accurate and explainable to all parties. methods. When the ordering of words in a document is considered,
Methods for characterizing judicial decisions in previous work the dimensionality grows exponentially. Therefore, methods typi-
have focused on court opinions, ignoring the many important for- cally seek lower dimensional representations of legal documents
mal procedures that lead to a final judgment. In this work, we that preserve relevant structure of the underlying text.
present the first client and lawyer support methods that predict Early legal document representations focused on summary sta-
court outcomes at the level of individual motions. Motions are for- tistics, like word length [8] or other metadata including document
mal requests to judges for an official ruling on a contested issue. complexity [22], publication date, and amendment counts [47].
They can be submitted before, during, or after the trial and can Knowledge representations of the dyadic citation relationships
have a significant impact on the final disposition of the lawsuit. For between documents are typically modelled as citation networks,
example, motions to strike petition for the removal of all or a subset where vertices correspond to legal documents and a directed edge
of the opposing party’s pleading. While many court documents exists from document 𝐴 to document 𝐵 if 𝐴 cites 𝐵 [27]. Important
are filed over the course of legal proceedings that are relevant to features can be extracted from the connectivity structure of legal
motion outcomes, we focus on complaint documents because (a) citation networks, e.g., directed paths can be interpreted as chains
they contain the facts alleged and legal claims asserted and (b) are of legal precedent or network centrality and in-degree can indicate
available to all parties when a lawsuit begins and can therefore be case importance.
used to support decision making. More recently, representations from natural language processing
We present a general overview of computational prediction of (NLP) have been used to compute richer representations of legal
litigation outcomes and our contributions in Section 2. Section documents. Term frequency–inverse document frequency (TD-IDF)
3 provides details of our methods, where we describe the court is a statistic computed for a word 𝓌 and document 𝒹:
administrative data and legal documents as well as our approach to
|D|
feature engineering and predictive modelling. We present results 𝑓𝑇 (𝓌,𝒹) = 𝒻𝒹 (𝓌) · log Í
on approximately 15 years of Connecticut civil case data in Section 𝒹 ∈D 1 (𝓌 ∈ 𝒹)
4, followed by a discussion and conclusions in Sections 5 and 6. where 𝒹 ∈ D is a document in corpus D, 𝒻𝑑 (𝑤) is the frequency
of word 𝓌 in document 𝒹, and 1 (𝓌 ∈ 𝒹) is an indicator function
that is 1 if 𝓌 appears in 𝒹 and 0 otherwise. TD-IDF is often used
as a corpus-specific importance weighting for words [80].
2 BACKGROUND State-of-the-art language embedding models have seen recent
Predicting legal outcomes has traditionally been the purview, not success by providing lower 𝑑 dimensional embeddings of words
only of practicing lawyers, but also researchers of judicial behavior and documents, where 𝑑 ≈ 102 << |𝑉 |. These architectures are pre-
in law, political science, and recently, computer science [37]. Early trained on large general corpora and then either applied directly
efforts to computationally model legal decision making focused on to legal documents or fine-tuned using transfer learning to legal
representations of rules obtained from case law and legislation [68, specific applications. The law2vec model is a neural embedding
32, 64]. When legal decisions can be modelled as a deterministic architecture based on word2vec [56] that was pre-trained on a large
process, rule-based AI has achieved considerable success [16, 23]. legal corpus consisting of mostly legislative documents [14]. Taking
More recently, ML has made a large impact on the research and advantage of recent developments in NLP, some researchers have
practice of law in general, and in predictive litigation analysis used transfer learning techniques from pre-trained transformer
specifically [2, 3, 28, 63]. models [21] to classify U.S. Fourth Amendment cases [35].
Several efforts are focused on extracting standardized data sets to Machine learning methods for predicting court outcomes have,
support ML in law. The Supreme Court Database contains over two thus far, been mostly trained using judicial opinions. One study
hundred years of U.S. Supreme Court cases each containing hun- developed a random forest classifier to predict over 240,000 justice
dreds of variables [70]. The CASELAW4 data set contains 350,000 votes and about 28,000 case outcomes for the U.S. Supreme Court
common law judicial decisions extracted from US State appellate from 1816 through 2015 [41]. The method predicted court decisions
courts [62]. The University of Oxford is constructing a database with 70.2% accuracy and justice votes with 71.9% accuracy. By
of 100,000 US court case decisions with features that include the comparison, legal experts at best accurately predicted about 66%
facts of the case, judgements, location, timing, and judicial opin- of the outcomes in sixty-eight cases argued in the U.S. Supreme
ions [24]. These and other similar works [77] provide benchmarks Court’s 2002 Term [65]. It is a common practice to use a court’s
that will accelerate the use of ML in litigation in a similar manner past decisions to predict its future decisions, as was done with data
as CIFAR [44] and MNIST [49] for image classification. from the European Court of Human Rights [54]. French Supreme
100
Case-level Prediction of Motion Outcomes in Civil Litigation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Parameter 2
SVM XGBoost
word2vec
OCR
TF-IDF
rule-based algorithm Grid Search
Complaint
Documents
Figure 1: Motion prediction pipeline overview. Two sets of features were computed from the State of Connecticut Judicial
Branch court administrative database (minimal and subset, Table 1) and combined with natural language features extracted
from complaint documents using word2vec, TF-IDF, and a rule-based algorithm. We optimized the hyperparameters of six ML
models using grid search. Here, grid search is described in two dimensions where the circles denote parameter configurations
and the curves on each axis denote the marginal classification accuracy. In this example, classification accuracy has higher
variability across values of parameter 1, which primarily determines the choice of the best parameter setting (red point). A
toy decision tree on the attorney specialization and major code features is shown under our six ML models.
Court decisions have been modelled from historical rulings data In this work, we consider 𝑘 = 2 where 1 and 2 correspond to
using a linear support vector machine (SVM) classifier, assuming a motion denied and granted respectively. Let the observed data be
bag-of-words representation for the rulings documents [73]. (𝑥𝑥 1, . . . ,𝑥𝑥 𝑖 , . . . ,𝑥𝑥 𝑛 ) = 𝑋 ∈ R𝑛×𝑝 , an 𝑛 × 𝑝 matrix of 𝑛 civil court
Importantly, these methods require data from decisions and opin- cases each containing 𝑝 covariates. Note that 𝑋 can, in general,
ions, which distinguish them from other uses of ML to support contain real, nominal, ordinal, or integer valued variables. Given a
real-time litigation support in court cases [25, 57]. Other tools, like training set (𝑋 𝑋 ,𝑌 𝑌 ), the goal is to build a model that predicts class
MyOpenCourt, provide an AI platform directly to clients for an- labels 𝑌¯ 𝑡𝑒𝑠𝑡 from held-out test data 𝑋 𝑡𝑒𝑠𝑡 to maximize classifica-
swering legal questions [45]. While these tools do provide real-time tion accuracy 𝑇|𝑌𝑌𝑃 +𝑇 𝑁| where TP and TN are true positives and true
𝑡𝑒𝑠𝑡
support for decision making, the focus is on data mining and learn- negatives respectively.
ing legal recommendations to support self-represented litigants,
not predictive analytics. 3.1 Connecticut Civil Court Data
The values for 𝑋 and 𝑌 were collected from the State of Connecticut
2.1 Our Contributions Judicial Branch, which provides access to materials such as public
Existing approaches to predicting court case outcomes focus on records and court case documents, as well as researcher access to
final judgements of a trial and rely on retrospective court data from their civil court administrative data [71]. The court administrative
appellate or national court decisions, and thus are not amenable data is populated by courthouse staff and stored in a centralized
for informing motion-level decision making [4, 41, 54, 73]. In this relational database, which was rebuilt locally in mySQL. Court case
work, we contribute to the field of legal analytics in several ways: documents are scanned and made available through the Judicial
(1) we define new lower dimensional features to support predic- Branch Law Library API.
tive modelling at the case-level; We focus on predicting the outcomes for motions to strike. A
(2) we develop and benchmark the first computational pipeline motion to strike has significant influence on case outcomes and
to assist lawyer and client decision making through the pre- are therefore an important factor in legal decision making. In Con-
diction of motion outcomes in district court data (Figure 1); necticut, a motion to strike is a written petition typically from a
(3) we analyze the predictability of motions to strike using both defendant to a judge to remove part or all of a plaintiff’s complaint
court administrative data and natural language features ex- allegations based on legal insufficiency. While we restrict our at-
tracted from complaint documents; tention to civil cases filed in Connecticut, the methods apply more
(4) we provide this pipeline (code, trained models, evaluation generally to other states as long as minimal docket information and
scripts) freely available and open-source. complaint documents are available.
101
ICAIL’21, June 21–25, 2021, São Paulo, Brazil McConnell et al.
categorical variables with a number of levels proportional to the Branch website using custom crawling scripts [71]. If a PDF con-
size of the data. Therefore, we developed custom SQL scripts to tained text, we used pdftotext (version 0.26.5) to convert the PDF to
extract 3 informative features based on domain expertise and low a text file (see Supplemental Methods for additional details). If a PDF
missing data rates (< 0.6% missingness). In total, we considered contained an image, we first converted the PDF to a TIFF file using
four court administrative features: juris number, major code, case ImageMagick (version 6.9.10-68 Q16) [72]. Then, we converted the
location, and attorney specialization (Table 1). The juris number is a TIFF to text using tesseract (version 4.1.1-rc2-20-g01fb) [69]. The
unique identifier for the attorney or firm representing the defendant. tesseract optical character recognition (OCR) engine is based on
The major code represents the case type encoded as a Bernoulli LSTM neural networks and maintained by Google. It has shown to
variable for tort or vehicular cases. The case location encodes a 15 have high accuracy on machine-written characters and black and
dimensional categorical variable denoting the Connecticut superior white images, both of which categorize complaint documents [75].
court location for the case.
3.3.1 Rule-based features for complaint documents. We also con-
The attorney specialization is derived based on the entropy of
sider algorithmically generated natural language features based
the case type (i.e., major case code) distribution for each attorney.
on a sequential covering rule generating algorithm [1]. A rule 𝑅
Formally, let the number of different major case codes (e.g. tort
maps a condition (antecedent) to a class (consequent). Here, an-
or vehicular) associated with an attorney be 𝑚 and the counts of
tecedents are a conjunction of Boolean conditions indicating the
cases litigated by an attorney in each major case code be 𝑤 =
presence of a word in a complaint document and the consequent is
(𝑤 1, . . . ,𝑤𝑚 ). Then, we model the case counts 𝑤 for attorney 𝑎 as
motion granted or denied. For example, the rule (𝑐𝑎𝑟 ∈ 𝐷) AND
a multinomial distribution with a Dirichlet prior,
(𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡 ∈ 𝐷) AND (𝑛𝑒𝑔𝑙𝑖𝑔𝑒𝑛𝑐𝑒 ∈ 𝐷) ⇒ 1 would map a motion
𝑤𝑎 ∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝜃 𝑎 ) to strike associated with complaint 𝐷 to denied if the words “car”,
𝜃 𝑎 ∼𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡 (𝛼 1, . . . ,𝛼𝑚 ) “accident”, and “negligence” are contained within 𝐷.
The sequential covering algorithm proceeds by learning a rule
In this work, we set (𝛼 1, . . . ,𝛼𝑚 ) = 1 . After observing 𝑤𝑎 , the
that maximizes some function. Then, the rule is added to a list and
posterior is a Dirichlet-multinomial distribution
all of the complaint documents covered by this rule are removed
𝑃 (𝜃 𝑎 |𝑤𝑎 ) = 𝐷𝑖𝑟 (𝛼 1 + 𝑤𝑎1, . . . ,𝛼𝑚 + 𝑤𝑎𝑚 ) from the data. This process is repeated until all documents are
We compute the specialization for attorney 𝑎 and major case code removed or sufficient coverage of the data is reached. In this work,
𝑗 as the entropy of the posterior expectation: we learned the number of rules by cross-validation on the training
! set. We consider two functions (or criteria) to optimize: a simple
Õ𝑚
𝑤𝑎 𝑗 + 𝛼 𝑗 𝑤𝑎 𝑗 + 𝛼 𝑗
𝐻 𝐸 [𝜃 𝑎 𝑗 |𝑤𝑎 ] = − Í𝑚 log Í𝑚 criteria and the First Order Inductive Learner (FOIL) criteria.
(𝑤 + 𝛼𝑘 )
𝑗=1 𝑘=1 𝑎𝑘
(𝑤 + 𝛼𝑘 )
𝑘=1 𝑎𝑘 The function with which the simple sequential covering algo-
rithm optimizes is
where 𝐻 is Shannon’s entropy whose probability vector is the rel-
𝑛+ + 1
ative frequency of the 𝑗 𝑡ℎ major case code smoothed by taking (1)
the expectation of a Dirichlet-multinomial distribution with prior 𝑛∗ + 𝑘
𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡 (𝛼 1, . . . ,𝛼𝑚 ); this smoothing reduces the variance of en- where 𝑛 + is the frequency of a word that appears in the granted
tropy calculations for attorneys with small caseloads. Note, that (GR) documents, 𝑛 ∗ is the total number of occurrences of the word
this generates 𝑚 − 1 features. across all documents, and k is the number of classes (here, 𝑘 = 2).
We consider two feature sets derived from these four court ad- In other words, the simple algorithm greedily adds a term to the
ministrative features, minimal and subset (see Table 1). antecedent that increases the rule’s accuracy the most.
A separate function is the FOIL criteria, which is optimized by
Feature Description S the RIPPER algorithm [17, 18]. The FOIL criteria is less greedy than
Juris Number Defendant attorney or firm identifier Equation 1, attempting to balance information gain with document
coverage
Attorney Case type specialization smoothed by a
Specialization Dirichlet-multinomial prior 𝑛+ 𝑛+
𝑛 +2 log2 + 2 − − log2 + 1 −
Major Code Classifies the case type (e.g. tort) ✓ 𝑛2 + 𝑛2 𝑛1 + 𝑛1
Case Location Superior court location for the case ✓ where
Table 1: Features of the CT Civil Case Administrative Data- • 𝑛 +1 (𝑛 −
1 ) is the number of complaint documents associated
base. The column identified as S denotes whether or not the with a motion to strike that were granted (denied) that the
feature was only included in the subset feature set (and not rule covers;
in the minimal feature set). • 𝑛 +2 (𝑛 −
2 ) is the number of complaint documents associated
with a motion to strike that are changed to positive (negative)
with the addition of a prospective word to the antecedent.
3.3.2 Word embeddings for complaint documents. We considered
3.3 Feature Engineering: Complaint three architectures to construct complaint document features from
Documents neural embeddings: word2vec [56], doc2vec [48], and law2vec [14].
We downloaded 7904 complaint documents associated with each The word2vec model maximizes log 𝑃 (𝑤𝑂 |𝑤 𝐼 ), or the log probabil-
case in our data as PDF files from the State of Connecticut Judicial ity of a given word 𝑤𝑂 given an input word 𝑤 𝐼 . The doc2vec model
102
Case-level Prediction of Motion Outcomes in Civil Litigation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
is similar but instead of conditioning on an input word 𝑤 𝐼 , it condi- Adaboost, optimizing an approximate negative gradient of the bi-
tions on a vector representing the document. We also consider the nomial deviance loss function [31]. XGBoost, or Extreme Gradient
law2vec model which was trained on 123,066 documents, including Boosting, is a highly efficient implementation of gradient boosting
53,000 UK legislative documents, 62,000 European legislative doc- trees with an adjusted loss function to control the complexity of
ument, and thousands of other English legislative, U.S. code, and the decision trees [7]
opinion documents. We computed 300 dimensional embeddings Õ
𝑛 Õ
𝑀
for each complaint document using doc2vec and word2vec models 𝐿𝑥𝑔𝑏 = 𝐿 (𝑦𝑖 , 𝐹 (𝑥𝑖 )) + Ω(ℎ𝑚 )
that were pre-trained on AP News, Google News, and Wiki arti- 𝑖=1 𝑚=1
cles. The law2vec repository provides pretrained models for 100 1
and 200 dimensional vectors, thus we selected 200 dimensional Ω(ℎ𝑚 ) = 𝛾𝑇𝑚 + 𝜆||𝑠 || 2
2
vectors. To compute a document representation using word2vec or where 𝐿 (𝑦𝑖 , 𝐹 (𝑥𝑖 )) is a loss function computed from observed mo-
law2vec, we computed an average word2vec vector weighted by
tion outcome 𝑦𝑖 and computed outcome 𝐹 (𝑥𝑖 ), ℎ𝑚 is the 𝑚𝑡ℎ weak
term frequency–inverse document frequency (TF-IDF) [53].
learning model with 𝑇𝑚 leaves and 𝑠 leaf output scores.
We also produce models trained on a combination of word embed-
ding and rule classifiers. A simple word2vec or simple law2vec TF- 3.4.3 Support Vector Machines. Support vector machines classify
IDF model computes an average word2vec vector for each document data by solving a convex optimization problem yielding a hyper-
that is weighted by TF-IDF and including only those words identi- plane that optimally separates two classes [10, 19]. Optimality is
fied by the simple rule-based classifier. Likewise, a FOIL word2vec measured with respect to the size of the margin between granted
or FOIL law2vec TF-IDF model computes an average word2vec vec- and denied motions. They can efficiently compute non-linear deci-
tor for each document that is weighted by TF-IDF and including sion boundaries with kernel functions, representing similarities in
only those words identified by the FOIL rule-based classifier. an inner product space.
0.58
subsets that have higher purity. Since finding an optimal decision 0.57
tree is NP-hard [36, 46], trees are typically constructed in a greedy 0.56
fashion with successive maximum entropy gain splits and leaves 0.55
that define a sample’s classification. In this work we build decision 0.54
tree classifiers [66], which produce a single decision tree, and ran- 0.53
dom forest classifiers [11], which are ensembles of decision trees 0.52
built from a bootstrap subset of the training data. ab dt gb rf svm xgb
103
ICAIL’21, June 21–25, 2021, São Paulo, Brazil McConnell et al.
Database
Doc2vec
2
0.6
0
0.6
8
0.5
6
0.5
4
0.5
ab dt gb rf svm xgb
Figure 3: Predicting motion to strike outcomes across court administrative data and complaint documents. Distinct classifiers
were trained on court administrative data and dense features computed from complaint documents. Document embedding
features based on doc2vec [48] (black) and word2vec [56, 53] (blue) largely improved the classification accuracy of motion
outcomes versus court administrative database features alone (red) for six classifiers [61]: AdaBoost (ab), decision trees (dt),
gradient boosting (gb), random forests (rf), support vector machines (svm), Xgboost (xgb). Box plots are drawn with Tukey
whiskers (median ± 1.5 times interquartile range).
for training and validation. To estimate variability in classification improved classification accuracy for all models. These results high-
accuracy, we computed 100 bootstrapped samples for each model light the utility of incorporating natural language features in the
selected from our grid search. prediction of motion outcomes.
First, we evaluated the minimal and subset feature sets (Table 1) Interestingly, the difference between minimal and subset fea-
associated with legal cases to determine feature relevancy for mo- turization was diminished when including complaint document
tion outcome prediction. We varied the feature composition for the features into the model. We observed this behavior when consider-
court administrative data and compared classifier accuracy (Fig. 2). ing both doc2vec (Supp. Fig. 1) and word2vec (Supp. Fig. 2) features,
Adding in the case location and major code features improved although word2vec features continued to yield better performing
median accuracy for all methods besides random forests. Overall, models. These findings suggest that complaint document embed-
decisions trees exhibited the highest motion to strike classification dings effectively capture major code and case location features.
accuracy (mean: 0.583, median: 0.583) with slightly higher perfor- We next investigated feature significance for motion to strike
mance than boosting methods. Given these results, we primarily outcome prediction for decision trees built on word2vec features.
focus our analysis on the subset featurization. We chose word2vec features because models built on word2vec
To assist interpretation of model performance, we compared embeddings produced more accurate results than database-only or
our ML models with a naive baseline. The naive classifier predicts database and doc2vec features; we selected decisions trees since
motion outcomes using the empirical frequency of the training these are the most explainable of the six models tested with similar
set; with 52% of the motions being granted, we observed a naive performance to the boosting models. We found that features derived
baseline accuracy of 0.501. During model selection, we observed from the complaint documents were universally important across
a maximum classification accuracy of 0.644 using Adaboost with all decision tree classifiers (Fig. 4). The case type (Major Code) was
dense word embeddings, corpus specific TF-IDF weightings, and also an important feature for a subset of models, as was the attorney
FOIL algorithmic rules. This same model had a mean accuracy of specialization (entropy).
0.605 over 100 bootstraps. However, the highest mean accuracy Next, we quantified whether tort or vehicular cases had subtle
score from the same group of features was found in decision tree differences that made prediction easier by stratifying classification
classifiers with 0.606. accuracy by case type in our best performing feature configura-
Next, we evaluated whether court administrative data alone tion (database features, FOIL and TF-IDF weighted word2vec). All
was sufficient for learning accurate motion to strike classification methods, excluding SVM, predicted motion to strike outcomes in
models. Using only database features, all methods produced classifi- vehicular cases with a significantly higher accuracy than tort cases
cation accuracies less than 0.60 (Fig. 3). Subsequently, we evaluated (one-way paired t-test, 𝑝 ≤ 2.2 × 10−16 ) (Fig. 5). This is likely due
these same methods, but including dense natural language fea- to inherent properties of vehicular cases and not class imbalance
tures extracted from complaint documents. While concatenating since vehicular cases encompassed approximately 47% of the total
doc2vec features to the database feature vectors improved classifi- cases.
cation accuracy for most methods, a more careful model defined Lastly, we investigated if legal domain-specific word2vec mod-
over word2vec features using corpus specific weighting (TF-IDF) els would improve classifier performance compared to word2vec
104
Case-level Prediction of Motion Outcomes in Civil Litigation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
1.0
dt (AP = 0.6274)
gb (AP = 0.5734)
0.8 ab (AP = 0.6186)
0.9 svm (AP = 0.6042)
ub
rf (AP = 0.5661)
,si F,FO F,FO
in
0.8
-ID -ID
TF TF
Precision
0.6
in
m
0.7
ub
F
-ID
TF
i m
F,s
ub
0.4
-ID
0.6
TF
sim sim FO FO
in
0.5
in
py
V
de
Lo T)
Lo HB)
Lo D)
W2
mb
(FS
tro
H
Co
(K
(H
(H
En
Nu
c.
c.
c.
c.
Lo
Ma
ris
0.0
Ju
0.625
ple (Fig. 6) and FOIL (Supp. Fig. 5) rule based classifiers or when
including (Supp. Fig. 6) and excluding (Supp. Fig. 7) FOIL features
0.600
105
ICAIL’21, June 21–25, 2021, São Paulo, Brazil McConnell et al.
more genuinely-disputed facts that matter to one or more of the ensure ethical and legal compliance in situations where the ethics
plaintiff’s legal claim. However, careful consideration must be taken or legal implications are ill-defined or cannot be mathematically
when deciding which features or court documents are relevant. For modelled. Model-based interpretability focuses on restricting the set
example, a motion for summary judgement depends on the quality of models such that a trained model directly informs relationships
of the evidence collected by the parties thus far, which may not be among model variables [58]. The methods developed in this work
well captured by the text of court documents. were selected, in part, due to high model-based interpretability (e.g.
Our comparisons between doc2vec and law2vec suggested there decision trees), but, incorporating causal reasoning and explicit
are potential benefits to incorporating neural embeddings from Bayesian modelling of relevant variables in the judicial decision
models trained specifically on legal corpora. In modern transformer making process would only increase explainability, interpretability,
architectures, domain specific pre-trained models, e.g. BioBERT [51], and ultimately, trustworthiness of the model. One possibility is to
have been shown to outperform transfer learning fine-tuning ap- consider the recent work in causal frameworks for decision trees [33,
proaches from general corpora like Wikipedia. While some of these 52, 78].
models are in the legal domain, e.g. patentBERT [50], no such model
exists for court documents. However, one significant challenge 6 CONCLUSIONS
to applying such models in the legal domain is that the memory By developing ML workflows with feature engineering rooted in
requirements for transformers scale quadratically with sequence legal domain expertise, we developed methods to help researchers
length, prohibiting them from being applied directly to the longer better understand the predictability of trial motions and practition-
texts that are common in complaints and judicial opinions [6]. Re- ers the ability to make more informed decisions. We developed and
cent work on extending the range of transformers provides some benchmarked the first ML methods to predict motion outcomes
evidence that this issue will be addressed [79, 76]. using only data that is available to all parties at trial. Our work
A limitation of the data in our analysis is that courts may occa- demonstrated that motion to strike outcomes are predictable with
sionally grant a motion to strike in part. This can occur, for example, high accuracy when new features like attorney specialization are
if two or more legal claims levied against a defendant were formally combined with complaint document embeddings.
challenged by a motion to strike. The court may grant the motion We expect these methods will be a valuable resource for lawyers
to strike with respect to a single legal claim and deny it for others. and their clients by enabling the estimation of case strength. For
In these cases, the motion to strike order code in the Law Library example, our methods can be used to predict if the case will sur-
data is noisy since there is only a single value that is provided. vive a motion to strike after a complaint document is filed. Based
Furthermore, the procedure with which courts interpret this order on predicted motion outcomes, both parties can make more in-
code is heterogeneous. Some courts interpret a granted motion as formed settlement decisions and lawyers representing the plaintiff
any motion granted in part or in full; other courts only use the can revise the language in their complaint documents. Fitted ML
granted order code when the motion is granted in full. Addressing models, training code, and benchmarking code can be accessed at
this issue either requires reforming and unifying the data process- https://github.com/aguiarlab/motionpredict.
ing procedures across Connecticut courts or developing methods
to parse out distinct legal claims from complaint documents and REFERENCES
then matching them to judicial order documents. [1] Charu C Aggarwal. 2018. Machine learning for text. Springer.
The relevant features for the motion prediction problem can [2] Sharan Agrawal et al. 2017. Affirm or reverse? using machine
also likely be improved. For example, an attorney is defined based learning to help judges write opinions. NBER Working Paper,
on their juris number and a derived attorney specialization fea- 29.
ture. There are other features that are likely relevant for predicting [3] Benjamin Alarie et al. 2016. Using machine learning to pre-
motion outcomes, e.g., attorney experience, record, or case load. dict outcomes in tax law. Can. Bus. LJ, 58, 231.
Similar feature engineering can be implemented for judges. Case [4] Nikolaos Aletras et al. 2016. Predicting judicial decisions
location and other high dimensional categorical features can be of the european court of human rights: a natural language
one-hot encoded, but may also benefit from a descriptive, lower processing perspective. PeerJ Computer Science, 2, e93.
dimensional set of features based on, e.g., court culture. [5] Katie Atkinson et al. 2020. Explanation in AI and law: past,
With the rise of ML in the legal domain, governments across the present and future. Artificial Intelligence, 103387.
globe are placing new emphases on ensuring AI-assisted decision [6] Iz Beltagy et al. 2020. Longformer: the long-document trans-
making is done in an ethical, transparent, and nondiscriminatory former. arXiv preprint arXiv:2004.05150.
manner. The European Commission for the Efficiency of Justice [7] Candice Bentejac et al. 2020. A comparative analysis of gra-
adopted 5 principles in the European Ethical Charter on the use of dient boosting algorithms. Artificial Intelligence Review, 1–
AI in judicial systems [82]. These principles guarantee that judi- 31.
cial AI is compatible with fundamental rights, nondiscriminatory, [8] Ryan C Black and James F Spriggs. 2008. An empirical anal-
transparent, impartial, fair, and explainable. In the U.S., the Na- ysis of the length of US Supreme Court opinions. Hous. L.
tional Center for State Courts has identified data transparency Rev., 45, 621.
and investigating how AI transforms judicial processes as national [9] David M Blei et al. 2003. Latent dirichlet allocation. the Jour-
priorities [59]. nal of Machine Learning Research, 3, 993–1022.
Fundamental to upholding these ideals is developing methods
that are interpretable. Interpretability provides a mechanism to
106
Case-level Prediction of Motion Outcomes in Civil Litigation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
[10] Bernhard E Boser et al. 1992. A training algorithm for optimal Social Science, 16, 1, 39–57. https://doi.org/10.1146/annurev-
margin classifiers. In Proceedings of the fifth annual workshop lawsocsci-052720-121843.
on Computational learning theory, 144–152. [29] Yoav Freund and Robert E Schapire. 1997. A decision-theoretic
[11] Leo Breiman. 2001. Random forests. Machine learning, 45, 1, generalization of on-line learning and an application to
5–32. boosting. Journal of computer and system sciences, 55, 1, 119–
[12] Shelagh MR Campbell. 2017. Exercising discretion in the 139.
context of dependent employment: assessing the impact of [30] Yoav Freund et al. 1999. A short introduction to boosting.
workload on the rule of law. Legal Studies, 37, 2, 305–323. Journal-Japanese Society For Artificial Intelligence, 14, 771-
[13] John Celona. 2016. Winning at Litigation through Decision 780, 1612.
Analysis: Creating and Executing Winning Strategies in any [31] Jerome H Friedman. 2002. Stochastic gradient boosting. Com-
Litigation or Dispute. Springer Series in Operations Research putational statistics & data analysis, 38, 4, 367–378.
and Financial Engineering. Springer. [32] Anne von der Lieth Gardner. 1984. Artificial intelligence
[14] Ilias Chalkidis. 2018. Law2Vec: Legal Word Embeddings. approach to legal reasoning. Technical report. Stanford Univ.
(2018). https://archive.org/details/Law2Vec. [33] Tim Genewein et al. 2020. Algorithms for Causal Reasoning
[15] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: a scalable in Probability Trees. en. arXiv:2010.12237 [cs], (November
tree boosting system. In Proceedings of the 22nd acm sigkdd 2020). arXiv: 2010.12237. Retrieved 12/10/2020 from http :
international conference on knowledge discovery and data //arxiv.org/abs/2010.12237.
mining, 785–794. [34] Jane Goodman-Delahunty et al. 2010. Insightful or wishful:
[16] Cary Coglianese. 2004. E-rulemaking: information technol- lawyers’ ability to predict case outcomes. Psychology, Public
ogy and the regulatory process. Admin. L. Rev., 56, 353. Policy, and Law, 16, 2, 133–157.
[17] William W Cohen et al. 1996. Learning rules that classify [35] Evan Gretok et al. 2020. Transformers for classifying fourth
e-mail. In AAAI spring symposium on machine learning in amendment elements and factors tests. In Legal Knowledge
information access. Volume 18. Stanford, CA, 25. and Information Systems: JURIX 2020: The Thirty-third An-
[18] William W Cohen and Yoram Singer. 1999. Context-sensitive nual Conference, Brno, Czech Republic, December 9-11, 2020.
learning methods for text categorization. ACM Transactions Volume 334. IOS Press, 63–72.
on Information Systems (TOIS), 17, 2, 141–173. [36] Thomas Hancock et al. 1996. Lower bounds on learning
[19] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector decision lists and trees. Information and Computation, 126, 2,
networks. Machine learning, 20, 3, 273–297. 114–122.
[20] Lindsey Devers. 2011. Plea and charge bargaining. Research [37] Allison P. Harris and Maya Sen. 2019. Bias and judging. An-
summary for Bureau of Justice Assistance, U.S. Department of nual Review of Political Science, 22, 1, 241–259. https://doi.
Justice, 1. org/10.1146/annurev-polisci-051617-090650.
[21] Jacob Devlin et al. 2018. Bert: pre-training of deep bidi- [38] Oliver Wendell Holmes. 1897. The path of the law. Harvard
rectional transformers for language understanding. arXiv Law Review, 10, 8, 457–478.
preprint arXiv:1810.04805. [39] Jonas Jacobson et al. 2011. Predicting civil jury verdicts: how
[22] Michael Evans et al. 2007. Recounting the courts? apply- attorneys use (and misuse) a second opinion. Journal of Em-
ing automated content analysis to enhance empirical legal pirical Legal Studies, 8, S1, 99–119. http://dx.doi.org/10.1111/
research. Journal of Empirical Legal Studies, 4, 4, 1007–1039. j.1740-1461.2011.01229.x.
[23] Frank Fagan and Saul Levmore. 2019. The impact of artificial [40] Robert A Karasek Jr. 1979. Job demands, job decision latitude,
intelligence on rules, standards, and judicial discretion. S. and mental strain: implications for job redesign. Administra-
Cal. L. Rev., 93, 1. tive science quarterly, 285–308.
[24] Felix Steffek. 2021. Law and Autonomous Systems Series: [41] Daniel Martin Katz et al. 2017. A general approach for pre-
Paving the Way for Legal Artificial Intelligence – A Common dicting the behavior of the Supreme Court of the United
Dataset for Case Outcome Predictions. University of Oxford. States. PLOS ONE, 12, 4, (April 2017), 1–18. https://doi.org/
(2021). https://www.law.ox.ac.uk/business-law-blog/blog/ 10.1371/journal.pone.0174698.
2018 / 05 / law - and - autonomous - systems - series - paving - [42] Nari Kim and Hyoung Joong Kim. 2017. A study on the
way-legal-artificial. law2vec model for searching related law. Journal of Digital
[25] Norman Fenton et al. 2016. Bayes and the law. Annual Review Contents Society, 18, 7, 1419–1425.
of Statistics and Its Application, 3, 1, 51–77. https://doi.org/ [43] Sang-Bum Kim et al. 2006. Some effective techniques for
10.1146/annurev-statistics-041715-033428. naive Bayes text classification. IEEE transactions on knowl-
[26] Matthias Feurer and Frank Hutter. 2019. Hyperparameter op- edge and data engineering, 18, 11, 1457–1466.
timization. In Automated Machine Learning. Springer, Cham, [44] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning mul-
3–33. tiple layers of features from tiny images.
[27] James H Fowler et al. 2007. Network analysis and the law: [45] Jason T Lam et al. 2020. The gap between deep learning and
measuring the legal importance of precedents at the us law: predicting employment notice. NLLP KDD, 7, 10.
supreme court. Political Analysis, 324–346. [46] Hyafil Laurent and Ronald L Rivest. 1976. Constructing opti-
[28] Jens Frankenreiter and Michael A. Livermore. 2020. Compu- mal binary decision trees is NP-complete. Information pro-
tational methods in legal analysis. Annual Review of Law and cessing letters, 5, 1, 15–17.
107
ICAIL’21, June 21–25, 2021, São Paulo, Brazil McConnell et al.
[47] David S Law and David Zaring. 2009. Law Versus Ideology: [66] S. R. Safavian and D. Landgrebe. 1991. A survey of decision
The Supreme Court and the Use of Legislative History. Wm. tree classifier methodology. IEEE Transactions on Systems,
& Mary L. Rev., 51, 1653. Man, and Cybernetics, 21, 3, 660–674.
[48] Quoc Le and Tomas Mikolov. 2014. Distributed representa- [67] Robert E Schapire. 2013. Explaining adaboost. In Empirical
tions of sentences and documents. In International conference inference. Springer, 37–52.
on machine learning, 1188–1196. [68] Marek J. Sergot et al. 1986. The British Nationality Act as a
[49] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten logic program. Communications of the ACM, 29, 5, 370–386.
digit database. http://yann.lecun.com/exdb/mnist/. [69] Ray Smith. 2007. An overview of the tesseract ocr engine.
[50] Jieh-Sheng Lee and Jieh Hsiang. 2019. Patentbert: patent clas- In Ninth international conference on document analysis and
sification with fine-tuning a pre-trained bert model. arXiv recognition (ICDAR 2007). Volume 2. IEEE, 629–633.
preprint arXiv:1906.02124. [70] Harold Spaeth et al. 2014. Supreme court database code book.
[51] Jinhyuk Lee et al. 2020. Biobert: a pre-trained biomedical (2014).
language representation model for biomedical text mining. [71] State of Connecticut Judicial Branch. 2021. Public Records
Bioinformatics, 36, 4, 1234–1240. Online. Accessed on 2021-01-01. (2021). https://jud.ct.gov/
[52] Jiuyong Li et al. 2016. Causal decision trees. IEEE Transactions lawlib/publicrecords.htm.
on Knowledge and Data Engineering, 29, 2, 257–271. [72] Michael Still. 2006. The definitive guide to ImageMagick. Apress.
[53] Joseph Lilleberg et al. 2015. Support vector machines and [73] Octavia-Maria Şulea et al. 2017. Predicting the law area and
word2vec for text classification with semantic features. In decisions of French Supreme Court cases. In Proceedings
2015 IEEE 14th International Conference on Cognitive Infor- of the International Conference Recent Advances in Natural
matics & Cognitive Computing (ICCI* CC). IEEE, 136–140. Language Processing, RANLP 2017. INCOMA Ltd., Varna, Bul-
[54] Masha Medvedeva et al. 2020. Using machine learning to garia, (September 2017), 716–722. https://doi.org/10.26615/
predict decisions of the European Court of Human Rights. 978-954-452-049-6_092.
Artificial Intelligence and Law, 28, 2, 237–266. [74] Harry Surden. 2014. Machine learning and law. Wash. L. Rev.,
[55] David E. Melnikoff and Nina Strohminger. 2020. The auto- 89, 87.
matic influence of advocacy on lawyers and novices. Nature [75] Ahmad P Tafti et al. 2016. OCR as a service: an experi-
Human Behaviour, (September 7, 2020), 1–7. mental evaluation of Google Docs OCR, Tesseract, ABBYY
[56] Tomas Mikolov et al. 2013. Efficient estimation of word rep- FineReader, and Transym. In International Symposium on
resentations in vector space. (2013). http://arxiv.org/abs/ Visual Computing. Springer, 735–746.
1301.3781. [76] Yi Tay et al. 2020. Long Range Arena: A Benchmark for
[57] Jane Mitchell et al. 2020. Machine learning for determining Efficient Transformers. en. arXiv:2011.04006 [cs], (November
accurate outcomes in criminal trials. Law, Probability and 2020). arXiv: 2011.04006. Retrieved 12/12/2020 from http :
Risk, 19, 1, (March 2020), 43–65. //arxiv.org/abs/2011.04006.
[58] W James Murdoch et al. 2019. Definitions, methods, and [77] Thomas Vacek et al. 2019. Litigation Analytics: Case out-
applications in interpretable machine learning. Proceedings comes extracted from US federal court dockets. In Proceed-
of the National Academy of Sciences, 116, 44, 22071–22080. ings of the Natural Legal Language Processing Workshop 2019,
[59] National Center for State Courts. 2021. Joint technology com- 45–54.
mittee priority topics. Accessed on 2021-03-01. (2021). https: [78] Stefan Wager and Susan Athey. 2018. Estimation and in-
//www.ncsc.org/about-us/committees/joint-technology- ference of heterogeneous treatment effects using random
committee/priority-topics-old-page. forests. Journal of the American Statistical Association, 113,
[60] Patrick W Nutter. 2018. Machine learning evidence: admissi- 523, 1228–1242.
bility and weight. U. Pa. J. Const. L., 21, 919. [79] Sinong Wang et al. 2020. Linformer: Self-Attention with
[61] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Linear Complexity. en. arXiv:2006.04768 [cs, stat], (June 2020).
Python. Journal of Machine Learning Research, 12, 2825–2830. arXiv: 2006.04768. Retrieved 12/12/2020 from http://arxiv.
[62] Alina Petrova et al. 2020. Extracting Outcomes from Ap- org/abs/2006.04768.
pellate Decisions in US State Courts. In Legal Knowledge [80] Ho Chung Wu et al. 2008. Interpreting tf-idf term weights
and Information Systems: JURIX 2020: The Thirty-third An- as making relevance decisions. ACM Transactions on Infor-
nual Conference, Brno, Czech Republic, December 9-11, 2020. mation Systems (TOIS), 26, 3, 1–37.
Volume 334. IOS Press, 133–142. [81] Feiyu Xu et al. 2019. Explainable ai: a brief survey on history,
[63] Arti K Rai. 2018. Machine learning at the patent office: lessons research areas, approaches and challenges. In CCF interna-
for patents and administrative law. Iowa L. Rev., 104, 2617. tional conference on natural language processing and Chinese
[64] Edwina L Rissland. 1990. Artificial intelligence and law: step- computing. Springer, 563–574.
ping stones to a model of legal reasoning. The Yale Law [82] Irina Moroianu Zlatescu and Petru Emanuel Zlatescu. 2019.
Journal, 99, 8, 1957–1981. Implementation of the European ethical charter on the use
[65] Theodore W Ruger et al. 2004. The supreme court forecasting of artificial intelligence in judicial systems and their envi-
project: legal and political science approaches to predicting ronment. Current Issues of the EU Political-Legal Space, 237.
supreme court decision making. Columbia Law Review, 1150–
1210.
108
Evaluating Document Representations for Content-based Legal
Literature Recommendations
Malte Ostendorff Elliott Ash Terry Ruas
Open Legal Data ETH Zurich University of Wuppertal
Germany Switzerland Germany
mo@openlegaldata.io ashe@ethz.ch ruas@uni-wuppertal.de
109
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ostendorff et al.
which the majority have never been investigated in the legal context a particular decision, e.g., to prepare a litigation strategy. Based on
with a quantitative study and validate our results qualitatively. (3) the decision at hand, the system recommends other decisions to its
We show that the hybrid combination of text-based and citation- users such that the research task is easy to accomplish. The recom-
based methods can further improve the experimental results. mendation is relevant when it covers the same topic or provides
essential information, e.g., it overruled the seed decision [45].
2 RELATED WORK
3.1 Case Corpus and Silver Standard
Recommender systems are a well-established research field [3] but
relatively few publications focus on law as the application domain. Most of the previous works (Section 2) evaluate recommendation
Winkels et al. [52] are among the first to present a content-based relevance by asking domain experts to provide subjective annota-
approach to recommend legislation and case law. Their system tions [9, 22, 28, 52]. Especially in the legal domain, these expert
uses the citation graph of Dutch Immigration Law and is evaluated annotations are costly to collect and, therefore, their quantity is
with a user study conducted with three participants. Boer and limited. For the same reason, expert annotations are rarely pub-
Winkels [9] propose and evaluate Latent Dirichlet Allocation (LDA) lished. Consequently, the research is difficult to reproduce [4]. In
[7] as a solution to the cold start problem in collaborative filtering the case of the US court decisions, such expert annotations between
approaches. In an experiment with 28 users, they find the user- documents are also not publicly available. We construct two ground
based approach outperforms LDA. Wiggers and Verberne [49] study truth datasets from publicly available resources allowing the evalua-
citations for legal information retrieval and suggest citations should tion of more recommendations to mitigate the mentioned problems
be combined with other techniques to improve the performance. of cost, quantity, and reproducibility.
Kumar et al. [22] compare four methods to measure the similar- 3.1.1 Open Case Book. With Open Case Book, the Harvard Law
ity of Indian Supreme Court decision: TF-IDF [43] on all document School Library offers a platform for making and sharing open-
terms, TF-IDF on only specific terms from a legal dictionary, Co- licensed casebooks 1 . The corpus consists of 222 casebooks contain-
Citation, and Bibliographic Coupling. They evaluate the similarity ing 3,023 cases from 87 authors. Each casebook contains a manually
measure on 50 document pairs with five legal domain experts. In curated set of topically related court decisions, which we use as
their experiment, Bibliographic Coupling and TF-IDF on legal terms relevance annotations. The casebooks cover a range from broad
yield the best results. Mandal et al. [28] extend this work by evaluat- topics (e.g., Constitutional law) to specific ones (e.g., Intermediary
ing LDA and document embeddings (Paragraph Vectors [25]) on the Liability and Platforms’ Regulation). The decisions are mapped to
same dataset, whereby Paragraph Vectors was found to correlate full-texts and citations retrieved from the Caselaw Access Project
the most with the expert annotations. Indian Supreme Court deci- (CAP)2 . After duplicate removal and the mapping procedure, rele-
sions are also used as evaluation by Wagh and Anand [47], where vance annotations for 1,601 decisions remain.
they use document similarity based on concepts instead of full-text.
They extract concepts (groups of words) from the decisions and 3.1.2 Wikisource. We use a collection of 2,939 US Supreme Court
compute the similarity between documents based on these concepts. decisions from Wikisource as ground truth [50]. The collection is
Their vector representation, an average of word embeddings and categorized in 67 topics like antitrust, civil rights, and amendments.
TF-IDF, shows IDF for weighting word2vec embeddings improve We map the decisions listed in Wikisource to the corpus from
results. Also, Bhattacharya et al. [6] compare citation similarity CourtListener3 . The discrepancy between the two corpora decreases
methods, i.e., Bibliographic Coupling, Co-citation, Dispersion [32] the number of relevance annotations to 1,363 court decisions.
and Node2Vec [17]), and text similarity methods like Paragraph Vec-
Table 1: Distribution of relevant annotations for Open Case
tors. They evaluate the algorithms and their combinations using a
Book and Wikisource.
gold standard of 47 document pairs. A combination of Bibliographic
Coupling and Paragraph Vectors achieves the best results.
With Eunomos, Boella et al. [8] present a legal document and Relevant annotations per document
knowledge management system for searching legal documents. Mean Std. Min. 25% 50% 75% Max.
The document similarity problem is handled using TF-IDF and co- Open Case Book 86.42 65.18 2.0 48.0 83.0 111.0 1590.0
sine similarity. Other experiments using embeddings for document Wikisource 130.01 82.46 1.0 88.0 113.0 194.0 616.0
similarity include Nanda et al. [33] or Ash and Chen [2].
Even though different methods have been evaluated in the legal
We derive a binary relevance classification from Open Case Book
domain, most results are not coherent and rely on small-scale user
and Wikisource. When decisions A and B are in the same casebook
studies. This finding emphasizes the need for a standard bench-
or category, A is relevant for B and vice versa. Table 1 presents the
mark to enable reproducibility and comparability [4]. Moreover,
distribution of relevance annotations. This relevance classification
the recent Transformer models [46] or novel citation embeddings
is limited since a recommendation might still be relevant despite
have not been evaluated in legal recommendation research.
not being assigned to the same topic as the seed decision. Thus,
we consider the Open Case Book and Wikisource annotations as a
3 METHODOLOGY silver standard rather than a gold one.
In this section, we describe our quantitative evaluation of 27 docu- 1 https://opencasebook.org
ment recommendations methods. We define the recommendation 2 https://case.law
110
Evaluating Document Representations for Content-based Legal Literature Recommendations ICAIL’21, June 21–25, 2021, São Paulo, Brazil
3.2 Evaluated Methods on different corpora, allows the evaluation of the method’s cross-
We evaluate 27 methods, each representing legal document 𝑑 as a domain applicability.
numerical vector 𝑑® ∈ R𝑠 , with 𝑠 denoting the vector size. To retrieve
3.2.3 Transformer-based Methods. As the second method cate-
the recommendations, we first obtain the vector representations (or
gory, we employ language models for deep contextual text rep-
document embeddings). Next, we compute the cosine similarities
resentations based on the Transformer architecture [46], namely
of the vectors. Finally, we select the top 𝑘 = 5 documents with the
BERT [15], RoBERTa [27], Sentence Transformers (Sentence-
highest similarity through nearest neighbor search4 . Mean Average
BERT and Sentence-RoBERTa) [41], LongFormer [5] and vari-
Precision (MAP) is the primary and Mean Reciprocal Rank (MRR) is
ations of them. In contrast to Paragraph Vectors and average word
the second evaluation metric [29]. We compute MAP and MRR over
vectors, which neglect the word order, the Transformers incorporate
a set of queries 𝑄, whereby 𝑄 is equivalent to the seed decisions
word positions making the text representations context-dependent.
with |𝑄 WS | = 1363 available in Wikisource and |𝑄 OCB | = 1601 for
BERT significantly improved the state-of-the-art for many NLP
Open Case Book. In addition to the accuracy-oriented metrics, we
tasks. In general, BERT models are pretrained on large text corpora
evaluate the coverage and Jaccard index of the recommendations.
in an unsupervised fashion to then be fine-tuned for specific tasks
The coverage for the method 𝑎 is defined as in Equation 1 where
like document classification [36]. We use four variations of BERT.
𝐷 denotes the set of all available documents in the corpus and 𝐷𝑎
The original BERT [15] as base and large version (pretrained on
denotes the recommended documents by 𝑎 [16].
Wikipedia and BookCorpus) and two BERT-base models pretrained
|𝐷𝑎 | on legal corpora. Legal-JHU-BERT-base from Holzenberger et al.
𝐶𝑜𝑣 (𝑎) = (1) [18] which is a BERT base model but fine-tuned on the CAP corpus.
|𝐷 |
Similarly, Legal-AUEB-BERT-base from Chalkidis et al. [14] is
We define the Jaccard index [19] for the similarity and diversity
as well fine-tuned on the CAP corpus but also on other corpora
of two recommendation sets 𝑅𝑎 and 𝑅𝑏 from methods 𝑎 and 𝑏 for
(court cases and legislation from the US and EU, and US contracts).
the seed 𝑑𝑠 in Equation 2:
RoBERTa improves BERT with longer training, larger batches, and
|𝑅𝑎 ∩ 𝑅𝑏 | removal of the next sentence prediction task for pretraining. Sen-
𝐽 (𝑎, 𝑏) = (2) tence Transformers are fine-tuned BERT and RoBERTa models in
|𝑅𝑎 ∪ 𝑅𝑏 |
a Siamese setting [12] to derive semantically meaningful sentence
We divide the evaluated methods into three categories: Word
embeddings that can be compared using cosine similarity (Sentence-
vector-, Transformer-, and citation-based methods.
BERT and Sentence-RoBERTa). The provided Sentence Transform-
3.2.1 TF-IDF Baseline. As a baseline method, we use the sparse ers variations are nli- or stsb-version that are either fine-tuned on
document vectors from TF-IDF [43], which are commonly used in the SNLI and MNLI dataset [11, 51] or fine-tuned on the STS bench-
related works [22, 33]5 . mark [13]. As the self-attention mechanism scales quadratically
with the sequence length, the Transfomer-based methods (BERT,
3.2.2 Word vector-based Methods. The following methods are de- RoBERTa and Sentence Transformers) bound their representation
rived from word vectors, i.e., context-free word representations. to 512 tokens. Longformer includes an attention mechanism that
Paragraph Vectors [25] extend the idea of word2vec [31] to learn- scales linearly with sequence length, which allows to process longer
ing embeddings for word sequences of arbitrary length. Paragraph documents. We use pretrained Longformer models as provided by
Vectors using distributed bag-of-words (dbow) performed well in Beltagy et al. [5] and limited to 4096 tokens. All Transformer models
text similarity tasks applied on legal documents [2, 28] and other apply mean-pooling to derive document vectors. We experimented
domains [24]. We train Paragraph Vectors’ dbow model to gen- with other pooling strategies but they yield significantly lower re-
erate document vectors for each court decision. Like word2vec, sults. These findings agree with Reimers and Gurevych [41]. We
GloVe [38] and fastText [10, 20] produce dense word vectors but investigate each Transformer in two variations depending on their
they do not provide document vectors. To embed a court decision availability and w.r.t. model size and document vector size (base
as a vector, we compute the weighted average over its word vectors, with 𝑠 = 768 and large with 𝑠 = 1024).
𝑤®𝑖 , whereby the number of occurrences of the word 𝑖 in 𝑑 defines
the weight 𝑐𝑖 . Averaging of word vectors is computationally effec- 3.2.4 Citation-based Methods. We explore citation-based graph
tive and yields good results for representing even longer documents methods in which documents are nodes and edges correspond
[1]. For our experiments, we use word vectors made available by to citations to generate document vectors. Like text-based repre-
the corresponding authors and custom word vectors. While GloVe sentations, citation graph embeddings have the vector size 𝑑® ∈
vectors are pretrained on Wikipedia and Gigaword [38], fastText R300 . With DeepWalk, Perozzi et al. [39] were the first to borrow
is pretrained on Wikipedia, UMBC webbase corpus and statmt.org word2vec’s idea and applied it to graph network embeddings. Deep-
news dataset [10]. Additionally, we use custom word vectors6 for Walk performs truncated random walks on a graph and the node em-
both methods (namely fastTextLegal and GloVeLegal ) pretrained beddings are learned through the node context information encoded
on the joint court decision corpus extracted from Open Case Book in these short random walks similar to the context sliding window
and Wikisource (see Section 3.1). Using word vectors pretrained in word2vec. Walklets [40] explicitly encodes multi-scale node
4 We relationships to capture community structures with the graph em-
set 𝑘 = 5 due to the UI of our legal recommender system [35].
5 We use the TF-IDF implementation from the scikit-learn framework [37]. bedding. Walklets generates these multi-scale relationships by sub-
6 The legal word vectors can be downloaded from our GitHub repository. sampling short random walks on the graph nodes. BoostNE [26] is
111
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ostendorff et al.
112
Evaluating Document Representations for Content-based Legal Literature Recommendations ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 2: Overall scores for top 𝑘 = 5 recommendations from Open Case Book and Wikisource as the number of relevant
documents, precision, recall, MRR, MAP and coverage for the 27 methods and the vector sizes. The methods are divided into:
baseline, word vector-based, Transformer-based, citation-based, and hybrid. High scores according to the exact numbers are
underlined (or bold for category-wise). ∗ values were rounded up.
does improve the performance, whereby Legal-AUEB-BERT gen- + fastTextLegal is even higher than the MRR of its sub-methods
erally outperforms Legal-JHU-BERT. For Open Case Book, Legal- Poincaré (0.629) and fastTextLegal (0.739) individually. The concate-
AUEB-BERT is the best model in the Transformer category in terms nation of Poincaré ∥ fastTextLegal is with 0.035 MAP the best method
of MAP even though it is only used as base version. on Wikisource. Using citation as training signal as in Sentence-
Poincaré and Walklets are by far the best methods in the citation Legal-AUEB-BERT also improves the performance but not as much
category. For Wikisource, the two citation-based methods, score the as concatenation or summation. When comparing the three hybrid
same MAP of 0.031 as fastTextLegal . Compared to the word vector- variations, score summation achieves overall the best results. In the
based methods, the citation methods do better on Wikisource than case of Wikisource, the concatenation’s scores are below its sub-
on Open Case Book. methods, while summation has at least the best sub-methods score.
In the category of hybrid methods, the combination of text and Moreover, combining two text-based methods such as Longformer-
citations improves the performance. For Open Case Book, the score large and fastTextLegal never improves its sub-methods.
summation Poincaré + fastTextLegal has the same MAP of 0.05
4.1.2 Document Length. The effect of the document length on the
as fastTextLegal but a higher MRR of 0.746. The MRR of Poincaré
performance in terms of MAP is displayed in Figure 1. We group
113
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ostendorff et al.
0.06
0.04
0.02
0.05
0.04
0.03
0.02
0.01
0.00
(31, 1777] (1777, 2666] (2666, 3520] (3520, 4532] (4532, 6083] (6083, 8659] (8659, 12930] (12930, 136017]
Text length as word count (8 equal-sized buckets)
Figure 1: MAP wrt. words in the seed document of Open Case Book (top) and Wikisource (bottom). The more words, the better
the results, no peak at medium length. fastTextLegal outperforms Legal-BERT and Longformer for short documents.
the seed documents into eight equal-sized buckets (each bucket citations and even decrease at 67-89 citations. When comparing
represents the equal number of documents) depending on the word Poincaré and Walklets there is no superior method and no depen-
count in the document text to make the two datasets comparable. dency pattern is visible. The performance effect on DeepWalk is
Both datasets, Open Case Book and Wikisource, present a similar more substantial. The number of citations must be above a cer-
outcome. The MAP increases as the word count increases. Table 2 tain threshold to allow DeepWalk to achieve competitive results.
presents the average overall documents and, therefore, the overall For Open Case Book, the threshold is at 51-67 citations, and for
best method is not equal to the best method in some subsets. For Wikisource, it is at 30-50 citations. Figure 2 also shows the on aver-
instance, Paragraph Vectors achieve the best results for several buck- age higher MAP of Poincaré + fastTextLegal in comparison to the
ets, e.g., 4772-6172 words in Open Case Book or 6083-8659 words other approaches. Citation-based methods require citations to work,
in Wikisource. The text limitation of fastTextLegal (4096 tokens) whereas text methods do not have this limitation (see 0-14 citations
in comparison to fastText is also clearly visible. The performance for Open Case Book). When no citations are available, citation-
difference between the two methods increases as the document based methods cannot recommend any documents, whereas the
length increases. For the first buckets with less than 4096 words, text methods still work (see 0-14 citations for Open Case Book).
e.g., 187-2327 words in Open Case Book, one could expect no differ- Our citation-based methods use only a fraction of original cita-
ence since the limitation does not affect the seed documents in these tion data, 70,865 citations in Open Case Book, and 331,498 citations
buckets. However, we observe a difference since target documents in Wikisource, because of limitation to the documents available in
are not grouped into the same buckets. Remarkable is that the per- the silver standards. For comparison, the most-cited decision from
formance difference for very long documents is less substantial. CourtListener (the underlying corpus of Wikisource) has 88,940 ci-
When comparing Longformer-large and Legal-AUEB-BERT, we tations, whereas in experimental data of Wikisource the maximum
also see an opposing performance shift with changing word count. number of in- and out-citations is 386. As a result, we expect the
While Legal-AUEB-BERT’s scores are relatively stable throughout citation-based methods, especially DeepWalk, to work even better
all buckets, Longformer depends more on the document length. On when applied on the full corpus.
the one hand, Longformer performs worse than Legal-AUEB-BERT
4.1.4 Coverage and Similarity of Recommendations. In addition
for short documents, i.e., 187-2327 words in Open Case Book, and
to the accuracy-oriented metrics, Table 2 reports also the cover-
31-1777 words in Wikisource. On the other hand, for documents
age of the recommendation methods. A recommender systems for
with more words, Longformer mostly outperforms Legal-AUEB-
an expert audience should not focus on small set of most-popular
BERT by a large margin. The citation-based method Poincaré is as
items but rather provide a high coverage of the whole item collec-
well affected by the document length. However, this effect is due to
tion. However, coverage alone does not account for relevancy and,
a positive correlation between word count and citation count.
therefore, it must be contextualized with other metrics, e.g., MAP.
4.1.3 Citation Count. Figure 2 shows the effect of the number of Overall, two citation-based methods yield the highest coverage
in- and out-citations (i.e., edges in the citation graph) on the MAP for both datasets, i.e., Poincaré for Open Case Book and DeepWalk
score. The citation analysis for Wikisource confirms the word count for Wikisource. In particular, Poincaré has not only a high coverage
analysis. More data leads to better results. Instead, for Open Case but also high MAP scores. Yet, the numbers do not indicate that
Book, the performance of the citation-based methods peak for 31-51 citation-based methods have generally a higher coverage since the
114
Evaluating Document Representations for Content-based Legal Literature Recommendations ICAIL’21, June 21–25, 2021, São Paulo, Brazil
0.06
MAP (OpenCaseBook)
0.05
0.04
0.03
0.02
0.01
Paragraph Vectors Longformer-large DeepWalk Walklets
0.00
(0, 14] (14, 23] fastText(23,
Legal
31] BoostNE
(31, 40] Poincaré
(40, 51] Poincaré + fastText Legal
(51, 67] (67, 89] (89, 425]
In- and out- citations(8 equal-sized buckets)
0.06
MAP (WikiSource)
0.05
0.04
0.03
0.02
0.01
0.00
(0, 3] (3, 6] (6, 12] (12, 21] (21, 30] (30, 50] (50, 82] (82, 386]
In- and out- citations(8 equal-sized buckets)
Figure 2: MAP scores wrt. citation count for Open Case Book (top) and Wikisource (bottom). Among citation-based methods,
Poincaré and Walklets perform on average the best, while DeepWalk outperforms them only for Wikisource and when more
than 82 citations are available (rightmost bucket).
Le
g
Po Po
inc inc Besides the coverage, we also analyze the similarity or diversity
Pa al-A ar a
fas ragr UEB
é | ré +
|f
as fas
of the recommendations between two methods. Figure 3 shows the
Gl a
similarity measured as Jaccard index for selected methods. Method
o t Te ph BE- D tT tT
e e
TF Ve L fast xt L Ve RT- eep Wal Poin xt L xt L
-ID eg Te eg cto ba Wa kle ca e e
F al xt al rs se lk ts ré gal gal
pairs with 𝐽 (𝑎, 𝑏) = 1 have identical recommendations, whereas
TF-IDF 1.00 0.17 0.15 0.16 0.10 0.04 0.06 0.06 0.06 0.11 0.13
𝐽 (𝑎, 𝑏) = 0 means no common recommendations. Generally speak-
ing, the similarity of all method pairs is considerably low (𝐽 < 0.8).
The highest similarity can be found between a hybrid method and
GloVe Legal 0.17 1.00 0.40 0.67 0.27 0.08 0.09 0.12 0.11 0.23 0.52
fastText 0.15 0.40 1.00 0.41 0.21 0.07 0.07 0.10 0.09 0.18 0.33 one of its sub-methods, e.g., Poincaré + fastTextLegal and fastText-
fastText Legal 0.16 0.67 0.41 1.00 0.28 0.09 0.09 0.13 0.11 0.24 0.76 Legal with 𝐽 = 0.76. Apart from that, substantial similarity can be
only found between pairs from the same category. For example, the
Paragraph Vectors 0.10 0.27 0.21 0.28 1.00 0.09 0.09 0.13 0.12 0.19 0.24
pair of the two text-based methods of GloVeLegal and fastTextLegal
Legal-AUEB-BERT-base 0.04 0.08 0.07 0.09 0.09 1.00 0.04 0.06 0.05 0.07 0.08 yields 𝐽 = 0.67. Citation-based methods tend to have a lower sim-
DeepWalk 0.06 0.09 0.07 0.09 0.09 0.04 1.00 0.20 0.14 0.14 0.12 ilarity compared to the text-based methods, whereby the highest
Jaccard index between two citation-based methods is achieved for
Walklets 0.06 0.12 0.10 0.13 0.13 0.06 0.20 1.00 0.32 0.27 0.18
Walklets and Poincaré with 𝐽 = 0.32. Like the coverage metric,
Poincaré 0.06 0.11 0.09 0.11 0.12 0.05 0.14 0.32 1.00 0.39 0.32 the Jaccard index should be considered in relation to the accu-
Poincaré || fastText Legal 0.11 0.23 0.18 0.24 0.19 0.07 0.14 0.27 0.39 1.00 0.23 racy results. GloVeLegal and fastTextLegal yield equally high MAP
scores, while having also a high recommendation’s similarity. In
Poincaré + fastText Legal 0.13 0.52 0.33 0.76 0.24 0.08 0.12 0.18 0.32 0.23 1.00
contrast, the MAP for Wikisource from fastTextLegal and Poincaré
is equally high, too. However, their recommendation’s similarity
Figure 3: Jaccard index for similarity or diversity of two is low 𝐽 = 0.11. Consequently, fastTextLegal and Poincaré provide
recommendation sets (average over all seeds from the two relevant recommendations that are diverse from each other. This
datasets). explains the good performance of their hybrid combination.
115
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ostendorff et al.
Table 3: Examples from fastTextLegal and Poincaré (other methods are in the supplementary material) for Mugler v. Kansas
with relevance annotations by the silver standards (S) and domain expert (D).
court held that Kansas could constitutionally outlaw liquor sales MAP for Open Case Book, while for Wikisource only hybrid meth-
with constitutional issues raised on substantive due process (Four- ods outperform fastTextLegal . Also, the coverage of fastTextLegal
teenth Amendment) and takings (Fifth Amendment). We provide a is considerably high for both datasets. Simultaneously, fastText-
description of the cases and their relevance on GitHub9 . Legal is robust to corner cases since neither very short nor very
The sample verification indicates the usefulness of both text- long documents reduce fastTextLegal ’s performance substantially.
based and citation-based methods and does not contradict our quan- These results confirm the findings from Arora et al. [1] that average
titative findings. Each of the recommendations have a legal impor- word vectors are “simple but tough-to-beat baseline”. Regarding
tant connection to the seed case (either the Fourteenth Amendment baselines, our TF-IDF baseline yields one of the worst results. In
or Fifth Amendment), although it is difficult to say whether the terms of accuracy metrics, only some Transformers are worse than
higher-ranked cases are more similar along an important topical TF-IDF, but especially TF-IDF’s coverage is the lowest by a large
dimension. The rankings do not appear to be driven by facts pre- margin. With a coverage below 50%, TF-IDF fails to provide diverse
sented in the case as most of them have not to do with alcohol recommendations that are desirable for legal literature research.
bans. Only Kidd vs. Pearson (1888) is about liquor sales as the seed The transfer of research advances to the legal domain is one
decision. The samples also do not reveal considerable differences aspect of our experiments. Thus, the performance of Transformers
between text- and citation-based similarity. With regards to the and citation embeddings is of particular interest. Despite the success
silver standards, the domain expert agrees in 14 of 20 cases (70%). of Transformers for many NLP tasks, Transformers yield on average
In only two cases the domain expert classify a recommendation as the worst results for representing lengthy documents written in
irrelevant despite being classified as relevant in the silver standard. legal English. The other two method categories, word vector-based,
and citation-based methods, surpass Transformers.
The word vector-based methods achieve overall the best results
5 DISCUSSION among the non-hybrid methods. All word vectors with in-domain
Our experiments explore the applicability of the latest advances in training, i.e., Paragraph Vectors, fastTextLegal , and GloVeLegal , per-
research to the use case of legal literature recommendations. Exist- form similarly good with a minor advantage by fastTextLegal . Their
ing studies on legal recommendations typically rely on small-scale similar performance aligns with the large overlap among their re-
user studies and are therefore limited in the number of approaches commendations. Despite a small corpus of 65,635 documents, the
that they can evaluate (Section 2). For this study, we utilize rele- in-domain training generally improves the performance as the gap
vance annotations from two publicly available sources, i.e., Open between the out-of-domain fastText and fastTextLegal shows. Given
Case Book and Wikisource. These annotations does not only enable that the training of custom word vectors is feasible on commodity
us to evaluate the recommendations of 2,964 documents but also hardware, in-domain training is advised. More significant than the
the comparison of in total 41 methods and their variations of which gap between in- and out-of-domain word vectors is the effect of
27 methods are presented in this paper. limited document lengths. For Open Case Book, the fastTextLegal
Our extensive evaluation shows a large variance in the recom- variation limited to the first 512 tokens has only 52% of the MAP
mendation performance. Such a variance is known from other stud- of the full-text method. For Wikisource, the performance decline
ies [4]. There is no single method that yields the highest scores exists as well but is less significant. This effect highlights the advan-
across all metrics and all datasets. Despite that, fastTextLegal is on tage of the word vector-based methods that they derive meaningful
average the best of all 41 methods. fastTextLegal yields the highest representations of documents with arbitrary length.
116
Evaluating Document Representations for Content-based Legal Literature Recommendations ICAIL’21, June 21–25, 2021, São Paulo, Brazil
The evaluated Transformers cannot process documents of arbi- agreement with the domain expert is high. The expert tends to clas-
trary length but are either limited to 512 or 4096 tokens. This limi- sify more recommendations as relevant than the silver standards,
tation contributes to Transformers’ low performance. For instance, i.e., relevant recommendations are missed. This explains the rela-
Longformer-large’s MAP is almost twice as high as BERT-large’s tively low recall from the quantitative evaluation. In a user study,
MAP on Open Case Book. However, for Wikisource both models we would expect only minor changes in the ranking of methods
yield the same MAP scores. For Wikisource, the in-domain pretrain- with similar scores, e.g., fastTextLegal and GloVeLegal . The category
ing as a larger effect than the token limit since Legal-AUEB-BERT ranking would remain the same. The benefit of our silver standards
achieves the best results among the Transformers. Regarding the is the number of available relevance annotations. The number of
Transformer pretraining, the difference between Legal-JHU-BERT annotations in related user studies is with up to 50 annotations
and Legal-AUEB-BERT shows the effect between two pretraining rather low. Instead, our silver standards provide a magnitude more
approaches. The corpora and the hyperparameter settings used relevance annotations. Almost 3,000 relevance annotations enable
during pretraining are crucial. Even though Legal-JHU-BERT was evaluations regarding text length, citation count, or other proper-
exclusively pretrained on the CAP corpus, which has a high over- ties that would be otherwise magnitudes more difficult. Similarly,
lap with Open Case Book, Legal-AUEB-BERT still outperforms the user studies are difficult to reproduce as their data is mostly
Legal-JHU-BERT on Open Case Book. Given these findings, we unavailable. This leads to reproducibility being an issue in recom-
expect the performance of Transformers could be improved by in- mender system research [4]. The open license of the silver standards
creasing the token limit beyond the 4096 tokens and by additional allows the sharing of all evaluation data and, therefore, contributes
in-domain pretraining. Such improvements are technically possible to more reproducibility. In summary, the proposed datasets bring
but add significant computational effort. In contrast to word vectors, great value to the field, overcoming eventual shortcomings.
Transformers are not trained on commodity hardware but on GPUs.
Especially long-sequence Transformers such as the Longformer 6 CONCLUSION
require GPUs with large memory. Such hardware may not be avail- We present an extensive empirical evaluation of 27 document rep-
able in production deployments. Moreover, the computational effort resentation methods in the context of legal literature recommen-
must be seen in relation to the other methods. Put differently, even dations. In contrast to previous small-scale studies, we evaluate
fastTextLegal limited to 512 tokens outperforms all Transformers. the methods over two document corpora containing 2,964 docu-
Concerning the citation embeddings, we consider Poincaré, clo- ments (1,601 from Open Case Book and 1,363 from Wikisource). We
sely followed by Walklets, as the best method. In particular, the two underpin our findings with a sample-based qualitative evaluation.
methods outperform the other citation methods even when only a Our analysis of the results reveals fastTextLegal (averaged fastText
few citations are available, which makes them attractive for legal word vectors trained on our corpora) as the overall best performing
research. Poincaré also provides the highest coverage for Open Case method. Moreover, we find that all methods have a low overlap be-
Book, emphasizing its quality for literature recommendations. For tween their recommendations and are vulnerable to certain dataset
Wikisource, DeepWalk has the highest coverage despite yielding characteristics like text length and number of citations available. To
generally low accuracy scores. As Figure 2 shows, DeepWalk’s MAP mitigate the weakness of single methods and to increase recommen-
score improves substantially as the number of citations increases. dation diversity, we propose hybrid methods like score summation
Therefore, we expect that DeepWalk but also the other citation of fastTextLegal and Poincaré that outperforms all other methods on
methods would perform even better when applied on larger cita- both datasets. Although there are limitations in the experimental
tion graph. The analysis of recommendation similarity also shows evaluation due to the lack of openly available ground truth data, we
little overlap between the citation-based methods and the text- are able to draw meaningful conclusions for the behavior of text-
based methods (Figure 3). This indicates that the two approaches based and citation-based document embeddings in the context of
complement each other and motivates the use of hybrid methods. legal document recommendation. Our source code, trained models,
Related work has already shown the benefit of hybrid methods and datasets are openly available to encourage further research9 .
for literature recommendations [6, 49]. Our experiments confirm
these findings. The simple approaches of score summation or vector ACKNOWLEDGMENTS
concatenation can improve the results. In particular, Poincaré +
We would like to thank Christoph Alt, Till Blume, and the anony-
fastTextLegal never leads to a decline in performance. Instead, it
mous reviewers for their comments. The research presented in this
increases the performance for corner cases in which one of the
article is funded by the German Federal Ministry of Education and
sub-methods performs poorly. Vector concatenation has mixed
Research (BMBF) through the project QURATOR (Unternehmen
effects on the performance, e.g., positive effect for Wikisource and
Region, Wachstumskern, no. 03WKDA1A) and by the project LYNX,
negative effect for Open Case Book. Using citations as training
which has received funding from the EU’s Horizon 2020 research
data in Sentence Transformers can also be considered as a hybrid
and innovation program under grant agreement no. 780602.
method that improves the performance. However, this requires
additional effort for training a new Sentence Transformer model.
As we discuss in Section 3.1, we consider Open Case Book and REFERENCES
[1] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but though Baseline
Wikisource more of silver than gold standards. With the qualitative for Sentence Embeddings. In 5th International Conference on Learning Representa-
evaluation, we mitigate the risk of misinterpreting the quantitative tions (ICLR 2017), Vol. 15. 416–424.
results, whereby we acknowledge our small sample size. The overall
9 GitHub repository: https://github.com/malteos/legal-document-similarity
117
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ostendorff et al.
[2] Elliott Ash and Daniel L. Chen. 2018. Case Vectors: Spatial Representations of [28] Arpan Mandal, Raktim Chaki, Sarbajit Saha, Kripabandhu Ghosh, Arindam Pal,
the Law Using Document Embeddings. SSRN Electronic Journal 11, 2017 (may and Saptarshi Ghosh. 2017. Measuring Similarity among Legal Court Case
2018), 313–337. https://doi.org/10.2139/ssrn.3204926 Documents. In Proc. of Compute ’17. 1–9.
[3] Xiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang, Xiangjie Kong, and Feng [29] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Intro-
Xia. 2019. Scientific paper recommendation: A survey. IEEE Access 7 (2019), duction to Information Retrieval. Vol. 16. Cambridge University Press, Cambridge.
9324–9339. 100–103 pages. https://doi.org/10.1017/CBO9780511809071
[4] Joeran Beel, Corinna Breitinger, Stefan Langer, Andreas Lommatzsch, and Bela [30] David Mellinkoff. 1963. The language of the law. Boston: Little Brown and
Gipp. 2016. Towards reproducibility in recommender-systems research. User Company (1963).
Modeling and User-Adapted Interaction (UMAI) 26 (2016). [31] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti-
[5] Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long- mation of Word Representations in Vector Space. (2013), 1–12. arXiv:1301.3781
Document Transformer. (2020). arXiv:2004.05150 [32] Akshay Minocha, Navjyoti Singh, and Arjit Srivastava. 2015. Finding Relevant
[6] Paheli Bhattacharya, Kripabandhu Ghosh, Arindam Pal, and Saptarshi Ghosh. Indian Judgments using Dispersion of Citation Network. In Proc. of WWW ’15.
2020. Methods for Computing Legal Document Similarity: A Comparative Study. ACM Press, New York, New York, USA, 1085–1088.
(2020). arXiv:2004.12307 [33] Rohan Nanda, Giovanni Siragusa, Luigi Di Caro, Guido Boella, Lorenzo Grossio,
[7] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Marco Gerbaudo, and Francesco Costamagna. 2019. Unsupervised and supervised
Journal of machine Learning research 3, Jan (2003), 993–1022. text similarity systems for automated identification of national implementing
[8] Guido Boella, Luigi Di Caro, Llio Humphreys, Livio Robaldo, Piercarlo Rossi, measures of European directives. Artificial Intelligence and Law 27, 2 (2019),
and Leendert van der Torre. 2016. Eunomos, a legal document and knowledge 199–225. https://doi.org/10.1007/s10506-018-9236-y
management system for the Web to provide relevant, reliable and up-to-date [34] Maximilian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning
information on the law. Artificial Intelligence and Law 24, 3 (2016), 245–283. hierarchical representations. Advances in Neural Information Processing Systems
[9] Alexander Boer and Radboud Winkels. 2016. Making a cold start in legal recom- 2017-Decem, Nips (2017), 6339–6348. arXiv:1705.08039
mendation: An experiment. Frontiers in Artificial Intelligence and Applications [35] Malte Ostendorff, Till Blume, and Saskia Ostendorff. 2020. Towards an Open
294 (2016), 131–136. https://doi.org/10.3233/978-1-61499-726-9-131 Platform for Legal Information. In Proc. of the ACM/IEEE Joint Conference on
[10] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En- Digital Libraries in 2020. ACM, New York, NY, USA, 385–388.
riching Word Vectors with Subword Information. Transactions of the Association [36] Malte Ostendorff, Peter Bourgonje, Maria Berger, Julian Moreno-Schneider, Georg
for Computational Linguistics 5 (2017), 135–146. Rehm, and Bela Gipp. 2019. Enriching BERT with Knowledge Graph Embeddings
[11] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Man- for Document Classification. In Proc. of the 15th Conference on Natural Language
ning. 2015. A large annotated corpus for learning natural language inference. Processing (KONVENS 2019). GSCL, Erlangen, Germany, 305–312.
Proc. of EMNLP (2015), 632–642. [37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
[12] Jane Bromley, J.W. Bentz, Leon Bottou, I. Guyon, Yann Lecun, C. Moore, Eduard Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
Sackinger, and R. Shah. 1993. Signature verification using a Siamese time de- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
lay neural network. International Journal of Pattern Recognition and Artificial Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
Intelligence 7, 4 (1993). [38] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
[13] Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Global Vectors for Word Representation. In Proc. of the 2014 Conference on Em-
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual pirical Methods in Natural Language Processing (EMNLP). ACL, Stroudsburg, PA,
Focused Evaluation. In Proc. of the 11th International Workshop on Semantic USA, 1532–1543. https://doi.org/10.3115/v1/D14-1162
Evaluation (SemEval-2017). ACL, Vancouver, Canada, 1–14. [39] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: online learning
[14] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, of social representations. In Proc. of KDD ’14. ACM Press, New York, New York,
and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law USA, 701–710.
School. In Findings of the Association for Computational Linguistics: EMNLP 2020. [40] Bryan Perozzi, Vivek Kulkarni, Haochen Chen, and Steven Skiena. 2017. Don’t
ACL, Stroudsburg, PA, USA, 2898–2904. Walk, Skip!: Online Learning of Multi-scale Network Embeddings. In Proc. of the
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis
Pre-training of Deep Bidirectional Transformers for Language Understanding. In and Mining 2017. ACM, New York, NY, USA, 258–265.
Proc. of the 2019 Conf. of the NAACL. ACL, Minneapolis, Minnesota, 4171–4186. [41] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
[16] Mouzhi Ge, Carla Delgado-Battenfeld, and Dietmar Jannach. 2010. Beyond using Siamese BERT-Networks. In The 2019 Conference on Empirical Methods in
accuracy: evaluating recommender systems by coverage and serendipity. In Proc. Natural Language Processing (EMNLP 2019). arXiv:1908.10084
of RecSys ’10. ACM Press, New York, New York, USA, 257. [42] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. An API Oriented
[17] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Open-source Python Framework for Unsupervised Learning on Graphs. (2020).
Networks. In Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery arXiv:2003.04819
and Data Mining - KDD ’16. ACM Press, New York, New York, USA, 855–864. [43] G. Salton, A. Wong, and C. S. Yang. 1975. Vector Space Model for Automatic
[18] Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020. A Indexing. Information Retrieval and Language Processing. Commun. ACM 18, 11
dataset for statutory reasoning in tax law entailment and question answering. In (1975), 613–620.
Proc. of the 2020 Natural Legal Language Processing Workshop. 31–38. [44] Malte Schwarzer, Moritz Schubotz, Norman Meuschke, and Corinna Breitinger.
[19] Paul Jaccard. 1912. The Distribution of the Flora in the Alpine Zone. New 2016. Evaluating Link-based Recommendations for Wikipedia. Proc. of the 16th
Phytologist 11, 2 (feb 1912), 37–50. ACM/IEEE Joint Conference on Digital Libraries (JCDL‘16) (2016), 191–200.
[20] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag [45] Marc van Opijnen and Cristiana Santos. 2017. On the concept of relevance in
of Tricks for Efficient Text Classification. In Proc. of EACL 2017. ACL, Stroudsburg, legal information retrieval. Artificial Intelligence and Law 25, 1 (2017), 65–87.
PA, USA, 427–431. [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
[21] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and and I. Polosukhin. 2017. Attention Is All You Need. Advances in Neural Information
Marián Boguñá. 2010. Hyperbolic geometry of complex networks. Physical Processing Systems 30 (Jun 2017), 5998–6008.
Review E - Statistical, Nonlinear, and Soft Matter Physics 82, 3 (2010), 1–18. [47] Rupali S. Wagh and Deepa Anand. 2020. Legal document similarity: A multicri-
[22] Sushanta Kumar, P. Krishna Reddy, V. Balakista Reddy, and Aditya Singh. 2011. teria decision-making perspective. PeerJ Computer Science 2020, 3 (2020), 1–20.
Similarity analysis of legal judgments. Compute 2011 - 4th Annual ACM Bangalore https://doi.org/10.7717/peerj-cs.262
Conference (2011). https://doi.org/10.1145/1980422.1980439 [48] Lidan Wang, Ming Tan, and Jiawei Han. 2016. FastHybrid: A hybrid model for effi-
[23] Steven A. Lastres. 2013. Rebooting Legal Research in a Digital Age. https: cient answer selection. Proc. of the 26th International Conference on Computational
//www.lexisnexis.com/documents/pdf/20130806061418_large.pdf Linguistics (2016), 2378–2388.
[24] J. H. Lau and T. Baldwin. 2016. An Empirical Evaluation of doc2vec with Practical [49] Gineke Wiggers and Suzan Verberne. 2019. Citation Metrics for Legal Information
Insights into Document Embedding Generation. In Proc. Workshop on Representa- Retrieval Systems. In BIR@ECIR. 39–50.
tion Learning for NLP. https://doi.org/10.18653/v1/w16-1609 [50] Wikisource. 2020. United States Supreme Court decisions by topic.
[25] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences https://en.wikisource.org/wiki/Category:United_States_Supreme_Court_
and Documents. Int. Conf. on Machine Learning 32 (2014), 1188–1196. decisions_by_topic
[26] Jundong Li, Liang Wu, Ruocheng Guo, Chenghao Liu, and Huan Liu. 2019. Multi- [51] Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage
level network embedding with boosted low-rank matrix approximation. In Proc. Challenge Corpus for Sentence Understanding through Inference. (2018), 1112–
of the 2019 IEEE/ACM International Conference on Advances in Social Networks 1122. https://doi.org/10.18653/v1/n18-1101
Analysis and Mining. ACM, New York, NY, USA, 49–56. [52] Radboud Winkels, Alexander Boer, Bart Vredebregt, and Alexander Van Someren.
[27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer 2014. Towards a Legal Recommender System. In Frontiers in Artificial Intelligence
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A and Applications, Vol. 271. 169–178.
Robustly Optimized BERT Pretraining Approach. (2019). arXiv:1907.11692
118
From Data to Information: Automating Data Science to Explore
the U.S. Court System
Andrew Paley Andong L. Li Zhao Harper Pack
andrewpaley@u.northwestern.edu andong@u.northwestern.edu harper.pack@northwestern.edu
Northwestern University Northwestern University Northwestern University
119
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paley et al.
has explored such limitations on access [1, 35], as well as the ques- They felt they were limited by the tools they were currently using
tionable completeness of the data available [30, 32]. Legislation to and wanted to ask questions of the data that they weren’t able to.
eliminate the PACER fees is progressing through Congress – one To help bridge that gap, we introduce a novel platform and
step towards opening the courts to public scrutiny and understand- user experience that provides users with the tools necessary to ex-
ing. plore data and drive analysis via natural language statements. Our
But making court records free won’t eliminate all barriers to approach leverages an ontology configuration that adds domain-
access. Many open-government initiatives in the U.S. and abroad relevant data semantics to a database schema for the sake of sup-
have yielded a growing array of public datasets [2, 24], and work to- porting search and analysis without user-entered code or SQL. This
wards data transparency is an ongoing effort [14]. However, while configuration allows us to abstract away the underlying schema
access to data is necessary, it’s insufficient: the applied value of complexities from user concern, understand what filters and analy-
that data to the end goal of increased public understanding – of sis are possible and domain-relevant, infer relevant analytics from
access to information – remains stymied by the limited analytical the data semantics, and provide guided outcomes during both search
skills and resources of the majority of those afforded that data. A and analysis.
survey detailed in [47] found in part that while citizens acknowl- The associated notebook-style experience is an early embodi-
edge and appreciate moves towards open data, most don’t know ment of a new form of human-data, or human-information, interface
people in their social circles who take advantage of it. Further, the – a user experience imbued with a set of assistive capabilities where
authors note “most open data released by the government is avail- interactions happen in natural language rather than code. The sys-
able in the raw format, which restricts its understandability by all tem also generates responses in modalities intuitively appropriate
people” and that “this data is mostly usable by experts with some to the nature of the analysis results – from text to various types of
technical knowledge to interpret and develop applications” [47]. visualizations.
Separately, in a case study of Data.gov, [23] argue that open data
“generates its value when it is not only available and accessible but 1.2 The Data Scientist/Data Interaction
also made sense by its users to solve problems” and conclude that
The second set of requirements mirrors the data scientist/data inter-
“public agencies should invest in new technologies and craft new
action: the wrangling of data into coherent and controlled schemas
data management techniques to make data readily accessible to
through various modes of ETL (extract, transform, load), text extrac-
users...providing real-time analysis and updates.”
tion, data cleaning, and the more complicated arenas of machine
To date, the bridge between raw data and meaningful infor-
learning and language modeling.
mation has generally been built ad-hoc and on-demand by data
To support explorations of the U.S. court system, this includes
scientists, but that resource-intensive approach doesn’t scale when
the structuring and harmonization of court records, with the initial
considering the information needs of a broader subset of the public.
focus here on a snapshot of roughly 270,000 case dockets. This
And, in the space of the legal system, even questions as simple as,
involves consultation with domain experts; the definition of a com-
“Are there differences in how judges handle fee waiver requests?” or
plex schema (across 30 tables ranging in size from two to thirty-one
“Is there any correlation between a judge’s tenure and the length of
columns); a pipeline to extract, transform and harmonize the un-
cases they oversee?” are impossible to answer without significant
structured and semi-structured components of dockets; the integra-
data expertise or the resources to pay for it. Clearly, open data ac-
tion of additional datasets to expand the information space (starting
cess isn’t enough; we need a mechanism to access the information
with background information on federal judges); the creation of a
contained within.
novel dataset for training language models for classification tasks
To build that mechanism, in essence, is to automate work that
(initially for classifying various types of motions within the scope of
would be done by a data scientist to extract information. Thus, we
a case); and the model training/fine-tuning and validation process
endeavor to outline what the data scientist’s role entails and identify
in pursuit of proving the utility of framing motion type detection
those functions as requirement sets for building the platform.
as a classification task.
120
From Data to Information: Automating Data Science to Explore the U.S. Court System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
study. Our discussion elaborates on the goals of our work, including forebear, IPython [36]. Automated visualization is a related area of
challenges to be addressed. research [31, 50] focusing primarily on presentation layers for a
Our approach to court docket search and analysis is one early given dataset rather than intent-driven question-answering.
step in the development of an open-source platform aimed at de-
mocratizing access to information. In discussion of future work, 3 THE NATURAL-LANGUAGE NOTEBOOK
we outline dual and distinct tracks: the first aimed at continuing to
Notebook-style interfaces are a standard part of the modern data
build and augment our U.S. court records database, and the second science toolkit, and for good reason: they support a logical process
focused on the ongoing development of the core platform. On the flow and marry exploratory and presentation layers in one cohesive
platform side, we point to a future in which additional data can experience. However, they are the tools of experts – users who
be brought in by technical users who manage data wrangling and know how to code, run analysis, interpret stack traces and explore
define data semantics – the steps we now think of as getting to complex results. They bring order in the form of scaffolding, but
“open data,” but with a newly imagined purpose – and our system remain largely agnostic about content or the specifics of a particular
scales to new domains, communities, and geographies. dataset or domain.
We borrow from that scaffolding, but our system leverages sim-
2 RELATED WORKS plified data filtering mechanisms and natural language statements.
And where other notebooks display a variety of outputs (defined by
Reducing the costs associated with PACER has been pursued as a the near-infinite space of possibility supported by arbitrary code),
way to achieve judicial transparency. However, studies have shown our system outputs natural language and annotated visualization
the limits of open data in providing greater transparency [44]. as a means of conveying information.
Notably, problems persist across many user personas, from citi- This approach maintains the intuitive flow of the notebook user
zens to data scientists, government agents, and even academics experience but brings its power to people unfamiliar with program-
[7, 18, 19, 23]. We aim to address a subset of these challenges – ming. Our notebooks are domain- and dataset-aware, and the user
pertaining to data utility and barriers to information access – by experience speaks the language not of the data scientist, but of a
applying automated analytical and visualization capabilities on top user reasonably fluent in the domain. Further, they provide assis-
of the data. tive mechanisms to surface what the system knows it is capable of,
Much research has focused on automating legal processes [28],
guiding even novice users to understand the range of capabilities
predicting outcomes [5], or assessing the value of AI for the two
available to them.
former areas [13, 45]. There have been some recent developments An exploration of the current iteration of interface mechanisms
in legal question-answering (QA) systems [11, 21]. However, these and output capabilities can be found in Figure 1 and Figure 2, and a
have had limited data analytics capabilities [22] and often rely deeper discussion of the paradigm follows.
on simple data retrieval for generating answers [33]. While some
commercial tools support exploration of court documents they are
prohibitively expensive, and limited in terms of scope and consis-
3.1 The UX Paradigm: “Search First, Then
tency of results [1]. Converse”
More broadly, general QA systems have been the subject of re- Our approach separates concerns between search (winnowing the
search for decades [16, 40, 46] and are some of the most prominent available dataset to a space of interest) and converse (the user in-
examples of AI systems [12]. There has been significant progress putting statements that drive analysis upon the filtered dataset and
in neural QA systems [10], with transformer-based models [27] the generation of responses). This approach embodies the strengths
achieving state-of-the-art results on benchmark tasks [39]. How- of the notebook format in focusing on one task at a time and pre-
ever, these QA systems are best suited for unstructured text data senting interstitial output as feedback. Further, this has the indi-
where the answer is plainly stated in the corpus itself, unlike our rect effect of separating concerns on the backend, supporting a
system which can infer or derive the answer through follow-on generalizable approach to the specification of filter and analysis
analysis. Other approaches aim to understand and decompose the configuration.
structure of complex questions into discrete parts as a plan for As depicted in Figure 1, each exploration in our notebook inter-
deriving an answer [49]; however, the representation is high-level face starts from a “search” (or “filter”) panel: a paginated view of the
and distinct from our approach which constructs runnable queries dataset that matches the current set of filters. The primary entities
against a given datasource. presented in this view are court cases in the Northern District of
Extensive research has parsed natural language queries into SQL Illinois. On initialization, the user is presented with the full space
queries [20], using techniques from deep learning [17], rules-based of available data absent any applied filters and can opt to apply
methods [42], or a mixture of both [43]. Instead, our approach filters or skip right to adding analyses of the full dataset.
automatically generates the space of possible analysis from an Below that is the partitioned “converse” step, where the user can
ontology configuration, and then translates the underlying analysis enter natural-language statements that drive analysis (Figure 2).
plans to natural language, drawing inspiration from prior work Of note, users can enter multiple analysis statements against one
[37, 41]. data view, stepping through a set of questions while maintaining a
Beyond current work in information retrieval via conversational thread of prior exploration.
systems, our approach utilizes a notebook-style interface, with This paradigm means our system does not have to manage state-
inspiration coming from Jupyter notebooks [38] as well as their ments like “Average case duration grouped by judge tenure for cases
121
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paley et al.
122
From Data to Information: Automating Data Science to Explore the U.S. Court System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
of action, case status, filing date, nature of suit, party name, judge
name, attorney name, as well as free text search in the docket entries
associated with the case. Ultimately, users can make a few targeted
selections and fill in a few inputs to get to searches equivalent to
“all cases in the Northern District of Illinois between 2015 and 2017
where Kennelly served as judge” or “all cases with nature of suit
property rights where one of the parties is Apple” – the SQL query
versions of which only a fraction of those users could generate
themselves.
The focus on case dockets being the “primary” searchable unit (as
opposed to judges, parties, attorneys, etc.) and all associated filters
are entirely configuration-driven and distinct from complexities
of the underlying schema. The ontology config maps the machine
representation to a user-friendly set of names and attends to the
scaffolding of ids, foreign keys and joins. Thus, the filterable fields
are a subset of those that exist at the schema level on various Figure 5: Two examples of the primary interface for adding
tables that join against the case table, and in some cases (such as analysis statements. Annotations: 1) The user-entered state-
Judge Name, depicted in Figure 4) actually span multiple fields in ment, having been auto-completed progressively via gener-
the schema (first_name, middle_name, last_name). The key point ated statement candidates, 2) Candidate matches given the
here is that the user does not need to consider the schema but previously auto-completed statement (“Average Fee Waiver
simply makes decisions about domain-relevant ways to search with Grant Rate Year-Over-Year”) and the subsequently user-
guidance from the system about the relevant search space (and the appended “grouped by,” 3) A user-entered string that hasn’t
system then generates runnable queries of various types, including yet been auto-completed, demonstrating fuzzy string match-
string matching and range finding, such as with dates). For domain ing, 4) A set of fuzzily matched results
expert users, our approach is a significant convenience over having
to learn or write SQL, and, for less knowledgeable users, it also
serves as guidance about relevance in the domain. in Figure 5, this is realized on the UX as a fuzzy (i.e., approximate
string matching) search across a set of natural language statements,
1 {... , each of which is generated dynamically by the system through
2 " judgeName " : { inferring relevant analysis possibilities based on the underlying
3 " nicename " : " Judge Name " ontology configuration and a core model of analysis types. Because
4 " type ": " text " , the system is inferring and defining the analysis space based on the
5 " allowMultiple " : True , ontology components, each generated statement corresponds to an
6 " autocomplete " : acs . getJudges , underlying plan representation that is interpretable by the analysis
7 " model " : [ db . JudgeOnCase , db . Judge ] , engine for the sake of generating queries and running analytics on
8 " fromTargetModel " : [ " judges " , " judge " ] , any set of filters.
9 " fields " : [" first_name " , " middle_name " , "
As users select and add additional analysis statements to the
last_name " ] ,
10 },
notebook, the system responds with answers in the form of text
11 ... } and visualizations, as depicted in Figure 2. As per standard notebook
mechanics, each analysis statement is tied to the active filtered set
in the panel above such that changes to the filters (and thus the slice
Figure 4: The config entry for the “Judge Name” entity for of the data presented) will flow through and update each linked
search/filter capabilities and the results view. 1) “nicename” result.
is the user-facing name of this entity type, 2) “type” and "al- The core platform’s model of analysis includes a growing set
lowMultiple" inform the input style and query generation of available operations (e.g., average), as well as specifications on
mechanisms, 3) “autocomplete” maps to a method on the how the operation ought to be performed (e.g., can only be done
autocomplete class (can be default or a plugin) and powers on numeric fields, how many fields are needed). As seen in Figure
the autocomplete API endpoint, 4) “model” and “fromTar- 6, the ontology configuration then defines the fields relevant for
getModel” map the model join and relationship feature path analysis and their user-friendly names, as well as their attributes
from the db.Case table at the ORM level, 5) “fields” defines (e.g., semantic type, possible transformations into other data types,
the field(s) this entity’s name/id maps to (affording support relevant units, and – in the case of discrete entities delimited by id
for multi-field queries) – how to generate their user-friendly names) and their relationship
to the primary model.
To illustrate the generation of the analysis space, in the instanti-
3.2.2 Analysis Statements and Query Generation. Once the user ation referenced in our figures, an analysis configuration that lists
has arrived at filtered data they are interested in, they can add ten relevant features for analysis (e.g., Judge Tenure, Nature of Suit,
multiple analysis statements below the dataview panel. As seen Case Duration, Fee Waiver Grant Status) alongside the metadata
123
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paley et al.
124
From Data to Information: Automating Data Science to Explore the U.S. Court System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
reframing them as classification tasks. For context, the main body the transformer models, the training accuracy is slightly higher
of a docket is a series of time-stamped text entries, each marking than the validation/test accuracy for both models, but we believe
events in the arc of a given case. These text-snippet representations this margin is reasonable given the small size of the dataset. To
contain various sorts of useful information, including motions (ef- reduce the likelihood of overfitting in the future, we continue to
fectively discrete requests for a judicial decision), the outcome of grow the tagged motion dataset.
a given motion, changes of representation or venue or presiding
judge, references to evidence or testimony, eventual outcomes, and 6 EVALUATION
so on. Being able to identify and classify such information would We evaluated our system’s effectiveness in handling both search
prove highly valuable for both search and analysis. and analysis of data across two separate tracks: 1) usability testing
To explore approaches, we started with the classification of mo- in which target users completed tasks with the system and provided
tion types as our initial target. At first glance, it could be tempting survey feedback, and 2) a case-study comparative analysis to assess
to envision a solution to this classification task based on regular the system’s efficacy when benchmarked against a data scientist’s
expression where motions are explicitly identified by name. How- ad hoc analysis.
ever, as depicted in Table 1, a pure regex approach is far too rigid
to capture the many complexities found in the docket entry sample
6.1 Usability Testing
space, including multiple motions being named in a single entry,
non-motion events referencing motions by name (e.g., notices, or- We gave 15 subjects (14 legal professionals, one journalist) a set
ders), and obfuscation of the motion type through varying levels of of prompts (e.g., “For all cases in the ’N.D. IL’ district, which year
docket entry metadata. These complexities are further compounded had the highest average case duration?”) and assessed their experi-
by naming convention variations across districts and the trappings ences in: (1) using the search filter, (2) conducting analysis on all
of error-prone human data entry [1]. the records, and (3) conducting analyses based on specific search
Thus we pivoted to language modeling. As no training dataset criteria of varying complexity. In addition, we gave them time to
exists for such a task, we created a web application to view and tag test their own scenarios while “thinking aloud” so we could capture
the motions pulled from our docket dataset. For both the definition their intentions and strategies. Participants were then presented
of the space of possible motion types and for the sake of actually with a survey to complete at the end of the session consisting of the
tagging the motion entries, we solicited help from legal scholars and modified System Usability Scale (SUS), an evaluation framework
their law students. We implemented a voting mechanism in the app shown to be effective at quantifying the complexity and ease of use
such that each motion will be tagged three times by three distinct of interfaces [4]. The average SUS score for our participants’ overall
users as a means of ensuring accuracy. Our dataset continues to experience across (1), (2), and (3) was 72.83, which is considered
grow through use of the application, though the experiments that good usability [3, 26]. When answering the statement regarding
follow leverage a subset of this data. whether they would use our system frequently, all but two partic-
In order to effectively utilize this data for our classification ex- ipants (87%) agreed or strongly agreed with that statement (one
periments, we performed some preprocessing on the raw dataset. was neutral and one disagreed and wrote that docket sheets are not
First, the raw dataset contained several motion classes with few used in their research). These results represent a preliminary round
data points. To address these rare motion classes in these initial of user testing, and we intend to further analyze the associated
tests, we set a threshold of 25 data points and merged all classes feedback and conduct additional user tests targeting users with a
below this threshold into the “Other Motion” class. Second, we re- wider variety of backgrounds.
moved all duplicate docket entries that arose as a byproduct of the
voting mechanism from the dataset to ensure that the models were 6.2 System Evaluation: A Case Study
not training on some docket entries more than others. After this To further weigh the benefits of our approach, we compare it with
preprocessing, the smallest motion class contained 25 samples, the prior work done by data scientists examining how fee waiver grant
largest contained 951, and the median and average of the motion rates vary among judges using ad hoc data processing and analysis
classes were 50 and 152, respectively. For each of the models we [35]. We answer the same question using our system (see Figure
used a train/validation/test split of 80/10/10 per class. In total after 7 for an example of the output) and through observation compare
preprocessing, there were 2,064 training samples with 524 testing both approaches across three dimensions: speed to insight, flexibil-
samples across 17 distinct motion classes. ity of exploration, and barrier to entry.
Making use of two pretrained transformers, the 110M parameter The initial data processing pipeline looks similar for both. A
BERT-base [9] and 125M parameter RoBERTa [29] models, we fine- systematic analysis of this issue requires paying to download case
tuned each on this processed dataset. We made use of the AllenNLP documents, creating an ETL process to structure the data, and
framework [15] as a wrapper around the Huggingface Transformers identifying the fee waiver status of each case [35]. However, where
library [48] to fine-tune the models for 10 epochs, using a batch the ad hoc method attends to ETL, aggregation, and visualization
size of 8, and the AdamW optimizer. The RoBERTa model achieved for a single target task, our approach looks to leverage that upfront
training accuracy of 95.69%, validation accuracy of 91.22%, and data work to support a wide array of possible downstream analyses.
test accuracy of 90.08%. The BERT-base model achieved training Thus, when considering a one-off query or single data point, we
accuracy of 96.95%, validation accuracy of 89.69%, and test accuracy cannot definitively say that the ETL, schema and ontology work in
of 89.31%. These results exceeded the baseline bag of embeddings support of our system will require less time than a data scientist
classification model, which achieved a test accuracy of 80.50%. For taking the ad hoc approach. But one-offs aren’t the goal of our
125
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paley et al.
126
From Data to Information: Automating Data Science to Explore the U.S. Court System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
• Issues of ethics and responsibility: One such example is speakers (of note, our ontology-driven approach means very little
privacy. Court documents are rife with personally identifi- actual language is coded into the UI, making this an easier pursuit),
able information, and reliably de-identifying documents at opening up the possibility of legal documents and open data from
scale is a non-trivial problem. Further, the tension between other countries being made available through the platform; 5) UX
de-identification and information completeness (say, for the improvements, including changes to analysis statement selection
sake of mapping to geographies) adds another complication. (with fuzzy semantic matching on colloquial terms against terms of
The use of highly regulated medical records data in research art), support for more interactivity in visualizations, and additional
and machine learning provides a promising precedent [8, 25] explanations associated with analysis results; 6) Support for inter-
for reference as we move forward. active machine learning by bringing the capabilities of our separate
• Issues of information misuse: Protecting against misuse motion tagging application directly to the platform and augmenting
of analysis, especially when the barrier of expertise to arriv- them to cover both the extraction/creation of novel tagged datasets
ing at such analysis has been lowered, is a significant issue and in-platform model training/fine-tuning, validation and testing.
in our increasingly fraught information landscape. In the This presents a significant opportunity for research in the space of
realm of law, the politicization of judicial decision making or making machine learning accessible to non-technical users.
the use of judicial analytics as a means of influencing future While we believe deeply in the importance of bringing trans-
outcomes are both potential issues. parency to the U.S. court system and will continue the data work
• Issues of explainability and data quality: Our scalable necessary to do so, we also see this data-information rift throughout
approach to data analysis adds a new layer of importance to the government and public sector in the United States and globally.
the explainability of results and also runs the risk of obscur- Thus, we are excited by the prospect of platform improvements to
ing incomplete or deficient data. To fully realize the promise support bringing a variety of new datasets to our application.
of data science automation, additional research will be fo-
cused on ensuring our system can explain itself and handle 9 CONCLUSION
issues of data quality gracefully and transparently. In this work we’ve detailed a novel platform and user experience to
• Issues associated with novel analysis: Inarguably, data allow non-data scientists to drive exploration and analysis of data
scientists can flexibly address novel questions or analysis associated with the U.S. Court system. In support of that experi-
requirements on the fly, and while our platform’s library ence, we defined the process by which we ingested, extracted and
of analytics will grow, there will continue to be question structured the data from 270,000 case dockets. Given the results of
types it can’t answer. In future work, we will expand our usability testing presented in our evaluation, we believe we have
nascent plugin framework to support custom analysis and early confirmation that this new natural language notebook ap-
continuously grow the built in libraries. proach marks a step in the direction of democratizing access to data
analysis and could have significant impact not only in the space
8 FUTURE WORK of the U.S. court system, but also more broadly across a variety of
publicly available data. Subsequent work is already underway to
Going forward, various members of our team are pursuing in tan- further develop the capabilities, refine the UX mechanics, and stand
dem the dual roadmap we laid out in the introduction. up new components of the ecosystem. In tandem, the ingestion,
One thread is aimed at making the raw data emitted from the structuring and enrichment of U.S. court records continues as we
U.S. court system increasingly machine-readable. This entails ev- work towards a comprehensive database mirroring the federal court
erything from the continued evolution of the ingestion pipeline system.
(sourcing data from a wider variety of districts and tackling corner
cases in the data) to improvements to the data already obtained
ACKNOWLEDGMENTS
through various forms of enrichment. In the near term, we intend
to pursue entity disambiguation on parties and attorneys, as well This material is based upon work supported by the National Sci-
as the creation of additional datasets to train language models for ence Foundation Convergence Accelerator Program under grant
classification outside the scope of the motions described above no. 1937123 and grant no. 2033604.
(such that we can attempt to capture additional data points such
as judicial rulings, charge severity, changes in representation, and REFERENCES
various forms of case outcome). [1] Charlotte Alexander and Mohammed Javad Feizollahi. 2019. On Dragons, Caves,
Teeth, and Claws: Legal Analytics and the Problem of Court Data Access. Com-
The other thread is the work with the core platform itself. This putational Legal Studies: The Promise and Challenge of Data-Driven Legal Research
will take a number of forms, including: 1) Expansions to the ana- (Ryan Whalen, ed., Edward Elgar, 2019, Forthcoming) (2019).
lytics capabilities and plugins (including the introduction of new [2] Judie Attard, Fabrizio Orlandi, Simon Scerri, and Sören Auer. 2015. A systematic
review of open government data initiatives. Government Information Quarterly
response types and visualizations); 2) An evolution of the ontology 32, 4 (2015), 399–418.
configuration and support for ontology management through the [3] Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what indi-
vidual SUS scores mean: Adding an adjective rating scale. Journal of usability
user experience, allowing for user-driven updates as well as the studies 4, 3 (2009), 114–123.
introduction of new data sources; 3) Ontology-driven derived fields, [4] Aaron Bangor, Philip T Kortum, and James T Miller. 2008. An empirical evaluation
providing support for adding new data points dynamically and intro- of the system usability scale. Intl. Journal of Human–Computer Interaction 24, 6
(2008), 574–594.
ducing new possibilities for downstream explanations; 4) Support [5] Karl Branting, Brandy Weiss, Bradford Brown, Craig Pfeifer, A Chakraborty, Lisa
for localization such that the platform could be used by non-English Ferro, M Pfaff, and A Yeh. 2019. Semi-supervised methods for explainable legal
127
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Paley et al.
prediction. In Proceedings of the Seventeenth International Conference on Artificial [31] Jock Mackinlay, Pat Hanrahan, and Chris Stolte. 2007. Show me: Automatic
Intelligence and Law. 22–31. presentation for visual analysis. IEEE transactions on visualization and computer
[6] Federal Judicial Center. 2011. Biographical directory of federal judges. graphics 13, 6 (2007), 1137–1144.
[7] Jonathan Crusoe, Anthony Simonofski, Antoine Clarinval, and Elisabeth Gebka. [32] Peter W Martin. 2018. District Court Opinions That Remain Hidden Despite a
2019. The impact of impediments on open government data use: insights from Long-Standing Congressional Mandate of Transparency-the Result of Judicial
users. In 2019 13th International Conference on Research Challenges in Information Autonomy and Systemic Indiffernece. Law Libr. J. 110 (2018), 305.
Science (RCIS). IEEE, 1–12. [33] Gayle McElvain, George Sanchez, Sean Matthews, Don Teo, Filippo Pompili,
[8] Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. and Tonya Custis. 2019. WestSearch Plus: A Non-factoid Question-Answering
De-identification of patient notes with recurrent neural networks. Journal of the System for the Legal Domain. In Proceedings of the 42nd International ACM SIGIR
American Medical Informatics Association 24, 3 (2017), 596–606. Conference on Research and Development in Information Retrieval. 1361–1364.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [34] Michael Bayer. [n.d.]. SQLAlchemy. https://www.sqlalchemy.org/
Pre-training of deep bidirectional transformers for language understanding. arXiv [35] Adam R Pah, David L Schwartz, Sarath Sanga, Zachary D Clopton, Peter DiCola,
preprint arXiv:1810.04805 (2018). Rachel Davis Mersey, Charlotte S Alexander, Kristian J Hammond, and Luís
[10] Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and A Nunes Amaral. 2020. How to build a more open justice system. Science 369,
Michael Auli. 2019. ELI5: Long Form Question Answering. In Proceedings of the 6500 (2020), 134–136.
57th Annual Meeting of the Association for Computational Linguistics. 3558–3567. [36] Fernando Pérez and Brian E Granger. 2007. IPython: a system for interactive
[11] Biralatei Fawei, Jeff Z Pan, Martin Kollingbaum, and Adam Z Wyner. 2018. A scientific computing. Computing in science & engineering 9, 3 (2007), 21–29.
methodology for a criminal law and procedure ontology for legal question answer- [37] Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo,
ing. In Joint International Semantic Technology Conference. Springer, 198–214. Maurizio Lenzerini, and Riccardo Rosati. 2008. Linking data to ontologies. In
[12] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Journal on data semantics X. Springer, 133–173.
Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, [38] Min Ragan-Kelley, F Perez, B Granger, T Kluyver, P Ivanov, J Frederic, and M
et al. 2010. Building Watson: An overview of the DeepQA project. AI magazine Bussonnier. 2014. The Jupyter/IPython architecture: a unified view of computa-
31, 3 (2010), 59–79. tional research, from interactive exploration to communication and publication.
[13] Anthony W Flores, Kristin Bechtel, and Christopher T Lowenkamp. 2016. False AGUFM 2014 (2014), H44D–07.
positives, false negatives, and false analyses: A rejoinder to machine bias: There’s [39] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
software used across the country to predict future criminals. and it’s biased SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed-
against blacks. Fed. Probation 80 (2016), 38. ings of the 2016 Conference on Empirical Methods in Natural Language Processing.
[14] World Wide Web Foundation. 2018. Open Data Barometer - Leaders Edition. World 2383–2392.
Wide Web Foundation. [40] Deepak Ravichandran and Eduard Hovy. 2002. Learning surface text patterns for
[15] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson a question answering system. In Proceedings of the 40th Annual meeting of the
H S Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2017. A association for Computational Linguistics. 41–47.
Deep Semantic Natural Language Processing Platform. [41] Mariano Rodriguez-Muro, Roman Kontchakov, and Michael Zakharyaschev. 2013.
[16] Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. 1961. Ontology-based data access: Ontop of databases. In International Semantic Web
Baseball: an automatic question-answerer. In Papers presented at the May 9-11, Conference. Springer, 558–573.
1961, western joint IRE-AIEE-ACM computer conference. 219–224. [42] Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq
[17] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Minhas, Ashish R Mittal, and Fatma Özcan. 2016. ATHENA: an ontology-driven
Zettlemoyer. 2017. Learning a neural semantic parser from user feedback. arXiv system for natural language querying over relational data stores. Proceedings of
preprint arXiv:1704.08760 (2017). the VLDB Endowment 9, 12 (2016), 1209–1220.
[18] Maxat Kassen. 2018. Adopting and managing open data: Stakeholder perspec- [43] Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi
tives, challenges and policy recommendations. Aslib Journal of Information Dalmia, Greg Stager, Ashish Mittal, Diptikalyan Saha, and Karthik Sankara-
Management (2018). narayanan. 2020. ATHENA++ natural language querying for complex nested
[19] Muhammad Mahboob Khurshid, Nor Hidayati Zakaria, Ammar Rashid, and SQL queries. Proceedings of the VLDB Endowment 13, 12 (2020), 2747–2759.
Muhammad Nouman Shafique. 2018. Examining the Factors of Open Govern- [44] Md Shamim Talukder, Liang Shen, Md Farid Hossain Talukder, and Yukun Bao.
ment Data Usability From Academician’s Perspective. International Journal of 2019. Determinants of user acceptance and use of open government data (OGD):
Information Technology Project Management (IJITPM) 9, 3 (2018), 72–85. An empirical investigation in Bangladesh. Technology in Society 56 (2019), 147–
[20] Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural 156.
language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, [45] Songül Tolan, Marius Miron, Emilia Gómez, and Carlos Castillo. 2019. Why ma-
10 (2020), 1737–1750. chine learning may lead to unfairness: Evidence from risk assessment for juvenile
[21] Mi-Young Kim, Randy Goebel, and S Ken. 2015. COLIEE-2015: evaluation of justice in catalonia. In Proceedings of the Seventeenth International Conference on
legal question answering. In Ninth International Workshop on Juris-informatics Artificial Intelligence and Law. 83–92.
(JURISIN 2015). [46] Ellen M Voorhees et al. 1999. The TREC-8 question answering track report. In
[22] Mi-Young Kim, Ying Xu, and Randy Goebel. 2014. Legal question answering using Trec, Vol. 99. 77–82.
ranking svm and syntactic/semantic similarity. In JSAI International Symposium [47] Vishanth Weerakkody, Zahir Irani, Kawal Kapoor, Uthayasankar Sivarajah, and
on Artificial Intelligence. Springer, 244–258. Yogesh K Dwivedi. 2017. Open data and its usability: an empirical view from the
[23] Rashmi Krishnamurthy and Yukika Awazu. 2016. Liberating data for public value: Citizen’s perspective. Information Systems Frontiers 19, 2 (2017), 285–300.
The case of Data. gov. International Journal of Information Management 36, 4 [48] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
(2016), 668–672. Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
[24] Karim R Lakhani, Robert D Austin, and Yumi Yi. 2002. Data. gov. Harvard Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Business School. Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
[25] Joffrey L Leevy, Taghi M Khoshgoftaar, and Flavio Villanustre. 2020. Survey on and Alexander M. Rush. 2019. HuggingFace’s Transformers: State-of-the-art
RNN and CRF models for de-identification of medical free text. Journal of Big Natural Language Processing. ArXiv abs/1910.03771 (2019).
Data 7, 1 (2020), 1–22. [49] Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel
[26] James R Lewis. 2018. The system usability scale: past, present, and future. Inter- Deutch, and Jonathan Berant. 2020. Break it down: A question understanding
national Journal of Human–Computer Interaction 34, 7 (2018), 577–590. benchmark. Transactions of the Association for Computational Linguistics 8 (2020),
[27] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman 183–198.
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising [50] Kanit Wongsuphasawat, Zening Qu, Dominik Moritz, Riley Chang, Felix Ouk,
sequence-to-sequence pre-training for natural language generation, translation, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2017. Voyager 2:
and comprehension. arXiv preprint arXiv:1910.13461 (2019). Augmenting visual analysis with partial view specifications. In Proceedings of the
[28] Tomer Libal and Matteo Pascucci. 2019. Automated reasoning in normative 2017 CHI Conference on Human Factors in Computing Systems. 2648–2659.
detachment structures with ideal conditions. In Proceedings of the Seventeenth
International Conference on Artificial Intelligence and Law. 63–72.
[29] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
[30] Lynn M LoPucki. 2001. Politics of Research Access to Federal Court Data. Tex. L.
Rev. 80 (2001), 2161.
128
Lex Rosetta: Transfer of Predictive Models Across Languages,
Jurisdictions, and Legal Domains
Jaromir Savelka Hannes Westermann Charlotte S. Alexander
jsavelka@cs.cmu.edu Karim Benyekhlef Jayla C. Grant
Carnegie Mellon University Université de Montréal Georgia State University
USA Canada USA
1 INTRODUCTION
This paper explores the ability of multi-lingual sentence embed-
Permission to make digital or hard copies of part or all of this work for personal or dings to enable training of predictive models that generalize beyond
classroom use is granted without fee provided that copies are not made or distributed individual languages, legal systems, jurisdictions, and domains (i.e.,
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. contexts). We propose a new type schema for functional segmenta-
For all other uses, contact the owner/author(s). tion of adjudicatory decisions (i.e., decisions of trial and appellate
ICAIL’21, June 21–25, 2021, São Paulo, Brazil court judges, arbitrators, administrative judges and boards) and use
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8526-8/21/06. it to annotate legal cases across eight different contexts (7 countries,
https://doi.org/10.1145/3462757.3466149 6 languages). We release the newly created dataset (807 documents
129
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Savelka and Westermann, et al.
with 89,661 annotated sentences) including the annotation schema as the user experience, of legal search tools. For example, if a user
to the public.1 searches for an application of a legal rule, they might restrict the
In the area of AI & Law, research typically focuses on a single search to the section where a judge applies the rule to a factual
context, such as decisions of a specific court on a specific issue situation. Judges themselves might find the technique useful, within
within a specific time range. This is justified by the complexity of their own jurisdictions but also in transnational disputes involving
legal work and the need for nuanced solutions to particular prob- the application of different legal standards. The same benefits apply
lems. At the same time, this narrow focus can limit the applicability to non-court settings, e.g., international arbitration, where many
of the research outcomes, since a proposed solution might not be jurisdictions’ laws and their interpretation matter. Further, the
readily transferable to a different context. In text classification, for segmentation of decisions into meaningful sections could serve as
example, a model might simply memorize a particular vocabulary an important step in many legal document processing pipelines.
characteristic of a given context, rather than acquiring the seman-
tics of a predicted type. Adaptation of such a model to a new context 1.2 Hypotheses
would then require the assembly of a completely new dataset. This To investigate how well predictive models, based on multi-lingual
may be both time-consuming and expensive, since the annotation sentence embeddings, learn to segment cases into functional parts
of legal documents relies on legal expertise. across different contexts, we evaluated the following hypotheses:
Certain tasks appear to be of interest to researchers from multiple
countries with different legal traditions (e.g., deontic classification (H1) A model trained on a single context can generalize when
of legal norms embodied in statutory law, argument extraction from transferred to other, previously unseen, contexts.
case law, summarization/simplification of legal documents, etc.). (H2) A model trained on data pooled from multiple contexts is
This suggests that there may be several core tasks in AI & Law that more robust and generalizes better to unseen contexts than
are of general interest in almost any context. One such task is a a model trained on a single context.
functional segmentation of adjudicatory decisions, which has been (H3) A context-specific model benefits from pooling the in-domain
the subject of numerous studies in the past (see Section 2). In this data with data from other contexts.
paper, we show that for this particular task it is possible to leverage
linguistic resources created in multiple contexts. 1.3 Contributions
This has wide-reaching implications for AI & Law research. Since By carrying out this work, we provide the following contributions
annotation of training data is expensive, models that are able to use to the AI & Law research community:
existing data from other contexts might be instrumental in enabling
real-world applications that can be applied across contexts. Such • Detailed definition and analysis of a functional segmentation
approaches may further enable international collaboration of re- task that is widely applicable across different contexts.
searchers, each annotating their own part of a dataset to contribute • A new labeled dataset consisting of 807 documents (89,661
to a common pool (as we do in this work) that could be used to sentences) from seven countries in six different languages.
train strong models able to generalize across contexts. • Evidence of the effectiveness of multi-lingual embeddings
on processing legal documents.
1.1 Functional Segmentation • Release of the code used for data preparation, analysis, and
the experiments in this work.
We investigate the task of segmenting adjudicatory decisions based
on the functional role played by their parts. While there are signifi-
cant differences in how decisions are written in different contexts,
2 RELATED WORK
we hypothesize that selected elements might be universal, such as, Segmenting court decisions into smaller elements according to their
for example, sections: function or role is an important task in legal text processing. Prior
(1) describing facts that give rise to a dispute; research utilizing supervised machine learning (ML) approaches
(2) applying general legal rules to such facts; or expert crafted rules can roughly be distinguished into two cat-
(3) stating an outcome of the case (i.e., how was it decided). egories. First, the task could be to segment the text into a small
number of contiguous parts typically comprising multiple para-
This conjecture is supported by the results of the comparative
graphs (this work). Different variations of this task were applied
project titled Interpreting Precedents [23], which aimed to analyze
to several legal domains from countries, such as Canada [15], the
(among other things) the structure in 11 different jurisdictions. The
Czech Republic [17], France [8], or the U.S. [27]. Second, the task
findings of this project suggest that the structure indicated above
could instead be labeling smaller textual units, often sentences,
may be considered a general model followed in the investigated
according to some predefined type system (e.g., rhetorical roles,
jurisdictions, although variations exist that are characteristic of
such as evidence, reasoning, conclusion). Examples from several
particular legal systems and types of courts and their decisions.
domains and countries include administrative decisions from the
The ability to segment cases automatically could be beneficial
U.S. [33, 41], multi-domain court decisions from India [6], inter-
for many tasks. It could support reading and understanding of
national arbitration decisions [9], or even multi-{domain,country}
legal decisions by students, legal practitioners, researchers, and
adjudicatory decisions in English [28]. Identifying a section that
the public. It could facilitate empirical analyses of the discourse
states an outcome of the case has also received considerable atten-
structure of decisions. It could enhance the performance, as well
tion separately [25, 38]. To the best of our knowledge, existing work
1 https://github.com/lexrosetta/caselaw_functional_segmentation_multilingual on functional segmentation of court decisions is limited to a single
130
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains ICAIL’21, June 21–25, 2021, São Paulo, Brazil
language—ours being the first paper exploring the task jointly on adjudicatory decisions. Each team developed specifications for the
legal documents in multiple languages. decisions to be included in their part of the dataset.
In NLP, the success of word embeddings was followed by an Four of the contexts were double-annotated by two annotators
increasing interest in learning continuous vector representations of (Canada, Czech R., France, U.S.A. I); the remaining four by just
longer linguistic units, such as sentences (a trend that has been re- one. Each team had at least one member with a completed law
flected in AI & Law research as well [34, 41]). Multi-lingual represen- degree. When a team had more than one member, law students
tations recently attracted ample attention. While most of the earlier were allowed to be included.
work was limited to a few close languages or pairwise joint em- A high-level description of the resulting dataset is provided in
beddings for English and one foreign language, several approaches Table 1. It consists of eight contexts from seven different countries
to obtain general-purpose massively multi-lingual sentence repre- (two parts are from the U.S.) with 807 documents in six languages
sentations were proposed [5, 11, 13]. Such representations were (three parts are in English). Most of the contexts include judicial
utilized in many downstream applications, such as document clas- decisions, while U.S.A. II was the only context that consisted solely
sification [21], machine translation [2], question answering [22], of administrative decisions.There are considerable variations in the
hate speech detection [4], or information retrieval (IR) in the legal length of the documents. While an average document in the U.S.A. I
domain [40]. Our work is one of the first such applications in the context comprises of 530.6 sentences, an average document in the
legal domain and to the best of our knowledge the first dealing with France context is about ten times shorter (59.0 sentences).
more than two languages. The four double-annotated parts enabled us to examine the inter-
Approaches other than language-agnostic sentence embeddings annotator agreement. Table 2 shows the raw agreement on a charac-
(this work) were used in AI & Law research focused on texts in ter level. While it appears that recognizing the Outcome was rather
multiple languages. A recent line of work mapped recitals to arti- straightforward in the France and U.S.A. I contexts, it was more
cles in EU directives and normative provisions in Member states’ complicated in case of Canada and the Czech R. This might be due
legislation [24]. There, mono-lingual models were used (i.e., one to a presence/absence of some structural clue. We also observe that
model per language). Other published applications in multi-lingual in the Czech R. context it was presumably much easier to distin-
legal IR were based on thesauri [14, 29]. A common technique to guish between the Background and Analysis than in case of the
bridge the language gap was the use of ontologies and knowledge other three contexts.
graphs [1, 3, 7, 16]. The multi-lingual environments, such as EU In this paper, we focus on prediction of the Background, Analysis,
or systems established by international treaties, attracted work on and Outcome types. We decided to exclude the Introductory Sum-
machine translation [20], meaning equivalence verification [32], mary type, since it is mainly present in the data from the United
and building of parallel corpora [30, 31]. States. For the double-annotated datasets, we picked the annota-
tions that appeared to be of higher quality (either by consensus
between the annotators themselves or by a decision of a third unbi-
3 DATASET
ased expert).
In creating the dataset, the first goal was to identify a task that We first removed all the spans of text annotated with either of
would be useful across different contexts. After extensive litera- the Out of Scope or Heading types. The removal of Out of Scope
ture review, we identified the task of functional segmentation of leaves the main body of a decision stripped of potential metadata
adjudicatory decisions as a viable candidate. To make the task gen- or editorial content at the beginning of the document as well as
eralizable, we decided to include only a small number of core types. dissents, or end notes at the end. The removal of the text spans
(1) Out of Scope – Parts outside of the main document body (e.g., annotated with the Heading type might appear counter-intuitive
metadata, editorial content, dissents, end notes, appendices). since headings often provide a clear clue as to the content of the
(2) Heading – Typically an incomplete sentence or marker start- following section (e.g., “Outcome”, “Analysis” etc.). We remove
ing a section (e.g., “Discussion,” “Analysis,” “II.”). these (potentially valuable) headings because we want to focus
(3) Background – The part where the court describes procedural on the more interesting task of recognizing the sections purely
history, relevant facts, or the parties’ claims. by the semantics of their constitutive sentences. This task is more
(4) Analysis – The section containing reasoning of the court, challenging, and more closely emulates generalization to domains
issues, and application of law to the facts of the case. where headings are not used or not present in all cases, or are not
(5) Introductory Summary – A brief summary of the case at the reliable indicators.
beginning of the decision. The transformed documents are separated into several segments
(6) Outcome – A few sentences stating how the case was decided based on the annotations of the three remaining types. Each seg-
(i.e, the overall outcome of the case). ment is then split into sentences.2 A resulting document is a se-
quence of sentences labeled with one of the Background, Analysis,
We created detailed annotation guidelines defining the individ- or Outcome types. The highlighted (green) part of Table 1 provides
ual types as well as describing the annotation workflow (tooling,
steps taken during annotation). Eight teams of researchers from six 2 We used the processing pipeline from https://spacy.io/ (large models). For the Czech
different countries (14 persons) were trained in the annotation pro- language we used https://github.com/TakeLab/spacy-udpipe with the Czech model
cess through online meetings. After this, each annotator conducted (PDT) from https://universaldependencies.org/. The output was further processed
with several regular expressions. A different method was used for the French dataset,
a dry-run annotation on 10 cases and received detailed feedback. which consists of a few very long sections, internally separated by a semicolon. After
Then, each team was tasked with assembling approximately 100 consultation with an expert we decided to split the cases by the semicolon as well.
131
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Savelka and Westermann, et al.
Table 1: Descriptive statistics of the created dataset. Each entry provides information about the country, the language of the
decisions (Lang), and the number of documents (Docs) in a specific context. The Sentence-Level Statistics subsection reports
basic descriptive statistics focused on sentences as well as the number of sentences labeled with each type (OoS - Out of Scope,
Head - Heading, Int.S. - Introductory Summary, Back - Background, Anl - Analysis, Out - Outcome). The part highlighted in
green contains the counts of sentences labeled with the types we focus on in this work.
Sentence-Level Statistics
Country Lang Docs Count Avg Min Max OoS Head Int.S. Back Anl Out Description
Canada EN 100 12168 121.7 8 888 873 438 20 3319 7190 328 Random selection of cases retrieved from www.canlii.org from multiple provinces.
7.2% 3.6% 0.2% 27.3% 59.1% 2.7% The selection is not limited to any specific topic or court.
Czech R. CS 100 11283 112.8 10 701 945 1257 2 3379 5422 278 A random selection of cases from Constitutional Court (30), Supreme Court (40), and
8.4% 11.1% 0.0% 29.9% 48.1% 2.5% Supreme Administrative Court (30). Temporal distribution was taken into account.
France FR 100 5507 55.1 8 583 3811 220 0 485 631 360 A selection of cases decided by Cour de cassation between 2011 and 2019. A stratified
69.2% 4.0% 0.0% 8.8% 11.4% 6.5% sampling based on the year of publication of the decision was used to select the cases.
Germany DE 104 10724 103.1 12 806 406 333 38 2960 6697 290 A stratified sample from the federal jurisprudence database spanning all federal courts
3.8% 3.1% 0.4% 27.6% 62.4% 2.7% (civil, criminal, labor, finance, patent, social, constitutional, and administrative).
Italy IT 100 4534 45.3 10 207 417 1098 0 986 1903 130 The top 100 cases of the criminal courts stored between 2015 and 2020 mentioning
9.2% 24.2% 0.0% 21.7% 42.0% 2.9% “stalking” and keyed to the Article 612 bis of the Criminal Code.
Poland PL 101 9791 96.9 4 1232 796 303 0 2736 5820 136 A stratified sample from trial-level, appellate, administrative courts, the Supreme Court,
8.1% 3.1% 0% 27.9% 59.4% 1.4% and the Constitutional tribunal. The cases mention “democratic country ruled by law.”
U.S.A. I EN 102 24898 244.1 34 1121 574 1235 475 6042 16098 474 Federal district court decisions in employment law mentioning “motion for summary
2.3% 5.0% 1.9% 24.3% 64.7% 1.9% judgment," “employee,” and “independent contractor.”
U.S.A. II EN 100 10756 107.6 24 397 1766 650 639 3075 4402 224 Administrative decisions from the U.S. Department of Labor. Top 100 ordered in
1.6% 6.0% 5.9% 28.6% 40.9% 2.1% reverse chronological rulings order, starting in October 2020, were selected.
Overall 6 807 89661 105.6 4 1232 9588 5534 1174 22982 48163 2220
Table 2: Raw agreement on a character level for the four encoder itself has no information on the language or writing script
datasets with two human annotators. The agreement is com- of the tokenized text, while the tokenizer is language specific. It
puted as a percentage of characters where both the annota- is even possible to mix multiple languages in one sentence. The
tors agree on a specific type over all the characters annotated focus of the LASER model is to produce vector representations
by that type by any of the annotators. (NM=Not Marked) of sentences that are general with respect to two dimensions: the
input language and the NLP task [5]. An interesting property of
OoS Head Int.S. Back Anl Out NM such universal multi-lingual sentence embeddings is the increased
Canada 97.2 68.2 44.0 83.3 92.2 79.9 43.4 focus on the sentence semantics, as the syntax or other surface
Czech R. 80.3 54.6 0.0 92.6 94.5 46.9 10.0 properties are unlikely to be shared among languages.
France 93.5 92.5 N/A 43.0 72.2 99.1 1.0
U.S.A. I 90.8 71.0 74.2 78.4 93.7 91.1 18.4 The GRU neural network [10] is an architecture based on a
recurrent neural network (RNN) that is able to learn the mapping
Overall 91.8 70.4 72.1 82.1 92.6 77.3 3.8
from a sequence of an arbitrary length to another sequence. GRUs
are able to either score a pair of sequences or to generate a target
basic descriptive statistics of the resulting dataset per the individual sequence given a source sequence (this work). In a bidirectional
contexts. Our final dataset for analysis consists of 807 cases split GRU, two separate sequences are considered (one from right to
into 74,539 annotated sentences. left and the other from left to right). Traditional RNNs work well
for shorter sequences but cannot be successfully applied to long
4 MODELS sequences due to the well-known problem of vanishing gradients.
Long Short-Term Memory (LSTM) networks [18] have been used
In our experiments we use the Language-Agnostic Sentence Repre- as an effective solution to this problem (the forget gate, along with
sentations (LASER) model [5] to encode sentences from different the additive property of the cell state gradients). GRUs have been
languages into a shared semantic space. Each document becomes proposed as an alternative to LSTMs with a reduced number of
a series of vectors which represent the semantic content of a sin- parameters. In GRUs there is no explicit memory unit, and the
gle sentence. We use these vectors to train a bidirectional Gated forget gate and the update gate are combined. The performance of
Recurrent Unit (GRU) model [10] for predicting sentence labels. GRUs was shown to be superior to that of LSTMs in the scenario
The LASER model is a language-agnostic bidirectional LSTM of long texts and small datasets [39], which is the situation in this
encoder coupled with an auxiliary decoder and trained on parallel work. For these reasons, we chose to use GRUs over LSTMs.
corpora. The sentence embeddings are obtained by applying a max- The overall structure of the employed model is shown in Fig-
pooling operation over the output of the encoder. The resulting ure 1.4 Each case is transformed into a 1080 × 1024 matrix. The
sentence representations (after concatenating both directions) are number 1080 represents the maximum length (in sentences) of any
1024-dimensional. The released trained model,3 which we use in case in our dataset. Shorter cases are padded to be of uniform length.
this work, supports 93 languages (including the six in our dataset) The vectors are passed to the model in batches of size 32. They first
belonging to 30 different families and written in 28 different scripts. go through a masking layer, which masks the sentences used for
The model was trained on 223 million parallel sentences. The joint
3 https://github.com/facebookresearch/LASER 4 The model was implemented using the Keras framework (https://keras.io/).
132
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Val 2 2 2 2 2 Test 2
3 3 3 Val 3 3 3
10 10 10 10 10 10
Val 2 2 2 Val 2 2 2
3 3 3 3 3 3
Figure 1: The structure of the sequential model used for pre-
diction. Each case 𝑚 is split into 𝑛 sentences, which are con- Train
... ... ...
Train
... ... ...
133
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Savelka and Westermann, et al.
that the models are able to transfer (at least some) knowledge from
one context to another.
134
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 4: Results of the Out-Context, Pooled Out-Context, and Pooled with In-Context experiments. Each row reports a perfor-
mance of a model across contexts. A bold number in each cell reports the (micro) average F1 -score over the predicted classes
and the standard deviation across the 10 folds. The F1 -scores of the three classes are reported below ordered as Background,
Analysis, Outcome. A visual explanation of cell contents can be found in Figure 3. The random (H1), the best single Out-Con-
text models (H2), and the In-Context (H3) baselines are highlighted. The p-values are color coded to match the baselines.
Canada Czech R. France Germany Italy Poland U.S.A. I U.S.A. II Avg (-test) Avg (+test)
Random .54 ± .04 .49 ± .02 .33 ± .05 .56 ± .04 .50 ± .04 .55 ± .04 .58 ± .03 .51 ± .02 .51 ± .07
(dist.) .31 .66 .06 .36 .59 .04 .32 .37 .25 .29 .68 .03 .31 .62 .04 .32 .67 .00 .26 .72 .02 .37 .61 .02 .32 .62 .10
Canada .82 ± .09 .68 ± .08 .64 ± .09 .81 ± .06 .73 ± .08 .73 ± .07 .87 ± .05 .88 ± .03 .76 ± .09 .77 ± .08
𝑝 = .016 .75 .87 .70 .53 .80 .39 .64 .66 .57 .71 .89 .00 .55 .82 .66 .50 .85 .00 .75 .92 .69 .87 .90 .70 .65 .83 .43 .66 .84 .46
Czech R. .76 ± .09 .91 ± .04 .47 ± .08 .82 ± .08 .82 ± .05 .83 ± .07 .84 ± .06 .86 ± .05 .77 ± .13 .79 ± .13
𝑝 = .016 .71 .80 .31 .90 .92 .64 .52 .36 .48 .75 .89 .01 .81 .84 .40 .73 .90 .00 .73 .90 .28 .86 .88 .41 .73 .80 .27 .75 .81 .32
France .71 ± .07 .61 ± .06 .86 ± .08 .66 ± .10 .73 ± .08 .65 ± .07 .72 ± .07 .68 ± .08 .68 ± .04 .70 ± .07
𝑝 = .016 .45 .83 .69 .37 .78 .45 .81 .83 .98 .37 .82 .00 .59 .81 .69 .33 .82 .00 .32 .86 .69 .48 .81 .73 .42 .82 .46 .47 .82 .53
Germany .72 ± .10 .64 ± .09 .29 ± .12 .88 ± .11 .69 ± .09 .77 ± .09 .73 ± .10 .83 ± .07 .67 ± .16 .69 ± .17
𝑝 = .031 .68 .76 .01 .42 .81 .01 .42 .32 .00 .82 .93 .66 .50 .84 .00 .54 .88 .54 .47 .85 .00 .82 .88 .00 .55 .76 .08 .58 .78 .15
Italy .55 ± .12 .76 ± .09 .63 ± .08 .78 ± .09 .95 ± .02 .73 ± .10 .53 ± .13 .74 ± .08 .67 ± .10 .71 ± .13
𝑝 = .047 .57 .55 .49 .74 .81 .12 .69 .63 .52 .73 .83 .00 .92 .96 .94 .66 .78 .00 .50 .54 .42 .74 .74 .63 .66 .70 .31 .69 .73 .39
Poland .76 ± .08 .83 ± .05 .38 ± .11 .85 ± .08 .83 ± .05 .93 ± .05 .73 ± .08 .83 ± .07 .74 ± .15 .77 ± .16
𝑝 = .016 .66 .84 .00 .82 .89 .01 .48 .52 .00 .73 .91 .44 .80 .91 .00 .89 .95 .88 .44 .86 .00 .82 .87 .00 .68 .83 .06 .71 .84 .17
U.S.A. I .83 ± .06 .65 ± .08 .47 ± .14 .81 ± .07 .65 ± .15 .67 ± .09 .91 ± .03 .89 ± .03 .71 ± .13 .74 ± .14
𝑝 = .016 .76 .87 .59 .45 .79 .49 .35 .61 .35 .71 .89 .00 .40 .80 .58 .38 .83 .00 .84 .94 .73 .87 .91 .68 .56 .81 .38 .60 .83 .43
U.S.A. II .81 ± .08 .67 ± .06 .53 ± .15 .84 ± .10 .75 ± .10 .70 ± .08 .86 ± .05 .94 ± .02 .74 ± .11 .76 ± .12
𝑝 = .016 .74 .86 .65 .49 .80 .31 .50 .63 .41 .76 .91 .00 .57 .84 .75 .43 .84 .00 .72 .92 .59 .93 .96 .82 .60 .83 .39 .64 .85 .44
Pooled .83 ± .06 .87 ± .03 .66 ± .08 .90 ± .04 .85 ± .04 .88 ± .05 .81 ± .10 .92 ± .03 .84 ± .08
𝑝 = .148 .77 .86 .66 .87 .91 .03 .59 .67 .71 .87 .95 .01 .81 .89 .65 .83 .94 .01 .64 .88 .73 .91 .93 .65 .79 .88 .43
Pooled+ .88 ± .05 .94 ± .03 .82 ± .09 .96 ± .02 .94 ± .04 .94 ± .04 .92 ± .03 .96 ± .02 .92 ± .04
𝑝 = .195 .83 .91 .77 .94 .95 .70 .76 .78 .96 .94 .98 .64 .91 .95 .90 .92 .97 .65 .86 .95 .84 .95 .96 .80 .89 .93 .78
7.2 Pooled Out-Context Experiment (H2) 7.3 Pooled With In-Context Experiment (H3)
The performance of Pooled Out-Context models is reported in The performance of the Pooled with In-Context model is reported
the Pooled row of Table 4. The experiment concerns the resulting in the Pooled+ row of Table 4. This experiment models a scenario
models’ robustness, i.e., if a model trained on multiple contexts where a sample of labeled data from the target context is available.
adapts well to unseen contexts. We are especially interested if such The question is whether combining In-Context data with data from
a model adapts better than the models trained on single contexts. other contexts leads to improved performance.
The results suggest that training on multiple contexts leads to The results appear to suggest that pooling the target context
models that are robust and perform better than the models trained with data from other contexts does lead to improved performance.
on a single context. The multi-context models outperform the best In case of 3 out of the 8 contexts (Canada, Czech R., and U.S.A. I)
single extra-context model baseline in 7 out of 8 cases. The 𝑝 = 0.148 the improvement is clear and substantial across all the three classes
needs to be understood in terms of the small number of samples (Background, Analysis, Outcome). For three additional contexts (Ger-
(contexts) and competitiveness of the baseline. many, Poland, and U.S.A. II), the performance also improved in
Interestingly, the Pooled Out-Context models even appear to terms of overall (micro) F1 -score, but took a slight hit for the chal-
be competitive with several In-Context models (Canada, Czech R., lenging Outcome class. With respect to the two remaining contexts
Germany, U.S.A. II). The overall average F1 -scores are often quite (France, Italy) the overall performance of the pooled models is lower
high (over 0.80 or 0.90). This is a surprising outcome considering than that of the In-Context models. As in the previous experiment,
the fact that no data from the context on which a model is evaluated the 𝑝 = 0.195 needs to be understood in terms of the small number
is used during training. of samples (contexts) and high competitiveness of the In-Context
baseline.
135
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Savelka and Westermann, et al.
Figure 4: Average LASER embeddings for each case doc- Overall, the cases from the same contexts appear to cluster to-
ument, projected to 2-dimensional space using Principal gether. This is expected as the documents written in the same
Component Analysis. language, having similar topics, or sharing some similarities due
to legal traditions, are likely to map to vectors that are closer. The
documents from both the U.S.A. contexts occupy the same region
in the embedding space. This is not surprising as they come from
8 DISCUSSION the same jurisdiction, are written in the same language, and deal
It appears that the multilingual sentence embeddings generated with similar topics (employment law). The Canadian cases, which
by the LASER model excel at capturing the semantics of a sen- are also in English, occupy a nearby space. This could be linked to
tence. A model trained on a single context could in theory capture the language as well as likely similarities in legal traditions. French,
the specific vocabulary used in that context. This would almost German, and Italian documents occupy the middle space; they are
certainly lead to poor generalization across different contexts. How- closer to the English documents than those from the Czech R. and
ever, the performance statistics we observed when transferring the Poland. Interestingly, the Czech and Polish documents occupy al-
Out-Context models to other contexts suggest that the sentence most the same space. The Polish context focuses on the rule of
embeddings provide a representation of the sentences that enable law while the Czech one is supposed to be more general. As the
the model to learn aspects of the meaning of a sentence, rather than latter deals with the decisions of the top-tier courts (one of them
mere surface linguistic features. Constitutional), it is possible that the topics substantially overlap.
The results clearly point to certain relationships where contexts Moreover, Poland and the Czech R. share similar legal traditions
within the same or related languages appear to work well together, and languages from the same family (Slavic), so the close proximity
e.g., {Canada, U.S.A. I, and U.S.A. II} or {Czech R. and Poland}. This of the documents in the embedding space might not be unexpected.
could indicate that the multi-lingual embeddings work better when Finally, German, French, and Canadian cases occupy wider areas
the language is the same or similar. It could also point to similarities than documents from other contexts. This could be due to their
in legal traditions, e.g. the use of somewhat similar sentences to lack of focus on specific legal domains.
indicate a transition from one type of section to the next. Note that We observed a peculiar phenomenon where the Out-Context
we removed headings, which means that explicit cues could not be models trained on the German and Polish contexts failed to detect
relied on by the models we trained on such transformed documents. Outcome sentences on the six remaining contexts and vice-versa.
Finally, the cause could also be topical (domain) similarity of the The cause is readily apparent from the visualization shown in Fig-
contexts (e.g., both the U.S.A. contexts deal with employment law). ure 5. Each segment of the figure depicts a spatial distribution of
Also note that the above are just possible explanations. We did not sentences color coded with their labels across documents for a par-
perform feature importance analysis on the models. ticular context. As can be seen, the cases typically follow a pattern
To gain insight into this phenomenon, we visualized the relation- of a contiguous Background segment, followed by a long Analysis
ships among the contexts on a document level. We first calculated section. The several Outcome sentences are placed at the very end
the average sentence embedding for each document. This yielded of the documents. In the Polish and German decisions, however,
1024-dimensional vectors for 807 documents representing their se- the Outcome sentences come first. The GRU models we use rely on
mantics. We arranged the resulting vectors in a matrix (1024 × 807) the structure as well as semantics in making their predictions. As
and performed a Principal Component Analysis (PCA) reducing we can see, a model trained exclusively on cases that begin with a
the dimensionality of the document vectors to 2. This operation Background might therefore have difficulties correctly identifying
enabled a convenient visualization shown in Figure 4. outcome sections at the beginning, and vice-versa. However, as we
136
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains ICAIL’21, June 21–25, 2021, São Paulo, Brazil
will see below, a model trained with data featuring both structures The inclusion of the In-Context data in the pooled data leads
can learn to correctly identify the correct structure based on the to a remarkable improvement over only using the pooled Out-
semantics of the sentences. Context data. The magnitude of the improvement highlights the
The model trained on the French context appears to perform importance of including such data in the training. We envision that
better on detecting Outcome sentences than models trained on other the models trained on different contexts used in combination with
contexts. This is somewhat surprising as the French model’s overall high-speed similarity annotation frameworks [35, 36] could enable
performance is among the weakest (e.g. compare the Czech model’s highly cost efficient annotation in situations where resources are
average F1 = 0.77 to the F1 = 0.68 of the French model). Again, scarce. Perhaps, adapting a model to an entirely new context could
Figure 5 provides an insight into why this happens. The French be as simple as starting with a model trained on other contexts, and
context is the only one where the count of Outcome sentences is spending a few hours correcting the misconceptions of the model
comparable to those of the other two categories. For all the other to teach it the particularities of the new context.
contexts, the Outcome sentences are heavily underrepresented. This
reveals an interesting direction for future work where the use of 9 CONCLUSIONS
re-sampling may yield models with better sensitivity for identifying We analyzed the use of multi-lingual sentence embeddings in se-
Outcome sentences. quence labeling models to enable transfer across languages, juris-
In two instances models trained on a single context under-per- dictions, legal systems (common and civil law), and domains. We
formed the random baseline. The model trained on the Germany created a new type schema for functional segmentation of adjudi-
context achieved the average F1 = 0.29 when applied to the French catory decisions and used it to annotate legal cases across eight
context (Random 𝐹 1 = 0.33). As the model trained on Polish data different contexts. We found that models generalize beyond the
also performed poorly on the French context (𝐹 1 = 0.38) the cause contexts they were trained on and that training the models on
appears to be the inability of the two models to detect the Outcome multiple contexts increases their robustness and improves the over-
sentences at the end (discussed above). As the Outcome sentences all performance when evaluating on previously unseen contexts.
are heavily present in the French context, this problem manifests We also found that pooling the training data of a model with data
in a score lower than the random baseline. The second instance from additional contexts enhances its performance on the target
is the model trained on the Italy context applied to the U.S.A. I context. The results are promising in enabling re-use of annotated
data (F1 = 0.53 versus F1 = 0.58 on Random). Here, the cause data across contexts and creating generalizable and robust models.
appears to be different. Note that the Italian context likely has a very We release the newly created dataset (807 documents with 89,661
specific notion of Outcome sentences (F1 = 0.94 on Italy→Italy). annotated sentences), including the annotation schema and the
It appears that many Analysis sentences from the U.S.A. I context code used in our experiments, to the public.
were labeled as Outcome by the model. Summary judgments often This work suggests a promising path for the future of inter-
address multiple legal issues with their own conclusions which national collaboration in the field of AI & Law. While previous
could have triggered the model to label such sentences as Outcome. annotation efforts have typically been limited to a single context,
An important finding is the performance of the Pooled Out- the experiments presented here suggest that researchers can work
Context model (H2) shown in the Pooled row of Table 4. The ex- together by annotating cases from many different contexts at the
periment simulates training of a model on several contexts, and same time. Such a combined effort could aid researchers in creating
then applying it to an unseen context. The Pooled Out-Context models that perform well on the data from the context they care
models, having no access to the data from a target context, reliably about, while at the same time helping other groups train even better
outperform the best single Out-Context models. They appear to models for other contexts. We encourage these research directions
be competitive with several In-Context models. These results are and hope to form such collaborations under the Lex Rosetta project.
achieved with a fairly small dataset of 807 cases. We expect that
expanding the dataset would lead to further improved performance.
10 FUTURE WORK
The Pooled with In-Context experiment (H3) models the situ-
ation where data from a target context is available in addition to The application of multi-lingual sentence embeddings to functional
labeled data from other contexts. Our experiments indicate that segmentation of case law across different contexts yielded promis-
the use of data from other contexts (if available) in addition to data ing results. At the same time, the work is subject to limitations
from the target context is preferable to the use of the data from the and leaves much room for improvement. Hence, we suggest several
target context only. This is evidenced by the improved performance directions for future work:
of the models trained on the pooled contexts over the single In-Con- • Extension of the datasets from different contexts used in this
text models. The models have an interesting property of being able work beyond ~100 documents per context.
to identify the Outcome sentences with effectiveness comparable • Annotation of data from contexts beyond the eight used here
to (or higher than) the models trained on the same context only. (multi-lingual models support close to 100 languages).
This holds for all the contexts, except Poland where the Outcome • Analysis of automatic detection of Introductory Summary,
performance is a bit lower (0.65 vs. 0.88). This indicates that the Headings, and Out of Scope.
model is able to learn the two possible modes of the Outcome section • Identification and investigation of other tasks applicable
placement. It successfully distinguishes cases where the section is across different contexts.
at the beginning from the cases where the Outcome sentences are • Evaluation of the application of other multilingual models
found toward the end. (e.g., those mentioned in Section 2).
137
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Savelka and Westermann, et al.
• Exploring other transfer learning strategies beyond simple WordNet architecture. In ICAIL 2005. 163–167.
data pooling, such as the framework proposed in [26]. [15] Atefeh Farzindar and Guy Lapalme. 2004. LetSum, an Automatic Text Summa-
rization system in Law field. JURIX 2004.
• Using multi-lingual models for annotation tasks with high- [16] Jorge González-Conejero, Pompeu Casanovas, and Emma Teodoro. 2018. Business
speed annotation framework, such as [35, 36]. Requirements for Legal Knowledge Graph: the LYNX Platform.. In TERECOM@
JURIX 2018. 31–38.
• Performing the transfer across contexts with related (but [17] Jakub Harašta, Jaromír Šavelka, František Kasl, and Jakub Míšek. 2019. Automatic
different) tasks, such as in [28]. Segmentation of Czech Court Decisions into Multi-Paragraph Parts. Jusletter IT
• Further exploring the differences in the distribution of the 4, M (2019).
[18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
multilingual embeddings for purposes of comparing and computation 9, 8 (1997), 1735–1780.
analyzing domains, languages, and legal traditions. [19] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980 (2014).
[20] Philipp Koehn, Alexandra Birch, and Ralf Steinberger. 2009. 462 Machine Trans-
ACKNOWLEDGMENTS lation Systems for Europe. In Proceedings of the Twelfth Machine Translation
Summit. Association for Machine Translation in the Americas, 65–72.
Hannes Westermann, Karim Benyeklef, and Kevin D. Ashley would [21] Guokun Lai, Barlas Oguz, Yiming Yang, and Veselin Stoyanov. 2019. Bridg-
like to thank the Cyberjustice Laboratory at Université de Montréal, ing the domain gap in cross-lingual document classification. arXiv preprint
the LexUM Chair on Legal Information and the Autonomy through arXiv:1909.07009 (2019).
[22] Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk.
Cyberjustice Technologies (ACT) project for their support of this 2020. MLQA: Evaluating Cross-lingual Extractive Question Answering. In ACL
research. Kevin D. Ashley also thanks the Canadian Legal Infor- 2020. 7315–7330.
mation Institute for providing the corpus of legal cases. Matthias [23] D.N. MacCormick, R.S. Summers, and A.L. Goodhart. 2016. Interpreting Precedents:
A Comparative Study. Taylor & Francis.
Grabmair thanks the SINC GmbH for supporting this research. [24] Rohan Nanda, Llio Humphreys, Lorenzo Grossio, and Adebayo Kolawole John.
Jakub Harašta and Tereza Novotná acknowledge the support of 2020. Multilingual Legal Information Retrieval System for Mapping Recitals and
Normative Provisions. In Proceedings of Jurix 2020. IOS Press, 123–132.
the ERDF project “Internal grant agency of Masaryk University” [25] Alina Petrova, John Armour, and Thomas Lukasiewicz. 2020. Extracting Out-
(No. CZ.02.2.69/0.0/0.0/19_073/0016943). comes from Appellate Decisions in US State Courts. In JURIX 2020. 133.
[26] Jaromír Šavelka and Kevin D Ashley. 2015. Transfer of predictive models for
classification of statutory texts in multi-jurisdictional settings. In ICAIL 2015.
REFERENCES 216–220.
[1] Tommaso Agnoloni, Lorenzo Bacci, Enrico Francesconi, P Spinosa, Daniela Tis- [27] Jaromir Savelka and Kevin D Ashley. 2018. Segmenting US Court Decisions into
cornia, Simonetta Montemagni, and Giulia Venturi. 2007. Building an ontological Functional and Issue Specific Parts.. In JURIX 2018. 111–120.
support for multilingual legislative drafting. Frontiers in Artificial Intelligence [28] Jaromır Šavelka, Hannes Westermann, and Karim Benyekhlef. 2020. Cross-
and Applications 165 (2007), 9. Domain Generalization and Knowledge Transfer in Transformers Trained on
[2] Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively Multilingual Legal Data. In ASAIL@ JURIX 2020.
Neural Machine Translation. In NAACL-HLT, Vol. 1 (Long and Short Papers). [29] Párai Sheridan, Martin Braschlert, and Peter Schauble. 1997. Cross-language
3874–3884. information retrieval in a Multilingual Legal Domain. In International Conference
[3] Gianmaria Ajani, Guido Boella, Luigi Di Caro, Livio Robaldo, Llio Humphreys, on Theory and Practice of Digital Libraries. Springer, 253–268.
Sabrina Praduroux, Piercarlo Rossi, and Andrea Violato. 2016. The European [30] Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manual Carrasco-
Legal Taxonomy Syllabus: A multi-lingual, multi-level ontology framework to Benitez, Patrick Schluter, Marek Przybyszewski, and Signe Gilbro. 2014. An
untangle the web of European legal terminology. Applied Ontology 11, 4 (2016). overview of the European Union’s highly multilingual parallel corpora. Lan-
[4] Sai Saket Aluru, Binny Mathew, Punyajoy Saha, and Animesh Mukherjee. 2020. guage Resources and Evaluation 48, 4 (2014), 679–707.
Deep learning models for multilingual hate speech detection. arXiv preprint [31] Kyoko Sugisaki, Martin Volk, Rodrigo Polanco, Wolfgang Alschner, and Dmitriy
arXiv:2004.06465 (2020). Skougarevskiy. 2016. Building a Corpus of Multi-lingual and Multi-format Inter-
[5] Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence national Investment Agreements. In JURIX 2016.
embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the [32] Linyuan Tang and Kyo Kageura. 2019. Verifying Meaning Equivalence in Bilingual
Association for Computational Linguistics 7 (2019), 597–610. International Treaties. In JURIX 2019. 103–112.
[6] Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and [33] Vern R Walker, Krishnan Pillaipakkamnatt, Alexandra M Davidson, Marysa
Adam Wyner. 2019. Identification of rhetorical roles of sentences in Indian legal Linares, and Domenick J Pesce. 2019. Automatic Classification of Rhetorical
judgments. In JURIX 2019, Vol. 322. IOS Press, 3. Roles for Sentences: Comparing Rule-Based Scripts with Machine Learning.. In
[7] Guido Boella, Luigi Di Caro, Michele Graziadei, Loredana Cupi, Carlo Emilio ASAIL@ ICAIL 2019.
Salaroglio, Llio Humphreys, Hristo Konstantinov, Kornel Marko, Livio Robaldo, [34] Hannes Westermann, Jaromír Šavelka, and Karim Benyekhlef. 2021. Paragraph
Claudio Ruffini, et al. 2015. Linking legal open data: breaking the accessibility and Similarity Scoring and Fine-Tuned BERT for Legal Information Retrieval and
language barrier in european legislation and case law. In ICAIL 2015. 171–175. Entailment. In New Frontiers in Artificial Intelligence (Lecture Notes in Computer
[8] Paul Boniol, George Panagopoulos, Christos Xypolopoulos, Rajaa El Hamdani, Science). Springer International Publishing.
David Restrepo Amariles, and Michalis Vazirgiannis. 2020. Performance in the [35] Hannes Westermann, Jaromír Šavelka, Vern R Walker, Kevin D Ashley, and Karim
Courtroom: Automated Processing and Visualization of Appeal Court Decisions Benyekhlef. 2019. Computer-Assisted Creation of Boolean Search Rules for Text
in France. In Proceedings of the Natural Legal Language Processing Workshop 2020. Classification in the Legal Domain. In JURIX 2019, Vol. 322. IOS Press, 123.
[9] Karl Branting, Brandy Weiss, Bradford Brown, Craig Pfeifer, A Chakraborty, Lisa [36] Hannes Westermann, Jaromír Šavelka, Vern R Walker, Kevin D Ashley, and Karim
Ferro, M Pfaff, and A Yeh. 2019. Semi-supervised methods for explainable legal Benyekhlef. 2020. Sentence Embeddings and High-Speed Similarity Search for
prediction. In ICAIL 2019. 22–31. Fast Computer Assisted Annotation of Legal Documents. In JURIX 2020, Vol. 334.
[10] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, IOS Press, 164.
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase [37] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Break-
Representations using RNN Encoder-Decoder for Statistical Machine Translation. throughs in statistics. Springer, 196–202.
In EMNLP 2014. [38] Huihui Xu, Jaromír Šavelka, and Kevin D Ashley. 2020. Using Argument Mining
[11] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- for Legal Text Summarization. In JURIX 2020, Vol. 334. IOS Press.
laume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, [39] Shudong Yang, Xueying Yu, and Ying Zhou. 2020. LSTM and GRU Neural Network
and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning Performance Comparison Study: Taking Yelp Review Dataset as an Example. In
at Scale. In ACL 2020. 8440–8451. IWECAI 2020. IEEE, 98–101.
[12] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. [40] Vladimir Zhebel, Denis Zubarev, and Ilya Sochenkov. 2020. Different Approaches
Journal of Machine learning research 7, Jan (2006), 1–30. in Cross-Language Similar Documents Retrieval in the Legal Domain. In Interna-
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: tional Conference on Speech and Computer. Springer, 679–686.
Pre-training of Deep Bidirectional Transformers for Language Understanding. In [41] Linwu Zhong, Ziyi Zhong, Zinian Zhao, Siyuan Wang, Kevin D Ashley, and
NAACL 2019, Volume 1 (Long and Short Papers). 4171–4186. Matthias Grabmair. 2019. Automatic summarization of legal decisions using
[14] Luca Dini, Wim Peters, Doris Liebwald, Erich Schweighofer, Laurens Mommers, iterative masking of predictive sentences. In ICAIL 2019. 163–172.
and Wim Voermans. 2005. Cross-lingual legal information retrieval using a
138
Converting Copyright Legislation into Machine-Executable
Code: Interpretation, Coding Validation and Legal Alignment
Alice Witt Anna Huggins
Law School Law School
Queensland University of Technology Queensland University of Technology
Brisbane, Queensland, Australia Brisbane, Queensland, Australia
ae.witt@qut.edu.au
ABSTRACT KEYWORDS
A critical challenge in “Rules as Code” (“RaC”) initiatives is enhanc- Rules as code, machine-executable code, statutory interpretation,
ing legal accuracy. In this paper, we present the preliminary results legal alignment, technical processes
of a two-week, first of its kind experiment that aims to shed light ACM Reference Format:
on how different legally trained people interpret and convert Aus- Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley. 2021.
tralian Commonwealth legislation into machine-executable code. Converting Copyright Legislation into Machine-Executable Code: Inter-
We find that coders collaboratively agreeing on key legal terms, or pretation, Coding Validation and Legal Alignment. In Eighteenth Interna-
atoms, before commencing independent coding work can signifi- tional Conference for Artificial Intelligence and Law (ICAIL’21), June 21–
cantly increase the similarity of their encoded rules. Participants 25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https:
nonetheless made a range of divergent interpretive choices, which //doi.org/10.1145/3462757.3466083
we argue are most likely due to: (1) the complexity of statutory
interpretation, (2) encoded provisions having varying levels of gran- 1 BACKGROUND
ularity, and (3) the functionality of our coding language. Based on In recent years, there has been significant growth in the “Rules
these findings, we draw an important distinction between processes as Code” (“RaC”) movement, a label for diverse initiatives that re-
for technical validation of encoded rules, which focus on ensuring evaluate how, when, and for whom government rules are made [24].
rules adhere to select coding languages and conventions, and pro- There are two main RaC approaches, the first of which involves
cesses of legal alignment, which we conceptualise as enhancing converting existing regulation, including statutes, into machine-
congruence between the encoded provisions and the true meaning executable code: “a coded representation of the actual rules in the
of the statutory text in line with the modern approach to statutory legislation, written in a computer language, so that computers can
interpretation. We argue that these processes are distinct but both read it and then use it to carry out programs”[29, p 27]. An applica-
critically important in enhancing the accuracy of encoded rules. We tion of this approach is regulatory technology (“RegTech”) [31] that
conclude by underlining the need for multi-disciplinary expertise can help firms comply and stay up to date with the rules governing
across specific legal subject matters, statutory interpretation and their commercial activities [24, 34]. The second approach involves
technical programming in RaC initiatives. “co-drafting” regulation in both natural language and machine-
consumable format, one that can “enable computers to model the
effect of the law”[3, p. 76], at the same time. A distinguishing feature
CCS CONCEPTS
of co-drafting is the potential for “digital first” rules [33], which
• Computing methodologies → Artificial intelligence; • Ap- not only represent an output, but also a “strategic and deliberate
plied computing → Law. approach to rulemaking”[24, p. 81]. RaC is therefore related to, yet
separate from, “computational law”, which investigates whether
regulation can and should be represented in computer code, among
other lines of inquiry [30], and “automated decision making", which
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed refers to decisions made by automated means with varying levels
for profit or commercial advantage and that copies bear this notice and the full citation of human involvement [18].
on the first page. Copyrights for components of this work owned by others than ACM While decades of research inform these overlapping areas, and
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a governments including New Zealand, Australia, France and Canada
fee. Request permissions from permissions@acm.org. have made significant inroads with the practical RaC movement
ICAIL’21, June 21–25, 2021, São Paulo, Brazil [27, 30], challenges persist in ensuring that encoded rules are trans-
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 parent, traceable, appealable and legally accurate. It can be particu-
https://doi.org/10.1145/3462757.3466083 larly difficult to enhance the legal accuracy of digitised legislation
139
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley
in a way that aligns with the correct legal interpretation of the se- (the text) taking into account their context and purpose [26]. We
lect statute(s). In the Australian legal system, this challenge is made conclude this paper, in Section 6, by highlighting the need in RaC
more complex by the strict separation of judicial, legislative and ex- initiatives for multi-disciplinary expertise across specific legal sub-
ecutive powers in the Australian Constitution, under which only the ject matters, statutory interpretation and technical programming.
judiciary can authoritatively interpret the meaning of statutes [18]. We also identify important areas for future research, such as testing
A result is that other interpreters, including individual coders and our hypothesis with a larger number of participants, and further
RegTech companies, are expected to mirror the courts’ approach to exploring what the separate yet interrelated processes for technical
interpreting statutes [5]. coding validation and legal alignment might entail in practice.
Against this backdrop, we commenced a multi-disciplinary and
collaborative research project that seeks to identify the legal and 2 COPYRIGHT LAW
coding challenges of converting Australian Commonwealth legisla-
Copyright is a body of intellectual property law that “confers rights
tion into a machine-executable format. Our research largely falls
in relation to the reproduction and dissemination of material that
within the first RaC approach given that we are focusing on legisla-
expresses ideas or information” [8, p. 7]. For material to be protected
tion that already exists. In this paper, as part of the broader research
under Australian copyright law, it must: (1) fall under one of the
project, we present a first of its kind experiment that aims to shed
established categories of subject matter, including literary, dramatic,
light on how different legally trained people (T = 3 participants
musical or artistic "works" (known as “Part III (of the CA) works"),
over a two-week period) interpret and convert select provisions of
which we focus on in this experiment; and (2) be sufficiently con-
the Copyright Act 1968 (Cth) (“the CA”, “the Act") into computer
nected to Australia (see ss 10 and 32 of the CA). Additionally, Part
code. By examining the choices that interpreters make when con-
III works must be: (3) recorded in “material form” and (4) “original"
verting existing legislation into machine-executable code, we aim
(s 32 of the CA) [8, ch. 6]. When the relevant criteria for subsistence
to provide critical insights into the coding, statutory interpretation
of copyright are satisfied, the copyright owner has certain exclusive
and other issues that arise in the encoding process. This can, in
rights (see, e.g., Divisions 1 and 2, Part III of the CA), which a party
turn, facilitate new understandings of how stakeholders can pro-
can directly or indirectly infringe [8, pp. 268, 285].
mote alignment between the languages and logics of statutes and
We focus on copyright law for several reasons, including our
encoded rules for the RaC movement more broadly.
research team having some copyright law expertise, and because
This paper proceeds in five sections. In Section 2, we provide a
this body of law is widely applicable to a range of stakeholders but
brief overview of Australian copyright law and, in Section 3, we
often difficult for lay audiences to understand [7]. In practical terms,
explain the Defeasible Deontic Logic and Turnip encoding soft-
our research could be of benefit to large technology companies,
ware that participants used to convert select provisions of the CA
galleries, libraries, educational institutions and archives that often
into computer code. Next, in Section 4, we explain our experiment
deal with copyright issues in bulk [28]. It could also potentially
design and methods. Then, in Section 5, we outline our prelimi-
benefit a large number of small content creators, who routinely
nary results that support our hypothesis that coders collaboratively
rely on existing material [2]. In research terms, by focusing on law
agreeing on key atoms, or legal terms, before commencing inde-
that already exists, we are generating new knowledge particularly
pendent coding work increases the similarity of their coded output.
relevant to the first RaC approach. This knowledge can in turn
Despite a significant increase in the average similarity of atoms
inform the RaC movement more broadly.
and, to a lesser extent, rules in Week 2, participants made a range
of divergent interpretive choices, most likely due to: (1) the com-
plexity of statutory interpretation, (2) encoded provisions having 3 FROM LEGISLATION TO ENCODED RULES
varying levels of granularity, and (3) the functionality of our coding To convert select provisions of the CA into machine-executable
language, Turnip. These differences underline the complexity of code, participants used a language and program (reasoner) called
attempting to reproduce the languages and logics of a statutory “Turnip”, which is based on Defeasible Deontic Logic (“DDL”) [14].
text in machine-executable code. The presentation of DDL and Turnip in this paper is a based on
Overall, we argue that processes for technical coding validation [12, 14]. DDL is an extension of Defeasible Logic [1], which refers to
and legal alignment are distinct but both critically important in an interest that can be defeated, and Deontic Logic, which pertains
enhancing the accuracy of RaC. On the one hand, processes for to “the study of those sentences in which only logical words and
technical validation can be automated and/or manual, and aim to normative expressions occur essentially. Normative expressions in-
ensure encoded rules adhere to select coding languages and conven- clude the words ‘obligation’, ‘duty’, ‘permission’, ‘right’, and related
tions. On the other hand, a process of legal alignment is concerned expressions”[10, p. 1]. Defeasible Deontic Logic therefore extends
with enhancing congruence between the encoded rules and the true defeasible logic “by adding deontic and other modal operators” [16,
meaning of the select legislation in line with the modern approach p. 47]. More specifically, DDL enables coders to integrate reasoning
to statutory interpretation, an undertaking that heavily relies on with exceptions; to model deontic concepts, such as obligations [O],
human judgment. This approach, as articulated and applied by the permissions [P], prohibitions [F], and exemptions [E]; and to rep-
High Court of Australia in Project Blue Sky v Australian Broadcast- resent both definitional norms (also known as “constitutive rules”)
ing Corporation (1998) 194 CLR 355, “requires a combined exercise and prescriptive norms [12, p. 178], all of which are present in the
involving analysis of the text, context and purpose (or policy) of the CA. Coders can also use a non-classical compensation operator to
statute in question”[20, p. 116]. The key task of statutory interpre- model obligations in force after a (potential) violation [14, 15]. De-
tation is therefore to find the legal meaning of the statutory words feasible Deontic Logic has been applied in several studies that aim
140
Converting Copyright Legislation into Machine-Executable Code ICAIL’21, June 21–25, 2021, São Paulo, Brazil
to convert different types of Australian regulation into computer the superiority relation, as follows:
code [12, 13, 19]. 𝑠 : 𝐵 1, . . . , 𝐵𝑚 ⇒ ¬𝐶
A rule in DDL takes the form of an IF . . . THEN . . . statement
in which “IF” represents the condition(s) of the rule and "THEN" The reasoning mechanism of DDL, which is based on an argumen-
models the effect of the norm [11, p. 284]. Coders can divide rules tation structure, extends the proof theory of Defeasible Logic [1].
into constitutive rules that, for instance, define important terms To prove a conclusion, there must be an applicable rule for the
in a normative document (e.g., different types of regulation) or said conclusion., and a rule is applicable if all the elements of the
outline condition(s) (i.e., the IF part) that might give rise to legal antecedent of the rule hold (i.e., have been proved). All counter-
requirements (i.e., the THEN part, such as obligations, permissions arguments must also be rebutted or defeated. A counter-argument
and prohibitions). Rules can be further classified according to their is a rule for a conflicting conclusion; that is, the negation of the
strength: specifically, as strict rules, defeasible rules and defeaters. A conclusion, or in case of deontic conclusions, conflicting deontic
strict rule is a rule in the classical sense. Defeasible rules are rules modalities. A counter-argument is rebutted if its premise(s) do not
subject to exceptions: the conclusion of the rule holds unless there hold, or a coder proves that the premise(s) do not hold, and the
are other (applicable) rules (for the same conclusion) that defeat counter-argument is defeated when the rule is weaker than an
the rule. Defeaters are a special kind of rule, they do not support applicable rule for the conclusion. Having outlined the basics of
conclusions, but prevent the conclusion to the opposite [1, p. 257]. DDL, as it applies to this experiment, we now turn to our encoding
For more information about these classifications, see [14]. software.
As previously noted, DDL enables coders to represent both defi-
nitional norms, also known as “constitutive rules”, and prescriptive 3.1 Turnip
norms. Constitutive rules are those in standard defeasible logic [12, Turnip1 is a modern (typed) functional programming implementa-
p. 179]. Normative rules can be prescriptive, such as rules estab- tion of Defeasible Deontic Logic that is written in the programming
lishing that something is obligatory or forbidden, and permissive, language Haskell2 . As previously noted, this software facilitates
including rules establishing that certain activities are explicitly the conversion of norms (e.g., different types of regulation) into
permitted, derogating rules for prohibitions or obligations to the computer code.
contrary. The standard form of normative rules follows: Turnip requires coders to define all terms before using them in a
set of rules. The basic structure for defining an individual term is:
𝑟 : 𝐴1, . . . , 𝐴𝑛 ↩→□ 𝐶 1 ⊙ · · · ⊙ 𝐶𝑚
Type Name description_string
In this rule, 𝐴1, . . . , 𝐴𝑛 are the condition(s) of the rule expressed
Type is defined in the following table:
as literals or deontic literals (e.g., an obligation [O] or permission
[P]), □ is a deontic modality, and the 𝐶𝑖 are literals (𝐶 1 ⊙ · · · ⊙ 𝐶𝑚 Type Keyword Sample Values
is a “reparation chain"). ↩→ is a placeholder for the type of rule, Boolean Atom True, False
and → stands for a strict rule, ⇒ for a defeasible rule, and { for String String "anything in double quotation marks"
a defeater. The mode of the rule □ determines the scope of the Numeric Numeric 123.456,-5,0
conclusion. When the mode is [O], the meaning of the right-hand Date Date 1995-02-01
side of the rule is that when the rule applies [O]𝐶 1 is in force (i.e., DateTime DateTime 1995-02-01T13:35
𝐶 1 is obligatory). If the rule is violated (i.e., ¬𝐶 1 holds), then [O]𝐶 2 Duration Duration 10w, 1d, 5h, 30m
is in force (𝐶 2 is obligatory, and 𝐶 2 compensates for the violation In the experiment that is the subject of this paper, participants
of [O]𝐶 1 ). We can repeat this reasoning when [O]𝐶 2 is potentially largely used “atoms”, which correspond to literals in DDL and rep-
violated [14, 15]. resent (atomic Boolean) statements that can be either true of false
DDL is a type of skeptical non-monotonic formalism, which (e.g., Atom person "is a person"). The description string, which is
means that when there are applicable rules with conflicting con- the optional text in double quotation marks (" "), defines the atom in
clusions (i.e., 𝐴 and ¬𝐴), the logic does not provide a conclusion. natural language. Coders can also use arithmetic operators (i.e., +,
To solve conflicts, DDL employs a so-called superiority relation: a -, *, /,) for numeric terms and values; comparison operators (i.e.,
binary relation over rules establishing the relative strength of rules. ==, !=, <, <=, >, >=) to create Boolean types from numeric and
For example, if we have an applicable rule 𝑟 for 𝐴 and a second duration terms; and conversion functions (e.g., interval, toDays,
applicable rule 𝑠 for ¬𝐴, we can use 𝑟 > 𝑠 to indicate that 𝑟 is after) that can operate on dates, times and duration terms. Con-
stronger than 𝑠. Accordingly, 𝑟 defeats 𝑠 when both apply, solves sider, for example, the interval function that takes two dates as
the conflict and allows a coder to conclude for 𝐴. input and returns a duration:
The superiority relation also provides a simple and effective publication . date := 1919 -09 -01
mechanism to encode exceptions. Consider the following defeasible usage . date := 2010 -12 -03
rule: interval ( usage . date , publication . date ) >= 70 y
𝑟 : 𝐴1, . . . , 𝐴𝑛 ⇒ 𝐶
Here the assignment operator := gives values to two terms of type
We can model an exception to this rule using a second rule (let us date. Then, we use the interval operator to compute the duration
say 𝑠), where the conclusion is the opposite of the conclusion of 1 An online environment to run Turnip rulesets, with samples of the features it offers
𝑟 , and the IF part of s contains the conditions when the exception is available at http://turnipbox.netlify.com/.
holds. We can formalise this second rule, with the instance 𝑠 > 𝑟 of 2 https://www.haskell.org.
141
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley
(i.e., time elapsed between the two dates), and we compare it with than the project’s shared GitHub4 repository, and could not com-
a given duration (70 years). municate with each other, or members of the broader research team,
Rules also have a basic structure that generally includes a label, in any way over the course of Week 1. Participants could, however,
a condition list, and a conclusion list. For example: raise questions or concerns with the first author (the contact per-
son for the experiment) at any time. Additionally, we instructed
label : condition_list = > conclusion_list
participants to assume that the elements for determining whether
The arrow (=>) determines the type of rule (e.g., strict, defeasible copyright subsists in a work are satisfied (i.e., to assume copyright
or a defeater). It is important to note that rules are designed to subsistence), and to encode the select legislative provisions only.
represent norms: a norm prescribes multiple, simultaneous effects, This means that participants did not encode relevant case law.
and different norms can prescribe the same effect [12, p. 180]. To In Week/Phase 2, participants coded ss 31, 32 and 36 of the CA.
make the work of coding in DDL more efficient, a condition list These provisions outline the nature of copyright in original works,
can include a conjunction (&) or disjunction (|) of Boolean, and original works in which copyright subsists, and infringement by
a conclusion list can be either an assignment, a single Boolean, doing acts comprised in the copyright, respectively. Like in Week
a conjunction of assignments, or a conjunction of Boolean. The 1, participants encoded the select provisions only (i.e., they did not
following example illustrates the equivalence between rules with encode case law), and could raise questions or concerns with the
conjunctions (&) and a disjunction (|) in a condition list: first author at any time. Aside from participants encoding different
provisions in each phase, a choice that we made largely to avoid
A & B => C & D A | B => C
participants becoming overly familiar with the statutory text, the
A & B => C A => C
main difference between the two phases of this experiment is that
A & B => D B => C
Week 2 had an intervention. The intervention was a two-hour com-
Turnip syntax, including negation ~ and numeric and temporal pulsory meeting during which participants collaboratively drafted
expression, ultimately allows for deontic expression. A deontic key atoms for manual coding in a single remote file (“Agreed Atom
expression is based on the combination of one of four deontic File”). The Agreed Atom File, which set out and defined key atoms
modalities: namely, [O], [P], [F], [E] (i.e., Obliged, Permitted, for ss 31, 32 and 36 of the CA, was separate from documents outlin-
Forbidden, Exempt) and an atom. For the modalities, "notice that ing our established coding conventions. After this meeting, which
[F]A is equivalent to [O]~A (and ~[P]A) and [E]A is equivalent to the first author facilitated, participants independently coded the
[P]~A". Given two rule labels, label1 and label2, label1>>label2 select provisions like in Week 1. Participants could refer back to, but
denotes the superiority relation between the rules identified by the not edit, the Agreed Atom File. The purpose of the intervention was
labels. to test our hypothesis that coders collaboratively agreeing on key
Turnip is of course one of several languages and logics that atoms before commencing independent coding work increases the
coders can use to convert different types of regulation into machine- similarity of their coded output (H1). Our null hypothesis was that
executable code [4, 25]. We used this language because it is particu- coders collaboratively agreeing on key atoms before commencing
larly useful for coders attempting to express complex rule structures. independent coding work does not increase the similarity of their
Take, for example, a statutory provision that establishes multiple coded output (H0).
obligations, permissions and at least one prohibition. We argue that In Weeks 1 and 2, participants were allocated up to 7.5 hours
coders can more accurately represent the effects of this provision by to encode the select provisions. For each phase, participants had
using Turnip’s deontic expressions and disjunctive and/or conjunc- approximately one week to complete their 7.5 hours of work, which
tive conditions lists. Thus, although Turnip is not the only encoding they submitted via email in the form of a single coding file (i.e.,
software available, it has clear advantages for the purposes of this each participant had one file for Week 1 and another for Week
experiment. 2). We assigned each participant a pseudonym and deidentified
the participants’ files for both the technical validation and legal
4 EXPERIMENT DESIGN AND METHODS alignment processes, to which we now turn.
We conducted this experiment over a two-week period in late 2020.
The population for this experiment was the pool of legally trained 4.1 Coding Validation Processes
research assistants attached to the broader research project. While An important part of converting legislation into machine-executable
participation was voluntary, in line with our university ethics ap- code is measuring the apparent success of encoded rules in terms
proval3 , participants were paid for their time. We had three partici- of coding validation. From a technical perspective, we argue that
pants in total. an encoded provision is “validated” when it adheres, in a formal
In Week/Phase 1, participants independently coded sections (“ss”) sense, to our select Turnip language and other relevant coding
40, 41, 41A and 42 of the CA, which are among several fair dealing conventions. In practical terms, this means that the code runs, or
exceptions to copyright infringement. These exceptions are those works, and produces a definitive outcome. Coding validation is also
for the purpose of research or study, criticism or review, parody concerned with the degree of internal consistency between coders
or satire, and reporting the news, respectively. By “independent” to ensure that encoded rules for the same piece of legislation work
coding, we mean that participants worked in remote files, rather together.
3 Wereceived ethics approval for this research at the Queensland University of Tech-
nology (QUT Approval 2000000763). 4 https://github.com/about.
142
Converting Copyright Legislation into Machine-Executable Code ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Our coding validation process involves automated and manual statutory text, and/or statutory presumptions (e.g., legisla-
analyses of encoded rules. For automated analysis, we created a tion not operating retrospectively), which relate to the scope
program that uses string manipulation to parse each participant’s and effect of legislation rather than the statutory language,
atoms into string arrays and, then, compare the participants’ en- should apply [26, p. 225].
coded rules to find similarities between their approaches. The mea- (6) Return to the provision. Interpreters assign a meaning to the
sure of apparent “similarity” between any two coders is the number provision, and its key words, based on findings from Steps 1
of shared atoms and rules, respectively, between the coders divided to 5. After assigning meaning, interpreters define atoms and
by the average number of total atoms between the two datasets. identify modalities (e.g., a permission [P], prohibition [F],
This automated analysis focused on syntax and not semantic mean- obligation [O] or exemption [E]) and key words/elements/-
ing. We also manually “cleaned” some aspects of the participants’ conditions for conversion into code.
coding files to attempt to enhance the comparability of encoded (7) Acceptance testing of coded provisions. Testing in this con-
rules. For example, we adjusted atom names, which used a mixture text involves coders developing a series of unit tests for
of periods ( . ) and camel case (e.g., camelCase), in line with our select provisions based on case law or examples in explana-
project-specific naming conventions. We also filtered out common tory memoranda. Encoded rules "passing" relevant tests is a
stop words, such as “the” and “a”, normalised the tense of atoms strong indicator of legal alignment.
and, where possible, corrected issues with the structure of certain
rules. At Stage 6, interpreters start to manually convert natural language
legislation into the machine-executable Turnip language. In prac-
4.2 Legal Alignment Processes tice, the Turnip language requires interpreters to convert a statutory
provision into a set of if-then statements or rules (rulesets). The
We draw an important distinction between coding validation, which
online runtime environment (the Turnip reasoner) can take a set of
is principally a technical process, and legal alignment, which we
rules and facts, respectively, and produce a set of results that are
define as the extent to which encoded provisions align with the
what the logic can infer from applying the facts to the rules.
languages and logics of the select statutory text [18]. At the heart
While acceptance testing is Step 7, we recommend that inter-
of legal alignment is the accurate representation of laws and other
preters start testing encoded rules as early as possible, ideally in
regulation in computer code. Our legal alignment processes are
parallel with Stage 6 to optimise legal alignment processes. Devel-
based on the modern approach to statutory interpretation and
oping and applying rigorous acceptance testing can, for example,
incorporate the Turnip coding language. As previously mentioned,
help interpreters to identify edge cases and subtle errors that might
this approach requires interpreters to examine the text, context and
be easily overlooked. While copyright has a doctrinally deep and
purpose of select legislation [20, 26], which we break down into a
rich body of case law[8], participants did not undertake acceptance
7-step process:
testing, principally due to time constraints. Rigorous acceptance
(1) Locate and read a specific statutory provision, and identify key testing for select copyright provisions therefore remains an impor-
words/elements/conditions (known in coding terms as “atoms”). tant topic for future research. We provide select examples of our
(2) Read the provision in the context of the Act as a whole. Inter- subsequent acceptance testing in Section 5.2.
preters refine the legal meaning of key words by interpreting
their meaning within the legislation’s full scope. For example,
is a key word defined in the legislation? Under what part of 4.3 Limitations
the legislation is the provision located? This step includes in- The findings in this paper should be interpreted with some limita-
terpreters reading intertextual legislation: for example, other tions in mind. First, there is a small number of participants (N =
statutes that are referenced in an Act. 3) who encoded different provisions in Week 1 and Week 2. This
(3) Consider Parliament’s purpose in enacting the legislation. In- means that we cannot directly compare atoms and rules across
terpreters further refine the legal meaning of key words by weeks, and the results cannot be used to make generalised find-
interpreting the provision in line with the legislation’s pur- ings about all coders that might apply our methodology. Second,
pose. For example, the legislation’s object clause, identified the measure of apparent “similarity” between atoms and rules is
at the start of every statute, can indicate a stated purpose based on our methodology, including Turnip syntax, which is a
or purposes. If a statute has multiple purposes, then inter- non-standardised benchmark. Finally, we are not able to reach a
preters read the legislation’s establishing documents, such definitive conclusion about the extent to which the encoded rules
as speeches in Parliament or the legislation’s explanatory align with the statutory text due to the authoritative interpretive
memorandum, to determine a hierarchy of purposes. role of the courts under Australia’s constitutional framework [18],
(4) Evaluate interpretive choices. Interpreters identify all possible a limitation that will apply to any attempt to encode legislation.
interpretations of key words and apply them to findings from Despite these limitations, which we expand upon in the Results
Steps 1 to 3 to evaluate which interpretation best aligns with and Discussion section below, this study makes significant inroads
the legislation’s context and purpose. with developing and applying a methodology that fuses the modern
(5) Consider the “canons of construction”. If Step 4 findings are approach to statutory interpretation with DDL. This paper also
still ambiguous, then interpreters consider whether and how provides valuable insights into sources of apparent coding differ-
syntactical presumptions (e.g., noscitur a sociis and ejusdem ences, legal alignment issues and potential solutions, as part of the
generis) [26, pp. 212, 215], which relate to the meaning of the broader RaC movement.
143
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley
5 RESULTS AND DISCUSSION meaning of a statute”[20, p. 118], unless the courts have reached
Overall, the results of our automated syntactical analysis show definitive conclusions about, for instance, what different provisions
a significant increase in the similarity of atoms and shared rules mean and how they apply in certain cases. More specific challenges
drafted by participants after the Week 2 intervention, as illustrated include converting evidentiary burdens, the exercise of discretion,
in Table 1. More specifically, Table 2 shows that the similarity of and ambiguous and open-textured rules in computer code, intra-
atoms increased from an average of 4.27% in Phase 1 to 57.64% in and intertextual provisions and drafting errors, all of which under-
Phase 2. A corollary of this is that the number of unique atoms, line the potential margin of error in interpreting statutory language
or atoms that are used by one coder only, decreased from Phase [17]. Due to the nature of statutory interpretation, it is unsurprising
1 to Phase 2. The similarity of rules drafted by participants also that there were differences in coders’ interpretive choices.
increased, albeit marginally, from 0% in Phase 1 to 1.01% in Phase 2. We also observed that participants coded the select provisions
These results support our hypothesis that participants agreeing on with varying levels of granularity. By “granularity”, we mean the
key atoms before commencing individual coding work increases extent to which coders split, or broke down, the statutory text into
the similarity of their encoding choices (H1). Importantly, as the discrete parts. Take, for example, s 40 of the CA that establishes the
individual coders encoded different statutory provisions in Weeks fair dealing exception to copyright infringement for the purpose
1 and 2, this cannot simply be attributed to the coders’ increasing of research or study. A selection of key atoms originating from ss
familiarity with the statutory text. 40(1), (1A) and (1B) follow:
144
Converting Copyright Legislation into Machine-Executable Code ICAIL’21, June 21–25, 2021, São Paulo, Brazil
P1:
s31_1_c : literaryDramaticMusicalWork . copyrightSubsists & ~ computerProgram & copyright . inRelationTo . work = >
exclusiveRightTo . enterInto . commercialRentalArrangement . workReproducedInSoundRecording
Participant 2 (P2):
s_31_1c : prescribedWorkSection31_1c & copyright . inRelationTo . work = >
exclusiveRightTo . enterInto . commercialRentalArrangement . inSoundRecording
145
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley
Coders using common conventions and developing clean, well- purposeOf.researchOrStudy, risks deviating from the true meaning
documented practices is likely to significantly improve the ability of of the statutory text as articulated by the courts. P3’s fine-grained
future users to understand how the legislation has been interpreted approach to construing and encoding the statutory language ap-
in the encoding process. It would also greatly simplify the task pears to better align with the court’s interpretation of the statutory
of future developers who may be called upon to review, audit, or text.
amend the code. The divergent coding choices for Section 40(2) of the CA provide
another useful illustration of nuanced legal issues that interpreters
could overlook in formal coding validation processes. Section 40(2)
5.2 Enhancing Legal Alignment outlines five “. . . matters to which regard shall be had, in determin-
Legal alignment, as distinct from and in addition to coding valida- ing whether a dealing constitutes a fair dealing with the work or
tion, is vital for ensuring the accuracy and legitimacy of encoded adaptation for the purpose of research or study” (emphasis added).
legislation. As a further step in checking the accuracy of individual We direct our attention to the word “shall” because it confers a
coding choices, we adopted a process of legal alignment, based on mandatory or directory obligation [26, p. 333]. We noticed, how-
the methodology above, to attempt to reproduce the legal logic ever, that only two of the three participants encoded the five matters
of select copyright provisions in machine-executable code. Legal as obligations ([O]). Consider the rules in Table 5. Significantly, this
alignment focuses only on our interpretation of the provisions us- suggests that P1’s choice to encode the matters in s 42(2) as general
ing a judicially confirmed process as the basis for our interpretive atoms rather than obligations deviates from the meaning the legisla-
approach and, critically, is distinct from a claim of “legal validity” ture seeks to convey. Such a decision matters because the statutory
that encoded rules correctly reflect the law. It is impossible to guar- text “has specific legal authority”[21, p. 159] and its component
antee the validity of any encoded legal rules [18]. As noted above, parts — words, symbols and, at times, images — are there for a
under Australia’s constitutional framework, only the judiciary can reason [23]. It can also have significant flow on effects for poten-
conclusively interpret the legal meaning of a statute. Even if a body tial end users who might, for example, rely on RegTech to comply
of case law exists, as it does for copyright law, the nature of the and stay up to date with regulation governing their commercial
common law system means that while similar cases will generally activities.
be treated alike [5], future cases with slightly different facts may In Australia, like other societies that strive to uphold the legal
trigger a reinterpretation of the law. An added complication is that ideals of the rule of law and separation of powers [32, 35], public
the construction of statutes is a question of law and therefore open law provides a range of checks and balances to assess whether a
to appeal on the basis of errors in statutory interpretation [20]. decision-maker is exercising power in accordance with established
This underlines that interpreters cannot authoritatively determine rules and principles [6, 18]. We argue that governments, technology
the extent to which their encoding choices are “legally valid”[17]. companies and other key stakeholders must take steps to ensure
Instead, we contend that RaC stakeholders should aim for “legal that adequate measures are in place to promote alignment between
alignment” between encoded rules and the true construction of the encoded statutory provisions and the true meaning of the statute as
statutory text. interpreted by the courts. This underscores the importance of RaC
As noted above, coders were not asked to undertake acceptance initiatives having multi-disciplinary expertise across the fields of
testing in Weeks 1 and 2 due to time constraints. This process was law, computer science and public policy [17]. From our perspective
undertaken by the research team after the technical coding valida- as legally trained coders, we argue that it is vital to have both
tion process. The subsequent legal alignment process highlighted subject matter and statutory interpretation expertise as part of a
the complexity of attempting to evaluate the extent to which en- multi-disciplinary RaC team.
coded rules align with the languages and logics of a statue. To
illustrate this point, it is useful to return to the encoding choices
for section 40 of the CA, for which P1 adopted a high-level ap-
proach and P3 drafted several fine-grained atoms. This provision, 6 CONCLUSION AND FUTURE RESEARCH
which is one of several exceptions to copyright infringement in The results from our experiment illustrate the complexities of at-
Part III, Division 3 of the Act, establishes that copyright in a work tempting to reproduce the languages and logics of a statutory text
or an adaptation of a literary, dramatic or musical work is not in- in machine-executable format. While our analyses are preliminary,
fringed by a fair dealing for the purpose of research or study. The we have provided a first of its kind experiment that examines how
wording of this provision is significant; in particular, the legisla- different legally trained people interpret and convert legislation
ture’s use of “research or study” (emphasis added). This raises the into computer code in practice. After our intervention in Week 2, a
question of whether an interpreter should code these terms con- meeting during which participants collaboratively agreed on key
junctively (i.e., research and study) or disjunctively (i.e., research or atoms for manual coding, we identified a significant increase in the
study). In De Garis v Neville Jeffress Pidler Pty Ltd (1990) 37 FCR 99 average similarity of atoms — from 4.27% in Week 1 to 57.64% in
ALR 625; 18 IPR 292, Beaumont J stated that the terms “research” Week 2 — and, to a lesser extent, rules. This finding, among others,
and “study” take their dictionary meanings and, most critically, supports our hypothesis that coders collaboratively agreeing on
should be considered disjunctively[9]. Indeed, the court considered key atoms before commencing independent coding work increases
whether the activities at issue could be characterised as “research” the similarity of their coded output. Importantly, as the individ-
or “study” for the purposes of s 40 of the CA, separately. This sug- ual participants encoded different statutory provisions in Weeks 1
gests that P1’s decision to encode the terms in one atom; namely, and 2 of the experiment, the greater similarity we observed cannot
146
Converting Copyright Legislation into Machine-Executable Code ICAIL’21, June 21–25, 2021, São Paulo, Brazil
P1:
s40_2_aTOd : determining . fairDealing = > regardTo . purposeAndCharacter & regardTo . natureOfWorkOrAdaptation &
regardTo . possibilityOfObtainingWorkOrAdaptation . withinReasonableTime . atOrdinaryCommercialPrice &
regardTo . effectOfDealingOn . potentialMarketOrValue
P2:
s_40_2 : determiningPotentialFairDealing . researchStudy = > [O] regardWorkPurposeCharacter &
[O] regardWorkNature & [O] regardPossibilityOfPurchasing &
[O] regardMarketValueEffect & [O] regardSubstantialityOfCopiedPart
P3:
s_40_2_work_a_to_e : entity . determining . whetherDealingIsFair . forPurposesOf . copyrightAct &
work . isLiteraryOrDramaticOrMusicalOrArtistic & dealing . isReproduction = >
[O] entity . toConsider . dealing . purpose & [O] entity . toConsider . dealing . character &
[O] entity . toConsider . dealing . natureOfWorkOrAdaptation &
[O] entity . toConsider . effectOfDealing . onValueOfWorkOrAdaptation &
[O] entity . toConsider . possibilityOfObtainingWorkOrAdaptationWithinReasonableTimeAtOrdinaryCommercialPrice
simply be attributed to the coders’ increasing familiarity with the processes for technical coding validation and legal alignment might
statutory text. entail in practice is warranted. For example, in terms of coding
Notwithstanding these increases, participants made a range of validation, our experiment raises practical questions about how
divergent interpretive choices, which we argue are most likely due exactly coding teams should choose between different yet equally
to: (1) the complexity of statutory interpretation, (2) encoded provi- technically valid coding choices, and the most appropriate and effi-
sions having varying levels of granularity, and (3) the functionality cient syntactic and semantic conventions and methodologies for
of our coding language, Turnip. This underlines that interpretive identifying and representing terms, predicates and propositions in
differences can arise even when coders undertake similar training, legal texts. Finally, in terms of coding validation, further research
use the same language and work in the same team. We explained is needed to shed light on acceptance testing options, including en-
that interpreters can, to some extent and not without risks, improve coding case law, for copyright law and beyond. An important part
the internal consistency of encoded rules by automatically and/or of this work will be clarifying best practices from both a technical
manually cleaning the dataset. and legal perspective.
Overall, we contend that RaC initiatives should have processes
for technical coding validation and legal alignment, both of which REFERENCES
are critically important in enhancing the accuracy of digitising [1] Grigoris Antoniou, David Billington, Guido Governatori, and Michael J. Ma-
legislation. The former helps to ensure that encoded rules adhere her. 2001. Representation Results for Defeasible Logic. ACM Transactions on
Computational Logic 2, 2 (2001), 255–287.
to select coding languages and conventions. It is particularly im- [2] Patricia Aufderheide, Kylie Pappalardo, Nicolas Suzor, and Jessica Stevens. 2018.
portant that interpreters not only follow select coding languages Calculating the consequences of narrow Australian copyright exceptions: mea-
surable, hidden and incalculable costs to creators. Poetics 69 (2018), 15–26.
and conventions, but also develop clean, well-documented code to [3] Tom Barraclough, Hamish Fraser, and Curtis Barnes. 2021. Legislation as Code
enable future users to understand how the legislation has been inter- for New Zealand: Opportunities, Risks and Recommendations. Report. Brainbox
preted in the encoding process. A second critical step is to engage in and The New Zealand Law Foundation, New Zealand.
[4] Sotiris Batsakis, George Baryannis, Guido Governatori, Tachmazidis Ilias, and
a process of legal alignment, which we conceptualise as enhancing Grigoris Antoniou. 2018. Legal Representation and Reasoning in Practice: A Crit-
congruence between the encoded rules and the true meaning of the ical Comparison. In Legal Knowledge and Information Systems, Monica Palmirani
select legislation in line with the modern approach to statutory in- (Ed.). IOS Press, 31–40.
[5] Lisa Burton Crawford and Dan Meagher. 2020. Statutory Precedents under the
terpretation. The results of this experiment suggest that a rigorous “Modern Approach” to Statutory Interpretation. Sydney Law Review 42, 2 (2020),
assessment of legal alignment requires multi-disciplinary expertise 209–239.
[6] Lisa Burton Crawford, Maria O’Sullivan, Janina Boughey, and Melissa Castan.
across specific legal subject matters, statutory interpretation and 2017. Public Law and Statutory Interpretation: Principles and Practice. Federation
technical programming. Press, NSW, Australia.
There are a range of important opportunities for future research [7] Commonwealth of Australia. 2013. Copyright and the Digital Economy: Discussion
Paper. ALRC Discussion Paper 79. https://www.alrc.gov.au/wp-content/upload
that arise from this study. First, there is scope to expand this ex- s/2019/08/dp79_whole_pdf_.pdf
periment to test our hypothesis (H1) across different bodies of law [8] Mark Davidson, Ann Monotti, and Leanne Wiseman. 2012. Australian Intellectual
and types of expertise, and with a larger number of participants. Property Law. Cambridge University Press, Cambridge, U.K.
[9] Bronwen Claire Ewen. 2017. 240 – Intellectual Property, III COPYRIGHT, (8)
Secondly, further exploration of what the separate yet interrelated DEFENCES TO INFRINGEMENT (B) Fair Dealing – Defences to Infringement of
Copyright. In Halsbury’s Laws of Australia. LexisNexis Australia, [240–2357].
147
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Alice Witt, Anna Huggins, Guido Governatori, and Joshua Buckley
[10] Dagfinn Føllesdal and Risto Hilpinen. 1971. Deontic Logic: An Introduction. In [21] Michael Kirby. 2012. The Never-Ending Challenge of Drafting and Interpreting
Deontic Logic: Introductory and Systematic Readings, Risto Hilpinen (Ed.). North Statutes – A Meditation on the career of John Finemore QC. Melbourne University
Holland, 1–35. Law Review 36, 1 (2012), 140.
[11] Thomas F. Gordon, Guido Governatori, and Antonino Rotolo. 2009. Rules and [22] David Lehr and Paul Ohm. 2017. Playing with the Data: What Legal Scholars
Norms: Requirements for Rule Interchange Languages in the Legal Domain Should Learn About Machine Learning. University of California Davis Law Review
(LNCS, 5858), Guido Governatori, John Hall, and Adrian Paschke (Eds.). Springer, 51, 2 (2017), 653–717.
Heidelberg, 282–296. [23] John Middleton. 2016. Statutory Interpretation: Mostly Common Sense? Mel-
[12] Guido Governatori, Pompeu Casanovas, and Louis de Koker. 2020. On the Formal bourne University Law Review 40, 2 (2016), 626–656.
Representation of the Australian Spent Conviction Scheme. In Rules and Reasoning [24] James Mohun and Alex Roberts. 2020. Cracking the code: Rulemaking for humans
(LNCS, Vol. 12173), Víctor Gutiérrez Basulto, Tomáš Kliegr, Ahmet Soylu, Martin and machines. OECD Working Papers on Public Governance. OECD, Paris, France.
Giese, and Dumitru Roman (Eds.). Springer International, 177–185. https://doi.org/10.1787/3afe6ba5-en
[13] Guido Governatori, Mustafa Hashmi, Ho-Pun Lam, Serena Villata, and Mon- [25] Jason Morris. 2020. Spreadsheets for Legal Reasoning: The Continued Promise of
ica Palmirani. 2016. Semantic Business Process Compliance Checking Using Declarative Logic Programming in Law. Masters Thesis, University of Alberta.
LegalRuleML. In Knowledge Engineering and Knowledge Management (LNAI, [26] Michelle Sanson. 2016. Statutory Interpretation (2nd ed.). Oxford University Press„
10024), Eva Blomqbvist, Paolo Ciancarini, Francesco Poggi, and Fabio Vitali (Eds.). Oxford, U.K.
Springer International, 746–761. [27] Service Innovation Lab (LabPlus). 2018. Better Rules for Government Discovery
[14] Guido Governatori, Francesco Olivieri, Antonino Rotolo, and Simone Scannapieco. Report. Technical Report.
2013. Computing Strong and Weak Permissions in Defeasible Logic. Journal of [28] Nicolas Suzor. 2014. The only way to fix copyright is to make it fair. The
Philosophical Logic 42, 6 (2013), 799–829. Conversation (21 February 2014). https://theconversation.com/the-only-way-to-
[15] Guido Governatori and Antonino Rotolo. 2006. Logic of Violations: A Gentzen fix-copyright-is-to-make-it-fair-23402
System for Reasoning with Contrary-To-Duty Obligations. Australasian Journal [29] Matthew Waddington. 2019. Machine-Consumable Legislation: A Legislative
of Logic 4 (2006), 193–215. Drafter’s Perspective – Human v Artificial Intelligence. The Loophole 2 (2019),
[16] Guido Governatori, Antonino Rotolo, and Erica Calardo. 2012. Possible World 21–52.
Semantics for Defeasible Deontic Logic. In Deontic Logic in Computer Science [30] Matthew Waddington. 2020. Rules as Code. Law in Context 37, 1 (2020), 179–186.
(DEON 2012) (Lecture Notes in Computer Science, Vol. 7393), Thomas Ågotnes, Jan [31] Vicki Waye. 2019. Regtech: A New Frontier in Legal Scholarship. Adelaide Law
Broersen, and Dag Elgesem (Eds.). Springer, Heidelberg, 46–60. Review 40, 1 (2019), 363–386.
[17] Anna Huggings, Alice Witt, Nicholas Suzor, Mark Burdon, and Guido Governatori. [32] Alice Witt, Nicolas Suzor, and Anna Huggins. 2019. The Rule of Law on Instagram:
2020. Financial Technology and Regulatory Technology, Issues Paper Submission. An Evaluation of the Moderation of Images Depicting Women’s Bodies. UNSW
Submission 196. Select Senate Committee on Financial Technology and Regula- Law Journal, 42, 2 (2019), 557–596.
tory Technology. https://www.aph.gov.au/DocumentStore.ashx?id=30153f40- [33] Meng Weng (HUANG Mingrong) Wong. 2020. Rules as Code – Seven Levels of
c456-4398-99f2-0dd627f86401&subId=699554 Digitisation. Report. Singapore Management University, School of Law, Singa-
[18] Anna Huggins. 2020. Executive Power in the Digital Age: Automation, Statutory pore.
Interpretation and Administrative Law. In Interpreting Executive Power, Janina [34] World Government Summit. 2018. RegTech for Regulators: Re-Architect the System
Boughey and Lisa Burton Crawford (Ed.). Federation Press, 111–128. for Better Regulation. Technical Report. 2018. https://www.worldgovernmentsum
[19] Mohammad Badiul Islam and Guido Governatori. 2018. RuleRS: A rule-based ar- mit.org/api/publications/document?id=5ccf8ac4-e97c-6578-b2f8-ff0000a7ddb6
chitecture for decision support systems. Artificial Intelligence and Law 26, 4 (2018), [35] Monika Zalnieriute, Lyria Bennett Moses, and George Williams. 2019. The Rule
315–344. https://doi.org/10.1007/s10506-018-9218-0 arXiv:http://rdcu.be/HIvL of Law and Automation of government Decision-Making. Modern Law Review
[20] Michael Kirby. 2011. Statutory Interpretation: The Meaning of Meaning. Mel- 82, 3 (2019), 425–427.
bourne University Law Review 35, 1 (2011), 113–133.
148
Hardness of Case-Based Decisions: a Formal Theory
Heng Zheng Davide Grossi Bart Verheij
Artificial Intelligence, Bernoulli Artificial Intelligence, Bernoulli Artificial Intelligence, Bernoulli
Institute, University of Groningen Institute, University of Groningen Institute, University of Groningen
The Netherlands ILLC/ACLE, University of The Netherlands
h.zheng@rug.nl Amsterdam bart.verheij@rug.nl
The Netherlands
d.grossi@rug.nl
ABSTRACT of legal decision-making have been object of much work in AI and
Stare decisis is a fundamental principle of case-based reasoning. Yet Law (cf. also [23]).
its application varies in complexity and depends, in particular, on But some cases are easier to decide than others. For instance,
whether relevant past decisions agree, or exist at all. The contribution when all past cases agree on a given legally relevant fact situa-
of this paper is a formal treatment of types of the hardness of case- tion, decision-making using the principle of stare decisis can be
based decisions. The typology of hardness is defined in terms of straightforward. When relevant past cases disagree, things get harder.
the arguments for and against the issue to be decided, and their Sometimes such a conflict of precedents can be resolved, for instance
kind of validity (conclusive, presumptive, coherent, incoherent). We when one precedent is considered a landmark case overturning ex-
apply the typology of hardness to Berman and Hafner’s research on isting doctrine, or when a precedent comes from a higher level
the dynamics of case-based reasoning and show formally how the court. But not all conflicts can be resolved, making decision-making
hardness of decisions varies with time. harder. Also it can happen that a legally relevant fact situation has no
matching precedent, so the stare decisis principle gives no answer.
CCS CONCEPTS
Paper contribution. As the above examples show, the hardness
• Computing methodologies → Knowledge representation and of case-based decisions comes in different types. It is the topic
reasoning; • Applied computing → Law. of this paper to provide a formal theory of the hardness of cases
in case-based decision-making. Significant work has been devoted
KEYWORDS to the nature and dynamics of case-based reasoning (e.g., [1, 2, 5–
Case-based reasoning, computational argumentation, hard cases 8, 11, 13–17, 21]), and to the topic of hard cases (e.g., [4, 9, 12]). Yet,
ACM Reference Format: to the best of our knowledge, no formal theory has been proposed
Heng Zheng, Davide Grossi, and Bart Verheij. 2021. Hardness of Case- so far of what makes a current case harder, or less hard, than other
Based Decisions: a Formal Theory . In Eighteenth International Confer- cases. We provide such a theory here, by focusing on the following
ence for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São question: is there a typology of how hard it is to make a decision
Paulo, Brazil. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/ about an issue in case-based reasoning? To answer this question,
3462757.3466071 we propose a formal approach based on the case model formalism
[20–22, 25, 26]. We describe a decision-making issue as an argument
1 INTRODUCTION and its counterargument, and formalize its hardness with the validity
Legal decision-making can be hard, very hard. Its complexities, of these arguments. We also illustrate the approach with a case study
which have been well-recognized in AI and Law since its early in the dynamics of case-based reasoning (following an example by
days, are numerous. An early contribution to the discussion of the Berman and Hafner [6–8, 11] as formalized in [21]), and show how
hardness of legal decision-making in AI and Law is ‘An Artificial hardness varies over time.
Intelligence Approach to Legal Reasoning’, a book by Gardner [9].
In that book, Gardner addresses the distinction between hard and Paper outline. Section 2 introduces earlier work in the case model
easy cases using ideas from jurisprudence. Following Rissland’s formalism. Section 3 develops a formal theory of hardness. Section 4
review [18] of the landmark work by Gardner, legal decisions are shows an application of our approach to a series of concrete legal
guided rather than governed by existing law; legal terms are open cases highlighting the development of hardness over time. Section 5
textured; legal questions can have more than one answer, but a positions our theory within existing literature on case-based decision-
reasonable and timely answer must be given; and the answers to legal making in law. Section 6 concludes. Detailed proof sketches of
questions can change over time. These, and other, such complexities relevant formal properties are provided throughout the paper.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed 2 PRELIMINARIES: CASE MODELS
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. Our approach uses case models [20], a formalism based on a propo-
For all other uses, contact the owner/author(s). sitional logic language L generated from a finite set of constants. We
ICAIL’21, June 21–25, 2021, São Paulo, Brazil write ¬ for negation, ∧ for conjunction, ∨ for disjunction, ↔ for
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8526-8/21/06. equivalence, ⊤ for a tautology, and ⊥ for a contradiction. The asso-
https://doi.org/10.1145/3462757.3466071 ciated classical logical consequence relation is denoted |=. Cases can
149
ICAIL’21, June 21–25, 2021, São Paulo, Brazil H. Zheng, D. Grossi and B. Verheij
π2 All arguments
π0
π1
P∧Q ¬P
P ∧ ¬Q Coherent arguments
Figure 1: Example of a case model. Larger boxes denote cases Presumptive arguments
that are preferred over the cases denoted by smaller boxes.
Conclusive arguments
150
Hardness of Case-Based Decisions: a Formal Theory ICAIL’21, June 21–25, 2021, São Paulo, Brazil
to its falsity. Here arguments are premise-conclusion pairs, without Example 4 In the case model of Figure 1, argument (P, Q) has type
considering a possible internal stepwise structure. Based on two pres, and argument (P, ¬Q) has type coh. Hence the validity distance
natural ways to compare issues, we will define and study an ordering between the validity labels of these two arguments is 1.
of issues representing their relative hardness.
3.2 Issues
3.1 Comparing arguments We introduce now the notion of issue, that is, the specific proposition
First we introduce labels for arguments based on their validity. a case model is supposed to decide about. Our theory concerns
Definition 4 (Validity labels for arguments) Let (χ, ρ) be an argu- precisely the hardness of deciding about the truth or falsity of an
ment and C a case model. Then the validity label of (χ, ρ) in C is issue given a situation in a case model.
denoted AC (χ, ρ) and is defined as follows: Issues and issue types. So an issue is a sentence whose truth or falsity
(1) AC (χ, ρ) = conc if C |= χ ⇒ ρ; we would like to establish, given a situation. Formally:
(2) AC (χ, ρ) = pres if C |= χ { ρ and C ̸|= χ ⇒ ρ; Definition 7 (Issues) A situation is a sentence σ ∈ L. A sentence
(3) AC (χ, ρ) = coh if C |= χ⊤ρ and C ̸|= χ { ρ; ι ∈ L is an issue given situation σ (denoted σ ± ι) if and only if
(4) AC (χ, ρ) = incoh if C ̸|= χ⊤ρ. σ ̸|= ι and σ ̸|= ¬ι.
For instance, pres is the label for arguments that are presumptively In other words, an issue for a given situation is a sentence whose
valid, but not conclusive. Using Figure 2 as an illustration, these truth or falsity is not logically settled by the situation. It is worth
arguments are in the presumptive arguments set, but not in the con- observing that σ ± ι is an issue if and only if σ ± ¬ι is an issue.
clusive arguments set. The label incoh is used for arguments that
are not coherent. In Figure 2, these arguments are in the set of all Example 5 For instance, P ± Q represents an issue Q with respect
arguments, but not in the set of coherent arguments.1 to a situation P.
Validity labels come with a natural ordering: Importantly, for every issue σ ± ι there are two naturally associated
Definition 5 (Validity ordering) The validity ordering is the total or- arguments: (σ , ι) and (σ , ¬ι). We will study hardness as a relation
der ≥ on set {conc, pres, coh, incoh} characterized by the following on the types of those pairs of arguments that correspond to issues.
property:2 conc ≥ pres ≥ coh ≥ incoh. Given the pair of arguments induced by an issue, we call a type
the multi-set (i.e., a set admitting multiple copies of an element)
Intuitively, based on their validity, arguments with label conc are consisting of the validity labels of the two arguments. Formally:
stronger than arguments with label pres. Similarly, arguments with
label pres are stronger than arguments with label coh, and arguments Definition 8 (Types of issues) Let C be a case model and σ ± ι an
with label coh are stronger than arguments with label incoh. issue. Then the type of the issue, denoted TC (σ ± ι), is the multiset:
The following proposition relates argument validities (Defini- {AC (σ , ι), AC (σ , ¬ι)}.
tion 3) to validity labels and validity ordering (Definitions 4 and 5).
Example 6 The type of issue P ± Q with respect to the case model
Proposition 2 Let (χ, ρ) be an argument and C a case model. Then in Figure 1 is {pres,coh}, as AC (P, Q) = pres and AC (P, ¬Q) = coh.
the following hold:
Not all types are logically possible. The following proposition shows
(1) (χ, ρ) is conclusive if and only if AC (χ, ρ) = conc; there are in total 7 possible issue types.
(2) (χ, ρ) is presumptive if and only if AC (χ, ρ) ≥ pres;
(3) (χ, ρ) is coherent if and only if AC (χ, ρ) ≥ coh; Proposition 3 Let C be a case model and σ ± ι an issue. Then
(4) (χ, ρ) is incoherent if and only if AC (χ, ρ) = incoh. TC (σ ± ι) equals one of the following:
(1) {conc, incoh};
P ROOF. Immediate using the definitions and Proposition 1. □ (2) {pres, pres};
(3) {pres, coh};
Using the ordering of arguments their validity label, we can quantify
(4) {pres, incoh};
how far apart they are in terms of strength, as follows:
(5) {coh, coh};
Definition 6 (Validity distance) Let v0 and v1 ∈ {conc, pres, coh, (6) {coh, incoh};
incoh}. Then we define the validity distance vd(v0 , v1 ) as the length (7) {incoh, incoh}.
of the shortest path from v0 to v1 in the validity ordering.
Intuitively, the validity distance between two arguments is the num- P ROOF. By Definition 4, an argument can have 4 possible labels
(42 −4)
ber of steps in the validity ordering ≥ that are needed to go from conc, pres, coh, incoh. There are therefore 2 +4 = 10 possible
the label of the weaker argument to the label of the stronger one multisets of size 2. Let a case model C = (C, ≥) and an issue σ ± ι
(of course avoiding loops). Clearly, as ≥ has length 3, the maximal be given. We split the proof in two parts. First, we reason by cases
validity distance is 3 and the minimal is 0. and for a multiset {v0 , v1 } (for which we can assume v0 ≥ v1 ) we
show what values among conc, pres, coh, incoh v1 can take once
1
These labels and their ordering in the following Definition 5 are related to the quantita- we fix v0 . Then we display examples showing the types listed in the
tive representation of [20, Section 3.3]. Arguments with label conc correspond to those
arguments with strength equal to 1 [20]. Arguments with label pres corresponds to those statement are possible.
arguments with strength above a given threshold but less than 1. Arguments with label v0 = conc Then v1 ∈ {conc, pres, coh, incoh}. If AC (σ , ι) =
coh corresponds to arguments with strength less than the given threshold but still above
0. And arguments with label incoh corresponds to arguments with strength equal to 0. conc, by Definition 3, for each π ∈ C with π |= σ , we have π |= σ ∧ ι.
2
Recall that a total order is a binary relation which is transitive, total and antisymmetric. Then since cases are consistent, π ̸|= σ ∧ ¬ι, so AC (σ , ¬ι) = incoh
151
ICAIL’21, June 21–25, 2021, São Paulo, Brazil H. Zheng, D. Grossi and B. Verheij
and the types {conc, conc}, {conc, pres}, and {conc, coh} are not 3 2 1 0
possible. Similarly, when AC (σ , ¬ι) = v0 . Therefore, TC (σ ± ι) is {conc, incoh} {pres, pres}
equal to {conc, incoh}.
v0 = pres Then v1 ∈ {pres, coh, incoh} and AC (σ , ¬ι) ∈ {pres, {pres, coh}
coh, incoh}. Similarly, when AC (σ , ¬ι) = v0 . Therefore, TC (σ ± ι) {pres, incoh} {coh, coh}
is equal to {pres, pres}, {pres, coh}, or {pres, incoh}.
v0 = coh Then v1 ∈ {coh, incoh}, and TC (σ ± ι) is equal to {coh, incoh}
{coh,coh} or {coh,incoh}.
{incoh, incoh}
v0 = incoh Then v1 = incoh and TC (σ ± ι) is {incoh,incoh}.
Now we give an example for each possible type. For {conc, incoh},
let C be a case model with only one case π0 = P ∧ Q, P ± Q an issue. Figure 3: Hasse diagram of ⪰v . The numbered columns depict
TC (P ± Q) = {conc, incoh}. the equivalence classes of ⪰d (one per distance from 0 to 3).
Let C be a case model with cases π0 = P ∧ Q and π1 = P ∧ ¬Q,
P ± Q an issue. For {pres, pres}, if C has a preference relation
π0 ∼ π1 , then TC (P ± Q) = {pres, pres}. For {pres, coh}, if C has Example 7 Continuing on Example 6, we have:
a preference relation π0 > π1 , then TC (P ± Q) = {pres, coh}. {pres,coh} ⪰v {incoh,incoh};
For {pres, incoh}, let C be a case model with cases π0 = P ∧ Q {pres,coh} ⪰d {incoh,incoh}.
and π1 = P∧R, P±Q an issue. If C has a preference relation π0 > π1 ,
We will see these relations at work in the definition of hardness
then TC (P ± Q) = {pres, incoh}.
ordering provided in the next subsection. For now it is important to
For {coh, coh}, let C be a case model with cases π0 = P, π1 =
observe that these relations are well-behaved:
P∧Q, and π2 = P∧¬Q, P±Q an issue. If C has a preference relation
π0 > π1 ∼ π2 , then TC (P ± Q) = {coh, coh}. Proposition 4 We have that:
For {coh, incoh}, let C be a case model with cases π0 = P and (1) ⪰v is a partial order, which is not total;
π1 = P ∧ Q, P ± Q an issue. If C has a preference relation π0 > π1 , (2) ⪰d is a total preorder, which is not antisymmetric.
then TC (P ± Q) = {coh, incoh}.
For {incoh, incoh}, let C be a case model only with case π0 = P, P ROOF. Claim 1. By Definition 9, ⪰v inherits the properties
P ± Q an issue. Then TC (P ± Q) = {incoh, incoh}. □ of ≥ (Definition 5) and is therefore reflexive, antisymmetric, and
transitive; hence a partial order. However, it is not total since some
Proposition 3 depends on the preference relation between cases types are incomparable, for instance {conc,incoh} and {pres,coh}
in C to be general. If restrictions are imposed on that preference can not be compared since conc≥pres while incoh≤coh .
relation, fewer types may be possible. In particular, if such a pref- Claim 2. By Definitions 9 and 6, an integer is associated to each
erence relation is trivial (in the sense that all cases are at least as type. Then ⪰d inherits the properties of the integer ≥ relation: reflex-
preferred as all other cases), like in the case models representing ivity, transitivity and totality. However, it is not antisymmetric, for in-
HYPO examples [25], then only 4 types are possible: {conc,incoh}, stance, {pres,coh} ⪰d {coh,incoh} and {coh,incoh} ⪰d {pres,coh}
{pres,incoh}, {pres,pres} and {incoh,incoh} since then a coherent (both have validity distance 1), but {pres,coh} ≠ {coh,incoh}. □
argument is always presumptive.
Relations ⪰v and ⪰d are depicted in Figure 3.
Comparing issues. There are two natural ways in which to compare
types. They can be compared by the relative strength of their validity
labels, or by the distance of their labels. Formally: 3.3 Hardness
Definition 9 (Type orderings) Let {v0 , v1 } and {v′0 , v′1 } be types. We We view hardness as a relation between issues: harder vs. easier
issues. The intuition that guides our definition is based on the re-
define two binary relations ⪰v and ⪰d ∈ {conc, pres, coh, incoh}2 as
lations ⪰d and ⪰v introduced in the previous section. An issue is
follows:
‘easy’ if the validity labels of the two arguments involved in the
(1) {v0 , v1 } ⪰v {v′0 , v′1 } if and only if: v0 ≥ v′0 and v1 ≥ v′1 , or issue are far apart by their validity distance: the stronger argument
v0 ≥ v′1 and v1 ≥ v′0 ; prevails. By means of illustration, an issue where the two arguments
(2) {v0 , v1 } ⪰d {v′0 , v′1 } if and only if vd(v0 , v1 ) ≥ vd(v′0 , v′1 ). are one of type conc and one of type incoh will be easy to decide.
The asymmetric parts of ⪰v , ⪰d are respectively denoted ≻v and It is ‘hard’ if, vice versa, the validity labels of the two arguments
≻d . Their symmetric parts are respectively denoted ∼v and ∼d . involved in the issue are close by validity distance. The prototypical
Intuitively, relation ⪰v orders types by the strength of their va- case consists of arguments with the same validity label. It is in such
lidity labels defined in Definition 5. So higher types in ⪰v pertain cases that one can then use relation ⪰v to distinguish among issues
issues involving stronger arguments, while lower types in ⪰v pertain whose arguments have same validity distance. Intuitively, if two
issues involving weaker arguments. Instead, relation ⪰d orders types issues both involve arguments with the same validity distance—like
by how far apart the labels within the type are from each other. So for instance {pres, coh} and {coh, incoh}—it is the issue involving
higher types in ⪰d pertain issues involving arguments containing a stronger arguments that is arguably ‘easier’.
strong and a weak argument, while lower types in ⪰d pertain issues These intuitions back the definition of hardness as an ordering
involving arguments of similar strength. over issue types, which is based on the lexicographic combination
152
Hardness of Case-Based Decisions: a Formal Theory ICAIL’21, June 21–25, 2021, São Paulo, Brazil
153
ICAIL’21, June 21–25, 2021, São Paulo, Brazil H. Zheng, D. Grossi and B. Verheij
{coh, coh}
EXPERT ∧ SITUATION ∧ OUTCOME_2 ¬EXPERT ∧ SITUATION ∧ OUTCOME_1 ¬EXPERT ∧ SITUATION ∧ ¬OUTCOME_1
Figure 5: The hardness of SITUATION ± OUTCOME_1 in 7 case models characterized by the type of issues
In the {coh,incoh} model, SITUATION±OUTCOME_1 is harder than Kerfoot v. Kelley 294 N.Y. 288, 62 N.E.2d 74 (1945): The claim
in the {pres,coh} model, since for argument (SITUATION, OUTCOME_1), was in tort law (driver negligence). The territorial rule applies.
it is less preferable as the testimony for OUTCOME_1 is made by a non- Auten v. Auten 308 N.Y. 155, 124 N.E.2d 99 (1954): The claim
expert. For (SITUATION, ¬OUTCOME_1), there is no support. Compar- was in contract law (enforce a child support agreement). The
ing with the {pres,coh} model, there is no expert testimony about center-of-gravity rule applies.
the issue, which makes the consideration of this issue harder. Kaufman v. American Youth Hostels 5 N.Y.2d 1016 (1959):
In the {pres,pres} model, issue SITUATION±OUTCOME_1 is harder The claim was in tort law (travel guide negligence). The territorial
than in the {coh,incoh} model, even though the terstimony for rule applies.
OUTCOME_1 is from an expert, namely a more preferable source. Haag v. Barnes 9 N.Y.2d 554, 175 N.E.2d 441, 216 N.Y.S. 2d
This is because its counterargument (SITUATION, ¬OUTCOME_1) is 65 (1961): The claim was in contract law (reopen a child support
also based on an expert, which makes both having OUTCOME_1 and agreement). The center-of-gravity rule applies.
not having OUTCOME_1 have strong support, hence harder to solve Kilberg v. Northeast Airlines 9 N.Y.2d 34, 172 N.E.2d 526, 211
than in the {coh,incoh} model, where there is no one who testifies N.Y.S.2d 133 (1961): The claim was in tort law (common carrier
that the situation should not have OUTCOME_1. negligence). The territorial rule is partly applied, and there is an
In the {coh,coh} model, issue SITUATION ± OUTCOME_1 is harder exception for the damages part of the case.
than in the {pres,pres} model. Even though in both of the models, the Babcock v. Jackson 12 N.Y.2d 473, 191 N.E.2d 279, 473 N.Y.S.2d
arguments for having OUTCOME_1 and for not having it are as strong 279 (1963): The claim was in tort law (driver negligence).
as each other, type {coh,coh} still indicates that the testimonies are
Considering the case model constructed in [21], also shown in Fig-
from less preferable sources (non-expert), and because of this, the
ure 6, where it consists of 7 cases. They are represented by factors for
consideration of the issue becomes harder.
the plaintiff’s name (SMITH, KERFOOT, etc.), the year of the decision
Type {incoh,incoh} is the hardest one. As shown in Figure 5,
(1938, 1945, etc.), the kind of case (TORT for a tort case, CONTRACT
there is completely no decision about issue SITUATION±OUTCOME_1,
for a contract case), and the jurisdiction choice rule (TERRITORY for
hence the consideration of the issue is the hardest as there is nothing
entirely applying the territorial rule, EXCEPTION for partly applying
that can be referred to.
the territorial rule while making an exception for the damages part
of the case, and GRAVITY for applying the center-of-gravity rule).
The preference relation among these cases is denoted by the size
4 HARDNESS OVER TIME
of the boxes directly, namely, the Babcock case is more preferred
In this section, we apply our approach to model case-based decision-
making in a real legal domain from the United States, and discuss
the development of precedential value in a series of relevant cases
by following the research developed by Berman and Hafner [8, 11] SMITH ∧ 1938 ∧ TORT ∧ TERRITORY
and Verheij [21]. KERFOOT ∧ 1945 ∧ TORT ∧ TERRITORY
The cases we show here are tort cases from New York, which are AUTEN ∧ 1954 ∧ CONTRACT ∧ GRAVITY
about car accidents, and which rule should be applied when different KAUFMAN ∧ 1959 ∧ TORT ∧ TERRITORY
jurisdictions are relevant. For instance, when people drive from New HAAG ∧ 1961 ∧ CONTRACT ∧ GRAVITY
York and have an accident in Ontario, which rule should be followed, KILBERG ∧ 1961 ∧ TORT ∧ EXCEPTION
Ontario’s or New York’s? BABCOCK ∧ 1963 ∧ TORT ∧ GRAVITY
Smith v. Clute 277 N.Y. 407, 14 N.E.2d 455 (1938): The claim
was in tort law (driver negligence). The territorial rule applies. Figure 6: The development of precedential values in cases [21]
154
Hardness of Case-Based Decisions: a Formal Theory ICAIL’21, June 21–25, 2021, São Paulo, Brazil
than other cases, which are preferentially equivalent. Since the Bab- preferred, not only the issue about this rule becomes harder,
cock case is a landmark case that overriding previous cases, by which but also other issues are affected (become easier).
the center-of-gravity approach is established for tort law [11, 21]. (4) In general, after finally making the GRAVITY rule as the pri-
We also apply the background theory of all cases in the case model mary one in 1963, the 4 tort law-relevant issues remain the
set in [21], namely, the plaintiff names exclude each other pairwise same hardness, as in 1945 when the primary one is the terri-
(¬(SMITH ∧ KERFOOT), etc.), and similarly for the decision years torial rule. However, more options makes the consideration
(¬(1938 ∧ 1945), etc.), the kinds of cases (¬(TORT ∧ CONTRACT)) of applying rules becomes harder.
and the choice rules (¬(TERRITORY ∧ EXCEPTION), etc.). The first observation can be illustrated by the introduction of the
We analyze the development of the jurisdiction choice rule by center-of-gravity rule. In 1954, the center-of-gravity rule starts to
restricting the case model to the cases up and until a particular year. be considered by the New York courts in a general sense, even
For instance, we write C (1954) for the set consisting of the three it has no effect on the hardness of issues TORT ± TERRITORY and
cases Smith, Kerfoot and Auten dating from 1954 or before [21]. TORT ± GRAVITY, it does make issue ⊤ ± GRAVITY become harder
The issues that we want to analyze in this series of cases are about than in 1945:
the development of the jurisdiction choice rule in tort law cases.
TC (1954) (⊤ ± GRAVITY) ≻h TC (1945) (⊤ ± GRAVITY).
They are shown as follows:
This is because, before 1954, it is clear that the GRAVITY rule is not
(1) TORT ± TERRITORY, which is associated with arguments:
considered in the court as argument (⊤, GRAVITY) is with incoh and
(TORT, TERRITORY) and (TORT, ¬TERRITORY);
(⊤, ¬GRAVITY) is with conc. However, the Auten case introduces this
(2) TORT ± GRAVITY, which is associated with arguments:
rule to the series of case models and makes both of the arguments
(TORT, GRAVITY) and (TORT, ¬ GRAVITY);
become presumptive. The introduction not only makes (⊤, GRAVITY)
(3) TORT ± EXCEPTION, which is associated with arguments:
stronger and (⊤, ¬GRAVITY) weaker but also shortens the validity
(TORT, EXCEPTION) and (TORT, ¬ EXCEPTION);
distance between the validity labels of the two opposite arguments in
(4) ⊤ ± GRAVITY, which is associated with arguments:
the issue. Because of the shorter distance, considering whether the
(⊤, GRAVITY) and (⊤, ¬ GRAVITY).
center-of-gravity rule should be generally considered or not becomes
Issue TORT ± TERRITORY is about whether a tort law case should harder than before. Similarly, after introducing the center-of-gravity
entirely apply the territorial rule or not. TORT ± GRAVITY is about rule into the tort law domain (by the Babcock case in 1963), we can
whether a tort law case should apply the center-of-gravity rule or not. see the same trend as in 1954.
TORT ± EXCEPTION is about whether a tort law case should partly The second observation is about the exception in a tort law case
follow the territorial rule and make an exception for the damages that applied the territorial rule, which is introduced by the Kilberg
part of the case. And ⊤ ± GRAVITY is about the applied status of the case in 1961. After this case is added into the model, it has no effect
center-of-gravity rule in a general sense. on the hardness of issues that related to the GRAVITY rule:
The validity of the arguments listed above has been discussed in
TC (1961) (TORT ± GRAVITY) ∼h TC (1959) (TORT ± GRAVITY);
[21]. As we show in Section 3, the hardness of an issue is determined
TC (1961) (⊤ ± GRAVITY) ∼h TC (1959) (⊤ ± GRAVITY).
by the validity of the arguments that it associates with. For instance,
the hardness of issue TORT ± TERRITORY in 1938 with respect to case Both TORT ± TERRITORY and TORT ± EXCEPTION become harder:
model C (1938) is: TC (1961) (TORT ± TERRITORY) ≻h TC (1959) (TORT ± TERRITORY);
TC (1938) (TORT ± TERRITORY) = {conc, incoh} TC (1961) (TORT ± EXCEPTION) ≻h TC (1959) (TORT ± EXCEPTION).
which is determined by the validity of the following arguments: This is because the exception makes the consideration of the territo-
rial rule in the tort law domain become more complex, as now the
AC (1938) (TORT, TERRITORY) = conc
courts need to think of whether there will be an exception or not.
AC (1938) (TORT, ¬TERRITORY) = incoh
The third observation is for the introduction of the landmark
TORT ± TERRITORY becomes harder in 1961, since the hardness of case (Babcock), introducing the center-of-gravity rule in the tort law
the issue in C (1961) is {pres, pres}. Notice that according to Def- domain. This makes issue TORT ± GRAVITY harder:
inition 4, we consider the labels of arguments (TORT, TERRITORY) TC (1963) (TORT ± GRAVITY) ≻h TC (1961) (TORT ± GRAVITY).
and (TORT, ¬TERRITORY) in C (1961) as pres rather than coh.
Based on the validity of the relevant arguments, we summarize and other relevant issues easier:
the hardness of issues with respect to case models by years in Table 1. TC (1963) (TORT ± TERRITORY) ≻h TC (1961) (TORT ± TERRITORY);
The trends of the hardness of issues is shown in Figure 7, from which TC (1963) (TORT ± EXCEPTION) ≻h TC (1961) (TORT ± EXCEPTION);
we have the following observations about the hardness of issues: TC (1963) (⊤ ± GRAVITY) ≻h TC (1961) (⊤ ± GRAVITY).
(1) When the center-of-gravity rule is introduced in general (1954) These trends can be explained from an intuitive perspective. Since
and into the tort law domain (1963), the issues about the from 1963, the GRAVITY rule becomes primary, for other options, the
GRAVITY rule become harder. more preferable way is not applying them, hence make the issues that
(2) The issues related to the territorial rule become harder corre- they associated with easier. But for the TORT ± GRAVITY, it becomes
spondingly when the court shows doubt on the rule by making harder as we explained in the first observation above.
an exception for the damages part of a tort case. In the last observation, we find that after the GRAVITY becomes
(3) When the center-of-gravity rule is introduced by a landmark primary in 1963, all the tort law-relevant issues remain the same
case (with higher preference), which makes the rule more hardness as they were before 1954. The only difference is that the
155
ICAIL’21, June 21–25, 2021, São Paulo, Brazil H. Zheng, D. Grossi and B. Verheij
{conc, incoh}
{pres, incoh}
easier
{pres, coh}
: TORT±TERRITORY
: TORT±GRAVITY
{coh, incoh} : TORT±EXCEPTION
: ⊤±GRAVITY
{pres, pres}
harder
{coh, coh}
{incoh, incoh}
1938 1945 1954 1959 1961 1963
Figure 7: Development of hardness over time for different issues in the series of tort cases
more preferred rule shifts from the TERRITORY rule to the GRAVITY The development of hardness over time. The dynamics of case-based
rule. Moreover, we can see that making the choice of which rule reasoning has for instance been addressed in [11, 13, 14] in terms of
should be applied in 1963 is harder than before 1954. This can be rules, values, and reasons, and there changes in these elements are
connected to our first observation. Before 1954, there is only one associated with the handling of hard cases. As we show in Section 4,
jurisdiction choice rule to be considered, as the argument for apply- our approach is also relevant for the dynamics in case-based reason-
ing the TERRITORY rule is conclusive (in C (1938) and C (1945)), ing. We extend the analysis of a series of New York tort cases in
and the argument for applying the GRAVITY rule is incoherent (in [11, 21] with types of issues, from where we find that even though
C (1938) and C (1945)). Therefore, if a current case is given in the new cases can strengthen an argument’s validity, the hardness of
year before 1954, by following the precedents, applying the territo- issues may also increase.
rial rule in the current case will be a more preferable choice. In 1963, Our approach gives insight into the five temporal patterns listed
even though the GRAVITY rule has already become more preferred in [11, Section 4.2] (also discussed in [21]), providing a different
than other choices, the consideration of which rule should be applied angle in terms of the hardness typology.
still becomes more complex. This is because there are new cases that (1) A general shift in the relative priority of competing purposes.
introduce new choices about the rule application during this period. The Auten and Haag cases introduce the center-of-gravity rule
More choices make the consideration of the application harder. into the contract cases, and let GRAVITY become a presump-
tive conclusion in general, whereas it was incoherent. How-
5 DISCUSSION ever, argument (TORT, GRAVITY) is not yet coherent. From
In this section, we position our formal theory of the hardness of the issue perspective, the Auten case makes the general con-
case-based decisions with respect to related research. sideration of the GRAVITY rule (⊤ ± GRAVITY) harder (from
156
Hardness of Case-Based Decisions: a Formal Theory ICAIL’21, June 21–25, 2021, São Paulo, Brazil
{conc,incoh} to {pres,pres}). After 1954, the consideration The approach we present here continues the discussion in [25],
of the GRAVITY rule becomes more complicate. However, the where we find that ‘using an incoherent argument can make sense
hardness of handling GRAVITY rule in tort law cases has not and break new ground. A decision based on such an argument can
changed yet, as TORT ± GRAVITY is still as hard as before. be considered as going beyond the current legal status modeled in
(2) A shift in the relative priority of competing purposes by find- the precedent model.’ We can further interpret this idea with the
ing exceptions. The Kilberg case makes that TERRITORY, rep- results we get from the case study in Section 4. For instance, after
resenting the entire application of the territorial rule, is no introducing the center-of-gravity rule into the tort law domain in
longer a conclusive consequence of tort cases, but only a 1963 by the Babcock case, the validity of the argument for applying
presumptive consequence. From the issue perspective, after the rule in a tort law case, namely (TORT, GRAVITY), shifts from
the Kilberg case is added to the model, the type of issue incoh to pres. Even though the validity of the argument becomes
TORT ± TERRITORY shifts from {conc,incoh} to {pres,pres}, stronger, the associated issue TORT±GRAVITY becomes harder, and
in the sense that EXCEPTION (partly applying the territorial the validity distance between the validity labels of the two arguments
rule) makes the territorial rule harder to handle. in the issue is shortened, namely, making the other argument weaker.
(3) The ratio decidendi of an older case is overruled, although it It could be interesting to enrich the series of cases discussed in
is significantly different. The example of this pattern discussed Section 4 to include what happened after the Babcock case. Other
in [11, 21] is that the Babcock case overrules the Kaufman series of cases that are well-known in AI and Law are also interesting
case. The formal case model we use doesn’t distinguish tort to look at using the hardness theory we developed, for instance, the
cases from passenger cases, thus the pattern is not visible cases about product liability and privity [3, 13, 15]. Also, since the
here. But as shown in Figure 7, we can still figure out that case model we show in the case study does not have all the possi-
the landmark Babcock case makes the consideration of some ble types in a real legal domain, it can be interesting to investigate
issues in a sense harder than right after the Kaufman case. whether such a complete case study can be made, in order to better
(4) A case is implicitly overruled. The rule that applied in tort understand the hardness of issues in an actual decision-making pro-
cases has changed since 1961 because of the Kilberg case and cess. Natural developments are also to connect our hardness typology
the Babcock case. The territorial rule is no longer a presump- in terms of kinds of validity to proof standards [10] and to consider
tive conclusion, and the center-of-gravity rule becomes more the development of hardness over time in terms of argumentation
preferred. If the Kerfoot case is decided after Kilberg or Bab- schemes for case-based reasoning [24]. It would also be interesting
cock, it may have come with a different outcome. As shown to explore the hardness of issues under different preference orderings
in Figure 7, both issues are with type {pres,coh} in 1963, than significance, for instance, in terms of court levels.
but notice that the conclusions of presumptive arguments in
From hardness of issues to easy and hard cases. As discussed by
the issues are GRAVITY and ¬TERRITORY. Furthermore, both
Gardner [9], hard cases is a main topic in law. Rissland summarizes
issues in 1963 are harder than in the period 1938 ∼ 1959, if
that hard cases in law can arise in three ways [18]:
the Kerfoot case is decided after 1963, though it will more
likely apply the GRAVITY rule, the decision-making process (1) there exist competing legal rules;
still becomes more complicated than before. (2) there exist unresolved predicates; and
(5) A case is explicitly overruled. As discussed by [8, 11], this (3) there exist competing cases.
pattern occurs rarely, and is not shown in our case model. Our formalism has the potential for modeling the hard cases dis-
cussed by Gardner. Competing cases and legal rules can be associ-
ated with issues with types {pres,coh}, {pres,pres}, and {coh,coh}.
Comparing with other research developed recently, in particular As in these types, the conclusions of arguments involved are op-
Horty’s reason model for precedential constraint [14], Henderson posite to each other, hence form a competing relation. Unresolved
and Bench-Capon’s model [13] following Levi’s idea to consider predicates can be associated with issues that do not contain con-
case-based reasoning as a “moving classification system” [15], the clusive arguments, namely, the same premise can lead to different
approach we applied here is based on a different mechanism in terms conclusions. If we treat these predicates as the premises, their in-
of an elementary propositional logical language. Moreover, their dicated meaning as the conclusions, we can analyze the meaning
research focuses more on the development of cases and the involved of unresolved predicates as leaving room for debate, hence leading
rules, and the hardness of issues as they occur in the decision-making to cases that are harder in the typology. Hence it seems interesting
process—the main contribution of this paper—have not been for- for future research to connect the hardness of issues to insights on
mally analyzed. It seems interesting to follow up on Levi’s three easy and hard cases. For instance, there is a connection to Dworkin’s
stage life cycle of rules (creation, refinement, replacement) as dis- famous idea (see e.g. [19, p. 488f.]) that for the perfect, Herculean
cussed in [13] using the hardness typology in this paper. judge, there is one right solution for all cases, including the hard
The focus of other research on case-based reasoning is often ones. In our hardness typology, there is a variety of options. Some-
rather different from our paper, and the hardness approach presented times there is exactly one solution, namely in the types {conc,incoh},
here may supplement that research. For instance, [13, 14] emphasize {pres,incoh} and {coh,incoh}. In {incoh,incoh} there is no stare
the role of reasons and rules in legal cases, and how they favor decisis solution. In {pres,pres} and {coh,coh}, there are two equally
the different parties in the court, which may be connected with our preferred solutions, and in {pres,coh}, there are two of which one
approach, in which issues and their hardness are associated with is strictly preferred over the other. Also, consider the characteriza-
arguments with equal premises, but opposite conclusions. tion of hard cases in [12] that they require an a-rational decision
157
ICAIL’21, June 21–25, 2021, São Paulo, Brazil H. Zheng, D. Grossi and B. Verheij
158
When Does Pretraining Help? Assessing Self-Supervised
Learning for Law and the CaseHOLD Dataset of 53,000+ Legal
Holdings
Lucia Zheng∗ Neel Guha∗ Brandon R. Anderson
zlucia@stanford.edu nguha@stanford.edu banderson@law.stanford.edu
Stanford University Stanford University Stanford University
Stanford, California, USA Stanford, California, USA Stanford, California, USA
159
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zheng and Guha, et al.
We hypothesize that the puzzling failure to find substantial gains Table 1: CaseHOLD example
from domain pretraining in law stem from the fact that existing fine- Citing Text (prompt)
tuning tasks may be too easy and/or fail to correspond to the domain
They also rely on Oswego Laborers’ Local 214 Pension Fund v. Marine
of the pretraining corpus task. We show that existing legal NLP Midland Bank, 85 N.Y.2d 20, 623 N.Y.S.2d 529, 647 N.E.2d 741 (1996), which
tasks, Overruling (whether a sentence overrules a prior case, see held that a plaintiff “must demonstrate that the acts or practices have a
Section 4.1) and Terms of Service (classification of contractual terms broader impact on consumers at large." Defs.’ Mem. at 14 (quoting Oswego
of service, see Section 4.2), are simple enough for naive baselines Laborers’, 623 N.Y.S.2d 529, 647 N.E.2d at 744). As explained above, how-
(BiLSTM) or BERT (without domain-specific pretraining) to achieve ever, Plaintiffs have adequately alleged that Defendants’ unauthorized
high performance. Observed gains from domain pretraining are use of the DEL MONICO’s name in connection with non-Ocinomled
hence relatively small. Because U.S. law lacks any benchmark task restaurants and products caused consumer harm or injury to the public,
that is comparable to the large, rich, and challenging datasets that and that they had a broad impact on consumers at large inasmuch as
such use was likely to cause consumer confusion. See, e.g., CommScope,
have fueled the general field of NLP (e.g., SQuAD [36], GLUE [46],
Inc. of N.C. v. CommScope (U.S.A) Int’l Grp. Co., 809 F. Supp.2d 33, 38
CoQA [37]), we present a new dataset that simulates a fundamental
(N.D.N.Y 2011) (<HOLDING>); New York City Triathlon, LLC v. NYC
task for lawyers: identifying the legal holding of a case. Holdings are Triathlon
central to the common law system. They represent the governing
legal rule when the law is applied to a particular set of facts. The Holding Statement 0 (correct answer)
holding is precedential and what litigants can rely on in subsequent holding that plaintiff stated a 349 claim where plaintiff alleged facts
cases. So central is the identification of holdings that it forms a plausibly suggesting that defendant intentionally registered its corporate
canonical task for first-year law students to identify, state, and name to be confusingly similar to plaintiffs CommScope trademark
reformulate the holding. Holding Statement 1 (incorrect answer)
This CaseHOLD dataset (Case Holdings on Legal Decisions)
holding that plaintiff stated a claim for breach of contract when it alleged
provides 53,000+ multiple choice questions with prompts from the government failed to purchase insurance for plaintiff as agreed by
a judicial decision and multiple potential holdings, one of which contract
is correct, that could be cited. We construct this dataset using the
Holding Statement 2 (incorrect answer)
rules of case citation [9], which allow us to match a proposition to a
source through a comprehensive corpus of U.S. case law from 1965 holding that the plaintiff stated a claim for tortious interference
to the present. Intuitively, we extract all legal citations and use the Holding Statement 3 (incorrect answer)
“holding statement,” often provided in parenthetical propositions
holding that the plaintiff had not stated a claim for inducement to breach
accompanying U.S. legal citations, to match context to holding [2].
a contract where she had not alleged facts sufficient to show the existence
CaseHOLD extracts the context, legal citation, and holding state- of an enforceable underlying contract
ment and matches semantically similar, but inappropriate, holding
propositions. This turns the identification of holding statements Holding Statement 4 (incorrect answer)
into a multiple choice task. holding plaintiff stated claim in his individual capacity
In Table 1, we show a citation example from the CaseHOLD
dataset. The Citing Text (prompt) consists of the context and legal
citation text, Holding Statement 0 is the correct corresponding pose an important tradeoff, as cost estimates for fully pretraining
holding statement, Holding Statements 1-4 are the four similar, BERT can be upward of $1M [41], with potential for social harm
but incorrect holding statements matched with the given prompt, [4], but advances in legal NLP may also alleviate huge disparities in
and the Label is the 0-index label of the correct holding statement access to justice in the U.S. legal system [16, 34, 47]. Our findings
answer. For simplicity, we use a fixed context window that may suggest that there is indeed something unique to legal language
start mid-sentence. when faced with sufficiently challenging forms of legal reasoning.
We show that this task is difficult for conventional NLP ap-
proaches (BiLSTM F1 = 0.4 and BERT F1 = 0.6), even though law 2 RELATED WORK
students and lawyers are able to solve the task at high accuracy. The Transformer-based language model, BERT [12], which lever-
We then show that there are substantial and statistically significant ages a two step pretraining and fine-tuning framework, has achieved
performance gains from domain pretraining with a custom vocabu- state-of-the-art performance on a diverse array of downstream NLP
lary (which we call Legal-BERT), using all available case law from tasks. BERT, however, was trained on a general corpus of Google
1965 to the present (a 7.2% gain in F1, representing a 12% relative Books and Wikipedia, and much of the scientific literature has
boost from BERT). We then experimentally assess conditions for since focused on the question of whether the Transformer-based
gains from domain pretraining with CaseHOLD and find that the approach could be improved by domain-specific pretraining.
size of the fine-tuning task is the principal other determinant of Outside of the law, for instance, Lee et al. [25] show that BioBERT,
gains to domain-specific pretraining. a BERT model pretrained on biomedicine domain-specific corpora
The code, the legal benchmark task datasets, and the Legal-BERT (PubMed abstracts and full text articles), can significantly outper-
models presented here can be found at: https://github.com/reglab/ form BERT on domain-specific biomedical NLP tasks. For instance,
casehold. it achieves gains of 6-9% in strict accuracy compared to BERT [25]
Our paper informs how researchers should decide when to en- for biomedical question answering tasks (BioASQ Task 5b and Task
gage in data and resource-intensive pretraining. Such decisions 5c) [45]. Similarly, Beltagy et al. show improvements from domain
160
When Does Pretraining Help? ICAIL’21, June 21–25, 2021, São Paulo, Brazil
pretraining with SciBERT, using a multi-domain corpus of scientific is that unlike general NLP, which has thrived on large benchmark
publications [3]. On the ACL-ARC multiclass classification task [22], datasets (e.g., SQuAD [36], GLUE [46], CoQA [37]), there are few
which contains example citations labeled with one of six classes, large and publicly available legal benchmark tasks for U.S. law.
where each class is a citation function (e.g., background), SciBERT This is explained in part due to the expense of labeling decisions
achieves gains of 7.07% in macro F1 [3]. It is worth noting that this and challenges around compiling large sets of legal documents
task is constructed from citation text, making it comparable to the [32], leading approaches above to rely on non-English datasets
CaseHOLD task we introduce in Section 3. [49, 50] or proprietary datasets [14]. Indeed, there may be a kind
Yet work adapting this framework for the legal domain has not of selection bias in available legal NLP datasets, as they tend to
yielded comparable returns. Elwany at el. [14] use a proprietary reflect tasks that have been solved by methods often pre-dating the
corpus of legal agreements to pretrain BERT and report “marginal” rise of self-supervised learning. Third, assessment standards vary
gains of 0.4 - 0.7% on F1. They note that in some settings, such gains substantially, providing little guidance to researchers on whether
could still be practically important. Zhong et al. [49] uses BERT domain pretraining is worth the cost. Studies vary, for instance, in
pretrained on Chinese legal documents and finds no gains relative whether BERT is retrained with custom vocabulary, which is partic-
to non-pretrained NLP baseline models (e.g., LSTM). Similarly, [50] ularly important in fields where terms of art can defy embeddings
finds that the same pretrained model performs poorly on a legal of general language models. Moreover, some comparisons are be-
question and answer dataset. tween (a) BERT pretrained at 1M iterations and (b) domain-specific
Hendrycks et al. [19] found that in zero-shot and few-shot set- pretraining on top of BERT (e.g., 2M iterations) [25]. Impressive
tings, state-of-the-art models for question answering, GPT-3 and gains might hence be confounded because the domain pretrained
UnifiedQA, have lopsided performance across subjects, performing model simply has had more time to train. Fourth, legal language
with near-random accuracy on subjects related to human values, presents unique challenges in substantial part because of extensive
such as law and morality, while performing up to 70% accuracy and complicated system of legal citation. Work has shown that con-
on other subjects. This result motivated their attempt to create a ventional tokenization that fails to account for the structure of legal
better model for the multistate bar exam by further pretraining citations can improperly present the legal text [20]. For instance,
RoBERTa [27], a variant of BERT, on 1.6M cases from the Harvard sentence boundary detection (critical for BERT’s next sentence pre-
Law Library case law corpus. They found that RoBERTa fine-tuned diction pretraining task) may fail with legal citations containing
on the bar exam task achieved 32.8% test accuracy without domain complicated punctuation [40]. Just as using an in-domain tokenizer
pretraining and 36.1% test accuracy with further domain pretrain- helps in multilingual settings [39], using a custom tokenizer should
ing. They conclude that while “additional pretraining on relevant improve performance consistently for the “language of law.” Last,
high quality text can help, it may not be enough to substantially few have examined differences across the kinds of tasks where
increase . . . performance.” Hendrycks et al. [18] highlight that future pretraining may be helpful.
research should especially aim to increase language model perfor- We address these gaps for legal NLP by (a) contributing a new,
mance on tasks in subject areas such as law and moral reasoning large dataset with the task of identification of holding statements
since aligning future systems with human values and understand- that comes directly from U.S. legal decisions, (b) assessing the con-
ing of human approval/disapproval necessitates high performance ditions under which domain pretraining can help.
on such subject specific tasks.
Chalkidis et al. [7] explored the effects of law pretraining using
various strategies and evaluate on a broader range of legal NLP tasks. 3 THE CASEHOLD DATASET
These strategies include (a) using BERT out of the box, which is We present the CaseHOLD dataset as a new benchmark dataset for
trained on general domain corpora, (b) further pretraining BERT on U.S. law. Holdings are, of course, central to the common law system.
legal corpora (referred to as LEGAL-BERT-FP), which is the method They represent the governing legal rule when the law is applied
also used by Hendrycks et al. [19], and (c) pretraining BERT from to a particular set of facts. The holding is what is precedential and
scratch on legal corpora (referred to as LEGAL-BERT-SC). Each what litigants can rely on in subsequent cases. So central is the
of these models is then fine-tuned on the downstream task. They identification of holdings that it forms a canonical task for first-year
report that a LEGAL-BERT variant, in comparison to tuned BERT, law students to identify, state, and reformulate the holding. Thus,
achieves a 0.8% improvement in F1 on a binary classification task as for a law student, the goal of this task is two-fold: (1) understand
derived from the ECHR-CASES dataset [5], a 2.5% improvement in case names and their holdings; (2) understand how to re-frame the
F1 on the multi-label classification task derived from ECHR-CASES, relevant holding of a case to back up the proceeding argument.
and between a 1.1-1.8% improvement in F1 on multi-label classifi- CaseHOLD is a multiple choice question answering task derived
cation tasks derived from subsets of the CONTRACTS-NER dataset from legal citations in judicial rulings. The citing context from the
[6, 8]. These gains are small when considering the substantial data judicial decision serves as the prompt for the question. The answer
and computational requirements of domain pretraining. Indeed, choices are holding statements derived from citations following
Hendrycks et al. [19] concluded that the documented marginal text in a legal decision. There are five answer choices for each citing
difference does not warrant domain pretraining. text. The correct answer is the holding statement that corresponds
This existing work raises important questions for law and arti- to the citing text. The four incorrect answers are other holding
ficial intelligence. First, these results might be seen to challenge statements.
the widespread belief in the legal profession that legal language is We construct this dataset from the Harvard Law Library case
distinct [28, 29, 44]. Second, one of the core challenges in the field law corpus (In our analyses below, the dataset is constructed from
161
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zheng and Guha, et al.
Table 2: Dataset overview statement that nullifies a previous case decision as a precedent, by
Dataset Source Task Type Size a constitutionally valid statute or a decision by the same or higher
ranking court which establishes a different rule on the point of law
Overruling Casetext Binary classification 2,400
Terms of Service Lippi et al. [26] Binary classification 9,414 involved.
CaseHOLD Authors Multiple choice QA 53,137 The Overruling task dataset was provided by Casetext, a com-
pany focused on legal research software. Casetext selected positive
the holdout dataset, so that no decision was used for pretraining overruling samples through manual annotation by attorneys and
Legal-BERT.). We extract the holding statement from citations (par- negative samples through random sampling sentences from the
enthetical text that begins with “holding”) as the correct answer and Casetext law corpus. This procedure has a low false positive rate for
take the text before it as the citing text prompt. We insert a <HOLD- negative samples because the prevalence of overruling sentences
ING> token in the position of the citing text prompt where the in the whole law is low. Less than 1% of cases overrule another
holding statement was extracted. To select four incorrect answers case and within those cases, usually only a single sentence contains
for a citing text, we compute the TD-IDF similarity between the overruling language. Casetext validates this procedure by estimat-
correct answer and the pool of other holding statements extracted ing the rate of false positives on a subset of sentences randomly
from the corpus and select the most similar holding statements, to sampled from the corpus and extrapolating this rate for the whole
make the task more difficult. We set an upper threshold for simi- set of randomly sampled sentences to determine the proportion of
larity to rule out indistinguishable holding statements (here 0.75), sampled sentences to be reviewed by human reviewers for quality
which would make the task impossible. One of the virtues of this assurance.
task setup is that we can easily tune the difficulty of the task by Overruling has moderate to high domain specificity because
varying the context window, the number of potential answers, and the positive and negative overruling examples are sampled from
the similarity thresholds. In future work, we aim to explore how the Casetext law corpus, so the language in the examples is quite
modifying the thresholds and task difficulty affects results. In a hu- specific to the law. However, it is the easiest of the three legal
man evaluation, the benchmark by a law student was an accuracy benchmark tasks, since many overruling sentences are distinguish-
of 0.94.1 able from non-overruling sentences due to the specific and explicit
A full example of CaseHOLD consists of a citing text prompt, the language judges typically use when overruling. In his work on
correct holding statement answer, four incorrect holding statement overruling language and speech act theory, Dunn cites several ex-
answers, and a label 0-4 for the index of the correct answer. The amples of judges employing an explicit performative form when
ordering of indices of the correct and incorrect answers are random overruling, using keywords such as “overrule”, “disapprove”, and
for each example and unlike a multi-class classification task, the “explicitly reject” in many cases [13]. Language models, non-neural
answer indices can be thought of as multiple choice letters (A, B, machine models, and even heuristics generally detect such key-
C, D, E), which do not represent classes with underlying meaning, word patterns effectively, so the structure of this task makes it less
but instead just enumerate the answer choices. We provide a full difficult compared to other tasks. Previous work has shown that
example from the CaseHOLD dataset in Table 1. SVM classifiers achieve high performance on similar tasks; Sulea et
al. [31] achieves a 96% F1 on predicting case rulings of cases judged
4 OTHER DATASETS by the French Supreme Court and Aletras et al. [1] achieves 79%
accuracy on predicting judicial decisions of the European Court of
To provide a comparison on difficulty and domain specificity, we
Human Rights.
also rely on two other legal benchmark tasks. The three datasets
The Overruling task is important for lawyers because the process
are summarized in Table 2.
of verifying whether cases remain valid and have not been overruled
In terms of size, publicly available legal tasks are small compared
is critical to ensuring the validity of legal arguments. This need has
to mainstream NLP datasets (e.g., SQuAD has 100,000+ questions).
led to the broad adoption of proprietary systems, such as Shepard’s
The cost of obtaining high-fidelity labeled legal datasets is precisely
(on Lexis Advance) and KeyCite (on Westlaw), which have become
why pretraining is appealing for law [15]. The Overruling dataset,
important legal research tools for most lawyers [11]. High language
for instance, required paying attorneys to label each individual
model performance on the Overruling tasks could enable further
sentence. Once a company has collected that information, it may
automation of the shepardizing process.
not want to distribute it freely for the research community. In
In Table 3, we show a positive example of an overruling sentence
the U.S. system, much of this meta-data is hence retained behind
and a negative example of a non-overruling sentence from the
proprietary walls (e.g., Lexis and Westlaw), and the lack of large-
Overruling task dataset. Positive examples have label 1 and negative
scale U.S. legal NLP datasets has likely impeded scientific progress.
examples have label 0.
We now provide more detail of the two other benchmark datasets.
The Overruling task is a binary classification task, where positive Passage Label
examples are overruling sentences and negative examples are non- for the reasons that follow, we approve the first district in the 1
overruling sentences from the law. An overruling sentence is a instant case and disapprove the decisions of the fourth district.
1 This
human benchmark was done on a pilot iteration of the benchmark dataset and
a subsequent search of the vehicle revealed the presence of an 0
may not correspond to the exact TF-IDF threshold presented here. additional syringe that had been hidden inside a purse located
on the passenger side of the vehicle.
162
When Does Pretraining Help? ICAIL’21, June 21–25, 2021, São Paulo, Brazil
163
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zheng and Guha, et al.
Devlin et al. [12]. One variant is initialized with the BERT base independently through our pretrained models followed by a linear
model and pretrained for an additional 1M steps using the case law layer, then take a softmax over the five concatenated outputs. For
corpus and the same vocabulary as BERT (uncased). The other vari- Overruling and Terms of Service, we use a single NVIDIA V100
ant, which we refer to as Custom Legal-BERT, is pretrained from (16GB) GPU to fine-tune on each task. For CaseHOLD, we used
scratch for 2M steps using the case law corpus and has a custom eight NVIDIA V100 (32GB) GPUs to fine-tune on the task.
legal domain-specific vocabulary. The vocabulary set is constructed We use 10-fold cross-validation to evaluate our models on each
using SentencePiece [24] on a subsample (appx. 13M) of sentences task. We use F1 score as our performance metric for the Overruling
from our pretraining corpus, with the number of tokens fixed to and Terms of Service tasks and macro F1 score as our performance
32,000. We pretrain both variants with sequence length 128 for 90% metric for CaseHOLD, reporting mean F1 scores over 10 folds. We
and sequence length 512 for 10% over the 2M steps total. report our model performance results in Table 5 and report statisti-
Both Legal-BERT and Custom Legal-BERT are pretrained us- cal significance from (paired) 𝑡-tests with 10 folds of the test data
ing the masked language model (MLM) pretraining objective, with to account for uncertainty.
whole word masking. Whole word masking and other knowledge From the results of the base setup, for the easiest Overruling
masking strategies, like phrase-level and entity-level masking, have task, the difference in F1 between BERT (double) and Legal-BERT
been shown to yield substantial improvements on various down- is 0.5% and BERT (double) and Custom Legal-BERT is 1.6%. Both
stream NLP tasks for English and Chinese text, by making the MLM of these differences are marginal. For the task with intermediate
objective more challenging and enabling the model to learn more difficulty, Terms of Service, we find that BERT (double) with fur-
about prior knowledge through syntactic and semantic informa- ther pretraining BERT on the general domain corpus increases
tion extracted from these linguistically-informed language units performance over base BERT by 5.1%, but the Legal-BERT vari-
[10, 21, 43]. More recently, Kang et al. [23] posit that whole-word ants with domain-specific pretraining do not outperform BERT
masking may be most suitable for domain adaptation on emrQA (double) substantially. This is likely because Terms of Service has
[33], a corpus for question answering on electronic medical records, low domain-specificity, so pretraining on legal domain-specific text
because most words in emrQA are tokenized to sub-word Word- does not help the model learn information that is highly relevant to
Piece tokens [48] in base BERT due to the high frequency of unique, the task. We note that BERT (double), with 77.3% F1, and Custom
domain-specific medical terminologies that appear in emrQA, but Legal-BERT, with 78.7% F1, outperform the highest performing
are not in the base BERT vocabulary. Because the case law corpus model from Lippi et al. [26] for the general setting of Terms of
shares this property of containing many domain-specific terms rel- Service, by 0.4% and 1.8% respectively. For the most difficult and
evant to the law, which are likely tokenized into sub-words in base domain-specific task, CaseHOLD, we find that Legal-BERT and Cus-
BERT, we chose to use whole word masking for pretraining the tom Legal-BERT both substantially outperform BERT (double) with
Legal-BERT variants on the legal domain-specific case law corpus. gains of 5.7% and 7.2% respectively. Custom Legal-BERT achieves
The second pretraining task is next sentence prediction. Here, we the highest F1 performance for CaseHOLD, with a macro F1 of
use regular expressions to ensure that legal citations are included 69.5%.
as part of a segmented sentence according to the Bluebook system We run paired 𝑡-tests to validate the statistical significance of
of legal citation [9]. Otherwise, the model could be poorly trained model performance differences for a 95% confidence interval. The
on improper sentence segmentation [40].3 mean differences between F1 for paired folds of BERT (double) and
base BERT are statistically significant for the Terms of Service task,
6 RESULTS with 𝑝-value < 0.001. Additionally, the mean differences between
6.1 Base Setup F1 for paired folds of Legal-BERT and BERT (double) with 𝑝-value
< 0.001 and the mean differences between F1 for paired folds of
After pretraining the models as described above in Section 5, we Custom Legal-BERT and BERT (double) with 𝑝-value < 0.001 are
fine-tune on the legal benchmark target tasks and evaluate the statistically significant for the CaseHOLD task. The substantial per-
performance of each model. formance gains from the Legal-BERT model variants were achieved
6.1.1 Hyperparameter Tuning. We provide details on our hyperpa- likely because the CaseHOLD task is adequately difficult and highly
rameter tuning process at https://github.com/reglab/casehold. domain-specific in terms of language.
6.1.2 Fine-tuning and Evaluation. For the BERT-based models, we 6.1.3 Domain Specificity Score. Table 5 also provides a measure of
use the input transformations described in Radford et al. [35] for domain specificity of each task, which we refer to as the domain
fine-tuning BERT on classification and multiple choice tasks, which specificity (DS) score. We define DS score as the average difference
convert the inputs for the legal benchmark tasks into token se- in pretrain loss between Legal-BERT and BERT, evaluated on the
quences that can be processed by the pretrained model, followed downstream task of interest. For a specific example, we run pre-
by a linear layer and a softmax. For the CaseHOLD task, we avoid diction for the downstream task of interest on the example input
making extensive changes to the architecture used for the two clas- using Legal-BERT and BERT models after pretraining, but before
sification tasks by converting inputs consisting of a prompt and fine-tuning, calculate loss on the task (i.e., binary cross entropy
five answers into five prompt-answer pairs (where the prompt and loss for Overruling and Terms of Service, categorical cross entropy
answer are separated by a delimiter token) that are each passed loss for CaseHOLD), and take the difference between the loss of
3 Where the vagaries of legal citations create detectable errors in sentence segmentation the two models. Intuitively, when the difference is large, the gen-
(e.g., sentences with fewer than 3 words), we omit the sentence from the corpus. eral corpus does not predict legal language very well. DS scores
164
When Does Pretraining Help? ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 5: Test performance, with ±1.96 × standard error, aggregated across 10 folds. Mean F1 scores are reported for Overruling
and Terms of Service. Mean macro F1 scores are reported for CaseHOLD. The best scores are in bold.
165
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zheng and Guha, et al.
166
When Does Pretraining Help? ICAIL’21, June 21–25, 2021, São Paulo, Brazil
framing. This would reflect the first-year law student exercise of an existing large language models to a new task or developing new
re-framing a holding to persuasively match their argument and models, since these workflows require retraining to experiment
isolate the two goals of the task. with different model architectures and hyperparameters. DS scores
provide a quick metric for future practitioners to evaluate when
7 DISCUSSION resource intensive model adaptation and experimentation may be
Our results resolve an emerging puzzle in legal NLP: if legal lan- warranted on other legal tasks. DS scores may also be readily ex-
guage is so unique, why have we seen only marginal gains to do- tended to estimate the domain-specificity of tasks in other domains
main pretraining in law? Our evidence suggests that these results with existing pretrained models like SciBERT and BioBERT [3, 25].
can be explained by the fact that existing legal NLP benchmark In sum, we have shown that a new benchmark task, the Case-
tasks are either too easy or not domain matched to the pretrain- HOLD dataset, and a comprehensively pretrained Legal-BERT model
ing corpus. Our paper shows the largest gains documented for illustrate the conditions for domain pretraining and suggests that
any legal task from pretraining, comparable to the largest gains language models, too, can embed what may be unique to legal
reported by SciBERT and BioBERT [3, 25]. Our paper also shows language.
the highest performance documented for the general setting of
the Terms of Service task [26], suggesting substantial gains from ACKNOWLEDGMENTS
domain pretraining and tokenization. We thank Devshi Mehrotra and Amit Seru for research assistance,
Using a range of legal language tasks that vary in difficulty and Casetext for the Overruling dataset, Stanford’s Institute for Human-
domain-specificity, we find BERT already achieves high perfor- Centered Artificial Intelligence (HAI) and Amazon Web Services
mance for easy tasks, so that further domain pretraining adds little (AWS) for cloud computing research credits, and Pablo Arredondo,
value. For the intermediate difficulty task that is not highly domain- Matthias Grabmair, Urvashi Khandelwal, Christopher Manning,
specific, domain pretraining can help, but gain is most substantial and Javed Qadrud-Din for helpful comments.
for highly difficult and domain-specific tasks.
These results suggest important future research directions. First, REFERENCES
we hope that the new CaseHOLD dataset will spark interest in [1] Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios
Lampos. 2016. Predicting judicial decisions of the European Court of Human
solving the challenging environment of legal decisions. Not only Rights: A natural language processing perspective. PeerJ Computer Science 2
are many available benchmark datasets small or unavailable, but (2016), e93.
they may also be biased toward solvable tasks. After all, a com- [2] Pablo D Arredondo. 2017. Harvesting and Utilizing Explanatory Parentheticals.
SCL Rev. 69 (2017), 659.
pany would not invest in the Overruling task (baseline F1 with [3] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language
BiLSTM of 0.91), without assurance that there are significant gains Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical
to paying attorneys to label the data. Our results show that domain Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP). Association for Computational
pretraining may enable a much wider range of legal tasks to be Linguistics, Hong Kong, China, 3615–3620.
solved. [4] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret
Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be
Second, while the creation of large legal NLP datasets is impeded Too Big? . In Proceedings of the 2021 ACM Conference on Fairness, Accountability,
by the sheer cost of attorney labeling, CaseHOLD also illustrates an and Transparency. Association for Computing Machinery, New York, NY, USA,
advantage of leveraging domain knowledge for the construction of 610–623.
legal NLP datasets. Conventional segmentation would fail to take [5] Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal
Judgment Prediction in English. In Proceedings of the 57th Annual Meeting of
advantage of the complex system of legal citation, but investing in the Association for Computational Linguistics. Association for Computational
such preprocessing enables better representation and extraction of Linguistics, Florence, Italy, 4317–4323. https://www.aclweb.org/anthology/P19-
legal texts. 1424
[6] Ilias Chalkidis, Ion Androutsopoulos, and Achilleas Michos. 2017. Extracting
Third, our research provides guidance for researchers on when Contract Elements. In Proceedings of the 16th Edition of the International Con-
pretraining may be appropriate. Such guidance is sorely needed, ference on Articial Intelligence and Law (London, United Kingdom) (ICAIL ’17).
Association for Computing Machinery, New York, NY, USA, 19–28.
given the significant costs of language models, with one estimate [7] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras,
suggesting that full pretraining of BERT with a 15GB corpus can and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of
exceed $1M. Deciding whether to pretrain itself can hence have Law School. In Findings of the Association for Computational Linguistics: EMNLP
2020. Association for Computational Linguistics, Online, 2898–2904. https:
significant ethical, social, and environmental implications [4]. Our //www.aclweb.org/anthology/2020.findings-emnlp.261
research suggests that many easy tasks in law may not require do- [8] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, and Ion Androut-
main pretraining, but that gains are most likely when ground truth sopoulos. 2019. Neural Contract Element Extraction Revisited. Workshop on
Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=
labels are scarce and the task is sufficiently in-domain. Because B1x6fa95UH
estimates of domain-specificity across tasks using DS score match [9] Columbia Law Review Ass’n, Harvard Law Review Ass’n, and Yale Law Journal.
2015. The Bluebook: A Uniform System of Citation (21st ed.). The Columbia Law
our qualitative understanding, this heuristic can also be deployed to Review, The Harvard Law Review, The University of Pennsylvania Law Review,
determine whether pretraining is worth it. Our results suggest that and The Yale Law Journal.
for other high DS and adequately difficult legal tasks, experimen- [10] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and
Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT.
tation with custom, task relevant approaches, such as leveraging arXiv:1906.08101 [cs.CL]
corpora from task-specific domains and applying tokenization / [11] Laura C. Dabney. 2008. Citators: Past, Present, and Future. Legal Reference
sentence segmentation tailored to the characteristics of in-domain Services Quarterly 27, 2-3 (2008), 165–190.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
text, may yield substantial gains. Bender et al. [4] discuss the signif- Pre-training of Deep Bidirectional Transformers for Language Understanding. In
icant environmental costs associated in particular with transferring
167
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zheng and Guha, et al.
Proceedings of the 2019 Conference of the North American Chapter of the Association [33] Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA:
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and A Large Corpus for Question Answering on Electronic Medical Records. In
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, Proceedings of the 2018 Conference on Empirical Methods in Natural Language
4171–4186. https://www.aclweb.org/anthology/N19-1423 Processing. Association for Computational Linguistics, Brussels, Belgium, 2357–
[13] Pintip Hompluem Dunn. 2003. How judges overrule: Speech act theory and the 2368. https://www.aclweb.org/anthology/D18-1258
doctrine of stare decisis. Yale LJ 113 (2003), 493. [34] Marc Queudot, Éric Charton, and Marie-Jean Meurs. 2020. Improving Access to
[14] Emad Elwany, Dave Moore, and Gaurav Oberoi. 2019. BERT Goes to Law School: Justice with Legal Chatbots. Stats 3, 3 (2020), 356–375.
Quantifying the Competitive Advantage of Access to Large Legal Corpora in [35] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im-
Contract Understanding. arXiv:1911.00473 http://arxiv.org/abs/1911.00473 proving language understanding by generative pre-training.
[15] David Freeman Engstrom and Daniel E Ho. 2020. Algorithmic accountability in [36] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
the administrative state. Yale J. on Reg. 37 (2020), 800. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed-
[16] David Freeman Engstrom, Daniel E. Ho, Catherine Sharkey, and Mariano- ings of the 2016 Conference on Empirical Methods in Natural Language Pro-
Florentino Cuéllar. 2020. Government by Algorithm: Artificial Intelligence in cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392.
Federal Administrative Agencies. Administrative Conference of the United States, https://www.aclweb.org/anthology/D16-1264
Washington DC, United States. [37] Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversa-
[17] European Union 1993. Council Directive 93/13/EEC of 5 April 1993 on unfair terms tional question answering challenge. Transactions of the Association for Compu-
in consumer contracts. European Union. tational Linguistics 7 (2019), 249–266.
[18] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn [38] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in bertol-
Song, and Jacob Steinhardt. 2021. Aligning AI With Shared Human Values. ogy: What we know about how bert works. Transactions of the Association for
arXiv:2008.02275 [cs.CY] Computational Linguistics 8 (2021), 842–866.
[19] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn [39] Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2020.
Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- How Good is Your Tokenizer? On the Monolingual Performance of Multilingual
derstanding. arXiv:2009.03300 [cs.CY] Language Models. arXiv:2012.15613 [cs.CL]
[20] Michael J. Bommarito II, Daniel Martin Katz, and Eric M. Detterman. 2018. [40] Jaromir Savelka, Vern R Walker, Matthias Grabmair, and Kevin D Ashley. 2017.
LexNLP: Natural language processing and information extraction for legal and Sentence boundary detection in adjudicatory decisions in the United States.
regulatory texts. arXiv:1806.03688 http://arxiv.org/abs/1806.03688 Traitement automatique des langues 58 (2017), 21.
[21] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and [41] Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The Cost of Training NLP Models:
Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and A Concise Overview. arXiv:2004.08900 [cs.CL]
Predicting Spans. Transactions of the Association for Computational Linguistics 8 [42] Koustuv Sinha, Prasanna Parthasarathi, Joelle Pineau, and Adina Williams. 2020.
(2020), 64–77. https://www.aclweb.org/anthology/2020.tacl-1.5 Unnatural Language Inference. arXiv:2101.00010 [cs.CL]
[22] David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. [43] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin
2018. Measuring the Evolution of a Scientific Field through Citation Frames. Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Represen-
Transactions of the Association for Computational Linguistics 6 (2018), 391–406. tation through Knowledge Integration. arXiv:1904.09223 [cs.CL]
https://www.aclweb.org/anthology/Q18-1028 [44] P.M. Tiersma. 1999. Legal Language. University of Chicago Press, Chicago,
[23] Minki Kang, Moonsu Han, and Sung Ju Hwang. 2020. Neural Mask Generator: Illinois. https://books.google.com/books?id=Sq8XXTo3A48C
Learning to Generate Adaptive Word Maskings for Language Model Adaptation. [45] George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas,
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Matthias Zschunke, Michael R. Alvers, Dirk Weissenborn, Anastasia Krithara,
Processing (EMNLP). Association for Computational Linguistics, Online, 6102– Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopou-
6120. https://www.aclweb.org/anthology/2020.emnlp-main.493 los, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel-Cyrille Ngonga
[24] Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language Ngomo, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder,
independent subword tokenizer and detokenizer for Neural Text Processing. Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the BIOASQ
arXiv:1808.06226 [cs.CL] large-scale biomedical semantic indexing and question answering competition.
[25] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, BMC Bioinformatics 16, 1 (April 2015), 138.
Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language [46] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
representation model for biomedical text mining. Bioinformatics 36, 4 (2019), Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for
1234–1240. Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop
[26] Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans- BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association
Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2019. CLAUDETTE: for Computational Linguistics, Brussels, Belgium, 353–355. https://www.aclweb.
an automated detector of potentially unfair clauses in online terms of service. org/anthology/W18-5446
Artificial Intelligence and Law 27, 2 (2019), 117–139. [47] Jonah Wu. 2019. AI Goes to Court: The Growing Landscape of AI for Access
[27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer to Justice. https://medium.com/legal-design-and-innovation/ai-goes-to-court-
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A the-growing-landscape-of-ai-for-access-to-justice-3f58aca4306f
Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL] [48] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,
[28] David Mellinkoff. 2004. The language of the law. Wipf and Stock Publishers, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff
Eugene, Oregon. Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan
[29] Elizabeth Mertz. 2007. The Language of Law School: Learning to “Think Like a Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,
Lawyer”. Oxford University Press, USA. Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,
[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s
Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301. Neural Machine Translation System: Bridging the Gap between Human and
3781 Machine Translation. arXiv:1609.08144 [cs.CL]
[31] Octavia-Maria, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, [49] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
and Josef van Genabith. 2017. Exploring the Use of Text Classification in the Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal
Legal Domain. Proceedings of 2nd Workshop on Automated Semantic Analysis Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association
of Information in Legal Texts (ASAIL). for Computational Linguistics. Association for Computational Linguistics, Online,
[32] Adam R. Pah, David L. Schwartz, Sarath Sanga, Zachary D. Clopton, Peter DiCola, 5218–5230. https://www.aclweb.org/anthology/2020.acl-main.466
Rachel Davis Mersey, Charlotte S. Alexander, Kristian J. Hammond, and Luís [50] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
A. Nunes Amaral. 2020. How to build a more open justice system. Science 369, Maosong Sun. 2020. JEC-QA: A Legal-Domain Question Answering Dataset. ,
6500 (2020), 134–136. 9701-9708 pages.
168
Part II
Short Papers
Practical Tools from Formal Models: The ECHR as a Case Study
Katie Atkinson Kanstantsin Dzehtsiarou
Joe Collenette∗ dzeh@liverpool.ac.uk
Trevor Bench-Capon∗ Department of Law
{katie,j.m.collenette,tbc}@liverpool.ac.uk University of Liverpool
Department of Computer Science Liverpool, UK
University of Liverpool
Liverpool, UK
ABSTRACT the user and provide explanations. In [16] the law in question was
One approach to building legal support systems is to run an ex- statute law, but the approach can also be applied to case law, e.g. [1].
ecutable model of the relevant knowledge through an interface Similar structures are found in hybrid systems based on reasoning
designed to collect information from the user and provide explana- with cases such as CABARET [17], IBP [7] and VJAP [12] which
tions. The usability of such systems depends on the terms used in also represent the law as high level rules but instead of directly
the law being represented: often only users familiar with the prac- querying the user, the leaf predicates are resolved according to
tice and application of the law will be able to provide the required factors found in precedent cases. In all these approaches, we be-
information. Earlier work applied this approach to the European gin with a high level question, such as ‘is Peter a British Citizen?’
Convention on Human Rights (ECHR). Although the performance and then unfold it through a series of intermediate concepts until
of the tool built for that domain was good, the questions posed to a base level is reached. At this point, hybrid systems will launch
the user demanded a good deal of knowledge and experience of their case based reasoning mechanism, but if we are using a rule
the ECHR. Here we use the knowledge of an expert with extensive based approach which expects the user to answer these base level
experience of the ECHR to extend the model, through intermediate questions, the terms must be readily understood by the user. For a
levels, to identify questions that are appropriate to the target user. successful system, two kinds of knowledge are required: knowledge
We have undertaken a pilot evaluation in which a small number as to the questions that should be asked, and knowledge as to how
of lawyers have used the prototype program and provided very these questions should be answered. A formalisation of the legislation
positive feedback, showing that they are receptive to AI solutions provides the first but not the second, and so if the law does not
that give effective, explainable decision support. use terms familiar to the users, they will be unable to answer the
questions. This is also true of hybrid systems; the users will need
CCS CONCEPTS to ascribe the factors, which is itself a substantial and not entirely
straightforward task. Moreover the knowledge of how to answer
• Applied computing → Law.
the questions or ascribe the factors may vary according to the back-
ground and skills of the user answering the questions. This suggests
KEYWORDS
that the formalisation of the law will need to be supplemented by
ADFs, case-based reasoning, explainability, ECHR a further sets of intermediate concepts, one for each type of user
ACM Reference Format: which will unfold into questions appropriate to the users.
Katie Atkinson, Joe Collenette, Trevor Bench-Capon, and Kanstantsin Dze- With the BNA, the formalisation of the law resulted in questions
htsiarou. 2021. Practical Tools from Formal Models: The ECHR as a Case such as ‘where was Peter born?’ and ‘who is the father of Peter?’
Study. In Eighteenth International Conference for Artificial Intelligence and which appeared to be readily answerable by applicants, or by ad-
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY,
judicators on the basis of an application form. Much of the appeal
USA, 5 pages. https://doi.org/10.1145/3462757.3466095
of this system came from the fact that the questions that resulted
directly from executing the formalisation were immediately and
1 INTRODUCTION
intuitively understandable in both these situations. However, those
Since the pioneering formalisation of the British Nationality Act who followed in their footsteps found that this was not always the
(BNA) [16], one popular approach to building legal support systems case. Problems became apparent in a follow up exercise to the BNA
has been to formalise the law and then execute the formalisation [5]. Two kinds of problems arose.
through a program able to gather facts about particular cases from First, many of the questions were difficult to answer for, or even,
Permission to make digital or hard copies of all or part of this work for personal or unintelligible to, the lay user, such as ‘did Peter pay the qualifying
classroom use is granted without fee provided that copies are not made or distributed level of contributions in the relevant tax year?’. Such questions might,
for profit or commercial advantage and that copies bear this notice and the full citation however, be answerable by an adjudicator, especially if there was
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or access to the contributions record database. With other questions
republish, to post on servers or to redistribute to lists, requires prior specific permission the adjudicator would have the problem: lay users will know their
and/or a fee. Request permissions from permissions@acm.org.
birth details, but this may require investigation and verification
ICAIL’21, June 21–25, 2021, São Paulo, Brazil
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. by the adjudicator, who will therefore need to know what kind of
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 evidence is required. So the questions may need further refinement
https://doi.org/10.1145/3462757.3466095
170
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Katie Atkinson, Joe Collenette, Trevor Bench-Capon, and Kanstantsin Dzehtsiarou
to enable the target users to answer them, and may need different and is maintainable. The ANGELIC ADF corresponds to the factor
refinements for different target users. For a lay person, the contri- hierarchy of traditional case based systems such as CATO [3]. In
butions condition will need to be expressed in terms of working in an ADF the nodes are connected into a graph structure, and there
a particular year, while for the adjudicator, date of birth will need are acceptance conditions that determine the acceptability of each
to ask about acceptable forms of proof, such as birth certificates. parent in terms of its children. The ADFs produced by ANGELIC
The second problem is that the question may have some partic- are always trees. In ANGELIC the acceptance conditions take the
ular legal interpretation. For example, UK Housing Benefit at one form of a set of prioritised sufficient conditions, together with a de-
time had an addition if the house was “hard to heat” [4]. While a fault to make the conditions collectively necessary. Maintainability
lay user may well have an opinion on this, in fact there was a very comes from the modular nature of this structure.
technical definition, spelled out in secondary legislation and case Testing on a limited set of cases achieved a 100% success rate.
law, in terms of factors such as size and type of house, number of Although the test set was small, the benefits of the approach were
rooms and age of occupants. Clearly this information is needed clear; the ADF was able to predict Article 6 cases with a high
both to enable lay users to answer the question, and to support the success rate, and fully justify its reasoning for the predictions. The
decision making of adjudicators, who may be unfamiliar with the developed ADF in [9] included a part relating to admissibility. The
relevant case law. questions developed for admissibility were at a very high level. The
In [9] an executable model of Article 6 of the European Con- model described in this paper, also developed using the ANGELIC
vention on Human Rights (ECHR) was presented. However the methodology, focuses entirely on admissibility. This focus allows a
questions that resulted from that formal model are unlikely to be much more detailed analysis to reduce the level of expertise needed
answered with confidence by a potential applicant, and some might to answer the questions with confidence. The decomposition of
even test an experienced lawyer. To move towards appropriate ques- the high level questions obtained from the documentation used in
tions requires knowledge of the theory and practice of the ECHR. [9] does, however, require guidance from someone expert in the
In this paper we will discuss what is needed for that model to be- relevant statute and case law. We have also replaced the command
come usable, in particular for those processing initial applications. line prototype of [9] with a visual interface that uses standard
Section 2 will briefly describe the ECHR. Section 3 will describe the interactions to make the program more accessible to non-computer
formal model of [9]. Section 4 will describe the additional analysis specialists.
we have carried out to move to a usable system, and Section 5
the resulting prototype. Section 6 gives an initial evaluation from
members of the intended user group who have used the prototype. 4 LEGAL FOUNDATIONS OF THE MODEL
Finally Section 7 will offer some concluding remarks. The model described here considers whether an application is ad-
missible or not, which is itself a substantial task. All applications
2 DOMAIN OVERVIEW submitted to the ECtHR need to be admissible in order to be consid-
The European Convention on Human Rights (ECHR) is a regional ered on merits. In other words, the Court needs to establish that the
human rights treaty that is now ratified by 47 European states application complies with a set of formal rules before it can examine
and covers almost the whole of Europe. Since 1960, when it was the substance of this application [10]. Admissibility includes two
established, the European Court of Human Rights (ECtHR) has types of rules: first, the ECtHR needs to establish that the applica-
delivered judgments in thousands of cases and created a significant tion falls within its jurisdiction to confirm that the Court can deal
body of legal precedents. with this application. Secondly, the ECHR has established a set of
The ECHR has proved very popular for experimentation with formal rules that the application itself needs to comply with, such
machine learning techniques for legal judgment predication tasks; as that the application was first submitted at the national level and
for example, see [2], [14], [8], [15] and [13]. These studies all report was rejected by the national judicial bodies and that it should be
success, with correct predictions being achieved in around 70-85% submitted within 6 months after the highest judicial body rejected
of cases. JURI Says [15] reports a success rate of 69% over the last the same application on the national level. This application should
year, although it fell to 60.9% for March 20211 . not be abusive, anonymous or trivial. This application also should
not be clearly without merits or – in the ECHR terms – manifestly
3 ADFS FOR REASONING ABOUT ECHR ill-founded. Again, if these conditions are not satisfied, the Court
CASES ON ARTICLE 6 declares an application inadmissible. The Court’s decision as to
inadmissibility is final and cannot be appealed against.
The aim of the work described in [9] was to encapsulate Article 6
The importance of admissibility is often underestimated. On
in an Abstract Dialectical Framework (ADF) [6]. A program based
average about 90% of all applications submitted to the ECtHR are
on the developed ADF predicted whether a particular case was
declared inadmissible. For instance, in 2019, 44,500 applications
admissible, and if so whether there was a violation of Article 6. One
were submitted to the Court and in the same year 38,480 applica-
of the main strengths of the program was that it was able to justify
tions were declared inadmissible. At the same time, in 2019, the
its reasoning in terms of the ADF nodes and acceptance conditions,
Court delivered only 2,187 meritorious judgments [11]. A large
similar to the how? explanation of a rule based system. The ADF
number of applications is declared inadmissible every year, so our
was created using the ANGELIC methodology [1], designed to cap-
project has potential importance for both the applicants wanting
ture case law in a manner that supports argumentation techniques
to avoid inadmissibility and for the Court for which considering of
1 JURI Says can be found at https://jurisays.com/ (accessed 2021/03/01). inadmissible applications takes a significant proportion of its time
171
Practical Tools from Formal Models: The ECHR as a Case Study ICAIL’21, June 21–25, 2021, São Paulo, Brazil
5 PROTOTYPE IMPLEMENTATION
The current prototype consists of two parts; the ADF model, exe-
Figure 1: The results screen of the JAVA prototype. The user
cuted by a JAVA program, and the JAVA front-end. The ADF model
is presented with a high level result and the reasoning be-
makes the predictions from a set of answers to the questions gath-
hind the decision is shown in the text below.
ered from the user through the front-end, which then presents an
explanation of the prediction to the user. The aim of the application
is to allow people with minimal legal training to quickly identify
whether a case that is being submitted to the ECtHR is likely to be (1989). Identifying such questions requires a good knowledge of
admissible. The current ADF is more extensive in the modelling of the legal practice.
admissibility when compared to the previous ADF developed in [9]. When the JAVA program loads, it parses a text representation of
There are 61 questions which when answered allow the 26 factors the ADF, similar to Table 1. This supports maintenance by enabling
to resolve to a prediction. In the previous ADF of [9] there were any changes to the ADF to be automatically reflected in the program.
only 5 factors relating to admissibility and only 7 questions needed The program asks the user a series of yes/no questions and once all
to be answered. the relevant questions have been answered the results are shown
When developing the ADF we consulted a legal expert in order to the user. The results screen also displays the reasoning as to
to decompose the high level nodes used in [9] and so capture the why that decision was made. Figure 1 shows the results for an
legal knowledge required to ensure that the list of questions is both example where the program informs the user that a signature is
as complete as possible, and appropriate to the target users. Table 1 required for the form and explains that as the applicant does not
shows the text and the acceptance conditions associated with the have legal representation and they have not signed the application
nodes. The root of the ADF is “V1" which indicates whether sub- form, not all signatures have been provided, which in turn means
mitting the application to the ECHR is recommended. The program that the application does not comply with rule 47 of the rules of
will recommend submission if both issues, nodes I1 and I2, are ac- the court and therefore the program recommends not to submit the
cepted. That is, the application is admissible (I2) and the application application.
complies with rule 47 of the court (I1). In turn, node I1 is accepted To give a better user experience than the previous work [9],
if the base level factors, I1Q1 and I1Q2, are accepted and the ab- only questions that are needed to generate a recommendation are
stract factors that represents that all necessary signatures (I1F1) asked: if a node can be resolved, the program moves on to the
and all documentation (I1F2) have been accepted. Abstract Factor next. Different paths lead to different questions, e.g. the questions
I1F1 has a number of different base level factors, offering different needed if the application is submitted by a company rather, than as
possibilities for I1F1 to be accepted. These represent the different an individual, are notably different.
signatures that are needed depending on different situations. Using the example in Figure 1 and referring to Table 1, we will
The questions used as part of the ADF are not only based on the show how the explanation is generated in Figure 1. The last ques-
requirements of the ECHR but also take into account its case law. tion to be answered is I1F1Q2. We start by printing all the base
The case law makes the questions more nuanced and integrates level factors that have the same parent as the last base level factor,
those aspects of admissibility that do not obviously flow from the as long as the user has answered the corresponding question. The
text of the Convention. In practice it is almost impossible to submit parent of I1F1Q2 is I1F1, and the only other base level factor that
a successful application without familiarising oneself with the case has an answer is I1F1Q1. This gives us the first two lines of explana-
law, which can be quite broad and diverse, or consulting with a tion. Next we print the abstract factors and issues in a hierarchical
professional lawyer specialising in the law of the ECtHR. This manner, adding the word “So” before each explanation. Finally we
model presents the rules enshrined in the case law of the Court in print the root and add the word “Therefore” at the start of the root’s
a simplified form. explanation.
In Table 1, I2F2Q4 asks “Is the applicant a potential victim of a This prototype has a number of benefits when compared to the
violation?" which affects whether the applicant has victim status. system in [9]. Despite the two systems tackling two different aspects
It is not obvious to a lay person that potential victims can apply of legal reasoning within the ECHR domain, they are comparable
to the Court, in certain limited circumstances. The scope of these as both systems use an underlying ADF that is presented to a user.
circumstances was made clear in the case law: the question has The major benefit of the current approach is the ease of use when
arisen in the context of extradition cases, where the question was compared to the previous work. We have now incorporated both
whether there was a risk of human rights violations in the receiving knowledge of the law, and how that law is applied in practice,
state if extradition were permitted e.g. Soering v. the United Kingdom which allows expression in terms of questions which the users can
172
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Katie Atkinson, Joe Collenette, Trevor Bench-Capon, and Kanstantsin Dzehtsiarou
Table 1: Nodes of the ADF that are referenced, along with their acceptance conditions.
3
the person bringing the case the victim”. To answer the question
effectively the user needs to understand what constitutes a person, 2
a case, and a victim. The current ADF has decomposed the base 1
level factor into abstract factors such as I2F1 which describes what
constitutes a valid petitioner, and I2F2 which describes victim status 0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
0
1
(see Table 1 for questions associated with these abstract factors). Q1
Q1
The questions associated with these abstract factors require less Positive Responses Question Number
specialist knowledge. Improvements have been made to the method Negative Responses
of interacting with the application, which is more in line with what
non-specialist computer users would expect from an application. Figure 2: Graph showing number of feedback responses to
The previous work used a text-only, command line interface [9]. each question, as positive or negative feedback
The new application uses a visual, mouse-driven, interface.
6 PILOT STUDY (1) (Functionality) Does the program have a reasonable response
Before embarking upon a full evaluation, we have conducted a pilot time?
study in which the prototype has been tested by a sample of our (2) (Functionality) Did the program run to completion without
target audience, which is a small group of lawyers who work within any interruptions?
the ECHR. Three independent lawyers (not, of course, including our (3) (Usability) How easy was the program to use?
domain expert) tested the prototype and completed a questionnaire (4) (Usability) How intuitive was the program to start using?
that covered five different aspects of the prototype. (5) (Explainability) How effective was the explanation given for
The following is the list of questions that the users of the pro- describing the program’s decisions?
totype were presented with, along with the question category in (6) (Explainability) How easy was the information to parse?
brackets. Each question had four possible responses, for different (7) (Usefulness) How useful would you find this program for
levels of agreement: assisting you in your work?
173
Practical Tools from Formal Models: The ECHR as a Case Study ICAIL’21, June 21–25, 2021, São Paulo, Brazil
(8) (Usefulness) Generally how useful would additional technol- solid grounding on the road to development of a tool for use in
ogy be for assisting with legal work? practice.
(9) (Questions about Questions) How clear were the questions The prototype tool has been tested by a small group of the tar-
that you answered within the program? Additionally the get audience of lawyers. Their response has been very positive,
users were able to say which questions were unclear. highlighting that lawyers are receptive to tools that are carefully
(10) (Questions about Questions) How much time would you designed and targeted at problems they encounter with their work,
save if you used a fully functional program for your work all of which paves the way for a fuller evaluation.
on deciding on the admissibility of cases? The continuation of the work will focus on expanding on the use
(11) (Questions about Questions) Does the program reflect how of the prototype. We are currently organising, with the assistance of
you decide on the admissibility of cases that you process? our ECHR expert, a field evaluation of our tool by users who work
within the ECtHR. This will provide a greater range of feedback
In Figure 2 the responses to the questionnaire have been con-
in order to adjust our tool as needed to best fit the audience’s
densed into positive or negative responses, where the top two
needs. Once we have a satisfactory tool for use by those assessing
answers to a question are positive and the bottom two are negative.
applications – which would provide a way of addressing the current
Though the results of the questionnaire come from a very small
significant backlog of unprocessed cases in the ECtHR – we will
sample, with only three lawyers completing the survey, we can con-
develop a set of questions for use by applicants themselves. Such
clude that the program developed worked well and was functional,
a facility should reduce the number of inadmissible applications
as all the responses received on functionality (Q1, Q2) and usability
by enabling applicants to gain a better understanding of what is
(Q3, Q4) were positive. Another positive outcome is that two of the
required to make an admissible application. In this way, access to
three ECHR lawyers responded that they trusted that the justifica-
to the ECtHR could be improved, both by assisting applicants, and
tions for the decisions made were sensible and understandable (Q5)
by speeding up the decision process for cases submitted.
and all three respondents agreed that the information was easy to
parse (Q6). All respondents saw the usefulness of our prototype
REFERENCES
(Q7), with two respondents stating they would use the program as it
[1] Latifa Al-Abdulkarim, Katie Atkinson, and Trevor Bench-Capon. 2016. A method-
currently stands, and the other affirming the usefulness but saying ology for designing systems to reason with legal cases using ADFs. AI and Law
that some (as opposed to many) changes are needed. Again all the 24, 1 (2016), 1–49.
[2] Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios
respondents agreed that technology has a role to play in the legal Lampos. 2016. Predicting judicial decisions of the European Court of Human
domain (Q8): one respondent said technology is needed rapidly, Rights: A natural language processing perspective. PeerJ Computer Science 2
while two cautioned that careful development will be needed. The (2016), e93.
[3] Vincent Aleven. 1997. Teaching case-based argumentation through a model and
positive responses to Q9 and Q10 were particularly pleasing, since examples. Ph.D. thesis. University of Pittsburgh.
these questions directly concerned the central aims of this exercise: [4] Trevor Bench-Capon. 1991. Practical legal expert systems: the relation between
the users all agreed that the questions were suitable for them and a formalisation of legislation and expert knowledge. In Law, Computer Science
and Artificial Intelligence, M Bennun and A Narayanan (Eds.). Ablex, 191–201.
that the program would save them time when assessing admis- [5] Trevor Bench-Capon, Gwen Robinson, Tom Routen, and Marek Sergot. 1987.
sibility. While the majority of the feedback has been positive, it Logic programming for large scale applications in law: A formalisation of sup-
plementary benefit legislation. In Proceedings of the 1st ICAIL. 190–198.
has also highlighted the need for domain experts to be a part of [6] Gerhard Brewka and Stefan Woltran. 2010. Abstract dialectical frameworks. In
the development process (Q11): although two respondents felt the Twelfth International Conference on the Principles of Knowledge Representation
program reflected all or part of their own process of dealing with and Reasoning. 102–111.
[7] Stefanie Brüninghaus and Kevin D Ashley. 2003. Predicting outcomes of case
admissibility, one felt that only some aspects had been covered. based legal arguments. In Proceedings of the 9th ICAIL. ACM, 233–242.
Overall the response to the program is very positive and indicates [8] Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural legal
that it is a sound basis for further developments for our legal de- judgment prediction in English. arXiv preprint arXiv:1906.02059 (2019).
[9] Joe Collenette, Katie Atkinson, and Trevor Bench-Capon. 2020. An explainable
cision support tools, giving potential for collaborations between approach to deducing outcomes in European Court of HumanRights cases using
computer scientists and lawyers. Encouraged by these results, we ADFs. In Proceedings COMMA 2020. IOS Press, 21–32.
[10] Fiona De Londras and Kanstantsin Dzehtsiarou. 2018. Great Debates on the Euro-
will now extend the study to a larger group of potential users, to pean Convention on Human Rights. Macmillan International Higher Education.
ensure that this initial evaluation is reflected in the law community. [11] European Court of Human Rights. 2019. Analysis of statistics. (2019). https:
//www.echr.coe.int/Documents/Stats_analysis_2019_ENG.pdf.
[12] Matthias Grabmair. 2017. Predicting trade secret case outcomes using argument
7 SUMMARY AND NEXT STEPS schemes and learned quantitative value effect tradeoffs. In Proceedings of the 16th
ICAIL. 89–98.
Our aim in this paper has been to conduct a deep dive into the legal [13] Arshdeep Kaur and Bojan Bozic. 2019. Convolutional Neural Network-based
analysis required to provide a robust, executable model of Article 6 Automatic Prediction of Judgments of the European Court of Human Rights.. In
27th AIAI Irish Conference on AI and Cognitive Science. CEUR 2563, 458–469.
of the ECHR. We focused on the theory and practice of a particular [14] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2019. Using machine
issue within the the ECHR, namely admissibility of cases. In con- learning to predict decisions of the European Court of Human Rights. AI and
sultation with our legal expert, we defined an ADF that captures Law (2019), 1–30.
[15] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2020. URI SAYS: An
the domain knowledge relevant for the issue of admissibility and Automatic Judgement Prediction System for the European Court of Human
transformed our model into an implemented tool that is able to ask Rights.. In Proceedings of JURIX 2020. 277–280.
questions appropriate to the target user. This is a necessary step [16] Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham-
mond, and H Terese Cory. 1986. The British Nationality Act as a logic program.
to enable academic exercises on legal case-based reasoning to be Commun. ACM 29, 5 (1986), 370–386.
transformed into usable tools. Our prototype tool implements the [17] David Skalak and Edwina L Rissland. 1992. Arguments and cases: An inevitable
intertwining. AI and Law 1, 1 (1992), 3–44.
back-end reasoning of the underpinning ADF and provides us with
174
On the relevance of algorithmic decision predictors for judicial
decision making
Floris Bex Henry Prakken
Utrecht University, The Netherlands Utrecht University, The Netherlands
Tilburg University, The Netherlands University of Groningen, The Netherlands
f.j.bex@uu.nl h.prakken@uu.nl
ABSTRACT not be judged any more on the legal merits of their individual case
In this article, we discuss case decision predictors, algorithms which, but on the basis of general statistics [19]. This is related to O’Neill’s
given some features of a legal case predict the outcome of the [18] criticism of ‘bucketing’, the practice of basing a decision about
case (i.e. the decision of the judge). We discuss whether, and if so an individual (e.g., about granting the person a loan) on the fact
how, such prediction algorithms can be used to support judges that the individual is member of a particular class of which a statis-
in their decision making process. We conclude that case decision tical frequency is known instead of on the particular situation of
predictors can only be useful in individual cases if they can give that individual. O’Neill [18, pp. 145–6] argues that, although this
legal justifications for their predictions, and that only these legal strategy might optimise the decision maker’s profit in the long run,
justifications are what should matter for a judge. it may lead to unjust decisions in individual cases.
To be able to evaluate this debate it is necessary to have a clear
CCS CONCEPTS picture on what information a prediction of a decision by an algo-
rithm in a particular case gives to the judge deciding the case. One
• Applied computing → Law; • Computing methodologies →
answer is given in [4]: “an AI system can be trained to accurately
Natural language processing; Machine learning.
forecast based on past behaviour what a user’s decision would be in
KEYWORDS a situation absent lapses in rationality.” So if an algorithm performs
well on a test set and if it predicts a particular decision in a new
legal prediction, legal decision making, application of algorithms case, then an arbitrary rationally-thinking judge would if assigned
ACM Reference Format: to the case, take the predicted decision. Of course, algorithms are
Floris Bex and Henry Prakken. 2021. On the relevance of algorithmic de- rarely 100% accurate, so we look at the probability that an arbitrary
cision predictors for judicial decision making. In Eighteenth International competent judge assigned to the case would take a predicted de-
Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, cision. We want to investigate to what extent an algorithmic case
São Paulo, Brazil. ACM, New York, NY, USA, 5 pages. https://doi.org/10.
prediction can yield such a decision probability: how, and under
1145/3462757.3466069
which assumptions, does a prediction in a particular case combined
1 INTRODUCTION with information about an algorithm’s performance on a test set
yield a decision probability for a new case?
The prediction of the decision of legal cases by means of machine- This last question immediately gives rise to a new question: why
learning algorithms has become a hot topic [1, 3, 4, 12, 16]. Such would judges be interested in probabilities at all when deciding a
algorithmic predictors can have various uses in the law. In this case? After all, we expect judges not to give probabilistic reasons
paper we discuss their application to support judges in individual for their decisions (except perhaps on matters of fact) but legal
cases, focusing on algorithmic decision predictors: algorithms that reasons. Still, judges have always looked at what their colleagues
predict the final decision of a legal case, a decision that would oth- decide in similar cases and there are good reasons for doing so,
erwise be made by the judge(s) (such as guilty/not guilty, rule for such as improving the consistency of intra-judicial decision making
plaintiff/defendant). Algorithmic decision predictors are sometimes [10, par. 8]. Underlying this is the assumption that if the great
claimed to improve the predictability and consistency of judicial majority of their colleagues would take the same decision, then
decision making, which is demanded by the principle of equality it presumably is the right decision. Of course, this assumption is
(cf. [10]). According to these claims, judges can use decision predic- at best defeasible and this leads to a second idea, namely, that if
tors in order to come to more consistent, more informed and less an algorithmic decision predictor performs well in the test phase,
biased judgments [4, 8, 17]. Others, however, fear that when judges’ then its predictions yield the ‘normal’ decision of the case, so that
decisions are informed by algorithmic case predictors, people will a judge could only deviate from a prediction if there are special
Permission to make digital or hard copies of all or part of this work for personal or circumstances in the case. We also want to investigate to what
classroom use is granted without fee provided that copies are not made or distributed extent such thinking is justified.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the To address these questions, we first in Section 2 give a brief
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or overview of the main types of algorithmic case predictors for legal
republish, to post on servers or to redistribute to lists, requires prior specific permission cases. We then discuss in Section 3 the various senses in which
and/or a fee. Request permissions from permissions@acm.org.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil probabilities can be derived given an algorithm and its evaluation
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. with a test set. The heart of our paper is Section 4, in which we
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 discuss to which extent the probabilities derived from an algorithm
https://doi.org/10.1145/3462757.3466069
175
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bex and Prakken
and its evaluation can be applied to a new case that is to be decided Predicting on the basis of the textual description of a case. Other
in court. Our main conclusion will be that in practice such an ap- algorithms predict decisions based on the text of case law, where
plication is almost never warranted. We then in Section 5 discuss statistical correlations are identified between, for example, word
what this means for the hope that the use of algorithmic case pre- combinations in the text and the case decision. Examples are algo-
dictors by judges in individual cases will improve the consistency rithms that predict whether the European Court of Human Rights
and predictability of judicial decision making. (ECHR) will for a specific article from the Convention with the
same name decide whether that article was violated, on the basis of
2 ALGORITHMIC DECISION PREDICTORS part of the text of the decision by the Court [1, 16] or the facts of
Algorithmic decision predictors come in, roughly, three types: pre- the case as communicated to the parties [15]. The performance of
dictors on the basis of legally relevant factors, predictors on the these different algorithms is largely comparable, with accuracy and
basis of features unrelated to the merits of a case and predictors on F-measures ranging between 75% and 80%. Although it would seem
the basis of the textual description of a case (for a recent overview that this kind of algorithm looks at the legal aspects of the case
see [3]). We focus on supervised classification algorithms that pre- (procedural history, facts), the identified statistical correlations do
dict a categorical outcome – one of multiple possible decisions, not say anything about the legally relevant reasons for the decision
such as affirm/reverse, guilty/not guilty – and not on algorithms of a case. Therefore these algorithms can also not explain their
that provide a continuous output – such as a regression algorithm predicted decisions in a legally meaningful way.
that predicts the length of a sentence or the amount of damages to
be paid. Furthermore, note that we only focus on predictors that, 3 FROM ALGORITHM PERFORMANCE TO
given some features of a case, predict the final decision of the case, PROBABILITIES
and that we do not include, for example, algorithms for estimating Recall that we want to investigate to which extent the performance
recidivism risk, as these do not provide a final case decision. of an algorithm on a test set justifies the idea that an arbitrary
competent judge assigned to a case will likely take the predicted
Predicting on the basis of legally relevant factors. One approach
decision. We call the probability at stake here the decision probabil-
predicts decisions on the basis of legally relevant factors in a case,
ity, the probability that an arbitrary competent and rational judge
by using either machine-learning techniques or a symbolic model
assigned to a particular case will take decision 𝑋 in that case, given
of legal reasoning. 1 This approach describes the facts of a case at a
that the algorithm predicts “𝑋 ”, that is, that the case will receive
higher level of abstraction than the concrete facts. The factors are
decision 𝑋 . In formulas this is 𝑃𝑟 (𝑋 |“𝑋 ”), where “𝑋 ” stands for the
assumed to be legally relevant for the case decision, so they can be
algorithm predicting decision 𝑋 . The precise way in which this
used for generating informative explanations of a prediction.
decision probability can be determined is for present purposes irrel-
The first studies into prediction on the basis of factors applied
evant, but the idea is that this probability can somehow be derived
general machine-learning techniques to encodings of cases in terms
from the algorithm’s performance on the test set. One candidate
of legally relevant factors. An early AI & law example is Mackaay
method is using Precision, the percentage of predictions “𝑋 ” on the
& Robillard [14], who studied the prediction of a type of Cana-
test set that are correct (i.e. the true positives divided by the total
dian tax case with the nearest-neighbor rule. In AI & Law, various
number of positively predicted cases). Interpreted as a frequency-
factor-based models for case-based reasoning have been used for
type probability, the precision is 𝑃𝑟 (𝑋 |“𝑋 ”), which looks like the
generating knowledge-based case decision predictions without the
decision probability we are after. However, we do not commit to ex-
use of machine learning techniques. Examples are the studies of
actly this way of determining the decision probability – for present
Ashley and his PhD students on the case law concerning misuse of
purposes, all that is relevant is that this decision probability will be
trade secrets in American law [2, 7]. Accuracy levels were obtained
defined in terms of an algorithm’s application to a test set, and the
of up to 88% [2] and 92% [7]. An advantage of this approach is that
crucial thing to note is that this makes the step to a probability of
the arguments generated about the predicted decision can be used
the same form for a new case that is not in the test set non-trivial.
as explanations of the prediction based on legal knowledge and in
a form not unlike the arguments of human judges or lawyers.
4 APPLYING GENERAL FREQUENCY-TYPE
Predicting on the basis of case metadata. Several authors have PROBABILITIES TO NEW CASES
used supervised machine learning based on case features that are For the answer to the question how the step from a probability
not related to the merits of the case. An example is the algorithm derived from performance on a test to a probability for a new
that predicts decisions of the American Supreme Court on the basis case can be made, we turn to the philosophy of probability theory.
of structured metadata such as the kind of case, the date at which it Philosophers distinguish frequency-type and belief-type probability.
was decided and which lower court decided the original case [12]. Probabilities are of the frequency type if they are based on relative
This algorithm, which correctly predicted 70% of the decisions, frequencies. Usually, the frequencies are relative to outcomes of
cannot explain the predicted decisions in a legally meaningful way, experiments that can be repeated indefinitely, such as tossing a coin
since the features on the basis of which it makes its predictions are or rolling a dice, but we consider the special case where they are
‘extra-legal’, that is, they are not related to the merits of the case. derived from a given finite set of test cases. Dawid [9] calls such
probabilities ‘statistical’ probabilities. In contrast, probabilities are
1 ‘Factors’
are here not just CATO-style boolean factors but any abstract fact pattern of the belief type if they are about the degree to which a proposition
that can have two or more values. is believable. Such probabilities can also be attached to propositions
176
On the relevance of algorithmic decision predictors for judicial decision making ICAIL’21, June 21–25, 2021, São Paulo, Brazil
that a single event occurs. The probabilities that can be defined more about it than its predicted decision. And the point is that if a
in terms of an algorithm’s performance on the test set are all of judge has more information than just membership of the ‘reference
the frequency type, since they are based on the relative number of class’ of the relative frequency (for instance, ‘80% of the cases with
true/false positives/negatives. However, what we want is a belief- predicted decision 𝑋 have decision 𝑋 ’), then it is irrational to rely
type probability, namely, the probability that a given new case will on the frequency-based probability concerning that class. Instead,
be decided as predicted by the algorithm. one should look at the probability of the decision conditional on the
So what we are interested in is what information a prediction of more specific reference class that corresponds to one’s knowledge
a decision gives to a judge in a particular case that the judge has to about the case. And this, of course, amounts to thinking about the
decide. The italicised words are crucial, since when a probability is particulars of case as judges are used to do.
interpreted as a frequency (or in Dawid’s [9] terms as a statistical Our argument is an instance of what philosophers call the prob-
probability), it does not by itself say anything about a particular case. lem of finding the right reference class when performing ‘direct
As is well known (e.g. [11, p. 137]), there is a logical gap between inference’. It is this reference-class argument that gives a philosoph-
frequencies and an individual probability: turning a frequency-type ical justification for O’Neill’s [18] criticism of ‘bucketing’ and more
probability into a probability about a particular case is a decision, generally for the fear of trial by statistics. In essence it means that
which has to be justified. Now how can this decision be justified? if nothing more is known of an algorithmic decision predictor than
It turns out that this requires a number of assumptions. its performance on the test set, then its predicted decisions cannot
be regarded as the decision that an arbitrary judge assigned to the
4.1 From the test set to the set of future cases case would likely take. So a judge who wants to know what his
Clearly, the move from the past to the future is only justified if the or her colleagues would likely decide in an individual case, should
set of future cases has the same proportions as the test set. However, not consult the algorithm since it does not provide the correct de-
this is not guaranteed (see also e.g. [5, 6]). First, the decisions of cision probability for the case. This in turn means that there is no
judges can change in that they start deciding on different grounds meaningful sense in which an algorithmically predicted decision is
or weighing reasons in different ways than they used to do. This can the ‘normal’ decision for the case, from which a judge could only
happen, for instance, when moral or political opinions in society deviate if he or she can point at special circumstances that make
change, or because different judges with different legal opinions this case different than a normal case of this kind.
To explain this further, imagine that cases are distributed in such
are assigned to the same type of case. Also, the distribution of types
a way that many cases are ‘clear’, for which a decision predictor
of cases can change because of changes in the world. Moreover, the
algorithm could be overfitted on inessential features of the training would always be correct, but many other cases are ‘hard’, for which
data (a well-known problem in statistics and machine learning). So a decision predictor would often be incorrect, but the algorithm
(as is well known in the literature on machine learning) in order to cannot explain to which type a new case belongs. Then only in the
accept a probability based on the test set as a probability for a future clear cases can the predicted decision be said to be the ’normal’ one.
set of cases, we have to make at least the following assumptions: But how can the judge know which case is easy and which case is
judges continue to decide cases on the same grounds; the frequency hard? To know this, the judge has to think about the particulars of
of the various types of cases remains the same; and the algorithm the case as judges always do. But then the judge can just as well
made its predictions on the test set for the right reasons. ignore the algorithmic prediction.
4.2 Yielding a decision probability for an 4.3 Objections to the reference class argument
individual case In the previous subsection we concluded that in practice it will be
This is not yet all. If the assumptions listed in the previous section impossible to rationally derive a case-specific decision probability
are justified, then all we know is that the frequency-type probability from frequency-type probabilities based on experiments with a test
derived from the test set can also be applied to a future set of cases set, so that judges who want to know what their colleagues would
(which can be open-ended). However, we are not after a probability likely decide in the case cannot obtain an answer to their question
of a kind of event (decisions predicted by this algorithm) but after by consulting a case decision predictor. We now discuss possible
the probability of a single event (this decision predicted by this objections to our reference-class argument for this conclusion.
algorithm). The former can be frequency-type but the latter must First, it might be argued that it is still rational to stick to a
be belief-type. We could apply the so-called frequency principle statistical decision probability for a new case, since there often are
[11, p. 137] and let the latter equate the former. However, if we no statistics on which a more specific frequency-type probability
do so, that is, if we base our probabilities concerning individual can be based. Yet this is a reasoning fallacy: if one wants to express
case decision predictions on frequencies, then we in fact make a a decision probability in such cases, one should take the additional
crucial assumption. This assumption is that the only ways in which information into account. If this cannot be done on the basis of
cases can relevantly differ is in the properties on which the relative known frequencies, one should form a probability based about one’s
frequencies are defined, that is, on their real and predicted decision, information about the specific case, on the penalty of making the
just as in familiar text book examples about urns with coloured balls unfounded assumption that this additional information is irrelevant
the only relevant way in which the balls can differ is in their colour. for the decision of the case (cf. [11, p. 137]).
While in the textbook examples this assumption is justified, for A variant of this argument is the argument that a belief-type
legal cases it is not. Judges who have to decide a case know much probability is always less well founded than a frequency-based
177
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Bex and Prakken
probability, so that a judge who wants to know what his or her tasks. In the medical example human and algorithm perform the
colleagues would likely decide can still look at what an algorithmic same task, namely, recognising cancer in images of, for instance,
decision predictor with a high precision predicts. However, this birthmarks. Moreover, the estimates of human and algorithm are
argument fails, since if one knows more about the case, then sticking compared to the same (objective) truth: by examining the cells un-
to the frequency is even less well-founded. Consider the analogy of der a microscope it can be determined with certainty whether there
an urn with 80% red balls and 20% blue balls. If this is all one knows is cancer. Thus a human expert and an algorithmic expert are com-
and one draws a ball from the top of the urn, then it is rational pared in terms of the same standard. In such a case a comparison
to assume that there is an 80% probability that it will be red. But between how humans and algorithms perform is meaningful and
suppose now that the person who filled the urn tells you that he the algorithm can be said to perform better than the human doctor,
first put all the red balls in and then all the blue balls, and that he namely, by recognising malign spots missed by the human doctor.
did not shake or stir, and that you take the ball from the top of the However, an algorithmic decision predictor performs a different
urn. It is now irrational to stick to the 80% probability that the ball task than the judge. A decision predictor predicts which decision a
will be red. In fact, the inverse probability (just 20% chance that the judge would take, which is a different task than the task the judge
ball will be red) seems more rational. performs, which is deciding the case. Then it is meaningless to
One may also consider technical solutions to the reference class compare the performance of the algorithm and the human judge.
problem. The first is to inspect the test set to check the algorithm’s What is more, even a correct prediction of a legally incorrect deci-
performance on subsets of test cases of particular types, as an at- sion would count as a success for the predictive algorithm. Such
tempt to make it more likely that the class memberships considered situations may arise, for instance, since the test set contains legally
for the algorithm’s performance coincide with the knowledge the incorrect decisions [5]. Correctly predicting a decision is not the
judge has about a particular case. This is a good idea in theory, same as predicting a correct decision.
but note that this approach in fact amounts to building a legal-
knowledge model of the reasons relevant for a decision. Moreover,
the created subclasses may be too small to yield reliable probabili-
5 CAN A DECISION PREDICTOR IMPROVE
ties, since in the law the collections of cases usually are not very PREDICTABILITY AND CONSISTENCY?
big [5]. Furthermore, a too fine-grained feature set may lead to an In Section 4 we concluded that a judge who has to decide a case
overfitted model that does not easily generalise [6, p. 9]. and who wants to know what an arbitrary rational judge assigned
A second technical solution is to obtain the probability for a to the case would probably decide, cannot rely on the statistics
single case directly from the algorithm, that is, the probability of a provided by (the evaluation of) an algorithmic decision predictor.
certain decision 𝑋 given the set of features 𝐹 that represents the However, this leaves the question what other benefits consulting
case, or 𝑃𝑟 (𝑋 |𝐹 ). Simpler predictive algorithms directly output such such an algorithm can have for a judge in an individual case. This
a prediction probability for a single case, and for e.g. neural net- we discuss below, focusing on the alleged benefit of improving the
works or support vector machines (SVMs) it is possible to estimate predictability and consistency of judicial decision making.
this probability based on the output of the model (cf. [20]). It can be First, we have to determine what the terms predictability and
argued that it is exactly this probability the judge needs: the algo- consistency mean in this context. Assuming that they mean the
rithm captures the behaviour of the judges in the training set cases same, there are two interpretations. One interpretation is that the
and then directly outputs the probability that these judges would same case is decided the same by different judges. Another interpre-
rule 𝑋 in a case like the current one with features 𝐹 . However, this tation is that similar cases are decided in the same way (or a similar
still does not yield the probability that an arbitrary judge would way) by the same or different judges. The second interpretation
rule in that way given the case, because there need not be a relation implies the first but not vice versa.
between the correctness of predictions and the prediction probabil- We can now ask how an algorithmic decision predictor can be
ity. For example, the algorithm can predict the wrong decision with used in order to improve predictability and consistency. If these
a high probability, or the algorithm may over- or underestimate terms mean that the same case is decided the same by different
individual probabilities simply because this leads to better classifi- judges, then a sure way to guarantee predictability and consistency
cation performance. Furthermore, using such advanced techniques is to give all judges the same algorithmic decision predictor and to
brings along even more assumptions and makes it even harder to require that they all follow its predictions in all cases. Then different
determine what exactly the given probability means, particularly judges would, when assigned to the same case, be guaranteed to
for a judge with no background in statistics or machine learning. take the same decision. However, this does not make sense, since
So instead of relying on this probability, the judge would be better as we argued in Section 4.3 we do not know whether all decisions
off thinking about the particulars of case as normal. in the training and test set were correct. If all judges blindly follow
A final objection is that an algorithm does not have to be perfect, the algorithm’s prediction, then both its accuracy and precision
as long as it performs better than human decision makers. Here will increase to 100%, and this would further lead to a tendency
sometimes the medical domain is mentioned, in which it is widely to make the predicted decision the legally correct one even if this
accepted that, for instance, a human oncologist has to consult a cannot be justified.
data-driven predictive algorithm for recognising skin cancer if this What if predictability and consistency mean that similar cases
algorithm has been proven to perform better than humans [21]. should be decided the same? Is this improved if we require judges
However, this analogy beaks down, since unlike in the medical to consult decision predictors as a source of information? Again, for
example, a legal predictive algorithm and a judge perform different mere decision predictors we cannot know. Suppose an algorithm
178
On the relevance of algorithmic decision predictors for judicial decision making ICAIL’21, June 21–25, 2021, São Paulo, Brazil
with 90% precision predicts decision X for case C. Does the judge like accuracy, precision and recall, and instead involves potential
then treat like cases alike if s/he follows the prediction? We cannot or actual users of the algorithms. More generally, we believe that
know, since the prediction in itself would not give any information it is important to inform the legal world in transparent language
about similarity with other cases. In fact, it might well be that about not only the potential benefits but also the limitations of
an algorithm treats cases that judges would regard as similar as algorithmic outcome predictors.
different or vice versa (likewise [6, p. 6]). For example, text-based Finally, we like to emphasise that our conclusions are confined
decision predictors like the ECHR predictor could fail to recognise to the use of algorithmic decision predictors for informing judges
that linguistically small differences are legally very relevant. on what they could decide in particular cases. Other uses of such
However, is this different if the prediction is combined with an algorithms may well have benefits, but this requires another paper.
explanation for it? The answer is negative if the explanation cannot
be given in terms of reasons related to the merit of the case. So a REFERENCES
SCOTUS-like predictor is ruled out. But this implies that an ECHR- [1] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, and V. Lampos. 2016. Predicting
judicial decisions of the European Court of Human Rights: A natural language
type predictor is also ruled out, since it cannot extract any legally processing perspective. PeerJ Computer Science 2 (2016), e93.
relevant information from the texts to which it is applied, so there is [2] V. Aleven. 2003. Using background knowledge in case-based legal reasoning: a
no way to identify whether its prediction is based on legal grounds computational model and an intelligent learning environment. Artificial Intelli-
gence 150 (2003), 183–237.
or on extraneous factors. Only decision predictors that base their [3] K. Ashley. 2019. A brief history of the changing roles of case prediction in AI
predictions on legally relevant factors could possibly yield legally and law. Law in Context. A Socio-legal Journal 36, 1 (2019), 93–112.
relevant information about similar cases to a judge. [4] B. Babic, D. Chen, T. Evgeniou, and A.-L. Fayard. 2021. The better way to onboard
AI. Harvard Business Review (2021). http://nber.org/~dlchen/papers/The_Better_
However, we believe that only these legal explanations are what Way_to_Onboard_AI.pdf To appear.
should matter for a judge, and that the judge should ignore the fact [5] T. Bench-Capon. 2020. The need for good-old fashioned AI and law. In Interna-
tional Trends in Legal Informatics: Festschrift for Erich Schweighofer, W. Hötzen-
that a decision was predicted by an algorithm with good statisti- dorfer, C. Tschol, and F. Kummer (Eds.). Editions Weblaw, Bern, 23–36.
cal performance on a test set. This use of such algorithms is not [6] R. Binns. 2020. Analogies and disanalogies between machine-driven and human-
much different from how judges currently use other information driven legal judgement. Journal of Cross-disciplinary Research in Computational
Law 1, 1 (2020).
sources, such as books, journals and peer consultation. Numeri- [7] S. Brueninghaus and K. Ashley. 2009. Automatically classifying case texts and
cal performance indicators like accuracy, precision and recall can predicting outcomes. Artificial Intelligence and Law 17 (2009), 125–165.
justify a degree of trust in algorithms in this general sense, but can- [8] I. Chalkidis, I. Androutsopoulos, and N. Aletras. 2019. Neural legal judgment
prediction in English. In Proceedings of the 57th Annual Meeting of the Association
not indicate the quality of individual predictions or explanations. for Computational Linguistics. 4317–4323.
Moreover, evaluating the quality of algorithmic explanations for [9] P. Dawid. 2005. Probaility and Proof. (2005). http://tinyurl.com/tz85o Appendix
to Analysis of Evidence, by T. J. Anderson, D. A. Schum and W. L. Twining.
individual predictions requires validation studies of a kind that goes [10] European Commission for the Efficiency of Justice (CEPEJ). 2018. European
far beyond the current trend to focus on numerical performance ethical Charter on the use of Artificial Intelligence in judicial systems and their
measures like accuracy, precision and recall and is more akin to an environment.
[11] I. Hacking. 2001. An Introduction to Probability and Inductive Logic. Cambridge
older AI tradition of carrying out empirical validation studies with University Press, Cambridge.
potential or actual users of the algorithm [13]. [12] D. Katz, M. Bommarito, and J. Blackman. 2017. A general approach for predicting
the behavior of the Supreme Court of the United States. PloS one 12, 4 (2017),
e0174698.
[13] R. O. Keefe. 1993. Issues in the verification and validation of knowledge-based
6 CONCLUSION systems. In Advances in Software Engineering and Knowledge Engineering, V. Am-
briola and G. Tortora (Eds.). Series on Software Engineering and Knowledge
In this paper we argued that a judge who has to decide a case Engineering, Vol. 2. World Scientific Publishing Co, 173–189.
and who wants to know what an arbitrary rational judge assigned [14] E. Mackaay and P. Robillard. 1974. predicting judicial decisions: The nearest
to the case would probably decide, cannot rely on the statistics neighbor rule and visual representation of case patterns. Datenverarbeitung im
Recht 3 (1974), 302–331.
provided by (the evaluation of) an algorithmic decision predictor. [15] M. Medvedeva, , X. Xu, M. Vols, and M. Wieling. 2020. JURI SAYS: an automatic
The idea that an algorithmic prediction that performed well on a judgement prediction system for the European Court of Human Rights. In
test set yields the ‘normal’ decision of the case, from which a judge Legal Knowledge and Information Systems. JURIX 2020: The Thirty-Third Annual
Conference, S. Villata, J. Harašta, and P. Křemen (Eds.). IOS Press, Amsterdam
could only deviate if there are special circumstances in the case, is etc., 277–280.
unfounded. Moreover, we argued that relying on the predictions of [16] M. Medvedeva, M. Vols, and M. Wieling. 2020. Using machine learning to predict
decisions of the European Court of Human Rights. Artificial Intelligence and Law
such algorithms cannot improve the predictability and consistency 28, 2 (2020), 237–266.
of judicial decision making in desirable ways. We believe that mere [17] F. Muhlenbach and I. Sayn. 2019. Artificial Intelligence and law: What do people
decision predictors, that is, predictors that cannot explain their really want?: Example of a French multidisciplinary working group. In Proceedings
of the 17th International Conference on Artificial Intelligence and Law. ACM Press,
predictions in legally meaningful terms, should not be used at New York, 224–228.
all by judges as decision-support tools for individual cases. Such [18] C. O’Neil. 2016. Weapons of Math Destruction: How Big Data Increases Inequality
algorithms do not give any useful information to judges and may and Threatens Democracy. Crown.
[19] F. Pasquale and G. Cashwell. 2018. Prediction, persuasion, and the jurisprudence
in fact be misleading and cause intellectual laziness. of behaviourism. University of Toronto Law Journal 68, supplement 1 (2018),
If an algorithmic decision predictor gives any useful information 63–81.
[20] J. Platt et al. 1999. Probabilistic outputs for support vector machines and com-
to judges at all, it is not in its predictions but in its explanations parisons to regularized likelihood methods. Advances in large margin classifiers
for these predictions. However, we noted that whether algorithmic 10, 3 (1999), 61–74.
explanations can indeed improve the quality of judicial decision [21] J. Susskind. 2018. Future Politics: Living Together in a World Transformed by Tech.
Oxford University Press, Oxford.
making requires validation studies of a kind that goes far beyond
the current trend to focus on numerical performance measures
179
The Burden of Persuasion in Structured Argumentation
Roberta Calegari Régis Riveret Giovanni Sartor
Alma Mater Research Institute for Commonwealth Scientific and Alma Mater Research Institute for
Human-Centered Artificial Industrial Research Organisation Human-Centered Artificial
Intelligence Brisbane, Australia Intelligence
Bologna, Italy regis.riveret@data61.csiro.au Bologna, Italy
roberta.calegari@unibo.it giovanni.sartor@unibo.it
180
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Calegari, Riveret and Sartor
been established with certainty that Hellen shot the intruder and prosecution’s argument for murder and the doctor’s argument for
that she did so intentionally. However, it remains uncertain whether her diligence, respectively.
the intruder was threatening Hellen with a gun, as claimed by the
defence, or had turned back and was running away on having been 3 ARGUMENTATION FRAMEWORK
discovered, as claimed by prosecution. The burden of persuasion is We introduce a structured argumentation framework relying on a
on prosecution, who needs to provide a convincing argument for lightweight ASPIC+ -like argumentation system [14]. For the sake
murder. Since in this case it remains uncertain whether there was of simplicity, we assume that arguments only consist of defeasible
self-defence, prosecution has failed to provide such an argument. rules, to the exclusion of strict rules and of some constituents of a
Therefore, the legally correct solution is that there should be no knowledge base—such as axioms, ordinary premises, assumptions,
conviction: Hellen needs to be acquitted. □ and issues that can be found in the complete model [14]. A frame-
Burden of persuasion in civil law. In civil law, burdens of produc- work based on defeasible rules is sufficient for our purposes and
tion and burdens of persuasion may be allocated in different ways. can be extended as needed with further structures.
The general principle is that the plaintiff only has the burden of
proof (both of production and persuasion) relatively to the operative 3.1 Argumentation graphs
facts that ground its claim, while the defendant has the burden of Let any literal be an atomic proposition or its negation. Literals are
proof relative to those exceptions which may prevent the operative brought into relation through defeasible rules.
facts from delivering their usual outcomes, such as justifications
Notation 3.1. For any literal 𝜙, its complement is denoted by 𝜙, ¯
with regard to torts, or incapability and vices of consent in contracts.
However, derogations from this principle may be established by i.e., if 𝜙 is a proposition 𝑝, then 𝜙¯ is ¬𝑝, while if 𝜙 is ¬𝑝, then 𝜙¯ is 𝑝.
the law, in order to take into account various factors, such as the Definition 3.1. A defeasible rule 𝑟 is a construct of the form:
presumed ability of each party to provide evidence in favour of his 𝜌 : 𝜙 1, ..., 𝜙𝑛 , ∼ 𝜙 1′ , ..., ∼ 𝜙𝑚
′ ⇒𝜓
or her claim, the need to protect weaker parties against abuses, etc.
In matters of civil liability, for example, it is usually the case that with 0 ≤ 𝑛 and 0 ≤ 𝑚, and where
the plaintiff, who asks for compensation, has to prove both that the • 𝜌 is the unique identifier for 𝑟 , denoted by N(𝑟 );
defendant caused the harm, and that this was done intentionally • each 𝜙 1, . . . 𝜙𝑛 , 𝜙 1′ , ..., 𝜙𝑚 ′ ,𝜓 is a literal;
or negligently. However, in certain cases, the law establishes an • 𝜙 1, . . . 𝜙𝑛 , ∼ 𝜙 1, ..., ∼ 𝜙𝑚 are denoted by 𝐴𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡(𝑟 ) and 𝜓 by
′ ′
inversion of the burden of proof for negligence. This means that in 𝐶𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡(𝑟 );
order to obtain compensation, the plaintiff only has to prove that • ∼ 𝜙 denotes the weak negation (negation by failure) of 𝜙: 𝜙 is
s/he was harmed by the defendant. This will be sufficient to win an exception that would block the application of the rule whose
the case unless the defendant provides a convincing argument that antecedent includes ∼ 𝜙.
s/he was diligent (not negligent). The identifier of a rule can be understood as the name of the rule.
Example 2.2. Let us consider a case in which a doctor caused It can be used as a literal to specify that the named rule is appli-
harm to a patient by misdiagnosing his case. Assume that there is cable, and its negation correspondingly to specify that the rule is
no doubt that the doctor harmed the patient: she failed to diagnose inapplicable [13].
cancer, which consequently spread and became incurable. However, A superiority relation ≻ is defined over rules: 𝑠 ≻ 𝑟 states that
it is uncertain whether or not the doctor followed the guidelines rule 𝑠 prevails over rule 𝑟 .
governing this case: it is unclear whether she prescribed all the
Definition 3.2. A superiority relation ≻ over a set of rules
tests that were required by the guidelines, or whether she failed to
𝑅𝑢𝑙𝑒𝑠 is an antireflexive and antisymmetric binary relation over
prescribe some tests that would have enabled cancer to be detected.
𝑅𝑢𝑙𝑒𝑠, i.e., ≻⊆ 𝑅𝑢𝑙𝑒𝑠 × 𝑅𝑢𝑙𝑒𝑠.
Assume that, under the applicable law, doctors are liable for any
harm suffered by their patients, but they can avoid liability if they A defeasible theory consists of a set of rules and a superiority
show that they were diligent (not negligent) in treating the patient, relation over the rules.
i.e., that they exercised due care. Thus, rather than the patient
having the burden of proving that doctors have been negligent (as Definition 3.3. A defeasible theory is a tuple ⟨𝑅𝑢𝑙𝑒𝑠, ≻⟩ where
it should be the case according to the general principles), doctors 𝑅𝑢𝑙𝑒𝑠 is a set of rules, and ≻ is a superiority relation over 𝑅𝑢𝑙𝑒𝑠.
have the burden of providing their diligence. Let us assume that We can construct arguments by chaining rules from the defeasi-
the law also says that doctors are considered to be diligent if they ble theory, as specified in the following definition; cf. [6, 13, 18].
followed the medical guidelines that govern the case. In this case,
given that the doctor has the burden of persuasion on her diligence, Definition 3.4. An argument 𝐴 constructed from a defeasible
and that she failed to provide a convincing argument for it, the theory ⟨𝑅𝑢𝑙𝑒𝑠, ≻⟩ is a finite construct of the form:
legally correct solution is that she should compensate the patient.□ 𝐴 : 𝐴1, . . . 𝐴𝑛 ⇒𝑟 𝜙
These two examples share a common feature. In both, uncertainty with 0 ≤ 𝑛, and where
remains concerning a decisive issue. However, this uncertainty does • 𝐴 is the argument’s unique identifier;
not preclude the law from prescribing a single legal outcome in • 𝐴1, . . . , 𝐴𝑛 are arguments constructed from the defeasible theory
each case. This outcome can be achieved by discarding the argu- ⟨𝑅𝑢𝑙𝑒𝑠, ≻⟩;
ments that fail to meet the required burden of persuasion, i.e., the • 𝜙 is the conclusion of the argument, denoted by Conc(𝐴);
181
The Burden of Persuasion in Structured Argumentation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
• 𝑟 : Conc(𝐴1 ), . . . , Conc(𝐴𝑛 ) ⇒ 𝜙 is the top rule of 𝐴, denoted by The notions of bp-rebuttings and undercuttings can then be
TopRule(𝐴). used to define a defeat relation comprising bp-defeats and strict
Notation 3.2. Given an argument 𝐴 : 𝐴1, . . . 𝐴𝑛 ⇒𝑟 𝜙 as in bp-defeats between arguments.
definition 3.4, Sub(𝐴) denotes the set of subarguments of 𝐴, i.e., Definition 3.9 (bp-defeat). A defeat relation { over a set of
Sub(𝐴) = Sub(𝐴1 ) ∪ . . . ∪ Sub(𝐴𝑛 ) ∪ {𝐴}. arguments A is a binary relation over A, i.e. {⊆ A × A, such
Different types of inconsistencies can appear between arguments, that ∀𝐴, 𝐵 ∈ A, 𝐴 defeats 𝐵, i.e. 𝐴 { 𝐵, iff 𝐴 bp-defeats 𝐵 or 𝐴
causing them to attack each other. In the ASPIC family of argumen- strictly-bp-defeats 𝐵:
tation frameworks, attack is differentiated from defeat, with the (1) 𝐴 bp-defeats 𝐵 iff 𝐴 bp-rebuts 𝐵 or 𝐴 undercuts 𝐵
latter taking preferences between arguments into account. Prefer- (2) 𝐴 strictly-bp-defeats 𝐵 iff 𝐴 bp-defeats 𝐵 and 𝐵 does not
ences over arguments are defined in the work reported here via a bp-defeats 𝐴 .
last-link ordering: an argument 𝐴 is preferred over another argu-
Example 3.10 (Civil law example: rules and arguments). To ex-
ment 𝐵 if the top rule of 𝐴 is stronger than the top rule of 𝐵.
emplify the notions just introduced, let us formalise Example 2.2
Definition 3.5. A preference relation ≻ is a binary relation through a set of rules. We assume that sufficient evidence is pro-
over a set of arguments A, such that an argument 𝐴 is preferred to vided to support (in the absence of evidence to the contrary) the
argument 𝐵, denoted by 𝐴 ≻ 𝐵, iff TopRule(𝐴) ≻ TopRule(𝐵). factual claims at issue (𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠, ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠, ℎ𝑎𝑟𝑚), i.e., that the
Before specifying the notion of defeat between arguments, let corresponding burdens of production are satisfied.
us first identify burdens of persuasion, i.e., those literals the proof f1 : ⇒ ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 r1 : ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 ⇒ ¬𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
of which requires a convincing argument. We assume that such f2 : ⇒ 𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 r2 : 𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 ⇒ 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
literals are consistent: it cannot be the case that there is a burden f3 : ⇒ ℎ𝑎𝑟𝑚 r3 : ℎ𝑎𝑟𝑚, ∼ 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒 ⇒ 𝑙𝑖𝑎𝑏𝑙𝑒
of persuasion both on 𝜙 and 𝜙. We can then build the following arguments:
Definition 3.6 (Burdens of persuasion). Let BurdPers, the set A1 : ⇒f1 ¬𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 A2 : A1 ⇒r1 ¬𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
of burdens of persuasion, be a set of literals such that if 𝜙 ∈ B1 : ⇒f2 𝑔𝑢𝑖𝑑𝑒𝑙𝑖𝑛𝑒𝑠 B2 : B1 ⇒r2 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒
BurdPers then 𝜙 ∉ BurdPers. We say that an argument 𝐴 is bur- C1 : ⇒f3 ℎ𝑎𝑟𝑚 C2 : C1 ⇒r3 𝑙𝑖𝑎𝑏𝑙𝑒
dened with persuasion if Conc(𝐴) ∈ BurdPers. If there were no burden of persuasion, the relations would be the
following: arguments A1 and B1 defeat one another, B1 defeats A2,
We now consider possible collisions between arguments, i.e., A1 defeats B2, A2 and B2 defeat one another, B2 strictly defeats C2.
those cases in which an argument 𝐴 challenges an argument 𝐵: (a) If on the contrary, there is burden of is on the doctors’ diligence
by contradicting the conclusion of a 𝐵’ subargument (rebutting), or (𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒 ∈ BurdPers), then B2 fails to defeats A2, so that A2
(b) by denying (the application of) the top rule of a 𝐵’ subargument strictly defeats B2. □
or by contradicting a weak negation in the body of the top rule of
a 𝐵’ subargument (undercutting). Note that our notion of rebutting Given a defeasible theory, arguments built from it and defeats be-
corresponds to the notion of successful rebutting in [14]. tween these arguments are gathered into an argumentation graph.
Definition 3.7 (bp-rebut). Argument 𝐴 bp-rebuts argument 𝐵 iff Definition 3.11. An argumentation graph constructed from a
∃𝐵 ′ ∈ 𝑆𝑢𝑏(𝐵) such that Conc(𝐴) = Conc(𝐵 ′ ) and defeasible theory 𝑇 is a tuple ⟨A, {⟩, where A is the set of all
(1) Conc(𝐴) ∉ BurdPers, and 𝐵 ′ ⊁ 𝐴, or arguments constructed from 𝑇 , and { is a defeat relation over A.
(2) Conc(𝐴) ∈ BurdPers and 𝐴 ≻ 𝐵 ′ . Notation 3.3. Given an argumentation graph 𝐺 = ⟨A, {⟩, we
According to Definition 3.7, for an unburdened argument 𝐴 to write A𝐺 , and {𝐺 to denote A and { respectively.
rebut 𝐵 by contradicting the latter’s subargument 𝐵 ′ , it is sufficient
that 𝐵 ′ is non-superior to 𝐴. For a burdened argument 𝐴 to rebut 𝐵 3.2 Labelling semantics
by contradicting 𝐵 ′ , it is necessary that 𝐴 is superior to 𝐵 ′ . Thus, Let us now introduce the notion of {IN, OUT, UND}-labellings of an
burdens of persuasion supplement priorities in deciding conflicts argumentation graph, so that each argument in the graph is labelled
between arguments having opposed conclusions. They dictate the IN, OUT or UND, depending on whether it is accepted, rejected, or
outcome of such conflicts when priorities do not already determine undecided, respectively.
which argument is to prevail: when two arguments contradict one
another, the one burdened with persuasion fails to bp-rebut the Definition 3.12. A {IN, OUT, UND}-labelling 𝐿 of an argumenta-
other, while the latter will succeed in bp-rebutting the first. tion graph 𝐺 is a total function 𝐿 : A𝐺 → {IN, OUT, UND}.
Undercutting is defined as usual, including both the case in Notation 3.4. Given a labelling 𝐿, we write IN(𝐿) for {𝐴 | 𝐿(𝐴) =
which an the attacker excludes the application of the top rule of IN }, OUT(𝐿)
for {𝐴 | 𝐿(𝐴) = OUT} and UND(𝐿) for {𝐴 | 𝐿(𝐴) = UND}.
the attacked argument (by denying the rule’s the name) and the
case in which it contradicts a weakly negated literal in the body of There are various ways to specify {IN, OUT, UND}-labelling func-
that rule. tions [1]. For example, they can be complete or grounded.
Definition 3.8 (bp-undercut). Argument 𝐴 undercuts argument Definition 3.13. A complete {IN, OUT, UND}-labelling of an argu-
𝐵 iff ∃𝐵 ′ ∈ 𝑆𝑢𝑏(𝐵) such that: 1) Conc(𝐴) = ¬N(𝑟 ) and TopRule(𝐵 ′ ) = mentation graph 𝐺 is a {IN, OUT, UND}-labelling such that ∀𝐴 ∈ A𝐺
𝑟 ; or 2) Conc(𝐴) = 𝜙 and ∼ 𝜙 ∈ 𝐴𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡(TopRule(𝐵 ′ )). (1) 𝐴 is labelled IN iff all defeaters of 𝐴 are labelled OUT, and
182
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Calegari, Riveret and Sartor
(2) 𝐴 is labelled OUT iff 𝐴 has a defeater labelled IN. persuasion. Let us assume that (as under Italian law) we have
BurdPers = {𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒, 𝑙𝑖𝑎𝑏𝑙𝑒}, i.e., the doctor has to provide
Definition 3.14. A grounded {IN, OUT, UND}-labelling of an ar-
a convincing argument that she was diligent, and the patient has
gumentation graph 𝐺 is a complete {IN, OUT, UND}-labelling 𝐿 of 𝐺
to provide a convincing argument for the doctor’s liability. As the
such that IN(𝐿) is minimal.
burdened doctor’s argument for 𝑑𝑢𝑒𝐷𝑖𝑙𝑖𝑔𝑒𝑛𝑐𝑒 is labelled OUT, her li-
Remark that any argument not labelled IN or OUT must be labelled ability can be established even though it remains uncertain whether
UND, since any { IN, OUT, UND }-labelling is a total function. the guidelines were followed. □
While common specifications of {IN, OUT, UND}-labellings define
reasonable positions [1], they do not cater for burdens of persuasion.
We now specify the notion of bp-labelling, namely, a labelling which A2 B2 C2
183
The Burden of Persuasion in Structured Argumentation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
of burdens of persuasion, we do not obtain the legally correct an- we plan to study the properties of our semantics, and its connection
swer, namely, acquittal. To obtain acquittal we need to introduce with other semantics for argumentation [1, 2].
burdens of persuasion. Prosecution has the burden of persuasion on
murder: it therefore falls to the prosecution to persuade the judge ACKNOWLEDGMENTS
that there was killing, that it was intentional, and that the killer did R. Calegari and G. Sartor have been supported by the H2020 ERC
not act in self-defence. Project “CompuLaw” (G.A. 833647).
A3 B3
REFERENCES
[1] Pietro Baroni, Martin Caminada, and Massimiliano Giacomin. 2011. An introduc-
UND UND
tion to argumentation semantics. The knowledge engineering review 26, 4 (2011),
365–410. https://doi.org/10.1017/S0269888911000166
[2] Pietro Baroni and Régis Riveret. 2019. Enhancing Statement Evaluation in Argu-
A2 B2 C2 mentation via Multi-labelling Systems. Journal Artificial Intelligence Research 66
IN UND UND (2019), 793–860. https://doi.org/10.1613/jair.1.11428
[3] Roberta Calegari, Andrea Omicini, and Giovanni Sartor. 2020. Argumentation
and Logic Programming for Explainable and Ethical AI. In XAI.it 2020 – Italian
A1 B1 C1 Workshop on Explainable Artificial Intelligence 2020 (CEUR Workshop Proceedings,
Vol. 2742). Sun SITE Central Europe, RWTH Aachen University, Italy, 55–68.
IN UND UND [4] Roberta Calegari and Giovanni Sartor. 2020. Burden of Persuasion in Ar-
gumentation. In Proceedings 36th International Conference on Logic Program-
ming (Technical Communications), ICLP 2020 (Electronic Proceedings in Theo-
A3 B3
retical Computer Science, Vol. 325). OPA, Rende (CS), Italy, 151–163. https:
OUT UND //doi.org/10.4204/EPTCS.325.21
[5] Roberta Calegari and Giovanni Sartor. 2020. A Model for the Burden of Persuasion
in Argumentation. In Legal Knowledge and Information Systems. JURIX 2020: The
A2 B2 C2 Thirty-third Annual Conference (Frontiers in Artificial Intelligence and Applications,
Vol. 334), Serena Villata, Jakub Harašta, and Petr Křemen (Eds.). IOS, Brno, Czech
IN UND UND Republic, 13–22. https://doi.org/10.3233/FAIA200845
[6] Martin Caminada and Leila Amgoud. 2007. On the Evaluation of Argumentation
Formalisms. Artificial Intelligence 171, 5—6 (2007), 286–310. https://doi.org/10.
A1 B1 C1 1016/j.artint.2007.02.003
IN UND UND [7] Arthur M. Farley and Kathleen Freeman. 1995. Burden of Proof in Legal Argumen-
tation. In Proceedings of the 5th International Conference on Artificial Intelligence
and Law. ACM, Maryland USA, 156—164. https://doi.org/10.1145/222092.222227
[8] Thomas F. Gordon, Henry Prakken, and Douglas Walton. 2007. The Carneades
Figure 2: Grounded {IN, OUT, UND}-labelling of Example model of argument and burden of proof. Artificial Intelligence 171, 10 (2007),
2.1 in the absence of burdens of persuasion (top), and 875–896. https://doi.org/10.1016/j.artint.2007.04.010
bp-labelling with the burden of persuasion BurdPers = [9] Thomas F. Gordon and Douglas N. Walton. 2009. Proof Burdens and Standards.
In Argumentation in Artificial Intelligence. Springer, Boston, MA, 239–258. https:
{𝑚𝑢𝑟𝑑𝑒𝑟 } (bottom). //doi.org/10.1007/978-0-387-98197-0_12
[10] Ulrike Hahn and Mike Oaksford. 2007. The Burden of Proof and Its Role in
Argumentation. Argumentation 21 (2007), 36–61. https://doi.org/10.1007/s10503-
The bp-labelling is depicted in Figure 2 (bottom). The prosecution 007-9022-6
failed to meet its burden of proving murder, i.e., its argument is [11] Ronald E. Leenes. 2001. Burden of Proof in Dialogue Games and Dutch Civil Pro-
cedure. In Proceedings of the 8th International Conference on Artificial Intelligence
not convincing, since it remains undetermined whether there was and Law. ACM, Missouri USA, 109–18. https://doi.org/10.1145/383535.383549
self-defence. Therefore, the argument supporting murder is labelled [12] Sanjay Modgil and Henry Prakken. 2010. Reasoning about Preferences in Struc-
tured Extended Argumentation Frameworks. In Proceedings of COMMA 2010,
OUT, and the presumed killer is to be acquitted. □ Computational Models of Argumentation. IOS, Italy, 347–58. https://doi.org/10.
3233/978-1-60750-619-5-347
4 CONCLUSION [13] Sanjay Modgil and Henry Prakken. 2014. The ASPIC + framework for structured
argumentation: a tutorial. Argument & Computation 5, 1 (2014), 31–62. https:
We have presented a formal model for the burden of persuasion. The //doi.org/10.1080/19462166.2013.869766
model is based on the idea that arguments burdened with persuasion [14] Henry Prakken. 2010. An Abstract Framework for Argumentation with Struc-
tured Arguments. Argument and Computation 1 (2010), 93–124. https://doi.org/
have to be rejected when there is uncertainty about them. We have 10.1080/19462160903564592
shown how an allocation of the burden of persuasion may lead to a [15] Henry Prakken, Chris Reed, and Douglas N. Walton. 2005. Dialogues about the
single outcome (IN arguments) in contexts in which the assessment Burden of Proof. In Proceedings of the 10th International Conference on Artificial
Intelligence and Law. ACM, Bologna, Italy, 115–124. https://doi.org/10.1145/
of conflicting arguments would otherwise remain undecided. We 1165485.1165503
have also shown how our model is able to address inversions of [16] Henry Prakken and Giovanni Sartor. 1996. Rules about Rules: Assessing Con-
flicting Arguments in Legal Reasoning. Artificial Intelligence and Law 4 (1996),
burdens of proof, namely, those cases in which the burden shifts 331–68. https://doi.org/10.1007/BF00118496
from one party to the other. In such cases, there is the burden of [17] Henry Prakken and Giovanni Sartor. 2010. A Logical Analysis of Burdens of
persuasion over the conclusion of a multistep argument, and at Proof. Legal Evidence and Proof: Statistics, Stories, Logic 1 (2010), 223–253.
[18] Gerard Vreeswijk. 1997. Abstract Argumentation Systems. Artificial Intelligence
the same time a burden of persuasion over the conclusion of an 90, 1–2 (1997), 225–279. https://doi.org/10.1016/S0004-3702(96)00041-0
attacker against a subargument of that multistep argument. The [19] Douglas Walton. 1996. Arguments from Ignorance. Pennsylvania State University
model can be expanded in various ways, to capture further aspects Press, Pennsylvania. https://doi.org/10.1007/978-3-319-15013-0_3
[20] Douglas Walton. 2014. Burden of proof, presumption and argumentation. Cam-
of legal reasoning. For instance, it can also be supplemented with bridge University Press, USA. https://doi.org/10.1017/CBO9781107110311
argumentation over burdens of persuasion [15], in a manner similar [21] C.R. Williams. 2003. Burdens and standards in civil litigation. Sydney Law Review
25 (2003), 165–188.
to the way in which argumentation systems can be expanded to
include argumentation about priorities, see [12, 16]. More generally
184
Prediction of monetary penalties for data protection cases in
multiple languages
Aaron Ceross Tingting Zhu
University of Oxford University of Oxford
Oxford, United Kingdom Oxford, United Kingdom
aaron.ceross@cs.ox.ac.uk tingting.zhu@eng.ox.ac.uk
ABSTRACT in the relationship between the data subject and the data controller.
As the use of personal data becomes further entrenched in the func- Correspondingly, data protection law has gained increased visibility
tion of societal interaction, the regulation of such data continues to in recent years, notably the enactment of the European Union’s
grow as an important area of law. Nevertheless, it is unfortunately General Data Protection Regulation (GDPR) [20].
the case that data protection authorities have limited resources to In general, there exists a disparity between regulators and the
address an increasing number of investigations. The leveraging of objects of regulation. This includes access to information, resources
appropriate data-driven models, coupled with the automation of to contest regulatory action, and technical expertise — with the bal-
decision making, has the potential to help in such circumstances. ance of power often favouring corporate entities over the regulating
In this paper, we evaluate machine learning models in the litera- authority. It has been argued that this leads to inefficient regulatory
ture (such as Support Vector Machine (SVM), Random Forest, and action for such corporate entities [4]. Constrained resources are
Multinomial Naive Bayes (MNB) classifiers) for natural language particularly acute in data protection regulation, with the effect of
processing in order to predict whether a monetary penalty was increasing pressure on the prioritisation of cases, therefore requir-
levied based on a description of case facts. We tested these models ing prudence when taking on an investigation in order to maximise
on a novel data set collected from the data protection authority of the effectiveness of regulatory action [5]. The widespread use of
Macao across the three languages (i.e., Chinese, English, and Por- personal data by innumerable entities has resulted in a need to be
tuguese). Our experimental results show that the machine learning selective in terms of determining which cases to take forward so
models provide the necessary predictability in order to automate the as to maximise effectiveness of regulation of personal data [14].
evaluation of data protection cases. In particular, SVM has consis- Further pressure on resources also comes from newer competences
tent performance across three languages and achieving an AUROC and regulatory expectation, such as mandatory breach notification.
of 0.725, 0.762, and 0.748 for Chinese, English, and Portuguese, re- In this paper we evaluate text classification methods that have the
spectively. We further evaluated the interpretability of the results potential to facilitate this aspect of the regulatory process. We are
independently for each of the languages and found that the salient unaware of any work in the available literature examining the use
texts that were identified are shared across the three languages. of machine learning methods for data protection regulatory actions.
ACM Reference Format:
Aaron Ceross and Tingting Zhu. 2021. Prediction of monetary penalties
2 BACKGROUND
for data protection cases in multiple languages. In Eighteenth International 2.1 Automation of data protection regulation
Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021,
Within the available literature we do not find examples using a
São Paulo, Brazil. ACM, New York, NY, USA, 5 pages. https://doi.org/10.
data-driven approach for regulatory actions in data protection. The
1145/3462757.3466097
wider literature regarding empirical analysis of data protection
1 INTRODUCTION judgements and regulatory actions has attracted only modest aca-
demic interest. There are few data sets which are readily available,
The capture, storage, and processing of personal data within in-
and this may contribute to the limited research in this area. For
formation systems has become a fundamental feature of societal
example, Ceross and Simpson [6] provided summary statistics on
interaction, including social relationships, commerce, government,
civil penalties provided by the United Kingdom’s data protection
and education. The increasing multiplicity and complexity of these
authority, however they did not provide a model in this work. Nev-
interactions across different entities necessitates the promulgation
ertheless, the authors found that regulatory actions were focused
of rules regarding the use and storage of personal data in order to
primarily on health and government-held data, with the causes of
prevent misuse of the data, as well as redress power asymmetries
breach being non-technological (e.g. improper disposal of records
Permission to make digital or hard copies of all or part of this work for personal or and unintended disclosure), which may suggest the priorities of the
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
data protection authority.
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 2.2 Automated prediction of legal outcomes
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org. through textual analysis
ICAIL’21, June 21–25, 2021, São Paulo, Brazil There has been increasing interest in the research literature regard-
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 ing the prediction of the outcomes of legal cases through machine
https://doi.org/10.1145/3462757.3466097 learning and natural language processing. This may be due to the
185
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Aaron Ceross and Tingting Zhu
greater availability of court data as well as widespread use of ma- 3.2 Data collection
chine learning and natural language processing libraries. Aletras et In this study, we extracted the text of completed cases-investigations
al. [1] constructed a dataset of 584 cases decided by the European by the GPDP from 2007 to 2019. As Chinese and Portuguese are
Court of Human Rights (ECtHR), focusing on those cases involving the two co-equal official languages of Macao, the GPDP provides
Articles 3, 6, and 8 of the European Convention on Human Rights. case-files on its website provided in these languages. Additionally,
The authors extracted n-grams and semantic topics as features, and the GPDP also provides English translations of the cases, although
classified the cases using an SVM with 10-fold cross validation. The the this translation is unofficial. Overall, the case report itself is
approach established by Aletras et al. [1] has acted as a template divided into three parts: (i) the brief, which explains the facts of the
for other studies. For example, using a data-set of 5,990 German case; (ii) the analysis, which gives an explanation of the applicable
tax cases, Waltl et al. [23] employed a Naive Bayes classifier, using law and consideration of factors; and (iii) the case outcome, which
11 linguistic features, 10 fold cross validation. As another example, details the decision of the GPDP.
Virtucio et al. [22] examined 27,492 cases decided by the Philippine The GPDP provides reports for cases whether an infraction of
Supreme Court, achieving a 59% accuracy using a Random Forest the PDPA is found or not. This allows for classification of penalties
classifier using topics as a feature. More recently, Medvedeva et and closed cases, which represents the full scope of possibilities
al. [16] adopted the approach in [1], expanding the scope to anal- with regard to assessment, that data protection authorities make
yse ECtHR decisions across 14 articles. The authors in this work when deciding on cases. From the retrieved cases, there was a dis-
selected an equal number of decision outcomes in order to maintain crepancy between the number of translated cases available from the
a balanced dataset. website (316 in written Chinese; 309 Portuguese; and 292 English),
There exist multiple jurisdictions wherein there are multiple indicating that the website does not provide every case across the
official languages of legal action (e.g, Canada and Belgium). In the three languages. Using the unique case identification numbers, we
works examining the ECtHR data, the authors did not consider kept only those cases common to all three languages, resulting in a
whether the outcomes were similar for different languages. We add dataset of 281 cases, representing investigations between 2007 and
to the literature by comparing the same facts of a case in three 2019. Out of these 281 cases, 73 (26%) were given a penalty, with the
languages to determine whether predictors of the cases are the remaining 208 cases closed without one (75%). Most of the cases in
same across the language versions. the data set arise from complaints (185, 66%), followed by reports
(58, 21%), referrals (21, 8%), and active investigations (17, 6%).
3 DATA
3.1 Description 4 METHODS
Macao is a Special Administrative Region of the People’s Republic 4.1 Data processing and feature extraction
of China. It is a former colony of Portugal which was transferred to In addition to any of the regulatory actions the GPDP may take, the
Chinese rule in 1999. Macao’s designation as a special administra- case outcomes may result in the authority finding that no further
tive region has meant that Macao retains a degree of autonomy in its action is necessary. In this work, we only make use of the case briefs
public administration and legislation. As such, the city has adopted in order to predict case outcomes. Other sections within the case
an independent approach to data protection regulation which is documentation address the merits of the claims (the analysis) or
different from China. The Personal Data Protection Act (Act 8/2005) provide the case outcome which may strongly influence the results.
(PDPA) [17] regulates the processing and storage of personal data Table 1 describes the summary features of the linguistic properties
in Macao and establishes a regulator, the Office for Personal Data of a brief. Chinese text has the shortest mean length of the three
Protection (GPDP).1 The data protection regulation in Macao is languages, as well as having the least number of tokens. Despite
based on the approach in the European Union, which envisions this, it has a larger vocabulary than the Portuguese and English
more prominent roles for data protection authorities [12]. Accord- text (3,423 types). These smaller tokens and vocabulary may help in
ingly, the GPDP has a wide regulatory remit with regard to data reducing the feature set for the purposes of classification, thereby
protection, including education, advice, and consultation. As part allowing for more effective prediction of classes.
of its regulatory functions, Article 33 of the PDPA provides that the
GPDP may impose a monetary penalty ranging from MOP$4,000 Table 1: Descriptive statistics of case briefs.
to MOP$40,000 (approximately $500 – $5,000 USD) for acts which
(i) infringe on the rights of the data subject (Articles 5, 10, 11, 12,
13); (ii) contravene rules related to security and confidentiality of Language #tokens #types min max mean
processing (Articles 16, 17); or (iii) where the processing of personal length length length
data has not been appropriately publicised (Article 25, paragraph Chinese 19,287 3,423 20 459 68.64
3). In 2019, the GPDP reported an unusual increase in received en- English 26,002 2,256 30 240 92.53
quiries regarding data protection (2,940) — which was an increase Portuguese 26,157 2,856 26 249 93.09
of 60% from the previous year and the highest for five years [10].
Such an increase inevitably gives rise to issues of prioritisation and
resource allocation, as discussed in Section 1.
The cases briefs are written in short sentences describing the
1 Acronym used in English is taken from the authority’s Portuguese name, Gabinete reasons the case has been brought to the GPDP. The brief does not
para a Protecção de Dados Pessoais.
186
Prediction of monetary penalties for data protection cases in multiple languages ICAIL’21, June 21–25, 2021, São Paulo, Brazil
make an evaluation as to the merits of the case nor do the briefs features of any particular prediction (i.e. which words were most
provide an indication as to the outcome. The input feature to our influential) are shared between the languages. We draw an illustra-
classification models can be defined as a document term frequency tive case example and compare the top ten features, assessing the
matrix which describes the frequency of terms that occur in a extent to which these translate into each language.
collection of documents. In order to generate this, we lowercase the
text and remove ‘stop words’, which are frequently occurring words 5 RESULTS
of no semantic significance (e.g. articles and prepositions). Both 5.1 Performance of classification models
unigram and bigram word features are extracted. While stemming
is a common approach to reduce the number of features in language Table 2 shows the mean and standard deviation of the performance
processing tasks, research has demonstrated that lemmatisation, metrics on the test data sets derived from the stratified 3-fold cross
using the infinitive root of a word may produce better outcomes validation. Given structural differences between languages, it is
for tasks [15]. Finally, term-frequency inverse document frequency expected that different models will be more effective for one lan-
(tf-idf) of each word was extracted to reflect how important the guage over another. For instance, SVM performs mostly well across
word is in a collection of case briefs. all metrics for Portuguese (AUROC of 0.748 and F1 of 0.605) and
Chinese (AUROC of 0.725 and F1 of 0.577). For English, MNB was
4.2 Evaluation of classification models considered to be a better option (AUROC of 0.774 and F1 of 0.649).
However, it is difficult to conclude if the results are significant due
We evaluated the case briefs using the classification models in the
to the limited number of case briefs available. It was further ex-
identified literature (see Section 2.2). This includes (i) a Support
pected that the specificity would be high across different models
Vector Machine (SVM) classifier, which is a non-probabilistic classi-
(a range of 0.875 to 0.981) due to the number of non-penalty briefs
fier that maps feature values into a higher dimension to maximise
available in comparison to those with penalty. In the case of recall,
the discrimination between two classes; (ii) a Random Forest (RF)
SVM performed the best mostly across three languages (0.576 for
classifier, which is an ensemble of decision trees via bootstrap-
Chinese, 0.630 for English, and 0.602 for Portuguese), where other
ping subsets of features; and (iii) Multinomial Naive Bayes (MNB),
models have variable results. The MCC values were also similar
a probabilistic classifier which combines Bayes’ theorem with a
across three languages with MNB providing the highest values
multinomial event model, to allow for explicit modelling of the
ranging from 0.48 to 0.534. We determine that the SVM performs
frequency/count of each of our features.
the best scores across all languages when considering all metrics.
In our experimental setup, each model was trained and tested
using a stratified 3-fold cross validation. Stratification was consid- 5.2 Comparison of predictions across languages
ered as the portion of penalty versus non-penalty case briefs was
highly imbalanced with a ratio of 73:281. The averaged performance Out of the 57 cases tested, the performance of the languages in-
across 3 folds is then computed for each model as well as for each significantly varied: English, 49 (86%), Portuguese, 50 (88%), and
language. In model performance evaluation, we considered com- Chinese, 50 (88%). The classification of No Penalty is high but is due
mon metrics including: specificity, recall, the F1 score, and the Area to the imbalanced class. Complaints, which make up the majority
Under the Receiver Operating Characteristic (AUROC). We also in- of reasons initiating a case investigation, receive the most penalties
cluded the Matthews correlation coefficient (MCC), which provides when compared to other reasons. Across all languages, true predic-
a correlation coefficient between the observed and predicted binary tions of penalties for complaints were high, with the most being
classifications. MCC is robust and therefore particularly useful in identified by English (75%). The largest disparity in scores exists in
scenarios where classes exhibit a strong imbalance [18]. Active Investigation; only the Chinese language classifier was able
to positively identify one of the two penalties.
4.3 Interpretability of classification
5.3 Assessment of interpretability across
The interpretability of the resultant machine learning models is
a challenging but worthwhile endeavour [11]. As such we utilise
languages
the LIME methodology proposed by Ribeiro et al [21]. LIME allows From the above results in Section 5.2, there is some evidence for the
for the identification of areas which provided the most influence feasibility of utilising machine learning for penalty prediction. One
to a classification, which is an attractive feature for legal decision of the questions posed by this paper was whether these predictions
making. With the identification of predictive features, we qualita- converge on semantic meaning with regard to the most discrimina-
tively assess whether the identified features for correctly predicted tive features used to make these predictions. We make a qualitative
positive cases are shared across the three languages. From the com- check regarding the features by selecting a case which was correctly
parison of classification models, we select the most performant identified by the classifier in all languages. For this purpose we
model across all three languages by assessing the classifier with select Case No 0002/2014/IP [9], which concerned the uploading
the best value in by each metric for each language, and then check of a photo onto a social networking site without the consent of
for consensus among metrics. We re-run a fold from the model in the complainant. The complainant had family photographs profes-
order to examine the quality of predictions across the entire data sionally taken by a company. Without the knowledge or consent of
set. We utilise the predictions of the entire dataset to (i) compare the complainant, the company uploaded the photo to social media
the amount of true positives identified in each language and (ii) as promotional material. The complainant asked the company to
to determine whether the semantic meaning of the determinative remove the photo but did not receive any response. The complaint
187
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Aaron Ceross and Tingting Zhu
SVM 0.620 (± 0.126) 0.762 (± 0.095) 0.630 (± 0.272) 0.894 (± 0.083) 0.564 (± 0.081)
English RF 0.480 (± 0.122) 0.666 (± 0.067) 0.385 (± 0.17) 0.947 (± 0.036) 0.423 (± 0.084)
MNB 0.649 (± 0.13) 0.774 (± 0.098) 0.631 (± 0.256) 0.918 (± 0.065) 0.588 (± 0.116)
SVM 0.605 (± 0.142) 0.748 (± 0.095) 0.602 (± 0.239) 0.894 (± 0.048) 0.514 (± 0.126)
Portuguese RF 0.561 (± 0.147) 0.709 (± 0.084) 0.453 (± 0.19) 0.966 (± 0.024) 0.528 (± 0.108)
MNB 0.602 (± 0.163) 0.747 (± 0.102) 0.575 (± 0.255) 0.918 (± 0.56) 0.534 (± 0.135)
Note: the mean and standard deviation are computed on the test sets from the 3-fold stratified cross validation.
photo foto
several tirar
networking mensagem
publish rede
message companhia
account exigir
specify social
take publicar
social atender
afterwards acto
0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.1 0.0 0.1 0.2 0.3
Coef Coef
was submitted to the GPDP who found that the situation warranted the features have limited semanitic utility such as ‘take’, ‘均’, and
a penalty. ‘acto’. This may have implications for the provision of explanations
In Figure 1 the most predictive features are shown for the out- in support of a prediction when used in a legal setting.
come of Case No 0002/2014/IP [9]. The top feature for English and
Portuguese is ‘photo’ and ‘foto’ respectively. In this instance we 6 DISCUSSION
have semantic convergence as these mean the same thing in both This work has demonstrated the effectiveness of text classification
languages. However, Chinese it is ‘投訴’, which translates to ‘to of data protection cases. The binary classification of penalties has
complain’. In this example, most of the terms are shared between been shown to be effective across multiple language translations of
English and Portuguese, e.g., ‘publish’ / ‘publicar’, ‘message’ / ‘men- the same facts. Despite the small dataset, this paper’s experiment
sagem’. Portuguese and Chinese share the word for ‘company’: has, at the very least, indicated the strong possibility that machine
companhia / ‘公司’. It is noted that many of the Chinese terms do learning may be included in some manner to facilitate case prioriti-
not provide much semantic meaning. For example, the ‘甲’ and ‘乙’ sation for regulatory action. Data protection authorities may utilise
characters are related to the ordering of items and in legal practice this method in order to pre-screen complaints thus more effectively
these characters are often used to anonymise names in a text (e.g. prioritise the use of the authority’s resources.
Person A/甲 and Person B/乙). The character ‘均’ is used to qualify While an argument may be made to use text classification for pri-
other nouns and thus may have multiple meanings depending on oritising cases, there are considerations whether such an undertak-
context such as ‘all’ or ‘even’; by itself, it does not provide any ing impinges on the nature of regulation and the law. Bayamlıoğlu
meaning. In short, while there may be some semantic meaning and Leenes [2] regard the use of the technology in legal decisions
derived from the prediction (e.g., ‘photo’, ’foto’, ‘公司’), many of and data-driven law as degrading the “moral enterprise” of the
188
Prediction of monetary penalties for data protection cases in multiple languages ICAIL’21, June 21–25, 2021, São Paulo, Brazil
law, which has an impact on the human trust and value placed not REFERENCES
only with the law itself but by extension those institutions charged [1] Aletras, N., Tsarapatsanis, D., Preoţiuc-Pietro, D., and Lampos, V. Pre-
with its execution. This is echoed by Hildebrandt [13] who argues dicting judicial decisions of the European Court of Human Rights: A natural
language processing perspective. PeerJ Computer Science 2 (2016), e93.
that data-driven law creates a crisis for law in that data-driven [2] Bayamlioğlu, E., and Leenes, R. The ‘rule of law’ implications of data-driven
predictions may result in atrophy of the ability to make judgements decision-making: a techno-regulatory perspective. Law, Innovation and Technol-
ogy 10, 2 (2018), 295–313.
congruent with the lived experience of individuals; data-driven law [3] Binns, R., Van Kleek, M., Veale, M., Lyngs, U., Zhao, J., and Shadbolt, N.
is beholden to a type of logic that may not lend easily itself to the ’It’s reducing a human being to a percentage’: Perceptions of justice in algorith-
rule of law. mic decisions. In Proceedings of the 2018 CHI Conference on Human Factors in
Computing Systems, CHI ’18, p. 1–14.
Explanability of machine learning models is often suggested as [4] Braithwaite, J. Enforced self-regulation: A new strategy for corporate crime
a counter-balance to what may be perceived as the negative effects control. Michigan Law Review 80, 7 (1982), 1466–1507.
of automated decisions. Edwards and Veale [8] maintain that an [5] Ceross, A. Examining data protection enforcement actions through qualitative
interviews and data exploration. International Review of Law, Computers &
explanation of the logic of a model and its outcomes may have Technology 32, 1 (2018), 99–117.
no meaning for an affected individual, giving little recourse by [6] Ceross, A., and Simpson, A. C. The use of data protection regulatory actions as
a data source for privacy economics. In Computer Safety, Reliability, and Security
which to challenge the outcome of the automated decision. There (SAFECOMP) (2017), S. Tonetta, E. Schoitsch, and F. Bitsch, Eds., vol. 10489 of
is also the question as to whether such explanations are helpful or Lecture Notes in Computer Science (LNCS), Springer, pp. 350–360.
informative to individuals as experiments on human perception [7] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019
of algorithmic decisions and found that even where explanations Conference of the North American Chapter of the Association for Computational
are provided to algorithmic decision-making, such explanations Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
may not contribute to a sense of ‘fairness’ in the decision made. (Minneapolis, Minnesota, June 2019), Association for Computational Linguistics,
pp. 4171–4186.
Binns et al. [3] investigated the human perception of algorithmic [8] Edwards, L., and Veale, M. Enslaving the algorithm: From a “Right to an
decisions and found that even where explanations are provided to Explanation” to a “Right to Better Decisions”? IEEE Security & Privacy 16, 3
(2018), 46–54.
algorithmic decision-making, such explanations may not contribute [9] Gabinete para a Protecção de Dados Pessoais. Case No: 0002/2014/IP:
to a sense of ‘fairness’ in the decision made. Uploaded clients’ photos by mistake. https://www.gpdp.gov.mo/index.php?m=
From the results, we would strongly caution against adopting content&c=index&a=show&catid=209&id=775, 2014. English version.
[10] Gabinete para a Protecção de Dados Pessoais. 個案調查. https://www.
the methods outlined in the experiment as the sole determination gpdp.gov.mo/uploadfile/2020/1009/20201009040422173.pdf, Oct. 2020.
of a penalty allocation. The results from our experiment shows [11] Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., and Kagal, L.
that the model are able to indicate which cases may be of more Explaining explanations: An overview of interpretability of machine learning.
In 2018 IEEE 5th International Conference on data science and advanced analytics
interest to a regulatory authority than others. It is, however, ques- (DSAA) (2018), IEEE, pp. 80–89.
tionable whether the description of factors amounts to a sufficient [12] Greenleaf, G. Macao’s EU-influenced Personal Data Protection Act. Privacy
Laws & Business International Newsletter 96 (2008), 21–22.
explanation for legal purposes. This is due to the models utilising [13] Hildebrandt, M. Law as computation in the era of artificial legal intelligence:
text processed in a bag-of-words approach which accounts for the Speaking law to the power of statistics. University of Toronto Law Journal 68,
frequency of words, not the semantic meaning of those words. As Supplement 1 (2018), 12–35.
[14] Hustinx, P. The role of data protection authorities. In Reinventing Data Pro-
such, many predictive features, as those detailed in Section 5.3, may tection?, S. Gutwirth, Y. Poullet, P. De Hert, C. de Terwange, and S. Nouwt, Eds.
be nonsensical when shown in isolation. Springer, 2009, pp. 131–137.
[15] Jianqiang, Z., and Xiaolin, G. Comparison research on text pre-processing
methods on Twitter sentiment analysis. IEEE Access 5 (2017), 2870–2879.
7 CONCLUSIONS AND FUTURE WORK [16] Medvedeva, M., Vols, M., and Wieling, M. Using machine learning to predict
In this work, we introduced a novel dataset, cases from the data decisions of the European Court of Human Rights. Artificial Intelligence and Law
28, 2 (2020), 237–266.
protection authority of Macao, and evaluated multiple machine [17] Personal Data Protection Act. https://www.gpdp.gov.mo/uploadfile/2016/0302/
learning classifiers for binary text classification for text in three 20160302033801814.pdf, 2005.
[18] Powers, D. M. Evaluation: From precision, recall and F-measure to ROC, in-
language versions. Our results show that MNB and SVM performed formedness, markedness and correlation. arXiv preprint arXiv:2010.16061 (2020).
well across all metrics for all languages with SVM being considered [19] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. Stanza: A Python
the most performant. Assessing interpretability was difficult given natural language processing toolkit for many human languages. In Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics: System
the bag-of-words model used in the text preprocessing, although Demonstrations (2020).
there is some overlap between the semantic meaning of features [20] Regulation on the protection of natural persons with regard to the processing of
between the languages. The models evaluated are not without their personal data and on the free movement of such data, and repealing Directive
95/46/EC (General Data Protection Regulation). L119, 4/5/2016, p. 1–88, 2016.
limitations and the size of training datasets remains a challenge. In [21] Ribeiro, M. T., Singh, S., and Guestrin, C. "Why should I trust you?": Explain-
future work, we aim to employ language models such as BERT [7] ing the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD
to experiment with test different language tasks in data protection International Conference on Knowledge Discovery and Data Mining, San Francisco,
CA, USA, August 13-17, 2016 (2016), pp. 1135–1144.
regulation. Another avenue of future work may include assessing [22] Virtucio, M. B. L., Aborot, J. A., Abonita, J. K. C., Avinante, R. S., Copino,
the utility and interpretability of the outputs of case prediction with R. J. B., Neverida, M. P., Osiana, V. O., Peramo, E. C., Syjuco, J. G., and Tan,
G. B. A. Predicting decisions of the Philippine Supreme Court using natural
case investigators themselves. language processing and machine learning. In 2018 IEEE 42nd Annual Computer
Software and Applications Conference (COMPSAC) (2018), vol. 2, IEEE, pp. 130–135.
ACKNOWLEDGEMENTS [23] Waltl, B., Bonczek, G., Scepankova, E., Landthaler, J., and Matthes, F.
Predicting the outcome of appeal decisions in Germany’s tax law. In International
The authors thank the anonymous reviewers for their feedback as Conference on Electronic Participation (2017), Springer, pp. 89–99.
well as Tasos Papastylianou and Andrew Simpson for comments
on earlier drafts of this work. Special thanks to Ethan Ceross.
189
Regulating Artificial Intelligence:
A Technology Regulator’s Perspective
Joshua Ellul Gordon Pace Stephen McCarthy
joshua.ellul@um.edu.mt gordon.pace@um.edu.mt stephen.mccarthy@mdia.gov.mt
Malta Digital Innovation Authority Department of Computer Science, Malta Digital Innovation Authority
& University of Malta University of Malta Malta
Malta Malta
190
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Joshua Ellul, Gordon Pace, Stephen McCarthy, Trevor Sammut, Juanita Brockdorff, and Matthew Scerri
for precise definitions and objective measures in the legal frame- cases where unexpected behaviour emerged from AI systems, some-
work, meant that the Maltese regulatory approach is founded on times of a controversial or even safety-critical nature. The increas-
practical and auditable aspects, and is intended to address concerns ing concern is not only to do with cases that have emerged but also
with existing technology (as opposed to attempting to address pos- based on the reality that more and more systems are becoming com-
sible issues arising from future development of AI technology, for puterised and automated. One often-referenced cliché is that of au-
example Artificial General Intelligence). To implement an AI reg- tomated scoring systems [13] in which discrimination is unaccept-
ulatory framework intended for modern day technology and also able, highlighting the need to ensure bias in datasets is removed and
in aim of not stifling innovation, the framework is primarily vol- attempts made to remove discriminatory features during training.
untary however may be mandated based upon the sector and/or The concerns highlighted above demonstrate the need to ensure
risk associated with the activity within which the AI system is used that sufficient assurances are put in place to ensure that AI algo-
or as deemed necessary by another lead authority or governing rithms are implemented correctly, and that their behaviour is as ex-
legislation. This sets the tone of much of the paper, but it is nat- pected and does not introduce any unwanted biases. Indeed, many
urally endemic to any discussion of practical implementations of are advocating for such regulatory frameworks to be developed
the regulation of technologies. The fast evolving nature of technol- and applied to AI systems, but it whether such frameworks should
ogy requires law-makers to address existing technology in a sound be mandatory for all AI-based systems is debatable. We now follow
manner, but also in a way that is expected to be future-proof. A full with a case for why such frameworks should not always be man-
version of this paper can be found in [7]. dated, and that they should not be focused on the technology but
should be focused on the sector or activity that the AI is being used
within/for.
2 THE CASE FOR AI ASSURANCES
We start by highlighting issues related to AI-based systems which 3 THE CASE FOR VOLUNTARY ASSURANCES
could result in systems operating incorrectly in relation to its in-
tended functionality and thereafter build the case for instilling as-
OF AI AND MANDATORY ASSURANCES OF
surances. We concentrate on Artificial Narrow Intelligence (ANI) REGULATED AND CRITICAL ACTIVITIES
given that the state-of-the-art has not yet reached levels of Artifi- Setting aside AGI, when it comes to ANI should such frameworks
cial General Intelligence (AGI) [3]. We will use the term AI through- always be mandated? The same AI framework, for instance identi-
out the rest of the paper to refer to AI that exists today — ANI. fying user preferences, can be the engine behind a wide range of
Since the inception of software development, the fact that such applications, from a personal movie recommendation system, to a
systems occasionally fail has been accepted to be the norm. Al- social network targeted advertising campaign to influence users in
though much work has gone into developing techniques to reduce an upcoming election. The underlying infrastructure is application
the frequency and severity of such occurrences, we continue to ex- agnostic, but should such an underlying infrastructure be required
perience software malfunction on a daily basis. The impact of such to be regulated? More so, what difference does it make if an algo-
failure is contained as long as the software functions in a closed rithm is AI-based or not and yet can be used for the same activity?
system i.e. it has no direct impact on the real world, but frequently Then, should we be talking about AI regulation at all? Or should
software affects the real world in a direct or indirect manner. One we be focusing on software — or rather, the activity it is used for
finds reports of many catastrophic failures in literature and news- irrespective of how it is implemented?
paper reports, with effects ranging from huge financial losses to Regulating all forms of AI would result in shackling and stifling
critical infrastructure failure and even loss of human life. AI sys- innovation [11]. The definition of AI itself is controversial, and
tems are no exception when it comes to incorrect behaviour and even if a definition is chosen, is it going to be clear what software
even when the algorithms themselves are correctly implemented, is AI and what software is not? There are some algorithms which
incorrect behaviour might emerge. For instance, a correct imple- we can ascertain are universally accepted as AI, and some systems
mentation of a machine learning based algorithm may still learn which are universally considered to not have aspects of AI within
wrong due behaviour due to insufficient training, biases and unbal- them, however what should be done about the rest? Could this
ances that may exist within datasets, etc. approach not only stifle AI-innovation, but also other software
Undoubtedly AI systems should undergo standard quality as- based innovation?
surances processes, not only for functional correctness of the algo- Looking back at the principles of regulation though, we need to
rithms themselves but also with respect to the behaviour emergent ask ourselves why is regulation of AI being proposed? Is it only be-
following training. However, testing of AI systems is only as good cause of end-of-the-world scenarios being painted which require
as the coverage of training data, iterations and permutations and AGI, which the state-of-the-art is currently not capable of? If so,
use cases which are undertaken. Once an AI system is deployed then perhaps we should differentiate between any regulatory re-
and it encounters an event that it was not trained to handle it may quirements for AGI and ANI. We propose that this should be done,
well end up handling it incorrectly. More so, if it is continuously at least in the interim until AGI is deemed to be upon us. We leave
learning in a live environment it may be exposed to certain situa- considerations for AGI as future work, and here will continue dis-
tions which could affect its behaviour negatively. cussing aspects pertaining to ANI.
Part of the challenge is that many AI-based techniques function If AI is regulated even when applied to unregulated and non-
as black-boxesfor which reason one finds extensive research to- critical activities, given the line between AI and software in general
wards explainable AI. The past decade has seen various infamous is blurred, and given that non-AI based techniques processes may
191
Regulating Artificial Intelligence: A Technology Regulator’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
yield the same sort of undesirable outcomes, then why should not from Deep Learning to Natural Language Processing and Optimisa-
all software be regulated? We propose that mandatory regulation tion Algorithms. The MDIA will also continue to monitor develop-
should be sector/activity-based and not on technology. ments and update guidelines as required to include (and potentially
The question of what constitutes high-risk or defines whether exclude) defining features of what is/not classifed as an AI-ITA.
a sector or activity should mandate this framework arises. This System Audits and Subject Matter Experts. The framework pro-
is to be left up to other lead authorities and laws of the land to vides a structure for the Authority and applicant to work with inde-
decide. For financial affairs, a financial services authority (a separate pendent (and approved) system auditors to be able to scrutinise to
body) may impose when a sector or activity should be mandated a fairly high level of detail the software itself as well as the manner
to undertake a technology audit (as proposed herein), or even if with which it is being operated under the ISAE 3000 [12] standard
any levels of enhanced due diligence is required. Therefore, based for assurance. The audit of the software system itself is primarily
on the above we make the argument that mandatory regulatory conducted via a code review, whose aim is to ensure that the man-
frameworks should not be technology-specific (or AI-specific), yet ner with which the AI-ITA is implemented accurately reflects what
should be activity or sector-specific as defined and required per the organisation behind the AI-ITA are claiming in their technology
activity/sector. blueprints. The rationale behind this is to ensure that any claims be-
AI technology-based assurances may not only be required for ing done are truly reflected in the code, which enables the general
regulated activities, however various AI-based products and ser- public, who may not know what AI really is to gain trust in the sys-
vices may see benefit in providing assurances to various stakehold- tem given that it stood up to scrutiny prior to the certificate being is-
ers. Therefore, the regulatory approach enables for technology- sued. Beyond the software, the certification mandates depending on
based assurances to also be offered on a voluntary basis (besides be- the type of audit being undertaken and associated controls, to also
ing mandated from lead authorities of respective sectors/activities). give the general public assurances that the AI-ITA creator and oper-
Now, we present the AI technology assurance framework imple- ator are running the organisation in a manner that meets the stan-
mented by the Malta Digital Innovation Authority4 which offers dards set out by the MDIA. The certification therefore enables the
certification of AI systems on a voluntary basis where sought, or on general public to trust the creator, in the manner they build, main-
a mandatory basis where other lead authorities or laws require it. tain and run the AI system. Two main types of audits are required
throughout an AI-ITA’s lifetime: (i) first a ‘Type 1 Systems Audit’
4 AN AI TECHNOLOGY ASSURANCE is required which focuses on providing assurances with respect to
REGULATORY FRAMEWORK functional correctness typically undertaken as an AI-ITA’s first au-
dit; and (ii) a ‘Type 2 Systems Audit’ which focuses on renewing pre-
We now present the AI Innovative Technology Arrangement (AI-
vious assurances provided through a previous audit which factors
ITA) regulatory technology assurance framework. Approaches for
in live data and operations associated with the system to assure the
providing software assurances will invariably have a degree of com-
system assurances are still in place within the period under audit.
monality irrespective of the technology domain and also applica-
The audit process begins with the applicant submitting a request
tion domain within which the solution is categorised under. As
(in the form of an application) to the Authority, upon which the
such, this framework builds on the Innovative Technology Arrange-
Authority will assess the applicant by reviewing the provided doc-
ment (ITA) [6] regulatory assurance framework overseen by the
umentation around the AI-ITA and conduct its due diligence. Fol-
Malta Digital Innovation Authority (MDIA). Rather than mandating
lowing this, the MDIA issues a Letter of Intent upon which the ap-
compliance and certification of all AI based systems, the regulatory
plicant will be able to appoint an MDIA approved Systems Auditor,
framework is a voluntary one — unless a lead authority deems that
and notify the MDIA of the appointment, for the MDIA to verify
such technology assurances are required. It is in this manner we be-
that the Systems Auditor has the required competencies (which the
lieve innovation can still flourish, by only requiring mandatory over-
Authority has tested the system auditor for). The Systems Auditor
sight of sectors and activities that should require such oversight.
will then conduct the audit as per the Authority’s guidelines5 and
AI Innovative Technology Arrangement. The challenge with
compile a report with their findings, which is issued to the MDIA
Artificial Intelligence ITAs (AI-ITAs), primarily revolves around
for a review and a subsequent decision on whether the certificate
identifying what constitutes AI. Rather than define what is an AI-
is to be issued. Once issued, a further follow up audit must be con-
ITA as a hard and fast rule, the guidelines take the approach of
ducted every time there is a material change in the AI-ITA (and on
defining qualities and criteria that qualify software as an AI-ITA:
renewal after every two years).
(a) the ability to use knowledge acquired in a flexible manner in
Systems Audits are an integral part of the certification process as
order to perform specific tasks and/or reach specific goals; (b) evo-
they provide the MDIA with an independent report on the particu-
lution, adaptation and/or production of results based on interpret-
larities of the AI-ITA, specifically the code (and data) and whether it
ing and processing data; (c) systems logic based on the process
accurately reflects what is being disclosed in the blueprint, and the
of knowledge acquisition, learning, reasoning, problem solving,
ongoing operations of the AI-ITA. Systems Audits are conducted by
and/or planning; (d) prediction forecast and/or approximation of
Systems Auditors, who must be independent from the AI-ITA and
results for inputs that were not previously encountered.
its operator, that are subject to approval by the Authority, and who
The above ensures that techniques and algorithms commonly as-
need to meet a set of requirements (defined in the Systems Auditor
sociated with the wider AI field are captured and include anything
guidelines) through their combined complement of Subject Matter
4 https://mdia.gov.mt 5 https://mdia.gov.mt/wp-content/uploads/2019/10/AI-ITA-Guidelines-03OCT19.pdf
192
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Joshua Ellul, Gordon Pace, Stephen McCarthy, Trevor Sammut, Juanita Brockdorff, and Matthew Scerri
Experts (SMEs) in the fields of IT audit, cybersecurity and technol- highlights that the Forensic Node is not only used to support the
ogy with specialisation, in this case, in AI. The SMEs will be the assessment of (some of) the operating effectiveness of the controls
primary individuals responsible for conducting systems audits, and during an audit, but may also be used to support legal compliance
must adhere to a set of requirements, such as ensuring that they by the MDIA (or other authorities) and also enables a further layer
meet a level of continuous professional education in the AI field6 . of monitoring to be done (manually or automated) by a Techni-
This section describes the requirements that an AI-ITA must cal Administrator (discussed next). It is important to note that the
meet in order to qualify for certification. Forensic Node must be separate from the ITA Harness, in that the
ITA Blueprint. The Blueprint document is an essential document Forensic Node is more concerned with Data Logging, as opposed
in the certification process as it is meant to provide a detailed to the monitoring in relation to boundaries.
description to the Authority on what the system does, how it is Technical Administrator. A Technical Administrator, a form of a
designed, and operated. Other than allowing the MDIA to evaluate service provider appointed by the AI-ITA to act as the final safe-
whether AI-ITA certification is applicable, it is further intended to guard for the system, must be appointed and in place at all times.
be used by the Systems Auditors as the document against which The Technical Administrator must be able to intervene, if required
aspects such as the code is reviewed against. The blueprint also to do so by the MDIA, another authority or legally (such as in the
defines a minimum set of disclosures that must be disclosed to event of a breach of law by the AI-ITA), to limit further impact to
direct users (in English) in a non-technical manner, to be able to the users and where necessary limit or reverse losses. For example,
communicate the features and functionalities of the system and consider an AI system that utilises reinforced learning and which,
how it respects the ethical AI framework7 , limitations to prevention after a period of time, starts to exhibit discriminatory bias that goes
of bias, and the expected accuracy of the AI-ITA. against the principles laid down in the ethical AI framework and/or
In a general (AI agnostic) sense, the detailed description must against the requirements of any laws or rules it must abide by. In
cover the functional capabilities of the AI-ITA, how the system is to this case the Technical Administrator must be able to halt the op-
be verified and tested to ensure the results meet expectations and eration of the system to prevent further damage and revert to an
what the operational limitations of the systems are. More specif- older model (as may be mandated by a legal judgement). As such,
ically, for an AI solution the blueprint must include a disclosure this also imposes an indirect requirement for the AI-ITA to provide
of the AI techniques used and to justify why certification is being mechanisms to enable the Technical Administrator to conduct their
sought, and how specific risks are being managed and mitigated e.g. actions as may be necessary (e.g. by ensuring regular snapshots of
what is being done to ensure that the underlying dataset is unbiased. the machine learning models are kept to revert back to).
In a broader sense, the Blueprint must highlight the safety mecha- English Description and Consumer Protection. The system be-
nisms in place and alignment with Malta’s Ethical AI Framework. ing certified is checked by the systems auditors who, amongst other
ITA Harness. A crucial element that the AI-ITA framework pro- things, ensure that its functionality matches that described in the
poses, and which needs to be highlighted clearly in the Blueprint is blueprint in human-readable form (in English). If, post-deployment,
the ITA Harness. The ITA Harness provides a safety net for the pro- the system exhibits behaviour contrary to this description against
cess by monitoring activity inputs and outputs to ensure that the which it was certified, the Innovative Technology Arrangements
boundaries (which must also be disclosed in the Blueprint) are re- and Services Act specifies that the English version prevails legally.
spected. Furthermore, the ITA harness must also be able to handle Auditing of Design and Development Processes. Systems Au-
any anomalies it detects (such as outputs outside expected bound- dits include oversight of the design and development process of
aries) in a manner which is also disclosed. The AI-ITA harness must the system-under-audit. Not only does such oversight cover tradi-
also communicate with the Forensic Node (discussed next) to ensure tional software engineering principles, but for systems including
that any anomalies are appropriately logged and can be investigated an element of AI also includes assurances that certain foundational
and rectified. While the harness may not apply to all AI-ITAs, the principles have been taken into consideration in the process.
Authority requires that when it does not apply it must be justified ad- Build on a human-centric approach. The systems auditors ensure
equately in the blueprint and accompanied with alternative plans of that the AI system was designed in a manner to support and assist
how the behaviour of the AI-ITA will be monitored and contained8 . humans without overriding the user into taking any unwanted
Forensic Node. The Forensic Node is another requirement man- decisions and the manner with which it operates musts be equitable
dated by the MDIA, and whose implementation and operation is and inclusive across different segments of society.
also subject to the audit. The purpose of the Forensic Node is to Adherence to applicable laws and regulations. It is crucial that be-
“store all relevant information on the runtime behaviour of the AI-ITA haviour induced by the system, including parts driven by AI, will
in real-time such as recording of inputs and outputs, and supporting be designed in a manner that adheres to the law.
data related to potential explainability of how an output was derived Maximise benefits of AI systems while preventing and minimising
from a given input wherever applicable”. This means that any inputs, their risks. It is crucial that any risks induced through the use of AI
outputs as well as data that supports how the system achieved the are identified and mitigated accordingly, including the setting up
results it did must be stored in a secure data store in real-time. This of controls to ensure fairness, transparency and resiliency to new
AI-specific attack vectors.
6 https://mdia.gov.mt/sa-guidelines/
Aligned with emerging international standards and norms around AI
7 https://malta.ai/wp-content/uploads/2019/08/Malta_Towards_Ethical_and_
ethics. As the world is increasingly becoming globalised through
Trustworthy_AI.pdf
8 https://mdia.gov.mt/wp-content/uploads/2019/10/AI-ITA-Blueprint-Guidelines- technology, and which may be further amplified through the prolif-
03OCT19.pdf eration of AI systems, this objective was laid down to ensure that
193
Regulating Artificial Intelligence: A Technology Regulator’s Perspective ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Malta’s ethical framework is aligned with similar ethical guidelines vehicles, finance, etc.), privacy and data protection, fundamental
by the EU commission9 and OECD10 . rights, profiling and anti-discrimination issues, competition law,
The framework further builds on these principles by delineat- and legal personality. Quite a number of ethical frameworks have
ing a number of principles (such as Human Autonomy, Fairness, been proposed— it suffices to note that “at least 63 public-private
Prevention of Harm and Explicability) and proposes 63 controls of initiatives have produced statements describing high-level principles,
how these can be tested. While not all of these controls apply to all values and other tenets to guide the ethical development, deployment
AI-ITAs, the AI-ITA must show that it has taken them into consid- and governance of AI” [14]. Within the framework proposed herein
eration and justify in the Blueprint (and ultimately top the users) an ethical framework is referenced to, however the scope of such a
those controls which do not apply. discussion would warrant a paper of its own.
194
Making Intelligent Online Dispute Resolution Tools available to
Self-Represented Litigants in the Public Justice System∗
Towards and Ethical use of the AI technology in the administration of Justice
195
ICAIL’21, June 21–25, 2021, São Paulo, Brazil F. Esteban de la Rosa and J. Zeleznikow
paper describes the underlying reasons that have triggered the rise the Civil Resolution Tribunal has been so successful, is that British
of SRLs and includes proposals to ensure their fair treatment. Columbia residents are mandated to use the system when dealing
with issues listed above. Whilst such an approach may be seen novel
2 EXAMPLES INCORPORATING ARTIFICIAL and discriminatory, it does ensure that the system is used, with
INTELLIGENCE INTO ONLINE DISPUTE relative ease, quickly and at minimal cost. In most cases parties are
RESOLUTION SYSTEMS to represent themselves, even if representation and legal assistance
is allowed.
2.1 The Dutch platform Rechtwijzer
Rechtwijzer1 (Roadmap to Justice) was designed for couples, with 2.3 The Internet Courts in China
children who are separating. The aim of Rechtwijzer was ‘to em- Between 2017 and 2018 China created three new Courts: the Hangzhou
power citizens to solve their problems by themselves or together Internet Court, the Beijing Internet Court and the Guangzhou In-
with his or her partner. If necessary, it refers people to the assistance ternet Court. These courts only have material jurisdiction over
of experts.’ Couples pay €100 for access to Rechtwijzer, which starts internet-related cases. The online platform makes an intelligent
by asking each partner for information such as their age, income, litigation risk assessment system available to the user and can pro-
education, whether they want the children to live with only one vide a report synthesising the litigants’ case and the corresponding
parent or part time with each, then guides them through questions risk based on the analysis of court data and similar cases. Litigation
about their preferences. risk assessment aims to help the party without legal knowledge
The platform had a diagnosis phase; an intake phase for the to identify and exclude common litigation risks, thereby reduc-
initiating party; and then invited the other to join and undertake ing unnecessary losses. Meanwhile, the assessment can make the
the same intake process. Once intake is completed, the parties start party aware that litigation is risky and costly and guide the parties
working on agreements. The dispute resolution model is that of to choose ADR or diversified dispute resolution. The system can
integrative (principled) negotiation[5]. The parties are informed of automatically generate a complaint letter by simply selecting the
rules such as those for dividing property, child support and stan- suitable response options. [4]
dard arrangements for visiting rights so that they could agree on
the basis of informed consent. Agreed agreements are reviewed by 2.4 Projects in Estonia and Singapore
a neutral lawyer. If the proposed solutions are not accepted, then In July 2019 the Estonian Ministry of Justice launched a project
couples can employ the system to request a mediator for an addi- developing AI software to hear and resolve small economic dis-
tional €360, or a binding decision by an adjudicator. Rechtwijzer is putes by eliminating human intervention [10]. The “robot judge”
voluntary and non-binding up until the point where the parties seek is configured to decide disputes of up to 7,000 euros. According
adjudication. Rechtwijzer had aimed to be self-financing through to the project, the disputing parties would have to upload their
user contributions. This has not occurred. documents and relevant information to a judicial platform. The AI
machine renders a decision that can be appealed to a human judge.
2.2 The British Columbia Civil Resolution The project limits its scope to contractual disputes.
Tribunal Singapore has been committed to digital justice since 2000. In
The British Columbia Civil Resolution Tribunal [11] is the most recent years it has been developing a more ambitious online system,
significant current widely available ODR system that comes closest initially only for injuries arising from motor vehicle accidents. An
to providing a full suite of dispute resolution services. It commences outcome simulator will provide guidance to potential claimants,
by diagnosing the dispute, by providing a decision tree, and provides prior to the commencement of proceedings, helping them to de-
legal information and tools such as customized letter templates. cide on offers from insurance companies. The aim is for parties
If this action does not resolve the dispute, one can then apply to first use the technology to reach amicable settlements without
to the Civil Resolution Tribunal for dispute resolution. The sys- professional legal advisors [14].
tem directs the user to the appropriate application forms. Once the
application is accepted, the user enters a secure and confidential 3 A FRAMEWORK FOR BUILDING ONLINE
negotiation platform, where the disputants can attempt to resolve DISPUTE RESOLUTION TOOLS FOR
their dispute. If the parties cannot resolve the dispute, a facilitator SELF-REPRESENTED LITIGANTS
will assist. Agreements can be turned into enforceable orders. If
An increasing phenomenon in Common Law countries is the grow-
negotiation or facilitation does not lead to a resolution, an indepen-
ing number of pro se (or self-represented) litigants. Landsman [7]
dent member will make a determination about the dispute.
argues that pro se cases pose inherent problems: they can cause de-
Currently, the Civil Resolution Tribunal deals with Motor vehi-
lays, increase administrative costs, undermine the judges’ ability to
cle injury disputes, Small claims disputes, Strata property disputes,
maintain impartiality and can leave the often-unsuccessful litigant
Societies and cooperative associations disputes and Shared accom-
feeling as though she has been treated unfairly.
modation and some housing disputes. For some of these domains
Research conducted in the Family Court of Australia shows that
potential litigants can only use the Civil Resolution Tribunal.
there are a range of reasons why people represent themselves,
To assist digitally disadvantaged litigants, technical support is
such as funding cuts and changes in eligibility to legal aid [3].
provided in accessing the Internet. One of the major reasons that
Other contributing factors include changes in technology, cultural
1 https://rechtwijzer.nl/ last viewed 5 February 2021. shifts towards self-help and self-representation, and changes in
196
Making Intelligent Online Dispute Resolution Tools available to Self-Represented Litigants in the Public Justice System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
legislation. The experiences of self-representation in Australian vulnerable position of the unrepresented litigant a truly helpful
law has generally accepted that SRLs are at a disadvantage in legal ODR system should provide the following facilities:
proceedings and their experience of the legal system may indeed (1) Case management: the system should allow users to enter
be negative. The lack of knowledge or skills of SRLs means that information, ask them for appropriate data and provide for
some are not able to access fair and equal justice in a system often templates to initiate the dispute. Self-represented litigants
geared towards legal representation. should be able to initiate the dispute, enter their pertinent
In England litigants can go to court without legal aid, in practice data and also track what is happening during the dispute
the technical and formal nature of proceedings, with the exception as well as being aware of what documents are required at
of the small claims procedure (for claims up to £10,000), makes specific times;
legal aid necessary. Its lack has led to public dissatisfaction but (2) Triaging: the system should provide information on how
also frustration among judges, faced with the need to inform lay important it is to act in a timely manner and where to send
litigants about the technicalities of the process without being able the dispute. This may be particularly important in cases of
to cross the line between information and legal aid. This situation domestic abuse or where there is a potential for children
led to a considerable increase in the time and cost spent on each to be kidnapped. Triaging systems are vital for expediting
judicial decision, even doubling it2 . While some SRLs can present action in high risk cases;
their case competently, most research suggests that SRLs struggle (3) Advisory tools: the system should provide tools for reality
with substantive law and procedure [6]. testing: these could include, books, articles, reports of cases,
Recent experiences, such as the online court established in Utah, copies of legislation and videos; there would also be calcu-
are demonstrating that ODR has the potential to transform the way lators (such as to advise upon child support) and BATNA
the American legal system deals with pro se litigants and access advisory; systems (to inform disputants of the likely out-
to justice issues at large. Although it may seem counterintuitive come if the dispute were to be decided by decision-maker
to bridge the justice gap by precluding people from appearing in (e.g. judge, arbitrator or ombudsman). Advisory tools, as
court, requiring certain types of claims to begin online will actually suggested by Zeleznikow [19] are a vital cog in supporting
provide quicker and more accessible legal solutions. As long as the self-represented litigants. An important associated question
programming and administration of ODR technology are conducted is how can we design advisory tools that self-represented
with attention to legal and ethical concerns, pro se litigants will litigants can gainfully use? Are the legal concepts behind
benefit from having their claims resolved online [2]. For this aim these tools too difficult for amateurs to understand? How do
access to justice is helped by the use of intelligent-user centric ODR we construct suitable user interfaces?
systems incorporating assessment and diagnosis AI tools [15]. (4) Communication tools – for negotiation, mediation, con-
Stranieri et al. [13] approach for providing advice about the dis- ciliation or facilitation. This could involve shuttle mediation
tribution of marital property following divorce in Australia was to if required. For many ODR providers, the provision of com-
use machine learning to provide advice about BATNAs (a BATNA is munication tools is their main goal;
used to inform disputants of the likely outcome if the dispute were (5) Decision Support Tools – if the disputants cannot resolve
to be decided by decision-maker e.g. judge, arbitrator or ombuds- their conflict, software using game theory or AI can be used
man). Despite using Machine Learning, it involved the development to facilitate trade-offs. Professionals (such as lawyers) can
of 94 Toulmin argument structures [16] to model the domain as it provide useful advice re trade-offs. In their absence, suitable
existed in 1995. Twenty-five years later, the theoretical principles decision support tools are vital;
behind machine learning software have not changed. But computer (6) Drafting software: if and once a negotiation is reached,
hardware is now much cheaper and data can be much more easily software can be used to draft suitable agreements. Drafting
stored. This has led to the development of ‘quicker; systems’, which plans (such as parenting plans) once there is an in-principle
the community has seen as ‘more intelligent’3 . agreement for a resolution of a dispute, is a non-trivial task.
Whilst the Split-Up system provides advice about BATNAs, the
Family Winner System [1] provided advice to disputing parents No single dispute is likely to require all six processes. However, the
on how they could best negotiate trade-offs. The disputing parties development of such a hybrid ODR system would be very signifi-
were asked to indicate how much they valued each item in dispute. cant. A total system would require us to construct the appropriate
Using logrolling, parties obtained what they most desired. systems 1 to 6, and the ultimate solution is to make sure that all the
Zeleznikow [20] discusses how it is possible to build ODR sys- systems are capable of communicating with each other.
tems that support self-represented litigants and what skills do self-
represented litigants require to use such systems. Zeleznikow [21] 4 ETHICAL ISSUES RELATING TO THE
considers how we can construct such systems with user centric PROVISION OF ARTIFICIAL
computing. So, what are the various types of ODR systems and INTELLIGENCE-BASED TOOLS TO
how can self-represented litigants use them? Having regard to the SELF-REPRESENTED LITIGANTS BY THE
PUBLIC JUSTICE SYSTEM: A EUROPEAN
2 JUSTICE, “Delivering Justice in an Age of Austerity” (April 2015). Available in https:
PERSPECTIVE
//justice.org.uk/justice-age-austerity-2/ last viewed 19 April 2021
3 See for example amica.gov.au which uses machine learning to advise upon property Neither the recent official documents of the European Union deter-
distribution amongst separating couples in Australia. mining how AI should be used in the field of the administration of
197
ICAIL’21, June 21–25, 2021, São Paulo, Brazil F. Esteban de la Rosa and J. Zeleznikow
justice4 nor the European Ethical Charter (EEC) on the use of AI origin, religion or belief, disability, age or sexual orientation) are
in judicial Systems and their environment adopted in 2018 by the respected and rule of law and due process principles upheld.
European Commission for the Efficiency of Justice of the Council In order to understand the European position it is also relevant to
of Europe deal directly with the admission of AI tools aimed at know the criterion followed by the new proposal for a Regulation of
enabling the parties to assess their legal position. Because SRLs April 2021. AI systems intended for the administration of justice are
generally lack legal skills and in view of the objective to encour- not listed among the prohibited practices (art. 5) but among the high-
age negotiation we submit that this use of technology for these risk AI systems (point 40 of the preamble). The new proposal for a
purposes should be considered high-risk. Regulation separates two kinds of judicial activities: it is considered
The EEC points out the inherent risks in these technologies may as high-risk the systems intended to assist judicial authorities in
even transcend the act of judging and affect essential functioning researching and interpreting facts and the law and in applying the
elements of the rule of law and judicial systems. These include law to a concrete set of facts. Such qualification is not extended to AI
principles such as the primacy of law. These tools could create systems intended for purely ancillary administrative activities that
a new form of normativity, which could supplement the law by do not affect the actual administration of justice in individual cases.
regulating the sovereign discretion of the judge, and potentially The proposed Regulation does not establish the definitive answer
leading, in the long term, to a standardisation of judicial decision as any use of AI must continue to occur solely in accordance with
based no longer on case-by-case reasoning by the courts, but on the applicable requirements resulting from the European Charter
a pure statistical calculation linked to the average compensation of Fundamental Rights, the rest of European Law and the national
previously awarded by other courts. That is why the report submits law.
a need to consider whether these solutions are compatible with We submit that in view of the beneficial impact it may have on
the individual rights enshrined in the European Convention on the functioning of the judicial system, it is necessary to identify the
Human Rights (ECHR). These would include the rights to a fair real possibilities, technical limits and safeguards to be met by the
trial (particularly the right to a natural judge established by law, the machines offered by the public justice system to SRLs.
right to an independent and impartial tribunal and equality of arms For specific areas of administrative law it is possible to develop
in judicial proceedings) and, where insufficient care has been taken legal rules as code providing useful information and support for
to protect data communicated in open data, the right to respect for SRLs. The use of code as rules in combination with User Centric
private and family life. Thus the EEC considers that applications ODR Tools using decision trees, may have success promoting ac-
of predictive justice should be assigned to the field of research and cess to justice for SRLs. The CRT in the British Columbia is an
further development in order to ensure that they fully tie in with example of success. The design of AI rule-based systems does not
actual needs before contemplating use on a significant scale in the exhibit the difficulties arising from the lack of transparency and
public sphere. the creation of biases that may arise employing ML induction al-
The European Commission (EC) recognises that the use of AI gorithms. Deductive AI tools (the so called Experts Systems) allow
applications can bring many benefits, such as making use of infor- transparency and the monitoring of the machine output is facili-
mation in new and highly efficient ways, and improve access to tated to be able to rectify what is necessary in case any errors in the
justice, including by reducing the duration of judicial proceedings. programming are discovered. Programming is, however, a delicate
At the same time it is aware that the opacity or biases embedded in process and if not done well can lead to unfair treatment when the
certain AI applications can also lead to risks and challenges for the algorithm doesn’t match reality. This can occur when a one-size-
respect of and effective enforcement of fundamental rights, includ- fits-all rule is implemented in a complex environment. A recent
ing in particular the right to an effective remedy and a fair trial. example is Australia’s Centrelink “robodebt” debacle5 . In that case,
The EC recognises as a possible high-risk a use case using the tech- welfare payments made on the basis of self-reported fortnightly
nology as part of decision-making processes with significant effects income were cross-referenced against an estimated fortnightly in-
on the rights of people. However, it also considers that the pro- come, taken as a simple average of annual earnings reported to
posed requirements in the White Paper on increased transparency, the Australian Tax Office, and used to auto-generate debt notices
human oversight, accuracy and robustness of these systems aim without any further human scrutiny or explanation. This assump-
to facilitate their beneficial use, while ensuring that fundamental tion is at odds with how Australia’s highly casualised workforce is
rights including non-discrimination based on sex, racial or ethnic actually paid. For example, a graphic designer who was unable to
find work for nine months of the financial year but earned A$12,000
in the three months before June would have had an automated debt
4 Among the last official documents are the Proposal for a Regulation Laying down raised against her. This is despite no fraud having occurred, and
Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending this scenario constituting exactly the kind of hardship Centrelink
Certain Union Legislative Act of 21.4.2021 COM (2021) 206 final; the Communication is designed to address.
from the Commission to the European Parliament, the Council, the European Economic
and Social Committee and the Committee of the regions called “Digitalisation of justice Rules as codes requires alterations to be introduced in case of
in the European Union. A toolbox of opportunities”, COM (2020) 710 final, of 2.12.2020; legislative changes. Although it will not be possible to attain the
the European Parliament Resolution of 20 October 2020 with recommendations to
the Commission on a framework of ethical aspects of artificial intelligence, robotics
quality of advice offered by a legal expert, we submit that the infor-
and related technologies (2020/2012 (INL); White Paper on Artificial Intelligence - A mation provided to SRLs through machines makes a contribution to
European approach to excellence and trust, COM(2020) 65 final of Brussels, 19.2.2020;
the European e-Justice Strategy 2019-2023 of 13 March 2019 (2019/C 96/04) Council
2019-2023; the Digital Revolution in view of Citizens’ Needs and Rights. Opinion of
the European Economic and Social Committee of 20.02.2019. 5 See https://tinyurl.com/y3dqe6mg last viewed 19 April 2019
198
Making Intelligent Online Dispute Resolution Tools available to Self-Represented Litigants in the Public Justice System ICAIL’21, June 21–25, 2021, São Paulo, Brazil
improving access to justice for those who cannot afford legal assis- 5 CONCLUSION
tance. Regarding the quality of advice provided by these machines, One of the latest trends in the incorporation of technology in the
it seems reasonable that the proposals of the European Commission administration of justice is the provision by public justice systems
about requirements concerning possible testing of applications and to support SRLs by the use of a combination of AI and ODR tools.
the need to provide relevant documentation on their purposes and These allow SRLS to have a diagnosis of the case, which influences
functionalities. It seems also reasonable to require maintaining the the parties either to determine a dismissal of the action or how
possibility to correct errors and providing information to the user to negotiate. This combination of tools shows great potential in
that the answer given by the machine may not necessarily match reducing the level and duration of litigation. The paper submits that
the answer that would be given by a judge hearing the case. this use of the technology must be considered as high risk as it may
Two of the disadvantages of the use of Machine Learning systems function as a replacement of judicial activities. However, it is still
are that they are not transparent, and the data and the software on possible to obtain positive results from this technology by inserting
which they are based may be manipulated. There is also a concern some safeguards, as is beginning to emerge from the European legal
that the use of Machine Learning in the legal system will worsen sphere. The debate is now about what safeguards are necessary
biases against minorities or deepen the divide between those who to ensure that the use of high-risk artificial intelligence tools in
can afford quality legal assistance and those who cannot [17]. Algo- the field of justice is fully compatible with the rule of law. The
rithms will continue to perform existing biases against vulnerable implementation and use of this technology should be preceded by
groups because the algorithms are largely copying and amplifying the detection and diagnosis of the functioning of justice in specific
the decision-making trends embedded in the legal system. There is sectors, so that the efforts are made in the areas with most pressing
already a class divide in legal access – those who can afford high needs.
quality legal professionals will always have an advantage. The de-
velopment of intelligent support systems can partially redress this REFERENCES
power imbalance by providing users with important legal infor- [1] Emilia Bellucci and John Zeleznikow. 2006. Developing Negotiation Decision
mation that was previously unavailable to them. Difficulties may Support Systems that support mediators: a case study of the Family_Winner
system. Journal of Artificial Intelligence and Law 13, 2 (2006), 233–271.
stem from biases. One example is COMPAS, a decision support [2] Julianne Dardanes. 2021. When Accessing Justice Requires Absence from the
system designed to help parole boards in the United States [18] Courthouse: Utah’s Online Dispute Resolution Program and the Impact it Will
decide which prisoners to release early, by providing a probability Have on Pro Se Litigants. Pepperdine Dispute Resolution Law Journal 21, 1 (2021).
[3] John Dewar, Barry W. Smith, and Cate Banks. 2000. Litigants in Person in the
score of their likelihood of reoffending. Rather than rely on a simple Family Court of Australia – Research Report No 20, Family Court of Australia.
decision rule, the algorithm used a range of inputs, including demo- Vol. Research Report No 20. Family Court of Australia Canberra.
graphic and survey information, to derive a score. The algorithm [4] Xuhui Fang. 2018. Recent Development of Internet Courts in China. International
Journal on Online Dispute Resolution 5 (2018), 1–2, 49–55.
did not use race as an explicit variable, but it did embed systemic [5] Roger Fisher and William Ury. 1981. Getting to yes. Penguin Group, New York.
racism by using variables that were shaped by police and judicial [6] Hazel Genn and Yvette Genn. 1989. The effectiveness of representation at tribunals.
Lord Chancellor’s Department.
biases. [7] Stephan Landsman. 2009. The growing challenge of pro se litigation. 13 Lewis &
What can be done is to ensure the traceability and cleanliness Clark L. Rev. 439 (2009).
of the data with which the machine operates, and to introduce [8] Arno Lodder and John Zeleznikow. 2010. Enhanced dispute resolution through the
use of information technology. Cambridge University Press.
elements of weighting. But as Richard Susskind considers, an ethical [9] Robert H. Mnookin and Lewis Kornhauser. 1979. Bargaining in the shadow of
programming is not feasible. It is not at all clear, either technically the law: The case of divorce. The Yale Law Journal 88, 5 (1979), 950.
or philosophically, what is meant when it is proposed that ethics [10] Eric Niiler. 2019. Can AI be a Fair Judge in Court? Estonia Thinks So. https:
//www.wired.com/story/can-ai-be-fair-judge-court-estonia-thinks-so/
should be embedded in Machine Learning. Nor it is clear what [11] Shannon Salter and Darin Thompson. 2016. Public-Centred Civil Justice Redesign:
is meant when it is demanded that software engineers program A Case Study of the British Columbia Civil Resolution Tribunal”. McGill Journal
of Dispute Resolution 3 (2016), 113.
Machine Learning systems to provide intelligent explanations. To [12] Tania Sourdin. 2018. Judge v. Robot: Artificial Intelligence and Judicial Decision-
think so is to misunderstand the difference between the inductive Making. UNSWLJ 41 (2018), 1114.
processes inherent in Machine Learning and the kind of argument [13] Andrew Stranieri, John Zeleznikow, Mark Gawler, and Bryn Lewis. 1999. A hybrid
rule–neural approach for the automation of legal reasoning in the discretionary
we expect when we ask for an explanation [14]. domain of family law in Australia. Artificial intelligence and Law 7, 2-3 (1999).
A different issue is the use of AI tools by judges to decide a case. [14] Richard E. Susskind. 2019. Online courts and the future of justice. Oxford University
We share the European Commission’s view that it is important that Press.
[15] Darin Thompson. 2015. Creating New Pathways to Justice Using Simple Artificial
judgments are delivered by judges who fully understand the AI Intelligence and Online Dispute Resolution. International Journal of Online
applications and all information taken into account therein that Dispute Resolution 4 (2015).
[16] S. Toulmin. 1958. The Uses of Argument. Cambridge University Press, Cambridge.
they might use in their work (AI not to replace but as Augmented [17] Peter K. Yu. 2020. The Algorithmic Divide and Equality in the Age of Artificial
Intelligence), on the understanding that the use of AI applications Intelligence, 72 FLA. L. REV 331 (2020).
must not prevent any public body from giving explanations for its [18] Monika Zalnieriute, Lyria Bennett Moses, and George Williams. 2019. The rule
of law and automation of government decision-making. The Modern Law Review
decisions. As for the machine being able to decide the case on its 82, 3 (2019), 425–455.
own, as the Estonian project poses, this should not be completely [19] J. Zeleznikow. 2002. Using Web-based Legal Decision Support Systems to Improve
ruled out. However, we are not at that stage yet! In the current Access to. Justice Information and Communications Technology Law 11, 1 (2002).
[20] John Zeleznikow. 2020. The challenges of using Online Dispute Resolution to
state of the art, machines can neither motivate nor explain the support Self Represented Litigants. Journal of Internet Law 3 23, 7 (2020).
decisions and predictions they make [14]. Legal arguments require [21] John Zeleznikow. 2021. Using Artificial Intelligence to provide Intelligent Dispute
Resolution Support. Group Decision and Negotiation (2021). https://doi.org/10.
persuasion that does not depend on predictable variables. 1007/s10726-021-09734-1
199
Plum2Text: A French Plumitifs–Descriptions Data-to-Text
Dataset for Natural Language Generation
Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Laval University, Computer Science Department and Faculty of Law
Québec, Canada
nicolas.garneau@ift.ulaval.ca,eve.gaumond@observatoire-ia.ulaval.ca
luc.lamontagne@ift.ulaval.ca,pierre-luc.deziel@fd.ulaval.ca
ABSTRACT data (e.g. a table 𝑡) and generates a textual utterance (namely the
In this paper, we introduce a new French Data-to-Text (D2T) dataset hypothesis ℎ) that is syntactical and semantically faithful to both the
in the legal domain: Plum2Text 1 . It is made out of plumitifs (docket structured input, and one or possibly several textual references 𝑟 . In
files) – descriptions pairs that are derived from publicly available this paper, we propose Plum2Text, a new French D2T dataset in the
documents issued by Canadian criminal courts. The development of legal domain, rather different from those previously introduced in
Plum2Text is primarily intended to train statistical natural language the literature. The dataset was built from Quebec’s plumitifs which
generation algorithms, in order to make the plumitifs more easily are legal documents that lie in the same family as court dockets.
understandable for Canadian citizens. The inputs and outputs of The plumitifs are textual summaries, written in a structured format,
the dataset are unique: on the data side, the values of the table which present all the steps of a judicial case. They also provide
contain long pieces of textual utterance, and on the text side (or information about parties’ identity, the jurisdiction in charge of
reference), it most often consists of a paraphrase of the table values. administering the case, and some information relating to the nature
We describe how we curated the plumitif–description associations and the course of the preceding, as illustrated in Figure 12 .
by introducing an annotation tool and a methodology specific to
the D2T natural language generation task. We do so by using simple
yet efficient text classifiers to help the annotator leverage annotated Figure 1: A Plumitif example illustrating the accused and
examples during the annotation process. As a matter of privacy, we plaintiff personal information along with charges and asso-
also illustrate how we are decontextualizing the descriptions. ciated pleas, decisions and penalty.
CCS CONCEPTS
• Applied computing → Law; Annotation; • Computing method-
ologies → Language resources; Natural language generation;
KEYWORDS
Legal Language Resource, Natural Language Generation, Annota-
tion Methodology
ACM Reference Format:
Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel. 2021.
Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for Natu-
ral Language Generation. In Eighteenth International Conference for Artificial
Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM,
New York, NY, USA, 5 pages. https://doi.org/10.1145/3462757.3466148
1 INTRODUCTION
D2T generation [12, 18] is a specialized task of natural language
generation (NLG) where a model takes as input semi-structured
1 https://nicolas.nlp.quebec/files/icail_2021_plum2text.jsonl It has been shown that the plumitifs lack intelligibility [1, 19].
Indeed, these files are written by clerks and contain several abbrevi-
Permission to make digital or hard copies of all or part of this work for personal or ations and references to the Criminal Code’s provisions, making it
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation difficult for litigants to understand. Beauchemin et al. [1] attempted
on the first page. Copyrights for components of this work owned by others than ACM to generate plumitifs descriptions using a rule-based system. They
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
found out however that their model could hardly generalize to
fee. Request permissions from permissions@acm.org. plumitifs from other judicial districts or having slight differences
ICAIL’21, June 21–25, 2021, São Paulo, Brazil in the way they were organized. But having only the plumitifs in
© 2021 Association for Computing Machinery.
2 All
examples (except screenshots) in this paper have been translated from French to
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00
https://doi.org/10.1145/3462757.3466148 English to facilitate understanding
200
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Table 1: Statistics about the different D2T datasets intro- summaries paired with their corresponding statistic tables. The
duced in Section 2.1, WikiBio, WebNLG, RotoWire, Ro- dataset provides 4.9K examples containing on average 630 records
toWire augmented and purified by Wang [21], E2E, ToTTo, with longer associated text of 340 tokens, on average. Wang [21]
and our dataset, Plum2Text. For each dataset, we present extended RotoWire, namely RotoWire-FG, adding 50% more data
the number of examples (NE), average number of input at- and enriched input tables. Wang [21] observed that only 60% of
tributes (NA) along with the average number of tokens for the summary contents can be grounded to the table records and
the inputs (A-Avg) and the references (R-Avg). thus proposed a purified version of RotoWire. Thomson et al. [20]
refined even further Rotowire-FG by providing more attributes
Datasets NE NA A-Avg R-Avg across multiple dimensions, increasing the content overlap between
WikiBio 728K 20 3 26 statistic tables and reference texts. Dušek et al. [8] proposed the
WebNLG 22K 1-7 5 25 E2E challenge dataset, a crowdsourced dataset of 50K examples in
RotoWire 4.9K 630 1 340 the restaurant domain, as well as a cleaned version of it [7]. Inputs,
RotoWireW+ 7.5K 630 1 206 using the Meaning Representation format, contain 3-8 attributes
E2E 50K 3-8 15 20 and an average of 20 tokens per reference text. Finally, Parikh et al.
ToTTo 120K 4 2 20 [15] proposed an open-domain D2T dataset, ToTTo, covering a wide
range of topics. It contains over 120K examples of Wikipedia tables
Plum2Text 2.5K 2-9 61 50
along with one-sentence descriptions. They extend their dataset
with highlighted cells offering better control for generation. On
hand, they could not train a statistical model for natural language average, a given input contains 4 attributes and the reference text
generation in order to alleviate this problem. We wish to solve this has 20 tokens. While this list of D2T datasets is not exhaustive, it
issue by working with court judgments. One way to describe judge- illustrates the diverse nature of the data which is tightly coupled
ments is that they are in a sense a longer version of the plumitif; to the task of D2T NLG. We aggregated the datasets’ statistics in
they thoroughly explain what is written in the plumitif. By pairing Table 1.
plumitifs with their court judgments, we created Plum2Text, a D2T
dataset that can be used to train statistical language generation
model. It also allowed us to reframe the problem into one that can
2.2 Datasets in the Legal Domain
be solved by D2T natural language solutions. The input and output Legal–NLP is an emerging field and so is the creation of datasets
components of Plum2Text are very unique on their own; the values supporting the recent advances. While the following list of re-
of the table contain long pieces of text (e.g. paragraphs from the sources is not exhaustive, we selected the ones most related to our
law) and the references are mostly paraphrases of the table content. contribution. Kano et al. [11] introduced in 2018 the COLIEE legal
In the next sections, we compare Plum2Text with other standard case retrieval task along with a dataset drawn from an existing
D2T datasets known to the community. We also position our con- collection of case law primarily from the Federal Court of Canada.
tribution amongst other datasets in the legal field. We provide an In 2019, Rabelo et al. [17] introduced the statute law competition
at-length explanation of the annotation process we used to create data corpus where the questions (for the question answering task)
Plum2Text and present the tools we designed for this task. We hope were drawn from Japanese Legal Bar exams. They also proposed
that our methodology will encourage the creation of many other three new tasks: legal case entailment, statute law retrieval, and
datasets not only in the legal field but also in different D2T domains. legal question answering.
Xiao et al. [24] introduced the Chinese AI and Law challenge
2 RELATED WORK dataset (CAIL2018), the first large-scale Chinese legal dataset for
Since we are introducing a new D2T dataset in the legal domain, we judgment prediction. Still in Chinese, Duan et al. [6] introduced
first introduce standard D2T datasets on general purpose domains the Chinese judicial reading comprehension (CJRC) dataset. More
(e.g. Wikipedia) known to the community and further describe recently, Chalkidis et al. [3] proposed a large-scale multi-label text
what has been proposed in the legal field regarding different natural classification dataset on the European Union legislation. In the
language processing tasks. French Canadian spectrum, Westermann et al. [22] introduced two
datasets drawn from 1 million written judgments from the Régie du
2.1 Data-to-Text Datasets logement du Québec where they used factors to predict and analyze
landlord–tenant judgments and to create a chatbot out of it. Cumyn
There is a handful of D2T datasets and each of them has its specifics
et al. [4] proposed an annotated set of 2,500 judgments using a
that we further expose in this section. Lebret et al. [13] introduced
faceted scheme with the objective of improving the performance
WikiBio, a large-scale fact table of biographical sentences extracted
of legal search engines. Closer to the NLG task, Bhattacharya et al.
from Wikipedia. This dataset has 728K examples, each contain-
[2] studied several summarization techniques on a large set of
ing on average 20 records and 26 description tokens. In a similar
Indian Supreme Court judgments. Ye et al. [25] proposed a dataset
vein, Gardent et al. [9] introduced the WebNLG dataset consisting
where a generative model learns to generate court views from fact
of 22K RDF–text pairs extracted from DBPedia 3 . An input may
descriptions in Chinese. They framed the generation problem as
contain up to seven RDF triples and the average text length is 25
text summarization. From that point of view, Plum2Text is a rather
tokens. Wiseman et al. [23] introduced RotoWire, BasketBall game
unique new dataset in the legal field since it tackles a new task (i.e.
3 https://wiki.dbpedia.org/ D2T generation) and is in French.
201
Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for Natural Language Generation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
202
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Figure 3: The accusation–paragraphs association interface. The set of accusations is displayed on the left and the whole deci-
sion on the right. The relevant paragraph classifier scores every paragraph and those with a high probability of being relevant
are displayed at the top of the interface, in the Relevant section.
RPC, we trained a standard Binary Text Classifier from the Spacy (1) Replace names, locations, and organizations that the NER
library [10]. We used 1,000 relevant paragraphs and 1,000 irrelevant model may have missed. For example, to preserve the privacy
ones randomly sampled from the judgments. On the test set, the of certain parties according to the order restricting publi-
paragraph classifier obtains 85% accuracy. cation (486.4 (1) C.Cr.), names are often elided as such; J...
D... for John Doe. A pre-trained NER model does not catch
3.3 Decontextualization of the Paragraphs this kind of elision and leaves unnecessary noise within the
In this section, we present how we decontextualize the paragraphs dataset. It had to be done manually.
(also known as delexicalization, see [18]). Paul Panenghat et al. [16] (2) Remove contextual information specific to the crime that was
showed that delexicalizing a dataset not only removes bias but also perpetrated. For example, one accused may have assaulted
improves out-of-domain portability, for instance in our case from his neighbor, which is very specific to the case. Furthermore,
criminal to civil law. Our decontextualization process is motivated we remove information that is not supported by the table 𝑡
by the underlying task Plum2Text is designed for; i.e. generating such as “after a trial of five days”.
an intelligible description of a plumitif. With that in mind, we (3) Remove gender and numbers7 related to the accused. As
remove not only personal information but also any information in such, we replace the French feminine version of “accusée”
the paragraph that cannot be found in the plumitif. (accused) with “accusé”. One example of an accusation in-
volving several defendants is conspiracy whereas “Person 1
3.3.1 Automatic Depersonalization. Following the argumenta- and Person 2 conspired...”. We thus replace “Person 1 and
tion of Beauchemin et al. [1], we deemed it essential to remove any Person 2” with “The accused”.
personal information that the judgment’s paragraphs may hold, (4) Remove information describing the victims such as “Per-
especially in the case of releasing this dataset. While the main con- son X, only 9 years old” or “Person X, a woman”. We also
cern here is privacy, it also greatly reduces the vocabulary size by normalize amounts, such as X$ or X kilos of cocaine, re-
removing rare tokens, and hopefully will improve the performance garding crimes committed against the controlled drugs and
of NLG models. The first step of our decontextualisation process substances act.
is to automatically remove all names as well as all information
Decontextualization reduced vocabulary size from 2,144 to 1,464
describing places or organisations (e.g. police department). A pre-
token types. We should also mention that we retrieve the sections
trained named entity recognition (NER) model [10] is used for this
text from their corresponding law. Compared to the D2T datasets
purpose. We also replace dates using regular expressions.
introduced in Section 2.1, Plum2Text is particularly interesting as
3.3.2 Information Specific to the Case. To further decontextu- several values from the table are sentences or even text paragraphs
alize the paragraphs, we manually remove information specific to (the laws), as illustrated in Figure 4. Also the resulting annotated
the case. More concretely, we went through every paragraph and
to make sure that people’s privacy is well protected; 7 One case may be regarding several defendants.
203
Plum2Text: A French Plumitifs–Descriptions Data-to-Text Dataset for Natural Language Generation ICAIL’21, June 21–25, 2021, São Paulo, Brazil
dataset contains on average 61 tokens per input value, whereas a [5] Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan
typical D2T dataset usually has 1 to 5 tokens, as depicted in Table 1. Das, and William Cohen. 2019. Handling Divergent Reference Texts when
Evaluating Table-to-Text Generation. In Proceedings of the 57th Annual Meeting
Furthermore, a table has on average 5 associated references with of the Association for Computational Linguistics. Association for Computational
overlapping table values. These characteristics pose some challenge Linguistics, Florence, Italy, 4884–4895. https://doi.org/10.18653/v1/P19-1483
[6] X. Duan, Baoxin Wang, Ziyue Wang, Wentao Ma, Yiming Cui, D. Wu, S. Wang, T.
for both the generation of a description and its evaluation. Liu, Tianxiang Huo, Z. Hu, Heng Wang, and Z. Liu. 2019. CJRC: A Reliable Human-
Annotated Benchmark DataSet for Chinese Judicial Reading Comprehension.
ArXiv abs/1912.09156 (2019).
[7] Ondřej Dušek, David M. Howcroft, and Verena Rieser. 2019. Semantic Noise
Figure 4: A translated example of a Plumitif–Paragraph pair Matters for Neural Natural Language Generation. In Proc. of the 12th Interna-
from our dataset. The highlighted information in the boxes tional Conference on Natural Language Generation. Association for Computa-
tional Linguistics, Tokyo, Japan, 421–426. https://doi.org/10.18653/v1/W19-8652
illustrates the presence of paraphrasing within Plum2Text. arXiv:1911.03905
[8] Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the
State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG
Challenge. Computer Speech & Language 59 (Jan. 2020), 123–156. https://doi.org/
10.1016/j.csl.2019.06.009 arXiv:1901.11528
[9] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-
Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceed-
ings of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Association for Computational Linguistics, Vancouver,
Canada, 179–188. https://doi.org/10.18653/v1/P17-1017
[10] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under-
standing with Bloom embeddings, convolutional neural networks and incremen-
tal parsing. (2017). To appear.
[11] Yoshinobu Kano, Miyoung Kim, M. Yoshioka, Yao Lu, J. Rabelo, Naoki Kiyota,
R. Goebel, and K. Satoh. 2018. COLIEE-2018: Evaluation of the Competition on
Legal Information Extraction and Entailment. In JSAI-isAI Workshops.
[12] K. Kukich. 1983. Design of a Knowledge-Based Report Generator. In ACL.
[13] Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural Text Generation
from Structured Data with Application to the Biography Domain. In Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing.
4 CONCLUSION Association for Computational Linguistics, Austin, Texas, 1203–1213. https:
//doi.org/10.18653/v1/D16-1128
In this paper, we introduced a new French D2T dataset in the le- [14] Paul Ohm. 2009. Broken promises of privacy: Responding to the surprising failure
gal field, Plum2Text. We thoroughly present how we created and of anonymization. UCLA l. Rev. 57 (2009), 1701.
annotated this dataset, by introducing a methodology that we be- [15] Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan
Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A Controlled Table-To-Text
lieve will help the research community. The creation of the dataset Generation Dataset. In Proceedings of the 2020 Conference on Empirical Methods in
presented in this paper is a stepping stone in the development of Natural Language Processing (EMNLP). Association for Computational Linguistics,
Online, 1173–1186. https://doi.org/10.18653/v1/2020.emnlp-main.89
a web application aiming at making plumitifs – a legal document [16] Mithun Paul Panenghat, Sandeep Suntwal, Faiz Rafique, Rebecca Sharp, and
providing a summary of a given judicial case – more easily un- Mihai Surdeanu. 2020. Towards the Necessity for Debiasing Natural Language
derstandable. As explained by [1], enhancing the intelligibility of Inference Datasets. In Proceedings of the 12th Language Resources and Evaluation
Conference. European Language Resources Association, Marseille, France, 6883–
plumitifs foster the right to access judicial information which is a 6888. https://www.aclweb.org/anthology/2020.lrec-1.850
hallmark of Canadian democracy (and that of many other countries [17] J. Rabelo, Miyoung Kim, R. Goebel, M. Yoshioka, Yoshinobu Kano, and K. Satoh.
as well). In future works, we plan to train a statistical NLG model 2019. A Summary of the COLIEE 2019 Competition. In JSAI-isAI Workshops.
[18] Ehud Reiter and R. Dale. 1997. Building applied natural language generation
on Plum2Text and enable intelligible description of plumitif at scale. systems. Nat. Lang. Eng. 3 (1997), 57–87.
[19] Sandrine Prom Tep, Florence Millerand, Alexandra Parada, Alexandra Bahary,
Pierre Noreau, and Anne-Marie Santorineos. 2019. Legal Information in Digital
Acknowledgements Form: the Challenge of Accessing Computerized Court Records. IJR 8 (2019).
We thank the reviewers for their insightful comments. This research was [20] Craig Thomson, Ehud Reiter, and Somayajulu Sripada. 2020. SportSett: Basketball
- A robust and maintainable dataset for Natural Language Generation. https:
funded by the Natural Sciences and Engineering Research Council of Canada //intellang.github.io/ IntelLanG : Intelligent Information Processing and Natural
and the Social Sciences and Humanities Research Council of Canada. Language Generation ; Conference date: 07-09-2020 Through 07-09-2020.
[21] Hongmin Wang. 2019. Revisiting Challenges in Data-to-Text Generation with
Fact Grounding. In Proceedings of the 12th International Conference on Natural
REFERENCES Language Generation. Association for Computational Linguistics, Tokyo, Japan,
[1] David Beauchemin, Nicolas Garneau, Eve Gaumond, Pierre-Luc Déziel, Richard 311–322. https://doi.org/10.18653/v1/W19-8639
Khoury, and Luc Lamontagne. 2020. Generating Intelligible Plumitifs Descrip- [22] Hannes Westermann, V. Walker, K. Ashley, and Karim Benyekhlef. 2019. Using
tions: Use Case Application with Ethical Considerations. In Proceedings of the Factors to Predict and Analyze Landlord-Tenant Decisions to Increase Access
13th International Conference on Natural Language Generation. Association for to Justice. Proceedings of the Seventeenth International Conference on Artificial
Computational Linguistics, Dublin, Ireland, 15–21. https://www.aclweb.org/ Intelligence and Law (2019).
anthology/2020.inlg-1.3 [23] Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in Data-
[2] P. Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu to-Document Generation. In Proceedings of the 2017 Conference on Empirical
Ghosh, and S. Ghosh. 2019. A Comparative Study of Summarization Algorithms Methods in Natural Language Processing. Association for Computational Linguis-
Applied to Legal Case Judgments. In ECIR. tics, Copenhagen, Denmark, 2253–2263. https://doi.org/10.18653/v1/D17-1239
[3] Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, and Ion Androut- [24] Chaojun Xiao, Haoxi Zhong, Z. Guo, Cunchao Tu, Zhiyuan Liu, M. Sun, Yansong
sopoulos. 2019. Large-Scale Multi-Label Text Classification on EU Legislation. Feng, Xianpei Han, Z. Hu, Heng Wang, and J. Xu. 2018. CAIL2018: A Large-Scale
In Proceedings of the 57th Annual Meeting of the Association for Computational Legal Dataset for Judgment Prediction. ArXiv abs/1807.02478 (2018).
Linguistics. Association for Computational Linguistics, Florence, Italy, 6314–6322. [25] Hai Ye, Xin Jiang, Zhunchen Luo, and W. Chao. 2018. Interpretable Charge
https://doi.org/10.18653/v1/P19-1636 Predictions for Criminal Cases: Learning to Generate Court Views from Fact
[4] Michelle Cumyn, Günter Reiner, S. Mas, and David Lesieur. 2019. Legal Knowl- Descriptions. ArXiv abs/1802.08504 (2018).
edge Representation Using a Faceted Scheme. Proceedings of the Seventeenth
International Conference on Artificial Intelligence and Law (2019).
204
Anonymization of German Legal Court Rulings
Ingo Glaser Tom Schamberger Florian Matthes
Technical University of Munich Technical University of Munich Technical University of Munich
Garching bei München, Germany Garching bei München, Germany Garching bei München, Germany
ingo.glaser@tum.de tom.schamberger@tum.de florian.matthes@tum.de
205
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Glaser et al.
cause harm or engender undesirable personal or legal repercus- the applied methods differ, the actual usage is flexible. While the
sions [7]. courts use different software to manage trials and store the actual
In order to anonymize documents, sensitive information has to be documents, including court rulings, all courts utilize Microsoft
identified first, before the detected entities can be pseudonymized Word for the actual anonymization. In fact, the "tool" at hand is
or neutralized. According to the definition of sensitivity, this infor- the search-and-replace functionality. This makes us belief that an
mation must contain at least one direct reference to an object or a approach, as proposed in this paper, will be highly beneficial.
juristic person outside the context of the document. In order to pre-
serve the meaning of the text, it is necessary to replace references 4 METHODOLOGY
referring to the same real object or person in such a way that the
4.1 Anonymization of German Legal Court
connection within the text stays intact.
The most challenging aspect of anonymization is the identifica- Rulings
tion step. In this work, the anonymization problem is reduced to a We introduce a anonymization method for court rulings solely
text sequence classification task referred to as contextual sensitivity trained on anonymized data. This approach is based on the idea
prediction. It is not sufficient to only detect references, because that the sensitivity of a entity does not depend on the specific entity
in legal documents such as court decisions, many references are itself, but on its type and context, only. In this paper, we developed
insensitive. Instead, the sensitivity of references in legal documents a machine learning approach, which classifies the sensitivity of
is assumed to mainly depend on the textual context. For instance, entities in legal documents based on their context. These entities
court names reference real-world objects, but they are insensitive have been pre-selected by general purpose NER.
and contain useful information for reviewers. On the other hand, The dataset used for model training consists of 1400 German
references to expert witnesses must not be exposed as they are anonymized court rulings of the state court in Munich Landgericht
highly sensitive. Munich that have been published in recent years. 1 In each ruling,
In order to underline the importance of supporting the anonymiza- each sensitive reference has previously been removed and replaced
tion of German legal court rulings, we conducted expert interviews. with a placeholder for publication at the courts. So, the anonymized
In knowledge-intensive and thus highly individual domains such entities, i.e. placeholders, could be detected using a rule-based algo-
as the legal domain, it is indispensable to understand the current, rithm, as discussed below.
non digital, process at first. Therefore, we also used the interviews Table 1 summarizes the most important information about the
to capture the current state of the anonymization. pre-processed training corpus without the test set of 180 documents.
We contacted 18 courts with a request for a respective expert In sum, those documents contain about 35.000 anonymization place-
interview. Courts from various jurisdictions were deliberately in- holders that are positively labeled for model training. However,
cluded. Eight courts agreed to support our research and provided due to the large variety of different reference types and different
appropriate contacts. These were one financial court, one social token lengths, the amount of data is considerably small. This espe-
court, one administrative court, and five courts from the ordinary cially impacts rare references like authorizations and bank accounts.
jurisdiction. The latter ones were distributed at all instances (district Anonymization mistakes in documents aggravate this issue even
court to higher regional court). Furthermore, we also interviewed further.
the Bavarian Ministry of Justice.
One of the major conclusions which emerged during the inter- Document count 1.220
views was the fact, that only two courts actually utilize standardized Word count 4.181.266
guidelines for the anonymization process. It is important to note Placeholder count 33.779
that the other courts do not, of course, anonymize indiscriminately, Average tokens per word 1.8
but rely much more on subject matter expertise. Therefore, it was Table 1: Information summary of the pre-processed training
important in these cases to understand the exact processes involved corpus
in anonymization, particularly the crucial entity types.
Another interesting aspect, which bases upon the fact of non- The placeholder detection is a rule-based classification algorithm,
existing guidelines, is the fact that these entities vary not only which takes paragraphs of anonymized legal documents and labels
between the different jurisdictions, but also within the ordinary the anonymization placeholders within the paragraph. The algo-
courts. While it seems obvious that decisions of different court rithm scans words using a sliding window of 3 consecutive text
types require other entity types to be anonymized, it may not elements. Each triplet of text elements consists of a predecessor, a
be obvious at first glance why this applies even within ordinary anonymization candidate and a successor. Two different types of
courts. However, such courts consist of multiple senates handling placeholders are distinguished using regular expressions: Obvious
different topics. In general, it is important that a judgment is still and potential placeholders. Obvious placeholders e.g. "Xxxx" meet
understandable after anonymization and, above all, that the judicial strong criteria and represent placeholders that are specially marked
decision is comprehensible. We therefore hypothesized that an ML by the author, e.g. ’"E."’. Potential placeholders may be interpreted
model must be specifically trained for the jurisdiction at hand and as placeholders if viewed outside the context, but may alternatively
cannot be utilized as a general purpose model. possess one of the following meanings: (1) omission within cites (e.g.
The replacement of sensitive information is a crucial part of the testimonies), (2) abbreviation (e.g. "i.d.R."), (3) reference to pages or
anonymization method. Therefore, it was important to understand
the used methods in practice. However, it turned out that, while 1 Source: https://www.gesetze-bayern.de
206
Anonymization of German Legal Court Rulings ICAIL’21, June 21–25, 2021, São Paulo, Brazil
using Python and the NumPy library. For model training, we used
the Adam optimizer [5] with a decaying learning rate. The training
Masked Token
BERT dataset was split into a training set of 1220 documents and a test
set of the remaining 180 documents. Then, the model was trained
Bi-LSTM Layer
for 4 epochs, i.e. the whole training dataset has been completely
Weighted Binary Cross Entropy Loss
iterated 4 times. Beyond the fourth epoch, the models performance
on the test set decreased, hence reaching over-fitting.
Prediction
A nA nA nA A nA
207
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Glaser et al.
208
Anonymization of German Legal Court Rulings ICAIL’21, June 21–25, 2021, São Paulo, Brazil
the high variety of different anonymization standards practiced by tool in order to increase the quantity of German legal data sets
different types of courts. Entities such as financials or process dates being publicly available by means of expert interviews.
are often classified insensitive by district court, but frequently sen- For this purpose, multiple different deep learning architectures
sitive by financial court. Due to lack of published financial rulings, were trained using state-of-the-art generally pre-trained contextual
the training corpus has been mainly composed of more general embeddings. Furthermore, a rule-based placeholder detection algo-
rulings, which explains the drop in evaluation performance ( recall). rithm was developed and validated, in order to label anonymization
During interviews at the courts, the anonymization web tool placeholders in anonymized legal documents. Due to the difference
was demonstrated and results from randomly sampled court rulings between training and evaluation data, the "validation-test gap" issue
were presented to court employees, being specialized in the manual has been introduced, which has caused a drop in model performance
anonymization of court documents. The feedback has been consis- on the non-anonymized test set. This issue was resolved using reg-
tently positive, since most employees considered the anonymization ularization methods such as input masking and dropout layers. The
of legal documents as an unpleasant task. Nevertheless, most in- models were evaluated on a manually created test document cor-
terviewees criticized that the error rates were not high enough pus. Thereby, we found out that purely contextual classification
to enable unsupervised automation. A small portion of the inter- cannot distinguish between named entities and entities that refer to
viewed workers also faulted that no manual configuration of the named entities within the document. No model reached both, high
algorithm could be done, after the model had been trained. How- recall and high precision, metrics on the direct sequence classifi-
ever, this seems to be a rather less important concern as denoted cation task. Nevertheless, in combination with a generally trained
by the court director of the same court. NER model, the feature-based BERT tuning approach using stacked
One important module of the anonymization approach presented biLSTM-RNN delivered promising results, but a specialized NER
is the detection of placeholders in the anonymized training data. model supporting more reference types is required.
Poor performance of this rule-based algorithm may lead to low In order to achieve high performances on the task of anonymiza-
model performance. The algorithm has been evaluated using the tion, it is inevitable to utilize original and non-anonymized training
Munich district court dataset as used for the model evaluation. The data. Hence, we also proposed a system to automatically pseudonymize
placeholder detection module performed with a precision of 99.9 legal court ruling datasets that produces pairs of original and
%, a recall of 98.0 % and an accuracy of 99.9 %. These results show anonymized court rulings.
that rule-based algorithms as described in Section 4.1 are capa- To conclude, the anonymization of German legal documents re-
ble of delivering sufficient performance to be used to pre-process mains a complex problem and more data is required in order to build
anonymized legal data for anonymization models. fully autonomous anonymization systems. Nonetheless, contextual
sensitivity classification represents an important foundation for
5.3 Pseudonymization future anonymization systems.
The pseudonymization procedure was evaluated using the Munich
REFERENCES
financial court rulings, since the contained documents have been
[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
in the original format including headers and footers. The evalua- Pre-training of Deep Bidirectional Transformers for Language Understanding.
tion revealed that all sensitive entities were filtered as intended https://arxiv.org/abs/1810.04805
[2] Francisco Manuel Carvalho Dias. 2016. Multilingual automated text anonymiza-
(100 %). Thereby, the specialized named entity recognizer correctly tion. Instituto Superior Técnico of Lisboa (2016).
recognized 72.8 % of the types of sensitive entities, which enables [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
automatic pseudonymization for a majority of entities. computation 9, 8 (1997), 1735–1780.
[4] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. https:
Considering that the remaining entity types may be manually re- //www.researchgate.net/publication/13853244_Long_Short-term_Memory
covered using the anonymized documents, the proposed pseudonymiza- [5] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza-
tion. https://arxiv.org/pdf/1412.6980.pdf
tion approach provides a considerable basis for the creation of large [6] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,
datasets, resembling original legal data without the necessity for and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv
privacy considerations. However, the ratio of correctly identified preprint arXiv:1603.01360 (2016).
[7] Ben Medlock. 2006. An Introduction to NLP-based Textual Anonymisation.. In
entity types can still be further improved by more advanced models. LREC. Citeseer, 1051–1056.
The pseudonymization web tool has been demonstrated dur- [8] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
ing interviews at the courts with samples of court rulings from Global vectors for word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP). 1532–1543.
the financial court in Munich. The samples had manually been [9] Manavalan Saravanan, Balaraman Ravindran, and Shivani Raman. 2009. Im-
pseudonymized already, because the rulings contained original proving legal information retrieval using an ontological framework. Artificial
Intelligence and Law 17, 2 (2009), 101–124.
data with us. The feedback was generally very positive and courts [10] Latanya Sweeney. 1996. Replacing personally-identifying information in medical
finally agreed to share data pseudonymized using this tool. Most records, the Scrub system.. In Proceedings of the AMIA annual fall symposium.
court employees found the tool to be simple to work with and ap- American Medical Informatics Association, 333.
[11] Amund Tveit, Ole Edsberg, TB Rost, Arild Faxvaag, O Nytro, T Nordgard, Mar-
preciated the feature to upload whole archives instead of favoring tin Thorsen Ranang, and Anders Grimsmo. 2004. Anonymization of general
individual document uploads, which requires more manual work. practioner medical records. In Proceedings of the second HelsIT Conference.
[12] Marc van Opijnen, Ginevra Peruginelli, Eleni Kefali, and Monica Palmirani. 2017.
On-line publication of court decisions in the eu: Report of the policy group of the
6 CONCLUSION & OUTLOOK project ‘building on the european case law identifier’. Available at SSRN 3088495
(2017).
After identifying the problem of anonymization of legal documents,
we verified that automatic legal anonymization is a highly desirable
209
Enhancing a Recidivism Prediction Tool With Machine Learning:
Effectiveness and Algorithmic Fairness
Marzieh Karimi-Haghighi Carlos Castillo
Universitat Pompeu Fabra Universitat Pompeu Fabra
marzieh.karimihaghighi@upf.edu carlos.castillo@upf.edu
ABSTRACT auditing, and criminal justice. Since the 1920s, violence risk as-
This paper addresses a key application of Machine Learning (ML) sessment tools have been progressively used in criminal justice by
in the legal domain, studying how ML may be used to increase the probation and parole officers, police, and psychologists to assess
effectiveness of a criminal recidivism risk assessment tool named the risk of harm, sexual, criminal, and violent offending in more
RisCanvi, without introducing undue biases. The two key dimen- than 44 countries [22, 32]. In comparison to traditional prediction
sions of this analysis are predictive accuracy and algorithmic fair- methods and unstructured clinical judgments, risk assessment tools
ness. ML-based prediction models obtained in this study are more ac- offer superior accuracy and performance [18]. In this regard, factors
curate at predicting criminal recidivism than the manually-created such as the availability of large databases, inexpensive computing
formula used in RisCanvi, achieving an AUC of 0.76 and 0.73 in pre- power, and developments in statistics and computer science have
dicting violent and general recidivism respectively. However, the brought an increase in the accuracy and applicability of these struc-
improvements are small, and it is noticed that algorithmic discrimi- tured tools [3]. Such advances have effectively increased the use
nation can easily be introduced between groups such as national vs of tools based on Machine Learning (ML) in criminal justice deci-
foreigner, or young vs old. It is described how effectiveness and al- sions for risk forecasting [4, 7, 8]. Today, various semi-structured
gorithmic fairness objectives can be balanced, applying a method in protocols for assessing risk of recidivism can be found in different
which a single error disparity in terms of generalized false positive countries including the U.S. [16], the U.K. [21], Canada [24], Aus-
rate is minimized, while calibration is maintained across groups. tria [30], and Germany [13]. In Spain, among current violence risk
Obtained results show that this bias mitigation procedure can sub- assessment tools including SAVRY, PCL-R, HCR-20, SVR-20, and
stantially reduce generalized false positive rate disparities across SARA, RisCanvi is a relatively new tool for risk assessment of recidi-
multiple groups. Based on these results, it is proposed that ML- vism. It was originally developed in 2009 in response to concerns of
based criminal recidivism risk prediction should not be introduced Catalan prison system officials regarding violent recidivism among
without applying algorithmic bias mitigation procedures. offenders after their sentences.
Research contribution. In this study, the effectiveness and algo-
CCS CONCEPTS rithmic fairness of RisCanvi risk assessment tool are evaluated in
comparison to ML models such as logistic regression, perceptron,
• Computing methodologies → Supervised learning by clas-
and support-vector machines, in violent and general recidivism
sification.
prediction. The effectiveness of the ML models are evaluated and
compared to RisCanvi in terms of various metrics including AUC,
KEYWORDS Generalized False Positive (GFPR), and Generalized False Negative
criminal recidivism, risk assessment, algorithmic fairness (GFNR). Also, potential algorithmic bias introduced by the ML meth-
ACM Reference Format: ods is evaluated in both violent and general recidivism prediction.
Marzieh Karimi-Haghighi and Carlos Castillo. 2021. Enhancing a Recidivism Given that model learning may lead to unfairness [11, 12, 34], the
Prediction Tool With Machine Learning: Effectiveness and Algorithmic impact of the obtained ML models is compared along nationality
Fairness. In Eighteenth International Conference for Artificial Intelligence and (national origin vs foreign origin) and age (young vs old). Then
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, some differences are addressed through a mitigation procedure [29],
USA, 5 pages. https://doi.org/10.1145/3462757.3466150 which try to equalize GFPR across nationality and age groups while
preserving the calibration in each group.
1 INTRODUCTION The rest of this paper is organized as follows. Section 2 outlines
Risk assessment is a necessary process in many important decisions related work. In Section 3, the RisCanvi risk assessment tool and the
such as public health, information security, project management, dataset used in this study are described. The methodology including
the ML models and algorithmic fairness analysis are presented in
Permission to make digital or hard copies of all or part of this work for personal or Section 4. Results are given in Section 5, and a procedure to mitigate
classroom use is granted without fee provided that copies are not made or distributed algorithmic discrimination is used in Section 6. Finally, the results
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
are discussed and the paper is concluded in Section 7.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org. 2 RELATED WORK
ICAIL’21, June 21–25, 2021, São Paulo, Brazil The introduction of algorithms for risk assessment in criminal
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 justice is a controversial topic, and perhaps one that has motivated
https://doi.org/10.1145/3462757.3466150 a great deal of research on algorithmic fairness.
210
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Marzieh Karimi-Haghighi and Carlos Castillo
In seminal research done by investigative journalism organiza- facilities, committing further violent offenses, and breaking prison
tion ProPublica [2, 25] it was concluded that a widely-used program permits. A fifth risk score was introduced more recently for general
named Correctional Offender Management Profiles for Alternative recidivism [31].
Sanctions (COMPAS) is biased against African American defendants. Two versions of the RisCanvi protocol were created, an abbre-
A follow-up study [19] found that COMPAS outcomes systemat- viated one of 10 items for screening (RisCanvi-S), and a complete
ically over-predict risk for women, thereby indicating systemic one of 43 items (RisCanvi-C). Risk items can be categorized into
gender bias. However, the findings of the ProPublica study were five different categories: Criminal/Penitentiary, Biographical, Fam-
rejected by Northpointe (COMPAS developer), claiming their al- ily/Social, Clinical, Attitudes/ Personality. These items can also
gorithm is fair because it is well calibrated [17]. Moreover, in this be divided into static factors (such as “criminal history of family”
report it is shown that the COMPAS risk scales exhibit accuracy and “age of starting violent activity”) and dynamic factors (such
equity and predictive parity. as “member of socially vulnerable groups” and “pro-criminal or
In contrast to the case of COMPAS, other studies have shown antisocial attitudes”).
that other risk assessment tools such as the Post Conviction Risk
Assessment (PCRA), the Structured Assessment of Violence Risk in 3.2 Dataset
Youth (SAVRY) and the Youth Level of Service/Case Management
The anonymized dataset used on this research comprises 7,239 of-
Inventory (YLS/CMI) do not exhibit racial bias in the recidivism
fenders who first entered the prison between 1989 and 2012 and who
prediction [28, 33]. In a more recent study focused on SAVRY [26,
were evaluated with the RisCanvi protocol between 2010 and 2013.
34], it is shown that although machine learning models could be
Only offenders for which nationality information was recorded
more accurate than the simple summation used to compute SAVRY
were kept that comprises 2,634 offenders. The result population
scores, they would introduce discrimination against some groups
was filtered in terms of their violent/general recidivism, freedom
of defendants.
and last RisCanvi evaluation dates considering the following con-
There are many different definitions of algorithmic fairness [27],
ditions: inmates who were released at most 9 months after their
some of which are incompatible with one another. It is impossible to
last RisCanvi evaluation, and for which violent/general recidivism
satisfy all of them simultaneously except in pathological cases (such
(or its absence) was recorded at most two years after their release.
as a perfect classifier), and in general it is impossible to maximize
Finally, samples with the size of 2,027 (out of 2,634) were reached.
algorithmic fairness and accuracy at the same time [5, 6]. Hence,
Among this population, 146 committed a violent offence (violent
there are necessary trade-offs between different metrics [6, 10, 23].
recidivism) and 310 committed a violent or non-violent offence
In this regard, some studies [20, 36, 37] try to mitigate potential
(general recidivism) after being released. The data includes all of
algorithmic discrimination by satisfying equalized odds or in other
the information for the two RisCanvi versions (RisCanvi-S and
words avoiding disparate mistreatment along different sensitive
RisCanvi-C). This study is focused on the RisCanvi-C protocol
groups. In addition, due to the importance of the calibration in
which is the complete version done after RisCanvi-S and it consists
risk assessment tools [6, 17], some previous work has also tried to
of more risk factors which results in three risk levels (low, medium,
minimize error disparity across groups while maintaining calibrated
and high).
probability estimates [29].
The most closely related previous work is Pleiss et al. [29], where
algorithmic bias in a machine learned risk assessment (COMPAS) 3.3 Violent and General Recidivism
is minimized by equalizing generalized false positive rates along This work is focused on RisCanvi protocol to assess Violent Re-
different races, finding this equalization to be incompatible with cidivism (“REVI” in the RisCanvi manual) and General Recidivism
calibration. In contrast, in the work presented on this paper, we (“REGE” in the RisCanvi manual) risks in sentenced inmates. REVI
start from an expert-based risk assessment method, which is not and REGE risks are outcomes predicted using two different sub-sets
machine learned, and propose a new machine learning model to of risk factors. REVI risk is obtained using 23 items out of the 43 risk
replace it, describing the effects of algorithmic bias mitigation on factors of the RisCanvi-C version plus two demographic features
both the original and the machine learned model. Additionally, we (gender and nationality) and to compute REGE risk, 14 items (out of
find that in RisCanvi equalization along nationality and age groups 43 risk factors of the RisCanvi-C version) are used. In RisCanvi-C,
is not entirely incompatible with calibration. each of the REVI and REGE scores has been computed by applying
the summation of their related features in a hand-crafted formula,
3 RISCANVI DATASET then using two cut-offs, obtaining three risk levels (details in [1]).
The distribution of REVI and REGE risk scores in the last RisCanvi
3.1 The RisCanvi Risk Assessment Tool evaluation is compared by nationality and age groups. Grouping
RisCanvi was introduced as a multi-level risk assessment protocol by gender is not considered as the number of women in the sample
for violence prevention in the Catalan prison system in 2009 [1]. is too small to draw robust conclusions. The comparison shows
This protocol is applied multiple times during an inmate’s period in that recidivism risk scores have approximately similar distributions
prison; the official recommendation is to do so every six months or along nationality and age group except for the REVI score in nation-
at the discretion of the case manager. RisCanvi is not a questionnaire. ality group which shows that foreigners tend to have lower REVI
Instead, each inmate is interviewed by professionals. In the original risk scores compared to Spaniards (Figures are omitted for brevity).
RisCanvi protocol, risk is determined for each inmate relative to For age groups, 30 years old is used as a cut-off, as criminology
four possible outcomes: self-directed violence, violence in the prison research suggests that the types of offense and context are different
211
Enhancing a Recidivism Prediction Tool With Machine Learning: Effectiveness and Algorithmic Fairness ICAIL’21, June 21–25, 2021, São Paulo, Brazil
for people under 30 and over 30 (see, e.g., [35]). This age is also used Table 1: Effectiveness of models in violent and general re-
as a cut-off for young and old people in the design of the RisCanvi cidivism prediction
protocol. In the present dataset, the majority of the population are
Spanish nationals (70%) and older than 30 years old (74%). Risk Violent Recidivism General Recidivism
According to the average violent and general recidivism rates for
Model AUC GFNR GFPR AUC GFNR GFPR
nationality and age groups, it can be seen that in general, foreigners
and older offenders have a lower recidivism rate. LR 0.76 0.82 0.06 0.73 0.75 0.14
RisCanvi_score 0.72 0.87 0.07 0.70 0.79 0.14
4 METHODOLOGY
The goal of this study is to compare the effectiveness and fairness
of Machine Learning (ML) models and the RisCanvi risk assessment
follows [29]: the GFPR of classifier ℎ𝑡 for group 𝐺𝑡 is 𝑐 𝑓 𝑝 (ℎ𝑡 ) =
tool in the prediction of violent and general recidivism.
E (𝑥,𝑦)∼𝐺𝑡 [ℎ𝑡 (𝑥) | 𝑦 = 0]. GFPR is the average probability of being
recidivist that the classifier estimates for people who actually do
4.1 ML-based Models not recidivate. Conversely, the GFNR of classifier ℎ𝑡 is 𝑐 𝑓 𝑛 (ℎ𝑡 ) =
Different ML methods, such as logistic regression, multi-layer per- E (𝑥,𝑦)∼𝐺𝑡 [(1−ℎ𝑡 (𝑥)) | 𝑦 = 1]. So the two classifiers ℎ 1 and ℎ 2 show
ceptron (MLP), and support vector machines (SVM) are used. The probabilistic equalized odds across groups 𝐺 1 and 𝐺 2 if 𝑐 𝑓 𝑝 (ℎ 1 ) =
ground truth is the violent/general recidivism, which is recorded at 𝑐 𝑓 𝑝 (ℎ 2 ) and 𝑐 𝑓 𝑛 (ℎ 1 ) = 𝑐 𝑓 𝑛 (ℎ 2 ).
most two years after the inmate’s release. Classifier ℎ𝑡 is said to be well-calibrated if ∀𝑝 ∈ [0, 1], P (𝑥,𝑦)∼𝐺𝑡
Different sub-sets of features are tested as input to the ML mod- [𝑦 = 1 | ℎ𝑡 (𝑥) = 𝑝] = 𝑝. To prevent the probability scores from
els, such as 43 RisCanvi-C items, Violent Recidivism (REVI)/General carrying group-specific information, both classifiers ℎ 1 and ℎ 2 are
Recidivism (REGE) risk items, and a set of features selected from 43 calibrated with respect to groups 𝐺 1 and 𝐺 2 [6, 17].
risk items using a feature selection method. In addition, three demo-
graphic features (gender, nationality, and age) are used as general 5 RESULTS
input features. Finally, the average of REVI/REGE risk scores over
all of the RisCanvi evaluations from the first to the last evaluation 5.1 Effectiveness Evaluation
is added. Among logistic regression (LR), multi-layer perceptron (MLP) and
The split of the two sets is done k times using stratified k-fold support vector machines, the best results were obtained using LR
cross-validation, reporting average results. for both violent and general recidivism predictions. Hence, the non-
LR based models are omitted for brevity. The final set of features
4.2 Algorithmic Fairness used for the model consists of a sub-set of the 43 risk items of
Algorithmic fairness is evaluated by comparing the impact of the the RisCanvi evaluation selected using a feature selection method
risk prediction method across nationality and age groups. (based on a linear model with L1-based penalization to yield sparse
As it is known, model calibration is a necessary condition, espe- coefficients), the average Violent Recidivism (REVI)/General Recidi-
cially in criminal justice risk assessments [6, 17]. If the risk tool is vism (REGE) score (from the first to the last RisCanvi evaluation),
not calibrated with respect to different groups, then the same risk es- gender, nationality, and age at the time of the last evaluation.
timate carries different meanings and cannot be interpreted equally Results in terms of AUC-ROC, GFNR, and GFPR are presented
for different groups. Furthermore, creating parity in the error rates and compared with the existing RisCanvi protocol in Table 1 for
of different groups (“equalized odds”) is a well-established method both violent and general recidivism prediction. These results are
to mitigate algorithmic discrimination in automatic classification. compared against RisCanvi_score, which is a number resulting
Previous work has also emphasized the importance of this algo- from the application of the RisCanvi formula.
rithmic fairness metric for this particular application [20, 36, 37]. In both violent and general recidivism prediction, LR yields better
Hence, to mitigate potential algorithmic discrimination, a relaxation results than RisCanvi in terms of all metrics. However, the results
method [29] is used in this paper which seeks to satisfy equalized are close to RisCanvi. In general, the LR model is more accurate
odds or parity in the error rates (generalized false positive rate and than RisCanvi, although by a small amount, which is surprising
generalized false negative rate) while preserving calibration in each considering that RisCanvi was not computationally optimized for
sub-group of nationality and age. In most cases, calibration and predictive accuracy.
equalized odds are mutually incompatible goals [10, 23], so in this
method it is sought to minimize only a single error disparity across 5.2 Algorithmic Fairness Evaluation
groups while maintaining calibration probability estimates. The results for the analysis of algorithmic fairness in all metrics
Generalized False Positive Rate (GFPR) and Generalized False along nationality (national and foreigner), and age groups (young
Negative Rate (GFNR) are the standard notions of false-positive and and old inmates) are shown in Table 2 for violent and general
false-negative rates that are generalized for use with probabilistic recidivism prediction. In the LR_calibrated model, the predictions
classifiers [29]. If variable 𝑥 represent an inmate’s features vector, have been calibrated with respect to each of the two sub-groups in
𝑦 indicates whether or not the inmate recidivists, 𝐺 1 , 𝐺 2 are the nationality and age.
two different groups, and ℎ 1 , ℎ 2 are binary classifiers which classify For violent recidivism, all models show a bias against nationals
samples from 𝐺 1 , 𝐺 2 respectively, GFPR and GFNR are defined as in terms of GFPR. The difference is less noticeable in RisCanvi. In
212
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Marzieh Karimi-Haghighi and Carlos Castillo
Table 2: Effectiveness of models in violent and general recidivism prediction per group
LR model, we can also observe higher GFPR for young inmates recidivism prediction, the decline in GFPR bias is obtained at the
compared to old offenders. In general, LR_calibrated and RisCanvi expense of further inequalities in other metrics.
models lead to more algorithmically fair results along both nation-
ality and age in terms of all metrics, except for the metrics in which
all the models show discrimination.
7 DISCUSSION AND CONCLUSIONS
The results for general recidivism prediction show higher AUC The effectiveness and fairness of Machine Learning (ML) models
for nationals compared to foreigners in RisCanvi. In terms of GFPR, in violent and general recidivism prediction were compared to the
the LR and LR_calibrated models show discrimination against na- RisCanvi risk assessment tool, an in-use model created by experts.
tional group. In age group, LR and LR_calibrated models show ML models were generated with AUC of 0.76 and 0.73 in violent and
higher GFPR along young compared to old group. In terms of AUC, general recidivism prediction respectively which shows slightly
we can see more discrimination against young inmates in RisCanvi better results compared to the AUC of RisCanvi protocol which is
compared to other models. As a result, LR_calibrated model shows 0.72 and 0.70. It is noteworthy that in this type of task, predictions
better algorithmic fairness properties across nationality and more are not very accurate in general (existing recidivism prediction
balanced values can be observed along age group in RisCanvi. tools typically have AUC in the range of 0.57-0.74 [9, 14, 15]), and
it is found that a hand-crafted formula created by experts is quite
comparable to a machine-learned one. Although the improvement
in accuracy by ML would be insufficient on its own to support its
6 EQUALIZED ODDS AND CALIBRATION introduction as a risk assessment tool, a key element of ML models
In this section, it is tried to achieve parity along nationality and age is their flexibility. An ML model can be re-trained with newer data,
groups in terms of two fairness metrics simultaneously. For this and incorporate new factors as the population of inmates changes
purpose, the method introduced by Pleiss et al. [29] is used that and more data on recidivism becomes available.
seeks parity in Generalized False Positive Rate (GFPR) or Gener- By studying differential treatment of RisCanvi and ML models
alized False Negative Rate (GFNR) while preserving calibration in across different groups, it can be stated that depending on the
each sub-group of nationality and age. The conclusion from the desired metric and groups, machine learning and human expert
previous section based on the results obtained per group in Table 2, can lead to different but comparable results. An advantage of ML
is that in both violent and general recidivism predictions, machine models is that the emphasis on different metrics can be changed
learning models show inequality in terms of GFPR along nationality during the modeling as legal or policy changes are introduced.
and age. RisCanvi also shows an imbalance in GFPR values along In this study, results in Table 2 showed that in both violent and
nationality groups in violent recidivism prediction. general recidivism predictions, there is an inequality in terms of
Hence, it is tried to create parity in this metric while preserving Generalized False Positive Rate (GFPR) metric along nationality
calibration in each group. The results after bias mitigation is pre- and age groups. So using a relaxation method [29], it was tried to
sented in Table 3 for violent and general recidivism prediction. The set parity in GFPR while preserving calibration in each sub-group
obtained models are referred to in the following as LR-Equalized, of nationality and age. The results after bias mitigation (in Table 3)
LR_Calibrated-Equalized, and RisCanvi-Equalized. showed that GFPR disparity in violent and general recidivism has
By comparing the results before and after this bias mitigation been respectively decreased at most 0.26 and 0.04 along nationality
(Table 2 and Table 3 respectively) in violent recidivism, it can be and 0.09 and 0.19 along age, however, in exchange for inequalities
seen that the discrimination in GFPR has decreased in the order in some other metrics.
of 0.08-0.26 and 0.06-0.09 along nationality and age groups respec- A robust conclusion from this work is that in a context in which
tively. Also, comparing the results before and after bias mitigation predictive factors neither determine nor yield a clear signal of
in general recidivism shows that there are reductions in GFPR dis- low/medium/high recidivism risk, ML cannot be considered a silver
parity in the orders of 0.03-0.04 and 0.16-0.19 along nationality bullet. At the very least, improvements in accuracy need to be
and age groups respectively. However, in both violent and general carefully contrasted with potential issues of algorithmic fairness
213
Enhancing a Recidivism Prediction Tool With Machine Learning: Effectiveness and Algorithmic Fairness ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 3: Equalized GFPR while preserving calibration in violent and general recidivism prediction
when introducing ML, and calibration and some bias mitigation [17] W. Dieterich, C. Mendoza, and T. Brennan. 2016. COMPAS risk scales: Demon-
method (such as equalized odds in this study) needs to be used. strating accuracy equity and predictive parity. Northpoint Inc (2016).
[18] William M Grove, David H Zald, Boyd S Lebow, Beth E Snitz, and Chad Nelson.
2000. Clinical versus mechanical prediction: a meta-analysis. Psychological
ACKNOWLEDGMENTS assessment 12, 1 (2000), 19.
[19] Melissa Hamilton. 2019. The sexist algorithm. Behavioral sciences & the law 37, 2
This work has been partially supported by the HUMAINT pro- (2019), 145–157.
gramme (Human Behaviour and Machine Intelligence), Centre for [20] M. Hardt, E. Price, and N. Srebro. 2016. Equality of opportunity in supervised
learning. In Advances in neural information processing systems. 3315–3323.
Advanced Studies, Joint Research Centre, and European Commis- [21] Philip D Howard and Louise Dixon. 2012. The construction and validation of the
sion. The project leading to these results has received funding OASys Violence Predictor: Advancing violence risk assessment in the English and
from “la Caixa” Foundation (ID 100010434), under the agreement Welsh correctional services. Criminal Justice and Behavior 39, 3 (2012), 287–307.
[22] Danielle Leah Kehl and Samuel Ari Kessler. 2017. Algorithms in the criminal
LCF/PR/PR16/51110009. justice system: Assessing the use of risk assessments in sentencing. (2017).
[23] J. Kleinberg, S. Mullainathan, and M. Raghavan. 2016. Inherent trade-offs in the
REFERENCES fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
[24] Carolin Kröner, Cornelis Stadtland, Matthias Eidt, and Norbert Nedopil. 2007.
[1] Antonio Andrés-Pueyo, Karin Arbach-Lucioni, and Santiago Redondo. 2018. The The validity of the Violence Risk Appraisal Guide (VRAG) in predicting criminal
RisCanvi: a new tool for assessing risk for violence in prison and recidivism. recidivism. Criminal Behaviour and Mental Health 17, 2 (2007), 89–100.
Recidivism Risk Assessment: A Handbook for Practitioners (2018), 255–268. [25] Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How we
[2] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias. analyzed the COMPAS recidivism algorithm. ProPublica (5 2016) 9 (2016).
ProPublica, May 23 (2016), 2016. [26] Marius Miron, Songül Tolan, Emilia Gómez, and Carlos Castillo. 2020. Evaluating
[3] Richard Berk. 2012. Criminal justice forecasts of risk: A machine learning approach. causes of algorithmic bias in juvenile criminal recidivism. Artificial Intelligence
Springer Science & Business Media. and Law (2020), 1–37.
[4] Richard Berk. 2017. An impact assessment of machine learning risk forecasts [27] Arvind Narayanan. 2018. 21 fairness definitions and their politics. presenterad
on parole board decisions and recidivism. J. of Experimental Criminology 13, 2 på konferens om Fairness, Accountability, and Transparency 23 (2018).
(2017), 193–216. [28] Rachael T Perrault, Gina M Vincent, and Laura S Guy. 2017. Are risk assess-
[5] Richard Berk. 2019. Accuracy and fairness for juvenile justice risk assessments. ments racially biased?: Field study of the SAVRY and YLS/CMI in probation.
J. of Empirical Legal Studies 16, 1 (2019), 175–194. Psychological assessment 29, 6 (2017), 664.
[6] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. [29] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger.
2018. Fairness in criminal justice risk assessments: The state of the art. Sociological 2017. On fairness and calibration. In Advances in Neural Information Processing
Methods & Research (2018), 0049124118782533. Systems. 5680–5689.
[7] Richard Berk and Jordan Hyatt. 2015. Machine learning forecasts of risk to inform [30] M. Rettenberger, M. Mönichweger, E. Buchelle, F. Schilling, and R. Eher. 2010. The
sentencing decisions. Federal Sentencing Reporter 27, 4 (2015), 222–228. development of a screening scale for the prediction of violent offender recidivism.
[8] Richard A Berk, Susan B Sorenson, and Geoffrey Barnes. 2016. Forecasting Monatsschrift für Kriminologie und Strafrechtsreform 93, 5 (2010), 346–360.
domestic violence: A machine learning approach to help inform arraignment [31] J.P. Singh, D.G. Kroner, J.S. Wormith, S.L. Desmarais, and Z. Hamilton. 2018.
decisions. J. of Empirical Legal Studies 13, 1 (2016), 94–115. Handbook of recidivism risk/needs assessment tools. John Wiley & Sons.
[9] Tim Brennan, William Dieterich, and Beate Ehret. 2009. Evaluating the predictive [32] Jay P Singh, Sarah L Desmarais, Cristina Hurducas, Karin Arbach-Lucioni, Car-
validity of the COMPAS risk and needs assessment system. Criminal Justice and olina Condemarin, Kimberlie Dean, Michael Doyle, Jorge O Folino, Verónica
Behavior 36, 1 (2009), 21–40. Godoy-Cervera, Martin Grann, et al. 2014. International perspectives on the
[10] Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study practical application of violence risk assessment: A global survey of 44 countries.
of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163. Int. J. of Forensic Mental Health 13, 3 (2014), 193–206.
[11] Alexandra Chouldechova and Aaron Roth. 2018. The frontiers of fairness in [33] Jennifer L Skeem and Christopher T Lowenkamp. 2016. Risk, race, and recidivism:
machine learning. arXiv preprint arXiv:1810.08810 (2018). Predictive bias and disparate impact. Criminology 54, 4 (2016), 680–712.
[12] S. Corbett-Davies and S. Goel. 2018. The measure and mismeasure of fairness: A [34] Songül Tolan, Marius Miron, Emilia Gómez, and Carlos Castillo. [n.d.]. Why
critical review of fair machine learning. arXiv preprint arXiv:1808.00023 (2018). Machine Learning May Lead to Unfairness: Evidence from Risk Assessment for
[13] K.P. Dahle, J. Biedermann, R.J. Lehmann, and F. Gallasch-Nemitz. 2014. The devel- Juvenile Justice in Catalonia. In Proc. of ICAIL’19.
opment of the Crime Scene Behavior Risk measure for sexual offense recidivism. [35] Jeffrey Todd Ulmer and Darrell J Steffensmeier. 2014. The age and crime rela-
Law and human behavior 38, 6 (2014), 569. tionship: Social variation, social explanations. In The nurture versus biosocial
[14] Matthew DeMichele, Peter Baumgartner, Michael Wenger, Kelle Barrick, Megan debate in criminology: On the origins of criminal behavior and criminality. SAGE
Comfort, and Shilpi Misra. 2018. The public safety assessment: A re-validation Publications Inc., 377–396.
and assessment of predictive utility and differential prediction by race and gender [36] B. Woodworth, S. Gunasekar, M.I. Ohannessian, and N. Srebro. 2017. Learning
in kentucky. Available at SSRN 3168452 (2018). non-discriminatory predictors. arXiv preprint arXiv:1702.06081 (2017).
[15] S.L. Desmarais, K.L. Johnson, and J.P. Singh. 2016. Performance of recidivism [37] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P
risk assessment instruments in US correctional settings. Psychological Services Gummadi. 2017. Fairness beyond disparate treatment & disparate impact: Learn-
13, 3 (2016), 206. ing classification without disparate mistreatment. In Proc. of the 26th Int. Conf.
[16] Sarah Desmarais and Jay Singh. 2013. Risk assessment instruments validated
and implemented in correctional settings in the United States. (2013). on WWW. 1171–1180.
214
Towards compliance checking in reified I/O logic via SHACL
Livio Robaldo
livio.robaldo@swansea.ac.uk
Legal Innovation Lab Wales - Swansea University
Swansea, Wales, UK
ABSTRACT As shown in [28], the formal simplicity and the modular structure
Reified Input/Output logic [29] has been recently proposed to han- of reified I/O logic facilitate the implementation of user-friendly
dle natural language meaning in Input/Output logic [17]. So far, interfaces to encode large knowledge bases of norms in reasonable
the research in reified I/O logic has focused only on KR issues, time. [28] presents the DAPRECO knowledge base (D-KB), a repos-
specifically on how to use the formalism for representing contex- itory of 966 formulae in reified I/O formulae that translates norms
tual meaning of norms (see [28]). This paper is the first attempt to from the GDPR. The D-KB was built in four months via a special
investigate reasoning in reified I/O logic, specifically compliance JavaScript editor implemented to this purpose.
checking. This paper investigates how to model reified I/O logic for- While past research in reified I/O logic has focused on how
mulae in Shapes Constraint Language (SHACL) [2], a recent W3C building formulae associated with norms in natural language, this
recommendation for validating and reasoning with RDFs/OWL. paper represents the first attempt to investigate how these formulae
can be implemented and used for compliance checking, i.e., to infer
KEYWORDS which obligations have been violated in a given state of affairs and
with respect to a given set of norms.
reified I/O logic, SHACL, RDFs/OWL
Compliance checking has never been really studied in I/O logic.
ACM Reference Format: Most past literature in I/O logic has focused on deontic reasoning,
Livio Robaldo. 2021. Towards compliance checking in reified I/O logic via and, recently, normative reasoning [15].
SHACL. In Eighteenth International Conference for Artificial Intelligence and Deontic reasoning is to reason about what is obligatory and
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, permitted, while dealing with contrary-to-duty reasoning, deontic
USA, 5 pages. https://doi.org/10.1145/3462757.3466065 paradoxes, ethical/moral conflicts, etc. Reasoning about obligations
and permissions is of course orthogonal to what agents really do,
1 INTRODUCTION i.e., whether they did or did not violate their obligations or whether
Reified Input/Output logic [29] is Input/Output logic [17] enriched they did or did not perform what they were permitted to do.
with reification. The introduction of reification in I/O logic enhances Compliance checking does not involve deontic reasoning. Still,
the expressivity of the I/O logic formulae without substantially compliance checking could not be so simple to handle, e.g., because
affecting the I/O logic constructs that implement deontic reasoning. norms might include exceptions that lead to defeasible reasoning.
Reification is a formal mechanism that associates instantiations This paper proposes a formalization of non-deontic inferences
of high-order predicates and operators with FOL terms [13], [27], in reified I/O logic via SHACL [2]. While recent literature offered
[26]. The latter can be then directly inserted as arguments of other solutions for compliance checking implemented in RDFs/OWL, e.g.,
FOL predicates, which may be in turn reified again into new FOL [6], only preliminary works use SHACL to this end, e.g., [21].
terms. In other words, reified I/O logic associates norms with ex-
plicit terms, e.g., constants or variables, and not only with truth-
conditional symbols such as predicates or (second-order) deontic 2 BACKGROUND - REIFIED I/O LOGIC
operators. These terms can be then inserted as parameter of sepa- 2.1 Input/Output logic
rated meta-properties.
I/O logic was originally introduced in [17]. I/O logic is a family of
Reified I/O logic is grounded on a specific reification-based ap-
logics, just like modal logic is a family of systems K, S4, S5, etc.
proach for Natural Language Semantics: the framework in [12].
However, while modal logic uses possible world semantics, I/O
The main insight of [12] is to massively use reification in order to
logic uses norm-based semantics, in the sense of [11]: I/O systems
transform every second-order operator, including boolean connec-
are families of if-then rules (𝑎, 𝑏), such that when 𝑎 is given in input,
tives, into a FOL predicate applied to FOL terms. The final resulting
𝑏 is returned in output. 𝑎 and 𝑏 are formulae in another logic, called
formulae are then flat conjunctions of atomic FOL predicates.
“the object logic”. It has been argued that norm-based reasoning
features some advantages over reasoning based on possible world
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed semantics, first of all a lower computational complexity [30].
for profit or commercial advantage and that copies bear this notice and the full citation I/O logic neatly decouples deontic and non-deontic inferences.
on the first page. Copyrights for components of this work owned by others than ACM I/O logic is indeed a meta-logic wrapped around another logic (e.g.,
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a [12], in case of reified I/O logic) called “the object logic”. The meta-
fee. Request permissions from permissions@acm.org. logic implements deontic inferences while the object logic imple-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil ments the non-deontic ones. In I/O systems for legal reasoning,
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 rules (𝑎, 𝑏) can be obligations, permissions, and constitutive rules.
https://doi.org/10.1145/3462757.3466065 These are clustered within three distinct sets 𝑂, 𝑃, and 𝐶 such that
215
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Robaldo et al.
∀(𝑎, 𝑏)∈𝑂 reads as “given 𝑎, 𝑏 is obligatory”, ∀(𝑎, 𝑏)∈𝑃 reads as “given belongs to the set 𝑂 (note “∈ 𝑂” in (2)): it is an obligation requiring
𝑎, 𝑏 is permitted”, and ∀(𝑎, 𝑏)∈𝐶 reads as “given 𝑎, 𝑏 holds”. each personal data processing to be lawful.
Most past research on I/O logic has focused on theoretical inves- (2) ∀𝑒𝑝 ( ∃𝑡1 ,𝑧,𝑤,𝑦,𝑥 [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇𝑖𝑚𝑒 𝑒𝑝 𝑡 1 ) ∧
tigations in the meta-logic, for modeling deontic reasoning. Since (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧
the focus was on studying the meta-logic, the object logic was al-
(𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧ (𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧
ways kept as simple as possible, i.e., it was always propositional
logic. Reified I/O logic is perhaps the most relevant proposal so far (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ],
in the I/O logic literature that employs an alternative (first-order) (𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 𝑒𝑝 ) ) ∈ 𝑂
object logic: the logical framework in [12]. Formulae in reified I/O logic employ two kind of predicates: primed
In I/O logic, inferences in the meta-logic are achieved by impos- predicates such as 𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ and non-primed pred-
ing axioms and constraints on the sets of if-then rules. Different icates such as 𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡. The former are obtained by reifying
combinations of axioms and constraints trigger different inferences. the latter; the first argument of primed predicates is the reification
For instance, [17] defines the basic axioms in (1), where the of the non-primed counterpart, i.e., a FOL term.
symbol ‘⊢’ is the entailment relation of the object logic. Variants of We should not reify all predicates, but only those we need. For
these axioms have been further investigated in [23] and [22]. instance, we do need to reify (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑥 𝑧) into
(𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧), where 𝑒𝑝 explicitly refers to the
(1) • SI: from (𝑎, 𝑥) to (𝑏, 𝑥) whenever 𝑏 ⊢ 𝑎.
action of processing, because we need to assert a property on this ac-
• OR: from (𝑎, 𝑥) and (𝑏, 𝑥) to (𝑎 ∨ 𝑏, 𝑥).
tion: in the consequent of the obligation, we require it to be lawful,
• WO: from (𝑎, 𝑥) to (𝑎, 𝑦) whenever 𝑥 ⊢ 𝑦.
i.e., to satisfy the 𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 predicate. Note that in (2), in order to
• AND: from (𝑎, 𝑥) and (𝑎, 𝑦) to (𝑎, 𝑥 ∧ 𝑦).
“carry” the variable 𝑒𝑝 from the antecedent to the consequent, a uni-
• CT: from (𝑎, 𝑥) and (𝑎 ∧ 𝑥, 𝑦) to (𝑎, 𝑦).
versal quantifier outscoping the if-then rule has been inserted. All
By imposing axioms SI, WO, and AND, we obtain a specific derivation other variables are existentially quantified within the antecedent.
system called deriv1 . Adding OR to deriv1 gives deriv2 . Adding The other predicate that 𝑒𝑝 is required to satisfy is 𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇 𝑖𝑚𝑒.
CT to deriv1 gives deriv3 . The five axioms together give deriv4 . This is a special predicates used to assert which reifications “really
Each derivation system is sound and complete with respect to a exist” at a certain time. 𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇𝑖𝑚𝑒 parallels the well-known
different (norm-based) semantics and can therefore trigger different predicate 𝐻𝑜𝑙𝑑𝑠𝐴𝑡 used in Event Calculus [14].
inferences (see [17] for further discussion and details). Thus, formula (2) reads: “for every personal data processing 𝑒𝑝
Given a derivation system, we may further constrain its sets of of some personal data 𝑧, owned by a data subject 𝑤, controlled by
if-then rules, by considering only subsets that do not yield outputs a controller 𝑦, and processed by a processor 𝑥 (nominated by 𝑦), it
conflicting with given inputs. This is needed to handle contrary-to- is obligatory for 𝑒𝑝 to be lawful.
duty reasoning, i.e., to determine which obligations are detached
in a situation that already violates some among them [16]. 2.3 Adding defeasibility to reified I/O logic
This paper is not concerned with the meta-level of I/O logic. It is common in legislation that some rules override others in re-
Rather, it will focus on the object logic and non-deontic inferences, stricted contexts. These more specific rules are seen as exceptions
including defeasible ones to handle exceptions in legal reasoning. of the general rules, as penguins may be seen as exceptions of birds
with respect to the ability of flying.
2.2 Adding reification to I/O logic In line with the literature, e.g., [10], reified I/O logic models
exceptions via special predicates “Ex” that are false by default. This
Reification is a well-known technique used in linguistics and com-
is achieved via negation-as-failure (naf). “naf(Ex)” is true if either
puter science for representing abstract concepts. These are associ-
“Ex” is false or it is unknown. On the other hand, when “Ex” holds,
ated with explicit objects, e.g., FOL terms (see below in this section)
“naf(Ex)” is false, and the general rule is blocked. An example, taken
or RDF resources (see §3 below), on which we can assert properties.
from [28], is given by the following rules:
These assertions can be recursively reified again into new terms.
Both [12] and RDFs/OWL recursively reify assertions until the (a) If the data subject has given consent to processing, then the
knowledge is represented in terms of a flat list of atomic predi- processing is lawful.
cates applied to terms. In RDFs/OWL, these flat lists are made of (b) If the age of the data subject is lower than the minimal age for
triples “(subject, predicate, object)”, while [12] also allows consent of his member state, (a) is not valid.
predicates with higher arity; however, any n-ary predicate can be (c) In case of (b), if the holder of parental responsibility has given
transformed into an equivalent conjunction of RDF triples. consent to processing, then the processing is lawful.
In [12] and in reified I/O logic, both the antecedent and the con-
(a)-(c) are formalized as the following constitutive rules:
sequent of the if-then rule are conjunctions of predicates. Universal
and existential quantifiers are added to bound the free variables (3) ∀𝑒𝑝 ( ∃𝑡,𝑧,𝑤,𝑦,𝑥 [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇𝑖𝑚𝑒 𝑒𝑝 𝑡) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧
occurring in the formulae. Universals that outscope the whole if- (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ∧ (𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧
then rules are used to “carry” individuals from the antecedent to (𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧ (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧
the consequent. Formal details and definitions are available in [29]. (𝐺𝑖𝑣𝑒𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝑇𝑜 𝑤 𝑒𝑝 ) ∧ 𝑛𝑎𝑓 ((𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝐴𝑔𝑒𝐷𝑆 𝑒𝑝 )) ],
A simple example from the D-KB [28] is shown in (2). (2) encodes
(𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 𝑒𝑝 ) ) ∈ 𝐶
in reified I/O logic part of Art.5(1)(a) of the GDPR. The if-then rule
216
Towards compliance checking in reified I/O logic via SHACL ICAIL’21, June 21–25, 2021, São Paulo, Brazil
(4) ∀𝑒𝑝 ( ∃𝑡,𝑧,𝑤,𝑦,𝑥,𝑠 [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇 𝑖𝑚𝑒 𝑒𝑝 𝑡) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧ formal language is SHACL [2], proposed by W3C precisely for
(𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ∧ (𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧ validation and inferences on RDFs/OWL graphs. The use of SHACL
(𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧ (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧ is currently a matter of ongoing research in the Semantic Web
community (see [7], [24], among others).
(𝑆𝑡𝑎𝑡𝑒𝑂 𝑓 𝑠 𝑤) ∧ (< 𝑎𝑔𝑒𝑂 𝑓 (𝑤) 𝑚𝑖𝑛𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝐴𝑔𝑒𝑂 𝑓 (𝑠)) ],
SHACL appears to be the right formal language for modelling
(𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝐴𝑔𝑒𝐷𝑆 𝑒𝑝 ) ) ∈ 𝐶 compliance checking, although so far it has been scarcely investi-
gated to this end, preliminary works being [20], [21], and [8].
(5) ∀𝑒𝑝 ( ∃𝑡,𝑧,𝑤,𝑦,𝑥,𝑠,ℎ [ (𝑅𝑒𝑥𝑖𝑠𝑡𝐴𝑡𝑇 𝑖𝑚𝑒 𝑒𝑝 𝑡) ∧ (𝐷𝑎𝑡𝑎𝑆𝑢𝑏 𝑗𝑒𝑐𝑡 𝑤) ∧
SHACL was originally proposed to define special conditions on
(𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔’ 𝑒𝑝 𝑥 𝑧) ∧ (𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟 𝑦 𝑧) ∧ RDFs/OWL graphs, called “SHACL shapes”, more expressive than
(𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑥) ∧ (𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠 𝑦 𝑥) ∧ (𝑃𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝐷𝑎𝑡𝑎 𝑧 𝑤) ∧ standard OWL cardinality and quantifier restrictions. RDFs/OWL
(𝑆𝑡𝑎𝑡𝑒𝑂 𝑓 𝑠 𝑤) ∧ (< 𝑎𝑔𝑒𝑂 𝑓 (𝑤) 𝑚𝑖𝑛𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝐴𝑔𝑒𝑂 𝑓 (𝑠)) ∧ graphs can be then validated against a set of such SHACL shapes.
(ℎ𝑎𝑠𝐻𝑜𝑙𝑑𝑒𝑟𝑂 𝑓 𝑃𝑟 ℎ 𝑤) ∧ (𝐺𝑖𝑣𝑒𝐶𝑜𝑛𝑠𝑒𝑛𝑡𝑇𝑜 ℎ 𝑒𝑝 ) ], However, SHACL “may be used for a variety of purposes beside
(𝑖𝑠𝐿𝑎𝑤 𝑓 𝑢𝑙 𝑒𝑝 ) ) ∈ 𝐶 validation, including user interface building, code generation and data
integration” (cit. [2]). This paper adds a new use cases for SHACL
3 COMPLIANCE CHECKING IN RDFS/OWL in that it proposes to use it for serializing reified I/O logic formulae
fit to check compliance.
RDFs/OWL is nowadays the W3C standard language for the Seman- In order to enhance the expressivity and the flexibility of the
tic Web [1]. RDFs/OWL represents knowledge via flat sets of triples standard, a current W3C Working Group Note proposes to enrich
“(subject, predicate, object)”, in which the predicate is SHACL shapes with advanced features1 such as “SHACL rules” to
an rdf:Property while the subject and the object can be any derive inferred triples from asserted ones, prior to validation.
rdfs:Resource, including other rdf:Property(s). In other words, As explained in [25], SHACL rules can trigger ontological or
RDFs/OWL allows to treat rdf:Property(s) as first-order terms on non-ontological inferences. Ontological inferences derive facts that
which separately asserting other (meta-)properties. can be added to the model. On the other hand, non-ontological
It is then evident that reification is, in essence, the very same inferences have the sole purpose of aggregating data, without nec-
mechanism used to represent knowledge in RDFs/OWL, thus the essarily asserting them in the model, in order to facilitate validation.
idea of implementing reified I/O logic in the W3C standard.
Some proposals have been done to implement compliance check- 5 SERIALIZING REIFIED I/O LOGIC IN SHACL
ing in RDFs/OWL, e.g., [9] and [6]. In these approaches compliance
checking is achieved by enriching the ontology with classes re- This paper represents the first attempt to investigate how to serialize
ferring to sets of individuals compliant with the norms and by reified I/O formula modeling obligations as SHACL shapes and
enforcing “is-a” inferences on these classes. reified I/O formula modeling constitutive rules as SHACL rules.
For instance, the OWL ontology used in [9] includes a class (6) shows the SHACL shape that serializes (2) above. Both require
Supplier including individuals that supply consumers with some every personal data processing to be lawful.
goods. Since suppliers are obliged to communicate their contractual (6) CheckLawfulness
conditions to their consumers (rule R1), the corresponding class in- rdf:type sh:NodeShape;
cludes a boolean datatype property hasCommunicatedConditions sh:targetClass PersonalDataProcessing;
sh:property [ sh:path is-lawful;
which is true for those suppliers that has complied with their obli-
sh:hasValue "true"ˆˆxsd:boolean; ];
gation and false otherwise. The ontology includes then a class
In (6), “sh:” is SHACL namespace prefix. (6) is a sh:NodeShape
SupplierR1compliant defined as to include only individuals in
requiring each individual of the sh:targetClass to satisfy the
Supplier for which hasCommunicatedConditions is true. Com-
sh:property. The latter constrains the individuals reached from
pliance checking is enforced by applying simple “is-a” inferences.
the sh:targetClass through the sh:path to satisfy sh:hasValue.
In the same spirit, [6] encodes in a fragment of OWL2 selected
On the other hand, PersonalDataProcessing, is-lawful, and
norms from Artt. 6, 7, 15, 23, and 30 of the GDPR, which concern
all other RDFs/OWL resources used in this paper are associated
data usage policies. Compliance on these policies is again imple-
1:1 with the predicates used in the reified I/O logic formulae such
mented via “is-a” inferences.
as (2), in the same way as the predicates occurring in the D-KB
While [9] and [6] are of course important contributions towards
[28] are associated with RDFs/OWL resources from the PrOnto
the same direction of research advocated here, it is not clear how to
ontology [19], an OWL ontology proposed to conceptualize the
model exceptions in those frameworks. Furthermore, adding explicit
data protection domain. Space constraints avoid to provide further
classes specifically devoted to “collect” the individuals compliant
details about the 1:1 mapping between reified I/O logic predicates
with the norms, as well as introducing new ones to properly handle
and RDFs/OWL resources.
exceptions, does not appear to be an easy and intuitive solution.
SHACL shapes refer to constraints, a solution that appears to be
The rest of the paper proposes to use SHACL as an alternative
more intuitive and economical than overpopulating the ontology
of the accounts in [9] and [6].
with extra classes as suggested in [9] and [6].
The validation facts, as well as new individuals, derived through
4 COMPLIANCE CHECKING IN SHACL SHACL are not mandatorily inserted in the ontology. The SHACL
This paper proposes and makes initial investigations to encode
legal rules in a formal language different from RDFs/OWL. This 1 See https://www.w3.org/TR/shacl-af
217
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Robaldo et al.
rules to model the reified I/O logic formulae in (3), (4), and (5) SHACL command sh:order. Mirroring these inferences in native
represent non-ontological inferences, in the sense explained in [25]: RDFs/OWL seems to be more difficult in that the formalism does
these rules are only functional to infer the truth value of is-lawful not allow to specify a priority between the inference rules.
before the SHACL shape in (6) is validated. When the agents’ age has been specified (sh:minCount 1) and it
(3), (4), and (5) are serialized in the SHACL rules in (7), (8), (9), is lower than (sh:lessThan) the minimal consent age of the Mem-
and, below, (10). ber State previously asserted by (7), rule (8) asserts the individual
of PersonalDataProcessing in the has-theme property of the in-
(7) sh:rule [rdf:type sh:TripleRule; sh:order 0;
sh:subject sh:this; dividual of GiveConsent as member of the class exceptionAgeDS
sh:predicate has-min-consent-age; (see rdf:type in sh:predicate).
sh:object [sh:path Finally, (9) sets as true the property is-lawful of the instances
(has-theme has-personal-data of PersonalDataProcessing that do not (sh:not) belong to class
is-personal-data-of has-member-state exceptionAgeDS. (9) implements the reified I/O logic formula shown
has-min-consent-age);]; ]; above in (3) and the SHACL operator sh:not the negation-as-failure
(predicate 𝑛𝑎𝑓 ) occurring therein. sh:not is in fact true when the
(8) sh:rule [rdf:type sh:TripleRule; sh:order 1;
sh:condition [ ontology does not include any specific assertion of the personal
sh:property [sh:path has-min-consent-age; data processing as member of the class exceptionAgeDS. In other
sh:minCount 1;]; words, since the close world assumption hold for both RDFs/OWL
sh:property [ and SHACL, sh:not is true when it is either false or unknown
sh:path (has-agent has-age); whether the personal data processing belongs to this class.
sh:lessThan has-min-consent-age;]; ]; Finally, (10) implements the reified I/O logic formula (5) above:
sh:subject [sh:path has-theme;];
sh:predicate rdf:type; (10) sh:rule [rdf:type sh:TripleRule; sh:order 2;
sh:object exceptionAgeDS; ]; sh:condition [
sh:property [sh:path
(9) sh:rule [rdf:type sh:TripleRule; sh:order 2; (has-theme; has-personal-data
sh:condition [ is-personal-data-of has-age);
sh:not [sh:property [sh:path has-theme; sh:lessThan has-min-consent-age; ];
sh:class exceptionAgeDS;]; ]; ]; sh:property [sh:path (has-theme
sh:subject [sh:path has-theme;]; has-personal-data
sh:predicate is-lawful; is-personal-data-of
sh:object "true"ˆˆxsd:boolean; ]; has-holder-of-pr);
The sh:targetClass of all these SHACL rules is GiveConsent. sh:equals has-agent;]; ];
Rules are executed according to the sh:order, from the lowest to sh:subject [sh:path has-theme;];
the highest value. Each rule in (7)-(9) makes a new assertion: the sh:predicate is-lawful;
sh:object "true"ˆˆxsd:boolean; ];
rdf:Property specified in the sh:predicate of the rule is asserted
between the two RDFs/OWL resources in the sh:subject and If the age of the data subject (has-age) who owns the personal
the sh:object. The sh:subject and the sh:object may be the data of the processing (has-personal-data is-personal-data)
sh:targetClass itself (keyword “sh:this”), a resource reachable that is the theme of a GiveConsent individual (has-theme) is lower
from the sh:targetClass through a path specified in sh:path, than (sh:lessThan) the minimal consent age of his/her Member
any other resource in the ontology, or a literal. State and the agent of this GiveConsent individual is the holder
(7) is executed as first because its sh:order is “0”. This rule sets of the data subject’s parental responsibility (has-holder-of-pr),
the value of the property has-min-consent-age for each individ- then the boolean is-lawful is again set to true.
ual in the class GiveConsent. This values is set to the integer value
reachable from the sh:path defined on sh:object in (7)). Specifi-
cally, this is the minimal consent age (has-min-consent-age) of 6 CONCLUSIONS
the Member State (has-member-state) of the data subject owning Reified I/O logic is a recent deontic logical framework explicitly
the personal data (has-personal-data is-personal-data-of) in- designed to handle natural language semantics, i.e., to represent
volved in the personal data processing occurring as the theme of norms occurring in existing legislation such as the GDPR.
the GiveConsent instances (has-theme). So far, the research in reified I/O logic has focused only on knowl-
It is important to understand that has-min-consent-age will edge representation issues, specifically on how to use the formalism
not be asserted on the individuals of GiveConsent also in the ref- for representing contextual meaning of norms [3].
erence ontology, but only in the derived one. In other words, (7) is On the other hand, this paper is the first attempt to investi-
a non-ontological inference rule that collects/aggregates this value gate computational issues in reified I/O logic, specifically how to
in GiveConsent for validation purposes only. After the validation, represent the reified I/O logic if-then rules in a computable machine-
these values will be discharged. readable format fit to enforce compliance checking.
Rule (8) compares the minimal consent age of the agents’ Member This paper proposed to model regulative rules as SHACL shapes
State, just asserted by (7) on GiveConsent’s instances, with the and constitutive rules as SHACL rules. SHACL shapes and rules are
agents’ age. The two rules are then executed in a pipeline, thanks to applied to RDFs/OWL models that describe states of affairs.
218
Towards compliance checking in reified I/O logic via SHACL ICAIL’21, June 21–25, 2021, São Paulo, Brazil
The solution proposed here is alternative to some recent ap- Notes in Computer Science, Vol. 12061), Mehdi Dastani, Huimin Dong, and Leon
proaches that model compliance checking on RDFs/OWL ontolo- van der Torre (Eds.). Springer, 151–165.
[16] David Makinson and Leendert van der Torre. 2001. Constraints for input/output
gies, e.g., [9] and [6]. logics. Journal of Philosophical Logic 30, 2 (2001), 155–185.
On the other hand, the present work only represents the first step [17] David Makinson and Leendert W. N. van der Torre. 2000. Input/Output Logics.
Journal of Philosophical Logic 29, 4 (2000), 383–408.
of a research endevour aiming at developing a full inference engine [18] Rohan Nanda, Luigi Di Caro, Guido Boella, Hristo Konstantinov, Tenyo Tyankov,
for reified I/O logic that implements and integrates all components Daniel Traykov, Hristo Hristov, Francesco Costamagna, Llio Humphreys, Livio
involved in normative reasoning. Much further work needs to be Robaldo, and Michele Romano. 2017. A Unifying Similarity Measure for Auto-
mated Identification of National Implementations of European Union Directives.
done in order to obtain a formally well-defined framework, tested In Proceedings of the 16th Edition of the International Conference on Articial Intelli-
on existing industrial use cases. gence and Law (ICAIL 2017). Association for Computing Machinery.
Further directions of research include the automatic or semi- [19] Monica Palmirani, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio
Robaldo. 2018. PrOnto: Privacy Ontology for Legal Compliance. In Proceedings
automatic generation of RDFs/OWL or SHACL assertions from of the 18𝑡ℎ European Conference on Digital Government (ECEG).
legal texts, possibly via NLP (cf. [4], [5], [18]). [20] Harshvardhan Jitendra Pandit, Declan O’Sullivan, and Dave Lewis. 2018. Explor-
ing GDPR Compliance Over Provenance Graphs Using SHACL. In Proc. of the
Posters and Demos Track of the 14th International Conference on Semantic Systems
ACKNOWLEDGMENTS co-located with the 14th International Conference on Semantic Systems (SEMAN-
TiCS 2018), Vienna, Austria, September 10-13, 2018 (CEUR Workshop Proceedings,
This research has been supported by the Legal Innovation Lab Wales Vol. 2198), Ali Khalili and Maria Koutraki (Eds.).
operation within Swansea University’s Hillary Rodham Clinton [21] Harshvardhan J. Pandit, Declan O’Sullivan, and Dave Lewis. 2019. Test-Driven
Approach Towards GDPR Compliance. In Semantic Systems. The Power of AI and
School of Law. The operation has been part-funded by the European Knowledge Graphs, Maribel Acosta, Philippe Cudré-Mauroux, Maria Maleshkova,
Regional Development Fund through the Welsh Government. Tassilo Pellegrini, Harald Sack, and York Sure-Vetter (Eds.). Springer International
Publishing, 19–33.
[22] Xavier Parent and Leon van der Torre. 2014. Aggregative Deontic Detachment for
Normative Reasoning. In Principles of Knowledge Representation and Reasoning:
REFERENCES Proceedings of the Fourteenth International Conference, KR 2014, Vienna, Austria,
[1] 2012. Web Ontology Language (OWL). Technical Report. W3C. https://www.w3. July 20-24, 2014.
org/OWL [23] Xavier Parent and Leendert van der Torre. 2014. “Sing and Dance!”. In Deontic
[2] 2017. Shapes constraint language (SHACL). Technical Report. W3C. https: Logic and Normative Systems, Fabrizio Cariani, Davide Grossi, Joke Meheus, and
//www.w3.org/TR/shacl Xavier Parent (Eds.). Springer International Publishing, 149–165.
[3] Cesare Bartolini, Andra Giurgiu, Gabriele Lenzini, and Livio Robaldo. 2016. [24] Paolo Pareti, George Konstantinidis, Fabio Mogavero, and Timothy J. Norman.
Towards Legal Compliance by Correlating Standards and Laws with a Semi- 2020. SHACL Satisfiability and Containment. In The Semantic Web - ISWC 2020 -
automated Methodology. In BNCAI (Communications in Computer and Informa- 19th International Semantic Web Conference, Athens, Greece, November 2-6, 2020,
tion Science, Vol. 765). Springer, 47–62. Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12506), Jeff Z. Pan,
[4] G. Boella, L. di Caro, L. Humphreys, L. Robaldo, and L. van der Torre. 2012. NLP Valentina A. M. Tamma, Claudia d’Amato, Krzysztof Janowicz, Bo Fu, Axel
Challenges for Eunomos, a Tool to Build and Manage Legal Knowledge. Proceed- Polleres, Oshani Seneviratne, and Lalana Kagal (Eds.). Springer, 474–493.
ings of the International Conference on Language Resources and Evaluation. [25] Paolo Pareti, George Konstantinidis, Timothy J. Norman, and Murat Sensoy. 2019.
[5] Guido Boella, Luigi Di Caro, Daniele Rispoli, and Livio Robaldo. 2013. A System SHACL Constraints with Inference Rules. In The Semantic Web - ISWC 2019 -
for Classifying Multi-label Text into EuroVoc. In Proceedings of the Fourteenth 18th International Semantic Web Conference, Auckland, New Zealand, October
International Conference on Artificial Intelligence and Law (Rome, Italy) (ICAIL 26-30, 2019, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11778),
’13). ACM, New York, NY, USA, 239–240. Chiara Ghidini, Olaf Hartig, Maria Maleshkova, Vojtech Svátek, Isabel F. Cruz,
[6] Piero A. Bonatti, Luca Ioffredo, Iliana M. Petrova, Luigi Sauro, and Ida Sri Rejeki Aidan Hogan, Jie Song, Maxime Lefrançois, and Fabien Gandon (Eds.). Springer,
Siahaan. 2020. Real-time reasoning in OWL2 for GDPR compliance. Artificial 539–557.
Intelligence 289 (2020). [26] L. Robaldo. 2010. Independent Set readings and Generalized Quantifiers. The
[7] Julien Corman, Juan L. Reutter, and Ognjen Savkovic. 2018. Semantics and Vali- Journal of Philosophical Logic 39(1) (2010), 23–58.
dation of Recursive SHACL. In The Semantic Web - ISWC 2018 - 17th International [27] L. Robaldo. 2011. Distributivity, Collectivity, and Cumulativity in terms of
Semantic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, (In)dependence and Maximality. The Journal of Logic, Language, and Information
Part I (Lecture Notes in Computer Science, Vol. 11136), Denny Vrandecic, Kalina 20(2) (2011), 233–271.
Bontcheva, Mari Carmen Suárez-Figueroa, Valentina Presutti, Irene Celino, Marta [28] L. Robaldo, C. Bartolini, M. Palmirani, A. Rossi, M. Martoni, and G. Lenzini. 2020.
Sabou, Lucie-Aimée Kaffee, and Elena Simperl (Eds.). Springer, 318–336. Formalizing GDPR provisions in reified I/O logic: the DAPRECO knowledge base.
[8] Christophe Debruyne, Harshvardhan J. Pandit, Dave Lewis, and Declan The Journal of Logic, Language, and Information 29 (2020). Issue 4.
O’Sullivan. 2019. Towards Generating Policy-Compliant Datasets. In 13th IEEE [29] L. Robaldo and X. Sun. 2017. Reified Input/Output logic: Combining Input/Output
International Conference on Semantic Computing, ICSC 2019, Newport Beach, CA, logic and Reification to represent norms coming from existing legislation. The
USA, January 30 - February 1, 2019. IEEE, 199–203. Journal of Logic and Computation 7 (2017). Issue 8.
[9] Enrico Francesconi and Guido Governatori. 2019. Legal Compliance in a Linked [30] X. Sun and L. Robaldo. 2017. On the complexity of Input/Output logic. The
Open Data Framework. In Legal Knowledge and Information Systems - JURIX Journal of Applied Logic 25 (2017), 69–88.
2019: The Thirty-second Annual Conference, Madrid, Spain, December 11-13, 2019
(Frontiers in Artificial Intelligence and Applications, Vol. 322), Michal Araszkiewicz
and Víctor Rodríguez-Doncel (Eds.). IOS Press, 175–180.
[10] G. Governatori, F. Olivieri, A. Rotolo, and S. Scannapieco. 2013. Computing
Strong and Weak Permissions in Defeasible Logic. Journal of Philosophical Logic
6, 42 (2013), 799–829.
[11] Jörg Hansen. 2014. Reasoning about permission and obligation. In David Makinson
on Classical Methods for Non-Classical Problems, S. O. Hansson (Ed.). Outstanding
Contributions to Logic Volume 3, Springer, 287–333.
[12] J.R. Hobbs and A.S. Gordon. 2017. A formal theory of commonsense psychology,
how people think people think. Cambridge University Press.
[13] J. R. Hobbs. 2008. Deep Lexical Semantics. In Proc. of the 9th International
Conference on Intelligent Text Processing and Computational Linguistics (CICLing-
2008). Haifa, Israel.
[14] R Kowalski and M Sergot. 1986. A Logic-based Calculus of Events. New Generation
Computing 4, 1 (1986), 67–95.
[15] Tomer Libal and Alexander Steen. 2020. Towards an Executable Methodology for
the Formalization of Legal Texts. In Logic and Argumentation - Third International
Conference, CLAR 2020, Hangzhou, China, April 6-9, 2020, Proceedings (Lecture
219
Modelling Legal Procedures
Antonino Rotolo Clara Smith
Alma AI, University of Bologna Law Faculty, University of La Plata
Bologna, Italy La Plata, Argentina
antonino.rotolo@unibo.it claritasmith@gmail.com
220
ICAIL’21, June 21–25, 2021, São Paulo, Brazil
to the parties by the authority of the court. In between, there are “consumed” in a technical sense w.r.t. that tort.) But, if I do not pay,
multiple possible and intermediate claims proposed by the plaintiff by the force of law I may be compelled to do so through a brief
and multiple possible defences presented by the defendant, and, secondary process known as judgement execution.
of course, many intermediate decisions taken by the court2 . Some
of these court decisions have a preclusive nature, i.e. they do not 3 FORMALISING LEGAL PROCEDURES
allow plaintiff and/or defendant to come back to previous points in We will use for representing legal procedures a multi-agent variant
the procedural flow. Given this ideal structure, the real procedure, of basic core of Propositional Dynamic Logic (PDL) [3] enriched
i.e. the actions effectively ordered by the authority, corresponds to with the preference operator ⊗ for denoting preferences among
a path (a subgraph) in the bigger, ideal graph. procedural actions [1]. The key reason towards the usage of PDL is
that, in the procedural law domain, claims and resolutions indeed
2.1 Structure of Claims resemble programs to be executed. Requests or proposed actions
Plaintiff and defendant are both parties in the procedure. A claim are organised in a preference order. Resolutions have their own
is the only available tool a party has for communicating with the dynamics of execution (either spontaneously by the one obliged
court within a procedure, asking for an action or a fact to be con- and/or by force of law).
sidered. Petitions rarely consist in one single request. They are
usually organised in the form of one main (preferred) request, plus 3.1 Syntax
subsidiary requests. This structure constitutes a common guideline. Let Ag be a set of agents. The language L consists of a set PROP =
The main reason for this preferred/subsidiary organisation of ac- {A, B, C, . . . } of countably many proposition symbols, a set P =
tions resides in the fact that plaintiff and defendant know that the {α i |i ∈ Ag} of countably many atomic programs which we call
court may not decide in their favour regarding their main request. atomic procedural actions or atomic procedures, the usual boolean
Therefore, each party presents to the court a menu of wills for the operators, the program constructors ;, ∪, and ⊗ = {⊗i |i ∈ Ag} and
court’s consideration, altogether and at the same instant. the modality [Π] for any procedure Π. A procedure Π is ⊗-free iff
Petitions can be seen, therefore, as prioritised goals. This prefer- ⊗ does not occur in it.
ence structure outlines a strategic move: I ask the court to order to So, formally, expressions of language are defined as follows:
bring about my preferred state of affairs, if this would or could not
be the case, I ask the court to order to bring about this other state p ::= A|¬p|p1 ∧ p2 |⟨Π⟩p with Π ::= α i |Π1 ; Π 2 |Π 1 ∪ Π 2 |
of affairs, and so on. Indeed, the goal pursued by the parties is that Πi 1 ⊗i · · · ⊗i Πi n
her/his main request becomes the content of the decision of the court where A ∈ PROP, α i ∈ P, and Πi 1 , . . . , Πi n are ⊗-free.
(if not possible, then one of the subsidiaries, as given). We usually deal with at least three types of agents, let’s denote
them p, d, k for representing the plaintiff, the defendant, and the
2.2 Resolutions court. Propositional letters denote as usual states of affairs. Complex
When the court receives a claim of a party it analyses it, and chooses formulas are built using classical boolean connectives as expected.
among the menu of proposed options.The court’s chosen option As usual, we also have an infinite collection of [Π] operators where
implies that the court discards those options appearing before the Π is a (lawful) procedure. In the simplest case, we may have [α i ],
one chosen, and that the options after the one chosen are not taken an atomic procedure for the agent i; hence, [α i ] A is a formula that
into account, at least at the time. The normal course of the procedure reads “every execution of α by i from the present state leads to a
usually indicates that the plaintiff presents the claim as described, state where A is true”. The dual assertion ⟨α i ⟩A—such that ⟨α i ⟩A ≡
then the court resolves, next the defendant defends himself with his ¬[α i ]¬A—states that “the execution of α by i from the present state
own claim, then the court resolves. To each request of the parties leads to a state where A is true”. Complex procedures are intuitively
the court produces an answer. We call this response a resolution. defined from fixed basic atomic procedures as follows:
The claim/resolution chain repeats from the very first claim of the (Sequence) if Π 1 and Π2 are procedures then Π1 ; Π 2 (“do Π 1
plaintiff to the judgment. followed by Π2 ”) is a procedure,
A resolution always has a performative nature even if it is a (Choice) if Π 1 and Π2 are procedures then Π1 ∪ Π2 (“do Π 1 or
low impact decision (e.g. “Take this fact into account for later in Π 2 , non-deterministically”) is a procedure,
this procedure”); it is an order of the authority. So far, in a sense, (Preference) if Π1 and Π 2 are procedures, so are (Π1 ⊗p Π 2 ),
resolutions have an executable nature: the court declares through (Π 1 ⊗d Π2 ), and (Π 1 ⊗k Π2 ), meaning “agent p prefers doing
a resolution which actions are to be done. The resolution, thus, α, but if not then p prefers doing β” (resp. for d, k).
has to be executed. Suppose I am sued and I defend myself with
Notice that we do not use the usual program constructor ∗, which
the claim of paying with no interest, and, subsidiary, to pay with
models in PDL the execution for a program of a nondeterministically
minimum interest. (My claim operates as the input to the court’s
chosen finite number of times. Although it is smooth to use it here,
decision.) Suppose next that the court sentences me to pay with
we can ignore this constructor for our specific purpose.
the minimum interest. Then I, the obliged agent, have to comply
The crucial variation w.r.t. the original use of ⊗ in [1] is that, in
with the judgment by effectively and spontaneously paying. With
their work, the authors interpret an expression [a ⊗ b]A as a being
my payment the judgment is considered to be “executed” (and
the most preferred state of affairs, and if a is not the case then b
2 We will restrict ourselves to procedures including only these agents: plaintiff, defen- is preferred. In this present work we interpret ⊗ as a preference
dant, and the decision-maker (e.g. the judge, the court, a mediator). operator among procedural actions.
221
Modelling Legal Procedures ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Remark 3.1. We read α p as “α as proposed by p”. For example, the Definition 3.7. Let M a procedural model M = ⟨W , R Π , R O, ≺
formula [α p ⊗k β d ]A is to be read: “the court prefers that p’s pro- , V ⟩ and Ri (w) := {v ∈ W | wRi v}, ∥A∥V := {w ∈ W | |=V w A}. The
posal is done, if that is not procedurally possible then d’s proposal valuation function for M is a follows:
must be carried out (leading to a state-of-affairs where A holds)”. • usual for atoms and boolean conditions,
• w |= OA iff ∃R ∈ R O such that R(w) = ∥A∥V [2];
Remark 3.2. Resolutions are expressions in which the procedure
• w |= [Π]A iff ∀v ∈ W if wR Πv then v ∈ ∥A∥V ;
subexpression is one relativised to the court (⊗k ). Within a (lawful)
• w |= [Π 1 ⊗i · · · ⊗i Πn ] A iff w |= [Π 1 ∪ · · · ∪ Πn ] A iff there
procedure, obligations are always imposed by the agent that holds
exists a preference path R Π1 ≺i · · · ≺i R Πn .
the legal power (e.g. the court, the third neutral.) When the court
speaks, it speaks in the form of an obligation, which is added as a To sum-up, this semantics combines the one for classical modal
milestone to the procedure. A resolution has the force of law. E.g.: logics proposed in [2] and the standard one for PDL, plus ranking
[α p ⊗k β p ]A implies the deontic expression O[α p ⊗k β p ] A3 . accessibility relations for procedures.
Note that the “ideal” graph intuitively described in the first sec-
In this work we state that obligations such as OA follow from tion is a multigraph. Vertices are states of affairs and arcs are pro-
resolutions. Later in this paper we address the detachment of O- cedures, relativised to agents. For example, suppose that starting
expressions from court resolutions. from the state-of-affairs v, procedure α p leads us to state w 1 , proce-
Example 3.3. The formulas [α p ⊗p β p ]A and [α p ⊗k β p ]A are read: dure µ d leads us to state w 2 , procedure β p leads us to state w 3 and
“The plaintiff proposes α, subsidiary he asks for β” and “The court procedure γ p to state w 4 . We may e.g. write the plaintiff’s request
as [α p ⊗p β p ⊗p γ p ]A, the defendant’s defence as µ d ; and a court’s
resolves that α as proposed by p is to be done, subsidiary β as
proposed by p is to be done” (leading to a state of affairs where A decision as e.g. [α p ⊗k µ p ]A (or α p ⊗k µ p A).
holds), respectively.
αp / w1 A
The formula [(α p ∪ β p ) ⊗k γ p ]A has in its turn the intuitive read-
ing: “the court decides that either α p or β p are to be performed,
being γ p subsidiary (leading to a state of affairs where A holds)”4 . µd
v / w2 A
3.2 Semantics βp
Let us now present an adequate semantics for our logic. The idea is '
to extend standard semantics for PDL with a relational version of w3 A
the one for ⊗. Multi-relational frames for ⊗-logics are based on the γp
idea of directly ranking relations.
, w4 A
Definition 3.4 (Procedural frame). A procedural frame is a struc-
ture F = ⟨W , R Π , R O, ≺⟩, where We assume that ≺ is a collection of strict partial orders, i.e., which
are irreflexive, transitive and asymmetric: one cannot validate a
• W is a non empty set of possible worlds,
formula such as [Π1 ⊗i Π1 ] A. Transitivity and antisymmetry are
• R Π is a countable set R α i |α i ∈ P of binary relations over
adopted as expected, and this it ensures the validity of
W ; we inductively extend R Π , for each non-atomic procedure
Π, as follows: [Π1 ⊗i · · · ⊗i Πn ] A ≡ (Contraction)
– wR Π1 ;Π2 v iff there exists a world z such that wR Π1 z and [Π1 ⊗i · · · ⊗i Πk−1 ⊗i Πk+1 · · · ⊗i Πn ] A
zR Π2 v; where Π j = Πk , j < k
– wR Π1 ∪Π2 v iff wR Π1 v or wR Π2 v;
• R O is countable set of binary relations over W , Lemma 3.8. The axiom (Contraction) is valid in the class of proce-
• ≺= {≺i |i ∈ Ag} is a collection of a strict partial orders over dural frames.
RΠ.
4 AXIOMS AND PRINCIPLES
Definition 3.5 (Procedural model). A procedural model is a struc-
4.1 Consistency
ture M = ⟨F , V ⟩, where
First of all, the fact that the court is the authority and that their
• F is a procedural frame, and resolutions have the force of law can be formalised as follows:
• V is a valuation function, V : Prop → 2W .
Definition 3.6 (Preference path). Let R Π1 , . . . , R Πn ⊆ R Π . We [Π1 ⊗k · · · ⊗k Π j ⊗k · · · ⊗k Πm ⊗k · · · ⊗k Πn ] A →
write R Π1 ≺i · · · ≺i R Πn to express that, for each j where 1 ≤ j < n, (Consistency)
R Π j ≺i R Π j+1 . We call R Π1 ≺i · · · ≺i R Πn a i-preference path from → ¬[Π1 ⊗x · · · ⊗x Πm ⊗x · · · ⊗x Π j ⊗x · · · ⊗x Πn ] A
R Π1 to R Πn of length n.
with x ∈ Ag
3 “O” is minimal non-normal deontic operator for representing obligations [2]. In the simplest case,
4 We can informally refer to procedures either by mentioning or not the state-of-affairs
they lead us to.
[Π 1 ⊗k Π2 ]A → ¬([Π2 ⊗x Π 1 ]A).
222
ICAIL’21, June 21–25, 2021, São Paulo, Brazil
This axiom settles procedural consistency with respect to prefer- seeking or willing the same state-of-affairs (and e.g. the situation
ences, i.e., that a resolution always prevails over any preferences may be used by the court as a basis to analyse a call for agreement.)
and cannot imply a preference on the contrary by any other agent in Suppose now that we have the following model M2 :
the process, not even by means of a court’s contradictory resolution.
For example, if the court resolves action α i is to be done (and, w 1A,C
?
subsidiary, βk ) then no other agent should prefer them in a converse αp
order, not even the court itself.
Although this axiom stands for a consistency axiom, its structure / w A,D
cannot be imposed between plaintiff’s and defendant’s respective v 2
βd
strategies, as we will see next.
M1 and M2 are indeed different models (they are indeed different
4.2 Procedural Bilaterality graphs). Even both models are different, if α p ≺k β d we have
223
Modelling Legal Procedures ICAIL’21, June 21–25, 2021, São Paulo, Brazil
actions. The court should decide θ d ⊗k dp A as it leads to the the preferred option of every resolution in the column where the
expected state-of-affairs A, but with less court activity. same option appears in the conductor resolution. Following, all
A variant of the Principle of Procedural Economy is the Concen- the given resolutions are rejoined in one resulting main resolution,
tration Principle, which means that agents ought to present together column by column. In this course, component options in a column
all what it can be done in one step. The presentation of parties’ are gathered together with the ∪ (choice) PDL operator.
proposals as goal preferences, i.e. as a main goal and subsidiary A preference alignment algorithm is as follows:
goals (as in [α n ⊗p β p ⊗p ⊗γ p ]A), is an example of the application procedure preference_alignment
of this principle. (list_of_resolutions):result
begin
5 SPECIFIC RULES align resol_1 as the conductor resolution
for i:=2 to n do
5.1 Detachment of Obligations if (preferred action in resol_i is in resol_1) then
From a resolution we can derive an obligation. Such an inference position:= position in resol_1 of that preferred
rule reflects the lawful reading of expressions such as [Πi ⊗k Π 2 ]A. action
If we have α ⊗p β, then α is obligatory in achieving A and β is to align resol_i below resol_1 starting from position
be done in case α is not possible, because the authority imposes so. end;
Let us use O as a non-normal minimal modal deontic operator. i := 1
Technically then, from a ⊗k -expression we can derive a formula result := false
which is in the scope of O. The following general form for this while (there are actions in column i) do
high-level reasoning rule is: begin
! ! !! union := connect with ∪ all actions from resol_1 to
Ì
n Ì
m Û
n
′ ′′
A → O[Π′ ]A resol_n in column i
k Π i ⊗k Π ⊗k k Πi A ∧ ( ⟨Πi ⟩ ¬A)
i =1 i =1 i =1 result := result ⊗k union
(O-detachment) i := i+1
end;
which should be intuitively understood as: “The court’s preference
end;
that holds is obligatory”, reflecting the intuitive reading of the ⊗k .
6 SUMMARY
5.2 Consistency Check w.r.t. Court Resolutions
A legal procedure in court proceedings is the formal way in which
Suppose that we have the following court resolution:
civil proceedings are conducted. A legal procedure was defined
[α ⊗k β ⊗k γ ]A. Assume that, later, the court also resolves
as a chain of consecutive actions which has as (final) goal the
that [β ⊗k ϕ]A.
decision/solution of a conflict, i.e., as a finite sequence of actions
By application of the O-detachment rule we get O[α] A from the
in which the last action is (the creation of) a(n individual) norm,
first resolution, and also we get O[β] A from the second resolution.
usually, an obligation.
Both detachments give us O[α] A ∧ O[β] A. From the procedural
We argued that one peculiar aspect of proceedings is that some
point of view this is not what it is expected because, according
types of procedure in the process are prioritised. Priorities indeed
to the first resolution given, we should get O[α] A, but not O[β] A
derive from individual preferences of the parties in the process, or
(unless β is to be done because α cannot). This consistency conflict
they can also follow from objective ordering requirements from
raises when we analyse the resolutions in the framework of a lawful
procedures. In order to model legal procedures, in this paper we
procedure. The intuition behind the solution to this (lawful) incon-
technically added obligations and a preference operator for proce-
sistency is that there is form of procedural reasoning that consists
dural actions to a multi-agent version of PDL.
on the temporal ordering of court resolutions and, following, the
This paper is a preliminary research. For example, complexity
alignment of the forthcoming resolutions with respect to the first
features and a full investigation on the effective application of the
resolution given. Suppose the given resolutions are:
proposed machinery are a matter of future work.
[α ⊗k β ⊗k γ ]A [β ⊗k ϕ]A [ϕ ⊗k ψ ]A.
Note that all three lead to the same state of affairs. We set the
REFERENCES
[1] Erica Calardo, Guido Governatori, and Antonino Rotolo. Sequence semantics for
first one as the conductor resolution, then align the rest of them modelling reason-based preferences. Fundam. Inform., 158(1-3):217–238, 2018.
according to the preferred option in each resolution, as follows: [2] Erica Calardo and Antonino Rotolo. Variants of multi-relational semantics for
propositional non-normal modal logics. J. Appl. Non Class. Logics, 24(4):293–320,
(Conductor) Resolution 1: [α ⊗k β ⊗k γ ]A 2014.
Resolution 2: [β ⊗k ϕ]A [3] David Harel, Jerzy Tiuryn, and Dexter Kozen. Dynamic Logic. MIT Press, Cam-
bridge, MA, USA, 2000.
Resolution 3: [ϕ ⊗k ψ ]A [4] P. Luiso. Diritto processuale civile. Giuffre, 2019.
Resultant Resolution 4: [α ⊗k β ⊗k (γ ∪ ϕ) ⊗k ψ ]A [5] Johan van Benthem. Program constructions that are safe for bisimulation. Stud
Logica, 60(2):311–330, 1998.
We call this rule preference alignment. The intuition behind is the [6] Joachim Zekoll. Comparative civil procedure. In Mathias Reimann and Rein-
following: indeed, court resolutions have a temporal ordering. Ac- hard Zimmermann, editors, The Oxford Handbook of Comparative Law. Oxford
University Press, 2012.
cording to this ordering, and starting with the first given resolution,
the remaining resolutions are aligned to the conductor resolution
with respect to their respective preferred option. That is, we put
224
Automatic Extraction of Amendments from Polish Statutory Law
Aleksander Smywiński-Pohl Krzysztof Wróbel
Mateusz Piech krzysztof@wrobel.pro
Zbigniew Kaleta Jagiellonian University
{apohllo,mpiech,zkaleta}@agh.edu.pl Kraków, Poland
AGH University of Science and Technology
Kraków, Poland
ABSTRACT Table 1: The types of entities annotated in the amendments.
The article discusses the problem of automatic detection of amend-
ments found in the Polish statutory law. We treat the problem as a Amendment type
token-classification task and we introduce a scheme constructed add_content remove_content change_content
by analysis of more than 200 amending bills. We apply recent neu- add_unit remove_unit change_unit change_id
ral architectures such as BERT and BiRNN to the task of token Identifier
classification. The achieved results of all models are very high as new_id amended_id preceding_id
micro average F1 score ranges from 96.3% to 98.2% for BiRNN. The Content
presented solution is a first step towards fully automatic structuring new_content old_content preceding_content
and application of amendments in the Polish statutory law.
CCS CONCEPTS the user tries to use that website to track the changes of any law,
• Computing methodologies → Information extraction; • Ap- which underwent a number of amendments, the website is not very
plied computing → Law. useful.
Taking into account the fact that the source texts of the amend-
KEYWORDS ments are weakly structured (i.e. PDF files) and the fact that the
amendment extraction, information extraction, named entity recog- number of amendments is tremendous, we investigate the possi-
nition, legal information system, Polish statutory law bility to apply machine learning approach to the problem of the
ACM Reference Format: automatic structuring of the text that would allow for converting
Aleksander Smywiński-Pohl, Mateusz Piech, Zbigniew Kaleta, and Krzysztof amending bills into structured data.
Wróbel. 2021. Automatic Extraction of Amendments from Polish Statutory The contribution of the article is as follows. We start by casting
Law. In Eighteenth International Conference for Artificial Intelligence and the problem of amendment extraction as token classification. We
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, introduce a scheme devised for detecting the amendments based
USA, 5 pages. https://doi.org/10.1145/3462757.3466141 on an analysis of a large number of Polish amending laws. Then we
present three approaches to the problem of amendment extraction:
1 INTRODUCTION one based on rules and two others based on machine learning. We
The legal system in Poland is based on statutory law passed by pay special attention to the pre-processing of the data since it has
the Polish parliament. The laws are published in the Journal of a huge impact on the obtained results. We discuss the related work
Laws of the Republic of Poland and are available via ISAP 1 website in Section 5. We conclude the article with prospects for the future
which distributes the laws as PDF files with metadata available research.
as HTML pages. Although the primary law is linked with all its
amendments, the user can only list all laws (but not amendments 2 ANNOTATION CORPUS
specific to that law) that modified the given document. In the case For this research we have created a corpus of 242 bills of Polish
of the most important laws such as civil and criminal codes, the statutory law from the years 1993-2018. Table 1 shows the types
consolidated text is also published yearly in the form of PDF files, of the textual units identified by analyzing the sample of bills. The
with specific regulations marked as recently amended. All in all, if primary element used to define the amendment is its type. The set
1 http://isap.sejm.gov.pl of required elements and in some cases their exact meaning are
dependant on the type of amendment.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed The amendments related to units (with unit suffix) are the basic
for profit or commercial advantage and that copies bear this notice and the full citation ones. Thus they require the id of the amended unit, the new content
on the first page. Copyrights for components of this work owned by others than the (in the case of addition and change), and the id of the preceding
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission unit (in the case of addition).
and/or a fee. Request permissions from permissions@acm.org. The amendments related to the change in content (with content
ICAIL’21, June 21–25, 2021, São Paulo, Brazil suffix) are much more sophisticated, since they change the content
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 of the unit only partly. The most common case is addition, sub-
https://doi.org/10.1145/3462757.3466141 stitution or removal of a short phrase or a punctuation mark. But
225
ICAIL’21, June 21–25, 2021, São Paulo, Brazil A. Smywiński-Pohl, M. Piech, Z. Kaleta, K. Wróbel
Table 2: The counts of tokens for different types of annota- a rule-based approach employing regular expressions (a baseline,
tions. henceforth called rules), the second one was an approach based on
the transformer architecture [19], with bi-directional encoding [5]
Annotation type Count (henceforth called BERT ) and the third one was an approach based
Amendment type 5 377 on a bidirectional recurrent neural network (BiRNN) [16] with Long
Identifier 9 747 short-term memory cells [8] (henceforth called BiRNN ). We have
Content 171 937 tested three transformer models pre-trained on different corpora:
No annotation 230 300 HerBERT on a large Polish corpus[15], RoBERTa on an English
Total 417 361 corpus and XLMR on multilingual corpus [4]. The HuggingFace’s
Transformers library [21] was used in that setting. The bidirectional
RNN employed character-level language model (cLM) an was pre-
trained on Polish texts from the law domain. We have chosen the
this action might be much more sophisticated, such as substitution FLAIR framework [2], for the pre-training and fine-tuning of that
of certain phrase en masse, e.g. replacing „minister of sport” with model.
„minister competent for sport”.
The last type of amendment – change_id – is related to the 3.1 Rules
specific case when a part of text receives a new identifier: mainly
In order to fairly compare the system with the other approaches, we
when an article containing one, unnumbered paragraph is extended
have created the rules looking only at the training and development
with a new paragraph, thus the existing paragraph receives an id.
datasets. The size of the “model” compared to the size of both RNN
Regarding the identifiers, we distinguish between the id of the
and HerBERT is very favorable for that approach. We were unable
element when it is changed or removed (amended_id) and when
to define the regular expression for the preceding_content tag
it is added, since in the second case the id of the preceding unit
(lack of common phrases in the examples), thus it is not detected
has to be indicated and we have to distinguish between them. It
by this approach. A tag associated with the regex is assigned to
was assumed that only the most specific part of the identifier is
text span matched with that regex and the O tag is assigned to the
annotated.
rest of text. In the case of multiple regular expressions matching
The annotation was performed by two independent annotators
overlapping text spans, the more popular class, according to training
and then reviewed by a super-annotator with the help of Inforex
and development sets, is assigned. The matching was run from left
system [11]. The corpus contains roughly 4 hundred thousand
to right, thus it is continued right after the phrase matching the
tokens and its size is comparable with the manually annotated
assigned tag.
part of the National Corpus of Polish (1.2 million) [14]. The Table 2
summarizes the counts for different types of tokens. The dominating
3.2 BERT
class are the tokens without annotation (230 thousand tokens).
But still almost half of the corpus contains amendment-related The second approach uses a transformer model for Polish called
annotations. This is not a surprise, since the majority of the enacted HerBERT [15]. The model was trained on a large number of Polish
bills are amendments to the existing laws. The second most popular texts and it uses the RoBERTa pre-training optimizations [10]. The
annotation types are related to content (more than 170 thousand model achieves SOTA results on KLEJ benchmark [15]2 , a collection
tokens). This is expected, since these tokens constitute the actual of Polish Natural Language Processing (NLP) tasks resembling the
contents of the new and changed regulations. The total number of GLUE benchmark [20]. The results on that benchmark are very
individual amendments – partly reflected by the first group, since similar to those of the XLMR model [4], yet HerBERT is better
amendment type is usually indicated by two tokens – is roughly suited for Polish since the previous model was trained on a very
2.5 thousand (this is an estimation, since there are cases when one large number of languages (100), while HerBERT was trained only
amendment type relates to changes of more than one element). on one language. As a result, HerBERT is much smaller and the
We have divided the annotated documents into three subsets: fine-tuning is faster compared to XLMR. Comparing to the previous
the training (~80%), the development (~10%) and the test (~10%) set. approach, we have not used any domain-specific texts to reduce
We imposed two requirements when splitting the documents. First the perplexity of the model on texts coming from the law domain.
of all, since the structure of the bill is crucial for the extraction of In the reported experiments we have used the large variant of the
amendments, we have split the whole documents rather than their model.
fragments, such as individual provisions. As a second requirement, To measure the impact of language-specific pre-training, we
we have paid special attention to keep similar distribution of tags have included two other models: Roberta (large) [10] and the largest
in each of the subsets. It wasn’t straightforward, thanks to the first multilingual model, i.e. XLMR [4]. Both of these models were trained
requirement, but we managed to keep each type of tag in 62%–85%, using the same approach, but they were trained for English and 100
8%–23% and 6%–23% range for the training, the development and languages respectively.
the test subsets respectively.
3.3 BiRNN
3 AMENDMENT DETECTION The last approach reflects a pretty recent SOTA model for NER.
Bidirectional RNNs with LSTM cells were very popular until the
The detection of the amendments – treated as a NER-like problem
– was tested using the following approaches. The first one was 2 Klej in Polish means glue in English.
226
Automatic Extraction of Amendments from Polish Statutory Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
transformer model dominated the NLP landscape. Also, the RNNs Table 3: The results of the detection of amendments.
have one important advantage compared to BERT – they accept
inputs of arbitrary length. BERT architectures are limited to 512 Model Micro F1 Macro F1 Weighted F1 Support
subtokens due to memory and time complexity which is quadratic Rules 69.58 82.03 72.07 1335
to the length of the input. This is important in the context of amend- HerBERT 97.96 90.81 97.92 1174
ment detection since we would prefer a model not requiring a so- RoBERTa 97.69 97.68 97.70 1148
phisticated text segmentation (at least in the stage of amendment XLMR 96.72 84.62 96.73 1174
detection), but rather accepting a complete text of the bill. The Flair BiRNN 1 90.81 80.72 90.60 1773
architecture assumes usage of a character-level language model pre- BiRNN 2 98.20 98.90 98.19 1174
trained on texts from a given domain. The model was pre-trained
on two corpora – the full National Corpus of Polish [13, 14] (approx.
2 billion tokens) and corpus containing the Polish statutory law training set). Thanks to that it gave a much better macro-average
and the judgments of Polish courts (approx. 4 billion tokens). F1 score (94%), but the micro-average F1 score of 84.36% was far
below expectation. That was primarily due to low outcome for the
4 EXPERIMENTS new_content class: 74.84% F1 score. This result is easy to explain,
We start the discussion of the experiments by presenting the re- since the new content usually contains a number of sentences which,
sults for the Rules model. For most of the types of elements the classified individually, lack the context required to determine if
rules work perfectly, yet there is a group of types (4 – new_id, 5 – they are part of the amending or the amended bill (they are quoted
amended_id, 7 – preceding_id) that are confused for each other. in the second case, but the quotation spans a large number of
This is the reason why the results for the rules approach (cf. Table sentences). In fact, the 75% score seems to be very high if we take
3) gives low scores both in terms of micro, macro and weighted that phenomenon into account.
average F1 measure. This shows that the identification of the ids is As a result, we have decided to perform a more elaborate pre-
problematic for this approach. We have to stress, that we have ex- processing. The algorithm iterated through the lines in the CONLL
cluded from the results the false positives related to the O tag, since document. If a new sentence (i.e. an empty line) was detected, it
that would yield a very large value of incorrectly determined identi- broke the sentence only if the preceding tag was O (i.e. the sentence
fiers since they appear in a large number of contexts. Moreover, we break was not inside an annotated span). In the other case, it broke
were unable to come up with a rule for detecting the 6th category the sentence when the first O tag was detected. Such approach does
(i.e. preceding_content). We argue that the approach based on not ensure that the context is always preserved, but it plays well
rules is not universal enough to easily differentiate between the with the new_content type, since these tokens always form long
various types of elements appearing in the amendments. spans of consecutive tokens.
As a second approach, we tested the family of BERT models. Yet, during inference such a procedure is impossible to apply be-
These models have a hard limit3 on the length of the input, which cause we do not have the tags, we are going to predict. To overcome
amounts to 512 subtokens. It is obvious that the length of bills is that problem we implemented a procedure based on the structure
longer than that limit, but it might be the case that during training of the bill. We treated textual content of provisions at any level as
it is not necessary to perform any pre-processing since all phenom- independent inputs for the algorithm. In the case the provision of
ena are present at the beginning of the document4 . We have run the lowest level was longer than 512 subtokens, it was truncated
two preliminary experiments using the HerBERT model, in order and the tag assigned to the last token was assigned to the remaining
to determine if the pre-processing is needed. In the first, we have tokens besides the last one (which in most cases is a quotation).
submitted the full document as the input. This approach results in Table 3 summarizes the results of an experiment conducted using
a large number of truncated text (more than 60% in the training the models belonging to the BERT family using the more sophisti-
set). In the second approach as input we took individual sentences cated pre-processing. All of them were trained with the same set
(obtained from the Inforex system that uses MorphoDiTa library of hyper-parameters (batch size: 8, epochs: 10, learning rate: 5e-6,
[18] for sentence boundary detection). This approach yielded com- eps: 1e-8, maximum gradient: 1.0, weight decay: 0.0, max sequence
plete coverage of the annotated tokens since the detected sentences length: 512, seed: 0 and F1 score on the development set used for
were always shorter than the 512 limit. early stopping). The results are reported for the test set.
The first approach resulted in 91.58% micro-average F1 score and The first observation relates to the number of testing examples
76.03% macro-average F1 score. The low macro-average score is a re- (support) available for each approach. As it was explained, there
sult of complete ignorance of the types of tags preceding_content were cases when the input had to be truncated. The different number
and remove_content, which were very rare both in the training of examples stems from the fact that the models yield a different
and the testing corpora. This outcome – as expected – falsifies the number of subtokens5 , but the differences are small.
assumption that the pre-processing is not needed. The second observation is that the performance of the models
Regarding the second approach, we observed that it resulted is very good, especially if we look at the micro average scores.
in a larger number of examples (3-fold increase in the case of the XLMR – which is the worst according to that metric – is still almost
3 The limit is hard in the sense that the quadratic memory complexity makes longer 5 HerBERT uses a tokenizer trained on Polish texts, RoBERTa – on English texts and
inputs prohibitively slow to process with the current available hardware. XLMR on texts from 100 languages. Since the vocabulary size is limited, the Polish
4 That would also be a waste of a large number of annotations, yet we believe that it is tokenizer model may yield the lowest number of tokens, since HerBERT and RoBERTa
good to test even the simplest approach. use the same vocabulary size. XLMR uses a dictionary which is 4 times larger.
227
ICAIL’21, June 21–25, 2021, São Paulo, Brazil A. Smywiński-Pohl, M. Piech, Z. Kaleta, K. Wróbel
perfect, yielding 96.72% micro average F1 score. According to that and alike models. For example, the authors of [2] propose the use
metric, the best results are obtained by HerBERT which yields 1.2 of a character level language model to generate contextual string
percentage point better result. Yet its macro average result (90.81% embeddings. A bidirectional neural model is trained for the task of
F1 score) is pretty low, compared to the result given by RoBERTa predicting the next character in the sequence. The hidden states of
(97.78% F1 score). That result is a bit surprising, especially if we the forward and backward parts of the model, respectively after and
recall that this model is pre-trained for English, while we analyze at the beginning of the analyzed word are concatenated to create
texts in Polish. A closer inspection of the results shows that the context-aware string embedding. In later work – [1] – they extend
low outcome of HerBERT was due to lower scores for the detec- this vector with second part, which is a function (e.g. element-wise
tion of the content-related tags ({add,remove,change}_content). minimum, maximum, or mean) of all embeddings for the same
If we take into account the fact that RoBERTa gives results only string including the newest one. One of our solutions (BiRNN) is
0.27 pp. lower than HerBERT in terms of micro average F1 score, directly based on that work.
we may conclude that for that problem it seems to be the best The authors of [22] have created LUKE – a language model
option among the family of BERT models, even though it is not based on RoBERTa, where they train the embeddings for entities
pre-trained on the Polish texts. This result is particularly inter- alongside the embeddings for words. The input for training LUKE
esting since it shows that we could leverage a more recent is the concatenation of tokenized sentence and the list of all entities
model pre-trained only for English, especially since the pre- present in this sentence. The training process is similar to BERT
training is a very costly procedure. and other masked language models – randomly selected parts of
As the last type of approach, we present the results for BiRNN the input (in this case words and entities alike) are masked and the
model. Although RNN do not have a hard limit on the length, to model is trained to predict those masked fragments. BERT training
fairly compare their performance with the previous approach, we is fully unsupervised, so it requires just a plain, unannotated corpus,
have used the same pre-processing strategy. We have tested the but LUKE needs an entity-annotated corpus. The authors used the
approach when the sentences are provided by MorphoDiTa (version corpus from Wikipedia with good results. LUKE also extends the
1) and when they are provided by the optimized version of the input self-attention mechanism normally used in transformers so that it
pre-processing (version 2). We have trained the model with the is entity-aware. Although that approach yields SOTA results for the
following hyper-parameters: hidden layer size: 256, max epochs: NER task, it does not apply to our problem, since the entities in our
150, learning rate: 0.1, mini-batch size: 32, word embeddings: pl- approach are much different from typical entities in NER: they are
wiki-fasttext-300d-1M [12]. The results of the experiments are given either short, constant phrases (e.g. is removed, is added), identifiers
in Table 3. The comparison of the results produced by HerBERT (e.g. art. 5a, letter b) and long spans of text (the new content of the
and BiRNN on the input provided by MorphoDiTa shows that – as amended provisions). Besides the first type, which is very easy to
expected – the number of testing (and training) examples is much detect, they belong to an open set of text phrases and cannot have
higher for BiRNN, since the input is not truncated. The recurrent their own – learned – representations.
architecture receives almost 3-times more examples. Interestingly The work on automatic processing of legal amendments dates as
the model achieves a low value (80.72%) for the macro average F1 far as the work of Timothy Arnold-Moore in 1995[3]. There are two
score (much lower than HerBERT). Inspection of the individual main approaches to the problem of amendment processing. One
classes showed that this is caused by 0.0% scores for new_id and is to take two versions of the same legal act and compare them in
change_id tags, which are rare in the training set. The results a diff-like manner. The second approach, presented herein, is to
for new_content and old_content were also lowering the result, generate an amended version of the legal act by using its previous
since both of them were below 90%. Yet the results are surprising text and the text of the amending act.
compared to the BERT model since that model had a much larger The most common approach is to use syntactic and semantic
number of training instances. parsing, using a rule-based system. E.g. authors of [17] use a shal-
The most interesting result is for the second setting, where the low syntactic parsing (chunking) using a battery of finite state
customized pre-processing was applied. Since the input was trun- automata and a semantic analysis using a compiler based on spe-
cated, the number of training and testing examples was lowered by cialized grammar. The system also contains an automatic classifier
approx. 33%. Yet this approach gave the best results overall – com- that recognizes three kinds of amending provisions and discards
paring both the simplified pre-processing and to the family of BERT not amending provisions from further processing. In our work, we
models. Both micro (98.2%) and macro (98.90%) average F1 scores expand the set of provisions by providing more fine-grained dis-
were almost perfect, meaning that the model was able to learn all tinctions and introducing the new type related to the introduction
classes. That result shows that even though the approach based of an identifier.
on BERT has dominated the landscape of NLP problems, for the The authors of [9] treat the task as a slots filling problem, where
amendment extraction problem – at least for the Polish language – the correct frame is chosen based on the verb and its dependents
RNNs still might be a good alternative. using IF-THEN rules. In the case of multiple solutions, a heuristic is
used to pick the best one. They also address the problem of idioms:
some complex phrases common in legal documents are rewritten
5 RELATED WORK using hand-crafted rules into a form that is easier to process by
The approach used in this article – BIO tagging – is often used for further stages of the system.
named entity recognition. One of the most commonly used methods In [7] the authors extend the number of types of modificatory pro-
for NER is the bidirectional LSTM often used with word embeddings visions and the article focuses on temporal modifications – changes
228
Automatic Extraction of Amendments from Polish Statutory Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
to either force or efficacy time. Unlike previous works in this field REFERENCES
they do not process all sentences of the amending act, but instead [1] Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled Contextualized
filter them using regular expressions to increase both accuracy and Embeddings for Named Entity Recognition. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics:
performance. They also introduce a sliding window to better handle Human Language Technologies, Volume 1 (Long and Short Papers). Association for
long and complex sentences Computational Linguistics, Minneapolis, Minnesota, 724–728. https://doi.org/10.
18653/v1/N19-1078
All of the aforementioned systems were devised for Italian legal [2] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embed-
acts (and in result Italian language) and use NIR (Norme in Rete) dings for sequence labeling. In Proceedings of the 27th International Conference
on Computational Linguistics. 1638–1649.
XML representation of the data. [3] Timothy Arnold-Moore. 1995. Automatically processing amendments to legis-
The authors of [6] describe an attempt to make an automated lation. In Proceedings of the 5th international conference on Artificial intelligence
system for the consolidation of Greek legal acts, based on officially and law. 297–306.
[4] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-
published PDFs. This system uses regular expressions on multiple laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
stages of the process, including recognizing the amendment type and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning
(addition, substitution, or deletion of a text portion) and extracting at Scale. arXiv:cs.CL/1911.02116
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
the required data (filling the slots). If the amendment concerns Pre-training of deep bidirectional transformers for language understanding. arXiv
whole structural units, such as paragraphs, the change is applied to preprint arXiv:1810.04805 (2018).
[6] John Garofalakis, Konstantinos Plessas, and Athanasios Plessas. 2016. A Semi-
the structure of the XML file containing the act. On the subpara- Automatic System for the Consolidation of Greek Legislative Texts. In Proceedings
graph level, the Python NLTK model is used to break down the of the 20th Pan-Hellenic Conference on Informatics (Patras, Greece) (PCI ’16).
paragraph into units of an appropriate level. Association for Computing Machinery, New York, NY, USA, Article 1, 6 pages.
https://doi.org/10.1145/3003733.3003735
[7] Davide Gianfelice, Leonardo Lesmo, Monica Palmirani, Daniele Perlo, and
6 CONCLUSIONS Daniele P Radicioni. 2013. Modificatory provisions detection: a hybrid NLP
approach. In Proceedings of the Fourteenth International Conference on Artificial
We have presented a novel algorithm for detecting amendments Intelligence and Law. 43–52.
found in Polish statutory law. The primary difference presented [8] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
in this article and the previous work is the application of neural computation 9, 8 (1997), 1735–1780.
[9] Leonardo Lesmo, Alessandro Mazzei, Monica Palmirani, and Daniele Radicioni.
models to the detection of amendment constituents. By treating 2013. TULSI: an NLP system for extracting legal modificatory provisions. Artificial
it as a token classification problem it is possible to use the most Intelligence and Law 21 (05 2013). https://doi.org/10.1007/s10506-012-9127-6
[10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
recent SOTA models from the BERT family and a bit older BiRNNs. Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
It turns out that using such models provides very accurate results. Robustly Optimized BERT Pretraining Approach. arXiv:cs.CL/1907.11692
This observation supported by experiments conducted with the [11] Michał Marcińczuk, Marcin Oleksy, and Jan Kocoń. 2017. Inforex—A collaborative
system for text corpora annotation and analysis. In Proceedings of the international
most recent neural models: HerBERT, RoBERTa, XLMR, and BiRNN. conference recent advances in natural language processing, RANLP. INCOMA
All of them yielded at least 96% for the micro average measure. Shoumen, 473–482.
Yet, contrary to our expectations the best performance among the [12] Agnieszka Mykowiecka, Małgorzata Marciniak, and Piotr Rychlik. 2017. Testing
word embeddings for Polish. Cognitive Studies 17 (2017).
family of BERT models was thanks to RoBERTa (a model trained [13] Piotr Pęzik. 2012. Wyszukiwarka PELCRA dla danych NKJP. In Narodowy
for English), while the best results were achieved by BiRNN whose Korpus Języka Polskiego, Adam Przepiórkowski, Mirosław Bańko, Rafał Górski,
and Barbara Lewandowska-Tomaszczyk (Eds.). Wydawnictwo Naukowe PWN,
score was above 98% for each weighting scheme. This opens the 253–279.
possibility for the automation of the amendment extraction in Pol- [14] Adam Przepiórkowski, Mirosław Bańko, Rafał Górski, and Barbara Lewandowska-
ish. Tomaszczyk. 2012. Narodowy Korpus J ezyka Polskiego. Wydawnictwo Naukowe
PWN.
Still, there are problems that have to be addressed to complete [15] Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and Ireneusz Gawlik.
that goal. First of all our solution is the first step in the automa- 2020. KLEJ: Comprehensive Benchmark for Polish Language Understanding.
tion pipeline, since the detected tokens has to be converted into arXiv:cs.CL/2005.00630
[16] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-
meaningful amendment representations. The solution thus requires works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
a proper structuring of the bill as well as automatic detection of [17] PierLuigi Spinosa, Gerardo Giardiello, Manola Cherubini, Simone Marchi, Giulia
Venturi, and Simonetta Montemagni. 2009. NLP-Based Metadata Extraction for
the references to the provisions. Both of these problems may be Legal Text Consolidation. In Proceedings of the 12th International Conference on
resolved following the same approach, but they were out of the Artificial Intelligence and Law (Barcelona, Spain) (ICAIL ’09). Association for
scope of this article. A connected problem is the detection and Computing Machinery, New York, NY, USA, 40–49. https://doi.org/10.1145/
1568234.1568240
processing of temporal expressions that determine the application [18] Milan Straka and Jana Straková. 2014. MorphoDiTa: Morphological Dictionary
date of a specific amendment. We will address these issues in the and Tagger. (2014).
forthcoming research. [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
You Need. arXiv:cs.CL/1706.03762
7 ACKNOWLEDGMENTS [20] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R
Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural
This work was supported by the Polish National Centre for Research language understanding. arXiv preprint arXiv:1804.07461 (2018).
and Development – LIDER Program under Grant LIDER/27/0164/L- [21] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie
8/16/NCBR/2017 titled “Lemkin – intelligent legal information sys- Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
tem” and in part by the PLGrid Infrastructure. Processing. CoRR abs/1910.03771 (2019). arXiv:1910.03771 http://arxiv.org/abs/
1910.03771
[22] Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto.
2020. LUKE: Deep Contextualized Entity Representations with Entity-aware
Self-attention. arXiv:cs.CL/2010.01057
229
A Dataset for Evaluating Legal Question Answering on Private
International Law
Francesco Sovrano Monica Palmirani Biagio Distefano
francesco.sovrano2@unibo.it monica.palmirani@unibo.it biagio.distefano@univie.ac.at
University of Bologna - DISI University of Bologna - CIRSFID Universität Wien
Italy Italy Austria
230
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Sovrano, et al.
second approach is the most recent and in many ways the most These regulations are, respectively, on the law applicable to con-
versatile, but sub-symbolic and opaque. A sub-symbolic approach tractual obligations; on the law applicable to non-contractual obli-
is said to be more data-oriented and it follows the recent success gations; on jurisdiction and the recognition and enforcement of
of Deep Neural Networks (DNNs) on natural language processing judgements in civil and commercial matters. These Regulations aim
and understanding. Current state-of-the-art on natural language to provide a tool for identifying the applicable law and the jurisdic-
understanding is heavily based on this data-centred approach, and tion in cases when two or more legal systems connect and generate
many models specifically applied to legalese have already been complex relationships (e.g. a sale of goods contract between an
published. For example in 2018 [3] published a framework for natu- Italian and a German citizen regarding commodities situated in
ral language processing and information extraction for legal and Spain).
regulatory texts. In 2019 [5] proposed one of the first models for It is important to highlight the fact that, for the construction of
legal word embeddings. While, in 2015 Kim et al. [10] presented the new dataset, we decided to inherit some methodological choices
one of the very first algorithms based on DNNs for Legal Question from [13], considering PIL as a subject simply from the point of view
Answering (reasoning) applied to a dataset of Boolean questions of these three EU Regulations, as a self-contained environment, i.e.,
from Japanese legal bar exams, then followed up by [7] and others excluding references to other international conventions and general
[9, 12]. In 2020, Sovrano et al. [13] proposed a novel and hybrid ap- principles. So that it is possible to evaluate Q&A techniques with
proach for legal question answering on PIL, using a legal ontology respect to their ability to handle the general principles in the recitals,
based on Ontology Design Patterns (like agent, role, event, tempo- the scope of application in the initial articles, and the specific cases
ral parameter, action) in order to mirror the legal significance of (e.g. exceptions) in the other articles. The methodological choices
the relationships within and among the provisions. More generally, we kept raised some issues with regard to the formulation of the
automated reasoning over legal texts (not just the PIL’s ones) is questions and their relevance. Conceptual questions (e.g. “What is a
not a trivial task, due to the fact that the legal jargon (legalese) is non-contractual obligation?”) cannot be fully answered by relying
less frequent and more ambiguous than commonly-used natural solely on these 3 Regulations, as the goal of this legislation - when
language. This is probably the reason why some works have de- considered atomistically - is limited to discipline conflict of law and
cided to focus on corpora, such as privacy policies [12], with a legal conflict of jurisdiction cases. While the Regulations, as with any
language that would be more similar to its natural counterpart, or to other piece of legislation, rely somewhat on external definitions
focus on more argumentative texts (e.g. sentences, procedural docu- and legal concepts, including those derived from jurisprudence
ments, cross-examinations, parliamentary court reports) instead of and opinions from commentators, they also define intrinsically and
legislative texts or contracts. Anyway, this challenge makes hard to specifically for their own purposes, key concepts (e.g. “judgement”
apply state-of-the-art sub-symbolic question answering algorithms in Art. 2 of Reg. Brussels I-bis). Therefore we decided to exclude
on legislative texts, especially the PIL ones, because of data scarcity any conceptual question but those involving key concepts defined
or novel topics introduced for the first time in the legal system (e.g., within the Regulations.
no historical series). The legal question answering tools we are interested in evaluat-
With this work we are interested into advancing on automated ing are meant to be used by practising lawyers, with reasonable -
answering to questions written in legalese and on PIL legislative yet, not expert - knowledge of PIL to:
texts. Our goal is to be able to properly evaluate canonical question • explore the contents of the Regulations;
answering techniques for PIL. This is why we try to expand the • get support in the reasoning concerning large Regulations.
work presented by Sovrano et al. in [13], publishing a larger and
more curated dataset extracted from Regulations such as: Rome The dataset for evaluating such tools shall comprise a set of ques-
I Regulation EC 593/2008; Rome II Regulation EC 864/2007; and tions for each of which there is also a set of expected answers in
Brussels I bis Regulation EU 1215/2012. the form of Articles, Recitals or Commission Statements3 . Recitals
In Section 2 we describe our dataset, and the methodology we are considered beside Articles because the user persona could be
followed to design it. While in Section 3 we analyse the results interested in prima facie interpretive tools emerging from the text
obtained by re-running the experiment of [13] on the new dataset, itself, let alone the debated bindingness of Recitals. The dataset
pointing to future work in Section 4. published in [13], was designed following a methodology that is
similar to the one we are going to use for this extension. For the
selection of the questions and the identification of the expected
2 A DATASET FOR EVALUATING LEGAL
answers we adapted to our case a specific methodology encoded by
QUESTION ANSWERING ON PIL Ashley and others in their works [1, 2, 6] during the last years. This
In this Section, we explain how we expanded the dataset presented methodology is common to other works in the field and it is meant
in [13], doubling its size. We improved over [13], publishing a larger to validate the experiment also from a legal perspective. In our case,
and more curated dataset for the evaluation of automated question the questions were selected by two legal experts, while other two
answering on PIL. independent legal experts matching our intended user persona were
Both the old and the new dataset were extracted from the fol- responsible for identifying the expected answers by relying solely
lowing Regulations, in English: on the verbatim information that can be found in the Regulations.
• Rome I Regulation EC 593/2008;
• Rome II Regulation EC 864/2007; 3 Rome
II Regulation contains three Commission Statements meant to bind the EU
• and Brussels I bis Regulation EU 1215/2012. Commission to publish studies on selected topics
231
A Dataset for Evaluating Legal Question Answering on Private International Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Therefore, legal experts were instructed to prevent case-law, gen- choose a different applicable law for different parts of the con-
eral principles or scholar opinions from influencing their answers, tract?”); questions whose answer falls in part within the scope of
as well as requested to avoid interaction with each other. As stated the Regulations, but somewhat relies on external concepts were la-
above, the research wants to model only the neutral legislative belled as Normally specific (e.g., “Which parties of a contract should
information from the three Regulations without any interpretation be protected by conflict-of-law rules?”); finally, broad questions
other than the literal one. The inclusion of other knowledge will be whose answer requires the significant use of external legal concepts
left to further research. First, experts read the three Regulations and and resources and whose answer is found through an articulate
answered to the questions without any assistance from auxiliary combinations of articles and recital were labelled as having Low
sources, including tools and previous knowledge. Then, they were specificity (e.g., “How should a contract be interpreted according
allowed to compare their answers with those provided by the tool to this regulation?”).
for legal question answering, selecting tool-assisted correct answers Of the 17 questions that compose the new extended dataset:
and missing replies to be used to calculate performance scores in 29.41% have a Low specificity; 35.29% have a Normal specificity;
the later stages. Despite the efforts to draft interpretation-neutral 35.29% have a High specificity.
questions, each independent expert has a certain margin of appre-
ciation both when providing her/his answers and when assessing 3 DATASET ANALYSIS
the correctness of the tool-provided answers. Therefore, another
In order to understand the behaviour of existing question answer-
intervention was necessary when divergences in their evaluation
ing tools on the new dataset, we repeated on it the experiment
occurred. When identifying the expected answers, the aggrega-
described in [13] ,changing the metrics used for the evaluation.
tion kept into account only theoretical replies that were common
Considering that we are not interested in the order answers are
between the two independent experts. This aggregation was con-
ranked, as metric for estimating the performance of the algorithm
ducted by one legal expert who dispose of a higher level of expertise
we chose: top5-recall, top5-precision and top5-F1, defined as follows.
in comparison to the independent evaluators, yet relying on the
Let 𝑚 be the number of strictly-correct answers that are produced
same criterion, i.e. literal interpretation only.
as output by the algorithm, let |𝐸| be the number of expected an-
At the end of the process we got the 9 new questions shown in
swers for a question, let |𝐴| be the number of given answers to
Table 2.
a question, then the top5-recall is given by 𝑚𝑖𝑛 𝑚 ( |𝐸 |,5) , while the
The questions were chosen with the following criteria: they had
to be sufficiently specific to find adequate answer in the Regula- top5-precision is given by 𝑚𝑖𝑛 (𝑚|𝐴 |,5) . The top5-recall is a measure
tions (we avoided too broad or excessively conceptual questions); of how many relevant answers are selected by the algorithm in
the questions needed not to be focused on specific cases but with the top five answers, while the top5-precision is a measure of how
a reasonable level of abstraction (e.g., instead of “Where can an many selected answers in the top five are relevant. Knowing the
employee that carries out their work in Spain sue an employee top5-recall and the top5-precision, it is easy to compute the top5-F1
located in Spain, if they had not agreed on the jurisdiction?”, a score by following the formula.
question such as “Where can an employee sue their employer?”); After running the experiment we computed the average top5-
the questions needed to be sufficiently different from one another F1 for all the questions in the dataset presented in Section 2 (that
(i.e., not asking repetitive questions such as “What is the applicable is the old dataset of [13] plus our new extension). The results on
law in contracts of carriage?” and “What is the applicable law in the whole dataset are a Top5-Recall of 37.58%, a Top5-Precision of
insurance contracts?”). Some of the questions in the dataset are 45.17% and a Top5-F1 of 38.05%.
relatively similar to one another, with some of them being the a We also performed an error analysis taking under consideration
more correct specification of another, such as “Which parties of a how top5-F1 scores vary when the context specificity change, ex-
contract should be protected by conflict-of-law rules?” vs “What is pecting that questions with low context specificity are harder to
the applicable rule to protect the weaker party of a contract?” answer correctly.
Questions in the dataset are not speculative or de iure condendo Results partly confirmed our expectations. In fact, we can observe
and are agnostic to elements that are placed outside the Regulation a trend where top5-F1 scores increase proportionally to the context
(e.g. jurisprudence, general principles, etc.). As such, they are not specificity. Our expectations were based on the fact that:
meant to nudge towards forms of interpretations other than the
• the specificity of a question is low when it asks something
literal one (e.g. analogy, principle-based reasoning, lex specialis,
that is not closely related to the Regulations;
etc.)
• multi-hop reasoning is usually required to answer questions
Furthermore, in order to be able to further analyse the results
with a low specificity, but the baseline is not equipped for
of any evaluation based upon our dataset, we decided to pick an
that kind of reasoning (yet).
heuristic for classifying questions, that is the context specificity
(Low, Normal, High), and we applied it also to the old dataset we For example, the question “How should a contract be interpreted
extended. Context specificity is a subjective concept and it is highly according to this regulation?” has a very low specificity and it would
dependant on each jurist. For this reason, we opted to use a cri- probably require to pinpoint both recitals and articles for a proper
terion that would ensure an acceptable level of objectivity. Thus, answer, therefore more distinct and distant paragraphs. Probably,
specific questions whose answer is exactly in the domain of the most of the speculative questions would require a broader view
Regulations were labelled as Highly specific (e.g., “Can the parties on the subject matter, having a low specificity to the Regulations,
therefore requiring multi-hop reasoning.
232
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Sovrano, et al.
Table 1: First block of answers (ordered by the pertinence to the question estimated by the tool) given by the baseline to the
questions in [13]. “B” stands for Brussels, “RI” for Rome I and “RII” for Rome II. “Rec.” stands for Recital, “Art.” for Article, and
“Stat.” for Commission Statement. For each answer, the top5 scores (precision, recall, F1) are shown. In the “scores“ columns:
“P” stands for Precision and “R” stands for Recall. In the “Specificity“ column: “L” stands for Low, “N” stands for Normal and
“H” stands for High.
233
A Dataset for Evaluating Legal Question Answering on Private International Law ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 2: Second block of expected answers and answers given by the baseline. See the caption of Table 1 for more details about
how to read this table.
[4] Pompeu Casanovas, Monica Palmirani, Silvio Peroni, Tom Van Engers, and Fabio
Figure 1: Average top5-F1 scores for each class of con- Vitali. 2016. Semantic web for the legal domain: the next step. Semantic Web 7, 3
(2016), 213–227.
text specificity: Low, Normal, High. Scores are respectively: [5] Ilias Chalkidis and Dimitrios Kampas. 2019. Deep learning in law: early adaptation
27.40%, 42.16%, 42.81% and legal word embeddings trained on large corpora. Artificial Intelligence and
Law 27, 2 (2019), 171–198.
0.5 [6] Jack G Conrad and John Zeleznikow. 2013. The significance of evaluation in
AI and law: a case study re-examining ICAIL proceedings. In Proceedings of the
0.42 Fourteenth International Conference on Artificial Intelligence and Law. 186–191.
[7] Phong-Khac Do, Huy-Tien Nguyen, Chien-Xuan Tran, Minh-Tien Nguyen, and
Minh-Le Nguyen. 2017. Legal question answering using ranking SVM and deep
convolutional neural network. arXiv preprint arXiv:1703.05320 (2017).
Top5-F1
0.27 [8] Meritxell Fernández-Barrera and Giovanni Sartor. 2011. The legal theory perspec-
tive: doctrinal conceptual systems vs. computational ontologies. In Approaches
to Legal Ontologies. Springer, 15–47.
[9] Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020. A
Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering.
arXiv preprint arXiv:2005.05257 (2020).
[10] Mi-Young Kim, Ying Xu, and Randy Goebel. 2015. A convolutional neural network
0 in legal question answering. In JURISIN Workshop.
L N H [11] Friedrich V Kratochwil. 1991. Rules, norms, and decisions: on the conditions
Specificity of practical and legal reasoning in international relations and domestic affairs.
Number 2. Cambridge University Press.
[12] Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and
Norman Sadeh. 2019. Question answering for privacy policies: Combining com-
putational and legal perspectives. arXiv preprint arXiv:1911.00841 (2019).
Francesconi, et al. 2012. A history of AI and Law in 50 papers: 25 years of
[13] Francesco Sovrano, Monica Palmirani, and Fabio Vitali. 2020. Legal Knowledge
the international conference on AI and Law. Artificial Intelligence and Law 20, 3
Extraction for Knowledge Graph Based Question-Answering. In Legal Knowledge
(2012), 215–319.
and Information Systems: JURIX 2020. The Thirty-third Annual Conference, Vol. 334.
[3] Michael J Bommarito II, Daniel Martin Katz, and Eric M Detterman. 2018. LexNLP:
IOS Press, 143–153.
Natural language processing and information extraction for legal and regulatory
texts. arXiv preprint arXiv:1806.03688 (2018).
234
Discovering the Rationale of Decisions:
Towards a Method for Aligning Learning and Reasoning
Cor Steging Silja Renooij Bart Verheij
c.c.steging@rug.nl s.renooij@uu.nl bart.verheij@rug.nl
Bernoulli Institute of Mathematics, Department of Information and Bernoulli Institute of Mathematics,
Computer Science and Artificial Computing Sciences, Computer Science and Artificial
Intelligence, University of Groningen Utrecht University Intelligence, University of Groningen
235
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Steging et al.
236
Discovering the Rationale of Decisions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 1: An overview of the tort law datasets. Datasets applied to any other machine learning model as well. We assume
marked with an asterisk are used for testing purposes only. that assessing and improving rationale discovery is relevant only
For each type of dataset, the size and label distribution is for models that perform well on their respective task. Our first step,
given. after training the above mentioned neural networks, is therefore
to evaluate their performance on typical test sets in terms of the
T/F label standard accuracy measure. Subsequently we will evaluate the
Dataset Size distribution performance of the networks on the dedicated, knowledge-driven
Regular 5,000/500 50%/50% test sets that were specifically designed for assessing the networks’
Unique* 1024 10.94%/89.06% quality of rationale discovery.
Unlawfulness* 168 66.67%/33.33%
Imputability* 128 87.5%/12.5% 4.1.1 Neural network architectures. Similar to the original experi-
ments, three multilayer perceptrons were used with one, two and
three hidden layers, respectively [Bench-Capon 1993]. The nets
The unique dataset contains these 1024 unique instances for all have 10 input nodes, corresponding to the number of features
the 10 features plus the label. In this dataset, there are 912 instances and a single output node, representing duty to repair. The node
where dut is false and 112 instances where dut is true (10.94%). configuration (i.e. number of nodes per layer) of each network is as
The regular type datasets are generated such that dut is true follows:
in exactly half of the instances. The sets are regular in the sense • One hidden layer network: 10-12-1
that balanced label distributions are common in machine learning • Two hidden layer network: 10-24-6-1
problems. These regular datasets are generated by sampling uni- • Three hidden layer network: 10-24-10-3-1
formly from the subset of cases from the unique dataset, such that
We use the MLPClassifier of the scikit-learn package [Pedregosa
each possible case is represented equally within the 50/50 label
et al. 2011], the sigmoid function as the activation function, the
distribution. In a typical machine learning experiment, only a sub-
Adam stochastic gradient-based optimizer [Kingma and Ba 2015],
set of the possible cases is typically available and presented to a
with a constant learning rate of 0.001. A total of 50,000 training
network, upon which the network will have to learn to generalize
iterations are used with a batch size of 50. Recall that the focus
to all possible cases. In addition to generating regular type datasets
of this study is not on creating the best possible classifier, but to
with 5,000 cases, we therefore also generate smaller regular type
assess rationale discovery.
datasets with only 500 instances; the latter contains 35.35% of the
unique instances. 4.1.2 Training and performance testing. The three types of neural
In the tort law domain we focus on the notions of unlawfulness networks are trained and tested on all combinations of different
(𝑐 2 ) and imputability (𝑐 3 ) to assess whether the networks are able datasets from Table 1. Every combination of training dataset and
to discover conditions in the data. For each of the two conditions, testing dataset is evaluated in terms of the accuracy of the resulting
we create a dedicated dataset. network on the test data. Because some of the datasets are stochas-
The Unlawfulness dataset is the subset of the unique dataset in tic (each generated dataset is slightly different), the whole process
which the features for the unlawfulness condition 𝑐 2 can take on of data generation, training and testing is repeated 50 times. The
any of their values, while the other features have values that are mean classification accuracies along with their standard deviations
guaranteed to satisfy the remaining conditions. Whether or not are reported. To assess the rationale discovery capabilities of all the
there is a duty to repair is therefore solely determined by whether trained networks, we study their performance on the dedicated test
or not condition 𝑐 2 is satisfied. All combinations of values of the sets for unlawfulness and imputability conditions. Performance is
other features are considered. The Unlawfulness dataset therefore measured both quantitatively, using standard accuracy, and qual-
consists of 168 unique instances, of which 66.66% have a positive itatively by a more detailed comparison of actual and expected
𝑑𝑢𝑡 value. outcomes.
The Imputability dataset is a similar subset of the unique dataset,
but now the features for the imputability condition (𝑐 3 ) can take on 4.2 Results
any value, except that the value of vst must be such that condition
Table 2 shows the mean classification accuracies over 50 runs, to-
𝑐 5 is satisfied. The value of 𝑑𝑢𝑡 (𝑥) is now completely dependent on
gether with their standard deviations, for the different combinations
whether or not condition 𝑐 3 evaluates to true. Due to the interde-
of training and testing sets in the tort law domain. The table includes
pendency of conditions 𝑐 3 and 𝑐 5 , the Imputability dataset only has
the quantitatively measured performance on the two dedicated test
128 unique instances, 87.5% of which have a positive 𝑑𝑢𝑡 value.
sets.
We can evaluate how well conditions 𝑐 2 (unlawfullness) and 𝑐 3
4 EXPERIMENTAL SETUP AND RESULTS (imputability) are learned. For these conditions, the network should
In this section we describe and motivate the experiments we per- output 1 in cases from the Unlawfulness dataset where the case is
formed for the tort law domain and report on their results. unlawful (𝑐 2 ), or in the Imputability dataset where the case can be
imputated to a person (𝑐 3 ); otherwise the output should be 0. The
4.1 Experiments mean output of the 3 layer network over 50 runs for the two training
We decide to use neural networks like in [Bench-Capon 1993]. sets on the Unlawfulness and Imputability datasets is presented in
The method is model-agnostic, however, meaning that it can be Table 3.
237
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Steging et al.
Table 2: The accuracies obtained by the neural networks in the tort law domain.
Table 3: Mean network output on the Unlawfulness and Im- logical evaluation of unlawfulness is false, and 1 if it is true, which is
putability datasets versus the logical evaluation of the un- exactly what it should do. Networks trained on all instances attain
lawfulness resp. imputability conditions. a perfect score on the Imputability dataset as well. This can also
be seen in Table 2, where the networks score 100% accuracy on
Trained on all instances Trained on smaller dataset the Unlawfulness and Imputability datasets after training on all
Unlawfulness Output Unlawfulness Output instances.
False 0 False 0.018 With less data, however, accuracies drop to around 92-95% for
True 1 True 1 the Unlawfulness dataset and 91-94% for the Imputability dataset.
Imputability Output Imputability Output This accuracy may still seem high, but we should take into account
False 0 False 0.875 the label distributions (66.67-33.33% and 87.5-12.5%, respectively).
True 1 True 1 Table 3 shows that networks still perform perfectly on cases in
which the unlawfulness and imputability conditions evaluate to true.
When the conditions are false, however, mistakes are made. The
5 DISCUSSION average output of networks on the Unlawfulness dataset increases
5.1 Standard Accuracy to 0.018, which should be 0, meaning that networks classify some
Standard accuracy is measured to see whether the learned models lawful cases as unlawful. In the Imputability dataset, the mean
are able to solve the classification problem, regardless of whether output increased more drastically to 0.875 when imputability is
or not they discovered the rationale underlying the data. We find false, meaning that in 87.5% of the instances in which the act is
accuracies of 100% or near 100% for networks trained on all in- not imputable to a person, the network incorrectly decided that it
stances (see Table 2). When presented with all unique instances, should be. This means that despite high accuracy on the general
the networks with one and two hidden layers are able to perfectly test set, the networks largely ignored the concept of imputability.
predict the outcome from Dutch tort law, and the network with
three hidden layers can create a very close approximation. 5.3 A Method for Rationale Evaluation
Presenting a neural network with all available cases is in practice Although our experiments and discussion focused on specific ex-
often infeasible. If it is possible, then a simple lookup table rather ample domains and neural networks, our approach for rationale
than a neural network would most likely suffice. For this reason, evaluation can be interpreted as a general method independent of
we also trained the networks on a subset of only around 35% of the the machine learning algorithm applied. Building on the results of
unique instances (see Table 2). As expected, the accuracies of the this paper, we therefore proposes a knowledge-driven method for
networks on the general test sets drop, but only slightly (to 98-99%). model-agnostic rationale evaluation, consisting of three distinct
Even on the unique test set, accuracies remain around 96%. This steps:
suggests that it is possible for the models to approximate tort law
with a small subset of the unique cases. (1) Measure the accuracy of a trained system, and proceed if the
accuracy is sufficiently high;
5.2 Rationale Discovery (2) Design dedicated test sets for rationale evaluation targeting
selected rationale elements based on expert knowledge of
Looking at the performance of the networks on the dedicated test
the domain;
sets partially exposes how well the rationale is captured by the
(3) Evaluate the rationale through the performance of the trained
network. We designed these test sets such that each one targets a
system on these dedicated test sets.
single condition from the domain. In addition to considering the
accuracy on these dedicated test sets, we qualitatively evaluate the The first step is based on the assumption that efforts for assessing
rational discovery capabilities of the networks by comparing their and possibly improving the rationale discovery capabilities of a
outputs with the actual outputs we would ideally expect for the learned model are only taken if the general performance of the
different domains. model is already considered good enough. Here we assume per-
Recall that on the Imputability dataset, networks should output 1 formance is measured using accuracy, but other measures can be
if the act is imputable to the person, and 0 otherwise; on the Unlaw- employed as well and the threshold of what is considered good
fulness dataset, the networks should output 1 if the case is unlawful, enough may vary per domain and application.
and 0 otherwise. Table 3 shows how well the networks were able The second step in our method depends on domain knowledge.
to internalize the notions of unlawfulness and imputability. When Hence the method effectively is a quantitative human-in-the-loop
trained on all instances, the mean output of the networks is 0 if the solution for rationale evaluation.
238
Discovering the Rationale of Decisions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
239
Process Mining-Enabled Jurimetrics
Analysis of a Brazilian Court’s Judicial Performance in the Business Law Processing
240
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Unger et al.
2.2 Jurimetrics end in itself, so that process mining can enhance the way the judi-
ciary treats its digital data through the application of algorithms for
Jurimetrics [11] is defined as ‘statistics applied to the law’. Although
process discovery, compliance and predictive analysis [9]. A tech-
it emerged decades ago, recent advances in computing and data
nical report with suggested actions to improve judicial efficiency
storage capabilities have enabled alternative ways of observing
highlights the use of process mining as one of these actions [3].
patterns in data-based and hence statistics-based court decisions.
Nevertheless, few studies on this topic were found in the liter-
In the USA, the application of statistics to law has been developed
ature. Empirical studies were performed using lawsuit data from
under alternative nomenclature, such as Empirical Legal Studies
Brazilian courts[19, 22] though focused on comparison of lawsuit
[7] and, more recently, Judicial Analytics [4]. In Brazil, jurimetrics
throughput times. Attempts to apply data mining to extract infor-
has received growing interest [12].
mation from lawsuit data were made [13, 18], but these studies do
When analyzing alternative ways of managing the lawsuit pro-
not directly address process mining on that data.
cessing, the analysis of the lawsuit throughput time using jurimet-
rics techniques may present quality issues, as it eventually considers
inadequate time intervals for the object of analysis [3]. Procedural 4 RESEARCH METHOD
viscosity [16], defined as “a set of structural characteristics of a This study applied the Process Mining Project Methodology [21] to
lawsuit that is able to affect the speed of its processing”, may apply. guide the application of process mining to analyze judicial perfor-
Specifically in business law, processing of lawsuits may require mance in a specific context. Only the first five stages of the method
about twice as much effort as a common lawsuit [2]. The same were carried out: planning, extraction, data processing, mining and
authors suggest “future research focused on process flow analysis, analysis, and evaluation. The project was finished with the insights
i.e., the study of the stochastic process that generates all events and generated by the evaluation stage. Besides restricting the scope to
timestamps of lawsuit processing”. business law, a period of analysis was defined to consider only the
TJSP’s lawsuits distributed between January 1, 2018 and July 21,
2.3 Process Mining 2020. All lawsuits with a procedural movement published in that
time interval were considered. Full progress data for each lawsuit
Process mining [20] emerged as a set of techniques for mining was retrieved until July 31, 2020, including lawsuits opened before
business process-related information from event data logged by and lawsuits not yet closed.
information systems. A business process is a chain of activities that The data were extracted in two steps. First, the identifiers of
produces an outcome that adds value to an organization and its lawsuits of interest were obtained. For this, all issues of the TJSP’s
customers [6]. Business process models play a dominant role during Electronic Journals of Justice (DJE4 ) were downloaded from the DJE
BPM life cycle, leading in achieving organizational improvement website5 , considering the defined analysis period. DJE publishes in-
goals, including reducing costs, lead times, and error rates. By using formation on provisional or final decisions for all ongoing lawsuits
real event data to discover process models, process mining leverages at TJSP, daily. An automated scraping of these files was carried out
data mining to understand operational processes in organizations. using keywords associated with business law litigation. Second,
Table 1 shows the mapping of the basic elements of event logs the lawsuit identifiers obtained were used to retrieve data from
from the process mining perspective to their counterparts in the the e-SAJ website6 , where information on lawsuits is published. A
procedural law domain. web scraping was carried out to retrieve information on lawsuits
attributes and progress events including their respective dates. In
3 RELATED WORK TJSP, there are four filing court departments dedicated to business
Process mining should be seen as an analytical tool naturally suit-
able for lawsuits due to their inherently procedural nature. It is 4 Diário da Justiça Eletrônico (in Portuguese)
cited as a promising approach to suggest improvements for lawsuit 5 http://www.dje.tjsp.jus.br
241
Process Mining-Enabled Jurimetrics ICAIL’21, June 21–25, 2021, São Paulo, Brazil
law; as a result, the lawsuits not filed at these four court depart-
ments were discarded. Data from both DJE and e-SAJ websites are Figure 2: Histogram of procedural movements by lawsuit
publicly available7 .
The process knowledge transfer with domain experts was carried
out, resulting in a mapping between the elements of the dataset
and the concepts of event log used by process mining, presented
in Table 2. Event data from lawsuits were used to create the event
log to be used in process mining. The lawsuit dataset was filtered
to remove columns with missing values or data not relevant to the
scope of this study. The judge column was made anonymous for
protecting personal data. The additional column order was added
to the movement database, and hence to the event log, to allow the
process mining discovery algorithm to identify the correct order of
activities within a case occurring on the same date.
5 RESULTS
The resulting event log contains data on lawsuits referring to 4,795 Figure 3: Process map based on average duration metrics
cases and 266,834 events, with procedural movements dating back
to 2008, and 10 case attributes, as described in Table 2. The event log
file8 was imported using the EverFlow9 process mining tool, which
produced the business process maps and the main process metrics,
such as number of cases, number of events, and average duration,
as presented in Figure 1 and Figure 3. Process map views are user-
interactive, so that activities, transitions, and the time interval can
be selected and filtered for drill-down analysis. Detailed views on
specific metrics are presented on dedicated dashboards and panels,
as shown in Figure 2, Figure 4, Figure 5 and Figure 6.
6.1 Control-flow Perspective analysis of the event log, as shown in Figure 1. The complexity
The control flow is the main perspective considered in process and procedural viscosity of lawsuits in business law can be verified
mining discovery, adding process-oriented value to the data mining by the process metrics, i.e., average rate of 55.6 events per case
and average case duration of 334 days. As shown in Figure 2, over
7 The right of access to judicial data in Brazil is guaranteed by the constitutional 10% of lawsuits have over 100 events per case, which corresponds
principle of judicial publicity, except in lawsuits protected by the secrecy of justice.
8 The event log file is available in the repository: https://doi.org/10.4121/14593857 to procedural movements during lawsuit processing. In addition,
9 http://everflow.ai one can verify that lawsuit processing in business law has an ad
242
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Unger et al.
243
Process Mining-Enabled Jurimetrics ICAIL’21, June 21–25, 2021, São Paulo, Brazil
244
Using Transformers to Improve Answer Retrieval
for Legal Questions
Andrew Vold Jack G. Conrad
Thomson Reuters Thomson Reuters
TR Labs Research TR Labs Research
andrew.vold@thomsonreuters.com jack.g.conrad@thomsonreuters.com
ABSTRACT consuming and laborious effort. Over time, we began to see an in-
Transformer architectures such as BERT, XLNet, and others are terest in more focused question answering systems taking the place
frequently used in the field of natural language processing. Trans- of traditional information retrieval systems. In the field of AI and
formers have achieved state-of-the-art performance in tasks such as Law, Quaresma and Rodrigues were among the first to implement
text classification, passage summarization, machine translation, and a question answering system for legal documents [13], one that
question answering. Efficient hosting of transformer models, how- focused on Portuguese legal decisions. More recently, however, de-
ever, is a difficult task because of their large size and high latency. velopments in deep learning-based approaches for tasks like open
In this work, we describe how we deploy a RoBERTa Base ques- domain question answering have resulted in major gains in answer
tion answer classification model in a production environment. We rate performance. They have also been responsible for comparable
also compare the answer retrieval performance of a RoBERTa Base advances in closed domain question answering in fields such as
classifier against a traditional machine learning model in the legal Legal QA [1]. Such progress has resulted in performance gains for
domain by measuring the performance difference between a trained both factoid and non-factoid question answering.
linear SVM on the publicly available PRIVACYQA dataset. We show Transformer architectures have delivered impressive perfor-
that RoBERTa achieves a 31% improvement in F1-score and a 41% mance gains over baselines for standard natural language process-
improvement in Mean Reciprocal Rank over the traditional SVM. ing (NLP) tasks. Open domain language modeling as a pretraining
step, followed by domain specific fine-tuning on another domain
CCS CONCEPTS has delivered state-of-the-art performance for tasks in a specific
domain, including the legal domain. One should thus expect to see
• Information systems → Information Retrieval; Retrieval
significant performance gains in legal question answer retrieval
Tasks and Goals; Question Answering; Information Retrieval;
by utilizing the output of a transformer based classifier which has
Retrieval Models and Ranking; Language Models; Information
been fine-tuned on legal QA pairs.
Retrieval; Evaluation of retrieval results; Relevance assessment.
It has been well observed that transformers are highly perfor-
mant at answering factoid questions which typically have answers
KEYWORDS of one or a few words [5]. Transformer based research in the Legal
Question Answering, Legal Applications, Deep Learning, Language domain has evolved toward more complex non-factoid questions
Models, BERT Engines, Evaluation which are more nuanced and may require several sentences to pro-
ACM Reference Format: vide context and elaboration in order to answer the legal question at
Andrew Vold and Jack G. Conrad. 2021. Using Transformers to Improve hand, for example, "When is a party entitled to a protective order?"
Answer Retrieval for Legal Questions. In Eighteenth International Conference The current work extends this research by processing a publicly
for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, available non-factoid QA dataset in an application workstream,
Brazil. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3462757. while addressing the challenges of performance quality, speed and
3466102 scale.
Permission to make digital or hard copies of all or part of this work for personal or 2.1 Open-Domain Question Answering
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation Open domain question answering is a task that answers factoid
on the first page. Copyrights for components of this work owned by others than the questions using large collections of documents [19]. Historically,
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission retrieval in open domain QA was usually conducted using tf.idf
and/or a fee. Request permissions from permissions@acm.org. or BM25 approaches, which match keywords with an inverted in-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil dex, and represent the question and content in high-dimensional,
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06. sparse vectors [16]. In their 2017 report, Chen et al. propose us-
https://doi.org/10.1145/3462757.3466102 ing Wikipedia for open domain question answering for factoid
245
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Andrew Vold and Jack G. Conrad
questions [5]. The task is one of machine reading at scale, which of a merger or acquisition [6]. They claim that what is novel in
addresses the challenges of document retrieval and machine com- their approach is that the proposed system explicitly handles the
prehension (identifying text spans containing the answer). Their imbalance in the data, by generating synthetic instances of the
approach combines a search component based on bigram hashing minority answer categories, using the Synthetic Minority Oversam-
and tf.idf matching with a multi-layer recurrent neural network pling Technique [4]. This ensures that the number of instances in
model trained to detect answers in Wikipedia paragraphs. They use all the classes are roughly equal to each other, thus leading to more
the SQuAD dataset for training and three other datasets for testing accurate and reliable classification. They use conditional random
[14]. They obtain an F-score of 79%, which was within a point of fields as their text selection algorithm. Each sentence in the contract
the top performing method at the time. under consideration is featurized into a tf.idf vector and fed into the
In their work on dense passage retrieval for open domain ques- CRF algorithm. The authors found a 13% improvement in accuracy
tion answering, Karpukhin et al. show that retrieval can be ef- due to the imbalance handling.
fectively implemented using dense representations alone, where The recently published work on Legal BERT has reported on
embeddings are learned from a small number of questions and performance gains on an assortment of downstream NLP tasks
passages via a simple dual encoder framework [9]. It has outper- [3]. The authors compare the performance of out of the box BERT
formed traditional QA baselines (top-20 results) by 9%-19%, while with a version that benefits from additional pre-training with legal
establishing new end-to-end baseline performance levels. domain data, and finally with a version where the pre-training with
In their earlier work on Bidirectional Encoder Representations legal domain data starts from scratch. The legal domain training
from Transformers (BERT), Devlin et al. introduced a new language data consists of UK and EU legislation, European Court of Justice
representation model which is designed to pre-train deep bidirec- and Court of Human Rights cases, and finally U.S. court cases as
tional representations from unlabeled text by jointly conditioning well as U.S. contracts. The authors show that the best strategy
on both left and right context in all layers [7]. BERT consequently to transfer BERT to a new domain may vary, but that one may
can be fine-tuned with just one additional output layer to create consider either further pre-training or pre-training from scratch
state-of-the-art, highly performant models for a wide range of tasks, on data from the new domain. Legal BERT achieved state-of-art
including question answering. results in three end-tasks, and, most notably, the performance gains
As an extension to BERT, Liu et al. developed a "robustly op- were stronger for the most challenging end-tasks (i.e., multi-label
timized" pretraining approach to BERT known as RoBERTa [10]. classification in ECHR-cases and contract header & lease details
They found that BERT was significantly undertrained. In their repli- in Contracts-NER) where in-domain (legal) knowledge is arguably
cation study of BERT, they carefully measured the impact of many the most important. The authors also released a version of Legal
key hyperparameters and training data size. They showed how BERT-SMALL, which is 3 times smaller than Legal BERT, but quite
hyperparameter choices have a major impact on final results. Their competitive performance-wise to the other versions of Legal BERT.
best model achieved state of the art results against such standard Reports on question answering systems have also recently been
collections as GLUE, RACE, and SQuAD. published by researchers at Thomson Reuters and LexisNexis [2,
Because pre-trained language models are usually computation- 11]. The current work demonstrates the robustness of a Legal QA
ally expensive, and it is difficult to execute them on resource limited system deployed in a multi-stage workstream where the engine is
devices, researchers like Jiao et al. have focused on transformer fine-tuned on an application-specific dataset. The application and
model distillation methods and proposed a novel method that was dataset are discussed below. The system is shown to significantly
specially designed for knowledge distillation (KD). By leveraging outperform the baseline using contemporary neural techniques.
their new KD method, while focusing on the knowledge already
preserved in larger models like RoBERTa, they discovered that such 3 METHODOLOGY
knowledge could be transferred to a smaller TinyBert model [8]. Transformer models have achieved state-of-the-art performance
The new framework captured in TinyBert performs transformer in many NLP applications such as text classification, text summa-
distillation at both the pre-training and task specific learning stages. rization, and question answering. Though transformers are highly
They have shown that their framework ensures that TinyBert cap- performant, their generally large size make them difficult to deploy
tures the general knowledge and task specific knowledge preserved in production systems. Successful transformer model hosting in
in BERT. a production environment would be a major advance in natural
In contrast with factoid question answering, Zhu et al. pursued language applications. For this reason, we developed a high perfor-
non-factoid question answering where the answers tend to be mance question answering (QA) system based on the RoBERTa base
longer passages [22]. In this work, the authors determine that by architecture, but other transformer architectures could be used as
generating synthetic training data of arbitrary volume and with well [10, 12]. The challenges and our strategies for handling these
well understood properties, the learning capacity of Knowledge problems will be discussed in this section.
Graph architectures can be better understood and characterized. QA system researchers do not frequently have access to evalu-
Whether a given neural architecture for KGQA will train a model to ated QA pairs that are broad, balanced, and comparable to what
generalize rather than memorize may depend on dataset properties. a user would ask. Open sourced QA pairs tend to be either very
general or belong to a niche domain. If one is fortunate to have
2.2 Legal Domain Question Answering access to labeled QA pairs in the working domain, it is unlikely that
In a recent work, the authors address a due diligence topic where there is enough data for broad topic coverage. To address this issue,
lawyers review documents for indication of risk due to the prospect subject matter experts (SMEs) can be assigned to procure quality
246
Using Transformers to Improve Answer Retrieval for Legal Questions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
mobile applications [20], and more than 3,500 relevant answers that
have been annotated by experts. From the data provided, we have
obtained approximately 130K passages for our training set, of which
about 25% was used in our validation set. The goal of the collection
was to achieve broad coverage across a spectrum of application
types. The researchers collected privacy policies from 35 mobile
applications representing different categories in the Google Play
Store [17]. Another goal of the creators was to include both policies
from well-known applications, which are likely to have carefully-
constructed privacy policies, and lesser-known applications with
Figure 1: QA System Development Cycle
smaller install bases, whose policies might be considerably less
sophisticated. They set a threshold of 5 million installs to ensure
QA pairs. Yet SMEs often experience fatigue when producing nu-
each category includes applications with installs on both sides of
merous examples, even if the queries originate from user query
the threshold. All policies in the corpus are in English, and were
logs. This phenomenon often manifests itself in the form of weak
collected before April 1, 2018, predating many companies’ GDPR-
question-answer pair generation where examples differ by only
focused revisions.
a few words. To address such limitations, natural language user
queries are identified, run through the classifier, and the highest 3.2.1 Answer Identification. In order to identify legally valid an-
scoring QA pairs are evaluated. The resulting data can then be used swers, seven subject matter experts with legal training were re-
to train the model, yielding a cyclic data curation, model training cruited to formulate answers to the Amazon Mechanical Turk ques-
process as seen in Figure 1. tions. They indicated relevant material within the given privacy
Given the QA system that we developed was intended for ap- policy in addition to supplying relevant metadata regarding the
plication to sets of in-house legal documents, many of which are question’s relevance, subjectivity, OPP-115 category [21], and how
not freely available to the general public, for the purposes of this likely any policy is to containing the answer to the question.
research report, we have opted to apply our techniques to the pub- Table 1 presents aggregate statistics for the PRIVACYQA dataset.
licly available legal questioning collection described in section 3.2. 1750 questions are posed to an imaginary privacy assistant over 35
Though it covers a subdomain of the legal space, it is nonetheless a mobile applications and their associated privacy documents.
broad ranging and complex dataset that contains an array of top-
ics, question and answer lengths and types. It is a nuanced and Dataset Train Test All
challenging set of data which is indicative of the kinds of question No. of Questions 1350 400 1750
and answer types one can expect to see in the legal domain. The No of Policies 27 8 35
findings we obtain apply specifically to the PRIVACYQA dataset, No. of Sentences 3704 1243 4947
but are also representative of the kinds of issues and challenges Avg. Q Length 8.42 8.56 8.46
one encounters with wider-ranging legal datasets as well. Avg. Doc. Length 3121.3 3629.13 3237.37
Avg. Ans. Length 123.73 153.44 139.62
3.1 Training Targets Table 1: Statistics of the PRIVACYQA Dataset
In order to assess the performance of the QA classifier, natural
language user log queries and their retrieved answers are presented 4 EXPERIMENTS
to an SME. The SME then must determine whether or not the top To demonstrate the quality of answer retrieval performance of a
answers returned by the classifier satisfy what was being asked. transformer in comparison with traditional ML models, we fine-
The grade by the SME can be a binary "pass/fail", a letter grade, tune an open domain pretrained RoBERTa classifier and train a
or even a score on a continuous scale. In our case, the grade is linear SVM with tf.idf features on the PRIVACYQA dataset. Training
converted into a label or regression target to be used for model models on this dataset is challenging for several reasons. First, the
fine-tuning. dataset is largely unbalanced with negative examples occurring 25
For our internal QA classifier, we utilized a multi-label grading times more often than positive examples. In addition, there exists
criteria which determined whether or not the answer satisfies the considerable noise in both the queries and the answers. Finally, the
requirements and to what degree it answers the given question. In number of unique questions and answers are far fewer than the
order to avoid grader bias, we have two SMEs grade each QA pair, total counts of QA pairs in the dataset.
and the average is taken. Disagreements of more than one grade Class imbalance is a common problem in real world machine
may be adjudicated by a senior SME. A similar approach was used learning applications. For this reason, there are many methods to
by the creators of the PRIVACYQA dataset, which will be explained effectively combat the adverse effects of training on an imbalanced
below. dataset. These can include over/under sampling, class weighting
on the loss, external or generated training data augmentation, and
3.2 Data more. For our experiments, we apply a simple class weighting
The dataset used in these experiments comes from the PRIVACYQA scheme to give more weight to the underrepresented positive class.
dataset described by Ravichander et al. in [15]. It is a corpus con- The PRIVACYQA data is quite noisy. The queries and answers are
sisting of 1,750 questions about privacy policies associated with riddled with misspellings, URLs, improper grammar, fragmented
247
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Andrew Vold and Jack G. Conrad
sentences, lack of punctuation, and more. In order to have the One of the most important metrics for QA classification systems
data resemble the data existing in our internal system, significant is Mean Reciprocal Rank (MRR). This simple metric is the average
data cleaning and filtering is applied. This includes capitalizing inverse position of the first true labeled examples in the answer
sentence beginnings, removing URLs, removing queries or answers ranks. MRR is a useful metric for ensuring that the highest quality
with more than 4 non-english words, and additional cleaning and answers make it to the highest rank in the list. This is especially
filtering steps. Even after all of this data preprocessing, the data important for applications like question answering which may
remains far from perfect, but is sufficient to meet the requirements return a few or even one answer for a particular query. Due to
of our experimental conditions. the importance of MRR, RoBERTa is the better choice for a QA
The original PRIVACYQA paper split the training and testing model with a 41.4% improvement in MRR over the SVM baseline
datasets by privacy category, rather than by unique queries (Table (** 𝑝 < 1 × 10−5 ).
1). The original PRIVACYQA dataset thus contains data leakage. It is interesting to see that a simple, traditional ML model like
Several queries from the test set can also be found in the training an SVM, operating on sparse word vectors, achieves performance
set. In order to rectify this, we identify the queries which exist in relatively similar to that of a transformer. One explanation for this
the test set, and reassign those QA pairs as training data, which can is that the data is very messy and lacks uniqueness. A simple ML
be seen in Table 2. model doesn’t get distracted by nuances of this dataset such as
fragmented sentences, misspellings, and the frequent use of URLs
Set Positives Negatives Total and company names. A simple ML model is also less prone to over-
Train 6,950 152,903 159,487 fitting than a transformer, especially considering the redundancy
Test 5,276 45,493 50,720 of the text in the dataset. Overfitting was a challenge during exper-
Table 2: Dataset Split Statistics imentation. For this reason, one can expect even higher RoBERTa
performance if the experiments are repeated with a more sophisti-
We perform tf.idf fitting on the unigrams and bigrams from cated strategy for combatting overfitting. A major lesson learned
the corpus of unique answers, and use it to vectorize the QA pairs from running this experiment is to ensure that the data used for
which are then used as inputs to a linear SVM. The hyperparameters training a transformer QA classifier is clean and without redun-
of the SVM are found by performing 5-fold cross validation via dancy. In addition, more careful domain adaptation could be applied
grid searching with maximizing the validation set F1-score as the before fine-tuning on the experimental dataset.
objective. This process leads to optimal hyperparameters for the
SVM model and a consistent training-validation split to be used for
6 APPLICATION PIPELINE
training RoBERTa. Developing a strong QA classifier is only one piece of deploying
Due to the large number of parameters in RoBERTa, it is trained a scalable QA application. It is not feasible to simply concatenate
by gradually unfreezing the layers, starting with the classification all passages from a corpus to a user’s query and sequentially feed
head. The learning rate and the batch size are decreased as layers them to a classifier. Instead, there needs to be a way to quickly filter
are unfrozen, as to avoid overloading the CUDA memory. After out obvious negative passages, yielding a smaller pool of potential
each epoch, the validation F1-score is measured until a plateau is answers to be fed to the classifier. An additional challenge of using a
reached, at which point the model loses generalizability. transformer based classifier like RoBERTa is its size and latency. In
order to address these challenges, we propose a solution consisting
5 RESULTS of a parallel cluster for candidate retrieval (Stage 1) and RoBERTa
After training both RoBERTa and SVM classifiers, the models are operating on a GPU endpoint (Stage 2). In addition, in order not to
run over the test set to determine the performance differences when overwhelm the user of the application, we typically return the top
using a transformer based QA classification engine. The results can n answers as predicted by RoBERTa, where n is small.
be seen in Table 3. One of the most important requirements for a powerful QA clas-
sification engine is to have a sufficiently large corpus of passages
Metric SVM RoBERTa against which a query can be compared. Oftentimes, this can be
Precision 0.212 0.470 on the scale of hundreds of thousands to millions of passages. The
Recall 0.480 0.326 overwhelmingly vast number of passages is irrelevant to a particu-
F1-score 0.294 0.385* lar query, and these are not difficult to identify. For this reason, it is
MRR 0.074 0.105** advisable to have a computationally efficient method of removing
Table 3: Classifier Performance on the Test Set the clearly irrelevant passages before performing any QA infer-
encing. In addition, due to the scale of the data, it is imperative to
As seen in the table, RoBERTa outperforms the SVM for all perform this filtering in parallel. To accomplish this, we employ a
metrics except recall. This makes sense because the SVM looks for parallel data cluster in the cloud with our data spanning several
exact token matches between the query and answer to assign a nodes (See Figure 2). The cluster functions by serving up the top-n
positive label. RoBERTa, however, uses the latent representation of most relevant passages as determined by properties such as term
the tokens to identify potential answers. In any QA application, it overlap between the query and passages. It is up to the application
is important to serve an expansive set of quality answers; for this designer to determine the appropriate number of passages to in-
reason, RoBERTa is preferable to the SVM for its 31% improvement clude in a candidate pool. Typically a candidate pool size between
in F1-score over the SVM (* 𝑝 < 1 × 10−5 ). 100 and 1000 will suffice. Increasing the number of nodes decreases
248
Using Transformers to Improve Answer Retrieval for Legal Questions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
latency but increases cost, so application engineers must decide in [3] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos.
advance on how many nodes to include in their cluster. Legal-bert: The muppets straight out of law school, 2020.
[4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic
After a satisfactory candidate pool has been retrieved, the QA minority over-sampling technique. J. Artif. Int. Res., 16(1):321–357, June 2002.
pairs are then tokenized, pushed to the CUDA device, and fed to [5] D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading wikipedia to answer open-
domain questions, 2017.
the classifier. The classifier returns a list of prediction scores of the [6] R. Chitta and A. K. Hudek. A reliable and accurate multiple choice question
relevance of the passage to the answer. The passages associated answering system for due diligence. In Proceedings of the Seventeenth International
with these predictions are then sorted, and the top-n are returned, Conference on Artificial Intelligence and Law, ICAIL ’19, pages 184–188, New York,
NY, USA, 2019. Association for Computing Machinery.
where n is determined by the application development team. One [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep
may also wish to apply a RoBERTa score threshold, so that very low bidirectional transformers for language understanding, 2019.
predictions, which are very often negative, are not shown to the [8] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. Tinybert:
Distilling BERT for natural language understanding. CoRR, abs/1909.10351, 2019.
user. If executed properly on the appropriate hardware, the entire [9] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. t. Yih.
answer serving process can take a second or less to perform. Dense passage retrieval for open-domain question answering, 2020.
[10] Y. Liu, M. O., N. Goyal, J. Du, M. Joshi, D. Chen, O. L., M. Lewis, L. Zettlemoyer,
7 CONCLUSIONS and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
[11] G. McElvain, G. Sanchez, S. Matthews, D. Teo, F. Pompili, and T. Custis. West-
Question answering is a challenging task which has been in devel- search plus: A non-factoid question-answering system for the legal domain. In
Proceedings of the 42nd International ACM SIGIR Conference on Research and
opment for many years. Question answering can take on different Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019,
forms such as answer generation, answer snippet retrieval, and pages 1361–1364. ACM, 2019.
question answer classification. We propose an end-to-end pipeline [12] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli.
fairseq: A fast, extensible toolkit for sequence modeling, 2019.
which combines the speed of a parallel data retrieval mechanism [13] P. Quaresma and I. Rodrigues. A question-answering system for portuguese
with the classification power of a fine-tuned RoBERTa Base classi- juridical documents. In Proceedings of the 10th International Conference on Artifi-
fier. Our observations from our internal data and the data discussed cial Intelligence and Law, ICAIL ’05, pages 256–257, New York, NY, USA, 2005.
Association for Computing Machinery.
in this paper indicate that transformer architectures can achieve [14] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for
greater classification performance than traditional machine learn- machine comprehension of text, 2016.
[15] A. Ravichander, A. W. Black, S. Wilson, T. B. Norton, and N. M. Sadeh. Question
ing methods in legal QA classification tasks. answering for privacy policies: Combining computational and legal perspectives.
We have discussed the efficacy of transformer models in text clas- CoRR, abs/1911.00841, 2019.
sification tasks. We observe a significant increase in F1-score and [16] S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and
beyond. Found. Trends Inf. Retr., 3(4):333–389, apr 2009.
MRR of a RoBERTa classifier over a linear SVM on the PRIVACYQA [17] P. Story, S. Zimmeck, and N. Sadeh. Which apps have privacy policies? In M. Med-
dataset. Our experiment has shown that transformer models can ina, A. Mitrakas, K. Rannenberg, E. Schweighofer, and N. Tsouroulas, editors,
achieve superior performance over traditional machine learning Privacy Technologies and Policy, pages 3–23, Cham, 2018. Springer International
Publishing.
techniques in legal question answer classification. [18] H. R. Turtle. Text retrieval in the legal world. Artif. Intell. Law, 3(1-2):5–54, 1995.
We have also also discussed some of the challenges and solu- [19] E. M. Voorhees. The trec-8 question answering track report. In Proceedings of
TREC-8, pages 77–82, 1999.
tions associated with developing and operating a transformer based [20] D. Weissenborn, G. Wiese, and L. Seiffe. Making neural QA as simple as possible
question answer classification system. With a large set of content, but not simpler. In Proceedings of the 21st Conference on Computational Natural
subject matter experts, and sufficient computing power, it is pos- Language Learning (CoNLL 2017), pages 271–280, Vancouver, Canada, Aug. 2017.
Association for Computational Linguistics.
sible to train and operate a transformer based system in a cost [21] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. Giovanni Leon,
effective manner. M. Schaarup Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, T. B. Norton,
E. Hovy, J. Reidenberg, and N. Sadeh. The creation and analysis of a website
privacy policy corpus. In Proceedings of the 54th Annual Meeting of the Association
REFERENCES for Computational Linguistics (Volume 1: Long Papers), pages 1330–1340, Berlin,
[1] S. Badugu and R. Manivannan. A study on different closed domain question Germany, Aug. 2016. Association for Computational Linguistics.
answering approaches. Int. J. Speech Technol., 23(2):315–325, 2020. [22] M. Zhu, A. Ahuja, D. Juan, W. Wei, and C. K. Reddy. Question answering with
[2] Z. Bennett, T. Russell-Rose, and K. Farmer. A scalable approach to legal question long multiple-span answers. In T. Cohn, Y. He, and Y. Liu, editors, Proceedings of
answering. In Proceedings of the 16th Edition of the International Conference on the 2020 Conference on Empirical Methods in Natural Language Processing: Findings,
Artificial Intelligence and Law, ICAIL ’17, pages 269–270, New York, NY, USA, EMNLP 2020, Online Event, 16-20 November 2020, pages 3840–3849. Association
2017. Association for Computing Machinery. for Computational Linguistics, 2020.
249
Toward Summarizing Case Decisions via Extracting Argument
Issues, Reasons, and Conclusions
Huihui Xu Jaromir Savelka Kevin D. Ashley
Intelligent Systems Program School of Computer Science Intelligent Systems Program,
University of Pittsburgh Carnegie Mellon University University of Pittsburgh
USA USA USA
huihui.xu@pitt.edu jsavelka@andrew.cmu.edu ashley@pitt.edu
ABSTRACT accessible to the lay public. This depends, however, on whether the
In this paper, we assess the use of several deep learning classifica- summaries capture the gist of the argument in the decision. In prior
tion algorithms as a step toward automatically preparing succinct work, we proposed that such case summaries could be generated
summaries of legal decisions. Short case summaries that tease out by extracting legal argument triples (IRC triples) including: 1) the
the decision’s argument structure by making explicit its issues, con- major issues a court addressed in the case, 2) the court’s conclusion
clusions, and reasons (i.e., argument triples) could make it easier with respect to each issue, and 3) the court’s reasons for reaching
for the lay public and legal professionals to gain an insight into the conclusion.
what the case is about. We have obtained a sizeable dataset of In [23], we evaluated whether a machine learning (ML) model
expert-crafted case summaries paired with full texts of the deci- can identify the components of legal argument triples in summaries
sions issued by various Canadian courts. As the manual annotation prepared by legal professionals. We applied traditional ML algo-
of the full texts is prohibitively expensive, we explore various ways rithms (random forest variations) and deep neural network models
of leveraging the existing longer summaries which are much less (LSTM, CNN and FastText) to identify the sentence components of
time-consuming to annotate. We compare the performance of the IRC triples in legal summaries and to the task of binary classifying
systems trained on the annotations that are manually ported to sentences (IRC vs. non-IRC) in the summaries and corresponding
the full texts from the summaries to the performance of the same full text decisions. While the performance on the summaries was
systems trained on annotations that are projected from the sum- promising, the performance on the full texts was quite poor.
maries automatically. The results show the possibility of pursuing In this work, we have substantially increased the size of the
the automatic annotation in the future. annotated data set of full case texts compared to the prior work. We
focus on applying deep learning algorithms (LSTM, CNN), includ-
CCS CONCEPTS ing pre-trained transformer models (RoBERTa, CNN-BERT) with
different loss functions to deal with the continuing challenge of
• Information systems → Information retrieval; Retrieval mod-
data imbalance in our training set. We report the results of apply-
els and ranking; Similarity measures; • Applied computing →
ing the different kinds of neural models on cases’ full texts after
Law; Annotation.
training with manually-mapped human-annotated sentences from
KEYWORDS the summaries and analyze the effects of using different loss func-
tions and embeddings. We also report results of a proof-of-concept
Information retrieval, argument mining, legal analysis, relevant experiment that applied automatically mapped human-annotated
sentences, summarization sentences from the summaries to the full-texts in order to classify
ACM Reference Format: argument triples. If this succeeds, we would not need to manually
Huihui Xu, Jaromir Savelka, and Kevin D. Ashley. 2021. Toward Summariz- annotate the full texts. It would suffice to manually annotate the
ing Case Decisions via Extracting Argument Issues, Reasons, and Conclu- summaries, automatically map those summary annotations to the
sions. In Eighteenth International Conference for Artificial Intelligence and full texts, and train a model directly on the full texts.
Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY,
USA, 5 pages. https://doi.org/10.1145/3462757.3466098
2 RELATED WORK
1 INTRODUCTION Argument mining research in the legal domain has focused on ex-
The ability to automatically prepare succinct summaries of legal tracting propositions, premises, conclusions, and nested argument
decisions could contribute to making legal source materials more structures [16], argument schemes such as by example [6], rhetori-
cal and other roles that sentences play in legal arguments [1, 19],
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed stereotypical fact patterns that strengthen a side’s claim (i.e., legal
for profit or commercial advantage and that copies bear this notice and the full citation factors) in domains like trade secret law [4], reasons or warrants in
on the first page. Copyrights for components of this work owned by others than ACM arguments citing facts or principles [21], functional parts of legal
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a decisions such as analysis or conclusions [20], and segments by
fee. Request permissions from permissions@acm.org. topic [13] or by linguistic analysis [5, 7, 22].
ICAIL’21, June 21–25, 2021, São Paulo, Brazil We aim to identify legal argument triples and employ them
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 to succinctly summarize case summaries. Yamada, et al. [24] have
https://doi.org/10.1145/3462757.3466098 summarized Japanese judgments in terms of issues, conclusions, and
250
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Huihui Xu, Jaromir Savelka, and Kevin D. Ashley
3 DATA SET
We defined the components of legal argument triples as follows:
251
Toward Summarizing Case Decisions via Extracting Argument Issues, Reasons, and Conclusions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
In this section, we present the details of the models, including Table 1: Comparison of manually annotated IRC summary
convolutional [10] and recurrent neural networks [14], BERT-based sentences with top 1 automatically ranked full-text sen-
neural networks [12] and a hybrid neural model combining a con- tences (Sentence-BERT embedding with cosine similarity)
volutional neural network with BERT embedding [8]. We also ex-
periment with different loss functions for those neural networks, Issue
including cross entropy loss, F1 loss and focal loss. Manual Damage to both vehicles exceeded the insurance deductibles and both parties
claim damages against each other for the amount of the deductibles.
Rank 1 The damages to both the truck and the car exceeded the $500.00 insurance
4.1 Model Architectures deductible. [. . . ]
Convolutional Neural Networks. Convolutional neural networks Reason
(CNN) utilize convolutional filters to extract local features. Origi- Manual The plaintiff should have taken more appropriate measures to avoid the acci-
dent
nally applied to computer vision tasks, CNNs have also achieved a
Rank 1 Even if Schmidt concluded that Henry was going to proceed into his path, he
high level of performance in sentence classification tasks. [9]. had more appropriate alternatives than locking his brakes and turning to the
In our study, we use the settings of hyperparameters of filters right.
from [9]: filter sizes of 3, 4, and 5 with 100 for each size of filter. Conclusion
In other words, the models are looking for tri-grams, 4-grams and Manual Fault for this accident was attributed 10% to the defendant and 90% to the
plaintiff.
5-grams in sentences.
Rank 1 I attribute 10% of the fault in this accident to Henry and 90% to Schmidt.
Long Short-Term Memory Networks. Long Short-Term Memory (LSTM)
networks, a different RNN architecture, overcomes the vanishing
annotators still need to read contextual information around a sen-
gradient problem by employing a cell to control removing or adding
tence to confirm the mapping and IRC type.
information throughout the whole training process [14].
We undertook a proof-of-concept experiment to assess if a strat-
GloVe [17] is an unsupervised learning algorithm for obtaining
egy of automatic mapping could make the process more efficient
vector representations for words1 . We used “glove.6B.100d” as pre-
in the future. The idea is to employ sentence embedding to map
trained word embeddings to feed into the LSTM model, where the
annotated summary sentences to full texts. Sentence embedding
vectors were trained on 6-billion tokens and have 100 dimensions.
can represent a sentence and capture semantic information as vec-
Dropout is also adapted to the LSTM model.
tors. Cosine similarity is used to examine the degree of similarity
BERT-based Neural Networks. Google AI Language introduced Bidi- between sentences in annotated summaries and full texts.
rectional Encoder Representations from Transformers (BERT) in
Sentence-BERT Embedding. Sentence embedding techniques rep-
2018 [3]. Instead of using single word embedding like GloVe, BERT
resent the entire sentence and semantic information as vectors.
takes the context into account by using bidirectional pre-training
Sentence-BERT is a modification of the BERT neural model that
for language representations. This pre-training method is intended
uses siamese and triplet networks to produce semantically mean-
to better grasp contextual meaning of a language than single-
ingful sentence embeddings [18]. Sentence-BERT has achieved high
directional pre-training.
levels of performance in measuring the similarity of sentential ar-
RoBERTa [12] replicates BERT training using an improved train-
guments in [15]. Considering the size of our data set, we chose
ing methodology with more data and computational resources. For
to use the BERT base model for sentence embeddings with 768
our study we used RoBERTa in its default configuration.
dimensions.
Convolutional Neural Network with BERT embedding. CNN with We calculated the cosine similarity score for annotated sentences
BERT embedding takes BERT-pretrained embeddings as input and from a summary to every sentence in its corresponding full text.
feeds them into a CNN model for classification. Unlike GloVe pre- All the similarity scores are ranked in descending order, and only
trained word embedding, BERT-pretrained embedding is not a static the top 5 sentences are selected as useful. The remaining sentences
embedding. As sentences are fed in, it produces the word embed- are marked as non-IRC type sentences.
dings in real time. There are reasons to believe that the automatically mapped sen-
We combine the two models: a BERT-based model and a CNN tences bear useful similarities to manually mapped ones. Table 1
classification model. The encoded text passes through the BERT shows examples comparing manually mapped IRC sentences and
model first and produces BERT embeddings. The dimension of the automatically mapped sentences. The top ranked sentences often
BERT embedding (768) is higher than that of GloVe pre-trained include the same key words as the manual sentences do. How-
word embedding (100). Other hyperparameters remain the same as ever, the Reason sentences that the algorithm prefers have fewer
for the CNN-only model. overlapping keywords than Issue and Conclusion.
252
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Huihui Xu, Jaromir Savelka, and Kevin D. Ashley
Table 2: Scores for test set on both manually mapped full text sentences and automatically mapped full text sentences by
using LSTM, CNN, RoBERTa, CNN-BERT. The abbreviations of Issue, Reason and Conclusion are I, R, and C. The suffixes are
-P(precision), -R(recall). Ave-F1 stands for the average of class-wise F1 scores.
All the models were trained for 30 epochs. We stored models’ check- models are trained on automatically mapped data. This means that
points after each epoch and evaluated on the separate validation set. models have lower probability of making correct classifications.
The models with lowest loss value on the validation were selected When we compare the performance among the same models with
for the classification on the test set. The performance on the test different loss functions, the model with F1 loss function always has
set is shown in the table. Average F1 is the average of class-wise F1 the highest average F1 score when trained on manually mapped
scores. data except for LSTM. The same pattern does not hold for the
We found that LSTM with the cross-entropy loss function achieved automatically mapped data. Each loss function has its own strength
the highest F1 scores on identifying Issues and Conclusions which in terms of training set and model selection.
are, 0.58 and 0.53, respectively. Both CNN with the F1 loss function
and CNN-BERT with cross-entropy loss have the highest F1 scores 6 DISCUSSION AND ERROR ANALYSIS
(0.27) on identifying Reasons. On average F1, CNN with the F1 loss
6.1 Discussion
function reached 0.43 which is highest among all the models.
The right side of of the table reports the performance of the The results for classifying full text sentences trained on automati-
models that were trained on the full text sentences corresponding cally mapped data, the right side of Table 2, are significantly higher
to the automatically mapped human-annotated sentences from the than trained on annotated summaries in prior work. There the
summaries. Both LSTM with F1, CNN with focal and RoBERTa highest F1 scores for full texts trained on annotated summaries
with focal tied in their performance on Issues (0.30). LSTM with were Issue (0.27), Reason (0.14), and Conclusion (0.24). We attribute
cross-entropy has the highest F1 (0.20) on Reasons. For Conclu- this improvement to using manually-mapped training sentences in
sion classification, CNN-BERT with cross-entropy loss achieves the the full texts, the higher numbers of annotated data, and the use
highest performance (0.39). Finally, RoBERTa with F1 loss achieves of deep learning algorithms plus transformer models (LSTM, CNN,
the highest score in terms of average F1. RoBERTa, and CNN-BERT).
Surprisingly, RoBERTa and CNN-BERT with cross-entropy and As noted, we tried different kinds of neural models paired with
focal losses perform better on automatically mapped data than different loss functions. We confirmed that the F1 loss function im-
manually mapped data in terms of average F1. However, LSTM and proved the performances of CNN and RoBERTa: RoBERTa with F1
CNN do not show the same pattern. The automatically mapped data loss yielded 0.21 on Reason and 0.41 on Conclusion while RoBERTa
are selected by sentence similarity scores with respect to BERT- without F1 loss produced only 0.01 on Reason and 0.16 on Conclu-
Sentence embedding. RoBERTa and CNN-BERT somehow take the sion. When a loss function is aligned with the evaluation metric, it
advantage of information contained in the sentence embedding is likely to improve model performance. LSTM, however, did not
to make a better classification. We are not sure how it affects the perform well with the F1 loss function: on the manually mapped
performance and will investigate it further. We also observed that data , LSTM(F1) yielded 0.0 on all IRC types.
models tend to perform better on Issue and Conclusion than Reason Those models each have advantages for certain sentence types.
despite the type of training set. Since Reasons frequently include LSTM(cross-entropy) yielded the highest F1 scores on Issues and
case facts, it is harder for models to classify them. Conclusions. CNN(F1) and CNN-BERT(cross-entropy) performed
Despite the relative comparable F1 scores between training on best on identifying Reasons. In general, models have difficulty iden-
manually mapped data and automatically mapped data, the preci- tifying Reason sentences, since Reasons have more complex se-
sion of all types of sentences drops significantly in most cases when mantic meanings. As noted, Reasons are intertwined with facts,
which can easily be classified as the non-IRC type. The annotators
253
Toward Summarizing Case Decisions via Extracting Argument Issues, Reasons, and Conclusions ICAIL’21, June 21–25, 2021, São Paulo, Brazil
confirmed that Issues and Conclusions are easier to catch. They Improving Access to Justice. The Canadian Legal Information In-
employ distinct keywords such as “issue”, “conclusion”, etc. stitute provided the corpus of paired legal cases and summaries.
LSTM has the ability to detect temporal information about a Computation resources are provided by the Center for Research
sequence and can handle arbitrary input lengths. Meanwhile, CNN Computing at the University of Pittsburgh.
can only accept fixed size input. We think the ability to handle se-
quential information and longer lengths make LSTM more suitable REFERENCES
for Issues and Conclusions, since these involve plainer language [1] A. Bansal, Z. Bu, B. Mishra, S. Wang, K. Ashley, and M. Grabmair. 2016. Document
Ranking with Citation Information and Oversampling Sentence Classification in
than Reasons. CNN has the upper hand on spotting Reasons. The lit- the LUIMA Framework.
eral composition of Reasons is more diverse than that of Issues and [2] J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and
Conclusions; the convolutional features can capture this diversity. psychological measurement 20, 1 (1960), 37–46.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
6.2 Error Analysis preprint arXiv:1810.04805 (2018).
[4] M. Falakmasir and K. Ashley. 2017. Utilizing Vector Space Models for Identifying
With respect to the right side of Table 2, the proof-of-concept study Legal Factors from Text.. In JURIX. 183–192.
training models on automatically mapped data, the results suggest [5] A. Farzindar and G. Lapalme. 2004. Legal text summarization by exploration of
the thematic structure and argumentative roles. In Text Summarization Branches
that classifying argument triples is feasible, but less effective than Out. 27–34.
with the manually mapped data when taking precision and recall [6] V. Feng and G. Hirst. 2011. Classifying arguments by scheme. In Proceedings of
into account. We are particularly interested in the errors that the the 49th annual meeting of the association for computational linguistics: Human
language technologies. 987–996.
models made classifying Reasons. As noted, targeting the Reasons [7] C. Grover, B. Hachey, and C. Korycinski. 2003. Summarising legal texts: Sen-
correctly is harder since they tend to be more complex and diverse tential tense and argumentative roles. In Proceedings of the HLT-NAACL 03 Text
Summarization Workshop. 33–40.
than Issues and Conclusions. [8] Changai He, Sibao Chen, Shilei Huang, Jian Zhang, and Xiao Song. 2019. Us-
Some of the misclassifications involved phrases attributing an ing convolutional neural network with BERT for intent determination. In 2019
expressed view to the judge. This is a positive sign, in that such International Conference on Asian Language Processing (IALP). IEEE, 65–70.
[9] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification.
self-referential judicial sentences are relatively less frequent in a CoRR abs/1408.5882 (2014). arXiv:1408.5882 http://arxiv.org/abs/1408.5882
case opinion and indicate sentences where the judge is more likely [10] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu,
to assert that something is an Issue, Conclusion, or Reason. On Laura Barnes, and Donald Brown. 2019. Text classification algorithms: A survey.
Information 10, 4 (2019), 150.
the other hand, such self-referential attribution phrases do not [11] J. Landis and G. Koch. 1977. The measurement of observer agreement for cate-
necessarily discriminate among the three classifications. gorical data. Biometrics (1977), 159–174.
[12] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
7 FUTURE WORK robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
We plan to continue to annotate new cases in order to increase the [13] Qi. Lu, J. Conrad, K. Al-Kofahi, and W. Keenan. 2011. Legal document clustering
size of the training set. Currently, the corpus includes 574 anno- with built-in topic segmentation. In Proc. 20th ACM int’l conf. Info. and knowledge
tated summary / full text pairs. The size of the data set is still not management. 383–392.
[14] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao.
large enough for adequately training more complex neural network 2020. Deep learning based text classification: A comprehensive review. arXiv
models. The data set is sufficiently large, however, to allow us to preprint arXiv:2004.03705 (2020).
[15] Amita Misra, Brian Ecker, and Marilyn A Walker. 2017. Measuring the similarity
continue to explore models and identify some challenges. The ex- of sentential arguments in dialog. arXiv preprint arXiv:1709.01887 (2017).
perience helps us to improve the quality of data as well as informs [16] R. Mochales and M. Moens. 2011. Argumentation mining. Artificial Intelligence
our intuitions about how human summarizers do their work. We and Law 19, 1 (2011), 1–22.
[17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
expect that the more annotated data we collect the more interesting Global vectors for word representation. In Proceedings of the 2014 conference on
properties we will be able to observe in this process. empirical methods in natural language processing (EMNLP). 1532–1543.
As noted, prior work explored different sampling strategies for [18] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Em-
dealing with imbalanced data to improve model performance. Dif- pirical Methods in Natural Language Processing. Association for Computational
ferent sampling methods have their merits in terms of their effects Linguistics. http://arxiv.org/abs/1908.10084
[19] M. Saravanan and B. Ravindran. 2010. Identification of rhetorical roles for
on training sets and model types. In this study, we briefly inves- segmentation and summarization of a legal judgment. Artificial Intelligence and
tigated a different method of adding augmented data to improve Law 18, 1 (2010), 45–76.
the performance of the models. Although the results were not as [20] J. Savelka and K. Ashley. 2018. Segmenting U.S. Court Decisions into Functional
and Issue Specific Parts. In Proceedings, 31st Int. Conf. on Legal Knowledge and
we expected, we observed that it had some positive effect on iden- Information Systems, Jurix. 111–120.
tifying Reasons from full texts. We will continue to explore other [21] O. Shulayeva, A. Siddharthan, and A. Wyner. 2017. Recognizing cited facts
methods to deal with our imbalanced data. and principles in legal judgements. Artificial Intelligence and Law 25, 1 (2017),
107–126.
We also plan to test whether a pre-trained legal language model [22] A. Wyner, R. Mochales-Palau, M. Moens, and D. Milward. 2010. Approaches to
improves performance over a generic language model. text mining arguments from legal cases. In Semantic processing of legal texts.
Springer, 60–79.
[23] Huihui Xu, Jaromír Šavelka, and Kevin D Ashley. 2020. Using Argument Mining
ACKNOWLEDGMENTS for Legal Text Summarization. Legal Knowledge and Information Systems JURIX
(2020), 184–193.
This work has been supported by grants from the Autonomy through [24] H. Yamada, S. Teufel, and T. Tokunaga. 2019. Building a corpus of legal argumenta-
Cyberjustice Technologies Research Partnership at the University tion in Japanese judgement documents: towards structure-based summarisation.
Art. Int. and Law 27, 2 (2019), 141–170.
of Montreal Cyberjustice Laboratory and the National Science Foun-
dation, grant no. 2040490, FAI: Using AI to Increase Fairness by
254
Part III
Extended Abstracts
CriminelBART: A French Canadian Legal Language Model
Specialized in Criminal Law∗
Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Laval University, Computer Science Department and Faculty of Law
Québec, Canada
nicolas.garneau@ift.ulaval.ca,eve.gaumond@observatoire-ia.ulaval.ca
luc.lamontagne@ift.ulaval.ca,pierre-luc.deziel@fd.ulaval.ca
256
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Nicolas Garneau, Eve Gaumond, Luc Lamontagne, Pierre-Luc Déziel
Criminal charges. In order to determine if CriminelBART grasped crimes. Similar experiments with a generic BARThez model always
the accusations’ distribution from the corpus, we ask the model to result with a few unrelated names such as“Mr. Gagné”. While this
fill in the following sentence (translated in English): experiment does not expose a clear bias in CriminelBART, the sole
The defendant is accused of <Mask> under the Criminal Code. possibility, how small could it be, that defendants’ names may be
The top 5 predicted passages (with semantic duplicates being coming out of this model is a privacy matter that cannot be ignored.
removed3 ) comprises “driving under the influence”, “aggravated Even though the identity of judges is deemed public, this is not to
assault”, “possession of narcotics”, “dangerous driving”, and “hit- be taken lightly. In a context where judges are already reluctant to
and-run”. As expected, CriminelBART suggests crimes related to the idea that their work might be subject to analyses conducted by
the driving of motor vehicles and some other infractions related AI, it is important not to scare them with uses of technology that
to controlled drugs and substances which constitute the main ac- may be prejudicial to them. A more in-depth analysis should be
cusations in the corpus (up to 50%). Nonetheless, CriminelBART conducted before releasing this language model at scale, which is
predicts various crimes while BARThez, a more generic model, kept as future work. This decision is also highly motivated by the
mainly returns crimes restricted to premeditated murder. recent critics on language model being stochastic parrots and the
different attacks performed on them in order to extract the training
Legal provisions. We also probe CriminelBART for legal provi- set. In our future works, we wish to leverage CriminelBART as a
sions such that, given a context, it predicts which provision it is textual description generator of plumitifs, short legal documents
associated to. To this end, we created 84 cloze tests4 on 28 provi- known to be unintelligible [1].
sions, where CriminelBART achieves an accuracy of 64% over all
provisions. Unsurprisingly, BARThez achieves 0% by predicting ran- Acknowledgements. We thank the reviewers for their insightful com-
dom tokens. Here is an example regarding provision 4, “Possession ments. This research was funded by both the Natural Sciences and Engi-
of substance”, from the controlled drugs and substances act; neering & Social Sciences and Humanities Research Councils of Canada.
257
Applying Decision Tree Analysis to Family Court Decisions:
Factors Determining Child Custody in Taiwan
Sieh-Chuen Huang Hsuan-Lei Shao∗ Robert B Leflar
College of Law East Asia Studies Dept. School of Law, University of Arkansas
National Taiwan University National Taiwan Normal University United States (retired 2020)
Taiwan Taiwan College of Law, National Taiwan
schhuang@ntu.edu.tw hlshao2@gmail.com University, Taiwan (since 2020)
rbleflar@uark.edu
CCS CONCEPTS belonged to the father unless either it had been agreed otherwise
• Information systems applications → Data mining; • Decision in a consensual divorce (Article 1051) or the court had decided
support systems → Expert systems. otherwise (Article 1055). The 1996 amendment repealed Article 1051
and amended Article 1055, replacing paternal preference with the
KEYWORDS “best interests of the child” doctrine and recognizing joint custody
and non-custodial parents’ visitation rights. New Article 1055-1
child custody, best interests of the child, legal factor, machine learn-
lists several factors that judges must consider, such as the age, sex,
ing, decision tree.
birth order, health condition, and the wishes of the child, and the
1 INTRODUCTION age, occupation, character, economic ability and lifestyle of the
parents. Such a broad standard gives judges considerable discretion
The doctrine of “best interests of the child” has guided courts in in deciding what is in the best interests of the child.
determining post-divorce child custody cases in Taiwan since the Our goal is to provide a clear answer to the long-standing debate
amendment of the Taiwan Civil Code in 1996, which overturned on what “most important factors” in fact influence judges’ custody
previous patriarchal practices at least as a matter of law. Previous decisions. To accomplish this goal, we applied a machine learning
empirical studies have adopted descriptive statistics in analyzing algorithm to determine what those factors are. We established a
court cases to determine which factors, as set out in Article 1055-1, dataset which we carefully labeled according to our research con-
are the ones judges tend to consider. However, these approaches cerns (weighing normative “factors”). Our model predicts outcomes
do not clarify which factors judges consider primary. accurately, and also provides insights on “most important factors”
This study collects Taiwanese family court decisions from 2012 in a way useful to parents, their lawyers, and legal scholars.
to 2017. The study employs decision tree analysis, a commonly used
machine learning technology. This appears to be the first published
application worldwide of machine learning to analysis of family
3 RESEARCH DESIGN
court decisionmaking. 3.1 Data Collecting
The study concludes that the three most significant factors con- All cases except juvenile and sexual assault cases decided by district
sidered by judges in Taiwan are first, which parent is the child’s courts in Taiwan since 2000 are open to the public on the official
current primary caregiver, followed by the wishes of the child website of the Judicial Yuan. We focus on child custody decisions of
and the judge’s assessment of the relative quality of each parent’s first instance decided by family and district courts. Using carefully
parent-child interaction. This result runs counter to widely held chosen causes of action, keywords and decision dates, we iden-
beliefs that parental gender and parents’ occupations and economic tified 3,028 child custody decisions between January 1, 2012 and
resources are still prime factors in judges’ contemplation. December 31, 2017.
We then limited our sample to cases in which both parents were
2 BACKGROUND OF TAIWAN CHILD Taiwanese and both sought to acquire custody. This is because when
CUSTODY WHEN PARENTS DIVORCE one parent (usually the defendant) does not come to court or keeps
Before the 1996 amendment, Taiwan’s Civil Code had stipulated silent, the other (the plaintiff) is very likely to receive custody. We
that, in both consensual and judicial divorce, the custody of children excluded these cases from our dataset. Among the 3,028 cases, the
∗ Corresponding
2,096 cases in which one of the parents did not express any opinion
author
regarding custodian and the 97 cases of transnational marriage were
Permission to make digital or hard copies of all or part of this work for personal or therefore excluded. The remaining 835 cases contain 1,290 children.
classroom use is granted without fee provided that copies are not made or distributed Among them, 1,126 children were under sole custody (87.3%), 159
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM children were under joint custody (12.3%), and 5 children were
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, under third party guardianship (0.4%)
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
ICAIL’21, June 21–25, 2021, São Paulo, Brazil 3.2 Dataset Construction/ Annotation Labels
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 We created a model that predicts the value of a dependent variable:
https://doi.org/10.1145/3462757.3466076 custody granted to father (labeled “1”) or mother (labeled “0”). As
258
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Huang, et al.
Table 1: Factors Considered by Judges the current caregiver of the child is in the first place. If the caregiver
Character Factor is the mother (represented by the number “3”) or both mother and
Sex father (represented by the number “2”), the model will follow the
Age line (branch) on the left side to the next node, childWill, meaning
Child
Child Willing
that the model predicts that judges will, secondarily, consider the
Emotional feelings between the other persons living
together and the child
child’s wishes. If the child prefers the mother or has no preference
Health condition (>= 2), the model again follows the line on the left side coming to
Character (drug use, alcohol consumptions. . . ) the next end node, which is labeled with a probability distribution
Economy indicating that in the validation set, custody has a very high prob-
Willing ability (95%) of going to mother (the label “0”). On the contrary,
Undue behavior (domestic violence towards the child) even if the mother is the primary caregiver, in the case that the
Parent Parenting time child prefers the father (< 2), the probability that the mother gets
Parenting environment custody becomes fairly low (17% in the validation set).
Friendly parent
Primary caretaker (caregiver) 4.2 Model Efficiency
Understanding the child
Parenting plan The model’s accuracy is 96.5% in its test set and the F1 score is
Parent-child interaction 0.9783, indicating that the model is quite satisfactory.
Both
Current residence
Support system Table 2: Confusion Matrix of Child Custody Model Test Set
Others
Social worker’s report Results predicted by the machine
Test_set_N=226
Positive (for mother) Negative (for father)
independent variables, we manually defined 19 factors from Article
Actual Positive True positive, TP=181 False negative, FN=7
1055-1, social workers’ evaluation items and previous literature results (for mother)
(Table 1 below). Negative False positive, FP=1 True negative, TN=37
(for father)
3.3 Decision Tree Learning
This study adopts the CHAID (Chi-squared Automatic Interaction 5 CONCLUSION
Detector) algorithm to make strategic splits. This algorithm’s pre-
Among the numerous factors stipulated in Article 1055-1 of the Tai-
dictive power performs better than statistical techniques such as
wan Civil Code, “primary caregiver,” “child’s wishes” and “parent-
regressions and it shows the nodes clearly, making the outcomes
child interaction” are the three most significant factors contem-
explainable in terms of the 19 factors listed above.
plated by judges. The pattern of Taiwanese judges’ decision-making
regarding child custody appears to be relatively constant and stable
4 RESEARCH RESULTS
during the six-year period studied.
4.1 Model Demonstration Furthermore, in custody disputes addressed by judicial decisions,
the mother seems to have overwhelming supremacy: in our dataset,
Figure 1: Decision Tree of Child Custody Cases in Taiwan mothers had a 75% likelihood of receiving sole custody.
This research should help legal scholars identify particular cus-
tody cases as either typical or exceptional by comparing the machine-
predicted results to actual judgments. A case may be exceptional,
worth further exploration and research, if the two outcomes are
inconsistent. In addition, this study should assist parents and their
lawyers to preliminarily evaluate their possibilities of acquiring
custody. And when the outcome of litigation can be predicted in
advance by the parties, the likelihood of cases going to court will
fall and the likelihood of settlement will increase. Thus, machine
learning helps us predict what the “law” is and hence contributes
to legal certainty.
ACKNOWLEDGMENTS
Hsuan-Lei Shao, “From Knowledge Genealogy to Knowledge Map-
China Studies in Big Data and Machine Learning” (107-2410-H-003-
058-MY3, Ministry of Science and Technology), Taiwan.
259
Sentence Classification for Contract Law Cases: A Natural
Language Processing Approach
Jonathan R. Mok Wai Yin Mok Rachel V. Mok
Norris Injury Lawyers PC University of Alabama in Huntsville Independent Researcher
jonrexmok@gmail.com mokw@uah.edu rmok57@gmail.com
260
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Mok et al.
in their terms and are reiterated by the court; these sentences will As previously stated, this paper will explore classification of
typically begin by stating ABC party argues XYZ point. Party issue legally useful sentence types found in contract law case law in
sentences are less significant than court issue sentences, unless the the state of Alabama. Due to the difficulty of classifying reasoning
court explicitly agrees with a particular party issue sentence, which sentences at this stage of the research, such classification has been
at that point elevates it to become a court determination known as excluded, but will be examined in later research. As a result, this
a court holding. paper will only focus on seven sentence types: Fact (FCT), Court
A holding sentence states a court conclusion that resolves a Issue (CTI), Party Issue (PTI), Holding (HLD), Law (LAW), Procedural
dispute by applying relevant law to the present set of facts in a History (PRH), and Reference (REF).
case. A holding sentence is the most important type of sentence in
any judicial opinion; it is considered precedent and binding on any 2 KNOWLEDGE BASE
future case brought within the issuing court’s jurisdiction.
Reasoning sentences help explain how a court has reached its
conclusion; it is analogous to a mathematical proof, but not nearly
as exact. Reasoning sentences can be difficult to definitively classify;
seasoned practitioners can mistake reasoning sentences for holding
sentences and vice versa. Any court determination not based on the
present facts of the case is not considered a holding sentence, but
likely a reasoning sentence; any hypothetical statement, such as an
example or analogy, made by the court is not a holding sentence,
but likely a reasoning sentence; and any commentary on law or
fact not regarding the present case, such as on a past case, is not a (a) Testing cases, 1063 total sen- (b) Training cases, 646 total sen-
holding sentence, but likely a reasoning sentence. tences. tences.
Law sentences are statements of law made by the court and are
always preceded or followed by a citation, or the citation may be Figure 1: Confusion matrices of the testing and training cases.
in the sentence itself. A law sentence may be the court restating a The elements along the diagonal show the number of sen-
law as it appears in the code of law, the court restating a holding tences that were identified correctly, and each off-diagonal
from a prior case, or the court stating its own interpretation of a element show the number of sentences that were identified
law or a prior court’s holding. incorrectly and in what label they were identified as.
Another type of sentence found in court opinions are proce-
dural history sentences, which describe how a case has progressed spaCy [1] has been chosen to parse and process court cases.
through the court system to be evaluated before the present court. Twelve Alabama contract cases were downloaded from JUSTIA, of
Almost all cases found in case law begin in a court that makes both which five are designated as training cases and seven as testing
factual and legal determinations such as which party’s account cases. Two different approaches are adopted for classifying the
of the dispute is considered correct and what legal principles are sentences: a rule-based approach for the reference sentences and
applied to resolve the case; such a court is called a court of first a knowledge-based approach for the other types of sentences. Be-
instance or trial court. All losing parties in civil and criminal mat- cause fact sentences are the default, the knowledge base contains
ters, bar the criminal prosecution, have a right to appeal to a higher fragments of sample sentences that are neither reference nor fact
court called an appellate court for additional review of mainly legal sentences. The fragments of the sample sentences are chosen to rep-
principles. Appellate courts rarely engage in fact finding and do resent the essential parts of the sample sentences from which they
so only under specific circumstances, such as when new relevant were extracted, but are general enough so that sentences whose
evidence arises in a case and is properly presented. In actuality, the types are to be determined may contain similar fragments. The sim-
majority of case law is written by appellate courts. Consequently, ilarity of a sentence and the sentence fragments in the knowledge
appellate courts will describe in procedural history sentences how base is calculated by the spaCy’s similarity function, which is based
a case was decided by a court of first instance and which party on word vectors [2, 3]. Shown in the above figure are the results of
appealed the case on what grounds to bring the case before it. Such our algorithm. Although our algorithm is still in an early stage of
information is of some value to practitioners, but can essentially be development, it yields a classification accuracy rate of 68.67% on
disregarded when forming a knowledge base for machine learnable the testing cases, demonstrating the validity of our approach.
information for the purposes of this paper.
Thus far, sentence classification according to fact sentences, is- REFERENCES
sue sentences further classified by court or party, law sentences, [1] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020.
spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/
reasoning sentences, holding sentences, and procedural history 10.5281/zenodo.1212303
sentences can be applied to every court opinion. Every court opin- [2] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Estimation of Word Representations in Vector Space. In 1st International Con-
ion also contains references that are valuable to practitioners, but ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May
can essentially be disregarded when forming a knowledge base 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
for machine learnable information for the purposes of this paper. http://arxiv.org/abs/1301.3781
[3] Tomás Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
Citations are classified as reference sentences in this paper. Distributed Representations of Words and Phrases and their Compositionality.
CoRR abs/1310.4546 (2013). arXiv:1310.4546 http://arxiv.org/abs/1310.4546
261
Constraint Answer Set Programming as a Tool to Improve
Legislative Drafting
A Rules as Code Experiment
Jason Morris
jmorris@smu.edu.sg
Singapore Management University Centre for Computational Law
Singapore
262
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Jason Morris
A set of 25 tests were encoded, and there were 4 test failures variety of fact scenarios simultaneously, quickly providing a deep
not explained by errors encoding the Rule or the tests. These four level of insight into the behaviour of the encoding. s(CASP) also
failures were investigated by the author by performing ’why not’ facilitated the use of a version of defeasibility that allowed defeating
queries and reviewing the justifications provided by s(CASP). relations of both the "subject to" and "despite" types to be encoded
This process revealed that the failing tests were encoded on the where they appear in the text, enhancing maintainability of the
basis of an expectation that the word "business" in Rule 34(1)(b) code [7].
referred to a legal practitioner’s activities. But Rule 34(9) defines s(CASP)’s abductive queries slow down considerably with the
"business" to refer to a general category of undertaking. Setting out complexity of the code, and so it may not be an appropriate ap-
a test in which Rule 34(1)(b) applied, while also using the defined proach for real-time applications of abductive reasoning. However,
meaning of "business", required making statements that did not its performance on deductive reasoning tasks was very efficient,
have clearly meaningful real world equivalents. This suggested completing the 25 tests in this experiment in an average of less
that Rule 34(1)(b) might also use the word "business" in a way than 1 second each, which suggests it can also be used to answer
inconsistent with the defined meaning, which would be a drafting legal questions with complicated fact scenarios and complicated
issue. rules in a user-facing application.
That issue was raised with the rest of the research team, who
confirmed that Rule 34(1)(b) had been faithfully encoded, that the ACKNOWLEDGMENTS
expectations of the failing tests were reasonable, and that Rule I owe a debt of gratitude to all my colleagues at the SMU Centre
34(1)(b) required the use of an interpretations of the word "business" for Computational Law, and in particular our Principal Investigator
that is inconsistent with the defined meaning of the word in order Meng Weng Wong, Industry Director Alexis Chun, and Professors
to give effect to that expectation, or to give it any clear meaning at Lim How Khang and Jerrold Soh, all of whom contributed greatly to
all. the legal analysis. Professors Gopal Gupta of University of Texas at
The research team seriously considered the possibility that there Dallas and Joaquín Arias at Universidad Rey Juan Carlos provided
might be a different interpretation of other aspects of the Rule that valuable assistance on the effective use of s(CASP). The feedback
would make Rule 34(1)(b) more clearly meaningful. The team was of the reviewers has also improved the paper and is gratefully
unable to find an interpretation that would have had that effect and acknowledged.
would not also make Rule 34(1)(b) redundant to other portions of This research is supported by the National Research Founda-
the Rule. The research team therefore concluded that it would be tion (NRF), Singapore, under its Industry Alignment Fund – Pre-
more correct if Rule 34(1)(b) referred not to businesses but to the Positioning Programme, as the Research Programme in Computa-
holding of an executive appointment. tional Law. Any opinions, findings and conclusions or recommen-
The researchers agreed on the following proposed replacement dations expressed in this material are those of the author(s) and do
for Rule 34(1)(b): not reflect the views of National Research Foundation, Singapore.
(1A) A legal practitioner must not accept any execu-
tive appointment that materially interferes with — REFERENCES
(i) the legal practitioner’s primary occupation of [1] L. Allen and C. R. Engholm. 1978. Normalized Legal Drafting and the Query
Method. Journal of Legal Education 29 (1978), 380–412.
practising as a lawyer; [2] Joaquín Arias, Manuel Carro, Zhuo Chen, and Gopal Gupta. 2020. Justifica-
(ii) the legal practitioner’s availability to those tions for Goal-Directed Constraint Answer Set Programming. arXiv preprint
arXiv:2009.10238 (2020).
who may seek the legal practitioner’s services as [3] Joaquin Arias, Manuel Carro, Elmer Salazar, Kyle Marple, and Gopal Gupta. 2018.
a lawyer; or Constraint answer set programming without grounding. Theory and Practice of
(iii) the representation of the legal practitioner’s Logic Programming 18, 3-4 (2018), 337–354.
[4] Organization for Economic Cooperation and Development Observatory for Public
clients. Sector Innovation. [n.d.]. Cracking the Code: Rulemaking for humans and ma-
The proposed amendment was encoded, and the tests re-run. All chines. Accessed February 28, 2021, at https://oecd-opsi.org/wp-content/uploads/
2020/10/Rules-as-Code_Highlights_Final_HighRes.pdf.
25 tests passed. [5] D. Merritt. 2017. Expert Systems in Prolog. Independently Published. https:
//books.google.com.sg/books?id=6IQGyQEACAAJ
[6] Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham-
6 CONCLUSIONS mond, and H. Terese Cory. 1986. The British Nationality Act as a logic program.
Our experiment demonstrates the use of the Rules as Code method- Commun. ACM 29, 5 (1986), 370–386.
[7] Hui Wan, Benjamin Grosof, Michael Kifer, Paul Fodor, and Senlin Liang. 2009. Logic
ology to detect a drafting issue in a proposed statutory text, and to Programming with Defaults and Argumentation Theories. In Logic Programming,
verify the effect of a proposed amendment. The issue discovered in Patricia M. Hill and David S. Warren (Eds.). Springer Berlin Heidelberg, Berlin,
Heidelberg, 432–448.
this experiment is the type of issue that Rules as Code is intended
to address early: one that if left unaddressed negatively affects the
degree to which the statutory text can be automated.
With regard to s(CASP)’s strengths and weaknesses for this task,
the access to "why not" queries and natural language justifications
was extremely valuable both in the encoding of the Rule, and in
the analysis of test failures. s(CASP)’s abductive reasoning over
constraints, and the fact that it returned answer sets rather than
bindings, allowed the author to test the encoding against a wide
263
Predicting Legal Proceedings Status: Approaches Based on
Sequential Text Data
Felipe Maia Polo Itamar Ciochetti Emerson Bertolo
felipemaiapolo@gmail.com itamar@tikal.tech emerson@tikal.tech
University of São Paulo, Brazil Tikal Tech, Brazil Tikal Tech, Brazil
Advanced Institute for AI (AI2), Brazil
ACM Reference Format: classified as archived (class 1), 45.23% is classified as active (class 2),
Felipe Maia Polo, Itamar Ciochetti, and Emerson Bertolo. 2021. Predict- and 7.63% is classified as suspended (class 3). The datasets we use
ing Legal Proceedings Status: Approaches Based on Sequential Text Data. are representative samples from the first and third most significant
In Eighteenth International Conference for Artificial Intelligence and Law Brazilian state courts (São Paulo and Rio de Janeiro).
(ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, In this work, we split at random our labeled dataset into three
2 pages. https://doi.org/10.1145/3462757.3466138
parts: training set (70%) for training models, validation set (10%)
for hyperparameter tuning, and test set (20%) for final assessment.
1 OBJECTIVE AND PRACTICAL
IMPORTANCE OF THIS WORK 4 METHODOLOGY
The objective of this work, which is fully given by Polo et al. [8], is We used four approaches to extract features from the legal texts
to develop predictive models to classify Brazilian legal proceedings and three base classifiers to create our predictive models to classify
in three possible classes of legal status: (i) archived proceedings, (ii) legal proceedings, i.e., text sequences. A more detailed explanation
active proceedings, and (iii) suspended proceedings. Each proceed- of our methodology can be found in Polo et al. [8].
ing is made up of a chronological sequence of short texts called
“motions” written by the courts’ administrative staff. The motions 4.1 Classifiers
relate to the proceedings, but not necessarily to their legal status. The first classifier we use is a many-to-one long short-term memory
Moreover, the proceedings’ labels are decided by the courts to orga- network (LSTM). The inputs are given by 𝑇 vectors representing
nize their workflow. This problem’s resolution is intended to assist the 𝑇 most recent texts in chronological order. The outputs will be
public and private institutions in managing large portfolios of legal predicted probabilities for each of the three classes, returned by the
proceedings, providing gains in scale and efficiency. Softmax function. The second classifier is a multilayer perceptron
neural network (MLP) with one hidden layer and ReLU activation
2 RELATED WORK functions. The MLP input is the concatenation of the feature vectors
Despite researchers’ efforts to create applications in the legal field, of the last 𝑇 texts. The third classifier is given by a XGBoost [1] tree
we were unable to find an attempt to solve a problem like ours in ensemble. We feed the last classifier with same inputs used for the
the literature. The issues closest to ours we could relate in litera- MLP. All classifiers make classification choosing the most probable
ture are those of identifying the parties in legal proceedings [7], class and we fix 𝑇 = 5, using zero-padding vectors when necessary.
classification of legal documents according to their administrative Details of the hyperparameter tuning phase can be found in Polo
labels [2] or predicting the area a proceeding belongs to [11]. This et al. [8].
paper has a different application that can be useful when looking
for efficiency in legal systems, especially in developing countries. 4.2 Feature Extraction
Unlike previous work, we consider sequences of texts explicitly We use four different approaches to extract features from texts:
in our modeling, which has not yet been observed in Law and AI Word2Vec (W2V) [5], Doc2Vec (D2V/PV-DM) [4], TFIDF [9], and a
literature by us. Brazilian Portuguese BERT-Base [10]. All of them are completely
unsupervised or self-supervised methods. For all approaches, we
3 DATA had text preprocessing and hyperparameter setting steps, which
Our data is composed of two datasets: a dataset of 3 · 106 unlabeled are detailed in Polo et al. [8]. Word2Vec, Doc2Vec, and TFIDF repre-
motions and a dataset containing 6449 legal proceedings, each sentations are fully trained using the mass of 3 · 106 texts/motions
with an individual and a variable number of motions, but which from unlabeled proceedings, while the BERT-Base is fine-tuned
have been labeled by lawyers. Among the labeled data, 47.14% is in the same dataset, making use of the Masked Language Model
(MLM) objective. One thing that is worth mentioning is that we
Permission to make digital or hard copies of part or all of this work for personal or use a method proposed by Mikolov et al. [6] in order to identify
classroom use is granted without fee provided that copies are not made or distributed presence words (2 to 4) that should be considered as unique tokens
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. when working with W2V, D2V, and TFIDF.
For all other uses, contact the owner/author(s). Given Word2Vec creates representations for tokens and not for
ICAIL’21, June 21–25, 2021, São Paulo, Brazil entire texts, we use two different approaches to that end. One
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8526-8/21/06. of them is used in conjunction with the LSTM classifier, and the
https://doi.org/10.1145/3462757.3466138 other is used in conjunction with the MLP and XGBoost classifiers.
264
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Felipe Maia Polo, Itamar Ciochetti, and Emerson Bertolo
Table 1: Evaluation of classification approaches (scores ± bootstrap std. errors). We combine three basic classifiers (LSTM, MLP,
and XGBoost) and four approaches for extracting features (Word2Vec, Doc2Vec, TFIDF, and BERT). In this extended abstract,
we omit MLP’s results since they are not better than LSTM’s and XGBoost’s.
265
Pathways to Legal Dynamics in Robotics
Antonino Rotolo Luciano H. Tamargo Diego C. Martínez
University of Bologna Universidad Nacional del Sur Universidad Nacional del Sur
Bologna, Italy Bahia Blanca, Argentina Bahia Blanca, Argentina
antonino.rotolo@unibo.it lt@cs.uns.edu.ar dcm@cs.uns.edu.ar
ACM Reference Format: Challenge 2. Make explicit why your norms are a kind of (soft)
Antonino Rotolo, Luciano H. Tamargo, and Diego C. Martínez. 2021. Path- constraint that deserve special analysis.
ways to Legal Dynamics in Robotics. In Eighteenth International Conference
for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, With hard constraints the problem is how to reconfigure robots’
Brazil. ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3462757. behaviour in presence of norm change. If the norms are represented
3466146 as soft constraints, then the problem is to check if the process of
monitoring violations is correctly managed. For example, it may be
1 RESEARCH CHALLENGES the case that violations are not detected often enough.
Normative concepts can play a crucial role in modelling the be- Whatever model we adopt for legal norms on robotics, we need
haviour and interaction of artificial agents. Investigations are still a formal model handling norm change:
relatively underdeveloped in robotics while interesting ideas come
from related fields, such as multi-agent systems (MAS). Challenge 3. Why and how can norms be changed at runtime?
We outlines some research challenges in this domain for which Many legal issues can be raised in regard to robots [2]. Example
we can use existing models of legal change developed in AI&Law. 1 illustrates how norm change can impact on robotics.
Three challenges for MAS can be adapted for robotics [1]:
Challenge 1. Explain which of the following choices should be Example 1. The Italian penal code states the following:
made in robotics: (a) norms must be explicitly represented in robots Art. 111 Italian of Penal Code – Procuring a person for an
in a declarative way, or (b) norms must be explicitly represented in offence who is not indictable or not punishable. Anyone
the overall system specification. who has procured a person for a criminal offence who is not
Option (a) must be preferred if we should avoid trivialising the indictable or is not punishable on the basis of a personal condi-
notion of norm, which is a risk when we see any specification tion or quality is liable for that offence which was committed
requirement as a norm that the system has to comply with [1]. In by this person, and an increased penalty is applied.
addition, since legal norms change, maintenance would be probably Imagine Mr. Smith induces a robot to threaten Mr. Jones, and the
easier. However, (b) is more suitable to address this problem: robot is bound to that goal (to threat Mr. Jones) but is equipped with
Problem 1. How we can check whether a robot complies with autonomy in achieving it. Can we apply art. 111 to this case? It should
norms applicable to it? How can we design a robot such that it complies be noted that the provision cover cases where the procured person
with a given set of norms? is not legally capable (she is not in full possession of her faculties)
and this makes more serious the offence committed by the procuring
Addressing Challenge 1 requires the preliminary clarification of person. However, robots, though intelligent, are not indictable and the
norm features that we need to embed within robots. In particular, principle of legality in criminal law does not allow the provision to be
temporal aspects are especially relevant for legal dynamics [3], applied by analogy when the crime is committed by a robot.
since legal norms can be qualified by temporal properties, such as: Suppose that the legislator enacts at the 1st of January 2005 a new
(1) the time when the norm comes into existence and belongs to version of art. 111 (denoted as ‘Art.111-n’.):
the legal system, (2) the time when the norm is in force, (3) the time
when the norm produces legal effects (it is applicable), and (4) the Art. 111 [Amended] Italian of Penal Code – Procuring a person—
time when the normative effects hold. who is not indictable or not punishable—or an intelligent
A norm is a kind of system constraint. While hard constraints machine for an offence. Anyone who has procured a person—
are restricted to preventive control systems in which violations who is not indictable or is not punishable on the basis of a
are impossible (we call this mechanism regimentation), soft con- personal condition or quality—or an intelligent machine for a
straints are used in detective control systems where violations can criminal offence is liable for that offence which was committed
be detected (we call this mechanism regulation). This justifies the by this person or machine, and an increased penalty is applied.
following challenge: The AI&law community proposed several frameworks for norm
Permission to make digital or hard copies of part or all of this work for personal or change, two of them focused on temporal models: one tem-
classroom use is granted without fee provided that copies are not made or distributed poralised rule-based system [3] and one extending belief revi-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. sion techniques [5]. Both view a legal system as a time-series
For all other uses, contact the owner/author(s). LS(t 1 ), LS(t 2 ), . . . , LS(t j ) of its versions, where each version is ob-
ICAIL’21, June 21–25, 2021, São Paulo, Brazil tained from previous ones through entering new norms, or by mod-
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8526-8/21/06. ification or repeal of existing norms: each LS(ti ) is the snapshot of
https://doi.org/10.1145/3462757.3466146 the norms in the legal system at time ti .
266
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Antonino Rotolo, Luciano H. Tamargo, and Diego C. Martínez
267
Labels distribution matters in performance achieved in legal
judgment prediction tasks
Olivier Salaün Philippe Langlais Karim Benyekhlef
salaunol@iro.umontreal.ca felipe@iro.umontreal.ca karim.benyekhlef@umontreal.ca
RALI, DIRO, University of Montréal RALI, DIRO, University of Montréal Cyberjustice Laboratory, Faculty of
Montréal, Québec, Canada Montréal, Québec, Canada Law, University of Montréal
Montréal, Québec, Canada
CCS CONCEPTS 2 PREPROCESSING OF THE DATASET
• Applied computing → Law; • Computing methodologies → The decisions of the corpus we used here come from a court of Que-
Natural language processing; Neural networks. bec in Canada that deals with all legal disputes occurring between
landlords and tenants. From 2001 to 2018, it issued 667,305 decisions
KEYWORDS in French, some of which are freely available at SOQUIJ (Société
legal judgment prediction, multilabel text classification, legal arti- québécoise d’information juridique) portal1 . Decisions have a mean
cles and a median lengths of 307 and 235 tokens respectively while the
standard deviation is 371, indicating a high variability in length
ACM Reference Format: across the documents.
Olivier Salaün, Philippe Langlais, and Karim Benyekhlef. 2021. Labels dis-
A first step in preparing the dataset consisted in extracting the
tribution matters in performance achieved in legal judgment prediction
tasks. In Eighteenth International Conference for Artificial Intelligence and
text of each decision then split it in two parts thanks to syntax-
Law (ICAIL’21), June 21–25, 2021, Sao Paulo, Brazil. ACM, New York, NY, based heuristics: the pre-verdict text and the verdict. The former
USA, 2 pages. https://doi.org/10.1145/3462757.3466144 contains the description of the dispute and is used as text input
for the text classification task. We also extracted the housing law
1 INTRODUCTION articles cited in the pre-verdict text from which we retained 445
that are cited in Book Five - Obligations in Civil Code of Quebec that
In recent years, transformer [4] and BERT models [1] have been are specifically related to property lease and thus more relevant to
widely used in plain NLP tasks with the assumption that models our task. The verdict text is further processed in order to generate
first pretrained on massive corpora then fine-tuned on the dataset several targets labels that cover the diversity of the verdicts decided
of a given task may suffice to achieve significant improvements. by the judges. Thanks to regular expressions and the like, plus
At the intersection of machine learning and law, legal judgment some expert knowledge of housing law, we pseudo-automatically
prediction (LJP) is a task that aims at predicting the outcome of a annotated the verdicts with 23 cumulative labels. Eventually, we
lawsuit based on a representation of the case. Such task is usually excluded all decisions for which no relevant article or verdict label
formalized in NLP as a text classification with different classes or was identified. All in all, the instances of the corpus amount to
labels corresponding to the verdicts. One specificity of court rulings 544,857 documents with an average of 3.3 labels and 2 cited articles,
is that their decisions are based on the application of legal articles and are randomly split into training, validation and test sets with a
to the facts described by the two parties (applicant and defendant). 60-20-20 ratio.
In this work, we designed a LJP multilabel classification task
based on a corpus of landlord-tenant disputes in French from Que-
bec, Canada [3, 5] in which a model must predict the verdicts labels
3 MODELS
on the basis of a truncated extract from the decision made by the Within the framework of this multilabel classification, we chose
tribunal. We applied CamemBERT [2], a BERT model pretrained as a baseline a One-Versus-Rest Logistic Regression with the in-
on French material, in order to assess to what extent the use of a put text as character-based TF-IDF vectors spanning 2-grams to
pretrained model can handle a LJP task. We also injected article- 8-grams (the top 100k most frequent n-grams are kept). We also
based input features with the hope that adding knowledge specific use a CamemBERT model that we fined-tuned to our task during
to housing law domain could improve classification performance. 10 epochs with a batch size of 32 and a learning rate of 10-5 with
Although such an approach allows better results, labels distribution the Adam optimizer and binary cross-entropy as loss function. The
must be taken into account when analyzing coarse and label-specific maximum sequence length amounts to 128 tokens for all of our
scores. models. Moreover, we also propose a model that leverage both
CamemBERT and cited articles by concatenating the BERT output
Permission to make digital or hard copies of part or all of this work for personal or (vector corresponding to the [CLS] token) with a 445-dimensional
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation one-hot vector that embeds which articles are cited in the decision.
on the first page. Copyrights for third-party components of this work must be honored. Then, the concatenation is sent to two fully connected layers as
For all other uses, contact the owner/author(s).
shown in Figure 1
ICAIL’21, June 21–25, 2021, São Paulo, Brazil
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8526-8/21/06.
https://doi.org/10.1145/3462757.3466144 1 https://soquij.qc.ca/
268
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Salaün et Langlais
4 RESULTS
The metrics chosen for evaluation are exact match (EM) and F1
scores. The former implies that all labels of an instance must be ex-
Figure 2: F1 scores obtained for each label (x-axis is in loga-
actly predicted by the model in order to be considered as correctly
rithmic scale) relative to label support
classified. The latter is an unweighted average of the F1 scores
obtained across the 23 labels. As shown on Table 1, CamemBERT
favoured models in this task, especially when BERT output is com- BERT approach. Still the results obtained must be considered with
bined with articles as such an approach outperforms the sole BERT caution. Firstly, the higher the support of a label, the more likely it
architecture by 3.2 points and 2.2 points for F1 macro-average and will be accurately predicted. Secondly, including articles into the
EM scores respectively. Still, such coarse results must be viewed model does not improve performance results uniformly across all
with precaution as they do not take into account the high imbalance labels.
among labels. As future works, we plan on investigating further in what con-
For instance, the top three most frequent labels cover more than ditions articles allow classification improvements and how such
half of the corpus while the sixteen least ones have a support be- knowledge could be used as a way to make predictions more suit-
low 5%. Such biases have repercussions on the F1 scores obtained able for interpretability.
on each label as shown on Figure 2. The F1 results obtained for
labels with a support below 5% are spread out between 0 (none of ACKNOWLEDGMENTS
the three least frequent labels could be correctly predicted) and We would like to thank the Cyberjustice Laboratory at Université de
90%. Whenever the support of a label is around or above 40%, the Montréal, the LexUM Chair on Legal Information and the Autonomy
corresponding F1 score has a minimum value at 75%. Another obser- through Cyberjustice Technologies (ACT) project for their support
vation that can be drawn is that although the addition of a one-hot of this research.
vector helps in improving BERT scores, such improvement is only
significant for certain verdict labels, suggesting that the inclusion REFERENCES
of domain-related knowledge is only significant in some cases. For [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
instance, improvement is very little or non significant for the three Pre-training of Deep Bidirectional Transformers for Language Understanding. In
NAACL-HLT (1). 4171–4186. https://aclweb.org/anthology/papers/N/N19/N19-
most frequent labels while it seems more noticeable for those with 1423/
lower support. [2] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent
Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2019.
CamemBERT: a Tasty French Language Model. arXiv preprint arXiv:1911.03894
5 CONCLUSION (2019).
Within the framework of a LJP task formalized as multilabel text [3] Olivier Salaün, Philippe Langlais, Andrés Lou, Hannes Westermann, and Karim
Benyekhlef. 2020. Analysis and Multilabel Classification of Quebec Court Decisions
classification that uses a corpus in French of landlord-tenant dis- in the Domain of Housing Law. In International Conference on Applications of
putes, we extended a CamemBERT model with a one-hot vector of Natural Language to Information Systems. Springer, 135–143.
cited articles. This led to better overall results with respect to a sole [4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems. 5998–6008.
[5] Hannes Westermann, Vern R Walker, Kevin D Ashley, and Karim Benyekhlef.
Logistic CamemBERT 2019. Using Factors to Predict and Analyze Landlord-Tenant Decisions to Increase
CamemBERT
regression + one-hot Access to Justice. In Proceedings of the Seventeenth International Conference on
Artificial Intelligence and Law. 133–142.
F1 (macro-avg.) 53.5 58.4 61.6
Exact match 58.6 63.7 65.9
Table 1: F1 (macro-average) and exact match scores achieved
by each model on the test set.
269
A simple mathematical model
for the legal concept of balancing of interests
Frederike Zufall∗ Rampei Kimura∗ Linyu Peng∗
zufall@coll.mpg.de rampei@aoni.waseda.jp l.peng@mech.keio.ac.jp
Max Planck Institute for Research on Waseda Institute for Advanced Study Keio University
Collective Goods Tokyo, Japan Yokohama, Japan
Bonn, Germany
270
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Zufall, Kimura and Peng
denoted as 𝑢 1 and 𝑢 2 such that This time-independent model (1) does not take 𝛼𝑡 as an input
𝑢𝑘 ∈ [0, 1], 𝑘 = 1, 2, subject to 𝑢 1 + 𝑢 2 = 1, argument and only captures each point in time separately.
making it sufficient to use a single parameter 𝑢 ∈ [0, 1] as the 3.2 A time-dependent mathematical model
balancing outcome. In order to model time continuously and not just as intermittent
Data coding. Based on the above criteria, a dataset (150 sets of points, we propose the following time-dependent model for the
outcomes 𝑢) is hand-coded by a fully-qualified lawyer, to serve as outcome function
training data for the models. The data points are based on standards 𝑐 00 + 𝑐 10 𝛼𝑝 + 𝑐 01 𝛼𝑠 + 𝑐 20 𝛼𝑝2 + 𝑐 11 𝛼𝑝 𝛼𝑠 + 𝑐 02 𝛼𝑠2
inferred from the relevant case law. 𝑢 (𝛼𝑝 , 𝛼𝑠 , 𝛼𝑡 ) = ,
𝑎(log (|𝛼𝑡 | + 1)) 2 + 𝑏 log (|𝛼𝑡 | + 1) + 1
(4)
3 THE MATHEMATICAL MODELS where the model parameters 𝑎, 𝑏, 𝑐 00, 𝑐 01, . . . are to be determined
For any given piece of information, the purpose of the models is by using the data. Unlike the time-independent model, this function
to determine whether (𝑖 1 ) privacy of information outweighs (𝑖 2 ) takes the legal parameter 𝛼𝑡 as an argument and reduces to the time-
access to information or vice versa. To summarize, the parameters independent model (1) at a given time. We impose the following
are defined as follows: assumptions:
𝛼𝑝 ∈ [0, 1] status of the person 𝑢 (𝛼𝑝 , 𝛼𝑠 , −∞) = 0 for all 𝛼𝑝 , 𝛼𝑠 ,
(5)
𝛼𝑠 ∈ [0, 1] sphere of the information 𝑢 (0, 0, 𝛼𝑡 ) = 0 for all 𝛼𝑡 , 𝑢 (1, 1, 0) = 1,
𝛼𝑡 = 𝑇𝑡 ∈ (−∞, 0] nondimensionalised time
yielding 𝑐 00 = 0 and
𝑢𝑘 ∈ [0, 1], subject to 𝑢 1 + 𝑢 2 = 1 index for (𝑖𝑘 ), 𝑘 = 1, 2
𝑐 10 + 𝑐 01 + 𝑐 20 + 𝑐 11 + 𝑐 02 = 1. (6)
The constraint 𝑢 1 + 𝑢 2 = 1 allows us to define one single index Again, we use Mathematica to derive optimal values of the model
to fulfill the task. This is the outcome 𝑢, which is a function of parameters using the method of least squares:
the legal parameters 𝛼𝑝 , 𝛼𝑠 and 𝛼𝑡 . The final decision, namely
𝑎 ∗ = 0.165792, 𝑏 ∗ = −0.212271, ∗
𝑐 01 = 0.529979,
whether (𝑖 1 ) privacy of information or (𝑖 2 ) access to the information (7)
dominates, is made via the comparison with a prior given threshold
∗
𝑐 10 = −0.0110422, ∗
𝑐 02 = −0.0559473, ∗
𝑐 11 = 0.295508.
value 𝑢 0 ∈ [0, 1]. Without loss of generality, we assume that when As an illustration, Figure 1 shows surface of the fitted time-
𝑢 ≤ 𝑢 0 , (𝑖 1 ) dominates, and otherwise, (𝑖 2 ) dominates. dependent outcome function 𝑢 in comparison to our data points at
−3 year.
3.1 A time-independent mathematical model
For simplicity, we first propose a simple quadratic model for each
(rescaled) year 𝛼𝑡 respectively as follows
𝑢 (𝛼𝑝 , 𝛼𝑠 ) = 𝑐 00 + 𝑐 10 𝛼𝑝 + 𝑐 01 𝛼𝑠 + 𝑐 20 𝛼𝑝2 + 𝑐 11 𝛼𝑝 𝛼𝑠 + 𝑐 02 𝛼𝑠2, (1)
where 𝑐 00, 𝑐 01, . . . are to be determined using the given dataset for
each year separately. Note that in the mathematical model, the legal
parameters 𝛼𝑝 , 𝛼𝑠 and 𝛼𝑡 are model arguments while 𝑐 00, 𝑐 10, . . .
serve as model parameters. We impose the reasonable assumptions
𝑢 (0, 0) = 0 for all 𝛼𝑡 , 𝑢 (1, 1) = 1 for all 𝛼𝑡 , (2)
leading to that for all 𝛼𝑡 ,
Figure 1: −3 year for the time-dependent model (4).
𝑐 10 + 𝑐 01 + 𝑐 20 + 𝑐 11 + 𝑐 02 = 1. (3)
The proposed model can be regarded as a linear optimisation 3.3 Evaluation
problem for which the coded data can be used to determine the
Chi-square test. To evaluate the fitted function for our time-
above coefficients, i.e. model parameters. Thus, we fit this function
dependent model in comparison to the whole dataset, we use the
with the coded data by using Mathematica; the algorithm is based on
chi-square test where 𝑁 is the number of data in the dataset; here
the theory of linear least squares. In Table 1, the optimal coefficients
𝑁 = 150. This gives the reduced chi-square
(denoted by 𝑐 ∗ ), e.g., model parameters, are listed for each year.
∑︁
𝑁
(𝑢 data − 𝑢) 2 𝜒2
𝛼𝑡 (year) 𝜒 2 := and = 0.0343305. (8)
∗
𝑐 01 ∗
𝑐 10 ∗
𝑐 02 ∗
𝑐 20
𝑢 𝑁
0 0.756269 0.218749 -0.144324 0.181876 𝑖=1
-1 0.655165 0.0286861 -0.088864 0.301803 It implies that the fitted function can describe the original dataset
-3 0.429315 -0.159663 0.00774652 0.390121 with sufficient accuracy.
-6 0.184965 -0.174577 0.15114 0.253607
-8 0.129208 -0.241208 0.163708 0.30786 Cross-validation. In order to evaluate the time-dependent model
-10 0.0662971 -0.295998 0.185813 0.364145 in terms of predictability, we use leave-on-out cross-validation and
Table 1: Fitted model parameters for model (1). calculate the mean absolute error:
𝑀𝐴𝐸 = 0.0728038. (9)
271
Part IV
Demonstrations
Interactive System for Arranging Issues
based on PROLEG in Civil Litigation
Ken Satoh Kazuko Takahashi Tatsuki Kawasaki
National Institute of Informatics Kwansei Gakuin University educe Co.,Ltd
Chiyoda, Tokyo, Japan Sanda, Hyogo, Japan Chiyoda, Tokyo, Japan
ksatoh@nii.ac.jp ktaka@kwansei.ac.jp sktk40829@gmail.com
1 INTRODUCTION theory and specifies concrete facts to make concrete legal argu-
In Japan, we have the procedure of “arranging issues” in civil liti- ments for a specific case in a civil litigation. Moreover, from our
gation where we clarify which facts are in dispute and what kind experience working with lawyers, we noticed that if we introduced
of evidence action should be made for these issues. Currently, IT a system with sophisticated but complex reasoning mechanisms,
technology is used only for online meeting for arranging issues they would be very reluctant to use the system. So, our purpose
and more sophisticated method is expected by a full use of IT/AI of this work is to identify the simplest function to arrange issues
technology. which would be easily understood by lawyers. In a sense, we ex-
We proposed a method for formalizing Japanese presupposed tract a useful part of Carneades for arrangement of issues in civil
Ultimate Fact theory (JUF theory, in short, Youken-jijisturon, in litigation.
Japanese) and converting it into logic programming and developed 2 PROLEG
a system called PROLEG (PROlog-based LEGal reasoning support
system)[1]. JUF formalises which party should give certain facts to We firstly review the PROLEG system[1]. A program of PROLEG
get a desired legal effect for the party, in other words, formalizes consists of a rulebase and a factbase. A rulebase consists of a set of
which party has a burden of proof for these facts. Then, given these general rules of the form
facts, PROLEG simulates reasoning by a judge to make a final con- 𝐻 ⇐ 𝐵 1, ..., 𝐵𝑛 .
clusion and present such process in a directed tree structure called where 𝐻 (called head or conclusion), 𝐵 1, ..., 𝐵𝑛 (called body) are first-
“block diagram.” order atom, and a set of exception rules of the form
In this work, we modify the PROLEG system to support arrang- 𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝐻, 𝐸)
ing issues in civil litigation. In the PROLEG system, they assume where 𝐻 and 𝐸 are the head of some general rules. We call 𝐸 excep-
that all the facts are given before simulation of judge’s reasoning tion. A factbase consists of set of the following expression 𝑓 𝑎𝑐𝑡 (𝑃)
so there are no interaction between both parties of plaintiff and de- where 𝑃 is an atom which is never the head of any general rule.
fendant during the simulation. On the other hand, given a desired We call 𝑃 fact predicate.
effect requested by one party, our interactive system (which we call A rule represents a general default rule meaning that if 𝐵𝑖 in 𝑅
int-PROLEG) automatically calculates possible justifications for the are all proved then in general 𝐻 is true except there is an exception
desired effect based on JUF theory stored in the system. After the rule 𝑒𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝐻, 𝐸) such that 𝐸 is proved.
party chooses a justification and int-PROLEG asks for the existence Given a PROLEG program, we can construct a proof tree of a
of necessary facts to the party to satisfy the justification. Then, given goal which is the root of the tree and the child nodes are
int-PROLEG asks whether the other party agrees on alleged facts conditions of general rules of the conclusion and the exceptions of
and also calculates possible counter-arguments against the chosen the conclusion.
justification and provides them to the other party. We iterate this 3 EXTENSION TO ARRANGE ISSUES
process until no further (counter-)arguments are presented. When
In this section, we show how to modify the PROLEG system into
this process is finished, disagreed facts are issues for which a judge
int-PROLEG.
decides the truth value.
Most interactive argument systems are mainly for construct- 3.1 Indexing a level for PROLEG literals
ing arguments manually by a user or evaluating arguments con- (1) First, we define the dependency on the atomic formula that
structed by a user. A notable exception would be the Carneades appears in the general rule. Among the conclusions of the
system which has a funcition of argument invention using argu- general rule, the conclusion that does not appear in the body
mentation schemes [2]. On the other hand, in our system, a user of any general rule or is not an exception of any exception
does not construct arguments from the scratch but chooses a pat- rule is called 0-level conclusion. Then, when making a top-
tern of legal arguments provided by int-PROLEG based on the JUF down proof tree from the 0-level conclusion using only the
general rules, we end up the fact predicates. We call the fact
Permission to make digital or hard copies of part or all of this work for personal or predicates finally visited 0-level facts and the 0-level con-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita- clusion and the intermediate visited atomic formulas called
tion on the first page. Copyrights for third-party components of this work must be 0-level atomic formulas.
honored. For all other uses, contact the owner/author(s).
(2) Suppose that the 𝑖-level atomic formulas and the 𝑖-level facts
ICAIL’21, June 21–25, 2021, São Paulo, Brazil
© 2021 Copyright held by the owner/author(s). are decided. For exception rules that conclude with the 𝑖-
level atomic formula, the collection of the atomic formulas
273
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Ken Satoh, Kazuko Takahashi, and Tatsuki Kawasaki
1 http://research.nii.ac.jp/~ksatoh/PROLEGdemo/IssueArrangmentDemo.mp4
274
Live Demonstration Of A Working Collaborative
eNegotiation System (Smartsettle Infinity)
Ernest Thiessen Graham Ross
Founder & President Head of International Marketing
Smartsettle Resolutions Inc. Smartsettle Resolutions Inc.
Vancouver, BC, Canada Vancouver, BC, Canada
ernest.thiessen@smartsettle.com Board Member
International Council for Online Dispute Resolution
KEYWORDS
negotiation, multivariate visual blind bidding, eNegotiation, col- Figure 1: Smartsettle Multivariate Visual Blind Bidding
laborative, conflict resolution, alternative dispute resolution, ADR,
online dispute resolution, ODR, algorithms
275
ICAIL’21, June 21–25, 2021, São Paulo, Brazil E. Thiessen et al.
(9) If no mutually accepted packages exist, the Automatic Deal- research[4] and live case work have demonstrated that the magni-
Closer can be invoked to avoid impasse due to a small gap. tude of value left behind in ordinary negotiations can be around 16%.
(10) Having reached a Baseline quickly, parties have energy left Shell’s research[2] concluded that negotiators that are subjected
for Smartsettle’s signature algorithm, Maximize the Mini- to a tedious negotiation dance become exhausted and have little
mum Gain, to uncover any remaining hidden value and gen- energy left to go “beyond win-win”® in a search for hidden value.
erate an Improvement that distributes the additional value The Smartsettle Visual Blind Bidding process not only conserves
fairly to all parties. This algorithm is foundational to all the the energy of negotiators but makes it very easy to uncover hidden
others and endorsed3 by experts in the field. value.
(11) Fairness Enhancing Normalization distributes additional
benefits fairly among all the parties, and this brings them to Table 1: Rewards for Good Negotiating Behaviour
an optimal solution on the Efficiency Frontier (green star).
Smartsettle allows parties to represent their preferences with Objective Behaviour Reward
any scale that is convenient to them. But under the covers, Acceptance of a fair A timely win-win
Smartsettle employs a proprietary method for normalization Fairness outcome outcome
that effectively neutralizes the efforts of any party to inflate Early movement to Bigger portion of
the benefits of optimization for themselves. Zone of Agreement the overlap
(12) In the background is the Expert Neutral Deal-Closer, which Agreement to Ex- Guaranteed agree-
the parties may fall back on in case of a large gap, and just pert Neutral Deal- ment
the existence of this remedy results in it rarely being used. Closer
Efficiency Secure Honesty Uncovered hidden
The process described above works for almost any formal negotia-
Truthfulness value
tion between any number of parties and can be depicted graphically
by adding more dimensions to Figure 1. Maximize the Minimum Peace Collaboration Improved relation-
Gain as described by US ICANS Patent US 5495412A has been ships
slightly modified to deal better with multi-party negotiations.
In high-value cases, Infinity’s ability to uncover both tangible In traditional negotiations, parties will often tend to hide or even
and intangible hidden value that improves the outcome for all misrepresent their true preferences. However, with Smartsettle
parties even after a settlement package has been agreed and thereby Infinity, the temptation to misrepresent preferences is eliminated.
enhances further the inter-party relationships, is by far its greatest Skilled facilitators help parties understand that it is actually counter-
asset. The greatest benefits are achieved when well-trained parties productive to use any sort of deception as a negotiating strategy
use the system collaboratively. Rather than usurp control from the and that truthfulness is rewarded. All of these good behaviours
user or the work of the mediator or negotiation advisor, the process together represent the fifth behaviour of collaboration, and result
benefits from human input that best understands and assesses the in improved relationships.
preferences and interests of all parties.
Table 1 summarizes how Smartsettle intelligently rewards good REFERENCES
negotiating behaviour. Whether this is artificial intelligence (AI) or [1] Howard Raiffa. 1996. Lectures on Negotiation Analysis. Program on Negotiation at
Harvard Law School.
intelligence augmented (IA) we leave for the reader to decide. [2] G. Richard Shell. 1999. Bargaining for Advantage: Negotiation Strategies for Rea-
Acceptance of a fair outcome is the first prerequisite for achieving sonable People. Penguin.
a result that benefits all parties. Smartsettle enables this behaviour [3] Ernest Thiessen, Peter Holt, Graham Ross, and Diana Wallis. 2017. Brexit 2.0
Negotiation Simulation with Smartsettle Infinity. International Journal of Online
in a process where parties can place secret bids on packages. When Dispute Resolution 2, 4 (2017).
a Zone of Agreement occurs, the party who made the smallest [4] Ernest M. Thiessen and D. Pete. Loucks. 1992. Computer-Assisted Negotiation
last move is rewarded with a bigger portion of the overlap. An of Multi-objective Water Resources Conflicts. Water Resources Bulletin, American
Water Resources Association 28, 1 (February 1992), 163–177.
agreement is ensured if parties agree to the Expert Neutral Deal- [5] Ernest M. Thiessen and Graham L. Ross. 2021. Using AI & IA to Reward Good
Closer in Final Session, and in fact is more likely to happen without Negotiating Behaviour. (on smartsettle.com).
the need of outside intervention. These first three behaviours all
contribute to quickly achieving a fair outcome and are applicable
to all negotiations, whether simple or complex.
In more complex multivariate cases, the importance of coming to
an early agreement is even greater. In addition to time savings, nego-
tiators also have the opportunity of uncovering hidden value with
the fourth behaviour of secure honesty and truthfulness. Thiessen’s
3 Harvard Professor Emeritus Howard Raiffa published [1] a preference for Maximize
the Minimum Gain (MMG) over Nobel Laureate’s algorithms Maximize the Utility
Product (MUP). Raiffa said that MMG was more intuitive and (in his opinion) produced
better outcomes in certain hypothetical illustrations. He did admit however that the
difference between these two algorithms would be insignificant in most real world
applications.
276
Part V
COLIEE Papers
BERT-based Ensemble Methods with Data Augmentation for
Legal Textual Entailment in COLIEE Statute Law Task
Masaharu Yoshioka Yasuhiro Aoki
yoshioka@ist.hokudai.ac.jp Youta Suzuki
Faculty of Information Science and Technology, Hokkaido yasu-a_01@eis.hokudai.ac.jp
University suzuki@eis.hokudai.ac.jp
Graduate School of Information Science and Technology, Graduate School of Information Science and Technology,
Hokkaido University Hokkaido University
Sapporo-shi, Hokkaido, Japan Sapporo-shi, Hokkaido, Japan
ABSTRACT 1 INTRODUCTION
The Competition on Legal Information Extraction/Entailment (COL- The Competition on Legal Information Extraction/Entailment (COL-
IEE) statute law legal textual entailment task (task 4) is a task to IEE) [3, 4, 10, 11, 15] serves as a forum to discuss issues related to
make a system judge whether a given question statement is true legal information retrieval (IR) and entailment. There are two types
or not by provided articles. In the last COLIEE 2020, the best per- of tasks in COLIEE. One is a task using case law (tasks 1 and 2),
formance system used bidirectional encoder representations from and the other is a task using Japanese statute law with Japanese bar
transformers (BERT), a deep-learning-based natural language pro- exam questions (tasks 3, 4, and 5). Task 3 is an IR task that aims to
cessing tool for handling word semantics by considering their con- retrieve (a) relevant law article(s) to judge whether the statement of
text. However, there are problems related to the small amount of the question is true, task 4 is an entailment task that judges whether
training data and the variability of the questions. In this paper, we a given relevant article entails a given question statement and task
propose a BERT-based ensemble method with data augmentation 5 is a combination of tasks 3 and 4.
to solve this problem. For the data augmentation, we propose a Because a part of bar exam questions are based on real use cases,
systematic method to make training data for understanding the it is important to have a mechanism for semantic matching to dis-
syntactic structure of the questions and articles for entailment. In cuss the relevance of words in the articles and those in the questions
addition, due to the nature of the non-deterministic characteristics for entailment. At an earlier stage, machine-readable thesauruses
of BERT fine-tuning and the variability of the questions, we pro- such as WordNet[7] and distributed representation of words such
pose a method to construct multiple BERT fine-tuning models and as Word2Vec[6] were used. Recently, the deep learning-based natu-
select an appropriate set of models for ensemble. The accuracy of ral language processing tool bidirectional encoder representations
our proposed method for task 4 was 0.7037, which was the best from transformers (BERT) [1] was introduced. One of the character-
performance among all submissions. istics of BERT is that it provides a general semantic analysis system
that can be fine-tuned for a particular task. For the last COLIEE
CCS CONCEPTS [10], the best performance systems for tasks 3 [12] and 4 [9] used
• Computing methodologies → Information extraction; En- BERT as a core component of the system.
semble methods. In this paper, we propose a method to use BERT-based ensem-
ble methods for task 4. This method utilizes BERT with data aug-
mentation that increases training examples by making article-and-
KEYWORDS question pairs systematically using sentences in the statute law
Textual entailment, Data augmentation, BERT, Ensemble method articles. We also propose a system that ensembles the results from
ACM Reference Format:
multiple BERT-based system outputs. The accuracy of the system
Masaharu Yoshioka, Yasuhiro Aoki, and Youta Suzuki. 2021. BERT-based for task 4 was 0.7037, which was the best performance among all
Ensemble Methods with Data Augmentation for Legal Textual Entailment the submitted runs for task 4 at COLIEE 2021.
in COLIEE Statute Law Task. In Eighteenth International Conference for
Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil.
ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3462757.3466105
2 RELATED WORKS
Because bar exam questions include questions about real use cases
of articles, it is necessary to discuss the correspondence between
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed the concepts used in the articles and real use cases. In the early
for profit or commercial advantage and that copies bear this notice and the full citation stage of COLIEE, several attempts were made to utilize resources
on the first page. Copyrights for components of this work owned by others than the for discussing such semantic matching, such as a machine-readable
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission thesaurus and data for the distributed representation of the terms.
and/or a fee. Request permissions from permissions@acm.org. For example, Mi-Young et al. [5] used Word2Vec [6] as a resource
ICAIL’21, June 21–25, 2021, São Paulo, Brazil for distributed representation, and Taniguchi et al. [14] proposed
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8526-8/21/06. . . $15.00 a method to utilize WordNet [7] as a machine-readable thesaurus.
https://doi.org/10.1145/3462757.3466105 However, because those methods cannot handle the context to
278
ICAIL’21, June 21–25, 2021, São Paulo, Brazil M. Yoshioka, Y. Aoki and Y. Suzuki
estimate the meaning of such terms, they are not as effective for preliminary experiment (the details are discussed in Section
utilizing such resources. 3.3), we confirmed that the characteristics of the fine-tuned
Recently, Devlin et al. [1] proposed BERT, a deep learning-based BERT-based model are different and that the accuracy of the
natural processing tool pretrained for solving general tasks that re- validation data is not directly related to that of the test data.
quire semantic information with larger corpora (such as the whole We assume that this result reflects the different character-
contents of Wikipedia). Based on this training process, BERT can istics of each model and that the appropriate selection of
handle the meaning (distributed representation) of words in a sen- the generated models for ensemble may improve the perfor-
tence by considering the context. In addition, BERT can be used mance for the unseen questions.
for various tasks by employing a fine-tuning process that utilizes
comparatively small numbers of training data. Because a pretrained 3.1 Data augmentation using articles
model of BERT contains rich information about the semantics of
In the deep learning framework, it is common to enlarge training
the words, the fine-tuned models may be able to handle semantic
data by modifying the existing data (data augmentation). However,
information even though the words themselves are not included in
it is important to define the appropriate data augmentation method
the training data.
to obtain better results. Related to the legal textual entailment task,
At COLIEE 2020, the BERT-based system achieved the best per-
data augmentation methods have been used for natural language
formance for legal textual entailment tasks (JNLP [9]). In that paper,
and logical inference, as introduced in Section 2. However, it is
they proposed a lawfulness classification approach that classified
difficult to apply these methods to this legal textual entailment
the appropriateness of legal statements by using many legal sen-
data.
tences that include bar exam questions provided by organizers
In this research, we assume that there are two types of errors to
without considering given relevant articles. This approach worked
judge whether an article entails a given question. One is semantic
well for COLIEE 2020 because of the large number of training data.
mismatch, and the other is logical mismatch (the appropriateness
In addition, they also pointed out that it was difficult to select an
of the judicial decision).
appropriate model using validation data for the unseen questions
For example, let us discuss the example of training data using
because of the significant variability of the questions.
the following article (a part of article 9): “A juridical act performed
To increase the size of the training data, the data augmentation
by an adult ward is voidable.”
approach is widely used in the field of image recognition [13].
However, few studies related to the data augmentation method (1) “A juridical act performed by an adult is voidable.”
have been conducted for the legal textual entailment task. Min The article does not entail this question because of semantic
et al. [8] proposed a syntactic-based data augmentation method matching (“adult” is not “adult ward” ).
to increase the robustness of natural language inferences. They (2) “A juridical act performed by an adult ward is not voidable.”
proposed a systematic method to create positive and negative data The article does not entail this question because of the
from the correct inference sentence by a syntactic operation such inappropriateness of the juridical decision (“voidable” and
as passivation and the inversion of subject and object. Evans et al. “not voidable”)
[2] proposed a method for data augmentation for logical entailment. (3) “A juridical act performed by an adult is not voidable.”
In this framework, their method increased negative and positive We cannot judge whether this question is true (it may re-
data by modifying logical inference rules using symbolic vocabulary quire another article). However, a given article cannot entail
permutation, which includes an operation to make implication rules the question.
that share the same contents for the condition and derived parts.
For the semantic matching case (1), it is difficult to select appro-
Those approaches are useful to design data augmentation methods
priate pairs (“adult” and “adult ward”) for replacement to make such
for legal textual entailment.
a semantic mismatch sentence. For both cases (3), it is also difficult
to make the data and to use these data for negative examples to
3 BERT-BASED ENSEMBLE LEGAL TEXTUAL identify types of errors to judge the entailment results.
ENTAILMENT SYSTEM By contrast, if we make the pair of correct answers with logical
Based on a discussion of the previous best performance system mismatch cases (2), the examples may help to explain the impor-
(JNLP [9]), we propose a system with the following characteristics. tance of comparisons between the judicial decision of the article
and that of the question.
(1) Textual entailment approach with data augmentation
Based on this assumption, we create training data that charac-
We assume that the reason why the lawfulness classification
terize the logical mismatch between the articles and questions.
approach outperformed the textual entailment one in the
The procedures to make this augmented data are as follows.
last COLIEE is the size of the training data. Therefore, when
we provide larger training data by data augmentation, the (1) Extraction of (a) judicial decision parts from the articles.
textual entailment approach may outperform the lawfulness If there are multiple decisions in an article, the sentences
classification approach because it uses the most important in the article are split into smaller sentences that contain
information (relevant articles). one judicial decision (Figure 1). When the split sentence
(2) Ensemble results of multiple BERT-based model outputs explains an exceptional case, a flipped judicial decision is
As discussed, it is difficult to select appropriate models for complemented for the split sentence (underlined part of the
the task by only evaluating the validation model. From our split sentences). When the sentence does not contain any
279
BERT-based Ensemble Methods with Data Augmentation for Legal Textual Entailment in COLIEE Statute Law Task ICAIL’21, June 21–25, 2021, São Paulo, Brazil
280
ICAIL’21, June 21–25, 2021, São Paulo, Brazil M. Yoshioka, Y. Aoki and Y. Suzuki
Table 1: Evaluation results of the 10 models Table 2: Evaluation results of the ensemble models
epoch and stop-training process when the validation loss increases. Submission ID Model used
We use a model with minimal validation loss. HUKB-1 (1, 2, 3, 4, 5, 6, 7)
Fine-tuned models accept a pair of question statements and (an) HUKB-2 (1, 2, 4)
article(s) as input and return whether the article(s) entail(s) the state- HUKB-3 (1, 2, 4, 7 , 8)
ment (positive) or not (negative), which is decided by comparing
the score for the probability of being positive or negative.
To discuss the appropriate settings for making the ensemble
3.3 Preliminary experiment model, we calculated the performance of the ensemble model using
these 10 models. For making the ensemble model, we used the
To evaluate the performance of the proposed BERT-based entail-
average probability of positive and negative from the target models.
ment system, we conducted a preliminary experiment using the R01
Table 6 shows the evaluation results of the ensemble models by
data (1 year of data with 111 questions) for evaluation. To discuss
accuracy. In this table, all combinations of the ensemble models are
the effect of the variability of the training and validation questions
used (selecting three to 10 models from those introduced in Table
set, we made 10 models using the same procedures. Because of
1).
the non-deterministic characteristics of the BERT fine-tuning pro-
There were large differences between the accuracies of the en-
cess and different training sets selected randomly, we expected
semble cases. The best accuracy system ensembled seven models,
the system to construct different models that use different features
and the worst used all models. Most of the cases that used three
for analyzing the texts. Table 1 shows the evaluation results for
to five models were adequate to estimate the results with better
the validation and test data. As shown in the table, the validation
accuracy.
accuracy and loss were not closely related to the test accuracy. We
All of the highest rank sets contained the best performance
assume that these results reflect the variability of the question set.
system model 2. In addition, they also used model 1, even though
We also made another 10 models without using augmented data.
the accuracy of model 1 was the lowest among these 10 models.
The average accuracy of these 10 models was 0.5108 (best: 0.5946,
This suggests that it is important to use a complementary set of
worst: 0.4505). From this comparison, we confirmed that data aug-
models that have different characteristics to improve the overall
mentation is effective to improve the performance of the BERT
performance of the ensemble models.
training process.
In this inference process, we can estimate the confidence of the
BERT model output by comparing the probability of positive or
3.4 Submitted results
negative. When the probability of positive is almost equal to 1(0), the Based on the results of the preliminary experiments, we submitted
system outputs positive (negative) results with higher confidence. the following three results that used different model sets for the
By contrast, when the probability is close to 0.5, the system output ensemble.
can be interpreted as less confidence. HUKB-1 and HUKB-2 were the best and second-best perfor-
When we checked the distribution of such confidence for each mance systems using R1 data as a kind of validation data. HUKB-3
question, the tendency of such confidence was not consistent among selected the five best models using validation loss information.
these models. From this observation, we assumed that these models Table 4 shows the final evaluation results of all submission runs,
may use different features to estimate the entailment results and among which, HUKB-2 achieved the highest accuracy.
that their characteristics may differ, even though we used the same
model architecture for training. In such a case, there is a probability 3.5 Discussion
to increase the performance of the overall system by ensembling To understand the effect of the ensemble method, we compared
the results of different models. the performance of the ensemble results with one of each model.
281
BERT-based Ensemble Methods with Data Augmentation for Legal Textual Entailment in COLIEE Statute Law Task ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Table 4: Final evaluation results Table 6: Number of questions classified by the ensemble re-
sults
Submission ID Correct Accuracy
BaseLine No 43/All 81 0.5309 Submission Agree Majority Other
HUKB-2 57 0.7037 ID Correct Wrong Correct Wrong Correct Wrong
HUKB-1 55 0.6790 HUKB-1 17 4 27 14 11 8
HUKB-3 55 0.6790 HUKB-2 28 9 23 11 6 4
UA_parser 54 0.6667 HUKB-3 19 9 17 12 19 9
JNLP.Enss5C15050 51 0.6296
JNLP.Enss5C15050SilverE2E10 51 0.6296 Table 7: Topic difficulty analysis based on the number of cor-
JNLP.EnssBest 51 0.6296 rect runs
OVGU_run3 48 0.5926
TR-Ensemble 48 0.5926 No. of No. of No. of correct
TR-MTE 48 0.5926 correct runs questions answers by HUKB-2
OVGU_run2 45 0.5556 1–3 7 0
KIS1 44 0.5432 4–6 11 1
KIS3 44 0.5432 7–9 12 9
UA_1st 44 0.5432 10–12 19 15
KIS2 43 0.5309 13–15 19 19
UA_dl 43 0.5309 16–18 13 13
TR_Electra 41 0.5062
OVGU_run1 36 0.4444
than that of HUKB-2. However, the accuracy of HUKB-3 was lower
Table 5: Evaluation results of the 10 models for the test data than that of HUKB-2, which suggests that selecting an appropriate
set of models for the ensemble is also effective for maintaining the
Model No. Accuracy accuracy of the “Agree” questions.
1 0.6790 Second, we analyze the characteristics of our system based on
2 0.6666 the difficulty estimated by the number of runs that return the cor-
3 0.5185 rect answer provided by organizers. Table 7 shows the number of
4 0.5555 questions corresponding to the number of correct runs from the 18
5 0.6666 submitted runs (Table 4). Questions with a smaller number of cor-
6 0.5308 rect runs may be common difficult problems among all submitted
7 0.6790 methods.
8 0.5925 From this table, we confirm that our method answers the easy
9 0.5555 questions consistently. These characteristics may come from our
10 0.5308 ensemble method reducing the effect of variability of the training
data sets.
By contrast, our system performs poorly for difficult questions,
Table 5 shows the evaluation results of the 10 models. This year, the suggesting common problems that nearly all submitted systems
basic model performed well and the best performance systems were cannot handle at this moment.
almost equivalent to the ensemble ones. However, the appropriate We would like to discuss the characteristics of such difficult
selection of the models (HUKB-2) made the ensemble results better questions using examples.
than one for each model. The following question (Figure 3) is a difficult question that only
These results justify the appropriateness of using the ensemble one run can answer correctly. Because the main terms appear in
method by selecting an appropriate ensemble set using validation both the question and the first sentence, the systems tend to say
data. positive (entail) for this question. However, it also matches the last
Table 6 shows the number of questions classified by agreement sentence that explains an exceptional case of the articles. As a result,
level among the models used. “Agree,” “Majority,” and “Other” rep- the given article does not entail the question.
resent “all models return the same results,” “final results are same Because our data augmentation method splits the sentences and
as majority voting,” and others, respectively. From these results, we only handles flipped negative cases, as introduced in Section 3.2,
can confirm that the average calculation ensemble method is better our system cannot answer this question correctly either. However,
than majority voting because the number of correct questions for because several articles have such exceptional cases, it may be better
others is larger than the number of wrong ones. For the “Agree” to propose a data augmentation method to handle such articles.
questions, the best performance system (HUKB-2) had the largest The following failure example (Figure 4: one run can answer
numbers because of the small number of used models (three), but correctly) is also related to the logical expression (quantifier). The
the accuracy of HUKB-1 (using seven models) for “Agree” was better article says “together with the obligee” (more than two), but the
282
ICAIL’21, June 21–25, 2021, São Paulo, Brazil M. Yoshioka, Y. Aoki and Y. Suzuki
283
BERT-based Ensemble Methods with Data Augmentation for Legal Textual Entailment in COLIEE Statute Law Task ICAIL’21, June 21–25, 2021, São Paulo, Brazil
284
Legal Norm Retrieval with Variations of the BERT Model
Combined with TF-IDF Vectorization
Sabine Wehnert Viju Sudhi Ernesto W. De Luca
sabine.wehnert@gei.de Shipra Dureja deluca@gei.de
Georg Eckert Institute Libin Kutty Georg Eckert Institute
Leibniz Institute for International Leibniz Institute for International
Textbook Research
Saijal Shahania Textbook Research
<firstname>.<lastname>@st.ovgu.de
Germany Germany
Otto von Guericke University
Otto von Guericke University Otto von Guericke University
Magdeburg, Germany
Magdeburg, Germany Magdeburg, Germany
285
ICAIL ’21, June 21–15, 2021, online Wehnert et al.
• We perform dataset manipulations to train a BERT classifier question answering tasks, sentence tagging tasks and paraphrase
for retrieval and combine it with TF-IDF vectorization identification. BERT introduced by Devlin et al. [4] is currently a
• We test similarity scores obtained from the BERTScore common choice for such downstream tasks, replacing various tradi-
The remainder of this work is organized in the following way: tional NLP pipelines. Following this, there has been an exhaustive
In Section 2, we describe approaches using TF-IDF, BERT models study about the applications of BERT and experiments to inves-
and their various use in the past editions of the COLIEE compe- tigate different fine-tuning methods for these pre-trained models
tition. Section 3 contains conceptual descriptions for each of our by Sun et al. [12]. They present various fine-tuning strategies for
three runs: Sentence-BERT Embedding with TF-IDF, LEGAL-BERT BERT on a text classification task, providing a general solution
with TF-IDF, and BERTScore. Section 4 consists of more details on for achieving state-of-the-art results on a variety of text classifi-
our experimental setup, results and a following discussion. In the cation datasets. We follow a few of these best practices for better
final section we conclude our results and indicate future research performance with our selected models, such as:
potential. (1) the use of the right combination of different hyperparameters
that directly affect the learning,
2 RELATED WORK (2) the importance of the selection of the correct value of warm-
In this section, we describe related research of retrieval methods we up steps,
used. In particular, we investigate past uses of the respective method (3) how concentrating on the decay rate can help to converge
within the COLIEE competition. First, we briefly review the TF-IDF towards the minima and when the learning rate decay should
(term frequency - inverse document frequency) vectorization and start, and
its place within the competition. Second, we collect approaches (4) the right combination of batch-size with the number of
which are similar to our methods which are using BERT (Bidirec- epochs and warm-up steps.
tional Encoder Representations from Transformers) and also make
a distinction between our methods for the three runs we submitted 2.2.1 Fine-tuning BERT. When Devlin et al. proposed the BERT
and the existing work. model, they described its use on downstream tasks in two phases:
a pre-training and a fine-tuning phase [4]. Therefore, its intended
2.1 Retrieval with TF-IDF use for any further task is to first fine-tune it in order to achieve the
desired performance. Nowadays, BERT is not always fine-tuned,
TF-IDF vectorization gives an idea of how relevant a particular term
sometimes the pre-trained model and its embeddings perform well
is within a document and within the document collection. TF-IDF
enough, if the domain is not substantially changed compared to
vectors represent a document by assigning a higher weight to terms
what the model was pre-trained on. However, for the legal domain it
which appear relatively frequent in few documents - compared to
can be worthwhile to adapt an existing BERT model to the different
their usual occurrence in the rest of the corpus - by discounting
use of vocabulary in that context. This can be also observed in the
the term frequency with the inverse document frequency. As Beel
past COLIEE competitions. For the task on statute law retrieval,
et al. [1] comment, the TF-IDF vectorization scheme is the most
Nguyen et al. [8] use an ensemble of BERT models. The publicly
widely used approach for content-based filtering for recommender
available bert-base-uncased 1 model is pre-trained on the English
systems and related text mining domains. In the COLIEE competi-
language Wikipedia and BooksCorpus [14] and then fine-tuned by
tion multiple teams in the previous years used TF-IDF vectors with
Nguyen et al. on the COLIEE training data. The model is combined
or without other representation methods to retrieve the relevant
with another special bert-base-uncased model that is further trained
articles given a query [8, 10]. In legal information retrieval, TF-IDF
with the masked language model (MLM) on the entire COLIEE data
only is still a valuable baseline model because its results are easy to
(BERT-CC) and fine-tuned on training data to obtain a measure
interpret for domain experts. However, in previous editions of the
of relevance. This ensemble of BERT achieved the best F2 score
competition, a mere TF-IDF approach could not reliably achieve
for the validation data. As their BERT-CC focuses on legal domain
winning scores. When used in conjugation with any other em-
knowledge, we reviewed further special BERT models. We found
bedding techniques, competitive results were attainable. One such
RoBERTa (Robustly Optimized BERT pre-training Approach) [7]
approach has been employed by Rabelo et al. [9] to address the case
with its variants, and LEGAL-BERT [3] as promising models for
law entailment task. They employ two different cosine similarity
task 3. RoBERTa [7] is optimized with some alterations to essential
approaches and a confidence score from BERT [4] to improve the
hyperparameters in BERT and trained with relatively bigger batches
extraction/entailment results. We adopt a similar approach in our
over a large training data size. It also excludes BERT’s next-sentence
second run by combining TF-IDF similarity scores with the softmax
prediction task, allowing it to improve on MLM over BERT. This
scores obtained from fine-tuned BERT models. However, we also
leads to a better performance on various baseline NLP downstream
differ in the way of choosing documents to calculate similarity and
tasks [7]. Similarly, LEGAL-BERT [3] is an adaption of BERT in the
in thresholding for the retrieval task.
legal domain where pre-training is carried out on a collection of
several fields of English legal text, such as contracts, court cases, and
2.2 Retrieval with BERT legislation. This special BERT model has been performing better
Nowadays, many pre-trained deep learning-based language models than the original version of BERT on legal domain-specific tasks [3].
are available, coming from neural network architectures for Natu-
ral Language Processing (NLP) with significant improvements for
various downstream tasks, such as single sentence classification, 1 https://huggingface.co/bert-base-uncased
286
Legal Norm Retrieval with Variations of the BERT Model Combined with TF-IDF Vectorization ICAIL ’21, June 21–15, 2021, online
2.2.2 Contextual Embeddings from BERT. Aside from further train- Table 1: Methods for each run for task 3
ing the whole BERT language model and using it on a classification
task, we can also use contextual word embeddings from a pre- Run Name Method
trained BERT model to determine semantic similarity of the query
OvGU_run1 Sentence-BERT + TF-IDF + data enrichment
and the article(s). Contextual word embeddings are computed at
OvGU_run2 LEGAL-BERT + TF-IDF + data augmentation
runtime. In particular, we obtain different vectors for the same
OvGU_run3 BERTScore
word, when it is used in another context or position in a sentence.
In that way, we can also distinguish homonyms when they are
accompanied by enough words in the appropriate context. Since
the contextual embedding type is quite recent, there is no final good fit for the COLIEE data and also perform well on the retrieval
consensus in the research community of how to compute the dis- task.
tance between two contextual word embedding sequences. The Overall, TF-IDF vectorization and BERT-based approaches have
most common methods are: using the [CLS] token which is often already been tried in course of the past COLIEE editions. Never-
seen as a representation of a whole sentence, using the individual theless, there are many options to employ both methods, while
word embeddings or averaging all individual word embeddings in fine-tuning can affect the outcome substantially.
a sentence and then computing a similarity score using the Word
Mover’s Distance [6] or cosine similarity. In the experiments by 3 STATUTE RETRIEVAL TASK
Reimers et al. [11], using the mean of the individual contextual This section describes in detail the three different methods we
word embeddings outperformed the approach with the [CLS] token. proposed and implemented for task 3 in COLIEE 2021, as mentioned
A recent approach related to this is the BERTScore [13]. After com- in Table 1. While the first method exploits Sentence-BERT coupled
puting the pairwise cosine similarity of all token-wise contextual with TF-IDF vectors and data enrichment, the second method uses
embeddings from two sentences, BERTScore selects token pairs LEGAL-BERT with TF-IDF vectors and data augmentation. The
between the two sentences which have the highest cosine similarity. third method applies the BERTScore to solve the problem at hand.
Those similarities are summed up and discounted by the words in
the sentence to obtain precision, recall and the according F1-score. 3.1 Sentence-BERT Embedding with TF-IDF
Optionally, the BERTScore can also incorporate IDF weighting. We
The first run involves a combination of 2-stage TF-IDF vectorization
employ the BERTScore in our third run to test whether the mere
with Sentence-BERT embeddings. This run was the best out of all
embeddings of BERT can also capture enough context in the train-
the runs submitted for task 3 in COLIEE 2021. An overview of the
ing data, compared to using document enrichment or fine-tuning
approach is depicted in Figure 1 and described in the following.
on a BERT model for relevance classification.
We start by enriching the training data with multiple adjustments
For us, it is particularly interesting that there are recent ap-
as described in Table 2. This enrichment helps us to create vectors
proaches for fine-tuning a language model specifically to obtain
for each article in the Civil Code which are more unique than those
meaningful sentence embeddings [2, 11]. In the previous COLIEE
the training data itself could deliver. A concrete example of the
edition, the cyber team achieved the best performance among all
enrichment process for Article 177 can be found in Table 3. We
teams using the universal sentence encoder, TF-IDF and a support
enrich each article in the training data as follows:
vector machine for the case law retrieval task [10]. Hence, we as-
sume that TF-IDF combined with sentence embeddings could also • Metadata: We add structural information using the section
work well on the statute law retrieval task. A new advancement titles in the Civil Code. In that way, hierarchical relations
on sentence embeddings has been made by Reimers et al. [11] who between articles within the same Part, Chapter, Section and
introduce Sentence-BERT. It outperforms the existing embedding even Subsection are modeled.
methods and is found useful for multiple downstream tasks. It is • Crawled data: We crawl Japanese open source commentary
based on a Siamese network architecture which ties the weights of on the Civil Code articles and thereby potentially enrich
two BERT models (one for each input sentence) that are updated the original article text with general remarks, corner cases,
during fine-tuning. As a default, the mean is used to pool the ob- previous versions, related articles and a reasoning for the
tained contextual word embeddings from each BERT model. Then, relation.
the two resulting sentence embeddings are concatenated with their • Relevant queries from training data: We parse the train-
element-wise difference, so that the final softmax layer can pre- ing data labels of task 4 (entailment) to enrich our training
dict a class. We have made use of this state-of-the-art embedding data of task 3 with queries that have a positive entailment
approach to create a richer and more meaningful numeric represen- relationship. With a positive entailment relationship we can
tation of each article and query pair in our first run. In a previous be sure that the queries added correspond to the meaning of
COLIEE edition, Kim et al. [5] have employed a Siamese Deep Con- the article and can help in determining relevance, too.
volutional Neural Network for the entailment task, which results After data enrichment, we encode the enriched texts with the
in better performance compared to regular Convolutional Neural TF-IDF vectorizers and the Tokenizer2 for our Sentence-BERT and
Networks. They attribute their success to the Siamese architecture progress to the final relevance score calculation with the following
which requires less parameters due to the weight sharing mecha- steps:
nism and a lower risk of overfitting. For this reason, we assume
that a sentence embedding based on a similar architecture may be a 2 https://huggingface.co/distilroberta-base
287
ICAIL ’21, June 21–15, 2021, online Wehnert et al.
𝑣𝑎
TF-IDF Stage 1
Article with metadata +
Article with relevant queries
cos
𝑣𝑞
𝑣𝑎
Query
TF-IDF Stage 2
Article with metadata Í
+ cos Normalization Thresholding
𝑣𝑞
Article(s)
𝑣𝑎
𝑣 𝑎 - article vectors (for all articles)
Sentence Embedding 𝑣𝑞 - article vectors (for a given query)
Article with metadata +
cos Í
Article with relevant queries + cosine
cos sum
Article with crawled queries similarity
𝑣𝑞
Table 2: Data enrichment for the statute retrieval task - 𝑡 𝑓𝑡,𝑑 is the term frequency - frequency of term 𝑡 in document
𝑑. Here, documents are the individual articles of the Civil
Description Code.
- 𝑁 is the total number of documents in the collection.
Articles with metadata training data + details
- 𝑑 𝑓𝑡 is the document frequency - frequency of term 𝑡 in the
regarding Part, Chapter, Section
collection.
and Subsection.
- 𝑤𝑡,𝑑 is the weight which is the product of term frequency
Articles with crawled data training data + translated
and inverse document frequency.
crawled data from the website
The vectors after a single stage of TF-IDF vectorization
https://ja.wikibooks.org/
yielded significant precision-recall trade-offs reflected in
Articles with relevant queries training data + queries from
the relatively lower F2 scores. This prompted us to provide
training data if the entailment
a different, but unique representation of the articles, which
label is Y
ended up in a second stage of TF-IDF where query and article
for the respective article.
vectors have been created considering only the Articles with
metadata enrichment. The combination of both stages acts
as a counter-balance in the trade-off.
(1) TF-IDF vectors are computed for queries and articles to- (2) Sentence-BERT embeddings for each article are created with
gether as a two-stage process. In the first stage, we rely the enrichment described in Table 2. We rely on the imple-
on sub-linear term frequency scaling and L2 normalization mentation3 by Reimers et al. [11] and use the pre-trained
while computing the vectors. Articles are enriched by a com- paraphrase-distilroberta-base-v1 model to create the article
bination of Articles with relevant queries and Articles with and query embeddings. We select the aforementioned para-
metadata. phrase model because it was trained on millions of para-
⃗⃗
The vectors 𝒗 are computed by the following equations 1 - 4, phrase examples and is reportedly performing well on natu-
ral language inference tasks4 .
𝑡 𝑓𝑡,𝑑 = (1 + 𝑙𝑜𝑔(𝑡 𝑓𝑡,𝑑 )) (1) (3) Finally, for each query-article pair we compute the cosine
similarity to determine the relevance of each article for the
𝑁 respective query. For each pair, we obtain three different
𝑖𝑑 𝑓𝑡 = 𝑙𝑜𝑔( ) (2)
1 + 𝑑 𝑓𝑡 similarity scores from the first stage TF-IDF, second stage
TF-IDF and Sentence-BERT embeddings. The sum of these
𝑤𝑡,𝑑 = 𝑡 𝑓𝑡,𝑑 ∗ 𝑖𝑑 𝑓𝑡 (3) scores is then normalized and we empirically determine a
threshold to filter out the best relevant articles for each test
⃗⃗⃗ query.
⃗⃗ 𝒘
𝒗 = qÍ (4)
2
𝑖 𝑤𝑖,𝑑
3 https://github.com/UKPLab/sentence-transformers
where, 4 https://www.sbert.net/docs/pre-trained_models.html#paraphrase-identification
288
Legal Norm Retrieval with Variations of the BERT Model Combined with TF-IDF Vectorization ICAIL ’21, June 21–15, 2021, online
Table 3: Example data enrichment for Article 177 of the Civil Code
training data
Article 177: Acquisitions of, losses of and changes in real rights on immovables .. and other laws regarding registration.
Metadata
Part: II Real Rights Chapter: I General Provisions Section: 3 Extinctive Prescription
Subsection: Requirements of Perfection of Changes in Real Rights on Immovables
Crawled data
Comprehensive succession - The range of changes in property rights that require registration has been determined ..
Legal evidence theory - What kind of person is referred to as "a person who has a legitimate interest ..
.. (161 unique words in total, shortened to conserve space)
Relevant queries from training data
– H19-11-3: In a case where A bought a registered building owned by B .. his/her acquisition of ownership of that building.
– H21-24-E: If a mortgage creation contract has the agreement of the mortgagee .. there is no registration of its creation.
– R01-6-A: In cases A sold Land X belonging to A and B sold it to C, C may be asserted .. for sales without security.
For developing an explanatory dialogue in a real setting, the articles. An example is shown in the Table 4 for the query
additional text we gained in the enrichment steps can be marked in with the Pair ID "H27-22-4":
a different font style. Then, we can highlight important keywords Query Q: "In the contract for deposit for value, if the perfor-
based on the scores of each TF-IDF stage. Since we did not apply any mance of the obligation to return deposited Thing has become
weighting during the cosine similarity computation of the Sentence- impossible due to reasons not attributable to the depositary,
BERT embeddings, the similarity between the word vectors of query he/she may not claim remuneration from the depositor, with
and article can be visualized using a heatmap. respect to the period after the impossibility of performance of
the agreed duration."
3.2 LEGAL-BERT with TF-IDF After achieving better results with data decomposition than
For our run 2, we treat task 3 as a sentence-pair classification task to with the original dataset, we further extract referenced arti-
predict the relevance of 1 if the given article is related to the query cles from each relevant article of the query using regular ex-
and 0 otherwise. Considering its good performance on previous pressions and append that as well to form multiple instances
retrieval tasks, we choose to work with a BERT model. A variety of of query-article pairs for each query. The same example is
BERT models that are pre-trained on different datasets can be used extended further for Approach 2 in Table 5.
for addressing domain-specific tasks with fine-tuning. However, this extensive decomposition of referenced arti-
cles did not optimize our recall further. We assume this is
3.2.1 BERT configuration. Following this convention with fine- plausible as these articles are supporting articles to the rele-
tuning, we initially used bert-base-uncased which has 12 hidden vant article content but are not directly relevant to the query.
layers with 768 hidden units in each layer 12 attention heads. A We compare the results with and without data decomposi-
classification head is added on top of the base model consisting of a tion and summarize them in Table 6. We decided to go with
single layer of fully connected linear neurons. We use the softmax Approach 1 where we have a better recall score.
function to get a probability distribution for the two labels and use (2) Data Augmentation: We use the non-relevant articles to
cross-entropy loss with Adam optimizer to fine-tune the model. reduce data imbalance. For this, we enriched this decom-
We split the training dataset into two parts for fine-tuning (∼ 85 % posed dataset using the top 50 non-relevant articles for each
training) the model and use the rest of the dataset for validation query instance. These non-relevant articles are based on the
(all queries starting with id "R01-*"). highest cosine similarity between TF-IDF vectors of the rel-
evant article to all the articles excluding the other relevant
3.2.2 Data Pre-processing. An overview of the pre-processing for ones for the respective query. This approach is similar to the
the LEGAL-BERT with TF-IDF approach is illustrated in Figure implementation by Nguyen et al. [8], where they considered
2. We pre-process both the training and validation splits in the query-article similarity. However, we assume that article-
following manner: article similarity is better suited than query-article similarity
(1) Data Decomposition: This is performed to extract each since we find that articles are more related to each other
relevant article for a given query to form separate instances. in terms of cosine similarity than they are to the queries.
For every query, there is one or more than one article associ- Based on the cosine similarity, we select only the top 50
ated and relevant to it. We take individual articles to create non-relevant articles as training examples, since we did not
a new instance in the training dataset so that the query can intend to reintroduce the data imbalance that we attempted
be divided into multiple instances against all of its relevant to overcome with augmentation.
289
ICAIL ’21, June 21–15, 2021, online Wehnert et al.
Table 4: Approach 1 - Data decomposition of multiple articles for each query into multiple instances
Queries Articles
Before Pre-processing
Query Q Article 665 The provisions of Articles 646 through 648, Article 649, and Article 650, paragraphs ...
Article 648 (1) In the absence of any ... (2) ... the provisions of Article 624 ... (3) ... course of performance.
Article 536 (1) If the performance ... (2) ... obligee for the benefit.
After Pre-processing
Query Q Article 665 The provisions of Articles 646 through 648, Article 649, and Article 650, paragraphs ...
Query Q Article 648 (1) In the absence of any ... (2) ... the provisions of Article 624 ... (3) ... course of performance.
Query Q Article 536 (1) If the performance ... (2) ... obligee for the benefit.
Table 5: Approach 2 - Data decomposition of multiple articles and their referenced articles for each query into multiple in-
stances
Queries Articles
Before Pre-processing
Query Q Article 665 The provisions of Articles 646 through 648, Article 649, and Article 650, paragraphs ...
Article 648 (1) In the absence of any ... (2) ... the provisions of Article 624 ... (3) ... course of performance.
Article 536 (1) If the performance ... (2) ... obligee for the benefit.
After Pre-processing
Query Q Article 665 The provisions of Articles 646 through 648, Article 649, and Article 650, paragraphs ...
Query Q Article 646 (1) A mandatary must deliver to the mandator monies and other things received during ...
Query Q Article 647 If the mandatary has personally consumed monies that were to be delivered to the mandator ...
Query Q Article 648 (1) In the absence of any special agreements, the mandatary may not claim remuneration ...
Query Q Article 649 If costs will be incurred in administering the mandated business, the mandator must ...
Query Q Article 650 (1) If the mandatary has expended costs found to be necessary for the administration ...
Query Q Article 624 (1) An employee may not demand remuneration until the work the employee promised ...
Query Q Article 536 (1) If the performance ... (2) ... obligee for the benefit.
Table 6: Results on the validation set for different data pre- legal-bert-base-uncased and legal-RoBERTa on similar hyperparam-
processing approaches of run 2 eters. We finally choose legal-bert-base-uncased as it indicated the
most satisfactory results to further test with different experimen-
Model Prec Recall tal setups in Section 4.1. To extract relevant articles for a given
query during testing, we combine each query with all the articles.
bert-base-uncased without decomposition 0.1392 0.3973
For LEGAL-BERT, we applied the softmax function to the logits
bert-base-uncased with Approach 1 0.2529 0.4421
predicted from our model. For each query-article pair, we obtain
bert-base-uncased with Approach 2 0.1179 0.4300
two softmax probability values, indicating the non-relevance and
relevance of the article to the query. We only consider the softmax
probabilities of the relevance column. To avoid the underflow of
softmax probabilities of top relevant articles, we max-normalize
(3) Augmenting the Original Dataset: To ensure that the
these scores. At the same time, we also calculate the query-article
original data could still influence the model, we also append
cosine similarity of all the articles for each query. The similarity
the original data. In other words, for each query without any
scores are also max-normalized for the same reasons as stated above.
data decomposition, all relevant articles are processed in an
We ultimately compute an average of these two normalized scores.
instance as they are given in the dataset. This increases the
To select the top-n relevant articles we use a threshold value se-
number of relevant articles for each query at the cost of gen-
lected based on the precision-recall trade-off for the validation set.
erating some duplicates, since for queries which have only
The time for training the LEGAL-BERT model increases from 2
one relevant article, those are already obtained at the step of
minutes on the original dataset to 2 hours on the fully enriched
data decomposition. Overall, the three pre-processing steps
dataset5 . The larger amount of text in the enriched data does not
increase the number of training instances by the factor 10.
have a significant impact during the test phase. At runtime, we
directly process the new query and all pre-stored enriched articles
3.2.3 Fine-Tuning. On comparing the results with the legal domain-
specific pre-trained BERT, bert-base-uncased was outperformed by 5 We used an NVIDIA Quadro RTX 8000 to accelerate training.
290
Legal Norm Retrieval with Variations of the BERT Model Combined with TF-IDF Vectorization ICAIL ’21, June 21–15, 2021, online
Query
Training
Split
Original +
Pair
Relevant
Article(s) TF-IDF LEGAL-BERT
cos 𝜎
Query
Max Max
Normalization Normalization
Copies
50 Training
Non-Relevant Data
Augmentation Article(s)
Pairs
cos
BERTScore
Corpus of Individual
all Articles Article
F1
Figure 2: Pre-processing for LEGAL-BERT with TF-IDF. Figure 4: Overview of the approach using BERTScore.
(3) For the test data, we take the average K of all the BERTScore
with the already trained language model, so that the prediction is
thresholds from the training data.
not causing any noticeable delay in the system’s response time.
291
ICAIL ’21, June 21–15, 2021, online Wehnert et al.
Table 7: Two stages of TF-IDF counter-balancing the Table 8: Results on validation set for run 2 candidates
precision-recall trade-off with Sentence-BERT.
Model Prec Recall
F2 Prec Recall
bert-base-uncased 0.2529 0.4421
Validation data legal-bert-base-uncased 0.3447 0.5357
legal-RoBERTa 0.2205 0.4866
with 1st stage TF-IDF 54.67 50.16 61.98
with 2nd stage TF-IDF 53.74 49.54 60.27
with both stages 56.52 52.60 63.39 Table 9: Task 3 Results for COLIEE 2021
COLIEE 2021 test data
with 1st stage TF-IDF 72.98 66.77 78.40 Position Run F2 Prec Recall R_30
with 2nd stage TF-IDF 73.02 66.28 79.63 1 OvGU_run1 0.7302 0.6749 0.7778 0.8515
with both stages 73.02 67.49 77.78 9 OvGU_run2 0.6717 0.4857 0.8025 0.9010
18 OvGU_run3 0.3016 0.1570 0.7006 0.7030
paragraph tags (<p>) and the list tags (<ol>, <ul>, <dl>, <li>) to get Further, we experiment with warm-up steps, introduce a de-
relevant information about the articles. This is motivated by the cay rate and did some hyperparameter tuning to optimize our re-
team TRC3 in the previous year of COLIEE [10], where they used the sults. We notice that with 3500 warm-up steps and a decay rate of
content in Japanese itself. However, we translate the fetched content 0.1 (1+𝑒𝑝𝑜𝑐ℎ) , we achieve the best performance. We then perform
to English using the google-trans-new package 8 . To vectorize these further training on the validation set with an increased batch size
enriched articles and queries we used the TfidfVectorizer from of 24. We then create an ensemble of legal-bert-base-uncased and
scikit-learn. TF-IDF vectors, both with max-normalized similarity scores for
To address the problem of the precision-recall trade-off, we use the article-query pairs, assigning equal weights to both the scores.
two-stages of TF-IDF, which is motivated from previous experi- Finally, we fetch the articles that are above the threshold value
ments we conducted on queries starting with the id "R01-*", as of 0.5.
shown in Table 7. It is evident on the validation data that the two-
stage TF-IDF can counter-balance the classical trade-off between 4.1.3 BERTScore. For the BERTScore, we use the model type bert-
precision and recall, considering the improved F2 scores. For the base-uncased, 9 layers and no re-weighting with IDF. This setup
COLIEE 2021 test data the second stage has a positive effect on the was determined based on the performance on our validation data
F2 score as well, though it is not as significant as we found it for which we also used before (queries starting with the id "R01-*"). The
our validation split. text is processed with the regular Tokenizer of BERT and we pass
The threshold value to filter out the top n relevant articles was query and article(s) without further modification to the scorer of
found empirically. After normalizing the sum of the scores from the the original BERTScore implementation. Our thresholding strategy
two stages of TF-IDF and Sentence-BERT embeddings, we consid- for this run results in a threshold value of 0.63331205.
ered the top 4 articles. This was purely based on our validation data,
where none of the queries had relevant articles exceeding a count of 4.2 Results
four. This is true with COLIEE 2021 test data as well, where none of Our first run, OvGU_run1 obtained the first position for its F2 score
the articles have more than 4 relevant articles. To find a threshold in the overall task evaluation for COLIEE 2021. OvGU_run2 also
for the scores of these articles, an index-based threshold was found has the best recall sharing the position with the run
to be better than a single value for the whole set. Accordingly, we JNLP.CrossLMultiLThreshold, closely followed by OvGU_run1. While
take the article in the 1st index (with a score of 1.0) and then set a considering Recall at 30, our runs have the third best (for OvGU_run2)
threshold of 0.91 or higher for the articles if found in the 2nd index and the fifth best (for OvGU_run1) scores. The results for our runs
and a threshold of 0.85 or above if found in the subsequent indices. are summarized in Table 9. Values in bold are the best scores for
the corresponding metric.
4.1.2 LEGAL-BERT with TF-IDF. To decide among the three al-
ternative models that we selected as candidates for our run 2 as
4.3 Discussion
discussed in Section 3.2, we validate them on various hyperpa-
rameter settings and observe that the default hyperparameters We assume that our first run provides reliable results because of
of the Adam Optimizer with a selective change in the learning the combination of contextual Sentence-BERT embeddings with
rate ranging from 1e−03 to 1e−06 , 1e−05 achieve the best results the TF-IDF vectors. This is supported by the test query R02-1-A:
among all three models (see Table 8) when trained on 3 epochs "The family court may decide to commence an assistance also in re-
for batch-size 16. Considering the highest recall score, we select spect of a person whose capacity to appreciate their own situation is
legal-bert-base-uncased to be our final choice for run 2. extremely inadequate due to a mental disorder.",
as shown in Table 10. For this query, only Sentence-BERT embed-
dings could retrieve the most relevant Article 15 which was not
8 https://github.com/lushan88a/google_trans_new retrieved in either stage of TF-IDF vectorization. The Article 15 has
292
Legal Norm Retrieval with Variations of the BERT Model Combined with TF-IDF Vectorization ICAIL ’21, June 21–15, 2021, online
Table 10: Comparison of results for query R02-1-A Table 11: Comparison of results for query R02-24-U
the following content: compared to other teams for this query. However, this result can
"(Decisions for Commencement of Assistance) be attributed to the threshold we selected, with high recall and
Article 15 (1) The family court may decide to commence an as- lower precision. The ranking of the articles by the BERTScore is
sistance in respect of a person whose capacity to appreciate only average even for this query, considering the Mean Average
their own situation is inadequate due to a mental disorder, Precision (MAP). The MAP score is only 0.0509 for run 3, while
at the request of the person in question, that person’s spouse, that run 1 gets 0.0299 and run 2 achieves 0.1250. The best MAP score
person’s relative within the fourth degree of kinship, the guardian, for this query with a value of 0.2309 was obtained by the team
the guardian’s supervisor, the curator, the curator’s supervisor, or a JNLP with their run called JNLP.CrossLBertJP. For assessing the
public prosecutor; provided, however, that this does not apply to a final ranking performance of run 3, we can compare its MAP score
person with respect to whom there are grounds as prescribed in Article to other teams. Also here we observe that the BERTScore with a
7 or the main clause of Article 11. (2) The issuance of a decision for MAP score of 0.5557 is the fourth-lowest performing run in the
commencement of assistance at the request of a person other than the competition, whereas our run 1 achieves 0.7496 and run 2 has the
person in question requires the consent of the person in question. (3) highest overall MAP score among our runs of 0.7571. The best
A decision for commencement of assistance must be made concurrent MAP score of 0.7947 was achieved by the team JNLP with their run
with a decision as referred to in Article 17, paragraph (1) or a decision JNLP.CrossLMultiLThreshold. This leads us to the conclusion that
as referred to in Article 876-9, paragraph (1)." the standard BERTScore without IDF-reweighting or any further
It turns out that for this query-article pair, we have a significant combined methods may not be sufficient to solve this task, at least
term overlap, which may be diluted by the whole article length. In with the the query type distribution of this year’s test dataset. We
that way, sentence-based approaches in general may work well for also observe how thresholding influences our F2 score in run 1, so
this query. Only our run OvGU_run1 and TR_HB have a 100% F2 that our method scores higher than a run by the JNLP team which
score for this query. has a better ranking performance.
On comparing our different runs, we find interesting similarities From the results and discussion above, our main takeaways from
in the articles retrieved by each of them. This might possibly be this COLIEE edition for task 3 are:
because of the common TF-IDF coupling in the first two runs. We
did not expect that embeddings from a pre-trained model (in run 1) (1) Contextual embeddings can significantly enhance retrieval
could give more or less comparable results with those from a model performance when coupled with TF-IDF vectors.
further trained on the COLIEE dataset (in run 2). (2) Adding external knowledge to the articles in the form of
Another insight from the results is how thresholding plays a structural information, entailed queries or definitions can
significant role in the retrieval task. For example, the test query help to make them more unique.
R02-24-U: (3) Data augmentation techniques are useful to train a BERT
"A donor shall assume a duty to retain the subject matter exercising classifier for a retrieval task.
care identical to that he/she exercises for his/her own property until (4) An intelligent or rather more effective thresholding mech-
the completion of such delivery.", anism should be devised to further improve precision and
retrieved only one relevant article with run 1 but both relevant maintain a decent F2 score.
articles with run 2. This is described in Table 11. Drawing conclu-
sions from this query - out of the many similar queries, we are not 5 CONCLUSION AND FUTURE WORK
surprised to see the fine-tuned model of run 2 retrieve 74 candidate In this work, we study variations of the language model BERT for
articles and run 1 retrieving only 70 candidate articles from a total task 3 of the COLIEE competition on statute law retrieval. We find
of 101. This results in run 2 with the overall best recall of 0.8025. a benefit in combining the BERT model with TF-IDF vectorization
With BERTScore, the interesting query to analyse is the test and in working on a sentence level with contextual embeddings.
query R02-17-I: Furthermore, it is helpful to test different pre-trained models and
"In the case that D manifests the intention to release another obligor fine-tuning, as well as adding external knowledge and data aug-
(C) from the obligation to D, even if neither D nor B manifests a mentation techniques. Our winning approach is an ensemble of
particular intention, D may not claim the payment of 600,000 to Sentence-BERT and two different TF-IDF representations with dif-
another obligor (A)." ferent extents of document enrichment. In the second run, we
We are able to retrieve 3 out of 4 articles (Article 439, 440 and fine-tune a BERT classifier for retrieval based on an augmented
441) using this technique which was the highest number when dataset. The third run is similarity scoring using the BERTScore
293
ICAIL ’21, June 21–15, 2021, online Wehnert et al.
with thresholding. Future enhancements can consist of an improved IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile,
thresholding mechanism and of encoding other types of external December 7-13, 2015. IEEE Computer Society, 19–27. https://doi.org/10.1109/
ICCV.2015.11
knowledge, for example named entities.
REFERENCES
[1] Jöran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. 2016. Research-
paper recommender systems: a literature survey. Int. J. Digit. Libr. 17, 4 (2016),
305–338. https://doi.org/10.1007/s00799-015-0156-0
[2] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St
John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al.
2018. Universal Sentence Encoder for English. In Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Language Processing, EMNLP 2018: Sys-
tem Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, Eduardo
Blanco and Wei Lu (Eds.). Association for Computational Linguistics, 169–174.
https://doi.org/10.18653/v1/d18-2029
[3] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras,
and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law
School. CoRR abs/2010.02559 (2020). arXiv:2010.02559 https://arxiv.org/abs/2010.
02559
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies, NAACL-HLT
2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill
Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computa-
tional Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
[5] Mi-Young Kim, Yao Lu, and Randy Goebel. 2017. Textual Entailment in Legal Bar
Exam Question Answering Using Deep Siamese Networks. In New Frontiers
in Artificial Intelligence - JSAI-isAI Workshops, JURISIN, SKL, AI-Biz, LENLS,
AAA, SCIDOCA, kNeXI, Tsukuba, Tokyo, Japan, November 13-15, 2017, Revised
Selected Papers (Lecture Notes in Computer Science, Vol. 10838), Sachiyo Arai,
Kazuhiro Kojima, Koji Mineshima, Daisuke Bekki, Ken Satoh, and Yuiko Ohta
(Eds.). Springer, 35–48. https://doi.org/10.1007/978-3-319-93794-6_3
[6] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From
Word Embeddings To Document Distances. In Proceedings of the 32nd International
Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (JMLR
Workshop and Conference Proceedings, Vol. 37), Francis R. Bach and David M. Blei
(Eds.). JMLR.org, 957–966. http://proceedings.mlr.press/v37/kusnerb15.html
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
arXiv:1907.11692 http://arxiv.org/abs/1907.11692
[8] Ha-Thanh Nguyen, Hai-Yen Thi Vuong, Phuong Minh Nguyen, Tran Binh Dang,
Quan Minh Bui, Vu Trong Sinh, Chau Minh Nguyen, Vu D. Tran, Ken Satoh,
and Minh Le Nguyen. 2020. JNLP Team: Deep Learning for Legal Processing in
COLIEE 2020. CoRR abs/2011.08071 (2020). arXiv:2011.08071 https://arxiv.org/
abs/2011.08071
[9] Juliano Rabelo, Mi-Young Kim, and Randy Goebel. 2019. Combining Sim-
ilarity and Transformer Methods for Case Law Entailment. In Proceedings
of the Seventeenth International Conference on Artificial Intelligence and Law,
ICAIL 2019, Montreal, QC, Canada, June 17-21, 2019. ACM, 290–296. https:
//doi.org/10.1145/3322640.3326741
[10] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
Kano, and Ken Satoh. 2020. COLIEE 2020: Methods for Legal Document Retrieval
and Entailment. https://sites.ualberta.ca/~rabelo/COLIEE2021/COLIEE2020_
summary.pdf. Accessed: 2021-05-09.
[11] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed-
dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and
Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990.
https://doi.org/10.18653/v1/D19-1410
[12] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-Tune
BERT for Text Classification?. In Chinese Computational Linguistics - 18th China
National Conference, CCL 2019, Kunming, China, October 18-20, 2019, Proceedings
(Lecture Notes in Computer Science, Vol. 11856), Maosong Sun, Xuanjing Huang,
Heng Ji, Zhiyuan Liu, and Yang Liu (Eds.). Springer, 194–206. https://doi.org/10.
1007/978-3-030-32381-3_16
[13] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi.
2020. BERTScore: Evaluating Text Generation with BERT. In 8th International
Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr
[14] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards
Story-Like Visual Explanations by Watching Movies and Reading Books. In 2015
294
To Tune or Not To Tune?
Zero-shot Models for Legal Case Entailment
Guilherme Moraes Rosa Ruan Chaves Rodrigues
NeuralMind, Brazil NeuralMind, Brazil
University of Campinas (Unicamp), Brazil Federal University of Goiás (UFG), Brazil
295
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Rosa et al.
in the legal domain. The models were trained in translation, sum- We separate 80% of the 2020 training set for training and the
marization, and multi-label classification tasks, and achieved better remaining for validation, which yields 260 and 65 positive examples
results than single-task models. for training and validation sets, respectively. Negative examples
Pretrained transformer models have only begun to be adopted in are all candidates not labeled as positive.
legal NLP applications more broadly [2, 10, 16, 37, 44]. In some tasks,
they marginally outperform classical methods, especially when 2020 2021
training data is scarce. For example, Zhong et al. [50] showed that Train Test Train Test
a BERT-based model performs better than a tf-idf similarity model Examples (base cases) 325 100 425 100
on a judgment prediction task [43], but is slightly less effective than Avg. # of candidates / example 35.52 36.72 35.80 35.24
an attention-based convolutional neural network [47]. Avg. positive candidates / example 1.15 1.25 1.17 1.17
In some cases, they outperform classical methods, but at the Avg. of tokens in base cases 37.72 37.03 37.51 32.97
expense of using hand-crafted features or by being fine-tuned on Avg. of tokens in candidates 100.16 112.65 103.14 100.83
the target task. For example, the best submission to task 2 of COLIEE
Table 1: Statistics of COLIEE’s Task 2.
2019 was a BERT model fed with hand-crafted inputs and fine-tuned
on in-domain data [29].
Peters et al. [26] demonstrate that fine-tuning on the target task
may not perform better than simple feature extraction from a pre- The micro F1-score is the official metric in this task:
trained model if the pretraining task and the target task belong to F1 = (2 × 𝑃 × 𝑅)/(𝑃 + 𝑅), (1)
highly different domains. These findings lead us to consider zero-
where 𝑃 is the number of correctly retrieved paragraphs for all
shot approaches while investigating how general domain Trans-
queries divided by the number of retrieved paragraphs for all
former models can be applied to legal tasks.
queries, and 𝑅 is the number of correctly retrieved paragraphs
Although zero-shot approaches are relatively novel in the legal
for all queries divided by the number of relevant paragraphs for all
domain, our work is not the first to apply zero-shot Transformer
queries.
models to domain-specific entailment tasks where limited labeled
data is available. Yin et al. [45] have transformed multi-label clas- 3 METHOD
sification tasks into textual entailment tasks, and then evaluated
the performance of a BERT model fine-tuned on mainstream entail- We experiment with the following models: BM25, monoT5-zero-
ment datasets. Yin et al. [46] also performed similar experiments shot, monoT5, and DeBERTa. We also evaluate an ensemble of our
while transforming question answering and coreference resolution monoT5 and DeBERTa models.
tasks into entailment tasks. We are not the first to use zero-shot
techniques on the legal case entailment task. For instance, Rabelo
3.1 BM25
et al. [28] used a BERT fine-tuned for paraphrase detection com- BM25 is a bag-of-words retrieval function that scores a document
bined with two transformer-based models fine-tuned on a generic based on the query terms appearing in it. We use the BM25 imple-
text entailment dataset and features generated by a BERT model mented in Pyserini [18], a Python toolkit that supports replicable
fine-tuned on the COLIEE training dataset. However, we are the information retrieval research. We use its default parameters.
first to show that zero-shot models can outperform fine-tuned ones We first index all paragraphs in datasets of tasks 1 and 2. Having
on this task. more paragraphs from task 1 improves the term statistics (e.g.,
document frequencies) used by BM25. Task 1 dataset is composed
of long documents, while task 2 is composed of paragraphs. This
difference in length may degrade BM25 scores for task 2 paragraphs
2.1 The Legal Case Entailment Task because the average document length will be higher due to task 1
The Competition on Legal Information Extraction/Entailment (COL- documents. We address this problem by segmenting each document
IEE) [13, 14, 30, 31] is an annual competition whose aim is to eval- into several paragraphs using a context window of 10 sentences
uate automatic systems on case and statute law tasks. with overlapping strides of 5 sentences.
Among the five tasks of the 2021 competition, we submitted The entailed fragment might be comprised of multiple sentences.
systems to task 2, called legal case entailment, which consists of Here we treat each of its sentences as a query and compute a BM25
identifying paragraphs from existing cases that entail a given frag- score for each sentence and candidate paragraph pair independently.
ment of a base case. The final score for each paragraph is the maximum among its sen-
Training data consists of a set of decision fragments, its respec- tence and paragraph pair scores. We then use the method described
tive candidate paragraphs that could be relevant or not to the frag- in Section 3.5 to select the paragraphs that will comprise our final
ment and a set of labels containing the number of the paragraphs answer.
by which the decision fragment is entailed. Test data includes only
decision fragments and candidate paragraphs, but no labels. As 3.2 monoT5-zero-shot
shown in Figure 1, the input to the model is a decision fragment At a high level, monoT5-zero-shot is a sequence-to-sequence adap-
Q of an unseen case and the output should be a set of paragraphs tation of the T5 model [33] proposed by Nogueira et al. [25] and
𝑃 = [𝑃 1, 𝑃2, ..., 𝑃𝑛 ] that are relevant to the given decision 𝑄. In further detailed in Lin et al. [19]. This ranking model is close to
table 1, we show the statistics of the 2020 and 2021 datasets. or at the state-of-the-art in retrieval tasks such as Robust04 [42],
296
Zero-shot Models for Legal Case Entailment ICAIL’21, June 21–25, 2021, São Paulo, Brazil
TREC-COVID, and TREC 2020 Precision Medicine and Deep Learn- memory cost of Transformers with respect to the sequence length,
ing tracks. Details of the model are described in Nogueira et al. [25]; we truncate inputs to 512 tokens during both training and inference.
here, we only provide a short overview. The model is fine-tuned with a learning rate of 10−3 for 80 steps
In the T5 model, all target tasks are cast as sequence-to-sequence using batches of size 128, which corresponds to 20 epochs. Each
tasks. For our task, we use the following input sequence template: batch has the same amount of positive and negative examples. We
refer to this model as monoT5.
Query: 𝑞 Document: 𝑑 Relevant: (2)
where 𝑞 and 𝑑 are the query and candidate texts, respectively. In
this work, 𝑞 is a fragment, and 𝑑 is one of the candidate paragraphs.
The model estimates a score 𝑠 quantifying how relevant a candi-
date text 𝑑 is to a query 𝑞. That is:
𝑠 = 𝑃 (Relevant = 1|𝑑, 𝑞). (3) 3.4 DeBERTa
The model is fine-tuned to produce the tokens “true” or “false” Decoding-enhanced BERT with disentangled attention (DeBERTa)
depending on whether the candidate is relevant or not to the query. improves on the original BERT and RoBERTa architectures by in-
That is, “true” and “false” are the “target tokens” (i.e., ground truth troducing two techniques: the disentangled attention mechanism
predictions in the sequence-to-sequence transformation). The suffix and an enhanced mask decoder [12]. Both improvements seek to
“Relevant:” in the input string serves as hint to the model for the introduce positional information to the pretraining procedure, both
tokens it should produce. in terms of the absolute position of a token and the relative position
We use a T5-large model fine-tuned on MS MARCO [1], a dataset between them.
of approximately 530k query and relevant passage pairs. We use a The COLIEE 2021 Task 2 dataset has very few positive examples
checkpoint available at Huggingface’s model hub that was trained of entailment. Therefore, for fine-tuning DeBERTa on this dataset,
with a learning rate of 10−3 using batches of 128 examples for 10k we found appropriate to artificially expand the positive examples.
steps, or approximately one epoch of the MS MARCO dataset.1 As fragments take up only a small portion of a base case paragraph,
In each batch, a roughly equal number of positive and negative we expand positive examples by generating artificial fragments
examples is sampled. We refer to this model as monoT5-zero-shot. from the same base case paragraph in which the original fragment
Although fine-tuning for more epochs leads to better perfor- has occurred. This is done by moving a sliding window, with a
mance on the MS MARCO development set, Nogueira et al. [25] stride that is half the size of the original fragment, over the base
showed that further training degrades a model’s zero-shot perfor- case paragraph. Each step of this sliding window is taken to be an
mance on other datasets. We observed similar behavior in our task artificial fragment, and such artificial fragments are assigned the
and opted to use the model trained for one epoch on MS MARCO. same labels as the original fragment.
At inference time, to compute probabilities for each query-candidate Although the resulting dataset after these operations is several
pair, a softmax is applied only on the logits of the tokens “true” and times larger than the original Task 2 dataset, we achieved better
“false”. The final score of each candidate is the probability assigned results by fine-tuning DeBERTa on a small sample taken from this
to the token “true”. artificial dataset. After experimenting with distinct sample sizes,
we settled for a sample of twenty thousand fragment and candidate
3.3 monoT5 paragraph pairs, equally balanced between positive and negative
We further fine-tune monoT5-zero-shot on the 2020 task 2 training entailment pairs.
set following a similar training procedure described in the previous In order to find the best hyperparameters for fine-tuning a De-
section. BERTa Large model, we perform a grid search over the hyperpa-
Fragments are mostly comprised of only one sentence, while rameters suggested by He et al. [12] while early stopping always
candidate paragraphs are longer, sometimes exceeding 512 tokens in at the second epoch. The best combination of hyperparameters is
length. Thus, to avoid excessive memory usage due to the quadratic used to fine-tune the model for ten epochs. The checkpoint with
the best performance on the 2020 test set is selected to generate
1 https://huggingface.co/castorini/monot5-large-msmarco-10k our predictions for the 2021 test set.
297
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Rosa et al.
2020 2021
Description Submission name F1 Prec Recall F1 Prec Recall 𝛼, 𝛽, 𝛾
(1a) Median of submissions - 0.5718 - - 0.5860 - - -
(1b) Best of 2020 [24] JNLP.task2.BMWT 0.6753 0.7358 0.6240 - - - -
(1c) 2nd best of 2021 UA_reg_pp - - - 0.6274 - - -
(2) BM25 - 0.6046 0.7222 0.52 0.6009 0.6666 0.5470 0.07, 2, 0.99
(3) DeBERTa DeBERTa 0.7094 0.7614 0.6640 0.6339 0.6635 0.6068 0, 2, 0.999
(4) monoT5 monoT5 0.6887 0.7155 0.660 0.6610 0.6554 0.6666 0, 3, 0.995
(5) monoT5-zero-shot - 0.6577 0.7400 0.5920 0.6872 0.7090 0.6666 0, 3, 0.995
(6) Ensemble of (3) and (4) DebertaT5 0.7217 0.7904 0.6640 0.6912 0.7500 0.6410 0.6, 2, 0.999
(7) Ensemble of (3) and (5) - 0.7038 0.7592 0.6560 0.6814 0.7064 0.6581 0.6, 2, 0.999
Table 2: Test set results on Task 2 of COLIEE 2020 and 2021. Our best single model F1 for each year is in bold.
3.5 Answer Selection Our pretrained transformer models (rows 3, 4 and 5) score above
The models described above estimate a score for each (fragment, BM25, the best submission of 2020 [24], and the second-best submis-
candidate paragraph) pair. To select the final set of paragraphs for sion of 2021. Likewise, our ensemble method effectively combines
a given fragment, we apply three rules: DeBERTa and monoT5 predictions, achieving the best score among
all submissions (row 6). However, the performance of monoT5-
• Select paragraphs whose scores are above a threshold 𝛼; zero-shot decreases when combined with DeBERTa (row 5 vs. 7),
• Select the top 𝛽 paragraphs with respect to their scores; showing that monoT5-zero-shot is a strong model.
• Select paragraphs whose scores are at least 𝛾 of the top score. The most interesting comparison is between monoT5 and monoT5-
We use exhaustive grid search to find the best values for 𝛼, 𝛽, 𝛾 zero-shot (rows 4 and 5). In the 2020 test data, monoT5 showed
on the development set of the 2020 task 2 dataset. We swept 𝛼 = better results than monoT5-zero-shot. Hence, we decided to submit
[0, 0.1, ..., 0.9], 𝛽 = [1, 2..., 10], and 𝛾 = [0, 0.1, ..., 0.9, 0.95, 0.99, 0.995, only the fine-tuned model to the 2021 competition. After the release
..., 0.9999]. The best values for each model can be found in Table 3. of ground-truth annotations of the 2021 test set, our evaluation of
Note that our hyperparameter search includes the possibility of monoT5-zero-shot showed that it performs better than monoT5.
not using the first or third strategies if 𝛼 = 0 or 𝛾 = 0 are chosen, A similar “inversion” pattern was found for DeBERTa vs. monoT5
respectively. (rows 3 and 4). DeBERTa was better than monoT5 on the 2020 test
set, but the opposite happened on the 2021 test set.
One explanation for these results is that we overfit on the test
3.6 DeBERTa + monoT5 Ensemble (DebertaT5) data of 2020, i.e., by (unintentionally) selecting techniques and
Ensemble methods seek to combine the strengths and compensate hyperparameters that gave the best result on the 2020 test set as
for the weaknesses of the models in order that the final model has experiments progressed. However, this is unlikely to be the case
better generalization performance. for our fine-tuned monoT5 model, as our hyperparameter selection
We use the following method to combine the predictions of is fully automatic and maximized on the development set, whose
monoT5 and DeBERTa (both fine-tuned on COLIEE 2020): We con- data is from COLIEE competitions before 2020.
catenate the final set of paragraphs selected by each model. We Another explanation is that there is a significant difference
remove duplicates, preserving the highest score. Then, we apply between the annotation methodologies of 2020 and 2021. Conse-
again the grid search method explained in the previous section quently, models specialized in the 2020 data could suffer from this
to select the final set of paragraphs. It is important to note that change. However, this is also unlikely since BM25 performed simi-
our method does not combine scores between models. It ensures larly in both years. Furthermore, we cannot confirm this hypothesis
that only individual answers with a certain degree of confidence since it is difficult to quantify differences in the annotation process.
are maintained in the final answer, which generally leads to an Regardless of the reason for the inversion, our main finding is
increase in Precision. The final answer for each test example can be that our zero-shot model performed at least comparably to fine-
composed of individual answers from one model or both models. tuned models on the 2020 test set and achieved the best result of a
single model on 2021 test data.
4 RESULTS
We present our main result in Table 2. Our baseline BM25 method
scores above the median of submissions in both COLIEE 2020 and 4.1 Ablation of the Answer Selection Method
2021 (row 2 vs. 1a). This confirms that BM25 is a strong baseline In Table 3, we show the ablation result of the answer selection
and it is in agreement with results from other competitions such as method proposed in Section 3.5. Our baseline answer selection
the Health Misinformation and Precision Medicine track of TREC method, which we refer to as “no rule” in the table, uses only the
2020 [27]. paragraph with the highest score as the final answer set, i.e., 𝛼 = 𝛾 =
298
Zero-shot Models for Legal Case Entailment ICAIL’21, June 21–25, 2021, São Paulo, Brazil
Model F1 Prec Recall 𝛼, 𝛽, 𝛾 Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural In-
monoT5-zero-shot (no rule) 0.6517 0.7373 0.584 0, 1, 0 formation Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan,
monoT5-zero-shot 0.6577 0.74 0.592 0, 3, 0.995 and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.
neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
monoT5 (no rule) 0.6755 0.7600 0.608 0, 1, 0 [5] Ilias Chalkidis and Dimitrios Kampas. 2019. Deep learning in law: early adaptation
monoT5 0.6887 0.7155 0.6640 0, 3, 0.995 and legal word embeddings trained on large corpora. Artificial Intelligence and
Law volume 27, pages171–198(2019) (2019).
DeBERTa (no rule) 0.6933 0.7800 0.6240 0, 1, 0
[6] Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan
DeBERTa 0.7094 0.7614 0.6640 0, 2, 0.999 Yu. 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with
DebertaT5-zero-shot (no rule) 0.6875 0.7777 0.6160 0, 1, 0 Less Forgetting. arXiv:2004.12651 [cs.CL]
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
DebertaT5-zero-shot 0.7038 0.7592 0.6560 0.6, 2, 0.999 Pre-training of Deep Bidirectional Transformers for Language Understanding. In
DebertaT5 (no rule) 0.7022 0.7900 0.6320 0, 1, 0 Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
DebertaT5 0.7217 0.7904 0.6640 0.6, 2, 0.999
Short Papers). 4171–4186.
[8] Ahmed Elnaggar, Christoph Gebendorfer, Ingo Glaser, and Florian Matthes. 2018.
Table 3: Ablation on the 2020 data of the answer selection Multi-Task Deep Learning for Legal Document Translation, Summarization and
method presented in Section 3.5. Multi-Label Classification. AICCC ’18: Proceedings of the 2018 Artificial Intelligence
and Cloud Computing Conference December 2018 Pages 9–15 (2018).
[9] Ahmed Elnaggar, Bernhard Waltl, Ingo Glaser, Jörg Landthaler, Elena Scepankova,
and Florian Matthes. 2018. Stop Illegal Comments: A Multi-Task Deep Learning
Approach. AICCC ’18: Proceedings of the 2018 Artificial Intelligence and Cloud
0 and 𝛽 = 1. For all models, the proposed answer selection method Computing Conference December 2018 Pages 41–47 (2018).
[10] Emad Elwany, Dave Moore, and Gaurav Oberoi. 2019. BERT Goes to Law School:
gives improvements of 0.6 to two F1 points over the baseline. Quantifying the Competitive Advantage of Access to Large Legal Corpora in
Contract Understanding. In Workshop on Document Intelligence at NeurIPS 2019.
[11] Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. Rethink Training of BERT
5 CONCLUSION Rerankers in Multi-Stage Retrieval Pipeline. arXiv preprint arXiv:2101.08751
We confirm a counter-intuitive result on a legal case entailment task: (2021).
[12] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. DeBERTa:
that models with little or no adaptation to the target task can have Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654 [cs.CL]
better generalization abilities than models that have been carefully [13] Yoshinobu Kano, M. Kim, R. Goebel, and K. Satoh. 2017. Overview of COLIEE
fine-tuned to the task at hand. Domain adversarial fine-tuning [41] 2017. In COLIEE 2017 (EPiC Series in Computing, vol. 47). 1–8.
[14] Yoshinobu Kano, Mi-Young Kim, Masaharu Yoshioka, Yao Lu, Juliano Rabelo,
and changes to the Adam optimizer [6] [49] have been proposed Naoki Kiyota, Randy Goebel, and Ken Satoh. 2018. COLIEE-2018: Evaluation of
as valid approaches for fine-tuning Transformer models on small the competition on legal information extraction and entailment. In JSAI Interna-
domain-specific datasets. However, whether these techniques could tional Symposium on Artificial Intelligence. 177–192.
[15] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord,
successfully be applied to the legal case entailment task to make Peter Clark, and Hannaneh Hajishirzi. 2020. UnifiedQA: Crossing Format Bound-
models fine-tuned on target task data perform better than zero-shot aries With a Single QA System. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: Findings. 1896–1907.
approaches remains an open question. [16] Spyretta Leivaditi, Julien Rossi, and Evangelos Kanoulas. 2020. A Benchmark for
Therefore, although domain-specific language model pretraining Lease Contract Review. arXiv preprint arXiv:2010.10386 (2020).
and adjustments to the fine-tuning process are promising directions [17] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART:
for future research, we believe that zero-shot approaches should Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
not be ignored as strong baselines for such experiments. Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of
It should also be noted that our research has implications for the Association for Computational Linguistics. 7871–7880.
[18] Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep,
future experiments beyond the scope of legal case entailment tasks. and Rodrigo Nogueira. 2021. Pyserini: An Easy-to-Use Python Toolkit to Support
Based on previous work by Yin et al. [45, 46], it is possible that Replicable IR Research with Sparse and Dense Representations. arXiv preprint
arXiv:2102.10073 (2021).
other legal tasks with limited labeled data, such as legal question [19] Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained transformers
answering, may benefit from our zero-shot approach. for text ranking: Bert and beyond. arXiv preprint arXiv:2010.06467 (2020).
[20] Jiang Lu, Pinghua Gong, Jieping Ye, and Changshui Zhang. 2020. Learning from
Very Few Samples: A Survey. arXiv preprint arXiv:2009.02653 (2020).
REFERENCES [21] Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xiang Ji, and Xueqi Cheng.
[1] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong 2021. PROP: Pre-training with Representative Words Prediction for Ad-hoc
Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Retrieval. In Proceedings of the 14th ACM International Conference on Web Search
Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. and Data Mining. 283–291.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. [22] Tomas Mikolov. 2013. Distributed representations of words and phrases and their
arXiv:1611.09268v3 (2018). compositionality. arXiv preprint arXiv:1310.4546 (2013).
[2] Purbid Bambroo and Aditi Awasthi. 2021. LegalDB: Long DistilBERT for Legal [23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti-
Document Classification. In 2021 International Conference on Advances in Elec- mation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781
trical, Computing, Communication and Sustainable Technologies (ICAECT). IEEE, (2013).
1–4. [24] Ha-Thanh Nguyen, Hai-Yen Thi Vuong, Phuong Minh Nguyen, Binh Tran Dang,
[3] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Quan Minh Bui, Sinh Trong Vu, Chau Minh Nguyen, Vu Tran, Ken Satoh, and
Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Minh Le Nguyen. 2020. JNLP Team: Deep Learning for Legal Processing in
Hon. 2020. UniLMv2: Pseudo-Masked Language Models for Unified Lan- COLIEE 2020. arXiv preprint arXiv:2011.08071 (2020).
guage Model Pre-Training. ArXiv. https://www.microsoft.com/en- [25] Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document
us/research/publication/unilmv2-pseudo-masked-language-models-for- Ranking with a Pretrained Sequence-to-Sequence Model. In Proceedings of the
unified-language-model-pre-training/ 2020 Conference on Empirical Methods in Natural Language Processing: Findings.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- 708–718.
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, [26] Matthew E Peters, Sebastian Ruder, and Noah A Smith. 2019. To Tune or Not to
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Tune? Adapting Pretrained Representations to Diverse Tasks. In Proceedings of
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). 7–14.
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
299
ICAIL’21, June 21–25, 2021, São Paulo, Brazil Rosa et al.
[27] Ronak Pradeep, Xueguang Ma, Xinyu Zhang, Hang Cui, Ruizhou Xu, Rodrigo [51] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
Nogueira, and Jimmy Lin. [n.d.]. H2oloo at TREC 2020: When all you got is Maosong Sun1. 2020. JEC-QA: A Legal-Domain Question Answering Dataset.
a hammer... Deep Learning, Health Misinformation, and Precision Medicine. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9701-9708.
Corpus 5, d3 ([n. d.]), d2. (2020).
[28] J. Rabelo, M.Y. Kim, and R. Goebel. 2020. Application of text entailment techniques
in COLIEE 2020. International Workshop on Juris-informatics (JURISIN) associated
with JSAI International Symposia on AI (JSAI-isAI) (2020).
[29] Juliano Rabelo, Mi-Young Kim, and Randy Goebel. 2019. Combining similarity and
transformer methods for case law entailment. In Proceedings of the Seventeenth
International Conference on Artificial Intelligence and Law (ICAIL ’19). 290–296.
[30] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
Kano, and Ken Satoh. 2019. A Summary of the COLIEE 2019 Competition. In
JSAI International Symposium on Artificial Intelligence. 34–49.
[31] Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu
Kano, and Ken Satoh. 2020. COLIEE 2020: Methods for Legal Document Retrieval
and Entailment. (2020).
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
et al. 2021. Learning transferable visual models from natural language supervision.
arXiv preprint arXiv:2103.00020 (2021).
[33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits
of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine
Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
[34] Kirk Roberts, Dina Demner-Fushman, E. Voorhees, W. Hersh, Steven Bedrick,
Alexander J. Lazar, and S. Pant. 2019. Overview of the TREC 2019 Precision
Medicine Track. The ... text REtrieval conference : TREC. Text REtrieval Conference
26 (2019).
[35] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu,
Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995),
109.
[36] Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few-shot
text classification and natural language inference. arXiv preprint arXiv:2001.07676
(2020).
[37] Shohreh Shaghaghian, Luna Yue Feng, Borna Jafarpour, and Nicolai Pogreb-
nyakov. 2020. Customizing Contextualized Language Models for Legal Docu-
ment Reviews. In 2020 IEEE International Conference on Big Data (Big Data). IEEE,
2139–2148.
[38] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin
Raffel. 2021. Improving and Simplifying Pattern Exploiting Training. arXiv
preprint arXiv:2103.11955 (2021).
[39] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna
Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of
Information Retrieval Models. arXiv preprint arXiv:2104.08663 (4 2021). https:
//arxiv.org/abs/2104.08663
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In NIPS.
[41] Giorgos Vernikos, Katerina Margatina, Alexandra Chronopoulou, and Ion An-
droutsopoulos. 2020. Domain Adversarial Fine-Tuning as an Effective Regularizer.
arXiv:2009.13366 [cs.LG]
[42] Ellen M. Voorhees. 2004. Overview of the TREC 2004 Robust Track. Proceedings
of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland,
November 16-19, 2004 (2004).
[43] Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong
Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018.
CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction. arXiv:1807.02478
(2018).
[44] Chin Man Yeung. 2019. Effects of inserting domain vocabulary and fine-tuning
BERT for German legal language. Master’s thesis. University of Twente.
[45] Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking Zero-
shot Text Classification: Datasets, Evaluation and Entailment Approach.
arXiv:1909.00161 [cs.CL]
[46] Wenpeng Yin, Nazneen Fatema Rajani, Dragomir Radev, Richard Socher, and
Caiming Xiong. 2020. Universal Natural Language Processing with Limited An-
notations: Try Few-shot Textual Entailment as a Start. arXiv:2010.02584 [cs.CL]
[47] Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN:
Attention-based convolutional neural network for modeling sentence pairs. Trans-
actions of the Association for Computational Linguistics 4 (2016), 259–272.
[48] Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, and Jimmy
Lin. 2020. Rapidly Deploying a Neural Search Engine for the COVID-19 Open
Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL
2020.
[49] Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, and Yoav Artzi.
2021. Revisiting Few-sample BERT Fine-tuning. arXiv:2006.05987 [cs.CL]
[50] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal
Artificial Intelligence. arXiv:2004.12158 (2020).
300
Index of Authors
302
Trecenti, Julio, 240 Wróbel, Krzysztof, 225
Troussel, Aurore, 40, 129
Tsushima, Kanae, 50 Xu, Huihui, 250
303