HLDC Hindi Legal Documents Corpus

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/359892640
HLDC: Hindi Legal Documents Corpus
Conference Paper · April 2022

DOI: 10.18653/v1/2022.findings-acl.278
CITATIONS READS
12 1,114
10 authors, including:
Arnav Kapoor Mudit Dhawan

International Institute of Information Technology, Hyderabad Indraprastha Institute of Information Technology
7 PUBLICATIONS 97 CITATIONS 4 PUBLICATIONS 24 CITATIONS
SEE PROFILE SEE PROFILE
Anmol Goel Arjun T H

Technische Universität Darmstadt International Institute of Information Technology, Hyderabad
16 PUBLICATIONS 39 CITATIONS 5 PUBLICATIONS 13 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Ashutosh Modi on 12 April 2022.
The user has requested enhancement of the downloaded file.

HLDC: Hindi Legal Documents Corpus
Arnav Kapoor† , Mudit Dhawan‡ , Anmol Goel† ,
T.H. Arjun† , Akshala Bhatnagar‡ , Vibhu Agrawal‡ , Amul Agrawal† ,
Arnab Bhattacharya¶ , Ponnurangam Kumaraguru† , Ashutosh Modi¶∗
†
IIIT Hyderabad, ‡ IIIT Delhi, ¶ IIT Kanpur
{arnav.kapoor,anmol.goel,arjun.thekoot}@research.iiit.ac.in
{mudit18159,akshala18012,vibhu18116}@iiitd.ac.in, amul.agrawal@students.iiit.ac.in,
arnabb@cse.iitk.ac.in, pk.guru@iiit.ac.in, ashutoshm@cse.iitk.ac.in
Abstract trained language models do not perform well on

these (Malik et al., 2021b). Thus, to develop legal
Many populous countries including India are
burdened with a considerable backlog of legal
text processing systems and address the challenges
cases. Development of automated systems that associated with the legal domain, there is a need
could process legal documents and augment for creating specialized legal domain corpora.
legal practitioners can mitigate this. However, In recent times, there have been efforts to de-
there is a dearth of high-quality corpora that velop such corpora. For example, Chalkidis et al.
is needed to develop such data-driven systems.
(2019) have developed an English corpus of Eu-
The problem gets even more pronounced in the
case of low resource languages such as Hindi.
ropean Court of Justice documents, while Ma-
In this resource paper, we introduce the Hindi lik et al. (2021b) have developed an English cor-
Legal Documents Corpus (HLDC), a corpus pus of Indian Supreme Court documents. Xiao
of more than 900K legal documents in Hindi. et al. (2018) have developed Chinese Legal Docu-
The documents are cleaned and structured to ment corpus. However, to the best of our knowl-
enable the development of downstream appli- edge, there does not exist any legal document cor-
cations. Further, as a use-case for the corpus, pus for the Hindi language (a language belonging
we introduce the task of Bail Prediction. We
to the Indo-European family and pre-dominantly
experiment with a battery of models and pro-
pose a multi-task learning (MTL) based model spoken in India). Hindi uses Devanagari script
for the same. MTL models use summarization (Wikipedia contributors, 2021) for the writing sys-
as an auxiliary task along with bail prediction tem. Hindi is spoken by approximately 567 mil-
as the main task. Experiments with different lion people in the world (WorldData, 2021). Most
models are indicative of the need for further of the lower (district) courts in northern India use
research in this area. Hindi as the official language. However, most of
1 Introduction the legal NLP systems that currently exist in In-
dia have been developed on English, and these do
In recent times, the legal system in many populous not work on Hindi legal documents (Malik et al.,
countries (e.g., India) has been inundated with 2021b). To address this problem, in this paper,
a large number of legal documents and pending we release a large corpus of Hindi legal docu-
cases (Katju, 2019). There is an imminent need for ments (H INDI L EGAL D OCUMENTS C ORPUS or
automated systems to process legal documents and HLDC) that can be used for developing NLP sys-
help augment the legal procedures. For example, tems that could augment the legal practitioners by
if a system could readily extract the required in- automating some of the legal processes. Further,
formation from a legal document for a legal prac- we show a use case for the proposed corpus via a
titioner, then it would help expedite the legal pro- new task of bail prediction.
cess. However, the processing of legal documents India follows a Common Law system and has
is challenging and is quite different from conven- a three-tiered court system with District Courts
tional text processing tasks. For example, legal (along with Subordinate Courts) at the lowest level
documents are typically quite long (tens of pages), (districts), followed by High Courts at the state
highly unstructured and noisy (spelling and gram- level, and the Supreme Court of India at the high-
mar mistakes since these are typed), use domain- est level. In terms of number of cases, district
specific language and jargon; consequently, pre- courts handle the majority. According to India’s
∗
Corresponding Author National Judicial Data Grid, as of November 2021,
there are approximately 40 million cases pending 2 Related Work
in District Courts (National Judicial Data Grid,
2021) as opposed to 5 million cases pending in In recent years there has been active interest in
High Courts. These statistics show an immediate the application of NLP techniques to the legal do-
need for developing models that could address the main (Zhong et al., 2020a). A number of tasks
problems at the grass-root levels of the Indian le- and models have been proposed, inter alia, Le-
gal system. Out of the 40 million pending cases, gal Judgment Prediction (Chalkidis et al., 2019),
approximately 20 million are from courts where Legal Summarization (Bhattacharya et al., 2019;
the official language is Hindi (National Judicial Tran et al., 2019), Prior Case Retrieval (Jackson
Data Grid, 2021). In this resource paper, we cre- et al., 2003; Shao et al., 2020), Legal Question An-
ate a large corpus of 912,568 Hindi legal docu- swering (Kim and Goebel, 2017), Catchphrase Ex-
ments. In particular, we collect documents from traction (Galgani et al., 2012), Semantic Segmen-
the state of Uttar Pradesh, the most populous state tation (Kalamkar et al., 2022; Malik et al., 2021a).
of India with a population of approximately 237 Legal Judgement Prediction (LJP) involves pre-
million (PopulationU, 2021). The H INDI L EGAL dicting the final decision from the facts and ar-
D OCUMENTS C ORPUS (HLDC) can be used for a guments of the case. Chalkidis et al. (2019) re-
number of legal applications, and as a use case, in leased 11,478 cases from the European Court of
this paper, we propose the task of Bail Prediction. Human Rights (ECHR). It contains facts, articles
violated (if any), and the importance scores. Ma-
Given a legal document with facts of the case,
lik et al. (2021b) provided 34,816 case documents
the task of bail prediction requires an automated
from the Supreme Court of India for the predic-
system to predict if the accused should be granted
tion task. Strickson and De La Iglesia (2020) pub-
bail or not. The motivation behind the task is
lished 4,959 documents from the U.K.’s Supreme
not to replace a human judge but rather augment
court (the highest court of appeal).
them in the judicial process. Given the volume of
Majority of corpora for Legal-NLP tasks have
cases, if a system could present an initial analysis
been in English; recently, there have been efforts
of the case, it would expedite the process. As told
to address other languages as well, for example,
to us by legal experts and practitioners, given the
Xiao et al. (2018), have created a large-scale Chi-
economies of scale, even a small improvement in
nese criminal judgment prediction dataset with
efficiency would result in a large impact. We de-
over 2.68 million legal documents. Work on
velop baseline models for addressing the task of
Legal-NLP in languages other than English is still
bail prediction.
in its incipient stages. Our paper contributes to-
In a nutshell, we make the following main con- wards these efforts by releasing corpus in Hindi.
tributions in this resource paper: Majority of the work in the legal domain has
• We create a Hindi Legal Documents Corpus focused on the higher court (Malik et al., 2021b;
(HLDC) of 912,568 documents. These docu- Strickson and De La Iglesia, 2020; Zhong et al.,
ments are cleaned and structured to make them 2020b); however, the lower courts handle the max-
usable for downstream NLP/IR applications. imum number of cases. We try to address this gap
Moreover, this is a growing corpus as we con- by releasing a large corpus of district court level
tinue to add more legal documents to HLDC. We legal documents.
release the corpus and model implementation Some of the recent work has explored other
code with this paper: https://github. Legal-NLP tasks in languages other than English.
com/Exploration-Lab/HLDC. Chalkidis et al. (2021) released a multilingual
• As a use-case for applicability of the corpus for dataset of 65K European Union (E.U.) laws for
developing legal systems, we propose the task of topic classification of legal documents. The data
Bail Prediction. was translated into the 23 official E.U. languages
• For the task of bail prediction, we experiment and annotated with labels from the multilingual
with a variety of deep learning models. We pro- thesaurus, EUROVOC. Luz de Araujo et al. (2018)
pose a multi-task learning model based on trans- have released 70 documents in Portuguese for Le-
former architecture. The proposed model uses gal Named Entity Recognition. The dataset con-
extractive summarization as an auxiliary task tains specific tags for law and legal cases en-
and bail prediction as the main task. tities in addition to the normal tags for named
Extract Raw Document Segmented Prior Case Retrieval
Documents Apply OCR Raw Text Segmentation Hindi Legal Legal NLP
Documents Final Cleaning Bail Judgement Prediction
Documents Documents
into Header and Collation Corpus (HLDC) Applications Legal Summarisation
and Body
ecourts Website Raw PDF Court
Documents
Figure 1: HLDC corpus creation pipeline
entities like person, locations, organisation and by downloading data from the e-Courts web-
time-entities. COLIEE (Competition on Legal site (a publicly available website: https://
Information Extraction/Entailment) tasks (Kano districts.ecourts.gov.in/). All the le-
et al., 2019, 2017) have published legal data in gal documents we consider are in the public do-
Japanese, along with their English translation. The main. We download case documents pertaining to
competition has two sub-tasks, a legal informa- the district courts located in the Indian northern
tion retrieval task and an entailment identifica- state of Uttar Pradesh (U.P.). We focus mainly on
tion task between law articles and queries. Mul- the state of U.P. as it is the most populous state of
tiple datasets in Chinese have been released for India, resulting in the filing of a large number of
different tasks, namely Reading Comprehension cases in district courts. U.P. has 71 districts and
(Duan et al., 2019), Similar Case Matching (Xiao about 161 district courts. U.P. is a predominantly
et al., 2019), Question Answering (Zhong et al., Hindi speaking state, and consequently, the offi-
2020b). Duan et al. (2019) proposed Chinese judicial language used in district courts is Hindi. We
cial reading comprehension (CJRC) dataset with crawled case documents from all districts of U.P.
about 10K documents and almost 50K questions corresponding to cases filed over two years, from
with answers. Zhong et al. (2020b) presented JEC- May 01, 2019 to May 01, 2021. Figure 2 shows
QA, a legal question answering dataset collected the map of U.P. and district wise variation in the
from the National Judicial Examination of China number of cases. As can be seen in the plot, the
with about 26K multiple-choice questions. They western side of the state has more cases; this is
augment the dataset with a database containing possibly due to the high population and more ur-
the legal knowledge required to answer the ques- banization in the western part. Table 1 shows %-
tions and also assign meta information to each of wise division of different case types in HLDC. As
the questions for in-depth analysis. Xiao et al. evident from the table, majority of documents per-
(2019) proposed CAIL2019-SCM, a dataset contain to bail applications. HLDC corpus has a total
taining 8,964 triplets of the case document, with of 3,797,817 unique tokens, and on average, each
the objective to identify which two cases are more document has 764 tokens.
similar in the triplets. Similar case matching has a HLDC Creation Pipeline: We outline the en-
crucial application as it helps to identify compara- tire pipeline used to create the corpus in Figure 1.
ble historical cases. A historical case with similar The documents on the website are originally typed
facts often serves as a legal precedent and influ- in Hindi (in Devanagari script) and then scanned
ences the judgement. Such historical information to PDF format and uploaded. The first step in
can be used to make the legal judgement predic- HLDC creation is the downloading of documents
tion models more robust. from the e-Courts website. We downloaded a to-
Kleinberg et al. (2017) proposed bail decision tal of 1,221,950 documents. To extract Hindi text
prediction as a good proxy to gauge if machine from these, we perform OCR (Optical Charac-
learning can improve human decision making. A ter Recognition) via the Tesseract tool1 . Tesser-
large number of bail documents along with the bi- act worked well for our use case as the majority
nary decision (granted or denied) makes it an ideal of case documents were well-typed, and it out-
task for automation. In this paper, we also propose performed other OCR libraries2 . The obtained
the bail prediction task using the HLDC corpus. text documents were further cleaned to remove
noisy documents, e.g. too short (< 32 bytes) or
3 Hindi Legal Documents Corpus too long (> 8096 bytes) documents, duplicates,
Hindi Legal Documents Corpus (HLDC) is a and English documents (details in Appendix B).
corpus of 912,568 Indian legal case documents 1
https://github.com/tesseract-ocr
2
in the Hindi language. The corpus is created https://github.com/JaidedAI/EasyOCR
Total Cases
Case Type % in HLDC
Muzzafarnagar 80k
Bail Applications 31.71%

Moradabad 70k
Criminal Cases 20.41%
60k
Original Suits 6.54%
Sitapur Warrant or Summons in Criminal Cases 5.24%
50k
Lucknow Complaint Cases 4.37%
40k
Civil Misc 3.4%
Final Report 3.32%
Allahabad 30k Civil Cases 3.23%
Others (Matrimonial Cases, Session 21.75%
20k
Trial, Motor Vehicle Act, etc.)
10k
0 Table 1: Case types in HLDC. Out of around 300 dif-

ferent case types, we only show the prominent ones.
Figure 2: Variation in number of case documents per Majority of the case documents correspond to bail ap-
district in the state of U.P. Prominent districts are plications.
marked.
Ratio
0.6
This resulted in a total of 912,568 documents in Bagpat

0.5
HLDC. We anonymized the corpus with respect to Gautam Budh
names and locations. We used a gazetteer3 along Nagar
0.4
with regex-based rules for NER to anonymize the

data. List of first names, last names, middle Kanpur Nagar 0.3
names, locations, titles like p\EXt (Pandit: ti-

Allahabad
0.2
tle of Priest), srяF (Sir: Sir), month names
and day names were normalized to <nAm> (Naam: 0.1
<name>). The gazetteer also had some common

ambiguous words (these words can be names or 0
sometimes verbs) like prATnA (Prathna: Can re-

Figure 3: Ratio of number of bail applications to total
fer to prayer, the action of request or name), gyA
number of applications in U.P.
(Gaya: can refer to location name or verb), EkyA
(Kiya: can refer to infinitive ‘to do’ or name),
ElyA (Liya: can refer to infinitive ‘to take’ or contains the facts of the case, arguments, judge’s
name). These were removed. Further, we ran summary, case decision and other information re-
RNN-based Hindi NER model4 on a subset of doc- lated to the final decision. The documents were
uments to find additional entities and these were segmented using regex and rule based approaches
subsequently used to augment our gazetteer (de- as described in Appendix D.
tails Appendix C). Phone numbers were detected Case Type Identification: HLDC documents
using regex patterns and replaced with a <'on - were processed to obtain different case types (e.g.,
n\br> (<phone-number>) tag, numbers written in Bail applications, Criminal Cases). The case type
both English and Hindi were considered. was identified via the meta-data that comes with
Legal documents, particularly in lower courts, each document. However, different districts use a
are highly unstructured and lack standardization variation of the same case type name (e.g., Bail
with respect to format and sometimes even the Application vs Bail App.). We resolved these
terms used. We converted the unstructured doc- standardization issues via manual inspection and
uments to semi-structured documents. We seg- regex-based patterns, resulting in a final list of 300
mented each document into a header and a body. unique case types.
The header contains the meta-information related Lexical Analysis: Although Hindi is the of-
to the case, for example, case number, court iden- ficial language, U.P. being a large and populous
tifier, and applicable sections of the law. The body state, has different dialects of Hindi spoken across
the state. We found evidence of this even in official
3
https://github.com/piyusharma95/ legal documents. For example, the word sAEkn
NER-for-Hindi, https://github.com/
balasahebgulave/Dataset-Indian-Names (Sakin: motionless) appears 11,614 times in the
4
https://github.com/flairNLP/flair dataset, 63.8% occurrences of the word come from
Segmented Judge's
Document Outcome
Bail Documents Opinion Bail Documents
Hindi Bail Segmentaion
1. Header Identification Mining
1. Bail Denied with Judge's
Documents 2. Body
2. Bail Granted Opinion
3. Result
Judge's
Hindi Legal Bail Documents
Filter Bail Outcome Opinion Bail Documents
Documents Documents Bail Identification Mining
Bail Denied with Judge's
Corpus Documents
Bail Granted Opinion
(HLDC)
Figure 4: Bail Corpus Creation Pipeline

Segmented Judge's
Hindi Legal Filter Bail Document Outcome
Bail Documents Opinion Bail Documents
Documents Bail 1. Header
Documents Segmentaion
6 districts of East U.P. (Ballia, Azamgarh, Ma- Identification
1. Bail Denied
Bail Mining
with Judge's
Amount Extraction: If bail was granted,
Documents 2. Body
Corpus (HLDC) 2. Bail Granted Opinion
harajganj, Deoria, Siddharthnagar and Kushina- 3. Result it usually has some bail amount associated with it.
gar). This particular variant of motionless being We extracted this bail amount using regex patterns
used most often only in East U.P. Similarly, the (Details in Appendix F).
word gOv\fFy (Gaushiya: cow and related ani- We verified each step of the corpus creation
mals) is mostly used in North-Western UP (Ram- pipeline (Detailed analysis in Appendix G) to en-
pur, Pilibhit, Jyotiba Phule Nagar (Amroha), Bi- sure the quality of the data. We initially started
jnor, Budaun, Bareilly, Moradabad). Three dis- with 363,003 bail documents across all the 71 dis-
tricts - Muzaffarnagar, Kanshiramnagar and Prat- tricts of U.P., and after removing documents hav-
apgarh district constitute 81.5% occurrences of the ing segmentation errors, we have a Bail corpus
word d\X (Dand: punishment). These districts are, with 176,849 bail documents. The bail corpus has
however, spread across UP. An important thing to a total of 2,342,073 unique tokens, and on average,
note is that words corresponding to specific dis- each document has 614 tokens. A sample docu-
tricts/areas are colloquial and not part of the stan- ment segmented into various sections is shown in
dard Hindi lexicon. This makes it difficult for pre- Appendix I.
diction model to generalize across districts (§7).
Corpus of Bail Documents: Bail is the provi- 4 HLDC: Ethical Aspects
sional release of a suspect in any criminal offence We create HLDC to promote research and au-
on payment of a bail bond and/or additional re- tomation in the legal domain dealing with under-
strictions. Bail cases form a large majority of researched and low-resource languages like Hindi.
cases in the lower courts, as seen in Table 1. Ad- The documents that are part of HLDC are in the
ditionally, they are very time-sensitive as they republic domain and hence accessible to all. Given
quire quick decisions. For HLDC, the ratio of bail the volume of pending cases in the lower courts,
documents to total cases in each district is shown our efforts are aimed towards improving the legal
in Figure 3. As a use-case for the corpus, we fur- system, which in turn would be beneficial for mil-
ther investigated the subset of the corpus having lions of people. Our work is in line with some of
only the bail application documents (henceforth, the previous work on legal NLP, e.g., legal cor-
we call it Bail Corpus). pora creation and legal judgement prediction (sec-
Bail Document Segmentation: For the bail tion 2). Nevertheless, we are aware that if not
documents, besides the header and body, we fur- handled correctly, legal AI systems developed on
ther segmented the body part into more sub- legal corpora can negatively impact an individual
sections (Figure 4). The body is further seg- and society at large. Consequently, we took all
mented into Facts and Arguments, Judge’s sum- possible steps to remove any personal information
mary and Case Result. Facts contain the facts of and biases in the corpus. We anonymized the cor-
the case and the defendant and prosecutor’s argu- pus (section 3) with respect to names, gender in-
ments. Most of the bail documents have a con- formation, titles, locations, times, judge’s name,
cluding paragraph where the judge summarizes petitioners and appellant’s name. As observed in
their viewpoints of the case, and this constitutes previous work (Malik et al., 2021b), anonymiza-
the judge’s summary sub-section. The case result tion of a judge’s name is important as there is a
sub-section contains the final decision given by the correlation between a case outcome and a judge
judge. More details about document segmentation name. Along with the HLDC, we also introduce
are in Appendix D. the task of Bail Prediction. Bail applications con-
Bail Decision Extraction: Decision was ex- stitute the bulk of the cases (§3), augmentation by
tracted from Case Result Section using a rule an AI system can help in this case. The bail predic-
based approach (Details in Appendix E). tion task aims not to promote the development of
systems that replace humans but rather the devel- and transformer-based contextualized embed-
opment of systems that augment humans. The bail dings model IndicBERT (Kakwani et al., 2020).
prediction task provides only the facts of the case Doc2Vec embeddings, in our case, is trained
to predict the final decision and avoids any biases on the train set of our corpus. The embeddings
that may affect the final decision. Moreover, the go as input to SVM and XgBoost classifiers.
Bail corpus and corresponding bail prediction sys- IndicBERT is a transformer language model
tems can promote the development of explainable trained on 12 major Indian languages. However,
systems (Malik et al., 2021b), we leave research IndicBERT, akin to other transformer LMs, has
on such explainable systems for future work. The a limitation on the input’s length (number of to-
legal domain is a relatively new area in NLP re- kens). Inspired by Malik et al. (2021b); Chalkidis
search, and more research and investigations are et al. (2019), we experimented with fine-tuning
required in this area, especially concerning biases IndicBERT in two settings: the first 512 tokens
and societal impacts; for this to happen, there is a and the last 512 tokens of the document. The
need for corpora, and in this paper, we make initial fine-tuned transformer with a classification head
steps towards these goals. is used for bail prediction.
5 Bail Prediction Task

To demonstrate a possible applicability for HLDC, 6.2 Summarization Based Models
we propose the Bail Prediction Task, where given
the facts of the case, the goal is to predict Given the long lengths of the documents, we ex-
whether the bail would be granted or denied. For- perimented with prediction models that use sum-
mally, consider a corpus of bail documents D = marization as an intermediate step. In particular,
b1 , b2 , · · · , bi , where each bail document is seg- an extractive summary of a document goes as in-
mented as bi = (hi , fi , ji , yi ). Here, hi , fi , ji and put to a fine-tuned transformer-based classifier (In-
yi represent the header, facts, judge’s summary dicBERT). Besides reducing the length of the doc-
and bail decision of the document respectively. ument, extractive summarization helps to evaluate
Additionally, the facts of every document contain the salient sentences in a legal document and is a
k sentences, more formally, fi = (s1i , s2i , · · · , ski ), step towards developing explainable models. We
where ski represents the k th sentence of the ith bail experimented with both unsupervised and super-
document. We formulate the bail prediction task vised extractive summarization models.
as a binary classification problem. We are inter- For unsupervised approaches we experimented
ested in modelling pθ (yi |fi ), which is the proba- with TF-IDF (Ramos, 2003) and TextRank (a
bility of the outcome yi given the facts of a case graph based method for extracting most important
fi . Here, yi ∈ {0, 1}, i.e., 0 if bail is denied or 1 if sentences) (Mihalcea and Tarau, 2004). For the su-
bail is granted. pervised approach, inspired by Bajaj et al. (2021),
we propose the use of sentence salience classi-
6 Bail Prediction Models fier to extract important sentences from the doc-
We initially experimented with off-the-shelf pre- ument. Each document (bi = (hi , fi , ji , yi ), §5)
trained models trained on general-purpose texts. comes with a judge’s summary ji . For each sen-
However, as outlined earlier (§1), the legal do- tence in the facts of the document (fi ) we calculate
main comes with its own challenges, viz. spe- it’s cosine similarity with judge’s summary (ji ).
cialized legal lexicon, long documents, unstruc- Formally, salience of k th sentence ski is given by:
tured and noisy texts. Moreover, our corpus is salience(ski ) = cos(hji , hsk ). Here hji is contex-
i
from an under-resourced language (Hindi). Never- tualized distributed representation for ji obtained
theless, we experimented with existing fine-tuned using multilingual sentence encoder (Reimers and
(pre-trained) models and finally propose a multi- Gurevych, 2020). Similarly, hsk is the represen-
i
task model for the bail prediction task. tation for the sentence ski . The cosine similarities
provides ranked list of sentences and we select top
6.1 Embedding Based Models 50% sentences as salient. The salient sentences
We experimented with classical embedding are used to train (and fine-tune) IndicBERT based
based model Doc2Vec (Le and Mikolov, 2014) classifier.
Classification Token
Lbail
Bail Predictor
Transformer
Encoder Ltotal = Lbail + Lsaliency
Layer
Sentence Encoder
(Multilingual
Sentence BERT) Lsaliency
Bail Documents
Frozen Parameters
Salience Classifier
Trainable Parameters
Contextual Sentence
Embeddings
Shared Layers Task Specific Layers
Figure 5: Overview of our multi-task learning approach.
Granted Dismissed Total

6.3 Multi-Task Learning (MTL) Model Train 77010 (62%) 46732 (38%) 123742
All
Test 21977 (62%) 13423 (38%) 35400
As observed during experiments, summarization Districts
Validation 11067 (63%) 6640 (37%) 17707
based models show improvement in results (§7). Train
77220 (62%) 47121 (38%) 124341
District (44 districts)
Inspired by this, we propose a multi-task frame- Wise Validation
9563 (60%) 6366 (40%) 15929
(10 districts)
work (Figure 5), where bail prediction is the main Test
23271 (64%) 13308 (36%) 36579
task, and sentence salience classification is the (17 districts)
auxiliary task. The intuition is that predicting the Table 2: Number of documents across each split
important sentences via the auxiliary task would
force the model to perform better predictions and
vice-versa. Input to the model are sentences cor- losses are equally weighted, and total loss is given
responding to the facts of a case: s1i , s2i , . . . , ski . by L = Lbail + Lsalience
A multilingual sentence encoder (Reimers and
Gurevych, 2020) is used to get contextualized rep- 7 Experiments and Results
resentation of each sentence: {h1i , h2i , · · · , hki }. In
addition, we append the sentence representations 7.1 Dataset Splits
with a special randomly initialized CLS embed-
ding (Devlin et al., 2019) that gets updated dur- We evaluate the models in two settings: all-district
ing model training. The CLS and sentence embed- performance and district-wise performance. For
dings are fed into standard single layer transformer the first setting, the model is trained and tested on
architecture (shared transformer). the documents coming from all districts. The train,
validation and test split is 70:10:20. The district-
6.3.1 Bail Prediction Task wise setting is to test the generalization capabili-
A classification head (fully connected layer MLP) ties of the model. In this setting, the documents
on the top of transformer CLS embedding is used from 44 districts (randomly chosen) are used for
to perform bail prediction. We use standard cross- training. Testing is done on a different set of 17
entropy loss (Lbail ) for training. districts not present in train set. The validation set
has another set of 10 districts. This split corre-
6.3.2 Salience Classification Task sponds to a 70:10:20 ratio. Table 2 provides the
We use the salience prediction head (MLP) on top number of documents across splits. The corpus
of sentence representations at the output of the is unbalanced for the prediction class with about
shared transformer. For training the auxiliary task, 60:40 ratio for positive to negative class (Table 2).
we use sentence salience scores obtained via co- All models are evaluated using standard accuracy
sine similarity (these come from supervised sum- and F1-score metric (Appendix H.1).
marization based model). For each sentence, we Implementation Details: All models are trained
use binary-cross entropy loss (Lsalience ) to predict using GeForce RTX 2080Ti GPUs. Models are
the salience. tuned for hyper-parameters using the validation set
Based on our empirical investigations, both the (details in Appendix H.2).
District-wise All Districts ments to the bail prediction system. After exam-
Model Name
Acc. F1 Acc. F1
Doc2Vec + SVM 0.72 0.69 0.79 0.77 ining the miss-classified examples, we observed
Doc2Vec + XGBoost 0.68 0.59 0.67 0.57 the following. First, the lack of standardization
IndicBert-First 512 0.65 0.62 0.73 0.71 can manifest in unique ways. In one of the doc-
IndicBert-Last 512 0.62 0.60 0.78 0.76
TF-IDF+IndicBert 0.76 0.74 0.82 0.81 uments, we observed that all the facts and ar-
TextRank+IndicBert 0.76 0.74 0.82 0.81 guments seemed to point to the decision of bail
Salience Pred.+IndicBert 0.76 0.74 0.80 0.78 granted. Our model also gauged this correctly and
Multi-Task 0.78 0.77 0.80 0.78
predicted bail granted. However, the actual re-
Table 3: Model results. For TF-IDF and TextRank sult of the document showed that even though ini-
models we take the sum of the token embeddings. tially bail was granted because the accused failed
to show up on multiple occasions, the judge over-
turned the decision and the final verdict was bail
7.2 Results denied. In some instances, we also observed that
The results are shown in Table 3. As can be even if the facts of the cases are similar the judge-
observed, in general, the performance of mod- ments can differ. We observed two cases about
els is lower in the case of district-wise settings. the illegal possession of drugs that differed only
This is possibly due to the lexical variation (sec- a bit in the quantity seized but had different deci-
tion 3) across districts, which makes it difficult sions. The model is trained only on the documents
for the model to generalize. Moreover, this lex- and has no access to legal knowledge, hence is not
ical variation corresponds to the usage of words able to capture such legal nuances. We also per-
corresponding to dialects of Hindi. Another formed quantitative analysis on the model output
thing to note from the results is that, in gen- to better understand the performance. Our model
eral, summarization based models perform bet- outputs a probabilistic score in the range {0, 1}.
ter than Doc2Vec and transformer-based models, A score closer to 0 indicates our model is con-
highlighting the importance of the summariza- fident that bail would be denied, while a score
tion step in the bail prediction task. The pro- closer to 1 means bail granted. In Figure 6 we plot
posed end-to-end multi-task model outperforms the ROC curve to showcase the capability of the
all the baselines in the district-wise setting with model at different classification thresholds. ROC
78.53% accuracy. The auxiliary task of sentence plots True Positive and False Positive rates at dif-
salience classification helps learn robust features ferent thresholds. The area under the ROC curve
during training and adds a regularization effect on (AUC) is a measure of aggregated classification
the main task of bail prediction, leading to im- performance. Our proposed model has an AUC
proved performance than the two-step baselines. score of 0.85, indicating a high-classification ac-
However, in the case of an all-district split, the curacy for a challenging problem.
MTL model fails to beat simpler baselines like
TF-IDF+IndicBERT. We hypothesize that this is
ROC Curve
due to the fact that the sentence salience training 1.0
data may not be entirely correct since it is based

on the cosine similarity heuristic, which may in- 0.8
duce some noise for the auxiliary task. Addition-
ally, there is lexical diversity present across docu-
True Positive Rate
0.6
ments from different districts. Since documents of
all districts are combined in this setting, this may
0.4
introduce diverse sentences, which are harder to
encode for the salience classifier, while TF-IDF
is able to look at the distribution of words across 0.2
all documents and districts to extract salient sen-

Proposed Model (AUC = 0.85)
tences. 0.0
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
7.3 Error Analysis
We did further analysis of the model outputs to Figure 6: ROC curve for the proposed model. The total
understand failure points and figure out improve- AUC (Area under curve) is 0.85.
We also plot (Figure 7) the density func- guages. India has 22 official languages, but for the
tions corresponding to True Positive (Bail cor- majority of languages, there are no legal corpora.
rectly granted), True Negative (Bail correctly dis- Another interesting future direction that we would
missed), False Positive (Bail incorrectly granted) like to explore is the development of deep mod-
and False Negatives (Bail incorrectly dismissed). els infused with legal knowledge so that model is
We observe the correct bail granted predictions are able to capture legal nuances. We plan to use the
shifted towards 1, and the correct bail denied pre- HLDC corpus for other legal tasks such as sum-
dictions are shifted towards 0. Additionally, the marization and prior case retrieval.
incorrect samples are concentrated near the mid-
dle (≈ 0.5), which shows that our model was able 9 Acknowledgement
to identify these as borderline cases.
This paper is dedicated to T.H. Arjun, who con-
tributed towards making this research possible,
you will be remembered! We would like to thank
4.0 correct_granted
Prof. Angshuman Hazarika and Prof. Shouvik
incorrect_granted Kumar Guha for their valuable suggestions and for
3.5 correct_dismissed guiding us regarding the technical aspects of the
incorrect_dismissed
3.0 Indian legal system. The author Ashutosh Modi
would like to acknowledge the support of Google
2.5 Research India via the Faculty Research Award
Density
Grant 2021.
2.0
1.5
References
1.0
Ahsaas Bajaj, Pavitra Dangati, Kalpesh Krishna, Prad-
0.5 hiksha Ashok Kumar, Rheeya Uppaal, Bradford
Windsor, Eliot Brenner, Dominic Dotterrer, Rajarshi
0.0 Das, and Andrew McCallum. 2021. Long docu-
0.0 0.2 0.4 0.6 0.8 1.0 ment summarization in a low resource setting using
Model Output
pretrained language models. In Proceedings of the
59th Annual Meeting of the Association for Compu-
Figure 7: Kernel Density Estimate (KDE) plots of our tational Linguistics and the 11th International Joint
Conference on Natural Language Processing: Stu-
proposed bail prediction model. The majority of errors
dent Research Workshop, pages 71–80, Online. As-
(incorrectly dismissed / granted) are borderline cases sociation for Computational Linguistics.
with model output score around 0.5.
Paheli Bhattacharya, Kaustubh Hiware, Subham Raj-
garia, Nilay Pochhi, Kripabandhu Ghosh, and Sap-
8 Future Work and Conclusion tarshi Ghosh. 2019. A Comparative Study of Sum-
marization Algorithms Applied to Legal Case Judg-
ments. In Leif Azzopardi, Benno Stein, Nor-
In this paper, we introduced a large corpus of le- bert Fuhr, Philipp Mayr, Claudia Hauff, and Djo-
gal documents for the under-resourced language erd Hiemstra, editors, Advances in Information Re-
Hindi: Hindi Legal Documents Corpus (HLDC). trieval, volume 11437, pages 413–428. Springer In-
We semi-structure the documents to make them ternational Publishing, Cham.
amenable for further use in downstream applica- Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos
tions. As a use-case for HLDC, we introduce the Aletras. 2019. Neural legal judgment prediction in
task of Bail Prediction. We experimented with English. In Proceedings of the 57th Annual Meet-
several models and proposed a multi-task learning of the Association for Computational Linguis-
tics, pages 4317–4323, Florence, Italy. Association
ing based model that predicts salient sentences as for Computational Linguistics.
an auxiliary task and bail prediction as the main
task. Results show scope for improvement that Ilias Chalkidis, Manos Fergadiotis, and Ion Androut-
we plan to explore in future. We also plan to ex- sopoulos. 2021. MultiEURLEX - a multi-lingual
and multi-label legal document classification dataset
pand HLDC by covering other Indian Hindi speak- for zero-shot cross-lingual transfer. In Proceed-
ing states. Furthermore, as a future direction, we ings of the 2021 Conference on Empirical Methods
plan to collect legal documents in other Indian lan- in Natural Language Processing, pages 6974–6996,
Online and Punta Cana, Dominican Republic. Asso- Justice Markandey Katju. 2019. Backlog of cases
ciation for Computational Linguistics. crippling judiciary. https://tinyurl.com/
v4xu6mvk.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Mi-Young Kim and Randy Goebel. 2017. Two-step
Deep Bidirectional Transformers for Language Un- cascaded textual entailment for legal bar exam ques-
derstanding. In Proceedings of the 2019 Conference tion answering. In Proceedings of the 16th edition of
of the North American Chapter of the Association the International Conference on Articial Intelligence
for Computational Linguistics: Human Language and Law, pages 283–290, London United Kingdom.
Technologies, Volume 1 (Long and Short Papers), ACM.
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics. Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec,
Jens Ludwig, and Sendhil Mullainathan. 2017. Hu-
Xingyi Duan, Baoxin Wang, Ziyue Wang, Wentao
man Decisions and Machine Predictions*. The
Ma, Yiming Cui, Dayong Wu, Shijin Wang, Ting
Quarterly Journal of Economics.
Liu, Tianxiang Huo, Zhen Hu, Heng Wang, and
Zhiyuan Liu. 2019. CJRC: A Reliable Human-
Quoc Le and Tomas Mikolov. 2014. Distributed repre-
Annotated Benchmark DataSet for Chinese Judicial
sentations of sentences and documents. In Proceed-
Reading Comprehension. In Maosong Sun, Xuan-
ings of the 31st International Conference on Inter-
jing Huang, Heng Ji, Zhiyuan Liu, and Yang Liu,
national Conference on Machine Learning - Volume
editors, Chinese Computational Linguistics, volume
32, ICML’14, page II–1188–II–1196. JMLR.org.
11856, pages 439–451. Springer International Pub-
lishing, Cham.
Pedro H. Luz de Araujo, Teófilo E. de Campos, Renato
Filippo Galgani, Paul Compton, and Achim Hoffmann. R. R. de Oliveira, Matheus Stauffer, Samuel Couto,
2012. Towards automatic generation of catch- and Paulo Bermejo. 2018. LeNER-Br: a dataset for
phrases for legal case reports. In Proceedings of named entity recognition in Brazilian legal text. In
the 13th International Conference on Computational International Conference on the Computational Pro-
Linguistics and Intelligent Text Processing - Volume cessing of Portuguese (PROPOR), Lecture Notes on
Part II, CICLing’12, page 414–425, Berlin, Heidel- Computer Science (LNCS), pages 313–323, Canela,
berg. Springer-Verlag. RS, Brazil. Springer.
Peter Jackson, Khalid Al-Kofahi, Alex Tyrrell, and Vijit Malik, Rishabh Sanjay, Shouvik Kumar Guha,
Arun Vachher. 2003. Information extraction from Shubham Kumar Nigam, Angshuman Hazarika,
case law and retrieval of prior cases. Artificial Intel- Arnab Bhattacharya, and Ashutosh Modi. 2021a.
ligence, 150(1):239–290. AI and Law. Semantic segmentation of legal documents via
rhetorical roles. CoRR, abs/2112.01836.
Divyanshu Kakwani, Anoop Kunchukuttan, Satish
Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam,
Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Kripabandhu Ghosh, Shouvik Kumar Guha, Arnab
Monolingual corpora, evaluation benchmarks and Bhattacharya, and Ashutosh Modi. 2021b. ILDC
pre-trained multilingual language models for Indian for CJPE: Indian legal documents corpus for court
languages. In Findings of the Association for Com- judgment prediction and explanation. In Proceed-
putational Linguistics: EMNLP 2020, pages 4948– ings of the 59th Annual Meeting of the Association
4961, Online. Association for Computational Lin- for Computational Linguistics and the 11th Interna-
guistics. tional Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 4046–4062,
Prathamesh Kalamkar, Aman Tiwari, Astha Agarwal, Online. Association for Computational Linguistics.
Saurabh Karn, Smita Gupta, Vivek Raghavan, and
Ashutosh Modi. 2022. Corpus for automatic struc-
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-
turing of legal documents. CoRR, abs/2201.13125.
ing order into text. In Proceedings of the 2004 con-
Yoshinobu Kano, Mi-Young Kim, Randy Goebel, and ference on empirical methods in natural language
Ken Satoh. 2017. Overview of coliee 2017. In COL- processing, pages 404–411.
IEE@ICAIL.
National Judicial Data Grid. 2021. National judi-
Yoshinobu Kano, Mi-Young Kim, Masaharu Yosh- cial data grid statistics. https://www.njdg.
ioka, Yao Lu, Juliano Rabelo, Naoki Kiyota, Randy ecourts.gov.in/njdgnew/index.php.
Goebel, and Ken Satoh. 2019. COLIEE-2018: Eval-
uation of the Competition on Legal Information Ex- PopulationU. 2021. Population of uttar pradesh.
traction and Entailment. In Kazuhiro Kojima, Maki https://www.populationu.com/in/
Sakamoto, Koji Mineshima, and Ken Satoh, edi- uttar-pradesh-population.
tors, New Frontiers in Artificial Intelligence, volume
11717, pages 177–192. Springer International Pub- Juan Enrique Ramos. 2003. Using tf-idf to determine
lishing, Cham. word relevance in document queries.
Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual us-
ing knowledge distillation. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 4512–4525,
Online. Association for Computational Linguistics.
Yunqiu Shao, Jiaxin Mao, Yiqun Liu, Weizhi Ma, Ken
Satoh, Min Zhang, and Shaoping Ma. 2020. BERT-
PLI: Modeling Paragraph-Level Interactions for Le-
gal Case Retrieval. In Proceedings of the Twenty-
Ninth International Joint Conference on Artificial
Intelligence, pages 3501–3507, Yokohama, Japan.
International Joint Conferences on Artificial Intelli-
gence Organization.
Benjamin Strickson and Beatriz De La Iglesia. 2020.

Legal Judgement Prediction for UK Courts. In Pro-
ceedings of the 2020 The 3rd International Confer-
ence on Information Science and System, pages 204–
209, Cambridge United Kingdom. ACM.
Vu Tran, Minh Le Nguyen, and Ken Satoh. 2019.

Building legal case retrieval systems with lexical
matching and summarization using a pre-trained
phrase scoring model. In Proceedings of the Seven-
teenth International Conference on Artificial Intel-
ligence and Law, ICAIL ’19, page 275–282, New
York, NY, USA. Association for Computing Ma-
chinery.
Wikipedia contributors. 2021. Devanagari —
Wikipedia, the free encyclopedia. [Online; accessed
10-November-2021].
WorldData. 2021. World data info: Hindi.
https://www.worlddata.info/
languages/hindi.php.
Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao
Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng,
Xianpei Han, Zhen Hu, Heng Wang, et al. 2018.
Cail2018: A large-scale legal dataset for judgment
prediction. arXiv preprint arXiv:1807.02478.
Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao
Tu, Zhiyuan Liu, Maosong Sun, Tianyang Zhang,
Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng
Xu. 2019. Cail2019-scm: A dataset of similar case
matching in legal domain. ArXiv, abs/1911.08962.
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang
Zhang, Zhiyuan Liu, and Maosong Sun. 2020a.
How does NLP benefit legal system: A summary
of legal artificial intelligence. In Proceedings of the
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 5218–5230, Online. As-
sociation for Computational Linguistics.
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang
Zhang, Zhiyuan Liu, and Maosong Sun. 2020b. Jec-
qa: A legal-domain question answering dataset. In
Proceedings of AAAI.
Appendix D Document Segmentation
A Data Statistics Out of 912,568 documents in HLDC, 340,280
were bail documents, these were further processed
District Number of Bail Applications
Muzaffarnagar 17234 to obtain the Bail Document corpus. Bail docu-
Moradabad 16219 ments were structured into different sections. We
Budaun 14533 extracted these sections from the bail documents.
Sitapur 14478
Saharanpur 10838 Details are mentioned below. An example of doc-
ument with different sections is shown in Table 10.
Table 4: Top 5 districts with most number of bail appli-
cations in UP.
D.1 Header
Header refers to the meta data related to the case,
B Data Cleaning and Filtering for example, DArA (IPC (Indian Penal Code) sec-
1,221,950 documents were scraped from Ecourts tions), TAnA (police station), case number, date of
website and 309,382 documents were removed in hearing, accused name, etc. Header is present at
the cleaning and filtering process. Following rules the top of the document. Header mostly ended
were used to remove documents. with DArA (IPC) or TAnA (police station) details.
Hence, in order to cut the document to get header,
• Removed blank documents (whose length is
we first find the indices of DArA (IPC) and TAnA
less than 32 bytes)
(police station), and from these indices we find the
• Removed duplicate documents finishing word of the header. We then segment the
document at the finishing word. We also include
• Removed too long and too short documents the first line of upcoming paragraph in header as it
(>8096 bytes or <2048 bytes). also didn’t contain case arguments but contained
• Removed document where majority text was data like if this is the first bail application or not.
in English language.
D.2 Case Result
This resulted in 912,568 filtered case documents
that constitute the Hindi Legal Document Corpus. Case Result refers to the end of the document
where judge writes their decision. Judge either ac-
C NER Removal cepts the bail application or rejects it. If the judge
For removing names and locations, lookup was had accepted the bail document then this section
done in lists containing NER. Libraries like mostly also contains bail amount and bail terms
5
HindiNLP (which uses SequenceTagger from for accused.
6
flair library which is based on an RNN model) We observed that result section mostly began
were run on a subset of the data to find addi- along the following line, mAml k sm-t tLyo\
tional NER that were added to the lists. Since the ko dKkr (looking at all facts of the case), the
Sequence-Tagger model is quite slow in process- keyword tLyo\ (facts) was very common around
ing documents, directly tagging large HLDC is not the start of the result section. Hence, we iterated
efficient. If a word was found in one of these lists over the indices of keyword tLyo\ (facts) in reverse
then it was replaced with a <nAm> (<name>) tag. order and checked if the division at that index is
Phone numbers were replaced with <'on - n\br> correct. To check if the division is correct we look
(<phone-number>) tag using the following regex for bail result in lower half of the division, if the
bail result is present, we classify that division as
( ( \ + * ) ( ( 0 [ − ] * ) * | ( ( 9 1 ) * ) ) ( ( \ d {12})
correct else we move to next index of tLyo\ (facts).
+ | ( \ d { 1 0 } ) + ) ) | \ d { 5 } ( [ − ] * ) \ d {6}
Phone numbers written in Hindi were also consid- D.3 Body
ered by using the same regex as above with En-
glish digits replaced with Hindi ones. The remaining portion of the document after re-
5 moving header and result section was called body.
https://github.com/avinsit123/
HindiNLP Body section was further divided, as described be-
6
https://github.com/flairNLP/flair low.
D.3.1 Judge’s summary Field Hindi phrases English Transla-
tions
Most of the bail documents have a concluding Judge’s uBy p" kF bhs Hearing the ar-
paragraph where the judge summarizes their view- Summary snn , p/AvlF k guments of the
avlokn , k
s parties, perusal
points of the case. To extract this, we first of the records, as
XAyrF m \ uplND
constructed certain regex which often precedes sA#y k ansAr , per the evidence
judge’s summary, defendant’s and prosecutor’s ar- mAml k tLyo\ available in the
v pErE-TEtyo\ m \ case diary, fully
guments (described in Table 5). Since the docu- clear from the
prF trh s -p£
ment might have intermingling of different argu- facts and circum-
h , prTm scnA
{ stances of the
ments and opinions, we opted for sentence level ErpoV , pEls case, First Infor-
annotation of these labels using the regex pattern. prp/ . . . prFEfln mation Report,
The sentences not matching any criteria are given EkyA Police Forms
. . . perused
a tag of None. Next we try to replace the None
Prosecutor яmAnt kA EvroD Opposing the
by extending the tags of the sentences to para- krt
hy aEBy- bail, it has been
graph level as long as no other tag is encountered. oяn kF aor s argued on behalf
As the judge’s opinion mostly occurs at the end, h,
tk EdyA gyA { of the prosecu-
яmAnt prATnAp/ tion, the objection
we start iterating from end and start marking the k
Ev!d aApEtt against the bail
None as judge’s opinion. If a label which is neither application
None nor judge’s opinion is encountered, the doc- Defendant aEBykt k EvdvAn The learned coun-
aEDvktA kA tk sel for the accused
ument is discarded as we cannot extract the judge’s h, m
{ \ JWA ev\ has argued, has
opinion from the document using the process de- r\Eяfn P\sAyA been falsely and
enmity implicated
fined. If the judge’s opinion label is found in re- gyA
in this case
verse iteration, then we claim that judge’s opinion
can be extracted. Finally, all sentences labelled as Table 5: Phrases used to construct regular expression
judge’s opinion either during reverse iteration or for extracting judge’s opinion. The list is just an in-
during paragraph level extension are extracted out dicative of the various phrases and variants used; the
entire list can be found in code
as judge’s summary and rest of the sentences form
facts and opinions for further modelling. Using the
above process, following are some cases where the (dismissed) and Enr-t (invalidated) identified re-
judge’s opinion cannot be extracted: jection of bail application and words like -vFkAr
1. Certain characters were mis-identified in the (accepted) identified acceptance of bail applica-
OCR pipeline and hence do not match the tion. Table 6 lists all the tokens used for extrac-
regex. tion.
2. The segmentation of document into header,
F Extracting Bail Amount from Result
body and result caused a significant portion
of the body and thus judge’s opinion to move In case of granted bail decision, the judge spec-
to result section. ifies bail amount. We saw that the bail amount
3. The document was written from judge’s per- mostly comprises of personal bond money and
spective and hence judge’s summary also surety money. There can be multiple personal
contains the prosecutor’s and defendant’s ar- bonds and sureties. The bail amount we extracted
guments. refers to the sum of all the personal bond money.
4. The regex didn’t have 100% coverage. Bail amount was present in two forms in result
section, numerical and Hindi-text. Numerical bail
D.3.2 Facts and Arguments amount was extracted by regex matching and text
This section comprised of facts related to case, ar- bail amount was extracted by creating a mapping
guments from defendant and prosecutor. Mostly, for it. Table 8 shows few examples of bail map-
this corresponds to the portion of the body after ping.
removing judge’s summary.
G HLDC Pipeline Analysis
E Extracting Bail Decision from Result
We used a validation set (0.1% of data) to evalu-
To extract the bail decision we searched for key- ate our regex based approaches, the results are in
words in result section. Keywords like KAErя Table 7. Note that metrics used for evaluation are
Field Tokens Model Hyper-Parameters (L=Learning Rate),
(E=Epochs), (D=Embedding Dimen-
sion(Default 200)), (W= Weight Decay),
Bail granted (E=Epochs(Default 15))
tokens District-wise Split All Districts Split
Doc2Vec + E=100 E=100
SVM
Doc2Vec + E=100, D=300 E=100, D=300
XGBoost
IndicBert - L=3.69 × 10−6 , L=1.58 × 10−6 ,
(First 512 W=2.6 × 10−2 W=4.8 × 10−2
Tokens)
IndicBert - L=5.60 × 10−5 , L=2.18 × 10−5 ,
(Last 512 W=1.0 × 10−2 W=4.3 × 10−2
Bail denied Tokens)
tokens TF-IDF + L=1.11 × 10−5 , L=9.84 × 10−6 ,
IndicBert W=1.9 × 10−2 W=4.9 × 10−2
TextRank + L=3.17 × 10−6 , L=3.99 × 10−6 ,
IndicBert W=3.1 × 10−2 W=1.5 × 10−2
Salience L=1 × 10−5 , L=4.2 × 10−6 ,
Pred. + W=3.2 × 10−2 W=1.7 × 10−2
Table 6: Bail decision tokens IndicBert
Multi-Task E=30, E=30,
L=5 × 10−5 L=1 × 10−5
quite strict and hence the results are much lower Table 9: Listing of Hyper-Parameters for Training of
for Judge’s summary part. The segmentation and Models
Judge’s opinion were strictly evaluated and even a
single sentence in the wrong segment reduces the
accuracy. We also see that the main binary label of standard evaluation metrics while performing clas-
outcome detection (bail granted or denied) had an sification experiments. These are mathematically
almost perfect accuracy of 99.4%. Nevertheless, described as the follows:
in future we plan to improve our pipeline further TP + TN
by training machine learning models. Accuracy =
TP + TN + FP + FN
Process Accuracy
Header, Body and Case Result
89.7%
2 ∗ P recision ∗ Recall
Segmentation F1 Score =
P recision + Recall
Judge’s Opinion and Facts ex-
85.7%
traction
where TP, FP, TN, and FN denote True Posi-
Bail Decision Extraction 99.4%
tives, False Positives, True Negatives, and False
Table 7: Evaluation results of bail document division Negatives, respectively. The mathematical formu-
and bail decision extraction pipeline. lation for P recision and Recall is given as fol-
lows:
TP
P recision =
Text Amount In Value Form TP + FP
TP
Recall =
TP + FN
H.2 Hyperparamter Tuning
We used Optuna 7 for hyperparameter optimisa-
tion. Optuna allows us to easily define search
Table 8: Text bail amount mapping example spaces, select optimisation algorithms and scale
with easy parallelization. We run parameter tuning
on 10% of the data to identify the best parameters
H Model Details before retraining the model with the best parame-
H.1 Evaluation Metrics ters on the entire dataset. The best parameters are
listed in Table 9.
To evaluate the performance of all the models, we
7
use Accuracy, and F1-score, which are considered https://github.com/optuna/optuna
I Sample Segmented Document
Field Example Translation
Header: This
chunk of the doc-
ument contains
meta information
related to the case
like court hearing
date, IPC sections
attached, police sta-
tion of complain,
etc.
Facts and Argu-

ments: This chunk
of the document
contains case facts
related to the case
and arguments
from defendant and
prosecutor.
Continued on next page

Table 10 – continued from previous page
Field Example Translation
Judge’s Opinion:
This refers to the
few lines present
in the middle por-
tion of the docu-
ment where judge
writes their opinion
of the case.
Result: This chunk

of the document
contains decision
made by judge on
the case.
Table 10: A sample segmented document
View publication stats

HLDC Hindi Legal Documents Corpus

Uploaded by

Copyright:

Available Formats

You might also like

HLDC Hindi Legal Documents Corpus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HLDC Hindi Legal Documents Corpus

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

HLDC: Hindi Legal Documents Corpus

Conference Paper · April 2022

Arnav Kapoor Mudit Dhawan

SEE PROFILE SEE PROFILE

Anmol Goel Arjun T H

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Abstract trained language models do not perform well on

Figure 1: HLDC corpus creation pipeline

Bail Applications 31.71%

0 Table 1: Case types in HLDC. Out of around 300 dif-

This resulted in a total of 912,568 documents in Bagpat

with regex-based rules for NER to anonymize the

names, locations, titles like p\EXt (Pandit: ti-

<name>). The gazetteer also had some common

sometimes verbs) like prATnA (Prathna: Can re-

Figure 4: Bail Corpus Creation Pipeline

5 Bail Prediction Task

Figure 5: Overview of our multi-task learning approach.

Granted Dismissed Total

data may not be entirely correct since it is based

all documents and districts to extract salient sen-

Benjamin Strickson and Beatriz De La Iglesia. 2020.

Vu Tran, Minh Le Nguyen, and Ken Satoh. 2019.

Field Example Translation

Facts and Argu-

Continued on next page

Result: This chunk

Table 10: A sample segmented document

View publication stats

You might also like