Professional Documents
Culture Documents
HLDC Hindi Legal Documents Corpus
HLDC Hindi Legal Documents Corpus
HLDC Hindi Legal Documents Corpus
net/publication/359892640
CITATIONS READS
12 1,114
10 authors, including:
All content following this page was uploaded by Ashutosh Modi on 12 April 2022.
entities like person, locations, organisation and by downloading data from the e-Courts web-
time-entities. COLIEE (Competition on Legal site (a publicly available website: https://
Information Extraction/Entailment) tasks (Kano districts.ecourts.gov.in/). All the le-
et al., 2019, 2017) have published legal data in gal documents we consider are in the public do-
Japanese, along with their English translation. The main. We download case documents pertaining to
competition has two sub-tasks, a legal informa- the district courts located in the Indian northern
tion retrieval task and an entailment identifica- state of Uttar Pradesh (U.P.). We focus mainly on
tion task between law articles and queries. Mul- the state of U.P. as it is the most populous state of
tiple datasets in Chinese have been released for India, resulting in the filing of a large number of
different tasks, namely Reading Comprehension cases in district courts. U.P. has 71 districts and
(Duan et al., 2019), Similar Case Matching (Xiao about 161 district courts. U.P. is a predominantly
et al., 2019), Question Answering (Zhong et al., Hindi speaking state, and consequently, the offi-
2020b). Duan et al. (2019) proposed Chinese judi- cial language used in district courts is Hindi. We
cial reading comprehension (CJRC) dataset with crawled case documents from all districts of U.P.
about 10K documents and almost 50K questions corresponding to cases filed over two years, from
with answers. Zhong et al. (2020b) presented JEC- May 01, 2019 to May 01, 2021. Figure 2 shows
QA, a legal question answering dataset collected the map of U.P. and district wise variation in the
from the National Judicial Examination of China number of cases. As can be seen in the plot, the
with about 26K multiple-choice questions. They western side of the state has more cases; this is
augment the dataset with a database containing possibly due to the high population and more ur-
the legal knowledge required to answer the ques- banization in the western part. Table 1 shows %-
tions and also assign meta information to each of wise division of different case types in HLDC. As
the questions for in-depth analysis. Xiao et al. evident from the table, majority of documents per-
(2019) proposed CAIL2019-SCM, a dataset con- tain to bail applications. HLDC corpus has a total
taining 8,964 triplets of the case document, with of 3,797,817 unique tokens, and on average, each
the objective to identify which two cases are more document has 764 tokens.
similar in the triplets. Similar case matching has a HLDC Creation Pipeline: We outline the en-
crucial application as it helps to identify compara- tire pipeline used to create the corpus in Figure 1.
ble historical cases. A historical case with similar The documents on the website are originally typed
facts often serves as a legal precedent and influ- in Hindi (in Devanagari script) and then scanned
ences the judgement. Such historical information to PDF format and uploaded. The first step in
can be used to make the legal judgement predic- HLDC creation is the downloading of documents
tion models more robust. from the e-Courts website. We downloaded a to-
Kleinberg et al. (2017) proposed bail decision tal of 1,221,950 documents. To extract Hindi text
prediction as a good proxy to gauge if machine from these, we perform OCR (Optical Charac-
learning can improve human decision making. A ter Recognition) via the Tesseract tool1 . Tesser-
large number of bail documents along with the bi- act worked well for our use case as the majority
nary decision (granted or denied) makes it an ideal of case documents were well-typed, and it out-
task for automation. In this paper, we also propose performed other OCR libraries2 . The obtained
the bail prediction task using the HLDC corpus. text documents were further cleaned to remove
noisy documents, e.g. too short (< 32 bytes) or
3 Hindi Legal Documents Corpus too long (> 8096 bytes) documents, duplicates,
Hindi Legal Documents Corpus (HLDC) is a and English documents (details in Appendix B).
corpus of 912,568 Indian legal case documents 1
https://github.com/tesseract-ocr
2
in the Hindi language. The corpus is created https://github.com/JaidedAI/EasyOCR
Total Cases
Case Type % in HLDC
Muzzafarnagar 80k
Judge's
Hindi Legal Bail Documents
Filter Bail Outcome Opinion Bail Documents
Documents Documents Bail Identification Mining
Bail Denied with Judge's
Corpus Documents
Bail Granted Opinion
(HLDC)
Lbail
Bail Predictor
Transformer
Encoder Ltotal = Lbail + Lsaliency
Layer
Sentence Encoder
(Multilingual
Sentence BERT) Lsaliency
Bail Documents
Frozen Parameters
Salience Classifier
Trainable Parameters
Contextual Sentence
Embeddings
Shared Layers Task Specific Layers
auxiliary task. The intuition is that predicting the Table 2: Number of documents across each split
important sentences via the auxiliary task would
force the model to perform better predictions and
vice-versa. Input to the model are sentences cor- losses are equally weighted, and total loss is given
responding to the facts of a case: s1i , s2i , . . . , ski . by L = Lbail + Lsalience
A multilingual sentence encoder (Reimers and
Gurevych, 2020) is used to get contextualized rep- 7 Experiments and Results
resentation of each sentence: {h1i , h2i , · · · , hki }. In
addition, we append the sentence representations 7.1 Dataset Splits
with a special randomly initialized CLS embed-
ding (Devlin et al., 2019) that gets updated dur- We evaluate the models in two settings: all-district
ing model training. The CLS and sentence embed- performance and district-wise performance. For
dings are fed into standard single layer transformer the first setting, the model is trained and tested on
architecture (shared transformer). the documents coming from all districts. The train,
validation and test split is 70:10:20. The district-
6.3.1 Bail Prediction Task wise setting is to test the generalization capabili-
A classification head (fully connected layer MLP) ties of the model. In this setting, the documents
on the top of transformer CLS embedding is used from 44 districts (randomly chosen) are used for
to perform bail prediction. We use standard cross- training. Testing is done on a different set of 17
entropy loss (Lbail ) for training. districts not present in train set. The validation set
has another set of 10 districts. This split corre-
6.3.2 Salience Classification Task sponds to a 70:10:20 ratio. Table 2 provides the
We use the salience prediction head (MLP) on top number of documents across splits. The corpus
of sentence representations at the output of the is unbalanced for the prediction class with about
shared transformer. For training the auxiliary task, 60:40 ratio for positive to negative class (Table 2).
we use sentence salience scores obtained via co- All models are evaluated using standard accuracy
sine similarity (these come from supervised sum- and F1-score metric (Appendix H.1).
marization based model). For each sentence, we Implementation Details: All models are trained
use binary-cross entropy loss (Lsalience ) to predict using GeForce RTX 2080Ti GPUs. Models are
the salience. tuned for hyper-parameters using the validation set
Based on our empirical investigations, both the (details in Appendix H.2).
District-wise All Districts ments to the bail prediction system. After exam-
Model Name
Acc. F1 Acc. F1
Doc2Vec + SVM 0.72 0.69 0.79 0.77 ining the miss-classified examples, we observed
Doc2Vec + XGBoost 0.68 0.59 0.67 0.57 the following. First, the lack of standardization
IndicBert-First 512 0.65 0.62 0.73 0.71 can manifest in unique ways. In one of the doc-
IndicBert-Last 512 0.62 0.60 0.78 0.76
TF-IDF+IndicBert 0.76 0.74 0.82 0.81 uments, we observed that all the facts and ar-
TextRank+IndicBert 0.76 0.74 0.82 0.81 guments seemed to point to the decision of bail
Salience Pred.+IndicBert 0.76 0.74 0.80 0.78 granted. Our model also gauged this correctly and
Multi-Task 0.78 0.77 0.80 0.78
predicted bail granted. However, the actual re-
Table 3: Model results. For TF-IDF and TextRank sult of the document showed that even though ini-
models we take the sum of the token embeddings. tially bail was granted because the accused failed
to show up on multiple occasions, the judge over-
turned the decision and the final verdict was bail
7.2 Results denied. In some instances, we also observed that
The results are shown in Table 3. As can be even if the facts of the cases are similar the judge-
observed, in general, the performance of mod- ments can differ. We observed two cases about
els is lower in the case of district-wise settings. the illegal possession of drugs that differed only
This is possibly due to the lexical variation (sec- a bit in the quantity seized but had different deci-
tion 3) across districts, which makes it difficult sions. The model is trained only on the documents
for the model to generalize. Moreover, this lex- and has no access to legal knowledge, hence is not
ical variation corresponds to the usage of words able to capture such legal nuances. We also per-
corresponding to dialects of Hindi. Another formed quantitative analysis on the model output
thing to note from the results is that, in gen- to better understand the performance. Our model
eral, summarization based models perform bet- outputs a probabilistic score in the range {0, 1}.
ter than Doc2Vec and transformer-based models, A score closer to 0 indicates our model is con-
highlighting the importance of the summariza- fident that bail would be denied, while a score
tion step in the bail prediction task. The pro- closer to 1 means bail granted. In Figure 6 we plot
posed end-to-end multi-task model outperforms the ROC curve to showcase the capability of the
all the baselines in the district-wise setting with model at different classification thresholds. ROC
78.53% accuracy. The auxiliary task of sentence plots True Positive and False Positive rates at dif-
salience classification helps learn robust features ferent thresholds. The area under the ROC curve
during training and adds a regularization effect on (AUC) is a measure of aggregated classification
the main task of bail prediction, leading to im- performance. Our proposed model has an AUC
proved performance than the two-step baselines. score of 0.85, indicating a high-classification ac-
However, in the case of an all-district split, the curacy for a challenging problem.
MTL model fails to beat simpler baselines like
TF-IDF+IndicBERT. We hypothesize that this is
ROC Curve
due to the fact that the sentence salience training 1.0
0.6
ments from different districts. Since documents of
all districts are combined in this setting, this may
0.4
introduce diverse sentences, which are harder to
encode for the salience classifier, while TF-IDF
is able to look at the distribution of words across 0.2
Grant 2021.
2.0
1.5
References
1.0
Ahsaas Bajaj, Pavitra Dangati, Kalpesh Krishna, Prad-
0.5 hiksha Ashok Kumar, Rheeya Uppaal, Bradford
Windsor, Eliot Brenner, Dominic Dotterrer, Rajarshi
0.0 Das, and Andrew McCallum. 2021. Long docu-
0.0 0.2 0.4 0.6 0.8 1.0 ment summarization in a low resource setting using
Model Output
pretrained language models. In Proceedings of the
59th Annual Meeting of the Association for Compu-
Figure 7: Kernel Density Estimate (KDE) plots of our tational Linguistics and the 11th International Joint
Conference on Natural Language Processing: Stu-
proposed bail prediction model. The majority of errors
dent Research Workshop, pages 71–80, Online. As-
(incorrectly dismissed / granted) are borderline cases sociation for Computational Linguistics.
with model output score around 0.5.
Paheli Bhattacharya, Kaustubh Hiware, Subham Raj-
garia, Nilay Pochhi, Kripabandhu Ghosh, and Sap-
8 Future Work and Conclusion tarshi Ghosh. 2019. A Comparative Study of Sum-
marization Algorithms Applied to Legal Case Judg-
ments. In Leif Azzopardi, Benno Stein, Nor-
In this paper, we introduced a large corpus of le- bert Fuhr, Philipp Mayr, Claudia Hauff, and Djo-
gal documents for the under-resourced language erd Hiemstra, editors, Advances in Information Re-
Hindi: Hindi Legal Documents Corpus (HLDC). trieval, volume 11437, pages 413–428. Springer In-
We semi-structure the documents to make them ternational Publishing, Cham.
amenable for further use in downstream applica- Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos
tions. As a use-case for HLDC, we introduce the Aletras. 2019. Neural legal judgment prediction in
task of Bail Prediction. We experimented with English. In Proceedings of the 57th Annual Meet-
several models and proposed a multi-task learn- ing of the Association for Computational Linguis-
tics, pages 4317–4323, Florence, Italy. Association
ing based model that predicts salient sentences as for Computational Linguistics.
an auxiliary task and bail prediction as the main
task. Results show scope for improvement that Ilias Chalkidis, Manos Fergadiotis, and Ion Androut-
we plan to explore in future. We also plan to ex- sopoulos. 2021. MultiEURLEX - a multi-lingual
and multi-label legal document classification dataset
pand HLDC by covering other Indian Hindi speak- for zero-shot cross-lingual transfer. In Proceed-
ing states. Furthermore, as a future direction, we ings of the 2021 Conference on Empirical Methods
plan to collect legal documents in other Indian lan- in Natural Language Processing, pages 6974–6996,
Online and Punta Cana, Dominican Republic. Asso- Justice Markandey Katju. 2019. Backlog of cases
ciation for Computational Linguistics. crippling judiciary. https://tinyurl.com/
v4xu6mvk.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Mi-Young Kim and Randy Goebel. 2017. Two-step
Deep Bidirectional Transformers for Language Un- cascaded textual entailment for legal bar exam ques-
derstanding. In Proceedings of the 2019 Conference tion answering. In Proceedings of the 16th edition of
of the North American Chapter of the Association the International Conference on Articial Intelligence
for Computational Linguistics: Human Language and Law, pages 283–290, London United Kingdom.
Technologies, Volume 1 (Long and Short Papers), ACM.
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics. Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec,
Jens Ludwig, and Sendhil Mullainathan. 2017. Hu-
Xingyi Duan, Baoxin Wang, Ziyue Wang, Wentao
man Decisions and Machine Predictions*. The
Ma, Yiming Cui, Dayong Wu, Shijin Wang, Ting
Quarterly Journal of Economics.
Liu, Tianxiang Huo, Zhen Hu, Heng Wang, and
Zhiyuan Liu. 2019. CJRC: A Reliable Human-
Quoc Le and Tomas Mikolov. 2014. Distributed repre-
Annotated Benchmark DataSet for Chinese Judicial
sentations of sentences and documents. In Proceed-
Reading Comprehension. In Maosong Sun, Xuan-
ings of the 31st International Conference on Inter-
jing Huang, Heng Ji, Zhiyuan Liu, and Yang Liu,
national Conference on Machine Learning - Volume
editors, Chinese Computational Linguistics, volume
32, ICML’14, page II–1188–II–1196. JMLR.org.
11856, pages 439–451. Springer International Pub-
lishing, Cham.
Pedro H. Luz de Araujo, Teófilo E. de Campos, Renato
Filippo Galgani, Paul Compton, and Achim Hoffmann. R. R. de Oliveira, Matheus Stauffer, Samuel Couto,
2012. Towards automatic generation of catch- and Paulo Bermejo. 2018. LeNER-Br: a dataset for
phrases for legal case reports. In Proceedings of named entity recognition in Brazilian legal text. In
the 13th International Conference on Computational International Conference on the Computational Pro-
Linguistics and Intelligent Text Processing - Volume cessing of Portuguese (PROPOR), Lecture Notes on
Part II, CICLing’12, page 414–425, Berlin, Heidel- Computer Science (LNCS), pages 313–323, Canela,
berg. Springer-Verlag. RS, Brazil. Springer.
Peter Jackson, Khalid Al-Kofahi, Alex Tyrrell, and Vijit Malik, Rishabh Sanjay, Shouvik Kumar Guha,
Arun Vachher. 2003. Information extraction from Shubham Kumar Nigam, Angshuman Hazarika,
case law and retrieval of prior cases. Artificial Intel- Arnab Bhattacharya, and Ashutosh Modi. 2021a.
ligence, 150(1):239–290. AI and Law. Semantic segmentation of legal documents via
rhetorical roles. CoRR, abs/2112.01836.
Divyanshu Kakwani, Anoop Kunchukuttan, Satish
Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam,
Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Kripabandhu Ghosh, Shouvik Kumar Guha, Arnab
Monolingual corpora, evaluation benchmarks and Bhattacharya, and Ashutosh Modi. 2021b. ILDC
pre-trained multilingual language models for Indian for CJPE: Indian legal documents corpus for court
languages. In Findings of the Association for Com- judgment prediction and explanation. In Proceed-
putational Linguistics: EMNLP 2020, pages 4948– ings of the 59th Annual Meeting of the Association
4961, Online. Association for Computational Lin- for Computational Linguistics and the 11th Interna-
guistics. tional Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 4046–4062,
Prathamesh Kalamkar, Aman Tiwari, Astha Agarwal, Online. Association for Computational Linguistics.
Saurabh Karn, Smita Gupta, Vivek Raghavan, and
Ashutosh Modi. 2022. Corpus for automatic struc-
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-
turing of legal documents. CoRR, abs/2201.13125.
ing order into text. In Proceedings of the 2004 con-
Yoshinobu Kano, Mi-Young Kim, Randy Goebel, and ference on empirical methods in natural language
Ken Satoh. 2017. Overview of coliee 2017. In COL- processing, pages 404–411.
IEE@ICAIL.
National Judicial Data Grid. 2021. National judi-
Yoshinobu Kano, Mi-Young Kim, Masaharu Yosh- cial data grid statistics. https://www.njdg.
ioka, Yao Lu, Juliano Rabelo, Naoki Kiyota, Randy ecourts.gov.in/njdgnew/index.php.
Goebel, and Ken Satoh. 2019. COLIEE-2018: Eval-
uation of the Competition on Legal Information Ex- PopulationU. 2021. Population of uttar pradesh.
traction and Entailment. In Kazuhiro Kojima, Maki https://www.populationu.com/in/
Sakamoto, Koji Mineshima, and Ken Satoh, edi- uttar-pradesh-population.
tors, New Frontiers in Artificial Intelligence, volume
11717, pages 177–192. Springer International Pub- Juan Enrique Ramos. 2003. Using tf-idf to determine
lishing, Cham. word relevance in document queries.
Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual us-
ing knowledge distillation. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 4512–4525,
Online. Association for Computational Linguistics.
Yunqiu Shao, Jiaxin Mao, Yiqun Liu, Weizhi Ma, Ken
Satoh, Min Zhang, and Shaoping Ma. 2020. BERT-
PLI: Modeling Paragraph-Level Interactions for Le-
gal Case Retrieval. In Proceedings of the Twenty-
Ninth International Joint Conference on Artificial
Intelligence, pages 3501–3507, Yokohama, Japan.
International Joint Conferences on Artificial Intelli-
gence Organization.
quite strict and hence the results are much lower Table 9: Listing of Hyper-Parameters for Training of
for Judge’s summary part. The segmentation and Models
Judge’s opinion were strictly evaluated and even a
single sentence in the wrong segment reduces the
accuracy. We also see that the main binary label of standard evaluation metrics while performing clas-
outcome detection (bail granted or denied) had an sification experiments. These are mathematically
almost perfect accuracy of 99.4%. Nevertheless, described as the follows:
in future we plan to improve our pipeline further TP + TN
by training machine learning models. Accuracy =
TP + TN + FP + FN
Process Accuracy
Header, Body and Case Result
89.7%
2 ∗ P recision ∗ Recall
Segmentation F1 Score =
P recision + Recall
Judge’s Opinion and Facts ex-
85.7%
traction
where TP, FP, TN, and FN denote True Posi-
Bail Decision Extraction 99.4%
tives, False Positives, True Negatives, and False
Table 7: Evaluation results of bail document division Negatives, respectively. The mathematical formu-
and bail decision extraction pipeline. lation for P recision and Recall is given as fol-
lows:
TP
P recision =
Text Amount In Value Form TP + FP
TP
Recall =
TP + FN
H.2 Hyperparamter Tuning
We used Optuna 7 for hyperparameter optimisa-
tion. Optuna allows us to easily define search
Table 8: Text bail amount mapping example spaces, select optimisation algorithms and scale
with easy parallelization. We run parameter tuning
on 10% of the data to identify the best parameters
H Model Details before retraining the model with the best parame-
H.1 Evaluation Metrics ters on the entire dataset. The best parameters are
listed in Table 9.
To evaluate the performance of all the models, we
7
use Accuracy, and F1-score, which are considered https://github.com/optuna/optuna
I Sample Segmented Document
Header: This
chunk of the doc-
ument contains
meta information
related to the case
like court hearing
date, IPC sections
attached, police sta-
tion of complain,
etc.
Judge’s Opinion:
This refers to the
few lines present
in the middle por-
tion of the docu-
ment where judge
writes their opinion
of the case.