Professional Documents
Culture Documents
Raghav 2015
Raghav 2015
of Legal Judgments
1 Introduction
The amount of available text-data in legal domain is vast and continuously grow-
ing which makes it challenging to deal with. Apart from the size of data, the inher-
ent complexity of legal domain demands better and more sophisticated methods to
process legal documents to satisfy information need of legal practitioners. Build-
ing efficient search approaches in legal domain is an active research area.
In the literature, efforts have been made to address the challenges of infor-
mation overload in the area of web search. A web page has links (references) to
other web pages often called hyperlinks. It is considered that the existence of a
link between two web pages indicates a relationship between the two pages [10].
In the area of web search and retrieval, these links provide important informa-
tion and efficient search systems have been developed by exploiting these links
[4,10] available in the web pages. Also, efficient methods have been developed
c Springer International Publishing Switzerland 2015
R. Prasath et al. (Eds.): MIKE 2015, LNAI 9468, pp. 449–459, 2015.
DOI: 10.1007/978-3-319-26832-3 42
450 K. Raghav et al.
for effective organization and retrieval of web pages by extending clustering [9]
and classification [5] approaches. Several efforts have been made to build effi-
cient search systems based on communities formed by links. In addition, efforts
have also been made in information retrieval to link one topic with other topics
by forming hyperlinks among the topics by carrying out text-based comparison
[11,20].
Legal systems are generally based on one of the two basic systems of law,
viz., civil law and common law. In civil law, core principles are codified into a
referable system which serves as the primary source of law. As opposed to civil
law, the common law is a law developed by judges through previous judgments,
i.e., courts have interpreted the law in individual cases alongside using a referable
system of rules as a source of law. A legal judgment is a closed case. It is a
text document, which explains the formal decision made by a court following
a lawsuit. Similar to web domain, links can be observed in legal judgments in
the form of a citation network in which one judgment is said to be connected to
another judgment when it cites the prior judgment. This citation information in
legal judgments could be utilized for efficient search.
In this paper, we have made an effort to find the related judgments through
cluster analysis by considering the judgments in a common law system. In partic-
ular, we consider citation information and propose an approach to find similar
legal judgments. By analyzing judgments delivered by the Supreme Court of
India it has been found that a considerable number of legal judgments have only
a few citations. Similar to the notion of links among topics [11,20], we employed
the notion of paragraph link to group similar judgments at a paragraph level.
We have applied clustering approach on the judgments dataset by considering
the citations, paragraph links and by combining both citations and paragraph
links. We show that it is possible to establish similarity between judgments by
exploiting citations.
In addition, we propose a clustering approach based on citations. Document
clustering by using vector space model is a well studied approach. In general, it is
accomplished by representing each data object as n-dimensional feature vector
with each coordinate of the vector being a term in vocabulary [18]. In case
of judgments with only citation information, application of typical clustering
algorithms like K-means is difficult, as computing the mean or central node of
the cluster is difficult. We propose clustering approach by employing the notion of
multiple central judgments to represent the cluster. By conducting experiments
on real world dataset, it has been shown that the proposed clustering approach
can be useful in establishing similarity between judgments by utilizing citation
information.
The rest of the paper is organized as follows. In the next section, we discuss the
related work. In Sect. 3, we explain the proposed approach. In Sect. 4, we discuss
the experimental results. The last section contains conclusion and future work.
Text and Citations Based Cluster Analysis of Legal Judgments 451
2 Related Work
Extensive work has been done in the area of web search and information retrieval
by exploiting the text based content in the web pages. Traditional methods to
compare two documents treat documents as bag-of-words where each term is
weighted according to TF-IDF score [20]. The vector space model [18] is a pop-
ular approach to model the documents and then cosine similarity method is
employed to compare two documents. In the survey [24], a taxonomy of cluster-
ing techniques, like agglomerative and partitional techniques have been provided.
They identify recent advancements in this domain and present various applica-
tions of clustering algorithms in the area of information retrieval.
In web domain, efficient search systems have been developed by exploiting
links [4,10]. The link based approaches have been explored to extract communi-
ties [12]. An effort [7] has been made to identify related web pages by using the
connectivity information in the web.
In [19], an approach has been proposed by considering the document as a
collection of segments of themes or topics. In that approach, to improve the
search performance, similar paragraphs in multiple documents are linked in order
to provide attention to concepts captured at the paragraph level.
In legal domain, several efforts are being made to build better information
extraction approaches. A machine learning based approach for retrieval of prior
cases has been studied in [2]. A navigation model [25] has been proposed to
browse through legal issues by exploiting the legal citation network in the form
of a semantic network. A probabilistic graphical model [21] for automatic text
summarization has been proposed in legal domain. An approach [23] has been
proposed to perform automatic categorization of case laws into high level cat-
egories. Karypis et al. [6] conducted clustering experiments (hard clustering,
soft clustering and hierarchical clustering) using several kinds of law firm data
for building decision support system to help legal experts. In [16], classification
based recursive soft clustering algorithm with built in topic segmentation was
proposed by employing metadata such as topical classification, document cita-
tions and click stream data from user behavior databases. The importance of
exploiting link information in legal judgments has been demonstrated by ana-
lyzing sample pairs of judgments [13–15].
In this paper, we have made an effort to establish similarity among legal judg-
ments through cluster analysis by exploiting citation information and paragraph
links.
3 Proposed Approach
In this section, we present the basic idea and the approach for clustering judg-
ments using citations.
search. Suppose a web page X has a link to web page Y , we say X has an
out-link to Y and Y has an in-link from X. As per cocitation-based similarity,
two web pages are considered similar if they have common in-links above a cer-
tain threshold. As per bibliographic coupling, two web pages are similar if they
have common out-links above certain threshold. Link based approaches have
been proposed to extract authoritative resources, assign ranks to pages, extract
cohesive communities and crawling. There are also efforts to develop improved
information retrieval approaches by exploiting thematic similarity between sim-
ilar paragraphs and similar sentences [19].
Similar to the web page, a legal judgment also has citations. As judgments
are very credible documents, citations indicate significant association between
two judgments.
Based on the analysis of real world data, it has been observed that consider-
able number of judgments have few citations. In order to increase the number of
citations of a judgment, we employ the notion of paragraph links (PLs) between
judgments and perform clustering using induced citations between judgments.
The PLs captures the intricate legal concepts discussed at a minute level in the
paragraphs of a legal judgment.
It is possible to cluster documents by applying similarity measures such as
cosine similarity between document vectors and apply K-means or other agglom-
erative clustering algorithms. But, the center of a cluster is difficult to define in
the case of citation-based similarity measures. In the next sub-section, we pro-
pose a clustering approach by considering multiple central documents as repre-
sentative documents of the cluster.
We have carried out the clustering analysis by considering the following
methods.
– Text based clustering: In this approach, the terms present in the text
judgments are used for performing similarity analysis among judgments using
clustering. We use standard K-means clustering algorithm for judgments. As
a part of preprocessing, we removed stop words from the document corpus
and applied porter stemming algorithm [17] for suffix-stripping of words in
the corpus. After assigning weights to terms with TF-IDF scores, we use the
iterative partition based K-means clustering algorithm [8]. We determine the
value of k by using Bayesian information criterion method [22] which is a
statistical approach for finding natural model selection for the dataset. It
provides information regarding the natural number of clusters suitable for the
dataset.
– Citations based clustering: From each judgment, we remove all text and
only keep citation information. By applying the proposed clustering approach
(proposed in the next sub-section), we obtain the clusters.
– Paragraph links (PLs) based clustering: We divide the judgment into a set
of paragraphs. For each judgment, we compute the similarity of each paragraph
with each paragraph of other judgments. A paragraph link (PL) is established
between two judgments if they have more than a threshold number of similar
Text and Citations Based Cluster Analysis of Legal Judgments 453
paragraphs. By considering only PLs for each judgment, clusters are obtained
by applying the proposed clustering approach.
– Combination of citations and PLs based clustering: We cluster the
judgments by considering both citations and PLs information.
Step 1: Finding Initial Clusters: We form the first cluster with the first
judgment. For each other judgment J, we compute Jaccard coefficient similarity
of J s citations and citations of each existing cluster. The judgment J is inserted
into the cluster with which it establishes the maximum similarity. If the judgment
is not similar to any of the existing clusters, a new cluster is created with the
judgment as its only element.
4 Experiments
In this section, we explain the dataset, preprocessing, evaluation metrics and
results.
Extracting Citations: We extract all the citations from the headnote section
of the judgments using regular expressions for the format used by reporters.
provide a similarity score between 0 to 10 for each pair, based on the utility in
making legal decisions. A similarity score of 0 indicates that there is no similarity
between the two judgments and no utility to the legal practitioner in making
decisions. A similarity score of 10 indicates that they are similar to each other
and is of good utility to legal practitioner.
For each clustering method, a judgment pair is classified as true-positive
(TP ), if the expert rating ≥5 and the clustering method assigned the judgments
to the same cluster and as true-negative (TN ), if the expert rating ≤3 and the
clustering method assigned the judgments to different clusters. It is classified as
false-positive (FP ), if the expert rating ≤3 and the clustering method assigned
the judgments to the same cluster and as false-negative (FN ), if the expert rating
≥5 and the clustering method assigned the judgments to the different clusters.
We report the effectiveness of clustering using binary classification measures of
TP TP 2×TP
Precision as TP+FP , Recall as TP+FN and F1 score as (2×TP)+FP+FN .
0 10 20 30 40 50
Number of clusters
By combining citations and PLs we cover 2704 judgments. The average num-
ber of clusters generated is 62. Clustering using PLs covers 45 (96 %) out of the
47 evaluation pairs. The average values of Precision, Recall and F1 score can be
observed as 0.81, 0.75 and 0.80 respectively. The average F1 score is observed
to be 0.80 which is better than the results obtained using text based clustering
and comparable to the results obtained using only citations or only PLs.
For clustering judgments using only citation information, we have employed the
notion of multiple representative judgments for each cluster. To increase the num-
ber citations of a judgment, we have used a notion of paragraph link. The user
evaluation study results, show that the citation-based approach is effective in
establishing similarity between legal judgments. As a part of future work, we are
planning to conduct a detailed user evaluation study. In addition, we would like to
exploit Acts information in judgments for better clustering. We would also like to
investigate the building of better search system by exploiting clusters to provide
better utility for the legal practitioners.
References
1. The Supreme Court of India Judgments. http://www.liiofindia.org/in/cases/cen/
INSC/
2. Al-Kofahi, K., Tyrrell, A., Vachher, A., Jackson, P.: A machine learning approach
to prior case retrieval. In: Proceedings of the 8th International Conference on
Artificial Intelligence and Law, pp. 88–93. ACM (2001)
3. Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithms. In:
Tesauro, G., et al. (eds.) Advances in Neural Information Processing Systems 7,
pp. 585–592. MIT, Cambridge (1995)
4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.
Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
5. Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Gonalves, M.A.:
Combining link-based and content-based methods for web document classification.
In: Proceedings of the 12th CIKM, pp. 394–401. ACM (2003)
6. Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective document clustering
for large heterogeneous law firm collections. In: Proceedings of the 10th Interna-
tional Conference on Artificial Intelligence and Law, pp. 177–187. ACM (2005)
7. Dean, J., Henzinger, M.R.: Finding related pages in the world wide web. Comput.
Netw. 31(11–16), 1467–1479 (1999)
8. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm.
Appl. Stat. 28, 100–108 (1979)
9. He, X., Zha, H., Ding, C.H., Simon, H.D.: Web document clustering using hyperlink
structures. Comput. Stat. Data Anal. 41(1), 19–45 (2002)
10. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM
46(5), 604–632 (1999)
11. Knoth, P., Novotny, J., Zdrahal, Z.: Automatic generation of inter-passage links
based on semantic similarity. In: Proceedings of the 23rd International Confer-
ence on Computational Linguistics, pp. 590–598. Association for Computational
Linguistics (2010)
12. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the web for
emerging cyber-communities. Comput. Netw. 31(11–16), 1481–1493 (1999)
13. Kumar, S.: Similarity Analysis of Legal Judgments and applying Paragraph-link
to Find Similar Legal Judgments. Master’s thesis, International Institute of Infor-
mation Technology Hyderabad (2014)
14. Kumar, S., Reddy, P.K., Reddy, V.B., Singh, A.: Similarity analysis of legal judg-
ments. In: Proceedings of 4th Annual ACM COMPUTE 2011, pp. 17:1–17:4. ACM
(2011)
Text and Citations Based Cluster Analysis of Legal Judgments 459
15. Kumar, S., Reddy, P.K., Reddy, V.B., Suri, M.: Finding similar legal judgements
under common law system. In: Madaan, A., Kikuchi, S., Bhalla, S. (eds.) DNIS
2013. LNCS, vol. 7813, pp. 103–116. Springer, Heidelberg (2013)
16. Lu, Q., Conrad, J.G., Al-Kofahi, K., Keenan, W.: Legal document clustering with
built-in topic segmentation. In: Proceedings of the 20th CIKM, pp. 383–392. ACM
(2011)
17. Porter, M.: An algorithm for suffix stripping. Program Electron. Libr. Inf. Syst.
14(3), 130–137 (1980)
18. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing.
Commun. ACM 18(11), 613–620 (1975)
19. Salton, G., Allan, J., Buckley, C., Singhal, A.: Automatic analysis, theme gener-
ation, and summarization of machine-readable texts. In: Card, S.K., Mackinlay,
J.D., Shneiderman, B. (eds.) Readings in Information Visualization, pp. 413–418.
Morgan Kaufmann Publishers Inc., San Francisco (1999)
20. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval.
Inf. Process. Manag. 24(5), 513–523 (1988)
21. Saravanan, M., Ravindran, B., Raman, S.: Improving legal document summariza-
tion using graphical models. In: Proceedings of the JURIX 2006, pp. 51–60. IOS
(2006)
22. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
23. Thompson, P.: Automatic categorization of case law. In: Proceedings of the 8th
International Conference on Artificial Intelligence and Law, pp. 70–77. ACM (2001)
24. Xu, R., Wunsch II, D.: Survey of clustering algorithms. Trans. Neur. Netw. 16(3),
645–678 (2005)
25. Zhang, P., Koppaka, L.: Semantics-based legal citation network. In: Proceedings of
the 11th International Conference on Artificial Intelligence and Law, pp. 123–130.
ACM (2007)