Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Text and Citations Based Cluster Analysis

of Legal Judgments

K. Raghav1(B) , Pailla Balakrishna Reddy1 , V. Balakista Reddy2 ,


and Polepalli Krishna Reddy1
1
IIIT-Hyderabad, Hyderabad, Telangana State, India
raghav.k@research.iiit.ac.in, pkreddy@iiit.ac.in
2
NALSAR University of Law, Hyderabad, Telangana State, India
{balakrishnar,balakista}@gmail.com

Abstract. Developing efficient approaches to extract relevant informa-


tion from a collection of legal judgments is a research issue. Legal judg-
ments contain citations in addition to text. It can be noted that the
link information has been exploited to build efficient search systems in
web domain. Similarly, the citation information in legal judgments could
be utilized for efficient search. In this paper, we have proposed an app-
roach to find similar judgments by exploiting citations in legal judg-
ments through cluster analysis. As several judgments have few citations,
a notion of paragraph link is employed to increase the number of cita-
tions in the judgment. User evaluation study on the judgment dataset
of Supreme Court of India shows that the proposed clustering approach
is able to find similar judgments by exploiting citations and paragraph
links. Overall, the results show that citation information in judgments
can be exploited to establish similarity between judgments.

Keywords: Legal judgments · Citation · Link based analysis ·


Clustering

1 Introduction
The amount of available text-data in legal domain is vast and continuously grow-
ing which makes it challenging to deal with. Apart from the size of data, the inher-
ent complexity of legal domain demands better and more sophisticated methods to
process legal documents to satisfy information need of legal practitioners. Build-
ing efficient search approaches in legal domain is an active research area.
In the literature, efforts have been made to address the challenges of infor-
mation overload in the area of web search. A web page has links (references) to
other web pages often called hyperlinks. It is considered that the existence of a
link between two web pages indicates a relationship between the two pages [10].
In the area of web search and retrieval, these links provide important informa-
tion and efficient search systems have been developed by exploiting these links
[4,10] available in the web pages. Also, efficient methods have been developed

c Springer International Publishing Switzerland 2015
R. Prasath et al. (Eds.): MIKE 2015, LNAI 9468, pp. 449–459, 2015.
DOI: 10.1007/978-3-319-26832-3 42
450 K. Raghav et al.

for effective organization and retrieval of web pages by extending clustering [9]
and classification [5] approaches. Several efforts have been made to build effi-
cient search systems based on communities formed by links. In addition, efforts
have also been made in information retrieval to link one topic with other topics
by forming hyperlinks among the topics by carrying out text-based comparison
[11,20].
Legal systems are generally based on one of the two basic systems of law,
viz., civil law and common law. In civil law, core principles are codified into a
referable system which serves as the primary source of law. As opposed to civil
law, the common law is a law developed by judges through previous judgments,
i.e., courts have interpreted the law in individual cases alongside using a referable
system of rules as a source of law. A legal judgment is a closed case. It is a
text document, which explains the formal decision made by a court following
a lawsuit. Similar to web domain, links can be observed in legal judgments in
the form of a citation network in which one judgment is said to be connected to
another judgment when it cites the prior judgment. This citation information in
legal judgments could be utilized for efficient search.
In this paper, we have made an effort to find the related judgments through
cluster analysis by considering the judgments in a common law system. In partic-
ular, we consider citation information and propose an approach to find similar
legal judgments. By analyzing judgments delivered by the Supreme Court of
India it has been found that a considerable number of legal judgments have only
a few citations. Similar to the notion of links among topics [11,20], we employed
the notion of paragraph link to group similar judgments at a paragraph level.
We have applied clustering approach on the judgments dataset by considering
the citations, paragraph links and by combining both citations and paragraph
links. We show that it is possible to establish similarity between judgments by
exploiting citations.
In addition, we propose a clustering approach based on citations. Document
clustering by using vector space model is a well studied approach. In general, it is
accomplished by representing each data object as n-dimensional feature vector
with each coordinate of the vector being a term in vocabulary [18]. In case
of judgments with only citation information, application of typical clustering
algorithms like K-means is difficult, as computing the mean or central node of
the cluster is difficult. We propose clustering approach by employing the notion of
multiple central judgments to represent the cluster. By conducting experiments
on real world dataset, it has been shown that the proposed clustering approach
can be useful in establishing similarity between judgments by utilizing citation
information.
The rest of the paper is organized as follows. In the next section, we discuss the
related work. In Sect. 3, we explain the proposed approach. In Sect. 4, we discuss
the experimental results. The last section contains conclusion and future work.
Text and Citations Based Cluster Analysis of Legal Judgments 451

2 Related Work
Extensive work has been done in the area of web search and information retrieval
by exploiting the text based content in the web pages. Traditional methods to
compare two documents treat documents as bag-of-words where each term is
weighted according to TF-IDF score [20]. The vector space model [18] is a pop-
ular approach to model the documents and then cosine similarity method is
employed to compare two documents. In the survey [24], a taxonomy of cluster-
ing techniques, like agglomerative and partitional techniques have been provided.
They identify recent advancements in this domain and present various applica-
tions of clustering algorithms in the area of information retrieval.
In web domain, efficient search systems have been developed by exploiting
links [4,10]. The link based approaches have been explored to extract communi-
ties [12]. An effort [7] has been made to identify related web pages by using the
connectivity information in the web.
In [19], an approach has been proposed by considering the document as a
collection of segments of themes or topics. In that approach, to improve the
search performance, similar paragraphs in multiple documents are linked in order
to provide attention to concepts captured at the paragraph level.
In legal domain, several efforts are being made to build better information
extraction approaches. A machine learning based approach for retrieval of prior
cases has been studied in [2]. A navigation model [25] has been proposed to
browse through legal issues by exploiting the legal citation network in the form
of a semantic network. A probabilistic graphical model [21] for automatic text
summarization has been proposed in legal domain. An approach [23] has been
proposed to perform automatic categorization of case laws into high level cat-
egories. Karypis et al. [6] conducted clustering experiments (hard clustering,
soft clustering and hierarchical clustering) using several kinds of law firm data
for building decision support system to help legal experts. In [16], classification
based recursive soft clustering algorithm with built in topic segmentation was
proposed by employing metadata such as topical classification, document cita-
tions and click stream data from user behavior databases. The importance of
exploiting link information in legal judgments has been demonstrated by ana-
lyzing sample pairs of judgments [13–15].
In this paper, we have made an effort to establish similarity among legal judg-
ments through cluster analysis by exploiting citation information and paragraph
links.

3 Proposed Approach
In this section, we present the basic idea and the approach for clustering judg-
ments using citations.

3.1 Basic Idea


In the web domain, links (or URLs) have been exploited for efficient search. Exis-
tence of links is considered as an important feature which is widely exploited for
452 K. Raghav et al.

search. Suppose a web page X has a link to web page Y , we say X has an
out-link to Y and Y has an in-link from X. As per cocitation-based similarity,
two web pages are considered similar if they have common in-links above a cer-
tain threshold. As per bibliographic coupling, two web pages are similar if they
have common out-links above certain threshold. Link based approaches have
been proposed to extract authoritative resources, assign ranks to pages, extract
cohesive communities and crawling. There are also efforts to develop improved
information retrieval approaches by exploiting thematic similarity between sim-
ilar paragraphs and similar sentences [19].
Similar to the web page, a legal judgment also has citations. As judgments
are very credible documents, citations indicate significant association between
two judgments.
Based on the analysis of real world data, it has been observed that consider-
able number of judgments have few citations. In order to increase the number of
citations of a judgment, we employ the notion of paragraph links (PLs) between
judgments and perform clustering using induced citations between judgments.
The PLs captures the intricate legal concepts discussed at a minute level in the
paragraphs of a legal judgment.
It is possible to cluster documents by applying similarity measures such as
cosine similarity between document vectors and apply K-means or other agglom-
erative clustering algorithms. But, the center of a cluster is difficult to define in
the case of citation-based similarity measures. In the next sub-section, we pro-
pose a clustering approach by considering multiple central documents as repre-
sentative documents of the cluster.
We have carried out the clustering analysis by considering the following
methods.

– Text based clustering: In this approach, the terms present in the text
judgments are used for performing similarity analysis among judgments using
clustering. We use standard K-means clustering algorithm for judgments. As
a part of preprocessing, we removed stop words from the document corpus
and applied porter stemming algorithm [17] for suffix-stripping of words in
the corpus. After assigning weights to terms with TF-IDF scores, we use the
iterative partition based K-means clustering algorithm [8]. We determine the
value of k by using Bayesian information criterion method [22] which is a
statistical approach for finding natural model selection for the dataset. It
provides information regarding the natural number of clusters suitable for the
dataset.
– Citations based clustering: From each judgment, we remove all text and
only keep citation information. By applying the proposed clustering approach
(proposed in the next sub-section), we obtain the clusters.
– Paragraph links (PLs) based clustering: We divide the judgment into a set
of paragraphs. For each judgment, we compute the similarity of each paragraph
with each paragraph of other judgments. A paragraph link (PL) is established
between two judgments if they have more than a threshold number of similar
Text and Citations Based Cluster Analysis of Legal Judgments 453

paragraphs. By considering only PLs for each judgment, clusters are obtained
by applying the proposed clustering approach.
– Combination of citations and PLs based clustering: We cluster the
judgments by considering both citations and PLs information.

3.2 Clustering Judgments Using Citations


We propose a clustering method by considering the citation information in judg-
ments. Clustering the judgments having only citations is a difficult task because
of two reasons. One is the notion of center cannot be clearly defined in the case
of clustering using citations. It is also observed that there are some important
judgments to which many judgments have common citations with them. We
develop a clustering approach by considering several representative judgments
as the center of the cluster. The judgment which establishes high connectiv-
ity with other members in the cluster is chosen as cluster representative. The
proposed methodology consists of three steps, which are explained below.

Step 1: Finding Initial Clusters: We form the first cluster with the first
judgment. For each other judgment J, we compute Jaccard coefficient similarity
of J  s citations and citations of each existing cluster. The judgment J is inserted
into the cluster with which it establishes the maximum similarity. If the judgment
is not similar to any of the existing clusters, a new cluster is created with the
judgment as its only element.

Step 2: Refinement of Clusters: The refinement consists of two steps.


– Finding representative judgments: For each cluster generated in the first
step, we find the k representative judgments in the cluster based on number
of judgments it is connected within the cluster. We say two judgments are
connected if they a have common citation.
– Assigning judgments to clusters: We compute the Jaccard coefficient sim-
ilarity for every judgment with each of the citation sets of representative nodes
of each cluster. We assign the judgment to the cluster with maximum similarity.

Step 3: Termination Condition: After the completion of the second step, we


compute the number of judgments that changed the cluster during this refine-
ment iteration and the number of representative judgments which changed clus-
ters. If the number of judgments changing clusters are less than a threshold
value or the number of representative judgments changing clusters are less than
a threshold value, then we stop the refinement step and return the current clus-
ters as the final set of clusters. Otherwise, Step 2 is followed.

About Convergence: The clustering algorithm we employed is similar to parti-


tional clustering algorithms like K-means and K-medoids. So, the algorithm con-
verges similar to K-means. The convergence properties of partitioning approaches
like K-means [3] have been explored in literature.
454 K. Raghav et al.

4 Experiments
In this section, we explain the dataset, preprocessing, evaluation metrics and
results.

4.1 Dataset and Preprocessing


We have conducted experiments using the dataset consisting of judgments deliv-
ered by the Supreme Court of India [1] from 1970 to 1993. The dataset consists
of 3, 738 judgments.

Structure of a Legal Judgment: A legal judgment is a text document. The


following are the important components of a legal judgment: Petitioner, Respon-
dent, Names of Judges, Date of Judgment, Citation, Act and Headnote. Here,
Petitioner is the one who presents a petition to the court. Respondent is the
entity against which/whom an appeal has been made. Act indicates the brief cat-
egory of the judgment. The headnote contains a brief summary of the judgment.
Supreme Court Reports (SCR) is an official reporter for the judgments delivered
by the Supreme Court. Supreme Court Cases (SCC) and All India Reporter
(AIR) are some prominent private reporters. There are two types of citations
for a judgment, namely, out-citations and in-citations. The out-citations of a
judgment are the external references made by the current judgment. The in-
citations are the references made to the current judgment by other judgments.
For example, if a judgment X refers to judgment Y to provide the decision, then
we say judgment X has an out-citation to Y and Y has an in-citation from X.
Out-citations and in-citations of a judgment together are referred as citations of
a judgment.

Extracting Citations: We extract all the citations from the headnote section
of the judgments using regular expressions for the format used by reporters.

Extracting Paragraph Links (PLs): We consider a paragraph as a text


between two consecutive (<p>) html tags. We extract all the paragraphs in
the headnote section of the judgment. Then we remove stop words and perform
stemming [17] on all the words in the corpus. We choose only those paragraphs
which have between 20 and 60 words. After extracting the paragraphs for all
the judgments, we find TF-IDF based cosine similarity of every paragraph with
every other paragraph in the dataset. We call two paragraphs as similar if they
have a cosine similarity ≥0.5. We establish a PL for each pair of judgments, if
they have at least three similar paragraphs, which has been studied as a good
estimate for similarity between text documents in the approach [19].

4.2 Evaluation Metrics


We evaluate the clustering process by utilizing expert scores for 47 random
pairs of judgments provided to legal experts. The domain experts were asked to
Text and Citations Based Cluster Analysis of Legal Judgments 455

provide a similarity score between 0 to 10 for each pair, based on the utility in
making legal decisions. A similarity score of 0 indicates that there is no similarity
between the two judgments and no utility to the legal practitioner in making
decisions. A similarity score of 10 indicates that they are similar to each other
and is of good utility to legal practitioner.
For each clustering method, a judgment pair is classified as true-positive
(TP ), if the expert rating ≥5 and the clustering method assigned the judgments
to the same cluster and as true-negative (TN ), if the expert rating ≤3 and the
clustering method assigned the judgments to different clusters. It is classified as
false-positive (FP ), if the expert rating ≤3 and the clustering method assigned
the judgments to the same cluster and as false-negative (FN ), if the expert rating
≥5 and the clustering method assigned the judgments to the different clusters.
We report the effectiveness of clustering using binary classification measures of
TP TP 2×TP
Precision as TP+FP , Recall as TP+FN and F1 score as (2×TP)+FP+FN .

4.3 Results of Text Based Clustering


We consider all the judgments as text documents and cluster them using K-
means clustering algorithm. Two documents are compared by applying cosine
similarity to the corresponding vectors of terms with TF-IDF weights. The
Bayesian information criterion (BIC) [22] approach has been used to determine
the natural number of clusters. The variation of BIC with number of clusters, k
is shown in Fig. 1. The number of clusters is set as 13 based on the elbow point.
The summary of the results obtained by text based clustering for k = 13,
and k = 60 are provided in the second and third column of Table 1 respectively.
We present the average results of 10 runs of the algorithm. All the 47 pairs
(100 %) participate in the evaluation. The average values of Precision, Recall,
and F1 score for k = 13 are 0.84, 0.68 and 0.75 respectively. The average values
of Precision, Recall, and F1 score for k = 60 are 0.81, 0.64 and 0.71 respectively.
Bayesian information criterion (BIC)
22000
21000
20000

0 10 20 30 40 50
Number of clusters

Fig. 1. Number of clusters Vs Bayesian information criterion (BIC)


456 K. Raghav et al.

Table 1. Comparison of clustering results

Parameter Clustering based on


Text (k = 13) Text (k = 60) Citations PLs Citations + PLs
Number of judgments covered 3738 3738 1508 1928 2704
Average number of clusters 13 60 140 56 62
Total number of evaluation pairs 47 47 47 47 47
Pairs participated in evaluation 47(100 %) 47(100 %) 38(80 %) 26(56 %) 45(96 %)
Average Precision 0.84 0.81 0.86 0.84 0.81
Average Recall 0.68 0.64 0.73 0.73 0.75
Average F1 score 0.75 0.71 0.79 0.78 0.80

4.4 Results of Clustering Using Citations


For each judgment, we keep only the citations (both in-citations and out-
citations) of the judgment and provide them as input to the proposed clustering
algorithm. The judgments with at least 3 citations are used. We apply the pro-
posed clustering approach and termination condition is set when we have less
than 5 % of the judgments changing clusters. It has been observed that the algo-
rithm converges in less than 15 iterations. We vary the number of representative
points from 1 to 6 and for each value of the representative points we run 10
trials of the algorithm. The number of clusters generated and quality measures
are recorded each time. The number of representative points per cluster is fixed
as 4 based on the best F1 score observed.
The fourth column of Table 1 summarizes the results of this approach. In this
approach, 1508 judgments participated in the clustering process. The number of
clusters generated is 140. Citations based clustering covers 38 (80 %) pairs out
of the 47 evaluation pairs. The average values of Precision, Recall and F1 score
are 0.86, 0.73 and 0.79 respectively.

4.5 Results of Clustering Using Paragraph Links (PLs)


We remove citation information from the judgments and perform the proposed
clustering algorithm using only PLs. The number of representative points per
cluster is fixed as 5.
The results obtained by using this approach are provided in the fifth col-
umn of Table 1. In this approach, 1928 judgments participated in the clustering
process. The average number of clusters generated is 56. The PLs based clus-
tering covers 26 (56 %) out of the 47 evaluation pairs. The average values of
Precision, Recall and F1 score can be observed as 0.84, 0.73 and 0.78 respec-
tively. The average F1 score is observed to be 0.78 which is better than the results
obtained using text based approach and comparable to the results obtained using
citations.

4.6 Results of Clustering by Combining Citations and PLs


We use both the citations and PLs as input to the clustering approach and the
results obtained are shown in last column of Table 1. The number of representa-
tive points is set as 6 based on the best F1 score.
Text and Citations Based Cluster Analysis of Legal Judgments 457

By combining citations and PLs we cover 2704 judgments. The average num-
ber of clusters generated is 62. Clustering using PLs covers 45 (96 %) out of the
47 evaluation pairs. The average values of Precision, Recall and F1 score can be
observed as 0.81, 0.75 and 0.80 respectively. The average F1 score is observed
to be 0.80 which is better than the results obtained using text based clustering
and comparable to the results obtained using only citations or only PLs.

4.7 Summary of the Results


From Fig. 2, it can be observed that a significant number of judgments have only
a few natural citations. Also, there are significant number of judgments with a
few PLs. By combining the citations and PLs, the total number of citations
is increased in significant number of judgments. As shown in third column of
Table 1, only, 1, 508 judgments participate in the clustering using citations among
the 3, 738 judgments in the dataset. By introducing PLs among the judgments,
we are able to include 1, 982 judgments in clustering process. By combining the
citations and PLs we are able to achieve more coverage by performing clustering
for 2, 700 judgments.

Fig. 2. Number of citations Vs Number of judgments

Overall, the experimental results show that the citation information is a


key feature which can be exploited to find similar judgments. The results also
indicate that the PLs can be exploited to establish similarity between judgments.
It has been observed that citation based methods can be used for establishing
similarity among legal judgments by clustering.

5 Conclusion and Future Work


In this paper, we have proposed an approach to cluster the judgments by
exploiting citation information to compute the similarity between the judgments.
458 K. Raghav et al.

For clustering judgments using only citation information, we have employed the
notion of multiple representative judgments for each cluster. To increase the num-
ber citations of a judgment, we have used a notion of paragraph link. The user
evaluation study results, show that the citation-based approach is effective in
establishing similarity between legal judgments. As a part of future work, we are
planning to conduct a detailed user evaluation study. In addition, we would like to
exploit Acts information in judgments for better clustering. We would also like to
investigate the building of better search system by exploiting clusters to provide
better utility for the legal practitioners.

References
1. The Supreme Court of India Judgments. http://www.liiofindia.org/in/cases/cen/
INSC/
2. Al-Kofahi, K., Tyrrell, A., Vachher, A., Jackson, P.: A machine learning approach
to prior case retrieval. In: Proceedings of the 8th International Conference on
Artificial Intelligence and Law, pp. 88–93. ACM (2001)
3. Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithms. In:
Tesauro, G., et al. (eds.) Advances in Neural Information Processing Systems 7,
pp. 585–592. MIT, Cambridge (1995)
4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.
Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
5. Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Gonalves, M.A.:
Combining link-based and content-based methods for web document classification.
In: Proceedings of the 12th CIKM, pp. 394–401. ACM (2003)
6. Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective document clustering
for large heterogeneous law firm collections. In: Proceedings of the 10th Interna-
tional Conference on Artificial Intelligence and Law, pp. 177–187. ACM (2005)
7. Dean, J., Henzinger, M.R.: Finding related pages in the world wide web. Comput.
Netw. 31(11–16), 1467–1479 (1999)
8. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm.
Appl. Stat. 28, 100–108 (1979)
9. He, X., Zha, H., Ding, C.H., Simon, H.D.: Web document clustering using hyperlink
structures. Comput. Stat. Data Anal. 41(1), 19–45 (2002)
10. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM
46(5), 604–632 (1999)
11. Knoth, P., Novotny, J., Zdrahal, Z.: Automatic generation of inter-passage links
based on semantic similarity. In: Proceedings of the 23rd International Confer-
ence on Computational Linguistics, pp. 590–598. Association for Computational
Linguistics (2010)
12. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the web for
emerging cyber-communities. Comput. Netw. 31(11–16), 1481–1493 (1999)
13. Kumar, S.: Similarity Analysis of Legal Judgments and applying Paragraph-link
to Find Similar Legal Judgments. Master’s thesis, International Institute of Infor-
mation Technology Hyderabad (2014)
14. Kumar, S., Reddy, P.K., Reddy, V.B., Singh, A.: Similarity analysis of legal judg-
ments. In: Proceedings of 4th Annual ACM COMPUTE 2011, pp. 17:1–17:4. ACM
(2011)
Text and Citations Based Cluster Analysis of Legal Judgments 459

15. Kumar, S., Reddy, P.K., Reddy, V.B., Suri, M.: Finding similar legal judgements
under common law system. In: Madaan, A., Kikuchi, S., Bhalla, S. (eds.) DNIS
2013. LNCS, vol. 7813, pp. 103–116. Springer, Heidelberg (2013)
16. Lu, Q., Conrad, J.G., Al-Kofahi, K., Keenan, W.: Legal document clustering with
built-in topic segmentation. In: Proceedings of the 20th CIKM, pp. 383–392. ACM
(2011)
17. Porter, M.: An algorithm for suffix stripping. Program Electron. Libr. Inf. Syst.
14(3), 130–137 (1980)
18. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing.
Commun. ACM 18(11), 613–620 (1975)
19. Salton, G., Allan, J., Buckley, C., Singhal, A.: Automatic analysis, theme gener-
ation, and summarization of machine-readable texts. In: Card, S.K., Mackinlay,
J.D., Shneiderman, B. (eds.) Readings in Information Visualization, pp. 413–418.
Morgan Kaufmann Publishers Inc., San Francisco (1999)
20. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval.
Inf. Process. Manag. 24(5), 513–523 (1988)
21. Saravanan, M., Ravindran, B., Raman, S.: Improving legal document summariza-
tion using graphical models. In: Proceedings of the JURIX 2006, pp. 51–60. IOS
(2006)
22. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
23. Thompson, P.: Automatic categorization of case law. In: Proceedings of the 8th
International Conference on Artificial Intelligence and Law, pp. 70–77. ACM (2001)
24. Xu, R., Wunsch II, D.: Survey of clustering algorithms. Trans. Neur. Netw. 16(3),
645–678 (2005)
25. Zhang, P., Koppaka, L.: Semantics-based legal citation network. In: Proceedings of
the 11th International Conference on Artificial Intelligence and Law, pp. 123–130.
ACM (2007)

You might also like