Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Building Chinese Legal Hybrid

Knowledge Network

Sheng Bi , Yanhui Huang , Xiya Cheng , Meng Wang(B) ,


and Guilin Qi

School of Computer Science and Engineering, Southeast University, Nanjing, China


{bisheng,haungyanhui,chengxiya,meng.wang,gqi}@seu.edu.cn

Abstract. Knowledge graphs play an important role in many appli-


cations, such as data integration, natural language understanding and
semantic search. Recently, there has been some work on constructing
legal knowledge graphs from legal judgments. However, they suffer from
some problems. First, existing work follows the Western legal system,
thus cannot be applied to other legal systems, such as Asian legal sys-
tems; Second, existing work intends to build a precise legal knowledge
graph, which is often not effective, especially when constructing the pre-
cise relationship between legal terms. To solve these problems, in this
paper, we propose a framework for constructing a legal hybrid knowl-
edge network from Chinese encyclopedia and legal judgments. First, we
construct a network of legal terms through encyclopedia data. Then, we
build a legal knowledge graph through Chinese legal judgments which
captures the strict logical connections in the legal judgments. Finally,
we build a Chinese legal hybrid knowledge network by combining the
network of legal terms and the legal knowledge graph. We also evalu-
ate the algorithms which are used to build the legal hybrid knowledge
network on a real-world dataset. Experimental results demonstrate the
effectiveness of these algorithms.

Keywords: Legal knowledge graphs · Legal judgments ·


Chinese encyclopedia · Legal hybrid knowledge network

1 Introduction

Knowledge graphs, which belong to the field of knowledge engineering, are pro-
posed by Google in 2012. A knowledge graph is a multi-relational graph com-
posed of entities as nodes and relations as edges with different types. Knowledge
graphs are a kind of knowledge representation form, and they extract domain
entities, attributes and their relationships from a large amount of text data to
produce structured knowledge. The major advantage of knowledge graphs is the
ability to express knowledge of complex relationships accurately and graphi-
cally, which is in line with human learning habits and can help people learn key
knowledge and relationships more quickly. Therefore, knowledge graphs play an
c Springer Nature Switzerland AG 2019
C. Douligeris et al. (Eds.): KSEM 2019, LNAI 11775, pp. 628–639, 2019.
https://doi.org/10.1007/978-3-030-29551-6_56
Building Chinese Legal Hybrid Knowledge Network 629

important role in many applications, such as data integration, natural language


understanding and semantic search.
Recently, there has been some work on constructing legal knowledge graphs
from legal judgments. Erwin Filtz proposes a method to represent the legal
data of Australia and enhances representation by semantics to build a legal
knowledge graph, where unambiguous and useful interlinking of legal cases are
supported [6]. European researchers start a LYNX project [13] to build a legal
knowledge graph for smart compliance services in multilingual Europe. As a
collection of structured data and unstructured documents, the legal knowledge
graph covers different jurisdictions, comprising legislation, case law, doctrine,
standards, norms, and other documents and can help companies solving ques-
tions and cases related to compliance in different sectors and jurisdictions.
However, they suffer from some problems. First, existing work follows the
Western legal system, thus cannot be applied to other legal systems, such as
Asian legal systems; Second, existing work intends to build a precise legal knowl-
edge graph, which is often not effective, especially when constructing the precise
relationship between legal terms. The relationship between legal terms is some-
times ambiguous and hard to be formalized.
To solve these problems, in this paper, we propose a framework for con-
structing a legal hybrid knowledge network from Chinese encyclopedia and legal
judgments. First, we crawl many legal webs and extract original encyclopedia
data, and use the high-quality encyclopedia knowledge to construct a network
of legal terms. Then, we build a legal knowledge graph through Chinese legal
judgments which captures the strict logical connections in the legal judgments.
Finally, we combine the network of legal terms with the legal knowledge graph
to produce our framework – Chinese legal hybrid knowledge network.
To sum up, in this paper, we design procedures of constructing a knowledge
graph and succeed in constructing a Chinese legal hybrid knowledge network
by means of data mining, text extraction, natural language processing, etc. Our
contributions in this paper are listed as follows:

1. We construct a network of legal terms through encyclopedia data and build


a legal knowledge graph through Chinese legal judgments which captures the
strict logical connections in the legal judgments.
2. We propose a framework for constructing a legal hybrid knowledge network
from Chinese encyclopedia and legal judgments.
3. We conduct extensive experiments on a real-world dataset to evaluate the
algorithms which are used to build the legal hybrid knowledge network.
Experimental results demonstrate the effectiveness of these algorithms.

The remainder of this paper is organized as follows. Section 2 shows a brief


overview of related work. Section 3 provides the details of the approaches to
building our legal hybrid knowledge network. In Sect. 4, we evaluate our algo-
rithms and give the experimental results. In the end, Sect. 5 concludes our work.
630 S. Bi et al.

2 Related Work
Our work is related to the work of knowledge graph construction, especially in
the legal domain.

2.1 Knowledge Graph


Knowledge graph is the work on interrelated information, usually limited to a
specific business domain, and managed as a graph. In the past years, there has
been a large number of related work on constructing knowledge graphs. Cyc [9],
one of the oldest knowledge graphs, has devoted fifteen years to build an ontol-
ogy of general concepts spanning human reality in predicate logic form. Inspired
by widespread used information communities like Wikipedia and The Semantic
Web, the American software company Metaweb designed Freebase [4], which is a
practical tuple database used to structure diverse general human knowledge with
high scalability. DBPedia [1], the most famous general knowledge graph in the
world, extracts structured data from Wikipedia infoboxes and makes this infor-
mation accessible on the Web, supporting 125 languages and providing quantities
of facts, largely focused on named entities that have Wikipedia articles. Similar
to DBpedia, Yet Another Great Ontology (YAGO) [17] also extracts knowledge
from Wikipedia (e.g., categories, redirects, infoboxes). Moreover, YAGO extracts
information from WordNet [12] (e.g., synsets, hyponymy). Compared with DBpe-
dia, YAGO mainly aims at an automatic fusion of knowledge extracted from
diverse Wikipedia language editions, while DBpedia aims at building different
knowledge graphs for each Wikipedia language edition.
Besides English knowledge graphs listed above, in recent years, many Chi-
nese knowledge graphs have been published such as Zhishi.me, XLore and CN-
DBPedia. Zhishi.me [14], the first large scale Chinese knowledge graph, extracts
structural features in three largest Chinese encyclopedia sites (i.e., Baidu Baike,
Hudong Baike, and Chinese Wikipedia) and proposes several data-level mapping
methods for automatic link discovery. XLore [18] is a large scale multi-lingual
knowledge graph by structuring and integrating Chinese Wikipedia, English
Wikipedia, French Wikipedia, and Baidu Baike. Up to date, XLore contains
16284901 instances, 2466956 concepts, and 446236 properties. Since the update
frequency of knowledge bases is very slow, researchers propose a never-ending
Chinese Knowledge extraction system, CN-DBpedia [19]. CN-DBpedia provides
the freshest knowledge with a smart active update strategy and can generate a
knowledge base which is constantly updated.

2.2 Legal Knowledge Graph

With the rapid development of open knowledge graphs, researchers draw atten-
tion to knowledge graphs in specific domains, like in legal domain. Since knowl-
edge graphs need a certain data representation so that it can be used, Erwin Filtz
proposes a method to represent the legal data of Australia, mainly legal norms
Building Chinese Legal Hybrid Knowledge Network 631

and court decisions in legal documents, and enhances representation by seman-


tics to build a legal knowledge graph, where unambiguous and useful interlinking
of legal cases are supported [6]. European researchers start a LYNX project [13]
to build a legal knowledge graph for smart compliance services in multilingual
Europe. As a collection of structured data and unstructured documents, the legal
knowledge graph covers different jurisdictions, comprising legislation, case law,
doctrine, standards, norms, and other documents and can help companies solve
questions and cases related to compliance in different sectors and jurisdictions.
However, existing work on constructing legal knowledge graphs follows the
Western legal system, thus cannot be applied to other legal systems, such as
Asian legal systems. Moreover, they intend to build a precise legal knowledge
graph, which is often not effective, especially when constructing the precise rela-
tionship between legal terms. The relationship between legal terms is sometimes
ambiguous and hard to be formalized. In construct, in this work, we try to con-
struct a hybrid knowledge network from Chinese encyclopedia and Chinese legal
judgments, where the relationship between legal terms are not precisely defined.

3 Methods for Constructing Legal Hybrid Knowledge


Network

In this section, we elaborate on our procedures of the legal hybrid knowledge


network. We divide the process for building legal hybrid knowledge network
into three steps. Firstly, we construct a network of legal terms through Chinese
Web-based encyclopedia. Secondly, we construct a legal knowledge graph by
extracting triples from legal judgments. Finally, we combine the network of legal
terms and the legal knowledge graph. The overall framework is shown in Fig. 1,
which indicates the general steps to develop legal hybrid knowledge network.
The details of building procedures are presented in the following subsections.

3.1 Constructing Legal Terms Network

To construct a network of legal terms, we obtain a large amount of high-quality


interconnected semantic data from Chinese encyclopedia, in particular, we iden-
tify important structural features in three largest Chinese encyclopedia sites (i.e.,
Baidu Baike1 , Hudong Baike2 , and Chinese Wikipedia3 ) for extraction and pro-
pose several data-level mapping strategies for automatic link discovery. Figure 2
shows the example of building a network of legal terms from Chinese encyclope-
dia. Table 1 shows the statistical results which we crawl on these websites.
From Table 1, we can see that the data sources have a wide coverage of
Chinese subjects and spans many domains except legal domain. In this case, we
need to select legal terms from data sources. However, it is difficult to distinguish

1
https://baike.baidu.com.
2
http://www.baike.com.
3
https://zh.wikipedia.org.
632 S. Bi et al.

Fig. 1. The framework shows the general steps to develop the Chinese legal hybrid
knowledge network. The top part is a typical process to construct a network of legal
terms from Chinese encyclopedia, and the bottom part is the process to develop legal
knowledge graph from legal judgments online. The lower right part indicates the final
legal hybrid knowledge network.

Table 1. The statistical result of entity extracted from different Chinese encyclopedia.

Baidu Baike Hudong Baike Wikipedia


10434530 3728441 736540

between legal terms and common terms. What is worse, it is worthwhile to note
that legal knowledge is not original data from these raw texts but should be
extracted carefully from them. However, the entities extracted from Chinese
encyclopedia are not clearly classified and it is to manually classify them because
the size of the set of entities is too big. To solve this problem, we build a classifier
to find professional legal terms from unprofessional websites.
We get legal terms in encyclopedia information by category. If the category
of terms belongs to “legalese”, “jurist” or “laws and articles”, we consider these
terms as legal related. However, these data are still not enough. To enlarge the
number of legal terms, we take legal terms as seed and acquire terms that are
not related to law automatically. We use these data to train a classifier.
Intuitively, the closer to legal term with the internal link, the more likely
they are relevant. Assuming that the order of neighborhoods of legal terms is
smaller than four, these neighborhoods are legal related and may be legal terms.
In other words, the order of neighborhoods of legal entities is bigger than three,
these neighborhoods have no relation with legal entities and can be regarded
as negative samples. Therefore, we consider first-order, second-order and third-
order neighborhoods of legal entities as possible legal entities to be classified,
and other neighborhoods of legal entities as negative samples. Figure 3 indicates
the structure of different order entity. For example, as for legal entities in Baidu
Building Chinese Legal Hybrid Knowledge Network 633

Baike, we consider entities whose category belongs to ‘legalese’, ‘jurist’ or ‘laws


and articles’ as legal entities. Moreover, these legal entities have 96243 first-
order neighborhoods, 1226592 second-order neighborhoods, and 8074532 third-
order neighborhoods. There are 1037163 neighborhoods left which are considered
to have no relation with legal entities. We select 5000 entities from 1037163
entities randomly to make a manual evaluation. The result shows that 4997
entities are irrelevant with the legal domain, which verifies the effectiveness of
our hypothesis.

Fig. 2. The example of building network of legal terms from Chinese encyclopedia.
Firstly, we obtain legal terms from three largest Chinese encyclopedia sites. Then we
calculate the similarities between these legal terms. Finally, these legal terms are linked
together by similarities to form a network of legal terms.

Fig. 3. We use seed as positive sample, and the neighborhoods which order bigger than
three as negative sample, the rest as potential legal domain related.

We use Support Vector Machine (SVM) [16] as our classifier. We randomly


select 30000 entities from 1037163 entities as negative samples, which have no
relation to the legal domain. We collect 27004 professional legal terms by manual
annotation as positive samples. Then we use these positive samples and negative
samples to train our classifier. It is worthwhile to note that we build five classi-
fiers and the final results are voted by these five classifiers, which improves the
accuracy of classification results. Through these classifiers, we get 14925 legal
entities.
634 S. Bi et al.

After obtained enough legal entities through our classifier, we use these enti-
ties to construct our network of legal terms. In our network of legal terms, each
entity is regarded as a point and entities are connected by the similarities between
these entities. We use SimRank to compute similarity. The weight between entity
a and entity b is computed as follows:
|I(a)| |I(b)|
C  
s(a, b) = s(Ii (a), Ij (b)) (1)
|I(a)||I(b)| i=1 j=1

where s(a, b) is the similarity of point a and point b. Note that when a = b,
s(a, b) = 1. Ii (a) is the i − th in-neighbor of point a. When I(a) = ∅ or I(b) = ∅,
s(a, b) = 0. Parameter C is damped coefficient, and C ∈ (0, 1). The description
of this formula is that the similarity between a and b is equal to the average of
the similarities between in-neighbors of a and in-neighbors of b.

3.2 Constructing a Legal Knowledge Graph

In this subsection, we show the details of building our legal knowledge graph.
According to the definition, a knowledge graph is a special graph where nodes
are entities and edges are relations. Knowledge graphs represent knowledge by
using RDF-style triples (h, r, t) which describe the relation r between the first
head entity h and the second tail entity t. For instance, “Beijing is the capital of
China” can be represented as (Beijing, capitalOf, China). Therefore, we should
extract RDF-style triples from legal judgments to build a legal knowledge graph.
Considering the specific standard format of legal judgments, we extract triples
based on several simple manual rules, which are listed in Table 2. Although the
rule-based method extracts some necessary information like the plaintiff and
defendant, there is a lot of information that is too complex to be extracted by
rules simply.
To tackle this problem, we adopt other named entity recognition (NER)
method to define entities hidden in sentences. NER is a subtask of information
extraction that seeks to locate and classify named entities in text into pre-defined
categories such as the names of persons, organizations, locations, expressions of
times, quantities, monetary values, and percentages. Currently, the most pop-
ular method of NER is using Conditional Random Field (CRF) [7] and Long
Short-Term Memory (LSTM), which is used in this paper. Moreover, we obtain
keywords and abstracts of legal judgments by the means of TextRank [10] to
enrich our legal knowledge graph. TextRank is a graph-based ranking model for
text processing which can be used to find the most relevant sentences in text and
keywords as well. Figure 4 indicates the triple extraction and knowledge graph
building process.
Building Chinese Legal Hybrid Knowledge Network 635

Fig. 4. The example of extracting triples and building knowledge graph from legal
judgments. The left part is a part of legal judgment and the right part is a knowledge
graph consisted of triples extracted from the legal judgment. Different relations are
labelled in different colors. (Color figure online)

Table 2. Extraction rules in legal judgments.

Extraction rules Example sentences Triples


Plaintiff # Plaintiff: Rose {“subject”: “xxx case”, “predicate”:
“plaintiff”, “object”: “Rose”}
Defendant # Defendant: Jack {“subject”: “xxx case”, “predicate”:
“plaintiff”, “object”: “Rose”}
Judge # Judge: Smith {“subject”: “xxx case”, “predicate”:
“judge”, “object”: “Smith”}
Court clerk # Court clerk: Nathy {“subject”: “xxx case”, “predicate”:
“court clerk”, “object”: “Nathy”}

3.3 Building a Hybrid Knowledge Network


Having constructed the network of legal terms and the legal knowledge graph,
we use entity links in legal judgments to combine the network and the knowledge
graph to build our legal hybrid knowledge network. We use LSTM+CRF based
method to recognize named entities. Then we use string matching to map these
named entities to legal entities and get candidates. Note that one named entity
usually corresponds to one candidate entity only. Especially, if there are more
than one candidate entity corresponding to the same named entity, we should
first compute the correlation between the candidate entities and the named entity
respectively, then select the most relevant candidate entity. Figure 5 shows an
example of the Chinese Legal Hybrid Knowledge Network.

4 Evaluation
In this section, we evaluated three algorithms used in constructing the Chinese
Legal Hybrid Knowledge Network.

4.1 Evaluation of Legal Related Entity Classification


In the section of constructing a network of legal terms, we had obtained 29706
positive samples and 1037163 negative samples. For better performance, we
636 S. Bi et al.

Fig. 5. The example of the Chinese legal hybrid knowledge network. We link the legal
entities which are in legal judgment to network of legal terms.

built 5 classifiers totally and each classifier was trained with different features.
To be specific, these features were one hot, tf-idf [15], LDA [3], doc2vec [8],
word2vec [11] respectively. As mentioned above, in classification, we used SVM
to perform binary classification. The training dataset included 29706 positive
samples and 29706 negative samples. Note that 29706 negative samples were
randomly selected from all negative samples and we updated the whole negative
samples by deleting those selected samples each time. The performances of these
five classifiers with different features are presented in Fig. 6.
As is shown in Fig. 6, all of our classifiers have high accuracy, recall and F1
values, which verifies the effectiveness of our methodology.

4.2 Evaluation of Named Entity Recognition


To evaluate the effect of NER, we selected 1000 legal judgments randomly as
our test dataset. Three annotators who were well educated and had expertise
in law were invited to find all named entities in our dataset which were used as
label data. All legal judgments were annotated in IOB (short for Inside, Outside,
Beginning) format. And each word was tagged with other or one of three entity
types: Person, Location or Organization. Each line in the file represented one
token with two fields: the word itself and its named entity type. In addition, we
extracted chunks from source files and represented every chunk in the format of
a three-element tuple: (chunk, type, start position).
Building Chinese Legal Hybrid Knowledge Network 637

Fig. 6. The performance curves of SVM with different features.

We applied LSTM+CRF to recognize named entities. This tool first tokenized


the sentences, analyzed these sentences and created their respective list of tuples.
Then we compared the output list against the standard list.
We scored NER based on exact matching for entity type which measured the
method’s capability for accurate named entity detection. We counted TP, FP
and FN, and calculated the precision and recall as follows:
TP
P recision = (2)
TP + FP
TP
Recall = (3)
TP + FN
As a result, our method took effect since the precision was 91% and the recall
was 87% compared with annotated data.

4.3 Evaluation of Entity Linking


In this section, we selected 1000 legal entities randomly and there were 26000
entities labelled by law experts. The entity linking tool we used here was Fast
Entity Linker Toolkit (FEL)4 . We used recall, accuracy and F1 measure to eval-
uate the effect of entity linking. They were calculated as follows:

S1 T1
P recision = × 100% (4)
S1

S1 T 1
Recall = × 100% (5)
T1
2 ∗ precision ∗ recall
F1 = (6)
precision + recall
4
https://github.com/yahoo/FEL.
638 S. Bi et al.

where S1 was the entity set which was linked by our method, and T1 was the
labelled entity set. The results were shown in Table 3.

Table 3. The performance of entity linking.

Precision Recall F1
86.21% 89.69% 87.91%

As was shown in Table 3, FEL had good performance in precision, recall and
F1 measure and was of great effect.

5 Conclusion and Future Work


In this paper, we proposed a methodology for constructing a Chinese Legal
Hybrid Knowledge Network. We first built a network of legal terms based on
Chinese encyclopedia information and a legal knowledge graph based on the legal
knowledge extracted from legal judgments online. Then we linked the network of
legal terms and the legal knowledge graph by legal entities to compose our final
legal hybrid knowledge network. Moreover, we evaluated the algorithms which
were used to build the legal hybrid knowledge network on a real-world dataset.
The results showed the effectiveness of these algorithms.
In the future, we will develop our Chinese legal hybrid knowledge network by
incorporating more information. Moreover, we will try to apply our legal hybrid
knowledge network to other legal applications, such as legal question answering,
and similar case recommendation.

Acknowledgement. This work was supported by National Key R&D Program of


China (2018YFC0830200) and National Natural Science Foundation of China Key
Project (U1736204).

References
1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:
a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007.
LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.
1007/978-3-540-76298-0 52
2. Benjamins, V.R., Casanovas, P., Breuker, J., Gangemi, A.: Law and the Semantic
Web: Legal Ontologies, Methodologies, Legal Information Retrieval, Andapplica-
tions, vol. 3369. Springer, Heidelberg (2005). https://doi.org/10.1007/b106624
3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn.
Res. 3(Jan), 993–1022 (2003)
4. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collabo-
ratively created graph database for structuring human knowledge. In: Proceedings
of the 2008 ACM SIGMOD International Conference on Management of Data, pp.
1247–1250. ACM (2008)
Building Chinese Legal Hybrid Knowledge Network 639

5. Do, P.K., Nguyen, H.T., Tran, C.X., Nguyen, M.T., Nguyen, M.L.: Legal ques-
tion answering using ranking SVM and deep convolutional neural network. arXiv
preprint arXiv:1703.05320 (2017)
6. Filtz, E.: Building and processing a knowledge-graph for legal data. In: Blomqvist,
E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC
2017. LNCS, vol. 10250, pp. 184–194. Springer, Cham (2017). https://doi.org/10.
1007/978-3-319-58451-5 13
7. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic
models for segmenting and labeling sequence data (2001)
8. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In:
International Conference on Machine Learning, pp. 1188–1196 (2014)
9. Lenat, D.B.: CYC: a large-scale investment in knowledge infrastructure. Commun.
ACM 38(11), 33–38 (1995)
10. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the
2004 Conference on Empirical Methods in Natural Language Processing (2004)
11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
12. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11),
39–41 (1995)
13. Montiel-Ponsoda, E., Gracia, J., Rodrı́guez-Doncel, V.: Building the legal knowl-
edge graph for smart compliance services in multilingual Europe. In: CEUR work-
shop proceedings No. ART-2018-105821 (2018)
14. Niu, X., Sun, X., Wang, H., Rong, S., Qi, G., Yu, Y.: Zhishi.me - weaving chinese
linking open data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7032, pp.
205–220. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25093-
4 14
15. Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries.
In: Proceedings of the First Instructional Conference on Machine Learning, Pis-
cataway, NJ, vol. 242, pp. 133–142 (2003)
16. Sánchez A, V.D.: Advanced support vector machines and kernel methods. Neuro-
computing 55(1–2), 5–20 (2003)
17. Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge.
In: Proceedings of the 16th International Conference on World Wide Web, pp.
697–706. ACM (2007)
18. Wang, Z., et al.: XLore: a large-scale English-Chinese bilingual knowledge graph.
In: International semantic web conference (Posters & Demos), vol. 1035, pp. 121–
124 (2013)
19. Xu, B., et al.: CN-DBpedia: a never-ending Chinese knowledge extraction system.
In: Benferhat, S., Tabia, K., Ali, M. (eds.) IEA/AIE 2017. LNCS (LNAI), vol.
10351, pp. 428–438. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
60045-1 44
Machine Learning

You might also like