Food Safety Knowledge Graph and Question Answering System: Li Qin Zhigang Hao Liang Zhao

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Food safety Knowledge Graph and Question Answering

System
Li Qin Zhigang Hao Liang Zhao*
College of informatics, HuaZhong College of informatics, HuaZhong College of informatics, HuaZhong
Agricultural University Agricultural University Agricultural University
WuHan, China, 430070 WuHan, China, 430070 WuHan, China, 430070
15607100234, incl. 86 13080336716, incl. 86 15527838508, incl. 86
qinli@mail.hzau.edu.cn hzau111hzg@163.com zhaoliang323@mail.hzau.edu.c
n
ABSTRACT development of the Internet has made more and more information
The issue of food safety in recent years has always been the focus available to people, but the geometrical progression of Internet
of public opinion. Every time there are unqualified foods, it will data and the disorder of network information make the acquisition
cause widespread panic and rumor spread, which has a great and utilization of information encounter many obstacles [3],
impact on social stability. Therefore, this paper crawled the data which is also a problem for data storage and management. In
of unqualified foods officially released in recent years from the order to solve this problem, this paper attempted to collect
network, and designed the extraction algorithm of food general information about the unqualified foods that were exposed from
entities, food domain entities and relationships between entities the Internet, and extracted information about the food names,
for these data. The extracted entity pairs were stored in the gStore manufacturers, production time, production address and
database. In order to solve the problem of association of unqualified items of the unqualified foods to construct a food
knowledge in knowledge graph, this paper also designed the food safety knowledge graph. At the same time, an intelligent question
safety ontology which organized the concepts, classifications and answering system was built based on the knowledge graph to
relationships about food production and food inspection. Finally, facilitate the analysis and research of consumers and related
this paper also built an intelligent question answering system by scholar.
means of gStore's http service to help person grasp the unqualified
food information through natural language.
2. Knowledge Graph and Related Research
The Knowledge Graph was first proposed by Google in 2012 to
CCS Concepts enhance the search performance of its search engine. Knowledge
• Information systems➝Information systems applications graph is a Multidisciplinary research method which combines the
➝Data mining. theory and method of applied mathematics, graphics, information
visualization technology, information science and other
Keywords disciplines with the citation analysis and co-occurrence analysis of
Food safety; knowledge graph; food ontology; HACCP ontology Metrology. The knowledge graph can show the core structure,
model; Question Answering System development history, frontier areas and overall knowledge
structure of the discipline with visual maps. It is intended to
1. INTRODUCTION describe the concepts, entities, events, and relationships between
As China's economic aggregate continues to rise, the food the objective world, essentially expressed as a semantic network
industry has also made considerable progress. However, a series [4]. Existing knowledge graph resources include DBpedia [5,6],
of problems have emerged at the same time. Among them, food YAGO [7,8], Wikidata[9], domestic Zhishi.me[10] and so on.
safety issues are particularly important. Because it is not only In recent studies, knowledge graph technology has been applied to
related to people's livelihood stability and economic development, food safety. Niu Zhe, Tong Maodi et al. [11] have studied
but also related to the government's prestige and international consumer satisfaction with food safety through knowledge graph.
image [1]. The report of the 19th CPC National Congress clearly However, their research did not realize the construction of
stated that “implement food safety strategies, and let the people knowledge graphs, just proposed the application direction of
eat with confidence”, and how to ensure food safety has become knowledge graphs; Ye Jinzhu et al.[1] used Cite Space
an important research topic. In recent years, the incidence of food information visualization software to search and visualize the
safety incidents in China has been significantly reduced, but it is literature in the food field, they also studied the future of food
still at a high incidence of food safety issues [2]. Although the safety issues, but the research does not involve the construction of
knowledge graph. In order to realize the construction of food
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
knowledge graph, this paper use knowledge graph technology to
not made or distributed for profit or commercial advantage and that extract the entity and relationship of the unqualified food
copies bear this notice and the full citation on the first page. Copyrights information crawled in the network, and represented the result as
for components of this work owned by others than ACM must be the DRF triples, and then stored the triples in the graph database.
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior 3. Food Safety Knowledge Graph
specific permission and/or a fee. Request permissions from The data information used in this paper was mainly obtained from
Permissions@acm.org. the network. Firstly, these data were processed and analyzed by
ICIT 2019, December 20–23, 2019, Shanghai, China
natural language processing (NLP), and the entities, relationships
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-7663-1/19/12…$15.00 and attributes needed for the knowledge graph were extracted to
form a triple in which the methods of extracting entities,
DOI: https://doi.org/10.1145/3377170.3377260

559
relationships and attributes were mainly based on rules and segmentation in the sentence L{li } , i (1,2,...n), and the
keywords. Although many existing literatures [12-15] also
propose methods of extracting entities, relationships and attributes specified type T ' .
based on machine learning and deep learning, but the extraction Output: entity set E{e j } and the location of entities in the
method based on rules and keywords is still the best in sentences EL{el j } , j (1,2 ,...m)。
applications which has strong domain background and high
precision requirements. The specific construction process of the Procedure:
knowledge graph is shown in Figure 1. (1) Create a cursor p that points to each Chinese word
segmentation ci .
3.1 Data Collection (2) Determine whether the type ti of ci pointed by P is equal to
First, this paper used Python scripts to crawl public unqualified
food date on the Internet. According to the Notice of the General T ' , if so, proceed to step (3); Otherwise, P points to the next
Administration of Market Supervision on Printing and Chinese word segmentation, and loop to step (2).
Distributing the Food Safety Supervision and Inspection Plan for (3) Save the location information li of ci , and preserve li as the
2019 in China, all data was divided into 33 categories, and each location information of the entity e j , expressed as el j = li , at the
category was built an initial text data set. The text data sets were
processed and imported into the postgrsql database for cleaning to same time let the starting position of the entity ei equal to li ,
obtain a structured text library. expressed as lstart = li .
(4) P points to the next Chinese word segmentation, and judges
whether the type ti of ci is T ' , if so, loop to step (4); If not, set
the end position of the entity lend = li  1 ,combine the Chinese
word segmentation between the lstart and lend to generate entity
e j , continue to step (2) until P traverses all the Chinese word
segmentation.

3.2.2 Domain entity identification


Domain entities are entities that cannot be directly labeled by
CoreNLP, such as food names produced by food companies, food
spot-check items, etc., so they need to be identified by text feature
extraction. First of all, according to the food name, through the
analysis of the grammar of the text, it was found that the location
of food name in the Chinese sentence often followed by the food
company name, and keywords such as “ produce”, “ sell”,
“ make” often appeared between them. So, this paper collected
high-frequency keywords between food name and food company,
then defined them as a keyword library named KeyDase.
Fig.1 Knowledge graph construction process
Then the name of food often ended with a food category, such as
3.2 Entities Extraction Liquor, can etc., so we need define a food type library named as
Entities extraction was divided into two parts, one was the FoodtypeDase. According to two libraries of KeyDase and
extraction of "general entity", using named entity recognition FoodtypeDase, domain entity identification algorithm can obtain
technology [16-19], and the tool is Stanford's CoreNLP. The other the starting position and ending position of food name in the
part was the extraction of "domain entities", mainly based on the Chinese sentence and merged words between them to generate a
keyword matching algorithm. domain entity. This paper takes the food name recognition
algorithm as an example, the specific algorithm is shown as
3.2.1 General entity identification follows:
The Stanford CoreNLP tool can recognize seven subclass entities
including PERSON, ORGANIZATION, MISC, LOCATION, Algorithm 2: Food Name Recognition Algorithm.
GPE, FACILITY, and DEMONYM. Processing the text with NLP Input: set of China word segmentation C{ci } , position of China
could obtain the China word segmentation of the text and the word segmentation in the sentences L{li } , company entity
word segmentation type, such as organization, time (MISC), and
location (GPE) which can be extracted directly. However, in the GE{gej , e j | t j "ORGANIZATION "} , KeyDase, FoodtypeDase,
case of Chinese word segmentation, there would be a case where i  (1,2,...n), j  (1,2,...m) .
an entity is divided into multiple objects, such as: "Fujian Gloss
Output: food entity FE{ fek } and its location information
Deshun Wine Co., Ltd. " was divided into " Fujian ", " Gloss ", "
Deshun ", " Wine " , " Co., Ltd ", therefore it was necessary to FEL{ felk } , k  (1,2 ,...q)
.
splicing adjacent words of the same type to obtain one entity. The Procedure:
specific algorithm is as follows:
(1) Create a cursor P, pointing to gej ;
Algorithm 1: general entity extraction algorithm. (2) Determine whether there is a keyword in KeyDase behind gej .
Input: the set of Chinese word segmentation C{ci } , the types of
If so, move P back to the ci adjacent to the keyword, save the
Chinese word segmentation T{ti } , the location of Chinese word
location information li of ci , and set lstart  li  1 , lstart is the start

560
position of the fek . At the same time note felk  li  1 , and the predicate, o is the object; (s, p, o) indicates that there is a
relationship between s and o, or that s has the attribute p and its
proceed to step (3). If not, make P point to the next gej and loop
value is o [20]. The following are some of the triples created in
to step (2); this paper. The structure of the RDF graph is shown in Figure 2.
(3) Make P point to the next ci ;
(4) Determine whether ci is in the FoodtypeDase, if not, continue
to move forward P, and loop to step (4) until P encounters a
period; if so, preserve li as end position of fek , expressed as
lend = li ;
ci lstart lend fek
(5) Connect the between the and to obtain .
When unqualified spot-check items of food were extracted, it was
necessary to establish a food spot-check items vocabulary named
as FoodSampleDase through the food safety standard documents Fig.2 An example of RDF graph
of China. The FoodSampleDase was mainly based on the Notice G={( Fujian Guangzhedeshun Wine Co., Ltd., name, " Fujian
of the General Administration of Market Supervision on Printing Guangzhedeshun Wine Co., Ltd. "),
and Distributing the Food Safety Supervision and Inspection Plan (Fujian Guangzhedeshun Wine Co., Ltd., production,
for 2019 in China. Then extracting the unqualified food spot- Wuyuanchun Liquor)
check items existed in FoodSampleDase from the Chinese Word (Wuyuanchun Liquor, name, " Wuyuanchun Liquor ").
Segmentation set C and expressed as UQE. (Wuyuanchun Liquor, PD, "October 19, 2015"),
(Wuyuanchun Liquor,unqualified, Saccharin sodium),
3.3 Relationship extraction (Saccharin sodium, name, " Saccharin sodium ")}.
This paper firmed the two relationships between the entities, one
was the production relationship between company entities and 3.4.2 Mapping Model
food entities, and the other was the inspecting relationship Conversion of data used to implement relational model into RDF
between food entities and unqualified items entities. On the other is mainly implemented by mapping rules developed by mapping
hand, the attributes of entities included food production address, language. The mainly mapping languages are Direct Mapping
and food production time. The specific extraction algorithm is a [21,22] and R2RML [23,24]. Direct Mapping is directly output
rule-based method. For a sentence containing X , Y ( X , Y  E ) , E RDF data according to the data table name and field name defined
is a set of entities, the extraction rules are defined as follows: in the database, the mapping rule is simple; the R2RML language
adopts SubjectMap, PredicateMap, RefObjectMap and
(1) IF X  GE{ge, e | t "ORGANIZATI ON "} RefObjectMap [25], with a high degree of flexibility. This paper
used D2Rq to generate mapping rules and adjusted them
Y  FE{ fe, e | t " FOOD"} AND
according to the structure of the tables and the relationships
kw{x | x  Between ( X , Y )}  GE  nil AND between the tables, shown in Figure 3. Mapping model was
kw  Keybase  nil divided into the mapping of the entity table and the mapping of
THEN a production relationship between X and Y. the entity table to the relational table.
(2) IF X  FE{ fe, e | t " FOOD"}
Y  SE{se, e | t " FOODSAMPLE "} AND
kw{x | x  Between ( X , Y )}  FE  nil AND
kw  {" Detect ", "Check "." Sampling ",...}  nil
THEN a inspecting relationship between X and Y.
(3) IF Y  AE{ae, e | t "GPE "} AND Distence ( X  Y )  TD
THEN an attribute relationship between X and Y.
(4) IF X  FE{ fe, e | t " FOOD"} , Y  DE{de, e | t " MISC "}
AND Distence ( X  Y )  TD
THEN an attribute relationship between X and Y.
Note: ‘Between(X,Y)’ represents the China word segmentation
between X and Y, ‘Distance (X,Y)’ represents the textual distance
between X and Y, and TD is an empirical value representing the Fig. 3 structure and relationship of tables
threshold of the textual distance.
3.5 Ontology Construction
3.4 RDF triples mapping This paper took a bottom-up approach to build knowledge graph.
The data model commonly used in knowledge graphs is the RDF In order to make the structure of the knowledge graph clearer and
triples graph, so it is necessary to convert the extracted entities more complete, it was necessary to establish a suitable ontology
and relationships into the RDF, so that knowledge can be stored based on the extracted data. The project has established a food
and used in the graph database. safety ontology.
3.4.1 RDF graph concept
The RDF graph is defined as a finite set of triples (s, p, o); each
triple represent a declarative sentence, where s is the subject, p is

561
4.1 Question Analysis
The analysis of user's question was mainly including the
segmentation of Chinese words, merger of word segmentation and
the judgement whether the merged words belong to key
information, such as the name of the company, the name of the
food, and the keywords that marked the attributes. The algorithm
process is similar to Algorithm 1 and Algorithm 2.

4.2 Problem Template Construction


This paper analyzes the user's questions and extracts the key
information in the question. The structure of the key information
is used to judge the type of the user's question, and abstract them
into different templates to generate the SPARQL queries, such as:
(1) For such questions: "what is the manufacturer of Wuyuanchun
Liquor ", which contained the key information "Wuyuanchun
Fig.4 An unqualified food ontology Liquor" and “manufacturer”, and the entity name and attribute
name were abstracted as {s:p}. The SPARQL command is as
The ontology organizes several categories such as Food, Food follows:
Hazards, and Food Inspection Items based on China's national
food safety standards, and then corresponds them to the HACCP PREFIX vocab: <file:///home/haozhigang/d2rq-0.8.1#>
system for food production. The specific structure is shown in SELECT ?o WHERE { vocab:S vocab:P ?o.}
Figure 4. In the SPARQL command, S was the subject of the triple, P was
the predicate of the triple, and ?o was the value of the query. The
3.6 Storage and Visualization query results are shown in Figure 6.
The knowledge graph is stored in a graph database. The
commonly used graph databases are Neo4j [26] developed by Neo
technology in the United States, gStore[27] developed by Beijing
University, GraphDB developed by Ontotext in Bulgaria, and Jena
developed by Apache. The graph database of this paper is gStore,
and the corresponding query language was SPARQL [28-30].
SPARQL is knowledge graph standard query language for RDF
model developed by W3C. It introduces the property path
mechanism [31] to support the navigation query on the RDF graph.
Using SPARQL to query the triplet stored in the gStore. The
visualization of the knowledge graph is shown in Figure 5.
Fig.6 A query example based on {S:P} template
(2) For such questions: " What are the foods which have food
problem that are the same as unqualified reasons with Wuyuan
Chunbai Liquor?", the key information was " food problem ", "
Wuyuan Chunbai Liquor " and " unqualified reasons ", and the
question template was {P1:S:P2}. The SPARQL command is as
follows:
“PREFIX vocab: <file:///home/haozhigang/d2rq-0.8.1#>
SELECT?s WHERE { vocab:S vocab:P2 ?o. ?s
vocab:P1 ?o.}”
The query involved the connection of two triples. First, set the
attribute value which was queried by S and P2 as the attribute
Fig.5 Knowledge graph visualization value of another triple, and then queried with P1, at last, the entity
to be queried was in ?s. The query results are shown in Figure 7.
4. Question Answering System
Question Answering System Based on Semantic Search, that is
constructed with the help of HTTP service of gStore. By parsing
the user's questions, the food in the questions was mapped to the
entity in the knowledge graph, and the food attributes were
mapped to the relationships and attributes in the knowledge graph.
At the same time, the food company, the address, and the time
were mapped to the attribute values in the knowledge graph. Next,
the mapping result was matched to specific question templates
and then converted into SPARQL query statements and submitted
to the server. Finally, the results were presented to the web
interface.
Fig.7 A query example based on {P1:S:P2} template

562
5. Summary and outlook [13] Li,F., Yu,H., 2019. An investigation of single-domain and
By using the knowledge graph technology, this paper built a food multidomain medication and adverse drug event relation
safety knowledge graph through the public data on the Internet extraction from electronic health record notes using
and realized a knowledge question answering system for food advanced deep learning models. Journal of the American
safety by using natural language query. However, since the Medical Informatics Association : JAMIA,26(7): 646-
research on food ontology in Chinese is still rare, so the ontology 654 .DOI=https://doi.org/10.1093/jamia/ocz018
used in this system is still insufficient, the classes, instances and [14] Christopoulou, F., Tran, T. , Sahu, S., Miwa, M. and
relationships all need to be further expanded and improved. In the Ananiadou, S.. 2019. Adverse drug events and medication
other hand the knowledge graph of this system is constantly relation extraction in electronic health records with
adding new knowledge, there are some conflicts and ambiguity in ensemble deep learning methods. Journal of the American
food safety knowledge. In the next step of this paper, we will Medical Informatics Association : JAMIA.
study how to solve this problem to achieve the effective
[15] Li, J., Huang, G.M., Chen, J.H., Wang, Y.B. 2019. Dual
integration of new knowledge and original knowledge graph.
CNN for Relation Extraction with Knowledge-Based
6. ACKNOWLEDGMENTS Attention and Word Embeddings. Computational Intelligence
The project is supported by the Fundamental Research Funds for and Neuroscience. 1-10.
the Central Universities (NO. 2662017JC029) and National Key DOI=https://doi.org/10.1155/2019/6789520
Research and Development Project (No. 2018YFC1604000). [16] Chowdhury, S. , Dong, X. , Qian, L. , Li, X. , Guan, Y. , and
Yang, J. , et al. 2018. A multitask bi-directional rnn model
7. REFERENCES for named entity recognition on chinese electronic medical
[1] Ye Jinzhu, Shan Chu.2018. Visualization Analysis of Food records. BMC Bioinformatics.19:499.
Safety Research in China Based on knowledge mapping. [17] Zhang, Y., Wang, X. W., Hou, Z., et al. 2018. Clinical
Anhui Agricultural Sciences. 46 (02). 177-180. Named Entity Recognition From Chinese Electronic Health
[2] Hu Yinglian. 2016. China's food safety from the perspective Records via Machine Learning Methods. JMIR medical
of social governance. Social Governance. 2. 36-42. informatics.
[3] Zhang Yang, Xie Zhuoli.2014. Construction of Kowledge [18] Wang, X. , Li, Y. , He, T. , Jiang, X. , and Hu, X. . 2018.
graph Based On Aggregation of Multi-source Online Recognition of bacteria named entity using conditional
Academic Information. Library and Information Service. random fields in spark. BMC Systems Biology, 12(S6).
58(22).84-94. DOI=https://doi.org/10.1186/s12918-018-0625-3
[4] Huang HengQi,Yu Juan,Liao Xiao,Xi YunJiang. 2019. [19] Radhakrishnan, Priya. 2018. Named Entity Extraction for
Reviews on Knowledge Graph Research . Computer Systems Knowledgebase Enhancement. ACM SIGIR Forum.
& Applications.28(6).1−12 . 52(1):169-170.
[5] Bizer, C., Lehmann, J., Kobilarov, G.,el a1. 2009. DBpedia- [20] Wang, X., Zou, L., Wang C.K.. et al. 2019. Research on
A Crystallization Point for the Web of Data.Journal of Web Knowledge Graph Data Management : A Survey. Journal of
Semantics 7(3):154-165. Software, 7: 2139-2174.
DOI=https://doi.org/10.1016/j.websem.2009.07.002. [21] Cao,Q., Zhao Y.M.2015.Technology Implementation Process
[6] Auer, S.,Bizer, C.,Kobilarov, G.,et a1..2007.DBpedia:A and Related Applications of Knowledge Mapping.
Nucleus for a Web of Open Data.International Semantic Web Information studies: Theory & Application. 38 (12). 127-132.
Conference(ISWC 2007).722-735. [22] Consortium, W. W. W. . (2012). A direct mapping of
[7] Suchanek, F. M., Kasneci, G., 2007. Weikum, G.,Yago:A relational data to rdf.
Core of Semantic Knowledge. Proceedings of the 16th DOI =http://hdl.handle.net/10421/7490
international conference on World Wide Web. 697-706. [23] Rodríguez-Muro, Mariano, & Rezk, M. . (2018). Efficient
sparql-to-sql with r2rml mappings. Social Science Electronic
[8] Suchanek, F. M.,Kasneci, G., Weikum, G.. 2008. Yago:A
Publishing.DOI= http://dx.doi.org/10.2139/ssrn.3199192
Large Ontology from Wikipedia and Wordnet. Journal of
Web Semantics.6(3).203-217. [24] Kyzirakos, K. , Savva, D. , Vlachopoulos, I. , Vasileiou, A. ,
DOI=https://doi.org/10.1016/j.websem.2008.06.001 Karalis, N. , and Koubarakis, M. , et al. (2018). Geotriples:
[9] Vrande, Denny, Tzsch M. Wikidata: A Free Collaborative transforming geospatial data into rdf graphs using r2rml and
rml mappings. Social Science Electronic
Knowledgebase. Communications of the ACM .57(10).78-85.
Publishing.DOI=http://dx.doi.org/10.2139/ssrn.3248492
[10] Niu, X., Sun, X., Wang, H., et al. 2011. Zhishi.me - Weaving
Chinese Linking Open Data. International Semantic Web [25] Huang H.q., Yu J., Liao X., and Xi Y.J. Review on
Conference( ISWC 2011) The Semantic Web.205-220. Knowledge Graphs.2019. Computer System & Applications.
28(6). 1–12.
[11] Niu Zhe, Tong Maodi, Chen Tingqiang, et al. 2018. Research
Progress on Consumer Food Safety Satisfaction Based on [26] Per?Uku, A. , Minkovska, D. , and Stoyanova, L. . 2017.
Knowledge Graph. Science and Technology of Food Industry. Modeling and processing big data of power transmission
39 (24). 227-233. grid substation using neo4j. Procedia Computer Science, 113,
9-16.
[12] E Haihong, Zhang Wenjing, Xiao Siqi, et al. 2019. Survey of
Entity Relationship Extraction Based on Deep Learning. [27] Zou, L. , M. Tamer Özsu, Chen, L. , Shen, X. , and Zhao, D. .
Journal of Software. 30 (06). 1793-1818. 2014. Gstore: a graph-based sparql query engine. The
VLDB Journal,23(4), 565-590.

563
[28] Zhang, X. , Meng, C. , and Zou, L. .2018. Expressivity issues [30] Minjae, S., Oh, H., Seo, S., Lee, K.H. 2019. Map-Side Join
in sparql: monotonicity and two-versus three-valued Processing of SPARQL Queries Based on Abstract RDF
semantics. Science China (Information Sciences), 61(12), Data Filtering. Journal of Database Management(JDM). 30.
187-189. 22-40.
[29] Zhang, W. E. , Sheng, Q. Z. , Qin, Y. , Taylor, K. , and Yao, [31] Kostylev, E. V. , Reutter, J. L. , Romero, M. , and Domagoj
L. 2018. Learning-based sparql query performance modeling Vrgo č . 2015. Sparql with property paths. International
and prediction. World Wide Web.21(4).1015-1035. Semantic Web Conference(ISWC 2015). 3-18.

564

You might also like