Professional Documents
Culture Documents
Topic Context Aware and Personalized: Citation Recommendation System
Topic Context Aware and Personalized: Citation Recommendation System
Topic Context Aware and Personalized: Citation Recommendation System
I
Context Aware and Personalized Citation Recommendation
System
Computer Science and Technology
Summary
Citation is an essential part of any paper and book, and it is a kind of respect
for the original when the author expounds the knowledge, and it is convenient
for readers to trace their origins and know the ins and outs of knowledge.
However, with the deepening of scientific research and the increase in the
number of scientific research workers, the number of papers is also expanding
sharply. This results in the author's writing, the need to spend a lot of time to
identify and supplement the citation, is a relatively cumbersome and lack of
creative process. This paper constructs an automated citation recommendation
system to solve this problem.
System according to the citation context, automatically recommend a list of
quoted papers for researchers, save a lot of time for researchers, and have great
practical value in the process of scientific research writing; In addition, the
citation recommendation system can be understood as the combination of
retrieval system and Recommender system, which brings a great research
significance to this subject.
Citation recommendation is a relatively new problem, in the past research, it
is used as a variant of the retrieval system, according to the content of the
citation context, to be recommended. In this paper, citation recommendation is
actually a personalized recommendation process, not only according to the
content of general recommendations, but also according to the preferences of
different researchers personalized recommendation.
This paper constructs a personalized citation recommendation model--PCR
model, using the user's publication and reference history, combining the existing
content-based recommendation methods. ThePCR model, which combines
II
user-reference orientation with content dependencies, has a performance boost
on recall@10 , with the latest content-based translation model, with 67% on
MAP . Performance improvements for 65% .
III
Context Aware and Personalized Citation Recommendation
System
Abstract
Citations are necessary parts for papers or books; they show authors' respect for
original works. Citations also help paper readers to know related information better.
But with the development of science, more and more researchers work on research
jobs which lead to paper's number grow rapidly. These years, when authors compose
their papers, they need a lot of time to add and confirm citations. It's a time
consuming procedure with little creativeness. This paper builds an automatic citation
recommendation system to solve the problem.
Our system analysis citation context, and recommend paper list for researchers to
cite automatically. System can save a lot of time for researchers, which means a huge
practical value; citation recommendation can be understood as a combination of
retrieval system and recommend system, which brings a big research significance.
Citation recommendation is a relatively new problem, in past, all researchers
consider this problem as a transformation of retrieval system, and they do
recommendation according to citation context's content. Here, we consider citation
recommendation as a personalized recommendation procedure. We also take author's
preference into consideration, do a personalized citation recommendation.
In this paper, we use author's publication and citation history, combine exist
content related recommend method, build a personalized citation recommendation
model--PCR model. Model obtains a 31.67% performance improvement in terms of
recall@10 and 27.65% improvement in MAP compared with the state-of-art method.
IV
Directory
1. Introduction
1.1 Research background ............................................................................... 3
1.2 Research content........................................................................................ 5
1.2.1 get user data and build user information ...................................... 6
1.2.2 to User Information modeling ........................................................... 7
1.2.3 user Information and content-based methods combine ......... 9
1.3 Document organization structure .....................................................10
1.4 Papers recommend ..............................................................................................11
1.5 Citation Recommended ................................................................................13
1.6 Support Vector Machine, SVM ..........................................................................16
2. CACR
2.1 related Concepts ....................................................................................................20
2.2 Reference behavior Analysis.............................................................................21
2.3 UTDUser Tendency Degree ......................................................................22
3.3.1 Build user information .......................................................................22
3.3.2 using user information to build UTD .........................................23
2.4 CRD (Content Relevant Degree) ......................................................................28
3.4.1 language Model ......................................................................................28
3.4.2 Translations Model...............................................................................31
2.5 UTDCRD ....................................................................................................................33
3.5.1 Fill the value divide......................................................................33
3.5.2 Combine multiple fractions ..............................................................34
3. Experiment Design
3.1 Data Set ...............................................................................................................37
3.1.1 Data Requirements...............................................................................37
3.1.2 data Gets the procedure .....................................................................39
34.1.3 MAS API ....................................................................................................43
3.1.4 data get tips .............................................................................................48
V
34.1.5 Data preprocessing ..............................................................................49
3.2 Evaluation Method................................................................................................52
3.3 Experiment Frame ................................................................................................53
3.4 Experiment Results .................................................................................55
3.5 Parameters Tune Excellent................................................................................56
3.6 features Analysis ...................................................................................................57
4. Summary and future work
4.1 Summary ..................................................................................................................59
4.2 Future work.............................................................................................................60
VI
Figure Directory
Figure 1. 1 Citation recommendation system work motioned ............................... 2
Figure 1.2 The total amount of paper varies with time statistics ........................... 4
Figure 1.3 A citation context sample .................................................................................. 4
Figure 1.4 User Reference Preferences location ............................................................ 9
Figure 5. 8 results recall rate with position change ..................................................55
VII
Tables Directory
Tables 3.1 abbreviation description .................................................................................20
Tables 3.2 Different levels of recommendations and different ways of
extension ...........................................................................................................................................23
Tables 3.2 MAS API ........................................................................................................................44
VIII
The first 1 Chapter Introduction
1 Introduction
There is no doubt that researchers need to know more about the
background and the latest progress in their field, and find out the shortcomings
of the current research, so as to carry out their own research. Research workers
currently engaged in the study, and predecessors of the research results are often
inseparable. They are either engaged in research related to the results of their
predecessors, or are based on previous research results. Therefore, when
researchers need to publish their research results in the form of papers, they
often need to cite a large number of citations, on the one hand, respect for the
work of predecessors, on the other hand, can provide readers with sufficient
background information.
However, it is not easy to do a good job of citation. With the development of
science and technology, information overload is also taking place in literature. In
a large number of papers, it is not an easy task to find the exact literature that the
scientific research mentioned in his paper should cite. There are two ways for
researchers to understand a scientific research. One is through books, courses or
lectures and other non-paper channels. In this case, when they want to cite a
scientific research in their own paper, find the original source is a laborious
matter. They often need to read a large number of papers and expand their
reading according to the citation relationship between the papers, and spend a
lot of time to find the source of the research results mentioned in their papers.
Another way is to read the original paper that introduces the results of the
research. In this case, it is relatively easy for researchers to cite the results, but
because many researchers have read a large number of papers, it is difficult to
recall a specific paper when it comes to the outcome. Even elementary
researchers who have read only a small number of papers cannot clearly
remember the title, author, and conference information of each paper. Therefore,
when the researchers need to make reference, almost all need to manually read
1
The first 1 Chapter Introduction
all kinds of information, in the search for paper spend a lot of time, and even
need to do a lot of unnecessary extended reading, this is undoubtedly a waste of
valuable time for researchers.
There are several large-scale systems that contain a huge amount of paper
data, such as reference counting, meeting article list features such as Google
scholar,Microsoft academic Search , and so on. Google scholar is the most familiar
paper retrieval system for researchers, offering free paper search services and
indexing most of the world's published academic journals. Google and ACM,
Nature,IEEE,OCLC , and many other publishers to cooperate, In the paper
retrieval, whether from the data volume or query speed is very good. For users
who simply want to make a paper query, the functionality they provide is
sufficient to meet the user's needs. However, for citations,Google scholar
provides only two simple features, format generation and reference counting.
Microsoft Academic Search is also an excellent paper search engine, providing
more functionality, such as the classification of paper fields, and the history of
reference counts in each article. In the application of citations,Microsoft
academic Search provides stronger support for extracting a list of citations from
each paper and citation contexts in which other papers cite the paper, which is
described in more detail in the article. In addition to the two larger paper search
engines, there are many smaller citation search engines, which provide features
such as CiteSeer. However, there is no system available for citation
recommendation.
Imagine having such a system, as shown in Figure 1.1 , when you're writing a
paper, you just need to write the existing research results based on the
knowledge you already have. In the manuscript, the location where citations
need to be annotated. After the manuscript is entered into the citation referral
system, the system automatically recommends a list of citations made up of
2
The first 1 Chapter Introduction
several candidate citations based on what you write. Even sometimes, when you
can't remember exactly how you want to quote a paper, you just need to write a
short description of what you're doing, and the system can recommend a list of
possible citations. When making a reference, you only need to select the paper in
the candidate list to complete the citation work. This process will be very
labor-saving, convenient, greatly reduce the workload, improve accuracy. With
accurate citation recommendations, on the one hand, when writing a paper,
quickly add citations, on the other hand, the extension of citation
recommendation technology, can also be based on the scientific research staff to
the simple description of their ideas, to recommend a similar idea of scientific
research, so as to help researchers to understand the direction of a specific
research at the forefront, It's a lot more than that.
3
The first 1 Chapter Introduction
amount of data caused by the information overload in the paper field makes
citation recommendation not only very necessary but also challenging.
Figure 1.2 The total amount of paper varies with time statistics
4
The first 1 Chapter Introduction
5
The first 1 Chapter Introduction
challenges, need to solve a lot of problems, this is the main content of this article.
The following is a brief list of issues and their solutions that you will encounter
when making personalized citation referrals.
6
The first 1 Chapter Introduction
Here, the paper divides the researchers into two categories, one of which is
that there is no history of the primary scientific research personnel, and the
other is the publication of the history of senior researchers. For the primary
scientific research staff, the research work time is not very often, often does not
form a certain reference preference, therefore, its demand for personalized
citation recommendation service is not strong. However, for the researchers who
have published a certain history, they generally have a certain reference
preference and need to consider their preference when quoting referrals. As a
result, the target users who actually need citation referrals are senior
researchers with a history. And the history of senior researchers is the most
representative of the personalized information. This paper first attempts to use
the scientific research staff in the past and other publications to build user
information, and use this user information to recommend.
The information is relatively easy to obtain for a user's published
publication. There is a large number of open resources on the Internet for
researchers to download. At the same time Microsoft academic Search and
Google scholar provide a number of functional support. The user's reference
preferences can be analyzed through existing publications. In this way, when the
citation referral system is on line, it can provide personalized citation
recommendation service without user intervention. And through the existing
methods, the system in the process of continuous adjustment of user information
and recommendation process, the recommendation algorithm to further
optimize.
7
The first 1 Chapter Introduction
would recommend papers to users, and a better-read paper for users would also
recommend a similar. This article wants to quantify the recommended process
in the process of developing these users ' interests, resulting in a quantified
standard that can be modeled on user information.
thinks that the user has already had the preference to each paper before writing
the paper. Therefore, this paper uses the Bayesian formula to combine the user's
preference with the content of a citation context.
Once the user information is combined with content-based methods, it is also
necessary to combine the various reference bias indicators. The indicators reflect
user preferences, user preferences are long-term formation, can be considered as
a result of a variety of information, here, this article will make use of SVM to
combine. In this way, the main constituent elements of the framework are
Bayesian formulas and SVM, which will be expanded to describe them.
10
The first 2 chapter related work
11
The first 2 chapter related work
not achieve the desired results, need to find more suitable for the paper
recommended method.
Because of the particularity of the thesis, we can use the content to
recommend [7], some researchers construct the word group of interest and
analyze the frequency of the article to make recommendations, which is actually
a variant of the TF*IDF technology in the search engine. In addition, the language
model [2] is often used in content-based methods to determine the ranking of the
recommended papers by calculating the probability of generating user interest
words in each paper. In addition, the researchers used vector space model [8]to
calculate the similarity between the user's interest vocabulary and the article
vocabulary, so as to determine the degree of recommendation of different
articles.
In addition to these basic methods of recommendation, in recent years the
paper recommended some innovative methods. K. Chandrasekaran and others
[9] further refine the Content-based recommendation method, based on the
information that the user has in citeseer Read the recommended paper for users.
They do not use a word-bag model to represent users or articles directly, without
using cos to calculate the similarity between users and articles. Instead, they
represented the user and the article in the structure of the hierarchical concept
tree and described the similarity between the user and the document by editing
the distance. Compared with the traditional method of content-based article
recommendation, this method has a great improvement.
B Shaparenko and other people [ten] in the recommendation, not satisfied
with the paper itself recommended to the user, but hope that the original part of
the paper extracted out, to provide users with reading. They use unsupervised
method to solve the problem, use the language model to analyze, use convex
programming to approximate, finally through the cosine similarity to calculate
the full text, so as to get the thesis body, and recommend. S. McNee and others
[one] hope to solve the cold startup problem effectively when establishing a
paper recommendation system. They use the reference relationship between
12
The first 2 chapter related work
existing researchers, the reference relationship among the papers and other
interconnected information as the starting data for collaborative filtering, so that
the system can have a better recommendation when it is first run.
K. Sugiyamad and other people [a] according to the user has published a
paper reference and the referenced information, build the user's neighbor paper
and the neighbor author, then unifies the user has published the paper the
Collaborators, and the content of published papers to build the user's personal
information. Based on the analysis of the user's recent interest, and through the
user's personal information and other documents of the similarity between
information, as the main basis for recommendations.
D. Zhou and other people [to], using the relationship between the author, the
relationship between the paper, and the author and other papers to build the
relationship between the three, and the three diagrams combined with the object
based collaborative filtering. When measuring the similarity between objects,
they use the low dimension data, and turn this problem into a optimization
problem, and use the half supervised learning method to construct the model. T.
Tang and other people [to] in an online learning system, using model-driven and
mixed collaborative filtering to recommend users, this article takes into account
the different interests and different levels of knowledge of users, And after the
corresponding recommendations and users to read, updated the user's
knowledge level information, so as to continue to recommend suitable articles
for users.
13
The first 2 chapter related work
Strohman and other people [copy] for the first time to try citation
recommendations, they put the entire manuscript as a system input, then
the user's manuscript as a long query string, in the paper Library search,
the search papers as a recommended paper returned. They divide the search
process into two steps, first, in a collection of millions of papers, to
search for a previous $number paper that looks only at the closest content.
In the second step, add the paper that is referenced by this $number
article to the list of alternative papers, so that the entire list expands
to 1000~3000 . The articles are then reordered. By using the simple
features of publication time, similarity of thesis content, common
reference, common author,Katz distance, reference quantity, the weights
of each feature are found by using the method of gradient rising, then
the recommended values of each paper are obtained by using the weighted
linear model. The model is intuitive and effective, but there is no
recommendation for positioning the citation context.
J Tang et people [in] when making citation recommendations, there is
no heuristic method applied, but a citation recommendation is made through
subject similarity. They propose a two-layer RBM(restricted Boltzmann
rogue) model, given a collection of papers with referential relationships,
which is used to study topic distribution through paper content and
reference relationships. Given a citation context, the subject model
learns to match the corresponding citation context and to sort the
proposed paper according to the degree of match. Compared with the
previous method, this method has a certain improvement.
Y. Lu and others [3] think that the citation context is very different
from the quoted papers in terms of idioms, and that it is not possible
to conduct similar searches directly. They use the translation model to
consider the citation context and the thesis itself as two different
languages. By statistical analysis of the existing citation context and
14
The first 2 chapter related work
15
The first 2 chapter related work
16
The first 2 chapter related work
There are many features that need to be used when building user
preferences, but when you finally consolidate other models for sorting, you can
only use one value to sort. As a result, multiple features need to be consolidated.
The support vector machine regards an instance as a point in space, and how
many dimensions there are in the method. For ease of introduction, take the 2
dimension as an example. At this point, each entity is a point in a two-dimensional
plane, as shown in Figure 2.1 , where circular and square dots represent two
different categories. SVM First requires a training set to train the model, which
can be expressed as:
S = (( , ), L( , )) (X Y) (2-1)
<wx+b> =0 (2-2)
17
The first 2 chapter related work
0 = =1 , ( 0), = 1, , (2-4)
O = =1 , ( 0), = 1, , (2-5)
Bringing it into the original Lagrangian function can be translated into the
18
The first 2 chapter related work
following:
1
L(w, b, a) = =1 < > (2-6)
2
= =1 , ( 0), = 1, , (2-7)
19
3 PCR
2 CACR
a. related Concepts
Before introducing the method of this article, we first explain some concepts
and abbreviations used in this article. The following concepts and abbreviations
are used in this article:
Tables 3.1 abbreviation description
name abbreviation
Content Correlation Content Relevance Degree CRD
Used to measure the degree of relevance between a citation context and an
article that only considers content. This value can be computed from any kind
of content-based model, such as a common language model.
User preference User Tendency Degree UTD
The focus of this article is to measure the tendency of a user to cite an article.
Reference possible Cite Possibility Degree CPD
degree
The likelihood that a citation context refers to a paper is the ultimate measure
of recommendation.
20
3 PCR
First u must know the existence of T , interest in t , then read it, and then
when u writing a paper p , recall that the T is relevant; At this point, the user U
may write a description of t D and eventually reference T. Before you write p , in
fact T has a higher probability of being referenced than a paper that has not been
read by u .
The value between D and T is CRD, and the value between u and t is UTD. In
previous work, often CRD was used as a CPD, while UTD was ignored, which
largely limited the further promotion of recommendations. In fact, users first
have different UTDfor different papers, and then produce a personalized citation
behavior based on these UTD . In other words, in citation recommendation,UTD
can actually be considered a priori of CPD . The following formulas can be used to
describe their relationship:
CPD = UTD CRD (3-1)
Therefore, this work will focus on UTD , using the predecessor has
completed the CRD, hope that the existing CRD on the basis of the appropriate to
add UTD to further improve the effect of citation recommendations. The
following article describes in detail how this article constructs UTD and CRD .
21
3 PCR
As mentioned earlier, for each "paper - user" pair, you need to compute the
value of UTD . Firstly, this paper constructs the user information. Here, you need
to divide the user into two types: a junior researcher who has no history, and a
senior researcher who publishes history.
It is difficult for novice researchers to build their user information, without
the effective support of user data and personalization. But from a different point
of view, personalization is not everyone's must, for junior researchers, involved
in scientific research is not deep enough, often does not form a relatively stable
interest and reference preferences, therefore, in the recommendation of
personalized reference value is small. In addition, the scope of contact of primary
researchers is too narrow, if the citation recommendation is limited to their
familiarity with the known scope, instead of limiting their scope of knowledge,
therefore, when making personalized citation recommendation, the junior
researcher only need to follow CRD recommendation, that is, With the previous
research results can be recommended.
For many years engaged in scientific research work of senior researchers,
they engaged in scientific research work for a long time, in a certain field has
their own contribution. Also gradually formed their own interest points and
reference preferences, they will be cited in the behavior of the impact of
reference preferences, so they need to be personalized treatment of these
researchers. Senior researchers have published papers that are the key to
building their user information, which is the focus of the PCR model. For the
majority of researchers who have worked for a number of years, they can get all
the information they want from their publication history, including:
1). A collection of papers made by all users; This is actually based on the
user's citation precedent as the basis for inferring its citation bias.
22
3 PCR
2). A collection of all the authors who have worked with the user, which is
actually a consideration of the user's social circle's impact on user citation
preferences.
3). A collection of authors that have been referenced by all users, which is
actually a user's circle of attention and a reflection of user preferences.
How do I measure UTDbetween u and T for a given user U and a goal paper
Tafter you have these user information? This article argues that the reason why
users have different UTD for different papers is that some recommended and
expand relationships. Recommend people from three levels: the user, the user's
partner, and the user-referenced author. These three levels of people will be
written and cited in two ways for users to recommend the paper, both of these
methods are considered as recommended behavior. For example, the author
writes a paper p or a reference paper P once, which is considered to be the
recommended paper p once. After accepting recommendations from three of
people, users will extend the recommendations, in addition to taking note of the
paper itself, the user will also notice the author of the paper and the conference
published in the paper. Based on three-level referrals and three extensions, you
can get a prior feature of 3x3=9 that can be used as a UTD (table 3.2 ).
Tables 3.2 Different levels of recommendations and different ways of extension
User-Referenced 3_3
3_1 3_2
person
23
3 PCR
Next, a detailed description of each UTD is given, partial data support for the
feature is provided, and a calculation formula for each UTD is listed. In the
following calculation formula,count (x, y) means the number of times the x
recommends y . where x is a user,y may be a paper, author, or meeting. count (x)
means the total number of times a user X recommends a paper. count (x) means
the total number of times the user X recommends the author. For example, if x
recommends only one paper, which has three authors, then count (x) is 1,count
(x) is 3. The variable u represents the current user, andt represents the target
thesis. A represents the author collection of the target thesis, and thev represents
the meeting or journal published in t . Delegate u A collection of people who
have worked together, and represents a collection of people who are
referenced by u .
First is the user's own recommendation behavior, the user obviously to own
thesis or once cited the paper more familiar, has the stronger quoted tendency.
High tendency not only affects the user's reference behavior to the paper itself,
but also expands to the author and the meeting of the paper. This can be
obtained1_1 to1_3 .
1. : Users themselves recommend paper 1_1 T is the percentage of
the total number of users recommended. This feature takes into account two
types of user behavior: Quoting your own published paper and quoting a paper
you've cited again. In the data set of this article, the number of paper metadata is
55823, and the average number of papers per author is " 2"for each article, which
is written by the author of 3. $number , which refers to the average number of
times a paper that you have referenced or that you have referred to is 5. In other
words, for an author, the probability of any paper being cited is 0.079%, and the
probability that a paper that has been quoted or written by itself is referred to as
a 29%, which is the $number of the former. A paper is written or quoted by the
user, and there is a great probability that the user will be quoted later.
This phenomenon is not surprising, first of all, for each researcher, its
24
3 PCR
research area is relatively concentrated, the current work of researchers and past
work often has a large correlation. As a result, previously published papers or
citations are undoubtedly more likely to be related to the current work. On the
other hand, users are often more familiar with their work or the work they have
cited, and these two factors lead to a higher citation bias. Therefore, given a paper,
the user's own recommendation behavior is the first feature to be considered.
The calculation formula is as follows:
(,)
1_1 = (3-2)
()
25
3 PCR
user references a conference paper or the more times a user publishes a meeting,
the more familiar the user is with the meeting. The more likely the paper is to be
referenced by the user. The feature expands in the direction of the meeting and
calculates the following formula:
(,)
1_3 = (3-4)
()
The above three formulas all consider the user's own behavior, in
addition to the user's own, the user's collaborators also have the
recommendation force, and will have an impact on the user's future reference
behavior. So the papers published or cited by user collaborators also play a
role in this model. The following three transcendental features consider the
recommended behavior of the user's collaborators.
4. 2_1 : User's collaborators c ollection recommended papers
T the number of total recommended times. This feature takes account of such
user behavior: Familiarity with the work of collaborators, and reading the papers
of collaborators. In the data set of this article, the average number of users who
have collaborated with the? ?? ? is the average of 9. $number names, and
references to collaborators are 5.4 times, and the number of papers referenced
by collaborators is 33.69. times, the average recommendation of each partner
resulted in the 2. the secondary reference, and the average of any other user only
has 0. The Times, is the partner's Ten. 86%, So you can see that the partner's
recommendation ability is much higher than that of other authors. Users often
have close links with collaborators, which leads to a great deal of probability for
users to familiarize themselves with the work of collaborators and to read their
papers. As a result, a partner-recommended paper can have a certain impact on
the user, and 2_1 uses to measure this effect, as shown in the following
formula:
( ,)
2_1 = (3-5)
( )
26
3 PCR
In addition to the user and the user's collaborators, the user's referrer
also has the ability to recommend to the user. If an article, an author, or a meeting
is recommended multiple times by a user's referrer, then the user will have a
higher probability to reference it, the next three of the transcendental features
from three kinds of expansion (the paper itself, the author of the paper, the
conference published in the paper), consider the user cited the author's behavior.
These features are calculated in the same way as 2_1 to2_3 .
7. , the user-referenced author 3_1 collection recommended paper
T is the percentage of its total recommended number. This feature only considers
the target thesis itself, which is calculated as:
( ,)
3_1 = ( )
(3-8)
27
3 PCR
These nine features are actually a priori on the 9 dimension of CPD , which
can then be multiplied by the CRD , which is calculated by the 9 prior to the
previous model, and the 9 a different CPD . By combining this 9 CPD , you can get
the criteria that are ultimately used to sort the candidate essays in a given
citation context.
This article utilizes a meta language model [2] and a translation model that
has been improved on the digest [3] as the CRD part of the algorithm, and is also
used as a contrast algorithm. Here's how these two methods work in the data set
of this article.
i.language Model
The language model is put forward in the $number year, which is mainly
applied in the field of information retrieval and has achieved good results, and
28
3 PCR
p(C|D)p(D)
(| ) = (3-11)
p(C)
Because this problem does not concern specific values, for the same C , in
the sort sense,p (d| C) =p (c| d) p (d), where p (d) is a priori probability of paper
D . Here, you can think that the paper D conforms to the uniform distribution,
that is, the probability of all papers appearing the same, the value ofP (d) is the
same, then the p (d) item will not affect the final sort result. Therefore, CPD
directly leverages the p (c| D) the value of. P (c| D The value of this article uses a
unary language model to estimate, according to the definition of a meta language
model, the model considers that each word appears to be independent of each
other, regardless of the previous word, is actually a " Word bag Model ". Its
calculation formula can be expressed as:
p(C|D) = =1 ( |) (3-12)
29
3 PCR
where is the word for the citation context, I , andn is the total number of
words in the citation context. ( |) represents the probability of
distribution of each word in the context of a citation. Suppose a paper is D, and
the total number of words in the article D is expressed as | d|, a word in the paper
is w, the number of occurrences is then the distribution probability of the
word can be expressed as follows:
(|) = (3-13)
|D|
When you get a citation context, the probability of each word calculates not
only the current document, p (w| D) , you also need to compute the document Set
p (w| C) , and then adjust by using the parameter Alpha , as follows:
30
3 PCR
Where is the constant set based on experience. Then, using the pin the p '
substitution formula (3-12) , this article can get the following formula for
calculating CPD :
p(C|D) = =1 (1 |D|+) p(w|D) + |D|+ p(w|S)
(3-17)
Then the paper uses the calculation result of the above formula as the basis
to sort out the candidate papers.
ii.Translations Model
31
3 PCR
p ( | ) = 1( = ) + (1 ) p( | ) (3-20)
p(C|D) = ( |) (3-21)
One
( |) = ( |) + (1 ) ( | ) ( |)
(3-22)
( |)and ( |) , respectively, the maximum likelihood estimate for
the word in all the essays together with S and thesis D , ( | )to The
probability of translating a word in a text from a translation model to a citation
context is computed as (4-19) . This makes use of the translation model to fill the
citation context and the word gap in the paper. In the final effect, in the
calculation of (3-20) , the model has a better effect [3]since the translation was
improved on the summary, so the translation model on the summary is used as
the CRD And one of the comparison methods for the final effect of this paper.
32
3 PCR
e. UTDCRD
After you get the value of UTD and CRD , the next difficulty is how to
combine CRD with UTD . For a given citation context and a paper, a score is
required to measure its CPD, which in turn sorts the list of candidates for the
paper. However, the combination process will encounter four of problems: the
value gap, multiple points, multiple authors, and positive and negative cases are
not balanced, in order to solve these four problems, this article uses the following
methods:
In the formula (4-1) , you need to multiply UTD and CRD . However, in the
process of calculating CRD , you are actually multiplying the relative quantities of
the words in the citation context. The citation context is generally longer, and its
average length is 4, plus the huge differences between the different words, so the
value of the CRD between the different citation contexts and the paper pairs is
huge. In the thesis []
"sprint[28]andrainforestproposetwoscalabletechniquesfordecisiontreebuilding...
" This citation context is an 3.361 10137 example of the maximum
CRD6.726 1058 correlation is , the minimum CRD correlation is, and the
largest is the smallest when using the language model. 2 1078 times. When
us5.936 1077ing the translation model, its maximum CRD correlation with the
candidate set is 51048 . 258x, minimum CRD correlation is, where the largest is
the smallest Times. 8.86 1028 In the case of with the maximum change in the
citation context, the maximum value is 0.0438, the smallest non- 9_9 0 value
is3.977 106 , and the maximum and minimum difference is only 11017 times.
Therefore, when UTD is used for CRD in this case, the role ofUTD can be almost
negligible. In the final sort, the results are almost identical to those obtained with
33
3 PCR
only CRD .
The magnitude of the value of UTD depends primarily on the size of the
dataset, and the magnitude of the value ofCRD depends on the length of the
citation context. You can take the relative fixed length to fetch the citation context.
But for the size of the dataset, the different crawl size will lead to different data
set size, so a dynamic tuning method is needed to balance the impact of the
dataset size.
Here, a common method of data balancing is used to solve this problem,
introducing a shrink variable Alpha, and using exponential to adjust the impact
of different data sizes, the formula (3-1) is updated to the following formula:
CPD = 1 (3-23)
You can get a better result by tuning the parameter Alpha , and you can
adjust the Alpha when the dataset size changes.
The model in this paper analyzes the user's referential bias from three
angles in three aspects, so there are 9 different priors. After this 9 UTD is
consolidated with CRD , there will be a 9 different score. To get a final score for
sorting, you need to combine this 9 score, which is the second question that is
mentioned at the beginning of this chapter. When you combine, you will also
encounter the remaining two issues that are mentioned at the beginning of this
chapter.
What this article solves is actually a multidimensional combination of
problems, here, this article leverages the SVM described above to combine.
Reference recommendation is actually a reference prediction problem, this
article regards it as a classification problem. There are actually only two
relationships for each citation context and paper to C: References or no
34
3 PCR
36
The first 4 chapter experimental design
3 Experiment Design
a. Data Set
i.Data Requirements
Unlike many research issues, the problem does not require data to be
labeled. Many recommendations, in order to determine the accuracy of
recommendations, often require users to label the degree of preferences, so as to
obtain training and testing data. As for citation recommendation, the data can be
divided into two parts by using time as natural partition. These two parts have a
referential relationship, directly reflects the user's citation behavior. The
reference relation of the time before can be used as the training data, and the
time can be used as the test data. There are no tagging requirements, but there
are several unique requirements for citation recommendation:
1. Content format that is easy to work with. Because the paper resources are
often in PDF format, it is inconvenient to direct text processing. You need to
convert the contents of the PDF to text.
2. Clear and unambiguous citation points. Because the templates for each
meeting are different, the format of the paper is different, the list of
37
The first 4 chapter experimental design
4. A relatively dense set of reciprocal primers. Since the total amount of the
paper is very large, the author is also likely to cite a variety of topics in
writing, there will be a relatively large reference range. Unable to obtain the
complete works of the thesis, in obtaining a subset, if the method is not
appropriate, will result in the set of papers in the inter-cited rate is too low,
the reference graph is too sparse to achieve the ideal experimental results.
The data acquisition strategy of this paper needs to obtain a subset of the
paper with a higher mutual citation rate.
5. Accurate extraction of meta information. In the personalized
recommendation, not only need the content of the paper. More important is
the accurate extraction of meta information. Including the author's list of
38
The first 4 chapter experimental design
39
The first 4 chapter experimental design
MAS contains a lot of metadata when providing a paper: such as paper itself,
thesis author, meeting, field, and so on, the metadata is identified by a unique ID
and is identified by the entity. After a collection of essays has been crawled, the
author's reference preference can be analyzed by aggregating the same author's
IDto get a list of papers from the same author. This solves the requirements for
the first 5 and the 6 mentioned above.
In addition to the Mas API that provides the ability to meet 5,6 , theMAS Web
site also provides a very important feature called "citation context, which is the
citation contexts, when you search for a paper, you can see the text that
references the paper's reference character at the same time, and in the search
paper [{] when the effect is shown in Figure 4.3 . The box area indicates the text
described in other papers when the paper is referenced, and the site is well
identified with the citation locator and the citation context.
40
The first 4 chapter experimental design
When you view the HTML code in the diagram above, you can pinpoint the
reference relationship by locating the paper IDthat references the paper. The area
in the box below shows the Figure 4.3 The first citation context corresponds to
the corresponding IDfor the thesis. With this feature, you can have the data
satisfy the requirements 2 and 3.
41
The first 4 chapter experimental design
You can get the PDFfor your paper, based on the downloadable URLs
provided by the MAS API . Then, using the third tool PDF Miner1pdf Miner is a
Python tool with powerful PDF text recognition capabilities that can be
accurately used to PDF the text of the paper is extracted and put into another text
file, so the paper will meet the requirements 1.
To meet the data requirements 4, you need to get a collection of citations
with a higher rate of interaction, and there is no doubt that similar areas will
have higher rates of interaction, and that people are often willing to cite more
influential papers. Therefore, this paper selects the most important 10 meetings
related to the data mining field as a seed conference to collect the papers, the
process is as follows:
1). ACLCIKMEMNLPICDEICDMKDDSIGIRVLDBWSDM
WWW
2). from the mas API to obtain this ten conference from $number year to
$time year of all the paper metadata, metadata includes the following
information: Thesis in MAS ID, title, summary, Publication time, published
meeting, author, ID list for the referenced paper, URLfor the article. After you get
the metadata for your paper, this article takes advantage of the functionality of
the MAS Web site to get a citation context that references these papers.
1 http://www.unixuser.org/~euske/python/pdfminer
42
The first 4 chapter experimental design
iii.MAS API
When you submit a query request to the MAS API , a result is returned in the
JSON format. JSON is a lightweight data interchange format that is a subset of the
JavaScript language. JSON Returns the result as a mapping of names and values.
43
The first 4 chapter experimental design
Data is separated by commas, and curly braces can be used to hold objects, and
square brackets save the array. After submitting a request to the MAS API , a
deeper, more complex result is returned, as follows:
{"d":{"__type":"Response:http:\/\/research.microsoft.com","Author":null,"Con
ference":null,"Domain":null,"Journal":null,"Keyword":null,"Organization":nul
l,"Publication":{"__type":"PublicationResponse:http:\/\/research.microsoft.c
om","EndIdx":0,"StartIdx":1,"TotalItem":0,"Result":[]},"ResultCode":0,"Trend
":null,"Version":"1.1"}}
The results returned by the MAS API can be understood as a tree structure,
and each node in the number is a pair of key and value . Where value might be a
value, or it might be a list. For any query statement, theMAS returns the same
result structure at the top level, where "D" is the root of the result, and the
following table is the topmost result and the corresponding interpretation.
d
Domain Explain Domain Explain
Version Return version Journal Journal Results
number
The result structure of each return is the same, but because the query
actually requires only one type of result, for example, to query only the
publication list, the return result is nullin another domain, such as the Author
field.
PublicationAuthorConferenceJournalOrganizationDomainKeyword
Tables 4.3 field results format
44
The first 4 chapter experimental design
7 -type fields
Domain Explain Domain Explain
TotalItem Number of results Result Results list
The result is rendered as a list in the result field, which may contain each
other, such as when a list of papers is returned, an author field, and the Author
field is a list, where the return fields of the different types of results are collated.
As shown in the following table:
Tables 4.4 fields result list items format
name name
identification
45
The first 4 chapter experimental design
author
Journal number
46
The first 4 chapter experimental design
As you can see, the fields contain rich information. However, because of the
difficulty of filling each information domain, the completion of the MAS API is
different, and there are many fields that return the result to Nulland need to be
processed further. In addition, each returned result entry will have a _ one _type
field that represents the result type, and the return value of each result type is
like "Response:http: \/\/ A. microsoft. com, where bold sections change as the
result type changes,Response is represented as the root
part;publicationresponse, Authorresponse , such as the return part of each
domain,publication,Author , and so on, represents each return item. In this way,
you can use the results returned by the MAS API to make an entity distinction
and get rich metadata.
47
The first 4 chapter experimental design
increase speed, you cannot crawl with only one IP , requiring multiple IP
crawling. Of course, hundreds of machines can be used, but the resources needed
are too much to meet, and the process would be tedious to gather data after
crawling. Here, this article uses a lightweight method. When you visit the MAS
Web site, you have the same effect of accessing a new machine, except that you
occasionally access it with your own IP , mostly through an agent. There will be
some free agents on the network, this article first collects free agents, tests filter
out available agents, and then takes turns taking advantage of available agents
for access. Assuming there is an x agent, the time to wait is only for the original /.
This can greatly improve the crawl speed. During the crawl, this article has used a
$number agent, which increases the crawl speed of the method in 1 to $number
times. This method is used to satisfy the need of data acquisition in this paper.
In addition, during the data acquisition process, there are unexpected events
that cause the program to terminate, such as power outages, program
manslaughter, or bugsin the program itself. Because crawling is a lengthy process,
it can waste a lot of time and server resources after an unexpected termination, if
the program restarts and needs to start the crawl process from scratch. Therefore,
data acquisition in this article requires breakpoint continuation. After each
crawler starts, you need to know the previous crawl progress, and then continue
crawling according to the progress. Here, the method used in this article is to
print the log file at the same time as the crawl, each time the crawler starts, the
log file is read first, the crawl progress is determined, and then the crawl
continues according to the progress.
After taking advantage of these techniques, it is possible to obtain the
required data smoothly, so that further experiment can be done.
v.Data preprocessing
After obtaining the data, the thesis data often contains a lot of noises
49
The first 4 chapter experimental design
because of the different writing habits and the different format requirements of
the papers. The text data needs to be preprocessed first to reduce noise and
improve the effectiveness of the method. The preprocessing consists of three
parts.
1. converts all letters to lowercase. Before parsing, you need to convert all
the letters to lowercase, which is also a common method in information retrieval.
The case will be in the process of machine processing there is a certain amount of
interference, more common is the first letter capital, but in fact, the same words
in the sentence meaning is the same, if due to the initial capital letter and can not
be associated, it will lose some information. In addition, because of the different
habits, different authors in the same word capitalization method is also different.
For example, some authors are accustomed to writing as Google and some users
write it as Google, and there is no difference in meaning. Of course, there are
special cases where the case is changed to mean different things, such as she
generally means her, and she may indicate a singer combination. But this
situation rarely occurs in data sets. So converting all the letters to lowercase is
more beneficial than harm, and is the first step in preprocessing.
2. preserves only letters or numbers and removes other symbols. Because
the content of meaning in the article is basically composed of letters and
numbers, and punctuation or other characters do not contain literal meaning, it
is removed to reduce the noise. Some researchers have chosen to take the
numbers out of the study, but the numbers often have more important meanings
in the paper, such as "four-color problems," which would lose all information if
the numbers were removed.
3. to take root for all words. There are many tenses and voices in English, the
verb has the third person singular, present tense, past tense, now complete, the
noun has singular and plural. These variants, though changed in morphology, are
almost identical in meaning. Therefore, the root of all words needs to be put
50
The first 4 chapter experimental design
forward to better analyze. Another tool-NLTK 2 -is used here. NLTK is the
abbreviation for Natural Language Toolkit , which is a Python open Source
Library for natural language processing and can complete the task of taking root
participle.
4. will reference the citation context for each article at the end of the article
and add its IDto the MAS site after each citation context.
After the above four steps, and then simple to organize the paper, you will get
a more easy to deal with the text of the paper, the format is as follows. Among
them, each part is delimited by the box, the first part is the thesis title, the second
part is the thesis abstract, the third part is the paper body, because the text is too
long, this article has omitted. Part fourth is a citation context that references this
paper and its corresponding thesis is IDin the MAS Web site. In this way, with a
paper, you can learn the content and the ID and citation contexts of other papers
that reference the paper, and you can build a reference graph of the dataset by
using the ID to match the papers already in the library.
2 http://www.nltk.org/
51
The first 4 chapter experimental design
b. Evaluation Method
1{ <}
@ = (4-1)
||
52
The first 4 chapter experimental design
the recommended list, then the score for the citation context is /, and the average
rating for all the citation contexts in the test set is the evaluation indicator here.
The formula is as follows:
1
= (4-3)
||
c. Experiment Frame
After the data acquisition process and the evaluation method are explained,
here is a simple description of the experimental framework of this study.
1. Use the mas API to collect the metadata for the required papers and
obtain the PDF download link, and to obtain the citation context information for
the relevant papers on the Mas Web site; based on PDF Links Download the PDF
format of the corresponding paper and convert it to text. In this step, you need to
take into account the MAS and the various PDF source Web site's anti-crawler
mechanism, and need to make the server's instability transparent.
2. to translate the text of the collected essays into lowercase, remove
punctuation, take root, and add the citation context ID crawled to the end of the
53
The first 4 chapter experimental design
paper.
3. to extract the characteristics of user preferences, here need to make full
use of the user's paper only metadata, organized to extract, which is the focus of
the text.
4. based on the chronological order, select the $number author as the test
user, choose the next citation context in the last paper to put into the test set, the
remaining data as a training set. Because it is only a part of the data of a large
number of papers, there is a need to select personal information in the data set
relatively complete users.
5. According to the training data, using some kind of content-related model
framework to get the content-related model, the SVM model is used to train the
user preference characteristics. Combining the results of the two models, the
unified model can be easily used. The content-related model of this article selects
the language model and the translation model, and acts as baseline.
6. for all the citation contexts in the test set in 4 , remove their citation
information, and use the model obtained in 5 to rate and sort all papers in the
$number text candidate set.
7. MAPRecall@10
The overall experimental framework is shown in the following illustration:
54
4 PCR
d. Experiment Results
In this paper, we use the stochastic algorithm, the one-language model and
the translation model as the contrast method. The model of this paper is based
on a meta language model and a translation model based on translation
enhancement as CRD. Use the methods previously described to evaluate. Table 7
is the final result of the experiment, and Figure 4.8 is the details of the recall rate
change.
Tables 4.5 Effects of different models contrast
55
4 PCR
that has been improved on a digest as a CRD PCR model. You can see from the
results above:
1. The effect of TM_PCR is best, and the effect ofLM_PCR is also better than that
of a meta language model that does not include personalization information.
2. Whether it is a language model or a translation model, the effect is improved
in the PCR model, which introduces personalization information. The
language model increases 234%on MAP , and recall@10 617%on. The
translation model improved 778%on MAP and raised recall@10 on the.
650%.
It can be seen that the effect of the PCR model is significantly improved both
for the language model and the translation model. This is because the PCR model
can take advantage of the relevance of the content as well as the user's
preferences. On the one hand, thePCR model can maintain a citation context that
can achieve better results with only content information; On the other hand,
thePCR model can improve the citation contexts that use only content
information that does not get better results. Therefore, thePCR model is effective
and can significantly improve the effectiveness of the traditional citation
recommendation model.
56
4 PCR
This paper shows the effect of the model by substituting different . Figure
4.9 shows the change in thePCR model effect during the change of the value of
the parameter alpha from 0.1 to 0.9 . You can see that the best effect is to set
Alpha to 0.8 , either in combination with the translation model or the language
model.
When you use the PCR model, you must tune Alpha again when you replace
the dataset because of the size of the data and the length of the citation context.
f. features Analysis
In the model for this article, there is a total of 3x3=9 features. Whether these
features are effective or not, which features are better. Here, the effect of this 9
feature is experimentally studied.
Figure 4. Ten shows the effect of the model after removing one of the
features in turn. The x_y item in the graph has the effect of remove_ . The
last "none" means that all features are not removed, that is, the final result of the
model.
57
4 PCR
With the figure 4. Ten , you can see that when you remove any feature, the
effect of the experiment is attenuated, indicating that each feature has a gain on
the recommended effect. As you can see, remove1_1 ~ A feature that has the
most effect on the experiment, especially 1_3 is1_1 . Shows that the
user's own historical behavior, the greatest impact on their future citation
behavior, including: the user wrote the paper, the user's collaborators,
user-issued meetings, user-referenced papers, user-referenced authors,
user-referenced meetings. Among them, the user refers to their past papers or
cited papers is the most obvious feature.
58
The first 5 Chapter Summary and future work
a. Summary
In this paper, the personalization of each user is taken into account when
doing citation recommendation work. We hope to improve the effect of citation
recommendation by adding the decision of user preference. For this purpose, this
paper presents a personalized citation recommendation model--PCR model.
Different users have different citations to the same paper, the model of this
paper quantifies the tendency, and builds the personalized information of the
users through the published papers. This information is used to measure the
probability of a user referencing a paper in a 9 dimension. Among them, the
factors include: the user itself, the user's collaborators, the user's published
meetings, the user cited papers and so on. These factors are consolidated using
SVM , combined with a single meta language model and a translation model that
has been improved from translation on the digest. In the model, this article also
uses the shrinkage parameter to solve the problem that the traditional CRD
model has a large difference from the UTD part of the introduction. In this paper,
the problem of the imbalance between the positive and negative cases in SVM is
solved by the method of calculating the average number of times by randomly
selecting the equivalent negative example.
Finally, after the combination of UTD , both the traditional one-meta
language model and the most recent summary on the translation improved
translation model, the effect has significantly improved. The method makes
reasonable use of the known information, further captures the user's reference
behavior, and makes the citation recommendation more accurate.
59
The first 5 Chapter Summary and future work
b. Future work
60
Reference documentation
[1] R. Yan, J. Tang, X. Liu, D. Shan, and X. Li. Citation count prediction: Learning
to estimate future citations for literature. In Proceedings of the 20th ACM
international conference on Information and knowledge management, pages
12471252. ACM, 2011.
[3] Y. Lu, J. He, D. Shan, and H. Yan. Recommending citations with translation
model. In Proceedings of the 20th ACM international conference on Information
and knowledge management, pages 20172020. ACM, 2011.
[8] Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J].
Communications of the ACM, 1975, 18(11): 613-620.
61
Adaptive Web-Based Systems, pages 8392. Springer, 2008.
[12] K. Sugiyama and M.-Y. Kan. Scholarly paper recommendation via users
recent research interests. In Proceedings of the 10th annual joint conference on
Digital libraries, pages 2938. ACM, 2010.
[13] D. Zhou, S. Zhu, K. Yu, X. Song, B. L. Tseng, H. Zha, and C. L. Giles. Learning
multiple graphs for document recommendations. In Proceedings of the 19th
international conference on World Wide Web, pages 141150. ACM, 2008.
62
[18] Q. He, D. Kifer, J. Pei, P. Mitra, and C. L. Giles. Citation recommendation
without author supervision. In Proceedings of the fourth ACM international
conference on Web search and data mining, pages 755764. ACM, 2011.
[21] Xue X, Jeon J, Croft W B. Retrieval models for question and answer
archives[C]//Proceedings of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval. ACM, 2008: 475-482.
[23] Chang C C, Lin C J. LIBSVM: a library for support vector machines[J]. ACM
Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.
[25] Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing
order to the web[J].
63