Identifying Related Documents For Research Paper R

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/44260422

Identifying related Documents for Research Paper Recommender by CPA and


COA

Article · October 2009


Source: DOAJ

CITATIONS READS
12 444

2 authors:

Bela Gipp Joeran Beel


Bergische Universität Wuppertal Trinity College Dublin
200 PUBLICATIONS   3,630 CITATIONS    118 PUBLICATIONS   2,717 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MathML View project

Recommender Systems View project

All content following this page was uploaded by Bela Gipp on 17 May 2014.

The user has requested enhancement of the downloaded file.


Preprint of: Bela Gipp and Jöran Beel. Identifying Related Documents For Research Paper Recommender By CPA And COA. In S. I. Ao, C. Douglas,
W. S. Grundfest, and J. Burgstone, editors, International Conference on Education and Information Technology (ICEIT'09), volume 1 of Lecture Notes in
Engineering and Computer Science, pages 636–639, Berkeley (USA), October 2009. International Association of Engineers (IAENG), Newswood Limited.
ISBN 978-988-17012-6-8. Downloaded from http://www.sciplore.org

Identifying Related Documents For Research Paper


Recommender By CPA and COA
Bela Gipp and Jöran Beel
Otto-von-Guericke University Magdeburg, Department of Computer Science, ITI and SciPlore.org
gipp|beel@sciplore.org

Abstract—This work-in-progress paper introduces two new results can be achieved by applying co-citation analysis.
approaches called Citation Proximity Analysis (CPA) and Citation proximity analysis is a further development of co-
Citation Order Analysis (COA). They can be applied to citation analysis.
identify related documents for the purpose of research paper
recommender systems. CPA is a variant of co-citation analysis
that additionally considers the proximity of citations to each Cockpit View

other within an article’s full-text. The underlying idea is that Graphical View (relevant documents are larger) Filter
Publication date between:
the closer citations are to each other in a document, the more 2002 and 2008

likely it is that the cited documents are related. For example, Evaluating Collaborative Recommender Systems
JL Herlocker, JA Konstan, G Terveen and JT Riedl 23 Select languages

citations listed in the same sentence are more likely to express Data Mining 2006, Journal of Science and Recommenders (IF 3.2)

Content Based
Publication types
Abstract: Recommender systems have been evaluated in many,
related thoughts than citations listed only in the same section. often incomparable, ways. In this paper we review the Recommender
key
Systems
decisions in evaluating collaborative filtering recommender
Relevance:
7.5
In COA, the order of citations are considered, allowing the systems… More

identification of a text similar to one that has been translated Tags: Recommender Systems Collaboration Evaluation Metrics
Performance Measurement
Impact factor:
2.5
from language A to language B, as the citations would still
Collaborative
Document
Evaluation
Collaborative rating:
occur in the same order. However, it is also shown that CPA 3.2
Recommender
and COA cannot replace text analysis and existing citation Systems
Topicality

analysis approaches for research paper recommender systems Legend 2.5

Collaborative R. Impact
since they all have their own strengths and weaknesses. Unrat

2002 2003
0-2

2004
2-4

2005
4-6

2006
6-8

2007
8-10

2008 Year Year Settings Change Query

Server connection with Scienstein.org established Data processing completed

Index Terms—Bibliometrics, citation proximity analysis,


citation order analysis, related documents, research paper Figure 1: GUI SciPlore – clustering similar documents
recommender
In the research paper recommender SciPlore.org this
approach is mainly used for two purposes. First, to cluster
similar documents as shown in Figure 1; and secondly, to
I. INTRODUCTION give recommendations for further related documents based
The search for related work is a time-consuming procedure on one or more documents the user has been interested in,
that even if performed by experienced scientists often leads as shown in Figure 2.
to unsatisfying results. To alleviate the problem, search
engines such as Google Scholar and Citeseer offer to In the first part of this paper related work is presented and
display “similar” documents based on text and citation the commonly applied citation analysis approaches
analysis. discussed with the focus on co-citation analysis. In the
following section the CPA approach is introduced.
Superior results are usually achieved by hybrid research Afterwards, the existing citation analysis approaches are
paper recommender systems. By combining further compared to CPA and their suitability for research paper
techniques such as co-word analysis, collaborative filtering, systems examined. The paper concludes with a summary
Subject-Action-Object (SAO) structures, etc., more precise and an outlook which includes how this new approach is
recommendations can be given. However, these approaches going to be integrated in the research paper recommender
are only suitable to a limited extent for identifying related SciPlore.org.
work [2-8].
Taking everything into account, our examination suggests
that in the case of scientific documents, usually the best
Based on document usage mining, Scienstein recommends
you the following papers:

Papers similar to the last papers you have read approach: Papers A and B are related because they both cite
The delicate topic of the impact factor
papers C, D and E.

Why the impact factor of journals should not be used for


In contrast, two documents are “co-cited” when at least one
evaluating research paper cites both. This approach is illustrated in Figure 4:
Papers A and B are related because they are both cited by
Impact Factor: Good Reasons for Concern
papers C, D and E. The more co-citations two papers
more... M. Szklo (2008), receive, the more related they are [6].
Epidemiology, vol. 19, no. 3
Papers recently published by authors you have read
Figure 2: Similar paper recommendation
Self-citations, co-authorships and keywords - A new approach Doc
to scientists’ field mobility C

Profiling citation impact - A new methodology cites cites


Doc
more... II. RELATED WORK D
The usefulness of a research paper recommender system
depends Title
to a large Author
extent on itsYear
ability to automatically
Update Doc
Source Ratings Abstract
determine related work to one or more documents. Various E
approaches exist to determine the degree of similarity of Doc A Doc B
documents in order to identify related work. cited cited
Whereas text-mining approaches are used in cases in which
Figure 4: Co-citation analysis
references are not stated, citation analysis approaches
usually deliver superior results as e.g. synonyms and
unclear nomenclature do not lead to misleading results [3, Although both approaches are suitable to identify similar
4, 5]. Many citation analysis approaches exist and they all papers, they serve different purposes. Whereas
have their own strengths and weaknesses for identifying bibliographic coupling is retrospective, co-citation is
similar documents. Among the most widely used are the essentially a forward-looking perspective [9]. However,
easily applicable „cited by‟ approach, which considers both approaches often deliver unsatisfying results, since
papers as relevant that cite the same input document and the they only make use of the bibliography at the end of the
„reference list‟ approach, which considers papers as document without analyzing the constellation of citations.
relevant that were referenced by the input document. The Therefore it is not possible to determine in which part of a
best results can usually be obtained by bibliographic related document the content of interest can be found.
coupling and co-citation analysis, which allow calculating
the coupling strength [6]. These approaches, which were
already invented in the 60s and 70s, are used by scientists III. CITATION PROXIMITY ANALYSIS AND
and on academic search engine websites like CiteSeer1 [9]. CITATION ORDER ANALYSIS
Instead of just using the bibliography, in CPA the
information derived from the proximity of the citations to
each other in the full-text is used to calculate the Citation
Doc A Doc B Proximity Index (CPI) in three steps.
citing citing
1. The document is parsed and a series of heuristics are
Doc used to process the citations including their position within
C
the document2.
Doc
cites
2. The citations are assigned to their corresponding items in
cites
D the bibliography. The overall margin of error with the
system we have developed equals nearly three percent for
Doc
the first and second step.
E
3. In the third step the proximity among each citation-pair is
Figure 3: Bibliographic coupling
examined. The underlying assumption is that the closer the
Documents are bibliographically coupled if they cite one or citations are to each other, the more likely it is that they are
more documents in common. Figure 3 illustrates this
2
The citations were parsed using a modified version of parsCit
(http://wing.comp.nus.edu.sg/parsCit) in combination with
exclusively developed software, which is available upon request
1
http://citeseer.ist.psu.edu from the authors.
related. Based on this proximity analysis, the CPI is series of tests we experienced the best results by calculating
calculated. If for example two citations are given in the the weighted average of the CPIs. By automating the
same sentence the probability that they are very similar is process described above, we have calculated the CPI for
higher (CPI = 1) as if they were only in the same paragraph publications contained in the SciPlore database. The results
(CPI = 1/2). See Figure 5. show that in comparison to the results delivered by co-
citation analysis, CPA delivers considerably better results in
Citing Document
identifying similar documents [1].
This is an example text with references to different documents.
This is one reference. This is an example text with references to
different documents. Two very similar references [1],[2]. This is an
example text with references to different documents.This is an
example text with references to different documents.Another
example. Another example.

This is an example text with references to different documents.


Another example. This is an example text with references to
different documents.

This is an example text with references to different documents.


Similar to the idea of CPA is another approach currently
Another example. This is an example text with references to
different documents. Another example. This is an example text
with references to different documents.Another example. Another
example. Another example. This is an example text with
references to different documents.Another example.
under development, that we call Citation Order Analysis
Another example. This is an example text with references to
different documents.This is an example text with references to
different documents. Another example. This is an example text
with references to different documents.Another example. Another
(COA). In contrast to CPA, in COA, only the order of
example. This is an example text with references to different
documents [3]. Another exampleThis is an example text with
references to different documents.

Another example. This is an example text with references to


citations is considered. The main advantage in comparison
different documents.Another example. This is another reference.
Another example. This is an example text with references to
different documents.Another example. This is an example text
with references to different documents. Example. This is an
example text with references to different documents.
to the usually applied text analysis approaches is that even
if documents are translated or paraphrased they can still be
identified as similar. Depending on the level of tolerance
even if citations were omitted, summarized documents can
Document 1 Document 2 Document 3 be identified. This way a digital fingerprint of documents
Another example. This is an example text with references to
can be created that can, besides for recommender systems,
also be used to identify plagiarized work. In some regard,
This is an example text with references to different documents.[1] different documents.This is an example text with references to
Another example. This is an example text with references to This is an example text with references to different documents. different documents. Another example [3]. This is an example text
different documents.This is an example text with references to Another example. This is an example text with references to with references to different documents.Another example. Another
different documents. This is one reference [1], [2]. This is an different documents. Another example. This is an example text example. This is an example text with references to different
example text with references to different documents. Another with references to different documents.Another example. Another documents [1]. Another exampleThis is an example text with

this approach is similar to bibliographic coupling. However,


example. This is an example text with references to different example. Another example. This is an example text with references to different documents.
documents.This is an example text with references to different references to different documents.Another example.
documents.Another example. Another example. Another example. This is an example text with references to
Another example. This is an example text with references to different documents.Another example. This is another reference
Another example. This is an example text with references to different documents.This is an example text with references to [2]. Another example. This is an example text with references to

by additionally considering the order of citations, this


different documents.This is an example text with references to different documents. Another example [3]. This is an example text different documents.Another example. This is an example text
different documents. Another example [3]. This is an example text with references to different documents.Another example. Another with references to different documents. Example. This is an
with references to different documents.Another example. Another example. This is an example text with references to different example text with references to different documents.
example. This is an example text with references to different documents [1]. Another exampleThis is an example text with
documents [1]. Another exampleThis is an example text with references to different documents. This is an example text with references to different documents.

approach is more precise and robust. Figure 6 illustrates the


references to different documents. This is one reference [1], [2]. This is an example text with
Another example. This is an example text with references to references to different documents. Another example. This is an
This is an example text with references to different documents. different documents.Another example. This is another reference example text with references to different documents.This is an
Another example. This is an example text with references to [2]. Another example. This is an example text with references to example text with references to different documents.Another
different documents. Another example. This is an example text different documents.Another example. This is an example text example. Another example.

concept.
with references to different documents.Another example. Another with references to different documents. Example. This is an
example. Another example. This is an example text with example text with references to different documents. This is an example text with references to different documents.[1]
references to different documents.Another example. Another example. This is an example text with references to
This is an example text with references to different documents. different documents.
Another example. This is an example text with references to This is one reference [1], [2]. This is an example text with
different documents.Another example. This is another reference references to different documents. Another example. This is an This is an example text with references to different documents.
[2]. Another example. This is an example text with references to example text with references to different documents.This is an Another example. This is an example text with references to
different documents.Another example. This is an example text example text with references to different documents.Another different documents. Another example. This is an example text
with references to different documents. Example. This is an example. Another example. with references to different documents.Another example. Another
example text with references to different documents. example.

Document A Document B
CPI = ¼ CPI = 1
This is an example
text with
references to
This is an example text with references to different documents. different
This is one reference. This is an example text with references to documents.[1] This is an example text with references to different documents.[1]

Figure 5: Illustration CPA different documents. Two very similar references [1],[2]. This is an
example text with references to different documents.This is an
example text with references to different documents.Another
example. Another example.
Another example.
This is an example
Another example. This is an example text with references to
different documents.This is an example text with references to
different documents. This is one reference [1], [2]. This is an
example text with references to different documents. Another
example. This is an example text with references to different
This is an example text with references to different documents. documents.This is an example text with references to different
Another example. This is an example text with references to documents.Another example. Another example.
different documents.
This is an example Another example. This is an example text with references to
This is an example text with references to different documents. text with different documents.This is an example text with references to
Another example. This is an example text with references to references to different documents. Another example [3]. This is an example text
different documents. Another example. This is an example text different with references to different documents.Another example. Another

However, further research needs to be performed to identify


with references to different documents.Another example. Another documents.[1] example. This is an example text with references to different
example. Another example. This is an example text with Another example. documents [1]. Another exampleThis is an example text with
references to different documents.Another example. This is an example references to different documents.
text with
Another example. This is an example text with references to references to This is an example text with references to different documents.

the appropriate weighting of the CPI values according to


different documents.This is an example text with references to different Another example. This is an example text with references to
different documents. Another example. This is an example text documents.This is different documents. Another example. This is an example text
with references to different documents.Another example. Another an ex with references to different documents.Another example. Another
example. This is an example text with references to different example. Another example. This is an example text with
documents [3]. Another exampleThis is an example text with references to different documents.Another example.

their occurrence, which also seems to depend on the


references to different documents.
Another example. This is an example text with references to
Another example. This is an example text with references to different documents.Another example. This is another reference
different documents.Another example. This is another reference. This is an example [2]. Another example. This is an example text with references to
Another example. This is an example text with references to text with different documents.Another example. This is an example text

publication‟s research field and publication‟s research type. different documents.Another example. This is an example text references to with references to different documents. Example. This is an
with references to different documents. Example. This is an different example text with references to different documents.
example text with references to different documents. documents.This is
an ex

For example, it seems that for analyzing a technical report asdasdasd

or patent specification, different weightings seem suitable. Figure 6: Illustration Citation Order Analysis
First empirical evaluations have lead to the values shown in
Table 1 for calculating the CPI.
IV. OUTLOOK
Besides identifying related work, the authors work on
Table 1: CPI values applying the idea behind CPA for automatic document
Occurrence CPI value classification for the research paper recommender SciPlore
Sentence 1 [11]. The aim is to automatically analyze the topics within
documents by analyzing the distribution of references
Paragraph 1/2
within research papers. So instead of knowing, for instance,
Chapter 1/4 that a certain publication focuses on the relativity theory,
Same journal / same book 1/8 the CPA makes it possible to identify the document sections
Same journal but different edition 1/16
focusing for example, on „Time dilation’, „Length
contraction‟ or „Mass-energy equivalence‟ and then to give
specific recommendations within documents or books.
The results delivered by CPA can be improved by Moreover, it is possible to combine the CPA with text
evaluating as many sources as possible. This can be the mining algorithms in order to automatically detect e.g.
case due to multiple occurrences of the same citation and contradicting studies. “The author A has shown in his
due to multiple documents citing a certain document. In our recent study [reference A] that in contrast to a previous
study [reference B]...” So by analyzing the words between [4] Marshakova, I. V. 1973. System of document
two references it is often possible to automatically analyze connections based on references, Nauchno-
the exact relationship between these two references and Tekhnicheskaya Informatsiya, vol. 2, no. 6, pp. 3–8.
how they compare to each other. [5] Beel, J. & Gipp, B. 2008, The Potential of
Oftentimes it is possible by knowing the position of each Collaborative Document Evaluation for Science, the
citation within a document, to draw conclusions about the 11th International Conference on Digital Asian
document type e.g. state-of-the art publications, etc. The Libraries (ICADL 2008), December 2 - 5, Kuta,
gathered information can be used to classify further Indonesia, published in G. Buchanan, M. Masoodian &
documents and to develop a more sophisticated „Web of S. Cunningham (Eds.), Digital Libraries: Universal and
Science‟. We believe that these technologies, in Ubiquitous Access to Information of Lecture Notes in
combination with collaborative filtering, will be the future Computer Science, vol. 5362, DOI 10.1007/978-3-540-
for identifying related work and will open the doors for 89533-6, ISSN 0302-9743, pp. 375-378, Springer-
powerful research paper recommender systems. Verlag Berlin Heidelberg.
[6] Small, H. 1973. Co-citation in the scientific literature:
V. DISCUSSION & CONCLUSION a new measure of the relationship between two
As shown, the CPA and COA offer substantial advantages documents, Journal of the American Society for
in identifying related documents in comparison to existing Information Science, vol. 24, pp. 265–269.
approaches. However, it should also be taken into account [7] Klavans, R., & Boyack, K. (2006). Identifying a better
that the effort is considerable. It is not sufficient to evaluate measure of relatedness for mapping science, Journal of
the bibliography of documents, but it is necessary to the American Society for Information Science and
process the complete document, identify each reference and Technology, Vol. 57, No. 2, pp. 251-263.
map it to the corresponding entry in the bibliography, which [8] Sternitzke, C. Bergmann, I. (2009), Similarity
is in practice not always possible, and leads in ca. 3% of measures for document mapping: A comparative study
cases to mismatches. This is because sometimes only an on the level of an individual scientist, Scientometrics,
abstract and the bibliography can be accessed, documents Vol. 78, No. 1, pp. 113-130.
cannot be parsed as OCR fails, or a reference style is used
[9] Garfield, E. (2001, November 27, 2001). From
that makes it unfeasible to automatically link references to
the corresponding items in the bibliography. This leads to Bibliographic Coupling to Co-CitationAnalysis Via
the conclusion that although these new approaches deliver Algorithmic Historio-Bibliography: A Citationist‟s
superior results, they cannot completely replace the already Tribute to BelverC. Griffith. Paper presented at the
existing approaches, but should be used in combination. Drexel University, Philadelphia, PA.
[10] Giles, C. L. Bollacker, K. D. And Lawrence, S. 1998.
CiteSeer: an automatic citation indexing system, In
REFERENCES Digital Libraries 98 - The Third ACM Conference on
[1] Gipp, B. & Beel, J. (2009). Citation Proximity Digital Libraries, pp. 89-98.
Analysis (CPA) - A new approach for identifying [11] Gipp, B. Beel, J. & Hentschel, C. (2009), Scienstein -
related work based on Co-Citation Analysis. In A Research Paper Recommender System, in
Proceedings of the 12th International Conference on Proceedings of IEEE International Conference on
Scientometrics and Informetrics, pp. 571-575. Emerging Trends in Computing. Tamil Nadu, India.
[2] Rip, A., & Courtial, J. (1984). Co-Word Maps of
Biotechnology: An Example of Cognitive
Scientometrics. Scientometrics, 6(6), 381-400.
[3] Fano, R. M. 1956. Information theory and the retrieval
of recorded information, in Documentation in Action,
Shera, J. H. Kent, A. Perry, J. W. (Edts), New York:
Reinhold Publ. Co., pp. 238–244.

View publication stats

You might also like