Professional Documents
Culture Documents
Plagiarism Detection The Tool and The Case Study
Plagiarism Detection The Tool and The Case Study
net/publication/220969852
CITATIONS READS
3 524
2 authors, including:
Sergey Butakov
Concordia University of Edmonton
59 PUBLICATIONS 156 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sergey Butakov on 01 April 2015.
Vladislav Scherbinin
School of Information Technology and Communications, American University of Nigeria
Lamido Zubairu Way, Yola Township By-Pass, PMB 2250, Yola, Adamawa State, Nigeria,
Sergey Butakov
Department of Computer Science, Konkuk University, Chungju Campus
Gong Dong Yeon Gu Dong Room #107, 322 Danwol-Dong, Chungju, Chungcheongbuk, 380-701, South Korea,
ABSTRACT
Plagiarism is a problem in many education institutions around the world. Preventing digital plagiarism requires enormous
amount of work from educator. In this paper we concentrate on implementation of well known anti-plagiarism algorithm
for local and global search for the original source of plagiarized assignment. We first discuss problems in the
implementation of existing anti-plagiarism systems, and then describe the open architecture that could be used for
plagiarism detection in different kind of assignments from plain text to audio submissions. Finally we present preliminary
results on algorithm implementation for processing of 1000+ submissions archive. We hope this paper will add a trend to
the discussion of anti-plagiarism systems especially for new types of assignments.
KEYWORDS
Plagiarism prevention, document fingerprinting, text search
1. INTRODUCTION
Plagiarism prevention is one of the tasks for all professional educators and a number of researchers address
this problem. The request like “plagiarism detection” to any global search provider will return hundreds of
papers discussing this problem and not only in education but in research community at large. This project
concentrates primarily on student plagiarism in computer science courses but the results could be easily
expanded to almost any case of plagiarism detection. Studies on higher education name many reasons for
plagiarism phenomenon (see Harris, 2004 or Chester, 2001) but this discussion is beyond the scope of this
paper. We will discuss tools and technologies for plagiarism prevention and come up with new suggestions,
results and ideas.
There are many systems on the market that can be used to check if a paper was plagiarized. One of the
most well known is the “Turnitin” system developed by the US company, iPlagiarisms LLC (for more details
check Turnitin, 2008). An educational organization can subscribe to the service for direct access or use the
service through the plug-ins for most popular course management systems (CMS) like Blackboard, WebCT,
ANGEL, and Moodle.
Among other popular plagiarism detection services are CopyCatch Software Suite (CopyCatch, 2008)
available for higher educational institutions in the UK, “Antiplagiat” plagiarism detection service used by
many universities in Russia (Antiplagiat, 2008), MyDropBox Internet Plagiarism Detection Service
(MyDropBox, 2008) and the recently started Safe Assign project from Blackboard (Safe Assign, 2008). All
the above mentioned services compare a paper submitted with their own database of student papers as well as
with many other sources of information like open Internet sites, subscribed libraries or databases of research
papers.
Along with using dedicated plagiarism detection services the other common technique employed by many
professors around the world is “to google” parts of student papers to look for similarities on the Internet.
Such an approach could be effective however it has its own disadvantages named in the review done by
304
IADIS International Conference e-Learning 2008
Mottley, 2004. Here we need to mention time consumption and poor results when comparing of two
documents if the documents are based on a same worksheet. Last statement requires more explanation. Using
worksheets for student assignments is a common practice in many courses. Teacher provides the worksheet
with question and expects a student to fill a gap. In this case the submissions from two students will have
very high level of similarity. For example if the question looks like: “The number of operations performed by
computing agent ____” and two answers like “The number of operations performed by computing agent __
Forty __” and “The number of operations performed by computing agent __ Forty seven__” will be
recognized by the Internet search engine as almost the same document. Thus to compare such kind of
submissions we have to use more precise methods.
In many cases the original source of paper could be located inside the learning community. It can be work
from the same class, from a previous semester, etc. There are some software solutions available that can help
a professor to search for locally plagiarized papers. We should mention here such products as Plague Doctor
(Engels et al. 2007), SNITCH software suite (Niezgoda and Way, 2006), YAP3 (Wise, 1996). These systems
were developed as “local” services for learning community and they are not part of any large system like
university-wide portal or CMS. However researchers stress out that very valuable information on learning
process is located in CMS databases (Romero et al., 2007) and a new branch in data mining – the educational
data mining was introduced recently (Romero and Ventura, 2007). One of the very important fields in the
educational data mining is plagiarism search and authorship identification.
As it was mentioned above the Turnitin service can be used as an add-on for Moodle CMS. Such
implementation has two basic requirements. First of all the institution should pay for the appropriate Turnitin
subscription. This requirement might be a problem for small educational institutions. And the second
requirement is stable 24/7 connectivity between Moodle and Turnitin platforms. This requirement could be
the real obstacle for educational institutions in developing nations (Odinma et. al., 2008).
In this paper we discuss the case study from American University of Nigeria (AUN) – a four-year
coeducational institution in Yola, North-East Nigeria. AUN highlights technology in education and all the
students are equipped with wireless laptops. The university maintains very advanced local and metropolitan
network infrastructure with 24/7 wireless access. For almost three years the internal network operates
effectively. But the external connectivity depends on satellite Internet service provider and it is expensive and
not as stable as required. Sometimes it could be no Internet connectivity for more then 24 hours. But it is not
the problem for CMS implementation because the Moodle server is located inside the local network and users
can login to the system even when the external connectivity fails. Based on the discussion above we can
describe the goal of this project as to build an anti-plagiarism plug-in for Moodle CMS with the next features:
(i) it should incorporate the advantages of local and global search; (ii) it should allow to perform
asynchronous global search in off-peak time and (iii) it should be open-source solution to contribute to the
Moodle community.
In the next section we briefly discuss the algorithm and information processing cycle implemented in the
developed software prototype. We also touch some software developing problems and provide the main
concept of the project future development. Last section provides results of plug-in implementation on the real
data from four semesters and discusses the outcome of this implementation.
305
ISBN: 978-972-8924-58-4 © 2008 IADIS
Based on these requirements, the Winnowing algorithm by Schleimer, Wilkerson, and Aiken (see
Schleimer et al., 2003) was selected for implementation. The algorithm is based on calculation of fingerprints
(hashes) of the input texts and afterwards fingerprints are used for documents comparison. The main
advantages of Winnowing algorithm are:
1. It does not depend on the type of input. Plain text (e.g. source code, essay or podcast script) can
be used as an input.
2. It works fast on large sets of data. Scaling the grammar size allows to perform quick check
followed by detailed comparison.
3. The fingerprint of document template could be easily removed from the student submission.
The plug-in is aimed to work with Moodle CMS. MOODLE stands for Modular Object-Oriented
Dynamic Learning Environment (Moodle, 2008). To evaluate how efficient the combination of Winnowing
and global search will work the loosely coupled architecture shown on the Figure 1 was developed.
Moodle Module
user Interface
(teacher)
Moodle Interface
The Internet
306
IADIS International Conference e-Learning 2008
4. Make a decision if paper was plagiarized “locally” or “globally”. Based on this decision the
extra information is presented to the teacher (like fragments of original and copied paper). The
screenshots with this information for local and global search are presented on the Figures 2 and 3
respectively.
Figure 2. Screenshot comparing the submission with similar document from local database
Figure 3. Screenshot comparing the submission with similar document from the Internet.
The developed prototype does not satisfy all the requirements for Moodle plug-ins because it uses some
Microsoft technologies but the same algorithms could be easily implemented using PHP language as it
necessary for Moodle. Based on the positive results of prototyping the new open architecture for anti-
plagiarism plug-in was developed. It is presented on the Figure 2. The main concept of this architecture is to
add a tokenization request broker to anti-plagiarism plug-in. The goal of such an addition is to make the
system open for processing of almost all kinds of soft submissions. The broker will define the type of
307
ISBN: 978-972-8924-58-4 © 2008 IADIS
submitted file and call the appropriate server to perform the tokenization of the submission. As the result of
tokenization system will get back the set of symbols (words, sentences, etc.). After this point it does not
matter for the anti-plagiarism algorithm what kind of assignment was originally submitted by a student
because now the algorithm will work with the plain tokenized text.
External Tokenizers
Moodle Core Plug-In
Core Tokenization ET1: MS-Office files
request broker ET2: WAV audio
……
ETN: …
Internal Tokenizers
Moodle DB based Plug-in
on almost any tables
IT1: Pure text
SQL DBMS IT2: PDF files
……
ITn: …
The Internet
308
IADIS International Conference e-Learning 2008
of assignments
Total number
plagiarized plagiarized from grading
submitted
assignments another section
# % # % # % # %
Teachers A and B; 292 85 29% - - 22 26% 85 100%
Spring 2006
Teacher B; Spring 237 44 18% 17 38% 5 11% 44 100%
2007
Teacher B; Fall 76 13 17% 13 100% 4 30% 13 100%
2007
Teacher C; Fall 263 59 22% 24 40% 21 36% 59 100%
2007
Teacher C; Spring 93 12 13% 5 42% * * 12 100%
2008
Teacher D; 62 10 16% 6 60% * * 10 100%
Spring 2008
Total 1023 223 22% - - 52 26% 223 100%
* - not applicable (professors did not perform manual search because the system was running).
Using the data described above we performed a search for plagiarized papers with manual cross-checking
for confirmation. Table 1 presents summarized the results of this search. The following data can be found in
the table: the total number of assignments submitted in the course and the total number of copied works.
From the table we can see that the level of plagiarism was remarkable higher before the system
implementation (before Spring 2008). In Spring 2008 the students were aware that all the submissions will be
processed by the system but they did not believe that it will be effective. After submitting many copied
assignment in the first homework they got very poor results and professors demonstrated the system during
class time. This demonstration has literally stopped plagiarism in this course with few minor exceptions.
As we can see from Table 1 the most promising result is that the prototype uncovered all plagiarized
submissions and it did not give any wrong hits. Such a result gives a great assistance to the professors
teaching this course. Now the plug-in is being implemented in “University Writing 101” course to polish the
implementation of “global” search. The proposed technology helps to effectively uncover cases of “local”
and “global” cheating and improve course outcome. The results from Spring 2008 show us that once students
see the anti-plagiarism plug-in in action they change their attitude and put more efforts towards the course.
So we can conclude that the prototype implementation was successful and the project should be moved to the
next phase of development.
4. CONCLUSION
Many researchers highlight that digital plagiarism is on the rise in education. Thus it is important to develop
an open-source anti-plagiarism module compatible with popular course management systems such as
Moodle. The project has clear and promising outcome for the education community. The software package
should support professors in uncovering of copied assignments and reduce the level of digital plagiarism in
higher education. The core of the system is based on the Winnowing algorithm. It serves to compare a
student submission with documents from local database as well as with results of global search returned by
MSN Live ™ search engine. The project has met its first goal: the desktop prototype of the plug-in was
developed. With this prototype, the first batch of 1000+ documents was processed. The outcome of proposed
technology is very promising. We can see that with system saves a lot of time for professor and once it is
implemented on institution level it does not require a subscription. From the technical point of view the
system could schedule its Internet traffic consumption and perform global search in off-peak time.
The next step for the project will be the implementation of proposed open architecture with tokenization
request broker. This technology along with speech recognition software eventually might allow evaluation of
309
ISBN: 978-972-8924-58-4 © 2008 IADIS
multimedia submissions such as audio podcasts. We should mention that along with advantages in education
Internet technology brings issues of digital plagiarism and to be effective the anti-plagiarism tools should
follow the technology.
ACKNOWLEDGEMENTS
The work was supported by Konkuk University in 2008.
REFERENCES
Antiplagiat, 2008. Antiplagiat plagiarism detection service. Accessed February, 20, 2008 at http://www.antiplagiat.ru
Boisvert R. F., Irwin M. J., 2006. Plagiarism on the rise, Communications of the ACM, Volume 49 Issue 6, pp. 23 – 24.
Chester, G., 2001. Plagiarism detection and prevention: final report on the JISC electronic plagiarism detection project.
JISC. Retrieved January 30, 2008, from http://www.jisc.ac.uk/uploaded_documents/plagiarism_final.pdf
Clough P., 2000. Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies,
Research Memoranda, CS-00-05, Department of Computer Science, University of Sheffield, UK, 2000
CopyCatch, 2008. CopyCatch Software Suite. Accessed February, 20, 2008 at http://www.copycatchgold.com/
Engels S. et.al., 2007. Plagiarism Detection Using Feature-Based Neural Networks. Proceedings of the 38th SIGCSE
technical symposium on Computer science education. 2007, pp. 34 – 38
Harris R., 2004. Anti-Plagiarism Strategies for Research Papers. Retrieved January 30, 2008, from
http://www.virtualsalt.com/antiplag.htm
JISC, 2008. JISC plagiarism detection service. Accessed February, 20, 2008 at http://www.submit.ac.uk
Lukashenko R. et al. 2007. Computer-Based Plagiarism Detection Methods and Tools: An Overview. CompSysTech '07:
Proceedings of the 2007 international conference on Computer systems and technologies, June 2007
Moodle, 2008. Moodle Statistics. Retrieved January 30, 2008, from http://moodle.org/stats/
Mottley, J., 2004. Is Google suitable for detecting plagiarism? LTSN Bioscience Bulletin, 12, 6 Retrieved January 30,
2008, from ftp://www.bioscience.heacademy.ac.uk/newsletters/ltsn12.pdf
MyDropBox, 2008. MyDropBox Internet Plagiarism Detection Service. Accessed February, 20, 2008 at
http://www.mydropbox.com.
Niezgoda S., Way T. P., 2006. SNITCH: A Software Tool for Detecting Cut and Paste Plagiarism Proceedings of
SIGCSE'06 conference, 2006, pp. 51 – 55.
Odinma, A. et. al., 2008. Planning, designing and implementing an enterprise network in a developing nation, Int. J.
Enterprise Network Management, Vol. 2, No. 3, pp.301–317.
Prechelt L. et.al, 2002. Finding Plagiarisms among a Set of Programs with JPlag. Journal of Universal Computer Science,
2002, Vol.8, 11, pp. pp. 1016 - 1038
Romero C., Ventura S., 2007. Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications,
Volume 33, Issue 1, July 2007, Pages 135-146
Romero, C. et al., 2007. Data mining in course management systems: Moodle case, Computers & Education (2007),
doi:10.1016/j.compedu.2007.05.016
Safe Assign, 2008. Safe Assign Anti-Plagiarism Service from Blackboard, Accessed March, 30, 2008 at
http://www.safeassign.com/
Schleimer S. et al., 2003. Winnowing: Local Algorithms for Document Fingerprinting. Proceedings of the ACM
SIGMOD International Conference on Management of Data, pages 76-85, June 2003.
Turnitin, 2008. Turnitin Brochure. Retrieved January 30, 2008, from http://turnitin.com/static/pdf/Turnitin_brochure.pdf
Wise M.J., 1996. YAP3: improved detection of similarities in computer program and other texts, ACM SIGCSE Bulletin,
Volume 28, Issue 1, pp. 130 - 134
310