Plagiarism Detection The Tool and The Case Study

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/220969852
Plagiarism Detection: The Tool And The Case Study.
Conference Paper · January 2008

Source: DBLP
CITATIONS READS
3 524
2 authors, including:
Sergey Butakov
Concordia University of Edmonton
59 PUBLICATIONS 156 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sergey Butakov on 01 April 2015.
The user has requested enhancement of the downloaded file.

ISBN: 978-972-8924-58-4 © 2008 IADIS
PLAGIARISM DETECTION: THE TOOL AND THE CASE

STUDY
Vladislav Scherbinin
School of Information Technology and Communications, American University of Nigeria
Lamido Zubairu Way, Yola Township By-Pass, PMB 2250, Yola, Adamawa State, Nigeria,
Sergey Butakov
Department of Computer Science, Konkuk University, Chungju Campus
Gong Dong Yeon Gu Dong Room #107, 322 Danwol-Dong, Chungju, Chungcheongbuk, 380-701, South Korea,
ABSTRACT
Plagiarism is a problem in many education institutions around the world. Preventing digital plagiarism requires enormous
amount of work from educator. In this paper we concentrate on implementation of well known anti-plagiarism algorithm
for local and global search for the original source of plagiarized assignment. We first discuss problems in the
implementation of existing anti-plagiarism systems, and then describe the open architecture that could be used for
plagiarism detection in different kind of assignments from plain text to audio submissions. Finally we present preliminary
results on algorithm implementation for processing of 1000+ submissions archive. We hope this paper will add a trend to
the discussion of anti-plagiarism systems especially for new types of assignments.
KEYWORDS
Plagiarism prevention, document fingerprinting, text search
1. INTRODUCTION
Plagiarism prevention is one of the tasks for all professional educators and a number of researchers address
this problem. The request like “plagiarism detection” to any global search provider will return hundreds of
papers discussing this problem and not only in education but in research community at large. This project
concentrates primarily on student plagiarism in computer science courses but the results could be easily
expanded to almost any case of plagiarism detection. Studies on higher education name many reasons for
plagiarism phenomenon (see Harris, 2004 or Chester, 2001) but this discussion is beyond the scope of this
paper. We will discuss tools and technologies for plagiarism prevention and come up with new suggestions,
results and ideas.
There are many systems on the market that can be used to check if a paper was plagiarized. One of the
most well known is the “Turnitin” system developed by the US company, iPlagiarisms LLC (for more details
check Turnitin, 2008). An educational organization can subscribe to the service for direct access or use the
service through the plug-ins for most popular course management systems (CMS) like Blackboard, WebCT,
ANGEL, and Moodle.
Among other popular plagiarism detection services are CopyCatch Software Suite (CopyCatch, 2008)
available for higher educational institutions in the UK, “Antiplagiat” plagiarism detection service used by
many universities in Russia (Antiplagiat, 2008), MyDropBox Internet Plagiarism Detection Service
(MyDropBox, 2008) and the recently started Safe Assign project from Blackboard (Safe Assign, 2008). All
the above mentioned services compare a paper submitted with their own database of student papers as well as
with many other sources of information like open Internet sites, subscribed libraries or databases of research
papers.
Along with using dedicated plagiarism detection services the other common technique employed by many
professors around the world is “to google” parts of student papers to look for similarities on the Internet.
Such an approach could be effective however it has its own disadvantages named in the review done by
304
IADIS International Conference e-Learning 2008
Mottley, 2004. Here we need to mention time consumption and poor results when comparing of two
documents if the documents are based on a same worksheet. Last statement requires more explanation. Using
worksheets for student assignments is a common practice in many courses. Teacher provides the worksheet
with question and expects a student to fill a gap. In this case the submissions from two students will have
very high level of similarity. For example if the question looks like: “The number of operations performed by
computing agent ____” and two answers like “The number of operations performed by computing agent __
Forty __” and “The number of operations performed by computing agent __ Forty seven__” will be
recognized by the Internet search engine as almost the same document. Thus to compare such kind of
submissions we have to use more precise methods.
In many cases the original source of paper could be located inside the learning community. It can be work
from the same class, from a previous semester, etc. There are some software solutions available that can help
a professor to search for locally plagiarized papers. We should mention here such products as Plague Doctor
(Engels et al. 2007), SNITCH software suite (Niezgoda and Way, 2006), YAP3 (Wise, 1996). These systems
were developed as “local” services for learning community and they are not part of any large system like
university-wide portal or CMS. However researchers stress out that very valuable information on learning
process is located in CMS databases (Romero et al., 2007) and a new branch in data mining – the educational
data mining was introduced recently (Romero and Ventura, 2007). One of the very important fields in the
educational data mining is plagiarism search and authorship identification.
As it was mentioned above the Turnitin service can be used as an add-on for Moodle CMS. Such
implementation has two basic requirements. First of all the institution should pay for the appropriate Turnitin
subscription. This requirement might be a problem for small educational institutions. And the second
requirement is stable 24/7 connectivity between Moodle and Turnitin platforms. This requirement could be
the real obstacle for educational institutions in developing nations (Odinma et. al., 2008).
In this paper we discuss the case study from American University of Nigeria (AUN) – a four-year
coeducational institution in Yola, North-East Nigeria. AUN highlights technology in education and all the
students are equipped with wireless laptops. The university maintains very advanced local and metropolitan
network infrastructure with 24/7 wireless access. For almost three years the internal network operates
effectively. But the external connectivity depends on satellite Internet service provider and it is expensive and
not as stable as required. Sometimes it could be no Internet connectivity for more then 24 hours. But it is not
the problem for CMS implementation because the Moodle server is located inside the local network and users
can login to the system even when the external connectivity fails. Based on the discussion above we can
describe the goal of this project as to build an anti-plagiarism plug-in for Moodle CMS with the next features:
(i) it should incorporate the advantages of local and global search; (ii) it should allow to perform
asynchronous global search in off-peak time and (iii) it should be open-source solution to contribute to the
Moodle community.
In the next section we briefly discuss the algorithm and information processing cycle implemented in the
developed software prototype. We also touch some software developing problems and provide the main
concept of the project future development. Last section provides results of plug-in implementation on the real
data from four semesters and discusses the outcome of this implementation.
2. PROTOTYPE OF ANTI-PLAGIARISM PLUG-IN

To develop the prototype for anti-plagiarism plug-in we need to select the algorithm for document comparing
and fit the selected algorithm into a new framework combining local and global search. There are many well-
known algorithms for plagiarism detection, for example see the reviews by Clough, 2000 and Lukashenko,
2007. This project is aimed on the development of anti-plagiarism module for different kind of assignments
like free-style essays, worksheet-based assignments and probably even audio podcasts. Based on this
requirement we have selected the following features for the algorithm to be implemented in an anti-
plagiarism plug-in:
• The algorithm should work on different types of input that could be converted to plain text.
• The algorithm should have good performance on large amounts of data while performing one-
vs.-all checks.
305
ISBN: 978-972-8924-58-4 © 2008 IADIS
Based on these requirements, the Winnowing algorithm by Schleimer, Wilkerson, and Aiken (see
Schleimer et al., 2003) was selected for implementation. The algorithm is based on calculation of fingerprints
(hashes) of the input texts and afterwards fingerprints are used for documents comparison. The main
advantages of Winnowing algorithm are:
1. It does not depend on the type of input. Plain text (e.g. source code, essay or podcast script) can
be used as an input.
2. It works fast on large sets of data. Scaling the grammar size allows to perform quick check
followed by detailed comparison.
3. The fingerprint of document template could be easily removed from the student submission.
The plug-in is aimed to work with Moodle CMS. MOODLE stands for Modular Object-Oriented
Dynamic Learning Environment (Moodle, 2008). To evaluate how efficient the combination of Winnowing
and global search will work the loosely coupled architecture shown on the Figure 1 was developed.
Moodle Module
user Interface
(teacher)
Moodle Interface
Plug-in DB based on MSSQL

DBMS
Moodle Core
External plug-in
tables
Moodle DB Plug-In Core

based on almost (stored
any SQL DBMS procedures)
The Internet
MSN Live™ Search Service
Figure 1. The simplified data flows for software prototype.

As it can be seen from the Figure 1 this architecture does not interfere with Moodle interface as it only
takes data from Moodle file storage. Based on this architecture we have developed software prototype for the
system. The prototype compares the student submission with local document database and the results
returned by global search engine. Local search are being performed in two main steps and the third step is
optional:
1. The system tokenizes the document and removes the meaningless characters (punctuation
characters, whitespaces, etc.);
2. The system calculates the fingerprints of submitted document and stores the results into the
database.
3. Worksheet fingerprint is being removed from the submission if it is required.
The global search adds four more steps to this sequence:
1. Search on the Internet for small phrases (like short sentences) from the text. At this stage system
connects to free MSN Live Search (the service developed by Microsoft).
2. System calculates fingerprints for top documents from search results.
3. Compare the submission with fingerprints of “local” and “global” documents.
306
4. Make a decision if paper was plagiarized “locally” or “globally”. Based on this decision the
extra information is presented to the teacher (like fragments of original and copied paper). The
screenshots with this information for local and global search are presented on the Figures 2 and 3
respectively.
Figure 2. Screenshot comparing the submission with similar document from local database
Figure 3. Screenshot comparing the submission with similar document from the Internet.
The developed prototype does not satisfy all the requirements for Moodle plug-ins because it uses some
Microsoft technologies but the same algorithms could be easily implemented using PHP language as it
necessary for Moodle. Based on the positive results of prototyping the new open architecture for anti-
plagiarism plug-in was developed. It is presented on the Figure 2. The main concept of this architecture is to
add a tokenization request broker to anti-plagiarism plug-in. The goal of such an addition is to make the
system open for processing of almost all kinds of soft submissions. The broker will define the type of
307
ISBN: 978-972-8924-58-4 © 2008 IADIS
submitted file and call the appropriate server to perform the tokenization of the submission. As the result of
tokenization system will get back the set of symbols (words, sentences, etc.). After this point it does not
matter for the anti-plagiarism algorithm what kind of assignment was originally submitted by a student
because now the algorithm will work with the plain tokenized text.
Moodle Interface Plug-In Interface
External Tokenizers
Moodle Core Plug-In
Core Tokenization ET1: MS-Office files
request broker ET2: WAV audio
……
ETN: …
Internal Tokenizers
Moodle DB based Plug-in
on almost any tables
IT1: Pure text
SQL DBMS IT2: PDF files
……
ITn: …
The Internet
MSN Live™ Search Service
Figure 4. The preliminary architecture with tokenization request broker

As it can be seen from Figure 2, such a development will move the project to another level. It will be
possible to use the anti-plagiarism module for evaluation of audio podcasts is the appropriate tokenizer will
be connected to the system. Using speech recognition software the audio information could be transformed
into the text and processed with the algorithm proposed above.
3. RESULTS AND DISCUSSION

To estimate the algorithm and proposed technology we processed student submissions from a course that
were taught at AUN in 2006 – 2008. The course title is “Introduction to Computer Science”; it was offered
for the first time in Spring 2006 to computer science and software engineering majors. The course was taught
with active use of Moodle CMS. Students submitted all the assignments through the system and professors
put all the grades and feedbacks into the system as well. The summarized results are shown in the Table 1.
We observed four semesters with four different professors. First semester (Spring 2006) the course was
taught by two professors (referred as A and B in Table 1). They taught it together using the same course slot
in Moodle. We did not include Fall 2006 due to some technical reasons. The second observation was made
on the course taught by professor B in Spring 2007. The third and fourth observations were made on the
course taught by professors B and C in separate course slots in Fall 2007. And the last observation was made
on the course taught by professors C and D in Spring 2008 also in separate slots. Teaching in separate slots
gave us the possibility to separate the results from different professors. Each student in the course is required
to submit eleven home work assignments. Each assignment assumes that student has to insert into a
worksheet some information from simulation software and his/her own conclusions based on the simulation
results. A worksheet is an MS-Word file and the students were expected to submit it through Moodle using
the same format. All students are expected to do the same assignment.
308
Table 1. Summary of plagiarism cases in four semesters.

Total number of Number of assignment Found with manu Found by software
of assignments
Total number
plagiarized plagiarized from grading
submitted
assignments another section
# % # % # % # %
Teachers A and B; 292 85 29% - - 22 26% 85 100%
Spring 2006
Teacher B; Spring 237 44 18% 17 38% 5 11% 44 100%
2007
Teacher B; Fall 76 13 17% 13 100% 4 30% 13 100%
2007
Teacher C; Fall 263 59 22% 24 40% 21 36% 59 100%
2007
Teacher C; Spring 93 12 13% 5 42% * * 12 100%
2008
Teacher D; 62 10 16% 6 60% * * 10 100%
Spring 2008
Total 1023 223 22% - - 52 26% 223 100%
* - not applicable (professors did not perform manual search because the system was running).
Using the data described above we performed a search for plagiarized papers with manual cross-checking
for confirmation. Table 1 presents summarized the results of this search. The following data can be found in
the table: the total number of assignments submitted in the course and the total number of copied works.
From the table we can see that the level of plagiarism was remarkable higher before the system
implementation (before Spring 2008). In Spring 2008 the students were aware that all the submissions will be
processed by the system but they did not believe that it will be effective. After submitting many copied
assignment in the first homework they got very poor results and professors demonstrated the system during
class time. This demonstration has literally stopped plagiarism in this course with few minor exceptions.
As we can see from Table 1 the most promising result is that the prototype uncovered all plagiarized
submissions and it did not give any wrong hits. Such a result gives a great assistance to the professors
teaching this course. Now the plug-in is being implemented in “University Writing 101” course to polish the
implementation of “global” search. The proposed technology helps to effectively uncover cases of “local”
and “global” cheating and improve course outcome. The results from Spring 2008 show us that once students
see the anti-plagiarism plug-in in action they change their attitude and put more efforts towards the course.
So we can conclude that the prototype implementation was successful and the project should be moved to the
next phase of development.
4. CONCLUSION
Many researchers highlight that digital plagiarism is on the rise in education. Thus it is important to develop
an open-source anti-plagiarism module compatible with popular course management systems such as
Moodle. The project has clear and promising outcome for the education community. The software package
should support professors in uncovering of copied assignments and reduce the level of digital plagiarism in
higher education. The core of the system is based on the Winnowing algorithm. It serves to compare a
student submission with documents from local database as well as with results of global search returned by
MSN Live ™ search engine. The project has met its first goal: the desktop prototype of the plug-in was
developed. With this prototype, the first batch of 1000+ documents was processed. The outcome of proposed
technology is very promising. We can see that with system saves a lot of time for professor and once it is
implemented on institution level it does not require a subscription. From the technical point of view the
system could schedule its Internet traffic consumption and perform global search in off-peak time.
The next step for the project will be the implementation of proposed open architecture with tokenization
request broker. This technology along with speech recognition software eventually might allow evaluation of
309
ISBN: 978-972-8924-58-4 © 2008 IADIS
multimedia submissions such as audio podcasts. We should mention that along with advantages in education
Internet technology brings issues of digital plagiarism and to be effective the anti-plagiarism tools should
follow the technology.
ACKNOWLEDGEMENTS
The work was supported by Konkuk University in 2008.
REFERENCES
Antiplagiat, 2008. Antiplagiat plagiarism detection service. Accessed February, 20, 2008 at http://www.antiplagiat.ru
Boisvert R. F., Irwin M. J., 2006. Plagiarism on the rise, Communications of the ACM, Volume 49 Issue 6, pp. 23 – 24.
Chester, G., 2001. Plagiarism detection and prevention: final report on the JISC electronic plagiarism detection project.
JISC. Retrieved January 30, 2008, from http://www.jisc.ac.uk/uploaded_documents/plagiarism_final.pdf
Clough P., 2000. Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies,
Research Memoranda, CS-00-05, Department of Computer Science, University of Sheffield, UK, 2000
CopyCatch, 2008. CopyCatch Software Suite. Accessed February, 20, 2008 at http://www.copycatchgold.com/
Engels S. et.al., 2007. Plagiarism Detection Using Feature-Based Neural Networks. Proceedings of the 38th SIGCSE
technical symposium on Computer science education. 2007, pp. 34 – 38
Harris R., 2004. Anti-Plagiarism Strategies for Research Papers. Retrieved January 30, 2008, from
http://www.virtualsalt.com/antiplag.htm
JISC, 2008. JISC plagiarism detection service. Accessed February, 20, 2008 at http://www.submit.ac.uk
Lukashenko R. et al. 2007. Computer-Based Plagiarism Detection Methods and Tools: An Overview. CompSysTech '07:
Proceedings of the 2007 international conference on Computer systems and technologies, June 2007
Moodle, 2008. Moodle Statistics. Retrieved January 30, 2008, from http://moodle.org/stats/
Mottley, J., 2004. Is Google suitable for detecting plagiarism? LTSN Bioscience Bulletin, 12, 6 Retrieved January 30,
2008, from ftp://www.bioscience.heacademy.ac.uk/newsletters/ltsn12.pdf
MyDropBox, 2008. MyDropBox Internet Plagiarism Detection Service. Accessed February, 20, 2008 at
http://www.mydropbox.com.
Niezgoda S., Way T. P., 2006. SNITCH: A Software Tool for Detecting Cut and Paste Plagiarism Proceedings of
SIGCSE'06 conference, 2006, pp. 51 – 55.
Odinma, A. et. al., 2008. Planning, designing and implementing an enterprise network in a developing nation, Int. J.
Enterprise Network Management, Vol. 2, No. 3, pp.301–317.
Prechelt L. et.al, 2002. Finding Plagiarisms among a Set of Programs with JPlag. Journal of Universal Computer Science,
2002, Vol.8, 11, pp. pp. 1016 - 1038
Romero C., Ventura S., 2007. Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications,
Volume 33, Issue 1, July 2007, Pages 135-146
Romero, C. et al., 2007. Data mining in course management systems: Moodle case, Computers & Education (2007),
doi:10.1016/j.compedu.2007.05.016
Safe Assign, 2008. Safe Assign Anti-Plagiarism Service from Blackboard, Accessed March, 30, 2008 at
http://www.safeassign.com/
Schleimer S. et al., 2003. Winnowing: Local Algorithms for Document Fingerprinting. Proceedings of the ACM
SIGMOD International Conference on Management of Data, pages 76-85, June 2003.
Turnitin, 2008. Turnitin Brochure. Retrieved January 30, 2008, from http://turnitin.com/static/pdf/Turnitin_brochure.pdf
Wise M.J., 1996. YAP3: improved detection of similarities in computer program and other texts, ACM SIGCSE Bulletin,
Volume 28, Issue 1, pp. 130 - 134
310
View publication stats

Plagiarism Detection The Tool and The Case Study

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Plagiarism Detection The Tool and The Case Study

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Plagiarism Detection: The Tool And The Case Study.

Conference Paper · January 2008

The user has requested enhancement of the downloaded file.

PLAGIARISM DETECTION: THE TOOL AND THE CASE

2. PROTOTYPE OF ANTI-PLAGIARISM PLUG-IN

Plug-in DB based on MSSQL

Moodle DB Plug-In Core

MSN Live™ Search Service

Figure 1. The simplified data flows for software prototype.

Moodle Interface Plug-In Interface

MSN Live™ Search Service

Figure 4. The preliminary architecture with tokenization request broker

3. RESULTS AND DISCUSSION

Table 1. Summary of plagiarism cases in four semesters.

View publication stats

You might also like