Professional Documents
Culture Documents
Batch 20
Batch 20
1
Abstract
Just as technology is progressing day by day, the purpose of using computer systems is also
increased.
It reproduces already existing content into a modified content. Original work of a person can be
found out while representing their ideas.
There are two approaches namely Manual and automated.
There are more number of methods and techniques. Some of them are String matching
algorithms, Natural Language Processing as well as Natural language toolkit.
2
Abstract
That output will be in the interval between the binary unit, 0 and 1.
The binary unit 0 will appear in the event that the documents have nothing in common. The
binary unit 1 will appear in the event that the documents is exactly similar and whenever the
result is between 0 and 1 then the documents have partially similar contents.
Here the main objective is to check the similarities among the input documents precisely using K
Nearest Neighbor, String Matching Techniques and Polya’s Counting theorem.
In accordance with String Matching Algorithm, The Rabin Karp Algorithm is used here.
3
Introduction
One who makes his/her work simpler by picking up the resources on creating any projects,
articles etc which results in PLAGIARISM.
Earlier times they used Natural Language Processing (NLP) Techniques like Grammar-based
method, Semantic-based method and Grammar-Semantic Hybrid method.
Machine learning techniques are k-nearest neighbors, naive Bayes etc. Some More techniques
like Text Mining, Clustering, bi-grams, tri-grams, N-grams etc.
In 90s the plagiarisms is 74.4% and now a days it will be highest percentage than expected.
Usage of new concept called Polya’s Counting theorem.
4
Literature Survey
S.No Paper Author Journal Year Advantages Disadvantages
Name Name Name
5
S.No Paper Name Author Name Journal Name Year Advantages Disadvantages
5. Plagiarism Khaled, F., & Iraqi Journal of Gives an overall Yet another
Detection H. Al-Tamimi, Science definition of Discussion about
Methods and M.S., plagiarism and the most frequent
Tools: An 2021 works through plagiarism types
Overview. different papers was extensively
for the most provided.
known types of
plagiarism
methods and
tools.
6
S.n Paper Name Author Name Journal Name Year Advantages Disadvantages
o
6. Key Term Vetriselvi.T,Go International Gives best way WordNet supported word
Extraction using palan.N.P,Ku Journal Of to extract key similarity values
a Sentence maresan.G Education and terms which is Can also be
based Weighted Management 2019 important in implemented.
TF-IDF Engineering: similarity
Algorithm Hong Kong analysis.
9. Plagiarism Yasaswi, J., IAPR Asian It uses every It only restricted with
Detection in Purini, S., Conference deep features deep features for the
Programming &Jawahar, on Pattern techniques. plagiarism detection
2017 techniques.
Assignments C. V Recognition
Using Deep
Features
7
S.no Paper Name Author Name Journal Name Year Advantages Disadvantages
8
S.no Paper Name Author Journal Year Advantages Disadvantages
Name Name
9
S.no Paper Name Author Name Journal Name Year Advantages Disadvantages
16. A Survey of D.R. International This paper deals The survey was
Plagiarism Bhalerao, S. Journal of with strategies and restricted with only
Detection S. Sonawane Science, methodologies in less number of
Strategies and Engineering text document. people.
Methodologies in and They gives various
Text Document Technology 2015 Strategies
Research methods.
(IJSETR)
17. Plagiarism Sonawane International This described They didn’t use
Detection by Kiran Shivaji Journal of about Hybrid any other
using Karp-Rabin and Computer Artificial Neural techniques and
and String Prabhudeva Applications 2015 Network and they failed to
Matching S Support Vectors compared with
Algorithm Machine methods other techniques.
Together to detect
plagiarism.
18. A survey on M.K. Vijay Machine Text mining Survey restriction
similarity meena Learning methods were is present in this
measures in and briefly explained in paper.
this paperwork
text mining Applications 2015
: An
Internationa
l Journal
10
S.no Paper Name Author Name Journal Name Year Advantages Disadvantages
19. Efficient Hybrid I. Atoum and A. Advanced This describes Lags on the
Semantic Text Otoom Computer about wordnet efficiency and
Similarity using Science and corpus precisions in
Word Net and a Applications 2015 similarity tools. semantic text
Corpus similarity.
20. Detection of Upul Bandara International This paper This paper didn’t
Source Code and Gamini Journal of describes work with any other
Plagiarism Using Wijayrathna Computer 2012 about Source approaches
Machine Theory and code efficiently.
Learning Engineering plagiarism
Approach using Machine
learning
approaches.
11
Existing System
The tools for similarity check may be paid or free of cost. Some of the tools are Beagle, Turtnitin,
Viper, Copyscape, PlagTracker, PlagSpotter, WORD-CHECK, CopyFind etc.
There are several methods like exact match, sentence based match, finger printing, substring
matching, citation based pattern analysis.
NLP,NLTK techniques detects the similarities.
String matching Algorithms and K Nearest Neighbor algorithm were the next step on detecting
similarities.
12
Proposed System
Those techniques exists may not be precise and efficient for similarity check.
K nearest Neighbor may be highly effective.
Yet another concept on Concept of combinational theory called Polya’s Counting theory is given
in
this project work.
13
Architecture Diagram
14
Modules List
Module 1: Preprocessing.
Module 2: String Combination.
Module 3: String Matching.
Module 4: K Nearest Neighbor.
15
Module 1: Preprocessing
16
Module 2: String Combination
17
Module 2: String Combination
Polya’s Counting Theorem:
Polya’s Counting Theorem is a combinational Theory Which predicts number of similar occurring
from the concept of Permutation and Combination.
Statement: The configuration counting series B(x) is obtained by substituting the figure-counting
series A (x’) for each yi in the cycle index Z(P; y1,y2,…….yk) of the permutation group P.
B(x) =Z (P;∑aqxq, ∑aqx2q,……..,∑aqxkq)
where B(x) is configuration counting series.
18
Module 2: String Combination
Algorithm for Polya’s Counting Theorem:
w1 w2 w3 w4
w1 - w1w2 w1w3 w1w4
w2 w2w1 - w2w3 w2w4
w3 w3w1 w3w2 - w3w4
w4 w4w1 w4w2 w4w3 -
19
Module 2: String Combination
Graphical Representation of words:
20
Module 2: String Combination
Step 4: Finding Probability for combinations of words:
The general formula would be,
P (D) = (wiwj)/W ------------- (20)
Where P is probability measure,
D is document, wi, wj are word combinations, i, j=1, 2, 3…..
W is total number of words in a document D.
Step 5: Comparing value of P among input documents:
The last step is to compare probability among input documents,
Consider the P (D) value for the combination of words w1w2 among the documents
D1, D2, D3, D4, and D5
If any of two documents has same value of P (D) then that word is similar among
those two documents.
That is, if P (D1) = P (D2) for word combination w1w2 then the word is similar among
those two documents.
Step 5 can be processed for all the documents for all the possible combination of input
Words and hence Polya’s co-efficient is found.
21
Module 3: String Matching
Checks for whether combined strings are matched among two input documents after the process
of string combination.
It includes following techniques:-
1. Rabin-Karp-Algorithm
2. Knuth-Morris-Pratt Algorithm
3. Boyer-Moore Algorithm
22
Module 4: K nearest Neighbor
Checks for whether combined strings are matched among two input documents after the process
of string combination.
It includes following Steps:-
1. Data Pre Processing Step
2. Fitting the K-NN algorithm to the Training set
3. Predicting the test result
4. Visualizing the output in using cosine similarity
23
Results and Screenshots
1. KNN CLASSIFIER:
24
Results and Screenshots
2. KNN Classifier along with Polya’s Counting Theorem
25
Comparative Analysis
CategoriesTime Time
Categories consumption complexity
Time Consumption Time Complexity
Category 18s O(n)
Category 1 (KNN 1(KNN)
18s O(n)
classifier) Category 14s O(1)
2(Polya’s
Category 2(Polya’s Counting
14s + O(1)
Counting Theorem KNN)
along with KNN)
Polya’s counting theorem concept with KNN has less time complexity than KNN.
26
Conclusion
27
Future Enhancement
We tried to implement a concept called Polya’s Counting theorem in similarity analysis which
reduces step in KNN algorithm.
Can be enhanced using Semantic Analysis method and also inclusion of this combinational
process in all the methods of finding similarity analysis.
28
Reference
Online Assignment Plagiarism Checking Using Data Mining and NLP, Taresh Bokade, Tejas
Chede, Dhanashri Kuwar, Prof. Rasika Shintre, International Research Journal of Engineering
and Technology (IRJET), Volume: 08 Issue: 01, Jan 2021.
“Online Assignment Plagiarism Detector”, Nikhil Paymode, Rahul Yadav, Sudarshan Vichare,
Suvarna Bhoir”, International Journal of Advanced Research In Science, Communication and
Technology (IJARSCT), Volume 4, Issue 2, April 2021.
“Plagiarism Detection Through Data Mining Techniques”, Rajashekar Nennuri, M Geetha Yadav,
M Samhitha, S Sandeep Kumar, G Roshini., International Conference On Recent Trends In
Computing, ICRTCE-2021.
Plagiarism Detection in Programming Assignments using Machine Learning, Nishesh Awale,
Mitesh Pandey, Anish Dulal, Bibek Timsina, Journal of Artificial Intelligence and Capsule
Networks, Vol.02,pp. 177-184.(2021).
29
Reference
“Plagiarism Detection Using Artificial Intelligence Technique In Multiple Files”, Mausumi Sahu,
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 5, ISSUE 04,
APRIL 2016.
“Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together”, Sonawane Kiran
Shivaji and Prabhudeva S, International Journal of Computer Applications (0975 – 8887) Volume 116 –
No. 23, April 2015.
A novel method to find out the similarity between source codes, Agrawal, M., & Sharma, D. K. IEEE Uttar
Pradesh Section International Conference on Electrical, Computer and Electronics Engi- neering,2016
Pg.339-343.
Key Term Extraction using a Sentence based Weighted TF-IDF Algorithm, Vetriselvi.T,Gopalan.N.P,
Kumaresan.G, International Journal Of Education and Management Engineering: Hong Kong, Vol 9, Iss
4, July 2019.
Survey and Comparison between Plagiarism Detection Tools, Mahmoud Nadim Nahas, American Journal
of Data Mining and Knowledge Discovery, volume 2, issue 2, p. 50 – 53,2017.
30
Thank You !!!
31