Batch 20

Similarity Analysis using Polya’s
Counting Theorem Among Input

Documents
Presented By, Guide Name,

Vinitha R Dr. T. Vetriselvi B.E.,
(811718104109) M.E., M.B.A., P.HD
Yogeshwari M
(811718104110)
1
Abstract
 Just as technology is progressing day by day, the purpose of using computer systems is also
increased.
 It reproduces already existing content into a modified content. Original work of a person can be
found out while representing their ideas.
 There are two approaches namely Manual and automated.
 There are more number of methods and techniques. Some of them are String matching
algorithms, Natural Language Processing as well as Natural language toolkit.
2
Abstract
 That output will be in the interval between the binary unit, 0 and 1.
 The binary unit 0 will appear in the event that the documents have nothing in common. The
binary unit 1 will appear in the event that the documents is exactly similar and whenever the
result is between 0 and 1 then the documents have partially similar contents.
 Here the main objective is to check the similarities among the input documents precisely using K
Nearest Neighbor, String Matching Techniques and Polya’s Counting theorem.
 In accordance with String Matching Algorithm, The Rabin Karp Algorithm is used here.
3
Introduction
 One who makes his/her work simpler by picking up the resources on creating any projects,
articles etc which results in PLAGIARISM.
 Earlier times they used Natural Language Processing (NLP) Techniques like Grammar-based
method, Semantic-based method and Grammar-Semantic Hybrid method.
 Machine learning techniques are k-nearest neighbors, naive Bayes etc. Some More techniques
like Text Mining, Clustering, bi-grams, tri-grams, N-grams etc.
 In 90s the plagiarisms is 74.4% and now a days it will be highest percentage than expected.
 Usage of new concept called Polya’s Counting theorem.
4
Literature Survey
S.No Paper Author Journal Year Advantages Disadvantages
Name Name Name
1. Online Taresh International Data mining and Lack of efficiency in these

Assignment Bokade, Research natural language methods.
Plagiarism Tejas Journal of processing [NLP]
Checking Chede, Engineering 2021 techniques like
Using Data Dhanashri and TF-IDF.
Mining and Kuwar, Technology
NLP Rasika (IRJET)
Shintre
2. Online Nikhil International Natural language These methods cannot

Assignment Paymode Journal of processing [NLP] work for larger documents.
Plagiarism , Rahul Advanced and Natural
Detector Yadav, Research In language toolkit
Sudarsha Science, 2021 [NLTK] is used.
n Vichare, Communicat
Suvarna ion and
Bhoir Technology
(IJARSCT)
5
S.No Paper Name Author Name Journal Name Year Advantages Disadvantages
3. Plagiarism Rajashekar International hybrid approach K nearest neighbor

Detection Nennuri, M Conference On and K- Nearest has more steps to
Through Data Samhitha, S Recent Trends In Neighbor be passed. Different
Mining Sandeep Computing, 2021 approach gives methods in single
Technique Kumar, ICRTCE-2021 precise output. step.
4. Plagiarism Nishesh Journal of Usage of Machine Cannot used for

Detection in Awale, Mitesh Artificial learning documents with
Programming Pandey, Anish Intelligence and techniques source codes.
Assignments Dulal, Bibek Capsule predicted
using Timsina Networks 2021 efficiently.
Machine
Learning
5. Plagiarism Khaled, F., & Iraqi Journal of Gives an overall Yet another
Detection H. Al-Tamimi, Science definition of Discussion about
Methods and M.S., plagiarism and the most frequent
Tools: An 2021 works through plagiarism types
Overview. different papers was extensively
for the most provided.
known types of
plagiarism
methods and
tools.
6
S.n Paper Name Author Name Journal Name Year Advantages Disadvantages
o
6. Key Term Vetriselvi.T,Go International Gives best way WordNet supported word
Extraction using palan.N.P,Ku Journal Of to extract key similarity values
a Sentence maresan.G Education and terms which is Can also be
based Weighted Management 2019 important in implemented.
TF-IDF Engineering: similarity
Algorithm Hong Kong analysis.
7. Plagiarism Muhammad International Introduced Drawbacks on efficiency

Detection Usman, University many Data and precision.
Process using Muhammad Faisalabad Mining
Data Mining Waleed Ashraf 2019 Technologies.
Techniques Riphah
8. Survey and Mahmoud American Gives Highly confusion due to

Comparison Nadim Nahas Journal of Data comparison every comparison
between Mining and between every between the tools and
Plagiarism Knowledge 2017 detection tools techniques.
Detection Tools Discovery understand by
users.
9. Plagiarism Yasaswi, J., IAPR Asian It uses every It only restricted with
Detection in Purini, S., Conference deep features deep features for the
Programming &Jawahar, on Pattern techniques. plagiarism detection
2017 techniques.
Assignments C. V Recognition
Using Deep
Features
7
S.no Paper Name Author Name Journal Name Year Advantages Disadvantages
10. Plagiarism T Sripathy International Describes It only considered

Detection Journal of about uses of the detection
Softwares and Advanced plagiarism softwares and didn’t
their use Research in 2017 detection compare it with
Science and softwares. each other to give
Engineering accuracy and
precision.
11. Semantic Pantulkar International It describes It lags on other

Similarity Sravanthi, DR. B. Research about the concepts of
Between Srinivasu Journal of semantic predicting
Sentences Engineering 2017 similarities similarities.
and between the
Technology given
(IRJET) sentences.
12. A novel method Agrawal M., & IEEE Uttar It dealt with This paper failed to
to find out the Sharma D. K Pradesh only source describe every
similarity Section codes plagiarism types.
between International plagiarism. This
source codes Conference paper
on Electrical, 2016 describes
Computer efficient ways
and to find similarity
Electronics among source
Engineering codes.
8
S.no Paper Name Author Journal Year Advantages Disadvantages
Name Name
13. Plagiarism Mausumi International This used more Drawbacks in

Detection Using Sahu journal of human prediction precision
Artificial scientific 2016 interaction and accuracy.
Intelligence methodologies
Technique in and to predict
Multiple Files technology similarity.
14. Plagiarism Sonawane International Concept used is The other such

Detection by Kiran Conference predicted the techniques predicted
well.
using Karp-Rabin Shivaji and On Recent output precisely.
,String Matching Prabhudev Trends In 2016
aS Computing,
ICRTCE
15. A NOVEL Mayank International Usage of JSIM Only allows

METHOD TO Agrawal, Conference tool finds out comparing among
FIND OUT THE Dilip Kumar on Electrical, plagiarisms. Java programming
SIMILARITY Sharma Computer 2015 Source codes.
BETWEEN and
SOURCE Electronics
CODES Engi-
neering,
9
16. A Survey of D.R. International This paper deals The survey was
Plagiarism Bhalerao, S. Journal of with strategies and restricted with only
Detection S. Sonawane Science, methodologies in less number of
Strategies and Engineering text document. people.
Methodologies in and They gives various
Text Document Technology 2015 Strategies
Research methods.
(IJSETR)
17. Plagiarism Sonawane International This described They didn’t use
Detection by Kiran Shivaji Journal of about Hybrid any other
using Karp-Rabin and Computer Artificial Neural techniques and
and String Prabhudeva Applications 2015 Network and they failed to
Matching S Support Vectors compared with
Algorithm Machine methods other techniques.
Together to detect
plagiarism.
18. A survey on M.K. Vijay Machine Text mining Survey restriction
similarity meena Learning methods were is present in this
measures in and briefly explained in paper.
this paperwork
text mining Applications 2015
: An
Internationa
l Journal
10
19. Efficient Hybrid I. Atoum and A. Advanced This describes Lags on the
Semantic Text Otoom Computer about wordnet efficiency and
Similarity using Science and corpus precisions in
Word Net and a Applications 2015 similarity tools. semantic text
Corpus similarity.
20. Detection of Upul Bandara International This paper This paper didn’t
Source Code and Gamini Journal of describes work with any other
Plagiarism Using Wijayrathna Computer 2012 about Source approaches
Machine Theory and code efficiently.
Learning Engineering plagiarism
Approach using Machine
learning
approaches.
11
Existing System
 The tools for similarity check may be paid or free of cost. Some of the tools are Beagle, Turtnitin,
Viper, Copyscape, PlagTracker, PlagSpotter, WORD-CHECK, CopyFind etc.
 There are several methods like exact match, sentence based match, finger printing, substring
matching, citation based pattern analysis.
 NLP,NLTK techniques detects the similarities.
 String matching Algorithms and K Nearest Neighbor algorithm were the next step on detecting
similarities.
12
Proposed System
 Those techniques exists may not be precise and efficient for similarity check.
 K nearest Neighbor may be highly effective.
 Yet another concept on Concept of combinational theory called Polya’s Counting theory is given
in
this project work.
13
Architecture Diagram
14
Modules List
 Module 1: Preprocessing.
 Module 2: String Combination.
 Module 3: String Matching.
 Module 4: K Nearest Neighbor.
15
Module 1: Preprocessing
 Transforms raw form of data into useful and efficient format.

 It includes following techniques:-
1. Tokenization
2. Stemming
3. Stopword Removal
16
Module 2: String Combination
 Used to combine the strings from the input document

 Could be done using Polya’s Counting Theorem
 The objective is to find number of words occurring in pair
17
Polya’s Counting Theorem:
 Polya’s Counting Theorem is a combinational Theory Which predicts number of similar occurring
from the concept of Permutation and Combination.
 Statement: The configuration counting series B(x) is obtained by substituting the figure-counting
series A (x’) for each yi in the cycle index Z(P; y1,y2,…….yk) of the permutation group P.
B(x) =Z (P;∑aqxq, ∑aqx2q,……..,∑aqxkq)
where B(x) is configuration counting series.
18
Algorithm for Polya’s Counting Theorem:
Step 1: Obtaining Set of words from input document.

Set of words from document 1= (w1, w2, w3, w4)
Step 2: Combining the words.
The combinations of words could be w1w2, w2w3, w3w4, w4w1
Step 3: Analyzing the combination in Matrix Form.
If we consider two set of words w1 and w2, there are two possible combinations
w1w2 and w2w1. The below matrix describes the combinations of words in the documents.
w1 w2 w3 w4
w1 - w1w2 w1w3 w1w4
w2 w2w1 - w2w3 w2w4
w3 w3w1 w3w2 - w3w4
w4 w4w1 w4w2 w4w3 -
19
Graphical Representation of words:
20
Step 4: Finding Probability for combinations of words:
The general formula would be,
P (D) = (wiwj)/W ------------- (20)
Where P is probability measure,
D is document, wi, wj are word combinations, i, j=1, 2, 3…..
W is total number of words in a document D.
Step 5: Comparing value of P among input documents:
The last step is to compare probability among input documents,
Consider the P (D) value for the combination of words w1w2 among the documents
D1, D2, D3, D4, and D5
If any of two documents has same value of P (D) then that word is similar among
those two documents.
That is, if P (D1) = P (D2) for word combination w1w2 then the word is similar among
those two documents.
Step 5 can be processed for all the documents for all the possible combination of input
Words and hence Polya’s co-efficient is found.
21
Module 3: String Matching
 Checks for whether combined strings are matched among two input documents after the process
of string combination.
 It includes following techniques:-
1. Rabin-Karp-Algorithm
2. Knuth-Morris-Pratt Algorithm
3. Boyer-Moore Algorithm
22
Module 4: K nearest Neighbor
 Checks for whether combined strings are matched among two input documents after the process
of string combination.
 It includes following Steps:-
1. Data Pre Processing Step
2. Fitting the K-NN algorithm to the Training set
3. Predicting the test result
4. Visualizing the output in using cosine similarity
23
Results and Screenshots
1. KNN CLASSIFIER:
24
Results and Screenshots
2. KNN Classifier along with Polya’s Counting Theorem
25
Comparative Analysis
CategoriesTime Time
Categories consumption complexity
Time Consumption Time Complexity
Category 18s O(n)
Category 1 (KNN 1(KNN)
18s O(n)
classifier) Category 14s O(1)
2(Polya’s
Category 2(Polya’s Counting
14s + O(1)
Counting Theorem KNN)
along with KNN)
 Polya’s counting theorem concept with KNN has less time complexity than KNN.
26
Conclusion
 Polya’s Counting Theorem ease the Process of Determining Similarity.

 A step to get more easy and precise results and time consuming. Reduce steps in KNN training
set.
 Makes final Goal more efficient.
27
Future Enhancement
 We tried to implement a concept called Polya’s Counting theorem in similarity analysis which
reduces step in KNN algorithm.
 Can be enhanced using Semantic Analysis method and also inclusion of this combinational
process in all the methods of finding similarity analysis.
28
Reference
 Online Assignment Plagiarism Checking Using Data Mining and NLP, Taresh Bokade, Tejas
Chede, Dhanashri Kuwar, Prof. Rasika Shintre, International Research Journal of Engineering
and Technology (IRJET), Volume: 08 Issue: 01, Jan 2021.
 “Online Assignment Plagiarism Detector”, Nikhil Paymode, Rahul Yadav, Sudarshan Vichare,
Suvarna Bhoir”, International Journal of Advanced Research In Science, Communication and
Technology (IJARSCT), Volume 4, Issue 2, April 2021.
 “Plagiarism Detection Through Data Mining Techniques”, Rajashekar Nennuri, M Geetha Yadav,
M Samhitha, S Sandeep Kumar, G Roshini., International Conference On Recent Trends In
Computing, ICRTCE-2021.
 Plagiarism Detection in Programming Assignments using Machine Learning, Nishesh Awale,
Mitesh Pandey, Anish Dulal, Bibek Timsina, Journal of Artificial Intelligence and Capsule
Networks, Vol.02,pp. 177-184.(2021).
29
Reference
 “Plagiarism Detection Using Artificial Intelligence Technique In Multiple Files”, Mausumi Sahu,
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 5, ISSUE 04,
APRIL 2016.
 “Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together”, Sonawane Kiran
Shivaji and Prabhudeva S, International Journal of Computer Applications (0975 – 8887) Volume 116 –
No. 23, April 2015.
 A novel method to find out the similarity between source codes, Agrawal, M., & Sharma, D. K. IEEE Uttar
Pradesh Section International Conference on Electrical, Computer and Electronics Engi- neering,2016
Pg.339-343.
 Key Term Extraction using a Sentence based Weighted TF-IDF Algorithm, Vetriselvi.T,Gopalan.N.P,
Kumaresan.G, International Journal Of Education and Management Engineering: Hong Kong, Vol 9, Iss
4, July 2019.
 Survey and Comparison between Plagiarism Detection Tools, Mahmoud Nadim Nahas, American Journal
of Data Mining and Knowledge Discovery, volume 2, issue 2, p. 50 – 53,2017.
30
Thank You !!!
31

Batch 20

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Batch 20

Uploaded by

Copyright:

Available Formats

Similarity Analysis using Polya’s

Counting Theorem Among Input

Presented By, Guide Name,

1. Online Taresh International Data mining and Lack of efficiency in these

2. Online Nikhil International Natural language These methods cannot

3. Plagiarism Rajashekar International hybrid approach K nearest neighbor

4. Plagiarism Nishesh Journal of Usage of Machine Cannot used for

7. Plagiarism Muhammad International Introduced Drawbacks on efficiency

8. Survey and Mahmoud American Gives Highly confusion due to

10. Plagiarism T Sripathy International Describes It only considered

11. Semantic Pantulkar International It describes It lags on other

13. Plagiarism Mausumi International This used more Drawbacks in

14. Plagiarism Sonawane International Concept used is The other such

15. A NOVEL Mayank International Usage of JSIM Only allows

 Transforms raw form of data into useful and efficient format.

 Used to combine the strings from the input document

Step 1: Obtaining Set of words from input document.

 Polya’s Counting Theorem ease the Process of Determining Similarity.

You might also like