MD

Report on
Removal of Duplicate data using AI
Submitted to
Amity University, Ranchi (Jharkhand)
In partial fulfilment of the requirements for the award of the degree of
(B. tech C.S.E)
By
Md Gulam Hassnain
A35705219006
SEM-8th (2019-2023)
Under the guidance of

Dr. kanikaThakur
DEPARTMENT OF COMPUTER SCIENCE

Amity Institute of Information Technology
AMITY UNIVERSITY JHARKHAND
Ranchi
2019- 2023
1
DECLARATION
I, MD GULAM HASSNAIN, student of B.TECH (CSE) hereby declare that the project
tled “Removal of Duplicate Data using AI” which is submi ed by me to the
Department of computer science and Technology, Amity school of engineering and
Technology, Amity University, Ranchi, Jharkhand, in the par al fulfillment of
requirement for the award of the degree of Bachelors of Technology, has not been
previously formed the basis of the award of and degree or other similar tle or
recogni on. The author a ests that the permission has been obtained for the use
of any copy wri en material appearing in the disserta on /project report other
than brief experts requiring only proper acknowledgement in scholarity wri ng and
all such use is acknowledged.
Yours sincerely
Md. Gulam Hassnain
A35705219006
2
CERTIFICATE
This is to certify that Mr. Md Gulam Hassnain, students of B. TECH (CSE), Ranchi
have worked under the able guidance and supervision of Dr. Amarnath Singh,
designation faculty guide.
This project report has the requisite standard of partial fulfillment of undergraduate
degree in Bachelor of Technology to the best of my knowledge no part of this
report and the contents are based on the original research.
I am aware that in case of non-compliance, Amity School of Engineering and

Technology to cancel the report.
Signature
Dr. Kanika Thakur
(Faculty Guide)
3
ACKNOWLEDGEMENT
I express my sincere gra tude to my faculty guide DR. Kanika Thakur, for his able
guidance, con nuous support, and coopera on throughout my research work,
without which the present work would not have been possible. My endeavor
stands incomplete without dedica ng my gra tude to him; he has contributed a
lot towards successful comple on of my research work.
I would also like to express my gra tude to my family, friends for their unending
support, and reless effort that kept me mo vated throughout the comple on of
this research.
Yours sincerely
MD. GULAM HASSNAIN
B. TECH (CSE)
2019-2023
4
TABLE OF CONTENTS
1. ABSTRACT 6
2. METHODOLOGY 7
3. INTRODUCTION 8
4. DATA DUPLICATION 9
5. IMPORTANCE OF REMOVING DUPLICATE DATA 10
6. DATA DUPLICATION EXPLAINED 10-11
7. BENEFITS OF DATA DUPLICATION REMOVAL 11
8. REAL LIFE EXAMPLE 12
9. DATA DUPLICATION WITH AI 13
10. AI POWERED BENFITS 14-15
11. AI ALGORITHM FOR DUPLICATION REMOVAL 16
12. ROLE OF AI IN IMPROVING DUPLICATE DATA 17-18
13. STAGES OF DATA CLEANING 19-21
14. PROBLEMATIC ANALYSIS OF REMOVING DATA 22-23
15. LIMITAION AND CONSIDERATION 23
16. HIGHLIGHT FEATURES 23-24
17. SYSTEM REQUIREMENT 25
18. IMPLEMENTATION 26-27
19. RESULT 28
20. CONCLUSION 29
21. REFERENCES 30
5
ABSTRACT
Ge ng rid of duplicate data is a crucial step in data management since it helps to

protect the data's accuracy, consistency, and integrity. Manually loca ng and
elimina ng duplicate data can be a difficult and me-consuming opera on due to
the daily increase in data volume. Here is where ar ficial intelligence (AI) can be
quite useful. A dataset's duplicate data can be automa cally found and eliminated
using AI-powered algorithms. These algorithms o en analyze data using machine
learning methods to look for trends that might iden fy duplicate records. They can
also gain knowledge from human feedback to gradually increase their accuracy.
For instance, the Python module Pandas offers robust tools for data analysis and
manipula on. Duplicate entries can be eliminated from a Pandas Data Frame object
using the func on drop duplicates (), which is a built-in feature. The func on offers
more sophis cated op ons for duplica on detec on and removal by using a variety
of arguments, including subset, keep, and in place.
Overall, employing AI-powered libraries in Jupyter Notebook helps speed up and

reduce the amount of me it takes to remove duplicate data. Users can also do
data analysis and visualiza on, as well as adjust the algorithmic se ngs for greater
accuracy. To ensure the quality and integrity of the data, it is crucial to validate the
results and bear in mind the constraints and presump ons of the algorithms that
were employed.
6
METHODOLOGY
The methodology for removing duplicate data using AI in Jupyter Notebook

typically involves the following steps:
Impor ng data: The dataset must first be imported into Jupyter Notebook, usually
with the help of the Pandas library. This phase could entail connec ng to a database
or reading data from a file.
Data cleaning: A er that, the dataset is cleaned and preprocessed to guarantee

accuracy and consistency. This could entail changing the data type, elimina ng
unneeded columns or rows, and adding or removing missing values.
Duplicate data detec on: A er that, the dataset is cleaned and preprocessed to
guarantee accuracy and consistency. This could entail changing the data type,
elimina ng unneeded columns or rows, and adding or removing missing values.
Duplicate data removal: When duplicate data is found, it can be eliminated from
the dataset. Pandas' built-in func ons or specially created scripts that eliminate
duplicate data based on a set of criteria can be used to accomplish this.
Data valida on: The dataset is tested to make sure it is correct and consistent a er
duplicate data has been removed. This could entail assessing the dataset's size and
organiza onal structure as well as contras ng it with data from other sources.
Data analysis: Finally, analysis and modelling can be done using the cleaned and
deduplicated dataset. This could entail employing data explora on tools or
machine learning algorithms to create predic ve models.
To guarantee that the outcomes are repeatable and transparent, it is crucial to

record the ac ons taken and the decisions made throughout the en re process. In
order to guarantee that the data is correct and consistent, it is also crucial to
confirm the results using outside sources or professional judgement.
7
INTRODUCTION
Duplicate data can result in a variety of concerns in the field of data management
and analysis, from decreasing accuracy to performance problems. The manual
detec on and elimina on of duplicate data can be difficult and me-consuming
due to the daily increase in data volume. In order to solve this issue effec vely and
precisely, ar ficial intelligence (AI) enters the picture.
Jupyter Notebook, a popular web tool that enables users to create and share
documents containing live code, visualiza ons, and narra ve text, is one pla orm
where AI-powered algorithms can be u lized for duplicate data elimina on. Jupyter
Notebook has a wealth of AI-powered libraries that can assist with duplicate data
reduc on and is the perfect se ng for data analysis and machine learning
ac vi es.
Users of Jupyter Notebook may automa cally find and eliminate duplicate data
using AI algorithms, which makes the process quicker and more effec ve. Powerful
tools and algorithms can automa cally find and eliminate duplicate records, and
libraries like Pandas, Numpy, and Scikit-Learn also offer sophis cated choices for
more precise and effec ve duplica on removal.
In Jupyter Notebook, removing duplicate data using AI can assist maintain data
integrity, accuracy, and consistency while also enhancing overall data management
and analysis. By automa ng the onerous and repe ve processes of duplicate data
removal, it enables users to concentrate on the more intricate and valuable
components of data analysis. To ensure the quality and integrity of the data, it is
8
crucial to validate the results and bear in mind the constraints and presump ons
of the algorithms that were employed.
What is Data Deduplica on?
Deduplica ng data is the process of removing unnecessary data from a dataset. It

entails loca ng and dele ng duplicate versions of files, emails, or other data kinds
that are exactly the same or nearly the same. By elimina ng duplicates, businesses
can increase their capacity for storage, shorten backup mes, and enhance their
capacity for data recovery.
A pointer to the unique data copy is used to replace redundant data blocks. Data
deduplica on and incremental backup, which transfers just the data that has
changed since the last backup, closely resemble each other in this regard.
To find duplicate byte pa erns, data deduplica on so ware examines data. In this
approach, the deduplica on program checks that the single-byte pa ern is
accurate and legi mate before using it as a reference. An addi onal pointer to the
previously stored byte pa ern will be provided in response to any later requests to
store the same byte pa ern.
An example of data deduplica on
The iden cal 1-MB file a achment may appear 100 mes in a typical email system.
All 100 instances are saved if the email pla orm is backed up or archived, needing
100 MB of storage space. The a achment is only stored once thanks to data
deduplica on; subsequent copies are all linked to the original copy. This example
reduces the 100 MB storage demand to 1 MB.
9
Importance of removing duplicate data from datasets
Duplicate data sets have the poten al to contaminate the training data with the
test data, or the other way around. Outliers may undermine the training process
and cause your model to "learn" pa erns that do not actually exist, while entries
with missing values will cause models to interpret features incorrectly.
Data deduplica on Explained.
Different types of data deduplica on exist. In its most basic version, the method
eliminates iden cal files at the level of individual files. File-level deduplica on and
single instance storage (SIS) are other names for this.
Deduplica on goes a step further by finding and removing redundant data

segments that are the same, even though the files they are in are not totally
iden cal. Storage space is freed up by a process known as block-level deduplica on
or sub-file deduplica on. Deduplica on is frequently used to refer to block-level
deduplica on. If they employ that modifica on, they are talking about file-level
deduplica on.
There are two types of block-level deduplica on: fixed block borders, where the
majority of block-level deduplica on takes place, and variable block boundaries,
where data is divided up at random intervals. The rest of the procedure o en stays
the same once the dataset has been divided into a number of li le pieces of data,
known as chunks or shards.
Each shard is subjected to a hashing method, such as SHA-1, SHA-2, or SHA-256, by

the deduplica on system, which results in a cryptographic alpha-numeric
representa on of the shard known as a hash. The value of that hash is then tested
10
to see if it has ever been seen before against a hash table or hash database. The
new shard is wri en to storage and the hash is added to the hash table or database
if it has never been seen before; otherwise, it is deleted, and a new reference is
added to the hash table or database.
What are the benefits of deduplica on?
Consider how frequently you edit documents to make minor changes. Even if you
simply altered one byte, an incremental backup would s ll save the en re file.
Every important business asset has a chance to include duplicate data. Up to 80%
of company data in many organiza ons is duplicated.
Target deduplica on, also known as target-side deduplica on, allows a customer
to significantly reduce storage, cooling, floor space, and maintenance costs by
doing the deduplica on process inside a storage system a er the na ve data has
been placed there. A customer can save money on storage and network traffic by
employing source deduplica on, also known as source-side deduplica on or client-
side deduplica on, where redundant data is detected at the source before being
transported across the network. This is due to the fact that redundant data
segments are detected before being transferred.
With cloud storage, source deduplica on performs incredibly well and can
significantly speed up backups. Deduplica on speeds up backup and recovery by
lowering the amount of data and network bandwidth that backup opera ons
require. When deciding whether to employ deduplica on, think about if your
company could profit from these advancements.
11
What is a real-life deduplica on example?
Consider a scenario where a company manager distributes 500 copies of a 1 MB

file containing a financial outlook report with visuals to the en re team. All 500
copies of that file are currently being kept on the company's email server. If you
u lize a data backup solu on, all 500 copies of the emails will be preserved and
take up 500 MB of server space. Even a simple data duplica on mechanism at the
file level would only save one copy of the report. Only that one stored copy is
referred to in every other instance. This indicates that the unique data's final
bandwidth and storage demand on the server is only 1 MB.
Another illustra on is what occurs when businesses periodically perform full

backups of files while also performing full backups of files when just a small number
of bytes have changed, as a result of long-standing design issues with backup
systems. Eight weekly full backups on a 10 TB file server would produce 800 TB of
backups, and incremental backups over the same period of me would likely
produce another 8 TB or more. Without slowing down restore performance, a
competent deduplica on solu on may decrease this 808 TB to less than 100 TB.
12
Why Data Deduplica on?
80% of a data scien st's me is spent on data prepara on, which was rated as the
least enjoyable aspect of the job by 76% of those surveyed.
For these reasons, it would be wise to invest whatever sum of money is required to
fully automate these procedures. The difficult ac vi es that require connec ng to
data sources, crea ng reliable pipelines, and performing other jobs pique the
interest of data professionals (engineers, scien sts, IT teams, etc.) who are in
charge of preparing data. It is detrimental to make data experts perform resome
jobs like data prepara on since it decreases their morale and diverts them from
more crucial du es.
Effec ve deduplica on can significantly improve a company's bo om line.

Although the cost per stored unit has decreased as cloud storage has grown in
popularity, there are s ll expenses related to managing huge amounts of data, and
duplicate data increases these expenses. Deciding may also take longer as a result
of the addi onal informa on.
Duplicate data can also produce inaccurate outcomes, making it challenging to

make wise business decisions. Such errors can be disastrous in the highly
compe ve economic environment of today. There are several loca ons and
methods for storing data, which increases the likelihood of errors.
Deduplica on is a crucial step in the data cleansing process that can help to lower
this risk. For analyses to produce precise and mely findings, duplicate data must
be eliminated from databases or from data models (using a deduplica on scrubber
or other tool).
13
The business analy cs solu ons from Grow can assist with data deduplica on and
analysis to produce quick insights for your en re organiza on.
Data Deduplica on with AI
According to a McKinsey & Company analysis, businesses may increase produc vity
by up to 50% by u lizing AI and machine learning to enhance their data
management and analy cs.
Data deduplica on can be accomplished using a variety of AI algorithms, such as

deep learning and machine learning.
In order to find duplicate data, machine learning algorithms may analyze datasets
and spot trends. They can gain knowledge from prior data deduplica on efforts and
gradually increase their accuracy. Deep learning techniques are very helpful for
complicated datasets because they can use neural networks to spot and get rid of
duplicate data.
AI-driven data deduplica on can assist organiza ons in a number of ways. For
instance, it can cut down on the me and effort needed for data deduplica on,
freeing up staff members to work on more important projects. Addi onally, it can
increase data deduplica on accuracy, lowering the possibility of data errors and
inconsistencies.
Addi onally, organiza ons may find duplicate data that might otherwise go
unno ced with the aid of AI-powered business analy cs solu ons, resul ng in a
more thorough and efficient data deduplica on process. Addi onally, it can assist
14
organiza ons in loca ng previously undiscovered data pa erns and insights,
resul ng in be er decision-making and be er commercial results.
Let's imagine that a business has a client database that has duplicate records. To
make sure that its customer informa on is correct and current, the organiza on
wishes to get rid of duplicate records. How AI, ML, and deep learning can assist with
this work is as follows:
Ar ficial intelligence-based methods for data deduplica on: The business can
u lize AI-based methods like machine learning and deep learning to find and
eliminate duplicate client records. These methods analyze the data using
algorithms to look for pa erns that point to duplicate entries.
Datasets used as training for AI-based methods: The business must create a training
dataset before using AI-based techniques for data deduplica on. To train the AI
model, the dataset should contain examples of duplicate and unique customer
entries.
15
Consider a sample dataset with the following customer entries:
The dataset can be analyzed by AI to spot duplicate customer entries. Because the
email and phone numbers in this example match, AI may determine that "John
Smith" and "John Doe" are the same individual. Similar to this, AI may determine
that "Sarah Brown" and "Sarah Brown" are iden cal people based on how closely
their phone number and email match.
Popular AI Algorithms for Duplicate Data Removal
>Parsing
The parsing approach is employed in data purifica on to find syntax problems.

Lexical and domain errors can be fixed by parsing since it first uses a sample set of
values to determine the format of the domain. Addi onally, it produces a
discrepancy detector for anomaly detec on.
16
>Data Transforma on
The process of data transforma on is similar to that of data cleansing in that data
is first mapped from one format to another, into a common scheme, and then it is
transformed into the desired format. Prior to mapping, transforma ons are used
to clean up the data by standardizing and normalizing it.
>Integrity Constraint Enforcement
When data is changed by adding, removing, or upda ng something, integrity is the

main concern. During integrity constraint checking, it is rejected if any integrity
constraints are broken. Only if the integrity requirement has not been violated are
addi onal iden fied updates allowed to be applied to the original data.
>Duplicate Elimina on
Data cleansing must include duplicate dele on. Every method of duplica on
elimina on has a number of varia ons. There must be an algorithm that recognizes
duplicate entries in each duplicate detec on method.
Role of AI in improving data deduplica on
AI plays a crucial role in improving data deduplica on since it helps to get beyond
some of the drawbacks of manual approaches for finding and elimina ng
duplicates. The following are some ways that AI might enhance data deduplica on:
17
1. Speed and efficiency: The ability of AI to process massive amounts of data
rapidly and accurately is one of the main advantages of employing it for data
deduplica on. Duplicate detec on in a large dataset can be me-consuming
and laborious using standard manual methods. On the other hand, AI
systems can analyze enormous amounts of data and find duplicates far more
quickly and effec vely.
2. Accuracy: Accuracy is another benefit of employing AI for data deduplica on.
AI algorithms can spot pa erns and resemblances that are challenging for
humans to see since they are built to learn from data. By u lizing AI,
businesses can make sure that duplicates are correctly recognized and
eliminated from their datasets, increasing the accuracy of their data as a
whole.
3. Scalability: AI-based data deduplica on methods can be used on datasets of
any size because they are very scalable. For businesses that deal with
enormous amounts of data, including social networking pla orms, e-
commerce businesses, and financial ins tu ons, this is especially
advantageous. These businesses can effec vely manage their data and make
sure that duplicates are eliminated thanks to AI.
4. Consistency: Results from manual data deduplica on techniques might be
unpredictable and depend on the person doing the work. On the other hand,
AI algorithms take a consistent approach, thus the same outcomes are
obtained each me the algorithm is run. Companies who need to make sure
their data is correct and consistent across various systems and apps may find
this consistency useful.
18
5. Learning and adapta on: AI systems are able to change their methods for
recognizing duplica on by learning from new data. This implies that the AI
model can be modified as the data evolves over me to guarantee that it
correctly detects and eliminates duplicates. Companies that deal with quickly
changing data, like healthcare providers or online retailers, might benefit
greatly from this agility.
Stages of Data Cleaning
Stage 1: Removing Duplicates
Duplicate entries are problema c for a variety of reasons. An entry that appears
more than once is given significant weight during training. Models that seem to do
well on frequent entries actually don't. Duplicate entries may destroy the
separa on between the train, valida on, and test sets when iden cal items are not
all in the same set. This could lead to erroneous performance forecasts that let the
model down in terms of actual outcomes.
Database duplicates can come from a wide range of causes, including processing
opera ons that were repeated along the data pipeline. Duplicate informa on
substan ally impairs learning, but the issue can be fixed easily. One possibility is to
mandate that columns be singular whenever possible. An addi onal choice is to run
a script that will immediately detect and delete duplicate entries. This is easy to do
with Pandas' drop duplicates capability, as shown in the sample code below:
19
Stage 2: Removing Irrelevant Data
Since data usually comes from numerous sources, there is a good chance that
a given table or database contains entries that shouldn't be there. In some
cases, it could be required to filter out older entries. In other situations, a more
complex data filtering is necessary.
Stage 3: Fixing Structural Errors
Similar-sounding tables, columns, or values frequently coexist in the same

database. Because a data engineer inserted an underscore or capital le er where
it wasn't supposed to, your data might be disastrous. If these objects are
integrated, it will go a long way towards making your data clean and suitable for
learning.
Stage 4: Detec ng Outliers
It might be challenging to iden fy outliers. It demands a greater understanding of

how the data should appear and when entries should be ignored because they are
suspect. Think of a real estate dataset where the value of each property has
increased by one digit. Even while it is fairly easy to make this kind of error, it can
20
significantly harm the model's ability to learn. Inves ga ng the possibili es and
ranges for numerical and categorical data inputs is the first step in loca ng
undesirable outliers. As an illustra on, a nega ve car cost number is certainly an
unfavorable outlier. Addi onally, employing algorithms for anomaly or outlier
iden fica on like KNN or Isola on Forest, outliers can be automa cally discovered
and deleted.
Stage 5: Handling Missing Data
The most important step in ML data cleaning is handling missing data. Missing data
can occur as a result of online forms that were only filled out with mandatory fields
or when tables and forms were updated. In some cases, it makes sense to
subs tute the meaning or most prevalent value for any missing data. If there are
more important elements, it may be desirable to discard the en re data entry.
21
How it can be problematic for data analysis in Removing Data
>Costs And Lost Income
Consider the addi onal expenses involved in sending one person five of the
iden cal catalogues. Users must be able to locate duplicate records and stop new
duplicate entries from being added to CRM in order to assist save wasteful costs.
>Difficult Segmenta on
Furthermore, it becomes challenging to segment effec vely without a precise

perspec ve of every customer. Non-targeted email distribu on can reduce open
and click-through rates, was ng your me and money. When it comes to a rac ng
and retaining customers, being able to provide more personalized communica ons
is made possible by having high-quality, accurate customer informa on..
>Less Informed decisions and inaccurate reports
Make sure your data is thorough, accurate, and duplicate-free if you intend to use
it to influence decisions on how to best posi on your company for future
commercial growth. Low-quality data-based decisions are nothing more than
guesses.
>Poor Business Processes
Employees can switch to using more conven onal techniques to maintain client
data, such as Excel or even post-It notes, when they get weary of the CRM due to
22
the volume of erroneous data and duplicates. U lizing such business tools may
restrict your clients' perspec ves and the expansion of your company.
The quan ty of customer records will increase as your clientele and business
expand, which will make the data more difficult to maintain and raise the possibility
that it will be lost.
Limita ons and Considera ons
AI algorithms are not error-free and can s ll make mistakes, despite the fact that
they can considerably increase the effec veness and accuracy of duplicate data
removal. To guarantee the quality and integrity of the data, human oversight and
valida on are crucial. Users should also evaluate the results to make sure they are
acceptable and correct and be aware of the assump ons and limits of the
algorithms being employed.
The Highlight Features:
Some of the key features of Jupyter Notebook include:
Interac ve compu ng: Jupyter Notebook makes it simple to examine and

experiment with data by allowing users to run and modify code interac vely.
23
Language agnos c: Python, R, Julia, and a host of other programming languages are
all supported by Jupyter Notebook. This makes it a flexible pla orm for ac vi es
involving machine learning and data analysis.
Easy visualiza on: Jupyter Notebook offers built-in support for data visualiza on
tools, such as Matplotlib and Seaborn, making it easy to generate visualiza ons of
data.
Collabora on: Jupyter Notebook makes it simple to collaborate on data research

projects since it enables numerous users to work on the same notebook
concurrently.
Documenta on: Jupyter Notebook makes it simple to create detailed

documenta on for data analysis projects by allowing users to add narra ve prose,
equa ons, and photos to their notebooks.
Large ecosystem: There is a sizable and vibrant user and developer community for
Jupyter Notebook, and there are numerous libraries and extensions available to
increase its capability.
Cloud-based: Jupyter Notebook makes it simple to access and share data analysis
projects from anywhere because it can be run locally on a user's computer or in the
cloud.
24
System Requirement:
 Vs Code
C, C#, C++, Fortran, Go, Java, JavaScript, Node.js, Python, and Rust are just a few
of the programming languages that may be u lized with Visual Studio Code, a
source-code editor. Based on the Electron framework, which is used to create
Node, it was created.
 Python
The most used interpreted language for image processing is Python. Python was
selected for this project due to its straigh orward syntax and broad selec on of
libraries and modules. Python's syntax enables programmers to express concepts
in less code than may be possible in languages like C++ or Java because it was
designed to be extendable. The main cause of Python's strong demand among
programmers is its sizable, well-known library. Addi onally supported for internet-
connected apps are MIME and HTTP. Python version 3.6 (64 bit) is used in this work
on Windows 10.
25
Implementa on:
26
27
Result:
Result.csv Generated in which all duplicate data has been removed..
28
Conclusion
In Jupyter Notebook, the elimina on of duplicate data using AI can drama cally
increase data integrity, correctness, and consistency while also enhancing data
administra on and analysis. By automa ng the onerous and repe ve processes
of duplicate data removal, it enables users to concentrate on the more intricate
and valuable components of data analysis.
In conclusion, the problem of duplicate data in data management and analysis may
be effec vely and accurately solved by using ar ficial intelligence algorithms in
Jupyter Notebook. Libraries like Pandas and Scikit-Learn, which have built-in
func ons and algorithms that can automa cally find and eliminate duplicate data,
offer strong tools for data manipula on and analysis. The op mal se ng for data
analysis and machine learning ac vi es is provided by Jupyter Notebook, which
enables users to see and analyze data as well as fine-tune algorithm parameters for
increased accuracy.
To ensure the quality and integrity of the data, it is crucial to validate the results
and bear in mind the constraints and presump ons of the algorithms that were
employed. In order to guarantee that the data is correct and consistent, human
oversight and valida on are necessary. Overall, the removal of duplicate data using
AI algorithms in Jupyter Notebook can considerably increase data integrity,
correctness, and consistency while also enhancing data administra on and
analysis.
29
References:
> h ps://www.druva.com/glossary/what-is-deduplica on-defini on-

andrelatedfaqs#:~:text=Deduplica on%20refers%20to%20a%20meth
od,instance%20can%20then%20be%20stored
> h ps://www.grow.com/blog/data-deduplica on-with-ai
h ps://indicodata.ai/blog/should-we-remove-duplicates-ask-slater/
> h ps://runneredq.com/news/problems-generated-by-having-
duplicate-data-in-a-database/
> h ps://deepchecks.com/what-is-
datacleaning/#:~:text=Datasets%20that%20contain%20duplicates%20
may,do%20not%20exist%20in%20reality
>h ps://www.researchgate.net/publica on/339561834_An_Effec ve
_Duplicate_Removal_Algorithm_for_Text_Documents
30

MD

Uploaded by

Copyright:

Available Formats

You might also like

MD

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MD

Uploaded by

Copyright:

Available Formats

Report on

Removal of Duplicate data using AI

In partial fulfilment of the requirements for the award of the degree of

(B. tech C.S.E)

Under the guidance of

DEPARTMENT OF COMPUTER SCIENCE

Md. Gulam Hassnain

I am aware that in case of non-compliance, Amity School of Engineering and

Dr. Kanika Thakur

MD. GULAM HASSNAIN

Ge ng rid of duplicate data is a crucial step in data management since it helps to

Overall, employing AI-powered libraries in Jupyter Notebook helps speed up and

The methodology for removing duplicate data using AI in Jupyter Notebook

Data cleaning: A er that, the dataset is cleaned and preprocessed to guarantee

To guarantee that the outcomes are repeatable and transparent, it is crucial to

What is Data Deduplica on?

Deduplica ng data is the process of removing unnecessary data from a dataset. It

An example of data deduplica on

Data deduplica on Explained.

Deduplica on goes a step further by ﬁnding and removing redundant data

Each shard is subjected to a hashing method, such as SHA-1, SHA-2, or SHA-256, by

What are the beneﬁts of deduplica on?

Consider a scenario where a company manager distributes 500 copies of a 1 MB

Another illustra on is what occurs when businesses periodically perform full

Eﬀec ve deduplica on can signiﬁcantly improve a company's bo om line.

Duplicate data can also produce inaccurate outcomes, making it challenging to

Data Deduplica on with AI

Data deduplica on can be accomplished using a variety of AI algorithms, such as

Popular AI Algorithms for Duplicate Data Removal

The parsing approach is employed in data puriﬁca on to ﬁnd syntax problems.

>Integrity Constraint Enforcement

When data is changed by adding, removing, or upda ng something, integrity is the

Role of AI in improving data deduplica on

Stages of Data Cleaning

Stage 1: Removing Duplicates

Stage 3: Fixing Structural Errors

Similar-sounding tables, columns, or values frequently coexist in the same

Stage 4: Detec ng Outliers

It might be challenging to iden fy outliers. It demands a greater understanding of

Stage 5: Handling Missing Data

>Costs And Lost Income

Furthermore, it becomes challenging to segment eﬀec vely without a precise

>Less Informed decisions and inaccurate reports

>Poor Business Processes

Limita ons and Considera ons

The Highlight Features:

Some of the key features of Jupyter Notebook include:

Interac ve compu ng: Jupyter Notebook makes it simple to examine and

Collabora on: Jupyter Notebook makes it simple to collaborate on data research

Documenta on: Jupyter Notebook makes it simple to create detailed

Result.csv Generated in which all duplicate data has been removed..

> h ps://www.druva.com/glossary/what-is-deduplica on-deﬁni on-

> h ps://www.grow.com/blog/data-deduplica on-with-ai

You might also like