Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

INTRODUCTION OF

DATA MINING
CA-CCTP-4 DATA MINING AND DATA WAREHOUSE
1. Basic Data Mining Tasks.
2. DM versus Knowledge Discovery in Databases.
3. Data Mining Issues.
4. Data Mining Metrics.
5. Social Implications of Data Mining.
6. Overview of Applications of Data Mining.
Introduction to Data Mining :
 Mining refers to the extraction of valuable things.
 Data mining refers to the study of collecting ,cleaning ,processing and analysis of the data and to retrieve
meaningful information from huge data.

 Definition :
 It is analysis of data and use of software techniques and statistical methods to find patterns in data.
 Data Mining deals with discovery of hidden knowledge, unexpected patterns and new rules from large data sets.
 Data mining is the use of algorithms to extract the information and pattern derived by KDD (Knowledge
Discovery in Database) process.
 It is also known as the process of extracting hidden information from a large dataset .
i.e. mining knowledge from data .
Extraction of interesting (non-trivial,implicit,previous knowledge extraction) of data.
It is on historical data.
Types of Data Mining

A)Relational Database:
 A relational database is a collection of multiple data sets formally organized by tables, records, and columns from which
data can be accessed in various ways without having to recognize the database tables.
 Tables convey and share information, which facilitates data search ability, reporting, and organization.

B)Data warehouses:
 A Data Warehouse is the technology that collects the data from various sources within the organization to provide
meaningful business insights.
 The huge amount of data comes from multiple places such as Marketing and Finance.
 The extracted data is utilized for analytical purposes and helps in decision- making for a business organization.
 The data warehouse is designed for the analysis of data rather than transaction processing.
C)Data Repositories:
 The Data Repository generally refers to a destination for data storage.
 However, many IT professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
 For example, a group of databases, where an organization has kept various kinds of information.
D)Object-Relational Database:
 A combination of an object-oriented database model and relational database model is called an object-relational model.
 It supports Classes, Objects, Inheritance, etc.
 One of the primary objectives of the Object-relational data model is to close the gap between the Relational database
and the object-oriented model practices frequently utilized in many programming languages, for example, C++, Java,
C#, and so on.
E)Transactional Database:
 A transactional database refers to a database management system (DBMS) that has the potential to undo a database
transaction if it is not performed appropriately.
 Even though this was a unique capability a very long while back, today, most of the relational database systems support
transactional database activities.
Advantages of Data Mining
o The Data Mining technique enables organizations to obtain knowledge-based data.
o Data mining enables organizations to make lucrative modifications in operation and production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short
time.

Disadvantages of Data Mining


o There is a probability that the organizations may sell useful data of customers to other organizations for
money. As per the report, American Express has sold credit card purchases of their customers to other
organizations.
o Many data mining analytics software is difficult to operate and needs advance training to work on.
o Different data mining instruments operate in distinct ways due to the different algorithms used in their
design. Therefore, the selection of the right data mining tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences in certain
conditions.
Data Mining Query Processing
It deals with the discovery of hidden knowledge, Queries are the primary mechanism for receiving
unexpected pattern and new rules from large data information from a database and consist of
set. questions presented to the database in predefined
format.
It is analysis of data. It gives answer of a questions fired to database.
In data mining What is looking for is usually to be In query ,what is looking for exists in table in
found. database .
Data mining uses data from various Query can be executed on operational or
heterogeneous data source. traditional database .
Data from various source need to be Query data need not be integrated and pre-
integrated ,pre-processed before data mining can processed.
be done.
Query in data mining does not have any format. Query is written in specific syntax :
Select attribute from table name where condition
Output is analzyed data and presented as pattern Output is subset of database.
or knowledge.
For presentation of output of data mining Output is displayed in terms of columns
different visualization technique such as or sometimes in text.
graphical,icon
based ,hierarchical,geometric,pixel based
and hybrid are used.
In data mining we are drowing in data,but In normal Query we are only drowing for
starving for knowledge. data.
It is actually one of the steps from KDD In normal query is used in data mining for
process. extracting pattern.
There is no any fix language used for data There are different lang: SQL, PL/SQL ,
mining. MySQL
Different tools are used for data mining. There is no tool used in normal query.
Data mining used in all database. It is used in RDBMS.
Data mining issues :
1)Human interaction :
When data mining task is to be undertaken ,the goal is not clear .
Users as well as the technical expert are unaware of results.
Need of proper interface.

2)Overfitting :
It is statistical error .
When model is used for a particular data set ,it may suit for other data in future.

3)Outlier :
When the model is derived ,there are some values of data do not fit the model.
These values significantly different from normal values,or they do not fit in any cluster.

4)Interpretation of result :
It is difficult task to interpret result.

5)Visualization of result :It is useful to understand and quickly view the output of different database algorithm.
6)Large data Set :
For larger dataset it is difficult to use .

7)High dimensionality :
More no of attributes leads confusion.

8)Multimedia data :
It is difficult to use traditional methods on audio, video ,multimedia data .

Other issues : Missing data ,noisy data ,irrelevant data


Difference between KDD and Data Mining
Although the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet
slightly different concepts.
KDD is the overall process of extracting knowledge from data, while Data Mining is a step inside the KDD process,
which deals with identifying patterns in data.
And Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.
Basic Data Mining Tasks.

 Data mining functionalities are to perceive the various forms of patterns to be identified in data mining
activities. To define the type of patterns to be discovered in data mining activities, data mining features are used.
Data mining has a wide application for forecasting and characterizing data in big data.
 Data mining tasks are majorly categorized into two categories: descriptive and predictive.

A)Descriptive data mining:


 Descriptive data mining offers a detailed description of the data, for example- it gives insight into what's going
on inside the data without any prior idea.
 This demonstrates the common characteristics in the results.
 It includes any information to grasp what's going on in the data without a prior idea.

B)Predictive Data Mining:


 This allows users to consider features that are not specifically available.
 For example, the projection of the market analysis in the next quarters with the output of the previous quarters,
In general, the predictive analysis forecasts or infers the features of the data previously available.
 For an instance: judging by the outcomes of medical records of a patient who suffers from some real illness.
Stages of data mining Process :KDD
It is process of digging or finding the truth laid in database which is not known yet and the things which are
previously not discovered .

1)Selection :
The data which is to be mined may not be necessarily from a single source .
The data may have many heterogeneous origins.
The data needs to be obtained from various data sources and files.
The data selection is based on mining goal.
Data relevant to the mining task is selected from various source.

2)Pre-processing :
Pre-processing involves cleaning of data and integration of data.
The data selected for mining purpose may have some incorrect, irrelevant values which leads to unwanted result .
Some values may be missing or erroneous.
When data is collected from heterogeneous source ,it involves variety of data.

3)Transfomation :
It is process of converting data into format which is suitable for processing .
Data is molded in the form which is required by data mining process.
4)Mining :
It leads towards using methods, techniques to extract the pattern present in data .
The process involves transformation of relevant data record into patterns using classification .
This steps involves application of various data mining algorithm to the transformed data .
Mining process generate the desired result for which the whole KDD process is undertaken.

5)Visualization / Interpretation :
Data is presented to the user in the form of reports .

Diagram
Knowledge Representation Methods :

When the data mining task is completed , next part is visualization of data .
The mined data is visualized or represented using various visualization tools so as to understand the knowledge
that has been devised out of mining process.
Techniques :
1)Graphical : This is a traditional graph structure including bar charts ,pie chart ,histogram and line graph may be
used.
2)Geometric: This technique includes the box plot and scatter diagram tech.
3)Icon-based : This techniques using figures,colors,other icons.
4)Pixel –based : With these techniques each data value is shown as a uniquely colored pixel.
5)Hierarchical : This divides the display raea into region based on data values.
6)Hybrid : The preceding approach can be combined into one display.
These are the following areas where data mining is widely used:
 Data Mining in Healthcare:
 Data mining in healthcare has excellent potential to improve the health system.
 It uses data and analytics for better insights and to identify best practices that will enhance health care services
and reduce costs.
 Analysts use data mining approaches such as Machine learning, Multi-dimensional database, Data
visualization, Soft computing, and statistics.
 Data Mining can be used to forecast patients in each category.
 The procedures ensure that the patients get intensive care at the right place and at the right time.
 Data mining also enables healthcare insurers to recognize fraud and abuse.
 Data Mining in Market Basket Analysis:
 Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of products,
then you are more likely to buy another group of products.
 This technique may enable the retailer to understand the purchase behavior of a buyer.
 This data may assist the retailer in understanding the requirements of the buyer and altering the store's layout
accordingly.
 Using a different analytical comparison of results between various stores, between customers in different
demographic groups can be done.
 Data mining in Education:
 Education data mining is a newly emerging field, concerned with developing techniques that explore
knowledge from the data generated from educational Environments.
 EDM objectives are recognized as affirming student's future learning behavior, studying the impact of
educational support, and promoting learning science.
 An organization can use data mining to make precise decisions and also to predict the results of the student.
With the results, the institution can concentrate on what to teach and how to teach.
Data Mining in Manufacturing Engineering:
 Knowledge is the best asset possessed by a manufacturing company.
 Data mining tools can be beneficial to find patterns in a complex manufacturing process.
 Data mining can be used in system-level designing to obtain the relationships between product architecture,
product portfolio, and data needs of the customers. It can also be used to forecast the product development
period, cost, and expectations among the other tasks.
Data Mining in CRM (Customer Relationship Management):
 Customer Relationship Management (CRM) is all about obtaining and holding Customers, also enhancing
customer loyalty and implementing customer-oriented strategies.
 To get a decent relationship with the customer, a business organization needs to collect data and analyse the
data. With data mining technologies, the collected data can be used for analytics.
Data Mining in Fraud detection:
 Billions of dollars are lost to the action of frauds.
 Traditional methods of fraud detection are a little bit time consuming and sophisticated.
 Data mining provides meaningful patterns and turning data into information.
 An ideal fraud detection system should protect the data of all the users.
 Supervised methods consist of a collection of sample records, and these records are classified as fraudulent or
non-fraudulent.
 A model is constructed using this data, and the technique is made to identify whether the document is
fraudulent or not.
Data Mining in Lie Detection:
 Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging task.
 Law enforcement may use data mining techniques to investigate offenses, monitor suspected terrorist
communications, etc.
 This technique includes text mining also, and it seeks meaningful patterns in data, which is usually
unstructured text.
 The information collected from the previous investigations is compared, and a model for lie detection is
constructed.
Data Mining Financial Banking:
 The Digitalization of the banking system is supposed to generate an enormous amount of data with every new
transaction.
 The data mining technique can help bankers by solving business-related problems in banking and finance by
identifying trends, casualties, and correlations in business information and market costs that are not instantly
evident to managers or executives because the data volume is too large or are produced too rapidly on the
screen by experts.
 The manager may find these data for better targeting, acquiring, retaining, segmenting, and maintain a
profitable customer.
What is Knowledge Discovery?
Some people don’t differentiate data mining from knowledge discovery while others view data mining as an
essential step in the process of knowledge discovery.
Here is the list of steps involved in the knowledge discovery process −
 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are retrieved from the database.
 Data Transformation − In this step, data is transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.
KDD Process Steps:
Knowledge discovery in the database process includes the following steps, such as:
 Goal identification: Develop and understand the application domain and the relevant prior knowledge and
identify the KDD process's goal from the customer perspective.

 Creating a target data set: Selecting the data set or focusing on a set of variables or data samples on which
the discovery was made.

 Data cleaning and preprocessing: Basic operations include removing noise if appropriate, collecting the
necessary information to model or account for noise, deciding on strategies for handling missing data fields,
and accounting for time sequence information and known changes.

 Data reduction and projection: Finding useful features to represent the data depending on the purpose of the
task. The effective number of variables under consideration may be reduced through dimensionality reduction
methods or conversion, or invariant representations for the data can be found.

 Matching process objectives: KDD with step 1 a method of mining particular. For example, summarization,
classification, regression, clustering, and others.
 Modelling and exploratory analysis and hypothesis selection: Choosing the algorithms or data mining and
selecting the method or methods to search for data patterns. This process includes deciding which model and
parameters may be appropriate (e.g., definite data models are different models on the real vector) and the
matching of data mining methods, particularly with the general approach of the KDD process (for example,
the end-user might be more interested in understanding the model in its predictive capabilities).
 Data Mining: The search for patterns of interest in a particular representational form or a set of these
representations, including classification rules or trees, regression, and clustering. The user can significantly
aid the data mining method to carry out the preceding steps properly.
 Presentation and evaluation: Interpreting mined patterns, possibly returning to some of the steps between
steps 1 and 7 for additional iterations. This step may also involve the visualization of the extracted patterns
and models or visualization of the data given the models drawn.
 Taking action on the discovered knowledge: Using the knowledge directly, incorporating the knowledge in
another system for further action, or simply documenting and reporting to stakeholders. This process also
includes checking and resolving potential conflicts with previously believed knowledge (or extracted).
Data Mining Metrics:
Data mining is one of the forms of artificial intelligence that uses perception models, analytical models, and
multiple algorithms to simulate the techniques of the human brain. Data mining supports machines to take human
decisions and create human choices.

The user of the data mining tools will have to direct the machine rules, preferences, and even experiences to have
decision support data mining metrics are as follows −

Usefulness − Usefulness involves several metrics that tell us whether the model provides useful data. For
instance, a data mining model that correlates save the location with sales can be both accurate and reliable, but
cannot be useful, because it cannot generalize that result by inserting more stores at the same location.

Furthermore, it does not answer the fundamental business question of why specific locations have more sales. It
can also find that a model that appears successful is meaningless because it depends on cross-correlations in the
data.
Return on Investment (ROI) − Data mining tools will find interesting patterns buried inside the data and
develop predictive models. These models will have several measures for denoting how well they fit the records.
It is not clear how to create a decision based on some of the measures reported as an element of data mining
analyses.
Access Financial Information during Data Mining − The simplest way to frame decisions in financial terms is to
augment the raw information that is generally mined to also contain financial data. Some organizations are investing
and developing data warehouses, and data marts.

The design of a warehouse or mart contains considerations about the types of analyses and data needed for expected
queries. It is designing warehouses in a way that allows access to financial information along with access to more
typical data on product attributes, user profiles, etc. can be useful.

Converting Data Mining Metrics into Financial Terms − A general data mining metric is the measure of "Lift". Lift
is a measure of what is achieved by using the specific model or pattern relative to a base rate in which the model is not
used. High values mean much is achieved. It can seem then that one can simply create a decision based on Lift.

Accuracy − Accuracy is a measure of how well the model correlates results with the attributes in the data that has been
supported. There are several measures of accuracy, but all measures of accuracy are dependent on the information that
is used. In reality, values can be missing or approximate, or the data can have been changed by several processes.

It is the procedure of exploration and development, it can decide to accept a specific amount of error in the data,
especially if the data is fairly uniform in its characteristics. For example, a model that predicts sales for a specific store
based on past sales can be powerfully correlated and very accurate, even if that store consistently used the wrong
accounting techniques. Thus, measurements of accuracy should be balanced by assessments of reliability.
Social Implications of Data Mining:
Data mining is the process of finding useful new correlations, patterns, and trends by transferring through a high amount of
data saved in repositories, using pattern recognition technologies including statistical and mathematical techniques. It is the
analysis of factual datasets to discover unsuspected relationships and to summarize the records in novel methods that are
both logical and helpful to the data owner.

Data mining systems are designed to promote the identification and classification of individuals into different groups or
segments. From the aspect of the commercial firm, and possibly for the industry as a whole, it can interpret the use of data
mining as a discriminatory technology in the rational search of profits.

There are various social implications of data mining which are as follows −

Privacy − It is a loaded issue. In current years’ privacy concerns have taken on a more important role in American society
as merchants, insurance companies, and government agencies amass warehouses including personal records.

The concerns that people have over the group of this data will generally extend to some analytic capabilities used to the
data. Users of data mining should start thinking about how their use of this technology will be impacted by legal problems
associated with privacy.
Profiling − Data Mining and profiling is a developing field that attempts to organize, understand, analyze, reason, and use
the explosion of data in this information age. The process contains using algorithms and experience to extract design or
anomalies that are very complex, difficult, or time-consuming to recognize.

The founder of Microsoft's Exploration Team used complex data mining algorithms to solve an issue that had haunted
astronomers for some years. The problem of reviewing, describing, and categorizing 2 billion sky objects recorded over 3
decades. The algorithm extracted the relevant design to allocate the sky objects like stars or galaxies. The algorithms were
able to extract the feature that represented sky objects as stars or galaxies. This developing field of data mining and
profiling has several frontiers where it can be used.

Unauthorized Used − Trends obtain through data mining designed to be used for marketing goals or some other ethical
goals, can be misused. Unethical businesses or people can use the data obtained through data mining to take benefit of
vulnerable people or discriminate against a specific group of people. Furthermore, the data mining tech.
Data Mining Applications
Here is the list of areas where data mining is widely used −
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
Question Bank :
1)Explain Data Mining with advantages and disadvantages.
2)Explain steps in KDD.
3)What are problems faced in data mining.
4)Why we need pre processed data .
5)Write difference between Query processing and data mining.
6)Write a note on types of data mining.
7) What is Knowledge Discovery?
8)Write a note on Data mining application.

You might also like