HTCB unit 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Unit-2 Basic Concept of Data Mining

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Data mining is the process of discovering patterns, correlations, and


anomalies in large datasets to predict outcomes. Using a mix of statistics, machine
learning, and database systems, it aims to turn raw data into useful information.

Application: Retailers use data mining to understand customer purchasing patterns and
predict future buying behaviors.

Data Collection

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: This is the process of gathering and measuring information on variables of


interest in a systematic way.

Example: Collecting customer transaction data from a supermarket.

Application: Enables targeted marketing campaigns by understanding customer


preferences.

Types of Data

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Data can be categorized into various types based on its characteristics.

1. Structured Data: Organized in rows and columns (e.g., databases, spreadsheets).

- Example: Employee records in a company’s database.

2. Unstructured Data: Not organized in a predefined manner (e.g., emails, videos).

- Example: Social media posts.


3. Semi-Structured Data: Contains tags or markers to separate data elements.

- Example: XML files.

Application: Different types of data require different techniques for processing and
analysis.

KDD Process (Knowledge Discovery in Databases)

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: A process of extracting useful information from large datasets, involving


several steps.

1. Data Selection: Choosing the relevant data for analysis.

2. Data Cleaning: Removing noise and inconsistencies.

3. Data Transformation: Converting data into suitable formats.

4. Data Mining: Applying algorithms to discover patterns.

5. Interpretation/Evaluation: Making sense of the discovered patterns.

Example: Analyzing medical records to find patterns related to a disease outbreak.

Data Preprocessing

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Preparing raw data for analysis by cleaning, transforming, and organizing it.

Example: Removing duplicates and filling missing values in a customer dataset.

Application: Ensures data quality and improves the accuracy of mining results.
Outlier Detection

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Identifying data points that deviate significantly from the rest of the dataset.

Example: Detecting fraudulent transactions in banking.

Application: Helps in improving the quality of data analysis by removing anomalies.

Data Integration

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Combining data from different sources into a coherent dataset.

Example: Merging customer information from different departments (sales, support).

Application: Provides a unified view of data, essential for comprehensive analysis.

Data Transformation

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Converting data into a suitable format for analysis.

Example: Normalizing data values to a common scale.

Application: Facilitates easier and more accurate data analysis.

Data Reduction
𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Reducing the volume of data while maintaining its integrity.

Example: Using sampling or aggregation to reduce the size of a dataset.

Application: Makes analysis more efficient by reducing computational requirements.

Data Generation

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Creating new data based on existing data.

Example: Generating synthetic data to test a machine learning model.

Application: Useful when real data is scarce or sensitive.

Data Summarization

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Providing a compact representation of a dataset.

Example: Creating summary statistics like mean, median, and standard deviation.

Application: Helps in quickly understanding the main characteristics of the data.

Data Presentation

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Visualizing data and results in an understandable manner.


Example: Using charts and graphs to present sales data trends.

Application: Facilitates decision-making by providing clear insights.

Data Mining Functionalities

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Various tasks that data mining can perform, such as classification,
clustering, association rule mining, etc.

Example: Classifying customer reviews as positive or negative.

Application: Enables solving different types of problems using data.

Classification and Architecture of Data Mining Systems

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: The framework and structure within which data mining operations are
carried out.

Example: A data mining system with components for data preprocessing, mining
algorithms, and result evaluation.

Application: Helps in organizing and managing the data mining process efficiently.

Data Mining Query Language


𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: A specialized language to interact with data mining systems to perform
various tasks.

Example: SQL-like queries to perform data mining operations.

Application: Allows users to specify what they want to analyze and how.

Data Mining Task Primitives

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Basic operations and tasks that can be performed in data mining.

Example: Data selection, data transformation, and pattern discovery.

Application: Forms the building blocks of data mining processes.

Integration of a Data Mining System with a Data Warehouse

𝘿𝙀𝙁𝙄𝙉𝙄𝙏𝙄𝙊𝙉: Combining data mining tools with a data warehouse to enhance data
analysis.

Example: Using data mining techniques on a data warehouse to discover trends in


historical sales data.

Application: Provides a comprehensive approach to data analysis by leveraging the storage


and processing capabilities of data warehouses.

Diagrams
Here are a few simple textual descriptions of diagrams that can be used:

1. KDD Process Diagram:

```

Data Selection -> Data Cleaning -> Data Transformation -> Data Mining ->
Interpretation/Evaluation

```

2. Data Preprocessing Steps:

```

Raw Data -> Cleaning -> Transformation -> Integration -> Reduced Data

```

3. Data Mining Functionalities:

```

Data Mining

|-- Classification

|-- Clustering

|-- Association Rule Mining

You might also like