Data Warehousing & Data Mining

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

Data Warehousing & Data mining

 What is data warehousing (central storage place for organized data)?


Data warehouse is a subject oriented, integrated, non-volatile and time Variant (SINT)
Collection of data in support of management’s decision.
in simple terms, is like a central storage place for organized data. It's a specialized database
where you collect and store information from various sources like your company's sales,
customer data, and more. This data is stored in a way that makes it easy to analyze and make
business decisions based on it. It's like having a well-organized library of data that helps you
gain insights and make informed choices.
 Subject Oriented: Data warehouses are designed to focus on specific subjects,
such as sales, customers, or products, to support meaningful analysis in a
particular area of business.
 Integrated: Data from various sources are integrated into a single repository,
often through ETL (Extract, Transform, Load) processes, ensuring consistency
and uniformity of data for analysis.
 Non-Volatile: Data in a data warehouse is not typically changed or updated
frequently. It is considered stable and is mainly used for historical analysis.
 Time Variant: Data warehouses store historical data over time, allowing users to
track changes and trends over different time periods.
o Allows the analysis of Past
o Relates information to the Present.
o Enables forecast to Future.
 Data Granularity: In Datawarehouse, it is efficient to keep data summarized at
different levels.
Data Warehouse – Subject oriented

Operation Applications
I
Saving Account
N
Data warehouse
T

Loans. Account E Subject: -

G Account

R
Current Account
A
T

Data Warehousing vs Data Mining:

Data Warehousing:
1. Purpose:
 Storage and Retrieval: Data warehousing primarily focuses on the efficient storage and
retrieval of large volumes of structured data.
 Decision Support: It is designed to support business intelligence and decision-making
processes by providing a centralized repository of integrated data.
2. Structure:
 Structured Data: Data warehouses store structured data, often organized in the form of star
schemas or snowflake schemas. These schemas facilitate efficient querying and reporting.
3. Process:
 ETL Processes: Extract, Transform, Load (ETL) processes are commonly used to extract data
from various sources, transform it into a consistent format, and load it into the data
warehouse.
4. User Interaction:
 Reporting and Analysis: Users interact with data warehouses through reporting and analysis
tools to gain insights into historical trends, make informed decisions, and generate business
reports.
Data Mining:
1. Purpose:
 Pattern Discovery: Data mining focuses on discovering patterns, relationships, and trends
within large datasets that may not be immediately apparent.
 Predictive Modeling: It involves building models that can predict future trends or behaviors
based on historical data.
2. Structure:
 Structured and Unstructured Data: Data mining can work with both structured and
unstructured data. It explores data to uncover hidden patterns.
3. Process:
 Exploratory Data Analysis: Data mining involves exploratory data analysis techniques,
statistical analysis, and machine learning algorithms to discover patterns and insights.
4. User Interaction:
 Model Building: Data mining involves building models using algorithms. Users, often data
scientists or analysts, interact with the data mining process to develop and refine these
models.
Relationship:
 Complementary: While data warehousing provides a platform for storing and managing data, data
mining extracts valuable information and knowledge from that data. Data mining can be applied to
data stored in a data warehouse.
In summary, data warehousing is about storing and managing data for efficient retrieval and
analysis, while data mining is about discovering patterns and knowledge from large datasets. Together, they
contribute to the process of turning raw data into actionable insights for decision-making.
Data warehouse Components | 3 Layer Architecture:

A data warehouse typically follows a three-layer architecture, which is designed to efficiently store and
manage data for analysis and reporting. Here are the three main components of this architecture:
1. Data Source Layer:
 This is the first layer in the architecture and is responsible for extracting data from various
source systems. Source systems can include databases, operational systems, external data
sources, and more.
 Data extraction tools or processes are used to collect and transform data from these sources
into a format suitable for storage in the data warehouse.
 In this layer, data can be in different formats and structures, and it may need to undergo
significant transformations to align with the data warehouse's structure.
2. Data Warehouse Layer:
 The data warehouse layer is the central component of the architecture, where data is stored
and organized for efficient querying and analysis.
 Key components within the data warehouse layer include:
 Data Warehouse Database: This is where the transformed and integrated data from
the data source layer is stored. It's typically designed using a dimensional model,
such as a star or snowflake schema, to enable easy retrieval and analysis.
 Data Integration: ETL (Extract, Transform, Load) processes are used to integrate and
consolidate data from various sources into the data warehouse. Data quality and
consistency are maintained during this process.
 Metadata Repository: Metadata, which includes information about the data's
source, transformation, and meaning, is stored in a metadata repository. This is
crucial for understanding the data and for data lineage.
3. Data Presentation Layer:
 The data presentation layer is where end-users interact with the data warehouse to access
and analyze data. It's often designed for easy reporting and analysis.
 Components within the data presentation layer include:
 Reporting Tools: These tools allow users to create, run, and schedule reports based
on the data stored in the data warehouse.
 Query and Analysis Tools: Users can write ad-hoc queries or perform data analysis
using tools designed for this purpose.
 Data Mining and Business Intelligence Tools: For advanced analytics, data mining,
and business intelligence, additional tools may be integrated with the data
presentation layer.
In summary, the three-layer architecture of a data warehouse includes the data source layer for data
extraction and transformation, the data warehouse layer for data storage and integration, and the data
presentation layer for user interaction and reporting. This architecture is designed to make data readily
accessible and understandable for business intelligence and decision-making purposes.

2, 3, 4 Tier Architecture Models


2, 3, and 4-tier architecture models are used in software design and development to structure and separate
the different components and functions of an application. Here's an overview of each of these models:
2-Tier Architecture:
In a 2-tier architecture, the application is divided into two main components:
1. Client Tier: This is the user interface or client-side component, where the user interacts with the
application. It is responsible for presenting data to the user and sending user requests to the server.
2. Server Tier: The server-side component processes the user's requests received from the client and
interacts with the database to retrieve or update data. It handles the business logic and data
management.
This architecture is relatively simple and is suitable for small applications. However, it can lead to scalability
and maintainability issues as the application grows.
3-Tier Architecture:
A 3-tier architecture adds an additional layer to the 2-tier model, separating the application into three main
components:
1. Presentation Tier (Client): This tier is responsible for the user interface and user experience. It
handles user interactions and displays information to the user.
2. Application Tier (Middle Tier): The application tier contains the business logic and serves as an
intermediary between the presentation tier and the data tier. It processes user requests,
communicates with the database, and implements the application's core functionality.
3. Data Tier (Server): The data tier is responsible for data storage and retrieval. It manages the
database system, where application data is stored and accessed.
A 3-tier architecture offers better scalability, as changes in one tier have fewer ripple effects on the other
tiers. It also promotes reusability of code and is suitable for larger, more complex applications.
4-Tier Architecture:
A 4-tier architecture extends the 3-tier model by adding an additional layer:
1. Presentation Tier (Client): This tier handles the user interface and user interactions.
2. Application Tier (Business Logic): This layer contains the core business logic and application
functionality.
3. Data Tier (Data Storage): It manages data storage and retrieval in the database.
4. Services Tier (Application Services): The services tier is responsible for providing various services,
such as security, authentication, and external integrations. It acts as an intermediary layer between
the application tier and external services or APIs.
A 4-tier architecture enhances modularity and allows for more extensive separation of concerns. It is often
used in large and complex enterprise applications that require a high degree of flexibility and scalability.

Data warehouse Need, Goals, Advantages, Benefits and Problems in Implementation


KDD
1. Data Selection: Choose the relevant data from various sources.
2. Data Pre-processing: Clean the data, handle missing values, and transform it into a suitable format.
3. Data Reduction: Reduce the volume but produce the same analytical results.
4. Transformation and Encoding: Convert data into appropriate forms for mining. This might involve
encoding categorical variables or scaling numerical ones.
5. Data Mining: Apply various algorithms to identify patterns or relationships in the data.
6. Pattern Evaluation: Assess the mined patterns' significance and usefulness.
7. Knowledge Representation: Present the discovered knowledge in a form that is understandable and
usable.
8. Interpretation and Evaluation: Evaluate the results and interpret them in the context of the
problem.
9. Deployment: Integrate the discovered knowledge into the business processes.

**1. Purpose:
 DBMS: Primarily designed for transactional processing. It's used for day-to-day operations, handling
a large number of small, individual transactions.
 Data Warehouse: Designed for analytical processing. It's used for reporting, analysis, and decision-
making.
**2. Data Structure:
 DBMS: Normalized data structure to minimize redundancy and ensure data consistency.
 Data Warehouse: Often uses a denormalized structure optimized for query and analysis
performance.
**3. Data Scope:
 DBMS: Focuses on current, operational data.
 Data Warehouse: Integrates and stores historical data from various sources.
**4. Query and Reporting:
 DBMS: Optimized for simple queries and transactional processing.
 Data Warehouse: Optimized for complex queries involving aggregations and analysis of large
datasets.
**5. Performance:
 DBMS: Prioritizes quick transaction processing.
 Data Warehouse: Prioritizes fast query performance and reporting.
**6. Usage:
 DBMS: Used for real-time, transactional applications like order processing, inventory management,
etc.
 Data Warehouse: Used for business intelligence, decision support, and analytical applications.
**7. Data Integration:
 DBMS: Typically deals with data from a specific application or domain.
 Data Warehouse: Integrates data from multiple sources, providing a comprehensive view.
**8. Schema:
 DBMS: Usually has a single schema designed for a specific application.
 Data Warehouse: Often has a star or snowflake schema to facilitate efficient querying.

OLTP vs OLAP:
OLTP stands for Online Transaction Processing. It's a type of system that manages and supports transaction-
oriented applications. In OLTP systems, the emphasis is on quick and reliable transaction processing. These
systems are designed to handle a large number of short online transactions, such as inserting, updating,
and deleting records in a database.
On the other hand, OLAP stands for Online Analytical Processing. OLAP systems are designed for complex
queries and data analysis. They are used for business intelligence and reporting purposes. OLAP databases
are optimized for read-intensive operations and provide a multidimensional view of the data, allowing
users to analyze and explore it from different perspectives.
In summary, OLTP focuses on transaction processing, ensuring that database transactions are processed
efficiently, while OLAP is geared towards analytical processing, allowing users to analyze and gain insights
from large volumes of data. Both concepts are crucial in the realm of databases and can be relevant to
discussions in a technical interview.

ETL stands for Extract, Transform, Load. It's a process used in data integration and warehousing.
Here's a breakdown:
1. Extract: Data is collected and pulled from various sources. This could be databases, applications, or
different systems.
2. Transform: The extracted data might not be in the right format or structure. In this step, the data is
cleaned, transformed, and converted into a suitable format. It could involve filtering out
unnecessary information, changing data types, or combining data from different sources.
3. Load: The transformed data is then loaded into a target database or data warehouse. This is where
it's organized and made ready for analysis.
Imagine you're making a smoothie. You extract the fruits (Extract), blend them together with maybe some
yogurt or other ingredients (Transform), and then pour the smoothie into a glass (Load). ETL is a similar
concept but for data, making sure it's ready and tasty for analysis!
Star Schema:

In the context of databases and data warehousing, a star schema is a type of schema where a central fact
table is connected to one or more-dimension tables through foreign key relationships. Here's a breakdown
of its components:
1. Fact Table:
 Central table in the star schema.
 Holds the quantitative data (facts) to be analyzed.
 Examples include sales, revenue, or any measurable metric.
2. Dimension Tables:
 Surround the fact table like points of a star.
 Contain descriptive attributes to provide context to the facts.
 Examples include time, geography, product, or any other categorization.
3. Attributes:
 Individual fields in dimension tables.
 Provide detailed information about the dimension.
4. Foreign Key:
 Links the primary key of a dimension table to the foreign key in the fact table.
 Enables the connection between fact and dimension tables.
Advantages:
 Simplifies queries by breaking down data into smaller, manageable pieces.
 Enhances query performance as it involves simpler join operations.
 Provides a clear structure for organizing data.
 Flexibility
 Scalability
Use Cases:
 Commonly used in data warehousing for analytical purposes.
 Well-suited for scenarios where there's a central metric (fact) surrounded by various dimensions.
Example: Consider a sales database:
 Fact Table: Sales
 Dimension Tables: Time, Product, Location
 Attributes: Date, Product Name, City
So, if you want to analyze sales data over time or by location, the star schema makes it efficient to do so.

OR

1. Fact Table:
 The fact table contains the core data that you want to analyze. This is often numeric data like
sales revenue, quantity sold, or any other measurable metric.
 It typically has foreign keys that connect to the primary keys of dimension tables,
establishing relationships between the central data and the descriptive context.
2. Dimension Tables:
 Each dimension table represents a specific aspect or category related to the data in the fact
table.
 Attributes in dimension tables provide detailed information about the dimension. For
example, a "Time" dimension might include attributes like day, month, and year.
 Dimension tables are not usually directly related to each other in a star schema, making
queries simpler and more efficient.
3. Hierarchies:
 Dimension tables often have hierarchies. For instance, the "Time" dimension might have a
hierarchy from year to quarter to month to day.
 Hierarchies are useful for drilling down into data at different levels of granularity.
4. Snowflake Schema vs. Star Schema:
 In a snowflake schema, dimension tables are normalized, meaning the attributes are further
broken down into sub-dimensions. In contrast, a star schema keeps dimensions
denormalized for simplicity and faster querying.
5. Query Optimization:
 Star schemas are optimized for query performance, especially for analytical queries that
involve aggregations and grouping.
 Joins between the fact table and dimension tables are straightforward, leading to faster data
retrieval.
6. Data Integrity:
 Foreign key relationships ensure data integrity. They enforce that each fact in the fact table
corresponds to a valid combination of dimension values.
7. Scalability:
 Star schemas are scalable and well-suited for data warehousing environments where large
volumes of data need to be analyzed efficiently.

Snowflake Schema:
The Snowflake Schema is a type of database schema in which a central fact table is connected to multiple
dimensions as well as to sub-dimensions. It's an extension of the star schema, where dimension hierarchies
are normalized into separate related tables, resembling the shape of a snowflake.
Key characteristics of a Snowflake Schema:
1. Normalized Structure:
 Unlike the star schema, where dimensions are denormalized, the snowflake schema
normalizes dimensions into multiple related tables.
 This normalization reduces redundancy and improves data integrity.
2. Hierarchical Organization:
 Dimension tables in a snowflake schema often have a hierarchical organization, which means
they are organized into levels, and each level is stored in a separate table.
3. Multiple Levels of Relationships:
 Relationships between the central fact table and dimension tables can have multiple levels.
For example, a dimension may have a parent-child relationship within itself.
4. Advantages:
 Normalization can save storage space and reduce data redundancy.
 Enhances data integrity by eliminating duplicate data.
Example:
Consider a scenario where you have a sales fact table at the center. The associated dimensions include
"Time," "Product," and "Location." In a snowflake schema:
 The "Time" dimension may be normalized into tables like "Year," "Quarter," and "Month."
 The "Product" dimension might be normalized into tables such as "Category," "Subcategory," and
"Product."
 The "Location" dimension could be normalized into "Country," "State," and "City" tables.
Pros and Cons:
Pros:
 Improved data integrity.
 Reduction in data redundancy.
 Easier to maintain and update dimension tables.
Cons:
 Increased complexity in query performance due to multiple table joins.
 More intricate to design and understand compared to star schemas.

Attributes types

You might also like