Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Module 1: Introduction to Data Warehousing

1. Data Warehousing Architecture.

Data warehousing architecture refers to the structure and organization of a


data warehouse, which is a centralized repository for storing and analyzing
large volumes of data from different sources. There are three common
approaches to constructing data warehouse layers:

1. Single-tier architecture: This approach aims to minimize data redundancy


by storing data in a single layer. However, it is not commonly used in
practice due to its limitations.

2. Two-tier architecture: In this approach, physically available sources and


the data warehouse are separated. However, this architecture is not
expandable and may face connectivity issues due to network limitations.

3. Three-tier data warehouse architecture: This is the most widely used


architecture for data warehouses. It consists of three layers:
Bottom Tier: This layer contains the data warehouse database server, which
is a relational database system. Back-end tools and utilities are used to
extract, clean, load, and refresh data into this layer.

Middle Tier: In the middle tier, the OLAP (Online Analytical Processing)
Server is implemented either through Relational OLAP (ROLAP) or
Multidimensional OLAP (MOLAP). ROLAP maps multidimensional data
operations to standard relational operations, while MOLAP directly
implements multidimensional data and operations.

Top Tier: This layer serves as the front-end client layer, housing query tools,
reporting tools, analysis tools, and data mining tools. Users interact with the
data warehouse through this layer to retrieve and analyze data for
decision-making purposes.
2. Data Warehouse v/s Data Mart.

Data Warehouse Data Mart

1. Enterprise wide data Department wide data

2. Multiple subject areas Single subject area

3. Multiple data sources Limited data sources

4. Occupies large memory Occupies limited memory

5. Longer time to implement Shorter time to implement

6. Less flexible due to More flexible due to flexible and


comprehensive nature agile changes

7. Supports strategic decision Supposts tactical decision making


making

8. Used in trend analysis, business Used in customer relationship


intelligence, etc. management (CRM), decision
making, etc.

3. Data Warehouse Schema

1. Star Schema
The star schema is a basic and widely used schema for organizing data in a
data warehouse or dimensional data marts. In this schema, there is one
central "fact" table surrounded by multiple "dimension" tables. The fact table
contains numerical data or measurements, while the dimension tables
provide context or details about the data. This schema is called "star"
because its structure resembles a star, with the fact table at the center and
the dimension tables at the edges. It is simple, efficient, and easy to
understand, making it a popular choice for building data warehouses and
handling common queries effectively.
In the given demonstration,
● SALES is a fact table having attributes i e
Product ID, Order ID, Customer ID, Employer
ID, Total, Quantity, Discount which
references to the dimension tables(first 4
and next 3 are measures.
● Employee dimension table contains the
attributes Emp ID, Emp Name, Title,
Department and Region
● Product dimension table contains the
attributes Product ID, Product Name,
Product Category, Unit Price
● Customer dimension table contains the
attributes Customer ID, Customer Name,
Address, City, Zip
● Time dimension table contains the
attributes Order ID, Order Date, Year,
Quarter, Month

Advantages of Star Schema: Simplifies queries, streamlines business


reporting, and efficiently feeds OLAP cubes.

Disadvantages of Star Schema: Weak data integrity, limited analytical


flexibility, and challenges with many-to-many relationships.

2. Snowflake Schema
The snowflake schema is like a cousin of the star schema.
In this schema, instead of dimensions being one big table, they are split into
multiple smaller tables that are more organized.
It happens when dimensions in a star schema get complex, with many
layers of relationships and each child table having multiple parent tables.
But this complexity only affects the dimension tables, not the fact tables.
● The Employee dimension table now contains the attributes
EmployeeID, EmployeeName, DepartmentID, Region, Territory
● The DepartmentID attribute links with Employee table with the
Department dimension table
● The Department dimension is used to provide detail about each
department, such as Name and Location of the department
● The Customer dimension table now contains the attributes
CustomerID, CustomerName, Address, CityID
● The CityID attributes links the Customer dimension table with the City
dimension table
● The City dimension table has details about each city such as CityName,
Zipcode, State and Country

The main difference between star schema and snowflake schema is that
the dimension table of the snowflake schema are maintained in
normalized form to reduce redundancy

Advantages:
Snowflake schema ensures structured data, reducing issues with data
integrity.
It optimizes disk space usage by organizing data efficiently.
Disadvantages:
While snowflaking saves space in dimension tables, the overall impact on
the data warehouse size is often minimal.
Snowflaking should be avoided unless necessary, and hierarchies should
remain within dimension tables without being split.

3. Factless Fact Table:


Factless fact tables, though lacking measures, serve to represent dimension
intersections, providing flexibility in data warehouse design and enabling
the depiction of many-to-many relationships. Examples of their utility
include tracking student attendance, product promotion events, or
insurance-related accidents, with dimensions for students, time, and classes.

Think about a record of student attendance in classes.


In this case, the fact table would consist of 3
dimensions: the student dimension, the time
dimension, and the class dimension.
This factless fact table would look like the following:
For example, one can easily answer the following
questions with this factless fact table:
• How many students attended a particular class on a particular day?
• How many classes on average does a student attend on a given day?
Without using a factless fact table, we will need two separate fact tables to
answer the above two questions

Types of Factless Fact Tables:


Factless fact tables for events record occurrences or events without
associated measures, useful for tracking activities.
In dimensional data warehouses, such tables are common, especially when
there are events to be tracked but no corresponding measurements.
On the other hand, factless fact tables for conditions are used to represent
relationships between dimensions when no clear transactions are involved.
They help in analyzing aspects of a business where negative conditions or
events are significant, like a bookstore not selling any books for a certain
period.
4. Fact Constellation:
A Fact Constellation is a schema used to represent multidimensional
models, consisting of multiple fact tables that share some common
dimension tables. It can be thought of as a collection of several star
schemas, which is why it's also called a Galaxy schema. Fact Constellation
schemas are commonly used in complex Data Warehouse designs and are
more intricate compared to star and snowflake schemas. They are necessary
for handling complex systems effectively.

In above demonstration:
● Placement is a fact table having attributes: ( Stud_roll , Company_id ,
TPO_id) with facts: (Number of students eligible, Number of students
placed).
● Workshop is a fact table having attributes: ( Stud_roll , Institute_id ,
TPO_id ) with facts: (Number of students selected, Number of students
attended the workshop).
● Company is a dimension table having attributes: ( Company_id , Name,
Offer_package)
● Student is a dimension table having attributes: ( Student_roll , Name,
CGPA)
● TPO is a dimension table having attributes: ( TPO_id , Name, Age)
● Training Institute is a dimension table having attributes: ( Institute_id ,
Name, Full_course_fee)
So, there are two fact tables namely, Placement and Workshop which are
part of two different star schemas having dimension tables Company,
Student and TPO in Star schema with fact table Placement and dimension
tables Training Institute, Student and TPO in Star schema with fact table
Workshop.
Both the star schema have two dimension tables common and hence,
forming a fact constellation or galaxy schema

Advantage: Provides a flexible schema


Disadvantage: It is much more complex and hence, hard to implement and
maintain.

4. Data Warehouse design approaches.

Data warehouse design approaches are pivotal in


constructing efficient data warehouses, potentially saving
considerable time and project costs. There are two primary
methods: the top-down approach and the bottom-up
approach.

In the top-down approach, championed by Bill Inmon, the


data warehouse is initially designed, followed by the
creation of data marts. This method involves extracting
data from diverse source systems using ETL tools,
validating it, and then loading it into the data warehouse.
Various aggregation and summarization techniques are
applied within the data warehouse. Subsequently, data
marts extract this aggregated data and further transform
it to align with their specific needs.
Contrarily, Ralph Kimball's bottom-up approach, known as
dimensional modeling or the Kimball methodology,
focuses on creating data marts first to address particular
business processes' reporting and analytics requirements.
These data marts are then used as the building blocks to
construct an enterprise data warehouse, allowing for a
more targeted and iterative approach to data warehouse
design.

Selecting the appropriate approach hinges on the


project's specifics, as each approach comes with its set of
advantages and considerations concerning
implementation, scalability, and adaptability to changing
business needs.

5. What is Dimensional Modeling?

Dimensional modeling is a technique used in data warehousing to structure


data in a way that is optimized for analysis and reporting. It was developed
by Ralph Kimball and consists of fact and dimension tables.

In a dimensional model, data is organized to facilitate the analysis of


numeric information like sales numbers, counts, or weights. Unlike relational
models, which are focused on real-time transaction processing, dimensional
models prioritize ease of retrieving information and generating reports.

The key elements of a dimensional data model include:

1. Facts: These are the measurements or metrics from a business process.


For example, in a sales process, a fact could be quarterly sales numbers.

2. Dimensions: Dimensions provide context to the facts by describing who,


what, and where of a business event. For instance, in the sales process,
dimensions could include customer names, product names, and locations.
3. Attributes: Attributes are the characteristics of dimensions that help filter,
search, or classify facts. For example, attributes of a location dimension
could include state, country, or zipcode.

Dimensional modeling helps in organizing data for efficient analysis and


reporting in data warehouse systems, making it easier to understand and
interpret business metrics.
6. OLTP v/s OLAP

OLTP OLAP

1. Manages real-time transactional Analyzes historical data for


data decision-making

2. Optimized for fast and efficient Used for complex queries and data
transaction processing analysis

3. Response time is in milliseconds Response time is in seconds to


minutes

4. OLAP uses traditional DBMS OLAP uses data warehouse

5. It is market oriented process It is customer oriented process

6. Queries used are standardized Queries used involves aggregations


and simple

7. Allows read / write operations Only read and rarely write


operations

8. Online banking, e-commerce, Business intelligence, data


order processing systems warehousing systems

7. Steps in ETL Process.

The ETL (Extract, Transform, Load) process involves several steps to


effectively manage and organize data for storage in a Data Warehouse:

1. Extraction: The first step in the ETL process is extraction, where data is
gathered from various source systems. This data can be in different formats
such as relational databases, NoSQL databases, XML files, or flat files. The
extracted data is then stored in a staging area before being loaded into the
data warehouse. This staging area serves as an intermediate step to ensure
that the data is in a consistent format and not corrupted before entering the
data warehouse.
2. Transformation: In the transformation step, rules or functions are applied
to the extracted data to convert it into a standardized format. This may
involve various processes such as filtering to include only certain attributes,
cleaning to replace null values with default values, joining to combine
multiple attributes into one, splitting to divide a single attribute into
multiple attributes, and sorting tuples based on specific criteria.

3. Loading: The final step of the ETL process is loading, where the
transformed data is inserted into the data warehouse. The loading process
can occur at different frequencies, either updating the data warehouse
frequently or at regular intervals, depending on system requirements. The
rate and timing of loading are determined by the specific needs of the
system.

Additionally, the ETL process can utilize the pipelining concept, where data
flows through a series of interconnected stages or tasks. Pipelining helps
streamline the ETL process by enabling a continuous flow of data from
extraction to transformation to loading, improving efficiency and reducing
latency.
8. What is Loading?

Loading is when we move information from one place to another in a


computer system. It's like transferring data from where it's stored to where
we want to keep it.

There are different types of loading:


Initial Load: This is when we move all the data for the first time, like when
you move into a new house and bring all your belongings with you.
Incremental Load: Periodically, we update the information in the database
based on what we need. After putting data into the database, we make sure
that everything is organized properly. We check to ensure that each piece of
information is connected to the right categories or groups. It's like making
sure each item in a store is in the correct section on the shelves.
Full Refresh: This is when we clear out a space completely and put in new
data, like emptying a shelf and putting all new books on it.
Module 3: Data Mining and Data Preprocessing

Q1. Knowledge Discovery in Database process (KDD)

Knowledge Discovery in Databases (KDD) is a systematic process aimed at


uncovering valuable insights from large datasets. It begins with learning
about the application domain, understanding relevant prior knowledge, and
defining the goals of the analysis. Next, a target dataset is created through
careful data selection, ensuring it aligns with the objectives of the analysis.

Data cleaning and preprocessing follow, constituting a significant portion of


the process, where data is refined to address issues like missing values,
duplicates, and inconsistencies. This step ensures the data is in a suitable
format for analysis. Subsequently, data reduction and transformation
techniques are applied to find useful features, reduce dimensionality, and
create an invariant representation of the dataset.

The next phase involves choosing appropriate data mining functions such as
summarization, classification, regression, association, or clustering, based on
the analysis goals. This is followed by selecting the mining algorithms that
best suit the chosen functions.
The core of KDD lies in data mining itself, where patterns of interest are
sought within the dataset. Once patterns are discovered, they undergo
evaluation to determine their significance and usefulness. This includes
visualization, transformation, and removing redundant patterns to present
the knowledge gained in a meaningful way.

Ultimately, the discovered knowledge is put to use, informing


decision-making processes, driving business strategies, or addressing
specific objectives within the application domain. Throughout the entire
process, iteration and refinement are common, ensuring that the insights
gained are actionable and valuable.

Q2. Data Mining Architecture

Data mining involves digging through various sources of data to find useful
patterns and insights. Here's how it works:

1. Data Sources:
Data comes from places like databases, data warehouses, and the web.
Different types of data are gathered and cleaned up for analysis.

2. Database or Data Warehouse Server:


This is where all the data is stored and ready for processing.
The server finds and retrieves the specific data needed for mining.
3. Data Mining Engine:
The heart of the system, it has modules for different mining tasks like finding
associations, classifying data, clustering, and predicting trends.

4. Pattern Evaluation Modules:


These modules assess the value or interest of patterns found by the mining
engine.
They help focus the search on patterns that are truly meaningful.

5. Graphical User Interface (GUI):


The interface between the user and the system.
Makes it easy for users to interact with the system without understanding its
complexities.
Displays results in a user-friendly way.

6. Knowledge Base:
Helps guide the mining process and assess the value of results.
Can include user beliefs and experiences to enhance accuracy.
The mining engine can consult the knowledge base for better results.

In simple terms, data mining involves gathering, cleaning, and analyzing


data from different sources with the help of specialized software. The system
then identifies valuable patterns and presents them to users through an
easy-to-use interface.

Q3. Data Mining Applications

Data mining finds applications in database analysis and decision support,


market analysis, risk management, and fraud detection. Data sources
include credit card transactions, loyalty cards, and customer complaints.
Target marketing identifies customer clusters based on characteristics like
interests and income. Cross-market analysis identifies product sales
associations. Customer profiling helps understand purchasing patterns,
while intelligent query answering aids in customer service.
In finance, data mining assists in cash flow analysis, asset evaluation, and
resource planning. It also aids in competitive monitoring and pricing
strategy development. Fraud detection, widely used in various sectors, relies
on historical data to identify suspicious patterns, such as staged accidents in
insurance or money laundering. In retail, it helps in identifying dishonest
employees and reducing shrinkage.

Other applications include sports analytics, such as NBA game statistics


analysis, astronomy discoveries, and internet web surf analysis for
market-related pages, improving web marketing effectiveness.

Q4. Data Mining Issues

1. Mining methodology and user interaction:


Mining different kinds of knowledge: This means finding various types of
insights from databases.
Interactive mining: Users can interact with the mining process at different
levels of detail.
Incorporation of background knowledge: Existing knowledge is used to
enhance the mining process.
Handling noise and incomplete data: Dealing with errors or missing
information in the data.
Pattern evaluation: Determining the importance or relevance of discovered
patterns.

2. Performance and scalability:


Efficiency and scalability of algorithms: Ensuring that mining processes are
fast and can handle large amounts of data.
Parallel, distributed, and incremental methods: Using techniques that allow
for processing data in parallel, across multiple systems, and in small
incremental steps.

3. Issues relating to the diversity of data types:


Handling relational and complex data types: Dealing with different types of
data structures and relationships.
Mining from heterogeneous databases: Extracting information from various
types of databases and global systems like the World Wide Web.
4. Issues related to applications and social impacts:
Application of discovered knowledge: Applying the insights gained from
data mining in real-world scenarios.
Domain-specific tools: Creating specialized tools for specific industries or
fields.
Intelligent query answering: Developing systems that can provide
intelligent responses to user queries.
Integration of discovered knowledge: Incorporating new findings with
existing knowledge.
Data security, integrity, and privacy: Ensuring that data used in mining is
protected and privacy is maintained.

Q5. Data Preprocessing

Data preprocessing is a critical step in data warehouse and data mining


processes, transforming raw data into a format suitable for analysis and
modeling. Here's an overview of the typical steps involved:

1. Data Collection: Gather data from various sources like databases, files,
APIs, etc., and consolidate them into a central repository.

2. Data Cleaning: Remove errors, inconsistencies, and irrelevant information


by handling missing values, outliers, duplicates, and standardizing data
formats.

3. Data Transformation: Convert categorical variables into numerical


representations, scale numerical features, engineer new features, and
reduce dimensionality.

4. Data Integration: Combine data from different sources into a unified


dataset, resolving inconsistencies or conflicts in data schema.

5. Data Aggregation: Aggregate data at various levels of granularity to derive


insights at different levels of abstraction, summarizing and condensing data.

6. Data Loading: Load the preprocessed data into a data warehouse or a


data mining tool for further analysis and modeling.
7. Data Mining: Apply data mining techniques such as clustering,
classification, regression, association rule mining, or anomaly detection to
discover patterns, trends, and insights from the preprocessed data.

8. Evaluation and Validation: Evaluate the effectiveness of the preprocessing


steps and the quality of the resulting data for the intended analysis tasks.
Validate the models built on preprocessed data using appropriate metrics
and techniques.

9. Iterative Process: Data preprocessing is often an iterative process where


steps may need to be revisited based on insights gained during analysis or
feedback from model performance.

By following these steps systematically, data preprocessing can help


improve the quality and usability of data for effective analysis and
decision-making in data warehousing and data mining applications.

You might also like