Professional Documents
Culture Documents
DWM
DWM
Middle Tier: In the middle tier, the OLAP (Online Analytical Processing)
Server is implemented either through Relational OLAP (ROLAP) or
Multidimensional OLAP (MOLAP). ROLAP maps multidimensional data
operations to standard relational operations, while MOLAP directly
implements multidimensional data and operations.
Top Tier: This layer serves as the front-end client layer, housing query tools,
reporting tools, analysis tools, and data mining tools. Users interact with the
data warehouse through this layer to retrieve and analyze data for
decision-making purposes.
2. Data Warehouse v/s Data Mart.
1. Star Schema
The star schema is a basic and widely used schema for organizing data in a
data warehouse or dimensional data marts. In this schema, there is one
central "fact" table surrounded by multiple "dimension" tables. The fact table
contains numerical data or measurements, while the dimension tables
provide context or details about the data. This schema is called "star"
because its structure resembles a star, with the fact table at the center and
the dimension tables at the edges. It is simple, efficient, and easy to
understand, making it a popular choice for building data warehouses and
handling common queries effectively.
In the given demonstration,
● SALES is a fact table having attributes i e
Product ID, Order ID, Customer ID, Employer
ID, Total, Quantity, Discount which
references to the dimension tables(first 4
and next 3 are measures.
● Employee dimension table contains the
attributes Emp ID, Emp Name, Title,
Department and Region
● Product dimension table contains the
attributes Product ID, Product Name,
Product Category, Unit Price
● Customer dimension table contains the
attributes Customer ID, Customer Name,
Address, City, Zip
● Time dimension table contains the
attributes Order ID, Order Date, Year,
Quarter, Month
2. Snowflake Schema
The snowflake schema is like a cousin of the star schema.
In this schema, instead of dimensions being one big table, they are split into
multiple smaller tables that are more organized.
It happens when dimensions in a star schema get complex, with many
layers of relationships and each child table having multiple parent tables.
But this complexity only affects the dimension tables, not the fact tables.
● The Employee dimension table now contains the attributes
EmployeeID, EmployeeName, DepartmentID, Region, Territory
● The DepartmentID attribute links with Employee table with the
Department dimension table
● The Department dimension is used to provide detail about each
department, such as Name and Location of the department
● The Customer dimension table now contains the attributes
CustomerID, CustomerName, Address, CityID
● The CityID attributes links the Customer dimension table with the City
dimension table
● The City dimension table has details about each city such as CityName,
Zipcode, State and Country
The main difference between star schema and snowflake schema is that
the dimension table of the snowflake schema are maintained in
normalized form to reduce redundancy
Advantages:
Snowflake schema ensures structured data, reducing issues with data
integrity.
It optimizes disk space usage by organizing data efficiently.
Disadvantages:
While snowflaking saves space in dimension tables, the overall impact on
the data warehouse size is often minimal.
Snowflaking should be avoided unless necessary, and hierarchies should
remain within dimension tables without being split.
In above demonstration:
● Placement is a fact table having attributes: ( Stud_roll , Company_id ,
TPO_id) with facts: (Number of students eligible, Number of students
placed).
● Workshop is a fact table having attributes: ( Stud_roll , Institute_id ,
TPO_id ) with facts: (Number of students selected, Number of students
attended the workshop).
● Company is a dimension table having attributes: ( Company_id , Name,
Offer_package)
● Student is a dimension table having attributes: ( Student_roll , Name,
CGPA)
● TPO is a dimension table having attributes: ( TPO_id , Name, Age)
● Training Institute is a dimension table having attributes: ( Institute_id ,
Name, Full_course_fee)
So, there are two fact tables namely, Placement and Workshop which are
part of two different star schemas having dimension tables Company,
Student and TPO in Star schema with fact table Placement and dimension
tables Training Institute, Student and TPO in Star schema with fact table
Workshop.
Both the star schema have two dimension tables common and hence,
forming a fact constellation or galaxy schema
OLTP OLAP
2. Optimized for fast and efficient Used for complex queries and data
transaction processing analysis
1. Extraction: The first step in the ETL process is extraction, where data is
gathered from various source systems. This data can be in different formats
such as relational databases, NoSQL databases, XML files, or flat files. The
extracted data is then stored in a staging area before being loaded into the
data warehouse. This staging area serves as an intermediate step to ensure
that the data is in a consistent format and not corrupted before entering the
data warehouse.
2. Transformation: In the transformation step, rules or functions are applied
to the extracted data to convert it into a standardized format. This may
involve various processes such as filtering to include only certain attributes,
cleaning to replace null values with default values, joining to combine
multiple attributes into one, splitting to divide a single attribute into
multiple attributes, and sorting tuples based on specific criteria.
3. Loading: The final step of the ETL process is loading, where the
transformed data is inserted into the data warehouse. The loading process
can occur at different frequencies, either updating the data warehouse
frequently or at regular intervals, depending on system requirements. The
rate and timing of loading are determined by the specific needs of the
system.
Additionally, the ETL process can utilize the pipelining concept, where data
flows through a series of interconnected stages or tasks. Pipelining helps
streamline the ETL process by enabling a continuous flow of data from
extraction to transformation to loading, improving efficiency and reducing
latency.
8. What is Loading?
The next phase involves choosing appropriate data mining functions such as
summarization, classification, regression, association, or clustering, based on
the analysis goals. This is followed by selecting the mining algorithms that
best suit the chosen functions.
The core of KDD lies in data mining itself, where patterns of interest are
sought within the dataset. Once patterns are discovered, they undergo
evaluation to determine their significance and usefulness. This includes
visualization, transformation, and removing redundant patterns to present
the knowledge gained in a meaningful way.
Data mining involves digging through various sources of data to find useful
patterns and insights. Here's how it works:
1. Data Sources:
Data comes from places like databases, data warehouses, and the web.
Different types of data are gathered and cleaned up for analysis.
6. Knowledge Base:
Helps guide the mining process and assess the value of results.
Can include user beliefs and experiences to enhance accuracy.
The mining engine can consult the knowledge base for better results.
1. Data Collection: Gather data from various sources like databases, files,
APIs, etc., and consolidate them into a central repository.