Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Data Warehouse Final Notes

What is Star Schema?


A star schema is a data warehouse design model that uses a central fact table to
store quantitative data, and multiple dimension tables to store descriptive data. The
fact table is connected to the dimension tables by foreign keys. This design makes it
easy to query the data warehouse for specific information.

Here is a more detailed explanation of the star schema:

 Fact table: The fact table is the central table in the star schema. It stores
quantitative data, such as sales amounts, product quantities, and customer
ages.
 Dimension tables: The dimension tables store descriptive data, such as
product names, customer names, and sales dates.
 Foreign keys: The fact table is connected to the dimension tables by foreign
keys. Foreign keys are columns in the fact table that refer to columns in the
dimension tables. This allows the data warehouse to maintain referential
integrity, which means that the data in the fact table and the dimension tables
are always consistent.

Characteristics of Star Schema


The star schema is intensely suitable for data warehouse database design because
of the following features:

o It creates a DE-normalized database that can quickly provide query


responses.
o It provides a flexible design that can be changed easily or added to
throughout the development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the
data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

 Query Performance: Star schemas are efficient for querying because they


have a limited number of tables and clear join paths. This means that small
single-table queries can be executed almost instantaneously, and large join
queries can be executed in seconds or minutes.
 Load Performance and Administration: Star schemas are easy to load and
administer because they have a simple structure. Dimension tables can be
populated once and refreshed occasionally, and new facts can be added
regularly and selectively.
 Built-in Referential Integrity: Star schemas enforce referential integrity by
having each data in dimensional tables have a unique primary key, and all
keys in the fact table are legitimate foreign keys drawn from the dimension
table. This ensures that the data in the star schema is always consistent.
 Easily Understood: Star schemas are easy to understand and navigate
because dimensions are joined only through the fact table. This makes it easy
for users to understand the relationships between different parts of the data.

Disadvantage of Star Schema


There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the
relationship between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several


dimension tables connected to it for time, branch, item, and geographic locations.

What is Snowflake Schema?


A snowflake schema is a data warehouse design model that is an extension of a
star schema. In a snowflake schema, the dimension tables are normalized, which
means that they are broken down into smaller tables. This can improve data
integrity and performance, but it can also make the schema more complex. The
snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third
normal form. Each dimension table performs exactly one level in a hierarchy.

Here is a more detailed explanation of the snowflake schema:


 Fact table: The fact table is the central table in the snowflake schema. It
stores quantitative data, such as sales amounts, product quantities, and
customer ages.
 Dimension tables: The dimension tables store descriptive data, such as
product names, customer names, and sales dates.
 Normalized dimension tables: Normalized dimension tables are smaller
tables that are created from the original dimension tables. They are created
by breaking down the original dimension tables into smaller tables that contain
only related data.
 Foreign keys: The fact table is connected to the normalized dimension tables
by foreign keys. Foreign keys are columns in the fact table that refer to
columns in the normalized dimension tables. This allows the data warehouse
to maintain referential integrity, which means that the data in the fact table
and the normalized dimension tables are always consistent.
Example:

Advantage of Snowflake Schema


1. The primary advantage of the snowflake schema is the development in
query performance due to minimized disk storage requirements and joining
smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension
levels and components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema


1. The primary disadvantage of the snowflake schema is the additional
maintenance efforts required due to the increasing number of lookup tables.
It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

S.NO Star Schema Snowflake Schema

While in snowflake schema, The


In star schema, The fact tables
fact tables, dimension tables as well
1. and the dimension tables are
as sub dimension tables are
contained.
contained.

Star schema is a top-down


2. While it is a bottom-up model.
model.

3. Star schema uses more space. While it uses less space.

It takes less time for the While it takes more time than star
execution of queries because schema for the execution of queries
4.
there are fewer JOINs because there are more JOINs
between tables. between tables.

A star schema has


A snowflake schema has
5. denormalized dimension
normalized dimension tables.
tables.

A star schema is easier to


6. While it’s design is complex.
design and implement.

While the query complexity of


The query complexity of star
7. snowflake schema is higher than
schema is low.
star schema.

It’s understanding is very


8. While it’s understanding is difficult.
simple.

It has less number of foreign While it has more number of foreign


9.
keys. keys.

10. It has high data redundancy. While it has low data redundancy.

What is Fact Constellation Schema?


§ Also known as the Galaxy Schema or the Starflake Schema
§ It is an extension of the traditional Star Schema and allows for more complex
relationships among multiple fact tables and dimension tables.
§ In a Fact Constellation Schema, multiple fact tables share common dimension
tables.
§ This schema design is suitable when there are multiple independent business
processes or events that need to be analyzed separately but also have some
common dimensions.

Here is a more detailed explanation of the fact constellation schema:

 Fact tables: The fact tables are the central tables in the fact constellation
schema. They store quantitative data, such as sales amounts, product
quantities, and customer ages.
 Dimension tables: The dimension tables store descriptive data, such as
product names, customer names, and sales dates.
 Shared dimension tables: The shared dimension tables are dimension
tables that are used by multiple fact tables. This can improve data integrity
and performance because the data in the shared dimension tables only needs
to be stored once.
 Foreign keys: The fact tables and the dimension tables are connected by
foreign keys. Foreign keys are columns in the fact tables and the dimension
tables that refer to each other. This allows the data warehouse to maintain
referential integrity, which means that the data in the fact tables and the
dimension tables are always consistent.

The fact constellation schema is a more complex data warehouse design model
than the star schema. However, it can offer some advantages, such as improved
data integrity, performance, and scalability.

Here are some of the benefits of using a fact constellation schema:

 Data Integrity: Fact constellation schemas can improve data integrity by


reducing the risk of data duplication and inconsistency. This is because
the shared dimension tables are smaller and contain only related data.
 Performance: Fact constellation schemas can improve performance by
reducing the number of joins that need to be performed when querying
the data warehouse. This is because the shared dimension tables are smaller
and contain only related data.
 Scalability: Fact constellation schemas can be scaled up to handle large
amounts of data more easily than star schemas. This is because the shared
dimension tables are smaller and easier to manage.

However, there are also some disadvantages to using a fact constellation schema:

 Complexity: Fact constellation schemas are more complex than star


schemas. This can make them more difficult to understand and maintain.
 Cost: Fact constellation schemas can be more expensive to implement and
maintain than star schemas. This is because they require more tables and
more complex relationships between the tables.

DIFFERENCE BETWEEN BI AND DATAWAREHOUSE:

Business intelligence for strategic planning


using Data Warehouse
Business intelligence (BI) plays a crucial role in strategic planning using a data
warehouse. A data warehouse is a central repository that integrates data from
various sources, providing a foundation for BI activities. Here's how BI supports
strategic planning using a data warehouse:

1. Data Consolidation and Integration: A data warehouse brings together data


from different systems and sources within an organization. It consolidates and
integrates data, ensuring consistency and accuracy. This unified view of
data allows for comprehensive analysis and informed decision-making.
2. Historical and Current Data Analysis: A data warehouse stores both
historical and current data, enabling organizations to analyze trends, patterns,
and performance over time. Strategic planners can leverage this historical
data to identify long-term trends, understand past successes and
failures, and make predictions about future outcomes.
3. Ad Hoc Reporting and Analysis: BI tools integrated with the data
warehouse provide ad hoc reporting and analysis capabilities. Strategic
planners can explore data, create custom reports, and perform interactive
analysis to gain insights into various aspects of the business. This helps in
identifying opportunities, evaluating performance, and understanding
market dynamics.
4. Key Performance Indicators (KPIs) Tracking: A data warehouse allows
organizations to define and track KPIs relevant to their strategic objectives.
Strategic planners can monitor performance metrics in real-time, track
progress towards goals, and identify areas that require attention or
improvement. KPI dashboards and scorecards provide visual
representations of performance, facilitating easy monitoring and decision-
making.
5. Predictive Analytics and Forecasting: BI tools integrated with the data
warehouse support predictive analytics and forecasting. Strategic planners
can leverage advanced analytical techniques to forecast future trends,
simulate scenarios, and evaluate the potential impact of strategic
decisions. This helps in developing robust strategic plans and making data-
driven decisions.
6. Competitive Analysis: BI tools integrated with external data sources can
provide insights into the competitive landscape. Strategic planners can
gather market intelligence, analyze competitor performance, identify
market trends, and uncover opportunities or threats. This information
guides strategic decision-making, such as product positioning, market entry
strategies, or identifying competitive advantages.
7. Collaboration and Information Sharing: BI platforms often provide
collaboration features that enable strategic planners to share reports, insights,
and analysis with stakeholders across the organization. This fosters a data-
driven culture, facilitates alignment, and ensures that strategic plans are well-
communicated and supported by relevant stakeholders.

Business intelligence (BI) is a broad term for the process of collecting, analyzing,
and presenting data to help businesses make better decisions. Data warehousing is
a key component of BI, as it allows businesses to store and organize large amounts
of data in a way that can be easily accessed and analyzed.

BI can be used for strategic planning by helping businesses to:

 Understand their customers: By analyzing customer data, businesses can


gain insights into their customers' rtneeds, preferences, and behaviors. This
information can be used to develop new products and services, improve
customer service, and target marketing campaigns more effectively.
 Identify trends: By analyzing historical data, businesses can identify trends
that may affect their business in the future. This information can be used to
make decisions about product development, marketing, and investment.
 Benchmark themselves against competitors: By comparing their
performance to that of their competitors, businesses can identify areas where
they can improve. This information can be used to set goals, develop
strategies, and allocate resources.
 Make better decisions: By providing businesses with insights into their
operations, BI can help them to make better decisions about everything from
product pricing to marketing campaigns. This can lead to improved efficiency,
profitability, and customer satisfaction.

Data warehouses can be a valuable tool for strategic planning, but they are not a
silver bullet. To be successful, businesses need to have a clear understanding of
their goals and objectives, and they need to be able to effectively use the data that is
stored in the data warehouse.

Here are some of the benefits of using a data warehouse for strategic planning:

 Improved decision-making: Data warehouses can provide businesses with


a wealth of information that can be used to make better decisions about
everything from product pricing to marketing campaigns.
 Increased efficiency: Data warehouses can help businesses to streamline
their operations and identify areas where they can improve efficiency.
 Improved customer service: Data warehouses can help businesses to
better understand their customers' needs and preferences, which can lead to
improved customer service.
 Increased profitability: Data warehouses can help businesses to identify
new opportunities for growth and revenue.

ETL (Extract, Transform, and Load) Process


INTRODUCTION:

ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format suitable
for loading into a data warehouse, and then load it into the warehouse. The process
of ETL can be broken down into the following three stages:

Extract: The first stage in the ETL process is to extract data from various sources
such as transactional systems, spreadsheets, and flat files. This step involves
reading data from the source systems and storing it in a staging area.
Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning and
validating the data, converting data types, combining data from multiple sources, and
creating new data fields.

Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the
warehouse.

The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data

warehouse is accurate, complete, and up-to-date. It also helps to ensure that the
data is in the format required for data mining and reporting.

Extraction: 
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No
SQL, XML, and flat files into the staging area. It is important to extract the data from
various source systems and store it into the staging area first and not directly into the
data warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may damage it and
rollback will be much more difficult. Therefore, this is one of the most important steps
of ETL process.

Transformation: 
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format.
It may involve following processes/tasks: 

Filtering – loading only certain attributes into the data warehouse.


Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States, and America into USA, etc.

Joining – joining multiple attributes into one.

Splitting – splitting a single attribute into multiple attributes.

Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

Loading: 
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older


file is replaced. Refresh is usually used in combination with static extraction to
populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying preexisting data. This method is used in combination with
incremental extraction to update data warehouses regularly.

ADVANTAGES OR DISADVANTAGES:

Advantages of ETL process in data warehousing:

 Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
 Better data integration: ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
 Increased data security: ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized
users can access the data.
 Improved scalability: ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
 Increased automation: ETL tools and technologies can automate and
simplify the ETL process, reducing the time and effort required to load and
update data in the warehouse.

Disadvantages of ETL process in data warehousing:


 High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
 Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
 Limited flexibility: ETL process can be limited in terms of flexibility, as it may
not be able to handle unstructured data or real-time data streams.
 Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
 Data privacy concerns: ETL process can raise concerns about data privacy,
as large amounts of data are collected, stored, and analyzed.

Here are five important challenges for ETL (Extract, Transform, Load):

1. Data Quality Assurance: Ensuring data quality is a significant challenge in


ETL processes. Source systems may have inconsistent or incomplete data,
leading to issues such as missing values, duplicates, or inaccuracies. It's
crucial to implement data validation, cleansing, and enrichment techniques to
improve data quality before loading it into the target system.
2. Scalability and Performance: ETL processes often deal with large volumes
of data, and scalability and performance are critical factors. As data volumes
increase, the ETL system should be able to handle the load efficiently without
compromising performance. Optimizing data extraction, transformation, and
loading operations, as well as employing parallel processing techniques, can
help overcome scalability challenges.
3. Data Integration and Compatibility: ETL processes need to integrate data
from diverse sources, which may have different data formats, structures, or
semantics. Mapping and transforming data to ensure compatibility and
consistency across systems can be complex. Dealing with different data
types, handling data from legacy systems, and resolving data conflicts are
common challenges in ETL integration.
4. Change Management and Data Governance: ETL processes must adapt to
changes in source systems, data models, or business requirements.
Managing changes effectively while maintaining data integrity and consistency
is a challenge. It involves version control, documentation, impact analysis,
and coordination with stakeholders to ensure that ETL processes evolve with
the changing needs of the organization.
5. Error Handling and Exception Management: ETL processes can encounter
errors, exceptions, or data anomalies during extraction, transformation, or
loading stages. Detecting and handling errors promptly is crucial for
maintaining data integrity. Implementing robust error handling mechanisms,
logging, and exception management processes are essential to identify,
capture, and resolve errors effectively.

Benefits of Data Warehousing:

 Improved Decision-Making: Data warehouses consolidate and organize


data from various sources into a unified and structured format. This enables
organizations to perform in-depth analysis and gain insights that support
informed decision-making. By providing a comprehensive view of the data,
data warehouses facilitate strategic planning and forecasting.
 Enhanced Data Quality: Data warehouses employ data cleansing, validation,
and integration processes, which help improve data quality. By ensuring
consistency, accuracy, and completeness of data, organizations can rely on
high-quality information for analysis and reporting.
 Faster Query and Reporting Performance: Data warehouses are optimized
for query performance, allowing users to retrieve and analyze data rapidly.
With pre-aggregated and pre-calculated measures, complex queries can be
executed efficiently, providing quick access to critical business information.
 Scalability and Flexibility: Data warehouses can handle large volumes of
data and support scalable growth. They are designed to accommodate
additional data sources and changes in business requirements, allowing
organizations to adapt and expand their data storage and analysis capabilities
as needed.
 Historical Data Analysis: Data warehouses store historical data over an
extended period, enabling trend analysis, pattern identification, and
comparative studies. This historical perspective provides valuable insights into
long-term business performance, customer behavior, and market trends.

Limitations of Data Warehousing:

 Cost and Complexity: Building and maintaining a data warehouse can be


costly and complex. It requires significant investments in hardware, software,
infrastructure, and skilled resources. The development, integration, and
ongoing maintenance of data warehouse systems demand careful planning
and budget allocation.
 Data Integration Challenges: Integrating data from multiple disparate
sources can be a complex task. Variations in data formats, structures, and
semantics can pose challenges for data integration and may require extensive
data transformation and cleansing efforts.
 Data Latency: Data warehouses operate on scheduled ETL processes, which
may introduce a delay between the availability of data in source systems and
its availability in the data warehouse. Real-time or near-real-time data updates
may not be achievable in certain scenarios, limiting the timeliness of insights
derived from the data warehouse.
 Data Security and Privacy: Data warehouses store sensitive and
confidential business data. Ensuring data security, privacy, and compliance
with regulatory requirements is crucial. Organizations must implement
appropriate security measures and access controls to protect data from
unauthorized access or breaches.
 Dependency on Data Sources: Data warehouses rely on the availability and
reliability of data from source systems. If the quality or consistency of source
data is compromised, it can impact the accuracy and reliability of insights
derived from the data warehouse. Regular monitoring and data governance
processes are necessary to maintain data integrity and address any issues in
the source systems.

Significance of Data warehouse with business intelligence


The significance of a data warehouse in conjunction with business intelligence (BI)
lies in their ability to work together to provide valuable insights and support informed
decision-making within an organization. Here are some key reasons why a data
warehouse is significant in the context of business intelligence:

 Centralized Data Storage: A data warehouse serves as a central repository


for structured, integrated, and historical data from various operational systems
within an organization. It consolidates data into a consistent and standardized
format, making it easier for BI tools and applications to access and analyze
the data.
 Data Integration and Consolidation: Data warehouses integrate data from
multiple sources, such as transactional databases, CRM systems, ERP
systems, and more. By consolidating data into a single source of truth, data
warehouses provide a unified view of the organization's data, eliminating data
silos and inconsistencies.
 Data Quality and Consistency: Data warehouses employ ETL (Extract,
Transform, Load) processes to cleanse, validate, and transform data. This
ensures data quality, consistency, and integrity, enabling accurate analysis
and reporting. Data cleansing techniques eliminate errors, duplicates, and
inconsistencies, leading to reliable and trustworthy data.
 Historical Data Analysis: Data warehouses store historical data over time,
allowing for trend analysis, pattern identification, and comparative studies.
This historical perspective enables organizations to identify long-term trends,
make predictions, and derive actionable insights from the data.
 Faster Query and Reporting Performance: Data warehouses are designed
for optimized query performance, enabling fast and efficient data retrieval.
With pre-aggregated and summarized data, BI tools can execute complex
queries and generate reports quickly, empowering users to access critical
business information in a timely manner.

What is Data Mart?


A data mart is a subset of a data warehouse that focuses on a specific functional
area or department within an organization. It is a smaller, more specialized database
that contains a subset of data from the overall data warehouse. Data marts are
designed to meet the specific needs of a particular business unit, enabling easier
and faster access to relevant data for analysis and reporting purposes.
Reasons for creating a data mart:

 Improved Performance: By creating a data mart, organizations can improve


query performance and response time for users within specific departments or
functional areas. Data marts contain a subset of data relevant to the specific
needs of the business unit, allowing for quicker data retrieval and analysis.
 Simplified Data Access: Data marts provide a simplified and user-friendly
interface for accessing and analyzing data. They are designed with the
specific requirements of a particular business unit in mind, making it easier for
users to navigate and find the data they need without having to sift through
the entire data warehouse.
 Enhanced Data Relevance: Data marts are tailored to the specific needs of a
department or business unit, ensuring that the data included is relevant and
meaningful for their analytical and reporting requirements. This improves the
accuracy and relevance of data analysis, enabling better decision-making
within the department.
 Departmental Autonomy: Data marts allow individual departments or
business units to have more control over their data and analysis processes.
They can define their own data models, schemas, and hierarchies based on
their specific needs, without impacting other departments' data structures or
access.
 Creates collective data by a group of users
 Easy access to frequently needed data
 Ease of creation
 Improves end-user response time
 Lower cost than implementing a complete data warehouses
 Potential clients are more clearly defined than in a comprehensive data
warehouse
 It contains only essential business data and is less cluttered.

Types of Data Marts:


Dependent Data Mart: A dependent data mart is created by extracting data from a
centralized data warehouse. It relies on the data warehouse for data integration and

transformation processes. This type of data mart is easier to maintain as it leverages


the existing infrastructure and processes of the data warehouse.

Independent Data Mart:

§ An independent data mart is built separately from the data warehouse and is
created specifically for a particular department or business unit.
§ It may use data from multiple sources, including the data warehouse, but it
has its own data integration and transformation processes.
§ This type of data mart provides more autonomy and flexibility but requires
additional effort for data integration.

Hybrid Data Mart: A hybrid data mart combines elements of both dependent and
independent data marts. It leverages the data warehouse for common data elements
while also incorporating department-specific data integration and transformation
processes. This type of data mart strikes a balance between central control and
departmental autonomy.
Difference between OLAP and OLTP

OLAP (Online OLTP (Online Transaction


Category Analytical Processing) Processing)

It is well-known as an
It is well-known as an online
Definition online database query
database modifying system.
management system.

Consists of historical data Consists of only operational


Data source
from various Databases. current data. 

It makes use of a
It makes use of a data
Method used standard database
warehouse.
management system (DBMS).

It is subject-oriented. Used
It is application-oriented. Used
Application for Data Mining, Analytics,
for business tasks.
Decisions making, etc.

In an OLAP database, In an OLTP database, tables


Normalized
tables are not normalized. are normalized (3NF).

The data is used in The data is used to perform


Usage of
planning, problem-solving, day-to-day fundamental
data
and decision-making. operations.

It provides a multi-
It reveals a snapshot of present
Task dimensional view of
business tasks.
different business tasks.
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

It serves the purpose to


It serves the purpose to Insert,
extract information for
Purpose Update, and Delete information
analysis and decision-
from the database.
making.

The size of the data is


Volume of A large amount of data is relatively small as the historical
data stored typically in TB, PB data is archived in MB, and
GB.

Relatively slow as the


amount of data involved is Very Fast as the queries
Queries
large. Queries may take operate on 5% of the data.
hours.

 The OLAP database is not The data integrity constraint


Update often updated. As a result, must be maintained in an
data integrity is unaffected. OLTP database.

It only needs backup from The backup and recovery


Backup and
time to time as compared process is maintained
Recovery
to OLTP. rigorously

You might also like