Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Star Schema and Technology review

Musa Sami Ata Abdel-Rahman


Supervisor: Professor Sebastian Link
Postgraduate Project in Computer Science

Abstract— Data warehouse (DW) is a collection of decision facing, especially when they have multi-systems and the
support technologies, aimed for enabling the knowledge worker wide range of data distribution resources, so the effort will
to make better and faster decisions. We are living in the Era not be only in unlocking the potentially useful information,
of Big data and several evolutions happened in terms of data
resources and data formats for the last decades. That enforces but also in gathering, processing and consolidating these
academia and industry to search deeply about alternatives data before starting the analytics process.Generally, data
and to provide better solutions as a response to these new management is a set of applications of quantitative and
factors in the way that can make the DW still valid to work qualitative methods to solve relevant problems and predict
on the same way or better compared with the past solutions outcomes. One of the salient revelations of today, with the
that were provided through traditional databases. This paper
concentrates on the most important component in DW systems, vast and growing amount of data, is that domain knowledge
Star Schema. The paper explores the Data warehouse concept, and analysis cannot be separated. It is Mandatory to have a
OLAP, and compares Star schema with other data modelling full background about the situation that we need to analyze,
like 3NF and SnowFlake and shows why we need Star schema. and that can be done by aligning the technical side and the
Within that we make a highlight on the evolution of database business side together to gain results. It is important to the
systems and shows several studies that were made before about
converting star schema from traditional DBMS to other types technical team to be aware of the domain of business, and on
of DB systems. In the end, we make a review from the industry the same time, the business should be aware of the new ideas
and show what the commercial vendors provides to implement and technologies that can be effective for delivering analysis.
Star schema within the products that they have. To satisfy that, we need to know about technologies that can
support corporates to manage their data and the facilities
Index Terms— Data warehouse,Star
Schema,OLAP,NoSQL,NewSQL,RDBMS,Business Intelligence that can help the Decision makers to achieve from the data
warehouse since it is an essential element of decision support
process [2]. The evolution of big data enforces to make
1. I NTRODUCTION
Several Types of research on Data warehouse to enhance data
We are living in times that everything is increasing, delivery. These researches are done on the concept itself or
especially when we talk about data. The universe is moving go deeply on each data warehouse components, Like data
fast, and more data is generated exponentially every day. modelling or ETL (Extract, Transform, and Load) process.
With all data coming on your way, you need the right tools Researchers are not only within academia but also within the
to deal with the growing amounts of data. The growth in the Technology vendors as well. Many commercial products and
quantity and diversity of data has led to data sets larger than services are now available, and all of the principal database
is manageable by the conventional, handson management management system vendors now are offering solutions in
tools. To manage these new and potentially invaluable data these areas. Recently, some of them asked whether the data
sets, new methods of data management and new applications warehouse is still alive or not? According to Eckerson and
in the form of predictive analytics, have been developed. Cloudera, Data warehousing is not dead, but it is struggling.
According to [1], The most important assets of any company Many challenges of Legacy Data Warehousing enforce it to
is its information, which is always kept into two main forms: evolve and adapt to fit with the realities of modern data
operational systems of records and data warehouse. Large management and to overcome the challenges of scalability
corporations are overwhelmed with data. They invest in data and elasticity, data variety, data latency, adaptability, data
facilitation to support them in the decision-making process, silos and data science compatibility. But that does not prevent
including reporting, dashboards, advanced visualisation,end- as we said before that the Data warehouse can be enhanced
user self-service and initiatives strategic to business intelli- and be a part of solutions for other new data management
gence. The main concern for these companies is not only concepts, like Data lake. According to Horton networks, both
how to keep data, but how to narrow the gap between the environments are complementary to one another in modern
time of the data generation and the time of gaining benefits business. We can use their features and facilities of both to
hidden within it. in other words, The challenge for companies cross the big data challenges by having integration between
today, many beholden to legacy systems and IT bureaucracy both of them and break the gaps that they have. This paper
is that this growing demand for analytics often outstrips is trying to explore the Data warehouse and focusing on the
their ability service it. That would make dealing with Big most important concept that it provides,star scehma, and how
data analytics is another challenge that these corporates are the current commercial products available are dealing with
star schema and how they support this concept. We need distributed facts within several systems. In contrast, OLAP
not forget that the database concept has been changed for uses SQL queries for reading data, does not need any kind of
at least the last 10 years. Before we were dealing with the manipulation data operations like OLTP, and that comes from
classically structured databases, Which serves the relational the design used which mainly guarantee a data consolidation
and structured data, but now we have other types within the from heterogeneous sources (i.e. several OLTP sources).
industry, semi-structured and unstructured data. We will take The database modelling in both systems are different and
into consideration these new facts and see how vendors deal we can say it is the major factor that controls and lead
with that to implement star schema, or whether they give a the differences between both of them. OLTP is using ER
solution or not to have a star schema. modelling, 3NF (3rd normal form) or higher. It eliminates
data duplication, provides efficient storage for detailed data,
2. DATA WAREHOUSE OVERVIEW but not suitable for data analysis. When we look to figure 1
A. Definition which represents the classic sales system, the ER model has
We have different definitions for Data warehouse. A 7 tables comparing with 4 tables that the dimensional model
widely accepted definition for Data warehouse is a collection has. it is clear that ER modelling would have a complex
of decision support technologies, aimed at enabling the query to gain a high level of data comparing with the
knowledge worker (executive, manager, analyst) to make dimensional model since we need to make 6 join operations
better and faster decisions [1]. A data warehouse is a subject- comparing with 3 joins on the other model. We know that
oriented, integrated, time-varying, non-volatile collection of join operations will decrease the performance, especially
data that is used primarily in organizational decision making when we have a huge number of records [3]. We can notice
[2]. According to Ralph Kimball and Margy Ross Typically that the Dimensional model is simple and clear for any
[1], Data warehouse and operational systems are corre- user, unlike the ER model which needs a technical user to
lated, but have different aspects. The operational systems understand, and that is not helpful in data analytics.
are optimized to process transactions quickly, on the other Decision support systems mainly concentrate mainly on hav-
hand, Data warehouse (DW) is responsible to evaluate the ing multidimensional data, which needs special operations
organizations performance over time, by keeping eye on the like roll-up (increasing the level of aggregation) and drill-
operational systems which feed the data warehouse. Different down (decreasing the level of aggregation or increasing
terms are also used to distinguish between the two concepts. detail) along one or more dimension hierarchies, slice and
For a data warehouse, we can use (OLAP), online analytical dice (selection and projection), and pivot (re-orienting the
processing, and (OLTP), online transaction processing, to multidimensional view of data) [2].
describe the operational systems. Data warehouse has the
main Goals to achieve[1]:-
‚ it must make an organizations information easily acces-
sible
‚ it must present the organizations information consis-
tently
‚ it must be adaptive and resilient to change
‚ it must be a secure bastion that protects our information
assets Fig. 1. ER Model versus Dimensional model[3]
‚ it must serve as the foundation for improved decision
making
‚ The business community must accept the data ware- C. Data Warehouse architecture
house if it is to be deemed successful
Typical architecture is shown in figure 2. As we can see
B. OLAP and OLTP it is a combination of systems and operations. OLTP system
and external data sources are the main input for the data
As an extension to the definition and characteristics of warehouse. ETL process which is mainly responsible for
data warehouse, we need to define the distinction between data manipulation, in terms of loading, transformation and
the OLAP and OLTP. In general, the main differences are:- extracting data from its resources. Data cleaning is one of
‚ Data contents and system orientation the main duties of ETL, which includes handling missing
‚ Database Model Design values and parsing into a standard format. You may find in
These two points are summarizing the differences between other resources that another component is added between
both systems in terms of performance and workload. That OLTP and data warehouse, it is called ”Operational Data
comes from the type of operations that we use on both Stage” (ODS) [1]. It is mainly an additional step that can
systems, the size of data, and the purposes of a design be added between the OLTP and Data warehouse. Instead
adapted in each of them. The OLTP mainly uses insert of transferring data directly, we can put ODS in the middle,
and update data manipulations (DML), and the database which will contain the needed data for DW system, and after
is designed to deal with a specific objective, which means doing the cleaning part, we can after that transfer it to DW
‚ Additive: measures that can be added across all dimen-
sions
‚ Nonadditive: measures that cannot be added across all
dimensions
‚ Partially additive: measures that can be added across
some dimensions and not with others
f Dimension tables are integral companions to a fact table.
The dimension tables contain the textual context associated
Fig. 2. Data Warehousing Architecture[4]
with a business process measurement event and describe the
who, what, where, when, how, and why associated with that
event [1]. In the dimensional model, the data is described
system. Another component that you can is Data Mart. It
by the relation between the fact table and dimension table.
is derived from the data warehouse, but more high level
The fact table contains the content of the data warehouse
and usually oriented to a specific business line or team. it
and it is surrounded by several dimension tables contain
enables faster rollout, since they do not require enterprise-
the description. Besides measures, Fact table has also keys,
wide consensus, but they may lead to complex integration
which consist of a group of foreign keys (FK) that point
problems in the long run if a complete business model is not
to the primary keys of dimension tables that are associated
developed [2]. The main important parts of DW architecture,
with this fact table to enable business analysis[6]. Figure
which is related to our study are Data modelling, Database
management systems and methods used to fetch our data.
The data modelling within the data warehouse should reflect
the OLAP structure, which means a multidimensional model
that can facilitate complex analyses and visualization. It is an
important technique for making databases simple to Deliver
data thats understandable to the business users and Deliver
fast query performance [1]. We have several Database man-
agement systems that have utilities and characteristics used
to implement the source and target data stores. Basically, it
should satisfy the requirements needed to implement a data
warehouse system. That includes supporting a dimensional
Fig. 3. Example of a Fact table with dimension tables (Star Schema)
data model and providing a language that can be used to
fetch and process the data [4]. The evolution in database 3 gives an example that clarifies the relationship between
systems brings other languages besides SQL ( Sequential a fact table and dimension tables. The join cardinality of
Query language) and its extension, like Object-oriented. dimension tables to fact tables is one-to-many because each
Since a data warehouse reflects the business model of an record in a dimension table can describe many records in the
enterprise, an essential element of a warehousing architecture fact table. Dimension Table has another important function.
is Metadata management. It is divided into three types; Each dimension table consists of columns that correspond to
Administrative metadata, Business metadata, and Operational attributes of the dimension, and the attributes of a dimension
metadata [2]. Administrative metadata is used to describe may be related via a hierarchy of relationships. For example,
the format, structure, syntax, and semantics of data, such in Figure 3 we have a dimension table called (DimDate),
as justification and spacing. Business metadata includes it contains (Year, Quarter, Month, Week, Day), and it is
business terms and definitions, ownership of the data, and connected to the fact table (online-Sales) which contains
charging policies. Operational metadata includes information a measure called (Revenue), then we can summarize the
that is collected during the operation of the warehouse, such revenue at different time level by drill-down or roll-up.
as usage statistics, error reports, and audit trails. The strength of any dimensional modelling comes from the
summarizing that the model can provide. That can be done
3. D IMENSIONAL M ODELING R EVIEW with high tuned queries between the fact table and dimension
tables, or building a separated summary tables fact table with
A. Basic concept
specific dimensions [2].
Dimensional modeling uses two major types of tables: Materialized view is another approach of summarizing that
Fact table and Dimension table. Fact tables hold numeric combine both physical implementation and queries. It is a
data that can be summarized as needed and contain infor- static view that contains the results of an aggregated query
mation about specific business processes or events within and has the same concept of a summary table. Instead of
these processes. Each row in a fact table corresponds to a executing complex queries, the Materialized view is rep-
measurement event and represents a specific level of detail resented as a database object. It facilitates access to the
which the facts may be defined and summarized, referred to fact table and the desired dimension, and it reflects any
as the grain[5]. The measures have different types of [6]: changing in the source data (fact table) without re-execute
the complex query that it represents [2]. Some Data architects
think that this separation of the data into facts and dimensions
would lead to complex queries, and that would affect the
system performance. As an alternative solution, they come
up with Flat Table, which is collapsing all entities in the data
model down to into the terminal entities. The flat table will
minimize the number of tables in the database and therefore
minimize the possibility that joins will be needed in end-
user queries [6]. With this approach, there is no loss in
information that the original data has, but it will lead to other
problems like massive redundancy, and it will not contain any Fig. 5. Snowflake schema example
aggregations. In fact, it may lead to aggregation errors when
there are dependency relationships between kernel entities.
(3NF) in normalization, and that is reflected by breaking
B. Star schema and Snowflake schema dimension tables to avoid redundancy, and that done by
Data warehouses are typically based on relational star representing each level as a separate table that includes a
schemas or snowflake schemas, an approach to implementing key and the descriptive attributes of the level [6].
multidimensional databases using relational database tech-
nology. When we need to talk about any of them, we C. Star schema versus Snowflake schema
can not neglect the other, especially when we know that Both models can be used for data warehouse systems. In
SnowFlake schema can be considered a special type of general, star schema is the most used, but that does not mean
extended star schema, since it represents a connected lattice it is free of drawbacks. Star schema has several advantages:
of star schemas. it is simple and easy to understand. It reduces the number
Star schema consists of a central fact table surrounded by of tables in the database and the number of relationships
dimension tables as shown in 3 [7]. When the model contains between them, so the number of joins required in user queries
multiple fact tables share dimension tables as shown in is minimized, and that will lead to a higher performance [6].
figure 4, then we call it Constellation schema or Galaxy Finally, the normalized approach that star schema has makes
schema [8]. According to Kimball, The Design Approach the aggregations of measures are straightforward and that
makes star schema a prefered choice for business people.
However, the star schema has some drawbacks. For example,
it does not model hierarchies adequately since the hierarchy
structure is not clear and additional information is required
to represent it. Sometimes it is difficult to associate attributes
within their corresponding levels. Moreover, when we have
many levels in hierarchies, the number of attributes is at
least as large as the hierarchy depth, making the hierarchy
structure difficult to understand [9].
Snowflake is prefered for technical people since the Normal-
ized tables are easy to maintain and optimize storage space.
Fig. 4. Galaxy schema Besides that, it represents hierarchical structures better. Every
level can be easily distinguished and, further, levels can
for Star Schema is based on the analysis of business process be reused between different hierarchies. Additionally, in
requirements, by identifying the grain of the data, defining this representation, specific attributes can be included in
the relevant facts that need to be aggregated, defining the the different levels of a hierarchy [9]. On the other hand,
dimensional attributes to aggregate by, and then forming snowflake schemas are less performant for querying due to
a star schema based on these [6]. Dimension table in star the joins that are needed for combining the data scattered
schema represents set of attributes which are not normalized, in the various tables composing a hierarchy. The (3NF)
which means that star schema does not support a direct that snowflake has would bring the same disadvantages that
attribute hierarchy [2] and that is the main difference be- we discussed before in section 2-A. ETL process these
tween star model and snowflake model. Snowflake makes days uses different techniques like pipelined, partitioned, and
normalization to its dimension tables, and that would give parallelism to avoid sequential loading [2]. with Snowflake,
multi-level hierarchy, unlike star schema which gives a single ETL process would be affected since snowflake is more
level of hierarchy [4]. When we compare between figure 3 complex in design and cannot be parallelised because the
and figure 5, we can see that no change happened to the dependency within the model restricts it.
fact table, but the approach for dimension table is different. Both models are implemented in real life, and some people
Dependency chains in star schema are collapsed, while in like Charles D.Tupper [6] does not think that one of them
snowflake schema are expanded out. Snowflake is near to is the best solution, but the optimal one is a compromise
between the two options. Business and use case can be the MOLAP is an array format. MOLAP can be considered as
main factor to decide. According to Oracle [10], Ralph Kim- the nearest approach for the cube model, and many vendors
ball proposes three cases where snowflake implementation is like Oracle, Microsoft and IBM provided MOLAP solutions,
not only acceptable but is also the key to a successful design: but the hard effort to maintain and administrate, especially
‚ Large customer dimensions where, for example, 80 per with the evolution of the data and its amount, makes vendors
cent of the fact table measurements involve anonymous concentrate on ROLAP approach, especially when you know
visitors about whom you collect little detail and 20 that MOLAP has 2 storage layers and indexing techniques
per cent involve reliably registered customers about that makes a high limitations on DW performance and data
whom you collect many detailed data by tracking many scalability [2]. So with the evolution in database systems,
dimensions you would find papers that discuss the implementation of
‚ Financial product dimensions for banks, brokerage OLAP on those systems, they would concentrate on ROLPA
houses, and insurance companies, because of each of (star schema) rather than MOLAP (OLAP cube). According
the individual products have a host of special attributes to Alfredo Cuzzocrea,Ladjel Bellatreche and Il-Yeol Song
not shared by other products [12], The implementation of OLAP cube over Big data is
‚ Multienterprise calendar dimensions because each or- considered as an open problem and actual research trends in
ganization has idiosyncratic fiscal periods, seasons, and the field of Data Warehousing. They Highlighted the reasons
holidays and suggested the future research directions that could serve
Furthermore, Kimball usually advises that it is not a good in this direction.
idea to expose end users to a physical snowflake design,
because it almost always compromises understandability and
performance. In his book (The Data Warehouse Toolkit) [1],
he is encouraging not to use snowflake since it does not
have ease of use like star schema. Even the advantages that
snowflake provides, like saving disk space and ability to
browse within a dimension, Kimball thinks that they would
be minor issues comparing with the primary objectives that
the design should handle, performance, optimization and ease Fig. 6. OLAP Cube example
of use. We can sacrifice these advantages to gain the primary
objectives [1].
Complexity is not only determined by query and loading, but E. Query on Star schema
also the ability of changes that can be done in the future. The main idea of star schema is making the data available
For that purpose, Christian Kaas, Torben, and Rasmussen in a simple model, which will simplify loading and access the
[11] made a study about schema evolution for stars and data. We have seen that the star schema structure distributes
snowflakes. They made an investigation about how both the data into a large fact table connected to several dimension
models response to changes, instance adaption and impact tables. In general, when we make a query on star schema for
on existing queries by implementing eight evolution opera- a specific pattern, it does not mean reading the data as it is
tions that cover insertion and deletion of dimensions, levels, but also includes data processing based on business needs. In
dimension attributes, and measure attributes the evolution fact, it is a reflection to OLAP operations that we discussed
properties of star and snowflake schemas. They came up with before in 2-B. The procedure of executing query includes
a conclusion that In all cases, the star schema was either [5]:-
superior to the snowflake schema or had the same problems,
‚ Selection of required facts: first, the end-user decides
and it is considerably more robust towards schema changes
which facts should be selected by identifying the fact
than the snowflake schema.
tables to be used.
D. OLAP and Star Sceham ‚ Selection of required dimensions: by joining dimen-

Within the research, I found that in the past there was sions to the query, end-users add required context.
a separation between the OLAP concept and star schema. ‚ Limiting scope of facts : the facts are filtered according

In Fact, The OLAP was representing the multidimensional to dimensional values (either from the fact table or from
model that the decision makers were looking for and usually, joined dimensions).
they presented OLAP logically as a cube, like a figure ‚ Summarization of facts: the raw facts are summarized

6. We can see that it gives highly interactive support to with aggregate functions, such as SUM() or COUNT().
decision makers by providing the functionality needed for The first thing comes to our mind when we talk about the
summarizing, viewing, and consolidating the information query is SQL, but we will see later that different approaches
available in data warehouses [1]. At that time, Vendors were come later based on the evolution of database management
implementing OLAP into 2 types: Relational OLAP (RO- systems and data warehouse solutions. Query on star schema
LAP) and multidimensional OLAP (MOLAP). The physical is called Star Query, and join tables is called Star join [1].
implementation for ROLAP is simply a star schema and The art of executing the efficient query is called Optimizaion,
and measuring the speed of the query is called Performance. speeds, and scopes. It is important to recognize that to
These two factors are the main indicators that differentiate figure out how the consequences of that evolution effect
between data warehouse vendors. They are struggling to on star schema implementation. The most commonly used
provide access to the data with an optimal query that has systems are- Relational database systems (RDBMS), Big
a high speed. Optimization is vital to the performance of the data management with the distributed file system (DFS),
database system. Each vendor has its own query optimizer Nonrelational data model (NoSQL), and (NewSQL). a dif-
that makes an execution plan to give a robust query and ferent classification can be done based
utilizes the system resources to provide a result with a high
performance (speed) [5]. But How Star Query is affecting A. Relational database systems (RDBMS)
performance? simply by three operations, Star joins, filtering, The most commonly used systems, and had the biggest
and aggregations. When you make a query that includes share of the market at the time object-oriented databases were
these actions, you will initiate two implicit operations, data created. It came from the relational model which is based
loading and data scanning. The mechanism of these two on representing data into tables with rows and columns. A
operations differs based on vendor specifications, and it relies table must contain keys, Primary key to differentiate between
on several factors, like the algorithm used, hardware and rows, and Foreign key that refer to other tables to create a link
storage techniques. So when we define a query with multiple between their data. These links are known by Integrity Rules,
joins, with several filter conditions, and ask for aggregations which is responsible to ensure that the data they contain stay
with several dimension, then it would raise the workload accurate and are always accessible [15]. RDBMSs typically
internally and in terms of loading and scanning, and that can use SQL and Stored procedures to manage structured data
be considered as a physical definition of complex query[2]. and allow JOIN-operations to combine the different tables.
For this reason, providing means for increasing the perfor- Both of them make RDBMSs able to handle queries and
mance of a data warehouse for analytical queries and other transactions that ensure efficient, correct, and robust data
types of data processing is one of the important research processing even in the presence of failures [4] [15]. Key
and technological areas [13]. Many solutions were provided features of RDBMSs are ACID (Atomicity, Consistency,
on the past to enhance the performance related to the data Isolation, and Durability) [16], and besides Relational model,
structure, like providing Materialized view, summary tables, RDBMSs can also handle dimensional model when they deal
or even Flat Table within star schema [2], which means with structured data and applicable to some semi-structured
transferring the latency cost to the ETL process. providing data. However, The evolution of Big data shows that RDBMS
indexes is another solution to enhance join operation and has a shortage. With static schema, no linear query execution
scan operation , like Bitmap indexing (which is not supported time and unstable query plan, RDBMS is poor in scalability.
by snowflake according to Kimball [1]) , Join indexing , and It is very expensive and difficult to handle. To scale relational
and Bitmap-join indexing [2] . Later, new challenges came to database it has to be distributed on to multiple servers, that
the data warehouse and business intelligence (BI) systems, will make Handling tables across different servers is difficult
like making these systems near to real-time, which means and lead to complex queries which will effect on query
make them sync with the original resources, by enhancing performance.
the ETL process and Query process. Query transformation
is another approach that serves in this direction. It is based B. Data File distribution and Data Engines
on conducting individual tasks such as translating values, In the past Parallelism was the most significant technique
deriving new calculated fields, joining tables at the source, used for processing massive databases, and all major vendors
converting date fields, and reformatting field sizes, table of database management systems offered data partitioning
names and data type [14]. To make that practical, all major and parallel query processing [2]. Later with the evolution
vendors offer variant strategies in parallel processing, data in data amount and structure, which is known later by Big
partitioning, and Query Extensions [2]. These strategies Data, The traditional databases were not capable of handling
differ between vendors since the performance area relies unstructured data and high volumes of real-time datasets,
on three DW components: hardware used, data storage, and and that enforced to come up with solutions that combine
database management system. it is important to know how structured and unstructured data, or in other words, to handle
each vendor uses these components to format its solution. the main attributes of Big Data: volume, velocity, and variety
that will lead us to know how they implement star schema (3Vs) [17]. Distributed file systems (DFS) came to solve
physically and on which direction they concentrate on to the shortages in scalability and performance that RDBMS
enhance the star query performance [13]. has. It is responsible to store and share files over a set
of connected nodes reliability, scalability, availability and
4. DATABASE MANAGEMENT SYSTEMS efficiency. Google started that approach by providing Google
The evolution of database systems through the past years file system (GFS) and based on that Hadoop offered its own
was mainly related to the data content and responding to Distributed File System (HDFS). Basically, it is a framework
the Big Data phenomenon. Several types were provided in (ecosystem) to manage processing and the hardware available
order to systematically and spontaneously tackle the growing by combining the software platform and the hardware to get
varieties and vagaries of data structures, schemas, sizes, flexibility in storing and processing of large volumes of both
structured and unstructured data by taking data transactions based, graph-based, and object-based and generally do not
in whatever form and processing those transactions across use SQL for data manipulation. [4]. NoSQL has several
commodity hardware in parallel. The main workload on motivations that encourage vendors to adapt to their data
this approach relies on the data processing model and data warehouse solutions [18]:-
engines rather than data storing and formatting. It started
‚ Simple and flexible. NoSQL data stores offer flexible
with MapReduce, which is using divide and conquer strategy
schemas or are sometimes completely schema-free and
to make computations [17]. The simplicity in its principle
are designed to handle a wide variety of data structure
and basic concept made MapReduce a key of success for
‚ Ability to scale horizontally over many commodity
processing, analyzing and managing large datasets in several
servers. Some data stores provide data scaling, while
implementations, but that did not deny the fact of difficulties
others are more concerned with reading and/or write
that it has in development, optimization, and maintaining its
scaling
functions especially in large-scale projects. Therefore, these
‚ Provide high availability. Many NoSQL data stores are
difficulties gave a rise to query platforms based on MapRe-
aimed towards highly distributed scenarios, and consider
duce, a layer constructed directly on the top of MapReduce,
partition tolerance as unavoidable
designed to translate queries and scripts into executable
‚ Unlike RDBMS, they do not support ACID transac-
native MapReduce jobs and to solve other limitations like
tions as provided by RDBMS. NoSQL data stores are
join operation and handling streaming data. Platforms like
sometimes referred to as BASE systems [19] (Basically
Hive, Impala, and Spark not only giving SQL-like query
Available, Soft state, Eventually consistent)
interface but also enabling to treat DFS systems as schemas.
Later Several Big Data Processing model has been suggested NoSQL-databases can be classified into four categories:
in combined with MapReduce, or to replace MapReduce, Key-value store, Document store, Column-store, and Graph-
such as Resilient Distributed Dataset (RDD) and Bulk Syn- store. You can find in other papers that they classify these
chronous Processing (BSP) [17], and the evolution happened categories into : Aggregate Oriented, and Non-aggregate
to HDFS was done to these processes, by having a platform Oriented [19]. In general, Choosing one of them for data
and SQL-like query interface. storing relies mainly on the business needs, and main features
For example, Spark started with RDD to extend Hadoop to that differ between each of them, which are summarized at
a general-purpose framework and to replace a MapReduce the end of this section in table I.
model with a better solution. It is an immutable distributed Key-value stores have a simple data model based on key-
collection of elements of the data, partitioned across nodes value pairs, which resembles an associative map or a dic-
in the cluster that can be operated in parallel with a low- tionary. The basic concept is that a globally distributed
level API that offers transformations and actions. Later Spark hash table has keys that lead to different database servers
came up with Data Frame based on RDD. Instead of dealing scattered all over the world. Each data item is converted
in low-level programming, Data Frame is better in terms into a key using some unique formula which is stored in the
of having an interface for RDDs especially when you have lookup table or directory. When the data are needed, the key
structured data. It has SQL for high-level data processing, is converted into the location of the data and accordingly
and that can give the ability to treat data within DFS as data are retrieved [19]. The value is opaque to the data
similar to the data within the traditional database, as a store and can be used to store any arbitrary data, including
schema. an integer, a string, an array, or an object, providing a
schema-free data model [18]. Along with being schema-
free, key-value stores are very efficient, fast, and scalable,
but since they have Many data structures, it cannot be
easily modelled, which means that they are not suitable for
scenarios requiring relations or structures. Any functionality
requiring relations, structures, or both must be implemented
in the client application interacting with the key-value store
[18].
Document stores provide another derivative of the key-
value store data model by using keys to locate documents
inside the data store, but they are more complex compared
with key-value stores. Like the key-value store, we have
Fig. 7. Evolution of data and big data process engines a key and a value, but the value in Document stores are
semi-structured data, and that gives the ability to include
hundreds of attributes and the number and type of attributes
C. Nonrelational data Store (NoSQL) can differ from row to row, and that makes the value to be
NoSQL can be referred to (No SQL) or (Not only SQL). It represented as a Document. Most document stores represent
is a distributed, non-relational database-systems, not primar- documents using JSON format or some format derived from
ily built on tables since it has many forms, such as document- it. A set of documents is called collections. Collections
discarding unwanted data in rows by performing I/O on
the blocks corresponding to columns that are actually being
read [19]. Besides that, a prior definition of columns is
not required, which offers huge flexibility in storing any
data type[18]. Graph-stores are considered to be the best

Fig. 8. Key-value store example by Andr Ribeiro

Fig. 10. Column-store example [16]

choice to store data along with relationships. Graph-stores


originated from graph theory which is is a mathematical
concept used to represent a set of objects, known as vertices
or nodes, and the links (or edges) that interconnectand these
Fig. 9. Document-Store example [16] vertices [18]. Graph-stores use graphs as their data model and
at the top of GD there may be a key/value store, columnar
database, BigTable database or combination of these and
may contain any number of documents which each of them other architectures [19].
should have a unique key [19]. This is the key feature Graph-stores have many features like efficiency in traversing
of document-oriented databases to become schema-less as
well as giving them the advantage of porting and storing
data of different types easily. Moreover, Document stores
give the capability of indexing documents based on the
primary key as well as on the contents of the documents, and
that differentiates this data model from the key-value stores
model, in which the values are opaque to the data store [18].
The ability to contain complex data structures such as nested
objects make Document stores to have different options for
Fig. 11. Graph-store example [16]
data processing, some of them provides MapReduce, other
systems provide SQL processing capabilities like RDBMS,
relationships between different entities. They offer persistent
but that does not mean that they match in the structure, since
storage of objects and relationships and support simple and
rows within a table in RDBMS follow the same schema, but
understandable queries with their own syntax [16], and that
in document stores each document inside can have a different
make them suitable in scenarios such as social network-
structure [19].
ing applications, pattern recognition, dependency analysis,
Column-stores are a hybrid between NoSQL and relational
recommendation systems and solving path finding problems
databases. They provide some row-and-column structure, but
raised in navigation systems. For that reasons, Facebook for
they do not have the strict rules of relational databases. Some
example is implementing every thing in GD [18].
column databases are inspired by key-value databases, but
most of them are inspired by Googles Bigtable, an internal TABLE I
distributed data storage system developed by Google [18]. SALIENT FEATURES OF VARIOUS N O SQL [18]
The approach to store and process data by column instead
of row. Each row is composed of a set of column families, Data Store Performance Scalability Flexibility Complexity
and different rows can have different column families [3]. Key-Value High High High Low
Column High High Moderate Low
Some systems store the data belonging to a row together Document High Variable High Low
on the same node, and other systems separate them on Graph Variable Variable High High
several nodes using composite partition keys, like Cassandra
[18], which also provides additional functionality of super
columns, which are formed by grouping various columns D. NewSQL
together. With this design, Column-stores are more flexible NoSQL databases were designed to address performance
comparing with RDBMS, since they precisely access the and scalability requirements, but they did not concentrate
data they need to answer a query rather than scanning and on ACID transactions as provided by RDBMS. The Idea
of combining both features comes with a new generation ones.
of databases, NewSQL. It is a hybrid solution that aims to Transparent Sharding Middleware systems provide a shard-
bring the relational data model into the world of NoSQL ing middleware layer to automatically split databases across
[18]. The main focus of NewSQL systems is to address the multiple nodes.These nodes run the same DBMS, have
issue of scalability of online transaction processing (OLTP) only a portion of the overall database, and cannot be
workloads with the support of programming model cus- accessed independently [22]. The centralized middleware
tomizability and reduce management tasks in SQL databases. component routes queries coordinates transactions, as well
NewSQL DBs open a path to migrate from traditional SQL as manages data placement, replication, and partitioning
systems (RDBMS) to modern SQL systems that have the across the nodes. It is responsible for executing queries on
same NoSQL and Big data characteristics; performance and behalf of the middleware at its local DBMS instance and
scalability [20],by absorbing data models and data manipula- returning results.Altogether, these allow middleware products
tion capabilities of NoSQL databases,and extending RDBMS to present a single logical database to the application without
functionalities, but that does not mean a full replacement for needing to modify the underlying DBMS. With that type of
RDBMS; they are designed for a specific scenario in which systems Developers do not need to make any changes to
can be successful, like OLAP and Big data analytics [21]. their application to use the new sharded database because
The main Five characteristics of NewSQL are:- they are often a drop-in replacement for an application
that is already using an existing single-node DBMS. The
‚ SQL as the primary mechanism for interaction
middleware approach is a good solution for scaling especially
‚ ACID support for transactions
when you have traditional DBMS (RDBMS), but with these
‚ a shared-nothing distributed architecture
RDBMS we will not get a strong manager that is optimized
‚ A non-locking concurrency control mechanism so real-
for memory-oriented storage like in some of the NewSQL
time reads will not conflict with writes, and thus cause
systems, it will enforce us to apply a local optimization on
them to stall
each node for each query.
‚ an architecture providing much higher per-node perfor-
Database-as-a-Service offerings from cloud computing
mance than available from traditional RDBMS solutions
providers that are also based on new architectures. We
NewSQL can be categorized into 3 categories based on know that with this services organizations do not have to
the different approaches adopted by vendors: New Archi- maintain the DBMS on either their own private hardware
tectures, Transparent Sharding Middleware, and Database- or on a cloud-hosted virtual machine. Instead, the DBaaS
as-a-Service [22]. provider is responsible for maintaining the physical configu-
New Architecture Databases are newly designed from ration of the database, including system tuning, replication,
scratch to achieve performance and scalability. they are de- and backups. Although Cloud was implemented before on
signed from a new codebase without any of the architectural RDBMS and NoSQL, they used the same underlying disk-
baggage of legacy systems. It can be further categorized such oriented DBMS, here we are considering only cloud systems
as General Purpose Databases and In-Memory Databases that has NewSQL architecture, which uses a log-structured
[23]. The main advantage of this type is that all parts of storage manager to improve I/O parallelism [22]. The main
the system can be optimized for multi-node environments. advantage that it provides its own custom DBaaS which can
For example, most NewSQL DBMSs are able to send intra- be deployed on all of the major cloud platforms. It gives the
query data directly between nodes rather than having to ability for distributing the database across different providers
route them to a central location like with some middleware in the same geographical region to avoid downtimes due to
systems. Having a custom engine instead of relying on an service outages[24].
off-the-shelf distributed filesystem like (HDFS) or storage
fabric like (Apache Ignite) is another advantage, because it 5. R ELATED WORK
will give the DBMS the whole responsibility to distribute Main studies on Star schema were done into 2 directions;
the database across its resources and allows the DBMS implementing star schema from RDBMS to other database
to send the query to the data rather than bring the data management systems and Star query performance. Through
to the query, and that leads to less network traffic since those studies, the main idea was whether these systems
transmitting the queries is typically less than transmitting the could support the purposes of OLAP, like multidimensional
data. Besides that, Managing their own storage also enables modelling, data roll-up and roll-down, hierarchies, and slice
a DBMS to employ more sophisticated replication schemes and dice operations. That is compatible with Golfarelli and
than what is possible with the block-based replication scheme Rizzi [25], who made a paper that surveyed more than
used in HDFS. That allows these DBMSs to achieve better 20 years of research on data warehouse systems. They
performance than other systems that are layered on top of concluded that most of the work papers were done on 4
other existing technologies.On the other hand, That type of main subjects; Query processing, Multidimensional mod-
new systems can be not attractive to some organizations. elling, Performance and tuning, and architecture. Sen, Arun
Most of them prefer adopting well-known technologies rather and Sinha [26] made a comparison between Data warehouse
than using new ones because the old ones have been tested systems based on of infrastructure-based for 15 vendors pro-
and has much-experienced people comparing with the new viding data warehousing systems. Although they came with
a conclusion that methodologies used by these companies OLAP using MongoDB, which is classified as a Document-
differ in details, all focus on the techniques of capturing oriented store. They found that they could implement star
and modelling user requirements in a meaningful way, They schema, but they recommended to have a new mechanism
used several attributes to check the differences. Two of these in joining operation that can be equivalent to natural join
attributes were related to data modelling and the ability in RDBMS. That was required not only for selection but
to support for Normalization/Denormalization. Within the also for creating a fact table. Besides that, they found that
data modelling, they differentiated between star schema and not all OLAP operations can be performed directly, and
relational modelling, and they found that 2 out of 15 were not that because of technical issues related to the MongoDB
supporting star schema; Teradata/NCR and SAS. Supporting environment. Kovaevi and I. Mekterovi [31] were providing
for Normalization/Denormalization is important for OLAP Alternative business intelligence engines based on NoSQL
queries, and that was important for measuring performance and NewSQL instead of RDBMS. It could be similar to
and how can improve that. They found that some systems [26] provided before, but here they did not but explicitly
supported that explicitly but some of them were not reporting whether these systems can have star schema or not. They
anything regarding that. Using the MapReduce technique to mainly looked to those systems that could provide similar
use for star schema data warehouses solve Big data problems properties of relational databases, like ACID, SQL and
related to aggregations and joins was another attractive sub- a relational schema. They made a comparison among 18
ject. Barkhordari, Mohammadhossein and Niamanesh [27] features, 4 of them related to the star schema, Data model,
proposed a method called Chabok that based on two-phased OLAP operations, SQL/ACID and Query language. So we
Map-Reduce and implemented Hadoop, and TPC-DS queries considered that the system that could provide those 4 features
(http://www.tpc.org/tpcds/) were executed for benchmarking. it could also give the Star schema explicitly, otherwise,
They used two-phased MapReduce since one of them repre- it needs a workaround method to implement. there were
sents the Fact table and the other one represents a dimension 14 systems used in the study, 6 of them NoSQL based,
table. They made a query language called MHBQL which and the other 8 were on NewSQL based. it was clear that
has the same functionality of SQL in terms of selection, NewSQL system would support the star schema, and they
joins and aggregations.They compared their queries with address that before making the comparison. They explored
SPARK and Hbase queries and found that they have a high the features of NewSQL and found that it was matching
performance since they adapted independency in hardware with the features that RDBMS has. The systems they used
nodes, which enables higher speeds of query execution, for NewSQL are Apache Spark, Apache Drill, Vertica, SAP
which omits joins and removes network congestion. Instead Hana, Oracle Times Ten, SQL Server Column indexes,
of using data replication to solve the performance issue, VoltDb, and NuoDb. For NoSQL, it was clear that not all
they made it optional and used it only for maintenance of them were supporting implement star schema inexplicit
and backup issues. Chevalier, El Malki, Kopliku, Teste, way, but we knew from previous that it could be done
and Tournier [28] Discussed investigated the use of NoSQL especially with column-oriented type. Many studies have
systems for implementing OLAP systems, mainly the RO- been don Star Query Performance, we mention them as a part
LAP model (Star Schema). they defined a set of rules to of Star schema subject and to emphasize that Star schema
map star schemas into two NoSQL models: column-oriented is still important for the implement. We know that in 2006
(HBase) and document-oriented (MongoDB), and the process Stonebraker’s Vertica developed Schema Benchmark (SSB)
included data transformation, data loading and aggregate to measure star schema query performance.It is an extension
computation. That process allowed them to compare the of the TPC-H benchmark. Later in 2009, it was extended to
different approaches with each other. They were successful (SSB+) by providing a new approach; Adjoined Dimension
not only in implementing their use case on both systems Columns (ADC) to speed up the star query [32]. That was
but also they measured the performance on both systems. based on traditional databases (RDBMS), So the evolution
In general, They recommended Hbase (column-oriented), of Databases enforced to have a benchmark for star schema
since it showed better using for space and memory, and the that is implemented on other types of database store. Malki,
aggregation time was reasonable. When they compared in Arlind, Essaid, and Olivier [33] summarized that on their
terms of queries that demand multiple attributes of a relation, paper; Benchmarking Big Data OLAP NoSQL Databases.
Document-oriented (MongoDB) showed better results, since They were not only introduced DSS benchmarks related to
it is supporting nested fields, unlike column-oriented, it might NoSQL and big data like KoalaBench, but also implemented
take longer because data would not be available in one several models like star schema, snowflake, Falt table and
place. Similar studies were done by Yangui, Nabli, and Flat flexible on 4 NoSQL databases, and they conclude
Gargouri [29]. They were interested in transforming the mul- that Star model had better performance compared with the
tidimensional model into NoSQL systems, mainly Casandra others. in fact the research also can be considered as one
(column-oriented) and MongoDB (Document-oriented). The of the implementations of star schema on NoSQL databases.
conclusion that they had was similar to [28], the queries on Enhance the join operation is also one of the factors related
Document-oriented not only better than column-oriented but to star query performance, like what Zhu, Navneet, Saurabh,
also better in explaining hierarchies. Maity, Soumya Sen, and and Patel did in [34], They introduced a new strategy in query
Debnath [30] were looking for challenges of implementing executing based on star schema; lookahead Information Pass-
ing (LIP), which speeds up star queries and as they said gives (LSA) are features provided to simplify the data warehouse.
an optimal query plan. Makreshanski and others introduced Dynamic Star Schema allows combining fact data and master
many-query join (MQJoin) [35], a method for sharing the data across the layer borders; within SAP BW and outside
execution of a join that can efficiently deal with hundreds SAP BW (federation), and combining partitioned master data
of concurrent queries by minimizing redundant work and (dimensions).
making efficient use of main-memory bandwidth and multi-
core architectures. B. Microsoft - SQL Server 2017 and Power BI
Microsoft supoorts data warehouse and star schema
6. V ENDOR R EVIEW
through many products that they provided for many years.
In this part, we will make a review of commercial products For physcial implementation the latest product that they have
available and provided by some main vendors. It was hard to is SQL Server 2017. Microsoft provides many features on
figure out what are vendors providing especially when you this product; it is Cloud, in-memory and Columnstore and
look to their white papers. They explain the implementation using Star Schema. It uses Azure for ETL, and according
of data warehouse and OLAP by their systems, some of to [37], one of the data pipeline stages is transforming data
them clarify that directly, and the others describe the general into star schema and that gives the end-user to make queries
approach that they have, but as we said before most of using both traditional relational data and new data types with
them consider mainly star schema as a first model for their common T-SQL commands and PolyBase bridge to translate
implementation. to non-SQL Server repositories. The query processing is
A. SAP HANA optimized for smaller result sets typical of data warehouses,
such as performing star joins and aggregating, It may effect
SAP one of the software corporations that makes enter- on the performance, but features like Columnstore indices,
prise software related to Data warehouse and business intel- Massively Parallel Processing (MPP) and B-Tree indices
ligence. SAP cares about dynamic dimensional modelling, guarantee a high performance.
so Their solution SAP Business Warehouse powered by SAP Another option that Microsoft provides is implementing Star
HANA is deploying a model called Dynamic Star Schema, Schema logically through Power BI, the latest Business in-
which is a new modelling pattern partitioning Dimensions telligence framework that Microsoft spends their investments
(master data) of fact-table InfoProvider using Composite and efforts to provide on that domain [38]. It gives the ability
Providers or Open ODS Views of type fact [36]. At first, they of connecting to several data sources and making the logical
were extending Star Schema on row-store RDBMS to Flat modelling, like Star Schema without having that physically.
Star Schemas on SAP HANA column-store. But later they With that virtual modelling, many operations like joining
found limitations with that approach, in terms of Flexibility, tables, aggreagtions and selections can be done through
adaptability, Maintenance and Support of modern federated that , wwithout effecting the physical implementation, and
DWH architectures. The Dynamic Star Schema is modelled that is common used in other BI systems, But when the
using:- implementation is done on a physical star schema, it will en-
‚ CompositeProvider on SAP BW fact-table InfoProvider able auto-detection feature, which will activate automatically
of Integrated DWH/ Propagation Layer the properties that related to the star schema,like creating
‚ Union-CompositeProvider on InfoProvider if you only relationships between tables, setting Cardinality, and Cross
want to define the Star Schema filter direction.
‚ Open ODS view of type fact on DB-table/ -view and
field-based DSO (advanced) of the Raw DWH / Open C. Amazon Redshift
ODS Layer that serve as fact tables Redshift is a distributed column-oriented data store, based
on the relational row-based PostgreSQL DBMS and a
petabyte-scale cloud DW. It uses MPP architecture, break-
ing up large data sets into chunks which are assigned to
slices within each node. Redshift cluster consists of several
compute nodes that contain the data, a leader node, which
receives queries and manages their execution in the cluster,
and Redshift Spectrum enables running SQL queries on
semi-structured data from the S3 distributed file system [39].
Redshift is Highly supporting Star schema, in fact, when
you look to the Developer guide that they have, you will
see that the first modelling it would be with a star schema.
Fig. 12. HANA Dynamic Star Schema They believe that most of the current implementations on
DW are on Star schema, so that would encourage migration
The main components are CompositeProvider, InfoObjects from the current systems that the companies have to Amazon
and InfoProvider, used to define and associate tables. Table Redshift. The amazing part is that Redshift supports the Star
Partitioning, Federation, and Layered Scalable Architecture Schema Benchmark (SSB). which means not only they care
is covered in many articles, blogs, technical documenta-
tion,academic reviews, and use cases. Besides that, Oracle
supports the whole DBMS types, and their products, systems,
and frameworks can be integrated with other solutions and
platforms. You can feel that Oracle makes a full adoption to
Star Schema. For example:
‚ The OLAP Cube option which they have on Oracle 12c
Fig. 13. Redshift Amazon Redshift Architecture-[39] is represented through Star schema [40]
‚ The Oracle Advanced Analytics (OAA), which is an
option of Oracle Database Enterprise for uncovering
about modelling but also about performance. This feature hidden relationships in the data, is using data mining
can support you to test the design and other factors like tools for mining data into Star schema for delivering
load time, storage use, and query performance. Besides real-time results and predictions [41]
that, Redshift provides several Distribution Styles for the ‚ Oracle usually use bitmap index and star transformation
rows of the table like Key distribution, All Ditribution and for implementing Star queries, but it was not applicable
EVEN distribution.That would help the query optimizer to to implement that within their In-memomry DB and
redistribute the rows to the compute nodes as needed to cloud DW since they use column-oriented DB, So
perform any joins and aggregations. Redistribution might they introduced Vecotr transformation or Vector Group
involve either sending specific rows to nodes for joining by , which is doing the same functionality of Star
or broadcasting an entire table to all of the nodes. Data transformation to work on Column-oriented DB [42]
Compression is another feature that supports the optimiza- In general, Oracle Supports Star Schema implementation
tion functionality that Redshift provides. Because Amazon from a different perspective, physically through the databases
Redshift uses a columnar design, it can choose optimal and logically through the BI systems.
compression algorithms for each column individually based
upon the data contained in that column, typically yielding E. IBM
significantly better compression than a row-based database.
Compression helps queries to perform faster and minimizes IBM has provided many solutions and platforms that
the amount of physical storage your data consumes. supporting data warehouse, and all of them were supporting
We know that one of the physical implementations on Star Star schema modelling,like Cognos BI, DB2 database, Db2
schema is defining the relations between tables, which means Warehouse and InfoSphereWarehouse. That supports come
define primary keys and foreign keys (Constraints). Redshift through several forms, like visualization tools to implement
uses this information to optimize the query by eliminating Star schema, or enhance performance and optimization, or
redundant joins and establishing a join order. But some providing new algorithms for star joins. For example, the
designers do not follow that procedure of relation definition, latest product, Db2 Warehouse, includes features that are not
so as an alternative, The query optimizer detects redundant specific for star schema, but it is important to have [43]:-
joins without constraints defined if we keep statistics up to ‚ Embedded Apache Spark Engine which is used in-
date, That can be done by using the Analyze operation Which memory SQL columnar processing with a MPP cluster
is responsible for that, or enabling other options within Copy architecture to speed up complex queries and predictive
operation in the way that can update statistics while loading model building, testing, and deployment (we will see in
data. Work load managment and Cursors are other features 6-H Spark role in a star schema)
that Redshift provides. Although there is nothing specific to ‚ Automatic schema generation and data loading, and that
Sar schema, both of them are supporting the performance includes Star schema
of the data warehouse. Workload management can separate ‚ BLU Acceleration, IBMs in-memory columnar process-
longer running queries from shorter running queries. This ing technology. Entire datasets and decompression are
helps keep applications that require responsive queries from not needed in-memory. Expect pre-fetching of data and
being backed up by long-running queries. It allows allocating data skipping
more memory to queue handling more intensive queries, and IBM not only put their efforts on the industry side but
to boost their performance. Cursors are used to reduce the also in Academia as well. You can find easily many white
impact on memory when the results are too big. Instead of papers provided by them related to database management
reflecting the wholes results into memory, Cursors chunk the and star schema. For example, They introduced Zigzag join
data at a time and control the maximum size to be loaded. algorithm for DB2 star schema queries [44], which improves
consistency in performance as well as reduces the execution
D. Oracle
time of queries in data warehouse or data mart environments
It was a challenge to summarize the approach that Oracle with large volumes of potentially partitioned data, complex
had to Star Schema. Once you make a search on Star ad hoc queries, and where database design uses a star
schema through Oracle.com, you will find that Star schema schema.
IBM Business intelligence and reporting tools are simi- dataset (RDD) to solve data processing, by partitioning
lar to other tools that give that gives the ability to de- data and executes various parallel operations on a cluster,
sign logical modelling even if that model does not exist and that mainly for unstructured data. After that, Spark
physically.Cognos, for example, has a component called started to provide SQL and Dataframe modules to handle
Impromptu Catalog, which is a business intelligence tool structured data based on RDD, but with a better interface.
repository that handles a logical design for the database, and The data frame has a typical structure like any table in the
similar in it’s functionality to SAP business object universe traditional database, consists of rows and columns, it can
builder. Through that component, you can implement any be connected to several resources, like traditional DB or
data modelling, including Star schema. NoSQL or files, and it can be used easily to represent the Star
IBM recognize this book Dimensional Modeling: In a schema components. The Spark SQL not only covering the
Business Intelligence Environment as a reference for data operations that the traditional SQL has but also has Catalyst
modelling, and within this book, you can find that they Optimizer, Which is a query optimizer that has similar
identify Star schema, multi-star schema, and Snowflake as functionalities like other query optimizers, detecting join
main modelling design for IBM products. operations and execute a query plan for better performance.
Catalyst Optimizer converts queries into logical and physical
F. Hive
trees corresponding to the different phases of optimizations.
Hive is a query engine developed at Facebook, brings Within each stage, these trees are manipulated by rules as
the concepts of tables, columns, partitions, and SQL to shown in figure 14. In Spark, star schema joins is detected
the Hadoop architecture by executing internally MapReduce using table and column statistics. With this information, the
jobs, keeping the extensibility and flexibility of Hadoop Optimizer joins the tables together in a star schema in an
[7]. HiveQL is the SQL that Hive provides. It has the optimal way [47]. Spark now is a part of many solutions
same operations that SQL has; selection, Join operations and
aggregations. That allows implementing the typical relational
operations like Star Schema.
G. Atscale
Atscale is a BI platform for Big Data, enabling the usage
of popular Business Intelligence tools on Big Data and
modern data platforms. It aims to provide an OLAP engine
for Hadoop, avoiding ETL and building a virtual ROLAP
(vROLAP) directly above the original data storage. The Cube
Fig. 14. Catalyst Optimizer
that AtScale provides is virtual since it does not contain
data, It describes only how to process and optimize the
data at query runtime. That gives the ability to apply the and used by many vendors as a part of the ETL process,
desired logical model that you want, even without having but The most recent product related to OLAP and Business
that physically. So when you need to define a star schema, intelligence based on Apache spark is SparkLineData.
you can easily define measures, dimensions, hierarchies, and SparkLineData comes to Solve One of the gaps in this
joins through Atscale, whether your data has data warehouse- stack is the ability to support fast Slice-and-Dice ad-hoc
style format or not, it does not matter. AtScale will allow workloads; such a workload is fairly common, for example,
you to create any logical model you need, regardless of browsing and doing discovery through large datasets. It
what the physical model is on disk [45]. Besides modelling, is a native business intelligence platform geared towards
the vROLAP provides Smart Aggregates, to solve one of providing fast ad-hoc querying over a Logical Cube (i.e.
the biggest drawbacks of ROLAP, Maintaining aggregate Star-Schema). Its based on an Apache Druid OLAP Indexing
tables. Instead of building and maintaining summary tables technology; the technique of using OLAP Indexing has the
up front, a vROLAP engine dynamically builds and main- additional benefit of simplifying the ETL processes and the
tains aggregates on-demand based on what BI users need. Data Management layer [48]. SparkLineData brings sub-
For performance and optimization issues, vROLAP engine second query response times that can be deployed over enter-
optimizes the queries to work with the data in its current prise data warehouses, data lakes and IoT analytics and pro-
form. It uses information about the data to get optimal vides sub-second response times for analytics queries using
performance from existing Hadoop resources. vROLAP not in-memory multi-dimensional indexes and advanced query
only optimizes query performance, but it also shortens the optimization on Apache Spark. Recently, Oracle completed
entire time to insight lifecycle. It removes the bottlenecks the acquisition of SparklineData to augment Oracle PaaS
and complexity that have been a barrier to the widespread services for enterprise customers transitioning to Big Data
adoption of OLAP on Hadoop [46]. solutions Oracle - Sparklinedata. The key components for
getting Star schema are Druid index, Catalyst optimizer, and
H. Spark and SparkLineData DataSource. SparkLineData orchestrates execution between
Spark becomes an accepted Architecture for an Enterprise Druid and Spark and the Druid Planner that generates optimal
Data Warehouse. It started with the resilient distributed Physical Plans to take advantage of both Spark and Druid
capabilities. The documents related to SparklineData are rare, the amount of data that increased through the past decades.
and even the previous website that they had before was We have seen that with each type of database management
closed, but Vincent Smith [48] built a star using Sparkline- there a was a study or a use case that a star schema was
Data and showed with several queries how it was easy to implemented on that type of DBMS, which means that Star
gain one with a better performance. schema is still considered as a core part of data warehouse
concept and later for Big data solutions. We made a review
on the current applications, frameworks and BI systems that
are provided by several commercial vendors. We have seen
that these vendors were insisting to adopt star schema on
their solutions, not only to absorb the previous solutions
that the organizations have but also to give the ability
to implement star schema for their future solutions. The
implementation of star schema can be physical or logical,
which means even these solutions that do not implement Star
schema within the database structure, Star schema can be
implemented through the BI and reporting interface. Many
Fig. 15. SparkLineData BI Stack factors that can influence the decision on implementing star
schema, like performance, real-time data and having an
optimal approach to the decision makers, but that does not
I. SAS - Data Integration Server mean that star schema cannot handle these. Many studies
In the past, SAS was not supporting Star schema mod- through academia and industry had proved that star schema
elling [26], But now it does. SAS introduced before LASR still applicable to handle several scenarios and use cases, and
Star Schema, which is similar to a typical star schema even for new concepts like Data lake, they used star schema
in a relational database, but it is limited to a single join as one of its components to implement. As a conclusion, the
condition. It does not support multiple key columns between new challenges that facing the star schema concept does not
fact and dimension tables. Later, SAS Provided Data In- make both academia and industry to give up, but enforced
tegration Server, the next generation product to the SAS them to think deeply to solve the new situations , such as new
ETL Server [49]. The visual design component of the SAS algorithms or approaches in impelementaions from academia
Data Integration Server, SAS Data Integration Studio, can be side, or combine several components and products in one
used to build the star schema or import data models directly package to solve and handle the new situations, like most of
from standard data modelling tools, such as CA ERwin or the commercial vendors do these days.
Rational Rose. With Data integration Studio you can easily
R EFERENCES
identify the fact and dimension tables, enable features like
slowly changing dimensions (SCD) for tracking changes [1] Kimball, R. and Ross, M., The Data Warehouse Toolkit: The
Definitive Guide to Dimensional Modeling, 2nd ed. Wiley
in dimension tables through SCD Type 2 Loader, manage Publishing, 2013. [Online]. Available: https://books.google.co.
changes and updates on star schema, and apply several join nz/books?hl=en&lr=&id=XoS2oy1IcB4C&oi=fnd&pg=PA1&ots=
algorithms as SORT-MERGE join, index join, Hash join, and 1CIdmyiPbD&sig=u Yqnw3V88kgWLGeGM9YZPnfp3I&redir esc=
y#v=onepage&q&f=false
a Multi-way join. For optimization and performance issues, [2] Chaudhuri, S. and Dayal, U., “An overview of data warehousing and
SAS provides another component with Data Integration, olap technology,” SIGMOD Rec., vol. 26, no. 1, pp. 65–74, Mar.
the Scalable Performance Data Server. It is responsible for 1997. [Online]. Available: http://doi.acm.org.ezproxy.auckland.ac.nz/
10.1145/248603.248616
performing, validating and optimizing Star joins with the best [3] Celko, J., “Chapter 2 - columnar databases,” in Joe Celkos Complete
performance. When the star join is utilized, the join requires Guide to NoSQL, Celko, J., Ed. Boston: Morgan Kaufmann,
only three steps to complete the join, regardless of the 2014, pp. 15 – 25. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/B9780124071926000029
number of dimension tables. This can improve performance [4] Homayouni, H., Ghosh, S., and Ray, I., “Data warehouse testing,”
when working with large dimension and fact tables. ser. Advances in Computers. Elsevier, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0065245817300578
7. CONCLUSION [5] Linstedt, D. and Olschimke, M., “Chapter 7 - dimensional modeling,”
in Data Vault 2.0, Linstedt, D. and Olschimke, M., Eds. Boston:
Data modelling still the most interested area related to Morgan Kaufmann, 2016, pp. 171 – 193. [Online]. Available: http:
Business intelligence and data warehouse environments. //www.sciencedirect.com/science/article/pii/B9780128025109000076
[6] Tupper, C. D., “19 - dimensional warehouses from enterprise
Through this paper, we made a review about data warehouse models,” in Data Architecture, Tupper, C. D., Ed. Boston:
and the main focus was on Star Schema. We have seen Morgan Kaufmann, 2011, pp. 337 – 356. [Online]. Available: http:
why we need to have Star schema and why it differs //www.sciencedirect.com/science/article/pii/B978012385126000019X
[7] Vaisman, A. and Zimányi, E., Logical Data Warehouse Design.
comparing with other data modelling. We showed the most Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 121–178.
attractive areas related to Star schema that attracted people [Online]. Available: https://doi.org/10.1007/978-3-642-54655-6 5
to make research such as implementation, performance and [8] Han, J., Kamber, M., and Pei, J., “4 - data warehousing and
online analytical processing,” in Data Mining (Third Edition), third
optimization. it was important to have a look at the evolution edition ed., ser. The Morgan Kaufmann Series in Data Management
of database management structure which was a response to Systems, Han, J., Kamber, M., and Pei, J., Eds. Boston:
Morgan Kaufmann, 2012, pp. 125 – 185. [Online]. Available: http: [28] Chevalier, M., El Malki, M., Kopliku, A., Teste, O., and Tournier, R.,
//www.sciencedirect.com/science/article/pii/B9780123814791000046 “How can we implement a multidimensional data warehouse using
[9] Malinowski, E. and Zimányi, E., Conventional Data Warehouses. nosql?” in Enterprise Information Systems, Hammoudi, S., Maci-
Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 77–135. aszek, L., Teniente, E., Camp, O., and Cordeiro, J., Eds. Cham:
[Online]. Available: https://doi.org/10.1007/978-3-540-74405-4 3 Springer International Publishing, 2015, pp. 108–130.
[10] “Star and snowflake schemas,” https://www.oracle.com/webfolder/ [29] Yangui, R., Nabli, A., and Gargouri, F., “Automatic transformation
technetwork/tutorials/obe/db/10g/r2/owb/owb10gr2 gs/owb/lesson3/ of data warehouse schema to nosql data base: Comparative study,”
starandsnowflake.htm, accessed: 2018-12-24. Procedia Computer Science, vol. 96, pp. 255 – 264, 2016.
[11] Kaas, C., Pedersen, T. B., and Rasmussen, B., “Schema [Online]. Available: http://www.sciencedirect.com/science/article/pii/
evolution for stars and snowflakes,” in ICEIS, 2004. S1877050916319391
[Online]. Available: https://pdfs.semanticscholar.org/5fed/ [30] Biswajit Maity, N. C. D., “Challenges of implementing data
155cd5c6187300a33f241f5b3383a2028aee.pdf? ga=2.161767292. warehouse in mongodb environment,” Journal of Fundamental and
1152849063.1545621207-1091939017.1544844686 Applied Sciences, vol. 10, pp. 222–228, 2018. [Online]. Available:
[12] Cuzzocrea, A., Bellatreche, L., and Song, I.-Y., “Data warehousing https://www.ajol.info/index.php/jfas/article/view/171660
and olap over big data: Current challenges and future research [31] Kovaevi, I. and Mekterovi, I., “Alternative business intelligence
directions,” in Proceedings of the Sixteenth International Workshop engines,” in 2017 40th International Convention on Information
on Data Warehousing and OLAP, ser. DOLAP ’13. New and Communication Technology, Electronics and Microelectronics
York, NY, USA: ACM, 2013, pp. 67–70. [Online]. Available: (MIPRO), May 2017, pp. 1385–1390.
http://doi.acm.org.ezproxy.auckland.ac.nz/10.1145/2513190.2517828 [32] O’Neil, P., O’Neil, E., Chen, X., and Revilak, S., “The star schema
[13] Wrembel, R., Data Warehouse Performance: Selected Techniques benchmark and augmented fact table indexing,” in Performance Eval-
and Data Structures. Berlin, Heidelberg: Springer Berlin uation and Benchmarking, Nambiar, R. and Poess, M., Eds. Berlin,
Heidelberg, 2012, pp. 27–62. [Online]. Available: https: Heidelberg: Springer Berlin Heidelberg, 2009, pp. 237–252.
//doi.org/10.1007/978-3-642-27358-2 2 [33] El Malki, M., Kopliku, A., Sabir, E., and Teste, O., “Benchmarking big
data olap nosql databases,” in Ubiquitous Networking, Boudriga, N.,
[14] Vaisman, A. and Zimányi, E., Data Warehouses: Next Challenges.
Alouini, M.-S., Rekhis, S., Sabir, E., and Pollin, S., Eds. Cham:
Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 1–26.
Springer International Publishing, 2018, pp. 82–94.
[Online]. Available: https://doi.org/10.1007/978-3-642-27358-2 1
[34] Zhu, J., Potti, N., Saurabh, S., and Patel, J. M., “Looking ahead
[15] “A relational database overview,” https://docs.oracle.com/javase/ makes query plans robust: Making the initial case with in-memory
tutorial/jdbc/overview/database.html, accessed: 2018-12-24. star schema data warehouse workloads,” Proc. VLDB Endow.,
[16] Siddiqa, A., Karim, A., and Gani, A., “Big data storage technologies: vol. 10, no. 8, pp. 889–900, Apr. 2017. [Online]. Available:
a survey,” Frontiers of Information Technology & Electronic https://doi.org/10.14778/3090163.3090167
Engineering, vol. 18, no. 8, pp. 1040–1070, Aug 2017. [Online]. [35] Makreshanski, D., Giannikis, G., Alonso, G., and Kossmann, D.,
Available: https://doi.org/10.1631/FITEE.1500441 “Many-query join: Efficient shared execution of relational joins on
[17] Wu, C., Buyya, R., and Ramamohanarao, K., “Chapter 1 - big modern hardware,” The VLDB Journal, vol. 27, no. 5, pp. 669–692,
data analytics = machine learning + cloud computing,” in Big Oct. 2018. [Online]. Available: https://doi-org.ezproxy.auckland.ac.
Data, Buyya, R., Calheiros, R. N., and Dastjerdi, A. V., Eds. nz/10.1007/s00778-017-0475-4
Morgan Kaufmann, 2016, pp. 3 – 38. [Online]. Available: http: [36] “Sap business warehouse7.5powered by sap hana,” https://www.sap.
//www.sciencedirect.com/science/article/pii/B9780128053942000015 com/documents/2016/03/445ef806-647c-0010-82c7-eda71af511fa.
[18] Grolinger, K., Higashino, W. A., Tiwari, A., and Capretz, M. A., html, accessed: 2019-01-15.
“Data management in cloud environments: Nosql and newsql data [37] “Enterprise bi in azure with sql data warehouse,” https:
stores,” Journal of Cloud Computing: Advances, Systems and //docs.microsoft.com/en-us/azure/architecture/reference-architectures/
Applications, vol. 2, no. 1, p. 22, Dec 2013. [Online]. Available: data/enterprise-bi-sqldw, accessed: 2019-01-15.
https://doi.org/10.1186/2192-113X-2-22 [38] “Power bi - create and manage relationships in power
[19] Chandra, D. G., “Base analysis of nosql database,” Future Generation bi desktop,” https://docs.microsoft.com/en-us/power-bi/
Computer Systems, vol. 52, pp. 13 – 21, 2015, special Section: desktop-create-and-manage-relationships, accessed: 2019-02-09.
Cloud Computing: Security, Privacy and Practice. [Online]. Available: [39] “Amazon redshift database developer guide api version 2012-12-
http://www.sciencedirect.com/science/article/pii/S0167739X15001788 01,” https://docs.aws.amazon.com/redshift/latest/dg/redshift-dg.pdf,
[20] Srinivasa, K. G. and Hiriyannaiah, S., “Chapter five - comparative accessed: 2019-01-15.
study of different in-memory (no/new) sql databases,” in A Deep [40] “Online analytic processing with oracle database
Dive into NoSQL Databases: The Use Cases and Applications, 12c,” https://docs.microsoft.com/en-us/power-bi/
ser. Advances in Computers, Raj, P. and Deka, G. C., Eds. desktop-create-and-manage-relationships, accessed: 2019-02-12.
Elsevier, 2018, vol. 109, pp. 133 – 156. [Online]. Available: [41] “Online analytic processing with oracle database 12c,” https:
http://www.sciencedirect.com/science/article/pii/S0065245817300463 //www.oracle.com/technetwork/database/options/advanced-analytics/
[21] Pokorný, J., “Database technologies in the world of big data,” advanced-analytics-wp-12c-1896138.pdf, accessed: 2019-02-12.
in Proceedings of the 16th International Conference on Computer [42] “Oracle in-memory overview,” https://www.oracle.com/technetwork/
Systems and Technologies, ser. CompSysTech ’15, 2015, pp. 1–12. database/in-memory/overview/twp-dbim-usage-2441076.html,
[Online]. Available: http://doi.acm.org/10.1145/2812428.2812429 accessed: 2019-02-12.
[22] Pavlo, A. and Aslett, M., “What’s really new with newsql?” SIGMOD [43] “Ibm db2 warehouse,” https://www.ibm.com/products/db2-warehouse,
Rec., vol. 45, no. 2, pp. 45–55, Sep. 2016. [Online]. Available: accessed: 2019-02-12.
http://doi.acm.org/10.1145/3003665.3003674 [44] “Zigzag join enablement for db2 star schema queries,”
[23] Kumar, R. and Charu, S., “Newsql databases: Scalable rdbms for oltp https://www.ibm.com/developerworks/data/library/techarticle/
needs to handle big data,” 03 2014. dm-1303zigzag/index.html, accessed: 2019-02-12.
[24] “cleardb overview,” https://www.cleardb.com/developers/platform/ [45] “atscale technical overview,” https://www.atscale.com/resource/
overview, accessed: 2019-01-07. atscale-technical-overview, accessed: 2019-02-10.
[46] “atscale bi olap for hadoop,” https://www.atscale.com/resource/
[25] Golfarelli, M. and Rizzi, S., From Star Schemas to Big Data:
atscale-bi-olap-for-hadoop-technical-doc, accessed: 2019-02-10.
20$$+$$Years of Data Warehouse Research. Cham: Springer
[47] “Star schema enhancements in apache spark,” https://developer.ibm.
International Publishing, 2018, pp. 93–107. [Online]. Available:
com/code/2018/04/16/star-schema-enhancements-in-apache-spark/,
https://doi.org/10.1007/978-3-319-61893-7 6
accessed: 2019-02-10.
[26] Sen, A. and Sinha, A. P., “A comparison of data warehousing [48] “Sparklinedata overview,” https://github.com/SparklineData/
methodologies,” Commun. ACM, vol. 48, no. 3, pp. 79–84, Mar. 2005. spark-druid-olap/wiki/Overview, accessed: 2019-02-10.
[Online]. Available: http://doi.acm.org/10.1145/1047671.1047673 [49] “Stars and models: How to build and maintain star schemas using
[27] Barkhordari, M. and Niamanesh, M., “Chabok: a map-reduce sas data integration server in sas 9, nancy rausch, sas institute
based method to solve data warehouse problems,” Journal of inc., cary, nc,” https://support.sas.com/resources/papers/proceedings/
Big Data, vol. 5, no. 1, p. 40, Oct 2018. [Online]. Available: proceedings/sugi31/096-31.pdf, accessed: 2019-02-10.
https://doi.org/10.1186/s40537-018-0144-5

You might also like