Professional Documents
Culture Documents
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
Abstract— Data warehouse (DW) is a collection of decision facing, especially when they have multi-systems and the
support technologies, aimed for enabling the knowledge worker wide range of data distribution resources, so the effort will
to make better and faster decisions. We are living in the Era not be only in unlocking the potentially useful information,
of Big data and several evolutions happened in terms of data
resources and data formats for the last decades. That enforces but also in gathering, processing and consolidating these
academia and industry to search deeply about alternatives data before starting the analytics process.Generally, data
and to provide better solutions as a response to these new management is a set of applications of quantitative and
factors in the way that can make the DW still valid to work qualitative methods to solve relevant problems and predict
on the same way or better compared with the past solutions outcomes. One of the salient revelations of today, with the
that were provided through traditional databases. This paper
concentrates on the most important component in DW systems, vast and growing amount of data, is that domain knowledge
Star Schema. The paper explores the Data warehouse concept, and analysis cannot be separated. It is Mandatory to have a
OLAP, and compares Star schema with other data modelling full background about the situation that we need to analyze,
like 3NF and SnowFlake and shows why we need Star schema. and that can be done by aligning the technical side and the
Within that we make a highlight on the evolution of database business side together to gain results. It is important to the
systems and shows several studies that were made before about
converting star schema from traditional DBMS to other types technical team to be aware of the domain of business, and on
of DB systems. In the end, we make a review from the industry the same time, the business should be aware of the new ideas
and show what the commercial vendors provides to implement and technologies that can be effective for delivering analysis.
Star schema within the products that they have. To satisfy that, we need to know about technologies that can
support corporates to manage their data and the facilities
Index Terms— Data warehouse,Star
Schema,OLAP,NoSQL,NewSQL,RDBMS,Business Intelligence that can help the Decision makers to achieve from the data
warehouse since it is an essential element of decision support
process [2]. The evolution of big data enforces to make
1. I NTRODUCTION
Several Types of research on Data warehouse to enhance data
We are living in times that everything is increasing, delivery. These researches are done on the concept itself or
especially when we talk about data. The universe is moving go deeply on each data warehouse components, Like data
fast, and more data is generated exponentially every day. modelling or ETL (Extract, Transform, and Load) process.
With all data coming on your way, you need the right tools Researchers are not only within academia but also within the
to deal with the growing amounts of data. The growth in the Technology vendors as well. Many commercial products and
quantity and diversity of data has led to data sets larger than services are now available, and all of the principal database
is manageable by the conventional, handson management management system vendors now are offering solutions in
tools. To manage these new and potentially invaluable data these areas. Recently, some of them asked whether the data
sets, new methods of data management and new applications warehouse is still alive or not? According to Eckerson and
in the form of predictive analytics, have been developed. Cloudera, Data warehousing is not dead, but it is struggling.
According to [1], The most important assets of any company Many challenges of Legacy Data Warehousing enforce it to
is its information, which is always kept into two main forms: evolve and adapt to fit with the realities of modern data
operational systems of records and data warehouse. Large management and to overcome the challenges of scalability
corporations are overwhelmed with data. They invest in data and elasticity, data variety, data latency, adaptability, data
facilitation to support them in the decision-making process, silos and data science compatibility. But that does not prevent
including reporting, dashboards, advanced visualisation,end- as we said before that the Data warehouse can be enhanced
user self-service and initiatives strategic to business intelli- and be a part of solutions for other new data management
gence. The main concern for these companies is not only concepts, like Data lake. According to Horton networks, both
how to keep data, but how to narrow the gap between the environments are complementary to one another in modern
time of the data generation and the time of gaining benefits business. We can use their features and facilities of both to
hidden within it. in other words, The challenge for companies cross the big data challenges by having integration between
today, many beholden to legacy systems and IT bureaucracy both of them and break the gaps that they have. This paper
is that this growing demand for analytics often outstrips is trying to explore the Data warehouse and focusing on the
their ability service it. That would make dealing with Big most important concept that it provides,star scehma, and how
data analytics is another challenge that these corporates are the current commercial products available are dealing with
star schema and how they support this concept. We need distributed facts within several systems. In contrast, OLAP
not forget that the database concept has been changed for uses SQL queries for reading data, does not need any kind of
at least the last 10 years. Before we were dealing with the manipulation data operations like OLTP, and that comes from
classically structured databases, Which serves the relational the design used which mainly guarantee a data consolidation
and structured data, but now we have other types within the from heterogeneous sources (i.e. several OLTP sources).
industry, semi-structured and unstructured data. We will take The database modelling in both systems are different and
into consideration these new facts and see how vendors deal we can say it is the major factor that controls and lead
with that to implement star schema, or whether they give a the differences between both of them. OLTP is using ER
solution or not to have a star schema. modelling, 3NF (3rd normal form) or higher. It eliminates
data duplication, provides efficient storage for detailed data,
2. DATA WAREHOUSE OVERVIEW but not suitable for data analysis. When we look to figure 1
A. Definition which represents the classic sales system, the ER model has
We have different definitions for Data warehouse. A 7 tables comparing with 4 tables that the dimensional model
widely accepted definition for Data warehouse is a collection has. it is clear that ER modelling would have a complex
of decision support technologies, aimed at enabling the query to gain a high level of data comparing with the
knowledge worker (executive, manager, analyst) to make dimensional model since we need to make 6 join operations
better and faster decisions [1]. A data warehouse is a subject- comparing with 3 joins on the other model. We know that
oriented, integrated, time-varying, non-volatile collection of join operations will decrease the performance, especially
data that is used primarily in organizational decision making when we have a huge number of records [3]. We can notice
[2]. According to Ralph Kimball and Margy Ross Typically that the Dimensional model is simple and clear for any
[1], Data warehouse and operational systems are corre- user, unlike the ER model which needs a technical user to
lated, but have different aspects. The operational systems understand, and that is not helpful in data analytics.
are optimized to process transactions quickly, on the other Decision support systems mainly concentrate mainly on hav-
hand, Data warehouse (DW) is responsible to evaluate the ing multidimensional data, which needs special operations
organizations performance over time, by keeping eye on the like roll-up (increasing the level of aggregation) and drill-
operational systems which feed the data warehouse. Different down (decreasing the level of aggregation or increasing
terms are also used to distinguish between the two concepts. detail) along one or more dimension hierarchies, slice and
For a data warehouse, we can use (OLAP), online analytical dice (selection and projection), and pivot (re-orienting the
processing, and (OLTP), online transaction processing, to multidimensional view of data) [2].
describe the operational systems. Data warehouse has the
main Goals to achieve[1]:-
‚ it must make an organizations information easily acces-
sible
‚ it must present the organizations information consis-
tently
‚ it must be adaptive and resilient to change
‚ it must be a secure bastion that protects our information
assets Fig. 1. ER Model versus Dimensional model[3]
‚ it must serve as the foundation for improved decision
making
‚ The business community must accept the data ware- C. Data Warehouse architecture
house if it is to be deemed successful
Typical architecture is shown in figure 2. As we can see
B. OLAP and OLTP it is a combination of systems and operations. OLTP system
and external data sources are the main input for the data
As an extension to the definition and characteristics of warehouse. ETL process which is mainly responsible for
data warehouse, we need to define the distinction between data manipulation, in terms of loading, transformation and
the OLAP and OLTP. In general, the main differences are:- extracting data from its resources. Data cleaning is one of
‚ Data contents and system orientation the main duties of ETL, which includes handling missing
‚ Database Model Design values and parsing into a standard format. You may find in
These two points are summarizing the differences between other resources that another component is added between
both systems in terms of performance and workload. That OLTP and data warehouse, it is called ”Operational Data
comes from the type of operations that we use on both Stage” (ODS) [1]. It is mainly an additional step that can
systems, the size of data, and the purposes of a design be added between the OLTP and Data warehouse. Instead
adapted in each of them. The OLTP mainly uses insert of transferring data directly, we can put ODS in the middle,
and update data manipulations (DML), and the database which will contain the needed data for DW system, and after
is designed to deal with a specific objective, which means doing the cleaning part, we can after that transfer it to DW
‚ Additive: measures that can be added across all dimen-
sions
‚ Nonadditive: measures that cannot be added across all
dimensions
‚ Partially additive: measures that can be added across
some dimensions and not with others
f Dimension tables are integral companions to a fact table.
The dimension tables contain the textual context associated
Fig. 2. Data Warehousing Architecture[4]
with a business process measurement event and describe the
who, what, where, when, how, and why associated with that
event [1]. In the dimensional model, the data is described
system. Another component that you can is Data Mart. It
by the relation between the fact table and dimension table.
is derived from the data warehouse, but more high level
The fact table contains the content of the data warehouse
and usually oriented to a specific business line or team. it
and it is surrounded by several dimension tables contain
enables faster rollout, since they do not require enterprise-
the description. Besides measures, Fact table has also keys,
wide consensus, but they may lead to complex integration
which consist of a group of foreign keys (FK) that point
problems in the long run if a complete business model is not
to the primary keys of dimension tables that are associated
developed [2]. The main important parts of DW architecture,
with this fact table to enable business analysis[6]. Figure
which is related to our study are Data modelling, Database
management systems and methods used to fetch our data.
The data modelling within the data warehouse should reflect
the OLAP structure, which means a multidimensional model
that can facilitate complex analyses and visualization. It is an
important technique for making databases simple to Deliver
data thats understandable to the business users and Deliver
fast query performance [1]. We have several Database man-
agement systems that have utilities and characteristics used
to implement the source and target data stores. Basically, it
should satisfy the requirements needed to implement a data
warehouse system. That includes supporting a dimensional
Fig. 3. Example of a Fact table with dimension tables (Star Schema)
data model and providing a language that can be used to
fetch and process the data [4]. The evolution in database 3 gives an example that clarifies the relationship between
systems brings other languages besides SQL ( Sequential a fact table and dimension tables. The join cardinality of
Query language) and its extension, like Object-oriented. dimension tables to fact tables is one-to-many because each
Since a data warehouse reflects the business model of an record in a dimension table can describe many records in the
enterprise, an essential element of a warehousing architecture fact table. Dimension Table has another important function.
is Metadata management. It is divided into three types; Each dimension table consists of columns that correspond to
Administrative metadata, Business metadata, and Operational attributes of the dimension, and the attributes of a dimension
metadata [2]. Administrative metadata is used to describe may be related via a hierarchy of relationships. For example,
the format, structure, syntax, and semantics of data, such in Figure 3 we have a dimension table called (DimDate),
as justification and spacing. Business metadata includes it contains (Year, Quarter, Month, Week, Day), and it is
business terms and definitions, ownership of the data, and connected to the fact table (online-Sales) which contains
charging policies. Operational metadata includes information a measure called (Revenue), then we can summarize the
that is collected during the operation of the warehouse, such revenue at different time level by drill-down or roll-up.
as usage statistics, error reports, and audit trails. The strength of any dimensional modelling comes from the
summarizing that the model can provide. That can be done
3. D IMENSIONAL M ODELING R EVIEW with high tuned queries between the fact table and dimension
tables, or building a separated summary tables fact table with
A. Basic concept
specific dimensions [2].
Dimensional modeling uses two major types of tables: Materialized view is another approach of summarizing that
Fact table and Dimension table. Fact tables hold numeric combine both physical implementation and queries. It is a
data that can be summarized as needed and contain infor- static view that contains the results of an aggregated query
mation about specific business processes or events within and has the same concept of a summary table. Instead of
these processes. Each row in a fact table corresponds to a executing complex queries, the Materialized view is rep-
measurement event and represents a specific level of detail resented as a database object. It facilitates access to the
which the facts may be defined and summarized, referred to fact table and the desired dimension, and it reflects any
as the grain[5]. The measures have different types of [6]: changing in the source data (fact table) without re-execute
the complex query that it represents [2]. Some Data architects
think that this separation of the data into facts and dimensions
would lead to complex queries, and that would affect the
system performance. As an alternative solution, they come
up with Flat Table, which is collapsing all entities in the data
model down to into the terminal entities. The flat table will
minimize the number of tables in the database and therefore
minimize the possibility that joins will be needed in end-
user queries [6]. With this approach, there is no loss in
information that the original data has, but it will lead to other
problems like massive redundancy, and it will not contain any Fig. 5. Snowflake schema example
aggregations. In fact, it may lead to aggregation errors when
there are dependency relationships between kernel entities.
(3NF) in normalization, and that is reflected by breaking
B. Star schema and Snowflake schema dimension tables to avoid redundancy, and that done by
Data warehouses are typically based on relational star representing each level as a separate table that includes a
schemas or snowflake schemas, an approach to implementing key and the descriptive attributes of the level [6].
multidimensional databases using relational database tech-
nology. When we need to talk about any of them, we C. Star schema versus Snowflake schema
can not neglect the other, especially when we know that Both models can be used for data warehouse systems. In
SnowFlake schema can be considered a special type of general, star schema is the most used, but that does not mean
extended star schema, since it represents a connected lattice it is free of drawbacks. Star schema has several advantages:
of star schemas. it is simple and easy to understand. It reduces the number
Star schema consists of a central fact table surrounded by of tables in the database and the number of relationships
dimension tables as shown in 3 [7]. When the model contains between them, so the number of joins required in user queries
multiple fact tables share dimension tables as shown in is minimized, and that will lead to a higher performance [6].
figure 4, then we call it Constellation schema or Galaxy Finally, the normalized approach that star schema has makes
schema [8]. According to Kimball, The Design Approach the aggregations of measures are straightforward and that
makes star schema a prefered choice for business people.
However, the star schema has some drawbacks. For example,
it does not model hierarchies adequately since the hierarchy
structure is not clear and additional information is required
to represent it. Sometimes it is difficult to associate attributes
within their corresponding levels. Moreover, when we have
many levels in hierarchies, the number of attributes is at
least as large as the hierarchy depth, making the hierarchy
structure difficult to understand [9].
Snowflake is prefered for technical people since the Normal-
ized tables are easy to maintain and optimize storage space.
Fig. 4. Galaxy schema Besides that, it represents hierarchical structures better. Every
level can be easily distinguished and, further, levels can
for Star Schema is based on the analysis of business process be reused between different hierarchies. Additionally, in
requirements, by identifying the grain of the data, defining this representation, specific attributes can be included in
the relevant facts that need to be aggregated, defining the the different levels of a hierarchy [9]. On the other hand,
dimensional attributes to aggregate by, and then forming snowflake schemas are less performant for querying due to
a star schema based on these [6]. Dimension table in star the joins that are needed for combining the data scattered
schema represents set of attributes which are not normalized, in the various tables composing a hierarchy. The (3NF)
which means that star schema does not support a direct that snowflake has would bring the same disadvantages that
attribute hierarchy [2] and that is the main difference be- we discussed before in section 2-A. ETL process these
tween star model and snowflake model. Snowflake makes days uses different techniques like pipelined, partitioned, and
normalization to its dimension tables, and that would give parallelism to avoid sequential loading [2]. with Snowflake,
multi-level hierarchy, unlike star schema which gives a single ETL process would be affected since snowflake is more
level of hierarchy [4]. When we compare between figure 3 complex in design and cannot be parallelised because the
and figure 5, we can see that no change happened to the dependency within the model restricts it.
fact table, but the approach for dimension table is different. Both models are implemented in real life, and some people
Dependency chains in star schema are collapsed, while in like Charles D.Tupper [6] does not think that one of them
snowflake schema are expanded out. Snowflake is near to is the best solution, but the optimal one is a compromise
between the two options. Business and use case can be the MOLAP is an array format. MOLAP can be considered as
main factor to decide. According to Oracle [10], Ralph Kim- the nearest approach for the cube model, and many vendors
ball proposes three cases where snowflake implementation is like Oracle, Microsoft and IBM provided MOLAP solutions,
not only acceptable but is also the key to a successful design: but the hard effort to maintain and administrate, especially
‚ Large customer dimensions where, for example, 80 per with the evolution of the data and its amount, makes vendors
cent of the fact table measurements involve anonymous concentrate on ROLAP approach, especially when you know
visitors about whom you collect little detail and 20 that MOLAP has 2 storage layers and indexing techniques
per cent involve reliably registered customers about that makes a high limitations on DW performance and data
whom you collect many detailed data by tracking many scalability [2]. So with the evolution in database systems,
dimensions you would find papers that discuss the implementation of
‚ Financial product dimensions for banks, brokerage OLAP on those systems, they would concentrate on ROLPA
houses, and insurance companies, because of each of (star schema) rather than MOLAP (OLAP cube). According
the individual products have a host of special attributes to Alfredo Cuzzocrea,Ladjel Bellatreche and Il-Yeol Song
not shared by other products [12], The implementation of OLAP cube over Big data is
‚ Multienterprise calendar dimensions because each or- considered as an open problem and actual research trends in
ganization has idiosyncratic fiscal periods, seasons, and the field of Data Warehousing. They Highlighted the reasons
holidays and suggested the future research directions that could serve
Furthermore, Kimball usually advises that it is not a good in this direction.
idea to expose end users to a physical snowflake design,
because it almost always compromises understandability and
performance. In his book (The Data Warehouse Toolkit) [1],
he is encouraging not to use snowflake since it does not
have ease of use like star schema. Even the advantages that
snowflake provides, like saving disk space and ability to
browse within a dimension, Kimball thinks that they would
be minor issues comparing with the primary objectives that
the design should handle, performance, optimization and ease Fig. 6. OLAP Cube example
of use. We can sacrifice these advantages to gain the primary
objectives [1].
Complexity is not only determined by query and loading, but E. Query on Star schema
also the ability of changes that can be done in the future. The main idea of star schema is making the data available
For that purpose, Christian Kaas, Torben, and Rasmussen in a simple model, which will simplify loading and access the
[11] made a study about schema evolution for stars and data. We have seen that the star schema structure distributes
snowflakes. They made an investigation about how both the data into a large fact table connected to several dimension
models response to changes, instance adaption and impact tables. In general, when we make a query on star schema for
on existing queries by implementing eight evolution opera- a specific pattern, it does not mean reading the data as it is
tions that cover insertion and deletion of dimensions, levels, but also includes data processing based on business needs. In
dimension attributes, and measure attributes the evolution fact, it is a reflection to OLAP operations that we discussed
properties of star and snowflake schemas. They came up with before in 2-B. The procedure of executing query includes
a conclusion that In all cases, the star schema was either [5]:-
superior to the snowflake schema or had the same problems,
‚ Selection of required facts: first, the end-user decides
and it is considerably more robust towards schema changes
which facts should be selected by identifying the fact
than the snowflake schema.
tables to be used.
D. OLAP and Star Sceham ‚ Selection of required dimensions: by joining dimen-
Within the research, I found that in the past there was sions to the query, end-users add required context.
a separation between the OLAP concept and star schema. ‚ Limiting scope of facts : the facts are filtered according
In Fact, The OLAP was representing the multidimensional to dimensional values (either from the fact table or from
model that the decision makers were looking for and usually, joined dimensions).
they presented OLAP logically as a cube, like a figure ‚ Summarization of facts: the raw facts are summarized
6. We can see that it gives highly interactive support to with aggregate functions, such as SUM() or COUNT().
decision makers by providing the functionality needed for The first thing comes to our mind when we talk about the
summarizing, viewing, and consolidating the information query is SQL, but we will see later that different approaches
available in data warehouses [1]. At that time, Vendors were come later based on the evolution of database management
implementing OLAP into 2 types: Relational OLAP (RO- systems and data warehouse solutions. Query on star schema
LAP) and multidimensional OLAP (MOLAP). The physical is called Star Query, and join tables is called Star join [1].
implementation for ROLAP is simply a star schema and The art of executing the efficient query is called Optimizaion,
and measuring the speed of the query is called Performance. speeds, and scopes. It is important to recognize that to
These two factors are the main indicators that differentiate figure out how the consequences of that evolution effect
between data warehouse vendors. They are struggling to on star schema implementation. The most commonly used
provide access to the data with an optimal query that has systems are- Relational database systems (RDBMS), Big
a high speed. Optimization is vital to the performance of the data management with the distributed file system (DFS),
database system. Each vendor has its own query optimizer Nonrelational data model (NoSQL), and (NewSQL). a dif-
that makes an execution plan to give a robust query and ferent classification can be done based
utilizes the system resources to provide a result with a high
performance (speed) [5]. But How Star Query is affecting A. Relational database systems (RDBMS)
performance? simply by three operations, Star joins, filtering, The most commonly used systems, and had the biggest
and aggregations. When you make a query that includes share of the market at the time object-oriented databases were
these actions, you will initiate two implicit operations, data created. It came from the relational model which is based
loading and data scanning. The mechanism of these two on representing data into tables with rows and columns. A
operations differs based on vendor specifications, and it relies table must contain keys, Primary key to differentiate between
on several factors, like the algorithm used, hardware and rows, and Foreign key that refer to other tables to create a link
storage techniques. So when we define a query with multiple between their data. These links are known by Integrity Rules,
joins, with several filter conditions, and ask for aggregations which is responsible to ensure that the data they contain stay
with several dimension, then it would raise the workload accurate and are always accessible [15]. RDBMSs typically
internally and in terms of loading and scanning, and that can use SQL and Stored procedures to manage structured data
be considered as a physical definition of complex query[2]. and allow JOIN-operations to combine the different tables.
For this reason, providing means for increasing the perfor- Both of them make RDBMSs able to handle queries and
mance of a data warehouse for analytical queries and other transactions that ensure efficient, correct, and robust data
types of data processing is one of the important research processing even in the presence of failures [4] [15]. Key
and technological areas [13]. Many solutions were provided features of RDBMSs are ACID (Atomicity, Consistency,
on the past to enhance the performance related to the data Isolation, and Durability) [16], and besides Relational model,
structure, like providing Materialized view, summary tables, RDBMSs can also handle dimensional model when they deal
or even Flat Table within star schema [2], which means with structured data and applicable to some semi-structured
transferring the latency cost to the ETL process. providing data. However, The evolution of Big data shows that RDBMS
indexes is another solution to enhance join operation and has a shortage. With static schema, no linear query execution
scan operation , like Bitmap indexing (which is not supported time and unstable query plan, RDBMS is poor in scalability.
by snowflake according to Kimball [1]) , Join indexing , and It is very expensive and difficult to handle. To scale relational
and Bitmap-join indexing [2] . Later, new challenges came to database it has to be distributed on to multiple servers, that
the data warehouse and business intelligence (BI) systems, will make Handling tables across different servers is difficult
like making these systems near to real-time, which means and lead to complex queries which will effect on query
make them sync with the original resources, by enhancing performance.
the ETL process and Query process. Query transformation
is another approach that serves in this direction. It is based B. Data File distribution and Data Engines
on conducting individual tasks such as translating values, In the past Parallelism was the most significant technique
deriving new calculated fields, joining tables at the source, used for processing massive databases, and all major vendors
converting date fields, and reformatting field sizes, table of database management systems offered data partitioning
names and data type [14]. To make that practical, all major and parallel query processing [2]. Later with the evolution
vendors offer variant strategies in parallel processing, data in data amount and structure, which is known later by Big
partitioning, and Query Extensions [2]. These strategies Data, The traditional databases were not capable of handling
differ between vendors since the performance area relies unstructured data and high volumes of real-time datasets,
on three DW components: hardware used, data storage, and and that enforced to come up with solutions that combine
database management system. it is important to know how structured and unstructured data, or in other words, to handle
each vendor uses these components to format its solution. the main attributes of Big Data: volume, velocity, and variety
that will lead us to know how they implement star schema (3Vs) [17]. Distributed file systems (DFS) came to solve
physically and on which direction they concentrate on to the shortages in scalability and performance that RDBMS
enhance the star query performance [13]. has. It is responsible to store and share files over a set
of connected nodes reliability, scalability, availability and
4. DATABASE MANAGEMENT SYSTEMS efficiency. Google started that approach by providing Google
The evolution of database systems through the past years file system (GFS) and based on that Hadoop offered its own
was mainly related to the data content and responding to Distributed File System (HDFS). Basically, it is a framework
the Big Data phenomenon. Several types were provided in (ecosystem) to manage processing and the hardware available
order to systematically and spontaneously tackle the growing by combining the software platform and the hardware to get
varieties and vagaries of data structures, schemas, sizes, flexibility in storing and processing of large volumes of both
structured and unstructured data by taking data transactions based, graph-based, and object-based and generally do not
in whatever form and processing those transactions across use SQL for data manipulation. [4]. NoSQL has several
commodity hardware in parallel. The main workload on motivations that encourage vendors to adapt to their data
this approach relies on the data processing model and data warehouse solutions [18]:-
engines rather than data storing and formatting. It started
‚ Simple and flexible. NoSQL data stores offer flexible
with MapReduce, which is using divide and conquer strategy
schemas or are sometimes completely schema-free and
to make computations [17]. The simplicity in its principle
are designed to handle a wide variety of data structure
and basic concept made MapReduce a key of success for
‚ Ability to scale horizontally over many commodity
processing, analyzing and managing large datasets in several
servers. Some data stores provide data scaling, while
implementations, but that did not deny the fact of difficulties
others are more concerned with reading and/or write
that it has in development, optimization, and maintaining its
scaling
functions especially in large-scale projects. Therefore, these
‚ Provide high availability. Many NoSQL data stores are
difficulties gave a rise to query platforms based on MapRe-
aimed towards highly distributed scenarios, and consider
duce, a layer constructed directly on the top of MapReduce,
partition tolerance as unavoidable
designed to translate queries and scripts into executable
‚ Unlike RDBMS, they do not support ACID transac-
native MapReduce jobs and to solve other limitations like
tions as provided by RDBMS. NoSQL data stores are
join operation and handling streaming data. Platforms like
sometimes referred to as BASE systems [19] (Basically
Hive, Impala, and Spark not only giving SQL-like query
Available, Soft state, Eventually consistent)
interface but also enabling to treat DFS systems as schemas.
Later Several Big Data Processing model has been suggested NoSQL-databases can be classified into four categories:
in combined with MapReduce, or to replace MapReduce, Key-value store, Document store, Column-store, and Graph-
such as Resilient Distributed Dataset (RDD) and Bulk Syn- store. You can find in other papers that they classify these
chronous Processing (BSP) [17], and the evolution happened categories into : Aggregate Oriented, and Non-aggregate
to HDFS was done to these processes, by having a platform Oriented [19]. In general, Choosing one of them for data
and SQL-like query interface. storing relies mainly on the business needs, and main features
For example, Spark started with RDD to extend Hadoop to that differ between each of them, which are summarized at
a general-purpose framework and to replace a MapReduce the end of this section in table I.
model with a better solution. It is an immutable distributed Key-value stores have a simple data model based on key-
collection of elements of the data, partitioned across nodes value pairs, which resembles an associative map or a dic-
in the cluster that can be operated in parallel with a low- tionary. The basic concept is that a globally distributed
level API that offers transformations and actions. Later Spark hash table has keys that lead to different database servers
came up with Data Frame based on RDD. Instead of dealing scattered all over the world. Each data item is converted
in low-level programming, Data Frame is better in terms into a key using some unique formula which is stored in the
of having an interface for RDDs especially when you have lookup table or directory. When the data are needed, the key
structured data. It has SQL for high-level data processing, is converted into the location of the data and accordingly
and that can give the ability to treat data within DFS as data are retrieved [19]. The value is opaque to the data
similar to the data within the traditional database, as a store and can be used to store any arbitrary data, including
schema. an integer, a string, an array, or an object, providing a
schema-free data model [18]. Along with being schema-
free, key-value stores are very efficient, fast, and scalable,
but since they have Many data structures, it cannot be
easily modelled, which means that they are not suitable for
scenarios requiring relations or structures. Any functionality
requiring relations, structures, or both must be implemented
in the client application interacting with the key-value store
[18].
Document stores provide another derivative of the key-
value store data model by using keys to locate documents
inside the data store, but they are more complex compared
with key-value stores. Like the key-value store, we have
Fig. 7. Evolution of data and big data process engines a key and a value, but the value in Document stores are
semi-structured data, and that gives the ability to include
hundreds of attributes and the number and type of attributes
C. Nonrelational data Store (NoSQL) can differ from row to row, and that makes the value to be
NoSQL can be referred to (No SQL) or (Not only SQL). It represented as a Document. Most document stores represent
is a distributed, non-relational database-systems, not primar- documents using JSON format or some format derived from
ily built on tables since it has many forms, such as document- it. A set of documents is called collections. Collections
discarding unwanted data in rows by performing I/O on
the blocks corresponding to columns that are actually being
read [19]. Besides that, a prior definition of columns is
not required, which offers huge flexibility in storing any
data type[18]. Graph-stores are considered to be the best