Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Data Engineering

Chapter 3
Data storage and Data Models
By Dr. Ghazi Al-Naymat

1
Data Storage

• Data Storing in a data science process refers to storing of useful data which you may
use in your data science process to dig the actionable insights out of it.

• Data storage is a big deal. Data storage is usually handled in a traditional database.
But for big data, companies use database, data warehouses and data lakes.
What’s a database
• A database is a storage location that houses structured data. A relational database require schemas and are not a
fit for unstructured or semi-structured data. Because of this rigid schema, they are not suited to be the centralized
place to store data from multiple sources where the raw data varies in format and structure.
• For all organizations, the use cases for databases include:
➢ Creating reports for financial and other data
➢ Analyzing relatively small datasets
➢ Automating business processes
➢ Auditing data entry
• Popular databases are:
➢ Oracle
➢ PostgreSQL
➢ MongoDB
➢ Redis
➢ Elasticsearch
➢ Apache Cassandra
Relational (SQL) VS non-relational (NoSQL)
• When it comes to choosing a database, one of the biggest decisions an organization
may have to make is whether to pick a relational (SQL) or non-relational (NoSQL)
data structure. While both of these are good choices, each have clear advantages and
disadvantages which must be discussed in the chapter 4.

• The amount of data the database could store is limited, so enterprise companies tend
to use data warehouses, which are versions for huge streams of data.
What’s a data warehouse?
• Data warehouses are large storage locations for data that accumulated from a wide
range of sources.

• Data warehouses are popular with mid- and large-size of data. as a way of sharing
data and content across the team. Data warehouses help management decision “data-
driven” decisions.

• Popular companies that offer data warehouses include:


➢ Amazon Redshift.
➢ Microsoft Azure.
➢ Google BigQuery.
➢ Star and Snowflake Schema.
➢ Micro Focus Vertica.
➢ Teradata.
➢ Amazon DynamoDB.
➢ PostgreSQL…
What’s a data lake?
• A data lake is a large storage repository that holds a huge amount of raw data in its
original format until you need it. Data lakes exploit the biggest limitation of data
warehouses: their ability to be more flexible.

• Only when the data needs to be retrieved some structures need to be applied, which is
ideal for data scientists and data analysis developers who can create new data models
on the fly but do not provide the same reporting capabilities and ease of use for
business users.
• Storing data in data lakes is much cheaper than in a data warehouse. Data lakes are
very popular in the modern stack because of their flexibility and costs, but they are
not a replacement for data warehouses or relational databases.

• Popular data lake companies are:


➢ Hadoop
➢ MS Azure
➢ Amazon S3 (Amazon Simple Storage Service)
Data warehouse vs data lake?
Data Models – RDBMS (SQL)
• Data modeling is the process of producing a descriptive diagram of relationships between
various types of information that are to be stored in a database.

• A data model describes information in a systematic way that allows it to be stored and retrieved
efficiently in a Relational Database Management System (RDBMS), such as SQL Server,
MySQL, or Oracle. The model can be thought of as a way of translating the logic of accurately
describing things in the real-world and the relationships between them into rules that can be
followed and enforced by computer code.
Data Models
• There are three stages or types of data model (called schemas):
➢ Conceptual – This is the first step in the modeling process, which imposes a theoretical order on data as it
exists in relationship to the entities being described, often real-world artifacts or concepts.
➢ Logical – Taking the semantic structure built at the conceptual stage, the logical modeling process attempts to
impose order by establishing discrete entities, key values, and relationships in a logical structure that is
brought into at least 4th normal form (4NF).
➢ Physical – Actually not physical at all, but it would be confusing to use “logical” twice, this step breaks the
data down into the actual tables, clusters, and indexes required for the data store.
Company DB – ER Model
Company DB – Relational database schema
Company DB – Physical Model
Data Models - NoSQL
• NoSQL databases such as MongoDB, Cassandra, and HBase have been the most promising
industry tools.
• Key-value model
• Columns model
• Document model
• Graphs model
Document model
Graph DB
Key-Value DB
Wide-Column DB

You might also like