Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

es

u
ig
Describe core data concepts

r
d
o
R
an
Al
Al
an
R
o
d
r
Summary

ig
u
es
es
u
ig
• Structured data – Here data follows a certain structure or pre-defined schema.

r
• Here the data is normally represented in the form of tables.

d
• You can also have relationships between the tables of data.

o
• Semi-structured data – Here there is a specific structure, but the data can also deviate or have more than
what is defined as the structure. Flexible in nature.

R
• Commonly used format is JSON – JavaScript Object Notation.
an

Summary
Al
es
u
ig
• Unstructured data – Here data could exist in binary form.

r
• Examples – Documents, Images and Videos.

d
o
R
an

Summary
Al
es
u
ig
• Text files – Here the data could be represented as text-based files.

r
• Common representation – CSV – Comma-separated files.

d
• Here the data is in human-readable format.

o
R
an

Summary
Al
es
u
ig
• Avro – This is a row-based format for storing data.

r
• Here each record of data contains both the structure and the data for the record.

d
• The header information is stored in JSON format. And the data is stored in binary format.

o
• This is an ideal format when it comes to compressing data and using less storage for data.

R
an

Summary
Al
es
u
ig
• ORC – This is Optimized Row Columnar format.

r
• Here the data is organized into columns. Different from the normal row-based format.

d
• Here there are stripes of data. The stripes contain data which could be pertinent for one or more columns.

o
R
an

Summary
Al
es
u
ig
• Parquet – This is a column-based format for storing data.

r
• This format is also used when you want to store data efficiently, its compressed in nature.

d
• Here the data is separated into different row groups.

o
R
an

Summary
Al
es
u
ig
• Database – Here you want to store relational data in the form of tables.

r
• You also want to establish relationships across tables.

d
• You can use the different database options on the Azure platform.

o
R
an

Summary
Al
es
u
ig
• Transaction data processing – Here data is processed via the use of transactions.

r
• Systems normally need to sustain millions of transactions per data.

d
• The underlying database system must be capable of sustaining these transactions.

o
• This is also referred to an OLTP solution – Online Transactional Processing system.

R
an

Summary
Al
es
u
ig
• Transaction data processing – Here the systems conform on the ACID concepts.

r
• Atomicity - Here each transaction is taken as a single unit. The transaction is either completed or rejected as a
whole.

d
• Consistency - Here the data must flow consistency from one state to the other.

o
• Isolation - Here concurrent transactions should not interfere with each other.

R
• Durability - Once a transaction is committed , then it should remain in that state.
an

Summary
Al
es
u
ig
• Analytical data processing – Here analysis is performed on the underlying data.

r
• First you would have your raw data stored in a data lake solution.

d
• Then transformation is carried out on the data. And the data is sent to a data warehouse.

o
• Analysis is performed on the data in the data warehouse.

R
• This is normally known as an OLAP system – Online Analytical Processing System.
an

Summary
Al
es
u
ig
• Azure virtual machine – This is an Infrastructure as a service.

r
• Here you spin up a machine on the cloud.

d
• Then you install the database software.

o
• Then you host the database on the database server.

R
an

Summary
Al
es
u
ig
• Azure SQL Database – This is a Platform as a service.

r
• Here the compute infrastructure is managed for you.

d
• You just need to start working with the database.

o
• Azure SQL database – Microsoft SQL Server.

R
• Azure Database for MySQL, Azure Database for MariaDB, Azure Database for PostgreSQL.
an

Summary
Al
es
u
ig
• Azure Cosmos DB – This is a fully managed NoSQL database solution.

r
• Azure Storage Accounts – This is storage on the cloud.

d
• Here you get access to services – Blob, File, Queue and Table.

o
• Azure Data Factory – This service can be used to orchestrate the use of pipelines.

R
• The pipelines can be used to transform and transfer data from the sources onto the destination.
an

Summary
Al
es
u
ig
• Azure Synapse – This is used to host a data warehouse.

r
• You also get access to other aspects like pipelines and the use of Spark for data processing.

d
• Azure Databricks – Here again you can make use of Spark clusters for data ingestion and processing.

o
• Azure HDInsight – This provides a managed platform for Apache open-source solutions.

R
an

Summary
Al
es
u
ig
• Azure Stream Analytics – This is used to process data in near real-time.

r
• This service allows you to capture and process the streaming data.

d
• Azure Data Explorer – This is used to analyze high volumes of data in near real time.

o
• Microsoft Purview – This is used from a data governance aspect.

R
• Microsoft Power BI – This is used for generating visualizations.
an

Summary
Al
es
u
ig
Identify considerations for

r
relational data on Azure

d
o
R
an
Al
es
u
ig
SQL Statements

r
d
o
R
an
Al
SQL Statements - Types

es
u
Transact-SQL This is used by Microsoft SQL Server and

ig
Azure SQL-based services
T-SQL

r
d
pgSQL This is used in PostgreSQL

o
R
SQL Statements

an
PL/SQL This is used in Oracle
Al
c
SQL Statements - Groups

es
u
DDL Statements This is used to work with tables and other objects in a database.

ig
For example, being able to create a table.

r
d
This is used to manage permissions to objects in the database.

o
DCL Statements

R
SQL Statements

an
DML Statements This is used to work with the rows in tables.
Al
c
DDL Statements

es
u
ig
CREATE
This can be used to create a 1 3
DROP

r
This can be used to remove an
table or another object in object such as a table from the

d
the database. database.

o
R
ALTER 2 RENAME
SQL Statements

4
an
This can be used to alter a This can be used to rename
table. For example you can an object in the database.
add another column to a
Al

table.
DCL Statements

es
u
ig
GRANT
This can be used to grant 1 3
REVOKE

r
This can be used to remove a
specific permissions. permission that was granted

d
earlier on.

o
R
DENY 2
SQL Statements

4
an
This can be used to deny
certain permissions.
Al
DML Statements

es
u
ig
SELECT
This can be used to read the 1 3
UPDATE

r
This can be used to modify
rows from a table. existing rows in the table.

d
o
R
INSERT 2 DELETE
SQL Statements

4
an
This can be used to add rows This can be used to delete
to a table. existing rows in the table.
Al
Al
an
R
Views

o
d
r
ig
u
es
Views

es
u
ig
Virtual
This is a virtual table where 1 3
Table

r
The data comes from a table
the contents are defined by referenced by the view.

d
a query.

o
R
Details 2 Query

4
an
The query can take data from
Views

The view consists of named


columns and rows. multiple tables.
Al
es
u
ig
Stored procedures

r
d
o
R
an
Al
Stored Procedures

es
u
ig
SQL Commands
1 3
Logic

r
This is a set of SQL Normally you would
commands. encapsulate some logic in the

d
stored procedure.

o
R
2 Parameters
Stored Procedure

Run
4
an
You can run the Stored You can also define
Procedure at any point in parameters for your stored
time. procedure.
Al
Al
an
R
o
d
r
Summary

ig
u
es
es
u
ig
Azure SQL database

r
d
o
R
an
Al
es
u
ig
• Here you model the data in the form of tables.

r
• You stored structured data in the tables.

d
• The tables are normally stored in a database.

o
• A table consists of columns and rows of data.

R
an

Summary
Al
es
u
ig
• This is a language that allows you to work with the underlying data.

r
• You can create a table, drop a table.

d
• You can insert data or update data in a table.

o
• You can also read data from within the table.

R
an

Summary
Al
es
u
ig
• Here you structure your data based on different normal forms.

r
• These normal forms help to reduce data redundancy and improve data integrity.

d
o
R
an

Summary
Al
es
u
ig
• There are different options for hosting a database.

r
• IaaS – Infrastructure as a Service – Virtual Machines – Here you install the database software and manage the

d
machine.

o
• PaaS – Platform as a Service – Azure SQL database – Here even the underlying compute infrastructure is
managed for you.

R
an

Summary
Al
Your own server

es
u
ig
Full control Security
You have full control You get to control all of

r
over the underlying the security aspects
database engine

d
o
Advantages

R
Azure SQL Database

an
Any version Integration
You can use any You can install custom
database version tools for integration
Al

purposes
Your own server - Downside

es
u
ig
Management Backups
You have to manage the You need to implement

r
underlying infrastructure backups

d
o
Downside

R
Azure SQL Database

an
High Availability Patching
You need to manage You need to install
high availability updates
Al
es
u
ig
• Subscription – This is used for billing purposes.

r
• Azure Active Directory – This is used for managing users within your Azure account.

d
• Resource – You create a resource based on a particular service.

o
• Resource Group – This is used for logically grouping your resources.

R
• Location – For most resources you need to mention a location for the deployment of the resource.
an

Summary
Al
es
u
ig
• This is a completely managed service.

r
• The underlying compute infrastructure is managed for you.

d
• It also has options such as High Availability, Backups etc.

o
R
an

Summary
Al
es
u
ig
• This is a deployment model that provides native integration with the Azure virtual network service.

r
• It provides near 100% compatibility with the latest SQL Server features.

d
• Here again the infrastructure is managed for you.

o
• Companies can also easily migrate their existing on-premises databases to the Managed Instance.

R
an

Summary
Al
es
u
ig
• MySQL is an open-source relational database management system.

r
• You can store your data in the form of tables.

d
• You can query for data using the Structured Query Language (SQL).

o
• Azure Database for MySQL is a fully managed database service.

R
• Here the underlying platform is managed by the service itself.
an
• Here you also get high availability, backups and patching as well.

Summary
Al
es
u
ig
• PostgreSQL is a free and open-source relational database management system.

r
• It has support for transactions that follow the ACID concepts – Atomicity, Consistency, Isolation and Durability.

d
• It also has support for views, foreign keys, triggers and stored procedures.

o
• Azure Database for PostgreSQL is a fully managed database service.

R
• Here the underlying platform is managed by the service itself.
an
• Here you also get high availability, backups and patching as well.

Summary
Al
es
u
ig
Describe how to work with

r
non-relational data on Azure

d
o
R
an
Al
Al
an
R
o
d
r
Summary

ig
u
es
es
u
ig
A z u r e S t o ra g e A c c o u n t s

r
d
o
R
an
Al
es
u
ig
• This service allows you to store objects on the cloud.

r
• Here you can make use of different services – Blob, Queue, File and Table.

d
• There are also different types of storage accounts.

o
R

Azure Storage accounts


an
Al
es
u
ig
• This service is optimized for storing large amounts of unstructured data.

r
• Use case examples – storing images, videos, log files, documents.

d
• In the blob service, you will create a container. This is used to organize a set of blobs.

o
• Block blobs – This is used to store text and binary data.

Azure Storage accounts


• Page blobs – This is used to store virtual hard drive files that are used as disks for your Azure virtual machines.
an
Al
es
u
ig
• This is used for hosting file shares on the cloud.

r
• This shares can be accessed via the SMB – Server Message Block protocol.

d
• You can mount the file shares from Windows, Linux and macOS clients.

o
R

Azure Storage accounts


an
Al
es
u
ig
• This service is used for storing non-relational structured data.

r
• Its ideal for storing flexible data sets because it does not conform to any sort of schema.

d
• In the table , you store an entity which is a set of properties.

o
• A property is nothing but a name-value pair.

Azure Storage accounts


• The partition key is used to split the data across various partitions. And the row key is used to identify an item
an
within a partition.
Al
Access tiers

es
u
ig
Hot Cool Archive

r
d
o
R
Azure Storage accounts

This is optimized for This is optimized for data


an
This is optimized for
data that is accessed that is infrequently
storing data that is
frequently. accessed and stored for
rarely accessed and
at least 30 days.
stored for at least 180
Al

days.
es
u
ig
• The Archive access tier is good for long-term backups.

r
• You can set the access tier at the Storage account level to Hot or Cool.

d
• At the object level, you can also set the Archive access tier.

o
R

Azure Storage accounts


an
Al
Data Redundancy

es
u
Zone-redundant

ig
Locally redundant Geo-redundant Geo-zone-
storage storage storage redundant
storage

r
Here data is copied Here data is copied Here data is copied
synchronously three synchronously synchronously three Here data is copied

d
times within a single across three Azure times within a single synchronously
physical location in availability zones in physical location in the across three Azure

o
the primary region the primary region primary region using availability zones in
LRS. It then copies the primary region
using ZRS. It then

R
your data
Azure Storage accounts

asynchronously to a copies your data


single physical location asynchronously to a
an
in the secondary single physical
region location in the
secondary region
Al
es
u
ig
This is a fully managed NoSQL database.

r
The database provides fast response time and is highly scalable.

d
Here the underlying infrastructure is completely managed by Azure.

o
Commonly used for web, mobile, gaming and IoT applications that need to handle massive amounts of data.

Azure Cosmos DB
an
Al
Cosmos DB API

es
u
ig
Core SQL MongoDB Cassandra Gremlin Table
API API API

r
API API

d
o
R
Azure Cosmos DB

an
If you need to query If you need to host a If you need to host a If you need to host a If you need to store
for items using MongoDB compatible Cassandra compatible graph-based database data in the form of
Structured query database database tables
language
Al
es
u
ig
Describe an analytics

r
workload on Azure

d
o
R
an
Al
Al
an
R
o
d
r
Summary

ig
u
es
es
u
ig
D a t a Wa r e h o u s i n g

r
d
o
R
an
Al
Data Warehousing

es
u
Data Ingestion Loading data

ig
1
The first can involve taking data Once you have the data
from various sources – real-time 3 in the state you desire,

r
streams, other data sources and you can then load into a
load them onto a data lake.

d
data warehouse.

o
R
Visualization
Transformation
2
4
You can analyze the data or even
an
Once you have the data visualize the data in the data
Summary

in the data lake, you can warehouse with the help of tools
look to cleanse and like Microsoft Power BI.
Al

transform the data.


es
u
ig
• The data in your data warehouse again can be represented as tables.

r
• But here the data is used mostly for analysis rather than for operational needs.

d
• Fact tables – These are tables that contain records which are immutable facts – Sales data.

o
• Dimension tables – These are tables that hold reference data and can add values when analyzing data in the

R
fact tables – e.g Customer data.

• Star Schema – You can develop a star schema design by making use of Fact and Dimension tables.
an

Summary
Al
es
u
ig
• You can load data from various sources into the data lake.

r
• In the data lake, you can store your structured data, semi-structured data and even unstructured data.

d
• Normally data lake offer schema-on-read which offer the ability to define tabular schemas on semi-structured

o
data files.

R
an

Summary
Al
es
u
ig
• Here multiple records are processed in a single operation.

r
• All of the records can be collected over time and then processed.

d
• You might want to carry out the processing of data in the night when the workloads are light on operational

o
systems.

R
• Here the latency is more – You might only get the result after hours of processing.

• You have to make sure your data set is free from errors. Otherwise the batch process might result in errors.
an

Summary
Al
es
u
ig
• Here the data is processed in real time.

r
• Here the time window when waiting for events is less. You could process the data after let’s say every minute.

d
• Here the latency is less, because you are processing the data more frequently.

o
R
an

Summary
Al
es
u
ig
• This is a cloud service that you can use for your ETL and ELT workflows – Extract , Transform, Load and Extract,

r
Load and Transform.

d
• You can create data-driven workflows that can be used for orchestrating data movement and transforming
data at scale.

o
• You can ingest data from various sources. Then perform the transformation. And then load the data into the

R
destination.
an

Summary
Al
es
u
ig
• Pipeline – Here you create a set of activities.

r
• Activities – These define the actions that need to be performed on your data.

d
• Datasets – You have the input and output datasets that represent the input and output data for an activity.

o
• Linked Service – This is the connection to either the source or destination data store.

R
• Integration Runtime – This is the underlying compute environment that can be used for running your pipeline.
an

Summary
Al
es
u
ig
• Mapping data flow

r
• Here you can design more complex data transformations.

d
• Here the data transformations are executed in the form of activities on Apache Spark clusters.

o
• You can get a visual designer – You don’t need any sort of coding experience.

R
an

Summary
Al
es
u
ig
• This is a managed service that can be used for complex event processing.

r
• You can ingest data in real time from various sources – Azure Event Hub, Azure IoT Hub or Azure Storage Blob

d
containers.
• You can use queries to process the data.

o
R
• You can define outputs where the data can be sent to after processing – Azure SQL database, Azure Synapse
etc.
an

Summary
Al
es
u
ig
• This is an enterprise analytics service.

r
• You can use this service to host your data warehouses.

d
• You can also use Spark technologies for processing your data.

o
• You can also make use of pipelines for your data integration needs.

R
an

Summary
Al
es
u
ig
• This is a set of tools that can be used for your enterprise data solutions.

r
• You can use your Spark clusters for analyzing your data.

d
• You can also use this service along with your Machine Learning services.

o
R
an

Summary
Al
es
u
ig
• This provides an interactive query experience when it comes to working with log and telemetry data.

r
• You can ingest data from a variety of sources – Azure Event Hubs, Azure Data Lake etc.

d
• You can perform data analytics with the use of the Kusto Query Language (KQL).

o
R
an

Summary
Al
es
u
ig
Data Analytics

r
d
o
R
an
Al
Data Analytics

es
u
Descriptive Predictive

ig
1
Here the focus is what is Here the focus is what
happening. Based on the data 3 will happen.

r
you have you can what is the
current situation.

d
o
R
Prescriptive
Diagnostic
2
4
Here the focus is on actions that
an
Data Analytics

Here the focus is why is can be taken to reach a desired


something happening. goal.
Based on the data you
Al

have you can decide why


something is happening.
es
u
ig
• Let’s say a company is collecting data about their e-commerce application.

r
• Descriptive Analysis - If they want to see the progress of the sales data to date.

d
• Diagnostics – Why are some products not selling during the sale period.

o
• Predictive – What are the projected sales for the next quarter.

R
• Prescriptive – What are the actions that can be taken to get sales back on track.
an

Data Analytics
Al

You might also like