Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs.

Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2

Published on October 6, 2020

Heather Grandy
2 articles
Follow
Biomedical Engineering
Master's Student @ The…

Disclaimer: Note that this article is not official Microsoft content. Please
visit Microsoft Docs to read official Microsoft content.

This article is geared towards helping readers prepare for the Azure Data Engineer

Associate certification as well as to simply learn about Azure Data & AI technologies.

In this article, I will be diving into Azure Blob Storage vs. Azure Data Lake Storage

Gen2 (ADLS Gen2) from the lens of a Data Engineer. The differences between these two

offerings are a common discussion topic in my customer workshops and are important to

understand when making architectural decisions as well as when preparing for your

certification.

Let’s start with the data – what kind of data can you store in Azure Blob or ADLS
Gen2?

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 1/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Azure Blob storage and ADLS Gen2 are both well-suited for storing unstructured data.

Think videos, photos, audio files, text files, Excel files, and more! Since you are storing

data in an unstructured format, you cannot directly query data in either service. You will

need to leverage another service to begin querying and/or analyzing that data – that is a

topic for a future article.

How do you provision these services in Azure?

Both Azure Blob storage and ADLS Gen2 are provisioned through an Azure Storage

Account. To reduce administrative overhead, Azure Storage Accounts contain four

different Azure storage services – Blobs, Queues, Tables, and Files. You can use all or

just one of these services within a single storage account, up to the resource limits. In this

article, I will be focusing only on Blob storage, but I want to provide a brief overview of

each offering:

Azure Blob Storage: Object storage solution for the cloud. Blob storage is

optimized for storing massive amounts of unstructured data – a.k.a. data that does

not adhere to a particular schema or definition, such as text data, photos, videos,

etc. Blobs are organized by containers. By the way, Blob = Binary Large Object.

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 2/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Azure Queues: Service for storing large numbers of messages that can be

accessed from anywhere in the world via authenticated calls using HTTP or

HTTPS. A queue contains millions of messages, up to the total capacity limit of a

storage account. Queues are often used to create a backlog of work to process

asynchronously.

Azure Tables: Service for storing NoSQL (key/value) data with a schema-less

design. Table storage is often used to store flexible datasets such as user data for

web apps, device information, or other types of metadata. A storage account can

contain any number of tables, up to its capacity limit.

Azure Files: A fully managed file share service in the cloud, accessible via Server

Message Block (SMB) protocol or Network File System (NFS) protocol. Azure

Files can be used to completely replace or supplement traditional on-premises file

servers.

Once you create an Azure Storage Account, you will see these options in the resource

overview page from the Azure portal. This is shown in the screenshot below in my

storage account named hgdp200storage. Side note about storage account names – they

have to be globally unique within all of Azure! The Blob storage option, containers, is

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 3/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

indicated in red. Note that sensitive information from this overview page has been

redacted.

A Deeper Look: Azure Blob Storage

As stated above, Azure Blob storage is optimized for storing massive amounts of

unstructured data. After creating an Azure Storage account, the next step is to create

containers which are used to organize a set of Blobs, like a directory in a file system –

similar, but not the same! Blob storage accounts are only capable of mimicking a

hierarchical folder structure; they do not support true directories. Once you have created

containers, you can store your Blobs (the actual files).

To visualize the components of an Azure Storage account, I have included a diagram

below. In this case, there is a storage account named hgdp200storage with two containers,

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 4/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

resources and example. Within these containers, there are a few Blobs of various formats

stored. The purpose of this diagram is to demonstrate relationship between Azure Storage

Account artifacts. Note that a storage account can hold zero or many containers, and

containers can hold zero or many blobs.

In short, some of the key benefits of Azure Blob storage include:

Tiered access – Hot, Cool, Archive

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 5/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Low-cost storage option for unstructured data

Built-in high-availability and disaster recovery with various redundancy options

With Azure Blob storage, there were opportunities for improvement in terms of

optimizations for big data analytics workloads. This leads me to the next discussion topic,

Azure Data Lake Storage Gen1!

A Deeper Look: Azure Data Lake Storage Gen1

! IMPORTANT ! Before I begin, it is crucial to point out that you should not use ADLS

Gen1 for any new projects. It is a legacy service and it is recommended to instead use

ADLS Gen2. I am introducing it in this article to highlight some of the key ADLS Gen1

capabilities that are included in ADLS Gen2.

Below is a list of some of the key capabilities offered by ADLS Gen1. To learn more,

refer to the Microsoft Azure Data Lake Storage Gen1 documentation.

ADLS Gen1 is an Apache Hadoop file system that is compatible with Hadoop

Distributed File System (HDFS) and works with the Hadoop ecosystem. If you’re

not familiar with Hadoop, it is an open-source platform that focuses on

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 6/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

simplifying distributed data processing. This is kind of a big deal for Hadoop

users! To work with the Hadoop ecosystem, your data needs to be stored in

HDFS. So, that means that users could store all of their data in ADLS Gen1 and

use it in their Hadoop workloads.

ADLS Gen1 supports virtually unlimited storage. Individual files can range from

kilobytes to petabytes in size.

Access Control Lists (ACLs) can be implemented to manage access to your data

in ADLS Gen1.

ADLS Gen1 can be accessed via the file system, prefixed by adl://. The ability to

access data this way allows for potential optimizations, particularly in big data

analytics scenarios.

Since I do not recommend using ADLS Gen1 (remember it is a legacy service!), I want to

keep this section short and move on to ADLS Gen2.

Introducing Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 emerged from the following simple equation:

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 7/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Azure Blob Storage + Azure Data Lake Storage Gen1 = Azure Data Lake Storage

Gen2

Joking aside, ADLS Gen2 truly is the result of converging the capabilities of two storage

services, Azure Blob Storage and Azure Data Lake Storage Gen1. The result? You get the

best of both worlds. File system semantics, directory and file-level security capabilities

from ADLS Gen1 are combined with the low-cost, tiered storage, high

availability/disaster recovery capabilities from Azure Blob Storage.

ADLS Gen2 was designed with big data analytics in mind and is a key component in

modern data analytics, data science, and data warehousing architectures. A fundamental

component of ADLS Gen2 is the addition of a hierarchical namespace to Blob storage.

To explain what this term really means, think about the file explorer on your computer.

You likely have created (or at least attempted to create) an organized folder structure.

Unlike Blob storage, you have the ability to create a folder structure with a hierarchy in

your ADLS Gen2 account. Besides providing a familiar interface style for developers, the

hierarchical namespace is preferred when working with big data analytics frameworks

like Hive and Spark. Without real directories, applications must process potentially

millions of individual blobs to accomplish directory-level tasks, whereas the hierarchical

namespace processes these tasks by updating the parent directory. Spark jobs, for

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 8/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

example, often write output to temporary locations and rename the location at the end of

the job. The time to rename is significantly lower with a hierarchical namespace.

So, how do you enable this seemingly ~magical~ hierarchical namespace? If there is one

thing you take away from this article, remember that if you try searching for “Azure Data

Lake Storage Gen2” in the Azure portal, you will not find what you’re looking for!

ADLS Gen2 accounts are provisioned by configuring the “enable hierarchical

namespace” option in the creation process of an Azure Storage Account. Once you

provision a storage account, you cannot modify the hierarchical namespace configuration.

The next image shows what you will expect to see if you are provisioning your storage

account from the Azure portal. Under the advanced tab, there is an option called “Data

Lake Storage Gen2 hierarchical namespace” which is disabled by default. To use the

ADLS Gen2 capabilities, switch this to enabled and continue through the resource

provisioning process.

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 9/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Once the storage account is provisioned, you can verify that the hierarchical namespace is

enabled by navigating to the resource in the Azure portal and searching for the

“Configuration” option on the left-hand blade. Notice the option to enable hierarchical

namespace is greyed out since it cannot be modified post-provisioning.

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 10/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Besides the hierarchical namespace, ADLS Gen2 has several other notable capabilities:

Like ADLS Gen1, ADLS Gen2 is Hadoop compatible, meaning you can manage

and access data just as you would with HDFS.

The new ABFS driver (ABFS = Azure Blob Filesystem) is available within all

Apache Hadoop environments and allows for other Azure services to access data

stored in ADLS Gen2. These services include: Azure HDInsight, Azure

Databricks, and Azure Synapse Analytics. The ABFS driver is optimized

specifically for big data analytics.

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 11/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Both ACL and POSIX permissions, plus additional granularity specific to ADLS

Gen2, are supported.

Data stored in ADLS Gen2 is not required to be moved or transformed prior to

performing analysis, reducing the required transaction cost.

ADLS Gen2 provides the same data redundancy and access tier offerings as

Azure Blob storage.

To summarize, ADLS Gen2 is built on top of Azure Blob storage. It supports the core

capabilities of Azure Blob storage while leveraging ADLS Gen1

features and introducing new functionality. To reiterate, ADLS Gen2 is not a separate

service in Azure, but is provisioned through an Azure Storage Account by enabling the

hierarchical namespace configuration option. ADLS Gen2 is optimized for big data

analytics workloads.

Comparison: Azure Blob Storage vs. Azure Data Lake Storage Gen2

Azure Data Lake Store Gen2 is a superset of Azure Blob storage capabilities. In the list

below, some of the key differences between ADLS Gen2 and Blob storage are

summarized.

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 12/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

ADLS Gen2 supports ACL and POSIX permissions allowing for more granular

access control compared to Blob storage.

ADLS Gen2 introduces a hierarchical namespace. This is a true file system,

unlike Blob Storage which has a flat namespace. This capability has a significant

impact on performance, especially in big data analytics scenarios.

ADLS Gen2 is an HDFS-compatible store. This means that Apache Hadoop

services can use data stored in ADLS Gen2. Azure Blob storage is not Hadoop-

compatible.

One last area of comparison I want to address is cost. Yes, there are price differences

between Azure Blob storage and ADLS Gen2. Generally, transactional costs for ADLS

Gen2 are slightly higher than those of Blob, but this is oftentimes offset by the resulting

reduced compute costs. To get more details on pricing, please refer to the ADLS Gen2

pricing page and the Azure Blob Storage pricing page.

Should I always choose Azure Data Lake Store Gen2 over Azure Blob Storage?

With this information in mind, it may seem like you should always choose ADLS Gen2

over Blob Storage – this is not the case! If you are storing vhd files or have a workload

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 13/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

that would not benefit from file systems hierarchy, then ADLS Gen2 may not be the right

choice. Those are just two examples, but I do encourage you to ask lots of questions

when selecting one of these storage options for your projects to ensure you choose the

option that is best suited for your workload.

Designing your Data Lake

Finally, I want to address one last important topic – how do you structure your data lake?

Don’t let the service names mislead you – your data lake in Azure could be ADLS Gen2

or Blob storage. This is an architectural decision you will have to make (data architecture

is the focus of the DP-201 exam).

Certainly, you do not just want to dump all of your data into a single blob container or

filesystem. That approach will only result in more problems and ultimately, you will have

a….. *cringes*….. data swamp!

No one wants a data swamp, right? So, how should you structure your data lake? The

short answer is… it depends! The longer (and more helpful) answer is more nuanced and

involves two of the key considerations are summarized below.

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 14/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

1. Think about creating zones for your data, where each zone holds data in a different

“stage.” Create zones in your data lake through separate file systems. Examples of zones

include:

Raw: As the name indicates, this is where your data is stored in its raw,

unprocessed format.

Curated: Contains processed data for specific use cases.

2. Consider ways in which you can create an efficient and logical folder structure such

that you are optimizing for data retrieval. In other words, spend sufficient time planning

your data lake structure. This includes thinking about user groups and security

boundaries, as well as partitioning.

For more information on designing your data lake, I highly recommend reading the

following blog posts from SQL Chick (Melissa Coates). These blog posts are by far the

best and most thorough explanations I have found for data lake design considerations.

Zones in a Data Lake

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 15/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

Data Lake Use Cases and Planning Considerations

Resources

Much of this article was written thanks to Microsoft documentation, as well as a few

other blog posts. Below is a summary of resources I used when writing this article.

· Azure Data Engineer Associate

· Azure Storage Account Overview

· Azure Storage Account Resource Limits

· Azure Blob Storage

· Azure Queues

· Azure Tables

· Azure Files

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 16/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

· Quickstart: Create an Azure Storage Account

· Tiered access – Hot, Cool, Archive

· Azure Storage Redundancy Options

· Azure Data Lake Storage Gen1 **for reference only, this is a legacy service**

· Introduction to Azure Data Lake Storage Gen2

· ADLS Gen2 Hierarchical Namespace

· ADLS Gen2 ABFS Driver

· ADLS Gen2 pricing

· Azure Blob Storage pricing page

· Zones in a Data Lake

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 17/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn

· Data Lake Use Cases and Planning Considerations

Thanks for reading!

That brings us to the end of the article. I hope this helped clarify the differences between

Azure Blob Storage and ADLS Gen2! If you have any questions or comments, please

post them in the comments. Thanks for reading!

https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 18/18

You might also like