Professional Documents
Culture Documents
(99+) Azure Data Engineering - Azure Blob Storage vs. Azure Data Lake Storage Gen2 - LinkedIn
(99+) Azure Data Engineering - Azure Blob Storage vs. Azure Data Lake Storage Gen2 - LinkedIn
Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2
Heather Grandy
2 articles
Follow
Biomedical Engineering
Master's Student @ The…
Disclaimer: Note that this article is not official Microsoft content. Please
visit Microsoft Docs to read official Microsoft content.
This article is geared towards helping readers prepare for the Azure Data Engineer
Associate certification as well as to simply learn about Azure Data & AI technologies.
In this article, I will be diving into Azure Blob Storage vs. Azure Data Lake Storage
Gen2 (ADLS Gen2) from the lens of a Data Engineer. The differences between these two
offerings are a common discussion topic in my customer workshops and are important to
understand when making architectural decisions as well as when preparing for your
certification.
Let’s start with the data – what kind of data can you store in Azure Blob or ADLS
Gen2?
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 1/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Azure Blob storage and ADLS Gen2 are both well-suited for storing unstructured data.
Think videos, photos, audio files, text files, Excel files, and more! Since you are storing
data in an unstructured format, you cannot directly query data in either service. You will
need to leverage another service to begin querying and/or analyzing that data – that is a
Both Azure Blob storage and ADLS Gen2 are provisioned through an Azure Storage
different Azure storage services – Blobs, Queues, Tables, and Files. You can use all or
just one of these services within a single storage account, up to the resource limits. In this
article, I will be focusing only on Blob storage, but I want to provide a brief overview of
each offering:
Azure Blob Storage: Object storage solution for the cloud. Blob storage is
optimized for storing massive amounts of unstructured data – a.k.a. data that does
not adhere to a particular schema or definition, such as text data, photos, videos,
etc. Blobs are organized by containers. By the way, Blob = Binary Large Object.
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 2/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Azure Queues: Service for storing large numbers of messages that can be
accessed from anywhere in the world via authenticated calls using HTTP or
storage account. Queues are often used to create a backlog of work to process
asynchronously.
Azure Tables: Service for storing NoSQL (key/value) data with a schema-less
design. Table storage is often used to store flexible datasets such as user data for
web apps, device information, or other types of metadata. A storage account can
Azure Files: A fully managed file share service in the cloud, accessible via Server
Message Block (SMB) protocol or Network File System (NFS) protocol. Azure
servers.
Once you create an Azure Storage Account, you will see these options in the resource
overview page from the Azure portal. This is shown in the screenshot below in my
storage account named hgdp200storage. Side note about storage account names – they
have to be globally unique within all of Azure! The Blob storage option, containers, is
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 3/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
indicated in red. Note that sensitive information from this overview page has been
redacted.
As stated above, Azure Blob storage is optimized for storing massive amounts of
unstructured data. After creating an Azure Storage account, the next step is to create
containers which are used to organize a set of Blobs, like a directory in a file system –
similar, but not the same! Blob storage accounts are only capable of mimicking a
hierarchical folder structure; they do not support true directories. Once you have created
below. In this case, there is a storage account named hgdp200storage with two containers,
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 4/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
resources and example. Within these containers, there are a few Blobs of various formats
stored. The purpose of this diagram is to demonstrate relationship between Azure Storage
Account artifacts. Note that a storage account can hold zero or many containers, and
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 5/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
With Azure Blob storage, there were opportunities for improvement in terms of
optimizations for big data analytics workloads. This leads me to the next discussion topic,
! IMPORTANT ! Before I begin, it is crucial to point out that you should not use ADLS
Gen1 for any new projects. It is a legacy service and it is recommended to instead use
ADLS Gen2. I am introducing it in this article to highlight some of the key ADLS Gen1
Below is a list of some of the key capabilities offered by ADLS Gen1. To learn more,
ADLS Gen1 is an Apache Hadoop file system that is compatible with Hadoop
Distributed File System (HDFS) and works with the Hadoop ecosystem. If you’re
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 6/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
simplifying distributed data processing. This is kind of a big deal for Hadoop
users! To work with the Hadoop ecosystem, your data needs to be stored in
HDFS. So, that means that users could store all of their data in ADLS Gen1 and
ADLS Gen1 supports virtually unlimited storage. Individual files can range from
Access Control Lists (ACLs) can be implemented to manage access to your data
in ADLS Gen1.
ADLS Gen1 can be accessed via the file system, prefixed by adl://. The ability to
access data this way allows for potential optimizations, particularly in big data
analytics scenarios.
Since I do not recommend using ADLS Gen1 (remember it is a legacy service!), I want to
Azure Data Lake Storage Gen2 emerged from the following simple equation:
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 7/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Azure Blob Storage + Azure Data Lake Storage Gen1 = Azure Data Lake Storage
Gen2
Joking aside, ADLS Gen2 truly is the result of converging the capabilities of two storage
services, Azure Blob Storage and Azure Data Lake Storage Gen1. The result? You get the
best of both worlds. File system semantics, directory and file-level security capabilities
from ADLS Gen1 are combined with the low-cost, tiered storage, high
ADLS Gen2 was designed with big data analytics in mind and is a key component in
modern data analytics, data science, and data warehousing architectures. A fundamental
To explain what this term really means, think about the file explorer on your computer.
You likely have created (or at least attempted to create) an organized folder structure.
Unlike Blob storage, you have the ability to create a folder structure with a hierarchy in
your ADLS Gen2 account. Besides providing a familiar interface style for developers, the
hierarchical namespace is preferred when working with big data analytics frameworks
like Hive and Spark. Without real directories, applications must process potentially
namespace processes these tasks by updating the parent directory. Spark jobs, for
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 8/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
example, often write output to temporary locations and rename the location at the end of
the job. The time to rename is significantly lower with a hierarchical namespace.
So, how do you enable this seemingly ~magical~ hierarchical namespace? If there is one
thing you take away from this article, remember that if you try searching for “Azure Data
Lake Storage Gen2” in the Azure portal, you will not find what you’re looking for!
namespace” option in the creation process of an Azure Storage Account. Once you
provision a storage account, you cannot modify the hierarchical namespace configuration.
The next image shows what you will expect to see if you are provisioning your storage
account from the Azure portal. Under the advanced tab, there is an option called “Data
Lake Storage Gen2 hierarchical namespace” which is disabled by default. To use the
ADLS Gen2 capabilities, switch this to enabled and continue through the resource
provisioning process.
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 9/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Once the storage account is provisioned, you can verify that the hierarchical namespace is
enabled by navigating to the resource in the Azure portal and searching for the
“Configuration” option on the left-hand blade. Notice the option to enable hierarchical
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 10/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Besides the hierarchical namespace, ADLS Gen2 has several other notable capabilities:
Like ADLS Gen1, ADLS Gen2 is Hadoop compatible, meaning you can manage
The new ABFS driver (ABFS = Azure Blob Filesystem) is available within all
Apache Hadoop environments and allows for other Azure services to access data
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 11/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Both ACL and POSIX permissions, plus additional granularity specific to ADLS
ADLS Gen2 provides the same data redundancy and access tier offerings as
To summarize, ADLS Gen2 is built on top of Azure Blob storage. It supports the core
features and introducing new functionality. To reiterate, ADLS Gen2 is not a separate
service in Azure, but is provisioned through an Azure Storage Account by enabling the
hierarchical namespace configuration option. ADLS Gen2 is optimized for big data
analytics workloads.
Comparison: Azure Blob Storage vs. Azure Data Lake Storage Gen2
Azure Data Lake Store Gen2 is a superset of Azure Blob storage capabilities. In the list
below, some of the key differences between ADLS Gen2 and Blob storage are
summarized.
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 12/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
ADLS Gen2 supports ACL and POSIX permissions allowing for more granular
unlike Blob Storage which has a flat namespace. This capability has a significant
services can use data stored in ADLS Gen2. Azure Blob storage is not Hadoop-
compatible.
One last area of comparison I want to address is cost. Yes, there are price differences
between Azure Blob storage and ADLS Gen2. Generally, transactional costs for ADLS
Gen2 are slightly higher than those of Blob, but this is oftentimes offset by the resulting
reduced compute costs. To get more details on pricing, please refer to the ADLS Gen2
Should I always choose Azure Data Lake Store Gen2 over Azure Blob Storage?
With this information in mind, it may seem like you should always choose ADLS Gen2
over Blob Storage – this is not the case! If you are storing vhd files or have a workload
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 13/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
that would not benefit from file systems hierarchy, then ADLS Gen2 may not be the right
choice. Those are just two examples, but I do encourage you to ask lots of questions
when selecting one of these storage options for your projects to ensure you choose the
Finally, I want to address one last important topic – how do you structure your data lake?
Don’t let the service names mislead you – your data lake in Azure could be ADLS Gen2
or Blob storage. This is an architectural decision you will have to make (data architecture
Certainly, you do not just want to dump all of your data into a single blob container or
filesystem. That approach will only result in more problems and ultimately, you will have
No one wants a data swamp, right? So, how should you structure your data lake? The
short answer is… it depends! The longer (and more helpful) answer is more nuanced and
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 14/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
1. Think about creating zones for your data, where each zone holds data in a different
“stage.” Create zones in your data lake through separate file systems. Examples of zones
include:
Raw: As the name indicates, this is where your data is stored in its raw,
unprocessed format.
2. Consider ways in which you can create an efficient and logical folder structure such
that you are optimizing for data retrieval. In other words, spend sufficient time planning
your data lake structure. This includes thinking about user groups and security
For more information on designing your data lake, I highly recommend reading the
following blog posts from SQL Chick (Melissa Coates). These blog posts are by far the
best and most thorough explanations I have found for data lake design considerations.
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 15/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
Resources
Much of this article was written thanks to Microsoft documentation, as well as a few
other blog posts. Below is a summary of resources I used when writing this article.
· Azure Queues
· Azure Tables
· Azure Files
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 16/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
· Azure Data Lake Storage Gen1 **for reference only, this is a legacy service**
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 17/18
12/1/22, 12:18 PM (99+) Azure Data Engineering: Azure Blob Storage vs. Azure Data Lake Storage Gen2 | LinkedIn
That brings us to the end of the article. I hope this helped clarify the differences between
Azure Blob Storage and ADLS Gen2! If you have any questions or comments, please
https://www.linkedin.com/pulse/azure-data-engineering-series-part-1-blob-storage-vs-lake-grandy/ 18/18