Chapter 4 (PRE 6)

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Chapter 4 • Data storage platforms

DATA WAREHOUSING AND MANAGEMENT • Data warehouses


ETL Process in Data Warehouse • Analytics tools
• ETL stands for Extract, Transform, Load Transform
and it is a process used in data
warehousing to extract data from • In this stage, the extracted data is
various sources, transform it into a transformed into a format that is
format suitable for loading into a data suitable for loading into the data
warehouse, and then load it into the warehouse. This may involve cleaning
warehouse. and validating the data, converting data
types, combining data from multiple
• The ETL process is an iterative process sources, and creating new data fields.
that is repeated as new data is added to During this phase of the ETL process,
the warehouse. The process is rules and regulations can be applied
important because it ensures that the that ensure data quality and
data in the data warehouse is accurate, accessibility. You can also apply rules to
complete, and up-to-date. It also helps help your company meet reporting
to ensure that the data is in the format requirements. The process of data
required for data mining and reporting. transformation is comprised of several
sub-processes:
Extract
• Cleansing — inconsistencies and
• The first stage in the ETL process is to missing values in the data are resolved.
extract data from various sources such
as transactional systems, spreadsheets, • Standardization — formatting rules are
and flat files. This step involves reading applied to the dataset.
data from the source systems and
storing it in a staging area. • Deduplication — redundant data is
excluded or discarded.
• Before data can be moved to a new
destination, it must first be extracted • Verification — unusable data is
from its source — such as a data removed and anomalies are flagged.
warehouse or data lake. In this first step • Sorting — data is organized according to
of the ETL process, structured and type.
unstructured data is imported and
consolidated into a single repository. • Other tasks — any additional/optional
Volumes of data can be extracted from rules can be applied to improve data
a wide range of data sources, including: quality.

• Existing databases and legacy systems • Transformation is generally considered


to be the most important part of the
• Cloud, hybrid, and on-premises ETL process. Data transformation
environments improves data integrity — removing
• Sales and marketing applications duplicates and ensuring that raw data
arrives at its new destination fully
• Mobile devices and apps compatible and ready to use.
• CRM systems
warehouse and ensuring that only
authorized users can access the data.
Load
• Improved scalability: ETL process can
• After the data is transformed, it is help to improve scalability by providing
loaded into the data warehouse. This a way to manage and analyze large
step involves creating the physical data amounts of data.
structures and loading the data into the
warehouse. Data can be loaded all at • Increased automation: ETL tools and
once (full load) or at scheduled intervals technologies can automate and simplify
(incremental load). the ETL process, reducing the time and
effort required to load and update data
• Full loading — In an ETL full loading in the warehouse.
scenario, everything that comes from
the transformation assembly line goes Disadvantages of ETL process in data
into new, unique records in the data warehousing
warehouse or data repository. Though
there may be times this is useful for • High cost: ETL process can be expensive
research purposes, full loading to implement and maintain, especially
produces datasets that grow for organizations with limited resources.
exponentially and can quickly become • Complexity: ETL process can be
difficult to maintain. complex and difficult to implement,
• Incremental loading — A less especially for organizations that lack the
comprehensive but more manageable necessary expertise or resources.
approach is incremental loading. • Limited flexibility: ETL process can be
Incremental loading compares incoming limited in terms of flexibility, as it may
data with what’s already on hand, and not be able to handle unstructured data
only produces additional records if new or real-time data streams.
and unique information is found. This
architecture allows smaller, less • Limited scalability: ETL process can be
expensive data warehouses to maintain limited in terms of scalability, as it may
and manage business intelligence. not be able to handle very large
amounts of data.
Advantages of ETL process in data
warehousing: • Data privacy concerns: ETL process can
raise concerns about data privacy, as
• Improved data quality: ETL process large amounts of data are collected,
ensures that the data in the data stored, and analyzed.
warehouse is accurate, complete, and
up-to-date. WHAT IS ELT?

• Better data integration: ETL process • ELT is an alternative to the traditional


helps to integrate data from multiple extract/transform/load (ETL) process. It
sources and systems, making it more pushes the transformation component
accessible and usable. of the process to the target database for
better performance. Bectakes
• Increased data security: ETL process advantage of the processing capability
can help to improve data security by already built into a data storage
controlling access to the data infrastructure, ELT reduces ause it the
time data spends in transit and boosts technologies in order to push
efficiency. improvements, security, and compliance
across the enterprise. ELT also leverages
• Extract — This step works similarly in the native capabilities of modern cloud
both ETL and ELT data management data warehouses and big data
approaches. Raw streams of data from processing frameworks.
virtual infrastructure, software, and
applications are ingested either in their • Lowering costs — Like most cloud
entirety or according to predefined services, cloud-based ELT can result in
rules. lower total cost of ownership, because
an upfront investment in hardware is
• Load — Here is where ELT branches off often unnecessary.
from its ETL cousin. Rather than deliver
this mass of raw data and load it to an • Flexibility — The ELT process is
interim processing server for adaptable and flexible, so it’s suitable
transformation, ELT delivers it directly to for a variety of businesses, applications,
the target storage location. This and goals.
shortens the cycle between extraction
and delivery. • Scalability — The scalability of a cloud
infrastructure and hosted services like
• Transform — The database or data integration platform-as-a-service (iPaaS)
warehouse sorts and normalizes the and software-as-a-service (SaaS) give
data, keeping part or all of it on hand organizations the ability to expand
and accessible for customized reporting. resources on the fly. They add the
The overhead for storing this much data compute time and storage space
is higher, but it offers more necessary for even massive data
opportunities to mine it for relevant transformation tasks.
business intelligence in near real-time.
ELT vs. ETL
Benefits of ELT
• The differences between ELT and a
• Simplifying management — ELT traditional ETL process are more
separates the loading and significant than just switching the L and
transformation tasks, minimizing the the T. The biggest determinant is how,
interdependencies between these when and where the data
processes, lowering risk, and transformations are performed.
streamlining project management.
• With ETL, the raw data is not available
• Future-proofed data sets — ELT in the data warehouse because it is
implementations can be used directly transformed before it is loaded. With
for data warehousing systems, but ELT, the raw data is loaded into the data
oftentimes ELT is used in the data lake warehouse (or data lake) and
approach in which data is collected transformations occur on the stored
from a range of sources. This, combined data.
with the separation of the
transformation process, makes it easier • Staging areas are used for both ELT and
to make future changes to the ETL, but with ETL the staging areas are
warehouse structure. built into the ETL tool being used. With
ELT, the staging area is in a database
• Leveraging the latest technologies — used for the data warehouse.
ELT solutions harness the power of new

You might also like