FPT-HandBookData Engineer (11123)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 160

Data Engineer

© Copyright FHM.AVI – Data 1


Data Engineer

© Copyright FHM.AVI – Data 2


Data Engineer
Table of Contents

Data Engineer ................................................................................................................................. 4


How To Become A Data Engineer? ....................................................................................... 4
Data Sources .......................................................................................................................... 12
Data Staging Area ................................................................................................................. 63
Extract – Transform – Load (ETL) .................................................................................... 68
Data Warehouse .................................................................................................................... 84
SQL Server & SQL ServerIntegration Service ............................................................. 128

© Copyright FHM.AVI – Data 3


Data Engineer

Data Engineer
How To Become A Data Engineer?

Who is a Data Engineer?


Every data-driven business requires a framework for data science and data
analytics pipeline. The person responsible for building and maintaining
this framework is known as Data Engineer. These engineers are
responsible for an uninterrupted flow of data between servers and
applications.
Therefore, a data engineer builds, tests, maintains data structures and
architectures for data ingestion, processing, and deployment of large-scale
data-intensive applications.
Data engineers work in tandem with data architects, data analysts, and data
scientiststhrough data visualization and storytelling. The most crucial role
of the data engineer is to design, develop, construct, install, test, and
maintain the complete data management and processing systems.

© Copyright FHM.AVI – Data 4


Data Engineer

So what do they exactly do? They create the framework to make data consumable
for data scientists and analysts so they can use the data to derive insights from it. So,
data engineers are the builders of data systems.

Responsibilities of a Data Engineer


Management
Data engineer manages this position by creating optimal databases, implementing
changes in schema, and maintaining data architecture standards across all the
business’s databases. Data Engineer is also responsible for enabling migration of
data amongst different servers and different databases, for example, data migration
between SQL servers to MySQL. He also defines and implements data stores based
on system requirements and user requirements.

System Design
Data engineers should always build a system that is scalable, robust, and fault-
tolerant hence, the system can be scaled up without increasing the number of data
sources and can handle a massive amount of data without any failure. For instance,
imagine a situation where a source of data is doubled or tripled, but the system fails
to scale up, so it would cost a lot more time and resources to build up a system to
intake this extensive data. Big Data Engineers have a role here: they handle the
extract transform and load process, which is a blueprint for how the collected raw
data is processed and transformed into data ready for analysis.

Analytics
The Data Engineer performs ad-hoc analyses of data stored in the business’s
databases and writes SQL scripts, stored procedures, functions, and views. He is
responsible for troubleshooting data issues within the business and across the
business and presents solutions to these issues. Data engineer proactively analyzes
and evaluates the business’s databases in order to identify and recommend
improvements and optimization. He prepares activity and progress reports regarding
the business database status and health, which is later presented to senior data
engineers for review and evaluation. In addition, the Data Engineer analyzes

© Copyright FHM.AVI – Data 5


Data Engineer

complex data system and elements, dependencies, data flow, and relationships so as
to contribute to conceptual physical and logical data models.
Some of the other responsibilities also include improving foundational data
procedures and integrating new data management technologies and the software into
existing systems and building data collection pipelines and finally include
performance tuning and make the whole system more efficient.
Data Engineers are considered the “librarians” of data warehouse and cataloging and
organizing metadata, defining the processes by which one files or extracts data from
the warehouse. Nowadays, metadata management and tooling have become a vital
component of the modern-day platform.

Goals of a Data Engineer


Developing Data Pipelines
This skill set involves transferring data from one point to another. In other words,
taking data from the operating system and then moving it into something that can be
analyzed by the analyst or data scientist hence, leading to the next goal of managing
tables and data sets.

Managing tables and Data Sets


The transferred data through pipelines populates some sorts of sets of tables that are
then used by the analysts or data scientists to extract all of their insights from data.
Analyzing information of any product, for example, a blog site with questions like
what people are reading? How are they reading it? How long they are staying on
particular articles.

Designing the product


Data Engineers end up playing an important role to understand what users want to
gain from large datasets. Considering questions at the time of development, that
users might have while using the product. E.g., developing a dashboard, how are
people going to use the dashboard? What other features can be added and how far
fetched they are.

© Copyright FHM.AVI – Data 6


Data Engineer

Conceptual Skills Required to be a Data


Engineer
The most required skill in data engineering is the ability to design and build data
warehouses, where all the raw data is collected, stored, and retrieved. Without a data
warehouse, all the tasks that data scientists do would become obsolete. It is either
going to get too expensive or very very large to scale. However, other skills required
are:

1. Data Modelling
The data model is an essential part of the data science pipeline. It is the process of
converting a document of sophisticated software system design to a diagram that can
comprehend, using text and symbols to represent the flow of data. Data models are
built during the analysis and design phase of a project to ensure the requirements of
a new application are fully understood. These models can also be invoked later in
the data lifecycle to rationalize data designs that were initially created by the
programmers on an ad hoc basis.

Stages in Data Modelling


• Conceptual: This is the first step in data model processing, which imposes a
logical order on data as it exists in relationship to the entities.
• Logical: The logical modeling process attempts to impose order by
establishing discrete entities, fundamental values, and relationships into
logical structures.
• Physical: This step breaks the data down into the actual tables, clusters, and
indexes required for the data storage.

© Copyright FHM.AVI – Data 7


Data Engineer

Hierarchical Data Model: This data model array in a tree-like structure, one-to-
many arrangements marked these efforts and have replaced file-based systems. E.g.,
IBM’s Information Management System (IMS), which found extensive use in
business like banking.
Relational Data Model: They replaced hierarchical models, as it reduced program
complexity versus file-based systems and also didn’t require developers to define
data paths.
Entity-Relationship Model: Closely relatable to the relationship model, these
models use diagrams and flowcharts to graphically illustrate the elements of the
database to ease the understanding of underlying models.
Graph Data Model: It is a much-advanced version of the hierarchical model, which,
together with graph databases, is used for describing the complicated relationship
within the data sets.

2. Automation
Industries use automation to increase productivity, improve quality and consistency,
reduce costs, and speed delivery. It provides benefits in greater magnitude to every
team player in an organization including Testers, Quality Analysts, Developers, or
even Business Users.
Automation can provide the following benefits:

• Speed: It is fast, hence, dramatically reduces team development time.

• Flexibility: Respond to changing business requirements quickly and easily.

• Quality: Automation tools produce tested high performance, complete, and


readable code.
• Consistency: It is easy for a developer to understand another’s code.
In data science, designing a data warehouse and data warehouse architecture requires
a long time to complete as well as semi-automated steps result in a data warehouse
that was limited and inflexible. So, data engineers came up with a solution to
automate data warehouse involving every step involved in its life cycle, thus
reducing the effort required to manage it. The need for data engineers to implement
data warehouse automation (DWA) tools is growing as these tools eliminate hand-

© Copyright FHM.AVI – Data 8


Data Engineer

coding and custom design for planning, design, building, and documenting decision
support infrastructure.

3. Extraction, Transformation, And Load (ETL)

ETL is defined as the procedure of copying data from one or more source into the
destination system, which represents the data differently from the source or in a
different context than the source. ETL is often used in data warehousing.
Data extraction is the concept of extracting data from heterogeneous or homogenous
sources; data transformation processes data by cleansing data and transforming them
to proper storage structure for the purpose of querying and analysis, finally data
loading describes the insertion of data into the final target database such as
operational data store, a data mart, data lake or data warehouse.
In data science, ETL involves pulling out data from operational systems like MySQL
or Oracle and moving it into a data warehouse like SQL server or modern-day data
warehouses like Hadoop or RedShift and then format it in such a way that analyst
can get it. Eventually, the ETL process starts at the analytical data layer that does
more than extracting data, it performs things like aggregating data, running metrics
and algorithms on the data so that it can be easily fed into future dashboards.

4. Product Understanding
Data engineers look at the data as their product, so it is made in such a way that users
can use it. If we are building datasets for machine learning engineers or data
scientists, we need to understand how they are going to use it what are the models

© Copyright FHM.AVI – Data 9


Data Engineer

that they want to build is enough information is being provided at the customer level.
This is required because the data engineer looks at the things at the granularity and
aggregate things themselves.

Becoming a Data Engineer

1. Programming Language: Start with learning Programming Language, like


Python, as it has clear and readable syntax, versatility, and widely available
resources and a very supportive community.
2. Operating System: Mastery in at least one OS like Linux and UNIX OS is
recommended, RHEL is a prevalent OS adopted by the industry which can
also be learned.
3. DBMS: Enhance your DMBS skills and get your hands-on experience at least
one relational database, preferably MySQL or Oracle DB. Thorough with
database administrator skills as well as skills like capacity planning,
installation, configuration, database design, monitoring security,
troubleshooting such as backup and recovery of data.
4. NoSQL: This is the next skill to focus as it would help you understand how
to handle semi and unstructured data.
5. ETL: Understand to extract data using ETL and data warehousing tools from
various sources. Transform and clean data according to the user and then load
your data into the data warehouse. This is an important skill which data
engineers must possess. Since we are at the age of revolution where the data

© Copyright FHM.AVI – Data 10


Data Engineer

is the fuel of the 21st century, various data sources and numerous technologies
have evolved over the last two decades major ones being NoSQL databases
and big data frameworks.
6. Big Data Frameworks: Big data engineers are required to learn multiple big
data frameworks to create and design processing systems.
7. Real-time Processing Frameworks: Concentrate on learning frameworks
like Apache Spark, which is an open-source cluster computing framework for
real-time processing, and when it comes to real-time data analytics spark
stands as go-to-tool across all solutions.
8. Cloud: Next in the career path, one must learn cloud which will serve as a big
plus. A good understanding of cloud technology will provide the option of
stable significant amounts of data and allowing big data to be further
available, scalable and fault-tolerant.

© Copyright FHM.AVI – Data 11


Data Engineer

Data Sources
General about Data source

With the current strong development of science and technology, the data sources as
input to building a data warehouse are very diverse and rich. For each specific data
warehouse, data sources will be different for each field of data. For example, for a
data warehouse about music, the input will mainly revolve around songs' info, with
a data warehouse about painting, it will be images' info, with a data warehouse about
the weather, the data source will revolve around the location and features of time-
series of weather, with business data, the data source will be around transactions'
info.
Therefore, to simplify the study of data sources to provide data for the data
warehouse building process, we only need to consider and present the commonly
used data sources, and include the following types of data sources:
• Flat file with structured data (csv, json, xml/RDF) and flat file with
unstructured data (file txt, log)
• Binary data formatted as the table/ matrix (file xls/xlsx, file avro, file music,
file video, file photo)

© Copyright FHM.AVI – Data 12


Data Engineer

• Data from database/ data warehouse


• Data from websites
• Data from API

Below are details about each type of data source:

Flat Files
Flat file is a type of data used to store very commonly in the process of recording
raw information of the original documents, it is both simple and usable. For log files,
the data transfers between steps in the processing sequence.
Flat files are text files in plain text format, so we can read and edit them with simple
tools available on popular operating systems such as NotePad (Windows), Emas &
VIM (Linux/Unix or MacOs), …
The content contained inside a flat file is common to use in structured data format
and data fields will be separated with the delimited, fixed width, or mixed format.
Sometimes flat files are also used in unstructured data formats. In this case, they will
be processed to extract the necessary data contained within it into structured data
files. And then, the results are convenient for checking and using in the next steps.

© Copyright FHM.AVI – Data 13


Data Engineer

Flat file with structured data


The delimited format uses columns and rows delimiters to define columns and rows.
We have an example of structured data with the data contained in a table as follows:

TranID SaleDate SubID ProdID subtotal

10001 20210727 10:25:52 1 30015 50000

10001 20210727 10:25:52 2 30175 250000

10002 20210727 10:25:52 1 20042 175000

The flat file is a text file with columns delimited by Tab character as follows:

The flat file is a text file with columns delimited by Comma character as follows:

The flat file is a text file with columns delimited by fixed width format as follows

Fixed width format uses width to define columns and rows. This format also includes
a character for padding fields to their maximum width.
Ragged right format uses width to define all columns, except for the last column,
which is delimited by the row delimiter.

© Copyright FHM.AVI – Data 14


Data Engineer

Flat file with unstructured data


Sometimes we need to read flat files containing randomly distributed information,
in which case we will gather the necessary information into a structured data format
to facilitate their use for the next steps. To do that, we will rely on indications related
to the data samples to be read to capture/filter them. If we have a file containing the
following information:

To implement the above ideas, we run a script like the following to collect data into
a file and the output data only contains structured data. This result makes it very easy
to load data into a table or a matrix for next processing.
awk ' BEGIN {printf("SaleDate,Product,Count\n")}

n = split($0, record,":")

date = record[1]

values = record[2]

m = split(values, fields,",")

$0 = date

date = substr(date, length($1)+2)

for (id in fields) {

$0 = fields[id]

prodname = substr(fields[id], 2, length(fields[id])-length($NF)-2)

printf("%s,%s,%s\n", date, prodname, $NF)

} ' [FlatFileInput.txt] > [FlatFileOutput.csv]

And the result is a CSV file as follows:

© Copyright FHM.AVI – Data 15


Data Engineer

And opened it in MS Excel:

Common Structured Data Formats of Flat Files


CSV (Comma Separated Values)
Files with .csv (Comma Separated Values) extension represent plain text files that
contain records of data with comma separated values. Each line in a CSV file is a
new record from the set of records contained in the file. Such files are generated
when data transfer is intended from one storage system to another. Since all
applications can recognize records separated by comma, import of such data files to
database is done very conveniently. Almost all spreadsheet applications such as
Microsoft Excel or OpenOffice Calc can import CSV without much effort. Data
imported from such files is arranged in cells of a spreadsheet for representation to
user.
CSV file format is known to be specified under RFC4180. It defines any file to be
CSV compliant if:
• Each record is located on a separate line, delimited by a line break (CRLF).
For example:
o h1,h2,h3 CRLF
o v1,v2,v3 CRLF

© Copyright FHM.AVI – Data 16


Data Engineer

• The last record in the file may or may not have an ending line break. For
example:
o h1,h2,h3 CRLF
o v1,v2,v3
• There may be an optional header line appearing as the first line of the file with
the same format as normal record lines. This header will contain names
corresponding to the fields in the file and should contain the same number of
fields as the records in the rest of the file (the presence or absence of the header
line should be indicated via the optional “header” parameter of this MIME
type). For example:
o field_name,field_name,field_name CRLF
o h1,h2,h3 CRLF
o v1,v2,v3 CRLF
• Within the header and each record, there may be one or more fields, separated
by commas. Each line should contain the same number of fields throughout
the file. Spaces are considered part of a field and should not be ignored. The
last field in the record must not be followed by a comma. For example:
o h1,h2,h3
• Each field may or may not be enclosed in double quotes (however some
programs, such as Microsoft Excel, do not use double quotes at all). If fields
are not enclosed with double quotes, then double quotes may not appear inside
the fields. For example:
o “h1”,“h2”,“h3” CRLF
o v1,v2,v3
• Fields containing line breaks (CRLF), double quotes, and commas should be
enclosed in double-quotes. For example:
o “h1”,“h CRLF
o 2”,“h3” CRLF
o v1,v2,v3

© Copyright FHM.AVI – Data 17


Data Engineer

• If double-quotes are used to enclose fields, then a double-quote appearing


inside a field must be escaped by preceding it with another double quote. For
example:
o “h1”, “h”“2”,“h3”
Super CSV escapes double-quotes with a preceding double-quote. Please note that
the sometimes-used convention of escaping double-quotes as \" (instead of "") is not
supported.

JSON (JavaScript Object Notation)


JSON or JavaScript Object Notation is a lightweight text-based open standard
designed for human-readable data interchange. Conventions used by JSON are
known to programmers, which include C, C++, Java, Python, Perl, etc.
• JSON stands for JavaScript Object Notation.
• The format was specified by Douglas Crockford.
• It was designed for human-readable data interchange.
• It has been extended from the JavaScript scripting language.
• The filename extension is .json.
• JSON Internet Media type is application/json.
• The Uniform Type Identifier is public.json.
Uses of JSON

• It is used while writing JavaScript based applications that includes browser


extensions and websites.
• JSON format is used for serializing and transmitting structured data over
network connection.
• It is primarily used to transmit data between a server and web applications.
• Web services and APIs use JSON format to provide public data.
• It can be used with modern programming languages.
Characteristics of JSON

• JSON is easy to read and write.

© Copyright FHM.AVI – Data 18


Data Engineer

• It is a lightweight text-based interchange format.


• JSON is language independent.
Simple Example in JSON

The following example shows how to use JSON to store information related to books
based on their topic and edition.
{

"book": [

"id":"01",

"language": "Java",

"edition": "third",

"author": "Herbert Schildt"

},

"id":"07",

"language": "C++",

"edition": "second",

"author": "E.Balagurusamy"

After understanding the above program, we will try another example. Let's save the
below code as json.htm:
<html>

<head>

<title>JSON example</title>

<script language = "javascript" >

var object1 = { "language" : "Java", "author" : "herbert schildt" };

© Copyright FHM.AVI – Data 19


Data Engineer

document.write("<h1>JSON with JavaScript example</h1>");

document.write("<br>");

document.write("<h3>Language = " + object1.language+"</h3>");

document.write("<h3>Author = " + object1.author+"</h3>");

var object2 = { "language" : "C++", "author" : "E-Balagurusamy" };

document.write("<br>");

document.write("<h3>Language = " + object2.language+"</h3>");

document.write("<h3>Author = " + object2.author+"</h3>");

document.write("<hr />");

document.write(object2.language + " programming language can be studied " + "from


book written by " + object2.author);

document.write("<hr />");

</script>

</head>

<body>

</body>

</html>

Now let's try to open json.htm using IE or any other javascript enabled browser that
produces the following result:

© Copyright FHM.AVI – Data 20


Data Engineer

XML (Extensible Markup Language)


XML is a specification that has been developed for transmitting information over the
Web. XML is based on the Standard Global Markup Language (SGML) and is a
cross-platform, software- and hardware-independent tool. In addition, it has the
following attributes:
• Is a markup language much like HTML.
• Is designed to describe data.
• Can be used for all types of data and graphics.
• Is independent of applications, platforms, or vendors.
• Allows designers to create their own customized tags, enabling the definition,
transmission, validation, and interpretation of data between applications and
between organizations.

XML tags identify the data and are used to store and organize the data, rather than
specifying how to display it like HTML tags, which are used to display the data.
XML is not going to replace HTML in the near future, but it introduces new
possibilities by adopting many successful features of HTML.

© Copyright FHM.AVI – Data 21


Data Engineer

There are three important characteristics of XML that make it useful in a variety of
systems and solutions −
• XML is extensible − XML allows you to create your own self-descriptive
tags, or language, that suits your application.
• XML carries the data, does not present it − XML allows you to store the
data irrespective of how it will be presented.
• XML is a public standard − XML was developed by an organization called
the World Wide Web Consortium (W3C) and is available as an open standard.

XML Usage

A short list of XML usage says it all −


• XML can work behind the scene to simplify the creation of HTML documents
for large web sites.
• XML can be used to exchange the information between organizations and
systems.
• XML can be used for offloading and reloading of databases.
• XML can be used to store and arrange the data, which can customize your
data handling needs.
• XML can easily be merged with style sheets to create almost any desired
output.
• Virtually, any type of data can be expressed as an XML document.
What is Markup?

XML is a markup language that defines set of rules for encoding documents in a
format that is both human-readable and machine-readable. So what exactly is a
markup language? Markup is information added to a document that enhances its
meaning in certain ways, in that it identifies the parts and how they relate to each
other. More specifically, a markup language is a set of symbols that can be placed in
the text of a document to demarcate and label the parts of that document.
Following example shows how XML markup looks, when embedded in a piece of
text −

© Copyright FHM.AVI – Data 22


Data Engineer

<message>

<text>Hello, world!</text>

</message>

This snippet includes the markup symbols, or the tags such as


<message>...</message> and <text>... </text>. The tags <message> and
</message> mark the start and the end of the XML code fragment. The tags <text>
and </text> surround the text Hello, world!

Why Use XML?

An industry typically uses data exchange methods that are meaningful and specific
to that industry. With the advent of e-commerce, businesses conduct an increasing
number of relationships with a variety of industries and, therefore, must develop
expert knowledge of the various protocols used by those industries for electronic
communication.
The extensibility of XML makes it a very effective tool for standardizing the format
of data interchange among various industries. For example, when message brokers
and workflow engines must coordinate transactions among multiple industries or
departments within an enterprise, they can use XML to combine data from disparate
sources into a format that is understandable by all parties.

Is XML a Programming Language?

A programming language consists of grammar rules and its own vocabulary which
is used to create computer programs. These programs instruct the computer to
perform specific tasks. XML does not qualify to be a programming language as it
does not perform any computation or algorithms. It is usually stored in a simple text
file and is processed by special software that is capable of interpreting XML.
The following sample XML file describes the contents of an address book:
<?xml version="1.0"?>

<address_book>

<person gender="f">

<name>Jane Doe</name>

© Copyright FHM.AVI – Data 23


Data Engineer

<address>

<street>123 Main St.</street>

<city>San Francisco</city>

<state>CA</state>

<zip>94117</zip>

</address>

<phone area_code=415>555-1212</phone>

</person>

<person gender="m">

<name>John Smith</name>

<phone area_code=510>555-1234</phone>

<email>johnsmith@somewhere.com</email>

</person>

</address_book>

How Do You Describe an XML Document?

There are two ways to describe an XML document: XML Schemas and DTDs.
XML Schemas define the basic requirements for the structure of a particular XML
document. A Schema describes the elements and attributes that are valid in an XML
document, and the contexts in which they are valid. In other words, a Schema
specifies which tags are allowed within certain other tags, and which tags and
attributes are optional. Schemas are themselves XML files.
The schema specification is a product of the World Wide Web Consortium (W3C).
For detailed information on XML schemas,
see http://www.w3.org/TR/xmlschema-0/.

The following example shows a schema that describes the preceding address book
sample XML document:
<xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema">

<xsd:element name="address_book" type="bookType"/>

<xsd:complexType name="bookType">
<xsd:element name=name="person" type="personType"/>
</xsd:complexType>

© Copyright FHM.AVI – Data 24


Data Engineer

<xsd:complexType name="personType">
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="address" type="addressType"/>
<xsd:element name="phone" type="phoneType"/>
<xsd:element name="email" type="xsd:string"/>
<xsd:attribute name="gender" type="xsd:string"/>
</xsd:complexType>

<xsd:complexType name="addressType">

<xsd:element name="street" type="xsd:string"/>


<xsd:element name="city" type="xsd:string"/>
<xsd:element name="state" type="xsd:string"/>
<xsd:element name="zip" type="xsd:string"/>
</xsd:complexType>

<xsd:simpleType name="phoneType">
<xsd:restriction base="xsd:string"/>
<xsd:attribute name="area_code" type="xsd:string"/>
</xsd:simpleType>

</xsd:schema>

You can also describe XML documents using Document Type Definition (DTD)
files, a technology older than XML Schemas. DTDs are not XML files.
The following example shows a DTD that describes the preceding address book
sample XML document:

<!DOCTYPE address_book [
<!ELEMENT person (name, address?, phone?, email?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (street, city, state, zip)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT zip (#PCDATA)>

<!ATTLIST person gender CDATA #REQUIRED>


<!ATTLIST phone area_code CDATA #REQUIRED>
]>

© Copyright FHM.AVI – Data 25


Data Engineer

An XML document can include a Schema or DTD as part of the document itself,
reference an external Schema or DTD, or not include or reference a Schema or DTD
at all. The following excerpt from an XML document shows how to reference an
external DTD called address.dtd:

<?xml version=1.0?>
<!DOCTYPE address_book SYSTEM "address.dtd">
<address_book>
...

XML documents only need to be accompanied by Schema or DTD if they need to


be validated by a parser or if they contain complex types. An XML document is
considered valid if 1) it has an associated Schema or DTD, and 2) it complies with
the constraints expressed in the associated Schema or DTD. If, however, an XML
document only needs to be well-formed, then the document does not have to be
accompanied by a Schema or DTD. A document is considered well-formed if it
follows all the rules in the W3C Recommendation for XML 1.0. For the full XML
1.0 specification, see http://www.w3.org/XML/.

YAML (YAML Ain’t Markup Language)


YAML is a digestible data serialization language that is often utilized to create
configuration files and works in concurrence with any programming language.
YAML is a data serialization language designed for human interaction. It’s a strict
superset of JSON, another data serialization language. But because it’s a strict
superset, it can do everything that JSON can and more. One major difference is that
newlines and indentation actually mean something in YAML, as opposed to JSON,
which uses brackets and braces.
The format lends itself to specifying configuration, which is how we use it at
CircleCI.

Basic Components of YAML File

The basic components of YAML are described below − Conventional Block Format
This block format uses hyphen+space to begin a new item in a specified list. Observe
the example shown below −
--- # Favorite movies

© Copyright FHM.AVI – Data 26


Data Engineer

- Casablanca

- North by Northwest

- The Man Who Wasn't There

Inline Format

Inline format is delimited with comma and space and the items are enclosed in
JSON. Observe the example shown below −
--- # Shopping list

[milk, groceries, eggs, juice, fruits]

Folded Text

Folded text converts newlines to spaces and removes the leading whitespace.
Observe the example shown below −
- {name: John Smith, age: 33}

- name: Mary Smith

age: 27

The structure which follows all the basic conventions of YAML is shown below −
men: [John Smith, Bill Jones]

women:

- Mary Smith

- Susan Williams

Synopsis of YAML Basic Elements

• The synopsis of YAML basic elements is given here: Comments in YAML


begins with the (#) character.
• Comments must be separated from other tokens by whitespaces.
• Indentation of whitespace is used to denote structure.
• Tabs are not included as indentation for YAML files.
• List members are denoted by a leading hyphen (-).
• List members are enclosed in square brackets and separated by commas.
• Associative arrays are represented using colon ( : ) in the format of key value
pair. They are enclosed in curly braces {}.

© Copyright FHM.AVI – Data 27


Data Engineer

• Multiple documents with single streams are separated with 3 hyphens (---).
• Repeated nodes in each file are initially denoted by an ampersand (&) and by
an asterisk (*) mark later.
• YAML always requires colons and commas used as list separators followed
by space with scalar values.
• Nodes should be labelled with an exclamation mark (!) or double exclamation
mark (!!), followed by string which can be expanded into an URI or URL.

XML vs JSON vs YAML vs CSV


Here are some guiding considerations to make decisions when choosing a flat file
format for the data to be stored:

Choose XML

• You have to represent mixed content (tags mixed within text). [This would
appear to be a major concern in your case. You might even consider HTML
for this reason.]
• There's already an industry standard XSD to follow.
• You need to transform the data to another XML/HTML format. (XSLT is
great for transformations.)
Choose JSON
• You have to represent data records, and a closer fit to JavaScript is valuable
to your team or your community.
Choose YAML
• You have to represent data records, and you value some additional features
missing from JSON: comments, strings without quotes, order-preserving
maps, and extensible data types.
Choose CSV
• You have to represent data records, and you value ease of import/export
with databases and spreadsheets.
The flat file is easy to use, but it has the disadvantage that Input / Output speed is
not good, so it should be considered carefully before using it for tasks that need to

© Copyright FHM.AVI – Data 28


Data Engineer

process fast speed or big size data. This disadvantage will be overcome for some
other data sources, which will be presented below.

Binary data formatted as the table/ matrix


Another common type of data is a binary table. The file types mentioned in this
section are files containing information in binary format and the data inside it is
arranged in standard structures to make them easy to create and use. These files have
great advantages in terms of data exchange speed and capacity.
Some common formats of Binary files are commonly used:

XLS/XLSX files
The XLSX and XLS file extensions are used for Microsoft Excel spreadsheets, part
of the Microsoft Office Suite of software.
XLSX/XLS files are used to store and manage data such as numbers, formulas, text,
and drawing shapes.
XLSX is part of Microsoft Office Open XML specification (also known as OOXML
or OpenXML), and was introduced with Office 2007. XLSX is a zipped, XML-based
file format. Microsoft Excel 2007 and later uses XLSX as the default file format
when creating a new spreadsheet. Support for loading and saving legacy XLS files
is also included.
XLS is the default format used with Office 97-2003. XLS is a Microsoft proprietary
Binary Interchange File Format. Microsoft Excel 97-2003 uses XLS as the default
file format when creating a new document.
The default extension used by this format is: XLSX or XLS.

How to load XLS/XLSX files with Python


Excel files can be read easier by using the Pandas module in Python

Read Excel column names

We import the Pandas module, including ExcelFile. The method read_excel() reads
the data into a Pandas Data Frame, where the first parameter is the filename and the
second parameter is the sheet.

© Copyright FHM.AVI – Data 29


Data Engineer

The list of columns will be called df.columns.

import pandas as
from pandas import ExcelWr
from pandas import Excel

df = pd.read_excel('File.xlsx', sheetname='Shee

print("Column headings
print(df.columns)

Using the data frame, we can get all the rows below an entire column as a list. To
get such a list, simply use the column header

print(df['Sepal width'])

Read Excel data

We start with a simple Excel file, a subset of the Iris dataset.

To iterate over the list, we can use a loop:

for i in df.index:
print(df['Sepal width'][i])

We can save an entire column into a list:

listSepalWidth = df['Sepal width']


print(listSepalWidth[0])

We can simply take entire columns from an excel sheet:

© Copyright FHM.AVI – Data 30


Data Engineer

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile

df = pd.read_excel('File.xlsx', sheetname='Sheet1')

sepalWidth = df['Sepal width']


sepalLength = df['Sepal length']
petalLength = df['Petal length']

AVRO files
Avro is an open source project that provides data serialization and data exchange
services for Apache Hadoop. These services can be used together or independently.
Avro facilitates the exchange of big data between programs written in any language.
With the serialization service, programs can efficiently serialize data into files or
into messages. The data storage is compact and efficient. Avro stores both the data
definition and the data together in one message or file.
Avro stores the data definition in JSON format making it easy to read and interpret;
the data itself is stored in binary format making it compact and efficient. Avro files
include markers that can be used to split large data sets into subsets suitable for
Apache MapReduce processing. Some data exchange services use a code generator
to interpret the data definition and produce code to access the data. Avro doesn't
require this step, making it ideal for scripting languages.
A key feature of Avro is robust support for data schemas that change over time —
often called schema evolution. Avro handles schema changes like missing fields,
added fields and changed fields; as a result, old programs can read new data and new
programs can read old data. Avro includes APIs for Java, Python, Ruby, C, C++ and
more. Data stored using Avro can be passed from programs written in different
languages, even from a compiled language like C to a scripting language like Apache
Pig.

Working with Avro Data


Apache Avro is a data serialization framework where the data is serialized in a
compact binary format. Avro specifies that data types be defined in JSON. Avro

© Copyright FHM.AVI – Data 31


Data Engineer

format data has an independent schema, also defined in JSON. An Avro schema,
together with its data, is fully self-describing.

Data Type Mapping


Avro supports both primitive and complex data types.
To represent Avro primitive data types in Greenplum Database, map data values to
Greenplum Database columns of the same type.
Avro supports complex data types including arrays, maps, records, enumerations,
and fixed types. Map top-level fields of these complex data types to the Greenplum
Database TEXT type. While Greenplum Database does not natively support these
types, you can create Greenplum Database functions or application code to extract
or further process subcomponents of these complex data types.
The following table summarizes external mapping rules for Avro data.

Avro Data Type PXF/Greenplum Data Type

boolean boolean

bytes bytea

double double

float real

int int or smallint

long bigint

string text

Complex type: Array,


text, with delimiters inserted between collection
Map, Record, or
items, mapped key-value pairs, and record data.
Enum

Complex type: Fixed bytea

Follows the above conventions for primitive or


Union complex data types, depending on the union;
supports Null values.

© Copyright FHM.AVI – Data 32


Data Engineer

Avro Schemas and Data


Avro schemas are defined using JSON, and composed of the same primitive and
complex types identified in the data type mapping section above. Avro schema files
typically have a .avsc suffix.
Fields in an Avro schema file are defined via an array of objects, each of which is
specified by a name and a type.

Handling Avro files in Python


This is an example usage of avro-python3 in a Python 3 environment.
# Python 3 with `avro-python3` package available

import copy

import json

import avro

from avro.datafile import DataFileWriter, DataFileReader

from avro.io import DatumWriter, DatumReader

# Note that we combined namespace and name to get "full name"

schema = {

'name': 'avro.example.User',

'type': 'record',

'fields': [

{'name': 'name', 'type': 'string'},

{'name': 'age', 'type': 'int'}

# Parse the schema so we can use it to write the data

schema_parsed = avro.schema.Parse(json.dumps(schema))

# Write data to an avro file

with open('users.avro', 'wb') as f:

© Copyright FHM.AVI – Data 33


Data Engineer

writer = DataFileWriter(f, DatumWriter(), schema_parsed)

writer.append({'name': 'Pierre-Simon Laplace', 'age': 77})

writer.append({'name': 'John von Neumann', 'age': 53})

writer.close()

# Read data from an avro file

with open('users.avro', 'rb') as f:

reader = DataFileReader(f, DatumReader())

metadata = copy.deepcopy(reader.meta)

schema_from_file = json.loads(metadata['avro.schema'])

users = [user for user in reader]

reader.close()

print(f'Schema that we specified:\n {schema}')

print(f'Schema that we parsed:\n {schema_parsed}')

print(f'Schema from users.avro file:\n {schema_from_file}')

print(f'Users:\n {users}')

# Schema that we specified:

# {'name': 'avro.example.User', 'type': 'record',

# 'fields': [{'name': 'name', 'type': 'string'}, {'name': 'age', 'type': 'int'}]}

# Schema that we parsed:

# {"type": "record", "name": "User", "namespace": "avro.example",

# "fields": [{"type": "string", "name": "name"}, {"type": "int", "name": "age"}]}

# Schema from users.avro file:

# {'type': 'record', 'name': 'User', 'namespace': 'avro.example',

# 'fields': [{'type': 'string', 'name': 'name'}, {'type': 'int', 'name': 'age'}]}

# Users:

# [{'name': 'Pierre-Simon Laplace', 'age': 77}, {'name': 'John von Neumann', 'age': 53}]

Issue with name, namespace, and full name

© Copyright FHM.AVI – Data 34


Data Engineer

An interesting thing to note is what happens with the name and namespace fields.
The schema we specified has the full name of the schema that has both
name and namespace combined, i.e., 'name': 'avro.example.User'. However, after
parsing with avro.schema.Parse(), the name and namespace are separated into
individual fields. Further, when we read back the schema from the users.avro file,
we also get the name and namespace separated into individual fields.
Avro specification, for some reason, uses the name field for both the full name and
the partial name. In other words, the name field can either contain the full name or
only the partial name. Ideally, Avro specification should have kept partial_name,
namespace, and full_name as separate fields.
This behind-the-scene separation and in-place modification may cause unexpected
errors if your code depends on the exact value of name. One common use case is
when you’re handling lots of different schemas and you want to
identify/index/search by the schema name.
A best practice to guard against possible name errors is to always parse a dict
schema into a avro.schema.RecordSchema using avro.schema.Parse(). This will
generate the namespace, fullname, and simple_name (partial name), which you can
then use with peace of mind.
print(type(schema_parsed))

# <class 'avro.schema.RecordSchema'>

print(schema_parsed.avro_name.fullname)

# avro.example.User

print(schema_parsed.avro_name.simple_name)

# User

print(schema_parsed.avro_name.namespace)

# avro.example

This problem of name and namespace deepens when we use a third-party package
called fastavro, as we will see in the next section.

Avro <> DataFrame


As we have seen above, Avro format simply requires a schema and a list of records.
We don’t need a dataframe to handle Avro files. However, we can write

© Copyright FHM.AVI – Data 35


Data Engineer

a pandas dataframe into an Avro file or read an Avro file into a pandas dataframe.
To begin with, we can always represent a dataframe as a list of records and vice-
versa
• List of records – pandas.DataFrame.from_records() –> Dataframe
• List of records <– pandas.DataFrame.to_dict(orient='records') – Dataframe
Using the two functions above in conjunction with avro-python3 or fastavro, we can
read/write dataframes as Avro. The only additional work wewould need to do is to
inter-convert between pandas data types and Avro schema types ourselves.
An alternative solution is to use a third-party package called pandavro, which does
some of this inter-conversion for us.
import copy

import json

import pandas as pd

import pandavro as pdx

from avro.datafile import DataFileReader

from avro.io import DatumReader

# Data to be saved

users = [{'name': 'Pierre-Simon Laplace', 'age': 77},

{'name': 'John von Neumann', 'age': 53}]

users_df = pd.DataFrame.from_records(users)

print(users_df)

# Save the data without any schema

pdx.to_avro('users.avro', users_df)

# Read the data back

users_df_redux = pdx.from_avro('users.avro')

print(type(users_df_redux))

# <class 'pandas.core.frame.DataFrame'>

© Copyright FHM.AVI – Data 36


Data Engineer

# Check the schema for "users.avro"

with open('users.avro', 'rb') as f:

reader = DataFileReader(f, DatumReader())

metadata = copy.deepcopy(reader.meta)

schema_from_file = json.loads(metadata['avro.schema'])

reader.close()

print(schema_from_file)

# {'type': 'record', 'name': 'Root',

# 'fields': [{'name': 'name', 'type': ['null', 'string']},

# {'name': 'age', 'type': ['null', 'long']}]}

In the above example, we didn’t specify a schema ourselves and pandavro assigned
the name = Root to the schema. We can also provide a schema dict to
pandavro.to_avro() function, which will preserve the name and namespace
faithfully.

Avro with PySpark


Using Avro with PySpark is fraught with a sequence of issues. Let’s see the common
issues step-by-step.
we will use Scala 2.11, Spark 2.4.4, and Apache Spark’s spark-avro 2.4.4 withina
pyspark shell.
$ $SPARK_INSTALLATION/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.4

Within the pyspark shell, we can run the following code to write and read Avro.
# Data to store

users = [{'name': 'Pierre-Simon Laplace', 'age': 77},

{'name': 'John von Neumann', 'age': 53}]

# Create a pyspark dataframe

users_df = spark.createDataFrame(users, 'name STRING, age INT')

# Write to a folder named users

© Copyright FHM.AVI – Data 37


Data Engineer

users_df.write.format('avro').mode("overwrite").save('users-folder')

# Read the data back

users_df_redux = spark.read.format('avro').load('./users-folder')

Image, Video & Audio files


There is another rich vein of data available, however, in the form of multi-media.
This collection of ‘binary based’ data includes images, videos, audio.
- An image, digital image, or still image is a binary representation of visual
information, such as drawings, pictures, graphs, logos, or individual video
frames.
- A video is a sequence of images (called frames) captured and eventually
displayed at a given frequency. However, by stopping at a specific frame of
the sequence, a single video frame, i.e. an image, is obtained.
- Any digital information with speech or music stored on and played through a
computer is known as an audio file or sound file.

Digital image (video) structure:


In digital world, an image is represented as a two-dimensional function f(x,y) , where
x and y are spatial coordinates, and the value of f at any pair of coordinates (x,y) is
called the intensity of the image at that point. For a gray-range image, the intensity
is given by just one value (one channel). For color images, the intensity is a 3D
vector (three channels), usually distributed in the order RGB.
There are three main types of images:

• Intensity image is a data matrix whose values have been scaled to represent
intensities. When the elements of an intensity image are of class uint8 or class
uint16, they have integer values in the
range [0,255][0,255] and [0,65535][0,65535], respectively. If the image is of
class float32, the values are single-precission floating-point numbers. They
are usually scaled in the range [0,1][0,1], although it is not rare to use the sclae
[0,255][0,255] too.

© Copyright FHM.AVI – Data 38


Data Engineer

• Binary image is a black and white image. Each pixel has one logical value,
00 or 11.
• Color image is like intensity image but with three chanels, i.e. to each pixel
corresponds three intensity values (RGB) instead of one.
The result of sampling and quantization is a matrix of real numbers. The size of the
image is the number of rows by the number of columns, M×NM×N. The indexation
of the image in Python follows the usual convention:

© Copyright FHM.AVI – Data 39


Data Engineer

When performing mathematical transformations of images, we often need the image


to be of double type. But when reading and writing we save space by using integer
codification.

Working with Images in Python


(Link: https://www.geeksforgeeks.org/working-images-python/)
PIL is the Python Imaging Library which provides the python interpreter with image
editing capabilities. It was developed by Fredrik Lundh and several other
contributors. Pillow is the friendly PIL fork and an easy to use library developed by
Alex Clark and other contributors. We’ll be working with Pillow.
Installation:
• Linux: On Linux terminal type the following:
pip install Pillow
Installing pip via terminal:
sudo apt-get update
sudo apt-get install python-pip

• Windows: Download the appropriate Pillow package according to your


python version. Make sure to download according to the python version you
have.
We’ll be working with the Image Module here which provides a class of the same
name and provides a lot of functions to work on our images.To import the Image
module, our code should begin with the following line:

from PIL import Image

Operations with Images:


• Open a particular image from a path:

© Copyright FHM.AVI – Data 40


Data Engineer

#img = Image.open(path)
# On successful execution of this statement,
# an object of Image type is returned and stored in img variable)
try:
img = Image.open(path)
except IOError:
pass
# Use the above statement within try block, as it can
# raise an IOError if file cannot be found,
# or image cannot be opened.

Retrieve size of image: The instances of Image class that are created have many
attributes, one of its useful attribute is size.

from PIL import Image


filename = "image.png"
with Image.open(filename) as image:
width, height = image.size
#Image.size gives a 2-tuple and the width, height can be obtained

• Some other attributes are: Image.width, Image.height, Image.format,


Image.info etc.
• Save changes in image: To save any changes that you have made to the
image file, we need to give path as well as image format.

img.save(path, format)
# format is optional, if no format is specified,
#it is determined from the filename extension

• Rotating an Image: The image rotation needs angle as parameter to get the
image rotated.

from PIL import Image


def main():
try:

© Copyright FHM.AVI – Data 41


Data Engineer

#Relative Path
img = Image.open("picture.jpg")
#Angle given
img = img.rotate(180)
#Saved in the same relative location
img.save("rotated_picture.jpg")
except IOError:
pass
if name == " main ":
main()

Digital audio structure

(Link: https://www.kdnuggets.com/2020/02/audio-data-analysis-
deep-learning-python-part-1.html)
The sound excerpts are digital audio files in audio format. Sound waves are digitized
by sampling them at discrete intervals known as the sampling rate (typically 44.1kHz
for CD-quality audio meaning samples are taken 44,100 times per second).
Each sample is the amplitude of the wave at a particular time interval, where the bit
depth determines how detailed the sample will be also known as the dynamic range
of the signal (typically 16bit which means a sample can range from 65,536 amplitude
values).

© Copyright FHM.AVI – Data 42


Data Engineer

Handle audio file by Python:


Python has some great libraries for audio processing like Librosa and PyAudio.There
are also built-in modules for some basic audio functionalities.
We will mainly use two libraries for audio acquisition and playback:
1. Librosa
It is a Python module to analyze audio signals in general but geared more towards
music. It includes the nuts and bolts to build a MIR(Music information retrieval)
system. It has been very well documented along with a lot of examples and
tutorials.
Installation:
pip install librosa
or
conda install -c conda-forge librosa
To fuel more audio-decoding power, you can install ffmpeg which ships with many
audio decoders.
2. IPython.display.Audio
IPython.display.Audio lets you play audio directly in a jupyter notebook.

© Copyright FHM.AVI – Data 43


Data Engineer

I have uploaded a random audio file on the below page. Let us now load the file in
your jupyter console.
Vocaroo | Online voice recorder
Vocaroo is a quick and easy way to share voice messages over the interwebs.
Loading an audio file:
import librosa

audio_data = '/../../gruesome.wav'

x , sr = librosa.load(audio_data)

print(type(x), type(sr))#<class 'numpy.ndarray'> <class 'int'>print(x.shape, sr)#(94316,) 22050

This returns an audio time series as a numpy array with a default sampling rate(sr)
of 22KHZ mono. We can change this behavior by resampling at 44.1KHz.
librosa.load(audio_data, sr=44100)

or to disable resampling.
librosa.load(audio_path, sr=None)

The sample rate is the number of samples of audio carried per second, measured in
Hz or kHz.

Data from SQL database and data


warehouse
Data from Data Warehouse:
A data warehouse (DW) is a digital storage system that connects and harmonizes
large amounts of data from many different sources. Its purpose is to feed business
intelligence (BI), reporting, and analytics, and support regulatory requirements – so
companies can turn their data into insight and make smart, data-driven decisions.
Data warehouses store current and historical data in one place and act as the single
source of truth for an organization.

© Copyright FHM.AVI – Data 44


Data Engineer

Most data-warehousing projects combine data from different source systems. Each
separate system may also use a different data organization and/or format. Common
data-source formats include relational databases, XML, JSON and flat files, but may
also include non-relational database structures such as Information Management
System (IMS) or other data structures such as Virtual Storage Access Method
(VSAM) or Indexed Sequential Access Method (ISAM), or even formats fetched
from outside sources by means such as web spidering or screen-scraping.
There are two scenarios when getting data from a DW:
- Extract data directly from DW with valid credentials to access databases.
- Extract data indirectly and limited (data size, extract speed, permissions) via
given APIs, please refer part: Web Services: Data from API for more
information.
* API-specific challenges: While it may be possible to extract data from a database
using SQL, the extraction process for SaaS products relies on each platform’s
application programming interface (API). Working with APIs can be challenging:

• APIs are different for every application.


• Many APIs are not well documented. Even APIs from reputable, developer-
friendly companies sometimes have poor documentation.
• APIs change over time. For example, Facebook’s “move fast and break
things” approach means the company frequently updates its reporting APIs –
and Facebook doesn’t always notify API users in advance.
* The Steps that you should follow for extracting data from a DW:

Step 1: Purpose of extracting data:


Determine the objective of the data, for example: which process you want to analyze.
Indicate the scope of the process (where it starts and where it ends).

Step 2: Selection Method


Determine whether you can extract all activities in a certain timeframe or whether
you will select cases based on a particular start or end activity. If you select cases
based on a start or end activity, write down which start or end activity will be used.

Step 3: Databases, Tables and Fields

© Copyright FHM.AVI – Data 45


Data Engineer

Bases on the data description of the DW, define the Databases, Tables and Fields
that related to the data that you want to extract.

Step 4: Timeframe
Read the chapter on How Much Data Do You Need? and determine the timeframe
based on your data selection method. Write down the start and the end date of the
timeframe you want to extract the data for.

Step 5: Required Format


When you ask someone to extract data for you, they will ask about the required
format. With “format” they typically mean everything, including the content, the
timeframe, etc.—not just the actual file format.
* Query data from Data Warehouse: please refer the below section Data from SQL
database

Data from SQL database:


A database is a systematic collection of data. They support electronic storage and
manipulation of data. Databases make data management easy.
Database files store data in a structured format, organized into tables and fields.
Individual entries within a database are called records. Databases are commonly
used for storing data referenced by dynamic websites.
Structure of a database:

© Copyright FHM.AVI – Data 46


Data Engineer

Microsoft SQL Server is a relational database management systems (RDBMS) that,


at its fundamental level, stores the data in tables. The tables are the database objects
that behave as containers for the data, in which the data will be logically organized
in rows and columns format. Each row is considered as an entity that is described by
the columns that hold the attributes of the entity. For example, the customers table
contains one row for each customer, and each customer is described by the table
columns that hold the customer information, such as the CustomerName and
CustomerAddress. The table rows have no predefined order, so that, to display the
data in a specific order, you would need to specify the order that the rows will be
returned in. Tables can be also used as a security boundary/mechanism, where
database users can be granted permissions at the table level.

Table basics
SQL Server tables are contained within database object containers that are called
Schemas. The schema also works as a security boundary, where you can limit
database user permissions to be on a specific schema level only. You can imagine
the schema as a folder that contains a list of files. You can create up to 2,147,483,647
tables in a database, with up to 1024 columns in each table. When you design a
database table, the properties that are assigned to the table and the columns within
the table will control the allowed data types and data ranges that the table accepts.

© Copyright FHM.AVI – Data 47


Data Engineer

Proper table design, will make it easier and faster to store data into and retrieve data
from the table.

Special table types


In addition to the basic user defined table, SQL Server provides us with the ability
to work with other special types of tables. The first type is the Temporary Table that
is stored in the tempdb system database. There are two types of temporary tables: a
local temporary table that has the single number sign prefix (#) and can be accessed
by the current connection only, and the Global temporary table that has two number
signs prefix (##) and can be accessed by any connection once created.
A Wide Table is a table the uses the Sparse Column to for optimized storage for the
NULL values, reducing the space consumed by the table and increasing the number
of columns allowed in that table to 30K columns.
System Tables are a special type of table in which the SQL Server Engine stores
information about the SQL Server instance configurations and objects information,
that can be queried using the system views.
Partitioned Tables are tables in which the data will be divided horizontally into
separate unites in the same filegroup or different filegroups, based on a specific key,
to enhance the data retrieval performance.

Database Query
A query is a way of requesting information from the database. A database query can
be either a select query or an action query. A select query is a query for retrieving
data, while an action query requests additional actions to be performed on the data,
like deletion, insertion, and updating.
For example, a manager can perform a query to select the employees who were hired
5 months ago. The results could be the basis for creating performance evaluations.

Query Language
Many database systems expect you to make requests for information through a
stylised query written in a specific query language. This is the most complicated
method because it compels you to learn a specific language, but it is also the most
flexible.

© Copyright FHM.AVI – Data 48


Data Engineer

Query languages are used to create queries in a database.

Examples of Query Languages


Microsoft Structured Query Language (SQL) is the ideal query language. Other
expansions of the language under the SQL query umbrella include:

• MySQL
• Oracle SQL
• NuoDB
Query languages for other types of databases, such as NoSQL databases and graph
databases, include the following:
• Cassandra Query Language (CQL)
• Neo4j’s Cypher
• Data Mining Extensions (DMX)
• XQuery

Power of Queries
A database has the possibility to uncover intricate movements and actions, but this
power is only utilised through the use of query. A complex database contains
multiple tables storing countless amounts of data. A query lets you filter it into a
single table, so you can analyse it much more easily.
Queries also can execute calculations on your data, summarise your data for you,
and even automate data management tasks. You can also evaluate updates to your
data prior to committing them to the database, for still more versatility of usage.
Queries can perform a number of various tasks. Mainly, queries are used to search
through data by filtering specific criteria. Other queries contain append, crosstab,
delete, make table, parameter, totals, and update tools, each of which performs a
specific function. For example, a parameter query executes the distinctions of a
specific query, which triggers a user to enter a field value, and then it makes use of
that value to make the criteria. In comparison, totals queries let users organise and
summarise data.

© Copyright FHM.AVI – Data 49


Data Engineer

In a relational database, which is composed of records or rows of data, the SQL


SELECT statement query lets the user select data and deliver it to an application
from the database. The resulting query is saved in a result-table, which is referred to
as the result-set. The SELECT statement can be divided into other specific
statements, like FROM, ORDER BY and WHERE. The SQL SELECT query can
also group and combine data, which could be useful for creating analyses or
summaries.

Create table in database:


The SQL CREATE TABLE statement is used to create a new table.
Syntax
The basic syntax of the CREATE TABLE statement is as follows −

CREATE TABLE table_name(


column1 datatype,
column2 datatype,
column3 datatype,
.....
columnN datatype,
PRIMARY KEY( one or more columns )
);

CREATE TABLE is the keyword telling the database system what you want to do.
In this case, you want to create a new table. The unique name or identifier for the
table follows the CREATE TABLE statement.
Then in brackets comes the list defining each column in the table and what sort of
data type it is. The syntax becomes clearer with the following example.
A copy of an existing table can be created using a combination of the CREATE
TABLE statement and the SELECT statement.
Example
The following code block is an example, which creates a CUSTOMERS table with
an ID as a primary key and NOT NULL are the constraints showing that these fields
cannot be NULL while creating records in this table −

© Copyright FHM.AVI – Data 50


Data Engineer

SQL> CREATE TABLE CUSTOMERS(


ID INT NOT NULL,
NAME VARCHAR (20) NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR (25) ,
SALARY DECIMAL (18, 2),
PRIMARY KEY (ID)
);

You can verify if your table has been created successfully by looking at the message
displayed by the SQL server, otherwise you can use the DESC command as follows

SQL> DESC CUSTOMERS;


+ + + + + + +
| Field | Type | Null | Key | Default | Extra |
+ + + + + + +
| ID | int(11) | NO | PRI |
| NAME | varchar(20) | NO |
| AGE | int(11) | NO | | | |
| ADDRESS | char(25) | YES | | NULL
| SALARY | decimal(18,2) | YES | | NULL
+ + + + + + +
5 rows in set (0.00 sec)

Now, you have CUSTOMERS table available in your database which you can use
to store the required information related to customers.

Extract data from database:


The SELECT statement is used to retrieve data from the data tables. Following are
examples of SQL SELECT statements:

© Copyright FHM.AVI – Data 51


Data Engineer

• To select all columns from a table (Customers) for rows where the Last_Name
column has Smith for its value, you would send this SELECT statement to the
server back end:

SELECT * FROM Customers WHERE Last_Name='Smith';

The server back end would reply with a result set similar to this:

+ + + +
| Cust_No | Last_Name | First_Name |
+ + + +
| 1001 | Smith | John
| 2039 | Smith | David
| 2098 | Smith | Matthew
+ + + +
3 rows in set (0.05 sec)

• To return only the Cust_No and First_Name columns, based on the same
criteria as above, use this statement:

SELECT Cust_No, First_Name FROM Customers WHERE


Last_Name='Smith';

The subsequent result set might look like:

+ + +
| Cust_No | First_Name |
+ + +
| 1001 | John
| 2039 | David
| 2098 | Matthew
+ + +
3 rows in set (0.05 sec)

To make a WHERE clause find inexact matches, add the pattern-matching


operator LIKE. The LIKE operator uses the % (percent symbol) wild card to match

© Copyright FHM.AVI – Data 52


Data Engineer

zero or more characters, and the underscore ( _) wild card to match exactly one
character. For example:
• To select the First_Name and Nickname columns from the Friends table for
rows in which the Nickname column contains the string "brain", use this
statement:

SELECT First_Name, Nickname FROM Friends WHERE Nickname LIKE


'%brain%';

The subsequent result set might look like:

+ + +
| First_Name | Nickname |
+ + +
| Ben | Brainiac
| Glen | Peabrain
| Steven | Nobrainer |
+ + +
3 rows in set (0.03 sec)

• To query the same table, retrieving all columns for rows in which the
First_Name column's value begins with any letter and ends with "en", usethis
statement:

SELECT * FROM Friends WHERE First_Name LIKE '_en';

The result set might look like:

+ + + +
| First_Name | Last_Name | Nickname |
+ + + +
| Ben | Smith | Brainiac |
| Jen | Peters | Sweetpea |
+ + + +
2 rows in set (0.03 sec)

© Copyright FHM.AVI – Data 53


Data Engineer

• If you used the % wild card instead (for example, '%en') in the example
above, the result set might look like:

+ + + +
| First_Name | Last_Name | Nickname |
+ + + +
| Ben | Smith | Brainiac |
| Glen | Jones | Peabrain |
| Jen | Peters | Sweetpea |
| Steven | Griffin | Nobrainer |
+ + + +
4 rows in set (0.05 sec)

Web Services: Data from websites


There’s no denying that organizations are leveraging web data every day. The web
represents the single, largest data source – a data source that is growing exponentially
and changes constantly. It is where equity and financial research, retail and
manufacturing, and travel and hospitality businesses go to find the most up-to- date
information that can be used to inform decision-making, fuel investment models,
provide alternative data sets, and offer insights.
Businesses around the world are losing trillions of dollars due to lack of timely
access to high-quality data. In fact, IBM estimates that poor-quality data costs
businesses in the U.S. more than $3 trillion annually. Today, organizations trying to
leverage web data use a technique called web scraping. But just as the internet has
brought a revolution to information by making it possible to access almost any
information, communicate with anyone else in the world, and so much more,
organizations can do better when it comes to leveraging web data – they can use a
Web Data Integration (WDI) approach.

A More Sophisticated Perspective on Web Scraping


WDI is an emerging category – a revolution – that does away with the need for
traditional web scraping. Web Data Integration is a new approach to acquiring and

© Copyright FHM.AVI – Data 54


Data Engineer

managing web data that focuses on data quality and control. It still achieves the same
objectives as web scraping, but it is much more sophisticated, providing an end-to-
end solution that treats the entire web data lifecycle as a single, integrated process.

Web scraping is in fact a component of Web Data Integration, but Web Data
Integration also allows you to:
• extract data from non-human readable output (hidden data)
• programmatically extract data several screens deep into transaction flows
• perform calculations and combinations to data to make it richer and more
meaningful
• cleanse the data
• normalize the data
• apply additional QA processes
• transform the data
• integrate the data not just via files but APIs and streaming capabilities
• extract data on demand
• analyze data with change, comparison, and custom reports

© Copyright FHM.AVI – Data 55


Data Engineer

Web Data Integration Unlocks the Value of Web Data


According to Opimas Research, total spend on Web Data Integration is
estimated to hit $5 billion in 2019. Given this reporting on estimated spend, it seems
that as companies urgently try to become “data-driven” as a part of digital
transformation, they are also stepping up their game when it comes to web data, the
value of it, and how they work with it.

Ovum reports that when treated as a single, holistic workflow (from web data
extraction to insight) with the same level of data validation discipline that is normally
accorded to conventional BI data or big data, web data can yield valuable insights.
This is the value of a Web Data Integration approach – and why Import.io has
developed an end-to-end Web Data Integration platform to better serve the need to
treat the web data each company (or each team) needs as the valuable data set that it
truly is.
As market research, business intelligence, analyst, and data teams in companies from
a broad range of industries continue to realize the value that can be found in datasets
that reside outside of their organizations’ walls, they will undoubtedly turn to the
web as a key source of intelligence. High-quality Web Data Integration solutions
enable the speedy and repeatable automation of web data capture and aggregation to
fuel a broad array of mission critical strategies like:

• staying a step ahead of the competition by monitoring pricing from rival


retailers or manufacturers

© Copyright FHM.AVI – Data 56


Data Engineer

• rating the financial health of companies through indicators such as sentiment


expressed in industry blogs, social media, or news aggregator sites
• gauging risk by tracing product reviews to gain insights into product quality
or perceptions.
Data from the web complements conventional enterprise analytic data or big data by
adding evidence or providing context. And, for those companies who realize the
need to go beyond traditional web scraping, Web Data Integration will provide a
competitive edge by yielding hidden insights about the market.

Web Services: Data from API


An increasing popular method for collecting data online is via a Representational
State Transfer Application Program Interface. That is the full name of REST
API, or simply API. This refers to a set of protocols that a user can use to query a
web service for data. Many organizations make their data available through an API.
The three most common formats for data from API are JSON, XML, and YAML
(see detail in Flat Files section).

What is an API?
API stands for application programming interface. The most important part of this
name is “interface,” because an API essentially talks to a program for you. You still
need to know the language to communicate with the program, but without an API,
you won’t get far.
When programmers decide to make some of their data available to the public, they
“expose endpoints,” meaning they publish a portion of the language they’ve used to
build their program. Other programmers can then pull data from the application by
building URLs or using HTTP clients (special programs that build the URLs for you)
to request data from those endpoints.
Endpoints return text that’s meant for computers to read, so it won’t make complete
sense if you don’t understand the computer code used to write it.
An API allows one program to request data from another.

© Copyright FHM.AVI – Data 57


Data Engineer

Why You Should Use an API


Computers make a lot of things easier, especially tasks that involve collecting and
sorting through tons of data. Let’s say you wanted to know how many times a
particular business partner submitted invoices to your company. You could feasibly
go into your company’s invoice records, scan the “from” data input, and print each
invoice individually for your audit.
On the other hand, if all invoices were uploaded to a central database, you could
write a simple program that accesses that database and finds all the instances of
the partner’s name. This would take much less time and be much more accurate.

What are the benefits of APIs?


APIs are all over the web, and are therefore common in modern business. Due to
their ease of use, there’s been a huge increase in API usage among platform and
infrastructure businesses. On top of that, APIs enable users to integrate apps like
Salesforce, Eloqua, and Marketo to better integrate their lead routing. It’s important
for revenue teams to flow lead data between their marketing platform and CRM, and
for customer support teams to flow data between their helpdesk and payment
processing system to manage renewals, upsells, and churn.
The best platforms for working with APIs make use of their power and flexibility by
letting users port over data from custom fields, which is a real pain point, even for
software with out-of-the-box integrations. For example, Marketo and Salesforce
offer native integrations, but since every sales organization designates their CRM
fields differently, this integration typically isn’t up to the task of keeping up with
every custom field. Modern API-based tools, such as the Tray Platform, give users
the ability and flexibility to map custom fields between different apps, and even flow
data directly between them via API calls. Providing this power saves a ton of time
and prevents manual errors, while freeing up users to focus on more-important
strategic concerns, such as how to grow customer engagement or win more sales
deals.

© Copyright FHM.AVI – Data 58


Data Engineer

Architecture of an API
APIs consist of three parts:
• User: the person who makes a request
• Client: the computer that sends the request to the server
• Server: the computer that responds to the request
Someone will build the server first, since it acquires and holds data. Once that server
is running, programmers publish documentation, including the endpoints where
specific data can be found. This documentation tells outside programmers the
structure of the data on the server. An outside user can then query (or search) the
data on the server, or build a program that runs searches on the database and
transforms that information into a different, usable format.
That’s super confusing, so let’s use a real example: an address book.
Back in the analog days, you’d receive a copy of the White Pages phone book, which
listed every person in your town by name and address, in alphabetical order. If you
needed a friend’s address, you could look them up by last name, find the address,
then look up their street on the maps included in the back. This was a limited amount
of information, and it took a long time to access. Now, through the magic of
technology, all of that information can be found in a database.
Let’s build a database containing the White Pages for a fictional town called
Customer_info. The folks at Customer_info decided that when they built their
database, they would create a few categories of information with nested data
underneath. These are our endpoints, and they’ll include all the information the API
will release to an outside program.
Here are the endpoints listed in the documentation for Customer_info:
• /names
o /first_name, /last_name
• /addresses
o /street_address, /email_address/
• /phones

© Copyright FHM.AVI – Data 59


Data Engineer

o /home_phone, /mobile_phone
Obviously, this isn’t all the information that could be collected about a person. Even
if Customer_info collected more private information about the Customer_info
residents (like birthdates and social security numbers), this data wouldn’t be
available to outside programmers without knowing the language of those endpoints.
These endpoints tell you the language you must use to request information from the
database. Should you want a list of all of the folks in Customer_info with the last
name Smith, you could do one of two things:
1. Make a URL request in a browser for that information. This uses your
internet browser as the client, and you will get back a text document in coding
language to sort through. That URL might look something like this:
http://api.customer_info.com/names?last_name=smith
2. Use a program that requests the information and translates it into usable
form. You can code your own program or use a ready-made HTTP client.
The first option is great for making simple requests with only a few responses (all of
the folks in a Customer_info with the last name Xarlax, for example — I’m pretty
sure there are only six households with this name in Customer_info). The second
option requires more coding fluency, but is great for programmers who want to use
the database of another program to enhance their own apps.
Many companies use the open APIs from larger companies like Google and
Facebook to access data that might not otherwise be available. APIs, in this case,
significantly lower the barriers to entry for smaller companies that would otherwise
have to compile their own data.

Actions You Can Take Through an API


Wow. Okay, so an API is how two computers talk to each other. The server has the
data and sets the language, while the client uses that language to ask for information
from the server (FYI, servers do not send data without a client requesting data, but
developers have found some ways around this with webhooks). APIs can do
anything!
Well, not so fast. The language and syntax of APIs severely limits their abilities.
There are four types of actions an API can take:

© Copyright FHM.AVI – Data 60


Data Engineer

• GET: requests data from a server — can be status or specifics (like


last_name)
• POST: sends changes from the client to the server; think of this as adding
information to the server, like making a new entry
• PUT: revises or adds to existing information
• DELETE: deletes existing information
When you combine the endpoints with these actions, you can search or update any
available information over an API. You’ll need to check the API documentation to
find out how to code these actions, as they’re all different.
While we’re talking about language and syntax, let’s cover the ways you can make
a request on a server:
HTTP: hypertext transfer protocol. This is how you got to our site in the first place
— by typing a URL in the search bar in your browser. This is a really easy way to
access data, but it won’t come back to you in a pretty format if you request a lot of
information. We’ll go into this further in just a second.
Text formats: XML, JSON. These are the main languages for accessing data over
an API. When you receive your data, you will need to wade through the XML or
JSON code to understand what the server gave you.

How to Use an API?


Many of the API uses you’ll see in your daily business lives move information from
one program to similar form fields in another program. This is especially useful
when you’re trying to share information that you would otherwise have to enter
repeatedly — e.g. sharing leads between your marketing automation platform and
your CRM.
The uses and examples are listing here are much more basic and pull much less data
than the standard API. But they’ll give you a good idea of the steps in the API
process.
1. Most APIs require an API key. Once you find an API you want to play with,
look in the documentation for access requirements. Most APIs will ask you to
complete an identity verification, like signing in with your Google account.

© Copyright FHM.AVI – Data 61


Data Engineer

You’ll get a unique string of letters and numbers to use when accessing the
API, instead of just adding your email and password every time (which isn’t
very secure — for more information on authorizations and verifications,
read this).
2. The easiest way to start using an API is by finding an HTTP client online, like
REST-Client, Postman, or Paw. These ready-made (and often free) tools
help you structure your requests to access existing APIs with the API key you
received. You’ll still need to know some of the syntax from the
documentation, but there is very little coding knowledge required.
3. The next best way to pull data from an API is by building a URL from existing
API documentation. This YouTube video explains how to pull location
data from Google Maps via API, and then use those coordinates to find nearby
photos on Instagram.
Overall, an API request doesn’t look that much different from a normal browser
URL, but the returned data will be in a form that’s easy for computers to read.
Finally, an API is useful for pulling specific information from another program. If
you know how to read the documentation and write the requests, you can get lots of
great data back, but it may be overwhelming to parse all of this. That’s where
developers come in. They can build programs that display data directly in an app or
browser window in an easily consumable format.
APIs have been a game-changer for modern software. The rise of the API economy
not only enables software companies to rapidly build in key functionality that might
previously have taken months or years of coding to implement, but it has also
enabled end users to connect their best-of-breed apps and flow data freely among
them via API calls.
Modern companies make the most of their cloud apps’ APIs to rapidly and
automatically deploy mission-critical data across their tech stack, even from custom
fields that don’t work with out-of-the-box integrations. FICO used API integrations
and automation to grow engagement for its marketing campaigns by double digits.
AdRoll used automated API calls among its revenue stack to increase sales meetings
13%.

© Copyright FHM.AVI – Data 62


Data Engineer

Data Staging Area


The first destination of the data that has been extracted from source is the staging
area. Sometimes staging areas are also called landing zones for flat files, XML files,
Cobol files and the like.
This logical layer:
• Acts as a temporary storage area for data manipulation before it enters the
data Warehouse
• And serves to isolate the rate at which data is received into the data warehouse
from the frequency at which data is refreshed and made available to the end
user.
It is also possible that in some implementations this layer is not necessary, as all data
transformation processing will be done “on the fly” as data is extracted from the
source system before it is inserted directly into the Operational data layer
(Foundation).

Why We Need Staging Area In Data


Warehouse?
A staging area is an intermediate storage area used for data processing during the
extract, transform and load (ETL) process. The data staging area sits between the
data source(s) and the data target(s), which are often data warehouses, data marts, or
other data repositories. “We have a simple data warehouse that takes data from a few
RDBMS source systems and load the data in dimension and fact tables of the
warehouse”. Staging area is a place where you hold temporary tables on data
warehouse server. Staging tables are connected to work area or fact tables. We
basically need staging area to hold the data, and perform data cleansing and merging,
before loading the data into warehouse.
If target and source databases are different and target table volume is high, it contains
some millions of records in this scenario without staging table we need to design a
workflow that using look up to find out whether the record exists or not in the target

© Copyright FHM.AVI – Data 63


Data Engineer

table since target has huge volumes so its costly to create cache it will hit the
performance. If we create staging tables in the target database, we can simply do
outer join in the source qualifier to determine insert/update this approach will give
you good performance. It will avoid full table scan to determine insert/updates on
target. And also we can create index on staging tables since these tables were
designed for specific application it will not impact to any other schemas/users. While
processing flat files to data warehousing we can perform cleansing. Data cleansing,
also known as data scrubbing, is the process of ensuring that a set of data is correct
and accurate. During data cleansing, records are checked for accuracy and
consistency.

• Since it is one-to-one mapping from ODS (Operational Data Sources) to


staging we do truncate and reload.
• We can create indexes in the staging state, to perform our source qualifier
best.
• If we have the staging area no need to relay on the informatics transformation
to known whether the record exists or not.

Data cleansing
Weeding out unnecessary or unwanted things (characters and spaces etc) from
incoming data to make it more meaningful and informative

Data merging
Data can be gathered from heterogeneous systems and put together

Data scrubbing
Data scrubbing is the process of fixing or eliminating individual pieces of data that
are incorrect, incomplete or duplicated before the data is passed to end user.
Data scrubbing is aimed at more than eliminating errors and redundancy. The goal
is also to bring consistency to various data sets that may have been created with
different, incompatible business rules.

© Copyright FHM.AVI – Data 64


Data Engineer

ODS - Operational Data Sources


My understanding of ODS is, it’s a replica of OLTP system and so the need of this,
is to reduce the burden on production system (OLTP) while fetching data for loading
targets. Hence it’s a mandate Requirement for every Warehouse.
So every day do we transfer data to ODS from OLTP to keep it up to date?
OLTP is a sensitive database they should not allow multiple select statements it may
impact the performance as well as if something goes wrong while fetching data from
OLTP to data warehouse it will directly impact the business.
ODS is the replication of OLTP.
Enables management to gain a consistent picture of the business.
ODS is also called as operation data source, ODS is nothing but a staging area,
It’s replica of Staging area. ODS (operation data source) is a main component
of data warehouse which collects the data from different sources. ODS
captures Day to day transactions. This is the data base used to capture daily
business activities and this is normalized database.
ODS can be described as a snap shot of the OLTP system. It acts as a source for
EDW (Enterprise data warehouse).ODS is more normalized than the EDW. Also
ODS doesn’t store any history. Normally the Dimension tables remain at the ODS
(SCD types can be applied in ODS) whereas the Facts Flow till the EDW. More
importantly Client report requirements determine what or what not to have in ODS
or EDW.

In Some Organizations will consider this as Data Back up or Data Recovery.


When we are loading the data we will take ODS tables as Source tables. IT is
Fully Normalized. Since it is normalized retrieving the data will take time. Depend
on the Organization they will refresh the ODS.

Architecture
Image of Staging area:

© Copyright FHM.AVI – Data 65


Data Engineer

See also: Data Warehousing - 34 Kimball Subsytems

Fundamental

© Copyright FHM.AVI – Data 66


Data Engineer

Traditional

Back Room / Front Room


Be careful with this one, you have a fork from the staging area to the datawarehouse
and ODS. This fork is a bad design because you publish twice the same data and
then bring inconsistency

© Copyright FHM.AVI – Data 67


Data Engineer

Extract – Transform – Load (ETL)


What is ETL?
ETL is a process that extracts the data from different source systems, then transforms
the data (like applying calculations, concatenations, etc.) and finally loads the data
into the Data Warehouse system. Full form of ETL is Extract, Transform and Load.
It's tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from the
truth and requires a complex ETL process. The ETL process requires active inputs
from various stakeholders including developers, analysts, testers, top executives and
is technically challenging.

Why do you need ETL?


There are many reasons for adopting ETL in the organization:
• It helps companies to analyze their business data for taking critical business
decisions.
• Transactional databases cannot answer complex business questions that can
be answered by ETL examples.
• A Data Warehouse provides a common data repository.
• ETL provides a method of moving the data from various sources into a data
warehouse.
• As data sources change, the Data Warehouse will automatically update as
long as implement CDC (Data Capture Changes) or SCD (Slowly Changing
Dimension).
• Well-designed and documented ETL system is almost essential to the success
of a Data Warehouse project.
• Allow verification of data transformation, aggregation and calculations rules.
• ETL process allows sample data comparison between the source and the target
system.

© Copyright FHM.AVI – Data 68


Data Engineer

• ETL process can perform complex transformations and requires the extra area
to store the data.
• ETL helps to Migrate data into a Data Warehouse. Convert to the various
formats and types to adhere to one consistent system.
• ETL is a predefined process for accessing and manipulating source data into
the target database.
• ETL in data warehouse offers deep historical context for the business.
• It helps to improve productivity because it codifies and reuses without a need
for high technical skills.

Benefits of ETL
By collecting large quantities of data from multiple sources, ETL can help you turn
data into business intelligence. It can help you drive invaluable insights from it and
uncover new growth opportunities. It does so by creating a single point-of-view so
that you can make sense of the data easily. It also lets you put new data sets next to
the old ones to give you historical context. As it automates the entire process, ETL
saves you a great deal of time and helps you reduce costs. Instead of spending time
manually extracting data or using low-capacity analytics and reporting tools, you can
focus on your core competencies while your ETL solution does all the legwork.

With data governance comes data democracy as well. That means making your
corporate data accessible to all team members who need it to conduct the proper
analysis necessary for driving insights and building business intelligence.
Following are biggest benefits of ETL:

Ease of Use through Automated Processes


As already mentioned in the beginning, the biggest advantage of ETL tools is the
ease of use. After you choose the data sources, the tool automatically identifies the
types and formats of the data, sets the rules how the data has to be extracted and
processed and finally, loads the data into the target storage. This makes coding in a
traditional sense where you have to write every single procedure and code
unnecessary.

© Copyright FHM.AVI – Data 69


Data Engineer

Facilitate performance
One of the most important benefits of ETL is its ability to ensure that business users
have fast access to large amounts of transformed and integrated data to inform their
decision making. Because ETL tools perform most processing during data
transformation and loading, most data are already ready for use by the time it’s
loaded into the data store. When BI applications query the database, they don’t have
to join records, standardize formatting and naming conventions, or even perform
many calculations to generate a report – which means that they can deliver results
significantly faster. An advanced ETL solution will even include performance-
enhancing technologies like cluster awareness, massively parallel processing, and
symmetric multi-processing that further boost data warehouse performance.

Provide a visual flow


Modern ETL applications feature a graphical user interface (GUI) that makes it easy
for users to design ETL processes with minimal programming expertise. Instead of
wrestling with SQL, Python or Bash scripts, stored procedures, and other
technologies, all your users have to do is specify rules and use a drag-and-drop
interface to map the flows of data in a process. Being able to see each step between
source systems and the data warehouse also gives them greater understanding of the
logic behind the data flow. These self-service tools also contain great collaboration
tools, making it possible for more people in the organization to participate in
developing and maintaining the data warehouse.

Leverage an existing development framework


ETL tools are specifically designed for complex data integration tasks like moving
data, populating a data warehouse, and integrating data from multiple source
systems. They also provide metadata about the data they handle and help manage
data governance tasks, which supports data quality processes and helps even novice
teams build and extend data warehouses.

© Copyright FHM.AVI – Data 70


Data Engineer

Provide operational resilience


ETL solutions provide the necessary functionality and standards for catching
operational problems in the data warehouse before they create performance
bottlenecks. They automate and monitor data flows, alerting the IT team to errors
during transformation. By minimizing the human error inherent in hand-coded
solutions, the ETL process makes data processing more efficient and reduces the
likelihood of downstream data integrity issues.

Track data lineage and perform impact analysis


The best modern ETL solutions give users deep insight into the data catalog,
allowing them to drill down into reports to see how each result was generated, what
source systems the data came from, where the data was stored in the data warehouse,
how recently it was refreshed, and how it was extracted and transformed. ETL also
lets users determine how changes in the data schema might affect their reports, and
how to make the necessary adjustments.

Enable advanced data profiling and cleansing


Business intelligence, machine learning, and other data-driven initiatives are only as
good as the data that informs them. ETL tools support solid data management by
letting you apply and maintain complex universal formatting standards and semantic
consistency to all data sets as you move and integrate them. This helps all your teams
understand each other’s needs and find the most relevant data based on their business
context.

Good for complex data management situations


ETL tools are great to move large volumes of data and transfer them in batches. In
case of complicated rules and transformations, ETL tools simplify the task and assist
you with data analysis, string manipulation, data changes and integration of multiple
data sets.

© Copyright FHM.AVI – Data 71


Data Engineer

Enhanced business intelligence


Data access is easier/better with ETL tools as it simplifies the process of extracting,
transforming and loading. Improved access to information directly impacts the
strategic and operational decisions that are based on data driven facts. ETL tools also
enable business leaders to retrieve information based on their specific needs and
make decisions accordingly.

Handle Big Data


Modern ETL tools can combine very large data sets of both structured and
unstructured data from disparate sources in a single mapping using Hadoop or similar
connectors. They can also prepare very large data volumes that don’t need tobe stored
in data warehouses for use by data integration solutions.

High return on investment (ROI)


ETL tools helps business to save costs and thereby, generate higher revenues. In fact,
a study that was conducted by the International Data Corporation has revealed that
the implementation of ETL resulted in a median 5-year ROI of 112% with a mean
payback of 1.6 years.

ETL Process in Data Warehouses


ETL is a 3-step process:

© Copyright FHM.AVI – Data 72


Data Engineer

Step 1 - Extraction
In this step of ETL architecture, data is extracted from the source system into the
staging area. Transformations if any are done in staging area so that performance of
source system in not degraded. Also, if corrupted data is copied directly from the
source into Data warehouse database, rollback will be a challenge. Staging area gives
an opportunity to validate extracted data before it moves into the Data warehouse.
Data warehouse needs to integrate systems that have different.
DBMS, Hardware, Operating Systems and Communication Protocols. Sources could
include legacy applications like Mainframes, customized applications, Point of
contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from
vendors, partners amongst others.
Hence one needs a logical data map before data is extracted and loaded physically.
This data map describes the relationship between sources and target data.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification

© Copyright FHM.AVI – Data 73


Data Engineer

Irrespective of the method used, extraction should not affect performance and
response time of the source systems. These source systems are live production
databases. Any slow down or locking could effect company's bottom line.
Some validations are done during Extraction:
• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
• Making sure no data loss during extracting
• Handling if any error occurring
Building the logical Data mapping from sources to target, in here we can analyze
data sources, analyze data content, collecting a business rule for ETL process
Identify Mainframe sources which mean what data source type we will work on that
help to determine the correct solution to handle in effective way and optimal way
Then we go forward with extracting and detecting changes, and those things need to
be capture

Step 2 - Transformation
Data extracted from source server is raw and not usable in its original form.
Therefore, it needs to be cleansed, mapped and transformed. In fact, this is the key
step where ETL process adds value and changes data such that insightful BI reports
can be generated.
It is one of the important ETL concepts where you apply a set of functions on
extracted data. Data that does not require any transformation is called as direct move
or pass through data.
In transformation step, you can perform customized operations on data. For instance,
if the user wants sum-of-sales revenue which is not in the database. Or if the first
name and the last name in a table is in different columns. It is possible to concatenate
them before loading.

© Copyright FHM.AVI – Data 74


Data Engineer

Following are Data Integrity Problems:


1. Different spelling of the same person like Jon, John, etc.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various
applications for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes.
Validations are done during this stage
• Filtering – Select only certain columns to load
• Using rules and lookup tables for Data standardization
• Character Set Conversion and encoding handling
• Conversion of Units of Measurements like Date Time Conversion, currency
conversions, numerical conversions, etc.
• Data threshold validation check. For example, age cannot be more than two
digits.
• Data flow validation from the staging area to the intermediate tables.
• Required fields should not be left blank.

© Copyright FHM.AVI – Data 75


Data Engineer

• Cleaning (for example, mapping NULL to 0 or Gender Male to "M" and


Female to "F" etc.)
• Split a column into multiples and merging multiple columns into a single
column.
• Transposing rows and columns,
• Use lookups to merge data
• Using any complex data validation (e.g., if the first two columns in a row are
empty then it automatically rejects the row from processing)
• Design an object that after your done transforming such as any factor, where
the business rule be applied
• There are some table need to be created for monitoring such as error event,
audit dimension, logging event, …
• Some rule need to be defined and metric need to be have to measure the data
• Data validation sample: row counts, column count, nullity, Numeric and Data
ranges, Length restriction, Valid and Invalid values, Data and value rule after
apply business rule.

Step 3 - Loading
Loading data into the target data warehouse database is the last step of the ETL
process. In a typical Data warehouse, huge volume of data needs to be loaded in a
relatively short period (nights). Hence, load process should be optimized for
performance.
In case of load failure, recover mechanisms should be configured to restart from the
point of failure without data integrity loss. Data Warehouse admins need to monitor,
resume, cancel loads as per prevailing server performance.
Types of Loading:
• Initial Load — populating all the Data Warehouse tables
• Incremental Load — applying ongoing changes as when needed
periodically.

© Copyright FHM.AVI – Data 76


Data Engineer

• Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
Load verification

• Ensure that the key field data is neither missing nor null.
• Test modeling views based on the target tables.
• Check that combined values and calculated measures.
• Data checks in dimension table as well as history table.
• Check the BI reports on the loaded fact and dimension table.
• Data object time
• Fact table or Dimension table delivery (add, update)
• Loading type
• Replication for high availability
• Backup for disaster recovery
• Any Late Arriving data? The data latency acceptances?
ETL Issues
• The benefits that we have elaborated above are all related to traditional ETL.
However, traditional ETL tools cannot keep up with the high speed of changes
that is dominating the big data industry. Let’s take a look at the shortcomings
of these traditional ETL tools.
• Traditional ETL tools are highly time-consuming. Processing data with ETL
means to develop a process in multiple steps every time data needs to get

© Copyright FHM.AVI – Data 77


Data Engineer

moved and transformed. Furthermore, traditional ETL tools are inflexible for
changes and cannot load readable live-data into the BI front end. We also have
to mention the fact that it is not only a costly process but also time consuming.
And we all know that time is money.
• There are some factors that influence the function of ETL tools and processes.
These factors would be divided in the following categories:

Data Architecture Issues


Similarity of Source and Target Data Structures:
The more the source data structure differ from the one of the target data, the more
complex the traditional ETL processing and maintenance effort become. Due to the
different structures, the load process will typically have to parse the records,
transform values, validate values, substitute code values etc.

Quality of Data:
Common data quality issues include missing values, code values not correct list of
values, dates and referential integrity issues. It makes no sense to load the data
warehouse with poor quality data. As an example, if the data warehouse will be used
for database marketing, the addresses should be validated to avoid returned email.

Complexity of the Source Data:


Depending on the sourcing teams background, some data sources are more complex
than others. Examples of complex sources may include multiple record types, bit
fields and packed decimal fields. This kind of data will translate into requirements
of the ETL tool or custom written solution since they are unlikely to exist in the
target data structures. Individuals on the sourcing team that are unfamiliar with these
types may need to do some research in these areas.

Dependencies in the Data:


Dependencies in the data will determine the order in which you load tables.
Dependencies also tend to reduce parallel loading operations, especially if data is
merged from different systems, which are on a different business cycle. Complex

© Copyright FHM.AVI – Data 78


Data Engineer

dependencies will also tend to make to load processes more complex, encourage
bottlenecks and make support more difficult.

Meta Data:
Technical meta data describes not only the structure and format of the source and
target data sources, but also the mapping and transformation rules between them.
Meta data should be visible and usable to both programs and people.

Application Architecture Issues


Logging:
ETL processes should log information about the data sources they read, transform
and write. Key information includes date processed, number of rows read and
written, error that encountered, and rules applied. This information is critical for
quality assurance and serves as an audit trail. The logging process should be rigorous
enough so that you can trace data in the data warehouse back to the source. In
addition, this information should be available as the processes are running to assist
in the completion times.

Notification:
The ETL requirements should specify what makes an acceptable load. The ETL
process should notify the appropriate support people when a load fails or has errors.
Ideally, the notification process should plug into your existing error tracking system.

Cold start, warm start:


Unfortunately, systems do crash. You need to be able to take the appropriate action
if the system crashes with your ETL process running. Partial loads can be literally a
pain. Depending on the size of your data warehouse and volume data, you want to
start over, known as cold start, or start from the last known successfully loaded
records, known as warm-start. The logging process should provide you information
about the state of the ETL process.

© Copyright FHM.AVI – Data 79


Data Engineer

People Issues
Management’s comfort level with technology:
How conversant is your management with data warehousing architecture? Will you
have a data warehouse manager? Does management have development in the
background? They may suggest doing all the ETL processes with Visual Basic.
Comfort level is a valid concern, and these concerns will constrain your option.

In-House expertise:
What is your businesses tradition? SQL server? ETL solutions will be drawn from
current conceptions, skills and toolsets. Acquiring, transforming and loading the data
warehouse is an ongoing process and will need to be maintained and extended as
more subject areas are added to the data warehouse.

Support:
Once the ETL processes have been created, support for them, ideally, you should
plug into an existing support structures, including people with appropriate skills,
notification mechanisms and error tracking systems. If you use a tool for ETL, the
support staff may need to be trained. The ETL process should be documented,
especially in the area of auditing information.

Technology Architecture Issues


Interoperability between platforms:
There must be a way for systems on one platform to talk to systems on another. FTP
is a common way to transfer data from one system to another. FTP requires a physical
network path from one system to another as well as the internet protocol on both
systems. External data sources usually come on a floppy tape or an internet server.

Volume and frequency of loads:


Since the data warehouse is loaded with batch programs, a high volume of data will
tend to reduce the batch window. The volume of data also affects the back out and

© Copyright FHM.AVI – Data 80


Data Engineer

recovery work. Fast load programs reduce the time it takes to load data into the data
warehouse.

Disk space:
Not only does the data warehouse have requirements for a lot of disk space, but there
is also a lot of hidden disk space needed for staging areas and intermediate files. For
example, you may want to extract data from source systems into flat files and then
transform the data to other flat files for load.

Scheduling:
Loading the data warehouse could involve hundreds of sources files, which originate
on different system use different technology and produced at different times. A
monthly load may be common for some portions of the warehouse and a quarterly
load for others. Some loads may be on demand such as lists of products or external
data. Some extract programs may be run on a different type of system than your
scheduler.

Best practices ETL process


Following are the best practices for ETL Process steps:
Never try to clean all the data
Every organization would like to have all the data clean, but most of them are not
ready to pay to wait or not ready to wait. To clean it all would simply take too long,
so it is better not to try to cleanse all the data.
Never clean Anything
Always plan to clean something because the biggest reason for building the Data
Warehouse is to offer cleaner and more reliable data.
Determine the cost of cleansing the data
Before cleansing all the dirty data, it is important for you to determine the cleansing
cost for every dirty data element.

To speed up query processing, have auxiliary views and indexes

© Copyright FHM.AVI – Data 81


Data Engineer

To reduce storage costs, store summarized data into disk tapes. Also, the trade-off
between the volume of data to be stored and its detailed usage is required. Trade-off
at the level of granularity of data to decrease the storage costs.

ETL Tools
There are many Data Warehousing tools are available in the market. Here, are some
most prominent one:
MarkLogic
MarkLogic is a data warehousing solution which makes data integration easier and
faster using an array of enterprise features. It can query different types of data like
documents, relationships, and metadata.

https://www.marklogic.com/product/getting-started/
Oracle
Oracle is the industry-leading database. It offers a wide range of choice of Data
Warehouse solutions for both on-premises and in the cloud. It helps to optimize
customer experiences by increasing operational efficiency.

https://www.oracle.com/index.html
Amazon RedShift:
Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to
analyze all types of data using standard SQL and existing BI tools. It also allows
running complex queries against petabytes of structured data.

https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Data warehouse Tools.

ETL vs. ELT


Extract, load, transform (ELT) is a variant of ETL where the extracted data is loaded
into the target system first. The architecture for the analytics pipeline shall also
consider where to cleanse and enrich data as well as how to conform dimensions.

© Copyright FHM.AVI – Data 82


Data Engineer

Cloud-based data warehouses like Amazon Redshift, Google BigQuery, and


Snowflake Computing have been able to provide highly scalable computing power.
This lets businesses forgo preload transformations and replicate raw data into their
data warehouses, where it can transform them as needed using SQL.
After having used ELT, data may be processed further and stored in a data mart.
There are pros and cons to each approach. Most data integration tools skew towards
ETL, while ELT is popular in database and data warehouse appliances. Similarly, it
is possible to perform TEL (Transform, Extract, Load) where data is first
transformed on a blockchain (as a way of recording changes to data, e.g., token
burning) before extracting and loading into another data store.

Ref:
https://en.wikipedia.org/wiki/Extract,_transform,_load
https://www.guru99.com/etl-extract-load-process.html

© Copyright FHM.AVI – Data 83


Data Engineer

Data Warehouse
What Is a Data Warehouse?
A data warehouse (often abbreviated as DW or DWH)(1) is a database designed
to enable business intelligence activities, that is, designed for analysis (read
operations) rather than for transaction processing (write, update and delete
operations), and (2) typically has data from different sources. They usually include
historical data derived from transaction data and in this case data warehouses (3)
separate analysis workload from transaction workload. This separation helps
improve the performance of business intelligence and transaction databases.
A data warehouse as a master database may contain multiple relational databases.
Within each database, schemas are defined for SQL query performance, tables can
be organized inside of schemas, and data is organized into tables. When data is
ingested, it is stored in various tables described by the schema. Query tools use the
schema to determine which data tables to access and analyze.
In addition to relational databases as mentioned above, a data warehouse
environment can include an extraction, transportation, transformation, and loading
(ETL) solution, and data services such as statistical analysis, reporting and data
mining. Thanks to ETL, data in a warehouse is clean, enriched and transformed.

Architecture with a data warehouse

© Copyright FHM.AVI – Data 84


Data Engineer

In pipeline graphs, ETL may be represented as a seperate part outside a data


warehouse for emphasis of its importance. It is important to note that defining the
ETL process is a very large part of the design effort of a data warehouse. The speed
and reliability of ETL operations are the foundation of the data warehouse.

Data warehouse vs database


To newbies, there are often struggles to differentiate between the 2 critical concepts,
so below is

Parameter Database Data Warehouse

Purpose Is designed to record Is designed to analyze and report.


transactional processing.

Processing The database uses the Data warehouse uses Online


Method Online Transactional Analytical Processing (OLAP).
Processing (OLTP)

Data source Data captured as-is from Data collected and transformed
a single source, such as a from many sources
transactional system

Usage The database helps to Data warehouse allows you to


perform fundamental analyze your business.
operations for your
business

© Copyright FHM.AVI – Data 85


Data Engineer

Tables and Tables and joins of a Table and joins are simple in a data
Joins database are complex as warehouse because they are
they are normalized. denormalized.

Orientation Is an application-oriented It is a subject-oriented collection of


collection of data data

Storage limit Generally limited to a Stores data from any number of


single application applications

Availability Data is available real-time Data is refreshed from source


systems as and when needed

Usage ER modeling techniques Data modeling techniques are used


are used for designing. for designing.

Technique Capture data Analyze data

Data Type Data stored in the Current and Historical Data is stored
Database is up to date. in Data Warehouse. May not be up to
date.

Storage of Flat Relational Approach Data Ware House uses dimensional


data method is used for data and normalized approach for the data
storage. structure. Example: Star and
snowflake schema.
Optimized for high
throughout write Optimized for high-speed query
operations performance

© Copyright FHM.AVI – Data 86


Data Engineer

Query Type Simple transaction Complex queries are used for


queries are used. analysis purpose.

Data Detailed Data is stored in It stores highly summarized data.


Summary a database.

Characteristics of Data Warehouse

- Subject – Oriented: the data warehousing process is proposed to handle with


a specific theme which is more defined. These themes can be sales,
distributions, marketing etc.
- Integrated: A data warehouse is built by integrating data from various
sources of data such that a mainframe and a relational database.
- Time-Variant: The data resided in data warehouse is predictable with a
specific interval of time and delivers information from the historical
perspective.
- Non-Volatile: the data resided in data warehouse is permanent. It also means
that data is not erased or deleted when new data is inserted. It includes the
mammoth quantity of data that is inserted into modification between the
selected quantity on logical business. Two types of data operations done in
the data warehouse are:

© Copyright FHM.AVI – Data 87


Data Engineer

• Data Loading
• Data Access

Data Warehousing Types


There are three main types of Data Warehouses, namely Enterprise Data Warehouse
(EDW), Operational Data Store (ODS) and Data Mart.
Enterprise Data Warehouse (EDW): a centralized warehouse. It provides decision
support service across the enterprise. It offers a unified approach for organizing and
representing data. It also provide the ability to classify data according to the subject
and give access according to those divisions.
Operational Data Store (ODS): is used when an organization’s reporting needs are
not satisfied by a data warehouse or an OLTP system. In ODS, a data warehouse can
be refreshed in real-time, making it best for routine activities like storing employees’
records.

Data Mart: As part of a data warehouse, Data Mart is particularly designed for
a specific business line like finance, accounts, sales, purchases, or inventory. The
warehouse allows you to collect data directly from the sources.
Above is the brief definition to give you an overview of DWH types. If you’re
interested in this, here is the link for Further read

Data Warehouse Architecture

The Data Warehouse Architecture can be defined as a structural representation of


the concrete functional arrangement based on which a Data Warehouse is
constructed that should include all its major pragmatic components.

There are typically four layers in every DWH Architecture:

- The source Layer (Bottom)


- The Staging Layer (Bottom)
- The Storage Layer (Bottom Tier)
- The presentation Layer (Top Tier)

© Copyright FHM.AVI – Data 88


Data Engineer

Three Tiers in DWH Architecture


- Bottom Tier:

This tier consists of storage which usually is a relational database system. The
cleansed and transformed data is loaded into this layer of the architecture. This tier
acts as the staging area where data in a very raw form from different sources are
pulled together for further processing.
- Middle Tier:

This tier is generally an Online Analytical Processing (OLAP) server. OLAP enables
faster query and better performance for Data Warehouse operation. The Middle Tier
can either be a Relation OLAP or a Multi-dimensional OLAP implementation. This
provides abstract view of the stored, transformed data.
- Top Tier:

This tier is the front-end the end user interacts with. It is generally tools and API that
connect and get data out from the Data Warehouse. It can run interactive user
queries, report on transformed data, analyze and mine data.

© Copyright FHM.AVI – Data 89


Data Engineer

Approached to DWH Architecture Design


Top-down approach (defined by Inmon):

© Copyright FHM.AVI – Data 90


Data Engineer

- External Sources can be structure, semi-structure and unstructured as well.


- Staging Area is where the data is ingested, validated, transformed and loaded
- Data Warehouse is where the storage happens for data and meta-data.
- Data Marts segregates organization data by function (for e.g. sales, finance)
- Data Mining is where the data is analyzed, patterns identified in the data.
Merits & de-merits of this approach
- Consistent dimensional view of Data Marts
- Resilient towards changes in business
- Cost and time are impacted designing this flexible approach

Bottom-up approach (defined by Kinball):

- External Sources can be structure, semi-structure and unstructured as well.


- Staging Area is where the data is ingested, validated, transformed and loaded
- In this approach Data Marts take precedence over the warehouse, are created
first
- Data Marts are then integrated to form the Data Warehouse and then data is
mined

Merits & de-merits of this approac


- Provides smaller, quicker user deliverables
- More open to Data Warehouse extensions compared to the first approach

© Copyright FHM.AVI – Data 91


Data Engineer

- Cost and time consumed is considered better as users get to see results much
faster
- Dimensional view of Data Marts lack consistency resulting in weaker Data
Warehouse

The differences between 2 approaches


To Sum up the difference between 2 approaches

Top-Down Approach Bottom-Up Approach

Provides a definite and consistent view of Reports can be generated easily as Data
information as information from the data marts are created first, and it is relatively
warehouse is used to create Data Marts easy to interact with data marts.

Strong model and hence preferred by big Not as strong, but the data warehouse can
companies be extended, and the number of data
marts can be created

Time, Cost and Maintenance is high Time, Cost and Maintenance are low.

© Copyright FHM.AVI – Data 92


Data Engineer

DWH Architecture Properities

Architecting a Data Warehouse should involve the following considerations


- Separation : Analytical and transactionional processing should be keep apart
as much as possible.
- Scalability : Hardware and software architectures should be simple to
upgrade the data volume, which has to be managed and processed, and the
number of users’ requirements that would progressively increase.
- Extensibility : Technologies adopted by an enterprise evolve with time and
as the business grows. An Extensible architecture should be able to
incorporate these over the life time of the Data Warehouse without a complete
redesign.
- Security : Data Warehouse provides the single version of the truth for the
Enterprise. All the Enterprise data needs to be carefully secured to ensure the
right users have access to the right information. This would be a critical
consideration for the Data Warehouse architecture.
- Administerability : In general a complicated system has a very short life
span. This holds true to Administration of any system, simpler administration
ensure longevity of the Data Warehouse in an Enterprise. In short DWH
Management should not be complicated.

© Copyright FHM.AVI – Data 93


Data Engineer

DWH Components

A typical data warehouse has four main components: a central database, ETL
(extract, transform, load) tools, metadata, and access tools. All of these
components are engineered for speed so that you can get results quickly and analyze
data on the fly.

Central Database
The Data store is one of the critical components of the Data Warehouse environment.
This can be implemented utilizing RDBMS. The implementation would be
constrained by the fact that traditional RDBMS is optimized for transactional
processing. For instance ad-hoc query, multi-table joins, aggregation are resource
intensive and slow down performance. Alternate approaches could be considered as
follows
- Deploy relational databases in parallel to allow for scalability. Parallel
relational databased allow shared memory or shared nothing model on various
multiprocessor configurations or massively parallel processors.
- Deploy index structures that can be used to bypass relational table scans,
improve speed and overall performance.

© Copyright FHM.AVI – Data 94


Data Engineer

- Deploy multidimensional databases (MDDBs) to overcome limitations which


are placed because of the relational database. For instance consider
deployment of OLAP as either ROLAP or MOLAP

ETL Tools (Acquisition, Clean-up and Transformation of Data)


ETL Tools are deployed for acquisition, clean-up, conversions, summarizations and
any other changes needed to transform source data into meaningful information in
the Data Warehouse. ETL is an acronym for Extract, Transform and Load.
The process generally includes:
- Anonymization of data as per regulatory stipulations
- Eliminating unwanted data from the operational data that do not add value
- Search and replace common data and definitions for data arriving from
different source
- Calculating summaries and derived values
- Populate defaults in case of missing data
- Remove duplicates in data arriving from multiple datasource
ETL Tools utilize scheduled background jobs, scripts, programs, etc… to regularly
update data in the Data Warehouse. The tools also value add in maintenance of
Metadata while dealing with heterogeneity of external databases and data.
Metadata
Metadata is definition about the data being stored in the Data Warehouse. It assists
in building, maintaining and managing the Data Warehouse. Metadata is critical as
it specifies the source, usage, values and features of Data Warehouse data. It even
defines how the data can be changed and processed. It forms and integral part of the
Data Warehouse datastore.
There is business metadata, which adds context to your data, and technical metadata,
which describes how to access data – including where it resides and how it is
structured.

Query Tools

© Copyright FHM.AVI – Data 95


Data Engineer

Query tools allow business users to interact with the data in the Data Warehouse
interactively. Ad-hoc queries enhances inputs for strategic decisions dynamically.
Categories of Query Tools are as follows
- Managed Queries help business users run ad-hoc queries on the data store
interactively in a very user friendly manner.
- Reporting Tools allow organization to generate regular operational reports. It
also supports high volume batch jobs with printing and summarization.
- Application Development Tools are used by power users to run customized
reports / analysis to satisfy organizations ad-hoc requirements dynamically.
- Data Mining Tools enable the process of discovering meaningful new
correlation, patterns and trends among all the organizations data stored in the
Data Warehouse. These tools can enable automation and can feed into AI /
ML
- OLAP Tools enable multi-dimensional views of data allowing business users
answer complex business scenario questions and also elaborate parameters
involved

Data Marts
Data Marts are access layers specific to a business function. It is tweaked for
performance and addresses specific needs of business functions like Sales,
Marketing, Finance, etc… Modular Data Marts feed into the overall Data Warehouse
created for the organization. Considerations for Data Mart design varies from

© Copyright FHM.AVI – Data 96


Data Engineer

organization to organization. Data Marts could exist in the same data store as the
Data Warehouse or could exist in their own independent physical data stores.

Data Warehouse Bus


Data Warehouse Bus determines the flow of data in the Data Warehouse. This flow
can be categories as Inflow, Upflow, Downflow, Outflow and Metaflow. While
designing the Data Warehouse Bus, shared dimensions and facts across Data Marts
need to be considered.

Data Warehouse Architecture Best Practices


The following best practices can be considered while designing a Data Warehouse
Architecture
1. Optimize Data Warehouse models for information retrieval utilizing
dimensional, de-normalized or hybrid approach that suit the needs at hand
2. Choose the appropriate design approach as top down or bottom up suitable to
the organizations business needs
3. Data need to bee processed quickly and accurately. Data needs to be
consolidated into a single version of the truth
4. Design data acquisition and cleansing process such that the Metadata is shared
between components of the Data Warehouse
5. Consider implementing ODS model when information retrieval need is near
bottom of data abstraction pyramid or when there are multiple operational
sources
6. Data model should integrated further to consolidation with consideration for
3NF data model. This would be ideal for acquiring ETL and Data cleansing
tools
What is a data lake?

A data lake is a centralized repository that stores all structured and unstructured
data at any scale. You can store your data without having to first structure the data,
and can run different types of analytics—from dashboards and visualizations to big
data processing, real-time analytics, and machine learning and full-text search.

© Copyright FHM.AVI – Data 97


Data Engineer

The structure of the data or schema is not defined when data is captured. This
means you can store all of your data without careful design or the need to know what
questions you might answer in the future. Data Lakes allow you to import any
amount of data that can come in real-time in its original format, saving time of
defining schema.
A data lake can be an upstream database for a data warehouse, as is seen below.

Data warehouse vs Data lake

If you’d like to read further about data lake, you can refer to here and/or this one.
It’s not completed but it will give you a big picture and hopefully some ideas to dig
deeper.

Characteristi Data Warehouse Data Lake

© Copyright FHM.AVI – Data 98


Data Engineer

cs

Data Relational from Non-relational and relational


transactional systems, from IoT devices, websites,
operational databases, and mobile apps, social media, and
line of business applications corporate applications

Schema Designed prior to the data Written at the time of analysis


warehousing implementation (schema-on-read)
(schema-on-write)

Data Quality Transformed data Transformed or raw data

Users Business analysts (with SQL Data scientists, Data developers,


queries) and Business analysts (using
transformed data) with various
tools such as Apache Hadoop, Presto, and

Apache Spark

Analytics Batch reporting, BI and Machine Learning, Predictive


visualizations analytics, data discovery and
profiling

What is a data mart?

A data mart serves the same role as a data warehouse, but it is intentionally limited
in scope. It may serve one particular department or business unit like marketing
and sales. Data marts may be a subset of a data warehouse that is highly curated for
a specific end user. The following graph illustrates this possible case.

© Copyright FHM.AVI – Data 99


Data Engineer

A possible architecture with a data warehouse and data marts

© Copyright FHM.AVI – Data 100


Data Engineer

Data warehouse vs data mart

DWH – Important Concepts


SCHEMA
Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema. In this chapter, we will discuss the
schemas used in a data warehouse.

Star schema
A star schema is a database organizational structure optimized for use in a data
warehouse or business intelligence that uses a single large fact table and one or more

© Copyright FHM.AVI – Data 101


Data Engineer

smaller dimensional tables. It is called a star schema because the fact table sits at the
center of the logical diagram, and the small dimensional tables branch off to form
the points of the star.

A fact table sits at the center of a star schema database, and each star schema
database only has a single fact table. The fact table contains the specific
quantifiable data to be analyzed, such as sales figures.
The fact table stores two types of information: numeric values and dimension
attribute values. Using a sales database as an example:
• Numeric value cells are unique to each row or data point and do not correlate
or relate to data stored in other rows. These might be facts about a transaction,
such as an order ID, total amount, net profit, order quantity or exact time.
• The dimension attribute values do not directly store data, but they store the
foreign key value for a row in a related dimensional table. Many rows in the
fact table will reference this type of information. So, for example, it mightstore
the sales employee ID, a date value, a product ID or a branch office ID.
Dimension tables store supporting information to the fact table. Each star schema
database has at least one dimension table. Each dimension table will relate to a
column in the fact table with a dimension value, and will store additional
information about that value.

© Copyright FHM.AVI – Data 102


Data Engineer

• The employee dimension table may use the employee ID as a key value and
can contain information such as the employee's name, gender, address or
phone number.
• A product dimension table may store information such as the product name,
manufacture cost, color or first date on market.
Benefits of the Star Schema

• It is extremely simple to understand and build.


• No need for complex joins when querying data.
• Accessing data is faster (because the engine doesn’t have to join various tables
to generate results).
• Simpler to derive business insights.
• Works well with certain tools for analytics, in particular, with OLAP systems
that can create OLAP cubes from data stored using star schema.
Disadvantages of the Star Schema

• Denormalized data can cause integrity issues. This means some data can turn
out to be inconsistent at times.
• Maintenance may appear simple at the beginning, but the larger data
warehouse you need to maintain, the harder it becomes (due to data
redundancy).
• It requires a lot more disk space than snowflake schema to store the same
amount of data.
• Many-to-many relationships are not supported.
• Limited possibilities for complex queries development.

Snowflake schema
The snowflake schema is an extension of a star schema. The main difference is that
in this architecture, each reference table can be linked to one or more reference tables
as well. The aim is to normalize the data.

Star schema Snowflake schema

© Copyright FHM.AVI – Data 103


Data Engineer

Benefits of the Snowflake Schema

• Uses less disk space because data is normalized and there is minimal data
redundancy.
• Offers protection from data integrity issues.
• Maintenance is simple due to a smaller risk of data integrity violations and
low level of data redundancy.

© Copyright FHM.AVI – Data 104


Data Engineer

• It is possible to use complex queries that don’t work with a star schema. This
means more space for powerful analytics.
• Supports many-to-many relationships.
Disadvantages of the Snowflake Schema

• Harder to design compared to a star schema.


• Maintenance can be more complex due to a large number of different tables
in the data warehouse.
• Queries can be very complex, including many levels of joins between many
tables.
• Queries can be slower in some cases because many joins should be done to
produce final output.
• More specific skills are needed for working with data stored using snowflake
schema.

Fact Constellation Schema


A fact constellation (or galaxy schema) has multiple fact tables.
• The following diagram shows one more fact table, namely shipping, in
addition to the sales table.

Star schema Fact constellation schema

© Copyright FHM.AVI – Data 105


Data Engineer

PARTITIONING
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It
optimizes the hardware performance and simplifies the management of data
warehouse by partitioning each fact table into multiple separate partitions. In this
chapter, we will discuss different partitioning strategies.

Why is it Necessary to Partition?


For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This
huge size of fact table is very hard to manage as a single entity. Therefore it needs
partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with
all the data. Partitioning allows us to load only as much data as is required on a
regular basis. It reduces the time to load and also enhances the performance of the
system.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced.
Query performance is enhanced because now the query scans only those partitions
that are relevant. It does not have to scan the whole data.

Partitioning Strategies
1. Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time period.
Here each time period represents a significant retention period within the business.
For example, if the user queries for month to date data then it is appropriate to
partition the data into monthly segments. We can reuse the partitioned tables by
removing the data in them.
2. Partition by Time into Different-sized Segments

© Copyright FHM.AVI – Data 106


Data Engineer

This kind of partition is done where the aged data is accessed infrequently. It is
implemented as a set of small partitions for relatively current data, larger partition
for inactive data.
3. Partition on a Different Dimension

The fact table can also be partitioned on the basis of dimensions other than time such
as product group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like
on a state by state basis. If each region wants to query on information capturedwithin
its region, it would prove to be more effective to partition the fact table into regional
partitions. This will cause the queries to speed up because it does not requireto scan
information that is not relevant.
4. Partition by Size of Table

When there are no clear basis for partitioning the fact table on any dimension, then
we should partition the fact table on the basis of their size. We can set the
predetermined size as a critical point. When the table exceeds the predetermined size,
a new table partition is created.
5. Partition by normalization

Normalization is the standard relational method of database organization. In this


method, the rows are collapsed into a single row, hence it reduces space. Take a look
at the following tables that show how normalization is performed.

• Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W

45 7 5.66 3-Sep-13 16 sunny Bangalore S

© Copyright FHM.AVI – Data 107


Data Engineer

• Table after Normalization

Store_id Store_name Location Region

16 sunny Bangalore W

64 san Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

METADATA

© Copyright FHM.AVI – Data 108


Data Engineer

Metadata is simply defined as data about data; or the description of the structure,
content, keys, indexes, etc., of data.
Metadata play a very important role than other data warehouse data and are
important for many reasons. For example, metadata are used as a directory to help
the decision support system analyst to locate the contents of the data warehouse and
as a guide to the data mapping when data are transformed from the operational
environment to the data warehouse environment.
Metadata also serve as a guide to the algorithms used for summarization between
the current detailed data and the lightly summarized data, and between the lighly
summarized data and higly summarized data.
Metadata should be stored and managed persistently (i.e., on disk).
Here are several examples.

• Metadata for a document may contain the document created date, last
modified date, it’s size, author, description, etc.
• Metadata for ETL includes the job name, source tables/files, target
tables/files, and frequency.
• Metadata associated with data management defines the data store in the Data
Warehouse. Every object in the database needs to be described including the
data in each table, index, and view, and any associated constraints.

Categories of Metadata

© Copyright FHM.AVI – Data 109


Data Engineer

- Business Metadata: It has the data ownership information, business


definition and changing policies.
- Technical Metadata: It includes database system names, table and column
names and sizes, data types and allowed values. It also includes structural
information such as primary and foreign key attributes and indices.
- Operational Metadata: It includes currency of data and data lineage.
Currency of data means whether the data is active, archived or purged.
Lineage of data means the history of data migrated and transformation applied
on it.

The Role of Metadata in the Data Warehouse

● Consistency of definitions: One department refers to “revenues,” another to


“sales.” Are they talking about the same activity? One unit talks about
“customers,” another about “users” or “clients.” Are these different
classifications or different terms for the same classification? Effective
metadata management can ensure that the same data “language” applies
throughout the organization. Business users can easily understand the full
meaning of data. With understanding comes eased communication, and an
overall improved process. The business metadata primarily supports business
end users who do not have a technical background, and cannot use the
technical metadata to determine what information is stored inside the Data

© Copyright FHM.AVI – Data 110


Data Engineer

Warehouse. Technical metadata primarily supports technical staff that must


implement and deploy the Data Warehouse.
● Clarity of relationships: Meta data management illuminates the associations
among all components of the warehouse: business rules, tables, columns,
transformations, and user views of the data. By clarifying relationships
throughout the Data Warehouse environment, managed Meta data enables
warehouse managers to see the bigger picture—to fully understand the
meanings of the data assets, and to accurately predict and manage the
impact of changes to the data warehouse.
● Availability of information: Meta data exists “behind the scenes,” revealing
the origin of data, who defined it, when it was modified, and much more.
● Resource discovery: Metadata serves the same functions in resource
discovery as good cataloging does by:
o Allowing resources to be found by relevant criteria;
o Identifying resources;
o Bringing similar resources together;
o Giving location information.

© Copyright FHM.AVI – Data 111


Data Engineer

A data warehouse system at buildtime and usetime

Metadata has been identied as a key success factor in data warehouse projects. It
captures all kinds of information necessary to extract, transform and load data from
source systems into the data warehouse, and afterwards to use and interpret the data
warehouse contents.
The generation and management of metadata serves two purposes: (1) to minimize
the efforts for development and administration of a data warehouse and (2) to
improve the extraction of information from it.

OLAP (Online Analytical Processing)


OLAP is a category of software that allows users to analyze information from
multiple database systems at the same time. Ienables end-users to perform ad hoc

© Copyright FHM.AVI – Data 112


Data Engineer

analysis of data in multiple dimensions, thereby providing the insight and


understanding they need for better decision making.
Analysts frequently need to group, aggregate and join data. These operations in
relational databases are resource intensive. With OLAP data can be pre-calculated
and pre-aggregated, making analysis faster.
Business is a multidimensional activity and businesses are run on decisions based
on multiple dimensions. Businesses track their activities by considering many
variables. When these variables are tracked on a spreadsheet, they are set on axes (x
and y) where each axis represents a logical grouping of variables in a category.
For example, sales in units or dollars may be tracked over one year’s time, by month,
where the sales measures might logically be displayed on the y axis and the months
might occupy the x axis (i.e., sales measures are rows and months are columns).
To analyze and report on the health of a business and plan future activity, many
variable groups or parameters must be tracked on a continuous basis—which is
beyond the scope of any number of linked spreadsheets. These variable groups or
parameters are called Dimensions.
Unlike relational databases, OLAP tools do not store individual transaction records
in two-dimensional, row-by-column format, like a worksheet, but instead use
multidimensional database structures—known as Cubes —to store arrays of
consolidated information. The OLAP cube is a data structure optimized for very
quick data analysis. The data and formulas are stored in an optimized
multidimensional database, while views of the data are created on demand.
Analysts can take any view, or Slice, of a Cube to produce a worksheet-like view of
points of interest. Rather than simply working with two dimensions (standard
spreadsheet) or three dimensions (for example, a workbook with tabs of the same
report, by one variables), companies have many dimensions to track—-for example,
a business that distributes goods from more than a single facility will have at least
the following dimensions to consider: Accounts, Locations, Periods, Salespeople
and Products. These dimensions comprise a base for the company’s planning,
analysis and reporting activities. Together they represent the “whole” business
picture, providing the foundation for all business planning, analysis and reporting
activities.

© Copyright FHM.AVI – Data 113


Data Engineer

Types of OLAP
OLAP can be categorized into specific types based on how they function asdescribed
below

1. Relational OLAP (ROLAP)

ROLAP stands for Relational Online Analytical Processing. ROLAP stores data
in columns and rows (also known as relational tables) and retrieves the information
on demand through user submitted queries. A ROLAP database can be accessed
through complex SQL queries to calculate information. ROLAP can handle large
data volumes, but the larger the data, the slower the processing times.
Because queries are made on-demand, ROLAP does not require the storage and pre-
computation of information. However, the disadvantage of ROLAP implementations
are the potential performance constraints and scalability limitationsthat result from
large and inefficient join operations between large tables. Examples of popular
ROLAP products include Metacube by Stanford Technology Group, Red Brick
Warehouse by Red Brick Systems, and AXSYS Suite by Information Advantage.

2. Multidimensional OLAP (MOLAP)

MOLAP stands for Multidimensional Online Analytical Processing. MOLAP


uses a multidimensional cube that accesses stored data through various

© Copyright FHM.AVI – Data 114


Data Engineer

combinations. Data is pre-computed, pre-summarized, and stored (a difference from


ROLAP, where queries are served on-demand).
A multicube approach has proved successful in MOLAP products. In this approach,
a series of dense, small, precalculated cubes make up a hypercube. Tools that
incorporate MOLAP include Oracle Essbase, IBM Cognos, and Apache Kylin.
Its simple interface makes MOLAP easy to use, even for inexperienced users. Its
speedy data retrieval makes it the best for “slicing and dicing” operations. One major
disadvantage of MOLAP is that it is less scalable than ROLAP, as it can handle a
limited amount of data.

3. Hybrid OLAP (HOLAP)

HOLAP stands for Hybrid Online Analytical Processing. As the name suggests,
the HOLAP storage mode connects attributes of both MOLAP and ROLAP. Since
HOLAP involves storing part of your data in a ROLAP store and another part in a
MOLAP store, developers get the benefits of both.
With this use of the two OLAPs, the data is stored in both multidimensional
databases and relational databases. The decision to access one of the databases
depends on which is most appropriate for the requested processing application or
type. This setup allows much more flexibility for handling data. For theoretical
processing, the data is stored in a multidimensional database. For heavy processing,
the data is stored in a relational database.
Microsoft Analysis Services and SAP AG BI Accelerator are products that run off
HOLAP.
How does it work?
A Data warehouse would extract information from multiple data sources and formats
like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server
(or OLAP cube) where information is pre-calculated in advance for further analysis.

© Copyright FHM.AVI – Data 115


Data Engineer

DATA INTEGRATION: ETL vs ELT

ETL (extract, transform, load) has been a standard approach to data integration for
decades. But the rise of cloud computing and the need for self-service data
integration has enabled the development of new approaches such as ELT (extract,
load, transform).

What is ETL?

ETL is a data integration process that helps organizations extract data from various
sources and bring it into a single database. ETL involves three steps:
• Extraction: Data is extracted from source systems—SaaS, online, on-
premises, and others—using database queries or change data capture
processes. Following the extraction, the data is moved into a staging area.
• Transformation: Data is then cleaned, processed, and turned into a common
format so it can be consumed by a targeted data warehouse, database, or data
lake.

© Copyright FHM.AVI – Data 116


Data Engineer

• Loading: Formatted data is loaded into the target system. This process can
involve writing to a delimited file, creating schemas in a database, or a new
object type in an application.
Advantages of ETL Processes

ETL integration offers several advantages, including:


• Preserves resources: ETL can reduce the volume of data that is stored in the
warehouse, helping companies preserve storage, bandwidth, and computation
resources in scenarios where they are sensitive to costs on the storage side.
Although with commoditized cloud computing engines, this is less of a
concern.
• Improves compliance: ETL can mask and remove sensitive data, such as IP
or email addresses, before sending it to the data warehouse. Masking,
removing, and encrypting specific information helps companies comply with
data privacy and protection regulations such as GDPR, HIPAA, and CCPA.
• Well-developed tools: ETL has existed for decades, and there is a range of
robust platforms that businesses can deploy to extract, transform, and load
data. This makes setting up and maintaining ETL pipelines much easier.
Drawbacks of ETL processes

Companies that use ETL also have to deal with several drawbacks:

• Legacy ETL is Slow: Traditional ETL tools require on disk-based staging


and transformations.
• Frequent maintenance: ETL data pipelines handle both extraction and
transformation. But they have to undergo refactors if analysts require different
data types or if the source systems start to produce data with deviating formats
and schemas.
• Higher Upfront Cost: Defining business logic and transformations can
increase the scope of a data integration project.

What is ELT?

© Copyright FHM.AVI – Data 117


Data Engineer

ELT is a data integration process that transfers data from a source system into a target
system without business logic-driven transformations on the data. ELT involves
three stages:

• Extraction: Raw data is extracted from various sources, such as applications,


SaaS, or databases.
• Loading: Data is delivered directly to the target system – typically with
schema and data type migration factored into the process.
• Transformation: The target platform can then transform data for reporting
purposes. Some companies rely on tools like dbt for transformations on the
target.
ELT reorders the steps involved in the integration process with transformation
occurring at the end instead of in the middle of the process.

Advantages of ELT processes

ELT offers a number of advantages:


• Fast extraction and loading: Data is delivered into the target system
immediately with minimal processing in-flight.
• Lower upfront development costs: ELT tools are typically adept at simply
plugging source data into the target system with minimal manual work from
the user given that user-defined transformations are not required.
• Low-maintenance: Cloud-based ELT technologies typically automate things
like schema changes so there’s minimal maintenance.
• More flexibility: Analysts no longer have to determine what insights and data
types they need in advance but can perform transformations on the data as
needed in the warehouse with tools like dbt
Drawbacks of ELT processes

ELT is not without challenges, including:


• Overgeneralization : Some modern ELT tools make generalized data
management decisions for their users – such as rescanning all tables in the

© Copyright FHM.AVI – Data 118


Data Engineer

event of a new column or blocking all new transactions in the case of a long-
running open transaction. This may work for some users, but could result in
unacceptable downtime for others.
• Security gaps: Storing all the data and making it accessible to various users
and applications come with security risks. Companies must take steps to
ensure their target systems are secure by properly masking and encrypting
data.
• Compliance risk: Companies must ensure that their handling of raw data
won’t run against privacy regulations and compliance rules such as HIPAA,
PCI, and GDPR.
• Increased Latency: In cases where transformations with business logic ARE
required in ELT, you must leverage batch jobs in the data warehouse. If
latency is a concern, ELT may slow down your operations.

ETL vs ELT Comparison

Batch processing implies moving the data from point A to point B. Processes
allowing for doing such tasks are known as ETL processes — Extract, Load, and
Transform.
These processes are based on extracting data from sources, transforming, and
loading it to a data lake or data warehouse.
Although, in recent years, another approach has been introduced: the ELT approach.

© Copyright FHM.AVI – Data 119


Data Engineer

ETL is the legacy way, where transformations of your data happen on the way to
the lake.
ELT is the modern approach, where the transformation step is saved until after the
data is in the lake. The transformations really happen when moving from the Data
Lake to the Data Warehouse.
ETL was developed when there were no data lakes; the staging area for the data
that was being transformed acted as a virtual data lake. Now that storage and
compute is relatively cheap, we can have an actual data lake and a virtual data
warehouse built on top of it.
ELT approach is preferred over ETL since it fosters best practices making easier
data warehousing processes — e.g., highly reproducible processes, simplification of
the data pipeline architecture, and so on.

DATA LOADING
Batch vs Streaming vs Lambda
Batch processing is based on loading the data in batches. This means, your data is
loaded once per day, hour, and so on.

Stream processing is based on loading the data as it arrives. This is usually done
using a Pub/Sub system. So, in this way, you can load your data to the data
warehouse nearly in real-time.

© Copyright FHM.AVI – Data 120


Data Engineer

These two types of processing are not mutually exclusive. They may coexist in a
data pipeline — see Lambda and Kappa architectures for more info. Particularly,
we’ll focus on the batch approach in this post.
Lambda Architecture combines batch and streaming pipeline into one architecture.

DATA PIPELINE: Basic Concepts & Roadmap


What is Data Pipeline?

© Copyright FHM.AVI – Data 121


Data Engineer

Building a data warehouse pipeline can be complex sometimes. If you are starting in
this world, you will soon realize there is no right or wrong way to do it. It always
depends on your needs.
Yet, there are a couple of basic processes you should put in place when building a
data pipeline to improve its operability and performance.
This document will be used to share you with a roadmap that can help as a guide
when building a data warehouse pipeline.
This roadmap is intended to help people to implement DataOps when building a data
warehouse pipeline through a set of processes.
In the roadmap section we talk about five processes that should be implemented to
improve your data pipeline operability and performance — Orchestration,
Monitoring, Version Control, CI/CD, and Configuration Management.
Some of the data warehousing terminology — e.g., data lakes, data warehouses,
batching, streaming, ETL, ELT, and so on.

Data Pipeline Roadmap


There are five processes we recommend you should put in place to improve your
data pipeline operability and performance.

These processes are Orchestration, Monitoring, Version Control, CI/CD,


and Configuration Management.

© Copyright FHM.AVI – Data 122


Data Engineer

Such processes are defined based on the DataOps philosophy, which “is a collection
of technical practices, workflows, cultural norms, and architectural patterns ”
enabling to reduce technical debt in the data pipeline — among other things.

Orchestration

We all have written CRON jobs for orchestrating data processes at some point in our
lives.
When data is in the right place and it arrives at the expected time, everything runs
smoothly. But, there is a problem. Things always go wrong at some point. When it
happens everything is chaos.
Adopting better practices for handling data orchestration is necessary — e.g., retry
policies, data orchestration process generalization, process automation, task
dependency management, and so on.
As your pipeline grows, so does the complexity of your processes. CRON jobs fall
short for orchestrating a whole data pipeline. This is where Workflow management
systems (WMS) step in. They are systems oriented to support robust operations
allowing for orchestrating your data pipeline.

© Copyright FHM.AVI – Data 123


Data Engineer

Some of the WMS used in the data industry are Apache Airflow, Apache Luigi,
and Azkaban.

Monitoring

Have you been in that position where all dashboards are down and business users
come looking after for you to fix them? or maybe your DW is down you don’t know?
That’s why you should always monitor your data pipeline!
Monitoring should be a proactive process, not just reactive. So, if your dashboard or
DW is down, you should know it before business users come looking out for you.
To do so, you should put in place monitoring systems. They run continuously to give
you realtime insights about the health of your data pipeline.
Some tools used for monitoring are Grafana, Datadog, and Prometheus.

CI/CD

Does updating changes in your data pipeline involve a lot of manual and error-prone
processes to deploy them to production? If so, CI/CD is a solution for you.
CI/CD stands for Continuous Integration and Continous Deployment. The goal of
CI is “to establish a consistent and automated way to build, package, and test
applications”. On the other hand, CD “picks up where continuous integration ends.
CD automates the delivery of applications to selected infrastructure environments.”
— more info here.
CI/CD allows you to push changes to your data pipeline in an automated way. Also,
it will reduce manual and error-prone work.
Some tools used for CI/CD are Jenkins, GitlabCI, Codeship, and Travis.

Configuration Management

So…Imagine your data pipeline infrastructure breaks down for any reason. For
example, you need to deploy again the whole orchestration pipeline infrastructure.
How you do it?
That’s where configuration management comes in. Configuration management
“deals with the state of any given infrastructure or software system at

© Copyright FHM.AVI – Data 124


Data Engineer

any given time.” It fosters practices like Infrastructure as Code. Additionally, it deals
with the whole configuration of the infrastructure — more info here.
Some tools used for Configuration Management are Ansible, Puppet,
and Terraform.

Version control

Finally, one of the most known processes in the software industry: version control.
We all have had problems when version control practices are not in place.
Version control manages changes in artifacts. It is an essential process for tracking
changes in the code, iterative development, and team collaboration.
Some tools used for Version Control are Github, GitLab, Docker Hub, and DVC.

Questions need to be answered before building a pipeline


A data pipeline is a series of data processing steps. If the data is not currently loaded
into the data platform, then it is ingested at the beginning of the pipeline. Then there
are a series of steps in which each step delivers an output that is the input to the next
step. To suitable design and implement and follow 5Vs (Volume, Velocity, Variety,
Veracity, Value) in Big data Design, there are some consideration need to be
addressed first:
1. Which of data processing type need to be handle (streaming or batching)
2. What rate of data do you expect (daily, hourly data ? 90% or 95% captured
data)
3. What is the data structure you need to handle as well as data sources type ?
4. Is the data being generated in Cloud or On-premises ?
5. Where to you implement data pipeline
6. What is specific technologies in which your team needs to have ?
7. Choosing the technical suite stack ?
8. Etc etc

© Copyright FHM.AVI – Data 125


Data Engineer

Summary
1. Data Warehouse is an information system that contains historical and
commutative data from single or multiple sources.
2. A Data Warehouse is subject oriented as it offers information regarding
subject instead of organization's ongoing operations.
3. In Data Warehouse, integration means the establishment of a common unit of
measure for all similar data from the different Databases.
4. Data Warehouse is non-volatile in the sense that the previous data is not
erased when new data is entered in it.
5. A Data Warehouse is Time-variant as the data in a Data Warehouse has high
shelf life.
6. There are 5 components of Data Warehouse Architecture namely Datastore,
ETL Tools, Metadata, Query Tools, Data Marts
7. The Query tools can be categorized as Query and reporting, tools, Application
Development tools, Data mining tools, OLAP tools
8. The data sourcing, transformation, and migration tools are used for
performing all the conversions and summarizations.
9. In the Data Warehouse Architecture, meta-data plays an important role as it
specifies the source, usage, values, and features of data warehouse data.

Data Warehousing Application Examples


Case Study: eWallet - Dimensional Modelling

Case Study: Retail Store - DWH

Case Study: Netflix - Optimizing DWH Storage

Case Study: Paypal - DWH Migration to Cloud

Reference Link:
https://www.educba.com/data-warehouse-architecture/?source=leftnav

© Copyright FHM.AVI – Data 126


Data Engineer

https://hevodata.com/learn/data-warehouse-design-a-comprehensive-guide/

https://insights.sap.com/what-is-a-data-warehouse/

https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
https://techblogmu.blogspot.com/2017/11/what-is-meant-by-metadata-in-context-
of.html
https://www.striim.com/etl-vs-elt/

© Copyright FHM.AVI – Data 127


Data Engineer

SQL Server & SQL Server


Integration Service
SQL Server Tutorials
In this article, we can search the documentation about the SQL Server tutorial
through the links below. Plus, some SQL Server Integration Service (SSIS)
presentation to provide a glimpse of how to build the Data Warehouse in a hybrid
environment of SQL Server and SSIS.
SQL Server tutorial:

https://www.sqlservertutorial.net/

https://www.guru99.com/ssis-tutorial.html
SQL Server Integration Service (SSIS) tutorial & examples:

https://www.sentryone.com/ssis-basics-guide

https://www.mssqltips.com/sqlservertutorial/9065/sql-
server-integration-services-ssis-data-flow-task-example/

https://docs.microsoft.com/en-us/sql/integration-
services/integration-services-tutorials
Here are some of the most basic things you need to know when working with SSIS:

SQL Server Integration Service Tutorials


SQL Server Integration service (SSIS) is a component of the Microsoft SQL
Server Database is a software that is used to achieve a broad scope of data migration
tasks. SSIS is an objective for workflow applications and data integration. It is a

© Copyright FHM.AVI – Data 128


Data Engineer

flexible data warehousing tool that is used for data loading, extraction, and
transformation like merging, cleaning, aggregating the data, etc.
It can extract data from different sources like Excel files, Oracle, SQL Server
databases, DB2 databases, etc. SSIS makes it easy to move data from one database
to another database. SSIS also includes graphic tools and windows wizards to
perform workflow functions like sending emails, FTP operations, Data sources, and
destination.

Why we use SSIS


The following are the key reasons to use SSIS
• SSIS tool helps to merge data from various data stores
• It helps to clean and standardize the data
• It can load the data from one datastore to another datastore very easily
• SSIS contains a GUI that allows the users to transfer the data quickly instead
of writing an extensive program
• Populates Data warehouse and data marts
• It is cheaper than other ETL tools
• Automates administrative functions and Data loading
• Build Business Intelligence into a Data Transformation process
• It provides robust error and event handling.

Features of SSIS
The following are the salient features of SSIS
• Relevant Data integration functions
• Data mining query transformation
• High-speed data connectivity elements such as connectivity to Oracle or SAP
• Effective implementation speed
• Provides rich Studio environment
• Tight integration with other Microsoft SQL software

© Copyright FHM.AVI – Data 129


Data Engineer

• Term extraction and Term Lookup transformation

SSIS Architecture

The following are the components of SSIS architecture:


• Control flow
• Data flow
• Event handling
• Package explorer
• Parameters

Control flow

© Copyright FHM.AVI – Data 130


Data Engineer

Control flow acts as the brain of the SSIS package. It includes containers and tasks
that are managed by precedence constraints. It helps to arrange the order of execution
for all its components.

Precedence constraints
The precedence constraints are the package component that directs tasks to execute
in a predefined order. It defines the workflow of the entire SSIS package. It helps
you to connect tasks to control the flow. Depending on the configuration, the
precedence constraints can be represented as dotted or solid lines with blue, red, or
green color.

Task
A task is an individual unit of work. It is the same as a method used in a programming
language. We don’t use any programming codes, but we implement drag and drop
techniques to design surfaces and to configure them.

Container
Containers are objects that help SSIS to provide structure to one or more tasks. It
provides visual consistency and also allows us to declare event handlers and
variables that could be in the scope of specific containers.
There are three types of containers. They are as follows:
• Sequence container: Sequence container is a subset of an SSIS package. It
acts as a single control point for the tasks that are defined inside a container.
It is used for grouping the tasks. We can split the control flow into multiple
logical units using sequence containers.
• For loop container: It is used for executing all inside tasks for a fixed number
of executions. It provides the same functionality as the sequence container
except that it also allows us to run the tasks multiple times
• For each loop container: It is more complicated than For Loop container
since it has many use cases and requires more complex configurations. It can
accomplish more popular actions such as looping across files within a
directory or looping over an executed SQL task result set.

Data flow

© Copyright FHM.AVI – Data 131


Data Engineer

Data flow tasks encapsulate the data flow engine that moves data between source
and destination and allows the user to transform, clean, and modify the data as it is
to run. It is also termed as Heart of SSIS

Packages
One of the core components of SSIS is packages. It is the collection of tasks that
execute in order. The precedence constraint helps to manage the order in which the
task will run. Packages can help the user to save files onto a SQL Server in the MSDB
or package catalog database.

Parameters
Parameters allow the user to assign values to the properties within packages at the
time of execution. It behaves much like variables but with a few main exceptions. It
also permits you to change package execution without editing and redeploying the
package.

How SSIS works


SSIS is a platform for data integration and workflow. Both the tasks data
transformation and workflow creation can be done using SSIS packages. It consists
of three major components, namely:

• Operational data
• ETL process
• Data warehouse

Operational data
Operational data is a database designed to integrate data from multiple sources for
additional operations on the data. It is the place where most of the data used in the
current operation are housed before it is transferred to the data warehouse for long
term storage.

ETL Process
Extract, Transform, and Load(ETL) is a method of extracting the data from different
sources, transforming this data to achieve the requirement, and loading into a target

© Copyright FHM.AVI – Data 132


Data Engineer

data warehouse. The data can be in any format XML file, flat file, or any database
file. It also ensures that the data stored in the data warehouse is accurate, high
quality, relevant, and useful for the users.
Extract: It is the process of extracting the data from various data sources depending
on different validation points. And the data can be any format such as XML, flat file,
or any database file.
Transformation: In transformation, the entire data is analyzed, and various functions
are applied to it to load the data to the targeted database in a cleaned and general
format.
Load: It is the process of loading the cleaned and extracted data to a target database
using minimal resources. It also validates the number of rows that have been
processed. The index helps to track the number of rows that are loaded in the data
warehouse. It also helps to identify the data format.

Data warehouse
The data warehouse is a single, complete, and consistent store of data that is
formulated by combining data from multiple data sources. It captures the data from
diverse sources for useful analysis and access. Data warehousing is a large set of
data accumulated that is used for assembling and managing data from various
sources to answer business questions. It helps in making a decision.

Types of SSIS tasks


In the SSIS tool, there are different types of tasks that are to perform various kinds
of activities.

Task Description

Execute SQL task It executes the SQL report against a relative database.

Data Flow task It can read the data from different sources. Transform the data
when it is in the memory and write it out against various
destinations

© Copyright FHM.AVI – Data 133


Data Engineer

Execute Package It is to execute other packages within the same project.


task

Execute process It is to specify the command line parameters.


task

Analysis Services It is to process objects of a tabular model or as an SSAS cube.


Processing task

FTP task It allows the user to perform basic FTP functionalities.

File System task It performs manipulations in the file system such as deleting
files, creating directories, renaming files, and moving the
source file.

Script task It is a blank task. You can write a .NET code that performs any
task; you want to accomplish.

Sent mail task It is to send an email to notify users that your package has
finished, or some error occurs.

Bulk insert task Use can loads data into a table by using the bulk insert
command.

XML task It helps to split, merge, or reformat any XML file.

WMI event watcher It allows the SSIS package to wait for and respond to certain
task WMI events.

Web Service task It executes a method on a web service.

Other essential ETL tools


• SAS Data management
• Oracle Warehouse Builder

© Copyright FHM.AVI – Data 134


Data Engineer

• IBM Infosphere Information Server


• Elixir Repertoire for Data ETL
• SAP Data Services
• PowerCenter Informatica
• Sargent Data flow
Best practices of SSIS
• SSIS is an in-memory pipeline. So it is essential to ensure that all
transformations occur in memory
• Plan for capacity by understanding resource utilization
• Schedule and distribute it correctly
• Optimize the SQL Lookup transformation, data source, and destination
• Try to minimize logged operations

Advantages of using SSIS


SSIS tool offers many benefits. Some of them are as follows:
• Ease and speed implementation
• Broad documentation support
• Tight integration with SQL server
• Provides real-time, message-based capabilities
• Support for the distribution model
• It allows you to use the SQL Server Destination rather than OLE DB to load
the data faster
• Standardized data integration
• It helps you to remove networks as a restriction for the insertion of data by
SSIS into SQL.

Drawbacks of Using SSIS


Few disadvantages of using the SSIS tool are as follows

© Copyright FHM.AVI – Data 135


Data Engineer

• Unclear vision and strategy


• SSIS lacks support for alternative data integration styles
• Sometimes it creates issues in non-windows environments
• Problematic integration with other products

SSIS Components
SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server
database software that can be used to perform a broad range of data migrationtasks.
SSIS is a platform for data integration and workflow applications.

SSIS Development Studio


The Business Intelligence Development Studio (BIDS) is the desktop workstation
component you use to design, develop, and test SSIS packages. BIDS provides you
with a totally graphical-oriented development environment, allowing you to copy,
maintain, and create new packages by using a menu and toolbox drag-and-drop
method for development. BIDS is a comprehensive development platform that
supports collaboration with source code management and version control; provides
debugging tools such as breakpoints, variable watches, and data viewers; and
includes the SQL Server Import and Export Wizard to jump-start package
development.

© Copyright FHM.AVI – Data 136


Data Engineer

Within BIDS, the SQL Server Import and Export Wizard allows you to generate
SSIS packages to copy data from one location to another quickly and easily. The
Import and Export Wizard guides you through a series of configuration editor pages
that allow you to select the source data, select your target destination, and map
source to target data elements. You might find this wizard helpful for creating a
starting point for a package.

SSIS Runtime Services


SSIS Runtime Services manages storage of packages in .dtsx (SSIS package system
file format) files or in the MSDB database and manages and monitors their
execution. SSIS Runtime Services saves your package layout, applies
configurations, executes packages, manages data source and destination connection
strings and security, and supports logging for tracking and debugging. SSIS Runtime
Services executable includes the package and all its containers, tasks, custom tasks,
and event handlers.
After you design, develop, and complete your testing of SSIS packages from your
desktop BIDS, you will want to deploy and implement the packages for scheduled
or on-demand processing to the SSIS Runtime Services server. In some companies,
the deployment of finished packages is oftentimes performed by a production
administrator or other authorized group. At other times, packages can be deployed
by the developer. Either way, you can use the graphical interface or a command-line
utility to configure and complete the package deployment.

SSIS Package Deployment


• The SQL Server Management Studio (SSMS) is a desktop workstation
component for the deployment and management of packages into production
environments. SSMS connects directly to SSIS Runtime Services and
provides access to the Execute Package utility, is used to import and export
packages to and from available storage modes (MSDB database or SSIS
Package Store), and allows you to view and monitor running packages.
• There are also two command-line utilities that you can use to manage, deploy,
and execute SSIS packages. An alternative to SSMS, Dtutil.exe, provides

© Copyright FHM.AVI – Data 137


Data Engineer

package management functionality at the command prompt to copy, move, or


delete packages or to confirm that a package exists.
• Finally, a more advanced feature is the Integration Services Object Model that
includes application programming interfaces (APIs) for customizing run-time
and data flow operations and automating package maintenance and execution
by loading, modifying, or executing new or existing packages
programmatically from within your business applications.

SSIS ETL
The ‘T’ is ETL stands for transformation. The goal of transformation is to convert
raw input data to an OLAP-friendly data model. This is also known as dimensional
modeling.
Microsoft SQL Server Integration Services (SSIS) is a platform for building high-
performance data integration solutions, including extraction, transformation, and
load (ETL) packages for data warehousing. SSIS includes graphical tools and
wizards for building and debugging packages; tasks for performing workflow
functions such as FTP operations, executing SQL statements, and sending e-mail
messages; data sources and destinations for extracting and loading data;
transformations for cleaning, aggregating, merging, and copying data; a
management service, the Integration Services service for administering package
execution and storage; and application programming interfaces (APIs) for
programming the Integration Services object model.
When designing the ETL process it’s good to think about the three fundamental
things it needs to do:

• Extract data from external data sources such as line-of-business systems,


CRM systems, relational databases, web services, and SharePoint lists.
• Transform the data. This includes cleansing the data and converting it to an
OLAP-friendly data model. The OLAP-friendly data model traditionally
consists of dimension and fact tables in a star or snowflake schema and closely
maps SSAS’s dimensional model (SSAS stands for SQL Server Analyses
Services).

© Copyright FHM.AVI – Data 138


Data Engineer

• Load the data so that it can be quickly accessed by querying tools such as
reports. In practice, this implies processing SSAS cubes.
An ETL process is a program that periodically runs on a server and orchestrates the
refresh of the data in the BI system. SQL Server Integration Services (SSIS) is a
development tool and runtime that is optimized for building ETL processes.
Learning SSIS involves a steep learning curve and if you have a software
development background as I do, you might first be inclined to build your ETL
program from scratch using a general-purpose programming language such as C#.
However, once you master SSIS you’ll be able to write very efficient ETL processes
much more quickly. This is because SSIS lets you design ETL processes in a
graphical way (but if needed you can write parts using VB or C#). The SSIS
components are highly optimized for ETL type tasks and the SSIS run-time executes
independent tasks in parallel where possible. If you’re a programmer, you’ll find it
amazingly difficult to write your own ETL process using a general-purpose language
and make it run more efficiently than one developed in SSIS.

What is SQL Server Integration Services?


SSIS is a platform for data integration and workflow applications. It features a fast
and flexible data warehousing tool used for data extraction, transformation, and
loading (ETL). The tool may also be used to automate the maintenance of SQL
Server databases and updates to multidimensional cube data.

What is ETL SQL Server?


SQL Server version 2021. Microsoft SQL Server Integration Services (SSIS) is a
platform for building high-performance data integration solutions, including
extraction, transformation, and load (ETL) packages for data warehousing.

What is the use of the SSIS package?


The primary use of SSIS is data warehousing as the product features a fast and
flexible tool for data extraction, transformation, and loading (ETL).). The tool may
also be used to automate the maintenance of SQL Server databases, update
multidimensional cube data, and perform other functions.

What is ETL design?

© Copyright FHM.AVI – Data 139


Data Engineer

The process of extracting data from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for extraction, transformation,
and loading. Note that ETL refers to a broad process, and not three well-defined
steps.

SSIS data flow to transform the data


In SSIS you can design your ETL process using control flows and data flows. Data
flows in SSIS are a type of control flow that allows you to extract data from an
external data sources, flow that data through several transformations such as sorting,
filtering, merging it with other data and converting data types, and finally store the
result at a destination, usually a table in the data warehouse. This is very powerful
and data flows seem to lend themselves very well for integrating the extract and
transformation tasks within them. This is why I call this the “obvious” approach and
many tutorials about SSIS follow this approach. The obvious approach seems
especially attractive because it is very efficient and there’s no need to store
intermediate results.
The figure below illustrates this process:

The top-level control flow in the Integration Services project may look like this:

© Copyright FHM.AVI – Data 140


Data Engineer

The “Extract and Transform” box is a sequence container that holds a data flow for
each of the tables that will be refreshed in the data warehouse. In this example, there
is one fact table and there are three dimension tables. SSIS will execute the data
flows in parallel, and when all of them have completed the cube will be processed.
The transformation of data takes place in the data flows.
The transformations needed in each of the data flows would typically look something
like this:

© Copyright FHM.AVI – Data 141


Data Engineer

Event Handlers in SSIS


SQL server version 20??, the Integration services provide the ability to handle any
type of event associated with the execution of its task and container (through the
ability to configure corresponding handlers)
The following list contains more significant and commonly monitored types of
events (you might be able to spot some of them in the Output window during package
execution in Debug node)

Event Handlers in SSIS


On Error
Generated as the result of on error condition. It falls into the category of the most
frequently implemented types of event handler. Its purpose can be for additional

© Copyright FHM.AVI – Data 142


Data Engineer

information simplifying troubleshooting, or to notify about a problem and need for


remediation.
On Warning
Similar to the On error event, it is raised in response to a problem (although not as
significant in terms of severity).

On Information
Produces reporting information relating to the outcome of either validation or
execution of a task or container (other than warning or error)
On –Task Failed
Signals the failure of a task and typically follows on error event.
On Pre Execute
Indicates that an exec table component is about to be launched.
On Pre validate
Marks the beginning of the component validation stage, following the on pre execute
event. The main purpose of validation is detection of potential problems that might
prevent execution from completing successfully.

On post validate
Occurs as soon as the validation process of the component is completed (following
on prevail date event),
On post-Execute
Takes place after an executable component finishes running
On variable value changed
Allows you to detect changes to variables. The scope of the variable determines
which executable will raise the event. In addition, in order for the event to take place,
the variable’s change event property must be set to true (the default is faces)
On progress
Raised at the point where measurable progress is made by the executable (for
example, when running execute SQL Task).

© Copyright FHM.AVI – Data 143


Data Engineer

• This can be evaluated by monitoring the values of the system variables


associated with the On progress event handler, such as progress completes
Progress count low, and progress count high.
Steps to configure Event handler
Scenario: Clean up or truncate the destination table before executing or loading data
into destination.
• Open Business Intelligence Development studio
• Create a new package and Rename it as Event Handler.dtsx
• In control flow drag and drop the data flow task and Rename it as DFT Event
Handler
• In data flow drag and drop OLEDB source
• Double click on OLEDB source to edit it
• Provide connection manager if exists
• Select [human resource]. [Employee] table from the dropdown list
• Select columns from left panel
• Click OK
• Drag and drop OLEDB destination and make a connection from OLEDB
source to destination
• Double click on OLEDB destination
• Provide destination connection manager
• Click new to create a destination table and Rename OLEDB destination as
employee details
• Click ok Twice
• Go to event handler tab
• Drag and Drop Execute SQL Task
o [on pre-validate Event handler for package executable]
• Double click on Execute SQL task provide connection if exists

© Copyright FHM.AVI – Data 144


Data Engineer

• Provide the following SQL command. To clean up the data from destination
table
• Truncate Table Employee details
• Click ok
Execute package

Different ways to execute SSIS package


Execute SSIS package using BIDS
During the development phase of the project, developers can test the SSIS package
execution by running the package from BIDS.
Open BIDS, configure a new package
Press Alt+Ctrl+L for solution explorer
Select the package
Right-click and select execute package option

Execute SSIS package using DTEXEC.EXE :


Command Line utility
Using the DTEXEC command-line utility can execute an SSIS package that is stored
in the File system or SQL Server.
For example: DTEXEC.EXE /F “C:packagescheck points .dtsx”
Executes SSIS package using DTEXEC UI.EXE :
Using the execute package utility graphical interphase can execute an SSIS package
that is stored in the file system or SQL server.
In the command line type DTEXEC UI.EXE and press enter which will open up
execute package utility editor.
In executing package utility editor select the general tab,
Package Source – File System
• Package – Click Browser
• Select any package from the list and click open

© Copyright FHM.AVI – Data 145


Data Engineer

Click execute to execute the linked/ embedded package.


(or)
The execute package utility is also used when you execute the SSIS package from
the integration services node in the SQL server management studio.
• Open SQL Server Management Studio
• Connect to integration services
• Expand stored packages
Expand file system
Select file system and right-click, Select import
Package editor
Package location – file system

• Package – click browse


• Select any package from
• The list and click open
• Package name - place cursor
• Click ok
• Select imported package
• Right-click and select run package
In execute, package utility editor click execute

Execute the SSIS package using SQL server agent (using jobs):
• Open SQL server management studio
• Connect to database engine
Select SQL server agent
Note: Ensure that SQL server agent service is start mode
• Select Jobs
Right-click and select a new job

• Provide Job Name as Load for each loop container

© Copyright FHM.AVI – Data 146


Data Engineer

• Select Steps page


• Click new
Step Name - Load for each loop container
• Type - SQL Server Integration Service Package
In General Tab,
Package Source - File System
• Package – click Browse
• Select any package from the list of package
• Click open
• Click ok
• Select Schedule package
• Click new
• Provide the job Name as Load For each loop –Container
• Schedule Type – Recurring
• Set the Frequency (When to start the execution of the specified package)
Click ok twice
Lookup Transformation:
It erased to compare the data from source to reference dataset. Using a Reference
dataset using simple equally joint.
Note: While implementing data warehousing the reference data must be dimension
table.
Steps to implement Look Up Transformation:
• Open Business Intelligence development studio
• Create a new package and rename it as lookup .dtsx
• dn control flow drag and drop and flow task
• dn data flow, drag and drop OLEDB Source
• Double click on OLEDB Source to edit it

© Copyright FHM.AVI – Data 147


Data Engineer

• Provide Connection Manager if exists and select


• Production .product category
• Select columns and click ok
• Drag and drop lookup Transformation and make
• A connection from the source to lookup
• Provide connection manager and select production
• .product subcategory
• Click configures error output and set Redirect Rows under error header
• Click ok
• Click ok
• Drag and drop OLEDB destination
• Make a connection from Lookup to destination
• Double click on OLEDB definition to configure it
• Provide destination connection manager and
• Click new to create a new destination table
• And rename the table as Matched data
• Click ok
• Select Mappings
• Drag and drop OLEDB destination to capture
• UN matched records from source to reference dataset.
• Make a connection from Look Up to OLEDB destination
• Using error output (Read data flow path)
• Click ok in lookup error output editor
• Double click on a destination to configure it
• Provide destination connection Manager and click
• New to create destination table and rename it as

© Copyright FHM.AVI – Data 148


Data Engineer

• UN matched-records
• Select mapping and click ok
Execute package
Note:

• Matched records will be updated to the destination.


• All the unmatched records will be inserted to the destination.
• In Lookup transformation, edit or remove the mappings between all the
columns expert between Product category ID.
• Scenario: TO the Dynamic flat file destination
• In control flow drag a drop data flow Task
Define the following variables with respect to the package.

Name Data type value

Uv source path string D:ssis packagespackages

Uv File Name string vendor

In Data flow drag and drop OLEDB source


• Double click on OLEDB source to edit it
• Provide a Source connection manager and select production. Production sub-
category
• From drop-down list
• Select columns
• Drag and drop flat file destination and make a connection from OLEDB
Source to flat-file
• Double click on the flat file to edit it
• Click new to configure new flat file connection Manager
• Select Delimited flat-file format
• Click ok

© Copyright FHM.AVI – Data 149


Data Engineer

Connection Manager Name - Dynamic vendor flat file


Description – Dynamic vendor flat file
File Name – Type the following path and file with an extension that is not available
at the destination.
• D:ssis PackagePackagevendor.txt
• Select columns page
• Click ok
• Select Mappings page
• Click ok, In Connection Manager, Select dynamic vendor flat file
• Press F4 for properties
• Expression – click browse
Select connection string and click in expression builder, provide the following
expression which create a new flat file dynamically
• @ [user:: Uv source path]+””+@[user::UV File Path] + “-
”+(DT_WSTR,10)(DT_DBDATE)
• @ [System::start Time] +’.txt’
• Click ok twice
• Close property window
Execute package
Script task:
Script task is used to design custom interphases.

Scenario:

Create a text file on every corresponding month dynamically using script task.
Define the following variables type string
• Uv Source path: D:ssispackagespackages
• Uv file name: Product category Details on
• Uv Full path:

© Copyright FHM.AVI – Data 150


Data Engineer

• Drag and Drop script task


• Double click on the script task to editor configure it.
• Select Script from the left panel and set,
• Read-Only Variables – Uv Source path, Uv File Name
• Read-write variable – Uv Full path
• Click Design Script, Opens MS Virtual Studio for Applications IDE,
Provide the following VB.Net Script to create dynamic text file the specified name.
Dim s Source path As String

Dim s File name As String

Dim s Full path As String

‘D:ssis package.Package

S Source path = Dts. variables (“Uv source path”).value.Tostring

‘Product category Details on sFile Name = Dts.variables(“Uv File Name”).Value.Tostring

‘D:ssis packagespackagesproduct category Details on-

sFull path = ssource path + sFile Name +”_”+Month

(Now ()) ,Tostring() +Year(Now()).Tostring()+.”txt”

Dts.variables(“Uv Fullpath”).value = sFullpath.To string()

• Select Debug Menu and Select build to build the above Scripting
• Select file menu and Select close and return
• Click ok
• Drag and drop data flow task and make a connection from script task to data
flow task
• In Data Flow Drag and Drop OLEDB Source
• Double click on OLEDB Source to configure to
• Provide connection Manager if exits
• Select Product category table
• Select columns from left pane and click ok

© Copyright FHM.AVI – Data 151


Data Engineer

• Drag and drop flat file destination and make a connection from OLEDB
source to flat file destination
• Double click on flat file destination
• Click new to create new connection Manager
• Select Delimited flat file format
• Click ok
• Provide flat file connection manager name and description if any
• Type the following path
• D:ssis packagepackageproduct category Detailson.txt
• Select columns from left panel
• Click ok
• Select mapping from left panel
• Click ok
• To connection manager select flat file connection manager
• Press F4 for properties and set, expression - click Browse
• Select connection string and click browse to build the expression in
expression builder
• Expand Variables
• Drag and Drop User:: Uv Full path in to expression
• Section, i.e @ [user:: Uv Full path]
• Click ok twice
• Close properties window
Execute the package
Providing Security for SSIS Package:
The protection level is in the SSIS package that is used to specify how sensitive
information is saved with

© Copyright FHM.AVI – Data 152


Data Engineer

In the package and also whether to encrypt the package or sensitive portions of the
package.
Example: the sensitive information would be password to the Database.
Steps to Configure protection level in SSIS
Open Business Intelligence Development Studio
Create OLEDB connect with server authentication and provide
Design package
Select package in control flow, right-click
Select properties,
Security
Protection level – Don’t save Sensitive
Don’t save Sensitive:
When you Specified Don’t Save Sensitive as the protection level, any sensitive
information
Is not written to the package XML file when you save the package. This could be
useful when
You want to make sure that anything sensitive is excluded from the package before
sending
It to someone. After saving the package with this sending, open the OLEDB
Connection Manager,
The Password is black even though save my password checkbox is checked.
Encrypt Sensitive with User Key:
Encrypt Sensitive with User Key encrypt Sensitive information based on the
credentials of the
The user who created the Package.
There is a limitation with this setting if another user (a different user than the one
who created

© Copyright FHM.AVI – Data 153


Data Engineer

The package and saved it) open the package the following error will be displayed,
error loading
Encrypts Sensitive with user key; failed to encrypt protection level XML load (dts.
Password).
Encrypt Sensitive with Password:
The Encrypt Sensitive with password setting require a password in the package and
that
Password will be used to encrypt and decrypt the sensitive information in the
package. To fill in the
Package password clicks on the button in the package password field of the package
and provide
Password and confirm password. When you open a package with this setting you
will be prompted
To enter the password.
Note: The Encrypt Sensitive with Password Setting for the Production level property
overcomes
The limitation of the encrypt Sensitive with user key setting by allowing any user to
open the package
As long as they have to password.
Encrypt All with Password:
The Encrypt All with password Setting used to encrypt the entire content Of the SSIS
package with the specified password. You specify the package Password in the
Package
Property, Same as Encrypt Sensitive with password settings. After saving the
package you can
View the package XML code that is already encrypted in between encrypted data
tags in the package XML.
Encrypt All with User Key:
The Encrypt All with User Key Setting is used to encrypt the entire

© Copyright FHM.AVI – Data 154


Data Engineer

Contents of the SSIS package by using User Key this means that only the user who
created the package
Will be able to open it, view or modify it, and run it.
Server Storage
The server storage Settings for the Production level property allows the package
To return all Sensitive information when you are saving the package to the SQL
server. SSIS packages Saved to SQL Server use the MS DB Database.
Pivoting and Un pivoting Examples
The presentation of the data is required for easy analysis turning columns into rows
And rows into columns are another way of presentation of data. So that the end-user
can understand
It easily.
Un pivot
A process of turning columns to Rows is known as Unpivot.
Steps to Configure Unpivot
Prepare the following Excel Sheet

Year Category Jan Feb March April

2008 Bikes 100 200 300 400

2008 Accessories 200 270 300 320

2009 Components 100 120 300 150

Phones and
2009 400 800 400 300
components

Open Business Intelligence Development Studio


Create a new package and rename it as UN pivot. dtsx
In control flow drag and drop Data Flow Task
In Data Flow drag and drop Excel source

© Copyright FHM.AVI – Data 155


Data Engineer

• Double click on Excel source to configure it


• Click ok
• Click Browse
• Select Unpirot.xls file and click open
• Click ok
• Select sheet 1 from the drop-down list
• Select columns and click ok
• Drag and drop Unpivot transformation and make a connection from the excel
source to Unpivot
• Double click on Unpivot
• Select the below columns to unpivot then, Jan, Feb, March, April, May, June
• Rename pivot key value column Name as – Months
• Specify sales amount as a Derived on Destination column for all the selected
pivoted Key values or input columns
• Click ok
• Make sure that the Excel source file is closed.
• Drag and drop Excel destination
• Make a connection from Unpivot to Excel destination
• Double click on Excel destination to configure it
• Click New to create new destination excel sheet and rename it as Unpivot
data
• Click ok
• Select Mappings
Execute package

Pivot
A process of turning rows into columns is known as a pivot
Steps to configure the pivot

© Copyright FHM.AVI – Data 156


Data Engineer

Prepare the following Excel sheet for source data.

Year Quarter Sales Amount

2009 Q1 100

2009 Q2 200

2009 Q3 300

2009 Q4 400

• Rename the Excel Pivot.xls


• Open Business Intelligence Development Studio
• Create a new package and rename it as Pivot. dtsx
• In control flow drag and drop data flow task.
• In Data Flow Drag and Drop Excel source
• Double click on Excel source to configure it
• Click New
• Click Browse
• Select pivot .xls and click open
• Click ok
• Select sheet 1 from the drop-down list
• Drag and Drop pivot transformation and make a connection from Excel
source to Pivot
• Double click on Pivot
• Select Input columns tab and check all input columns
• Select input-output properties tab
• Expand pivot default output
• Expand Input column

© Copyright FHM.AVI – Data 157


Data Engineer

• Select Year and Set, pivot usage – 1


• Select Quarter input column and Set, pivot usage – 2
• Select Sales Amount input column and set, pivot usage – 3
Pivot Usage
Pivot Usage tells SSIS how to treat the data what its role during the
Transformation process
• The column is not pivoted1
• The column is part of the set key that identifiers 1 or more rows as part of 1
set.
All input rows with the same set key are considered into 1 output row.

• The column is a pivot column


• The values from this column are Placed in columns that are created as a result
of the pivot.
Export pivot default output and create the following columns by click the add
columns button Year Q1, Q2, Q3, Q4:
• Copy or Note the Lineage ID of input column year
• Select year output column and set,
• Pivot Key-value – Year
• Source column – 58(Lineage ID of year input column)
• Select Q1 and set,
• Pivot Key-value – Q1
• Source column – 64
• Select Q2 and set,
• Pivot Key value – Q2
• Source column – 64
And follow the same process for Q3 and Q4 output columns:
• Click Refresh

© Copyright FHM.AVI – Data 158


Data Engineer

• Click ok
• Drag and Drop Excel destination
• Double click on Excel destination to edit it
• Provide Excel connection Manager,
• Click New to create a new table (sheet) and name it as pivot data
• Click ok twice
Execute package

© Copyright FHM.AVI – Data 159


Data Engineer

You might also like