Professional Documents
Culture Documents
FPT-HandBookData Engineer (11123)
FPT-HandBookData Engineer (11123)
FPT-HandBookData Engineer (11123)
Data Engineer
How To Become A Data Engineer?
So what do they exactly do? They create the framework to make data consumable
for data scientists and analysts so they can use the data to derive insights from it. So,
data engineers are the builders of data systems.
System Design
Data engineers should always build a system that is scalable, robust, and fault-
tolerant hence, the system can be scaled up without increasing the number of data
sources and can handle a massive amount of data without any failure. For instance,
imagine a situation where a source of data is doubled or tripled, but the system fails
to scale up, so it would cost a lot more time and resources to build up a system to
intake this extensive data. Big Data Engineers have a role here: they handle the
extract transform and load process, which is a blueprint for how the collected raw
data is processed and transformed into data ready for analysis.
Analytics
The Data Engineer performs ad-hoc analyses of data stored in the business’s
databases and writes SQL scripts, stored procedures, functions, and views. He is
responsible for troubleshooting data issues within the business and across the
business and presents solutions to these issues. Data engineer proactively analyzes
and evaluates the business’s databases in order to identify and recommend
improvements and optimization. He prepares activity and progress reports regarding
the business database status and health, which is later presented to senior data
engineers for review and evaluation. In addition, the Data Engineer analyzes
complex data system and elements, dependencies, data flow, and relationships so as
to contribute to conceptual physical and logical data models.
Some of the other responsibilities also include improving foundational data
procedures and integrating new data management technologies and the software into
existing systems and building data collection pipelines and finally include
performance tuning and make the whole system more efficient.
Data Engineers are considered the “librarians” of data warehouse and cataloging and
organizing metadata, defining the processes by which one files or extracts data from
the warehouse. Nowadays, metadata management and tooling have become a vital
component of the modern-day platform.
1. Data Modelling
The data model is an essential part of the data science pipeline. It is the process of
converting a document of sophisticated software system design to a diagram that can
comprehend, using text and symbols to represent the flow of data. Data models are
built during the analysis and design phase of a project to ensure the requirements of
a new application are fully understood. These models can also be invoked later in
the data lifecycle to rationalize data designs that were initially created by the
programmers on an ad hoc basis.
Hierarchical Data Model: This data model array in a tree-like structure, one-to-
many arrangements marked these efforts and have replaced file-based systems. E.g.,
IBM’s Information Management System (IMS), which found extensive use in
business like banking.
Relational Data Model: They replaced hierarchical models, as it reduced program
complexity versus file-based systems and also didn’t require developers to define
data paths.
Entity-Relationship Model: Closely relatable to the relationship model, these
models use diagrams and flowcharts to graphically illustrate the elements of the
database to ease the understanding of underlying models.
Graph Data Model: It is a much-advanced version of the hierarchical model, which,
together with graph databases, is used for describing the complicated relationship
within the data sets.
2. Automation
Industries use automation to increase productivity, improve quality and consistency,
reduce costs, and speed delivery. It provides benefits in greater magnitude to every
team player in an organization including Testers, Quality Analysts, Developers, or
even Business Users.
Automation can provide the following benefits:
coding and custom design for planning, design, building, and documenting decision
support infrastructure.
ETL is defined as the procedure of copying data from one or more source into the
destination system, which represents the data differently from the source or in a
different context than the source. ETL is often used in data warehousing.
Data extraction is the concept of extracting data from heterogeneous or homogenous
sources; data transformation processes data by cleansing data and transforming them
to proper storage structure for the purpose of querying and analysis, finally data
loading describes the insertion of data into the final target database such as
operational data store, a data mart, data lake or data warehouse.
In data science, ETL involves pulling out data from operational systems like MySQL
or Oracle and moving it into a data warehouse like SQL server or modern-day data
warehouses like Hadoop or RedShift and then format it in such a way that analyst
can get it. Eventually, the ETL process starts at the analytical data layer that does
more than extracting data, it performs things like aggregating data, running metrics
and algorithms on the data so that it can be easily fed into future dashboards.
4. Product Understanding
Data engineers look at the data as their product, so it is made in such a way that users
can use it. If we are building datasets for machine learning engineers or data
scientists, we need to understand how they are going to use it what are the models
that they want to build is enough information is being provided at the customer level.
This is required because the data engineer looks at the things at the granularity and
aggregate things themselves.
is the fuel of the 21st century, various data sources and numerous technologies
have evolved over the last two decades major ones being NoSQL databases
and big data frameworks.
6. Big Data Frameworks: Big data engineers are required to learn multiple big
data frameworks to create and design processing systems.
7. Real-time Processing Frameworks: Concentrate on learning frameworks
like Apache Spark, which is an open-source cluster computing framework for
real-time processing, and when it comes to real-time data analytics spark
stands as go-to-tool across all solutions.
8. Cloud: Next in the career path, one must learn cloud which will serve as a big
plus. A good understanding of cloud technology will provide the option of
stable significant amounts of data and allowing big data to be further
available, scalable and fault-tolerant.
Data Sources
General about Data source
With the current strong development of science and technology, the data sources as
input to building a data warehouse are very diverse and rich. For each specific data
warehouse, data sources will be different for each field of data. For example, for a
data warehouse about music, the input will mainly revolve around songs' info, with
a data warehouse about painting, it will be images' info, with a data warehouse about
the weather, the data source will revolve around the location and features of time-
series of weather, with business data, the data source will be around transactions'
info.
Therefore, to simplify the study of data sources to provide data for the data
warehouse building process, we only need to consider and present the commonly
used data sources, and include the following types of data sources:
• Flat file with structured data (csv, json, xml/RDF) and flat file with
unstructured data (file txt, log)
• Binary data formatted as the table/ matrix (file xls/xlsx, file avro, file music,
file video, file photo)
Flat Files
Flat file is a type of data used to store very commonly in the process of recording
raw information of the original documents, it is both simple and usable. For log files,
the data transfers between steps in the processing sequence.
Flat files are text files in plain text format, so we can read and edit them with simple
tools available on popular operating systems such as NotePad (Windows), Emas &
VIM (Linux/Unix or MacOs), …
The content contained inside a flat file is common to use in structured data format
and data fields will be separated with the delimited, fixed width, or mixed format.
Sometimes flat files are also used in unstructured data formats. In this case, they will
be processed to extract the necessary data contained within it into structured data
files. And then, the results are convenient for checking and using in the next steps.
The flat file is a text file with columns delimited by Tab character as follows:
The flat file is a text file with columns delimited by Comma character as follows:
The flat file is a text file with columns delimited by fixed width format as follows
Fixed width format uses width to define columns and rows. This format also includes
a character for padding fields to their maximum width.
Ragged right format uses width to define all columns, except for the last column,
which is delimited by the row delimiter.
To implement the above ideas, we run a script like the following to collect data into
a file and the output data only contains structured data. This result makes it very easy
to load data into a table or a matrix for next processing.
awk ' BEGIN {printf("SaleDate,Product,Count\n")}
n = split($0, record,":")
date = record[1]
values = record[2]
m = split(values, fields,",")
$0 = date
$0 = fields[id]
• The last record in the file may or may not have an ending line break. For
example:
o h1,h2,h3 CRLF
o v1,v2,v3
• There may be an optional header line appearing as the first line of the file with
the same format as normal record lines. This header will contain names
corresponding to the fields in the file and should contain the same number of
fields as the records in the rest of the file (the presence or absence of the header
line should be indicated via the optional “header” parameter of this MIME
type). For example:
o field_name,field_name,field_name CRLF
o h1,h2,h3 CRLF
o v1,v2,v3 CRLF
• Within the header and each record, there may be one or more fields, separated
by commas. Each line should contain the same number of fields throughout
the file. Spaces are considered part of a field and should not be ignored. The
last field in the record must not be followed by a comma. For example:
o h1,h2,h3
• Each field may or may not be enclosed in double quotes (however some
programs, such as Microsoft Excel, do not use double quotes at all). If fields
are not enclosed with double quotes, then double quotes may not appear inside
the fields. For example:
o “h1”,“h2”,“h3” CRLF
o v1,v2,v3
• Fields containing line breaks (CRLF), double quotes, and commas should be
enclosed in double-quotes. For example:
o “h1”,“h CRLF
o 2”,“h3” CRLF
o v1,v2,v3
The following example shows how to use JSON to store information related to books
based on their topic and edition.
{
"book": [
"id":"01",
"language": "Java",
"edition": "third",
},
"id":"07",
"language": "C++",
"edition": "second",
"author": "E.Balagurusamy"
After understanding the above program, we will try another example. Let's save the
below code as json.htm:
<html>
<head>
<title>JSON example</title>
document.write("<br>");
document.write("<br>");
document.write("<hr />");
document.write("<hr />");
</script>
</head>
<body>
</body>
</html>
Now let's try to open json.htm using IE or any other javascript enabled browser that
produces the following result:
XML tags identify the data and are used to store and organize the data, rather than
specifying how to display it like HTML tags, which are used to display the data.
XML is not going to replace HTML in the near future, but it introduces new
possibilities by adopting many successful features of HTML.
There are three important characteristics of XML that make it useful in a variety of
systems and solutions −
• XML is extensible − XML allows you to create your own self-descriptive
tags, or language, that suits your application.
• XML carries the data, does not present it − XML allows you to store the
data irrespective of how it will be presented.
• XML is a public standard − XML was developed by an organization called
the World Wide Web Consortium (W3C) and is available as an open standard.
XML Usage
XML is a markup language that defines set of rules for encoding documents in a
format that is both human-readable and machine-readable. So what exactly is a
markup language? Markup is information added to a document that enhances its
meaning in certain ways, in that it identifies the parts and how they relate to each
other. More specifically, a markup language is a set of symbols that can be placed in
the text of a document to demarcate and label the parts of that document.
Following example shows how XML markup looks, when embedded in a piece of
text −
<message>
<text>Hello, world!</text>
</message>
An industry typically uses data exchange methods that are meaningful and specific
to that industry. With the advent of e-commerce, businesses conduct an increasing
number of relationships with a variety of industries and, therefore, must develop
expert knowledge of the various protocols used by those industries for electronic
communication.
The extensibility of XML makes it a very effective tool for standardizing the format
of data interchange among various industries. For example, when message brokers
and workflow engines must coordinate transactions among multiple industries or
departments within an enterprise, they can use XML to combine data from disparate
sources into a format that is understandable by all parties.
A programming language consists of grammar rules and its own vocabulary which
is used to create computer programs. These programs instruct the computer to
perform specific tasks. XML does not qualify to be a programming language as it
does not perform any computation or algorithms. It is usually stored in a simple text
file and is processed by special software that is capable of interpreting XML.
The following sample XML file describes the contents of an address book:
<?xml version="1.0"?>
<address_book>
<person gender="f">
<name>Jane Doe</name>
<address>
<city>San Francisco</city>
<state>CA</state>
<zip>94117</zip>
</address>
<phone area_code=415>555-1212</phone>
</person>
<person gender="m">
<name>John Smith</name>
<phone area_code=510>555-1234</phone>
<email>johnsmith@somewhere.com</email>
</person>
</address_book>
There are two ways to describe an XML document: XML Schemas and DTDs.
XML Schemas define the basic requirements for the structure of a particular XML
document. A Schema describes the elements and attributes that are valid in an XML
document, and the contexts in which they are valid. In other words, a Schema
specifies which tags are allowed within certain other tags, and which tags and
attributes are optional. Schemas are themselves XML files.
The schema specification is a product of the World Wide Web Consortium (W3C).
For detailed information on XML schemas,
see http://www.w3.org/TR/xmlschema-0/.
The following example shows a schema that describes the preceding address book
sample XML document:
<xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema">
<xsd:complexType name="bookType">
<xsd:element name=name="person" type="personType"/>
</xsd:complexType>
<xsd:complexType name="personType">
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="address" type="addressType"/>
<xsd:element name="phone" type="phoneType"/>
<xsd:element name="email" type="xsd:string"/>
<xsd:attribute name="gender" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="addressType">
<xsd:simpleType name="phoneType">
<xsd:restriction base="xsd:string"/>
<xsd:attribute name="area_code" type="xsd:string"/>
</xsd:simpleType>
</xsd:schema>
You can also describe XML documents using Document Type Definition (DTD)
files, a technology older than XML Schemas. DTDs are not XML files.
The following example shows a DTD that describes the preceding address book
sample XML document:
<!DOCTYPE address_book [
<!ELEMENT person (name, address?, phone?, email?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (street, city, state, zip)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT zip (#PCDATA)>
An XML document can include a Schema or DTD as part of the document itself,
reference an external Schema or DTD, or not include or reference a Schema or DTD
at all. The following excerpt from an XML document shows how to reference an
external DTD called address.dtd:
<?xml version=1.0?>
<!DOCTYPE address_book SYSTEM "address.dtd">
<address_book>
...
The basic components of YAML are described below − Conventional Block Format
This block format uses hyphen+space to begin a new item in a specified list. Observe
the example shown below −
--- # Favorite movies
- Casablanca
- North by Northwest
Inline Format
Inline format is delimited with comma and space and the items are enclosed in
JSON. Observe the example shown below −
--- # Shopping list
Folded Text
Folded text converts newlines to spaces and removes the leading whitespace.
Observe the example shown below −
- {name: John Smith, age: 33}
age: 27
The structure which follows all the basic conventions of YAML is shown below −
men: [John Smith, Bill Jones]
women:
- Mary Smith
- Susan Williams
• Multiple documents with single streams are separated with 3 hyphens (---).
• Repeated nodes in each file are initially denoted by an ampersand (&) and by
an asterisk (*) mark later.
• YAML always requires colons and commas used as list separators followed
by space with scalar values.
• Nodes should be labelled with an exclamation mark (!) or double exclamation
mark (!!), followed by string which can be expanded into an URI or URL.
Choose XML
• You have to represent mixed content (tags mixed within text). [This would
appear to be a major concern in your case. You might even consider HTML
for this reason.]
• There's already an industry standard XSD to follow.
• You need to transform the data to another XML/HTML format. (XSLT is
great for transformations.)
Choose JSON
• You have to represent data records, and a closer fit to JavaScript is valuable
to your team or your community.
Choose YAML
• You have to represent data records, and you value some additional features
missing from JSON: comments, strings without quotes, order-preserving
maps, and extensible data types.
Choose CSV
• You have to represent data records, and you value ease of import/export
with databases and spreadsheets.
The flat file is easy to use, but it has the disadvantage that Input / Output speed is
not good, so it should be considered carefully before using it for tasks that need to
process fast speed or big size data. This disadvantage will be overcome for some
other data sources, which will be presented below.
XLS/XLSX files
The XLSX and XLS file extensions are used for Microsoft Excel spreadsheets, part
of the Microsoft Office Suite of software.
XLSX/XLS files are used to store and manage data such as numbers, formulas, text,
and drawing shapes.
XLSX is part of Microsoft Office Open XML specification (also known as OOXML
or OpenXML), and was introduced with Office 2007. XLSX is a zipped, XML-based
file format. Microsoft Excel 2007 and later uses XLSX as the default file format
when creating a new spreadsheet. Support for loading and saving legacy XLS files
is also included.
XLS is the default format used with Office 97-2003. XLS is a Microsoft proprietary
Binary Interchange File Format. Microsoft Excel 97-2003 uses XLS as the default
file format when creating a new document.
The default extension used by this format is: XLSX or XLS.
We import the Pandas module, including ExcelFile. The method read_excel() reads
the data into a Pandas Data Frame, where the first parameter is the filename and the
second parameter is the sheet.
import pandas as
from pandas import ExcelWr
from pandas import Excel
df = pd.read_excel('File.xlsx', sheetname='Shee
print("Column headings
print(df.columns)
Using the data frame, we can get all the rows below an entire column as a list. To
get such a list, simply use the column header
print(df['Sepal width'])
for i in df.index:
print(df['Sepal width'][i])
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
AVRO files
Avro is an open source project that provides data serialization and data exchange
services for Apache Hadoop. These services can be used together or independently.
Avro facilitates the exchange of big data between programs written in any language.
With the serialization service, programs can efficiently serialize data into files or
into messages. The data storage is compact and efficient. Avro stores both the data
definition and the data together in one message or file.
Avro stores the data definition in JSON format making it easy to read and interpret;
the data itself is stored in binary format making it compact and efficient. Avro files
include markers that can be used to split large data sets into subsets suitable for
Apache MapReduce processing. Some data exchange services use a code generator
to interpret the data definition and produce code to access the data. Avro doesn't
require this step, making it ideal for scripting languages.
A key feature of Avro is robust support for data schemas that change over time —
often called schema evolution. Avro handles schema changes like missing fields,
added fields and changed fields; as a result, old programs can read new data and new
programs can read old data. Avro includes APIs for Java, Python, Ruby, C, C++ and
more. Data stored using Avro can be passed from programs written in different
languages, even from a compiled language like C to a scripting language like Apache
Pig.
format data has an independent schema, also defined in JSON. An Avro schema,
together with its data, is fully self-describing.
boolean boolean
bytes bytea
double double
float real
long bigint
string text
import copy
import json
import avro
schema = {
'name': 'avro.example.User',
'type': 'record',
'fields': [
schema_parsed = avro.schema.Parse(json.dumps(schema))
writer.close()
metadata = copy.deepcopy(reader.meta)
schema_from_file = json.loads(metadata['avro.schema'])
reader.close()
print(f'Users:\n {users}')
# Users:
# [{'name': 'Pierre-Simon Laplace', 'age': 77}, {'name': 'John von Neumann', 'age': 53}]
An interesting thing to note is what happens with the name and namespace fields.
The schema we specified has the full name of the schema that has both
name and namespace combined, i.e., 'name': 'avro.example.User'. However, after
parsing with avro.schema.Parse(), the name and namespace are separated into
individual fields. Further, when we read back the schema from the users.avro file,
we also get the name and namespace separated into individual fields.
Avro specification, for some reason, uses the name field for both the full name and
the partial name. In other words, the name field can either contain the full name or
only the partial name. Ideally, Avro specification should have kept partial_name,
namespace, and full_name as separate fields.
This behind-the-scene separation and in-place modification may cause unexpected
errors if your code depends on the exact value of name. One common use case is
when you’re handling lots of different schemas and you want to
identify/index/search by the schema name.
A best practice to guard against possible name errors is to always parse a dict
schema into a avro.schema.RecordSchema using avro.schema.Parse(). This will
generate the namespace, fullname, and simple_name (partial name), which you can
then use with peace of mind.
print(type(schema_parsed))
# <class 'avro.schema.RecordSchema'>
print(schema_parsed.avro_name.fullname)
# avro.example.User
print(schema_parsed.avro_name.simple_name)
# User
print(schema_parsed.avro_name.namespace)
# avro.example
This problem of name and namespace deepens when we use a third-party package
called fastavro, as we will see in the next section.
a pandas dataframe into an Avro file or read an Avro file into a pandas dataframe.
To begin with, we can always represent a dataframe as a list of records and vice-
versa
• List of records – pandas.DataFrame.from_records() –> Dataframe
• List of records <– pandas.DataFrame.to_dict(orient='records') – Dataframe
Using the two functions above in conjunction with avro-python3 or fastavro, we can
read/write dataframes as Avro. The only additional work wewould need to do is to
inter-convert between pandas data types and Avro schema types ourselves.
An alternative solution is to use a third-party package called pandavro, which does
some of this inter-conversion for us.
import copy
import json
import pandas as pd
# Data to be saved
users_df = pd.DataFrame.from_records(users)
print(users_df)
pdx.to_avro('users.avro', users_df)
users_df_redux = pdx.from_avro('users.avro')
print(type(users_df_redux))
# <class 'pandas.core.frame.DataFrame'>
metadata = copy.deepcopy(reader.meta)
schema_from_file = json.loads(metadata['avro.schema'])
reader.close()
print(schema_from_file)
In the above example, we didn’t specify a schema ourselves and pandavro assigned
the name = Root to the schema. We can also provide a schema dict to
pandavro.to_avro() function, which will preserve the name and namespace
faithfully.
Within the pyspark shell, we can run the following code to write and read Avro.
# Data to store
users_df.write.format('avro').mode("overwrite").save('users-folder')
users_df_redux = spark.read.format('avro').load('./users-folder')
• Intensity image is a data matrix whose values have been scaled to represent
intensities. When the elements of an intensity image are of class uint8 or class
uint16, they have integer values in the
range [0,255][0,255] and [0,65535][0,65535], respectively. If the image is of
class float32, the values are single-precission floating-point numbers. They
are usually scaled in the range [0,1][0,1], although it is not rare to use the sclae
[0,255][0,255] too.
• Binary image is a black and white image. Each pixel has one logical value,
00 or 11.
• Color image is like intensity image but with three chanels, i.e. to each pixel
corresponds three intensity values (RGB) instead of one.
The result of sampling and quantization is a matrix of real numbers. The size of the
image is the number of rows by the number of columns, M×NM×N. The indexation
of the image in Python follows the usual convention:
#img = Image.open(path)
# On successful execution of this statement,
# an object of Image type is returned and stored in img variable)
try:
img = Image.open(path)
except IOError:
pass
# Use the above statement within try block, as it can
# raise an IOError if file cannot be found,
# or image cannot be opened.
Retrieve size of image: The instances of Image class that are created have many
attributes, one of its useful attribute is size.
img.save(path, format)
# format is optional, if no format is specified,
#it is determined from the filename extension
• Rotating an Image: The image rotation needs angle as parameter to get the
image rotated.
#Relative Path
img = Image.open("picture.jpg")
#Angle given
img = img.rotate(180)
#Saved in the same relative location
img.save("rotated_picture.jpg")
except IOError:
pass
if name == " main ":
main()
(Link: https://www.kdnuggets.com/2020/02/audio-data-analysis-
deep-learning-python-part-1.html)
The sound excerpts are digital audio files in audio format. Sound waves are digitized
by sampling them at discrete intervals known as the sampling rate (typically 44.1kHz
for CD-quality audio meaning samples are taken 44,100 times per second).
Each sample is the amplitude of the wave at a particular time interval, where the bit
depth determines how detailed the sample will be also known as the dynamic range
of the signal (typically 16bit which means a sample can range from 65,536 amplitude
values).
I have uploaded a random audio file on the below page. Let us now load the file in
your jupyter console.
Vocaroo | Online voice recorder
Vocaroo is a quick and easy way to share voice messages over the interwebs.
Loading an audio file:
import librosa
audio_data = '/../../gruesome.wav'
x , sr = librosa.load(audio_data)
This returns an audio time series as a numpy array with a default sampling rate(sr)
of 22KHZ mono. We can change this behavior by resampling at 44.1KHz.
librosa.load(audio_data, sr=44100)
or to disable resampling.
librosa.load(audio_path, sr=None)
The sample rate is the number of samples of audio carried per second, measured in
Hz or kHz.
Most data-warehousing projects combine data from different source systems. Each
separate system may also use a different data organization and/or format. Common
data-source formats include relational databases, XML, JSON and flat files, but may
also include non-relational database structures such as Information Management
System (IMS) or other data structures such as Virtual Storage Access Method
(VSAM) or Indexed Sequential Access Method (ISAM), or even formats fetched
from outside sources by means such as web spidering or screen-scraping.
There are two scenarios when getting data from a DW:
- Extract data directly from DW with valid credentials to access databases.
- Extract data indirectly and limited (data size, extract speed, permissions) via
given APIs, please refer part: Web Services: Data from API for more
information.
* API-specific challenges: While it may be possible to extract data from a database
using SQL, the extraction process for SaaS products relies on each platform’s
application programming interface (API). Working with APIs can be challenging:
Bases on the data description of the DW, define the Databases, Tables and Fields
that related to the data that you want to extract.
Step 4: Timeframe
Read the chapter on How Much Data Do You Need? and determine the timeframe
based on your data selection method. Write down the start and the end date of the
timeframe you want to extract the data for.
Table basics
SQL Server tables are contained within database object containers that are called
Schemas. The schema also works as a security boundary, where you can limit
database user permissions to be on a specific schema level only. You can imagine
the schema as a folder that contains a list of files. You can create up to 2,147,483,647
tables in a database, with up to 1024 columns in each table. When you design a
database table, the properties that are assigned to the table and the columns within
the table will control the allowed data types and data ranges that the table accepts.
Proper table design, will make it easier and faster to store data into and retrieve data
from the table.
Database Query
A query is a way of requesting information from the database. A database query can
be either a select query or an action query. A select query is a query for retrieving
data, while an action query requests additional actions to be performed on the data,
like deletion, insertion, and updating.
For example, a manager can perform a query to select the employees who were hired
5 months ago. The results could be the basis for creating performance evaluations.
Query Language
Many database systems expect you to make requests for information through a
stylised query written in a specific query language. This is the most complicated
method because it compels you to learn a specific language, but it is also the most
flexible.
• MySQL
• Oracle SQL
• NuoDB
Query languages for other types of databases, such as NoSQL databases and graph
databases, include the following:
• Cassandra Query Language (CQL)
• Neo4j’s Cypher
• Data Mining Extensions (DMX)
• XQuery
Power of Queries
A database has the possibility to uncover intricate movements and actions, but this
power is only utilised through the use of query. A complex database contains
multiple tables storing countless amounts of data. A query lets you filter it into a
single table, so you can analyse it much more easily.
Queries also can execute calculations on your data, summarise your data for you,
and even automate data management tasks. You can also evaluate updates to your
data prior to committing them to the database, for still more versatility of usage.
Queries can perform a number of various tasks. Mainly, queries are used to search
through data by filtering specific criteria. Other queries contain append, crosstab,
delete, make table, parameter, totals, and update tools, each of which performs a
specific function. For example, a parameter query executes the distinctions of a
specific query, which triggers a user to enter a field value, and then it makes use of
that value to make the criteria. In comparison, totals queries let users organise and
summarise data.
CREATE TABLE is the keyword telling the database system what you want to do.
In this case, you want to create a new table. The unique name or identifier for the
table follows the CREATE TABLE statement.
Then in brackets comes the list defining each column in the table and what sort of
data type it is. The syntax becomes clearer with the following example.
A copy of an existing table can be created using a combination of the CREATE
TABLE statement and the SELECT statement.
Example
The following code block is an example, which creates a CUSTOMERS table with
an ID as a primary key and NOT NULL are the constraints showing that these fields
cannot be NULL while creating records in this table −
You can verify if your table has been created successfully by looking at the message
displayed by the SQL server, otherwise you can use the DESC command as follows
−
Now, you have CUSTOMERS table available in your database which you can use
to store the required information related to customers.
• To select all columns from a table (Customers) for rows where the Last_Name
column has Smith for its value, you would send this SELECT statement to the
server back end:
The server back end would reply with a result set similar to this:
+ + + +
| Cust_No | Last_Name | First_Name |
+ + + +
| 1001 | Smith | John
| 2039 | Smith | David
| 2098 | Smith | Matthew
+ + + +
3 rows in set (0.05 sec)
• To return only the Cust_No and First_Name columns, based on the same
criteria as above, use this statement:
+ + +
| Cust_No | First_Name |
+ + +
| 1001 | John
| 2039 | David
| 2098 | Matthew
+ + +
3 rows in set (0.05 sec)
zero or more characters, and the underscore ( _) wild card to match exactly one
character. For example:
• To select the First_Name and Nickname columns from the Friends table for
rows in which the Nickname column contains the string "brain", use this
statement:
+ + +
| First_Name | Nickname |
+ + +
| Ben | Brainiac
| Glen | Peabrain
| Steven | Nobrainer |
+ + +
3 rows in set (0.03 sec)
• To query the same table, retrieving all columns for rows in which the
First_Name column's value begins with any letter and ends with "en", usethis
statement:
+ + + +
| First_Name | Last_Name | Nickname |
+ + + +
| Ben | Smith | Brainiac |
| Jen | Peters | Sweetpea |
+ + + +
2 rows in set (0.03 sec)
• If you used the % wild card instead (for example, '%en') in the example
above, the result set might look like:
+ + + +
| First_Name | Last_Name | Nickname |
+ + + +
| Ben | Smith | Brainiac |
| Glen | Jones | Peabrain |
| Jen | Peters | Sweetpea |
| Steven | Griffin | Nobrainer |
+ + + +
4 rows in set (0.05 sec)
managing web data that focuses on data quality and control. It still achieves the same
objectives as web scraping, but it is much more sophisticated, providing an end-to-
end solution that treats the entire web data lifecycle as a single, integrated process.
Web scraping is in fact a component of Web Data Integration, but Web Data
Integration also allows you to:
• extract data from non-human readable output (hidden data)
• programmatically extract data several screens deep into transaction flows
• perform calculations and combinations to data to make it richer and more
meaningful
• cleanse the data
• normalize the data
• apply additional QA processes
• transform the data
• integrate the data not just via files but APIs and streaming capabilities
• extract data on demand
• analyze data with change, comparison, and custom reports
Ovum reports that when treated as a single, holistic workflow (from web data
extraction to insight) with the same level of data validation discipline that is normally
accorded to conventional BI data or big data, web data can yield valuable insights.
This is the value of a Web Data Integration approach – and why Import.io has
developed an end-to-end Web Data Integration platform to better serve the need to
treat the web data each company (or each team) needs as the valuable data set that it
truly is.
As market research, business intelligence, analyst, and data teams in companies from
a broad range of industries continue to realize the value that can be found in datasets
that reside outside of their organizations’ walls, they will undoubtedly turn to the
web as a key source of intelligence. High-quality Web Data Integration solutions
enable the speedy and repeatable automation of web data capture and aggregation to
fuel a broad array of mission critical strategies like:
What is an API?
API stands for application programming interface. The most important part of this
name is “interface,” because an API essentially talks to a program for you. You still
need to know the language to communicate with the program, but without an API,
you won’t get far.
When programmers decide to make some of their data available to the public, they
“expose endpoints,” meaning they publish a portion of the language they’ve used to
build their program. Other programmers can then pull data from the application by
building URLs or using HTTP clients (special programs that build the URLs for you)
to request data from those endpoints.
Endpoints return text that’s meant for computers to read, so it won’t make complete
sense if you don’t understand the computer code used to write it.
An API allows one program to request data from another.
Architecture of an API
APIs consist of three parts:
• User: the person who makes a request
• Client: the computer that sends the request to the server
• Server: the computer that responds to the request
Someone will build the server first, since it acquires and holds data. Once that server
is running, programmers publish documentation, including the endpoints where
specific data can be found. This documentation tells outside programmers the
structure of the data on the server. An outside user can then query (or search) the
data on the server, or build a program that runs searches on the database and
transforms that information into a different, usable format.
That’s super confusing, so let’s use a real example: an address book.
Back in the analog days, you’d receive a copy of the White Pages phone book, which
listed every person in your town by name and address, in alphabetical order. If you
needed a friend’s address, you could look them up by last name, find the address,
then look up their street on the maps included in the back. This was a limited amount
of information, and it took a long time to access. Now, through the magic of
technology, all of that information can be found in a database.
Let’s build a database containing the White Pages for a fictional town called
Customer_info. The folks at Customer_info decided that when they built their
database, they would create a few categories of information with nested data
underneath. These are our endpoints, and they’ll include all the information the API
will release to an outside program.
Here are the endpoints listed in the documentation for Customer_info:
• /names
o /first_name, /last_name
• /addresses
o /street_address, /email_address/
• /phones
o /home_phone, /mobile_phone
Obviously, this isn’t all the information that could be collected about a person. Even
if Customer_info collected more private information about the Customer_info
residents (like birthdates and social security numbers), this data wouldn’t be
available to outside programmers without knowing the language of those endpoints.
These endpoints tell you the language you must use to request information from the
database. Should you want a list of all of the folks in Customer_info with the last
name Smith, you could do one of two things:
1. Make a URL request in a browser for that information. This uses your
internet browser as the client, and you will get back a text document in coding
language to sort through. That URL might look something like this:
http://api.customer_info.com/names?last_name=smith
2. Use a program that requests the information and translates it into usable
form. You can code your own program or use a ready-made HTTP client.
The first option is great for making simple requests with only a few responses (all of
the folks in a Customer_info with the last name Xarlax, for example — I’m pretty
sure there are only six households with this name in Customer_info). The second
option requires more coding fluency, but is great for programmers who want to use
the database of another program to enhance their own apps.
Many companies use the open APIs from larger companies like Google and
Facebook to access data that might not otherwise be available. APIs, in this case,
significantly lower the barriers to entry for smaller companies that would otherwise
have to compile their own data.
You’ll get a unique string of letters and numbers to use when accessing the
API, instead of just adding your email and password every time (which isn’t
very secure — for more information on authorizations and verifications,
read this).
2. The easiest way to start using an API is by finding an HTTP client online, like
REST-Client, Postman, or Paw. These ready-made (and often free) tools
help you structure your requests to access existing APIs with the API key you
received. You’ll still need to know some of the syntax from the
documentation, but there is very little coding knowledge required.
3. The next best way to pull data from an API is by building a URL from existing
API documentation. This YouTube video explains how to pull location
data from Google Maps via API, and then use those coordinates to find nearby
photos on Instagram.
Overall, an API request doesn’t look that much different from a normal browser
URL, but the returned data will be in a form that’s easy for computers to read.
Finally, an API is useful for pulling specific information from another program. If
you know how to read the documentation and write the requests, you can get lots of
great data back, but it may be overwhelming to parse all of this. That’s where
developers come in. They can build programs that display data directly in an app or
browser window in an easily consumable format.
APIs have been a game-changer for modern software. The rise of the API economy
not only enables software companies to rapidly build in key functionality that might
previously have taken months or years of coding to implement, but it has also
enabled end users to connect their best-of-breed apps and flow data freely among
them via API calls.
Modern companies make the most of their cloud apps’ APIs to rapidly and
automatically deploy mission-critical data across their tech stack, even from custom
fields that don’t work with out-of-the-box integrations. FICO used API integrations
and automation to grow engagement for its marketing campaigns by double digits.
AdRoll used automated API calls among its revenue stack to increase sales meetings
13%.
table since target has huge volumes so its costly to create cache it will hit the
performance. If we create staging tables in the target database, we can simply do
outer join in the source qualifier to determine insert/update this approach will give
you good performance. It will avoid full table scan to determine insert/updates on
target. And also we can create index on staging tables since these tables were
designed for specific application it will not impact to any other schemas/users. While
processing flat files to data warehousing we can perform cleansing. Data cleansing,
also known as data scrubbing, is the process of ensuring that a set of data is correct
and accurate. During data cleansing, records are checked for accuracy and
consistency.
Data cleansing
Weeding out unnecessary or unwanted things (characters and spaces etc) from
incoming data to make it more meaningful and informative
Data merging
Data can be gathered from heterogeneous systems and put together
Data scrubbing
Data scrubbing is the process of fixing or eliminating individual pieces of data that
are incorrect, incomplete or duplicated before the data is passed to end user.
Data scrubbing is aimed at more than eliminating errors and redundancy. The goal
is also to bring consistency to various data sets that may have been created with
different, incompatible business rules.
Architecture
Image of Staging area:
Fundamental
Traditional
• ETL process can perform complex transformations and requires the extra area
to store the data.
• ETL helps to Migrate data into a Data Warehouse. Convert to the various
formats and types to adhere to one consistent system.
• ETL is a predefined process for accessing and manipulating source data into
the target database.
• ETL in data warehouse offers deep historical context for the business.
• It helps to improve productivity because it codifies and reuses without a need
for high technical skills.
Benefits of ETL
By collecting large quantities of data from multiple sources, ETL can help you turn
data into business intelligence. It can help you drive invaluable insights from it and
uncover new growth opportunities. It does so by creating a single point-of-view so
that you can make sense of the data easily. It also lets you put new data sets next to
the old ones to give you historical context. As it automates the entire process, ETL
saves you a great deal of time and helps you reduce costs. Instead of spending time
manually extracting data or using low-capacity analytics and reporting tools, you can
focus on your core competencies while your ETL solution does all the legwork.
With data governance comes data democracy as well. That means making your
corporate data accessible to all team members who need it to conduct the proper
analysis necessary for driving insights and building business intelligence.
Following are biggest benefits of ETL:
Facilitate performance
One of the most important benefits of ETL is its ability to ensure that business users
have fast access to large amounts of transformed and integrated data to inform their
decision making. Because ETL tools perform most processing during data
transformation and loading, most data are already ready for use by the time it’s
loaded into the data store. When BI applications query the database, they don’t have
to join records, standardize formatting and naming conventions, or even perform
many calculations to generate a report – which means that they can deliver results
significantly faster. An advanced ETL solution will even include performance-
enhancing technologies like cluster awareness, massively parallel processing, and
symmetric multi-processing that further boost data warehouse performance.
Step 1 - Extraction
In this step of ETL architecture, data is extracted from the source system into the
staging area. Transformations if any are done in staging area so that performance of
source system in not degraded. Also, if corrupted data is copied directly from the
source into Data warehouse database, rollback will be a challenge. Staging area gives
an opportunity to validate extracted data before it moves into the Data warehouse.
Data warehouse needs to integrate systems that have different.
DBMS, Hardware, Operating Systems and Communication Protocols. Sources could
include legacy applications like Mainframes, customized applications, Point of
contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from
vendors, partners amongst others.
Hence one needs a logical data map before data is extracted and loaded physically.
This data map describes the relationship between sources and target data.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and
response time of the source systems. These source systems are live production
databases. Any slow down or locking could effect company's bottom line.
Some validations are done during Extraction:
• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
• Making sure no data loss during extracting
• Handling if any error occurring
Building the logical Data mapping from sources to target, in here we can analyze
data sources, analyze data content, collecting a business rule for ETL process
Identify Mainframe sources which mean what data source type we will work on that
help to determine the correct solution to handle in effective way and optimal way
Then we go forward with extracting and detecting changes, and those things need to
be capture
Step 2 - Transformation
Data extracted from source server is raw and not usable in its original form.
Therefore, it needs to be cleansed, mapped and transformed. In fact, this is the key
step where ETL process adds value and changes data such that insightful BI reports
can be generated.
It is one of the important ETL concepts where you apply a set of functions on
extracted data. Data that does not require any transformation is called as direct move
or pass through data.
In transformation step, you can perform customized operations on data. For instance,
if the user wants sum-of-sales revenue which is not in the database. Or if the first
name and the last name in a table is in different columns. It is possible to concatenate
them before loading.
Step 3 - Loading
Loading data into the target data warehouse database is the last step of the ETL
process. In a typical Data warehouse, huge volume of data needs to be loaded in a
relatively short period (nights). Hence, load process should be optimized for
performance.
In case of load failure, recover mechanisms should be configured to restart from the
point of failure without data integrity loss. Data Warehouse admins need to monitor,
resume, cancel loads as per prevailing server performance.
Types of Loading:
• Initial Load — populating all the Data Warehouse tables
• Incremental Load — applying ongoing changes as when needed
periodically.
• Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
Load verification
• Ensure that the key field data is neither missing nor null.
• Test modeling views based on the target tables.
• Check that combined values and calculated measures.
• Data checks in dimension table as well as history table.
• Check the BI reports on the loaded fact and dimension table.
• Data object time
• Fact table or Dimension table delivery (add, update)
• Loading type
• Replication for high availability
• Backup for disaster recovery
• Any Late Arriving data? The data latency acceptances?
ETL Issues
• The benefits that we have elaborated above are all related to traditional ETL.
However, traditional ETL tools cannot keep up with the high speed of changes
that is dominating the big data industry. Let’s take a look at the shortcomings
of these traditional ETL tools.
• Traditional ETL tools are highly time-consuming. Processing data with ETL
means to develop a process in multiple steps every time data needs to get
moved and transformed. Furthermore, traditional ETL tools are inflexible for
changes and cannot load readable live-data into the BI front end. We also have
to mention the fact that it is not only a costly process but also time consuming.
And we all know that time is money.
• There are some factors that influence the function of ETL tools and processes.
These factors would be divided in the following categories:
Quality of Data:
Common data quality issues include missing values, code values not correct list of
values, dates and referential integrity issues. It makes no sense to load the data
warehouse with poor quality data. As an example, if the data warehouse will be used
for database marketing, the addresses should be validated to avoid returned email.
dependencies will also tend to make to load processes more complex, encourage
bottlenecks and make support more difficult.
Meta Data:
Technical meta data describes not only the structure and format of the source and
target data sources, but also the mapping and transformation rules between them.
Meta data should be visible and usable to both programs and people.
Notification:
The ETL requirements should specify what makes an acceptable load. The ETL
process should notify the appropriate support people when a load fails or has errors.
Ideally, the notification process should plug into your existing error tracking system.
People Issues
Management’s comfort level with technology:
How conversant is your management with data warehousing architecture? Will you
have a data warehouse manager? Does management have development in the
background? They may suggest doing all the ETL processes with Visual Basic.
Comfort level is a valid concern, and these concerns will constrain your option.
In-House expertise:
What is your businesses tradition? SQL server? ETL solutions will be drawn from
current conceptions, skills and toolsets. Acquiring, transforming and loading the data
warehouse is an ongoing process and will need to be maintained and extended as
more subject areas are added to the data warehouse.
Support:
Once the ETL processes have been created, support for them, ideally, you should
plug into an existing support structures, including people with appropriate skills,
notification mechanisms and error tracking systems. If you use a tool for ETL, the
support staff may need to be trained. The ETL process should be documented,
especially in the area of auditing information.
recovery work. Fast load programs reduce the time it takes to load data into the data
warehouse.
Disk space:
Not only does the data warehouse have requirements for a lot of disk space, but there
is also a lot of hidden disk space needed for staging areas and intermediate files. For
example, you may want to extract data from source systems into flat files and then
transform the data to other flat files for load.
Scheduling:
Loading the data warehouse could involve hundreds of sources files, which originate
on different system use different technology and produced at different times. A
monthly load may be common for some portions of the warehouse and a quarterly
load for others. Some loads may be on demand such as lists of products or external
data. Some extract programs may be run on a different type of system than your
scheduler.
To reduce storage costs, store summarized data into disk tapes. Also, the trade-off
between the volume of data to be stored and its detailed usage is required. Trade-off
at the level of granularity of data to decrease the storage costs.
ETL Tools
There are many Data Warehousing tools are available in the market. Here, are some
most prominent one:
MarkLogic
MarkLogic is a data warehousing solution which makes data integration easier and
faster using an array of enterprise features. It can query different types of data like
documents, relationships, and metadata.
https://www.marklogic.com/product/getting-started/
Oracle
Oracle is the industry-leading database. It offers a wide range of choice of Data
Warehouse solutions for both on-premises and in the cloud. It helps to optimize
customer experiences by increasing operational efficiency.
https://www.oracle.com/index.html
Amazon RedShift:
Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to
analyze all types of data using standard SQL and existing BI tools. It also allows
running complex queries against petabytes of structured data.
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Data warehouse Tools.
Ref:
https://en.wikipedia.org/wiki/Extract,_transform,_load
https://www.guru99.com/etl-extract-load-process.html
Data Warehouse
What Is a Data Warehouse?
A data warehouse (often abbreviated as DW or DWH)(1) is a database designed
to enable business intelligence activities, that is, designed for analysis (read
operations) rather than for transaction processing (write, update and delete
operations), and (2) typically has data from different sources. They usually include
historical data derived from transaction data and in this case data warehouses (3)
separate analysis workload from transaction workload. This separation helps
improve the performance of business intelligence and transaction databases.
A data warehouse as a master database may contain multiple relational databases.
Within each database, schemas are defined for SQL query performance, tables can
be organized inside of schemas, and data is organized into tables. When data is
ingested, it is stored in various tables described by the schema. Query tools use the
schema to determine which data tables to access and analyze.
In addition to relational databases as mentioned above, a data warehouse
environment can include an extraction, transportation, transformation, and loading
(ETL) solution, and data services such as statistical analysis, reporting and data
mining. Thanks to ETL, data in a warehouse is clean, enriched and transformed.
Data source Data captured as-is from Data collected and transformed
a single source, such as a from many sources
transactional system
Tables and Tables and joins of a Table and joins are simple in a data
Joins database are complex as warehouse because they are
they are normalized. denormalized.
Data Type Data stored in the Current and Historical Data is stored
Database is up to date. in Data Warehouse. May not be up to
date.
• Data Loading
• Data Access
Data Mart: As part of a data warehouse, Data Mart is particularly designed for
a specific business line like finance, accounts, sales, purchases, or inventory. The
warehouse allows you to collect data directly from the sources.
Above is the brief definition to give you an overview of DWH types. If you’re
interested in this, here is the link for Further read
This tier consists of storage which usually is a relational database system. The
cleansed and transformed data is loaded into this layer of the architecture. This tier
acts as the staging area where data in a very raw form from different sources are
pulled together for further processing.
- Middle Tier:
This tier is generally an Online Analytical Processing (OLAP) server. OLAP enables
faster query and better performance for Data Warehouse operation. The Middle Tier
can either be a Relation OLAP or a Multi-dimensional OLAP implementation. This
provides abstract view of the stored, transformed data.
- Top Tier:
This tier is the front-end the end user interacts with. It is generally tools and API that
connect and get data out from the Data Warehouse. It can run interactive user
queries, report on transformed data, analyze and mine data.
- Cost and time consumed is considered better as users get to see results much
faster
- Dimensional view of Data Marts lack consistency resulting in weaker Data
Warehouse
Provides a definite and consistent view of Reports can be generated easily as Data
information as information from the data marts are created first, and it is relatively
warehouse is used to create Data Marts easy to interact with data marts.
Strong model and hence preferred by big Not as strong, but the data warehouse can
companies be extended, and the number of data
marts can be created
Time, Cost and Maintenance is high Time, Cost and Maintenance are low.
DWH Components
A typical data warehouse has four main components: a central database, ETL
(extract, transform, load) tools, metadata, and access tools. All of these
components are engineered for speed so that you can get results quickly and analyze
data on the fly.
Central Database
The Data store is one of the critical components of the Data Warehouse environment.
This can be implemented utilizing RDBMS. The implementation would be
constrained by the fact that traditional RDBMS is optimized for transactional
processing. For instance ad-hoc query, multi-table joins, aggregation are resource
intensive and slow down performance. Alternate approaches could be considered as
follows
- Deploy relational databases in parallel to allow for scalability. Parallel
relational databased allow shared memory or shared nothing model on various
multiprocessor configurations or massively parallel processors.
- Deploy index structures that can be used to bypass relational table scans,
improve speed and overall performance.
Query Tools
Query tools allow business users to interact with the data in the Data Warehouse
interactively. Ad-hoc queries enhances inputs for strategic decisions dynamically.
Categories of Query Tools are as follows
- Managed Queries help business users run ad-hoc queries on the data store
interactively in a very user friendly manner.
- Reporting Tools allow organization to generate regular operational reports. It
also supports high volume batch jobs with printing and summarization.
- Application Development Tools are used by power users to run customized
reports / analysis to satisfy organizations ad-hoc requirements dynamically.
- Data Mining Tools enable the process of discovering meaningful new
correlation, patterns and trends among all the organizations data stored in the
Data Warehouse. These tools can enable automation and can feed into AI /
ML
- OLAP Tools enable multi-dimensional views of data allowing business users
answer complex business scenario questions and also elaborate parameters
involved
Data Marts
Data Marts are access layers specific to a business function. It is tweaked for
performance and addresses specific needs of business functions like Sales,
Marketing, Finance, etc… Modular Data Marts feed into the overall Data Warehouse
created for the organization. Considerations for Data Mart design varies from
organization to organization. Data Marts could exist in the same data store as the
Data Warehouse or could exist in their own independent physical data stores.
A data lake is a centralized repository that stores all structured and unstructured
data at any scale. You can store your data without having to first structure the data,
and can run different types of analytics—from dashboards and visualizations to big
data processing, real-time analytics, and machine learning and full-text search.
The structure of the data or schema is not defined when data is captured. This
means you can store all of your data without careful design or the need to know what
questions you might answer in the future. Data Lakes allow you to import any
amount of data that can come in real-time in its original format, saving time of
defining schema.
A data lake can be an upstream database for a data warehouse, as is seen below.
If you’d like to read further about data lake, you can refer to here and/or this one.
It’s not completed but it will give you a big picture and hopefully some ideas to dig
deeper.
cs
Apache Spark
A data mart serves the same role as a data warehouse, but it is intentionally limited
in scope. It may serve one particular department or business unit like marketing
and sales. Data marts may be a subset of a data warehouse that is highly curated for
a specific end user. The following graph illustrates this possible case.
Star schema
A star schema is a database organizational structure optimized for use in a data
warehouse or business intelligence that uses a single large fact table and one or more
smaller dimensional tables. It is called a star schema because the fact table sits at the
center of the logical diagram, and the small dimensional tables branch off to form
the points of the star.
A fact table sits at the center of a star schema database, and each star schema
database only has a single fact table. The fact table contains the specific
quantifiable data to be analyzed, such as sales figures.
The fact table stores two types of information: numeric values and dimension
attribute values. Using a sales database as an example:
• Numeric value cells are unique to each row or data point and do not correlate
or relate to data stored in other rows. These might be facts about a transaction,
such as an order ID, total amount, net profit, order quantity or exact time.
• The dimension attribute values do not directly store data, but they store the
foreign key value for a row in a related dimensional table. Many rows in the
fact table will reference this type of information. So, for example, it mightstore
the sales employee ID, a date value, a product ID or a branch office ID.
Dimension tables store supporting information to the fact table. Each star schema
database has at least one dimension table. Each dimension table will relate to a
column in the fact table with a dimension value, and will store additional
information about that value.
• The employee dimension table may use the employee ID as a key value and
can contain information such as the employee's name, gender, address or
phone number.
• A product dimension table may store information such as the product name,
manufacture cost, color or first date on market.
Benefits of the Star Schema
• Denormalized data can cause integrity issues. This means some data can turn
out to be inconsistent at times.
• Maintenance may appear simple at the beginning, but the larger data
warehouse you need to maintain, the harder it becomes (due to data
redundancy).
• It requires a lot more disk space than snowflake schema to store the same
amount of data.
• Many-to-many relationships are not supported.
• Limited possibilities for complex queries development.
Snowflake schema
The snowflake schema is an extension of a star schema. The main difference is that
in this architecture, each reference table can be linked to one or more reference tables
as well. The aim is to normalize the data.
• Uses less disk space because data is normalized and there is minimal data
redundancy.
• Offers protection from data integrity issues.
• Maintenance is simple due to a smaller risk of data integrity violations and
low level of data redundancy.
• It is possible to use complex queries that don’t work with a star schema. This
means more space for powerful analytics.
• Supports many-to-many relationships.
Disadvantages of the Snowflake Schema
PARTITIONING
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It
optimizes the hardware performance and simplifies the management of data
warehouse by partitioning each fact table into multiple separate partitions. In this
chapter, we will discuss different partitioning strategies.
Partitioning Strategies
1. Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period.
Here each time period represents a significant retention period within the business.
For example, if the user queries for month to date data then it is appropriate to
partition the data into monthly segments. We can reuse the partitioned tables by
removing the data in them.
2. Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is
implemented as a set of small partitions for relatively current data, larger partition
for inactive data.
3. Partition on a Different Dimension
The fact table can also be partitioned on the basis of dimensions other than time such
as product group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like
on a state by state basis. If each region wants to query on information capturedwithin
its region, it would prove to be more effective to partition the fact table into regional
partitions. This will cause the queries to speed up because it does not requireto scan
information that is not relevant.
4. Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then
we should partition the fact table on the basis of their size. We can set the
predetermined size as a critical point. When the table exceeds the predetermined size,
a new table partition is created.
5. Partition by normalization
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
METADATA
Metadata is simply defined as data about data; or the description of the structure,
content, keys, indexes, etc., of data.
Metadata play a very important role than other data warehouse data and are
important for many reasons. For example, metadata are used as a directory to help
the decision support system analyst to locate the contents of the data warehouse and
as a guide to the data mapping when data are transformed from the operational
environment to the data warehouse environment.
Metadata also serve as a guide to the algorithms used for summarization between
the current detailed data and the lightly summarized data, and between the lighly
summarized data and higly summarized data.
Metadata should be stored and managed persistently (i.e., on disk).
Here are several examples.
• Metadata for a document may contain the document created date, last
modified date, it’s size, author, description, etc.
• Metadata for ETL includes the job name, source tables/files, target
tables/files, and frequency.
• Metadata associated with data management defines the data store in the Data
Warehouse. Every object in the database needs to be described including the
data in each table, index, and view, and any associated constraints.
Categories of Metadata
Metadata has been identied as a key success factor in data warehouse projects. It
captures all kinds of information necessary to extract, transform and load data from
source systems into the data warehouse, and afterwards to use and interpret the data
warehouse contents.
The generation and management of metadata serves two purposes: (1) to minimize
the efforts for development and administration of a data warehouse and (2) to
improve the extraction of information from it.
Types of OLAP
OLAP can be categorized into specific types based on how they function asdescribed
below
ROLAP stands for Relational Online Analytical Processing. ROLAP stores data
in columns and rows (also known as relational tables) and retrieves the information
on demand through user submitted queries. A ROLAP database can be accessed
through complex SQL queries to calculate information. ROLAP can handle large
data volumes, but the larger the data, the slower the processing times.
Because queries are made on-demand, ROLAP does not require the storage and pre-
computation of information. However, the disadvantage of ROLAP implementations
are the potential performance constraints and scalability limitationsthat result from
large and inefficient join operations between large tables. Examples of popular
ROLAP products include Metacube by Stanford Technology Group, Red Brick
Warehouse by Red Brick Systems, and AXSYS Suite by Information Advantage.
HOLAP stands for Hybrid Online Analytical Processing. As the name suggests,
the HOLAP storage mode connects attributes of both MOLAP and ROLAP. Since
HOLAP involves storing part of your data in a ROLAP store and another part in a
MOLAP store, developers get the benefits of both.
With this use of the two OLAPs, the data is stored in both multidimensional
databases and relational databases. The decision to access one of the databases
depends on which is most appropriate for the requested processing application or
type. This setup allows much more flexibility for handling data. For theoretical
processing, the data is stored in a multidimensional database. For heavy processing,
the data is stored in a relational database.
Microsoft Analysis Services and SAP AG BI Accelerator are products that run off
HOLAP.
How does it work?
A Data warehouse would extract information from multiple data sources and formats
like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server
(or OLAP cube) where information is pre-calculated in advance for further analysis.
ETL (extract, transform, load) has been a standard approach to data integration for
decades. But the rise of cloud computing and the need for self-service data
integration has enabled the development of new approaches such as ELT (extract,
load, transform).
What is ETL?
ETL is a data integration process that helps organizations extract data from various
sources and bring it into a single database. ETL involves three steps:
• Extraction: Data is extracted from source systems—SaaS, online, on-
premises, and others—using database queries or change data capture
processes. Following the extraction, the data is moved into a staging area.
• Transformation: Data is then cleaned, processed, and turned into a common
format so it can be consumed by a targeted data warehouse, database, or data
lake.
• Loading: Formatted data is loaded into the target system. This process can
involve writing to a delimited file, creating schemas in a database, or a new
object type in an application.
Advantages of ETL Processes
Companies that use ETL also have to deal with several drawbacks:
What is ELT?
ELT is a data integration process that transfers data from a source system into a target
system without business logic-driven transformations on the data. ELT involves
three stages:
event of a new column or blocking all new transactions in the case of a long-
running open transaction. This may work for some users, but could result in
unacceptable downtime for others.
• Security gaps: Storing all the data and making it accessible to various users
and applications come with security risks. Companies must take steps to
ensure their target systems are secure by properly masking and encrypting
data.
• Compliance risk: Companies must ensure that their handling of raw data
won’t run against privacy regulations and compliance rules such as HIPAA,
PCI, and GDPR.
• Increased Latency: In cases where transformations with business logic ARE
required in ELT, you must leverage batch jobs in the data warehouse. If
latency is a concern, ELT may slow down your operations.
Batch processing implies moving the data from point A to point B. Processes
allowing for doing such tasks are known as ETL processes — Extract, Load, and
Transform.
These processes are based on extracting data from sources, transforming, and
loading it to a data lake or data warehouse.
Although, in recent years, another approach has been introduced: the ELT approach.
ETL is the legacy way, where transformations of your data happen on the way to
the lake.
ELT is the modern approach, where the transformation step is saved until after the
data is in the lake. The transformations really happen when moving from the Data
Lake to the Data Warehouse.
ETL was developed when there were no data lakes; the staging area for the data
that was being transformed acted as a virtual data lake. Now that storage and
compute is relatively cheap, we can have an actual data lake and a virtual data
warehouse built on top of it.
ELT approach is preferred over ETL since it fosters best practices making easier
data warehousing processes — e.g., highly reproducible processes, simplification of
the data pipeline architecture, and so on.
DATA LOADING
Batch vs Streaming vs Lambda
Batch processing is based on loading the data in batches. This means, your data is
loaded once per day, hour, and so on.
Stream processing is based on loading the data as it arrives. This is usually done
using a Pub/Sub system. So, in this way, you can load your data to the data
warehouse nearly in real-time.
These two types of processing are not mutually exclusive. They may coexist in a
data pipeline — see Lambda and Kappa architectures for more info. Particularly,
we’ll focus on the batch approach in this post.
Lambda Architecture combines batch and streaming pipeline into one architecture.
Building a data warehouse pipeline can be complex sometimes. If you are starting in
this world, you will soon realize there is no right or wrong way to do it. It always
depends on your needs.
Yet, there are a couple of basic processes you should put in place when building a
data pipeline to improve its operability and performance.
This document will be used to share you with a roadmap that can help as a guide
when building a data warehouse pipeline.
This roadmap is intended to help people to implement DataOps when building a data
warehouse pipeline through a set of processes.
In the roadmap section we talk about five processes that should be implemented to
improve your data pipeline operability and performance — Orchestration,
Monitoring, Version Control, CI/CD, and Configuration Management.
Some of the data warehousing terminology — e.g., data lakes, data warehouses,
batching, streaming, ETL, ELT, and so on.
Such processes are defined based on the DataOps philosophy, which “is a collection
of technical practices, workflows, cultural norms, and architectural patterns ”
enabling to reduce technical debt in the data pipeline — among other things.
Orchestration
We all have written CRON jobs for orchestrating data processes at some point in our
lives.
When data is in the right place and it arrives at the expected time, everything runs
smoothly. But, there is a problem. Things always go wrong at some point. When it
happens everything is chaos.
Adopting better practices for handling data orchestration is necessary — e.g., retry
policies, data orchestration process generalization, process automation, task
dependency management, and so on.
As your pipeline grows, so does the complexity of your processes. CRON jobs fall
short for orchestrating a whole data pipeline. This is where Workflow management
systems (WMS) step in. They are systems oriented to support robust operations
allowing for orchestrating your data pipeline.
Some of the WMS used in the data industry are Apache Airflow, Apache Luigi,
and Azkaban.
Monitoring
Have you been in that position where all dashboards are down and business users
come looking after for you to fix them? or maybe your DW is down you don’t know?
That’s why you should always monitor your data pipeline!
Monitoring should be a proactive process, not just reactive. So, if your dashboard or
DW is down, you should know it before business users come looking out for you.
To do so, you should put in place monitoring systems. They run continuously to give
you realtime insights about the health of your data pipeline.
Some tools used for monitoring are Grafana, Datadog, and Prometheus.
CI/CD
Does updating changes in your data pipeline involve a lot of manual and error-prone
processes to deploy them to production? If so, CI/CD is a solution for you.
CI/CD stands for Continuous Integration and Continous Deployment. The goal of
CI is “to establish a consistent and automated way to build, package, and test
applications”. On the other hand, CD “picks up where continuous integration ends.
CD automates the delivery of applications to selected infrastructure environments.”
— more info here.
CI/CD allows you to push changes to your data pipeline in an automated way. Also,
it will reduce manual and error-prone work.
Some tools used for CI/CD are Jenkins, GitlabCI, Codeship, and Travis.
Configuration Management
So…Imagine your data pipeline infrastructure breaks down for any reason. For
example, you need to deploy again the whole orchestration pipeline infrastructure.
How you do it?
That’s where configuration management comes in. Configuration management
“deals with the state of any given infrastructure or software system at
any given time.” It fosters practices like Infrastructure as Code. Additionally, it deals
with the whole configuration of the infrastructure — more info here.
Some tools used for Configuration Management are Ansible, Puppet,
and Terraform.
Version control
Finally, one of the most known processes in the software industry: version control.
We all have had problems when version control practices are not in place.
Version control manages changes in artifacts. It is an essential process for tracking
changes in the code, iterative development, and team collaboration.
Some tools used for Version Control are Github, GitLab, Docker Hub, and DVC.
Summary
1. Data Warehouse is an information system that contains historical and
commutative data from single or multiple sources.
2. A Data Warehouse is subject oriented as it offers information regarding
subject instead of organization's ongoing operations.
3. In Data Warehouse, integration means the establishment of a common unit of
measure for all similar data from the different Databases.
4. Data Warehouse is non-volatile in the sense that the previous data is not
erased when new data is entered in it.
5. A Data Warehouse is Time-variant as the data in a Data Warehouse has high
shelf life.
6. There are 5 components of Data Warehouse Architecture namely Datastore,
ETL Tools, Metadata, Query Tools, Data Marts
7. The Query tools can be categorized as Query and reporting, tools, Application
Development tools, Data mining tools, OLAP tools
8. The data sourcing, transformation, and migration tools are used for
performing all the conversions and summarizations.
9. In the Data Warehouse Architecture, meta-data plays an important role as it
specifies the source, usage, values, and features of data warehouse data.
Reference Link:
https://www.educba.com/data-warehouse-architecture/?source=leftnav
https://hevodata.com/learn/data-warehouse-design-a-comprehensive-guide/
https://insights.sap.com/what-is-a-data-warehouse/
https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
https://techblogmu.blogspot.com/2017/11/what-is-meant-by-metadata-in-context-
of.html
https://www.striim.com/etl-vs-elt/
https://www.sqlservertutorial.net/
https://www.guru99.com/ssis-tutorial.html
SQL Server Integration Service (SSIS) tutorial & examples:
https://www.sentryone.com/ssis-basics-guide
https://www.mssqltips.com/sqlservertutorial/9065/sql-
server-integration-services-ssis-data-flow-task-example/
https://docs.microsoft.com/en-us/sql/integration-
services/integration-services-tutorials
Here are some of the most basic things you need to know when working with SSIS:
flexible data warehousing tool that is used for data loading, extraction, and
transformation like merging, cleaning, aggregating the data, etc.
It can extract data from different sources like Excel files, Oracle, SQL Server
databases, DB2 databases, etc. SSIS makes it easy to move data from one database
to another database. SSIS also includes graphic tools and windows wizards to
perform workflow functions like sending emails, FTP operations, Data sources, and
destination.
Features of SSIS
The following are the salient features of SSIS
• Relevant Data integration functions
• Data mining query transformation
• High-speed data connectivity elements such as connectivity to Oracle or SAP
• Effective implementation speed
• Provides rich Studio environment
• Tight integration with other Microsoft SQL software
SSIS Architecture
Control flow
Control flow acts as the brain of the SSIS package. It includes containers and tasks
that are managed by precedence constraints. It helps to arrange the order of execution
for all its components.
Precedence constraints
The precedence constraints are the package component that directs tasks to execute
in a predefined order. It defines the workflow of the entire SSIS package. It helps
you to connect tasks to control the flow. Depending on the configuration, the
precedence constraints can be represented as dotted or solid lines with blue, red, or
green color.
Task
A task is an individual unit of work. It is the same as a method used in a programming
language. We don’t use any programming codes, but we implement drag and drop
techniques to design surfaces and to configure them.
Container
Containers are objects that help SSIS to provide structure to one or more tasks. It
provides visual consistency and also allows us to declare event handlers and
variables that could be in the scope of specific containers.
There are three types of containers. They are as follows:
• Sequence container: Sequence container is a subset of an SSIS package. It
acts as a single control point for the tasks that are defined inside a container.
It is used for grouping the tasks. We can split the control flow into multiple
logical units using sequence containers.
• For loop container: It is used for executing all inside tasks for a fixed number
of executions. It provides the same functionality as the sequence container
except that it also allows us to run the tasks multiple times
• For each loop container: It is more complicated than For Loop container
since it has many use cases and requires more complex configurations. It can
accomplish more popular actions such as looping across files within a
directory or looping over an executed SQL task result set.
Data flow
Data flow tasks encapsulate the data flow engine that moves data between source
and destination and allows the user to transform, clean, and modify the data as it is
to run. It is also termed as Heart of SSIS
Packages
One of the core components of SSIS is packages. It is the collection of tasks that
execute in order. The precedence constraint helps to manage the order in which the
task will run. Packages can help the user to save files onto a SQL Server in the MSDB
or package catalog database.
Parameters
Parameters allow the user to assign values to the properties within packages at the
time of execution. It behaves much like variables but with a few main exceptions. It
also permits you to change package execution without editing and redeploying the
package.
• Operational data
• ETL process
• Data warehouse
Operational data
Operational data is a database designed to integrate data from multiple sources for
additional operations on the data. It is the place where most of the data used in the
current operation are housed before it is transferred to the data warehouse for long
term storage.
ETL Process
Extract, Transform, and Load(ETL) is a method of extracting the data from different
sources, transforming this data to achieve the requirement, and loading into a target
data warehouse. The data can be in any format XML file, flat file, or any database
file. It also ensures that the data stored in the data warehouse is accurate, high
quality, relevant, and useful for the users.
Extract: It is the process of extracting the data from various data sources depending
on different validation points. And the data can be any format such as XML, flat file,
or any database file.
Transformation: In transformation, the entire data is analyzed, and various functions
are applied to it to load the data to the targeted database in a cleaned and general
format.
Load: It is the process of loading the cleaned and extracted data to a target database
using minimal resources. It also validates the number of rows that have been
processed. The index helps to track the number of rows that are loaded in the data
warehouse. It also helps to identify the data format.
Data warehouse
The data warehouse is a single, complete, and consistent store of data that is
formulated by combining data from multiple data sources. It captures the data from
diverse sources for useful analysis and access. Data warehousing is a large set of
data accumulated that is used for assembling and managing data from various
sources to answer business questions. It helps in making a decision.
Task Description
Execute SQL task It executes the SQL report against a relative database.
Data Flow task It can read the data from different sources. Transform the data
when it is in the memory and write it out against various
destinations
File System task It performs manipulations in the file system such as deleting
files, creating directories, renaming files, and moving the
source file.
Script task It is a blank task. You can write a .NET code that performs any
task; you want to accomplish.
Sent mail task It is to send an email to notify users that your package has
finished, or some error occurs.
Bulk insert task Use can loads data into a table by using the bulk insert
command.
WMI event watcher It allows the SSIS package to wait for and respond to certain
task WMI events.
SSIS Components
SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server
database software that can be used to perform a broad range of data migrationtasks.
SSIS is a platform for data integration and workflow applications.
Within BIDS, the SQL Server Import and Export Wizard allows you to generate
SSIS packages to copy data from one location to another quickly and easily. The
Import and Export Wizard guides you through a series of configuration editor pages
that allow you to select the source data, select your target destination, and map
source to target data elements. You might find this wizard helpful for creating a
starting point for a package.
SSIS ETL
The ‘T’ is ETL stands for transformation. The goal of transformation is to convert
raw input data to an OLAP-friendly data model. This is also known as dimensional
modeling.
Microsoft SQL Server Integration Services (SSIS) is a platform for building high-
performance data integration solutions, including extraction, transformation, and
load (ETL) packages for data warehousing. SSIS includes graphical tools and
wizards for building and debugging packages; tasks for performing workflow
functions such as FTP operations, executing SQL statements, and sending e-mail
messages; data sources and destinations for extracting and loading data;
transformations for cleaning, aggregating, merging, and copying data; a
management service, the Integration Services service for administering package
execution and storage; and application programming interfaces (APIs) for
programming the Integration Services object model.
When designing the ETL process it’s good to think about the three fundamental
things it needs to do:
• Load the data so that it can be quickly accessed by querying tools such as
reports. In practice, this implies processing SSAS cubes.
An ETL process is a program that periodically runs on a server and orchestrates the
refresh of the data in the BI system. SQL Server Integration Services (SSIS) is a
development tool and runtime that is optimized for building ETL processes.
Learning SSIS involves a steep learning curve and if you have a software
development background as I do, you might first be inclined to build your ETL
program from scratch using a general-purpose programming language such as C#.
However, once you master SSIS you’ll be able to write very efficient ETL processes
much more quickly. This is because SSIS lets you design ETL processes in a
graphical way (but if needed you can write parts using VB or C#). The SSIS
components are highly optimized for ETL type tasks and the SSIS run-time executes
independent tasks in parallel where possible. If you’re a programmer, you’ll find it
amazingly difficult to write your own ETL process using a general-purpose language
and make it run more efficiently than one developed in SSIS.
The process of extracting data from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for extraction, transformation,
and loading. Note that ETL refers to a broad process, and not three well-defined
steps.
The top-level control flow in the Integration Services project may look like this:
The “Extract and Transform” box is a sequence container that holds a data flow for
each of the tables that will be refreshed in the data warehouse. In this example, there
is one fact table and there are three dimension tables. SSIS will execute the data
flows in parallel, and when all of them have completed the cube will be processed.
The transformation of data takes place in the data flows.
The transformations needed in each of the data flows would typically look something
like this:
On Information
Produces reporting information relating to the outcome of either validation or
execution of a task or container (other than warning or error)
On –Task Failed
Signals the failure of a task and typically follows on error event.
On Pre Execute
Indicates that an exec table component is about to be launched.
On Pre validate
Marks the beginning of the component validation stage, following the on pre execute
event. The main purpose of validation is detection of potential problems that might
prevent execution from completing successfully.
On post validate
Occurs as soon as the validation process of the component is completed (following
on prevail date event),
On post-Execute
Takes place after an executable component finishes running
On variable value changed
Allows you to detect changes to variables. The scope of the variable determines
which executable will raise the event. In addition, in order for the event to take place,
the variable’s change event property must be set to true (the default is faces)
On progress
Raised at the point where measurable progress is made by the executable (for
example, when running execute SQL Task).
• Provide the following SQL command. To clean up the data from destination
table
• Truncate Table Employee details
• Click ok
Execute package
Execute the SSIS package using SQL server agent (using jobs):
• Open SQL server management studio
• Connect to database engine
Select SQL server agent
Note: Ensure that SQL server agent service is start mode
• Select Jobs
Right-click and select a new job
• UN matched-records
• Select mapping and click ok
Execute package
Note:
Scenario:
Create a text file on every corresponding month dynamically using script task.
Define the following variables type string
• Uv Source path: D:ssispackagespackages
• Uv file name: Product category Details on
• Uv Full path:
‘D:ssis package.Package
• Select Debug Menu and Select build to build the above Scripting
• Select file menu and Select close and return
• Click ok
• Drag and drop data flow task and make a connection from script task to data
flow task
• In Data Flow Drag and Drop OLEDB Source
• Double click on OLEDB Source to configure to
• Provide connection Manager if exits
• Select Product category table
• Select columns from left pane and click ok
• Drag and drop flat file destination and make a connection from OLEDB
source to flat file destination
• Double click on flat file destination
• Click new to create new connection Manager
• Select Delimited flat file format
• Click ok
• Provide flat file connection manager name and description if any
• Type the following path
• D:ssis packagepackageproduct category Detailson.txt
• Select columns from left panel
• Click ok
• Select mapping from left panel
• Click ok
• To connection manager select flat file connection manager
• Press F4 for properties and set, expression - click Browse
• Select connection string and click browse to build the expression in
expression builder
• Expand Variables
• Drag and Drop User:: Uv Full path in to expression
• Section, i.e @ [user:: Uv Full path]
• Click ok twice
• Close properties window
Execute the package
Providing Security for SSIS Package:
The protection level is in the SSIS package that is used to specify how sensitive
information is saved with
In the package and also whether to encrypt the package or sensitive portions of the
package.
Example: the sensitive information would be password to the Database.
Steps to Configure protection level in SSIS
Open Business Intelligence Development Studio
Create OLEDB connect with server authentication and provide
Design package
Select package in control flow, right-click
Select properties,
Security
Protection level – Don’t save Sensitive
Don’t save Sensitive:
When you Specified Don’t Save Sensitive as the protection level, any sensitive
information
Is not written to the package XML file when you save the package. This could be
useful when
You want to make sure that anything sensitive is excluded from the package before
sending
It to someone. After saving the package with this sending, open the OLEDB
Connection Manager,
The Password is black even though save my password checkbox is checked.
Encrypt Sensitive with User Key:
Encrypt Sensitive with User Key encrypt Sensitive information based on the
credentials of the
The user who created the Package.
There is a limitation with this setting if another user (a different user than the one
who created
The package and saved it) open the package the following error will be displayed,
error loading
Encrypts Sensitive with user key; failed to encrypt protection level XML load (dts.
Password).
Encrypt Sensitive with Password:
The Encrypt Sensitive with password setting require a password in the package and
that
Password will be used to encrypt and decrypt the sensitive information in the
package. To fill in the
Package password clicks on the button in the package password field of the package
and provide
Password and confirm password. When you open a package with this setting you
will be prompted
To enter the password.
Note: The Encrypt Sensitive with Password Setting for the Production level property
overcomes
The limitation of the encrypt Sensitive with user key setting by allowing any user to
open the package
As long as they have to password.
Encrypt All with Password:
The Encrypt All with password Setting used to encrypt the entire content Of the SSIS
package with the specified password. You specify the package Password in the
Package
Property, Same as Encrypt Sensitive with password settings. After saving the
package you can
View the package XML code that is already encrypted in between encrypted data
tags in the package XML.
Encrypt All with User Key:
The Encrypt All with User Key Setting is used to encrypt the entire
Contents of the SSIS package by using User Key this means that only the user who
created the package
Will be able to open it, view or modify it, and run it.
Server Storage
The server storage Settings for the Production level property allows the package
To return all Sensitive information when you are saving the package to the SQL
server. SSIS packages Saved to SQL Server use the MS DB Database.
Pivoting and Un pivoting Examples
The presentation of the data is required for easy analysis turning columns into rows
And rows into columns are another way of presentation of data. So that the end-user
can understand
It easily.
Un pivot
A process of turning columns to Rows is known as Unpivot.
Steps to Configure Unpivot
Prepare the following Excel Sheet
Phones and
2009 400 800 400 300
components
Pivot
A process of turning rows into columns is known as a pivot
Steps to configure the pivot
2009 Q1 100
2009 Q2 200
2009 Q3 300
2009 Q4 400
• Click ok
• Drag and Drop Excel destination
• Double click on Excel destination to edit it
• Provide Excel connection Manager,
• Click New to create a new table (sheet) and name it as pivot data
• Click ok twice
Execute package