Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

www.fullinterview.

com
ETL Testing
1
www.fullinterview.com
Data warehousing and its Concepts:
What is Data warehouse?
Data Warehouse is a central managed and integrated database containing data
from the operational sources in an organization (such as SAP C!" #!P
s$stem%. &t ma$ gather manual inputs from users determining criteria and
parameters for grouping or classif$ing records.
Data warehouse database contains structured data for 'uer$ anal$sis and can be
accessed b$ users. (he data warehouse can be created or updated at an$ time
with minimum disruption to operational s$stems. &t is ensured b$ a strateg$
implemented in #() process.
A source for the data warehouse is a data e*tract from operational databases.
(he data is validated cleansed transformed and finall$ aggregated and it
becomes read$ to be loaded into the data warehouse.
Data warehouse is a dedicated database which contains detailed stable non+
volatile and consistent data which can be anal$zed in the time variant.
Sometimes where onl$ a portion of detailed data is re'uired it ma$ be worth
considering using a data mart.
A data mart is generated from the data warehouse and contains data focused on
a given sub,ect and data that is fre'uentl$ accessed or summarized.
2
www.fullinterview.com
Data warehouse Architecture:
3
www.fullinterview.com
Data warehouse Architecture (Contd%:
4
www.fullinterview.com
Advantages of Data warehouse:
Data warehouse provides a common data model for all data of interest
regardless of the data-s source. (his ma.es it easier to report and
anal$ze information than it would be if multiple data models were used
to retrieve information such as sales invoices order receipts general
ledger charges etc.
&nconsistencies are identified and resolved prior to loading of data in
the Data warehouse. (his greatl$ simplifies reporting and anal$sis.
&nformation in the data warehouse is under the control of data
warehouse users so that even if the source s$stem data is purged
over time the information in the warehouse can be stored safel$ for
e*tended periods of time.
/ecause the$ are separate from operational s$stems data
warehouses provide retrieval of data without slowing down operational
s$stems.
Data warehouses enhance the value of operational business
applications notabl$ customer relationship management (C!"%
s$stems.
Data warehouses facilitate decision support s$stem applications such
as trend reports (e.g. the items with the most sales in a particular area
within the last two $ears% e*ception reports and reports that show
actual performance versus goals.
Disadvantages of Data Warehouse:
Data warehouses are not the optimal environment for unstructured
data.
/ecause data must be e*tracted transformed and loaded into the
warehouse there is an element of latenc$ in data warehouse data.
0ver their life data warehouses can have high costs. "aintenance
costs are high.
Data warehouses can get outdated relativel$ 'uic.l$. (here is a cost of
delivering suboptimal information to the organization.
(here is often a fine line between data warehouses and operational
s$stems. Duplicate e*pensive functionalit$ ma$ be developed. 0r
functionalit$ ma$ be developed in the data warehouse that in
retrospect should have been developed in the operational s$stems
and vice versa.
5
www.fullinterview.com
#() Concept:
#() is the automated and auditable data ac'uisition process from source s$stem
that involves one or more sub processes of data e*traction data transportation
data transformation data consolidation data integration data loading and data
cleaning.
E+ Extracting data from source operational or archive s$stems which are primar$
source of data for the data warehouse.
T + Transforming the data 1 which ma$ involve cleaning filtering validating and
appl$ing business rules.
L+ Loading the data into the data warehouse or an$ other database or application
that houses the data.
6
www.fullinterview.com
#() Process:
#() Process involves the #*traction (ransformation and )oading Process.
Extraction:
The first part of an ETL process involves extracting the data from the source
systems. Most data warehousing projects consolidate data from different source
systems. Each separate system may also use a different data format. Common data
source formats are relational databases and flat files, but may include nonrelational
database structures such as !nformation Management "ystem #!M"$ or other data
structures such as %irtual "torage &ccess Method #%"&M$ or !ndexed "e'uential
&ccess Method #!"&M$, or even fetching from outside sources such as through web
spidering or screenscraping. Extraction converts the data into a format for
transformation processing.
&n intrinsic part of the extraction involves the parsing of extracted data, resulting in
a chec( if the data meets an expected pattern or structure. !f not, the data may be
rejected entirely or in part.
Transformation:
Transformation is the series of tas(s that prepares the data for loading into the
warehouse. )nce data is secured, you have worry about its format or structure.
*ecause it will be not be in the format needed for the target. Example the grain
level, data type, might be different. +ata cannot be used as it is. "ome rules and
functions need to be applied to transform the data
7
www.fullinterview.com
)ne of the purposes of ETL is to consolidate the data in a central repository or to
bring it at one logical or physical place. +ata can be consolidated from similar
systems, different subject areas, etc.
ETL must support data integration for the data coming from multiple sources and
data coming at different times. This has to be seamless operation. This will avoid
overwriting existing data, creating duplicate data or even worst simply unable to load
the data in the target
Loading:
Loading process is critical to integration and consolidation. Loading process decides
the modality of how the data is added in the warehouse or simply rejected. Methods
li(e addition, ,pdating or deleting are executed at this step. -hat happens to the
existing data. "hould the old data be deleted because of new information. )r
should the data be archived. "hould the data be treated as additional data to the
existing one.
"o data to the data warehouse has to loaded with utmost care for which data
auditing process can only establish the confidence level. This auditing process
normally happens after the loading of data.
)ist of #() tools:
/elow is the list of #() (ools available in the mar.et:
List of ETL Tools ETL Vendors
0racle Warehouse /uilder (0W/% 0racle
Data &ntegrator 2 Data Services SAP /usiness 0b,ects
&/" &nformation Server (Datastage% &/"
SAS Data &ntegration Studio SAS &nstitute
PowerCenter &nformatica
#li*ir !epertoire #li*ir
Data "igrator &nformation /uilders
S3) Server &ntegration Services "icrosoft
(alend 0pen Studio (alend
Data4low "anager
Pitne$ /owes /usiness
&nsight
Data &ntegrator Pervasive
0pen (e*t &ntegration Center 0pen (e*t
(ransformation "anager #() Solutions )td.
Data "anager5Decision Stream &/" (Cognos%
Clover #() 6avlin
#()7A)) &8A9
D/: Warehouse #dition &/"
Pentaho Data &ntegration Pentaho
Adeptia &ntegration Server Adeptia
8
www.fullinterview.com
#() (esting:
4ollowing are some common goals for testing an #() application:
Data completeness + (o ensure that all e*pected data is loaded.
Data Quality + &t promises that the #() application correctl$ re,ects substitutes
default values corrects and reports invalid data.
Data transformation + (his is meant for ensuring that all data is correctl$
transformed according to business rules and design specifications.
Performance and scalabilit$+ (his is to ensure that the data loads and 'ueries
perform within e*pected time frames and the technical architecture is scalable.
&ntegration testing+ &t is to ensure that #() process functions well with other
upstream and downstream applications.
User-acceptance testing + &t ensures the solution fulfills the users; current
e*pectations and also anticipates their future e*pectations.
Regression testing + (o .eep the e*isting functionalit$ intact each time a new
release of code is completed.
/asicall$ data warehouse testing is divided into two categories </ac.+end testing;
and <4ront+end testing;. (he former applies where the source s$stems data is
compared to the end+result data in )oaded area which is the #() testing. While
the latter refers to where the user chec.s the data b$ comparing their "&S with
the data that is displa$ed b$ the end+user tools.
Data Validation:
Data completeness is one of the basic wa$s for data validation. (his is needed to
verif$ that all e*pected data loads into the data warehouse. (his includes the
validation of all the records fields and ensures that the full contents of each field
are loaded.
Data Transformation:
=alidating that the data is transformed correctl$ based on business rules can be
one of the most comple* parts of testing an #() application with significant
transformation logic. Another wa$ of testing is to pic. up some sample records
and compare them for validating data transformation manuall$ but this method
re'uires manual testing steps and testers who have a good amount of
e*perience and understand of the #() logic.
9
www.fullinterview.com
Data Warehouse (esting )ife C$cle:
)i.e an$ other piece of software a DW implementation undergoes the natural
c$cle of >nit testing S$stem testing !egression testing &ntegration testing and
Acceptance testing.
Unit testing: (raditionall$ this has been the tas. of the developer. (his is a
white+bo* testing to ensure the module or component is coded as per agreed
upon design specifications. (he developer should focus on the following:
a% (hat all inbound and outbound director$ structures are created properl$ with
appropriate permissions and sufficient dis. space. All tables used during the #()

are present with necessar$ privileges.
b% (he #() routines give e*pected results:
i. All transformation logics wor. as designed from source till target
ii. /oundar$ conditions are satisfied? e.g. chec. for date fields with leap $ear
dates
iii. Surrogate .e$s have been generated properl$
iv. 9>)) values have been populated where e*pected
v. !e,ects have occurred where e*pected and log for re,ects is created with
sufficient details
vi. #rror recover$ methods
vii. Auditing is done properl$
c% (hat the data loaded into the target is complete:
i. All source data that is e*pected to get loaded into target actuall$ get
loaded? compare counts between source and target and use data
profiling tools
ii. All fields are loaded with full contents? i.e. no data field is truncated while
transforming
iii. 9o duplicates are loaded
iv. Aggregations ta.e place in the target properl$
v. Data integrit$ constraints are properl$ ta.en care of
System testing: @enerall$ the 3A team owns this responsibilit$. 4or them the
design document is the bible and the entire set of test cases is directl$ based
upon it. Aere we test for the functionalit$ of the application and mostl$ it is blac.+
bo*. (he ma,or challenge here is preparation of test data. An intelligentl$
designed input dataset can bring out the flaws in the application more 'uic.l$.
Wherever possible use production+li.e data. Bou ma$ also use data generation
tools or customized tools of $our own to create test data. We must test for all
possible combinations of input and specificall$ chec. out the errors and
e*ceptions. An unbiased approach is re'uired to ensure ma*imum efficienc$.
8nowledge of the business process is an added advantage since we must be
able to interpret the results functionall$ and not ,ust code+wise.
10
www.fullinterview.com
(he 3A team must test for:
i. Data completeness? match source to target counts terms of business.
Also the load windows refresh period for the DW and the views created
should be signed off from users.
ii. Data aggregations? match aggregated data against staging tables.
iii. @ranularit$ of data is as per specifications.
iv. #rror logs and audit tables are generated and populated properl$.
v. 9otifications to &( and5or business are generated in proper format
Regression testing: A DW application is not a one+time solution. Possibl$ it is
the best e*ample of an incremental design where re'uirements are enhanced
and refined 'uite often based on business needs and feedbac.s. &n such a
situation it is ver$ critical to test that the e*isting functionalities of a DW
application are not messed up whenever an enhancement is made to it.
@enerall$ this is done b$ running all functional tests for e*isting code whenever a
new piece of code is introduced. Aowever a better strateg$ could be to preserve
earlier test input data and result sets and running the same again. 9ow the new
results could be compared against the older ones to ensure proper functionalit$.
Integration testing: (his is done to ensure that the application developed wor.s
from an end+to+end perspective. Aere we must consider the compatibilit$ of the
DW application with upstream and downstream flows. We need to ensure for
data integrit$ across the flow. 0ur test strateg$ should include testing for:
i. Se'uence of ,obs to be e*ecuted with ,ob dependencies and scheduling
ii. !e+startabilit$ of ,obs in case of failures
iii. @eneration of error logs
iv. Cleanup scripts for the environment including database
(his activit$ is a combined responsibilit$ and participation of e*perts from all
related applications is a must in order to avoid misinterpretation of results.
Acceptance testing: (his is the most critical part because here the actual users
validate $our output datasets. (he$ are the best ,udges to ensure that the
application wor.s as e*pected b$ them. Aowever business users ma$ not have
proper #() .nowledge. Aence the development and test team should be read$
to provide answers regarding #() process that relate to data population. (he test
team must have sufficient business .nowledge to translate the results in terms of
business. Also the load windows refresh period for the DW and the views
created should be signed off from users.
Performance testing: &n addition to the above tests a DW must necessaril$ go
through another phase called performance testing. An$ DW application is
designed to be scalable and robust. (herefore when it goes into production
environment it should not cause performance problems. Aere we must test the
11
www.fullinterview.com
s$stem with huge volume of data. We must ensure that the load window is met
even under such volumes. (his phase should involve D/A team and #() e*pert
and others who can review and validate $our code for optimization.
Summary:
(esting a DW application should be done with a sense of utmost responsibilit$. A
bug in a DW traced at a later stage results in unpredictable losses. And the tas.
is even more difficult in the absence of an$ single end+to+end testing tool. So the
strategies for testing should be methodicall$ developed refined and streamlined.
(his is also true since the re'uirements of a DW are often d$namicall$ changing.
>nder such circumstances repeated discussions with development team and
users is of utmost importance to the test team. Another area of concern is test
coverage. (his has to be reviewed multiple times to ensure completeness of
testing. Alwa$s remember a DW tester must go an e*tra mile to ensure near
defect free solutions.
12

You might also like