ETL and SQL

CASE: the case statements goes through conditions and return a value when first condition is met(like if –then-else)
Hi This is thirupathirao and I have 4.11 years of experience in etl testing coming to the last project I have done with united RANK, Dense_Rank, Row_nuber: it is assigns rank to each record in a table it skips the similar values
natural foods us based client, which is mainly retail domain analysis here in this project we used SQL server as a data base, sql
is for writing & validating the data, informatica power center as ETL tool and jira as defect tracking & management tool VIEW: View is virtual table it acts as a actual table the views are not stored in the database, no memory concept is
Roles & Res: Currently we are working on agile methdology which we have 2 weeks in a sprint once sprint planing is done will be MATERIALIZED VIEW: The results of a view expression are stored in a database system. It has some store memory
assigning few user stories once I get user stories I will analyzing the user stories for requirement once is done I will be preparing SUB Query : A Subquery is a SQL query within another query. It is a subset of a Select statement whose return values are used in
quaries based on the STM document as per the business logic, once code has been deployed into the testing I will preparing jobs filtering the conditions of the main query.
and validate etl process and involved in preparing and reviewing test cases and executing cases as per the business requirement.
Correlated SUB Query: a correlated subquery is a subquery that uses values from the outer query in order to complete. Because
Project Architecture: currently we extracing the data from diff sources like flat files ,db files and xml files and load into the a correlated subquery requires the outer query to be executed first, the correlated subquery must run once for every row in the
landing area, then data is moved to staging layer there we can remove unwanted data, duplicate data and applying business outer query. It is also known as a synchronized subquery.
logics here . then data moves to data warehouse here we are perform high level validation like Record count validations,
duplicate checks, null value checks, use Except/minus query to check whether data is loaded as per the business requirement or Water Fall Model: The waterfall model is a classical model used in system development life cycle to create a system with a linear
not..
and sequential approach. It is termed as waterfall because the model develops systematically from one phase to another in a
Validations We perform downward fashion.
Record count validations, Reconciliation checks, data length, data types, constraint check, index checks, source data validation, Agile Model: The main difference is that Waterfall is a linear system of working that requires the team to complete each project
data comparison check, duplicate data validations, data with Primary key, Foreign key, Null Value Checks phase before moving on to the next one while Agile encourages the team to work simultaneously on different phases of the
What Layer working : Source to target
. project
Priority is the order in which the developer should resolve a defect whereas Severity is the degree of impact that a defect has on Defect Life Cycle: New > Open or Reject > In Analysis.>In Development >Ready to test >In Test > done or Re Open.
the operation of the product. Priority is categorized into three types: low, medium and high whereas Severity is categorized into Testing Life Cycle: Require Phase > Analysis Phase > Test Preparation> Test Execution > Signoff
five types : critical Severity is a factor used to identify how much a defect impairs product usage. There are many scales of Regression testing: This testing is done to make sure that new code changes should not have side effects on the existing
severity Surrogate Key: the key is generated when a new record is inserted into a table. When a primary key is generated at
runtime, it is called a surrogate key. Surrogate key is an internally generated key by the current system and is invisible to the functionalities. It ensures that the old code still works once the latest code changes are done.
user. As several objects are available in the database corresponding to surrogate, surrogate key can not be utilized as primary Cast , Convert: Change the data type from one format to another format. SELECT CAST(25.65 AS varchar)), CONVERT(int, 25.65)
key. For example: A sequential number can be a surrogate key. Change Data Capture: Inserting new records, updating one or more fields of existing records, deleting records are the types of
A natural key is a single column or a combination of columns that has a business value and occurs naturally in the real world changes which Change Data Capture processes must detect in the source system.
(e.g. Social security number, International Standard Book Number…).
PRIMARY KEY: A PRIMARY KEY constraint uniquely identifies each record in a database table. All columns participating in a DML: insert,update,delete DDL: create, alter ,drop DCL: grant, revoke, TCL : commit, rollback
primary key constraint must not contain NULL values LIKE : SELECT FullNameFROM EmployeeDetails WHERE FullName LIKE ‘__hn%’;
FACT TABLE: Fact table basically represents the metrics, measurements or facts of a business process, In fact tables facts are CONCAT: SELECT CONCAT(EmpId, ManagerId) as NewId FROM EmployeeDetails;- empidmangerid
stored and they are linked to a number of dimension tables via foreign keys TRIM: UPDATE EmployeeDetails SET FullName = LTRIM(RTRIM(FullName));
Addictive facts: Number of dimension associated with fact tables, Semi addictive facts: some dimensions associated with fact Case: select case when empid=true then’yes’ else no end as false from employee
tables but not all. Non addictive facts: it can’t sumup for any dimension table.
DIMENSION TABLE: Dimensions are descriptive data which is described by the keys dimensions are organized in the tables called Max Salary in Each Dept: SELECT dept_id, MAX(salary) as max_salary_per_dept FROM employee GROUP BY dept_id;
Dimension Table, Confirm dimension: a dimension table which can be shared by multiple fact tables. Junk Dimension it cannot Functions: SELECT *, RANK() OVER(partition by ORDER BY salary DESC) AS ranks, DENSE_RANK() OVER(ORDER BY salary DESC) AS
be used to described facts is known as JunkDimension dense_ranks , ROW_NUMBER() OVER(ORDER BY salary DESC) AS row_numbers FROM managers;
SCD(Slowly Changing Dimension):SCD is a dimension that stores and manages both current and historical data over time in a LEAD,LAG: Select value, LAG(sale_value) OVER(ORDER BY sale_value) from table |Lead(sale_value) OVER(ORDER BY sale_value)
data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension EVEN/ODD: SELECT E.EmpId, E.Project, E.Salary FROM ( SELECT *, Row_Number() OVER(ORDER BY EmpId) AS RowNumber
records. FROM EmployeeSalary) E WHERE E.RowNumber % 2 = 0|Even ,%2=1 |ODD
3rdndHIGHEST SALARY: Select min(salary) from (select top3 * from employee order by salary desc)third order by salary asc---
SCD1: SCD1 the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. (2 HIGHEST –select max(salary) from employee whre salary not in (select max(salary) from employee)
SCD2: Creating another dimension record, A new record is created with the changed data values and this new record becomes Nth HIGHEST—select * from employee e1 where(n-1)=(select count(distinct(e2.salary)) from employee e2 where
the current record. e2.salary>e1.salary),
SCD2: New record is added to the NEW TABLE CREATE: select * into newtable from old table( without data where 1=0;)
SCD2 metadata – eff_start_date, eff_end_date, and is_current are designed to manage the state of the DUPLICATE FIND: select * ,count(*) from employee group by empid having count(*)>1
DELETE DUPLICATE:
record. eff_start_date and eff_end_date contain the time interval when the record is effective. Metadata – timestamp is the Delete from(select *,row_number()over(partition by empid order by empid) as rn from employeetable)where rn>1
actual time when the customer record was generated For every record which is inserted, Start_Date field is loaded with system SUBSTRING: select substring(‘fullname’,1,charindex(‘_’,fullname) as firstname,
date value and the End_Date field is loaded with maximum date value which is ‘9999-12-31’ to represent it as an ACTIVE record. substring(‘fullname’,charindex(‘_’,fullname)+1,len(‘fullname)) as lastname from employee
When the record data is modified, the End_Date field of the record is updated with system date value to make it INACTIVE. Left & Right: select name,left(name,charindex(‘ ‘,name) as firstname,right(name, len(name)-charindex(‘ ‘, name)) as lastname
JOINS: Joins clause used to combine 2 or more tables or select statements related columns between them . from employeetable
Inner join: It returns the records that have matching values in both tables Max salary with dept: select d.department,max(salary) from department d leftjoin employee e on e.empid=d.id grotp by dept
left join: Return records from left table and matched records from right table Date : select firstname,getdate() as currentdate ,joiningdate, datediff(mm,joiningdate,getdate()) as totalmonths from emptable
right join: Return records from right table and matched records from left table````````` Datepart : datepart( month, column_name) from emp ex: select empname,datepart(year,couluname) from emp group by year
outer join: Return all records when there is a match in either left or right table Sum with > salary with 3 tables:
Natural Join: NATURAL JOIN is similar to INNER join but we do not need to use the ON clause during the join. Meaning in a Select a.empname,sum(c.empsalary),b.dept.name from emptable A join depttable B on a.dept_id=b.dept_id join salary c on
natural join we just specify the tables. a.empid=b.empid where c.empsalary> 20000 group by b.deptnmae,a.empname having sum(c.empsalary)>40000
UNION & UNION ALL: union operator used to combine result set of 2 or more select statements, every select statement with in Top 2 max salary with dept: select dept,empname,salary from (select deptname,empname,salary, rank()over(partion by dept
union must have same number of columns & same data types it is not return duplicate values, UNION ALL : All records with dupli order by empname) assalary_ ranked)a where salary_ranked <=2
CONSTRAINTS: UNIQUE, NOT NULL, CHECK, DEFAULT, INDEX, PK, Foreign Key How to upper case only one: select upper(substring(fullname,1,1) )from employee
Staging Area: During ETL process a staging area is used as an intermediate storage area it serves as a temporary staging area
between data sources and data ware house.
Noramalization: Normalization is a data base design which is implemented to reduce redundant data /duplicate data /repeated Unix: rows from 20 to10 : cat file.txt | head - 20| tail -10, count of unique rows: cat file.txt | sort –u| wc –l
data in the data base, normalization rules desides larger tables into smaller tables and links using relationship keys.0 Duplicate rows find: cat file.txt | sort | uniq –d| wc –l , grep commnad: grep ‘’string’’ file.txt grep –c fil.txt (count) grep –v
OLTP: Online transaction processing captures, stores, and processes data from transactions in real time. (unmatching count)
OLAP: Online analytical processing uses complex queries to analyze aggregated historical data from OLTP
Smoke testing : Smoke Testing is performed to ascertain that the critical functionalities of the program are working fine. Smoke ETL Bugs: source, calculation ,cosmetic bugs ,input/output
testing exercises the entire system from end to end
Sanitary Testing: Sanity testing is done at random to verify that each functionality is working as expected.
STAR SCHEMA: A star schema contains both dimension tables and fact tables in it. In star schema each dimension is surrounded
by fact tables. SNOWFLAKE SCHEMA: A snow flake schema contains all three- dimension tables, fact tables, and sub-dimension
tables. Each dimension is normalized into sub-dimensions.
Informatica Power Center:

What is the difference between active and passive transformation?
An active transformation is a transformation that changes the number of rows when the source table is passed through it. For example, Aggregator
transformation is a type of active transformation that performs the aggregations on groups such as sum and reduces the number of rows.
A passive transformation is a transformation that does not change the number of rows when the source data is passed through it, i.e., neither the
new rows are added, nor existing rows are dropped. In this transformation, the number of output and input rows are the same.
Mapping: Mapping is a pipeline or structural flow of data that describes how data flows from source to the destination through
transformations.
What is session?
A session is a property in Informatica that have a set of instructions to define when and how to move the data from the source table to the target
table. A session is like a task that we create in workflow manager. Any session that you create must have a mapping associated with it. Session must
have a single mapping at a time, and it cannot be changed. In order to execute the session, it must be added to the workflow. A session can either
be a reusable or non-reusable object where reusable means that we can use the data for multiple rows.
What is Designer?
A designer is a graphical user interface that builds and manage the objects like source table, target table, Mapplets, Mappings, and transformations.
Mapping in Designer is created by using the Source Analyzer to import the source table, and target designer is used to import the target table.
Workflow:
Workflow is a set of instructions used to execute the mappings.The workflow contains various tasks such as session task, command task, event wait
task, email task, etc. which are used to execute the sessions.It is also used to schedule the mappings.All the tasks are connected to each other
through links inside a workflow. After creating the workflow, we can execute the workflow in the workflow manager and monitor its progress
through the workflow monitor.
Workflow Monitor?
Workflow Monitor is used to monitor the execution of workflows or the tasks available in the workflow. It is mainly used to monitor the progress of
activities such as Event log information, a list of executed workflows, and their execution time.
Source Qualifier transformation that selects the records from multiple sources, and the sources can be relational tables, flat files, and Informatica
PowerExchange services. It is an active and connected transformation. When you add the source tables in mapping, then Source Qualifier is added
automatically.
Expression Transformation?
Expression Transformation is a passive and connected transformation. It is used to manipulate the values in a single row. Examples of expression
transformation are concatenating the first name and last name, adjusting the student records, converting strings to date, etc. It also checks the
conditional statements before passing the data to other transformations. Expression transformation uses numeric and logical operators
Sorter Transformation?
It is an active transformation.It is used to sort the data either in ascending or in descending order, similar to the ORDER BY clause in SQL.
Aggregator Transformation?
Aggregator transformation is a connected and active transformation. It is used to perform aggregate functions over a group of rows such
as sum, average, count, etc., similar to the aggregate functions in SQL such as sum(), avg(), count(), etc.
Filter Transformation?
Filter transformation is an active and connected transformation. It filters out the rows which are passed through it, i.e., it changes the
number of rows which are passed through. It applies the filter condition on the group of data. This filter condition returns an either true or
false value.
Joiner Transformation?
o Joiner Transformation is an active and connected transformation. It allows you to create the joins in Informatica, similar to the joins that
we create in database. In joiner transformation, joins are used for two sources and these sources are:
1.Master source 2.Detail source
o Master outer join :In Master outer join, the resultset contains all the records from the Detail source and the matching rows in the master
source. This join will be similar to the Right join in SQL.
o Detail outer join: In Detail outer join, the resultset contains all the records from the Master source and the matching rows in the Detail
source. This join will be similar to the Left join in SQL.
o Full Outer Join: In Full outer join, the resultset contains all the records from both the sources, i.e., Master and Detail source.
o Normal join, the resultset contains only the matching rows between Master and Detail source. This join is similar to the inner join in SQL.
Router Transformation?
Router transformation is an active and connected transformation. Router transformation is similar to the filter transformation as both the
transformations test the input data based on the filters. In Filter transformation, you can apply only one filter or condition, and if the condition is not
satisfied, then a particular is dropped. But in Router transformation, more than one condition can be applied. Therefore, we can say that the single
input data can be checked on multiple conditions..
lookup Transformation?
o Lookup transformation is active as well as passive transformation. It can be used in both connected and unconnected mode.It is used to
look up the data in a source, source qualifier, flat file, or a relational table to retrieve the data. We can import the definition of lookup from
any flat file or relational database, and an integration service queries the lookup source based on the ports, lookup condition and then
returns the result to other transformations.

ETL and SQL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETL and SQL

Uploaded by

Copyright:

Available Formats

CASE: the case statements goes through conditions and return a value when first condition is met(like if –then-else)

Informatica Power Center:

You might also like