Abinitio-Material

1
Question Answer =================================
Phases - are used to break the graph into pieces. Temp

after its completion. Phases are used to effectively sep
CPU, disk) parts of the application.
Phases vs Checkpoints
Checkpoints - created for recovery purposes. These ar
can recover to the latest saved point - and rerun from i
You can have phase breaks with or without checkpoin
A new sandbox will have many directories: mp, dml, x

xfr
with extension .xfr containing your own custom funct
"somepath/xfr/yourfile.xfr"). Usually XFR stores map
1) Data Parallesim - data (copies of the same - or diffe

components.
three types of parallelism
2) Componnent Paralelism (execute simultaneously on
3) Pipeline (sequential).
Multi-File System
m_mkfs - create a multifile (m_mkfs ctrlfile mpfile1 .

m_ls - list all the multifiles
MFS
m_rm - remove the multifile
m_cp - copy a multifile
m_mkdir - to add more directories to existing director
 Each partition of a component uses: ~ 8 MB +

 Add size of lookup files used in phase (if mult
once)
Memory requirements of a
 Multiply by degree of parallelism. Add up all c
graph
is used in that phase.
 Select the largest-memory phase in the graph
SCAN
How to calculate a SUM ROLLUP
SCANWITHROLLUP
Scan followed by Dedup sort and select the last
dedup sort with null key If we don't use any key in the sort component while us
then the output depends on the keep parameter.
 first - only the first record

 last - only last record
 unique_only - there will be no records in the o
file1 (A,B,C) , file2 (A,B,D). We partition both files b

join on partitioned flow
should we partition by "A,B" ? Not clear.
checkin, checkout You can do checkin/checkout using the wizard right fr
how to have different

passwords for QA and parameterize the .dbc file - or use environmental varia
production
 use scan and filter

 m_dump <dml> <mfs file> -start 50 -end 75
How to get records 50-75 out of
100
 use next_in_sequence() function and filter by e
&& next_in_sequence() <75)
Hot to convert a serial file into

create MFS, then use partition component
FFS
project parameters vs. sandbox

When you check out a project into your sandbox - you
parameters
you can refer to them as sandbox parameters.
error you get when connecting mismatching compone

Bad-Straight-flow
to mfs flow without using a partition component)
merging graphs You can not merge two ab initio graphs. You can use t
can also copy/paste the contents between graphs. See
 partitioning - dividing a single flow of records

partitioning, re-partitioning,  departitioning - removing partitionning (gather
departitioning
 re-partitioning - change the number of partition
lookup file for large amounts of data use MFS lookup file (instead
No indexes as such. But there is an "output indexing"

indexing
transform part.
Environment project Environment project - special public project that exist

the environment parameters required by the private or
Environment.
Aggregate - old component

Aggregate vs Rollup Rollup - newer, extended, recommended to use instead
(built-in functions like sum count avg min max produc
 EME = Enterprise Metdata Environment. Func

analysis, dependency analysis). It is on the serv
transformations, config info, source and target
you checkin/checkout. /Project dir of EME con
EME, GDE, Co-operating
sandboxes connected to it. It also helps in depe
sytem
air commands to manipulate repository objects
 GDE = Graphical Devlopment Environment (o
 Co-operating sytem = Ab Initio server installed
fencing means job controlling on priority basis.

In AI it actually refers to customized phase breaking. A
source data volume process will not cough in dead loc
processes.
fencing
Fencing - changing a priority of a job
Phasing - managing the resources to avoid deadlocks.
For example, limiting the number of simultaneous pro
(by breaking the graph into phases, only 1 of which ca
Continuous components - produce useful output file w

Continuous components
Continuous rollup, Continuous update batch subscribe
Answer
Question ================================
======
Deadlock is when two or more processes are

deadlock
resource. To avoid use phasing and resource
 AB_HOME - where co>operating sy

 AB_AIR_ROOT - default location fo
 sandboxes standard environment
environment
 AI_SORT_MAX_CORE, AI_HOME
 from unix prompt: env | grep AI

wrapper script unix script to run graphs
A multistage component is a component whi

in 5 stages (1.input select, 2.temporary initia
multistage component output selection, 5.finalize). So it is a transfo
packages. Examples: scan Normalize and De
normalize and denormalize sorted.
Dynamic DML is used if the input metadata

different time different input files are recieve
Dynamic DML
different dml. in that case we can use flag in
read in the input file recieved and according
dml is used.
 fan out - partition component (increa

fan in, fan out
 fan in departition component (decrea
a user can lock the graph for editing so that o

lock
and can not edit the same graph.
Lookup is good for spped for small files (wil

join vs lookup memory). For large files use join. You may n
limit to handle big joins.
multi update multi update executes SQL statements - it tre

completely separate piece of work.
 We can use Autosys, Control-M, or a
 We can take care of dependencies in

scripts should run sequentially, we ca
scheduler
Autosys, or we can create a wrapper
sequential commands (nohup comma
command2.ksh &; etc). We can even
Initio to execute individual scripts as
These are database interfaces (api - uses SQL

Api and Utility modes in input table whatever vendor provides)
 lookup file component. Functions: lo

lookup_next, lookup_match, lookup_
lookup file
 Lookups are always used with combi
components.
You can call stored proc (for example, from
Calling stored proc in DB you can even write SP in Ab Initio. Make it "
good performance.
Frequently used functions string_ltrim, string_lrtrim, string_substring,

now()
data validation is_valid, is_null, is_blank, is_defined
When joining inputs (in0, in1, ...) one of the

default - in0). Driving input is usually the lar
driving port
smallest can have "Sorted-Input" parameter b
sorted" because it will be loaded completely
Ab Initio benefits: parallelism built in, mulit

amounts of data, easy to build and run. Gene
easily modified as needed )if something coul
itself). The scripts can be easily scheduled us
and easily integrated with other systems.
Ab Initio doesn't require a dedicated adminis

Ab Initio vs Informatica for ETL
Ab Initio doesn't have built-in CDC capabili
Capture).
Ab Initio allows to (attach error / reject files)

capture and analyze the message and data se
Informatica which has just one huge log). Ab
metrics for each component.
override key option is used when we need to

override key
different field names.
control file should be in the multifile directo

control file
the serial files)
max-core parameter (for example, sort 100 M

of memory used by a component (like Sort o
max-core before spilling to disk. Usually you don't nee
default value. Setting it too high may degrad
of OS swapping and degrading of the perform
graph > select parameters tab > click "create

Usage: $paramname. Edit > parameters. The
Input Parameters
substituted during run time. You may need to
scope as formal.
Each component has reject, error, and log po
records, Error captures corresponding error,
execution statistics of the component. You ca
Error Trapping
each component by setting reject threshold to
on first reject, or setting ramp/limit. You can
function in transform function.
Answer
Question
=======================================
In GDE goto options View > Tracking Details - will se

How to see resource usage
and memory usage, etc.
Easy and saves development time. Need to understand

assign keys component
and you can't control it easily.
 Scenario 1 (preferred): we run query which joi

us the result in just 1 DB component.
Join in DB vs join in Ab Initio
 Scenario 2 (much slower): we use 2 database c
and join them in Ab Initio.
not recommended if number of records is big. It is bet

Join with DB
and then join in Ab Initio.
Parameter showing how data is unevenly distributed b

Data Skew
skew = (partition size - avg.part.size)* 100 / (size of th
.dbc - database configuration file (dbname, nodes, ver

the db directory
dbc vs cfg
.cfg - any tyoe of config file. for example, remote con
remote server, user/pwd to connect to db, location of O
connection method). .cfg file resides in the config dir.
depth not equal data format error etc...

compilation errors
depth error : we get this error.. when two components
does't match there layout
types of partitions broadcast pbyexpression pbyroundrobin pbykey pwith
when joining, used records go to the output port, unus

unused port
port
tuning performance  Go parallel using partitionning. Roundrobin pa
balance.
 Use Multi-file system (MFS).
 Use Ad Hoc MFS to read many serial files in p
component.
 Once data is partitionned - do not switch it to s
instead.
 Do not acceess large filess via NFS - use FTP
 use lookup local rather than lookup (especially
 Use rollup and Filter as soon as possible to red
Ideally do it in the source (database ?) before y
 Remove unnecessary components. For exampl
exp, you can implement the same function in r
Another example - when joining data from 2 fi
instead of adding an additional component for
 use gather instead of concatenate.
 it is faster to do a sort after a partitino, than to
 try to avoid using a join with the "db" compon
 when getting data from database - make sure y
indexes, etc.). If possible, do necessary selectio
the database before getting data into Ab Initio.
 tune Max_core for Optimal performance (for s
the input file).
 Note - If in-memory join cannot fit its non-driv
MAX-CORE, then it will drop all the inputs to
not make sence.
 Using phase breaks let you allocate more mem
components - thus improving performance.
 Use checkpoint after sort to land data on disk
 Use Join and rollup in-memory feature
 When joining very small dataset to a very larg
to broadcast the small dataset to MFS using br
the small file as lookup. But for large dataset d
partitioner.
 Use Ab Initio layout instead of database defau
 Change AB_REPORT parameter to increased m
 Use catalogs for reusability
 Components like join/ rollup should have the o
if they are placed after a sort component.

 minimize number of sort components. Minimi
component, and if possible replace them by in-
only required fields in the sort reformat join co
Groups" instead of just Sort when data was alr
 Use phasing/flow buffers in case of merge sort
 Minimize the use of regular expression functio
transfer functions
 Avoid repartitioning of data unnecessarily. Wh
more than two flows, use Reformat rather than
 For joining records from 2 flows use Concaten
there is a need to follow some specific order in
is required then it is preferable to use Gather c
 Instead of putting many Reformat components

indexes parameter in the first Reformat compo
condition there.
 Delta table maintain the sequencer of each dat

delta table
 Master (or base) table - a table on tp of which
rollup - performs aggregate calculations on groups, sc

scan vs rollup
totals
packages used in multistage components or transform compone
 Reformat - deriving new data by adding/dropp

Reformat vs "Redefine Format"
 Redefine format - rename fields
Conditional DML DML which is separated based on a condition
 The prerequisit for using sortwithingroup is th

SORTWITHINGROUP by the major key. sortwithingroup outputs the
reading the major key group. It is like an impli
Define a Formal Keyword Parameter of type string. Fo

FilterCondition, and you want it to do filtering on CO
graph in your "Filter by expression" Component enter
passing a condition as a parameter $FilterCondition
Now on your command line or in wrapper script give

YourGraphname.ksh -FilterCondition COUNT > 0
Passing file name as a parameter #!/bin/ksh

#Running the set up script on enviornment
typeset PROJ_DIR $(cd $(dirname $0)/..; pwd
. $PROJ_DIR/ab_project_setup.ksh $PROJ_DI
#Exporting the script parameter1 to INPUT
if [ $# -ne 2 ];
then
INPUT_FILE_PARAMETER_1 $1
INPUT_FILE_PARAMETER_2 $2
# This grpah is using the input file
cd $AI_RUN
./my_graph1.ksh $INPUT_FILE_PARAMETER_1
# This graph also is using the input fi
./my_graph2.ksh $INPUT_FILE_PARAMETER_2
exit 0;
else
echo Insufficient parameters
exit 1;
fi
-------------------------------------
#!/bin/ksh
#Running the set up script on enviornment

typeset PROJ_DIR $(cd $(dirname $0)/..; pwd
. $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR
#Exporting the script parameter1 to INPUT_F

export INPUT_FILE_NAME $1
# This grpah is using the input file

cd $AI_RUN
./my_graph1.ksh
# This graph also is using the input file.

./my_graph2.ksh
exit 0;
How to remove header and trailer

use conditional dml where you can separate detail from
lines?
validations use reformat with count :3 (out0:header ou
 first method: in GDE go to RUN > Execute Co

c:control c:dp1 c:dp2 c:dp3 c:dp4
How to create a multi file system on
Windows
 second method: double-click on the file compo
double-click on partitions - there you can enter
A vector is simply an array. It is an ordered set of elem

Vector
can be any type, including a vector or a record).
Dependency analysis will answer the questions regard

Dependency Analysis
does the data come from what applications prodeuce a
Answer
Question
=======================================
Surrogate key There are many ways to create a surrogate key. For ex
next_in_sequence() function in your transform. Or you
values" component. Or you can write a stored procedu
Note: if you use partitions, then do something like this
(next_in_sequence()-1)*no_of_partition()+this_partiti
This is a config file for ab initio - in user's home direc

.abinitiorc $AB_HOME/Config. It sets abinitio home path, confi
(AB_WORK_DIR, AB_DATA_DIR, etc.), login info
login methods for hosts for execution (like EME host,
.profile your ksh init file ( environment, aliases, path variables

command prompt settings, etc.)
data mapping, data modelling
Hwo to execute the graph From GDE - whole graph or by phases. From checkpo
Write Multiplefiles A component which allows to write simultaneously in
Testing Run the graph - see the results. Use components from
Sandbox is your private area where you develop and t

Sandbox vs EME one version can be in the sandbox at any time. The EM
versions of the code that have been checked into it (so
Where the data-files are and where the components ar

data - serial or partitioned (multi-file). The layout is d
Layout
file (or a control file for the multifile). In the graph the
automatically (for multifile you have to provide detail
Latest versions April 2009: GDE ver.1.15.6, Co-operative system ver
menu edit > parameters - allows you to specify private

Graph parameters
They can be of 2 types - local and formal.
You can define pre- and post-processes, triggers. Also

Plan>It
run on success or on failure of the graphs.
Frequently used components  input file / output file

 input table / output table
 lookup / lookup_local
 reformat
 gather / concatenate
 join
 runsql
 join with db
 compression components
 filter by expression
 sort (single or multiple keys)
 rollup
 trash
 partition by expression / partition by key
co>operating system is layered on top of native OS (u

GDE, GDE generates a script (according to "run" setin
running on hosts execute the scripts on different machines (using specif
connection methods, like rexec telnet rsh rlogin) - and
codes back.
This is basically an Oracle question - regarding SQLL

Conventional load - using insert statements. All trigge
will be checked, all indexes will be updated.
conventional loading vs direct
loading
Direct load - data is written directly block by block. C
partition. Some constraints are checked, indexes may
native options to skip index maintenance.
in abinitio there are 3 types of joins: inner join, outer j
 for inner join 'record_requiredN' parameter is t

semi-join  for outer join it is false for all the "in" ports.
 for semi join it is true for the required compon

components.
http://www.geekinterview.com/Interview-Questions/Data-Warehouse/Abinitio/page10

Abinitio-Material

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abinitio-Material

Uploaded by

Copyright:

Available Formats

1

Question Answer =================================

Phases - are used to break the graph into pieces. Temp

You can have phase breaks with or without checkpoin

A new sandbox will have many directories: mp, dml, x

1) Data Parallesim - data (copies of the same - or diffe

m_mkfs - create a multifile (m_mkfs ctrlfile mpfile1 .

m_mkdir - to add more directories to existing director

 Each partition of a component uses: ~ 8 MB +

 Select the largest-memory phase in the graph

 first - only the first record

 unique_only - there will be no records in the o

file1 (A,B,C) , file2 (A,B,D). We partition both files b

checkin, checkout You can do checkin/checkout using the wizard right fr

how to have different

 use scan and filter

Hot to convert a serial file into

project parameters vs. sandbox

error you get when connecting mismatching compone

 partitioning - dividing a single flow of records

No indexes as such. But there is an "output indexing"

Environment project Environment project - special public project that exist

Aggregate - old component

 EME = Enterprise Metdata Environment. Func

 Co-operating sytem = Ab Initio server installed

fencing means job controlling on priority basis.

Continuous components - produce useful output file w

Deadlock is when two or more processes are

 AB_HOME - where co>operating sy

 from unix prompt: env | grep AI

A multistage component is a component whi

Dynamic DML is used if the input metadata

 fan out - partition component (increa

a user can lock the graph for editing so that o

Lookup is good for spped for small files (wil

multi update multi update executes SQL statements - it tre

 We can use Autosys, Control-M, or a

 We can take care of dependencies in

These are database interfaces (api - uses SQL

 lookup file component. Functions: lo

Frequently used functions string_ltrim, string_lrtrim, string_substring,

data validation is_valid, is_null, is_blank, is_defined

When joining inputs (in0, in1, ...) one of the

Ab Initio benefits: parallelism built in, mulit

Ab Initio doesn't require a dedicated adminis

Ab Initio allows to (attach error / reject files)

override key option is used when we need to

control file should be in the multifile directo

max-core parameter (for example, sort 100 M

graph > select parameters tab > click "create

In GDE goto options View > Tracking Details - will se

Easy and saves development time. Need to understand

 Scenario 1 (preferred): we run query which joi

not recommended if number of records is big. It is bet

Parameter showing how data is unevenly distributed b

.dbc - database configuration file (dbname, nodes, ver

depth not equal data format error etc...

types of partitions broadcast pbyexpression pbyroundrobin pbykey pwith

when joining, used records go to the output port, unus

if they are placed after a sort component.

 Instead of putting many Reformat components

 Delta table maintain the sequencer of each dat