Professional Documents
Culture Documents
MIcro Design - ABC Framework Ver 1.3
MIcro Design - ABC Framework Ver 1.3
For
ABC Framework
Owner: IBM
Project
Creation Date: 20/05/12
Last Updated: 18/09/12
Version: V1.3
Authors: IBM
Approval List
Serial# Name Position Sign-off Status Sign-off Date
1
2
3
4
Distribution List
Serial# Name Email id Position
1
2
3
1 INTRODUCTION........................................................................................................................................................... 4
2 PROGRAM SPECIFICATION........................................................................................................................................... 4
2.1 Tables used for ETL framework and ABC verification.................................................................................................................................4
2.2 INFORMATICA ETL Objects Created for ABC checks...............................................................................................................................12
3 UNIX SHELL SCRIPTS.............................................................................................................................................. 18
3.1 Autosys Dependency Jobs:..........................................................................................................................................................................23
4 UTILITIES:................................................................................................................................................................ 24
The purpose of this document is to provide detail information regarding the ETL process control framework and
audit-balance-check functionalities to support the following:
2 Program Specification
As a process flow the ETL framework will perform audit, balance and control for all the source systems irrespective
of flat file or relational databases.
Data loading will be audited in depending on the type of interface between source system and landing area
For systems where files are provided on ETL server, audit-balance-checks will be done using shell scripts
For systems where file are pulled via sftp from source system server, audit-balance-checks will be done using shell
scripts
For systems where data is pulled from source system database using Informatica mappings, audit-balance-checks
will be done using informatica worklets
The ABC framework will maintain job stream and job execution history and results of ABC performed in a set of
metadata and transaction tables as listed below.
SYSTEM_CONFIG_METADATA - This table is used to maintain the metadata related various folders and FTP location details for
each source system. Data from this table is utilized during job execution to set informatica parameters for Source and target files,
Log files, Bad files,etc.
GLOBAL_CONFIG - This table used to maintain the metadata related to success/Failure email ID /group ID and other
parameters like waiting period for dependent process to search for parent job completion.
JOB_STREAM_METADATA – This table is used to hold the metadata for job streams. Appropriate values need to be
populated in this table for each job stream during code deployment.
OBJECT_DATA_METADATA – This table used to contain the metadata information related to each jobs and table
loaded. This table will hold metadata related to source object name, target object name, source object pattern,
acquisition type, acquisition frequency, what the generic ABC checks applicable against each job name.
Generic ABC checks are
ABC_CHECK_METADATA – This table used to keep the metadata for each custom ABC check that would be
performed against job. These checks will be used primarily during load from Staging to SOR and SOR to Mart.
JOB_PARAM_DATA – This table used to keep the data corresponds to all the parameter and its values for all
workflows/sessions. It contains the data related to sections like “Global” and “Object” wise
BATCH_JOB_STATUS – This table used to keep the audit data for execution of each jobs which have defined in the
OBJECT_DATA_METADATA table
BATCH_JOB_AUDIT – This table used to keep the audit data for result of generic as well as customs ABC checks
defined against each job in the OBJECT_DATA_METADATA table. New inclusion of field like ABC_Chek_ID to
identify the records corresponds to which ABC rule .
Control Famework
DDL.zip
Data Model
RUN_FREQUENCY Will hold data like 'D' for daily, 'W' for weekly, 'M' for monthly
etc.
BUSINESS_DATE Will hold value of date for which data is belongs to in 'dd-MMM-
yy' format
JOB_STREAM:
Column Name Description
EDW_AREA Will hold data like 'STG' for staging, 'SOR' for enterprise data
warehouse, 'MART ' for mart area
BUSINESS_PROCESSING_DATE Will hold value of date for which data is belongs to in 'dd-MMM-
yy' format
OBJECT_DATA_METADATA :
Column Name Description
SYSTEM_NAME Name of the systems like VP, PRIMA, LAS, LAG etc
SOURCE_OBJECT_NAME Name of source entity(can be a table or flat file) for the Job
name
TARGET_OBJECT_NAME Name of the target entity(can be a table or flat file) for the Job
name
JOB_NAME Name of the Jobs, for the Informatica jobs it is the work flow
INPUT_FILE_PATTERN This field contains the pattern using which sFTP mechanism
will find which files need to be FTP from a remote server
TARGET_OBJECT_TYPE Holds the value of object Type of the Target entity (FILE /
TABLE)
CHKSUM_WHERECLAUSE This field will hold the where clause value of checksum
expressions , example like WHERE RECORD_ID='2' (as
done for HTSM for VP system)
RANGE_WHERECLAUSE This field will hold the where clause value of range expressions
HDR_CHECKSUM_DIV This field will hold the value for the CHECKSUM DIVISION if
required for any type of file ( example , values would be default
1 , in some types of file like HTSM,ATH2,ATH4 it is 100)
BATCH_JOB_STATUS :
Column Name Description
JOB_STREAM_ID Would be flowing from Jobstream table based upon which job
stream is being executed
BUSINESS_PROCESSING_DATE Will hold the value of the date for which the job stream/job is
being run
JOB_RUN_STATUS Will hold the status of the job. would contain values like
'SUCCESS', 'FAIL' ,'RUN' 'SKIP' depending upon job execution
BATCH_JOB_AUDIT :
Column Name Description
JOB_STREAM_ID Would be flowing from Jobstream table based upon which job
stream is being executed
OBJECT_ID Would be the object id associated with every job name which is
being executed
SOURCE_RECORD_EXTRACTED Actual number of source records extracted during the job run
TARGET_RECORD_UPDATED Actual number of records updated into target table during the
job run
RUN_DATE Contains the date value for which job is being run
PREV_CNT Actual count value of the previous period run for the a job
PREV_CHKSUM Actual checksum value of the previous period run for the a job
CURR_CNT Actual count value of the current run for the same job
CURR_CHKSUM Actual checksum value of the current run for the same job
RANGE_MIN_VAL Holds minimum value of the range for the job is being run
RANGE_MAX_VAL Holds maximum value of the range for the job is being run
COUNT_RESULT Will hold the values like 'SUCCESS', 'FAIL' depending upon the
status of the ABC 'COUNT' check for the job .
CHECKSUM_RESULT Will hold the values like 'SUCCESS', 'FAIL' depending upon the
status of the ABC 'CHECKSUM' check for the job .
CNT_ZERO_RESULT Will hold the values like 'SUCCESS', 'FAIL' depending upon the
status of the ABC 'CNT_ZERO' check for the job .
CHKSUM_ZERO_RESULT Will hold the values like 'SUCCESS', 'FAIL' depending upon the
status of the ABC 'CHKSUM_ZERO' check for the job .
CNT_CSM_SAME_RESULT Will hold the values like 'SUCCESS', 'FAIL' depending upon the
status of the ABC 'CNT_CSM_SAME_RESULT' check for the
job .
RANGE_RESULT Will hold the values like 'SUCCESS', 'FAIL' depending upon the
status of the corresponding ABC 'RANGE' check for the job .
HEADER_CHKSUM Holds the value of checksum shown in the header line of the
flat file source
HEADER_CNT Holds the value of record count shown in the header line of the
flat file source
HEADER_BUSINESS_DATE Holds the value of record count shown in the header line of the
flat file source
BUSINESS_DATE_CHK_RESULT Will hold the values like 'SUCCESS', 'FAIL' depending upon the
status of the ABC 'Business date check RESULT' check for the
job
ABC_CHECK_METADATA:
Column Name Description
MINIMUM_THRESHOLD Will hold the minimum thresold value for that target column
which is in the Range_expression
MAXIMUM_THRESHOLD Will hold the minimum thresold value for that target column
which is in the Range_expression
SOURCE_COUNT_EXPR Will hold the expression for actual count value for the source
column of target table
ex. COUNT(DISTINCT ACCT)
SOURCE_CHECKSUM_EXPR Will hold the expression for actual checksum value for the
source column of target table
ex. SUM(LOGO)
TARGET_COUNT_EXPR Will hold the expression for actual count value for the target
column of target table
ex. COUNT(DISTINCT ACCT)
TARGET_CHECKSUM_EXPR Will hold the expression for actual checksum value for the
target column of target table
ex. SUM(LOGO)
TARGET_RANGE_EXPR Will hold the actual value of the target column for which range
to be determined
ex. COUNT(DISTINCT ACCT)
OBJECT_ID Will hold the value of the object id for which ABC check is being
evaluated. ex. 14 for ATH1
SYSTEM_CONFIG_METADATA
Column Name Description
INBOUND_DIR This is the name of the directory where the source files to be
kept
REJECT_DIR This directory is made for data files which are rejected after file
validation
ARCHIVE_DIR This directory is made for archiving the data files which are ok
after file validation and before informatica job runs for that file
BAD_FILE_DIR This directory is made for the files which are rejected during
Informatica load
SESS_LOG_DIR This directory is made for the files which are created during
Informatica execution as a log file
GLOBAL_CONFIG
Column Name Description
FILE_WATCHER_WAIT_INTERVAL Will hold the value of “waiting period” value (in seconds) for
Autosys dependent job will search for its file related to its
parent job's completeion
JOB_PARAM_DATA
Column Name Description
PARAM_SECTION Will hold the “SECTION” value of for that jobname, like
“GLOBAL” or
“[ORGNAME_STG_VP.WF:wf_stg_VP_ATH1.ST:s_m_stg_VP_
ATH1]”
PARAM_VALUE Will hold the actual value of “param name” which would be
used during workflow (jobname) execution
like STG_VP_ATH1_JSID(to be dynamically replaced with
actual value like 1, 1001 etc) for $$JOB_STRM_ID or
“s_m_stg_VP_ATH1.log” for $PMSessionLogFile
JOB_PARAM_DATA
Global_Config
(a)Mapping Details
Specification Type Specification
It compares the values like Count, Checksum against a column defined in the control
file(generated from header record of the source file ) with the actual count of records and
checksum values against that specific column of the Target Table(once loaded from the source
file/Table)
Input Tables Object_Data_Metadata,
Actual STAGING Table
BATCH_JOB_AUDIT
Sequence of this mapping Run Once after each Target table load from file
Parameter/ Variables $$TABLENAME -->Name of the stage Table
$$CHECKSUM_EXPR-->like SUM(ABS(col_name)) or SUM(col_name) etc.
$$ABCRESULT -->Will be used later for generic ABC checks
$$JOB_NAME → Job name for which ABC checks is being performed
$$JS_ID → Job stream Id for the job
$$JOB_RUN_ID → Current Job run ID
$$SRCROW → Number of source record extracted would be fetched from Informatica metadata
variable during the execution of the mapping
$$TGTINSROW → Number of target record inserted would be fetched from Informatica
metadata variable during the execution of the mapping
$$BUS_DT_CTL_FILE → Value of the Business date from the source control file
$$CNT_CTL_FILE → Value of the count of records from the control file
$$CHKSUM_CTL_FILE → Value of the Checksum column from the control file
$$WHERECLAUSE → for files like HTSM of VP systems where $$WHERECLAUSE would be
useful to get CHKSUM value
$$CHECKSUM_EXPR → would be used to hold values like SUM(ABS(COL_VAL1))
$$RANGE_EXPR→ would be used to hold values for Range expression
$$RANGE_WHERECLAUSE→ would be used to hold values for where clause for range
expression
$$CHECKSUM_DIV→ would be used to hold values for CHECKSUM_DIV to divide the value
provided in the header file (like in case of ATH2,ATH4,HTSM)
NB – The shortcut of this mapping would be created in each of the folders specific to each source systems
One worklet (below mentioned) need to be created in each of the folder specific to the Source Systems, and this
worklet to be added after each target entity load session of the workflow.
Session specific to Source Systems(where ever the flat file is the source , example VP)
Specification Type Specification
Informatica server used to execute the mapping IS_ORGNAME_EDW_DEV
Session name s_m_READ_VP_CONTROL_FILE
Mapping name m_READ_VP_CONTROL_FILE
Session Type non-reusable
Session Log File <sessionname>.log
Parameter File /edw/infa_shared/ParamFiles/
<system_name>_<Target_entity_Name>.prm
Source NA
Record Treatment Target Insert
Post Post_session_Success_ Assigns all the variable from the control file related to the job_name to the
Variable_assignments next session
Parmeter File Entry for each workflow for ABC Check Session:
Sample entry for ABC Check related with ATH2
[_STG_VP.WF:wf_stg_VP_ATH2.WT:wklt_Generic_ABC_Check_STG]
$PMSessionLogFile=s_m_Generic_ABC_Check.log
$DBConnection_Src=CONN_EDW_NZ
$DBConnection_ABC=Oracle_repo
$$TABLENAME=STG_VP_ATH2
$$CHECKSUM_EXPR=SUM(ABS(MT_AMOUNT))
$$JS_ID=STG_VP_ATH2_JSID
$$JOB_RUN_ID=STG_VP_ATH2_JOBID
$$JOB_NAME=STG_VP_ATH2_JOBNM
$$CTRL_FILE=ATH2_20120515.ctl
$$COUNT_VALUE=0
$$CHECKSUM_VALUE=0
$$CHECKSUM_EXPR=VAR_CHKSUMEXPR
$$WHERECLAUSE=VAR_WHERECLAUSE
$$RANGE_EXPR=VAR_RANGEEXPR
$$RANGE_WHERECLAUSE=VAR_RANGEWHERECLAUSE
$$CHECKSUM_DIV=VAR_CHECKSUMDIV
It compares the values like Count, Checksum , range against a column defined in the source
side columns like “SOURCE_COUNT_EXPR” ,” SOURCE_CHECKSUM_EXPR” with the actual
count of records on target side vcolumns like
“TARGET_COUNT_EXPR” ,”TARGET_CHECKSUM_EXPR” and checksum values against that
specific column of the Target Table
Input Tables ABC_CHECK_METADATA
Actual SOR/MART table
BATCH_JOB_AUDIT
Sequence of this mapping Run Once after each workflow (containing multiple Target tables) load from file
Session Details:
Specification Type Specification
Informatica server used to execute the mapping IS_ORGNAME_EDW_DEV or IS_ORGNAME_EDW_UAT
Session name s_m_Custom_ABC_Check
Mapping name m_Custom_ABC_Check
Session Type non-reusable
Session Log File <sessionname>.log
Parameter File /edw/infa_shared/ParamFiles/
<system_name>_<Target_entity_Name>.prm
Source NA
Record Treatment Target Data driven
Pre Pre_session_variable_A NA
ssignment
NB - One worklet to be defined for this “s_m_Custom_ABC_Check” in each SOR/MART folder based upon subject
area , which worklet be attached after each SOR or MART workflow so that all the “ABC” related checks (defined in
the ABC_CHECK_METADATA) to be executed after each worklflow(job) runs.
Scripts.zip B a c k u p _ 1 4 S e p . z ip
Predecessor job NA
Input Arguments SQL File
RPT file
Variable number of arguments
SQL Files Used(If Any) NA
Configuration File (if any) NA
Predecessor job NA
Input Arguments NA
SQL Files Used(If Any) NA
Configuration File (if any) NA
Predecessor job NA
Input Arguments Input_Filename
SQL Files Used(If Any) NA
Configuration File (if any) NA
As Autosys can not create a dependency between jobs in two separate flows. To overcome this constraint we have
following strategy:
Parent_Job_Object_Id Integer
Child_Job_Object_id Integer
File_Watcher_Ind Char(1)
Process Description :
1.Create touch file – “File watcher” file[ Standard : <object_id>_BUSDATE.suc ex. 1_20120807.suc]. Touch file to
be created in existing shell script <Job_afterrun.sh>
2.Create file watcher job to perform following steps
1.Read “Object_Dependency_Metadata” tables for the job to check if there is a dependency exists for that
object , if dependency exists then check for the respective file_watcher file.
2.If file_watcher file for Parent job found in step-2 then allow the child job to execute
3.If file_watcher file for Parent job not found in step-2 , wait 30 secs (configurable in global config metadata
column “File_Watcher_Wait_Interval”) again go to step-2 for rechecking
3.Create shell script to remove the “file_watcher” file related to parent job.
4 Utilities:
1.Bulkwriter – There is a awk script which was specifically written for changing the “Netezza Relational writer” for
target table to “Netezza Bulk writer” . This utility applies to the XML file(exported Informatica workflows) and it
generated the changed XML file. Changed XML can be imported in Informatica workflow Manager which would
contain the properties set with “Netezza Bulk writer” .
2.Session Config -There is a awk script which specifically written for changing the following session properties .
This utility applies to the XML file(exported Informatica workflows) and it generated the changed XML file. Changed
XML can be imported in Informatica workflow Manager which would contain the properties set with “Netezza Bulk
writer” .
3.Paramete File Conversion - There is a awk script which was specifically written for uploading the “conventional
parameter file” to target table “Job_Param_Data” . This utility applies to the text file(containing parameter name-
4.load data
infile './vp_template.prm.out'
append
into table tmp_job_param_data
fields terminated by '~' optionally enclosed by '"'
(filename, param_section, wf_name, param_name, param_value)
vp_template_out.prm is a '~' seperated flat file which would be used in the second step
5.Repository Backup : This shell script would be used to take backup of the repository
use :
sh repository_backup.sh