Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 45

Teradata Architecture:

3 main Components:

1. Parsing Engine -> Seesion Control, Parser, Optmizer, Dispatcher


2. Bynet -> Bynet 0 & Bynet 1
3. Access Module Processor

 The PE checks the syntax of the query, check the user security rights
 Then PE come up with the best optimized plan for the execution of the query
 The PE passes this plan through BYNET to AMPs.
 The AMPs follow the plan and retrieve the data from its DISK.
 Then AMPs passes the data to PE through BYNET.
 The PE then passes the data to the user.

1. GRANT CREATE FUNCTION ON APPL TO vkonara;


2. GRANT DROP FUNCTION ON APPL TO vkonara;
3. GRANT SELECT,INSERT,UPDATE ON APPL TO vkonara;
4. CREATE a user account

CREATE USER vkonara


AS
PERMANENT = 1000000 BYTES
PASSWORD = ***
TEMPORARY = 1000000 BYTES
SPOOL = 1000000 BYTES;

5. Create a table:

CREATE MULTISET TABLE APPL.EMPLOYEE ,FALLBACK ,


NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
EmployeeNo INTEGER,
FirstName VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
LastName VARCHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
DOB DATE FORMAT 'YYYY-MM-DD',
JoinedDate DATE FORMAT 'YYYY-MM-DD',
DepartmentNo BYTEINT)
PRIMARY INDEX ( EmployeeNo );

6. Check data distribution on Amps.

SELECT HASHAMP(HASHBUCKET(HASHROW(EmployeeNo))) AS
"AMP#",EmployeeNo,HASHROW(EmployeeNo),HASHBUCKET(HASHROW(EmployeeNo)
)
--,COUNT(*)
FROM EMPLOYEE
--GROUP BY 1
ORDER BY 2,1 DESC;

7. Check data distribution count on amp.

SELECT HASHAMP(HASHBUCKET(HASHROW(EmployeeNo))) AS
"AMP#",COUNT(*)
FROM EMPLOYEE
GROUP BY 1
ORDER BY 2,1 DESC;

8. EXPLAIN SELECT * FROM EMPLOYEE;

This is without collecting stats:


1) First, we lock a distinct APPL."pseudo table" for read on a
RowHash to prevent global deadlock for APPL.EMPLOYEE.
2) Next, we lock APPL.EMPLOYEE for read.
3) We do an all-AMPs RETRIEVE step from APPL.EMPLOYEE by way of an
all-rows scan with no residual conditions into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with low confidence to be 2 rows (140 bytes).
The estimated time for this step is 0.03 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.

After collect stats:

COLLECT STATISTICS COLUMN(EmployeeNo) ON Employee;

HELP STATISTICS employee;

EXPLAIN SELECT * FROM EMPLOYEE;

1) First, we lock a distinct APPL."pseudo table" for read on a


RowHash to prevent global deadlock for APPL.EMPLOYEE.
2) Next, we lock APPL.EMPLOYEE for read.
3) We do an all-AMPs RETRIEVE step from APPL.EMPLOYEE by way of an
all-rows scan with no residual conditions into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with high confidence to be 6 rows (420 bytes).
The estimated time for this step is 0.03 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.
9. How to check user access.

SELECT *
FROM dbc.allrights
WHERE username='vkonara'
AND databasename='APPL';

D – Delete, I- Insert, R – Select, SH- Show table/View, U- Update.

10. Find the Table Space Size of your table across all AMPs in Teradata

SELECT DATABASENAME, TABLENAME, SUM(CURRENTPERM)


FROM DBC.TABLESIZE
WHERE DATABASENAME = 'APPL' AND TABLENAME = 'EMPLOYEE'
GROUP BY DATABASENAME , TABLENAME;

11. Created non uniq Secondary index on employee table as below:

CREATE INDEX (FIRSTNAME) ON "APPL"."EMPLOYEE"

When ask for Explain plan on employee table with a where condition on NUSI:

EXPLAIN SELECT * FROM "APPL"."EMPLOYEE" WHERE FIRSTNAME LIKE 'Vina%';


EXPLAIN SELECT * FROM "APPL"."EMPLOYEE" WHERE FIRSTNAME LIKE 'Vina%';

1) First, we lock a distinct APPL."pseudo table" for read on a


RowHash to prevent global deadlock for APPL.EMPLOYEE.
2) Next, we lock APPL.EMPLOYEE for read.
3) We do an all-AMPs RETRIEVE step from APPL.EMPLOYEE by way of an
all-rows scan with a condition of ("APPL.EMPLOYEE.FirstName LIKE
'Vina%'") into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with no confidence to be 6
rows (420 bytes). The estimated time for this step is 0.03
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.

After collecting stats on the table and ran the explain plan:

COLLECT STATISTICS ON "APPL"."EMPLOYEE" COLUMN FIRSTNAME;

EXPLAIN SELECT * FROM "APPL"."EMPLOYEE" WHERE FIRSTNAME LIKE 'Vina%';

1) First, we lock a distinct APPL."pseudo table" for read on a


RowHash to prevent global deadlock for APPL.EMPLOYEE.
2) Next, we lock APPL.EMPLOYEE for read.
3) We do an all-AMPs RETRIEVE step from APPL.EMPLOYEE by way of an
all-rows scan with a condition of ("APPL.EMPLOYEE.FirstName LIKE
'Vina%'") into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with high confidence to be
3 rows (210 bytes). The estimated time for this step is 0.03
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.
Ex :

CREATE MULTISET VOLATILE TABLE VT_STUDENT , NO LOG


(
STUDENT_ID INTEGER NOT NULL,
STUDENT_NAME VARCHAR(50),
STUDENT_AGE INTEGER,
STUDENT_COUNTRY VARCHAR(20)
)
UNIQUE PRIMARY INDEX(STUDENT_ID)
UNIQUE INDEX (STUDENT_NAME)
INDEX(STUDENT_AGE)
ON COMMIT PRESERVE ROWS;

EXPLAIN SEL * FROM VT_STUDENT WHERE STUDENT_ID=1;

Now run the query using UNIQUE PRIMARY INDEX

EXPLAIN SEL * FROM VT_STUDENT WHERE STUDENT_ID=1;

1) First, we do a single-AMP RETRIEVE step from DBC.VT_STUDENT by way


of the unique primary index "DBC.VT_STUDENT.STUDENT_ID = 1" with
no residual conditions. The estimated time for this step is 0.01
seconds.
-> The row is sent directly back to the user as the result of
statement 1. The total estimated time is 0.01 seconds.

Now run the query using UNIQUE SECONDARY INDEX


EXPLAIN SEL * FROM VT_STUDENT WHERE student_name='WILLIAM';

EXPLAIN SEL * FROM VT_STUDENT WHERE student_name='WILLIAM';

1) First, we do a two-AMP RETRIEVE step from DBC.VT_STUDENT by way of


unique index # 4 "DBC.VT_STUDENT.STUDENT_NAME = 'WILLIAM'" with no
residual conditions. The estimated time for this step is 0.01
seconds.
-> The row is sent directly back to the user as the result of
statement 1. The total estimated time is 0.01 seconds.

Now run the query using NON UNIQUE SECONDARY INDEX

EXPLAIN SEL * FROM VT_STUDENT WHERE student_age=12;

1) First, we do an all-AMPs RETRIEVE step from DBC.VT_STUDENT by way


of an all-rows scan with a condition of (
"DBC.VT_STUDENT.STUDENT_AGE = 12") into Spool 1 (group_amps),
which is built locally on the AMPs. The size of Spool 1 is
estimated with low confidence to be 4 rows (220 bytes). The
estimated time for this step is 0.03 seconds.
2) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.

Now run the query using column which has no index defined on it.

EXPLAIN SEL * FROM VT_STUDENT WHERE STUDENT_COUNTRY='USA';

EXPLAIN SEL * FROM VT_STUDENT WHERE STUDENT_COUNTRY='USA';

1) First, we do an all-AMPs RETRIEVE step from DBC.VT_STUDENT by way


of an all-rows scan with a condition of (
"DBC.VT_STUDENT.STUDENT_COUNTRY = 'USA'") into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with no confidence to be 1 row (55 bytes).
The estimated time for this step is 0.03 seconds.
2) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.

We have used column which is not any index in the query. This results in ALL-AMP fetch from table
and there is NO confidence level.

12. What is a pseudo table Lock in Explain Plan?

When you want to retrieve the rows from the table, the very first step in explain plan is the
pseudo table lock on that table

e.g –
Type select * from <databasename>.<tablename> in the SQL assistant, and do the explain
plan for this table. The very first step you see is –
1) First, we lock a distinct <databasename>.”pseudo table” for read on a
RowHash to prevent global deadlock for
<databasename>.<tablename>.
2)Next, we lock <databasename>.<tablename> for read.
We know that for retrieval of rows from the table we need to put the read lock which we are
implementing in step 2, but the question is that what is this pseudo table lock in the step 1 ?
We know that each AMP holds a portion of a table. We also know that when a Full Table
Scan is performed that each AMP will read their portion of the table.
Now suppose that two different users wants to place multiple locks on the same table and
one user gets one lock and the other user gets another lock. Both user requires lock made
by other user and have to wait for indefinite time to acquire that lock because actually both
the users are waiting for each other to release lock. This is called DEADLOCK.

A Pseudo Lock is how Teradata prevents a deadlock.


When a user does an All-AMP operation Teradata will assign a single AMP to command the
other AMPs to lock the table. We can call this AMP as the “Gatekeeper” AMP. This AMP will
always be responsible for locking that particular table on all AMPs. Now all the users running
an all AMP query on the table have to report to this “Gatekeeper” AMP for getting
permission on locks.
The “Gatekeeper” AMP never plays favorites and performs the locking on a First Come First
Serve basis. The first user to run the query will get the lock. The others will have to wait. In
this way Teradata prevents the deadlock situation when an all AMP operation is made in the
query

Note – Teradata selects this “Gatekeeper” AMP by hashing the tablename used in the select
query and then matching the hash value in the hash map. The AMP number which it gets
from hash map is assigned as “Gatekeeper” AMP.
Refer the image below taken from Coffings to understand the concept better –

Primary Index in Teradata:


Primary index is the most powerful feature provided by Teradata. Each table in Teradata
must have at least one column as Primary Index. It is defined at the time of creating table. If
any change in Primary Index needs to be implemented, one needs to drop the table and
recreate it.PI can’t be altered or modified. It is the most preferred and important index for
below reasons:

 Data Distribution
 Known access path
 Improves Join performance

There are two types of Primary Indexes:

1. Unique Primary Index (UPI).


2. Non-Unique Primary Index(NUPI)

Let us now understand what exactly happens when you define a PI on the table. 
Unique Primary Index(UPI):-
As the name suggests a UPI allows unique values i.e. no duplicates are allowed. It is a one
AMP operation and data distribution is even. It can contain one null value. Syntax for
Unique Primary Index:
CREATE TABLE sample_1
(col_a INT,
col_b VARCHAR(20),
col_c CHAR(4))
UNIQUE PRIMARY INDEX (col_a);
For eg: We have an Employee table where EMP_NO is the primary index (we have chosen
this as EMP_NO is unique to all).
Data
Distribution using UPI:-

Sample query:-
INSERT INTO DBNAME.EMPOYEE VALUES (011,'Wilson',20,'2010-10-26',5000);
When a user submits a insert query for a table with Primary Index the following processes
occur:

1. The index value goes through a hashing algorithm and gives out a 32-bit Row-hash
value something like this 0011 0011 0101 0101 0000 0001 0110 0001 for EMP_NO 011.
2. First 20 bit of this 32 bit Row hash value determines the AMP on which the row will
reside. This is decided from the Hash map which contains 1 million hash bucket. Hash
map looks something like this for 4 AMP system.
      3.
So now the hash value will point to the particular amp from HASH MAP. e.g : Our value 0011
0101 0101 0000 0001 0110 0001 points to 2nd row , 1st column i.e. AMP 4. Now we have AMP 3
where the row will reside. 5. The PE will send the row to the AMP with the hash value
attached to it, something like this:

 
     6. An uniqueness value is defined for each row. As EMP_NO is unique to all, in case of UPI
the uniqueness value will be 1 for all. This can be well understood while we will study NUPI.
So, this is how the row can be distributed to the AMP. Same is the process for retrieval.  
 
Non-Unique Primary Index(NUPI):-
A NUPI can allow duplicates. It can have n number of null values.
Syntax for NUPI:-
CREATE TABLE sample_1
(col_a INT,
col_b VARCHAR(20),
col_c CHAR(4))
PRIMARY INDEX (col_a);
We will take the same example for understanding NUPI.

We
have taken EMP_NAME as the NUPI column, as it can contain duplicate records. As per our
table the employee name Gary has appeared twice that means it is a duplicate value. Now
let us see what happens when PE receives a duplicate value.
1. The same process of generating the Row-Hash value is followed.
2. To differentiate within the duplicates a uniqueness value is added with the hash-value,
something like below:

      3.
The uniqueness value is added to make the row unique from all the duplicates. If we had one
more employee as Gary, the uniqueness value would have been 3 for him. 4. As the AMP is
selected with the help of Hash value, all the duplicates value will go the same AMP.

    
5. The duplicates reside on same AMP, thus it leads to uneven distribution of data and may
cause performance to degrade.

Together with Row – Hash and the uniqueness value the Teradata make as 64 bit ROW –
ID to uniquely identify each row in the given AMP.

Secondary Index in Teradata:


Before you start with SI, well I must say that as a prerequisite you must first read about
the Primary index in Teradata.

So after knowing about Primary index the point here is that when we already
have UPI and NUPI then what’s the use of this Secondary Index?
Well the best possible answer for this question is that – Secondary Indexes provide an
alternate path to the data, and should be used on queries that run many times.
Teradata runs extremely well without secondary indexes, but since secondary indexes use
up space and overhead, they should only be used on “KNOWN QUERIES” or queries that are
run over and over again. Once you know the data warehouse, environment you can create
secondary indexes to enhance its performance.
Syntax of creating Secondary Index
Syntax of UNIQUE SI:
CREATE UNIQUE INDEX (Column/Columns) ON <dbname>.<tablename >;
Syntax of NON-UNIQUE SI:
CREATE INDEX (Column/Columns) ON <dbname>.<tablename >;
Note – SI can be created even after table is populated with data. Unlike PI which is created
only at the time of creation of table. You can create and drop SI at any time.
Whenever you create SI on the table, Teradata will create a subtable on all AMP. This
subtable contains three columns given below –
1. Secondary Index Value
2. Secondary Index Row ID (this is the hashed value of SI value)
3. Base table Row ID (this is the actual base row id  )
Will see the use of all these values later in this post.
USI Subtable Example
When we defined a UNIQUE SI on the table, then Teradata will immediately create a USI
subtable in each AMP for that particular table.

When we defined a UNIQUE SI on the table, then Teradata will immediately create a USI
subtable in each AMP for that particular table.

Remember creation of subtable requires PERM space, so always be wise to choose your SI.


Normally the best SI is that column or columns which is mostly used in the WHERE clause.
Now I’ll explain in-depth architecture of creation of subtable and retrieval of SI for better
understanding of the concept.
Please look into the image below –
Suppose we have an Employee table (base table) having
attributes Emp, Dept Fname, Lname and Soc_security. We defined USI on the
column Soc_Security.
You can see the SI subtable created on each AMP which holds information about the SI
column and corresponding Base row id (Base Table Row-ID), which is the ROW ID of the
actual Employee table. The steps involve to load this subtable is as follows –
1)      Teradata will first create the subtable on all AMP.
2)      After that it hashes the value of this USI column (Soc_Security) and based on that
hashed value it check the hash map for the AMP number which will hold this USI value in its
subtable.
3)      After getting the respective AMP number, the SI value along with the two more
attributes (secondary index row id and base table row id) will be stored in the subtable of
that AMP.
In this way we populate our USI subtable on each AMP. As the SI columns is UNIQUE there is
no duplication of SI values in any subtable, means each row in the subtable is unique and will
fetch only one row when we make a query on that SI column.
Note – As it is clear now that defining SI will require the creation of subtable, so we should
be aware that SI requires space cost factor on our Teradata system.
Teradata retrieval of USI query.
Suppose on the above example we make a query –
Select * from Employee_table where Soc_Security = ‘123-99-8888’;
When a TD optimizer finds USI in where clause it knows that it’s a 2 AMP operation and also
only one row will be returned. So the step its perform for retrieval is as follows –
1)      It will hash the value of SI (‘123-99-8888’), by hashing algorithm and found the hash
value for it.
2)      Now it checks this hash value in the hash map and gets the AMP number from it. We
know that this AMP stores this SI value.
3)      Now it will go to the Employee subtable of that AMP and retrieve the Base row id
which is stored for that hash value.
4)      This Base row id will be sent back to optimizer by BYNET.
5)      Now optimizer sent back this ROW ID again and fetch the resultant row from the Base
table for which Soc_Security = ‘123-99-8888’ .
As we have seen that Teradata system requires 2 AMP to reach the answer row that’s why
we called USI operation as the 2 AMP operations. Even if SI row resides in the same AMP  in
which Base row reside , still after getting Base row id from the subtable it will sent back to
optimizer so that it start search again based on that Base row id. So it’s always called as the
2 amp operation.
NUSI Subtable Example 
When we defined a NUSI on the table then Teradata will build the subtable on each AMP in
the same fashion as that in USI. The only difference in this subtable creation is that, instead
of building subtable on each AMP it will be build on AMP local which means that each AMP
will build the subtable in it to points it own base rows. In other words each NUSI subtable
will reflect and points to the those base rows only which it owns.
Please look into the image below –

Suppose we have an Employee table (base table) on which we defined NUSI on the
column Fname.
1)      Now Teradata will first create the subtable on all AMP.
2)      Each AMP will hold the secondary index values for their rows in the base table only. In
our example, each AMP holds the Fname column for all employee rows in the base table on
their AMP (AMP local).
3)      Each AMP Local Fname will have the Base Table Row-ID (pointer) so the AMP can
retrieve it quickly if needed. If an AMP contains duplicate first names, only one subtable row
for that name is built with multiple Base Row-IDs. See the example above for the Fname =
‘John’ , the subtable holds multiple base row id for this value.
Teradata retrieval of NUSI query.
Suppose on the above example we make a query –
Select * from Employee_table where Fname = ‘John’;
When an NUSI (Fname) is used in the WHERE clause of an SQL statement, the PE Optimizer
recognizes the Non-Unique Secondary Index. It will perform an all AMP operation to look
into the subtable for the requested value. So the step its perform for retrieval is as follows –
1)      It will hash the value of NUSI (‘John’), by hashing algorithm and found the hash value
for it.
2)      Now it will instruct all AMP to look for this hash value in its Employee subtable. Note
unlike USI there is no looking into hash map  because each subtable in the AMP contains
rows from its own base rows only. So this look up on hash value will be performed on all
AMP subtable.
3)       Any AMP which doesn’t have this hash value will not participate anymore in the
operation.
4)      When the hash value found the corresponding Base row id will be fetched from the
subtable and send to optimizer for actual retrieval of rows.
The point to note here is that NUSI operation is not similar to FTS (full table scan).
Suppose we don’t have Fname as the NUSI and we make the query
on Fname in WHERE clause. In this case first of all Fname from the Employee table is
redistributed in SPOOL space and then we match our value given in the where clause from
the rows in SPOOL.
While in our case where Fname is defined as NUSI, TD optimizer already knows that this
column is NUSI and its already distributed by its value in subtable in each AMP. So it will not
go for redistribution step instead of that it will directly match the value for it in each
subtables.
The PE will decide if a NUSI is strongly selective and worth using over a Full Table Scan. So
it’s advisable to always do COLLECT STATS on NUSI index. You can check the Explain
function to see if a NUSI is being utilized or if bitmapping (FTS) is taking place.
 Secondary Index Summary
1)      You can have up to 32 secondary indexes for a table.
2)      Secondary Indexes provide an alternate path to the data.
3)      The two types of secondary indexes are USI and NUSI.
4)      Every secondary index defined causes each AMP to create a subtable.
5)      USI subtables are hash distributed.
6)      NUSI subtables are AMP local.
7)      USI queries are Two-AMP operations.
8)      NUSI queries are All-AMP operations, but not Full Table Scans.
9)      Always Collect Statistics on all NUSI indexes.

Partition Primary Index – Basics :

Partitioned primary index or PPI is used for physically splitting the table into a series of subtables. With the proper
use of Partition primary Index we can save queries from time consuming full table scan. Instead of scanning full table,
only one particular partition is accessed.

Follow the example below to get the insight of PPI –


We have an order table (ORDER_TABLE) having two columns – Order_Date and Order_Number, in which PI is
defined on Order_Date. The primary Index (Order_Date) was hashed and rows were distributed to the proper AMP
based on Row Hash Value then sorted by the Row ID. The distribution of rows will take place as explained in the
image below –

Now when we execute Query –


Select * from Order_Table where Order_Date between 1-1-2003 and 1-31-2003;
This query will result in a full table scan despite of Order_Date being PI.
Now we have defined PPI on the column Order_date. The primary Index (Order_Date) was hashed and
rows were distributed to the proper AMP based on Row Hash Value then sorted by the Order_Date and
not by Row ID. The distribution of rows will take place as explained in the image below –

Now when we execute Query –


Select * from Order_Table where Order_Date between 1-1-2003 and 1-31-2003;

This query will not result in a full table scan because all the January orders are kept together in
their partition. 

Partitions are usually defined based on Range or Case as follows.


Partition by CASE
CREATE TABLE ORDER_Table (
ORDER_ID INTEGER NOT NULL,
CUST_ID INTEGER NOT NULL,
ORDER_DATE DATE ,
ORDER_AMOUNT INTEGER
)
PRIMARY INDEX (CUST_ID)
PARTITION BY CASE_N (
ORDER_AMOUNT < 10000 ,
ORDER_AMOUNT < 20000 ,
ORDER_AMOUNT < 30000,
NO CASE OR UNKNOWN ) ;

Partition by RANGE

CREATE TABLE ORDER_Table


(
ORDER_ID INTEGER NOT NULL,
CUST_ID INTEGER NOT NULL,
ORDER_DATE DATE ,
ORDER_AMOUNT INTEGER
)
PRIMARY INDEX (CUST_ID)
PARTITION BY RANGE_N (
ORDER_DATE BETWEEN DATE ‘2012-01-01’ AND ‘2012-12-31’ EACH INTERVAL ‘1’ MONTH,
NO RANGE OR UNKNOWN ) ;

If we use NO RANGE or NO CASE – then all values not in this range will be in a single
partition.
If we specify UNKNOWN, then all null values will be placed in this partition

Advantage of Partition Primary Index  –


 Partitioned Primary Index is one of the unique features of Teradata, which is used for
distribution of rows based on different partitions so that they can be retrieved much
faster than any other conventional approach.
 Maximum partitions allowed by Teradata – 65,535
 It also reduces the overhead of scanning the complete table (or FTS) thus improving
performance.
 In PPI tables row is hashed normally on the basis of its PI, but actual storage of row in
AMP will take place only in its respective partition. It means rows are sorted first on
the basis of there partition column and then inside that partition they are sorted by
there row hash .
 Usually PPI’s are defined on a table in order to increase query efficiency by avoiding
full table scans without the overhead and maintenance costs of secondary indexes.
 Deletes on the PPI table is much faster.
 For range based queries we can effectively remove SI and use PPI, thus saving
overhead of SI subtable.
Disadvantage of Partition Primary Index –
 PPI rows are 2 bytes are longer so it will use more PERM space.
 In case we have defined SI on PPI table then as usual size of SI subtable will also
increase by 2 bytes for each referencing rowid
 A PI access can be degraded if the partition column is not part of the PI. For e.g. if
query specifying a PI value but no value for the PPI column must look in each
partition for that table, hence loosing the advantage of using PI in where clause.
 When we are doing joins to non-partitioned tables with the PPI table then that join
may be degraded. If one of the tables is partitioned and other one is non-partitioned
then sliding window merger join will take place.
 The PI can’t be defined UNIQUE  when the partioning columns is not the part of PI.

Collect Statistics in Teradata:

There are many ways to generate a query plan for a given SQL, and collecting statistics
ensures that the optimizer will have the most accurate information to create the best access
and join plans.
The optimizing phase of Teradata, makes decisions on how to access table data. These
decisions can be very important when table joins (especially those involving multiple joins)
are required by a query. By default, the Optimizer uses approximations of the number of
rows in each table (known as the cardinality of the table) and of the number of unique
values in indexes in making its decisions. To build such estimates, the Optimizer picks a
random AMP and builds the information and it is possible for the estimates to be
significantly off. This can lead to poor choices of join plans, and associated increases in the
response times of the queries involved.
One way to help the Optimizer make better decisions is to give it more accurate information
as to the content of the table. This can be done using the COLLECT STATISTICS statement.
When the Optimizer finds that there are statistics available for a referenced table, it will use
those statistics instead of using estimated table cardinality or estimated unique index value
counts.
Stats should be collected mainly under the below circumstances:
1. A thumb rule is to collect statistics when they’ve changed by 10%. (That would be 10% more
rows inserted, or 10% of the rows deleted, or 10% of the rows changed, or some
combination.)
2. The range of values for an index or column of a table for which statistics have been
collected has changed significantly. Sometimes one can infer this from the date and time the
statistics were last collected, or by the very nature of the column (for instance, if the column
in question holds a transaction date, and statistics on that column were last gathered a year
ago, it is almost certain that the statistics for that column are stale).
How stats are built over the table?
TD builds the uniqueness count for each identified column / set of columns for the
completed table/partition data and stores the information in the DBC tables.
Whenever the stats are collected later, the previously collected information is lost and fresh
stats are updated in the DBC tables.
The time taken to collect stats doesn’t depend on how frequently the stats have been
collected or how recently the stats have been collected.
Stats should be collected on all dimensions, history, transactional, reference and aggregate
tables based on the below approach:
1. If the table is loaded under DELETE INSERT mode, then STATS should be collected during
each load.
2. If the table is built under INSERT UPDATE mode, then STATS should be collected if the
data demographics change by more than 10%.
3. If the target is a transactional table loaded in APPEND mode, then STATS should be
collected if the data demographics change by more than 10%.
4. If the table is built under INSERT mode; (aggregate tables where data is built for a
particular duration and queried upon this duration) tables where partitions are built over
each aggregation period, STATS should be collected on the new partition, even if the data
demographics for the entire table changes less than 10%, because user queries or extractions
might be built over data for current period of aggregation.

What are No Primary Index tables ?

Starting with Teradata Release 13, tables can be defined without having a primary index.
As we all know, the primary index is the main idea behind an evenly data distribution on a
Teradata system. By design, the primary index ensures that a Teradata system is
unconditionally scalable.
Hence, the question is: what is the meaning of tables without a primary index and how are
they implemented and fit into the hashing design of Teradata.
Initially, some words regarding how data is distributed in case of a no primary index table.
Basically, rows are distributed randomly across the AMPs.
As no hashing takes place, but rows have to be identified uniquely, the ROWID is generated
differently from the ROWID of a regular table having a primary index:
As we do not have a hash value, Teradata uses the HASHBUCKET of the responsible AMP
and adds a uniqueness value. As you can conclude, the bytes normally occupied by the hash
value can now be used to increase the range for generating uniqueness values.
This is how No Primary Index Tables are created:
CREATE TABLE <TABLE>
(
PK INTEGER NOT NULL
) NO PRIMARY INDEX
;

Usage for No Primary Index Tables


As no primary index tables are distributed randomly across the AMPs, loading will become
faster. Let’s take as an example, the phases of a FastLoad:
1. Incoming rows are distributed in a round-robin fashion randomly across all AMPs
2. The rows are hashed by the primary index value and forwarded to the responsible AMPs
3. The responsible AMPs sort the received rows by ROWID
Now let’s consider a no primary index table. Basically, after distributing the rows randomly
across the AMPs we are ready. No hashing and redistribution is needed. No sorting is
needed. Further, as rows are assigned randomly to the AMPs, your data will always be
distributed evenly across all AMPs and no skewing will occur.
As you can imagine, this makes loading much faster. Only the aquisition phase of the loading
utilities is executed.
However much useful no primary index tables are in order to decrease the load times, don’t
forget that without a primary index Teradata is limited to full tables scans if rows have to be
retrieved.
You probably will recognize some similarities between no primary index tables and the
Teradata columnar feature introduced with Teradata 14. Basically, tables which are using the
new column partition feature of Teradata are equally no primary index tables.
Although offering great performance improvements for certain workload types, column
stores on Teradata lack as well primary index access.
To some extend, this disadvantage of no primary index tables can be compensated with join
indexes or secondary indexes.
Basically, no primary index tables are not designed for being production tables. Consider
using them during the ETL-Process in case Teradata anyway has to do full table scans like
SQL transformations carried out on each row etc.

There are some further restrictions if you decide to use no primary index tables. Here are the
most important:
 Only MULTISET tables can be created
 No identity columns can be used
 NoPi tables cannot be partitioned with a PPI
 No statements with an update character allowed (UPDATE,MERGE INTO,UPSERT)
 No Permanent Journal possible
 Cannot be defined as Queue Tables

Join Indexes:

The join index JOIN the two tables together and keeps the result set in the permanent space
of Teradata. This JOIN index will hold the result set of the two table, and at the time of JOIN
parsing engine will decide whether it is fast to build the result set from the actual BASE
tables or the JOIN index. User never directly query the JOIN index. In the sense JOIN index is
the result of joining two tables together so that parsing engine always decide to take the
result set from this JOIN index instead of going and doing manual join on the base table.
Types of JOIN index –

Multi table JOIN index


Suppose we have two BASE tables EMPLOYEE_TABLE and DEP_TABLE, which holds the data
of EMPLOYEE and DEPARTMENT respectively. Now a JOIN index on these two tables will be
somewhat-
CREATE JOIN INDEX EMP_DEPT
AS
SELECT EMP_NO,EMP_NAME, EMP_DEPT, EMP_SAL, EMP_MGR
FROM EMPLOYEE_TABLE EMP
INNER JOIN DEP_TABLE DEP
ON EMP.EMP_DEPT = DEP.DEPT_NO
UNIQUE PRIMARY INDEX (EMP_NO);
 
This way the JOIN index EMP_DEPT holds the result set of two BASE tables, and at the time
of JOIN PE will decide weather it is faster to join actual tables or to take result set from this
JOIN index. So always choose wise list of columns and tables to create JOIN index.
Single Table JOIN index
A single table JOIN index duplicate a single table, but changes the primary index. Users will
only query the base table and its PE who decide which result set is faster, from JOIN index or
from actual BASE tables. The reason to create the single table JOIN index is so joins can be
performed faster because no redistribution or duplication needs to occur.
CREATE JOIN INDEX EMP_SNAP
AS
SELECT EMP_NO, EMP_NAME, EMO_DEPT
FROM EMPLOYEE_TABLE
PRIMARY INDEX(EMP_DEPT);

Aggregate JOIN index


An aggregate JOIN index will allow the tracking of Averages SUM and COUNT on any table.
This JOIN index is basically used if we need to perform any aggregate function in the data of
the table.
CREATE JOIN INDEX EMP_SNAP
AS
SELECT EMP_NO, EMP_NAME, EMO_DEPT
FROM EMPLOYEE_TABLE
PRIMARY INDEX(EMP_DEPT);

The main fundamentals of JOIN indexes are –


 JOIN index is not a pointer to data it actually store data in PERM space
 Users never query them directly, its PE who decide which result set to take
 Updated when base tables are changed
 Can’t be loaded with Fastload or Multiload.
Teradata Space Management:

Teradata is designed in such a fashion, to reduce the DBA’s administrative functions when it
comes to space management. Space is configured in the following ways in Teradata system

1)      PERMANENT SPACE
2)      SPOOL SPACE
3)      TEMPORARY SPACE

1) PERMANENT SPACE – Permanent space is where the objects (i.e. – databases, users,
tables) are created and stored. PERM space is distributed evenly across all the AMPs. Equal
distribution is necessary, because then there is a high percentage that the objects will be
shared across all the AMPs, and at the time of data retrieval all AMPs will work parallel to
fetch the data.
Unlike other relational databases the Teradata database does not physically defined the
PERM space at the time of object creation, instead of that it defines the upper limit for the
PERM space and then PERM space is used dynamically by the objects.
E.g. if a database is defined as the 500 GB PERM space and actual size of database is 300
GB only, then the remaining 200 GB will be used as SPOOL space, there is no need of holding
the 200 GB when it is not required by the database. But when database required more space
then this 200 GB will be released from the SPOOL space and given back to database. This
mechanism ensures enough memory to execute all processes in the Teradata system.
2) SPOOL SPACE – Spool space is the amount of space on the system that has not been
allocated. The primary reason for the SPOOL space is to store intermediate results or queries
that are being processed in Teradata. For example, when executing conditional query all the
qualifying rows which satisfies the given condition will be store in the SPOOL space for
further processing by the query. Any PERM space currently unassigned is available as a
SPOOL space.
Defining a SPOOL space limit is not required when Users and Databases are created. But it is
highly recommended to define the upper limit of SPOOL space for any object (i.e. users,
database, tables) which you create. Because in case there is no upper limit define for SPOOL
space for the object then the processing query for that object might consume all the space
in the system and cause “runaway transaction”.
one of the difference between the PERM space and the SPOOL space is that –

In PERM space if we create a CHILD database from the PARENT database then the amount
of PERM space for that CHILD database is subtracted from the PARENT PERM space.
For example a database SYSDBA is allotted 500 GB of PERM space. Now if we create another
CHILD database, say HR, from SYSDBA , and allot 200 GB of PERM space to HR database,
then this 200 GB will be subtracted from the PARENT database SYSDBA. Similarly if we
define another CHILD database SALARY from HR and allot 100 GB PERM space to it, then this
100 GB will be deducted from HR database.
While the SPOOL space limit for a CHILD database is not subtracted from its immediate
PARENT, but the CHILD database SPOOL space is as large as its immediate PARENT.
In spool space allocation the CHILD database HR and SALARY has the same amount of
SPOOL space as there PARENT database SYSDBA has.
 
To define PERM space and SPOOL space on a database we required below mentioned query

CREATAE DATABASE teradatatech AS PERM = 10000000, SPOOL =20000000

3) TEMP SPACE – The amount of space used for Global Temporary Tables is known as TEMP
space. These results remain available to the user until the session is terminated. Tables
created in TEMP space will survive a restart. Permanent space not being used for tables is
available for TEMP space.

Recovery Journal:

The Teradata database uses Recovery Journal to automatically maintain data integrity in the
case of :
 An interrupted transaction
 An AMP failure
Recovery Journals are created, maintained and purged by the system automatically, so
no DBA intervention is required. Recovery journal are tables stored in the storage medium
so they take up the disk space on the system.
There are three types of Recovery Journal in Teradata-
1. Transient Journal
2. Down – AMP Recovery Journal
3. Permanent Journal
Now we look on each of the recovery journal in details –
Transient Journal
A transient journal maintains data integrity when in-flight transactions are interrupted. Data
is returned to its original state after transaction failure.
A transient journal is used during normal system operation to keep “before images” of
changed rows so the data can be restored to its previous state if the transaction is not
completed. This happens on each AMP as changes occur. When a transaction is started, the
system automatically stores a copy of all the rows affected by the transaction in the
transient journal until the transaction is completed. Once the transaction is completed the
“before images” are purged.
In the event of transaction failure, the “before images” are reapplied to the affected tables
and deleted from the journal, and the “rollback” operation is completed.
Down AMP Recovery Journal
The down AMP recovery journal allows continued system operation while an AMP is down.
A down AMP recovery journal is used with fallback protected tables to maintain a record of
write transactions (updates, creates, inserts, deletes, etc) on the failed AMP while it is
unavailable.
The Down AMP recovery journal starts automatically after the loss of an AMP in a cluster.
Any changes to the data in the failed AMP are logged into the Down AMP recovery journal
by the other AMPs in the cluster. When the failed AMP is brought back online, the restart
process includes applying the changes in the Down – AMP recovery journal to the recovered
AMP.
The journal is discarded once the process is complete, and the AMP is brought online, fully
recovered.
Permanent Journal
Permanent Journals are an optional feature used to provide an additional level of data
protection. You specify the use of permanent journal at the table level. It provides full-table
recovery to a specific point in time. It can also reduce the need for costly and time –
consuming full table backups.
Permanent journals are tables stored on disk array like user data is, so they can take up
additional disk space, on the system. The database administrator maintains the permanent
journal entries (deleting, archiving, and so on).A database can have one permanent journal.
When you create a table with permanent journaling, you must specify whether the
permanent journal will capture.
 Before images – for rollback to “undo” a set of changes to a previous state.
 After images – for roll forward to “redo” to a specific state.
 Following is the syntax of giving permanent journal –

 CREATE DATABASE teradatatech


FROM space_amount AS
PERM = 4000000    /* permanent space */
SPOOL = 2000000 /* spool space */
NO FALLBACK
ACCOUNT = ‘$admin’
NO BEFORE JOURNAL
AFTER JOURNAL
DEFAULT JOURNAL TABLE = teradata.journal;
Here Admin has opted   for only AFTER JOURNAL and he has name the journal table
as “teradata.journal”.
When user creates a table in the database “teradatatech” , by default AFTER
JOURNAL is available for him to protect his data when the hardware failure occurs.
 He can opt for NO AFTER JOURNAL by overriding the default.
 Scenario1 : Here  by default the table has AFTER JOURNAL option.
 CREATE TABLE table_name
( field1 INTEGER,
field2 INTEGER)
PRIMARY INDEX field1;
 Scenario2: in this case, user has specifically stated he wanted no AFTER JOURNAL for
his data. This is how user can override the defult.
 CREATE TABLE table_name
FALLBACK,
NO AFTER JOURNAL
( field1 INTEGER,
field2 INTEGER)
PRIMARY INDEX field1;
 In this case whenever the user inserts/updates and the transaction is committed ,
then the affected rows will be taken backup in the journal table “teradata.journal”.

Teradata - JOIN strategies:

Join Methods
Teradata uses different join methods to perform join operations. Some of the commonly
used Join methods are −

 Merge Join
 Nested Join
 Product Join
Merge Join
Merge Join method takes place when the join is based on the equality condition. Merge
Join requires the joining rows to be on the same AMP. Rows are joined based on their row
hash. Merge Join uses different join strategies to bring the rows to the same AMP.
Strategy #1
If the join columns are the primary indexes of the corresponding tables, then the joining
rows are already on the same AMP. In this case, no distribution is required.

Consider the following Employee and Salary Tables.

CREATE SET TABLE EMPLOYEE,FALLBACK (


EmployeeNo INTEGER,
FirstName VARCHAR(30) ,
LastName VARCHAR(30) ,
DOB DATE FORMAT 'YYYY-MM-DD',
JoinedDate DATE FORMAT 'YYYY-MM-DD',
DepartmentNo BYTEINT
)
UNIQUE PRIMARY INDEX ( EmployeeNo );
CREATE SET TABLE Salary (
EmployeeNo INTEGER,
Gross INTEGER,
Deduction INTEGER,
NetPay INTEGER
)
UNIQUE PRIMARY INDEX(EmployeeNo);

When these two tables are joined on EmployeeNo column, then no redistribution takes
place since EmployeeNo is the primary index of both the tables which are being joined.

Strategy #2
Consider the following Employee and Department tables.

CREATE SET TABLE EMPLOYEE,FALLBACK (


EmployeeNo INTEGER,
FirstName VARCHAR(30) ,
LastName VARCHAR(30) ,
DOB DATE FORMAT 'YYYY-MM-DD',
JoinedDate DATE FORMAT 'YYYY-MM-DD',
DepartmentNo BYTEINT
)
UNIQUE PRIMARY INDEX ( EmployeeNo );
CREATE SET TABLE DEPARTMENT,FALLBACK (
DepartmentNo BYTEINT,
DepartmentName CHAR(15)
)
UNIQUE PRIMARY INDEX ( DepartmentNo );

If these two tables are joined on DeparmentNo column, then the rows need to be
redistributed since DepartmentNo is a primary index in one table and non-primary index in
another table. In this scenario, joining rows may not be on the same AMP. In such case,
Teradata may redistribute employee table on DepartmentNo column.

Strategy #3
For the above Employee and Department tables, Teradata may duplicate the Department
table on all AMPs, if the size of Department table is small.

Nested Join
Nested Join doesn’t use all AMPs. For the Nested Join to take place, one of the condition
should be equality on the unique primary index of one table and then joining this column to
any index on the other table.

In this scenario, the system will fetch the one row using Unique Primary index of one table
and use that row hash to fetch the matching records from other table. Nested join is the
most efficient of all Join methods.

Product Join
Product Join compares each qualifying row from one table with each qualifying row from
other table. Product join may take place due to some of the following factors −

 Where condition is missing.


 Join condition is not based on equality condition.
 Table aliases is not correct.
 Multiple join conditions.

Teradata - Data Protection:

Transient Journal
Teradata uses Transient Journal to protect data from transaction failures. Whenever any
transactions are run, Transient journal keeps a copy of the before images of the affected
rows until the transaction is successful or rolled back successfully. Then, the before images
are discarded. Transient journal is kept in each AMPs. It is an automatic process and cannot
be disabled.

Fallback
Fallback protects the table data by storing the second copy of rows of a table on another
AMP called as Fallback AMP. If one AMP fails, then the fallback rows are accessed. With
this, even if one AMP fails, data is still available through fallback AMP. Fallback option can
be used at table creation or after table creation. Fallback ensures that the second copy of
the rows of the table is always stored in another AMP to protect the data from AMP failure.
However, fallback occupies twice the storage and I/O for Insert/Delete/Update.

Following diagram shows how fallback copy of the rows are stored in another AMP.
Down AMP Recovery Journal
The Down AMP recovery journal is activated when the AMP fails and the table is fallback
protected. This journal keeps track of all the changes to the data of the failed AMP. The
journal is activated on the remaining AMPs in the cluster. It is an automatic process and
cannot be disabled. Once the failed AMP is live then the data from the Down AMP recovery
journal is synchronized with the AMP. Once this is done, the journal is discarded.

Cliques
Clique is a mechanism used by Teradata to protect data from Node failures. A clique is
nothing but a set of Teradata nodes that share a common set of Disk Arrays. When a node
fails, then the vprocs from the failed node will migrate to other nodes in the clique and
continue to access their disk arrays.

Hot Standby Node


Hot Standby Node is a node that does not participate in the production environment. If a
node fails then the vprocs from the failed nodes will migrate to the hot standby node. Once
the failed node is recovered it becomes the hot standby node. Hot Standby nodes are used
to maintain the performance in case of node failures.

RAID
Redundant Array of Independent Disks (RAID) is a mechanism used to protect data from
Disk Failures. Disk Array consists of a set of disks which are grouped as a logical unit. This
unit may look like a single unit to the user but they may be spread across several disks.

RAID 1 is commonly used in Teradata. In RAID 1, each disk is associated with a mirror disk.
Any changes to the data in primary disk is reflected in mirror copy also. If the primary disk
fails, then the data from mirror disk can be accessed.

Teradata - Performance Tuning:

Explain
The first step in performance tuning is the use of EXPLAIN on your query. EXPLAIN plan
gives the details of how optimizer will execute your query. In the Explain plan, check for the
keywords like confidence level, join strategy used, spool file size, redistribution, etc.
Collect Statistics
Optimizer uses Data demographics to come up with effective execution strategy. COLLECT
STATISTICS command is used to collect data demographics of the table. Make sure that the
statistics collected on the columns are up to date.

 Collect statistics on the columns that are used in WHERE clause and on the columns
used in the joining condition.
 Collect statistics on the Unique Primary Index columns.
 Collect statistics on Non Unique Secondary Index columns. Optimizer will decide if it
can use NUSI or Full Table Scan.
 Collect statistics on the Join Index though the statistics on base table is collected.
 Collect statistics on the partitioning columns.

Data Types
Make sure that proper data types are used. This will avoid the use of excessive storage than
required.

Conversion
Make sure that the data types of the columns used in join condition are compatible to avoid
explicit data conversions.

Sort
Remove unnecessary ORDER BY clauses unless required.

Spool Space Issue


Spool space error is generated if the query exceeds per AMP spool space limit for that user.
Verify the explain plan and identify the step that consumes more spool space. These
intermediate queries can be split and put as separately to build temporary tables.

Primary Index
Make sure that the Primary Index is correctly defined for the table. The primary index
column should evenly distribute the data and should be frequently used to access the data.

SET Table
If you define a SET table, then the optimizer will check if the record is duplicate for each and
every record inserted. To remove the duplicate check condition, you can define Unique
Secondary Index for the table.
UPDATE on Large Table
Updating the large table will be time consuming. Instead of updating the table, you can
delete the records and insert the records with modified rows.

Dropping Temporary Tables


Drop the temporary tables (staging tables) and volatiles if they are no longer needed. This
will free up permanent space and spool space.

MULTISET Table
If you are sure that the input records will not have duplicate records, then you can define
the target table as MULTISET table to avoid the duplicate row check used by SET table.

Teradata – BTEQ:

BTEQ utility is a powerful utility in Teradata that can be used in both batch and interactive
mode. It can be used to run any DDL statement, DML statement, create Macros and stored
procedures. BTEQ can be used to import data into Teradata tables from flat file and it can
also be used to extract data from tables into files or reports.

BTEQ Terms
Following is the list of terms commonly used in BTEQ scripts.

 LOGON − Used to log into Teradata system.


 ACTIVITYCOUNT − Returns the number of rows affected by the previous query.
 ERRORCODE − Returns the status code of the previous query.
 DATABASE − Sets the default database.
 LABEL − Assigns a label to a set of SQL commands.
 RUN FILE − Executes the query contained in a file.
 GOTO − Transfers control to a label.
 LOGOFF − Logs off from database and terminates all sessions.
 IMPORT − Specifies the input file path.
 EXPORT − Specifies the output file path and initiates the export.

Example
Following is a sample BTEQ script.

.LOGON 192.168.1.102/dbc,dbc;
DATABASE tduser;

CREATE TABLE employee_bkup (


EmployeeNo INTEGER,
FirstName CHAR(30),
LastName CHAR(30),
DepartmentNo SMALLINT,
NetPay INTEGER
)
Unique Primary Index(EmployeeNo);

.IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;

SELECT * FROM
Employee
Sample 1;
.IF ACTIVITYCOUNT <> 0 THEN .GOTO InsertEmployee;

DROP TABLE employee_bkup;

.IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;

.LABEL InsertEmployee
INSERT INTO employee_bkup
SELECT a.EmployeeNo,
a.FirstName,
a.LastName,
a.DepartmentNo,
b.NetPay
FROM
Employee a INNER JOIN Salary b
ON (a.EmployeeNo = b.EmployeeNo);

.IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;


.LOGOFF;

The above script performs the following tasks.

 Logs into Teradata System.


 Sets the Default Database.
 Creates a table called employee_bkup.
 Selects one record from Employee table to check if the table has any records.
 Drops employee_bkup table, if the table is empty.
 Transfers the control to a Label InsertEmployee which inserts records into
employee_bkup table
 Checks ERRORCODE to make sure that the statement is successful, following each
SQL statement.
 ACTIVITYCOUNT returns number of records selected/impacted by the previous SQL
query.

Teradata – FastLoad:

FastLoad utility is used to load data into empty tables. Since it does not use transient
journals, data can be loaded quickly. It doesn't load duplicate rows even if the target table
is a MULTISET table.

Limitation
Target table should not have secondary index, join index and foreign key reference.

How FastLoad Works


FastLoad is executed in two phases.

Phase 1
 The Parsing engines read the records from the input file and sends a block to each
AMP.
 Each AMP stores the blocks of records.
 Then AMPs hash each record and redistribute them to the correct AMP.
 At the end of Phase 1, each AMP has its rows but they are not in row hash sequence.

Phase 2
 Phase 2 starts when FastLoad receives the END LOADING statement.
 Each AMP sorts the records on row hash and writes them to the disk.
 Locks on the target table is released and the error tables are dropped.

Example
Create a text file with the following records and name the file as employee.txt.

101,Mike,James,1980-01-05,2010-03-01,1
102,Robert,Williams,1983-03-05,2010-09-01,1
103,Peter,Paul,1983-04-01,2009-02-12,2
104,Alex,Stuart,1984-11-06,2014-01-01,2
105,Robert,James,1984-12-01,2015-03-09,3

Following is a sample FastLoad script to load the above file into Employee_Stg table.

LOGON 192.168.1.102/dbc,dbc;
DATABASE tduser;
BEGIN LOADING tduser.Employee_Stg
ERRORFILES Employee_ET, Employee_UV
CHECKPOINT 10;
SET RECORD VARTEXT ",";
DEFINE in_EmployeeNo (VARCHAR(10)),
in_FirstName (VARCHAR(30)),
in_LastName (VARCHAR(30)),
in_BirthDate (VARCHAR(10)),
in_JoinedDate (VARCHAR(10)),
in_DepartmentNo (VARCHAR(02)),
FILE = employee.txt;
INSERT INTO Employee_Stg (
EmployeeNo,
FirstName,
LastName,
BirthDate,
JoinedDate,
DepartmentNo
)
VALUES (
:in_EmployeeNo,
:in_FirstName,
:in_LastName,
:in_BirthDate (FORMAT 'YYYY-MM-DD'),
:in_JoinedDate (FORMAT 'YYYY-MM-DD'),
:in_DepartmentNo
);
END LOADING;
LOGOFF;

Executing a FastLoad Script


Once the input file employee.txt is created and the FastLoad script is named as
EmployeeLoad.fl, you can run the FastLoad script using the following command in UNIX
and Windows.

FastLoad < EmployeeLoad.fl;


Once the above command is executed, the FastLoad script will run and produce the log. In
the log, you can see the number of records processed by FastLoad and status code.

**** 03:19:14 END LOADING COMPLETE


Total Records Read = 5
Total Error Table 1 = 0 ---- Table has been dropped
Total Error Table 2 = 0 ---- Table has been dropped
Total Inserts Applied = 5
Total Duplicate Rows = 0
Start: Fri Jan 8 03:19:13 2016
End : Fri Jan 8 03:19:14 2016
**** 03:19:14 Application Phase statistics:
Elapsed time: 00:00:01 (in hh:mm:ss)
0008 LOGOFF;
**** 03:19:15 Logging off all sessions

FastLoad Terms
Following is the list of common terms used in FastLoad script.

 LOGON − Logs into Teradata and initiates one or more sessions.


 DATABASE − Sets the default database.
 BEGIN LOADING − Identifies the table to be loaded.
 ERRORFILES − Identifies the 2 error tables that needs to be created/updated.
 CHECKPOINT − Defines when to take checkpoint.
 SET RECORD − Specifies if the input file format is formatted, binary, text or
unformatted.
 DEFINE − Defines the input file layout.
 FILE − Specifies the input file name and path.
 INSERT − Inserts the records from the input file into the target table.
 END LOADING − Initiates phase 2 of the FastLoad. Distributes the records into the
target table.
 LOGOFF − Ends all sessions and terminates FastLoad.

Teradata – MultiLoad:

MultiLoad can load multiple tables at a time and it can also perform different types of tasks
such as INSERT, DELETE, UPDATE and UPSERT. It can load up to 5 tables at a time and
perform up to 20 DML operations in a script. The target table is not required for MultiLoad.

MultiLoad supports two modes −

 IMPORT
 DELETE
MultiLoad requires a work table, a log table and two error tables in addition to the target
table.
 Log Table − Used to maintain the checkpoints taken during load which will be used
for restart.
 Error Tables − These tables are inserted during load when an error occurs. First error
table stores conversion errors whereas second error table stores duplicate records.
 Log Table − Maintains the results from each phase of MultiLoad for restart purpose.
 Work table − MultiLoad script creates one work table per target table. Work table is
used to keep DML tasks and the input data.

Limitation
MultiLoad has some limitations.

 Unique Secondary Index not supported on target table.


 Referential integrity not supported.
 Triggers not supported.
How MultiLoad Works
MultiLoad import has five phases −

 Phase 1 − Preliminary Phase – Performs basic setup activities.


 Phase 2 − DML Transaction Phase – Verifies the syntax of DML statements and brings
them to Teradata system.
 Phase 3 − Acquisition Phase – Brings the input data into work tables and locks the
table.
 Phase 4 − Application Phase – Applies all DML operations.
 Phase 5 − Cleanup Phase – Releases the table lock.

The steps involved in a MultiLoad script are −

 Step 1 − Set up the log table.


 Step 2 − Log on to Teradata.
 Step 3 − Specify the Target, Work and Error tables.
 Step 4 − Define INPUT file layout.
 Step 5 − Define the DML queries.
 Step 6 − Name the IMPORT file.
 Step 7 − Specify the LAYOUT to be used.
 Step 8 − Initiate the Load.
 Step 9 − Finish the load and terminate the sessions.
Example
Create a text file with the following records and name the file as employee.txt.

101,Mike,James,1980-01-05,2010-03-01,1
102,Robert,Williams,1983-03-05,2010-09-01,1
103,Peter,Paul,1983-04-01,2009-02-12,2
104,Alex,Stuart,1984-11-06,2014-01-01,2
105,Robert,James,1984-12-01,2015-03-09,3

The following example is a MultiLoad script that reads records from employee table and
loads into Employee_Stg table.

.LOGTABLE tduser.Employee_log;
.LOGON 192.168.1.102/dbc,dbc;
.BEGIN MLOAD TABLES Employee_Stg;
.LAYOUT Employee;
.FIELD in_EmployeeNo * VARCHAR(10);
.FIELD in_FirstName * VARCHAR(30);
.FIELD in_LastName * VARCHAR(30);
.FIELD in_BirthDate * VARCHAR(10);
.FIELD in_JoinedDate * VARCHAR(10);
.FIELD in_DepartmentNo * VARCHAR(02);

.DML LABEL EmpLabel;


INSERT INTO Employee_Stg (
EmployeeNo,
FirstName,
LastName,
BirthDate,
JoinedDate,
DepartmentNo
)
VALUES (
:in_EmployeeNo,
:in_FirstName,
:in_Lastname,
:in_BirthDate,
:in_JoinedDate,
:in_DepartmentNo
);
.IMPORT INFILE employee.txt
FORMAT VARTEXT ','
LAYOUT Employee
APPLY EmpLabel;
.END MLOAD;
LOGOFF;

Executing a MultiLoad Script


Once the input file employee.txt is created and the multiload script is named as
EmployeeLoad.ml, then you can run the Multiload script using the following command in
UNIX and Windows.

Multiload < EmployeeLoad.ml;

Teradata – FastExport:

FastExport utility is used to export data from Teradata tables into flat files. It can also
generate the data in report format. Data can be extracted from one or more tables using
Join. Since FastExport exports the data in 64K blocks, it is useful for extracting large
volume of data.

Example
Consider the following Employee table.
EmployeeNo FirstName LastName BirthDate

101 Mike James 1/5/1980

104 Alex Stuart 11/6/1984

102 Robert Williams 3/5/1983

105 Robert James 12/1/1984

103 Peter Paul 4/1/1983

Following is an example of a FastExport script. It exports data from employee table and
writes into a file employeedata.txt.

.LOGTABLE tduser.employee_log;
.LOGON 192.168.1.102/dbc,dbc;
DATABASE tduser;
.BEGIN EXPORT SESSIONS 2;
.EXPORT OUTFILE employeedata.txt
MODE RECORD FORMAT TEXT;
SELECT CAST(EmployeeNo AS CHAR(10)),
CAST(FirstName AS CHAR(15)),
CAST(LastName AS CHAR(15)),
CAST(BirthDate AS CHAR(10))
FROM
Employee;
.END EXPORT;
.LOGOFF;
Executing a FastExport Script
Once the script is written and named as employee.fx, you can use the following command
to execute the script.

fexp < employee.fx


After executing the above command, you will receive the following output in the file
employeedata.txt.

103 Peter Paul 1983-04-01


101 Mike James 1980-01-05
102 Robert Williams 1983-03-05
105 Robert James 1984-12-01
104 Alex Stuart 1984-11-06

FastExport Terms
Following is the list of terms commonly used in FastExport script.

 LOGTABLE − Specifies the log table for restart purpose.


 LOGON − Logs into Teradata and initiates one or more sessions.
 DATABASE − Sets the default database.
 BEGIN EXPORT − Indicates the beginning of the export.
 EXPORT − Specifies the target file and the export format.
 SELECT − Specifies the select query to export data.
 END EXPORT − Specifies the end of FastExport.
 LOGOFF − Ends all sessions and terminates FastExport.

You might also like