Professional Documents
Culture Documents
BISP Teradata Basics
BISP Teradata Basics
BISP Teradata Basics
Manohar Krishna
1. Architecture
2. PI, SI, PPI
3. Data protection
4. Spaces and Tables
5. Other indexes
Teradata Database System
A Teradata Database system contains one or more nodes. A node is a term for
a processing unit under the control of a single operating system. The node is
where the processing occurs for the Teradata Database. There are two types of
Teradata Database systems:
Symmetric multiprocessing (SMP) - An SMP Teradata Database has a single
node that contains multiple CPUs sharing a memory pool.
Massively parallel processing (MPP) - Multiple SMP nodes working together
comprise a larger, MPP implementation of a Teradata Database. The nodes are
connected using the BYNET, which allows multiple virtual processors on
multiple nodes to communicate with each other.
Node Components
A node is the basic building block of a Teradata Database system, and contains
a large number of hardware and software components. A conceptual
diagram of a node and its major components is shown below. Hardware
components are shown on the left side of the node and software
components are shown on the right side.
Client Connections
Users can access data in the Teradata Database through an application on
both channel-attached and network-attached clients. Additionally, the node
itself can act as a client. Teradata client software is installed on each client
(channel-attached, network-attached, or node) and communicates with
RDBMS software on the node. You may hear either type of client referred to
by the term "host," though this term is not typically used in documentation
or product literature.
Trusted Parallel Application (TPA)
A Trusted Parallel Application (TPA) uses PDE to implement virtual
processors (vprocs). The Teradata Database is classified as a TPA. The four
components of the Teradata Database TPA are:
SMP systems do not contain BYNET hardware. The PDE and BYNET software
emulate BYNET activity in a single-node environment.
BYNET Unique Features
The BYNET has several unique features:
• Scalable: As you add more nodes to the system, the overall network bandwidth
scales linearly. This linear scalability means you can increase system size without
performance penalty -- and sometimes even increase performance.
• Fault tolerant: Each network has multiple connection paths. If the BYNET
detects an unusable path in either network, it will automatically reconfigure that
network so all messages avoid the unusable path. Additionally, in the rare case that
BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are
re-routed to BYNET 1.
Table A rows
Table B rows
CREATE TABLE sample_2 A NoPI choice will result in distribution of the data between AMPs based
(col_x INTEGER on random generator code.
,col_y INTEGER A common use may be for staging or intermediate tables used with load
operations.
NoPI ,col_z INTEGER)
NoPI is available With Teradata 13.0
NO PRIMARY INDEX ;
o_# c_# o_dt o_# c_# o_dt o_# c_# o_dt o_# c_# o_dt
o_st o_st o_st o_st
7202 2 4/09 C 7325 2 4/13 O 7188 1 4/13 C 7324 3 4/13 O
7415 1 4/13 C 7103 1 4/10 O 7225 2 4/15 C 7384 1 4/12 C
7402 3 4/16 C
Row Distribution Using a NUPI – Case 2
Order
Order Customer Order Order
Notes:
Number Number Date Status • Customer_Number may be the preferred access
PK column for ORDER table, thus a good index
NUPI candidate.
7325
7324
2
3
4/13
4/13
O
O
• Values for Customer_Number are somewhat
7415 1 4/13 C non-unique.
7103
7225
1
2
4/10
4/15
O
C
• Choice of Customer_Number is therefore a
7384 1 4/12 C NUPI.
7402
7188
3
1
4/16
4/13
C
C
• Rows with the same PI value distribute to the
7202 2 4/09 C same AMP.
• Row distribution is less uniform or skewed.
AMP AMP AMP AMP
o_# c_# o_dt o_st o_# c_# o_dt o_st o_# c_# o_dt o_st
7325 2 4/13 O 7384 1 4/12 C 7402 3 4/16 C
7202 2 4/09 C 7103 1 4/10 O 7324 3 4/13 O
7225 2 4/15 C 7415 1 4/13 C
7188 1 4/13 C
Row Distribution Using a Highly Non-Unique Primary Index (NUPI) – Case 3
• Values for Order_Status are “highly” non-
Order unique.
Order Customer Order Order
Number Number Date Status • Choice of Order_Status column is a NUPI.
PK • Only two values exist, so only two AMPs will
NUPI
ever be used for this table.
7325
7324
2
3
4/13
4/13
O
O
• Table will not perform well in parallel
7415 1 4/13 C operations.
7103
7225
1
2
4/10
4/15
O
C
• Highly non-unique columns are poor PI
7384 1 4/12 C choices generally.
7402
7188
3
1
4/16
4/13
C
C • The degree of uniqueness is critical to
7202 2 4/09 C efficiency.
Hash Bucket #
{
A Row Hash is the 32-bit result of applying a hashing
algorithm to an index value.
The DSW or Hash Bucket is represented by the high order
16 bits of the Row Hash.
Hash Map
{
A Hash Map is uniquely configured for each system.
It is a array of 65,536 entries (buckets) which associates
bucket numbers with specific AMPs.
Two systems with the same number of AMPs will have
AMP #
{ the same Hash Map.
Changing the number of AMPs in a system requires a
change to the Hash Map.
Duplicate Hash Values
It is possible for the hashing algorithm to end up with the same row hash value for
two different rows. There are two ways this could happen:
To differentiate each row in a table, every row is assigned a unique Row ID. The Row ID is the
combination of the row hash value and a uniqueness value.
Row ID = Row Hash Value + Uniqueness Value
The uniqueness value is used to differentiate between rows whose Primary Index
values generate identical row hash values. In most cases, only the row hash value
portion of the Row ID is needed to locate the row.
RowID
Row ID Row Hash Uniqueness Id
(32 bits) (32 bits)
Rows are logically maintained 3B11 5032 0000 0001 1018 Reynolds Jane
in Row ID sequence. 3B11 5032 0000 0002 1020 Reynolds Evan
3B11 5032 0000 0003 1031 Reynolds Jason
3B11 5033 0000 0001 1014 Jacobs Paul
3B11 5034 0000 0001 1012 Chevas Jose
3B11 5034 0000 0002 1021 Carnet Jean
: : : : :
Using Hash Functions to View DistributionHash
Duplicate Rows
A duplicate row is a row in a table whose column values are identical to another
row in the same table. In other words, the entire row is the same, not just the
index.
Because duplicate rows are allowed in the Teradata Database, When you create a
table, the following definitions determine whether or not it can contain
duplicate rows:
– MULTISET tables: May contain duplicate rows. The Teradata Database will not check
for duplicate rows.
– SET tables: The default. The Teradata Database checks for and does not permit
duplicate rows. If a SET table is created with a Unique Primary Index, the check for
duplicate rows is replaced by a check for duplicate index values.
Duplicate Rows…
col_a col_b col_c
A duplicate row is a row of a table whose column values are all Duplicate Rows
identical to another row in the same table. 20 50 A
25 50 A
25 50 A
• Because a PK uniquely identifies each row, ideally a relational table should not have duplicate rows!
• The ANSI standard, however, permits duplicate rows for specialized situations, thus Teradata permits them as well.
• You may select whether your table will or will not allow them.
Checks for * and disallows duplicate rows. Doesn’t check for and allows duplicate rows.
* Note: If a UPI is selected on a SET table, the duplicate row check is replaced by a check for duplicate index values.
Secondary Indexes
• A secondary Index can be used to impose uniqueness within a columns or set of columns
• A table can have from 0 to 32 secondary indexes. Each index can have up to 64 columns
• Secondary Indexes:
You can submit a request without specifying a Primary Index and still access
the data. The following access methods do not use a Primary Index:
– Unique Secondary Index (USI)
– Non-Unique Secondary Index (NUSI)
– Full-Table Scan
Comparison of Primary and Secondary indexes
4 Values should not Values may be Values may be changed Values may be changed
change changed (redistributes row)
5 Column should Column should not Column cannot be changed Index may be changed
not change change (drop and recreate table) (drop and recreate index)
•PPI rows are 2 bytes longer. Table uses more PERM space
RAID 1 characteristics:
Data is fully replicated in Mirror disk
Provides high data availability and performance, but storage
costs are high.
Disk level protection: RAID1…
AMP level protection: FALLBACK CLUSTER
A cluster is a group of AMPs that act as a single fallback unit
A Fallback row is a copy of a “Primary row” which is stored on a different AMP with in
the same cluster
After the loss of any AMP, a Down-AMP Recovery Journal is started automatically.
Its purpose is to log any changes to rows which reside on the down AMP. Any inserts, updates,or
deletes affecting rows on the down AMP, are applied to the Fallback copy within the cluster. The AMP
that holds the Fallback copy logs the Row ID in its Recovery Journal
RAID1 and FALLBACK
Node level protection: CLIQUE
Node level protection: CLIQUE…
Hot stand by Node: CLIQUE…
Data integrity protection: LOCKS
Data integrity protection: LOCKS…
Locking Modifier:
Lock requests are queued behind all outstanding incompatible lock requests for the
same object.
Transaction level protection: Journals
Transaction level protection: Journals…
Spaces In Teradata Relation of PERM and SPOOL Space
3 Types of Spaces in Teradata
1) Perm space :
The space occupied by the tables,
indices, stored procedures
2) Spool space:
Spool Space is work space used to
hold intermediate answer sets. Any
Perm Space currently unassigned is
available as Spool Space
3) Temp space:
The space occupied by Global
temporary tables
55
Perm Spaces distribution Spool space distribution
Space terminology
57
Assigning Perm and Spool Limits
58
Types of temp Tables
1) Derived
2) Volatile temporary
3) Global temporary
59
Derived Tables select prod_id, sale_date ,amount, AVGSALE
from sales_table,
•It is local to the query -it exists only for the duration of ( Sel AVG(amount) from sales_table ) as TEMP (AVGSALE)
the query. order by 3 DESC
In the example above, we stated ON COMMIT PRESERVE ROWS. This statement allows us to use the
volatile table again for other queries in the session. The default statement is ON COMMIT DELETE
ROWS, which means the data is deleted when the query is committed. Since this is rarely what is
intended, it is common to include ON COMMIT PRESERVE ROWS in the table creation statement.
The following commands are not applicable to volatile tables:
•COLLECT/DROP/HELP STATISTICS (From TD 13 it is possible)
•CREATE/DROP INDEX
•ALTER TABLE
•GRANT/REVOKE privileges
•DELETE DATABASE/USER(does not drop volatile tables)
• Can not be loaded with Multiload or Fastload utilities
61
Working with Volatile Table
Step 1:Creation
CREATE VOLATILE TABLE vt_deptsal
(deptno SMALLINT
,avgsal DEC(9,2)
,maxsal DEC(9,2)
,minsal DEC(9,2)
,sumsal DEC(9,2)
,empcnt SMALLINT)
ON COMMIT PRESERVE ROWS;
Step 2: Population
INSERT INTO vt_deptsal
SELECT dept ,AVG(sal) ,MAX(sal) ,MIN(sal) ,SUM(sal) ,COUNT(emp)
FROM emp
GROUP BY 1;
Step 3: Using
Show all employees who make the minimum salary in their department.
Note: joining volatile table with my permanent tables
SELECT emp, last, dept, sal
FROM emp INNER JOIN vt_deptsal
ON dept=deptno
WHERE sal=minsal
ORDER BY 3;
62
Global Temporary Table
Characteristics:
Global Temporary Tables are created using the CREATE GLOBAL TEMPORARY command. They require a
base definition which is stored in the Data Dictionary(DD).
NOTE: When you hear the term 'Temporary Table' it might mean different things to different people. In
Teradata terminology, 'Temporary Tables' mean 'Global Temporary tables'.
63
Working with Global Temporary Table
CREATE GLOBAL TEMPORARY TABLE gt_deptsal
•(deptno SMALLINT
•,avgsal DEC(9,2)
•,maxsal DEC(9,2)
•,minsal DEC(9,2)
•,sumsal DEC(9,2)
•,empcnt SMALLINT);
DROP TEMPORARY TABLE gt_deptsal; This drops the local instance of the table only.
DROP TABLE gt_deptsal; This drops the base definition and local instance of the table if present.
It will fail if there are other instances of the table in the system.
DROP TABLE gt_deptsal ALL; This drops the base table and all instances.
It will fail if any instance is in an active transaction.
64
Statistics
In Teradata Statistics can be understand as landmark of the Address.
Where Address is related to huge data and statistics are key information
about data.
Statistics…
NOTE:
Use
“DIAGNOSTIC HELPSTATS
ON FOR SESSION;”