Performance Comparison

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Performance Comparison between Row Organized and

Column Organized Tables


Somraj Chakrabarty (somrajob@gmail.com) 01 September 2016
Senior Consultant
CapGemini India

Suvradeep Sensarma (sensarmasuvradeep@gmail.com)


Senior Consultant
CapGemini India

Abhinava Mukherjee (abhinava_mukherjee@yahoo.com)


Consultant
CapGemini India

Page is basic building block of DB2 database which is also basic unit of I/O. DB2 used to store
data on a page row-wise architecturally. It means a given page in DB2 used to contain multiple
rows of data. With version 10.5, DB2 has introduced a totally new dimension of organization of
data on a page. According to this new architecture, data can also be organized as per column
or ‘column organized’. DB2 has introduced this new feature to enhance the performance of few
database queries manifold. This article will explain how data is actually stored on data page
for column organized tables and the performance benefits from it along with the comparison
between row organized and Column Organized table performance.

Introduction to Column Organized Tables


DB2 Version 10.5 introduced compressed column-organized tables for DB2 databases. Column-
organized tables are tables whose data pages contain column data instead of row data. Although
there is no change in the way we query the tables, however the DB2's way of handling the data
internally has changed significantly in column organized tables. The compression in column
organized table is achieved more effectively. The probability of finding repeating patterns on a
page is very high when the data on the page is from the same column. This has resulted into huge
storage savings, performance benefits in specific cases. Row-organized tables, on the other hand,
store data from columns in the same row, and the data of those columns can vary widely, thereby
reducing the probability of finding repeating patterns on a page.

© Copyright IBM Corporation 2016 Trademarks


Performance Comparison between Row Organized and Column Page 1 of 11
Organized Tables
developerWorks® ibm.com/developerWorks/

Architectural difference between row organized and column table


The traditional way of storing data on a db2 data page is storing multiple rows on a given page
which can also be referred to as 'row-organized table'. The number of rows that can fit into a page
depends upon the row width and the page size. Lesser the row width, more the number of rows
can fit into a page.

The below diagram in Figure 1 is a pictorial representation of a row organized table, where each
DB2 pages contain multiple rows. Please note that this is just an illustration, actually a DB2 page
contains more attributes such as Page Header information, slot directory etc. In a row organized
table a row cannot span multiple pages.

Figure1: Row Organized Tables

With a columnar format, a single page stores the row values of just a single column. DB2 allocates
extents of pages for each column in a column-organized table. The page size and extent size is
fixed for each table, based on the table space assigned when the CREATE TABLE statement is
executed. It means that when the database engine performs I/O to retrieve data, it just performs
I/O for only the columns that satisfy the query. This can save a lot of resources when processing
certain kinds of queries.

Below is the pictorial representation of Column Organized table in Figure 2, where we can see the
pages are allocated column wise. Each page is filled with data from a single column; the number of
rows with data in a page would vary.

Performance Comparison between Row Organized and Column Page 2 of 11


Organized Tables
ibm.com/developerWorks/ developerWorks®

Figure2: Column Organized Tables

TSN stands for Tuple Sequence Number (it's like Row ID for row organized table). Rows are
assigned a TSN, in an ascending order when the data row is stored. TSNs would uniquely identify
one row of data within a table. DB2 uses the TSN to locate and retrieve column data for a specific
row. When a column-organized table is created, DB2 creates a system generated page map index.
The index contains one entry for each page in the column-organized table. The index is assigned a
system generated name and uses a schema of SYSIBM. The page map indexes map the TSNs to
pages.

Storage Objects for Column Organized Tables


Column-organized table has 2 internal storage objects for data

•  Column-Organized Storage Object: Column organized storage object included user data
and available empty pages. The user column data is stored in a set of pages termed the
column-organized storage object.

•  Data Object: Data object includes Meta data and column-level dictionaries. The column
dictionaries and some other table metadata are stored in the data storage object for the table.

Creation of Column organized tables


•  Using Organize by Column Clause: Create table clause has introduced 'ORGANIZE BY
COLUMN' clause to create column organized table.

CREATE TABLE STUDENT(ID SMALLINT NOT NULL, NAME VARCHAR(9),STREAM VARCHAR(10) )ORGANIZE BY COLUMN IN
USERSPACE1

•  Setting DB CFG DFT_TABLE_ORG to COLUMN: If database configuration parameter is


set as DFT_TABLE_ORG to COLUMN, then by default all tables will be created as column
organized tables. There is no need of specifying the clause 'ORGANIZE BY COLUMN' in
that case. Alternatively, the default table organization can also be changed to COLUMN
automatically by setting the DB2_WORKLOAD registry variable to ANALYTICS. This setting
establishes a configuration that is optimal for analytic workloads.

Performance Comparison between Row Organized and Column Page 3 of 11


Organized Tables
developerWorks® ibm.com/developerWorks/

•  Converting Row Organized tables to Column Organized table using db2convert:


db2convert tool converts one or all row-organized user tables in a specified database into
column-organized tables. The row-organized table remains online during this conversion
process. It actually calls ADMIN_MOVE_TABLE procedure in the background to achieve the
conversion.

To refer the full syntax of db2convert please refer the db2 10.5 infocenter link.

The following command converts all row-organized user-defined tables to column- organized
tables within the database SAMPLE:
code>db2convert –d SAMPLE

The following command converts the single row-organized table SCHEMA1.TAB1 to a column-
organized table in the database SAMPLE:
db2convert -d SAMPLE -z SCHEMA1 –t TAB1

•  Converting Row Organized tables to Column Organized table using


ADMIN_MOVE_TABLE: The ADMIN_MOVE_TABLE procedure can be used to convert row-
organized tables to a column-organized table. To indicate conversion to a column-organized
table you can specify ORGANIZE BY COLUMN as an option of ADMIN_MOVE_TABLE.
Specify COPY_USE_LOAD option to move data using a LOAD utility to generate the Column
Dictionaries

For example:
Call admin_move_table ('TEST','ACCT2','AS2','AS2','AS2','ORGANIZE BY COLUMN',
'','','','COPY_USE_LOAD','MOVE')

Catalog and Monitoring information added for Column Organized


tables
With the introduction of column organized tables, few catalog and monitoring information have
been added to the existing catalog and monitoring tables/routines. By fetching these details we will
be in better position to understand vital information related to a particular column organized table.

First thing to lookout for is 'TABLEORG' column in SYSCAT.TABLES. A value of 'C' indicates that
it's a column organized table and 'R' indicates row organized table.

A new column MPAGES has also been added to SYSCAT.TABLES. It indicates total number of
pages for table metadata. It is non-zero only for a table that is organized by column. For column-
organized tables, the user table data is stored in a special column organized storage object. The
column NPAGES in SYSCAT.TABLES is a count of these pages with table column data. Since
the column dictionaries for column-organized tables can be much larger than the dictionary data
stored for row organized tables, the column MPAGES in SYSCAT.TABLES shows the total number
of pages used for table metadata, which includes these column dictionaries. The column FPAGES

Performance Comparison between Row Organized and Column Page 4 of 11


Organized Tables
ibm.com/developerWorks/ developerWorks®

in SYSCAT.TABLES shows a total page count for column-organized tables which includes both the
data and column organized object.

Below is an example of output for newly introduced columns for SYSCAT.TABLES for both Row
and Column Organized tables.
db2 "select substr(tabschema,1,8) as schema,substr(tabname,1,20) as
table,colcount,card,tableorg,npages,fpages,mpages from syscat.tables where tabschema='SCHEMA1'"

SCHEMA TABLE COLCOUNT CARD TABLEORG NPAGES FPAGES


MPAGES

-------- -------------------- -------- -------------------- -------- --------------------


-------------------- --------------------
SCHEMA1 TEST1 145 94984 R 3998 3999
0
SCHEMA1 TEST1_COLORG 145 94984 C 717 1492
775

SYSIBMADM.ADMINTABINFO has also added columns called col_object_l_size and


col_object_p_size. Col_object_l_size represents amount of disk space logically allocated for the
column-organized data in the table, reported in kilobytes. Col_object_p_size represents amount
of disk space physically allocated for the column-organized data in the table, reported in kilobytes.

Apart from these, there are many columns added in the monitoring routines for column organized
tables. Few columns are like pool_col_l_reads, pool_col_p_reads, pool_async_col_reads,
pool_async_col_writes, object_col_l_reads and object_col_p_reads. These monitor elements
should be used to understand what portion of the I/O is being driven by access to column-
organized tables when a workload impacts both row-organized and column-organized tables.

Does Column organized tables improve query performance?


To answer this question, we have performed few rigorous testing on few tables to check whether
column organized tables really improves the query performance.

With same data, we have created one row organized table names TEST2_ROWORG and one
column organized table names TEST2_COLORG. We used the clause ORAGANIZE BY ROW to
create the row organized table because the registry variable DB2_WORKLOAD has been set to
ANALYTICS, which ensures that by default all the tables are created as Column organized tables
as DFT_TABLE_ORG database configuration is set to COLUMN.

Both the tables have same structure and cardinality as shown in below Table Attributes table

Table 1. Table Attributes


CARD 1135438

COLCOUNT 12

Upon running the below query it can be seen that the column organized table has smaller size in
comparison to its row organized counterpart.

Performance Comparison between Row Organized and Column Page 5 of 11


Organized Tables
developerWorks® ibm.com/developerWorks/

db2 "select substr(tabname,1,25) as tab,data_object_p_size,index_object_p_size,col_object_p_size from


sysibmadm.admintabinfo where tabschema='SCHEMA2'"

TAB DATA_OBJECT_P_SIZE INDEX_OBJECT_P_SIZE COL_OBJECT_P_SIZE


------------------------- -------------------- -------------------- --------------------
TEST2_ROWORG 215040 0 0
TEST2_COLORG 2048 3072 124928

The column-organized versions of the table shows most of the disk space is allocated
in the column-organized object for the tables, the column COL_OBJECT_P_SIZE.
The column dictionaries and other metadata are allocated in the data object, shown as
DATA_OBJECT_P_SIZE. The page map index for a column-organized table is shown as a small
amount of space in the index object for each table, shown as INDEX_OBJECT_P_SIZE. The
amounts shown are kilobytes.

Next we ran the below similar queries for both row organized and column organized tables.
db2 "select * from schema2.test2_roworg"
db2 "select * from schema2.test2_colorg"

Below are the results for roworg and colorg tables for the above two similar queries. The results
were collected by using db2batch and db2exfmt (total query cost) tool as shown in Table Result1
table

Table 2. Table Result1


Parameter ROWORG COLORG

Elapsed Time 2.528742 1.946262

TOTAL_L_READS 6685 3612

POOL_READ_TIME 67 0

TOTAL_CPU_TIME 2215213 3385222

TOTAL_WAIT_TIME 4891 3301

Total Query Cost 8748.19 4177.46

We can see the performance for Column Organized table is better in every aspect. This is because
the total number of pages to be fetched from disk to bufferpool is far less than row organized
tables for larger tables. The TOTAL_L_READS count for column organized table is just greater
than half of what is there for row organized tables and hence the overall benefit in performance.

Another query was run on the same tables, now by selecting only one column. The results were
even better now for column organized tables as shown in Table Result2 table
db2 "select col1 from schema2.test2_roworg"
db2 "select col1 from schema2.test2_colorg"

Table 3. Table Result2


Parameter ROWORG COLORG

Elapsed Time 2.322436 1.104926

Performance Comparison between Row Organized and Column Page 6 of 11


Organized Tables
ibm.com/developerWorks/ developerWorks®

TOTAL_L_READS 6684 180

POOL_READ_TIME 26 0

TOTAL_CPU_TIME 546005 702004

TOTAL_WAIT_TIME 6119 1609

Total Query Cost 8681.37 663.526

Here, we can see the logical read has dropped considerably for column organized table. Overall
I/O is reduced with columnar technology because reading is done based upon the query needs.
This can often make 95 percent of the I/O go away because most analytic workloads access only
a subset of the columns. For example, if you're only accessing 20 columns of a 50-column table in
a traditional row store, you end up having to do I/O and consume server memory even for data in
columns that are of no interest to the task of satisfying the query.

We carried out a similar kind of testing on smaller tables. In smaller tables however we saw the
row organized version performed better than its column organized counterpart. This is because,
for smaller tables, more extents were allocated for column organized tables than row organized
columns. For example a table might have 30 rows with 50 columns in a table space with an extent
size of 4 pages. Each column would be allocated at least one extent, so the table would require at
least 200 pages for 30 data rows.

So, before implementing column organized table in your environment, it should be tested
thoroughly and then implemented. You should implement column organized tables where you
see the workload to be more analytical. These workloads are characterized by non-selective data
access (that is, queries access more than approximately 5% of the data), and extensive scanning,
grouping, and aggregation. Workloads that are transactional in nature should not use column-
organized tables. Traditional row-organized tables with index access are generally better suited for
these environments. In the case of mixed workloads, which include a combination of analytic query
processing and very selective access (involving less than 2% of the data), a mix of row-organized
and column-organized tables might be suitable.

DML operations on Column Organized tables


The storage for column organized table is specially designed to address SELECT operations for
analytical workloads. Since every column of the column-organized table is stored on a different
data page, the processing for DML operation will need to access and change many pages.

•  INSERT: Whenever DB2 has to perform insert for a given row in a column organized table
it has to insert data on pages which is equal to the number of columns. For example if the
table has 30 columns, a single insert will affect 30 pages. Newly inserted rows are always
stored in the last partially filled pages assigned to each column. DB2 does some special
internal processing for inserting rows into column-organized tables that buffers the new data,
to reduce the processing overhead when applications are inserting many new rows.
•  DELETE: When delete operation happens on column organized tables, it does not actually
release space. Therefore there will be still no space available for new inserts. The data for
each column is flagged as deleted in the page for each column. Extents that contain pages

Performance Comparison between Row Organized and Column Page 7 of 11


Organized Tables
developerWorks® ibm.com/developerWorks/

where the entire column data has been flagged as deleted can be released using a REORG
with RECLAIM EXTENTS.

•  UPDATE: Updates are processed using a DELETE of the old data row and an INSERT for
the changed row. This impacts every page containing columns for the data row. This means
that an updated row consumes space in proportion to the number of times the row has been
updated until space reclamation occurs.

Easy to Maintain
Column Organized tables eradicates many headaches for DBAs that is generally needed for a
normal table. There is absolute no or very minimum maintenance activities needed for this kind of
tables.

•  REORG: There is no need of classic offline/online reorg needed for this kind of tables.
REORG is only needed to reclaim the extents using RECLAIM EXTENT CLAUSE and that is
also automated if we set DB2_WORKLOAD to ANALYTICS.

•  RUNSTATS: There is also no requirement to run RUNSTATS if we set DB2_WORKLOAD to


ANALYTICS. It will be automatically done.

•  INDEXES: These kinds of tables do not need any index to be created by user. The system
automatically creates index called page map index to locate the data. Even if you run
db2advis for any query that is based on column organized tables, it will not suggest any
indexes to you.

•  MDCS or MQTs: There is no requirement of MDCs and MQTs to improve the performance of
the query.

Basically column organized table is like LOAD and GO. You will have to simply create it and load it.
There is no or minimum maintenance activity needed for any improvement of performance. These
tables are already optimized to give the best performance (for certain workloads).

Points to remember
1. Column Organized tables has changed the traditional way of storing data on pages to improve
performance for analytical queries w.r.t Row Organized tables.
2. It is easy to implement. Just CREATE LOAD and GO.
3. It requires very minimum maintenance by the DBAs thus saving a great amount of man hours
related to table maintenance.
4. DML operations require a little bit of extra operations compared to traditional row organized
tables.
5. It performs far better than normal row organized tables for analytical workloads, where the
queries which access 5% of data.
6. A very large size and high cardinality usually gives better performance in column organized
format whereas low cardinality tables with row organized format responds better for analytical
workloads.

Performance Comparison between Row Organized and Column Page 8 of 11


Organized Tables
ibm.com/developerWorks/ developerWorks®

Acknowledgments
Special thanks to Manish Makwana for review and advice towards writing this article.

Performance Comparison between Row Organized and Column Page 9 of 11


Organized Tables
developerWorks® ibm.com/developerWorks/

Resources
•  Learn more from Database https:/ /www.ibm.com/developerworks/library/dm-1406convert-
table-db2105/
•  Infocenter link https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/
com.ibm.db2.luw.admin.dbobj.doc/doc/c0060592.html
•  Stay current with developer technical events and webcasts focused on a variety of IBM
products and IT industry topics.
•  Follow developerWorks on Twitter
•  Get involved in the developerWorks Community. Connect with other developerWorks users
while you explore developer-driven blogs, forums, groups, and wikis.

Performance Comparison between Row Organized and Column Page 10 of 11


Organized Tables
ibm.com/developerWorks/ developerWorks®

About the authors


Somraj Chakrabarty

Somraj Chakrabarty has a bachelor's degree in electronics and communication


engineering from NIT, Durgapur. He has around 8 years of experience as a DB2
LUW database administrator in the finance, retail, and manufacturing domain. He has
worked in technology companies like TCS and Infosys, and is currently associated
with CAPGEMINI India as a Senior Consultant. He mostly supports multiple projects
in performance tuning and design areas. He is a certified Advanced DB2 Database
Administrator.

Suvradeep Sensarma

Suvradeep Sensarma is a DB2 LUW DBA senior consultant with CapGemini India.
He has extensive experience working with customers in performance tuning and
support tips on DB2 on LUW. He is also certified in DB2 10.1 DBA for Linux, UNIX,
and Windows (Exam 611).

Abhinava Mukherjee

Abhinava Mukherjee is a DB2 LUW DBA with CapGemini India supporting multiple
projects in various domains. He has a Master degree in Computer Application along
with hands on IT experience in Datacenter Operation and System Administration
working on AIX, Unix, AS-400 and Windows environment. He is also certified in DB2
10.1 Fundamentals (Exam 610).

© Copyright IBM Corporation 2016
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)

Performance Comparison between Row Organized and Column Page 11 of 11


Organized Tables

You might also like