Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Analytical Functions in ORACLE 8i

Edward Kosciuzko, Sequel Consulting, Inc.

Introduction
The purpose of this article is to introduce some of the new analytical functions that were introduced in ORACLE 8i. After
reading Oracle’s documentation on the functions, I feel certain that most users will, or did have, trouble understanding
exactly what some of the options are. The windowing clause options, in particular, were poorly documented and required
a lot of testing to determine exactly what the options were and even more importantly, when they were permitted.
Numerous examples are contained in this article to explain the various options.

All the functions are not covered here due to time. The regression analysis functions should be self-explanatory after
understanding the functions covered in this article. Not being a statistician, some of the statistical functions were avoided
like the plague.

With my special interest in SQL, these new functions also provided a far superior way of specifying complex queries, plus
listing aggregates with the details used to compute the aggregates. Included below are numerous examples, and in certain
cases, execution statistics are listed to illustrate the significant performance improvements that can be attained with the
new functions.

Objective of Functions
While these functions can be implemented by utilizing standard SQL, the benefits are:
simplicity of specification
• reducing network traffic
• moving processing to server
• provide superior performance over previous SQL functions

Simplicity
In the early days of Oracle Corporation I would demo ORACLE to prospective clients and tell them that the beauty of the
relational approach is any query could be formulated with SQL. Fortunately only one client ever asked me (using
ORACLE’s demo database) to list the sum of salaries by department and compare that against all other departments (i.e.
what percentage of the total company salaries, each department’s sum represented). Having demonstrated ORACLE for
years, I immediately hedged by saying you must first create a view. The following view was required for the solution:

CREATE VIEW co_tot_sal (total_sal) AS


SELECT SUM(sal) FROM emp

The final SQL would then be:

SELECT deptno, (SUM(sal)/total_sal)*100


FROM emp e, co_tot_sal c
GROUP BY deptno, total_sal
SQL 1

The problem here was having the appropriate views created in advance.

Another problem encountered years ago was trying to phrase a query to find the top 10 salesmen. To illustrate let’s look at
the top 2 salaries in the EMP table.

www.nyoug.org 1 212-978-8890
SELECT * FROM emp e1
WHERE EXISTS
(SELECT null FROM emp e2
WHERE e2.sal > e1.sal
AND e1.rowid != e2.rowid
HAVING COUNT(*) <2)
SQL 2

Specifying SQL like this is beyond the average user, and it’s inefficient because there is no way to inform ORACLE what
we are trying to achieve.

Reducing Network Traffic


Many of the data analysis tools consume large amounts of data that must be transmitted over the network to the client for
analysis. The specialized tool then produces the summaries requested. Now the server can be produce and only transmit
the summaries.

Moving Processing to Server


How long ago were we thrilled to have PC’s to offload the processing from the server? Now we’re moving it back.
Consider the reports we could generate easily using SQL*Plus and the BREAK and COMPUTE commands. Of course
you were forced to display the details, but the totaling and subtotaling was performed at the client. This processing was
simply rescanning and resorting the results of the query. Now ORACLE performs those functions.

Analytical Function vs Standard Aggregates


Differentiating the analytical functions from the standard aggregate functions, such as AVG, SUM, etc, is really based on
a similarity. Both type of functions work on sets of values. The way in which the sets are defined is the difference. The
standard aggregates would produce a value for each set of rows defined by the GROUP BY function. The analytical
functions allow you to also group rows defined by the query, and the value of the analytical function is based on the group
of rows. The difference is that the GROUP BY compresses detail rows into a single row, whereas the analytical functions
produce a value for each detail row comprising a group. The groups defined by analytical functions are called partitions.

The following query uses a standard aggregate with a GROUP BY to produce the sum of salaries per group defined by the
same job and deptno.

SELECT deptno, job, SUM(sal)


FROM emp
GROUP BY deptno, job
SQL 3

DEPTONO JOB
10 CLERK 1300
10 MANAGER 2450
10 PRESIDENT 5000
20 ANALYST 6000
20 CLERK 1900
20 MANAGER 2975
30 CLERK 950
30 MANAGER 2850
30 SALESMAN 5600
Table 1

www.nyoug.org 2 212-978-8890
The following illustrates an analytical function that produces the same sum but lists it with all the details.

SELECT empno, deptno, job,


SUM(sal) OVER (PARTITION BY deptno, job) sum_sal
FROM emp
SQL 4

EMPNO DEPTNO JOB SUM_SAL


7934 10 CLERK 1300
7782 10 MANAGER 2450
7839 10 PRESIDENT 5000
7788 20 ANALYST 6000
7902 20 ANALYST 6000
7369 20 CLERK 1900
7876 20 CLERK 1900
7566 20 MANAGER 2975
7900 30 CLERK 950
7698 30 MANAGER 2850
7499 30 SALESMAN 5600
7654 30 SALESMAN 5600
7844 30 SALESMAN 5600
7521 30 SALESMAN 5600
Table 2

The main difference at this point to recognize is that the analytical aggregates do not compress the groups of rows into a
single row as does the standard aggregate. That means the analytical functions can also be applied to a SQL module
containing a GROUP BY. However, when the SQL module does have a GROUP BY the only columns or expressions that
can be referenced by the analytical functions are the columns/expressions that are being grouped, plus the other
aggregates.

Partitions
The analytical functions operate on groups of rows called partitions. The syntax for the SUM analytical function is as
follows:

SUM (column/expression) OVER ( [PARTITION BY col/express, [col/express, …] ] )

The PARTITION clause is optional. If the PARTITION clause is not used the set of rows operated on by the analytical
function is the entire result set. This is analogous to the standard aggregate when there is no GROUP BY clause. For
example, the following uses the SUM analytical function to retrieve the total salaries for all EMP rows, and lists it with
each individual EMP row, allowing us to determine what percentage of the total salaries an EMP’s salary is.

SELECT empno, (sal/SUM(sal) OVER () ) AS percent


FROM emp
SQL 5

EMPNO PERCENT
7369 .027562446
7499 .055124892
7521 .043066322
7566 .102497847

www.nyoug.org 3 212-978-8890
7654 .043066322
7698 .098191214
7782 .084409991
7788 .103359173
7839 .172265289
7844 .051679587
7876 .037898363
7900 .032730405
7902 .103359173
7934 .044788975
Table 3

Compare this solution with the solution used in SQL 1.

Execution Plan
So what’s really happening within ORACLE? The execution plan for SQL 4 in figure 1 shows the sorting used to produce
the output of the analytical function in step 2. After the normal criteria and grouping (if a GROUP BY is part of the
syntax), a scan and sort is performed on the result set to produce the analytical function output.

Figure 1

Top or Bottom N Values


The top or bottom refers to the rows in a result set that either have the largest (top) or smallest (bottom) values. For
instance, in sales it’s important to be able to identify things such as:
• top n selling products
• top n selling regions
• top n salesmen
• bottom n selling products
• etc

SQL 2 above illustrates retrieving the top 2 highest paid employees in the EMP table. ORACLE now provides two
analytical functions that ranks the rows in the result set based on a set of columns. There are two functions because
ranking semantics has two categories: one where rank values are skipped due to ties and one that doesn’t skip values. The
functions are RANK and DENSE_RANK. DENSE_RANK is the one that doesn’t skip values.

To illustrate the idea of skipping values, the following SQL ranks the EMP rows by SAL using both functions.

SELECT empno, sal, RANK() OVER ( ORDER BY sal) Rank_Values,


DENSE_RANK () OVER (ORDER BY sal) Dense_Rank_Values
FROM emp
SQL 6

The results are displayed in Table 4. Check where the SAL values are the same. The first location is highlighted in yellow.
Both EMPNO = 7521 and 7654 have a SAL of 1250. Both the RANK and the DENSE_RANK give the SAL values the
same rank; but it’s the subsequent SAL values where the ranking is different. With RANK, since two rows ties for a rank
of 4, the rank of 5 is skipped, making 6 the next rank value, whereas with DENSE_RANK rank values are not skipped.
The row is highlighted in green (dark shading).

www.nyoug.org 4 212-978-8890
EMPNO SAL RANK DENSE_RANK
7369 800 1 1
7900 950 2 2
7876 1100 3 3
7521 1250 4 4
7654 1250 4 4
7934 1300 6 5
7844 1500 7 6
7499 1600 8 7
7782 2450 9 8
7698 2850 10 9
7566 2975 11 10
7788 3000 12 11
7902 3000 12 11
7839 5000 14 12
Table 4

It should be noted that providing ties with the same rank is important, since they both have the same value.

The syntax for the RANK and DENSE_RANK functions are the same. The syntax follows:

RANK () OVER ([PARTITION BY col/express [,col/express, …] ]


ORDER BY col/express [,…] [ASC|DESC] [NULLS FIRST|NULLS LAST]

The RANK function itself does not take an argument. As always, the PARTITION clause, which groups rows of the result
set for the input to the analytical function, is optional. If omitted the entire result set is the partition. The RANK and
DENSE_RANK require specifying the ORDER BY, since the rows must be sorted by the columns the ranking is applied
to. As with the standard ORDER BY, the collation order can be specified with ASC or DESC for each ORDER BY
column/expression. And also like the standard ORDER BY, nulls can appear last or first for each order by item. The
default depends on whether you are ordering by ASC or DESC. If ordering by ASC, by default nulls will appear last, and
the reverse for DESC.

Note that the Data Warehousing Guide shows the “[collate clause]”. Who knows what they were thinking, but just
disregard it.

From Query
Ever wonder why ORACLE introduced the ability to place a SQL statement in the FROM clause of a SQL module?
Initially it provided a means of sidestepping the creation of a view. The real significance is the ability to filter the results
of a SQL statement relative to the selected items. This becomes especially important with analytical functions, since they
cannot appear in the WHERE clause. The work-around is to embed the SQL in the FROM clause of another SQL module
and then reference the result set in the WHERE clause. For instance, the analogous SQL to produce the results of SQL 2
appears in SQL 7.

SELECT empno, sal, rank_value


FROM (SELECT empno, sal,
RANK() OVER ( ORDER BY sal DESC) AS rank_value
FROM emp)
WHERE rank_value <=2
SQL 7

www.nyoug.org 5 212-978-8890
The main query, whose results are ranked, is highlighted in bold. In order to return only the top 2, the query must be
embedded in a FROM clause and have the WHERE clause filter the rows.

First note that the intention is to return the top 2 paid employees. That means we must sort by SAL, and the sort must be
in descending order since the first sort row will get the rank of 1. If the rows are sorted in ascending order the rank of 1
identifies the bottom paid employees.

EMPNO SAL RANK_VALUE


7839 5000 1
7788 3000 2
7902 3000 2
Table 5

To highlight the difference between RANK and DENSE_RANK, consider what SQL 7 would have produced if 2
employees tied for 1st place. Those 2 employees would both have a RANK value of 1, and EMPNO=7788 and 7902 would
have a RANK value of 3. But if we used the DESNSE_RANK function both EMPNO = 7788 and 7902 would have a
DENSE_RANK of 2. So the criterion “rank_value <=2” works for the RANK function, but would have produced the
wrong answer if DENSE_RANK was used.

Certain types of top-bottom queries are more complex when the top or bottom members are based on an aggregate. For
example, the TIME_SHEETS table lists the hours worked per project per employee. To list the top 5 employees who
worked the most hours would require the following solution:

SELECT *
FROM
(SELECT emp_seq , SUM (hours ) AS sum_hrs
FROM time_sheets
GROUP BY emp_seq )
WHERE 5 >=
(SELECT count (count (* ) )
FROM time_sheets
GROUP BY emp_seq
HAVING SUM (hours ) > sum_hrs )
SQL 8

Unfortunately SQL 8’s execution would not finish in “your lifetime”. ( I executed the SQL for over 24 hours and then
cancelled.) SQL 8 requires grouping the entire table for each employee and then for each employee, the correlated
subquery would have to recompute the total hours per employee and filter out those that did not work as many hours. The
count is then compared against 5, since we want to list only the top 5 workers. To make this work you have no choice but
to embed the initial GROUP BY in the FROM clause of the main SQL module, otherwise there is no way to reference the
sum of hours for an employee in the correlated subquery.

A more efficient solution follows:

SELECT *
FROM (SELECT emp_seq, SUM(hours),
RANK () OVER (ORDER BY SUM(hours) DESC) AS rnk
FROM time_sheets
GROUP BY emp_seq)
WHERE rnk <= 5
SQL 9

www.nyoug.org 6 212-978-8890
Only one grouping of the data is necessary. And performance is reasonable for a TIME_SHEETS table with 13,939,925
rows. The execution statistics are listed in figure 2.

Figure 2

Note that SQL 8 and 9 did not account for NULLs. In both cases you can simply eliminate the NULLs with a WHERE
clause, or in SQL 8, you can order the results and request NULLs to appear first or last. The ORDER BY clause in SQL 9
allows the same type of NULL handling.

One final example ranks the employees by their hiredate and birthdate in descending order enabling us to obtain the last
10 employees hired, and if there is a tie, the youngest employee is ranked lower. SQL 10 below uses standard SQL.

SELECT emp_seq, hiredate, birthdate


FROM employees e1
WHERE 10 > (SELECT count(*) FROM employees e2
WHERE e2.hiredate > e1.hiredate
OR (e2.hiredate = e1.hiredate AND e2.birthdate <= e1.birthdate))
SQL 10

The complexity of specifying SQL 10 is not intuitive, though it does make sense if you consider the request carefully. It’s
basically the subquery that’s difficult. As with the RANK function, if the primary columns, HIREDATE is equal, then the
tie breaker is the BIRTHDATE column. So we OR a criterion stating that if the HIREDATE’s are equal, then the
BIRTHDATE of the subquery must be less than that of the outer query. For example, the subquery returns, per each
employee in the outer query, the number of employees that have a more recent hiredate, plus, when the hiredate is the
same, the employee with the lesser birthdate. The complexity only increases as the number of columns involved in the
ranking increases. But not so with the RANK function. SQL 11 accomplishes the same task and is trivial compared to
SQL 10. Adding more columns for the ranking only means adding the column to the ORDER BY clause of the RANK.

SELECT /*+ ALL_ROWS */ *


FROM (SELECT emp_seq, hiredate, birthdate,
RANK() OVER (ORDER BY hiredate DESC, birthdate ASC) rnk
FROM employees)
WHERE rnk <= 10
SQL 11

The execution of SQL 10 was over 30 minutes while using the RANK function in SQL 11 took a fraction of a second.
(The EMPLOYEES table contains 15,000 rows.)

Ranking Subtotals
When performing data analysis using the CUBE or ROLLUP functions, often it’s the subtotals and totals that need to be
ranked. The key to specifying the ranking involves the GROUPING function which allows us to determine when the row
contains a subtotal or total. GROUPING of a column that is part of the ORDER BY clause of the RANK function returns
1 when the NULL is due to a subtotal or total.

Using the EMP and DEPT tables the listing of the average salary by department, all departments, job and all jobs is
simple. To filter out the details, use the HAVING clause.

SELECT DECODE(GROUPING(dname), 1, 'All Departments', dname) AS dname,

www.nyoug.org 7 212-978-8890
DECODE(GROUPING(job), 1, 'All Jobs', job) AS job,
COUNT(*) "Total Empl", AVG(sal) * 12 "Average Sal",
RANK() OVER (PARTITION BY GROUPING(dname), GROUPING(job)
ORDER BY COUNT(*) DESC) AS rnk
FROM emp, dept
WHERE dept.deptno = emp.deptno
GROUP BY CUBE (dname, job)
HAVING GROUPING(dname) = 1 OR GROUPING(job) = 1
SQL 12

DNAME JOB Total Empl Average Sal RNK


SALES All Jobs 6 18800 1
RESEARCH All Jobs 5 26100 2
ACCOUNTING All Jobs 3 35000 3
All Departments CLERK 4 12450 1
All Departments SALESMAN 4 16800 1
All Departments MANAGER 3 33100 3
All Departments ANALYST 2 36000 4
All Departments PRESIDENT 1 60000 5
All Departments All Jobs 14 24878.5714 1
Table 6

Windowing Functions
Certain analytical functions operate on a subset of rows within a partition. These subsets are referred to as windows.
There are two types of windows that can be specified; a physical or logical window. Physical means a specific number of
rows, whereas logical means the window is based on the ORDER BY value (only one column/expression can occur in the
ORDER BY in certain circumstances). The syntax to specify a window follows the ORDER BY syntax (the ORDER BY
is mandatory):

ROWS | RANGE {{UNBOUNDED PRECEDING | <value expression4> PRECEDING}


| BETWEEN {UNBOUNDED PRECEDING | <value expression4> PRECEDING}
AND{CURRENT ROW | <value expression4> FOLLOWING}}

The ROWS keyword refers to physical window and RANGE, the logical window. The other keywords are relative to the
current row. But it’s the current row that has different meanings for physical and logical windows.

Logical Windows
To better understand the difference between physical and logical windows, let’s start with the logical window, since
physical windows should be simple enough to understand.

The following query uses the EMP table to list the sum of salaries for employees with a lower or equal salary. The logical
window only specifies an upper limit.

SELECT empno, sal,


SUM(sal) OVER (ORDER BY sal
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_sal
FROM emp
SQL 13

The results are in table 7 below:

www.nyoug.org 8 212-978-8890
EMPNO SAL SUM_SAL
7369 800 800
7900 950 1750
7876 1100 2850
7521 1250 5350
7654 1250 5350
7934 1300 6650
7844 1500 8150
7499 1600 9750
7782 2450 12200
7698 2850 15050
7566 2975 18025
7788 3000 24025
7902 3000 24025
7839 5000 29025
Table 7

The rows in yellow (shading) both have the same SUM_SAL value. This is the key to understanding logical windows.
The point here is that CURRENT ROW refers to all rows have the same value of the ORDER BY column. Since both
highlighted employees have the same SAL, both values are added to the sum for EMPNO=7521.

To further illustrate the point, the following query computes the sum of the DEPTNO values (forget the query makes no
sense).

SELECT empno, sal,


SUM(deptno) OVER (ORDER BY sal
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_deptno
FROM emp
SQL 14

The results follow:

EMPNO DEPTNO SAL SUM_DEPTNO


7369 20 800 20
7900 30 950 50
7876 20 1100 70
7521 30 1250 130
7654 30 1250 130
7934 10 1300 140
7844 30 1500 170
7499 30 1600 200
7782 10 2450 210
7698 30 2850 240
7566 20 2975 260
7788 20 3000 300
7902 20 3000 300
7839 10 5000 310
Table 8

The yellow (light shaded) highlighted row in table 8 has other DEPTNO values of 30, but the window is based on equal or

www.nyoug.org 9 212-978-8890
less values of SAL, since the ORDER BY is on SAL. The red (dark shaded) rows have the same SAL value, so the
SUM_DEPTNO value is the same for both rows.

Date Intervals
If the ORDER BY is over a date column, it would helpful to specify an interval without having to consider the actual
physical values. When using a logical window (only with logical windows) specification and the ORDER BY
column/expression is a date, you can easily specify date intervals in terms of days, months or years. This feature gives you
the ability to specify sliding date windows for requests, such as summarizing outstanding invoices. Combine this with the
CASE function and you can easily request invoices “30 days outstanding”, “60 days…”, etc.

To illustrate some of the interval syntax, I downloaded historical stock pricing for ORCL from ’01-Dec-00’ to ’14-Dec-
01’. The moving average for 30 days is returned in SQL 15, along with the average for the next 30 days from the current
date.

SELECT quote_date, close,


AVG(close) OVER (ORDER BY quote_date
RANGE INTERVAL '30' DAY PRECEDING) AS prv_30,
AVG(close) OVER (ORDER BY quote_date
RANGE BETWEEN CURRENT ROW
AND INTERVAL '30' DAY FOLLOWING) AS fol_30
FROM stock_quotes
SQL 15

When BETWEEN is not used, the value supplied is considered the start-point by ORACLE and the end-point if the
current row. So PRV_30 averages the stock prices from 30 days preceding the current row. FOL_30 averages the price
from the current row till 30 days following.

If you want to compare PRV_30 and FOL_30, embed the SQL in a FROM clause. For example if SQL 15 was embedded
in a FROM clause, a criterion could be applied to the outer query to return only those rows where the difference between
PRV_30 and FOL_30 is more than 25% of PRV_30. Other types of analysis can easily be performed to compare an
increase in the moving average with the change in volume.

The Data Warehousing Guide illustrates the INTERVAL syntax using DAYS/MONTHS/YEARS. Drop the S in the time
categories to compile without error. I couldn’t find anything in the SQL Reference Manual.

ORACLE provides two other functions to assist in the specification of a time interval; NUMTODSINTERVAL and
NUMTOYMINTERVAL. The syntax is as follows:

NUMTODSINTERVAL (n, ‘DAY|HOUR|MINUTE|SECOND’)

NUMTOYMINTERVAL (n, ‘YEAR|MONTH’)

The DS in NUMTODSINTERVAL stands for Day or Second. The YM stands for Year and Month. So if you want to use
another numeric column as the first parameter of the NUMTO_DS_INTERVAL, you can. Using the STOCK_QUOTES
table, you can specify a logical window as:

RANGE NUMTODSINTERVAL (open, 'DAY') PRECEDING

The Unwritten Documentation


I only hope the folks that write the instructions for nuclear power plants are better than Oracle’s documentation crew. The
following query drove me crazy trying to figure out what in the world was happening. It deals with a logical window
defined by ‘n’ PRECEDING or FOLLOWING. SQL 16 below was initially used to test the features.

www.nyoug.org 10 212-978-8890
SELECT emp_seq, effective_date, sal,
MAX(sal) OVER (ORDER BY effective_date DESC
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) AS Max_Sal
FROM sal_history
SQL 16

So in the logical world, what does “1 PRECEDING” mean? Using the previous knowledge that was also not documented
well, the CURRENT ROW should refer to the group of rows having the same EFFECTIVE_DATE since that’s what we
ordered by. Does ‘1 PRECEDING’ mean the previous logical group? The results of the query are displayed in table 9.

EMP_SEQ EFFECTIVE_DATE SAL MAX_SAL


1015 11-JAN-01 500 500
1001 06-JAN-01 300 300
1003 06-JAN-01 200 300
1015 06-JAN-01 300 300
1001 01-JAN-01 200 200
1003 01-JAN-01 100 200
1002 01-JAN-01 150 200
1015 01-JAN-01 200 200
1001 22-DEC-00 100 1000
1007 22-DEC-00 400 1000
1009 22-DEC-00 1000 1000
Table 9

The rows in the same logical group are highlighted with the same color. If ‘1 PRECEDING’ actually meant one logical
row preceding the current row, then MAX_SAL for 1001 should be 500, but instead it’s 300 which is the maximum SAL
for that logical group. The same goes all the other logical groups.

So to make sense out of this, you first have to consider what the rows in the partition are ordered by; a date column. It
turns out that since the sort column is a date column ‘1 PRECEDING’ means ‘1 DAY PRECEDING’. To check this out,
change the 1 to a 5 since ’11-JAN-01’ is 5 days after ’06-JAN-01’.

SELECT emp_seq, effective_date, sal,


MAX(sal) OVER (ORDER BY effective_date DESC
RANGE BETWEEN 5 PRECEDING AND CURRENT ROW) AS Max_Sal
FROM sal_history
SQL 17

EMP_SEQ EFFECTIVE_DATE SAL MAX_SAL


1015 11-JAN-01 500 500
1001 06-JAN-01 300 500
1003 06-JAN-01 200 500
1015 06-JAN-01 300 500
1001 01-JAN-01 200 300
1003 01-JAN-01 100 300
1002 01-JAN-01 150 300
1015 01-JAN-01 200 300
1001 22-DEC-00 100 1000
1007 22-DEC-00 400 1000
1009 22-DEC-00 1000 1000

www.nyoug.org 11 212-978-8890
Table 10

Now what happens when the ORDER BY column is a numeric? The following is similar to SQL 17 except the ORDER
BY is by SAL.

SELECT emp_seq, effective_date, sal,


MAX(sal) OVER (ORDER BY sal DESC
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) AS Max_Sal
FROM sal_history
SQL 18

If you look at the results it’s clear that ‘1 PRECEDING doesn’t mean 1 logical row. Just like the date field, it means units
of SAL. SQL 19 uses a value of 100.

SELECT emp_seq, effective_date, sal,


MAX(sal) OVER (ORDER BY sal DESC
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW) AS Max_Sal
FROM sal_history
SQL 19

The results are:

EMP_SEQ EFFECTIVE_DATE SAL MAX_SAL


1009 22-DEC-00 1000 1000
1015 11-JAN-01 500 500
1007 22-DEC-00 400 500
1001 06-JAN-01 300 400
1015 06-JAN-01 300 400
1003 06-JAN-01 200 300
1015 01-JAN-01 200 300
1001 01-JAN-01 200 300
1002 01-JAN-01 150 200
1003 01-JAN-01 100 200
1001 22-DEC-00 100 200
Table 11

Now the results make sense. Logical appears to always refer to the value of the ORDER BY. That might explain why
logical windows are limited to one ORDER BY column/expression when a specific numeric value is given for the
PRECEDING keyword. The next logical question is what about sorting by a character column. This is something else that
is never mentioned in the manuals. I tried the following SQL to see what it would generate.

SELECT empno, job, MAX(sal) OVER (ORDER BY job


RANGE 1 PRECEDING) max_job
FROM emp
SQL 20

And all it generated was error “ORA-00902: invalid datatype”. So I guess we should assume that you just can’t do that;
but as you’ll see you can sort by character columns when the window is a physical window.

Physical Windows
Physical windows are pretty straightforward, except for when the window is limited by the number of rows. For instance,

www.nyoug.org 12 212-978-8890
you can specify the end-points as either the boundaries of the partition, or a specified number of rows. Just use ROWS
instead of RANGE to indicate a physical window. SQL 20 is rewritten below as a physical window instead.:

SELECT empno, job, MAX(sal) OVER (ORDER BY job


ROWS 1 PRECEDING) max_job
FROM emp
SQL 21

The results are as you would expect. So where would you use a physical window? A good example is historical data. For
example, the SAL_HISTORY table contains a history of all salaries per employee. To determine the amount of each raise
requires sorting the rows per employee in descending order and then comparing the current row with the next row. Since
the last row in each partition (by EMP_SEQ) is the first salary assigned the employee, there was no raise, thus returning
zero. We must eliminate the last row of each partition.

The LAST_VALUE function allows us to select the last row in the window. FIRST_VALUE selects the first row.

SELECT emp_seq, sal, effective_date, sal - LAST_VALUE(sal) OVER


(PARTITION BY emp_seq ORDER BY effective_date DESC
ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS raise,
MIN(effective_date) OVER (PARTITION BY emp_seq ORDER BY effective_date)
AS first_sal
FROM sal_history
SQL 22

The MIN function is included to get the date per employee when the employee was first given a salary. We can use that to
compare with the EFFFECTIVE_DATE. If they are equal then we don’t return the row. The results in table 12 illustrates
the data from SQL 22.

EMP_SEQ SAL EFFECTIVE_DATE RAISE FIRST_SAL


1001 300 06-JAN-01 100 22-DEC-00
1001 200 01-JAN-01 100 22-DEC-00
1001 100 22-DEC-00 0 22-DEC-00
1002 150 01-JAN-01 0 01-JAN-01
1003 200 06-JAN-01 100 01-JAN-01
1003 100 01-JAN-01 0 01-JAN-01
1007 400 22-DEC-00 0 22-DEC-00
1009 1000 22-DEC-00 0 22-DEC-00
1015 500 11-JAN-01 200 01-JAN-01
1015 300 06-JAN-01 100 01-JAN-01
1015 200 01-JAN-01 0 01-JAN-01
Table 12

Each partition is shaded in a different color. The first SAL_HISTORY row for each employee has the
EFFECTIVE_DATE and FIRST_SAL in bold making it easy to see which row to exclude.

Recall that in order to compare the aggregate with the column we need to embed the query in a FROM clause and then use
a WHERE clause to filter out the first SAL_HISTORY row per employee. The final solution is SQL 23.

SELECT *
FROM (SELECT emp_seq, sal, effective_date, sal - LAST_VALUE(sal) OVER
(PARTITION BY emp_seq ORDER BY effective_date DESC
ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS raise,

www.nyoug.org 13 212-978-8890
MIN(effective_date) OVER (PARTITION BY emp_seq ORDER BY effective_date)
AS first_sal
FROM sal_history)
WHERE effective_date != first_sal
SQL 23

How would you specify that query without the analytical functions? And more important, how much of a performance
gain do you get? SQL 24 performs the same task as SQL 23 but doesn’t use analytical functions. It requires a self-join in
order to get the a SAL_HISTORY row joined to the previous SAL_HISTORY row. The self-join isn’t simple because the
EFFECTIVE_DATEs have to be joined via a correlated subquery.

SELECT s2.effective_date, s2.sal, s2.sal – s1.sal AS raise


FROM sal_history s1, sal_history s2
WHERE s1.emp_seq = s2.emp_seq
AND s1.effective_date = (SELECT MAX(effective_date) FROM sal_history
WHERE emp_seq = s2.emp_seq
AND effective_date < s2.effective_date)
SQL 24

Figure 3 shows the execution statistics where “SQL 1: /TUTORIAL” in the figure is SQL 24 above, and “SQL 2:
/TUTORIAL” is SQL 23 above. The performance is significantly better in all aspects.

Figure 3

Defaults
If you look carefully you’ll find a small note in the Data Warehousing Guide indicating what the default is, when the
windowing clause is omitted from a windowing function. The default is:

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

But this only occurs for a windowing function. RANK, for instance, is not a windowing function, plus some functions
such as SUM, AVG, MIN, etc can be either used as a windowing function or not. This makes it difficult to know what
will happen by default.

The following SQL uses the SUM function but does not specify a PARTITION, ORDER BY or windowing clause. By
default, the PARTITION is the entire result set.

SELECT deptno, ename, sal,


SUM(sal) OVER () AS tot_sal
FROM emp
SQL 25

The result of the SUM function in SQL 25 is the total of all salaries. The default windowing clause does not apply here
because there is no ORDER BY. Recall that to specify a windowing clause, you must have an ORDER BY clause.

www.nyoug.org 14 212-978-8890
By adding an ORDER BY clause to SQL 25, we get SQL 26:

SELECT deptno, ename, sal,


SUM(sal) OVER ( ORDER BY sal) AS tot_sal
FROM emp
SQL 26

The result set is displayed in table 13.

DEPTNO ENAME SAL TOT_SAL


20 SMITH 800 800
30 JAMES 950 1750
20 ADAMS 1100 2850
30 WARD 1250 5350
30 MARTIN 1250 5350
10 MILLER 1300 6650
30 TURNER 1500 8150
30 ALLEN 1600 9750
10 CLARK 2450 12200
30 BLAKE 2850 15050
20 JONES 2975 18025
20 SCOTT 3000 24025
20 FORD 3000 24025
10 KING 5000 29025
Table 13

It looks like TOT_SAL is a running total, but not exactly. Because an ORDER BY is specified, the default windowing is
applied. That means for a given row, all SAL values from the beginning of the partition up to the current row will be
summed.

Because of the default window being a logical window, rows with duplicate ORDER BY values will be considered as a
single row. For example, the yellow shaded rows have the same SAL; 1250. Therefore both rows will have the same
TOT_SAL value which is a sum of all SAL values for rows prior to the SAL equal to 1250, plus the sum of the rows with
a SAL of 1250.

Other Analytical Functions

RATIO_TO_REPORT
The RATION_TO_REPORT function computes the percentage of the column/expression to the total of
column/expression for all rows in the partition. An ORDER BY is not permitted, which in turns means a window clause is
not permitted.

In the following example, we query the total hours per employee per project, plus list what portion of the total hours
worked by an employee, were the hours worked on a project. ORACLE must first compute the sum of hours per project
per employee and then total the sums prior to comparing the hours on a project to the total. Note that the parameter to the
function does not have to be listed separately on the SELECT list as was done on SQL 27. Also note that the parameter to
RATIO_TO_REPORT is an aggregate.

SELECT emp_seq, proj_seq, SUM(hours) AS sum_hrs,

www.nyoug.org 15 212-978-8890
RATIO_TO_REPORT(SUM(hours))
OVER ( PARTITION BY emp_seq ) AS ratio
FROM time_sheets
GROUP BY emp_seq, proj_seq
SQL 27

The partial results are listed in table 14 below.

EMP_SEQ PROJ_SEQ SUM_HRS RATIO


2903 10 12 .6
2903 11 8 .4
2907 11 12 1
2921 10 9 .310344828
2921 11 12 .413793103
2921 13 8 .275862069
2934 10 8 1
2941 11 8 1
2945 11 10 .555555556
2945 13 8 .444444444
Table 14

LAG/LEAD
The LAG and LEAD functions are analogous to the FIRST_VALUE and LAST_VALUE functions, in the sense that each
of the functions returns a specific value from another row in a partition. FIRST_VALUE – a windowing function –
references the first row in the window and returns the value of the paramater. LAG – not a windowing function – can
reference any row previous to the current row using an optional offset.

LAG has 1 mandatory and 2 optional parameters. The first parameter is the item in a row to return; the second parameter
is the offset from the current row that identifies the row the value is returned from; the last parameter is the default to
return if the offset moves outside of the partition.

SELECT emp_seq, effective_date, sal,


LAG(sal,2) OVER (ORDER BY effective_date) lg,
FIRST_VALUE(sal) OVER (ORDER BY effective_date) fv
FROM sal_history
SQL 28

The FIRST_VALUE function in SQL 28 will always return the same value since there is only one partition. LAG will
always return the SAL value from 2 rows previous to the current row.

CASE
I have to mention this function because it can be a big help in specifying queries that needs to group rows based on
complex criteria or specify complex criteria in the WHERE clause. For example, grouping unpaid invoices by the amount
of days past due requires subtracting the invoice date from the current date, and then using that value to group the row in
categories such as “30 days late”, “60 days late”, etc.

SELECT CASE WHEN sysdate-inv_date > 90 THEN '90 days overdue'


WHEN sysdate-inv_date > 60 THEN '60 days overdue'
WHEN sysdate-inv_date > 30 THEN '30 days overdue'
WHEN sysdate-inv_date > 0 THEN 'less than 30 days overdue' END

www.nyoug.org 16 212-978-8890
AS period,
SUM(amount) AS amount
FROM invoices
WHERE paid_date IS NULL
GROUP BY CASE WHEN sysdate-inv_date > 90 THEN '90 days overdue'
WHEN sysdate-inv_date > 60 THEN '60 days overdue'
WHEN sysdate-inv_date > 30 THEN '30 days overdue'
WHEN sysdate-inv_date > 0 THEN 'less than 30 days overdue' END
SQL 29

The results are:

PERIOD AMOUNT
30 days overdue 4301
60 days overdue 6255
90 days overdue 1012
less than 30 days overdue 10302
Table 15

Now imagine making this request without the CASE function? SQL 30, below, uses the DECODE function to obtain the
same results as SQL 29, but with a great deal more complexity. And imagine someone else trying to figure out what SQL
30 means after you leave?
SELECT DECODE (SIGN(sysdate-inv_date – 90), -1, DECODE(SIGN(sysdate-inv_date-60),-1,
DECODE(SIGN(sysdate-inv_date-30), -1, ‘less than 30 days overdue’,
’30 days overdue’),’60 days overdue’),’90 days overdue’) AS period,
SUM(amount) AS amount
FROM invoices
GROUP BY DECODE (SIGN(sysdate-inv_date – 90), -1, DECODE(SIGN(sysdate-inv_date-60),
-1, DECODE(SIGN(sysdate-inv_date-30), -1, ‘less than 30 days overdue’,
’30 days overdue’),’60 days overdue’),’90 days overdue’)
SQL 30

Also note that the CASE function can appear within the WHERE clause allowing the specification of complex criteria.
CUME_DIST

You all have been part of this type of ranking. When you got your SAT scores you know that you were in the top 10%
perhaps, or when you pay taxes you might feel better to at least know that you’re in the top 1% of income.

CUME_DIST is one of the analytical functions used to determine the number of values in a sorted list that came before or
are equal to the current value. The exact definition of the function is:

CUME_DIST(x) = number of values (different from, or equal to, x) in set


coming before x in the specified order/ N

The ORDER BY is mandatory, since a sorted list is required. The value of CUME_DIST ranges from greater than 0 to 1.
Using a simple table of student scores, the following query returns the CUME_DIST of the score for each student.

SELECT student_id, score, CUME_DIST() OVER (ORDER BY score)


FROM scores
SQL 31

The results of the query are:

www.nyoug.org 17 212-978-8890
STUDENT_ID SCORE CUME_DIST
1 45 .083333333
4 50 .166666667
7 58 .25
3 63 .333333333
12 69 .416666667
6 72 .5
9 76 .583333333
2 85 .75
8 85 .75
10 87 .833333333
11 92 .916666667
5 98 1
Table 7

The highest grade is determined by a CUME_DIST of 1. If the CUME_DIST column is multiplied by 100, then we have
the percentile. Student 5 would be in the 100 percentile, meaning that he did as well or better than 100% of the students.

To differentiate CUME_DIST from RANK, the difference is that RANK doesn’t inform you of a row’s value relative to
the rest; it really gives the position of the value in the list. CUME_DIST on the other hand is a relative value; if the ASC
ORDER BY is used it informs you that that portion of the set that has a value less than or equal to the row’s value. If the
DESC option was used it informs you of that portion of the set that has a value greater than or equal.

Summary
• simplicity
• efficiency
• able to apply multiple analytical functions with different partitioning of the data
• ability to display aggregates along with the detail data used to derive the values
• great way to move details to warehouse while simultaneously storing the aggregates with the details.
• can list aggregates on the same row where each aggregate can be derived from a different group of rows using
partition and window clauses.

BIO
Edward Kosciuzko is a principal with Sequel Consulting, Inc., and can be reached at 973-226-7835.

www.nyoug.org 18 212-978-8890

You might also like