Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

The Many Uses of SQL Subqueries

Tasha Chapman, Oregon Department of Consumer and Business Services

Subqueries are very useful tools in SQL. A subquery is an embedded query nested within another query.
They’re most often used when a particular value that is needed for the outer query is unknown, but can
be searched for using an SQL query. This paper acquaints the reader with the concept of subqueries, and
provides some examples of how they might be used.

Getting started – the basics of Proc SQL


Before we delve into subqueries, let’s start with a basic primer on SQL. In SAS, SQL statements
are sandwiched between a Proc SQL statement and a QUIT statement. Every SQL query must
have a SELECT clause and a FROM clause, but can also have optional clauses such as WHERE,
HAVING, and others.

The SELECT clause is used to select the variables from the source table. It is analogous to a KEEP
statement in a DATA step, or a VAR statement in a SAS procedure. You can select a single
variable, or multiple variables in a SELECT clause.

The FROM clause tells SAS which data table(s) the pull the data from. It is analogous to a SET
statement in a DATA step, or a DATA= option in a SAS procedure. You can pull data from one
source table or multiple source tables.

The WHERE clause is used to implement criteria for selecting rows from the source table. It is
the same as the WHERE statement used in DATA steps and SAS Procedures. Using a WHERE
clause is optional, but it can be useful in getting correct results and also is important for efficient
programming.

Below is an example of an SQL query. In this example we’re selecting the variables Title, Author
and ISBN from the table Books, but only where the author’s name is Cody.

proc sql;
select Title, Author, ISBN
from Books
where Author = 'Cody';
quit;

What is a subquery?
Now you know what a basic SQL query looks like, what is a subquery? A subquery is essentially
a query within a query. A subquery is a complete query, enclosed within parentheses, that is
processed first. Then the results of the subquery are passed back and used as part of the outer
query. Subqueries are most often used in the WHERE or HAVING clause, in which the results of

1
the subquery are used as a condition of the outer query. But subqueries can also be used in the
SELECT or FROM clause, if appropriate.

Still confused? No worries. An example of a subquery might help you out. Let’s say that you
have a data table showing insurance policies for a set of policy holders. The table includes the
unique policy number, the name of the policy holder, the name of the insurance company
holding the policy, and the beginning and ending dates of the policy:

Policy Table
Policy_no Policy_holder Insurer Pol_begin_dt Pol_end_dt
A12859 Callahan Auto BeSafe 03/01/2008 02/28/2009
C52996 Acme Co. M&M 05/25/2006 09/14/2007
Q58953 Acme Co. BeSafe 09/15/2007 06/14/2010
D58622 Dunder Mifflin LexCorp 06/05/2006 03/03/2010
P99785 Ewing Oil M&M 03/21/2004 11/21/2004
W08308 Paper St. Soap Co. BeSafe 07/08/2005 06/30/2006
T99775 Paper St. Soap Co. BeSafe 06/30/2006 09/30/2009
D77582 Duff Beer LexCorp 05/08/2007 06/08/2008
G88958 Rick’s Café BeSafe 12/01/2005 06/30/2007
Q99585 Lyon Estates BeSafe 06/26/2005 11/05/2015

We could run a basic SQL query that shows us only those policies held by BeSafe insurance
company using the following syntax1:

proc sql;
select *
from Policy
where Insurer = 'BeSafe';
quit;

The resulting output would look like so:

Policy Table (only BeSafe policies)


Policy_no Policy_holder Insurer Pol_begin_dt Pol_end_dt
A12859 Callahan Auto BeSafe 03/01/2008 02/28/2009
Q58953 Acme Co. BeSafe 09/15/2007 06/14/2010
W08308 Paper St. Soap Co. BeSafe 07/08/2005 06/30/2006
T99775 Paper St. Soap Co. BeSafe 06/30/2006 09/30/2009
G88958 Rick’s Café BeSafe 12/01/2005 06/30/2007
Q99585 Lyon Estates BeSafe 06/26/2005 11/05/2015

1
The asterisk will select all variables. It is similar to using as Keep _ALL_ statement in SAS.

2
Let’s say that you also have a table that contains some basic characteristics about each
insurance company, such as whether they issue policies internationally or only in the U.S., and
which U.S. state the company is headquartered in:

Insurer_Co Table
Insurer Insurer_type HQ_state
BeSafe US Only Oregon
LexCorp International New York
M&M International Florida

Now let’s say that your boss, or your client, or your own personal curiosity comes along and
says, “Hey! Can you tell me which policies are held by insurance companies that issue
international policies?” (Right now you might be tempted to say “Um… I don’t know…” but in a
few paragraphs you’ll be able to say “Absolutely!” with a smile.)

We know which policies in the Policy table are held by which insurers. And we know which
insurers in the Insurer_Co table sell international policies. But how do we put these two pieces
together?

If we look at the Insurer_Co table, we see that there are currently two insurers, LexCorp and
M&M that issue international policies. In theory we could use the WHERE clause in SQL to
query for policies issued by “LexCorp” and “M&M.” And yes, technically that would be an easy
solution given our example data with only three insurers. But what if we had 100 insurers in our
Insurer_Co table? Or 1,000? You wouldn’t want to manually search for all the international
insurers. That would be silly and (unless you’re padding someone’s bill) a complete waste of
time. What do you do?

Well, by golly, we could use an SQL query to search for all the international insurers in the
Insurer_Co table. That query would look something like this:

proc sql;
select Insurer
from Insurer_Co
where Insurer_Type = 'International';
quit;

The results of the query above would be (as expected) “LexCorp” and “M&M.” Now what we
need to do is use the results of this query in another query that selects the appropriate policies
in the policy table. Thank goodness for subqueries. We’ll use an outer query that searches for
the appropriate policies based on the results of our subquery:

3
proc sql;
select *
from Policy
where Insurer in
(select Insurer
from Insurer_Co
where Insurer_Type = 'International');
quit;

The SQL procedure will run the inner query first (“select insurer from insurer_co…”) which will
result in a list of insurance companies that meet the specified criteria (“where insurer_type =
‘International’ ”). The resulting list will then be used in the outer query (“select * from policy…”)
just as if we’d manually entered the list ourselves. In other words, the above query will return
the same results as:

proc sql;
select *
from Policy
where Insurer in ('LexCorp' 'M&M');
quit;

…but with less work on our part. Either way we get the same results:

Policy Table (only policies from international insurers)


Policy_no Policy_holder Insurer Pol_begin_dt Pol_end_dt
C52996 Acme Co. M&M 05/25/2006 09/14/2007
D58622 Dunder Mifflin LexCorp 06/05/2006 03/03/2010
P99785 Ewing Oil M&M 03/21/2004 11/21/2004
D77582 Duff Beer LexCorp 05/08/2007 06/08/2008

Rules of SQL subqueries


There are a few rules to SQL subqueries that you should be aware of:

The subquery must be a complete SQL query in itself, with the required SELECT and
FROM statements. One benefit to this requirement is that if you are unsure of your
query, you can always run your subquery as a standalone query to make sure that you
are getting accurate results.
Even though the subquery is a complete query, only one semi-colon is used in the entire
SQL statement, and that is at the end of the outer query. You do not need to put a
semi-colon at the end of your subquery.
The subquery must be completely enclosed in a set of parentheses.
Some subqueries will bring back a single row of results. Others (like the example above)
will bring back multiple rows of results. As such, when using a subquery in a WHERE or
HAVING clause, make sure you use the appropriate comparison operator. If we had
written the above example using “Where insurer = …” instead of “Where insurer in…”

4
we would have had an error, because you can not use an equal sign to compare against
a list.

More examples of subqueries – Find the policy with the latest ending date
In the example above, we used a subquery to apply the results from a query of one table to the
WHERE clause in a query of another table. In the next examples, we’ll use functions to calculate
some value using the data from the policy table, and apply the results to another query of the
same table.

For example, let’s say we wanted to know which policy had the latest ending date. We can use
the MAX function to calculate the latest ending date, but then how do we find which policy has
this date? By using a subquery, of course!2

First we have to create a query that finds the latest ending date in the Policy table using the
maximum (MAX) function, like so:

proc sql;
select max(pol_end_dt)
from Policy;
quit;

This query returns the value of the latest policy ending date, which happens to be 11/05/2015.
Next, we apply this as a subquery, to find the policy that has this ending date:

proc sql;
select *
from Policy
where pol_end_dt =
(select max(pol_end_dt)
from Policy);
quit;

The result is the policy that has the ending date of 11/05/2015:

Policy Table (policy with latest ending date)


Policy_no Policy_holder Insurer Pol_begin_dt Pol_end_dt
Q99585 Lyon Estates BeSafe 06/26/2005 11/05/2015

The inner query calculated the latest ending date in the Policy table, and the outer query found
the policy record with this ending date.

2
If you weren’t able to anticipate the answer to this question, then you obviously haven’t been paying
attention.

5
More subqueries – Find the policy with the shortest policy period
Here’s another example. Let’s say we wanted to find the policy with the shortest policy period.
First we have to calculate what is the shortest policy period in the Policy table, using the
minimum (MIN) function:

proc sql;
select min(pol_end_dt-pol_begin_dt)
from Policy;
quit;

Then we find which policy has this calculated policy period length by applying this subquery to
the WHERE clause of the outer query:

proc sql;
select *
from Policy
where pol_end_dt-pol_begin_dt =
(select min(pol_end_dt-pol_begin_dt)
from Policy);
quit;

The resulting table shows the policy with the shortest policy period:

Policy Table (policy with the shortest policy period)


Policy_no Policy_holder Insurer Pol_begin_dt Pol_end_dt
P99785 Ewing Oil M&M 03/21/2004 11/21/2004

The inner query calculated the shortest length between the policy beginning date and policy
ending date in the Policy table, and the outer query found the policy record with this matching
policy period length.

More subqueries – Find policies with longer than average policy periods
What if we wanted to find the policy (or policies) that had longer than average policy periods?
Can we do that? Yes, we can! First, you write a query that calculates the average policy period
using the mean (AVG) function, like so:

proc sql;
select avg(pol_end_dt-pol_begin_dt)
from Policy;
quit;

Then we plug this subquery into an outer query that looks for policies that have policy periods
longer than this calculated value:

6
proc sql;
select *
from Policy
where pol_end_dt-pol_begin_dt >
(select avg(pol_end_dt-pol_begin_dt)
from Policy);
quit;

The resulting table shows all the policies with longer than average policy periods:

Policy Table (only policies with longer than average policy periods)
Policy_no Policy_holder Insurer Pol_begin_dt Pol_end_dt
Q58953 Acme Co. BeSafe 09/15/2007 06/14/2010
D58622 Dunder Mifflin LexCorp 06/05/2006 03/03/2010
T99775 Paper St. Soap Co. BeSafe 06/30/2006 09/30/2009
Q99585 Lyon Estates BeSafe 06/26/2005 11/05/2015

Using subqueries to query down the rows


Now let’s say that your boss, or your client, or your own personal curiosity comes along and
says, “Hey! Can you tell me which policy holders had insurance during the entire 2007 calendar
year?” And you say, “Sure! That sounds easy enough.” And then you look at the policy table…

Policy Table
Policy_no Policy_holder Insurer Pol_begin_dt Pol_end_dt
A12859 Callahan Auto BeSafe 03/01/2008 02/28/2009
C52996 Acme Co. M&M 05/25/2006 09/14/2007
Q58953 Acme Co. BeSafe 09/15/2007 06/14/2010
D58622 Dunder Mifflin LexCorp 06/05/2006 03/03/2010
P99785 Ewing Oil M&M 03/21/2004 11/21/2004
W08308 Paper St. Soap Co. BeSafe 07/08/2005 06/30/2006
T99775 Paper St. Soap Co. BeSafe 06/30/2006 09/30/2009
D77582 Duff Beer LexCorp 05/08/2007 06/08/2008
G88958 Rick’s Café BeSafe 12/01/2005 06/30/2007
Q99585 Lyon Estates BeSafe 06/26/2005 11/05/2015

Some policy holders, like Dunder Mifflin, have a single policy that covers all of 2007. However,
other policy holders, like Acme Co., have multiple policies, each of which covers portions of
2007. What do you do? We’ll approach this problem step by step.

First, we’ll look for policy holders that have policies starting on or before January 1st, 2007. Note
that we’re looking for policy holders, not individual policies. This will be important later on:

7
proc sql;
select Policy_Holder
from Policy
where Pol_begin_dt le '01jan2007'd ;
quit;

This query gets us a list of policy holders (and, although we did not select policy begin date in
our query, it is shown below for demonstration purposes):

st
Policy table (policies starting on or before January 1 , 2007)
Policy_holder Pol_begin_dt
Acme Co. 05/25/2006
Dunder Mifflin 06/05/2006
Ewing Oil 03/21/2004
Paper St. Soap Co. 07/08/2005
Paper St. Soap Co. 06/30/2006
Rick’s Café 12/01/2005
Lyon Estates 06/26/2005

Next we’ll look for policy holders that have policies ending on or after December 31st, 2007:

proc sql;
select Policy_Holder
from Policy
where Pol_end_dt ge '31dec2007'd ;
quit;

From this query we get the following list of policy holders:

st
Policy table (policies ending on or after December 31 , 2007)
Policy_holder Pol_end_dt
Callahan Auto 02/28/2009
Acme Co. 06/14/2010
Dunder Mifflin 03/03/2010
Paper St. Soap Co. 09/30/2009
Duff Beer 06/08/2008
Lyon Estates 11/05/2015

Finally we’ll use both subqueries in our outer query to find policy holders that have at least one
policy that starts on or before January 1st, 2007 (first subquery) and have at least one policy that
starts on or after December 31st, 2007 (second subquery):

8
proc sql;
select Policy_holder
from Policy
where Policy_holder in
(select Policy_holder
from Policy First subquery
where Pol_begin_dt le '01jan2007'd)
and Policy_holder in
(select Policy_holder
from Policy Second subquery
where Pol_end_dt ge '31dec2007'd);
quit;

The outer query is looking for policy holders that are on both lists. Ewing Oil and Rick’s Café
both have policies that start before January 1st, 2007, but neither of them have policies that end
after December 31st, 2007. Callahan Auto and Duff Beer both have policies that end after
December 31st, 2007, but neither policy, or any other policy held by them, starts before January
1st, 2007. That leaves the remaining four policy holders as fully covered during 2007 – Acme Co.,
Dunder Mifflin, Paper St. Soap Co., and Lyon Estates.

Policy_holder
Acme Co.
Dunder Mifflin
Paper St. Soap Co.
Lyon Estates

In conclusion…
Hopefully this paper provided you with a basic understanding of SQL subqueries and some
examples of how you might use them. However, this is just the tip of the iceberg. SQL
subqueries can be used in all sorts of ways and to solve all sorts of problems. You can learn
more by reading up on SQL and (a personal favorite) experimenting on your own. Have fun!

For additional resources, check out SAS Help and Documentation and sasCommunity.org
(www.sascommunity.org/wiki).

Your comments and questions are valued and encouraged. Additional questions can also be
addressed to the author:
Tasha Chapman
Oregon Department of Consumer and Business Services, Salem, Oregon
Tasha.L.Chapman@state.or.us

SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

You might also like