Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Manisha Reddy

Structured Query Language (SQL) is a standard language used to manage relational


databases. SQL Server is a relational database management system (RDBMS)
developed by Microsoft. In SQL Server, there are various ways to store and manipulate
data, including Common Table Expressions (CTEs) and Temporary Tables. While they
might seem similar, there are some fundamental differences between the two.

Common Table Expressions (CTE) 2


Temporary Table 4
Differences between CTEs and Temporary Tables 5
When to use CTEs vs. Temporary Tables 5
Temporary tables to the rescue? 6
Best Practice 7
Manisha Reddy

Common Table Expressions (CTE)


A CTE is a named temporary result set that you can reference within a SELECT,
INSERT, UPDATE, or DELETE statement. A CTE is defined using the WITH statement,
and it can reference itself recursively. A CTE is also known as a subquery factoring,
and it can be used to simplify complex queries, make them more readable and easier to
maintain.

You can think of a Common Table Expression (CTE) as a table subquery. A table
subquery, also sometimes referred to as a derived table, is a query that is used as the
starting point to build another query. Like a subquery, it will exist only for the duration of
the query. CTEs make the code easier to write as you can write the CTEs at the top of
your query – you can have more than one CTE, and CTEs can reference other CTEs –
and then you can use the defined CTEs in your main query.

CTEs are not physically stored on disk, and their lifespan is limited to the execution of a
single query. This means that you cannot create, alter or drop CTEs explicitly. Also, you
cannot reference a CTE from multiple queries within the same batch.

CTEs make the code easier to read, and favor reuse: imagine that in each CTE you are
defining the subset of data that you want to work on in the main query and you are
giving it a label. In the main query then you can just refer to that subset by using its
label instead of having to write the whole subquery.

CTEs also allows for some complex scenarios like recursive queries.

Advantages:
● Readability: CTEs enhance the readability of complex queries by breaking them
into modular, named components.
● Recursive Queries: CTEs can be used for recursive queries, where a query
refers to its own output.

Example:
Manisha Reddy

Let’s consider the following example, where we have a table named “Employees” with
columns “EmployeeID”, “FirstName”, “LastName”, “DepartmentID”, and “Salary”:

In this example, we’re using a CTE named “TopEmployees” to retrieve the top 10
employees with the highest salary from the “Employees” table. We then select the full
name and salary of these top employees from the CTE. The CTE allows us to simplify
the query and make it more readable by separating the top employees selection from
the final select statement.
Manisha Reddy

Temporary Table
Temporary tables are also temporary result sets that are stored in the tempdb system
database. Unlike CTEs, temporary tables are physically stored on disk, and you can
create, alter or drop them explicitly. Temporary tables can be used to store and
manipulate large amounts of data and can be used in multiple queries within the same
batch.

Temporary tables can be created using the CREATE TABLE statement with the prefix
“#” or “##” for local and global temporary tables, respectively. Local temporary tables
are only accessible from the current session and are automatically dropped when the
session ends. Global temporary tables are accessible from all sessions, and they are
dropped automatically when the last session referencing them is closed.

Types of Temporary Tables:


● Local Temporary Tables: Exist only for the duration of the session.
● Global Temporary Tables: Exist for the duration of the connection.

Let’s consider the following example, where we want to create a temporary table to
store the top 10 employees with the highest salary:
Manisha Reddy

In this example, we create a temporary table named “#TopEmployees” with columns


“EmployeeID”, “FirstName”, “LastName”, and “Salary”. We then insert the top 10
employees with the highest salary from the “Employees” table into the temporary table.
Finally, we select the full name and salary of these top employees from the temporary
table and drop the table explicitly.

Differences between CTEs and Temporary Tables


The main differences between CTEs and Temporary Tables are:
● Storage: CTEs are not physically stored on disk, while temporary tables are.
● Lifespan: CTEs exist only for the duration of the query execution, while
temporary tables can exist beyond a single query execution.
● Explicit Management: You cannot explicitly create, alter, or drop a CTE, while you
can with a temporary table.
● Scope: CTEs are only accessible within the query that defines them, while
temporary tables can be accessed by multiple queries within the same batch.
Manisha Reddy

When to use CTEs vs. Temporary Tables


CTEs are often preferred over temporary tables when you need to simplify complex
queries and improve query readability. CTEs are also useful when you need to
reference the same result set multiple times within the same query.
Temporary tables are useful when you need to store and manipulate large amounts of
data, and you need to reference that data across multiple queries within the same
batch. Temporary tables can also improve query performance by allowing you to index
and optimize the data in the table.
Manisha Reddy

Temporary tables to the rescue?


Temporary tables can help to greatly reduce or even fix the poor row estimation due to
the aforementioned error amplification. How? Well, by storing the result of a subquery
into a temporary table, you are resetting such error amplification as the query engine
can use the data in the temporary table and thus make sure it is not guessing too much
anymore.

Another reason to use a temporary table is if you have a complex query that needs to
be used one or more time in subsequent steps and you want to avoid spending time
and resource to execute that query again and again (especially if the result set is small
compared to the originating data and/or the subsequent queries will not be able to push
any optimization down to the subquery as you are working on aggregated data, for
example)

But there is no “one-solution-fits-all” here. You must try to see if, for your use case, a
subquery is enough, or a temporary table is needed to give the query engine some
leverage to get better estimations and thus a better execution plan.

Keep also in mind that using temporary tables comes with some overhead. Aside from
the obvious space usage, resources – and thus time – will be spent just for loading
them. Sometimes you might even need to create indexes on temporary tables to make
sure subsequent query performances are at the top.

The data persisted in the temporary table, also, is not automatically kept up to date with
any changes that might be made to the data in the tables used in the originating query.
It is your responsibility to refresh the data on the temporary table anytime you need it
(Another option would be to use Indexed Views: see below for more details on this
feature).
Manisha Reddy

Best Practice
The best practice for choosing between CTE and TempTable depends on the specific
needs of the query. If the query is simple and does not require the use of complex logic,
then a CTE may be the best option. However, if the query is complex or requires the
use of large amounts of data, then a TempTable may be the better choice.

In general, CTEs are a good choice for the following:


● Improving the readability and maintainability of complex queries
● Simplifying recursive queries
● Creating temporary result sets that are only needed for the current query
● SQL Server can do a good job of estimating how many rows will come out of it,
and the contents of what those rows will be, or
● When what comes out of the CTE doesn’t really influence the behavior of the rest
of the query, or
● When you’re not sure what portions of the CTE’s data will actually be necessary
for the rest of the query (because SQL Server can figure out what parts to
execute, and what parts to simply ignore)

TempTables are a good choice for the following:


● Storing large amounts of data
● Creating temporary tables that need to be accessed by multiple queries
● Creating temporary tables that need to be indexed or constrained
● You have to refer to the output multiple times, or
● When you need to pass data between stored procedures, or
● When you need to break a query up into phases to isolate unpredictable
components that dramatically affect the behavior of the rest of the query

Ultimately, the best way to choose between CTE and TempTable is to experiment with
both options and see which one works best for the specific needs of the query.

I’d suggest starting with CTEs because they’re easy to write and to read. If you hit a
performance wall, try ripping out a CTE and writing it to a temp table, then joining to the
temp table.
Manisha Reddy

Subquery
Subqueries (also known as inner queries or nested queries) are a tool for performing
operations in multiple steps. For example, if you wanted to take the sums of several
columns, then average all of those values, you'd need to do each aggregation in a
distinct step.

Types of Subqueries:
● Single-row Subquery: Returns a single value.
● Multi-row Subquery: Returns multiple rows.
● Multi-column Subquery: Returns multiple columns.

Subqueries can be used in several places within a query, but it's easiest to start with
the FROM statement. Here's an example of a basic subquery:

Let's break down what happens when you run the above query:

First, the database runs the "inner query"—the part between the parentheses:

If you were to run this on its own, it would produce a result set like any other query. It
might sound like a no-brainer, but it's important: your inner query must actually run on
its own, as the database will treat it as an independent query. Once the inner query
runs, the outer query will run using the results from the inner query as its underlying
table:
Manisha Reddy

Subqueries are required to have names, which are added after parentheses the same
way you would add an alias to a normal table. In this case, we've used the name "sub."
A quick note on formatting: The important thing to remember when using subqueries is
to provide some way for the reader to easily determine which parts of the query will be
executed together. Most people do this by indenting the subquery in some way. The
examples in this tutorial are indented quite far—all the way to the parentheses. This
isn't practical if you nest many subqueries, so it's fairly common to only indent two
spaces or so.

The above examples, as well as the practice problem don't really require
subqueries—they solve problems that could also be solved by adding multiple
conditions to the WHERE clause. These next sections provide examples for which
subqueries are the best or only way to solve their respective problems.

Using subqueries to aggregate in multiple stages


What if you wanted to figure out how many incidents get reported on each day of the
week? Better yet, what if you wanted to know how many incidents happen, on average,
on a Friday in December? In January? There are two steps to this process: counting
the number of incidents each day (inner query), then determining the monthly average
(outer query):
Manisha Reddy

If you're having trouble figuring out what's happening, try running the inner query
individually to get a sense of what its results look like. In general, it's easiest to write
inner queries first and revise them until the results make sense to you, then to move on
to the outer query.

Subqueries in conditional logic


You can use subqueries in conditional logic (in conjunction with WHERE, JOIN/ON, or
CASE). The following query returns all of the entries from the earliest date in the
dataset (theoretically—the poor formatting of the date column actually makes it return
the value that sorts first alphabetically):

The above query works because the result of the subquery is only one cell. Most
conditional logic will work with subqueries containing one-cell results. However, IN is
the only type of conditional logic that will work when the inner query contains multiple
results:
Manisha Reddy

Note that you should not include an alias when you write a subquery in a conditional
statement. This is because the subquery is treated as an individual value (or set of
values in the IN case) rather than as a table.

Joining subqueries
You may remember that you can filter queries in joins. It's fairly common to join a
subquery that hits the same table as the outer query rather than filtering in the WHERE
clause. The following query produces the same results as the previous example:

This can be particularly useful when combined with aggregations. When you join, the
requirements for your subquery output aren't as stringent as when you use the WHERE
clause. For example, your inner query can output multiple results. The following query
ranks all of the results according to how many incidents were reported in a given day. It
does this by aggregating the total number of incidents each day in the inner query, then
using those values to sort the outer query:
Manisha Reddy

Subqueries can be very helpful in improving the performance of your queries. Let's
revisit the Crunchbase Data briefly. Imagine you'd like to aggregate all of the
companies receiving investment and companies acquired each month. You could do
that without subqueries if you wanted to, but don't actually run this as it will take
minutes to return:

Note that in order to do this properly, you must join on date fields, which causes a
massive "data explosion." Basically, what happens is that you're joining every row in a
given month from one table onto every month in a given row on the other table, so the
number of rows returned is incredibly great. Because of this multiplicative effect, you
must use COUNT(DISTINCT) instead of COUNT to get accurate counts.

You can see this below:


Manisha Reddy

The following query shows 7,414 rows:


SELECT COUNT(*) FROM tutorial.crunchbase_acquisitions

The following query shows 83,893 rows:


SELECT COUNT(*) FROM tutorial.crunchbase_investments

The following query shows 6,237,396 rows:


SELECT COUNT(*)
FROM tutorial.crunchbase_acquisitions acquisitions
FULL JOIN tutorial.crunchbase_investments investments
ON acquisitions.acquired_month = investments.funded_month

If you'd like to understand this a little better, you can do some extra research on
cartesian products. It's also worth noting that the FULL JOIN and COUNT above
actually runs pretty fast—it's the COUNT(DISTINCT) that takes forever. More on that in
the lesson on optimizing queries.

Of course, you could solve this much more efficiently by aggregating the two tables
separately, then joining them together so that the counts are performed across far
smaller datasets:
Manisha Reddy

Note: We used a FULL JOIN above just in case one table had observations in a month
that the other table didn't. We also used COALESCE to display months when the
acquisitions subquery didn't have month entries (presumably no acquisitions occurred
in those months). We strongly encourage you to re-run the query without some of these
elements to better understand how they work. You can also run each of the subqueries
independently to get a better understanding of them as well.
Manisha Reddy

Subqueries and UNIONs


For this next section, we will borrow directly from the lesson on UNIONs—again using
the Crunchbase data:

It's certainly not uncommon for a dataset to come split into several parts, especially if
the data passed through Excel at any point (Excel can only handle ~1M rows per
spreadsheet). The two tables used above can be thought of as different parts of the
same dataset—what you'd almost certainly like to do is perform operations on the entire
combined dataset rather than on the individual parts. You can do this by using a
subquery:

This is pretty straightforward. Try it for yourself:


Manisha Reddy

Stored Procedure
A stored procedure is a set of SQL statements that can be saved and executed later. It
can take parameters, making it a reusable and efficient way to perform specific tasks or
operations on a database.

Advantages:
● Code Reusability: Stored procedures can be called from multiple locations,
promoting code reuse.
● Enhanced Security: Users can execute a stored procedure without direct table
access, improving security.

Example:
Manisha Reddy

We can write the query using the CTE and subquery

subquery:

CTE:
Manisha Reddy

Is there any reason that we might choose to use a CTE over a subquery? In terms of
performance, they are pretty much the same. Remember from our talk on the order of
operations in SQL that a subquery will run before the main query, and that is the same
with the CTE, so in either case you are basically querying a query.

If the performance of the two options are the same, why would you choose a CTE over
a subquery? For a simple query like my example, it’s probably going to come down to
personal preference. However, for more complex queries that require multiple
subqueries, using CTEs can make your query easier to understand. This is especially
important when you are writing queries that will need to be used or edited by multiple
users. With the subquery structure, it isn’t always easy to see what the author intended.

You can create multiple CTEs to use in a query, just as you can create multiple
subqueries. You can also name your CTE what ever you like (I used CTE before to
make it clear which part was the CTE, however it is better to use names that are
descriptive), this will also make it easier to understand what your query is doing.

Here’s an example:
Manisha Reddy

We could also write this using multiple subqueries:


Manisha Reddy

The query using multiple CTEs is easier to read, but you get the same results no matter
which way you write the query.

One item to point out here, when using multiple CTE, you only need to use the WITH
keyword once, you separate the individual CTEs with a comma.

So, all this to say that CTE and subqueries will accomplish the same thing. What about
temp tables? There is one major difference between CTE/subquery and temp tables. A
temp table can be accessed by multiple queries in the same SQL session. A
CTE/subquery is only available for a single query.

What does that mean? Let’s say that I had multiple queries that needed to use the
same ‘mask’ CTE, I could put that CTE at the beginning of each query, but that will
require a lot of extra typing (which is not high on my list of ways to waste time) and the
performance would deteriorate — the same temporary result set would be run for each
query, and that takes extra time. This is why we temp tables.
Let’s take a look at how this would work.
Manisha Reddy

Once you create and populate your temp table (the SELECT clause is populating the
table), you can query it multiple times until you disconnect your SQL session. Note that
this syntax is for SQLite, SQL Server and other instances of SQL that support the
SELECT INTO syntax.

So, now we can query like normal using our temp.mask table.
Manisha Reddy

Notice that we still used a CTE, your temp table works just like any other table when
querying, you just have to run it once for any session. There are some things to keep in
mind, your temp table will be stored, since it is not run for each query (like a
CTE/subquery) this can significantly improve performance if you are running multiple
queries using the same temporary data. However, because the temp table is stored if
you are only using it for a single query, the performance will be worse using a temp
table.

In summary, we can use CTE/subqueries interchangeably, but the CTE is easier to read
and see what is going on, so it is best used when a query will be used/edited by other
users. The use case for these is single queries — the information will not need to be
accessed by multiple queries in a single SQL session. A temp table is also a temporary
data source, but it will be available for multiple queries during the same SQL session.
Manisha Reddy

Other stuff that you may want to know

Indexed Views
A special kind of Views, the Indexed Views, can be created so that the produced result
is materialized and persisted into the database data file. With Indexed Views, the result
doesn’t need to be re-calculated every time, so they are great for improving read
performances. In HTAP scenarios they can help to get a great performance boost. The
database engine will also make sure that every time data in one of the based tables
used in an Indexed View is updated, the persisted result is updated too, so that you
always have fresh and updated values.

Inline Table-Valued Functions (aka Parametrized Views)


Sometimes you would like to have a View with parameters, to make it easier to return
just the subset of values you are interested in. In Azure SQL and SQL Server, you can
create parametrized views. They fall (more correctly, IMHO) under the umbrella of
“Functions”, and specifically they can be created by using Inline Table-Valued
Functions:

You might also like