Outros Joins - LOOP, HASH and MERGE Join Types

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

(http://www.

madeir

LOOP, HASH and MERGE Join Types


by Eitan Blumin (http://www.madeiradata.com/author/eitanblumin/) on 5 January 2012
(http://www.madeiradata.com/loop-hash-and-merge-join-types/) in Blog
(http://www.madeiradata.com/category/blog/) • 16 Comments (http://www.madeiradata.com/loop-
hash-and-merge-join-types/#comments)

Today I’ll talk about the available JOIN operator types in SQL Server (Nested
Loops, Hash and Merge Joins), their differences, best practices and complexity.

For the samples in this post, we’ll use the free AdventureWorks database
sample available here:
http://msftdbprodsamples.codeplex.com/releases/view/4004
(http://msftdbprodsamples.codeplex.com/releases/view/4004)

Introduction: What are Join Operators?


A join operator is a type of an algorithm which the SQL Server Optimizer chooses
in order to implement logical joins between two sets of data.

The SQL Server Optimizer may choose a different algorithm for different
scenarios based on the requested query, available indexes, statistics and number
of estimated rows in each data set.

It’s possible to find the operator which was used by looking at the execution plan
that SQL Server has prepared for your query.
For more information on execution plans and how to read them, I recommend
checking out the first chapter out of Grant Fritchey’s excellent book:
http://www.simple-talk.com/sql/performance/execution-plan-basics/
(http://www.simple-talk.com/sql/performance/execution-plan-basics/)

NESTED LOOPS
“Nested Loops” is the simplest operator of the bunch.

We’ll take the following query as an example (which gets some order detail
columns for orders placed during July 2001):

SELECT
OH.OrderDate, OD.OrderQty, OD.ProductID, OD.UnitPrice
FROM
Sales.SalesOrderHeader AS OH
JOIN
Sales.SalesOrderDetail AS OD
ON
OH.SalesOrderID = OD.SalesOrderID
WHERE
OH.OrderDate BETWEEN '2001­07­01' AND '2001­07­31'

The resulting execution plan looks like this:

(http://i0.wp.com/www.madeirasql.com/wp-content/uploads/join-types-1-exec-
plan.jpg)

The operator on the top right is called the outer input and the one just below it
is called the inner input.

What the “Nested Loops” operator basically does is: For each record from the
outer input – find matching rows from the inner input.

Technically, this means that the clustered index scan you see as the outer input is
executed once to get all the relevant records, and the clustered index seek you
see below it is executed for each record from the outer input.

We’ll verify this information by placing the cursor over the Clustered Index Scan
operator and looking at the tooltip:

(http://i2.wp.com/www.madeirasql.com/wp-content/uploads/join-types-2-
details.jpg)

We can see that the estimated number of executions is 1. We’ll look at the tooltip
of the Clustered Index Seek:

(http://i2.wp.com/www.madeirasql.com/wp-content/uploads/join-types-3-
details.jpg)

This time we can see that the estimated number of executions is 179 which is the
approximate number of rows returned from the outer input.

In terms of complexity (assume N is the number of rows from the outer output
and M is the total number of rows in the SalesOrderDetail table): The complexity
of this query is: O(NlogM) where “logM” is the complexity of each seek in the
inner input.
The SQL Server Optimizer will prefer to choose this operator type when the outer
input is small and the inner input has an index on the column(s) by which the two
data sets are joined. The bigger the difference in number of rows between the
outer and inner inputs, the more benefit this operator will provide over the other
operator types.

Having indexes and up-to-date statistics is crucial for this join type, because you
don’t want SQL Server to accidently think there’s a small number of rows in one
of the inputs when in fact there are a whole lot. For example: Performing 10
times index seek is nothing like performing 100,000 times index seek, especially if
the table size of the inner input is around 120,000 and you’d be better off doing
one table scan instead.

MERGE Join
The “Merge” algorithm is the most efficient way to join between two very large
sets of data which are both sorted on the join key.

We’ll use the following query as an example (which returns a list of customers
and their sale order identifiers):

SELECT
OC.CustomerID, OH.SalesOrderID
FROM
Sales.SalesOrderHeader AS OH
JOIN
Sales.Customer AS OC
ON
OH.CustomerID = OC.CustomerID

The execution plan for this query is:

(http://i1.wp.com/www.madeirasql.com/wp-content/uploads/join-types-4-exec-
plan.jpg)

• First, we’ll notice that both of the data sets are sorted on the CustomerID
column: The Customer table because that’s the clustered primary key, and
the SalesOrderHeader table because there’s a nonclustered index on the
CustomerID column.
• By the width of the arrows between the operators (and by placing the cursor
over them), we can see that the number of rows returned from each set is
rather large.
• In addition, we used the equality operator (=) in the ON clause (the Merge
join requires an equality operator).

These three factors cause SQL Server Optimizer to choose the Merge Join for this
query.

The biggest performance gain from this join type is that both input operators are
executed only once. We can verify this by placing the cursor over them and see
that the number of executions is 1 for both of the operators. Also, the algorithm
itself is very efficient:

The Merge Join simultaneously reads a row from each input and compares them
using the join key. If there’s a match, they are returned. Otherwise, the row with
the smaller value can be discarded because, since both inputs are sorted, the
discarded row will not match any other row on the other set of data.

This repeats until one of the tables is completed. Even if there are still rows on
the other table, they will clearly not match any rows on the fully-scanned table, so
there is no need to continue. Since both tables can potentially be scanned, the
maximum cost of a Merge Join is the sum of both inputs. Or in terms of
complexity: O(N+M)

If the inputs are not both sorted on the join key, the SQL Server Optimizer will
most likely not choose the Merge join type and instead prefer the Hash join (will
be explained soon). However if it does anyway (whether it’s because it was forced
to due to a join hint, or because it was still the most efficient), then SQL will need
to sort the table which is not already sorted on the join key.

HASH Match
The “Hash” join type is what I call “the go-to guy” of the join operators. It’s the one
operator chosen when the scenario doesn’t favor in any of the other join types.
This happens when the tables are not properly sorted, and/or there are no
indexes. When SQL Server Optimizer chooses the Hash join type, it’s usually a bad
sign because something probably could’ve been done better (for example, adding
an index). However, in some cases (complex queries mostly), there’s simply no
other way.

We’ll use the following query as an example (which gets the first and last names
of contacts starting with “John” and their sales order identifiers):

SELECT
OC.FirstName, OC.LastName, OH.SalesOrderID
FROM
Sales.SalesOrderHeader AS OH
JOIN
Person.Contact AS OC
ON
OH.ContactID = OC.ContactID
WHERE
OC.FirstName LIKE 'John%'

The execution plan looks like this:

(http://i2.wp.com/www.madeirasql.com/wp-content/uploads/join-types-5-exec-
plan.jpg)

Because there’s no index on the ContactID column in the SalesOrderHeader


table, SQL Server chooses the Hash join type.

Before diving into the example, I’ll first explain two important concepts: A
”Hashing” function and a “Hash Table”.

“Hashing” is a programmatic function which takes one or more values and


converts them to a single symbolic value (usually numeric). The function is usually
one-way meaning you can’t convert the symbolic value back to its original value
(s), but it’s deterministic meaning if you provide the same value as input you will
always get the same symbolic value as output. Also, it’s possible for several
different inputs to result in the same output hash value (meaning, the function
isn’t necessarily unique).
A “Hash Table” is a data structure that divides all rows into equal-sized “buckets”,
where each “bucket” is represented by a hash value. This means that when you
activate the hash function on some row, using the result you’ll know immediately
to which bucket it belongs.

As with the Merge join, the two input operators are executed only once. We can
verify this by looking at the tooltips of the input operators.

Using available statistics, SQL Server will choose the smaller of the two inputs to
serve as the build input and it will be the one used to build the hash table in
memory. If there’s not enough memory for the hash table, SQL Server will use
physical disk space in TEMPDB. After the hash table is built, SQL Server will get
the data from the larger table, called the probe input, compare it to the hash
table using a hash match function, and return any matched rows. In graphical
execution plans, the build input will always be the one on top, and the probe
input will be the one below.

As long as the smaller table is very small, this algorithm can be very efficient. But
if both tables are very large, this can be a very costly execution plan.

SQL Server Optimizer uses statistics to figure out the cardinality of the values.
Using that information, it dynamically creates a hash function which will divide the
data into as many buckets with sizes as equal as possible.

Because  it’s so dynamic,  it’s difficult to estimate the complexity of the creation of
the hash table and the  complexity of each hash match. Because SQL Server
Optimizer performs this dynamic operation during execution time and not during
compilation time, sometimes the values you see in the execution plan are
incorrect. In some cases, you could compare a hash join and a nested loops join,
see that according to the execution plans the nested loop is more expensive (in
terms of logical reads etc.), when in fact the hash join executes much slower
(because its cost estimation is incorrect).

For our complexity terms, we’ll assume hc is the complexity of the hash table
creation, and hm is the complexity of the hash match function. Therefore, the
complexity of the Hash join will be O(N*hc + M*hm + J) where N is the smaller
data set, M is the larger data set and J is a “joker” complexity addition for the
dynamic calculation and creation of the hash function.
Microsoft has an interesting page in Books Online that describes further Hash
Join sub-types and additional aspects about them. I highly recommend you read
it: http://msdn.microsoft.com/en-us/library/ms189313.aspx
(http://msdn.microsoft.com/en-us/library/ms189313.aspx)

Query Hints
Using Query Hints, it’s possible to force SQL Server to use specific join types.
However, it’s highly not recommended to do so especially in production
environments, because the same join type may not be the best choice forever
(because data can change), and SQL Server Optimizer usually has it right
(assuming statistics are up-to-date).

I’ll talk about them just so you can experiment with the different join types on
your test environment and see the differences.

To force SQL Server to use specific join types using query hints, you add the
OPTION clause at the end of the query, and use the keywords LOOP JOIN,
MERGE JOIN or HASH JOIN.

Try executing the above queries with different join hints and see what happens:

SELECT OC.CustomerID, OH.SalesOrderID


FROM Sales.SalesOrderHeader AS OH
JOIN Sales.Customer AS OC
ON OH.CustomerID = OC.CustomerID
OPTION (HASH JOIN)

SELECT OC.FirstName, OC.LastName, OH.SalesOrderID


FROM Sales.SalesOrderHeader AS OH
JOIN Person.Contact AS OC
ON OH.ContactID = OC.ContactID
WHERE OC.FirstName LIKE 'John%'
OPTION (LOOP JOIN)

SELECT OC.FirstName, OC.LastName, OH.SalesOrderID


FROM Sales.SalesOrderHeader AS OH
JOIN Person.Contact AS OC
ON OH.ContactID = OC.ContactID
WHERE OC.FirstName LIKE 'John%'
OPTION (MERGE JOIN)

Summary
Nested Loops
• Complexity: O(NlogM)
• Used usually when one table is significantly small
• The larger table has an index which allows seeking it using the join key

Merge Join
• Complexity: O(N+M)
• Both inputs are sorted on the join key
• An equality operator is used
• Excellent for very large tables

Hash Match
• Complexity: O(N*hc+M*hm+J)
• Last-resort join type
• Uses a hash table and a dynamic hash match function to match rows

Resources
The following resources were used for the making of this post:

• Inside the SQL Server Query Optimizer (http://www.simple-


talk.com/books/sql-books/inside-the-sql-server-query-optimizer/) by
Benjamin Nevarez
• SQL Server Execution Plans
(http://www.sqlservercentral.com/articles/books/65831/) by Grant
Fritchey
• Advanced Query Tuning Concepts (http://msdn.microsoft.com/en-
us/library/ms191426.aspx) at Microsoft’s SQL Server Books Online
• Blog post about Hash Joins
(http://blogs.msdn.com/b/craigfr/archive/2006/08/10/687630.aspx) by Craig
Freedman

About Latest Posts

Eitan Blumin
(Http://Www.madeiradata.com/Author/Eitanb
lumin/)
Eitan has a black belt in SQL Server design and development.
(http://ww Eitan is a brilliant architect as well as a professional developer who is
w.madeira capable of turning a very complex design into a practical and efficient
data.com/ solution. He has nine years of experience facing all kinds of SQL Server

author/eit

You might also like