Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 12

Your Trusted Analytics and Planning Partner

Distributions in Azure Synapse

Maria Thomas
TekLink International, Inc.
www. Teklink.com
Table Of Content

 Distributions in Azure Synapse


 Synapse SQL architecture components
 What is distributed table
 Hash distributed table
 Round-Robin distributed table
 Replicated tables

TekLink International Confidential 2


Distributions in Azure Synapse

 Hash distribution.
 Round Robin.
 Replicated tables.

TekLink International Confidential 3


Synapse SQL Architecture Components

TekLink International Confidential 4


What is a distributed table?

 A distributed table in Azure Synapse is a table that logically exists as a single table, but the rows are
physically stored on all the nodes or distribution (typically sixty) of the dedicated SQL pool.
 The efficiency of data distribution is directly proportional to the query execution performance.
 You can choose the sharding pattern to use to distribute the data when you define the table. These
sharding patterns are supported:
Hash
Round Robin
Replicate
A distribution is the basic unit of storage and processing for parallel queries that run on distributed data.
When Synapse SQL runs a query, the work is divided into 60 smaller queries that run in parallel.
Each of the 60 smaller queries runs on one of the data distributions. Each compute node manages one or more of the
60 distributions.

TekLink International Confidential 5


Hash Distributed Tables

 To shard data into a hash-distributed table, a hash function is used to deterministically assign each row
to one distribution. In the table definition, one of the column is designated as the distribution column.
The hash function uses the values in the distribution column to assign each row to a distribution.
 Identical values always hash to the same distribution, SQL Analytics has built-in knowledge of the row
locations. In dedicated SQL pool this knowledge is used to minimize data movement during queries,
which improves query performance.

TekLink International Confidential 6


Hash Distributed Tables

 Hash distributed table work well for large fact tables in a star schema. They can have very large
numbers of rows and still achieve high performance. A hash distributed table can deliver the highest
query performance for joins and aggregations on large tables.
 Consider using a hash distributed table when:
• The table size on disk is more than 2GB.
• The table has frequent insert, update and delete operations.

TekLink International Confidential 7


Hash Distributed Tables

TekLink International Confidential 8


Choosing a distribution column

Choose a distribution column with data that distributes evenly(No data skewness).
To balance the parallel processing, select a distribution column that:
•Has many unique values. All rows with the same value are assigned to the same distribution. Since there
are 60 distributions, some distributions can have > 1 unique values while others may end with zero values
•Does not have NULLS or has only a few NULLs. If all values in the column are NULL, all the rows are
assigned to the same distribution.
•Is not a date column. All data for the same date lands in the same distribution. If several users are all
filtering on the same date, then only 1 of the 60 distributions do all the processing work.
Choose a distribution column that minimizes data movement
•To minimize data movement, select a distribution column that: is used in joins, group by, distinct and
having clause.
•If none of your columns have enough distinct values for a distribution column, you can create a new
column as a composite of one or more values.
TekLink International Confidential 9
Round Robin Distributed Table

A round robin table delivers fast performance when used as a staging table for loads.

A round robin distributed table distributes data evenly across the table but without any further
optimization.

A distribution first chosen at random and then buffers of rows are assigned to distributions sequentially. It
is quick to load data into a round-robin table, but query performance can often be better with hash
distributed tables.

Joins on round-robin tables require reshuffling data, which makes additional time.

TekLink International Confidential 10


Replicated Table

• A replicated table has a full copy of the tale accessible on each compute node. Replicating a table removes the need to transfer data
among compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table
size is less than 2gb compressed. 2GB is not a hard limit. If the data is static and does not change, you can replicate larger tables.
Replicated tables work well for dimension tables in star schema.
Replicated tables may not yield the best query performance when:
• The table has frequent insert, update, and delete operations. The data manipulation language operation (DML) required a rebuild of
the replicated table. Rebuilding frequently can cause slower performance.
• The SQL pool is scaled frequently. Scaling a SQL pool changes the number of compute nodes.

TekLink International Confidential 11


Your Trusted Analytics and Planning Partner
BW and HANA EDW BI Strategy & Roadmap
S/4HANA Embedded Analytics Big Data Adoption Strategy
Trade Management Solutions Global Delivery & Support
Business Objects
Predictive Analytics

BPC Planning and Innovation Labs & PoC


Consolidation Solution Accelerators
S/4HANA Planning Cloud & Technical Services
Cloud Planning Solutions

TekLink International Confidential 12

You might also like