Micro Partitions and Clustering

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 6

Micro Partitions

and Clustering
• Traditional data warehouses rely on static partitioning of large tables to achieve acceptable
performance and enable better scaling.
• In these systems, a partition is a unit of management that is manipulated independently
using specialized DDL and syntax; however, static partitioning has a number of
well-known limitations, such as maintenance overhead and data skew, which can result in
disproportionately-sized partitions.
• All data in Snowflake tables is automatically divided into micro-partitions, which are
contiguous units of storage.
• Micro-partitions and Data clustering, two of the principal concepts utilized in Snowflake
physical table structures.
• All data in Snowflake tables is automatically divided into micro-partitions, which are
contiguous units of storage
• Each micro-partition contains between 50 MB and 500 MB of uncompressed data
Micro Partitions and Clustering
Snowflake stores metadata about all rows stored in a micro-partition, including:
• The range of values for each of the columns in the micro-partition.
• The number of distinct values.
• Additional properties used for both optimization and efficient query processing.

• Snowflake enables precise pruning of columns in micro-partitions at query run-time, including columns containing semi-structured data.
• if 10% of predicate is applied then 10% of scan of micro-partition is achieved. i.e. if your filter requires only 10% data from the table then due to
micro partition arrangement ,10% of data is able to retrieve. ex : a query targeting a particular hour would ideally scan 1/24*365(8760) of the micro-
partitions in the table and then only scan the portion of the micro-partition that contain the data for the hour column.
• Efficiency of pruning can be observed by comparing partitions scanned and partitions total statistics in the table scan operators within the query
profile.
• Wider the gap between the number of partitions scanned to the total partitions the better
• If these numbers were much closer this would inform that the pruning isn't helping your query. In which case, for very large tables you could change
the clustering key.
• the closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected, the more efficient is the pruning performed on
the table.
• Not all predicate expressions can be used to prune. For example, Snowflake does not prune micro-partitions based on a predicate with a subquery,
even if the subquery results in a constant.
• For smaller tables, you could look to re-organise your query to include the filter which uses the existing clustering key..
Micro Partitions and Clustering
Clustering
• Typically the data stored in tables are sorted/ordered along natural keys (date, region ..etc.). This clustering is the key factor in queries.
• As the data is inserted/loaded into a table, clustering metadata is collected and recorded for each micro partition created during the process. This helps in
preventing the unnecessary scans.
• Snowflake maintains clustering metadata for the micro-partitions in a table, including:
• The total number of micro-partitions that comprise the table.
• The number of micro-partitions containing values that overlap with each other (in a specified subset of table columns).
• The depth of the overlapping micro-partitions.
• Any data landing to snowflake goes through the following operations:
1 Divide and map incoming data into micro-partition using the ordering of the data as it is inserted/loaded.
2 Compress the data
3 Capture and store metadata
Benefits of micro-partition
1- SF does it automatically so we do not need to define or maintain it up front.
2- Micro-partitions Size is 50-500 MB before SF applies compression, this enable efficient fine granular pruning for faster queries.
3- Micro-partitions can overlap in their range of values, preventing the skewness.
4- Columnar Storage which helps efficient scanning of columns mentioned only in select or query
5- Columns are compressed within each micro-partition.
Micro Partitions and Clustering
Clustering depth
• The clustering depth for a populated table measures the average depth (1 or greater) of the overlapping micro-partitions for specified columns in a
table. The smaller the average depth, the better clustered the table is with regards to the specified columns.

id f_name l_name dob active city   dob select * from customer where year(dob) = 1995
13 sjka kcds 2/5/1995 TRUE Hyd min So,1995 is having 0 overlap
9 ajksja nsln 8/13/1997 FALSE delhi  
19 akjas nksd 9/11/1997 TRUE mumbai   select * from customer where year(dob) = 1997
4 hdonc nlks 9/19/1997 TRUE pune Partition1 max So, 1997 is having 1 overlap & depth is 2
     
16 wno mnclas 10/19/1997 TRUE nasik min select * from customer where year(dob) = 2003
10 nonwn nlsk 1/21/2003 FALSE lucknow   Depth is 4 and overlap is 3
18 mcdm nkdlc] 1/23/2003 TRUE delhi  
7 lmkm iw 2/15/2003 TRUE hyd Partition2 max
     
11 ncsln lndc 3/2/2003 FALSE Hyd min
1 nsn ncl 3/15/2003 TRUE delhi  
5 nsdlnld ncdslk 3/19/2003 TRUE mumbai  
8 eojei nsdn 3/21/2003 TRUE pune partition3 max
     
14 nsdnk nscdn 4/4/2003 TRUE Hyd min
12 nlsdn ncdsl 4/11/2003 FALSE delhi  
6 nsdln ncsn 4/13/2003 TRUE mumbai  
17 snoej nskl 4/17/2003 TRUE pune partition4 max
     
15 knslnc lmsdc; 4/20/2003 FALSE nasik min
20 nsdlcn alk 4/24/2003 TRUE lucknow  
2 nsdln mdcs 4/29/2003 TRUE delhi  
3 nsdln scd 8/25/2008 TRUE hyd partition5 max
Micro Partitions and Clustering
Clustering depth
• As the number of overlapping micro-partitions decreases, the overlap depth decreases.
• When there is no overlap in the range of values across all micro-partitions, the micro-partitions are considered to be in a constant state (i.e. they cannot
be improved by clustering).
• SYSTEM$CLUSTERING_DEPTH( '<table_name>' , '( <col1> [ , <col2> ... ] )' [ , '<predicate>' ] )
• Computes the average depth of the table according to the specified columns (or the clustering key defined for the table). The average depth of a
populated table (i.e. a table containing data) is always 1 or more. The smaller the average depth, the better clustered the table is with regards to the
specified columns.
• SYSTEM$CLUSTERING_INFORMATION( '<table_name>' , '( <col1> [ , <col2> ... ] )’ )
• Returns clustering information, including average clustering depth, for a table based on one or more columns in the table.
• To improve the clustering of the underlying table micro-partitions, you can always manually sort rows on key table columns and re-insert them into
the table; however, performing these tasks could be cumbersome and expensive.Instead, Snowflake supports automating these tasks by designating
one or more table columns/expressions as a clustering key for the table. A table with a clustering key defined is considered to be clustered.
• In particular, to see performance improvements from a clustering key, a table has to be large enough to consist of a sufficiently large number of micro-
partitions, and the column(s) defined in the clustering key have to provide sufficient filtering to select a subset of these micro-partitions.
• In general, tables in the multi-terabyte (TB) range will experience the most benefit from clustering, particularly if DML is performed
regularly/continually on these tables.
• A clustering key is a subset of columns in a table (or expressions on a table) that are explicitly designated to co-locate the data in the table in the same
micro-partitions. This is useful for very large tables where the ordering was not ideal (at the time the data was inserted/loaded) or extensive DML has
caused the table’s natural clustering to degrade.
Micro Partitions and Clustering
• Some general indicators that can help determine whether to define a clustering key for a table include:
• Queries on the table are running slower than expected or have noticeably degraded over time.
• The clustering depth for the table is large.
• Benefits of using cluster key:
1- Improved scan efficiency and pruning.
2- Better column compression than tables with no clustering.
3- No additional administration is required, All future maintenance performed by snowflake.
the compute resources used to perform clustering consume credits. As such, you should cluster only when queries will benefit substantially from the clustering. Clustering in a
way will help perform faster wherever their is sort required.
The more frequently a table is queried, the more benefit clustering provides. However, the more frequently a table changes, the more expensive it will be to keep it clustered.
Therefore, clustering is generally most cost-effective for tables that are queried frequently and do not change frequently. The number of distinct values (i.e. cardinality) in a
column/expression is a critical aspect of selecting it as a clustering key. It is important to choose a clustering key that has: A large enough number of distinct values to enable
effective pruning on the table.A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.
A column with very low cardinality (e.g. a column that indicates only whether a person is male or female) might yield only minimal pruning. At the other extreme, a column
with very high cardinality (e.g. a column containing UUID or nanosecond timestamp values) is also typically not a good candidate to use as a clustering key directly.
Depth Avg = (frequency/partition)
Since the automatic clustering is on , whenever there is an insert, snowflake will create micro-partition and clustering
Automatic clustering is just flagging the snowflake that re clustering is to be performed for the said table
alter table t2_order_proirity suspend recluster;
alter table t2_order_proirity resume recluster;
Clustering is not supported for external table

You might also like