Professional Documents
Culture Documents
Snowflake Interview 2024 03
Snowflake Interview 2024 03
The Snowflake Data Cloud has a number of powerful features that empower organizations to make
more data-driven decisions.
In this blog, we’re going to explore Snowflake’s Dynamic Data Masking feature in detail, including
what it is, how it helps, and why it’s so important for security purposes.
Snowflake Dynamic Data Masking (DDM) is a data security feature that allows you to alter sections
of data (from a table or a view) to keep their anonymity using a predefined masking strategy.
Data owners can decide how much sensitive data to reveal to different data consumers or data
requestors using Snowflake’s Dynamic Data Masking function, which helps prevent accidental and
intentional threats. It’s a policy-based security feature that keeps the data in the database unchanged
while hiding sensitive data (i.e. PII, PHI, PCI-DSS), in the query result set over specific database
fields.
For example, a call center agent may be able to identify a customer by checking the final four
characters of their Social Security Number (SSN) or PII field, but the entire SSN or PII field of the
customer should not be shown to the call center agent (data requester).
Dynamic Data Masking (also known as on-the-fly data masking) policy can be specified to hide part
of the SSN or PII field so that the call center agent (data requester) does not get access to the sensitive
data. On the other hand, an appropriate data masking policy can be defined to protect SSNs or PII
fields, allowing production support members to query production environments for troubleshooting
without seeing any SSN or any other PII fields, and thus complying with compliance regulations.
Figure 1: Data Masking Using Masking Policy in Snowflake
The intention of Dynamic Data Masking is to protect the actual data and substitute or hide where the
actual data is not required to non-privileged users without changing or altering the data at rest.
Data is masked for different reasons. The main reason here is risk reduction, and according to
guidelines set by the security teams, to limit the possibility of a sensitive data leak. Data is also
masked for commercial reasons such as masking of financial data that should not be common
knowledge, even within the organization. There is a Compliance reason and it is driven by
requirements or recommendations based on specific standards, and regulations like GDPR, SOX,
HIPAA, and PCI DSS.
The projects are usually initiated by data governance or compliance teams. There are requirements
from the privacy office or legal team where personally identifiable information should be protected.
In Snowflake, Dynamic Data Masking is applied through masking policies. Masking policies are
schema-level objects that can be applied to one or more columns in a table or a view (standard &
materialized) to selectively hide and obfuscate according to the level of anonymity needed.
Once created and associated with a column, the masking policy is applied to the column at query
runtime at every position where the column appears.
Figure 2: How Dynamic Data Masking Works in Snowflake
To apply Dynamic Data Masking, the masking policy objects need to be created. Like many
other securable objects in Snowflake, the masking policy is also a securable and schema level object.
The following is an example of a simple masking policy that masks the SSN number based on a
user’s role.
The masking policy name, “mask_ssn” is the unique identifier within the schema and the signature
for the masking policy specifies the input columns in this example “ssn_txt” alongside data
type(string) to evaluate at query runtime. The return data type must match the input data type
followed by the SQL expression that transforms or mask the data which is ssn_txt in this example.
The SQL expression can include a built-in function or UDF or conditional expression functions (like
CASE in this example).
In the above example, the SSN is partially masked if the current role of the user
is CALL_CENTER_AGENT. If the user role is PROD_SUPP_MEMBER, then it replaces all the
numeric characters with character x. For any other roles, it returns NULL.
Once the masking policy is created, it needs to be applied to a table or view column. This can be done
during the table or view creation or using an alter statement.
We can create multiple masking policies and apply them to different columns at the same time.
In the previous example, we masked the customer table’s SSN column. We can create additional
masking policies for first_name, last_name, and date of birth columns and alter the customer table and
apply additional masking policies.
-- masking policy to mask first name
create or replace masking policy mask_fname as (fname_txt string) returns string ->
case
when current_role() in ('CALL_CNETER_AGENT') then 'xxxxxx'
when current_role() in ('PROD_SUPP_MEMBER') then 'xxxxxx'
else NULL
end;
-- apply mask_fname masking policy to customer.first_name column
alter table if exists customer modify column first_name set masking policy
mydb.myschema.mask_fname;
The best aspect of Snowflake’s data masking strategy is that end users can query the data without
knowing whether or not the column has a masking policy. Whenever Snowflake discovers a column
with a masking policy associated, the Snowflake query engine transparently rewrites the query at
runtime.
For authorized users, query results return sensitive data in plain text, whereas sensitive data is
masked, partially masked, or fully masked for unauthorized users.
If we take our customer data set where masking policies are applied on different columns, a query
submitted by a user and the query executed after Snowflake rewrites the query automatically looks as
follows.
Query Type Query Submitted By User Rewritten Query by Snowflake
Simple Query Select dob, ssn from Select mask_dob(dob), mask_ssn(ssn) from
customer; customer;
Query with Select dob, ssn from Select dob, ssn from customer where
where clause customer where ssn = ‘576- mask_ssn(ssn) = ‘576-77-4356’
predicate 77-4356’
The rewrite is performed in all places where the protected column is present in the query, such as in
“projections”, “where” clauses, “join” predicates, “group by” statements, or “order by” statements.
There are cases where data masking on a particular field depends on other column values besides user
roles. To handle such a scenario, Snowflake supports conditional masking policy, and to enable this
feature, additional input parameters can be passed as an argument along with data type.
Let’s say a user has opted to show his/her educational detail publicly but this flag is false for many
other users. In such a case, the user’s education detail will be masked only if the public visibility flag
is false, else this field will not be masked.
Figure 7: Conditional Data Masking Policies SQL Construct
-- DDL for user table
create or replace table user
(
id number,
first_name string,
last_name string,
DoB string,
highest_degree string,
visibility boolean,
city string,
zipcode string
);
A new masking policy can be created quickly and easily with no overhead of historic loading
of data.
You can write a policy once and have it apply to thousands of columns across databases and
schemas.
Masking policies are easy to manage and support centralized and decentralized administration
models.
Easily mask data before sharing.
Easily change masking policy content without having to reapply the masking policy to
thousands of columns.
Conclusion
Snowflake’s Dynamic Data Masking is a very powerful feature that allows you to bring all kinds of
sensitive data into your data platform and manage it at scale.
Snowflake’s policy-based approach, along with role-based access control (RBAC), allows you to
prevent sensitive data from being viewed by table/view owners and users with privileged
responsibilities.
If you’re looking to take advantage of Snowflake’s Dynamic Data Masking feature, the data experts at
phData would love to help make this a reality. Feel free to reach out today for more information
1. What is Snowflake Dynamic Data Masking?
Snowflake Dynamic Data Masking is a security feature that allows organizations to mask
sensitive data in their database tables, views, and query results in real-time. This is useful for
protecting sensitive information from unauthorized access or exposure.
Snowflake Dynamic Data Masking allows the data to be masked as it is accessed, rather than being
permanently altered in the database.
With Dynamic Data Masking, users can choose which data to mask and how it should be masked,
such as by replacing sensitive information with dummy values or by partially revealing data. This can
be done at the column level, meaning that different columns can be masked differently depending on
the sensitivity of the data they contain.
2. Steps to apply Snowflake Dynamic Data Masking on a column
Follow below steps to perform Dynamic Data Masking in Snowflake.
Step-1: Create a Custom Role with Masking Privileges
Step-2: Assign Masking Role to an existing Role/User
Step-3: Create a Masking Policy
Step-4: Apply the Masking Policy to a Table or View Column
Step-5: Verify the masking rules by querying data
Step-1: Create a Custom Role with Masking Privileges
The below SQL statement creates a custom role MASKINGADMIN in Snowflake.
create role MASKINGADMIN;
The below SQL statement grants privileges to create masking policies to the role MASKINGADMIN.
grant create masking policy on schema MYDB.MYSCHEMA to role MASKINGADMIN;
The below SQL statement grants privileges to apply masking policies to the role MASKINGADMIN.
grant apply masking policy on account to role MASKINGADMIN;
Step-2: Assign Masking Role to an existing Role/User
The MASKINGADMIN role by default will not have access to any database nor warehouse. The role
needs to be assigned to another Custom Role or a User who have privileges to access a database and
warehouse.
The below SQL statement assigns MASKINGADMIN to another custom role named
DATAENGINEER.
grant role MASKINGADMIN to role DATAENGINEER;
This allows all users with DATAENGINEER role to inherit masking privileges. Instead if you want to
limit the masking privileges, assign the role to individual users.
The below SQL statement assigns MASKINGADMIN to a User named STEVE.
grant role MASKINGADMIN to user STEVE;
Step-3: Create a Masking Policy
The below SQL statement creates a masking policy STRING_MASK that can be applied to columns
of type string.
create or replace masking policy STRING_MASK as (val string) returns string ->
case
when current_role() in ('DATAENGINEER') then val
else '*********'
end;
This masking policy masks the data applied on a column when queried from a role other than
DATAENGINEER.
Step-4: Apply (Set) the Masking Policy to a Table or View Column
The below SQL statement applies the masking policy STRING_MASK on a column named
LAST_NAME in EMPLOYEE table.
alter table if exists EMPLOYEE modify column LAST_NAME set masking policy STRING_MASK;
Note that prior to dropping a policy, the policy needs to be unset from all the tables and views on
which it is applied.
Step-5: Verify the masking rules by querying data
Verify the data present in EMPLOYEE table by querying from two different roles.
The below image shows data present in EMPLOYEE when queried from DATAENGINEER role.
Row-level access policies/security : This feature defines row access policies to filter
visible rows based on user permissions.
Object tagging : Tags objects to classify and track sensitive data for compliance and
security.
Object tag-based masking policies : This feature enables the protection of column data
by assigning a masking policy to a tag, which can then be set on a database object or the
Snowflake account.
Data classification : This feature allows users to automatically identify and classify
columns in their tables containing personal or sensitive data.
Object dependencies : This feature allows users to identify dependencies among
Snowflake objects.
Access History : This feature provides a record of all user activity related to data access
and modification within a Snowflake account. Essentially, it tracks user queries that read
column data and SQL statements that write data. The Access History feature is
particularly useful for regulatory compliance auditing and also provides insights into
frequently accessed tables and columns.
Snowflake Dynamic Data Masking: Snowflake Dynamic Data Masking is a feature that
enables organizations to hide sensitive data by masking it with other characters. It allows
users to create Snowflake masking policies to conceal data in specific columns of tables
or views. Dynamic Data Masking is applied in real-time, ensuring that unauthorized
users or roles only see masked data.
External Tokenization: Before we delve into External Tokenization, let's first
understand what Tokenization is. Tokenization is a process that replaces sensitive data
with ciphertext, rendering it unreadable. It involves encoding and decoding sensitive
information, such as names, into ciphertext. On the other hand, External Tokenization
enables the masking of sensitive data before it is loaded into Snowflake, which is
achieved by utilizing an external function to tokenize the data and subsequently loading
the tokenized data into Snowflake.
While both Snowflake Dynamic Data Masking and External Tokenization are column-level
security features in Snowflake, Dynamic Data Masking is more commonly used as it allows
users to easily implement data masking without the need for external functions. External
Tokenization, on the other hand, involves a more complex setup and is typically not widely
implemented in organizations.
Redaction: Replaces data with a fixed set of characters, like XXX, ***, &&&.
Random data: Replaces with random fake data based on column data type.
Shuffling: Scrambles the data while preserving format.
Encryption: Encrypts the data, allowing decryption for authorized users.
When a user queries a table or view protected by a Snowflake dynamic data masking policy, the
masking rules are applied before the results are returned, ensuring users only see the masked
version of sensitive data, even if their permissions allow viewing the actual data.
Snowflake dynamic data masking is a powerful tool for protecting sensitive data. It is easy to
use, scalable, and can be applied to any number of tables or views. Snowflake Dynamic Data
Masking can help organizations to comply with data privacy regulations, such as the General
Data Protection Regulation (GDPR) , HIPAA, SOC, and PCI DSS.
Risk Mitigation: The main purpose of Snowflake Dynamic Data Masking is to reduce
the risk of unauthorized access to sensitive data. So by masking sensitive columns in
query results, Snowflake Dynamic Data Masking prevents potential leaks of data to
unauthorized users.
Confidentiality: Snowflake may contain financial data, employee data, intellectual
property or other information that should remain confidential. Snowflake Dynamic Data
Masking ensures this sensitive data is not exposed in query results to unauthorized users.
Regulatory Compliance: Regulations like GDPR, HIPAA, SOC, and PCI DSS require
strong safeguards for sensitive and personally identifiable information. Snowflake
Dynamic Data Masking helps meet compliance requirements by protecting confidential
data from bad actors.
Snowflake Governance Initiatives: Snowflake Data governance and security teams
typically drive initiatives to implement controls like Snowflake Dynamic Data Masking
to better manage and protect sensitive Snowflake data access.
Privacy and Legal Requirements: Privacy regulations and legal obligations may
require Snowflake to mask sensitive data from unauthorized parties. Dynamic Data
Masking provides the technical controls to enforce privacy requirements for data access.
Grant
ing masking policy privileges to roles - Snowflake masking policies
Creating Snowflake
masking policy to mask strings - Snowflake masking policies
This masking policy masks the data applied on a column when queried from a role other
than school_principal.
[TABLE_NAME] with the name of the table or view where the column is located.
[COLUMN_NAME] with the name of the column to be masked
[POLICY_NAME] with the name of the masking policy created in the previous step.
Here is an example:
ALTER TABLE IF EXISTS student_records
MODIFY COLUMN email
SET masking policy data_masking;
Applying masking policy to
Snowflake table column - Snowflake masking policies - Snowflake Dynamic Data Masking
Creating Partial Data Masking Policy in Snowflake - Snowflake column level security
This particular masking policy will mask the email address by replacing everything after the first
period with asterisks (*). But, the email domain will be left unmasked, meaning that users with
the SCHOOL_PRINCIPAL role will be able to see the full email address, while users with
other roles will only be able to see the first part of the email address, followed by asterisks.
Applying partial masking policy to email column in Snowflake - Snowflake Dynamic Data Masking
This statement applies the masking policy to the email column. Once you have applied the
masking policy, users with the SCHOOL_PRINCIPAL role will be able to see the full email
address for all students in the student_records table. Noet that users with other roles will only be
able to see the first part of the email address, followed by asterisks.
Applying conditional masking policy to email column based on student_id in Snowflake - Snowflake
column level security - Snowflake Dynamic Data Masking
This statement applies the masking policy to the email column, considering the values in the
email and student_id columns.
Data Masking
Here are some additional points to remember while working with Snowflake dynamic data
masking:
Snowflake dynamic data masking policies obfuscate data at query runtime, original data
is unchanged
Snowflake dynamic data masking prevents unauthorized users from seeing real data
Take backup data before applying masking
Masking applies only when reading data, not DML
Snowflake dynamic data masking policy names must be unique within a database
schema.
Masking policies are inherited by cloned objects, ensuring consistent data protection
across replicated data.
Masking policies cannot be directly applied to virtual columns in Snowflake. To apply a
dynamic data masking policy to a virtual column, you can create a view on the virtual
columns and then apply the policy to the corresponding view columns.
Snowflake records the original query executed by the user on the History page of the
web interface. The query details can be found in the SQL Text column, providing
visibility into the original query even with data masking applied.
Masking policy names used in a specific query can be found in the Query Profile, which
helps in tracking the applied policies for auditing and debugging purposes.
Conclusion
At last, data security is a critical concern for organizations, and Snowflake's Dynamic Data
Masking feature offers a powerful solution to protect sensitive Snowflake data. Snowflake's
Dynamic Data Masking is an extremely powerful tool that empowers organizations to bring
sensitive data into Snowflake platforms while effectively managing it at scale. Snowflake
dynamic data masking combines policy-based approaches and role-based access control (RBAC)
and makes sure that only authorized individuals can access sensitive data, protecting it from
prying eyes and mitigating the risk of data breaches. Throughout this article, we explored the
concept, benefits, and implementation of Dynamic Data Masking, covering step-by-step
instructions for building and applying masking policies. We also delved into advanced
techniques like partial and conditional data masking, discussed policy management, and
highlighted the limitations as well as its benefits.
Just as a skilled locksmith carefully safeguards valuable treasures in a secure vault, Snowflake's
Dynamic Data Masking feature acts as a trustworthy guardian for organizations' sensitive data.
FAQs
What is Snowflake Dynamic Data Masking?
Snowflake Dynamic Data Masking is a security feature in Snowflake that allows the masking of
sensitive data in query results.
How does Dynamic Data Masking work in Snowflake?
It works by applying masking policies to specific columns in tables and views, which replace the
actual data with masked data in query results.
Can I apply Dynamic Data Masking to any column in Snowflake?
Yes, you can apply it to any table or view column that contains sensitive data. It cannot be
applied directly to virtual columns.
Is the original data altered when using Dynamic Data Masking?
No, the original data in the micro-partitions is unchanged. Only the query results are masked.
Who can define masking policies in Snowflake?
Only users with the necessary privileges, such
as ACCOUNTADMIN or SECURITYADMIN roles, can define masking policies.
Can I use Dynamic Data Masking with third-party tools?
Yes, as long as the tool can connect to Snowflake and execute SQL queries.
How can I test my Snowflake masking policies?
You can test them by running SELECT queries and checking if the returned data is masked as
expected.
Can I use Dynamic Data Masking to mask data in real-time?
Yes, the data is masked in real-time during query execution.
Can I use different Snowflake masking policies for different users?
Yes, you can define different masking policies and grant access to them based on roles in
Snowflake.
What types of data can I mask with Dynamic Data Masking?
You can mask any type of data, including numerical, string, and date/time data.
What happens if I drop a masking policy?
Only future queries will show unmasked data. Historical query results from before the policy
was dropped remain masked.
Can I use Dynamic Data Masking with Snowflake's Materialized Views feature?
Yes, masking will be applied at query time on the materialized view, not during its creation.
What is Row-Level Security?
Row-Level Security is a security mechanism that limits the records returned from a database
table based on the permissions provided to the currently logged-in user. Typically, this is done
such that certain users can access only their data and are not permitted to view the data of other
users.
In our previous article we have discussed how to implement Row-Level Security using Secure Views.
In this article let us understand how to set up Row-Level Security on a database table using Row
Access Policies in Snowflake.
2. What are Row Access Policies in Snowflake?
A Row Access Policy is a schema-level object that determines whether a given row in a table or
view can be viewed by a user using the following types of statements.
1. SELECT statements
2. Rows selected by UPDATE, DELETE, and MERGE statements.
The row access policy should be added to a table or a view binding with a column present inside
them. A row access policy can be added to a table or view either when the object is created or after
the object is created.
3. Steps to implement Row-Level Security using Row Access Policies in Snowflake
Follow below steps to implement Row-Level Security using Row Access Policies in Snowflake.
1. Create a table to apply Row-Level Security
2. Create a Role Mapping table
3. Create a Row Access Policy
4. Add the Row Access Policy to a table
5. Create Custom Roles and their Role Hierarchy
6. Grant SELECT privilege on table to custom roles
7. Grant USAGE privilege on virtual warehouse to custom roles
8. Assign Custom Roles to Users
9. Query and verify Row-Level Security on table using custom roles
10. Revoke privileges on role mapping table to custom roles
3.1. Create a table to apply Row-Level Security
Let us consider a sample employees table as an example for the demonstration of row-level security
using secure views.
The below SQL statements creates a table named employees with required sample data in hr schema
of analytics database.
use role SYSADMIN;
Employees table
3.2. Create a Role Mapping table
The below SQL statements creates mapping table named role_mapping which stores the country and
corresponding role to be assigned for the users of that country as shown below.
use role SYSADMIN;
EndPoint URL:
https://<account>.snowflakecomputing.com/v1/data/pipes/<pipeName>/insertFiles?
requestId=<requestId>
Headers:
Content-Type: application/json
Accept: application/json
Authorization: Bearer <jwt_token>
Body:
{
"files":[
{
"path":"path/file1.csv",
"size":100
},
{
"path":"path/file2.csv",
"size":100
}
]
}
In the Endpoint URL of the above request:
account: The Account Identifier of your Snowflake account which can be obtained
from your login URL.
pipeName: Fully qualified Snowpipe Name. ex: my_db.my_schema.my_pipe.
requestId: A random string used to track requests. The same can be passed
to inserReport endpoint to find the load status of files processed in a particular
request.
Below is the EndPoint URL of the sample request made for the demonstration.
https://eubchbl-al20253.snowflakecomputing.com/v1/data/pipes/
DEMO_DB.PUBLIC.MY_REST_SNOWPIPE/insertFiles?requestId=S0MeRaNd0MvA1ue01
In the Headers section of the above request:
<jwt_token> is the token generated in step-2.
The below image shows the HTTP Method, EndPoint URL, Headers configured in the API request
to submit a list of files for ingestion in Postman.
EndPoint URL:
https://<account>.snowflakecomputing.com/v1/data/pipes/<pipeName>/insertReport?
requestId=<requestId>&beginMark=<beginMark>
Headers:
Content-Type: application/json
Accept: application/json
Authorization: Bearer <jwt_token>
In the Endpoint URL of the above request:
account: The Account Identifier of your Snowflake account which can be obtained
from your login URL.
pipeName: Fully qualified Snowpipe Name. ex: my_db.my_schema.my_pipe.
requestId: A random string used to track requests submitted in the REST request to
ingest files.
beginMark: Marker, returned by a previous call to insertReport, that can be used to
reduce the number of repeated events seen when repeatedly calling insertReport.
Below is the response of the request used for demonstration which shows the load status of the files
processed.
{
"pipe": "DEMO_DB.PUBLIC.MY_REST_SNOWPIPE",
"completeResult": true,
"nextBeginMark": "1_1",
"files": [
{
"path": "Inbox/s3_emp_2.csv",
"stageLocation": "s3://te-aws-s3-bucket001/",
"fileSize": 18777931,
"timeReceived": "2023-05-21T15:19:04.353Z",
"lastInsertTime": "2023-05-21T15:19:35.356Z",
"rowsInserted": 199923,
"rowsParsed": 199923,
"errorsSeen": 0,
"errorLimit": 1,
"complete": true,
"status": "LOADED"
},
{
"path": "Inbox/s3_emp_1.csv",
"stageLocation": "s3://te-aws-s3-bucket001/",
"fileSize": 18777914,
"timeReceived": "2023-05-21T15:19:04.353Z",
"lastInsertTime": "2023-05-21T15:19:35.356Z",
"rowsInserted": 200195,
"rowsParsed": 200195,
"errorsSeen": 0,
"errorLimit": 1,
"complete": true,
"status": "LOADED"
}
],
"statistics": {
"activeFilesCount": 0
}
}
The load status can also be verified from Snowflake using the COPY_HISTORY table function as
shown below.
SELECT * FROM
TABLE(INFORMATION_SCHEMA.COPY_HISTORY(TABLE_NAME=>'EMPLOYEES',
START_TIME=> DATEADD(MINUTES, -10, CURRENT_TIMESTAMP())));
COPY_HISTORY output
4. Snowpipe REST API Response Codes
Below are the expected response codes of the Snowpipe REST API requests.
Response
Description
Code
Failure. pipeName not recognized. This error code can also be returned if the role used when callin
404
sufficient privileges.
5. Closing Points
In this article we have used Postman REST Client to use Snowpipe REST API. However in your
application, you can choose to use either Java or Python SDKs which are provided by Snowflake.
The difference is the Snowflake provided SDKs automatically handle the creation and management of
JWT tokens required for authentication.
For more details, refer Snowflake Documentation.
Execute multiple SQL statements in a single Snowflake API request
April 30, 2023
Spread the love
Contents hide
Introduction
Submitting multiple statements in a single Snowflake SQL REST API request
Extracting Results of each SQL Statement in the API request
Introduction
In our previous article we have discussed on overview of Snowflake SQL REST API and how to
submit a SQL API request to execute a SQL statement. This method allows submitting only one SQL
statement for execution.
In this article let us understand how to submit a request containing multiple statements to the
Snowflake SQL API.
Submitting multiple statements in a single Snowflake SQL REST API request
The process to submit multiple statements in a single request is similar to submitting a single
statement in a request except that in the body part of the request
In the statement field, enter multiple statements separated using semicolon (;)
In the parameters field, set the MULTI_STATEMENT_COUNT field to the
number of SQL statements in the request.
The Snowflake SQL REST API request for executing multiple SQL statements in a single request is
as follows.
HTTP Method: POST
EndPoint URL:
https://<account_identifier>.snowflakecomputing.com/api/v2/statements
Headers:
Authorization: Bearer <jwt_token>
Content-Type: application/json
Accept: application/json
X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT
Body:
{
"statement": "select * from table1;selct * from table2;",
"timeout": 60,
"database": "<your_database>",
"schema": "<your_schema>",
"warehouse": "<your_warehouse>",
"role": "<your_role>"
"parameters": {
"MULTI_STATEMENT_COUNT": "<statements_count>"
}
}
To learn more about how to generate a JWT token, refer our previous article.
For example, below is the body part of the request submitting two SQL statements for execution.
{
"statement": "select * from employees where employee_id=101;select * from employees where
employee_id=102;",
"timeout": 60,
"database": "DEMO_DB",
"schema": "PUBLIC",
"warehouse": "COMPUTE_WH",
"role": "ACCOUNTADMIN",
"parameters": {
"MULTI_STATEMENT_COUNT": "2"
}
}
In the above example
MULTI_STATEMENT_COUNT is set to 2 which corresponds to the number of
SQL statements being submitted.
To submit a variable number of SQL statements in the statement field,
set MULTI_STATEMENT_COUNT to 0. This is useful in an application where the
number of SQL statements submitted is not known at runtime.
If the value of MULTI_STATEMENT_COUNT does not match the number of SQL
statements specified in the statement field, the SQL API returns error.
The below image shows the HTTP Method, EndPoint URL, Headers configured in the API request
to execute a SQL statement in Postman.
API request showing the HTTP Method, EndPoint URL, Headers in Postman
The below image shows the Body part of API request configured to execute a SQL statement in
Postman.
EndPoint URL:
https://<account_identifier>.snowflakecomputing.com/api/v2/statements/<statementHandle>
Headers:
Authorization: Bearer <jwt_token>
Content-Type: application/json
Accept: application/json
X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT
on of a SQL statement
Introduction
Snowflake provides multiple ways to manage data efficiently in its database. Snowflake SQL REST
API is one such feature which allows users to interact with Snowflake through HTTP requests,
making it easy to integrate with other systems.
Before we jump into understanding the capabilities of Snowflake SQL REST API and how to access
it, let us quickly understand what an API is.
What is REST API?
API stands for Application Programming Interface, which is a software intermediary provided by an
application to other application that allows two applications to talk to each other.
A real time example of APIs is the Weather apps on your mobile. These apps do not use their own
weather forecasting system of their own. Instead, they provide weather information accessing the API
of a third party weather provider. Apple, for instance, uses The Weather Channel’s API.
REST stands for REpresentational State Transfer which is an architectural style. REST defines a set
of principles and standards using which APIs can be built and REST is the widely accepted
architectural style of building APIs.
REST API request is generally made up of four parts – HTTP Method, Endpoint URL, Headers
and Body
We will discuss more about the making a Snowflake SQL REST API request in the subsequent
sections of the article.
Snowflake SQL REST API capabilities
The operations that can be performed using Snowflake SQL REST API are
Submit SQL statements for execution.
Check the status of the execution of a statement.
Cancel the execution of a statement.
Fetch query results concurrently.
This API can be used to execute standard queries and most DDL and DML statements.
Snowflake SQL REST API Endpoints
The Snowflake SQL REST API can be accessed using the following URL.
https://<account_identifier>.snowflakecomputing.com/api
The <account_identifier> can be obtained easily from your login URL.
The API consists of the /api/v2/statements/ resource and provides the following endpoints:
The following endpoint is used to submit a SQL statement for execution.
/api/v2/statements
The following endpoint is used to to check the status of the execution of a statement.
/api/v2/statements/<statementHandle>
The following endpoint is used to to cancel the execution of a statement.
/api/v2/statements/<statementHandle>/cancel
In the steps to come, we shall learn how to access all these endpoints using Postman.
Steps to access Snowflake SQL REST API using Postman
Postman is a powerful tool for testing APIs, and it allows us to easily make HTTP requests and view
responses. You can either download and install desktop application of Postman or use its web
version from any device by creating an account.
Following below steps to access Snowflake SQL REST API using Postman.
Authentication
Every API request we make must also include the authentication information. There are two options
for providing authentication: OAuth and JWT key pair authentication. In this article we will use JWT
key pair authentication for demonstration purpose.
Follow below steps to use JWT Key Pair Authentication.
1. Configure Key-Pair Authentication by performing below actions.
Generate Public-Private Key pair using OpenSSL.
Assign the generated public key to your Snowflake user.
The generated private key should be stored in a file and available locally on machine
where JWT is generated.
Refer our previous article for more details on configuring Key Pair authentication.
2. Once Key Pair Authentication for your Snowflake account is set, JWT token should be generated.
This JWT token is time limited token which has been signed with your key and Snowflake will know
that you authorized this token to be used to authenticate as you for the SQL API.
Below is the command to generate JWT token using SnowSQL.
snowsql --generate-jwt -a <account_identifier> -u <username> --private-key-path
<path>/rsa_key.pem
The below image shows generating JWT using SnowSQL command line tool.
EndPoint URL:
https://<account_identifier>.snowflakecomputing.com/api/v2/statements
Headers:
Authorization: Bearer <jwt_token>
Content-Type: application/json
Accept: application/json
X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT
Body:
{
"statement": "select * from table",
"timeout": 60,
"database": "<your_database>",
"schema": "<your_schema>",
"warehouse": "<your_warehouse>",
"role": "<your_role>"
}
In the body part of the above request
The statement field specifies the SQL statement to execute.
The timeout field specifies that the server allows 60 seconds for the statement to be
executed.
The other fields are self-explanatory.
The below image shows the HTTP Method, EndPoint URL, Headers configured in the API request
to execute a SQL statement in Postman.
The below image shows the Body part of API request configured to execute a SQL statement in
Postman.
EndPoint URL:
https://<account_identifier>.snowflakecomputing.com/api/v2/statements/<statementHandle>
Headers:
Authorization: Bearer <jwt_token>
Content-Type: application/json
Accept: application/json
X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT
If the statement has finished executing successfully, Snowflake returns the HTTP response code 200
and the results in a ResultSet object. However, if an error occurred when executing the statement,
Snowflake returns the HTTP response code 422 with a QueryFailureStatus object.
Cancelling the Execution of a SQL statement
To cancel the execution of a statement, send a POST request to the /api/v2/statements/ endpoint and
append the statementHandle to the end of the URL path followed by cancel as a path parameter.
The Snowflake SQL REST API request to cancel the execution of a SQL statement is as follows.
HTTP Method: POST
EndPoint URL:
https://<account_identifier>.snowflakecomputing.com/api/v2/statements/<statementHandle>/cancel
Headers:
Authorization: Bearer <jwt_token>
Content-Type: application/json
Accept: application/json
X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT
HOW TO: Generate JWT Token for Snowflake Key Pair Authentication?
April 9, 2023
Spread the love
Contents hide
Introduction
What is JWT Token?
Pre-requisites for generating JWT Token for Snowflake Authentication
Generating JWT Token using Snowflake SnowSQL
Generating JWT Token using Python
Introduction
In our previous article, we have discussed about how to set up Key Pair Authentication in
Snowflake. But if you wanted to connect Snowflake via Snowflake SQL REST API using key pair
authentication, it expects a valid JWT (JSON Web Token).
In this article let us understand what JWT token is, how to generate it and pre-requisites to generate it.
What is JWT Token?
JWT or JSON Web Token is an open industry standard for securely transmitting information between
parties as a JSON object most commonly used to identify an authenticated user.
Once the user is logged in, each subsequent request will also include the JWT. The JWT tokens are
valid usually for only a certain period for about 60 minutes and they need to be regenerated after the
token expired.
Pre-requisites for generating JWT Token for Snowflake Authentication
Below are the pre-requisites for generating JWT Token for Snowflake Key Pair Authentication.
1. Generate Public-Private Key pair using OpenSSL.
2. Assign the public key to your Snowflake user.
3. The generated private key should be stored in a file and available locally on machine
where JWT is generated.
Refer our previous article for more details on configuring Key Pair authentication.
Generating JWT Token using Snowflake SnowSQL
SnowSQL is a command line tool for connecting to Snowflake. Snowflake SnowSQL lets you execute
all SQL queries and perform DDL and DML operations, including loading data into and unloading
data out of database tables.
Refer our previous article to learn more about Snowflake SnowSQL.
SnowSQL has a parameter –generate-jwt, which would generate the JWT Token when used in
conjunction with following parameters.
-a <account_identifier> : It is the unique name assigned to your account. It can be
extracted from the URL to login to Snowflake account as shown below.
<account_identifier>.snowflakecomputing.com
-u <username> : It is the user name with which you connect to the specified account.
–private-key-path <path>: It is the location where the generated private key file is
placed.
Below is the command to generate JWT token in SnowSQL.
snowsql --generate-jwt -a <account_identifier> -u <username> --private-key-path
<path>/rsa_key.pem
The below image shows generating JWT using SnowSQL command line tool.
Testing
OpenSSL installation
2.4. Generate Private Key
To generate a private key, open a command prompt window and navigate to path where keys needs to
be stored. You can generate either an encrypted version of the private key or an unencrypted version
of the private key.
To generate an unencrypted version, use the following command:
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.pem –nocrypt
To generate an encrypted version, use the following command (which omits “-nocrypt”):
openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out rsa_key.pem
You will have to enter a passphrase as Encryption Password using this method which would be
required during authentication.
These commands generate a private key in PEM format as shown below.
-----BEGIN PRIVATE KEY-----
MIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQC0ElLYu+UZjgft
6th1HDppkJg1pbEzCiUw6+czuiDgzfnbvEG8Ah/y1Ir2f27AmCUVvfFIXiEfGFIY
...
d+7T5RSG+bQyylGPpfpdig==
-----END PRIVATE KEY-----
2.5. Generate Public Key
The Public key is generated by referencing the Private Key.
The following command generates the public key using the private key contained in rsa_key.pem
openssl rsa -in rsa_key.pem -pubout -out rsa_key.pub
This commands generate a public key in PEM format as shown below.
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAtBJS2LvlGY4H7erYdRw6
aZCYNaWxMwolMOvnM7og4M3527xBvAIf8tSK9n9uwJglFb3xSF4hHxhSGH7sy1n2
...
qIQIDAQAB
-----END PUBLIC KEY-----
3. Configuring Key Pair Authentication in Snowflake
Follow below steps to configure Key Pair Authentication for all supported Snowflake clients.
3.1. Generate and store the Public and Private keys securely
Generate the public and private keys using OpenSSL as explained in the previous section. If the keys
are generated in a different location, move them to your local directory where the snowflake client
runs.
The Key files should be protected from unauthorized access and it is user’s responsibility to secure
the keys.
3.2. Assign the Public Key to a Snowflake User
The public key should be assigned to the user using ALTER USER statement as shown below.
ALTER USER SFUSER08 SET RSA_PUBLIC_KEY = 'MIIBIjANB…';
Only users with SECURITYADMIN role and above can alter the user.
3.3. Verify the assigned Public Key of a User
Verify if the public key is successfully assigned to the user using DESCEIBE USER statement as
shown below.
DESCRIBE USER SFUSER08;
Verifying the Public Key and Public Key Finger print assigned to the user
The above output of the DESCRIBE USER shows the assigned public key and the public key finger
print generated.
With this step the configuration of the Key Pair Authentication is completed.
4. Connect Snowflake using Key-Pair Authentication
Below are the supported Snowflake Clients with Key Pair Authentication.
SnowSQL (CLI Client)
Snowflake Connector for Python
Snowflake Connector for Spark
Snowflake Connector for Kafka
Go driver
JDBC Driver
ODBC Driver
Node.js Driver
.NET Driver
PHP PDO Driver for Snowflake
Let us use SnowSQL to verify whether the generated private key can be used to connect to
Snowflake.
In order to connect SnowSQL through key pair authentication, the private key must be available on a
local directory of the machine where SnowSQL is installed.
To know more about how to download and install SnowSQL, refer our previous article.
Run the below command to connect to SnowSQL using Private key.
snowsql -a <account_identifier> -u <user> --private-key-path <path>/rsa_key.pem
The <account_identifier> can be extracted from the URL to login to Snowflake account.
<account_identifier>.snowflakecomputing.com
The below image shows that we were able to successfully connect to Snowflake SnowSQL using the
private key generated.
2023-01-01
2023-01-01 2876.9300
2023-01-02 3509.7500
2023-01-03 2971.6600
2023-01-04 3328.3200
2023-01-03 2971.6600
2023-01-04 3328.3200
12686.6600
Although the body of a UDF can contain a complete SELECT statement, it cannot contain DDL
statements or any DML statement other than SELECT.
8. Table Function with Examples
Table Functions or a Tabular SQL UDFs (UDTFs) returns a set of rows consisting of 0, 1 or
more rows each of which has 1 or more columns.
While creating UDTFs using CREATE FUNCTION command, the <result_data_type> should
be TABLE(…). Inside the parentheses specify the output column names along with the expected data
type.
Consider below tables sales_by_country, currency as an example for demonstration purpose.
CREATE OR REPLACE TABLE sales_by_country(
year NUMBER(4),
country VARCHAR(50),
sale_amount NUMBER
);
2022 90000
YEAR SALES_AMOUNT
2023 100000
2022 90000 US
2023 100000 US
2022 90000 US
2023 100000 US
CALL MyStoredProcedure(argument_1);
Runs with the privileges of the caller. Runs with the privileges of the owner.
Inherit the current warehouse of the caller. Inherit the current warehouse of the caller.
Use the database and schema that the caller is Use the database and schema that the stored procedure is created in,
currently using. schema that the caller is currently using.
5. Demonstration of Caller’s and Owner’s Rights
Let us understand how Caller’s and Owner’s Rights work with an example
using ACCOUNTADMIN and SYSADMIN roles.
Using ACCOUNTADMIN role, let us create a table named Organization for demonstration.
USE ROLE ACCOUNTADMIN;
CREATE TABLE organization(id NUMBER, org_name VARCHAR(50));
When the table is queried using SYSADMIN role, it throws an errors as shown below since no grants
on this table are provided to SYSADMIN.
USE ROLE SYSADMIN;
SELECT * FROM organization;
Let us create a stored procedure with Caller’s rights using ACCOUNTADMIN role to delete data
from Organization table.
USE ROLE ACCOUNTADMIN;
Assign the grants to execute the stored procedure to the SYSADMIN role.
USE ROLE ACCOUNTADMIN;
GRANT USAGE ON PROCEDURE DEMO_DB.PUBLIC.sp_demo_callers_rights() TO ROLE
SYSADMIN;
The output of the caller’s rights stored procedure with SYSADMIN role is as below.
USE ROLE SYSADMIN;
CALL sp_demo_callers_rights();
Since the SYSADMIN role do not have any privileges on Organization table, the execution of
procedure with caller’s rights also fails.
The owner of the stored procedure can change the procedure from an owner’s rights stored procedure
to a caller’s rights stored procedure (or vice-versa) by executing an ALTER
PROCEDURE command as shown below.
ALTER PROCEDURE sp_demo_callers_rights() EXECUTE AS OWNER;
The output of the owner’s rights stored procedure with SYSADMIN role is as below.
USE ROLE SYSADMIN;
CALL sp_demo_callers_rights();
Though the SYSADMIN role do not have privileges on Organization table, the execution of the
procedure which deletes data from the Organization table succeeds because the procedure
executes with Owner’s rights.
Checkout other articles related to Snowflake Stored Procedures
Snowflake Stored Procedures
Chapter-1: Create Procedure
Chapter-2: Variables
Chapter-3: EXECUTE IMMEDIATE
Chapter-4: IF-ELSE, CASE Branching Constructs
Chapter-5: Looping in Stored Procedures
Chapter-6: Cursors
Chapter-7: RESULTSET
Chapter-8: Exceptions
Chapter-9: Caller’s and Owner’s Rights in Snowflake Stored Procedures
Subscribe to our Newsletter !!
What are Stored Procedures?
Stored procedures allow you to write procedural code that executes business logic by combining
multiple SQL statements. In a stored procedure, you can use programmatic constructs to perform
branching and looping.
A stored procedure is created with a CREATE PROCEDURE command and is executed with
a CALL command.
Snowflake supports writing stored procedures in multiple languages. In this article we will discuss on
writing stored procedures using Snowflake SQL Scripting.
2. Stored Procedure Syntax in Snowflake
The following is the basic syntax for creating Stored Procedures in Snowflake.
CREATE OR REPLACE PROCEDURE <name> ( [ <arg_name> <arg_data_type> ] [ , ... ] )
RETURNS <result_data_type>
LANGUAGE SQL
AS
$$
<procedure_body>
$$
;
Note that you must use string literal delimiters (‘ or $$) around procedure definition(body) if you are
creating a Snowflake Scripting procedure in Classic Web Interface or SnowSQL. The string literal
delimiters (‘ or $$) are not mandatory when writing procedures in SnowSight.
Let us understand the various parameters in the stored procedure construct.
2.1. NAME <name>
Specifies the name of the stored procedure.
The name must start with an alphabetic character and cannot contain spaces or special characters
unless the entire identifier string is enclosed in double quotes (e.g. “My Procedure”). Identifiers
enclosed in double quotes are also case-sensitive.
2.2. INPUT PARAMETERS ( [ <arg_name> <arg_data_type> ] [ , … ] )
A Stored Procedures can be built which takes one or more arguments as input parameters or even
without any input parameters.
The <arg_name> specifies the name of the input argument.
The <arg_data_type> specifies the SQL data type of the input argument.
-- Stored Procedure with multiple input arguments
CREATE OR REPLACE PROCEDURE my_proc( id NUMBER, name VARCHAR)
-- Examples
net_sales NUMBER(38,2);
-- Examples
LET net_sales := 98.67;
RETURN gross_sales;
END;
4. Using a Variable in a SQL Statement (Binding)
The variables declared in the stored procedure can be used in the SQL statements using colon as
prefix to the variable name. For example:
DELETE FROM EMPLOYEES WHERE ID = :in_employeeid;
If you are using the variable as the name of an object, use the IDENTIFIER keyword to indicate that
the variable represents an object identifier. For example:
DELETE FROM IDENTIFIER(:in_tablename) WHERE ID = : in_employeeid;
If you are building a SQL statement as a string to execute, the variable does not need the colon prefix.
For example:
LET sql_stmt := 'DELETE FROM EMPLOYEES WHERE ID = ' || in_employeeid;
Note that if you are using the variable with RETURN, you do not need the colon prefix. For example:
RETURN my_variable;
5. Assigning result of a SQL statement to Variables using INTO clause in Snowflake Stored
Procedures
You can assign expression result of a SELECT statement to Variables in Snowflake Stored
Procedures using INTO clause.
The syntax to assign result of a SQL statement to variables is as below.
SELECT <expression1>, <expression2>, ... INTO :<variable1>, :<variable2>, ... FROM ...
WHERE ...;
In the syntax:
The value of <expression1> is assigned to <variable1>.
The value of <expression2> is assigned to <variable2>.
Note that the SELECT statement used to assign values to variables must return only single output
row.
Consider below data as an example to understand how it works.
CREATE OR REPLACE TABLE employees (id INTEGER, firstname VARCHAR);
101 TONY
6. Variable Scope in Snowflake Stored Procedures
If you have nested blocks in your stored procedures and multiple variables with same name are
declared in it, the scope of the variable will be local to the block its declared.
For example, if you have an outer block and inner block where you have declared a
variable my_variable and assigned value as 5 in outer block and 7 in inner block. As long as the
variable is used in the inner block, the value remains 7 and all operations outside the inner block, the
value assigned to variable remains 5.
When a variable name is referenced, Snowflake looks for the variable by starting first in the current
block, and then working outward one block at a time until a matching name is found.
100
But if you are using SnowSQL or the classic web interface, you must specify the block as a string
literal (enclosed in single quotes or double dollar signs), and you must pass the block to the
EXECUTE IMMEDIATE command as shown below.
EXECUTE IMMEDIATE
$$
DECLARE
net_sales NUMBER(38, 2);
tax NUMBER(38, 2);
gross_sales NUMBER(38, 2) DEFAULT 0.0;
BEGIN
net_sales := 98.67;
tax := 1.33;
gross_sales := net_sales + tax;
RETURN gross_sales;
END;
$$
;
5. Executing SQL statements in Stored Procedures using EXECUTE IMMEDIATE
The below stored procedure is an example which executes the SQL statements using EXECUTE
IMMEDIATE.
The first EXECUTE IMMEDIATE command executes the CREATE statement
declared in the variable create_stmt.
The second EXECUTE IMMEDIATE command executes the DELETE statement
declared in variable delete_stmt in concatenation with a filter condition passed as a
string.
This also demonstrates that EXECUTE IMMEDIATE works not only with a string literal, but also
with an expression that evaluates to a string (VARCHAR).
CREATE OR REPLACE PROCEDURE sp_execute_immediate_demo()
RETURNS NUMBER
LANGUAGE SQL
AS
$$
DECLARE
create_stmt VARCHAR DEFAULT 'CREATE OR REPLACE TABLE temp_emp AS SELECT *
FROM employees';
delete_stmt VARCHAR DEFAULT 'DELETE FROM temp_emp';
result NUMBER DEFAULT 0;
BEGIN
EXECUTE IMMEDIATE create_stmt;
EXECUTE IMMEDIATE delete_stmt || ' WHERE status= ''INACTIVE'' ';
result := (SELECT COUNT(*) FROM temp_emp);
RETURN result;
END;
$$
;
The output of the procedures gives the record count in table temp_emp after removing the
INACTIVE records.
6. EXECUTE IMMEDIATE with USING clause in Snowflake
The EXECUTE IMMEDIATE command is used in conjunction with USING clause to pass bind
variables to the SQL query passed as a string literal to it. The bind variables are passed as a list
separated by comma and enclosed in brackets.
The syntax to use EXECUTE IMMEDIATE with USING clause in Snowflake is as follows.
EXECUTE IMMEDIATE '<sql_query>' USING (bind_variable1, bind_variable2,…);
A bind variable holds a value to be used in SQL query executed by EXECUTE IMMEDIATE
command.
The below stored procedure is an example in which values to the filter condition of a SQL query
executed by EXECETE IMMEDIATE command are passed through bind variables defined in USING
clause.
CREATE OR REPLACE PROCEDURE purge_data_by_date()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
sql_stmt VARCHAR DEFAULT 'DELETE FROM employees WHERE hire_date BETWEEN :1
AND :2';
min_date DATE DEFAULT '2015-01-01';
max_date DATE DEFAULT '2017-12-31';
result VARCHAR;
BEGIN
EXECUTE IMMEDIATE sql_stmt USING (min_date, max_date);
result := 'Data deleted between '|| min_date || 'and '|| max_date;
RETURN result;
END;
$$
;
7. EXECUTE IMMEDIATE with INTO clause in Snowflake
Using INTO clause in conjunction with EXECUTE IMMEDIATE, we can specify the list of the user
defined variables which will hold the values returned by SELECT statement in Oracle. But
currently EXECUTE IMMEDIATE with INTO clause is not supported in Snowflake like in
Oracle.
Instead we can still assign the values to the user defined variables from the result of a SQL statement
by using INTO clause directly in SELECT statement as shown below.
SELECT id, firstname INTO :id_variable, :name_variable FROM employees WHERE id = 101;
For more details refer our previous article.
IF-ELSE, CASE Statements in Snowflake Stored Procedures
February 26, 2023
Spread the love
Contents hide
Introduction
IF Statement
CASE Statement
Simple CASE Statement
Introduction
Snowflake Stored Procedures supports following branching constructs in the stored procedure
definition.
IF ELSE
CASE
IF Statement
IF statement in Snowflake provides a way to execute a set of statements if a condition is met.
The following is the syntax to the IF statement in Snowflake.
IF ( <condition> ) THEN
<statement>;
ELSEIF ( <condition> ) THEN
<statement>;
ELSE
<statement>;
END IF;
In an IF statement:
The ELSEIF and ELSE clauses are optional.
If an additional condition needs to be evaluated, add statements
under ELSEIF clause.
Multiple conditions can be evaluated using multiple ELSEIF clauses.
If none of the provided conditions are true, specify statements to execute
in ELSE clause.
The following is an example of Snowflake stored procedure calculating the maximum among the
three numbers using IF statement.
CREATE OR REPLACE PROCEDURE sp_demo_if(p NUMBER, q NUMBER, r NUMBER)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
var_string VARCHAR DEFAULT 'Maximum Number: ';
BEGIN
IF ( p>=q AND p>=r ) THEN
RETURN var_string || p ;
ELSEIF ( q>=p AND q>=r ) THEN
RETURN var_string || q ;
ELSE
RETURN var_string || r ;
END IF;
END;
$$
;
The output of the procedure is as follows
CALL sp_demo_if(11,425,35);
SP_DEMO_IF
The following are the different type of Loops supported in Snowflake Stored Procedures.
FOR
WHILE
REPEAT
LOOP
In this article let us discuss about the different loops in Snowflake Stored Procedures with examples.
Contents hide
FOR Loop in Snowflake Stored Procedures
WHILE Loop in Snowflake Stored Procedures
REPEAT Loop in Snowflake Stored Procedures
LOOP Loop in Snowflake Stored Procedures
FOR Loop in Snowflake Stored Procedures
A FOR loop enables a particular set of steps to be executed for a specified number of times until
a condition is satisfied.
The following is the syntax of FOR Loop in Snowflake Stored Procedures.
FOR <counter_variable> IN [ REVERSE ] <start> TO <end> { DO | LOOP }
<statement>;
END { FOR | LOOP } [ <label> ] ;
The keyword DO should be paired with END FOR and the keyword LOOP should be paired
with END LOOP. For example:
FOR...DO
...
END FOR;
FOR...LOOP
...
END LOOP;
In FOR Loop:
A <counter_variable> loops from the values defined for <start> till the value
defined for <end> in the syntax.
Note that if a variable with the same name as <counter_variable> is declared outside
the loop, the outer variable and the loop variable are independent.
Use REVERSE keyword to loop the values starting from <end> till <start>.
If there are multiple loops defined in the procedure, use the <label> to identify loops
individually. This also helps to jump loops using BREAK and CONTINUE
statements.
The following is an example of Snowflake Stored Procedure which calculates the sum of first n
numbers using FOR Loop.
CREATE OR REPLACE PROCEDURE sp_demo_for_loop(n NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
$$
DECLARE
total_sum INTEGER DEFAULT 0;
BEGIN
FOR i IN 1 TO n DO
total_sum := total_sum + i ;
END FOR;
RETURN total_sum;
END;
$$
;
The output of the procedure with FOR Loop is as follows.
CALL sp_demo_for_loop(5);
SP_DEMO_FOR_LOOP
15
The following is an example of Snowflake Stored Procedure which prints the numbers in backwards
using REVERSE keyword in FOR Loop.
CREATE OR REPLACE PROCEDURE sp_demo_reverse_for_loop(n NUMBER)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
reverse_series VARCHAR DEFAULT '';
BEGIN
FOR i IN REVERSE 1 TO n LOOP
reverse_series := reverse_series ||' '|| i::VARCHAR ;
END LOOP;
RETURN reverse_series;
END;
$$
;
The output of the procedure with REVERSE keyword in FOR Loop is as follows.
CALL sp_demo_reverse_for_loop(5);
SP_DEMO_REVERSE_FOR_LOOP
54321
WHILE Loop in Snowflake Stored Procedures
A WHILE loop iterates while a specified condition is true. The condition for the loop is tested
immediately before executing the body of the loop in WHILE loop. If the condition is false, the
loop is not executed even once.
The following is the syntax of WHILE Loop in Snowflake Stored Procedures.
WHILE ( <condition> ) { DO | LOOP }
<statement>;
END { WHILE | LOOP } [ <label> ] ;
In a WHILE Loop:
The <condition> is an expression that evaluates to a BOOLEAN.
The keyword DO should be paired with END WHILE and the
keyword LOOP should be paired with END LOOP.
If there are multiple loops defined in the procedure, use the <label> to identify loops
individually. This also helps to jump loops using BREAK and CONTINUE
statements.
Note that if the <condition> never evaluates to FALSE, and the loop does not contain a BREAK
command (or equivalent), then the loop will run and consume credits indefinitely.
The following is an example of Snowflake Stored Procedure which calculates the sum of first n
numbers using WHILE Loop.
CREATE OR REPLACE PROCEDURE sp_demo_while_loop(n NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
$$
DECLARE
total_sum INTEGER DEFAULT 0;
BEGIN
LET counter := 1;
WHILE (counter <= n) DO
total_sum := total_sum + counter;
counter := counter + 1;
END WHILE;
RETURN total_sum;
END;
$$
;
The output of the procedure with WHILE Loop is as follows.
CALL sp_demo_while_loop(6);
SP_DEMO_WHILE_LOOP
21
REPEAT Loop in Snowflake Stored Procedures
A REPEAT loop iterates until a specified condition is true. This is similar to DO WHILE loop in
other programming languages which tests the condition at the end of the loop. This means that
the body of a REPEAT loop always executes at least once.
The following is the syntax of REPEAT Loop in Snowflake Stored Procedures.
REPEAT
<statement>;
UNTIL ( <condition> )
END REPEAT [ <label> ] ;
In a REPEAT Loop:
The <condition> is an expression that evaluates to a BOOLEAN.
The <condition> is evaluated at the end of the loop and is defined
using UNTIL keyword.
The following is an example of Snowflake Stored Procedure which calculates the sum of first n
numbers using REPEAT Loop.
CREATE OR REPLACE PROCEDURE sp_demo_repeat(n NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
$$
DECLARE
total_sum INTEGER DEFAULT 0;
BEGIN
LET counter := 1;
REPEAT
total_sum := total_sum + counter;
counter := counter + 1;
UNTIL(counter > n)
END REPEAT;
RETURN total_sum;
END;
$$
;
The output of the procedure with REPEAT Loop is as follows.
CALL sp_demo_repeat(7);
SP_DEMO_REPEAT
28
LOOP Loop in Snowflake Stored Procedures
A LOOP loop executes until a BREAK command is executed. It does not specify a number of
iterations or a terminating condition.
The following is the syntax of LOOP Loop in Snowflake Stored Procedures.
LOOP
<statement>;
END LOOP [ <label> ] ;
In a LOOP Loop:
The user must explicitly exit the loop by using BREAK command in the loop.
The BREAK command is normally embedded inside branching logic
(e.g. IF Statements or CASE Statements).
The BREAK command immediately stops the current iteration, and skips any
remaining iterations.
The following is an example of Snowflake Stored Procedure which calculates the sum of first n
numbers using LOOP Loop.
CREATE OR REPLACE PROCEDURE sp_demo_loop(n NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
$$
DECLARE
total_sum INTEGER DEFAULT 0;
BEGIN
LET counter := 1;
LOOP
IF(counter > n) THEN
BREAK;
END IF;
total_sum := total_sum + counter;
counter := counter + 1;
END LOOP;
RETURN total_sum;
END;
$$
;
The output of the procedure with REPEAT Loop is as follows.
CALL sp_demo_loop(8);
SP_DEMO_LOOP
36
1. Cursors in Snowflake Stored Procedures
A Cursor is a named object in a stored procedure which allows you to loop through a set of
rows of a query result set, one row at a time. It allows to perform same set of defined actions for
each row individually while looping through a result of a SQL query.
Working with Cursors in Snowflake Stored Procedures includes following steps
1. Declaring a cursor either in DECLARE or BEGIN…END section of the stored
procedure.
2. Opening a cursor using OPEN command.
3. Fetching rows from cursors using FETCH command.
4. Closing a cursor using CLOSE command.
2. Syntax of Cursors in Snowflake Stored Procedures
2.1. Declaring a Cursor
A Cursor must be declared before using it. Declaring a Cursor defines the cursor with a name
and the associated SELECT statement.
The syntax for declaring a CURSOR in DECLARE section of the procedure is as follows.
DECLARE
<cursor_name> CURSOR FOR <select_statement>;
-- Example:
DECLARE
my_cursor CURSOR FOR SELECT id, firstname FROM employees;
The syntax for declaring a CURSOR in BEGIN…END section of the procedure is as follows.
BEGIN
…
LET <cursor_name> CURSOR FOR <select_statement>;
…
END;
-- Example:
BEGIN
…
LET my_cursor CURSOR FOR SELECT id, firstname FROM employees;
…
END;
2.2. Opening a Cursor
The cursor must be explicitly opened before fetching rows from it using OPEN command. The
query associated with cursor is not executed until it is opened.
The syntax to OPEN a CURSOR in stored procedure is as follows.
OPEN <cursor_name>;
-- Example:
BEGIN
OPEN my_cursor;
…
END;
2.3. Fetching data from Cursor
The FETCH command retrieves row by row from the result set of the query associated with
cursor. Each FETCH command that you execute fetches a single row and increments the
internal counter to next row.
As a result the FETCH command must be executed multiple times until last row is fetched using
looping commands in stored procedures. If a FETCH command is executed after all rows are fetched,
it retrieves null values.
The syntax to FETCH data from CURSOR in stored procedures is as follows.
FETCH <cursor_name> INTO <variable_1>,<variable_2>,…;
-- Example:
BEGIN
…
FETCH my_cursor INTO my_variable_1, my_variable_1;
…
END;
2.4. Closing a Cursor
The cursor must be closed once all rows are fetched using the CLOSE command.
The syntax to CLOSE a CURSOR in stored procedures is as follows.
CLOSE <cursor_name>;
-- Example:
BEGIN
…
CLOSE my_cursor;
END;
3. Setting up a query for Cursor demonstration
The following SELECT query fetches all the tables present in PUBLIC schema
of DEMO_DB database. The query uses INFORMATION_SCHEMA which is a data dictionary
schema available under each database.
SELECT table_name, table_type
FROM demo_db.information_schema.tables
WHERE table_schema = 'PUBLIC' AND table_type= 'BASE TABLE'
ORDER BY table_name;
101 TONY
102 STEVE
101 TONY
102 STEVE
101 TONY
The below is another simple example showing an exception being raised using RAISE command
where exception number and message are defined.
CREATE OR REPLACE PROCEDURE sp_raise_exception()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
MY_SP_EXCEPTION EXCEPTION(-20001, 'Raised user defined exception
MY_SP_EXCEPTION.');
BEGIN
RAISE MY_SP_EXCEPTION;
END;
$$
;
The output of the stored procedure is as follows.
CALL sp_raise_exception();
Note that the exception number and message are displayed as per the exception definition and are
different from previous example.
4. Catching an Exception in Snowflake Stored Procedures
Whenever we are raising and exception using the RAISE command, the job fails providing the
information of the error.
Instead of letting the job fail, we can also handle the exception by catching it using
the EXCEPTION block of the stored procedure.
The syntax to catch an exception using the EXCEPTION block in the stored procedure is as shown
below.
BEGIN
…
EXCEPTION
WHEN <exception_name> THEN
<statement>;
END;
The below is an example showing how to catch an exception using EXCEPTION block in stored
procedures.
CREATE OR REPLACE PROCEDURE sp_raise_exception()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
MY_SP_EXCEPTION EXCEPTION(-20001, 'Raised user defined exception
MY_SP_EXCEPTION.');
BEGIN
RAISE MY_SP_EXCEPTION;
EXCEPTION
WHEN MY_SP_EXCEPTION THEN
RETURN 'Raised user defined exception MY_SP_EXCEPTION';
END;
$$
;
The output of the stored procedure is as follows.
CALL sp_raise_exception();
CALL sp_raise_exception(1);
CALL sp_raise_exception(3);
The below is an example of a stored procedure capturing the error using built-in exception
EXPRESSION_ERROR.
CREATE OR REPLACE PROCEDURE sp_expression_error()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
var1 FLOAT;
var2 VARCHAR DEFAULT 'Some text';
BEGIN
var1 := var2;
RETURN var1;
EXCEPTION
WHEN EXPRESSION_ERROR THEN
RETURN 'STATEMENT_ERROR:'||SQLSTATE||':'||SQLCODE||':'||SQLERRM;
END;
$$;
The output of the stored procedure is as follows.
CALL sp_expression_error();
Note that we have not declared any exception in the above examples and the error information is
captured using a built-in exceptions.
In both the above two examples, you can replace the built-in exception with OTHER and it will still
capture the error information. Not just the built-in exceptions, the OTHER exception catches any
user-defined exception which is declared but not specified in EXCEPTION block.
The below is an example of a stored procedure capturing the error using built-in exception OTHER.
CREATE OR REPLACE PROCEDURE sp_demo_other()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
var1 NUMBER;
BEGIN
var1 := 1/0;
RETURN var1;
EXCEPTION
WHEN OTHER THEN
RETURN 'OTHER_ERROR:'||SQLSTATE||':'||SQLCODE||':'||SQLERRM;
END;
$$;
The output of the stored procedure is as follows.
CALL sp_demo_other();
7. Closing Points
More than one exception can be handled using one exception handler using OR.
The below exception block shows that same value to be returned for multiple exceptions using OR.
EXCEPTION
WHEN MY_EXCEPTION_1 OR MY_EXCEPTION_2 OR MY_EXCEPTION_3 THEN
RETURN 123;
WHEN MY_EXCEPTION_4 THEN
RETURN 4;
WHEN OTHER THEN
RETURN 99;
The exception handler should be at the end of the block. If the block contains statements after the
exception handler, those statements are not executed.
If more than one WHEN clause could match a specific exception, then the first WHEN clause that
matches is the one that is executed. The other clauses are not executed.
If you want to raise the exception which you caught in your exception handler, execute RAISE
command without any arguments.
BEGIN
DELETE FROM emp;
EXCEPTION
WHEN STATEMENT_ERROR THEN
LET ERROR_MESSAGE := SQLCODE || ': ' || SQLERRM;
INSERT INTO error_details VALUES (:ERROR_MESSAGE); -- Capture error details into a table.
RAISE; -- Raise the same exception that you are handling.
END;
Checkout other articles related to Snowflake Stored Procedures
Caller’s and Owner’s Rights in Snowflake Stored Procedures
March 31, 2023
Spread the love
Contents hide
1. Introduction
2. Caller’s Rights in Snowflake Stored Procedures
3. Owner’s Rights in Snowflake Stored Procedures
4. Difference between Caller’s and Owner’s Rights in Snowflake
5. Demonstration of Caller’s and Owner’s Rights
1. Introduction
The stored procedures in Snowflake runs either with caller’s rights or the owner’s rights which helps
in defining the privileges with which the statements in the stored procedure executes. By default,
when a stored procedure is created in Snowflake without specifying the rights with which it should be
executed, it runs with owner’s rights.
In this article let us discuss what are caller’s rights and owner’s rights, the differences between the
both and how to implement them in Snowflake stored procedures.
2. Caller’s Rights in Snowflake Stored Procedures
A caller’s rights stored procedure runs with the privileges of the role that called the stored procedure.
The term “Caller” in this context refers to the user executing the stored procedure, who may or may
not be the creator of the procedure.
Any statement that the caller could not execute outside the stored procedure cannot be executed
inside the stored procedure with caller’s rights.
At the time of creation of stored procedure, the creator has to specify if the stored procedure runs with
caller’s rights. The default is owner’s rights.
The syntax to create a stored procedure with caller’s rights is as shown below.
CREATE OR REPLACE PROCEDURE <procedure_name>()
RETURNS <data_type>
LANGUAGE SQL
EXECUTE AS CALLER
AS
$$
…
$$;
3. Owner’s Rights in Snowflake Stored Procedures
An Owner’s rights stored procedure runs with the privileges of the role that created the stored
procedure. The term “Owner” in this context refers to the user who created the stored procedure, who
may or may not be executing the procedure.
The primary advantage of Owner’s rights is that the owner can delegate the privileges to
another role through stored procedure without actually granting privileges outside the
procedure.
For example, if a user do not have access to clean up data in a table is granted access to a stored
procedure (with owner’s rights) which does it. The user who do not have any privileges on table
can clean up the data in the table by executing the stored procedure. But the same statements in
the procedure when executed outside the procedure, cannot be executed by the user.
The syntax to create a stored procedure with owner’s rights is as shown below.
CREATE OR REPLACE PROCEDURE <procedure_name>()
RETURNS <data_type>
LANGUAGE SQL
EXECUTE AS OWNER
AS
$$
…
$$;
Note “EXECUTE AS OWNER” is optional. Even if the statement is not specified, the procedure is
created with owner’s rights.
4. Difference between Caller’s and Owner’s Rights in Snowflake
The below are the differences between Caller’s and Owner’s Rights in Snowflake.
Caller’s Rights Owner’s Rights
Runs with the privileges of the caller. Runs with the privileges of the owner.
Inherit the current warehouse of the caller. Inherit the current warehouse of the caller.
Use the database and schema that the caller is Use the database and schema that the stored procedure is created in,
currently using. schema that the caller is currently using.
5. Demonstration of Caller’s and Owner’s Rights
Let us understand how Caller’s and Owner’s Rights work with an example
using ACCOUNTADMIN and SYSADMIN roles.
Using ACCOUNTADMIN role, let us create a table named Organization for demonstration.
USE ROLE ACCOUNTADMIN;
CREATE TABLE organization(id NUMBER, org_name VARCHAR(50));
When the table is queried using SYSADMIN role, it throws an errors as shown below since no grants
on this table are provided to SYSADMIN.
USE ROLE SYSADMIN;
SELECT * FROM organization;
Let us create a stored procedure with Caller’s rights using ACCOUNTADMIN role to delete data
from Organization table.
USE ROLE ACCOUNTADMIN;
Assign the grants to execute the stored procedure to the SYSADMIN role.
USE ROLE ACCOUNTADMIN;
GRANT USAGE ON PROCEDURE DEMO_DB.PUBLIC.sp_demo_callers_rights() TO ROLE
SYSADMIN;
The output of the caller’s rights stored procedure with SYSADMIN role is as below.
USE ROLE SYSADMIN;
CALL sp_demo_callers_rights();
Since the SYSADMIN role do not have any privileges on Organization table, the execution of
procedure with caller’s rights also fails.
The owner of the stored procedure can change the procedure from an owner’s rights stored procedure
to a caller’s rights stored procedure (or vice-versa) by executing an ALTER
PROCEDURE command as shown below.
ALTER PROCEDURE sp_demo_callers_rights() EXECUTE AS OWNER;
The output of the owner’s rights stored procedure with SYSADMIN role is as below.
USE ROLE SYSADMIN;
CALL sp_demo_callers_rights();
Though the SYSADMIN role do not have privileges on Organization table, the execution of the
procedure which deletes data from the Organization table succeeds because the procedure
executes with Owner’s rights.
Checkout other articles related to Snowflake Stored Procedures
Window Functions in Snowflake Snowpark
February 21, 2024
Spread the love
Contents hide
1. Introduction to Window Functions
2. Window Functions in Snowpark
3. Demonstration of Window Functions in Snowpark
3.1. Find the Employees with Highest salary in each Department
3.2. Calculate Total Sum of Salary for each Department and display it alongside each employee’s
record
3.3. Calculate the Cumulative Sum of Salary for each Department
3.4. Calculate the Minimum Salary Between the Current Employee and the one Following for each
Department
1. Introduction to Window Functions
A Window Function performs a calculation across a set of table rows that are somehow related
to the current row and returns a single aggregated value for each row. Unlike regular aggregate
functions like SUM() or AVG(), window functions do not group rows into a single output row.
Instead, they compute a value for each row based on a specific window of rows.
Refer to our previous article to learn more about Window Functions in SQL, including examples to
help you understand them better.
In this article, let us explore how to implement Window Functions on DataFrames in Snowflake
Snowpark.
2. Window Functions in Snowpark
The Window class in Snowpark enables defining a WindowSpec (or Window Specification) that
determines which rows are included in a window. A window is a group of rows that are associated
with the current row by some relation.
Syntax
The following is the syntax to form a WindowSpec in Snowpark.
Window.<partitionBy_specification>.<orderBy_specification>.<windowFrame_specification>
PartitionBy Specification
The Partition specification defines which rows are included in a window (partition). If
no partition is defined, all the rows are included in a single partition.
OrderBy Specification
The Ordering specification determines the ordering of the rows within the window.
The ordering could be ascending (ASC) or descending (DESC).
WindowFrame Specification
The Window Frame specification defines the subset of rows within a partition over
which the window function operates. It determines the range of rows to consider for
each row’s computation within the window partition.
A Window Function is formed by passing the WindowSpec to the aggregate functions (like SUM(),
AVG(), etc.) using the OVER clause.
<aggregate_function>(<arguments>).over(<windowSpec>)
To know the full list of functions that support Windows, refer to Snowflake Documentation.
3. Demonstration of Window Functions in Snowpark
Consider the EMPLOYEES data below for the demonstration of the Window Functions in
Snowpark.
#// creating dataframe with employee data
employee_data = [
[1,'TONY',24000,101],
[2,'STEVE',17000,101],
[3,'BRUCE',9000,101],
[4,'WANDA',20000,102],
[5,'VICTOR',12000,102],
[6,'STEPHEN',10000,103],
[7,'HANK',15000,103],
[8,'THOR',21000,103]
]
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |24000 |101 |
|2 |STEVE |17000 |101 |
|3 |BRUCE |9000 |101 |
|4 |WANDA |20000 |102 |
|5 |VICTOR |12000 |102 |
|6 |STEPHEN |10000 |103 |
|7 |HANK |15000 |103 |
|8 |THOR |21000 |103 |
------------------------------------------------
3.1. Find the Employees with Highest salary in each Department
Follow the below steps to find the details of Employees with the highest salary in each Department
using Window Functions in Snowpark.
STEP-1: Import all the necessary Snowpark libraries and create a WindowSpec
The following code creates a WindowSpec where a partition is created based on the DEPT_ID field
and the rows within each partition are ordered by SALARY in descending order.
#// Importing Snowpark Libraries
from snowflake.snowpark import Window
from snowflake.snowpark.functions import row_number, desc, col, min
---------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"RANK" |
---------------------------------------------------------
|4 |WANDA |20000 |102 |1 |
|5 |VICTOR |12000 |102 |2 |
|1 |TONY |24000 |101 |1 |
|2 |STEVE |17000 |101 |2 |
|3 |BRUCE |9000 |101 |3 |
|8 |THOR |21000 |103 |1 |
|7 |HANK |15000 |103 |2 |
|6 |STEPHEN |10000 |103 |3 |
---------------------------------------------------------
In the above code, we have used the DataFrame.withColumn() method that returns a DataFrame
with an additional column with the specified column name computed using the specified expression.
In this scenario, the method returns all the columns of the DataFrame df_emp along with a new field
named RANK computed based on the Window Function passed as an expression.
Alternatively, the DataFrame.select() method can be used to achieve the same output as shown
below.
df_emp.select("*", row_number().over(windowSpec).alias("RANK")).show()
STEP-3: Filter the records with the Highest Salary in each Department
The following code filters the records with RANK value 1 and sorts the records based on
the DEPT_ID.
df_emp.withColumn("RANK",
row_number().over(windowSpec)).filter(col("RANK")==1).sort("DEPT_ID").show()
---------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"RANK" |
---------------------------------------------------------
|1 |TONY |24000 |101 |1 |
|4 |WANDA |20000 |102 |1 |
|8 |THOR |21000 |103 |1 |
---------------------------------------------------------
When executed, this code is translated and executed as SQL in Snowflake through the Snowpark API.
The resulting SQL statement will be equivalent to the following query.
SELECT * FROM(
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
ROW_NUMBER() OVER (PARTITION BY DEPT_ID ORDER BY SALARY DESC) AS RANK
FROM EMPLOYEES
)
WHERE RANK = 1
ORDER BY DEPT_ID ;
3.2. Calculate Total Sum of Salary for each Department and display it alongside each employee’s
record
The following code calculates the total Sum of the Salary for each Department and displays it
alongside each employee record using Window Functions in Snowpark.
windowSpec = Window.partitionBy("DEPT_ID")
df_emp.withColumn("TOTAL_SAL", sum("SALARY").over(windowSpec)).show()
--------------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"TOTAL_SAL" |
--------------------------------------------------------------
|1 |TONY |24000 |101 |50000 |
|2 |STEVE |17000 |101 |50000 |
|3 |BRUCE |9000 |101 |50000 |
|4 |WANDA |20000 |102 |32000 |
|5 |VICTOR |12000 |102 |32000 |
|6 |STEPHEN |10000 |103 |46000 |
|7 |HANK |15000 |103 |46000 |
|8 |THOR |21000 |103 |46000 |
--------------------------------------------------------------
The above Snowpark code is equivalent to the following SQL query.
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
SUM(SALARY) OVER (PARTITION BY DEPT_ID) AS TOTAL_SAL
FROM EMPLOYEES
;
3.3. Calculate the Cumulative Sum of Salary for each Department
The following code calculates the total Cumulative Sum of the Salary for each Department and
displays it alongside each employee record using Window Functions in Snowpark.
windowSpec = Window.partitionBy("DEPT_ID").orderBy(col("EMP_ID"))
df_emp.withColumn("CUM_SAL", sum("SALARY").over(windowSpec)).show()
------------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"CUM_SAL" |
------------------------------------------------------------
|1 |TONY |24000 |101 |24000 |
|2 |STEVE |17000 |101 |41000 |
|3 |BRUCE |9000 |101 |50000 |
|4 |WANDA |20000 |102 |20000 |
|5 |VICTOR |12000 |102 |32000 |
|6 |STEPHEN |10000 |103 |10000 |
|7 |HANK |15000 |103 |25000 |
|8 |THOR |21000 |103 |46000 |
------------------------------------------------------------
The above Snowpark code is equivalent to the following SQL query.
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
SUM(SALARY) OVER (PARTITION BY DEPT_ID ORDER BY EMP_ID) AS CUM_SAL
FROM EMPLOYEES
;
3.4. Calculate the Minimum Salary Between the Current Employee and the one Following for each
Department
The following code calculates the minimum salary between the current employee and the one
following for each department using Window Functions in Snowpark.
windowSpec =
Window.partitionBy("DEPT_ID").orderBy(col("EMP_ID")).rows_between(Window.currentRow,1)
df_emp.withColumn("MIN_SAL", min("SALARY").over(windowSpec)).sort("EMP_ID").show()
------------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"MIN_SAL" |
------------------------------------------------------------
|1 |TONY |24000 |101 |17000 |
|2 |STEVE |17000 |101 |9000 |
|3 |BRUCE |9000 |101 |9000 |
|4 |WANDA |20000 |102 |12000 |
|5 |VICTOR |12000 |102 |12000 |
|6 |STEPHEN |10000 |103 |10000 |
|7 |HANK |15000 |103 |15000 |
|8 |THOR |21000 |103 |21000 |
------------------------------------------------------------
The above Snowpark code is equivalent to the following SQL query where the salary of the current
employee record is compared with the next employee record.
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
MIN(SALARY) OVER (PARTITION BY DEPT_ID ORDER BY EMP_ID ROWS BETWEEN
CURRENT ROW AND 1 FOLLOWING) AS MIN_SAL
FROM EMPLOYEES
ORDER BY EMP_ID
;
Window Frames require the data within the window to be ordered. So, even though the ORDER BY
clause is optional in regular window function syntax, it is mandatory in window frame syntax.
Window Functions in Snowflake Snowpark
February 21, 2024
Spread the love
Contents hide
1. Introduction to Window Functions
2. Window Functions in Snowpark
3. Demonstration of Window Functions in Snowpark
3.1. Find the Employees with Highest salary in each Department
3.2. Calculate Total Sum of Salary for each Department and display it alongside each employee’s
record
3.3. Calculate the Cumulative Sum of Salary for each Department
3.4. Calculate the Minimum Salary Between the Current Employee and the one Following for each
Department
1. Introduction to Window Functions
A Window Function performs a calculation across a set of table rows that are somehow related
to the current row and returns a single aggregated value for each row. Unlike regular aggregate
functions like SUM() or AVG(), window functions do not group rows into a single output row.
Instead, they compute a value for each row based on a specific window of rows.
Refer to our previous article to learn more about Window Functions in SQL, including examples to
help you understand them better.
In this article, let us explore how to implement Window Functions on DataFrames in Snowflake
Snowpark.
2. Window Functions in Snowpark
The Window class in Snowpark enables defining a WindowSpec (or Window Specification) that
determines which rows are included in a window. A window is a group of rows that are associated
with the current row by some relation.
Syntax
The following is the syntax to form a WindowSpec in Snowpark.
Window.<partitionBy_specification>.<orderBy_specification>.<windowFrame_specification>
PartitionBy Specification
The Partition specification defines which rows are included in a window (partition). If
no partition is defined, all the rows are included in a single partition.
OrderBy Specification
The Ordering specification determines the ordering of the rows within the window.
The ordering could be ascending (ASC) or descending (DESC).
WindowFrame Specification
The Window Frame specification defines the subset of rows within a partition over
which the window function operates. It determines the range of rows to consider for
each row’s computation within the window partition.
A Window Function is formed by passing the WindowSpec to the aggregate functions (like SUM(),
AVG(), etc.) using the OVER clause.
<aggregate_function>(<arguments>).over(<windowSpec>)
To know the full list of functions that support Windows, refer to Snowflake Documentation.
3. Demonstration of Window Functions in Snowpark
Consider the EMPLOYEES data below for the demonstration of the Window Functions in
Snowpark.
#// creating dataframe with employee data
employee_data = [
[1,'TONY',24000,101],
[2,'STEVE',17000,101],
[3,'BRUCE',9000,101],
[4,'WANDA',20000,102],
[5,'VICTOR',12000,102],
[6,'STEPHEN',10000,103],
[7,'HANK',15000,103],
[8,'THOR',21000,103]
]
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |24000 |101 |
|2 |STEVE |17000 |101 |
|3 |BRUCE |9000 |101 |
|4 |WANDA |20000 |102 |
|5 |VICTOR |12000 |102 |
|6 |STEPHEN |10000 |103 |
|7 |HANK |15000 |103 |
|8 |THOR |21000 |103 |
------------------------------------------------
3.1. Find the Employees with Highest salary in each Department
Follow the below steps to find the details of Employees with the highest salary in each Department
using Window Functions in Snowpark.
STEP-1: Import all the necessary Snowpark libraries and create a WindowSpec
The following code creates a WindowSpec where a partition is created based on the DEPT_ID field
and the rows within each partition are ordered by SALARY in descending order.
#// Importing Snowpark Libraries
from snowflake.snowpark import Window
from snowflake.snowpark.functions import row_number, desc, col, min
---------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"RANK" |
---------------------------------------------------------
|4 |WANDA |20000 |102 |1 |
|5 |VICTOR |12000 |102 |2 |
|1 |TONY |24000 |101 |1 |
|2 |STEVE |17000 |101 |2 |
|3 |BRUCE |9000 |101 |3 |
|8 |THOR |21000 |103 |1 |
|7 |HANK |15000 |103 |2 |
|6 |STEPHEN |10000 |103 |3 |
---------------------------------------------------------
In the above code, we have used the DataFrame.withColumn() method that returns a DataFrame
with an additional column with the specified column name computed using the specified expression.
In this scenario, the method returns all the columns of the DataFrame df_emp along with a new field
named RANK computed based on the Window Function passed as an expression.
Alternatively, the DataFrame.select() method can be used to achieve the same output as shown
below.
df_emp.select("*", row_number().over(windowSpec).alias("RANK")).show()
STEP-3: Filter the records with the Highest Salary in each Department
The following code filters the records with RANK value 1 and sorts the records based on
the DEPT_ID.
df_emp.withColumn("RANK",
row_number().over(windowSpec)).filter(col("RANK")==1).sort("DEPT_ID").show()
---------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"RANK" |
---------------------------------------------------------
|1 |TONY |24000 |101 |1 |
|4 |WANDA |20000 |102 |1 |
|8 |THOR |21000 |103 |1 |
---------------------------------------------------------
When executed, this code is translated and executed as SQL in Snowflake through the Snowpark API.
The resulting SQL statement will be equivalent to the following query.
SELECT * FROM(
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
ROW_NUMBER() OVER (PARTITION BY DEPT_ID ORDER BY SALARY DESC) AS RANK
FROM EMPLOYEES
)
WHERE RANK = 1
ORDER BY DEPT_ID ;
3.2. Calculate Total Sum of Salary for each Department and display it alongside each employee’s
record
The following code calculates the total Sum of the Salary for each Department and displays it
alongside each employee record using Window Functions in Snowpark.
windowSpec = Window.partitionBy("DEPT_ID")
df_emp.withColumn("TOTAL_SAL", sum("SALARY").over(windowSpec)).show()
--------------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"TOTAL_SAL" |
--------------------------------------------------------------
|1 |TONY |24000 |101 |50000 |
|2 |STEVE |17000 |101 |50000 |
|3 |BRUCE |9000 |101 |50000 |
|4 |WANDA |20000 |102 |32000 |
|5 |VICTOR |12000 |102 |32000 |
|6 |STEPHEN |10000 |103 |46000 |
|7 |HANK |15000 |103 |46000 |
|8 |THOR |21000 |103 |46000 |
--------------------------------------------------------------
The above Snowpark code is equivalent to the following SQL query.
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
SUM(SALARY) OVER (PARTITION BY DEPT_ID) AS TOTAL_SAL
FROM EMPLOYEES
;
3.3. Calculate the Cumulative Sum of Salary for each Department
The following code calculates the total Cumulative Sum of the Salary for each Department and
displays it alongside each employee record using Window Functions in Snowpark.
windowSpec = Window.partitionBy("DEPT_ID").orderBy(col("EMP_ID"))
df_emp.withColumn("CUM_SAL", sum("SALARY").over(windowSpec)).show()
------------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"CUM_SAL" |
------------------------------------------------------------
|1 |TONY |24000 |101 |24000 |
|2 |STEVE |17000 |101 |41000 |
|3 |BRUCE |9000 |101 |50000 |
|4 |WANDA |20000 |102 |20000 |
|5 |VICTOR |12000 |102 |32000 |
|6 |STEPHEN |10000 |103 |10000 |
|7 |HANK |15000 |103 |25000 |
|8 |THOR |21000 |103 |46000 |
------------------------------------------------------------
The above Snowpark code is equivalent to the following SQL query.
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
SUM(SALARY) OVER (PARTITION BY DEPT_ID ORDER BY EMP_ID) AS CUM_SAL
FROM EMPLOYEES
;
3.4. Calculate the Minimum Salary Between the Current Employee and the one Following for each
Department
The following code calculates the minimum salary between the current employee and the one
following for each department using Window Functions in Snowpark.
windowSpec =
Window.partitionBy("DEPT_ID").orderBy(col("EMP_ID")).rows_between(Window.currentRow,1)
df_emp.withColumn("MIN_SAL", min("SALARY").over(windowSpec)).sort("EMP_ID").show()
------------------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |"MIN_SAL" |
------------------------------------------------------------
|1 |TONY |24000 |101 |17000 |
|2 |STEVE |17000 |101 |9000 |
|3 |BRUCE |9000 |101 |9000 |
|4 |WANDA |20000 |102 |12000 |
|5 |VICTOR |12000 |102 |12000 |
|6 |STEPHEN |10000 |103 |10000 |
|7 |HANK |15000 |103 |15000 |
|8 |THOR |21000 |103 |21000 |
------------------------------------------------------------
The above Snowpark code is equivalent to the following SQL query where the salary of the current
employee record is compared with the next employee record.
SELECT
EMP_ID, EMP_NAME, SALARY, DEPT_ID,
MIN(SALARY) OVER (PARTITION BY DEPT_ID ORDER BY EMP_ID ROWS BETWEEN
CURRENT ROW AND 1 FOLLOWING) AS MIN_SAL
FROM EMPLOYEES
ORDER BY EMP_ID
;
Window Frames require the data within the window to be ordered. So, even though the ORDER BY
clause is optional in regular window function syntax, it is mandatory in window frame syntax.
1. Introduction
The Table.update() method in Snowpark helps in updating the rows of a table. It returns a
tuple UpdateResult, representing the number of rows modified and the number of multi-joined rows
modified. This method can also be used to update the rows of a DataFrame.
Syntax
Table.update(<assignments>, <condition>, [<source>])
Parameters
<assignments>
A dictionary that contains key-value pairs representing columns of a DataFrame and
the corresponding values with which they should be updated. The values can either be
a literal value or a column object.
<condition>
Represents the specific condition based on which a column should be updated. If no
condition is specified, all the rows of the DataFrame will be updated.
<source>
Represent another DataFrame based on which the data of the current DataFrame will
be updated. The join condition between both the DataFrames should be specified in
the <condition>.
2. Steps to Update a DataFrame in Snowpark
Follow the below steps to update data of a DataFrame in Snowpark using Table.update() method.
1. Create a DataFrame with the desired data using Session.createDataFrame(). The
DataFrame could be built based on an existing table or data read from a CSV file or
content created within the code.
2. Create a temporary table with the contents of the DataFrame using
the DataFrameWriter class.
3. Create a DataFrame to read the contents of the temporary table
using Session.table() method.
4. Using the Table.update() method, update the contents of the DataFrame which is
created using a temporary table.
5. Display the contents of the DataFrame to verify that the appropriate records have
been updated using the DataFrame.show() method.
Temporary tables only exist within the session in which they were created and are not visible to other
users or sessions. Once the session ends, the table is completely purged from the system. Therefore,
temporary tables are well-suited in the scenario of updating DataFrames.
3. Demonstration of Updating all rows of a DataFrame
STEP-1: Create DataFrame
The following code creates a DataFrame df_emp which holds the the EMPLOYEES data as shown
below.
#// create a DataFrame with employee data
employee_data = [
[1,'TONY',24000,10],
[2,'STEVE',17000,10],
[3,'BRUCE',9000,20],
[4,'WANDA',20000,20]
]
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |24000 |10 |
|2 |STEVE |17000 |10 |
|3 |BRUCE |9000 |20 |
|4 |WANDA |20000 |20 |
------------------------------------------------
STEP-2: Create Temporary Table
The following code creates a temporary table named tmp_emp in the Snowflake database using the
contents of df_emp DataFrame.
#// create a temp table
df_emp.write.mode("overwrite").save_as_table("tmp_emp", table_type="temp")
STEP-3: Read Temporary Table
The following code creates a new DataFrame df_tmp_emp which reads the contents of temporary
table tmp_emp.
#// create a DataFrame to read contents of temp table
df_tmp_emp = session.table("tmp_emp")
df_tmp_emp.show()
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |24000 |10 |
|2 |STEVE |17000 |10 |
|3 |BRUCE |9000 |20 |
|4 |WANDA |20000 |20 |
------------------------------------------------
STEP-4: Update DataFrame
The following code updates all the records of DataFrame df_tmp_emp by multiplying the DEPT_ID
values by 10 and doubling the SALARY amounts.
#// update DEPT_ID and SALARY fields of all records
from snowflake.snowpark.types import IntegerType
from snowflake.snowpark.functions import cast
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |48000 |100 |
|2 |STEVE |34000 |100 |
|3 |BRUCE |18000 |200 |
|4 |WANDA |40000 |200 |
------------------------------------------------
4. Updating a DataFrame based on a Condition
The following code updates the salary of all employees belonging to department 100.
#// update the SALARY field of employees where DEPT_ID is 100
df_tmp_emp.show()
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |48100 |100 |
|2 |STEVE |34100 |100 |
|3 |BRUCE |18000 |200 |
|4 |WANDA |40000 |200 |
------------------------------------------------
5. Updating a DataFrame based on data in another DataFrame
A DataFrame can also be updated based on the data in another DataFrame
using Table.update() method.
The following code updates employees’ SALARY in df_tmp_emp DataFrame where EMP_ID is
equal to EMP_ID in another DataFrame df_salary.
#// update DataFrame based on data in another DataFrame
df_tmp_emp.show()
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |50000 |100 |
|2 |STEVE |35000 |100 |
|3 |BRUCE |18000 |200 |
|4 |WANDA |40000 |200 |
------------------------------------------------
6. Updating a DataFrame using Session.sql() Method
The Session.sql() method in Snowpark can be used to execute a SQL statement. It returns a new
DataFrame representing the results of a SQL query.
Follow the below steps to update the data of a DataFrame in Snowpark using
the Session.sql() method.
1. Create a DataFrame with the desired data using Session.createDataFrame(). The
DataFrame could be built based on an existing table or data read from a CSV file or
content created within the code.
2. Create a temporary table with the contents of the DataFrame using
the DataFrameWriter class.
3. Use the Session.sql() method to update the contents of the temporary table.
4. Create a DataFrame to read the contents of the updated temporary table using
the session.table() method.
5. Display the contents of the DataFrame to verify that the appropriate records have
been updated using the DataFrame.show() method.
#// create DataFrame
employee_data = [
[1,'TONY',24000,10],
[2,'STEVE',17000,10],
[3,'BRUCE',9000,20],
[4,'WANDA',20000,20]
]
employee_schema = ["EMP_ID", "EMP_NAME", "SALARY", "DEPT_ID"]
df_emp =session.createDataFrame(employee_data, schema=employee_schema)
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|1 |TONY |101 |
|2 |STEVE |101 |
|3 |BRUCE |102 |
|4 |WANDA |102 |
|5 |VICTOR |103 |
|6 |HANK |105 |
-----------------------------
#// create dataframe with departments data
department_data = [
[101,'HR'],
[102,'SALES'],
[103,'IT'],
[104,'FINANCE'],
]
-----------------------
|"DEPT_ID" |"NAME" |
-----------------------
|101 |HR |
|102 |SALES |
|103 |IT |
|104 |FINANCE |
-----------------------
3.1. Join DataFrames in Snowpark
The EMPLOYEES and DEPARTMENTS DataFrames can be joined using
the DataFrame.join method in Snowpark as shown below.
#// Joining two DataFrames
#// Method-1
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).show()
#// Method-2
df_emp.join(df_dept, df_emp["DEPT_ID"] == df_dept["DEPT_ID"]).show()
------------------------------------------------------------------------------
|"ID" |"l_vrun_NAME" |"l_vrun_DEPT_ID" |"r_jc4z_DEPT_ID" |"r_jc4z_NAME" |
------------------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
------------------------------------------------------------------------------
3.2. Join DataFrames referring to a Single Column Name in Snowpark
The DataFrames can be joined by referring to a single column name if the name of the column is same
in both the DataFrames.
The EMPLOYEES and DEPARTMENTS DataFrames can be joined by referring to a single
column DEPT_ID as shown below.
#// Joining two DataFrames referring to a single column
df_emp.join(df_dept, "DEPT_ID").show()
----------------------------------------------------
|"DEPT_ID" |"ID" |"l_9ml8_NAME" |"r_8dfz_NAME" |
----------------------------------------------------
|101 |1 |TONY |HR |
|101 |2 |STEVE |HR |
|102 |3 |BRUCE |SALES |
|102 |4 |WANDA |SALES |
|103 |5 |VICTOR |IT |
----------------------------------------------------
3.3. Rename Ambiguous Columns of Join operation Output in Snowpark
When two DataFrames are joined, the overlapping columns will have random column names in the
resulting DataFrame as seen in the above examples.
The randomly named columns can be renamed using Column.alias as shown below.
#// Renaming the ambiguous columns
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).\
select(df_emp.ID, df_emp.NAME.alias("EMP_NAME"), df_emp.DEPT_ID.alias("DEPT_ID"),
df_dept.NAME.alias("DEPT_NAME")).show()
-----------------------------------------------
|"ID" |"EMP_NAME" |"DEPT_ID" |"DEPT_NAME" |
-----------------------------------------------
|1 |TONY |101 |HR |
|2 |STEVE |101 |HR |
|3 |BRUCE |102 |SALES |
|4 |WANDA |102 |SALES |
|5 |VICTOR |103 |IT |
-----------------------------------------------
3.4. Rename Ambiguous Columns of Join operation Output using lsuffix and rsuffix
The randomly named overlapping columns can be renamed using lsuffix and rsuffix parameters
in DataFrame.join method.
lsuffix – Suffix to add to the overlapping columns of the left DataFrame.
rsuffix – Suffix to add to the overlapping columns of the right DataFrame.
#// Renaming the ambiguous columns using lsuffix and rsuffix
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP",
rsuffix="_DEPT").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
--------------------------------------------------------------------
It is recommended to use lsuffix and rsuffix parameters within DataFrame.join method when there
are overlapping columns between the DataFrames.
3.5. Join DataFrames based on Multiple Conditions in Snowpark.
DataFrames can be joined based on multiple conditions separated by the “&” symbol as shown
below.
DataFrame.join(right DataFrame, (<join_condition_1>) & (< join_condition_2>))
When the names of the columns are the same between both the DataFrames, they can be joined by
passing a list of column names as shown below.
DataFrame.join(right DataFrame, ["col_1", "col_2", ..]
The following is an example of joining EMPLOYEES and DEPARTMENTS based on multiple
conditions.
#// Joining two DataFrames based on Multiple conditions
df_emp.join(df_dept, (df_emp.DEPT_ID == df_dept.DEPT_ID) & (df_emp.ID <
df_dept.DEPT_ID), \
lsuffix="_EMP", rsuffix="_DEPT").show()
3.6. Join Types in Snowpark
By default, the DataFrame.join method applies an inner join to join the data between the two
DataFrames. The other supported join types can be specified to join the data between two DataFrames
as shown below.
#// Left Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT",
join_type="left").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
|6 |HANK |105 |NULL |NULL |
--------------------------------------------------------------------
Instead of the “join_type” parameter, we can also use the “how” parameter to specify the join
condition.
#// Right Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT",
how="right").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
|NULL |NULL |NULL |104 |FINANCE |
--------------------------------------------------------------------
IN Operator in Snowflake Snowpark
February 13, 2024
Spread the love
Contents hide
1. Introduction
2. IN Operator in Snowflake Snowpark
3. Demonstration of IN operator in Snowpark
3.1. Filtering Data from a Snowpark DataFrame using a Single Value
3.2. Filtering Data from a Snowpark DataFrame using Multiple Values
4. Implementing SubQueries in Snowpark using Column.in_() method
STEP-1: Extract the list of values to be passed to the IN operator into a DataFrame
STEP-2: Pass the DataFrame representing a Subquery as Input Parameter to the Column.in_() Method
5. Implementing the IN operator in the Snowpark SELECT clause
6. NOT IN Operator in Snowflake Snowpark
STEP-1: Use the IN Operator in the SELECT Clause to Identify Values Not Present in the Specified
List
STEP-2: Filter Values Not Present in the Specified List
STEP-3: Filter DataFrame by Passing a List of Values Not Present in the Specified List
1. Introduction
The IN operator in SQL allows you to specify multiple values in a WHERE clause to filter the data. It
serves as a shorthand for employing multiple OR conditions. Additionally, the IN operator can be
utilized with a subquery within the WHERE clause.
In this article, let us explore how the IN operator can be implemented with DataFrames in Snowflake
Snowpark.
2. IN Operator in Snowflake Snowpark
The Column.in_() method in Snowpark returns a list of values that can be passed to
the DataFrame.filter() method (equivalent to WHERE in SQL) to perform the equivalent of an IN
operator in SQL.
The supported values of Column.in_() method are a sequence of values or a DataFrame that
represents a subquery.
3. Demonstration of IN operator in Snowpark
Consider the EMPLOYEES and DEPARTMENTS data below for the demonstration of the IN
operator in Snowpark.
#// create dataframe with employees data
employee_data = [
[1,'TONY',101],
[2,'STEVE',101],
[3,'BRUCE',102],
[4,'WANDA',102],
[5,'VICTOR',103],
[6,'HANK',105],
]
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|1 |TONY |101 |
|2 |STEVE |101 |
|3 |BRUCE |102 |
|4 |WANDA |102 |
|5 |VICTOR |103 |
|6 |HANK |105 |
-----------------------------
#// create dataframe with departments data
department_data = [
[101,'HR'],
[102,'SALES'],
[103,'IT'],
[104,'FINANCE'],
]
-----------------------
|"DEPT_ID" |"NAME" |
-----------------------
|101 |HR |
|102 |SALES |
|103 |IT |
|104 |FINANCE |
-----------------------
3.1. Filtering Data from a Snowpark DataFrame using a Single Value
The following example extracts details of employees with ID=1 from the EMPLOYEES DataFrame.
df_emp.filter(col("ID")==1).show()
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|1 |TONY |101 |
-----------------------------
The above Snowpark code is equivalent to the following SQL query.
SELECT * FROM EMPLOYEES WHERE ID = 1;
3.2. Filtering Data from a Snowpark DataFrame using Multiple Values
The following example extracts details of employees with ID 1, 2, and 3 from
the EMPLOYEES DataFrame using Column.in_() method.
from snowflake.snowpark.functions import col
df_emp.filter(col("ID").in_(1,2,3)).show()
#// (or) //
df_emp.filter(df_emp.col("ID").in_(1,2,3)).show()
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|1 |TONY |101 |
|2 |STEVE |101 |
|3 |BRUCE |102 |
-----------------------------
The above Snowpark code is equivalent to the following SQL query.
SELECT * FROM EMPLOYEES WHERE ID IN (1,2,3);
4. Implementing SubQueries in Snowpark using Column.in_() method
A Subquery also known as an inner query or nested query is a query nested within another SQL
statement. The inner query is executed first and its results are used by the outer query to further filter,
join, or manipulate data.
Consider a scenario where we need to extract the details of all employees belonging to
the SALES department. The same can be achieved using the below SQL query using a filter condition
on the SALES department in a subquery.
SELECT * FROM EMPLOYEES WHERE DEPT_ID IN (
SELECT DEPT_ID FROM DEPARTMENTS WHERE NAME = 'SALES');
Let us understand how the same can be implemented in Snowpark.
STEP-1: Extract the list of values to be passed to the IN operator into a DataFrame
In this scenario, we need to extract the DEPT_ID value of the SALES department from
the DEPARTMENTS DataFrame into a new DataFrame.
The following code applies a filter on the NAME field with values as ‘SALES’ and selects only
the DEPT_ID field from the DEPARTMENTS DataFrame into a new DataFrame.
df_dept_SALES = df_dept.filter(col("NAME")=="SALES").select("DEPT_ID")
df_dept_SALES.show()
-------------
|"DEPT_ID" |
-------------
|102 |
-------------
STEP-2: Pass the DataFrame representing a Subquery as Input Parameter to the Column.in_() Method
The following code returns the details of all employees belonging to the SALES department by
passing the DataFrame that holds the SALES department ID value as input to
the Column.in_() method.
df_emp_SALES = df_emp.filter(col("DEPT_ID").in_(df_dept_SALES))
df_emp_SALES.show()
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|3 |BRUCE |102 |
|4 |WANDA |102 |
-----------------------------
5. Implementing the IN operator in the Snowpark SELECT clause
The Column.in_() method in Snowpark can also be passed to a DataFrame.select() call. The
expression returns a Boolean value and evaluates to true if the value in the column is one of the values
in the specified sequence.
The following code returns the ID column from the EMPLOYEES DataFrame along with a new
column that returns true if the ID value is present in one of the values passed to the IN operator.
#// IN operator in SELECT clause
df_emp.select(col("ID"), col("ID").in_(1,2,3).alias("IS_EXISTS")).show()
----------------------
|"ID" |"IS_EXISTS" |
----------------------
|1 |True |
|2 |True |
|3 |True |
|4 |False |
|5 |False |
|6 |False |
----------------------
The above Snowpark code is equivalent to the following SQL query.
SELECT ID, ID IN (1,2,3) AS IS_EXISTS FROM EMPLOYEES;
6. NOT IN Operator in Snowflake Snowpark
In SQL, using the NOT keyword preceding the IN operator retrieves all records that do not match any
of the values in the list. For example, the following SQL statement returns all employee records
whose ID is not 1, 2, or 3.
SELECT * FROM EMPLOYEES WHERE ID NOT IN (1,2,3);
There is no in-built method available in Snowpark that can perform the same actions as NOT
IN operator in SQL.
To implement the NOT IN operator in Snowpark, we still utilize the Column.in_() method.
However, it’s essential to ensure that the DataFrame passed as an input parameter to the method
contains a list of values other than those in the specified list.
GET all employee records whose ID is not 1, 2, or 3.
STEP-1: Use the IN Operator in the SELECT Clause to Identify Values Not Present in the Specified
List
The following code returns ‘True’ for IDs that are present in the list passed to
the Column.in_() method, and ‘False’ if they are not present.
df1 = df_emp.select(col("ID"), col("ID").in_(1,2,3).alias("IS_EXISTS"))
df1.show()
----------------------
|"ID" |"IS_EXISTS" |
----------------------
|1 |True |
|2 |True |
|3 |True |
|4 |False |
|5 |False |
|6 |False |
----------------------
STEP-2: Filter Values Not Present in the Specified List
The following code retrieves all the IDs from the EMPLOYEES DataFrame, excluding those that are
present in the specified list, by filtering out the records that returned ‘False’ in the previous step.
df2 = df1.filter(col("IS_EXISTS")=='False').select("ID")
df2.show()
--------
|"ID" |
--------
|4 |
|5 |
|6 |
--------
STEP-3: Filter DataFrame by Passing a List of Values Not Present in the Specified List
The following code retrieves all employee records whose ID is not 1, 2 or 3
from EMPLOYEES DataFrame by passing a DataFrame that holds all the employee IDs except for
1, 2, and 3.
df3 = df_emp.filter(col("ID").in_(df2))
df3.show()
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|4 |WANDA |102 |
|5 |VICTOR |103 |
|6 |HANK |105 |
-----------------------------
All the above mentioned steps are equivalent to the following SQL query.
SELECT * FROM EMPLOYEES WHERE ID IN (
SELECT ID FROM(
SELECT ID, ID IN (1,2,3) IS_EXISTS FROM EMPLOYEES)
WHERE IS_EXISTS = 'False');
The process to identify the values that are not present in the required list may vary depending on the
specific scenario. However, the overall approach is to identify the values absent from the list specified
in the NOT IN condition and leverage them to filter the records.
1. Introduction
Joins are used to combine rows from two or more tables, based on a related column between them.
Joins allow for the creation of a comprehensive result set that incorporates relevant information from
multiple tables.
In this article, let us explore how to join data between two DataFrames in Snowflake Snowpark.
2. Joins in Snowpark
The DataFrame.join method in Snowpark helps in performing a join of specified type on the data of
the current DataFrame with another DataFrame based on a list of columns.
Syntax:
DataFrame.join(right DataFrame, <join_condition>, join_type=<join_type>)
Parameters:
right DataFrame – The other DataFrame to be joined with.
<join_condition> – The condition using which data in both the DataFrames are joined. The valid
values for a join condition are
A column or a list of column names. When a column or list of column names are
specified, this method assumes the same named columns are present in both the
DataFrames.
Column names from both the DataFrames specifying the join condition.
<join_type> – The type of join to be applied to join data between two DataFrames. The Snowpark
API for Python supports the following join types found in SQL.
SQL Join Type Supported Value
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|1 |TONY |101 |
|2 |STEVE |101 |
|3 |BRUCE |102 |
|4 |WANDA |102 |
|5 |VICTOR |103 |
|6 |HANK |105 |
-----------------------------
#// create dataframe with departments data
department_data = [
[101,'HR'],
[102,'SALES'],
[103,'IT'],
[104,'FINANCE'],
]
-----------------------
|"DEPT_ID" |"NAME" |
-----------------------
|101 |HR |
|102 |SALES |
|103 |IT |
|104 |FINANCE |
-----------------------
3.1. Join DataFrames in Snowpark
The EMPLOYEES and DEPARTMENTS DataFrames can be joined using
the DataFrame.join method in Snowpark as shown below.
#// Joining two DataFrames
#// Method-1
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).show()
#// Method-2
df_emp.join(df_dept, df_emp["DEPT_ID"] == df_dept["DEPT_ID"]).show()
------------------------------------------------------------------------------
|"ID" |"l_vrun_NAME" |"l_vrun_DEPT_ID" |"r_jc4z_DEPT_ID" |"r_jc4z_NAME" |
------------------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
------------------------------------------------------------------------------
3.2. Join DataFrames referring to a Single Column Name in Snowpark
The DataFrames can be joined by referring to a single column name if the name of the column is same
in both the DataFrames.
The EMPLOYEES and DEPARTMENTS DataFrames can be joined by referring to a single
column DEPT_ID as shown below.
#// Joining two DataFrames referring to a single column
df_emp.join(df_dept, "DEPT_ID").show()
----------------------------------------------------
|"DEPT_ID" |"ID" |"l_9ml8_NAME" |"r_8dfz_NAME" |
----------------------------------------------------
|101 |1 |TONY |HR |
|101 |2 |STEVE |HR |
|102 |3 |BRUCE |SALES |
|102 |4 |WANDA |SALES |
|103 |5 |VICTOR |IT |
----------------------------------------------------
3.3. Rename Ambiguous Columns of Join operation Output in Snowpark
When two DataFrames are joined, the overlapping columns will have random column names in the
resulting DataFrame as seen in the above examples.
The randomly named columns can be renamed using Column.alias as shown below.
#// Renaming the ambiguous columns
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).\
select(df_emp.ID, df_emp.NAME.alias("EMP_NAME"), df_emp.DEPT_ID.alias("DEPT_ID"),
df_dept.NAME.alias("DEPT_NAME")).show()
-----------------------------------------------
|"ID" |"EMP_NAME" |"DEPT_ID" |"DEPT_NAME" |
-----------------------------------------------
|1 |TONY |101 |HR |
|2 |STEVE |101 |HR |
|3 |BRUCE |102 |SALES |
|4 |WANDA |102 |SALES |
|5 |VICTOR |103 |IT |
-----------------------------------------------
3.4. Rename Ambiguous Columns of Join operation Output using lsuffix and rsuffix
The randomly named overlapping columns can be renamed using lsuffix and rsuffix parameters
in DataFrame.join method.
lsuffix – Suffix to add to the overlapping columns of the left DataFrame.
rsuffix – Suffix to add to the overlapping columns of the right DataFrame.
#// Renaming the ambiguous columns using lsuffix and rsuffix
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP",
rsuffix="_DEPT").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
--------------------------------------------------------------------
It is recommended to use lsuffix and rsuffix parameters within DataFrame.join method when there
are overlapping columns between the DataFrames.
3.5. Join DataFrames based on Multiple Conditions in Snowpark.
DataFrames can be joined based on multiple conditions separated by the “&” symbol as shown
below.
DataFrame.join(right DataFrame, (<join_condition_1>) & (< join_condition_2>))
When the names of the columns are the same between both the DataFrames, they can be joined by
passing a list of column names as shown below.
DataFrame.join(right DataFrame, ["col_1", "col_2", ..]
The following is an example of joining EMPLOYEES and DEPARTMENTS based on multiple
conditions.
#// Joining two DataFrames based on Multiple conditions
df_emp.join(df_dept, (df_emp.DEPT_ID == df_dept.DEPT_ID) & (df_emp.ID <
df_dept.DEPT_ID), \
lsuffix="_EMP", rsuffix="_DEPT").show()
3.6. Join Types in Snowpark
By default, the DataFrame.join method applies an inner join to join the data between the two
DataFrames. The other supported join types can be specified to join the data between two DataFrames
as shown below.
#// Left Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT",
join_type="left").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
|6 |HANK |105 |NULL |NULL |
--------------------------------------------------------------------
Instead of the “join_type” parameter, we can also use the “how” parameter to specify the join
condition.
#// Right Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT",
how="right").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
|NULL |NULL |NULL |104 |FINANCE |
--------------------------------------------------------------------
GROUP BY in Snowflake Snowpark
February 2, 2024
Spread the love
Contents hide
1. Introduction
2. GROUP BY clause in Snowpark
3. Demonstration of GROUP BY clause in Snowpark
3.1. Find the Number of Employees in each Department
3.2. Find the MAX and MIN Salary of employees in each Department
4. HAVING Clause in Snowflake Snowpark
4.1. Find the Departments with more than two employees
1. Introduction
The GROUP BY clause in SQL is utilized in conjunction with the SELECT statement to aggregate
data from multiple records and organize the results based on one or more columns. The GROUP BY
clause returns a single row for each group.
In this article, let us explore how to implement the GROUP BY clause on rows of a DataFrame in
Snowflake Snowpark.
2. GROUP BY clause in Snowpark
The DataFrame.group_by method in Snowpark is similar to the GROUP BY clause in Snowflake
that helps in grouping of rows based on specified columns.
Syntax:
One or multiple columns can be passed as inputs to the group_by method as shown below.
DataFrame.group_by("col_1", "col_2",…)
A List of column names can be passed as inputs to the group_by method as shown below.
DataFrame.group_by(["col_1", "col_2",…])
Return Value:
The DataFrame.group_by method returns a RelationalGroupedDataFrame as an output.
A RelationalGroupedDataFrame is a representation of an underlying DataFrame where rows are
organized into groups based on common values. Aggregations can then be defined on top of this
grouped data.
>>> df = df_employee.group_by("DEPT_ID")
>>> type(df)
<class 'snowflake.snowpark.relational_grouped_dataframe.RelationalGroupedDataFrame'>
Unlike a regular DataFrame, only a limited set of methods are supported on a
RelationalGroupedDataFrame. To know the full list of methods supported on a
RelationalGroupedDataFrame, refer to the Snowflake Documentation.
3. Demonstration of GROUP BY clause in Snowpark
Consider the EMPLOYEE data below for the demonstration of the implementation of the GROUP
BY in Snowpark.
#// create dataframe with employee data
employee_data = [
[1,'TONY',24000,101],
[2,'STEVE',17000,101],
[3,'BRUCE',9000,101],
[4,'WANDA',20000,102],
[5,'VICTOR',12000,102],
[6,'STEPHEN',10000,103],
[7,'HANK',15000,103],
[8,'THOR',21000,103]
]
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |24000 |101 |
|2 |STEVE |17000 |101 |
|3 |BRUCE |9000 |101 |
|4 |WANDA |20000 |102 |
|5 |VICTOR |12000 |102 |
|6 |STEPHEN |10000 |103 |
|7 |HANK |15000 |103 |
|8 |THOR |21000 |103 |
------------------------------------------------
3.1. Find the Number of Employees in each Department
The following is the SQL Query that calculates the number of employees in each department.
SELECT DEPT_ID, COUNT(EMP_ID)
FROM EMPLOYEES
GROUP BY DEPT_ID;
The same can be achieved in Snowpark using the DataFrame.group_by method as shown below
>>> from snowflake.snowpark.functions import count
>>> df_employee.group_by("DEPT_ID").agg(count("EMP_ID")).show()
-------------------------------
|"DEPT_ID" |"COUNT(EMP_ID)" |
-------------------------------
|101 |3 |
|102 |2 |
|103 |3 |
-------------------------------
3.2. Find the MAX and MIN Salary of employees in each Department
The following is the SQL Query that calculates the MAX and MIN salary of employees in each
department.
SELECT DEPT_ID,
MAX(SALARY) MAX_SALARY, MIN(SALARY) MIN_SALARY
FROM EMPLOYEES
GROUP BY DEPT_ID
;
The same can be achieved in Snowpark using the DataFrame.group_by method as shown below.
>>> from snowflake.snowpark.functions import max, min
>>> df_employee.group_by("DEPT_ID").agg(max("SALARY"), min("SALARY")).show()
---------------------------------------------
|"DEPT_ID" |"MAX(SALARY)" |"MIN(SALARY)" |
---------------------------------------------
|101 |24000 |9000 |
|102 |20000 |12000 |
|103 |21000 |10000 |
---------------------------------------------
Note that we are employing Aggregate Functions using the DataFrame.agg method in conjunction
with the DataFrame.group_by method to achieve the solution.
Adding aliases to the aggregate fields using Column.alias method for returning a renamed column
name.
>>> df_employee.group_by("DEPT_ID").agg(max("SALARY").alias("MAX_SALARY"),
min("SALARY").alias("MIN_SALARY")).show()
-------------------------------------------
|"DEPT_ID" |"MAX_SALARY" |"MIN_SALARY" |
-------------------------------------------
|101 |24000 |9000 |
|102 |20000 |12000 |
|103 |21000 |10000 |
-------------------------------------------
4. HAVING Clause in Snowflake Snowpark
The HAVING clause in SQL is used in conjunction with the GROUP BY clause to filter the results
of a query based on aggregated values. Unlike the WHERE clause, which filters individual rows
before they are grouped, the HAVING clause filters the result set after the grouping and aggregation
process.
In Snowpark, there is no equivalent method that provides the functionality of the HAVING clause in
SQL. Instead, we can use DataFrame.filter method which filters rows of a DataFrame based on the
specified conditional expression (similar to WHERE in SQL).
Let us understand with an example.
4.1. Find the Departments with more than two employees
The following is the SQL Query that returns the departments with more than two employees.
SELECT DEPT_ID
FROM EMPLOYEES
GROUP BY DEPT_ID
HAVING COUNT(EMP_ID)>2;
Follow the below steps to return the departments with more than two employees in Snowpark.
STEP-1: Find the Number of Employees in each Department
using DataFrame.group_by and DataFrame.agg methods as shown below.
>>> df1 = df_employee.group_by("DEPT_ID").agg(count("EMP_ID").alias("EMP_COUNT"))
>>> df1.show()
---------------------------
|"DEPT_ID" |"EMP_COUNT" |
---------------------------
|101 |3 |
|102 |2 |
|103 |3 |
---------------------------
STEP-2: Filter the records with employee count >2 using the DataFrame.filter method as shown
below.
>>> df2 = df1.filter(col("EMP_COUNT") > 2)
>>> df2.show()
---------------------------
|"DEPT_ID" |"EMP_COUNT" |
---------------------------
|101 |3 |
|103 |3 |
---------------------------
STEP-3: Select only the Department ID field using the DataFrame.select method as shown below.
>>> df3 = df2.select("DEPT_ID")
>>> df3.show()
-------------
|"DEPT_ID" |
-------------
|101 |
|103 |
-------------
All these steps can be combined into a single command, as shown below.
>>> df_employee.group_by("DEPT_ID").agg(count("EMP_ID").alias("EMP_COUNT")).\
filter(col("EMP_COUNT")>2).select("DEPT_ID").show()
-------------
|"DEPT_ID" |
-------------
|101 |
|103 |
-------------
This is equivalent to the SQL query below, where an outer query is employed to filter the departments
using the WHERE clause as shown below.
SELECT DEPT_ID FROM(
SELECT DEPT_ID, COUNT(EMP_ID) EMP_COUNT
FROM EMPLOYEES
GROUP BY DEPT_ID)
WHERE EMP_COUNT>2;
GROUP BY in Snowflake Snowpark
February 2, 2024
Spread the love
Contents hide
1. Introduction
2. GROUP BY clause in Snowpark
3. Demonstration of GROUP BY clause in Snowpark
3.1. Find the Number of Employees in each Department
3.2. Find the MAX and MIN Salary of employees in each Department
4. HAVING Clause in Snowflake Snowpark
4.1. Find the Departments with more than two employees
1. Introduction
The GROUP BY clause in SQL is utilized in conjunction with the SELECT statement to aggregate
data from multiple records and organize the results based on one or more columns. The GROUP BY
clause returns a single row for each group.
In this article, let us explore how to implement the GROUP BY clause on rows of a DataFrame in
Snowflake Snowpark.
2. GROUP BY clause in Snowpark
The DataFrame.group_by method in Snowpark is similar to the GROUP BY clause in Snowflake
that helps in grouping of rows based on specified columns.
Syntax:
One or multiple columns can be passed as inputs to the group_by method as shown below.
DataFrame.group_by("col_1", "col_2",…)
A List of column names can be passed as inputs to the group_by method as shown below.
DataFrame.group_by(["col_1", "col_2",…])
Return Value:
The DataFrame.group_by method returns a RelationalGroupedDataFrame as an output.
A RelationalGroupedDataFrame is a representation of an underlying DataFrame where rows are
organized into groups based on common values. Aggregations can then be defined on top of this
grouped data.
>>> df = df_employee.group_by("DEPT_ID")
>>> type(df)
<class 'snowflake.snowpark.relational_grouped_dataframe.RelationalGroupedDataFrame'>
Unlike a regular DataFrame, only a limited set of methods are supported on a
RelationalGroupedDataFrame. To know the full list of methods supported on a
RelationalGroupedDataFrame, refer to the Snowflake Documentation.
3. Demonstration of GROUP BY clause in Snowpark
Consider the EMPLOYEE data below for the demonstration of the implementation of the GROUP
BY in Snowpark.
#// create dataframe with employee data
employee_data = [
[1,'TONY',24000,101],
[2,'STEVE',17000,101],
[3,'BRUCE',9000,101],
[4,'WANDA',20000,102],
[5,'VICTOR',12000,102],
[6,'STEPHEN',10000,103],
[7,'HANK',15000,103],
[8,'THOR',21000,103]
]
------------------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |"DEPT_ID" |
------------------------------------------------
|1 |TONY |24000 |101 |
|2 |STEVE |17000 |101 |
|3 |BRUCE |9000 |101 |
|4 |WANDA |20000 |102 |
|5 |VICTOR |12000 |102 |
|6 |STEPHEN |10000 |103 |
|7 |HANK |15000 |103 |
|8 |THOR |21000 |103 |
------------------------------------------------
3.1. Find the Number of Employees in each Department
The following is the SQL Query that calculates the number of employees in each department.
SELECT DEPT_ID, COUNT(EMP_ID)
FROM EMPLOYEES
GROUP BY DEPT_ID;
The same can be achieved in Snowpark using the DataFrame.group_by method as shown below
>>> from snowflake.snowpark.functions import count
>>> df_employee.group_by("DEPT_ID").agg(count("EMP_ID")).show()
-------------------------------
|"DEPT_ID" |"COUNT(EMP_ID)" |
-------------------------------
|101 |3 |
|102 |2 |
|103 |3 |
-------------------------------
3.2. Find the MAX and MIN Salary of employees in each Department
The following is the SQL Query that calculates the MAX and MIN salary of employees in each
department.
SELECT DEPT_ID,
MAX(SALARY) MAX_SALARY, MIN(SALARY) MIN_SALARY
FROM EMPLOYEES
GROUP BY DEPT_ID
;
The same can be achieved in Snowpark using the DataFrame.group_by method as shown below.
>>> from snowflake.snowpark.functions import max, min
>>> df_employee.group_by("DEPT_ID").agg(max("SALARY"), min("SALARY")).show()
---------------------------------------------
|"DEPT_ID" |"MAX(SALARY)" |"MIN(SALARY)" |
---------------------------------------------
|101 |24000 |9000 |
|102 |20000 |12000 |
|103 |21000 |10000 |
---------------------------------------------
Note that we are employing Aggregate Functions using the DataFrame.agg method in conjunction
with the DataFrame.group_by method to achieve the solution.
Adding aliases to the aggregate fields using Column.alias method for returning a renamed column
name.
>>> df_employee.group_by("DEPT_ID").agg(max("SALARY").alias("MAX_SALARY"),
min("SALARY").alias("MIN_SALARY")).show()
-------------------------------------------
|"DEPT_ID" |"MAX_SALARY" |"MIN_SALARY" |
-------------------------------------------
|101 |24000 |9000 |
|102 |20000 |12000 |
|103 |21000 |10000 |
-------------------------------------------
4. HAVING Clause in Snowflake Snowpark
The HAVING clause in SQL is used in conjunction with the GROUP BY clause to filter the results
of a query based on aggregated values. Unlike the WHERE clause, which filters individual rows
before they are grouped, the HAVING clause filters the result set after the grouping and aggregation
process.
In Snowpark, there is no equivalent method that provides the functionality of the HAVING clause in
SQL. Instead, we can use DataFrame.filter method which filters rows of a DataFrame based on the
specified conditional expression (similar to WHERE in SQL).
Let us understand with an example.
4.1. Find the Departments with more than two employees
The following is the SQL Query that returns the departments with more than two employees.
SELECT DEPT_ID
FROM EMPLOYEES
GROUP BY DEPT_ID
HAVING COUNT(EMP_ID)>2;
Follow the below steps to return the departments with more than two employees in Snowpark.
STEP-1: Find the Number of Employees in each Department
using DataFrame.group_by and DataFrame.agg methods as shown below.
>>> df1 = df_employee.group_by("DEPT_ID").agg(count("EMP_ID").alias("EMP_COUNT"))
>>> df1.show()
---------------------------
|"DEPT_ID" |"EMP_COUNT" |
---------------------------
|101 |3 |
|102 |2 |
|103 |3 |
---------------------------
STEP-2: Filter the records with employee count >2 using the DataFrame.filter method as shown
below.
>>> df2 = df1.filter(col("EMP_COUNT") > 2)
>>> df2.show()
---------------------------
|"DEPT_ID" |"EMP_COUNT" |
---------------------------
|101 |3 |
|103 |3 |
---------------------------
STEP-3: Select only the Department ID field using the DataFrame.select method as shown below.
>>> df3 = df2.select("DEPT_ID")
>>> df3.show()
-------------
|"DEPT_ID" |
-------------
|101 |
|103 |
-------------
All these steps can be combined into a single command, as shown below.
>>> df_employee.group_by("DEPT_ID").agg(count("EMP_ID").alias("EMP_COUNT")).\
filter(col("EMP_COUNT")>2).select("DEPT_ID").show()
-------------
|"DEPT_ID" |
-------------
|101 |
|103 |
-------------
This is equivalent to the SQL query below, where an outer query is employed to filter the departments
using the WHERE clause as shown below.
SELECT DEPT_ID FROM(
SELECT DEPT_ID, COUNT(EMP_ID) EMP_COUNT
FROM EMPLOYEES
GROUP BY DEPT_ID)
WHERE EMP_COUNT>2;
Subscribe to our Newsletter !!
Aggregate Functions in Snowflake Snowpark
January 27, 2024
Spread the love
Contents hide
1. Introduction
2. Aggregate Functions in Snowpark
3. Demonstration of Aggregate Functions using DataFrame.agg Method in Snowpark
3.1. Passing a DataFrame Column Object
3.2. Passing a Tuple with Column Name and Aggregate Function
3.3. Passing a List of Column Objects and Tuple
3.4. Passing a dictionary Mapping Column Name to Aggregate Function
4. Aggregate Functions using DataFrame.select method in Snowpark
5. Renaming the Return Aggregate Fields
6. Passing Return Value of an Aggregate Function as an Input
1. Introduction
Aggregate functions perform a calculation on a set of values and return a single value. These
functions are often used in conjunction with the GROUP BY clause to perform calculations on
groups of rows.
To know the list of all the supported aggregate functions in Snowflake, refer to Snowflake
Documentation.
In this article, we will explore how to use aggregate functions in Snowflake Snowpark Python.
2. Aggregate Functions in Snowpark
The DataFrame.agg method in Snowpark is used to aggregate the data in a DataFrame. This method
accepts any valid Snowflake aggregate function names as input to perform calculations on multiple
rows and produce a single output.
There are several ways the DataFrame columns can be passed to DataFrame.agg method to perform
aggregate calculations.
1. A Column object
2. A tuple where the first element is a column object or a column name and the second
element is the name of the aggregate function
3. A list of the above
4. A dictionary that maps column name to an aggregate function name.
3. Demonstration of Aggregate Functions using DataFrame.agg Method in Snowpark
Follow the below steps to perform Aggregate Calculations using DataFrame.agg Method.
STEP-1: Establish a connection with Snowflake from Snowpark using
the Session class.
STEP-2: Import all the required aggregate functions (min, max, sum, etc.,) from
the snowflake.snowpark.functions package.
STEP-3: Create a DataFrame that holds the data on which aggregate functions are to
be applied.
STEP-4: Implement aggregate calculations on the DataFrame using the
DataFrame.agg method.
Demonstration
Consider the EMPLOYEE data below for the demonstration of the implementation of the Aggregate
functions in Snowpark.
#// Creating a DataFrame with EMPLOYEE data
employee_data = [
[1,'TONY',24000],
[2,'STEVE',17000],
[3,'BRUCE',9000],
[4,'WANDA',20000],
[5,'VICTOR',12000],
[6,'STEPHEN',10000]
]
------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |
------------------------------------
|1 |TONY |24000 |
|2 |STEVE |17000 |
|3 |BRUCE |9000 |
|4 |WANDA |20000 |
|5 |VICTOR |12000 |
|6 |STEPHEN |10000 |
------------------------------------
3.1. Passing a DataFrame Column Object
Import all the necessary aggregate function methods from
the snowflake.snowpark.functions package before performing aggregate calculations as shown
below.
#// Importing the Aggregate Function methods
from snowflake.snowpark.functions import col, min, max, avg
df_employee.agg(max(col("SALARY")), min(col("SALARY"))).show()
---------------------------------
|"MAX(SALARY)" |"MIN(SALARY)" |
---------------------------------
|24000 |9000 |
---------------------------------
3.2. Passing a Tuple with Column Name and Aggregate Function
#// Passing a tuple with column name and aggregate function to DataFrame.agg method
df_employee.agg(("SALARY", "max"), ("SALARY", "min")).show()
---------------------------------
|"MAX(SALARY)" |"MIN(SALARY)" |
---------------------------------
|24000 |9000 |
---------------------------------
3.3. Passing a List of Column Objects and Tuple
#// Passing a list of the values
df_employee.agg([("SALARY", "min"), ("SALARY", "max"), avg(col("SALARY"))]).show()
-------------------------------------------------
|"MIN(SALARY)" |"MAX(SALARY)" |"AVG(SALARY)" |
-------------------------------------------------
|9000 |24000 |15333.333333 |
-------------------------------------------------
3.4. Passing a dictionary Mapping Column Name to Aggregate Function
#// Passing a dictionary mapping column name to aggregate function
df_employee.agg({"SALARY": "min"}).show()
-----------------
|"MIN(SALARY)" |
-----------------
|9000 |
-----------------
4. Aggregate Functions using DataFrame.select method in Snowpark
The DataFrame.select method can be used to return a new DataFrame with the specified Column
expressions as output. Aggregate functions can be utilized as column expressions to select and process
data from a DataFrame.
#// Aggregate functions using select method
df_employee.select(min("SALARY"), max("SALARY")).show()
-----------------------------------------
|"MIN(""SALARY"")" |"MAX(""SALARY"")" |
-----------------------------------------
|9000 |24000 |
-----------------------------------------
5. Renaming the Return Aggregate Fields
The output fields from the Aggregate Functions can be renamed to new column names
using Column._as or Column.alias methods as shown below.
#// Renaming column names
df_employee.agg(min("SALARY").as_("min_sal"), max("SALARY").alias("max_sal")).show()
-------------------------
|"MIN_SAL" |"MAX_SAL" |
-------------------------
|9000 |24000 |
-------------------------
df_employee.select(min("SALARY").as_("MIN_SAL"),
max("SALARY").alias("MAX_SAL")).show()
-------------------------
|"MIN_SAL" |"MAX_SAL" |
-------------------------
|9000 |24000 |
-------------------------
6. Passing Return Value of an Aggregate Function as an Input
Let us understand this with a simple example. Consider the requirement is to get the employee details
with max salary. This can be accomplished using the below SQL query.
-- Get employee details with MAX Salary
SELECT * FROM EMPLOYEES WHERE SALARY IN(
SELECT MAX(SALARY) FROM EMPLOYEES) ;
print(max_sal)
----------------------------
|[Row(MAX_SALARY=24000)] |
----------------------------
Extract the max salary amount from the list object as shown below.
max_sal = df_max_sal[0]['MAX_SALARY']
type(max_sal)
-----------------
|<class 'int'> |
-----------------
print(max_sal)
----------
|24000 |
----------
The DataFrame.filter method filters rows from a DataFrame based on the specified conditional
expression (similar to WHERE in SQL).
The following code extracts the employee details with max salary.
#// Get employee details with max salary
df_employee.filter(col("SALARY") == max_sal).show()
------------------------------------
|"EMP_ID" |"EMP_NAME" |"SALARY" |
------------------------------------
|1 |TONY |24000 |
------------------------------------
HOW TO: Create and Read Data from Snowflake Snowpark DataFrames?
January 10, 2024
Spread the love
Contents hide
1. Introduction
2. What is a DataFrame?
3. Pre-requisites to create a DataFrame in Snowpark
4. How to create a DataFrame in Snowpark?
5. How to Read data from a Snowpark DataFrame?
6. How to create a DataFrame in Snowpark with a List of Specified Values?
7. How to create a DataFrame in Snowpark with a List of Specified Values and Schema?
8. How to create a DataFrame in Snowpark using Pandas?
9. How to create a DataFrame in Snowpark from a range of numbers?
10. How to create a DataFrame in Snowpark from a Database Table?
11. How to create a DataFrame in Snowpark by reading files from a stage?
1. Introduction
Snowpark is a developer framework from Snowflake that allows developers to interact with
Snowflake directly and build complex data pipelines. In our previous article, we discussed what
Snowflake Snowpark is and how to set up a Python development environment for Snowpark.
In Snowpark, the primary method for querying and processing data is through a DataFrame. In this
article, we will explore what DataFrames are and guide you through the process of creating them in
Snowpark.
2. What is a DataFrame?
A DataFrame in Snowpark acts like a virtual table that organizes data in a structured manner.
Think of it as a way to express a SQL query, but in a different language. It operates lazily,
meaning it doesn’t process the data until you instruct it to perform a specific task, such as
retrieving or analyzing information.
The Snowpark API finally converts the DataFrames into SQL to execute your code in Snowflake.
3. Pre-requisites to create a DataFrame in Snowpark
To construct a DataFrame, you have to make use of Session class in Snowpark which establishes a
connection with a Snowflake database and provides methods for creating DataFrames and accessing
objects.
When you create a Session object, you provide connection parameters to establish a connection with a
Snowflake database as shown below
import snowflake.snowpark as snowpark
from snowflake.snowpark import Session
connection_parameters = {
"account": "snowflake account",
"user": "snowflake username",
"password": "snowflake password",
"role": "snowflake role", # optional
"warehouse": "snowflake warehouse", # optional
"database": "snowflake database", # optional
"schema": "snowflake schema" # optional
}
session = Session.builder.configs(connection_parameters).create()
To create DataFrames in a Snowsight Python worksheet, construct them within the handler function
(main) and utilize the Session object (session) passed into the function.
def main(session: snowpark.Session):
# your code goes here
4. How to create a DataFrame in Snowpark?
The createDataFrame method of Session class in Snowpark creates a new DataFrame containing the
specified values from the local data.
Syntax:
The following is the syntax to create a DataFrame using createDataFrame method.
session.createDataFrame(data[, schema])
The accepted values for data in the createDataFrame method are List, Tuple or a Pandas
DataFrame.
Lists are used to store multiple items in a single variable and are created using square
brackets.
ex: myList = [“one”, “two”, “three”]
Tuples are used to store multiple items in a single variable and are created using
round brackets. The contents of a tuple cannot change once they have been created in
Python.
ex: myTuple = (“one”, “two”, “three”)
Pandas is a Python library used for working with data sets. Pandas allows the
creation of DataFrames natively in Python.
The schema in the createDataFrame method can be a StructType containing names and data types of
columns, or just a list of column names, or None.
5. How to Read data from a Snowpark DataFrame?
Data from a Snowpark DataFrame can be retrieved by utilizing the show method.
Syntax:
The following is the syntax to read data from a Snowpark DataFrame using show method.
DataFrame.show([n, max_width])
Parameters:
n – The value represents the number of rows to print out. This default value is 10.
max_width – The maximum number of characters to print out for each column.
6. How to create a DataFrame in Snowpark with a List of Specified Values?
Example-1:
The following is an example of creating a DataFrame with a list of values and assigning the column
name as “a”.
df1 = session.createDataFrame([1,2,3,4], schema=["a"])
df1.show()
------
|"A" |
------
|1 |
|2 |
|3 |
|4 |
------
The DataFrame df1 when executed is translated and executed as SQL in Snowflake by Snowpark API
as shown below.
SELECT "A" FROM (
SELECT $1 AS "A"
FROM VALUES (1::INT), (2::INT), (3::INT), (4::INT)
) LIMIT 10
Example-2:
The following is an example of creating a DataFrame with multiple lists of values and assigning the
column names as “a”,”b”, “c” and “d”.
df2 = session.createDataFrame([[1,2,3,4],[5,6,7,8]], schema=["a","b","c","d"])
df2.show()
--------------------------
|"A" |"B" |"C" |"D" |
--------------------------
|1 |2 |3 |4 |
|5 |6 |7 |8 |
--------------------------
The DataFrame df2 when executed is translated and executed as SQL in Snowflake by Snowpark API
as shown below.
SELECT "A", "B", "C", "D" FROM (
SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D"
FROM VALUES
(1::INT, 2::INT, 3::INT, 4::INT),
(5::INT, 6::INT, 7::INT, 8::INT)
) LIMIT 10
7. How to create a DataFrame in Snowpark with a List of Specified Values and Schema?
When schema parameter in the createDataFrame method is passed as a list of column names or
None, the schema of the DataFrame will be inferred from the data across all rows.
Example-3:
The following is an example of creating a DataFrame with multiple lists of values with different data
types and assigning the column names as “a”,”b”, “c” and “d”.
df3 = session.createDataFrame([[1, 2, 'Snow', '2024-01-01'],[3, 4, 'Park', '2024-01-02']],
schema=["a","b","c","d"])
df3.show()
----------------------------------
|"A" |"B" |"C" |"D" |
----------------------------------
|1 |2 |Snow |2024-01-01 |
|3 |4 |Park |2024-01-02 |
----------------------------------
The DataFrame df3 when executed is translated and executed as SQL in Snowflake by Snowpark API
as shown below.
SELECT "A", "B", "C", "D" FROM (
SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D"
FROM VALUES
(1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING),
(3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10
Note that in the above query, since we did not explicitly specify the data types of the columns
during definition, the values ‘2024-01-01’ and ‘2024-01-02’, despite being of “Date” data type,
are identified as “String” data type.
Example-4:
Create a custom schema parameter of StructType containing names and data types of columns and
pass it to the createDataFrame method as shown below.
#//create dataframe with schema
from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType, DateType
my_schema = StructType(
[StructField("a", IntegerType()),
StructField("b", IntegerType()),
StructField("c", StringType()),
StructField("d", DateType())]
)
------------------------------------------
|"A" |"B" |"C" |"D" |
------------------------------------------
|1 |2 |Snow |2024-01-01 00:00:00 |
|3 |4 |Park |2024-01-02 00:00:00 |
------------------------------------------
The DataFrame df4 when executed is translated and executed as SQL in Snowflake by Snowpark API
referencing to the columns with the defined data types as shown below.
SELECT
"A", "B", "C",
to_date("D") AS "D"
FROM (
SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D"
FROM VALUES
(1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING),
(3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10
Note that in the above query, the column “D” is read as Date data type in Snowflake.
8. How to create a DataFrame in Snowpark using Pandas?
A Pandas DataFrame can be passed as “data” to create a DataFrame in Snowpark.
Example-5:
The following is an example of creating a Snowpark DataFrame using pandas DataFrame.
import pandas as pd
df_pandas = session.createDataFrame(pd.DataFrame([1,2,3],columns=["a"]))
df_pandas.show()
------
|"a" |
------
|1 |
|2 |
|3 |
------
Unlike DataFrames created with Lists or Tuples using the ‘createDataFrame‘ method, when a
DataFrame is created using a pandas DataFrame, the Snowpark API creates a temporary table
and imports the data from the pandas DataFrame into it. When extracting data from the
Snowpark DataFrame created using the pandas DataFrame, the data is retrieved by querying
the temporary table.
The DataFrame df_pandas when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT * FROM
"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA"."SNOWPARK_TEMP_TABLE_9
RSV8KITUO" LIMIT 10
9. How to create a DataFrame in Snowpark from a range of numbers?
A DataFrame from a range of numbers can be created using range method of Session class in
Snowpark. The resulting DataFrame has single column named “ID” containing elements in a range
from start to end.
Syntax:
The following is the syntax to create a DataFrame using range method.
session.range(start[, end, step])
Parameters:
start : The start value of the range. If end is not specified, start will be used as the
value of end.
end : The end value of the range.
step : The step or interval between numbers.
Example-6:
The following is an example of creating a DataFrame with a range of numbers from 1 to 9.
df_range = session.range(1,10).to_df("a")
df_range.show()
-------
|"A" |
-------
|1 |
|2 |
|3 |
|4 |
|5 |
|6 |
|7 |
|8 |
|9 |
-------
The DataFrame df_range when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT * FROM (
SELECT ( ROW_NUMBER() OVER ( ORDER BY SEQ8() ) - 1 ) * (1) + (1) AS id
FROM ( TABLE (GENERATOR(ROWCOUNT => 9)))
) LIMIT 10
Example-7:
The following is an example of creating a DataFrame with a range of numbers from 1 to 9 with a step
value of 2 and returning the output column renamed as “A”.
df_range2 = session.range(1,10,2).to_df("a")
df_range2.show()
-------
|"A" |
-------
|1 |
|3 |
|5 |
|7 |
|9 |
-------
The DataFrame df_range2 when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT "ID" AS "A" FROM (
SELECT ( ROW_NUMBER() OVER ( ORDER BY SEQ8() ) - 1 ) * (2) + (1) AS id
FROM ( TABLE (GENERATOR(ROWCOUNT => 5)))
) LIMIT 10
10. How to create a DataFrame in Snowpark from a Database Table?
The sql and table methods of Session class in Snowpark can be used to create a DataFrame from a
Database Table.
Example-8:
The following is an example of creating a DataFrame from a database table by executing a SQL query
using sql method of Session class in Snowpark.
df_sql = session.sql("SELECT * FROM
SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE")
df_sql.show(5)
----------------------------------
|"YEAR" |"MONTH" |"REVENUE" |
----------------------------------
|2012 |5 |3264300.11 |
|2012 |6 |3208482.33 |
|2012 |7 |3311966.98 |
|2012 |8 |3311752.81 |
|2012 |9 |3208563.06 |
----------------------------------
The DataFrame df_sql when executed is translated and executed as SQL in Snowflake by Snowpark
API as shown below.
SELECT * FROM (SELECT * FROM
SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE) LIMIT 5
Example-9:
The following is an example of creating a DataFrame from a database table by executing a SQL query
using table method of Session class in Snowpark.
df_sql = session.table("MONTHLY_REVENUE")
df_sql.show(5)
----------------------------------
|"YEAR" |"MONTH" |"REVENUE" |
----------------------------------
|2012 |5 |3264300.11 |
|2012 |6 |3208482.33 |
|2012 |7 |3311966.98 |
|2012 |8 |3311752.81 |
|2012 |9 |3208563.06 |
----------------------------------
The DataFrame df_table when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT * FROM MONTHLY_REVENUE LIMIT 5
11. How to create a DataFrame in Snowpark by reading files from a stage?
DataFrameReader class in Snowpark provides methods for loading data from a Snowflake stage to a
DataFrame with format-specific options. To use it:
1. Create a DataFrameReader object through Session.read method.
2. For CSV file format, create a custom schema parameter of StructType containing
names and data types of columns.
3. Set the file format specific properties such as delimiter using options() method.
4. Specify the file path and stage details by calling the method corresponding to the
CSV format, csv().
Example-10:
The following is an example of creating a DataFrame in Snowpark by reading CSV files from S3
stage.
from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType
schema = StructType(
[StructField("EMPLOYEE_ID", IntegerType()),
StructField("FIRST_NAME", StringType()),
StructField("LAST_NAME", StringType()),
StructField("EMAIL", StringType())
])
--------------------------------------------------------------
| EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL |
--------------------------------------------------------------
| 204384 | Steven | King | SKING@test.com |
| 204388 | Neena | Kochhar | NKOCHHAR@test.com |
| 204392 | Lex | De Haan | LDEHAAN@test.com |
| 204393 | Alexander | Hunold | AHUNOLD@test.com |
| 204394 | Bruce | Ernst | BERNST@test.com |
--------------------------------------------------------------
The DataFrame df_s3_employee when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
1. A temporary file format is created using the properties specified in
the options() method.
2. The stage files are queried using the file format created in the first step and the
columns are cast into the data types specified in the schema defined
3. The file format created in the first step is dropped.
--create a temporary file format
CREATE SCOPED TEMPORARY FILE FORMAT If NOT EXISTS
"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FOR
MAT_Y00K7HK598
TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1
--select data from stage files using the temporary file format
SELECT * FROM (
SELECT
$1::INT AS "EMPLOYEE_ID",
$2::STRING AS "FIRST_NAME",
$3::STRING AS "LAST_NAME",
$4::STRING AS "EMAIL"
FROM @my_s3_stage/Inbox/( FILE_FORMAT =>
'"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FOR
MAT_Y00K7HK598')
) LIMIT 5
--drop the temporary file format
DROP FILE FORMAT If EXISTS
"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FOR
MAT_Y00K7HK598
HOW TO: Create and Read Data from Snowflake Snowpark DataFrames?
January 10, 2024
Spread the love
Contents hide
1. Introduction
2. What is a DataFrame?
3. Pre-requisites to create a DataFrame in Snowpark
4. How to create a DataFrame in Snowpark?
5. How to Read data from a Snowpark DataFrame?
6. How to create a DataFrame in Snowpark with a List of Specified Values?
7. How to create a DataFrame in Snowpark with a List of Specified Values and Schema?
8. How to create a DataFrame in Snowpark using Pandas?
9. How to create a DataFrame in Snowpark from a range of numbers?
10. How to create a DataFrame in Snowpark from a Database Table?
11. How to create a DataFrame in Snowpark by reading files from a stage?
1. Introduction
Snowpark is a developer framework from Snowflake that allows developers to interact with
Snowflake directly and build complex data pipelines. In our previous article, we discussed what
Snowflake Snowpark is and how to set up a Python development environment for Snowpark.
In Snowpark, the primary method for querying and processing data is through a DataFrame. In this
article, we will explore what DataFrames are and guide you through the process of creating them in
Snowpark.
2. What is a DataFrame?
A DataFrame in Snowpark acts like a virtual table that organizes data in a structured manner.
Think of it as a way to express a SQL query, but in a different language. It operates lazily,
meaning it doesn’t process the data until you instruct it to perform a specific task, such as
retrieving or analyzing information.
The Snowpark API finally converts the DataFrames into SQL to execute your code in Snowflake.
3. Pre-requisites to create a DataFrame in Snowpark
To construct a DataFrame, you have to make use of Session class in Snowpark which establishes a
connection with a Snowflake database and provides methods for creating DataFrames and accessing
objects.
When you create a Session object, you provide connection parameters to establish a connection with a
Snowflake database as shown below
import snowflake.snowpark as snowpark
from snowflake.snowpark import Session
connection_parameters = {
"account": "snowflake account",
"user": "snowflake username",
"password": "snowflake password",
"role": "snowflake role", # optional
"warehouse": "snowflake warehouse", # optional
"database": "snowflake database", # optional
"schema": "snowflake schema" # optional
}
session = Session.builder.configs(connection_parameters).create()
To create DataFrames in a Snowsight Python worksheet, construct them within the handler function
(main) and utilize the Session object (session) passed into the function.
def main(session: snowpark.Session):
# your code goes here
4. How to create a DataFrame in Snowpark?
The createDataFrame method of Session class in Snowpark creates a new DataFrame containing the
specified values from the local data.
Syntax:
The following is the syntax to create a DataFrame using createDataFrame method.
session.createDataFrame(data[, schema])
The accepted values for data in the createDataFrame method are List, Tuple or a Pandas
DataFrame.
Lists are used to store multiple items in a single variable and are created using square
brackets.
ex: myList = [“one”, “two”, “three”]
Tuples are used to store multiple items in a single variable and are created using
round brackets. The contents of a tuple cannot change once they have been created in
Python.
ex: myTuple = (“one”, “two”, “three”)
Pandas is a Python library used for working with data sets. Pandas allows the
creation of DataFrames natively in Python.
The schema in the createDataFrame method can be a StructType containing names and data types of
columns, or just a list of column names, or None.
5. How to Read data from a Snowpark DataFrame?
Data from a Snowpark DataFrame can be retrieved by utilizing the show method.
Syntax:
The following is the syntax to read data from a Snowpark DataFrame using show method.
DataFrame.show([n, max_width])
Parameters:
n – The value represents the number of rows to print out. This default value is 10.
max_width – The maximum number of characters to print out for each column.
6. How to create a DataFrame in Snowpark with a List of Specified Values?
Example-1:
The following is an example of creating a DataFrame with a list of values and assigning the column
name as “a”.
df1 = session.createDataFrame([1,2,3,4], schema=["a"])
df1.show()
------
|"A" |
------
|1 |
|2 |
|3 |
|4 |
------
The DataFrame df1 when executed is translated and executed as SQL in Snowflake by Snowpark API
as shown below.
SELECT "A" FROM (
SELECT $1 AS "A"
FROM VALUES (1::INT), (2::INT), (3::INT), (4::INT)
) LIMIT 10
Example-2:
The following is an example of creating a DataFrame with multiple lists of values and assigning the
column names as “a”,”b”, “c” and “d”.
df2 = session.createDataFrame([[1,2,3,4],[5,6,7,8]], schema=["a","b","c","d"])
df2.show()
--------------------------
|"A" |"B" |"C" |"D" |
--------------------------
|1 |2 |3 |4 |
|5 |6 |7 |8 |
--------------------------
The DataFrame df2 when executed is translated and executed as SQL in Snowflake by Snowpark API
as shown below.
SELECT "A", "B", "C", "D" FROM (
SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D"
FROM VALUES
(1::INT, 2::INT, 3::INT, 4::INT),
(5::INT, 6::INT, 7::INT, 8::INT)
) LIMIT 10
7. How to create a DataFrame in Snowpark with a List of Specified Values and Schema?
When schema parameter in the createDataFrame method is passed as a list of column names or
None, the schema of the DataFrame will be inferred from the data across all rows.
Example-3:
The following is an example of creating a DataFrame with multiple lists of values with different data
types and assigning the column names as “a”,”b”, “c” and “d”.
df3 = session.createDataFrame([[1, 2, 'Snow', '2024-01-01'],[3, 4, 'Park', '2024-01-02']],
schema=["a","b","c","d"])
df3.show()
----------------------------------
|"A" |"B" |"C" |"D" |
----------------------------------
|1 |2 |Snow |2024-01-01 |
|3 |4 |Park |2024-01-02 |
----------------------------------
The DataFrame df3 when executed is translated and executed as SQL in Snowflake by Snowpark API
as shown below.
SELECT "A", "B", "C", "D" FROM (
SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D"
FROM VALUES
(1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING),
(3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10
Note that in the above query, since we did not explicitly specify the data types of the columns
during definition, the values ‘2024-01-01’ and ‘2024-01-02’, despite being of “Date” data type,
are identified as “String” data type.
Example-4:
Create a custom schema parameter of StructType containing names and data types of columns and
pass it to the createDataFrame method as shown below.
#//create dataframe with schema
from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType, DateType
my_schema = StructType(
[StructField("a", IntegerType()),
StructField("b", IntegerType()),
StructField("c", StringType()),
StructField("d", DateType())]
)
df4 = session.createDataFrame([[1, 2, 'Snow', '2024-01-01'],[3, 4, 'Park', '2024-01-02']],
schema=my_schema)
df4.show()
------------------------------------------
|"A" |"B" |"C" |"D" |
------------------------------------------
|1 |2 |Snow |2024-01-01 00:00:00 |
|3 |4 |Park |2024-01-02 00:00:00 |
------------------------------------------
The DataFrame df4 when executed is translated and executed as SQL in Snowflake by Snowpark API
referencing to the columns with the defined data types as shown below.
SELECT
"A", "B", "C",
to_date("D") AS "D"
FROM (
SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D"
FROM VALUES
(1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING),
(3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10
Note that in the above query, the column “D” is read as Date data type in Snowflake.
8. How to create a DataFrame in Snowpark using Pandas?
A Pandas DataFrame can be passed as “data” to create a DataFrame in Snowpark.
Example-5:
The following is an example of creating a Snowpark DataFrame using pandas DataFrame.
import pandas as pd
df_pandas = session.createDataFrame(pd.DataFrame([1,2,3],columns=["a"]))
df_pandas.show()
------
|"a" |
------
|1 |
|2 |
|3 |
------
Unlike DataFrames created with Lists or Tuples using the ‘createDataFrame‘ method, when a
DataFrame is created using a pandas DataFrame, the Snowpark API creates a temporary table
and imports the data from the pandas DataFrame into it. When extracting data from the
Snowpark DataFrame created using the pandas DataFrame, the data is retrieved by querying
the temporary table.
The DataFrame df_pandas when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT * FROM
"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA"."SNOWPARK_TEMP_TABLE_9
RSV8KITUO" LIMIT 10
9. How to create a DataFrame in Snowpark from a range of numbers?
A DataFrame from a range of numbers can be created using range method of Session class in
Snowpark. The resulting DataFrame has single column named “ID” containing elements in a range
from start to end.
Syntax:
The following is the syntax to create a DataFrame using range method.
session.range(start[, end, step])
Parameters:
start : The start value of the range. If end is not specified, start will be used as the
value of end.
end : The end value of the range.
step : The step or interval between numbers.
Example-6:
The following is an example of creating a DataFrame with a range of numbers from 1 to 9.
df_range = session.range(1,10).to_df("a")
df_range.show()
-------
|"A" |
-------
|1 |
|2 |
|3 |
|4 |
|5 |
|6 |
|7 |
|8 |
|9 |
-------
The DataFrame df_range when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT * FROM (
SELECT ( ROW_NUMBER() OVER ( ORDER BY SEQ8() ) - 1 ) * (1) + (1) AS id
FROM ( TABLE (GENERATOR(ROWCOUNT => 9)))
) LIMIT 10
Example-7:
The following is an example of creating a DataFrame with a range of numbers from 1 to 9 with a step
value of 2 and returning the output column renamed as “A”.
df_range2 = session.range(1,10,2).to_df("a")
df_range2.show()
-------
|"A" |
-------
|1 |
|3 |
|5 |
|7 |
|9 |
-------
The DataFrame df_range2 when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT "ID" AS "A" FROM (
SELECT ( ROW_NUMBER() OVER ( ORDER BY SEQ8() ) - 1 ) * (2) + (1) AS id
FROM ( TABLE (GENERATOR(ROWCOUNT => 5)))
) LIMIT 10
10. How to create a DataFrame in Snowpark from a Database Table?
The sql and table methods of Session class in Snowpark can be used to create a DataFrame from a
Database Table.
Example-8:
The following is an example of creating a DataFrame from a database table by executing a SQL query
using sql method of Session class in Snowpark.
df_sql = session.sql("SELECT * FROM
SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE")
df_sql.show(5)
----------------------------------
|"YEAR" |"MONTH" |"REVENUE" |
----------------------------------
|2012 |5 |3264300.11 |
|2012 |6 |3208482.33 |
|2012 |7 |3311966.98 |
|2012 |8 |3311752.81 |
|2012 |9 |3208563.06 |
----------------------------------
The DataFrame df_sql when executed is translated and executed as SQL in Snowflake by Snowpark
API as shown below.
SELECT * FROM (SELECT * FROM
SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE) LIMIT 5
Example-9:
The following is an example of creating a DataFrame from a database table by executing a SQL query
using table method of Session class in Snowpark.
df_sql = session.table("MONTHLY_REVENUE")
df_sql.show(5)
----------------------------------
|"YEAR" |"MONTH" |"REVENUE" |
----------------------------------
|2012 |5 |3264300.11 |
|2012 |6 |3208482.33 |
|2012 |7 |3311966.98 |
|2012 |8 |3311752.81 |
|2012 |9 |3208563.06 |
----------------------------------
The DataFrame df_table when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
SELECT * FROM MONTHLY_REVENUE LIMIT 5
11. How to create a DataFrame in Snowpark by reading files from a stage?
DataFrameReader class in Snowpark provides methods for loading data from a Snowflake stage to a
DataFrame with format-specific options. To use it:
1. Create a DataFrameReader object through Session.read method.
2. For CSV file format, create a custom schema parameter of StructType containing
names and data types of columns.
3. Set the file format specific properties such as delimiter using options() method.
4. Specify the file path and stage details by calling the method corresponding to the
CSV format, csv().
Example-10:
The following is an example of creating a DataFrame in Snowpark by reading CSV files from S3
stage.
from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType
schema = StructType(
[StructField("EMPLOYEE_ID", IntegerType()),
StructField("FIRST_NAME", StringType()),
StructField("LAST_NAME", StringType()),
StructField("EMAIL", StringType())
])
--------------------------------------------------------------
| EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL |
--------------------------------------------------------------
| 204384 | Steven | King | SKING@test.com |
| 204388 | Neena | Kochhar | NKOCHHAR@test.com |
| 204392 | Lex | De Haan | LDEHAAN@test.com |
| 204393 | Alexander | Hunold | AHUNOLD@test.com |
| 204394 | Bruce | Ernst | BERNST@test.com |
--------------------------------------------------------------
The DataFrame df_s3_employee when executed is translated and executed as SQL in Snowflake by
Snowpark API as shown below.
1. A temporary file format is created using the properties specified in
the options() method.
2. The stage files are queried using the file format created in the first step and the
columns are cast into the data types specified in the schema defined
3. The file format created in the first step is dropped.
--create a temporary file format
CREATE SCOPED TEMPORARY FILE FORMAT If NOT EXISTS
"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FOR
MAT_Y00K7HK598
TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1
--select data from stage files using the temporary file format
SELECT * FROM (
SELECT
$1::INT AS "EMPLOYEE_ID",
$2::STRING AS "FIRST_NAME",
$3::STRING AS "LAST_NAME",
$4::STRING AS "EMAIL"
FROM @my_s3_stage/Inbox/( FILE_FORMAT =>
'"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FOR
MAT_Y00K7HK598')
) LIMIT 5
df_customer = session.table("SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER")
df_customer_filter = df_customer.filter(col("C_NATIONKEY")=='15')
df_customer_select = df_customer_filter.select(col("C_CUSTKEY"),col("C_NAME"))
OverWrite Data
The following code writes the contents of the df_customer_select DataFrame to the specified
Snowflake table, overwriting the table’s existing data if it already exists.
customer_wrt =
df_customer_select.write.mode("overwrite").save_as_table("SNOWPARK_DEMO_DB.SNOWPAR
K_DEMO_SCHEMA.CUSTOMER")
When executed, this code is translated and executed as SQL in Snowflake through the Snowpark API.
The resulting SQL statement is as follows:
--creates a new table overwriting the existing one if already exists
CREATE OR REPLACE TABLE
SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.CUSTOMER (
"C_CUSTKEY" BIGINT NOT NULL,
"C_NAME" STRING(25) NOT NULL
) AS SELECT * FROM (
SELECT "C_CUSTKEY", "C_NAME" FROM
SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER WHERE ("C_NATIONKEY" = '15')
);
The following code confirms that the table is created and displays the count of records loaded into the
CUSTOMER table.
session.table("SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.CUSTOMER").count()
+----+
|5921|
+----+
Append Data
The following code appends the contents of the df_customer_select DataFrame to the specified
Snowflake table, adding new records to the existing ones.
customer_wrt =
df_customer_select.write.mode("append").save_as_table("SNOWPARK_DEMO_DB.SNOWPARK_
DEMO_SCHEMA.CUSTOMER")
When executed, this code is translated and executed as SQL in Snowflake through the Snowpark API.
The resulting SQL statement is as follows:
--verifies if specified table already exists
show tables like 'CUSTOMER' in schema
SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA
--Inserts data as table is existing already; otherwise creates the table and inserts data.
INSERT INTO SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.CUSTOMER
SELECT "C_CUSTKEY", "C_NAME"
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
WHERE ("C_NATIONKEY" = '15')
The following code displays the count of records in the CUSTOMER table. Since we have already
created and inserted data in the preceding step, the record count indicates that data got appended.
session.table("SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.CUSTOMER").count()
+------+
|11842 |
+------+
Ignore Data
The following code ignores the write operation if the specified table already exists.
customer_wrt =
df_customer_select.write.mode("errorifexists").save_as_table("SNOWPARK_DEMO_DB.SNOWPA
RK_DEMO_SCHEMA.CUSTOMER")
When executed, this code is translated and executed as SQL in Snowflake through the Snowpark API.
The resulting SQL statement is as follows:
--creates table only if not already existing
CREATE TABLE IF NOT EXISTS
SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.CUSTOMER (
"C_CUSTKEY" BIGINT NOT NULL,
"C_NAME" STRING(25) NOT NULL
) AS
SELECT * FROM (
SELECT "C_CUSTKEY", "C_NAME"
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
WHERE "C_NATIONKEY" = '15'
);
The following code displays the count of records in the CUSTOMER table. Since we have already
created and inserted data in the preceding steps, the record count indicates that no data got written into
the table.
session.table("SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.CUSTOMER").count()
+------+
|11842 |
+------+
Throw Error
The following code throws an exception if the specified table already exists.
customer_wrt =
df_customer_select.write.mode("errorifexists").save_as_table("SNOWPARK_DEMO_DB.SNOWPA
RK_DEMO_SCHEMA.CUSTOMER")
connection_parameters = {
"account": "snowflake account",
"user": "snowflake username",
"password": "snowflake password",
"role": "snowflake role", # optional
"warehouse": "snowflake warehouse", # optional
"database": "snowflake database", # optional
"schema": "snowflake schema" # optional
}
session = Session.builder.configs(connection_parameters).create()
STEP-2: Define a schema parameter of StructType containing names and data types of columns.
StructType is a data type representing a collection of fields that may have different data types. It is
commonly used to define the schema of a DataFrame or a column with a nested structure.
Define a schema parameter as a StructType that includes the names and corresponding data types of
the columns in the CSV files from which data should be copied as shown below.
from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType, DateType
schema = StructType(
[StructField("EMPLOYEE_ID", IntegerType()),
StructField("FIRST_NAME", StringType()),
StructField("LAST_NAME", StringType()),
StructField("EMAIL", StringType()),
StructField("HIRE_DATE", DateType()),
StructField("SALARY", IntegerType())
])
STEP-3: Read Data from Staged Files into a Snowpark DataFrame using Session.read Method
DataFrameReader class in Snowpark provides methods for reading data from a Snowflake stage to a
DataFrame with file format-specific options. A DataFrameReader object can be created
through Session.read method as shown below.
#// Use the DataFrameReader (session.read) to read from a CSV file
----------------------------------------------------------------------------------
|"EMPLOYEE_ID" |"FIRST_NAME" |"LAST_NAME" |"EMAIL" |"HIRE_DATE" |"SALARY"
|
----------------------------------------------------------------------------------
|100 |Steven |King |SKING |2003-06-17 |24000 |
|101 |Neena |Kochhar |NKOCHHAR |2005-09-21 |17000 |
|102 |Lex |De Haan |LDEHAAN |2001-01-13 |17000 |
|103 |Alexander |Hunold |AHUNOLD |2006-01-03 |9000 |
|104 |Bruce |Ernst |BERNST |2007-05-21 |6000 |
----------------------------------------------------------------------------------
STEP-4: COPY Data from Snowpark DataFrame INTO Snowflake table using
DataFrame.copy_into_table Method
DataFrame.copy_into_table method in Snowpark executes a COPY INTO <table> command to
load data from files in a stage location into a specified table.
This method is slightly different from the COPY INTO command. It automatically creates a table if it
doesn’t exist and the input files are CSV, unlike the COPY INTO <table> command.
Syntax:
The following is the syntax to copy data from a CSV file into a table using
the DataFrame.copy_into_table method in Snowpark.
DataFrame.copy_into_table(table_name, [optional_parameters])
Example:
The following code copies the data from DataFrame df_employee into the EMPLOYEE table in
Snowflake.
#// writing data using DataFrame.copy_into_table method
copied_into_result = df_employee.copy_into_table("employee")
When executed, this code is translated and executed as SQL in Snowflake through the Snowpark API.
The resulting SQL statement is as follows:
-- Verifying if table exists
show tables like 'employee'
-- Creating table
CREATE TABLE employee(
"EMPLOYEE_ID" INT,
"FIRST_NAME" STRING,
"LAST_NAME" STRING,
"EMAIL" STRING,
"HIRE_DATE" DATE,
"SALARY" INT)
#// Copies data from only specified files from stage location
copied_into_result = df_employee.copy_into_table("employee", files=['employee.csv'], force=True)
6. How to Specify File Pattern to Load Data into Snowflake Table from Snowpark?
Using the pattern parameter in the DataFrame.copy_into_table method, we can explicitly define the
file pattern from which data should be copied into a Snowflake table.
Syntax:
DataFrame.copy_into_table(table_name, pattern='<regex_file_pattern>')
The following code copies data from stage files that match the file pattern specified in
the DataFrame.copy_into_table method into the Snowflake EMPLOYEE table.
#// Reads all files from the Inbox folder in Stage location
df_employee = session.read.schema(schema).options({"field_delimiter": ",", "skip_header":
1}).csv('@my_s3_stage/Inbox/')
#// Copies data from files which match the specified file pattern from stage location
copied_into_result = df_employee.copy_into_table("employee", pattern='emp[a-z]+.csv', force=True)
7. How to COPY INTO Snowflake Table with Different Structure from Snowpark?
When the structure of columns is different between the stage files and the Snowflake table, we can
specify the order of target columns to which data should be saved in the table using
the target_columns parameter in the DataFrame.copy_into_table method.
Syntax:
DataFrame.copy_into_table(table_name, target_columns=['<column_1>','<column_2>',..])
The following is the order of columns in the Stage file vs Snowflake table vs target_columns
parameter
employee.csv EMPLOYEE target_columns parameter
EMPLOYEE_ID ID ID
copied_into_result = df_employee.copy_into_table("employees", \
target_columns=['ID','FNAME','LNAME','EMAIL_ADDRESS','JOIN_DATE','SALARY'], \
transformations=['$1', ltrim(rtrim('$2')), ltrim(rtrim('$3')), concat(substr('$2',1,1),'$3'), '$5',
'$6'], \
force=True, on_error="CONTINUE")
To find the complete list of supported functions in the transformations parameter, refer to Snowflake
Documentation.
The following is the transformed data loaded into the EMPLOYEES table in Snowflake.
Transformed data in EMPLOYEES Table
Introduction to Snowflake Snowpark for Python
January 2, 2024
Spread the love
Contents hide
1. What is Snowflake Snowpark?
2. Setting up Snowpark Python Environment to connect Snowflake
1. Installing Python
2. Installing Snowpark
3. Installing Visual Studio Code
3. Connecting Snowflake using Snowpark Python
3.1. Import Snowpark Libraries
3.2. Create Connection Parameters
3.3. Create a Session
3.4. Write your code
4. Writing Snowpark Code in Python Worksheets
4.1. Creating Python Worksheets
4.2. Writing Snowpark Code in Python Worksheets
4.3. Snowpark Python Packages for Python Worksheets
5. Closing Thoughts
1. What is Snowflake Snowpark?
The Snowpark is an intuitive library that offers an API for querying and processing data at scale in
Snowflake. It seamlessly integrates DataFrame-style programming into preferred languages like
Python, Java, and Scala, and all operations occur within Snowflake using the elastic and serverless
Snowflake engine. This eliminates the need of moving data to the system where your application code
runs.
When developing Snowpark applications, there are some key concepts that are important to
understand.
The Snowpark API enables you to write code from your IDE or notebook in your
preferred language and finally convert it into SQL to execute your code in Snowflake.
A core abstraction in Snowpark is the DataFrame, representing a query in your
chosen language.
Snowpark does not require a separate cluster outside of Snowflake for computations.
The queries built using Dataframes are converted to SQL, efficiently distributing
computation in Snowflake’s elastic engine.
Dataframes in Snowpark are executed lazily, running only when actions like retrieval,
storage, or viewing of data are performed.
Snowpark Dataframes are also run entirely within Snowflake ensuring data remains
within unless explicitly requested by the application.
2. Setting up Snowpark Python Environment to connect Snowflake
The following are the prerequisites for setting up a local Python development environment to build
applications using Snowpark.
1. Install Python
2. Install Snowpark
3. Install Visual Studio Code
1. Installing Python
At the point of writing this article, the supported version of Python for Snowpark is 3.10. To know the
latest supported version, please refer Snowflake Documentation.
Follow below steps to install Python on your Windows local computer.
1. Navigate to the official Python website. Click on Downloads and then select Windows.
2. Navigate to the supported version of the Python in the downloads page and download the installer.
3. Once the executable file is downloaded completely, open the file to install Python.
4. In the installation wizard, verify the path where the installation files will be saved. Also, select the
checkbox at the bottom to Add Python 3.11 to PATH and click Install Now.
5. Wait for the wizard to finish the installation process until the Set up was successful message
appears. Click Close to exit the wizard.
6. To verify if Python is installed on your machine, issue the following command from the command
prompt (start >> cmd) of your machine.
python --version
2. Installing Snowpark
To install the Snowpark Python package, execute the following command from your command
prompt window.
pip install snowflake-snowpark-python
The download begins and all the required packages are installed into your Python virtual environment.
Follow below steps to install PIP on your machine if it is not already installed before running the
above given command.
1. Run the following cURL command in the command prompt to download the get-pip.py file
curl https://bootstrap.pypa.io/get-pip.py –o get-pip.py
2. Once the download is complete, run the following Python command to install PIP.
Python get-pip.py
3. Open a new command prompt window and run the following command to verify if PIP has
successfully installed.
pip --version
3. Installing Visual Studio Code
Visual Studio Code is the most popular code editor and IDE provided by Microsoft with support for
development operations like debugging, task running, and version control.
Follow below steps to install Visual Studio Code on Windows.
1. Navigate to the official website of Visual Studio code.
2. Click on the Download for Windows button on the website to start downloading the application.
3. Once the download finishes, Click on the installer icon to start the installation process of the Visual
Studio Code.
4. In the installation wizard, agree to the terms and conditions, and proceed by clicking the “Next”
and “Install” buttons on the subsequent pages.
5. After successful installation of Visual Studio Code, go to the extensions tab in Visual Studio Code,
search for the Python extension and install it.
6. The Python extension tries to find and select what it deems the best environment for the workspace.
To manually specify the environment, press Ctrl+Shift+P to open the VS Code Command
Palette and execute the command Python: Select Interpreter
7. The Python: Select Interpreter command displays a list of available global environments, select
the python 3.10 environment which we set up in the first step.
connection_parameters = {
"account": "qokbyrr-ag94793",
"user": "SFUSER13",
"password": "Abc123",
"role": "ACCOUNTADMIN",
"warehouse": "SNOWPARK_DEMO_WH",
"database": "SNOWPARK_DEMO_DB",
"schema": "SNOWPARK_DEMO_SCHEMA"
}
new_session = Session.builder.configs(connection_parameters).create()
df_campaign_spend = new_session.table('campaign_spend')
df_campaign_spend.show()
Below is the output of the Snowpark application code.
Application Output
Here, the application code is run on your local machine, but the actual query execution is
performed within Snowflake.
4. Writing Snowpark Code in Python Worksheets
Snowflake supports writing Snowpark code in Python worksheets to process data using Snowpark
Python in Snowsight. You can conduct your development and testing in Snowflake without the need
to install dependent libraries by writing code in Python worksheets.
4.1. Creating Python Worksheets
To start coding in Python worksheets, in Snowsight, open Worksheets, simply click + to add new
worksheet, and select Python Worksheet.
The below image shows the default code with which the Python worksheet is created in Snowsight.
5. Closing Thoughts
While there is much more to cover on Snowpark, I trust this article has offered you a fundamental
understanding, particularly beneficial for individuals without a programming background. It aims to
assist you in initiating your Snowpark learning journey and building a solid foundation for exploring
its capabilities.
Watch this space for more informative content on Snowflake Snowpark !!