Professional Documents
Culture Documents
AWS Project1
AWS Project1
AWS Project1
Project Statement:
Data Pipeline with Glue and Redshift:
Challenge:
Build a simple ETL pipeline using Glue crawlers and jobs to
Extract data from a CSV file (e.g., customer data) stored in S3,
Transform it (e.g., filter, format), and
Load it into a Redshift table.
Step1:
1. Creating a S3 bucket.
2. Creating Redshift Cluster.
3. IAM role Creation.
4. AWS Glue – Database, Table, Crawler and ETL job Creation.
5. Integrate all services using Visual ETL.
Created a workgroup and namespace which are needed for AWS glue to connect to
Redshift:
Now we are creating a IAM role to make sure AWS Glue is having necessary permissions to access S3
buckets and Redshift. Since we need access to S3 and Redshift as well , so to do this I am taking
“AWSGlueConsoleFullAccess” Policy.
Crea%ng a database, Table and Glue Crawler to discover the schema of CSV file in S3.
Also, we must Specify the S3 location and IAM role created earlier.
• Creating a crawler in AWS Glue so that this will help us to fetch the data stored in AWS
S3 bucket.
• Up on running the crawler, we can see one table us create from the database we created
above.
1. In Running Status:
2. Once Crawler is done, we have to chek if table is created or not:
• Now in tables we can see Schema has been automatically by glue itself.
Connection Creation: This connection is very important from which we can allow AWS Glue to
access Amazon Redshift.
Look for AWS Redshift in Choose data Source page.
In this step we must provide the workgroup details that we have created earlier. Upon selecting
this AWS have capacity to identify the Database name and username. We must provide the
password.
Next step we must review the details and continue creating the connection.
Successful connection:
Now we must focus on the target storage system which is in our case “Redshift”.
Clicking on Query editor v2 option on the left nav bar as shown:
Creating a Table in AWS Redshift using Query editor so that it will match with the column data
type present in s3.
Once done:
Once Job is done, we can see the data loaded to redshift table as below