Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

DOING

ETL IN
TALEND

Talend is a popular open-source data
integration and ETL (Extract, Transform,
Load) software platform used for connecting,
transforming, and managing data from
various sources to target destinations.

In this project we’ll use talend to extract data


from Postgres Database, transform it with
talend and then output it into a Bigquery
Table
DATASET
 kaggle.com/datasets/suraj520/car-sales-data.

The dataset is a tabular dataset that 


contains 2,500,000 rows of
Column Types
information on car sales from a car
Date varchar(50)
dealership over the course of a year. salesperson varchar(50)
The dataset includes nine columns of Customer varchar(50)
data for each car sale. Car Make varchar(50)
Car Model varchar(50)
The dataset were imported to Car Year int4
Postgres Database using DBeaver Sale Price int4
Commission Rate float4
Commission Earned float4
CONNECTING TO
THE DATABASE
Create a tDBConnection component to connect to the postgres
database. fill out the required field to connect to database, this include
Database host, port, database, schema, username and password.

the postgres database were running under a docker container in this


project
QUERYING THE DATA
the connection component would be used to make a query to the
database using tDBInput component. The tDBInput component can also
be used to map and rename column to more standard convention
(without special charcater and spaces)

e c t i o n
Conn onent
Comp

Query can be
changed depending Column remaping & renaming
on condition
ADDING UNIQUE ID
A unique ID for each row can be useful to the database to calculate some
aggregation data. But the original data doesn’t provide a unique ID to each row
of the data. Therefore it would be a good idea to add a unique ID to the data.
This can be using tAddCRCRow component.

n u s e d f o r
Colum ting unique ID
calcula
DATA QUALITY
AND VALIDATION
After adding a unique ID, data quality and validation are done to the data by
using tSchemaComplianceCheck component. This component validates all input
rows against a reference schema or check types, nullability, length of rows
against reference values.
DATA
TRANSFORMATION
After the data quality and validation were checks out, we can proceed to data
transformation. We use tMap component to transform data in talend. The
transformation thats done to the data in this project were car_age,
profit_range, sales_year
tMap EXPRESSION
car_age

This column calculate the age of the car at the time of purchase

profit_range

Create buckets or categories for "Sale Price" ranges (e.g., low, medium, high)
to analyze sales distribution.

sales_year This column can actually be omitted


since i only use this to calculate
aggregate rows inside talend, not BigQuery

Year of the sale


PUTTING IT ALL
TOGETHER
After data transformation we can combine the new columns with the other
selected columns in the dataset as tMap’s output. In this case we selected all
column. We can also renamed the column named (CRC as ID).
LOAD TO BIGQUERY
Now that all transformation are done, we can load the data into bigquery. tBigQueryOutput
component is used to load the data into a bigquery table. tBigQueryOutput need a service
account with admin role to bigquery and cloud storage. The component also required you to
fill out project ID, dataset, table name, GCS bucket, file URI, and service account path.
RESULT IN BIGQUERY
JOB RUN RESULT

You might also like