Professional Documents
Culture Documents
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
on
A STUDY ON DATA PROCESSING USING UBER REAL TIME DATA CREATING A
ETL PIPELINE
by
KOUSTAV DUTTA(A914145022049)
Kolkata
Date:
KOUSTAV DUTTA
(A914145022049)
CERTIFICATE
Kolkata
Date:
I would like to thank ABU IMRAN AHMED Sir for his guidance and kind co- operation in
giving me this research work. He has been a constant support on the same in thick and thins
every time. I would also like to thank my classmate TANZIL AHMED, for their kind support
throughout the project. They have helped me in gaining a lot of knowledge by researching for
me. I would also like to thank my parents without whose support this would never have been
possible or completely unattainable.
ABSTRACT:
This project focuses on the development and implementation of a robust data processing
ecosystem, blending the versatility of Python, the scalability of Google Cloud Platform (GCP)
services, and an efficient Extract, Transform, Load (ETL) pipeline. With the goal of optimizing
data workflows, enhancing analytics, and supporting informed decision-making, the project
unfolds through strategic steps. Beginning with a well-structured data model visualized through
an Entity-Relationship (ER) diagram, Python emerges as a key player for intricate tasks within
the ETL pipeline. GCP's suite of services, including Cloud Storage, Compute Engine, and Big
Query, further amplifies the project's capabilities, providing a scalable infrastructure for data
storage, computation, and advanced analytics.The intricate processes of data extraction,
transformation, and loading are meticulously optimized, ensuring high-quality output at each
stage. The integration of Cloud Services augments scalability and real-time analytics, solidifying
the system's efficiency.The ultimate culmination of this effort is the creation of a user-friendly
dashboard, offering an intuitive interface for users to interact with processed data. This
dashboard serves as a gateway to meaningful insights, empowering users to make informed
decisions. Through this initiative, the project aims to unlock the full potential of data,
transforming raw information into a valuable resource for strategic decision-making in the
data-driven landscape.
TABLE OF CONTENT
● Abstract .......................................................................................................
● Objective ..….................................................................................................
● What is ETL Pipeline? ...................................................................................
■ Our Model ….............................................................................
■ ER Diagram …..........................................................................
○ Platform Use …..................................................................................
○ Data Set ….........................................................................................
■ Training Data …........................................................................
■ Python …..................................................................................
● Indexing …......................................................................
● Merging...........................................................................
○ Cloud Services....................................................................................
■ GCP..........................................................................................
● Cloud Storage.................................................................
● Compute Engine.............................................................
● Big Query …...................................................................
○ Data Extraction ….............................................................................
○ Data Transformation ….....................................................................
○ Data Loading …................................................................................
○ Dashboard …....................................................................................
○ Conclusion …....................................................................................
○ References …....................................................................................
Introduction
In the era of data-driven decision-making, the seamless processing and analysis of vast datasets
are paramount. This project embarks on a journey to develop a comprehensive data ecosystem,
integrating the power of Python, Google Cloud Platform (GCP) services, and an efficient
Extract, Transform, Load (ETL) pipeline. The overarching objective is to create a robust
infrastructure capable of handling diverse datasets, optimizing data workflows, and providing
insightful analytics for informed decision-making.
As organizations grapple with the complexities of big data, the need for sophisticated data
processing solutions becomes increasingly critical. Python, renowned for its versatility and
extensive libraries, emerges as a key player in implementing intricate tasks within the ETL
pipeline. This, coupled with the scalable and reliable services offered by GCP—Cloud Storage
for secure data storage, Compute Engine for computational muscle, and Big Query for advanced
analytics—forms the backbone of our approach.
The development of a well-structured data model, visualized through an Entity-Relationship
(ER) diagram, guides the implementation of our model. The synergy between Python's
capabilities, the scalability of GCP, and the efficiency of ETL processes positions this initiative
to deliver a seamlessly integrated platform.
Throughout this project, we will delve into the intricacies of data extraction, transformation, and
loading, optimizing each step to ensure the highest quality output. The integration of Cloud
Services will further enhance the scalability and real-time analytics capabilities of our system.
Ultimately, the project aims not only to establish a resilient ETL pipeline but also to craft a
user-friendly dashboard. This dashboard will serve as the interface through which users can
interact with processed data, gaining meaningful insights and empowering them to make
informed decisions. In this endeavor, we are poised to unlock the full potential of data,
transforming it from raw information into a valuable resource for strategic decision-making.
OBJECTIVE :
The primary goal of this project is to design, implement, and optimize a comprehensive data
processing and analytics ecosystem. The key objectives are as follows:
1. Develop a Robust ETL Pipeline:
- Create an efficient Extract, Transform, Load (ETL) pipeline to handle data seamlessly.
- Utilize Python as the primary programming language for ETL tasks, capitalizing on its
versatility and extensive libraries.
2. Model Development and ER Diagram:
- Design and implement a data model to represent the structure and relationships within the
dataset.
- Develop an Entity-Relationship (ER) diagram to visualize the database schema and guide the
model's implementation.
3. Platform Utilization:
- Leverage Google Cloud Platform (GCP) for its suite of services, including Cloud Storage,
Compute Engine, and Big Query.
- Harness the scalability and reliability of GCP to handle varying data loads and computational
requirements.
4. Data Handling with Python:
- Utilize Python for crucial data processing tasks such as indexing and merging to optimize data
retrieval and manipulation.
- Leverage Python's capabilities to seamlessly integrate the ETL pipeline with the overall data
ecosystem.
WHAT IS ETL?
ETL stands for Extract, Transform, Load, which represents a common process in data integration
and data warehousing. Here's a breakdown of each component:
1. Extract:
- The first step involves extracting data from various sources such as databases, files,
applications, or external systems.
- Extracted data is often in its raw form and may be inconsistent or incompatible with the target
system.
2. Transform:
- Once data is extracted, it undergoes transformation processes to convert it into a usable format.
- Transformations can include cleaning, filtering, aggregating, and restructuring the data.
- The goal is to ensure that data is accurate, consistent, and suitable for analysis or storage.
3. Load:
- The transformed data is then loaded into a target system, such as a data warehouse, database, or
another storage solution.
- Loading can be done in various ways, including full loading (all data is loaded each time) or
incremental loading (only new or changed data is loaded).
ETL processes are fundamental in data engineering and play a crucial role in consolidating data
from different sources, making it accessible and useful for analysis, reporting, and business
intelligence. ETL tools and frameworks automate and streamline these processes, reducing
manual effort and ensuring the efficiency and accuracy of data movement and transformation.
OUR MODEL
Big Query:
1. Serverless Data Warehousing:
Big Query is a serverless, fully managed data warehouse that allows for seamless analytics on
large datasets without the need for infrastructure management.
2. Scalability:
It offers impressive scalability, enabling users to analyze petabytes of data with high
performance due to its underlying architecture.
Geographical Information: Latitude and longitude coordinates for pickup and drop-off points.
Insights into popular routes and areas with high demand. Fare Information: Fare charges for
each ride. Surge pricing data during peak hours or high demand PeriodsUser Details:
Anonymized user identifiers. User types (e.g., regular users, promotional users).Time-Based
Trends: Analysis of ride patterns during various times of the day, days of the week, or
extraordinary events.Seasonal variations in demand.
Additional Context: Weather conditions during rides. Unique events or holidays impacting ride
volume. Analyzing this dataset can provide valuable insights for urban planning, traffic
management, and optimizing Uber's services in New York. It's a valuable resource for data
scientists, analysts, and policymakers to understand transportation dynamics, improve efficiency,
and enhance the overall urban mobility experience.
1.Imports the necessary libraries: a. mage - The Mage AI library. b. pandas - A Python library
for data. analysis and manipulation. 2.Creates a Mage AI task called extract_transform_load.
The extract_transform_load task extracts data from a CSV file, transforms it into a Pandas
DataFrame, and loads it into a Google Cloud Storage bucket. Configures the
extract_transform_load task. The extract_transform_load task is configured to extract data
from the CSV file data.csv and load it into the Google Cloud Storage bucket my-bucket. Runs
the extract_transform_load task. The extract_transform_load task is run using the
mage.run_task() function.. Prints the status of the extract_transform_load task. The
mage.get_task_status() function is used to get the status of the extract_transform_load task.
PYTHON MERGING:
The data pipeline starts by extracting data from a CSV file and transforming it to meet the
requirements of the Mage AI platform. The transformed data is then loaded into a Google Cloud
Storage (GCS) bucket. Mage AI orchestrates the entire pipeline, making it easy to manage and
troubleshoot.
In essence, the pipeline automates the process of extracting, transforming, and loading data from
a local file to a cloud-based storage service. This streamlines data management and allows
businesses to focus on analysis rather than data movement.
ER DIAGRAM:
Fact & Dimension Table:
Fact Table:
Contains quantitative measures or metrics that are used for analysis
Typically contains foreign keys that link to dimension tables.Contains columns that have high
cardinality and change frequently.Contains columns that are not useful for analysis by
themselves, but are necessary for calculating metrics.
Code Snippet:
Foreign keys:
● datetime_id: This foreign key references the datetime_id primary key in the datetime
dimension table.
● passenger_count_id: This foreign key references the passenger_count_id primary key
in the passenger_count dimension table.
● trip_distance_id: This foreign key references the trip_distance_id primary key in the
trip_distancedimension table.
● rate_code_id: This foreign key references the rate_code_id primary key in the
rate_code dimension table.
● pickup_location_id: This foreign key references the pickup_location_id primary key
in the pickup_location dimension table.
● payment_type_id: This foreign key references the payment_type_id primary key in
the payment_typedimension table.
● dropoff_location_id: This foreign key references the dropoff_location_id primary key
in the dropoff_location dimension table.
Primary key:
● trip_id: The trip_id column is the primary key of the fact table. This means that it is a
unique identifier for each row in the table.
The foreign keys in the fact table allow us to relate the fact table to the dimension tables. This
allows us to perform complex queries on the data, such as finding the total fare amount for all
trips that started in a particular location or that were paid for with a credit card.
The primary key in the fact table is used to uniquely identify each row in the table. This is
important for ensuring the integrity of the data and for allowing us to perform efficient queries
Dimension Table:
Contains columns that describe attributes of the data being analyzed.Typically contains primary
keys that link to fact tables.Contains columns that have low cardinality and don't change
frequently.Contains columns that can be used for grouping or filtering data for analysis.
CODE SNIPPETS:
Datetime Dimension Table
Column Name Data Type Description
datetime_id nt Unique identifier for the
datetime record
date varchar(10) Date of the datetime record
ime varchar(8) Time of the datetime record
weekday varchar(10) Day of the week of the datetime
record (e.g., Sunday, Monday,
Tuesday, etc.)
month varchar(10) Month of the datetime record
e.g., January, February, March,
etc.)
quarter varchar(10) Quarter of the year of the
datetime record (e.g., Q1, Q2,
Q3, Q4)
year varchar(4) Year of the datetime record
day_of_year nt Day of the year of the datetime
record (e.g., 1, 2, 3, ...)
The Python script in the image creates a table called tbl_analytics in the
uber_data_engineering_project database. The table contains data about Uber rides, such as the
vendor ID, pickup and dropoff times, distance traveled, rate code, pickup location, fare amount,
and other charges.
The script joins five different tables to create the tbl_analytics table:
● fact_table: Contains raw data about Uber rides.
● datetime: Contains information about the date and time of each ride.
● trip_distance: Contains information about the distance traveled for each ride.
● rate_code: Contains information about the rate code for each ride.
● pickup_location: Contains information about the pickup location for each ride.
The tbl_analytics table can be used to analyze Uber rides, such as identifying the most popular
pickup and dropoff locations, the average fare amount for different rate codes, or the impact of
tolls and improvement surcharges on the total cost of rides.
OUTPUT:
Data Extraction :
1. Importing libraries: The script starts by importing the necessary libraries: pandas for
data manipulation and datetime for date-related operations.
2. Loading data: The script reads the CSV file containing Uber ride information using
pandas.read_csv() function and stores it in a DataFrame object named df.
3. Creating datetime dimensions: The script creates two DataFrames named
datetime_dim and passenger_count_dim to store the extracted datetime information
and passenger count information, respectively. It utilizes the .dt.hour, .dt.day,
.dt.month, .dt.year, and .dt.weekday methods of the DateTimeIndex to extract the
relevant information from the datetime columns.
4. Creating trip distance dimension: The script creates a DataFrame named
trip_distance_dim to store the extracted trip distance information. It extracts the trip
distance values and assigns unique identifiers to each entry.
5. Mapping rate code: The script creates a dictionary named rate_code_type that maps
the rate code values to descriptive labels: "Standard rate", "JFK", "Newark", "Nassau
or Westchester", "Negotiated fare", and "Group ride".
6. Combining dimensions: The script joins the datetime_dim, passenger_count_dim,
trip_distance_dim, and rate_code_type DataFrames using pandas.concat() function to
create a single DataFrame named tbl_analytics.
7. Dropping duplicates: The script removes duplicate rows from tbl_analytics using
.drop_duplicates().reset_index(drop=True) to ensure data integrity.
8. Renaming columns: The script renames the columns of tbl_analytics to make them
more descriptive: "datetime id", "tpep_pickup_datetime", "pick hour", "pick day",
"pick month", "pick year", "pick weekday", "tpep_dropoff_datetime", "drop_hour",
"drop_day", "drop_month", "drop_year", "drop_weekday", "passenger_count_id",
"passenger_count", "trip_distance_id", "trip_distance", and "rate_code".
Data Loading:
The platform of choice, GCP, played a pivotal role, leveraging Cloud Storage for scalable and
reliable data storage, Compute Engine for processing power, and BigQuery for powerful
analytics. The ETL pipeline, encompassing data extraction, transformation, and loading,
showcased the synergy between these components in ensuring a smooth flow of information.
Our focus on Python underscored its significance in AI and data science, serving as a versatile
tool for implementing the ETL processes. Through meticulous indexing and merging, we
optimized data handling, facilitating the creation of a robust model. The integration of cloud
services, particularly GCP, added a layer of scalability and real-time analytics, aligning with the
modern demands of data-driven applications.
In the end, the envisioned dashboard serves as the culmination of our efforts, providing a
user-friendly interface to interact with the processed data. This journey through ETL, Python,
and GCP reaffirms the importance of a well-orchestrated data pipeline in unlocking the full
potential of datasets for informed decision-making and insightful analytics.
REFERENCES:
In addition to these resources, there are a number of other great references available
online. Here are a few of my favorites:
Video Reference:
https://www.youtube.com/watch?v=WpQECq5Hx9g&t=2621s.