Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)

Summer Internship Report (ETSI-600)
on
A STUDY ON DATA PROCESSING USING UBER REAL TIME DATA CREATING A
ETL PIPELINE
TRANSFORMING DATA INTO TRAINED DATA.
Submitted to Amity University Kolkata
in partial fulfillment of the requirements for the award of the degree of

Master of Computer Application
by
KOUSTAV DUTTA(A914145022049)
under the guidance of

ABU IMRAN AHMED
(SENIOR DEVELOPER, IBM USA)
DEPARTMENT OF INFORMATION TECHNOLOGY AMITY INSTITUTE OF
INFORMATION TECHNOLOGY
AMITY UNIVERSITY KOLKATA

KOLKATA(W.B.)
DECLARATION
I, KOUSTAV DUTTA, a student of Master of Computer Application with the Enrollment

Number A914145022049 do hereby declare that the seminar titled “Data Engineering And
Transformation” which is submitted by me to Department of Information Technology, Amity
Institute of Information Technology, Amity University Kolkata, West Bengal, in partial
fulfillment of requirement for the award of the degree of Master of Computer Application , has
not been previously formed the basis for the award of any degree, diploma or other similar title
or recognition.
Kolkata
Date:
KOUSTAV DUTTA
(A914145022049)
CERTIFICATE
On the basis of declaration submitted by KOUSTAV DUTTA , student of Master of Computer

Application with the Enrollment Number A914145022049, I hereby certify that the seminar
titled “DATA ENGINEERING AND TRANSFORMING” which is submitted to Department
of Information Technology, Amity Institute of Information Technology, Amity University
Kolkata, West Bengal, in partial fulfillment of the requirement for the award of the degree of
Master of Computer Application, is based on the experiments and studies carried out by him/her.
This work is original and has not been submitted in part or full for any other degree or diploma
of any university or institution.
Kolkata
Date:
Dr. DEBAPRITA BANNERJEE

Dr.DIPANKAR
Amity Institute of Information Technology Amity University Kolkata
Kolkata (W.B.)
FEEDBACK BY EXAMINERS
A. Comments From Seminar Guide
B. Comments From External Examiner

ACKNOWLEDGEMENT:
I would like to thank ABU IMRAN AHMED Sir for his guidance and kind co- operation in
giving me this research work. He has been a constant support on the same in thick and thins
every time. I would also like to thank my classmate TANZIL AHMED, for their kind support
throughout the project. They have helped me in gaining a lot of knowledge by researching for
me. I would also like to thank my parents without whose support this would never have been
possible or completely unattainable.
ABSTRACT:
This project focuses on the development and implementation of a robust data processing
ecosystem, blending the versatility of Python, the scalability of Google Cloud Platform (GCP)
services, and an efficient Extract, Transform, Load (ETL) pipeline. With the goal of optimizing
data workflows, enhancing analytics, and supporting informed decision-making, the project
unfolds through strategic steps. Beginning with a well-structured data model visualized through
an Entity-Relationship (ER) diagram, Python emerges as a key player for intricate tasks within
the ETL pipeline. GCP's suite of services, including Cloud Storage, Compute Engine, and Big
Query, further amplifies the project's capabilities, providing a scalable infrastructure for data
storage, computation, and advanced analytics.The intricate processes of data extraction,
transformation, and loading are meticulously optimized, ensuring high-quality output at each
stage. The integration of Cloud Services augments scalability and real-time analytics, solidifying
the system's efficiency.The ultimate culmination of this effort is the creation of a user-friendly
dashboard, offering an intuitive interface for users to interact with processed data. This
dashboard serves as a gateway to meaningful insights, empowering users to make informed
decisions. Through this initiative, the project aims to unlock the full potential of data,
transforming raw information into a valuable resource for strategic decision-making in the
data-driven landscape.
TABLE OF CONTENT
● Abstract .......................................................................................................
● Objective ..….................................................................................................
● What is ETL Pipeline? ...................................................................................
■ Our Model ….............................................................................
■ ER Diagram …..........................................................................
○ Platform Use …..................................................................................
○ Data Set ….........................................................................................
■ Training Data …........................................................................
■ Python …..................................................................................
● Indexing …......................................................................
● Merging...........................................................................
○ Cloud Services....................................................................................
■ GCP..........................................................................................
● Cloud Storage.................................................................
● Compute Engine.............................................................
● Big Query …...................................................................
○ Data Extraction ….............................................................................
○ Data Transformation ….....................................................................
○ Data Loading …................................................................................
○ Dashboard …....................................................................................
○ Conclusion …....................................................................................
○ References …....................................................................................
Introduction
In the era of data-driven decision-making, the seamless processing and analysis of vast datasets
are paramount. This project embarks on a journey to develop a comprehensive data ecosystem,
integrating the power of Python, Google Cloud Platform (GCP) services, and an efficient
Extract, Transform, Load (ETL) pipeline. The overarching objective is to create a robust
infrastructure capable of handling diverse datasets, optimizing data workflows, and providing
insightful analytics for informed decision-making.
As organizations grapple with the complexities of big data, the need for sophisticated data
processing solutions becomes increasingly critical. Python, renowned for its versatility and
extensive libraries, emerges as a key player in implementing intricate tasks within the ETL
pipeline. This, coupled with the scalable and reliable services offered by GCP—Cloud Storage
for secure data storage, Compute Engine for computational muscle, and Big Query for advanced
analytics—forms the backbone of our approach.
The development of a well-structured data model, visualized through an Entity-Relationship
(ER) diagram, guides the implementation of our model. The synergy between Python's
capabilities, the scalability of GCP, and the efficiency of ETL processes positions this initiative
to deliver a seamlessly integrated platform.
Throughout this project, we will delve into the intricacies of data extraction, transformation, and
loading, optimizing each step to ensure the highest quality output. The integration of Cloud
Services will further enhance the scalability and real-time analytics capabilities of our system.
Ultimately, the project aims not only to establish a resilient ETL pipeline but also to craft a
user-friendly dashboard. This dashboard will serve as the interface through which users can
interact with processed data, gaining meaningful insights and empowering them to make
informed decisions. In this endeavor, we are poised to unlock the full potential of data,
transforming it from raw information into a valuable resource for strategic decision-making.
OBJECTIVE :
The primary goal of this project is to design, implement, and optimize a comprehensive data
processing and analytics ecosystem. The key objectives are as follows:
1. Develop a Robust ETL Pipeline:
- Create an efficient Extract, Transform, Load (ETL) pipeline to handle data seamlessly.
- Utilize Python as the primary programming language for ETL tasks, capitalizing on its
versatility and extensive libraries.
2. Model Development and ER Diagram:
- Design and implement a data model to represent the structure and relationships within the
dataset.
- Develop an Entity-Relationship (ER) diagram to visualize the database schema and guide the
model's implementation.
3. Platform Utilization:
- Leverage Google Cloud Platform (GCP) for its suite of services, including Cloud Storage,
Compute Engine, and Big Query.
- Harness the scalability and reliability of GCP to handle varying data loads and computational
requirements.
4. Data Handling with Python:
- Utilize Python for crucial data processing tasks such as indexing and merging to optimize data
retrieval and manipulation.
- Leverage Python's capabilities to seamlessly integrate the ETL pipeline with the overall data
ecosystem.
5. Cloud Services Integration:

- Implement Cloud Storage for secure and scalable data storage.
- Utilize Compute Engine to provide the necessary computational power for data processing
tasks.
- Leverage Big Query for powerful analytics and real-time data insights.
6. Data Extraction, Transformation, and Loading (ETL):
- Establish efficient mechanisms for extracting data from diverse sources.
- Implement transformative processes to structure and enhance the quality of the data.
- Load processed data into the chosen storage and analytics platforms.
7. Dashboard Creation:
- Develop a user-friendly dashboard to facilitate intuitive interaction with the processed data.
- Ensure that the dashboard provides meaningful insights and supports informed
decision-making.
WHAT IS ETL?
ETL stands for Extract, Transform, Load, which represents a common process in data integration
and data warehousing. Here's a breakdown of each component:
1. Extract:
- The first step involves extracting data from various sources such as databases, files,
applications, or external systems.
- Extracted data is often in its raw form and may be inconsistent or incompatible with the target
system.
2. Transform:
- Once data is extracted, it undergoes transformation processes to convert it into a usable format.
- Transformations can include cleaning, filtering, aggregating, and restructuring the data.
- The goal is to ensure that data is accurate, consistent, and suitable for analysis or storage.
3. Load:
- The transformed data is then loaded into a target system, such as a data warehouse, database, or
another storage solution.
- Loading can be done in various ways, including full loading (all data is loaded each time) or
incremental loading (only new or changed data is loaded).
ETL processes are fundamental in data engineering and play a crucial role in consolidating data
from different sources, making it accessible and useful for analysis, reporting, and business
intelligence. ETL tools and frameworks automate and streamline these processes, reducing
manual effort and ensuring the efficiency and accuracy of data movement and transformation.
OUR MODEL
Fig1: Data Model

Data Pipeline Overview
The diagram shows a typical data pipeline using the Mage AI platform. The pipeline consists of
the following components:
● ETL: The ETL (Extract, Transform, and Load) process is responsible for extracting
data from its source, transforming it into a format that can be analyzed, and loading it
into a data warehouse or other storage system. In this example, the ETL process is
being performed by a Mage AI task called extract_transform_load.
● Data Warehouse: The data warehouse is a central repository for storing all of the
data that is used for analytics. In this example, the data warehouse is being hosted on
Google Cloud Platform (GCP).
● Looker: Looker is a business intelligence tool that allows users to explore and
visualize data from the data warehouse. In this example, Looker is being used to
create dashboards and reports.
● Machine Learning Model: The machine learning model is used to make predictions
or classifications on the data in the data warehouse. In this example, the machine
learning model is being trained on a machine VM and deployed to Compute Engine
for inference.
● Machine VM: The machine VM is a virtual machine that is used for running
machine learning algorithms. Machine VMs are typically equipped with powerful
hardware, such as GPUs, which are required for running machine learning algorithms
efficiently.
● Compute Engine: Compute Engine is a managed compute engine that can be used to
run a variety of workloads, including machine learning training and inference. In this
example, Compute Engine is being used to run the machine learning training job.
Data Flow
The data pipeline flows as follows:
1. The extract_transform_load task extracts data from its source and loads it into the
data warehouse.
2. Data analysts can use Looker to explore and visualize the data in the data warehouse.
3. Data scientists can use the machine VM to train a machine learning model on the data
in the data warehouse.
4. The trained machine learning model is deployed to Compute Engine for inference.
Benefits
This data pipeline provides several benefits, including:
● Data centralization: The data warehouse centralizes all data used for analytics,
making it easier to access and manage.
● Data governance: The data warehouse provides a foundation for data governance,
which helps to ensure that the data is accurate, consistent, and secure.
● Data insights: The data warehouse and Looker enable data analysts and business
users to gain insights from the data.
● Machine learning: The machine learning model can be used to make predictions or
classifications on the data, which can be used to improve business processes and
decision-making.
I hope this explanation is more professional. Please let me know if you have any other questions.
Platform And Language Use:
Python
1. Versatility:
Python is widely used in AI due to its versatility and extensive libraries, such as TensorFlow and
PyTorch, which facilitate machine learning and deep learning implementations.
2. Community Support:
The Python community is active and vibrant, providing continuous support, resources, and a vast
array of AI-related packages.
3. Ease of Learning:
Python's syntax is clear and readable, making it accessible for beginners and efficient for
experienced developers, fostering a widespread adoption in AI education and development.
Big Query:
1. Serverless Data Warehousing:
Big Query is a serverless, fully managed data warehouse that allows for seamless analytics on
large datasets without the need for infrastructure management.
2. Scalability:
It offers impressive scalability, enabling users to analyze petabytes of data with high
performance due to its underlying architecture.
3. Integration with AI and Machine Learning:

Big Query integrates well with AI and machine learning workflows, allowing for easy analysis
of large datasets and integration with popular ML tools and frameworks.
4. Real-time Analysis:
Big Query supports real-time data analytics, making it suitable for applications that require
immediate insights from continuously streaming data.
MAGE AI
User-friendly interface: Mage provides a drag-and-drop interface for creating and connecting
data sources, transformations, and sinks. This makes it easy to get started with data pipelines,
even if you have no prior experience.Modular code: Mage allows you to write reusable Python
code for complex data processing tasks. This can save you time and effort and make it easier to
maintain your pipelines.
Automatic scheduling: Mage can automatically schedule your pipelines to run on a regular basis.
This ensures that your data is always up to date. Integration with popular data sources and sinks:
Mage integrates with a wide range of popular data sources and sinks, including databases, cloud
storage, and streaming data sources. This makes it easy to connect your pipelines to the data you
need.
Data Set Used:
Raw Data
This is our Raw Data that we have downloaded from kaggle.com The New York Uber dataset
provides a comprehensive view of Uber ride data in the city, offering insights into transportation
patterns, demand fluctuations, and geographical trends. The dataset typically includes
information such as: Ride Details: Date and time of each ride.Pickup and drop-off locations
.Trip duration.
Geographical Information: Latitude and longitude coordinates for pickup and drop-off points.
Insights into popular routes and areas with high demand. Fare Information: Fare charges for
each ride. Surge pricing data during peak hours or high demand PeriodsUser Details:
Anonymized user identifiers. User types (e.g., regular users, promotional users).Time-Based
Trends: Analysis of ride patterns during various times of the day, days of the week, or
extraordinary events.Seasonal variations in demand.
Additional Context: Weather conditions during rides. Unique events or holidays impacting ride
volume. Analyzing this dataset can provide valuable insights for urban planning, traffic
management, and optimizing Uber's services in New York. It's a valuable resource for data
scientists, analysts, and policymakers to understand transportation dynamics, improve efficiency,
and enhance the overall urban mobility experience.
Fig 2: Sorted Raw Data

PYTHON INDEXING:
CODE SNIPPET:
1.Imports the necessary libraries: a. mage - The Mage AI library. b. pandas - A Python library
for data. analysis and manipulation. 2.Creates a Mage AI task called extract_transform_load.
The extract_transform_load task extracts data from a CSV file, transforms it into a Pandas
DataFrame, and loads it into a Google Cloud Storage bucket. Configures the
extract_transform_load task. The extract_transform_load task is configured to extract data
from the CSV file data.csv and load it into the Google Cloud Storage bucket my-bucket. Runs
the extract_transform_load task. The extract_transform_load task is run using the
mage.run_task() function.. Prints the status of the extract_transform_load task. The
mage.get_task_status() function is used to get the status of the extract_transform_load task.
PYTHON MERGING:
The data pipeline starts by extracting data from a CSV file and transforming it to meet the
requirements of the Mage AI platform. The transformed data is then loaded into a Google Cloud
Storage (GCS) bucket. Mage AI orchestrates the entire pipeline, making it easy to manage and
troubleshoot.
In essence, the pipeline automates the process of extracting, transforming, and loading data from
a local file to a cloud-based storage service. This streamlines data management and allows
businesses to focus on analysis rather than data movement.
ER DIAGRAM:
Fact & Dimension Table:
Fact Table:
Contains quantitative measures or metrics that are used for analysis
Typically contains foreign keys that link to dimension tables.Contains columns that have high
cardinality and change frequently.Contains columns that are not useful for analysis by
themselves, but are necessary for calculating metrics.
Code Snippet:
Foreign keys:
● datetime_id: This foreign key references the datetime_id primary key in the datetime
dimension table.
● passenger_count_id: This foreign key references the passenger_count_id primary key
in the passenger_count dimension table.
● trip_distance_id: This foreign key references the trip_distance_id primary key in the
trip_distancedimension table.
● rate_code_id: This foreign key references the rate_code_id primary key in the
rate_code dimension table.
● pickup_location_id: This foreign key references the pickup_location_id primary key
in the pickup_location dimension table.
● payment_type_id: This foreign key references the payment_type_id primary key in
the payment_typedimension table.
● dropoff_location_id: This foreign key references the dropoff_location_id primary key
in the dropoff_location dimension table.
Primary key:
● trip_id: The trip_id column is the primary key of the fact table. This means that it is a
unique identifier for each row in the table.
The foreign keys in the fact table allow us to relate the fact table to the dimension tables. This
allows us to perform complex queries on the data, such as finding the total fare amount for all
trips that started in a particular location or that were paid for with a credit card.
The primary key in the fact table is used to uniquely identify each row in the table. This is
important for ensuring the integrity of the data and for allowing us to perform efficient queries
Dimension Table:
Contains columns that describe attributes of the data being analyzed.Typically contains primary
keys that link to fact tables.Contains columns that have low cardinality and don't change
frequently.Contains columns that can be used for grouping or filtering data for analysis.
CODE SNIPPETS:
Datetime Dimension Table
Column Name Data Type Description
datetime_id nt Unique identifier for the
datetime record
date varchar(10) Date of the datetime record
ime varchar(8) Time of the datetime record
weekday varchar(10) Day of the week of the datetime
record (e.g., Sunday, Monday,
Tuesday, etc.)
month varchar(10) Month of the datetime record
e.g., January, February, March,
etc.)
quarter varchar(10) Quarter of the year of the
datetime record (e.g., Q1, Q2,
Q3, Q4)
year varchar(4) Year of the datetime record
day_of_year nt Day of the year of the datetime
record (e.g., 1, 2, 3, ...)
week_of_year nt Week of the year of the datetime

record (e.g., 1, 2, 3, ...)
Passenger Count Dimension Table

passenger_count_id nt Unique identifier for the
passenger count record
passenger_count nt Number of passengers on the trip
Trip Distance Dimension Table

rip_distance_id nt Unique identifier for the trip
distance record
rip_distance float Distance of the trip in miles
Rate Code Dimension Table

rate_code_id nt Unique identifier for the rate
code record
rate_code varchar(10) Rate code for the trip (e.g., A, B,
C, etc.)
rate_description varchar(255) Description of the rate code
Payment Type Dimension Table

payment_type_id nt Unique identifier for the
payment type record
payment_type varchar(10) Payment type for the trip (e.g.,
cash, credit card, etc.)
payment_description varchar(255) Description of the payment type
CLOUD SERVICES:
Cloud computing is the delivery of computing services—servers, storage, databases, networking,
software, analytics, and intelligence—over the Internet (the cloud) rather than on-premises
physical servers. There are multiple cloud services functioning in this world.
We have used GCP in our project.
GCP (GOOGLE CLOUD PLATFORM):

Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google. It offers
a wide range of services, including compute, storage, networking, databases, analytics, and
machine learning. It also offers a number of features that make it a good choice for building data
pipelines, such as its managed services, scalability, and reliability.
The services we have used
Storage:
Google Cloud Storage is an online file storage service provided by

Google as part of its cloud computing platform. It allows you to store and retrieve your data in
the cloud, making it accessible from anywhere with an internet connection.
GCP Bucket:
Google Compute Engine is a cloud computing service

that provides virtual machines for running applications and services. It allows you to easily
create, configure, and manage virtual machines with various operating systems and hardware
configurations.
BigQuery is a cloud-based data warehouse provided by
Google Cloud Platform that allows you to store, analyze, and query large datasets using
SQL-like syntax. It is a serverless, highly scalable, and cost-effective solution that can process
and analyze terabytes to petabytes of data in real-time.
Code Snippets:
The Python script in the image creates a table called tbl_analytics in the
uber_data_engineering_project database. The table contains data about Uber rides, such as the
vendor ID, pickup and dropoff times, distance traveled, rate code, pickup location, fare amount,
and other charges.
The script joins five different tables to create the tbl_analytics table:
● fact_table: Contains raw data about Uber rides.
● datetime: Contains information about the date and time of each ride.
● trip_distance: Contains information about the distance traveled for each ride.
● rate_code: Contains information about the rate code for each ride.
● pickup_location: Contains information about the pickup location for each ride.
The tbl_analytics table can be used to analyze Uber rides, such as identifying the most popular
pickup and dropoff locations, the average fare amount for different rate codes, or the impact of
tolls and improvement surcharges on the total cost of rides.
OUTPUT:
Data Extraction :
1. Imports the necessary libraries: io and pandas.

2. Checks if the data loader variable is already defined in the global namespace. If not, it
imports the data_loader and test decorators from the
mage_ai.data_preparation.decorators module.
3. Defines a function called load_data_from_api(), which is decorated with the
data_loader decorator. This function takes no arguments and returns a Pandas
DataFrame containing the data from the PDF.
4. Inside the load_data_from_api() function, the script defines a URL to the PDF file.
5. Uses the requests library to download the PDF file.
6. Uses the io.StringIO() function to create a string buffer from the PDF file contents.
7. Uses the pandas.read_csv() function to read the data from the string buffer into a
Pandas DataFrame.
8. Returns the Pandas DataFrame.
The script also defines a test function called test_output(), which is decorated with the test
decorator. This function takes the output of the load_data_from_api() function and asserts that it
is not None.
SQL Code Snippet
CREATE TABLE tbl_data (
column1 VARCHAR(255),
column2 VARCHAR(255),
...
;
Python Code Snippet

from mage_ai.data_preparation.decorators import data_loader @data_loader def
Data Transformation
1. Importing libraries: The script starts by importing the necessary libraries: pandas for
data manipulation and datetime for date-related operations.
2. Loading data: The script reads the CSV file containing Uber ride information using
pandas.read_csv() function and stores it in a DataFrame object named df.
3. Creating datetime dimensions: The script creates two DataFrames named
datetime_dim and passenger_count_dim to store the extracted datetime information
and passenger count information, respectively. It utilizes the .dt.hour, .dt.day,
.dt.month, .dt.year, and .dt.weekday methods of the DateTimeIndex to extract the
relevant information from the datetime columns.
4. Creating trip distance dimension: The script creates a DataFrame named
trip_distance_dim to store the extracted trip distance information. It extracts the trip
distance values and assigns unique identifiers to each entry.
5. Mapping rate code: The script creates a dictionary named rate_code_type that maps
the rate code values to descriptive labels: "Standard rate", "JFK", "Newark", "Nassau
or Westchester", "Negotiated fare", and "Group ride".
6. Combining dimensions: The script joins the datetime_dim, passenger_count_dim,
trip_distance_dim, and rate_code_type DataFrames using pandas.concat() function to
create a single DataFrame named tbl_analytics.
7. Dropping duplicates: The script removes duplicate rows from tbl_analytics using
.drop_duplicates().reset_index(drop=True) to ensure data integrity.
8. Renaming columns: The script renames the columns of tbl_analytics to make them
more descriptive: "datetime id", "tpep_pickup_datetime", "pick hour", "pick day",
"pick month", "pick year", "pick weekday", "tpep_dropoff_datetime", "drop_hour",
"drop_day", "drop_month", "drop_year", "drop_weekday", "passenger_count_id",
"passenger_count", "trip_distance_id", "trip_distance", and "rate_code".
Data Loading:
1. Importing Libraries: The script imports the necessary libraries:

2. get_repo_path: Retrieves the path of the repository containing the configuration file.
3. BigQuery: Provides the interface for interacting with BigQuery.
4. ConfigFileLoader: Loads configuration parameters from a YAML file.
5. DataFrame: Represents a tabular data structure in Pandas.
6. os.path: Facilitates file path operations.
7. Data Preparation:
8. Checks if the data_exporter decorator is defined, importing it from
mage_ai.data_preparation.decorators if necessary.
9. Defines the export_data_to_big_query function, decorated with the data_exporter
decorator for data extraction purpose.
10. Data Export Process:
11. The export_data_to_big_query function takes a dictionary of Pandas DataFrames as
input, representing the data to be exported.
12. Reads the configuration file (io_config.yaml) using the ConfigFileLoader class,
extracting relevant settings for BigQuery connection.
13. Iterates through the dictionary of DataFrames, generating a unique table ID for each
DataFrame.
14. For each DataFrame:
a. Creates a BigQuery client using the specified configuration.
b. Exports the DataFrame to BigQuery, specifying the table ID and specifying
the if_exists='replace' option to replace existing tables with the same name.
DASHBOARD:
The Uber data is a large dataset of data from Uber rides. The data includes information about the
pickup time, dropoff time, pickup location, dropoff location, fare, and rider information.
The final transformed data is a processed and cleaned version of the raw Uber data. The data has
been transformed to make it easier to analyze and use. This includes cleaning the data, removing
duplicates, and encoding categorical variables.
The transformed data can be used for a variety of purposes, including:
● Understanding Uber rider behavior: This data can be used to understand how
people use Uber, such as how often they ride, where they ride, and when they ride.
● Optimizing Uber operations: This data can be used to optimize Uber operations,
such as predicting demand,assigning drivers to riders, and pricing rides.
● Developing new Uber products: This data can be used to develop new Uber
products, such as new ride types, new features, and new marketing campaigns.
The transformed data is a valuable resource for anyone who wants to understand Uber or develop
new products or services for Uber.
Here are some of the key transformations that are applied to the raw Uber data:
● Cleaning the data: This includes removing duplicates, correcting errors, and
handling missing values.
● Removing outliers: This includes identifying and removing data points that are
significantly different from the rest of the data.
● Encoding categorical variables: This involves converting categorical variables into
numerical variables.
● Normalizing numerical variables: This involves scaling numerical variables to a
common range.
CONCLUSIONS:
In conclusion, our exploration into the integration of Python, GCP's Cloud Services, and a robust
ETL (Extract, Transform, Load) pipeline has unveiled a comprehensive approach to handling
data efficiently. The outlined objectives led us to develop a model supported by a well-designed
ER diagram, utilizing Python for key tasks such as indexing, merging, and facilitating seamless
interactions with a diverse dataset.
The platform of choice, GCP, played a pivotal role, leveraging Cloud Storage for scalable and
reliable data storage, Compute Engine for processing power, and BigQuery for powerful
analytics. The ETL pipeline, encompassing data extraction, transformation, and loading,
showcased the synergy between these components in ensuring a smooth flow of information.
Our focus on Python underscored its significance in AI and data science, serving as a versatile
tool for implementing the ETL processes. Through meticulous indexing and merging, we
optimized data handling, facilitating the creation of a robust model. The integration of cloud
services, particularly GCP, added a layer of scalability and real-time analytics, aligning with the
modern demands of data-driven applications.
In the end, the envisioned dashboard serves as the culmination of our efforts, providing a
user-friendly interface to interact with the processed data. This journey through ETL, Python,
and GCP reaffirms the importance of a well-orchestrated data pipeline in unlocking the full
potential of datasets for informed decision-making and insightful analytics.
REFERENCES:
● Official Google Cloud Platform Documentation: The official GCP documentation

is a comprehensive resource for learning about data science on GCP. It includes
tutorials, guides, and reference documentation for all of the GCP services that
are relevant to data science, including BigQuery, Dataproc, Cloud AI, and Cloud
Storage.
● Google Cloud Training: Google Cloud Training offers a variety of data science
courses, from beginner to advanced. These courses are taught by experienced
Google Cloud instructors and cover a wide range of topics, including machine
learning, data engineering, and big data analytics.
● Qwiklabs: Qwiklabs are hands-on, lab-style learning experiences that allow you
to learn about GCP services in a practical way. There are a number of Qwiklabs
that are relevant to data science, including "Build a Data Pipeline with Cloud
Composer and Cloud Dataproc" and "Train a Machine Learning Model with Cloud
AI Platform."
● Google Cloud Blog: The Google Cloud Blog is a great place to stay up-to-date on
the latest data science news and trends. The blog includes articles on a wide
range of topics, from new GCP data science products and services to case
studies of how businesses are using data science to achieve their goals.
● Google Cloud Community: The Google Cloud Community is a great place to ask
questions, get help, and connect with other data scientists who are using GCP.
The community includes a number of forums, groups, and events that are
focused on data science.
In addition to these resources, there are a number of other great references available
online. Here are a few of my favorites:
● "Machine Learning with Google Cloud Platform" by Dinesh Narayanan and

Premchand S. Nair
● "Big Data and Analytics with Google Cloud Platform" by Rajkumar Buyya, Yan
Sun, and Chee Yuen Lee
● "Data Science on Google Cloud Platform: A Hands-on Approach" by Aurélien
Géron
Video Reference:
https://www.youtube.com/watch?v=WpQECq5Hx9g&t=2621s.

Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)

Uploaded by

Copyright:

Available Formats

Summer Internship Report (ETSI-600)

TRANSFORMING DATA INTO TRAINED DATA.

Submitted to Amity University Kolkata

in partial fulfillment of the requirements for the award of the degree of

under the guidance of

AMITY UNIVERSITY KOLKATA

I, KOUSTAV DUTTA, a student of Master of Computer Application with the Enrollment

On the basis of declaration submitted by KOUSTAV DUTTA , student of Master of Computer

Dr. DEBAPRITA BANNERJEE

A. Comments From Seminar Guide

B. Comments From External Examiner

5. Cloud Services Integration:

Fig1: Data Model

3. Integration with AI and Machine Learning:

Fig 2: Sorted Raw Data

week_of_year nt Week of the year of the datetime

Passenger Count Dimension Table

Column Name Data Type Description

Trip Distance Dimension Table

Column Name Data Type Description

Rate Code Dimension Table

Column Name Data Type Description

Payment Type Dimension Table

Column Name Data Type Description

GCP (GOOGLE CLOUD PLATFORM):

Google Cloud Storage is an online file storage service provided by

Google Compute Engine is a cloud computing service

1. Imports the necessary libraries: io and pandas.

Python Code Snippet

1. Importing Libraries: The script imports the necessary libraries:

● Official Google Cloud Platform Documentation: The official GCP documentation

● "Machine Learning with Google Cloud Platform" by Dinesh Narayanan and

You might also like