College Report

Final Year Industrial Training - 7CSTR Report
Data Analysis of Goibibo and Make My Trip Dataset
Submitted in
partial fulfillment of the degree of Bachelor of Technology
Rajasthan Technical University
By
Bhavika Sabnani
PCE16CE027
Department of Computer Engineering

Poornima College of Engineering, Jaipur
(Academic Year 2019-20)
CERTIFICATE BY COMPANY/ INDUSTRY / INSTITUTE
[B. Tech VII Semester] Department of Computer Engineering i

CANDIDATE'S DECLARATION
I BHAVIKA SABNANI hereby declare that I have undertaken 60 Days industrial training at
NetParam Technologies Pvt. Ltd. (an IT Division of C-DAC ATC NETCOM) during a period
from 04 May, 2019 to 04 July, 2019 in partial fulfillment of requirements for the award of degree
of B.Tech (Computer Engineering) at POORNIMA COLLEGE OF ENGINEERING, JAIPUR.
The work which is being presented in the training report submitted to Department of Computer
Engineering at POORNIMA COLLEGE OF ENGINEERING, JAIPUR is an authentic record of
training work.
It has not been submitted anywhere else for the award of any degree, diploma and fellowship of
any University or Institution.
Signature of the Student
The Number of Weeks / Days industrial training Viva–Voce Examination of________Student

Name__________ has been held on ____________ and accepted.
Signature of Internal Examiner Signature of External Examiner
[B. Tech VII Semester] Department of Computer Engineering ii

ACKNOWLEDGEMENT
A project of such a vast coverage cannot be realized without help from numerous sources and
people in the organization. I am thankful to Mr. Shashikant Singhi, Chairman, PGC and
Prof(Dr.) Mahesh M. Bundele, Director, PCE for providing me a platform to carry out such
a training successfully.
I am also very grateful to Dr. Surendra Yadav (HOD,CE) for his kind support.
I would like to take this opportunity to show my gratitude towards Mrs Shalini Puri &
Dr. Sunil Pathak (Practical Training Seminar- 7CSTR) who helped me in successful
completion of my Final Year Practical Training. They have guided, motivated & were source
of inspiration for me to carry out the necessary proceedings for the training to be completed
successfully.
I am also grateful to the Mr Sharthak Achargee for his/her guidance and

support.
I am also privileged to have Mr. Manish Kumar Sharma who has/have flourished me with
his/her/their valuable facilities without which this work cannot be completed.
I would also like to express my hearts felt appreciation to all of my friends whose direct or
indirect suggestions help me to develop this project [and to entire team members for their
valuable suggestions.
Lastly, thanks to all faculty members of Computer Engineering department for their moral
support and guidance.
[B. Tech VII Semester] Department of Computer Engineering iii

ABSTRACT
Data generated by different sources is increasing day by day which has given birth to a new term called “Big
Data”. Proper management and analysis of this data is required, so that useful results can be produced from
this data. By analyzing data useful insights can be generated from it. Many Organizations that are working on
Big Data get benefit by the analysis of data. Hadoop is a open source framework which provides different tools
to store and analyze Big Data. Analysis of the hotel booking dataset of two popular websites Goibibo and Make
My Trip is done by using hadoop .
[B. Tech VII Semester] Department of Computer Engineering iv

TABLE OF CONTENTS
CHAPTER 1 : INTRODUCTION............................................................................................... 1
1.1 What is Data?.......................................................................................................................1
1.2 What is Big Data?................................................................................................................1
1.3 Database and Database Management System……………………………………….......2
1.3.1 Types of databases…………………………………………………………………...2
1.4 Hadoop : As a Solution of Big Data………………………………………………………2
CHAPTER 2 : TRAINING WORK UNDERTAKEN………………………………………... 4
2.1 What is Big Data?................................................................................................................4
2.1.1 Big Data Characteristics……………………………………………………………..4
2.2 What is Hadoop?..................................................................................................................5
2.2.1 History of Hadoop……………………………………………………………………5
2.2.1.1 Difference between Hadoop1 and Hadoop2………………………………………7
2.2.1.2 Difference between Hadoop2 and Hadoop3………………………………………7
2.2.2 Features of Hadoop….……………………………………………………………….8
2.3 Hadoop Ecosystem………………………………………………………………………...8
2.3.1 Hadoop Distributed File System(HDFS)…………………………………………….9
2.3.1.1 HDFS Nodes………………………………………………………………………9
[B. Tech VII Semester] Department of Computer Engineering v

2.3.1.2 HDFS Architecture………………………………………………………………10
2.3.2 MapReduce…………………………………………………………………………11
2.3.2.1 Working of MapReduce …………………………………………………………11
2.3.3 Yet Another Resource Negotiator (YARN)………………………………………...12
2.3.3.1 YARN Architecture……………………………………………………………...12
2.4 Hive………………………………………………………..................................................13
2.4.1 Hive Architecture…………………………………………………………………...13
2.4.2 Partitioning in Hive…………………………………………………………………14
2.4.3 Bucketing in Hive………………………………………………………………….15
2.5 Apache Pig………………………………………………………………………………….15
2.5.1 Pig Latin……………………………………………...……………………………….15
2.5.2 Basic Types of Data Models in Pig…………………………………………………...16
2.5.3 Pig Architecture………...…………………………………………………………….16
2.6 Power BI…………………………………………………………………………………….17
2.6.1 Power BI Products……………………………………………………………………17
2.6.2 DAX(Data Analysis Expression) in Power BI……………………………………......17
2.6.3 Tableau vs Power BI………………………………………………………………….18
CHAPTER 3 : RESULTS AND DISCUSSIONS ..................................................................... 19
[B. Tech VII Semester] Department of Computer Engineering vi

3.1 Introduction........................................................................................................................19
3.2 Agenda Behind The Project..............................................................................................19
3.3 Dataset Description………………………………………................................................19
3.4 Tables in Excel……………………………………………………………………………...21
3.5 Flow Chart of Project………………………………………………………………...……22
3.6 Queries Applied………………………………………………………………………….…23
3.6.1 For Goibibo Dataset…………………………………………………………………..23
3.6.1.1 Queries done using Hive…………………………………………………...………23
3.6.1.2 Queries done using Pig…………………………………………………………….28
3.6.2 For Make My Trip Dataset……………………………………………………………30
3.6.2.1 Queries done using Hive…………………………………………………………30
3.6.2.2 Queries done using Pig…………………………………………...………………..31
CHAPTER 3 : FUTURE SCOPE AND CONCLUSION ........................................................ 32
[B. Tech VII Semester] Department of Computer Engineering vii

LIST OF FIGURES
Figure Number Caption Page No
Figure 2.1 Big Data Characteristics 5
Figure 2.2 Hadoop V.1.x Components 6
Figure 2.3 Hadoop V.2.x Components 6
Figure 2.4 Hadoop Ecosystem 9
Figure 2.5 Architecture of HDFS 10
Figure 2.6 Working of MapReduce 12
Figure 2.7 Architecture of YARN 13
Figure 2.8 Apache Pig Architecture 16
Dataset of Goibibo in Excel

Figure 3.1 21
Dataset of Make My trip in Excel

Figure 3.2 21
Figure 3.3 Flow Chart of Project 22
Figure 3.4 Count of hotels of different ratings in different states 23
[B. Tech VII Semester] Department of Computer Engineering viii

Figure 3.5 Count of hotels of different star ratings of 10 citites 24
Count of different Property Types(Goibibo.com)

Figure 3.6 25
Graph showing count of different Property Types

Figure 3.7 25
Average of Food and Drinks rating of hotels of 5 states

Figure 3.8 26
Average of Food and Drinks rating for hotels of different ratings 27

Figure 3.9
Count of property types for Andhra Pradesh

Figure 3.10 28
Count of different property types(Makemytrip.com)

Figure 3.11 30
[B. Tech VII Semester] Department of Computer Engineering ix

LIST OF TABLES
Table No. . Title of Table Page No.
Table 1. Difference between Hadoop1 and Hadoop2 7
Table 2. Difference between Hadoop2 and Hadoop3 7
Table 3. 18
Difference between Tableau and Power BI
Description of editgoibibo.csv
Table 4. 20
Table 5. Description of editmake.csv 20
[B. Tech VII Semester] Department of Computer Engineering x

LIST OF ABBREVIATIONS
Abbreviation Acronym
DBMS Database Management System
RDBMS Relational Database Management System
SQL Structured Query Language
HDFS Hadoop Distributed File System
YARN Yet Another Resource Negotiator
HQL Hive Query Language
BI Business Intelligence
MR MapReduce
[B. Tech VII Semester] Department of Computer Engineering xi

CHAPTER I
INTRODUCTION
1.1 What is Data?
Data is raw facts or figures related to an object or event. All the facts (meaningful or meaningless)
collected for an object termed as data. The processed form of data is known as information. Data
can be categorized in three types –
i. Structured
ii. Unstructured
iii. Semi-structured
1.2 What is Big data?
The term “Big data” refers to huge amount of data that may be structured, semi-structured or
unstructured. Enterprises are generating a very huge volume of data at a high speed. Reason behind
this large volume of data are rapid innovations and emerging companies like Google, Yahoo,
Amazon, eBay and a rapid growth of internet population.
This data includes videos, images, sensor data, weblogs etc. generated from different sources like
mobile phones, military surveillances, scientific researches, ecommerce websites and so on. This
has given birth to a new type of data called Big Data which is unstructured sometime semi
structured and also unpredictable in nature. Social media websites generate data in real time which
is increasing exponentially day by day. As the data is growing constantly, it is a tedious task to
[B. Tech VII Semester] Department of Computer Engineering 1

manage and process such a huge volume of data. Organizations store a large amount of data and
often fail to create useful insights from that data, because analyzing a large data set is not easy.
1.3 Database and Database Management System
A database can be defined collection of data. Databases stores data in a systematic way and
supports access and manipulation of this data.
Manipulation of data in a database is done with the help of a software program known as Database
Management System (DBMS). A Database Management System is a collection of programs that is
designed for the manipulation and management of data stored in a database. This data
manipulation includes insertion of new data, updation or deletion of existing data. Some of the
examples of Database Management System are MySQL, Oracle, PostgreSQL, MariaDB, IBM
DB2 etc.
1.3.1 Types of Databases –
2. Hierarchical Database
3. Network Database
4. Relational Database
5. Object-Oriented Database
6. Graph Database
1.4 Hadoop : A Solution of Big data
RDBMS cannot solve the big data problem as it can only store structured data while the data
generated has no fixed format it can be structured, unstructured or semi-structured. Despite of this
RDBMS has a slow processing speed which means it takes a lots of time to process large data sets.

Due to these disadvantages of RDBMS, Hadoop came into existence which provides solution to
the big data problem.
Hadoop is the solution to Big Data problems. It is an open-source framework, written in JAVA
developed by Apache Software Foundation. Hadoop was created by Doug Cutting. It is based on
distributed computing which stores store large data on a cluster of commodity hardware in a
distributed manner.

CHAPTER 2
TRAINING WORK UNDERTAKEN
2.1 What is Big Data?
Big Data is a term which is used for huge amount of data. This data constitute both structured and
unstructured data that is growing at a very fast rate day by day. Organizations are facing challenges
to manage and analysis this data to produce some meaningful results, because traditional database
systems are unable to manage large data sets.
2.1.1 Big Data Characteristics (5 V’s of Big Data)
Characteristics of Big Data can be defined by these 5 V’s of Big Data –
• Volume - Volume is the amount of data generated
• Velocity – Velocity describes the speed at which the data is generated.
• Variety – Variety defines the type of data as data can be categorized in structured, semi-
structured or unstructured.
• Veracity - Veracity is the quality or trustworthiness of the data. It shows accuracy of the
data.
• Value – Value means how much meaningful or useful data we can extract.

Fig 2.1 Big Data Characteristics
2.2 What is Hadoop?
Hadoop is the solution to Big Data problems. It is an open-source framework, written in JAVA
developed by Apache Software Foundation. Hadoop was created by Doug Cutting. It is based on
distributed computing which stores store large data on a cluster of commodity hardware in a
distributed manner.
2.2.1 History of Hadoop
• Hadoop V.1.x Components
In Hadoop 1 only two components are there – HDFS and MapReduce. HDFS is used for
storage purpose. Hadoop1 uses MapReduce for both Data Processing as well as for resource
management.

Fig 2.2 Hadoop V.1.x Components
• Hadoop V.2.x Components
In Hadoop 2, HDFS is used again for Storage but the function of MapReduce is only Data
Processing and a new component YARN was introduced, which is used for resource
management.
Fig 2.3 Hadoop V.2.x Components

2.2.1.1 Difference between Hadoop 1 and Hadoop 2
Table 2.1 Difference between Hadoop1 and Hadoop2
S. No Hadoop1 Hadoop2
1 Hadoop1 supports MapReduce (MR) Hadoop2 can work with MR as well as other
processing model only. distributed computing models like Spark,
Hama, Giraph & HBase coprocessors.
2 MR does both processing and cluster- YARN is used for cluster resource management and
resource management. different processing models are used for processing.
3 It supports only 4000 nodes per cluster Scalable up to 10000 nodes per cluster
4 Hadoop1 has Single Point of Failure, Supports a Stand by NameNode ,in the case of
because it has a single Namenode. Namenode failure.
5 Does not support Microsoft Windows Hadoop2 supports Microsoft windows
2.2.1.2 Difference between Hadoop2 and Hadoop3
Table 2.2 Difference between Hadoop1 and Hadoop3
S.No. Hadoop2 Hadoop3
Java version 6 was the minimum Java version 8 is the minimum requirement.
1
requirement.
HDFS supports replication for fault HDFS support for erasure encoding.
2
tolerance.
YARN timeline service Introduced YARN timeline service v.2(improved scalability

3
and reliability)
Secondary namenode was introduced as Supports more than 2 namenode

4
standby.

2.2.2 Features of Hadoop
• Open Source
• Reliability
• Fault Tolerance
• High Availability
• Scalability
• Economic
• Distributed Processing
2.3 Hadoop Ecosystem
Three main parts of Hadoop Ecosystem are –
• Hadoop Distributed File System (HDFS)
• MapReduce
• YARN(Yet Another Resource Negotiator)
With these three components some other components of Hadoop framework are –
• Hive
• Pig
• Sqoop
• HBase

• Flume
Fig 2.4 Hadoop Ecosystem
2.3.1 Hadoop Distributed File System (HDFS)
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage system of
Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides scalable,
fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a distributed file
system that runs on commodity hardware.
2.3.1.1 HDFS Nodes
HDFS consists of two nodes –

• NameNode - It is also known as Master node. Actual file data is not stored by NameNode.
It is responsible of storing the metadata i.e. number of blocks created, location of the block,
details of the Datanode on which the data is stored and other information of files.
• DataNode - It is also known as Slave Node. HDFS Datanode stores the actual file data in
the form of blocks in HDFS. Datanode performs data read and write operations.
Datanodes also perform different block creation , deletion and replication tasks on the
instruction of namenode.
2.3.1.2 HDFS Architecture
Fig 2.5 Architecture of HDFS

2.3.2 MapReduce
Hadoop MapReduce is the processing component of Hadoop ecosystem, which provides parallel
processing of data. All the data which is stored in the Hadoop Distributed File system is processed
by MapReduce programs. Due to this parallel processing of data large datasets can be processed at
a faster rate.
2.3.2.1 Working of MapReduce
MapReduce component of Hadoop works in two phases, which are -
• Map phase
• Reduce phase
Each phase takes input in the form of key-value pairs and also produces output in the form of key-
value form. There are two functions by which the processing is accomplished – map
function and reduce function. Map function takes a set of data as input and produces a set of key-
value pairs. Output generated by Map function is called as intermediate output. Output from the
Map function is supplied as an input to the Reduce function and the final result is generated by the
Reduce function.

Fig 2.6 Working of MapReduce
2.3.3 Yet Another Resource Negotiator (YARN)
Apache Yarn (Yet Another Resource Negotiator) is the resource management layer of Hadoop.
The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing engines like graph
processing, interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System). Apart from resource management,
Yarn also does job Scheduling.
2.3.3.1 YARN Architecture
YARN consists of three components –
• Resource Manager
• Node Manager

• Application Master
Fig 2.7 Architecture of YARN
2.4 Apache Hive
Apache Hive is an open source data warehouse, which built on top of Hadoop. It is used for
querying and analyzing large amount of data (structured or semi-structured) stored in HDFS.
Writing MapReduce jobs for a task is a complex task but with hive data can be analyzed using
SQL queries. Hive uses a SQL similar language which is known as Hive Query Language (HQL).
SQL like queries written by the programmer are automatically translated into the MapReduce jobs.
2.4.1 Hive Architecture
Apache hive consists of the following componenets-
• Metastore – It stores metadata for each of the tables like their schema and location.
• Driver – It acts like a controller which receives the HiveQL statements. The driver starts
the execution of the statement by creating sessions. It monitors the life cycle and progress
of the execution. Driver stores the necessary metadata generated during the execution of a
HiveQL statement.
• Compiler - It performs the compilation of the HiveQL query. This converts the query to an
execution plan. The compiler in Hive converts the query to an Abstract Syntax Tree
(AST). First, check for compatibility and compile-time errors, then converts the AST to
a Directed Acyclic Graph (DAG).
• Optimizer – It performs various transformations on the execution plan to provide
optimized DAG.
• Executor – Once compilation and optimization complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.
• CLI, UI, and Thrift Server – CLI (command-line interface) provides a user interface for
an external user to interact with Hive. Thrift server in Hive allows external clients to
interact with Hive over a network, similar to the JDBC or ODBC protocols.
2.4.2 Hive Partitioning
Partitioning is a way of grouping the related parts of a table based on the values of some particular
columns like city, date etc. Using partitioning a large table is divided into small partitions such that
querying on this data becomes easy.
Hive uses two types of partitioning –
• Static Partitioning

• Dynamic Partitioning
2.4.3 Bucketing in Hive
Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used
for efficient querying.
• The data i.e. present in that partitions can be divided further into Buckets
• The division is performed based on Hash of particular columns that we selected in the
table.
• Buckets use some form of Hashing algorithm at back end to read each record and place it
into buckets.
• In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;
2.5 Apache Pig
Hadoop Pig is nothing but an abstraction over MapReduce. While it comes to analyze large sets of
data, as well as to represent them as data flows, we use Apache Pig. Generally, we use it
with Hadoop. By using Pig, we can perform all the data manipulation operations in Hadoop.
2.5.1 Pig Latin
Pig offers a high-level language to write data analysis programs which we call as Pig Latin. Pig
Latin is a procedural programming language and fits very naturally in the pipeline paradigm. By
using Pig complex queries having most of joins and filters, can be performed easily.

2.5.2 Basic Types of Data Models in Pig:
Pig comprises of 4 basic types of data models. They are as follows:
• Atom – It is a simple atomic data value. It is stored as a string but can be used as either a
string or a number
• Tuple – An ordered set of fields
• Bag – An collection of tuples.
• Map – set of key value pairs.
2.5.3 Apache Pig Architecture
Architecture of Apache Pig is given by the below diagram -
Fig 2.8 Apache Pig Architecture

2.6 Power BI
Power BI is a Data Visualization and Business Intelligence (BI) tool provided by microsoft that
allows the analysis and visualization of data by providing different visuals. Dataset from different
data sources can be imported to Power BI, which can be converted to interactive dashboards and
BI reports.
2.6.1 Power BI Products
Power BI provide a set of products ,each of them has its own features. Different Power BI
products are –
• Power BI Desktop - The Windows-desktop-based application for PCs and desktops,
primarily for designing and publishing reports to the Service.
• Power BI Service - The SaaS (software as a service) based online service (formerly known
as Power BI for Office 365, now referred to as PowerBI.com or simply Power BI).
• Power BI Mobile Apps - The Power BI Mobile apps for Android and iOS devices, as well
as for Windows phones and tablets.
• Power BI Remote Server - An On-Premises Power BI Reporting solution for companies
that won't or can't store data in the cloud-based Power BI Service.
2.6.2 DAX (Data Analysis Expression) in Power BI
DAX (Data Analysis Expressions) is a formula expression language and can be used in different
BI and visualization tools. DAX is also known as function language, where the full code is kept
inside a function.

Different DAX functions can be categorized into the following categories –
• Aggregation functions
• Counting functions
• Logical functions
• Information functions
• Text functions
• Date functions
2.6.3 Tableau vs Power BI
Table 2.3 Difference between Tableau and Power BI
Feature Tableau Power BI
Data Tableau provides strong data visualization Power BI provides a strong

Visualization and is one of the main data visualization backend data manipulation
tool in the market. feature with access to simple
visualizations.
Size of Dataset Tableau can connect much larger datasets Power BI has a limit of 1GB
as compared to Power BI. data in free version.
Costing Tableau is more expensive as compared to Power BI also comes with a
Power BI. free version
Implementation Tableau provides different implementation Power BI uses cloud storage

types as per organizational needs. and includes simple
Deployment, at an enterprise level follows implementation process.
a step-wise process which last for weeks.

CHAPTER 3
RESULTS AND DISCUSSIONS

3.1 Introduction
This section provides the detailed description of the project “Data Analysis of Goibibo and Make
My Trip” which includes the detail of the dataset, queries applied using Pig and Hive for the
analysis of data and then visualization of the result of those queries using Power BI.
3.2 Agenda behind the Project
As the numbers of hotels are increasing, so it is required to analyze the data of these hotels. The
main aim behind this project is to produce some useful insights from the available data by
analyzing using different tools and generate a better visualization of that data which can facilitate
all sorts of users.
Here, we have analyzed the hotel booking dataset of two popular websites GoIbibo and
MakeMyTrip.
3.3 Dataset Description
We’ve collected the data from http://www.kaggle.com, which is a popular website for providing
datasets to both students as well as professionals, who are working in the field of Data Science.
Following data files that are collected:-
• editgoibibo.csv
This file contains details of hotels listed in dataset of goibibo.

Table 3.1 Description of editgoibibo.csv
Variable Data Type Description
Hotel_id Int It contain unique id for hotels
Property name String Name of the hotel.
Property Type String It contain category of property.
State String It contains the City and State the property

situated at.
Guest recommendation Int It contains number of guest recommendations
Star rating Int(0-5) It contains star rating for a hotel.
Latitude and Longitude Float Used to locate property on map.
Other ratings Float (0-5) To differentiate between hotels.
• editmake.csv
This file contains the details of hotels listed in make my trip dataset.
Table 3.2 Description of editmake.csv
VARIABLE DATA DESCRIPTION

TYPE
Hotel_id Int It contain unique id for hotels
Property name String Name of the hotel.
Property Type String It contain category of property.
City String It contains the City the property situated at.
Hotel star rating Int (0-5) Hotels can be 5-star hotels or 1-star hotels.
Latitude and Longitude Float Used to locate property on map.

3.4 Tables in Excel
• Goibibo dataset
Fig 3.1 Dataset of Goibibo in Excel
• Make My Trip Dataset
Fig 3.2 Dataset of Make My Trip in Excel

3.5 Flow Chart of Project
Download data from kaggle.com
Send the data into Hadoop cluster usnig

Winscp
Use Hive QL to create database table
Analyse the data in form of query result
Export result from Hive to Ms excel
Use Power Bi to show the result
Fig 3.3 Flow Chart of Project

3.6 Queries Applied
3.6.1 For Goibibo dataset
3.6.1.1 Queries done using Hive
• To show the number of hotels for each star rating for 10 states.
➢ create table countstarrating(state string,c_hotel int,h_star int);
➢ insert into table countstarrating select state,count(id),h_star from goibibo group by state
,h_star;
➢ select * from countstarrating;
Result of the above query is –
Fig 3.4 count of hotels of different ratings in different states
Visulaization of the above result in Power BI is shown in the below figure –

Fig 3.5 Count of hotels of different star ratings of 10 citites
• To find out the count of different propert_type.
➢ create table c_ptype (p_type string,c_ptype int) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
➢ insert into table c_ptype select p_type,count(*) from goibibo group by p_type;
➢ select * from c_ptype;
Result of the above query is -

Fig 3.6 Count of different Property Types
Below figure shows the visualization of this result in Power BI –
Fig 3.7 Graph showing count of different Property Types

• Which state has maximum number of hotels.
➢ CREATE TABLE go_statemaxh(state string,countH int) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
➢ insert into table go_statemaxh select state,count(*)as c from goibibo group by state order
by c desc limit 1;
➢ select * from go_statemaxh;
• All average ratings according to state
➢ create table allavg_state(state string,site float,service float,amenity float,food float,location
float,clean float) row format delimited fields terminated by '\t';
➢ insert overwrite table allavg_state select
state,round(avg(site),2),round(avg(service),2),round(avg(amenity),2),round(avg(food),2),ro
und(avg(location),2),round(avg(clean),2) from goibibo1 group by state;
Fig 3.8 Average of Food and Drinks ratings of hotels of 5 states

• All average ratings according to h_star
• create table allavg_hstar(h_star int,site float,service float,amenity float,food float,location
float,clean float) row format delimited fields terminated by '\t';
• insert overwrite table allavg_hstar select h_star, round(avg(site),2), round(avg(service),2),
round (avg(amenity),2), round (avg(food),2), round(avg(location),2), round(avg(clean),2)
from goibibo1 group by h_star;
Fig 3.9 Average of Food and Drinks rating for hotels of different ratings
• Count of p_type for each state.
➢ create table count__by_ptypeState (state string , p_type string, count int);
➢ insert overwrite table count__by_ptypeState select state,p_type,count(p_name) from

goibibo1 group by state,p_type;
➢ select * from c_ptype;

Fig 3.10 Count of property types for Andhra Pradesh
3.6.1.2 Queries done using Pig
• Count the number of hotels in each state having 100 guest recommendation and
having all ratings 5.
➢ goibibo = LOAD '/editgo_tab.txt' USING PigStorage('\t') as (id:int,
p_name:chararray,p_type:chararray,city:chararray,state:chararray,guest:int,h_star:int,latitud
e:float,longitude:float,review:int,c_rev:int,site:float,service:float,amenities:float,food:float,l
ocation:float,clean:float);
➢ f = filter goibibo by guest==100 and site==5 and service==5 and amenities==5 and
food==5 and location==5 and clean==5;
➢ gp = group f by state;
➢ c = foreach gp generate group , COUNT(f.p_name) as count ;
➢ gp_c = group c all;

➢ max = foreach gp_c
order c by count desc;
l = limit m 1;
generate FLATTEN(l);
};
➢ store max into '/max_hotels_with_5_100';
• Which state has maximum 5 star hotels.
➢ filter_5 = filter goibibo by h_star ==5 ;
➢ gp = group filter_5 by state;
➢ count= foreach gp generate group as state, COUNT(filter_5.p_name) as count;
➢ gp_all = group count all;
➢ max_state = foreach gp_all
m = order count by count desc;
l = limit m 1;
};
➢ store max into '/State_max_5_hotel’;

3.6.2 For Make My Trip dataset
3.6.2.1 Queries done using Hive
• Which city has maximum number of hotels.
➢ CREATE TABLE mmt_citymaxh(city string,countH int) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
➢ insert into table mmt_citymaxh select city,count(*)as c from mmt group by city order by c
desc limit 1;
➢ select * from mmt_citymaxh;
• To find out the count of different propert_type.
➢ CREATE TABLE mmt(p_type string, count int) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
➢ insert into table mmt11 select p_type,count(*) from mmt group by p_type;
➢ select * from mmt11;
Fig 3.11 Count of different property types

3.6.2.2 Queries done using Pig
• Count of number of hotels for each star rating for 10 cities.
➢ make = load '/editmake.csv' using PigStorage('\t')
as(id:int,p_name:chararray,p_type:chararray,city:chararray,h_star:int,latitude:double,longit
ude:double,mmt_review_score:float,mmt_tripadvisor_count:int,room_types:chararray,site_
review_rating:float);
➢ gp = group make by (city,p_type,h_star);
➢ count_by_ptype_rating = foreach gp generate
FLATTEN(group)as(city,p_type,h_star),COUNT(make.p_name)as count;
➢ store count_by_ptype_rating into '/count_by_ptype_rating';
• Which property type has maximum 5 star rating.
➢ filter_5 = filter make by h_star ==5 ;
➢ gp = group filter_5 by p_type;
➢ count_hotels = foreach gp generate group,COUNT(filter_5.p_name);
➢ store count_hotels in '/count_hotels';
• Which city has maximum 5 star hotels.
➢ filter_5 = filter make by h_star ==5 ;
➢ gp = group filter_5 by city;
➢ count= foreach gp generate group as city, COUNT(filter_5.p_name) as count;
➢ gp_all = group count all;

➢ max_city = foreach gp_all
m = order count by count desc;
l = limit m 1;
};

CHAPTER 4
FUTURE SCOPE AND CONCLUSIONS

This analysis is mainly useful for two types of user. First type of users are the regular customers
who use online websites like goibibo and make my trip for hotel booking. By using this analysis
they can differentiate between different hotels based on their property type as well as ratings.
Second type of users are business men, as this analysis is also useful for business purpose. Using
this analysis hotel owners can compare their hotels with other hotels present in their area and can
find that what kind of improvement they need to do in their hotels to provide better customer
satisfaction and get higher ratings.
With these datasets (Goibibo and Make My Trip), we can also use the data of other online
websites. By analyzing that data more accurate results can be generated. Beside this a comparison
can also be done between these online websites.
We can also apply different ML and AI algorithms in our project to improve its efficiency and to
produce more useful results from it.
Thus the task of big data analysis is not only important but also a necessity. In fact
many organizations that have implemented Big Data are realizing significant competitive
advantage compared to other organizations with no Big Data efforts. The analysis of hotel data is
important as it gives a easy way of comparison between hotels of a particular area and also the
visualization of the analyzed data provide the desired information to its users in a easier way so
that they can get benefit from it.

REFERENCES
• https://www.edureka.co
• https://data-flair.training/blogs/big-data-tutorials-home/
• https://www.guru99.com/bigdata-tutorials.html
• https://www.tutorialspoint.com/big_data_tutorials.htm
• https://www.javatpoint.com/what-is-big-data
• https://powerbi.microsoft.com/en-us

College Report

Uploaded by

Copyright:

Available Formats

You might also like

College Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

College Report

Uploaded by

Copyright:

Available Formats

Final Year Industrial Training - 7CSTR Report

Data Analysis of Goibibo and Make My Trip Dataset

Department of Computer Engineering

CERTIFICATE BY COMPANY/ INDUSTRY / INSTITUTE

[B. Tech VII Semester] Department of Computer Engineering i

of B.Tech (Computer Engineering) at POORNIMA COLLEGE OF ENGINEERING, JAIPUR.

Engineering at POORNIMA COLLEGE OF ENGINEERING, JAIPUR is an authentic record of

Signature of the Student

The Number of Weeks / Days industrial training Viva–Voce Examination of________Student

Signature of Internal Examiner Signature of External Examiner

[B. Tech VII Semester] Department of Computer Engineering ii

I am also grateful to the Mr Sharthak Achargee for his/her guidance and

[B. Tech VII Semester] Department of Computer Engineering iii

My Trip is done by using hadoop .

[B. Tech VII Semester] Department of Computer Engineering iv

1.1 What is Data?.......................................................................................................................1

1.2 What is Big Data?................................................................................................................1

1.3 Database and Database Management System……………………………………….......2

1.3.1 Types of databases…………………………………………………………………...2

1.4 Hadoop : As a Solution of Big Data………………………………………………………2

CHAPTER 2 : TRAINING WORK UNDERTAKEN………………………………………... 4

2.1 What is Big Data?................................................................................................................4

2.1.1 Big Data Characteristics……………………………………………………………..4

2.2 What is Hadoop?..................................................................................................................5

2.2.1 History of Hadoop……………………………………………………………………5

2.2.1.1 Difference between Hadoop1 and Hadoop2………………………………………7

2.2.1.2 Difference between Hadoop2 and Hadoop3………………………………………7

2.2.2 Features of Hadoop….……………………………………………………………….8

2.3 Hadoop Ecosystem………………………………………………………………………...8

2.3.1 Hadoop Distributed File System(HDFS)…………………………………………….9

2.3.1.1 HDFS Nodes………………………………………………………………………9

[B. Tech VII Semester] Department of Computer Engineering v

2.3.1.2 HDFS Architecture………………………………………………………………10

2.3.2.1 Working of MapReduce …………………………………………………………11

2.3.3 Yet Another Resource Negotiator (YARN)………………………………………...12

2.3.3.1 YARN Architecture……………………………………………………………...12

2.4.1 Hive Architecture…………………………………………………………………...13

2.4.2 Partitioning in Hive…………………………………………………………………14

2.4.3 Bucketing in Hive………………………………………………………………….15

2.5 Apache Pig………………………………………………………………………………….15

2.5.1 Pig Latin……………………………………………...……………………………….15

2.5.2 Basic Types of Data Models in Pig…………………………………………………...16

2.5.3 Pig Architecture………...…………………………………………………………….16

2.6 Power BI…………………………………………………………………………………….17

2.6.1 Power BI Products……………………………………………………………………17

2.6.2 DAX(Data Analysis Expression) in Power BI……………………………………......17

2.6.3 Tableau vs Power BI………………………………………………………………….18

CHAPTER 3 : RESULTS AND DISCUSSIONS ..................................................................... 19

[B. Tech VII Semester] Department of Computer Engineering vi

3.2 Agenda Behind The Project..............................................................................................19

3.3 Dataset Description………………………………………................................................19

3.4 Tables in Excel……………………………………………………………………………...21

3.5 Flow Chart of Project………………………………………………………………...……22

3.6 Queries Applied………………………………………………………………………….…23

3.6.1 For Goibibo Dataset…………………………………………………………………..23

3.6.1.1 Queries done using Hive…………………………………………………...………23

3.6.1.2 Queries done using Pig…………………………………………………………….28