Professional Documents
Culture Documents
College Report
College Report
College Report
Submitted in
partial fulfillment of the degree of Bachelor of Technology
Rajasthan Technical University
By
Bhavika Sabnani
PCE16CE027
CANDIDATE'S DECLARATION
I BHAVIKA SABNANI hereby declare that I have undertaken 60 Days industrial training at
NetParam Technologies Pvt. Ltd. (an IT Division of C-DAC ATC NETCOM) during a period
from 04 May, 2019 to 04 July, 2019 in partial fulfillment of requirements for the award of degree
The work which is being presented in the training report submitted to Department of Computer
training work.
It has not been submitted anywhere else for the award of any degree, diploma and fellowship of
any University or Institution.
ACKNOWLEDGEMENT
A project of such a vast coverage cannot be realized without help from numerous sources and
people in the organization. I am thankful to Mr. Shashikant Singhi, Chairman, PGC and
Prof(Dr.) Mahesh M. Bundele, Director, PCE for providing me a platform to carry out such
a training successfully.
I am also very grateful to Dr. Surendra Yadav (HOD,CE) for his kind support.
I would like to take this opportunity to show my gratitude towards Mrs Shalini Puri &
Dr. Sunil Pathak (Practical Training Seminar- 7CSTR) who helped me in successful
completion of my Final Year Practical Training. They have guided, motivated & were source
of inspiration for me to carry out the necessary proceedings for the training to be completed
successfully.
I am also privileged to have Mr. Manish Kumar Sharma who has/have flourished me with
his/her/their valuable facilities without which this work cannot be completed.
I would also like to express my hearts felt appreciation to all of my friends whose direct or
indirect suggestions help me to develop this project [and to entire team members for their
valuable suggestions.
Lastly, thanks to all faculty members of Computer Engineering department for their moral
support and guidance.
ABSTRACT
Data generated by different sources is increasing day by day which has given birth to a new term called “Big
Data”. Proper management and analysis of this data is required, so that useful results can be produced from
this data. By analyzing data useful insights can be generated from it. Many Organizations that are working on
Big Data get benefit by the analysis of data. Hadoop is a open source framework which provides different tools
to store and analyze Big Data. Analysis of the hotel booking dataset of two popular websites Goibibo and Make
TABLE OF CONTENTS
CHAPTER 1 : INTRODUCTION............................................................................................... 1
2.3.2 MapReduce…………………………………………………………………………11
2.4 Hive………………………………………………………..................................................13
3.1 Introduction........................................................................................................................19
LIST OF FIGURES
LIST OF TABLES
Table 3. 18
Difference between Tableau and Power BI
Description of editgoibibo.csv
Table 4. 20
LIST OF ABBREVIATIONS
Abbreviation Acronym
BI Business Intelligence
MR MapReduce
CHAPTER I
INTRODUCTION
1.1 What is Data?
Data is raw facts or figures related to an object or event. All the facts (meaningful or meaningless)
collected for an object termed as data. The processed form of data is known as information. Data
i. Structured
ii. Unstructured
iii. Semi-structured
The term “Big data” refers to huge amount of data that may be structured, semi-structured or
unstructured. Enterprises are generating a very huge volume of data at a high speed. Reason behind
this large volume of data are rapid innovations and emerging companies like Google, Yahoo,
This data includes videos, images, sensor data, weblogs etc. generated from different sources like
mobile phones, military surveillances, scientific researches, ecommerce websites and so on. This
has given birth to a new type of data called Big Data which is unstructured sometime semi
structured and also unpredictable in nature. Social media websites generate data in real time which
is increasing exponentially day by day. As the data is growing constantly, it is a tedious task to
manage and process such a huge volume of data. Organizations store a large amount of data and
often fail to create useful insights from that data, because analyzing a large data set is not easy.
A database can be defined collection of data. Databases stores data in a systematic way and
Manipulation of data in a database is done with the help of a software program known as Database
designed for the manipulation and management of data stored in a database. This data
manipulation includes insertion of new data, updation or deletion of existing data. Some of the
examples of Database Management System are MySQL, Oracle, PostgreSQL, MariaDB, IBM
DB2 etc.
2. Hierarchical Database
3. Network Database
4. Relational Database
5. Object-Oriented Database
6. Graph Database
RDBMS cannot solve the big data problem as it can only store structured data while the data
generated has no fixed format it can be structured, unstructured or semi-structured. Despite of this
RDBMS has a slow processing speed which means it takes a lots of time to process large data sets.
Due to these disadvantages of RDBMS, Hadoop came into existence which provides solution to
Hadoop is the solution to Big Data problems. It is an open-source framework, written in JAVA
developed by Apache Software Foundation. Hadoop was created by Doug Cutting. It is based on
distributed computing which stores store large data on a cluster of commodity hardware in a
distributed manner.
Big Data is a term which is used for huge amount of data. This data constitute both structured and
unstructured data that is growing at a very fast rate day by day. Organizations are facing challenges
to manage and analysis this data to produce some meaningful results, because traditional database
• Variety – Variety defines the type of data as data can be categorized in structured, semi-
structured or unstructured.
• Veracity - Veracity is the quality or trustworthiness of the data. It shows accuracy of the
data.
• Value – Value means how much meaningful or useful data we can extract.
Hadoop is the solution to Big Data problems. It is an open-source framework, written in JAVA
developed by Apache Software Foundation. Hadoop was created by Doug Cutting. It is based on
distributed computing which stores store large data on a cluster of commodity hardware in a
distributed manner.
In Hadoop 1 only two components are there – HDFS and MapReduce. HDFS is used for
storage purpose. Hadoop1 uses MapReduce for both Data Processing as well as for resource
management.
In Hadoop 2, HDFS is used again for Storage but the function of MapReduce is only Data
Processing and a new component YARN was introduced, which is used for resource
management.
S. No Hadoop1 Hadoop2
1 Hadoop1 supports MapReduce (MR) Hadoop2 can work with MR as well as other
processing model only. distributed computing models like Spark,
Hama, Giraph & HBase coprocessors.
2 MR does both processing and cluster- YARN is used for cluster resource management and
resource management. different processing models are used for processing.
3 It supports only 4000 nodes per cluster Scalable up to 10000 nodes per cluster
4 Hadoop1 has Single Point of Failure, Supports a Stand by NameNode ,in the case of
because it has a single Namenode. Namenode failure.
Java version 6 was the minimum Java version 8 is the minimum requirement.
1
requirement.
HDFS supports replication for fault HDFS support for erasure encoding.
2
tolerance.
• Open Source
• Reliability
• Fault Tolerance
• High Availability
• Scalability
• Economic
• Distributed Processing
• MapReduce
With these three components some other components of Hadoop framework are –
• Hive
• Pig
• Sqoop
• HBase
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage system of
Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides scalable,
fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a distributed file
It is responsible of storing the metadata i.e. number of blocks created, location of the block,
details of the Datanode on which the data is stored and other information of files.
• DataNode - It is also known as Slave Node. HDFS Datanode stores the actual file data in
the form of blocks in HDFS. Datanode performs data read and write operations.
Datanodes also perform different block creation , deletion and replication tasks on the
instruction of namenode.
Hadoop MapReduce is the processing component of Hadoop ecosystem, which provides parallel
processing of data. All the data which is stored in the Hadoop Distributed File system is processed
by MapReduce programs. Due to this parallel processing of data large datasets can be processed at
a faster rate.
• Map phase
• Reduce phase
Each phase takes input in the form of key-value pairs and also produces output in the form of key-
value form. There are two functions by which the processing is accomplished – map
function and reduce function. Map function takes a set of data as input and produces a set of key-
value pairs. Output generated by Map function is called as intermediate output. Output from the
Map function is supplied as an input to the Reduce function and the final result is generated by the
Reduce function.
Apache Yarn (Yet Another Resource Negotiator) is the resource management layer of Hadoop.
The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing engines like graph
processing, interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System). Apart from resource management,
• Resource Manager
• Node Manager
Apache Hive is an open source data warehouse, which built on top of Hadoop. It is used for
querying and analyzing large amount of data (structured or semi-structured) stored in HDFS.
Writing MapReduce jobs for a task is a complex task but with hive data can be analyzed using
SQL queries. Hive uses a SQL similar language which is known as Hive Query Language (HQL).
SQL like queries written by the programmer are automatically translated into the MapReduce jobs.
• Metastore – It stores metadata for each of the tables like their schema and location.
[B. Tech VII Semester] Department of Computer Engineering 13
Data Analysis of Goibibo and Make My Trip Dataset
• Driver – It acts like a controller which receives the HiveQL statements. The driver starts
the execution of the statement by creating sessions. It monitors the life cycle and progress
of the execution. Driver stores the necessary metadata generated during the execution of a
HiveQL statement.
• Compiler - It performs the compilation of the HiveQL query. This converts the query to an
execution plan. The compiler in Hive converts the query to an Abstract Syntax Tree
(AST). First, check for compatibility and compile-time errors, then converts the AST to
optimized DAG.
• Executor – Once compilation and optimization complete, the executor executes the tasks.
• CLI, UI, and Thrift Server – CLI (command-line interface) provides a user interface for
an external user to interact with Hive. Thrift server in Hive allows external clients to
interact with Hive over a network, similar to the JDBC or ODBC protocols.
Partitioning is a way of grouping the related parts of a table based on the values of some particular
columns like city, date etc. Using partitioning a large table is divided into small partitions such that
• Static Partitioning
Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used
• The data i.e. present in that partitions can be divided further into Buckets
• The division is performed based on Hash of particular columns that we selected in the
table.
• Buckets use some form of Hashing algorithm at back end to read each record and place it
into buckets.
Hadoop Pig is nothing but an abstraction over MapReduce. While it comes to analyze large sets of
data, as well as to represent them as data flows, we use Apache Pig. Generally, we use it
with Hadoop. By using Pig, we can perform all the data manipulation operations in Hadoop.
Pig offers a high-level language to write data analysis programs which we call as Pig Latin. Pig
Latin is a procedural programming language and fits very naturally in the pipeline paradigm. By
using Pig complex queries having most of joins and filters, can be performed easily.
• Atom – It is a simple atomic data value. It is stored as a string but can be used as either a
string or a number
Power BI is a Data Visualization and Business Intelligence (BI) tool provided by microsoft that
allows the analysis and visualization of data by providing different visuals. Dataset from different
data sources can be imported to Power BI, which can be converted to interactive dashboards and
BI reports.
Power BI provide a set of products ,each of them has its own features. Different Power BI
products are –
• Power BI Service - The SaaS (software as a service) based online service (formerly known
as Power BI for Office 365, now referred to as PowerBI.com or simply Power BI).
• Power BI Mobile Apps - The Power BI Mobile apps for Android and iOS devices, as well
DAX (Data Analysis Expressions) is a formula expression language and can be used in different
BI and visualization tools. DAX is also known as function language, where the full code is kept
inside a function.
• Aggregation functions
• Counting functions
• Logical functions
• Information functions
• Text functions
• Date functions
This section provides the detailed description of the project “Data Analysis of Goibibo and Make
My Trip” which includes the detail of the dataset, queries applied using Pig and Hive for the
analysis of data and then visualization of the result of those queries using Power BI.
As the numbers of hotels are increasing, so it is required to analyze the data of these hotels. The
main aim behind this project is to produce some useful insights from the available data by
analyzing using different tools and generate a better visualization of that data which can facilitate
Here, we have analyzed the hotel booking dataset of two popular websites GoIbibo and
MakeMyTrip.
We’ve collected the data from http://www.kaggle.com, which is a popular website for providing
datasets to both students as well as professionals, who are working in the field of Data Science.
• editgoibibo.csv
• editmake.csv
This file contains the details of hotels listed in make my trip dataset.
Hotel star rating Int (0-5) Hotels can be 5-star hotels or 1-star hotels.
• To show the number of hotels for each star rating for 10 states.
➢ insert into table countstarrating select state,count(id),h_star from goibibo group by state
,h_star;
➢ create table c_ptype (p_type string,c_ptype int) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
➢ insert into table c_ptype select p_type,count(*) from goibibo group by p_type;
➢ insert into table go_statemaxh select state,count(*)as c from goibibo group by state order
by c desc limit 1;
state,round(avg(site),2),round(avg(service),2),round(avg(amenity),2),round(avg(food),2),ro
Fig 3.9 Average of Food and Drinks rating for hotels of different ratings
• Count the number of hotels in each state having 100 guest recommendation and
p_name:chararray,p_type:chararray,city:chararray,state:chararray,guest:int,h_star:int,latitud
e:float,longitude:float,review:int,c_rev:int,site:float,service:float,amenities:float,food:float,l
ocation:float,clean:float);
➢ f = filter goibibo by guest==100 and site==5 and service==5 and amenities==5 and
➢ gp = group f by state;
l = limit m 1;
generate FLATTEN(l);
};
l = limit m 1;
generate FLATTEN(l);
};
➢ insert into table mmt_citymaxh select city,count(*)as c from mmt group by city order by c
desc limit 1;
➢ CREATE TABLE mmt(p_type string, count int) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
➢ insert into table mmt11 select p_type,count(*) from mmt group by p_type;
as(id:int,p_name:chararray,p_type:chararray,city:chararray,h_star:int,latitude:double,longit
ude:double,mmt_review_score:float,mmt_tripadvisor_count:int,room_types:chararray,site_
review_rating:float);
FLATTEN(group)as(city,p_type,h_star),COUNT(make.p_name)as count;
l = limit m 1;
generate FLATTEN(l);
};
who use online websites like goibibo and make my trip for hotel booking. By using this analysis
they can differentiate between different hotels based on their property type as well as ratings.
Second type of users are business men, as this analysis is also useful for business purpose. Using
this analysis hotel owners can compare their hotels with other hotels present in their area and can
find that what kind of improvement they need to do in their hotels to provide better customer
With these datasets (Goibibo and Make My Trip), we can also use the data of other online
websites. By analyzing that data more accurate results can be generated. Beside this a comparison
We can also apply different ML and AI algorithms in our project to improve its efficiency and to
Thus the task of big data analysis is not only important but also a necessity. In fact
many organizations that have implemented Big Data are realizing significant competitive
advantage compared to other organizations with no Big Data efforts. The analysis of hotel data is
important as it gives a easy way of comparison between hotels of a particular area and also the
visualization of the analyzed data provide the desired information to its users in a easier way so
• https://www.edureka.co
• https://data-flair.training/blogs/big-data-tutorials-home/
• https://www.guru99.com/bigdata-tutorials.html
• https://www.tutorialspoint.com/big_data_tutorials.htm
• https://www.javatpoint.com/what-is-big-data
• https://powerbi.microsoft.com/en-us