Professional Documents
Culture Documents
Problem Statement of This Dataset
Problem Statement of This Dataset
Telecom companies record their users’ activity data, i.e. calls, SMS and internet usage for billing
and monitoring purposes. This data is commonly called Call Detail Record (CDR) data, though
apart from calls, it also contains data of text messages, internet usage etc.
Since a typical telecom company has millions of users, they generate massive quantities of data.
In this assignment, you will analyses the ‘Milano Grid’ telecom data which contains the
telecommunication activity recorded over the city of Milano in Italy. The data is collected by the
Telecom Italia cellular network over the city of Milano.
The city of Milano is divided into various grids, or squares, as shown below, each denoted in this
Please note that:
All he square ids belong to the city of Milano. The country code corresponding to Milano is 39
(Italy).
Country code 0 means that the telco operator doesn't know the country of origin/destination or
that the user asked to have this info hidden for privacy reasons. For the purpose of the
Square id: the id of the square that is part of the Milano GRID; TYPE: numeric
Time interval: the beginning of the time interval expressed as the number of
milliseconds elapsed from the Unix Epoch on January 1st, 1970 at UTC. The end of the time
interval can be obtained by adding 600000 milliseconds (10 minutes) to this value. TYPE:
numeric
Country code: the phone country code of a nation. Depending on the measured activity
this value assumes different meanings that are explained later. TYPE: numeric
SMS-in activity: the activity in terms of received SMS inside the Square id, during the
time interval and sent from the nation identified by the Country code. TYPE: numeric
SMS-out activity: the activity in terms of sent SMS inside the Square id, during the Time
interval and received by the nation identified by the Country code. TYPE: numeric
Call-in activity: the activity in terms of received calls inside the Square id, during the
Time interval and issued from the nation identified by the Country code. TYPE: numeric
Call-out activity: the activity in terms of issued calls inside the Square id, during the
Time interval and received by the nation identified by the Country code. TYPE: numeric
Internet traffic activity: the activity in terms of performed internet traffic inside the
Square id, during the Time interval and by the nation of the users performing the connection
Important notes
Files are in .txt format. If no activity was recorded for a field specified in the schema above, then
the corresponding value is missing from the file. For example, if for a given combination of the
Square id 's', the Time interval 'i' and the Country code 'c' no SMS was received (so, SMSin is
activity, Callin is the value corresponding to the Call-in activity, Callout is the value
corresponding to the Call-out activity and internettraffic is the value corresponding to the
Please note that since this .txt file doesn’t have any column names as the first row, you don’t
row will be skipped and you will get wrong outputs to all the questions listed below.
Please note that the answer to every question should be only ONE numeric value. Whenever you
value, please show an aggregated value, combining all the squareids or the countries.
The dataset is of size ~6MB, which should not be deleted. Hence import the entire data in
an external table.
The data has a lot of empty cells. Import it such that the empty cells are treated as NULL
in Pig.
You are working as an analyst responsible for analyzing telecom data of three countries
(country_code = 39, 43, 33). First create a table from the external table created in task 1, to store
the data corresponding only to these three countries. Do the following analysis on these three
countries.
How many grids (square_ids) are there in total in the given three countries? Display the
country.
Which country among the given three has the second highest total activity? Note that
total activity is defined as the sum of sms_in, sms_out, call_in, call_out, internet_traffic. Display
Which squareID has the maximum total SMS activity in these three countries? Note that
What is the total activity for the three countries? Note that total activity is defined as
Now, say you want to analyze a specific country with country code=39.
What is the total call activity from the three square_ids to country_code 39?
What is the total SMS activity from the three square_ids to country_code 39?
What is the total activity, i.e. sum of CallIn, CallOut, SMSIn, SMSOut, internet traffic of
You are supposed to code entirely in Hive Query Language (HQL). For each question, write the
code in a well-commented notepad which you will submit at the end. The notepad should have