Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 4

Problem Statement

Telecom companies record their users’ activity data, i.e. calls, SMS and internet usage for billing

and monitoring purposes. This data is commonly called Call Detail Record (CDR) data, though

apart from calls, it also contains data of text messages, internet usage etc.

Since a typical telecom company has millions of users, they generate massive quantities of data.  

In this assignment, you will analyses the ‘Milano Grid’ telecom data which contains the
telecommunication activity recorded over the city of Milano in Italy. The data is collected by the
Telecom Italia cellular network over the city of Milano.

The city of Milano is divided into various grids, or squares, as shown below, each denoted in this

dataset by a unique square_id.

 Please note that:
All he square ids belong to the city of Milano. The country code corresponding to Milano is 39
Country code 0 means that the telco operator doesn't know the country of origin/destination or

that the user asked to have this info hidden for privacy reasons. For the purpose of the

assignment, you will assume that 0 is just another country code. 

The schema of the dataset is described below:

 Square id: the id of the square that is part of the Milano GRID; TYPE: numeric

 Time interval: the beginning of the time interval expressed as the number of

milliseconds elapsed from the Unix Epoch on January 1st, 1970 at UTC. The end of the time

interval can be obtained by adding 600000 milliseconds (10 minutes) to this value. TYPE:

 Country code: the phone country code of a nation. Depending on the measured activity

this value assumes different meanings that are explained later. TYPE: numeric

 SMS-in activity: the activity in terms of received SMS inside the Square id, during the

time interval and sent from the nation identified by the Country code. TYPE: numeric

 SMS-out activity: the activity in terms of sent SMS inside the Square id, during the Time

interval and received by the nation identified by the Country code. TYPE: numeric

 Call-in activity: the activity in terms of received calls inside the Square id, during the

Time interval and issued from the nation identified by the Country code. TYPE: numeric

 Call-out activity: the activity in terms of issued calls inside the Square id, during the

Time interval and received by the nation identified by the Country code. TYPE: numeric

 Internet traffic activity: the activity in terms of performed internet traffic inside the

Square id, during the Time interval and by the nation of the users performing the connection

identified by the Country code . TYPE: numeric

Important notes

Files are in .txt format. If no activity was recorded for a field specified in the schema above, then

the corresponding value is missing from the file. For example, if for a given combination of the

Square id 's', the Time interval 'i' and the Country code 'c' no SMS was received (so, SMSin is

empty), the corresponding record looks as follows:

s \t i \t c \t \t SMSout \t Callin \t Callout \t Internettraffic

where \t corresponds to the tab character, SMSout is the value corresponding to the SMS-out

activity, Callin is the value corresponding to the Call-in activity, Callout is the value

corresponding to the Call-out activity and internettraffic is the value corresponding to the

Internet traffic activity.

Please note that since this .txt file doesn’t have any column names as the first row, you don’t

have to do - tblproperties ("skip.header.line.count"="1"). Otherwise, the data stored in the first

row will be skipped and you will get wrong outputs to all the questions listed below.

Please note that the answer to every question should be only ONE numeric value. Whenever you

are confused between displaying a country-wise result or squareid-wise result or an aggregated

value, please show an aggregated value, combining all the squareids or the countries. 

Task 1: Understand the data in hand

 The dataset is of size ~6MB, which should not be deleted. Hence import the entire data in

an external table.

 The data has a lot of empty cells. Import it such that the empty cells are treated as NULL

in Pig.

Task 2: Writing Hive Queries - I

You are working as an analyst responsible for analyzing telecom data of three countries
(country_code = 39, 43, 33). First create a table from the external table created in task 1, to store
the data corresponding only to these three countries. Do the following analysis on these three

 How many grids (square_ids) are there in total in the given three countries? Display the

number of unique grids in the three countries.

 Which country has the minimum total internet activity? Display the country code of this


 Which country among the given three has the second highest total activity? Note that

total activity is defined as the sum of sms_in, sms_out, call_in, call_out, internet_traffic. Display

the country code of this country. Do not compress the table.

 Which squareID has the maximum total SMS activity in these three countries? Note that

total SMS activity is the sum of incoming and outgoing SMSes.

 What is the total activity for the three countries? Note that total activity is defined as

the sum of sms_in, sms_out, call_in, call_out, internet_traffic.

Task 3: Writing Hive Queries - II

 Now, say you want to analyze a specific country with country code=39.

 What is the total call activity from the three square_ids to country_code 39?

 What is the total SMS activity from the three square_ids to country_code 39?

 What is the total activity, i.e. sum of CallIn, CallOut, SMSIn, SMSOut, internet traffic of

the three square_ids?

You are supposed to code entirely in Hive Query Language (HQL). For each question, write the

code in a well-commented notepad which you will submit at the end. The notepad should have

the entire code.

You might also like