Professional Documents
Culture Documents
Case Study Hadoop
Case Study Hadoop
Parkavi.A1, Dr.N.Vetrivelan2
1
Assistant Professor, CSE Department, M.S.Ramaiah Institute of Technology, Bangalore-560054,India
(parkavi.a@msrit.edu)
2
Professor, Department of Computer Applications, Periyar Maniammai Univeristy, Tanjore, Tamilnadu, India
(nvetri@yahoo.com)
Abstract – In this paper a proposal is made for analysis will really get rid of the poverty from our nation
maintaining citizens’ information using Geo distributed data and all the families can be taken care by the government.
centers in different regions of the country for performing This was tedious and impossible before. But now by using
any analytics over the citizen details to find out the statistics latest technologies it is possible to get rid of poverty from
of citizens with respect to specific criteria. If big data our country .
analytics needed to be performed over the citizen data which
are stored across the country, it can be optimized using Data
transformation graph technique. As the citizen information By using hadoop based system, big data analysis over
has sensitive data, the personal information can be hidden the citizens data[2] can be done and collect the required
using the anonymization technique. statistical data and based on it government can form
committee to benefit the people by issuing proper
schemes. [1][4]
Keywords – Map Reduce, Datacenters, Anonymization,
Citizen system II. MAP REDUCE FRAMEWORK ACROSS REGION
WISE DATA CENTERS IN GEO DISTRIBUTED WAY
I. INTRODUCTION
Big data analytics needs applications which will create
In India, the information about citizens is known and and manage bulk of information to improve the
recorded in files to administrators of villages, towns or performance, monitoring and verification. The cloud
zones of cities. That information can be collected and resources will divide the citizen files into chunks; thus
stored in the data centers which can be maintained in they can be processed in parallelized way using hadoop
zonal offices of administrators[1]. The head of the family, framework. [1]
family members’ details along with their name, age,
qualification, job, salary income, medical and health Usually in hadoop framework for map reduce
details, additional earning details and participation in operations “key,value pair” will be used as intermediate
events (which yields popularity and good name for the data. Hadoop framework provides the facility of data
country, with proofs of the information) have to be storage nearer to sources of analysis. It will ensure that ,
maintained in the files of data centers. The citizen even though the data is collected in different regions of
information have to be updated at least every year after country , they can be analyzed for statistical information
verifying in person individually with proofs about the by central government as globally.[4]
families.
In necessary situation, the data can be replicated for
A. Expected benefits to Government using the geo increasing the availability. For example, replicating of
distributed data centers and hadoop framework revenue department staffs details can be done. This may
be done for back up as well as for specific analysis.
Nowadays Governments are issuing many benefit Analysis may be done to know that how many employees
schemes for the people based on the statistics collected are still having 2 more years to get retired. Based on that
from the authorities of regions about the peoples’ need. result, the government can offer scheme like volunteer
And the essential needs of the people may be left with out retirement, thus new youngsters can be recruited for
the notice of government some times. revenue departments. This is one example scenario where
the replication will fasten up the analysis as well as it will
For example if there are small children in a family help to overcome the failures.
and father met with accident and died or may not be in
situation to work after the recovery from the accident. If The analyzed results of each region can be collected
the father works for private job then getting benefit from to produce a single data set. So individual data centers can
government scheme is not possible. In such cases the be maintained at the regional centers to maintain the
analysis over such criteria’s has to be done by citizen files.
government to meet the needs of such families by
providing educational facilities, training and job to wife or
to children finished their education, etc., This kind of
A. Optimizing the path of mapping and reducing jobs of tracker running in the global mapper. In this case
big data analytics distributed data is copied to mapper running center. [1]
Anonymization
in map reduce manner. The sequences of jobs can be
Data Analytics
arranged in hierarchical tree to represent their reducing
Data
Citizen info
and mapping. The data transformation graph can be used
in files
to find out the optimized path of executing the map
reduce jobs. The input files which are divided in to
chunks called as splits will be assigned to a mappers of
the analysis job. [1]
…
… For maintaining the jobs, group manager can be used
… in the system. It can be used to start the analysis jobs over
… the data centers. And the data transformation graph
algorithm (DTG) usage for finding the optimal execution
path will be carried over by group manager.DTG can be
Anonymization
The Geo distributed citizen info system will have n Group manager of the analysis jobs can check the job
regions of states in the country. So for each region one managers by using heart beat checking mechanism. Thus
data center can be maintained. According to the analytics the livelihood of job managers can be verified. In the
needed the cluster of k -data centers can be logically same way task managers can also be monitored by the job
formed based on their locality of interest. There are managers.
various possibilities of execution paths for map reduce
job. [1][4] II. GENERALIZATION OF DATA TO PRESERVE
THE PRIVACY OF CITIZENS USING MAP REDUCE
Case1 : For example to identify the number of cancer FRAMEWORK
patients in particular state , the analytics task can be
executed near or within the data center and the local result Privacy preservation is very important criteria in
which is the sub set of global result can be sent to the job cloud. Because the applications like e-health records, e-
2013 IEEE International Conference on Computational Intelligence and Computing Research