Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2013 IEEE International Conference on Granular Computing (GrC)

Construct Rough Approximation based on GAE

Lin Shi Yang Zhou


School of Computer Science and Technology School of Information and Communication Engineering
Dalian University of Technology Dalian University of Technology
Dalian, China Dalian, China
e-mail: selinhil@126.com e-mail: 751478976@qq.com

Jun Meng Tsauyoung Lin


School of Computer Science and Technology Department of Computer Science
Dalian University of Technology San Jose State University
Dalian, China San Jose, USA
e-mail: mengjun@dlut.edu.cn(corresponding author) e-mail: tylin@cs.sjsu.edu

Abstract—Recently, cloud computing has emerged as a new granulation. Generally, GrC involves a number of theories
paradigm which focuses on web-scale problems, large data and techniques and acts as a method dealing with massive
centers, multiple models of computing and highly-interactive data mining, solution of complex problems, fuzzy
web applications. It is high available and scalable for information processing, etc. Rough set theory, quotient space
distributed and parallel data storage and computing based on theory and fuzzy information granulation constitute three
a large amount of cheap PCs. As the representative product, primary research models of GrC. Furthermore, any methods
Google app engine (GAE), which acts a platform as a service or theories based on grouping, classification and clustering
(PaaS) cloud computing platform, mainly contains Google File can fall into GrC theory. In 1979, the first concept about
System (GFS) and MapReduce programming model for
information granularity was proposed by Zadeh [10]. He
massive data process. This paper analyses GAE from the point
thought “information granularity”, as a new concept, exists
of Granular computing (GrC) and explain why it is suitable for
massive data mining. Further we present an example of how to in many domains, while it appears in different forms. Hobss
use it to construct neighborhoods of rough set and compute directly used “granularity” as the title of his research and
lower and upper approximations accurately and strictly. discussed the divide and conquer of granules [11]. Lin
proposed “neighborhood system” in 1988 and studied the
Keywords-Clouding computing; Google app engine; relationship between neighborhood system and database
Granular computing; Rough set [12]. Since then, a lot of related work has been carried out.
This paper will analyze GAE in detail with the theory of
I. INTRODUCTION GrC to present the reason why it is a suitable platform for
massive data mining. We will focus on the distributed
Google app engine (GAE) [1] is a prominent and storage and computing mechanism in the bottom of GAE,
representative platform of cloud computing [2]. It possesses which has similar concept of “divide and conquer” with GrC.
the idea of platform as a service (PaaS) [3]. Since the first At last, a brief example is given to show how to construct
version was released in 2008, it has caused lots of attention neighborhoods of rough set and compute lower and upper
in computer science. As a parallel computing framework, it approximations based on GAE.
has been evaluated to have commendable support on The remainder of this paper is organized as follows. In
development of computationally intensive high-performance section 2, some basic concepts and definitions about GrC,
computing (HPC) [4] algorithms and applications. Data rough set and neighborhood are introduced. The frameworks
submitted by users are stored in a Google distributional of GFS and MapReduce model are elaborated and analyzed
Bigtable [5] using the Google File System (GFS) [6]. with GrC theory in section 3 and section 4. In section 5,
Besides, it employs MapReduce programming model [7, 8] computing methods of granulating, lower and upper
to handle massive data computing. GAE supports approximations are proposed with MapReduce model of
development of web applications for both scientific GAE. Finally, section 6 concludes this paper.
researchers and companies and is suitable for the mining of
massive internet data. II. GRANULAR COMPUTING MODEL AND ROUGH
Granular computing (GrC) [9] is a theory which SET THEORY
simulates human thinking while solving complicated GrC is a computing paradigm and conceptual theory of
problems. Since the concept was proposed last century, information processing. The main content of GrC is
researchers have taken a keen interest on information granulating the universe of problem and gaining basic

978-1-4799-1282-7/13/$31.00 ©2013 IEEE 259


information granule. It simulates human thinking and example a tolerant relation, replace equivalence classes with
simplifies complicated problems and avails massive data neighborhoods. We consider tolerant relation because it is
mining, solution of complex problems, fuzzy information more suitable for noisy, approximate, incomplete or
processing, etc. uncertain data modeling comparing with equivalence
In this section, we will review some basic concepts about relation. The equivalence relation is thought to be a special
GrC and rough set theory[13, 14, 15]. Then discuss the tolerant relation.
definitions of neighborhood and approximation in detail. Neighborhood based on tolerant relation is defined as
Definition 1. We use a triplet (IG, EG, FG) to describe a following.
granule. Among them, IG reflects general characteristics, Definition 5. For any object x∈U and a tolerant relation
rules and common features of elements inside a granule, EG R, the neighborhood of x is:
is a collection of individual elements and FG is a transform
function on them. N R ( x ) = {y xRy , y ∈ U }.
Definition 2. Let DS=(U, A, V, f) be a decision system. U (4)
is a finite and non-empty set of objects called the universe
and A is non-empty set of attributes. C and D are two finite The lower and upper approximation operators in tolerant
and non-empty subsets of attributes called condition and rough set are defined as bellow.
decision attributes respectively, meet the conditions C ∪ Definition 6. For a subset of objects X⊆U and a tolerant
D=A, C∩D=∅. For any a∈A:U→Va, Va is called the value relation R on U, the lower and upper approximation
operators in generalized rough set are:
set of attribute a. V=∪a∈AVa represents the value set of all
attributes in DS. f is called an information function, f(x, a)∈
apr R ( X ) = { x N R ( x ) ⊆ X , x ∈ U },
Va for any a∈A and x∈U. (5)
Any subset of attributes B⊆A can be associated with a
binary relation IND(B). This binary relation is called an apr R ( X ) = {x N R ( x) ∩ X ≠ Φ, x ∈ U }.
indiscernibility relation. (6)
Definition 3. The indiscernibility relation is denoted as:
III. GOOGLE FILE SYSTEM
IND ( B ) = {( x , y ) ∈ U × U f ( x, b) = f ( y , b ), ∀b ∈ B}, (1)
GFS [6] is a distributed and scalable file system running
on cheap commodity hardware. It supplies clients with high
where B denotes subset of A. This relation is an aggregate performance. The file system has been
equivalent relation. For an object x∈U, all objects equivalent successfully and widely deployed in Google as a large data
to x compose an equivalence class of x, denoted by sets storage platform. The largest one contains 1000 storage
[x]B={y∈U | (x, y) ∈ IND(B)}. All unique equivalence nodes with a huge disk storage up to 300 TB which can be
classes based on B compose a partition induced by B and heavily accessed by hundreds of machines continuously.
these equivalence classes are elementary sets in rough set Generally, a GFS cluster contains a master node and
theory. multiple chunkservers running on cheap commodity
The main idea of rough set theory is using two sets, machines. Files stored in GFS are divided into 64MB fixed
which can be represented by elementary sets, to approximate size chucks and each chuck has a unique ID in global. Each
an indescribable subset X of U. The two sets are key file chunk is replicated on multiple chunkservers, with a
concepts in rough set theory and named as the lower and default number of three, on different racks. This ensures the
upper approximations of X, respectively. master node can clones existing replicas whenever a
Definition 4. For a subset of objects X⊆U and a subset of chunkservers is offline or corrupted. The master node is in
attributes B⊆C, the lower and upper approximations of X are: charge of the maintaining work for control information,
namespace, mapping and locations of each chunk.
B( X ) = {x x ∈U ∧ [ x]B ⊆ X }, Chunkservers response to user’s requests under the control
(2) of the master node. The architecture is shown in Fig. 1.
Each chunkserver can be viewed as a granule
B( X ) = {x x ∈U ∧ [ x]B ∩ X ≠ Φ}. constructed by relation S. Denote all file chunks as a
(3)
universe U. For any x, y∈U, if they satisfy the relation xSy,
they will be stored and maintain by a same local chunkserver.
The lower approximation set B(X ) contains all objects Client gets the description of interested information and
which can be certainly classified as objects of X based on the relevant data from a specific granule. Thus, GFS constructs
set of attributes B. The upper approximation set B ( X ) is multiple granules upon data files and guarantees access
the set of objects which can be possibly classified as objects speed, data security and scalability when facing huge data
of X. scale and massive user requests.
Further, classical Pawlak rough set model can be
extended to a generalized rough set model by using an
arbitrary binary relation instead of the equivalent relation, for

260
Fig. 1. The architecture of Google file system.

Fig. 2. The MapReduce programming model.


IV. MAPREDUCE MODEL
TABLE I. INFORMATION TABLE
MapReduce [7, 8] is one of the most significant core
model and associated implementation in GAE. It focuses on U C
generating large data sets and distributed computing in a 1 what is GAE
distributed computing environment which is amenable for a
variety of real-world tasks. The MapReduce programming 2 GAE is a platform
model is described by Dean [7]. 3 what is cloud computing
The model takes a set of input (key, value) pairs and
produces a set of output (key, value) pairs after a series of 4 cloud computing is a concept
processing. The users implement MapReduce process using
two interfaces: Map and Reduce. In MapReduce programming model, the real input data is
Map function takes input pairs as (key1, value1). All input a file, each line is assigned with a distinct key. We transform
pairs are stored in a distributive environment and divided it into a table to help understanding. From the point of GrC,
into small splits. Each split is processed by a map task, so the this process can be illustrated by Fig. 3 with detailed data
number of splits or input scale decides the total number of granulation and parallel computation steps.
map tasks. Split will be divided intelligently by the model As shown in Fig. 3, the input data file containing 4 lines
itself to maintain load balancing. After all the map tasks are of sentences, denoted as universe U, are numbered by the
finished, the MapReduce library will gather intermediate MapReduce model automatically. Then a tolerance relation S
values with the same intermediate key and transforms to the is defined upon U. For any x, y∈U, if they satisfy the tolerant
reduce function as pairs (key2, value2) . relation xSy, they are divided into the same split as an
Reduce function accepts an intermediate key2 and a set of independent information granule GSi . Each map task
values for key2. Then it merges all values for a same key
together through a computing process and forms a smaller processes a granule with user demanded functions and output
set of values (key3, value3). Each reduce task produces just (key, value) pairs. Outputs from all the map tasks are stored
one or zero output. Unlike map tasks, the number of reduce locally as a temporary set. When all map tasks are finished,
tasks is not decided by the scale of data, it can be specified (key, value) pairs from different map tasks are gathered and
separately. redivided by an equivalence relation P. For any x,
Different phases of the MapReduce model and the y∈temporary set, if x.key equals y.key, it denotes that they
detailed data flow in GAE is showed in Fig. 2. belongs to the same partition GPi as a new information
Next, we give a simple example to explain how does granule. Then each GPi , with unique key value, will be
MapReduce programming model realize the concept of
delivered to a reduce task. The final result combines all (key,
“divide and conquer” and analyze the model using GrC
value) pairs as one output file.
theory.
The whole process reflects “divide and conquer”
TableⅠis an information table, we will figure out the adequately. Such programming model running on a
detail in MapReduce programming model to count the distributed environment has sufficient ability dealing with
number of occurrences of each word. massive data mining.

261
TABLE II. DECISION TABLE

Object C1 C2 D

x1 c1 c2 d1

x2 c2 c3 d2

x3 c3 c1 d2

x4 c2 c3 d2

x5 c1 c2 d1

x6 c3 c2 d2

x7 c3 c1 d1

x8 c1 c1 d2

A. Information Granulation
Information granulation is a significant part of GrC
issues, and it can be hard with massive data. Neighborhood
construction is a procedure of granulation on rough set.
During the past decades, many rough sets based
algorithms have been proposed. However, the enlarging data
make them a challenging task. A previous study has proofed
commendable behavior of MapReduce job on the
computation of equivalence classes and rough set
approximations[16]. In this instance, we will try to extend
the equivalence relation to tolerance relation.
Firstly, we proposed a parallel MapReduce method to
construct neighborhood for each object on universal U.
Algorithm 1. Map reduce function for neighborhood
construction.
Map function input:
key: document name
value: Si =(Ui, C∪D, V, f)
Output:
Fig. 3. A MapReduce programming example analyzed with GrC theory. <key’, value’>pairs, where key’ is the information set of an
object belonging to the condition attribute set C and value’
denotes object ID.
V. INSTANCE ANALYSIS begin
Recently, the data explosion is emerging and becoming for each xi∈ Ui, do
hottest topic in computer realm. Traditional computing let value’=ID of xi
method faces huge challenges as the data explosion. Cloud let key’=∅
computing with high available and scalable computing for each tempc1 in g(fC1 (xi)) do
ability and distributed storage structure has been the most for each tempc2 in g(fC2 (xi)) do
famous method. Now we present how GAE deals with let key’= tempc1, tempc2
massive data mining. The computing of rough set output.collect(key’, value’)
approximation issues is used as a simple example. end
A decision table S = (U, A, V, f) is given in Table Ⅱ, end
where C1, C2 are the condition attributes and D is the end
decision attribute. end
Assuming that there is tolerant relation on Object.
fCj(xi) denotes the jth condition attribute of object xi,
Where TC1={{c1,c2}, {c2,c3}} , short for TC1={{c1,c1}, g(fCj(xi)) gets the value set within condition attribute Cj
{c2,c2}, {c3,c3}, {c1,c2}, {c2,c1}, {c2,c3} , {c3,c2}} , which has tolerant relation with fCj(xi).
TC2={{c1}, {c2}, {c3}} and TD={{d1}, {d2}}. Obviously, Reduce function input:
TC2 and TD satisfiy equivalence relation and can be key: information set of the object respecting to the condition
regarded as a special tolerant relation.
attribute set C

262
value: object ID detail process on the computing of rough set approximations
Output: using MapReduce model of GAE.
<key’, value’>pairs, where key’ information set of the object Firstly, a parallel algorithm is conducted to construct
respecting to the condition attribute set C and value’ denotes associations between neighborhoods on condition attribute
object ID set. and decision attribute. Detailed algorithm is shown as
follows:
begin
Algorithm 2. Map reduce function for association
let key’=key
construction.
let value’=∅
for each ID do Map function input:
let value’= value’ ∪ ID key: document name
end value: Si =(Ui, C∪D, V, f)
output.collect(key’, value’) Output:
end <key’, value’>pairs, where key’ is information of attribute
set and value’ is ∅.
begin
In Map function, objects are partitioned into
for each xi∈ Ui, do
neighborhoods respectively within different splits. In
let key’= fC1(xi)+ fC2(xi)+ fD(xi)
Reduce Function, neighborhoods with the same information
let value’= ∅
set are combined together. Results are shown as follows:
output.collect(key’, value’)
C 1' = < c1 , c1 > , C 1" = { x 8 } ; end
C 2' = < c1 , c 2 > , C 2" = { x1 , x 5 }; end
C 3' = < c1 , c 3 > , C 3" = { x 2 , x 4 } ; Reduce function input:
//key: information of attribute set
C 4' = < c 2 , c1 > , C 4" = { x 3 , x 7 , x 8 }; //value: ∅
C 5' = < c 2 , c 2 > , C 5" = { x1 , x 5 , x 6 } ; Output:
<key’, value’>pairs, where key’ is associated condition and
C 6' = < c 2 , c 3 > , C 6" = { x 2 , x 4 };
decision granules and value’ = True.
C 7' = < c 3 , c1 > , C 7" = { x 3 , x 7 }; begin
C 8' = < c 3 , c 2 > , C 8" = { x 6 }; let key’=h(key)+g(key)
let value’=True
C 9' = < c 3 , c 3 > , C 9" = { ∅ }. output.collect(key’, value’)
Each line in the results can be viewed as a information end
granule, where Ci’ is a label of this granule and Ci” is a h(key) gets the condition granule tag and g(key) gets the
collection of objects contained in it. The neighborhood of decision granule tag according to the content of a subject.
object xi based on condition attribute set C can be found
using the results above. For example, the value of condition We use Ass(Ci’,Dj’)=True, which means Ci’ and Dj’ are
from object x1 is <c1,c2> which equals C2’, so associated, to denote each output line of Reduce Function.
N C ( x1 ) = C2" = {x1 , x5 } . Thus, the final result is Ass(C2’,D1’)=True,
Ass(C6 ,D2 )=True, Ass(C7 ,D2 )=True, Ass(C8’,D2’)=True,
’ ’ ’ ’
In this way, massive data are sorted and gathered further Ass(C7’,D1’)=True, Ass(C1’,D2’)=True.
analysis and processing. At the same time, the neighborhood In this way, the associations between condition and
construction method will benefit from the MapReduce decision granules are confirmed, and the rough set
model and gain high speedup, scaleup and sizeup ability. approximations computation issue can be transformed from
Next, in order to compute the approximate of each massive data to a smaller neighborhood set of tags. Next we
granule on decision attribute, the universe U is granulated implement a serial algorithm to compute approximations of
on D. Similar parallel process can be implemented as each decision granule.
Algorithm 1 using decision attribute D to replace condition With massive data, if we compute the approximations
attribute C. Thus, we get the following results: directly, memory may overflow since neighborhoods besed
on condition attributes and decision attribute both contain
D1' =< d1 >, D1" = {x1 , x5 , x7 }; too many objects. Hence, we use an special algorithm for
D2' =< d 2 >, D2" = {x2 , x3 , x4 , x6 , x8 }. computation of upper and lower approximations of each
decision granule.
Algorithm 3. A serial algorithm to compute approximations
B. Rough Set Approximations Computation of decision granule according to association tags.
In section A we get the neighborhood on both condition
attribute and decision attribute. This subsection will give the

263
Input: VI. CONCLUSION
Condition and decision granule tags and associations Cloud computing has emerged as a popular method
between them. dealing with massive data mining. With information
Output: explosion from the internet, traditional computing methods
apr R ( Di" ), apr R ( Di" ). can not satisfy people's needs anymore. GAE, as a
forerunner of cloud computing, has impeccable abilities
begin facing the distributed storage and computing challenges.
for each Di” do GrC is a method which simulates human thinking. It
apr R ( Di" ) = ∅ simplifies the complicated problems using “divide and
for each Ck’ do conquer” methods and has been used in many academic
if (Ass(Ck’, Di’)=True) then sectors. When GrC crashes cloud computing, we believe
they will both make great progresses.
apr R ( Di" ) = apr R ( Di" ) ∪ Ck"
end ACKNOWLEDGMENT
end This paper is supported by the Natural Science
output( apr R ( Di" ) ) Foundation of Liaoning Province of China (No.
end 20130200029), the National Students' Innovation and
for each Ck’ do Entrepreneurship Training Program of China
if(Ass(Ck’, D1’)=True and Ass(Ck’, D2’)=True) then (No.201210141014).
Set Ck’.bool=False
REFERENCES
end
[1] M. Malawski, M. Kuźniar, P. Wójcik, M. Bubak, "How to Use
end Google App Engine Free Computing ", IEEE Internet Computing, vol.
for each Di” do 17, 2013, pp. 50-59.
apr R ( Di" ) = ∅ [2] M. Viktor, K. Marcel. H. Marius," High
’ performance cloud computing", Future Generation Computer
for each Ck do Systems, vol. 29, August, 2013, pp. 1408-1416.
if(Ass(Ck’,Di’)=True and Ck’.bool≠False) then [3] G. Lawton, "Developing software online with platform-as-a-service
apr R ( Di" ) = apr R ( Di" ) ∪ C k" technology", Computer, 2008, pp. 3-15.
[4] D. Kevin, C. R. Severance, and M.K.Loukides, High performance
end computing, vol. 2, Sebastopol, CA:O'Reilly & Associates, 1998.
end [5] F. Chang, J. Dean,S. Ghemawat, W. C. Hsieh, et al., "Bigtable: A
output( apr ( Di" ) ) distributed storage system for structured data", ACM Transactions on
R
Computer Systems, 2008.
end [6] S. Ghemawat, G. Howard, and S. T. Leung, "The Google file system",
end 19th ACM SIGOPS Operating Systems Review, vol. 37, ACM, 2003.
We explain the detailed process of computing upper and [7] J. Dean, and G. Sanjay, "MapReduce: simplified data processing on
lower approximations for D1” and D2”. When we compute the large clusters", Communications of the ACM, 2008, pp. 107-113.
upper approximation of D1”, we have the associations [8] J. Ekanayake, P. Shrideep, and F. Geoffrey, "Mapreduce for data
intensive scientific analyses", IEEE Fourth International Conference
Ass(C2’,D1’)=True and Ass(C7’,D1’)=True. The object sets of on e-Science, 2008, pp. 7-12.
C2” and C7” are combined to the upper approximation set. So,
[9] T. Y. Lin, "Granular computing: practices, theories, and future
we gain the upper approximation apr R ( D1" ) = {x1 , x3 , x5 , x7 } directions", Computational Complexity. Springer New York, 2012,
for D1”.Similarly for D2”, we have the associations pp. 1404-1420.
Ass(C6’,D2’)=True, Ass(C7’,D2’)=True, Ass(C8’,D2’)=True [10] L. A. Zadeh, Fuzzy sets and information Amsterdam: North-Holland
Publishing, 1979.
and Ass(C1’,D2’)=True, and finaly obtain the upper
[11] J. R. Hobbs, "Granularity", Proceedings of the Ninth International
approximation set apr R ( D2" ) = { x2 , x3 , x4 , x6 , x7 , x8 } . For the Joint Conference on Artificial Intelligence, 1985.
lower approximation set of D1”,while scanning each Ck’, the [12] T. Y. Lin, "Neighborhood systems and relational databases",
boolean value of C7’ is set to False. In this way, two Proceedings of the 1988 ACM sixteenth annual conference on
associations exist with D1”, one of which has False boolean Computer science, 1988.
value. Thus, we only add C2” to the lower approximation and [13] Z. Z. Shi, Z. Zheng, and Z. Q. Meng, "Image segmentation-oriented
tolerance granular computing model", IEEE International Conference
get the final result apr ( D1" ) = {x1 , x5 } . Using the same on Granular Computing, 2008.
R
method we get the lower approximation set of D2”: [14] Z. Pawlak, "Rough sets-theoretical aspect of reasoning about data",
Proceedings of Kluwer Academic Publishers, 1991.
apr R ( D2" ) = { x2 , x4 , x6 , x8 } .
[15] Y. Y. Yao and T. Y. Lin, "Generalization of rough sets using modal
logic", Intelligent automation and soft computing, 1996, pp. 103-120.
[16] J. Zhang, T. Li, D.Ruan, et al. A parallel method for computing rough
set approximations. Information Sciences, 2012, pp.209-223.

264

You might also like