Professional Documents
Culture Documents
Incremental and SQL-Based Data Grid Mining Algorithm For Mobility Prediction of Mobile Users
Incremental and SQL-Based Data Grid Mining Algorithm For Mobility Prediction of Mobile Users
Incremental and SQL-Based Data Grid Mining Algorithm For Mobility Prediction of Mobile Users
U.Sakthi R.S.Bhuvaneswaran
Research Scholar Assistant Professor
Computer Science Department Ramanujan Computing Center
Anna University, Chennai Anna University, Chennai
India India
sakthi.ulaganathan@gmail.com
Abstract
next location of mobile user in mobile environment.
In this paper, we propose a new SQL based By using the predicted location, the system effectively
incremental distributed algorithm for predicting the allocate resources to mobile users in the neighbor
next location of a mobile user in a mobile web location and it is possible to answer the queries that
environments. Parallel and Distributed data mining refer to the future position of mobile uses. Data grid is
algorithm is applied on moving logs stored in designed to allow large moving logs to be stored in
geographically distributed data grid to generate the repositories. In business area it is necessary to develop
mobility pattern, which provides various location environment for analysis, inference and discovery over
based services to the mobile users. One of the existing the data grid. Therefore, the evolution of the data grid
works for deriving mobility pattern is re-executing the is represented by knowledge grid offering high level
algorithm from scratch results in excessive services for distributed mining and extraction of
computation. In our work, we have designed new knowledge from data repositories available on data
incremental algorithm by maintaining infrequent grid.
mobility patterns, which avoids unnecessary scan of
full database. We built data grid system on a cluster of Location management or location tracking is the
workstation using open source Globus Toolkit (GT) mechanism of locating the particular mobile user at
and Message Passing Interface extended with Grid any given point of time. The two strategies for location
Services (MPICH-G2). The experiments were tracking are location updating and location prediction
conducted on original data sets with incremental [15]. Location updating is a passive method in which
addition of data and the computation time was the system periodically updates the current location in
recorded for each data sets. We analyzed our results a database. Location prediction is the dynamic strategy
with various sizes of data sets and it shows the time in which the system proactively estimates the mobile
taken to generate mobility pattern by incremental users location based on a user movement model.
mining algorithm is less than re-computing approach. Personal Communication System (PCS) allows mobile
users to move from one location to another location
1. Introduction since these systems are based on the notion of wireless
access [14]. In mobile system each mobile user is
Grid Computing has emerged as an important new associated with Home Location Register (HLR) which
field, distinguished from conventional distributed stores up-to-date location of the mobile users. These
system by focusing on large scale resource sharing, logs accumulate as large database, in which data
problem solving in dynamic and multi-institutional mining technique [13] is applied to find the frequently
virtual organizations [7]. The Knowledge Grid (KG) is followed location. Each Base Station (BS) in PCS is
a parallel and distributed architecture that integrates connected with Separate Home Location Register led
data mining techniques [8] and grid technologies. The to the geographically distributed data grid node. Grid
Knowledge Grid is exploited to perform distributed network was built with cluster of grid node contains
data mining on very large data sets available over grids moving logs of mobile users. Existing research work
to find hidden valuable information, process models to applied data mining technique on mobile data for path
make decisions and results to make business decisions. mining in a single database server. If the size of the
In our work knowledge grid is developed to predict the moving logs is very large, the overhead in integrating
The paper is organized as follows. In section 2, the Location management is purely database updating
process of incremental mining Mobility Pattern (MP) and querying procedure. Let UAP = 〈 l1, l2,…,ln 〉 be the
of mobile users in grid environment is described. set of locations. A database DB is a set of mobile user
Section 3 describes the related work in distributed data records, where each record contains set of locations.
mining and Section 4 describes the services of Let DB = {DB1, DB2,...,DBm} where m is number of
knowledge grid. Section 5 describes how the inter- data grid nodes in grid network and DBk is the
process communication is performed in MPICH-G2. database stored in kth data grid node. Local frequent
Section 6 explains the distributed algorithm for mining locations are mined and it is communicated to all other
mobility patterns in grid environment. Section 7 and grid nodes to find the global frequent locations called
Section 8 explains the process of generating mobility as mobility patterns. Mobility rules are generated from
rules and mobility prediction. Section 9 describes the mobility patterns which is in the form of A B
incremental mining algorithm to generate mobility having global confidence c if c% of records in DB that
pattern when adding new moving logs to the original contain A followed by B. Mobility rule A B
database. The performance analysis of the proposed represents the mobile user’s current location is A then
method is described in Section 10. The conclusion is he/she will move to location B. The mobility rule A
given in Section 11. B has global support s if s% of record contains A ∪ B.
72
3. Related Works hostname, operating system, etc., and dynamic
information like CPU workload, memory status.
In mobile environment, moving log size is very 3. Globus Resource Allocation and Management
large. It will increase the overhead to integrate all the (GRAM): responsible for resource allocation, job
moving logs into one database server [10]. The creation, monitoring management, job control and
centralized mobility pattern algorithm was discussed in provides interface for job submission on remote
[11] which cannot be efficient for large data size. machine.
Many parallel and distributed variants of sequential 4. GridFTP: It is an extension of FTP for parallel data
apriori algorithm [9,12] have been discussed in other transfer, file transfer based on GSI authentication
resources. In [1,4], grid implementation of frequent mechanisms.
item sets in a grid environment deals with sales 5. Heart Beat Monitor (HBM): Responsible for
transaction of a company, these algorithms cannot be identifying globus process failures and application
used directly in our domain, because this algorithm process failures and immediately recovery action can
does not take into account the network topology while be taken.
generating the candidate patterns. The generation of 6. Global Access to Secondary Storage (GASS):
candidate pattern in [13] is not same as the candidate Exposes interfaces that help the clients to access the
pattern in mobile environment. In PCS, only the data in a uniform manner from heterogeneous data
sequence of neighboring location of the network can be sources. Data caching service is utilized to improve
considered as the mobility pattern. We propose data access performance. Metadata catalog and
incremental parallel and distributed knowledge grid services allows us to search for a multitude of services,
mining approach based on apriori algorithm for mining based upon the object metadata attributes.
mobility patterns in mobile environment. 7. Dynamically-Updated Resource Online Co-allocator
(DUROC): It acts as a coordinator between subjobs
running on different computational grid.
4. Knowledge Grid
4.2 Knowledge Grid Services
Knowledge Discovery in Database (KDD)
employs a variety of software, tools called as data
The knowledge grid contains Core K-Grid layer and
mining to find useful patterns and models. The
High level K-Grid layer [5]. The Core K-Grid layer of
Computational power of parallel and distributed
knowledge grid performs two main services. 1.
system is applied for analysis on large volume of data,
Knowledge Directory Service (KDS) manages
maintained over geographically distributed sites. This
metadata about data source, algorithm used for data
technique is called as Parallel and Distributed
mining, mining results and visualization tool. All
Knowledge Discovery (PDKD). Grid computing is the
metadata are represented in eXtensible Markup
most emerged area for high performance distributed
Language (XML) documents stored in Knowledge
applications like knowledge discovery process.
Metadata Repository (KMR). The discovered mobility
Knowledge gird is an integration of basic grid services,
patterns after the execution of distributed mining
data grid services and computational grid services for
process is stored in Knowledge Base Repository
distributed data mining and knowledge discovery [2].
(KBR) 2. The Resource Allocation and Execution
We built knowledge grid using the open-source Globus
Management Service (RAEMS) generates data mining
toolkit [3].
process execution plan and it will be stored in a
Knowledge Execution Plan Repository (KEPR). The
4.1 Globus Toolkit Services execution plan will generate resources request
expressed using the Resource Specification Language
The basic grid components provided by the globus (RSL) for GRAM. The high level K-Grid layer
toolkit are: supports the following services. 1. Data Access Service
1. Grid Security Infrastructure (GSI): Provides (DAS) is responsible for accessing data for data
authentication (identity of the users and services) based mining. 2. Tools and Algorithm Access Service
on certificate produced by certificate authority and (TAAS) are responsible for loading tools and
standard X.509.The mutual authentication is provided algorithm defined in KDS. 3. Execution Plan
by Secure Socket Layer (SSL) protocol. Management Service (EPMS) enables users to create
2. Monitoring and Data Service (MDS): MDS is an execution plan by assigning programs to data
information service component provides information resources. On multiple execution of program, different
about available grid resources and periodically collects execution plans are created. 4. Result Presentation
their status. It provides static information like Service (RPS) is responsible for presenting mobility
73
patterns to the users stored in knowledge base Generates resource
repository. specification using
MDS
mpirun command
5. Inter-Process Communication Using
MPICH-G2
Submission of multiplejobs
MPICH-G2 is a grid enabled open source library for GASS using globusrun command
implementing Message Passing Interface (MPI). It
supports parallel and distributed data mining MPI
application to run on cluster of machines of different
architecture. MPICH-G2 uses TCP for inter-machine DUROC
communication and vendor API for intra-machine Coordinator
communication. Our distributed KMPM algorithm is
executed on several computational grids using
MPICH-G2 component. MPICH-G2 uses RSL script
for sub job execution. The design of our parallel and
distributed mining of mobility pattern application on GRAM GRAM
knowledge grid is shown in Figure 1. Initially grid Authentication Authentication
security infrastructure generates certificates for the using GSI using GSI
user authentication to sign on other site. The user can
use monitoring and discovery service to select grid Job Manager Job Manager
node based on memory, CPU load and network
topology. The globus-run command is executed to
PBS job Script PBS job Script
submit job on multiple machines by creating MPI
computation. MPICH-G2 uses RSL to specify the URL
of the computational and grid resources. The RSL
script defines the job in the following way: Local Scheduler Local Scheduler
+
(&(resourceManagerContact=
“kalannia/jobmanager-pbs”) (count = 10) P1 P2 P3 P4 P5 P6
(label = “subjob 0”)
(environment=
(GLOBUS_DUROC_SUBJOB_INDEX 0) TCP communication
(directory = “/home/sakthi/gridmining”) Vendor API between clusters
(executable = “/home/sakthi/gridmining/KMPM”)
) Figure 1. Inter-process communication
74
X.sup > S and locally large if X.supi > Si , where S is increment k by 1
global minimum support and Si is local minimum end while
support at Pi. Our KMPM algorithm is executed on n
processing nodes in parallel. Intermediate results are The above algorithm is an adaptation apriori
transferred to other nodes using send and receive algorithm in a distributed environment. Every process
commands of MPICH-G2. generates subsequence of length k called as Local
Subsequence (LSk) and then calculates local support
Generation of candidate sequence is implemented in count for each LSk. These subsequence local support
the form of SQL k-way join by removing infrequent count is exchanged with all other process to generate
subsequence [6]. Candidate subsequences are Global Subsequence (GSk) and then calculates global
generated by maintaining chronological order of support count for each GSk. The subsequences which
location. The previous work generates large number of have a support count greater than the threshold global
Local Candidate sequences of length k (LCk) from the support count are selected as Global Mobility Pattern
original UAP. In our work, we have generated of length k (GMPk). For instance, consider UAPs 〈 4,
candidate sequence from the previously generated
6, 8, 0, 5 〉 , 〈 2, 4, 8, 0, 6 〉 and 〈 1, 2, 4, 6 〉 where the
Local Mobility Pattern (LMPk-1) using SQL k-way join
operation. All sub sequences of new sequences must be number 4 represents location of mobile user. The
frequent. Our algorithm generates small number of support count of the subsequence 〈 4, 6 〉 can be
candidate sequences, since it involve less computation calculated as follows. s.count = s.count + suppInc and
time in determining global mobility patterns. The 1
suppInc= 1+ totdis , where totdis is the number of
pseudocode for Knowledge grid based Mobility Pattern
Mining (KMPM) algorithm is given below: location between 4 and 6. s.count value is 2 because it
appears in 1st and 3rd UAP. In 2nd UAP there are two
// pseudo code for local mining at a process Pi locations between 4 and 6. Therefore the support value
KMPM( ) for 4 and 6 is 〈4,6 〉 .count = 2+ 1+12 =2.33. It will
Input: moving path of mobile users increase the accuracy of the support counting. This
Output: local frequent mobility pattern algorithm will generate more accurate global mobility
LSk=null //k is the length of subsequence patterns.
for each UAP a ∈ Di
find the subsequence of UAP and put it in S
7. Mobility Rule Generation
for each subsequence s ∈ S
//calculate the support count and store it in
In our knowledge grid, after the execution of
LSk
parallel and distributed mining algorithm, mobility
s.count = s.count + s.suppInc
patterns are stored in knowledge base repository. It can
end for
be used to generate mobility rules. For example,
end for
mobility pattern is 〈4, 6, 8 〉 .
// pseudocode for global mining at kth pass Mobility rules are as follows:
globalmining() 〈4 〉 〈 6, 8 〉
Input: local frequent patterns 〈4, 6 〉 〈 8 〉
Output: global frequent patterns
From the UMPs, all possible mobility rules are
k=1
generated and their confidence value is calculated. In
while (GSk ≠ null)
for every from 1 to n increment by 1 general, mobility rule R is represented as 〈t1, t2,..,ti 〉
node Pi exchange and merge local support 〈ti+1,ti+2,…,tp 〉 . Confidence value for the rule R is
counts of LSk with all n nodes and find the calculated using the following formula:
global support count of all subsequence and
〈 t1 ,t2 ,...,ti 〉 .count
store it in GSk
Confidence (R) = 〈 ti +1 ,ti + 2 ,...t p 〉 .count
end for
for each subsequence in GSk Then the mobility rules which have a confidence
if the global support count of subsequence is higher than a predefined confidence threshold (confmin)
above the minimum global support count then are selected. These mobility rules can be used in next
put the subsequence in global mobility pattern phase for mobility prediction. The mined mobility rule
GMPk is compared with the current location of mobile user to
end for predict the next possible locations.
75
8. Mobility Prediction 9. Incremental Mobility Rule Mining
In mobile web environment, next location of mobile The parallel and distributed algorithm is executed
users is predicted using mobility rule and current on data grid to find the mobility pattern when moving
location of the mobile user. Mobility rule contains two logs are added to and removed from the database
parts namely, head - the part before the arrow and tail- without re-executing the algorithm. The algorithm uses
the part after the arrow. Our process generates set of the concept of negative border for data mining by
rules whose head matches with the current location of maintaining the infrequent mobility pattern. Infrequent
the mobile user. These rules are called as matching sequence represents the set of sequences, which did not
rules. The first location in the tail of the matching rule satisfy the minimum support. During each pass of the
and match value is stored in the resultant array. Match data mining algorithm, the set of sequences (GSk) are
value is calculated by summing up the support value of computed from the previous mobility patterns (GMPk-
the UMP and confidence of the rule. The matching 1). The negative border with infrequent sequence is
rules in the array are sorted in descending order with found by NMPk=GSk-GMPk. where NMPk represents
respect to match value. This process generates most the infrequent sequence in kth pass.
confident and frequent rules.
Table 1. Parameters used in experiment
The parameter m defines number of predictions
required. It selects only first m locations from the S.No Notation Meaning
resultant array. In Figure 2, there are three locations 4, 1 DB Transactions in
6 and 8. For example, currently the mobile user is in original database
location 4. Our algorithm generates matching rules
2 db Transactions that are
〈4 〉 〈6 〉 and 〈4 〉 〈 8 〉 . The match value is newly added
calculated for each predicted location and stored in 3 DB+ Transactions in the
resultant array. Resultant array contains two values [(6, updated database DB
78.56), (8, 68.56)]. If m=1, then the location 6 will be ∪ db
predicted as next location. If m=2, then the locations 6 4 GMP+,GMPDB,GMPdb Mobility pattern in the
and 8 are the predicted as next locations. The predicted respective database
location can be used by expert system to provide 5 NMP(GMP+), Negative border in the
location based service to the mobile user. NMP(GMPDB), respective database
NMP(GMPdb)
Location 6
4
The algorithm for updating the mobility pattern as
follows:
76
if GMPDB ∪ NMP(GMPDB) ≠ NMP(GMPDB+) ∪ count 0.30%. From the Figure 4 the performance
NMP(GMPDB) then improvement was about 50% for 0.20% and it is
s = GMPDB+ increased to 55% for the support count 0.25%. The
repeat performance is increased 60% for the support count
compute s = s ∪ NMP (s) 0.30%.
until s does not grow
GMPDB+ = { x s | support(x) >= min_sup } 10000
C o m p u t a ti o n T im e in S e c o n d s
9000
Initially the original transactions are mined to 8000
generate the Gocal Mobility Pattern (GMPDB) for the 7000 100%
6000 90%
specified minimum support. While the Mobility
5000 80%
Patterns are generated, their negative border 4000 70%
GMP(GMPDB) is also generated and retained. The 3000 60%
negative border is used to avoid re-computation when 2000
1000
new transactions are added to the database. When
0
newly moving log transactions are added to the RC IM RC IM RC IM
database(db) the frequent mobility pattern for the new
0.2 0.25 0.3
transactions(GMPdb) are generated for the user Support Value in %
specified minimum support. There are three cases to
update the mobility pattern. 1. The support count is
calculated for each sequence in the (GMPDB) from the Figure 3. Dataset T5I2D1400K, data added
(GMPdb) if the minimum support is satisfied and increment of 100K from 1000K
GMPDB is updated. 2. The support count of sequence
that is common to both GMPdb and NMPdb are counted
and GMPDB is updated if support count is specified. 8000
C o m p u ta tio n T im e in S e c o n d s
77
added to the database. The precision is calculated by
the number of correctly predicted locations divided by [2] Mario Cannataro, Antonio Congiusta, Andrea Pugliese,
the total number of predictions made. Talia, and Paolo Trunfio, ”Distributed Data Mining on Grids:
Services, Tools and Applications”, IEEE Transaction on
Systems Man and Cybernetics, Dec. 2004, pp.2451-2465.
0.9
[3] Ian Foster, “Globus Toolkit Version 4: Software for
0.8 Service-Oriented System”, Journal of Computer Science and
0.7 Technology, Jul. 2006, pp. 513-520.
P re c i s i o n
78