Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Towards the Improving of LiDAR Point Cloud’s

Classification with Hidden Association Rules


Antonio Balderas Cepedaa,∗, Wilver-Enrique Salinas Castilloa, Martı́n-Eleno
Vogel Vazqueza , Sergio-Bernardo Jı́menez Hernándezb
a
Facultad de Ingenierı́a y Ciencias.
Universidad Autónoma de Tamaulipas
C.U. Adolfo López Mateos. C.P. 87157
Tel: +52 834 3181800 ext. 2149
Fax: +52 834 3181718
b
Facultad de Ingenierı́a “Arturo Narro Siller”
Universidad Autónoma de Tamaulipas
C.U. Tampico-Madero. C.P. 89337

Keywords: Association rules, classification, GIS, aerial laser, filtering, LiDAR

1. Introduction

The association rules mining (Ceglar and Roddick, 2006) is an active research
topic. The range of applications are from the early warnings in food’s supply net-
works (Beulens et al., 2006) to spatial data analysis (Ding et al., 2006; Koperski
and Han, 1995). Traditional association rules are useful to discovering potentially
interesting patterns in databases, but in some cases there are millions of rules to
be analyzed, and the discovery of infrequent/hidden patterns is difficult (if not im-
possible) because in most cases infrequent patterns are hidden from the traditional
definition of the association rule (AR).


Corresponding author
Email addresses: mabalderas@acm.org (Antonio Balderas Cepeda),
wsalinas@uat.edu.mx (Wilver-Enrique Salinas Castillo), mvogel@uat.edu.mx (Martı́n-Eleno
Vogel Vazquez), sjimenez@uat.edu.mx (Sergio-Bernardo Jı́menez Hernández)

Preprint submitted to KAIS February 20, 2012


The infrequent patterns remain hidden to traditional ARs because it is the very
nature of a rare pattern. That is, the pattern occurs sparsely and in few records
in a database. Since the ARs discard rare items a-priori, it becomes impossible
to detect rare patterns such as this. To discover such infrequent patterns, the im-
plementation of a technique is necessary for the consideration of items that are
present in a very low percentage of the database records, and thus implies the
generation of bigger data structures and many false-positives; that is, items that
appear together in a database only by chance.
On this topic many approaches have been proposed to discover infrequent pat-
terns (Balderas, 2010; Bezerra et al., 2009; Das and Schneider, 2007; Koh and
Rountree, 2005). Moreover, the work of Das and Schneider (2007) presents the
detection of anomalous records in databases, an approach that uses marginal dis-
tributions at attribute subsets. The research work of Koh and Rountree (2005)
defines sporadic rules as association rules with low support and high confidence.
On the ruleset reduction topic, the work of Webb (2006) presented non-
derivable association rules to avoid redundant/derivable rules. They show that
99% (depending on the case) of the traditional association rules are redundant or
derivables.
Knowledge discovery from spatial databases is of special relevance since there
is a huge volume of laser scanner data (LiDAR), aerial imaging and satellite sensor
data. Hence, data mining techniques can provide insight and help improve spatial
data analysis. Moreover, a special characteristic of spatial data analysis is the
immediate applicability of benefits to society, especially in developing countries.
In this document we propose an hidden association rule mining filtering tech-
nique to discover hidden patterns from spatial datasets (LiDAR), and we apply the

2
proposed techniques for the improvement of classification of LiDAR data points
on land and non-land areas. Although there are some proposals (Ding et al.,
2006; Koperski and Han, 1995; Bogorny et al., 2008; Mennis and Liu, 2005) to
discover knowledge from spatial datasets, in this paper we propose a new filter
for HARs (called IDENT) that in combination with the use of interesting mea-
sures avoids the generation of millions of hidden association rules and provides
an useful ruleset.
In this document, we shows an on-going research work. We propose a filtering
techniques to discover compact and useful hidden rulesets. Moreover, in an effort
to reduce the number of anomalous rules obtained, the approach proposed can be
parallelized immediately. Furthermore, the rules obtained can be found useful in
bio-surveillance, credit screening, marketing, and in any application that requires
the identification of rare and significant patterns.
Through the next sections we will define our proposal and subsequently dis-
cuss the experimental results obtained from the analyzed LiDAR data.

2. Definitions

To define the filter called IDENT and the adaption of the Certainty Factor
as a filter, we must recall the definitions of hidden anomalous rules. Let I =
I1 , I2 , . . . , Im be a set of binary attributes called items. Let T be a database of
transactions. Each transaction t is represented as a binary vector, with t[k] = 1 if
t contains the item Ik and t[k] = 0 otherwise. The database contains one tuple for
each transaction, therefore a transaction t contains X (a set of some items in I) if
for all items in X, t[k] = 1.
A canonical anomalous association rule (CAAR) is an associative and im-

3
plicative pattern of the form X ⇒ A j |Yl , where X is a set of some items in I,
and A j and Yl are single items in I that are not present in X. The confidence fac-
tor of rule is obtained by the percentage of transactions in T that contain X¬Yl
and also contain A j . Then, the confidence and support of a rule is given by
con f (X ⇒ A j |Yl ) = Pr(A j |X¬Yl ) , supp(X ⇒ A j |Yl ) = Pr(A j X¬Yl ) . The support
of an itemset X is the number of transactions in T that contains X. The support
corresponds to a measure of statistical significance while confidence corresponds
with the strength of the implication.
Let X be an itemset, X + H is an extension of X iff X ∩ H = ∅, and we write it
XH.
A general filter over CAARs is defined to preserve general rules. Let a CAAR
be X ⇒ A j |Yl . Therefore, every rule with antecedent extensions is pruned, that is,
every rule XH ⇒ A j |Yl , with this filter will be called AP.
A significance filter is defined using the conditional probability. Let a rule be
X ⇒ A j |Yl , the rule’s significance be given by the percentage of transactions that
contain A j and also contain X¬Yl . This filter will be called SIG.

sig(X ⇒ A j |Yl ) = Pr(A j X¬Yl )/Pr(X¬Yl )

A certainty filter is defined adapting the certainty factor interest measure to


the case of CAARs. This filter will be called CF.
con f (X¬Yl ⇒A j )−supp(A j )
CF(X ⇒ A j |Yl ) = 1−supp(A j )
, i f f con f (X¬Yl ⇒ A j ) > supp(A j ).
(con f (X¬Yl ⇒A j )−supp(A j ))
CF(X ⇒ A j |Yl ) = supp(A j )
, i f f con f (X¬Yl ⇒ A j ) < supp(A j ).
CF(X ⇒ A j |Yl ) = 0, i f f con f (X¬Yl ⇒ A j ) = supp(A j ).
Finally, to identify CAARs supported in the same transactions in T , we define
a filter that we will call IDENT. This definition is as follows: let two CAARs be

4
X ⇒ A j |Yl and V ⇒ A j |Yl , if supp(A j V X) = supp(A j XYl ) and supp(A j V X) =
supp(A j X), and the support of both rules be the same value. These rules relate to
the same transactions; they are correlated.

2.1. Constraints
To provide a concise anomalous ruleset, we use minimum thresholds as fol-
lows: a minimum confidence threshold (to provide a measure of the strength of
the association), a minimum support threshold (to provide statistical significance
to the obtained pattern), an absolute minimum threshold (to avoid the extraction
of very low support patterns supported by one or two transactions in T ), a mini-
mum attribute domain threshold (in the case of relational databases) to reveal hid-
den patterns and avoid obtaining a binary election, and a minimum significance
threshold that provides a measure of the strength of the anomalous pattern ob-
tained. con f (X ⇒ A j |Yl ) ≥ θ, con f (XYl ⇒ ¬A j ) ≥ θ, minS upp(X ⇒ A j |Yl ) ≥ ǫ
,absMinS upp(X ⇒ A j Yl ) ≥ ω, minDom(X ⇒ A j Yl ) ≥ β, minS ig(X ⇒ A j |Yl ) ≥ γ

3. Algorithm

We implement this proposal using the ”in memory anomalous associations


miner” (IMAAM) algorithm. Although to obtain CAARs an association rule al-
gorithm can be modified. We use a vertical mining approach in which every item-
set of level k (k-itemset) is stored in a tree-like structure as illustrated in figure
1. Within each k-itemset we stored a reduced bitset representation of the trans-
actions in which they appear. We divide the process of extraction of CAARs in
three stages: first, we obtain the k-itemsets using the minSupp threshold and the
absMinSupp threshold. At this stage the tree-like structure is constructed. Sec-
ond, we obtain association rules and canonical anomalous association rules. At

5
root k

a b c d e f 1

b d ... d e f ... f 2

d e f 3

Figure 1: Tree structure representation of the data structure stored in memory. This structure is
generated by our algorithm (called IMAAM). The k column means the k-itemsets obtained, while
the spoted cells in the tree corresponds with infrequent k-itemsets.

this stage some filters (SIG, CF) can be applied. Lastly, we perform the filtering,
grouping and ranking.
The most computationally intensive stage is the first stage. The algorithm
must obtain the support of every k-itemset; this process is in general described as
follows;
supportEx{
tree.start(minSupp,absMinSupp, minDom)
tree.generateRoot() //generate 1-itemset
while(k ≤ userLimit && LB ! = empty){
tree.reorderLB() //last branch (LB)
tree.copyLB()
}}
We performed the analysis and application using one Turion II M600 4GB
(RAM) laptop, and a Sun Ultra 40 workstation with 16GB of RAM and four
processors, and used the MARS 6 Software (Merrick Co.) and AutoCAD Map3D

6
Figure 2: The terrain model obtained from the LiDAR dataset used in this proposal.

(Autodesk) to manipulate models and data. We implement our algorithm in Java


SE 1.6 and Netbeans 6.7, and utilized a proprietary dataset of aerial LiDAR data
from a region within México. This dataset is presented in figure 2 and contains
5,176,609 records of LiDAR data that was classified mostly as terrain. The dataset
contains columns like X, Y, Z, number of returns, classification. The classification
of the LiDAR data was made manually with the tools of the Mars 6 software.

4. Application

Our pipeline can be described as follows: the continuous values were dis-
cretized using equi-width 10 bins, and then processed by our implementation in
Java. The resulting CAARs were modeled again in Autocad Map3D. We evaluate
our technique using the following minimum thresholds; minDom = 4, absMin-
Supp = 3, minSig=1/4, minSupp = 5% and 10%, minConf= 75% and 90%.
In figure 3 we illustrate a CAAR that has identified a points cloud corre-
sponding to a vegetation layer that was wrongly classified as terrain, hence, we
can use CAARs to improve the classification of LiDAR data. Quantitatively, the
number of CAARs obtained in any case were under 30, a human (and program)
manageable set. Therefore, our filtering adoption was useful since we were able

7
Figure 3: The red point cloud shows an hidden association rule. The back layer is a model of the
terrain as is shown in a previous image.

to improve the LiDAR data analysis within a manageable CAAR’s set. The time
taken to process the five millions records was about two seconds at the filtering
stage and seven minutes at the itemset and rule generation stages.

5. Conclusions

We investigated the problem of improving the classification layers of LiDAR


data. Although our on-going research effort relates with GIS applications, our
technique can be used in other knowledge domains like medicine, credit screen-
ing, agriculture, climate-change, bio-surveillance, and more.
We have shown that the proposed filter (IDENT) for the identification of over-
lapped canonical anomalous association works well jointly with the adaption of
interest measures like Certainty Factors and Confidence Factors together with the
Occam’s razor (AP filter). We apply the hidden association rules technique in Li-
DAR data analysis efficiently, we show that hidden association rules can help to
improve the LiDAR data analysis, but we are convinced that more research and
funding are necessary to reach more conclusions for the benefit of the society and
customers of such technologies.

8
Balderas, M.-A., 2010. Rare Association Rule Mining and Knowledge Discov-
ery: Technologies for Infrequent and Critical Event Detection. IGI-Global, Ch.
Mining Hidden Association Rules from Real-Life Data, pp. 168–184.

Beulens, A., Li, Y., Kramer, M., van der Vorst, J., 2006. Possibilities for applying
data mining for early warning in food supply networks. In: 20th Workshop on
Methodologies and Tools for Complex System Modeling and Integrated Policy
Assessment. International Institute for Applied Systems Analysis.

Bezerra, F., Wainer, J., Aalst, W. M. P., 2009. Anomaly Detection Using Process
Mining. Vol. 29 of Lecture Notes in Business Information Processing. Springer
Berlin Heidelberg, pp. 149–161.

Bogorny, V., Kuijpers, B., Alvares, L. O., January 2008. Reducing uninteresting
spatial association rules in geographic databases using background knowledge:
a summary of results. International Journal Geographic Information Science.
22, 361–386.

Ceglar, A., Roddick, J. F., 2006. Association mining. ACM Computing Surveys
38 (2), 5.

Das, K., Schneider, J., 2007. Detecting anomalous records in categorical datasets.
In: Proceedings of the 13th ACM SIGKDD international conference on Knowl-
edge discovery and data mining. KDD ’07. ACM, New York, NY, USA, pp.
220–229.

Ding, W., Eick, C. F., Wang, J., Yuan, X., 2006. A framework for regional asso-
ciation rule mining in spatial datasets. IEEE International Conference on Data
Mining 0, 851–856.

9
Koh, Y., Rountree, N., 2005. Finding sporadic rules using apriori-inverse. In:
Proceedings of the Advances in Knowledge Discovery and Data Mining, 9th
Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer,
pp. 97–106.

Koperski, K., Han, J., 1995. Discovery of spatial association rules in geographic
information databases. In: Lecture Notes in Computer Science - Advances in
Spatial Databases. Springer-Verlag, pp. 47–66.

Mennis, J., Liu, J. W., 2005. Mining association rules in spatio-temporal data: An
analysis of urban socioeconomic and land cover change. Transactions in GIS
9 (1), 5–17.

Webb, A. G. I., 2006. Discovering significant rules. In: Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data
mining. ACM Press, Philadelphia, PA, USA, pp. 434–443.

10

You might also like