Professional Documents
Culture Documents
IPLOM Clustering Event Logs Using Iterative Partitioning
IPLOM Clustering Event Logs Using Iterative Partitioning
1255
Loghound and Teiresias on 7 different event log files, which Base Event) format [2] using Hidden Markov Models (HMM)
were manually labeled by our faculty’s tech support group. and a modified Naive Bayesian Model. They were able to
Results show that IPLoM consistently outperforms the other achieve 85% and 82% classification accuracy respectively.
algorithms and achieves an average (across the datasets) F- While similar, the automatic categorization done in [5] is
Measure of 78% whereas the second best algorithm (SLCT) not the same as discovering event log clusters or formats.
achieves an average F-Measure of 10%. This is because the work done in [5] is a supervised classifi-
The rest of this paper is organized as follows: section 2 cation problem, with predefined categories, while the prob-
discusses previous work in event type pattern mining and lem we tackle is unsupervised, with the final categories not
categorization. Section 3 outlines the proposed algorithm known apriori. On the other hand SLCT [14] and Loghound
and the methodology to evaluate its performance. Section 4 [15] are two algorithms, which were designed specifically for
describes the results whereas section 5 presents the conclu- automatically clustering log files, and discovering event for-
sion and the future work. mats. This is similar to our objective in this paper. Because
both SLCT and Loghound are similar to the Apriori algo-
2. BACKGROUND AND PREVIOUS WORK rithm, they require the user to provide a support threshold
value as input. This support threshold is not only used to
Data clustering as a technique in data mining or machine control the output of these algorithms but is fundamental
learning is a process whereby entities are sorted into groups to their internal mechanism.
called clusters, where members of each cluster are similar SLCT works through a three-step process. The steps are
to each other and dissimilar from members of other groups. described below
Clustering can be useful in the interpretation and classifi-
cation of large data sets, which may be overwhelming to 1. It firsts identifies the frequent words (words that occur
analyze manually. Clustering therefore can be a useful first more frequently than the support threshold value) or
step in the automatic analysis of event logs. 1-itemsets from the data
If each textual line in an event log is considered a data
point and its individual words considered attributes, then 2. It then extracts the combinations of these 1-itemsets
the job of clustering an event log reduces to one in which sim- that occur in each line in the data set. These 1-itemset
ilar log messages are grouped together. For example the log combinations are cluster candidates.
entry Command has completed successfully can be consid-
ered a 4-dimensional data point with the following attributes 3. Finally, those cluster candidates that occur more fre-
“Command”, “has”, “completed”, “successfully”. However, as quently than the support value are then selected as the
stated in [14], traditional clustering algorithms are not suit- clusters in the data set.
able for event logs for the following reasons:
Loghound on the other hand discovers frequent patterns
1. The event lines, do not have a fixed number of at- from event logs by utilizing a frequent itemset mining al-
tributes. gorithm, which mirrors the Apriori algorithm more closely
than SLCT because it works by finding itemsets, which may
2. The data point attributes i.e. the individual words contain more than 1 word up to a maximum value provided
or tokens on each line, are categorical. Most conven- by the user. With both SLCT and Loghound, lines that
tional clustering algorithms are designed for numerical do not match any of the frequent patterns discovered are
attributes. classified as outliers.
SLCT and Loghound have received considerable attention
3. Event log lines are high dimensional. Traditional clus-
and have been used in the implementation of the Sisyphus
tering algorithms are not well suited for high dimen-
Log Data Mining toolkit [13], as part of the LogView log
sional data.
visualization tool [9] and in online failure prediction [11].
4. Traditional clustering algorithms also tend to ignore Fig. 1 shows four examples of the type of clusters that SLCT
the order of attributes. In event logs the attribute and Loghound are able to find, the asterisks in each line
order is important. indicate placeholders that can match any word. We will
adopt this cluster representation in the rest of our work.
While several algorithms like CLIQUE, CURE and MAFIA A comparison of SLCT against a bio-informatics pattern
have been designed for clustering high dimensional data[1, discovery algorithm developed by IBM called Teiresias is
14], these algorithms are still not quite suitable for log files carried out in [12]. Teiresias was designed to discover all
because an algorithm suitable for clustering event logs needs patterns of at least a given specificity and support in cat-
to not just be able to deal with high dimensional data, it also egorical data. Teiresias can be described as an algorithm
needs to be able to deal with data with different attribute that takes a set of strings X and breaks them up into a
types, ignore the order of the input records and discover set of unique characters C, which are the building blocks of
clusters that exist in subspaces of the high dimensional data the strings. It then proceeds to find all motifs (patterns)
[1, 14]. having at least a specificity determined by L/W, where L
For these reasons several algorithms and techniques for is the number of non-wildcard characters from C and W is
automatic clustering and/or categorization of log files have the width of the motif with wildcards included. A support
been developed. Moreover, some researchers have also at- value K can also be provided i.e. Teiresias only finds motifs
tempted to use techniques designed for pattern discovery in that occur at least K times in the set of strings X. While
other types of textual data to the task of clustering event Teiresias was adjudged to work just as effectively as SLCT
logs. In [5] the authors attempt to classify raw event logs by the author it was found not to scale efficiently to large
into a set of categories based on the IBM CBE (Common data sets.
1256
Figure 1: Sample clusters generated by SLCT and
Loghound
1257
enough lines to exceed a certain percentage (the partition
support threshold) of the log messages in the partition.
1258
Tech-Support members of the Dalhousie Faculty of Com-
puter Science produced the line format cluster descriptions
of these 7 datasets manually. Table 1 gives the number of
clusters identified in each file manually. Again due to privacy
issues we are unable to provide manually produced cluster
descriptions of the non-opensource log files but the man-
ually produced cluster descriptions for the HPC data are
available for download 1 . These cluster descriptions then
became our gold standard, against which to measure the
performance of the other algorithms as an information re-
trieval (IR) task. As in classic IR, our performance metrics
are Recall, Precision and F-Measure, which are described in
Figure 5: IPLoM Step-4: Discover cluster descrip- [17]. The True Positive(TP), False Positive(FP) and False
tions. The unique token counts are beside arrows Negative(FN) values were derived by comparing the set of
3.6 Algorithm Parameters manually produced line formats to the set of retrieved for-
In this section, we give a brief overview of the parame- mats produced by each algorithm. In our evaluation a line
ters/thresholds used by IPLoM. The fact that IPLoM has format is still considered a FP even if matches a manually
several parameters, which can be used to tune its perfor- produced line format to some degree, the match has to be
mance, it provides flexibility for the system administrators exact for it to be considered a TP. The next section gives
since this gives them the option of using their expert knowl- more details about the results of our experiments.
edge when they see it necessary.
4. RESULTS
• File Support Threshold: Ranges between 0-1. It
reduces the number of clusters produced by IPLoM. We tested SLCT, Loghound, Teiresias and IPLoM on the
Any cluster whose instances have a support value less data sets outlined in Table 1. The parameter values used in
than this threshold is discarded. The higher this value running the algorithms in all cases are provided in Table 2.
is set to, the fewer the number of clusters that will be The rationale for choosing the support values used for SLCT,
produced. Loghound and IPLoM is explained later in this section. The
seed value for SLCT and Loghound is a seed for a random
• Partition Support Threshold: Ranges between 0- number generator used by the algorithms, all other param-
1. It is essentially a threshold that controls backtrack- eter values for SLCT and Loghound are left at their default
ing. Based on our experiments, the guideline is to set values. The parameters for Teiresias were also chosen to
this parameter to very low values i.e. < 0.05 for opti- achieve the lowest support value allowed by the algorithm.
mum performance. The IPLoM parameters were all set intuitively except in case
of the cluster goodness threshold and the partition support
• Upper Bound and Lower Bound: Ranges between threshold . In setting the cluster goodness threshold we ran
0-1. They control the decision on how to treat M side IPLoM on the HPC file while varying this value. The pa-
of relationships in Step-2. Lower Bound should usu- rameter was then set to the value (0.34) that gave the best
ally take values < 0.5 while Upper Bound takes values result and was kept constant for the other files used in our
> 0.5. experiments. The partition support threshold was set to 0
• Cluster Goodness Threshold: Ranges between 0- to provide a baseline performance. It is pertinent to note
1. It is used to avoid further partitioning. Its optimal that we were unable to test the Teiresias algorithm against
should lie in the range of 0.3 − 0.6. all our data sets. This was due to its inability to scale to
the size of our data sets. This is a problem that is attested
Sensitivity analysis performed to evaluate the stability of to in [12]. Thus in this work, it was only tested against the
IPLoM using different values for the parameters shows little Syslog data set.
deviation in performance.
1259
Table 1: Log Data Statistics
S/No Name Description No. of Messages No. of Formats (Manual)
1 HPC High Performance Cluster Log (Los Alamos) 433490 106
2 Syslog OpenBSD Syslog 3261 60
3 Windows Windows Oracle Application Log 9664 161
4 Access Apache Access Log 69902 14
5 Error Apache Error Log 626411 166
6 System OS X Syslog 24524 9
7 Rewrite Apache mod rewrite Log 22176 10
1260
HPC Syslog Windows
0.18
0.35
0.3
0.16
0.3
0.25 0.14
0.25
0.12
0.2
F-Measure
F-Measure
F-Measure
0.06
0.1
0.1
0.04
0.05 0.05
0.02
0 0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Support (%) Support (%) Support (%)
0.14 0.7
0.3
0.12 0.6
0.25
0.1 0.5
0.2
F-Measure
F-Measure
F-Measure
SLCT SLCT SLCT
IPLoM 0.08 IPLoM 0.4 IPLoM
Loghound Loghound Loghound
0.15
0.06 0.3
0.1
0.04 0.2
0.05
0.02 0.1
0 0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Support (%) Support (%) Support (%)
Rewrite
0.5
0.45
0.4
0.35
0.3
F-Measure
SLCT
0.25 IPLoM
Loghound
0.2
0.15
0.1
0.05
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Support (%)
(g) Rewrite
Figure 6: Comparing F-Measure performance of IPLoM, Loghound and SLCT using support values.
Recall Across Event Logs Precision Across Event Logs F-Measure Across Event Logs
m
or
m
s
PC
rite
or
PC
rite
s
PC
or
rite
m
w
w
es
es
slo
slo
ste
ste
Err
Err
es
w
slo
do
do
ste
H
Err
ew
ew
cc
cc
do
H
ew
cc
Sy
Sy
Sy
Sy
Sy
in
in
Sy
A
in
R
R
W
Figure 8: Comparing performance of default IPLoM output with best case performances of Teiresias,
Loghound and SLCT.
1261
Table 5: Algorithm performance based on cluster 2000 ACM SIGMOD International Conference on
token length Management of Data, pages 1–12, 2000.
Token Length No. of Percentage Retrieved(%)
Range Clusters [4] M. Klemettinen. A Knowledge Discovery Methodology
SLCT Loghound IPLoM
1 - 10 316 12.97 13.29 53.80
for Telecommunications Network Alarm Databases.
11 - 20 142 7.04 9.15 49.30 PhD thesis, University of Helsinki, 1999.
>21 68 15.15 16.67 51.52
[5] T. Li, F. Liang, S. Ma, and W. Peng. An integrated
framework on mining log files for computing system
Table 6: Algorithm performance based on cluster management. In Proceedings of of ACM KDD 2005,
instance frequency pages 776–781, 2005.
Instance Frequency No. of Percentage Retrieved(%)
Range Clusters [6] C. Lonvick. The BSD syslog protocol. RFC3164,
SLCT Loghound IPLoM
1 - 10 263 2.66 1.90 44.87
August 2001.
11 - 100
101 - 1000
144
68
16.67
20.59
18.75
23.53
47.92
72.06
[7] L. Los Alamos National Security. Operational data to
>1000 51 34.00 38.00 82.00 support and enable computer science research.
Published to the web, January 2009.
[8] S. Ma, , and J. Hellerstein. Mining partially periodic
IPLoM’s performance was more resilient. The results used in event patterns with unknown periods. In Proceedings
our token length and cluster instance frequency evaluations of the 16th International Conference on Data
are based on cluster description formats only. Engineering, pages 205–214, 2000.
[9] A. Makanju, S. Brooks, N. Zincir-Heywood, and E. E.
5. CONCLUSION AND FUTURE WORK Milios. Logview: Visualizing event log clusters. In
In this paper we introduce IPLoM, a novel algorithm for Proceedings of Sixth Annual Conference on Privacy,
the mining of event log clusters. Through a 3-Step hierar- Security and Trust. PST 2008, pages 99 – 108,
chical partitioning process IPLoM partitions log data into October 2008.
its respective clusters. In its 4th and final stage IPLoM [10] A. Makanju, N. Zincir-Heywood, and E. E. Milios.
produces cluster descriptions or line formats for each of the Iplom: Iterative partitioning log mining. Technical
clusters produced. report, Dalhousie University, February 2009.
We implemented IPLoM and tested its performance against [11] F. Salfener and M. Malek. Using hidden semi-markov
the performance of algorithms previously used in the same models for effective online failure prediction. In 26th
task i.e. SLCT, Loghound and Teiresias. In our experi- IEEE International Symposium on Reliable
ments we compared the results of the algorithms against Distributed Systems., pages 161–174, 2007.
results produced manually by human subjects on seven dif- [12] J. Stearley. Towards informatic analysis of syslogs. In
ferent data sets. The results show that IPLoM has an av- Proceedings of the 2004 IEEE International
erage (across the data sets) Recall of 0.81, Precision of 0.73 Conference on Cluster Computing, pages 309–318,
and F-Measure of 0.76. It is also shown that IPLoM demon- 2004.
strated statistically significantly better performance than ei- [13] J. Stearley. Sisyphus log data mining toolkit. Accessed
ther SLCT or Loghound on six of the seven different data from the Web, January 2009.
sets. Future work will focus on using the results of IPLoM [14] R. Vaarandi. A data clustering algorithm for mining
i.e. the extracted cluster formats, in other automatic log patterns from event logs. In Proceedings of the 2003
analysis tasks. IEEE Workshop on IP Operations and Management
(IPOM), pages 119–126, 2003.
Acknowledgements [15] R. Vaarandi. A breadth-first algorithm for mining
frequent patterns from event logs. In Proceedings of
This research is supported by the NSERC Strategic Grant
the 2004 IFIP International Conference on
and network, ISSNet. The authors would also like to ac-
Intelligence in Communication Systems (LNCS),
knowledge the staff of Palomino System Innovations Inc..,
volume 3283, pages 293–308, 2004.
TARA Inc. and Dal-CS Tech-Support for their support in
[16] R. Vaarandi. Mining event logs with slct and
completing this work.
loghound. In Proceedings of the 2008 IEEE/IFIP
This work is conducted as part of the Dalhousie NIMS
Network Operations and Management Symposium,
Lab at http://www.cs.dal.ca/projectx/.
pages 1071–1074, April 2008.
[17] Wikipedia.org. Precision and Recall - Wikipedia, the
6. REFERENCES free encyclopedia. Published to the web,
[1] J. H. Bellec and M. T. Kechadi. CUFRES: clustering http://en.wikipedia.org/wiki/Precision and Recall.
using fuzzy representative events selection for the fault Last checked April 23, 2009.
recognition problem in telecommunications networks. [18] Q. Zheng, K. Xu, W. Lv, and S. Ma. Intelligent search
In PIKM ’07: Proceedings of the ACM first Ph.D. for correlated alaram from database containing noise
workshop in CIKM, pages 55 – 62, New York, NY, data. In Proceedings of the 8th IEEE/IFIP Network
USA, 2007. ACM. Operations and Management Symposium (NOMS),
[2] B. T. et. al. Automating problem determination: A pages 405–419, 2002.
first step toward self healing computing systems. IBM
White Paper, October 2003.
[3] J. Han, J. Pei, and Y.Yin. Mining frequent patterns
without candidate generation. In Proceedings of the
1262
Algorithm 3 Procedure DetermineP1andP2
Input: Partition P .
Real number CT as cluster goodness threshold.
Algorithm 1 Partition Prune Function 1: Determine token count of P as token count.
Input: Collection C[] of log ↓le partitions. 2: if token count > 2 then
Real number P S as partition support threshold. 3: Determine the number of token positions with only one unique value as
Output: Collection C[] of log ↓le partitions with all partitions with support less count 1.
than P S grouped into one partition. 4: GC = count 1
token count
1: Create temporary partition T emp P 5: if GC < CT then
2: for every partition in C do 6: (P 1, P 2) = Get M apping P ositions(P ) {See Algorithm 5}
3: Supp =
#LinesInP artition
#LinesInCollection 7: else
4: if Supp < P S then
8: Return to calling procedure, add P to C Out and move to next parti-
tion.
5: Add lines from partition to T emp P 9: end if
6: Delete partition from C[] 10: else if token count = 2 then
7: end if
11: (P 1, P 2) = Get M apping P ositions(P )
8: end for 12: else
9: Add partition T emp P to collection C[] 13: Return to calling procedure, add P to C Out and move to next partition.
10: Return(C)
14: end if
15: Return()
18: Create set S T emp with token values on the many side of the rela-
20: end if
tionship. 21: Return(split rank)
19: split rank : = Get Rank P osition(S T emp).
20: if split rank = 2 then
21: split pos = P 2
22: else
23: split pos = P 1
Algorithm 5 Get Mapping Positions Function
24: end if
25: else {mapping is M − M } Input: Log ↓le partition P .
Output: Integer token positions P 1 and P 2 as (P 1,P 2).
26: if partition has gone through step 2 then
27: Move to next token.
1: Determine token count of P as count token
28: else {partition is from step 1}
2: if count token = 2 then
3: Set P 1 to ↓rst token position.
29: Create sets S T emp1 and S T emp2 with token values on both
4: Set P 2 to second token position.
sides of the relationship.
30: if S T emp1 has lower cardinality then
5: else {count token is > 2}
33: split pos = P 2 8: Determine the token count value with the highest frequency other than value 1 as
f req card.
34: end if
9:
35: end if if there is a tie for highest frequency value then
36: end if 10: Select lower token value as f req card
37: Split partition into new partitions based on token values in split pos. 11: end if
38: if partition is empty then 12: if the frequency of f req card > 1 then
39: Move to next partition. 13: Set P 1 to ↓rst token position with cardinality f req card.
40: end if 14: Set P 2 to second token position with cardinality f req card.
41: end for 15: else {the frequency of f req card = 1}
42: if partition is not empty then
16: Set P 1 to ↓rst token position with cardinality f req card.
43: Create new partition with remainder lines.
17: Set P 2 to ↓rst token position with the next most frequent cardinality
44: end if other than value 1.
45: Add new partitions to T emp C 18: end if
46: T emp C = P artition P rune(T emp C) {See Algorithm 1} 19: else {P is from Step 1}
47: Add all partitions from T emp C to C Out 20: Set P 1 to first token position with lowest cardinality.
48: end for 21: Set P 2 to second token position with lowest cardinality or first token position with
49: C Out = F ile P rune(C Out) {See Algorithm ??} the second lowest cardinality.
25: Return((P1,P2))
1263