Network Data Classification Using Graph Partition: Sahan L. Maldeniya, Ajantha S. Atukorale, Wathsala W. Vithanage

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Network Data Classication Using Graph Partition

Sahan L. Maldeniya , Ajantha S. Atukorale , Wathsala W. Vithanage


University of Colombo School of Computing
No 35, Reid Avenue,
Colombo 00700, Sri Lanka
smaldeniya@virtusa.com
aja@ucsc.cmb.ac.lk
wathsala@opensource.lk

AbstractApplication of network classication can be seen numbers of the data packets. Here, it is assumed that appli-
in many domains. These varies from preserving the quality of cations keep using the same port numbers which have been
network to analyzing personal characteristics of network users.
dedicated to those applications.
However current methods applied for network data classication
does not meet the expectations. This is because networks are Another method used in the early days of network data
dynamic which are prone to rapid changes, while methods used classication was payload based classication[2]. In payload
for the classication has been either trained using examples or based classication, it will rely on specic application data.
dened using heuristics. This method can further divide into two parts which are
World Wide Web itself is a big graph which is made out of
protocol decoding where the application protocol data has been
number of URLS connecting each other via hyper-links. Hence
in this work we have used this graph nature of WWW and applied used, and signature-based identication where a search will
graph theories to partition the network to classify network be carried out to identify application specic byte sequence in
data. We have used results obtained by classifying the network packet payload.
trafc using k-means algorithm to evaluate the performance and However, with growth, development and evolution of Internet
usability of proposed method.
these methods introduce their own set of problems such as not
a single port dedicated for an application, port allocation on
I. I NTRODUCTION
demand, complexities faced when load processing in network-
Internet has become one of the things that change rapidly ing devices, problems arisen when trying to process encrypted
with changes of technology. Not only it changes, but it changes data, and breaching of privacy policies. To overcome these
the people, their behaviors and attitudes as well. From being a difculties of traditional methodologies, researchers focused
simple static web page delivery and email delivery medium their attention to use other method like machine learning,
in its early ages, Internet has been evolved to become a statistical and heuristic base methods.
weapon, to revolt against tyrants. Studying about Internet has In this paper we introduce a novel approach to classify network
became a science [1], where the studies carried out not only data using graph theories.
technological aspects but patterns, social effects and trends that Internet (or world wide web) is a large graph from its nature
has been occurred due to Internet. which contains URLs or URIs that act like nodes and hyper-
While managing rapidly growing user base, ISPs1 have to links act as edges to connect these nodes with each other in
facilitate users by providing a safe and secure service, main- multiple ways. Because of this, world wide web has inherited
taining a good QoS2 and allocating a satisfactory bandwidth qualities of a graph. This make it much easier to use graph
to users of their services. Internet is becoming a scarce re- partition methods to classify or cluster a graph created from
source, which took attention of even UN3 , where UN imposed data collected as network trafc.
new regulations to control the unlimited usage of Internet. We have used Louvain algorithm[3] as graph partition method.
To facilitate their customers by fairly allocating this limited Network trafc collected from University of Colombo School
resource in secure manner while containing a good QoS; ISPs of Computing network operational center have been used to
and network administrators are focusing on trafc analysis and create graphs, partition them and evaluate this method.
network classication methods to ease this procedure. Origin Rest of this paper is organized as follows. Section II discuss
of network classication goes back to early 2000s where the related work in network trafc classication domain. Section
user base was much smaller than present and applications of III explains Louvain algorithm which used as the graph
the Internet were much limited. partition algorithm for this work. Section IV describes our
One of the earliest methods used to classify network trafc was graph creation and trafc classication approach. Section V-A
port base trafc classication[2]. In port based classication, discuss about results and section VI concludes our work.
network trafc had been classied using the TCP/UDP port
II. R ELATED WORKS .
1 InternetService Providers
2 Quality of Service Network trafc classication has been an active and contin-
3 United Nations uing research area for more than a decade. In this section we

978-1-4799-2084-6/13/$31.00 2013 IEEE ICON 2013


describe several studies done in network trafc classication base on the results of above research to compare the results
which are related to our work. of our classier. In 2011 Atukorale and Vithanage presented a
When traditional methods for network trafc classication classier [13] which works based on the intensity of request
become not usable any more, most of researchers had focused made to each website domain in short run and long run. This
their attention on using machine learning to classify network classier is capable of clustering four website domains based
trafc, which has been an active and successful method used on above technique. Kohenens self-organizing map [14] has
in other domains that time. Both supervised and unsupervised been used to identify most effective features after training with
machine learning approaches have been applied classify net- the trafc ows collected from access logs of HTTP proxy
work trafc. servers.
In 2004 Roughan et al [4] published their analysis about us- Even though machine learning has become an active partici-
ability of nearest neighbors (NN), linear discriminate analysis pant in network trafc classication domain there have been
(LDA) and quadratic discriminant analysis (QDA) algorithms other areas such as statistical methods and heuristic base meth-
to classify labeled trafc data. From their work they have ods which contributed to network trafc classication domain.
found out that average packet length and ow duration are the In 2007, Crotti et al [15] proposed a trafc classication
most signicant features among the features they considered method using statistical data which they refer as statistical
to classify network trafc data. Based on the results they ngerprints. Here the ngerprint for a given application layer
have proven that lowest error rate can be seen at classication protocol has been generated by evaluating set of probability
method which uses three of protocol class types which includes density functions estimated from set of ows. Sengupta and
FTP , telnet and real media ows. Moore and Zuev in 2005 Sil [16] in 2012 published their work based on classifying
proposed to use Naive Bayes technique for labeled trafc data network trafc using rough set theory. Graph theories based
classication [5] [6]. They have improved their classier by classication methods are slightly new to network trafc clas-
using Nave Bayes algorithm with kernel estimation and fast sication domain because of that efciency of these methods
correlation based lter for classication which had improved were a consideration in the past. Though in 2003 Nobel et
both performance and ow accuracy than using just the simple al [16] proposed two methods to detect anomalies in graphs
Nave Bayes classier. In 2006, Park et al [7] tested and by detecting the anomalous substructures and by detecting
compared Naive Bayesian classier with Kernel Estimation the anomalous sub-graphs. Siska et al [17] proposed a ow
(NBKE), Decision Tree and the Reduced Error Pruning Tree trace generator to test and evaluate intrusion detection systems
to test ability of those algorithms to use in trafc classication using graph based trafc classication techniques in 2010.
domain. Results from this work have shown that use of Also Mehr et al [18] suggested a hybrid method in 2011 to
Decision Tree and the Reduced Error Pruning Tree can achieve identify similarity of web pages using distributed automata and
more accuracy than Naive Bayesian classier with Kernel Esti- graph partition theories.
mation. In 2006 Nguyen and Armitage [8] proposed a method Our work is based on an efcient and accurate graph partition
to address the issue of timely and continuous classication method proposed by Lambiotte et al [3] in 2008. This method
of network by using the most recent N number of packets will be explained with more detail in section III. We use the
from a ow which they called a classication sliding window. modularity classes result by running Louvain algorithm on our
Specialty of this classier is that, it does not need the classier data set to partition a graph created from the data set.
to capture the start of the trafc ow and it allows classication
to be initiated at any point even the trafc ows are already in III. L OUVAIN ALGORITHM .
progress. Kiviat graphs [9] have been used to visualize resulted This algorithm have been proposed by Lambiotte et al [3] in
clusters because it is easy to interpret cluster meaning than 2008 which nds high modularity partitions of a large network
using inter arrival time/ packet size plots. In 2004, McGregor with complete unfolding of hierarchical community structure
et al [10] suggest a method to classify network trafc by in a short time. This algorithm consists of two iteratively
classifying the packet headers using expectation maximization repeating phases. At the beginning a separate modularity class
algorithm. Erman et al in 2006 [11] did a comparison in will be assigned into every node. As algorithm runs it will
between three clustering algorithms and their usage in network accumulate nodes of different classes into common classes
trafc classication. In this research they have been evaluated using modularity gain achieved by the accumulation of a node
results of k- means algorithm and DBSCAN algorithm with into a such class and form communities of nodes. Which
previous results published for autoclass algorithm by Zander et community does a node belong will be decided based on
al [12]. According to their results, authors stated that K-Means highest positive modularity gain achieved by the selected nodes
and Autoclass algorithms produce more evenly distributed against to its neighbor communities.
clusters than clusters produced by DBSCAN algorithm. They Second phase of the algorithm build a new network using
have been found out that the reason for DBSCAN algorithm communities found in the rst phase. Here it will consider the
not able to produce evenly distributed clusters is that it tries communities formed in rst phase as nodes in new network.
to include noisy data into the existing clusters. Also according New edges will assign between nodes those represent commu-
to results it can be seen that K-Means algorithm cluster data nities connected in previous step. Each of these new edge will
faster than other two algorithms. We used k-means algorithm have a weight calculated by summing up the weights of links
between nodes in two connected communities resulted in rst Algorithm 1 Louvain algorithm
phase. 1: Assign each node in graph to its own community.
A. Modularity of a partition. 2: For each node v in graph G, calculate the modularity gain
Q between v and its neighbour nodes.
Modularity measures the quality of a partition. Lets consider
3: If there is a positive increment in modularity gain in
a weighted graph G where i and j are nodes in the graph,
between v and a neighbour node add v to community of
community attribute of node i dened by ci . Modularity value
the neighbour node and move to the next node in the list.
Q [1, 1] can be calculated by,
4: If modularity could not increase any more within node
 
1  ki kj communities stop the current process and move to step 5 or
Q= Aij (ci , cj ) (1) else carry on until no more optimization can be achieved.
2m i,j 2m
5: Merge nodes of each newly created communities in to a
Where Aij is the adjacency matrix represent the graph G, single node to represent that community and obtain the
ki is the degree of node i and m is the total weight measured edge of such two nodes in a way that edge has the weight
by following equations, equals to summation of total weights of edges those two
 communities linking by.
ki = Aij 6: Go to step 2 and repeat the process iteratively until no
j more communities left to merge.

1

m= 2
Aij
i,j with Linux distributions.
Majority of the network trafc we collected consisted of data
belong to HTTP and HTTPS protocols. Since our approach
B. Modularity gain. is based on creating a graph with URLs and hyperlinks, we
Modularity gain Q is calculate by formula 2 where sum used network trafc of these two protocol types.
of weights of links inside community C denote by in , sum In HTTP and HTTPS log les each request send outside has
of weights of links incident to nodes denote by tot , ki is the been logged into a separate line. To create a graph from these
sum of weights of links incident to node i. ki,in is the sum of network dumps we needed URLs and hyperlinks. BRO stores
weights of links from i to nodes in C and sum of weights of URLs and hyperlinks into log les as URLs and reference for
all links in the network denote by m. that URL. Also there are so many other elds in log les such
as time stamp of the HTTP/HTTPs request, source IP address,
  2
+2ki,in destination IP address, source port, destination port etc. Each
tot +ki
Q = in
request can separately identify from another using source IP
2m 2m
 address, URL and time stamp elds of the request. Hence
  2  2 (2)
ki we have written several Python scripts to extract source IP
in
tot
address, URL, time stamp and the URL initiate request. The
2m 2m 2m
extracted data stored into a MySQL database because of the
Using above formulas, modularity classes of the graph will easiness of process and analyze data using SQL language.
be calculate as algorithm 1. We created sever data chunks using the stored data where
each chunk include network trafc of two consecutive days.
IV. N ETWORK DATA CLASSIFICATION
These data chunks used to create several graphs and compare
METHOD .
the results which will be discussed in section V-A
A. Data collection and processing.
We have used BRO IDS[19] to collect data from university
network. We modied the code of BRO IDS to remove all B. Graph of network data creation approach.
the privacy related details such as email addresses, passwords We used an adjacency matrix based approach to create the
and credit card numbers from the data set. Collected data are graph from collected network trafc. Prior to creating graph
in text format. BRO print these data to log les hourly which each data chunk has been analyzed to obtain the number of
place into a folder named by the date that data have been distinct URLs in that data chunk. This number has been used
collected and different log le have been created for each to create a square matrix for analyzed data chunk. Here the
common protocol type. Protocols such as HTTP, HTTPS, adjacency matrix is of the integer type where integer type
FTP and SMTP are considered as common protocol types by used to hold a count.
BRO . It also saves connection related data in the same way Since the URLs are consisting of character sequences they
as above procedure. These log les have been used as raw have to be mapped in to indexes. We have used a simple
data for our work. hash function to achieve this URL to index mapping. The
Size of these log les exceeded 100 mega bytes. Hence the calculated hash value for a particular URL will be unique
analysis of these log les are done using terminal tools ships only to that particular data set. Hence we have to calculate
the hash values every time we use a new data set. We have V. R ESULTS .
used MD5 algorithm as our hashing algorithm. Our test-bed consisted with a Intel core-i7 processor which
has 8MB cache and physical memory of 6GB. We use Python
programming language and R statistical tool to analyse the
Algorithm 2 Index calculating function results.
1: S size of unique URLs in dataset
2: function INDEX O F (URL u, S)
3: h M D5 (u) A. Representation of resulted partitions.
4: return h mod S Our data set consisted of the natural trafc ows collected
5: end function from the users of University network. Due to this reason the
collected trafc data spread in huge range of URL categories.
Using Algorithm 2 we have been able to index URLs. Then To properly categorize these URLs we needed to have a
algorithm 3 has been used to create the network matrix which standard method or labeling. In other words there is a need of
is an adjacency matrix of network trafc. directory of URLs to categorize the collected network trafc.
Due to unavailability of open standard to categorize URLs
or a web directory service, resulted clusters have not been
Algorithm 3 Matrix of network
labeled. Hence we focused on identifying resulted clusters by
1: Dataset D
comparing content of several such partitioned graphs.
2: U allU niqueU RLsInDataSet (D)
3: M [size of U][size of U]  Integer square matrix
We gained a rough idea about the partitions by manually
initialized to 0
analyzing the content of each partition. Also a comparison with
4: for uU do
partitions resulted from several data sets shown that, most of
5: V getRef f eringU RLsBy (u, D)  All URLs
the time a URL from a partition of one data set does not fall in
referred by u
to the partition that have same neighborhood of URLs resulted
6: uh indexOf (u)
from another data set. But it could have been observed that a
7: for vV do
URL can be found in a subset of the neighborhood of one such
8: vh indexOf (v)
partition.
9: M [uh ] [vh ] M [uh ] [vh ] + 1
This can be explained with Louvain algorithm. In the algo-
10: end for
rithm partitions are formed with accumulating several sub
11: end for
partitions. But in different data sets the properties to form
12: return M
such accumulated partitions which are identical, might not be
presented. Hence sub partitions will accumulate to some other
sub partitions by forming a totally new partition.
C. Graph partitioning using Louvain algorithm.
B. Comparison with k-means clustering.
We used a XML le format called gexf to store the
graph which obtained by converting resulted network matrix. According to the results published by Erman et al in 2006
Gephi[20] has been used to visualize and analyze the resulted [11], they have shown that k-means algorithm is the most
graph. We then applied Louvain algorithm to the created graph. efcient unsupervised machine learning approach among the
Gephi has tools to color partitions resulted by Louvain method three algorithms they tested. Based on this results we have
in different color schemes. This method gives a clear image used k-means algorithm to compare the results of our methods.
of partitions and where they resides in the graph.
K-means clustering algorithm have been used with the
extracted data from the graph to cluster those data. Here we
considered only about the edges of our graph. Within cluster
sum of squares against number of clusters have been plotted
for above data as shown in gure 2(a). This plot used identify
the accurate number of clusters resides in our graph dataset
for k-means algorithm.
Number of clusters which minimize the within cluster sum of
squares has been chosen as the ideal number of clusters for
the dataset and k-means algorithm has been applied to dataset.
We plotted resulted clusters with applying a color scheme to
Fig. 1. Graph after partitioning. distinguish each cluster.
Table I shows details about clusters formed by two algo-
We have used several data sets and taken the partitioned rithms for our datasets. We can observe that Louvain algo-
graphs for compare the results. rithm always formed larger number of partitions than k-means
(a) Within cluster sum of squares against number of clusters. (b) Plotted clusters from k-means.

Fig. 2. Eccentricity distributions of two distinct data sets.

(a) Graph G1 . (b) Graph G2 .

Fig. 3. Eccentricity distributions of two distinct data sets.

No. of nodes No. of edges No. of partitions


No of k-means II shows the time complexities of proposed graph partitioning
clusters approach and clustering approach.
11433 34830 2488 30
13686 48083 2792 24 According to the table II we can see that, using Louvain
5419 12898 1140 19 algorithm we can cluster network trafc in a lesser time than
TABLE I using k-means algorithm. Since running time matters in trafc
D ETAILS OF DATASETS . classication domain, we can achieve best results by using the
graph partition approach.

D. Eccentricity distributions.
algorithm. The reason for this might be that nodes which
could not be able to assigned to partitions will left alone by After comparing eccentricity plots of several networks as
Louvain algorithm. Also at the beginning of Louvain algorithm shown in gure 3, we have been observed that they have
it assigns each node into a separate cluster. Number of partition identical distributions. This means nodes have been positioned
will be high when single nodes left out without accumulating in an identical way where they have similar frequency of nodes
to partitions. those having same distance value from one node to another.
Hence number of nodes meet when traveling from one node
C. Time complexities. to another have same values for data set collected.
Louvain method known to reduce run time after detecting An explanation to this scenario could be that only a portion
several hierarchies of communities and has a time complexity of World Wide Web(WWW) is visible to a given region.
of O(n log n). K-means algorithm known to have a time com- Hence the URLs that link with each other, which are visible
plexity of O(ndk+1 log n) where d is the number of dimensions to that region, changes rarely. Reasons that dene portion of
data have scattered through and k is the number of centroids. WWW visible to a given region varies from legal system of
Both algorithms address domain of NP-hard problems while that region to cultural values and social believes. This could
Louvain method takes a greedy optimization approach. Table also be vary base on technology as well. Before coming to
k-means clustering Louvain method
Data Filtering O(n) O(n)
Network creation O(n2 ) O(n2 )
Clustering algorithm O(ndk+1 log n) O(n log n)
Total O(ndk+1 log n) O(n log n)
+O(n2 ) + O(n) +O(n2 ) + O(n)
TABLE II
T IME COMPLEXITIES .

a nal conclusion more research should be carried out about [7] J. Park, H.-R. Tyan, and C.-C. J. Kuo, GA-Based Internet Trafc
Classication Technique for QoS Provisioning, in Proceedings of the
eccentricity graphs of network trafc using data collected from 2006 International Conference on Intelligent Information Hiding and
different locations. Multimedia. IEEE Computer Society, 2006, pp. 251254.
[8] T. T. T. Nguyen and G. Armitage, Training on multiple sub-ows
VI. C ONCLUSION to optimise the use of Machine Learning classiers in real-world IP
networks, in in Proceedings of the IEEE 31st Conference on Local
Our work has illustrated the importance of using graph Computer Networks. Tampa, Florida, USA: IEEE Computer Society,
December 2006, pp. 369376.
theories to classify network trafc. Since World Wide Web [9] K. W. Kolence and P. J. Kiviat, Software unit proles & kiviat gures,
is a large graph, graph theories are suitable to approximate SIGMETRICS Perform. Eval. Rev., vol. 2, no. 3, pp. 212, Sep. 1973.
resulted partitions with the natural clusters reside on a network. [Online]. Available: http://doi.acm.org/10.1145/1041613.1041614
[10] A. McGregor, M. Hall, and P. Lorier, Flow clustering using machine
In this work we only used pre-collected trafc data to create learning techniques, in Passive and Active Network Measurement, 5th
graphs. Hence the graph is static for the period of time data International Workshop, PAM 2004, Antibes Juan-les-Pins, France, April
collected. However to obtain an better understanding, theories 19-20, 2004, Proceedings, B. Chadi and I. Pratt, Eds. Springer, 2004,
vol. 3015, pp. 205214.
about dynamic graphs[21] should be use. Also these dynamic [11] J. Erman, M. Arlitt, and A. Mahanti, Trafc classication using
graph theories can be used to classify network trafc obtained clustering algorithms, in Proceedings of the 2006 SIGCOMM workshop
from live streaming resources. on Mining network data. New York, New York, USA: ACM Press, 2006,
pp. 281286.
Our work can be used in intrusion detection domain to nd [12] S. Zander, T. Nguyen, and G. Armitage, Automated trafc classication
network anomalies. Also this work can be used with QoS and application identication using machine learning, in Proceedings of
services where ISPs can nd high trafc partitions and use the The IEEE Conference on Local Computer Networks 30th Anniver-
sary, ser. LCN 05. Washington, DC, USA: IEEE Computer Society,
those information to solve problems arise with high network 2005, pp. 250257.
usage. [13] W. W. Vithanage and A. S. Atukorale, A Novel Classier for En-
We faced the lack of standard web categorization method while gineering Web Trafc, in 2011 IEEE Symposium on Computers and
Communications ISCC. IEEE, 2011, pp. 10091016.
carrying out our work. Though there exists few public web [14] T. Kohonen, The self-organizing map, Proceedings of the IEEE,
directories which maintained using crowd sourcing methods, vol. 78, no. 9, pp. 14641480, 1990.
the frequency of updating those directories are not enough to [15] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, Trafc classication
through simple statistical ngerprinting, ACM SIGCOMM Computer
cover rapidly expanding World Wide Web. In order to use the Communication Review, vol. 37, no. 1, p. 5, 2007.
full power of network partitioning there should be a frequently [16] N. Sengupta and J. Sil, Evaluation of Rough Set Theory Based
updating, standard directory service to label and categorize Network Trafc Data Classier Using Different Discretization Method,
International Journal of Information and Electronics Engineering, vol. 2,
URLs. no. 3, pp. 338341, 2012.
[17] P. Siska, M. P. Stoecklin, A. Kind, and T. Braun, A ow
R EFERENCES trace generator using graph-based trafc classication techniques,
in Proceedings of the 6th International Wireless Communications
[1] J. Hendler, N. Shadbolt, W. Hall, T. Berners-lee, and D. Weitzner, Web and Mobile Computing Conference, ser. IWCMC 10. New
Science : An interdisciplinary approach to understanding the World Wide York, NY, USA: ACM, 2010, pp. 457462. [Online]. Available:
Web, Communications of the ACM - Web science, vol. 51, no. 7, pp. http://doi.acm.org/10.1145/1815396.1815503
6069, 2008. [18] S. M. Mehr, M. Taran, A. B. Hashemi, and M. R.
[2] T. Nguyen and G. Armitage, A survey of techniques for internet trafc Meybodi, Determining web pages similarity using distributed
classication using machine learning, IEEE Communications Surveys learning automata and graph partitioning, 2011 International
& Tutorials, vol. 10, no. 4, pp. 5676, 2008. Symposium on Articial Intelligence and Signal Processing
[3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast (AISP), pp. 129134, Jun. 2011. [Online]. Available:
unfolding of communities in large networks, Journal of Statistical http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5960971
Mechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008, 2008. [19] V. Paxson, Bro: a System for Detecting Network Intruders in
[4] M. Roughan, S. Sen, O. Spatscheck, and N. Dufeld, Class-of-Service Real-Time, Computer Networks, vol. 31, no. 23-24, pp. 24352463,
Mapping for QoS A Statistical Signature based Approach to IP Trafc 1999. [Online]. Available: http://www.icir.org/vern/papers/bro-CN99.pdf
Classication, in Proceedings of the 4th ACM/SIGCOMM conference [20] M. Bastian, S. Heymann, and M. Jacomy, Gephi: An
on Internet measurement. Taormina, Sicily, Italy: ACM, 2004, pp. 135 open source software for exploring and manipulating
148. networks, in International AAAI Conference on
[5] D. Zuev and A. W. Moore, Trafc Classication Using a Statistical Weblogs and Social Media, 2009. [Online]. Available:
Approach, in Passive and Active Network Measurement, C. Dovrolis, https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/154/1009
Ed. Berlin / Heidelberg: Springer, 2005, pp. 321324. [21] C. C. Bilgin and B. Yener, Dynamic Network Evolution : Models ,
Clustering , Anomaly Detection.
[6] A. W. Moore and D. Zuev, Internet Trafc Classication Using
Bayesian Analysis Techniques, SIGMETRICS Perform. Eval. Rev.,
vol. 33, no. 1, pp. 5060, 2005.

You might also like