Professional Documents
Culture Documents
Network Data Classification Using Graph Partition: Sahan L. Maldeniya, Ajantha S. Atukorale, Wathsala W. Vithanage
Network Data Classification Using Graph Partition: Sahan L. Maldeniya, Ajantha S. Atukorale, Wathsala W. Vithanage
Network Data Classification Using Graph Partition: Sahan L. Maldeniya, Ajantha S. Atukorale, Wathsala W. Vithanage
AbstractApplication of network classication can be seen numbers of the data packets. Here, it is assumed that appli-
in many domains. These varies from preserving the quality of cations keep using the same port numbers which have been
network to analyzing personal characteristics of network users.
dedicated to those applications.
However current methods applied for network data classication
does not meet the expectations. This is because networks are Another method used in the early days of network data
dynamic which are prone to rapid changes, while methods used classication was payload based classication[2]. In payload
for the classication has been either trained using examples or based classication, it will rely on specic application data.
dened using heuristics. This method can further divide into two parts which are
World Wide Web itself is a big graph which is made out of
protocol decoding where the application protocol data has been
number of URLS connecting each other via hyper-links. Hence
in this work we have used this graph nature of WWW and applied used, and signature-based identication where a search will
graph theories to partition the network to classify network be carried out to identify application specic byte sequence in
data. We have used results obtained by classifying the network packet payload.
trafc using k-means algorithm to evaluate the performance and However, with growth, development and evolution of Internet
usability of proposed method.
these methods introduce their own set of problems such as not
a single port dedicated for an application, port allocation on
I. I NTRODUCTION
demand, complexities faced when load processing in network-
Internet has become one of the things that change rapidly ing devices, problems arisen when trying to process encrypted
with changes of technology. Not only it changes, but it changes data, and breaching of privacy policies. To overcome these
the people, their behaviors and attitudes as well. From being a difculties of traditional methodologies, researchers focused
simple static web page delivery and email delivery medium their attention to use other method like machine learning,
in its early ages, Internet has been evolved to become a statistical and heuristic base methods.
weapon, to revolt against tyrants. Studying about Internet has In this paper we introduce a novel approach to classify network
became a science [1], where the studies carried out not only data using graph theories.
technological aspects but patterns, social effects and trends that Internet (or world wide web) is a large graph from its nature
has been occurred due to Internet. which contains URLs or URIs that act like nodes and hyper-
While managing rapidly growing user base, ISPs1 have to links act as edges to connect these nodes with each other in
facilitate users by providing a safe and secure service, main- multiple ways. Because of this, world wide web has inherited
taining a good QoS2 and allocating a satisfactory bandwidth qualities of a graph. This make it much easier to use graph
to users of their services. Internet is becoming a scarce re- partition methods to classify or cluster a graph created from
source, which took attention of even UN3 , where UN imposed data collected as network trafc.
new regulations to control the unlimited usage of Internet. We have used Louvain algorithm[3] as graph partition method.
To facilitate their customers by fairly allocating this limited Network trafc collected from University of Colombo School
resource in secure manner while containing a good QoS; ISPs of Computing network operational center have been used to
and network administrators are focusing on trafc analysis and create graphs, partition them and evaluate this method.
network classication methods to ease this procedure. Origin Rest of this paper is organized as follows. Section II discuss
of network classication goes back to early 2000s where the related work in network trafc classication domain. Section
user base was much smaller than present and applications of III explains Louvain algorithm which used as the graph
the Internet were much limited. partition algorithm for this work. Section IV describes our
One of the earliest methods used to classify network trafc was graph creation and trafc classication approach. Section V-A
port base trafc classication[2]. In port based classication, discuss about results and section VI concludes our work.
network trafc had been classied using the TCP/UDP port
II. R ELATED WORKS .
1 InternetService Providers
2 Quality of Service Network trafc classication has been an active and contin-
3 United Nations uing research area for more than a decade. In this section we
1
m= 2
Aij
i,j with Linux distributions.
Majority of the network trafc we collected consisted of data
belong to HTTP and HTTPS protocols. Since our approach
B. Modularity gain. is based on creating a graph with URLs and hyperlinks, we
Modularity gain Q is calculate by formula 2 where sum used network trafc of these two protocol types.
of weights of links inside community C denote by in , sum In HTTP and HTTPS log les each request send outside has
of weights of links incident to nodes denote by tot , ki is the been logged into a separate line. To create a graph from these
sum of weights of links incident to node i. ki,in is the sum of network dumps we needed URLs and hyperlinks. BRO stores
weights of links from i to nodes in C and sum of weights of URLs and hyperlinks into log les as URLs and reference for
all links in the network denote by m. that URL. Also there are so many other elds in log les such
as time stamp of the HTTP/HTTPs request, source IP address,
2
+2ki,in destination IP address, source port, destination port etc. Each
tot +ki
Q = in
request can separately identify from another using source IP
2m 2m
address, URL and time stamp elds of the request. Hence
2 2 (2)
ki we have written several Python scripts to extract source IP
in
tot
address, URL, time stamp and the URL initiate request. The
2m 2m 2m
extracted data stored into a MySQL database because of the
Using above formulas, modularity classes of the graph will easiness of process and analyze data using SQL language.
be calculate as algorithm 1. We created sever data chunks using the stored data where
each chunk include network trafc of two consecutive days.
IV. N ETWORK DATA CLASSIFICATION
These data chunks used to create several graphs and compare
METHOD .
the results which will be discussed in section V-A
A. Data collection and processing.
We have used BRO IDS[19] to collect data from university
network. We modied the code of BRO IDS to remove all B. Graph of network data creation approach.
the privacy related details such as email addresses, passwords We used an adjacency matrix based approach to create the
and credit card numbers from the data set. Collected data are graph from collected network trafc. Prior to creating graph
in text format. BRO print these data to log les hourly which each data chunk has been analyzed to obtain the number of
place into a folder named by the date that data have been distinct URLs in that data chunk. This number has been used
collected and different log le have been created for each to create a square matrix for analyzed data chunk. Here the
common protocol type. Protocols such as HTTP, HTTPS, adjacency matrix is of the integer type where integer type
FTP and SMTP are considered as common protocol types by used to hold a count.
BRO . It also saves connection related data in the same way Since the URLs are consisting of character sequences they
as above procedure. These log les have been used as raw have to be mapped in to indexes. We have used a simple
data for our work. hash function to achieve this URL to index mapping. The
Size of these log les exceeded 100 mega bytes. Hence the calculated hash value for a particular URL will be unique
analysis of these log les are done using terminal tools ships only to that particular data set. Hence we have to calculate
the hash values every time we use a new data set. We have V. R ESULTS .
used MD5 algorithm as our hashing algorithm. Our test-bed consisted with a Intel core-i7 processor which
has 8MB cache and physical memory of 6GB. We use Python
programming language and R statistical tool to analyse the
Algorithm 2 Index calculating function results.
1: S size of unique URLs in dataset
2: function INDEX O F (URL u, S)
3: h M D5 (u) A. Representation of resulted partitions.
4: return h mod S Our data set consisted of the natural trafc ows collected
5: end function from the users of University network. Due to this reason the
collected trafc data spread in huge range of URL categories.
Using Algorithm 2 we have been able to index URLs. Then To properly categorize these URLs we needed to have a
algorithm 3 has been used to create the network matrix which standard method or labeling. In other words there is a need of
is an adjacency matrix of network trafc. directory of URLs to categorize the collected network trafc.
Due to unavailability of open standard to categorize URLs
or a web directory service, resulted clusters have not been
Algorithm 3 Matrix of network
labeled. Hence we focused on identifying resulted clusters by
1: Dataset D
comparing content of several such partitioned graphs.
2: U allU niqueU RLsInDataSet (D)
3: M [size of U][size of U] Integer square matrix
We gained a rough idea about the partitions by manually
initialized to 0
analyzing the content of each partition. Also a comparison with
4: for uU do
partitions resulted from several data sets shown that, most of
5: V getRef f eringU RLsBy (u, D) All URLs
the time a URL from a partition of one data set does not fall in
referred by u
to the partition that have same neighborhood of URLs resulted
6: uh indexOf (u)
from another data set. But it could have been observed that a
7: for vV do
URL can be found in a subset of the neighborhood of one such
8: vh indexOf (v)
partition.
9: M [uh ] [vh ] M [uh ] [vh ] + 1
This can be explained with Louvain algorithm. In the algo-
10: end for
rithm partitions are formed with accumulating several sub
11: end for
partitions. But in different data sets the properties to form
12: return M
such accumulated partitions which are identical, might not be
presented. Hence sub partitions will accumulate to some other
sub partitions by forming a totally new partition.
C. Graph partitioning using Louvain algorithm.
B. Comparison with k-means clustering.
We used a XML le format called gexf to store the
graph which obtained by converting resulted network matrix. According to the results published by Erman et al in 2006
Gephi[20] has been used to visualize and analyze the resulted [11], they have shown that k-means algorithm is the most
graph. We then applied Louvain algorithm to the created graph. efcient unsupervised machine learning approach among the
Gephi has tools to color partitions resulted by Louvain method three algorithms they tested. Based on this results we have
in different color schemes. This method gives a clear image used k-means algorithm to compare the results of our methods.
of partitions and where they resides in the graph.
K-means clustering algorithm have been used with the
extracted data from the graph to cluster those data. Here we
considered only about the edges of our graph. Within cluster
sum of squares against number of clusters have been plotted
for above data as shown in gure 2(a). This plot used identify
the accurate number of clusters resides in our graph dataset
for k-means algorithm.
Number of clusters which minimize the within cluster sum of
squares has been chosen as the ideal number of clusters for
the dataset and k-means algorithm has been applied to dataset.
We plotted resulted clusters with applying a color scheme to
Fig. 1. Graph after partitioning. distinguish each cluster.
Table I shows details about clusters formed by two algo-
We have used several data sets and taken the partitioned rithms for our datasets. We can observe that Louvain algo-
graphs for compare the results. rithm always formed larger number of partitions than k-means
(a) Within cluster sum of squares against number of clusters. (b) Plotted clusters from k-means.
D. Eccentricity distributions.
algorithm. The reason for this might be that nodes which
could not be able to assigned to partitions will left alone by After comparing eccentricity plots of several networks as
Louvain algorithm. Also at the beginning of Louvain algorithm shown in gure 3, we have been observed that they have
it assigns each node into a separate cluster. Number of partition identical distributions. This means nodes have been positioned
will be high when single nodes left out without accumulating in an identical way where they have similar frequency of nodes
to partitions. those having same distance value from one node to another.
Hence number of nodes meet when traveling from one node
C. Time complexities. to another have same values for data set collected.
Louvain method known to reduce run time after detecting An explanation to this scenario could be that only a portion
several hierarchies of communities and has a time complexity of World Wide Web(WWW) is visible to a given region.
of O(n log n). K-means algorithm known to have a time com- Hence the URLs that link with each other, which are visible
plexity of O(ndk+1 log n) where d is the number of dimensions to that region, changes rarely. Reasons that dene portion of
data have scattered through and k is the number of centroids. WWW visible to a given region varies from legal system of
Both algorithms address domain of NP-hard problems while that region to cultural values and social believes. This could
Louvain method takes a greedy optimization approach. Table also be vary base on technology as well. Before coming to
k-means clustering Louvain method
Data Filtering O(n) O(n)
Network creation O(n2 ) O(n2 )
Clustering algorithm O(ndk+1 log n) O(n log n)
Total O(ndk+1 log n) O(n log n)
+O(n2 ) + O(n) +O(n2 ) + O(n)
TABLE II
T IME COMPLEXITIES .
a nal conclusion more research should be carried out about [7] J. Park, H.-R. Tyan, and C.-C. J. Kuo, GA-Based Internet Trafc
Classication Technique for QoS Provisioning, in Proceedings of the
eccentricity graphs of network trafc using data collected from 2006 International Conference on Intelligent Information Hiding and
different locations. Multimedia. IEEE Computer Society, 2006, pp. 251254.
[8] T. T. T. Nguyen and G. Armitage, Training on multiple sub-ows
VI. C ONCLUSION to optimise the use of Machine Learning classiers in real-world IP
networks, in in Proceedings of the IEEE 31st Conference on Local
Our work has illustrated the importance of using graph Computer Networks. Tampa, Florida, USA: IEEE Computer Society,
December 2006, pp. 369376.
theories to classify network trafc. Since World Wide Web [9] K. W. Kolence and P. J. Kiviat, Software unit proles & kiviat gures,
is a large graph, graph theories are suitable to approximate SIGMETRICS Perform. Eval. Rev., vol. 2, no. 3, pp. 212, Sep. 1973.
resulted partitions with the natural clusters reside on a network. [Online]. Available: http://doi.acm.org/10.1145/1041613.1041614
[10] A. McGregor, M. Hall, and P. Lorier, Flow clustering using machine
In this work we only used pre-collected trafc data to create learning techniques, in Passive and Active Network Measurement, 5th
graphs. Hence the graph is static for the period of time data International Workshop, PAM 2004, Antibes Juan-les-Pins, France, April
collected. However to obtain an better understanding, theories 19-20, 2004, Proceedings, B. Chadi and I. Pratt, Eds. Springer, 2004,
vol. 3015, pp. 205214.
about dynamic graphs[21] should be use. Also these dynamic [11] J. Erman, M. Arlitt, and A. Mahanti, Trafc classication using
graph theories can be used to classify network trafc obtained clustering algorithms, in Proceedings of the 2006 SIGCOMM workshop
from live streaming resources. on Mining network data. New York, New York, USA: ACM Press, 2006,
pp. 281286.
Our work can be used in intrusion detection domain to nd [12] S. Zander, T. Nguyen, and G. Armitage, Automated trafc classication
network anomalies. Also this work can be used with QoS and application identication using machine learning, in Proceedings of
services where ISPs can nd high trafc partitions and use the The IEEE Conference on Local Computer Networks 30th Anniver-
sary, ser. LCN 05. Washington, DC, USA: IEEE Computer Society,
those information to solve problems arise with high network 2005, pp. 250257.
usage. [13] W. W. Vithanage and A. S. Atukorale, A Novel Classier for En-
We faced the lack of standard web categorization method while gineering Web Trafc, in 2011 IEEE Symposium on Computers and
Communications ISCC. IEEE, 2011, pp. 10091016.
carrying out our work. Though there exists few public web [14] T. Kohonen, The self-organizing map, Proceedings of the IEEE,
directories which maintained using crowd sourcing methods, vol. 78, no. 9, pp. 14641480, 1990.
the frequency of updating those directories are not enough to [15] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, Trafc classication
through simple statistical ngerprinting, ACM SIGCOMM Computer
cover rapidly expanding World Wide Web. In order to use the Communication Review, vol. 37, no. 1, p. 5, 2007.
full power of network partitioning there should be a frequently [16] N. Sengupta and J. Sil, Evaluation of Rough Set Theory Based
updating, standard directory service to label and categorize Network Trafc Data Classier Using Different Discretization Method,
International Journal of Information and Electronics Engineering, vol. 2,
URLs. no. 3, pp. 338341, 2012.
[17] P. Siska, M. P. Stoecklin, A. Kind, and T. Braun, A ow
R EFERENCES trace generator using graph-based trafc classication techniques,
in Proceedings of the 6th International Wireless Communications
[1] J. Hendler, N. Shadbolt, W. Hall, T. Berners-lee, and D. Weitzner, Web and Mobile Computing Conference, ser. IWCMC 10. New
Science : An interdisciplinary approach to understanding the World Wide York, NY, USA: ACM, 2010, pp. 457462. [Online]. Available:
Web, Communications of the ACM - Web science, vol. 51, no. 7, pp. http://doi.acm.org/10.1145/1815396.1815503
6069, 2008. [18] S. M. Mehr, M. Taran, A. B. Hashemi, and M. R.
[2] T. Nguyen and G. Armitage, A survey of techniques for internet trafc Meybodi, Determining web pages similarity using distributed
classication using machine learning, IEEE Communications Surveys learning automata and graph partitioning, 2011 International
& Tutorials, vol. 10, no. 4, pp. 5676, 2008. Symposium on Articial Intelligence and Signal Processing
[3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast (AISP), pp. 129134, Jun. 2011. [Online]. Available:
unfolding of communities in large networks, Journal of Statistical http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5960971
Mechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008, 2008. [19] V. Paxson, Bro: a System for Detecting Network Intruders in
[4] M. Roughan, S. Sen, O. Spatscheck, and N. Dufeld, Class-of-Service Real-Time, Computer Networks, vol. 31, no. 23-24, pp. 24352463,
Mapping for QoS A Statistical Signature based Approach to IP Trafc 1999. [Online]. Available: http://www.icir.org/vern/papers/bro-CN99.pdf
Classication, in Proceedings of the 4th ACM/SIGCOMM conference [20] M. Bastian, S. Heymann, and M. Jacomy, Gephi: An
on Internet measurement. Taormina, Sicily, Italy: ACM, 2004, pp. 135 open source software for exploring and manipulating
148. networks, in International AAAI Conference on
[5] D. Zuev and A. W. Moore, Trafc Classication Using a Statistical Weblogs and Social Media, 2009. [Online]. Available:
Approach, in Passive and Active Network Measurement, C. Dovrolis, https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/154/1009
Ed. Berlin / Heidelberg: Springer, 2005, pp. 321324. [21] C. C. Bilgin and B. Yener, Dynamic Network Evolution : Models ,
Clustering , Anomaly Detection.
[6] A. W. Moore and D. Zuev, Internet Trafc Classication Using
Bayesian Analysis Techniques, SIGMETRICS Perform. Eval. Rev.,
vol. 33, no. 1, pp. 5060, 2005.