Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

International Journal of Pure and Applied Mathematics

Volume 117 No. 19 2017, 403-410


ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu

A SIMPLE DATA COMPRESSION ALGORITHM FOR ANOMALY DETECTION IN WIRELESS


SENSOR NETWORKS
1
Uthayakumar J, 2Vengattaraman T, 3Dr. J. Amudhavel
1
Research Scholar, Department of Computer Science, Pondicherry University, Puducherry, India
2
Assistant Professor, Department of Computer Science, Pondicherry University, Puducherry, India
3
Associate Professor, Department of CSE, KL University, Andhra Pradesh, India
1*
uthayresearchscholar@gmail.com, 2vengattaramant@gmail.com, 3info.amudhavel@gmail.com

Abstract: Wireless Sensor Networks (WSN) consists of nodes. WSN is randomly deployed in the sensing field to
numerous sensor nodes and is deeply embedded into the measure physical parameters such as temperature,
real world for environmental monitoring. As the sensor humidity, pressure, vibration, etc [1]. WSN is widely
nodes are battery powered, energy efficiency is used in tracking and data gathering applications include
considered as an important design issue in WSN. Since surveillance (indoor and outdoor), healthcare, disaster
data transmission consumes more energy than sensing management, habitat monitoring, etc [2]. A sensor node
and processing of data, many researchers have been is built up of four components namely transducer,
carried out to reduce the number of data transmission. microcontroller, battery, and transceiver. The sensor
Data compression (DC) techniques are commonly used nodes are constrained in energy, bandwidth, memory and
to reduce the amount of data transmission. On the other processing capabilities. As the sensor nodes are battery
side, anomaly detection is also a challenging task in powered and are usually deployed in the harsh
WSN to enhance the data integrity. To achieve this data environment, it is not easy to recharge or replace
integrity, the sensor nodes append labels to the sensed batteries [3]. The lifetime of WSN can be extended in
data to differentiate the actual value and abnormal value. two ways: increasing the battery storage capacity and
The labeled value can be represented as ‘0’ for actual effectively utilizing the available energy. The way of
data and ‘1’ for anomaly data. In this paper, we employ a increasing the battery capacity is not possible in all
Lempel Ziv Markov-chain Algorithm (LZMA) to situations. So, the effective utilization of available
compress the labeled data in WSN. LZMA is a lossless energy is considered as an important design issue.
data compression algorithm which is well suited for real Several researchers observed that a large amount of
time applications. LZMA algorithm compresses the energy is spent for data transmission when compared to
labeled data and transmits to Base Station (BS) via sensing and processing operation [4]. This study reveals
single hop and multi hop communication. Extensive that the reduction in the amount of data transmission is
experiments were performed using real world labeled an effective way to achieve energy efficiency. Data
WSN dataset. To ensure the effectiveness of LZMA transmission is the most energy consuming task due to
algorithm, it is compared with 5 well-known the nature of strong temporal correlation in the sensed
compression algorithms namely Deflate, Lempel Ziv data. DC is considered as a useful approach to eliminate
Welch (LZW), Burrows Wheeler Transform (BWT), the redundancy in the sensed data [5]
Huffman coding and Arithmetic coding (AC). By
comparing the compression performance of LZMA DC technique represents the data in its compact form
method with existing methods, LZMA achieves without negotiating the data quality to a certain extent. It
significantly better compression with an average is used to compress text, image, audio, video, etc [6].
compression ratio of 0.0104 at the bit rate of 0.839 The compact form of any data can be achieved by the
respectively. recognition and utilization of patterns exists in the data.
Keywords: Anomaly detection; Data compression; DC is divided into two types based on the reconstructed
Lempel Ziv Welch; Multi hop communication; Wireless data quality and the two types are lossless compression
Sensor Networks and lossy compression [7]. Lossy compression refers to
a loss of quality in reconstructed data. It achieves better
1. Introduction compression and is useful in situations where the loss of
data quality is acceptable. Example: images, audio,
The recent advancement in wireless networks and
videos. In some situations, the loss of information is
Micro-Electro-Mechanical-System (MEMS) leads to the
unacceptable where the reconstructed data should be the
development of low cost, compact and smart sensor

403
International Journal of Pure and Applied Mathematics Special Issue

exact replica of the original data [8]. The basic idea of and Deflate algorithm in terms Compression Ratio (CR),
compressing data involves two steps: eliminating Compression Factor (CF) and Bits per character (BPC).
redundant and irrelevant data. The nature of redundancy
in the real world data makes data compression possible. 1.2 Organization of this paper
The removal of data in compression process which The rest of the paper is organized as follows: Section 2
cannot be identified by the human eye is termed as explains the different types of classical DC techniques
irrelevancy reduction. The reduction in the amount of and anomaly detection techniques in WSN. Section 3
data enables to store a large amount of information in the presents the LZMA compression algorithm for labeled
same storage space and reduces the transmission time data in WSN. Section 4 explains the performance
significantly. This nature is highly useful in WSN to evaluation in single hop as well as multi hop scenario.
compress sensed data [9]. Section 5 concludes with the highlighted contributions,
future work, and recommendations.
Another important challenge in WSN is to handle the
integrity of data sensed by the sensors. This requirement 2. Related Work
leads to a research problem known as anomaly detection
[10]. It plays a major role in the intrusion detection and Energy efficiency is the major design issue in WSN.
fault diagnosis. It is needed to detect any misbehavior or Clustering and routing are the most widely used energy
anomalies for the reliable and secure functioning of the efficient techniques [13]. Numerous clustering and
network. Anomaly detection is useful in WSN to identify routing techniques have been developed and these
the abnormal variations in the sensing field [11]. It is a techniques are found in the literature [14], [15]. Data
process of raising an alert when a significant change compression is an alternative way to achieve energy
occurs. For instance, WSN is considered to monitor the efficiency. DC compression techniques have been
environmental conditions like temperature and humidity presented in [16]. The popular coding methods are
level of forest fire detection. When the sensor Huffman coding, Arithmetic coding, Lempel Ziv coding,
malfunction or fire is caught, the sensed value will Burrows-wheeler transform, RLE, Scalar and vector
drastically vary from the actual values. These abnormal quantization.
conditions are identified and notified to BS for further
investigation. Anomaly Detection operates in two ways
while integrating to WSN: centralized approaches and Huffman coding [17] is the most popular coding
distributed approaches. In centralized approaches, the technique which effectively compresses data in almost
sensor node senses the environment and transmits the all file formats. It is a type of optimal prefix code which
sensed data to BS. BS only identifies the data whether it is widely employed in lossless data compression. It is
is actual data or anomaly data. But, this traditional based on two observations: (1) In an optimum code, the
approach makes the sensor node to send all the raw or frequent occurrence of symbols is mapped with shorter
erroneous measurements to BS. This results in wastage code words when compared to symbols that appear less
of energy by transmitting large number of raw sensor frequently. (2) In an optimum code, the least frequent
measurements. In distributed approaches, the sensor occurrence of two symbols will have the same length.
nodes sense the field and identify the anomalies using The basic idea is to allow variable length codes to input
anomaly detection algorithm [12]. The sensor node characters depending upon the frequency of occurrence.
appends a label to the sensed value to represent The output is the variable length code table for coding a
anomalies. This label is used to differentiate between the source symbol. It is uniquely decodable and it consists of
normal data and anomaly data. In this paper, we employ two components: Constructing Huffman tree from input
an LZMA lossless compression algorithm to compress sequence and traversing the tree to assign codes to
labeled data in WSN. characters. Huffman coding is still popular because of its
simpler implementation, faster compression and lack of
1.1 Contribution of this paper patent coverage. It is commonly used in text
compression.
The contribution of the paper is summarized as follows:
(i) A lossless LZMA compression algorithm is used to
compress labeled WSN data. (ii) Two labeled WSN AC [18] is an another important coding technique to
datasets (temperature and humidity) in both single hop generate variable length codes. It is superior to Huffman
and multi hop communication is used, and (iii) LZMA coding in various aspects. It is highly useful in situations
results are compared with 5 well-known compression where the source contains small alphabets with skewed
algorithms namely Huffman coding, AC, LZW, BWT probabilities. When a string is encoded using arithmetic

404
International Journal of Pure and Applied Mathematics Special Issue

coding, frequent occurring symbols are coded with lesser one code. Typically, an LZW code is 12-bits length
bits than rarely occurring symbols. It converts the input (4096 codes). The starting 256 (0-255) entries represent
data into a floating point number in the range of 0 and 1. ASCII codes, to indicate individual character. The
The algorithm is implemented by separating 0 to 1 into remaining 3840 (256-4095) codes are defined by an
segments and the length of each segment is based on the encoder to indicate variable-length strings. UNIX
probability of each symbol. Then the output data is compress, GIF images, PNG images and others file
identified in the respective segments based on the formats use LZW coding.
symbol. It is not easier to implement when compared to
other methods. There are two versions of arithmetic BWT [22] is also known as block sorting compression
coding namely Adaptive Arithmetic Coding and Binary which rearranges the character string into runs of
Arithmetic Coding. A benefit of arithmetic coding than identical characters. It uses two techniques to compress
Huffman coding is the capability to segregate the data: move-to-front transform and RLE. It compresses
modeling and coding features of the compression data easily to compress in situations where the string
approach. It is used in image, audio and video consists of runs of repeated characters. The most
compression. important feature of BWT is the reversibility which is
fully reversible and it does not require any extra bits.
Dictionary based coding approaches find useful in BWT is a "free" method to improve the efficiency of text
situations where the original data to be compressed compression algorithms, with some additional
involves repeated patterns. It maintains a dictionary of computation. It is s used in bzip2. A simpler form of
frequently occurring patterns. When the pattern comes in lossless data compression coding technique is RLE [23].
the input sequence, they are coded with an index to the It represents the sequence of symbols as runs and others
dictionary. When the pattern is not available in the are termed as non-runs. The run consists of two parts:
dictionary, it is coded with any less efficient approaches. data value and count instead of original run. It is
The Lempel-Ziv algorithm (LZ) is a dictionary-based effective for data with high redundancy.
coding algorithm commonly used in lossless file
compression. This is widely used because of its 3. The LZMA Algorithm on Anomaly Labeled
adaptability to various file formats. It looks for Data
frequently occurring patterns and replaces them by a LMZA is the modified version of Lempel-Ziv
single symbol. It maintains a dictionary of these patterns algorithm to achieve higher CR [24]. It is a lossless
and the length of the dictionary is set to a particular data compression algorithm based on the principle of
value. This method is much effective for larger files and dictionary based encoding scheme. LZMA utilizes the
less effective for smaller files. For smaller files, the complex data structure to encode one bit at a time. It
length of the dictionary will be larger than the original uses a variable length dictionary (maximum size of 4
file. The two main versions of LZ were developed by GB) and is mainly used to encode an unknown data
Ziv and Lempel in two individual papers in 1977 and stream. It is capable of compressing the data generated
1978, and they are named as LZ77 [19] and LZ78 [20]. at a rate of 10-20 Mbps in a real-time environment.
These algorithms vary significantly in means of Though it uses larger size dictionary, it still achieves
searching and finding matches. The LZ77 algorithm the same decompression speed like other compression
basically uses sliding window concept and searches for algorithms. LZ77 algorithm encodes the byte sequence
matches in a window within a predetermined distance from existing contents instead of the original data.
back from the present position. Gzip, ZIP, and V.42bits When no identical byte sequence is available in the
use LZ77. The LZ78 algorithm follows a more existing contents, the address and sequence length is
conservative approach of appending strings to the fixed as '0' and the new symbol will be encoded. LZ77
dictionary. also employs a dynamic dictionary to compress
unknown data by the help of sliding window concept.
LZW is an enhanced version of LZ77 and LZ78 which is LZMA extends the LZ77 algorithm by adding a Delta
developed by Terry Welch in 1984 [21]. The encoder Filter and Range Encoder. The Delta Filter alters the
constructs an adaptive dictionary to characterize the input data stream for effective compression by the
variable-length strings with no prior knowledge of the sliding window. It stores or transmits data in the form
input. The decoder also constructs the similar dictionary of differences between sequential data instead of
as encoder based on the received code dynamically. The complete file. The output of the first-byte delta
frequent occurrence of some symbols will be high in text encoding is the data stream itself. The subsequent bytes
data. The encoder saves these symbols and maps it to are stored as the difference between the current and its

405
International Journal of Pure and Applied Mathematics Special Issue

previous byte. For a continuously varying real time no deviation from actual data, i.e. normal value is found.
data, delta encoding makes the sliding dictionary more The sensor node appends the label value to sensed data
efficient [18, 19]. For example, consider a sample input and then performs compression. Sensor node runs
sequence as 2,3,4,6,7,9,8,7,5,3,4 7. The input sequence LZMA algorithm and the compression algorithm
is encoded with LZMA technique and the encoded efficiently compress the labeled data irrespective of the
output sequence is 2, 1, 1, 2, 1, 2,-1,-1,-2,-2, 1. So, the label value. LZMA algorithm uses the dictionary, sliding
number of symbols in the input sequence is 8 and the window concept and range encoder to efficiently
number of symbols in output sequence is 4. compress labeled data. Then the compressed data will be
transmitted to the BS. The BS receives the compressed
data and performs decompression process. As the LZMA
Static and adaptive dictionary are the commonly used
is the lossless compression technique, the reconstruction
dictionaries. The static dictionary uses the fixed entries
data is the exact replica of original data with no loss of
and constants based on the application of the text.
information.
Adaptive dictionaries take the entries from the text and
generate on run time. A search buffer is employed as a
dictionary and the buffer size is chosen based on the 4. Performance Evaluation
implementation parameters. Patterns in the text are
assumed to occur within range of the search buffer. To ensure the effectiveness of the LZMA algorithm
The offset and length are individually encoded, and a while compressing labeled data, its lossless compression
bit-mask is also separately encoded. Usage of an performance is compared with 5 different, well-known
appropriate data structure for the buffer decreases the compression algorithms namely Huffman coding, AC,
search time for longest matches. Sliding Dictionary LZW, BWT coding and Deflate algorithm.
encoding is comparatively tedious than decoding as it
requires to identify the longest match. Range encoder
encodes all the symbols of the message into a single
number to attain better CR. It efficiently deals with
probabilities which are not the exact powers of two.
The steps involved in range encoder are listed below.

 Given a large-enough range of integers, and


probability estimation for the symbols.
 Divide the primary range into sub-ranges where
the sizes are proportional to the probability of
the symbol they represent.
 Every symbol of the message is encoded by
decreasing the present range down to just that
sub-range which corresponds to the successive
symbol to be encoded.
 The decoder should have same probability
estimation as encoder used, which can either be
Fig.1.Workflow of LZWA on Anomaly labeled data in
sent in advance, derived from already transferred
WSN
data [20].
4.1 Metrics
The overall operation is shown in Fig. 1. LZMA
compression is used to compress real time data In the section, various metrics used to analyze the
generated rapidly. Initially, the sensor nodes sense the compression performance are discussed. The
physical environment. The sensed value is tested for performance metrics are listed below: CR, CF, and BPC.
anomalies and the label value is appended. A label value
is appended by the sensors to every individual sensed Compression Ratio (CR):
data. The labeled value '1' is appended to the sensed data
CR is defined as the ratio of the number of bits in the
when the sensed data differs from actual data, i.e. the
compressed data to the number of bits in the
abnormal value is found. Likewise, the labeled value '0'
uncompressed data and is given in Eq. (1). A value of
is appended to the sensed data when the sensed data has
CR 0.62 indicates that the data consumes 62% of its

406
International Journal of Pure and Applied Mathematics Special Issue

original size after compression. The value of CR greater 5. Results And Discussion
than 1 result to negative compression, i.e. compressed
data size is higher than the original data size. To highlight the good characteristics of LZMA based
labeled data compression, it is compared with 5 states
No. of bits in compressed data of art approaches. A direct comparison is made with
1
No. of bits in original data the results of existing methods using the same set of 2
datasets. Table 2 summarizes the experiment results of
Compression Factor (CF): compression algorithms based on three compression
CF is a performance metric which is the inverse of metrics such as CR, CF and BPC. As evident from
compression ratio. A higher value of CF indicates Table 2, the overall compression performance of
effective compression and lower value of CF indicates LZMA algorithm is significantly better than other
expansion. algorithms on both two dataset. It is observed that
LZMA algorithm achieves almost equal compression
No. of bits in original data on both single and multi-hop scenarios. It is also noted
2
No. of bits in compressed data that Huffman coding produces poor results than other
algorithms. The compression performance on 8
Bits per character (BPC): different dataset files reveals some interesting facts that
BPC is used to calculate the total number of bits the compression algorithms perform extremely
required, on average, to compress one character in the different based on the nature of applied dataset. The
input data. It is defined as the ratio of the number of bits existing methods especially Deflate and BWT achieves
in the output sequence to the total number of character in almost similar compression performance.
the input sequence.
No. of bits in compressed data Likewise, Huffman and Arithmetic coding also
3 produce appropriately equal compression performance.
No. character in the original data
This is due to the fact that the efficiency of an
4.2 Dataset Description Arithmetic code is always better or at least identical to
a Huffman code. Similar to Huffman coding,
For experimentation, two publicly available labeled Arithmetic coding also tries to calculate the probability
WSN datasets are used. The labeled WSN dataset of occurrence of particular certain symbols and to
consists of temperature, humidity and labeled values optimize the length of the necessary code. It achieves
gathered from single-hop and multi-hop scenarios using an optimum which exactly corresponds to the
TelosB motes [25]. The data is collected for 6 hours at a theoretical specifications of the information theory. A
time interval of 5 seconds. The dataset contains the minor degradation result from inaccuracies, because of
labeled data in which value ‘0’ indicates the actual value correction operations for the interval division.
and ‘1’ indicates the abnormal value. The data were
collected in both indoor and outdoor environment. The On the other side, Huffman coding generates rounding
description of labeled WSN dataset is tabulated in Table errors because of its code length is limited to multiples
1. of a bit. The variation from the theoretical value is
more than the inaccuracy of arithmetic coding. Though
Table 1 Dataset Description LZW achieves better compression than Huffman and
arithmetic coding, it fails to achieve better than Deflate
and BWT.

407
International Journal of Pure and Applied Mathematics Special Issue

Table 2 Comparison results of LZMA with existing methods

LZW works well in situations where the levels of method is compared with state of art approaches such
redundancies are high. When the dictionary size is as Arithmetic coding, Huffman coding, BWT, LZW
increased, the number of bits required for the indexing and Deflate algorithm. By comparing the compression
also increases. This limitation of LZW makes the results performance of LZMA method with existing methods,
to lag behind Deflate, BWT, and LZMA. LZMA achieves significantly better compression with
an average CR of 0.0104 at the bit rate of 0.839
In overall, LZMA results in effective compression than respectively. In future, it can be extended to compress
existing methods. Generally, dictionary based coding real time data of several applications.
approaches find useful in situations where the original
data to be compressed involves repeated patterns. As References
LZW is a dictionary based method, it produces better
results for labeled WSN dataset because of the [1] D. Estrin, J. Heidemann, S. Kumar, and M. Rey,
possibility of repeated occurrence of temperature and “Next Century Challenges: Scalable Coordination
humidity values. LZMA extends LZW with range in Sensor Networks,” in Proceedings of the 5th
encoding technique enables to produce significantly annual ACM, 1999, pp. 263–270.
higher compression than LZW. Interestingly, LZMA
requires the minimum bit rate of 0.749 BPC for single [2] K. Sohraby, D. Minoli, and T. Znati, Wireless
hop indoor data and maximum bit rate of 0.922 BPC for Sensor Networks. 2007.
single hop outdoor 2 data. It is also noted that Huffman [3] F. Akyildiz, W. Su, Y. Sankarasubramaniam, and
coding and Arithmetic coding achieves poor E. Cayirci, “Wireless sensor networks: a survey,”
performance with the average bit rate of 3.84 BPC and Comput. Networks, vol. 38, no. 4, pp. 393–422,
3.659 BPC respectively. It is observed that LZMA 2002.
achieves the average CR of 0.104 at a bit rate of 0.839
BPC. [4] C. S. Raghavendra, K. M. Sivalingam, and T.
Znati, Wireless Sensor Networks, 1st ed. US:
6. Conclusion Springer US, 2004.
This paper employs a Lempel Ziv Markov chain [5] C. A. Smith, “A Survey of Various Data
Algorithm (LZMA) lossless compression technique to Compression Techniques,” Int. J. pf Recent
compress WSN labeled data. The sensor node uses Technol. Eng., vol. 2, no. 1, pp. 1–20, 2010.
LZMA algorithm to compress the labeled data and [6] D. Salomon, Data Compression The Complete
transmits to BS via single-hop or multi-hop Reference, 4th ed. Springer, 2007.
communication. This proposed method enhances the
network lifetime by reducing the amount of data [7] K. Sayood, Introduction to Data Compression.
transmission. At the same time, anomaly data can also 2006.
be easily identified. The performance of LZMA

408
International Journal of Pure and Applied Mathematics Special Issue

[8] S. W. Drost and N. . Bourbakis, “A Hybrid system [20] J.Ziv and A.Lempel, “lz78.pdf.” IEEE, pp. 530–
for real-time lossless image compression,” 536, 1978.
Microprocess. Microsyst., vol. 25, no. 1, pp. 19–
[21] T. A. Welch, “A technique for high-Performance
31, 2001.
Data Compression,” IEEE, pp. 8–19, 1984.
[9] N. Kimura and S. Latifi, “A Survey on Data
[22] M. Burrows and D. Wheeler, “A block-sorting
Compression in Wireless Sensor Networks,” Proc.
lossless data compression algorithm,” Algorithm,
Int. Conf. Inf. Technol. coding Comput., pp. 16–
Data Compression, no. 124, p. 18, 1994.
21, 2005.
[23] J. Capon, “A probabilistic model for run-length
[10] J. W. Branch, B. K. Szymanski, C. Giannella, R.
coding of pictures,” IRE Trans. Inf. Theory, vol.
Wolff, and Kargupta, “In-network outlier
100, pp. 157–163, 1959.
detection in wireless sensor networks,” in Proc. of
ICDCS, 2006. [24] Z. Tu and S. Zhang, “A Novel Implementation of
JPEG 2000 Lossless Coding Based on LZMA,” in
[11] S. Rajasegarar, J. C. Bezdek, C. Leckie, and M.
Proceedings of the Sixth IEEE International
Palaniswami, “Elliptical anomalies in wireless
Conference Computer and Information
sensor networks,” ACM Trans. Sens. Networks,
Technology, 2006.
vol. 6, no. 1, 2009.
[25] S. Suthaharan, M. Alzahrani, S. Rajasegarar, C.
[12] M. Moshtaghi, S. Rajasegarar, C. Leckie, and S.
Leckie, and M. Palaniswami, “Labelled Data
Karunasekera, “Anomaly detection by clustering
Collection for Anomaly Detection in Wireless
ellipsoids in wireless sensor networks,” in Proc. of
Sensor Networks,” pp. 269–274, 2010.
the ISSNIP, 2009.
[13] W. R. Heinzelman, A. Chandrakasan, and H.
Balakrishnan, “Energy-efficient communication
protocol for wireless microsensor networks,”
Proc. 33rd Annu. Hawaii Int. Conf. Syst. Sci., vol.
0, no. c, pp. 3005–3014, 2000.
[14] Sariga and P. Sujatha, “A survey on unequal
clustering protocols in Wireless Sensor
Networks,” J. King Saud Univ. - Comput. Inf.
Sci., 2017.
[15] M. M. Afsar and N. M. H. Tayarani, “Clustering
in sensor networks: a literature survey,” J. Netw.
Comput. Appl., vol. 46, pp. 198–226, 2014.
[16] T. Srisooksai, K. Keamarungsi, P. Lamsrichan,
and K. Araki, “Practical data compression in
wireless sensor networks: A survey,” J. Netw.
Comput. Appl., vol. 35, no. 1, pp. 37–59, 2012.
[17] D. A. Huffman, “A Method for the Construction
of Minimum-Redundancu Codes,” A Method
Constr. Minimum-Redundancu Codes, pp. 1098–
1102, 1952.
[18] W. I. H., N. R. M., and C. J. G., “Arithmetic
coding for data compression,” Commun. ACM,
vol. 30, no. 6, pp. 520–540, 1987.
[19] J.Ziv and A.Lempel, “A Universal Algorithm for
Data Compression,” IEEE Trans. Inf. Theory, vol.
23, no. 3, pp. 337–343, 1977.

409
410

You might also like