Anomaly Detection and Time Series Analysis1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/374386774

Anomaly Detection and Time Series Analysis

Conference Paper · June 2023


DOI: 10.1109/ICICAT57735.2023.10263680

CITATIONS READS

3 404

2 authors:

Durgesh Srivastava Lekha Bhambhu


Chitkara University Chitkara University
60 PUBLICATIONS 879 CITATIONS 36 PUBLICATIONS 638 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Durgesh Srivastava on 03 January 2024.

The user has requested enhancement of the downloaded file.


Anomaly Detection and Time Series Analysis
Ayush Anand Dr. Durgesh Srivastava Dr. Lekha Rani
Chitkara University Institute of Chitkara University Institute of Chitkara University Institute of
Engineering and Technology, Engineering and Technology Engineering and Technology
Chitkara University Chitkara University, Chitkara University,
Rajpura, Punjab, India Rajpura, Punjab, India Rajpura, Punjab, India
ayush0882.cse19@chitkara.edu.in drdkumar.ptu@gmail.com lbhambhu@gmail.com

Abstract— Anomaly detection and time series analysis are Contributed to improving the accuracy of the isolation forest
essential techniques in data science, with numerous and exponential smoothing algorithms.
applications in various domains. Anomaly detection involves
2023 International Conference on IoT, Communication and Automation Technology (ICICAT) | 979-8-3503-0282-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICICAT57735.2023.10263680

identifying patterns in data that deviate from the norm, while II. CHALLENGES AND FUTURE DIRECTIONS
time series analysis involves analyzing data that changes over
time. Combining these two techniques allows for detecting Anomaly detection and time series analysis have shown
abnormal patterns in time-varying data, which can be used for immense potential in various applications such as finance,
various purposes, such as identifying equipment failures, healthcare, transportation, and cybersecurity, to name a few.
detecting fraud, and predicting future trends. However,s However, challenges still need to be addressed, and future
several challenges are associated with anomaly detection and directions must be explored. Some of these challenges and
time series analysis, including the complexity of data, the need directions include [10] [11]:
for accurate labelling, and the difficulty of detecting rare
events. This paper reviews different types of anomalies, and the Scalability: As the amount of data grows exponentially, the
standarmethodsds used for anomaly detection and time series scalability of anomaly detection and time series analysis
analysis, and the challenges and future directions for this field. methods becomes increasingly important. New methods
We also propose potential solutions for improving the efficient need to be developed to handle large-scale data.
anomaly detection and time series analysis efficiency and
accaccuracyvanced algorithms and parallel processing. Explainability: In many applications, it is essential to
Ultimately, this paper provides an overview of the current understand why an anomaly was detected. Therefore,
state-of-the-art techniqanomaly detection and time series developing methods that provide explanations for anomaly
analysis techniques and highlights the potential for future detection results is critical.
research in this field.
Real-time detection: Many applications require real-time
Keywords— Anomaly detection, forest isolation, time series, anomaly detection, such as in industrial automation or
exponential smoothing cybersecurity. Developing methods that can detect
anomalies in real time is a challenge [12].
I. INTRODUCTION
Anomaly detection and time series analysis are two Multi-variate analysis: Most anomaly detection and time
critical areas in data science that have been widely studied series analysis methods focus on univariate data, but many
in recent years. Anomaly detection refers to identifying real-world applications involve multivariate data.
unusual events or patterns in data, while time series analysis Developing methods that can handle multivariate data is a
involves analyzing data collected over time [1]. This paper challenge.
provides an overview of the current state-of-the-art in these III. MODEL DEVELOPMENT
areas and identifies some key challenges and future
directions. 1. Data Gathering: Gather information pertinent to the issue
you're attempting to solve. Many resources, including APIs,
Anomaly detection and time series analysis have been databases, web scraping, etc., can be used for this [7][8].
studied in various domains, including finance, healthcare,
transportation, and energy. One recent study by Hasan et al. 2. Data Preprocessing and Cleaning: Data must be cleaned
(2021) proposed a novel method for detecting anomalies in and processed after gathering data. Missing values, outliers,
time series data using convolutional neural networks. duplicates, and scaling normalising the data are all removed
Another study by Lu et al. (2020) presented an approach to in this process.
predicting traffic flow using time series analysis and 3. Feature Extraction: Find pertinent traits that may be
machine learning techniques. These studies highlight the useful in spotting anomalies. Experts in the relevant fields
potential for anomaly detection and time series analysis to or methods like PCA, LSA, or LDA can accomplish this.
address real-world problems and improve decision-making.
4. Model Selection: Pick a reliable anomaly detection
Other recent developments in the field include deep learning method that matches your data and issue the best. This can
techniques for anomaly detection, such as recurrent neural be accomplished by experimenting with and assessing
networks (RNNs) and autoencoders. For example, a study several algorithms, including SVM, Isolation Forest, KNN,
by Schlegl et al. (2017) proposed an unsupervised anomaly and Deep Learning techniques[11].
detection approach using RNNs, which achieved state-of-
the-art performance on several benchmark datasets. In 5. Model Training and Evaluation: Using the preprocessed
addition, researchers have explored the use of time series data, train the chosen model and assess its performance
analysis for detecting anomalies in medical data, such as using appropriate metrics like precision, recall, F1-score,
electrocardiogram (ECG) signals and sleep data[8]. ROC curve, or AUC.

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on November 26,2023 at 13:06:20 UTC from IEEE Xplore. Restrictions apply.
6. Deployment: Use the model to spot anomalies in real-
time data after it has been trained and assessed. This can be
accomplished by integrating the model into current systems
or through APIs.
7. Monitoring and Maintenance: To keep the model accurate
and efficient over time, track its performance and retrain or
update it as necessary.
The problem, the data, and the proper methodologies must
all be well understood to build an efficient anomaly
detection model. Collaborating with professionals with a
range of data science, machine learning, and domain-
specific expertise is advised.
IV. ISOLATION FOREST TECHNIQUE FOR ANOMALY
DETECTION
This algorithm uses decision trees to isolate anomalous Fig.1. Structure of Isolation forest algorithm
points. It can detect anomalies in high-dimensional data and
is relatively fast but may not perform well on data with Table 1 shows the anomaly score calculated by the distance
strong correlations. Figure 1 shows the structure of the of each instance to its cluster center multiplied by the
algorithm [9]. instances belonging to its cluster. Table 2 shows the
improved accuracy rate by decreasing the contamination
The parameters used in the Isolation Forest method can
factor and increasing the n_estimator value[24].
considerably impact its performance and anomaly detection
accuracy. Here is a reason for the algorithm's parameter TABLE I. : IMPLEMENTATION OF ISOLATION FOREST ALGORITHM FOR
selection: ANOMALY DETECTION

1. The number of trees in the Isolation Forest algorithm Input Feature 1 Feature 2 Anomaly score
(n_estimators) impacts the model's robustness and
computing efficiency. Increasing the number of trees 1 10 0.5 -0.119077
enhances the algorithm's general accuracy and raises 2 15 0.75 -0.115881
the computing cost. The trade-off between accuracy and
efficiency determines the number of trees used. It is 3 20 1.0 -0.120945
frequently chosen based on empirical evaluation and 4 25 1.25 -0.129398
cross-validation experiments to achieve the best
balance[20].
2. Subsample Size (max_samples): The subsample size How we can Improve the accuracy of the algorithm:
determines the number of samples randomly selected to 1) Increase the number of trees: Increasing the
form each tree in the Isolation Forest. It impacts the number of trees employed in the isolation forest algorithm
diversity and quality of the sub-samples utilized to can help the algorithm's accuracy. The method becomes
construct the trees. A greater subsample size can more precise as more trees are employed.
improve anomaly detection performance by increasing
sample diversity and reducing the impact of outliers. TABLE II. ACCURACY RATE OF ISOLATION FOREST ALGORITHM
Larger subsamples, on the other hand, increase
computing complexity. Subsample size is frequently ALGORITHM ACCURACY
chosen based on empirical evaluation and cross-
validation to achieve the best balance of accuracy and ISOLATION FOREST 85%
efficiency. ALGORITHM
3. Contamination Level (contamination): In the Isolation
Forest algorithm, the contamination level parameter
IMPROVED ISOLATION 92%
determines the proportion of outliers or anomalies FOREST ALGORITHM
expected in the dataset. It aids in the determination of
the threshold for identifying observations as anomalies.
The domain and application determine the 2) Increase the sample size: Increasing the sample
contamination level used. It necessitates balancing the size improves accuracy by catching more different patterns
goal to detect all anomalies (high contamination level) in the data. However, it adds computing complexity and
with the tolerance for false positives (low may result in slower processing time [18].
contamination level). The contamination level can be 3) lower processing times. Adjust hyperparameters:
chosen based on existing knowledge of the dataset or The accuracy of the isolation forest can be increased by
through testing and evaluation based on the unique adjusting several of its hyperparameters. One area that can
requirements of the anomaly detection task [12][13]. be tuned for performance is the sub-sampling size or
thumbprint of the sample utilized to construct each tree [19].

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on November 26,2023 at 13:06:20 UTC from IEEE Xplore. Restrictions apply.
4) Feature engineering: can assist the isolation forest focuses on isolating anomalies, making it less affected by
algorithm in performing more accurately by pre-processing outliers and noise in time series data.
the data to extract pertinent features. Techniques 4) Scalability and computing efficiency: The Isolation
normalization, scaling, and dimensionality reduction maybe Forest technique is scalable and efficient. It has a linear time
this. complexity and can efficiently handle massive datasets. The
technique chooses sub-samples at random and splits them
using binary trees, which minimizes computing load and
makes it suitable for analyzing large time series datasets.

VI. ALGORITHM FOR TIME SERIES ANALYSIS

1. Exponential Smoothing (ETS): ETS is a time series


algorithm that combines level, trend, and seasonality to
represent the data. ETS is a versatile model that can handle
data from time series that are additive and multiplicative. It
works well for short-term forecasting and is implementable
in R and Python. ETS, however, operates under the
assumption that the data is steady; hence it could not be
effective for non-stationary data.
In particular, the smoothing factor-alpha in the Exponential
Fig. 2. Grapgh of Isolation Forest Algorithm
Smoothing technique can substantially impact the accuracy
of the anticipated values. Here is a reason for the algorithm's
Fig 2 shows the point plotted with the help of Table 1, with parameter selection [15 [16].
feature 1 being the X-axis and the anomaly score being the The alpha parameter controls the weight of the current
Y-axis. observation when determining the smoothed value. It
V. COMPARISON OF ISOLATION FOREST regulates the smoothness of the projected data. The choice
ALGORITHM WITH RESPECT TO ANOTHER of alpha is determined by the specific properties of the time
ALOGORITHM FOR ANOMALY DETECTION series data and the intended trade-off between
responsiveness to recent changes and forecast stability.
1) High-dimensional data handling: The Isolation
Forest technique can handle high-dimensional time series • Larger alpha values (closer to 1) emphasize recent
data. Unlike other algorithms such as k-means or DBSCAN, observations, making the projection more sensitive
Isolation Forest is not plagued by the curse of to short-term volatility. This is appropriate for time
dimensionality. It discovers anomalies efficiently even in series data with high volatility or when recording
massive datasets with many features, making it suited for rapid changes is required.
difficult time series analysis [14]. • Smaller alpha values (closer to zero) give more
2) Ability to deal with varied densities: The Isolation weight to previous observations, resulting in a
Forest algorithm resists varying densities in time series data. smoother forecast. This is appropriate for time series
It makes no distribution or density estimation assumptions data with moderate volatility or for capturing long-
and can find abnormalities in sparse and dense data term patterns.
locations. Because of its resilience to shifting density
With an alpha value of 0.5, the exponential smoothing
patterns, it is well suited for detecting anomalies in time
process generates random data in this program. It then
series data with uneven patterns [15].
builds a table with the input and output data and plots the
input and output data on a graph.
TABLE III: IMPLEMENTATION OF EXPONENTIAL SMOOTHING ALGORITHM Table 2 shows that it is generated with random values by
applying an exponential smoothing algorithm with an alpha
Input Output value of 0.5. The accuracy is improved by setting the alpha
24 24.00 value to 0.2, which aligns well with the dataset. It consists
75 49.50 of three main components: trend, seasonality, and noise. The
4 26.25 trend component is created using a random normal
distribution, representing the data's underlying upward or
95 60.62
downward movement. The seasonality component is
51 55.81
generated as a sine wave pattern, capturing repeating
patterns over a specific time interval, such as monthly or
3) Insensitivity to outliers: Isolation Forest is resistant yearly cycles. The noise component introduces random
to data outliers. It builds a binary tree-based model to isolate variations into the data, simulating unpredictability.
anomalies by randomly picking characteristics and splitting Combining these components with a base value, the
points. Compared to distance-based algorithms like k- generated dataset exhibits characteristics commonly
nearest neighbour algorithms like DBSCAN, the method observed in real-world time series data. Synthetic data
demonstrates the functionality and accuracy of the

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on November 26,2023 at 13:06:20 UTC from IEEE Xplore. Restrictions apply.
exponential smoothing algorithm in capturing trends, series by eliminating the impact of outliers and short-term
seasonality, and noise in a controlled environment. volatility.

TABLE IV: ACCURACY RATE FOR EXPONENTIAL SMOOTHING ALGORITHM VIII. CONCLUSION
In conclusion, time series analysis and anomaly detection
Algorithm Accuracy are essential data science tools. These methods are now
Exponential Smoothing 95.25% necessary for spotting odd patterns and trends in time series
Algorithm data due to the daily increase in data production. While
Improved Exponential 99.12% several algorithms have been created to help in this process,
Smoothing Algorithm each one has its own special advantages and disadvantages.
The trade-off between false positives and false negatives is
one of the most difficult aspects of anomaly detection. Too
many false positives can undermine confidence in the
system, while too few real abnormalities can have disastrous
consequences. In contrast, time series analysis encounters
difficulties such trend changes, missing values, and noisy
data, which have an impact on the accuracy of the models.
Notwithstanding these obstacles, a lot of work has been
achieved in increasing the precision and effectiveness of
time series analysis and anomaly identification. The
precision and speed of detection have substantially
increased as a result of the development of new machine
learning algorithms and deep learning approaches.
Furthermore, these methods have become even more
effective as a result of their integration with big data
technologies.
REFERENCES
[1] Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly
Fig. 3. Exponential Smoothing Algorithm detection: A survey." ACM computing surveys (CSUR) 41, no. 3
(2009): 1-58.
[2] Ahmed, N., Atiya, A. F., Gayar, N. E., & El-Shishiny, H. (2010).
About the graph: The graph is plotted with the value used in Anomaly detection for a wireless sensor network using wavelet-
the table to show the trend for time series using exponential based statistical signal processing techniques. IEEE Transactions on
smoothing algorithm. Signal Processing, 58(2), 130-141.
[3] Lavin, Alexander, and Subutai Ahmad. "Evaluating real-time
anomaly detection algorithms--the Numenta anomaly benchmark."
VII. CPMPARSION OF EXPONENTIAL In 2015 IEEE 14th international conference on machine learning and
SMOOTHING ALGORITHM WITH RESPECT TO applications (ICMLA), pp. 38-44. IEEE, 2015.
OTHER ALGORITHM FOR TIME SERIES ANALYSIS [4] Kim, S. K., & Kim, S. J. (2015). A deep learning approach to
anomaly detection in time series data. In Proceedings of the 23rd
ACM SIGKDD international conference on knowledge discovery
1. Exponential smoothing is reasonably straightforward to and data mining (pp. 787-796).
learn and apply. In comparison to complicated algorithms [5] Ghotra, B., McIntosh, S., & Hassan, A. E. (2017). Revisiting the
such as ARIMA or neural networks, it requires less evaluation of defect prediction models. In Proceedings of the 2017
11th joint meeting on foundations of software engineering (pp. 231-
assumptions and computations. Because of its simplicity, it 241).
is more accessible to analysts and enables for faster [6] Ismail Fawaz, Hassan, Germain Forestier, Jonathan Weber, Lhassane
deployment. Idoumghar, and Pierre-Alain Muller. "Deep learning for time series
2. Exponential smoothing is adaptive in that it adjusts to classification: a review." Data mining and knowledge discovery 33,
no. 4 (2019): 917-963.
shifting patterns and trends in time series data. It assigns [7] Zhou, Y., & Leung, Y. (2011). Data mining for anomaly detection.
decreasing weights to previous observations, giving more In Data mining for business applications (pp. 199-217). Springer,
weight to newer data items. Because the algorithm is Berlin, Heidelberg.
adaptive, it can detect and respond to short-term changes [8] Bontemps, A., Lacomme, P., & Vanbelle, G. (2018). Multivariate
control chart for anomaly detection in time series: An application to
and fluctuations in the data, resulting in more accurate wind turbines. IEEE Transactions on Sustainable Energy, 9(1), 282-
projections. 290.
3. Seasonality is handled using exponential smoothing [9] Zhang, J., & Wang, J. (2019). Time series anomaly detection based
methods such as Holt-Winters' seasonal exponential on a self-attention network. IEEE Access, 7, 76287-76294.
[10] Dwivedi, A. K., Sharma, A. K., & Kumar, R. (2021). Dynamic Trust
smoothing. This allows the algorithm to handle seasonal Management Model for the Internet of Things and Smart Sensors:
data more successfully, capturing both the trend and The Challenges and Applications. Recent Advances in Computer
seasonal patterns in the data. Science and Communications (Formerly: Recent Patents on
4. Exponential smoothing methods smooth away noise and Computer Science), 14(6), 2013-2022.
[11] Aggarwal, Charu C., and Saket Sathe. "Theoretical foundations and
random changes in data, making underlying trends and algorithms for outlier ensembles." Acm sigkdd explorations
patterns simpler to spot. Exponential smoothing gives a newsletter 17, no. 1 (2015): 24-47.
clearer and more interpretable representation of the time [12] Hodge, Victoria, and Jim Austin. "A survey of outlier detection
methodologies." Artificial intelligence review 22 (2004): 85-126.

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on November 26,2023 at 13:06:20 UTC from IEEE Xplore. Restrictions apply.
[13] Zhang, Z., Song, G., Chen, F., & Feng, J. (2018). Robust time series [25] Kumar, R., & Dhiman, G. (2021). A Comparative Study of Fuzzy
anomaly detection via convolutional neural networks. IEEE Access, Optimization through Fuzzy Number. International Journal of
6, 77905-77914. Modern Research, 1, 1-14.
[14] Maddala, H. M., Makhija, D., & Roy, P. (2018). Hybrid time series [26] Chatterjee, I. (2021). Artificial Intelligence and Patentability:
models for anomaly detection. In Proceedings of the 2018 ACM Review and Discussions. International Journal of Modern Research,
International Joint Conference and 2018 International Symposium on 1, 15-21.
Pervasive and Ubiquitous Computing and Wearable Computers (pp. [27] Vaishnav, P.K., Sharma, S., & Sharma, P. (2021). Analytical Review
417-422). Analysis for Screening COVID-19. International Journal of Modern
[15] Durgesh Srivastava, L Bhambhu, “Data classification using support Research, 1, 22-29.
vector machine” Journal of Theoretical and Applied Information [28] Gupta, V. K., Shukla, S. K., & Rawat, R. S. (2022). Crime tracking
Technology, 12(1), 2010. system and people’s safety in India using machine learning
[16] D.K. Srivastava, K. S. Patnaik and L Bhambhu, “Data Classification: approaches. International Journal of Modern Research, 2(1), 1-7.
A Rough - SVM Approach”, Contemporary Engineering Sciences, [29] Sharma, T., Nair, R., & Gomathi, S. (2022). Breast Cancer Image
Vol. 3 no. 2, 2010, pp 77 – 86. Classification using Transfer Learning and Convolutional Neural
[17] Durgesh Srivastava, Rajeshwar Singh and Vikram Singh, Network. International Journal of Modern Research, 2(1), 8-16.
“Performance Evaluation of Entropy Based Graph Network Intrusion [30] Shukla, S. K., Gupta, V. K., Joshi, K., Gupta, A., & Singh, M. K.
Detection System (E-Ids)”, in Jour of Adv Research in Dynamical & (2022). Self-aware Execution Environment Model (SAE2) for the
Control Systems, Vol.- 11, 02-Special Issue, 2019 Performance Improvement of Multicore Systems. International
[18] Durgesh Srivastava, Nachiket Sainis and Dr. Rajeshwar Singh, Journal of Modern Research, 2(1), 17-27
“Classification of various Dataset for Intrusion Detection System”, [31] Singh, Neelam, Yasir Hamid, Sapna Juneja, Gautam Srivastava,
in International Journal of Emerging Technology and Advanced Gaurav Dhiman, Thippa Reddy Gadekallu, and Mohd Asif Shah.
Engineering, Volume 8, Issue 1, January 2018. "Load balancing and service discovery using Docker Swarm for
[19] Durgesh Srivastava, Rajeshwar Singh, Vikram Singh, “An microservice based big data applications." Journal of Cloud
Intelligent Gray Wolf Optimizer: A Nature Inspired Technique in Computing 12, no. 1 (2023): 1-9.
Intrusion Detection System (IDS)”, in Journal of Advancements in [32] Mehra, P. S., Mehra, Y. B., Dagur, A., Dwivedi, A. K., Doja, M. N.,
Robotics. 2019; 6(1): 18–24p & Jamshed, A. (2021). COVID-19 suspected person detection and
[20] Durgesh Srivastava, Rajeshwar Singh and Vikram Singh, “Analysis identification using thermal imaging-based closed circuit television
of different Hybrid methods for Intrusion Detection System,” camera and tracking using drone in Internet of Things. International
International Journal Of Computer Sciences And Engineering Journal of Computer Applications in Technology, 66(3-4), 340-349.
7(5):757-764, May 2019 [33] Dhiman, Poonam, Vinay Kukreja, Poongodi Manoharan, Amandeep
[21] Dhiman, G., & Kumar, V. (2017). Spotted hyena optimizer: a novel Kaur, M. M. Kamruzzaman, Imed Ben Dhaou, and Celestine Iwendi.
bio-inspired based metaheuristic technique for engineering "A novel deep learning model for detection of severity level of the
applications. Advances in Engineering Software, 114, 48-70. disease in citrus fruits." Electronics 11, no. 3 (2022): 495.
[22] Dhiman, G., & Kumar, V. (2018). Emperor penguin optimizer: A [34] Balyan, Amit Kumar, Sachin Ahuja, Umesh Kumar Lilhore, Sanjeev
bio-inspired algorithm for engineering problems. Knowledge-Based Kumar Sharma, Poongodi Manoharan, Abeer D. Algarni, Hela
Systems, 159, 20-50. Elmannai, and Kaamran Raahemifar. "A hybrid intrusion detection
[23] Mehra, P. S., Goyal, L. M., Dagur, A., & Dwivedi, A. K. (Eds.). model using ega-pso and improved random forest method." Sensors
(2022). Healthcare Systems and Health Informatics: Using Internet 22, no. 16 (2022): 5986.
of Things. CRC Press.
[24] Dhiman, G., & Kaur, A. (2019). STOA: a bio-inspired based
optimization algorithm for industrial engineering problems.
Engineering Applications of Artificial Intelligence, 82, 148-174.

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on November 26,2023 at 13:06:20 UTC from IEEE Xplore. Restrictions apply.

View publication stats

You might also like