Professional Documents
Culture Documents
Wavelet Transform and Unsupervised Machine Learning To Detect Insider Threat On Cloud File-Sharing
Wavelet Transform and Unsupervised Machine Learning To Detect Insider Threat On Cloud File-Sharing
Abstract—As increasingly more enterprises are deploying cloud services are offered. These include password protection, data
file-sharing services, this adds a new channel for potential insider encryption, data loss prevention, disaster recovery management,
threats to company data and IPs. In this paper, we introduce a and security analytics. However, most of these are rule-based,
two-stage machine learning system to detect anomalies. In the first and generate a lot of alerts with a high false positive rate. In
stage, we project the access logs of cloud file-sharing services onto addition, many of the alerts are related or duplicated, which
relationship graphs and use three complementary graph-based overwhelms the capacity of the security operations center (SOC)
unsupervised learning methods: OddBall, PageRank and Local to investigate all of them. In this study, we develop a two-stage
Outlier Factor (LOF) to generate outlier indicators. In the second machine learning system to automatically detect user anomalous
stage, we ensemble the outlier indicators and introduce the behavior based on the access logs of cloud file-sharing systems
discrete wavelet transform (DWT) method, and propose a
and feed a small selective group of users to the SOC for
procedure to use wavelet coefficients with the Haar wavelet
function to identify outliers for insider threat. The proposed
investigation. The architecture of our proposed machine learning
system has been deployed in a real business environment, and system is shown in Fig. 1.
demonstrated effectiveness by selected case studies.
I. INTRODUCTION
Enterprise file sharing, whether on the cloud (public, hybrid,
or private) or on-premises, enables individuals to share files
from mobile devices and PCs. Sharing can happen between Fig. 1. Architecture of machine learning system for cloud file-sharing outlier
people (for example, partners and customers) within or outside detection
the organization, as well as between applications. The offerings
enable modern user productivity and collaboration scenarios for The access logs include detailed file access information such
the creation of a digital workplace [Basso et al., 2016]. However, as uploads, down-loads, prospecting, viewing, and
cloud file-sharing creates a new channel for insider threats, modifications. On a daily basis, we parse the logs, extract the
including the leakage and exfiltration of business strategies, features, and create a user/file relationship graph. The graph is
customer information, personal identifiable information (PII), dynamic as users, files, and the relations between them change
company IPs, source code, etc. over time.
Depending on the vendors that provide enterprise file sharing In order to detect outliers among different users, we apply
services, certain levels of security and data protection in cloud three different but complementary machine learning algorithms.
156
indicators and create a new time series. This new time series Fig 3). And later investigation proves that this is a significant
generates stronger correlation with each of the individual outlier detection. It is this that demonstrates the benefit of using DWT
indicators. The DWT will be applied to the new time series to to generate wavelet risk score for potential insider threat
generate the final outlier score. detections. By adjusting the wavelet risk score threshold, we can
send a very small number of alerts for SOC to investigate.
TABLE I. CORRELATION MATRIX In summary, the Wavelet risk score from the first level
coefficients of the Haar DWT is able to capture the patterns in
Ensemble OddBall PageRank LOF
both the outlier indicator feature space and temporal space and
Ensemble 1.0000 0.5153 0.3544 0.8435
we propose to use that to flag risky users.
OddBall 0.5153 1.0000 0.2703 0.0365
PageRank 0.3544 0.2703 1.0000 0.0010
IV. CONCLUSIONS AND DISCUSSIONS
LOF 0.8435 0.0365 0.0010 1.0000
Mean Value 0.1073 0.0202 0.0032 0.2984 In this paper, we propose a novel two-stage machine learning
Total Record 22,016 Distinct User 688 system which leverages cloud file-sharing access data to
automatically detect user anomalous behavior for potential
B. Time Series Patterns of Outlier Indicators and Wavelet Risk insider threat. In the first stage, we project the access log data
Score onto user/file and user/user relationship graphs and apply three
In this section, we randomly sample two users and look into unsupervised algorithms to generate outlier indicators. In the
details of the trending relationship between the ensemble outlier second stage, we ensemble the outlier indicators and introduced
indicator and the Wavelet risk score to make sure that wavelet the DWT with the Haar wavelet function to model users’
risk score is able to detect changes in the ensemble outlier temporal behavior for insider threat detection. Our experiment
indicator time series and effectively flag potential insider threat. results demonstrate that the first level coefficients of the DWT
with the Haar wavelet is able to effectively capture the temporal
patterns of the user behaviors.
As for future work, we will incorporate more data sources,
including but not limited to, active directory logs, windows
logon logs, physical security logs, and printer logs, to better
detect insider threats holistically.
REFERENCES
[1] Akoglu L., McGlohon M. and Faloutsos C. (2010). Oddball:
Spotting anomalies in weighted graphs. Pacific-Asia Conference
on Knowledge Discovery and Data Mining. Springer, 2010
[2] Akoglu L., Tong H. and Koutra D. (2015). Graph based anomaly
detection and description: a survey. Data Mining and Knowledge
Discovery 29.3: 626-688, 2015
[3] Basso M., Hobert K. A. and Mann J. (2016). Magic quadrant for
enterprise file synchronization and sharing, 2016
[4] Bilen, C., & Huzurbazar, S. (2002). Wavelet-based detection of
outliers in time series. Journal of Computational and Graphical
Fig. 3. Trend plots of Ensemble score and Wavelet risk score (over 32 days) Statistics, 11(2), 311-327
[5] Breunig M. M., et al (2000). LOF: identifying density-based local
Fig. 3 shows the two users’ daily ensemble values, and the outliers. Proceedings of the ACM SIGMOD international
Wavelet risk scores over a 32-day period. Note that Ensemble conference on Management of data, 2000
series use the vertical axis on the left, while the Wavelet risk [6] Grané, A., & Veiga, H. (2010). Wavelet-based detection of
score uses the vertical axis on the right. The figure shows that outliers in financial time series. Computational Statistics & Data
the Wavelet risk scores are closely correlated with Ensemble Analysis, 54(11), 2580-2593.9
scores and the Wavelet risk scores show big spikes when the [7] Page L., Brin S. and Motwani R. (1999). The PageRank citation
ensemble value increases significantly. However, the more ranking: Bringing order to the web. Stanford InfoLab.
important characteristic is that the wavelet risk score can detect http://ilpubs. Stanford. edu 8090: 422, 1999
subtle changes in the ensemble outlier indicator, for example for [8] Zimek, A., Ricardo JGB Campello, and Jörg Sander. (2014)
user 1, the ensemble outlier indicator looks like having a regular "Ensembles for unsupervised outlier detection: challenges and
fluctuation during days 21-25 as in the other time intervals (the research questions a position paper." Acm Sigkdd Explorations
green line of the top chart in Fig 3), but there is a much clearer Newsletter 15.1 (2014): 11-22
indication of the wavelet score (the red line of the top chart in
157