Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

OpenUBA: A SIEM-agnostic, Open

Source Framework for Modeling User


Behavior
May 25, 2020

Jovonni L. Pharr
jpharr2@student.gsu.edu
Atlanta, GA, USA
[v0.0.1]

Abstract
This project describes a system for analytic modeling of user & entity behavior. We
make use of the scientific computing community, and the tools used within it. We
demonstrate conditions under which modeling invocation takes place, the handling of
model results, and a description of model components. Version control is also described
for models changing over time. Feedback is described for both rules, and models. Effi-
cient system storage is also demonstrated for anomalies, and events of interest. Various
risk calculation approaches are also described. The system takes advantage of turing-
completeness for model development, instead of limited development freedom. Lastly,
different scenarios in which models, and rules accept feedback are covered. The com-
ponents described construct a UBA system designed to be extensible, powerful, and
open.
OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior

Motivation the state of the system. Models may ex-


ecute both stochastic, and deterministic
sets of computations. With the depth, and
There exists several security monitor-
breadth of security use cases, model devel-
ing tools focused on providing User and
opment in any modern UBA tool should
Entity behavior monitoring capabilities.
be turing complete, and fully expressive.
UBA tools attempt to replace a SIEM so-
UBA tools attempt to extend a flexible
lution, but the two tools should work to-
model development practice, but because
gether. Security Operation Centers (SOC)
it typically results in less than turing com-
should not be forced to choose between
pleteness, model development is inherently
SIEM solution, and a UBA solution. Al-
limited, although expressive.
though, there are SIEM solutions releasing
their own UBA features, however, there is
a difference between a flexible, and expres- 2.1 Mapping Mitre Att&ck
sive UBA tool, and adding a UBA feature.
Mapping to the Mitre Att&ck matrix
is important for SOCs to monitoring their
1.1 Problems own performance with respect to under-
standing their coverage and security pos-
Many UBA platforms take a "block
ture. Models from the model library are
box" approach to modeling. The very
mapped to Mitre Att&ck techniques and
concept of modeling can vary across UBA
tactics. This enables coverage with respect
tools, though more concretely defined in
to Mitre Attack to be calculated across the
mathematics. Often UBA vendors claim
model population. This is becoming com-
their models are IP, which is an under-
mon practice for security tools, as Mitre
standable business decision to make. How-
maintains an industry leading position on
ever, ramifications of models being pro-
standardizing attack vectors, and surfaces
prietary include vendor lock-in, decreased
for the purpose of organizations using a
model developer-friendliness, and hinder-
"common language" – this mapping is ar-
ing community growth & participation.
bitrary.
Also, some SOCs must use multiple
tools to gain the benefits of UBA, but this
can be expensive depending on the vendor. 2.2 Defining models
OpenUBA [1] aims to solve these problems
by providing We define an abstraction of a model,

1. Open/"white box" model standard


M = {(m, λ, s) | Smin ≤ s ≤ Smax } (1)
2. SIEM-agnostic
Where m is a mitre map, λ is the
3. Turing-complete, expressive modeling
merkle root hash from the model compo-
4. Model feedback from cases nent tree, s is the contributing risk score
for the model, and S is the system’s global
5. Model versioning risk score limits. Each model has a relation
to a mtechnique , and mtactic .
6. Lightweight architecture
Models run in model groups, which
7. Alerting may contain ≥ 1 model. We define the
model execution function,

Modeling Ξ = execute(M) (2)

UBA tools use an overarching concept which is a function mapping a carte-


of a model M. A model can be an ar- sian product of the set of models, and the
bitrary set of commands, of which change set of datasets in the system.

1
OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior

For example, if a model needs to parse in-


Ξ:M×D →r (3) coming data a certain way, the logic can
be encapsulated inside of an external mod-
Where r is a model result, and D is ule, and invoked by the model, not written
set of data in the system, and d ∈ D, the within the model itself.
subset of data used by the execute func-
tion. Model results can be a set of rela-
tions between a user U, and s, given s is 2.4.1 Verification
between Smin , and Smax . Such a relation
can be "U has new risk score of S", "U has Model verification takes one of two forms,
their risk score increased/decreased by S", the system is either verifying the contents
or another arbitrary risk score calculation. from the model library, before installing,
or verifying the files the model uses, after
being installed, but prior to each execution
2.3 Model Library of the model.
We make use of a model registry, from Both verification functions, verifying
which models are installed for use in the the encoded data of the model
system. A model registry can be arbitrary,
but must be trusted by the user, although Vd : Cdata → {0, 1} (7)
models are verified before installation, and
usage. and verifying the files the model use

Vf : Cf ile → {0, 1} (8)


2.3.1 Components

Models are constructed by ≥ 1 component. maps an element of the component


Components at a high-level are files. Model to a truth value. Overall verification is
library entries contain the file hashes, and the logical conjunction of both verification
encoded-data hashes of each component. types, and is represented by the boolean
expression

{(ctype , f, d) : c 6= ∅, ∀c ∈ C} (4) V := Vd Vf (9)

Where f is the file, and d is the en- We represent the model invocation
coded data. A model component contains function as
a set of hashes,
Φ = invoke(. . . ) (10)
H := {H(f ), H(d) | f 6= ∅, d 6= ∅} (5)

where H is a hash function, and f , and 


d must be present. Together, these form install(M), Γ = d

the model component. Φ(Γ, M) = Ξ(M), Γ=f

Err, otherwise

c = {(ctype , H)} (6) (11)


Using the data and file verifications,
we construct the simple verification condi-
2.4 Common Modules tions to be satisfied before model invoca-
For a given model, the model author may tion. Invocation occurs if the verification
be required to rewrite a basic preprocessing function, V
step for a model, of which may differ de-
pending on details of how the model is be- (
ing used. For this reason, we extend com- Φ(Γ, M), if V
V (M) = (12)
mon model modules to be used by models. Err, otherwise

2
OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior

In addition to installation verification, For a return type of Urisk


the same process is executed upon each
model execution. This ensures the in-
tegrity of the models are checked prior to

{u1new , ...unnew }, if ρ = Urisk

running the model.
Mres = {u1 , ...un }, if ρ = Uanom

Err, otherwise

2.5 Model Groups (17)

Model groups are sets of models to run with where unnew is either the new risk
a shared context, and shared datasets. We score for user, u, or the risk score to be
define executing over model groups by used for new risk score calculation.

rj = Ξ(Mgroupi , Mj ) | ∀i, j, ∈ Menabled 2.7 Sequence Models


(13)
M can be a single-fire model, or a se-
given that Mj is enabled, where M is quence of individual models. Single-fire
the model configuration, of which defines models perform their logic on the input
the group. data, and return a result. Single fire mod-
els are markovian, meaning that the state
2.5.1 Data Loaders of the system after Ξ(M) is only depen-
dent on the current state, and the state
A data loader provides models with the transition initiated by invoking Ξ.
data needed to execute, and any contextual Sequence models depend on outputs
information needed for the model. Context from previous models. One approach is to
for a model may be the host information if directly feed the output from a model into
remotely connecting to a data source, or the input of another model. However, we
the location of the data source. abstract the connection between models by
If a system has n models enabled for memoizing model outputs.
proxy data, data should be loaded at most The timing of sequence model execu-
one time for the model group. The amounttions is not always immediately after an-
of times the same data is loaded, DMgroup is
other. Therefore, it is efficient to memo-
not proportional to the number of models,
ize prior model output. This approach en-
|M|. sures the length of the time window used
during model training/inference can be up
to the length of the system historical win-
DMgroup 6∝ |M|, M ∈ Mgroup (14) dow. In other words, we can store mem-
oized model outputs, and condition upon
Data loaders enable model groups to them at model execution time.
share the data used to model.

2.6 Model Return 2.8 Model Versioning

Model return types are defined at model If there exists a model where the model
configuration time. version is the maximum version number in
the set of model versions, the system will
infer on that model by default. Given a set
ρ = resulttype (15)
of all model versions, Mvers
and the set of result types is

Mvers := {v1 , v2 , . . . vn | Mv ≥ 0, ∀v}


ρ := {Urisk , Uanom , ...Un } (16) (18)

3
OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior

If specific models call for sequence of


events to occur within an interval of time,
∃M 3 max(Mvers ) = Mv ⇒ Ξ(Mv ) [t∗ , t], we can efficiently retain the poten-
(19) tially anomalous event for up to t − ∆,
This prevents multiple versions of the where ∆ is the amount of time into the past
same model from executing, especially the system can observe an event – typically
while running on the same data. this is 90 days in traditional SIEM prod-
ucts.
However, this data storage configura-
2.9 Rules tion is arbitrary because at any point in
time, we can store the anomaly data in an
Traditionally, a rule engine within a SIEM
external abstract data store, and a model
contains a set of conditions. The engines
can simply retrieve it from outside the sys-
use event data to check against these rules.
tem, similar to how models retrieve the
With OpenUBA, an entire rule engine can
data used during computation.
be defined within a standalone model. This
rule engine executes within a model group,
and can output risk scores similar to a tra- Logs
ditional model. This approach enables rule
engines to be used in the same way models We process logs for both creating
are used. users, and entities, and for model execu-
tion. Although the process by which the
data is processed is the same, the intent is
2.10 Storage different between the two.
Another issue with current-state UBA
tools is the requirement to store a copy of process(i, j), ∀i, j (21)
production data in such a way where the
tool can access it. This forces UBA users 3.1 ID Feature
to eventually copy large amounts of data.
Although UBA tool tend to delete data be- For each log source, a feature to be used as
yond a specified time window, copying the an identifier is configured. The identifier
data alone can be challenging for users with should server as a common key across all
large amounts of data. log sources, or must be a subset of another
To be lightweight, and "siem- key. This identifier makes it less costly to
agnostic", we store the anomaly data, and decipher the attributing account for any e,
supporting events within a specified time not that it cannot be done otherwise.
window. This way, the system can still
execute on production data, but without 3.2 Dormancy
the requirement to retain all of the data
on disk, or in memory. Dormancy is determined by the lack of a
If an enterprise proxy log source can subset of e in Du ; the dormancy period is
result in 1B record on a given day, which arbitrary. If exactly 0 events from precon-
in turn generates 50, 000 proxy anoma- figured events, edorm , are found within the
lies, retaining 50, 000 records on disk is far logs of u, we say u is dormant.
more efficient than copying the full set of
data. This constraint makes the system
more lightweight than a traditional bloated Udorm = {u | ∀u ∈ U, edorm ∈
/ Du } (22)
SIEM, or UBA system. The number of
stored events, estored is much less than the Risk
number of observed events eobserved
Risk scores can not only be arbitrary,
estored ≪ eobserved (20) but can also mislead or confuse stakehold-

4
OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior

ers. "Black box" risk calculations hinder We define an abstracted risk calcula-
SOCs from fully understanding how case tion function, and map over all users
generating anomalies are precisely scored,
and their ability to concretely explain the
calculations to third parties. There is urisk = γ(u, risk(u, t), Mur ), ∀u ∈ U
an inherent tradeoff between interpretable (25)
risk scores, and the determinism of risk
γ is a cartesian product of the user
score calculation. The more a model uses
set, and two real numbers, and yields a real
stochastic, and more complex processes to
number, representing a new risk score for
calculate risk scores, the more SOC ana-
user, u.
lysts lose granular explainability of risk re-
sults.
γ :u×<×<→< (26)
The philosophical concept of a risk
score is subjective as well, and is designed For simple risk score calculations, the
for human intuition. This goal can be function does not need u, however, the sys-
hindered by arbitrary vendor-driven risk tem can consider a user’s profile as context
score calculations, of which decrease in- during risk calculation, opposed to a risk
terpretability of model outputs. In data score being independent of u.
science, there exists an accuracy vs inter-
pretability tradeoff. As more complexed
models are invented, of which increase in Case Feedback
performance, we lose human intuition on
the inner workings of the model. This jus- Upon a SOC analyst providing feed-
tifies enabling SOC teams to transparently back to a model, the models adjust internal
define their risk score calculation logic as logic in some meaningful way. Feedback
much as possible – hiding concrete risk into a model differs in several ways depend-
score logic only harms day-to-day SOC de- ing on properties of the model. However,
cision making, and requires a great deal of feedback on a rule, must be separated from
trust from the SOC onto the vendor. feedback to a model.

4.1 Risk Calculation 5.1 Model Feedback


We purposefully abstract the risk cal- Model feedback usually does not include a
culation, and demonstrate basic risk score change in semantic or syntactic forms be-
calculations for an individual, and a group. cause model parameters usually change af-
If a model results in new user risk score, we ter feedback – although additional feature
can simply overwrite the users risk. selection, or preprocessing may also change
after feedback.

uriskt+1 = ru (23) 1. Tuning

If a model results in s quantity of risk (a) changing parameters in a model


to be used in the final calculation of risk
for U , then the abstracted risk score calcu- 2. Remodeling
lation process is invoked using the quantity
(a) rewriting, or editing the underly-
of risk as an input variable. For example,
ing model structure
if the risk calculation for users is a sum-
mation between the previous risk score of
u, and the model output with respect to u, The feedback process used is mainly
then for parameter tuning, but more work is to
be done on automatically remodeling based
on feedback. For a simple statistical model
uriskt+1 = uriskt−1 + ru (24) such as

5
OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior

if user a triggers rule R and modeling insights from current threats.


S standard deviations from ’normal’ Sharing hyperparamaters to fine tune a
shared security model on most recent Bot-
if a result was a false positive, the net activity is possible when "Open Model"
model would include the false positive met- solutions are used more widely.
ric in its calculation for standard deviation,
σ(x).
6.1 Shared SOC
5.1.1 Weighting A SuperSOC consists of a set of small
SOCs, all with their own people, processes,
Model feedback can also be weighted. and data. SuperSOCs enabled multiple
Instead of σ(x) just having the case- distributed SOCs to function as one or-
generating event included within its cal- ganism regarding security threat modeling,
culation, the function could also apply a and security analytics.
weight, [0, 1], of which can damp the feed-
back. Damping can be done on several
variables, including the amount of times 6.2 Distributed Intel
the case has been deemed a false positive,
or accepted behavior by the SOC. Having community driven intel is not un-
common in several security circles. Iso-
lated, analytics-driven security tools hin-
5.2 Rule Feedback der SOCs from building truly knowledge-
driven analytic solutions by removing the
Rule feedback may look differently from community participation. More work is to
model feedback. Rule feedback may re- be done in this space.
quire metaprogramming of the actual rule,
or changing a portion. Feedback to a model
for a rule such as 6.3 Cross-SOC models

if eventtype = A SOCs can also begin to collaboratively


model security use cases. For example,
may become knowing how anomalous this proxy traffic
for a user in your organization, is differ-
if eventtype == 0 ent when compared to other companies in
and if eventtype == 1 the same industry. Intersoc collaboration
is still an active area of research.
after feedback. In other words, feed-
back on rules can change the syntactic form
of the rule, or the semantic form, such as
References
variables. Each model defines its own pro-
cess through which the model interprets [1] Georgia Cyber Warfare Range
feedback given on the case generated. OpenUBA, http://openuba.org

Future Work
Herein, we have demonstrated an
"Open Model" UBA solution for monitor-
ing users and entities. Once systems like
OpenUBA are implemented, macro-level
SOC activities can be performed using the
system’s data, and the community data.
Several threat feeds exists today, but lit-
tle to none exists with a focus on analytics,

You might also like