Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

Machine Learning in Digital Forensics:

Development Data Issues

Carl Stuart Leichter PhD


carl.leichter@ntnu.no
NTNU Testimon Digital Forensics Group
Representational Development Data
is Essential for ML Applications
• Representative of what?
– What are you trying to model?
– What principle are you testing?

The key is the structure in the data!

2
Representational Development Data
• The structure in the data reflects the structure in the
system under study.
– If we have collected the right data
• Measure wood brightness and grain prominence

– If we have correctly collected the data


• Proper lighting conditions for measuring brightness, etc
• Noise and error?
3
A Data Complexity Progression
• Initial, proof of principle testing (POP)
– Create (simulate) data that has structure based on *your* model

• Incrementally more complicated train/test data


– More Simulated data
– Synthesize data that has intended structure
• Constructed from select pieces of real data (if possible)
• Case Study data
• Ogerta Elezaj’s “Missing” Data
• Edgar Lopez’s Different financial fraud scenarios
– Injection of known fraud

– Limited: cannot replicate data from complex systems

• Data from a real example, but “cleaned up”


– No missing data
– Reduced noise

• Fully realistic data

• Last two may be infeasible


4 – Financial Fraud Data
Real Example of POP Data

The Problem* Model:


• 10 target classes
• Sparse Data (very little data available)
– As few as 6 training examples from one of the classes
• High* Data Dimensionality
– >10-d Data Space

* Los Alamos National Laboratory (LANL 1993)


5
MLP-BP “Universal Voxel”
Requires 5 MLP-BP Neurons

6
ANN-8
Radial Basis Function (RBF) Nets

Permits a Simpler Internal Structure For Some Complex Models

Complex Model
7
1 RBF Neuron
Proof of Principle Testing
• Create data that adheres to your model structure

• Simulated data for LANL CNLS:


– Multi-class data
• 10 classes simulated
– Hyper-dimensional data
• 10 dimensions simulated
– Sparse training data
• As few as 6 exemplars simulated for one of the classes

Principle was proven with simulated data

Successfully classified the real data


8
Comparing Performance Results

• Testing a specific method


– Parameter tuning

• Compare performance across different methods


– Two different ML algorithms
– Same ML, two different feature vector strategies

• Need standardized and well understood data sets


– Beyond basic proof of principle
– Beyond toy data sets

9
Publicly Available Data Sets
• ML Research
– “Toy” Data Sets
• The Iris data set is widely used in general ML studies.

– Benchmark Data Sets and Data Repositories


• General http://archive.ics.uci.edu/ml/

• Digital Forensics Research


– NIST CFReDS
– Brian Carrier Data Sets

– Doesn’t cover all domains


• Big Data?
• Malware?
10
Malware Test Data Repository
• Is this sample similar to already known samples?
• How well do our system detect this sample?
• Is the sample linked to a certain campaign
• Can the sample be attributed to a specific group?

Such a repository doesn’t exist (yet)!


(NTNU Testimon DFG is proposing one)

11
Synthetic Data is a Solution
• The synthetic dataset can be generated according to
the requirements for different test scenarios.

– Each different test scenario models a different aspect of the


system under study

– Incremental scenario testing (in principle) allows isolated


testing of each part of your model

– The requirements of an incremental scenario testing can


improve the development of your fundamental model!

– Even if your model passes all of the tests you set before it, you
may still discover new aspects of your model that you haven’t
even considered.

12
Data Simulation:
A Well Established and Accepted Method in
Research

• Testing: Reproducible synthetic data


• Documentation: Data synthesis error conditions are identified and
error rate is measurable.
• Publication: Data synthesis procedures and results are subject to
peer review publication and ”general acceptance” in the scientific
community
• Governance: establish the existence of official standards that
govern data synthesis

13
The Daubert Standard!
Learning to Synthesize
a Fraud Model

14
Recall The Data Sim Presentation
But there is analysis of real data to extract the “aggregated
information” about the fraud scenario. This serves to model
the behaviour of the essential components of the internal
structure of the system under study: the financial fraud.

• The essential components?


– Enough details to accurately model the behaviour sought, without
fully replicating all of the system components

• Relativistic versus Newtonian Mechanics


• Ideal Gas Law

• Daubert Implications?
15
Enables Incremental Modeling
• The synthetic dataset has the benefit that can be
generated according to the researcher needs to study
how certain fraud might affect a specific scenario.
• Start with simplest scenario (POP)
• Incrementally add complexity


• Target complexity

16
Enables Increasing Model Complexity

• Social Network Analysis (SNA) helps to recreate


the topology of the customers relations inside the
simulation.

17
Avoid “Pushbutton _______”!

18
Brain Computer Interface Research
• Electroencephalogram Data (EEG)
– Brainwaves

https://upload.wikimedia.org/wikipedia/commons/thumb/ https://upload.wikimedia.org/wikipedia/commons/
19
b/bf/EEG_cap.jpg/220px-EEG_cap.jpg thumb/2/26/Spike-waves.png/420px-Spike-waves.png
POP: Detecting Changes in Brain State
Training Data

Non-Alcoholic Alcoholic
Data Data

20
Testing Data
Alcoholic vs Non-alcoholic EEG
64 Channels of Data

21
Generalizing Beyond POP
Training Data

22
Testing Data
New Results Are Terrible
Successful POP Experiment

Unsuccessful extension

23
24
My Expectation: Generally Recognize Alcoholic vs Non-
Alcoholic

Non-Alcoholic Alcoholic
Data Data

The Result: These 2 Specific Patterns are Different


25
Avoid “Pushbutton _______”!
• The Square Root of Green is NOT Anxiously Mustard!
– The Square Root of Green?!
– Anxiously Mustard?!?!

26
Thank You!
• Questions
• Comments
• Feedback
• Improvements

carl.leichter@ntnu.no

27

You might also like