CSL COINS Finse Select Topic3

Machine Learning in Digital Forensics:
Development Data Issues
Carl Stuart Leichter PhD

carl.leichter@ntnu.no
NTNU Testimon Digital Forensics Group
Representational Development Data
is Essential for ML Applications
• Representative of what?
– What are you trying to model?
– What principle are you testing?
The key is the structure in the data!
2
Representational Development Data
• The structure in the data reflects the structure in the
system under study.
– If we have collected the right data
• Measure wood brightness and grain prominence
– If we have correctly collected the data

• Proper lighting conditions for measuring brightness, etc
• Noise and error?
3
A Data Complexity Progression
• Initial, proof of principle testing (POP)
– Create (simulate) data that has structure based on *your* model
• Incrementally more complicated train/test data

– More Simulated data
– Synthesize data that has intended structure
• Constructed from select pieces of real data (if possible)
• Case Study data
• Ogerta Elezaj’s “Missing” Data
• Edgar Lopez’s Different financial fraud scenarios
– Injection of known fraud
– Limited: cannot replicate data from complex systems
• Data from a real example, but “cleaned up”

– No missing data
– Reduced noise
• Fully realistic data
• Last two may be infeasible

4 – Financial Fraud Data
Real Example of POP Data
The Problem* Model:

• 10 target classes
• Sparse Data (very little data available)
– As few as 6 training examples from one of the classes
• High* Data Dimensionality
– >10-d Data Space
* Los Alamos National Laboratory (LANL 1993)

5
MLP-BP “Universal Voxel”
Requires 5 MLP-BP Neurons
6
ANN-8
Radial Basis Function (RBF) Nets
Permits a Simpler Internal Structure For Some Complex Models
Complex Model
7
1 RBF Neuron
Proof of Principle Testing
• Create data that adheres to your model structure
• Simulated data for LANL CNLS:

– Multi-class data
• 10 classes simulated
– Hyper-dimensional data
• 10 dimensions simulated
– Sparse training data
• As few as 6 exemplars simulated for one of the classes
Principle was proven with simulated data
Successfully classified the real data

8
Comparing Performance Results
• Testing a specific method

– Parameter tuning
• Compare performance across different methods

– Two different ML algorithms
– Same ML, two different feature vector strategies
• Need standardized and well understood data sets

– Beyond basic proof of principle
– Beyond toy data sets
9
Publicly Available Data Sets
• ML Research
– “Toy” Data Sets
• The Iris data set is widely used in general ML studies.
– Benchmark Data Sets and Data Repositories

• General http://archive.ics.uci.edu/ml/
• Digital Forensics Research

– NIST CFReDS
– Brian Carrier Data Sets
– Doesn’t cover all domains

• Big Data?
• Malware?
10
Malware Test Data Repository
• Is this sample similar to already known samples?
• How well do our system detect this sample?
• Is the sample linked to a certain campaign
• Can the sample be attributed to a specific group?
Such a repository doesn’t exist (yet)!

(NTNU Testimon DFG is proposing one)
11
Synthetic Data is a Solution
• The synthetic dataset can be generated according to
the requirements for different test scenarios.
– Each different test scenario models a different aspect of the

system under study
– Incremental scenario testing (in principle) allows isolated

testing of each part of your model
– The requirements of an incremental scenario testing can

improve the development of your fundamental model!
– Even if your model passes all of the tests you set before it, you
may still discover new aspects of your model that you haven’t
even considered.
12
Data Simulation:
A Well Established and Accepted Method in
Research
• Testing: Reproducible synthetic data

• Documentation: Data synthesis error conditions are identified and
error rate is measurable.
• Publication: Data synthesis procedures and results are subject to
peer review publication and ”general acceptance” in the scientific
community
• Governance: establish the existence of official standards that
govern data synthesis
13
The Daubert Standard!
Learning to Synthesize
a Fraud Model
14
Recall The Data Sim Presentation
But there is analysis of real data to extract the “aggregated
information” about the fraud scenario. This serves to model
the behaviour of the essential components of the internal
structure of the system under study: the financial fraud.
• The essential components?

– Enough details to accurately model the behaviour sought, without
fully replicating all of the system components
• Relativistic versus Newtonian Mechanics

• Ideal Gas Law
• Daubert Implications?
15
Enables Incremental Modeling
• The synthetic dataset has the benefit that can be
generated according to the researcher needs to study
how certain fraud might affect a specific scenario.
• Start with simplest scenario (POP)
• Incrementally add complexity
•
•
• Target complexity
16
Enables Increasing Model Complexity
• Social Network Analysis (SNA) helps to recreate

the topology of the customers relations inside the
simulation.
17
Avoid “Pushbutton _______”!
18
Brain Computer Interface Research
• Electroencephalogram Data (EEG)
– Brainwaves
https://upload.wikimedia.org/wikipedia/commons/thumb/ https://upload.wikimedia.org/wikipedia/commons/
19
b/bf/EEG_cap.jpg/220px-EEG_cap.jpg thumb/2/26/Spike-waves.png/420px-Spike-waves.png
POP: Detecting Changes in Brain State
Training Data
Non-Alcoholic Alcoholic
Data Data
20
Testing Data
Alcoholic vs Non-alcoholic EEG
64 Channels of Data
21
Generalizing Beyond POP
Training Data
22
Testing Data
New Results Are Terrible
Successful POP Experiment
Unsuccessful extension
23
24
My Expectation: Generally Recognize Alcoholic vs Non-
Alcoholic
Non-Alcoholic Alcoholic
Data Data
The Result: These 2 Specific Patterns are Different

25
Avoid “Pushbutton _______”!
• The Square Root of Green is NOT Anxiously Mustard!
– The Square Root of Green?!
– Anxiously Mustard?!?!
26
Thank You!
• Questions
• Comments
• Feedback
• Improvements
carl.leichter@ntnu.no
27

CSL COINS Finse Select Topic3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSL COINS Finse Select Topic3

Uploaded by

Copyright:

Available Formats

Machine Learning in Digital Forensics:

Development Data Issues

Carl Stuart Leichter PhD

The key is the structure in the data!

– If we have correctly collected the data

• Incrementally more complicated train/test data

– Limited: cannot replicate data from complex systems

• Data from a real example, but “cleaned up”

• Fully realistic data

• Last two may be infeasible

The Problem* Model:

* Los Alamos National Laboratory (LANL 1993)

Permits a Simpler Internal Structure For Some Complex Models

• Simulated data for LANL CNLS:

Principle was proven with simulated data

Successfully classified the real data

• Testing a specific method

• Compare performance across different methods

• Need standardized and well understood data sets

– Benchmark Data Sets and Data Repositories

• Digital Forensics Research

– Doesn’t cover all domains

Such a repository doesn’t exist (yet)!

– Each different test scenario models a different aspect of the

– Incremental scenario testing (in principle) allows isolated

– The requirements of an incremental scenario testing can

• Testing: Reproducible synthetic data

• The essential components?

• Relativistic versus Newtonian Mechanics

• Social Network Analysis (SNA) helps to recreate

The Result: These 2 Specific Patterns are Different

You might also like