Professional Documents
Culture Documents
Lec3 Data Processing
Lec3 Data Processing
Filtering, Normalization
and Correlation
SIT Internal
Lecture 3 Contents
○ Events of security interests
○ Log filtering
○ Data normalization
○ Event correlation
SIT Internal
OS logs
• Authentication
• Linux syslog, remote user authenticating with Secure Shell (SSH) daemon
OS logs
• Service startup, shutdown and status change
• Service crash
• Linux syslog, FYP server shutting down involuntarily (due to a crash or a kill
command)
SIT Internal
OS logs
• Miscellaneous status message
• Linux syslog, successful connection to a POP3 mail daemon by a remote user “anton”
• Linux syslog shows a connection failure (due to access controls) to a telnet service
SIT Internal
Accuracy of data
• Data discrepancy caused by
Time synchronization
○ Challenges with time synchronization
○ dead battery or other hardware failure
○ which time zone?
○ NTP clock drift causes time deviations at the order of seconds
○ syslog forwarder mystery: his time vs. my time
○ log lag
○ 5:17, AM or PM?
SIT Internal
Data Quality
• Data have quality if they satisfy the
requirements
• Accuracy - Errors
• Completeness – Missing values
• Consistency – Huge deviation
• Timeliness - Updated
• Believability – Trust in the data
• Interpretability – Ease to understand
SIT Internal
DATA CLEANSING
SIT Internal
Data Cleaning
• Data filtering
• Irrelevant data fields
• Duplicated data entries, could be from different sources
• Redundant data that is heavily dependent and can be derived from other data,
e.g., collinearity between data, DoB and Age
Filtering &
Raw logs Correlation
Normalization
Filtering &
Raw logs Correlation
Normalization
○ Filtering
○ Take in raw log data, determine whether to keep it
○ Normalization
○ Take the raw log, map its various elements to a common
format
○ Event – a normalized log message
○ Correlation
○ Normalized log data is input to correlation
○ Match a single normalized piece of data, or a series of
data, for the purpose of taking an action
SIT Internal
• Take the things you know about, place them in an ignore file so you
can exclude things you know about.
Output
297 cron: (root) CMD (/usr/bin/at)
167 sendmail: alias database /etc/aliases.db out of date
120 ftpd: PORT
61 lpd: restarted
48 kernel: wdpi0: transfer size=2048 intr cmd DRQ
…
The number preceding the log message shows how many times the log message was seen in log
files.
SIT Internal
Category Timestamps
Source IP Source port • E.g., login.success •Time generated
Destination IP Destination Port •Time received
• Regression
• Best line to fit two attributes
• Equation
• Use one attribute to predict other
SIT Internal
• Regression
• Take half of the dataset to
• Model an equation
• Validate the equation using other half of data set
• Predict risk from reliability and/or vice-versa
SIT Internal
Normalization
○ Parse the log messages that you would like to keep for piece apart
components to turn them into a common format.
Normalization - Steps
1. Get documentation for products you are using.
2. Read the documentation for descriptions of what the raw log data looks
like and what each field is.
3. Come up with the proper parsing expression to normalize the data.
○ Most log analysis systems utilize a regular expression
implementation to parse the data.
4. Test the parsing logic on sample raw log data.
5. Deploy the parsing logic.
SIT Internal
Detecting Discrepancy
• Detect discrepancy
• Knowledge of metadata – Domain expertise
• Understanding data types and attributes
• Outliers
• Standard deviation from mean
SIT Internal
Outlier Detection
• Quantitative data
• Descriptive statistics
• Standard deviation, Box plots
• Regression
Outlier Detection
• Qualitative data
• Calculate similarity and correlation
• Use clustering methods
• Find the ones which cannot be
assigned to a cluster
Handling outliers
Handling outliers
Event Correlation
SIT Internal
Correlation makes a
difference between:
○ “14:10 7/4/20110 User Roberts Successful Authenticate to 10.100.52.105 from
10.10.8.22”
and...
Correlation
Correlation
○ Reduce false positives
○ E.g. Intrusion Detection System (IDS) to consult a vulnerability
database
Correlation
Basic forms of correlation
○ Rule based
○ Statistical
Correlation
○ Normalization of raw event data is crucial to effectively perform atomic
correlation
○ Keeping normalized logs in a database table supports database style
searches
○ Show [All Logs] From [All Devices] from the [last two weeks], where
the [username] is [Roberts]
○ Just as with any database, event normalization allows the creation of
summarization reports
○ Which User Accounts have accessed the highest number of distinct
hosts in the last month?
SIT Internal
Rule Correlation
○ Correlate events by behavioural rules
○ Requires stateful rule engine
○ Pseudocode for reconnaissance attempts followed by a
firewall policy violation
○ If the system sees an event E1 where E1.eventType = portscan
○ followed by an event E2 where E2.srcip = E1.srcip and E2.dstip =
E1.dstip and E2.eventType = firewall.reject
○ then do something (Email, alert, etc.)
○ E1 is detected by IDS, E2 by a firewall that implicitly rejects
○ “Followed by” doesn’t mean follow immediately
SIT Internal
Rule Correlation
○ Functionalities required for rule correlation
○ Stateful behaviour
○ Counting
○ Timeout: e.g. default age-out period of five mins
○ Rule reuse: e.g. reuse components of conditional
statements
○ Priorities: dictate the order of rules to be performed
○ Language for specifying rules, e.g. XML, Lisp
○ Action: e.g. write to text files, create help desk tickets
SIT Internal
Micro-Level Correlation
○ Correlate fields within a single event or set of events
○ Source IP
○ Destination IP
○ Time
○ …
○ Match fields between events, across time periods, across devices
○ E.g. If a single host fails to log in to three separate servers using the
same credentials, within a 6-second time window, raise an alert.
SIT Internal
Macro-level Correlation
○ Pull in other sources of information, fusion correlation
○ E.g. compare vulnerability scan data with event data
○ Make reference to the Contextual data
○ E.g. user role on a particular system
○ Pull user information from an LDAP server or Active Directory server.
○ Contextual data can be input to rule correlation
SIT Internal
Correlation Patterns
Micro-Level Macro-Level
Source IP correlation
○ Sort a chunk of network connection data by the source IP
○ For analyst to visualize what the system is up to
Source IP correlation
- case study
○ BlackIce log (host IDS)
○ Arrival Time: Apr 4, 2000 20:49:31.0479
○ Version: 4
○ Header length: 20 bytes
○ Total length: 60
○ Identification: 0x5434
○ Source: 131.183.39.83 (131.183.39.83)
○ Destination: MY.NET.70.234 (MY.NET.70.234)
○ Transmission Control Protocol, Src port: 3611 (3611), Dst port: 53 (53)
○ Source host 131.183.39.83 has been detected probing the system with
BlackIce on destination port TCP 53 or DNS
SIT Internal
Source IP correlation
- case study (cont’d)
○ Snort records on 131.183.39.83
○ 04/04-20:42:57.484472 131.183.39.83:1641 -> MY.NET.1.0:53
○ 04/04-20:42:57.485577 131.183.39.83:1647 -> MY.NET.1.6:53
○ 04/04-20:42:57.485655 MY.NET.1.3:53 -> 131.183.39.83:1644
○ … (lots of records deleted)
○ 04/04-21:02:43.801043 131.183.39.83:2890 -> MY.NET.254.169:53
○ 04/04-21:02:44.795187 131.183.39.83:2924 -> MY.NET.254.203:53
○ 04/04-21:02:44.796316 131.183.39.83:2926 -> MY.NET.254.205:53
Destination IP correlation
○ Target based analysis
○ Sort by destination IP to locate systems that have become servers
SIT Internal
Time correlation
○ Sorting by the time field and using the source IP as the second sort key
can merge files from more than one source to examine the network
activity that spans multiple sites
○ Sorting event fields in various ways helps us find the relations that might
otherwise remain hidden
SIT Internal
Port correlation
○ Interested when IDS detects activity targeting your Web server
○ Look for events having 80 or 443 as the destination port
Anti-port correlation
○ Use open port information along with firewall data to detect attacks in
the slow or low category.
○ Nmap can be used to track open ports on your systems
○ Pseudocode
○ if (event E1.dstport != (known_open_ports on event E1.dstip))
○ Then doSomething
○ It helps to detect worm
SIT Internal
○ Website has tools that you can use to query on IP addresses and
networks, especially contact information.
○ Plot attacks on a map based on this information helps to track down evil
doers.
SIT Internal
Geographic location
correlation – case study
Apache access log has records of access which can be used to pinpoint and
identify
○ Location
○ Time
○ Type of request of client
○ Type of response of server
E.g.,
58.214.19.53 - - [21/Aug/2005:04:31:13 -0400] "GET / HTTP/1.1" 403 3931 "-"
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"
SIT Internal
Vulnerability correlation
○ Vulnerability scanners provide information on vulnerable host
○ Hostname or IP address, vulnerable service or port: e.g. Sendmail port
(25)
○ Remediation steps, e.g. Patch version of Sendmail
○ Combine vulnerability scan data with real-time event data
○ IDS reports a range of ports are scanned across several hosts
○ Verify whether the ports are active and vulnerable
SIT Internal
Vulnerability correlation
• Reduce noise by reporting based upon high value systems or asset
weights
• Add context of target operating system
• Add knowledge of vulnerabilities
• Rules
• Target Vulnerable to Detected Exploit
• Vulnerable to Detected Exploit on Different Port
• Vulnerable to Different Exploit than Detected on Attacked Port
SIT Internal
• AlienVault Vulnerability
Scanner detected the
“IIS remote command
execution”
vulnerability on the
server
Lecture 3 Summary
Lecture 3 Summary
Filtering &
Raw logs Correlation
Normalization
○ Data filtering
○ Irrelevant data fields
○ Duplicated data entries, could be from different sources
○ Redundant data that is heavily dependent and can be derived from other data,
e.g., collinearity between data, DoB and Age
Lecture 3 Summary
Correlation Patterns
Micro-Level Macro-Level