Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

I Presentation by ANML

June 2004
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
Advanced Netflow Analysis
I
N
D About the Presenter
I
A
N • Mark Meiss
A
• Academic Background:
– B.S. Mathematics, B.S. Computer Science
U
N – Ph.D. student in Department of Computer Science
I • Research interests:
V
– Structural analysis of network traffic data
E
– High-performance file transfer protocols
R
S – Autonomous information retrieval agents
I
T
Y
I
N
D About the Presenter
I
A
N • Professional Experience:
A
– Over 10 years in software development
U – With IU IT Services since 1997
N
– Worked with Bloomington NOC
I
V
– First employee of ANML
E – Developed Animated Traffic Map, Router
R Proxy, Tsunami file transfer protocol, etc.
S
I
T
Y
I
N
D Introduction
I
A
N • The ANML has a number of new and
A
ongoing projects involving netflow data
U analysis
N – Statistical analysis
I
V
– Visualization
E – Structural analysis
R
S
I
T
Y
I
N
D Overview
I
A
N 1. Fastcount [David Ripley]
A
– Analysis of unique hosts on Abilene network
U 2. gCube [Greg Travis]
N
– 3D visualization of real-time netflow data
I
V 3. FlowRank [Mark Meiss and Ed Balas]
E
– Structural analysis of connectivity data
R
S – Behavioral identification of applications
I
T
Y
I
N
D Fastcount
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D Motivation
I
A
N • Abilene netflow data must be anonymized
A before it is written to disk
– Lower 13 bits are masked out
U
N – This makes deriving a count of active hosts on
I the network difficult
V • Unanonymized data can be used in-memory
E
R
– Persistently running application
S – Clever memory management
I
T
Y
I
N
D Complication
I
A
N • Just because we see an IP address in a flow
A doesn’t mean it’s an active host
– Many worms spray packets with false sources
U
N – Many worms also scan ranges of destination
I address
V • Solution: Make sure that we see traffic
E
going to and coming from a host
R
S – Sampling problems will diminish over time
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D gCube
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D Overview
I
A
N • Inspiration
A – “Spinning Cube of Potential Doom” visualization on display at
Supercomputing 2003
U – http://www.nersc.gov/nusers/security/TheSpinningCube.php
N • Basic idea
I
– 3D plot with source IP, destination IP, and port as axes
V
E
– Show network activity in real time
R – Use netflow data instead of IDS feed
S
I
T
Y
I
N
D Why Use 3D?
I
A
N • Many anomalies can be quickly characterized at a
A glance
– Port scans are vertical lines
U
– Source and destination scans are horizontal lines
N
I – Activity in private, multicast, and unused address space
V is easy to see
E • Can also see change over time
R
– “Internet cloud chamber”
S
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D FlowRank
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D PageRank
I
A
N • PageRank is a Web page ranking system
A invented by Brin and Page of Google
– Attempts to measure importance of a Web page
U
N – Pages gain rank by being pointed to by many
I pages and by pointing to pages with high rank
V – Calculated offline using an iterative algorithm
E
– Examines only the connections in the Web, not
R
the content of the pages
S
I
T
Y
I
N
D Technical Details of PageRank
I
A
N • A given set of Web pages creates an implied
A directed graph of connections
– The graph has an edge from page A to page B if page A
U
links to page B
N
I
• This graph can be represented as a matrix
V – If entry (i, j) is non-zero, page i links to page j
E – Sparse representation is necessary
R • Google’s matrix has over 1,000,000,000,000,000,000 entries
S
I
T
Y
I
N
D Technical Details of PageRank
I
A
N • Problem with “dangling links”
A
– These are links to pages that contain no links of
their own
U
N
– These pages absorb PageRank without
I
distributing it to other pages
V • Solution is to say that a page without
E
outbound links actually links to every page
R
S
with equal probability
I
T
Y
I
N
D Calculating PageRank
I
A
N • We can think of the connectivity matrix as
A defining a Markov model that generates a random
list of Web pages
U
– In other words, we can use the matrix to make a
N
random walk of the Web
I
V • The PageRank vector is the first eigenvector of the
E connectivity matrix
R – In other words, it’s the probability that we’re at that
S page during our random walk
I
T
Y
I
N
D Vulnerability of PageRank
I
A
N • PageRank was first published in 1998
A
• Since then, it has been shown to be
U
vulnerable to “clique attacks”
N – Unsavory Web site owner buys 75 domains
I – Home page on each domain points to each of
V the other domains
E
R
– All of the domains thus rise in PageRank score
S • Google blacklists Web sites for this
I
T
Y
I
N
D FlowRank
I
A
N • Netflow records also create an implied
A
connectivity matrix
U – We can create an edge from host A to host B if
N host A transmits data to host B
I
V
• The vulnerability to a clique attack becomes
E a detector of peer-to-peer applications and
R social networks!
S
I
T
Y
I
N
D Weighted PageRank
I
A
N • The volume of data in a flow is an
A
important characteristic of the traffic
U – We modify the basic PageRank algorithm by
N weighing all entries based on traffic volume
I
– This new algorithm still converges, but the final
V
E
values have a significantly different distribution
R
S
I
T
Y
I
N
D Weighted PageRank
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D So What’s It Good For?
I
A
N • These are potential applications; this
A
research is just starting
U – Automatic detection of peer-to-peer
N applications or bot networks
I
– Heuristic for node importance in visualization
V
E
tools
R – Heuristic for ordering importance of IDS
S anomalies
I
T
Y
I
N
D Rethinking the Edges
I
A
N • In theory, every TCP connection between
A
host A and host B involves two flows
U – One from host A to host B
N – One from host B to host A
I
V • Due to sampling, we often catch only one of
E the two
R
S
– This interferes with the operation of FlowRank
I
T
Y
I
N
D Rethinking the Edges
I
A
N • When we see a flow from host A to host B,
A
why should the edge go from A to B and not
U from B to A?
N – We can try to identify which host is the client
I
(initiator of the connection) and which is the
V
server (receiver of the connection)
E
R – We can make a good guess at this by studying
S the relative frequency of the ports used
I
T
Y
I
N
D Rethinking the Edges
I
A
N • This client/server classification seems to
A
greatly increase the utility of the
U connectivity graph
N
• Examining the connectivity graph over time
I
V
can give us an idea of the type of
E application that runs on a TCP port
R
S
I
T
Y
I
N
D TCP Port 80 (httpd)
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D TCP Port 25 (smtp)
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D TCP Ports 6881-6889 (bittorrent)
I
A
N
A

U
N
I
V
E
R
S
I
T
Y
I
N
D Conclusion
I
A
N • Netflow data is good for more than basic
A
anomaly detection and bin-totalling
U • Useful for interactive visualization of real-
N
time network events
I
V • Structural analysis may be useful for
E analyzing the intent of network traffic
R
S
without examining individual packets
I
T
Y

You might also like