using Machine Learning Presenter: Klara Nahrstedt CS598KN, September 19, 2017 Paper Authors: T. Nguyen, G. Armitage IEEE Communications Survey & Tutorials, Vol. 10, No.4, 2008 Outline • Motivation and Problem Description • ML Techniques • Application of ML in IP Traffic Classification • Review of ML-based IP traffic Classification Techniques • Summary Motivation • Real-time traffic classification has potential to solve difficult network management problems • Network managers need to know traffic characteristics • Traffic classification can be useful in • QoS provisioning • Real-time traffic classification is core to QoS-enabled services and automated QoS architectures • RT traffic classification allows to respond to network congestion problems • Automated intrusion detection systems • Detect patterns indicative of denial of service attacks • Trigger automated re-allocation of network resources • Identify customer use of network resources which contradicts operator’s term of service • Clarification of ISP obligations with respect to “lawful interception” of IP traffic Problem Description • IP traffic is often defined as set of flows with 5-tuple parameters • Protocol type • Source address: port • Destination address: port • One can use simple classification to infer control applications which use ‘well-known’ TCP or UDP port numbers (e.g., web traffic port 80) and use this classification to regulate traffic • Problem: • Many apps use unpredictable port numbers • We need more sophisticated classification techniques to infer app type • Problem: • Deep packet inspection techniques are not effective because these techniques assume • Assumption 1: 3rd parties unaffiliated with either source or recipient are able to inspect each IP packet payload • Assumption 2: Classifier knows syntax of each app packet payload • Why didn’t IntServ or DiffServ work for QoS provisioning? • Problems with QoS signaling • Problems with service pricing mechanisms Challenges • Violation of 1st assumption • Customers use encryption to obfuscate packet contents (including TPC and UDP port numbers) • Governments impose privacy regulations constraining the ability of 3rd parties to lawfully inspect payloads at all • Violation of 2nd assumption • Commercial devices will need repeated updates to stay ahead of regular changes in every app packet payload formats • This causes heavy operational load Goal and Approach • Goal: • We need new approaches to recognize application-level usage patterns without deep packet inspection • Approach: • Recognize statistical patterns in externally observable attributes of the traffic • Example: packet length, inter-packet arrival time • Cluster IP traffic into groups that have similar traffic patterns, • Classify one or more apps of interest • Use Machine Learning techniques to IP traffic classification Approach – Applying ML • Step 1: features are defined by which future unknown IP traffic may be identified and differentiated • Features are attributes of flows calculated over multiple packets • Feature examples: • max and min of packet length in each direction; • flow duration; • inter-packet arrival time
• Step 2: ML classifier is trained to associate sets of features with
known traffic classes (creating rules) • Step 3: ML algorithm is applied to classify unknown Traffic using previously learned rules Traffic Classification Metrics • Classification techniques differentiate how accurately the technique or model makes decisions when presented with previously unseen data. • Assumption: we have traffic class X • Goal: traffic classifier is being used to classify packets (or flows) belong to class X when presenting a mixture of previously unseen traffic • Input: mixed traffic of packets or flows, • Output: does a flow (packet) belong to class X or not. • Metrics characterize classifier’s accuracy • False Negative • False Positive • True Negative • True Positive • Other classifier evaluation metrics in ML literature are • Recall - % of members of class X correctly classified as belonging to class X • Precision - % of those instances that truly are members of class X among all those classified as class X Traffic Classification Metrics “In pattern recognition, information retrieval and binary classification, • precision (also called positive predictive value) is the • fraction of relevant instances among the retrieved instances, while • recall (also known as sensitivity) is the • fraction of relevant instances that have been retrieved over the total amount of relevant instances.
Both precision and recall are therefore based on an
understanding and measure of relevance.” (Wikipedia) Limitations of Packet Inspection for Traffic Classification • Traditional IP traffic classification uses • Packet’s TCP or UDP port numbers (port-based classification) • Reconstruction of protocol signatures in its payload (payload-based classification) • Port-based classification limitations • Some apps may not have ports registered with IANA (e.g. Napster, Kazaa P2P apps) • Apps may use ports other than its well-known ports to avoid OS access control restrictions (e.g., non-privileged users may be forced to run HTTP servers on ports other than port 80) • Server ports may be dynamically allocated (e.g., RealVideo streamer does the dynamic negotiation of server port used for data transfer) • IP layer encryption may obfuscate the TCP/UDP headers Limitation of Packet Inspection • Payload-based IP traffic classification limitations • Payload-based inspection avoids reliance on fixed port numbers, but
• Imposes significant complexity and processing load on traffic identification
devices • Must be kept up-to-date with extensive knowledge of application protocol semantics • Must be powerful enough to perform concurrent analysis of potentially large number of flows Background of Machine Learning • Input of ML process • Data instances /datasets • Each instance is characterized by values of its features (attributes or discriminators) • Output of MP process • Description of knowledge (depends on particular ML approach) • Types of Learning • Classification (supervised learning) • Clustering (unsupervised learning) • Associations – learning associations between features • Numeric predictions - outcome predicted is not discrete class, but numeric quantity • Classification and Clustering are used for network traffic classification Background of Machine Learning (2) • Supervised Learning • Modeling input/output relations • Identifying mapping from input features to output class • Knowledge learning represented as flowchart, decision tree, classification rules and used to classify a new unseen instance • Two phases • Training phase – construct classification model • Testing phase – use classification model to classify new unseen instances • Classification algorithms • Differ mainly how classification model is constructed and what optimization algorithm is used to search for good model • Examples of classification algorithms: Decision tree, Naïve Bayes techniques Background of Machine Learning (3) • Clustering • Does not provide guidance • Discovers natural clusters (groups) in data • Finds patterns in input data • Clusters instances with similar properties (e.g., distance measuring approach) • Basic Clustering Methods • Classic k-means algorithm • Forms clusters in numeric domains, partitioning instances into disjoint clusters • Incremental clustering • Generates hierarchical grouping of instances • Probability-based clustering method • Assigns instances to classes probabilistically, not deterministically K-mean clustering “k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.” (wikipedia) Background of Machine Learning (4) • Evaluation of supervised learning algorithms • Optimize recall and precision • Problem: often there is tradeoff between them and app context decides which is more important • Consider tradeoff tools • Receiver operating characteristics curve (ROC) provides a way to visualize tradeoffs between TP and FP • Consider important issue • Cost of trading between Recall and Precision • Challenge: datasets for training and testing (should be different, but it is difficult) Background of ML (5) • Possible solution: consider “holdout” method • Set aside some part (2/3) of pre-labeled dataset for training and 1/3 for testing • Possible solution: if only small dataset available, consider N-fold cross- validation method • The set is first split into N approx. equal partitions (folds) • Each partition (1/N) is used for testing while ((N-1)/N) are used for training • Procedure repeats N times so that every instance has been used exactly once for testing • Recall and Precision are calculated from average of recalls, precisions measured during all N tests • Possible solution: if partitioning into N subsets does not guarantee equal representation of any given class, consider stratification method • Randomly sample dataset in such a way that each class is equally represented in both training and testing Background of Machine Learning (6) • Evaluation of unsupervised learning • Answer questions • How many clusters are hidden in data • What is optimal number of clusters • Whether resulted clusters are meaningful or just an artifact of algorithms • How easy they are to use • How fast it is to be employed • What is intra-cluster quality • How good is inter-cluster separation • What is cost of labeling clusters • What are requirements in terms of computer computation and storage • Three approaches to investigate cluster validity • External criteria approach – based on prior information of data • Internal criteria approach – based on examining internal structure inherited from dataset • Relative criteria approach – based on finding best clustering scheme that a clustering algorithm can define under certain assumptions and parameters Background of Machine Learning (6) • Feature selection algorithms • Feature selection process = Identification of smallest necessary set of features required to achieve one’s accuracy goal • Selection of features crucial – irrelevant or redundant features often lead to negative impacts on accuracy of ML algorithms • Classification of feature selection algorithms • Filter methods • Make independent assessment based on general characteristics of data • Rely on certain metric to rate and select best subset before learning commences • Are not biased towards any ML algorithms • Wrapper methods • Evaluate performance of different subsets using ML algorithm that will ultimately be employed for learning • Are biased towards ML algorithm used • Example • Correlation-based Feature Selection (CFS) filter techniques with Greedy, Best-First or Genetic search Application of ML to Traffic Classification (1) • Definitions: • Uni-directional flow (packets going in one direction and defined by five-tuple, source, destination IP addresses, ports and protocol number) • bi-directional flow (pair of uni-directional flows going in opposite directions between the same source and destination IP addresses and ports) • full flow (bidirectional flow captured over its entire life time) • Class: IP traffic casued by application or group of apps • Instances: multiple packets belong to same flow • Features: numerical attributes calculated over multiple packets belonging to individual flows • Mean packet lengths • Standard deviation of inter-packet arrival times, • Total flow length • Fourier transform of packet inter-arrival time Application of ML (2)
Training supervised ML traffic classifier
- Traces are collected from online game traffic and Other interfering apps (HTTP, DNS, SSH) - Flow processing – calculate statistics properties of These flows - Data sampling – narrow search space Training and testing for two-class supervised ML traffic classifier - Feature filtering – limit number of features actually Used in training of ML classifier (cross-validation,…) Application of ML (3)
Data flow within operational supervised ML traffic classifier
Clustering Approaches • Flow Clustering using Expectation Maximization • EM clusters traffic with similar observable properties into different app types • HTTP, FTP, SMTP, IMAP, DNS, NTP traffic studied • Group traffic flows into small number of clusters • Create classification rules from clusters • From these rules remove features that have no large impact • Repeat this process • EM algorithm groups traffic into number of classes based on traffic type (bulk transfer, small transactions, multiple transactions) • Results are limited in identifying individual apps of interest • Other approaches in paper (read – interesting) EM Algorithm • “Expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. (wikipedia) • Expectation (E) step • creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and • Maximization (M) step • computes parameters maximizing the expected log-likelihood found on the E step. • These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Supervised Learning Approaches • Real-time traffic classification using multiple sub-flows (Nguyen & Armitage, 2006) • Timely and continuous classification is important • RT classification uses most recent N packets of a flow – classification sliding window • Use of small number of packets for classification • Ensures timeliness of classification and reduces buffer space required to store packets for classification • Offers potential monitoring traffic flow during its lifetime in timely manner with constraints of physical resources • Training ML classifiers on multiple sub-flow features • Extract two or more sub-flows (of N packets) from every flow that represents class of traffic one whishes to identify in the future • Each sub-flow should be taken from places of original flow having noticeably different statistical properties (start, middle of flow) • Train ML classifier wit combination of these sub-flows rather than original full flows • This optimization is demonstrated using Naïve Bayes algorithm • Other approaches – in paper (interesting) Challenges • Timely and continuous classification • Most work has evaluated efficacy of different ML algorithms when applied to entire datasets of IP traffic, trained and testbed over full-flows • Some work has explored performance of ML classifiers that utilize only first few packets of flow , but cannot cope with missing flow’s initial packets • Directional neutrality • Many works assume bi-directional flows and knowledge of first packets • Getting direction wrong will degrade classification accuracy • Efficient use of memory and processors • There are tradeoffs between classification performance of classifier and resource consumption • Using large features improves accuracy, but required a lot of resources • Portability and robustness • Portability is not considered carefully in classification models • Few works evaluate robustness of classification performance when packet loss, packet fragmentation, delay and jitter occur. Summary • Good introduction to ML usage of traffic classification • Traffic classification is important for many purposes and definitely in multimedia networking and cyber-physical systems • Besides QoS services, consider anomaly detection, proactive network real- time monitoring, management, routing traffic • Important concepts for multimedia traffic • real-time traffic classification • Continuous traffic classification • Feature selection that includes delay, jitter, bandwidth • Machine Learning algorithm with high accuracy for traffic classification