Professional Documents
Culture Documents
HMM
HMM
Abstract
TCP has an embedded congestion control algorithm, by which the senders limit the rate at which they send packets into the network based on the network congestion they perceive. There are two main stages in this algorithm: slow start (SS), where the TCP sender increases its rate exponentially; congestion avoidance (CA), where the TCP sender adjusts its rate in a additive-increase, multiplicativedecrease fashion. Recently, researchers have argued that most of the ows in the Internet, such as short web-page transfers, never get enough time to exit the slow start phase. In practice, this means that they never get enough time to ramp up to full capacity. The main goal of this work is to study the impact that slow start has on different ows in the Internet. For that, we take a probabilistic approach based on a HMM to determine if a ow has left slow start or not. After a proper learning of the HMM using a training set, this enables us to test and study statistics on a large ammount data.
to sources as when to slow down. It is not clear which of the two approaches would be better, but since the last one results in additional overheads, end-to-end congestion control has been widely deployed in the Internet.
1. Introduction
The Internet is composed by a series of routers and links that have limited capacities. Millions of users are connected to the Internet and it would not make sense to dimension resources to accomodate all users at the same time. As users join and leave the network, some resources are occupied and others are left free. These uctuations are an inherent part of the Internet. When resource demands exceed capacity, congestion occurs. If all users always sent at their maximum rate, the buffers in the routers would start to explode, which is what happened in the 1980s, when the Internet experienced its rst congestion collapse. This way, researchers had to come up with a machanism to dinamically allocate resources to the different users. Different approaches to congestion control have been considered: end-to-end congestion control, where only the transport layer is part of the congestion control mechanism (routers are left out); network-assisted congestion control, where routers would give explicit feedback
congestion, and multiplicatively decreases its rate when it detects congestion. Each time an ACK is received, the congestion window is increased by M SS M SS . When a loss is cwd detected by a triple duplicate ACK, the congestion window is cut to half. If a loss is detected by a timeout, the congestion window is set to 1 MSS and the connection reenters slow start.
1.2. Objectives
The main intention of this work is to study the impact that slow start has on the ows across the Internet. Many researchers claim that most of the data transfer in the Internet never ramp up to full capacity, and this results in underutilization of the links. As an alternative, they suggest for example that slow start should start with a much bigger congestion window [2]. This way, it would be very interesting to know what is the percentage of the ows that never leave slow start, or what is the average necessary connection time or le size for a ow to leave slow start. Equipped with the proper probabilistic tools, we will try to answer some of these questions.
gaussians, which means that they will be fully specied by a mean and a covariance matrix. HMMs seem to be a very useful tool for our application of detecting if a ow is in slow start or in congestion avoidance. In our problem, we observe sequences of data which are dependent on the state at which the ow is. However, we do not have information about the current state, that is, for each packet that we observe we have no specic eld saying whether it was sent in SS or CA.
2. Dataset
In order to perform this machine learning project, it was necessary to obtain a lot of packet traces from TCP ows in the Internet. One possible approach could be to collect the whole data. However, the adopted solution was to download 2011 Internet Traces from the CAIDA website [3]. The Cooperative Association for Internet Data Analysis (CAIDA) collects several different types of data at geographically and topologically diverse locations, and makes this data available to the research community, while preserving the privacy of individuals and organizations who donate data or network access. This means that, for example, the IP addresses of the traces are anonymized. However, this is done in such a way that if two IP adresses are equal in the original trace, they will also be equal in the anonymized trace. This will preserve the most important characteristics and still allow researchers to employ machine learning techniques. Using datasets collected by specialists in the area seemed like a much more robust approach. Seemingly minor methodological details can seriously inuence or even invalidate any analysis that is subsequently performed on the data. The dataset used contains anonymized passive trafc traces from CAIDAs equinix-chicago and equinixsanjose monitors on high-speed Internet backbone links. The Endace network cards used to record these traces provide timestamps with nanosecond precision. However, the anonymized traces are stored in pcap format with timestamps truncated to microseconds.
At rst, a pre-processing of the data was performed. The different tcp ows were separated based on their stream number. For each packet, the throughput was computed as:
Th =
3. Learning
The next step was to use a big training set to learn the parameters of the HMM. This can be done using maximum likelihood estimation, that is, determining the parameters that maximize the probability of observing the given data p(X). Since we do not know the values of the hidden states, we will have to perform unsupervised learning. Since it is not possible to obtain a closed-form solution in this case, expectation maximization has to be performed. The training starts with some initial parameters for the model (old ). In the E step, this parameters are used to nd the posterior distributions of the latent variables p(Z|X, old ). In the M step, we maximize the expectation of the logarithm of the complete data likelihood, while xing p(Z|X, old ). This will be done in an iterative way until some stopping criteria. The learning was performed using an HMM toolbox to a training set. However, we still needed to provide the initial estimates for the prior of the rst latent variable, for the transition matrix and for the emission parameters. As suggested in [1], the emission parameters were initialized by tting a mixture of gaussians to the whole dataset. In this step we are obviously losing the sequentiality of the data, but it will give us a good initialization for the mean and the covariance of the gaussian distributions. The algorithm tried to nd two mixtures, one corresponding to SS and the other to CA. Since a ow always starts in slow start, the prior of the SS state is 1 and the prior of CA is set to 0. The initial transition matrix was initialized based on the intuition that very rarely a ow will return to SS after shifting to CA. When in SS, it was assumed that it is equally probable to stay in SS or to switch to CA.
4. Decoding
The problem of decoding corresponds to nding the most probable values for the hidden states in a sequence. This will allow us to estimate when a ow transitions from SS to CA. Note that this is a different problem from nding the most probable hidden state in each instant of time (which may not even belong to a possible path). Naively, we would have to evaluate exponentially many paths. However, Viterbi algorithm allows us to compute the best path through a message passing algorithm, in which we only need to keep track of K paths. At each time, we only need to keep track of the best path that lead to each of the states. The Viterbi algorithm was applied by using a Matlab toolbox to a test set, which is different from the training set used to t the parameters. The most probable paths for the latent variables of the different ows were then stored in a cell array structure.
5. Results
In this section, the most important results of the work are presented. In the unsupervised training, the following parameters were learned: prior = mean = 1 0
In a critical perspective, it can be said that the loglikelihood values obtained during the training were very low. This is perhaps due to the fact there is a lot of data, and thus the proabiblity of observing so much data is almost equal to zero. One way to overcome this fact, could be to use a MAP with a prior covariance that is larger, so as to increase the probability of the observed data.
1.0038 0.8899
6. Conclusion
The focus of this project was more on applying the available machine learning tools in a proper way than to implement the actual tools. In this context, it was possible to put into practice tools and algorithms learned in class such as HMMs, maximum likelihood estimation, mixture of gaussian, k-means, EM, etc. The processment of the data and feature selection turned out to be a little cumbersome and time-consuming, but in the end it was possible to use a very big dataset with a lot of information concerning real-world ows in the Internet. The probabilistic model developed allowed to analyse a big test set of transfers, leading us to the conclusion that in fact many ows in the Internet never get past the slow start phase.
The rst problem that we proposed to solve was to nd how many ows of the test set ever leave slow start. By analysing all the path sequences, we can easily verify how many of them shifted to CA and how many remained in SS. We observed that 38.48% of the ows never left slow start. Furthermore, we are only considering ows that have more than ten packets, so this value should be even bigger if we considered every ow. The second problem that we wanted to address was how many packets on average have to be sent before a ow enters congestion avoidance. For this, we analysed all the most probable path sequences and kept the index of the packet where the most probable state starts being CA. We can conclude that on average we have to wait for 14.83 packets to be sent before shifting to CA. For an MSS of 1500 bytes, this means that we have to send aproximately 22kb of data before shifting to CA.
7. Future work
An interesting extension to the HMMs are called switching linear dynamic systems (SLDS). In a HMM, the latent variables are discrete. An extension could be to consider continuous latent variables, which results in a linear dynamic system. The SLDS can be viewed as a combination of linear dynamic systems with a hidden markov model, where the HMM allows us to switch stocastically between the different LDSs. The LDS would be very useful in modelling the TCP congestion control mechanism, since
the evolution of the congestion window is in fact a linear dynamical system, where the state (the congestion window) evolves throughout time, with each value generating a specic range of emissions in throughput.
References
[1] C. M. Bishop. Pattern recognition and machine learning. Springer, 1st ed. 2006. corr. 2nd printing edition, Oct. 2006. 2, 3 [2] N. Dukkipati, T. Rece, Y. Cheng, J. Chu, T. Herbert, A. Agarwal, A. Jain, and N. Sutin. An argument for increasing tcps initial congestion window. SIGCOMM Comput. Commun. Rev., 40:2633, June 2010. 2 [3] P. H. kc claffy, Dan Andersen. The caida anonymized 2011 internet traces, November 2008. 2 [4] J. F. Kurose and K. W. Ross. Computer Networking: A Top-Down Approach. Addison-Wesley Publishing Company, USA, 5th edition, 2009. 1