Computer Networking Chap3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 102

Chap.

3 Transport Layer
‰ Goal : study principle of providing comm services to app processes
and implementation issues in the Internet protocols, TCP and UDP
‰ Contents
z Relationship bw transport and net layers

{ extending net layer’s delivery service to a delivery service bw

two app-layer processes, by covering UDP


z Principles of reliable data transfer and TCP

z Principles of congestion control and TCP’s congestion control

3-1
Chap.3 Transport Layer
‰ Introduction and Transport-Layer Services
z Relationship Between Transport and Network Layers

z Overview of the Transport Layer in the Internet

‰ Multiplexing and Demultiplexing


‰ Connectionless Transport: UDP
‰ Principle of Reliable Data Transfer
‰ Connection-Oriented Transport: TCP
‰ Principles of Congestion Control
‰ TCP Congestion Control

3-2
Overview of Transport-layer
‰ provide logical comm bw app
processes running on diff hosts
‰ transport protocols run in end
systems
z sending side: converts msgs from
app process into transport-layer
pkts (segments in Internet term),
passes them to net layer
{ (possibly) break app msgs into

small chunks, and add headers


z receiving side: processes
segments from net layer, making
them available to app
‰ more than one transport protocol
available to apps
z Internet: TCP and UDP
3-3
Relationship bw Transport and Network layers
‰ transport layer provides logical comm bw processes, whereas net layer
provides logical comm bw hosts
‰ Household analogy
z kids in one household (A) write letters to kids in another household (B)

{ Ann in A and Bill in B collect/distribute mail from/to other kids

z analogies

{ letters in envelopes ~ app messages

{ kids ~ processes

{ houses ~ hosts

{ Ann and Bill ~ transport protocol

Š not involved in delivering mail bw mail centers


Š Susan-Harvey, substituting Ann-Bill, may provide diff service
Š services (e.g., delay and bw guarantees) clearly constrained by the
service the postal service provides
Š certain service (e.g., reliable, secure) can be offered even when
postal service doesn’t offer the corresponding service
{ postal service ~ net layer protocol
3-4
Overview of Transport-layer in the Internet
‰ IP (Internet Protocol) provides best-effort delivery service
z makes “best-effort” to deliver segments, but no guarantees : no
guarantee on orderly delivery, integrity of data in segments
⇒ unreliable service
‰ User Datagram Protocol (UDP) : provides an unreliable
connectionless service, no-frills extension of IP service
z transport-layer multiplexing and demultiplexing : extend IP’s
host-to-host delivery to process-to-process delivery
z integrity checking by including error-detection fields in segment
header
‰ Transmission Control Protocol (TCP) : provides a reliable
connection-oriented service with several additional services to app
z reliable data transfer : correct and in-order delivery by using
{ flow control and error control (seq #, ack, timers)

z connection setup
z congestion control

3-5
Chap.3 Transport Layer
‰ Introduction and Transport-Layer Services
‰ Multiplexing and Demultiplexing
‰ Connectionless Transport: UDP
‰ Principle of Reliable Data Transfer
‰ Connection-Oriented Transport: TCP
‰ Principles of Congestion Control
‰ TCP Congestion Control

3-6
Multiplexing and Demultiplexing
‰ a process can have one or more sockets; each socket having a unique id
‰ multiplexing at sending host : Ann’s job in household analogy
z gathering data chunks at sources from diff sockets

z encapsulating each chunk with header info to create segments

z passing segments to net layer

‰ demultiplexing at receiving host : Bill’s job in household analogy


z delivering data in a seg to the correct socket

3-7
How Demultiplexing Works
‰ host receives IP datagrams
z each datagram has src and dst IP addrs

{ each datagram carries a transport-layer seg

z each seg has src and dst port #s

{ well-known port #s : reserved for well-known app protocols,

ranging 0 ~ 1023 : HTTP(80), FTP(21), SMTP(25) , DNS(53)


{ other #s : can be used for user apps

‰ IP addrs and port #s used to direct seg to appropriate socket

3-8
Connectionless Multiplexing and Demultiplexing
‰ creating UDP socket
DatagramSocket mySocket1 = new DatagramSocket();
{ transport layer automatically assigns a port # to the socket, in the

range 1024~65535 not currently used by other UDP ports


DatagramSocket mySocket2 = new DatagramSocket(19157);
{ app assigns a specific port # 19157 to the UDP socket

z typically, the port # in the client side is automatically assigned,


whereas the server side assigns a specific port #
‰ When a host receives UDP seg, it
checks dst port # in the seg and
directs the seg to the socket with
that port #
z UDP socket identified by 2-tuple :
(dst IP addr, dst port #)
{ IP datagrams with diff src IP

addrs and/or src port #s are


directed to the same socket
z src port addr is used as dst port
addr in return seg 3-9
Connection-Oriented Mux/Dumux (1)
‰ TCP socket identified by 4-tuple
(src IP addr, src port #, dst IP addr, dst port #)
‰ demultiplexing at receiving host
z 4-tuple used to direct seg to appropriate socket

z TCP segs with diff src IP addrs or src IP port #s are directed
to two diff sockets (except TCP seg carrying conn-
establishment request)
‰ server host may support many simultaneous TCP sockets
z each socket identified by its own 4-tuple

3-10
Connection-Oriented Mux/Dumux (2)

3-11
Connection-Oriented Mux/Demux : Threaded Server
‰ Today’s high-performing Web server uses only one process, but
creating a new thread with a new conn for each new client conn
z connection sockets may be attached to the same process

3-12
Chap.3 Transport Layer
‰ Introduction and Transport-Layer Services
‰ Multiplexing and Demultiplexing
‰ Connectionless Transport: UDP
z UDP Segment Structure

z UDP Checksum

‰ Principle of Reliable Data Transfer


‰ Connection-Oriented Transport: TCP
‰ Principles of Congestion Control
‰ TCP Congestion Control

3-13
User Datagram Protocol (UDP) [RFC 768]
‰ no-frills, bare bones transport protocol : adds nothing to IP but,
z multiplexing/demultiplexing : src and dst port #s

z (light) error checking

‰ features of UDP
z unreliable best-effort service : no guarantee on correct delivery

{ UDP segments may be lost and delivered out of order to app

z connectionless : no handshaking bw UDP sender and receiver

‰ Q: Isn’t TCP always preferable to UDP? A: No


z simple, but suitable to certain apps such as real-time apps

{ stringent to delay, but tolerable to some data loss

z no conn establishment ⇒ no additional notable delay

z simple ⇒ no conn state, including send/receive buffers,


congestion-control parameters, seq and ack # parameters
z small pkt header overhead : 8 bytes compared to 20 bytes in TCP

3-14
Popular Internet Apps and Their Protocols

3-15
Controversy on UDP
‰ UDP is lack of congestion control and reliable data transfer
‰ when many users starts streaming high-bit rate video, packet
overflow at routers, resulting in
z high loss rates for UDP packets

z decrease TCP sending rate

⇒ adaptive congestion control, forcing all sources including UDP


sources, required in particular streaming multimedia apps
‰ build reliability directly into app (e.g., adds ack/rexmission)
z many of today’s proprietary streaming apps run over UDP, but
builds ack and rexmission into app in order to reduce pkt loss
z nontrivial, but can avoid xmission-rate constraint imposed by
TCP’s congestion control mechanism

3-16
UDP Segment Structure
‰ Source port #, dst port # : used for multiplexing/demultiplexing
‰ Length : length of UDP seg including header, in bytes
‰ Checksum : to detect errors (i.e., bits altered) on an end-end basis
z error source : noise in the links or while store in a router

{ some link-layer protocol may not provide error checking

3-17
UDP Checksum Calculation (1) : Sender
‰ sum all of 16-bit words in segment in a row, with two words for
each calculation with overflow wrapped around
‰ take 1’s complement of the sum; the result is the checksum value
(ex) three 16-bit words 0110011001100000
0101010101010101
1000111100001100

z sum of first two words 0110011001100000


0101010101010101
1011101110110101
z adding third word 1011101110110101
1000111100001100
10100101011000001
1 wrapped around
0100101011000010
checksum value : 1011010100111101 1’s complement

3-18
UDP Checksum Calculation (2) : Receiver
‰ add all 16-bit words including checksum, and decide
z no error detected, if the result is 1111111111111111

z error detected, otherwise

{ nonetheless the decision is not perfect : error may actually

have taken place even when no error detection is decided


‰ UDP is not responsible for recovering from error
z reaction to detecting errors depends on implementations

{ simply discard damaged seg, or

{ pass damaged seg to app with warning

3-19
Chap.3 Transport Layer
‰ Introduction and Transport-Layer Services
‰ Multiplexing and Demultiplexing
‰ Connectionless Transport: UDP
‰ Principle of Reliable Data Transfer
z Building a Reliable Data Transfer Protocol

z Pipelined Reliable Data Transfer Protocol

z Go-Back-N (GBN)

z Selective Repeat (SR)

‰ Connection-Oriented Transport: TCP


‰ Principles of Congestion Control
‰ TCP Congestion Control

3-20
Reliable Data Transfer : Service Model and Implementation
‰ reliable data transfer : no corruption, no loss, and in-order delivery
z of central importance to networking : not only at transport layer,
but also at link layer and app layer

rdt_send() : deliver_data() : called by rdt


called from app to deliver data to app

udt_send() : called by
rdt to sen pkt over
unreliable channel rdt_rcv() : called from
channel upon pkt arrival

3-21
Reliable Data Transfer: Implementation Consideration
‰ characteristics of unreliable channel determines the complexity of
reliable data transfer protocol
‰ We will
z incrementally develop sender and receiver sides of rdt protocol,
considering increasingly complex model of underlying channel
z consider only unidirectional data transfer for simplicity purpose

{ but, control packet is sent back and forth

z use finite state machines (FSM) to specify sender, receiver

dashed arrow : initial state event causing state transition


actions taken on state transition
state: next state uniquely
determined by event state Λ : no event or no action
state
1 2

3-22
rdt1.0 : Perfectly Reliable Channel
‰ Assumptions of underlying channel
z perfectly reliable : no bit errors, no loss of packets

‰ separate FSMs for sender and receiver


z sender sends data into underlying channel

z receiver read data from underlying channel

3-23
rdt2.0 : Channel with Errors
‰ New assumptions of underlying channel
z may be corrupted when transmitted, propagated, or buffered

z no loss and in-order delivery

‰ Automatic Repeat reQuest (ARQ) protocols


z error detection : extra bits placed in checksum field

z receiver feedback : ACK/NAK pkt explicitly sent back to sender

{ ACK (positive acknowledgement) : when pkt received OK

{ NAK (negative acknowledgement) : when pkt received in error

z rexmission : sender rexmits pkt on receipt of NAK

3-24
rdt2.0 : not Corrupted

3-25
rdt2.0 : Corrupted

3-26
rdt2.0 : Fatal Flaw
Q: How to recover from errors in ACK or NAK pkts?
z minimally, need to add checksum bits to ACK/NAK pkts

z possible solutions
{ repeated requests from sender/receiver for a garbled ACK
and NAK : hard to find a clue to way out
{ add enough checksum bits for correction : not applicable for
lost pkt
{ simply resend the pkt when receiving a garbled ACK or NAK ⇒
incurs possible duplicate at receiver
Š receiver doesn’t know whether it is a new pkt or a rexmission
(i.e., a duplicate pkt)
‰ handling duplicates : add a new field (seq # field) to the packet
z sender puts a seq # into this field, and receiver discards
duplicate pkt
z 1-bit seq # suffice for stop-and-stop protocol

‰ rdt2.0 is stop-and-wait protocol : sender sends one pkt, then waits


for receiver response
3-27
Description of sol 1 of Fatal Flaw of rdt2.0

A dict
ates s
ometh
in g to B

r epeat
lease
r “p
i e s ok o
rep l
A didn’t understand WhaBt
did yo
u say? b
ut corrup
ted
B has no idea whether it is part of dictation
? or request for repetition of last reply
yo u say
h at did
W

3-28
rdt2.1 : Employing Seq # - Sender

3-29
rdt2.1 : Employing Seq # - Receiver

3-30
rdt2.1 : Discussion
‰ sender
z seq # added to pkt

z two seq #’s (0,1) will suffice

z must check if received ACK/NAK corrupted

z twice as many states

{ state must remember whether current pkt has seq # of 0 or 1

‰ receiver
z must check if received pkt is duplicate

{ state indicates whether 0 or 1 is expected pkt seq #

z receiver cannot know if its last ACK/NAK received OK at sender

3-31
rdt2.2 : NAK-free
‰ accomplish the same effect as a NAK, by sending an ACK for the
last correctly received pkt
z receiver must explicitly include seq # of pkt being ACKed

‰ sender that receives two ACKs (i.e., duplicate ACKs) knows that
receiver didn’t correctly receive the pkt following the pkt being
acked twice, thus rexmits the latter

3-32
rdt2.2 : NAK-free (Sender)

3-33
rdt2.2 : NAK-free (Receiver)

3-34
rdt3.0 : Channel with Errors and Loss
‰ new assumptions of underlying channels :
z can lose pkts (data or ACKs)

Q : how to detect pkt loss and what to do when pkt loss occurs
z checksum, seq #, ACKs, rexmissions are of help, but not enough

‰ approaches
z sender waits proper amount of time (at least round-trip delay +
processing time at receiver) to convince itself of pkt loss
z rexmits the pkt if ACK not received within this time

z if a pkt (or its ACK) just overly delayed, sender may rexmit the
pkt even though it has not been lost
{ but, seq # handles the possibility of duplicate pkts

‰ implementation
z countdown timer set appropriately starts each time pkt is sent

z rexmit pkt when the timer is expired

3-35
rdt3.0 : Channel with Errors & Loss (Sender)

3-36
rdt3.0 : Channel with Errors & Loss – Operation (1)

3-37
rdt3.0 : Channel with Errors & Loss – Operation (2)

3-38
Performance of rdt3.0 (Stop-and-Wait Protocol)

‰ assumption : ignore xmission time of ACK pkt (which extremely small)


and processing time of pkt at the sender and receiver
‰ sender utilization Usender : frac. of time sender is busy sending into ch
ex) 1 Gbps link, 30 ms RTT, 1 KB packet
ttrans 0.008 8, 000 bits/packet
U sender = = ≈ 0.00027 ; ttrans = L R = = 0.008 ms
RTT + ttrans 30 + 0.008 9
10 bits/sec
very poor!
z net protocol limits the capabilities provided by underlying net HW
3-39
Pipelining
‰ sends multiple pkts without waiting for acks
z range of seq #s is increased

z buffering at sender and/or receiver required

{ sender : pkts that have been xmitted by not yet acked

{ receiver : pkts correctly receiver

sender is assumed to send 3 pkts


before being acked
3ttrans 0.024
U sender = = ≈ 0.0008 : essentially tripled
RTT + ttrans 30.008
‰ two generic forms of pipelined protocols: go-Back-N, selective repeat
3-40
Go-Back-N (GBN) Protocol
‰ sender’s view of seq #s in GBN

z window size N : # of pkts allowed to send without waiting for ACK


{ GBN often referred to as sliding window protocol

z pkt’s seq # : carried in a k-bit field in pkt header


{ range of seq # : [0, 2 -1] with modulo 2 arithmetic
k k

‰ events at GBN sender


z invocation from above : before sending, check if window isn’t full

z receipt of an ACK : cumulative ack - ack with seq # n indicates all


pkts with a seq up to and including n have been correctly received
z timeout : resend all pkts previously xmitted but not yet acked

‰ drawback of GBN : when widow size and bw-delay product are large,
a single pkt error cause a large # of unnecessarily rexmissions
3-41
Go-Back-N (GBN) Protocol : Sender

‰ a single timer : for the oldest xmitted


but not yet acked pkt
‰ upon receipt of an ACK, if there are
z no outstanding unacked pkts, the
timer is stopped
z still xmitted but not yet acked
3-42
pkts, the timer is restarted
Go-Back-N (GBN) Protocol : Receiver
‰ when pkt with seq # n is received correctly and in-order, receiver
sends an ACK for pkt n and delivers data portion to upper layer
‰ receiver discards out-of-order pkts and resends an ACK for the
most recently received in-order pkt
z simple receiver buffering : needn’t buffer any out-of-order pkts
z only info needed : seq # of next in-order pkt, expectedseqnum

3-43
Go-Back-N (GBN) Protocol : Operation

window size = 4

3-44
Selective Repeat (SR) Protocol
‰ sender rexmits only pkts for which ACK not received ⇒ avoid unnecessary
rexmission
‰ receiver individually acks correctly received pkts regardless of their order
z out-of-order pkts are buffered until missing pkts are received

3-45
SR Protocol : Sender/Receiver Events and Actions
‰ sender
z data from above : if next available seq # is in window, send pkt

z timeout(n) : resend pkt n, restart timer


{ each pkt has its own (logical) timer

z ACK(n) in [sendbase,sendbase+N]
{ mark pkt n as received

{ if n is equal to send_base, window base is moved forward to next

unacked pkt, and xmit unxmitted pkts in advanced window


‰ receiver
z pkt n in [rcvbase, rcvbase+N-1] correctly received : send ACK(n)
{ if not previously received, it is buffered

{ if n is equal to rcv_base, this pkt and previously buffered in-order pkts

are delivered to upper layer, and receive window moved forward by the
# of pkts delivered to upper layer
z pkt n in [rcvbase-N,rcvbase-1] correctly received
{ an ACK generated even though previously acked

{ if not acks, sender’s window may never move forward; for example, ack

for send_base pkt in Figure 3.23


z otherwise : ignore 3-46
SR Operation

3-47
Max. Window Size
‰ stop-and-wait protocol A B
z window size N ≤ 2k-1 (k: # of seq field), not
2k, why?
ex) k=2 ⇒ seq #s : 0, 1, 2, 3; max N = 3
‰ SR protocol
z scenarios
(a) : all acks are lost
Š incorrectly sends duplicate as new
(b) : all acks received correctly, but pkt 3
is lost
{ receiver can’t distinguish xmission of pkt
0 in (b) from rexmission of pkt 0 in (a)
z further consideration on scenario (a)
{ A rexmits pkt 0; B receives and buffer it
{ B sends piggybacked ack for pkt 2 that is
already acked but lost
{ A advanced window 3 0 1, and sends pkt 3
{ B receives pkt 3, and delivers pkt 0 (no
good!) in buffer and pkt 3 to upper layer
z wayout : avoid overlapping of SR windows
{ N ≤ 2k-1, k: # of bits in seq field 3-48
rdt : Comment on Packet Reordering
‰ since seq #s are reused, old copies of a pkt with a seq/ack # of x
can appear, even though neither sender’s nor receiver’s window
contains x
z use of max pkt lifetime : constrain pkt to live in the net

{ ~ 3 minutes in TCP for high-speed net

3-49
Summary of rdt Mechanisms

3-50
Chap.3 Transport Layer
‰ Introduction and Transport-Layer Services
‰ Multiplexing and Demultiplexing
‰ Connectionless Transport: UDP
‰ Principle of Reliable Data Transfer
‰ Connection-Oriented Transport: TCP
z TCP Connection

z TCP Segment Structure

z Round-Trip Time Estimation and Timeout

z Reliable Data Transfer

z Flow Control

z TCP Connection Management

‰ Principles of Congestion Control


‰ TCP Congestion Control
3-51
TCP Connection
‰ two processes established connection via 3-way handshake before sending
data, and initialize TCP variables
z full duplex : bi-directional flow bw processes in the same conn
z point-to-point : bw one sender and one receiver
{ multicasting is not possible with TCP

‰ a stream of data passes through a socket into send buffer


z TCP grab chunks of data from send buffer
z max seg size (MSS) : max amount of app-layer data in seg
{ set based on Path MTU of link-layer

{ typically, 1,460 bytes, 536 bytes, or 512 bytes

z each side of conn has send buffer and receive buffer

3-52
TCP Segment Structure

for reliable data xfer


count in bytes, not pkts

4-bit # counting for flow control, # of bytes


in 32-bit words receiver willing to receive
for error detection
typically, empty
- time-stamping
- mss, window scaling factor
negotiation, etc.

• ACK : indicates value in ack field is valid


• SYN, RST, FIN : used for connection setup and teardown
• PSH : receiver should pass data to upper layer immediately
• URG : indicates there is an urgent data in the seg marked by sending-side upper layer
- urgent data pointer indicates the last bytes of urgent data
- generally, PSH and URG are not used 3-53
Seq Numbers and Ack Numbers
‰ seq # : 1st byte in seg over xmitted bytes stream, not over series
of xmitted segs
z TCP implicitly number each byte in data stream

z initial seq # is chosen randomly rather than set 0, why?


‰ ack # : seq # of next byte expected from other side
z cumulative ACK

Q : how to handle out-of-order segs at receiver? discard or buffer


waiting for missing bytes to fill in the gaps
z TCP leaves the decision up to implementation, but the latter is
chosen in practice
3-54
Telnet : Case Study of Seq and Ack Numbers
‰ each ch typed by A is echoed back by B and displayed on A’s screen

ACK piggybacked on B-to-A data seg

explicit ACK with no data

3-55
Estimating Round-Trip Time (RTT)
‰ clearly, TCP timeout value > RTT
Q : How much larger? How to estimate RTT? Each seg exploited in
estimating RTT? …
‰ estimating RTT
z SampleRTT : time measured from seg xmission until ACK receipt

{ measured not for every seg xmitted, but for one of xmitted segs

approximately once every RTT


{ rexmitted segs are not considered in measurements

{ fluctuates from seg to seg : atypical ⇒ needs some sort of avg

‰ Exponential Weighted Moving Average (EWMA) of RTT


z avg several recent measurements, not just current SampleRTT

EstimatedRTT = (1 - α)⋅EstimatedRTT + α⋅SampleRTT


{ recommended value of α : 0.125

z more weight on recent samples than on old samples


z weight of a given sampleRTT decays exponentially fast as updates

proceed
3-56
RTT Samples and RTT Estimates

variations in the Sample RTT are smoothed out in Estimated RTT

3-57
Retransmission Timeout Interval
‰ DevRTT, variation of RTT : an estimate of how much SampleRTT
deviates from EstimatedRTT
DevRTT = (1-β)⋅DevRTT + β⋅|SampleRTT−EstimatedRTT|
z large (or small) when there is a lot of (or little) fluctuation

z recommended value of β : 0.25

‰ TCP’s timeout interval


z should be larger, or unnecessarily rexmit!

z but, if too much larger, TCP wouldn’t quickly rexmit, leading to


large data transfer delay
z thus, timeout interval should be EstimatedRTT plus some safety
margin that varies as a function of fluctuation in SampleRTT
TimeoutInterval = EstimatedRTT + 4⋅DevRTT

3-58
TCP Reliable Data Transfer
‰ reliable data transfer service on top of IP’s unreliable service
z seq # : to identify lost and duplicate segs

z cumulative ack : positive ACK (i.e, NAK-free)

z timer

{ a single rexmission timer is recommended [RFC 2988], even if

there are multiple xmitted but not yet acked segs


{ rexmissions triggered by

Š when timed out


Š 3 duplicate acks at sender : fast rexmit in certain versions

‰ We’ll discuss TCP rdt in two incremental steps


z highly simplified description : only timeouts considered

z more subtle description : duplicate acks as well as timeouts


considered
in both cases, error and flow control are not taken into account

3-59
Simplified TCP Sender

seq # is byte-stream # of the first data byte in seg

TimeoutInterval = EstimatedRTT + 4⋅DevRTT

some not-yet-acked segs are acked


move window forward

3-60
TCP Retransmission Scenarios

SendBase=120
SendBase=100

SendBase=120

SendBase=100 SendBase=120

rexmission due to a lost ack segment 100 not rexmitted cumulative ack avoids
rexmission of first seg

3-61
TCP Modifications : Doubling Timeout Interval
‰ at each timeout, TCP rexmits and set next timeout interval to
twice the previous value
⇒ timeout intervals grow exponentially after each rexmission
‰ but, for the other events (i.e., data received from app and ACK
received) timeout interval is derived from most recent values of
EstimatedRTT and DevRTT

3-62
TCP ACK Gen Recommendation [RFC 1122, 2581]

‰ timeout period can be relatively long ⇒ may increase e-t-e delay


‰ when sending a large # of segs back to back (such as a large file), if
one seg is lost, there will be likely many back-to-back ACKs for it

3-63
TCP Modifications : TCP Fast Retransmit
‰ TCP Fast Retransmit : rexmits a (missing) seg before its timer
expiration, if TCP sender receives 3 duplicate ACKs

if (y > SendBase) { // event: ACK received,


with ACK field value of y
SendBase = y
if (there are currently not-yet-
acked segs)
start timer
}
else { // a duplicate ACK for already ACKed
segment
increment count of dup ACKs
received for y
if (count of dup ACKs received
for y = 3) // TCP fast retransmit
resend seg with seq # y
}

3-64
Is TCP Go-Back-N or Selective Repeat?
‰ similarity of TCP with Go-Back-N
z TCP : cumulative ack for the last correctively received, in-order seg

z cumulative and correctly received but out-of-order segs are not


individually acked
⇒ TCP sender need only maintain SendBase and NextSeqNum
‰ differences bw TCP and Go-Back-N : many TCP implementations
z buffer correctly received but out-of-order segs rather than discard

z also, suppose a seq of segs 1, 2, … N, are received correctively in-order,


ACK(n), n < N, gets lost, and remaining N-1 acks arrive at sender before
their respective timeouts
{ TCP rexmits at most one seg, i.e., seg n, instead of pkts, n, n+1, …, N

{ TCP wouldn’t even rexmit seg n if ACK(n+1) arrived before timeout for

seg n
‰ a modification to TCP in [RFC 2018] : selective acknowledgement
z TCP receiver acks out-of-order segs selectively rather than cumulatively

z when combined with selective rexmission - skipping segs selectively


acked by receiver – TCP looks a lot like generic SR protocol
‰ Thus, TCP can be categorized as a hybrid of GBN and SR protocols
3-65
Flow Control : Goal
‰ receiving app may not read data in rcv buffer as quickly as
supposed to be
z it may be busy with some other task

z may relatively slow at reading data, leading to overflowing


receiver’s buffer by too much data too quickly sent by sender
‰ flow control : a speed-matching service, matching sending rate
against reading rate of receiving app
z goal : eliminate possibility of sender overflowing receiver buffer

(note) to make the discussion simple, TCP receiver is assumed to


discard out-of-order segs

3-66
Flow Control : How It Works?
RevBuffer : size of buffer space allocated to a conn
RcvWindow : amount of free buffer space at rcv’s buffer
initial value of RcvWindow = RevBuffer

LastByteRcvd, LastByteRead : variables at receiver


LastByteSent, LastByteAcked : variables at sender

‰ at receiver
z not to overflow : LastByteRcvd – LastByteRead ≤ RcvBuffer
LastByteRcvd – LastByteRead : # of bytes received not yet read
z RevWindow advertising : RcvWindow placed in receive window field in
every seg sent to sender
RcvWindow = RcvBuffer - [LastByteRcvd - LastByteRead]
‰ at sender : limits unacked # of bytes to RcvWindow
z LastByteSent – LastByteAcked ≤ RcvWindow

LastByteSent – LastByteAcked : # of byte sent but not yet acked

3-67
Flow Control : Avoiding Sender Blocking
‰ suppose A is sending to B, B’s rcv buffer becomes full so that
RcvWindow = 0, and after advertising RcvWindow = 0 to A, B has
nothing to send to A
z note that TCP at B sends a seg only if it has data or ack to send

{ there is no way for B to inform A of some space having opened

up in B’s rcv buffer ⇒ A is blocked, and can’t xmit any more!


z wayout : A continue to send segs with one data byte when
RcvWindow = 0, which will be acked
{ eventually, the buffer will begin to empty and ack will contain

a nonzero RcvWindow value

3-68
TCP Connection Management : Establishment
‰ 3-way handshake
1. client sends SYN seg to server
{ contains no app data

{ randomly select client initial

seq # SYN s
egmen
t
2. server replies with SYNACK seg
{ server allocates buffers and ent
m
seg
variables to the connection YN ACK
S
{ contains no app data

{ randomly select server initial ACK


se gmen
seq # t
3. client replies with ACK seg
{ client allocates buffers and

variables to the connection


{ may contain data

3-69
TCP Connection Management : Termination

‰ Either of client or server can end the


TCP connection
‰ duration of TIME_WAIT period :
implementation dependent
z typically, 30 secs, 1 min, 2 mins

‰ RST seg : seg with RST flag set to 1


z sent when receiving a TCP seg
whose dst port # or src IP addr is
not matched with on-going one

3-70
TCP State Transition : Client
Socket clientSocket = new Socket("hostname","port#");

3-71
TCP State Transition : Server
ServerSocket welcomeSocket = new ServerSocket(port#)

Socket connectionSocket = welcomeSocket.accept(); 3-72


Chap.3 Transport Layer
‰ Introduction and Transport-Layer Services
‰ Multiplexing and Demultiplexing
‰ Connectionless Transport: UDP
‰ Principle of Reliable Data Transfer
‰ Connection-Oriented Transport: TCP
‰ Principles of Congestion Control
z The Causes and the Costs of Congestion

z Approaches to Congestion Control

z Network-Assisted Congestion-Control Example for ATM AVR

‰ TCP Congestion Control

3-73
Preliminary of Congestion Control
‰ pkt loss (at least, perceived by sender) results from overflowing of
router buffers as the net becomes congested
z rexmission treats a symptom, but not the cause, of net
congestion
‰ cause of net congestion : too many sources attempting to send data
at too high a rate
z basic idea of wayout : throttle senders in face of net congestion

z what’s different from flow control?

‰ ranked high in top-10 list of networking problem

3-74
Causes and Costs of Congestion : Scenario 1
‰ assumptions
z no error control, flow control, and congestion control
z host A and B send data at an avg rate of λin bytes/sec, respectively
z share a router with outgoing link capacity of R and infinite buffer space
z ignore additional header info (transport-layer and lower-layer)

cost of congested net : avg delay


grows unboundedly large as arrival
rate nears link capacity
3-75
Causes and Costs of Congestion : Scenario 2 (1)
‰ assumptions
z one finite buffer space

z each host with same λin, retransmit dropped packets

3-76
Causes and Costs of Congestion : Scenario 2 (2)

case a case b case c


‰ case a (unrealistic) : host A can somehow determine if router
buffer is free, and send a pkt when buffer is free
z no loss, thus no rexmission ⇒ λ’in= λin

‰ case b : a pkt is known for certain to be dropped


z R/3 : original data, R/6 : rexmitted data

z cost of congested net : sender must rexmit dropped pkt

‰ case c : premature timeout for each pkt ⇒ rexmit each pkt twice
z cost of congested net : unneeded rexmissions waste link bw
3-77
Causes and Costs of Congestion : Scenario 3
‰ assumptions
z 4 routers, each with finite buffer space and link capacity of R
z each of 4 hosts has same λin, rexmits over 2-hop paths

• consider A→C conn


• a pkt dropped at R2 (due to high λin
from B) wastes the work done by R1

cost of congested net : a pkt


dropped at some point wastes the
xmission capacity up to that
3-78point
Two Broad Approaches to Congestion Control
‰ end-end congestion control
z no explicit support (by feedback) from net layer

z congestion inferred by end-system based on observed net


behavior, e.g., pkt loss and delay
z approach taken by TCP

{ congestion is inferred by TCP seg loss indicated by timeout or

triple duplicate acks


‰ network-assisted congestion control
z routers provide explicit feedback to end systems regarding
congestion state in the net
z single bit indication

{ SNA, DECnet, TCP/IP ECN [RFC2481], ATM AVR congestion

control
z explicit rate : the rate router can support on its outgoing link

3-79
Two Types of Feedback of Congestion Info
‰ direct feedback : from a router to the sender by using choke pkt
‰ feedback via receiver
z router mark/update a field in a pkt flowing forward to indicate
congestion
z upon receipt of the pkt, receiver notifies sender of congestion

3-80
ATM ABR Congestion Control
‰ Asynchronous Transfer Mode (ATM)
z a virtual-circuit switching architecture

z info delivered in fixed size cell of 53 bytes

z each switch on src-to-dst path maintains per-VC state

‰ Available Bit Rate (ABR) : an elastic service


z if net underloaded, use as much as available bandwidth

z if net congested, sender rate is throttled to predetermined min


guaranteed rate
‰ Resource Management (RM) cells
z interspersed with data cells, conveying congestion-related info

{ rate of RM cell interspersion : tunable parameter

Š default value : one every 32 data cells


z provides both feedback-via-receiver and direct feedback

{ sent by src flowing thru switches to dst, and back to src

{ switch possibly generate RM cell itself, and send directly to src

3-81
Mechanisms of Congestion Indication in ATM AVR

‰ Explicit Forward Congestion Indication (EFCI) bit


z EFCI bit in a data cell is set to 1 at congested switch
z if a data cell preceding RM cell has EFCI set, dst sets CI bit of RM cell,
and sends it back to src
‰ CI (Congestion Indication) and NI (No Increase) bits
z set by congested switch, NI/CI bit for mild/severe congestion
z dst sends the RM cell back to src with CI and NI bits intact

‰ Explicit Rate (ER) : two-byte field in RM cell


z congested switch may lower ER value in a passing RM cell
z when retuned back to src, it contains max supportable rate on the path
3-82
Chap.3 Transport Layer
‰ Introduction and Transport-Layer Services
‰ Multiplexing and Demultiplexing
‰ Connectionless Transport : UDP
‰ Principle of Reliable Data Transfer
‰ Connection-Oriented Transport : TCP
‰ Principles of Congestion Control
‰ TCP Congestion Control
z Fairness

z TCP Delay Modeling

3-83
Preliminary of TCP Congestion Control (1)
‰ basic idea of TCP congestion control : limit sending rate based on
the network congestion perceived by sender
z increase/reduce sending rate when sender perceives little/∗
congestion along the path bw itself and dst
‰ to keep the description concrete, sending a large file is assumed
‰ How does sender limit sending rate?
LastByteSent - LastByteAcked ≤ min{CongWin, RcvWindow} (1)
z CongWin : a variable limiting sending rate due to perceive

congestion
z henceforth, RcvWindow constraint ignored in order to focus on
congestion control
z (1) limits the amount of unacked data, thus the sending rate
{ consider conn for which loss and xmission delay are negligible

CongWin
then, sending rate ≈
RTT
3-84
Preliminary of TCP Congestion Control (2)
‰ How does sender perceive congestion on path bw itself and dst?
z a timeout or the receipt of three duplicate ACKs

‰ TCP is self-clocking : acks are used to trigger its increase on cong


window size, thus the sending rate
z consider an optimistic case of cong-free, in which acks are taken as
an indication that seg are successfully delivered to dst
z if acks arrive at a slow/high rate, cong window is increased more
slowly/quickly
‰ How to regulate sending rate as a function of perceived congestion?
z TCP congestion control algorithms, consisting of 3 components

{ additive-increase, multiplicative-decrease (AIMD)

Š AIMD is a big-picture description; details are more complicated


{ slow start

{ reaction to timeout events

3-85
Additive-Increase, Mulitplicative-Decrease
‰ multiplicative decrease : cut CongWin in half down to 1 MSS when
detecting a loss
‰ additive increase: increase CongWin by 1 MSS every RTT until a loss
detected (i.e., when perceiving e-t-e path is congestion-free)
z commonly, accomplished by increasing CongWin by MSS⋅(MSS/CongWin)
bytes for each receipt of new ack
ex) MSS=1,460 bytes, ConWin=14,600 bytes ⇒ 10 segs sent within RTT
Š an ACK for a seg increases CongWin by 1/10⋅MSS, thus after ack for all 10
segs (thus, for one RTT) CongWin is increased by MSS
z congestion avoidance : linear increase phase of TCP cong control

saw-toothed pattern of CongWin

3-86
TCP Slow Start
‰ When a TCP conn begins, CongWin is typically
initialized to 1 MSS ⇒ initial rate ≈ MSS/RTT
ex) MSS = 500 bytes, RTT = 200 msec ⇒ initial
sending rate : only about 20 kbps
z linear increase at init. phase results in a
waste of bw, considering available bw may be
>> MSS/RTT
z desirable to quickly ramp up to some
respectable rate
‰ slow start (SS) : during initial phase, increase
sending rate exponentially fast by doubling
CongWin every RTT until a loss occurs
z achieved by increasing CongWin by 1 MSS

for receipt of ack

3-87
Reaction to Congestion
Q: When does CongWin switch from exponential increase to linear increase?
A: when CongWin is reached to Threshold
z Threshold : a variable set to a half of CongWin just before a loss
{ initially set large, typically 65 Kbytes, so that it has no initial effect

{ maintained until the next loss

‰ TCP Tahoe, early version of TCP


z CongWin is cut to 1 MSS both for a timeout and for 3 duplicate acks
{ Jacobson’s algorithm [Jacobson 1988]

‰ TCP Reno [RFC2581, Stevens ’94] : reaction to loss depends on loss type
z for 3 duplicate acks receipt : CongWin is cut in half, then grows linearly
z for a timeout event : CongWin is set to 1 MSS (SS phase), then grows
exponentially to a Threshold, then grows linearly (CA phase)
z idea : 3 dup acks anyhow indicates capability of delivering some pkts
{ TCP Reno cancels SS phase after a triple duplicate ack : fast recovery

‰ many variations of TCP Reno [RFC 3782, RFC 2018]


z TCP Vegas [Brakmo 1995]
{ idea : early warning - detect congestion in routers before pkt loss occurs

{ when this imminent pkt loss, predicted by observing RTT, is detected,

CongWin is lowered linearly; the longer the RTT, the greater the congestion
3-88
TCP Congestion Control Algorithms

• initial value of Threshold = 8 MSS


• triple duplicate acks just after 8th round

3-89
TCP Reno Congestion Control Algorithm
‰ [RFC 2581, Stevens 1994]

3-90
Steady-State Behavior of a TCP Connection
‰ Consider a highly simplified macroscopic model for steady-state
behavior of TCP
z SS phases ignored since they are typically very short

z Letting W be the window size when a loss event occurs, RTT and
W are assumed to be approximately constant during a conn
Q : What’s avg throughput of a long-lived TCP conn as a function of
window size and RTT?
0.75 ⋅ W
A : avg throughput of a TCP connection = (2)
RTT
z a pkt is dropped when the rate increases to W/RTT

z then the rate is cut in half and linearly increases by MSS/RTT


every RTT until it again reaches W/RTT
z this process repeats over and over again

3-91
TCP Futures
‰ TCP congestion control has evolved over the years and continue to evolve
z [RFC 2581] : a summary as of the late 1990s

z [Floyd 2001] : some recent developments

z traditional scheme is not necessarily good for today’s HTTP-dominated


Internet or for a future Internet service high bandwidth delay product
ex) Consider a high-speed TCP conn with 1500-byte segments, 100ms RTT, and
want to achieve 10 Gbps throughput through this conn
z to meet this, from (2) required window size is

RTT 0.1 sec 1 107


W= ⋅ tput = ⋅ 10 bits/sec ⋅
10
= ≈ 111,111 segs
0.75 0.75 1,500 × 8 bits/seg 90
z this is a lot of segs, so that there is high possibility of errors, leading us
to derive a relationship bw throughput and error rate [prob. P39]
1.22 ⋅ MSS
avg throughput of a TCP conn =
RTT L
⇒ L = 2⋅10-10, i.e., one loss for every 5 ⋅10-9segs : unattainably low!
⇒ new vers of TCP required for high-speed environments [RFC 3649, Jin 2004]
3-92
TCP Fairness (1)
‰ suppose K TCP conns pass though a bottleneck link bw of R, with each conn
sending a large file
⇒ avg xmission rate of each conn is approximately R/K
‰ TCP congestion control is fair : each conn gets an equal share of
bottleneck link’s bw among competing TCP conns
‰ consider a link of R shared by two TCP conn, with idealized assumptions
z same MSS and RTT, sending a large amount of data, operating in CA
mode (AIMD) at all times, i.e., ignore SS phase

3-93
TCP Fairness (2)

‰ bw realized by two conns fluctuates


along equal bw share line, regardless
of their initial rates
‰ in practice, RTT value differs from
conn to conn
z conns with a smaller RTT grab the ideal operating
available bw more quickly (i.e., open point
D
their cong window faster), thus get loss occurs
higher throughput than those conns B
with larger RTTs
C
CA phase
A

3-94
Some other Fairness Issues
‰ Fairness and UDP
z multimedia apps, e.g., Internet phone and video conferencing do
not want their rate throttled even if net is congested
z thus runs over UDP rather than TCP, pumping audio/video at
const rate, and occasionally lose pkt rather than reducing rate
when congested ⇒ UDP sources may crowd out TCP traffic
z research issue : TCP-friendly cong control

{ goal : let UDP traffic behave fairly, thus prevent the Internet

from flooding
‰ Fairness and parallel TCP connections
z a session can open multiple parallel TCP conn’s bw C/S, thus gets
a large portion of bw in a congested link
{ a Web browser to xfer multiple objects in a page

ex) a link of rate R supporting 9 ongoing C/S apps


{ a new app, asking for 1 TCP conn, gets an equal share of R/10

{ a new app, asking for 11 TCP conns, gets an unfair rate of R/2

3-95
TCP Delay Modeling
‰ We’d compute the time for TCP to send an object for some simple models
z latency : defined as the time from when a client initiate a TCP conn until
the time at which it receives the requested object
‰ assumptions : made in order not to obscure the central issues
z simple one-link net of rate R bps
z amount of data sender can xmit is limited solely by cong window
z pkts are neither lost or corrupted, thus no rexmission
z all protocol header overheads : ignored
z object consist of an integer # of MSS
{ O: object size [bits], S : seg size [bits] (e.g., 536 bits)

z xmission time for segs including control info : ignored


z initial threshold of TCP cong control scheme is so large as not to be
attained by cong window
‰ without cong window constraint : the latency is 2⋅RTT+O/R
z clearly, SS procedure, dynamic cong window increase this minimal latency

3-96
Static Congestion Window (1)
‰ W : a positive integer, denoting a
fixed-size static congestion window
z upon receipt of rqst, server
immediately sends W segs back to
back to client, then one seg for
each ack from client W=4

‰ 1st case : WS/R > RTT+S/R


z ack for 1st seg in 1st window
received before sending 1st
window’s worth of segs
z server xmit segs continuously until
entire object is xmitted
z thus, the latency is
2⋅RTT+O/R

3-97
Static Congestion Window (2)
‰ 2nd case : WS/R < RTT+S/R
z ack for 1st seg in 1st window received
after sending 1st window’s worth of segs
‰ latency = setup time + time for xmitting
object + sum of times in idle state
z let K : # of windows covering object
K = O/WS or ⎡K⎤ if K is not an integer W=2
z # of times being in idle state = K-1

z duration of server being in idle state


S/R+RTT-WS/R
z thus, the latency is
2⋅RTT+O/R+(K-1)[S/R+RTT-WS/R]+
where [x]+ = max(x,0)

transmitting state
idle state
3-98
Dynamic Congestion Window (1)
‰ cong window grows according to slow start,
i.e., doubled every RTT O/S=15
K=4
z O/S : # of segs in the object
Q=2
z # of segs in kth window : 2k-1 P=min{Q,K-1}=2
z K : # of windows covering object

⎧ O⎫
K = min ⎨k : 20 + 21 + " + 2k −1 ≥ ⎬
⎩ S⎭
⎧ O⎫
= min ⎨k : 2k −1 − 1 ≥ ⎬
⎩ S⎭
⎧ ⎛O ⎞⎫
= min ⎨k : k ≥ log2 ⎜ + 1 ⎟ ⎬
⎩ ⎝S ⎠⎭
⎡ ⎛O ⎞⎤
= ⎢log2 ⎜ + 1 ⎟ ⎥
⎢ ⎝S ⎠⎥
z xmission time of kth window = (S/R)2k-1
z duration in idle state of kth window
=[S/R+RTT-2k-1(S/R)]+
3-99
Dynamic Congestion Window (2)
‰ latency = setup time + time for xmitting object + Σ times in idle state
O K −1 ⎡ S S⎤
+

latency = 2 ⋅ RTT + + ∑ ⎢ + RTT − 2k −1 ⎥ (3)


R k =1 ⎣ R R⎦
z Q : # of times server being idle if object were of infinite size

{
Q = max k :
S
R
S
} ⎧
+ RTT − 2k −1 ≥ 0 = max ⎨k : 2k −1 ≤ 1 +
R ⎩
RTT ⎫

S /R ⎭
⎧ ⎛ RTT ⎞ ⎫ ⎢ ⎛ RTT ⎞ ⎥
= max ⎨k : k ≤ log2 ⎜ 1 + ⎟ + 1 ⎬ = log
⎢ 2⎜ 1 + ⎟⎥ + 1
⎩ ⎝ S / R ⎠ ⎭ ⎣ ⎝ S / R ⎠⎦
‰ actual # of times server is idle is P=min{Q, K-1}, then (3) becomes
O P ⎡S S⎤
latency = 2 ⋅ RTT + + ∑ ⎢ + RTT − 2k −1 ⎥
R k =1 ⎣ R R⎦
O ⎡S ⎤ S
P
= 2 ⋅ RTT + + P ⎢ + RTT ⎥ − ∑ 2k −1
R ⎣R ⎦ R k =1
O ⎡S ⎤ S
= 2 ⋅ RTT + + P ⎢ + RTT ⎥ − (2P − 1 ) (4)
R ⎣R ⎦ R
3-100
Dynamic Congestion Window (3)
‰ comparing TCP latency of (4) with minimal latency
latency P ⎡⎣(S R ) RTT + 1⎤⎦ − ⎡⎣(2p − 1 ) (S R ) RTT ⎤⎦
=1+
minimal latency 2 + (O R ) RTT
P + (S R ) RTT ⎡⎣P + 1 − 2p ⎤⎦ P
=1+ ≤1+
2 + (O R ) RTT 2 + (O R ) RTT

latency contributed by slow start


z slow start significantly increase latency when object size is
relatively small (implicitly, high xmission rate) and RTT is
relatively large
{ this is often the case with the Web

‰ See the examples in the text

3-101
HTTP Modeling
Assume Web page consists of
z 1 base HTML page (of size O bits)

z M images (each of size O bits)

‰ non-persistent HTTP
z M+1 TCP conns in series

z response time = 2⋅(M+1)RTT + (M+1)O/R + sum of idle times

‰ persistent HTTP
z 2 RTT to request and receive base HTML file

z 1 RTT to request and receive M images

z response time = 3⋅RTT + (M+1)O/R + sum of idle times

‰ non-persistent HTTP with X parallel conns


z suppose M/X is integer

z 1 TCP conn for base file

z M/X sets of parallel conns for images

z response time = 2⋅(M/X + 1)RTT + (M+1)O/R + sum of idle times

3-102

You might also like