Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 31

Statistical multiplexing: sharing of bandwidth between users

Say the network supports 100 Mbps. Each user subscribes to 5 Mbps and uses the n
etwork 1/2 of the time. So it looks like the network can support 20 users. But t
he chance of all 20 using their bandwidth at once is 1/2^20, so most of the band
width is wasted. We can have 30 users and most of the time they'll still get eno
ugh bandwidth given their usage. 30/20 = 1.5 is the statistical multiplexing gai
n.
Content delivery: to deliver the same content to n users, instead of routing it
n times through almost the same path (different only at the end when it goes to
different clients), route it to a replica machine close to all the users, then d
istribute to each user. If the common path has m machines, the old way takes mn
hops, the new one takes m + n hops -> way more efficient.
Large networks have more economic value than small ones because of Metcalfe's la
w: the value of a n-node network is proportional to n^2, e.g. a 6-node network h
as 6*5 = 30 connections, but a 12-node network has 12*11 = 121 connections.
Host = edge device = sink
Link: 3 kinds
full duplex: bidirectional
half duplex: bidirectional but not at the same time
simplex: unidirectional, not common
Wireless links: messages are broadcast to all nodes in range
Note that when node A is sending to node B
B can't send to any other node at the same time because a node can't sen
d and receive at the same time
no other node that's connected to B can send to B because that'd interfe
re with B's reception
The Socket API: used for clients and servers to talk to each other, supports str
eams (reliable) and datagrams (unreliable)
Apps can attach to the network at different ports.
Primitive calls:
SOCKET: create new communication endpoint
SEND/RECEIVE
CONNECT: try to establish a connection
BIND: associate local address with socket
LISTEN: announce readiness to accept connection
ACCEPT: establish incoming connection
CLOSE
Example app: client and server connect, client sends request, server sends respo
nse, both disconnect.
client.socket
server.socket
server.bind
server.listen
server.accept*
client.connect*
client.send
server.receive*
server.send
client.receive*
client.close
server.close
*: blocking calls: wait for other side to do something
Program code:
Client

socket()
getaddrinfo() // translates human friendly names to addresses eg IP addr
esses
connect()
...
send()
receive()
...
close()
Server
socket()
getaddrinfo()
bind()
listen()
accept() // loops from here to end to serve many clients
...
receive()
...
send()
close()
Traceroute: looks inside network as a packet travels. Returns message from host
at each hop, thus tracing the packet.
Protocol: instances of a protocol on different nodes can talk to each other indi
rectly, by having instance of a lower layer protocol on the same nodes talk to e
ach other. THe lower protocol uses yet lower protocols to talk all the way down
to the physical medium layer.
Protocol stack: set of protocols used on a node.
Example: browser connected to the internet wirelessly
HTTP
TCP
IP
802.11
Encapsulation: where lower layers wrap content from higher layers and add their
own info to make a new message for delivery. Like the postal service, lower laye
rs don't look inside the contents of higher layers.
Lower layers usually add their info to the front, called the header. Sometimes t
hey also add trailers, or encrypt or compress contents, or segmentation and reas
sembly.
A host running multiple apps using multiple protocols (eg browser and Skype) wil
l need to route packets from the same lower layer protocol to the right higher l
ayer protocol, called demultiplexing. This is done with demultiplexing keys in t
he headers. The key in each layer says what the next higher protocol is, eg the
Ethernet header may point to IP instead of ARP, the IP header points to TCP inst
ead of UDP. The TCP header has the port number to point to the right app.
Advantages of layering
Info hiding and reuse: when the lower layers differ (eg Ethernet and 802.11)
the top layers can still be the same -> reuse.
Different same-layer protocols (eg Ethernet and 802.11) can talk by having a
middle device that takes off the header of one protocol and attach the header f
or the other, then pass the message along. This is a router.
Disadvantages:
Overhead, but only a few bytes, very insignificant for large messages
Hides info, eg when you need to know if you're wired or wireless.
Reference models

Helps guide us at what layer to implement a function eg routing.


OSI model
application
presentation
session
transport
network
data link
physical
Internet reference model
application eg SMTP, HTTP, RTP, DNS
transport eg TCP, UDP
internet eg IP
link (physical) eg 802.11, Ethernet
Units of info and their layers:
application: message
transport: segment
network: packet
link: frame
physical: bit
Devices according to their layers:
repeater: only physical
switch/bridge: link
router: link/network
proxy: link/network/transport/app: eg firewalls
Model layers don't correspond exactly with protocols, eg there can be multiple p
rotocols in a layer.
Physical layer
Physical links have 2 properties: rate R aka bandwidth B (bits/s) and delay
D (depends on length of link) (signals propagate at c in free space, 2/3c in wir
e)
Latency: delay to send a message on a link
Transmission delay: time to put a M-bit message 'on the wire'
T = M/R
Propagation delay: time for bits to propagate across the wire:
D = length/(2/3*c)
L = T + D
Either delay will dominate in any given case, in which case we can ignor
e the other one.
Bandwidth-delay product: amount of data that can be held in a wire at a give
n moment
BD = R*D (how fast data can go times how long it takes data to go)
Relationship between bandwidth and propagation delay:
eg B = 100 bps, D = 1s
then the wire can transfer 100 bits per second, or 1 bit per 0.01 second
. So in the first 0.01 second, you put the first bit on the wire, in the second,
the second bit, etc. After 100 times, 1s has passed and you have 100 bits on th
e wire. Now the first bit arrives at the other end, because the propagation dela
y was 1s.
If B were 1000 bps, after 1s, you have 1000 bits on the wire before the
first bit arrives.
If D were 0.1s, you only have enough time to put 10 bits on the wire bef
ore the first bit arrives.
This means higher B or higher D means more bits on the wire at any given
time.

Physical media
Wires
Twisted pair: LANs and phones. Twisting reduces radiated signal
Cat 5 UTP (unshielded twisted pair) cable: 4 pairs of twisted ca
ble.
Coaxial cable: from the inside out: copper core, insulating material
, braided outer conductor, protective plastic covering. Faster.
Fiber: long thin pure strand of glass
multimode: shorter links, cheaper
single mode: up to 100km
Wireless
can interfere -> must moderate
How to carry digital signals through analog cables
Signals can be decomposed to first and second harmonics using Fourier tr
ansform.
If the bandwidth (width of frequency band - EE definition) is reduced, t
he signal is degraded until it's no longer clear if it's a 0 or a 1. We can reco
ver the signals up to a certain level of degradation.
Effects of traveling over wire on a signal
delayed
attenuated
frequencies above a cutoff are highly attenuated
noise added
Fiber: very little attenuation over very long distances
Wireless:
very fast attenuation at rate of 1/d^2 where d is distance from transmis
sion point
interference and confusion if can receive multiple signals on the same c
omputer
Spatial reuse: can use the same frequency if distance is long enough, be
cause one of the two signals would be too weak at any given place
Multipath fading: when a receiver receives 2 signals from the same trans
mitter, but one bounced off a reflector (eg filing cabinet) so it's delayed and
shifted a bit. When the two signals are added they cancel each other out -> mult
ipath fading. It's possible for reflectors to be positioned such that the reflec
ted signal is as strong as the original, so when added we have a signal twice as
strong.
Modulation: for translating analog to digital signals
Baseband modulation: for physical media like wires
NRZ (Non-Return to Zero) scheme: high voltage = 1, low = 0
Can use more signal levels, eg 4 levels to transmit 2 bits at a time
(level 4 to transmit 11, 3 for 10, 2 for 01, 1 for 00, for example)
Engineering considerations:
Clock recovery: When there's a long string of repeated bits eg 0
00000...0, it's hard for an ordinary clock to tell one 0 from another, so we wan
t to introduce some variation in the bitstream.
multiple coding schemes to make this happen, eg Manchester c
oding. One simple way called 4B/5B: for every possible 4 bit sequence, send an a
lternative 5 bit sequence instead, eg instead of 0000, send 11110. To prevent lo
ng runs of 1s, when there's a 1, invert the signal; if it's a 0, signal stays wh
ere it is (no matter if it had been high or low).
Passband modulation: for fiber/wireless: need higher frequencies which p
ass well through the media
amplitude shift keying: high frequency for 1, flat for 0
frequency shift keying: high frequency for 1, low for 0
phase shift keying: up then down for 1, down then up for 0
Network limits: expressed in terms of bandwidth B, signal strength on the re

ceiver side S, and noise strength N


Nyquist limit: ignoring noise, if bandwidth is B, the maximum symbol rat
e is 2B where a symbol is a waveform that can represent a number of bits (eg 1 b
it if there are 2 possible signal levels aka amplitudes, 2 bits if 4 signal leve
ls, etc). Since with V signal levels you can transmit lg(V) bits, the max bit ra
te is
R = 2B * lg(V) (bits/s)
Shannon capacity
Signal to Noise Ratio (SNR): how many signal levels the receiver can
distinguish, based on S/N, eg if S = 4, N = 1 (noise is within 1 signal level)
then the receiver can distinguish 4 levels.
SNR = 10 * log10(S/N) (decibels)
eg S/N = 1000, SNR = 30 dB
Shannon capacity: max info carrying rate of the channel:
C = B * lg(1 + S/N) (Hz)
# this part converts the SNR to number of bits c
arried
Wired vs wireless:
Wired: can change the wire properties to reach requisite SNR and B -> ca
n change data rate
Wireless: can only fix B, not SNR (which depends on distance from transm
itter and position) -> must adapt data rate to SNR
Example: DSL
phone lines can transmit 2 MHz bit only use the bottom 4 kHz -> can use
the rest for data. Houses closer to the local exchange gets higher SNR -> faster
.
Link layer
how to transfer messages (frames) across physical links
sender adds header and trailer to packet (payload)
pass to physical layer
receiver receives and unwraps packet
straddles OS and hardware layers
Framing methods: how to turn a stream of bits into a series of frames
Byte count: start each frame with a length field. Problem: hard to resynchro
nize after a framing error (everything after the error will be off and no way to
redetect frame starting point)
Byte stuffing: have special flag bytes at beginning and end of each frame. I
f the payload contains the flag character, escape it. If the payload contains an
escape, escape that too (otherwise, if data contains ESC at the end, the proces
sed version will be ESC FLAG, and the receiver will think the FLAG is part of th
e payload).
eg A FLAG B -> A ESC FLAG B
A ESC B -> A ESC ESC B
A ESC FLAG B -> A ESC ESC ESC FLAG B
A ESC ESC B -> A ESC ESC ESC ESC B
Now all unescaped flags in the stream is the start/end of a frame
Bit stuffing: call 111111 a flag. In real data, insert a 0 after every 11111
. Receiver should delete any 0 that's found after a 11111.
Example: PPP over SONET
PPP: Point to Point Protocol, uses byte stuffing
Assume flag is 0x7E and ESC is 0x7D. To stuff a byte, add ESC then XOR t
he next byte with 0x20 (which flips the 6th bit from 1 to 0). To unstuff, remove
ESC then
XOR the next byte with 0x20 (which flips the 6th bit from 0 back to
1, restoring original contents). This will remove the flag from the contents of
the frame
eg contents have 0x7D. Add ESC -> 0x7E7D. XOR with 0x20 -> 0x7E5D.
Error detection and correction
Noise errors: can't tell if 0 or 1.

Add check bits to the message bits to detect and correct errors
Naive approach: send 2 copies
can detect as many errors as there are bits in the original
can't correct any (don't know which copy is wrong)
takes 2 errors (in the same bit position) to fail (2 copies matc
h but both are wrong)
takes up 50% of bandwidth
-> terrible
Send codewords: D data bits + R check bits (aka systematic block cod
e) where R = fn(D)
sender computers R check bits from the D bits
receiver recomputes R' check bits based on the D bits it receive
d. If R != R', there was an error.
There are 2^(D + R) possible codewords, but there's only 2^D cor
rect (valid) codewords -> randomly chosen codeword only has 1/(2^R) chance of be
ing a correct one.
Hamming distance: how many bits need to flip to turn D1 to D2
Hamming distance of a code: minimum distance between any pair of
codewords
Observe that if we duplicate a pair of bit sequences, their
Hamming distance doubles. eg HD(0, 1) = 1, HD(00, 11) = 2.
For a code of d + 1 distance, up to d errors can be detected
eg 000 and 111 -> d = 2. More than 2 errors means 000 be
comes 111 and vice versa -> still valid -> can't be detected.
For a code of 2d + 1 distance, up to d errors can be correct
ed by mapping to the nearest codeword.
eg 000 and 111 -> d = 1. If receiving 001, can map to 00
0.
Error detection schemes:
Parity Bit: add 1 check bit that's the sum of the data bits (same as XORing
all the data bits)
Distance of code is 2 because if I flip 2 bits the parity bit is right a
gain.
Can detect all odd numbers of errors
Checksums: sum data in N-bit words, eg 16 bits in TCP/IP
example: in TCP/IP, the checksum for D bits is a 16 bit value made by: t
hose D bits broken up into 16-bit numbers, added up, any carry bits (excess bit
on the left) added to the right, bit flipped (one's complement)
eg to transmit 0001 f203 f4f5 f6f7
sum = 2ddf0
carry bit added = ddf2
bit flipped = 220d -> checksum
On receiving end: add received bits in groups of 16 bits, add carry bit,
flip. If result is 0, data is correct.
Distance of code: 2 (can change 0 to 1, 1 to 0 -> same sum)
Can detect max 1, can't correct any
Can detect all burst errors (series of errors). There are 2^16 possible
16-bit sequences, so a random checksum has a 1/2^16 chance of being correct -> m
uch better than parity.
Cyclic Redundancy Check (CRC): given n data bits, generate k check bits
such that the n+k bits are evenly divisible by a generator C
eg n = 302, k = 1, C = 3, check bits = 1 because 3021 % 3 = 0
Has to be done in binary -> has to use modulo 2
arithmetic
Send:
extend the n data bits with k zeros
divide by the generator value C
keep remainder, ignore quotient
adjust k check bits by remainder

Receive:
divide and check for zero remainder
Example:
n = 1101011111, k = 4, C = 10011
11010111110000
- 10011
1001
- 10011
1001
etc
10
-> remainder is 10, so check bits are 0010
Standard CRC-32 is 0x82608ed7
HD = 4 -> can detect up to 3 bit errors
can catch all odd numbers of errors
can catch bursts of up to k bits in error
not vulnerable to systematic errors like checksum
CRCs are widely used on links: Ethernet, wifi, ADSL, cable
Checksums used in Internet: IP, TCP, UDP
Error correction
For a code of HD 2d+1, with <= d errors, can correct by mapping to closest c
odeword
Hamming code: method for constructing a code with a distance of 3.
n = 2^k - k - 1
put check bit in positions p that are powers of 2, starting with 1.
at position p, check bit is parity of positions with a p term in their v
alues (ie the pth digit in their binary expression is turned on)
Example: data = 0101, 3 check bits
7 bit code -> check bits at 1, 2 and 4
__0_101
at 1, covers parity of positions 1, 3 (1+2), 5 (1+4), 7 (1 + 2 + 4)
-> par(0, 1, 1) = 0
at 2, covers parity of positions 2, 3, 6, 7 -> par(0, 0, 1) = 1
at 4, covers parity of positions 4, 5, 6, 7 -> par(1, 0, 1) = 0
-> 0100101
To decode:
recompute check bits (with parity sum including the check bit)
arrange as binary number
value (aka syndrome) tells error position
0 means no error
otherwise, flip bit to correct
Example: 0100101
p1 = par(0011) = 0
p2 = par(1001) = 0
p3 = par(0101) = 0
syndrome = 000 -> no error
data = 0101
if error: 0100111
p1 = par(0011) = 0
p2 = par(1011) = 1
p3 = par(0111) = 1
syndrome = 110 = 6 -> 6th bit is wrong -> flip 6th bit -> data 0
101
Other correction codes:
Convolutional codes
Low density parity check: state of the art
Is detection or correction better?
depends on error pattern
eg 1000 bit messages, error rate 1/10000
correction: need 10 check bits per message

overhead: if error found, need 10 bits to correct


detection: 1 check bit per message, plus 1000 bit retransmission
1/10 of the time
overhead: 1 + 1000/10 = 101 bits -> higher
errors in bursts of 100 but only 1-2/1000 messages have errors
correction: need ~100 bits -> overhead 100 bits
detection: ~32 check bits per message (so the chance of error is
1/2^32) + 1000 bits resend 2/1000 of the time -> overhead: 32 + 2*1000/1000 = 3
4 bits -> lower
Use correction:
when errors are expected
no time for retransmission
used in physical layer and application layer eg Forward Error Correc
tion (FEC)
Use detection:
when errors are not expected (more efficient)
for large errors when they do occur
used in link layer for residual errors the physical layer didn't get
Error handling strategies
detect errors and retransmit frame using Automatic Repeat reQuest (ARQ)
used when errors are common and expected, eg wifi and TCP
receiver sends an ACK to acknowledge correct frames. Sender resend if no
t receiving the ACK within a certain time.
determining timeout is tricky
how to deal with duplicates when the first ACK is lost and the sender se
nds the same frame again -> both frames and ACKs must carry sequence numbers
in Stop-And-Wait, only needs 1 bit (0 or 1)
not efficient for high bandwidth because only 1 frame can be outstan
ding from the sender -> network limits are capped by delays, eg if high bandwidt
h, still takes 2D time to get a single frame through -> same limit
Solution: sliding window algorithm: allow W frames to be outstanding
such that W frames can be sent every 2D (aka RTT - round trip time). This way t
he network is kept busy - there's just enough frames to take up every 2D interva
l.
Multiplexing: sharing bandwidth
time division multiplexing: users take turns on a fixed schedule, eg with 3
users, 2-1-3-2-1-3. The gap in between is called guard time.
each user sends at a high rate a fraction of the time
have to coordinate whose info to send when
when someone can send, they'll send faster
frequency division multiplexing: different users take up different frequency
bands eg user 1 takes up the 60-64 kHz band, user 2 takes 64-68, etc
each user sends at a steady low rate
In GSM, each call is at a different frequency band. Within a frequency band,
time division is used.
Problem: network usage is bursty, eg need high rate when downloading a page,
then 0 rate while reading it, but both TDM and FDM are designed for fixed usage
patterns and fixed number of users.
Solution: observe that if each user uses the network burstily, and the max r
ate each needs is R, then the max rate of both is R' < 2R because they don't bot
h need the max rate at the same time.
-> Multiple Access protocols
2 families of MA protocols:
Randomized: nodes randomize their resource access attempts, good for low
loads, eg wifi
Problem: all nodes are connected together, so any node can send to a
ny other node, but there's no central node to organize the flow -> collisions ca
n occur when multiple nodes try to send at once.

Initial solution: ALOHA protocol (connects Hawaiian islands): when a


node has to send, it sends. If it doesn't get an ACK, that means the send faile
d, so it waits a random amount of time and try again -> simple, decentralized, w
orks well under low load, but inefficient (18%) under high load. Simple improvem
ent: divide time into slots to avoid packets that are only partially lost -> eff
iciency goes up to 36%.
Ethernet: similar to ALOHA, all nodes are connected to a central cab
le. Improvement:
CSMA (Carrier Sense Multiple Access): listens to medium for acti
vity before sending. Collisions can still happen because nodes can send at the s
ame time, or different times but the packets need time to propagate across the w
ire -> CSMA is only effective when the bandwidth delay is small (< 1 packet).
Collision Detection: if network is busy, send JAM signal. May ta
ke up to 2D seconds to detect a collision (D for the fist collision node to send
the first packet to the other end, D for the 2nd node to get its packet to the
first node on the other end) -> network impose minimum frame size that lasts for
2D seconds so node can't finish before collision. Ethernet's minimum frame size
is 64 bytes (smaller packets will be padded out).
Problem: persistence (what to do if another node is sending).
Naive solution: wait, then send as soon as other node is finishe
d.
Problem: if more than 1 node is waiting, they'll all start a
t the same time. In fact, this scheme guarantees collision whenever collision ca
n happen.
Ideally, we want each waiting node to send with probability
1/N, N being the number of waiting nodes. But in a dynamic network we don't know
what N is
Binary Exponential Backoff (BEB) algorithm to estimate N:
if 1 collision, assumes there are 2 waiting nodes -> wai
t 0 or 1 frame time.
2 collisions, wait 0 to 3 times
3 collisions, wait 0 to 7 times, etc
Efficient because the number of collisions it needs to d
etect is only lg(N), eg if 100 nodes collide, only 7 collisions need to be detec
ted.
Ethernet frame format:
preamble to say this is a physical layer frame
destination address
source address
needs sender and receiver addresses because sharing a ca
ble
destination goes before source so every node can see if
it's for them
data
padding
checksum (CRC-32)
Contention-free: nodes order their resource access attempts, good for hi
gh or guaranteed loads,
Wireless Multiple Access
Problem:
can't use CSMA because different nodes have different coverage areas ->
can't listen for busy lines. 2 problems:
hidden terminals: if the nodes are laid out A B C, and each node can
only reach its neighbor, then A and C can't hear each other -> they're 'hidden
terminals' when sending to B. They can't hear that the other is sending to B, so
they'll send to B, and there'll be a collision.
exposed terminals: A B C D, B can send to A and C can send to D at t
he same time, but since B and C can hear each other, they won't send at the same
time -> B and C are exposed terminals. But we want them to send at the same tim
e because it's no problem and saves time.

can't use CD because nodes can't hear while sending -> keep sending and
collision keeps happening for a long time undetected.
potential solution: Multiple Access with Collision Avoidance (MACA):
sender sends Request to Send (RTS) first, receiver sends Clear to Send (CTS) ba
ck. CTS states how long the packet is, so all nodes hearing the CTS can stay sil
ent.
in A B C scenario, A sends RTS to B, B sends CTS to A and C, C k
nows B is receiving so C doesn't send -> problem solved.
in A B C D scenario, B sends RTS to A, C sends RTS to D, B gets
C's RTS and C gets B's, but they're busy sending so they won't hear it. A and D
send CTSs back, but D and A won't hear the other's CTS, so B and C are clear to
send.
Physical layer of wifi:
use 20/40 MHz channels in ISM bands that the government opened up for free u
se.
Link layer:
frame format:
destination address
source address
access point address
frames are ACKed and retransmitteed with ARQ
errors are detected with CRC-32
to avoid collision, use CSMA/CA:
sender inserts small random gaps, ie instead of sending as soon as the l
ine is clear, wait a random amount of time. This way RTS/CTS are optional.
Contention-free MA:
token ring: nodes are wired in a circle and a token is passed around. The no
de with the token gets to send.
pros
prevents collisions under high load
predictable demand
cons
higher overhead at low load
token can get lost
can use time references: if token is lost for some time, assume
loss.
Switched Ethernet: modern Ethernet, has a central switch instead of all nodes co
nnecting to a wire
Devices:
Hub: physical layer, connects all hosts so any host can talk to any othe
r host. Equivalent to a shared wire.
Switch: link layer. Each host coming in branches out into as many branch
es as there are other nodes, each of which crosses the other nodes coming in, re
sulting in a n*n grid. This way independent pairs of nodes can communicate at th
e same time by taking different paths in the grid. The wires are bidirectional s
o that the links can be full duplex. Switches have buffers at either input or ou
tput to store frames when there's more frames going to the same host than the li
nks can handle.
Hubs and switches have replaced wires because wiring to a single place i
s more convenient, and if
the hub/switch fails, we know to just replace it in
stead of hunting down the wire. They're also scalable: all ports can use max ban
dwidth instead of sharing the cable's bandwidth.
How do switches figure out what port is what address? By using backwards lea
rning: when a host sends, the switch knows its address and the port it's coming
from, so it makes a table mapping the port to the address. If it needs to forwar
d to an address whose port it doesn't know, it just sends to all ports. The wron
g ports just ignore it.
This works when multiple switches and hubs are connected together too. A
ssuming no loops, if switch 1 doesn't know the destination port, it'll broadcast
to all its nodes including any hub/switch it's connected to. Switch 2 receives

the broadcast and rebroadcasts to all its nodes, one of which is the destination
, so the frame now arrives at the destination port. So now both switches have le
arned the address associated with the source port.
If there's a loop, eg switches C and D have 2 links to each other. Say A
, which belongs to switch C, wants to send to F, which belongs to switch D. A se
nds to C, C broadcasts to D twice (because of the 2 links), D broadcasts back to
C twice, C broadcasts to D twice again as well as back to A, etc -> fail.
Solution: switches together find a spanning tree (a tree, meaning no loo
ps, that connects all switches) and treat this spanning tree as the topology.
Theory:
the switch with the lowest address is the root
grow tree as shortest distances from root (if 2 links have same
distance, use lower address one) (in practice, 2 happens at the same time as 1)
turn off ports for forwarding if they're not in the spanning tre
e
Practice
initially, all switches assume it's the root
each switch sends periodic updates to its neighbors with its add
ress, root address, and its distance in hops to the root. Format: A, C, 1 means
I'm A, the root is C, my distance to root is 1.
the switches favor ports with shortest distance to lowest root,
using lowest address to break ties.
Example: fig 1.
Network layer:
Why need one when we have the link layer? Link layer doesn't scale (every sw
itch has to maintain millions of entries in the host table), every send to an un
known links will be broadcast to millions of hosts, and different kinds of link
layers don't work together (eg wifi, Ethernet).
Routing vs forwarding:
routing: deciding which way a packet should go
forwarding: sending the actual packet according to the found route. Only
happens at 1 node -> faster.
2 kinds of services the network layer provides to the transport layer:
Datagrams aka connectionless service: each unit is self-contained, like the
post office (IP)
Virtual circuits: connection-oriented, like the phone.
Both are implemented with store-and-forward packet switching: routers receiv
e a full packet, may stores it temporarily before forwarding. Mostly only store
if there's contention for bandwidth -> we use statistical multiplexing to share
the bandwidth over time
Datagrams:
Router output ports have buffering for when it gets input from several s
ources. The buffer is typically FIFO and there's a discard policy when congestio
n happens.
Each router has a forwarding table keyed by address that gives next hop
for each destination address, but may change, eg if A needs to go to C first to
go to F, the F entry in the A router has the value C. Packets may take different
routes and arrive out of order.
Each packet contains the destination address.
Easier to mask failure - just resend to different route.
Hard to add quality of service to individual packets.
VC:
3 phases: set up, data transfer and teardown. During the teardown, the r
outers along the circuit delete the circuit state.
Packets only contain a circuit identifier.
Each router has a forwarding table keyed by circuit that gives the outpu
t line and next label to place on packet as it moves along the circuit. Each ent
ry has 'in' and 'out' parts to identify where a packet is coming from and what c
onnection label to put on it next, eg

H1 1 C 5
H3 1 C 2
means if the packet comes from H1 on connection 1, send to C on conn
ection 5; if from H3, send on connection 2. Notice the network can handle multip
le connections from and to the same pair of hosts.
Application: Multi-Protocol Label Switching (MPLS): used in ISPs, set up
connections inside their backbone, adds MPLS label to IP packets at ingress, re
move at egress.
Hard to mask failure: have to replace router if failing.
Easy to add quality of service.
Internetworking
Hard because many things are different between networks: service model, addr
essing, quality of service (eg some packets are priority and need better service
), packet sizes, etc
Example: wifi datagram network to MPLS VC network to Ethernet datagram netwo
rk
IPv4 packet format:
Version (4)
IP Header Length (IHL, length of everything before payload)
Total length
Protocol: what protocol is inside the IP, eg TCP
Header checksum
Source address
Destination address (both 32 bits)
IP Addresses:
written in 'dotted quad' notation
Blocks of IP addresses share a common prefix, which is a group of bits at th
e beginning of the address.
Classful addressing (still embedded in addresses but ignored):
Class A: start with 0, prefix 8 -> there are 127 of them, each supportin
g 2^24 hosts
Class B: start with 10, prefix 16
Class C: start with 110, prefix 24
Routers have a table with prefixes. The prefixes may overlap. The longest ma
tching prefix, ie most specific (fewest addresses in the block), wins.
Example: router has 2 entries:
prefix
next hop
192.24.0.0/18
D
# from 192.24.0.
0 to 192.24.63.255
(last 16 bits 00/000000 00000000 to 00/111111 11111111)
192.24.12.0/22
B
# from 192.24.12
.0 to 192.24.15.255 (last 16 bits 000011/00 00000000 to 000011/11 11111111) -> l
ies within the D range, but is more specific
Addresses:
192.24.6.0
-> last 16 bits 00000110 00000000 -> fits D but no
t B -> D
192.24.14.32
-> last 16 bits 00001110 00100000 -> fits both B and
D -> B because B is more specific
192.24.54.0
-> last 16 bits 00110110 00000000 -> fits D but not
B -> D
Hosts also have a forwarding table. Only 2 entries needed:
network's prefix -> means on local network, so goes to that IP addre
ss directly
0.0.0.0 -> catches everything else, which is the router's job, so se
nd to router
Routers can do this too, eg send all addresses not in their network to t
heir ISP
DHCP (Dynamic Host Configuration Protocol): helps nodes get their IP addresses w
hen they just woke up (otherwise nodes only have their Ethernet address)
leases IP addresses to nodes
provides other params: network prefix, address of local router, DNS server,

time server, etc


uses UDP ports 67 for client -> server and 68 for server -> client
Bootstrap issue: how does a node know where the DHCP server is to ask it wha
t the node's IP address is? It does that by broadcasting a message and having th
e DHCP server pick it up. The broadcast address is all 1's.
Client sends server a Discover request to look for the DHCP server
Server sends an Offer of a new IP address to client
Client sends a Request for that IP address to server
Server sends an ACK to acknowledge giving the client the address
-----DORA
If the lease is up, the client only needs to do the last 2 steps to rene
w its current address.
ARP (Address Resolution Protocol): given IP address, find Ethernet (MAC) address
. Useful for sending to nodes within same network, where you know the receiver's
IP address but not their MAC address.
Sender broadcasts Request to the network eg 'who has IP 1.2.3.4?'.
Receiver receives and Reply, 'I do at 1:2:3:4:5:6'
Packet fragmentation problem: different networks have different packet sizes aka
Maximum Transmission Unit (MTU).
We want to send as large packets as possible to reduce header overhead.
Solutions:
fragmentation: routers along the way split up large packets, end host re
assembles them
in each packet header, there are total length, packet ID (fragments
of the same packet have the same ID), byte offset of current packet, and MF (Mor
e Fragment - whether there's more after this one) fields
Cons:
lots of work for the sending router
receiving router has to save fragments in buffer until it has al
l the pieces
more likely to lose packets
security vulnerability since it's harder to see what's in a pack
et
Path MTU discovery: sender sends a large packet. If destination network
doesn't support it, destination router will send back a message with the MTU it
does accept. Sender sends another packet in this new size. May be repeated furth
er along the network.
Works well because there are only a limited number of sizes
Implemented with ICMP
Don't Fragment (DF) bit is set so error will be returned if packet i
s too large.
Internet Control Message Protocol (ICMP): sits on top of IP, provides error repo
rt and testing
An ICMP packet is sent by the destination router whenever a packet was too l
arge.
Format;
type
code
checksum
start of bad packet
Examples:
Name
Type/Code
Usage
Destination Unreachable (net or host)
3/0 or 1
lack of connectivity
Destination Unreachable (fragment)
3/4
path MTU discovery (this error means the packet needed to be fragm
ented but the frag bit wasn't set)
Time Exceeded (Transit)

11/0

traceroute
Echo Request or Reply
8 or 0/0
ping
There's a TTL (Time to live) field in the IP header that's decremented each
hop. ICMP error when it reaches 0. This protects forwarding loops.
Traceroute works by sending messages with increasing TTL, starting at 1. Thu
s, at each subsequent step, it gets a message back from the host at that hop. Th
at's how it knows where the packets are going.
IPv6:
128 bits, 8 groups of 4 hex digits
Can write shorthand by skipping initial 0s in each group and skipping groups
of 0000 altogether
Will need to be compatible with IPv4 for a long time.
Solutions:
Dual stack: hosts speak both IPv4 and IPv6
Translators: translate packets back and forth
Tunnels: to cross IPv4 links that connect islands of IPv6 networks, by w
rapping the IPv6 packet
NAT boxes:
present in network devices
turns 1 external IP address into many internal ones, using one of the intern
al prefixes eg 192.168.0.0/16 or 10.0.0.0/8
works by keeping an internal/external table with IP address and ports, eg 2
internal addresses can go to different ports but same external address. Internal
ly, 2 hosts can have the same address but different ports.
the NAT box rewrites the IP header to change the internal address to the ext
ernal one for outgoing packets and vice versa for incoming.
The table is populated on a host's first attempt to send a packet. The NAT w
ill then copy down the host's internal address and port, and the external addres
s, and assign the host a new random port.
Weaknesses:
traffic can't come in without first having outgoing traffic so the NAT e
ntry can be set up
apps that expose the IP address can't be used
Routing:
one way is the spanning tree algorithm, but it's inefficient because the lin
ks removed to avoid loops could be the shortest path.
Delivery models:
unicast: one sender to one receiver
broadcast: 1 sender to all receivers
multicast: 1 sender to several but not all receivers
anycast: to nearest receiver
Properties of a good routing scheme:
correctness: found paths should work
efficiency: use bandwidth well
fairness: all nodes should have the ability to send/receive
fast convergence: recover from failures quickly
scalability: as networks grow
Finding best unicast routes:
define a cost function to compute the cost of each path
factors the cost function can take in: distance, data cost, number of ho
ps, etc
subsets of the shortest path will also be shortest paths.
sink tree: union of all shortest paths to a node from all other nodes (s
ink = destination node)
useful because all those paths may converge in a few links going to
the sink, so once we get to the converging node, we don't need to care where the
packet is coming from.
Dijkstra's algorithm:

need to know complete network topology


finds shortest path to all nodes from a given node
steps:
mark all nodes tentative
set distances from source to = for source and infinity for all other
nodes
while tentative nodes remain:
extract N, a node with lowest distance
add link to N to the shortest path tree
relax the distances of neighbors of N by lowering any better dis
tance estimates
example: Fig 2
Distance Vector Routing:
doesn't require knowing the whole network topology like Dijkstra's
rarely used in practice
nodes keep distance vectors about each other, as well as next hops to all de
stinations
steps:
initialize vector with 0 cost to self, infinity cost to other destinatio
ns
periodically send vector to neighbors
update vector for each destination by selecting the shortest distance he
ard, after adding cost of neighbor link
use best neighbor for forwarding
problem: if a link fails that divides the network into islands, nodes on the
island other than a given node's island will try and reroute to other nodes wit
hin their island, but those other nodes will think that's the new, longer route
to get to the original node, not realizing the new route involves themselves. Th
us these nodes will keep discovering longer and longer paths that eventually rea
ch infinity.
there are heuristics to fix this eg 'split horizon, poison reverse' (don't a
ccept info that you yourself gave someone else), but not very effective -> link
state algorithm, eg running Dijkstra's, is more popular in practice
Routing Information Protocol (RIP):
use hop count as metric
infinity = 16 hops -> can only use in small networks
includes split horizon, poison reverse
routers send vectors every 30s
run on top of UDP
timeout in 180s to detect failures
Flooding: send the same message to every other node. When a 2nd node receives a
flooding message from a first node, it should broadcast it to all the nodes conn
ected to it. The original sender should recognize its message so it doesn't pass
it on again. The same node may receive the message multiple times from multiple
paths -> inefficient.
Link state algorithm:
each node floods a link state packet (LSP) that describes their portion of t
he topology
all nodes combine all LSPs to get full topology
each node runs Dijkstra
if there's a link failure, the nodes on either end would notice and broadcas
t their distance to each other as infinity so the network will adapt.
Equal-cost multi-path routing:
enable using multiple paths of the same minimum cost between the same 2 node
s
instead of keeping a next hop, keep a set
source and sink trees become DAGs
if packets between the same pair of nodes are sent on multiple paths, they c
an arrive out of order. So we try to send messages between same pairs on the sam
e path.

Hierarchical routing: to reduce size of routing tables and number of routing mes
sages as the number of hosts grow.
divide hosts into regions. routers talk to regions instead of hosts.
can be less efficient sometimes (not taking most direct route)
IP aggregation and subnetting:
routers can change the prefix without affecting hosts.
thus, we can group more specific prefixes under one less specific prefix, eg
group 3 /18 networks under 1 /16 network -> subnetting
joining multiple smaller prefixes into one large prefix is aggregation
helps keep routing table small because there's only 1 prefix to save.
Networks on the internet are connected directly or through Internet Exchange Poi
nts (IXPs)
Different networks may have different routing policies, which can cause inef
ficiency
eg two ISPs want to choose the shortest route within their network, but
when they want to talk, the shortest path in the sender ISP leads to a longer pa
th in the destination ISP. Also the other direction may not use the same path.
Example policies:
transit policies: between customers and ISPs: customers can use ISPs to
send and receive data anywhere on the network, not just from and to other hosts
in the ISP
peer policies: between ISPs, so that one ISP can send and receive traffi
c meant for their customer going through the other ISP.
Border Gateway Protocol (BGP): the interdomain routing protocol used in the
internet.
is a path vector protocol: like distance vector protocol but stores path
s instead of distances
different parties like ISPs are called Autonomous Systems (AS)
border routes of ASs announce BGP routes to each other, including IP pre
fix, path vector (list of ASs on the way to the prefix, for finding loops) and n
ext hop, in the opposite direction of traffic
kinds of policies:
transit: if A buys transit service from B, B sends to and receives f
rom A, but A doesn't do any intermediate routing for B. A is the end of the line
, similar to customers buying service from ISPs
peer: if A and B are peers, they send to and receive from each other
, without going through a bigger ISP, but they don't carry intermediate traffic.
So if A peers with B, B peers with C, A still can't send to C through B.
example: AS1 is a big ISP. AS2, AS3, AS4 are small ISPs that rent ac
cess from AS1. Prefixes A, B, C are on AS2, AS3, AS4 respectively. AS1 tells AS2
[B, (AS1, AS3)], meaning you can get to B by going to AS1 then AS3. Traffic fro
m AS1 to AS2 uses the transit policy, from AS2 to AS1 uses customer policy, and
AS2 to AS3 uses peer policy.
Transport layer
doesn't examine data inside it like lower layers (examined by switches and r
outers)
unit is segment; has TCP header and payload
can provide unreliable or reliable transport
unreliable: UDP (uses messages)
messages may be lost, reordered, duplicated
limited message size
pretty much like IP
doesn't take into account receiver's state or network state
reliable: TCP (uses bitstreams)
sends each byte only once, reliably and in order
arbitrary length content
take into account receiver's state or network state
uses the Socket API
sockets let apps attach to the transport layer using different ports

used for connection-oriented transport: LISTEN, ACCEPT, CONNECT


used for connectionless transport: SEND(TO), RECEIVE(FROM)
ports are 16-bit
an application process is identified by the tuple: IP address, protocol
(TCP or UDP) and port
servers often bind to ports under 1024, aka 'well-known ports'
clients can use any port as long as they announce it to the server
OS often assigns 'ephemeral' ports to clients
common ports:
20, 21: FTP
22: SSH
25: SMTP
110: POP3
143: IMAP
443: HTTPS
543: RTSP (media player control)
631: IPP (printer sharing)
UDP
used by apps that don't want reliability or bytestreams, eg
VOIP (unreliable)
DNS, RPC (message-oriented)
DHCP (bootstrapping)
datagram size up to 64KB
header format:
source port (16b)
destination port (16b)
UDP length (16b)
UDP checksum (16b)
TCP
connection establishment:
sender and receiver must agree on a set of params eg Maximum Segment
Size (MSS), aka signaling, similar to dialing a phone
three way handshake: opens connection for data in both directions
each side probes the other with a fresh Initial Sequence Number
(ISN)
sends on a SYNchronize segment
echo on an ACKnowledge segment
eg client (active party) sends server (passive party) a SYN
and an SN
server responds with a SYN with its own SN, and an ACK in th
e same segment for efficiency, with the client's SN + 1
client sends an ACK and server's SN + 1
connection is now established
this way guards against delayed or duplicated packets because th
e SNs will be off
connection establishment FSM: figs 3, 4
Closing connection:
4 way handshake
active party sends FIN x (FIN = finish, x = arbitrary SN)
passive party sends ACK x+1
passive party sends FIN y
active party sends ACK y+1
active party waits for twice the max segment lifetime (60s) in c
ase the ACK was lost, in which case a second FIN will be sent
active party closes connection
connection closing FSM: figs 5, 6
Sliding window:
to maximize number of packets that can be sent given a network capacity.
builds on Stop-and-Wait: sends 1 packet, waits for ACK, sends another pa
cket

SaW is inefficient: if R = 1 Mbps, D = 50 ms, the roundtrip time is 100m


s, so can only send 10 packets/s, or 100kbps, only 1/10 the network capacity. Sa
me with unlimited R.
Sliding window allows W packets to be outstanding -> can send W packets
per roundtrip time.
Needs w = 2BD bits to fill network path
Explanation:
roundtrip time is 2D, so we need W to be as many packets as can be s
ent within 2D time. If the bandwidth is B b/s, and the propagation delay is D, i
n 2D time we can have max 2BD bits on the wire, and that's what we want - for th
e wire to be fully occupied.
Example: R = 1 Mbps, D = 50 ms
w = 10^6 bits/s * 50 * 10^-3 s * 2 = 10^5 bits = 100 kbits
If a packet is 10 kbits, the sliding window is 10 packets.
if R = 10 Mbps, the window is 100 packets
Sender
keeps 2 state variables, indexed by SN:
last frame sent (LFS)
last ACK received (LAR)
if LFS = LAR > w, send more packets
As ACKs are received, the set of unACKed frames move along to th
e right, hence 'sliding window'
Receiver
Many versions of sliding window:
Go-Back-N
simplest, inefficient with errors
sender keeps a single packet buffer
state variable: last ACK sent (LAS)
when receiving
if SN = LAS + 1
accept and pass to app
update LAS
send ACK
else
discard as out of order
handling retransmissions
keeps 1 timer
when timer goes off, if not ACKed, resend all packet
s starting at LAS + 1
Selective Repeat
more complex, better performance
buffers out of order segments to reduce retransmissions, but
still pass packets to app in order
buffer contains w packets (max outstanding at any time)
buffers packets from LAS + 1 to LAS + w
pass up to app in-order segments from LAS + 1
update LAS
send ACK for LAS whether LAS has changed or not
handling retransmissions
keeps 1 timer for each unACKed segment
on timeout, resend that segment
hope to resend fewer segments
setting timeouts:
easy on ethernet because RTT is short and predictabl
e
harder on the internet
adaptive timeout: keeps 2 values
smoothed RTT (SRTT):
SRTT(n+1) = 0.9 * SRTT(n) + 0.1 * RTT(n + 1)

-> new estimate is 90% old estimate and 10%


new data
variance in RTT:
Svar(n+1) = 0.9 * Svar(n) + 0.1 * |RTT(n+1)
- SRTT(n+1)|
-> new sample is difference between new and
old RTT
effect of these values is to smooth out sharp va
riations in RTT
eg Fig 7, 8 (adaptive timeout)
set timeout to a combination of these estimates
TCP: timeout(n) = SRTT(n) + 4 * Svar(n)
Sequence numbers need more than 0/1 like SaW
Go-Back-N
W + 1
Selective Repeat
W for packets and W for ACKs of earlier packets
typically implemented with a N-bit counter that wraps around
Flow control
for when the receiver gets more packets than it can handle (ie the app d
oesn't call RECEIVE to retrieve packets in the buffer fast enough)
without flow control, if the receiver gets more packets than the buffer
has room for, it'll have to discard packets.
to avoid, tell the sender the available buffer space (aka WIN)
sender uses min(W, WIN) as effective window size
TCP
message boundaries not preserved between send() and receive(), eg send a
s 4 512 B packets but receive as 1 2048 B packet
bidirectional data transfer: can combine ACKs for one direction with dat
a going in the other direction
receiver sends cumulative ACK containing next expected SN (LAS + 1)
receiver also optionally sends selective ACKs (SACK) to report its buffe
r state
receiver can ACK up to 3 ranges of bytes at a time (eg 1-100 and 201-300
but not 101-200)
TCP uses a 'three duplicate ACK' heuristic to save some time rather than
wait for timeout
if 4 ACKs come back with a cumulative ACK of 100, but the last 3 als
o reporting additional received packets of 201-300, 201-400 and 201-500, then th
e 101-200 range is probably lost, so resend even if the timeout isn't up.
Congestion control
Cause: recall that routers have buffers for each output line. When there
's more input than buffer space, congestion happens.
Goodput: rate of new. unrepeated traffic (~throughput).
A network collapse happens when the offered load goes up to a point wher
e packets begin to be dropped at the receiver, and have to be retransmitted. Alt
hough the data transfer rate is still high, the actual goodput is low because a
lot of it is repeat data.
Need a good bandwidth allocation policy that's efficient (uses all netwo
rk capacity) and fair (doesn't choke a sender just because it causes congestion)
need to work with both transport and network layers
transport because it's responsible for injecting traffic into th
e network
network because that's where the router is, so it knows when con
gestion is happening
efficiency is more important than fairness
fairness is a vague concept: should connections that use more ho
ps be penalized? connections that send more data?
-> more important to just avoid starvation
bottleneck: the lowest-capacity link (itself in a 1-link connection)

Max-Min Fairness:
"maximize the minimum"
increase all links until some reach maximum, then hold that link
constant and increase the other links until another one reaches maximum, and so
on
ex: Fig 9 (max-min)
Bandwidth allocation models
open loop: reserve bandwidth before use
closed loop: use feedback to adjust rates
enforced by host (so host must play nice) or network?
expressed in sliding window size or absolutes eg packets/s?
-> TCP: closed loop, host driven, window base
Host driven: network layer tells transport layer whether receive
r is full (binary signal). Sender uses a "control law" to adjust the sliding win
dow size.
Additive Increase Multiplicative Decrease (AIMD): a control
law where hosts additively (adding a constant) increase rate while network is no
t congested and multiplicatively (dividing by a constant) decrease rate when con
gestion happens
Fig 10
Very efficient because only needs 1 binary signal
Works better than any other control law (MIAD, MIMD, AIA
D)
ACK Clocking
helps sender pace packets according to slowest link's capacity
at first sender would send as fast as the fastest link
then the packets would bunch up in the bottleneck and slow down when the
y get to the receiver
so the ACKs would come back at the same slow rate
the sender would perceive the slower rate and start sending at that rate
instead
in TCP this is called the congestion window or cwnd
cwnd is the max rate the sender can send, vs the flow control window
which is the max rate the receiver can receive
the rate is roughly cwnd/RTT
Slow Start: way to allocate bandwidth more quickly than AIMD, used in TCP
window size starts at 1, doubles with every RTT until packet loss occurs
, then switch to AIMD to zero in on the right rate
-> the increase is slow at first (1, 2, 4) but accelerates quickly becau
se it's exponential
set a ssthresh variable that's infinity at first, then after the first p
acket loss, change it to cwnd/2
next time when cwnd >= ssthresh, switch from Slow Start to AIMD
after a timeout, cwnd starts with SS again with cwnd = 0
Difference from AIMD:
SS sends 2 packets with every ACK received, effectively doubling cwn
d every RTT because during one RTT, sender will have received cwnd ACKs (Fig 11
- SS timeline)
AIMD sends 1 more packet with every cwnd ACKs received, so only addi
ng 1 packet every RTT
Fast Retransmit: alternative to the MD part of AIMD
Recall that ACKs contain the SN of the last cumulative packet received p
lus 1, so if packets 13, 15, 16, 17 were received, the ACKs for 15, 16, 17 will
only report 14, because 14 wasn't received
After 3 duplicate ACKs (so really the 4th ACK bearing the same SN), the
sender will assume the next SN was lost and resend that
This way in the above example, receiver will continue to send ACK 14 a f
ew more times, then it'll get 14 and ACK 18 because it also received 15, 16, 17
Now everything can continue as before
This way the sender can detect a single loss faster than waiting for a c

onservative timeout
Fast Recovery
While waiting for the ACK to change, sender can continue to send new pac
kets because the duplicate ACKs mean other packets were still received. Thus the
ACK clock is kept going
Instead of reverting to slow start on a packet loss, just adjust cwnd to
current cwnd/2 and switch to AI
Note all this can only repair single packet loss; we still need timeouts for
bigger losses.
Used in TCP Reno.
Explicit Congestion Notification (ECN): router actively reports congestion
Not deployed yet
Router queues should normally be 0. When router senses the queue buildin
g up, meaning congestion is happening, it flips a bit in the IP header
The receiver sees the bit flipped and notify the sender
Advantages
clear signal
early detection
no loss
no extra packets to send
Disadvantages
need to upgrade router and hosts
Session layer
services for getting different resources for the same application (hence sam
e session)
Presentation layer
indicate type and encoding of content
type eg image/jpeg, text/html
encoding eg gzip
headers are small, so not packed to be easier to read and debug
Application Layer
doesn't always have a GUI, eg DNS
DNS
maps from names (high level resource identifiers) to IP addresses (low l
evel addresses)
before DNS, used HOSTS.TXT file regularly retrieved for all hosts from a
central machine at the Network Information Center (NIC)
not efficient, prone to naming conflicts, have to work with NIC to g
et your name in, unnecessary entries, etc
DNS is hierarchical, starting from "." (typically omitted)
generic
aero -> top level domain (TLD)
com
edu
washington
cs
robot
eng
gov
org
...
countries
au
edu
uwa
jp
uk
...
the namespace is divided into zones, eg "edu" or "washington + cs", that
are administered by the same entity

each zone has a nameserver that people should contact to access reso
urces in that zone
a zone is comprised of DNS resource records
type A: IPv4 address of a host
AAAA ("quad A"): IPv6 address
NS: nameserver of domain
CNAME: canonical name for an alias (eg galah.washington.edu may
be the same as cs.washington.edu)
SOA (Start of Authority): key info about the zone (nameserver, v
alidity period of entries, etc)
the upper level nameservers should have pointers to lower level ones
when client requests a domain name, client's nameserver first asks the root
servers, which tells it the lower level nameserver to ask next, until it gets to
a nameserver that has the requested domain in its zone
eg for filts.cs.vn.nl to access cs.washington.edu
filts.cs.vn.nl asks local (cs.vu.nl) nameserver
local nameserver asks root nameserver (a.root-servers.net)
root nameserver doesn't know but knows where to find .edu nameserver
local nameserver asks .edu nameserver
.edu nameserver doesn't know but knows where to find washington.edu
nameserver
local nameserver asks washington.edu nameserver
washington.edu nameserver doesn't know but knows where to find cs.wa
shington.edu nameserver
local nameserver asks cs.washington.edu nameserver
requested domain is within cs.washington.edu's zone, so it gives the
local nameserver its IP address
local nameserver sends IP address back to client
notice there were 2 kinds of queries
recursive: the kind done by client to local nameserver: nameserv
er does all the resolution work and return the complete answer to the client
lets server offload client burden
iterative: done by local nameserver to all other nameservers: na
meserver returns either the answer or a delegated subdomain and doesn't look fur
ther
good for servers with high loads
queries can be cached for instant access
there's a TTL field, usually between hours and days, during whic
h the cache is up to date
can cache partial addresses
eg if local nameserver doesn't know the IP of cs.washington.
edu but does know washington.edu, it can go directly to washington.edu
local nameserver address is part of info returned by DHCP
there are 13 root nameservers: a.root-servers.net through m.
all nameservers need to know the IPs of these 13
that's configured in a file named.ca (ca = cache) which helps th
e nameserver bootstrap
there are actually 250 instances of the root nameservers, but th
ey share the 13 IP addresses using IP anycast, which lets multiple locations hav
e the same IP (clients are routed to the geographically closest one)
DNS Protocol
built on UDP, port 53
use ARQ for reliability
query and response share a 16 bit message ID field
nameservers tend to be redundant which also helps with load balancing
DNS needs security so clients aren't directed to the wrong machine
DNSSEC (DNS with security extension) is being deployed
HTTP
designed to transmit various web resources in addition to HTML (JS, CSS
etc)

how it works:
start with the page URL
protocol (http://)
server (en.wikipedia.org)
page on server (/wiki/Vegemite)
resolve the server to IP address
set up TCP connection to server
send HTTP request for the page
get HTTP response
execute/fetch embedded resources/render (eg JS/images/CSS respective
ly)
clean up any idle TCP connections
methods: GET, POST et al
Response codes:
1xx: information
2xx: success
3xx: redirection
4xx: client error
5xx: server error
Headers
browser capabilities (client -> server)
User-Agent
Accept
Accept-Charset
Accept-Encoding
Accept-Language
caching related (both)
If-Modified-Since
Date
Last-Modified
Expires
browser context (client -> server)
Cookie
Referrer
Authorization
Host
content delivery
Content-Encoding
Content-Length
Content-Type
Set-Cookie
HTTP performance
measures
Page Load Time (PLT): key measure but hard to measure
early HTTP loads one resource after another, each in its own TCP con
nection -> inefficient (no parallelism, TCP overhead especially when only one wa
s needed)
to reduce PLT:
reduce content size, including taking client properties into acc
ount, eg big screen -> send big image; mobile phone -> send small image
compression
parallel connections: make x TCP connections at once
connections compete with each other
persistent connections: make 1 TCP connection and use it for mul
tiple requests
doesn't cause network bursts and loss like parallel
saves time because client would otherwise need to send a Con
nect request to the server with each TCP connection
also no slow-start
pipelining: like persistent but as soon as client finds out it n

eeds to make x more requests (eg after the page downloads, it finds there's x im
ages on the page), it makes those requests one after another instead of waiting
for response for each
caching: access local copy instead of requesting server again
to know if local copy is still valid:
Expires header if available
heuristics -> content available right away
is content cacheable (does it look static or dynamic
)
was it recently validated (recently fetched)
was it not modified recently
ask the server using a conditional GET (still saves work
because doesn't need to retransmit the whole content) -> content available afte
r 1 RTT
use Last-Modified header from server
use Etag (hash of content) header from server
web proxies: intermediary between pools of clients and e
xternal web servers
bigger shared cache -> one client can access the cac
he downloaded for another client earlier
security checking (content scanning)
firewalls (aka organizational access policies)
CDNs
local servers placed close to clients and serving the same c
opy to those clients instead of having them fetch individual copies over and ove
r from a remote server
popularity of content obeys Zipf's Law: the kth most popular
content is 1/k as popular as the most popular content, eg the 2nd most popular
content is half as popular as the most popular one.
-> graph of popularity from most to least looks like a rever
se log graph
-> caching the most popular contents take care of a lot of c
ontent, but there's still a long tail that needs taking care of
can just use browser and proxy caches but that only benefits
one client/one organization
want to place replicas all over the internet
-> use DNS servers
DNS servers are updated to return different server addre
sses depending on the client
the server looks at the client's IP, determines the loca
tion, and returns the closest server
SPDY ("speedy"): experimental protocol being deployed to speed u
p HTTP
parallel HTTP requests on a single TCP connection (vs serial
HTTP requests on a single TCP connection as in a persistent connection)
compressed headers
client priority
mod_pagespeed: speed things up on server side
minify JS (make variable name shorter)
make nested CSS inline
resize images for client
etc
P2P
disadvantages of client/server CDNs
expensive, managed infrastructure
no privacy
problems with P2P
peers have limit capabilities
solution: create minimum spanning tree with the root being the p
eer with the desired content

-> scales well


break up file into pieces that can be downloaded in parallel fro
m different peers
incentive to upload
only allow to download if also uploading
how to keep up with changing peers
peers need to learn how to find content
use distributed hashtables (DHT): indexed by content, values
are peers having that content
to create the entry for a file
hash the filename
determine what node is responsible for storing the e
ntry based on the hash
forward the entry from node to node until reaching t
he right node
to retrieve an entry
hash the filename
contact the node responsible for the hash to get the
entry
Quality of Service
the web runs on a best-efforts basis, but sometimes the amount of queuing an
d loss isn't acceptable
eg browser should have higher priority than torrent app which runs in the ba
ckground
a best effort basis is going to create a bottleneck and dropped packets
at the router
browser can have higher queue priority than torrent app
with eg Skype and BitTorrent, Skype needs low delay, but not high bandwidth,
so Skype can have a small amount of bandwidth earmarked as high priority, but B
itTorrent can have all the other bandwidth -> both are happy
in real-time communications, we need constant playout, so if a packet arrive
s too late, it might as well not arrive.
variations in delay is called jitter
need a buffer to help
Real-time Transport Protocol (RTP): used to carry media on top of best effor
t UDP
header:
payload type (media format eg MP4, G11)
timestamp: helpful to play things in real time
sequence numbers
Simple Initiation Protocol (SIP): open protocol for establishing calls over
IP
use simple method/response codes like HTTP
runs on TCP or UDP
process:
caller sends invite signal
receiver sends 180 'ringing' signal
when someone picks up, receiver sends 200 'OK' signal
caller sends ACK
RTP media
either party sends 'BYEU'
other party sends 200 'OK'
Video media
streaming: less demanding
single direction -> delay not too bad
if the network is fast enough, the media player's buffer will eventu
ally fill
then it'll stop requesting new data
media players have a high and a low watermark
if the buffer reaches the high watermark, stop requesting new da

ta
if the buffer reaches the low watermark, start requesting new da
ta
uses TCP instead of UDP because low delay isn't essential, and loss
recovery simplifies presentation (eg fill in missing material)
how it works:
client browser sends media request (HTTP) to server
server sends HTTP response back with a metafile
client hands off metafile to media player
media player sends media request (Real-time Streaming Protocol RTSP) to server
server sends media response (TCP or UDP)
interactive, eg videoconferencing
Weighted fair queueing: alternative to FIFO queueing
when multiple apps share a router's FIFO queue, each with its own incomi
ng flow
shorter RTT flows are favored in the long term because its packets a
re emptied sooner from the queue
if one flow gets a lot of traffic, the other ones will have to wait
a long time for their turn
possible solution: round robin queueing
takes 1 packet from 1 flow at a time -> no starving
order of arrival isn't preserved
bandwidth isn't evenly shared because packets could be different siz
es in different flows
better solution: fair queueing
RR but approximate bit-level fairness
compute virtual finish time of each flow
virtual clock ticks every time all flows have sent 1 bit
send packets in order of their virtual finish times
doesn't aim to be perfect - don't interrupt transmitting a big p
acket to transmit a small one because the small one has an earlier finish time
notation:
Arrive(j)F: time packet j arrives in flow F
Length(J)F: length of packet j in flow F
Finish(j)F = max(Arrive(j)F, Finish(J - 1)F) + Length(j)F ->
Finish means the time when the router sends off the packet to the app and is do
ne with it. The formula means a packet's finish time is how long it takes to tra
nsmit it (length) plus a baseline of whatever is the time the router is ready to
work on it, which is the later of when it arrives and when the router is done w
ith the previous packet.
Example: 3 flows, F1-F3. Flows 1 and 3 have 1000B packet size, F
2 has 300B packets.
-> Finish(0)F = 0 for all F
assuming all queues are backlogged (aka Arrive(j)F < Finish(j 1)F always, meaning there's always more packets for all flows waiting in the que
ue)
then Finish(1)F1 = 1000, Finish(2)F1 = 2000, Finish(3)F1 = 3000,
etc
Finish(1)F2 = 300, Finish(2)F2 = 600, Finish(3)F2 = 900, 1200, 1
500, ...
etc
then we'll send in order of finish time
1(F2) -> 2(F2) -> 3(F2) -> 1(F1) -> 1(F3) -> 4(F2), etc
because their finish times are, respectively:
300, 600, 900, 1000, 1000, 1200, etc
Weighted Fair Queueing (WFQ): each flow has a weight, to be used to
compute Finish:
Finish(j)F = max(Arrive(j)F, Finish(j-1)F) + Length(j)F/Weight(F
)

This means with a weight of 2, a flow gets to send 2x more bits


during the same time
Traffic shaping
helps limit total traffic so we can make good on QoS promises
average traffic flow isn't a good indication in the case of bursty flows
average flow is still useful, but we need other measures
token buckets
has 2 params, R (average rate) and B (max burst rate)
the bucket contains tokens exchangeable for bits transmitted
new tokens come in at the same rate as the network's average rat
e
the depth of the bucket represents the max allowed burstiness
when new packets come in, the bucket is checked
if there are enough tokens for the amount of data coming in,
those tokens are removed and the data are allowed to go on
if not
to shape traffic, wait until enough tokens accumulate
to police traffic, demote or drop packets until enough t
okens accumulate
token buckets are run at the edge of the network eg on routers.
Only packets with enough tokens are allowed into the network
QoS architecture
packets are tagged with priority by user
To mark priority on a packet, use a 6-bit IP header called DSCP (Dif
ferentiated Services Code Point)
0: default forwarding/best effort -> elastic, eg BitTorrent
10-38: assured forwarding/enhanced effort -> average rate (strea
ming video)
46: expedited forwarding (real time) -> low loss/delay (VoIP, ga
ming)
48: precedence, eg network control -> high priority (routing pro
tocol)
Marking can be done according to local policy (eg gaming is high at
home but low at work) or by the OS
token buckets are enforced at edge of network
ISP may not allow user traffic to be marked as precedence
out of profile traffic (eg a regular packet being marked as preceden
ce) can be remarked to 'best effort'
service tier is checked and enforced by ISP
service tiers are per hop, not end-to-end
have to keep amount of high priority traffic low
Difficulties
have to be deployed at all ISPs to see the value
is tied to pricing, otherwise everyone will mark all packets as high pri
ority
slow/difficult deployment
Guaranteed service
need admission control to keep traffic under control and reject flows th
at can't be satisfied
rejecting should be infrequent
can guarantee a rate at the router:
if all flows' weights sum to 100, then a flow with weight 10 is guar
anteed to get at least 1/10 the bandwidth, eg if the bandwidth is 100 Mbps, the
flow gets >= 10 Mbps
to guarantee a rate across the network, need to make sure that rate is g
uaranteed at every router in the network, so have to give the flow the right wei
ght
for all routers i
W(F)i / W(i) * L(i) >= R Mbps
(fraction of a flow's weight over the total weight should be suc

h that after taking the bandwidth into account, you get the desired guarantee)
to guarantee a delay (>= latency), need to shape traffic -> use token bu
ckets
worst case delay is if the max burst of B bits arrive all at once ->
need to put all in the queue
then the last packet to join the queue will need B/R seconds to leav
e the queue, where R is the drain rate
to guarantee a delay across the network, observe that if traffic wer
e perfectly smooth, the delay would be the latency. If there's one burst, it'll
queue up in one router and be smoothed out as it empties into the next router an
d all subsequent routers -> a burst is only penalized once -> the max network de
lay is the same as the max router delay: latency + B/R
Security
security by obscurity: try to keep the encoding scheme secret
kinds of security threats:
eavesdropper: reads messages that aren't for them
in the Goal and Threat model, the goal is confidentiality: only reci
pient of a message can read it
if Alice sends a private message to Bob, and Eve can observe the mes
sage going back and forth, then Eve is a passive adversary
encryption/decryption scheme: Alice's plaintext message is encrypted
into cyphertext. The encryption scheme is parameterized by a secret key that Al
ice and Bob have, but the algorithm itself is public.
This way if the key is compromised, just use a different key.
weakness: key must be distributed
this is a symmetric key encryption scheme, eg Advanced Encryptio
n Standard (AES)
efficient, can send at high data rate
public key cryptography: Bob has a public key. Alice uses it to encr
ypt her message. Bob uses his private key to decrypt.
good for establishing a connection between people who don't know
each other well enough to exchange a key private to their relationship
slow
winning combination: use public key to send private key, which is a
small message, then use private key to send large messages using symmetric key
the key is called a session key
tamperer: use other people's machines to send messages
need to make sure the message came from the right person (integrity)
and is unchanged (authenticity)
person in the middle (Trudy - intruder) is an active adversary
public key schemes aren't enough if she can do things with messa
ges (even if she can't read them) eg flip some bits
she can rearrange blocks to make a different valid message, eg "
stop do not buy now" becomes "buy now do not stop"
solution: include a Message Authentication Code (MAC) to validate th
e integrity/authenticity of the message, eg a hash MAC
MACs are generated using a symmetric key scheme, so Bob can vali
date that it was Alice who sent the message, because only Alice has the key
but Bob can't convince someone else that the message came from A
lice, because he could have generated the message himself
solution 2: digital signatures
Alice has a private key. She uses it to generate a short signatu
re. Everyone else can use Alice's public key to verify that the signature is her
s
since signatures use public keys, they're slow
to speed things up, use message digests
message digests are a fixed-length hash of an arbitrarylength message, such that it's computationally infeasible to find 2 messages wit
h the same digest

digests can be used to represent messages instead of sig


natures
note the hash function is deterministic, not parameteriz
ed by a key
now we can sign the hash of the message instead of the m
essage itself
-> faster because the hash is capped at eg 160 bits
also need to prevent replays, eg take an old message and resend as a
new one.
the attacker can save all messages and resend them 1 year later,
unread and untampered with, and still cause damage because the contents are out
dated
prevent by including proof of freshness such as timestamps or a
nonce (number once), which is like a sequence number - goes up with each new mes
sage
impersonator: pretend to be someone else to get info
disrupter: disrupt network services
Wireless security
different than wired because anyone can pick up and send messages
uses passwords
Wired Equivalent Privacy (WEP) was used in 1999
switched to Wifi Protected Access (WPA)
only encrypts the TCP packet, not the wireless link layer headers
each client connected to an access point (AP) has a session key to p
rove it has the password
the AP also gives out a group key so it can broadcast to all clients
to join the network
first the client enters the password
the AP sends back a nonce
the client computes the session key from the AP's nonce, MAC add
resses and master key
the AP does the same -> knows the client is legit because it has
the session key
the AP sends the client the group key, encrypted using the sessi
on key
the client gets the group key -> knows the AP is legit because i
t has the session key
the client ACKs the group key
in enterprise networks, there's no shared password; each client has
credentials than they exchange with the authentication server
HTTPS
add-on to HTTP
means HTTP over Secure Sockets Layer (SSL)/Transport Layer Security (TLS
)
TLS is the later version of SSL
TLS sits between TCP and HTTP; both run unaltered
helps clients make sure server is who it says it is (eg bank)
client verifies a public key really belongs to an entity by using a
certificate which binds an entity to a public key, and is signed by a trusted pa
rty, also using public keys
the certificates are commonly in X.509 format
instead of having a single server giving out certificates, we ha
ve a Public Key Infrastructure (PKI) forming a hierarchy so multiple servers cal
led Certificate Authorities (CAs) can issue certificates -> the root server does
n't have to handle all certificates
to verify website ABC
browser has root's public key (aka root certificates = ~
100)
can use root to verify RA1's key to be X
can use RA1 to verify CA1's key to be Y

can use CA1 to verify ABC's key to be Z


sometimes a public key is compromised and the certificate ha
s to be revoked -> the key is added to the PKI's Certificate Revocation List (CR
L)
DNS security
doesn't need encryption because anyone can ask and get an answer
DNS spoofing: intruder sends a fake DNS reply, causing the wrong domainIP binding to be cached
how does Trudy know when the DNS request is sent, and to whom?
by impersonating a local client and requesting the local nameser
ver
how does she create a fake reply that looks real?
only replies from authoritative IPs are accepted
put IP of authoritative nameserver in source field
the IDs of the request and reply must match
guess a 16-bit ID
send a lot and hope one matches
observe if there's a pattern in IDs
there must be an outstanding request
send reply right after query
then the real reply will be ignored because there's no m
ore outstanding query
DNSSec
being deployed
adds new types of records
RRSIG: digital signatures of records
DNSKEY: public keys of a nameserver's own replies
DS: public keys of nameservers that a nameserver delegates to (e
g the .edu nameserver can delegate to uw.edu)
when a client requests www.uw.edu
local nameserver asks edu nameserver
nameserver authenticates with edu PK and delegates to uw.edu
nameserver
uw.edu authenticates with uw.edu PK and delegates to www.uw.
edu nameserver
Firewalls
inspects every TCP packet between client and internet
problem is to translate high level policies to low level filtering rules
limited viewpoint in the network: can't read messages because broken
across packets, encryption
types
stateless firewall
most simple
static rules, eg block incoming port 22 (telnet)
stateful firewall
tracks packet exchanges
eg NAT boxes: allow incoming packets if there's an outgoing conn
ection from inside
application layer firewall
rules are based on app usage and content
can emulate higher layers, eg piece together packets to check fo
r viruses, etc
deployment
can separate the network into internal hosts and external servers th
at connect to the internet a lot, eg email servers, that are outside the firewal
l but still in the network.
this way if the external servers fail, it's easy to take them off th
e network
can be a specialized device, part of an AP, or part of the OS
centralized firewall makes managing easier

distributed firewall (on each host) is more secure


if one host fails, the other hosts are still protected, vs a dan
gerous packet having access to all hosts
better visibility of apps
VPNs
private networks across geological locations, need to keep intruders out
can lease a separate line -> expensive
just use the internet -> cheaper
goal: keep a logical network separate from the internet while still
using the internet for connectivity
threat: Trudy may access VPN and intercept or tamper with messages
use tunneling:
packets from one remote host in the VPN to the other have their
source and destination altered to go to the right router
IP packets are encrypted and wrapped in a second IP layer wh
ose source and destination are tunnel endpoints
to maintain confidentiality, integrity and authenticity, use
IPSEC (IP Security)
set up a symmetric key for host pairs
communication becomes more connection-oriented because t
here are session keys to set up
adds Encapsulated Security Protocol (ESP) header and aut
hentication HMAC trailer
other traffic is unchanged
DDoS
encryption does no good
no good solution
DoS: can be triggered by a small number of strange packets eg
Ping of Death (a malformed packet)
SYN flood: attacker sends may SYN packets (requesting TCP connection
s) to a host
SYN cookies help
DDoS attackers use an army of botnets - compromised hosts
in a DDoS, attackers can falsify their IP addresses because the network
doesn't check
that way they can hide their location
called IP spoofing
this way the victim thinks the wrong person is attacking them
the attacker can actually make someone else attack the victim for th
em by first pretending to be the victim and sending packets to the scapegoat, ma
king the scapegoat send packets back to the source IP, which is the victim
solution: ingression filtering: have ISP verify IP source address
slow to deploy because doesn't directly benefit ISP
have more network capacity
use CDNs
have more peak capacity
have earlier filtering

You might also like