Professional Documents
Culture Documents
1125 Dibbiny Tragler Iskra Shern Efraim
1125 Dibbiny Tragler Iskra Shern Efraim
2
Telco Cloud: Common Network Requirements
● Virtual networking -- VxLAN
● Resilience through redundancy
● Bonded Uplinks with LACP
● VLAN-Trunking to hosted VNFs
● BGP for route discovery between leaves
● Mirrored traffic flows
● Performance:
○ Small packets at high rates and throughput with rate limiting
○ Rapid flow creation and eviction: churn
● Scale: especially for flows (>10K per sec)
● Cost and power efficiency
● OpenStack and K8s integration 3
SD-WAN BGP L3VPN with MPLS
Telco SDN Multi-Cluster Deployment
with VXLAN Overlay DataCenter
Gateways
Cluster2
Cluster1
Border Border
VXLAN
VNI 1000 ECMP
L3
Fabric
Leaf Leaf
VXLAN
Host1 MAC VNI 2000
Host100
Underlay MAC VLAN 20
VLAN 10 VLAN 10
VLAN 10 Underlay
VTEP IP 10.0.0.1 10.0.0.2 VLAN 20
10.0.0.1 VTEP IP Hostn
Host1 Host2 Host100
VXLAN 20.0.0.1
VNI 1000 OVS VXLAN OVS OVS
OVS VTEPn
VNI 2000 VTEP 2 VNI 2000 VTEP 100
VTEP 1 VLAN n
VLAN VLAN VLAN
VM1 MAC 100 100 VM1 MAC VM100 200 VMn
VM1 VM2
Overlay Overlay 192.168.100.1
VLAN 100 192.168.1.1/28 192.168.1.2/28 VLAN 200
VM1 IP VM100 IP 4
192.1681.1 192.168.100.1
North South and East West traffic with VXLAN overlay and VLAN tagging
OpenStack
Server-1 Server-2
SDN Plugin
East-West
…… traffic VM5 VM6
……
VM7 VM8
VM1 VM2 VM3 VM4
SDN
T U T U Control T U T U
vlan vlan vlan vlan
OVS 300 T-Tagged
100 200 OVS 100
User U-Untagged
Slow-path Slow-path
VF1 VF2 VF3 VF4 Flow Flow VF1 VF2 VF3 VF4
programming programming
NIC OVS eswitch P P OVS eswitch NIC
F F
P1 VTEP P2 P1 VTEP P2
North-South
Leaf traffic Leaf
LAG LAG
L2VNI 2000
L3VNI 60
L3VNI 50
DC-GW
6
VF LAG Overview
● Requirements: HW offloads uses SR-IOV, therefore there’s a need to offload
the Linux Bond functionality:
○ Link aggregation: Combining number of physical ports (IEEE 802.11ad)
○ High availability: Supporting Link failover
● Expectations:
○ Single VF aggregating both physical ports of the same NIC for RX and TX
○ VF associated with either PF sends data over either physical port
○ Transparent to the VM
○ Initialized by creating a bond device that enslaves both ports of the NIC
7
VF LAG HW Offload
● In LAG mode flow rules are offloaded to the FDB of both e-switches for high
availability
○ For uplink representors, Shared TC Block is used to replicate the rules on both e-switches.
○ For representors, the driver is responsible to replicate the rules to both e-switches.
● The driver registers to netdev events to set the affinity of live ports
● Send queues’ affinity is set in a round-robin fashion to both physical ports
8
VLAN Tagging and Header Rewrite
9
VLAN Tagging and IP Header rewrite
● Overlay tagging:
○ Pushing VLAN tag for VM traffic
○ Popping VLAN tag for VM traffic
● Underlay tagging:
○ Mainly used to ensure segregation of different type of traffic (Control plane, User plane, Monitoring.. )
● Header rewrites:
○ TTL decrement for routing across subnets
○ Copy TOS from Inner
○ UDP source port: Hash of inner headers to provide entropy for ECMP
Local Host Remote Host or Underlay VM Dst Overlay or
Ethertype VM Src MAC Ethertype
Src MAC gateway Dst MAC VLAN MAC QinQ VLAN
Outer Inner
Push or pop overlay VLAN
Outer
Outer IP VXLAN Inner IP Payload
Ethernet UDP Ethernet
Set or copy IP ToS/DSCP Set UDP src port with hash on inner L2/L3/L4 headers Decrement TTL when routing across segments
10
Remote Mirroring Overview
● Duplicating the traffic of one VF to a remote VF/Host
● Use cases:
○ Debuggability
○ Lawful intercepts
○ Monitoring
○ Other..
11
Remote Mirroring Configurations
Sample configurations for local and remote traffic
Local Remote
Traffic Traffic
VRS VRS
VF1 VF2 -TC -TC VF1 VF2
VxLAN Tunnels
Mirroring
Destination
V1 (RHEL 7.5) V2 (RHEL 7.5)
VM81 VM22
VM11 alubr0 alubr0
KVM KVM
VRS VRS
VF1 VF2 -TC -TC VF1 VF2
NIC NIC
.51 .143.153 .143.154 .52
VXLAN
VxLAN
GRE
Tunnel
Tunnel
V1 (RHEL 7.5) V2 (RHEL 7.5)
Mirroring VM22
Destination VM11 alubr0 alubr0
KVM KVM
VRS VRS
VF1 -TC -TC VF1
15
Looking ahead for
OVS TC/flower HW offload
16
OVS TC/flower Flow Offload Openstack
OVS 2.11 with RHEL 7.7, Mellanox ConnectX-5 (mlx5) Neutron
Current Features VM VM
● Line rate throughput 25Gbps, 50Gbps VF NIC VF NIC
● Flow Match - 5 Tuple, IPv4, IPv6, MAC, ovsdb-server
TCP/UDP port, protocol, VLAN, VXLAN
Hypervisor ovs-vswitchd
● Action - set QoS ToS/DSCP, set TTL
● Overlay encap - VLAN, QinQ, GRE, VXLAN
● VLAN aware VMs - VXLAN over VLAN SR-IOV OVS Kernel
● Scale to thousands of flows with MAC learning datapath
● ethtool SR-IOV per VF Rate limiting Linux bond
● Linux bonding offload - VF-LAG
● Remote port mirroring with GRE encap VF1 VF2 PF TC/flower
offload
Work in progress (community collaboration) NIC OVS eswitch
● Flow insertion rate - TC parallelization, NIC Bond offload
software steering
● Traffic engineering - MPLS Segment Routing Port1 Port2
● Load balancer & Firewall - Connection tracking
● vDPA - replace SR-IOV, live migration
17
Flow aggregation and insertion rate
18
Flow aggregation and insertion rate
19
User flow aggregation: Routed flows with TTL decrement
Spine Routed Flow VXLAN Encap
skb_priority(0/0),skb_mark(0/0),in_port(ens15f0_2),eth(src=c6
:17:f1:df:4e:d3,dst=68:54:ed:00:8e:10),eth_type(0x0800),ipv4(
VXLAN ECMP
src=192.168.0.0/255.255.0.0,dst=192.168.2.1,proto=6,tos=0/0x3
L3VNI 8300
,ttl=64),tcp(src=32768/0x8000,dst=0/0), packets:3381147,
bytes:5288111012, used:0.700s, offloaded:yes, dp:tc,
Leaf
actions:set(tunnel(tun_id=0x86d288,dst=10.10.10.2,tp_dst=4789
,flags(key))),set(eth(src=68:54:ed:00:13:5c,dst=92:eb:fc:be:f
VLAN 10 VLAN 10
1:c8)),set(ipv4(ttl=63)),vxlan_sys_4789
10.10.10.1 10.10.10.2 ● Typically all North-South flows are routed, East-West flows
Host1 Host2
between VNFs are routed, Remote mirroring traffic is routed
OVS OVS
VTEP 1 VTEP 2 ● Routed flows need TTL decrement, which is done as kernel
VLAN VLAN
200
match and set
VM1 100 VM2
● This can result in explosion of user flows
192.168.1.1/28 192.168.2.1/28
● Kernel patch: [PATCH net-next] openvswitch: add TTL
decrement action
20
User flow aggregation - Hash for ECMP/Entropy
Spine
● Load balance traffic across 4 or more ECMP paths
VXLAN ECMP ● Hash (dp_hash) on Inner L2/L3/L4 headers; 5-tuple
VNI 1000
hash IP src/dst, TCP/UDP port, protocol
Leaf ● Set the outer VXLAN UDP source port with hash
results
VLAN 10 VLAN 10
● This is dp_hash bug, first packet (ovs-vswitchd) and
10.0.0.1 10.0.0.2
Host1 Host2 consequent packets are hashed differently in
OVS OVS kernel/NIC
VTEP 1 VTEP 2
VLAN VLAN ● Today hashing is being done in userspace, hence
100 VM2 100
VM1
explosion of flows
192.168.1.1/28 192.168.1.2/28
● Move ECMP/entropy hash to kernel and offload to NIC
21
Future - Load Balancing for East-West Services
Spine Future use cases with potential for flow
explosion and flow churn
VXLAN ECMP
VNI 1000 Inner IP/TCP/UDP headers need to be
processed on the fly
● OVS for load balancers to distribute East-
Leaf West traffic across 3-8 remote VNFs
● Resolve VIP and alternate destination
10.0.0.1 10.0.0.2 10.0.0.3
VNF MAC/IP for each flow
Host1 Host2 Host3 ● User Flows from VM1 need to be
OVS
distributed across 3 VNF Set (VM2, VM3,
OVS OVS
VTEP 1 VTEP 2 VTEP 3 VM4) on Host 2 and Host3
VM1 VM2 VM4 ● Flows will be distributed across 4 ECMP
VM3
20.1.1.1/28 20.1.1.2/28 20.1.1.3/28 paths
192.168.1.2/28
VNF Redundancy VIP 1.1.1.1
22
Future - Network Policy, ACLs, Connection tracking
Future use cases with potential for flow explosion
Spine
and flow churn
ECMP ● Firewall and network policy (Security Groups)
between services
● Inner IP/TCP/UDP headers need to be
examined on the fly for firewall rule
Leaf
● Allow traffic from VM1 to VM3 and VM2 to
VM4 but deny all other traffic
10.0.0.1 10.0.0.2 10.0.0.3 ○ Offload Stateless ACLs
Host1 Host2 Host3
○ Offload Stateful security groups with
OVS OVS OVS OVS Connection Tracking
VTEP 1 VTEP 2 VTEP 3
○ Conntrack Offload - Flow setup in slow
VM1 VM2 VM3 VM4 path and then offload flow to NIC
20.1.1.1/24 20.1.3.1/24 20.1.4.3/24 ● Connection tracking offload patches
20.1.21.1/24
https://patchwork.ozlabs.org/cover/1203705/
23
Improving flow injection rate to NIC
Increase flow insertion rate to > 10K flows/s
● In kernel 5.4 - TC parallelization for flow injection and bypass the RTNL-lock
e474619a2498 ("net: sched: flower: don't check for rtnl on head dereference")
195c234d15c9 ("net: sched: flower: handle concurrent mask insertion")
259e60f96785 ("net: sched: flower: protect masks list with spinlock")
9a2d93899897 ("net: sched: flower: handle concurrent filter insertion in
fl_change")
272ffaadeb3e ("net: sched: flower: handle concurrent tcf proto deletion")
3d81e7118d57 ("net: sched: flower: protect flower classifier state with
spinlock")
c24e43d83b7a ("net: sched: flower: track rtnl lock state")
…
24
MPLS over UDP and Segment Routing for the Edge
25
MPLS to the Edge - Source based traffic engineering
DataCenter1 Path Computation
DataCenter2
Global Controller
SDN SDN
Control Control
SD-WAN BGP L3 VPN/VPLS
Segment Routing MPLS
Spine
Spine
DC-GW DC-GW
SR
SR MPLS
MPLS (OSPF
(OSPF SD-WAN or ISIS)
or ISIS) Leaf
Leaf
28
Thank-you
29
PCAP@OVS egress
VNF to PNF example
RFC7510: MPLSoUDP
MPLS labels
BSID: 50100
HV SID: 524283
SL: 524286
30
OVS tables for SR over UDP
VNF to WAN endpoint example
Table 13 = Routing
Table 62 = BSID
31