Download as pdf or txt
Download as pdf or txt
You are on page 1of 759

Copyright

Segment Routing Part II – Traffic Engineering


by Clarence Filsfils, Kris Michielsen, François Clad, Daniel Voyer

Copyright © 2019 Cisco Systems, Inc. and/or its affiliates. All rights reserved.

All rights reserved. No part of this book may be reproduced, distributed, or transmitted in any form or
by any means, electronic or mechanical, including photocopying, recording, or by any information
storage and retrieval system, without written permission from the publisher or author, except for the
inclusion of brief quotations in a review.

Kindle Edition v1.1, May 2019


Warning and Disclaimer
Every effort has been made to make this book as complete and as accurate as possible, but no
warranty or fitness is implied.

THE INFORMATION HEREIN IS PROVIDED ON AN “AS IS” BASIS, WITHOUT ANY


WARRANTIES OR REPRESENTATIONS, EXPRESS, IMPLIED OR STATUTORY, INCLUDING
WITHOUT LIMITATION, WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
The publisher and authors shall have neither liability nor responsibility to any person or entity with
respect to any loss or damages arising from the information contained in this book.

The views and opinions expressed in this book belong to the authors or to the person who is quoted.
Trademark Acknowledgments
Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the
U.S. and other countries. To view a list of Cisco trademarks, go to this URL:
www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their
respective owners. The use of the word partner does not imply a partnership relationship between
Cisco and any other company. (1110R)

All terms mentioned in this book that are known to be trademarks or service marks have been
appropriately capitalized. The authors cannot attest to the accuracy of this information. Use of a term
in this book should not be regarded as affecting the validity of any trademark or service mark.
Preface
This book is the second part of the series about Segment Routing (SR).

As more and more Service Providers and Enterprises operate a single network infrastructure to
support an ever-increasing number of services, the ability to custom fit transport to application needs
is of paramount importance.

In that respect, network operators have been exploring Traffic Engineering techniques for some years
now but have obviously run into many scaling issues preventing them from having an end-to-end, fine-
grained control over the myriad services they offer.

Segment Routing Traffic Engineering (SR-TE) has changed that game and has become the undisputed
solution to deliver Traffic Engineering capabilities at scale.
Audience
We have tried to make this book accessible to a wide audience and to address beginner, intermediate
and advanced topics. We hope it will be of value to anyone trying to design, support or just
understand SR-TE from a practical perspective. This includes network designers, engineers,
administrators, and operators, both in service provider and enterprise environments, as well as other
professionals or students looking to gain an understanding of SR-TE.

We have assumed that readers are familiar with data networking in general, and with the concepts of
IP, IP routing, and MPLS, in particular. There are many good books available on these subjects.

We have also assumed that readers know the basics of Segment Routing. Part I of this SR book series
(available on amazon.com) is a great resource to learn the SR fundamentals.
Disclaimer
This book only reflects the opinion of the authors and not the company they work for. Every statement
made in this book is a conclusion drawn from personal research by the authors and lab tests.

Also consider the following:

Some examples have been built with prototype images. It is possible that at the time of publication,
some functions and commands are not yet generally available. Cisco Systems does not commit to
release any of the features described in this book. For some functionalities this is indicated in the
text, but not for all. On the bright side: this book provides a sneak preview to the state-of-the-art
and you have the opportunity to get a taste of the things that may be coming.

It is possible that some of the commands used in this book will be changed or become obsolete in
the future. Syntax accuracy is not guaranteed.

For illustrative purposes, the authors took the liberty to edit some of the command output examples,
e.g., by removing parts of the text. For this reason, the examples in this book have no guaranteed
accuracy.
Reviewers
Many people have contributed to this book and they are acknowledged in the Introduction chapter. To
deliver a book that is accurate, clear, and enjoyable to read, it needs many eyes, other that the eyes of
the authors.

Here we would like to specifically thank the people that have reviewed this book in its various stages
towards completion. Without them this book would not have been at the level it is now. A sincere
“Thank you!” to all.

In alphabetical order:

Mike DiVincenzo, Alberto Donzelli, Darren Dukes, Muhammad Durrani, Rakesh Gandhi, Arkadiy
Gulko, Al Kiramoto, Przemyslaw Krol, Sakthi Malli, Paul Mattes, Hardik Patel, Rob Piasecki,
Carlos Pignataro, Robert Raszuk, Joel E. Roberts, JC Rode, Aisha Sanes, Grant Socal, Simon
Spraggs, YuanChao Su, Ketan Talaulikar, Mike Valentine, Bhupendra Yadav, Frank Zhao, and Qing
Zhong.
Textual Conventions
This book has multiple flows:

General flow: This is the regular flow of content that a reader wanting to learn SR would follow. It
contains facts, no opinions and is objective in nature.

Highlights: Highlight boxes emphasize important elements and topics for the reader. These are
presented in a “highlight” box.

HIGHLIGHT
This is an example of a highlight.

Opinions: This content expresses opinions, choices and tradeoffs. This content is not necessary to
understand SR, but gives some more background to the interested reader and is presented as
quotes. We have also invited colleagues in the industry who have been very active on the SR
project to share their opinions on SR in general or some specific aspects. The name of the person
providing that opinion is indicated in each quote box.

“This is an example opinion. ”

— John Doe

Reminders: The reminders briefly explain technological aspects (mostly outside of SR) that may
help understanding the general flow. They are presented in a “reminder” box.

REMINDER
This is an example reminder
Illustrations and Examples Conventions
The illustrations and examples in this book follow the following conventions:

Router-id of NodeX is 1.1.1.X. Other loopbacks have address 1.1.n.X, with n an index.

Interface IPv4 address of an interface on NodeX connected to NodeY is 99.X.Y.X/24, with X<Y.
E.g., a link connecting Node2 to Node3 has a network address 99.2.3.0/24; the interface address on
Node2 is 99.2.3.2 and on Node3 it is 99.2.3.3.

Prefix-SIDs are labels in the range 16000 to 23999. This is the default Segment Routing Global
Block (SRGB) in Cisco devices.

Explicit local SIDs are labels in the range [15000-15999]. This is the default Segment Routing
Local Block (SRLB) in Cisco devices.

Dynamic Adjacency-SIDs are labels in the [24000-24999] range and have the format 240XY for an
adjacency on X going to Y.

Dynamic Binding-SIDs are labels in the range [40000-40999].

Dynamic BGP EPE Peering-SIDs are labels in the range [50000-50999].

Dynamic labels allocated by other (non-SR) MPLS applications such as LDP, RSVP-TE, BGP-LU,
etc., are in the range [90000-99999].

SID lists are written as <S1, S2, S3>, ordered first to last, i.e., top to bottom for the SR MPLS label
stack.
Segment Routing, Part II
Table of Contents
Copyright
Preface

1 Introduction
1.1 Subjective Introduction of Part I
1.2 A Few Terms
1.3 Design Objectives

1.3.1 An IP-Optimized Solution


1.3.2 A Simple Solution
1.3.3 A Scalable Solution
1.3.4 A Modular Solution
1.3.5 An Innovative Solution
1.4 Service-Level Assurance (SLA)
1.5 Traffic Matrix
1.6 Capacity Planning
1.7 Centralized Dependency

1.7.1 Disjoint Paths


1.7.2 Inter-Domain
1.7.3 Bandwidth Broker
1.7.4 Multi-Layer Optimization
1.8 A TE Intent as a SID or a SID list
1.9 SR Policy
1.10 Binding SID
1.11 How Many SIDs and Will It Work?
1.12 Automation Based on Colored Service Routes
1.13 The SR-TE Process

1.13.1 One Process, Multiple Roles

1.13.2 Components
1.13.3 SR-TE Database

1.13.4 SR Native Algorithms


1.13.5 Interaction With Other Processes and External APIs
1.13.6 New Command Line Interface (CLI)
1.14 Service Programming

1.15 Lead Operator Team


1.16 SR-TE Cisco Team
1.17 Standardization
1.18 Flow of This Book
1.19 References
Section I – Foundation
2 SR Policy
2.1 Introduction
2.1.1 An Explicit Candidate Path of an SR Policy

2.1.2 Path Validation and Selection


2.1.3 A Low-Delay Dynamic Candidate Path
2.1.4 A Dynamic Candidate Path Avoiding Specific Links
2.1.5 Encoding a Path in a Segment List
2.2 SR Policy Model
2.2.1 Segment List
2.2.2 Candidate Paths
2.3 Binding Segment
2.4 SR Policy Configuration
2.5 Summary

2.6 References

3 Explicit Candidate Path


3.1 Introduction

3.2 SR-MPLS Labels


3.3 Segment Descriptors
3.4 Path Validation
3.5 Practical Considerations

3.6 Controller-Initiated Candidate Path


3.7 TDM Migration
3.8 Dual-Plane Disjoint Paths Using Anycast-SID
3.9 Summary
3.10 References
4 Dynamic Candidate Path
4.1 Introduction
4.1.1 Expressing Dynamic Path Objective and Constraints
4.1.2 Compute Path = Solve Optimization Problem

4.1.3 SR-Native Versus Circuit-Based Algorithms


4.2 Distributed Computation
4.2.1 Headend Computes Low-Delay Path
4.2.2 Headend Computes Constrained Paths
4.2.2.1 Affinity Link Colors
4.2.2.2 Affinity Constraint
4.2.3 Other Use-Cases and Limitations
4.2.3.1 Disjoint Paths Limited to Single Head-End
4.2.3.2 Inter-Domain Path Requires Multi-Domain Information
4.3 Centralized Computation

4.3.1 SR PCE

4.3.1.1 SR PCE Redundancy


4.3.2 SR PCE Computes Disjoint Paths

4.3.2.1 Disjoint Group


4.3.2.2 Path Request, Reply, and Report
4.3.2.3 Path Delegation
4.3.3 SR PCE Computes End-To-End Inter-Domain Paths

4.3.3.1 SR PCE’s Multi-Domain Capability


4.3.3.2 SR PCE Computes Inter-Domain Path
4.3.3.3 SR PCE Updates Inter-Domain Path
4.4 Summary
4.5 References
5 Automated Steering
5.1 Introduction
5.2 Coloring a BGP Route
5.2.1 BGP Color Extended Community

5.2.2 Coloring BGP Routes at the Egress PE


5.2.3 Conflict With Other Color Usage
5.3 Automated Steering of a VPN Prefix
5.4 Steering Multiple Prefixes With Different SLAs
5.5 Automated Steering for EVPN
5.6 Other Service Routes
5.7 Disabling AS
5.8 Applicability
5.9 Summary
5.10 References

6 On-demand Next-hop

6.1 Coloring
6.2 On-Demand Candidate Path Instantiation

6.3 Seamless Integration in SR-TE Solution


6.4 Tearing Down an ODN Candidate Path
6.5 Illustration: Intra-Area ODN
6.6 Illustration: Inter-domain ODN

6.7 ODN Only for Authorized Colors


6.8 Summary
6.9 References
7 Flexible Algorithm
7.1 Prefix-SID Algorithms
7.2 Algorithm Definition
7.2.1 Consistency
7.2.2 Definition Advertisement
7.3 Path Computation

7.4 TI-LFA Backup Path


7.5 Integration With SR-TE
7.5.1 ODN/AS
7.5.2 Inter-Domain Paths
7.6 Dual-Plane Disjoint Paths Use-Case
7.7 Flex-Algo Anycast-SID Use-Case
7.8 Summary
7.9 References
8 Network Resiliency
8.1 Local Failure Detection

8.2 Intra-Domain IGP Flooding

8.3 Inter-Domain BGP-LS Flooding


8.4 Validation of an Explicit Path

8.4.1 Segments Expressed as Segment Descriptors


8.4.2 Segments Expressed as SID Values
8.5 Recomputation of a Dynamic Path by a Headend
8.6 Recomputation of a Dynamic Path by an SR PCE

8.7 IGP Convergence Along a Constituent Prefix-SID


8.7.1 IGP Reminder
8.7.2 Explicit Candidate Path
8.7.3 Dynamic Candidate Paths
8.8 Anycast-SIDs
8.9 TI-LFA protection
8.9.1 Constituent Prefix-SID
8.9.2 Constituent Adj-SID
8.9.3 TI-LFA Applied to Flex-Algo SID

8.10 Unprotected SR Policy


8.11 Other Mechanisms
8.11.1 SR IGP Microloop Avoidance
8.11.2 SR Policy Liveness Detection
8.11.3 TI-LFA Protection for an Intermediate SID of an SR Policy
8.12 Concurrency
8.13 Summary
8.14 References
Section II – Further details
9 Binding-SID and SRLB

9.1 Definition

9.2 Explicit Allocation


9.3 Simplification and Scaling

9.4 Network Opacity and Service Independence


9.5 Steering Into a Remote RSVP-TE Tunnel
9.6 Summary
9.7 References

10 Further Details on Automated Steering


10.1 Service Routes With Multiple Colors
10.2 Coloring Service Routes on Ingress PE
10.3 Automated Steering and BGP Multi-Path
10.4 Color-Only Steering
10.5 Summary
10.6 References
11 Autoroute and Policy-Based Steering
11.1 Autoroute

11.2 Pseudowire Preferred Path


11.3 Static Route
11.4 Summary
11.5 References
12 SR-TE Database
12.1 Overview
12.2 Headend
12.3 SR PCE
12.3.1 BGP-LS
12.3.2 PCEP

12.4 Consolidating a Multi-Domain Topology

12.4.1 Domain Boundary on a Node


12.4.2 Domain Boundary on a Link

12.5 Summary
12.6 References
13 SR PCE
13.1 SR-TE Process

13.2 Deployment
13.2.1 SR PCE Configuration
13.2.2 Headend Configuration
13.2.3 Recommendations
13.3 Centralized Path Computation
13.3.1 Headend-Initiated Path
13.3.2 PCE-Initiated Path
13.4 Application-Driven Path
13.5 High-Availability

13.5.1 Headend Reports to All PCEs


13.5.2 Failure Detection
13.5.3 Headend Re-Delegates Paths to Alternate PCE Upon Failure
13.5.3.1 Headend-Initiated Paths
13.5.3.2 Application-Driven Paths
13.5.4 Inter-PCE State-Sync PCEP Session
13.5.4.1 State-Sync Illustration
13.5.4.2 Split-Brain
13.6 BGP SR-TE
13.7 Summary

13.8 References

14 SR BGP Egress Peer Engineering


14.1 Introduction

14.2 SR BGP Egress Peer Engineering (EPE)


14.2.1 SR EPE Properties
14.3 Segment Types
14.4 Configuration

14.5 Distribution of EPE Information in BGP-LS


14.5.1 BGP Peering SID TLV
14.5.2 Single-hop BGP Session
14.5.3 Multi-hop BGP Session
14.6 Use-Cases
14.6.1 SR Policy Using Peering-SID
14.6.2 SR EPE for Inter-Domain SR Policy Paths
14.7 Summary
14.8 References

15 Performance Monitoring – Link Delay


15.1 Performance Measurement Framework
15.2 The Components of Link Delay
15.3 Measuring Link Delay
15.3.1 Probe Format
15.3.2 Methodology
15.3.3 Configuration
15.3.4 Verification
15.4 Delay Advertisement
15.4.1 Delay Metric in IGP and BGP-LS

15.4.2 Configuration

15.4.3 Detailed Delay Reports in Telemetry


15.5 Usage of Link Delay in SR-TE

15.6 Summary
15.7 References
16 SR-TE Operations
16.1 Weighted Load-Sharing Within SR Policy Path

16.2 Drop on Invalid SR Policy


16.3 SR-MPLS Operations
16.3.1 First Segment
16.3.2 PHP and Explicit-Null
16.3.3 MPLS TTL and Traffic-Class
16.4 Non-Homogenous SRGB
16.5 Candidate-Paths With Same Preference
16.6 Summary
16.7 References

Section III – Tutorials


17 BGP-LS
17.1 BGP-LS Deployment Scenario
17.2 BGP-LS Topology Model
17.3 BGP-LS Advertisement
17.3.1 BGP-LS NLRI
17.3.1.1 Protocol-ID Field
17.3.1.2 Identifier Field
17.3.2 Node NLRI
17.3.3 Link NLRI

17.3.4 Prefix NLRI

17.3.5 TE Policy NLRI


17.3.6 Link-State Attribute

17.4 SR BGP Egress Peer Engineering


17.4.1 PeerNode SID BGP-LS Advertisement
17.4.2 PeerAdj SID BGP-LS Advertisement
17.4.3 PeerSet SID BGP-LS Advertisement

17.5 Configuration
17.6 ISIS Topology
17.6.1 Node NLRI
17.6.2 Link NLRI
17.6.3 Prefix NLRI
17.7 OSPF Topology
17.7.1 Node NLRI
17.7.2 Link NLRI
17.7.3 Prefix NLRI

17.8 References
18 PCEP
18.1 Introduction
18.1.1 Short PCEP History
18.2 PCEP Session Setup and Maintenance
18.2.1 SR Policy State Synchronization
18.3 SR Policy Path Setup and Maintenance
18.3.1 PCC-Initiated SR Policy Path
18.3.2 PCE-Initiated SR Policy Path
18.3.3 PCE Updates SR Policy Path

18.4 PCEP Messages

18.4.1 PCEP Open Message


18.4.2 PCEP Close Message

18.4.3 PCEP Keepalive Message


18.4.4 PCEP Request message
18.4.5 PCEP Reply Message
18.4.6 PCEP Report Message

18.4.7 PCEP Update Message


18.4.8 PCEP Initiate Message
18.4.9 Disjointness Association Object
18.5 References
19 BGP SR-TE
19.1 SR Policy Address-Family
19.1.1 SR Policy NLRI
19.1.2 Tunnel Encapsulation Attribute
19.2 SR Policy BGP Operations

19.2.1 BGP Best-Path Selection


19.2.2 Use of Distinguisher NLRI Field
19.2.3 Target Headend Node
19.3 Illustrations
19.3.1 Illustration NLRI Distinguisher
19.3.1.1 Same Distinguisher, Same NLRI
19.3.1.2 Different Distinguishers, Different NLRIs
19.4 References
20 Telemetry
20.1 Telemetry Configuration

20.1.1 What Data to Stream

20.1.2 Where to Send It and How


20.1.3 When to Send It

20.2 Collectors and Analytics


20.3 References
Section IV – Appendices
A. Introduction of SR Book Part I

A.1 Objectives of the Book


A.2 Why Did We Start SR?
A.3 The SDN and OpenFlow Influences
A.4 100% Coverage for IPFRR and Optimum Repair Path
A.5 Other Benefits
A.6 Team
A.7 Keeping Things Simple
A.8 Standardization and Multi-Vendor Consensus
A.9 Global Label

A.10 SR MPLS
A.11 SRv6
A.12 Industry Benefits
A.13 References
B. Confirming the Intuition of SR Book Part I
B.1 Raincoat and Boots on a Sunny Day
B.2 ECMP-Awareness and Diversity
Landmarks
Table of contents
Start of Content
1 Introduction
This chapter is written by Clarence Filsfils in a subjective manner.

He describes the intuitions, the design objectives and the target use-cases that led to the definition of
the SR-TE solution. He shares the design experience collected while deploying this solution.
1.1 Subjective Introduction of Part I
Appendix A provides an unmodified copy of the subjective introduction of Part I of this SR book
series as written in 2016. It describes the intuitions and design objectives of the overall Segment
Routing solution.

Specifically to this book and the theme of Traffic Engineering, Part I’s introduction described:

Scalability and complexity problem of the RSVP-TE full-mesh concept

The difference between SR Traffic Engineering solution and RSVP-TE circuit model

The difference between the SR Traffic Engineering solution and OpenFlow

Appendix B has been written in April 2019. It provides a few public data points that were released
after the publication of Part I and that confirm its intuitions and analysis.
1.2 A Few Terms
SR-TE refers to the Segment Routing Traffic Engineering solution.

This solution translates the intent of the operator (delay, disjointness, bandwidth) into “SR policies”,
programs these SR Policies into the network, and steers traffic onto its appropriate SR Policy.

An SR Policy is fundamentally a list of segments. In its simplest form, this is a sequence of IP


waypoints expressed as SR-MPLS (or SRv6) segments, also referred to as Segment IDs or SIDs. A
list of segments, or SID list, is represented as <S1, S2, …>, where S1 is the first segment to visit.

An SR-TE intent, defined as an optimization objective and a set of constraints, is translated into a list
of segments b a compute engine. The compute engine can be a router or a Path Computation Element
(PCE, referred to as SR PCE in the context of an SR deployment). The former case results in a
distributed solution while the latter case results in a centralized solution. A hybrid solution mixes
computation on the router and the SR PCE.

The SID list that implements an SR-TE intent is called the Solution SID List.
1.3 Design Objectives
Here are the objectives that we followed for the SR-TE project.

1.3.1 An IP-Optimized Solution


Prior to SR-TE, IP had always been engineered with underlying connection-oriented point-to-point
non-ECMP circuits: Frame-Relay, ATM and then RSVP-TE MPLS LSPs. A clear goal of the SR
project is to leverage IP and its properties (like ECMP) to engineer IP traffic.

SR expresses an explicit path as an ordered list of ECMP-aware shortest-paths to intermediate IP


waypoints. It uses IP waypoints to build an IP engineering solution. As the waypoints are IP-based,
they benefit from IP properties (like ECMP). This provides an IP-optimized solution.

By being designed for IP, the SR-TE solution is directly applicable to any instantiation of SR: SR-
MPLS (MPLS dataplane) or SRv6 (IPv6 dataplane). In this book, we provide all the concepts and
illustrations for SR-MPLS. However, be assured that all the concepts also directly apply to SRv6.
Part III in this series of SR books is dedicated to SRv6 and will provide the SRv6-TE illustrations.

1.3.2 A Simple Solution


Very few operators had access to circuit-based TE technology due to its high cost of operation and
complexity. A clear goal of the SR project is to drastically widen the applicability of TE, making it
usable by any IP operator.

At the heart of network services are the BGP-based service routes. How can we automate the SR-TE
solution around these service routes? We simply tag the BGP routes with a color that expresses their
SLA intent. We configure a few templates that define the SLA associated to each color, then let BGP
and SR-TE figure out the solution themselves. If a router does not have enough information to
compute a path by itself, it automatically requests help from an SR PCE.

Delay is a fundamental SLA requirement. How can we automate the computation of business SLAs
expressed on delay (minimum delay or minimum cost with a bound on the delay)? We automate the
measurement and signaling of one-way link-delays, and define rules that preserve the stability of the
routing system.
These early questions in our SR-TE journey drove our research and our engineering. The effective
solution is now proven by vast deployment.

Simplicity has always been our first priority throughout the SR project. In the context of the SR-TE
solution, this translated into:

Integrated per-link and per-policy performance monitoring

Integrated OAM

Integrated counters and telemetry, which eliminate the need of Netflow/IPFIX to get the demand
matrix

Automation of the SR Policy (called ODN)

Automation of the BGP service steering onto an SR Policy (called AS)

1.3.3 A Scalable Solution


The RSVP-TE un-scalability and complexity is eliminated. i.e.:

k × N2 states in the fabric (full mesh among N edge nodes, k tunnels for each pair of nodes to hack
the lack of native ECMP in an RSVP circuit)

race competitions upon network topology change

the notion of tunnel

Instead, SR-TE offers a stateless solution that supports traffic steering and label stack push without
performance degradation. It also comes with efficient SR-native algorithms that reduce the
computation complexity, while minimizing the solution SID list length and maximizing the ECMP
diversity.

1.3.4 A Modular Solution


Abstract analysis helps us to isolate each behavior in its fundamental building blocks. We schedule
our engineering work such that building blocks are developped independently and with the APIs that
allow their combination. Flex-Algo, ODN, AS, SR PCE, native SR algorithms, TI-LFA are various
examples of this decomposition of the overall solution into independent building blocks. TI-LFA can
be used with or without Flex-Algo. ODN can be used with or without the SR PCE and AS. Flex-Algo
can be combined with ODN and AS, or used independently.

Another aspect of this modularity is the support of diverse deployment models.

Centralized (“vertical”): an SDN controller makes all the decisions centrally and programs the
network accordingly. The routers receive their SID lists in an explicit manner.

Distributed: the router dynamically translates an SLA intent into a SID list. The router uses SR
native algorithms to compute the solution SID list on the basis of the information in the SR-TE
database.

Hybrid (“horizontal”): upon installing a colored BGP service route, a router detects the need for an
SR policy and requests an SR PCE to compute the SID list that implements the required SLA for
that service. The router automatically steers the service traffic in the dynamically instantiated SR
Policy.

We often refer to the first and last models as “vertical” and “horizontal”.

The “vertical” model has a “god system” that oversees everything, decides and programs. The
network is relatively passive and all the intelligence is in the central SDN system.

The “horizontal” intelligence model is the sum of many interacting components. The BGP route-
reflectors distribute the colored BGP routes. BGP-LS route-reflectors1 distribute the topology
information from the various domains. The SR PCEs listen to the redundant BGP-LS route-reflection
system to update their SR-TE DB. The PEs listen to the BGP route-reflectors to learn the BGP service
destinations, endpoints and SLA colors. The PEs listen to the SR PCE advertisements to learn who
and where they are. Upon receiving a colored BGP route to next-hop N with SLA color C, on-demand
(dynamically), the PE requests one of the discovered SR PCEs for an SR Policy solution to endpoint
N with color C. The SR PCE becomes responsible for this SR Policy (the request is stateful) and it
updates the PE if the solution SID list changes. Any of the components can be scaled horizontally. If
there are more PEs or more BGP routes, then additional route-reflectors can be added. If there are
more domains, then additional BGP-LS route-reflectors can be added. If there are more PEs or more
SR Policies within a region, then additional SR PCEs can be added in that region.
The horizontal model is also called “hybrid” because it is a combination of the “vertical SDN”
centralized intelligence (indeed, we have SR PCEs that help do inter-domain wide “central”
computations) with distributed intelligence (the colored BGP route reflection, the BGP-LS route
reflection, the availability of many concurrent SR PCEs).

As we write this book, the two models have vast deployment. The first better fits the WEB/OTT
context while the latter better fits the SP/Enterprise market. We knew that different operators would
have different preferences and that each solution would have their pros and cons. Instead of an
arbitrary choice for one single option, we decided to be agnostic. We designed the overall solution
with a set of common components.

As SR-TE deployments may follow various models, the solution also supports different control plane
protocols.

From the controller to the network: PCEP, BGP SR-TE and Network Configuration Protocol
(NETCONF) and the flexibility to adapt to other API/signaling technologies as per the industry
evolution.

From the network to the controller: Telemetry, BGP-LS, PCEP and NETCONF.

A specific note on BGP-LS, the BGP Link-state address-family. Often BGP-LS is viewed as being
restricted to allow a set of IGP routers in an IGP area to transport the IGP LS-DB via a BGP channel
to the controllers. In our SR-TE architecture, BGP-LS has a much wider role: it provides a channel
between each independent router and the controllers to report any information local to the router, its
immediate topology in a way that is independent of the presence or not of an IGP (e.g., applicability
to BGP-based Data Centers), the state of its SR Policies (e.g., applicability beyond the router’s
immediate topology), the capabilities of the node (e.g., how many segments it can push/insert) ,
distribution of performance measurement (e.g., delay), etc.

For example, when an inter-domain SR Policy is built from a Metro1 domain to a Metro2 domain via
the core domain, the SR PCE wants to leverage any available transit SR Policy in the core. To do so,
the SR PCE needs to know all the available SR Policies in the core domain. This is straightforward
thanks to a BGP-LS session from any SR Policy headend to one of the BGP-LS Route Reflectors
(RRs). Any SR PCE peering to one of the BGP-LS RRs will get all the instantiated SR Policies in the
SR domain.

1.3.5 An Innovative Solution


A key ingredient that was continuously reminded to the SR team is to think outside of the box

An in-the-box TE concept is to think that a tunnel is necessary, that a full-mesh is necessary, that
tunnels need to be pre-configured, that steering is done with autoroute and that a tunnel is non-
ECMP… It has been done like this for 20 years, it must be the starting point

No. These concepts must not be “blindly” assumed.


1.4 Service-Level Assurance (SLA)
The purpose of any Traffic Engineering technology is to enable Service-Level Assurances (SLAs).
An SLA is defined as a combination of requirements in terms of end-to-end delay, loss, bandwidth
and disjointness. Some of these requirements are partially or entirely controlled by routing, while
others are also affected by external factors such as the network utilization.

The end-to-end delay is composed of the propagation delay and the scheduling delay2. The
propagation delay is computed as the fiber length divided by the speed of light in the medium. It is
independent of the network utilization and entirely controlled by routing: minimizing the accumulated
propagation delay between two nodes in the network is equivalent to finding a path between these
nodes with the minimum accumulated fiber length. On the other hand, the scheduling delay represents
the queue length divided by the port rate. This component of the propagation delay is affected by the
link utilization as well as the scheduling strategy. An operator wishing to minimize the scheduling
delay would ensure that the network links are provisioned with sufficient capacity and that delay-
sensitive traffic is directed to a dedicated queue.

Packet loss also comprises a congestion-independent and a congestion-dependent component. The


former is caused by cable or connection defect, while the latter is the result of queue buildup past
their limit. Congestion-independent loss can be monitored by every router on a per-link basis and
reported as part of its topology update (e.g., ISIS/OSPF extended TE attributes). Any router or PCE
having access to this information can thus compute a path across the network ensuring that the
congestion-independent loss remains below a certain threshold. Computing such a path is a routing
problem.

Similarly, disjoint paths — that do share specific types of resources such as links, nodes or Shared
Risk Link Groups (SRLGs) — are computed based on the knowledge of the topology and its SRLGs.
It is a routing problem.

The same applies for the inclusion or exclusion of certain types of resources along a path. For
example, a subset of links in a topology may support MACSec (IEEE 802.1AE) encryption. The intent
to “only use MACSec protected links” is translated into a constraint to exclude all non-MACSec
protected links from the path, which is again a routing problem.
In this book, we focus on the routing problems. These take as input an augmented topology database
that may contain the network topology of multiple domains, delay and loss measurement data, as well
as policy information such as SRLG and TE affinities.
1.5 Traffic Matrix
A (traffic) demand D(X, Y) is the amount of traffic entering the network at ingress edge node X and
leaving the network at egress edge node Y.

The demand matrix (also called traffic matrix) contains all the demands crossing the network. It has
as many rows as ingress edge nodes and as many columns as egress edge nodes. The demand D(X, Y)
is the cell of the matrix at row X and column Y.

Clearly, simplifying the demand matrix collection process is a key goal of the SR-TE project.

The two solutions used historically were very complex and did not scale. Experts were required to
setup and operate these solutions.

The first solution consisted in running a full-mesh of RSVP-TE tunnels from any edge to any edge,
steering all the traffic in these tunnels and collecting the tunnel traffic counters. All these tunnels were
set with zero-bandwidth and were following the IGP shortest-path. Hence, the whole network was
then governed by an N2 scaling problem just to get some counters incremented. This seemed as a
vastly inefficient solution and in fact a hack.

The second solution consists of running NetFlow/IPFIX on each external interface of all the edge
nodes of the network and streaming the flow information to controllers for demand matrix deduction.
Contrary to the full-mesh of RSVP-TE tunnel solution, this is not a hack. The traffic is un-modified
and continues being steered by the IGP. The traffic information obtained through the NetFlow/IPFIX
solution can be instrumental for many solutions other than the capacity planning, e.g., for security
analysis. The problem is a practical one: not all external interfaces of all edge nodes support
NetFlow/IPFIX and the NetFlow/IPFIX collectors and processors require some dedicated investment
and support. In conclusion, if the NetFlow/IPFIX solution is not motivated by other reasons, it
seemed too complex and expensive just for the purpose of deriving the demand matrix.

As part of the SR project, we designed an automated solution to derive the demand matrix in any
Segment Routing deployment.

From a high-level perspective, the idea is relatively simple. Any SR router X marks its interfaces as
internal or external. For any packet received on an external interface, if X routes the packet towards a
BGP destination B/b via an egress edge node Y, then X increments D(X, Y). At regular intervals (e.g.,
15 minutes) X stores these D(X, *) counters in a local history table. Periodically, a controller can
retrieve these history tables from all the routers and easily deduce the demand matrix evolution over
time.

This information can be collected via Telemetry (see chapter 20, "Telemetry") or NETCONF.
1.6 Capacity Planning
Capacity Planning is the continuous art of forecasting traffic load, also in case of failure conditions,
in order to evolve the network topology, its capacity, and its routing to meet a defined service-level
agreement (SLA).

The input to the process consists of discovering the following:

the topology and the raw capacity of the links

the measured propagation delay of the links (performance measurement)

the shared-risk link groups (SRLGs)

the demand matrix

The SRLGs represent the likely concurrent failures to consider. The demand matrix represents the
traffic crossing the network, essentially the traffic going from each ingress to each egress node.

A well-known capacity planning solution is Cisco WAN Automation Engine (WAE) Planning [Cisco
WAE].

Historically, the capacity planning solution was an offline process. For example, every quarter, the
projected traffic demand at 6 or 12 months is simulated for all the possible network failures (link,
node and SRLG). Additional capacity is ordered accordingly to ensure that no congestion occurs
within the expected traffic matrix growth and expected network failures. Often, the process is much
simpler and is based on empirical rules such as “upgrade a link as soon as some measure of averaged
peak traffic is above a certain threshold”.

With a semi-automated process, the capacity planning solution becomes “online”: it continuously
collects all the inputs described previously and checks whether the target SLA is maintained. If a
threshold is exceeded, the capacity planning solution issues an alert to the operator and may propose
a tactical traffic-engineering solution. The operator is still involved to assess the situation and to
validate the proposed solution. The long-term solution is still to add capacity to handle the network
growth, however the operator has now a solution to handle an unexpected surge of traffic or an
unexpected network failure.
With the automated process, the tactical traffic-engineering policies are programmed to the network
automatically when required, without the need for operator validation.
1.7 Centralized Dependency
A “central” computation viewpoint is needed for the following cases: disjoint paths, inter-domain,
bandwidth brokering and multi-layer optimization.

In this book, we focus on the central computation performed by the SR PCE.

1.7.1 Disjoint Paths


A common (central) viewpoint is required to compute disjoint paths from separate source/headend
nodes.

For example, pseudo-wire 1 (PW1) from A to C in Figure 1‑1 must be disjoint from PW2 (from B to
D). Nodes A and B cannot individually compute a path that would meet this requirement as they are
not aware of the behavior of the other node.

Figure 1-1: Disjoint paths


The solution consists in having these two nodes (A and B) to contact the same SR PCE and ask it to
compute both disjoint paths and communicate the solution to each respective headend.

The SR PCE computes the paths P1 and P2, respectively from A to C and B to D, and ensures that P1
be disjoint from P2. It then provides these paths respectively to A and B. The stateful SR PCE tracks
this path diversity SLA and recomputes the paths P1 and P2 upon topology change.

1.7.2 Inter-Domain
A centralized viewpoint (spanning all the domains) is required when the SR Policy goes from a
headend to an endpoint that are in two isolated domains (e.g., no redistribution between domain, no
BGP RFC3107).

A centralized viewpoint (spanning all the domains and with SLA information in the SR-TE DB) is
required when the SR Policy goes from a headend to an endpoint that are in two isolated domains and
a specific SLA is required (e.g., minimum delay, avoid blue links). In this case, best-effort inter-
domain connectivity (redistribution, BGP RFC3107) is of no help. A centralized path computation is
required to find a SID list that implements an inter-domain path satisfying the SLA requirement.

One could say that this is theoretically incorrect. Indeed, it is true that the headend of the first domain
can himself peer into the BGP-LS route-reflectors and hence acquire the complete inter-domain
topology augmented with the delay information. However, in practice we have not seen this
deployment model. The typical design is to keep headends computing themselves for their domain and
asking their SR PCEs for any inter-domain computations.

1.7.3 Bandwidth Broker


A centralized controller is required when the SLA depends on the avoidance of congestion and the
offline capacity planning process does not guarantee that enough Bandwidth is always available.

In such case, individual routers cannot coordinate, in an efficient and scalable manner, the allocation
of their respective traffic to avoid congestion. A central bandwidth broker is required to coordinate
the allocation of bandwidth. A centralized optimization technique is required to optimize the
bandwidth allocation.
MPLS RSVP-TE uses a hop-by-hop signaling technique to lay per-circuit state in the network and
reserve bandwidth. While this admission control technique avoids congestion, it creates scale
problems (k × N2).

As mentioned earlier, a design objective for SR-TE is to eliminate the non-IP circuits and their state
scaling problem. Instead of laying state in the network to implement a distributed admission control
mechanism, SR-TE uses a centralized controller which acts as a bandwidth broker. The centralized
bandwidth broker receives the bandwidth-related request from the individual routers, keeps track of
the available bandwidth in the network and routes the requests accordingly.

The inner details of the bandwidth broker solution are specific to each individual implementation and
are rarely publicly available.

A few examples of bandwidth-aware SR controllers are [Cisco WAE], [Google Espresso],


[Facebook EdgeFabric], and [Alibaba NetO], and [Microsoft SWAN].

1.7.4 Multi-Layer Optimization


A centralized control point is required when L1-L3 combined topologies are considered. Use-cases
like Link Augmentation or L3 node bypass are examples.
1.8 A TE Intent as a SID or a SID list
In the first public presentation that we gave on SR (October 2012), we described two fundamental
ways to express a TE intent with segment routing: with one single Flex-Algo SID or with multiple
SIDs.

Let’s start with multiple SIDs as this is likely the most intuitive.

If the shortest-path from Tokyo to Brussels is via USA (e.g., lower cost of transported bits), the low-
delay path may likely be expressed with the SR Policy <toMoscow, toBrussels> as shown in
Figure 1‑2, toMoscow is the first segment and represents the shortest-path from Tokyo to Moscow.
toBrussels is the second segment and represents the shortest-path from Moscow to Brussels.

Figure 1-2: Low-delay path Tokyo – Brussels – using IGP shortest path SIDs

The computation engine (a router or an SR PCE) collects the topology and its segments and expresses
the intent (the SR Policy) as a SID list. The salient point to remember here is that each prefix segment
expresses a shortest-path according to the basic IGP metric.

There is a different way to approach the problem: defining additional segments that have specific TE
properties (i.e., prefix segments that do not follow the IGP shortest-path).

Let’s assume the following:


each router monitors the propagation delay of each of its links

the propagation delay of each link is flooded by the IGP

each IGP node is configured with a second SID which is flooded with the “delay” attribute

each IGP node computes the shortest-path of a “delay” SID with the per-link propagation delay

“toBrussels(Delay)” indicates the “delay” SID for Brussels

Then, it is easy to deduce that the low-delay SR policy from Tokyo to Brussels can be expressed with
a single SID “toBrussels(Delay)”.

Such a SID is called an “IGP Flex-Algo” SID (see chapter 7, "Flexible Algorithm") based on the
following:

IGP because the IGP computes the related shortest path (e.g., min-delay) and floods the related
information inside the IGP domain

Algo for Algorithm because we associate a specific TE intent to the SID, expressed as an
optimization objective (an algorithm). Each objective is identified by an Algo number.

Flex because any operator is free to define the intent of each Flex-Algo it instantiates

Operator 1 may define Algo128 to minimize TE metric and exclude red affinity

Operator 2 may define Algo128 to minimize delay metric and exclude blue affinity

The same intent (SR Policy) can thus be expressed as <toMoscow, toBrussels> or
<toBrussels(Delay)>. See Figure 1‑3.
Figure 1-3: Low-delay path Tokyo – Brussels – using low-delay SIDs

An intent can also be expressed as a SID list of Flex-Algo SIDs. Let’s assume that Moscow and
Tokyo are in the Asian domain while Brussels and Moscow are in the European domain. An inter-
domain low-delay path from Tokyo to Brussels could then be expressed as <toMoscow(Delay),
toBrussels(Delay)>. But maybe in the Asian domain, the IGP shortest-path from Tokyo to Moscow is
also the low-delay path, while in the European domain the IGP shortest path from Moscow to
Brussels is not on the low-delay path. In that case, the end-to-end SR policy could be expressed as
<toMoscow, toBrussels(Delay)>. See Figure 1‑4.
Figure 1-4: Low-delay path Tokyo – Brussels – combining different algorithm SIDs

That is right, the SR-TE solution is about combining any SIDs to express the intent. If it is possible to
do so with one SID, it will be done. If several SIDs are required, this is also fine. If several SIDs of a
different nature are needed, this is fine as well.

The solution works like the multi-color and multi-shape bricks of the LEGO® construction toys. IGP
SIDs are the yellow bricks. The yellow bricks implement a minimum-IGP-cost shortest-path. Flex-
Algo1 IGP SIDs are bricks of red color. The red bricks implement a minimum-delay shortest-path.
Flex-Algo2 IGP SIDs are bricks of green color. The green bricks implement a minimum-IGP-cost
shortest-path constrained to the green plane of the network. The blue bricks implement a minimum-
IGP-cost shortest-path constrained to the blue plane of the network. Depending on the intent, the
solution uses different bricks. Furthermore, the solution may combine bricks of different colors (for
example, as part of an inter-domain policy). This is the inherent richness of the solution and this is
exactly what we had proposed in the first SR presentation at Cisco NAG in October 2012.

One must not think that IGP Flex-Algo is opposed to SR-TE. It is an inherent part of the SR-TE
solution. Historically, it was introduced in the very first SR presentation in October 2012.
1.9 SR Policy
There is no concept of a tunnel in SR-TE. Instead, we introduce the concept of an SR Policy.

An SR Policy realizes a TE intent by means of a solution SID list.

The SIDs that are part of the SID list may be of any type: IGP SID, IGP Flex-Algo SID, BGP SID, …

An SR policy is identified by the following three attributes:

headend: where the policy is instantiated

endpoint: where the policy ends: an IPv4/v6 address

color3: represents an intent and is a new fundamental concept to automate SR-TE

At a headend, an SR Policy is identified by the (color, endpoint) tuple.

The headend of an SR Policy binds a Binding SID to the policy.

The headend of a valid SR Policy installs the SR Policy in the forwarding plane as an MPLS rewrite:

(incoming label = Binding SID → POP and PUSH Solution SID list)
1.10 Binding SID
The Binding SID (BSID) is fundamental to Segment Routing. It provides scaling, network opacity and
service independence.

Figure 1-5: DC Inter-Connect with SLA and without inter-domain BGP

Let’s illustrate these benefits with the diagram in Figure 1‑5 where DCI1 has a low-delay SR Policy
“Pol1” to DCI3 with SID list <Prefix-SID(D), Adj-SID(D-to-E), Prefix-SID(DCI3)> and with a
Binding SID BSID(Pol1).

In this context, a low-delay multi-domain SR Policy from S to Z is simply expressed as <Prefix-


SID(DCI1), BSID(Pol1), Prefix-SID(Z)>.

Without the leverage of the intermediate core SR Policy, S would need to steer its low-delay flow
into the SR Policy with SID list < Prefix-SID(DCI1), Prefix-SID(D), Adj-SID(D-to-E), Prefix-
SID(DCI3), Prefix-SID(Z)>.

The use of a BSID (and the transit SR Policy) decreases the number of segments imposed by the
source.

A BSID acts as a stable anchor point which isolates one domain from the churn of another domain.

Upon topology changes within the core of the network, the low-delay path from DCI1 to DCI3 may
change. While the path of an intermediate policy changes, its BSID does not change. Hence the policy
used by the source does not change and the source is shielded from the churn in another domain.
A BSID provides opacity and independence between domains. The administrative authority of the
core domain may want to exercise a total control over the paths through this domain so that it can
perform capacity planning and introduce TE for the SLAs it provides to the leaf domains. The use of a
BSID allows keeping the service opaque. S is not aware of the details of how the low-delay service
is provided by the core domain. S is not aware of the need of the core authority to temporarily change
the intermediate path.
1.11 How Many SIDs and Will It Work?
Let’s assume that a router S can only push 5 labels and a TE intent requires a SID list of 7 labels <S1,
S2, S3, S4, S5, S6, S7> where these labels are IGP Prefix SIDs to respective nodes 1, 2, 3, 4, 5, 6
and 7.

Are we stuck? No, for two reasons: Binding SID and Flex-Algo SID.

The first solution is clear as it leverages the Binding SID properties (Figure 1‑6 (a)).

Figure 1-6: Solutions to handle label imposition limit

The SID list at the headend node becomes <S1, S2, S3, S4, B> where B is the Binding SID of a
policy <S5, S6, S7> at node 4. The SID list at the headend node meets the 5-label constraint of that
node.

While a binding SID and policy at node 4 does add state to the core for a policy on the edge, the state
is not per edge policy. Therefore, a single binding SID B for policy <S5, S6, S7> may be reused by
many edge node policies.
The second solution is equally straightforward: instantiate the intent on the nodes in the network as an
extra Flex-Algo IGP algorithm (say AlgoK) and allocate a second SID S7’ to node 7 where S7’ is
associated with AlgoK (Figure 1‑6 (b)). Clearly the SID list now becomes <S7’> and only requires
one label to push ☺.

These two concepts guarantee that an intent can be expressed as an SR Policy that meets the
capabilities of the headend (e.g., max push ≤ 5 labels).

As we detail the SR-TE solution, we explain how the characteristics of the nodes are discovered
(how many labels can they push) and how the optimization algorithms take this constraint into
consideration.

Last but not least, it is important to remember that most modern forwarding ASICs can push at least 5
labels and that most use-cases require less than 5 labels (based on extensive analysis of various TE-
related use-cases on ISP topologies). Hence, this problem is rarely a real constraint in practice.
1.12 Automation Based on Colored Service Routes
A key intuition at the base of the simplicity and automation of our SR-TE solution has been to place
the BGP routes at the center of our solution. That idea came during a taxi ride in Rome with Alex
Preusche and Alberto Donzelli.

BGP routes provide reachability to services: Internet, L3VPN, PW, L2VPN.

We allow the operator to mark the BGP routes with colors. Hence, any BGP route has a next-hop and
a color.

The next-hop indicates where we need to go.

The color is an extended color attribute expressed as 32-bit values. The color indicates how we need
to go the next-hop. It defines a TE SLA intent.

An operator allocates TE SLA intent to each color as he wishes. A likely example is:

No color: “best-effort”

Red: “low-delay”

Let us assume the operator marks a BGP route 9/8 via 1.1.1.5 with color red while BGP route 8/8 via
1.1.1.5 is left uncolored.

Upon receiving the route 8/8, Node 1 installs 8/8 via 16005 (prefix-SID of 1.1.1.5) in FIB. This is the
classic behavior. The traffic to 8/8 takes the IGP path to 1.1.1.5 which is the best-effort (lowest cost)
path.

Upon receiving the route 9/8, Node 1 detects that the route has a color red that matches a local TE
SLA template red. As such, the BGP process asks the TE process for the local SR Policy with (color
= red; endpoint = 1.1.1.5). If this SR Policy does not yet exist, then the TE process instantiates it on-
demand. Whether pre-existing or on-demand instantiated, the TE process eventually returns an SR
Policy to the BGP process. The BGP process then installs 9/8 on the returned SR policy. The traffic to
9/9 takes the red SR Policy to 1.1.1.5 which provides the low-delay path to node 5.
The installation of a BGP route onto an SR Policy is called Automated Steering (AS). This is
fundamental simplification as one no longer needs to resort to complex policy-based routing
constructions.

The dynamic instantiation of an SR Policy based on a color template and an endpoint is called On-
Demand Next-hop (ODN). This is a fundamental simplification as one no longer needs to pre-
configure any SR Policy. Instead, all the edge nodes are configured with the same few templates.
1.13 The SR-TE Process
The SR-TE solution is implemented in the Cisco router operating systems (Cisco IOS XR, Cisco IOS
XE, Cisco NX-OS) as a completely new process (i.e., different and independent from MPLS RSVP-
TE).

In this book, we explain the IOS XR SR-TE implementation in detail. We ensured consistency
between the three OS implementations and hence the concepts are easily leveraged for IOS XE and
NX-OS platforms.

1.13.1 One Process, Multiple Roles


The SR-TE process supports multiple roles:

Within the headend: SR-TE brain of the router (as a local decision maker)

Topology database limited to its local domain

Possibly contains multi-domain topology database, but not yet seen in practice

Outside the headend: SR PCE (helping routers for path disjointness or inter-domain policies)

This multi-role ability is very similar to the BGP process in the router OS:

BGP brain of the router (receives path and install the best paths in RIB)

BGP Route Reflector (helping routers to collect all the paths)

In the BGP case, a border router peers to a Route Reflector (RR) to get all its BGP paths. The border
router and the route reflector run the same OS and use the same BGP process. The border router uses
the BGP process as a local headend, for BGP path selection and forwarding entry installation. The
route reflector uses the BGP process only to aggregate paths and reflects the best ones.

This is very similar for SR-TE.

In the SR-TE case, a border router (headend in SR-TE language) runs the SR-TE process to manage
its policies, computes the SID list when it can, requests help from an SR PCE when it cannot, and
eventually installs the forwarding entries for the active paths of its policies.
The SR PCE runs the same SR-TE process but in PCE mode. In this case, the SR-TE process
typically collects more topological information from multiple domains (e.g., for inter-domain path
calculation) and provides centralized path computation services for multiple headends.

The ability to use a single consistent architecture and implementation for multiple roles or use-cases
is a great benefit.

However, it may sound confusing initially. For example, let us ask the following question: does the
router have the multi-domain topology? Three answers are possible:

No. The router only has local domain information and asks the SR PCE for help for inter-domain
policies.

Yes. The router is an SR PCE and to fulfill this role it aggregates the information from multiple
domains.

Yes. The router is a headend, but the operator has decided to provide multi-domain information
directly to this headend such that it computes inter-domain policies by itself (without SR PCE
help).

The SR Process architecture and implementation allows these three use-cases.

In this book, we walk you through the various use-cases and we provide subjective hints and
experience to identify which use-cases are most common. For example, while the last answer is
architecturally possible and it will likely eventually be deployed, for now such design has not come
up in deployment discussions.

Another likely question may be: “Is the SR PCE a virtualized instance on a server?” Two answers are
possible:

Yes. An SR PCE does not need to install any policy in its dataplane. It has a control-plane only
role. Hence, it may be more scalable/cost-effective to run it as a virtual instance

No. An SR PCE role may be enabled on a physical router in the network. The router is already
present, it already has the SR-TE process, the operator may as well leverage it as an additional SR
PCE role.
Again, this is very similar to the BGP RR role: it can be virtualized on a server or enabled on a
physical router.

1.13.2 Components
At high-level, the SR-TE process comprises the following components:

SR-TE Database

Local Policy Database:

Multiple candidate paths per policy received via multiple channels (BGP, PCEP, NETCONF,
CLI)

Validation process

Selection process

Binding SID (BSID) association

Dynamic Path Computation – the SR-native algorithms

On-Demand Policy instantiation (ODN)

Policy installation in FIB

Automated Steering (AS)

Policy Reporting

BGP-LS

Telemetry

NETCONF/YANG

1.13.3 SR-TE Database


A key component of the SR-TE process is the SR-TE database (SR-TE DB).
The SR-TE process collects topology information from ISIS/OSPF and BGP-LS and stores this
information in the SR-TE database (SR-TE DB).

We designed the SR-TE DB to be multi-domain aware.

This means that the SR-TE process must be able to consolidate in the SR-TE DB the topology
information of multiple domains, while discarding the redundant information.

The SR-TE DB is also designed to hold more than the topological information, it includes the
following:

Segments (Prefix-SIDs, Adj-SIDs, Peering SIDs)

TE Link Attributes (such as TE metric, delay metric, SRLG, affinity link color)

Remote Policies (in order to leverage a policy in a central domain as a transit policy an end-to-end
inter-domain policy between two different access domains, an SR PCE MUST know about all the
policies installed in the network that are eligible for use as transit policy)

1.13.4 SR Native Algorithms


This is a very important topic.

The algorithms to translate an intent (e.g., low-delay) into a SID list have nothing to do with the
circuit algorithms used by ATM/FR PNNI or MPLS RSVP-TE.

The SR native algorithms take into consideration the properties of each available segment in the
network, such as the segment algorithm or ECMP diversity, and compute paths as sequences of
segments rather than forwarding links

The SR native algorithms are implemented in the SR-TE process, and hence are leveraged either in
headend or SR PCE mode.

1.13.5 Interaction With Other Processes and External APIs


The SR-TE process has numerous interactions with other router OS processes and with external
entities via APIs.
Those interactions are classified as a router headend role (“headend”) or an SR PCE role (“SR
PCE”). These hints are indicative/subjective as many other use-cases are possible.

Figure 1-7: SR-TE interaction with other processes and external APIs

ISIS/OSPF

headend discovers the local-domain link-state topology

BGP-LS

SR PCE discovers local/remote-domain link-state topology information

SR PCE discovers SR Policies in local/remote domains

Headend reports its locally programmed SR Policies and their status

Headend discovers the local-domain link-state topology (e.g., SR BGP-only DC)

PCEP

Headend requests a path computation from the SR PCE


Conversely, SR PCE receives a path computation request

Headend learns a path from the SR PCE

Conversely, SR PCE signals a path to the headend

Headend reports its local SR Policies to the SR PCE

BGP-TE

Headend learns a path from the SR PCE

Conversely, SR PCE signals a path to the headend

NETCONF

Headend learns a path from the SR PCE

Conversely, SR PCE signals a path to the headend

SR PCE discovers SR Policies of the headend

Conversely, headend signals SR Policies to the SR PCE

FIB

Headend installs a policy in the dataplane

BGP Service Route and equivalent (Layer-2 Pseudowire) for automated steering purpose

Headend’s BGP process asks the SR-TE process for an SR Policy (Endpoint, color)

Headend’s SR-TE process communicates the local valid SR-TE policies to the BGP process

Northbound API for high-level orchestration and visibility (SR PCE)

Cisco WAE, Packet Design LLC, etc. applications


We will come back to these different interactions and APIs throughout this book and illustrate them
with examples.

1.13.6 New Command Line Interface (CLI)


The SR-TE CLI in IOS XR 6.2.1 and above has been designed to be intuitive and minimalistic.

It fundamentally differs from the classic MPLS RSVP-TE CLI in IOS XR.
1.14 Service Programming
While we will detail the notion of “network programming” and hence “service programming” in Part
III of this series of SR books, it is important to understand that the Service Function Chaining (SFC)
solution is entirely integrated within the SR-TE solution.

The solution to enforce a stateless topological traffic-engineering policy through a network is the
same as the solution to enforce a stateless service program.

We call a “Network Function (NF)” an application/function provided by a hardware appliance or a


Virtual Machine (VM) or Container. An NF can reside anywhere within the SR domain. It can be
close to the access or centralized within a DC.

We use the term “service program” instead of the classical term “service chain” because the
integrated SR-TE solution for TE and SFC delivers more than a simple sequential chain.

The SR solution allows for the same expression as a modern programming language: it supports
flexible branching based on rich conditions. An example of application that leverages the unique
capabilities of SR is the Linux iptables-based SERA firewall [SERA]. This open-source firewall can
filter packets based on their attached SR information and perform SR-specific actions, such as
skipping one or several of the remaining segments. This has also been demonstrated with the open-
source SNORT application.

Aside from the richness of the SR-TE “service program” solution, we also have the scale and
simplicity benefit of SR: the state is only at the edge of the network and no other protocol is needed to
support Network Functions Virtualization (NFV) based solutions (the SR-TE solution is simply
reused).

This is very different from other solutions like Network Service Header (NSH) that installs state all
over the network (less scalable) and creates a new encapsulation, resulting in more protocols, more
overhead, and more complexity.
1.15 Lead Operator Team
The SR-TE solution has been largely influenced by lead operators around the globe.

Martin Horneffer from Deutsche Telecom is an industry TE veteran. In 2005, he highlighted the
scaling and complexity drawback of MPLS RSVP-TE and proposed an alternative solution based on
capacity planning and IGP-metric tuning [IGP Tuning]. He likes to remind the community that the
absolute pre-requisite for a TE process is the reliable and scalable collection of the input
information: the optical SRLGs, the per-link performance metrics (delay and loss) and the demand
matrix. Martin has significantly influenced the SR-TE design by keeping the focus on the automation
and simplification of the input collections and the relationship between TE and Capacity Planning.

Paul Mattes from Microsoft is an IP and SDN industry veteran. His long-time experience with
developing BGP code gives him a clear insight into network complexity and scale. His early
participation in one of the major SDN use-cases [SWAN] gave him an insight in how SR could
preserve the network programming benefit introduced by open-flow while significantly scaling its
operation by leveraging the distributed routing protocols and their related prefix segments. Paul has
significantly influenced the SR-TE design, notably any use-case involving a centralized SR-TE
controller with higher-layer applications.

Alex Bogdanov, Steven Lin, Rob Shakir and later Przemyslaw Krol from Google, have joined the
work initiated with Paul, have validated the BGP SR-TE benefits and have helped to refine many
details as we were working on their use-cases. They significantly influenced the RSVP-TE/SR-TE
bandwidth Broker interworking solution.

Niels Hanke from Vodafone Germany has been a key lead operator with one of the first ever SR-
MPLS deployments. Together with Anton Karneliuk from Vodafone Germany, we then engineered the
first worldwide delivery of a three-tier latency service thanks to SR-TE, ODN/AS and IGP SR Flex-
Algo. Anton joined me at the Cisco Live!™ 2019 event to share some information on this new
deployment. The video is available on our site segment-routing.net.

Mike Valentine from a leading financial service customer was the first to realize the applicability of
SR in his sector both to drastically simplify the operation and to provide automated end-to-end SLA
SR Policies. His blunt feedback helped me a lot to convince Siva to “bite the bullet”, forget the
original prototype code and re-engineer the SR-TE process (and its CLI) from scratch.

Stéphane Litkowski from Orange Group was our first SR PCE user. His passion and the co-
development efforts with Siva helped to refine our design and speed-up our production plans.

Dan Voyer of Bell Canada is an IP/MPLS and SDN industry veteran, and his team has deployed an
impressive SR network across several domains. His use-case, combining service chaining and SR-
TE and applying it to external services such as SDWAN, was a key influence in the SR design.

Gaurav Dawra first at Cisco and then at LinkedIn, extended the BGP control plane and BGP-LS.
Gaurav also provided key insight in the DC and inter-DC use-case for SR-TE.

Dennis Cai from Alibaba helped the community to understand SR-TE deployment and SR Controller
architecture within a WEB context. His presentation is available on our site segment-routing.net.

Arkadiy Gulko of Thomson Reuters has worked closely with us to refine the IGP SR Flex-Algo
solution and show its great applicability for low-latency or diversity requirements.

As in any public presentation on SR, we thank the lead operator team for all their help in defining and
deploying this novel SR solution.
1.16 SR-TE Cisco Team
Siva Sivabalan, Tarek Saad and Joseph Chin worked closely to design and implement the SR-TE
solution on IOS XR from scratch. Siva Sivabalan continues to lead all the SR-TE development and
his energy has been key to deliver the SR-TE solution to our lead operators.

As the project and the deployment expanded, the larger SR-TE team was formed with Mike
Koldychev, Johnson Thomas, Arash Khabbazibasmenj, Abdul Rehman, Alex Tokar, Guennoun
Mouhcine, David Toscano, Bhupendra Yadav, Bo Wu, Peter Pieda, Prajeet G.C. and Jeff Williams.

The SR-TE solution has been tested by Zhihao Hong, Vibov Bhan, Vijay Iyengar, Braven Hong,
Wanmathy Dolaasthan, Manan Patel, Kalai Sankaralingam, Suguna Ganti, Paul Yu, Sudheer
Kalyanashetty, Murthy Haresamudra, Matthew Starky, Avinash Tadimalla, Yatin Gandhi and Yong
Wang.

François Clad played a key role defining the SR native algorithms.

Junaid Israr, Apoorva Karan, Bertrand Duvivier and Ianik Semco have been the project and product
managers for SR-TE and have been instrumental in making things happen from a development process
viewpoint. They orchestrated the work across all the components and API’s such as to have a single
modular and easy to operate solution.

Jose Liste, Kris Michielsen, and Alberto Donzelli supported the first major SR-TE designs and
deployments. They are excellent sources of reality-check to keep the team focused on deployment
requirements. Their professional handling of demos and proof-of-concepts has been key to our
communication of these novel ideas.

Frederic Trate has been leading our external communication and has been a great source of
brainstorming for our long-term strategy.

Tim LaBerge influenced the SR-TE design during his tenure at Microsoft (partnering with Paul
Mattes) and as part of the Cisco WAE development team.

Peter Psenak led the SR IGP Flex-Algo implementation.


Ketan Talaulikar led the BGP-LS architecture and implementation and the related BGP-only design
together with Krishna Swamy.

Zafar Ali led the OAM for SR-TE and our overall SR activity at the IETF.

Rakesh Gandhi led the Performance Monitoring for SR-TE.

Sagar Soni and Patrick Khordoc provided key help on Performance Monitoring implementation.

Dhanendra Jain and Krishna Swamy led the BGP work for SR-TE and were key players for the
ODN/AS implementation.

David Ward, SVP and Chief Architect for Cisco Engineering, was essential to realize the opportunity
with SR and fund the project in September 2012.

Ravi Chandra, SVP Core Software Group, proved essential to execute our project beyond its first
phase. Ravi has been leading the IOS XR, IOS XE and NX-OS software at Cisco. He very quickly
understood the SR opportunity and funded it as a portfolio-wide program. We could then really tackle
all the markets interested in SR (hyper-scale WEB operators, SP and Enterprise) and all the network
segments (DC, metro/aggregation, edge, backbone).

Sumeet Arora, SVP SP Routing, provided strong support to execute our project across the SP routing
portfolio: access, metro, core, merchant and Cisco silicon.

Venu Venugopal, VP, acted as our executive sponsor during most of the SR-TE engineering phase. He
played a key role orchestrating the commitment and effective delivery of all the components of the
solution. We had a great trip together to visit Paul and Tim at Microsoft a few years ago. This is
where the BGP-TE component got started.
1.17 Standardization
As we explained in Part I of this SR books series, we are committed to standardization and have
published at IETF all the details that are required to ensure SR-TE inter-operability (e.g., protocol
extensions).

Furthermore, we have documented a fairly detailed description of our local-node behavior to allow
operators to request similar behaviors from other vendors.

[draft-ietf-spring-segment-routing-policy] is the main document to consider. It describes the SR-TE


architecture and its key concepts. It introduces the various protocol extensions:

draft-filsfils-spring-sr-traffic-counters

draft-filsfils-spring-sr-policy-considerations

draft-ietf-idr-bgp-ls-segment-routing-ext

draft-ietf-idr-te-lsp-distribution

draft-ietf-idr-bgpls-segment-routing-epe

draft-ietf-lsr-flex-algo

draft-ietf-pce-segment-routing

draft-sivabalan-pce-binding-label-sid

draft-ietf-pce-association-diversity

draft-ietf-idr-segment-routing-te-policy

RFC8491

RFC8476

draft-ietf-idr-bgp-ls-segment-routing-msd

See the complete list on www.segment-routing.net/ietf.


1.18 Flow of This Book
This book focuses on the SLA intents that can be handled as a routing problem: low-delay, disjoint
planes, resource inclusion/exclusion, intra and inter-domain.

We relate these SLA intents to use-cases and show how they can be met by various SR-TE solutions:
SR-TE policy with explicit path, SR-TE Policy with dynamic path, SR IGP Flex-Algo.

For example, the dual-plane disjointness service can be supported by an explicit path leveraging per-
plane anycast SID, or a dynamic path excluding an affinity or by an IGP Flex-Algo enabled on the
chosen plane.

The first chapters introduce the notions of SR Policy and its candidate paths (static and dynamic). We
then delve into the heart of the SR-TE solution: On-Demand Policy (ODN) and Automated Steering
(AS).

We then cover SR IGP Flexible Algorithm, Network Resiliency and Binding SID.

At that point, the key concepts will have been covered. The remaining of the book is then a series of
chapters that details topics that were introduced earlier. For example, the notion of SR-TE Database
is introduced in the first chapters as it is key to understand how native SR algorithms compute a
solution SID list and how an explicit candidate path is validated. Chapter 12, "SR-TE Database"
revisits that concept and covers it in depth.

We chose to have this two-step approach to ease the learning curve. We believe that the most
important is to understand what the different components of the SR-TE solution are and how they
interact between each other. Once this is well-understood, the reader can then zoom on a specific
component and study it in more depth.
1.19 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018

[draft-ietf-pce-segment-routing] "PCEP Extensions for Segment Routing", Siva Sivabalan,


Clarence Filsfils, Jeff Tantsura, Wim Henderickx, Jonathan Hardwick, draft-ietf-pce-segment-
routing-16 (Work in Progress), March 2019

[draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano


Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idr-
segment-routing-te-policy-05 (Work in Progress), November 2018

[draft-ietf-idr-bgp-ls-segment-routing-ext] "BGP Link-State extensions for Segment Routing",


Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Hannes Gredler, Mach Chen, draft-ietf-idr-
bgp-ls-segment-routing-ext-12 (Work in Progress), March 2019

[draft-sivabalan-pce-binding-label-sid] "Carrying Binding Label/Segment-ID in PCE-based


Networks.", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Jonathan Hardwick, Stefano Previdi,
Cheng Li, draft-sivabalan-pce-binding-label-sid-06 (Work in Progress), February 2019

[draft-ietf-lsr-flex-algo] "IGP Flexible Algorithm", Peter Psenak, Shraddha Hegde, Clarence


Filsfils, Ketan Talaulikar, Arkadiy Gulko, draft-ietf-lsr-flex-algo-01 (Work in Progress),
November 2018

[RFC8491] "Signaling Maximum SID Depth (MSD) Using IS-IS", Jeff Tantsura, Uma Chunduri,
Sam Aldrin, Les Ginsberg, RFC8491, November 2018

[RFC8476] "Signaling Maximum SID Depth (MSD) Using OSPF", Jeff Tantsura, Uma Chunduri,
Sam Aldrin, Peter Psenak, RFC8476, December 2018
[draft-ietf-idr-bgp-ls-segment-routing-msd] "Signaling MSD (Maximum SID Depth) using Border
Gateway Protocol Link-State", Jeff Tantsura, Uma Chunduri, Gregory Mirsky, Siva Sivabalan,
Nikos Triantafillis, draft-ietf-idr-bgp-ls-segment-routing-msd-04 (Work in Progress), February
2019

[draft-ietf-idr-te-lsp-distribution] "Distribution of Traffic Engineering (TE) Policies and State


using BGP-LS", Stefano Previdi, Ketan Talaulikar, Jie Dong, Mach Chen, Hannes Gredler, Jeff
Tantsura, draft-ietf-idr-te-lsp-distribution-10 (Work in Progress), February 2019

[draft-ietf-pce-association-diversity] "Path Computation Element communication Protocol (PCEP)


extension for signaling LSP diversity constraint", Stephane Litkowski, Siva Sivabalan, Colby
Barth, Mahendra Singh Negi, draft-ietf-pce-association-diversity-06 (Work in Progress), February
2019

[draft-ietf-idr-tunnel-encaps] "The BGP Tunnel Encapsulation Attribute", Eric C. Rosen, Keyur


Patel, Gunter Van de Velde, draft-ietf-idr-tunnel-encaps-11 (Work in Progress), February 2019

[SR.net] <http://www.segment-routing.net>

[GMPLS-UNI] <https://www.cisco.com/c/en/us/td/docs/routers/asr9000/software/asr9k-r6-
5/mpls/configuration/guide/b-mpls-cg-asr9000-65x/b-mpls-cg-asr9000-65x_chapter_0111.html>

[Cisco WAE] <https://www.cisco.com/c/en/us/products/routers/wae-planning/index.html>

[Google Espresso] “Taking the Edge off with Espresso: Scale, Reliability and Programmability for
Global Internet Peering.”, Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve Padgett,
Matthew Holliman, Gary Baldus, Marcus Hines, Taeeun Kim, Ashok Narayanan, Ankur Jain,
Victor Lin, Colin Rice, Brian Rogan, Arjun Singh, Bert Tanaka, Manish Verma, Puneet Sood,
Mukarram Tariq, Matt Tierney, Dzevad Trumic, Vytautas Valancius, Calvin Ying, Mahesh
Kallahalla, Bikash Koley, and Amin Vahdat, Proceedings of the Conference of the ACM Special
Interest Group on Data Communication (SIGCOMM '17), 2017.
<https://doi.org/10.1145/3098822.3098854>

[Facebook EdgeFabric] “Engineering Egress with Edge Fabric: Steering Oceans of Content to the
World.”, Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V.
Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng, Proceedings
of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17), .
2017. <https://doi.org/10.1145/3098822.3098853>,
<https://research.fb.com/publications/engineering-egress-with-edge-fabric/>

[Alibaba NetO] “NetO: Alibaba’s WAN Orchestrator”, Xin Wu, Chao Huang, Ming Tang, Yihong
Sang, Wei Zhou ,Tao Wang, Yuan He, Dennis Cai, Haiyong Wang, and Ming Zhang, SIGCOMM
2017 Industrial Demos, 2017. <http://conferences.sigcomm.org/sigcomm/2017/files/program-
industrial-demos/sigcomm17industrialdemos-paper1.pdf>

[SERA] “SERA: SEgment Routing Aware Firewall for Service Function Chaining scenarios”,
Ahmed Abdelsalam, Stefano Salsano, Francois Clad, Pablo Camarillo and Clarence Filsfils, IFIP
Networking, Zurich, Switzerland, May 2018.
<http://dl.ifip.org/db/conf/networking/networking2018/1B2-1570422197.pdf>

[IGP tuning] “IGP Tuning in an MPLS Network”, Martin Horneffer, NANOG 33, February 2005,
Las Vegas.

[SWAN] “Achieving high utilization with software-driven WAN.”, Chi-Yao Hong, Srikanth
Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer,
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM (SIGCOMM '13), 2013.
<https://doi.org/10.1145/2486001.2486012>, <http://research.microsoft.com/en-
us/projects/swan>

1. While a typical deployment uses RRs to scale BGP, any BGP distribution model can be used:
full-mesh, RRs, Confederations.↩

2. Historically, serialization delay (packet size divided by the Gbps rate of the link) was also a
component. It has now become negligible.↩

3. Not to be confused with link colors or affinity colors, which are used to mark links in order to
include or exclude them from a TE path.↩
Section I – Foundation
This section describes the foundational elements of SR Traffic Engineering.
2 SR Policy
What we will learn in this chapter:

An SR Policy at a headend is identified by the endpoint and a color

An SR Policy has one or more candidate paths, which are often simply called “paths”

A candidate path is in essence a SID list (a list of segments), or a set of SID lists

A candidate path can be dynamic or explicit

A candidate path can be instantiated via CLI/NETCONF or signaled via PCEP or BGP-TE

The valid path with the highest preference is selected as active path

The active path’s SID lists are programmed in the forwarding table

The SID lists of an SR Policy are the SID lists of its active path

An SR Policy is bound to a Binding-SID

A valid SR Policy has its Binding-SID installed in the MPLS forwarding table as a local incoming
label with action “Pop and push the SID list of the SR Policy”

We first introduce the concept of an SR Policy through example, then we provide a formal definition
and we remind the key concept of a Binding Segment. The last section introduces the configuration
model.
2.1 Introduction
Segment Routing allows a headend node to steer a packet flow along any path in the network by
imposing an ordered list of segments on the packets of the flow. No per-flow state is created on the
intermediate nodes, the per-flow state is carried within the packet’s segment list. Each segment in the
segment list identifies an instruction that the node must execute on the packet, the most common being
a forwarding instruction.

Se gme nts and Se gme nt Ide ntifie rs (SIDs)


A Segment is an instruction that a node executes on the incoming packet that carries the instruction in its header. Examples of
instructions are forward packet along the shortest path to its destination, forward packet through a specific interface, deliver
packet to a given application/service instance, etc.

A Segment Identifier (SID) identifies a Segment. The format of a SID depends on the implementation. The SR MPLS
implementation uses MPLS labels as SIDs. SRv6 uses SIDs in the format of IPv6 addresses but they are not actually IPv6
addresses as their semantic is different.

While there is a semantic difference between segments and SIDs, the latter being the identifier of the former, both terms are often
used interchangeably and the meaning can be derived from its context. This also applies to their derivatives such as “segment list”
and “SID list”.

In the MPLS instantiation of Segment Routing, a SID is an MPLS label and a SID list is an MPLS
stack.

In the context of topological traffic-engineering, the SID list (the MPLS label stack) is composed of
prefix-SIDs (shortest path to the related node) and adjacency SIDs (specific use of a link).

A key benefit of the Segment Routing solution is that it integrates topological traffic-engineering and
service chaining in the same solution. The SID list can be a combination of Prefix-SIDs, Adjacency-
SIDs and service SIDs. The first two help to guide the packets topologically through the network
while the latter represents services distributed through the network (hardware appliance, VM or
container).

In this book (Part II), we will focus on the topological traffic-engineering (more briefly, traffic
engineering or SR-TE). Part III will explain how the same solution is leveraged for SR-based service
chaining.
The central concept in our SR-TE solution is “SR Policy”. SR Policy governs the two fundamental
actions of a traffic engineering solution:

Expressing a specific path through the network

Typically different from the shortest-path computed by the IGP

Steering traffic onto the specific path

In the remaining of this section, we use four illustrations to introduce some key characteristics of an
SR Policy. The first example illustrates the concept of explicit path. The second example introduces
the concept of validation and selection of an explicit path. The third example introduces the concept
of a dynamic path with the minimization of a cumulative metric. The fourth example extends this latter
concept by adding a constraint to the dynamic path.

The ability to engineer a path through a network cannot be dissociated from the steering of traffic onto
that path. Automated Steering (AS) is the SR-TE functionality that automatically steers colored
service routes on the appropriate SR Policy. We briefly present the AS functionality in the first
example, while chapter 5, "Automated Steering" and chapter 10, "Further Details on Automated
Steering" cover the steering concept in detail.

SR Policy is not a tunne l!


“The term “tunnel” historically involved several issues that we absolutely did not want Segment Routing to inherit: 1/ pre-
configuration at a headend towards a specific endpoint with a specific SLA or path; 2/ performance degradation upon
steering; 3/ scale limitation due to the handling of a tunnel as an interface; 4/ autoroute steering as sole steering mechanism

The think-out-of-the-box and simplification/automation mindset pushed us to reconsider the key concepts at the base of an
SR solution.

This allowed us to identify the notions of SR Policy, color of an SR Policy, On-Demand Policy (ODN) and Automated
Steering (AS).

These concepts were not present in the RSVP-TE tunnel model and allowed us to simplify the TE operation. ”

— Clarence Filsfils

The illustrations assume a single IGP area network (e.g., Figure 2‑1).
Figure 2-1: Network Topology with IGP Shortest path and SR Policy with explicit path

The default IGP metric for the links in this network is 10. The link between Node3 and Node4 is an
expensive, low capacity link and therefore the operator has assigned it a higher IGP link metric of
100. Also, the link between Node8 and Node5 has a higher IGP link metric of 100.

2.1.1 An Explicit Candidate Path of an SR Policy


We illustrate the concept of an explicit candidate path of an SR Policy. We briefly introduce the
Automated Steering (AS) of a flow onto an SR Policy.

Note that, while the exact term is “candidate path”, we may use the shorter term “path” instead.

We assume that the operator wants some traffic from Node1 to Node4 to go via the path
1→7→6→8→5→4.

The operator could express this path as <16008, 24085, 16004>: shortest-path to Node8 (Prefix-SID
of Node8), adjacency from Node8 to Node5 (Adjacency-SID), shortest-path to Node4 (Prefix-SID of
Node4). Per the illustration conventions used in this book, 24085 is the Adj-SID label of Node8 for
its link to Node5.
Such a SID list expressed by the operator is called “explicit”. The term “explicit” means that the
operator (potentially via an external controller) computes the source-routed path and programs it on
the headend router. The headend router is explicitly told what source-routed path to use. The headend
router simply receives a SID list and uses it as such.

No intent is associated with the SID list: that is, the headend does not know why the operator selected
that SID list. The headend just instantiates it as per the explicit configuration.

This explicit SID list (the sequence of instructions to go from Node1 to Node4) is configured in an
SR Policy at headend Node1.

SR Policy (headend Node1, color blue, endpoint Node4)

candidate-paths:

1. explicit: SID list <16008, 24085, 16004>

We see that the SR Policy is identified by a tuple of three entries: the headend at which the SR Policy
is instantiated, the endpoint and a color.

As explained in the introduction, the color is a way to express an intent. At this point in the
illustration, think of the color as a way to distinguish multiple SR Policies, each with its own intent,
from the same headend Node1 to the same endpoint Node4.

For example, Node1 could be configured with the following two SR Policies:

SR Policy (headend Node1, color blue, endpoint Node4)

candidate-paths:

1. explicit: SID list <16008, 24085, 16004>

SR Policy (headend Node1, color orange, endpoint Node4)

candidate-paths:

1. explicit: SID list <16003, 24034>


These two SR Policies need to be distinguished because Node1 may want to concurrently steer a
flow into the blue SR Policy while another flow is steered into the orange SR Policy.

In fact, you can already intuit that the color is not only for SR Policy distinction. It plays a key role in
steering traffic, and more specifically in automatically steering traffic in the right SR Policy by
similarly coloring the service route (i.e., intuitively, the orange-colored traffic destined to Node4 will
go via the orange SR Policy to Node4 and the blue-colored traffic destined to Node4 will go via the
blue SR Policy to Node4).

A fundamental component of the SR-TE solution is “Automated Steering” (AS). While this is
explained in detail later in the book, we provide a brief introduction here.

Figure 2-2: Two SR Policies from Node1 to Node4

Automated Steering allows the headend Node1 to automate the steering of its traffic into the
appropriate SR Policy. In this example, we see in Figure 2‑2 that Node4 advertises two BGP routes
to Node1: 1.1.1.0/24 with color orange and 2.2.2.0/24 with color blue. Based on these colors,
headend Node1 automatically steers any traffic to orange prefix 1.1.1.0/24 into the SR Policy
(orange, Node4) and any traffic to blue prefix 2.2.2.0/24 into the SR Policy (blue, Node4).

Briefly, applying Automated Steering at a headend allows the headend to automatically steer a
service route (e.g., BGP, PW, etc.) into the SR Policy that matches the service route color and the
service route endpoint (i.e., BGP nexthop).

Remember that while in this text we refer to colors as names for ease of understanding, a color is
actually a number. For example, the color blue could be the number 10. To advertise a color with a
prefix, BGP adds a color extended community attribute to the advertisement (more details in chapter
5, "Automated Steering").

Let us also reiterate a key benefit of Segment Routing. Once the traffic is steered into the SR Policy
configured at Node1, the traffic follows the traffic-engineered path without any further state in the
network.

For example, if the blue traffic destined for Node4 is steered in the SR Policy (Node1, blue, Node4),
then the blue traffic will go via 1→7→6→8→5→4. No state is present at 7, 6, 8, 5 or 4 for this flow.
The only state is present at the SR Policy headend. This is a key advantage compared to RSVP-TE
which creates per-flow state throughout the network.

In these first examples, SR Policies had one single explicit candidate path per policy, with an
explicit candidate path that is specified as a single explicit SID list.

In the next illustrations, we expand the SR Policy concept by introducing multiple candidate paths per
policy and then the notion of dynamic candidate paths.

2.1.2 Path Validation and Selection


In the second example, we assume that the operator wants some traffic from Node1 to Node4 to go
via one of two possible candidate paths, 1→7→6→8→5→4 and 1→2→3→4. These paths can be
expressed with the SID lists <16008, 24085, 16004> and <16003, 24034> respectively. The operator
prefers using the first candidate path and only use the second candidate path if the first one is
unusable. Therefore, he assigns a preference value to each candidate path, with a higher preference
value indicating a more preferred path. The first path has a preference 100 while the second path has
preference 50. Both paths are shown in Figure 2‑3.

Figure 2-3: SR Policy (blue, Node4) with two candidate paths

A candidate path is usable when it valid. A common path validity criterion is the reachability of its
constituent SIDs. The validation rules are specified in chapter 3, "Explicit Candidate Path".

When both candidate paths are valid (i.e., both paths are usable), headend Node1 selects the highest
preference path and installs the SID list of this path (<16008, 24085, 16004>) in its forwarding table.
At any point in time, the blue traffic that is steered into this SR Policy is only sent on the selected
path, any other candidate paths are inactive.

A candidate path is selected when it has the highest preference value among all the valid candidate
paths of the SR Policy. The selected path is also referred to as the “active path” of the SR Policy. In
case multiple valid candidate paths have the same preference, the tie-breaking rules described in
chapter 16, "SR-TE Operations" are evaluated to select a path.

A headend re-executes the active path selection procedure whenever it learns about a new candidate
path of an SR Policy, the active path is deleted, an existing candidate path is modified or its validity
changes.
At some point, a failure occurs in the network. The link between Node8 and Node5 fails, as shown in
Figure 2‑4. At first, Topology Independent Loop-Free Alternate (TI-LFA) protections ensures that the
traffic flows that were traversing this link are quickly (in less than 50 milliseconds) restored. Chapter
8, "Network Resiliency" describes TI-LFA protection of SR Policies. Refer to SR book Part I for
more details of TI-LFA.

Eventually, headend Node1 learns via IGP flooding that the Adj-SID 24085 of the failed link has
become invalid. Node1 evaluates the validity of the path’s SID list1 and invalidates it due to the
presence of the invalid Adj-SID. Node1 invalidates the SID list and the candidate path and re-
executes the path selection process. Node1 selects the next highest preference valid candidate path,
the path with preference 50. Node1 installs the SID list of this path – <16003, 24034> – in the
forwarding table. From then, the blue traffic that is steered into this SR Policy is sent on the new
selected path.

Figure 2-4: SR Policy (blue, Node4) with two candidate paths – highest preference path is invalid

After restoring the failed link, the candidate path with preference 100 becomes valid again. Headend
Node1 will perform the SR Policy path selection procedure again, select the valid candidate path
with the highest preference and update its forwarding table with this path’s SID list <16008, 24085,
16004>. The blue traffic that is steered into this SR Policy is sent on the path 1→7→6→8→5→4, as
shown in Figure 2‑3.

2.1.3 A Low-Delay Dynamic Candidate Path


Often, the operator prefers to simply express an intent (e.g., low-delay2 to a specific endpoint or
avoid links with TE affinity color purple) and let the headend translate this intent into a SID list. This
is called a “dynamic” candidate path. The headend dynamically translates the intent into a SID list,
and most importantly, continuously responds to any network change by updating the SID list as
required to meet the intent.

The intent is formally defined as a minimization of an additive metric (e.g., IGP metric, TE metric or
Link-Delay) and a set of constraints (e.g., avoid/include IP address, SRLG, TE Affinity).

Leveraging the link-state routing protocol (ISIS, OSPF or BGP-LS), each node distributes its own
local information, its own piece of the network jigsaw puzzle. Besides the well-known topology
information (nodes, links, prefixes, and their attributes), the link-state advertisement may include SR
elements (SRGB, SIDs, …) and other link attributes (delay, loss, SRLGs, affinity, …).

The headend node receives all this information and stores it in its local SR-TE database (SR-TE
DB). This SR-TE DB contains a complete topological view of the local IGP area, including SR-TE
information. The SR-TE DB contains everything that the headend node needs to compute the paths
through the network that meet the intent.

The headend uses “native SR” algorithms to translate the “intent” of a dynamic path into a SID list.
The term “native SR” highlights that the algorithm has been optimized for SR. It maximizes ECMP
and minimizes the SID list length.

Let us now consider an SR Policy from Node1 to Node4 that expresses an intent to provide a low-
delay path. To compute the low-delay path, the headend node needs to know the delay of the links in
the network. Figure 2‑5 shows the measured delay values for each link. Headend Node1 receives
these link-delay metrics via the IGP and adds them to its SR-TE DB. Node1 can now compute the
low-delay path to Node4, which is simply the shortest path computation using the link-delay as
metric. The resulting path is 1→2→3→4, with a cumulative delay 12+11+7 = 30.
Figure 2-5: Network topology with measured link-delay values

Node1 encodes the path in the SID list <16003, 24034>; 16003 is the shortest-path to Node3 and
hence correctly follows 1→2→3 and then 3→4 is enforced with the Adj SID from Node3 to Node4
(24034).

To recap, an SR Policy may be instantiated with a “dynamic” candidate path. A “dynamic” candidate
path expresses an intent. The headend uses its SR-TE Database and its SR native algorithms
(explained later in the book) to translate the intent into a SID list. Any time the network changes, the
headend updates the SID list accordingly.

Assuming that the operator uses the “green” color for “low-delay”, the SR Policy that we just
analyzed can be summarized as follows:

SR Policy (Node1, green, Node4)

candidate-paths:
1. dynamic: delay optimized

→ SID list <16003, 24034>

2.1.4 A Dynamic Candidate Path Avoiding Specific Links


Let us now assume that the operator needs an SR Policy “purple” from Node1 to Node4 which
minimizes the accumulated IGP metric from Node1 to Node4 while avoiding the links with affinity
color red.

Link affinity colors are a set of properties that can be assigned to a link.

Each link in the network can have zero, one or more affinity colors. These colors are advertised in the
IGP (and BGP-LS) with the link (IGP adjacency, really).

While we refer here to affinity colors as color names, each affinity color is actually a bit in a bitmap.
If a link has a given color, then the bit in the bitmap that corresponds to that color is set.

Colors
The term “color” may refer to an attribute of two distinct elements of a network that must not be confused.

Link affinity color is a commonly used name to indicate an IETF Administrative Group (RFC 3630 and RFC 5305) or Resource Class
(RFC 2702). Link affinity colors are link attributes that are used to express some notion of "class". The link affinity colors can then be
used in path computation to include or exclude links with some combination of colors.

SR Policy colors on the other hand, are used to identify an intent or SLA. Such color is used to match a service route that requires a
given SLA to an SR Policy that provides the path that satisfies this SLA. The SLA color is attached to a service route advertisement
as a community.

As shown in Figure 2‑6, only the link between Node7 and Node6 has affinity color red. Node6 and
Node7 distribute this affinity color information within the network in the IGP. Node1 receives this
information and inserts it into its SR-TE DB.
Figure 2-6: Network topology with link affinity

To compute the path, Node1 prunes the links that do not meet the constraint “avoiding red links” from
the topology model in its SR-TE DB and computes the IGP metric shortest path to Node4 on the
pruned topology. The resulting path is 1→2→3→6→5→4. Node1 encodes this path in the optimal
SID list <16003, 16004>. 16003 is the Prefix-SID of Node3, transporting the packet via the IGP
shortest path to Node3. 16004 is the Prefix-SID of Node4, transporting the packet via the IGP shortest
path to Node4.

This SR Policy is summarized as follows:

SR Policy (Node1, purple, Node4)

candidate-paths:

1. dynamic: minimize IGP-metric AND exclude red links

→ SID list <16003, 16004>


In section 2.2 we formalize the SR Policy concepts illustrated in this section.

2.1.5 Encoding a Path in a Segment List


A segment list expresses a path in a sequence of segments. Each segment specifies a part of the path
and, when taken in isolation, is independent from the intent of the end-to-end path. SR-TE derives an
optimal segment list to encode the end-to-end path by using the segments at its disposal.

By default, the available segments are the regular IGP prefix segments (following the IGP shortest
path) and IGP adjacency segments.3

In the low-delay example in section 2.1.3, SR-TE encoded the low-delay path in a sequence of a
Prefix-SID and an Adj-SID.

The illustration of that example is repeated here in Figure 2‑7 for ease of reference.

Figure 2-7: Encoding the low-delay path in a segment list

The low-delay path from Node1 to Node4 is 1→2→3→4. Each node advertises a Prefix-SID and an
Adj-SID for each of its adjacencies. These are the segments that SR-TE has available in its toolbox to
encode the path.
Since this path has no ECMP, SR-TE could encode the path using the sequence of Adj-SIDs of all
links that the path traverses: <24012, 24023, 24034>. This is not optimal as it requires a SID for each
hop.

SR-TE sees that the portion of the path 1→2→3 can be expressed by the Prefix-SID of Node3.
Indeed, the IGP shortest path from Node1 to Node3 is 1→2→3.

The IGP shortest path from Node3 to Node4 is 3→6→5→4, which does not match the desired path
3→4. The path 3→4 can be expressed by the Adj-SID 24034 of Node3 for the link to Node4.
2.2 SR Policy Model
An SR Policy is uniquely identified by a tuple consisting of the following three elements:

Headend

Endpoint

Color

The headend is where the policy is instantiated/implemented.

The endpoint is typically the destination of the SR Policy, specified as an IPv4 or IPv6 address.

The color is an arbitrary 32-bit numerical value used to differentiate multiple SR Policies with the
same headend and endpoint. The color is key for the “Automated Steering” functionality. The color
typically represents an intent, a specific way to reach the endpoint (e.g., low-delay, low-cost with
SRLG exclusion, etc.).

At a given headend node, an SR Policy is fully identified by the (color, endpoint) tuple. In this book,
when we assume that the headend is well-known, we will often refer to an SR Policy as (color,
endpoint).

Only one SR Policy with a given color C can exist between a given pair (headend node, endpoint). In
other words: each SR Policy tuple (headend node, color, endpoint) is unique.

As illustrated in Figure 2‑8, an SR Policy has at least one candidate path and a single active
candidate path. The active candidate path is the valid path of best preference. The SID list of a policy
is the SID list of its active path.
Figure 2-8: SR Policy model

For simplicity we start with the assumption that each candidate path has only one SID list. In reality,
each candidate path can have multiple SID lists, each with its an associated load-balancing weight.
The traffic on that candidate path is then load-shared over all valid SID lists of that path, in
accordance with their weight ratio. This is explained in more detail in chapter 16, "SR-TE
Operations".

2.2.1 Segment List


A Segment List (SID list) is a sequence of SIDs that encodes the path of a packet through the network.
In the SR MPLS implementation, a SID is a label value and a SID list is a stack of labels. For a
packet steered into the SR Policy, this stack of labels (SID list) is imposed on the packet’s header.

A SID list is represented as an ordered list <S1, S2, ..., Sn>, where S1 is the first SID, which is the top
label of the SR MPLS label stack, and Sn is the last SID, the bottom label for SR MPLS.

In the SR-MPLS implementation, there are two ways to configure a SID as part of a SID list: by
directly specifying its MPLS label value (e.g., 16004) or by providing a segment descriptor (e.g., IP
address 1.1.1.4) that the headend node translates into the corresponding label value. There is an
important difference between the two options.
A SID expressed as an MPLS label value is checked only if it is in the first position of the SID list. It
is checked (validated) to find the outgoing interface and next-hop.

A SID expressed as a segment descriptor is always checked for validity. The headend must indeed
translate that segment descriptor into an MPLS label (i.e., what is imposed on the packets are label
stacks); this act of translating the segment descriptor to the MPLS label is the validity check. If it
works, then the SID is valid. If the translation fails (the segment descriptor IP address is not seen in
the SR-TE DB or the IP address is seen but without a SID) then the SID is invalid. A SID bound to a
prefix of a failed node or a failed adjacency is invalid. That SID is not in the SR-TE DB and the
headend cannot translate the segment descriptor into an MPLS label value.

Most often an explicit path will be configured with all SIDs expressed as MPLS label values. This
would be the case when an external controller has done all the computations and is actively (stateful)
monitoring the policy. In such case, the controller is in charge and it does not want the headend to
second guess its operation.

There is a second reason to express an explicit SID as an MPLS label value: when the SID belongs to
a remote domain. In that case, the headend has no way to validate the SID (it does not have the link-
state topology of the remote domain), hence an MPLS label value is used to avoid the validity check.

“I can't emphasize enough how much having complete SR-based MPLS connectivity set up "for free" by the IGP allows
you to concentrate on the real work of traffic engineering. For example, you can craft an explicit path to navigate traffic
through a critically-congested region, then use a node SID at the bottom of the stack to get the rest of the way to egress
via IGP shortest path. And you never need to worry about LSP signaling. ”

— Paul Mattes

2.2.2 Candidate Paths


A candidate path of an SR Policy represents a specific way to transport traffic from the headend to the
endpoint of the corresponding SR Policy.

A dynamic candidate path is expressed as an optimization objective and a set of constraints. Using
the native SR algorithms and information in its local SR-TE DB, the headend computes a solution to
the optimization problem4, and provides the solution (the optimal path) as a SID list. If the headend
does not have the required information in its SR-TE DB to compute the path, the headend may
delegate the computation to a controller or Path Computation Element (PCE). Whenever the network
state changes, the path is automatically recomputed. Read chapter 4, "Dynamic Candidate Path" for
more information about dynamic paths.

An explicit candidate path is expressed as a SID list. The SID list can be provided to the headend in
various ways, most commonly by configuration or signaled by a controller. Despite its name, an
explicit path is likely the result of a dynamic computation. The difference with the dynamic path is
that the headend is not involved in the computation of the explicit path’s SID list; the headend only
receives the result as an explicit SID list. Read chapter 3, "Explicit Candidate Path" for more
information about explicit paths.
Figure 2-9: Dynamic and explicit candidate paths

Instead of specifying a single candidate path for an SR Policy, an operator may want to specify
multiple possible candidate paths with an order of preference.

Refer to Figure 2‑3 of the example in section 2.1.2, where the operator specified two candidate paths
for the SR Policy on Node1 to Node4. At any point in time, one candidate path is selected and
installed in the forwarding table. This is the valid candidate path with highest preference. Upon
invalidation of the selected path, the next highest preference valid path is selected as new selected
path.

Each candidate path has a preference. The default value is 100. The higher the preference, the more
preferred the path.
The active candidate path of an SR Policy is the valid path with highest preference.

A candidate path is valid if usable. The validation process of explicit paths is detailed in chapter 3,
"Explicit Candidate Path". Briefly, the first SID must be resolved on a valid outgoing interface and
next-hop and the SR-TE DB must be able to resolve all the other SIDs that are expressed as segment
descriptors into MPLS labels. To validate a dynamic path, the path is recomputed.

A headend re-executes the active path selection procedure whenever it learns about a new candidate
path of an SR Policy, the active path is deleted, an existing candidate path is modified or its validity
changes. In other words, the active path of an SR Policy, at any time, is the valid candidate path with
the highest preference value.

A headend may be informed about candidate paths for a given SR Policy by various means, including
local configuration, NETCONF, PCEP or BGP (we will discuss these different mechanisms
throughout the book). The source of the candidate path does not influence the selection of the SR
Policy’s active path; the SR Policy’s active path is selected based on its validity and its preference. A
later section will explain how a single active path is selected when an SR Policy has multiple valid
candidate paths with the same preference.
Each candidate path can be le arne d via a diffe re nt way
“A headend can learn different candidate paths of an SR Policy via different means: some via local configuration, others
via PCEP or BGP SR-TE.

The SR-TE solution is designed to be modular and hence the SR Policy model allows for mixing candidate paths from
various sources.

In practice, the reader can assume that one single source is used in a given deployment model.

In a distributed control-plane model, the candidate path (or more likely the ODN dynamic template it derives from) is likely
learned by the headend via local configuration (itself automated by a solution such as Cisco NSO).

In a centralized control-plane model, the (explicit) candidate path is likely learned by the headend from the controller via
BGP SR-TE or PCEP.

In some specific and less frequent deployments, the operator may mix different sources: some base candidate path is
learned from the local configuration while some specific candidate paths are learned via BGP SR-TE or PCEP. Likely
these later are more preferred when present.

As such we defined the SR Policy concept such that each of its candidate paths can be learned in a different way:
configuration, NETCONF, PCEP and BGP SR-TE.

In practice, we advise you to focus on the most likely use-case: all of the candidate paths of a policy are learned via the
same mechanism (e.g., configuration). ”

— Clarence Filsfils
2.3 Binding Segment
The Binding Segment (BSID) is fundamental to SR-TE.

A Binding-SID is bound to an SR Policy. It provides the key into the bound SR Policy. On a given
headend, a BSID is bound to a single SR Policy at any point in time. The function of a BSID (i.e., the
instruction it represents) is to steer labeled packets into its associated SR Policy.

In the SR MPLS implementation, a BSID B bound to an SR Policy P at headend H is a local label of


H. Only the headend H has a state for this SR Policy and hence has a forwarding entry for B.

When a remote node R wants to steer packets into the SR Policy P of headend node H, the remote
node R pushes a label stack with the prefix SID of H followed by the BSID of P. The prefix SID of H
may be replaced by any SID list that eventually reaches H. If R is attached to H, then R can simply
push the BSID label to steer the packets into P.

The BSID bound to an SR Policy P can be either explicitly provided by the operator or controller, or
dynamically allocated by the headend. We will detail the allocation process in a later chapter.

The BSID is an attribute of a candidate path and the BSID of an SR Policy is the BSID of the active
candidate path.

The BSID of an SR Policy could change upon change of active path


“This is a consequence of the ability to learn any candidate path independently. As a result, each candidate path may be
learned with a BSID.

In practice, we advise you to focus on the most likely use-case: all of the candidate paths of a policy have the same BSID.
Hence, upon active path change, the BSID of the policy does not change. ”

— Clarence Filsfils

Any packet that arrives at the headend with a BSID on top of its label stack is steered into the SR
Policy associated with the BSID. The headend pops the BSID label, pushes the label stack (SID list)
associated with the SR Policy on the packet’s header, and forwards the packet according to the first
label of the SID list.
For an SR Policy that has a valid path, the headend node installs the following entry in its forwarding
table for an SR Policy with SID list <S1, S2> and BSID B1:

Incoming label: B1

Label operation: Pop, Push <S1, S2>

Egress: egress information of S1

Is the labe l value of a BSID at time T a re liable ide ntifie r of an SR Policy at any
time ?
“Theoretically, no. As we highlighted previously, in a corner case, each different candidate path may have a different
BSID. Hence, upon valid path change, the BSID of an SR Policy may change and hence its label value.

Theoretically, the unique time-independent identification of an SR Policy is the tuple (headend, color, endpoint).

However, in a normal deployment, all the candidate paths of the same policy are defined with the same BSID and hence
there is a single and stable BSID per policy (whatever the selected candidate path) and hence the BSID is in practice a
good identifier for a policy. ”

— Clarence Filsfils

In the topology of Figure 2‑10, Node1 is the headend of an SR Policy with SID list <16003, 24034>.
Node1 allocated a BSID 40104 for this SR Policy.
Figure 2-10: Binding-SID

When Node1 instantiated the SR Policy, it installed the forwarding entry:

Incoming label (Binding-SID): 40104

Outgoing label operation: pop, push <16003, 24034>

Egress interface: egress interface of label 16003: to Node2

A remote node, such as Node10 in Figure 2‑10, can steer a packet into Node1’s SR Policy by
including the BSID of the SR Policy in the label stack of the packet. Node1’s SR Policy (green,
1.1.1.4) is then used as a Transit Policy in the end-to-end path from Node10 to Node4.

The label stack of a packet going from Node10 to Node4 is shown in Figure 2‑10. In the example,
Node10 imposes the label stack <16001, 40104> on the packet and sends it towards Node9. Label
16001, the Prefix-SID of Node1, brings the packet to Node1. BSID 40104 steers the traffic into the
SR Policy. Label 40104 is popped and labels <16003, 24034> are pushed.

The headend of a Transit Policy steers the packet into this Transit Policy without re-classifying the
packet. The packet classification was done by a node located remotely from the headend node,
Node10 in this example. Node10 decided to steer a packet into this specific end-to-end SR Policy to
Node4. Node10 can classify this packet based on its destination address only, or it could also take
other elements into account (source address, DSCP, …). Node10 encoded the result of the
classification as a BSID in the segment list imposed on the packet. Node1 simply forwards the packet
according to this packet’s label stack.

The BSID is further discussed in chapter 9, "Binding-SID and SRLB", and the role of the BSID in
traffic steering is also covered in chapter 10, "Further Details on Automated Steering".
2.4 SR Policy Configuration
This section shows the configurations of the SR Policies that were presented in the examples of the
introduction section of this chapter.

An Explicit Candidate Path of an SR Policy

For ease of reference the network topology is repeated here in Figure 2‑11. This illustration shows
two SR Policies on headend Node1. Both SR Policies have the same endpoint Node4 and each has an
explicit candidate path.

Figure 2-11: Two SR Policies from Node1 to Node4

The configuration of headend Node1 in Example 2‑1, specifies two SR Policies with the names
POLICY1 and POLICY2. SR Policies are configured under the segment-routing traffic-eng
configuration section. The name of an SR Policy is user-defined and is unique on the headend node.
Example 2-1: SR Policy configuration Node1

segment-routing
traffic-eng
policy POLICY1
!! (blue, Node4)
color 20 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
policy POLICY2
!! (orange, Node4)
color 40 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST2
!
segment-list name SIDLIST1
index 10 mpls label 16008 !! Prefix-SID Node8
index 20 mpls label 24085 !1 Adj-SID link 8->5
index 30 mpls label 16004 !! Prefix-SID Node4
!
segment-list name SIDLIST2
index 10 mpls label 16003 !! Prefix-SID Node3
index 20 mpls label 24034 !1 Adj-SID link 3->4

SR Policy POLICY1 is configured with a color value 20 and endpoint address 1.1.1.4. The color
value 20 is chosen by the operator and represents a specific service level or specific intent for the
policy. The endpoint 1.1.1.4 is the loopback prefix advertised by Node4, as shown in Figure 2‑11.

One candidate path is specified with preference 100: an explicit path with a segment list named
SIDLIST1. SIDLIST1 is a locally significant user-defined name that identifies the segment list.

The segment list SIDLIST1 specifies the SIDs in in increasing index order, which maps into top to
bottom order for the SR MPLS stack.

The first SID (index 10) is the Prefix-SID 16008 of Node8’s loopback prefix 1.1.1.8/32. The second
SID (index 20) is the Adjacency-SID 24085 of Node8 for the adjacency to Node5. The third and last
SID (index 30) is the Prefix-SID 16004 of Node4’s loopback prefix 1.1.1.4/32. The resulting
candidate path (1→7→6→8→5→4) is shown in Figure 2‑11.

A second SR Policy POLICY2 is configured with a color 40 and endpoint address 1.1.1.4.

One candidate path is specified with preference 100: an explicit path with a segment list named
SIDLIST2.
The first SID (index 10) is the Prefix-SID 16003 of Node3’s loopback prefix 1.1.1.3/32. The second
and last SID (index 20) is the Adjacency-SID 24034 of Node3 for the adjacency to Node4. The
resulting candidate path (1→2→3→4) is shown in Figure 2‑11.

Path Validation and Selection

In an SR Policy with multiple candidate paths, each one is configured with a different preference
value. The higher preference value, the more preferred the candidate path.

SR Policy (blue, Node4) of Node1 in Figure 2‑12 has two candidate paths, one with preference 50
and one with preference 100. The preference 100 candidate path is the current active path since it has
the highest preference.

Figure 2-12: SR Policy (blue, Node4) with two candidate paths

Example 2‑2 shows the SR Policy configuration on headend Node1. This SR Policy has two
candidate paths. The preference 100 path is an explicit path with segment list SIDLIST1. The second
candidate path has preference 50 and is an explicit path with segment list SIDLIST2.

Both candidate paths are illustrated in Figure 2‑12.


Example 2-2: SR Policy configuration Node1 – explicit path

segment-routing
traffic-eng
policy POLICY1
!! (blue, Node4)
color 20 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
preference 50
explicit segment-list SIDLIST2
!
segment-list name SIDLIST1
index 10 mpls label 16008 !! Prefix-SID Node8
index 20 mpls label 24085 !1 Adj-SID link 8->5
index 30 mpls label 16004 !! Prefix-SID Node4
!
segment-list name SIDLIST2
index 10 mpls label 16003 !! Prefix-SID Node3
index 20 mpls label 24034 !1 Adj-SID link 3->4

A Low-Delay Dynamic Candidate Path

Another SR Policy, named POLICY3, is now configured on Node1 in Figure 2‑13. This SR Policy
has a dynamic path, where Node1 dynamically computes the low-delay path to Node4. In the
example, all nodes in the topology measure the delay of their links and distribute these delay metrics
using the IGP. These measured link-delay metrics are displayed next to the links in the illustration.
How these delay metrics are measured and distributed is discussed in chapter 15, "Performance
Monitoring – Link Delay". The IGP on Node1 receives all these link-delay metrics and stores them in
its database. From these delay metrics, you can deduce that the low-delay path from Node1 to Node4
is 1→2→3→4, as shown in Figure 2‑13. The cumulative delay of this path is 12+11+7 = 30.
Figure 2-13: Network topology with measured link-delay values

The configuration of this SR Policy on Node1 is shown in Example 2‑3.

A color value 30 is assigned to this SR Policy with IPv4 endpoint address 1.1.1.4. This color value is
chosen by the operator to indicate the “low-delay” SLA.

Example 2-3: SR Policy configuration Node1 – dynamic path

segment-routing
traffic-eng
policy POLICY3
!! (green, Node4)
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay

One candidate path is specified for this SR Policy: a dynamic path minimizing the accumulated link-
delay to the endpoint. The link-delay metric is the optimization objective of this dynamic path. The
preference of the candidate path is 100.

A Dynamic Candidate Path Avoiding Specific Links

The next SR Policy on headend Node1 also has a dynamic candidate path. This SR Policy path
provides the IGP shortest path avoiding the red links and is illustrated in Figure 2‑14.

To compute the path, Node1 prunes the links that do not meet the constraint “avoiding red links” from
the network graph in its SR-TE DB and computes the IGP metric shortest path to Node4 on the pruned
topology. The resulting path is 1→2→3→6→5→4. Node1 encodes this path in the optimal SID list
<16003, 16004>. The first entry 16003 is the Prefix-SID of Node3, transporting the packet via the
IGP shortest path to Node3; and the second entry 16004 is the Prefix-SID of Node4, transporting the
packet via the IGP shortest path to Node4.

Figure 2-14: Network topology with link affinity

The configuration of this SR Policy on Node1 is shown in Example 2‑4.


A color value 50 is assigned to this SR Policy with IPv4 endpoint address 1.1.1.4. This color value is
chosen by the operator to indicate the SLA of this path.

Example 2-4: SR Policy configuration Node1 – dynamic path

segment-routing
traffic-eng
policy POLICY4
!! (purple, Node4)
color 50 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type igp
!
constraints
affinity
exclude-any
name red
!
affinity-map
name red bit-position 3

One candidate path is specified for this SR Policy. The preference of the candidate path is 100.

The optimization objective of this path is to minimize the accumulated IGP to the endpoint. The path
must not traverse any “red” links, as specified by the constraints of this dynamic path. exclude-any
name red means to avoid links that have affinity link color “red”.
2.5 Summary
An SR Policy essentially consists of an ordered list of SIDs. In the SR-MPLS instantiation, this SID
list is represented as a stack of MPLS labels that is imposed on packets that are steered into the SR
Policy.

An SR Policy is uniquely identified by the tuple (headend node, color, endpoint). When the headend
is known, an SR Policy is identified by (color, endpoint).

The color is used to distinguish multiple SR Policies between the same headend node and endpoint. It
is typically an identifier of the SLA that the SR Policy provides, e.g., color “green” = low-delay. The
color is key to automate the traffic steering (Automated Steering (AS) and the on-demand instantiation
of SR Policy (ODN)

An SR Policy has one or multiple candidate paths.

A candidate path can be dynamically computed (dynamic path) or explicitly specified (explicit path).

For a dynamic path, the SID list is computed based on an optimization objective and a set of
constraints provided by the headend.

For an explicit path, the SID list is explicitly specified by the operator or by an application. The
headend is not involved in the specification nor the computation of the explicit path’s SID list. The
headend only needs to validate the explicit SID list.

A candidate path can be instantiated via different sources CLI/NETCONF or signaled via PCEP or
BGP SR-TE. An SR Policy can contain candidate paths from different sources.

The active candidate path of an SR Policy is the valid path with highest preference.

Packets steered into an SR Policy follow the SID list associated with the active path of the SR Policy.

An SR Policy is bound to a Binding-SID. The BSID is a key into the SR Policy. In the SR-MPLS
instantiation, a BSID is a local label at the headend of the policy. Traffic received by the headend
with the BSID as top label is steered into the policy. Specifically, the BSID is popped and the active
SID list is pushed on the packet.
2.6 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018

[RFC7471] "OSPF Traffic Engineering (TE) Metric Extensions", Spencer Giacalone, David Ward,
John Drake, Alia Atlas, Stefano Previdi, RFC7471, March 2015

[RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer
Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016

1. By default, Node1 does not invalidate the segment list if it is expressed using label values. More
information about controlling validation of a segment list can be found in chapter 3, "Explicit
Candidate Path".↩

2. In this book, we use the term “delay” even in cases where the term “latency” is commonly used.
The reason is that the metric that expresses the link propagation delay is named “link delay”
(RFC 7810 and RFC 7471). A low-delay path is then a path with minimal cumulative link-delay
metric.↩

3. Chapter 7, "Flexible Algorithm" describes how operators can extend their segment “toolbox” by
defining their own Prefix-SIDs.↩

4. Computing a path is really solving an optimization problem. Given the information in the SR-TE
DB, find the optimal path (minimizing the specified metric) while adhering to the specified
constraints.↩
3 Explicit Candidate Path
What we will learn in this chapter:

An explicit candidate path is formally defined as a weighted set of SID lists

An explicit candidate path is typically a single SID list

An explicit candidate path can be locally configured on the headend node

An explicit candidate path can be instantiated on the headend node by a controller through a
signaling protocol

A SID can be expressed as a segment descriptor or an MPLS label value

Validation of an explicit SID list and hence validation of an explicit candidate path

Pros and cons of using a segment descriptor or an MPLS label value

Two use-cases: dual-plane disjoint paths and TDM migrations

An explicit candidate path of an SR Policy is directly associated with a list of SIDs. These SIDs can
be expressed either with their MPLS label values or using abstract segment descriptors that are
deterministically resolved into label values by the headend. The latter provides a level of resiliency
against changes in the MPLS label allocations and allows the headend to validate the SID list before
configuring its forwarding table. On the other hand, the resolution procedure requires that the headend
knows the label value for each segment descriptor in the SID list. Depending on the reach of SR
Policy and the knowledge of the entity initiating it, one type of expression would thus be preferred
over the other.

The instantiation of explicit candidate path using both types of SID expression is described in this
chapter, with examples of scenarios in which each one is used. Two practical use-cases are also
detailed to further illustrate the role of explicit candidate paths in real-world deployments.
3.1 Introduction
Formally, an explicit candidate path is defined as a weighted set of SID lists. For example, an explicit
candidate path could be defined as SID list <S1, S2, S3> with weight W1 and SID list <S4, S5, S6>
with weight W2. If that candidate path is selected as the active path of the policy, then the two SID
lists are installed in the dataplane. The traffic flows steered into the SR Policy are load-balanced
between the two SID lists with a ratio of W1/(W1+W2) on the first list and W2/(W1+W2) on the
second.

In practice, most of the use-cases define an explicit candidate path as a single SID list. Therefore, this
chapter focuses on the single-list case, while more details on the multiple SID list generalization are
provided in chapter 16, "SR-TE Operations".

The key property of an explicit candidate path is that the SID list is provided to the headend. The
headend does not need to compute it and is oblivious of the operator’s intent for this candidate path.

A segment in an explicit SID list can be expressed either as an SR-MPLS label or as a segment
descriptor.

Intuitively, the segment descriptor of a Prefix SID is the IP address of the related prefix and the
descriptor of an Adj-SID is the IP address of the related adjacency.

An explicit candidate path can be initiated on a headend node by configuration, using the Command
Line Interface (CLI), the classic XML API, or the NETCONF protocol; or by a signaling protocol,
such as PCEP or BGP.

Although it is possible for an operator to manually configure this type of path on the headend node,
explicit paths are typically initiated by a controller that then also takes the responsibility to monitor
and maintain them. These are referred to as controller-initiated paths.

The responsibility of the headend for an explicit candidate path is usually limited to:

1. Translating any SID expressed as a segment descriptor into an SR-MPLS label

2. Validating the outgoing interface and next-hop


3.2 SR-MPLS Labels
The SIDs in the SID list can be specified as MPLS labels, with any value in the MPLS label range.

Each SID in an explicit path configuration is associated with an index and the SID list is ordered by
increasing SID indexes. In this book the indexes are numbered in increments of 10, but this is not a
requirement.

Example 3‑1 shows the required configuration to specify a SID list, named SIDLIST1, with MPLS
label values. The first entry (index 10) is the label 16008 that is the Prefix-SID associated with
prefix 1.1.1.8/32 on Node8; the second entry (index 20) is the label 15085 that is the Adj-SID of
Node8 for its adjacency to Node5; and the third entry (index 30) is the label 16004 that is the
Prefix-SID associated with Node4.

Example 3-1: SR Policy with explicit path on Node1 – SID list using MPLS labels

segment-routing
traffic-eng
policy POLICY1
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
segment-list name SIDLIST1
index 10 mpls label 16008 !! Prefix-SID Node8
index 20 mpls label 15085 !! Adj-SID link 8->5
index 30 mpls label 16004 !! Prefix-SID Node4

The Adj-SID 15085 is a manual Adj-SID configured on Node8 for its adjacency to Node5. Label
15085 is allocated from Node8’s SRLB. Since this Adj-SID is configured, it is persistent across
reloads.

The SID list SIDLIST1 is used as part of an explicit candidate path for the SR Policy POLICY1,
illustrated on Figure 3‑1. The headend node Node1 can directly use the MPLS labels in SIDLIST1 to
program the forwarding table entry for POLICY1. Node1 only needs to resolve the first SID 16008 to
obtain the outgoing interface to be associated with the SR Policy forwarding entry.
Figure 3-1: Explicit path example

The output in Example 3‑2 shows the SR Policy POLICY1 instantiated on Node1 with the
configuration in Example 3‑1. Node1 mapped the first SID label value 16008 to the Prefix-SID (line
16).

Example 3-2: SR Policy using explicit segment list with label values

1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy


2
3 SR-TE policy database
4 ---------------------
5
6 Color: 10, End-point: 1.1.1.4
7 Name: srte_c_10_ep_1.1.1.4
8 Status:
9 Admin: up Operational: up for 00:00:15 (since Jan 24 10:51:03.477)
10 Candidate-paths:
11 Preference: 100 (configuration) (active)
12 Name: POLICY1
13 Requested BSID: dynamic
14 Explicit: segment-list SIDLIST1 (valid)
15 Weight: 1
16 16008 [Prefix-SID, 1.1.1.8]
17 15085
18 16004
19 Attributes:
20 Binding SID: 40014
21 Forward Class: 0
22 Steering BGP disabled: no
23 IPv6 caps enable: yes
As you can see in Example 3‑2, IOS XR uses the auto-generated name srte_c_10_ep_1.1.1.4 to
identify the SR Policy. This name is composed of the SR Policy’s color 10 and endpoint 1.1.1.4. The
configured name POLICY1 is used as the name of the candidate path.

The headend Node1 dynamically allocates the BSID 40014 for this SR Policy, as shown in
Example 3‑2 (line 20) and installs the BSID forwarding entry for this SR Policy, as shown in
Example 3‑3. The BSID forwarding entry instructs Node1, for any incoming packet that has the BSID
40014 as top label, to pop the BSID label and steer the packet into the SR Policy POLICY1, which
causes the SR Policy’s SID list to be imposed on the packet.

Example 3-3: SR Policy BSID forwarding entry on headend Node1

RP/0/0/CPU0:xrvr-1#show mpls forwarding labels 40014


Local Outgoing Prefix Outgoing Next Hop Bytes
Label Label or ID Interface Switched
------ ----------- --------- -------------------- --------------- --------
40014 Pop No ID srte_c_10_ep_1.1.1.4 point2point 0

The forwarding entry of the SR Policy is shown in Example 3‑4. The imposed label stack is
(top→bottom): (16008, 15085, 16004). The outgoing interface (Gi0/0/0/1) and next hop (99.1.7.7)
are derived from the first SID in the list (16008) and point towards Node7.

Example 3-4: SR Policy imposed segment list on headend Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail


Color Endpoint Segment Outgoing Outgoing Next Hop Bytes
List Label Interface Switched
----- ---------- ------------ -------- ------------ ------------ ---------
10 1.1.1.4 SIDLIST1 16008 Gi0/0/0/1 99.1.7.7 0
Label Stack (Top -> Bottom): { 16008, 15085, 16004 }
Path-id: 1, Weight: 0
Packets Switched: 0
Packets/Bytes Switched: 0/0
(!): FRR pure backup
3.3 Segment Descriptors
A segment descriptor identifies a SID in such a way that the headend node can retrieve the
corresponding MPLS label value through a deterministic SID resolution procedure. While multiple
types of descriptors exist (see draft-ietf-spring-segment-routing-policy), this chapter focuses on the
two descriptor types most commonly used:

IPv4 address identifying its corresponding Prefix-SID

IPv4 address identifying a numbered point-to-point link1 and its corresponding Adj-SID

The segment descriptor of a Prefix-SID is its associated prefix. Since a Prefix-SID is always
associated with a prefix, the resolution of a descriptor prefix to the corresponding SID label value is
straightforward. The prefix-to-SID-label mapping is advertised via IGP or BGP-LS and stored in
each node’s SR-TE DB. For example, line 11 of the SR-TE DB snapshot in Example 3‑5 indicates
that the prefix 1.1.1.8(/32) is associated with the label 16008.

When several Prefix-SID algorithms are available in a domain, the same prefix can be associated
with multiple Prefix-SIDs, each bound to a different algorithm. In that case, the prefix alone is not
sufficient to uniquely identify a Prefix-SID and should be completed by an algorithm identifier. If the
algorithm is not specified, then the headend selects by default the strict-SPF (algorithm 1) Prefix-SID
if available, or the regular SPF (algorithm 0) Prefix-SID otherwise. See chapter 7, "Flexible
Algorithm" for details of the different algorithms.

Similarly, an IP address configured on the interface to a point-to-point adjacency can serve as the
segment descriptor of an Adj-SID. Such an IP address identifies a specific L3 link in the network. For
example, the address 99.5.8.8 is configured on Node8 for its link with Node5 and thus identifies the
link between Node5 and Node8. The headend can then derive the direction in which that link should
be traversed from the preceding SID in the SID list. In this example, if the preceding SID in the list
ends at Node5, then the link should be traversed from Node5 to Node8. Conversely, if the preceding
SID were to end at Node8, then that same link should have been traversed from Node8 to Node5.
This headend intelligence allows the segment descriptor to be the IP address of either end of the link;
it does not specifically need to be a local, or remote, interface address. The link and direction
together identify a specific adjacency that the headend can look up in its SR-TE DB to determine the
Adj-SID label value.

A node may advertise several Adj-SIDs for a given adjacency. Typically, one protected2 Adj-SID and
one unprotected Adj-SIDs, as illustrated on lines 23 and 35 of Example 3‑5. In this situation, the
headend selects a protected Adj-SID by default.

Example 3-5: Translation of segment descriptor into MPLS label

1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng ipv4 topology 1.1.1.8


2
3 SR-TE topology database
4 -----------------------
5
6 Node 8
7 TE router ID: 1.1.1.8
8 Host name: xrvr-8
9 ISIS system ID: 0000.0000.0008 level-2
10 Prefix SID:
11 Prefix 1.1.1.8, label 16008 (regular)
12
13 Link[0]: local address 99.5.8.8, remote address 99.5.8.5
14 Local node:
15 ISIS system ID: 0000.0000.0008 level-2
16 Remote node:
17 TE router ID: 1.1.1.5
18 Host name: xrvr-5
19 ISIS system ID: 0000.0000.0005 level-2
20 Metric: IGP 10, TE 10, Delay 6
21 Admin-groups: 0x00000000
22 Adj SID: 24085 (protected) 24185 (unprotected)
23
24 Link[1]: local address 99.6.8.8, remote address 99.6.8.6
25 Local node:
26 ISIS system ID: 0000.0000.0008 level-2
27 Remote node:
28 TE router ID: 1.1.1.6
29 Host name: xrvr-6
30 ISIS system ID: 0000.0000.0006 level-2
31 Metric: IGP 10, TE 10, Delay 10
32 Admin-groups: 0x00000000
33 Adj SID: 24086 (protected) 24186 (unprotected)

Example 3‑6 and Figure 3‑2 show how these segment descriptors are used as part an explicit SID list
configuration. The first entry (index 10) uses the IPv4 address 1.1.1.3 as the segment descriptor for
the Prefix-SID associated with prefix 1.1.1.3/32 of Node3. The second entry (index 20) uses
address 99.3.4.3 as the segment descriptor for the Adj-SID of Node3 for its adjacency to Node4. The
address 99.3.4.3 is configured on Node3’s interface for the point-to-point link between Node3 and
Node4, and the preceding segment ending at Node3 indicates that the link should be traversed from
Node3 to Node4.
Example 3-6: SIDs specified as segment descriptors

segment-routing
traffic-eng
policy POLICY2
color 20 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST2
!
segment-list name SIDLIST2
index 10 address ipv4 1.1.1.3 !! Prefix-SID Node3
index 20 address ipv4 99.3.4.3 !! Adj-SID link 3->4

Figure 3-2: SIDs specified as segment descriptors example

The SID list SIDLIST2 is configured as an explicit candidate path of SR Policy POLICY2. The output
in Example 3‑7 shows that the headend Node1 resolved the segment descriptors in SIDLIST2 into the
corresponding label values and used them to program the forwarding table entry for POLICY2. The
address 1.1.1.3 is resolved to the Prefix-SID label 16003 and 99.3.4.3 to the protected Adj-SID label
24034.
Example 3-7: SR Policy using explicit segment list with IP addresses

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 20, End-point: 1.1.1.4


Name: srte_c_20_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:00:20 (since May 4 09:25:54.847)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: POLICY2
Requested BSID: dynamic
Explicit: segment-list SIDLIST2 (valid)
Weight: 1
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Attributes:
Binding SID: 40114
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

If the Adj-SID label on Node3 for the adjacency to Node4 changes value, for example following a
reload of Node3, headend Node1 automatically updates the SID list with the new value of the Adj-
SID label. No configuration change is required on Node1.

The BSID and SR Policy forwarding entries are equivalent with the previous example.

In the two examples above, the explicit SID list was either configured with only MPLS label values
or with only segment descriptors. Segment descriptors and MPLS label values can also be combined
in an explicit SID list, but with one limitation: once an entry in the SID list is specified as an MPLS
label value, all subsequent entries must also be specified as MPLS label values. In other words, a
SID specified as segment descriptor cannot follow a SID specified as MPLS label value.
3.4 Path Validation
An explicit candidate path is valid if its SID list is valid. More generally, when an explicit candidate
path is expressed as a weighted set of SID lists, the explicit candidate path is valid if it has at least
one valid SID list.

A SID list is valid if all the following conditions are satisfied:

The SID list contains at least one SID

The SID list has a weight larger than 0

The headend can resolve all SIDs expressed as segment descriptors into MPLS labels

The headend can resolve the first SID into one or more outgoing interfaces and next-hops

The first condition is obvious: an empty SID list is invalid.

As mentioned in chapter 2, "SR Policy", each SID list has an associated weight value that controls the
relative amount of traffic steered over this particular SID list. The default weight is 1 for SID lists
defined as part of an explicit candidate path. If the weight is zero, then the SID list is considered
invalid.

The headend node uses the information in its local SR-TE DB to resolve each SID specified as
segment descriptor to its MPLS label value. If a segment descriptor is not present in the headend
node’s SR-TE DB, then the headend node cannot retrieve the corresponding label value. Therefore, a
SID list containing a segment descriptor that cannot be resolved is considered invalid.

Finally, the headend node must be able to find out where to send the packet after imposing the SID
list. The headend node uses its local information to determine a set of outgoing interfaces and next-
hops from the first SID in the list. If it is unable to resolve the first SID of a SID list into at least one
outgoing interface and next hop, then that SID list is invalid.

To illustrate the validation procedure for explicit paths, we compare two configuration variants for
the SR policy illustrated on Figure 3‑3: the SIDs specified as label values and the SIDs specified as
segment descriptors.
A failure occurred in the network: the link between Node8 and Node5 went down. As a result, IGP
withdraws the Adj-SID of this link. The headend removes the Adj-SID from its SR-TE DB. If this
Adj-SID is protected by TI-LFA, then traffic will temporarily (~ 15 minutes) be forwarded on the TI-
LFA backup path.

Figure 3-3: Explicit path example

First, consider the case where the SIDs in the SID list are specified as label values. This is the
configuration in Example 3‑8. The SID list is not empty and has the default weight value of 1, so the
first two validation conditions are satisfied. The next condition does not apply since the SIDs are
specified as label values. The first SID in the list is 16008, which is the Prefix-SID of Node8. Node8
is still reachable from Node1, via Node7, and hence Node1 can resolve the outgoing interface and
nexthop for 16008 accordingly. This satisfies the last validation condition. Consequently, Node1 still
considers SIDLIST1 as valid and keeps steering traffic into POLICY1. If the Adj-SID 15085 is
protected by TI-LFA, then the traffic is forwarded along the backup path and continues to reach
Node4 for a time, but starts being dropped by Node8 as soon as the IGP removes the Adj-SID backup
path. If the Adj-SID is not protected, then the traffic is immediately dropped by Node8 when the link
goes down.
Example 3-8: SR Policy with explicit path on Node1 – SID list using MPLS labels

segment-routing
traffic-eng
policy POLICY1
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
segment-list name SIDLIST1
index 10 mpls label 16008 !! Prefix-SID Node8
index 20 mpls label 15085 !! Adj-SID link 8->5
index 30 mpls label 16004 !! Prefix-SID Node4

Now, consider the case where the SIDs in the SID list are specified with segment descriptors. This is
the configuration in Example 3‑9. The first two validation conditions, SID list not empty and weight
not 0, are satisfied. The fourth condition is satisfied as well: the prefix-SID 16008 of Node8’s prefix
1.1.1.8/32 is still reachable from the headend Node1 via Node7. However, when Node1 attempts to
resolve the second segment descriptor 99.5.8.8 into an MPLS label value, it is unable to find the IP
address in its SR-TE DB. This entry was removed from the SR-TE DB following the withdrawal of
the link between Node8 and Node5 in the IGP. The failure to resolve this second segment descriptor
violates the third validation condition and causes Node1 to invalidate SIDLIST2. The candidate path
and SR Policy are also invalidated since no other option is available.

Example 3-9: SIDs specified as segment descriptors

segment-routing
traffic-eng
policy POLICY2
color 20 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST2
!
segment-list name SIDLIST2
index 10 address ipv4 1.1.1.8 !! Prefix-SID Node8
index 20 address ipv4 99.5.8.8 !!Adj-SID link 8->5
index 30 address ipv4 1.1.1.4 !! Prefix-SID Node4

Example 3‑10 shows the status of the SR Policy with the configuration in Example 3‑9 when the link
Node8-Node5 is down. Notice that this SR Policy is Operational down (line 9) since its only
Candidate path is down.
Example 3-10: SR Policy using explicit segment list with IP addresses – status down

1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy


2
3 SR-TE policy database
4 ---------------------
5
6 Color: 20, End-point: 1.1.1.4
7 Name: srte_c_20_ep_1.1.1.4
8 Status:
9 Admin: up Operational: down for 00:00:06 (since May 4 10:47:10.263)
10 Candidate-paths:
11 Preference: 100 (configuration)
12 Name: POLICY2
13 Requested BSID: dynamic
14 Explicit: segment-list SIDLIST2 (inactive)
15 Inactive Reason: Address 99.5.8.5 can not be resolved to a SID
16 Weight: 1
17 Attributes:
18 Forward Class: 0
19 Steering BGP disabled: no
20 IPv6 caps enable: yes

Constraints

Besides the four validation conditions explained previously, the operator may express further
constraints to an explicit candidate path.

For example, the operator may request the headend to ensure that the path followed by the explicit
SID list never uses a link or node with a given SRLG or TE-affinity, or may require that the
accumulated link delay be smaller than a bound.

Chapter 4, "Dynamic Candidate Path" contains more information about constraints.

Resiliency

While explicit paths are static by nature, they benefit from different resiliency mechanisms. Let us
consider the Adj-SID S1 associated with link L. Let us assume that S1 is the first SID of the list or is
expressed as a segment descriptor.

Should L fail, the IGP convergence ensures that the headend SR-TE DB is updated within a few 100s
of msec. The SR-TE DB update triggers the invalidation of S1, the invalidation of its associated SID
list and hence the invalidation of its associated explicit candidate path (assuming that it is the only
SID list for this candidate path). This invalidation may allow for another candidate path of the SR
policy to become active or it may cause the SR policy to become invalid and hence to be removed
from the forwarding table. In such case, the traffic would then follow the default IGP shortest path (or
be dropped as described in chapter 16, "SR-TE Operations").
3.5 Practical Considerations
In practice, the choice of expressing a SID list with label values or segment descriptors depends on
various factors, such as how reliably the entity initiating the path can maintain it, or whether the
headend has enough visibility to resolve all the segment descriptors that would be used in the SID
list.

Some hints are provided in this section on the most appropriate mode of expression through several
practical examples.

A Controller May Not Want the Headend to Second-Guess

An explicit candidate path is typically computed by an external controller. The intent for that path is
known by the controller.

This controller typically monitors the network in real time and updates any explicit SID list when
needed. This can happen upon the loss of a link or node (an alternate path is then selected) or upon
the addition of a link or node in the topology (a better path is now possible).

As the controller knows the intent and it monitors the network in real time, most likely the controller
does not want the headend to second-guess its decisions.

In such a case, the controller will express the SIDs as MPLS labels. The headend does not check the
validity of such SIDs and the controller stays in full control.

An Operator May Want Headend Monitoring

The operator may instantiate an explicit SR Policy as two (or more) explicit candidate paths (CPs):
CP1 with preference 200 and CP2 with preference 100. The operator does not want to continuously
check the state of the network and promptly react to a change. Instead, the operator wants the headend
to automatically switch from CP1 to CP2 when CP1 becomes unavailable and from CP2 to CP1 when
CP1 becomes available.

In such a case, the operator will express the SIDs as segment descriptors. This will force the headend
to validate each SID individually and hence the SID list in its entirety. When a SID of CP1’s SID list
becomes invalid, the headend will invalidate the entire SID list and CP1, and activate CP2. When the
SID becomes valid again, the headend will re-enable the SID list and activate CP1.
SID Translation Is Limited to the Headend Domain

When it is known that a segment descriptor is not present in the headend’s SR-TE DB, the SID must
be expressed as a label value. The translation from segment descriptor to SID label value would not
work otherwise and hence there would be no point in expressing the related SID using a segment
descriptor.

This is typically the case when a SID is outside the local domain of the headend.

Dynamically allocated labels are hard to predict

One must know the label value of a SID to be able to use that mode of expression.

This is not a problem for Prefix-SIDs, as they have fixed, network-wide label values3.

It may be a problem for Adj-SIDs or Peering-SIDs which often use dynamically allocated label
values that are hard – even impossible – to predict. These dynamic labels are subject to change; for
example, a router could allocate different label values for its Adj-SIDs after it has been reloaded. If
the Adj-SID label value changes, the configured SID list with the old Adj-SID label value does no
longer provide the correct path and must be reconfigured.

Assume that in the configuration of Example 3‑1 the operator had configured the dynamic Adj-SID
label 24085 instead of the manual Adj-SID. Then assume that Node8 was reloaded and has acquired
a new label value 24000 for the Adj-SID of the link Node8→Node5. Since the old label value 24085
for this Adj-SID is no longer correct, the Adj-SID entry (index 20) in the SR Policy’s SID list on
Node1 must then be reconfigured, using the new label value, as index 20 mpls label 24000.

A solution to this problem is to configure explicit Adj-SIDs and Peering SIDs. Since they are
configured, they are persistent, even across reloads.
3.6 Controller-Initiated Candidate Path
A controller, or an application above a controller, can translate an SLA from a headend H to an
endpoint E into an Explicit Candidate Path. Once computed, the controller signals the path to H via
PCEP or BGP SR-TE.

We leave the protocol details to the chapters that are dedicated to these protocols. For now, it is
enough to know that these protocols provide the means to convey the explicit candidate path from the
controller to the headend.

After the protocol exchange, the headend node instantiates the path or updates the path if it already
existed.

South-bound and north-bound inte rface s


The south-bound and north-bound indications refer to the typical architectural drawing, where the north-bound interface is drawn on
top of the illustrated component and the south-bound interface is drawn below it.

For a controller, examples of the north-bound interface are: REST (Representational State Transfer) and NETCONF (Network
Configuration Protocol).

Controller south-bound interface examples are PCEP, BGP, classic XML, and NETCONF.

In the topology of Figure 3‑4, a controller is programmed to deliver a low-delay path from Node1 to
1.1.1.4 (Node4). The controller is monitoring the topology and learns that the link between Node7
and Node6 experiences a higher than normal delay, for example due to an optical circuit change.
Therefore, the controller decides to avoid this link by initiating on headend Node1 a new explicit
path with the SID list <16003, 16004>, which encodes the path 1→2→3→6→5→4.

The controller signals this path for SR Policy (blue, 1.1.1.4) to headend Node1 via its south-bound
interface. This path is a candidate path for the SR Policy (blue, 1.1.1.4). If the SR Policy (blue,
1.1.1.4) already exists on Node1 (initiated by any protocol: CLI, PCEP, BGP, …), this new candidate
path is added to the list of candidate paths of this SR Policy. The path selection procedure then
decides which candidate path becomes the active path. In this case, we assume that SR Policy (blue,
1.1.1.4) does not yet exist on Node1. Therefore, Node1 instantiates the SR Policy with a single
candidate path and programs the forwarding entry for this SR Policy. The path is then ready to be
used, traffic can be steered onto it.

Figure 3-4: Controller-initiated path

Using this mechanism, a controller can steer any traffic flow on any desired path through the network
by simply programming the SR Policy path on the headend node.

The controller is responsible for maintaining the path. For example, following a change in the
network, it must re-compute the path and update it on headend Node1 if required.
3.7 TDM Migration
In this use-case, we explain how an operator can pin a pseudowire (PW) onto a specific SR Policy
and ensure that the PW traffic will be dropped as soon as the SR Policy’s selected path becomes
invalid.

The steering of the PW in the SR policy is not the focus of this section. Briefly, we use the L2VPN
preferred path mechanism which allows us to steer a PW into an SR policy. We will detail this in
chapter 10, "Further Details on Automated Steering".

The focus of this section is on two specific mechanisms:

The ability to avoid using a TI-LFA FRR backup path

The ability to force the traffic to be dropped upon SR policy invalidation (instead of letting the PW
follow the IGP shortest-path)

Explicit paths allow pinning down services on a pre-defined path through the network. This may be
desired when, for example, migrating a Time Division Multiplexing (TDM) service to an IP/MPLS
infrastructure.

In the example topology of Figure 3‑5, two disjoint PWs are configured between Node1 and Node4.
These PWs carry high traffic volumes and the operator does not want both PWs to traverse the same
core link (2-3 or 6-5) since their combined load exceeds the capacity of these links.

The operator configures two SR Policies on Node1 and steers one PW in each of them using the
L2VPN preferred-path functionality.
Figure 3-5: TDM migration

Each of these two SR Policies has an explicit SID list, where the SIDs are the unprotected
Adjacency-SIDs of the links on the path. This ensures that the SR Policies are pinned to their path.

Unprote cte d Adj-SID


Unprotected Adj-SIDs are not protected by a local protection mechanism such as TI-LFA. By using this type of Adj-SIDs for this
path, local protection can be enabled on all links in the network for other traffic flows to benefit from it, while traffic flows using the
unprotected Adj-SIDs do not use the local protection functionality.

The relevant configuration of Node1 is shown in Example 3‑11. The Adj-SIDs in the SID lists are
specified as interface IP addresses. For example, 99.1.2.2 is the IP address of the interface on Node2
for its link to Node1.

The headend Node1 maps these interface IP addresses to the Adj-SID of the link. However, as
mentioned in section 3.3, IP addresses are mapped to protected Adj-SID labels by default. Hence, the
configuration line constraints segments unprotected (see lines 10-12 of Example 3‑11) is
added to instruct Node1 to map the IP addresses to the unprotected Adj-SID labels instead.
By default, an SR Policy with no valid candidate path is invalidated and the traffic that was steered
into it falls back to its default forwarding path, which is usually the IGP shortest path to its
destination. However, in this use-case the operator specifically wants to force the traffic to be
dropped upon SR Policy invalidation. This behavior is achieved by configuring steering
invalidation drop under the SR Policy (see lines 5-6 of Example 3‑11).

“The invalidation-drop behavior is a great example of lead operator partnership.

The lead operator brought up their requirement very early in the SR-TE design process (~2014) and this helped us define
and implement the behavior. This has been now shipping since 2015 and is used in deployment when the operator prefers to
drop some traffic rather than letting it flow over paths that do not meet is requirements (e.g., from a BW/capacity
viewpoint).. ”

— Clarence Filsfils
Example 3-11: TDM migration – configuration of Node1

1 segment-routing
2 traffic-eng
3 policy POLICY1
4 color 10 end-point ipv4 1.1.1.4
5 steering
6 invalidation drop
7 candidate-paths
8 preference 100
9 explicit segment-list SIDLIST1
10 constraints
11 segments
12 unprotected
13 !
14 policy POLICY2
15 color 20 end-point ipv4 1.1.1.4
16 steering
17 invalidation drop
18 candidate-paths
19 preference 100
20 explicit segment-list SIDLIST2
21 constraints
22 segments
23 unprotected
24 !
25 segment-list name SIDLIST1
26 index 10 address ipv4 99.1.2.2 !! link 1->2
27 index 20 address ipv4 99.2.3.3 !! link 2->3
28 index 30 address ipv4 99.3.4.4 !! link 3->4
29 !
30 segment-list name SIDLIST2
31 index 10 address ipv4 99.1.6.6 !! link 1->6
32 index 20 address ipv4 99.5.6.5 !! link 6->5
33 index 30 address ipv4 99.4.5.4 !! link 5->4
34 !
35 l2vpn
36 pw-class PREF-PATH1
37 encapsulation mpls
38 preferred-path sr-te policy POLICY1
39 !
40 pw-class PREF-PATH2
41 encapsulation mpls
42 preferred-path sr-te policy POLICY2
43 !
44 xconnect group XG1
45 p2p PW1
46 interface GigabitEthernet0/0/0/0
47 neighbor ipv4 1.1.1.3 pw-id 1
48 pw-class PREF-PATH1
49 !
50 p2p PW2
51 interface GigabitEthernet0/0/0/1
52 neighbor ipv4 1.1.1.3 pw-id 2
53 pw-class PREF-PATH2

At the time of writing, steering invalidation drop and constraints segments unprotected
were only available in the initial (and now deprecated) SR-TE CLI.
3.8 Dual-Plane Disjoint Paths Using Anycast-SID

“There are at least three SR solutions to the disjoint paths use-case: SR Policy with a dynamic path, SR IGP Flex-Algo and
Explicit candidate path. In this section, we describe the last option. We will describe the other options later in this book.

The key point I would like to highlight is that this explicit candidate path option has been selected for deployment.

In theory, this explicit path solution does not work when an access node loses all its links to the chosen blue plane (2-11 and
2-13 both fail in the following illustration) or the blue plane partitions (11-12 and 13-14 both fail).

However, in practice, some operators estimate that these events are highly unlikely and hence they select this simple dual-
plane solution for its pragmatic simplicity.

Other operators prefer a solution that dynamically ensures that the disjoint objective is always met whatever the state of
the network, even if unlikely. The two other design options meet those requirements. We will cover them later in the book.

— Clarence Filsfils

The topology in Figure 3‑6 shows a dual-plane network topology, a design that is used in many
networks.

The blue plane consists of nodes 11 to 14.

The green plane consists of nodes 21 to 24.

The common practice consists in configuring the inter-plane connections, also known as shunt links,
(e.g., link between Node11 and Node21) with high (bad) IGP metric. These links are represented
with thinner lines to illustrate this.

When the shunt links have such a high metric, the traffic that enters one plane remains on the same
plane until its destination. The only scenario that would make the traffic cross to the other plane is a
partitioning of its initial plane; i.e., a failure that causes one part of the plane to become isolated from
the rest. In this case, reachability between the partitions may only be possible via the other plane.
This is very rare in practice as this would require at least two independent failures.

Edge nodes connect redundantly to each plane, via direct links, or indirectly via another edge node.
Figure 3-6: Dual-plane disjoint paths using anycast-SID

Let us assume that all the blue nodes are configured with Anycast-SID 16111 and all the green nodes
are configured with Anycast-SID 16222.
Anycast-SID
As we explained in Part1 of this book, Anycast-SIDs not only provide more load-balancing and node resiliency, they are also useful to
express macro-engineering policies that steer traffic via groups of nodes (“anycast sets”) instead of via individual nodes. Each
member of an anycast set advertises the same Anycast-SID.

All blue plane nodes advertise the Anycast-SID 16111. For this, the configuration of Example 3‑12 is applied on all the blue plane
nodes. In essence, an Anycast-SID is a Prefix-SID that is advertised by multiple nodes. Therefore, the regular Prefix-SID
configuration is used to configure it. All blue plane nodes advertise the same Loopback1 prefix 1.1.1.111/32 with Prefix-SID 16111.

Example 3-12: Dual-plane disjoint paths – anycast-SID configuration

interface Loopback1
description blue plane anycast address
ipv4 address 1.1.1.111/32
!
router isis 1
interface Loopback1
address-family ipv4 unicast
prefix-sid absolute 16111 n-flag-clear

By default, when configuring a Prefix-SID, its N-flag is set, indicating that the Prefix-SID identifies a single node. However, an
Anycast-SID does not identify a single node, but it identifies a group of nodes: an anycast set. Therefore, an Anycast-SID must be
advertised with the Prefix-SID N-flag unset, requiring the n-flag-clear keyword in the Prefix-SID configuration. Note that for ISIS,
this configuration also clears the N-flag in the prefix-attribute.
Anycast se gme nts: ve rsatile and powe rful
“Anycast segments are, in my eyes, a very versatile and powerful tool for any network designer who does not want to or
cannot completely rely on a central controller to initiate all needed paths. Whenever TE policies are designed and maybe
even configured by a human, anycast segments can solve several problems:

They provide resiliency. More than one node can carry the same SID and possible several nodes can serve the same
purpose and step in for each other. See chapter 8, "Network Resiliency".

Using the same SID or SID descriptor on all routers significantly simplifies router configuration.

IT effort will often be reduced as less parameters are needed to generate the configuration for each router.

Last but not least, anycast segments can be seen as a method of abstraction: An Anycast-SID does no longer just stand
for a particular node, or group of nodes, but rather for a function that needs to be applied to a packet, or for a certain
property of forwarding a packet through a network.

Thus, for my specific needs, anycast segments are essential element of how Segment Routing brings traffic engineering to
a whole new level. ”

— Martin Horneffer

In such a dual-plane topology, a very simple solution to steer traffic via the blue plane to Node3
consists in imposing the SID list <16111, 16003>, where 16003 is the Prefix-SID of Node3.

Indeed, 16111 steers the traffic into the blue plane and then the dual-plane design ensures that it
remains in the same plane until its destination. The traffic would not use the other plane as the shunt
links have a very bad metric and multiple independent failures would be required to partition the blue
plane.

Similarly, traffic can be steered in the green plane with the SID list <16222, 16003>.

To steer traffic via the blue plane, the operator configures an SR Policy “BLUE” with color blue
(value 10), endpoint 1.1.1.3 (Node3) and the explicit segment list SIDLIST1, as shown in
Example 3‑13. This SID list is expressed with two segment descriptors: 1.1.1.111 is resolved into the
blue plane Anycast-SID 16111 and 1.1.1.3 maps to the Prefix-SID of Node3.

A second SR Policy, named “GREEN”, with color green (value 20), endpoint 1.1.1.3 (Node3) and
the explicit SID list SIDLIST2 is used to steer traffic via the green plane. SIDLIST2 also contains
two entries: 1.1.1.222 maps to the green plane Anycast-SID 16222 and 1.1.1.3 maps to the Prefix-SID
of Node3.

Example 3-13: Dual-plane disjoint paths using anycast-SID – configuration

segment-routing
traffic-eng
policy BLUE
color 10 end-point ipv4 1.1.1.3
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
policy GREEN
color 20 end-point ipv4 1.1.1.3
candidate-paths
preference 100
explicit segment-list SIDLIST2
!
segment-list name SIDLIST1
index 10 address ipv4 1.1.1.111 !! blue plane anycast
index 20 address ipv4 1.1.1.3
!
segment-list name SIDLIST2
index 10 address ipv4 1.1.1.222 !! green plane anycast
index 20 address ipv4 1.1.1.3

Now, assume that an L3VPN service is required between Node1 and Node3, with two VRFs: blue
and green. These two VRFs should be carried over disjoint paths wherever possible. This can be
achieved by steering the VRF blue packets into SR Policy BLUE, and the VRF green packets into SR
Policy GREEN. Traffic steering details are covered in chapter 5, "Automated Steering". For now, it
is enough to know that blue colored prefixes are steered into the SR Policy with color blue and green
colored prefixes are steered into the SR Policy with color green.

The traceroute output in Example 3‑14 shows that the traffic of the VRF blue traverses the blue plane
(nodes 11 to 14) while the traffic of VRF green traverses the green plane (nodes 21 to 24). Both VRF
prefixes 10.10.10.10 and 20.20.20.20 have Node3 as BGP next hop.

The MPLS labels on the packet received by Node2 are the Anycast-SID of the plane, 16111 or 16222,
the Prefix-SID of the BGP next hop, 16003, and the VPN label 9000x for the prefix.
Example 3-14: Dual-plane disjoint paths using anycast-SID – traceroute

RP/0/0/CPU0:xrvr-1#traceroute vrf blue 10.10.10.10

Type escape sequence to abort.


Tracing the route to 10.10.10.10

1 99.1.2.2 [MPLS: Labels 16111/16003/90009 Exp 0/0/0] 19 msec 19 msec 19 msec


2 99.2.11.11 [MPLS: Labels 16003/90009 Exp 0/0] 29 msec 19 msec 19 msec
3 99.11.12.12 [MPLS: Labels 16003/90009 Exp 0/0] 19 msec 19 msec 19 msec
4 99.3.12.3 19 msec 19 msec 19 msec

RP/0/0/CPU0:xrvr-1#traceroute vrf green 20.20.20.20

Type escape sequence to abort.


Tracing the route to 20.20.20.20

1 99.1.2.2 [MPLS: Labels 16222/16003/90007 Exp 0/0/0] 19 msec 19 msec 19 msec


2 99.2.23.23 [MPLS: Labels 16003/90007 Exp 0/0] 29 msec 19 msec 19 msec
3 99.23.24.24 [MPLS: Labels 16003/90007 Exp 0/0] 19 msec 19 msec 19 msec
4 99.3.24.3 19 msec 19 msec 19 msec
3.9 Summary
An explicit path can be provided to the headend by configuration or signaled by a controller
(NETCONF, PCEP, BGP).

An explicit candidate path is formally defined as a weighted set of SID lists. An explicit candidate
path is typically a single SID list.

A headend node is not involved in the path computation or the SID list encoding of an explicit path; it
instantiates an explicit path verbatim, as is provided.

A SID can be expressed as an MPLS label or a segment descriptor.

A segment descriptor is preferred when one wants the headend to validate the SID.

A segment descriptor cannot be used for a SID that is unknown to the headend (e.g., inter-domain).

An explicit candidate path is valid if at least one of its SID lists is valid.

An explicit SID list is valid if all the following conditions are met:

The SID list contains at least one SID

The SID list has a weight value larger than 0

The headend can resolve all SIDs expressed as segment descriptors

The headend can resolve the first SID into one or more outgoing interfaces and next-hops.

An explicit candidate path has many applications. We illustrated two of them:

TDM migration to an SR-based infrastructure

Disjoint paths in a dual-plane network


3.10 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[RFC8402] "Segment Routing Architecture", Clarence Filsfils, Stefano Previdi, Les Ginsberg,
Bruno Decraene, Stephane Litkowski, Rob Shakir, RFC8402, July 2018

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018

1. Restricting this segment descriptor to point-to-point links allows the SID to be determined with a
single interface address, as opposed to a pair of (local, remote) interface addresses in the
general case.↩

2. As described in SR book Part I and in chapter 8, "Network Resiliency", a protected Adj-SID is


FRR-enabled while its unprotected counterpart is not.↩

3. Unless you do not use the same SRGB on all nodes, which is strongly discouraged.↩
4 Dynamic Candidate Path
What we will learn in this chapter:

Dynamically computing an SR-TE path is solving an optimization problem with an optimization


objective and constraints. The information in the SR-TE DB is used to compute a path.

SR-optimized algorithms have been developed to compute paths and encode these paths in SID
lists to make optimal use of the benefits of SR, leveraging the available ECMP in the network.

The headend node or a Path Computation Element (PCE) can compute SR-TE paths.

The base IOS XR software offers SR PCE server functionality.

In many cases, the headend node itself can compute SLA paths in a single IGP area, for example
compute delay optimized paths, or paths that avoid specific resources.

For specific path computations, where the headend does not have the necessary information in its
SR-TE DB, the headend node uses an SR PCE to compute paths. For example, computing disjoint
paths from different headend nodes or computing end-to-end inter-domain paths.

A headend not only requests an SR PCE to compute paths but can also delegate control of the paths
to the SR PCE, which then autonomously maintains these paths.

The SR-TE DB is natively multi-domain capable. The SR PCE learns the topologies of all domains
and the Peering-SIDs of the BGP peering links via BGP-LS.

The SR PCE functionality can be distributed among several SR PCE servers throughout the
network. If needed, these servers may synchronize between each other.
4.1 Introduction
A dynamic candidate path is a candidate path of an SR Policy that is automatically computed by a
headend (router) or by a Path Computation Element (PCE) upon request of the headend. Such a path is
automatically re-computed when required to adapt to a changing network.

4.1.1 Expressing Dynamic Path Objective and Constraints


While an explicit path is expressed as a list of segments, a dynamic path is expressed as an
optimization objective and a set of constraints. These two elements specify the SR-TE intent.

The optimization objective is the characteristic of the path that must be optimized. Typically, this is
the minimization of a specific metric such as the IGP link metric or the minimum link-delay metric.

Constraints are limitations that the resulting path must honor. For example, one may want the path to
avoid specific links or groups of links or have a cumulative path metric that is bound by a maximum
value.

4.1.2 Compute Path = Solve Optimization Problem


Computing a path through a network that satisfies a set of given requirements is solving an
optimization problem, basically a constrained shortest path problem.

“Find the shortest path to each node in the network” is an example of an optimization problem that is
solved by using Dijkstra’s well-known Shortest Path First (SPF) algorithm. To solve this problem,
Dijkstra’s algorithm uses the network graph (consisting of nodes and links, also known as vertices
and edges in graph theory) and a cost associated with each edge (link).

In networking, the link-state IGPs use Dijkstra’s algorithm to compute the Shortest Path Tree (SPT);
the edge cost is then the IGP link metric. Prefixes are leaves hanging from the nodes (vertices).

The information that is needed to run Dijkstra’s SPF and compute prefix reachability – nodes, links,
and prefixes – is distributed by the link-state IGPs. Each node keeps this information in its Link-state
Database (LS-DB).

This introduces two necessary elements that are needed to compute paths in a network: a database
that contains all necessary information about the network and a computation engine that applies
algorithms to the information to solve the optimization problem, i.e., compute dynamic paths.

Database

The database used for the computation is the SR-TE DB. Besides the network graph (nodes and
links), the SR-TE DB contains various other information elements that the computation engine uses to
solve different optimization problems. A few examples illustrate this. To compute delay-optimized
paths, link delay information is required. To provide disjoint paths, information about the
programmed paths is required.

Chapter 12, "SR-TE Database" goes into much more detail on the SR-TE DB.

Computation Engine

The computation engine translates an SR-TE intent into a SID list.

The path computation algorithms are at the core of the computation engine. Efficient SR-native
optimization algorithms have been developed based on extensive scientific research; see the
SIGCOMM 2015 paper [SIGCOMM2015].

This chapter focuses on four use-cases: delay-optimized paths, resource-avoidance paths, disjoint
paths, and inter-domain paths.

4.1.3 SR-Native Versus Circuit-Based Algorithms


Classic MPLS-TE (RSVP-TE) was designed 20 years ago as an ATM/FR sibling. Its fundamental
building block is a point-to-point non-ECMP circuit with per-circuit state at each hop.

Even though ECMP is omnipresent in IP networks, classic RSVP-TE circuit-based paths do not
leverage this ECMP by definition. As a result, the RSVP-TE solution needs many tunnels between the
same set of headends and endpoints in order to use multiple paths through the network (one tunnel
over each possible path). This drastically increases operational complexity and decreases scalability
due to the number of tunnels required for a proper traffic load-balancing.

As indicated before, considerable research went into the development of efficient SR-optimized path
computation algorithms for SR-TE. These algorithms natively use the characteristics of the IP network
and inherently leverage all available ECMP paths. The outcome of a native, SR-optimized
computation is a reduced SID list that leverages the SR capabilities, such as multi-hop segments and
ECMP-awareness.

Fe w SIDs, ECMP
“As explained in the introduction of Part1, the intuition for Segment Routing came up while driving to Rome and realizing
that the path to Rome avoiding the Gottardo pass (the shortest path to Rome from Brussels is via the Gottardo pass) could
be expressed as simply “from Brussels, go to Chamonix and then from Chamonix go to Rome”. Only two segments are
required! All the simulations we did later confirmed this basic but key intuition: few SIDs were required.

My experience designing and deploying network had told me that ECMP is a basic property of modern IP network. Hence
the intuition while driving to Rome was also that segment-routed paths would naturally favor ECMP.

This was clear because each prefix segment expresses a shortest-path and network topologies are designed such that
shortest-paths involve as much ECMP as possible (to load-share and better use resources and for robustness).

This ECMP property was proved later by all the simulations we did. ”

— Clarence Filsfils

To illustrate the benefits of the SR-native algorithms over the circuit-based classic RSVP-TE
solution, Figure 4‑1 compares RSVP-TE circuit-based optimization with SR-TE optimization. In the
topology a path is computed using both methods from Node1 to Node3, avoiding the link between
Node2 and Node3.
Figure 4-1: Circuit Optimization versus SR Optimization

The RSVP-TE solution first prunes the red link. Second, it computes the shortest path from 1 to 3 in
the pruned graph. Third, it selects one single non-ECMP path out of the (potentially ECMP) shortest
path. In this example, there are three possible shortest paths and let us assume RSVP-TE picks the
path 1→4→5→7→3. Applying this old circuit-based solution to SR would specify each link on the
path, leading to a SID list <24014, 24045, 24057, 24073>, where 240XY represents the Adj-SID of
NodeX to NodeY. This SID list can be shortened to <16005, 16003>, where 1600X is the Prefix-SID
of NodeX, using a trivial path-to-SID list algorithm. However, this is still not as good as the SR-
native solution described in this chapter. It still does not leverage ECMP and this classic path
computation cannot adapt the path to SR-specific requirements such as segment list size or more
ECMP paths.

The SR-TE solution uses a completely different algorithm that seeks to leverage as much ECMP as
possible while using as few SIDs as possible. For this reason, we call it the “SR-native” algorithm.
In this example, the SR native algorithm finds the SID list <16007, 16003>, where 1600X is the
Prefix-SID of NodeX. This SID list only uses two SIDs. This SID list load-balances traffic over 3
paths.
Clearly the SR-native algorithm is preferred for SR applications.
4.2 Distributed Computation
The SR-TE computation engine can run on a headend (router) or on an SR Path Computation Element
(SR PCE). The former involves a distributed solution while the latter involves a centralized solution.
Both headend and SR PCE leverage the SR-native algorithms.

The SR-TE implementation in IOS XR can be used both as an SR-TE headend and an SR PCE.

When possible, one should leverage the SR-TE path computation of the headend (distributed design).
This provides a very scalable solution. When needed, path computation is delegated to an SR PCE
(centralized).

Router and SR PCE use the same path computation algorithms. The difference of their functionality is
not the computation engine but the content of the SR-TE DB.

A headend’s SR-TE DB is most often limited to the local domain and the local SR policies.

An SR PCE’s SR-TE DB may contain (much) more information such as the state of other SR Policies
and additional performance information. Knowing other SR Policies allows disjoint path
computation, and knowing other domains allows inter-domain path computation.

Computing an SR-TE path from a headend to an endpoint within the same IGP area requires only
information of the local IGP area. The headend node learns the information of its local IGP area and
therefore it can compute such paths itself. Sections 4.2.1 and 4.2.2 detail two such examples: low-
delay and resource exclusion.

In a distributed model, the headend itself computes paths locally. When possible, this model should
be used as it provides a very scalable solution.

4.2.1 Headend Computes Low-Delay Path


The network operator wants to provide a low-delay path for a delay-sensitive application. The
source and destination of the path are in the same IGP area. For SR-TE to compute such a path, the
link-delay information must be available in the SR-TE DB. Each node in the area measures the delays
of its links and floods the measured link delay metrics in the IGP.
Link de lay me tric
The operator can use the measured link delay metric which is dynamically measured per-link by the router and distributed in the IGP.
This methodology has the benefit of always ensuring a correct link delay, even if the optical topology changes due to optical circuit
restoration. In addition, using the measured link delay removes the operational complexity of manually configuring link delays. See
chapter 15, "Performance Monitoring – Link Delay" for more details of the measured delay metric functionality.

In case the direct measurement and distribution of the link delay metric is not available, the operator can use the TE link metric to
represent link delay. The TE metric is an additional link metric, distributed in the IGP, that the operator can leverage to represent the
needs of a particular application. If the delays of the links in the network are constant and known (e.g., based on information coming
from the optical network and fiber lengths), the operator can configure the TE link metrics to represent the (static) link delay.

Given that each node distributes the link delay metrics in the IGP, each headend node in the area
receives this information and stores it in its SR-TE DB. This way, a headend node has the necessary
information to compute delay-optimized paths in the network.

For a headend to feed the IGP-learned information into the SR-TE DB, the distribute link-state
command must be configured under the IGP, as illustrated in Example 4‑1 for both ISIS and OSPF.
This command has an optional parameter instance-id <id>, which is only relevant in multi-domain
topologies. See chapter 17, "BGP-LS" for details.

Example 4-1: Feed SR-TE DB with IGP information

router isis SR
distribute link-state
!
router ospf SR
distribute link-state

As an example, the operator has enabled the nodes in the network of Figure 4‑2 to measure the delay
of their links. The current link delays are displayed in the illustration as the unidirectional link-delay
in milliseconds. The IGP link metrics are 10.
Figure 4-2: Headend computes low-delay path

The operator needs a delay-optimized path from Node1 to Node5 and configures an SR Policy on
Node1 with a dynamic path, optimizing the delay metric. Example 4‑2 shows the SR Policy
configuration on Node1. The SR Policy “LOW-DELAY” has a single candidate path that is
dynamically computed by optimizing the delay metric. No path constraints are applied.

Example 4-2: SR Policy configuration – headend computed dynamic low-delay path

segment-routing
traffic-eng
policy LOW-DELAY
color 20 end-point ipv4 1.1.1.5
candidate-paths
preference 100
dynamic
metric
type delay

The resulting path, 1→2→3→4→5 with a cumulative delay of 10 + 10 + 10 + 9 = 39 ms, is


illustrated in Figure 4‑2. This path can be expressed with a SID list <16003, 16004, 16005>, where
1600X is the Prefix-SID of NodeX. Each of these Prefix-SIDs expresses the IGP shortest path to their
target node. This SID list is shown in the output of Example 4‑3.
Example 4-3: Headend computed low-delay SR Policy path

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy


SR-TE policy database
---------------------

Color: 20, End-point: 1.1.1.5


Name: srte_c_20_ep_1.1.1.5
Status:
Admin: up Operational: up for 06:57:40(since Sep 14 07:55:11.176)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: LOW-DELAY
Requested BSID: dynamic
Dynamic (valid)
Metric Type: delay, Path Accumulated Metric: 39
16003 [Prefix-SID, 1.1.1.3]
16004 [Prefix-SID, 1.1.1.4]
16005 [Prefix-SID, 1.1.1.5]
Attributes:
Binding SID: 40026
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

The operator notices that this delay-optimized path only uses the path via Node4, while two equal
cost (IGP metric) paths are available between Node3 and Node5. This is due to a small difference in
link delay between these two paths from Node3 to Node5: the path via Node4 has a delay of 10 + 9 =
19 ms, while the path via Node6 has a delay of 10 + 10 = 20 ms, 1 ms more. The operator considers
the difference in delay between these two paths insignificant and would prefer to leverage the
available ECMP between Node3 and Node5.

The operator can achieve this by specifying a margin to tolerate a solution that is not optimal but
within the specified margin of the optimal solution. The margin can be specified as an absolute value
or as a relative value (percentage), both in comparison to the optimal solution.

In Example 4‑4 an absolute margin of 2 ms (2000 µs) is specified for the dynamic path of the SR
Policy to Node5 on Node1. The cumulative delay of the solution path can be up to 2 ms larger than
the minimum-delay path. With this configuration the solution SID list is <16003, 16005>, which
leverages the ECMP between Node3 and Node5.

Use the keyword relative to specify a relative margin.


Example 4-4: Dynamic path with delay margin

segment-routing
traffic-eng
policy LOW-DELAY
color 20 end-point ipv4 1.1.1.5
candidate-paths
preference 100
dynamic
metric
type delay
margin absolute 2000

De lay margin
“Back in 2014 when designing the SR-TE native algorithms, we realized that the intuition to use as much ECMP as possible
would not work for delay optimization without the notion of margin.

First, let us explain the reason. This is again very intuitive.

Human beings optimize network topologies and assign IGP metrics to enhance the ECMP nature of shortest paths.

For example, two fibers between Brussels and Paris are assigned the same IGP cost of 10 while one fiber takes 300 km
and the other one takes 500 km. From a capacity viewpoint, this distance difference does not matter.

From a delay viewpoint it does matter. The dynamic router-based performance monitoring of the two fibers will detect a
difference of 200 km / c ≈ 1 msec, where c is the speed of light in the fiber.

The IGP will flood one link with 1.5 msec of delay and the other link with 2.5 msec of delay.

Remote routers computing a low-delay path (e.g., from Stockholm to Madrid) would have to insert a prefix-SID in Brussels
followed by the adjacency SID of the first fiber to make sure that the second fiber is avoided. Two more SIDs and no
ECMP just for gaining 1 msec between Stockholm and Madrid.

We could easily guess that some operator would prefer fewer SIDs and more ECMP in exchange for some margin around
the low-delay path.

Hence, we introduced the notion of margin for low-delay SR-native optimization.

We then designed the related algorithm which finds the segment-routed path with the least amount of SIDs within the
margin above the lowest-delay path. ”

— Clarence Filsfils

Whenever the topology changes, or the link delay measurements change significantly (see chapter 15,
"Performance Monitoring – Link Delay", discussing the link delay measurement), headend Node1 re-
computes the paths in the new topology and updates the SID list of the SR Policy accordingly.
4.2.2 Headend Computes Constrained Paths
The operator needs to provide a path that avoids certain resources.

The example network is a dual-plane network. A characteristic of such network design is that, when a
packet lands on a plane, it stays on that plane until its destination, provided the plane is not
partitioned. By default, traffic flows are load-balanced over both planes.

The operator can steer traffic flows onto one of the planes. This can be achieved in multiple ways.
One way is by using an Anycast-SID assigned to all devices in a plane as described in the previous
chapter (chapter 3, "Explicit Candidate Path").

Another possibility is to compute dynamic paths, restricting the path to a given plane. This method is
described in this section. Yet another way is to use the Flex-Algo functionality described in chapter 7,
"Flexible Algorithm".
Re source e xclusion
“The richness of the SR-TE solution allows to solve a given problem in different ways, each with different trade-offs.

Let us illustrate this with the dual-plane disjoint paths use-case (e.g., enforcing a flow through the blue plane).

In the explicit path chapter 3, "Explicit Candidate Path", we showed a first solution using the anycast SID of the blue plane.

In this dynamic path chapter, we show a second solution using a dynamic path with an “exclude affinity green”.

Later on, in the SR IGP Flex-Algo chapter 7, "Flexible Algorithm", we will describe a third solution.

The first solution does not require any intelligence on the headend but may fail to adapt to rare topology issues.

The second solution requires SR-TE headend intelligence (SR-TE DB and the SR-native algorithm) without network-wide
IGP change or dependency.

The third solution leverages the IGP itself but requires a network-wide feature capability.

In my opinion, each is useful, and the selection is done case by case based on the specific operator situation.

This is confirmed by my experience as I have been involved with the deployment of the first two and at the time of writing
I am involved in the deployment of the third solution type. Each operator had its specific reasons to prefer one solution over
another.

The SR solution is built as modules. Each operator can use the module he wants based on his specific analysis and
preference. ”

— Clarence Filsfils

4.2.2.1 Affinity Link Colors


To steer traffic flows on a single plane the operator uses an SR Policy with a dynamic path that is
constrained to the desired plane. One way to apply this restriction is by using the link affinity
functionality. The links in each plane are colored; the links in one plane are colored blue and the links
in the other plane green. The path via the green plane can then be computed as “optimize IGP metric,
avoiding blue links”, and the path via the blue plane as “optimize IGP metric, avoiding green links”.
Link colors and affinity
An operator can mark links with so-called affinity link colors, known as administrative groups in IETF terminology. Historically,
there were 32 different colors (groups) available, later extended to allow more (256 colors in Cisco IOS XR). These colors allow
links to be mapped into groups or classes. For example, links in a given region are colored red and links in another region are colored
green.

The colors of a link are advertised in IGP as a bitmap, where each color represents a bit in this bitmap. The IGP advertises the
affinity bitmaps with the adjacencies. An operator can freely choose which bit represents which color. For example, color red is
represented by the first bit in the bitmap and color green by the third bit. If a link has a given color, the affinity bitmap of that link is
then advertised with the bit of the corresponding color set. Each link can have multiple colors, in which case multiple bits in the
bitmap will be set.

These link colors are inserted in the SR-TE DB. A headend node can then use these link colors in its path computation to include or
exclude links with a given color or set of colors in the path. For example, the operator can specify to compute a dynamic path,
optimizing IGP metric and avoiding links with color red.

Refer to RFC 3630, RFC 5305, and Section 6.2 of RFC 2702 for IETF information about affinity and link colors.

The topology of Figure 4‑3 is a dual-plane network. One plane, named “Green”, consists of the nodes
11 to 14, while the other plane, named “Blue”, consists of the nodes 21 to 24. Node2 and Node3 are
connected to both planes. All the links in plane Green are colored with the same affinity color
(green), the links in plane Blue with another affinity color (blue).

Figure 4-3: dual-plane affinity resource avoidance example


The affinity color configuration is illustrated with Node2’s configuration in Example 4‑5. This node
has a link in each plane, the link to Node11 in the Green plane and the link to Node21 in the Blue
plane.

The configuration starts by defining human-friendly affinity color names. These names can be any
user-defined strings, not just names of colors as usually shown in examples. Each name identifies a
specific bit in the affinity bitmap. The position of the bit in the bitmap is zero-based, the first bit has
position 0. Name GREEN in the example corresponds to the bit at position 0 in the bitmap (GREEN
position 0), name BLUE corresponds to the bit at position 2.

The naming scheme for the affinity-maps is locally significant; the names and their mapping to the bit
positions are not distributed. Consistency is key; it is strongly recommended to have a consistent
name-to-bit-position mapping configuration across all nodes, e.g., by using an orchestration system.

After defining the names, they can be assigned to links. Interface Gi0/0/0/0 on Node2 is in the plane
Green and is marked with affinity GREEN (affinity name GREEN), while interface Gi0/0/0/1,
which is in the plane Blue, is marked with the name BLUE. Each link can be marked with multiple
names by configuring multiple affinity names under the interface. The node then advertises for each
interface the affinity bitmap with the bits that correspond to the configured names set to 1.

Example 4-5: Assigning Affinity link colors – Node2

segment-routing
traffic-eng
affinity-map
name GREEN bit-position 0
name BLUE bit-position 2
!
interface Gi0/0/0/0
!! link to Node11, in plane Green
affinity name GREEN
!
interface Gi0/0/0/1
!! link to Node21, in plane Blue
affinity name BLUE

Since each node advertises the affinity-map of its links in the IGP, all nodes in the IGP area receive
that information and insert it in their SR-TE DB. A headend node can then use this information in its
path computations.

4.2.2.2 Affinity Constraint


The operator configures two SR Policies to Node3 on Node1. Each SR Policy uses one of the two
planes. The operator configures one SR Policy to compute a dynamic path avoiding BLUE links,
hence restricted to the Green plane, and another SR Policy with a path that avoids GREEN links,
hence restricted to the Blue plane. The SR Policy configuration of Node1 is shown in Example 4‑6.

Example 4-6: Affinity link color resource avoidance example – Node1

segment-routing
traffic-eng
policy VIA-PLANE-GREEN
color 20 end-point ipv4 1.1.1.3
candidate-paths
preference 100
dynamic
metric
type igp
!
constraints
affinity
exclude-any
name BLUE
!
policy VIA-PLANE-BLUE
color 30 end-point ipv4 1.1.1.3
candidate-paths
preference 100
dynamic
metric
type igp
!
constraints
affinity
exclude-any
name GREEN

The first SR Policy, named VIA-PLANE-GREEN, has color 20 and endpoint 1.1.1.3 (Node3). A
single candidate path is configured, dynamically computing an IGP metric optimized path, excluding
links with color BLUE. Headend Node1 locally computes the path. The resulting path is shown in
Figure 4‑3. Notice that the path follows plane Green, leveraging the available ECMP within this
plane. Example 4‑7 shows the status of this SR Policy on Node1.
Example 4-7: SR Policy status on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 20, End-point: 1.1.1.3


Name: srte_c_20_ep_1.1.1.3
Status:
Admin: up Operational: up for 00:10:54 (since Mar 2 17:27:13.549)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: VIA-PLANE-GREEN
Requested BSID: dynamic
Constraints:
Affinity:
exclude-any:
BLUE
Dynamic (valid)
Metric Type: IGP, Path Accumulated Metric: 50
16014 [Prefix-SID, 1.1.1.14]
16003 [Prefix-SID, 1.1.1.3]
Attributes:
Binding SID: 40048
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

The second SR Policy in Node1’s configuration, named VIA-PLANE-BLUE, steers the traffic along
plane Blue. Node1 computes this path by optimizing the IGP metric and avoiding links with color
GREEN.

Whenever the topology changes, headend Node1 re-computes the SR Policy paths in the new topology
and updates the SID lists of the SR Policies accordingly.

With the configuration in Example 4‑6, three paths are available to Node3, two via SR policies and
one via the IGP shortest path. The SR Policies restrict the traffic to one of the planes, while the IGP
shortest path uses both planes.

By default, service traffic is steered via the IGP shortest path to its nexthop. If the nexthop has an
associated Prefix-SID then that will be imposed. This is the default Prefix-SID forwarding behavior.
For example, service traffic with nexthop Node3 is steered via Node3’s Prefix-SID 16003.

The IGP Prefix-SID 16003 follows the unconstrained IGP shortest path, leveraging all available
ECMP paths. Hence, by default, the traffic flows from Node1 to Node3 are not limited to a single
plane, but they are distributed over all available ECMP.
Steering service traffic into the SR Policies towards its nexthop can be done using Automated
Steering (AS) by attaching the color of the required SR Policy to the destination prefix. Attach color
20 to steer service traffic into SR Policy VIA-PLANE-GREEN and color 30 to steer it into SR Policy
VIA- PLANE-BLUE. Chapter 5, "Automated Steering" describes AS in more details.

This gives the operator three steering possibilities: via both planes for uncolored destinations, via
plane Green for destinations with color 20, and via plane Blue for destinations with color 30.

4.2.3 Other Use-Cases and Limitations


4.2.3.1 Disjoint Paths Limited to Single Head-End
Disjoint (diverse) paths are paths that do not traverse common network elements, such as links,
nodes, or SRLGs. To compute disjoint SR-TE paths, information about all paths is required.

If the disjoint paths have a common headend and the endpoints(s) of these paths are within the same
IGP area, the headend can compute these paths itself. In that case the headend has knowledge of the
local IGP area and it has knowledge of both paths since it is the headend of both.

If the disjoint paths have distinct headends, then these headends cannot compute the paths since they
only know about their own SR Policy paths and are unaware of the other headend’s SR Policy paths.
If one does not know the other path, then computing a disjoint path is not possible.

This use-case requires a centralized solution where a centralized computation entity is aware of both
paths in order to provide disjoint apths. Section 4.3 of this chapter details the disjoint paths use-case
between two sets of headends and endpoints.

4.2.3.2 Inter-Domain Path Requires Multi-Domain Information


Computing paths that traverse different IGP areas or domains requires knowledge about all these IGP
areas and domains.

Typically, a headend node has only knowledge about its local IGP area. Possibly the headend’s SR-
TE DB contains multi-domain topology database, but that is not yet seen in practice, for scalability
reasons.
If the headend node does not have multi-domain information, it cannot compute inter-area and inter-
domain paths. This requires using a centralized solution. Section 4.3.3 of this chapter details the
inter-domain path computation use-case.
4.3 Centralized Computation
When possible, one should leverage the SR-TE path computation of the headend (distributed design).
When needed, path computation is delegated to an SR PCE (centralized).

Router and SR PCE use the same path computation algorithms. The difference of their functionality is
not the computation engine but the content of the SR-TE DB.

A headend’s SR-TE DB is most often limited to its local domain and its own SR policies, as was
pointed out before.

An SR PCE’s SR-TE DB may contain more information, such as the topology of other domains and
state of other SR Policies. Knowing other domains allows inter-domain path computation. Knowing
other SR Policies allows disjoint path computation.

4.3.1 SR PCE
A Path Computation Element (PCE) is an element in the network that provides a path computation
service to Path Computation Clients (PCCs).

Typically, a PCC communicates with a PCE using the PCE communication Protocol (PCEP). Each
headend node can act as a PCC and request the PCE to compute a path using a client/server
request/reply model. The PCE computes the path and replies to the PCC, providing the path details.

To compute paths, an SR PCE has an SR-TE DB and a computation engine, like an SR-TE headend
node. For its local IGP area, the SR PCE learns the topology information from IGP and stores that
information in its SR-TE DB. The PCE uses this information to compute paths, hereby using the same
path computation algorithms as used by a headend node.

A headend node acts as a PCC and requests an SR PCE to compute a path to an endpoint. In its
request the headend provides the path’s optimization objective and constraints. The SR PCE then
computes the path and returns the resulting path to the PCC (headend) as a SID list. The headend then
instantiates this path.

But an SR PCE can do more than compute paths; as a stateful PCE it can control paths as well. A PCC
hands over control of a path to a PCE by delegating this path to the PCE.
“It's a very tedious task to organize the efficiency of a network in regards of RSVP-TE. We have to manually tweak the
timers, the cost and add lots of route-policies to make sure that the traffic will use the pattern that we want.

SR-TE dynamic path drastically reduces this complexity and eliminates the need for manual intervention while maintaining
a highly optimized and robust network infrastructure. One easy use-case is the disjoint paths for video streaming; when
sending duplicate streams, SR-TE dynamic path allows us to make sure the two copies will never use the same links all
along the path from source to destination. This is done by automatically gathering information from the LS-DB to the SR
PCE. The PCE can then compute SR-policy paths that match the desired objective, such as delay, and constraints. ”

— Daniel Voyer

IOS XR SR PCE Server

SR PCE server functionality is available in the base IOS XR software image and it can be used on all
physical and virtual IOS XR platforms.

SR PCE functionality can be enabled on any IOS XR node that is already in the network. However,
for scalability and to prevent a mix of functionalities on a given node, it may be desirable to deploy
separate nodes for the SR PCE server functionality, using either a hardware or a virtual platform.

The SR PCE server functionality is enabled in IOS XR using a configuration as in Example 4‑8. The
configured IP address is used for the PCEP session between PCC and PCE. It must be a globally
reachable (not in a VPN) local IP address.

SR PCE receives its topology information from the different protocols (ISIS, OSPF, and BGP) via an
internal API. To enable the IGP (ISIS or OSPF) of the PCE node to feed its IGP link-state database
(LS-DB) to the SR-TE process, configure distribute link-state under the IGP. This way, the SR
PCE server learns the local IGP area’s topology information and inserts it in its SR-TE DB. BGP
automatically feeds its BGP-LS information to SR PCE, without additional configuration, as
described in section 4.3.3. The instance-id 101 in the distribute link-state command is the
domain identifier that is used to distinguish between domains in the SR-TE DB. This is explained
further in chapter 12, "SR-TE Database" and chapter 13, "SR PCE".
Example 4-8: SR PCE server configuration

pce
address ipv4 1.1.1.10
!
router isis SR !! or "router ospf SR"
distribute link-state instance-id 101

IOS XR PCE Client (PCC)

To enable a headend as PCE Client (PCC), configure the address of the SR PCE, as shown in
Example 4‑9 for a PCE with address 1.1.1.10. With this configuration, the PCC establishes a PCEP
session to the SR PCE. Multiple PCEs can be configured with an order of preference, as explained in
chapter 13, "SR PCE".

Example 4-9: PCE address on PCC

segment-routing
traffic-eng
pcc
pce address ipv4 1.1.1.10

4.3.1.1 SR PCE Redundancy


SR PCE uses the existing PCEP high availability functionality to provide redundancy. The three main
components of this functionality are topology learning, SR Policy reporting, and re-delegation
behavior.

A headend has a PCEP connection to a pair (or even a larger group) of PCEs. For redundancy, it is
important that these PCEs have a common knowledge of the topology and the SR Policies in the
network. That way, the headend can use any of these PCEs in case its primary PCE fails.

All connected PCEs have the same knowledge of the topology since they all receive the same
topology feed from the IGP and/or BGP-LS.

When an SR Policy is instantiated, updated, or deleted, the headend sends an SR Policy state report to
all its connected PCEs. This keeps the SR Policy database of all these PCEs in sync, and thus one
PCE can act as a standby of another PCE.

A headend delegates control of an SR Policy path to a single SR PCE, the primary SR PCE that also
computed the path.
Failure of this primary PCE, does not impact the SR Policies that are delegated to it nor the traffic
that is steered into it. Upon failure of this PCE the headend maintains the SR Policies and re-
delegates them to another connected PCE. This new delegate PCE takes over control of the path,
verifies it and updates it if required. Given that the information available to all PCEs in this set is the
same and the PCEs use the same path computation algorithms, the path will not be updated if the
topology did not change in the meantime.

More details are available in chapter 13, "SR PCE".

4.3.2 SR PCE Computes Disjoint Paths


In this first centralized computation use-case, a centralized SR PCE computes disjoint paths from two
different headends. The SR PCE not only computes the paths but controls them as well. These are key
requirements to provide optimal disjoint paths from different headends.

The network in Figure 4‑4 is a single-area network, interconnecting two customer sites, Site 1 and
Site 2. The operator of this network wants to provide disjoint paths between the customer sites. The
disjoint paths start from two different headend nodes: one path from Node1 to Node4 and another
path from Node5 to Node8.
Figure 4-4: Network topology for disjoint paths

An SR PCE is added to the network. For illustration purposes the SR PCE is drawn as a single
separate entity. Chapter 13, "SR PCE" discusses the different SR PCE connectivity options.

Headends Node1 and Node5 cannot simply compute the usual IGP shortest paths to Node4 and Node8
respectively, since these paths are not disjoint, as illustrated in Figure 4‑5. The two IGP shortest paths
share Node6, Node7, and the link between Node6 and Node7.
Figure 4-5: IGP shortest paths are not disjoint

The headends cannot independently compute disjoint paths since neither of the headends knows about
the path computed by the other headend.

Providing disjoint paths in an IP/MPLS network has always been cumbersome, especially when the
disjoint paths must be achieved between distinct pairs of nodes. The operator could use path
constraints to accomplish disjoint paths. The operator could, for example, use affinity link colors to
mark the links traversed by the path from Node5 to Node8 and then exclude these links from the path
between Node1 and Node4. However, this solution provides no guarantee to find disjoint paths.
Moreover, it would be an operational burden, as the link colors would have to be updated whenever
the topology changes. A dynamic solution to provide a disjoint path service is highly desired. The SR
PCE provides this solution.

4.3.2.1 Disjoint Group


Disjoint paths are expressed by assigning them to an entity called a disjoint association group, or
disjoint group for short.

A group-id identifies a disjoint group. Paths with the same disjoint group-id (i.e., member of the same
disjoint group) are disjoint from each other. Disjoint groups can be applied to paths originating at the
same headend or different headends.

The operator indicates which paths must be disjoint from each other by assigning both paths the same
disjoint group-id. The PCE understands and enforces this constraint.

A disjoint group also specifies the diversity parameters, such as the desired type of disjoint paths:
link, node, SRLG, or node+SRLG. The type of disjoint paths indicates which resources are not shared
between the two disjoint paths, e.g., link-disjoint paths do not share any link, node-disjoint paths do
not share any node, etc.

4.3.2.2 Path Request, Reply, and Report


The operator first configures Node5 as a PCC by specifying the SR PCE’s address 1.1.1.10. Then the
operator configures an SR Policy to Node8 on Node5, as shown in Example 4‑10.

The SR Policy, named POLICY1, has a single candidate dynamic path with preference 100. The
keyword pcep under dynamic indicates that Node5 uses an SR PCE to compute the path. The
optimization objective is to minimize the IGP metric. As a constraint, the path is a member of the
disjoint group with identifier 1 and the paths in this group must be node disjoint (type disjoint
node).

Example 4-10: SR Policy disjoint path configuration on Node5

segment-routing
traffic-eng
policy POLICY1
color 20 end-point 1.1.1.8
candidate-paths
preference 100
dynamic
pcep
metric
type igp
constraints
association-group
type disjoint node identifier 1
!
pcc
pce address ipv4 1.1.1.10
Figure 4-6: First path of disjoint group

After configuring the SR Policy on Node5, Node5 sends a Path Computation Request (PCReq) to the
SR PCE 1.1.1.10 (see ➊ in Figure 4‑6). In that PCReq, Node5 provides all the necessary information
to enable the SR PCE to compute the path: the headend and endpoint, the optimization objective and
the constraints. In this example, Node5 requests the SR PCE to compute a path from Node5 to Node8,
optimizing the IGP metric, with the constraint that is must be node-disjoint from other paths with a
disjoint group-id 1.

Since the SR PCE does not know about any other existing paths with disjoint group-id 1, it computes
the unconstrained IGP metric optimized path to Node8. This is indicated as ➋ in Figure 4‑6. The SR
PCE then sends the computation result to Node5 in a Path Computation Reply (PCRep) message
(marked ➌). If the path computation was successful, then the PCRep contains the solution SID list. In
the example, the resulting SID list is <16008>, only containing the Prefix-SID of Node8.

Node5 instantiates the path for the SR Policy to Node8 (see ➍ in Figure 4‑6). Node5 reports the
information of this path to PCE, using a Path Computation Report (PCRpt) message (marked ➎). In
this PCRpt, Node5 provides all details about the state of the path to the SR PCE.

Figure 4-7: Second path of disjoint group

Sometime later, the operator configures the SR Policy to Node4 on headend Node1, as shown in
Example 4‑11. This configuration is almost an identical copy of the configuration on Node5 in
Example 4‑10, but we chose to use a different name although this is not required.
Example 4-11: SR Policy disjoint path configuration on Node1

segment-routing
traffic-eng
policy POLICY2
color 20 end-point 1.1.1.4
candidate-paths
preference 100
dynamic
pcep
metric
type igp
constraints
association-group
type disjoint node identifier 1
!
pcc
pce address ipv4 1.1.1.10

After configuring the SR Policy on Node1, Node1 sends a PCReq to the SR PCE (see ➊ in
Figure 4‑7). Node1 indicates in this PCReq that it needs a path to Node4, with optimized IGP metric
and node-disjoint from other paths with a disjoint group-id 1. The SR PCE finds another path with
disjoint group-id 1 in its SR-TE DB: the path from Node5 to Node8 that was instantiated before. The
SR PCE computes the path1 (marked ➋ in Figure 4‑7) , disjoint from the existing one, and finds that
the path from Node1 to Node4 is 1→2→3→4, with SID list <16002, 24023, 16004>. It sends a
PCRep to Node1 (➌), which instantiates the path (➍) and sends a PCRpt to the SR PCE with the
path’s state (➎).

4.3.2.3 Path Delegation


The sequence described in the previous section works well when first configuring the SR Policy on
Node5 and then the SR Policy on Node1. To describe another important SR PCE mechanism, path
delegation, assume that the SR Policy on Node1 is configured first. Figure 4‑8 illustrates this.

After configuring the SR Policy on Node1, Node1 sends a Path Computation Request (PCReq) to the
SR PCE, see ➊ in Figure 4‑8.
Figure 4-8: SR PCE computes disjoint paths – step 1

Since the SR PCE does not know about any other existing paths with disjoint group-id 1, it computes
the unconstrained IGP metric optimized path to Node4. This is indicated with ➋ in Figure 4‑8. The
resulting SID list is <16004>, only containing the Prefix-SID of Node4. The SR PCE sends the result
to Node1 in a Path Computation Reply (PCRep) message, marked ➌. Node1 instantiates the path for
the SR Policy to Node4, see ➍ in Figure 4‑8. Node1 reports the information of this path to PCE,
using a Path Computation Report (PCRpt) message, marked ➎.

If now Node5 requests a path to Node8 that is node-disjoint from the path between Node1 and
Node4, no such path exists since all paths would have to traverse Node6 and Node7, violating the
diversity requirement. The only solution is to change the path between Node1 and Node4 to
1→2→3→4.
To make this possible, Node1 has delegated control of the SR PCE computed path to the SR PCE.
When a PCC delegates path control to the SR PCE, the SR PCE can autonomously indicate to the PCC
that it must update the path. This way the SR PCE can maintain the intent of the path, for example
following a topology change or after adding or changing a path of a disjoint path pair.

To delegate a path to the SR PCE, the PCC sets the delegate flag in the PCRpt message for that path.
An IOS XR PCC always automatically delegates control to the SR PCE when it has computed the
path.

The next step in the example is the configuration of the SR Policy to Node8 on headend Node5. After
configuring this SR Policy on Node5, Node5 sends a PCReq to the SR PCE (see ➊ in Figure 4‑9).
Node5 indicates in this PCReq that it needs a path to Node8, with optimized IGP metric and node-
disjoint from other paths with a disjoint group-id 1. SR PCE finds another path with disjoint group-id
1 in its SR-TE DB: the path from Node1 to Node4 that was instantiated before. Instead of simply
computing the path, disjoint from the existing path, the SR PCE concurrently computes both disjoint
paths, as this yields the optimal disjoint paths. This is marked ➋ in Figure 4‑9. As a result of this
computation, SR PCE finds that the path from Node1 to Node4 must be updated to a new path
1→2→3→4, otherwise no disjoint path is possible from Node5 to Node8.

Since Node1 has delegated control of the path to the SR PCE, SR PCE can autonomously update this
path when required. The SR PCE updates the path by sending a Path Computation Update (PCUpd) to
Node1 (marked ➌ in Figure 4‑9) with the new SID list <16002, 24023, 16004>. 16002 is the Prefix-
SID of Node2, 24023 is the Adj-SID of the link from Node2 to Node3, and 16004 is the Prefix-SID
of Node4.
Figure 4-9: SR PCE computes disjoint paths – step 2

Node1 updates the path (indicated by ➍ in Figure 4‑9) and Node1 reports the new path to SR PCE
using a PCRpt (marked ➎).

The SR PCE had also computed the path from Node5 to Node8 (5→6→7→8), with the solution SID
list <16008>, where 16008 is the Prefix-SID of Node8. SR PCE replies to Node5 with this solution
SID list <16008> in the PCRep message. This is ➏ in Figure 4‑10. Node5 instantiates the path
(indicated with ➐) and sends a PCRpt with the path information to SR PCE (marked ➑). With this
PCRpt, Node5 also delegates control of this path to the SR PCE such that the SR PCE can
autonomously instruct Node5 to update the path if required.

Figure 4-10: SR PCE computes disjoint paths – step 3

Whenever the topology changes, SR PCE autonomously re-computes the paths and updates them if
required to maintain disjoint paths.

The two headends of the disjoint paths need to use the same SR PCE to compute the paths to ensure
mutual diversity. However, any such pair of headends can use any SR PCE for the computation. To
expand the example that we described above, another SR PCE, or pair of SR PCEs, could be
introduced in the network. This SR PCE can for example compute disjoint paths from Node4 to
Node1 and from Node8 to Node5. The SR PCEs that compute paths for different disjoint groups do
not need to synchronize between each other. These computations are completely independent.
PCEs are typically deployed in pairs for redundancy reasons and intra-pair synchronization (state-
sync PCEP sessions) may be leveraged for high-availability. See chapter 13, "SR PCE" for more
details.

4.3.3 SR PCE Computes End-To-End Inter-Domain Paths


In the previous sections of this chapter, the dynamic path computations, either performed by the
headend node or by a PCE, were restricted to a single IGP area. In this section, we go one step further
by adding multiple domains to the mix.

Figure 4‑11 shows a topology consisting of three domains. All three domains are SR-enabled and, per
the SR design recommendation, all nodes use the same SRGB. Domain1 and Domain2 are different
Autonomous Systems (ASs), interconnected by two eBGP peering links, between Node14 and
Node24, and between Node15 and Node25. Domain2 and Domain3 are interconnected by two border
nodes: Node22 and Node23. If Domain2 and Domain3 are different IGP domains, these border nodes
run two IGP instances, one for each connected domain. If Domain2 and Domain3 are two areas in a
single IGP domain then Node22 and Node23 are Area Border Routers (ABRs).

Significant Inte r-Domain Improve me nts


“In the ~2005/2009 timeframe, I worked with Martin Horneffer and Jim Uttaro to define the “Seamless MPLS design”. In
essence, we used RFC3107 to provide a best-effort MPLS LSP across multiple domains.

SR-TE drastically improves this solution as it allows to remove RFC3107 (less protocol) and it supports SLA-enabled inter-
domain policies (which the Seamless MPLS design could not support).

SR Policies can indeed be used to instantiate a best-effort path across domains. An operator could leverage this to remove
RFC3107 and hence simplify its operation.

Furthermore, SR policies natively provide SLA policies (low-delay, resource avoidance, disjoint paths) over the inter-domain
path while Seamless MPLS based on RFC3107 only delivers best-effort. ”

— Clarence Filsfils
Figure 4-11: Multi-domain network topology

An SR Policy can provide a seamless end-to-end inter-domain path. To illustrate this, we start by
looking at an SR Policy with an explicit inter-domain path, instead of directly jumping into dynamic
path computation for multi-domain networks.

To provide an end-to-end inter-domain path from Node11 to Node31 in the network of Figure 4‑11,
the operator configures an SR Policy with an explicit SID list on Node11. The SID list is: <16014,
51424, 16022, 16031>, where 160XX is the Prefix-SID of NodeXX. 51424 is the SID that steers the
traffic over the BGP peering link between Node14 and Node24: the Peering-SID. This Peering-SID is
described in chapter 14, "SR BGP Egress Peer Engineering". For now, you can view the Peering-SID
as the BGP equivalent of the IGP Adj-SID: packets received with a Peering-SID as top label are
steered towards the associated BGP neighbor.

Figure 4‑12 shows the path of the SR Policy. This illustration also shows a packet with its label stack
as it travels from Node11 to Node31 using the above SID list. Prefix-SID 16014 brings the packet
from Node11 to Node14 via the ECMP-aware IGP shortest path. The Peering-SID 51424 brings the
packet from Node14 to Node24, traversing the peering link. Prefix-SID 16022 then brings the packet
from Node24 to Node22 via the IGP shortest path, and finally, Prefix-SID 16031 brings the packet to
Node31 via the ECMP-aware IGP shortest path.
Figure 4-12: End-to-end inter-domain explicit path

Thus, an SR Policy can provide a seamless end-to-end inter-domain path. But the operator wants a
dynamic solution that adjusts the end-to-end SLA paths to the changing network. Instantiating an SR
Policy on Node11 with a dynamic path that is locally computed by the headend node, does not work.
In a multi-domain network, a headend node can only see the topology of its local area; it cannot see
the network topology beyond the border nodes. The IGP floods the topology information in the local
area only. Therefore, the headend node does not have enough information in its SR-TE DB to compute
inter-domain paths.

To solve this problem, the headend node can use an SR PCE to compute the path. This SR PCE will
have to know the topologies of all domains.

4.3.3.1 SR PCE’s Multi-Domain Capability


Information about inter-AS BGP peering links is provided via the SR BGP EPE functionality.
Knowledge about all domain topologies and about the BGP inter-AS peering links is distributed by
BGP-LS.

BGP-LS

In previous sections of this chapter, we have seen that an SR PCE can get the topology information of
its connected IGP area. To get the topology information of remote domains, the SR PCE server can
use BGP and the BGP link-state address-family, commonly known as “BGP-LS”. BGP-LS transports
an IGP LS-DB using BGP signaling. Basically, BGP-LS can pack the content of the LS-DB and
transport it in BGP to a remote location, typically a PCE. BGP-LS benefits from all BGP route
propagation functionality, such as use of Route-reflectors, as we will see further.

Since its introduction, BGP-LS has outgrown this initial functionality and thanks to various extensions
it has become the mechanism of choice to feed any information to a controller. BGP-LS is described
in more detail in chapter 17, "BGP-LS".

The SR PCE server learns the topology information of the different domains in the network via BGP-
LS. In the example topology in Figure 4‑13, one node in each domain feeds its local LS-DB via a
BGP-LS session to a single local Route Reflector (RR). This minimal setup is for illustration
purposes only; in practice, redundancy would be provided.

Example 4‑12 shows the BGP-LS and IGP configuration of Node21. The BGP session to neighbor
1.1.1.29, which is the Domain2 RR, has the Link-state address-family (BGP-LS) enabled. Note that
both AFI and SAFI are named “link-state”, therefore the double link-state in the address-family
configuration.

With the configuration distribute link-state under router isis, Node21 distributes its ISIS
LS-DB to BGP. The instance-id 102 in this command is a “routing universe” identifier that makes
it possible to differentiate multiple, possibly overlapping, topologies. This identifier is carried in all
BGP-LS advertisements and for each advertisement it identifies the IGP instance that fed the
information to BGP-LS. In this example, the topology information of the ISIS instance “SR” on
Node21 is identified by instance-id 102. All information that ISIS instance “SR” feeds to BGP-LS
carries the identifier 102. More information is available in chapter 17, "BGP-LS".

Remember from sections 4.2.1 and 4.3.1 in this chapter that this same command also enables
distribution of the IGP LS-DB to the SR-TE process on headend node and SR PCE.
Example 4-12: BGP-LS configuration on Node21

router isis SR !! or "router ospf SR"


distribute link-state instance-id 102
!
router bgp 2
bgp router-id 1.1.1.21
address-family link-state link-state
!
neighbor 1.1.1.29 !! Domain2 RR
remote-as 2
update-source Loopback0
address-family link-state link-state

In practice, multiple nodes will have a BGP-LS session to multiple local RRs for redundancy. The
link-state address-family (BGP-LS) can be carried in Internal BGP (iBGP) and External BGP (eBGP)
sessions, and it is subject to the standard BGP propagation rules.

In the example, a full-mesh of BGP-LS sessions interconnects the domain RRs, but other design
options are possible, such as hierarchical RRs. Eventually, each RR in the network has the LS-DB
information of all IGP domains in its BGP-LS database.

Figure 4-13: Multi-domain distribution of LS-DB information in BGP-LS

BGP Peering-SIDs
At this point, we have not yet discussed how BGP-LS gets the information about the BGP peering
links between Domain1 and Domain2. These links are not in the IGP LS-DB since there is no IGP
adjacency formed across them. Earlier in this section, when discussing the inter-domain explicit path,
we have introduced the BGP Peering-SID. This Peering-SID fulfills a function for BGP peers that is
similar to the function an Adj-SID has for IGP adjacencies. When applied to the inter-domain use-
case, BGP Peering-SIDs provide the functionality to cross the border between two ASs and encode a
path that spans different domains.

BGP Peering-SIDs are SIDs that are allocated for a BGP peer or BGP peering links when enabling
the Egress Peer Engineering (EPE) functionality. Chapter 14, "SR BGP Egress Peer Engineering" is
dedicated to EPE.

The EPE Peering SIDs are advertised in BGP-LS by each EPE-enabled peering node. For this
purpose, the EPE-enabled border nodes Node14, Node15, Node24, and Node25 have a BGP-LS
session to their local RR, as shown in Figure 4‑14. The BGP-LS sessions between the RRs also
distribute the EPE information.

Figure 4-14: Multi-domain distribution of EPE information in BGP-LS


Note that the information distributed by BGP-LS is not a periodic snapshot of the topology; it is a
real-time reactive feed. It is updated as new information becomes available.

SR PCE Learns Multi-Domain Information

Now that we have learned how all information that is required to compute inter-domain paths is
available in BGP-LS, we can feed this information into the SR PCE’s SR-TE DB. Therefore, the SR
PCE server connects via BGP-LS to its local RR to receive consolidated information about the whole
network.

The SR PCE server deployment model is similar to the BGP RR deployment model. Multiple SR
PCE servers in the network can perform the inter-domain path computation functionality, provided
they tap into the BGP-LS information feed. As multiple SR PCE servers can be deployed in the
network, the operator can specify which headend uses which SR PCE server, based on geographical
proximity or service type, for example. This provides for horizontal scalability, adding more SR
PCEs when scale requires it and dividing the work between them.

SR PCE scale s as BGP RR


“A key scaling benefit of the SR solution is that its SR PCEs scale like BGP Route Reflectors.

A pair of SR PCEs can be deployed in each PoP or region and can be dedicated to serve the headends of that PoP/region.
As their load increases, further pairs can be added.

There is no need to synch all the SR PCEs of an SR domain.

There is only a need to synch within a pair. This can be done either in an indirect manner (via BGP-LS/PCEP
communications with the headends which reflect the same information to the two SR PCEs of their pair) or in a direct
manner (PCEP synch between the two SR PCEs) ”

— Clarence Filsfils

In the example in Figure 4‑15, two SR PCE servers have been added to the network, one in Domain1
and another in Domain3. Both connect to their local RR to tap into the real-time reactive BGP-LS
feed to get up-to-date topology information. Headend Node11 uses the SR PCE in Domain1 to
compute inter-domain paths, while Node31 can use the SR PCE in Domain3 to compute these paths.
Note that the headend nodes only need to use an SR PCE to compute paths if they do not have the
necessary information available in their own SR-TE DB. For example, Node11 can to compute by
itself a delay-optimized path to Node14, which is located in its own domain. Node11 does not need
to involve the SR PCE, as all required information is in its own SR-TE DB.

Figure 4-15: PCE servers using BGP-LS to receive multi-domain topology information

The SR PCE servers in the example are IOS XR routers with the SR PCE server functionality
enabled. The SR PCE and BGP-LS configuration in Domain1’s SR PCE node in this example is
displayed in Example 4‑13. The SR PCE service functionality is enabled by configuring the local IP
address used for the PCEP sessions, 1.1.1.10 in this example. The SR PCE gets its SR-TE DB
information from BGP-LS, therefore it has a BGP-LS session to the Domain1 RR 1.1.1.19. The SR
PCE can combine the information received via BGP-LS with information of its local IGP instance.
Example 4-13: PCE server and BGP-LS configuration on Domain1’s SR PCE node

pce
address ipv4 1.1.1.10
!
router bgp 1
bgp router-id 1.1.1.10
address-family link-state link-state
!
neighbor 1.1.1.19 !! Domain1 RR
remote-as 1
update-source Loopback0
address-family link-state link-state

4.3.3.2 SR PCE Computes Inter-Domain Path


To provide an end-to-end delay-optimized path from Node11 to Node31, the operator configures an
SR Policy on Node11. The operator specifies that the SR PCE in Domain1 computes this SR Policy
dynamic path.

Example 4‑14 shows the SR Policy configuration on Node11. The SR Policy to endpoint 1.1.1.31 –
Node31’s loopback address – has a dynamic path computed by the SR PCE server. The SR PCE must
optimize the delay of the path (metric type delay). Node11 connects to the SR PCE server with
address 1.1.1.10 to compute the path, as configured under pcc section on Node11.

Example 4-14: SR Policy with delay-optimized inter-domain path on Node11

segment-routing
traffic-eng
policy POLICY1
color 20 end-point 1.1.1.31
candidate-paths
preference 100
dynamic
pcep
metric
type delay
!
pcc
pce address ipv4 1.1.1.10

All nodes in the network have enabled link delay measurement and distribute the link delay metrics in
IGP. How this works and how to use it is explained in chapter 15, "Performance Monitoring – Link
Delay". Together with the other topology information, the link delay metrics are distributed in BGP-
LS. For illustration simplicity, the default link-delay metric in the illustration is 10.

The PCEP protocol exchange that occurs after configuring the SR Policy, as illustrated in Figure 4‑16,
is equivalent to the sequence that we have seen before in the single-domain disjoint paths example
(section 4.3.2).

Figure 4-16: Inter-domain path PCEP sequence

After configuring the SR Policy on Node11, Node11 sends a PCEP Request to its SR PCE, requesting
a delay-optimized path to endpoint Node31 (marked ➊ in Figure 4‑16). The SR PCE uses the multi-
domain information in its SR-TE DB to compute the end-to-end delay-optimized path (indicated with
➋).

The SR PCE returns the solution SID list <16015, 51525, 24523, 16031> to Node11 (➌ in the
illustration).

Node11 instantiates this path (marked ➍). This path follows the IGP shortest path to Node15 using
Prefix-SID 16015 of Node15. It then traverses the BGP peering link using the Peering-SID 51525. To
traverse the low-delay, high IGP metric link between Node25 and Node23, Node25’s Adj-SID for
this link is used: label 24523. Finally, the path follows the IGP shortest path to Node31, using
Node31’s Prefix-SID 16031.

Node11 reports the status of the path to the SR PCE in a PCEP Report message (marked ➎). In this
Report, Node11 also delegates the control of this path to the SR PCE. The SR PCE can autonomously
request Node11 to update the path if required.
4.3.3.3 SR PCE Updates Inter-Domain Path
To illustrate the reactivity of the SR PCE computed path, assume a failure occurred in the network.
The link between Node25 and Node23 went down (indicated with ➊ in Figure 4‑17). As a first
reaction to the failure, Node25 activates the TI-LFA backup path for the Adj-SID 24523 of the failed
link (➋). This quickly restores connectivity of the traffic carried in the SR Policy.

Figure 4-17: Stateful PCE updates inter-domain path after failure

The IGP floods the topology change and Node21 advertises it in BGP-LS (marked ➌). BGP
propagates the topology update to SR PCE (➍ in the illustration). The SR PCE re-computes the path
(➎), and sends a PCEP Update message to headend Node11 (marked ➏). Node11 updates the path
(➐) and sends a PCEP Report message to SR PCE with the status of the path (indicated with ➑).

The sequence of events in this example is event-triggered, each event triggers the next event.
However, it is not instantaneous and subject to delays, such as delays in BGP to propagate the
topology change to the SR PCE.
4.4 Summary
This chapter explains how traffic engineering paths are dynamically computed.

Computing a TE path is solving an optimization problem that has an optimization objective and
constraints.

The required information to compute SR-TE paths is stored in the SR-TE DB.

SR-optimized algorithms have been developed to compute paths and encode these paths in SID lists
while minimizing the number of SID’s and maximizing ECMP.

In many cases, the headend node itself can compute SLA paths in a single IGP area. Low-delay and
resource exclusions are typical examples. All information required for such computations is available
in the headend’s SR-TE DB.

When the headend does not have the necessary information in its SR-TE DB, the headend node uses
an SR PCE to compute paths. Computing disjoint paths from different headend nodes is a frequent
example. The two headends delegate the computation of their SR Policy to the same SR PCE to
ensure mutual path diversity.

The SR PCE is responsible for keeping a delegated SR path optimal. It recomputes delegated paths in
case of topology changes and instructs the headends to update the paths if required.

Another case requiring an SR PCE to compute paths is the multi-domain case. A headend node cannot
compute optimal end-to-end paths in a multi-domain network since the IGP only provides topology
information about the local area, not about the remote areas and domains. An SR PCE learns the
topology information about all domains in a network using BGP-LS and stores this information in its
SR-TE DB. The SR-TE DB is natively multi-domain capable. BGP-LS not only carries IGP topology
information in BGP, but also other information, such as EPE information. BGP-LS provides a real-
time reactive feed for this information that the PCE can tap into. The SR PCE uses the multi-domain
SR-TE DB to compute optimal end-to-end inter-domain paths for the headend nodes.

The SR PCE functionality is not concentrated in a single centralized server. SR PCE servers can be
distributed throughout the network (a typical analogy is BGP route reflector distribution).
The SR PCE servers do not need to synchronize with each other. They only need to get the necessary
information (e.g., topology) via their BGP-LS feed, and from PCEP reports of their connected PCCs.
4.5 References
[SIGCOMM2015] “A Declarative and Expressive Approach to Control Forwarding Paths in
Carrier-Grade Networks.”, Renaud Hartert, Stefano Vissicchio, Pierre Schaus, Olivier
Bonaventure, Clarence Filsfils, Thomas Telkamp, Pierre François, SIGCOMM 2015, October
2015, <http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p15.pdf>

[RFC2702] "Requirements for Traffic Engineering Over MPLS", Michael D. O'Dell, Joseph
Malcolm, Jim McManus, Daniel O. Awduche, Johnson Agogbua, RFC2702, September 1999

[RFC3107] "Carrying Label Information in BGP-4", Eric C. Rosen, Yakov Rekhter, RFC3107,
May 2001

[RFC3630] "Traffic Engineering (TE) Extensions to OSPF Version 2", Derek M. Yeung, Dave
Katz, Kireeti Kompella, RFC3630, October 2003

[RFC5305] "IS-IS Extensions for Traffic Engineering", Tony Li, Henk Smit, RFC5305, October
2008

[RFC5440] "Path Computation Element (PCE) Communication Protocol (PCEP)", JP Vasseur,


Jean-Louis Le Roux, RFC5440, March 2009

[RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information


Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752,
March 2016

[RFC8231] "Path Computation Element Communication Protocol (PCEP) Extensions for Stateful
PCE", Edward Crabbe, Ina Minei, Jan Medved, Robert Varga, RFC8231, September 2017

[draft-ietf-idr-bgp-ls-segment-routing-ext] "BGP Link-State extensions for Segment Routing",


Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Hannes Gredler, Mach Chen, draft-ietf-idr-
bgp-ls-segment-routing-ext-12 (Work in Progress), March 2019

1. Instead of simply computing the path, disjoint from the existing one, the SR PCE concurrently
computes both disjoint paths, as this has the highest probability to find a solution and it yields the
optimal solution. See the next section for details.↩
5 Automated Steering
What we will learn in this chapter:

A Provider Edge (PE) advertising a service route (BGP, Pseudowire (PW), or Locator/ID
Separation Protocol (LISP)) may tag it with a color, which indicates a specific intent (e.g., color
30 = “low-delay”, color 20 = “only via network plane blue”)

In BGP, the color is supported by the well-known color extended community attribute. The same
extension has been defined for LISP.

Per-Destination Automated Steering (AS) automatically steers a service route onto a valid SR
Policy based on its next-hop and color.

If no such valid SR Policy exists, the service route is instead installed “classically” in the
forwarding plane with a recursion on the route to the next-hop, i.e., the Prefix-SID to the next-hop.

AS is enabled by default but can be disabled on a per-protocol (e.g., BGP) and per AFI/SAFI
basis.

A per-flow variant of AS will be detailed in a future revision of this book. Per-flow steering
enables to steer traffic-flows by considering various packet header fields such as source address
or DSCP.

In the previous chapters, we have explored the SR Policy, its paths, and SID lists. We have seen how
such paths can be computed and instantiated on a headend node. But, how can we make use of these
SR Policies? How can we steer traffic into them?

In this chapter we first describe how an operator can indicate the service requirements of a route by
tagging it with a color, a BGP color community for BGP routes. This color is an identifier of an intent.
The ingress node can then use the route’s color and nexthop to automatically steer it in the matching
SR Policy. This is called Automated Steering (AS). Next, we explain how the granular per-
destination AS applies to various types of BGP service routes. Finally, we show how AS can be
disabled if necessary.
5.1 Introduction
The Automated Steering (AS) functionality automatically steers service routes into the SR Policy that
provides the desired intent or SLA1.

AS is a key component of the SR-TE solution as it automates the steering of the service traffic on the
SR Policies that deliver the required SLA.

AS can operate on a per-flow basis2. In this case, multiple flows to the same service destination (e.g.,
BGP route 1.0.0.0/8) may be automatically steered onto different SR Policies. While any per-flow
classification technique could be used, a commonly used one is DSCP/EXP classification. For
example, traffic to 1.0.0.0/8 with DSCP 1 would go in SR Policy 1 while traffic to 1.0.0.0/8 with
DSCP 2 would go in SR Policy 2.

In this revision of the book, we will focus on the per-destination AS solution. For example, two BGP
destinations 1.0.0.0/8 and 2.0.0.0/8 with the same next-hop 9.9.9.9 can be automatically steered on
two different SR Policies to 9.9.9.9.

While the concept applies to any type of service route, we will use BGP as an illustration.
Omne s viae Romam ducunt (all the ways le ad to Rome )
“The ODN/AS design came up in a taxi in Rome ☺.

We were on our way to an excellent trattoria in Trastevere. We were caught in the traffic on the right bank of the river
Tevere somewhere between Castel San Angelo and piazza Trilussa. My dear friends Alexander Preusche and Alberto
Donzelli were giving me a hard time on the need for an “SDN” solution that would be “easy to operate”.

I knew that SR was perfectly adapted to a centralized solution (e.g., see Paul Mattes’ talk [SR-for-DCI]).

But I resisted the push towards this sole solution as I thought that it required a level of operational investment that many
SPs would want to avoid.

I wanted a solution that would:

be much more automated.

not rely on a single mighty server.

be integrated with the BGP/VPN route distribution as this is the heart of the SP/enterprise services.

The traffic jam was a blessing as was the friendly pressure from Alex and Alberto. Finally, the idea came and looked very
simple to me: when it receives the VPN routes from an attached CPE, the egress PE marks the received routes with a
color that encodes the SLA required by the related customer; the route-reflector reflects this color attribute transparently;
the ingress PE automatically requests the local SR-TE process to instantiate an SR Policy using a template associated with
the color (this would become ODN); when instantiated, the SR-TE process calls back BGP and provides the BSID of the
instantiated policy; BGP automatically installs the colored service route on the BSID (this would become AS).

The idea naturally applied to multi-domain as the ODN template could request SR PCE delegation for SLA objectives that
exceed the information available at the ingress PE.

After a very good meal with Alberto and Alex, I called Siva and explained the idea. As usual, with his boundless energy, he
said, “let’s do it” and a few weeks later we had the first proof of concept.

Bertrand immediately understood the huge business appeal for the solution and was a key contributor to get this committed
in the product roadmap.

Thanks to Siva, Bertrand, Alberto and Alex … and Rome ☺ ”.

— Clarence Filsfils

Throughout this chapter we will use the network in Figure 5‑1 to illustrate the different concepts.

This network consists of a single IGP area. The default IGP link metric is 10, the links between
Node3 and Node4 and between Node5 and Node8 have a higher IGP link metric 100, as indicated in
the drawing. An affinity color red is assigned to the link between Node6 and Node7.

Figure 5-1: Automated Steering reference topology

The operator of the network in Figure 5‑1 has enabled delay measurement on all links in the network
(see chapter 15, "Performance Monitoring – Link Delay"). For simplicity, we assume that the
measured link-delays of all links are the same: 10 milliseconds.

The operator of the network provides various services that will be described in the remainder of this
chapter. Figure 5‑1 illustrates a VPN service for a customer NewCo, using CEs Node13 and Node43,
attached to PEs Node1 and Node4 respectively. The illustration also shows an internet service
represented with a cloud attached to PE Node4.

CE Node43, which is configured in VRF NewCo on Node4, advertises prefixes 3.3.3.0/24 and
6.6.6.0/24 via BGP to Node4. PE Node4 also learns of a global prefix 8.8.8.0/24 and propagates
these prefixes via BGP to Node1 and sets itself (router-id 1.1.1.4) as the BGP nexthop.

PE Node1 receives these routes from Node4 and propagates the routes in VRF NewCo to CE
Node13. Node1 switches traffic to these routes (VPN and global) via the IGP shortest path to Node4
(1→7→6→5→4).
We will build further on this topology to explain how to provide the appropriate SLA for the different
service destinations.
5.2 Coloring a BGP Route
At the heart of the AS solution is the tagging concept: when the egress Provider Edge (PE) advertises
a BGP (service) route, the egress PE colors the route.

The color is chosen to indicate the SLA required by the BGP route.

There is no fixed mapping of color to SLA semantics. Any operator is free to design the allocation.
For example, in this chapter, we will use color 30 to represent a low-delay SLA.

5.2.1 BGP Color Extended Community


The SR-TE solution leverages the BGP Color Extended Community.

BGP communitie s
BGP communities (RFC 1997) are additional tags that can be added to routes to provide additional information for processing that
route in a specific way. The BGP communities of a route are bundled in a BGP communities attribute attached to that route. BGP
communities are typically used to influence routing decisions. BGP communities can be selectively added, modified, or removed as
the route travels from router to router.

A BGP community is a 32-bit number, typically consisting of the 16-bit AS number of the originator followed by a 16-bit number with
an operator-defined meaning.

RFC 4360 introduced a larger format (48-bit) of community, the extended community. It has a type field to indicate the type and
dictate the structure of the remaining bytes in the community. Examples of extended community types are the Route-Target (RT)
community for VPN routes and the Route Origin community, both specified in RFC 4360.

The Opaque Extended Community is another type of extended community defined in RFC 4360. It is really a class of extended
communities. The Color extended community, specified in RFC 5512, is a sub-type of this class.

The BGP extended communities of a route are bundled in a BGP extended communities attribute attached to that route.

RFC 5512 specifies the format of the Color Extended Community shown in Figure 5‑2.

The first two octets indicate the type of the extended community. The first octet indicates it is a
transitive opaque extended community. “Transitive” means that a BGP node should pass it to its
neighbors, even if it does not recognize the attribute. The second octet indicates the type of Opaque
Extended Community, the type 11 (or 0x0b) is Color.
Figure 5-2: Color Extended Community format

The Color Value is a flat, 32-bit number. The value is defined by the user and is opaque to BGP.

To be complete, we note that draft-ietf-idr-segment-routing-te-policy specifies that when the Color


Extended Community is used to steer the traffic into an SR policy, two bits of the Reserved field are
used to carry the Color-Only bits (CO bits), as shown in Figure 5‑3.

Figure 5-3: Color Extended Community CO-flags

The use of these CO bits for other settings than the default value “00” is not very common. The use-
cases are explained in chapter 10, "Further Details on Automated Steering".

5.2.2 Coloring BGP Routes at the Egress PE


Automated Steering uses the color extended community to tag service routes with a color. This color
extended community is then matched to the color of an SR Policy. The functionality and the
configuration that is used to color BGP routes – i.e., attach a color extended community to a route – is
not specific to SR-TE, but we cover this aspect for completeness.

There are multiple ways to attach (extended) communities to BGP routes, all involving a route-policy,
but applied at different so-called “attach points”. A route-policy is a construct that implements a
routing policy. This routing policy instructs the router to inspect routes, filter them, and potentially
modify their attributes.

A route-policy is written in Routing Policy Language (RPL), which resembles a simplified


programming language.

In this chapter we illustrate two different route-policy attach points, an ingress BGP route-policy and
an egress BGP route-policy. Other attach points are possible but not covered in this book.

In Figure 5‑1, Node4 receives the service routes from CE Node43. By applying an ingress route-
policy on Node4 for the BGP session to Node43, the received routes and their attributes can be
manipulated, such as adding a color extended community.

Node4 advertises the global prefix 8.8.8.0/24 to Node1. By applying an egress route-policy on
Node4 for the BGP session to Node1, the transmitted routes and their attributes can be manipulated,
such as adding a color extended community.

Before looking at the route-policies on Node4, let’s take a look at the color extended communities.
The color extended communities that Node4 uses in its route-policies, are defined in the
extcommunity-set sets Blue, Green, and Purple, as shown in Example 5‑1.

Example 5-1: color extended community sets on Node4

extcommunity-set opaque Blue


20
end-set
!
extcommunity-set opaque Green
30
end-set
!
extcommunity-set opaque Purple
40
end-set

Using the configuration in Example 5‑2, Node4 applies route-policy VRF-COLOR as an ingress
route-policy under the address-family ipv4 unicast for VRF NewCo neighbor 99.4.43.43 (Node43)
(see line 39). Since it is applied under the VRF NewCo BGP session, it is only applied to routes of
that VRF.
Route-policy VRF-COLOR is defined in lines 1 to 9. This route-policy attaches color extended
community Blue to 3.3.3.0/24 and Green to 6.6.6.0/24.

Route-policy GLOBAL-COLOR, defined in lines 11 to 16, attaches color extended community Purple
to 8.8.8.0/24. This route-policy is applied as an egress route-policy under the address-family ipv4
unicast for global neighbor 1.1.1.1 (Node1) (see line 27).

Example 5-2: Applying route-policies on Node4 to color routes

1 route-policy VRF-COLOR
2 if destination in (3.3.3.0/24) then
3 set extcommunity color Blue
4 endif
5 if destination in (6.6.6.0/24) then
6 set extcommunity color Green
7 endif
8 pass
9 end-policy
10 !
11 route-policy GLOBAL-COLOR
12 if destination in (8.8.8.0/24) then
13 set extcommunity color Purple
14 endif
15 pass
16 end-policy
17 !
18 router bgp 1
19 bgp router-id 1.1.1.4
20 address-family ipv4 unicast
21 address-family vpnv4 unicast
22 !
23 neighbor 1.1.1.1
24 remote-as 1
25 update-source Loopback0
26 address-family ipv4 unicast
27 route-policy GLOBAL-COLOR out
28 !
29 address-family vpnv4 unicast
30 !
31 vrf NewCo
32 rd auto
33 address-family ipv4 unicast
34 !
35 neighbor 99.4.43.43
36 remote-as 2
37 description to CE43
38 address-family ipv4 unicast
39 route-policy VRF-COLOR in

5.2.3 Conflict With Other Color Usage


The color extended community has been defined many years ago and may safely be concurrently used
for many distinct purposes.
The design rule for the operator is to allocate color ranges on a per application basis.

For example, an operator may have allocated 10-99 to indicate SLA requirements while he decided
to use 1000-1999 to track the PoP-of-origin of Internet routes.

As soon as a specific range is allocated for SLA/SR-TE purposes, then the operator will only use
these colors for the SR Policy configurations (color, endpoint).

Conversely, in this example, an SR Policy will never be configured with a color 1000 and hence an
Internet route coming from PoP 1000 will never risk being automatically steered on an SR Policy of
color 1000.

Multiple extended and regular communities can be concurrently attached to a given route, e.g., one to
indicate SLA requirements and another to track the PoP-of-origin.
5.3 Automated Steering of a VPN Prefix
Another customer Acme uses the L3VPN service of the network shown in Figure 5‑4. This customer’s
CE Node12 connects to PE Node1 in VRF Acme and similarly CE Node42 connects to PE Node4.

Customer Acme requires a low-delay path from CE12 to CE42.

To satisfy this requirement, the operator configures on Node1 an SR Policy with color “green” (value
30, associated with low-delay SLA) and endpoint Node4’s loopback prefix 1.1.1.4/32. The name of
an SR Policy is user-defined. Here we have chosen the name “GREEN”, matching the name we used
for the color of this SR Policy, but we could have named it “delay_to_node4” or any other meaningful
name. The SR Policy GREEN has a single candidate path with preference 100. This candidate path is
a dynamic path with an optimized delay metric and no constraints. The configuration of this SR Policy
is shown in Example 5‑3.

Example 5-3: SR Policy GREEN configuration on Node1

segment-routing
traffic-eng
policy GREEN
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay

The path of SR Policy GREEN is shown in Figure 5‑4.


Figure 5-4: Automated Steering of a green BGP route

CE42 advertises prefix 2.2.2.0/24 via BGP to PE Node4. This is indicated by ➊ in Figure 5‑4.
Node4 allocates a VPN label 92220 for prefix 2.2.2.0/24 in VRF Acme and advertises this prefix via
BGP to Node1 with the BGP nexthop set to its loopback address 1.1.1.4.

To indicate the low-delay requirement for traffic destined for this prefix, Node4 adds a color
extended community with color “green” to the BGP update for VRF Acme prefix 2.2.2.0/24 (➋ in
Figure 5‑4). As described earlier, color “green” refers to color value 30 which was chosen to
identify the low-delay SLA.

Node1 receives the BGP route and tries to match the route’s attached color (30) and its nexthop
(1.1.1.4) to a valid SR Policy’s color and endpoint. BGP finds the matching SR Policy GREEN, with
color 30 and endpoint 1.1.1.4, and automatically installs the prefix 2.2.2.0/24 in the forwarding table
pointing to the SR Policy GREEN (➌), instead of pointing to the BGP nexthop. All traffic destined
for VRF Acme route 2.2.2.0/24 is then forwarded via SR Policy GREEN.

By directly mapping the color of the BGP prefix to the color of the SR Policy, no complex and
operationally intensive traffic steering configurations are required. By installing the BGP prefix in the
forwarding table, pointing to the SR Policy, the traffic steering has no impact on the forwarding
performance.
Simplicity of Automate d Ste e ring
“In our first SR customer deployment in 2017, we had to use the tunnel interface mode SR-TE since SR Policy mode was
not available at that time (see the “new CLI” section in chapter 1, "Introduction"). Consequently, legacy techniques like
SPP (Service Path Preference) were used to steer traffic into corresponding tunnels. This customer has been struggling
with the granularity and complexity of traffic steering. Therefore, when we introduced auto-steering capability of SR Policy
mode, this customer accepted it at once: per-flow granularity, only need to care about “colors”. Now the customer is
planning to migrate the existing tunnel interface mode SR-TE to SR Policy mode to enjoy the benefits of auto-steering. ”

— YuanChao Su

More Details

We will now explore the Automated Steering functionality in more detail by examining the BGP
routes and forwarding entries.

Example 5‑4 shows the BGP configuration of Node4. It is a regular base VPNv4 configuration.
Node4 has two BGP sessions, one VPNv4 iBGP session to Node1 (1.1.1.1) and an eBGP session in
VRF Acme to CE42 (99.4.42.42).

An extended-community-set Green is defined for color value 30 (lines 14-17). Note that the name
Green is only a locally significant user-defined identifier of this set.

To attach color 30 to the prefixes received from CE Node42, the operator has chosen to use an
ingress route-policy COLOR_GREEN for address-family ipv4 unicast of the BGP session to Node42
in VRF Acme. To keep the example simple, route-policy COLOR_GREEN unconditionally attaches
the extended-community-set Green to all incoming routes on that BGP session.
Example 5-4: BGP configuration on Node4

1 vrf Acme
2 address-family ipv4 unicast
3 import route-target
4 1:1
5 !
6 export route-target
7 1:1
8!
9 interface GigabitEthernet0/0/0/1.100
10 vrf Acme
11 ipv4 address 99.4.41.4 255.255.255.0
12 encapsulation dot1q 100
13 !
14 extcommunity-set opaque Green
15 # low-delay SLA
16 30
17 end-set
18 !
19 route-policy COLOR_GREEN
20 set extcommunity color Green
21 end-policy
22 !
23 route-policy PASS
24 pass
25 end-policy
26 !
27 router bgp 1
28 bgp router-id 1.1.1.4
29 address-family vpnv4 unicast
30 !
31 neighbor 1.1.1.1
32 remote-as 1
33 update-source Loopback0
34 address-family vpnv4 unicast
35 !
36 vrf Acme
37 rd auto
38 address-family ipv4 unicast
39 !
40 neighbor 99.4.42.42
41 remote-as 2
42 description to CE42
43 address-family ipv4 unicast
44 route-policy COLOR_GREEN in

BGP on Node4 receives the route 2.2.2.0/24 from CE Node42. This BGP route is displayed in
Example 5‑5. Notice that the route has an attached color extended community Color:30, as a result of
the ingress BGP route-policy COLOR_GREEN, as described above. The other Extended community
of that route is a Route-Target (RT) community with value 1:1 which is used for L3VPN purposes.
Example 5-5: Node4 received BGP route 2.2.2.0/24 from CE42

RP/0/0/CPU0:xrvr-4#show bgp vrf Acme 2.2.2.0/24


BGP routing table entry for 2.2.2.0/24, Route Distinguisher: 1.1.1.4:0
Versions:
Process bRIB/RIB SendTblVer
Speaker 10 10
Local Label: 92220
Last Modified: Jun 5 09:30:24.894 for 01:58:01
Paths: (1 available, best #1)
Advertised to PE peers (in unique update groups):
1.1.1.1
Path #1: Received by speaker 0
Advertised to PE peers (in unique update groups):
1.1.1.1
2
99.4.42.42 from 99.4.42.42 (1.1.1.42)
Origin IGP, metric 0, localpref 100, valid, external, best, group-best, import-candidate
Received Path ID 0, Local Path ID 1, version 10
Extended community: Color:30 RT:1:1
Origin-AS validity: (disabled)

According to the regular L3VPN functionality, Node4 dynamically allocated a label 92220 for this
VPN prefix and advertised it to Node1, with the color extended community attached. Node4 has set
its router-id 1.1.1.4 as BGP nexthop for VPN prefixes. The BGP route as received by Node1 is
shown in Example 5‑6. The output shows that VRF Acme route 2.2.2.0/24 has BGP nexthop 1.1.1.4
and color 30.

Example 5-6: Node1 received BGP VPN route 2.2.2.0/24 from PE Node4

RP/0/0/CPU0:xrvr-1#show bgp vrf Acme 2.2.2.0/24


BGP routing table entry for 2.2.2.0/24, Route Distinguisher: 1.1.1.1:0
Versions:
Process bRIB/RIB SendTblVer
Speaker 7 7
Last Modified: Jun 5 09:31:12.880 for 02:04:00
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
2
1.1.1.4 C:30 (bsid:40001) (metric 40) from 1.1.1.4 (1.1.1.4)
Received Label 92220
Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported
Received Path ID 0, Local Path ID 1, version 7
Extended community: Color:30 RT:1:1
SR policy color 30, up, registered, bsid 40001, if-handle 0x00000410

Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0

SR Policy GREEN on Node1, with color 30 and endpoint 1.1.1.4, matches the color 30 and BGP
nexthop 1.1.1.4 of VRF Acme route 2.2.2.0/24. Using the Automated Steering functionality, BGP
installs the route in the forwarding table, pointing to SR Policy GREEN. BGP uses the BSID of the
SR Policy as a key.

Since no explicit BSID was provided by the operator as part of the policy configuration
(Example 5‑3), the headend node dynamically allocated one; this is the default behavior. To find out
the actual BSID of an SR Policy, we can look at its status. Example 5‑7 shows the status of SR Policy
GREEN. The output shows the dynamically allocated Binding SID for this SR Policy: label 40001.
The SID list of this SR Policy is <16003, 24034>.

Example 5-7: Status of SR Policy GREEN on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.4


Name: srte_c_30_ep_1.1.1.4
Status:
Admin: up Operational: up for 07:45:16 (since May 28 09:48:34.563)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: GREEN
Requested BSID: dynamic
Dynamic (valid)
Metric Type: delay, Path Accumulated Metric: 30
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Attributes:
Binding SID: 40001
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

Without Automated Steering, BGP would install the VRF Acme route 2.2.2.0/24 recursing on its BGP
nexthop 1.1.1.4. Recursion basically means that the route refers to another route for its forwarding
information. When recursing on the BGP nexthop, all traffic destined for 2.2.2.0/24 would follow the
Prefix-SID of 1.1.1.4 to Node4.

With automated Steering, BGP does not recurse the route on its BGP nexthop, but on the BSID of the
matching SR Policy. In this case, BGP installs the route 2.2.2.0/24 recursing on the BSID 40001 of
SR Policy GREEN, as shown in Example 5‑8. The RIB entry (the first output in Example 5‑8) shows
the BSID (Binding Label: 0x9c41 (40001)). The CEF entry (the second output) shows that the
path goes via local-label 40001, which resolves to SR Policy GREEN ((with color 30 and
endpoint 1.1.1.4: next hop srte_c_30_ep_1.1.1.4). Hence, all traffic destined for 2.2.2.0/24 will
be steered into SR Policy GREEN.

As we had indicated above, BGP advertises the VPN route 2.2.2.0/24 with a VPN service label,
92220. Since the service label must be at the bottom of the label stack, Node1 imposes the VPN label
92220 on the packets before steering them into the SR Policy GREEN where the SR Policy’s SID list
is imposed. This is shown in the last line of the output, labels imposed {ImplNull 92220}. This
label stack is ordered top→bottom. The bottom label on the right is the VPN label. The top label is
the label needed to reach the BGP nexthop. In this case no such label is required since the SR Policy
transports the packets to the nexthop. Therefore, the top label is indicated as ImplNull, which is the
implicit-null label. Normally the implicit-null label signals the MPLS pop operation, in this case it
represents a no-operation.

Example 5-8: RIB and CEF forwarding entries for VRF Acme prefix 2.2.2.0/24 on Node1

RP/0/0/CPU0:xrvr-1#show route vrf Acme 2.2.2.0/24 detail

Routing entry for 2.2.2.0/24


Known via "bgp 1", distance 200, metric 0
Tag 2, type internal
Installed Jun 5 09:31:13.214 for 02:16:45
Routing Descriptor Blocks
1.1.1.4, from 1.1.1.4
Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id: 0xe0000000
Route metric is 0
Label: 0x1683c (92220)
Tunnel ID: None
Binding Label: 0x9c41 (40001)
Extended communities count: 0
Source RD attributes: 0x0001:257:17039360
NHID:0x0(Ref:0)
Route version is 0x1 (1)
No local label
IP Precedence: Not Set
QoS Group ID: Not Set
Flow-tag: Not Set
Fwd-class: Not Set
Route Priority: RIB_PRIORITY_RECURSIVE (12) SVD Type RIB_SVD_TYPE_REMOTE
Download Priority 3, Download Version 1
No advertising protos.

RP/0/0/CPU0:xrvr-1#show cef vrf Acme 2.2.2.0/24


2.2.2.0/24, version 1, internal 0x5000001 0x0 (ptr 0xa14f440c) [1], 0x0 (0x0), 0x208 (0xa16ac190)
Updated Jun 5 09:31:13.233
Prefix Len 24, traffic index 0, precedence n/a, priority 3
via local-label 40001, 3 dependencies, recursive [flags 0x6000]
path-idx 0 NHID 0x0 [0xa171651c 0x0]
recursion-via-label
next hop VRF - 'default', table - 0xe0000000
next hop via 40001/0/21
next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 92220}
To validate let’s use traceroute on Node1 to verify the path of the packets sent to 2.2.20/24. A
traceroute from Node1 to the VRF Acme prefix 2.2.2.0/24 reveals the actual label stack that is
imposed on the packets. The output of this traceroute is shown in Example 5‑9. By using the interface
addressing convention as specified in the front matter of this book, the last digit in each address of the
traceroute output indicates the number of the responding node. The traceroute probes follow the path
1→2→3→4→41. The label stack shown for the first hop, 16003/24034/92220, consists of the SR
Policy’s SID list <16003, 24034> as transport labels and the VPN label 92220 as service label.

Example 5-9: Traceroute on Node1 to VRF Acme prefix 2.2.2.0/24

RP/0/0/CPU0:xrvr-1#traceroute vrf Acme 2.2.2.2

Type escape sequence to abort.


Tracing the route to 2.2.2.2

1 99.1.2.2 [MPLS: Labels 16003/24034/92220 Exp 0] 19 msec 9 msec 9 msec


2 99.2.3.3 [MPLS: Labels 24034/92220 Exp 0] 9 msec 9 msec 9 msec
3 99.3.4.4 [MPLS: Label 92220 Exp 0] 9 msec 9 msec 9 msec
4 99.4.41.41 9 msec 9 msec 9 msec

Note that traffic to Node4’s loopback prefix 1.1.1.4/32 itself is not steered into an SR Policy but
follows the usual IGP shortest path 1→7→6→5→4, as illustrated in the output of the traceroute on
Node1 to 1.1.1.4 displayed in Example 5‑10.

Example 5-10: Traceroute on Node1 to Node4’s loopback address 1.1.1.4

RP/0/0/CPU0:xrvr-1#traceroute 1.1.1.4

Type escape sequence to abort.


Tracing the route to 1.1.1.4

1 99.1.7.7 [MPLS: Label 16004 Exp 0] 9 msec 0 msec 0 msec


2 99.6.7.6 [MPLS: Label 16004 Exp 0] 0 msec 0 msec 0 msec
3 99.5.6.5 [MPLS: Label 16004 Exp 0] 0 msec 0 msec 0 msec
4 99.4.5.4 0 msec 0 msec 0 msec
“More than 15 years of software development experience helped me to identify solutions that are less complex to
implement yet powerful enough to simplify network operations. When Clarence introduced the concept of ODN/AS, I
immediately realized that the proposed mechanism falls into such category. After developing a proof-of-concept ODN/AS
model for SR-TE disjoint path use-case, I generalized the model for other SR-TE use-cases. Currently, ODN/AS has
become a salient feature of SR-TE portfolio.

The operational simplicity of the ODN/AS solution did not come for free as it required a rather sophisticated
implementation. One of the key challenges of this functionality was the internal communication between the different
software processes (e.g., BGP and SR-TE processes became new collaborators). Since these software components are
developed by different teams, the realization of this functionality was an example of true teamwork. Thanks to the
excellent collaboration with the other component leads (particularly BGP, RIB, and FIB), the final implementation in the
shipping product has become robust, scalable, performance optimized while at the same time simple to operate.

Even though I have been part of SR-TE development from day one, because of its great customer appeal, realizing the
concept of ODN/AS in Cisco products has provided me with a great sense of satisfaction. ”

— Siva Sivabalan
5.4 Steering Multiple Prefixes With Different SLAs
Let us now assume that the operator has configured Node1 with two additional SR Policies to Node4.

We have conveniently named each SR Policy as the color of its SLA. The following SR Policies exist
on Node1:

SR Policy BLUE, color 20 (blue), endpoint 1.1.1.4

SR Policy GREEN, color 30 (green), endpoint 1.1.1.4

SR Policy PURPLE, color 40 (purple), endpoint 1.1.1.4

The paths of these three SR Policies are shown in Figure 5‑5.

Figure 5-5: Multiple SR Policies with endpoint Node4 on Node1

The SR Policy configuration of Node1 is shown in Example 5‑11.


SR Policy BLUE has an explicit candidate-path using segment list BLUE_PATH. This segment list is
defined at the top of the configuration.

SR Policy GREEN has a dynamic delay-optimized path without constraints.

SR Policy PURPLE has a dynamic path that optimizes the IGP metric and avoids links with Affinity
red. There is only one link with Affinity red in the network: the link between Node6 and Node7.

Example 5-11: SR Policy configuration on Node1

segment-routing
traffic-eng
affinity-map
name red bit-position 0
!
segment-list name BLUE_PATH
index 10 mpls label 16008
index 20 mpls label 24085
index 30 mpls label 16004
!
policy BLUE
color 20 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list BLUE_PATH
!
policy GREEN
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay
!
policy PURPLE
color 40 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type igp
constraints
affinity
exclude-any
name red

Different CEs of different customers are connected to PE Node4: Node42 of VRF Acme, Node43 of
VRF NewCo, and Node44 of VRF Widget. Each of these customers has different SLA needs. The
required SLA for each prefix is indicated by attaching the appropriate SLA-identifying color to the
prefix advertisement. PE Node4 colors the prefixes as follows (also shown in Figure 5‑6), when
advertising them to Node1:
VRF Acme prefix 2.2.2.0/24: color 30 (green)

VRF Acme prefix 5.5.5.0/24: color 30 (green)

VRF NewCo prefix 3.3.3.0/24: color 20 (blue)

VRF NewCo prefix 6.6.6.0/24: color 30 (green)

VRF Widget prefix 4.4.4.0/24: color 40 (purple)

VRF Widget prefix 7.7.7.0/24: no color

Figure 5-6: Advertising prefixes with their appropriate SLA color

All these prefixes have a BGP nexthop 1.1.1.4.


In the previous section we have seen that Node1 steers VRF Acme prefix 2.2.2.0/24 with color green
attached, into SR Policy GREEN since this SR Policy’s color and endpoint match the color and BGP
nexthop of the prefix. Node4 advertises the other prefix in VRF Acme, 5.5.5.0/24, with the same
color green (30), therefore Node1 also steers this prefix into SR Policy GREEN. Multiple service
routes that have the same color and same BGP nexthop share the same SR Policy.

Node1 receives VRF NewCo prefix 3.3.3.0/24 with nexthop 1.1.1.4 and color blue (20). BGP
installs this route recursing on the BSID of SR Policy BLUE, as this SR Policy matches the color and
nexthop of this route. The other prefix in VRF NewCo, 6.6.6.0/24, has color green (30). This prefix is
steered into SR Policy GREEN, since this SR Policy matches the color and nexthop of this prefix.

BGP installs VRF Widget prefix 4.4.4.0/24 with nexthop 1.1.1.4 and color purple (40) recursing on
the BSID of SR Policy PURPLE, the SR Policy that matches the color and nexthop of this route.
Node4 advertises the other prefix of VRF Widget, 7.7.7.0/24, without color, indicating that this prefix
requires no specific SLA. This BGP route as received by Node1 is presented in Example 5‑12. Since
it has no attached color, BGP on Node1 installs this route recursing on its BGP nexthop 1.1.1.4.
Consequently, the traffic destined for 7.7.7.0/24 will follow the IGP shortest path to Node4.

Example 5-12: BGP route without color on Node1

RP/0/0/CPU0:xrvr-1#show bgp vrf Widget 7.7.7.0/24


BGP routing table entry for 7.7.7.0/24, Route Distinguisher: 1.1.1.1:0
Versions:
Process bRIB/RIB SendTblVer
Speaker 10 10
Last Modified: Jun 5 14:02:44.880 for 00:03:39
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
2
1.1.1.4 (metric 40) from 1.1.1.4 (1.1.1.4)
Received Label 97770
Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported
Received Path ID 0, Local Path ID 1, version 10
Extended community: RT:1:1
Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0

The RIB and CEF forwarding entries for this colorless prefix are shown in Example 5‑13.

The first output in Example 5‑13 shows that BGP installed the route in RIB without reference to a
BSID (Binding Label: None).
The second output shows that the CEF entry for VRF Widget prefix 7.7.7.0/24 recurses on its BGP
nexthop 1.1.1.4 (via 1.1.1.4/32). The last line in the output shows the resolved route, pointing to
the outgoing interface and nexthop (Gi0/0/0/1, 99.1.7.7/32). The imposed labels are <16004, 97770>
(labels imposed {16004 97770}), where 16004 (Node4’s Prefix-SID) is the label to reach the
BGP nexthop and 97770 is the VPN label that Node4 advertised for prefix 7.7.7.0/24 in VRF Widget.

Example 5-13: RIB and CEF forwarding entries for BGP route without color on Node1

RP/0/0/CPU0:xrvr-1#show route vrf Widget 7.7.7.0/24 detail

Routing entry for 7.7.7.0/24


Known via "bgp 1", distance 200, metric 0
Tag 2, type internal
Installed Jun 5 14:02:44.898 for 00:00:13
Routing Descriptor Blocks
1.1.1.4, from 1.1.1.4
Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id: 0xe0000000
Route metric is 0
Label: 0x17dea (97770)
Tunnel ID: None
Binding Label: None
Extended communities count: 0
Source RD attributes: 0x0001:257:17039360
NHID:0x0(Ref:0)
Route version is 0x3 (3)
No local label
IP Precedence: Not Set
QoS Group ID: Not Set
Flow-tag: Not Set
Fwd-class: Not Set
Route Priority: RIB_PRIORITY_RECURSIVE (12) SVD Type RIB_SVD_TYPE_REMOTE
Download Priority 3, Download Version 5
No advertising protos.

RP/0/0/CPU0:xrvr-1#show cef vrf Widget 7.7.7.0/24


7.7.7.0/24, version 5, internal 0x5000001 0x0 (ptr 0xa14f440c) [1], 0x0 (0x0), 0x208 (0xa16ac5c8)
Updated Jun 5 14:02:44.917
Prefix Len 24, traffic index 0, precedence n/a, priority 3
via 1.1.1.4/32, 3 dependencies, recursive [flags 0x6000]
path-idx 0 NHID 0x0 [0xa1715c14 0x0]
recursion-via-/32
next hop VRF - 'default', table - 0xe0000000
next hop 1.1.1.4/32 via 16004/0/21
next hop 99.1.7.7/32 Gi0/0/0/1 labels imposed {16004 97770}

The last example showed that a BGP route without a color recurses on its nexthop. This can be
generalized. If BGP on an ingress node receives a service route with color C and BGP nexthop E, and
no valid SR Policy (C, E) with color C and endpoint E exists on the ingress node, then BGP installs
the service route as usual, recursing on its BGP nexthop. In this case, any packet destined for this
service route will follow the default path to E, typically the IGP shortest path. The same behavior is
applied as for service routes without a color.
5.5 Automated Steering for EVPN
While the previous section showed an L3VPN example, this section illustrates the Automated
Steering functionality for a single-homed EVPN Virtual Private Wire Service (VPWS) service.

EVPN is the next generation solution for Ethernet services. It relies on BGP for auto-discovery and
signaling, using same principles as L3VPNs. EVPN VPWS is specified in RFC 8214.

Figure 5‑7 shows the topology with Node1 and Node4 the PEs providing the EVPN VPWS service.
Node12 and Node42 are the CEs of a customer, connected with their Access Circuit (AC) to their
respective PEs.

A BGP session is established between Node1 and Node4. For simplicity, this illustration shows a
direct BGP session, but in general a BGP RR would be used.

A VPWS service should be established between CE12 and CE42. This VPWS service requires low-
delay transport, therefore this service route BGP advertisement is colored with color 30, identifying
the low-delay requirement.
Figure 5-7: EVPN VPWS Automated Steering reference topology

The VPWS requires the configuration of the following elements on the PEs:

EVPN Instance (EVI) that represents a VPN on a PE router. It serves the same role as an L3VPN
VRF.

Local AC identifier to identify the local end of the point-to-point VPWS

Remote AC identifier to identify the remote end of the VPWS

The L2VPN configuration of the VPWS service on Node4 is shown in Example 5‑14. The circuit with
name “EVI9” on interface Gi0/0/0/0.100, has EVI 9 and remote and local AC identifiers are 1 and 4
respectively.
Example 5-14: EVPN VPWS configuration on PE Node4

interface GigabitEthernet0/0/0/0.100 l2transport


encapsulation dot1q 100
!
l2vpn
xconnect group evpn-vpws
p2p EVI9
interface GigabitEthernet0/0/0/0.100
neighbor evpn evi 9 target 1 source 4

BGP signaling for EVPN uses the address-family l2vpn evpn. The BGP configuration of Node4
is displayed in Example 5‑15. To signal the low-delay requirement of the circuit of EVI 9 to Node1,
the color extended community 30 is attached to the service route. For this purpose, an outgoing route-
policy evpn_vpws_policy is applied under the EVPN address-family of neighbor 1.1.1.1 (Node1).

The route-policy evpn_vpws_policy matches on the route-distinguisher (RD) 1.1.1.4:9 of the


service route, which is automatically allocated and consists of the router-id 1.1.1.4 and the EVI (9) of
the service. The route-policy sets the color extended community Green for matching routes. “Green”
refers to an extcommunity-set that specifies the color numerical value 30.

To transport the EVPN VPWS service between CE12 and CE42 on a low-delay path in both
directions, the equivalent BGP configuration is used on Node1. Node1 then also colors the service
route with color 30 (low-delay).

Example 5-15: BGP configuration on PE Node4

extcommunity-set opaque Green


# color green identifies low-delay
30
end-set
!
route-policy evpn_vpws_policy
if rd in (1.1.1.4:9) then
set extcommunity color Green
endif
end-policy
!
router bgp 1
bgp router-id 1.1.1.4
address-family l2vpn evpn
!
neighbor 1.1.1.1
remote-as 1
update-source Loopback0
address-family l2vpn evpn
route-policy evpn_vpws_policy out
BGP on Node1 receives the service routes as shown in Example 5‑16. The EVI 9 routes are
highlighted. The first entry shows the local route, the second shows the imported remote route and the
third shows the received route.

Example 5-16: BGP EVPN routes on Node1

RP/0/0/CPU0:xrvr-1#show bgp l2vpn evpn


BGP router identifier 1.1.1.1, local AS number 1
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0x0 RD version: 0
BGP main routing table version 63
BGP NSR Initial initsync version 6 (Reached)
BGP NSR/ISSU Sync-Group versions 0/0
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best


i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 1.1.1.1:9 (default for vrf VPWS:9)
*> [1][0000.0000.0000.0000.0000][1]/120
0.0.0.0 0 i
*>i[1][0000.0000.0000.0000.0000][4]/120
1.1.1.4 C:30 100 0 i
Route Distinguisher: 1.1.1.4:9
*>i[1][0000.0000.0000.0000.0000][4]/120
1.1.1.4 C:30 100 0 i

Processed 3 prefixes, 3 paths

To display the details of the route, add the NLRI to the show command as in Example 5‑17. The NLRI
in this case is [1][0000.0000.0000.0000.0000][4]/120.

The next hop of this route is 1.1.1.4 (Node4) and a color extended community 30 is attached to this
route (Color:30 in the output). Node4 has allocated service label 90010 for this service route
(Received Label 90010 in the output).

The output 1.1.1.4 C:30 (bsid:40001) and the last line in the output are a result of the automated
steering that we will explain next.
Example 5-17: EVI 9 BGP EVPN routes on Node1

RP/0/0/CPU0:xrvr-1#show bgp l2vpn evpn rd 1.1.1.4:9 [1][0000.0000.0000.0000.0000][4]/120


BGP routing table entry for [1][0000.0000.0000.0000.0000][4]/120, Route Distinguisher: 1.1.1.4:9
Versions:
Process bRIB/RIB SendTblVer
Speaker 53 53
Last Modified: Jul 4 08:59:57.989 for 00:11:18
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
Local
1.1.1.4 C:30 (bsid:40001) (metric 50) from 1.1.1.4 (1.1.1.4)
Received Label 90010
Origin IGP, localpref 100, valid, internal, best, group-best, import-candidate, not-in-vrf
Received Path ID 0, Local Path ID 1, version 48
Extended community: Color:30 RT:1:9
SR policy color 30, up, registered, bsid 40001, if-handle 0x00000470

Node1 has an SR Policy configured with color 30 that provides a low-delay path to Node4. The
configuration of this SR Policy is displayed in Example 5‑18.

Example 5-18: SR Policy GREEN configuration on Node1

segment-routing
traffic-eng
policy GREEN
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic mpls
metric
type delay

When receiving and importing the route, BGP finds that the route with nexthop 1.1.1.4 is tagged with
color extended community 30. BGP has learned from SR-TE that a valid SR Policy GREEN exists
that matches the nexthop and color of the service route. Therefore, BGP installs the service route
recursing on the Binding-SID 40001 of that matching SR Policy. This is indicated in the BGP route
output in Example 5‑17 with 1.1.1.4 C:30 (bsid:40001) and also in the last line of the same
output.

As a result, the VPWS traffic for EVI 9 from Node1 to Node4 follows the low-delay path instead of
the default IGP shortest path. The equivalent mechanism can be used to steer the reverse VPWS traffic
via the low-delay path.
5.6 Other Service Routes
While the examples in the previous sections were restricted to VPNv4 L3VPN routes and EVPN
VPWS routes, Automated Steering functionality equally applies to other types of service routes, such
as global labeled and unlabeled IPv4 and IPv6 unicast routes, 6PE routes, and 6vPE routes. Also,
different types of EVPN routes (type 1, type 2, and type 5) will benefit from AS. At the time of
writing, AS support for EVPN routes was limited.

Automated Steering is a generic steering architecture that can be equally applied to other signaling
protocols such as LISP (Locator/ID Separation Protocol).
5.7 Disabling AS
Automated Steering is enabled by default.

If a headend H with a valid SR Policy P (C, E) receives a BGP route B/b3 with color C and next-hop
E, then H automatically installs B/b via P in the forwarding plane.

In some deployments, the operator may decide to only use AS for VPN services. In this case, he may
want to add a layer of configuration robustness by disabling AS for the Internet service.

This can be done as illustrated in Example 5‑19.

Example 5-19: Disable Automated Steering for IPv4 Unicast BGP routes

segment-routing
traffic-eng
policy GREEN
color 30 end-point ipv4 1.1.1.4
steering
bgp disable
ipv4 unicast
candidate-paths
preference 100
dynamic mpls
metric
type delay

More generically, Example 5‑20 shows the options to disable Automated Steering. When not
specifying AFI/SAFI then Automated Steering is disabled for all BGP address-families. The per-
AFI/SAFI functionality is not available at time of writing this book.

Example 5-20: Options to disable Automated Steering

segment-routing
traffic-eng
policy <name>
color <color> end-point (ipv4|ipv6) <address>
steering
bgp disable
[<afi> <safi>]
[<afi> <safi> ...]
5.8 Applicability
Automated Steering functionality applies to any SR Policy, regardless of how it is instantiated.

In this chapter, for illustration simplicity, we used pre-configured SR Policies. AS would apply in the
same exact way to any other type of SR Policy: instantiated via PCEP, BGP SR-TE and ODN as we
will see in the next chapter.

ODN and AS are inde pe nde nt


“In this chapter, we explained the AS solution. In the next chapter, we will explain its close companion “ODN”. AS
automates the steering of traffic into an SR Policy while ODN automates the instantiation of an SR Policy.

Historically, I came up with the two solutions together. Later on, I insisted to break the original idea into two independent
modules: ODN for the on-demand SR Policy instantiation and AS for the Automated Steering of service routes in related
SR policies.

It was clear that some operators would ensure that the required SR Policies would be present at the ingress PE (e.g., via a
centralized controller) and these operators would need automated steering on existing SR policies (instead of on-demand
policies).

This is consistent with the overall modular architecture of the SR solution. Each independent behavior is defined as an
independent module. Different operators combine different modules based on their specific use-cases. ”

— Clarence Filsfils

Automated Steering is not restricted to the single IGP domain networks. It equally applies to multi-
domain networks.

Note that Automated Steering relies on the SR Policy color and endpoint matching the color and BGP
nexthop of the service route, also for end-to-end inter-domain SR Policies. Consequently, service
routes that must be steered into an end-to-end inter-domain SR Policy must have the egress PE (the
SR Policy’s endpoint in the remote domain) as BGP nexthop.

Therefore, the service routes must be propagated to the ingress PE keeping the BGP nexthop intact. If
there are eBGP sessions involved, as typically is the case, they must be configured with next-hop-
unchanged under the BGP address-family to ensure the BGP nexthop of the service route is not
updated when propagating it.
5.9 Summary
A service route is a BGP route (VPN or not), an EVPN route, a PW or a LISP route.

A Provider Edge (PE) advertises a service route and tags it with a color.

In BGP, the color is supported by the well-known color extended community attribute. The same
extension has been defined for LISP.

The operator allocates a range of colors to indicate the SLA requirement. For example, color = 30
means “low-delay” while color 50 means “only via network plane blue”.

For convenience, the operator allocates a name to the color. For example, color = 30 could be named
“green” or “low-delay”.

AS is enabled by default.

Per-Destination Automated Steering (also called “AS”) automatically steers a service route with
color C and next-hop N onto the SR Policy (N, C), if valid. If this SR Policy does not exist or is
invalid, the service route is installed “classically” in the forwarding plane: i.e., with a recursion on
the route to the next-hop N (typically the IGP path to N).

AS also supports per-flow steering. Per-Flow steering allows to steer different flows matching the
same destination service route onto different SR policies based on per-flow classification (e.g.,
DSCP value). This book revision only details the per-destination AS.

The operator may concurrently use the BGP color extended-community for different purposes.

AS can be disabled on a per-protocol basis (BGP) and per AFI/SAFI basis.

AS is one of the key components of the SR-TE solution.

It drastically simplifies the traffic-engineering operation. The operator only needs to color the service
routes and AS automatically steers them in the correct SR Policies.

Furthermore, by directly installing the service route recursing on the SR Policy, the forwarding
performance degradation of mechanisms such as policy-based routing is avoided.
5.10 References
[SR-for-DCI] “Segment Routing for Data Center Interconnect at Scale”, Paul Mattes (Microsoft),
Mohan Nanduri (Microsoft), MPLS + SDN WC2017 Upperside Conferences,
<http://www.segment-routing.net/conferences/2017-mpls-world-congress-paul-mattes/>, March
2017

[RFC4364] "BGP/MPLS IP Virtual Private Networks (VPNs)", Yakov Rekhter, Eric C. Rosen,
RFC4364, February 2006

[RFC4659] "BGP-MPLS IP Virtual Private Network (VPN) Extension for IPv6 VPN", Francois Le
Faucheur, Jeremy De Clercq, Dirk Ooms, Marco Carugi, RFC4659, September 2006

[RFC4798] "Connecting IPv6 Islands over IPv4 MPLS Using IPv6 Provider Edge Routers (6PE)",
Jeremy De Clercq, Dirk Ooms, Francois Le Faucheur, Stuart Prevost, RFC4798, February 2007

[RFC5512] "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP
Tunnel Encapsulation Attribute", Pradosh Mohapatra, Eric C. Rosen, RFC5512, April 2009

[RFC6830] "The Locator/ID Separation Protocol (LISP)", Dino Farinacci, Vince Fuller, David
Meyer, Darrel Lewis, RFC6830, January 2013

[RFC7432] "BGP MPLS-Based Ethernet VPN", John Drake, Wim Henderickx, Ali Sajassi, Rahul
Aggarwal, Dr. Nabil N. Bitar, Aldrin Isaac, Jim Uttaro, RFC7432, February 2015

[RFC8214] "Virtual Private Wire Service Support in Ethernet VPN", Sami Boutros, Ali Sajassi,
Samer Salam, John Drake, Jorge Rabadan, RFC8214, August 2017

[RFC4360] "BGP Extended Communities Attribute", Dan Tappan, Srihari S. Ramachandra, Yakov
Rekhter, RFC4360, February 2006

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018
[draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano
Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idr-
segment-routing-te-policy-05 (Work in Progress), November 2018

[draft-dukes-lisp-colored-engineered-underlays] "LISP Colored Engineered Underlays", Darren


Dukes, Jesus Arango, draft-dukes-lisp-colored-engineered-underlays-01 (Work in Progress),
December 2017

1. Service-Level Agreement – this may not be the most correct term to use in this context as it really
indicates a written agreement documenting the required levels of service. When using the term
SLA in this book, we mean the requirements of a given service.↩

2. Will be described in a later revision of this book.↩

3. Prefix B, prefix length b.↩


6 On-demand Next-hop
What we will learn in this chapter:

A candidate path of an SR Policy is either instantiated explicitly (configuration, PCEP, BGP-TE)


or automatically by the “On-Demand Next-Hop” (ODN) solution.

If a head-end H has at least one service route with color C and nexthop E and has an SR Policy
path template for color C (“ODN template”), then H automatically instantiates a candidate path for
the SR Policy (C, E) based on this template.

The candidate path instantiation occurs even if an SR Policy (C, E) already exists on H with a non-
ODN-instantiated candidate path.

ODN applies to single and multi-domain networks

We first explain how a colored service route triggers the dynamic on-demand instantiation of an SR
Policy candidate path, based on an ODN template. This path is deleted again if it is no longer used.
We then explain the integration of ODN in the SR-TE solution and illustrate with examples. Finally,
we explain how to apply ODN restrictions.

In the previous chapters, we have explored the SR Policy, its paths, and SID lists. We have seen how
such paths can be computed and instantiated on a headend node. We have seen how to automatically
steer traffic on the SR Policy.

Up to now, the SR Policies were initiated by configuration or by a central controller. The


configuration option can be cumbersome, especially if many SR Policies are to be configured. The
controller option may not fit those operators who want to maximize distributed intelligence.

The On-demand Next-hop (ODN) functionality automates the instantiation of SR Policy paths by the
headend node without any need for an SDN controller. The SR Policy paths are based on templates,
each template specifying the requirements of the path.

The Automated Steering functionality (AS) applies to any type of SR candidate path: ODN or not.
SR De sign Principle – Simplicity
“The ODN solution is a key achievement of the SR-TE innovation. The heart of the business of a network operator is to
support services and hence we needed to focus the solution on the service and not on the transport.

The RSVP-TE solution built complex meshes of pre-configured tunnels and then had lots of difficulty steering traffic into
the RSVP tunnels.

With ODN (and AS), the solution is natural: the service routes are tagged and the related SR Policies are automatically
instantiated and used for steering.

The operation is focused on the service, not on the underlying transport. ”

— Clarence Filsfils
6.1 Coloring
As for the AS solution, the coloring of service routes is at the heart of the ODN solution.

The ingress PE is pre-configured with a set of path templates, one per expected color indicating an
SLA/SR-TE requirement.

The egress PE colors the service routes it advertises with the color that indicates the desired SLA for
each service route.
6.2 On-Demand Candidate Path Instantiation
As soon as an ingress PE that has an ODN template for color C, receives at least one service route
with color C and nexthop E, this PE automatically instantiates an ODN candidate path for SR Policy
(C, E) according to the ODN template of color C. This candidate path is called ODN-instantiated (as
opposed to locally configured or signaled from a controller via PCEP or BGP-TE).

If SR Policy (C, E) already exists, this ODN-instantiated candidate path is added to the list of
candidate paths of the SR Policy. Else, the SR Policy is dynamically instantiated.

The ODN instantiation occurs even if non-ODN candidate paths already exist for the related SR
Policy (C, E).

The ODN template specifies the characteristics of the instantiated candidate path, such as its
preference, whether it is dynamic, if so which metric to minimize, which resources to exclude.

ODCPI is more accurate , ODN is nice r


“When the idea came in the taxi, Alberto and Alex’s problem was the instantiation of best-effort SR connectivity across
domains without RFC3107.

Without RFC3107, PE1 in the left-most domain does not know about the loopback of PE2 in the right-most domain and
hence PE1 cannot install a VPN route learned with next-hop = PE2.

Alex and Alberto needed an automated scheme to give PE1 a path to PE2’s loopback.

They needed an on-demand SR Policy to the next-hop (PE2).

Hence, in the taxi, I called that idea: “ODN” for “On-Demand Next-hop”: an automated SR path to the next-hop.

Quickly the idea became well known within Cisco and the lead operators group and we kept using the “ODN” name.

Technically, a better name could have been “ODCPI” for “On-Demand Candidate Path Instantiation”. This is much longer
and cumbersome and hence we kept using the ODN name. ”

— Clarence Filsfils
6.3 Seamless Integration in SR-TE Solution
Once the ODN candidate path is instantiated for the SR Policy (C, E), all the behaviors for this SR
Policy happen as usual:

The best candidate path is selected

The related SID list and Binding-SID (BSID) are inserted in the forwarding plane

If AS is applicable, service routes of color C and next-hop E are steered on the active candidate
path of the SR Policy (C, E)

All these points are applied regardless of the candidate path instantiation mechanism (configuration,
signalization or ODN).

SR De sign Principle - Modularity


“Each solution is designed as an individual component that fits within the overall SR architecture. The ODN solution is
solely related to the dynamic instantiation of a candidate path. All the rest of the SR-TE solution is unchanged. Any other
behavior of the solution applies to a candidate path whether it is explicitly configured, explicitly signaled by an SDN
controller or automatically instantiated by ODN ”

— Clarence Filsfils
6.4 Tearing Down an ODN Candidate Path
Assuming an ODN template for color C at headend H, when H receives the last BGP withdraw for a
service route with color C and endpoint E, then H tears down the ODN-instantiated candidate path for
the SR Policy (C, E).

At that time, as for the previous section, the SR Policy behavior is as usual:

The best candidate path is selected

The related SID list and BSID are inserted in the forwarding plane

If AS is applicable, service routes of color C and next-hop E are steered on the selected candidate
path of the SR Policy (C, E)

If the ODN candidate path was the last candidate path of the SR Policy, then the headend also tears
down the SR Policy.
6.5 Illustration: Intra-Area ODN
Node1 in Figure 6‑1 has two on-demand templates configured. One for color green (color value 30),
and another for color purple (color value 40). The configuration of these templates is presented in
Example 6‑1.

Example 6-1: On-demand color templates on Node1

segment-routing
traffic-eng
affinity-map
name RED bit-position 1
!
!! green ODN template
on-demand color 30
dynamic
metric
type delay
!
!! purple ODN template
on-demand color 40
dynamic
metric
type igp
affinity exclude-any
!! RED is defined in affinity-map above
name RED

Figure 6-1: Intra-area On-Demand Next-hop (ODN)


The green template (on-demand color 30) specifies to compute a dynamic candidate path,
optimizing the delay metric of the path.

The purple template (on-demand color 40) specifies to compute a dynamic candidate path,
optimizing the IGP metric and avoiding links with affinity color RED.

Colors
Note that we are using the color terminology for different purposes, in accordance with the common terminology used in the
networking industry.

The meaning or purpose of a color in this book can be derived from its context. Often it is indicated in the text, such as “affinity
color” that refers to the link affinity functionality. The link affinity color RED has nothing to do with the SLA color green (30).

Node4 advertises service route 2.2.2.0/24 in BGP with color green and BGP nexthop 1.1.1.4 (Node4)
to Node1, as illustrated in Figure 6‑1. How to attach a color to a prefix is explained in chapter 5,
"Automated Steering".

When this route arrives on Node1, since an on-demand template exists for color green on Node1,
BGP on Node1 requests the local SR-TE process to instantiate a candidate path for the SR Policy
with endpoint 1.1.1.4 and color green, based on the green on-demand template. This is the ODN
functionality.

As pointed out earlier in this chapter, the ODN candidate path is instantiated wether the
corresponding SR Policy already exists or not, and even if other non-ODN candidate paths already
exist for the related SR Policy. If the related SR Policy does not yet exist, then it is created when the
candidate path is instantiated.

SR-TE computes a dynamic path to endpoint 1.1.1.4 with optimized delay metric, as the green on-
demand template specifies. Since no SR Policy (green, 1.1.1.4) existed yet on Node1, Node1 creates
it with the dynamic path as candidate path.

At this point, an SR Policy (green, 1.1.1.4) exists on Node1 with the ODN instantiated candidate path
as selected path. The status of this SR Policy is shown in Example 6‑2.
Note that with the configuration in Example 6‑1 two candidate paths are automatically instantiated on-
demand: one with preference 100 and another with preference 200. Both paths are marked (BGP
ODN) in the output.

The candidate path with preference 200 is computed by the headend node.

The candidate path with preference 100 is computed by a PCE. Since no PCE is configured in this
example, this path is invalid.

The PCE computed path will be used if the headend cannot compute the path, e.g., because the
endpoint is located in a remote domain. In this case, the PCE address must be configured as
illustrated in the next section. Note that it is only required to specify pcep under dynamic if only the
PCE must compute the path. This is explained further in this chapter.

Note the BSID label 40001 that was allocated for this SR Policy, Binding SID: 40001.

Example 6-2: On-demand instantiated candidate path on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.4


Name: srte_c_30_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:15:42 (since Jul 5 07:57:23.382)
Candidate-paths:
Preference: 200 (BGP ODN) (active)
Requested BSID: dynamic
Dynamic (active)
Metric Type: delay, Path Accumulated Metric: 30
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Preference: 100 (BGP ODN)
Requested BSID: dynamic
PCC info:
Symbolic name: bgp_c_30_ep_1.1.1.4_discr_100
PLSP-ID: 30
Dynamic (pce) (invalid)
Metric Type: NONE, Path Accumulated Metric: 0
Attributes:
Binding SID: 40001
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

BGP uses Automated Steering as usual (see chapter 5, "Automated Steering") to steer the service
route 2.2.2.0/24 into this SR Policy (green, 1.1.1.4), because this service route’s color green and
nexthop 1.1.1.4 match with the SR Policy’s color and endpoint.

The Automated Steering can be verified in the BGP table and in the forwarding table.

The BGP table entry for VRF ACME prefix 2.2.2.0/24 is displayed in Example 6‑3. The output shows
that this prefix was received with a nexthop 1.1.1.4 (Node4) and a color extended community 30 (that
we named “green” in this text), 1.1.1.4 C:30. It also shows the BSID of the SR Policy,
(bsid:40001).

Example 6-3: BGP table entry on Node1 for VRF ACME prefix 2.2.2.0/24

RP/0/0/CPU0:xrvr-1#show bgp vrf ACME 2.2.2.0/24


BGP routing table entry for 2.2.2.0/24, Route Distinguisher: 1.1.1.1:0
Versions:
Process bRIB/RIB SendTblVer
Speaker 572 572
Last Modified: Jul 5 07:59:18.989 for 00:29:11
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
2
1.1.1.4 C:30 (bsid:40001) (metric 50) from 1.1.1.4 (1.1.1.4)
Received Label 90000
Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported
Received Path ID 0, Local Path ID 1, version 556
Extended community: Color:30 RT:1:1
SR policy color 30, up, registered, bsid 40001, if-handle 0x00000490

Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0

The forwarding entry of VRF ACME prefix 2.2.2.0/24 in Example 6‑4 shows that this prefix recurses
on the BSID 40001 of the SR Policy (green (30), 1.1.1.4). BSID 40001 is associated with SR Policy
srte_c_30_ep_1.1.1.4, as indicated in the last line of the output. Note that Node1 imposes the VPN
service label 90000 for this prefix, as advertised by Node4, before steering it into the SR Policy.

Example 6-4: CEF table entry on Node1 for VRF ACME prefix 2.2.2.0/24

RP/0/0/CPU0:xrvr-1#show cef vrf ACME 2.2.2.0/24


2.2.2.0/24, version 218, internal 0x5000001 0x0 (ptr 0xa13a0d78) [1], 0x0 (0x0), 0x208 (0xa175f44c)
Updated Jul 5 07:59:18.595
Prefix Len 24, traffic index 0, precedence n/a, priority 3
via local-label 40001, 3 dependencies, recursive [flags 0x6000]
path-idx 0 NHID 0x0 [0xa17d4288 0x0]
recursion-via-label
next hop VRF - 'default', table - 0xe0000000
next hop via 40001/0/21
next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 90000}
Sometime later, Node4 advertises service route 5.5.5.0/24 in BGP with color green and BGP nexthop
1.1.1.4 (Node4). When Node1 receives this route, it finds that the ODN candidate path for SR Policy
(green, 1.1.1.4) already exists. BGP uses Automated Steering to steer the service route 5.5.5.0/24 into
this SR Policy. The output in Example 6‑5 shows that 5.5.5.0/24 indeed recurses on BSID label
40001, the BSID of SR Policy srte_c_30_ep_1.1.1.4. The VPN service label for this prefix is 90003.

Example 6-5: CEF table entry on Node1 for VRF ACME prefix 5.5.5.0/24

RP/0/0/CPU0:xrvr-1#show cef vrf ACME 5.5.5.0/24


5.5.5.0/24, version 220, internal 0x5000001 0x0 (ptr 0xa13a1868) [1], 0x0 (0x0), 0x208 (0xa175f1e4)
Updated Jul 5 08:47:16.587
Prefix Len 24, traffic index 0, precedence n/a, priority 3
via local-label 40001, 3 dependencies, recursive [flags 0x6000]
path-idx 0 NHID 0x0 [0xa17d4288 0x0]
recursion-via-label
next hop VRF - 'default', table - 0xe0000000
next hop via 40001/0/21
next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 90003}

The service route 5.5.5.0/24 is of the same VRF ACME as the previous one, but that is not a
requirement. The above behavior also applies if the service routes are of different VRFs or even
different address-families. The two decisive elements are the color and nexthop of the service route.

Node4 also advertises a service route 4.4.4.0/24 of VRF Widget with nexthop 1.1.1.4 (Node4) and
color purple (color value 40). Since Node1 has an on-demand template for color 40, the same
procedure is used as before to instantiate an on-demand candidate path based on the template.
Figure 6-2: Intra-area On-Demand Next-hop (ODN) – using multiple ODN templates

BGP uses Automated Steering to steer the traffic into the matching SR Policy (purple, 1.1.1.4).
6.6 Illustration: Inter-domain ODN
The ODN functionality is not limited to single area networks but it is equally applicable to multi-
domain networks. Consider the network diagram in Figure 6‑3. This network consists of three
independent domains. With “independent”, we mean that no reachability information is exchanged
between the domains. We assume that, by default, Node1 for example, does not have reachability to
Node10.

At the time of writing, BGP requires an IP route to the nexthop (it could be a less specific prefix) to
consider the service route. This can be achieved by propagating an aggregate route to the nexthop in
BGP or by configuring a less specific static route to the nexthop (it can point to Null0) on the BGP
speakers1.

Figure 6-3: ODN Multi-domain topology

“ODN has the best properties to make a multi-domain network infrastructure to scale and very flexible, while keeping
everything simple at the same time. Still from an end-to-end perspective, it allows us to implement solutions for cases such
as a PE that belongs in a network domain that needs to establish an L3VPN with another PE that belongs to another
domain, while going through a middle "opaque" domain. An easy use case could be a virtual PE (vPE) hosted in a
datacenter attached to a core network connecting to a brownfield PE that belongs to a different area from the Core and
the DC. The implementation is done with a simple configuration enhanced with an SR PCE, enabling automated service
instantiation with specific criteria such as latency, optimized delay-metric, bandwidth, disjointness, etc. ”

— Daniel Voyer
In the single-area network example that we discussed in the previous section, the headend node itself
can dynamically compute the on-demand SR Policy path based on the optimization objective and
constraints specified in the ODN template for the SLA color. However, this node cannot locally
compute end-to-end inter-domain paths for the multi-domain case since it has no visibility into the
network topology beyond its local area. Instead, the headend node should request a PCE with
network-wide visibility to compute the inter-domain paths.

Similar to the single-area ODN case, the egress service node Node10 advertises its service routes to
the ingress node Node1, tagged with the colors that identify the required SLAs. This is marked with
➊ in Figure 6‑3. Typically, the service routes are advertised via a Route-Reflector infrastructure, but
BGP must ensure that the service routes are propagated keeping the initial BGP nexthop intact, i.e.,
Node1 must receive the service route with BGP nexthop Node10 (➋). Also, the color attached to the
service route should be propagated all the way to Node1. Different from the single-area case, the on-
demand template on Node1 specifies that a PCE must be used to compute the inter-domain path of the
on-demand SR Policy (➌ in Figure 6‑3). The ODN template of Node1 is shown in Example 6‑6. The
SR PCE has IP address 1.1.1.99.

Example 6-6: On-demand color templates on Node1 – PCE computes path

segment-routing
traffic-eng
!! green ODN template
on-demand color 30
dynamic
pcep
metric
type delay
!
pcc
pce address ipv4 1.1.1.99

By default, the headend first tries to compute the path locally. If that fails it requests the SR PCE to
compute the path. The keyword pcep in the color 30 ODN template specifies to only use the SR PCE
to compute the path.

Example 6‑7 shows the status of the ODN-instantiated SR Policy path. Since the configuration
specifies to use the PCE to compute the path, the locally computed candidate path (preference 200) is
shutdown.
Example 6-7: On-demand instantiated candidate path for inter-domain SR Policy on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.10


Name: srte_c_30_ep_1.1.1.10
Status:
Admin: up Operational: up for 00:15:42 (since Jul 5 07:57:23.382)
Candidate-paths:
Preference: 200 (BGP ODN) (shutdown)
Requested BSID: dynamic
Dynamic (invalid)
Last error: No path found
Preference: 100 (BGP ODN) (active)
Requested BSID: dynamic
PCC info:
Symbolic name: bgp_c_30_ep_1.1.1.10_discr_100
PLSP-ID: 32
Dynamic (pce 1.1.1.99) (valid)
Metric Type: delay, Path Accumulated Metric: 600
16121 [Prefix-SID, 1.1.1.121]
16101 [Prefix-SID, 1.1.1.101]
16231 [Prefix-SID, 1.1.1.231]
16010 [Prefix-SID, 1.1.1.10]
Attributes:
Binding SID: 40060
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

ODN is a scalable mechanism to automatically provide end-to-end inter-domain reachability, with


SLA if required. It offers a pull model, where the ingress node installs forwarding entries on-demand,
only for service routes that are actually used. This is opposed to the traditional push model, where the
network pushes routes to all endpoints which may or may not steer traffic onto them.

“Most of the customers I work with as a solution architect, are struggling in designing and operate network architectures
that are either easy to scale and operate but flexible enough to implement different transport SLAs. They are always very
impressed by the simplicity of ODN solution and the benefits are immediately evident: no manual RSVP-TE tunnel setup,
no complicated Policy-Based Routing (PBR) configuration for selectively steering traffic, very granular traffic steering and
transparent applicability to inter area and inter domain scenarios that are in many cases fundamental for end-to-end
access-core-dc architecture scalability. ”

— Alberto Donzelli

The ODN model can also be used together with a default reachability model. For example, when
routes are redistributed between domains or in combination with unified MPLS, also known as
seamless MPLS2. ODN can provide SLA paths for a subset of routes, keeping the other routes on their
default forwarding path.

When transitioning from the traditional route push model to the ODN route pull model, ODN can be
introduced next to the existing default forwarding. Reachability to service routes can then be
gradually moved from the classic model to ODN.
6.7 ODN Only for Authorized Colors
The ODN functionality is only triggered when receiving a service route with an authorized color. A
color is authorized when an on-demand template is configured for that color. Appropriate ingress
filtering in BGP can also be used to restrict the received extended communities.

If required, the ODN functionality can be further restricted to a sub-set of BGP nexthops by
configurating a prefix-list ACL on the on-demand template. Only the nexthops passing the filter will
trigger on-demand instantiation for this template. At the time of writing, this functionality is not
available.

Example 6‑8 illustrates a possible configuration to restrict ODN functionality for this template to the
BGP nexthops matching the prefix 1.1.1.0/24.

Example 6-8: Restrict ODN to a subset of BGP nexthops

ipv4 prefix-list ODN_PL


10 permit 1.1.1.0/24
!
segment-routing
traffic-eng
!! green ODN template
on-demand color 30
restrict ODN_PL
dynamic
metric
type delay

SR De sign Principle – Le ss protocols and Se amle ss De ployme nt


“With SR we want to eliminate any unnecessary protocol and maximize the use of the base IP protocols: IGP and BGP.
Hence, ODN/AS has been designed to re-use the already existing BGP coloring mechanism.

Simple color range allocation rules ensure seamless ODN/AS deployment into networks that were already using the BGP
coloring mechanism for other purposes.

For example, the operator dedicates range 1000-1999 to mark the PoP-of-origin of a BGP path while range 100-199 is used
for ODN/AS. ”

— Clarence Filsfils
6.8 Summary
An SR candidate path is either instantiated explicitly (configuration, PCEP, BGP SR-TE) or
automatically by the “On-Demand Next-Hop” (ODN) solution.

If a head-end H has at least one service route with color C and has an ODN template for color C, then
H automatically instantiates an ODN candidate path for the SR Policy (C, E) based on the ODN
template of color C. The ODN instantiation occurs even if there is a pre-existing non-ODN candidate
path for the SR Policy (C, E).

The ODN solution is a key benefit of the SR-TE architecture. It drastically simplifies the operation of
the network by not requiring maintaining a complex configuration. It leverages the distributed
intelligence and robustness of the network. It does not require an SDN controller.

The ODN solution applies to intra and inter-domain use-cases.

Most likely, the ODN solution is combined with the AS solution. However, in theory, these two
components are independent and could be used independently in some use-cases.
6.9 References
[RFC5512] "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP
Tunnel Encapsulation Attribute", Pradosh Mohapatra, Eric C. Rosen, RFC5512, April 2009

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018

[draft-ietf-mpls-seamless-mpls] "Seamless MPLS Architecture", Nicolai Leymann, Bruno


Decraene, Clarence Filsfils, Maciek Konstantynowicz, Dirk Steinberg, draft-ietf-mpls-seamless-
mpls-07 (Work in Progress), June 2014

1. Note that the configuration may allow reachability to the nexthop via the default route (nexthop
resolution allow-default) or may require nexthop reachability via a host route (nexthop resolution
prefix-length minimum 32).↩

2. A scalable architecture that integrates core, aggregation, and access domains into a single
(“seamless”) MPLS domain to provide an end-to-end service delivery [draft-ietf-mpls-seamless-
mpls-07].↩
7 Flexible Algorithm
What we will learn in this chapter:

An IGP Prefix-SID is associated to a prefix and an algorithm.

The default algorithm (0) is the IGP shortest-path according to the IGP metric.

Algorithms 0 to 127 are standardized by the IETF.

Algorithms 128 to 255 are customized by the operator. They are called SR IGP Flexible
Algorithms (Flex-Algo for short).

A Flex-Algo is defined with an optimization objective and a set of constraints.

Any node participating in a Flex-Algo advertises its support for this Flex-Algo.

Only the nodes participating in a Flex-Algo compute the paths to the Prefix-SIDs of that Flex-Algo.
This is an important benefit for scale.

Multiple prefix-SIDs of different algorithms can share the same loopback address. This is an
important benefit for operational simplicity and scale.

Flex-Algo is an intrinsic component of the SR architecture: the algorithm has been included in the
SR proposal since day one.

Flex-Algo is an intrinsic component of the SR-TE architecture: it leverages the Automated Steering
component to automatically steer service traffic onto Flex-Algo Prefix-SIDs. Upon ODN automated
instantiation of an inter-domain SR-TE policy, the SR PCE leverages any per-IGP-domain Flex-
Algo Prefix-SID that provides the required path.

Frequent use-cases involve dual-plane disjoint paths and low-delay routing.

IGP Prefix-SIDs steer the traffic along the shortest path to the corresponding prefix, as computed by
the IGP. Although this shortest path is usually assumed to be the shortest IGP path, minimizing the
accumulated IGP metric along the traversed links, there is in reality more than one definition of
shortest path. In order to ensure that all nodes in the IGP follow the same definition for a given Prefix-
SID, each Prefix-SID is associated with an algorithm. The Prefix-SID algorithm defines how the
shortest paths should be computed for that particular Prefix-SID.

Flexible Algorithms are a type of Prefix-SID algorithms defined within the scope of an IGP domain
and fully customizable by the operator. An operator can define one or several Flexible Algorithms
with optimization objectives and constraints tailored to its specific needs.

In this chapter, we start by detailing the concept of Prefix-SID algorithm. We then explain how
Flexible Algorithms are defined in a consistent manner within an IGP domain and how this capability
is integrated within the SR architecture. We conclude by illustrating the use and the benefits of
Flexible Algorithms within the context of real operator use-cases.
7.1 Prefix-SID Algorithms
Before delving into the Flexible Algorithm, we first go back to the basics of the Prefix-SID. The IGP
Prefix-SID is associated with an IGP prefix. It is a global SID that steer packets along the ECMP-
aware IGP shortest path to its associated prefix.

Figure 7-1: Example Prefix-SID of default algorithm (SPF)

On the example topology in Figure 7‑1, Node3 advertises a Prefix-SID 16003 associated with its
loopback prefix 1.1.1.3/32. All nodes in the network program a forwarding entry for this SID with the
instruction “go via the IGP shortest path to 1.1.1.3/32”. For example, the shortest path from Node6
to 1.1.1.3/32 is towards Node5, so Node6 installs the forwarding entry: incoming label 16003,
outgoing label 16003, via Node5. Eventually, a packet with top label 16003, injected anywhere in the
network, is forwarded via the IGP shortest path to 1.1.1.3/32.

This Prefix-SID description is correct, but incomplete. It only covers the default Prefix-SID
behavior, which is associated with algorithm 0.
The SR Algorithm information is part of the SR architecture since the beginning. An IGP Prefix-SID
is associated with an algorithm K where K ranges from 0 to 255. The IGP advertises this Algorithm
information in its Prefix-SID and Router Capability advertisement structure.

Each Prefix-SID is advertised in ISIS with an associated Algorithm, as shown in the Prefix-SID TLV
format in Figure 7‑2. OSPF includes an Algorithm field in the Prefix-SID sub-TLV of the OSPF
Extended Prefix TLV (RFC7684), as shown in Figure 7‑3.

An IGP prefix can be associated with different Prefix-SIDs, each of a different algorithm.

Figure 7-2: Algorithm field in ISIS Prefix-SID TLV format

Figure 7-3: Algorithm field in OSPF Prefix-SID sub-TLV format

Algorithm identifiers between 0 and 127 are reserved for standardized (IETF/IANA) algorithms.

At the time of writing, the following standardized Algorithm identifiers are defined in RFC 8402
(Segment Routing Architecture).

0: Shortest Path First (SPF) algorithm based on IGP link metric. This is the well-known shortest
path algorithm based on IGP link metric, as computed by ISIS and OSPF. The traffic steered over
an SPF Prefix-SID follows the shortest IGP path to the associated prefix, unless the Prefix-SID’s
forwarding entry is overridden by a local policy on an intermediate node. In this case, the traffic is
processed as per the local policy and may deviate from the path expressed in the SID list. A
common example of such local policy is the autoroute announce mechanism: the traffic steered over
an SPF Prefix-SIDs can be autorouted into an SR Policy or RSVP-TE tunnel.

1: Strict Shortest Path First (Strict-SPF) algorithm based on IGP link metric. This algorithm is
identical to algorithm 0, except that it has an additional semantic instructing the intermediate nodes
to ignore any local path deviation policy. The traffic steered over a Strict-SPF Prefix-SID strictly
follows the unaltered shortest IGP path to the associated prefix1. In particular, this traffic cannot be
autorouted.

User-defined algorithms are identified by a number between 128 and 255. These can be customized
by each operator to their liking and are called the SR IGP Flexible Algorithms, or Flex-Algos for
short.

Compone nt by compone nt we build a powe rful inte grate d solution for 5G


“In my first public talk on SR (October 2012), I described the use of Prefix Segments associated with different algorithms.

We kept that component for later execution because we first needed to develop the more general SR-TE solution: ODN,
AS and the inter-domain solution thanks to SR PCE.

Once these components are in place, the full power of Flex-Algo Prefix-SIDs can be leveraged.

For example, Automated Steering is key to automatically steer BGP/Service destinations on the Flex-Algo Prefix-SID that
delivers the required SLA to the BGP next-hop. Most deployments are inter-domain and hence ODN and SR PCE are key
to deliver automated inter-domain SR Policies that leverage per-IGP-domain Flex-Algo Prefix-SID.

Flex-Algo and the related per-link delay measurement is shipping since December 2018. As I write these lines, Anton
Karneliuk and I gave the first public presentation of a flag-ship deployment at Vodafone (Cisco Live!™ January 2019,
video available on segment-routing.net). The details of this low-delay slice design so important for 5G are described in this
chapter, both as a use-case and with a viewpoint offered by Anton. ”

— Clarence Filsfils

In this book, we use the notation Flex-Algo(K) to indicate the Flex-Algo with identifier K, and Prefix-
SID(K) to indicate the Prefix-SID of algorithm K. The notation Algo(0) is used to indicate algorithm
0.
A typical use-case is for an operator to define a Flex-Algo, for example 128, to minimize the delay
metric. This is illustrated on Figure 7‑4.

First, the operator enables the dynamic link-delay measurement, such that the per-link delay metric is
advertised by the IGP. This allows the paths to adapt to any link-delay changes, for example caused
by rerouting of the underlying optical network (see chapter 15, "Performance Monitoring – Link
Delay"). Alternatively, if link-delays are assumed to be constant, the operator could use the TE metric
to reflect the link-delay by statically configuring the TE metric on each link with the link’s delay
value.

Then, every IGP node is configured with one additional prefix-SID, associated with the Flex-
Algo(128). As a result, the IGP continuously delivers two different ways to reach every node: a
shortest-path according to the IGP metric (Prefix-SID with Algo(0) of the destination node) and a
shortest-path according to the delay metric (Prefix-SID with Flex-Algo(128) of the destination node).

Figure 7-4: Example Prefix-SID of low-delay algorithm


To ease the illustrations, we use the following numbering scheme:

Prefix-SID(0) of Algo(0) of Node J is 16000 + J

Prefix-SID(K) of Flex-Algo(K) of Node J is 16000 + (K −120) × 100 + J

Said otherwise, the 3rd digit of the label indicates the algorithm: 0 means Algo(0) and 8 means Flex-
Algo(128).

Obviously, this is just for our illustration and this should not be used as a design rule.

In this illustration, Node3 advertises the prefix 1.1.1.3/32 with two Prefix-SIDs:

16003: algorithm 0: shortest-path according to IGP metric

16803: algorithm 128: shortest-path according to delay metric

As a result, when Node1 sends the green packet with top-label 16803, the packet follows the delay-
metric shortest path 1→2→3 with an accumulated delay metric of 10. Similarly, when Node1 sends
the blue packet with top-label 16003, it follows the IGP-metric shortest path 1→6→5→3 with an
accumulated IGP metric of 30.

The low-delay path from Node1 to Node3 (1→2→3) could also be encoded in a SID list <16002,
24023>, where 16002 is the Algo(0) Prefix-SID of Node2, providing the IGP shortest path to Node2
(1→2), and 24023 the Adj-SID of Node2 to Node3, steering the traffic via the direct link to Node3
(2→3). However, the Flex-Algo functionality reduces the required SID list to the single segment
16803, as indicated above.
Low-De lay
“Working closely with the operators, I knew that huge business requirements were left unsupported in terms of concurrent
support of low-cost and delay optimization.

A key component of this solution is the real-time per-link delay measurement and advertisement in the IGP.

As soon as this metric is available, then it is clear that the operator can use it to differentiate services on its infrastructure.
Either via the SR Policy solution (dynamic path based on delay-metric) or via the Flex-Algo solution. ”

— Clarence Filsfils

Another very important benefit of the Flex-Algo solution is that no additional addresses need to be
configured. The same prefix can be associated with multiple Prefix-SIDs of multiple algorithms. This
is key for operational simplicity and scale.

The related configuration of Node3 is shown in Example 7‑1.

Example 7-1: ISIS Prefix-SID configuration on Node3

interface Loopback0
ipv4 address 1.1.1.3/32
!
router isis 1
flex-algo 128
!
interface Loopback0
address-family ipv4 unicast
prefix-sid absolute 16003
prefix-sid algorithm 128 absolute 16803

Aside from the algorithm value between 128 and 255, the configuration of Flex-Algo Prefix-SIDs is
similar to the well-known IGP Prefix-SID. This implies that Flex-Algo Prefix-SID labels are selected
from the SR Global Block (SRGB) label range, which is shared by all Prefix-SIDs, and that Prefix-
SID properties (e.g., Node-SID flag, PHP and explicit-null behaviors) are also configurable for Flex-
Algo Prefix-SIDs.

All the IGP nodes in the illustration support algorithms 0 and 128 and hence all the nodes install the
related Prefix-SIDs. The Prefix-SIDs of Algo(0) are installed along the IGP-metric shortest path
while the Prefix-SIDs of Flex-Algo(128) are installed along the delay-metric shortest path.
The ISIS advertisement of Node3’s prefix 1.1.1.3/32 with the above configuration is shown in
Example 7‑2. The example illustrates the ISIS advertisement of prefix 1.1.1.3/32 with Prefix-SID
index 3 for Algo(0) and index 803 for Flex-Algo(128). With the default SRGB [16000-23999], these
respectively map to the Prefix-SIDs label values 16003 (Algo(0)) and 16803 (Flex-Algo(128)).

The equivalent Prefix-SID algorithm capability is available for OSPF.

Similarly, the SR extensions of BGP-LS in ietf-idr-bgp-ls-segment-routing-ext define the Algorithm


Identifier advertisement in BGP-LS.

Example 7-2: ISIS Prefix-SID advertisement example

RP/0/0/CPU0:xrvr-3#show isis database verbose xrvr-3

IS-IS 1 (Level-2) Link State Database


LSPID LSP Seq Num LSP Checksum LSP Holdtime ATT/P/OL
xrvr-3.00-00 * 0x0000013e 0x20ee 1198 0/0/0
<...>
Metric: 0 IP-Extended 1.1.1.3/32
Prefix-SID Index: 3, Algorithm:0, R:0 N:1 P:0 E:0 V:0 L:0
Prefix-SID Index: 803, Algorithm:128, R:0 N:1 P:0 E:0 V:0 L:0
Prefix Attribute Flags: X:0 R:0 N:1
Source Router ID: 1.1.1.3
<...>

Each ISIS node advertises its support for a given algorithm in the Algorithm sub-TLV of the Router-
Capability TLV, as shown in Figure 7‑5 and Example 7‑3. This sub-TLV lists all identifiers of the
algorithms that the node supports. Again, the equivalent functionality is available for OSPF, where
OSPF advertises the list of supported algorithms in an SR-Algorithm TLV in the Router Information
LSA, as shown in Figure 7‑6.

Figure 7-5: Algorithms in ISIS Algorithm sub-TLV of Router Capability TLV


Figure 7-6: Algorithms in OSPF Algorithm TLV of Router Information LSA

Example 7‑3 shows the ISIS Router Capability advertisement of a node that supports Algorithms 0
(SPF), 1 (strict-SPF) and 128 (Flex-Algo(128)). An SR-enabled IOS XR node always participates in
Algo(0) and Algo(1).

Example 7-3: ISIS Router Capability advertisement example

RP/0/0/CPU0:xrvr-3#show isis database verbose xrvr-3

IS-IS 1 (Level-2) Link State Database


LSPID LSP Seq Num LSP Checksum LSP Holdtime ATT/P/OL
xrvr-3.00-00 * 0x0000013e 0x20ee 1198 0/0/0
<...>
Hostname: xrvr-3
Router Cap: 1.1.1.3, D:0, S:0
Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000
SR Algorithm:
Algorithm: 0
Algorithm: 1
Algorithm: 128
Node Maximum SID Depth:
Label Imposition: 10
Router ID: 1.1.1.3
<...>

The Router Capability TLV can also contain the Flex-Algo Definition sub-TLVs, which will be
covered in the next section.
7.2 Algorithm Definition
An operator can define its own custom Flexible Algorithms and assign an algorithm identifier to it.
The user-defined algorithms have an identifier value between 128 and 255.

The Flex-Algo definition consists of three elements:

A calculation type

An optimization objective

Optional constraints

For example: use SPF to minimize IGP metric and avoid red colored links, or use SPF to minimize
delay metric.

The calculation type indicates the method to compute the paths. At the time of writing, two calculation
types have been defined: SPF (0) and strict-SPF (1), same as the IETF-defined SR algorithms. Both
types use the same algorithm (Dijkstra’s SPF), but the strict-SPF type does not allow a local policy to
override the SPF-computed path with a different path. IOS XR uses only type 0 (SPF) for Flex-Algo
computations.

The optimization objective is defined as the minimization of a specific metric type. On IOS XR, the
metric type can be IGP, TE or link-delay. The path to each Prefix-SID of the Flex-Algo is computed
such that the accumulated metric of the specified type is minimal, considering the Flex-Algo
constraints.

These constraints consist in zero, one or several sets of resources, typically identified by their affinity
color, to be included in or excluded from the path to each Prefix-SID of the Flex-Algo.

A node that has a local definition of a particular Flex-Algo can advertise the Flex-Algo definition in
its Router Capability TLV (ISIS) or Router Information LSA (OSPF). These definition advertisements
will be used to select a consistent Flex-Algo definition on all participating nodes. The consistency
insurance is discussed further in the next section. By default, a local Flex-Algo definition is not
advertised.
Figure 7-7: All three Prefix-SID paths from Node1 to Node3

Example 7‑4 illustrates the Flex-Algo definition configuration of Node3 in Figure 7‑7. Two Flex-
Algo definitions are configured. Flex-Algo(128) provides the low-delay path without constraints, as
explained in section 7.1, while Flex-Algo(129) provides a constrained IGP shortest path that avoids
links with affinity color RED. The priority of both Flex-Algo definitions is 100. The default priority
is 128. This priority value is used in the Flex-Algo Definition consistency mechanism as described in
the next section. Node3 advertises both Flex-Algo definitions.
Example 7-4: Flex-Algo definition and affinity configuration on Node3

1 router isis 1
2 affinity-map RED bit-position 2
3 !
4 flex-algo 128
5 priority 100
6 metric-type delay
7 advertise-definition
8 !
9 flex-algo 129
10 priority 100
11 !! default: metric-type igp
12 advertise-definition
13 affinity exclude-any RED
14 !
15 interface GigabitEthernet0/0/0/3
16 !! link to Node5
17 affinity flex-algo RED

For this illustration, we have colored the link between Node3 and Node5 RED by adding the affinity
RED to interface Gi0/0/0/3 (lines 15-17). RED is a locally significant user-defined name that
indicates the bit in position 2 of the affinity bitmap (line 2 in the example). Therefore, Flex-
Algo(129) paths will not traverse this link. Notice that the link affinity is configured under the IGP.
Flex-Algo link affinities are advertised as application specific TE link attributes, as further described
in section 7.2.2.

Node3 advertises a third Prefix-SID for prefix 1.1.1.3/32: 16903 with Flex-Algo(129).

Since all nodes participate in Flex-Algo(129), each node computes the path to 1.1.1.3/32, according
to the algorithm minimize IGP, avoid red links, and installs the forwarding entry for Prefix-SID
16903 accordingly, as illustrated in Figure 7‑8.
Figure 7-8: Example Prefix-SID of algorithm with constraints

As an example, the Flex-Algo(129) path from Node1 to Node3 is 1→6→5→4→3.

Without using the Flex-Algo SIDs, this path can be encoded in the SID list <16004, 16003>, where
16004 is the Algo(0) Prefix-SID of Node4 and 16003 the Algo(0) Prefix-SID of Node3.

The other Prefix-SIDs of Node3 still provide the path according to their associated algorithm: Prefix-
SID 16003 provides the IGP shortest path to Node3, and Prefix-SID 16803 provides the low-delay
path to Node3. All three paths from Node1 to Node3 are shown in Figure 7‑7.

7.2.1 Consistency
While IETF-defined SR Algorithms are standardized and hence globally significant, a Flex-Algo is
defined within the scope of an IGP domain. Different definitions for a given Flex-Algo identifier may
thus co-exist at the same time over the global Internet.
The paths towards Flex-Algo Prefix-SIDs are computed on each participating node as per the
definition of this Flex-Algo. To avoid routing loops, it is crucial that all nodes participating in a Flex-
Algo have a consistent definition of this algorithm. Therefore, the nodes must agree on a unique Flex-
Algo definition within the advertisement scope of the Flex-Algo identifier.

A node can participate in a Flex-Algo without a local definition for this Flex-Algo. At least one node
in the flooding domain must advertise the Flex-Algo definition. Multiple advertisers are
recommended for redundancy.

In order to ensure Flex-Algo definition consistency, every node that participates in a particular Flex-
Algo, selects the Flex-Algo definition based on the following rules:

1. From the Flex-Algo definition advertisements in the area (including both locally generated and
received advertisements), select the one(s) with the highest priority.

2. If there are multiple Flex-Algo definition advertisements with the same highest priority, select the
one that is originated from the node with the highest System-ID in case of ISIS or highest Router-
ID in case of OSPF.

The locally configured definition of the Flex-Algo is only considered if it is advertised. It is then
treated equally in the selection process as the received definitions advertised by remote nodes. Flex-
Algo definition advertisement is disabled by default.

If a node does not send nor receive any definition advertisement for a given Flex-Algo, or if it does
not support the selected Flex-Algo definition (e.g., an unsupported metric-type), the node locally
disables this Flex-Algo. It stops advertising support for the Flex-Algo and removes the forwarding
entries for all Prefix-SIDs of this Flex-Algo.

Thus, if the definition of a Flex-Algo is not advertised by any node, this Flex-Algo is non-functional.
Lacking a definition, all participating nodes disable this Flex-Algo.

The format of the Flex-Algo definition advertisement is described in the next section.
Consiste nt Fle x-Algo de finitions
“Common understanding of the Flex-Algo definition between all participating routers is absolutely needed for correct
operation of the Flex-Algo technology. Very often, when consistency between multiple devices is requested, people think of
it as a potential danger. What happens if there is a conflict? How do I make sure the consistency is achieved? What if
things are misconfigured?

When we designed Flex-Algo technology we tried to avoid these problems. The selection of the Flex-Algo definition is
done based on strict rules and with deterministic outcome. There can never be an inconsistency or a conflict. Every
participating node will use the same selection algorithm that is guaranteed to produce a single definition for the particular
Flex-Algo. ”

— Peter Psenak

7.2.2 Definition Advertisement


The Flex-Algo definition is advertised in ISIS using the Flex-Algo Definition sub-TLV shown in
Figure 7‑9. This sub-TLV is advertised as a sub-TLV of the Router Capability TLV.

Figure 7-9: ISIS Flex-Algo Definition sub-TLV format

The Flex-Algo definition is advertised in OSPF using the Flex-Algo Definition TLV shown in
Figure 7‑10. This TLV is advertised as a top-level TLV of the Router Information LSA.
Figure 7-10: OSPF Flex-Algo Definition TLV format

The fields in the Flex-Algo Definition sub-TLV (ISIS) and TLV (OSPFv2) (Figure 7‑9 and
Figure 7‑10) are:

Flex-Algorithm: Flex-Algo identifier. Value between 128 and 255.

Metric-Type: type of metric to be used during the calculation.

0. IGP metric – default in IOS XR

1. Minimum Unidirectional Link Delay (RFC7810)

2. TE metric (RFC5305)

Calc-Type: the calculation type used to compute paths for the Flex-Algorithm2. 0: SPF algorithm,
1: strict-SPF algorithm – in IOS XR: SPF (0)

Priority: the priority of the advertisement; higher value is more preferred – default in IOS XR: 128

Sub-TLVs: optional sub-TLVs.

At the time of writing, the available optional sub-TLVs specify the inclusion and exclusion of link
affinity colors (also known as Administrative Groups) in the path computation as follows.

Exclude-any (type 1): exclude links that have any of the specified link colors

Include-any (type 2): only include links that have any of the specified link colors
Include-all (type 3): only include links that have all of the specified link colors

The format of the ISIS Admin Group Sub-TLVs is common for all types. The format is shown in
Figure 7‑11.

Figure 7-11: ISIS Flex-Algo include/exclude admin-group sub-TLV

The format of the OSPF Admin Group Sub-TLVs is shown in Figure 7‑12.

Figure 7-12: OSPF Flex-Algo include/exclude admin-group sub-TLV

These sub-TLVs specify which extended administrative groups (link affinity colors) must be
excluded/included from the path, according to the type of the sub-TLV. The format of the Extended
Administrative Group field is defined in RFC7308.

IETF draft-ketant-idr-bgp-ls-flex-algo adds support for Flex-Algo definitions to BGP-LS.

When a Flex-Algo is enabled on a node, ISIS advertises support for it by adding the algorithm(s) in
the Router Capability TLV. If advertisement of the Flex-Algo definition(s) is enabled, ISIS includes
these in the Router Capability TLV as well. Example 7‑5 shows the ISIS Router Capability TLV as
advertised by Node3 with the configuration in Example 7‑4.

Example 7-5: Algorithm support in ISIS Router Capability TLV of Node3

RP/0/0/CPU0:xrvr-3#show isis database verbose xrvr-3


<...snip...>
Hostname: xrvr-3
Router Cap: 1.1.1.3, D:0, S:0
Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000
SR Local Block: Base: 15000 Range: 1000
SR Algorithm:
Algorithm: 0
Algorithm: 1
Algorithm: 128
Algorithm: 129
Node Maximum SID Depth:
Label Imposition: 10
Flex-Algo Definition:
Algorith: 128 Metric-Type: 1 Alg-type: 0 Priority: 100
Flex-Algo Definition:
Algorith: 129 Metric-Type: 1 Alg-type: 0 Priority: 100
Flex-Algo Exclude Ext Admin Group:
0x00000004
<...snip...>

Note that the router isis configuration in Example 7‑4 also contains the affinity-map definition that
defines the Flex-Algo application specific affinity mappings. The advertisement of application
specific TE link attributes is specified for ISIS in draft-ietf-isis-te-app. In Example 7‑4, color RED is
defined as the bit in position 2 of the affinity bitmap. The color is attached to interface Gi0/0/0/3,
which is the link to Node5. Node3 now advertises its adjacency to Node5 with an affinity bitmap
where bit 2 is set (binary 100 = hexadecimal 0x4). Example 7‑6 shows the ISIS advertisement of
Node3 for its adjacency to Node5.

Example 7-6: Flex-algo affinity bitmap advertisement of Node3 in ISIS

<...snip...>
Metric: 10 IS-Extended xrvr-5.00
Interface IP Address: 99.3.5.3
Neighbor IP Address: 99.3.5.5
Link Average Delay: 12 us
Link Min/Max Delay: 12/12 us
Link Delay Variation: 0 us
Application Specific Link Attributes:
L flag: 0, SA-Length 1, UDA-Length 0
Standard Applications: FLEX-ALGO
Ext Admin Group:
0x00000004
Link Maximum SID Depth:
Label Imposition: 10
ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24035
<...snip...>
A node that participates in a given algorithm does not necessarily need to advertise a Prefix-SID for
that algorithm. However, in practice a Prefix-SID is likely not only intended to steer the traffic on the
desired path to that node, but also to use that node as part of a TI-LFA backup path. For that reason,
we will assume in this chapter that when a node participates in Flex-Algo(K), it also advertises a
Prefix-SID for Flex-Algo(K).
7.3 Path Computation
A node performs a computation per algorithm that it participates in. When computing the path for a
given Flex-Algo(K), the computing node first removes all the nodes that do not advertise their
participation for this Flex-Algo(K) as well as the resources that must be avoided according to the
constraints of Flex-Algo(K). Any link that does not have a metric as used in Flex-Algo(K) is also
pruned from the topology. The resulting topology is called Topo(K).

In Figure 7‑13, Flex-Algo(130) is defined as optimize IGP metric, exclude RED links and enabled
on all nodes except Node6. The Flex-Algo(130) topology, Topo(130), is thus derived from the
original network topology in Figure 7‑13 (a) by pruning Node6, which does not participate in Flex-
Algo(130), and the RED link between Node3 and Node5, which must be avoided as per the definition
of Flex-Algo(130). The topology Topo(130), represented in Figure 7‑13 (b), is used for the Flex-
Algo(130) path computations on all nodes participating in this algorithm.

Figure 7-13: Physical network topology and Flex-Algo(130) topology Topo(130)

The computing nodes leverage Dijkstra’s algorithm to compute the shortest path graph on Topo(K),
optimizing the type of metric as defined for Flex-Algo(K). ECMP is obviously supported in each
Flex-Algo, i.e., traffic-flows are load-balanced over multiple paths that have an equal cost for that
Flex-Algo.
Finally, the node installs the MPLS-to-MPLS forwarding entries for the Prefix-SIDs of Flex-Algo(K).
The node does not install any IP-to-MPLS or IP-to-IP forwarding entries for these Flex-Algo Prefix-
SIDs. This means that, by default, unlabeled IP packets are not steered onto a Flex-Algo Prefix-SID.
Instead, the operator can rely on SR-TE steering mechanisms to steer traffic flows on Flex-Algo
Prefix-SIDs.

For example, the Flex-Algo(130) path from Node1 to Node5 is 1→2→3→4→5.

Without using the Flex-Algo SIDs, this path can be encoded in the SID list <16002, 24023, 16004,
16005>, where 1600X is the Algo(0) Prefix-SID of NodeX and 24023 is an Adj-SID of the link from
Node2 to Node3.

Also note that, although the network topology and the Flex-Algo definition is the same as Flex-
Algo(129) in Example 7‑4, the fact that Node6 does not participate in Flex-Algo(130) drastically
changes the resulting path compared to Figure 7‑8 (1→6→5→4→3). This property of the Flex-Algo
solution is particularly useful for use-cases like dual-plane disjoint paths, as described in section 7.6.

The path computation is performed by any node that participates in Flex-Algo(K). If a node
participates in more than one Flex-Algo, then it performs the described computation independently for
each of these Flex-Algos. In this example, all nodes support Algo(0) and thus perform an independent
path computation for this algorithm, based on the regular IGP topology.

ECMP Load Balancing

ECMP-awareness is an intrinsic property of Prefix-SIDs: traffic flows steered over a Prefix-SID are
load-balanced over all the equal-cost shortest paths towards the associated prefix.

This concept is fairly simple to understand with the regular IGP Prefix-SIDs (algorithms 0 and 1),
since the traffic flows merely follow the IGP shortest paths. With Flex-Algo, it is important to
understand that the term “shortest paths” is meant in a generic way, considering the algorithm
definition associated Prefix-SID algorithm optimization metric and constraints. If the Prefix-SID
algorithm is a Flex-Algo, then the shortest paths are computed on the Flex-Algo topology, as defined
in the previous section, and with the optimization metric specified in the Flex-Algo definition as link
metric. These are the only shortest paths being considered for ECMP load-balancing. In particular, the
IGP metric has no impact at all on the ECMP decisions for a Flex-Algo defined as optimizing the TE
or delay metric.
This particular property allows SR paths that would have required multiple SID lists with Algo(0) or
1 Prefix-SIDs to be expressed with a single Flex-Algo SID, or SID list. This is an important benefit
for operational efficiency and robustness.

In the network of Figure 7‑14, Flex-Algo(128) is enabled on all nodes alongside Algo(0). The
definition of Flex-Algo(128) is minimize delay metric.

All links in the network have an IGP metric 10, except for the links between Node2 and Node3 and
between Node4 and Node5 that have IGP metric 100.

All links in the network have a link-delay metric 10, except for the link between Node6 and Node7
that has link-delay metric 92.

Node8 advertises a Prefix-SID 16008 for Algo(0) and a Prefix-SID 16808 for Flex-Algo(128).

All nodes compute the IGP shortest path to Node8 and install the forwarding entry for Node8’s
Prefix-SID 16008 along that path. This Prefix-SID’s path from Node1 to Node8 (1→6→7→8) is
shown in Figure 7‑14.

All nodes compute the low-delay path to Node8 and install the forwarding entry for Node8’s Flex-
Algo(128)’s Prefix-SID 16808 following that path. Node1 has two equal delay paths to Node8, one
via Node2 and another via Node4. Therefore, Node1 installs these two paths for Node8’s Flex-
Algo(128)’s Prefix-SID 16808. This Prefix-SID’s equal cost paths (1→2→3→8 and 1→4→5→8)
are shown in Figure 7‑14. Node1 load-balances traffic-flows to this Prefix-SID over these two paths.
Figure 7-14: Example Flex-Algo ECMP

Without using the Flex-Algo SIDs, this low-delay ECMP path to Node8 requires two SID lists
<16002, 24023, 16008> and <16004, 24045, 16008>, where 16002, 16004, and 16008 are the
Algo(0) Prefix-SIDs of Node2, Node4, and Node8 respectively, and 24023 and 24045 the Adj-SIDs
of the links Node2-Node3 and Node4-Node5 respectively. The Adj-SIDs are required to cross the
high IGP metric links.
7.4 TI-LFA Backup Path
If TI-LFA is enabled on a node, then this node provisions the TI-LFA backups paths for the Prefix-
SIDs of each Flex-Algo(K) that it supports. The backup path for a Prefix-SID(K) is computed and
optimized as per the definition of Flex-Algo(K). Only Flex-Algo(K) Prefix-SIDs and unprotected
Adj-SIDs are used in the SID list that encodes the Flex-Algo(K) backup path. This ensures that TI-
LFA backup paths meet the same constraints and minimize the same link metric as their primary paths.

The Flex-Algo TI-LFA functionality is illustrated with the topology in Figure 7‑15. It is similar to
those used earlier in the chapter, except that all links have the same IGP metric (10).

Figure 7-15: Flex-Algo TI-LFA topology

All nodes in the network participate in Algo(0) and in Flex-Algo(129). Flex-Algo(129) is defined to
optimize the IGP metric while avoiding RED links, as shown in the configuration in Example 7‑7.

Example 7-7: Flex-Algo definition configuration on all nodes

router isis 1
flex-algo 129
!! default: metric-type igp
advertise-definition
affinity exclude-any RED
Node1 advertises a Prefix-SID 16001 for Algo(0) and a Flex-Algo(129) Prefix-SID 16901. Node5
advertises Algo(0) Prefix-SID 16005 and Flex-Algo(129) Prefix-SID 16905.

TI-LFA protection is enabled on Node2, as shown in Example 7‑8. After enabling TI-LFA, Node2
computes the backup paths for all algorithms it participates in. We focus on the TI-LFA backup paths
for the Prefix-SIDs advertised by Node1.

Example 7-8: TI-LFA enabled on Node2

router isis 1
interface Gi0/0/0/0
!! link to Node1
point-to-point
address-family ipv4 unicast
fast-reroute per-prefix
fast-reroute per-prefix ti-lfa

The primary path on Node2 for both Node1’s Prefix-SIDs (16001 and 16901) are via the direct link
to Node1, as shown in Figure 7‑15. TI-LFA independently computes backup paths for the Prefix-SIDs
of each algorithm, using the topology as seen by the algorithm.

Algo(0) Backup Path

Let us start with the TI-LFA backup paths for the IGP SPF algorithm (Algo(0)) Prefix-SIDs. TI-LFA
uses the Algo(0) topology, Topo(0), to compute the backup paths for the Algo(0) Prefix-SIDs.
Topo(0) contains all nodes and all links in the network, since all nodes participate in Algo(0) and
Algo(0) has no constraints.

The Algo(0) TI-LFA link-protecting backup path on Node2 for prefix 1.1.1.1/32 is the post-
convergence path3 2→3→5→6→1, as illustrated in Figure 7‑16. This backup path can be encoded by
the SID list <16005, 16001>, only using Algo(0) Prefix-SIDs. This SID list first brings the packet via
the Algo(0) path (this is the IGP shortest path) to Node5 (2→3→5) using the Prefix-SID(0) 16005 of
Node5, and then via the Algo(0) path to Node1 (5→6→1) using the Prefix-SID(0) 16001 of Node1.
Figure 7-16: TI-LFA backup paths for each algorithm

Algo(129) Backup Path

To compute the Flex-Algo(129) TI-LFA backup path for the Flex-Algo(129) Prefix-SIDs, IGP on
Node2 first derives the topology Topo(129) to be used for Flex-Algo(129) computations. For this,
IGP removes all nodes that do not participate in Flex-Algo(129) and all links that are excluded by the
Flex-Algo(129) constraints.

In this example, all nodes participate in Flex-Algo(129), therefore all nodes stay in Topo(129). Flex-
Algo(129)’s definition specifies to avoid RED colored links, therefore the RED colored link between
Node3 and Node5 is removed from the topology.

The Flex-Algo(129) TI-LFA link-protecting backup path on Node2 for Prefix-SID 16901 is the post-
convergence path 2→3→4→5→6→1, as illustrated in Figure 7‑16. This backup path can be encoded
by the SID list <16905, 16901>, using Flex-Algo(129) Prefix-SIDs. This SID list first brings the
packet via the Flex-Algo(129) shortest path to Node5 (2→3→4→5) using the Prefix-SID(129)
16905 of Node5 and then via the Flex-Algo(129) shortest path to Node1 (5→6→1) using the Prefix-
SID(129) 16901 of Node1.

Notice that the Flex-Algo(129) TI-LFA backup path avoids the RED link, as specified in the Flex-
Algo(129) definition.
TI-LFA/Fle x-Algo gain
“One of the major benefits of Flex-Algo is its ability to provision dynamic constrained paths based on a single SR label with
local repair (TI-LFA) respecting the same constraints as the primary path.

We looked into different options such as Multi-Topology (MT)-SIDs, various SR-TE policy-based routing solutions,
however none of the currently available techniques except Flex-Algo SR-TE approach could efficiently support ~50ms
recovery from a network node or link failure and at the same time to prevent the traffic traversal via an undesirable path.

Flex-Algo TI-LFA optimum fast-reroute capabilities could be well suited for unicast as well as multicast flows to
significantly reduce operational overhead and greatly simplify our overall end-to-end service model. ”

— Arkadiy Gulko

The properties of TI-LFA still hold with Flex-Algo: the backup path is tailored along the post-
convergence path, at most two SIDs are required to express the backup path in symmetric metric
networks, etc.
7.5 Integration With SR-TE
Flex-Algo is inherently part of SR. It leverages the algorithm functionality that is part of the Prefix-
SID definition since day 1.

Flex-Algo is also part of the SR-TE Architecture. It enriches the set of SIDs that are available for
SR-TE to encode SLA paths and is fully integrated with the other SR-TE mechanisms such as ODN
and AS.

Any path in the network can be encoded as a list of Adj-SIDs. Prefix-SIDs of any algorithm, allow
SR-TE paths to be expressed in a better way by leveraging the IGP distributed path calculation and
robustness. Not only do they drastically reduce the number of SIDs required in the SID list, but they
also bring ECMP capabilities and all the IGP resiliency mechanisms to SR-TE.

Flex-Algo generalizes the concept of Prefix-SID, that was previously limited to unconstraint shortest
IGP path, to any operator-defined intent. The Flex-Algo Prefix-SIDs thus allow SR-TE to meet more
accurately the operator’s intent for an SR path, with even fewer SIDs in the SID list, more ECMP
load-balancing, IGP-based re-optimization after a topology change and properly tailored TI-LFA
backup paths.

Assume Flex-Algo(128) is enabled on all nodes in the network of Figure 7‑17 with algorithm
definition minimize delay metric without constraints. By default, Algo(0) is also enabled on all
nodes, providing the unconstrained IGP shortest path.
Figure 7-17: Flex-Algo integration with SR-TE

When configuring on Node1 an SR Policy to Node3 with a dynamic path optimizing the IGP metric,
the computed path is 1→6→5→3. SR-TE would encode this path as SID list <16003>, only
containing the Algo(0) Prefix-SID 16003 of Node3.

SR-TE could have encoded the computed path using various other SID lists, such as <16006, 16005,
16003>, with 1600X the Algo(0) Prefix-SID of NodeX, or <24016, 24065, 24053>, with 240XY the
Adj-SID of NodeX to NodeY.

In this scenario, SR-TE decides to encode the path with the Prefix-SID 16003 because this SID is the
most appropriate one considering the dynamic path definition. It not only provides the shortest SID
list, but it also provides the best resiliency and scaling. If SR-TE would encode the path with SID list
<16006, 16005, 16003>, SR-TE would have to update the SID list when e.g., the link between Node1
and Node6 fails, since the existing SID list no longer encodes the new IGP shortest path 1→2→3.
When using SID list <16003>, SR-TE does not have to update the SID list after this failure since the
IGP updates the path of Prefix-SID 16003 to the new IGP shortest path.
Similarly, when an SR Policy to Node3 is configured on Node1 with a dynamic path optimizing
delay, SR-TE computes the low-delay path as 1→2→3. SR-TE can encode this path with the SID list
<16002, 24023>, with 16002 the Prefix-SID of Node2 and 24023 the Adj-SID of Node2 to Node3.

However, SR-TE has another more appropriate SID available, which is the Flex-Algo(128) Prefix-
SID 16803 of Node3. This Flex-Algo SID allows to express the path as a single SID <16803>, but it
also provides better resiliency and scaling with the same reasons as the previous example. Upon
failure of the link between Node2 and Node3, SR-TE does not have to update the SID list since the
IGP will update the path of Prefix-SID 16803. In addition, the TI-LFA backup of 16803 is tailored
along the low-delay post-convergence path since IGP uses the Flex-Algo definition when computing
the TI-LFA backup path.

SR-TE can also combine SIDs of different types in a SID list, as illustrated in the next example.

The network in Figure 7‑18 illustrates the combination of Algo(0) Prefix-SIDs with a Flex-Algo
Prefix-SID. The network consists of three domains, two edge domains (Edge1 and Edge2) and a core
domain (Core).

Figure 7-18: Combining different types of SIDs in an SR Policy SID list

Since there is no significant delay difference in the edge domains, only Algo(0) is enabled in these
domains.
To provide low-delay paths in the core domain, a Flex-Algo(128) is defined as minimize delay
metric and all core nodes participate in this Flex-Algo(128).

The low-delay path from Node11 to Node31 combines Algo(0) SIDs in the edge domains with a
Flex-Algo(128) SID in the core domain. Node11 imposes a SID list <16001, 16803, 16031> on the
low-delay traffic. This SID list first steers the packet on the IGP shortest path to Node1 (Algo(0)
Prefix-SID 16001), then on the low-delay path to Node3, using the Flex-Algo(128) Prefix-SID 16803
and finally on the IGP shortest path to Node31 using Algo(0) Prefix-SID 16031.

7.5.1 ODN/AS
As we have seen in chapter 5, "Automated Steering" and chapter 6, "On-Demand Nexthop", the
Automated Steering functionality steers a service route (e.g., BGP route) into the SR Policy that is
identified by the nexthop and color of the service route. The On-demand Nexthop functionality
enables automatic instantiation of an SR Policy path when receiving a service route.

There is no change in the ODN and AS functionalities when using Flex-Algo SIDs to encode an SR
Policy path. ODN and AS work irrespective of the composition of the SID list.

One way to specifically enforce the use of Flex-Algo SIDs is to specify the SID algorithm identifier
as a constraint of the dynamic path. This way the Flex-Algo identifier is mapped to an SLA color. SR-
TE will express the path of that SLA color using SIDs of the mapped Flex-Algo identifier.

For example, assume that the operator identifies low-delay by SLA color green (value 30).

Flex-Algo(128) is defined to provide the low-delay path. Therefore, SLA color green (value 30) can
be mapped to Flex-Algo(128). For this purpose, an on-demand color template is configured on the
headend node.

Example 7‑9 shows the on-demand color template for color 30, indicating Flex-Algo(128) SIDs must
be used to encode the path.
Example 7-9: ODN configuration

segment-routing
traffic-eng
on-demand color 30
dynamic
sid-algorithm 128

Assume that the on-demand template in Example 7‑9 is applied on Node1 in Figure 7‑17.

When a service route arrives on headend Node1, with BGP nexthop 1.1.1.3 and color green (30), the
ODN functionality instantiates an SR Policy to 1.1.1.3 (Node3) based on the ODN template for color
green. The template restricts the SIDs of the SR Policy’s SID list to Flex-Algo(128) SIDs. The
solution SID list consists of a single SID: the Flex-Algo(128) Prefix-SID 16803 of Node3.
Example 7‑10 shows the status of this ODN SR Policy.

Example 7-10: ODN SR Policy using Flex-Algo SID

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.3


Name: srte_c_30_ep_1.1.1.3
Status:
Admin: up Operational: up for 00:02:03 (since Apr 9 17:11:52.137)
Candidate-paths:
Preference: 200 (BGP ODN) (active)
Requested BSID: dynamic
Constraints:
Prefix-SID Algorithm: 128
Dynamic (valid)
16803 [Prefix-SID: 1.1.1.3, Algorithm: 128]
Preference: 100 (BGP ODN)
Requested BSID: dynamic
Dynamic (pce) (invalid)
Metric Type: NONE, Path Accumulated Metric: 0
Attributes:
Binding SID: 40001
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

BGP on Node1 then uses Automated Steering to steer the service route into this ODN SR Policy.

7.5.2 Inter-Domain Paths


Consider the multi-domain network as shown in Figure 7‑19. All nodes in both domains participate in
Flex-Algo(128), which is defined as optimize delay metric without constraints. Dynamic link delay
measurement is enabled on all links in both domains. The measured link-delays are displayed next to
the links.

Figure 7-19: Flex-Algo ODN and AS Inter-Domain Delay

The operator wants to use ODN to automatically instantiate low-delay paths for service routes,
leveraging the low-delay Flex-Algo SIDs in both domains. Since these are inter-domain paths, an SR
PCE is required to compute these paths.

The operator uses SLA color 30 to identify low-delay service paths. Node9 advertises a service
route 128.9.9.0/24 that requires a low-delay path. Therefore, Node9 attaches color extended
community 30 to this route.

Upon receiving this route, BGP on Node0 requests SR-TE to provide a path to Node9, with the
characteristics that are specified in the on-demand template for color 30.

The operator wants to leverage the Flex-Algo SIDs, as indicated before. In this example there are two
options to achieve this, using two different ODN templates. In the first option, the ODN template on
headend Node0 only specifies the low-delay optimization objective and lets the PCE select the most
appropriate SIDs to meet this objective. In a second option, the ODN template on headend Node0
explicitly specifies to use SIDs of Flex-Algo(128).
Both options lead to the same result in this example. The SR PCE computes the path and replies to
Node0 with the SID list <16805, 16809>. Both SIDs are Flex-Algo(128) Prefix-SIDs. The first
Prefix-SID in the SID list, 16805, brings the packet to Node5 via the low-delay path. The second
Prefix-SID, 16809, brings the packet from Node5 to Node9 via the low-delay path. SR-TE on Node0
instantiates an SR Policy to Node9 with the received SID list and steers the service route
128.9.9.0/24 into this SR Policy.

However, the second option requires that the operator associates the Flex-Algo definition optimize
delay metric without constraints with the same algorithm identifier, in this case 128, in both
domains. If this definition is associated with Flex-Algo(128) in Domain1 and Flex-Algo(129) in
Domain2, then the SID algorithm constraint would prevent SR-TE from returning the appropriate
path. It is thus recommended that dynamic paths and ODN templates are configured with the actual
intent for the SR path and that the SID algorithm constraint is only used when the intent cannot be
expressed otherwise. An example of intent that would require the SID algorithm constraint is
described in the next section.

The second option also requires support in PCEP to signal the required SID algorithm to the SR PCE.
At the time of writing, this PCEP extension is not available.
7.6 Dual-Plane Disjoint Paths Use-Case
Figure 7‑20 shows a dual-plane network topology. Plane1 consists of nodes 1 to 4 while Plane2
consists of nodes 5 to 8. Node0 and Node9 are part of both planes.

Figure 7-20: Flex-Algo dual-plane network design

In this use-case, the operator needs to provide strict disjoint paths for certain traffic streams on the
two planes, even during failure conditions. For other traffic, the operator wants to benefit from all
available ECMP and resiliency provided by the dual-plane design model. This other traffic should
therefore not be restricted to a single plane.

The operator achieves these requirements by using the Flex-Algo functionality. This functionality can
indeed provide strict disjoint paths, even during failure conditions when the traffic is directed on the
TI-LFA backup path. Furthermore, the Flex-Algo solution does not required any use of affinity link
colors nor additional loopback prefixes on the network nodes.
By default, all SR nodes automatically participate in the Algo(0) (IGP SPF) topology.

The operator defines two Flex-Algos and identifies them with numbers 128 and 129. Both algorithms
have the same definition: optimize IGP metric, no constraints. The operator enables Flex-Algo(128)
on all the Plane1 nodes (0, 1, 2, 3, 4, 9) and Flex-Algo(129) on all the Plane2 nodes (0, 5, 6, 7, 8, 9).

If a node advertises its participation in a Flex-Algo, it typically also advertises a Prefix-SID for that
Flex-Algo. As an example, Figure 7‑21 shows the different Prefix-SIDs that Node2, Node7, and
Node9 advertise for their loopback prefix. The Prefix-SIDs advertised by the other nodes are not
shown. Node9, for example, advertises the following Prefix-SIDs for its loopback prefix 1.1.1.9/32:

Algo(0) (SPF): 16009

Flex-Algo(128): 16809

Flex-Algo(129): 16909

Figure 7-21: Flex-Algo Prefix-SID assignment in dual-plane network

Example 7‑11 shows the relevant ISIS configuration of Node9. The flex-algo definitions are empty,
since IGP metric is optimized by default and no constraint is specified. The Prefix-SID configurations
for the different algorithms are shown under interface Loopback0.

Example 7-11: Flex-Algo use-case – Dual-plane, ISIS configuration of Node9

interface Loopback0
ipv4 address 1.1.1.9/32
!
router isis 1
flex-algo 128
!! default: metric-type igp
advertise-definition
!
flex-algo 129
!! default: metric-type igp
advertise-definition
!
address-family ipv4 unicast
segment-routing mpls
!
interface Loopback0
address-family ipv4 unicast
prefix-sid absolute 16009
prefix-sid algorithm 128 absolute 16809
prefix-sid algorithm 129 absolute 16909

On each node, ISIS computes the Shortest Path Trees (SPTs) for the different algorithms that the node
participates in. ISIS starts by deriving the topology for each algorithm by pruning the non-
participating nodes and excluded links from the topology graph. This results in the topologies shown
in Figure 7‑22: Topo(0), Topo(128), and Topo(129).
Figure 7-22: Flex-Algo Topologies – Topo(0), Topo(128), and Topo(129)

On each node, after computing the SPT for a given algorithm topology, ISIS installs the forwarding
entries for the Prefix-SIDs of the related algorithm. As an example, Figure 7‑23 shows the paths from
Node0 to Node9 for the three Prefix-SIDs that are advertised by Node9. Notice that the Algo(0)
Prefix-SID 16009 leverages all available ECMP, of both planes. Flex-Algo(128) and Flex-Algo(129)
Prefix-SIDs leverage the available ECMP, constrained to their respective plane.
Figure 7-23: Flex-Algo Prefix-SID path examples

With the exception of the Algo(0) Prefix-SID, only MPLS-to-MPLS (label swap) and MPLS-to-IP
(label pop) entries are installed for the Prefix-SIDs. IP-to-MPLS (label push) forwarding entries are
typically installed for Algo(0) Prefix-SIDs4.

As an example, the following MPLS forwarding entries are installed on Node0:

Algo(0):

In 16009, out 16009 via Node1 or Node5

In 16002, out 16002 via Node1

In 16007, out 16007 via Node5

Flex-Algo(128):

In 16809, out 16809 via Node1

In 16802, out 16802 via Node1

Flex-Algo(129):

In 16909, out 16909 via Node5

In 16907, out 16907 via Node5

Node1 installs the following forwarding entries:

Algo(0):

In 16009, out 16009 via Node2 or Node4

In 16002, out 16002 via Node2

In 16007, out 16007 via Node2, Node4, or Node5

Flex-Algo(128):
In 16809, out 16809 via Node2 or Node4

In 16802, out 16802 via Node2

Flex-Algo(129):

None, Node1 does not participate in Flex-Algo(129)

When enabling TI-LFA, the traffic on a given Flex-Algo Prefix-SID will be protected by a backup
path that is constrained to the topology of the Flex-Algo. This implies that even in failure cases the
protected traffic will stay in the Flex-Algo’s topology plane. Since traffic carried on Algo(0) Prefix-
SIDs is not constrained to a single plane, this does not apply to these Prefix-SIDs. Traffic to Algo(0)
Prefix-SIDs can be deviated to another plane in failure cases.

The operator can steer service traffic on any of the planes by attaching a specific SLA color to this
service route. If a service route has no color, then traffic flows are steered on the Algo(0) Prefix-SID
and it will be shared over both planes.

The operator chooses the SLA color value 1128 to indicate Flex-Algo(128), and SLA color value
1129 to indicate Flex-Algo(129).

Assume that Node9 advertises three service routes, for an L3VPN service in this example, to Node0.
L3VPN prefix 128.9.9.0/24 must be steered on a Flex-Algo(128) path, L3VPN prefix 129.9.9.0/24
must be steered on a Flex-Algo(129) path, and L3VPN prefix 9.9.9.0/24 must follow the regular IGP
shortest path to its nexthop. Therefore, Node9 advertises 128.9.9.0/24 with color extended
community 1128 and 129.9.9.0/24 with extended community color 1129. 9.9.9.0/24 is advertised
without color.

Node0 is configured as shown in Example 7‑12, to enable the ODN functionality for both Flex-Algos
(lines 3 to 9). Color 1128 is mapped to Flex-Algo(128), and color 1129 is mapped to Flex-
Algo(129). Part of the BGP configuration on this node is included to illustrate that no BGP route-
policy is required for the ODN/AS functionality.
Example 7-12: Flex-Algo use-case – Dual-plane, Automated Steering on Node0

1 segment-routing
2 traffic-eng
3 on-demand color 1128
4 dynamic
5 sid-algorithm 128
6 !
7 on-demand color 1129
8 dynamic
9 sid-algorithm 129
10 !
11 router bgp 1
12 neighbor 1.1.1.9
13 remote-as 1
14 update-source Loopback0
15 address-family vpnv4 unicast
16 !
17 vrf Acme
18 rd auto
19 address-family ipv4 unicast

Figure 7‑24 illustrates the steering of the uncolored L3VPN prefix 9.9.9.0/24. The BGP advertisement
for this prefix arrives on Node0 without any color extended community, therefore, BGP installs the
prefix as usual, recursing on its BGP next-hop 1.1.1.9.

By default, Node0 imposes the (Algo(0)) Prefix-SID 16009 for traffic destined for 1.1.1.9/32, and for
traffic to prefixes that recurse on 1.1.1.9/32, such as 9.9.9.0/24. Since traffic flows to Prefix-SID
16009 are load-balanced over all available ECMP of both planes, this is also the case for traffic
flows to 9.9.9.0/24.

Figure 7-24: Algo(0) Prefix-SID uses both planes

The BGP advertisement for L3VPN prefix 128.9.9.0/24 arrives on Node0 with a color extended
community 1128, as shown in Figure 7‑25. Since Node0 has an SR-TE on-demand template for color
1128 configured, mapping color 1128 to Flex-Algo(128), BGP installs L3VPN prefix 128.9.9.0/24,
recursing on the Flex-Algo(128) Prefix-SID 16809 of the BGP next-hop 1.1.1.9/32. Traffic-flows
towards 128.9.9.0/24 are distributed over the available ECMP of the top plane (Flex-Algo(128))
only.

Figure 7-25: Flex-Algo(128) Prefix-SID uses one planes

Similarly, traffic flows to L3VPN prefix 129.9.9.0/24 will be distributed over the available ECMP of
the bottom plane (Flex-Algo(129)) only.

Fle x-Algo Use Case s - Disjointne ss and many othe rs


“It is very important to categorize operators use cases for unconventional flows that require traffic to be steered over
specific path/topology to meet explicit SLA requirements. The use cases for these flows could be categorized as follow:

1. Flex-Algo Only – Use Cases that could be framed into predefined path/topology options based on commonly used
constrains such as:

To contain the traffic within specific scoped boundaries (local/metro; sub- or multi-regional; tiered (parent traffic
should never traverse child);

Dual plane disjoint path based on strict/loose routing within a plane to support two unicast/multicast flows that
MUST traverse completely diverse logically and physically end to end path under any conditions;

Path meeting specific SLA (e.g., delay, loss)

2. Controller – Use Cases that do not fit into predefined constraint categories such as:

Dual plane Disjoint path where any path at a time can fallback to common plane based on SLA and still preserve
the disjointness;
Guaranteed BW path;

TE Multi-Layer path/visibility;

Flex-Algo could be applicable to both; in the first set of use cases Flex-Algo could provide a full solution, in the second set
of use cases Flex-Algo could effectively complement controller approach to enhance computation outcome.

Flex-Algo allows us to design any topology/path for unicast and multicast as per our intent to achieve specific business
objectives. For instance, we can define constraints, computation algorithm and apply it in a such a way to support any
diverse requirements for dual plane model per regional or global scope based on best SLA performance in a seamless
fashion with automated provisioning, minimum operational overhead and no additional protocol support.

In another example where business drives cost reduction, by reducing number of redundant circuits to 3 per location (by
decreasing availability/reliability of dual plane model) but still mandating required diversity based on best performant paths
for two streams, we could use Flex-Algo in combination with Controller to provide disjointness and best paths using a third
circuit as a common path that either plane might use at any given time. In addition to the business gain above, the
combination of Flex-Algo with Controller produces set of other benefits such as fast-recovery and use of minimum label
stack, which converts in business terms to risk mitigation and cost avoidance.

As you noted, multicast has been mentioned along the side with unicast. And this is a major advantage of Flex-Algo since
unicast and multicast traffic steering leverage common Flex-Algo infrastructure. Despite that the result of Flex-Algo
computation is programmed in the MPLS FIB, Flex-Algo could be decoupled from MPLS FIB and be used by other
applications such as mLDP and BIER.

Many other use cases could be demonstrated to reveal how impactful Flex-Algo technology might be in resolving various
business requirements. Operator creativity to define its own custom algorithm spreads far beyond of presented use cases.

— Arkadiy Gulko
7.7 Flex-Algo Anycast-SID Use-Case
In Figure 7‑26, the network operator wants to enforce for some Internet prefixes a low-delay path
from the Internet border router Node1 to a specific Broadband Network Gateway (BNG) Node8.

This could be easily achieved with SR IGP Flex-Algo if Flex-Algo were available from ingress to
egress, but unfortunately that is not the case. Part of the network consists of older LDP-only nodes
(Node4, Node5, and Node8). On those LDP-only nodes the delay metrics are not available and Flex-
Algo cannot be used.

All nodes are in the same single-level ISIS area.

Node3 and Node6 are on the border between the SR domain and the LDP domain, these are the SR
domain border nodes. Both SR and LDP are enabled on these nodes.

Link-delay measurement is enabled on all links in the SR domain.

Figure 7-26: Low-delay path in SR/LDP network

In order to provide a low-delay path where possible and an IGP shortest path where no delay
information is available, the operator combines a Flex-Algo SID (providing the low-delay path in the
SR domain) with a regular Prefix-SID (providing the IGP shortest path in the LDP domain).

The operator enables Flex-Algo(128) on Node1, Node2, Node3, Node6, and Node7 with definition
optimize delay metric.

Given that the link-delays in the LDP portion of the network are unknown, the low-delay path from
Node1 to the (delay-wise) closest SR domain border node is combined with the IGP shortest path
from the border node to BNG Node8.

This selection of the optimal SR domain border node is automatic by leveraging Anycast-SID
functionality. This is simply done by letting these border nodes advertise the same (anycast) prefix
with the same Flex-Algo Anycast-SID. These SR domain border nodes are then grouped in an anycast
set, with the Anycast-SID leading to the closest node in this anycast set.

The Flex-Algo(128) configurations on the SR domain border nodes Node3 and Node6 are presented
in Example 7‑13 and Example 7‑14. Both nodes advertise their own Flex-Algo(128) Prefix-SID with
their Loopback0 address, and they both advertise Flex-Algo(128) Anycast-SID 16836 with their
loopback136 prefix 136.1.1.136/32.

An Anycast-SID is a sub-type of Prefix-SID that is advertised by several nodes, as opposed to the


Node-SID, with is another sub-type of Prefix-SID advertised by a single node. Node-SIDs are
advertised with the Node flag and a configured Prefix-SID is by default a Node-SID. Anycast-SIDs
are advertised without the Node flag (N-flag), as specified with n-flag-clear configuration.
Example 7-13: Flex-Algo configuration on Node3

interface Loopback0
ipv4 address 1.1.1.3 255.255.255.255
!
interface Loopback136
description Anycast of Node3 and Node6
ipv4 address 136.1.1.136 255.255.255.255
!
router isis 1
flex-algo 128
metric-type delay
advertise-definition
!
interface Loopback0
address-family ipv4 unicast
prefix-sid absolute 16003
prefix-sid algorithm 128 absolute 16803
!
interface Loopback136
address-family ipv4 unicast
prefix-sid algorithm 128 absolute 16836 n-flag-clear

Example 7-14: Flex-Algo configuration on Node6

interface Loopback0
ipv4 address 1.1.1.6 255.255.255.255
!
interface Loopback136
description Anycast of Node3 and Node6
ipv4 address 136.1.1.136 255.255.255.255
!
router isis 1
flex-algo 128
metric-type delay
advertise-definition
!
interface Loopback0
address-family ipv4 unicast
prefix-sid absolute 16006
prefix-sid algorithm 128 absolute 16806
!
interface Loopback136
address-family ipv4 unicast
prefix-sid algorithm 128 absolute 16836 n-flag-clear

This Flex-Algo(128) Anycast-SID 16836 transports packets to the closest (delay-wise) SR domain
border node. If two or more SR domain border nodes are equally close in terms of delay, then the
traffic is per-flow load-balanced among them.

Given the delay metrics in Figure 7‑26, Node6 is the SR domain border node that is closest to Node1.
Therefore, the path of Flex-Algo(128) Anycast-SID 16836 from Node1 is via Node7 (1→7→6).
To traverse the LDP portion of the path, the IGP shortest path is used. Node8 is not SR enabled hence
it does not advertise a Prefix-SID. Therefore, one or more SR Mapping-Servers (SRMSs) advertise a
(regular Algo(0)) Prefix-SID 16008 for the BNG address 1.1.1.8/32. Refer to Part I of the Segment
Routing book series for more details of the SRMS.

Node2 is used as SRMS and its configuration is shown in Example 7‑15.

Example 7-15: SRMS configuration on Node2

segment-routing
mapping-server
prefix-sid-map
address-family ipv4
1.1.1.8/32 8 range 1 !! Prefix-SID 16008 (index 8)
!
router isis 1
address-family ipv4 unicast
segment-routing prefix-sid-map advertise-local

This Prefix-SID 16008 carries the traffic from the SR border node to the BNG via the IGP shortest
path. The SR domain border nodes automatically (no configuration required) takes care of the
interworking with LDP by stitching the BNG Prefix-SID to the LDP LSP.

As shown on Example 7‑16, an SR Policy POLICY2 to the BNG Node8 is configured on the Internet
border router Node1. This SR Policy has an explicit path PATH2 consisting of two SIDs: the Flex-
Algo Anycast-SID 16836 of the SR domain border nodes and the Prefix-SID 16008 of BNG Node8.

Example 7-16: SR Policy configuration on Node1

segment-routing
traffic-eng
segment-list PATH2
index 20 mpls label 16836 !! Flex-Algo Anycast-SID of Node(3,6)
index 40 mpls label 16008 !! Prefix-SID of Node8
!
policy POLICY2
color 30 end-point ipv4 1.1.1.8
candidate-paths
preference 100
explicit segment-list PATH2

The status of the SR Policy is shown in Example 7‑17.


Example 7-17: SR Policy status on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.8


Name: srte_c_30_ep_1.1.1.8
Status:
Admin: up Operational: up for 00:13:05 (since Mar 30 14:57:53.285)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: POLICY2
Requested BSID: dynamic
Explicit: segment-list PATH2 (valid)
Weight: 1
16836 [Prefix-SID, 136.1.1.136]
16008
Attributes:
Binding SID: 40001
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

The path of the SR Policy is illustrated in Figure 7‑27. The Flex-Algo(128) Anycast-SID 16836
provides the low-delay path to the closest of (Node3, Node6) and the packets arrive on Node3 or
Node6 with Prefix-SID label 16008 as top label. Given the link delays in this example, the node
closest to Node1 is Node6.
Figure 7-27: SR Policy path combining Flex-Algo and SRMS Prefix-SID

Node3 and Node6 automatically stitch Node8’s Prefix-SID 16008 to the LDP LSPs towards Node8.

Node6 has two equal cost paths to BNG Node8, as illustrated in Figure 7‑27. This is the case for
Node3 as well.

The MPLS forwarding table on Node6, displayed in Example 7‑18, shows that Node6 installs Prefix-
SID 16008 with outgoing LDP labels 90048 and 90058. These are the LDP labels allocated for
1.1.1.8/32 by Node4 and Node5 respectively (see output in Example 7‑18). Node6 load-balances
traffic-flows over these two paths.
Example 7-18: Node8’s Prefix-SID in MPLS forwarding table on Node6

RP/0/0/CPU0:xrvr-6#show mpls ldp bindings 1.1.1.8/32


1.1.1.8/32, rev 39
Local binding: label: 90068
Remote bindings: (3 peers)
Peer Label
----------------- ---------
1.1.1.3:0 90038
1.1.1.4:0 90048
1.1.1.5:0 90058

RP/0/0/CPU0:xrvr-6#show mpls forwarding labels 16008


Local Outgoing Prefix Outgoing Next Hop Bytes
Label Label or ID Interface Switched
------ ----------- ------------------ ------------ ------------- ---------
16008 90058 SR Pfx (idx 8) Gi0/0/0/0 99.5.6.5 4880
90048 SR Pfx (idx 8) Gi0/0/0/4 99.4.6.4 1600

This solution is fully automated and dynamic once configured. The IGP dynamically adjusts the Flex-
Algo(128) Prefix-SID to go via the closest (lowest delay) SR domain border node. The Prefix-SID
then steers the traffic via the IGP shortest path to BNG Node8.

Anycast-SIDs also provide node resiliency. In case one of the SR domain border nodes fails, the
other seamlessly takes over.

The operator uses Automated Steering to steer the destination prefixes that require a low-delay path
into the SR Policy.

In this use-case the operator uses different SR building blocks (Flex-Algo, Anycast-SID, SRMS,
SR/LDP interworking) and combines them with the SR-TE infrastructure to provide a solution for the
problem. It is an example of the versatility of the solution.
7.8 Summary
Since day 1 of Segment Routing, a Prefix-SID is associated to a prefix and an algorithm.

Most often, the Prefix-SID algorithm is the default one (zero).

With algorithm zero, the IGP Prefix-SID is steered via the shortest IGP path to its associated prefix
(i.e., the shortest-path as computed by ISIS and OSPF).

Algorithms 0 to 127 are standardized by the IETF.

Algorithms 128 to 255 are customized by the operator. They are called SR IGP Flexible Algorithms
(Flex-Algo for short).

For example, one operator may define Flex-Algo(128) to minimize the delay while another operator
defines Flex-Algo(128) to minimize the IGP metric and avoid the TE-affinity RED. Yet another
operator could define Flex-Algo(128) as a minimization of the TE metric and avoid the SRLG 20.

Any node participating in a Flex-Algo advertises its support for this Flex-Algo.

Typically, a node that participates in a Flex-Algo has a prefix-SID for that Flex-Algo. This is not an
absolute requirement.

Adding a Prefix-SID for a Flex-Algo to a node does not require to configure an additional loopback
address. Multiple prefix-SIDs of different Algorithms can share the same address. This is an
important benefit for operational simplicity and scale.

Any node participating in a Flex-Algo computes the paths to the Prefix-SIDs of that Flex-Algo.

Any node not participating in a Flex-Algo does not compute the paths to the Prefix-SIDs of that Flex-
Algo. This is an important benefit for scale.

The definition of a Flex-Algo must be consistent across the IGP domain. The Flex-Algo solution
includes a mechanism to enforce that consistency.

Flex-Algo is an intrinsic component of the SR-TE architecture; Flex-Algo SIDs with their
specificities are automatically considered for SR-TE path calculation and can be included in any SR
Policy. As such, the ODN and AS components of SR-TE natively leverage Flex-Algo.

Typical use-cases involve plane disjoint paths and low-latency routing.


7.9 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[RFC5305] "IS-IS Extensions for Traffic Engineering", Tony Li, Henk Smit, RFC5305, October
2008

[RFC7308] "Extended Administrative Groups in MPLS Traffic Engineering (MPLS-TE)", Eric


Osborne, RFC7308, July 2014

[RFC7684] "OSPFv2 Prefix/Link Attribute Advertisement", Peter Psenak, Hannes Gredler, Rob
Shakir, Wim Henderickx, Jeff Tantsura, Acee Lindem, RFC7684, November 2015

[RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer
Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016

[RFC8402] "Segment Routing Architecture", Clarence Filsfils, Stefano Previdi, Les Ginsberg,
Bruno Decraene, Stephane Litkowski, Rob Shakir, RFC8402, July 2018

[draft-ietf-lsr-flex-algo] "IGP Flexible Algorithm", Peter Psenak, Shraddha Hegde, Clarence


Filsfils, Ketan Talaulikar, Arkadiy Gulko, draft-ietf-lsr-flex-algo-01 (Work in Progress),
November 2018

[draft-ietf-isis-te-app] "IS-IS TE Attributes per application", Les Ginsberg, Peter Psenak, Stefano
Previdi, Wim Henderickx, John Drake, draft-ietf-isis-te-app-06 (Work in Progress), April 2019

[draft-ietf-ospf-segment-routing-extensions] "OSPF Extensions for Segment Routing", Peter


Psenak, Stefano Previdi, Clarence Filsfils, Hannes Gredler, Rob Shakir, Wim Henderickx, Jeff
Tantsura, draft-ietf-ospf-segment-routing-extensions-27 (Work in Progress), December 2018

[draft-ietf-isis-segment-routing-extensions] "IS-IS Extensions for Segment Routing", Stefano


Previdi, Les Ginsberg, Clarence Filsfils, Ahmed Bashandy, Hannes Gredler, Bruno Decraene,
draft-ietf-isis-segment-routing-extensions-23 (Work in Progress), March 2019
[draft-ketant-idr-bgp-ls-flex-algo] "Flexible Algorithm Definition Advertisement with BGP Link-
State", Ketan Talaulikar, Peter Psenak, Shawn Zandi, Gaurav Dawra, draft-ketant-idr-bgp-ls-flex-
algo-01 (Work in Progress), February 2019

1. “Strict” must not be confused with “unprotected”. FRR mechanisms such as TI-LFA normally
apply to Strit-SPF Prefix-SIDs.↩

2. IGP Algorithm Types in IANA registry: https://www.iana.org/assignments/igp-parameters/igp-


parameters.xhtml#igp-algorithm-types↩

3. The TI-LFA backup path is tailored along the post-convergence path; this is the path that the
traffic will follow after IGP converges following a failure. Read Part I of the SR book series for
more details.↩

4. Unless LDP is enabled and the default label imposition preference is applied. See Part I of the
SR book series.↩
8 Network Resiliency
The resiliency of an SR policy may benefit from several detection mechanisms and several
convergence or protection solutions:

local trigger from a local link failure

remote intra-domain trigger through IGP flooding

remote inter-domain trigger through BGP-LS flooding

validation of explicit candidate paths

re-computation of dynamic paths

selection of the next best candidate path

IGP convergence of constituent IGP Prefix SIDs

Anycast Prefix SID leverage

TI-LFA Local Protection of constituent SIDs, including Flex-Algo SIDs

TI-LFA Node Protection for an intermediate segment part of an SR Policy

Invalidation of a candidate path based on end-to-end liveness detection

We will describe each of these mechanisms in isolation and will highlight how they can co-operate to
the benefit of the resiliency of an SR Policy.

We will explain how an SR Policy can be designed to avoid TI-LFA protection when it is not
desirable.
The Fast Conve rge nce (FC) Proje ct
“Around 2001, while working on the design and deployment of the first worldwide DiffServ network, I realized that the loss
of connectivity (packet drops) following a link/node down/up transition was much more important that the impact of
congestion (QoS/DiffServ).

I then started the Fast Convergence project at Cisco Systems where we step-by-step optimized all the components of the
routing resiliency solution. We first focused on the IGP convergence as this is the base and most general solution. We
drastically improved the IGP convergence on large networks from 10s of seconds down to 200msec. In parallel, we
pioneered the IP Fast Reroute technology and invented Loop Free Alternate (LFA), Remote LFA, TI-LFA for link/node
and SRLG protection, SR IGP microloop avoidance solution and, last but not least, the BGP Prefix-Independent (BGP PIC)
fast-reroute solution.

A key aspect of this research, engineering and deployment is that this is automated by IGP and BGP (simple), optimum (the
backup paths are computed on a per-destination basis) and complete (lots of focused has been placed on the hardware
data structures to provide sub-10msec failure detection and enable sub-25msec backup path in a prefix independent
manner).

All of these efforts led to over 30 patents and in many aspects are the source of the Segment Routing solution.

Indeed, when I studied the first SDN experiments with OpenFlow, it was very clear to me that it would not scale. The unit
of state (a per-flow entry at every hop through an inter-domain fabric) was way too low and would exhaust the controller. I
had spent years optimizing the installations of these states in the dataplane.

Instead, I thought that the SDN controller should work with IGP constructs (IGP prefix and IGP adjacency segments). The
SDN controller should focus on combining these segments and delegate their state maintenance to the distributed
intelligence of the network.

We should have written a book on this FC journey. Maybe one day... ”

— Clarence Filsfils
8.1 Local Failure Detection
The sooner a local link failure is detected, the sooner a solution can be found.

Some transmission media provide fast hardware-based notifications of connectivity loss. By default
in IOS XR, link down events are immediately notified to the control plane (i.e., default carrier-delay
for link down events is zero).

For transmission media that do not provide such fast fault notifications, or if the link is not a direct
connection but there is an underlying infrastructure providing the connectivity, a higher layer
mechanism must perform the connectivity verification.

Link liveness is traditionally assessed by IGPs using periodic hello messages. However, the
complexity of handling IGP hellos in the Route Processor imposes a rather large lower bound on the
hello interval, thus making this mechanism unsuitable for fast failure detection.

Bidirectional Failure Detection (BFD) is a generic light-weight hello mechanism that can be used to
detect connectivity failures quickly and efficiently. Processing the fixed format BFD packets is a
simple operation that can be offloaded to the linecards.

BFD sends a continuous fast-paced stream of BFD Control packets from both ends of the link. A
connectivity failure is reported when a series of consecutive packets is not received on one end.

A detected connectivity failure is indicated in the transmitted BFD packets. This way uni-directional
failures can be reported on both sides of the link.

In addition to the BFD Control packets, BFD can transmit Echo packets. This is the so-called Echo
mode. The Echo packets are transmitted on the link with the local IP address as destination IP
address. The remote end of the link forwards these Echo packets as regular traffic back to the local
node.

Since BFD on the remote node is not involved in processing Echo packets and less delay variation is
expected, BFD Echo packets can be sent at a faster pace, resulting in faster failure detection. The
transmission interval of the BFD Control packets can be slowed down.
In Example 8‑1, the IGP (ISIS or OSPF) bootstraps the BFD session to its neighbor on the interface.
Echo mode is enabled by default and the echo transmission interval is set to 50 milliseconds (bfd
minimum-interval 50). BFD Control packets (named Async in the output) are sent at a default two-
second interval. The default BFD multiplier is 3. The status of the BFD session is verified in
Example 8‑2.

With this configuration the connectivity detection time is < 150 ms (= 50 ms × 3), which is the hello
interval times the multiplier.

Example 8-1: ISIS and OSPF BFD configurations

router isis 1
interface TenGigE0/1/0/0
bfd minimum-interval 50
bfd fast-detect ipv4
!
router ospf 1
area 0
interface TenGigE0/1/0/1
bfd minimum-interval 50
bfd fast-detect

Example 8-2: BFD session status

RP/0/RSP0/CPU0:R2#show bfd session


Interface Dest Addr Local det time(int*mult) State
Echo Async H/W NPU
----------------- --------------- ---------------- ---------------- ------
Te0/1/0/0 99.2.3.3 150ms(50ms*3) 6s(2s*3) UP
No n/a
“It should be pointed out that in number of production networks the term “convergence” is often used to describe the
duration of the outage when upon node or link failure packets are dropped until the routing protocols (IGP and even BGP)
synchronize their databases or tables network wide. The duration of the outage usually takes seconds or even minutes for
BGP. That is also one of the reasons many people consider BGP as slow protocol.

Network convergence as described above is only one aspect of a sound network design. In practice network designs
should introduce forms of connectivity restoration hierarchy into any network. Below are the summarized best practices to
reduce the duration of packet drops upon any network failure:

BGP at Ingress and Egress of the network should have both internal and external path redundancy ahead of failure.
Possible options: “Distribution of Diverse BGP Paths” [RFC6774] or “Advertisement of Multiple Paths in BGP”
[RFC7911].

Routing in the core should be based on BGP next hop and in the cases where distribution of redundant paths to all
network elements would be problematic some choice of encapsulation should be applied.

Protection should be computed and inserted into FIB ahead of failure such that connectivity restoration is immediate
(closely related to failure detection time) while protocols take their time to synchronize routing layer.

That is why applying techniques of fast reroute (TI-LFA or PIC) are extremely important elements every operator should
highly consider to be used in his network regardless if the network is pure IP, MPLS or SR enabled.

The fundamental disruptive change which needs to be clearly articulated here is that until Segment Routing the only
practical fine grained traffic control and therefore its protection was possible to be accomplished with RSVP-TE. The
problem and in my opinion the main significant difference between those two technologies is the fact that soft-state RSVP-
TE mechanism required end-to-end path signaling with RSVP PATH MSG followed by RSVP RESV MSG. SR requires no
event-based signaling as network topology with SIDs (MPLS or IPv6) is known ahead of any network event. ”

— Robert Raszuk

A local interface failure or BFD session failure triggers two concurrent reactions: the IGP
convergence and local Fast Reroute.

The local Fast Reroute mechanism (TI-LFA) ensures a sub-50msec restoration of the traffic flows.
TI-LFA is detailed in section 8.9.

In parallel, the detection of the local link failure triggers IGP convergence.

The failure of a local link brings down the IGP adjacency over that link, which then triggers the link-
state IGP to generate a new Link-State PDU (LSP) that reflects the new situation (after the failure).We
use the ISIS term LSP in this section, but it equally applies to OSPF Link-State Advertisements
(LSAs).

A delay is introduced before actually generating the new LSP. This is not static delay, but a delay that
dynamically varies with the stability of the network. The delay is minimal in a stable network. In
periods of network instability, the delay increases exponentially to throttle the triggers to the IGP.
This adaptive delay is handled by an exponential backoff timer, the LSP generation (LSP-gen) timer in
this case. When the network settles, the delay steadily decreases to its minimum again.

By default IOS XR sets the minimum (initial) LSP generation timer to 50 ms. 50 ms is a good tradeoff
between quick reaction to failure and some delay to possibly consolidate multiple changes in a single
LSP.

After the IGP has generated the new LSP, a number of events take place. First, the IGP floods this new
LSP to its IGP neighbors. Second, it triggers SPF computation that is scheduled to start after the SPF
backoff timer expires. Third, it feeds this new LSP to SR-TE. SR-TE updates its SR-TE DB and acts
upon this topology change notification as described further in this chapter.

Similar to the LSP generation mechanism, a delay is introduced between receiving the SPF
computation trigger and starting the SPF computation. This delay varies dynamically with the stability
of the network. It exponentially increases in periods of network instability to throttle the number of
SPF computations and it decreases in stable periods. This adaptive delay is handled by the SPF
backoff timer.

As the IGP feeds the new topology information to SR-TE, this information is also received by BGP if
BGP-LS is enabled on this node. BGP then sends the updated topology information in BGP-LS to its
BGP neighbors.
8.2 Intra-Domain IGP Flooding
When a link in a network fails, the nodes adjacent to that link quickly detect the failure, which triggers
the generation of a new LSP as described in the previous section.

Nodes that are remote to the failure learn about it through the IGP flooding of this new LSP.

It takes some time to flood the LSP from the node connected to the failure to the SR-TE headend node.
This time is the sum of the bufferisation, serialization, propagation, and IGP processing time at each
hop flooding the LSP (see [subsec]). The paper [subsec] indicates that bufferisation and serialization
delays are negligible components of the flooding delay.

The propagation delay is a function of the cumulative fiber distance between the originating node and
the headend node and can be roughly assessed as 5 ms per 1000 km of fiber. For a US backbone the
rule of thumb indicates that the worst case distance is ~6000 km (30 ms) and that a very conservative
average is 3000 km (15 ms).

The IGP processing delay is minimized by a fast-flooding mechanism that ensures that a number of
incoming LSPs is quickly flooded on the other interfaces, without introducing delay or pacing. A lab
characterization described in [subsec] showed that in 90% of 1000 single-hop flooding delay
measurements, the delay was smaller than 2 ms. The 95% of the measurements was smaller than 28
ms. Note that multiple flooding paths exist in a network, hence it is unlikely that the worst case per-
hop delay will be seen on all flooding paths.

Now that we have a per-hop flooding delay, we need to know the number of flooding hops.

The LSP is flooded in the local IGP area. Rule of thumb indicates that the number of flooding hops
should be small to very small (< 5). In some topologies, inter-site connections or large rings might
lead to 7 flooding hops.

If we add up the flooding delay contributors, we come to a delay of 50 ms (LSP-gen initial-wait) + 5


(number of hops) × 2 ms (IGP processing delay) + 15 ms (propagation delay) = 75 ms between
failure detection and the headend receiving the LSP.

After receiving the new LSP of a remote node, the IGP handles it like a locally generated LSP. First,
the IGP floods this LSP to its IGP neighbors. Second, it triggers an SPF computation that is scheduled
to start after the SPF backoff timer expires. Third, it feeds this new LSP to SR-TE. SR-TE updates its
SR-TE DB and acts upon this topology change notification as described further in this chapter.

As the IGP feeds the new topology information to SR-TE, this information is also received by BGP if
BGP-LS is enabled on this node. BGP then sends the updated topology information in BGP-LS to its
BGP neighbors.

Figure 8‑1 shows a single-area IGP topology. An SR Policy is configured on Node1 with endpoint
Node4 and a single dynamic candidate path with optimized delay, as shown in Example 8‑3. This
low-delay path is illustrated in Figure 8‑1.

Example 8-3: Minimum delay SR Policy on Node1

segment-routing
traffic-eng
policy GREEN
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay

Figure 8-1: Remote trigger through IGP flooding – before failure


The link between Node3 and Node4 fails. Both nodes detect the failure, bring down the adjacency
and flood the new LSP. When the LSP reaches Node1, the IGP notifies SR-TE of the change. SR-TE
on Node1 re-computes the path and installs the new SID list for the SR Policy, as illustrated in
Figure 8‑2.

Figure 8-2: Remote trigger through IGP flooding – after failure


8.3 Inter-Domain BGP-LS Flooding
Topology change notifications are also propagated by BGP-LS.

Figure 8‑3 shows a multi-domain network with two independent domains, Domain1 and Domain2,
interconnected by eBGP peering links. An SR PCE is available in Domain1 to compute inter-domain
paths. This SR PCE receives the topology information of Domain2 via BGP-LS. While it can receive
the information of its local domain, Domain1, via IGP or BGP-LS, in this section we assume it gets it
vis BGP-LS.

Each domain has a BGP Route-reflector (RR) and these RRs are inter-connected.

Figure 8-3: Remote trigger through BGP-LS update – before failure

Node5 in Domain2 feeds any changes of its IGP LS-DB into BGP-LS and sends this information to its
local RR. This RR propagates the BGP-LS information to the RR in Domain1. SR PCE taps into this
Domain1 RR to receive the topology information of both domains.
An SR Policy is configured on Node1 with endpoint Node10 and a single dynamic candidate path
with optimized delay metric that is computed by an SR PCE. Node1 has a PCEP session with the SR
PCE in Domain1 with address 1.1.1.100. The configuration of this SR Policy is displayed in
Example 8‑4.

Node1 automatically delegates control of the path to the SR PCE. This SR PCE is then responsible of
updating the path when required.

Example 8-4: SR Policy with PCE computed dynamic path

segment-routing
traffic-eng
policy GREEN
color 30 end-point ipv4 1.1.1.10
candidate-paths
preference 100
dynamic
pcep
!
metric
type delay
!
pcc
pce address ipv4 1.1.1.100

The link between Node6 and Node7 fails, as illustrated in Figure 8‑4. Both nodes detect the failure,
bring down the IGP adjacency and flood the new LSPs in the Domain2 area. When the LSPs reach
Node5, the IGP of this node feeds the information to BGP-LS. BGP on Node5 sends the BGP-LS
update to its RR which propagates the update to the RR in Domain1. This RR then propagates the
update to the SR PCE.

This topology update triggers the recomputation of the delegated path on the SR PCE. After
recomputing the path, the SR PCE instructs Node1 to update the path with the new SID list. Node1
then installs the new SID list for the SR Policy, as illustrated in Figure 8‑4.

Note that within 50 ms after detecting the failure, TI-LFA protection on Node7 restores connectivity
via the TI-LFA backup path. This allows more time for the sequence of events as described above to
execute.
Figure 8-4: Remote trigger through BGP-LS update – after failure
8.4 Validation of an Explicit Path
When a headend receives a notification that the topology has changed, it re-validates the explicit
paths of its local SR Policies.

If the topology change causes the active candidate path of an SR Policy to become invalid, then the
next-preferred, valid candidate path of the SR Policy is selected as active path.

An explicit path is invalidated when the headend determines that at least one of the SIDs in its SID
list is invalid. The validation procedure is described in detail in chapter 3, "Explicit Candidate Path".
Briefly, an operator can specify a SID in an explicit path as a label value or as a segment descriptor.
The headend validates all the SIDs expressed with segment descriptors by attempting to resolve them
into the corresponding label values, as well as the first SID in the list, regardless of how it is
expressed, by resolving it into an outgoing interface and next-hop. Non-first SIDs expressed as MPLS
labels cannot be validated by the headend and hence are considered valid.

An explicit path also becomes invalid if it violates one of its associated constraints in the new
topology.

The next sections illustrate how an active candidate path expressed with segment descriptors or
MPLS label values can be invalidated following a topology change and how SR-TE falls back to the
next valid candidate path.

8.4.1 Segments Expressed as Segment Descriptors


Headend Node1 in Figure 8‑5 has an SR Policy to endpoint Node4. The SR Policy has two explicit
candidate paths, both are illustrated in Figure 8‑5.

The configuration of the SR Policy is shown in Example 8‑5.


Figure 8-5: Validation of explicit path

The segment lists are defined in the beginning of the configuration. The segments in the segment lists
are expressed as segment descriptors.

address ipv4 1.1.1.3 identifies the Prefix-SID of Node3 (1.1.1.3); address ipv4 99.3.4.3
identifies the Adj-SID of the point-to-point link of Node3 to Node4; address ipv4 1.1.1.4
identifies the Prefix-SID of Node4 (1.1.1.4).

The SR Policy’s candidate path with preference 200 refers to the segment list SIDLIST2; this is the
path 1→2→3→4 in Figure 8‑5. The candidate path with preference 100 refers to the segment list
SIDLIST1; this is the path 1→2→3→6→5→4 in Figure 8‑5.
Example 8-5: SR Policy with explicit paths – segments expressed as segment descriptors

segment-routing
traffic-eng
segment-list SIDLIST2
index 10 address ipv4 1.1.1.3 !! Prefix-SID Node3
index 20 address ipv4 99.3.4.3 !! Adj-SID Node3->Node4
!
segment-list SIDLIST1
index 10 address ipv4 1.1.1.3 !! Prefix-SID Node3
index 20 address ipv4 1.1.1.4 !! Prefix-SID Node4
!
policy EXP
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 200
explicit segment-list SIDLIST2
!
preference 100
explicit segment-list SIDLIST1

Both candidate paths are valid. The candidate path with the highest preference value (200) is selected
as active path, as shown in the SR Policy status in Example 8‑6.

Example 8-6: SR Policy with explicit paths

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:00:18 (since Jul 12 11:46:01.673)
Candidate-paths:
Preference: 200 (configuration) (active)
Name: EXP
Requested BSID: dynamic
Explicit: segment-list SIDLIST2 (valid)
Weight: 1
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Preference: 100 (configuration)
Name: EXP
Requested BSID: dynamic
Explicit: segment-list SIDLIST1 (invalid)
Weight: 1
Attributes:
Binding SID: 40013
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

As displayed in the output, the active path (with the highest preference 200) has a SID list <16003,
24034>, with 16003 the Prefix-SID of Node3 and 24034 the Adj-SID of Node3’s link to Node4.
The less preferred candidate path has a SID list <16003, 16004>, with 16003 and 16004 the Prefix-
SIDs of Node3 and Node4 respectively. Since this path is not active, its SID list is not shown in the
output of Example 8‑6.

The link between Node3 and Node4 fails, as illustrated in Figure 8‑6.

Node1 gets the topology change notification via IGP flooding. In the new topology the Adj-SID 24034
of the failed link is no longer valid. Because of this, Node1 invalidates the current active path since it
contains the, now invalid, Adj-SID 24034. Node1 selects the next preferred candidate path and
installs its SID list in the forwarding table.

The new status of the SR Policy is displayed in Example 8‑7. The preference 100 candidate path is
now the active path.

Example 8-7: SR Policy with explicit paths – highest preference path invalidated after failure

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:05:47 (since Apr 9 18:03:56.673)
Candidate-paths:
Preference: 200 (configuration)
Name: EXP
Requested BSID: dynamic
Explicit: segment-list SIDLIST2 (invalid)
Weight: 1
Last error: Address 99.3.4.3 can not be resolved to a SID
Preference: 100 (configuration) (active)
Name: EXP
Requested BSID: dynamic
Explicit: segment-list SIDLIST1 (valid)
Weight: 1
16003 [Prefix-SID, 1.1.1.3]
16004 [Prefix-SID, 1.1.1.4]
Attributes:
Binding SID: 40013
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

All traffic steered into the SR Policy now follows the new explicit path.
Figure 8-6: Validation of explicit path – after failure

“Creation of explicit paths often raises concerns that static paths make the network more fragile and less robust to dynamic
repairs. While the text explicitly explains the dynamic selection between number of possible candidate SR paths based on
live link state updates it should also be pointed out that those are only optimizations.

When no explicit paths are available fallback to native IGP path will be the default behavior.

Moreover some implementations may also allow to configure load balancing ahead of failure across N parallel explicit
paths. ”

— Robert Raszuk

8.4.2 Segments Expressed as SID Values


The configuration of Node1 is now modified by expressing the segments in the segment lists using
SID label values instead of segment descriptors. The resulting SR Policy configuration of Node1 is
shown in Example 8‑8 and the candidate paths are illustrated in Figure 8‑5.

Remember that a SID expressed as a label value is validated by the headend only if it is in first
position in the SID list. The headend must be able to resolve this first entry to find outgoing
interface(s) and nexthop(s) of the SR Policy.
Example 8-8: SR Policy with explicit paths – SIDs expressed as label values

segment-routing
traffic-eng
segment-list SIDLIST12
index 10 mpls label 16003 !! Prefix-SID Node3
index 20 mpls label 24034 !! Adj-SID Node3-Node4
!
segment-list SIDLIST11
index 10 mpls label 16003 !! Prefix-SID Node3
index 20 mpls label 16004 !! Prefix-SID Node4
!
policy EXP
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 200
explicit segment-list SIDLIST12
!
preference 100
explicit segment-list SIDLIST11

After the failure of the link between Node3 and Node4, as illustrated in Figure 8‑6, the selected path
(the candidate path with preference 200) is not invalidated.

The first SID in the SID list (16003) is still valid since the link failure does not impact this SID. The
failure does impact the second SID in the list. However, this second SID (24034) is specified as a
SID label value and therefore its validity cannot be assessed by the headend, as discussed above.

The status of the SR Policy after the failure is displayed in Example 8‑9. The preference 200
candidate path is still the selected path.

Traffic steered into this SR Policy keeps following the path 1→2→3→4. Since the link between
Node3 and Node4 is down, traffic is now dropped at Node3.
Example 8-9: SR Policy with explicit paths – highest preference path is not invalidated after failure

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:59:29 (since Jul 12 11:46:01.081)
Candidate-paths:
Preference: 200 (configuration) (active)
Name: EXP
Requested BSID: dynamic
Explicit: segment-list SIDLIST2 (valid)
Weight: 1
16003 [Prefix-SID, 1.1.1.3]
24034
Preference: 100 (configuration)
Name: EXP
Requested BSID: dynamic
Explicit: segment-list SIDLIST1 (invalid)
Weight: 1
Attributes:
Binding SID: 40013
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

It is important to note that an explicit SID list is normally controlled by an SR PCE. This SR PCE
would be made aware of the topology change and would dynamically update the SID list as described
in section 8.6 below.
8.5 Recomputation of a Dynamic Path by a Headend
When a headend receives a notification that the topology has changed, it re-computes the non-
delegated dynamic paths of its SR Policies.

Headend Node1 in Figure 8‑7 has an SR Policy to endpoint Node4 with a delay-optimized dynamic
path. The link-delay metrics are indicated next to the links, the default IGP metric is 10. The
configuration of the SR Policy is shown in Example 8‑10. The computed low-delay path is illustrated
in the drawing.

Figure 8-7: Recomputation of a dynamic path by headend – before failure

Example 8-10: SR Policy with dynamic low-delay path

segment-routing
traffic-eng
policy DYN
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay

The status of this SR Policy is presented in Example 8‑11. The computed SID list is <16003, 24034>,
with 16003 the Prefix-SID of Node3 and 24034 the Adj-SID of Node3 for the link to Node4. The
accumulated delay metric of this path is 30 (= 12 + 11 + 7).

Example 8-11: SR Policy with dynamic low-delay path

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:02:46 (since Jul 12 13:28:18.800)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: DYN
Requested BSID: dynamic
Dynamic (valid)
Metric Type: delay, Path Accumulated Metric: 30
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Attributes:
Binding SID: 40033
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

The link between Node3 and Node4 fails. SR-TE on Node1 is notified of the topology change and
SR-TE re-computes the dynamic path of SR Policy DYN. The new low-delay path from Node1 to
Node4 is illustrated in Figure 8‑8.
Figure 8-8: Recomputation of a dynamic path by headend – after failure

Example 8‑12 shows the new status of the SR Policy. The cumulative delay of this path is 64 (= 12 +
11 + 10 + 15 + 16). The path is encoded in the SID list <16003, 16004> with the Prefix-SIDs of
Node3 and Node4 respectively.

Example 8-12: SR Policy with dynamic low-delay path

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:05:46 (since Jul 12 13:28:18.800)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: DYN
Requested BSID: dynamic
Dynamic (valid)
Metric Type: delay, Path Accumulated Metric: 64
16003 [Prefix-SID, 1.1.1.3]
16004 [Prefix-SID, 1.1.1.4]
Attributes:
Binding SID: 40033
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes
All traffic steered into the SR Policy now follows the new low-delay path.
8.6 Recomputation of a Dynamic Path by an SR PCE
The headend node delegates control of all SR Policy paths computed by an SR PCE to that same SR
PCE. This occurs when the headend requested the path computation services of an SR PCE or when
the SR PCE itself initiated the path on the headend node. In either case, the SR PCE is responsible for
maintaining its delegated paths. Following a topology change, the SR PCE thus re-computes these
paths and updates if necessary the SID lists provided to their respective headend nodes.

The sequence of events following a topology change is equivalent with the previous section, but with
the potential additional delay of the failure notification to the SR PCE and the signaling of the updated
path from the SR PCE to the headend. If the SR PCE were connected via BGP-LS it would receive
the topology notification via BGP-LS (as in section 8.3 above), possibly incurring additional delay.

To illustrate, we add an SR PCE in the network of the previous section, see Figure 8‑9, and let the
headend Node1 delegate the low-delay path to the SR PCE.

When the link between Node3 and Node4 fails, the IGP floods the change throughout the network
where it also reaches the SR PCE.

The SR PCE re-computes the path and signals the updated path to the headend Node1 that updates the
path.
Figure 8-9: Recomputation of a dynamic path by SR PCE – after failure

I Furthermore, PCE-initiated paths are signaled to the headend node as explicit candidate paths,
which implies that the explicit path validation procedure described in section 8.4 is also applied by
the headend. The resiliency of PCE-initiated paths is thus really the combination of PCE-based re-
computation and headend-based validation, although the latter is fairly limited since these SID lists
are often expressed with MPLS label values.
8.7 IGP Convergence Along a Constituent Prefix-SID
A prefix-SID may be a constituent of a candidate path of a Policy (explicit or dynamic). A failure may
impact the shortest-path of that prefix-SID. For example, in this illustration, we see that Prefix-SID
16004 of Node4 is the second SID of the SID list of the SR Policy in Figure 8‑10. If the link between
Node3 and Node6 fails, this SID is impacted and hence the SR Policy is impacted.

In this section, we describe how the IGP convergence of a constituent Prefix-SID helps the resiliency
of the related SR Policy.

First we provide a brief reminder on the IGP convergence process and then we show how the IGP
convergence is beneficial both for an explicit and a dynamic candidate path.

8.7.1 IGP Reminder


A Prefix-SID follows the shortest path to its associated prefix. When the network topology changes,
the path of the Prefix-SID may change if the topology change affects the shortest path to its associated
prefix.

The IGP takes care of maintaining the Prefix-SID paths. If the Prefix-SID is used in an SR Policy’s
SID list and IGP updates the path of the Prefix-SID, then the path of the SR Policy follows this
change. Let us illustrate this in the next sections.

8.7.2 Explicit Candidate Path


Node1 in Figure 8‑10 has an SR Policy to Node4 with an explicit SID list. The configuration of this
SR Policy is displayed in Example 8‑13. The SIDs in the SID list are specified as segment
descriptors. The first entry in the SID list is the Prefix-SID of Node3, the second entry the Prefix-SID
of Node4. The path of this SR Policy is illustrated in the drawing.
Example 8-13: SR Policy with an explicit path

segment-routing
traffic-eng
segment-list SIDLIST1
index 10 address ipv4 1.1.1.3 !! Prefix-SID Node3
index 20 address ipv4 1.1.1.4 !! Prefix-SID Node4
!
policy EXP
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1

Figure 8-10: IGP convergence along a constituent Prefix SID – before failure

The link between Node3 and Node6 fails, as illustrated in Figure 8‑11. This link is along the shortest
path of the Prefix-SID of Node4 from Node3 in the topology before the failure.

IGP converges and updates the path of this Prefix-SID along the new shortest path from Node3 to
Node4: the direct high-cost link from Node3 to Node4.
Figure 8-11: IGP convergence along a constituent Prefix SID – after failure

In parallel to the IGP convergence, the SR-TE process of the headend Node1 receives a remote intra-
domain topology change (section 8.2) and triggers the re-validation of the explicit path (section 8.4).
The headend finds that the SIDs in the SID list are still valid (thanks to the IGP convergence) and
leaves the SR Policy as is.

Even though the topology has changed by a failure that impacted the SR Policy path, the explicit SID
list of the SR Policy stayed valid while the path encoded in this SID list has changed thanks to IGP
convergence.

This is a straightforward illustration of the scaling benefits of SR over OpenFlow. In an OpenFlow


deployment, the controller cannot rely on the network to maintain IGP Prefix-SIDs. The OpenFlow
controller must itself reprogram all the per-flow states at every hop. This does not scale.

8.7.3 Dynamic Candidate Paths


Node1 in Figure 8‑12 has an SR Policy with a delay-optimized dynamic path. The configuration of
this SR Policy is displayed in Example 8‑14. The low-delay path as computed by Node1 is illustrated
in the drawing. The SID list that encodes this path is <16003, 24034>, as shown in the SR Policy
status output in Example 8‑15.
Figure 8-12: IGP convergence along a constituent Prefix SID – dynamic path before failure

Example 8-14: SR Policy configuration with dynamic path

segment-routing
traffic-eng
policy DYN
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay
Example 8-15: SR Policy status

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 18:16:20 (since Jul 12 13:28:18.800)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: DYN
Requested BSID: dynamic
Dynamic (valid)
Metric Type: delay, Path Accumulated Metric: 30
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Attributes:
Binding SID: 40033
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

The link between Node2 and Node3 fails, as illustrated in Figure 8‑13.

Triggered by the topology change, Node1 re-computes the new low-delay path. This new low-delay
path is illustrated in Figure 8‑13.

It turns out that the SID list that encodes this new path is the same SID list as before the failure. The
output in Example 8‑16 confirms this. While the SID list is unchanged, the path of the first SID in the
list has been updated by IGP. Notice in the output that the cumulative delay metric of the path has
incremented to 58.
Example 8-16: SR Policy status after failure – SID list unchanged

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 18:23:47 (since Jul 12 13:28:18.800)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: DYN
Requested BSID: dynamic
Dynamic (valid)
Metric Type: delay, Path Accumulated Metric: 58
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Attributes:
Binding SID: 40033
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

Figure 8-13: IGP convergence along a constituent Prefix SID – dynamic path after failure

This is another example where the SR Policy headend benefits from the concurrent IGP convergence.
The SID list before and after the network failure is the same (although the actual network path is
different). The SR Policy headend does not need to change anything in its dataplane. It simply
leverages the network wide IGP convergence process.
8.8 Anycast-SIDs
An Anycast-SID is a Prefix-SID that is advertised by more than one node. The set of nodes
advertising the same anycast prefix with Anycast-SID form a group, known by the name anycast set.
Using an Anycast-SID in the SID list of an SR Policy path provides improved traffic load-sharing and
node resiliency.

An Anycast-SID provides the IGP shortest path to the nearest node of the anycast set. If two or more
nodes of the anycast set are located at the same distance (metric), then traffic flows are load-balanced
among them.

If a node in an anycast set fails, then the other nodes in the set seamlessly take over. After IGP has
converged, the Anycast-SID follows the shortest path to the node of the anycast set that is now the
nearest in the new topology.

An Anycast-SID can be used as a SID in the SR Policy’s SID list, hereby leveraging the load-
balancing capabilities and node resiliency properties of the Anycast-SID.

Note that all this applies to Flex-Algo Prefix-SIDs as well. For example, if two nodes with a Flex-
Algo(K) Anycast-SID are at the same distance with respect to the Flex-Algo(K) metric, then traffic
flows are load-balanced among them.

In a typical use-case of the Anycast-SID, an Anycast-SID is assigned to a pair (or more) of border
nodes between two domains. To traverse the border between the domains, the Anycast-SID of the
border nodes can be included in the SR Policy SID list. Traffic is then load-balanced between the
border nodes if possible, and following a failure of a border node, the other border node(s)
seamlessly take(s) over. Finer grained traffic steering over a particular border node is still possible,
using this border node’s own Prefix-SID instead of the Anycast-SID.

Figure 8‑14 shows a network topology with three domains, connected by pairs of border nodes. An
Anycast-SID is assigned to each pair of border nodes: Anycast-SID 20012 for the border nodes
between Domain1 and Domain2, Node1 and Node2; Anycast-SID 20034 for the border nodes
between Domain2 and Domain3, Node3 and Node4. The domains are independent, no redistribution
is done between the domains.
Figure 8-14: Anycast-SID provides load-balancing

An SR Policy with endpoint Node31 is instantiated on headend Node11, using an explicit path. The
SID list is <20012, 20034, 16031>; Anycast-SID 20012 load-balances the traffic from Node11 to
Node1 and Node2; Anycast-SID 20034 load-balances the traffic to Node3 and Node4; Prefix-SID
16031 of Node31 then steers the traffic to Node31. By using the Anycast-SIDs of the border nodes in
the SID list instead of their individual Prefix-SIDs, the available ECMP in the network is better used.
See, for example, how Node11 load-balances the traffic flows between Node1 and Node2, and
Node1 load-balances the flows between Node3 and Node4.

The use of Anycast-SIDs also provides node resiliency. Upon failure of Node3, traffic to Anycast-
SID 20034 is then carried by the remaining Node4, as illustrated in Figure 8‑15. The SID list of the
SR Policy on Node11 stays unchanged.
Figure 8-15: Anycast-SID provides node resiliency
8.9 TI-LFA protection
Local protection techniques ensure minimal packet loss (< 50msec) following a failure in the
network. Protection is typically provided by pre-installing a backup path before a failure occurs.
Upon a failure of the locally connected resource, traffic flows can then quickly be restored by sending
them on the pre-installed backup path.

While protection ensures the initial restoration of the traffic flows, the control plane (IGP, SR-TE,
…) computes the paths in the new topology and installs the new paths in the forwarding tables. The
traffic then follows the newly installed paths and new backup paths are derived and installed
according to the new topology.

Part I of the SR book explains Topology Independent LFA (TI-LFA) that uses SR functionality to
provide optimal and automatic sub-50msec Fast Reroute protection for any link, node, or local SRLG
failure in the network. TI-LFA is an IGP functionality.

IGP computes and installs TI-LFA backup paths for each IGP destination. The TI-LFA configuration
specifies which resource TI-LFA should protect: the link, the neighbor node, or the local SRLG.
Remote SRLGs are protected using Weighted SRLG TI-LFA Protection.

IGP then computes a per-prefix optimal TI-LFA backup path that steers the traffic around the protected
resource (link, node, SRLG). The TI-LFA backup path is tailored over the post-convergence path, that
the traffic will follow after IGP converges following the failure of the protected resource.

IGP then installs the computed TI-LFA backup paths in the forwarding table. When a failure is
detected on the local interface, the backup paths are activated in the data plane, deviating the traffic
away from the failure. The activation of the TI-LFA backup paths happens for all affected destinations
at the same time, in a prefix-independent manner. The backup paths stay active until IGP converges
and updates the forwarding entries on their new paths.

TI-LFA provides protection for Prefix-SIDs, Adj-SIDs, LDP labeled traffic and IP unlabeled traffic.

Since TI-LFA is a local protection mechanism, it should be enabled on all interfaces of all nodes to
provide automatic local protection against any failure.
“I have been working on network resiliency for 15 years, contributing to IGP fast convergence, LFA, LFA applicability,
LFA manageability and RLFA.

TI-LFA and IGP microloop avoidance are key improvements in terms of network resiliency, both powered by Segment
Routing. They are easy to use as they have general applicability: any topology and any traffic, both shortest path and TE.
They also follow the usual IGP shortest path which has many good properties: the best from a routing standpoint, well-
known to people (e.g., compared to RLFA paths), natively compliant with service providers' existing routing policies (as
those are already encoded as part of the IGP metrics), and typically provisioned with enough capacity (as the fast reroute
path is typically shared with IGP post convergence). ”

— Bruno Decraene

Example 8‑17 shows the configuration that is required to enable TI-LFA for ISIS and OSPF.

Example 8-17: TI-LFA configuration

router isis 1
interface Gi0/0/0/0
address-family ipv4 unicast
fast-reroute per-prefix
fast-reroute per-prefix ti-lfa
!
router ospf 1
area 0
interface GigabitEthernet0/0/0/0
fast-reroute per-prefix
fast-reroute per-prefix ti-lfa enable

Refer to Part I of the SR book for more details of the TI-LFA functionality.

TI-LFA provides local protection for Prefix-SIDs and Adj-SIDs. This protection also applies to SIDs
used in an SR Policy’s SID list, the constituent SIDs of the SID list.

The following sections illustrate the TI-LFA protection of such constituent Prefix-SIDs and Adj-SIDs.

8.9.1 Constituent Prefix-SID


Node1 in Figure 8‑16 has an SR Policy to endpoint Node4 with a SID list <16006, 16004>. The SIDs
in the list are the Prefix-SIDs of Node6 and Node4 respectively. This SID list could have been
explicitly specified or dynamically derived. The path of this SR Policy is illustrated in the drawing.
Figure 8-16: TI-LFA of a constituent Prefix-SID

In this example we focus on the protection by Node6. Packets that Node1 steers into the SR Policy of
Figure 8‑16 arrive on Node6 with the Prefix-SID label 16004 as top label. In steady-state Node6
forwards these packets towards Node5, via the IGP shortest path to Node4.

The operator wants to protect the traffic on Node6 against link failures and enables TI-LFA on
Node6.

The IGP on Node6 computes the TI-LFA backup paths for all destinations, including Node4.
Figure 8‑16 illustrates the TI-LFA backup path for Prefix-SID 16004.

Upon failure of the link to Node5, Node6 promptly steers the traffic destined for 16004 on this Prefix-
SID’s TI-LFA backup path (via Node3) to quickly restore connectivity. While this local protection is
active, the mechanisms described earlier in this chapter are triggered in order to update the SR Policy
path if required. Also IGP convergence happens concurrently.

The backup path is 6→3→4. Let us verify the TI-LFA backup path for 16004 on Node6 using the
output of Example 8‑18, as collected on Node6.
Node6 imposes Adj-SID 24134 on the protected packets and sends them to Node3. The link from
Node3 to Node4 has a high metric, therefore the Adj-SID 24134 of Node3 for the link to Node4 is
imposed on the protected packets to steer them over that link.

Example 8-18: Prefix-SID 16004 TI-LFA backup path on Node6

RP/0/0/CPU0:xrvr-6#show isis fast-reroute 1.1.1.4/32

L2 1.1.1.4/32 [30/115]
via 99.5.6.5, GigabitEthernet0/0/0/0, xrvr-5, SRGB Base: 16000, Weight: 0
Backup path: TI-LFA (link), via 99.3.6.3, GigabitEthernet0/0/0/1 xrvr-3, SRGB Base: 16000,
Weight: 0
P node: xrvr-3.00 [1.1.1.3], Label: ImpNull
Q node: xrvr-4.00 [1.1.1.4], Label: 24134
Prefix label: ImpNull
Backup-src: xrvr-4.00

RP/0/0/CPU0:xrvr-6#show mpls forwarding labels 16004 detail


Local Outgoing Prefix Outgoing Next Hop Bytes
Label Label or ID Interface Switched
------ ----------- ----------------- ----------- ------------- -----------
16004 16004 SR Pfx (idx 4) Gi0/0/0/0 99.5.6.5 59117
Updated: Jul 13 10:38:49.812
Path Flags: 0x400 [ BKUP-IDX:1 (0xa1a6e3c0) ]
Version: 245, Priority: 1
Label Stack (Top -> Bottom): { 16004 }
NHID: 0x0, Encap-ID: N/A, Path idx: 0, Backup path idx: 1, Weight: 0
MAC/Encaps: 14/18, MTU: 1500
Outgoing Interface: GigabitEthernet0/0/0/0 (ifhandle 0x00000020)
Packets Switched: 1126

Pop SR Pfx (idx 4) Gi0/0/0/1 99.3.6.3 0 (!)


Updated: Jul 13 10:38:49.812
Path Flags: 0xb00 [ IDX:1 BKUP, NoFwd ]
Version: 245, Priority: 1
Label Stack (Top -> Bottom): { Imp-Null 24134 }
NHID: 0x0, Encap-ID: N/A, Path idx: 1, Backup path idx: 0, Weight: 0
MAC/Encaps: 14/18, MTU: 1500
Outgoing Interface: GigabitEthernet0/0/0/1 (ifhandle 0x00000040)
Packets Switched: 0
(!): FRR pure backup

Traffic-Matrix Packets/Bytes Switched: 0/0

If TI-LFA is not enabled, then there is no protection and no backup path is computed and installed.
Following a failure, traffic loss occurs until the IGP and SR-TE computed and installed the paths in
the new topology.

8.9.2 Constituent Adj-SID


As described in SR book Part I, two types of Adj-SIDs exist, those that are protected by TI-LFA and
those that are not protected by TI-LFA. These are respectively called “protected” and “unprotected”
Adj-SIDs.
By default in IOS XR, the IGP allocates an Adj-SID of each type for each of its adjacencies.
Additional protected and unprotected Adj-SIDs can be configured if needed.

When TI-LFA is enabled on a node, IGP computes and installs a TI-LFA backup path for each of the
protected Adj-SIDs. The unprotected Adj-SIDs are left without TI-LFA backup path.

Example 8‑19 shows the two Adj-SIDs that Node3 in the following example has allocated for its link
to Node4. 24034 is the protected one (Adjacency SID: 24034 (protected)) and 24134 the
unprotected one (Non-FRR Adjacency SID: 24134).

Example 8-19: Adj-SIDs on Node3 for the adjacency to Node4

RP/0/0/CPU0:xrvr-3#show isis adjacency systemid xrvr-4 detail

IS-IS 1 Level-2 adjacencies:


System Id Interface SNPA State Hold Changed NSF IPv4 IPv6
BFD BFD
xrvr-4 Gi0/0/0/0 *PtoP* Up 27 1w2d Yes None None
Area Address: 49.0001
Neighbor IPv4 Address: 99.3.4.4*
Adjacency SID: 24034 (protected)
Backup label stack: [16004]
Backup stack size: 1
Backup interface: Gi0/0/0/1
Backup nexthop: 99.3.6.6
Backup node address: 1.1.1.4
Non-FRR Adjacency SID: 24134
Topology: IPv4 Unicast

Total adjacency count: 1

Node1 in Figure 8‑17 has an SR Policy to endpoint Node4 with a SID list <16003, 24034>. 16003 is
the Prefix-SID of Node3, 24034 is the protected Adj-SID of Node3 for the link to Node4. It is
unimportant to the discussion here if this SID list is explicitly specified or dynamically computed.
Figure 8-17: TI-LFA of a constituent Adj-SID

Since TI-LFA is enabled on Node3, IGP on Node3 derives and installs a TI-LFA backup path for the
protected Adj-SID 24034. This TI-LFA backup path that steers the packets to the remote end of the
link (Node4) while avoiding the link, is illustrated in the drawing.

The output in Example 8‑20, as collected on Node3, shows the TI-LFA backup path for the Adj-SID
24034. The backup path is marked in the output with (!) on the far right. It imposes Node4’s Prefix-
SID 16004 on the protected packets and steers the traffic via Node6 in the case the link to Node4
fails.
Example 8-20: Adj-SID TI-LFA backup on Node3

RP/0/0/CPU0:xrvr-3#show mpls forwarding labels 24034 detail


Local Outgoing Prefix Outgoing Next Hop Bytes
Label Label or ID Interface Switched
------ ----------- ----------------- ------------ ------------- ----------
24034 Pop SR Adj (idx 0) Gi0/0/0/0 99.3.4.4 0
Updated: Jul 13 07:44:17.297
Path Flags: 0x400 [ BKUP-IDX:1 (0xa19c0210) ]
Version: 121, Priority: 1
Label Stack (Top -> Bottom): { Imp-Null }
NHID: 0x0, Encap-ID: N/A, Path idx: 0, Backup path idx: 1, Weight: 100
MAC/Encaps: 14/14, MTU: 1500
Outgoing Interface: GigabitEthernet0/0/0/0 (ifhandle 0x00000040)
Packets Switched: 0

16004 SR Adj (idx 0) Gi0/0/0/1 99.3.6.6 0 (!)


Updated: Jul 13 07:44:17.297
Path Flags: 0x100 [ BKUP, NoFwd ]
Version: 121, Priority: 1
Label Stack (Top -> Bottom): { 16004 }
NHID: 0x0, Encap-ID: N/A, Path idx: 1, Backup path idx: 0, Weight: 40
MAC/Encaps: 14/18, MTU: 1500
Outgoing Interface: GigabitEthernet0/0/0/1 (ifhandle 0x00000060)
Packets Switched: 0
(!): FRR pure backup

After a link goes down, the IGP preserves the forwarding entries of its protected Adj-SIDs for some
time, so that the traffic that still arrives with this Adj-SID is not dropped but forwarded via the
backup path. This gives time for a headend or SR PCE to update any SR Policy path that uses this
Adj-SID to a new path avoiding the failed link.

Unprotected Adj-SIDs are not protected by TI-LFA, even if TI-LFA is enabled on the node. The IGP
immediately removes the forwarding entry of an unprotected Adj-SID from the forwarding table when
the associated link goes down. If an unprotected Adj-SID is included in the SID list of an SR Policy
and the link of that Adj-SID fails, then packets are dropped until the SID list is updated.

Node1 in Figure 8‑18 has an SR Policy to endpoint Node4 with a SID list <16003, 24134>. 16003 is
the Prefix-SID of Node3, 24134 is the unprotected Adj-SID of Node3 for the link to Node4.
Figure 8-18: Including an unprotected Adj-SID in the SID list

Even though TI-LFA is enabled on Node3, IGP on Node3 does not install a TI-LFA backup path for
the unprotected Adj-SID 24134, as shown in Example 8‑19. As a consequence, traffic to this Adj-SID
is dropped when the link fails.

8.9.3 TI-LFA Applied to Flex-Algo SID


As we have seen in chapter 7, "Flexible Algorithm", TI-LFA also applies to Flex-Algo Prefix-SIDs
and the post-convergence backup path is computed with the same optimization objective and
constraints as the primary path.

An SR Policy SID list built with Flex-Algo Prefix-SIDs benefits from the TI-LFA protection of its
constituent SIDs.
8.10 Unprotected SR Policy
Some use-cases demand that no local protection is applied for specific traffic streams. This can for
example be the case for two live-live streams (1+1 redundancy) carried in two disjoint SR Policies.
Rather than locally restoring the SR Policy traffic after a failure and possibly creating congestion, the
operator may want that traffic to be dropped. Yet this decision must not prevent other traffic
traversing the same nodes and links from being protected by TI-LFA.

Figure 8‑19 illustrates such a use-case. A Market Data Feed source on the left sends two Market Data
Feed multicast traffic streams, carried in Pseudowires, over two disjoint paths in the WAN (live-live
or 1+1 redundancy). The resiliency for this Market Data Feed is provided by concurrently streaming
the same data over both disjoint paths.

The operator requires that if a failure occurs in the network the Market Data Feed traffic is not locally
protected. Given the capacity of the links and the bandwidth utilization of the streams, this would lead
to congestion and hence packet loss.
Figure 8-19: Unprotected Adj-SIDs use-case

The use-case in Figure 8‑19 is solved by configuring SR Policies from Node1 to Node2 and from
Node3 to Node4. These two SR Policies have explicit paths providing disjointness. The explicit SID
lists of these SR Policies consist of unprotected Adj-SIDs only. The Pseudowire traffic is steered into
these SR Policies.

The SR Policy and L2VPN configuration of Node1 is shown in Example 8‑21. The SR Policy to
Node2 is POLICY1, using explicit segment list SIDLIST1. This segment list contains one entry: the
unprotected Adj-SID 24012 of the adjacency to Node2.

The L2VPN pseudowire uses statically configured PW labels, although LDP signaling can be used as
well. The pseudowire traffic is steered into POLICY1 (10, 1.1.1.2) by specifying this SR Policy as
preferred-path.
Example 8-21: Configuration SR Policy using unprotected Adj-SIDs and invalidation drop

segment-routing
traffic-eng
segment-list name SIDLIST1
index 10 mpls label 24012 !! Adj-SID Node1->Node2
!
policy POLICY1
color 10 end-point ipv4 1.1.1.2
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
l2vpn
pw-class PWCLASS1
encapsulation mpls
preferred-path sr-te policy srte_c_10_ep_1.1.1.2 fallback disable
!
xconnect group XG1
p2p PW1
interface Gi0/0/0/0
neighbor ipv4 1.1.1.2 pw-id 1234
mpls static label local 3333 remote 2222
pw-class PWCLASS1

Upon a link failure, traffic is not protected by TI-LFA since only unprotected Adj-SIDs are used in the
SID list. Also IGP convergence does not deviate the paths of the SR Policies since they are pinned to
the links using Adj-SIDs.

A failure of a link will make the SR Policy invalid. Since fallback is disabled in the preferred-path
configuration, the PW will go down as well.

Another mechanism to prevent fallback to the default forwarding path that applies to all service
traffic, not only L2VPN using preferred path, is the “drop upon invalidation” behavior. This
mechanism is described in chapter 16, "SR-TE Operations".
8.11 Other Mechanisms
The resiliency of an SR Policy may also benefit from the following mechanisms:

SR IGP microloop avoidance

SR Policy liveness detection

TI-LFA Protection for an Intermediate SID of an SR Policy

8.11.1 SR IGP Microloop Avoidance


Due to the distributed nature of routing protocols, the nodes in a network converge asynchronously
following a topology change, i.e., the nodes do not start and end their convergence at the same times.
During this period of convergence, transient forwarding inconsistencies may occur, leading to
transient routing loops, the so-called microloops. Microloops cause delay, out-of-order delivery and
packet loss. Depending on topology, hardware, and coincidence, the occurrence and duration of
microloops may vary. They are less likely to occur and shorter in networks with devices that offer
fast and deterministic convergence characteristics.

The SR IGP microloop avoidance solution eliminates transient loops during IGP convergence. This is
a local functionality that temporarily steers traffic on a loop-free path using a list of segments until the
network has settled and microloops may no longer occur. This mechanism is beneficial for the
resiliency of SR Policies as it eliminates delays and losses caused by those transient loops along its
constituent IGP SIDs.

An SR Policy is configured on Node1 in Figure 8‑20. The SR Policy’s SID list is <16003, 16004>
with 16003 and 16004 the Prefix-SIDs of Node3 and Node4.

The link between Node3 and Node4 has a higher IGP metric 50, but since it provides the only
connection to Node4, it is the shortest IGP path from Node3 to Node4.
Figure 8-20: IGP microloop avoidance

The operator brings the link between Node4 and Node5 up, as in Figure 8‑21. In this new topology
the shortest IGP path from Node3 to Node4 is no longer via the direct link, but via the path
3→6→5→4.

Assume in Figure 8‑21 that Node3 is the first to update the forwarding entry for Node4 for the new
topology. Node3 starts forwarding traffic destined for Node4 towards Node6, but the forwarding
entry for Node4 on Node6 still directs the traffic towards Node3: a microloop between Node3 and
Node6 exists. If now Node6 updates its forwarding entry before Node5 then a microloop between
Node6 and Node5 occurs until Node5 updates its forwarding entry.
Figure 8-21: IGP microloop avoidance – after bringing up link

In order to prevent these microloops from happening, Node3 uses SR IGP microloop avoidance and
temporarily steers traffic for 16004 on the loop-free path by imposing the SID list <16005, 24054>
on the incoming packets with top label 16004. 16005 is the Prefix-SID of Node5, 24054 is the Adj-
SID of Node5 for the link to Node4. After the network has settled, Node3 installs the regular
forwarding entry for 16004.

By eliminating the possible impact of microloops, SR IGP microloop avoidance provides a resiliency
gain that benefits SR Policies.

Part I of the SR book provides more information about IGP microloop avoidance.

8.11.2 SR Policy Liveness Detection


The IOS XR Performance Measurement (PM)1 infrastructure is leveraged to provide SR Policy
liveness detection, independently or in combination with end-to-end delay measurement. By
combining liveness detection with end-to-end delay measurement, the amount of probes is reduced.

PM not only verifies liveness of the SR Policy endpoint, it has the capability to verify liveness of all
ECMP paths of an SR Policy by exercising each individual atomic path.
For liveness detection, the headend imposes the forward SID list and optionally return SID list on the
PM probes. The remote endpoint simply switches the probes as regular data packets. This makes it a
scalable solution.

By default the PM probes are encoded using the Two-Way Active Measurement Protocol (TWAMP)
encoding [RFC5357] which is a well-known standard deployed by many software and hardware
vendors. See draft-gandhi-spring-twamp-srpm. Other probe formats may be available – RFC6374
using MPLS GAL/G-Ach or IP/UDP (draft-gandhi-spring-rfc6374-srpm-udp).

When PM detects a failure, it notifies SR-TE such that SR-TE can invalidate the SID list or the active
path or trigger path protection switchover to a standby path.

Leveraging PM for liveness verification removes the need to run (S)BFD over the SR Policy.

8.11.3 TI-LFA Protection for an Intermediate SID of an SR Policy


An SR Policy with SID list <16003, 16004> is installed on headend Node1 in Figure 8‑22.

Consider the failure of Node3, the target node of Prefix-SID 16003.

TI-LFA Node Protection does not apply for the failure of the destination node, or in this case the
target of a Prefix-SID, since no alternate path would allow the traffic to reach a failed node.

Although there is indeed no way to “go around” a failure of the final destination of a traffic flow, the
problem is slightly different when the failure affects an intermediate SID in an SR Policy.

Intuitively one can see that the solution is to skip the SID that failed and continue with the remainder
of the SID list. In the example, Node2 would pop 16003 and forward to the next SID 16004.
Figure 8-22: TI-LFA Protection for an intermediate SID of an SR Policy

draft-ietf-rtgwg-segment-routing-ti-lfa specifies TI-LFA Extended Node Protection. At the time of


writing it is not implemented in IOS XR.
8.12 Concurrency
The network resiliency technology is often hard to grasp because it involves many interactions.

First, many nodes may be involved, each providing a piece of the overall solution: one node may
detect a problem, another may participate in the transmission of that information, another node may
compute the solution and yet another may update the actual forwarding entry.

Second, several mechanisms may operate concurrently.

The fastest is TI-LFA, which, within 50msec, applies a pre-computed backup path for a constituent
SID of the SR Policy.

Then it is not easy to know whether the IGP convergence or the SR Policy convergence will happen
first. We advise to consider them happening concurrently within an order of a second following the
failure.

The IGP convergence involves the distributed system of the IGP domain. The nodes re-compute the
shortest path to impacted IGP destinations and the related IGP Prefix SIDs are updated. If an SR
Policy is using one of these IGP Prefix SIDs, the IGP convergence benefits the resiliency of the SR
Policy.

The SR Policy convergence involves the headend of the policy (and potentially its SR PCE). The SR
Policy convergence involves the re-validation of explicit candidate path and the re-computation of
dynamic candidate path. The result of the SR Policy convergence process may be the selection of a
new candidate path or the computation of a different SID list for the current dynamic path.
Worth studying
“The SR-TE resiliency solution may first seem a difficult concept to apprehend. I would encourage you to study it
incrementally, a few concepts at a time and more as experience is collected.

It is foundational to Segment Routing because, at the heart of Segment Routing, we find the concept of a Prefix SID which
is a shortest path to a node maintained by a large set of nodes acting as a distributed system.

Hence at the heart of Segment Routing, we have IP routing and hence IP routing resiliency.

Obviously at times, a few may regret the easier to comprehend ATM/FR/RSVP-TE circuits without any ECMP and
without any distributed behavior. But quickly, one will remember that this apparent simpler comprehension is only an illusion
hiding much more serious problems of scale and complexity. ”

— Clarence Filsfils
8.13 Summary
The resiliency of an SR policy may benefit from several detection mechanisms and several
convergence or protection solutions:

local trigger from a local link failure

remote intra-domain trigger through IGP flooding

remote inter-domain trigger through BGP-LS flooding

validation of explicit candidate paths

re-computation of dynamic paths

selection of the next best candidate path

IGP convergence of constituent IGP Prefix SIDs

Anycast Prefix SID leverage

TI-LFA Local Protection of constituent SIDs, including Flex-Algo SIDs

TI-LFA Node Protection for an intermediate segment part of an SR Policy

Invalidation of a candidate path based on end-to-end liveness detection

We described each of these mechanisms in isolation and highlighted how they can co-operate to the
benefit of the resiliency of an SR Policy.

We have explained how an SR Policy can be designed to avoid TI-LFA protection when it is not
desirable.

A local link failure and a remote IGP flooded failure trigger IGP convergence and provide the intra-
domain trigger to the headend SR-TE process. IGP convergence maintains the constituent SIDs of an
SR Policy.
A remote failure flooded by IGP or BGP-LS triggers the headend and SR PCE SR-TE process to
validate the explicit paths, recompute the dynamic paths and reselects the active path.

End-to-end liveness detection triggers the invalidation of a candidate path.

A local link failure triggers TI-LFA that provides sub-50ms restoration for the constituent SIDs of an
SR Policy.
8.14 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[subsec] “Achieving sub-second IGP convergence in large IP networks”, Francois P, Filsfils C,


Evans J, Bonaventure O, ACM SIGCOMM Computer Communication Review, 35(3):33-44,
<https://inl.info.ucl.ac.be/publications/achieving-sub-second-igp-convergence->, July 2005.

[RFC6774] "Distribution of Diverse BGP Paths", Robert Raszuk, Rex Fernando, Keyur Patel,
Danny R. McPherson, Kenji Kumaki, RFC6774, November 2012

[RFC7880] "Seamless Bidirectional Forwarding Detection (S-BFD)", Carlos Pignataro, David


Ward, Nobo Akiya, Manav Bhatia, Santosh Pallagatti, RFC7880, July 2016

[RFC7911] "Advertisement of Multiple Paths in BGP", Daniel Walton, Alvaro Retana, Enke Chen,
John Scudder, RFC7911, July 2016

[RFC5357] "A Two-Way Active Measurement Protocol (TWAMP)", Jozef Babiarz, Roman M.
Krzanowski, Kaynam Hedayat, Kiho Yum, Al Morton, RFC5357, October 2008

[RFC6374] "Packet Loss and Delay Measurement for MPLS Networks", Dan Frost, Stewart
Bryant, RFC6374, September 2011

[draft-ietf-rtgwg-segment-routing-ti-lfa] "Topology Independent Fast Reroute using Segment


Routing", Stephane Litkowski, Ahmed Bashandy, Clarence Filsfils, Bruno Decraene, Pierre
Francois, Daniel Voyer, Francois Clad, Pablo Camarillo Garvia, draft-ietf-rtgwg-segment-routing-
ti-lfa-01 (Work in Progress), March 2019

[draft-gandhi-spring-rfc6374-srpm-udp] "In-band Performance Measurement Using UDP Path for


Segment Routing Networks", Rakesh Gandhi, Clarence Filsfils, Daniel Voyer, Stefano Salsano,
Pier Luigi Ventre, Mach Chen, draft-gandhi-spring-rfc6374-srpm-udp-00 (Work in Progress),
February 2019
[draft-gandhi-spring-twamp-srpm] "In-band Performance Measurement Using TWAMP for Segment
Routing Networks", Rakesh Gandhi, Clarence Filsfils, Daniel Voyer, draft-gandhi-spring-twamp-
srpm-00 (Work in Progress), February 2019

1. Also see chapter 15, "Performance Monitoring – Link Delay".↩


Section II – Further details
This section deepens the concepts of the Foundation section and extends it with more details about
SR-TE topics.
9 Binding-SID and SRLB
What we will learn in this chapter:

In SR-MPLS, a BSID is a local label bound to an SR Policy.

A BSID is associated with a single SR Policy.

The purpose of a Binding-SID (BSID) is to steer packets into its associated SR Policy

The steering can be local or remote.

Any packet received with the BSID as top label is steered into the SR Policy associated with this
BSID: i.e., the BSID label is popped and the label stack of the associated SR Policy is pushed.

Automated Steering (AS) locally steers a BGP/service route onto the BSID of the SR Policy
delivering the required SLA/color to the BGP next-hop.

The BSID of an SR Policy is the BSID of the active candidate path.

We recommend that all the candidate paths of an SR Policy use the same BSID.

By default, a BSID is dynamically allocated and is inherited by all candidate paths without BSID

A BSID can be explicitly specified, in which case we recommend doing so within the SRLB.

BSIDs provide the benefits of simplification and scaling, network opacity, and service
independence.

Equivalent SR Policies can be configured with the same explicit BSID on all the nodes of an
anycast set, such that traffic can be remotely steered into the best suited SR Policy using the
Anycast-SID followed by the explicit BSID.

A BSID can be associated with any interface or tunnel. E.g., a BSID assigned to an RSVP-TE
tunnel enables SR-TE to steer traffic into that tunnel.
We start by defining the Binding-SID as the key to an SR Policy. The BSID is used to steer local and
remote traffic into the SR Policy.

The BSID is dynamically allocated by default. Explicitly specified BSIDs should be allocated from
the SRLB to simplify explicit BSID allocation by a controller or application.

We recommend allocating the same BSID to all candidate paths to provide a stable BSID. This
simplifies operations.

After this, the benefits of using BSIDs are highlighted: simplicity, scale, network opacity, and service
independence.

Finally, we show how a BSID allocated to an RSVP-TE tunnel can be used to traverse an RSVP-TE
domain.
9.1 Definition
As introduced in chapter 2, "SR Policy", a Binding Segment is a local1 segment used to steer packets
into an SR Policy. For SR-MPLS, the Binding Segment identifier, called Binding-SID or BSID, is an
MPLS label. The BSID is an attribute of the candidate path and each SR Policy inherits the BSID of
its active candidate path. If the SR Policy is reported to a controller, then its associated BSID is
included in the SR Policy’s status report.

In practice, when an SR Policy is initiated, the headend dynamically allocates a BSID label at the SR
Policy level, that is used for all candidate paths of the SR Policy. This ensures that each SR Policy is
associated with a unique BSID and, on a given headend node, a particular BSID is bound to a single
SR Policy at a time. The BSID label is allocated from the dynamic label range, by default [24000-
1048575] in IOS XR, which is the pool of labels available for automatic allocation.

This default behavior can be overridden by specifying an explicit BSID for the SR Policy or its
constituent candidate paths, but it is recommended that all candidate paths of an SR Policy always
have the same BSID. Examples of scenarios where explicit BSIDs are required and recommendations
on their usage are provided in the next section.

In the topology of Figure 9‑1, Node1 is the headend of an SR Policy “POLICY1” with color 30 and
endpoint Node4. Its SID list is <16003, 24034>. SR-TE has dynamically allocated BSID label 40104
for this SR Policy. When this SR Policy becomes active, Node1 installs the BSID MPLS forwarding
entry as indicated in the illustration:

incoming label: 40104

outgoing label operation: POP, PUSH (16003, 24034)

outgoing interface and nexthop: via Node2 (along Prefix-SID 16003)


Figure 9-1: Binding-SID

Example 9‑1 shows Node1’s SR Policy configuration.


Example 9-1: SR Policy on Node1 with dynamically allocated BSID

segment-routing
traffic-eng
segment-list name SIDLIST1
index 10 mpls label 16003
index 20 mpls label 24034
!
policy POLICY1
!! binding-sid is dynamically allocated (default)
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 200
explicit segment-list SIDLIST1
!
preference 100
dynamic
metric
type te

Example 9-2: Status of SR Policy with dynamically allocated BSID on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.4


Name: srte_c_30_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:00:17 (since Sep 28 11:17:04.764)
Candidate-paths:
Preference: 200 (configuration) (active)
Name: POLICY1
Requested BSID: dynamic
Explicit: segment-list SIDLIST1 (valid)
Weight: 1, Metric Type: TE
16003 [Prefix-SID, 1.1.1.3]
24034
Preference: 100 (configuration)
Name: POLICY1
Requested BSID: dynamic
Dynamic (invalid)
Attributes:
Binding SID: 40104
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

Remote Steering

The MPLS forwarding entry of the BSID label 40104 on Node1 is shown in Example 9‑3. Traffic
with top label 40104 is forwarded into the SR Policy (30, 1.1.1.4), named srte_c_30_ep_1.1.1.4
after pop’ing the top label.

The forwarding entry of SR Policy (30, 1.1.1.4) itself is displayed in Example 9‑4. It shows that
labels (16003, 24034) are imposed on packets steered into this SR Policy.
Example 9-3: BSID forwarding entry

RP/0/0/CPU0:xrvr-1#show mpls forwarding labels 40104


Local Outgoing Prefix Outgoing Next Hop Bytes
Label Label or ID Interface Switched
------ ----------- -------- --------------------- --------------- --------
40104 Pop No ID srte_c_30_ep_1.1.1.4 point2point 0

Example 9-4: SR Policy forwarding entry

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail


Color Endpoint Segment Outgoing Outgoing Next Hop Bytes
List Label Interface Switched
----- ------------ ----------- -------- ------------ ------------ --------
30 1.1.1.4 SIDLIST1 16003 Gi0/0/0/0 99.1.2.2 0
Label Stack (Top -> Bottom): { 16003, 24034 }
Path-id: 1, Weight: 64
Packets Switched: 0

Local Steering

The BSID not only steers BSID-labeled traffic into the SR Policy, it also serves as an internal key to
its bound SR Policy. To locally steer service traffic into an SR Policy on a service headend, the
service signaling protocol (e.g., BGP) installs the service route recursing on this SR Policy’s BSID,
instead of classically installing a service route recursing on its nexthop prefix. This makes the BSID a
fundamental element of the SR architecture.

Node4 in Figure 9‑1 advertises BGP route 2.2.2.0/24 to Node1 with BGP nexthop Node4 and color
30. Example 9‑5 shows this BGP route received by Node1.

Example 9-5: BGP IPv4 unicast route 2.2.2.0/24 received by Node1

RP/0/0/CPU0:xrvr-1#show bgp ipv4 unicast 2.2.2.0/24


BGP routing table entry for 2.2.2.0/24
Versions:
Process bRIB/RIB SendTblVer
Speaker 6 6
Last Modified: Sep 20 08:21:32.506 for 00:24:00
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
Local
1.1.1.4 C:30 (bsid:40104) (metric 40) from 1.1.1.4 (1.1.1.4)
Origin IGP, metric 0, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 1, version 6
Extended community: Color:30
SR policy color 30, up, registered, bsid 40104, if-handle 0x000000d0

Node1 uses Automated Steering to steer this route into SR Policy (30, Node4) since this SR Policy
matches the nexthop and color of the route. To forward the traffic destined for 2.2.2.0/24 into the SR
Policy, BGP installs the forwarding entry of this route recursing on BSID label 40104.

Example 9‑6 shows the forwarding entry of service route 2.2.2.0/24. The output shows that Node1
forwards packet to 2.2.2.0/24 via local-label 40104 (recursion-via-label). This BSID
40104 is bound to the SR Policy (30, 1.1.1.4).

Example 9-6: Service route recurses on BSID

RP/0/0/CPU0:xrvr-1#show cef 2.2.2.0/24


2.2.2.0/24, version 45, internal 0x5000001 0x0 (ptr 0xa140fe6c) [1], 0x0 (0x0), 0x0 (0x0)
Updated Sep 20 08:21:32.506
Prefix Len 24, traffic index 0, precedence n/a, priority 4
via local-label 40104, 3 dependencies, recursive [flags 0x6000]
path-idx 0 NHID 0x0 [0xa17cbd10 0x0]
recursion-via-label
next hop via 40104/1/21
next hop srte_c_30_ep_1.1.1.4
9.2 Explicit Allocation
By default, the headend dynamically allocates the BSID for an SR Policy. This is appropriate in most
cases as the operator, or the controller initiating the SR Policy, does not care which particular label is
used as BSID. However, some particular use-cases require that a specific label value is used a BSID,
either for stability reasons or to ensure that similar SR Policies on different headend nodes have the
same BSID value. In such cases, the BSID of the SR Policy should be explicitly allocated.

In a simple example illustrated in Figure 9‑2, the operator wants to statically steer a traffic flow from
a host H into an SR Policy POLICY1 on Node1. Therefore, the host imposes this SR Policy’s BSID
on the packets in the flow. Since the configuration on the host is static, this BSID label value must be
persistent, even across reloads. If the BSID would not be persistent and the value would change after
a reload, then the host configuration would have to be updated to impose the new BSID value. This is
not desirable, therefore the operator specifies an explicit BSID 15000 for the SR Policy.

Figure 9-2: Simple explicit BSID use-case

Example 9‑7 shows the explicit BSID label 15000 specified for SR Policy POLICY1 on Node1.
Example 9-7: An explicit BSID for a configured SR Policy

segment-routing
traffic-eng
segment-list name SIDLIST1
index 10 mpls label 16003
index 20 mpls label 24034
!
policy POLICY1
binding-sid mpls 15000
color 100 end-point ipv4 1.1.1.4
candidate-paths
preference 200
explicit segment-list SIDLIST1
!
preference 100
dynamic
metric
type te

For a configured SR Policy, the explicit BSID is defined under the SR Policy and inherited by all
configured candidate paths and all signaled candidate paths that do not have an explicit BSID.

Although controller-initiated candidate paths can be signaled with their own explicit BSIDs, we
recommend that all candidate paths of a given SR Policy have the same BSID. This greatly simplifies
network operations by ensuring a stable SR Policy BSID, that is independent of the selected
candidate path. All SR-TE use-cases and deployments known to the authors follow this
recommendation.

When not following this recommendation, the BSID of the SR Policy may change over time as the
active path of the SR Policy changes. If a newly selected candidate path has a different BSID than the
former active path, the SR policy BSID is modified to the new active path’s BSID. As a consequence,
all remote nodes leveraging the BSID of this SR Policy must be updated to the new value.

The SR Policy status in Example 9‑8 shows that the BSID 15000 is explicitly specified and that its
forwarding entry has been programmed.
Example 9-8: Status of SR Policy with explicit BSID

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 100, End-point: 1.1.1.4


Name: srte_c_100_ep_1.1.1.4
Status:
Admin: up Operational: up for 1d15h (since Feb 5 18:12:38.507)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: POLICY1
Requested BSID: 15000
Explicit: segment-list SIDLIST1 (valid)
Weight: 1
16003 [Prefix-SID, 1.1.1.3]
24034
Attributes:
Binding SID: 15000 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

The headend programmed the BSID label 15000 in the MPLS forwarding table, as shown in
Example 9‑9. The packets with top label 15000 are steered into SR Policy (100, 1.1.1.4).

Example 9-9: Verify MPLS forwarding entry for Binding-SID of SR Policy

RP/0/0/CPU0:xrvr-1#show mpls forwarding labels 15000


Local Outgoing Prefix Outgoing Next Hop Bytes
Label Label or ID Interface Switched
------ --------- ------------- ---------------------- ----------- --------
15000 Pop SRLB (idx 0) srte_c_100_ep_1.1.1.4 point2point 0

Label 15000 is the first (index 0) label in the SR Local Block (SRLB) label range, as indicated in the
output (SRLB (idx 0)). The SRLB is a label range reserved for allocating local SIDs for SR
applications and it is recommended that all explicit BSIDs are allocated from this range, as further
explained below.

The operator can enforce using explicit BSIDs for all SR Policies on a headend by disabling dynamic
BSID label allocation as illustrated in Example 9‑10. With this configuration, a candidate path
without a specified BSID is considered invalid.

Example 9-10: Disable dynamic label allocation for BSID

segment-routing
traffic-eng
binding-sid dynamic disable
If the requested explicit BSID label for a candidate path is not available on the headend, then no
BSID is allocated by default to that candidate path. Without a BSID, the candidate path is considered
invalid and cannot be selected as active path for the SR Policy. This behavior can be changed by
instructing the headend to fall back to a dynamically allocated label if the explicit BSID label is not
available, as illustrated in Example 9‑11.

Example 9-11: Enable fallback to dynamic label allocation for BSID

segment-routing
traffic-eng
binding-sid explicit fallback-dynamic

This fallback behavior applies to any SR Policy candidate path with an explicit BSID, regardless of
how it is communicated to the headend, via CLI, PCEP, BGP SR-TE, ….

Note that an explicit BSID cannot be configured under an ODN color template. Since multiple SR
Policies with different endpoints can be instantiated from such template, they would all be bound to
the same explicit BSID. This is not allowed as it would violate the BSID uniqueness requirement.

SR Local Block (SRLB)

In the simplistic scenario illustrated in Figure 9‑2, the operator can easily determine that the label
15000 is available and configure it as an explicit BSID for the SR Policy. It is much more complex
for a controller managing thousands of SR Policy paths to find an available label to be used as
explicit BSID of an SR Policy. To simplify the discovery of available labels on a headend node,
explicit BSIDs should be allocated from the headend’s SRLB.

The SR Local Block (SRLB) is a range of local labels that is reserved by the node for explicitly
specified local segments. The default SRLB in IOS XR is [15000-15999].

Same as for the SRGB, the SRLB is advertised in the IGP (ISIS and OSPF) and BGP-LS such that
other devices in the network, controllers in particular, can learn the SRLB of each node in the
network.

Example 9‑12 and Example 9‑13 show the ISIS Router Capability TLV and OSPF Router Information
LSA specifying the default SRLB.
Example 9-12: ISIS Router Capability TLV with SRLB advertised by Node1

RP/0/0/CPU0:xrvr-1#show isis database verbose xrvr-1

IS-IS 1 (Level-2) Link State Database


LSPID LSP Seq Num LSP Checksum LSP Holdtime/Rcvd ATT/P/OL
xrvr-1.00-00 * 0x00000178 0x5d23 621 /* 0/0/0
<...>
Router Cap: 1.1.1.1 D:0 S:0
Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000
SR Local Block: Base: 15000 Range: 1000
SR Algorithm:
Algorithm: 0
Algorithm: 1
Node Maximum SID Depth:
Label Imposition: 10
<...>
Example 9-13: OSPF Router Information LSA with SRLB

RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 4.0.0.0 self-originate

OSPF Router with ID (1.1.1.1) (Process ID 1)

Type-10 Opaque Link Area Link States (Area 0)

LS age: 47
Options: (No TOS-capability, DC)
LS Type: Opaque Area Link
Link State ID: 4.0.0.0
Opaque Type: 4
Opaque ID: 0
Advertising Router: 1.1.1.1
LS Seq Number: 80000002
Checksum: 0x47ce
Length: 88

Router Information TLV: Length: 4


Capabilities:
Graceful Restart Helper Capable
Stub Router Capable
All capability bits: 0x60000000

Segment Routing Algorithm TLV: Length: 2


Algorithm: 0
Algorithm: 1

Segment Routing Range TLV: Length: 12


Range Size: 8000

SID sub-TLV: Length 3


Label: 16000

Node MSD TLV: Length: 2


Type: 1, Value 10

Segment Routing Local Block TLV: Length: 12


Range Size: 1000

SID sub-TLV: Length 3


Label: 15000

Dynamic Hostname TLV: Length: 6


Hostname: xrvr-1

A non-default SRLB can be configured as illustrated in Example 9‑14 where label range [100000-
100999] is allocated as SRLB.

Example 9-14: Non-default SRLB configuration

segment-routing
!! local-block <first-label> <last-label>
local-block 100000 100999
For operational simplicity, it is recommended that the operator configures the same SRLB position
(between 15000 and ~1M) and size (≤ 256k) on all nodes.

A controller can learn via IGP, BGP-LS and PCEP the SRLB of each node in the network, as well as
the list of local SIDs explicitly allocated from the node’s SRLB (BSIDs, Adj-SIDs, EPE SIDs). The
controller can mark the allocated local SIDs in the SRLB as unavailable in order to deduce the set of
available labels in the SRLB of any node.

As an example, the application learns that Node1’s SRLB is [15000-15999], it has allocated labels
15000 and 15001 for Adj-SIDs, 15002 for EPE and 15003 for BSID. Therefore, the controller can
pick any label from the SRLB sub-range [15004-15999].

Randomly picking one of the available SRLB labels makes the likelihood of a collision with another
controller or application allocating the same label at the same time unlikely.

To further reduce the chance of allocation collisions, the operator could allocate sub-ranges of the
SRLB to different applications, e.g., [15000-15499] to application A and [15500-15999] to
application B. Each application independently manages its SRLB sub-range.

If despite these techniques a BSID collision would still take place, e.g., due to a race condition, the
application would get notified via system alerts or learn about it via BGP-LS or PCEP.

By default, IOS XR accepts allocating any free label (except the label values <16) as BSID, but we
have highlighted the importance of allocating explicit BSIDs from the SRLB. This recommendation
can be enforced in IOS XR by applying the configuration in Example 9‑15.

Example 9-15: Only allow explicit BSIDs from SRLB

segment-routing
traffic-eng
binding-sid explicit enforce-srlb

With this configuration, any explicit BSID outside of the SRLB is considered as unavailable. It is not
allocated, and the subject candidate path is invalid. This enforcement behavior applies to any SR
Policy candidate path with an explicit BSID, regardless of how it is communicated to the headend,
via CLI, PCEP, BGP SR-TE, ….
9.3 Simplification and Scaling
Using a Transit Policy via its BSID brings simplification and several scaling benefits. It makes it
possible to reduce the number of segments that must be imposed on a given headend node and it
reduces churn by isolating the headend node from changes in remote domains or areas.

Figure 9‑3 shows a multi-domain network topology. Two Data Center domains DC1 and DC2
interconnected by a WAN Core domain. Each domain runs an IGP and has SR enabled.

The domains in this example are independent. No redistribution is done between the different
domains, and no BGP is used for inter-domain connectivity; inter-domain connectivity is only
provided by SR Policies. Domain independence is not required to take advantage of the BSID
functionality, but makes its benefits appear more clearly.

A low-delay path is required between Node11 and Node31. The operator instantiates an SR Policy to
Node31 with an end-to-end low-delay path on headend Node11.

Given the link-delay metrics, this low-delay path is 11→12→1→21→22→23→3→32→31, as


illustrated in Figure 9‑3. This path can be encoded with SID list <16001, 16022, 24023, 16003,
16031>, where a 160XX entry is the Prefix-SID of NodeXX; for example 16022 is the Prefix-SID of
Node22. 24023 is the Adj-SID of Node22 for its link to Node23. Because of the high IGP metric
(100) of this link the IGP shortest path from Node22 to Node3 is not via the direct link to Node23, but
via path 22→21→24→25→23→3. Therefore, the Adj-SID 24023 is required to steer traffic across
this link.
Figure 9-3: Binding-SID illustration

Instead of encoding the entire end-to-end path in the SID list at headend Node11, the operator
leverages a low-delay SR Policy from DCI Node1 to DCI Node3 as Transit Policy, as illustrated in
Figure 9‑4. Doing so brings a number of benefits as discussed further in this section.

The low-delay SR Policy from DCI Node1 to endpoint DCI Node3 with the path 1→21→22→23→3
is illustrated in Figure 9‑4. The SID list that encodes this path is <16022, 24023, 16003>. The
operator has associated an explicit BSID 15000 to this SR Policy on Node1.

Node11 leverages the low-delay Transit Policy to cross the Core domain by including its BSID in the
end-to-end SR Policy’s SID list. The end-to-end low-delay path on Node11 can then be encoded as
the SID list <16001, 15000, 16031>. This is illustrated in Figure 9‑4.

When imposing this SID list on a packet, the first entry 16001 brings the packet via the IGP shortest
path to Node1. The second entry BSID 15000 steers the packet on its associated SR Policy that brings
the packet to Node3 via the low-delay path. And the last entry 16031 then brings the packet from
Node3 to its destination Node31 via the IGP shortest path.
Figure 9-4: End-to-end SR Policy path leveraging BSID

Reduce Segment List Size

Compare the SID list of the SR Policy on Node11 in Figure 9‑3 (<16001, 16022, 24223, 16003,
16031>) to the one in Figure 9‑4 (<16001, 15000, 16031>). The end-to-end path stays the same,
while the SID list size has been reduced from five to three.

Reducing the amount of SIDs to impose on a headend reduces the required label imposition capability
of that headend node. This is especially beneficial for lower-end devices that typically have more
limited label imposition capabilities.

Reduce Churn on Headend Node

Leveraging a Transit Policy reduces the churn on the local headend. A Transit Policy’s BSID is a
stable anchor point that isolates one domain from the churn of another domain. Upon a topology
change in a remote domain, the path of a Transit Policy in that domain may change, but its BSID does
not change2. This implies that the SID list of the SR Policy that includes the stable BSID does not
change either. Consequently, the local headend node is isolated from the churn in the remote domain.
And less churn on the headend node allows for increased scaling.

A change in the topology of the Core domain in Figure 9‑4 that necessitates an update of the low-
delay SR Policy path in the Core domain does not affect the SR Policy on Node11. For example, if
the link between Node22 and Node23 fails, Node1 updates the low-delay path of its SR Policy to
Node3 via Node24 and Node25. However, the BSID 15000 of this SR Policy remains unchanged.
Therefore, the SR Policy on Node11 does not need to be updated.

Simplify Intermediate Aggregation Nodes

Using the BSID simplifies and increases the scale of intermediate aggregation nodes. The
intermediate aggregation nodes offer paths with specific characteristics that are anchored by stable
BSIDs.

The characteristics of these paths and their BSIDs are known and can be leveraged by the remote
ingress nodes. These ingress nodes classify packet flows and steer these flows on the paths with the
proper characteristics. The ingress nodes encode their steering decision as a BSID in the packet
header’s SID list (label stack). The Transit Policy’s headend node does not (re-)classify these packet
flows, it does not stitch paths and is not involved in any steering decision. This node simply executes
the instructions (segments) that are encoded in each packet’s SID list. This is why the SR Policies
(with their BSIDs) are advertised in BGP-LS.

In the example of Figure 9‑4, Node11 classifies the packet flows and steers the flows that require a
low-delay path on the SR Policy that is illustrated in the figure. Node11 hereby leverages the low-
delay Transit Policy from Node1 to Node3. Node1 simply forwards the packet flows based on the
BSID label 15000. Node1 does not re-classify the packets and no configuration is required to stitch
the flows to the low-delay path. All this is taken care of by including the BSID in the SID list
imposed on Node11.
9.4 Network Opacity and Service Independence
A BSID provides network opacity and service independence between domains. The administrative
authority of a domain, for example the Core domain in Figure 9‑5, may not want to share information
of its topology with the other domains. The use of a Transit Policy and its BSID allows keeping the
network opaque and the offered service independent.

The Core domain operator can instantiate SR Policies with certain transport service characteristics
(e.g., low delay) and these SR Policies’ characteristics and their BSIDs can then be provided to the
other domains DC1 and DC2. This way, these other domains can use the transport services provided
by the Core domain without being aware of the Core topology or how the Core domain provides the
transport services. The other domains are also not aware of any changes that the Core domain makes
to the SR Policies paths.

Figure 9-5: BSID provides network opacity

Combine BSID With Anycast-SID

The operator makes use of Anycast-SIDs to provide load-sharing and node resiliency for the border
nodes (see chapter 3, "Explicit Candidate Path" and SR book Part I). The operator can combine the
benefits of the Anycast-SID with the benefits of the BSID.

Figure 9‑6 shows that Anycast-SID 17012 is allocated to Node1 and Node2. Instead of steering the
traffic via a specific border node, headend Node11 can steer the traffic via any border node by
including the Anycast-SID in the SID list.
The Core domain operator instantiates low-delay SR Policies crossing the Core domain on both
border nodes and allocates the same explicit BSID to them.

Figure 9-6: Anycast-SID and Binding-SID

To use the Core domain’s low-latency paths to domain DC2, Node11 imposes the SID list <17012,
15000, 16031> on the packets. Anycast-SID 17012 brings the packets from Node11 to either Node1
or Node2. BSID 15000 then brings the packets to Node3 or Node4, depending on the headend node
(Node1 or Node2) where the packet landed on. Finally, Node31’s Prefix-SID 16031 brings the
packets to Node31.
9.5 Steering Into a Remote RSVP-TE Tunnel
A BSID could be bound to any type of tunnel or interface: RSVP-TE tunnel, IP tunnel, GRE tunnel,
IP/UDP tunnel, etc. This way, traffic can be steered over such tunnels and interfaces by including their
BSIDs in an SR Policy’s SID list, without requiring establishing a routing protocol adjacency over
them.

A BSID can be allocated for an RSVP-TE tunnel as shown in Example 9‑16. The BSID can be
dynamically allocated or explicitly specified as in the example. Use the command without specifying
a label (label <label>) to dynamically allocate a label.

The RSVP-TE interface tunnel-te1 in the example provides a TE-metric optimized path.

Example 9-16: Binding-SID for RSVP-TE tunnel

interface tunnel-te1
ipv4 unnumbered Loopback0
destination 1.1.1.3
binding-sid mpls label 4001
path-selection metric te
path-option 1 dynamic

Equivalent to the behavior of an SR Policy BSID, when a packet arrives on the RSVP-TE tunnel
headend with the BSID label on top of the label stack, the headend pops the BSID label and steers the
packet into the RSVP-TE tunnel.

This functionality provides the capability to stitch an SR Policy to an RSVP-TE tunnel, for example to
traverse an RSVP-TE part of the network, as illustrated in Figure 9‑7.
Figure 9-7: Use BSID to steer traffic over an RSVP-TE part of the network

RSVP-TE tunnels have been instantiated in the Core domain. There is an RSVP-TE tunnel from DCI
Node1 to DCI Node3 with a path that optimizes the TE metric. To use this RSVP-TE tunnel in an end-
to-end SR-TE path from Node11 to Node31, the operator assigns a BSID 4001 to this RSVP-TE
tunnel on Node1. Node11 can then leverage this RSVP-TE tunnel by imposing a SID list <16001,
4001, 16031> on the packets. It works the same way as using an SR Policy’s BSID: Node1’s Prefix-
SID 16001 brings the packets from Node11 to Node1, BSID 4001 steers them into the RSVP-TE
tunnel, and Node31’s Prefix-SID 16031 brings the packets to destination Node31.

Using this mechanism, SR-TE provide a seamless end-to-end path crossing an RSVP-TE part of the
network.

This functionality can also be used to stitch SR-TE carried traffic to an RSVP-TE tunnel that
terminates in the RSVP-TE island, such as an RSVP-TE tunnel from Node1 to Node25 in the
illustration.
9.6 Summary
A Binding-SID is a segment that is bound to an SR Policy and its purpose is to steer packets into its
associated SR Policy. In SR-MPLS, a BSID is a local label bound to an SR Policy.

A BSID is associated with a single SR Policy on a given headend node.

A headend installs the BSIDs of its local SR Policies in the forwarding table. Any packet received
with the BSID as top label is steered into its associated SR Policy.

Local service traffic on the headend is steered into an SR Policy by recursing the service route on the
SR Policy’s BSID.

A headend steers a packet through an SR Policy of a remote headend by including the SID sequence
<Prefix-SID of the remote headend, BSID of the targeted remote SR Policy> in its SID list.

The BSID of an SR Policy is the BSID of the active candidate path. We recommend that all all the
candidate paths of an SR Policy use the same BSID. This results in a stable BSID for the SR Policy
which simplifies operations.

By default, a BSID is dynamically allocated and is inherited by all candidate paths without BSID

We recommend to allocate an explicitly specified BSID from the SRLB. This simplifies explicit
BSID allocation by a controller or application. The SRLB is the range of labels used for allocating
local SIDs, such as BSIDs.

BSIDs provide the benefits of simplification and scaling, network opacity, and service independence.
These benefits can be achieved by leveraging remote SR Policies as Transit Policies and using their
BSIDs as stable anchor points.

A BSID can be bound to any type of tunnel or interface: RSVP-TE tunnel, IP tunnel, GRE tunnel,
IP/UDP tunnel, etc. E.g., a BSID assigned to an RSVP-TE tunnel enables SR-TE to steer traffic into
that tunnel.
9.7 References
[RFC8402] "Segment Routing Architecture", Clarence Filsfils, Stefano Previdi, Les Ginsberg,
Bruno Decraene, Stephane Litkowski, Rob Shakir, RFC8402, July 2018

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018

1. Although the SR Architecture (RFC 8402) allows a Binding Segment to be a global segment, it is
a local segment in IOS XR.↩

2. A stable BSID of an SR Policy is a design objective. See section 9.2.↩


10 Further Details on Automated Steering
What we will learn in this chapter:

When a BGP route has multiple colors, Automated Steering (AS) installs the route on the valid SR
Policy that matches the color with the highest numerical value.

While coloring BGP routes on the egress node for AS is most common, an operator can also color
BGP routes on the ingress node.

Automated Steering logically applies to BGP multi-path

For specific use-cases, the operator can steer traffic based on color only and can then steer IPv6
traffic in an IPv4 SR Policy. This color-only and family-agnostic steering functionality leverages
the concept of a null endpoint.

The fundamentals of Automated Steering are covered in chapter 5, "Automated Steering".

A few more specific elements of Automated Steering (AS) are described in this chapter. We start by
explaining that AS steers into the valid SR Policy with the highest matching color when a BGP route
has multiple colors attached. Next we explain that BGP routes can be colored on the ingress PE. We
continue by applying AS when using BGP multi-path. We conclude by describing how to steer traffic
based only on the color of the route, leveraging the concept of a null endpoint.
10.1 Service Routes With Multiple Colors
If a BGP route R/r (a prefix R with prefix length r) with BGP next-hop N has multiple color extended
communities attached (C1, …, Ck ) then BGP steers R/r into the valid SR Policy with endpoint N and
the numerically highest color value.

In Figure 10‑1, Node1 receives BGP route 4.4.4.0/24 with next-hop 1.1.1.4 and two colors, blue (20)
and green (30). BLUE and GREEN are the names of two SR Policies on Node1 with endpoint 1.1.1.4
and respective colors blue and green. Both SR Policies are valid and authorized-to-steer.

Since color green’s value 30 is numerically higher than color blue’s value 20, BGP steers 4.4.4.0/24
into SR Policy GREEN.

Figure 10-1: service route with multiple colors

Example 10‑1 shows the SR Policy configuration of Node1.


Example 10-1: SR Policy configuration Node1 – service route with multiple colors

segment-routing
traffic-eng
policy GREEN
!! (green, Node4)
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
policy BLUE
!! (blue, Node4)
color 20 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST2
!
segment-list name SIDLIST1
index 10 mpls label 16003 !! Prefix-SID Node3
index 20 mpls label 24034 !1 Adj-SID link 3->4
!
segment-list name SIDLIST2
index 10 mpls label 16008 !! Prefix-SID Node8
index 20 mpls label 24085 !1 Adj-SID link 8->5
index 30 mpls label 16004 !! Prefix-SID Node4

Attaching multiple colors to a BGP route can be used as a mechanism to let the egress PE signal
primary and secondary SR Policy steering selection using BGP.

If SR Policy GREEN becomes invalid, due to a failure as illustrated in Figure 10‑2, BGP re-
resolves1 route 4.4.4.0/24 on the valid SR Policy with the next lower numerical color value matching
a color advertised with the prefix, SR Policy BLUE in this example. Since this functionality relies on
BGP to re-resolve the route, it is not a protection mechanism.
Figure 10-2: service route with multiple colors for primary/backup

When SR Policy GREEN becomes valid again sometime later then BGP re-resolves the prefix
4.4.4.0/24 onto SR Policy GREEN since then it is again the valid SR Policy with the highest color
that matches a color of the route.
10.2 Coloring Service Routes on Ingress PE
Chapter 5, "Automated Steering" describes the typical situation where the egress PE specifies the
service level agreement (SLA; the “intent”) of a service route by coloring the route with the color that
identifies the desired service level. The egress PE advertises the service route with this information
attached as color extended community to the ingress PE, possibly via one or more Route-Reflectors
(RR). The ingress PE can then directly use these attached colors to steer the routes into the desired
SR Policy without invoking a BGP route-policy to classify the routes.

However, the ingress PE can override the SLA of any received service route by manipulating the
color attached to this route. Such manipulation is realized through an ingress route-policy that sets,
adds, modifies, or deletes any color extended community of a given service route.

In the topology of Figure 10‑3, Node1 has two SR Policies with endpoint Node4:

GREEN with color 30 (green) with an explicit path via Node3

BLUE with color 20 (blue) with an explicit path via Node8 and Node5

The configuration of these SR Policies on Node1 is the same as Example 10‑1.

In this example, egress Node4 advertises two prefixes with next-hop 1.1.1.4 in BGP:

4.4.4.0/24 with color 40 (purple)

5.5.5.0/24 without color

The operator wants to override the service level selection that was done by the egress node and
configures ingress Node1 to:

set color of 4.4.4.0/24 to 20 (blue)

set color of 5.5.5.0/24 to 30 (green)


Figure 10-3: Setting color of BGP routes on ingress PE

The BGP route-policy configuration of ingress PE Node1 is shown in Example 10‑2.

Example 10-2: Ingress PE Node1 BGP configuration

extcommunity-set opaque COLOR-BLUE


20
end-set
!
extcommunity-set opaque COLOR-GREEN
30
end-set
!
route-policy SET-COLOR
if destination in (4.4.4.0/24) then
set extcommunity color COLOR-BLUE
endif
if destination in (5.5.5.0/24) then
set extcommunity color COLOR-GREEN
endif
end-policy
!
router bgp 1
neighbor 1.1.1.4
address-family ipv4 unicast
route-policy SET-COLOR in

The configuration starts with the two extended community sets COLOR-BLUE and COLOR-GREEN.
Color is a type of opaque extended community (see RFC 4360), therefore the keyword opaque. Each
set contains one color value, set COLOR-BLUE contains value 20 and set COLOR-GREEN value 30.
Route-policy SET-COLOR replaces any currently attached color extended community of prefix
4.4.4.0/24 with the value 20 in the COLOR-BLUE extended community set. The color of prefix
5.5.5.0/24 is set to value 30 of the COLOR-GREEN set.

Node1 applies an ingress route-policy SET-COLOR on its BGP session to PE Node4 (1.1.1.4).

The ingress route-policy is exercised when the route is received and before Automated Steering is
done. Hence, any color changes done in the route-policy are taken into account for AS.

As a result, prefix 4.4.4.0/24 is steered into SR Policy BLUE, since its next-hop (1.1.1.4) and the
updated color (20) match the endpoint and color of this SR Policy. Prefix 5.5.5.0/24 is steered into
SR Policy GREEN, based on the updated color (30) of this prefix.

Similarly, an ingress route-policy modifying the BGP next-hop will cause the matching traffic to be
steered into an SR policy whose endpoint is the new next-hop.
10.3 Automated Steering and BGP Multi-Path
Using Automated Steering, BGP matches each of the paths of a route with an SR Policy based on the
path’s nexthop and color. So, if BGP installs multiple paths for a route, AS is applied for each
individual path.

In the network of Figure 10‑4, CE Node45 in vrf Acme is multi-homed to two egress PEs: Node4 and
Node5. CE Node45 advertises the route 45.1.1.0/24 and both egress PEs color this route in green
(30) when advertising it to the ingress PE Node1.

Figure 10-4: BGP multi-path with Automated Steering

Ingress PE Node1 receives both paths for vrf Acme route 45.1.1.0/24. When using BGP RRs in the
network instead of a full mesh of BGP sessions between the PEs, the RRs need to propagate both
routes. This is a generic BGP multi-path requirement. Here this is achieved by using a different
Route-Distinguisher for the route on both egress PEs.

Ingress PE Node1 is configured for iBGP multi-path, as shown in Example 10‑3. To enable iBGP
multi-path, use the command maximum-paths ibgp <number of paths> under the address-family.
<number of paths> specifies the maximum number of paths that will be installed for any given
route. Since the IGP costs to the egress PEs are not equal in this example, the parameter unequal-
cost is added to the command to relax the default multi-path requirement that the IGP metrics to the
BGP nexthops must be equal.

Example 10-3: Configuration of ingress PE Node1

router bgp 1
bgp router-id 1.1.1.1
address-family vpnv4 unicast
!
neighbor 1.1.1.4
remote-as 1
update-source Loopback0
address-family vpnv4 unicast
!
neighbor 1.1.1.5
remote-as 1
update-source Loopback0
address-family vpnv4 unicast
!
vrf Acme
rd auto
address-family ipv4 unicast
maximum-paths ibgp 4 unequal-cost

As a result of this multi-path configuration, BGP on Node1 installs two paths to 45.1.1.0/24, one via
Node4 and another via Node5, since these paths both satisfy the multi-path requirements.

The SR Policy configuration of Node1 is shown in Example 10‑4. Node1’s SR Policies to Node4 and
Node5 with color green (30) are named GREEN_TO4 and GREEN_TO5 respectively.
Example 10-4: SR Policy configuration Node1 – BGP multi-path

segment-routing
traffic-eng
policy GREEN_TO4
!! (green, Node4)
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1
!
policy GREEN_TO5
!! (green, Node5)
color 30 end-point ipv4 1.1.1.5
candidate-paths
preference 100
explicit segment-list SIDLIST2
!
segment-list name SIDLIST1
index 10 mpls label 16003 !! Prefix-SID Node3
index 20 mpls label 24034 !1 Adj-SID link 3->4
!
segment-list name SIDLIST2
index 10 mpls label 16008 !! Prefix-SID Node8
index 20 mpls label 24085 !1 Adj-SID link 8->5
index 30 mpls label 16004 !! Prefix-SID Node4

The egress PEs both advertise route 45.1.1.0/24 with a color green (30). Using Automated Steering
functionality, Node1 matches the path via Node4 to SR Policy (green, Node4) and the path via Node5
to SR Policy (green, Node5). Applying multi-path and AS, BGP installs both paths to 45.1.1.0/24
recursing on the Binding-SIDs (BSIDs) of the respective SR Policies matching the color and nexthop
of each path.

Example 10‑5 shows the BGP route 45.1.1.0/24 with its two paths. The first path, which is the best-
path, goes via Node4 using the SR Policy with color 30 to 1.1.1.4 (Node4). The second path goes via
Node5 using the SR Policy with color 30 to 1.1.1.5 (Node5).
Example 10-5: BGP multi-path route

RP/0/0/CPU0:xrvr-1#show bgp vrf Acme 45.1.1.0/24


BGP routing table entry for 45.1.1.0/24, Route Distinguisher: 1.1.1.1:2
Versions:
Process bRIB/RIB SendTblVer
Speaker 82 82
Last Modified: Jul 19 18:50:03.624 for 00:19:55
Paths: (2 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
Local
1.1.1.4 C:30 (bsid:40001) (metric 120) from 1.1.1.4 (1.1.1.4)
Received Label 90000
Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, multipath, import-candidate,
imported
Received Path ID 0, Local Path ID 0, version 81
Extended community: Color:30 RT:1:1
SR policy color 30, up, registered, bsid 40001

Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0
Path #2: Received by speaker 0
Not advertised to any peer
Local
1.1.1.5 C:30 (bsid:40002) (metric 130) from 1.1.1.5 (1.1.1.5)
Received Label 90002
Origin IGP, metric 0, localpref 100, valid, internal, multipath, import-candidate, imported
Received Path ID 0, Local Path ID 0, version 0
Extended community: Color:30 RT:1:1
SR policy color 30, up, registered, bsid 40002

Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.5:0

Example 10‑6 shows the CEF forwarding entry of vrf Acme route 45.1.1.0/24. The first path of this
route is steered into SR Policy (30, 1.1.1.4) GREEN_TO4 via BSID 40001 (via local-label
40001). The second path is steered into SR Policy (30, 1.1.1.5) GREEN_TO5 via its BSID 40002
(via local-label 40002).

The last label (9000X) in the labels imposed stack of the output is the VPN label of the prefix, as
advertised by the egress PE. The first label in the imposed label stack is the label to reach the BGP
nexthop, but since the nexthop is reached via the SR Policy this label is implicit-null (ImplNull)
which represents a no-operation in this case.
Example 10-6: CEF entry of BGP multi-path route

RP/0/0/CPU0:xrvr-1#show cef vrf Acme 45.1.1.0/24


45.1.1.0/24, version 13, internal 0x5000001 0x0 (ptr 0xa13e8b04) [1], 0x0 (0x0), 0x208 (0xa18f907c)
Updated Jul 19 18:50:03.624
Prefix Len 24, traffic index 0, precedence n/a, priority 3
via local-label 40001, 3 dependencies, recursive, bgp-multipath [flags 0x6080]
path-idx 0 NHID 0x0 [0xa163e174 0x0]
recursion-via-label
next hop VRF - 'default', table - 0xe0000000
next hop via 40001/0/21
next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 90000}

via local-label 40002, 3 dependencies, recursive, bgp-multipath [flags 0x6080]


path-idx 1 NHID 0x0 [0xa163e74c 0x0]
recursion-via-label
next hop VRF - 'default', table - 0xe0000000
next hop via 24028/0/21
next hop srte_c_30_ep_1.1.1.5 labels imposed {ImplNull 90002}
10.4 Color-Only Steering
By default, AS requires an exact match of the service route’s nexthop with the SR Policy’s endpoint.
This implies that the BGP next-hop of the service route must be of the same address-family (IPv4 or
IPv6) as the SR Policy endpoint.

In rare cases an operator may want to steer traffic into an SR Policy based on color only or an IPv6
service route into an IPv4 SR Policy (address-family agnostic). This would allow the operator to
reduce the number of SR Policies maintained on a headend, by leveraging a single SR Policy to
forward traffic regardless of its nexthop and address-family.

Automated Steering based on color only is possible by using the Color-Only bits (CO-bits) of the
Color Extended Community attribute.

The format of the Color Extended Community is specified in RFC 5512 (to be replaced by draft-ietf-
idr-tunnel-encaps) and is augmented in draft-ietf-idr-segment-routing-te-policy for the cases where it
is used to steer traffic into an SR Policy. The format, as specified in draft-ietf-idr-segment-routing-te-
policy, is shown in Figure 10‑5.

Figure 10-5: Color extended community attribute format

With:

Color-Only bits (CO-bits): influence the Automated Steering selection preference, as described in
this section.

Color Value: a flat 32-bit number

The CO-bits of a Color Extended Community can be specified with the color value in the
extcommunity-set that configures the color value, as illustrated in Example 10‑7. The example
configures the extended community set named “INCL-NULL” with color value “100” and sets the
CO-bits of this color to “01”. This setting of the CO-bits indicates to the headend to also consider SR
Policies with null endpoint, as discussed further. This extcommunity-set can then be used to attach
the color extended community to a prefix as usual, using a route-policy.

Example 10-7: set CO-bits for Color Extended Community

extcommunity-set opaque INCL-NULL


100 co-flag 01
end-set

Table 10‑1 shows the behavior associated to each setting of the CO-bits in the Color Extended
Community. It shows the selection criteria and preference order (most to least preferred) that BGP
uses to select an SR Policy to be associated with a received route R/r. The table assumes that the
received BGP route R/r with next-hop N has a single color C.

Table 10-1: CO-bits – traffic steering preference order

CO=00 CO=01 CO=10

1. SR Policy (C, N) 1. SR Policy (C, N) 1. SR Policy (C, N)


2. IGP to N 2. SR Policy (C, null(AFN)) 2. SR Policy (C, null(AFN))

3. SR Policy (C, null(any)) 3. SR Policy (C, null(any))


4. IGP to N 4. SR Policy (C, any(AFN))

5. SR Policy (C, any(any))


6. IGP to N

Terminology:

“IGP to N” is the IGP shortest path to N


null(AFN) is the null endpoint for the address-family of N (AFN)

null(any) is the null endpoint for any address-family


any(AFN) is any endpoint of the address-family of N (AFN)

any(any) is any endpoint of any address-family

By default, the CO-bits of a color are 00 which results in the default Automated Steering functionality
described in chapter 5, "Automated Steering": steer a BGP route R/r with next-hop N and color C into
the valid and authorized-to-steer SR Policy (C, N). If no such SR Policy exists, then steer the route
R/r on the IGP shortest path to N. This default behavior (CO-bits=00) is listed in the first column of
the table.

While discussing the other settings of the CO-bits, we will introduce some new terms that are also
used in Table 10‑1. Null endpoint is the first term.

Null endpoint

An operator can specify a null endpoint for an SR Policy if only the color of a route is of
importance for the steering or if a “wildcard” SR Policy for a given color is required.
Depending on the address-family, the null-endpoint will be 0.0.0.0 (IPv4) or ::0 (IPv6).

Remember that the endpoint attribute of an SR Policy does not directly determine the location
where the packets will exit the SR Policy. This exit point derives from the SID list associated
with the SR Policy, in particular the last SID.

However, dynamic candidate paths are computed based on the location of the SR Policy
endpoint. Since the null endpoint does not have a specific location, it is therefore not compatible
with dynamic candidate paths. All the candidate paths of an SR Policy with a null endpoint must
be explicit.

Table 10‑1 uses the term “null(AFN)” to indicate the null endpoint for the Address-Family of
nexthop N (indicated by AFN). For example, if N is identified by an IPv6 address then null(AFN)
= null(IPv6) = ::0.

Note that only one SR Policy (C, null(AF)) for each AF can exist on a given head-end node
since the tuple (color, endpoint) uniquely identifies an SR Policy.

For a BGP route R/r that has a single color C with the CO-bits set to 01, BGP steers the route R/r
according to the preference order in the second column of Table 10‑1 (CO=01).

BGP preferably steers the route R/r into a valid SR Policy (C, N).

If no such SR Policy is available, then BGP steers the route R/r into a valid SR Policy (C,
null(AFN)), which matches the route’s color C and has a null endpoint of the same address-family as
the nexthop N.
If such SR Policy is not available, then BGP steers the route R/r into a valid SR Policy (C, null(any))
with a matching color and a null endpoint of any family.

Finally, if none of the above SR Policies are available, then BGP steers the route R/r on the IGP
shortest path to its next-hop N.

The second new term is any endpoint.

Any endpoint

Table 10‑1 uses the term any(AFN) to indicate any endpoint that matches the address-family of
nexthop N (which is indicated by AFN). This allows any SR Policy with a matching color and
address family to be selected.

If, for example, route R/r has BGP nexthop 1.1.1.3 (i.e., N=1.1.1.3), then the address-family of N
(AFN) is IPv4. any(AFN) is then any(IPv4) and will match any IPv4 endpoint. any(any) goes one
step further by allowing an SR Policy of any endpoint and any address-family to be selected.

For a BGP route R/r that has a single color C with the CO-bits set to 10, BGP steers the route R/r
according to the preference order in the third column of Table 10‑1.

The preference order is identical to the order in the second column (CO-bits 01), up to preference 3.

The next (4th) preference is a valid SR Policy (C, any(AFN)), which matches the route’s color C and
has an endpoint of the same address-family as the nexthop N.

If such SR Policy is not available, then BGP steers the route into a valid SR Policy (C, any(any)).

If none of the above SR Policies are available, then BGP steers the route R/r on the IGP shortest path
to its next-hop N.

At the time of writing, handling of CO-bits=10 is not implemented in IOS XR.

Address-Family Agnostic Steering

As a consequence of the Color-Only steering mechanism, traffic of one address-family can be steered
into an SR Policy of the other address-family.
For example, headend H receives BGP IPv6 route 4::4:4:0/112 with nexthop B:4:: and color 20 with
the CO-bits set to 01. The non-default CO-bits setting 01 indicates to also consider SR Policies with
color 20 and a null endpoint as a fallback. In this example, H has no SR Policy with nexthop B:4::
and color 20, nor an SR Policy with IPv6 null endpoint ::0 and color 20, but H has an SR Policy with
IPv4 null endpoint 0.0.0.0 and color 20. Therefore, BGP steers 4::4:4:0/112 on this IPv4 SR Policy.

In this situation, a particular transport problem may arise if the last label of the label stack is popped
before the traffic reaches the endpoint node, due to the Penultimate Hop Popping (PHP) functionality.

MPLS (Multiprotocol Label Switching) transport can carry multiple types of traffic, IPv4, IPv6, L2,
etc… As long as the packet contains an MPLS header, it will be transported seamlessly, even by
nodes that do not support the transported type of traffic.

However, when a node pops the last label and removes the MPLS header, the node needs to know and
support the type of packet that is behind the now removed MPLS header in order to correctly forward
this packet. “Correctly forward” implies elements such as using correct Layer 2 encapsulation and
updating TTL field in the header. An MPLS header does not contain a protocol field nor any other
indication of the type of the underlying packet. The node that popped the last label would need to
examine the newly exposed packet header to find out if it is IPv4 or IPv6 or something else.

If the SR Policy uses IPv4-based SIDs, then the nodes are able to transport MPLS and unlabeled IPv4
but maybe not unlabeled IPv6. And the other way around, if the SR Policy uses IPv6-based SIDs, then
the nodes are able to transport MPLS and unlabeled IPv6 but maybe not unlabeled IPv4.

To avoid this penultimate node problem, the packets of the address-family that differs from the SR
Policy’s address-family must be label-switched all the way to the endpoint of the SR Policy. The
exposed IP header problem on the penultimate node does not arise in such case.

When steering an IPv6 destination into an IPv4 SR Policy, BGP programs the route to impose an IPv6
explicit-null (label value 2) on unlabeled IPv6 packets steered into this SR Policy. With this IPv6
explicit-null label at the bottom of the label stack, the packet stays labeled all the way to the ultimate
hop node. This node then pops the IPv6 explicit-null label and does a lookup for the IPv6 destination
address in the IPv6 forwarding table to forward the packet.
For labeled IPv6 packets, it is the source node’s responsibility to impose a bottom label that is not
popped by the penultimate node, such as an IPv6 explicit-null label or an appropriate service label.

Note that for destinations carried in SR Policies of their own address-family (e.g., IPv4 traffic
carried by IPv4 SR Policy) nothing changes.

Figure 10‑6 illustrates an SR Policy with IPv4 null endpoint transporting the following types of
traffic. This SR Policy’s SID list is <16004>, only containing the Prefix-SID of Node4. Prefixes
4.4.4.0/24 and 4::4:4:0/112 have color 10 with CO-bits set to “01”.

Figure 10-6: Address-Family agnostic steering into IPv4 SR Policy

Unlabeled IPv4

Automated Steering steers unlabeled IPv4 packets into the matching (nexthop and color) IPv4 SR
Policy.
Labeled IPv4

Labeled IPv4 packets with the top label matching a BSID are steered into the IPv4 SR Policy
associated with the BSID.

Unlabeled IPv6

BGP imposes the IPv6-explicit-null label when steering unlabeled IPv6 packets into the
matching IPv4 SR Policy.

Labeled IPv6

The source node is expected to add IPv6-explicit-null (or another appropriate label) underneath
the BSID of the IPv4 SR Policy.

At the time of writing, steering IPv4 destinations into IPv6 SR Policies is not possible. Steering
IPv6 destinations into IPv4 SR Policies is possible by default. It can be disabled per SR Policy, as
shown in Example 10‑8.

Example 10-8: disable steering IPv6 destinations into IPv4 SR Policy with null endpoint

segment-routing
traffic-eng
policy POLICY1
ipv6 disable
color 20 end-point ipv4 0.0.0.0
candidate-paths
preference 100
dynamic

The color only steering for BGP-LU, 6PE, 6vPE, VPNv4 and EVPN cases will follow the same
semantic as global IPv4/IPv6 traffic.
10.5 Summary
This chapter describes some specific elements of Automated Steering that are not covered in chapter
5, "Automated Steering".

A BGP route can have multiple SLA colors. Automated Steering installs such route on the valid SR
Policy that matches the color with the highest numerical value.

Use an ingress BGP route-policy on an ingress PE to add, update, or delete the SLA color(s) as
advertised by the egress PE.

With BGP multi-path, Automated Steering is applied to each installed multi-path entry.

In very specific use-cases, an operator may desire to use color-only and address-family agnostic
steering. This allows steering traffic only based on color and enables steering traffic of one address-
family in an SR Policy of the other address-family.
10.6 References
[RFC5512] "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP
Tunnel Encapsulation Attibute", Pradosh Mohapatra, Eric C. Rosen, RFC5512, April 2009

[RFC7911] "Advertisement of Multiple Paths in BGP", Daniel Walton, Alvaro Retana, Enke Chen,
John Scudder, RFC7911, July 2016

[draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano


Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idr-
segment-routing-te-policy-05 (Work in Progress), November 2018

[draft-ietf-idr-tunnel-encaps] "The BGP Tunnel Encapsulation Attribute", Eric C. Rosen, Keyur


Patel, Gunter Van de Velde, draft-ietf-idr-tunnel-encaps-11 (Work in Progress), February 2019

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018

1. BGP re-executes nexthop resolution to find the new route to reach the nexthop for this route.↩
11 Autoroute and Policy-Based Steering
What we will learn in this chapter:

Policy-based steering methods allow an operator to configure a local routing policy on a headend
that overrides any BGP/IGP path and steers specified traffic flows on an SR Policy.

Autoroute is an IGP functionality where the IGP automatically installs forwarding entries for
destinations beyond the endpoint of the SR Policy into this SR Policy

Pseudowire traffic can be steered into an SR Policy by pinning it to this SR Policy

Traffic can be statically steered into an SR Policy by using a static route

Automated Steering (AS) is the recommended steering functionality for SR-TE. It allows an operator
to automatically steer traffic with a fine granularity.

However, AS is not the only steering mechanism. Other steering mechanisms are available to the
operator to solve particular use-cases. These steering methods allow the operator to configure a local
routing policy on a headend that overrides any BGP or IGP shortest path and steers a specified traffic
flow into an SR Policy.

In this chapter we first describe how the IGP can steer prefixes into an SR Policy by using autoroute.
Autoroute has some limitations that will be highlighted. Next we explain steering pseudowire traffic
into an SR Policy by pinning the pseudowire to the SR Policy. Finally, we show static route pointing
to an SR Policy.
11.1 Autoroute
Autoroute [RFC 3906], also known as IGP shortcut, is a mechanism to let the IGP steer destination
prefixes into SR Policies.

When enabling autoroute on an SR Policy, it essentially instructs the IGP to steer destination prefixes
that are advertised by the SR Policy’s endpoint node or its downstream nodes (downstream on the
IGP shortest path graph) into this SR Policy. The IGP installs these destination prefixes in the
forwarding table, pointing to the SR Policy.

Autoroute is a local behavior on the headend. No IGP adjacency is formed over the SR Policy and the
IGP does not advertise any autoroute information to the other nodes in the network.

If the IGP installs the forwarding entry for the SR Policy’s endpoint N into the SR Policy, all BGP
routes with nexthop N that are not steered by Automated Steering are steered into this SR Policy. BGP
installs these routes in the forwarding table, recursing on their nexthop N, and IGP steers this nexthop
into the SR Policy.

Automated Steering overrides the steering done by autoroute. BGP installs colored BGP routes that
match an SR Policy, recursing on the matching SR Policy’s BSID instead of recursing on their
nexthop.

Figure 11‑1 shows a two-area IGP network. It can be OSPF were Node5 and Node6 are Area Border
Routers, or ISIS were Node5 and Node6 are Level-1-2 routers, Area 1 is level-2 and Area 2 is level-
1.
Figure 11-1: Autoroute

An SR Policy GREEN (green, Node6) is programmed on Node1 with a SID list that enforces the path
1→2→3→6 (the IGP shortest path is 1→7→6) and configured with autoroute announce.

As a result, the IGP installs the forwarding entries to Node4, Node5 and Node6, which are located
downstream of the SR Policy’s endpoint Node6, via SR Policy GREEN. These destinations are said
to be autorouted in this SR Policy.

Since Node1 sees the loopback address of Node8 as advertised by the area border nodes Node5 and
Node6, the forwarding entry to Node8 is also installed via the SR Policy.

The forwarding table of Node1 towards the other nodes in the network is shown on Figure 11‑1.

The IGP steers unlabeled and label imposition forwarding entries for autorouted destinations into the
SR Policy.

The IGP installs the label swap forwarding entries for Prefix-SIDs of autorouted destinations onto the
SID’s algorithm shortest path, with one exception: algorithm 0 Prefix-SIDs. Read chapter 7, "Flexible
Algorithm" for more information about Prefix-SID algorithms. For algorithm 0 (default SPF) Prefix-
SIDs the IGP installs the label swap forwarding entry into the SR Policy if this SR Policy’s SID list
only contains non-algorithm-0 Prefix-SIDs.
Otherwise, i.e., if the SR Policy contains at least one algorithm 0 Prefix-SID, the IGP installs the
label swap entry for algorithm 0 (default SPF) Prefix-SIDs onto the IGP shortest path.

In order to steer SR labeled traffic into an SR Policy using autoroute, e.g., on an aggregation node, SR
Policy must use only strict Prefix-SIDs and Adj-SIDs. Note that autorouting SR labeled traffic into an
SR Policy is only possible for algorithm 0 Prefix-SIDs.

To steer service traffic into an SR Policy using autoroute, e.g., on a PE node, the SR Policy’s SID list
can contain any type of Prefix-SIDs since the label imposition forwarding entry will forward the
service traffic into the SR Policy.

LDP forwarding entries are installed as well for autorouted prefixes. The label swap forwarding
entry for an autorouted destination with an LDP local label is always installed, regardless of the
composition of the SR Policy’s SID list.

If a (targeted) LDP session with the SR Policy’s endpoint node exists, then the installed outgoing
label for the LDP forwarding entry is the learned LDP label. Otherwise the Prefix-SID of the prefix,
possibly advertised by an SR Mapping Server (SRMS), is used as outgoing label, hereby leveraging
LDP to SR interworking functionality. More information of SR and LDP interworking is provided in
Part I of the SR book series.

Limitations of Autoroute

Autoroute is widely used in classic MPLS TE deployments, but it has a number of disadvantages and
limitations that make its use less desirable as compared to the recommended SR-TE Automated
Steering functionality.

Limited to local IGP area SR Policies

Since autoroute is an IGP functionality and the IGP has only visibility in the topology of its local
area, autoroute applicability is limited to SR Policies with an endpoint located in the local IGP
area.

Limited to per-BGP nexthop steering

Autoroute steers the specific IGP prefixes into an SR Policy, as well as BGP routes recursing on
these IGP routes. Therefore, for BGP prefixes the steering is per-nexthop, not per-destination. If
a nexthop address N is autorouted into an SR Policy, then all BGP routes with N as nexthop are
steered into the SR Policy as well.
Note that Automated Steering overrides autoroute steering. Automated Steering steers a BGP
route into the SR Policy that matches the color and nexthop of that BGP route, even if this
nexthop prefix is autorouted into another SR Policy.

Blanket steering method

When enabling autoroute on an SR Policy, all IGP prefixes of the endpoint node and of the nodes
on the SPT behind the endpoint node are steered into the SR Policy, as well as all BGP routes
recursing on these prefixes.

Steering SR labeled traffic

As explained in the previous section, incoming labeled traffic with the Prefix-SID of one of
these autorouted prefixes as top label is steered into the SR Policy only if this SR Policy’s SID
list contains strict-SPF Prefix-SIDs and Adj-SIDs. If the SID list contains one or more non-
strict-SPF Prefix-SIDs, SR labeled traffic is steered on the IGP shortest path instead.
11.2 Pseudowire Preferred Path
The operator who wants to pin the path for an L2VPN Pseudowire (PW) to an SR Policy can use the
configuration shown in Example 11‑1. In this example configuration, the L2VPN PW is pinned to SR
Policy GREEN using the preferred-path sr-te policy srte_c_30_ep_1.1.1.4 configuration
in the applied pw-class.

With this configuration, if the SR Policy goes down then the pseudowire traffic will follow the default
(labeled) forwarding path to the L2VPN PW neighbor. This is typically the IGP shortest path.

This default fallback can be disabled by adding fallback disable to the preferred-path
configuration. When this option is specified and the SR Policy goes down, the pseudowire goes down
as well.

Example 11‑1 shows the configuration for a pseudowire XCON-P2P with statically specified
pseudowire labels (mpls static label local 2222 remote 3333). Another option is to use LDP
to exchange pseudowire labels with the neighbor. In that case no labels must be configured, but mpls
ldp must be enabled globally on the node. Enabling LDP does not imply that LDP transport labels
will be used. If LDP is only used for PW label negotiation and not for transport, then no interfaces
must be configured under mpls ldp.

The other cross-connect in Example 11‑1 is XCON-EVPN-P2P, an EVPN VPWS signaled


pseudowire.

In the pw-class EoMPLS-PWCLASS the SR Policy GREEN is configured as preferred path. This
pw-class is applied to both cross-connects to make SR Policy GREEN their preferred path.
Optionally fallback to the IGP shortest path can be disabled in the preferred-path configuration.
Example 11-1: L2VPN PW preferred-path

segment-routing
traffic-eng
policy GREEN
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay
!
l2vpn
pw-class EoMPLS-PWCLASS
encapsulation mpls
preferred-path sr-te policy srte_c_30_ep_1.1.1.4
!! or: preferred-path sr-te policy srte_c_30_ep_1.1.1.4 fallback disable
!
xconnect group XCONGRP
p2p XCON-P2P
interface TenGigE0/1/0/3
neighbor ipv4 1.1.1.4 pw-id 1234
!! below line only if not using LDP for PW signaling
mpls static label local 2222 remote 3333
pw-class EoMPLS-PWCLASS
!
p2p XCON-EVPN-P2P
interface TenGigE0/1/0/4
neighbor evpn evi 105 target 40101 source 10101
pw-class EoMPLS-PWCLASS
11.3 Static Route
The network operator can use static routing to steer specific destination prefixes into an SR Policy.
The configuration is illustrated in Example 11‑2.

Example 11-2: Static route into SR Policy – configuration

segment-routing
traffic-eng
policy GREEN
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay
!
router static
address-family ipv4 unicast
99.20.20.0/24 sr-policy srte_c_30_ep_1.1.1.4

The route points directly to the SR Policy, as shown in Example 11‑3. Note that a static route has a
lower administrative distance (1) and overrides any existing routing entry provided by other
protocols such as the IGP or BGP.

Example 11-3: Static route into SR Policy

RP/0/0/CPU0:xrvr-1#show route 99.20.20.0/24

Routing entry for 99.20.20.0/24


Known via "static", distance 1, metric 0 (connected)
Installed Jun 19 11:54:54.290 for 00:08:07
Routing Descriptor Blocks
directly connected, via srte_c_30_ep_1.1.1.4
Route metric is 0, Wt is 1
No advertising protos.
11.4 Summary
A local routing policy can be used to steer traffic into an SR Policy, such as autoroute, pinning a
pseudowire to a specific SR Policy, or static route.

Autoroute is an IGP functionality where the IGP installs the forwarding entries for prefixes located on
or beyond the endpoint of an SR Policy into that SR Policy.

Automated Steering overrides autoroute steering.

A pseudowire can be pinned to an SR Policy to steer all traffic of that pseudowire into this SR
Policy.
11.5 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[RFC3906] "Calculating Interior Gateway Protocol (IGP) Routes Over Traffic Engineering
Tunnels", Henk Smit, Naiming Shen, RFC3906, October 2004
12 SR-TE Database
What you will learn in this chapter:

SR-TE computes and validates SR Policy paths based on the information contained in the SR-TE
DB

The SR-TE DB is populated with local domain topology (nodes, links, metrics) and SR (SRGB,
SIDs) information through the IGP

Topology and SR information from remote domains and BGP-only domains is obtained through
BGP-LS

The SR-TE DB is intrinsically multi-domain capable: it can contain and consolidate information of
an entire multi-domain network

Multi-domain topology consolidation allows SR-TE to compute globally optimal inter-domain SR


Policy paths

Detailed information about all active SR Policies in the network can be obtained through PCEP or
BGP-LS

The SR-TE database represents the view that the SR-TE process of a node has of its network. It is the
one stop source of information for SR-TE to compute and validate SR Policy paths. On a headend
node, it contains local area information fed by the IGP process. It is used to compute intra-area
dynamic paths and translate explicit paths expressed with Segment Descriptors into a list of MPLS
labels. On an SR PCE, it contains the entire multi-domain network topology, acquired via BGP-LS,
that is used to compute inter-domain SR Policy paths as well as disjoint SR Policy paths originating
from separate headend nodes. The topology of BGP-only networks can also be collected via BGP-
LS. In this chapter, we explain how the SR-TE DB is populated with IGP information on a headend,
with BGP-LS and PCEP on an SR PCE and how network topology information from different
domains is consolidated into a single multi-domain topology.
12.1 Overview
The SR-TE Database (SR-TE DB) contains the information for SR-TE to compute and validate SR
Policy paths. The information in the SR-TE DB includes:

Base IGP topology information (nodes, links, IGP metric, …)

Egress Peer Engineering (EPE) information

Segment Routing information (SRGB, Prefix-SID, Adj-SID, …)

TE Link Attributes (TE metric, link delay metric, SRLG, affinity colors, …)

SR Policy information (headend, endpoint, color, segment list, BSID, …)

The information in the SR-TE DB is protocol independent and it may be learnt via the link-state IGPs,
via BGP-LS, or via PCEP.

The SR-TE DB is intrinsically multi-domain capable. In some use-cases, the SR-TE DB may only
contain the topology of the local attached domain while in other use-cases the SR-TE DB contains the
topology of multiple domains.

Instance Identifier

Each routing domain, or routing protocol instance, in the network is identified by a network-wide
unique Instance Identifier (Instance-ID), which is a 64-bit value assigned by the operator. A given
routing protocol instance must be associated with the same Instance-ID on all nodes that participate in
this instance. Conversely, other routing protocol instances on the same or different nodes in the
network must be provided a different Instance-ID.

Furthermore, a single Instance-ID is assigned to an entire multi-level or multi-area IGP instance. This
means that the same Instance-ID must be associated to the IGP instance on all nodes in the domain,
regardless of their level or area.

The Instance-ID is configured on a node with distribute link-state instance-id <Instance-


ID> under the IGP instance. The configuration in Example 12‑1 shows a node running two IGP
instances. Instance-ID 100 is configured for the ISIS instance named “SR-ISIS” and Instance-ID 101
for the OSPF instance “SR-OSPF”. The same Instance-ID 100 is associated to this ISIS instance on
the other nodes in the “SR-ISIS” domain, and similarly for the Instance-ID 101 in the “SR-OSPF”
domain.

If not specifying an instance-id in the configuration, it uses the default Instance-ID 0. This default
Instance-ID is reserved for networks running a single routing protocol instance.

Example 12-1: Specify the domain Instance-ID under the IGP

router isis SR-ISIS


distribute link-state instance-id 100
!
router ospf SR-OSPF
distribute link-state instance-id 101

The distribute link-state configuration enables the feeding of information from the IGP to the
SR-TE DB and to BGP-LS. It is thus required on SR-TE headend nodes, SR PCEs participating in an
IGP and BGP-LS Producers1.

Each domain-specific object (node, link, prefix) in the SR-TE DB is associated with the Instance-ID
of its domain identifying the domain where the object belongs. When distributing the object in BGP-
LS, the Instance-ID is carried within the Identifier field of the object’s BGP-LS NLRI. See chapter
17, "BGP-LS" for more details.

When not following the above Instance-ID assignment recommendations, that is using non unique
Instance-IDs across multiple domains, duplicate entries for the same node, link, or prefix objects may
be present in the SR-TE DB. This may also result in an inaccurate view of the network-wide
topology.

Information Feeds

A headend node typically has an SR-TE DB containing only information of the locally attached IGP
area. While the SR-TE DB of a Path Computation Element (PCE) usually includes more information,
such as information about remote domains and BGP peering links, and remote SR Policies.

As illustrated in Figure 12‑1, different mechanisms exist to feed information into the SR-TE DB: IGP,
BGP, and Path Computation Element Protocol (PCEP). The BGP-LS feed configuration does not
specify an Instance-ID since BGP-LS NLRIs carry the Instance-ID that is provided by the BGP-LS
Producer.
Figure 12-1: Populating the SR-TE DB

Different combinations of information feeding mechanisms are used, depending on the role and
position of the node hosting the SR-TE DB. These mechanisms are detailed in the following sections.
12.2 Headend
In order for the SR-TE process of a node to learn the topology of a connected network area, it is
sufficient that the node participates in the IGP of that area and that the information feed from the IGP
to the SR-TE process is enabled. Since a headend node typically participates in the IGP, this is a
common method to populate the SR-TE DB of such a node. This local area information allows the
headend node to compute and maintain intra-area SR Policy paths, as described in chapter 4,
"Dynamic Candidate Path".

On a node in a single-IGP-domain network, the information feed of the IGP to the SR-TE process is
enabled by configuring distribute link-state under the IGP instance, as illustrated in
Example 12‑2. Since no Instance-ID is specified in this command, it uses the default Instance-ID 0 for
the only domain in the network.

Example 12-2: Single IGP domain – distribute link-state configuration

router isis SR !! or router ospf SR


distribute link-state

In a network with multiple IGP domains, it is required to specify the instance-id in the
distribute link-state configuration command, as was illustrated before in Figure 12‑1.

The distribute link-state configuration command has an optional throttle parameter that
specifies the time to wait before distributing the link-state update. The default throttle interval is 50
ms for ISIS and 5 ms for OSPF.

The two-node topology in Figure 12‑2 is used to illustrate the information present in the SR-TE DB.
This is an ISIS level-2 network, but the equivalent information is available for OSPF networks.
Figure 12-2: Two-node ISIS topology

Example 12‑3 shows the configuration of Node1 in this topology. First a brief description of the
configuration elements, from top to bottom.

The interface Loopback0 prefix is 1.1.1.1/32 (line 2).

ISIS distributes the LS-DB to SR-TE with Instance-ID 100 (line 7). The Instance-ID is specified for
illustration purposes. Since this is a single-domain network, the default Instance-ID 0 could have
been used.

The ISIS router-id is configured as 1.1.1.1, interface Loopback0’s address (line 10), this is the TE
router-id. For OSPF this TE router-id is configured as mpls traffic-eng router-id Loopback0.
Segment-routing is enabled for ISIS and a Prefix-SID 16001 is associated with prefix 1.1.1.1/32 (line
16).

The link to Node2 is a point-to-point link and TI-LFA is enabled on this interface (lines 18-22).

The TE metric for interface Gi0/0/0/0 is configured as 15 under SR-TE (line 27). Bit 0 is set in the
affinity bitmap of this interface, using the user-defined name COLOR0 (lines 28-32). With this
configuration, Node1 advertises the affinity bit-map 0x00000001 for this link.

Performance-measurement is enabled to dynamically measure and advertise the link-delay metrics. In


this example, the link-delay for interface Gi0/0/0/0 is configured as a static value of 12 s (line 38).
See chapter 15, "Performance Monitoring – Link Delay" for more details.

In the SRLG section an SRLG 1111 is assigned to interface Gi0/0/0/0 using the user-defined name
SRLG1 (lines 40-44).

The following elements are not shown in the configuration since they are the default: SR Global
Block (SRGB) is [16000-23999] and the SR Local Block (SRLB) is [15000-15999].
Example 12-3: Node1’s configuration – ISIS example

1 interface Loopback0
2 ipv4 address 1.1.1.1 255.255.255.255
3!
4 router isis SR
5 is-type level-2-only
6 net 49.0001.0000.0000.0001.00
7 distribute link-state instance-id 100
8 address-family ipv4 unicast
9 metric-style wide
10 router-id Loopback0
11 segment-routing mpls
12 !
13 interface Loopback0
14 passive
15 address-family ipv4 unicast
16 prefix-sid absolute 16001
17 !
18 interface GigabitEthernet0/0/0/0
19 point-to-point
20 address-family ipv4 unicast
21 fast-reroute per-prefix
22 fast-reroute per-prefix ti-lfa
23 !
24 segment-routing
25 traffic-eng
26 interface GigabitEthernet0/0/0/0
27 metric 15
28 affinity
29 name COLOR0
30 !
31 affinity-map
32 name COLOR0 bit-position 0
33 !
34 performance-measurement
35 interface GigabitEthernet0/0/0/0
36 delay-measurement
37 !! statically specified delay value
38 advertise-delay 12
39 !
40 srlg
41 interface GigabitEthernet0/0/0/0
42 name SRLG1
43 !
44 name SRLG1 value 1111

This example is a single-domain topology. All the information that is inserted in the SR-TE DB is
learned from the IGP. To allow comparing the SR-TE DB entry for Node1 to the ISIS link-state
advertisement of Node1, Node1’s LS-DB entry is shown in Example 12‑4.
Example 12-4: Node1’s ISIS LS-DB entry

RP/0/0/CPU0:xrvr-1#show isis database verbose xrvr-1

IS-IS SR (Level-2) Link State Database


LSPID LSP Seq Num LSP Checksum LSP Holdtime/Rcvd ATT/P/OL
xrvr-1.00-00 * 0x00000067 0xd912 1077 /* 0/0/0
Area Address: 49.0001
NLPID: 0xcc
IP Address: 1.1.1.1
Router ID: 1.1.1.1
Hostname: xrvr-1
Router Cap: 1.1.1.1, D:0, S:0
Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000
SR Local Block: Base: 15000 Range: 1000
SR Algorithm:
Algorithm: 0
Algorithm: 1
Node Maximum SID Depth:
Label Imposition: 10
Metric: 0 IP-Extended 1.1.1.1/32
Prefix-SID Index: 1, Algorithm:0, R:0 N:1 P:0 E:0 V:0 L:0
Prefix Attribute Flags: X:0 R:0 N:1
Source Router ID: 1.1.1.1
Metric: 10 IS-Extended xrvr-2.00
Affinity: 0x00000001
Interface IP Address: 99.1.2.1
Neighbor IP Address: 99.1.2.2
Admin. Weight: 15
Link Average Delay: 12 us
Link Min/Max Delay: 12/12 us
Link Delay Variation: 0 us
Link Maximum SID Depth:
Label Imposition: 10
ADJ-SID: F:0 B:1 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24012
ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24112
Metric: 10 IP-Extended 99.1.2.0/24
Prefix Attribute Flags: X:0 R:0 N:0
MPLS SRLG: xrvr-2.00
Interface IP Address: 99.1.2.1
Neighbor IP Address: 99.1.2.2
Flags: 0x1
SRLGs:
[0]: 1111

Total Level-2 LSP count: 1 Local Level-2 LSP count: 1

The SR-TE process on Node1 receives the topology information from the IGP (distribute link-
state is configured under router isis SR).

Example 12‑5 shows the entry for Node1 in the SR PCE’s SR-TE DB and can be compared with the
information in the ISIS LS-DB entry shown in Example 12‑4.

The output starts with the node information, followed by the link information.
Node1’s hostname is xrvr-1 (line 7) and its TE router-id is 1.1.1.1 (line 6). Node1 is a level-2 ISIS
node with system-id 0000.0000.0001 (line 8). The Autonomous System number (ASN) is 0 since this
information is not received via BGP.

Node1 advertises a single Prefix-SID 16001, which is an algorithm 0 (regular) Prefix-SID


associated with prefix 1.1.1.1/32 (line 11). It is a Node-SID since it has the N-flag set (flags: N).

The prefix 1.1.1.1/32 with its Prefix-SID is advertised in the domain with ID 100 (domain ID: 100)
(line 10). This domain ID is the instance-id as is configured in the distribute link-state
instance-id 100 command to distribute the IGP LS-DB to SR-TE. This information is not in the
ISIS LS-DB entry of Example 12‑4.

Node1 has an SRGB [16000-23999] (line 14) and an SRLB [15000-15999] (line 17).

There is one link to Node2. Node2 has a TE router-id 1.1.1.2 (line 23). The link is identified by its
local and remote IP addresses, 99.1.2.1 and 99.1.2.2 respectively (line 19).

The link metrics are IGP metric 10, TE metric 15, and link-delay metric 12 (line 26). Note that the
link-delay metric is the minimum-delay metric, as discussed in chapter 15, "Performance Monitoring
– Link Delay".

Two Adjacency-SIDs are advertised for this link, a protected Adj-SID 24012 and an unprotected
Adj-SID 24112 (line 27). The purpose of these different Adj-SIDs is explained in Part I of the SR
book series.

The other link attributes are the affinity bitmap (Admin-groups) with bit 0 set (0x00000001 on line
28)) and the SRLG 1111 (line 29).
Example 12-5: Entry for Node1 in SR-TE DB – ISIS example

1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng ipv4 topology traffic-eng 1.1.1.1


2
3 SR-TE topology database
4 ---------------------------------
5 Node 1
6 TE router ID: 1.1.1.1
7 Host name: xrvr-1
8 ISIS system ID: 0000.0000.0001 level-2 ASN: 0
9 Prefix SID:
10 ISIS system ID: 0000.0000.0001 level-2 ASN: 0 domain ID: 100
11 Prefix 1.1.1.1, label 16001 (regular), flags: N
12 SRGB INFO:
13 ISIS system ID: 0000.0000.0001 level-2 ASN: 0
14 SRGB Start: 16000 Size: 8000
15 SRLB INFO:
16 ISIS system ID: 0000.0000.0001 level-2 ASN: 0
17 SRLB Start: 15000 Size: 1000
18
19 Link[0]: local address 99.1.2.1, remote address 99.1.2.2
20 Local node:
21 ISIS system ID: 0000.0000.0001 level-2 ASN: 0
22 Remote node:
23 TE router ID: 1.1.1.2
24 Host name: xrvr-2
25 ISIS system ID: 0000.0000.0002 level-2 ASN: 0
26 Metric: IGP 10, TE 15, Delay 12
27 Adj SID: 24012 (protected) 24112 (unprotected)
28 Admin-groups: 0x00000001
29 SRLG Values: 1111
12.3 SR PCE
Please reference chapter 13, "SR PCE" for further details of the SR PCE.

The SR PCE can participate in the IGP to learn the topology of the connected IGP area, equivalent to
the headend as described in the previous section.

However, using IGP information feeds to populate the SR-TE DB with a complete view of a multi-
domain network would require establishing an IGP adjacency with each IGP area of the network. This
is neither practical nor scalable.

12.3.1 BGP-LS
Instead, BGP link-state (BGP-LS) is the mechanism of choice to learn the topology of multiple IGP
areas and domains. BGP-LS uses BGP to distribute the network information in a scalable manner. On
top of that, BGP-LS also carries other information that is not distributed by the IGP (e.g., BGP
peering links).

Each entry in the SR-TE DB has an associated Instance-ID identifying the domain where this entry
belongs. This Instance-ID is provided by the BGP-LS Producer of that entry and is carried in the
object’s BGP-LS NLRI. This allows the BGP-LS Consumer to learn the domain where the received
object belongs.

The name BGP-LS refers to the BGP link-state address-family, specified in “Link-State Info
Distribution Using BGP” [RFC7752]. As you can tell from its name, BGP-LS was initially introduced
to distribute the IGP LS-DB and TE-DB information in BGP. It has been extended and became the
preferred mechanism to carry other types of information, such as Egress Peering Engineering (EPE)
and TE Policy information. Chapter 17, "BGP-LS" discusses the BGP-LS protocol aspects in more
detail.

Since BGP-LS is just another BGP address-family, all existing BGP protocol mechanisms can be used
to transport and distribute the BGP-LS information in a scalable fashion. A typical BGP-LS
deployment uses BGP Route Reflectors to scale the distribution, but this is not a requirement.

The SR-TE DB of a node can be populated by combining the IGP and BGP-LS information feeds. The
node learns the topology of its local IGP area via the IGP, while the remote area and domain
topologies are obtained via BGP-LS.

Both of these topology learning mechanisms provide real-time topology feeds. Whenever the topology
changes, the change is propagated immediately via the different feeding mechanisms. While this is
obvious for the IGP mechanism, it is worth noting that is also the case for BGP-LS. However, a direct
IGP feed is likely quicker to deliver the information than a BGP-LS update that may be propagated
across one or more BGP RRs.

The configuration on a BGP-LS Producer – a node that feeds its IGP’s LS-DB into BGP-LS – consists
of a BGP session enabled for address-family link-state link-state and the distribute
link-state command under the IGP specifying an Instance-ID. The configuration is illustrated in
Example 12‑6.

The BGP address-family is identified by an Address Family Identifier (AFI) and a Subsequent
Address Family Identifier (SAFI). For the BGP-LS address family, both AFI and SAFI are named
link-state, therefore the double keyword in the configuration.

Example 12-6: Configuration on BGP-LS Producer node

router isis SR
distribute link-state instance-id 100
!
router ospf SR
distribute link-state instance-id 101
!
router bgp 1
bgp router-id 1.1.1.1
address-family link-state link-state
!
neighbor 1.1.1.10
remote-as 1
address-family link-state link-state

The configuration shows one BGP neighbor 1.1.1.10 for BGP-LS. Distribute link-state is
configured under both IGP instances of this node to distribute their LS-DB information to the local
SR-TE process (if present) and to the BGP process. The respective Instance-IDs that the operator has
assigned to these IGP domains are specified in the distribute commands.

The SR PCE has a similar configuration as in Example 12‑6. The IGP configuration is needed when
acquiring the local IGP area’s topology directly from the IGP. SR PCE inserts the information
received via IGP and BGP-LS in its SR-TE DB.
In addition to the network topology, SR Policy information can also be exchanged via BGP-LS, as
specified in draft-ietf-idr-te-lsp-distribution. Each SR Policy headend may advertise its local SR
Policies in BGP-LS, thus allowing a PCE, controller, or any other node to obtain this information via
its BGP-LS feed. At the time of writing, only the PCEP functionality to collect SR Policy information
was available.

BGP-Only Fabric

Some networks use BGP as the only routing protocol, like Massive-Scale Data Centers (MSDCs)
described in “BGP Routing in Data Centers” [RFC7938]. The implementation of SR in such networks
is described in ietf-spring-segment-routing-msdc.

To populate the SR-TE DB on an SR PCE or headend for such networks, each node advertises its
local information in BGP-LS as specified in draft-ketant-idr-bgp-ls-bgp-only-fabric. This is needed
since BGP does not provide a detailed consolidated topology view of network similar to the one
provided by the link-state IGPs. The information is inserted in the SR-TE DB, same as the other
BGP-LS topology information, and is used by the SR-TE process.

The BGP nodes in a BGP-only SR network advertise the following information in BGP-LS.

The attributes of the BGP node and the BGP Prefix-SIDs2 to reach that node

The attributes of all the links between the BGP nodes and their associated BGP PeerAdj-SIDs (and
other Peering-SIDs, see chapter 14, "SR BGP Egress Peer Engineering")

The SR Policies instantiated on each of the BGP nodes with their properties and attributes

The functionality in the above bullets is not available in IOS XR at the time of writing.

12.3.2 PCEP
The SR-TE DB also contains information on the existing SR Policies in the network. In particular, the
local SR Policies on a headend and the delegated ones on a PCE are maintained in the SR-TE DB.
The delegated SR Policies are under control of the SR PCE as described in chapter 13, "SR PCE".
Besides, the PCE can learn about the other SR Policies instantiated in the network via BGP-LS, as
mentioned in the previous section, or via PCEP.
The latter is achieved by having each Path Computation Client (PCC) send to all its connected
(stateful) PCEs a PCEP state report for an SR Policy whenever it is added, updated, or deleted. A
PCE receiving such a state report updates its SR-TE DB with the new SR Policy information.

The PCE can receive this information directly from the PCC or via a state-sync connection with
another PCE (see draft-litkowski-pce-state-sync).

More PCEP information is provided in chapter 13, "SR PCE" and chapter 18, "PCEP".

Figure 12‑3 illustrates a network with an SR PCE Node10. For illustration purposes the headend
Node1 in this network uses a PCE to compute a dynamic path, even though it could compute the path
itself in this particular example.
Figure 12-3: ISIS network with SR PCE

An SR Policy GREEN to Node4 with a delay-optimized path is configured on Node1, as shown in


Example 12‑7.

PCE 1.1.1.10 (Node10) is specified.


Example 12-7: Node1’s SR-TE configuration

segment-routing
traffic-eng
policy GREEN
binding-sid mpls 15000
color 30 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
pcep
!
metric
type delay
!
pcc
pce address ipv4 1.1.1.10

The configuration to enable the SR PCE functionality on Node10 is shown in Example 12‑8.

Example 12-8: SR PCE Node10’s PCE configuration

pce
address ipv4 1.1.1.10

SR PCE Node10 computes the path of Node1’s SR Policy GREEN and Node1 reports the state of the
SR Policy path to this SR PCE.

SR PCE Node10’s SR Policy database contains one entry, Node1’s SR Policy (PCC 1.1.1.1), shown
in Example 12‑9.

Node1’s SR Policy has a reported name cfg_GREEN_discr_100, where the prefix “cfg_” is added to
the configured name GREEN to indicate that it is a configured SR Policy and avoid name collisions
with controller-initiated SR Policies. The suffix “_discr_100” uses the preference of the candidate
path (100 in this example) to differentiate between multiple configured candidate paths.

The endpoint of this SR Policy is Node4 (destination 1.1.1.4) and its BSID is 15000 (Binding
SID: 15000).

The flags in the PCEP information indicate the path is delegated to this SR PCE (D:1), the path is
Administratively active (A:1) and operationally active (O:2). The meanings of these flags are
specified in “PCEP Extensions for Stateful PCE” [RFC8231].
A reported and a computed path are shown for this SR Policy since Node10 itself computed this path
(Computed path: (Local PCE)) and Node1 (PCC: 1.1.1.1) reported this path (Reported path).

Example 12-9: SR PCE Node10’s SR Policy database – Node1’s SR Policy

RP/0/0/CPU0:xrvr-10#show pce lsp pcc ipv4 1.1.1.1 detail

PCE's tunnel database:


----------------------
PCC 1.1.1.1:

Tunnel Name: cfg_GREEN_discr_100


LSPs:
LSP[0]:
source 1.1.1.1, destination 1.1.1.4, tunnel ID 22, LSP ID 2
State: Admin up, Operation up
Setup type: Segment Routing
Binding SID: 15000
Maximum SID Depth: 10
Bandwidth: signaled 0 kbps, applied 0 kbps
PCEP information:
PLSP-ID 0x80016, flags: D:1 S:0 R:0 A:1 O:2 C:0
LSP Role: Single LSP
State-sync PCE: None
PCC: 1.1.1.1
LSP is subdelegated to: None
Reported path:
Metric type: delay, Accumulated Metric 30
SID[0]: Node, Label 16003, Address 1.1.1.3
SID[1]: Adj, Label 24004, Address: local 99.3.4.3 remote 99.3.4.4
Computed path: (Local PCE)
Computed Time: Thu Oct 18 11:34:26 UTC 2018 (00:08:57 ago)
Metric type: delay, Accumulated Metric 30
SID[0]: Node, Label 16003, Address 1.1.1.3
SID[1]: Adj, Label 24004, Address: local 99.3.4.3 remote 99.3.4.4
Recorded path:
None
Disjoint Group Information:
None
12.4 Consolidating a Multi-Domain Topology
The SR-TE DB is intrinsically multi-domain capable; it can contain and consolidate the topology of
an entire multi-domain network. The multi-domain information in the SR-TE DB makes it possible to
compute optimal end-to-end inter-domain paths.

A multi-domain network is a network topology that consists of multiple sub-topologies. An IGP area
is an example of a sub-topology. A multi-area IGP network is therefore considered as a multi-domain
network. A network consisting of multiple interconnected Autonomous Systems is another example of
a multi-domain network.

The domains in the network can be isolated from each other or prefixes can be redistributed or
propagated between the domains.

The Segment Routing architecture can provide seamless unified end-to-end forwarding paths in multi-
domain networks. SR turns the network into a unified stateless fabric.

In order to compute globally optimal inter-domain paths, knowledge of the end-to-end topology is
needed. The term “globally optimal” indicates that the computed path is the best possible over the
whole multi-domain network, as opposed to per-domain locally optimal paths that may not produce a
globally optimal path when concatenated. This can be achieved by consolidating the different domain
topologies in the multi-domain SR-TE DB, and computing paths on this consolidated topology.

Techniques that use per-domain path computations (e.g., “Path Comp. for Inter-Domain TE LSPs”
[RFC5152]) do not provide this optimality; these techniques may lead to sub-optimal paths, making
diverse or back-up path computation hard, or may simply fail to find a path when one really does
exist.
The importance of simple multi-domain TE
“From an end-to-end path point of view, the ability to unify multiple ASs into one logical network is critical, however, this
ability was not there in the IP world before SR was introduced. With BGP EPE & BGP-LS, a controller can learn and
stitch multiple AS networks together into one automatically, which lays out the foundation of massive SDN deployment.
This ability is vital for 5G transport, which may consist of hundreds of ASs and thousands of devices in case of Tier-1
operators. In this case flexible end-to-end SR-TE (e.g., from cell site to packet core to content) is a must for 5G slicing. It
is also vital for network and cloud synergy, especially where Cloud Native is the new norm and the requirement to use TE
from network devices to containers becomes common, hereby most likely traverse multiple ASs. ”

— YuanChao Su

Assumptions and Requirements

Since the topology information of all domains is collected and inserted in the same SR-TE DB, a trust
relationship between the domains is assumed (e.g., all domains belong to the same administrative
entity). This is a common assumption for many multi-domain functionalities.

While the meaning of physical quantities, such as delay or bandwidth, could be assumed to be the
same across different domains, this should be verified. For example, link-delay measurements should
provide compatible and comparable metrics, not only across domains but also across devices within
a domain, like between devices of different vendors.

Other link attributes (such as IGP link metric, TE link metric, affinity color, SRLG, etc.) may have
different meanings in different domains, possibly because they have been implemented by different
organizations or authorities.

When computing inter-domain paths, the PCE treats all the metrics and other link attributes as
“global” (i.e., comparable between domains). This may not always be correct. For example, if the
domains where originally from different administrators then affinity link colors, metrics, etc. can have
been assigned differently. For domains that use different routing protocols, different metrics are used
by default, notably ISIS versus OSPF. ISIS uses default link metric 10, while OSPF uses default link
metric 1 for interfaces with 100 Mbps speed or higher.

12.4.1 Domain Boundary on a Node


The topology information (e.g., nodes, links) that a PCE receives from a domain enables it to
construct the network graph of this domain.

For a multi-domain network, the PCE receives multiple of such individual domains and inserts them
in the multi-domain SR-TE DB.

Figure 12‑4 illustrates a three-domain network.

Figure 12-4: Example multi-domain topology

In order to compute end-to-end inter-domain paths, the different domains must be consolidated into a
single network topology graph, by interconnecting the topology graphs of the individual domains.

Two domain topologies can be interconnected on the nodes that are common to these two domains, the
border nodes.

A node that participates in multiple domains is present in the topologies of all these domains. In order
to identify this node as a single entity across the domains, and therefore this node consolidates the
domain topologies, a common network-wide node identifier is required. In a multi-area IGP instance,
the ISIS system-id or OSPF router-id are used as common identifier. In the case of multiple IGP
instances, the TE router-id is used instead.

The OSPF router-id and ISIS system-id are the natural node identifiers in a single multi-area network.
Each area is inserted as a separate topology in the SR-TE DB. The Area Border Router (“L1L2
node” for ISIS) advertises its unique router-id (“system-id” for ISIS) in each of its areas. SR-TE
finds multiple nodes with the same identifier in the SR-TE DB and identifies these node instances as
the same node, hereby interconnecting the areas’ topologies at this node.

In multi-domain networks running several IGP instances, the TE router-id is used as a network-wide
unique identifier of a node. If two nodes are present in the SR-TE DB with the same TE router-id,
they are really the same node. Figure 12‑4 illustrates how the border nodes between the domains
interconnect the domain topologies. Node1 in Figure 12‑4 runs two IGP instances: an IGP instance for
Domain1 and another IGP instance for Domain2. A PCE learns all topologies of this network. The
topologies of Domain1 and Domain2 in this PCE’s SR-TE DB both contain Node1 since it
participates in both domains. SR-TE can identify Node1 in both domains since Node1 advertises the
same TE router-id (1.1.1.1) in both domains.

While the OSPF router-id and ISIS system-id are the natural node identifiers in a single multi-area
network, using the same IGP identifier (OSPF router-id or ISIS system-id) on multiple IGP instances
is discouraged. It could inadvertently lead to problems due to duplicate protocol identifiers in the
network.

Pre ve nt duplicate OSPF route r-ids or ISIS syste m-ids


It is recommended to not use identical ISIS system-ids on multiple ISIS instances of a node and to not use identical OSPF router-ids
on multiple OSPF instances of a node.

Using the same protocol identifier for two different instances of the same IGP, ISIS or OSPF, could be quite dangerous. The node
would advertise the same identifier in two separate domains. If on a given moment, at any place in the network some “wires are
crossed” by accident, hereby directly connecting the domains to each other, then duplicate router-ids/system-ids will appear in the
network with all its consequences. It will cause misrouting and flooding storms as the LSPs/LSAs are re-originated by the two nodes
that have the same identifier. The risk of this happening, however small, should be avoided.

12.4.2 Domain Boundary on a Link


A multi-AS network is a multi-domain network that consists of multiple Autonomous Systems (ASs)
interconnected via BGP peering links.

Assuming that Egress Peer Engineering (EPE) is enabled on the BGP peering sessions, BGP
advertises these BGP peering sessions in BGP-LS as links, anchored to a local and a remote node.
These anchor nodes are identified by their BGP router-id. More information about EPE is available
in chapter 14, "SR BGP Egress Peer Engineering".

For the SR PCE to identify the EPE anchor node and the IGP ASBR to be the same node, the BGP
router-id of the EPE link anchor node and the TE router-id of the IGP ASBR must be the equal.

Figure 12‑5 illustrates a multi-AS network consisting of two ASs, AS1 and AS2, interconnected by
two BGP peering links. EPE has been enabled on the BGP peering sessions and these are inserted in
the SR-TE DB as “links” between the ASBRs. The IGP TE router-id on the ASBRs have been
configured to be the same as their BGP router-id. For example, the TE router-id and the BGP router-
id on Node1 are configured as 1.1.1.1.

Figure 12-5: Inter-AS BGP peering links

With this information in the SR-TE DB, SR-TE can interconnect the two ASs’ topologies via the EPE
peering sessions. The anchor nodes of the bottom EPE “link”, Node1 and Node3, are identified by
their BGP router-ids, 1.1.1.1 and 1.1.1.3 respectively. TE router-id 1.1.1.1 identifies Node1 in the
AS1 topology. Because of the common identifier, SR-TE knows this is the same node. The same
occurs for Node3. As a result, the two domains are consolidated in a single network topology.

End-to-end inter-AS paths can be computed on this topology and the Peering-SID label is used to
traverse the peering link.
Another possibility to provide inter-domain connectivity is configuring an SR Policy that steers
traffic over the inter-domain link. This SR Policy can be reported in PCEP and advertised in BGP-
LS. SR-TE can then provide inter-domain connectivity by including the BSID of the SR Policy in its
solution SID list. At the time of writing, this solution is not available in IOS XR.

Figure 12‑6 illustrates a two-node topology with a BGP peering link between the two nodes. ISIS is
enabled on each node but since ISIS is not enabled on the link they do not form an ISIS adjacency.
Enabling BGP and ISIS on the nodes illustrates how SR-TE consolidates the node entries of the ISIS
topology and BGP EPE topology.

Node1 is in Autonomous System (AS) number 1, Node2 is in AS2. A single-hop external BGP
(eBGP) session is established between the nodes and Egress Peer Engineering (EPE) is enabled on
this eBGP session.

Figure 12-6: Two-node BGP topology


The configuration of Node1 is shown in Example 12‑10.

The ISIS configuration is the same as in the previous section, except that the interface to Node2 is not
configured under ISIS.

A single-hop eBGP session is configured to Node2’s interface address 99.1.2.2 (line 26). EPE
(egress-engineering) is enabled on this session (line 28). Address-family IPv4 unicast is enabled
on this session, but other address-families can be configured with an equivalent configuration.

Since the session is an external BGP session, ingress and egress route-policies must be configured.
Here we have applied the route-policy PASS which allows all routes (lines 1-3 and 31-32).

Since we want to use this link to carry labeled traffic between the ASs, MPLS must be enabled on the
interface. BGP automatically enables MPLS forwarding on the interface when enabling a labeled
address-family (such as ipv4 labeled-unicast) under BGP. In this example, MPLS is explicitly
enabled on Gi0/0/0/0 using the mpls static interface configuration (lines 34-35).
Example 12-10: Node1’s configuration – EPE example

1 route-policy PASS
2 pass
3 end-policy
4!
5 interface Loopback0
6 ipv4 address 1.1.1.1 255.255.255.255
7!
8 router isis SR
9 is-type level-2-only
10 net 49.0001.0000.0000.0001.00
11 distribute link-state instance-id 100
12 address-family ipv4 unicast
13 metric-style wide
14 router-id Loopback0
15 segment-routing mpls
16 !
17 interface Loopback0
18 passive
19 address-family ipv4 unicast
20 prefix-sid absolute 16001
21 !
22 router bgp 1
23 bgp router-id 1.1.1.1
24 address-family ipv4 unicast
25 !
26 neighbor 99.1.2.2
27 remote-as 2
28 egress-engineering
29 description # to Node2 #
30 address-family ipv4 unicast
31 route-policy PASS in
32 route-policy PASS out
33 !
34 mpls static
35 interface GigabitEthernet0/0/0/0

Example 12‑11 shows the entry for Node1 in the SR-TE DB of an SR PCE that received the BGP-LS
information from Node1 and Node2.

Notice that SR-TE has consolidated Node1 of the ISIS topology with Node1 of the BGP topology
since the ISIS TE router-id and the BGP router-id are the same 1.1.1.1, as described in section
12.4.2. The entry for Node1 in the SR-TE DB shows both ISIS and BGP properties.

The ISIS node elements in the output are the same as in the ISIS example above.

The BGP ASN 1 and router-id 1.1.1.1 are shown with the node properties (line 8).

The link is connected to Node2 that has BGP router-id 1.1.1.2 in AS 2 (line 24). The link is identified
by its local and remote ip addresses 99.1.2.1 and 99.1.2.2 respectively (line 20).
The link metrics (IGP, TE, and link-delay) are all 0 (line 25) and the affinity bit map (Admin-groups)
is also 0 (line 26). At the time of writing, the link metrics and attributes for an EPE link were not yet
advertised in BGP-LS and get the value 0.

The peerNode-SID label 50012 (represented as Adj-SID of the EPE link) is shown, marked as (epe)
in the output (line 27).

Example 12-11: Node1 as PCE – Entry for Node1 in SR-TE DB – EPE example

1 RP/0/0/CPU0:sr-pce#show pce ipv4 topology bgp 1.1.1.1


2
3 PCE's topology database - detail:
4 ---------------------------------
5 Node 1
6 TE router ID: 1.1.1.1
7 Host name: xrvr-1
8 BGP router ID: 1.1.1.1 ASN: 1
9 ISIS system ID: 0000.0000.0001 level-2 ASN: 1
10 Prefix SID:
11 ISIS system ID: 0000.0000.0001 level-2 ASN: 1 domain ID: 100
12 Prefix 1.1.1.1, label 16001 (regular), flags: N
13 SRGB INFO:
14 ISIS system ID: 0000.0000.0001 level-2 ASN: 1
15 SRGB Start: 16000 Size: 8000
16 SRLB INFO:
17 ISIS system ID: 0000.0000.0001 level-2 ASN: 1
18 SRLB Start: 15000 Size: 1000
19
20 Link[0]: local address 99.1.2.1, remote address 99.1.2.2
21 Local node:
22 BGP router ID: 1.1.1.1 ASN: 1
23 Remote node:
24 BGP router ID: 1.1.1.2 ASN: 2
25 Metric: IGP 0, TE 0, Delay 0
26 Admin-groups: 0x00000000
27 Adj SID: 50012 (epe)
12.5 Summary
The SR-TE DB is an essential component of the SR-TE functionality that contains all the network
information (nodes, links, prefixes, SR Policies) required to compute and validate paths.

The information in the SR-TE DB is protocol-independent. It combines information retrieved from


different sources (IGP, BGP-LS, PCEP).

The SR-TE DB on a headend node contains the local area information acquired from the IGP process.
It uses this information to compute intra-area paths and to translate explicit paths expressed with
Segment Descriptors into a list of MPLS labels.

The SR-TE DB on an SR PCE contains the entire multi-domain network topology, acquired via BGP-
LS, possibly in combination with IGP. Information of BGP-only networks is collected via BGP-LS.
The SR PCE consolidates the network topology information from different domains into a single
network graph to compute optimal inter-domain paths. The PCE’s SR-TE DB also contains the active
SR Policies in the network, acquired via PCEP or BGP-LS. This information allows the SR PCE to
compute disjoint paths.
12.6 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[draft-ietf-idr-bgpls-segment-routing-epe] "BGP-LS extensions for Segment Routing BGP Egress


Peer Engineering", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Keyur Patel, Saikat Ray,
Jie Dong, draft-ietf-idr-bgpls-segment-routing-epe-18 (Work in Progress), March 2019

[RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information


Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752,
March 2016

[draft-ietf-idr-te-lsp-distribution] "Distribution of Traffic Engineering (TE) Policies and State


using BGP-LS", Stefano Previdi, Ketan Talaulikar, Jie Dong, Mach Chen, Hannes Gredler, Jeff
Tantsura, draft-ietf-idr-te-lsp-distribution-10 (Work in Progress), February 2019

[draft-litkowski-pce-state-sync] "Inter Stateful Path Computation Element (PCE) Communication


Procedures.", Stephane Litkowski, Siva Sivabalan, Cheng Li, Haomian Zheng, draft-litkowski-pce-
state-sync-05 (Work in Progress), March 2019

[draft-ketant-idr-bgp-ls-bgp-only-fabric] "BGP Link-State Extensions for BGP-only Fabric", Ketan


Talaulikar, Clarence Filsfils, krishnaswamy ananthamurthy, Shawn Zandi, Gaurav Dawra,
Muhammad Durrani, draft-ketant-idr-bgp-ls-bgp-only-fabric-02 (Work in Progress), March 2019

1. BGP-LS Producers are nodes that feed information, such as their IGP LS-DB, into BGP-LS. See
chapter 17, "BGP-LS" for the different BGP-LS roles.↩

2. BGP Prefix-SIDs are described in Part I of the SR book series.↩


13 SR PCE
What we will learn in this chapter:

The SR-TE process can realize different roles, as the brain of a headend and as part of an SR Path
Computation Element (SR PCE) server.

SR PCE is a network function component integrated in any IOS XR base software image1. This
functionality can be enabled on any IOS XR node, physical or virtual.

SR PCE server is stateful, multi-domain capable (it computes and maintains inter-domain paths),
and SR-optimized (using SR-native algorithms).

SR PCE is a network entity that provides computation services, it can compute and maintain paths
on behalf of a headend node. As such it extends the SR-TE capability of a headend by computing
paths for cases where the headend node cannot compute. Computing inter-domain paths or disjoint
paths are such cases.

SR PCE provides a north-bound interface to the network for external applications.

PCEP high-availability mechanisms provide resiliency against SR PCE failures without impacting
the SR Policies and traffic forwarding. Inter-PCE PCEP sessions can improve this resiliency.

In this chapter, we assume that the SR PCE communicates with its clients (SR-TE headends) via
PCEP. BGP SR-TE is an alternate possibility introduced in the last section of this chapter and
described in detail in chapter 19, "BGP SR-TE".

First, we explain that the SR-TE process is the brain of an SR PCE. Then we see that the SR PCE
computes and statefully maintains paths on behalf of headend nodes, particularly for the use-cases that
the headend cannot handle itself. We describe the PCEP protocol exchange between the SR PCE and
its headend client. Next, we discuss the role of SR PCE as an interface to the network. We conclude
by describing the SR PCE high availability mechanisms and showing a brief example of signaling an
SR Policy path via BGP.
13.1 SR-TE Process
The SR-TE process is at the core of the SR-TE implementation. As indicated in chapter 1,
"Introduction", it is a building block that can fulfill different roles:

Embedded in a headend node as the brain of the router, it provides SR-TE services to the local node.

In an SR PCE server, it provides SR-TE services to other nodes in the network.

While this chapter focuses on the latter, this SR PCE always has to interact with the SR-TE process
running on the headend nodes, where the SR Policies are instantiated and the traffic steering takes
place.

The capabilities of the SR-TE process in the SR PCE and in the headend are similar, but due to their
different position and role in the network they use differently the SR-TE process components. These
components are illustrated in Figure 13‑1.

Figure 13-1: SR-TE Process

SR-TE DB: holds (multi-domain) topology information, SR information, SR Policies, and more.
See chapter 12, "SR-TE Database" for more details.
Compute engine: dynamic path computations using SR-native algorithms. See chapter 4, "Dynamic
Candidate Path" for more details.

Local SR Policy database: used on a headend to maintain, validate and select candidate paths
from different sources for SR Policies. Also see chapter 2, "SR Policy".

On-Demand Nexthop (ODN): used on a headend to instantiate SR Policies on demand. See


chapter 6, "On-Demand Nexthop" for more details.

Automated Steering (AS): used on a headend to automatically steer traffic into SR Policies. See
chapter 5, "Automated Steering" for more details.

The SR-TE process interacts with various internal and external entities using different protocols and
interfaces, such as the ones listed below. These interfaces are illustrated in Figure 13‑1. The role of
the protocol/interface depends on the role of the SR-TE process.

IGP: receive network information distributed by the IGP, distribute TE attributes via the IGP. See
chapter 12, "SR-TE Database" for more details.

BGP-LS: receive topology and other network information and report SR Policy information. See
chapter 17, "BGP-LS" and chapter 12, "SR-TE Database" for more details.

PCE communication Protocol (PCEP): communication between SR PCE and SR PCC. See
further in this chapter, chapter 12, "SR-TE Database", and chapter 18, "PCEP" for more details.

BGP SR-TE: BGP address-family for communication between SR PCE and SR PCC. See chapter
19, "BGP SR-TE" for more details.

NETCONF: data-model based communication between SR PCE and SR PCC, and between
application and SR PCE.

REST: communication between application and SR PCE. See further in this chapter for more
details.

SR-TE Process on the Headend


The SR-TE process on a headend node is in charge of managing the local policies and the associated
traffic steering. Its interface with the IGP provides the SR-TE process the local domain topology that
it needs to perform its most common tasks, such as dynamically computing intra-domain SR Policy
paths or resolving segment descriptors in explicit paths. This is illustrated in Figure 13‑2, only
showing the IGP as interface.

Figure 13-2: SR-TE Process on headend

SR-TE Process on the SR PCE

While many SR-TE use-cases can be solved by only using the SR-TE process that is embedded on the
headend node, others require the delegation of SR-TE tasks to an external SR PCE. The headend then
becomes a client of the SR PCE server.

Figure 13‑3 illustrates the headend (PCC) and SR PCE server and their communication via PCEP.
Figure 13-3: SR-TE Processes on headend and on SR PCE

Here are some of the use-cases that require usage of SR PCE.

Inter-domain paths: For end-to-end inter-domain path computations, the computing node needs to
have knowledge about the different domains in the network. Architecturally, it is possible to feed the
topology of the whole network to the headend node such that it can locally compute inter-domain
paths. However, operators prefer to deploy dedicated SR PCEs for this functionality, as this provides
a more scalable solution. The SR PCEs get the network-wide information via their BGP-LS feed. The
headend nodes then use the services of these SR PCEs to compute and maintain these inter-domain
paths.

Disjoint paths from distinct headends: Disjoint paths in a network should preferably be computed
and maintained by a single entity. This method provides the highest probability that disjoint paths are
found and that they are optimal. Disjoints paths from a single headend can be computed and
maintained by the headend itself. For disjoint paths from distinct headends, an SR PCE is used as the
single central entity that computes and maintains both disjoint paths on behalf of their respective
headends.

North-bound Interface to the network: A third-party application requiring access to the network to
extract information, program the network devices or steer traffic flows may leverage the SR PCE
north-bound interfaces. Instead of directly accessing the individual devices in the network via various
protocols and interfaces, that application can access them through the SR PCE, which provides a
unified structured interface to the network. The SR PCE provides real-time topology and SR Policy
status information via its REST API and it can instantiate, update, and delete SR Policy candidate
paths via the same API.
13.2 Deployment
Figure 13‑4 shows the network used for the illustrations in this chapter. Node1 is the headend node. It
has a PCEP session with the SR PCE.

Figure 13-4: Network for PCEP protocol sequences illustrations

13.2.1 SR PCE Configuration


SR PCE is a built-in functionality in the IOS XR base software image. This functionality is available
on any physical or virtual IOS XR node and it can be activated with a single configuration command.
In practice, the IOS XR based SR PCE can be deployed on physical hardware router, but also on a
virtual router such as the Cisco IOS XRv9000 router, as it involves control plane operations only.

Enabling SR PCE Mode

Example 13‑1 shows the configuration to enable SR PCE functionality on an IOS XR node. The
configured IP address, 1.1.1.10, indicates the local address that SR PCE uses for its PCEP sessions.
The PCEP session can be authenticated by specifying a password.

Example 13-1: PCE PCEP configuration

pce
address ipv4 1.1.1.10
!! password encrypted 00071A150754 (optional)

Populating the SR-TE DB

The SR PCE needs to populate its SR-TE DB with the network topology. As described in chapter 12,
"SR-TE Database", it can obtain and combine information from the IGP (limited to the local IGP
area(s)) and from BGP-LS.

In the example single-area network the SR PCE learns the topology via the IGP. The configuration on
the SR PCE is shown for both IGPs (ISIS and OSPF) in Example 13‑2.

The IGP information feed to the SR-TE process is enabled with the distribute link-state
configuration under the appropriate IGP. It is recommended to specify the instance-id with the
distribute link-state configuration. In single-area networks, the instance-id can be left at its
default value 0 (identifying the “Default Layer 3 Routing topology”).

Example 13-2: Feed IGP information to PCE SR-TE

router isis SR
distribute link-state instance-id 101
!
router ospf SR
distribute link-state instance-id 102

To learn the topologies of remote areas and domains, BGP-LS is required. Example 13‑3 shows a
basic configuration to feed BGP-LS information from BGP neighbor 1.1.1.11 into the SR-TE DB. It
can be combined with the configuration in Example 13‑2. Further details are provided in chapter 12,
"SR-TE Database" and chapter 17, "BGP-LS".
Example 13-3: Feed BGP-LS information to PCE SR-TE

router bgp 1
address-family link-state link-state
!
neighbor 1.1.1.11
remote-as 1
update-source Loopback0
address-family link-state link-state

13.2.2 Headend Configuration


Example 13‑4 shows the configuration on an IOS XR head-end to enable PCC PCEP functionality. In
the example, the headend connects to an SR PCE at address 1.1.1.10. By default, the local PCEP
session address is the address of the lowest numbered loopback interface. The source address of the
PCEP session can be configured and the PCEP sessions can be authenticated by specifying the
password of the session under the PCE configuration.

Example 13-4: Headend PCEP configuration

segment-routing
traffic-eng
pcc
!! source-address ipv4 1.1.1.1 (optional)
pce address ipv4 1.1.1.10
!! password encrypted 13061E010803 (optional)

By default, a headend only sends PCEP Report messages for SR Policies that it delegates to the SR
PCE. SR Policy delegation is discussed further in this chapter. Sometimes it may be required for the
SR PCEs to also learn about the paths that are not delegated, such as headend-computed paths or
explicit paths. In that case, the configuration in Example 13‑5 lets the headend report all its local SR
Policies.

Example 13-5: Headend reports all local SR Policies

segment-routing
traffic-eng
pcc
report-all

Verify PCEP Session

The command show pce ipv4 peer is used to verify the PCEP sessions on the SR PCE, as
illustrated in Example 13‑6. In this example, one PCC with address 1.1.1.1 is connected (Peer
address: 1.1.1.1) and its PCEP capabilities, as reported in its PCEP Open message, are shown in
the output. This PCC supports the Stateful PCEP extensions (Update and Instantiation) and the
Segment-Routing PCEP extensions. These extensions are specified in “PCEP Extensions for Stateful
PCE” [RFC8231], “PCE-Initiated LSPs in Stateful PCE” [RFC8281], and draft-ietf-pce-segment-
routing respectively. By adding the detail keyword to the show command (not illustrated here),
additional information is provided, such as PCEP protocol statistics.

Example 13-6: Verify PCEP session on SR PCE

RP/0/0/CPU0:SR-PCE1#show pce ipv4 peer

PCE's peer database:


--------------------
Peer address: 1.1.1.1
State: Up
Capabilities: Stateful, Segment-Routing, Update, Instantiation

To verify the PCEP sessions on the headend, use the command show segment-routing traffic-
eng pcc ipv4 peer, as illustrated in Example 13‑7. The PCEP capabilities, as reported in the
PCEP Open message are shown in the output. This PCE has a precedence 255 on this headend, which
is the default precedence value. The precedence indicates a relative preference between multiple
configured PCEs as will be discussed further in this chapter.

Example 13-7: Verify PCEP session on headend

RP/0/0/CPU0:iosxrv-1#show segment-routing traffic-eng pcc ipv4 peer


PCC's peer database:
--------------------

Peer address: 1.1.1.10, Precedence: 255, (best PCE)


State up
Capabilities: Stateful, Update, Segment-Routing, Instantiation

By adding the detail keyword to the show command (not illustrated here), additional information is
provided, such as PCEP protocol statistics.

13.2.3 Recommendations
The SR PCE deployment model has similarities with the BGP RR deployment model. A PCE is a
control plane functionality and, as such, does not need to be in the forwarding path. It can even be
located remotely from the network it serves, for example as a virtual instance in a Data Center.

Since SR PCE is an IOS XR functionality, it can be deployed on routers in the network where it could
be managed as a regular IOS XR network node, or it can be enabled on virtual IOS XR instances on a
server.

Although the SR PCE functionality provides centralized path computation services, it is not meant to
be concentrated on a single box (a so-called “god box” or “all-seeing oracle in the sky”). Instead, this
functionality should be distributed among multiple instances, each offering path computation services
to a subset of headend nodes in the network while acting as a backup for another.

We recommend that, for redundancy, every headend should have a PCEP session to two SR PCE
severs, one primary and one backup.

Example 13‑8 shows a configuration of a headend with two SR PCE servers, 1.1.1.10 and 1.1.1.11,
configured with a different precedence. PCE 1.1.1.10 is the preferred PCE since it has the lowest
precedence value. The headend uses this preferred PCE for its path computations. PCE 1.1.1.11 is the
secondary PCE, used upon failure of the preferred SR PCE. More than two PCEs can be specified if
desired.

Example 13-8: Headend configuration – multiple SR PCEs

segment-routing
traffic-eng
pcc
pce address ipv4 1.1.1.10
precedence 100
!
pce address ipv4 1.1.1.11
precedence 200

As the network grows bigger, additional SR PCEs can be introduced to deal with the increasing load.
Each new SR PCE is configured as the primary PCE on a subset of the headend nodes, and as
secondary on another subset. Because every SR PCE has a complete view of the topology, it is able
to serve any request from its connected headends.

For example, all the PEs within a region may share the same pair of SR PCEs. Half of these PEs use
one SR PCE as primary (lowest precedence) and the other SR PCE as secondary, while the other half
is configured the other way around. This will distribute the load over both SR PCEs.

Once the PCE scale limit is reached, we introduce a second pair of PCEs and half of the PEs use the
first pair, half use the second. And we can keep adding pairs of PCEs as required. The SR PCE
solution is horizontally scalable.
13.3 Centralized Path Computation
When enabling the SR PCE functionality on a node, this node acts as a server providing path
computation and path maintenance services to other nodes in the network.

Multi-Domain and Optimized for SR

An SR PCE is natively multi-domain capable. It maintains an SR-TE DB that can hold multiple
domains and it can compute optimal end-to-end inter-domain paths.

The SR PCE populates its SR-TE DB with multi-domain topology information obtained via BGP-LS,
possibly combined with an IGP information feed (ISIS or OSPF) for its local IGP area. The SR-TE
DB is described in more detail in chapter 12, "SR-TE Database".

Using the information in its SR-TE DB, SR PCE uses its SR optimized path computation algorithms to
solve path optimization problems. The algorithms compute single-area and inter-domain paths and
encode the computed path in an optimized segment list.

Stateful

SR PCE is an Active Stateful type of PCE. It does not only compute paths on request of a client
(headend), but it also maintains them. It takes control of the SR Policy paths and updates them when
required.

This requirement for the SR PCE to be able to control SR Policy paths is evident. If a headend node
requests an SR PCE to compute a path, it is very likely that this headend cannot compute the path
itself. Therefore, it is likely not able to validate the path and request a new path if the current one is
invalid.

An Active Stateful PCE uses the PCEP path delegation mechanism specified in “PCEP Extensions for
Stateful PCE” [RFC8231] to update the parameters of a path that a client has delegated to it. This
PCEP delegation mechanism is described further in this chapter.

13.3.1 Headend-Initiated Path


IETF draft-ietf-pce-segment-routing adds SR support to the PCEP protocol and “PCE Path Setup
Type” [RFC8408] specifies how to signal the type of path, SR-TE or RSVP-TE. With these
extensions, the existing PCEP procedures can be used to signal SR Policy paths.
PCEP has been extended further with additional SR-TE functionalities such as draft-sivabalan-pce-
binding-label-sid that specifies how to signal the BSID. More detailed PCEP protocol information is
available in chapter 18, "PCEP".

Different PCEP packet sequences are used in the interaction between a headend (PCC) and an SR
PCE to initiate an SR Policy path and to maintain this path during its lifetime.

The SR Policy candidate path initiation can be done by configuration (CLI or NETCONF) or
automatically through the ODN functionality (see chapter 6, "On-Demand Nexthop"). The candidate
path is of dynamic type with the pcep keyword instructing the headend to request the computation
service of an SR PCE. This is the most likely use-case for an SP or Enterprise.

Two different protocol sequences are possible for these headend-initiated, PCE-computed paths.

In the first variant, the headend (PCC) starts by requesting the SR PCE to compute a path using the
stateless Request/Reply protocol exchange (as specified in “Path Computation Element (PCE)
Communication Protocol (PCEP)” [RFC5440]). The headend sends a path Request message to the SR
PCE with an optimization objective and a set of constraints. The SR PCE computes the path and
returns the solution to the headend in a Reply message. The headend installs that path and then
switches to stateful mode by sending a Report message for the path to the SR PCE with the Delegate
flag (D-flag) set. This action effectively delegates control of the path to the SR PCE.

In the second variant, the PCC starts in stateful mode by immediately delegating control of the path to
the SR PCE. Therefore, the headend sends a Report message for the (empty) path to the SR PCE with
the Delegate flag set. This variant is used by IOS XR headends and is the one illustrated in
Figure 13‑5, with an SR Policy dynamic candidate path initiated from headend Node1.

Node1’s configuration is shown in Example 13‑9. The configuration specifies to request the SR PCE
to compute the path (using the keyword pcep). The dynamic path must optimize the TE metric
(metric type te) and avoid links with affinity color name RED (affinity exclude-any name
RED).
Example 13-9: Configuration on headend Node1

segment-routing
traffic-eng
affinity-map
name RED bit-position 0
!
policy BLUE
binding-sid mpls 15000
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
pcep
metric
type te
!
constraints
affinity
exclude-any
name RED

After configuring the SR Policy, the headend Node1 sends the following PCEP Report (PCRpt)
message (marked ➊ in Figure 13‑5) to the SR PCE:

PCEP Report:

SR Policy status: Administratively Up, Operationally Down, Delegate flag set

Path setup type: SR

Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4

Symbolic name: cfg_BLUE_discr_100

BSID: 15000

Segment list: empty

Optimization objective: metric type TE

Constraints: exclude-any RED links


Figure 13-5: PCEP protocol sequence – Report/Update/Report

With this message Node1 delegates this path to the SR PCE. After receiving this message, SR PCE
finds the empty segment list and computes the path according to the optimization objective and
constraints that are specified in this Report message.

If, for any reason, the SR PCE needs to reject the delegation, it sends an empty Update message with
the Delegate flag set to 0.
After the computation (➋) SR PCE requests Node1 to update the path by sending the following PCEP
Update (PCUpd) message to Node1 (➌):

PCEP Update:

SR Policy status: Desired Admin Status Up, Delegate flag

Path setup type: SR

Segment list: <16003, 24034>

Headend Node1 installs the SR Policy path in the forwarding table (➍), and sends a status report to
SR PCE in the following PCEP Report (PCRpt) message (➎):

PCEP Report:

SR Policy status: Administratively Up, Operationally Up, Delegate flag set

Path setup type: SR

Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4

Symbolic name: cfg_BLUE_discr_100

BSID: 15000

Segment list: <16003, 24034>

Optimization objective: metric type TE

Constraints: exclude-any RED links

The status of the SR Policy is shown in the output of Example 13‑10.


Example 13-10: Status SR Policy on headend Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy


Color: 10, End-point: 1.1.1.4
Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:51:55 (since Aug 9 13:30:20.164)
Candidate-paths:
Preference: 100 (configuration) (active)
Name: BLUE
Requested BSID: 15000
PCC info:
Symbolic name: cfg_BLUE_discr_100
PLSP-ID: 30
Dynamic (pce 1.1.1.10) (valid)
Metric Type: TE, Path Accumulated Metric: 30
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Attributes:
Binding SID: 15000 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

After completing this protocol exchange, the headend has installed the path and has delegated control
of the path to the SR PCE.

The SR PCE is now responsible for maintaining this path and may autonomously update it using the
PCEP sequence described below.

Following a topology change the SR PCE re-computes the delegated paths and updates them if
necessary. Any time that the SR PCE finds that the current path differs from the desired path, SR PCE
requests the headend to update the path.

The illustration in Figure 13‑6 continues the example. Node1 has delegated the SR Policy path with
endpoint Node4 to the SR PCE. The link between Node3 and Node4 fails and SR PCE is notified of
this topology change via the IGP (marked ➊ in Figure 13‑6). The topology change triggers SR PCE to
recompute this path (➋). The new path is encoded in SID list <16004>.
Figure 13-6: PCEP protocol sequence – Update/Report

After computing the new path, SR PCE requests Node1 to update the path by sending the following
PCEP Update (PCUpd) message to Node1 (➌):

PCEP Update:

SR Policy status: Desired Admin Status Up, Delegate flag


Path setup type: SR

Segment list: <16004>

Headend Node1 installs the SR Policy path in the forwarding table (➍), and sends a status report to
SR PCE in the following PCEP Report (PCRpt) message (➎):

PCEP Report:

SR Policy status: Administratively Up, Operationally Up, Delegate flag set

Path setup type: SR

Endpoints: headend and endpoint identifiers 1.1.1.1 and 1.1.1.4

Symbolic name: cfg_BLUE_discr_100

BSID: 15000

Segment list: <16004>

Optimization objective: metric type TE

Constraints: exclude-any RED links

13.3.2 PCE-Initiated Path


An operator can configure an SR Policy definition on an SR PCE. The SR PCE then initiates this SR
Policy on the specified headend and maintains it.

Example 13‑11 illustrates the SR PCE configuration of an SR Policy named BROWN to be initiated
on headend Node1 with address 1.1.1.1 (peer ipv4 1.1.1.1). This SR Policy has color 50,
endpoint 1.1.1.4 and BSID 15111. The dynamic candidate path has preference 100 and optimizes the
TE metric. While this example SR Policy has a single dynamic candidate path, SR Policies with
multiple candidate paths and explicit paths can be configured on the PCE.
Example 13-11: SR Policy configuration on SR PCE

pce
segment-routing
traffic-eng
peer ipv4 1.1.1.1
policy BROWN
binding-sid mpls 15111
color 50 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type te

The SR PCE in Figure 13‑7 computes the dynamic path (➊) and sends the following PCEP Initiate
(PCInit) message with the solution SID list to headend Node1 (➋):

PCEP Initiate:

SR Policy status: Desired Admin Status Up, Delegate flag, Create flag set

Path setup type: SR

Endpoints: headend and endpoint identifiers 1.1.1.1 and 1.1.1.4

Color: 50

Symbolic name: BROWN

BSID: 15111

Preference: 100

Segment list: <16003, 24034>


Figure 13-7: PCE-initiated path – Initiate/Report PCEP protocol sequence

Headend Node1 installs the SR Policy path in the forwarding table (➌), and sends a status report to
SR PCE in the following PCEP Report (PCRpt) message (➍):

PCEP Report:

SR Policy status: Administratively Up, Operationally Up, Delegate flag set

Path setup type: SR


Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4

Symbolic name: BROWN

BSID: 15111

Segment list: <16003, 24034>

With this message the headend confirms that the path has been installed as instructed and delegates
control to the SR PCE.

The resulting SR Policy on Node1 is shown in Example 13‑12.

Example 13-12: SR Policy candidate path on the headend, as initiated by the SR PCE configuration

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 50, End-point: 1.1.1.4


Name: srte_c_50_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:00:15 (since Jul 30 38:37:12.967)
Candidate-paths:
Preference: 100 (PCEP) (active)
Name: BROWN
Requested BSID: 15111
PCC info:
Symbolic name: BROWN
PLSP-ID: 3
Dynamic (pce 1.1.1.10) (valid)
16003 [Prefix-SID, 1.1.1.3]
24004 [Adjacency-SID, 99.3.6.3 - 99.3.6.6]
Attributes:
Binding SID: 15111 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

The SR PCE maintains this path in the same manner as a headend-initiated path.

When the SR Policy configuration on SR PCE is updated or removed, the SR Policy candidate path
on the headend is updated or deleted accordingly using PCEP.
13.4 Application-Driven Path
A likely use-case in the WEB/OTT market sees a central application programming the network with
SR Policies. In that case, the central application likely collects the view of the network using the SR
PCE’s north-bound interface. This interface to the network is available to any application. At the time
of writing, the available interfaces are REST and NETCONF.

North-bound and south-bound inte rface s


A north-bound interface allows a network component to communicate with a higher-level component while a south-bound interface
allows a network component to communicate with a lower-level component.

The north-bound and south-bound terminology refers to the typical architectural drawing, as in Figure 13‑8, where the north-bound
interface is drawn on top (“on the North side”) of the applicable component, SR PCE in this case, and the south-bound interface is
drawn below it (“on the South side”).
Figure 13-8: North- and south-bound interfaces

For a controller, examples of the north-bound interface are: REST (Representational State Transfer) and NETCONF (Network
Configuration Protocol).

Controller south-bound interface examples are PCEP, BGP, classic XML, NETCONF.

Using the REST API, an application can request SR PCE to provide topology and SR Policies
information.

An application can do a GET of the following REST URLs to retrieve the information:

Topology URL: http://<sr-pce-ip-addr>:8080/topo/subscribe/json

SR Policy URL: http://<sr-pce-ip-addr>:8080/lsp/subscribe/json

The returned topology information is the equivalent of the output of the command show pce ipv4
topology. When specifying “json” in the URL, the information is returned in telemetry format which
has base64 encoded Google Protocol Buffers (GPB) data wrapped into json object(s). The data is
encoded using the encoding path Cisco-IOS-XR-infra-xtc-oper:pce/topology-nodes/topology-node as
can be found in the YANG module Cisco-IOS-XR-infra-xtc-oper.yang. Other encoding formats (txt,
xml) are possible.

The returned SR Policy information is the equivalent of the output of the command show pce lsp.
When specifying json format, the information is returned in telemetry format using the encoding path
Cisco-IOS-XR-infra-xtc-oper:pce/tunnel-detail-infos/tunnel-detail-info as can be found in the YANG
module Cisco-IOS-XR-infra-xtc-oper.yang. Other encoding formats (txt, xml) are possible.

An application can request a snapshot of the current information or subscribe to a continuous real-
time feed of topology and SR Policy updates. If an application subscribes to the continuous topology
feed, it will automatically receive (push model) the new topology information of each changed node.

Equivalently, if subscribing to the SR Policy feed, the application automatically receives the new
information of each changed SR Policy.
Incre asing fle xibility and re ducing comple xity
“With centralized TE controller and SR policy, now we can dramatically reduce the complexity for our traffic engineering
deployment. We can push the traditional edge router function to be inside of the data center and control the path from our
server gateway. This not only increases the flexibility of path selection, but also reduces the complexity of the WAN
routers. ”

— Dennis Cai

After the central application collects the network information, it computes the required SR Policy
paths and deploys these paths using the SR PCE’s north-bound interface. The SR PCE then initiates
these paths via its south-bound PCEP interface.

If the SR Policy does not yet exist on the headend then the headend instantiates it with the specified
path as candidate path.

If the SR Policy already exists on the headend then the PCEP initiated SR Policy candidate path is
added to the other candidate paths of the SR Policy. The selection of the active candidate path for an
SR Policy is based on preference of these path, as described in chapter 2, "SR Policy". If the PCEP
initiated candidate path has the highest preference, it will be selected as active path and override any
lower preference candidate paths of the SR Policy.

PCEP initiated candidate paths are ephemeral. They are not stored in the running configuration of the
headend node. This does not mean that they are deleted when the connection to the SR PCE is lost, as
described in the SR PCE high availability section 13.5.

In Figure 13‑9 an application instructs the SR PCE to initiate an explicit SR Policy path to Node4 on
headend Node1 using the SR PCE’s North-bound API2 (marked ➊ in Figure 13‑9). Following this
request, the SR PCE sends the following PCEP Initiate (PCInit) message to headend Node1 (➋):

PCEP Initiate:

SR Policy status: Desired Admin Status Up, Delegate flag, Create flag set

Path setup type: SR

Endpoints: headend and endpoint identifiers 1.1.1.1 and 1.1.1.4


Color: 30

Symbolic name: GREEN

BSID: 15001

Preference: 100

Segment list: <16003, 24034>


Figure 13-9: PCEP protocol sequence – Initiate/Report

Headend Node1 installs the SR Policy path in the forwarding table (➌), and sends a status report to
SR PCE in the following PCEP Report (PCRpt) message (➍):
PCEP Report:

SR Policy status: Administratively Up, Operationally Up, Delegate flag set

Path setup type: SR

Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4

Symbolic name: GREEN

BSID: 15001

Segment list: <16003, 24034>

With this message the headend confirms that the path has been installed as instructed and delegates
control to the SR PCE.

SR PCE stores the status information in its database and feeds the path information to the application
via its North-bound interface (➎).

The status of the SR Policy on headend Node1 is shown in Example 13‑13.

Example 13-13: Status PCE-initiated SR Policy on headend Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.4


Name: srte_c_30_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:51:55 (since Aug 9 14:29:38.540)
Candidate-paths:
Preference: 100 (PCEP) (active)
Name: GREEN
Requested BSID: 15001
PCC info:
Symbolic name: GREEN
PLSP-ID: 3
Dynamic (pce 1.1.1.10) (valid)
16003 [Prefix-SID, 1.1.1.3]
24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4]
Attributes:
Binding SID: 15001 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes
The central application is now responsible to maintain the path. It receives a real-time feed of
topology and SR Policy information via the SR PCE north-bound interface and it can update the path
as required via this north-bound interface.

The application can also delete the path via the SR PCE’s North-bound API. After receiving the path
delete request from the application, the SR PCE sends a PCEP Initiate message with the Remove flag
(R-flag) set and the headend acknowledges with a PCEP Report message that the SR Policy path has
been removed.

If the deleted path was the last candidate path of the SR Policy, then the headend will also delete the
SR Policy. Otherwise, the headend follows the selection procedure to select a new active path.
13.5 High-Availability
An SR PCE populates its SR-TE DB via BGP-LS, or via IGP for its local IGP area. Both BGP-LS
and IGP are distributed mechanisms that have their own well-seasoned high availability mechanisms.
Since the SR PCEs individually tap into these information feeds, typically via redundant connections,
the SR-TE DBs of the SR PCEs are independently and reliably synchronized to a common source of
information.

SR PCE uses the native PCEP high-availability capabilities to recover from PCEP failure situations.
These are described in the next sections.

13.5.1 Headend Reports to All PCEs


As stated in the previous section, we recommend that a headend node connects to a pair of SR PCEs
for redundancy. The headend specifies one SR PCE as primary and the other as secondary.

The headend establishes a PCEP session to both SR PCEs, but it only uses the most preferred one for
path computations if both are available.

Whenever the headend sends out a PCEP Report message reporting the state of an SR Policy path, it
sends this it to all connected SR PCEs, but only sets the Delegate flag (D-flag) in the Report message
for the PCE that computed or initiated the SR Policy path. The other SR PCEs receive the Report
message with the Delegate flag unset. This way, the headend keeps the SR Policy databases of all
connected SR PCEs synchronized, while delegating the path to a single SR PCE.

This mechanism allows the less preferred SR PCEs to operate in hot-standby mode; no state
synchronization is needed at the time the preferred SR PCE fails.

The PCEP state synchronization mechanism is illustrated in Figure 13‑10. Two SR PCEs are
configured on Node1, SR PCE1 (1.1.1.10) and SR PCE2 (1.1.1.11), with SR PCE1 being the most
preferred (lowest precedence 100). The corresponding configuration of Node1 is displayed in
Example 13‑14.
Example 13-14: Configuration of Node1 – two SR PCEs

segment-routing
traffic-eng
policy POL1
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
pcep
!
metric
type te
!
pcc
pce address ipv4 1.1.1.10
precedence 100
!
pce address ipv4 1.1.1.11
precedence 200

Example 13‑15 shows the status of the PCEP sessions to both SR PCEs on headend Node1. Both
sessions are up and SR PCE1 (1.1.1.10) is selected as the primary PCE (best PCE).

Example 13-15: PCEP peers’ status on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng pcc ipv4 peer

PCC's peer database:


--------------------

Peer address: 1.1.1.10, Precedence: 100, (best PCE)


State up
Capabilities: Stateful, Update, Segment-Routing, Instantiation

Peer address: 1.1.1.11, Precedence: 200


State up
Capabilities: Stateful, Update, Segment-Routing, Instantiation

The operator configures SR Policy POL1 on Node1 with a single dynamic candidate path, as shown
in Example 13‑14. The operator indicates to use PCEP to compute this dynamic path, optimizing the
TE-metric. Node1 uses its primary SR PCE1 for path computations.

After configuring this SR Policy on Node1, Node1 sends a PCEP Report message with an empty
segment list to both SR PCE1 and SR PCE2, as indicated by ➊ and ➋ in Figure 13‑10. These
Report messages are identical, except for the Delegate flag.

Node1 delegates the path to its primary SR PCE1 by setting the Delegate flag (D-flag = 1) in this
Report message (➊), while the Delegate flag in the Report message sent to SR PCE2 ➋ is unset (D-
flag = 0).
Since the path is delegated to SR PCE1, it computes the path (➌) and replies to Node1 with the
solution in a PCEP Update message (➍). Node1 installs the path (➎) and sends a Report message to
both SR PCE1 (➏) and SR PCE2 (➐). Again, only setting the Delegate flag in the Report to SR
PCE1.

Both SR PCE1 and SR PCE2 are now aware of the status of the SR Policy path on head-end Node1,
but only SR PCE1 has control over it.

Figure 13-10: Headend Node1 sends PCEP Report to all its connected PCEs

Example 13‑16 shows the SR Policy status output of SR PCE1 and Example 13‑17 shows the status
output of SR PCE2. Both SR PCEs show the Reported path, which is the SID list <16003, 24034>,
as shown in lines 24 to 27 in both outputs.

The two outputs are almost identical, except on two aspects. First, SR PCE1 also shows the
Computed path (lines 28 to 32 in Example 13‑16), while this information is not displayed on SR
PCE2 (Example 13‑17) since SR PCE2 did not compute this path. Second, the Delegate flag is set on
SR PCE1 (flags: D:1 on line 19) but not on SR PCE2 (flags: D:0 on line 19), since the path is
only delegated to SR PCE1.

Example 13-16: SR Policy status on primary SR PCE1

1 RP/0/0/CPU0:xrvr-10#show pce lsp detail


2
3 PCE's tunnel database:
4 ----------------------
5 PCC 1.1.1.1:
6
7 Tunnel Name: cfg_POL1_discr_100
8 LSPs:
9 LSP[0]:
10 source 1.1.1.1, destination 1.1.1.4, tunnel ID 10, LSP ID 1
11 State: Admin up, Operation up
12 Setup type: Segment Routing
13 Binding SID: 40006
14 Maximum SID Depth: 10
15 Absolute Metric Margin: 0
16 Relative Metric Margin: 0%
17 Bandwidth: signaled 0 kbps, applied 0 kbps
18 PCEP information:
19 PLSP-ID 0x80001, flags: D:1 S:0 R:0 A:1 O:1 C:0
20 LSP Role: Single LSP
21 State-sync PCE: None
22 PCC: 1.1.1.1
23 LSP is subdelegated to: None
24 Reported path:
25 Metric type: TE, Accumulated Metric 30
26 SID[0]: Node, Label 16003, Address 1.1.1.3
27 SID[1]: Adj, Label 24034, Address: local 99.3.4.3 remote 99.3.4.4
28 Computed path: (Local PCE)
29 Computed Time: Mon Aug 13 19:21:31 UTC 2018 (00:13:06 ago)
30 Metric type: TE, Accumulated Metric 30
31 SID[0]: Node, Label 16003, Address 1.1.1.3
32 SID[1]: Adj, Label 24034, Address: local 99.3.4.3 remote 99.3.4.4
33 Recorded path:
34 None
35 Disjoint Group Information:
36 None
Example 13-17: SR Policy status on secondary SR PCE2

1 RP/0/0/CPU0:xrvr-11#show pce lsp detail


2
3 PCE's tunnel database:
4 ----------------------
5 PCC 1.1.1.1:
6
7 Tunnel Name: cfg_POL1_discr_100
8 LSPs:
9 LSP[0]:
10 source 1.1.1.1, destination 1.1.1.4, tunnel ID 10, LSP ID 1
11 State: Admin up, Operation up
12 Setup type: Segment Routing
13 Binding SID: 40006
14 Maximum SID Depth: 10
15 Absolute Metric Margin: 0
16 Relative Metric Margin: 0%
17 Bandwidth: signaled 0 kbps, applied 0 kbps
18 PCEP information:
19 PLSP-ID 0x5, flags: D:0 S:0 R:0 A:1 O:1 C:0
20 LSP Role: Single LSP
21 State-sync PCE: None
22 PCC: 1.1.1.1
23 LSP is subdelegated to: None
24 Reported path:
25 Metric type: TE, Accumulated Metric 30
26 SID[0]: Node, Label 16003, Address 1.1.1.3
27 SID[1]: Adj, Label 24034, Address: local 99.3.4.3 remote 99.3.4.4
28 Computed path: (Local PCE)
29 None
30 Computed Time: Not computed yet
31 Recorded path:
32 None
33 Disjoint Group Information:
34 None

13.5.2 Failure Detection


The liveness of the SR PCE having delegation over SR Policy paths is important. This SR PCE is
responsible to maintain the delegated paths and update them if required, such as following a topology
change. Failure to do so can result in sub-optimal routing and packet loss.

A headend verifies liveliness of a PCEP session using a PCEP keepalive timer and dead timer.

By default, the headend and SR PCE send a PCEP message (Keepalive or other) at least every 30
seconds. This time period is the keepalive timer interval. The keepalive timer is restarted every time
a PCEP message (of any type) is sent. If no PCEP message has been sent for the keepalive time
period, the keepalive timer expires and a PCEP Keepalive message is sent.
A node uses the dead timer to detect if the PCEP peer is still alive. A node restarts the session’s dead
timer when receiving a PCEP message (of any type). The default dead timer interval is 120 seconds.
When not receiving any PCEP message for the dead-timer time period, the dead timer expires and the
PCEP session is declared down.

Keepalive/dead timers are exchanged in the PCEP Open message. The recommended (and as
mentioned IOS XR default) values are 30 seconds for keepalive, and four times the keepalive value
(120 seconds), for the dead timer. For further details please see chapter 18, "PCEP" and RFC5440.

The keepalive and dead timers are configurable on PCC and PCE, as shown in Example 13‑18 and
Example 13‑19. On the PCC the keepalive timer interval is configured as 60 seconds, the dead timer
interval 180 seconds. On the PCE the keepalive interval is configured as 60 seconds and the dead
timer is left to its default value.

Example 13-18: PCC timer configuration

segment-routing
traffic-eng
pcc
pce address ipv4 1.1.1.10
!
timers keepalive 60
timers deadtimer 180

Example 13-19: PCE timer configuration

pce
address ipv4 1.1.1.10
!
timers
keepalive 60

The PCC also tracks reachability of the SR PCE’s address in the forwarding table. When the SR
PCE’s address becomes unreachable, then the PCEP session to that SR PCE is brought down
immediately without waiting for the expiration of the dead-timer.

13.5.3 Headend Re-Delegates Paths to Alternate PCE Upon Failure


It is important to note that the failure of the PCEP session to an SR PCE has no immediate impact on
the SR Policies statuses and traffic forwarding into these SR Policies, even for a failure of the
primary SR PCE to which the headend has delegated the paths.
After the headend detects the failure of an SR PCE using the mechanisms of the previous section, it
attempts to re-delegate the paths maintained by this failed SR PCE to an alternate SR PCE.

Two cases are distinguished: headend-initiated paths and application-initiated paths.

13.5.3.1 Headend-Initiated Paths


The redelegation mechanism described in this section is specified in ”PCEP Extensions for Stateful
PCE” [RFC8231].

When a headend detects that the PCEP session to an SR PCE has failed, it starts a Redelegation timer.
This is the time to wait before redelegating paths to another SR PCE. At the same time the headend
starts a State-Timeout timer, which is the maximum time to keep SR Policy paths states after detecting
the PCEP session failure.

The Redelegation timer allows some time for the primary PCEP session to recover after the failure. If
it recovers before the Redelegation timer has expired, both the path state and the delegation state
remain unmodified, as if the failure had never occurred.

For headend initiated paths, the Redelegation timer in IOS XR is fixed to 0. After detecting a PCEP
session failure the headend immediately redelegates the headend initiated paths that were delegated
to the failed SR PCE to the next preferred SR PCE.

When the delegation succeeds, the new delegate SR PCE verifies the path and updates it if needed.
Assuming that the SR PCEs have a synchronized SR-TE DB and use similar path computation
algorithms, no path changes are expected after a re-delegation.

Paths that cannot be redelegated to an alternate SR PCE are called orphaned paths. When the State-
Timeout timer expires, the headend invalidates the orphaned paths. The usual SR Policy path
selection procedure is then used to select the highest preference valid candidate path of the SR Policy
as active path. If no other valid candidate path is available, then the SR Policy is torn down.

The default state-timeout timer in IOS XR is 120 seconds and is configurable on the headend as
delegation-timeout.
If the PCEP session to a more preferred SR PCE comes up, either when a failed SR PCE is restored
or when a new one with a lower precedence is configured, then the headend immediately redelegates
its SR Policy paths to this more preferred SR PCE.

Primary PCE Failure Illustration

The redelegation mechanism for headend-initiated paths is illustrated in Figure 13‑11. The initial
situation of this illustration is the final situation of Figure 13‑10 where Node1 has delegated its SR
Policy to its primary SR PCE1.

At a given time, SR PCE1 becomes unreachable and Node1 brings down the PCEP session to SR
PCE1 (➊). This can be due to expiration of the dead-timer or triggered by the removal of SR PCE1’s
address from the routing table.

Upon detecting the PCEP session failure, Node1 starts its State-Timeout timer and immediately
attempts to redelegate its SR Policy path to its other connected PCE, SR PCE2. To redelegate the
path, Node1 sends a PCEP Report message for the SR Policy path to SR PCE2 with the Delegate flag
set (➋).

In the meantime, Node1 preserves the SR Policy path state and traffic steering as they were before the
PCEP failure.

SR PCE2 accepts the delegation (by not rejecting it) and computes the SR Policy path to validate its
correctness (➌). If the computed path differs from the reported path, then SR PCE2 instructs Node1
to update the SR Policy with the new path. In the illustration we assume that the failure does not
impact the SR Policy path. Therefore, SR PCE2 does not need to update the path.
Figure 13-11: Workflow after failure of the preferred SR PCE

When SR PCE1 recovers, then Node1 will redelegate the SR Policy path to SR PCE1 since it is the
more preferred SR PCE. The message exchange is equivalent with Figure 13‑11, but to SR PCE1
instead of to SR PCE2.

13.5.3.2 Application-Driven Paths


The redelegation mechanism described in this section is specified in RFC 8281 (PCEP extensions for
PCE-initiated paths). While it is the application that initiates paths and maintains them, we will
describe the behaviors from a PCEP point of view where the SR PCE initiates the paths on the
headend.

The SR PCE that has initiated a given SR Policy path is called the “owner SR PCE”. The headend
delegates the SR PCE initiated path to the owner SR PCE and keeps track of this owner.
Upon failure of the owner SR PCE, the paths should be redelegated to another SR PCE.

The redelegation behavior is PCE centric. The headend does not simply redelegate an SR PCE-
initiated path to the next preferred SR PCE after the owner SR PCE fails. Indeed, in contrast with the
headend-initiated path, the headend cannot know which PCE is capable of maintaining the PCE-
initiated path. Instead, another SR PCE can take initiative to adopt the orphan path. This SR PCE
requests the headend to delegate an orphaned SR PCE-initiated path to it by sending a PCEP Initiate
message identifying the orphan path it wants to adopt to the headend.

Same as for the headend-initiated paths, the procedure is governed by two timers on the headend, the
redelegation timer and the state-timeout timer. In IOS XR these timers differ from the timers with the
same name used for the headend-initiated paths. They are configured separately.

The redelegation timer interval is the time the headend waits for the owner PCE to recover. The state-
timeout timer interval is the time the headend waits before cleaning up the orphaned SR Policy path.
These two timers are started when the PCEP session for the owner PCE goes down.

If the owner SR PCE disconnects and reconnects before the redelegation timer expires, then the
headend automatically redelegates the path to the owner SR PCE. Until this timer expires, only the
owner SR PCE can re-gain ownership of the path.

If the redelegation timer expires before the owner SR PCE re-gains ownership, the path becomes an
orphan. At that point in time any SR PCE, not only the owner SR PCE, can request ownership of the
orphan path (“adopt” it) by sending a proper PCEP Initiate message with the path identifier to the
headend. The path is identified by the PCEP-specific LSP identifier (PLSP-ID), as described in
chapter 18, "PCEP".

The redelegation timer for SR PCE initiated paths is named initiated orphan timer in IOS XR.
The redelegation timer is configurable, and its default value is 3 minutes.

Before state-timeout timer expires, the SR Policy path is kept intact, and traffic is forwarded into it. If
this timer expires before any SR PCE claims ownership of the path, the headend brings down and
removes the path. The usual SR Policy path selection procedure is then used to select the highest
preference valid candidate path of the SR Policy as active path. If no other valid candidate path is
available, then the SR Policy is torn down.
The state-timeout timer for SR PCE initiated paths is named initiated state timer in IOS XR. The
state-timeout timer is configurable, and its default value is 10 minutes.

13.5.4 Inter-PCE State-Sync PCEP Session


An SR PCE can maintain PCEP sessions to other SR PCEs. These are the state-sync sessions as
specified in IETF draft-litkowski-pce-state-sync.

These state-sync sessions have a dual purpose: synchronize SR Policy databases of SR PCEs
independently from the headends and prevent split-brain situations for disjoint path computations.

The standard PCEP procedures maintain SR Policy database synchronization between SR PCEs by
letting the headends send their PCEP Report messages to all their connected SR PCEs. This way the
headends ensure that the SR Policy databases on the different connected SR PCEs stay synchronized.

If additional redundancy is desired, PCEP state-sync sessions can be established between the SR
PCEs. These state-sync sessions will ensure that the SR Policy databases of the SR PCEs stay
synchronized, regardless of their connectivity to the headends. If a given headend loses its PCEP
session to one of its SR PCEs, the state-sync connections ensure that this SR PCE still (indirectly)
receives the Reports emitted by that headend.

13.5.4.1 State-Sync Illustration


In Figure 13‑12, headend Node1 has a PCEP session only to SR PCE1, e.g., as a result of a failing
PCEP session between Node1 and SR PCE2. SR PCE1 has a state-sync PCEP session to SR PCE2.
Figure 13-12: State-sync session between SR PCE1 and SR PCE2

The state-sync PCEP session is configured on SR PCE1 as shown in Example 13‑20. SR PCE1 has IP
address 1.1.1.10, SR PCE2 has IP address 1.1.1.11.

Example 13-20: State-sync session configuration on SR PCE1

pce
address ipv4 1.1.1.10
state-sync ipv4 1.1.1.11

An SR Policy GREEN is configured on headend Node1, with a dynamic path computed by SR PCE1.
After SR PCE1 has computed the path and has sent an Update message to Node1, Node1 installs the
path and sends a Report message to its only connected SR PCE1 (marked ➊ in Figure 13‑12). The
Delegate flag is set since Node1 delegates this path to SR PCE1.
Upon receiving this Report message, SR PCE1 clears the Delegate flag, adds a TLV specifying the
identity of the headend (PCC) (➋) and then forwards the message to SR PCE2 via the PCEP state-
sync session.

When an SR PCE receives a Report message via a state-sync session, it does not forward this Report
message again via its own state-sync sessions. This prevents message loops between PCEs but, as a
consequence, a full mesh of state-sync sessions between PCEs is required for the complete
information to be available on all of them.

Example 13‑21 and Example 13‑22 show the SR Policy databases on SR PCE1 and SR PCE2
respectively.

Example 13‑21 shows that SR Policy cfg_GREEN_discr_100 has been reported by headend Node1
(PCC: 1.1.1.1) and is delegated to this SR PCE1 (D:1). Both a Computed path and Reported path
are shown since SR PCE has computed this path and has received the Report message.
Example 13-21: SR Policy database on SR PCE1

RP/0/0/CPU0:SR-PCE1#show pce lsp detail

PCE's tunnel database:


----------------------
PCC 1.1.1.1:

Tunnel Name: cfg_GREEN_discr_100


LSPs:
LSP[0]:
source 1.1.1.1, destination 1.1.1.4, tunnel ID 29, LSP ID 8
State: Admin up, Operation active
Setup type: Segment Routing
Binding SID: 40006
Maximum SID Depth: 10
Absolute Metric Margin: 0
Relative Metric Margin: 0%
Bandwidth: signaled 0 kbps, applied 0 kbps
PCEP information:
PLSP-ID 0x8001d, flags: D:1 S:0 R:0 A:1 O:1 C:0
LSP Role: Single LSP
State-sync PCE: None
PCC: 1.1.1.1
LSP is subdelegated to: None
Reported path:
Metric type: TE, Accumulated Metric 30
SID[0]: Node, Label 16003, Address 1.1.1.3
SID[1]: Adj, Label 24023, Address: local 99.3.4.3 remote 99.3.4.4
Computed path: (Local PCE)
Computed Time: Wed Aug 01 12:21:21 UTC 2018 (00:02:25 ago)
Metric type: TE, Accumulated Metric 30
SID[0]: Node, Label 16003, Address 1.1.1.3
SID[1]: Adj, Label 24023, Address: local 99.3.4.3 remote 99.3.4.4
Recorded path:
None
Disjoint Group Information:
None

Example 13‑22 shows that SR Policy cfg_GREEN_discr_100 is not delegated to this SR PCE2 (D:0).
This SR Policy has been reported by SR PCE1 via the state-sync session (State-sync PCE:
1.1.1.10). Only a reported path is shown since this SR PCE2 did not compute this path, it only
received the Report message.
Example 13-22: SR Policy database on SR PCE2

RP/0/0/CPU0:SR-PCE2#show pce lsp detail

PCE's tunnel database:


----------------------
PCC 1.1.1.1:

Tunnel Name: cfg_GREEN_discr_100


LSPs:
LSP[0]:
source 1.1.1.1, destination 1.1.1.4, tunnel ID 29, LSP ID 8
State: Admin up, Operation active
Setup type: Segment Routing
Binding SID: 40006
Maximum SID Depth: 10
Absolute Metric Margin: 0
Relative Metric Margin: 0%
Bandwidth: signaled 0 kbps, applied 0 kbps
PCEP information:
PLSP-ID 0x8001d, flags: D:0 S:1 R:0 A:1 O:1 C:0
LSP Role: Single LSP
State-sync PCE: 1.1.1.10
PCC: None
LSP is subdelegated to: None
Reported path:
Metric type: TE, Accumulated Metric 30
SID[0]: Node, Label 16003, Address 1.1.1.3
SID[1]: Adj, Label 24023, Address: local 99.3.4.3 remote 99.3.4.4
Computed path: (Local PCE)
None
Computed Time: Not computed yet
Recorded path:
None
Disjoint Group Information:
None

13.5.4.2 Split-Brain
The inter-PCE state-sync PCEP connection is also used to solve possible split-brain situations. In a
split-brain situation, multiple brains (in this case, SR PCEs) independently compute parts of the
solution to a problem and fail to find a complete or optimal solution because none of the brains has
the authority to make the others reconsider their decisions.

In the context of SR-TE, this type of situation may occur for disjoint path computation, as illustrated
in Figure 13‑13.

Headends Node1 and Node5 have PCEP sessions to both SR PCE1 and SR PCE2, with SR PCE1
configured as the primary SR PCE. However, the PCEP session from Node5 to SR PCE1 has failed.
Therefore, SR PCE2 is now the primary (and only) SR PCE for Node5.

There is a state-sync PCEP session between SR PCE1 and SR PCE2.


Strict node-disjoint paths (see chapter 4, "Dynamic Candidate Path" for more details) are required
from Node1 to Node4 and from Node5 to Node8. At first the SR Policy to Node4 is configured on
Node1 and SR PCE1 computes the path. The result is path 1→2→6→7→3→4, which is the IGP
shortest path from Node1 to Node4. This path is encoded as segment list <16004>, as indicated in the
drawing. Node1 delegates the path to SR PCE1. SR PCE2 also learns about this path from Node1
since Node1 sends Report messages to both connected SR PCEs.

Then the SR Policy to Node8 is configured on Node5. Node5 requests a path that is strictly node-
disjoint from the existing path of Node1 to Node4. SR PCE2 is unable to find such a path, all possible
paths traverse nodes used by the existing path from Node1 to Node4. The problem is that SR PCE2
learned about the path from Node1 to Node4, but it cannot change this path in order to solve the
disjoint paths problem. SR PCE2 has no control over that path since it is delegated to SR PCE1.

If as single SR PCE would be computing and maintaining both paths, it could update the path from
Node1 to Node4 to 1→2→3→4 using segment list <16002, 24023, 16004>, and disjoint paths would
be possible. But due the split-brain situation (SR PCE1 controls one path, SR PCE2 controls the other
path), no solution is found.
Figure 13-13: Split-brain situation: no disjoint paths found

To solve the split-brain problem, a master/slave relationship is established between SR PCEs when
an inter-PCE state-sync session is configured. At the time of writing, the master PCE is the PCE with
the lowest PCEP session IP address.

The SR PCE master/slave relationship solves the split-brain situation by letting the master SR PCE
compute and maintain both disjoint paths. The slave SR PCE sub-delegates the disjoint path
computation and maintenance to this master SR PCE.

Figure 13‑14 illustrates this by showing how it solves the problem of the example above, requiring
strict node-disjoint paths from Node1 to Node4 and from Node5 to Node8
Figure 13-14: Sub-delegating SR Policy path to master SR PCE (1)

Assuming that SR PCE1 in Figure 13‑14 is the master SR PCE, it is responsible for the computation
and maintenance of disjoint paths. Assuming the SR Policy from Node1 to Node4 is configured first,
SR PCE1 computes the path as shown in the drawing as the lighter colored path 1→2→6→7→3→4.
Node1 installs the path and reports it to both SR PCEs.

Next, the SR Policy from Node5 to Node8 is configured on Node5. Node5 sends a PCEP Report with
empty segment list to SR PCE2 (➊). SR PCE2 does not compute the path but forwards the Report
message to SR PCE1 via the state-sync session (➋) and hereby sub-delegates this SR Policy path to
SR PCE1. SR PCE1 then becomes responsible for computing and updating the path. SR PCE2 adds a
TLV to this Report message identifying the owner PCC of the SR Policy path, Node5.

SR PCE1 computes both disjoint paths (➌) and sends an Update message to Node1 (➍) to update the
path of Node1’s SR Policy as 1→2→3→4. Node1 updates the path (➎).
Figure 13-15: Sub-delegating SR Policy path to master SR PCE (2)

In Figure 13‑15 SR PCE1 then requests Node5 to update the path of the SR Policy to Node8 to
5→6→7→8 by sending an Update message to SR PCE2 via the state-sync session (➏). SR PCE2
forwards the Update message to Node5 (➐), which updates the path (➑).

After the headend nodes have updated their paths, they both send a PCEP Report reporting the new
state of the SR Policy paths (these messages are not shown in the illustration). Node1 sends the
Report directly to both SR PCE1 and SR PCE2. Node5 then sends a Report of the path to SR PCE2,
which forwards it to the master SR PCE1.
13.6 BGP SR-TE
This section provides a brief introduction to BGP SR-TE, the BGP way of signaling SR Policy
candidate paths from a controller to a headend. It is also known as “BGP SR Policy” and the BGP
extensions are specified in draft-ietf-idr-segment-routing-te-policy. A new Subsequent Address-
Family Indicator (SAFI) “SR Policy” is defined to convey the SR Policy information.

At the time of writing, SR PCE does not support BGP SR-TE as south-bound interface, but it is
supported on the headend.

Why would anyone ne e d that?


“Most network operators don’t have the luxury of doing a greenfield deployment of new technology, nor do they have
100% confidence from management that their new solution won't cause some unforeseen disaster. BGP SR-TE provides
an elegant, safe and flexible way to introduce global traffic engineering into an existing BGP-based network, and to
transition an existing RSVP-TE-based network to use SR-based traffic engineering. It leverages your existing BGP
expertise, such as understanding the semantics of peering sessions, updates, path attributes and Graceful Restart. As it
operates at the level of a BGP nexthop (an endpoint), it is straightforward for a team with BGP experience to comprehend.
Safety comes from having BGP and your IGP as fallbacks.

BGP SR-TE may come across at first as arbitrarily complex, but understanding each feature often leads to the same
pattern: “Why would anyone need that?”, followed by “Oh, I need that.” ”

— Paul Mattes

BGP SR-TE is covered in detail in chapter 19, "BGP SR-TE". Here we present a simple example
illustrated in Figure 13‑16.
Figure 13-16: BGP SR Policy illustration

The controller (ip address 1.1.1.10) signals the SR Policy candidate path using the BGP IPv4 SR
Policy address-family.

The SR Policy is identified in the NLRI with color 30 and endpoint 1.1.1.4. The distinguisher allows
to differentiate multiple candidate paths for the same SR Policy. It is important to note that this BGP
Update only carries an SR Policy path, it does not carry any service route reachability information.

The attributes of the candidate path are carried in the Tunnel-encaps attribute of the BGP Update
message. Multiple attributes are possible, here we only show the Binding-SID 15001 and the SID list
<16003, 24034>.
The IPv4 SR Policy BGP address-family is configured on headend Node1 for its session to the
controller, as shown in Example 13‑23. SR-TE is enabled as well.

Example 13-23: BGP SR-TE configuration

router bgp 1
bgp router-id 1.1.1.1
address-family ipv4 sr-policy
!
neighbor 1.1.1.10
remote-as 1
update-source Loopback0
address-family ipv4 sr-policy
!
segment-routing
traffic-eng

Le ve raging the BGP protocol


“BGP SR Policy leverages the scalable, extensible and persuasive BGP protocol, this differentiates it from other controller
south-bound interface protocols like CLI/NETCONF, PCEP, etc. An operator/vendor can easily develop a BGP controller
that could signal this new NLRI.

I’m seeing more and more adoption of BGP SR Policy because of its simplicity and scalability; actually, a couple of BGP
SR Policy controllers have been deployed in production networks in the past year. Moving forward, with the ability of
BGP-LS to report SR Policies, BGP has a big potential to become the unified SR Policy communication protocol between
controller and device. ”

— YuanChao Su

After receiving the BGP Update, Node1 instantiates the SR Policy’s candidate path with the attributes
signaled in BGP.

Same as for PCEP initiated paths, if the SR Policy does not yet exist, it is created with the candidate
path. The candidate path is added to the list of all candidate paths of the SR Policy and the selection
process decides which candidate path becomes the active path.

Example 13‑24 shows the status of the SR Policy on Node1.


Example 13-24: Status of BGP SR-TE signaled SR Policy on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 30, End-point: 1.1.1.4


Name: srte_c_30_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:00:16 (since Feb 26 13:14:45.488)
Candidate-paths:
Preference: 100 (BGP, RD: 12345) (active)
Requested BSID: 15001
Explicit: segment-list (valid)
Weight: 1, Metric Type: TE
16003 [Prefix-SID, 1.1.1.3]
24034
Attributes:
Binding SID: 15001 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

When BGP withdraws the entry, the candidate path is removed and a new candidate path is selected.
If no other candidate path exists, the SR Policy is removed as well.
13.7 Summary
The SR-TE process can fulfil different roles, as the brain of a headend and as part of an SR Path
Computation Element (SR PCE) server.

SR PCE is a network function component integrated in any IOS XR base image. This functionality
can be enabled on any IOS XR node, physical or virtual.

SR PCE server is stateful (it maintains paths on behalf of other nodes), multi-domain capable (it
computes and maintains inter-domain paths), and SR-optimized (using SR-native algorithms).

SR PCE is a network entity that provides path computation and maintenance services to headend
nodes. It extends the SR-TE capability of a headend by computing paths that the headend node
cannot compute, such as inter-domain or disjoint paths.

SR PCE provides an interface to the network via its north-bound interfaces. An application uses
this interface to collect a real-time view of the network and to add/update/delete SR Policies.

PCEP high-availability mechanisms provide resiliency against SR PCE failures without impacting
the SR Policies and traffic forwarding. Inter-PCE state-sync PCEP sessions can further improve
this resiliency.
13.8 References
[YANGModels] “YANG models”, https://github.com/YangModels/yang

[RFC4655] "A Path Computation Element (PCE)-Based Architecture", JP Vasseur, Adrian Farrel,
Gerald Ash, RFC4655, August 2006

[RFC5440] "Path Computation Element (PCE) Communication Protocol (PCEP)", JP Vasseur,


Jean-Louis Le Roux, RFC5440, March 2009

[RFC8231] "Path Computation Element Communication Protocol (PCEP) Extensions for Stateful
PCE", Edward Crabbe, Ina Minei, Jan Medved, Robert Varga, RFC8231, September 2017

[RFC8281] "Path Computation Element Communication Protocol (PCEP) Extensions for PCE-
Initiated LSP Setup in a Stateful PCE Model", Edward Crabbe, Ina Minei, Siva Sivabalan, Robert
Varga, RFC8281, December 2017

[RFC8408] "Conveying Path Setup Type in PCE Communication Protocol (PCEP) Messages", Siva
Sivabalan, Jeff Tantsura, Ina Minei, Robert Varga, Jonathan Hardwick, RFC8408, July 2018

[draft-ietf-pce-segment-routing] "PCEP Extensions for Segment Routing", Siva Sivabalan,


Clarence Filsfils, Jeff Tantsura, Wim Henderickx, Jonathan Hardwick, draft-ietf-pce-segment-
routing-15 (Work in Progress), February 2019

[draft-sivabalan-pce-binding-label-sid] "Carrying Binding Label/Segment-ID in PCE-based


Networks.", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Jonathan Hardwick, Stefano Previdi,
Cheng Li, draft-sivabalan-pce-binding-label-sid-06 (Work in Progress), February 2019

[draft-litkowski-pce-state-sync] "Inter Stateful Path Computation Element (PCE) Communication


Procedures.", Stephane Litkowski, Siva Sivabalan, Cheng Li, Haomian Zheng, draft-litkowski-pce-
state-sync-05 (Work in Progress), March 2019

[draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano


Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idr-
segment-routing-te-policy-05 (Work in Progress), November 2018
1. In some cases, an additional software license may be required. Check with your sales
representative.↩

2. At the time of writing, the REST API to manage SR Policy candidate paths was in the process of
being updated and the exact details were not yet available for publication.↩
14 SR BGP Egress Peer Engineering
What we will learn in this chapter:

SR BGP Egress Peer Engineering (EPE) enables a centralized controller to instruct an ingress node
to steer traffic via a specific egress node to a specific BGP peer or peering link.

SR BGP EPE uses Peering-SIDs to steer to a specific BGP peer or peering link, regardless of the
BGP best-path. Peering-SIDs can be seen as the BGP variant of IGP Adj-SIDs.

SR BGP EPE does not change the existing BGP distribution mechanism in place nor does it make
any assumptions on the existing iBGP design.

The different types of peering-SIDs are PeerNode-SIDs, Peer-Adj-SIDs, and PeerSet-SIDs.


PeerNode-SIDs steer to the associated peer, PeerAdj-SIDs steer over a specific peering link of the
associated peer, and PeerSet-SIDs steer on a set of PeerNode- and/or PeerAdj-SIDs.

The Peering-SID information is advertised in BGP-LS, such that the controller can insert it in its
SR-TE DB and use it for path computations.

The controller can instantiate an SR Policy on an ingress node to steer traffic flows via a specific
egress node to a specific BGP peer, whereby also the path to the egress node can be specified,
integrating intra-domain and inter-domain TE.

The SR PCE includes SR EPE peering-SIDs in the SR Policy’s segment list to provide end-to-end
unified inter-domain SR-TE paths.

We start by explaining how SR BGP EPE solves the BGP egress engineering problem. Then the
different Peering SID types are introduced and how the peering information with the SIDs are
distributed in BGP-LS. We then illustrate how the peering information is inserted in the controller’s
database. Two EPE use-cases are presented as conclusion.
14.1 Introduction
The network in Figure 14‑1 consists of four Autonomous Systems (ASs), where each AS is identified
by its AS Number (ASN): AS1, AS2, AS3, and AS4.

AS1 has BGP peerings with AS2 (via egress Node3 and Node4) and with AS3 (via egress Node4).

Node1 and Node2 in AS1 are shown as ingress nodes, but they can be any router, server or Top-of-
Rack (ToR) switch in AS1.

A given AS not only advertises its own prefixes to its peer ASs, but may also propagate the prefixes
it has received from its own peer ASs. Therefore, a given destination prefix can be reached through
multiple peering ASs and over multiple peering links.

Destination prefixes 11.1.1.0/24 and 22.2.2.0/24 of Figure 14‑1 are located in AS4. AS4 advertises
these prefixes to its peer ASs AS2 and AS3, which propagate these prefixes further to AS1. AS1 not
only receives these prefixes from peers in different ASs, such as Node5 in AS2 and Node6 in AS3,
but also from different eBGP peers of the same AS, such as Node6 and Node7 in AS3.
Figure 14-1: Multi-AS network

Given that multiple paths to reach a particular prefix are available, it is the task of the routers and the
network operators to select the best path for each prefix. Proper engineering of the ways that traffic
exits the AS is crucial for cost efficiency and better end-user experience. For the AS4 prefixes in the
example, there are four ways to exit AS1: Node3→Node5, Node4→Node5, Node4→Node6, and
Node4→Node7.

BGP uses path selection rules to select one path, the so-called “BGP best-path”, to the destination
prefix at each BGP speaker. This is the path that BGP installs in the forwarding table. BGP’s best-
path selection typically involves routing policies and rule-sets that are specified by the network
operator to influence the best-path selection.

Using such routing policies and rule-sets on the different BGP speakers provides some level of
flexibility and control on how traffic leaves the AS. However, this technique is limited by the BGP
best-path selection mechanism and a per-destination prefix granularity.
Assume in Figure 14‑1 that AS1’s egress ASBR Node4 sets a higher local preference on the path of
prefix 11.1.1.0/24 received from AS3’s peer Node7. As a result, Node4 selects this path via Node7
as best-path and sends out all packets destined for 11.1.1.0/24 towards Node7, regardless of where
the packet comes from.

Besides, the ingress ASBRs Node1 and Node2 may be configured with a route-policy to select a BGP
best-path via egress ASBR Node3 or Node4 for a particular prefix, but they have no control over the
egress peering link that these egress ASBRs will use for that prefix. The egress peering link selection
is made solely by the egress ASBR based on its own configuration.

Assume that the routing policies and rule-sets are such that the BGP best paths to prefixes 11.1.1.0/24
and 22.2.2.0/24 are as indicated in Figure 14‑1. The BGP best-path from Node1 and Node2 to these
prefixes is via Node3 and its peer Node5 in AS2, as indicated in Figure 14‑1.
14.2 SR BGP Egress Peer Engineering (EPE)
Due to the limitations of the classic BGP egress engineering based on routing policies and rule-sets, a
finer grained and more consistent control on this egress path selection is desired.

The SR BGP Egress Peer Engineering (EPE) solution enables a centralized controller to instruct an
ingress Provider Edge (PE) router or a content source within the domain to use a specific egress
ASBR and a specific external interface or neighbor to reach a particular destination prefix. See RFC
8402 and draft-ietf-spring-segment-routing-central-epe.

To provide this functionality, an SR EPE capable node allocates one or more segments for each of its
connected BGP neighbors. These segments are called BGP Peering Segments or BGP Peering SIDs.

These Peering-SIDs provide to BGP a similar functionality to the IGP Adj-SID, by steering traffic to
a specific BGP peer or over a specific peering interface. They can thus be included in an SR Policy’s
segment list to express a source-routed inter-AS path. Different types of Peering-SIDs exist and are
discussed in section 14.3 of this chapter.

SR BGP EPE is used to engineer egress traffic to external peers but, by providing the means for SR-
TE paths to cross AS boundaries, it also enables unified end-to-end SR-TE paths in multi-AS
networks. SR BGP EPE is a main constituent of the integrated intra-domain and inter-domain SR-TE
solution

In the remainder of this chapter we will abbreviate “SR BGP EPE” to “SR EPE” or even “EPE”.

Segment Routing is enabled in AS1 of Figure 14‑2. AS1’s ASBRs Node3 and Node4 respectively
advertise the IGP Prefix-SID 16003 and 16004. AS2, AS3, and AS4 are external to AS1 and no SR
support is assumed for these ASs.
Figure 14-2: SR BGP EPE illustration
Single -hop and multi-hop e BGP se ssions
The standard assumption for an eBGP session is that the peers are directly connected such that they can establish a single-hop
session. Therefore, by default eBGP sessions are only allowed between directly connected neighbors. Single-hop BGP sessions are
established between the interfaces addresses of the peering nodes. By default, IOS XR BGP checks if the BGP peering addresses
are directly connected (the “connected-check”) and it sends all BGP packets with a TTL=1 to ensure that only single-hop sessions
can be established.

Note that “single-hop” is somewhat of a misnomer since it excludes BGP sessions between the loopback addresses of two directly
connected nodes. But one could argue that the loopback addresses are not connected in this case.

Sometimes eBGP sessions between non-directly-connected neighbors (“multi-hop”) are desired, e.g., to establish eBGP sessions
between route reflectors in different ASs (e.g., inter-AS option C in RFC 4364) or to enable load-balancing between multiple peering
links. Multi-hop BGP sessions are typically established between loopback addresses of the peering nodes.

There are two options in IOS XR to enable a multi-hop eBGP session for a peer: ebgp-multihop <tx ttl> and ignore-connected-
check.

ebgp-multihop <tx ttl> disables the connected-check to allow non-connected neighbors and sets the TTL of the transmitted BGP
packets to the configured <tx ttl> value. This is option must be used if the BGP neighbor nodes are not directly connected.

If the BGP neighbor nodes are directly connected, but the BGP session is established between their loopbacks, then it is preferred to
configure ignore-connected-check to disable the connected-check, possibly in combination with ttl-security to only accept BGP
packets of directly connected nodes (see RFC 5082).

A controller is present in AS1.

The BGP best-path for AS4’s prefixes 11.1.1.0/24 and 22.2.2.0/24 is via Node3 and its peer Node5
in AS2, as indicated in Figure 14‑2.

The operator of AS1 wants to have full control of the egress traffic flows, not only on how these
traffic flows exit AS1, but also on how they travel within AS1 itself. The operator wants to have very
granular and programmable control of these flows, without dependencies on the other ASs or on BGP
routing policies. This steering control is provided by SR-TE and the SR EPE functionality.

SR EPE is enabled on all BGP peering sessions of Node3 and Node4. Node4 allocated a BGP
Peering SID label 50407 for its peering session to Node7 and label 50405 for its peering session to
Node5. Packets that arrive on Node4 with a top label 50407 or 50405 are steered towards Node7 or
Node5 respectively.
A traffic flow to destination 11.1.1.0/24, marked “F1” in Figure 14‑2, enters AS1 via Node1. The
operator chooses to steer this flow F1 via egress Node4 towards peer Node5. The operator
instantiates an SR Policy on Node1 for this purpose, with segment list <16004, 50405> where 16004
is the Prefix-SID of Node4 and 50405 the BGP peering SID for Node4’s peering to Node5. This SR
Policy steers traffic destined for 11.1.1.0/24 on the IGP shortest path to egress Node4 and then on the
BGP peering link to Node5. The path of flow F1 is indicated in the drawing.

At the same time another flow “F2” that enters AS1 in Node1 with destination 22.2.2.0/24, should be
steered on an intra-AS path via Node8 towards egress Node4 and then towards AS3’s peer Node7.
The operator initiates a second SR Policy on Node1 that imposes a segment list <16008, 16004,
50407> on packets destined for 22.2.2.0/24. This segment list brings the packets via the IGP shortest
path to Node8 (using Node8’s Prefix-SID 16008), then via the IGP shortest path to Node4 (Node4’s
Prefix-SID 16004), and finally towards peer Node7 (using BGP Peering SID 50407). The path of
flow F2 is indicated in the drawing.

Any other traffic flows still follow their default forwarding paths, IGP shortest-path and BGP best-
path.

14.2.1 SR EPE Properties


The SR EPE solution does not require changing the existing BGP distribution mechanism in place nor
does it make any assumptions on the existing iBGP design (route reflectors (RRs), confederations, or
iBGP full meshes). The SR EPE functionality is only required at the EPE egress border router and the
EPE controller.

The solution does not require any changes on the remote peering node in the external AS. This remote
peering node is not involved in the SR EPE solution. Furthermore, the solution allows egress PEs to
set next-hop-self on eBGP-learned routes that they announce to their iBGP peers.

SR EPE provides an integrated intra-domain and inter-domain traffic engineering solution. The
solution accommodates using an SR Policy at an ingress PE or directly at a source host within the
domain to steer traffic within the local domain all the way to the external peer.

SR EPE enables using BGP peering links on an end-to-end SR-TE path in a multi-AS network,
without requiring to distribute the peering links in IGP, nor having to declare the peering links as
“passive TE interfaces”.

Although egress engineering usually refers to steering traffic on eBGP peering sessions, SR EPE
could be equally applied to iBGP sessions. However, contrary to eBGP, iBGP sessions are typically
not along the data forwarding path between directly connected peers. Therefore, the applicability of
SR EPE for iBGP peering sessions may be limited to only those deployments where the iBGP peer is
along the data forwarding path. At the time of writing, IOS XR only supports SR EPE for eBGP
neighbors.
14.3 Segment Types
Earlier in this chapter, we have explained that an SR EPE capable node allocates one or more
segments for each of its connected BGP neighbors. These segments are called BGP Peering Segments
or BGP Peering SIDs and they provide a functionality to BGP that is similar to the Adj-SID for IGP:
Peering-SIDs steer traffic to a specific peer or over a specific peering interface.

A peering interface is an interface on the shortest path to the peer. In case of a single-hop BGP
session (between neighboring interface addresses), there is a single peering interface. In case of a
multi-hop BGP session, there can be multiple peering interfaces: the interfaces along the set of
shortest paths to the BGP neighbor address.

Several types of Peering-SIDs enable the egress peering selection with various level of granularity.
The Segment Routing Architecture specification (RFC8402) defines three types of BGP Peering SIDs:

PeerNode-SID: steer traffic to the BGP peer, load-balancing traffic flows over all available
peering interfaces

SR header operation: POP

Action: Forward on any peering interface

PeerAdjacency-SID: steer traffic to the BGP peer via the specified peering interface

SR header operation: POP

Action: Forward on the specific peering interface

PeerSet-SID: steer traffic via an arbitrary set of BGP peers or peering interfaces, load-balancing
traffic flows over all members of the set

SR header operation: POP

Action: Forward on any interface to the set of peers or peering interfaces

These three segment types can be local or global segments. Remember that a local segment only has
local significance on the node originating it, while a global segment has significance in the whole SR
domain. In the SR MPLS implementation in IOS XR, BGP Peering Segments are local segments.

An SR EPE node allocates a PeerNode-SID for each EPE-enabled BGP session.

For multi-hop BGP sessions, the SR EPE node also allocates a PeerAdj-SID for each peering
interface of that BGP session, in addition to the PeerNode-SID. Even if there is only a single peering
interface to the BGP multi-hop neighbor, a PeerAdj-SID is allocated for it.

For single-hop BGP sessions, no PeerAdj-SID is allocated since for this type of BGP sessions the
PeerNode-SID also provides the PeerAdj-SID functionality.

In addition, each peer and peering interface may be part of a set identified by a PeerSet-SID. For
example, when assigning the same PeerSet-SID to two peers and a peering interface to a third peer,
traffic to the PeerSet-SID is load-balanced over both peers and the peering interface. At the time of
writing, IOS XR does not support PeerSet-SIDs.

The SR EPE node can dynamically allocate local labels for its BGP Peering SIDs, or the operator can
explicitly assign local labels for this purpose. At the time of writing, IOS XR only supports
dynamically allocated BGP Peering SID labels.

Figure 14‑3 illustrates the various types of BGP Peering SIDs on a border node. Node4 has three
eBGP sessions: a single-hop session to Node5 in AS2, another single-hop session to Node6 in AS3
and a multi-hop session to Node7 in AS3. The multi-hop BGP session is established between the
loopback addresses of Node4 and Node7; the session is transported over two peering interfaces,
represented by the two links between these nodes.
Figure 14-3: SR EPE Peering SID types

Assume that Node4 allocated the following BGP Peering SIDs:


BGP PeerNode-SID 50405 for the session to Node5 in AS2

BGP PeerNode-SID 50406 for the session to Node6 in AS3

BGP PeerNode-SID 50407 for the session to Node7 in AS3

BGP PeerAdj-SID 50417 for the top peering link between Node4 and Node7

BGP PeerAdj-SID 50427 for the bottom peering link between Node4 and Node7
14.4 Configuration
The BGP configuration on Node4 for the topology of Figure 14‑3, is shown in Example 14‑1.

Node4 is in AS1 and has three BGP sessions, two single-hop sessions to 99.4.5.5 (Node5) and
99.4.6.6 (Node6), and one multi-hop session (ignore-connected-check) to 1.1.1.7 (Node7). The
ignore-connected-check command disables the verification that the nexthop is on a connected
subnet. Node4 sends the BGP packets to Node7 with TTL=1 which is fine since the nodes are directly
connected.

In this example, address-family ipv4 unicast is enabled for all neighbors, but other address-
families are also supported.

Since the neighbors are in a different AS, an ingress and egress route-policy must be applied.
Example 14-1: Node4 SR EPE configuration

route-policy bgp_in
pass
end-policy
!
route-policy bgp_out
pass
end-policy
!
router bgp 1
bgp router-id 1.1.1.4
address-family ipv4 unicast
!
neighbor 99.4.5.5
remote-as 2
egress-engineering
description # single-hop eBGP peer Node5
address-family ipv4 unicast
route-policy bgp_in in
route-policy bgp_out out
!
!
neighbor 99.4.6.6
remote-as 3
egress-engineering
description # single-hop eBGP peer Node6
address-family ipv4 unicast
route-policy bgp_in in
route-policy bgp_out out
!
!
neighbor 1.1.1.7
remote-as 3
egress-engineering
description # multi-hop eBGP peer Node7
update-source Loopback0
ignore-connected-check
address-family ipv4 unicast
route-policy bgp_in in
route-policy bgp_out out

SR EPE is enabled by configuring egress-engineering under the BGP neighbor. With this
configuration, the node automatically allocates, programs and advertises the Peering-SIDs to the
controller in BGP-LS, as we will see in the next section.

For the example configuration of Node4 in Example 14‑1, the PeerNode-SID for the session to Node5
is shown in Example 14‑2. Node4 has allocated label 50405 as PeerNode-SID for this BGP session.
First Hop address 99.4.5.5 is the interface address on Node5 for the link to Node4.
Example 14-2: Node4 PeerNode-SID for peer Node5

RP/0/0/CPU0:xrvr-4#show bgp egress-engineering


<...snip...>

Egress Engineering Peer Set: 99.4.5.5/32 (10b291a4)


Nexthop: 99.4.5.5
Version: 5, rn_version: 5
Flags: 0x00000006
Local ASN: 1
Remote ASN: 2
Local RID: 1.1.1.4
Remote RID: 1.1.1.5
First Hop: 99.4.5.5
NHID: 1
Label: 50405, Refcount: 3
rpc_set: 105cfd34

<...snip...>

The PeerNode-SID for the multi-hop eBGP session to Node7 is shown in Example 14‑3. Node4 has
allocated label 50407 as PeerNode-SID for this BGP session. Node4 uses two next-hops (First Hop
in the output) for this BGP PeerNode-SID: via the top link to Node7 (99.4.7.7) and via the bottom
link to Node7 (77.4.7.7). Traffic is load-balanced over these two next-hops. Addresses 99.4.7.7 and
77.4.7.7 are Node7’s interface addresses for its links to Node4.

Example 14-3: Node4 PeerNode-SID for peer Node7

<...snip...>

Egress Engineering Peer Set: 1.1.1.7/32 (10b48fec)


Nexthop: 1.1.1.7
Version: 2, rn_version: 2
Flags: 0x00000006
Local ASN: 1
Remote ASN: 3
Local RID: 1.1.1.4
Remote RID: 1.1.1.7
First Hop: 99.4.7.7, 77.4.7.7
NHID: 0, 0
Label: 50407, Refcount: 3
rpc_set: 10c34c24

<...snip...>

One or more PeerAdj-SIDs are allocated for a multi-hop BGP session. The PeerAdj-SIDs for
Node4’s multi-hop EBGP session to Node7 are shown in Example 14‑4. Node4 has two equal cost
paths to the peer Node7: via the two links to Node7. Node4 has allocated labels 50417 and 50427 as
PeerAdj-SIDs for these two peering links. Hence, traffic that arrives on Node4 with top label 50417
will be forwarded on the top peering link, using next-hop (First Hop) 99.4.7.7.
Example 14-4: Node4 PeerAdj-SIDs for peer Node7

<...snip...>

Egress Engineering Peer Set: 99.4.7.7/32 (10d92234)


Nexthop: 99.4.7.7
Version: 3, rn_version: 5
Flags: 0x0000000a
Local ASN: 1
Remote ASN: 3
Local RID: 1.1.1.4
Remote RID: 1.1.1.7
First Hop: 99.4.7.7
NHID: 2
Label: 50417, Refcount: 3
rpc_set: 10e37684

Egress Engineering Peer Set: 77.4.7.7/32 (10c931f0)


Nexthop: 77.4.7.7
Version: 4, rn_version: 5
Flags: 0x0000000a
Local ASN: 1
Remote ASN: 3
Local RID: 1.1.1.4
Remote RID: 1.1.1.7
First Hop: 77.4.7.7
NHID: 4
Label: 50427, Refcount: 3
rpc_set: 10e58fa4

<...snip...>

BGP on Node4 allocates the Peering-SIDs and installs them in the forwarding table, as shown in the
MPLS forwarding table displayed in Example 14‑5. All the Peering SID forwarding entries pop the
label and forward on the appropriate interface. The traffic to the PeerNode-SID 50407 of EBGP
multi-hop session to Node7 is load-balanced over the two links to Node7.

Example 14-5: SR EPE Peering-SID MPLS forwarding entries on Node4

RP/0/0/CPU0:xrvr-4#show mpls forwarding


Local Outgoing Prefix Outgoing Next Hop Bytes
Label Label or ID Interface Switched
------ ----------- ------------------ ------------ ------------- --------
50405 Pop No ID Gi0/0/0/0 99.4.5.5 0
50406 Pop No ID Gi0/0/0/1 99.4.6.6 0
50407 Pop No ID Gi0/0/0/2 99.4.7.7 0
Pop No ID Gi0/0/0/3 77.4.7.7 0
50417 Pop No ID Gi0/0/0/2 99.4.7.7 0
50427 Pop No ID Gi0/0/0/3 77.4.7.7 0
14.5 Distribution of SR EPE Information in BGP-LS
An SR EPE enabled border node allocates BGP Peering-SIDs and programs them in its forwarding
table. It also advertises these BGP Peering-SIDs in BGP-LS such that a controller can learn and use
this EPE information.

As explained in the previous chapters (chapter 12, "SR-TE Database" and chapter 13, "SR PCE"),
BGP-LS is the method of choice to feed network information to a controller. In the context of SR EPE,
the use of BGP-LS makes the advertisement of the BGP Peering SIDs completely independent from
the advertisement of any forwarding and reachability information in BGP. Indeed, BGP does not
directly use the BGP Peering SIDs, but simply conveys this information to the controller (or the
operator) that can leverage it in SR Policies.

In case RRs are used to scale BGP-LS distribution, the controller can then tap into one of them to
learn the BGP Peering SIDs from all SR EPE enabled border nodes, in addition to the other BGP-LS
information.

The SR EPE-enabled BGP peerings are represented as Link objects in the BGP-LS database and
advertised using Link-type BGP-LS NLRIs.

The controller inserts the SR EPE-enabled BGP peerings as links in the SR-TE DB. More details of
the SR-TE DB and the SR EPE peering entries in SR-TE DB are available in chapter 12, "SR-TE
Database".

Figure 14‑4 shows the format of a BGP-LS Link-type NLRI, as specified in BGP-LS RFC 7752.
Figure 14-4: BGP-LS Link-type NLRI format

The fields in the BGP-LS Link-type NLRI are:

Protocol-ID: Identifies the source protocol of this NLRI

Identifier: identifier of the “routing universe” that this link belongs to, commonly associated with
an instance-id in the IGP world

Local Node Descriptors: set of properties that uniquely identify the local node of the link

Remote Node Descriptors: set of properties that uniquely identify the remote node of the link

Link Descriptors: set of properties that uniquely identifies a link between the two anchor nodes

BGP-LS is extended by IETF draft-ietf-idr-bgpls-segment-routing-epe to support advertisement of


BGP Peering SIDs.

The Protocol-ID field in the NLRI for the SR EPE Peerings has value 7 “BGP” to indicate that BGP
is the source protocol of this NLRI. The Identifier field has value 0 in IOS XR, identifying the Default
Layer 3 Routing topology.

The possible values of the other fields in the Link-type NLRI (Local Node, Remote Node, and Link
Descriptors) as used for SR EPE peering advertisements are described in chapter 17, "BGP-LS".
The BGP Peering SID is included in the BGP-LS Attribute that is advertised with the EPE NLRI. It is
specified in a BGP Peering SID TLV, as described in the next section.

14.5.1 BGP Peering SID TLV


The format of the BGP Peering SID TLV is displayed in Figure 14‑5. This TLV contains the Peering-
SID and is advertised with the associated EPE NLRI in the BGP-LS Attribute.

Figure 14-5: Peering SID TLV format

The fields in the BGP Peering SID TLV are:

Type: PeerNode SID (1101); PeerAdj SID (1102); PeerSet SID (1103)

Length: length of TLV

Flags:

V-Flag: Value flag – If set, then the SID carries a value; set in current IOS XR releases.

L-Flag: Local flag – If set, then the value/index carried by the SID has local significance; set in
current IOS XR releases.
B-Flag: Backup flag – If set, then the SID refers to a path that is eligible for protection; unset in
current IOS XR releases.

P-Flag: Persistent flag – If set, then the SID is persistently allocated, i.e., the SID value remains
consistent across router restart and session/interface flap; unset in current IOS XR releases.

Weight: sets the distribution ratio for Weighted ECMP load-balancing; always 0 in current IOS XR
releases.

SID/Label/Index: According to the TLV length and to the Value (V) and Local (L) flags settings, it
contains either:

3-octet local label where the 20 rightmost bits are used for encoding the label value. The V and
L flags are set.

4-octet index defining the offset in the SRGB (Segment Routing Global Block) advertised by this
node. In this case, the SRGB is advertised using the extensions defined in ietf-idr-bgp-ls-
segment-routing-ext.

14.5.2 Single-hop BGP Session


Figure 14‑6 illustrates a single-hop eBGP session between Node4 in AS1 and Node5 in AS2.

Figure 14-6: SR EPE for single-hop BGP session


SR EPE is enabled on Node4 for this BGP session, using the configuration of Node4 as shown in
Example 14‑6. By default, eBGP sessions in IOS XR require specifying ingress and egress route-
policies that specify which prefixes to accept and advertise. These are the bgp_in and bgp_out
route-policies in the example.

Example 14-6: BGP SR EPE on Single-hop BGP session – Node4’s configuration

route-policy bgp_in
pass
end-policy
!
route-policy bgp_out
pass
end-policy
!
router bgp 1
bgp router-id 1.1.1.4
address-family ipv4 unicast
!
neighbor 99.4.5.5
remote-as 2
egress-engineering
description # single-hop eBGP peer Node5 #
address-family ipv4 unicast
route-policy bgp_in in
route-policy bgp_out out

Node4 advertises the BGP Peering SIDs to the controller (1.1.1.10) via a BGP-LS BGP session.

The configuration of this BGP-LS session to 1.1.1.10 on Node4 is shown in Example 14‑7.

Example 14-7: Node4’s BGP-LS session to Controller

router bgp 1
address-family link-state link-state
!
neighbor 1.1.1.10
remote-as 1
update-source Loopback0
description iBGP to Controller
address-family link-state link-state

Node4 has allocated PeerNode-SID 50405 for the session to Node5 (also see the output in
Example 14‑2). Since it is a single-hop BGP session, no PeerAdj-SIDs are allocated for this session.

The controller receives the BGP-LS advertisement for this PeerNode-SID, as shown in
Example 14‑8. The long and cryptic command used to display the advertisement consists mainly of
the BGP-LS NLRI of the BGP PeerNode-SID in string format that is specified as an argument of the
show command.
The output of show bgp link-state link-state (without specifying an NLRI) shows a legend to
interpret the NLRI fields, as is explained in chapter 17, "BGP-LS".

Refer to the format of a BGP-LS link-type NLRI is shown in Figure 14‑4. In the NLRI, we recognize
the Local and Remote Node Descriptors, which identify the local and remote anchor nodes Node4
and Node5. These anchor nodes are identified by their ASN and BGP router-id. The Link Descriptor
contains the local and remote interface addresses of this single-hop BGP session. The only entry in
the Link-state Attribute, shown at the bottom of the output, is the BGP PeerNode-SID 50405.

Example 14-8: BGP PeerNode-SID BGP table entry

RP/0/0/CPU0:xrvr-10#show bgp link-state link-state [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c2][b0.0.0.0]


[q1.1.1.5]][L[i99.4.5.4][n99.4.5.5]]/664 detail
BGP routing table entry for [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c2][b0.0.0.0][q1.1.1.5]][L[i99.4.5.4]
[n99.4.5.5]]/664
NLRI Type: Link
Protocol: BGP
Identifier: 0x0
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
BGP Router Identifier: 1.1.1.4
Remote Node Descriptor:
AS Number: 2
BGP Identifier: 0.0.0.0
BGP Router Identifier: 1.1.1.5
Link Descriptor:
Local Interface Address IPv4: 99.4.5.4
Neighbor Interface Address IPv4: 99.4.5.5

Versions:
Process bRIB/RIB SendTblVer
Speaker 3 3
Flags: 0x00000001+0x00000000;
Last Modified: Feb 7 16:39:56.876 for 00:19:02
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Flags: 0x4000000001060005, import: 0x20
Not advertised to any peer
Local
1.1.1.4 (metric 10) from 1.1.1.4 (1.1.1.4)
Origin IGP, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 3
Link-state: Peer-SID: 50405

The SR PCE receives the BGP-LS information and inserts the BGP Peering-SID information in its
SR-TE DB. This is shown in the output of Example 14‑9.

Node4 is presented as Node 1 in the SR-TE DB.


The BGP session on Node4 to Node5 is represented as a (uni-directional) link Link[0], anchored to
local Node4 and remote Node5. The controller created entries in the SR-TE DB for these two anchor
nodes, identified by their ASN and BGP router-id: Node4 (AS1, 1.1.1.4) and Node5 (AS2, 1.1.1.5).
The PeerNode-SID is shown as Adj SID (epe) in the SR-TE DB. At the time of writing, the other
link attributes (metrics, affinity, etc.) are not advertised for EPE links in IOS XR, therefore the link
metrics are set to zero.

The remote Node5 is also inserted in the SR-TE DB as Node 2. Since Node5 itself does not
advertise any BGP Peering-SIDs in this example, no other information is available for this node
besides its ASN and BGP router-id (AS2, 1.1.1.5).

Example 14-9: BGP PeerNode-SID SR-TE DB entry

RP/0/0/CPU0:xrvr-10#show pce ipv4 topology

PCE's topology database - detail:


---------------------------------
Node 1
BGP router ID: 1.1.1.4 ASN: 1

Link[0]: local address 99.4.5.4, remote address 99.4.5.5


Local node:
BGP router ID: 1.1.1.4 ASN: 1
Remote node:
BGP router ID: 1.1.1.5 ASN: 2
Metric: IGP 0, TE 0, delay 0
Adj SID: 50405 (epe)

Node 2
BGP router ID: 1.1.1.5 ASN: 2

14.5.3 Multi-hop BGP Session


Figure 14‑7 illustrates a multi-hop eBGP session between Node4 in AS1 and Node7 in AS3.
Figure 14-7: SR EPE for multi-hop BGP session

BGP EPE is enabled on the BGP session of Node4 to Node7, as shown in the configuration of
Example 14‑10. Node4 and Node7 are adjacent. To allow the multi-hop BGP session between the
loopback interfaces of these nodes, it is sufficient to disable the connected check (ignore-
connected-check) in BGP.

Example 14-10: SR BGP EPE for multi-hop eBGP session – Node4’s configuration

route-policy bgp_in
pass
end-policy
!
route-policy bgp_out
pass
end-policy
!
router bgp 1
bgp router-id 1.1.1.4
address-family ipv4 unicast
neighbor 1.1.1.7
remote-as 3
ignore-connected-check
egress-engineering
description # multi-hop eBGP peer Node7 #
update-source Loopback0
address-family ipv4 unicast
route-policy bgp_in in
route-policy bgp_out out

The PeerNode-SID for this BGP session is 50407 and the PeerAdj-SIDs are 50417 and 50417 (also
see the output in Example 14‑3 and Example 14‑4). These SIDs are also shown in Figure 14‑7.
The BGP PeerNode-SID BGP-LS advertisement as received by the SR PCE is shown in
Example 14‑11.

The PeerNode entry is advertised as a link using a BGP-LS Link-type NLRI. This “link” is anchored
to its local node (AS1, 1.1.1.4) and remote node (AS3, 1.1.1.7), as shown in the Local and Remote
Node Descriptors in Example 14‑11. The Link Descriptor contains the local and remote BGP session
addresses, Node4’s BGP router-id 1.1.1.4 and Node7’s BGP router-id 1.1.1.7 in this example. The
only entry in the Link-state Attribute attached to this NLRI is the BGP PeerNode-SID 50407, as
shown at the bottom of the output.

Example 14-11: BGP PeerNode-SID BGP table entry for multi-hop session

RP/0/0/CPU0:xrvr-10#show bgp link-state link-state [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0]


[q1.1.1.7]][L[i1.1.1.4][n1.1.1.7]]/664 detail
BGP routing table entry for [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0][q1.1.1.7]][L[i1.1.1.4]
[n1.1.1.7]]/664
NLRI Type: Link
Protocol: BGP
Identifier: 0x0
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
BGP Router Identifier: 1.1.1.4
Remote Node Descriptor:
AS Number: 3
BGP Identifier: 0.0.0.0
BGP Router Identifier: 1.1.1.7
Link Descriptor:
Local Interface Address IPv4: 1.1.1.4
Neighbor Interface Address IPv4: 1.1.1.7

Versions:
Process bRIB/RIB SendTblVer
Speaker 16 16
Flags: 0x00000001+0x00000200;
Last Modified: Feb 7 17:24:22.876 for 00:01:08
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Flags: 0x4000000001060005, import: 0x20
Not advertised to any peer
Local
1.1.1.4 (metric 10) from 1.1.1.4 (1.1.1.4)
Origin IGP, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 16
Link-state: Peer-SID: 50407

For a multi-hop BGP session, PeerAdj-SIDs are also allocated for each link used in the BGP session.
In this example there are two peering links, so two PeerAdj-SIDs are allocated. Each PeerAdj-SID is
advertised in a BGP-LS Link-type NLRI.
One of the two PeerAdj-SID BGP-LS advertisements as received by the controller is shown in
Example 14‑12. The two anchor nodes Node4 and Node7 are specified in the Local and Remote
Node Descriptors. To distinguish this link from the other parallel link, the local (99.4.7.4) and remote
(99.4.7.7) interface addresses are included in the Link Descriptor.

The only entry in the Link-state Attribute attached to this NLRI is the BGP PeerAdj-SID 50427,
shown at the bottom of the output.

Example 14-12: BGP PeerAdj-SID BGP table entry for multi-hop session

RP/0/0/CPU0:xrvr-10#show bgp link-state link-state [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0]


[q1.1.1.7]][L[i99.4.7.4][n99.4.7.7]]/664 detail
BGP routing table entry for [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0][q1.1.1.7]][L[i99.4.7.4]
[n99.4.7.7]]/664
NLRI Type: Link
Protocol: BGP
Identifier: 0x0
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
BGP Router Identifier: 1.1.1.4
Remote Node Descriptor:
AS Number: 3
BGP Identifier: 0.0.0.0
BGP Router Identifier: 1.1.1.7
Link Descriptor:
Local Interface Address IPv4: 99.4.7.4
Neighbor Interface Address IPv4: 99.4.7.7

Versions:
Process bRIB/RIB SendTblVer
Speaker 18 18
Flags: 0x00000001+0x00000200;
Last Modified: Feb 7 17:24:22.876 for 00:02:41
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Flags: 0x4000000001060005, import: 0x20
Not advertised to any peer
Local
1.1.1.4 (metric 10) from 1.1.1.4 (1.1.1.4)
Origin IGP, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 18
Link-state: Peer-Adj-SID: 50427

The SR PCE inserts the BGP Peering-SID information in its SR-TE DB. The entry in the SR PCE’s
database is shown in Example 14‑13.

Three Peering link entries are present in the SR-TE DB entry of Node4: Link[0] is the PeerNode
entry, Link[1] is the PeerAdj entry for the top link to Node7, and Link[2] is the PeerAdj entry for
the bottom link. The two PeerAdj entries can be distinguished from each other by their local and
remote addresses.

The controller created a node entry for Node7. Since Node7 does not advertise any BGP Peering-
SIDs in this example, the controller does not have any other information about this node besides its
ASN and BGP router-id (AS3, 1.1.1.7).

Example 14-13: BGP PeerNode-SID SR-TE DB entry

RP/0/0/CPU0:xrvr-10#show pce ipv4 topology

PCE's topology database - detail:


---------------------------------
Node 1
BGP router ID: 1.1.1.4 ASN: 1

Link[0]: local address 1.1.1.4, remote address 1.1.1.7


Local node:
BGP router ID: 1.1.1.4 ASN: 1
Remote node:
BGP router ID: 1.1.1.7 ASN: 3
Metric: IGP 0, TE 0
Bandwidth: Total 0 Bps, Reservable 0 Bps
Adj SID: 50407 (epe)

Link[1]: local address 99.4.7.4, remote address 99.4.7.7


Local node:
BGP router ID: 1.1.1.4 ASN: 1
Remote node:
BGP router ID: 1.1.1.7 ASN: 3
Metric: IGP 0, TE 0
Bandwidth: Total 0 Bps, Reservable 0 Bps
Adj SID: 50417 (epe)

Link[2]: local address 77.4.7.4, remote address 77.4.7.7


Local node:
BGP router ID: 1.1.1.4 ASN: 1
Remote node:
BGP router ID: 1.1.1.7 ASN: 3
Metric: IGP 0, TE 0
Bandwidth: Total 0 Bps, Reservable 0 Bps
Adj SID: 50427 (epe)

Node 2
BGP router ID: 1.1.1.7 ASN: 3
14.6 Use-Cases
14.6.1 SR Policy Using Peering-SID
Assume in Figure 14‑8 that egress nodes Node3 and Node4 have next-hop-self configured. This
means that they advertise the BGP prefixes setting themselves as BGP nexthop. Both egress nodes
receive the AS4 BGP routes 11.1.1.0/24 and 22.2.2.0/24 from their neighbor AS.

Node1 receives the routes as advertised by Node3 with nexthop 1.1.1.3, and as advertised by Node4
with nexthop 1.1.1.4. Ingress Node1 selects the routes via Node3 as BGP best-path. The BGP best-
path on Node1 for both routes goes via egress Node3 and to its peer Node 5, as indicated in the
illustration.

Figure 14-8: SR Policies using Peering-SIDs

Instead of using the BGP best-path, the operator wants to steer 11.1.1.0/24 via the IGP shortest path to
egress Node4 and then to its peer Node5. The prefix 22.2.2.0/24 must be steered via Node8 towards
egress Node4 and then to its peer Node7.
For this purpose, the operator instantiates two SR Policies on Node1, both with endpoint Node4
(1.1.1.4), one with color 10 and the other with color 20.

The segment list of the SR Policy with color 10 is <16004, 50405>, where 16004 is the Prefix-SID of
Node4 and 50405 is the PeerNode-SID of Node4 to peer Node5. The segment list of the SR Policy
with color 20 is <16008, 16004, 50407>, with 16008 and 16004 the Prefix-SIDs of Node8 and
Node4 respectively, and 50407 the PeerNode-SID of Node4 to peer Node7.

To steer the two service routes on their desired SR Policies, Node1 changes their nexthops to 1.1.1.4
(Node4) and tags each with the color of the appropriate SR Policy. 11.1.1.0/24 gets color 10,
22.2.2.0/24 gets color 20. For this, Node1 applies an ingress route-policy on its BGP session as
explained in chapter 10, "Further Details on Automated Steering".

BGP on Node1 uses Automated Steering functionality to steer each of these two service routes on the
SR Policy matching its color and nexthop.

As a result, Node1 imposes segment list <16004, 50405> on packets destined for prefix 11.1.1.0/24,
and segment list <16008, 16004, 50407> on packets destined for prefix 22.2.2.0/24.

14.6.2 SR EPE for Inter-Domain SR Policy Paths


In a multi-domain network, with multiple ASs and BGP peering links between the ASs, SR BGP EPE
can be used to transport the SR traffic across the BGP peering links.

When SR EPE is enabled on the BGP peering sessions, the ASBRs advertise these BGP sessions and
their BGP Peering SIDs in BGP-LS. The SR PCE receives this information, together with all the other
topology information of the network. The SR PCE consolidates all the topology information to form a
single network graph. The SR EPE enabled peering links are included in this network graph as links.

The controller can then compute the required end-to-end SR Policy paths using that consolidated
network graph. When an SR Policy path traverses a BGP peering link, this peering link’s Peering SID
is included in the SR Policy’s segment list.

More details are provided in chapter 12, "SR-TE Database".

MPLS Forwarding on Peering Link


To provide SR-MPLS inter-domain connectivity, labeled packets must be able to traverse the peering
link. This means that MPLS forwarding must be enabled on this link. Otherwise, labeled packets that
are switched on this interface will be dropped.

By default, MPLS forwarding is not enabled on a peering link, even if egress-engineering is


enabled on the BGP session.

Use the command illustrated in Example 14‑14 to list the interfaces with MPLS forwarding enabled.
The two interfaces in the output have MPLS forwarding enabled, as indicated in the last column. Any
unlisted interface does not have MPLS forwarding enabled.

Example 14-14: Show MPLS-enabled interfaces

RP/0/0/CPU0:xrvr-4#show mpls interfaces


Interface LDP Tunnel Static Enabled
-------------------------- -------- -------- -------- --------
GigabitEthernet0/0/0/0 No No No Yes
GigabitEthernet0/0/0/1 No No No Yes

MPLS forwarding is automatically enabled on the peering link when enabling a labeled address-
family (such as address-family ipv4 labeled-unicast) on the BGP session.

If no labeled address-family is required, enable MPLS forwarding on the interface by configuring it


under mpls static, as illustrated in Example 14‑15. This command enables MPLS forwarding on
interface Gi0/0/0/2.

Example 14-15: Enable MPLS forwarding using mpls static configuration

mpls static
interface GigabitEthernet0/0/0/2
14.7 Summary
Egress Peer Engineering enables a centralized controller to instruct an ingress node to steer traffic via
a specific egress node to a specific BGP peer or peering link. SR EPE peering SIDs steer traffic to a
specific BGP peer or peering link.

The different types of peering-SIDs are PeerNode-SIDs, Peer-Adj-SIDs and PeerSet-SIDs. The
Peering-SID information is advertised in BGP-LS. This way, a controller can insert the information in
its SR-TE DB and use it for path computations.

SR EPE peering-SIDs can be used in an SR Policy’s segment list to steer traffic via a specific egress
node and a specific peering link to an external domain. SR EPE Peering-SIDs can also be used in the
segment list of an end-to-end inter-domain SR-TE path to cross the inter-domain BGP peerings.
14.8 References
[RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information
Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752,
March 2016

[RFC8402] "Segment Routing Architecture", Clarence Filsfils, Stefano Previdi, Les Ginsberg,
Bruno Decraene, Stephane Litkowski, Rob Shakir, RFC8402, July 2018

[RFC5082] "The Generalized TTL Security Mechanism (GTSM)", Carlos Pignataro, Pekka
Savola, David Meyer, Vijay Gill, John Heasley, RFC5082, October 2007

[draft-ietf-idr-bgpls-segment-routing-epe] "BGP-LS extensions for Segment Routing BGP Egress


Peer Engineering", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Keyur Patel, Saikat Ray,
Jie Dong, draft-ietf-idr-bgpls-segment-routing-epe-18 (Work in Progress), March 2019

[draft-ketant-idr-bgp-ls-bgp-only-fabric] "BGP Link-State Extensions for BGP-only Fabric", Ketan


Talaulikar, Clarence Filsfils, krishnaswamy ananthamurthy, Shawn Zandi, Gaurav Dawra,
Muhammad Durrani, draft-ketant-idr-bgp-ls-bgp-only-fabric-02 (Work in Progress), March 2019

[draft-ietf-spring-segment-routing-central-epe] "Segment Routing Centralized BGP Egress Peer


Engineering", Clarence Filsfils, Stefano Previdi, Gaurav Dawra, Ebben Aries, Dmitry Afanasiev,
draft-ietf-spring-segment-routing-central-epe-10 (Work in Progress), December 2017
15 Performance Monitoring – Link Delay
What you will learn in this chapter:

The Performance Measurement (PM) framework enables measurement of various characteristics


(delay, loss, and consistency) for different network elements (link, SR Policy, node)

The PM functionality enables the dynamic measurement of link delays.

Link delay is measured using standard query-response messaging, in either one-way or two-way
mode.

Measured link delay information is advertised in the IGP and in BGP-LS.

The minimum delay measured over a time period represents the propagation delay of the link.

This is stable metric used by SR-TE and Flex-Algo to steer traffic along a low-delay path across
the network.

To reduce the flooding churn on the network, link delay metrics are only flooded when the
minimum delay changes significantly.

Detailed delay measurement reports are streamed via telemetry, for history and analysis.

After introducing the generic Performance Measurement framework, we provide a brief overview of
the components of link delay, highlighting why the minimum value is so important for SR-TE. We
explain in detail how the link delays are measured in the network, how to enable these measurements
in IOS XR and how to configure the functionality to meet the operator’s need. We then explain how
the link delay information is advertised by the IGP, BGP-LS and telemetry, showing in particular how
variations in the link delay are reflected in the routing protocol advertisements. Finally, we remind
how the delay information is leveraged by SR-TE.
15.1 Performance Measurement Framework
In certain networks, network performance data such as packet loss, delay and delay variation (jitter)
as well as bandwidth utilization is a critical measure for Traffic Engineering (TE). Such performance
data provides operators the characteristics of their networks for performance evaluation, that is
required to ensure the Service Level Agreements (SLAs).

The Performance Measurement (PM) functionality provides a generic framework to enable


dynamically measuring various characteristics (delay, loss, and consistency) for different network
elements (link, SR Policy, node). The focus of this chapter is the link delay measurement. The
description of other PM functionalities such as measuring delay and loss of SR Policies and Data
Plane Monitoring1 (DPM) are deferred to a later revision of this book.

The PM functionalities can be employed to measure actual performance metrics and react
accordingly. For example, SR-TE leverages link delay measurements to maintain dynamic low-delay
paths, updating them as needed to always provide the lowest possible delay.

Before link delay measurement was available, the TE metric was often used to express a static link
delay value, and SR-TE computed delay optimized paths by minimizing the accumulated TE metric
along the path. However, these low-delay paths could not be adapted to varying link delays, such as
following a reroute in the underlay optical network, since TE metric values were statically
configured.
15.2 The Components of Link Delay
The Extended TE Link Delay Metrics are flooded in ISIS (RFC 7810), OSPF (RFC 7471), as well
as BGP-LS (draft-ietf-idr-te-pm-bgp).

The following Extended TE Link Delay Metrics are flooded:

Unidirectional Link Delay

Unidirectional Minimum/Maximum Link Delay

Unidirectional Delay Variation

Section 15.4 in this chapter covers the details of the link delay metrics advertisement in the IGP and
BGP-LS.

An Insight Into the Physical Topology

Of all these link delay statistics, only the minimum values are leveraged by SR-TE for dynamic path
computation. The minimum link delay measured over a period of time is the most representative of the
transmission or propagation delay of a link. This is a stable value that only depends on the underlying
optical circuit and irrespective of the traffic traversing the link. If the physical topology does not
change, a constant fiber length over a constant speed of light provide a constant propagation delay2.

A significative modification of the minimum link delay thus indicates a change in the optical circuit,
with an impact on the traversed fiber length. For example, a fiber cut followed by a reroute of the
optical circuit is reflected in the measured minimum link delay.

In practice, such optical network topology changes often occur without the IP team being aware,
although they may have a significant impact of the service quality perceived by the user. The link
delay measurement functionality provides the means to automatically respond to these changes by
moving sensitive traffic to alternative, lower delay paths.

QoS-Controlled Variations

The other link delay statistics — average, maximum and variance — are affected by the traffic
flowing through the link and can thus be highly variable at any time scale. These values do not reflect
the forwarding path of the traffic but the QoS policy (e.g., queueing, tail-drop) configured by the
operator for that class of traffic. For that reason, they are not suitable as a routing metric.

Furthermore, rerouting traffic as a reaction to a change in the packet scheduling delays would cause
network instability and considerably deteriorate the perceived service quality.

An operator willing to reduce the maximum delay or jitter (delay variation) along an SR Policy
should consider modifying the QoS policy applied to the traffic steered into that SR Policy.
15.3 Measuring Link Delay
The measurement method uses (hardware) timestamping in Query and Response packets, as defined in
RFC 6374 “Packet Loss and Delay Measurement for MPLS Networks”, RFC 4656 “One-way Active
Measurement Protocol (OWAMP)”, and RFC 5357 “Two-Way Active Measurement Protocol
(TWAMP)”.

To measure the link delay at local Node1 over a link to remote Node2, the following steps are
followed, as illustrated in Figure 15‑1:

Figure 15-1: Delay Measurement method

1. Local Node1 (querier) sends a Delay Measurement (DM) Query packet to remote Node2
(responder). The egress Line Card (LC) on Node1 timestamps the packet just before sending it
on the wire (T1).

2. Remote Node2’s ingress LC timestamps the packets as soon as it receives it from the wire (T2).

3. Remote Node2 reflects the DM packet to the querier with the timestamps in the DM Response
packet. The remote Node2 timestamps (optional for one-way measurement) the packet just before
sending it over the wire (T3).
4. Local Node1 timestamps (optional for one-way measurement) the packet as soon as it receives
the packet from the wire (T4).

Each node measures the link delay independently. Therefore, Node1 and Node2 both send Query
messages.

A querier sends the Query messages at its own initiative; a responder replies to the received Query
messages with Response messages.

The link delay is computed using the timestamps in the DM Response packet. For one-way
measurements, the delay is computed as T2 – T1. For two-way measurements, the one-way delay is
computed as the half of the two-way delay: ((T2 – T1) + (T4 – T3))/2 = ((T4 – T1) – (T3 – T2))/2.

The performance measurement functionality is disabled by default in IOS XR and is only enabled
when it is configured.

15.3.1 Probe Format


The Delay Measurement (DM) probe packets can be sent in various formats and various
encapsulations. Well-known formats are RFC 6374 “Packet Loss and Delay Measurement for MPLS
Networks” and RFC 4656/RFC 5357 “OWAMP”/”TWAMP”.

RFC 6374 probes are sent as native MPLS packets or IP/UDP encapsulated. This format is usable for
measurements in SR-MPLS networks.

draft-gandhi-spring-twamp-srpm describes a more generic mechanism for Performance Measurement


in MPLS and non-MPLS networks. While this draft uses TWAMP IP/UDP encapsulated probes, the
measurement mechanism is the same as described in this chapter.

At the time of writing, the above IETF draft was just released. Therefore, this section focuses on the
RFC 6374 encapsulation and the TWAMP encapsulation is deferred to a later revision of this book.

The RFC 6374 Delay Measurement (DM) probe packets are MPLS packets that are sent over the
Generic Associated Channel (G-ACh). Packets that are sent on the G-Ach are MPLS packets with the
G-ACh Label (GAL) – label value 13 – at the bottom of the MPLS label stack.
G-ACh, ACH, and GAL

Generic Associated Channel (G-ACh)

The Generic Associated Channel (G-ACh) (RFC5586) is a control channel that is associated with an MPLS Label Switched Path
(LSP), a pseudowire, or a section (link). It provides a supplementary logical data channel that can be used by various protocols,
mostly for maintenance functions. The G-ACh provides a link-layer-agnostic channel that can be used in MPLS networks to
communicate between two adjacent devices, similar to Cisco Discovery Protocol (CDP) or Link Layer Discovery Protocol (LLDP)
used on Ethernet links.

Associated Channel Header (ACH)

A packet that is sent on the G-ACh has an Associated Channel Header (ACH) that identifies the Channel Type. There is for
example a channel type for Delay Measurement (DM) packets.

G-ACh Label (GAL)

To enable devices to identify packets that contain an ACH (i.e., is sent on an associated control channel), a label-based exception
mechanism is used.

A label from the MPLS reserved label range is used for this purpose: the G-ACh Label (GAL) with label value 13. When a packet is
received with the GAL label, the receiving device knows the packet is received on the G-ACh and that an ACH follows the bottom-
of-stack label.

The GAL is followed by an Associated Channel Header (ACH) that identifies the message type, and
the message body that contains the actual DM probe packet follows this ACH.

Because the DM probe packets are carried in the G-ACh, these probe packets are MPLS packets.
This means that the LC at the remote end of the link must be MPLS-capable in order to process these
packets.

The DM Query probe packet format is shown in Figure 15‑2. The packet consists of a Layer 2 header,
an MPLS shim header, an Associated Channel Header (ACH), and the DM probe packet.
Figure 15-2: DM Query packet using RFC 6374 Packet Format

Layer 2 header

The Destination MAC address of the DM query packets is the MPLS Multicast MAC address
01-00-5e-80-00-0d (RFC 7213). This way the user does not need to configure next-hop
addresses for the links (which would be needed to resolve a unicast destination MAC address).
The remote side’s LC needs to support this MPLS Multicast MAC address.

Since the packet is an MPLS packet, the L2 ethertype field value is 0x8847.

MPLS shim header


The G-ACh Label (GAL) label (label value 13) follows the MAC header. This special label
indicates to the receiving node that this packet is a G-ACh packet.

Associated Channel Header (ACH)

The Associated Channel Header (ACH) that follows the GAL specifies the Channel Type. This
indicates the type of message carried in the associated control channel (G-ACh). For DM query
probe messages, the Channel Type is called “MPLS Delay Measurement (DM)”.

DM packet

The actual DM query probe message follows the ACH.

The fields in the DM packet are specified as follows:

Version: Currently set to 0.

Flags:

Query/Response indicator (R-flag): Set to 0 for a Query and 1 for a Response.

Traffic-class-specific measurement indicator (T-flag): Set to 1 when the measurement is done


for a particular traffic class (DSCP value) specified in DS field – 0 for IOS XR.

Control Code:

For query message:

0x0: In-band Response Requested – response is expected over GACh – IOS XR uses this
Control Code for two-way delay measurements

0x1: Out-of-band Response Requested – response is expected over an out-of-band channel –


IOS XR uses this Control Code for one-way delay measurements

0x2: No Response Requested – no response is expected.

For response message:


0x0: Success

other values: Notifications and Errors

Message Length: length of this message

Querier Timestamp Format (QTF): The format of the timestamp values written by the querier –
IOS XR: IEEE 1588-2008 (1588v2) format; 32-bit seconds field + 32-bit nanoseconds field

Responder Timestamp Format (RTF): The format of the timestamp values written by the
responder – IOS XR: IEEE 1588-2008 (1588v2) format; 32-bit seconds field + 32-bit
nanoseconds field

Responder's Preferred Timestamp Format (RPTF): The timestamp format preferred by the
responder – IOS XR: not specified

Session Identifier: An arbitrary numerical value that uniquely identifies a measurement operation
(also called a session) that consists of a sequence of messages. All messages in the sequence
(session) have the same Session Identifier value.

DS: DSCP value of measured traffic-class – not used in IOS XR (T-flag = 0).

Timestamp 1-4: The timestamps as collected by the local-end and remote-end nodes.

TLV Block: zero or more TLVs – IOS XR includes UDP Return Object (URO) TLV (RFC 7876)
for one-way delay measurement.

The delay measurement method allows for one-way and two-way delay measurements. By default,
two-way delay measurement is enabled in IOS XR when enabling link delay measurement.

Two-Way Measurements

For two-way measurements, all four timestamps (T1 to T4) defined in the DM packet are used.

Since the delay in both directions must be measured, both Query and Response messages are sent as
MPLS GAL packets, as shown in Figure 15‑2. Therefore, the querier requests to receive the DM
Response in-band and the UDP Return Object (URO) TLV is not included in the TLV block.
To compute the two-way link delay, the timestamps filled in by the same node are subtracted;
therefore, hardware clock synchronization is not required between querier and responder nodes. On
platforms that do not support Precision Time Protocol (PTP) for accurate time synchronization, two-
way delay measurement is the only option to support delay measurement.

From a two-way measurement, the one-way link delay is computed as the two-way delay divided by
2.

Example 15‑1 shows a packet capture of a DM Query message for a two-way delay measurement.
The querier node has indicated it desires an in-band response (line 16). The session identifier is
33554433 (line 21). Timestamp T1 has been filled in when the query was transmitted (line 22).

Example 15-1: Two-way delay measurement Query message example

1 Ethernet II, Src: fa:16:3e:59:50:91, Dst: IPv4mcast_01:81 (01:00:5e:00:01:81)


2 Destination: IPv4mcast_01:81 (01:00:5e:00:01:81)
3 Source: fa:16:3e:59:50:91 (fa:16:3e:59:50:91)
4 Type: MPLS label switched packet (0x8847)
5 MultiProtocol Label Switching Header, Label: 13 (Generic Associated Channel Label (GAL)), Exp: 0, S: 1, TTL:
1
6 Generic Associated Channel Header
7 .... 0000 = Channel Version: 0
8 Reserved: 0x00
9 Channel Type: MPLS Delay Measurement (DM) (0x000c)
10 MPLS Delay Measurement (DM)
11 0000 .... = Version: 0
12 .... 0000 = Flags: 0x0
13 0... = Response indicator (R): Not set
14 .0.. = Traffic-class-specific measurement indicator (T): Not set
15 ..00 = Reserved: Not set
16 Control Code: In-band Response Requested (0x00)
17 Message Length: 44
18 0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3)
19 .... 0000 = Responder timestamp format (RTF): Null Timestamp (0)
20 0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0)
21 Session Identifier: 33554433
22 Timestamp 1 (T1): 1519157877.787756030 seconds
23 Timestamp 2 (T2): 0.000000000 seconds
24 Timestamp 3: 0
25 Timestamp 4: 0

The responder fills in timestamp T2 (line 23 of Example 15‑1) when receiving the Query message.

The responder then sends an in-band (G-Ach) Response message, presented in Example 15‑2. It
copies the Session Identifier and Querier Timestamp Format (QTF) fields from the Query message
and it also copies the T1 and T2 fields of the Query message to the T3 and T4 fields of the Response
message. When transmitting the Response message, it fills in the timestamp T3 field (line 22).
Timestamp T4 is filled in by the querier when it receives the message.

Example 15-2: Two-way delay measurement Response message example

1 Ethernet II, Src: fa:16:3e:72:db:5e, Dst: IPv4mcast_01:81 (01:00:5e:00:01:81)


2 Destination: IPv4mcast_01:81 (01:00:5e:00:01:81)
3 Source: fa:16:3e:72:db:5e (fa:16:3e:72:db:5e)
4 Type: MPLS label switched packet (0x8847)
5 MultiProtocol Label Switching Header, Label: 13 (Generic Associated Channel Label (GAL)), Exp: 0, S: 1, TTL:
1
6 Generic Associated Channel Header
7 .... 0000 = Channel Version: 0
8 Reserved: 0x00
9 Channel Type: MPLS Delay Measurement (DM) (0x000c)
10 MPLS Delay Measurement (DM)
11 0000 .... = Version: 0
12 .... 1000 = Flags: 0x8
13 1... = Response indicator (R): Set
14 .0.. = Traffic-class-specific measurement indicator (T): Not set
15 ..00 = Reserved: Not set
16 Control Code: Success (0x01)
17 Message Length: 44
18 0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3)
19 .... 0011 = Responder timestamp format (RTF): Truncated IEEE 1588v2 PTP Timestamp (3)
20 0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0)
21 Session Identifier: 33554433
22 Timestamp 1 (T3): 1519157877.798967010 seconds
23 Timestamp 2 (T4): 0.000000000 seconds
24 Timestamp 3 (T1): 1519157877.787756030 seconds
25 Timestamp 4 (T2): 1519157877.798967010 seconds

One-Way Measurements

When one-way delay is enabled, the querier requests to receive the DM Response out-of-band and it
adds a UDP Return Object (URO) TLV (defined in RFC 7876) in the DM Query packet, to receive the
DM Response message via an out-of-band UDP channel. The URO TLV contains the IP address (IPv4
or IPv6) and the destination UDP port to be used for this response packet.

Only two timestamps (T1 and T2) defined in the RFC 6374 DM packets are used for one-way
measurements since the Response packet is not sent in-band and may not traverse the link.
Timestamps filled in by different nodes are subtracted; therefore, the hardware clocks must be
accurately synchronized between querier and responder nodes. Clock synchronization can be
achieved by using e.g., Precision Time Protocol (PTP), which provides clock accuracy in the sub-
microsecond range.

Figure 15‑3 shows the format of the out-of-band DM Response probe packet used for one-way delay
measurements.
Figure 15-3: out-of-band DM Response message for one-way delay measurement

The DM Response probe packet is carried in an IP/UDP packet, with the IP destination address and
UDP Destination port number as specified in the DM Query probe message. The source IP address
and source UDP port number are chosen by the responder.

The RFC 6374 DM packet immediately follows the UDP header; no ACH is included. The
correlation between Query and Response messages can be achieved by the UDP port number and the
Session Identifier field in the DM message.

Example 15‑3 shows a packet capture of a Query message for a one-way measurement. For a one-
way measurement, the querier requests an out-of-band response (line 16) and adds an UDP Return
Object (URO) TLV (lines 26 to 30), requesting the DM response to return in an UDP packet.
Example 15-3: One-way delay measurement Query message example

1 Ethernet II, Src: fa:16:3e:59:50:91, Dst: IPv4mcast_01:81 (01:00:5e:00:01:81)


2 Destination: IPv4mcast_01:81 (01:00:5e:00:01:81)
3 Source: fa:16:3e:59:50:91 (fa:16:3e:59:50:91)
4 Type: MPLS label switched packet (0x8847)
5 MultiProtocol Label Switching Header, Label: 13 (Generic Associated Channel Label (GAL)), Exp: 0, S: 1, TTL:
1
6 Generic Associated Channel Header
7 .... 0000 = Channel Version: 0
8 Reserved: 0x00
9 Channel Type: MPLS Delay Measurement (DM) (0x000c)
10 MPLS Delay Measurement (DM)
11 0000 .... = Version: 0
12 .... 0000 = Flags: 0x0
13 0... = Response indicator (R): Not set
14 .0.. = Traffic-class-specific measurement indicator (T): Not set
15 ..00 = Reserved: Not set
16 Control Code: Out-of-band Response Requested (0x01)
17 Message Length: 52
18 0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3)
19 .... 0000 = Responder timestamp format (RTF): Null Timestamp (0)
20 0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0)
21 Session Identifier: 33554434
22 Timestamp 1 (T1): 1519157877.787756030 seconds
23 Timestamp 2 (T2): 0.000000000 seconds
24 Timestamp 3: 0
25 Timestamp 4: 0
26 UDP Return Object (URO)
27 URO type: 131
28 Length: 6
29 UDP-Destination-Port: 6634
30 Address: 10.0.0.5

The responder sends the DM Response in an UDP packet, presented in Example 15‑4. Even though
the responder also filled in timestamp T3 when transmitting the Response, this timestamp is of little
use since the DM Response is not transmitted in-band and may not traverse the measured link.
Example 15-4: One-way delay measurement Response message example

Ethernet II, Src: fa:16:3e:72:db:5e, Dst: fa:16:3e:59:50:91


Destination: fa:16:3e:59:50:91 (fa:16:3e:59:50:91)
Source: fa:16:3e:72:db:5e (fa:16:3e:72:db:5e)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.0.0.6 (10.0.0.6), Dst: 10.0.0.5 (10.0.0.5)
Protocol: UDP (17)
User Datagram Protocol, Src Port: 6634, Dst Port: 6634
Source Port: 6634
Destination Port: 6634
MPLS Delay Measurement (DM)
0000 .... = Version: 0
.... 1000 = Flags: 0x8
1... = Response indicator (R): Set
.0.. = Traffic-class-specific measurement indicator (T): Not set
..00 = Reserved: Not set
Control Code: Success (0x01)
Message Length: 44
0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3)
.... 0011 = Responder timestamp format (RTF): Truncated IEEE 1588v2 PTP Timestamp (3)
0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0)
Session Identifier: 33554434
Timestamp 1 (T1): 1519157877.787756030 seconds
Timestamp 2 (T2): 1519157877.798967010 seconds
Timestamp 3 (T3): 1519157877.798967010 seconds
Timestamp 4 (T4): 0.000000000 seconds

15.3.2 Methodology
To measure the delay, the operator sets up a delay measurement session by enabling delay
measurement on an interface. Such measurement session consists of a sequence of DM Query
messages sent by the querier node. The responder node replies to each of the Query messages with
DM Response messages. All DM Query and Response messages in a given session use the same
Session Identifier field value in the DM packet.

The characteristics of the measurement session (interval between queries, number of queries, …) are
chosen by the querier node.

The delay measurement operation follows the below sequence.

Every probe interval the querier node starts a probe. Each probe consists of one or more queries,
where each query – DM Query/Response exchange – provides a delay measurement.

The number of queries per probe is called burst count and the time interval between two queries in a
burst is called burst interval.
These terms are illustrated in Figure 15‑4. The illustration shows the sequence of two probes, with a
burst count of five.

Figure 15-4: Delay measurement – query, probe, and burst terminology

The different delay measurement parameters are configurable, as described in the next section. The
default probe interval is 30 seconds, with a default burst count of 10 and a default burst interval of 3
seconds. This makes that, by default, the queries are evenly spread over the probe interval.

After receiving all responses to the queries in a probe (or after a timeout, in case one or more
messages have been lost), the various link delay metrics are computed over all the (successful)
queries in the probe: minimum, average, moving average, maximum, and variation. These statistical
metrics are provided to the operator via Telemetry (periodic and event-driven) and CLI/NETCONF.

At the end of a probe, the node also checks if the current link delay metrics must be advertised in IGP
using the accelerated advertisement. The IGP advertisement functionality is further discussed in
section 15.4 below.

In Figure 15‑4, let us assume that d1n is the measured delay of the nth query in this probe and that
Probe0 is the probe before Probe1 (not illustrated). After receiving the responses to all queries of
Probe1, the querier node computes the probe metrics as follows.

Average delay: Probe1.Average = AVG(d11, d12, d13, d14, d15)


Moving average3: Probe1.MovingAvg = Probe0.MovingAvg × (1 – ⍺) + Probe1.Average × ⍺ (⍺
is a weight factor, set to 0.5 at the time of writing)

Minimum delay: Probe1.Minimum = MIN(d11, d12, d13, d14, d15)

Maximum delay: Probe1.Maximum = MAX(d11, d12, d13, d14, d15)

Delay variation4: Probe1.Variation = Probe1.Average – Probe1.Minimum

Similarly, after receiving all responses to the queries of Probe2, the querier node computes:

Probe2.Average = AVG(d21, d22, d23, d24, d25)

Probe2.MovingAvg = Probe1.MovingAvg × (1 – ⍺) + Probe2.Average × ⍺

Probe2.Minimum = MIN(d21, d22, d23, d24, d25)

Probe2.Maximum = MAX(d21, d22, d23, d24, d25)

Probe2.Variation = Probe2.Average – Probe2.Minimum

The querier node aggregates the delay measurements of multiple probes every advertisement
interval. The advertisement interval is a multiple of the probe interval, hence it consists of a whole
number of probes. The advertisement interval is configurable, with a default interval of 120 seconds.
The advertisement interval is also called aggregation interval.

After computing the delay metrics of the last probe in the advertisement interval, the querier node
computes the aggregated delay metrics over all probes in the advertisement interval: average,
minimum, moving average, maximum, and variation. The querier node then provides these statistical
delay metrics via Telemetry (periodic and event-driven) and CLI/NETCONF.

At this point in time, the node also checks if the current link delay metrics must be advertised in IGP
as a periodic advertisement.

Figure 15‑5 shows three probes, with a burst count of five and the probe interval as indicated. The
aggregation interval is two times the probe interval in this illustration.
Assuming that the node has aggregated the metrics at the end of Probe1, the next aggregation interval
consists of Probe2 and Probe3.

At the end of Probe3, the node computes the delay metrics of Probe3 itself and it also computes the
aggregate metrics, since it is the last probe in the aggregation interval.

Figure 15-5: Delay measurement – probe metrics and aggregated metrics

The aggregate metrics computed after completing Probe3 are as follows:

Average delay: Agg2.Average = AVG(Probe2.Average, Probe3.Average)

Moving average: Agg2.MovingAvg = Probe3.MovingAvg

Minimum delay: Agg2.Minimum = MIN(Probe2.Minimum, Probe3.Minimum)

Maximum delay: Agg2.Maximum = MAX(Probe2.Maximum, Probe3.Maximum)

Delay variation: Agg2.Variation = AVG(Probe2.Variation, Probe3.Variation)

15.3.3 Configuration
Delay measurements can be enabled per interface by configuring delay-measurement on that
interface under the performance-measurement configuration section, as shown in Example 15‑5.

Example 15-5: Enable delay measurement on an interface

performance-measurement
interface GigabitEthernet0/0/0/0
delay-measurement
!
interface GigabitEthernet0/0/0/1
delay-measurement

The remote side of the link must be similarly configured to activate the responder.

The performance measurement parameters are configured under a profile. The global profile for link
delay measurements (delay-profile interfaces) is shown in Example 15‑6. The configuration
value ranges and defaults are indicated. The parameters in this delay-profile are applied for all
interface link delay measurements.

Example 15-6: Global link delay measurement profile

performance-measurement
!! Global default profile for link delay measurement
delay-profile interfaces
probe
interval < 30-3600 SEC > (default: 30 sec)
burst
count < 1-30 COUNT > (default: 10 count)
interval < 30-15000 msec > (default: 3000 msec)
one-way (default: two-way)
advertisement
periodic
interval < 30-3600 sec > (default: 120 sec)

By default, the delay measurements are performed in two-way mode. For one-way measurements, the
Precision Time Protocol (PTP) must be enabled for accurate time synchronization between the two
nodes and the delay profile must be configured with probe one-way.

A configured advertisement interval (advertisement periodic interval) is internally rounded up to the


next multiple of the probe interval, so that it equals to a whole number of probe intervals. For
example, a configured advertisement interval of 45 seconds with a probe interval of 30 seconds is
rounded up to 60 seconds, or two probe intervals: 2×30 seconds. The configuration does not reflect
this internal round-up.
The other advertisement configuration options are discussed in the Delay Advertisement section
below.

Static Delay

If desired, a static link delay can be configured for an interface, using the advertise-delay
command illustrated in Example 15‑7.

Example 15-7: Static delay configuration

performance-measurement
interface GigabitEthernet0/0/0/0
delay-measurement
advertise-delay <static delay in usec>

When a static advertise-delay value is configured on an interface, the minimum, maximum and
average delay advertised by the node for this interface are all equal to the configured value, while the
advertised variance is 0.

Even if the link delay is statically configured, probes are continued to be scheduled and delay metrics
are aggregated, stored in the history buffers and streamed in telemetry. However, advertisement
threshold checks are suppressed and therefore, the actual measured link delay values are not
advertised.

Platforms that do not support dynamic link delay measurements may still allow configuration of static
link delay values. This allows integration of such platforms for delay-optimized path computations
without using the TE metric as fallback link delay metric.

15.3.4 Verification
Various show commands are available to verify delay measurements on a router. These show
commands are not covered in detail this book, please refer to the available product documentation for
this.

We do want to mention that a node keeps a history of delay metric data. The following historical data
is available:

1. Delay metrics computed from the probe


Delay metrics computed at the end of the probe-interval (threshold crossed or not)

2. Delay metrics computed from the aggregation

Delay metrics computed at the end of the advertisement interval

3. Delay metrics computed from the advertisement

Delay metrics from the accelerated and periodic advertisement (i.e., threshold crossed)

Example 15‑8, Example 15‑9, and Example 15‑10 are a few examples. Example 15‑8 shows the
current information of a delay measurement session. Example 15‑9 and Example 15‑10 respectively
show historical information of the probe metrics and aggregated metrics.

Example 15-8: Delay measurement information

RP/0/0/CPU0:iosxrv-1#show performance-measurement interfaces

--------------------------------------------------------------------------
0/0/CPU0
--------------------------------------------------------------------------

Interface Name: GigabitEthernet0/0/0/0 (ifh: 0x20)


Delay-Measurement : Enabled
Local IPV4 Address : 99.1.2.1
Local IPV6 Address : ::
Local MAC Address : fa16.3e59.5091
Primary VLAN Tag : None
Secondary VLAN Tag : None
State : Up

Delay Measurement session:


Session ID : 33554434

Last advertisement:
Advertised at: 10:41:58 Thu 22 Feb 2018 (161 seconds ago)
Advertised reason: periodic timer, min delay threshold crossed
Advertised delays (uSec): avg: 17711, min: 14998, max: 24998, variance: 3124

Current advertisement:
Scheduled in 2 more probes (roughly every 120 seconds)
Current delays (uSec): avg: 18567, min: 14998, max: 19998, variance: 3000
Number of probes started: 2
Example 15-9: History of probe delay metric

RP/0/0/CPU0:iosxrv-1#show performance-measurement history probe interfaces

--------------------------------------------------------------------------
0/0/CPU0
--------------------------------------------------------------------------

Interface Name: GigabitEthernet0/0/0/0 (ifh: 0x20)


Delay-Measurement history (uSec):
Probe Start Timestamp Pkt(TX/RX) Average Min Max
10:46:01 Thu 22 Feb 2018 10/10 17998 14998 19998
10:45:31 Thu 22 Feb 2018 10/10 18498 14998 19998
10:45:01 Thu 22 Feb 2018 10/10 18998 14998 19998
10:44:31 Thu 22 Feb 2018 10/10 17998 9999 19998
10:44:01 Thu 22 Feb 2018 10/10 17998 14998 19998
10:43:31 Thu 22 Feb 2018 10/10 19498 14998 24998
10:43:01 Thu 22 Feb 2018 10/10 18998 14998 19998
10:42:31 Thu 22 Feb 2018 10/10 18998 14998 19998
10:42:01 Thu 22 Feb 2018 10/10 18498 14998 19998
10:41:31 Thu 22 Feb 2018 10/10 17498 14998 19998
10:41:01 Thu 22 Feb 2018 10/10 16998 14998 24998
10:40:31 Thu 22 Feb 2018 10/10 19498 14998 24998
10:40:01 Thu 22 Feb 2018 10/10 18498 14998 19998
10:39:31 Thu 22 Feb 2018 10/10 17998 14998 19998
10:39:01 Thu 22 Feb 2018 10/10 16998 9999 19998
--More--

Example 15-10: History of aggregated delay metric

RP/0/0/CPU0:iosxrv-1#show performance-measurement history aggregated interfaces

PER-NA : Periodic timer, no advertisements have occured


PER-AVG : periodic timer, avg delay threshold crossed
PER-MIN : periodic timer, min delay threshold crossed
PER-MAX : periodic timer, max delay threshold crossed
ACCEL-MIN : accel threshold crossed, min delay threshold crossed

--------------------------------------------------------------------------
0/0/CPU0
--------------------------------------------------------------------------

Interface Name: GigabitEthernet0/0/0/0 (ifh: 0x20)


Delay-Measurement history (uSec):
Aggregation Timestamp Average Min Max Action
10:45:59 Thu 22 Feb 2018 18372 9999 19998 PER-MIN
10:43:58 Thu 22 Feb 2018 18997 14998 24998 PER-NA
10:41:58 Thu 22 Feb 2018 18122 14998 24998 PER-MIN
10:39:58 Thu 22 Feb 2018 18247 9999 24998 PER-MIN
10:37:58 Thu 22 Feb 2018 18123 14998 19998 PER-NA
10:35:58 Thu 22 Feb 2018 17748 14998 19998 PER-NA
10:33:58 Thu 22 Feb 2018 18997 14998 24998 PER-MIN
10:31:58 Thu 22 Feb 2018 17747 9999 19998 PER-MIN
10:29:58 Thu 22 Feb 2018 18122 14998 24998 PER-NA
--More--
15.4 Delay Advertisement
As we have seen in the previous sections, the node periodically measures the link delay and computes
the various delay metrics. In order for other devices to use this information, the node must then share
these link delay metrics with the other devices in the network and/or with a controller.

To prevent churn in the network, flooding of link delay metrics in IGP should be reduced to the
minimum. The node must not flood the delay metrics whenever it has a new measurement. On the
other hand, a very detailed view on the evolution of the delay metrics may be desirable.

Therefore, delay measurement functionality provides reduced IGP flooding – only flooding the delay
metrics when exceeding certain thresholds – while using Event Driven Telemetry (EDT) to push the
delay metrics at a finer time scale.

15.4.1 Delay Metric in IGP and BGP-LS


At the beginning of this chapter, we explained that only the minimum delay metric should be
considered for delay-optimized routing, since it expresses the propagation delay of the link.
Consequently, IGP only floods the link delay metrics if the minimum delay changes significantly. If the
maximum, average and/or variance change but the minimum delay remains stable, then no IGP delay
metric advertisement is triggered. Whenever delay metric flooding is triggered due to minimum delay
change, all delay metrics for the link are advertised in the network.

The link delay measurements are flooded as Extended TE Link Delay Metrics in ISIS (RFC 7810)
and OSPF (RFC 7471). The measurements are added as sub-TLVs to the advertisement of the link.
BGP-LS supports advertisement of the Extended TE Link Delay Metrics, as specified in draft-ietf-
idr-te-pm-bgp.

The following Extended TE Link Delay Metrics will be flooded (in separate sub-TLVs):

Unidirectional Link Delay

Unidirectional Min/Max Link Delay

Unidirectional Delay Variation


These metrics are the delay metrics as computed in the previous section; the (moving) average delay
is advertised as the Unidirectional Link Delay metric.

The format of the delay metric TLVs are the same for ISIS, OSPF, and BGP-LS. The format of the
Unidirectional Link Delay (Sub-)TLV is shown in Figure 15‑6, the Min/Max Unidirectional Link
Delay (Sub-)TLV format is shown in Figure 15‑7, and the format of the Unidirectional Delay
Variation (Sub-)TLV is shown in Figure 15‑8.

Figure 15-6: Unidirectional Link Delay Sub-TLV format

Figure 15-7: Min/Max Unidirectional Link Delay Sub-TLV format

Figure 15-8: Unidirectional Delay Variation Sub-TLV format

The format of the actual delay measurement fields (Delay, Min Delay, Max Delay, and Delay
Variation) in these TLVs, is the delay measurement in microseconds, encoded as a 24-bit integer
value.
The Anomalous flag (A-flag) in the Unidirectional and Min/Max Unidirectional Link Delay
(Sub‑)TLV formats, is set when a measured value exceeds a configured maximum threshold. The
A‑flag is cleared again when the measured value falls below the configured reuse threshold. At the
time of writing, the A-flag is always unset in IOS XR.

An example of delay metrics advertised in ISIS is shown in Example 15‑11 and the matching
advertisement of the link in BGP-LS is shown in Example 15‑13.

Example 15-11: ISIS delay metric advertisement

RP/0/0/CPU0:iosxrv-1#show isis database verbose iosxrv-1 | begin IS-Extended iosxrv-2.00


Metric: 10 IS-Extended iosxrv-2.00
Interface IP Address: 99.1.2.1
Neighbor IP Address: 99.1.2.2
Link Average Delay: 8127 us
Link Min/Max Delay: 4999/14998 us
Link Delay Variation: 3374 us
Link Maximum SID Depth:
Subtype: 1, Value: 10
ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24004

OSPF advertises the link delay metrics in the TE (type 1) Opaque LSAs, as illustrated in
Example 15‑12.
Example 15-12: OSPF delay metric advertisement

RP/0/0/CPU0:iosxrv-1#show ospf database opaque-area 1.0.0.3 self-originate

OSPF Router with ID (1.1.1.1) (Process ID 1)

Type-10 Opaque Link Area Link States (Area 0)

LS age: 98
Options: (No TOS-capability, DC)
LS Type: Opaque Area Link
Link State ID: 1.0.0.3
Opaque Type: 1
Opaque ID: 3
Advertising Router: 1.1.1.1
LS Seq Number: 80000006
Checksum: 0xe45e
Length: 108

Link connected to Point-to-Point network


Link ID : 1.1.1.2
(all bandwidths in bytes/sec)
Interface Address : 99.1.2.1
Neighbor Address : 99.1.2.2
Admin Metric : 1
Maximum bandwidth : 125000000
IGP Metric : 1
Unidir Link Delay : 8127 micro sec, Anomalous: no
Unidir Link Min Delay : 4999 micro sec, Anomalous: no
Unidir Link Max Delay : 14998 micro sec
Unidir Link Delay Variance : 3374 micro sec

BGP-LS advertises these Extended TE Link Delay Metrics as Link Attribute TLVs, which are TLVs
that may be encoded in the BGP-LS attribute with a Link NLRI. The format is the same as for ISIS and
OSPF advertisements. An example is shown in Example 15‑13, showing a Link NLRI originated by
ISIS.
Example 15-13: BGP-LS delay metric advertisement

RP/0/0/CPU0:iosxrv-1#sh bgp link-state link-state [E][L2][I0x0][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1]


[b0.0.0.0][s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696 detail
BGP routing table entry for [E][L2][I0x0][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1][b0.0.0.0]
[s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696
NLRI Type: Link
Protocol: ISIS L2
Identifier: 0x0
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
ISO Node ID: 0000.0000.0001.00
Remote Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
ISO Node ID: 0000.0000.0002.00
Link Descriptor:
Local Interface Address IPv4: 99.1.2.1
Neighbor Interface Address IPv4: 99.1.2.2

Versions:
Process bRIB/RIB SendTblVer
Speaker 29 29
Flags: 0x00000001+0x00000200;
Last Modified: Feb 20 21:20:11.687 for 00:03:21
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Flags: 0x400000000104000b, import: 0x20
Not advertised to any peer
Local
0.0.0.0 from 0.0.0.0 (1.1.1.1)
Origin IGP, localpref 100, valid, redistributed, best, group-best
Received Path ID 0, Local Path ID 0, version 29
Link-state:
metric: 10, ADJ-SID: 24004(30) , MSD: Type 1 Value 10
Link Delay: 8127 us Flags: 0x00, Min Delay: 4999 us
Max Delay: 14998 us Flags: 0x00
Delay Variation: 3374 us

IGP delay metric flooding can be triggered periodically or accelerated, as described in the next
sections.

Periodic Advertisements

When enabling link delay measurements, periodic IGP advertisements are enabled by default. A node
periodically checks if it needs to flood the current delay metrics in IGP. The periodicity is 120
seconds by default and is configurable as the advertisement interval.

IGP floods the delay metrics of a given link if both conditions in Equation 15‑1 are satisfied for that
link at the end of the periodic advertisement interval. Flooded.Minimum is the minimum delay metric
that was last flooded and Agg.Minimum is the aggregated minimum delay metric computed at the end
of the current aggregation/advertisement interval. The default periodic threshold (%) and minimum
(value) are configurable values, with the default values: threshold 10%, minimum 500 μs.

Equation 15-1: Periodic delay metric flooding conditions

|Agg.Minimum – Flooded.Minimum| / Flooded.Minimum ≥ threshold (default 10%) (a)

and

|Agg.Minimum – Flooded.Minimum| ≥ minimum (default 500 μs) (b)

In this default configuration, the delay metrics of a link are flooded at the end of the periodic
advertisement interval if the minimum delay of that link over the advertisement interval differs ( + or
– ) from the last flooded minimum delay for that link by at least 10% and 500 μs. The default
minimum value of 500 μs is roughly equivalent to 100 km of fiber5.

When these conditions are met, IGP floods all current aggregate delay metrics for that link:
Agg.Minimum, Agg.Maximum, Agg.MovingAvg (as average delay), and Agg.Variance.

The node keeps these flooded metrics as Flooded.Minimum, Flooded.Maximum,


Flooded.MovingAvg, and Flooded.Variance, and it also pushes/streams these values with Event
Driven Telemetry to the collector.

Figure 15‑9 illustrates the delay metric advertisement operation. The illustration shows on top the
minimum delay of a link (y-axis) in function of time (x-axis). Both the measured minimum delay and
the delay as advertised in the IGP are shown.

Below the minimum delay graph, the DM Queries that the node sends on the link are represented as a
series of arrows, each of them representing a query.

The Probes are shown below the Queries. In this example the burst-count is set to 4, i.e., every
probe-interval the node sends a burst of 4 queries. After receiving the last response of all queries in
the probe, the node computes the probe metrics. In the illustration this occurs at times p1, p2, …, p7
as indicated with arrows.

The advertisements are displayed below the Probes. The advertisement interval is two times the
probe-interval.

At the bottom of the illustration the advertised metrics are indicated. These are described in this
section.

Figure 15-9: Minimum-delay metric increase with periodic advertisements

Assume that at time p1, the end of the first periodic advertisement interval in the illustration, the
aggregate minimum delay (min1) is equal to the last advertised minimum delay (min1). The node
suppresses the periodic advertisement.

Sometime during Probe2 the minimum delay of the link increases from min1 to min2. This can occur
due to a failure in the optical network triggering an optical restoration that switches traffic to go via
the other side of the optical ring. In this example this new optical path is significantly longer, causing
an large abrupt increase of propagation delay.
The advertisement interval expires at time p3. The aggregate minimum delay over the interval is
min1, since the first query of Probe2 measured a delay min1. Even though all the other measurements
in Probe2 and Probe3 measured a minimum delay min2, the minimum delay of the whole
advertisement interval is still min1. Min1 is also the last advertised minimum delay, so the periodic
advertisement is suppressed.

The advertisement interval expires again at time p5. This time, the aggregate minimum delay of the
interval is min2. Min2 differs enough from the previously advertised min1 to exceed the periodic
threshold. The node advertises the aggregate delay metrics (min/max/avg/var of (Probe4-metrics,
Probe5-metrics)). The advertised minimum delay is min2.

Relying only on periodic advertisements, it takes between one and two times the advertisement
interval for a significant increase in the minimum link delay to be advertised in the IGP.

Accelerated Advertisements

As shown on the previous illustration, it can take up to two times the periodic advertisement interval
before a worse (increased) minimum delay metric is advertised in the IGP through the periodic
advertisements. If a faster reaction is important, the accelerated advertisement functionality, disabled
by default, can be enabled.

When enabled, an accelerated advertisement can be sent at the end of each probe interval, in-between
periodic advertisements, if the minimum delay metric has changed significantly compared to the
previous advertised minimum delay.

When a probe finishes (the last response message has been received) and the probe’s minimum delay
metric crosses the accelerated threshold (see further), then the IGP floods all link delay metrics
(average, min, max, and variation) for that link. The advertised delay metrics are the delay metrics of
the last probe.

When the accelerated advertisement is triggered, the periodic advertisement interval is reset.

The IGP floods the delay metrics of a given link if both conditions in Equation 15‑2 are satisfied for
that link at the end of a probe interval. In fact, these conditions are the same as in Equation 15‑1, but
the threshold and minimum for accelerated advertisements are independent from those of the periodic
advertisements. The default accelerated threshold (%) and minimum (value) are configurable values,
with the default values: threshold 20%, minimum 500 μs.

Equation 15-2: Accelerated delay metric flooding conditions

|Agg.Minimum – Flooded.Minimum| / Flooded.Minimum ≥ threshold (default 20%) (a)

and

|Agg.Minimum – Flooded.Minimum| ≥ minimum (default 500 μs) (b)

If accelerated advertisement is enabled, then the delay metrics of a given link are flooded at the end
of the probe interval if the minimum delay of that link over the probe interval differs ( + or – ) at least
20% (by default) with the last flooded minimum delay of that link and the difference is at least 500 μs
(by default).

When these conditions are met, the IGP floods all current probe delay metrics for that link:
Probe.Minimum, Probe.Maximum, Probe.MovingAvg (as average delay), and Probe.Variance.

The node keeps these flooded metrics as Flooded.Minimum, Flooded.Maximum,


Flooded.MovingAvg, and Flooded.Variance, and it also pushes these values with Event Driven
Telemetry.

Figure 15‑10 illustrates the effect of accelerated advertisements on the scenario of Figure 15‑9. The
delay measurement parameters are the same as for the previous example, only accelerated
advertisements are enabled now.
Figure 15-10: Minimum-delay metric increase with accelerated advertisements

Assume that at time p1, the end of the first periodic advertisement interval in the illustration, the
aggregate minimum delay (min1) is equal to the last advertised minimum delay (min1). The node
suppresses the periodic advertisement.

Sometime during Probe2 the minimum delay of the link increases from min1 to min2.

At time p2, the node verifies if the Probe2-metrics exceed the accelerated thresholds. The minimum
delay measured over Probe2 is min1 since the first query of Probe2 measured a delay min1. Since
min1 is also the last advertised minimum delay, the threshold is not exceeded, the accelerated
advertisement is suppressed.

The periodic advertisement interval expires at time p3. The aggregate minimum delay over the
interval is min1, since the first query of Probe2 measured a delay min1. Min1 is also the last
advertised minimum delay, so the periodic advertisement is suppressed. However, the minimum delay
of the Probe3 interval is min2. The difference between min2 and the previously advertised min1
exceeds the accelerated thresholds. Therefore, the node advertises the Probe3-metrics as accelerated
advertisement.
Since the minimum delay stays constant, all further advertisements in the example are suppressed.

With accelerated advertisements enabled, the worst-case delay between the minimum delay increase
and the advertisement of this change is reduced to two times the probe-interval. The best-case delay
for that case is one probe-interval.

Another example combining periodic and accelerated advertisements is shown in Figure 15‑11. The
delay measurement parameters are the same as in the previous examples.

Figure 15-11: Delay metric advertisements – illustration

Before the illustrated time interval, the node advertises the minimum delay value min3 in the IGP.

At time p1, the end of the first probe-interval in the diagram, the node computes the delay metrics of
this probe: Probe1-metrics. The node then checks if it needs to accelerated flood the delay metrics,
but since the measured minimum delay (min3) is the same as the last advertised minimum delay
(min3), the accelerated thresholds are not exceeded, therefore the accelerated advertisement is
suppressed.

Time p1 is also the end of the periodic advertisement interval, and the node computes the aggregate
delay metrics of this advertisement interval, based on the probes in the advertisement interval. The
minimum delay is the same as the last advertised minimum delay and the periodic advertisement is
suppressed.

Shortly after time p1 and before the first query of Probe2, the minimum delay of the link decreases
from min3 to min1.

At time p2, the node computes the Probe2-metrics and it then checks if it needs to accelerated flood
the delay metrics. The difference between the minimum delay metric in this probe-interval, min1, and
the previous advertised minimum delay metric, min3, exceeds the accelerated thresholds. Therefore,
the node floods the Probe2-metrics. The advertised minimum delay metric is now min1. The
accelerated advertisement also resets the periodic advertisement interval.

Right after time p2 and before the first query of Probe3, the minimum delay of the link increases from
min1 to min2.

At time p3, the node computes the Probe3-metrics. The minimum delay metric in this probe-interval
is min2. The difference between min2 and the last advertised minimum delay, min1, does not exceed
the accelerated thresholds, therefore the accelerated advertisement is suppressed.

At time p4, the end of the periodic advertisement interval, the node computes the aggregated delay
metrics over the advertisement interval. The aggregated minimum delay metric is min2. The
difference between min2 and min1 exceeds the periodic thresholds, although it did not exceed the
accelerated thresholds. Therefore, the node advertises the aggregated delay metrics (min/max/avg/var
of (Probe3-metrics, Probe4-metrics)). The advertised minimum delay metric is min2.

At time p5, the accelerated thresholds are not exceeded.

Sometime during Probe5, the minimum delay of the link increases from min2 to min4.

At time p6, the end of the advertisement interval, the aggregated minimum delay over the
advertisement interval is min2. This is the lowest measured delay in Probe5 and Probe6. Since min2
is equal to the last advertised minimum delay metric, no periodic advertisement is triggered.
However, the difference between min2 and the minimum delay metric of the Probe6 interval, min4,
exceeds the accelerated threshold. Therefore, the node advertises the Probe6-metrics as accelerated
advertisement and resets the periodic advertisement interval.
Finally, at time p7, Probe7-metrics do not exceed the accelerated threshold, suppressing the
accelerated advertisement.

15.4.2 Configuration
The IGP advertisement parameters can be customized, as shown in Example 15‑14. Periodic
advertisements are enabled by default. Accelerated advertisements are disabled by default. The IGP
automatically floods the delay metrics when requested by the performance measurement functionality.
No additional IGP configuration is required.

The relative thresholds and absolute minimum values of both advertisement modes can be configured.

Example 15-14: Global link delay measurement profile – advertisement configuration

performance-measurement
!! Global default profile for link delay measurement
delay-profile interfaces
advertisement
periodic
disabled (default: enabled)
interval < 30-3600 sec > (default: 120 sec)
threshold < 0-100 % > (default: 10%)
minimum < 0-100000 usec > (default: 500 usec)
accelerated (default: disabled)
threshold < 0-100 % > (default: 20%)
minimum < 1-100000 usec > (default: 500 usec)

Periodic IGP delay metric advertisements can be disabled altogether if desired. But even with
periodic advertisements disabled, the link delay metrics are still pushed via telemetry (EDT). This
way a controller can use the link delay metrics, as received via telemetry, while eliminating the
flooding of these metrics in the IGP.

15.4.3 Detailed Delay Reports in Telemetry


We have seen that the node not only floods the measured link delay metrics in IGP, but it also streams
the measurements via telemetry, periodic and event-driven. Chapter 20, "Telemetry" provides more
information about telemetry. This section only describes the applicability of telemetry for
performance measurement data.

Telemetry uses a push model as opposed to the traditional poll model to get data from the network.
Telemetry streams the data, skipping the overhead of polling.
Model Driven Telemetry (MDT) structures the streamed data based on a specified model. Native,
OpenConfig or IETF YANG models are available through model-driven telemetry.

MDT can periodically stream (push) telemetry data or only when the data has changed. The latter is
called Event Driven Telemetry (EDT). By only streaming data when it has changed, EDT reduces the
amount of unnecessary data that is streamed.

For the delay measurements, MDT is supported for the following data:

Summary, interface, session and counter show command content

Histogram data

In addition to the periodic streaming, EDT is supported for the following delay measurement data:

Delay metrics computed in the last probe-interval (event: probe-completed)

Delay metrics computed in the last aggregation-interval i.e., end of the periodic advertisement
interval (event: advertisement interval expired)

Delay metrics last flooded in the network (event: flooding-triggered)

The performance-measurement telemetry data entries are in the Cisco-IOS-XR-perf-meas-oper.yang


YANG module.
15.5 Usage of Link Delay in SR-TE
SR Policy dynamic path computation uses the minimum link delay metric for delay optimized paths.
As previously mentioned, this minimum link delay metric reflects the underlying optical circuit makes
it possible to react to changes in the optical topology.

Example 15‑15 shows the configuration of an SR Policy ORANGE and an on-demand color template
for color 10 using the minimum link delay metric as the optimization objective.

Example 15-15: SR Policy with delay optimized dynamic path

segment-routing
traffic-eng
on-demand color 10
dynamic
metric
type delay
!
policy ORANGE
color 100 end-point ipv4 1.1.1.4
candidate-paths
preference 100
dynamic
metric
type delay

If the link delay metric of a given link is not advertised by the node, then SR-TE path computation
falls back to the TE metric advertised for that link. The TE metric is used as if it expresses the link
delay in sec. For example, if a node does not advertise a link delay metric for a link but it
advertises a TE metric 100 for this link, then SR-TE considers for its delay-optimized path
computations the TE metric 100 as a minimum link delay of 100 sec.

This fallback to TE metric helps incremental deployment of dynamic link delay metrics in networks.

The Flexible Algorithm (Flex-Algo) functionality (see chapter 7, "Flexible Algorithm") can use the
minimum link delay metric as optimization objective as well. However, Flex-Algo’s IGP path
computation does not fall back to using TE metric for links that do not advertise the minimum link
delay metric. Instead, any link that does not have a link delay metric is excluded from the Flex-Algo
topology.

Example 15‑16 shows a configuration of the Flex-Algo functionality using the measured minimum link
delay metric. Traffic is automatically steered on the low-delay path using the SR-TE Automated
Steering functionality (see chapter 5, "Automated Steering").

Example 15-16: Flexible Algorithm configuration using delay metric for path computation

router isis 1
flex-algo 128
metric-type delay
!
segment-routing
traffic-eng
on-demand color 10
dynamic
sid-algorithm 128
15.6 Summary
The Performance Monitoring functionality enables dynamic measurement of link delay.

SR-TE can use the measured link delay metrics to compute delay-optimized paths, also for the Flex-
Algo functionality of SR-TE. The minimum delay measured over a time period is important for SR-
TE, as it represents the (quasi) static propagation delay of the link.

Link delay metrics are dynamically measured using a Query/Response packet exchange. This
mechanism is standardized in RFC 6374 for link delay measurements in an MPLS network. draft-
gandhi-spring-twamp-srpm provides the equivalent mechanism using more generic TWAMP
encoding.

The measured delay metrics are flooded in IGP and BGP-LS and streamed via telemetry. To reduce
the flooding churn on the network, only flooding new link delay metrics when the minimum delay
metric has significantly changed.

Link delays can be measured using a one-way or a two-way measurement. One-way measurement
requires accurate time synchronization between the local and remote node of the link.
15.7 References
[RFC6374] "Packet Loss and Delay Measurement for MPLS Networks", Dan Frost, Stewart
Bryant, RFC6374, September 2011

[RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer
Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016

[RFC7471] "OSPF Traffic Engineering (TE) Metric Extensions", Spencer Giacalone, David Ward,
John Drake, Alia Atlas, Stefano Previdi, RFC7471, March 2015

[RFC8571] "BGP - Link State (BGP-LS) Advertisement of IGP Traffic Engineering Performance
Metric Extensions", Les Ginsberg, Stefano Previdi, Qin Wu, Jeff Tantsura, Clarence Filsfils,
RFC8571, March 2019

[RFC5586] "MPLS Generic Associated Channel", Martin Vigoureux, Stewart Bryant, Matthew
Bocci, RFC5586, June 2009

[RFC7213] "MPLS Transport Profile (MPLS-TP) Next-Hop Ethernet Addressing", Dan Frost,
Stewart Bryant, Matthew Bocci, RFC7213, June 2014

[RFC7876] "UDP Return Path for Packet Loss and Delay Measurement for MPLS Networks",
Stewart Bryant, Siva Sivabalan, Sagar Soni, RFC7876, July 2016

[RFC5481] "Packet Delay Variation Applicability Statement", Benoit Claise, Al Morton,


RFC5481, March 2009

[draft-gandhi-spring-twamp-srpm] "In-band Performance Measurement Using TWAMP for Segment


Routing Networks", Rakesh Gandhi, Clarence Filsfils, Daniel Voyer, draft-gandhi-spring-twamp-
srpm-00 (Work in Progress), February 2019

[RFC4656] "A One-way Active Measurement Protocol (OWAMP)", Matthew J. Zekauskas,


Anatoly Karp, Stanislav Shalunov, Jeff W. Boote, Benjamin R. Teitelbaum, RFC4656, September
2006
[RFC5357] "A Two-Way Active Measurement Protocol (TWAMP)", Jozef Babiarz, Roman M.
Krzanowski, Kaynam Hedayat, Kiho Yum, Al Morton, RFC5357, October 2008

1. DPM is a data-plane monitoring solution that provides highly scalable blackhole detection
capability. DPM validates control plane and data plane consistency, leveraging SR to steer
probes validating the local node’s own data plane.↩

2. Fiber propagation delay can vary over time, due to external influences. For example, as the
length of the fiber and its index of refraction are slightly temperature dependent, propagation
delay can change due to a changing temperature.↩

3. This is the exponential moving average, as described in Wikipedia:


https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average↩

4. The delay variance metric is computed as the Packet Delay Variation (PDV) specified in
RFC5481, section 4.2, based on the average and minimum delay: delay variance = average delay
– minimum delay.↩

5. The propagation delay of a fiber connection can be roughly assessed as 5 ms per 1000 km of
fiber.↩
16 SR-TE Operations
What we will learn in this chapter:

An SR Policy candidate path can consist of multiple segment lists, each associated with a weight.
Traffic-flows that are steered into the SR Policy are load-balanced over the segment lists in
proportion to their relative weights. This is called “Weighted ECMP”.

The path-invalidation drop functionality keeps an invalid SR Policy in the forwarding table as a
drop entry. All service traffic that is steered into the SR Policy is dropped instead of falling back
to the IGP shortest path to the nexthop of the service.

The MPLS PHP and explicit-null behaviors apply to segments in the SR Policies segment list. In
particular, the first entry in the segment list is not always applied on the packet steered into the SR
Policy since Penultimate Hop Popping (PHP) may apply to that segment.

TTL and TC/DSCP field values of incoming packets are propagated by default.

It is recommended to use the same SRGB on all nodes in an SR domain. Using heterogeneous
SRGBs significantly complicates network operations.

This chapter provides various practical indications for deploying and operating SR-TE. We start by
explaining how multiple SID lists can be associated with the same candidate path and how the traffic
is load-balanced among those. We describe a method that prevents traffic steered into an invalid SR
policy from being automatically redirected to the IGP shortest path. We then explain how generic
MPLS mechanisms, such as PHP, explicit null and TTL propagation, apply to SR-TE. Finally, we
provide some details on what to expect if some of the recommendations stated over the course of this
book are not followed.
16.1 Weighted Load-Sharing Within SR Policy Path
An SR Policy candidate path consists of one or more segment lists, as described in chapter 2, "SR
Policy".

If the active candidate path consists of multiple segment lists, traffic flows that are steered into the SR
Policy are load-balanced over these segment lists. The regular per-flow load-balancing mechanism
are applied, using a hash function over the packet header fields.

Each segment list has an associated weight. The default weight is 1. Each segment list carries a
number of flows in proportion to its relative weight.

The fraction of flows carried by a given segment list with weight w is w/∑wi where w is the weight of
the segment list and ∑wi the sum of the weights of all segment lists of the candidate path. This is
called “Weighted ECMP” as opposed to “ECMP” where traffic-flows are uniformly distributed over
all paths. The accuracy of the weighted load-balancing depends on the platform implementation.

The weighted ECMP between segment lists can be used to distribute traffic over several paths in
proportion to the capacity available on each path.

Figure 16‑1 illustrates a network where the interconnections between the nodes have different
capacities. The connection between Node2 and Node4 consists of three GigabitEthernet links and the
connection between Node3 and Node4 consists of five GigabitEthernet links. The link between
Node2 and Node3 is a Ten GigabitEthernet link, as well as the other links in the drawing.

The operator wants to distribute the load of an SR Policy from headend Node1 to endpoint Node5
over the available capacity. Therefore, the operator configures two segment lists for the SR Policy,
SL1 via Node3 with weight 5 and SL2 directly to Node5 with weight 3. With this configuration 5/8
(62.5%) of the traffic-flows follow SL1 and 3/8 (37.5%) of the flows follow SL2.
Figure 16-1: Weighted ECMP between segment lists of an SR Policy

Example 16‑1 shows the SR Policy configuration on Node1. Two segment lists are configured: SL1
<16003, 16005> and SL2 <16005>. These segments lists are used in the candidate path with
preference 100 of the SR Policy named “WECMP”. SL1 is configured with weight 5 and SL2 has
weight 3.
Example 16-1: Weighted ECMP between segment lists of an SR Policy

segment-routing
traffic-eng
segment-list SL1
index 10 mpls label 16003
index 20 mpls label 16005
!
segment-list SL2
index 10 mpls label 16005
!
policy WECMP
color 20 end-point ipv4 1.1.1.5
candidate-paths
preference 100
explicit segment-list SL1
weight 5
!
explicit segment-list SL2
weight 3

The SR Policy forwarding entry, as displayed in Example 16‑2, shows the two segment lists’
forwarding entries with their weight values to distribute the traffic flows over the two paths.

Example 16-2: Weighted ECMP – SR Policy forwarding

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail


Color Endpoint Segment Outgoing Outgoing Next Hop Bytes
List Label Interface Switched
----- ----------- ------------ -------- ------------ ------------ --------
20 1.1.1.5 SL1 16003 Gi0/0/0/0 99.1.2.2 0
Label Stack (Top -> Bottom): { 16003, 16005 }
Path-id: 1, Weight: 320
Packets Switched: 0
SL2 16005 Gi0/0/0/1 99.1.7.7 0
Label Stack (Top -> Bottom): { 16005 }
Path-id: 2, Weight: 192
Packets Switched: 0

SR-TE has computed the weights in the output according to the configured weights. The computed
weight ratios in the output are the same as for the configured weights: 320/(320+192) = 5/8 and
192/(320+192) = 3/8.
16.2 Drop on Invalid SR Policy
If an SR Policy becomes invalid, i.e., all its candidate-paths are invalid, then the default behavior is
to bring down the SR Policy and let all traffic that was steered into the SR Policy fall back on its
default forwarding entry, usually the IGP shortest path.

For example, prefix 2.2.2.0/24 with nexthop 1.1.1.4 and color 30 is steered into SR Policy GREEN.
When all candidate paths of SR Policy GREEN become invalid, then this SR Policy is invalidated
and its forwarding entry is removed. The prefix 2.2.2.0/24 is then steered onto the IGP shortest path
to its nexthop 1.1.1.4.

For some use-cases it is desirable for the traffic to be dropped when the SR Policy becomes invalid.
To prevent the traffic that is steered into an SR Policy to fall back on the IGP shortest path upon
invalidation of this SR Policy, the path-invalidation-drop functionality can be enabled.

Path-invalidation-drop keeps the invalid SR Policy up, keeping its forwarding entry active but
dropping all traffic that is steered into it.

Let us consider an example use-case that requires strict separation between two disjoint paths. If one
of the paths becomes invalid, then the traffic that was steered into that SR Policy must not be steered
on the IGP shortest path since that is not guaranteed to be disjoint from the other path. The path-
invalidation-drop functionality drops the traffic of the invalid SR Policy such that the strict
disjointness requirement is not violated.

In another example, a traffic stream is replicated into two streams on two disjoint paths (live-live
redundancy). The rate of the streams is high compared to the capacity of the links carrying the traffic.
Some of the links cannot carry the combination of both streams. Steering both streams on such link
would result in congestion and packet loss affecting both streams. This is highly undesirable since it
defeats the purpose of live-live redundancy. Path-invalidation-drop will drop the stream’s packets at
the headend upon SR Policy invalidation. This prevents congestion and keeps the other stream
unaffected.

At the time of writing, this functionality was not available in IOS XR.
16.3 SR-MPLS Operations
The SR implementation for MPLS (SR-MPLS) leverages the existing MPLS architecture. This means
that the existing MPLS operations are also applicable to SR-MPLS and consequently also to SR-TE
for SR-MPLS.

16.3.1 First Segment


The first segment in an SR Policy’s segment list is sometimes not imposed on the packets steered into
this SR Policy as a consequence of applying the usual MPLS operations.

For simplicity, we assume that the active candidate path of a given SR Policy consists of a single
segment list <S1, S2, S3>. At first sight, one would expect that these three segments are imposed as
labels on the packets that are steered into the SR Policy. However, this is not always the case.
Sometimes the label of the first segment in the list is not pushed on these packets as a consequence of
applying the usual MPLS operations on this first segment.

Four cases can be distinguished with respect to the first segment in the list:

1. S1 is a Prefix-SID

1. S1 is a Prefix-SID of a downstream neighbor node (i.e., the shortest path to this connected node is
via the direct link) and PHP is enabled for this Prefix-SID

2. S1 is a Prefix-SID of a downstream neighbor node and PHP is disabled for this Prefix-SID

3. S1 is a Prefix-SID of a node that is not a downstream neighbor node (i.e., the node is not a
connected node or the shortest path to this connected node is not via the direct link)

2. S1 is an Adjacency-SID of a local adjacency or an EPE peering-SID of a local EPE peering


session

In cases 1.a. and 2., the headend uses the first segment in the SID list to determine the outgoing
interface(s) and next-hop(s) for the SR Policy, but does not impose this first segment on the packet.

In cases 1.b. and 1.c., the head-end node uses the first segment to determine the outgoing interface(s)
and next-hop(s), and also imposes this first segment on the packet steered into the SR Policy.
16.3.2 PHP and Explicit-Null
The well-known MPLS operations on the penultimate hop node also apply to SR-MPLS packets.
These operations are: Penultimate Hop Popping (PHP), where the penultimate hop node pops the
label before forwarding, and Explicit-null, where the penultimate hop node swaps the top label with
the explicit-null label before forwarding.

PHP behavior is enabled by default for each Prefix-SID in IOS XR. This means that a node advertises
its Prefix-SIDs in the IGP with the PHP-off flag unset.

An operator may require QoS treatment based on the MPLS TC (formerly EXP) field value. In that
case, the packet must arrive at the final node with a label carrying the TC value. If the bottom label is
a service label, then the packet always arrives at the final node with a label. However, if the bottom
label is a transport label, then the penultimate hop must not pop the last label, i.e., PHP must not be
enabled for the bottom label.

In IOS XR, the default PHP behavior can be replaced by the explicit-null behavior. When enabling
explicit-null behavior, the penultimate hop no longer pops the bottom label but swaps it with the
explicit-null label.

In a first method to apply explicit-null behavior for an SR Policy, the bottom label is a Prefix-SID and
explicit-null is enabled for this Prefix-SID. This can be done by adding explicit-null to the
Prefix-SID configuration on the endpoint node, as shown in Example 16‑3.

Example 16-3: Explicit-null behavior for Prefix-SID

router isis 1
interface Loopback0
address-family ipv4 unicast
prefix-sid absolute 16001 explicit-null
!
router ospf 1
area 0
interface Loopback0
prefix-sid absolute 16001 explicit-null

The penultimate hop node of the endpoint node will receive the packets with this Prefix-SID as top
label and due to the requested explicit-null behavior, this penultimate hop node will swap the Prefix-
SID label with the explicit-null label.
Another method to convey the TC value up to the endpoint of an SR Policy consists in adding the
explicit-null label as the last segment (bottom label) in an explicit SR Policy segment list. This will
make the packets arrive at the SR Policy’s endpoint node with an explicit-null label as only label.

16.3.3 MPLS TTL and Traffic-Class


SR-TE for SR-MPLS can push one or more MPLS labels on packets (unlabeled and labeled) that are
steered into the SR Policy. The MPLS label imposition follows the generic procedures that are
described in detail in Segment Routing Part I.

In summary, for an incoming unlabeled packet, the default behavior is to copy the TTL of the IP
header to the MPLS TTL field of the MPLS label after decrementing it by one. In case multiple labels
are imposed, all imposed labels get the same MPLS TTL value.

This behavior can be disabled by configuring mpls ip-ttl-propagation disable. With this
configuration the MPLS TTL fields of all imposed labels are set to 255 regardless of the IP TTL
value in the received IP packet.

For an incoming labeled packet, the TTL of the incoming top label is decremented and the swapped
label as well as all imposed labels get this decremented TTL value in their MPLS TTL field.

For an incoming unlabeled packet, the first 3 bits of the DSCP field in the IP header are copied to the
MPLS TC field of all imposed MPLS labels.

For an incoming labeled packet, the MPLS TC field of the top label is copied to the TC field of the
swapped label and all imposed MPLS labels.
16.4 Non-Homogenous SRGB
It is strongly recommended to use the same SRGB on all nodes in the SR domain, although the SR
architecture allows to use different SRGBs on different nodes. Refer to Part I of the SR book series
for more information about the SRGB. The use of heterogeneous SRGBs has implications on SR-TE,
particularly when manually specifying prefix-SID label values in an explicit path’s segment list of an
SR Policy.

“Using the same SRGB on all nodes within the SR Domain has undoubtedly significant advantages in simplification of
administration, operation and troubleshooting. Also programming the network is simplified by using this model, and Anycast
Segments can be used without added complexity. Using the same SRGB network-wide is expected to be a deployment
guideline. As mentioned in the expired draft-filsfils-spring-segment-routing-use-cases “Several operators have indicated
that they would deploy the SR technology in this way with a single consistent SRGB across all the nodes. They motivated
their choice based on operational simplicity ...”. ”

— Kris Michielsen

The Prefix-SID label value specified in the segment list is the label value for that Prefix-SID as
known by the node that needs to interpret it. This is most easily explained by an example. See
Figure 16‑2 (same as Figure 4.3. in Segment Routing Part I). The SRGB of the four nodes in the
topology is indicated above each node. Each node advertises a Prefix-SID index equal to its node
number.
Figure 16-2: Multi-label stacks and different SRGBs

Node1 applies a segment list <Prefix-SID(3), Prefix- SID(4)> to steer traffic towards Node4 via
Node3. Node1 needs to consider the SRGB of the correct nodes to determine which label values for
the Prefix-SIDs to push on the packets.

To compute the required label value for the first SID in the SID list – Prefix-SID(3) – Node1 uses the
SRGB of its nexthop towards Node3, which is Node2. Node2 would receive the packet with that
label value as top label and must be able to identify it as Prefix-SID(3). Hence, the label value must
be associated with Prefix-SID(3) in Node2’s SRGB. Node1 computes the label value for Prefix-
SID(3) by adding the SID index 3 to the SRGB base of Node2: 21000 + 3 = 21003.

Equivalently, Node1 uses the SRGB of Node3 to compute the label value for the second SID in the
SID list, which is Prefix-SID(4). Node3 receives the packets with this second SID’s label as top
label. Therefore, Node1 needs to compute the label value of the second SID in the label context of
Node3. Node1 adds the SID index 4 to the SRGB base of Node3: 22000 + 4 = 22004.

When imposing the SID list <Prefix-SID(3), Prefix- SID(4)> as label values, Node1 imposes the
label stack <21003, 22004> on the packets.

The SR Policy configuration on Node1 is shown in Example 16‑4. However, the first label value in
the configuration is 16003, which is different from 21003 that was derived above.
Example 16-4: Non-homogenous SRGB – SR Policy configuration on Node1

segment-routing
traffic-eng
segment-list name SIDLIST1
index 10 mpls label 16003 !! Prefix-SID(3)
index 20 mpls label 22004 !! Prefix-SID(4)
!
policy ORANGE
color 10 end-point ipv4 1.1.1.4
candidate-paths
preference 100
explicit segment-list SIDLIST1

The label value for the first segment in the segment list (16003 in the example) is the Prefix-SID(3)
label value in the context of the headend node, thus using the local SRGB of the head-end node
([16000-23999] in the example). The SID index of Prefix-SID(3) is 3, therefore the required label
value is 16000 + 3 = 16003.

Configuring the local label value for the first segment makes this configuration independent of the
downstream neighbor. Internally, the correct label value is computed matching the downstream
neighbor’s SRGB.

In the example, Node1 imposes the outgoing label value 21003 as top label, as shown in
Example 16‑5.

Example 16-5: Non-homogenous SRGB – SR Policy forwarding entry on Node1

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail


Color Endpoint Segment Outgoing Outgoing Next Hop Bytes
List Label Interface Switched
----- ----------- ------------ -------- ------------ ------------ --------
10 1.1.1.4 SIDLIST1 21003 Gi0/0/0/0 99.1.2.2 0
Label Stack (Top -> Bottom): { 21003, 16004 }
Path-id: 1, Weight: 64
Packets Switched: 0

Using a homogenous SRGB on all nodes in the SR domain avoids all the complexity of deriving SID
label values.
16.5 Candidate-Paths With Same Preference
The recommended design rule is to use a different preference for all candidate paths of a given SR
Policy. This makes it straightforward to select the active path of the SR Policy: the valid candidate
path with the highest preference value.

However, draft-ietf-spring-segment-routing-policy does not require uniqueness of preference, which


would be difficult to impose given that the candidate paths can come from multiple, possibly
independent sources.

Instead, a set of tie-breaking rules are used to select the active path for an SR Policy among a set of
candidate paths with the same preference value.

A candidate path of an SR Policy is uniquely identified by the tuple <Protocol-Origin, Originator,


Discriminator> in the context of a single SR Policy that is identified by the tuple <headend, color,
endpoint>.

Protocol-Origin:

Numerical value that identifies the component or protocol that originates or signals the candidate
path. Recommended values are 10: PCEP; 20: BGP-SR-TE; 30: configuration.

Originator:

Tuple <ASN, node-address> with ASN represented as a 4-byte number and node-address being
an IPv4 or IPv6 address.

Discriminator:

Numerical value to distinguish between multiple candidate paths with common Protocol-Origin
and Originator.

This identifier tuple is used in the tie-breaking rules to select a single valid candidate path for an SR
Policy out of a set of candidate paths with the same preference value.

The tie-breaking rules are evaluated in the following order:


1. Prefer higher Protocol-Origin value

2. Prefer existing installed path (optional, only if this rule is activated by configuration)

3. Prefer lower Originator value

4. Prefer higher Discriminator value


16.6 Summary
Traffic steered into an SR Policy can be load-balanced over different paths by specifying multiple
segment lists under the candidate path. Each segment list has a weight and traffic-flows are load-
balanced over the segment lists proportional to their relative weights.

We have seen how a service can be strictly steered onto a policy which turns as a bit bucket when a
selected candidate path is invalid. Specifically, when the SR Policy becomes invalid, instead of re-
using the IGP shortest path to the next-hop of the service, the service route is kept on the invalid
policy which remains in the forwarding table as a drop entry.

SR-MPLS and thus also SR-TE for SR-MPLS leverage the existing MPLS architecture. This implies
that the PHP and explicit-null behaviors also apply to segments in the SR Policies segment list.

We have seen that the first entry in the segment list is not always applied on the packet steered into the
SR Policy since Penultimate Hop Popping (PHP) may apply to that segment.

When imposing labels on a packet, the TTL field of the imposed labels can be copied from the IP
header TTL field for unlabeled incoming traffic or it is copied from the top label for incoming
labeled traffic. The MPLS TC field is copied from the DSCP field in the IP header for incoming
unlabeled traffic or from the top label of an incoming labeled packet. All imposed labels get the same
TTL and TC field values.

Using different SRGBs on different nodes in the network complicates network operations. In
particular, defining an explicit segment list using label values becomes more difficult.
16.7 References
[SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121,
<https://www.amazon.com/gp/product/B01I58LSUO>,
<https://www.amazon.com/gp/product/1542369126>

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018

[draft-filsfils-spring-segment-routing-use-cases] "Segment Routing Use Cases", Clarence Filsfils,


Pierre Francois, Stefano Preidi, Bruno Decraene, Stephane Litkowski, Martin Horneffer, Igor
Milojevic, Rob Shakir, Saku Ytti, Wim Henderickx, Jeff Tantsura, Sriganesh Kini, Edward Crabbe,
draft-filsfils-spring-segment-routing-use-cases-01 (Work in Progress), October 2014

[draft-filsfils-spring-sr-policy-considerations] "SR Policy Implementation and Deployment


Considerations", Clarence Filsfils, Ketan Talaulikar, Przemyslaw Krol, Martin Horneffer, Paul
Mattes, draft-filsfils-spring-sr-policy-considerations-02 (Work in Progress), October 2018

[SR-Part-I] “Segment Routing Part I”, Clarence Filsfils, Kris Michielsen, Ketan Talaulikar,
https://www.amazon.com/dp/1542369126 and https://www.amazon.com/dp/B01I58LSUO,
October 2016
Section III – Tutorials
This section contains tutorials of the protocols used for SR-TE.
17 BGP-LS
BGP Link-State (BGP-LS) provides the mechanism to advertise network topology and other network
information via BGP. The initial specification of BGP-LS, as described in RFC 7752, defines how to
use BGP to convey the content of the link-state protocol database (LS-DB) and the Traffic-
Engineering Database (TED) to external components such as a PCE. This explains the name “Link-
state” of the BGP-LS address-family.

BGP transport makes it possible to convey the topology information of an IGP area to remote
locations, even across domain and AS boundaries, in a scalable manner using a robust and proven
protocol. External devices, that are not participating in a link-state protocol (ISIS or OSPF), can
collect network-wide topology information using BGP-LS.

Visibility into the entire network allows applications, such as the Path Computation Element (PCE),
to extend the use of TE techniques to the whole network in an optimal way.

BGP-LS can also be used as a common interface to retrieve topology information from an entire
network for applications that require it. It uses its own abstract topology model that hides many of the
differences between the advertisements of the different protocols that are injected into BGP-LS.
BGP-LS can be used as an Application Programming Interface (API) to collect network information.

Since its initial specification in RFC 7752, BGP-LS has been extended in order to carry other types
of information, such as SR and performance data:

Segment Routing – draft-ietf-idr-bgp-ls-segment-routing-ext

IGP TE Performance Metric Extensions – draft-ietf-idr-te-pm-bgp: TE IGP metric extensions


(from RFC7471 and RFC7810)

Egress Peer Engineering – draft-ietf-idr-bgpls-segment-routing-epe

TE Policies – draft-ietf-idr-te-lsp-distribution

It is important to note that the BGP-LS address-family is not limited to carrying routing information.
The address-family can be extended to carry any information, for example:
Information taken from other sources than link-state databases

Information related to topology

Information related to a node’s configuration and state

Information related to services

Examples of these BGP-LS extensions are draft-ketant-idr-bgp-ls-bgp-only-fabric that specifies how


BGP-LS carried topology information of a BGP-only network and draft-dawra-idr-bgp-ls-sr-service-
segments that specifies how BGP-LS is used to advertise service segments.
17.1 BGP-LS Deployment Scenario
Figure 17‑1 illustrates a typical deployment scenario for BGP-LS. Multiple BGP speakers in the
network are enabled for BGP-LS and form BGP-LS sessions with one or more centralized BGP
speakers, such as RRs, over which they convey their local topology information. This local
information can come from IGP, BGP, or another source.

Figure 17-1: Typical BGP-LS deployment model

Using the regular BGP propagation mechanisms, any BGP speaker may obtain the consolidated BGP-
LS information of the entire network as provided by all other BGP speakers.

An external component such as an SR PCE, can obtain this consolidated information by tapping into a
centralized BGP speaker or any other BGP speaker that has the aggregated BGP-LS information.

An internal component of a BGP-enabled node, such as the SR-TE process on a headend, can obtain
the aggregated BGP-LS information from its local BGP process.

The entities and nodes in Figure 17‑1 are assuming different roles in the dissemination of BGP-LS
information.

BGP-LS Producer:
The BGP speaker that advertises local information (e.g., IGP, SR, PM) into BGP-LS.

The BGP speakers Node3, Node7, Node9, and Node12 originate link-state information from
their IGP into BGP-LS. Node7 and Node9 are in the same IGP area, they are originating the
same link-state information into BGP-LS. A node may also originate non-IGP information into
BGP-LS, e.g., its local node information.

BGP-LS Propagator:

The BGP speaker that propagates BGP-LS information from producers to other BGP-LS
speakers, and eventually consumers.

The BGP speaker Node1 propagates the BGP-LS information between the BGP speakers Node3,
Node7, Node9, and Node12. Node1 performs BGP best-path selection and propagates BGP-LS
updates.

BGP-LS Consumer:

The application or process that leverages the BGP-LS information to compute paths or perform
network analysis. The BGP-LS consumer is not a BGP speaker.

The SR PCEs are BGP speakers that provide the BGP-LS information that they have collected to
a consumer application. The BGP protocol implementation and the consumer application may be
on the same or different nodes (e.g., local SR-TE process or remote application retrieving BGP-
LS information via a North-bound interface).

These roles are not mutually exclusive. The same node may be Producer for some link-state
information and Propagator for some other link-state information while also providing this
information to a Consumer application.
17.2 BGP-LS Topology Model
A node running a link-state IGP in essence distributes its local connectivity (neighbors and prefixes)
in a link state advertisement (LSP/LSA1) to all other nodes in the IGP area.

Each node combines these LSPs/LSAs as pieces of a jigsaw puzzle to form a complete map or graph
of the topology. The IGP then uses this topology map to compute the shortest path tree (SPT) and
derive prefix shortest path reachability.

Figure 17‑2 illustrates this for a three-node topology. Each node announces its own adjacencies to the
other nodes. Note that this topology graph represents a logical topology, not a physical topology. For
example, nodes or links that are not enabled for this IGP, are not present in the graph.

Figure 17-2: Link-state IGP LSPs/LSAs and topology model

BGP-LS does not simply encapsulate these IGP LSPs/LSAs in BGP, instead it transcodes the
information they contain into a more abstract topology model based on three classes of objects:
nodes, links, and prefixes.
The IGP topology is transcoded into the BGP-LS model in order to overcome the differences between
OSPF and ISIS, but also to include information from other sources, such as BGP.

The use of Type/Length/Value (TLV) structures to encode the data makes BGP-LS easily extensible
without requiring any change to the underlying transport protocol. For example, new classes of
objects, such as TE Policies, have been added to the model (draft-ietf-idr-te-lsp-distribution).

Figure 17-3: BGP-LS topology model

The three base BGP-LS classes of objects are illustrated in Figure 17‑3.

A node object represents a node, which is typically a router or a routing protocol instance of a
router.

A link object represents a directed link. A link object is anchored to two anchor nodes: a local
node and a remote node. One half-link can have different characteristics than the corresponding
half-link in the other direction.

A prefix object is anchored to the node that originates the prefix. Multiple nodes can originate the
same prefix, in which case multiple prefix objects exist, one for each node that originates the
prefix.

The network in Figure 17‑3 contains three nodes and six (half-) links. Only one prefix object is shown
in the illustration, although, in reality, a loopback prefix is defined for each node and an interface
prefix for each (half-)link.
17.3 BGP-LS Advertisement
BGP-LS uses the Multi-Protocol extensions of BGP (MP-BGP), as specified in RFC4760, to carry
the information. The base BGP-LS specification (RFC7752) defines a new address-family named
“Link-state”.

The Address Family Indicator (AFI) for BGP-LS is 16388, the Subsequent Address Family Indicator
(SAFI) is 71 for a non-VPN topology and 72 for a VPN topology.

When defining a new address-family, the Network Layer Reachability Information (NLRI) format for
that address-family must also be specified.

Three BGP-LS NLRI types are defined in RFC7752: Node, Link, and Prefix. IETF draft-ietf-idr-te-
lsp-distribution specifies a fourth type of BGP-LS NLRI: TE Policy. These different types of BGP-LS
NLRIs are described in the next sections.

RFC7752 also defines a Link-state Attribute that is used to carry additional parameters and
characteristics for a BGP-LS NLRI. This Link-state Attribute is advertised together with the BGP-LS
NLRI that it applies to.

Figure 17‑4 is a high-level illustration of a BGP-LS Update message, showing the different attributes
that are present in such a BGP-LS Update message.

A BGP-LS Update message contains the mandatory attributes ORIGIN, AS_PATH, and, for iBGP
advertisements, LOCAL_PREF.

The BGP-LS NLRI is included in the MP_REACH_NLRI attribute (RFC4760). This attribute also
contains a Next-hop, which is the BGP nexthop as we know it. It is the IPv4 or IPv6 BGP session
address if the advertising node applies next-hop-self.

The Link-state Attribute, as mentioned above, contains the properties of the NLRI included in the
Update message.
Figure 17-4: BGP-LS Update message showing the different attributes
The usual BGP protocol procedures, specifically best-path selection, also applies to BGP-LS paths.

Among all the paths received for a given BGP-LS NLRI, only the best-path is selected and
propagated to the other BGP neighbors. Furthermore, only the best-path is provided to the external
consumers such as SR-TE.

BGP applies the regular selection rules to select the best-path for each BGP-LS NLRI. One of the
conditions for BGP to consider a path is the reachability of the BGP nexthop. If the BGP nexthop is
unreachable then the path is not considered.

The best-path selection can be influenced by modifying the attributes of the paths using BGP route-
policies.

In the next section, we will look at the two main parts of the BGP-LS advertisement: BGP-LS NLRI
and Link-state Attribute.

17.3.1 BGP-LS NLRI


Earlier in this chapter we have introduced the BGP-LS topology abstraction that models a network
using three classes of objects: nodes, links, prefixes.

Each BGP-LS object is identified by its NLRI. The NLRI is the key to the corresponding object entry
in the BGP-LS database, while the Link-state Attribute that is advertised with the NLRI contains the
properties and characteristics of this object.

Since the BGP-LS NLRI is the key to an object, it must contain sufficient data to uniquely identify an
object. The remaining data – that is not required to identify the object – is carried in the associated
Link-state Attribute.

The general format of a BGP-LS NLRI is shown in Figure 17‑5. The NLRI-type defines the object
class (node, link, prefix, …). The data of the NLRI is a set of TLVs.
Figure 17-5: BGP-LS NLRI format

RFC 7752 defines four types of BGP-LS NLRIs:

Node NLRI-Type (type 1): describes a node

Link NLRI-Type (type 2): describes a (directed) link

IPv4 Prefix NLRI-Type (type 3): describes an IPv4 prefix

IPv6 Prefix NLRI-Type (type 4): describes an IPv6 prefix

IETF draft-ietf-idr-te-lsp-distribution defines a fifth type:

TE Policy NLRI-Type (type 5): Describe a TE Policy (SR TE Policy, MPLS Tunnel, IP Tunnel,
MPLS state)

The general format of these BGP-LS NLRI-types is shown in Figure 17‑6. The first two fields in the
BGP-LS NLRI are the same for all NLRI types: Protocol-ID and Identifier. Depending on the NLRI
type, it contains various other Descriptor fields as described further.
Ide ntifie r and Instance -ID
The Identifier field in the BGP-LS NLRI is the 64-bit number that identifies a “routing universe”. RFC7752 also uses the
name “Instance-ID” for the Identifier field.

The NLRI “Identifier” field must not be confused with the “BGP-LS Identifier” that is a 32-bit Node Descriptor Sub-TLV.

The use of the BGP-LS Identifier is discouraged in IOS XR. The NLRI Identifier field containing the Instance-ID is the
recommended way to distinguish between different domains. The BGP-LS Identifier field has value 0 in IOS XR.

Figure 17-6: General format of the BGP-LS NLRI

Remember that a BGP-LS NLRI forms a unique key for an object in the BGP-LS database. The BGP-
LS database can consist of multiple (logical) topologies that can partially or completely overlap.
Even for overlapping topologies, each of the objects must be uniquely identifiable.

For example, when migrating a network from OSPF to ISIS, both IGPs can be enabled at the same
time on all nodes and links of the physical topology. In that case, the OSPF and ISIS topologies
completely overlap. BGP-LS must be able to distinguish the two independent topologies. For
example, a given node must have a key (NLRI) for its node object in the OSPF topology and a
different key (NLRI) for its node object in the ISIS topology.

The two fields that are present in all BGP-LS NLRIs, Protocol-ID and Identifier, allow to maintain
separation between multiple overlapping (partially or completely) topologies. We will start by
looking at these two fields.
17.3.1.1 Protocol-ID Field
The Protocol-ID field of a BGP-LS NLRI identifies the protocol that is the source of the topology
information. It identifies the protocol that led to the creation of this BGP-LS NLRI.

Table 17‑1 lists the possible values of the Protocol-ID field. For example, a BGP-LS NLRI that is
generated based on the information in an OSPFv2 LS-DB, will have a Protocol-ID value 3
(OSPFv2). A BGP-LS NLRI that is generated based on Level-2 ISIS LS-DB entry will have a
Protocol-ID value 2 (ISIS L2).

Table 17-1: BGP-LS Protocol-IDs

Protocol-ID NLRI information source protocol Specification

1 IS-IS Level 1 RFC7752

2 IS-IS Level 2 RFC7752

3 OSPFv2 RFC7752

4 Direct RFC7752

5 Static configuration RFC7752

6 OSPFv3 RFC7752

7 BGP draft-ietf-idr-bgpls-segment-routing-epe

8 RSVP-TE draft-ietf-idr-te-lsp-distribution

9 Segment Routing draft-ietf-idr-te-lsp-distribution

17.3.1.2 Identifier Field


Multiple instances of a routing protocol may run over the same set of nodes and links, for example
using RFC6549 for multi-instance OSPFv2 or using RFC8202 for multi-instance ISIS.

Each routing protocol instance defines an independent topology, an independent context, also known
as a “routing universe”. The 64-bit Identifier field in each BGP-LS NLRI allows to discriminate
which NLRI belongs to which routing protocol instance or routing universe.

The Identifier field is used to “stamp” each NLRI with a value that identifies the routing universe the
NLRI belongs to. All NLRIs that identify objects (nodes, links, prefixes) from a given routing
universe have the same Identifier value. NLRIs with different Identifier values are part of different
routing universes.

The Identifier is defined as a flat 64-bit value. RFC 7752 reserves Identifier values 0-31, with 0
indicating the “Default Layer 3 Routing topology”. Values in the range 32 to 264-1 are for "Private
Use” and can be freely used.

Figure 17‑7 illustrates the use of the Identifier field to discriminate between IGP topologies
advertised through the same protocol2. Both nodes in the topology run two identically configured ISIS
instances, resulting in two logical topologies that entirely overlap. To enable BGP-LS to distinguish
these topologies, a different Identifier field value is used in the BGP-LS NLRIs for each ISIS
instance. This makes the NLRIs of each logical topology unique, despite that all other NLRI fields are
equal for both topologies.

Figure 17-7: Example of using Identifier field as discriminator between IGP topologies

In IOS XR, the Identifier value can be configured using the instance-id keyword in the distribute
link-state command, as illustrated in Example 17‑1. The default Identifier value is 0, identifying
the “Default Layer 3 Routing topology”. ISIS instance SR-ISIS-1 in Example 17‑1 has Identifier 1000,
ISIS instance SR-ISIS-2 has Identifier 2000, and OSPF instance SR-OSPF has Identifier 32.
Example 17-1: Configuration of instance-id

router isis SR-ISIS-1


distribute link-state instance-id 1000
!
router isis SR-ISIS-2
distribute link-state instance-id 2000
!
router ospf SR-OSPF
distribute link-state instance-id 32

Two instances of the same protocol on a node cannot have the same instance-id. This is enforced
during configuration. The IGP instance of all nodes in the same IGP domain must have the same
instance-id since they belong to the same routing universe.

17.3.2 Node NLRI


The Node NLRI is the key to identify a node object in the BGP-LS database. This key must be
globally unique for each node within the entire network (BGP-LS database).

A physical node that is part of multiple routing universes is represented by multiple Node NLRIs, one
for each routing universe the physical node participates in.

The format of an BGP-LS Node NLRI is shown in Figure 17‑8.

Figure 17-8: Node NLRI format

Besides the Protocol-ID and the Identifier fields that were discussed in the previous sections, the
Node NLRI also contains a Local Node Descriptors field. This field consists of a set of one or more
TLVs that uniquely identifies the node.

The possible Node Descriptor TLVs are:


Autonomous System: Opaque 32-bit number, by default the AS number of the BGP-LS originator3

BGP-LS Identifier (BGP-LS ID): Opaque 32-bit number, by default 045

OSPF Area-ID: 32-bit number identifying the OSPF area of the NLRI

IGP Router-ID: Opaque value of variable size, depending on the type of node:

ISIS non-pseudonode: 6-octet ISO system-id of the node

ISIS pseudonode: 6-octet ISO system-id of the Designated Intermediate System (DIS) followed
by the 1-octet, nonzero Pseudonode identifier (PSN ID)

OSPFv2 and OSPFv3 non-pseudonode: 4-octet Router-ID

OSPFv2 pseudonode: 4-octet Router-ID of the Designated Router (DR) followed by the 4-octet
IPv4 address of the DR's interface to the LAN

OSPFv3 pseudonode: 4-octet Router-ID of the DR followed by the 4-octet interface identifier of
the DR's interface to the LAN

BGP Router Identifier (BGP Router-ID): 4-octet BGP Router-ID, as defined in RFC4271 and
RFC6286.

Confederation Member ASN (Member-ASN): 4-octet number representing the member ASN
inside the Confederation.

See the sections 17.6 and 17.7 below for illustrations of Node NLRIs.

17.3.3 Link NLRI


A Link NLRI uniquely identifies a link object in the BGP-LS database. The format of a Link NLRI is
shown in Figure 17‑9.
Figure 17-9: Link NLRI format

A link object, as identified by a BGP-LS Link-type NLRI, represents a unidirectional link (“half-
link”). A link object is identified by the two end nodes of the link, the local node and the remote node,
also known as the “anchor nodes”, and further information to distinguish different links between the
same pair of nodes.

The BGP-LS Link-type NLRI contains a Local Node Descriptors field and a Remote Node
Descriptors field that define the anchor nodes. These two Node Descriptors fields can contain the
same Node Descriptor TLVs as discussed in the previous section about the Node NLRI.

The Link NLRI also contains a Link Descriptors field to further distinguish different links between the
same pair of nodes. These are the possible TLVs that can be added to the Link Descriptors field:

Link Local/Remote Identifier: local and remote interface identifiers, used when the link has no IP
address configured (unnumbered). In IOS XR, the SNMP ifIndex of the interface is used as
interface identifier.

IPv4 interface address: IPv4 address of the local node’s interface

IPv4 neighbor address: IPv4 address of the remote node’s interface

IPv6 interface address: IPv6 address of the local node’s interface

IPv6 neighbor address: IPv6 address of the remote node’s interface


Multi-Topology Identifier: MT-ID TLV containing the (single) MT-ID of the topology where the
link is reachable

See the sections 17.6 and 17.7 below for illustrations of Link NLRIs.

17.3.4 Prefix NLRI


A Prefix NLRI uniquely identifies a prefix object in the BGP-LS database. The Prefix NLRI format,
as displayed in Figure 17‑10, has a Local Node Descriptors field and a Prefix Descriptors field.

Figure 17-10: Prefix NLRI format

The Local Node Descriptors field uniquely identifies the node that originates the prefix, which is the
prefix’s anchor node. The same TLVs can be used for this Node Descriptors field as described in the
above section for the Node NLRI.

The Prefix Descriptors field is then used to uniquely identify the prefix advertised by the node. The
Possible TLVs that can be included in the Prefix Descriptors fields are:

Multi-topology Identifier: a TLV containing the (single) MT-ID of the topology where the prefix is
reachable

OSPF Route Type: The type of OSPF route as defined in the OSPF protocol. – 1: Intra-area; 2:
inter-area; 3: External type 1; 4: External type 2; 5: NSSA type 1; 6: NSSA type 2

IP Reachability Information: a TLV containing a single IP address prefix (IPv4 or IPv6) and its
prefix length; this is the prefix itself
See the sections 17.6 and 17.7 below for illustrations of Prefix NLRIs.

17.3.5 TE Policy NLRI


A TE Policy NLRI uniquely identifies a TE Policy object in the BGP-LS database. The format of a
BGP-LS TE Policy NLRI is shown in Figure 17‑11. This NLRI is specified in draft-ietf-idr-te-lsp-
distribution.

A TE Policy object represents a unidirectional TE Policy. A TE Policy object is anchored to the TE


Policy’s Headend node. The Headend Node Descriptors field in the NLRI identifies this anchor node.
This field can use the same Node Descriptors TLVs as described in the above section for the Node
NLRI.

The TE Policy Descriptors field in the NLRI is used to uniquely identify the TE Policy on the
headend anchor node.

Figure 17-11: TE Policy NLRI format

The Possible TLVs that can be included in the TE Policy Descriptors fields are:

Tunnel ID: 16-bits Tunnel Identifier, as specified in RFC3209 (RSVP-TE: Extensions to RSVP for
LSP Tunnels)

LSP ID: 16-bits LSP Identifier, as specified in RFC3209

IPv4/6 Tunnel Head-end address: IPv4/IPv6 Tunnel Head-End Address, as specified in


RFC3209
IPv4/6 Tunnel Tail-end address: IPv4/IPv6 Tunnel Tail-End Address, as specified in RFC3209

SR Policy Candidate Path: the tuple (Endpoint, Policy Color, Protocol-Origin, Originator ASN,
Originator Address, Distinguisher) as defined in ietf-spring-segment-routing-policy

Local MPLS Cross Connect: a local MPLS state in the form of an incoming label and an interface
followed by an outgoing label and an interface

At time of writing, IOS XR does not advertise TE Policies in BGP-LS.

17.3.6 Link-State Attribute


While an BGP-LS NLRI is the unique identifier or key of a BGP-LS object, the Link-state Attribute is
a container for additional characteristics of the object. Examples of object characteristics are the
node’s name, the link bandwidth, the prefix route-tag, etc.

Refer to the different IETF documents for the possible information that can be included in the Link-
state Attribute:

BGP-LS – RFC7752

Segment Routing – draft-ietf-idr-bgp-ls-segment-routing-ext

IGP TE Performance Metric Extensions – draft-ietf-idr-te-pm-bgp

Egress Peer Engineering – draft-ietf-idr-bgpls-segment-routing-epe

TE Policies – draft-ietf-idr-te-lsp-distribution

We will look at some examples in the illustrations sections 17.6 and 17.7 below.
17.4 SR BGP Egress Peer Engineering
Chapter 14, "SR BGP Egress Peer Engineering" describes the EPE functionality in detail. SR BGP
EPE allocates a Peering SID for a BGP session and advertises the BGP peering session information
in BGP-LS for a controller to use this information in its path computation.

EPE is enabled on a BGP session as shown in Example 17‑2.

Example 17-2: EPE configuration

router bgp 1
neighbor 99.4.5.5
egress-engineering

Since the EPE-enabled BGP Peerings are advertised in BGP-LS as links (using link-type NLRI), they
are identified by their local and remote anchor nodes plus some additional information to distinguish
multiple parallel links between the two nodes.

The advertisements of the different Peering Segment types are described in the following sections.

17.4.1 PeerNode SID BGP-LS Advertisement


The PeerNode SID is advertised with a BGP-LS Link-type NLRI, See Figure 17‑9. The following
information is provided in the BGP-LS NLRI and BGP-LS Attribute:

BGP-LS Link NLRI:

Protocol-ID: BGP

Identifier: 0 in IOS XR

Local Node Descriptors contains:

Local BGP Router-ID

Local ASN

Remote Node Descriptors contains:

Peer BGP Router-ID


Peer ASN

Link Descriptors contains (depending on the address-family used for the BGP session):

IPv4 Interface Address TLV contains the BGP session IPv4 local address

IPv4 Neighbor Address TLV contains the BGP session IPv4 peer address

IPv6 Interface Address TLV contains the BGP session IPv6 local address

IPv6 Neighbor Address TLV contains the BGP session IPv6 peer address

BGP-LS Attribute contains:

PeerNode SID TLV

Other TLV entries may be present for each of these fields, such as additional node descriptor fields as
specified in BGP-LS RFC 7752. In addition, BGP-LS Node and Link Attributes, as defined in RFC
7752, may be added in order to advertise the characteristics of the link.

Examples of a PeerNode-SID advertisements are included in chapter 14, "SR BGP Egress Peer
Engineering", section 14.5.

17.4.2 PeerAdj SID BGP-LS Advertisement


The PeerAdj SID is advertised with a BGP-LS Link-type NLRI, See Figure 17‑9. The following
information is provided in the BGP-LS NLRI and BGP-LS Attribute:

BGP-LS Link NLRI:

Protocol-ID: BGP

Identifier: 0 in IOS XR

Local Node Descriptors contains TLVs:

Local BGP Router-ID

Local ASN
Remote Node Descriptors contains TLVs:

Peer BGP Router-ID

Peer ASN

Link Descriptors contains TLV:

Link Local/Remote Identifiers contains the 4-octet Link Local Identifier followed by the 4-octet
value 0 indicating that the Link Remote Identifier is unknown

Link Descriptors may contain in addition TLVs:

IPv4 Interface Address TLV contains the IPv4 address of the local interface through which the BGP
session is established

IPv4 Neighbor Address TLV contains the BGP session IPv4 peer address

IPv6 Interface Address TLV contains the IPv6 address of the local interface through which the BGP
session is established

IPv6 Neighbor Address TLV contains the BGP session IPv6 peer address

BGP-LS Attribute:

PeerAdj SID TLV

Other TLV entries may be present for each of these fields, such as additional node descriptor fields as
described in BGP-LS RFC 7752. In addition, BGP-LS Nodes and Link Attributes, as defined in RFC
7752, may be added in order to advertise the characteristics of the link.

Examples of a PeerAdj-SID advertisements are included in chapter 14, "SR BGP Egress Peer
Engineering", section 14.5.

17.4.3 PeerSet SID BGP-LS Advertisement


The PeerSet SID TLV is added to the BGP-LS Attribute of a PeerNode SID or PeerAdj SID BGP-LS
advertisement, in addition to the PeerNode SID or PeerAdj SID TLV.
At the time of writing, PeerSet SIDs are not supported in IOS XR.
17.5 Configuration
To configure a BGP-LS session, use address-family link-state link-state.

BGP-LS can be enabled for an IPv4 or IPv6 Internal BGP (iBGP) or External BGP (eBGP) sessions.

BGP-LS updates follow the regular BGP propagation functionality, for example, they are subject to
best-path selection and are reflected by Route Reflectors. Also the BGP nexthop attribute is updated
according to the regular procedures.

The node in Example 17‑3 has an iBGP BGP-LS session to neighbor 1.1.1.2 and an eBGP BGP-LS
session to 2001::1:1:1:2. The eBGP session in this example is multi-hop.

Example 17-3: BGP-LS session configuration

route-policy bgp_in
pass
end-policy
!
route-policy bgp_out
pass
end-policy
!
router bgp 1
address-family link-state link-state
!
neighbor 1.1.1.2
remote-as 1
update-source Loopback0
address-family link-state link-state
!
neighbor 2001::1:1:1:2
remote-as 2
ebgp-multihop 2
update-source Loopback0
address-family link-state link-state
route-policy bgp_in in
route-policy bgp_out out
17.6 ISIS Topology

Figure 17-12: BGP-LS advertisements – network topology

The network topology in Figure 17‑12 consists of two ISIS Level-2 nodes interconnected by a link.
Both nodes have SR enabled for IPv4 and advertise a Prefix-SID for their loopback prefix. Both
nodes have TI-LFA enabled on their interface such that they advertise a protected and unprotected
Adj-SID for the adjacency.

The nodes also have a TE metric (also known as Administrative Weight) and an affinity link color
(also known as affinity bits or Administrative Group) configured for the link.

Link-delay measurement is enabled on the link to measure the link-delay and advertise the link-delay
metrics as described in chapter 15, "Performance Monitoring – Link Delay".

The relevant parts of Node1’s configuration are shown in Example 17‑4.


Example 17-4: ISIS topology – Node1 configuration

hostname xrvr-1
!
interface Loopback0
ipv4 address 1.1.1.1 255.255.255.255
!
interface GigabitEthernet0/0/0/0
description to xrvr-2
ipv4 address 99.1.2.1 255.255.255.0
!
router isis SR
is-type level-2-only
net 49.0001.0000.0000.0001.00
distribute link-state instance-id 101
address-family ipv4 unicast
metric-style wide
router-id Loopback0
segment-routing mpls
!
interface Loopback0
passive
address-family ipv4 unicast
prefix-sid absolute 16001
!
!
interface GigabitEthernet0/0/0/0
point-to-point
address-family ipv4 unicast
fast-reroute per-prefix
fast-reroute per-prefix ti-lfa
!
!
segment-routing
traffic-eng
interface GigabitEthernet0/0/0/0
affinity
name BLUE
!
metric 20
!
affinity-map
name BLUE bit-position 0
!
performance-measurement
interface GigabitEthernet0/0/0/0
delay-measurement

The ISIS Link-state database entry for Node1 is shown in Example 17‑5.

Node1 originates a Link-State PDU stating what area it is in (49.0001) and its hostname (xrvr-1). The
Network Layer Protocol IDentifier (NLPID) value indicates that the node supports IPv4 (0xcc).

Node1 advertises its loopback0 IPv4 address 1.1.1.1 as a router-id in the TLVs IP address, Router
ID, and Router Capabilities.
The Router Capability TLV contains more attributes of this node: SR is enabled for IPv4 (I:1), the
SRGB is [16000-23999] (first label 16000, size 8000), the SRLB is [15000-15999]. Node1 supports
SR algorithm 0 (SPF) and 1 (strict-SPF), and its type-1 Node Maximum SID Depth (MSD) is 10.

Node1 advertises its loopback0 prefix 1.1.1.1/32 with a Prefix-SID 16001 (index 1).

The adjacency to Node2 has an IGP metric 10, a TE metric (Admin. Weight) 20, and a minimum-
delay link metric 5047. The adjacency carries the affinity color identified by bit 0 in the affinity
bitmap. Two Adj-SIDs are advertised for this adjacency: protected (B:1) Adj-SID 24012 and
unprotected (B:0) Adj-SID 24112.

Example 17-5: ISIS database entry for Node1

RP/0/0/CPU0:xrvr-1#show isis database verbose xrvr-1

IS-IS SR (Level-2) Link State Database


LSPID LSP Seq Num LSP Checksum LSP Holdtime/Rcvd ATT/P/OL
xrvr-1.00-00 * 0x0000022a 0x2880 1180 /* 0/0/0
Area Address: 49.0001
NLPID: 0xcc
Hostname: xrvr-1
IP Address: 1.1.1.1
Router ID: 1.1.1.1
Router Cap: 1.1.1.1, D:0, S:0
Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000
SR Local Block: Base: 15000 Range: 1000
SR Algorithm:
Algorithm: 0
Algorithm: 1
Node Maximum SID Depth:
Label Imposition: 10
Metric: 0 IP-Extended 1.1.1.1/32
Prefix-SID Index: 1, Algorithm:0, R:0 N:1 P:0 E:0 V:0 L:0
Prefix Attribute Flags: X:0 R:0 N:1
Source Router ID: 1.1.1.1
Metric: 10 IP-Extended 99.1.2.0/24
Prefix Attribute Flags: X:0 R:0 N:0
Metric: 10 IS-Extended xrvr-2.00
Affinity: 0x00000001
Interface IP Address: 99.1.2.1
Neighbor IP Address: 99.1.2.2
Admin. Weight: 20
Ext Admin Group: Length: 4
0x00000001
Link Average Delay: 5831 us
Link Min/Max Delay: 5047/7047 us
Link Delay Variation: 499 us
Link Maximum SID Depth:
Label Imposition: 10
ADJ-SID: F:0 B:1 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24012
ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24112

Total Level-2 LSP count: 1 Local Level-2 LSP count: 1


ISIS feeds the LS-DB information to BGP-LS and SR-TE. The content of the BGP-LS database is
shown in Example 17‑6. The output shows the BGP-LS NLRIs in string format. The legend of the
NLRI fields in these strings is indicated on top of the output. The first field ([V], [E], or [T])
indicates the BGP-LS NLRI type: [V] Node, [E] Link, or [T] Prefix.

To verify the Link-state Attribute that is associated with the NLRI, the command show bgp link-
state link-state <NLRI string format> [detail] must be used. In the next sections we will
show examples of each NLRI.

Example 17-6: BGP-LS database content

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state


BGP router identifier 1.1.1.1, local AS number 1
<... snip ...>

Status codes: s suppressed, d damped, h history, * valid, > best


i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Prefix codes: E link, V node, T IP reacheable route, u/U unknown
I Identifier, N local node, R remote node, L link, P prefix
L1/L2 ISIS level-1/level-2, O OSPF, D direct, S static/peer-node
a area-ID, l link-ID, t topology-ID, s ISO-ID,
c confed-ID/ASN, b bgp-identifier, r router-ID,
i if-address, n nbr-address, o OSPF Route-type, p IP-prefix
d designated router address
Network Next Hop Metric LocPrf Weight Path
*> [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]]/328
0.0.0.0 0 i
*> [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]]/328
0.0.0.0 0 i
*> [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1][b0.0.0.0][s0000.0000.0002.00]][L[i99.1.2.1]
[n99.1.2.2]]/696
0.0.0.0 0 i
*> [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]][R[c1][b0.0.0.0][s0000.0000.0001.00]][L[i99.1.2.2]
[n99.1.2.1]]/696
0.0.0.0 0 i
*> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][P[p99.1.2.0/24]]/392
0.0.0.0 0 i
*> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][P[p1.1.1.1/32]]/400
0.0.0.0 0 i
*> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]][P[p99.1.2.0/24]]/392
0.0.0.0 0 i
*> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]][P[p1.1.1.2/32]]/400
0.0.0.0 0 i

Processed 8 prefixes, 8 paths

17.6.1 Node NLRI


The Node NLRI for Node1 is shown in Example 17‑7. By using the detail keyword in the show
command, you get a breakdown of the NLRI fields at the top of the output.
These are the fields in the Node NLRI string:

[V] NLRI Type: Node

[L2] Protocol: ISIS L2

[I0x65] Identifier: 0x65 = 101 (this is the instance-id)

[N ...] Local Node Descriptor:

[c1] AS Number: 1

[b0.0.0.0] BGP Identifier: 0.0.0.0

[s0000.0000.0001.00] ISO Node ID: 0000.0000.0001.00

The prefix-length at the end of the NLRI string (/328) is the length of the NLRI in bits.

The Protocol and ISO Node ID can be derived from the ISIS IS-type and LSP-ID

Example 17-7: BGP-LS Node NLRI of Node1

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]]/328


detail
BGP routing table entry for [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]]/328
NLRI Type: Node
Protocol: ISIS L2
Identifier: 0x65
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
ISO Node ID: 0000.0000.0001.00

<... snip ...>

Link-state: Node-name: xrvr-1, ISIS area: 49.00.01,


Local TE Router-ID: 1.1.1.1,
SRGB: 16000:8000, SR-ALG: 0 SR-ALG: 1, SRLB: 15000:1000,
MSD: Type 1 Value 10

The Link-state attribute is shown at the bottom of the output, preceded by Link-state:. These are the
elements in the Link-state attribute for the Node NLRI example:

Node-name: xrvr-1 ISIS dynamic hostname xrvr-1

ISIS area: 49.00.01 ISIS area address 49.0001

Local TE Router-ID: 1.1.1.1 Local node (Node1) IPv4 TE router-id 1.1.1.1


SRGB: 16000:8000 SRGB: [16000-23999]

SR-ALG: 0 SR-ALG: 1 SR Algorithms: SPF (0), strict-SPF (1)

SRLB: 15000:1000 SRLB: [15000-15999]

MSD: Type 1 Value 10 Node Maximum SID Depth (MSD): 10

These attributes correspond to the TLVs in the ISIS LSP as displayed in Example 17‑5.

17.6.2 Link NLRI


The Link NLRI for the (half-)link Node1→Node2 in the IPv4 topology is shown in Example 17‑8. By
using the detail keyword in the show command, you get a breakdown of the NLRI fields. These are
the fields in the Link NLRI string:

[E] NLRI Type: Link

[L2] Protocol: ISIS L2

[I0x65] Identifier: 0x65 = 101

[N ...] Local Node Descriptor:

[c1] AS Number: 1

[b0.0.0.0] BGP Identifier: 0.0.0.0

[s0000.0000.0001.00] ISO Node ID: 0000.0000.0001.00

[R ...] Remote Node Descriptor:

[c1] AS Number: 1

[b0.0.0.0] BGP Identifier: 0.0.0.0

[s0000.0000.0002.00] ISO Node ID: 0000.0000.0002.00

[L ...] Link Descriptor:

[i99.1.2.1] Local Interface Address IPv4: 99.1.2.1

[n99.1.2.2] Neighbor Interface Address IPv4: 99.1.2.2


The prefix-length at the end (/696) is the length of the NLRI in bits.

Example 17-8: BGP-LS Link NLRI of link Node1→Node2 in IPv4 topology

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1]


[b0.0.0.0][s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696 detail
BGP routing table entry for [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1][b0.0.0.0]
[s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696
NLRI Type: Link
Protocol: ISIS L2
Identifier: 0x65
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
ISO Node ID: 0000.0000.0001.00
Remote Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
ISO Node ID: 0000.0000.0002.00
Link Descriptor:
Local Interface Address IPv4: 99.1.2.1
Neighbor Interface Address IPv4: 99.1.2.2

<... snip ...>

Link-state: Local TE Router-ID: 1.1.1.1, Remote TE Router-ID: 1.1.1.2,


admin-group: 0x00000001, TE-default-metric: 20,
metric: 10, ADJ-SID: 24012(70), ADJ-SID: 24112(30),
MSD: Type 1 Value 10, Link Delay: 5831 us Flags: 0x00,
Min Delay: 5047 us Max Delay: 7047 us Flags: 0x00,
Delay Variation: 499 us

The Link-state attribute is shown in the output, preceded by Link-state. These are the elements in
the Link-state attribute for the Link NLRI example:

Local TE Router-
ID: 1.1.1.1
Local node (Node1) IPv4 TE router-id: 1.1.1.1

Remote TE
Router-ID: Remote node (Node2) IPv4 TE router-id: 1.1.1.2
1.1.1.2

admin-group:
0x00000001
Affinity (color) bitmap: 0x00000001

TE-default-
metric: 20
TE metric: 20

metric: 10 IGP metric: 10

ADJ-SID: Adj-SIDs: protected (flags 0x70 → B-flag=1) 24012; unprotected (flags


24012(70), ADJ- 0x30 → B-flag=0) 24112 – refer to the ISIS LS-DB output in Example 17‑5
SID: 24112(30) for the flags
MSD: Type 1
Value 10
Link Maximum SID Depth: 10

Link Delay: 5831


us Flags: 0x00
Average Link Delay: 5.831 ms

Min Delay: 5047


us
Minimum Link Delay: 5.047 ms

Max Delay: 7047


us Flags: 0x00
Maximum Link Delay: 7.047 ms

Delay Variation:
499 us
Link Delay Variation: 0.499 ms

These attributes correspond to the TLVs advertised with the adjacency in the ISIS LSP of
Example 17‑5.

17.6.3 Prefix NLRI


The Prefix NLRI for Node1’s loopback IPv4 prefix 1.1.1.1/32 is shown in Example 17‑9. By using
the detail keyword in the command, you get a breakdown of the NLRI fields. These are the fields in
the Prefix NLRI string:

[T] NLRI Type: Prefix

[L2] Protocol: ISIS L2

[I0x65] Identifier: 0x65 = 101

[N ...] Local Node Descriptor:

[c1] AS Number: 1

[b0.0.0.0] BGP Identifier: 0.0.0.0

[s0000.0000.0001.00] ISO Node ID: 0000.0000.0001.00

[P ...] Prefix Descriptor:

[p1.1.1.1/32] Prefix: 1.1.1.1/32

The prefix-length at the end (/696) is the length of the NLRI in bits.
Example 17-9: BGP-LS Prefix NLRI of Node1’s loopback IPv4 prefix

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]]


[P[p1.1.1.1/32]]/400 detail
BGP routing table entry for [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][P[p1.1.1.1/32]]/400
NLRI Type: Prefix
Protocol: ISIS L2
Identifier: 0x65
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
ISO Node ID: 0000.0000.0001.00
Prefix Descriptor:
Prefix: 1.1.1.1/32

<... snip ...>

Link-state: Metric: 0, PFX-SID: 1(40/0), Extended IGP flags: 0x20,


Source Router ID: 1.1.1.1

The Link-state attribute is shown in the output, preceded by Link-state:. These are the elements in
the Link-state attribute for the Prefix NLRI example:

Metric: 0 IGP metric: 0

PFX-SID: 1(40/0) Prefix-SID index: 1; SPF (algo 0); flags 0x40 → N-flag=1

Extended IGP flags: 0x20 Extended prefix attribute flags: 0x20 → N-flag=1

Source Router ID: 1.1.1.1 Originating node router-id: 1.1.1.1

These attributes correspond to the TLVs advertised with prefix 1.1.1.1/32 in the ISIS LSP of
Example 17‑5.
17.7 OSPF Topology

Figure 17-13: Two-node OSPF topology

The network topology in Figure 17‑13 consists of two OSPF area 0 nodes interconnected by a link.
Both nodes have SR enabled and advertise a Prefix-SID for their loopback prefix. Both nodes have
TI-LFA enabled on their interface such that they advertise a protected and unprotected Adj-SID for
the adjacency.

The nodes also have a TE metric (also known as Administrative Weight) and an affinity link color
(also known as affinity bits or Administrative Group) configured for the link. Delay measurement is
enabled on the link to measure the link delay and advertise the link-delay metrics as described in
chapter 15, "Performance Monitoring – Link Delay".

The relevant parts of Node1’s configuration are shown in Example 17‑10.


Example 17-10: OSPF topology – Node1 configuration

interface Loopback0
ipv4 address 1.1.1.1 255.255.255.255
!
interface GigabitEthernet0/0/0/0
description to xrvr-2
ipv4 address 99.1.2.1 255.255.255.0
!
router ospf SR
distribute link-state instance-id 102
log adjacency changes
router-id 1.1.1.1
segment-routing mpls
fast-reroute per-prefix
fast-reroute per-prefix ti-lfa enable
area 0
interface Loopback0
passive enable
prefix-sid absolute 16001
!
interface GigabitEthernet0/0/0/0
network point-to-point
!
!
mpls traffic-eng router-id 1.1.1.1
!
segment-routing
traffic-eng
interface GigabitEthernet0/0/0/0
affinity
name BLUE
!
metric 20
!
affinity-map
name BLUE bit-position 0
!
performance-measurement
interface GigabitEthernet0/0/0/0
delay-measurement

Example 17‑11 shows the OSPF LSAs as advertised by Node1.

The main one is the Router LSA (type 1), which advertises intra-area adjacencies and connected
prefixes.

The other LSAs are Opaque LSAs of various types6, as indicated by the first digit in the Link ID: TE
(type 1), Router Information (type 4), Extended Prefix LSA (type 7), and Extended Link LSA (type 8).

BGP-LS consolidates information of the various LSAs in its Node, Link, and Prefix NLRIs. In the
next sections we will discuss the different BGP-LS NLRIs and how their information maps to the
OSPF LSAs.
Example 17-11: OSPF LSAs advertised by xrvr-1

RP/0/0/CPU0:xrvr-1#show ospf database adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1)

Router Link States (Area 0)

Link ID ADV Router Age Seq# Checksum Link count


1.1.1.1 1.1.1.1 1252 0x800000d9 0x005aff 3

Type-10 Opaque Link Area Link States (Area 0)

Link ID ADV Router Age Seq# Checksum Opaque ID


1.0.0.0 1.1.1.1 1252 0x800000d8 0x00a8a9 0
1.0.0.3 1.1.1.1 743 0x800000da 0x003601 3
4.0.0.0 1.1.1.1 1252 0x800000d9 0x004135 0
7.0.0.1 1.1.1.1 1252 0x800000d8 0x003688 1
8.0.0.3 1.1.1.1 113 0x800000d9 0x006d08 3

OSPF feeds its LS-DB content to BGP-LS and SR-TE. The content of the BGP-LS database is shown
in Example 17‑12. The output shows the BGP-LS NLRIs in string format. The legend of the NLRI
fields in these strings is indicated on top of the output. The first field ([V], [E], or [T]) indicates the
BGP-LS NLRI type: [V] Node, [E] Link, or [T] Prefix.

A detailed view of a given NLRI, including the Link-state Attribute that is associated with the NLRI,
can be displayed by using the command show bgp link-state link-state <NLRI string
format> [detail]. In the next sections we will show examples of each NLRI.
Example 17-12: BGP-LS database content – OSPF

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state


BGP router identifier 1.1.1.1, local AS number 1
<... snip ...>

Status codes: s suppressed, d damped, h history, * valid, > best


i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Prefix codes: E link, V node, T IP reacheable route, u/U unknown
I Identifier, N local node, R remote node, L link, P prefix
L1/L2 ISIS level-1/level-2, O OSPF, D direct, S static/peer-node
a area-ID, l link-ID, t topology-ID, s ISO-ID,
c confed-ID/ASN, b bgp-identifier, r router-ID,
i if-address, n nbr-address, o OSPF Route-type, p IP-prefix
d designated router address
Network Next Hop Metric LocPrf Weight Path
*> [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]]/376
0.0.0.0 0 i
*> [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]]/376
0.0.0.0 0 i
*> [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][R[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][L[i99.1.2.1]
[n99.1.2.2]]/792
0.0.0.0 0 i
*> [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][R[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][L[i99.1.2.2]
[n99.1.2.1]]/792
0.0.0.0 0 i
*> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01][p99.1.2.0/24]]/480
0.0.0.0 0 i
*> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01][p1.1.1.1/32]]/488
0.0.0.0 0 i
*> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][P[o0x01][p99.1.2.0/24]]/480
0.0.0.0 0 i
*> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][P[o0x01][p1.1.1.2/32]]/488
0.0.0.0 0 i

Processed 8 prefixes, 8 paths

17.7.1 Node NLRI


The Node NLRI for Node1 is shown in Example 17‑13. By using the detail keyword in the show
command, you get a breakdown of the NLRI fields at the top of the output.

These are the fields in the Node NLRI string:

[V] NLRI Type: Node

[O] Protocol: OSPF

[I0x66] Identifier: 0x66 = 102 (this is the instance-id)

[N ...] Local Node Descriptor:

[c1] AS Number: 1
[b0.0.0.0] BGP Identifier: 0.0.0.0

[a0.0.0.0] Area ID: 0.0.0.0

[r1.1.1.1] Router ID IPv4: 1.1.1.1

The prefix-length at the end of the NLRI string (/376) is the length of the NLRI in bits.

The area ID and OSPF router ID can be derived from the OSPF packet header.

Example 17-13: BGP-LS Node NLRI of xrvr-1

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]]/376


detail
BGP routing table entry for [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]]/376
NLRI Type: Node
Protocol: OSPF
Identifier: 0x66
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
Area ID: 0.0.0.0
Router ID IPv4: 1.1.1.1

<... snip ...>

Link-state: Local TE Router-ID: 1.1.1.1, SRGB: 16000:8000


SR-ALG: 0 SR-ALG: 1, MSD: Type 1 Value 10

The Link-state attribute is shown at the bottom of the output, preceded by Link-state:. These are the
elements in the Link-state attribute for the Node NLRI example:

Local TE Router-ID: 1.1.1.1 Local node (Node1) IPv4 TE router-id 1.1.1.1

SRGB: 16000:8000 SRGB: [16000-23999]

SR-ALG: 0 SR-ALG: 1 SR Algorithms: SPF (0), strict-SPF (1)

MSD: Type 1 Value 10 Node Maximum SID Depth (MSD): 10

The TE router-id can be retrieved from one of the TE Opaque LSAs, as shown in Example 17‑14. The
SR information (SRGB, Algorithms, MSD) can be retrieved from the Router Information Opaque
LSA. See Example 17‑15.
Example 17-14: OSPF TE router-ID

RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 1.0.0.0 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1)

Type-10 Opaque Link Area Link States (Area 0)

LS age: 410
Options: (No TOS-capability, DC)
LS Type: Opaque Area Link
Link State ID: 1.0.0.0
Opaque Type: 1
Opaque ID: 0
Advertising Router: 1.1.1.1
LS Seq Number: 800000d9
Checksum: 0xa6aa
Length: 28

MPLS TE router ID : 1.1.1.1

Number of Links : 0

Example 17-15: OSPF Router Information LSA

RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 4.0.0.0 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1)

Type-10 Opaque Link Area Link States (Area 0)

LS age: 521
Options: (No TOS-capability, DC)
LS Type: Opaque Area Link
Link State ID: 4.0.0.0
Opaque Type: 4
Opaque ID: 0
Advertising Router: 1.1.1.1
LS Seq Number: 800000da
Checksum: 0x3f36
Length: 60

Router Information TLV: Length: 4


Capabilities:
Graceful Restart Helper Capable
Stub Router Capable
All capability bits: 0x60000000

Segment Routing Algorithm TLV: Length: 2


Algorithm: 0
Algorithm: 1

Segment Routing Range TLV: Length: 12


Range Size: 8000

SID sub-TLV: Length 3


Label: 16000

Node MSD TLV: Length: 2


Type: 1, Value 10
17.7.2 Link NLRI
The Link NLRI for the (half-)link Node1→Node2 in the OSPF topology is shown in Example 17‑16.
By using the detail keyword in the show command, you get a breakdown of the NLRI fields.

These are the fields in the Link NLRI string:

[E] NLRI Type: Link

[O] Protocol: OSPF

[I0x66] Identifier: 0x66 = 102

[N ...] Local Node Descriptor:

[c1] AS Number: 1

[b0.0.0.0] BGP Identifier: 0.0.0.0

[a0.0.0.0] Area ID: 0.0.0.0

[r1.1.1.1] Router ID IPv4: 1.1.1.1

[R ...] Remote Node Descriptor:

[c1] AS Number: 1

[b0.0.0.0] BGP Identifier: 0.0.0.0

[a0.0.0.0] Area ID: 0.0.0.0

[r1.1.1.2] Router ID IPv4: 1.1.1.2

[L ...] Link Descriptor:

[i99.1.2.1] Local Interface Address IPv4: 99.1.2.1

[n99.1.2.2] Neighbor Interface Address IPv4: 99.1.2.2

The prefix-length at the end of the NLRI string (/792) is the length of the NLRI in bits.
Example 17-16: BGP-LS Link NLRI of link Node1→Node2 in IPv4 topology

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][R[c1]


[b0.0.0.0][a0.0.0.0][r1.1.1.2]][L[i99.1.2.1][n99.1.2.2]]/792 detail
BGP routing table entry for [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][R[c1][b0.0.0.0][a0.0.0.0]
[r1.1.1.2]][L[i99.1.2.1][n99.1.2.2]]/792
NLRI Type: Link
Protocol: OSPF
Identifier: 0x66
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
Area ID: 0.0.0.0
Router ID IPv4: 1.1.1.1
Remote Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
Area ID: 0.0.0.0
Router ID IPv4: 1.1.1.2
Link Descriptor:
Local Interface Address IPv4: 99.1.2.1
Neighbor Interface Address IPv4: 99.1.2.2

<... snip ...>

Link-state: Local TE Router-ID: 1.1.1.1, Remote TE Router-ID: 1.1.1.2


admin-group: 0x00000001, max-link-bw (kbits/sec): 1000000
TE-default-metric: 20, metric: 1, ADJ-SID: 24012(e0)
ADJ-SID: 24112(60), MSD: Type 1 Value 10,
Link Delay: 5831 us Flags: 0x00 Min Delay: 5047 us
Max Delay: 7047 us Flags: 0x00, Delay Variation: 499 us
Link ID: Local:3 Remote:4

The Link-state attribute is shown in the output, preceded by Link-state. These are the elements in
the Link-state attribute for the Link NLRI example:

Local TE Router-ID:
Local node (Node1) IPv4 TE router-id: 1.1.1.1
1.1.1.1

Remote TE Router-ID:
1.1.1.2
Remote node (Node2) IPv4 TE router-id: 1.1.1.2

admin-group: 0x00000001 Affinity (color) bitmap: 0x00000001

max-link-bw (kbits/sec):
1000000
Bandwidth of the link: 1Gbps

TE-default-metric: 20 TE metric: 20

metric: 1 IGP metric: 1

ADJ-SID: 24012(e0), ADJ- Adj-SIDs: protected (flags 0xe0 → B-flag=1) 24012; unprotected
SID: 24112(60) (flags 0x60 → B-flag=0) 24112
MSD: Type 1 Value 10 Link Maximum SID Depth: 10

Link Delay: 5831 us


Flags: 0x00
Average Link Delay: 5.831 ms

Min Delay: 5047 us Minimum Link Delay: 5.047 ms

Max Delay: 7047 us


Flags: 0x00
Maximum Link Delay: 7.047 ms

Delay Variation: 499 us Link Delay Variation: 0.499 ms

Link ID: Local:3


Remote:4
Local and remote interface identifiers: 3 and 4

The Link-state attributes contains information collected from Router LSA for adjacency, from TE
Opaque LSA for TE attributes, and from Extended Link LSA for Adj-SIDs, MSD, and Link IDs. These
LSAs are shown in Example 17‑17, Example 17‑18, and Example 17‑19.
Example 17-17: OSPF Router LSA

RP/0/0/CPU0:xrvr-1#show ospf database router adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1)

Router Link States (Area 0)

LS age: 1786
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 1.1.1.1
Advertising Router: 1.1.1.1
LS Seq Number: 800000da
Checksum: 0x5801
Length: 60
Number of Links: 3

Link connected to: a Stub Network


(Link ID) Network/subnet number: 1.1.1.1
(Link Data) Network Mask: 255.255.255.255
Number of TOS metrics: 0
TOS 0 Metrics: 1

Link connected to: another Router (point-to-point)


(Link ID) Neighboring Router ID: 1.1.1.2
(Link Data) Router Interface address: 99.1.2.1
Number of TOS metrics: 0
TOS 0 Metrics: 1

Link connected to: a Stub Network


(Link ID) Network/subnet number: 99.1.2.0
(Link Data) Network Mask: 255.255.255.0
Number of TOS metrics: 0
TOS 0 Metrics: 1
Example 17-18: OSPF TE LSA – link

RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 1.0.0.3 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1)

Type-10 Opaque Link Area Link States (Area 0)

LS age: 1410
Options: (No TOS-capability, DC)
LS Type: Opaque Area Link
Link State ID: 1.0.0.3
Opaque Type: 1
Opaque ID: 3
Advertising Router: 1.1.1.1
LS Seq Number: 800000db
Checksum: 0x3402
Length: 132

Link connected to Point-to-Point network


Link ID : 1.1.1.2
(all bandwidths in bytes/sec)
Interface Address : 99.1.2.1
Neighbor Address : 99.1.2.2
Admin Metric : 20
Maximum bandwidth : 125000000
Affinity Bit : 0x1
IGP Metric : 1
Extended Administrative Group : Length: 1
EAG[0]: 0x1
Unidir Link Delay : 5831 micro sec, Anomalous: no
Unidir Link Min Delay : 5047 micro sec, Anomalous: no
Unidir Link Max Delay : 7047 micro sec
Unidir Link Delay Variance : 499 micro sec

Number of Links : 1
Example 17-19: OSPF Extended Link LSA

RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 8.0.0.3 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1)

Type-10 Opaque Link Area Link States (Area 0)

LS age: 906
Options: (No TOS-capability, DC)
LS Type: Opaque Area Link
Link State ID: 8.0.0.3
Opaque Type: 8
Opaque ID: 3
Advertising Router: 1.1.1.1
LS Seq Number: 800000da
Checksum: 0x6b09
Length: 112

Extended Link TLV: Length: 88


Link-type : 1
Link ID : 1.1.1.2
Link Data : 99.1.2.1

Adj sub-TLV: Length: 7


Flags : 0xe0
MTID : 0
Weight : 0
Label : 24012

Adj sub-TLV: Length: 7


Flags : 0x60
MTID : 0
Weight : 0
Label : 24112

Local-ID Remote-ID sub-TLV: Length: 8


Local Interface ID: 3
Remote Interface ID: 4

Remote If Address sub-TLV: Length: 4


Neighbor Address: 99.1.2.2

Link MSD sub-TLV: Length: 2


Type: 1, Value 10

17.7.3 Prefix NLRI


The Prefix NLRI for Node1’s loopback IPv4 prefix 1.1.1.1/32 is shown in Example 17‑20. By using
the detail keyword in the command, you get a breakdown of the NLRI fields.

These are the fields in the Prefix NLRI string:

[T] NLRI Type: Prefix

[O] Protocol: OSPF


[I0x66] Identifier: 0x66 = 102

[N ...] Local Node Descriptor:

[c1] AS Number: 1

[b0.0.0.0] BGP Identifier: 0.0.0.0

[a0.0.0.0] Area ID: 0.0.0.0

[r1.1.1.1] Router ID IPv4: 1.1.1.1

[P ...] Prefix Descriptor:

[o0x01] OSPF Route Type: 0x01

[p1.1.1.1/32] Prefix: 1.1.1.1/32

The prefix-length at the end of the NLRI string (/488) is the length of the NLRI in bits.

Example 17-20: BGP-LS Prefix NLRI of Node1’s loopback IPv4 prefix

RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01]


[p1.1.1.1/32]]/488 detail
BGP routing table entry for [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01][p1.1.1.1/32]]/488
NLRI Type: Prefix
Protocol: OSPF
Identifier: 0x66
Local Node Descriptor:
AS Number: 1
BGP Identifier: 0.0.0.0
Area ID: 0.0.0.0
Router ID IPv4: 1.1.1.1
Prefix Descriptor:
OSPF Route Type: 0x01
Prefix: 1.1.1.1/32

<... snip ...>

Link-state: Metric: 1, PFX-SID: 1(0/0), Extended IGP flags: 0x40

The Link-state attribute is shown in the output, preceded by Link-state:. These are the elements in
the Link-state attribute for the Prefix NLRI example:

Metric: 1 IGP metric: 1


Prefix-SID index: 1; SPF (algo 0); flags 0x0 (NP:0, M:0, E:0, V:0,
PFX-SID: 1(0/0)
L:0)
Extended IGP flags:
0x40
Extended prefix attribute flags: 0x40 (A:0, N:1)
The Link-state attributes contains information collected from Router LSA for prefix and from
Extended Prefix LSA for Prefix-SID. These LSAs are shown in Example 17‑17 and Example 17‑21.

Example 17-21: OSPF Extended Prefix LSA

RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 7.0.0.1 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1)

Type-10 Opaque Link Area Link States (Area 0)

LS age: 128
Options: (No TOS-capability, DC)
LS Type: Opaque Area Link
Link State ID: 7.0.0.1
Opaque Type: 7
Opaque ID: 1
Advertising Router: 1.1.1.1
LS Seq Number: 800000da
Checksum: 0x328a
Length: 44

Extended Prefix TLV: Length: 20


Route-type: 1
AF : 0
Flags : 0x40
Prefix : 1.1.1.1/32

SID sub-TLV: Length: 8


Flags : 0x0
MTID : 0
Algo : 0
SID Index : 1
17.8 References
[RFC3209] "RSVP-TE: Extensions to RSVP for LSP Tunnels", Daniel O. Awduche, Lou Berger,
Der-Hwa Gan, Tony Li, Dr. Vijay Srinivasan, George Swallow, RFC3209, December 2001

[RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information


Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752,
March 2016

[RFC7471] "OSPF Traffic Engineering (TE) Metric Extensions", Spencer Giacalone, David Ward,
John Drake, Alia Atlas, Stefano Previdi, RFC7471, March 2015

[RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer
Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016

[RFC8571] "BGP - Link State (BGP-LS) Advertisement of IGP Traffic Engineering Performance
Metric Extensions", Les Ginsberg, Stefano Previdi, Qin Wu, Jeff Tantsura, Clarence Filsfils,
RFC8571, March 2019

[draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano


Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idr-
segment-routing-te-policy-05 (Work in Progress), November 2018

[draft-ietf-idr-bgp-ls-segment-routing-ext] "BGP Link-State extensions for Segment Routing",


Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Hannes Gredler, Mach Chen, draft-ietf-idr-
bgp-ls-segment-routing-ext-12 (Work in Progress), March 2019

[draft-ietf-idr-bgpls-segment-routing-epe] "BGP-LS extensions for Segment Routing BGP Egress


Peer Engineering", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Keyur Patel, Saikat Ray,
Jie Dong, draft-ietf-idr-bgpls-segment-routing-epe-18 (Work in Progress), March 2019

[draft-ietf-idr-te-lsp-distribution] "Distribution of Traffic Engineering (TE) Policies and State


using BGP-LS", Stefano Previdi, Ketan Talaulikar, Jie Dong, Mach Chen, Hannes Gredler, Jeff
Tantsura, draft-ietf-idr-te-lsp-distribution-10 (Work in Progress), February 2019
[draft-ketant-idr-bgp-ls-bgp-only-fabric] "BGP Link-State Extensions for BGP-only Fabric", Ketan
Talaulikar, Clarence Filsfils, krishnaswamy ananthamurthy, Shawn Zandi, Gaurav Dawra,
Muhammad Durrani, draft-ketant-idr-bgp-ls-bgp-only-fabric-02 (Work in Progress), March 2019

[draft-dawra-idr-bgp-ls-sr-service-segments] "BGP-LS Advertisement of Segment Routing


Service Segments", Gaurav Dawra, Clarence Filsfils, Daniel Bernier, Jim Uttaro, Bruno Decraene,
Hani Elmalky, Xiaohu Xu, Francois Clad, Ketan Talaulikar, draft-dawra-idr-bgp-ls-sr-service-
segments-01 (Work in Progress), January 2019

1. ISIS uses the term Link-State PDU (LSP), OSPF uses the term Link-State Advertisement (LSA).
Do not confuse the ISIS LSP with the MPLS term Label Switched Path (LSP).↩

2. Note that this example (enabling multiple ISIS instances on an interface) cannot be configured in
IOS XR, since a given non-loopback interface can only belong to one ISIS instance.↩

3. Although the AS number is configurable on IOS XR for backward compatibility reasons, we


recommend to leave it to the default value.↩

4. Although the BGP-LS identifier is configurable on IOS XR for backward compatibility reasons,
we recommend to leave it to the default value (0).↩

5. The 32-bit BGP-LS Identifier field is not related to the 64-bit BGP-LS NLRI Identifier field, the
latter is also known as “instance-id”. It is just an unfortunate clash of terminology.↩

6. https://www.iana.org/assignments/ospf-opaque-types/ospf-opaque-types.xhtml↩
18 PCEP
18.1 Introduction
A Path Computation Element (PCE) is “… an entity that is capable of computing a network path or
route based on a network graph, and of applying computational constraints during the computation.”
(RFC 4655).

A Path Computation Client (PCC) is the entity using the services of a PCE.

For a PCC and PCE or two PCEs to communicate the Path Computation Element (Communication)
Protocol (PCEP) has been introduced (RFC5440).

At the time PCEP was specified, the only existing TE solution was the classic RSVP-TE protocol.
Therefore, many aspects of PCEP are geared towards its application for RSVP-TE tunnels, such as
re-use of existing RSVP-TE objects. Since its introduction PCEP has evolved and extended. For
example, support for SR-TE has been added (draft-ietf-pce-segment-routing).

Since this book is about SR-TE, the focus of this PCEP chapter is on the PCEP functionality that is
applicable to SR-TE. This implies that various PCEP packet types, packet fields, flags, etc. are not
described. Please refer to the relevant IETF specification of a given element that is not covered in
this chapter.

This book only describes PCEP message exchanges with an Active Stateful PCE that is SR-capable.

18.1.1 Short PCEP History


The initial PCEP specification in RFC 5440 only provided stateless PCE/PCC support using a PCEP
Request/Reply protocol exchange. In a client/server relation, the PCC requests the PCE to compute a
path, the PCE computes and returns the result. The PCE computes the path using an up-to-date
topology database. “Stateless” means that the PCE does not keep track of the computed path or the
other paths in the network.

RFC 8231 adds stateful PCE/PCC capabilities to PCEP. This RFC specifies the protocol mechanism
for a (stateful) PCE to learn the SR Policy paths from a PCC. This enables the PCE to keep track of
the SR Policy paths in the network and take them into account during the computation.

This RFC also specifies the PCEP mechanism for a PCC to delegate control of an SR Policy path to a
PCE, an Active Stateful PCE in this case. The Active Stateful PCE not only computes and learns SR
Policy paths, it can also take control of SR Policy paths and update these paths.

RFC 8281 further extends the stateful PCEP capabilities to support PCE initiation of SR Policies on
a PCC.

IETF draft-ietf-pce-segment-routing extends PCEP to support SR Policies. It mainly defines how an


SR Policy path is specified in PCEP.

PCEP has been further extended with SR-TE support. Some examples: RFC 8408 specifies how to
indicate the type of path since PCEP can be used for SR-TE and RSVP-TE, draft-sivabalan-pce-
binding-label-sid specifies how PCEP conveys the Binding-SID of an SR Policy path.
18.2 PCEP Session Setup and Maintenance
a PCEP session can be setup between a PCC and a PCE or between two PCEs. A PCC may have
PCEP sessions with multiple PCEs, and a PCE may have PCEP sessions with multiple PCCs.

PCEP is a TCP-based protocol. A PCC establishes a TCP session to a PCE on TCP port 4189. The
PCC always initiates the PCEP connection.

After the TCP session is established, PCC and PCE initialize the PCEP session by exchanging the
session parameters using Open messages as illustrated in Figure 18‑1.

The Open message specifies the Keepalive interval that the node will use and a recommended Dead
timer interval for the peer to use. It also contains the capabilities of the node.

The Open message is acknowledged by a Keepalive message.


Figure 18-1: PCEP session initialization

Two timers are used to maintain the PCEP session, Keepalive timer and Dead timer. These two timers
and Keepalive messages are used to verify liveness of the PCEP session.

The default Keepalive timer is 30 seconds, the default Dead timer is 120 secons.

Every time a node sends a PCEP message it restarts the Keepalive timer of the session. When the
Keepalive timer expires then the node sends a Keepalive message. This mechanism ensures that at
least every Keepalive interval a PCEP message is sent without sending unnecessary Keepalive
messages.

Every time a node receives a PCEP message it restarts the Dead timer. When the Dead timer expires
the node tears down the PCEP session.

PCC and PCE independently send their Keepalive messages; they are not responded to. The
Keepalive timer can be different on both peers.

18.2.1 SR Policy State Synchronization


After the session is initialized and both PCE and PCC have stateful PCEP capability, then the PCC
synchronizes its local SR Policies’ states to the PCE using PCEP State Report (PCRpt) messages.
This exchange is illustrated in Figure 18‑2. The format of the PCRpt message will be described
further in section 18.4.7 of this chapter.

During the state synchronization, the PCC reports its local SR Policies using PCRpt messages with
the Sync-flag set. The end of the synchronization is indicated by an empty Report message with the
Sync-flag unset.
Figure 18-2: SR Policy initial state synchronization

After the initial SR Policy state synchronization, the PCC sends a PCRpt message whenever the state
of any of its local SR Policies changes.

As illustrated in Figure 18‑3, following a state change of a local SR Policy path, the PCC sends a
PCRpt message to all its connected stateful PCEs. This way the SR Policy databases on all these
connected PCEs stay synchronized with the PCC’s SR Policy database.
Figure 18-3: SR Policy state synchronization
18.3 SR Policy Path Setup and Maintenance
In this book we focus on the use of an Active Stateful PCE. An Active Stateful PCE not only computes
an SR Policy path and keeps track of the path, it also takes the responsibility to maintain the path and
update it when required. The IOS XR SR PCE is an example of an Active Stateful PCE.

The PCC delegates control of a path to an Active Stateful PCE by sending a state Report message for
that path with the Delegate flag set.

An IOS XR PCC automatically delegates control of a PCE-computed path to the PCE that computed
this path. This makes sense since if the PCC relies on a PCE to compute the path, it will also rely on
the PCE to maintain this path. This includes configured (CLI, NETCONF) SR Policy paths and ODN
initiated SR Policy paths.

An IOS XR PCC also delegates control of an SR Policy path that a PCE initiates on the PCC. In that
case the PCC delegates control to the PCE that initiated the path.

An IOS XR headend does not delegate control of a locally computed path to a PCE, nor the control of
an explicit path.

The following sections describe the different PCEP protocol exchanges and message formats to
initiate an SR Policy path that is delegated to a PCE and to maintain this path using PCEP.

Two cases can be distinguished, PCC-initiated paths and PCE-initiated paths.

18.3.1 PCC-Initiated SR Policy Path


The PCC initiates the SR Policy path by configuration (CLI), NETCONF, or automatically instantiates
the path on-demand by a service protocol trigger (ODN). The PCEP protocol exchange between PCC
and PCE is the same in all these cases.

Two protocol exchange variants exist for the initial setup of the path.

In the first variant (Request/Reply/Report), the protocol exchange starts as a stateless Request/Reply
message exchange between PCC and PCE. The PCC installs the path and sends a Report message to
the PCE where the PCC delegates control of the path to the PCE.
In the second variant (Report/Update/Report), the PCC immediately delegates the (empty) path to the
PCE. The PCE accepts control of the path (by not rejecting it), computes the path and sends the
computed path in an Update message to the PCC. The PCC installs the path and sends a Report
message to the PCE.

While SR PCE supports both variants, an IOS XR PCC uses the second variant.

At the end of both variants of the initiation sequence, the PCC has installed the path and has delegated
control of the path to the SR PCE.

Request/Reply/Report

The Request/Reply/Report variant is illustrated in Figure 18‑4.

Figure 18-4: PCC-initiated SR Policy initial message exchange variant 1

The message sequence is as follows:

1. The operator configures an SR Policy on headend (PCC) with an SR PCE-computed dynamic


path. Or the headend automatically instantiates an SR Policy path using ODN. The headend
assigns a unique PLSP-ID (PCEP-specific LSP ID) to the SR Policy path. The PLSP-ID uniquely
identifies the SR Policy path on a PCEP session and remains constant during its lifetime.

2. The headend sends a PCReq message to the PCE and includes the path computation parameters in
an ENDPOINT object, a METRIC object and an LSPA (LSP Attributes) object. The headend also
includes an LSP object with this path’s PLSP-ID.
3. The PCE computes the path.

4. The PCE returns the path in a PCRep message with the computed path in an ERO (Explicit Route
Object). It also includes an LSP object with the unique PLSP-ID and a METRIC object with the
computed metric value.

5. The headend installs the path.

6. The headend sends a PCRpt message with the state of the path. It includes an ERO, a METRIC
object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend
sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE.

Report/Update/Report

The Report/Update/Report variant is illustrated in Figure 18‑5.

Figure 18-5: PCC-initiated SR Policy initial message exchange variant 2

This variant’s message sequence is as follows:

1. The operator configures an SR Policy on headend (PCC) with an SR PCE-computed dynamic


path. Or the headend automatically instantiates an SR Policy path using ODN. The headend
assigns a unique PLSP-ID (PCEP-specific LSP ID) to the SR Policy. The PLSP-ID uniquely
identifies the SR Policy on a PCEP session and remains constant during its lifetime.
2. The headend sends a PCRpt message to the PCE with an empty ERO. It includes a METRIC
object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend
sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE.

3. The SR PCE accepts the delegation (by not rejecting it), finds that the path must be updated and
computes a new path.

4. The PCE returns the new path in a PCUpd message to the headend. It includes an ERO with the
computed path, an LSP object with the unique PLSP-ID assigned by the headend, and an SRP
(Stateful Request Parameters) object with a unique SRP-ID number to track error and state report
messages received as a response to this Update message.

5. The headend installs the path.

6. The headend sends a PCRpt message with the state of the new path. It includes an ERO, a
METRIC object and an LSPA object. It also includes an LSP object with the unique PLSP-ID.
The headend sets the Delegate flag in the LSP object to delegate control of this path to the SR
PCE.

18.3.2 PCE-Initiated SR Policy Path


We call this case the PCE-initiated path since from a PCEP perspective the PCE takes the initiative to
initiate this path. In reality it will be an application (or orchestrator, or administrator) that requests a
PCE to initiate an SR Policy path on a given headend. This application uses one of the PCE’s north-
bound interfaces, such as REST, NETCONF, or even CLI, for this purpose.
Figure 18-6: PCE-initiated SR Policy path message exchange

The message sequence as illustrated in Figure 18‑6 is as follows:

1. An application requests SR PCE to initiate an SR Policy candidate path on a headend.

2. The application can provide the path it computed itself, or it can request the SR PCE to compute
the path. SR PCE sends a PCEP Initiate (PCInit) message to the headend with an ERO encoding
the path. It includes an LSP object with the PLSP-ID set to 0 as well as an SRP object with a
unique SRP-ID number to track error and state report messages received as a response to this
initiate message.

3. The headend initiates the path and allocates a PLSP-ID for it.

4. The headend sends a PCRpt message with the state of the path. It includes an ERO, a METRIC
object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend
sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE. It also sets
the Create flag in the LSP object to indicate it is a PCE-initiated path.

18.3.3 PCE Updates SR Policy Path


When the headend has delegated the control of a path to the PCE, the PCE is in full control and it
maintains the path and updates it when required. For example, following a topology change or upon
request of an application. The same message exchange as illustrated in Figure 18‑7 is used to update
the path, regardless how the delegated path was initially set up.

Figure 18-7: SR Policy path update message exchange

The message sequence is as follows:

1. Following a topology change event, the SR PCE re-computes the delegated path. Or an
application requests to update the path of a given SR Policy.

2. The PCE sends the updated path in a PCUpd message to the headend. It includes an ERO
encoding the path, an LSP object with the unique PLSP-ID assigned by the headend as well as an
SRP object with a unique SRP-ID number to track error and state report messages received as a
response to this update message. The Delegate flag is set in the LSP object to indicate that the
path is delegated.

3. The headend updates the path.

4. The headend sends a PCRpt message with the state of the new path. It includes an ERO, a
METRIC object and an LSPA object. It also includes an LSP object with the unique PLSP-ID.
The headend sets the Delegate flag in the LSP object to delegate control of this path to the SR
PCE.
18.4 PCEP Messages
Different types of PCEP messages exist. The main ones have been introduced in the previous sections
of this chapter:

Open – start a PCEP session, including capability negotiation

Close – end a PCEP session

Keepalive – maintain PCEP session active

Error (PCErr) – error message sent when a protocol error occurs or when a received message is
not compliant with the specification

Request (PCReq) – sent by PCC to PCE to request path computation

Reply (PCRep) – sent by PCE to PCC in response to Request

Report (PCRpt) – sent by PCC for various purposes:

Initial state synchronization – after initializing the PCEP session, the PCC uses Report messages
to synchronize its local SR Policy path statuses to the PCE

Path status report – the PCC sends Report messages to all its connected PCEs whenever the state
of a local SR Policy path changes

Delegation control – the PCC delegates control of a local SR Policy path to the PCE by setting
the Delegate flag in the path’s Report message

Update (PCUpd) – sent by PCE to request PCC to update an SR Policy path

Initiate (PCInit) – sent by PCE to request PCC to initiate an SR Policy path

PCEP Header

Each PCEP message carries information organized as “Objects”, depending on the type of the
message.

All PCEP messages have a common header followed by the “Objects”.


The format of the common PCEP header is shown in Figure 18‑8.

Figure 18-8: Common PCEP Header format

Where:

Version: 1

Flags: No flags are currently defined

Message-Type:

1. Open

2. Keepalive

3. Path Computation Request (PCReq)

4. Path Computation Reply (PCRep)

5. Notification (PCNtf)

6. Error (PCErr)

7. Close

8. Path Computation Monitoring Request (PCMonReq)

9. Path Computation Monitoring Reply (PCMonRep)

10. Path Computation State Report (PCRpt)

11. Path Computation Update Request (PCUpd)

12. Path Computation LSP Initiate Message (PCInit)


Object Header

The Objects in a PCEP message have various formats and can contain sub-Objects and TLVs,
depending on the type of Object. The header of an Object has a fixed format that is shown in
Figure 18‑9.

Figure 18-9: Common Object Header format

Where:

Object-Class: identifies the PCEP object class

Object-Type (OT): identifies the PCEP object type – The pair (Object-Class, Object-Type)
uniquely identifies each PCEP object

Flags:

Res (Reserved)

Processing-Rule flag (P-flag): used in Request (PCReq) message. If set, the object must be taken
into account during path computation. If unset, the object is optional.

Ignore flag (I flag): used in Reply (PCRep) message. Set if the optional object was ignored.
Unset if the optional object was processed.

Object body: contains the different elements of the object

18.4.1 PCEP Open Message


A PCEP Open Message contains an Open Object with the format as shown in Figure 18‑10. The
Object is preceded by the common PCEP header and the common Object header (not shown).
Figure 18-10: PCEP Open Object format

The PCEP message is of type “Open”. The Open Object is of Object-Class 1, Object-Type 1.

The fields in the Open Object body:

Version: 1

Flags: none defined

Keepalive: maximum period of time (in seconds) between two consecutive PCEP messages sent by
the sender of this Open message

Deadtimer: the Deadtimer that the sender of this Open message recommends its peer to use for this
session.

Session ID (SID): identifies current PCEP session for logging and troubleshooting purposes. Can
be different in each direction

The Open Object can contain TLVs:

Stateful Capability TLV

Segment Routing Capability TLV

The format of the Stateful Capability TLV is shown in Figure 18‑11. If this TLV is present, then the
device (PCE or PCC) is Stateful PCEP capable.
Figure 18-11: Stateful Capability TLV format

The fields in this TLV are:

Flags (only showing the supported ones):

Update (U): PCE can update TE Policies, PCC accepts updates from PCE

Synch (S): PCC includes LSP-DB version number when the paths are reported to the PCE

Initiate (I): PCE can initiate LSPs, PCC accepts PCE initiated TE Policies

Other flags are being defined

The format of the Segment Routing Capability TLV is shown in Figure 18‑12. If this TLV is present,
then the device (PCE or PCC) is Segment Routing Capable.

Figure 18-12: Segment Routing Capability TLV format

The fields in this TLV are:

Flags:

L-flag (no MSD Limit): set by PCC that does not impose any limit on MSD
N-flag (NAI resolution capable): set by a PCC that can resolve a Node or Adjacency Identifier
(NAI) to a SID

Maximum SID Depth (MSD): Specifies the maximum size of the segment list (label stack) that this
node can impose.

Example

Example 18‑1 shows a packet capture of a PCEP Open message. The PCC sending this message
advertises a keepalive interval of 30 seconds and a dead-timer of 120 seconds. The PCC is stateful
and supports the update and initiation functionality. It also supports the SR PCEP extensions and the
MSD is 10 labels.

Example 18-1: Packet capture of a PCEP Open message

Path Computation Element communication Protocol


OPEN MESSAGE (1), length: 28
OPEN OBJECT
Object Class: OPEN OBJECT (1)
Object Flags: 0x0
Object Value:
PCEP version: 1
Open Object Flags: 0x0
Keepalive: 30
Deadtime: 120
Session ID (SID): 4
TLVs:
STATEFUL-PCE-CAPABILITY TLV (16), Type: 16, Length: 4
Stateful Capabilities Flags: 0x00000005
..0. .... = Triggered Initial Sync (F): Not Set
...0 .... = Delta LSP Sync Capability (D): Not Set
.... 0... = Triggered Resync (T): Not Set
.... .1.. = Instantiation (I): Set
.... ..0. = Include Db Version (S): Not Set
.... ...1 = Update (U): Set
SR-PCE-CAPABILITY TLV (26), Type: 26, Length: 4
Reserved: 0x0000
SR Capability Flags: 0x00
Maximum SID Depth (MSD): 10

18.4.2 PCEP Close Message


A PCEP Close Message contains a Close Object with the format as shown in Figure 18‑13.
Figure 18-13: PCEP Close Object format

A Close Object has the following fields:

Flags: no flags defined yet

Reason: specifies the reason for closing the PCEP session; setting of this field is optional

18.4.3 PCEP Keepalive Message


A PCEP Keepalive message consists of only the common PCEP header, with message type
“Keepalive”.

18.4.4 PCEP Request message


A PCEP Request message must contain at least the following objects:

Request Parameters (RP) Object: administrative info about the Request

End-points Object: source and destination of the SR Policy

A PCEP Request message may also contain (non-exhaustive list):

LSP Attributes (LSPA) Object: various LSP attributes (such as affinity, priority, protection desired)

Metric Object: Metric type to optimize, and set max-metric bound

RP Object

The Request Parameters Object specifies various characteristics of the path computation request.
Figure 18-14: PCEP RP Object format

Flags: various flags have been defined – Priority, Reoptimization, Bi-directional, strict/loose –
please refer to the different IETF PCEP specifications

Request-ID-number: ID number to identify Request, to match with Reply

Optional TLVs:

Path setup type TLV (draft-ietf-pce-lsp-setup-type): specifies which type of path to setup – 0 or
TLV not present: RSVP-TE; 1: Segment Routing

Endpoints Object

Source and destination IP addresses of the path for which a path computation is requested. Two types
exist, one for IPv4 and one for IPv6. Both have an equivalent format.

Figure 18-15: PCEP Endpoints Object format

LSPA Object
The LSP Attributes (LSPA) Object contains various path attributes to be taken into account during
path computation.

Figure 18-16: PCEP LSPA Object format

The fields in the LSPA Object are:

Affinity bitmaps:

Exclude-any: exclude links that have any of the indicated “colors”. Set 0 to exclude no links

Include-any: only use links that have any of the indicated “colors”. Set 0 to accept all links

Include-all: only use links that have all of the indicated “colors”. Set 0 to accept all

Note: only “standard” 32-bit color bitmaps are supported in this object

Priorities (Setup and Holding), used in tunnel preemption.

Flags:

L-flag (Local Protection Desired): when set, the computed path must include protected segments

Optional TLVs

Metric Object
The Metric object specifies which metric must be optimized in the path computation and which metric
bounds must be applied to the path computation.

Figure 18-17: PCEP Metric Object format

The fields in the Metric Object are:

Flags:

B-flag (Bound): if unset, the Metric Type is the optimization objective; if set, the Metric Value is
a bound of the Metric Type, i.e., the resulting path must have a cumulative metric ≤ Metric Value

C-flag (Computed Metric): if set, then PCE must return the computed path metric value

Metric Type:

IGP metric, TE metric, Hop Count, path delay (RFC 8233), …

Metric Value: path metric value

Example

Example 18‑2 shows a packet capture of a PCEP Path Computation Request message. The PCC
requests the computation of an SR Policy path from 1.1.1.1 (headend) to 1.1.1.4 (endpoint),
optimizing TE metric without constraints. The path must be encoded using protected SIDs.

This message’s Request-ID-Number is 2. See Example 18‑3 in the next section for the matching Reply
message.
Example 18-2: Packet capture of a PCEP Path Computation Request message

14:54:41.769244000: 1.1.1.1:46605 --> 1.1.1.10:4189


Path Computation Element communication Protocol
PATH COMPUTATION REQUEST MESSAGE (3), length: 92
Request Parameters (RP) OBJECT
Object Class: Request Parameters (RP) OBJECT (2)
Object Type: 1, Length: 20
Object Flags: 0x2
00.. = Reserved Flags: Not Set
..1. = Processing-Rule (P): Set
...0 = Ignore (I): Not Set
Object Value:
RP Object Flags: 0x80
.... .... .... .... .... .... ..0. .... = Strict/Loose (O): Not Set
.... .... .... .... .... .... ...0 .... = Bi-directional (B): Not Set
.... .... .... .... .... .... .... 0... = Reoptimization (R): Not Set
.... .... .... .... .... .... .... .000 = Priority (Pri): 0
Request-ID-Number: 0x00000002 (2)
TLVs:
PATH-SETUP-TYPE TLV (28), Type: 28, Length: 4
Reserved: 0x000000
Path Setup Type (PST): 1 (Segment Routing)
END-POINT OBJECT
Object Class: END-POINT OBJECT (4)
Object Type: 1, Length: 12
Object Flags: 0x0
Object Value:
Source IPv4 address: 1.1.1.1
Destination IPv4 address: 1.1.1.4
LSP Attributes (LSPA) OBJECT
Object Class: LSP Attributes (LSPA) OBJECT (9)
Object Type: 1, Length: 20
Object Flags: 0x0
Object Value:
Exclude-Any: 0x00000000
Include-Any: 0x00000000
Include-All: 0x00000000
Setup Priority: 7
Holding Priority: 7
LSPA Object Flags: 0x01
.... ...1 = Local Protection Desired (L): Set
Reserved: 0x00
METRIC OBJECT
Object Class: METRIC OBJECT (6)
Object Type: 1, Length: 12
Object Flags: 0x0
Object Value:
Reserved: 0x0000
Metric Object Flags: 0x00
.... ..0. = Computed Metric (C): Not Set
.... ..0. = Bound (B): Not Set
Metric Type: 2 (TE metric)
Metric Value: 0

18.4.5 PCEP Reply Message


A PCEP Reply message must contain at least the following objects:

Request Parameters (RP) Object: administrative info about the Request


If computation successful: Explicit Route Object (ERO): path of the LSP

If no path found: NO-PATH object

A PCEP Reply message may also contain (non-exhaustive list):

Metric Object: Metric of path

RP Object

See section 18.4.4.

ERO Object

An Explicit Route Object (ERO) describes the sequence of elements (segments, links, nodes, ASs,
…) that the path traverses. The explicit route is encoded as a series of sub-objects contained in the
ERO object. For an SR Policy path, the Sub-Objects are of type “Segment Routing” (SR-ERO).

The format of the ERO and ERO Sub-Object are shown in Figure 18‑18 and Figure 18‑19
respectively.

Figure 18-18: PCEP ERO format

Figure 18-19: PCEP ERO sub-object format

The format of the SR-ERO is shown in Figure 18‑20. Each SR-ERO Sub-Object represents a segment
in the segment list. Each segment is specified as a SID and/or a segment-descriptor, called “NAI”
(Node or Adjacency Identifier) in the SR PCEP specification.

Figure 18-20: PCEP SR-ERO Sub-Object format

The fields in the SR-ERO Sub-Object are:

L-Flag: Loose, if set, this is a loose-hop in the LSP, PCC may expand

NT: NAI (Node or Adjacency Identifier) Type:

0 – NAI is absent

1 – IPv4 node ID

2 – IPv6 node ID

3 – IPv4 adjacency

4 – IPv6 adjacency

5 – unnumbered adjacency with IPv4 node IDs

Flags:

F-flag: If set, NAI == Null

S-flag: if set, SID == Null, PCC chooses SID

C-flag: if set (with M-Flag set), the SID field is a full MPLS label (incl. TC, S, and TTL); if
unset (with M-Flag set), the PCC must ignore the TC, S, and TTL fields of the MPLS label in the
SID field

M-flag: if set, the SID field is an MPLS label value; if unset, the SID field is an index into a
label range

SID: Segment Identifier

NAI: Node or Adjacency Identifier (also known as SID descriptor), format depends on the NAI
type (NT) field value:

If NAI type = 1 (IPv4 Node ID): IPv4 address of Node ID

If NAI type = 2 (IPv6 Node ID): IPv6 address of Node ID

If NAI type = 3 (IPv4 Adjacency): Pair of IPv4 addresses: Local IPv4 address and Remote IPv4
address

If NAI type = 4 (IPv6 Adjacency): Pair of IPv6 addresses: Local IPv6 address and Remote IPv6
address

If NAI type = 5 (unnumbered adjacency with IPv4 node IDs): Pair of (Node ID, Interface ID)
tuples

NO-PATH Object

The NO-PATH object is included in the PCEP Reply message if the path computation request cannot
be satisfied. It may contain a reason of computation failure.

Metric Object

See section 18.4.4.

Example

Example 18‑3 shows a packet capture of a Path Computation Reply message. This Reply message is a
response to the Request message in Example Example 18‑2 of the previous section.

The PCE computed an SR Policy path encoded in a SID list with two SIDs: <16003, 24034>. 16003
is the Prefix-SID of Node3 (1.1.1.3), 24034 is an Adj-SID of the link between Node3 and Node4.
The cumulative TE metric of this path is 30.
Example 18-3: Packet capture of a PCEP Path Computation Reply message

14:53:20.691651000: 1.1.1.10:4189 --> 1.1.1.1:55512


Path Computation Element communication Protocol
PATH COMPUTATION REPLY MESSAGE (4), length: 68
Request Parameters (RP) OBJECT
Object Class: Request Parameters (RP) OBJECT (2)
Object Type: 1, Length: 20
Object Flags: 0x2
00.. = Reserved Flags: Not Set
..1. = Processing-Rule (P): Set
...0 = Ignore (I): Not Set
Object Value:
RP Object Flags: 0x80
.... .... .... .... .... .... ..0. .... = Strict/Loose (O): Not Set
.... .... .... .... .... .... ...0 .... = Bi-directional (B): Not Set
.... .... .... .... .... .... .... 0... = Reoptimization (R): Not Set
.... .... .... .... .... .... .... .000 = Priority (Pri): 0
Request-ID-Number: 0x00000002 (2)
TLVs:
PATH-SETUP-TYPE TLV (28), Type: 28, Length: 4
Reserved: 0x000000
Path Setup Type (PST): 1 (Segment Routing)
ERO OBJECT
Object Class: ERO OBJECT (7)
Object Type: 1, Length: 32
Object Flags: 0x0
Object Value:
SR-ERO SUBOBJECT
Loose Flag = 0x0: Not Set
Sub Object Type: 36, Length: 12
SID Type: 1 (IPv4 Node ID)
SR-ERO Sub-Object Flags: 0x001
.... .... 0... = NAI==Null (F): Not Set
.... .... .0.. = SID==Null, PCC chooses SID (S): Not Set
.... .... ..0. = Complete MPLS label (C): Not Set
.... .... ...1 = MPLS label (M): Set
SID: 0x03e83000
SID MPLS Label: 16003
IPv4 Node ID: 1.1.1.3
SR-ERO SUBOBJECT
Loose Flag = 0x0: Not Set
Sub Object Type: 36, Length: 16
SID Type: 3 (IPv4 Adjacency)
SR-ERO Sub-Object Flags: 0x001
.... .... 0... = NAI==Null (F): Not Set
.... .... .0.. = SID==Null, PCC chooses SID (S): Not Set
.... .... ..0. = Complete MPLS label (C): Not Set
.... .... ...1 = MPLS label (M): Set
SID: 0x05de2000
SID MPLS Label: 24034
Local IPv4 Address: 99.3.4.3
Remote IPv4 Address: 99.3.4.4
METRIC OBJECT
Object Class: METRIC OBJECT (6)
Object Type: 1, Length: 12
Object Flags: 0x0
Object Value:
Reserved: 0x0000
Metric Object Flags: 0x00
.... ..0. = Computed Metric (C): Not Set
.... ..0. = Bound (B): Not Set
Metric Type: 2 (TE metric)
Metric Value: 30
18.4.6 PCEP Report Message
A PCEP Report message contains at least the following objects:

LSP Object: to identify the LSP

Explicit Route Object (ERO): path of the LSP

A PCEP Report message may also contain (non-exhaustive list):

Stateful PCE Request Parameters (SRP) Object: to correlate messages (e.g., PCUpd and PCRept),
mandatory if PCReport is result of PCUpdate

LSP Attributes (LSPA) Object: various LSP attributes (such as affinity, priority)

Metric Object: Metric of calculated path

LSP Object

The LSP object identifies the path and indicates the state of the path, its delegation status, and the state
synchronisation flag.

The format of the LSP Object is displayed in Figure 18‑21.

Figure 18-21: PCEP LSP Object format

The fields in the LSP Object are:

PLSP-ID: A PCEP-specific Identifier for the LSP


A PCC creates a unique PLSP-ID for each path (LSP). The PLSP-ID remains constant for the
lifetime of a PCEP session. The tuple (PCC, PLSP-ID) is a globally unique PCEP identifier of a
path.

Flags:

O (Operational State):

0 – Down; 1 – Up; 2 – Active; 3 – Going-Down; 4 – Going-Up

A (Administrative State):

If set, the path is Administratively enabled

R (Remove):

If set, PCE should remove all path state

S (Sync):

Set during State Synchronization

D (Delegate):

If set in Report, PCC delegates path

If set in Update, PCE accepts delegation

TLVs: LSP Identifier TLV, Symbolic Path Name TLV, Binding-SID TLV

LSP Identifier TLV

The LSP Identifier TLV identifies the path using fields inherited from RSVP-TE. Its format is shown
in Figure 18‑22. The fields in this TLV refer to various RSVP-TE tunnel and LSP identifiers. Some of
these fields are re-used for the SR Policy paths.
Figure 18-22: LSP Identifier TLV format

The fields in the LSP Identifier TLV are:

IPv4 Tunnel Sender Address – headend IP address

LSP ID – used to differentiate LSPs with the same Tunnel ID

Tunnel ID – path identifier that remains constant over the life time of a path

Extended Tunnel ID – generally the headend IP address

IPv4 Tunnel Endpoint Address – endpoint IP address

Symbolic Path Name TLV

A symbolic name for a path, unique per PCC. The symbolic name of a path stays constant during the
path’s lifetime. The format of this TLV is shown in Figure 18‑23.

Figure 18-23: Symbolic Path Name TLV format


Binding-SID TLV

The SR Policy candidate path’s Binding-SID. The format of this TLV is presented in Figure 18‑24.

Figure 18-24: Binding-SID TLV format

The Binding-SID TLV has the following fields:

Binding Type (BT):

BT = 0: The Binding Value field contains an MPLS label where only the label value is valid,
other fields of the MPLS label (TC, S, and TTL) are ignored

BT = 1: The Binding Value field contains an MPLS label, where all the fields of the MPLS label
(TC, S, and TTL) are significant and filled in on transmission.

Binding Value: as specified by Binding Type value

ERO Object

See section 18.4.6.

SRP Object

The SRP (Stateful PCE Request Parameters) Object is used to correlate between Update requests sent
by the PCE and the error reports and state reports sent by the PCC. The format of the SRP Object is
shown in Figure 18‑25.
Figure 18-25: SRP Object format

The SRP Object contains the following fields:

Flags:

R-flag (Remove) – if set, it indicates a request to remove a path; if unset, indicates a request to
create a path – used in Initiate message

SRP-ID-number: uniquely identifies (in the current PCEP session) the operation that the PCE has
requested the PCC to perform. SRP-ID-number is incremented each time a new request is sent to
the PCC.

LSPA Object

See section 18.4.4.

Metric Object

See section 18.4.4.

Example

Example 18‑4 shows a packet capture of a Report message. This Report message follows the Reply
message in Example 18‑3 of the previous section, after the PCC instantiated the SR Policy path. The
SR Policy path with SID list <16003, 24034> is Active and Delegated to the PCE. The SR Policy has
the name “BROWN” and has a Binding-SID label 15111.

Example 18-4: Packet capture of a PCEP Path Computation Report message

14:54:42.769244000: 1.1.1.1:46605 --> 1.1.1.10:4189


Path Computation Element communication Protocol
Path Computation LSP State Report (PCRpt) Header
..1. .... = PCEP Version: 0x1
...0 0000 = Flags: 0x00
...0 0000 = Reserved Flags: Not set
Message Type: Path Computation LSP State Report (PCRpt) (10)
Message length: 156
SRP object
Object Class: SRP OBJECT (33)
0001 .... = SRP Object-Type: SRP (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 20
Flags: 0x00000000
.... .... .... .... .... .... .... ...0 = Remove (R): Not set
SRP-ID-number: 64
PATH-SETUP-TYPE
Type: PATH-SETUP-TYPE (28)
Length: 4
Reserved: 0x000000
Path Setup Type: Path is setup using Segment Routing (1)
LSP object
Object Class: LSP OBJECT (32)
0001 .... = LSP Object-Type: LSP (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 52
.... .... 1000 0000 0000 0010 1110 .... = PLSP-ID: 524334
Flags: 0x000029
.... .... .... ...1 = Delegate (D): Set
.... .... .... ..0. = SYNC (S): Not set
.... .... .... .0.. = Remove (R): Not set
.... .... .... 1... = Administrative (A): Set
.... .... .010 .... = Operational (O): ACTIVE (2)
.... .... 0... .... = Create (C): Not Set
.... 0000 .... .... = Reserved: Not set
IPV4-LSP-IDENTIFIERS
Type: IPV4-LSP-IDENTIFIERS (18)
Length: 16
IPv4 Tunnel Sender Address: 1.1.1.1
LSP ID: 2
Tunnel ID: 46
Extended Tunnel ID: 1.1.1.1 (16843009)
IPv4 Tunnel Endpoint Address: 1.1.1.4
SYMBOLIC-PATH-NAME
Type: SYMBOLIC-PATH-NAME (17)
Length: 5
SYMBOLIC-PATH-NAME: BROWN
Padding: 000000
TE-PATH-BINDING TLV
Type: TE-PATH-BINDING TLV (65505)
Length: 6
Binding Type: 0 (MPLS)
Binding Value: 0x03b07000
MPLS label: 15111
EXPLICIT ROUTE object (ERO)
Object Class: EXPLICIT ROUTE OBJECT (ERO) (7)
0001 .... = ERO Object-Type: Explicit Route (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 32
SR
0... .... = L: Strict Hop (0)
.010 0100 = Type: SUBOBJECT SR (36)
Length: 12
0001 .... = SID Type: IPv4 Node ID (1)
.... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M)
SID: 65548288 (Label: 16003, TC: 0, S: 0, TTL: 0)
NAI (IPv4 Node ID): 1.1.1.3
SR
0... .... = L: Strict Hop (0)
.010 0100 = Type: SUBOBJECT SR (36)
Length: 16
0011 .... = SID Type: IPv4 Adjacency (3)
.... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M)
SID: 98443264 (Label: 24034, TC: 0, S: 0, TTL: 0)
Local IPv4 address: 99.3.4.3
Remote IPv4 address: 99.3.4.4
LSPA object
Object Class: LSPA OBJECT (9)
0001 .... = LSPA Object-Type: LSP Attributes (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 20
Exclude-Any: 0x00000000
Include-Any: 0x00000000
Include-All: 0x00000000
Setup Priority: 7
Holding Priority: 7
Flags: 0x01
.... ...1 = Local Protection Desired (L): Set
Reserved: 0x00
METRIC object
Object Class: METRIC OBJECT (6)
0001 .... = METRIC Object-Type: Metric (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 12
Reserved: 0
Flags: 0x00
.... ..0. = (C) Cost: Not set
.... ...0 = (B) Bound: Not set
Type: TE Metric (2)
Metric Value: 30

18.4.7 PCEP Update Message


A PCEP Update message contains at least the following objects:

Stateful PCE Request Parameters (SRP) Object: to correlate messages (e.g., PCUpd and PCRept)

LSP Object: to identify the LSP

Explicit Route Object (ERO): path of the LSP


A PCEP Update message may also contain (non-exhaustive list):

Metric Object: Metric of calculated path

SRP Object

See section 18.4.7.

LSP Object

See section 18.4.7.

ERO Object

See section 18.4.6.

Metric Object

See section 18.4.4.

Example

Example 18‑5 shows the packet capture of a Path Computation Update message. The PCE requests the
PCC to update the SR Policy path’s SID list to <16003, 16004>, both SIDs are Prefix-SIDs.

Example 18-5: Packet capture of a PCEP Path Computation Update message

14:54:47.769244000: 1.1.1.10:4189 --> 1.1.1.1:46605


Path Computation Element communication Protocol
Path Computation LSP Update Request (PCUpd) Header
..1. .... = PCEP Version: 0x1
...0 0000 = Flags: 0x00
...0 0000 = Reserved Flags: Not set
Message Type: Path Computation LSP Update Request (PCUpd) (11)
Message length: 88
SRP object
Object Class: SRP OBJECT (33)
0001 .... = SRP Object-Type: SRP (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 20
Flags: 0x00000000
.... .... .... .... .... .... .... ...0 = Remove (R): Not set
SRP-ID-number: 47
PATH-SETUP-TYPE
Type: PATH-SETUP-TYPE (28)
Length: 4
Reserved: 0x000000
Path Setup Type: Path is setup using Segment Routing (1)
LSP object
Object Class: LSP OBJECT (32)
0001 .... = LSP Object-Type: LSP (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 24
.... .... 1000 0000 0000 0001 1101 .... = PLSP-ID: 524334
Flags: 0x000009
.... .... .... ...1 = Delegate (D): Set
.... .... .... ..0. = SYNC (S): Not set
.... .... .... .0.. = Remove (R): Not set
.... .... .... 1... = Administrative (A): Set
.... .... .000 .... = Operational (O): DOWN (0)
.... .... 0... .... = Create (C): Not Set
.... 0000 .... .... = Reserved: Not set
VENDOR-INFORMATION-TLV
Type: VENDOR-INFORMATION-TLV (7)
Length: 12
Enterprise Number: ciscoSystems (9)
Enterprise-Specific Information: 0003000400000001
EXPLICIT ROUTE object (ERO)
Object Class: EXPLICIT ROUTE OBJECT (ERO) (7)
0001 .... = ERO Object-Type: Explicit Route (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 28
SR
0... .... = L: Strict Hop (0)
.010 0100 = Type: SUBOBJECT SR (36)
Length: 12
0001 .... = SID Type: IPv4 Node ID (1)
.... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M)
SID: 65548288 (Label: 16003, TC: 0, S: 0, TTL: 0)
NAI (IPv4 Node ID): 1.1.1.3
SR
0... .... = L: Strict Hop (0)
.010 0100 = Type: SUBOBJECT SR (36)
Length: 12
0011 .... = SID Type: IPv4 Node ID (1)
.... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M)
SID: 65552384 (Label: 16004, TC: 0, S: 0, TTL: 0)
NAI (IPv4 Node ID): 1.1.1.4
METRIC object
Object Class: METRIC OBJECT (6)
0001 .... = METRIC Object-Type: Metric (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 12
Reserved: 0
Flags: 0x00
.... ..0. = (C) Cost: Not set
.... ...0 = (B) Bound: Not set
Type: TE Metric (2)
Metric Value: 50

18.4.8 PCEP Initiate Message


A PCEP Initiate message contains at least the following objects:

Stateful PCE Request Parameters (SRP) Object: to correlate messages (e.g., PCUpd and PCRept)
LSP Object: to identify the LSP

Explicit Route Object (ERO): path of the LSP

A PCEP Update message may also contain (non-exhaustive list):

End-points Object: source and destination of the SR Policy

Metric Object: Metric of calculated path

SRP Object

See section 18.4.7.

LSP Object

See section 18.4.7.

ERO Object

See section 18.4.6.

Endpoints Object

See section 18.4.4.

Metric Object

See section 18.4.4.

Example

Example 18‑6 shows the packet capture of an Initiate message. The PCE initiates an SR Policy path
on the PCC 1.1.1.1 to endpoint 1.1.1.4. The path is named “GREEN” and has a BSID 15222. The SID
list is <16003, 24034>, with 16003 the Prefix-SID of Node3 and 24034 the Adj-SID of the link from
Node3 to Node4.

The SR Policy has a color 888 and the path’s preference is 200. At the time of writing, no PCEP
objects were specified for these elements.

Example 18-6: Packet capture of a PCEP Path Computation Initiate message

14:59:36.738294000: 1.1.1.10:4189 --> 1.1.1.1:46605


Path Computation Element communication Protocol
Path Computation LSP Initiate (PCInitiate) Header
..1. .... = PCEP Version: 0x1
...0 0000 = Flags: 0x00
...0 0000 = Reserved Flags: Not set
Message Type: Path Computation LSP Initiate (PCInitiate) (12)
Message length: 144
SRP object
Object Class: SRP OBJECT (33)
0001 .... = SRP Object-Type: SRP (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 20
Flags: 0x00000000
.... .... .... .... .... .... .... ...0 = Remove (R): Not set
SRP-ID-number: 63
PATH-SETUP-TYPE
Type: PATH-SETUP-TYPE (28)
Length: 4
Reserved: 0x000000
Path Setup Type: Path is setup using Segment Routing (1)
LSP object
Object Class: LSP OBJECT (32)
0001 .... = LSP Object-Type: LSP (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 48
.... .... 0000 0000 0000 0000 0000 .... = PLSP-ID: 0
Flags: 0x000089
.... .... .... ...1 = Delegate (D): Set
.... .... .... ..0. = SYNC (S): Not set
.... .... .... .0.. = Remove (R): Not set
.... .... .... 1... = Administrative (A): Set
.... .... .000 .... = Operational (O): DOWN (0)
.... .... 1... .... = Create (C): Set
.... 0000 .... .... = Reserved: Not set
SYMBOLIC-PATH-NAME
Type: SYMBOLIC-PATH-NAME (17)
Length: 5
SYMBOLIC-PATH-NAME: GREEN
Padding: 000000
TE-PATH-BINDING TLV
Type: TE-PATH-BINDING TLV (65505)
Length: 6
Binding Type: 0 (MPLS)
Binding Value: 0x03b76000
MPLS label: 15222
VENDOR-INFORMATION-TLV
Type: VENDOR-INFORMATION-TLV (7)
Length: 12
Enterprise Number: ciscoSystems (9)
Enterprise-Specific Information: 0003000400000001
END-POINT object
Object Class: END-POINT OBJECT (4)
0001 .... = END-POINT Object-Type: IPv4 addresses (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 12
Source IPv4 Address: 1.1.1.1
Destination IPv4 Address: 1.1.1.4
COLOR object
Object Class: COLOR OBJECT (36)
0001 .... = COLOR Object Type: color
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 8
Color: 888
PREFERENCE object
Object Class: PREFERENCE OBJECT (37)
0001 .... = PREFERENCE Object Type: preference
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 8
Preference: 200
EXPLICIT ROUTE object (ERO)
Object Class: EXPLICIT ROUTE OBJECT (ERO) (7)
0001 .... = ERO Object-Type: Explicit Route (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 32
SR
0... .... = L: Strict Hop (0)
.010 0100 = Type: SUBOBJECT SR (36)
Length: 12
0001 .... = SID Type: IPv4 Node ID (1)
.... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M)
SID: 65548288 (Label: 16003, TC: 0, S: 0, TTL: 0)
NAI (IPv4 Node ID): 1.1.1.3
SR
0... .... = L: Strict Hop (0)
.010 0100 = Type: SUBOBJECT SR (36)
Length: 16
0011 .... = SID Type: IPv4 Adjacency (3)
.... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M)
SID: 98443264 (Label: 24034, TC: 0, S: 0, TTL: 0)
Local IPv4 address: 99.3.4.3
Remote IPv4 address: 99.3.4.4
METRIC object
Object Class: METRIC OBJECT (6)
0001 .... = METRIC Object-Type: Metric (1)
.... 0000 = Object Header Flags: 0x0
...0 = Ignore (I): Not set
..0. = Processing-Rule (P): Not set
00.. = Reserved Flags: Not set
Object Length: 12
Reserved: 0
Flags: 0x00
.... ..0. = (C) Cost: Not set
.... ...0 = (B) Bound: Not set
Type: TE Metric (2)
Metric Value: 30

18.4.9 Disjointness Association Object


Identifying paths that must be disjoint from each other is done by grouping those paths in an
Association group. To indicate the membership of a path in an association group, an Association
Object is added to the PCEP messages. This Association Object is specified in IETF draft-ietf-pce-
association-group.

The format of the Association Object is displayed in Figure 18‑26. Two types of Association Objects
exist, one with IPv4 Association Source and one with IPv6 Association Source.

Figure 18-26: Association Object format

The different fields in the Association Object are:

Flags:

R (Removal): if set, the path is removed from the association group; if unset, the path is added
or kept as part of the association group – only considered in Report and Update messages

Association type: identifies the type of association, such as “disjointness”

Association ID: this identifier in combination with Association type and Association Source
uniquely identifies an association group

Association Source: this IPv4 or IPv6 address in combination with Association type and
Association ID uniquely identifies an association group

Optional TLVs

To associate paths that must be disjoint from each other, an Association Object with Disjointness
Association Type is used. This Disjointness Association Type is defined in draft-ietf-pce-association-
diversity.

The disjointness Association Object contains a Disjointness Configuration TLV that specifies the
disjointness configuration parameters.

The format of the Disjointness Configuration TLV is shown in Figure 18‑27.

Figure 18-27: Disjointness Configuration TLV format

Where the fields are:

Flags:

L (Link diverse) – if set, the paths within the disjoint group must be link-diverse

N (Node diverse) – if set, the paths within the disjoint group must be node-diverse

S (SRLG diverse) – if set, the paths within the disjoint group must be SRLG-diverse

P (Shortest path) – if set, the path should be computed without considering the disjointness
constraint

T (strict disjointness) – if set, the path must not fall back to a lower disjoint type if no disjoint
path can be found; if unset, the path can fall back to a lower disjoint path or to a non-disjoint
path if no disjoint path can be found.
18.5 References
[RFC4655] "A Path Computation Element (PCE)-Based Architecture", JP Vasseur, Adrian Farrel,
Gerald Ash, RFC4655, August 2006

[RFC5440] "Path Computation Element (PCE) Communication Protocol (PCEP)", JP Vasseur,


Jean-Louis Le Roux, RFC5440, March 2009

[RFC8231] "Path Computation Element Communication Protocol (PCEP) Extensions for Stateful
PCE", Edward Crabbe, Ina Minei, Jan Medved, Robert Varga, RFC8231, September 2017

[RFC8281] "Path Computation Element Communication Protocol (PCEP) Extensions for PCE-
Initiated LSP Setup in a Stateful PCE Model", Edward Crabbe, Ina Minei, Siva Sivabalan, Robert
Varga, RFC8281, December 2017

[RFC8408] "Conveying Path Setup Type in PCE Communication Protocol (PCEP) Messages", Siva
Sivabalan, Jeff Tantsura, Ina Minei, Robert Varga, Jonathan Hardwick, RFC8408, July 2018

[draft-ietf-pce-segment-routing] "PCEP Extensions for Segment Routing", Siva Sivabalan,


Clarence Filsfils, Jeff Tantsura, Wim Henderickx, Jonathan Hardwick, draft-ietf-pce-segment-
routing-16 (Work in Progress), March 2019

[draft-sivabalan-pce-binding-label-sid] "Carrying Binding Label/Segment-ID in PCE-based


Networks.", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Jonathan Hardwick, Stefano Previdi,
Cheng Li, draft-sivabalan-pce-binding-label-sid-06 (Work in Progress), February 2019

[draft-ietf-pce-association-group] "PCEP Extensions for Establishing Relationships Between Sets


of LSPs", Ina Minei, Edward Crabbe, Siva Sivabalan, Hariharan Ananthakrishnan, Dhruv Dhody,
Yosuke Tanaka, draft-ietf-pce-association-group-09 (Work in Progress), April 2019

[draft-ietf-pce-association-diversity] "Path Computation Element communication Protocol (PCEP)


extension for signaling LSP diversity constraint", Stephane Litkowski, Siva Sivabalan, Colby
Barth, Mahendra Singh Negi, draft-ietf-pce-association-diversity-06 (Work in Progress), February
2019
[draft-litkowski-pce-state-sync] "Inter Stateful Path Computation Element (PCE) Communication
Procedures.", Stephane Litkowski, Siva Sivabalan, Cheng Li, Haomian Zheng, draft-litkowski-pce-
state-sync-05 (Work in Progress), March 2019
19 BGP SR-TE
Different mechanisms are available to instantiate SR Policy candidate paths on a head-end node, such
as CLI, NETCONF, PCEP, and BGP. This chapter explains the BGP mechanism to advertise a
candidate path of an SR Policy, also known as “BGP SR-TE” or “BGP SR Policy”.

IETF draft-ietf-idr-segment-routing-te-policy specifies the BGP extensions that are needed for this.
This draft specifies a new Multi-Protocol BGP (MP-BGP) address-family to convey the SR Policy
candidate paths using a new NLRI format.

It is important to note that BGP advertises the SR Policy candidate path on itself; it is a self-contained
BGP advertisement. The SR Policy candidate path advertisement is not a prefix advertisement and it
is not related to any prefix nor does it express any prefix reachability. It is not a tunnel advertisement
and it is not related to any tunnel. It is also not an attribute of a prefix.

If a given SR Policy candidate path must be updated, only that SR Policy candidate path needs to be
re-advertised. If a new SR Policy candidate path is defined, only that new SR Policy candidate path
needs to be advertised.

In summary, the BGP protocol is used as a conveyor of SR Policy candidate paths, taking up the same
task as e.g., the PCEP SR Policy initiation mechanism. No prefix reachability information is sent with
the SR Policy.

The relation between SR Policy and a prefix is established by matching the SR Policy’s end-point and
color to the prefix’s BGP next-hop and color community; this is Automated Steering (AS) described
in chapter 5, "Automated Steering". AS is not specific to BGP-initiated SR Policies, but it applies to
all SR Policies, regardless of their initiation mechanism.
19.1 SR Policy Address-Family
The Multi-protocol extensions of BGP (MP-BGP RFC4760) provides the ability to pass arbitrary
data in BGP protocol messages. The MP_REACH_NLRI and MP_UNREACH_NLRI attributes
introduced with MP-BGP are BGP's containers for carrying opaque information. This capability has
been leveraged by for example BGP-LS (RFC7752) and BGP flow-spec (RFC5575).

MP-BGP can also be used to carry SR Policy information in BGP. Therefore, IETF draft-ietf-idr-
segment-routing-te-policy introduces new BGP address-families with new AFI/SAFI combinations,
where the AFI is IPv4 or IPv6, combined with a new SR Policy SAFI.

Figure 19‑1 is a high-level illustration of a BGP SR Policy Update message showing the different
attributes that are present in such a BGP Update message.
Figure 19-1: BGP SR Policy Update message showing the different attributes

A BGP SR Policy Update message contains the mandatory attributes ORIGIN, AS_PATH, and for
iBGP exchanges also LOCAL_PREF.

The BGP SR Policy NLRIs are included in the MP_REACH_NLRI attribute, as specified in
RFC4760. This attribute also contains a Next-hop, which is the BGP nexthop as we know it. It is the
IPv4 or IPv6 BGP session address if the advertising node applies next-hop-self.

Note that, although a BGP next-hop is advertised with the SR Policy NLRI, the SR Policy is not
bound to this BGP next-hop. The BGP next-hop attribute is a mandatory BGP attribute, mandated by
the BGP protocol specification, so it must be included. But it is not related to any aspect of the
advertised SR Policy path. However, as for any BGP advertisement, BGP requires the nexthop of a
route to be reachable (read “present in RIB”) before it considers installing the route (RFC4271,
Section 9.1.2).

The Tunnel-encapsulation Attribute contains the properties of the candidate path associated with the
SR Policy that is identified in the NLRI included in the Update message.

One or more (extended) community attributes is present. These community attributes are Route-Target
(RT) extended communities and/or the NO_ADVERTISE community. The purpose of which is
described in section 19.2.3 of this chapter.

19.1.1 SR Policy NLRI


The new NLRI for the SR Policy AFI/SAFI identifies the SR Policy of the advertised candidate path.
This NLRI has the format as shown in Figure 19‑2.
Figure 19-2: SR Policy SAFI NLRI format

The fields in this NLRI are:

Distinguisher: a 32-bit numerical value used to distinguish multiple advertisements of the same SR
Policy. The distinguisher has no semantic and it is only used to make multiple occurrences of the
same SR Policy unique (from a NLRI perspective). It is comparable to the L3VPN Route
Distinguisher (RD).

Policy Color: a 32-bit color value that to distinguish multiple SR Policies to the same end-point.
The tuple (color, end-point) identifies an SR Policy on a given head-end. The color typically
identifies the policy or SLA or intent of the SR Policy, such as “low-delay”.

Endpoint: an IPv4 (32-bit) or IPv6 (128-bit) address, according to the AFI of the NLRI, that
identifies the endpoint of the SR Policy. The endpoint can represent a single node or a set of nodes
(e.g., using an anycast address).

19.1.2 Tunnel Encapsulation Attribute


While the NLRI identifies the SR Policy of the candidate path, the properties of the candidate path
itself are provided in a Tunnel Encapsulation Attribute that is advertised with the NLRI. The Tunnel
Encapsulation Attribute is defined in IETF draft-ietf-idr-tunnel-encaps and consists of a set of TLVs,
where each TLV contains information corresponding to a particular Tunnel-type.

A new Tunnel-type (15) is defined in draft-ietf-idr-segment-routing-te-policy to encode an SR Policy


candidate path in the attribute. The SR Policy TLV in the attribute consists of a set of sub-TLVs and
sub-sub-TLVs.

The encoding of an SR Policy candidate path in the Tunnel Encapsulation Attribute is as illustrated in
Figure 19‑3. The structure of the TLVs in the attribute reflects the structure of the SR Policy, as
discussed in chapter 2, "SR Policy".

Figure 19-3: Tunnel Encapsulation Attribute structure for SR Policy candidate path
It is important to highlight that this Tunnel Encapsulation Attribute is opaque to BGP; it has no
influence on BGP best-path selection or the path propagation procedure.

A Tunnel Encapsulation Attribute contains a single SR Policy TLV, hence it describes a single
candidate path. The format of the SR Policy TLV is shown in Figure 19‑4.

Figure 19-4: SR Policy TLV format

The SR Policy TLV may contain multiple Segment-List sub-TLVs, and each Segment-List sub-TLV
may contain multiple Segment sub-sub-TLVs, as illustrated in Figure 19‑3. The different sub- and sub-
sub-TLVs are discussed in the following sections.

Preference Sub-TLV

The Preference sub-TLV is an optional sub-TLV that encodes the preference of this candidate path.
The preference value of a candidate path is used by SR-TE to select the preferred path among
multiple (valid) candidates, see chapter 2, "SR Policy". Note that the preferred path selection is done
by SR-TE, not by BGP.

The format of this sub-TLV is shown in Figure 19‑5. No Flags have been defined yet.

Figure 19-5: Preference sub-TLV format

Binding-SID Sub-TLV
A controller can specify an explicit Binding-SID value for the SR Policy candidate path, see chapter
9, "Binding-SID and SRLB". In that case, the controller includes this optional Binding-SID sub-TLV
in the SR Policy TLV. The format of this sub-TLV is shown in Figure 19‑6.

Figure 19-6: Binding-SID sub-TLV format

The Binding-SID can be empty, a 4-octet MPLS label (SR MPLS), or a 16-octet IPv6 SID (SRv6).

The flags are:

S-flag: if set then the “Specified-BSID-only” behavior is enabled; the candidate path is then only
considered as active path if a BSID is specified and available.

I-flag: if set then the “Drop Upon Invalid” behavior is enabled; the invalid SR Policy and its BSID
is kept in the forwarding table with an action to drop packets steered into it.

Priority Sub-TLV

The priority sub-TLV is an optional sub-TLV that indicates the order in which SR Policies are
recomputed following a topology change.

The Priority sub-TLV has the format illustrated in Figure 19‑7. No flags have been defined.

Figure 19-7: Priority sub-TLV format


Policy Name Sub-TLV

The Policy Name sub-TLV is an optional sub-TLV that specifies a symbolic name of the SR Policy
candidate path.

The format of this sub-TLV is illustrated in Figure 19‑8.

Figure 19-8: Policy Name sub-TLV format

Explicit-Null Label Policy (ENLP) Sub-TLV

The Explicit-null Label Policy (ENLP) sub-TLV is an optional sub-TLV that indicates whether an
explicit-null label must be imposed prior to any other labels on an unlabeled packet that is steered
into the SR Policy.

Imposing an explicit-null label at the bottom of the label stack on an unlabeled packet enables
carrying packets of one address-family in an SR Policy of another address-family, such as carrying an
IPv6 packet in an IPv4 SR Policy.

First imposing the explicit-null label ensures that a packet is carried with MPLS label all the way to
the endpoint of the SR Policy. No PHP will be done on this packet.

The intermediate nodes on the path can then label-switch the packet and do not have to support
forwarding of the type of packet carried underneath the MPLS label. Only the endpoint must be able
to process that packet that arrives with only the explicit-null label.

The format of this sub-TLV is illustrated in Figure 19‑9.


Figure 19-9: Explicit-null Label Policy (ENLP) Sub-TLV format

No flags have been defined.

The ENLP field has one of the following values:

1. Impose an IPv4 explicit-null label (0) on an unlabeled IPv4 packet and do not impose an IPv6
explicit-null label (2) on an unlabeled IPv6 packet

2. Do not impose an IPv4 explicit-null label (0) on an unlabeled IPv4 packet and impose an IPv6
explicit-null label (2) on an unlabeled IPv6 packet

3. Impose an IPv4 explicit-null label (0) on an unlabeled IPv4 packet and impose an IPv6 explicit-
null label (2) on an unlabeled IPv6 packet

4. Do not impose an explicit-null label

Segment-List Sub-TLV

The Segment List, or SID list, of an SR Policy path encodes a single explicit path to the end-point, as
a list of segments. The format of the Segment-List sub-TLV is shown in Figure 19‑10.

Figure 19-10: Segment-List sub-TLV format


This Segment-List sub-TLV is optional. As discussed in chapter 2, "SR Policy", an SR Policy
candidate path can contain multiple segment lists, and each segment list in the set can have a weight
value for Weighted ECMP load-balancing. Therefore, the SR Policy TLV can contain multiple
Segment-List sub-TLVs.

Each Segment-List sub-TLV can contain an optional Weight sub-sub-TLV and zero, one, or multiple
Segment sub-sub-TLVs.

Weight Sub-Sub-TLV

The Weight sub-sub-TLV is an optional TLV in the Segment-List sub-TLV. The format of this sub-sub-
TLV is shown in Figure 19‑11. No Flags have been defined yet.

Figure 19-11: Weight sub-TLV format

Segment Sub-Sub-TLV

A Segment-List sub-TLV contains zero, one, or multiple Segment sub-sub-TLVs. As specified in IETF
draft-ietf-spring-segment-routing-policy, segments in the segment list can be specified using different
segment-descriptor types. These types are also repeated in IETF draft-ietf-idr-segment-routing-te-
policy.

As an example, Figure 19‑12 shows the format of a Type 1 Segment TLV. This type specifies the SID
in the form of an MPLS label value.
Figure 19-12: Segment sub-sub-TLV Type 1: SID only, in the form of MPLS Label

Where the flags are defined as:

V-flag: if set then the Segment Verification behavior is enabled, where SR-TE verifies validity of
this segment.

A-flag: if set then the SR Algorithm id is present in the SR Algorithm field; this flag does not apply
to all segment-descriptor types, e.g., it does not apply to type-1.

Please refer to IETF draft-ietf-idr-segment-routing-te-policy for the formats of the other Segment-
descriptor types.
19.2 SR Policy BGP Operations
Since the SR Policy candidate path is conveyed by BGP, the existing BGP operations are applicable.
This section discusses these operational aspects.

As was indicated before, BGP is merely used as the transport mechanism to convey the SR Policy
candidate path information from the controller to the head-end. The SR Policy path information is
opaque to BGP. BGP is not involved in the processing of the information, nor in the installation of the
forwarding instructions. These aspects are handled by the SR-TE functionality.

19.2.1 BGP Best-Path Selection


When BGP receives multiple paths of a given NLRI, it uses the best-path selection rules to select the
single best path for that NLRI. This procedure equally applies for BGP SR Policy advertisements.
BGP selects the best-path among all received BGP SR Policy advertisements with the same NLRI,
using the regular selection rules. BGP sends only the best-path to SR-TE.

While the existing BGP procedures apply to the BGP SR Policy advertisements, the network operator
must ensure that the role of BGP is limited to be the conveyer of the information, not interfering with
SR-TE operations. Specifically, BGP must not be the one that makes a selection between different
candidate paths of the same SR Policy. Therefore, the operator must use the Distinguisher field in the
NLRI, which is described in the next section.

19.2.2 Use of Distinguisher NLRI Field


The Distinguisher field in the SR Policy NLRI is used to make multiple candidate path advertisements
of the same SR Policy (i.e., having the same color and endpoint) unique from a NLRI perspective.
Ne twork Laye r Re achability Information (NLRI)
When BGP receives a MP-BGP update message, it receives the Network Layer Reachability Information (NLRI) and the path
attributes that are the properties of the NLRI. BGP treats the NLRI as an opaque key to an entry in its database and the path
attributes are associated to this BGP database entry.

By default, BGP only installs a single path for a given NLRI, the best-path, and when BGP propagates an NLRI to its neighbors (e.g.,
a RR reflecting routes to its RR clients), BGP only propagates this best-path.

When BGP receives multiple paths with the same NLRI (the key of the database entry), it selects the best path for that NLRI using
the BGP best-path selection procedure and only that best path is propagated. The best-path selection is influenced by the path
attributes and other elements.

When BGP receives multiple paths, each with a different NLRI, then BGP propagates all of these paths.

For those familiar with L3VPN, the SR Policy Distinguisher has a similar function as the L3VPN
Route Distinguisher (RD) value that is added to an L3VPN prefix.

One or more controllers can advertise multiple candidate paths for a given SR Policy to a headend
node. Or these controllers can advertise candidate paths for SR Policies with the same color and
endpoint to multiple headend nodes. In all these cases, the controller needs to ensure that an SR
Policy path advertisement is not inadvertently suppressed due to best-path selection. Therefore, each
SR Policy path should be advertised with a unique NLRI. This explains the purpose of the
Distinguisher field in the SR Policy NLRI: make the NLRI unique to ensure that:

RRs propagate each of the SR Policy candidate paths

BGP at the headend hands over each SR Policy candidate path to SR-TE

It is important that BGP does not perform any selection among the different candidate paths. The
candidate path selection must only be done by SR-TE, based on path validity and preference, as
described in chapter 2, "SR Policy".

Let us illustrate this with Figure 19‑13. In this network a BGP Route-Reflector (RR) is used to scale
BGP route distribution.

Two controllers advertise a candidate path for SR Policy (4.4.4.4, green) intended for headend
Node1. Both controllers send the advertisement to the RR. Each controller sends its SR Policy path
with a different distinguisher field, such that the RR treats them as separate routes (NLRIs).
Therefore, the RR propagates both advertisements to headend Node1.

Also BGP on Node1 sees both paths as independent paths, thanks to their unique NLRI, and does not
apply any BGP path selection. Hence BGP hands over both paths to the SR-TE process.

SR-TE receives both paths and performs the path selection mechanism based on path validity and
preference, to select the active path among all candidate paths of the SR Policy (4.4.4.4, green).

Figure 19-13: Use of the Distinguisher NLRI field


If the controllers would have used the same RD for their advertisements, then the BGP RR would
have applied best-path selection and would have propagated only the best-path to Node1. SR-TE on
Node1 would have only received one candidate path for the SR Policy to Node4 instead of the
expected two paths.

19.2.3 Target Headend Node


A controller must specify the intended target headend, this is the headend node that should instantiate
the SR Policy candidate path specified in the BGP SR Policy advertisement.

There are two possible mechanisms to do this, as illustrated in Figure 19‑14. Which mechanism can
be used depends on whether the controller has a direct BGP session to the target headend node or not.

Figure 19-14: Direct session or via RR

For the first mechanism as shown in Figure 19‑14 (a), the controller may or may not have a direct
BGP session to the target headend. To identify the target headend, the controller attaches a Route-
Target (RT) Extended Community attribute to the SR Policy advertisement. This is the well-known
RT, as defined in RFC4360 and also used for BGP L3VPN. The RT is in IPv4-address format, where
the IPv4 address matches the BGP router-id of the intended headend node.
If the SR Policy path must be instantiated on multiple headend nodes, multiple RTs can be attached,
one for each targeted headend node. Only the nodes identified by one of the RTs will hand the SR
Policy candidate path information to the SR-TE process.

In the case that the controller has a direct BGP session to the intended headend node, as shown in
Figure 19‑14 (b), two methods are possible: attach an RT matching the headend node to the BGP SR
Policy advertisement as described before or attach a NO_ADVERTISE community.

The presence of the well-known NO_ADVERTISE community (value 0xFFFFFF02, RFC1997)


prevents the receiving node to propagate this advertisement to any of its BGP neighbors. If the
headend node receives the SR Policy advertisement with a NO_ADVERTISE community attached,
then it knows it is the intended target node.

If a node receives a BGP SR Policy advertisement that has no RT nor a NO_ADVERTISE community
attached, the advertisement is considered invalid.

As a summary, a controller can attach an RT that identifies the intended headend node to an SR Policy
advertisement. If the controller has a direct BGP session to the intended headend node, then it can
also attach the NO_ADVERTISE community to the SR Policy advertisement.
19.3 Illustrations
The BGP configuration of headend Node1 in Figure 19‑15 is shown in Example 19‑1.

Node1 has a BGP session to the controller (1.1.1.10) with address-family IPv4 SR Policy enabled.
Node1 also has SR-TE enabled.

Figure 19-15: BGP SR Policy illustration

Example 19-1: BGP SR-TE configuration

router bgp 1
bgp router-id 1.1.1.1
address-family ipv4 sr-policy
!
neighbor 1.1.1.10
remote-as 1
update-source Loopback0
address-family ipv4 sr-policy
!
segment-routing
traffic-eng
The controller sends a BGP update of an SR Policy to headend Node1. The SR Policy has a color 10
and an IPv4 endpoint 1.1.1.4. The advertised SR Policy NLRI consists of:

Distinguisher: 12345

Color: 10

Endpoint: 1.1.1.4

The preference of the advertised candidate path is 100 and its Binding-SID is 15001. The candidate
path consists of two SID lists: <16003, 24034> with weight 1 and <16003, 16004> with weight 2.

Example 19‑2 shows the BGP Update packet capture. The update message contains the
MP_REACH_NLRI attribute; the mandatory attributes ORIGIN, AS_PATH, and LOCAL_PREF;
EXTENDED_COMMUNITIES attribute; and TUNNEL_ENCAPSULATION attribute.

The MP_REACH_NLRI attribute contains the nexthop address 1.1.1.10, which is the BGP router-id
of the controller. The nexthop is not used but BGP requires that it is a reachable address.

The NLRI consists of the three fields described above (Distinguisher, Color, Endpoint).

The controller added the RT extended community 1.1.1.1:0 to indicate that Node1 (with router-id
1.1.1.1) is the intended headend node.

The TUNNEL_ENCAPSULATION attribute contains all the SR Policy’s candidate path information,
structured with TLVs and sub-TLVs.
Example 19-2: BGP SR-TE Update message packet capture

Border Gateway Protocol - UPDATE Message


Marker: ffffffffffffffffffffffffffffffff
Length: 153
Type: UPDATE Message (2)
Withdrawn Routes Length: 0
Total Path Attribute Length: 130
Path attributes
Path Attribute - MP_REACH_NLRI
Address family identifier (AFI): IPv4 (1)
Subsequent address family identifier (SAFI): SR TE Policy (73)
Next hop network address (4 bytes)
IPv4=1.1.1.10
Number of Subnetwork points of attachment (SNPA): 0
Network layer reachability information (13 bytes)
SR TE Policy
Distinguisher: 12345
Color: 10
Endpoint: IPv4=1.1.1.4
Path Attribute - ORIGIN: IGP
Path Attribute - AS_PATH: empty
Path Attribute - LOCAL_PREF: 100
Path Attribute - EXTENDED_COMMUNITIES
Carried extended communities: (1 community)
Community Transitive IPv4-Address Route Target: 1.1.1.1:0
Path Attribute - TUNNEL_ENCAPSULATION
TLV Encodings
SR TE Policy
Sub-TLV Encodings
Preference: 100
SR TE Binding SID: 15001
Segment List
Sub-sub-TLV Encodings
weight: 1
Segment: (Type 1) label: 16003, TC: 0, S: 0, TTL: 0
Segment: (Type 1) label: 24034, TC: 0, S: 0, TTL: 0
Segment List
Sub-sub-TLV Encodings
weight: 2
Segment: (Type 1) label: 16003, TC: 0, S: 0, TTL: 0
Segment: (Type 1) label: 16004, TC: 0, S: 0, TTL: 0

Example 19‑3 shows the summary of the BGP SR Policy paths on the headend Node1. In this case,
Node1 received a single SR Policy path with NLRI [12345][10][1.1.1.4]/96, as shown in the
output. The format of this string is indicated in the legend of the output: [distinguisher][color]
[endpoint]/mask.
Example 19-3: BGP IPv4 SR Policy update on Node1

RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy


BGP router identifier 1.1.1.1, local AS number 1
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0x0 RD version: 16
BGP main routing table version 16
BGP NSR Initial initsync version 2 (Reached)
BGP NSR/ISSU Sync-Group versions 0/0
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best


i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network codes: [distinguisher][color][endpoint]/mask
Network Next Hop Metric LocPrf Weight Path
*>i[12345][10][1.1.1.4]/96
1.1.1.10 100 0 i

Processed 1 prefixes, 1 paths

To show more details, add the NLRI string to the show command as well as the detail keyword, as
shown in Example 19‑4.

Example 19-4: BGP IPv4 SR Policy update on Node1 – detail

RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [12345][10][1.1.1.4]/96 detail


BGP routing table entry for [12345][10][1.1.1.7]/96
Versions:
Process bRIB/RIB SendTblVer
Speaker 16 16
Flags: 0x00001001+0x00000200;
Last Modified: Mar 27 10:06:04.986 for 00:05:50
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Flags: 0x4000000001060005, import: 0x20
Not advertised to any peer
Local
1.1.1.10 (metric 50) from 1.1.1.10 (1.1.1.10)
Origin IGP, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 16
Extended community: RT:1.1.1.1:0
Tunnel encap attribute type: 15 (SR policy)
bsid 15001, preference 100, num of segment-lists 2
segment-list 1, weight 1
segments: {16003} {24034}
segment-list 2, weight 2
segments: {16003} {16004}
SR policy state is UP, Allocated bsid 15001

Headend Node1 instantiates the SR Policy, as shown in Example 19‑5. The SR Policy has the
automatically assigned name srte_c_10_ep_1.1.1.4, where “c_10” indicates color 10 and
“ep_1.1.1.4” indicates endpoint 1.1.1.4. The candidate path has the automatically assigned name
bgp_c_10_ep_1.1.1.4_discr_12345, where “bgp” indicates it is a BGP-signaled path and
“discr_12345” indicates Discriminator 12345.

Example 19-5: BGP SR-TE configuration

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:00:07 (since Mar 27 10:06:05.106)
Candidate-paths:
Preference: 100 (BGP, RD: 12345) (active)
Requested BSID: 15001
PCC info:
Symbolic name: bgp_c_10_ep_1.1.1.4_discr_12345
PLSP-ID: 18
Explicit: segment-list (valid)
Weight: 1
16003 [Prefix-SID, 1.1.1.3]
24034
Explicit: segment-list (valid)
Weight: 2
16003 [Prefix-SID, 1.1.1.3]
16004
Attributes:
Binding SID: 15001 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes

19.3.1 Illustration NLRI Distinguisher


19.3.1.1 Same Distinguisher, Same NLRI
Consider headend Node1 where two candidate paths of the same SR Policy (10, 1.1.1.4) are signaled
via BGP and whose respective NLRIs have the same route distinguishers:

NLRI 1 with distinguisher = 12345, color = 10, endpoint = 1.1.1.4

preference 100, segment list <16003, 24034>

NLRI 2 with distinguisher = 12345, color = 10, endpoint = 1.1.1.4

preference 200, segment list <16003, 16004>


Because the NLRIs are identical (same endpoint, same color, same distinguisher) BGP performs best-
path selection as usual and passes only the best-path to the SR-TE process. Note that there are no
changes to BGP best path selection algorithm; the candidate paths’ preference values do not have any
effect on the BGP best-path selection process.

The two advertisements as received by Node1 are shown in Example 19‑6. Both advertisements, one
from 1.1.1.10 and another from 1.1.1.9, are shown as paths of the same NLRI. The detailed BGP
output, shown in Example 19‑7, indicates that the advertisement from 1.1.1.10 is the best and is
selected as best-path.

As SR-TE receives only one path, shown in Example 19‑8, it does not perform any path selection.

While the controllers intended to instantiate two candidate paths for SR Policy (1.1.1.4, 10), a
preferred path with preference 200 and a less preferred path with preference 100, BGP only passed
one path to SR-TE. Since both paths where advertised with the same NLRI, the intended outcome was
not achieved.

Example 19-6: SR Policy BGP advertisements with the same NLRI

RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy


BGP router identifier 1.1.1.1, local AS number 1
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0x0 RD version: 24
BGP main routing table version 24
BGP NSR Initial initsync version 2 (Reached)
BGP NSR/ISSU Sync-Group versions 0/0
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best


i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network codes: [distinguisher][color][endpoint]/mask
Network Next Hop Metric LocPrf Weight Path
*>i[12345][10][1.1.1.4]/96
1.1.1.10 100 0 i
* i 1.1.1.9 100 0 i

Processed 1 prefixes, 2 paths


Example 19-7: SR Policy BGP advertisements with the same NLRI – detail

RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [12345][10][1.1.1.4]/96


BGP routing table entry for [12345][10][1.1.1.4]/96
Versions:
Process bRIB/RIB SendTblVer
Speaker 24 24
Last Modified: Mar 27 10:52:10.986 for 02:24:22
Paths: (2 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
Local
1.1.1.10 (metric 50) from 1.1.1.10 (1.1.1.10)
Origin IGP, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 24
Extended community: RT:1.1.1.1:0
Tunnel encap attribute type: 15 (SR policy)
bsid 15001, preference 100, num of segment-lists 1
segment-list 1, weight 1
segments: {16003} {24034}
SR policy state is UP, Allocated bsid 15001
Path #2: Received by speaker 0
Not advertised to any peer
Local
1.1.1.9 (metric 50) from 1.1.1.9 (1.1.1.9)
Origin IGP, localpref 100, valid, internal
Received Path ID 0, Local Path ID 0, version 0
Extended community: RT:1.1.1.1:0
Tunnel encap attribute type: 15 (SR policy)
bsid 15001, preference 200, num of segment-lists 1
segment-list 1, weight 1
segments: {16003} {16004}
SR policy state is Down (Request pending), Allocated bsid none

Example 19-8: SR-TE only receives one candidate path

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 02:25:03 (since Mar 27 10:52:11.187)
Candidate-paths:
Preference 100 (BGP, RD: 12345) (active)
Requested BSID: 15000
PCC info:
Symbolic name: bgp_c_10_ep_1.1.1.4_discr_12345
PLSP-ID: 16
Explicit: segment-list (valid)
Weight: 1
16003
24034
Attributes:
Binding SID: 15001 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes
19.3.1.2 Different Distinguishers, Different NLRIs
Consider headend Node1 where two candidate paths of the same SR Policy (1.1.1.4, 10) are signaled
via BGP and whose respective NLRIs have different route distinguishers:

NLRI 1 with distinguisher = 12345, color = 10, endpoint = 1.1.1.4

preference 100, segment list <16003, 24034>

NLRI 2 with distinguisher = 54321, color = 10, endpoint = 1.1.1.4

preference 200, segment list <16003, 16004>

The two advertisements as received by Node1 are shown in Example 19‑9. Both advertisements, one
from 1.1.1.10 and another from 1.1.1.9, have a different NLRI. The detailed BGP output of these paths
are shown in Example 19‑10 and Example 19‑11.

Because the NLRIs are different (different distinguisher), they each have a BGP best-path. Therefore,
BGP passes both paths to the SR-TE process. From these two candidate paths, SR-TE selects the path
with the highest preference as active path, since both paths are valid. Thus, the path advertised by
1.1.1.9 becomes the active path for SR Policy (1.1.1.4, 10). This is presented in Example 19‑12.

Since each path was advertised with a different NLRI, the resulting active path is the intended active
path for the SR Policy. The recommended approach is to use NLRIs with different distinguishers
when several candidate paths for the same SR Policy (color, endpoint) are signaled via BGP to a
headend.
Example 19-9: SR Policy BGP advertisements with different NLRI

RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy


BGP router identifier 1.1.1.1, local AS number 1
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0x0 RD version: 25
BGP main routing table version 25
BGP NSR Initial initsync version 2 (Reached)
BGP NSR/ISSU Sync-Group versions 0/0
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best


i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network codes: [distinguisher][color][endpoint]/mask
Network Next Hop Metric LocPrf Weight Path
*>i[12345][10][1.1.1.4]/96
1.1.1.10 100 0 i
*>i[54321][10][1.1.1.4]/96
1.1.1.9 100 0 i

Processed 2 prefixes, 2 paths

Example 19-10: SR Policy BGP advertisements with different NLRI – path 1

RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [12345][10][1.1.1.4]/96


BGP routing table entry for [12345][10][1.1.1.4]/96
Versions:
Process bRIB/RIB SendTblVer
Speaker 24 24
Last Modified: Mar 27 10:52:10.986 for 02:26:39
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
Local
1.1.1.10 (metric 50) from 1.1.1.10 (1.1.1.10)
Origin IGP, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 24
Extended community: RT:1.1.1.1:0
Tunnel encap attribute type: 15 (SR policy)
bsid 15001, preference 100, num of segment-lists 1
segment-list 1, weight 1
segments: {16003} {24034}
SR policy state is UP, Allocated bsid 15001
Example 19-11: SR Policy BGP advertisements with different NLRI – path 2

RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [54321][10][1.1.1.4]/96


BGP routing table entry for [54321][10][1.1.1.4]/96
Versions:
Process bRIB/RIB SendTblVer
Speaker 25 25
Last Modified: Mar 27 13:18:27.986 for 00:00:31
Paths: (1 available, best #1)
Not advertised to any peer
Path #1: Received by speaker 0
Not advertised to any peer
Local
1.1.1.9 (metric 50) from 1.1.1.9 (1.1.1.9)
Origin IGP, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 25
Extended community: RT:1.1.1.1:0
Tunnel encap attribute type: 15 (SR policy)
bsid 15001, preference 200, num of segment-lists 1
segment-list 1, weight 1
segments: {16003} {16004}
SR policy state is UP, Allocated bsid 15001

Example 19-12: SR-TE receives two candidate path and selects one as active path

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy

SR-TE policy database


---------------------

Color: 10, End-point: 1.1.1.4


Name: srte_c_10_ep_1.1.1.4
Status:
Admin: up Operational: up for 00:28:01 (since Aug 8 20:23:00.736)
Candidate-paths:
Preference: 200 (BGP, RD: 54321) (active)
Requested BSID: 15001
PCC info:
Symbolic name: bgp_c_10_ep_1.1.1.4_discr_54321
PLSP-ID: 17
Explicit: segment-list (valid)
Weight: 1
16003 [Prefix-SID, 1.1.1.3]
16004
Preference: 100 (BGP, RD: 12345)
Requested BSID: 15001
PCC info:
Symbolic name: bgp_c_10_ep_1.1.1.4_discr_12345
PLSP-ID: 16
Explicit: segment-list (invalid)
Weight: 1
Attributes:
Binding SID: 15001 (SRLB)
Forward Class: 0
Steering BGP disabled: no
IPv6 caps enable: yes
19.4 References
[RFC1997] "BGP Communities Attribute", Tony Li, Ravi Chandra, Paul S. Traina, RFC1997,
August 1996

[RFC4760] "Multiprotocol Extensions for BGP-4", Ravi Chandra, Yakov Rekhter, Tony J. Bates,
Dave Katz, RFC4760, January 2007

[RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information


Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752,
March 2016

[RFC5575] "Dissemination of Flow Specification Rules", Pedro R. Marques, Jared Mauch,


Nischal Sheth, Barry Greene, Robert Raszuk, Danny R. McPherson, RFC5575, August 2009

[draft-ietf-idr-tunnel-encaps] "The BGP Tunnel Encapsulation Attribute", Eric C. Rosen, Keyur


Patel, Gunter Van de Velde, draft-ietf-idr-tunnel-encaps-11 (Work in Progress), February 2019

[draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano


Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idr-
segment-routing-te-policy-05 (Work in Progress), November 2018

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence


Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segment-
routing-policy-02 (Work in Progress), October 2018
20 Telemetry
Before the introduction of telemetry, a network operator used various methods to collect data from
networking devices, SNMP, syslog, CLI scripts, etc. These methods had a number of problems:
incomplete, unscalable, unstructured, etc. With telemetry, the network data harvesting model changes
from a pull model to a push model, from polling to streaming. Therefore, telemetry is often referred to
as “streaming telemetry”.

Polling – when the network management system (NMS) wants to get some network data from a
networking device, such as interface counter values, it sends a request to this device, specifying the
data it wants to receive. The device collects the data and returns it to the NMS in a reply. This
sequence is repeated whenever the NMS needs new data.

Streaming – data is continuously sent (“streamed”) from the networking devices to one or more
collectors, either at fixed intervals or when the data changes. A collector can subscribe to the
specific data streams it is interested in.

Streaming data is more efficient that polling. The collector can simply consume the received data
without periodic polling and the networking device can collect the data more efficiently as it knows
beforehand what to collect, which allows the collection to be organized efficiently. It also avoids the
Request/Response Hits to CPU that is associated with polling and more importantly, if there are many
receivers it can just copy the packet instead of processing multiple requests for the same data.

The telemetry data is model-driven, which means telemetry uses data that is structured in a consistent
format, a data-model. The various types of data are structured in different models, with each data
model specified in a YANG module, using the YANG data modeling language (RFC 6020).

A YANG module is a file that describes a data-model, how the data is structured in a hierarchical
(tree) structure, using the YANG language.

There are multiple ways to structure data and therefore multiple types of YANG models exist: “Cisco
IOS XR native model”, Openconfig, IETF, … A number of modules (IETF, vendor-specific, …) are
available on the Gitub YangModels repository [Github-YANG]. For example, the Cisco IOS XR
native YANG models can be found in [YANG-Cisco]. Openconfig YANG modules can be found in
the Openconfig Github repository [Openconfig-YANG].
20.1 Telemetry Configuration
The model-driven telemetry configuration in Cisco IOS XR consists of three parts:

What data to stream

Where to send it and how

When to send it

20.1.1 What Data to Stream


We will use the example of a popular YANG module that describes the data-model of the Statistics
Infrastructure of a router, containing the interface statistics: Cisco-IOS-XR-infra-statsd-oper.yang.

Example 20‑1 shows the YANG module as presented using the pyang tool to dump the YANG module
in a tree structure. Large chunks of the output have been removed for brevity, but it gives an idea of
the structure of this data-model.

Example 20-1: Cisco-IOS-XR-infra-statsd-oper.yang tree structure

$ pyang -f tree Cisco-IOS-XR-infra-statsd-oper.yang --tree-depth 4


module: Cisco-IOS-XR-infra-statsd-oper
+--ro infra-statistics
+--ro interfaces
+--ro interface* [interface-name]
+--ro cache
| ...
+--ro latest
| ...
+--ro total
| ...
+--ro interface-name xr:Interface-name
+--ro protocols
| ...
+--ro interfaces-mib-counters
| ...
+--ro data-rate
| ...
+--ro generic-counters
...

To find the data that you want to stream using telemetry, you first need to identify the data-model that
contains the required data and secondly the leafs within that data-model that contain the required data.
In this example, we want to stream the generic interface statistics (packets/bytes received,
packet/bytes sent, etc.). The Cisco-IOS-XR-infra-statsd-oper YANG model that we saw earlier may
contain this data. Example 20‑2 shows the content of the “latest/generic-counters” container of the
Cisco-IOS-XR-infra-statsd-oper YANG model. It contains the statistics that we are interested in.

Example 20-2: Cisco-IOS-XR-infra-statsd-oper.yang

$ pyang -f tree Cisco-IOS-XR-infra-statsd-oper.yang --tree-path infra-


statistics/interfaces/interface/latest/generic-counters

module: Cisco-IOS-XR-infra-statsd-oper
+--ro infra-statistics
+--ro interfaces
+--ro interface* [interface-name]
+--ro latest
+--ro generic-counters
+--ro packets-received? uint64
+--ro bytes-received? uint64
+--ro packets-sent? uint64
+--ro bytes-sent? uint64
+--ro multicast-packets-received? uint64
+--ro broadcast-packets-received? uint64
...

Now that we have found the YANG model and the subtree within that model that contains the desired
data, we can configure the networking device to stream this data. A sensor-path is configured in the
sensor-group section of the telemetry model-driven configuration section, as shown in
Example 20‑3.

This sensor-path consists of two elements separated by a “:”. The first element is the name of the
YANG model. This is the name of the YANG module file without the “.yang” extension, “Cisco-IOS-
XR-infra-statsd-oper” in the example. The second element of the sensor-path is the subtree path,
“infra-statistics/interfaces/interface/latest/generic-counters” in the example. Multiple sensor-paths
can be configured under the sensor-group if desired.

Example 20-3: MDT sensor-path configuration

telemetry model-driven
sensor-group SGROUP1
sensor-path Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters

Example 20‑4 shows a sample of data that is streamed when specifying the example sensor-path. The
data of all interfaces is streamed, the output only shows the data of a single interface.
Example 20-4: Sample of streamed telemetry data

...
{
"Timestamp": 1530541658167,
"Keys": {
"interface-name": "TenGigE0/1/0/0"
},
"Content": {
"applique": 0,
"availability-flag": 0,
"broadcast-packets-received": 2,
"broadcast-packets-sent": 1,
"bytes-received": 772890213,
"bytes-sent": 1245490036,
"carrier-transitions": 1,
"crc-errors": 0,
"framing-errors-received": 0,
"giant-packets-received": 0,
"input-aborts": 0,
"input-drops": 11,
"input-errors": 0,
"input-ignored-packets": 0,
"input-overruns": 0,
"input-queue-drops": 0,
"last-data-time": 1530541658,
"last-discontinuity-time": 1528316495,
"multicast-packets-received": 1768685,
"multicast-packets-sent": 1026962,
"output-buffer-failures": 0,
"output-buffers-swapped-out": 0,
"output-drops": 0,
"output-errors": 0,
"output-queue-drops": 0,
"output-underruns": 0,
"packets-received": 4671580,
"packets-sent": 9672832,
"parity-packets-received": 0,
"resets": 0,
"runt-packets-received": 0,
"seconds-since-last-clear-counters": 0,
"seconds-since-packet-received": 0,
"seconds-since-packet-sent": 0,
"throttled-packets-received": 0,
"unknown-protocol-packets-received": 0
}
},
...

Table 20‑1 shows some commonly used YANG models.

Table 20-1: Examples of commonly used YANG models

Feature YANG Model

Interfaces Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters

QoS Cisco-IOS-XR-qos-ma-oper:qos/interface-table/interface

Memory Cisco-IOS-XR-nto-misc-shmem-oper:memory-summary/nodes/node/summary
Cisco-IOS-XR-nto-misc-shprocmem-oper:processes-memory/nodes/node

CPU Cisco-IOS-XR-wdsysmon-fd-oper:system-monitoring/cpu-utilization

BGP Cisco-IOS-XR-ipv4-bgp-oper:bgp/instances/instance/instance-active/default-vrf/neighbors/neighbor

IP Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/protocols/protocol

The YANG models can also be found on the router itself. The modules are located in the directory
/pkg/yang that can be accessed after activating the shell (run) as shown in Example 20‑5. The
example lists the performance measurement YANG modules. Use more or less to peek inside a
module.

Example 20-5: Accessing YANG modules on the router

RP/0/RSP0/CPU0:ASR9904-B# run
# cd /pkg/yang
# ls Cisco-IOS-XR-perf-meas*
Cisco-IOS-XR-perf-meas-cfg.yang
Cisco-IOS-XR-perf-meas-oper-sub1.yang
Cisco-IOS-XR-perf-meas-oper-sub2.yang
Cisco-IOS-XR-perf-meas-oper.yang
# more Cisco-IOS-XR-perf-meas-oper.yang
module Cisco-IOS-XR-perf-meas-oper {

/*** NAMESPACE / PREFIX DEFINITION ***/

namespace "http://cisco.com/ns/yang/Cisco-IOS-XR-perf-meas-oper";

prefix "perf-meas-oper";

/*** LINKAGE (IMPORTS / INCLUDES) ***/

import Cisco-IOS-XR-types { prefix "xr"; }

include Cisco-IOS-XR-perf-meas-oper-sub2 {
revision-date 2017-10-17;
}

include Cisco-IOS-XR-perf-meas-oper-sub1 {
revision-date 2017-10-17;
}

/*** META INFORMATION ***/

--More--

# exit

20.1.2 Where to Send It and How


In the previous section we have identified which data needs to be streamed. In this section,we specify
where the data will be sent, using which protocol, and in which format.
The transport and encoding is configured under a destination-group section of the telemetry
model-driven configuration. In Example 20‑6, the collector’s address is 172.21.174.74 and it listens
at TCP port 5432.

Example 20-6: MDT destination-group configuration

telemetry model-driven
destination-group DGROUP1
address family ipv4 172.21.174.74 port 5432
encoding self-describing-gpb
protocol tcp

The other protocol options are UDP and gRPC [gRPC]. Note that the available configuration options
depend on platform and software release.

Using gRPC there is an option to let the collector dial-in into the router, in that case the collector
initiates a gRPC session to the router and specifies a subscription. The router then streams the data
that is specified by the sensor-group(s) in the subscription that the collector specified.

To transport the data from the device to the collector, it must be encoded (or “serialized”) in a format
that can be transmitted across the network. The collector decodes (or “de-serializes”) the received
data into a semantically identical copy of the original data. Common text-based encodings are JSON
and XML.

Another popular encoding format is Google Protocol Buffers (GPB), “protobufs” for short. From the
GPB website [GPB]: “Protocol buffers are Google's language-neutral, platform-neutral, extensible
mechanism for serializing structured data – think XML, but smaller, faster, and simpler.” The GPB
encoding serializes data into a binary format that is compact but not self-describing (i.e., you need an
external specification to decode the data). Specifically, the field names are not included in the
transmitted data since these are very verbose, especially when compared to mostly numerical data,
and do not change between sample intervals. There is a self-describing variant of GPB that is less
efficient but simpler to use.

The encoding is configured in the destination-group of the telemetry configuration. In Example 20‑6
the encoding is specified as self-describing-gpb.

20.1.3 When to Send It


A subscription configuration section under telemetry model-driven specifies which sensor-group(s)
must be streamed to which destination-id(s) and at what interval. Streaming telemetry can send the
data periodically at a fixed interval, or only when it changes. The latter is achieved by specifying a
sample-interval 0.

The configuration in Example 20‑7 specifies that the subscription SUB1 streams the sensor-path(s)
specified in the sensor-group SGROUP1 every 30 seconds (30000 ms) to the collector(s) specified in
destination-id DGROUP1.

Example 20-7: MDT subscription configuration

telemetry model-driven
subscription SUB1
sensor-group-id SGROUP1 sample-interval 30000
destination-id DGROUP1

Example 20‑8 shows the complete telemetry model-driven configuration.

Example 20-8: MDT complete configuration

telemetry model-driven
destination-group DGROUP1
address family ipv4 172.21.174.74 port 5432
encoding self-describing-gpb
protocol tcp
!
sensor-group SGROUP1
sensor-path Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters
!
subscription SUB1
sensor-group-id SGROUP1 sample-interval 30000
destination-id DGROUP1
20.2 Collectors and Analytics
The collector and analytics platform are not in scope of this book. Worth mentioning is the open-
source pipeline [pipeline], a flexible, multi-function collection service. It can collect telemetry data
from the network devices, write the telemetry data to a text file, push the data to a Kafka bus and/or
format it for consumption by various analytics stacks.
20.3 References
[RFC6020] "YANG - A Data Modeling Language for the Network Configuration Protocol
(NETCONF)", Martin Björklund, RFC6020, October 2010

[Github-YANG] https://github.com/YangModels

[YANG-Cisco] https://github.com/YangModels/yang/tree/master/vendor/cisco

[Openconfig-YANG] https://github.com/openconfig/public/tree/master/release/models

[gRPC] https://grpc.io/

[GPB] https://developers.google.com/protocol-buffers/

[pipeline] https://github.com/cisco/bigmuddy-network-telemetry-pipeline

[Telemetry] https://xrdocs.io/telemetry/
Section IV – Appendices
A. Introduction of SR Book Part I
This appendix provides an unmodified copy of the subjective introduction of Part I of this SR book
series as written in 2016. It describes the intuitions and design objectives of the overall Segment
Routing solution.
A.1 Objectives of the Book
“Segment Routing – Part I” has several objectives:

To teach the basic elements of Segment Routing (SR) with a focus on the IGP extensions, the
IGP/LDP interaction, Topology-Independent Fast Reroute (TI-LFA FRR) and the MPLS data
plane. “Segment Routing – Part II” will focus on the Traffic Engineering applications, both
distributed and centralized. Part I describes the SR use-cases that have been first considered for
deployment. Part I lays the foundation for Part II.
To provide design guidelines and illustrations based on real use-cases we have deployed
To invite operators who have participated in this project to provide their view and offer tips
based on their experience defining and deploying SR
To explain why we have defined SR the way it is, what were our primary objectives, what were
our intuitions, what happened during the first 3 years of this technology

The first objective is natural and we will dedicate lots of content to teach SR, let’s say in an
“objective” manner.

We believe that it is important to complement the “objective” understanding of a technology with a


more “subjective” viewpoint. This helps to understand what part of the technology is really important
in real life, what to optimize for, what trade-offs to make or not make.

The other three objectives are related to getting this subjective experience. By reviewing deployed
use-cases, by incorporating the highlights of the operators, by explaining how we came up with SR
and what were our initial intuitions and goals; we hope that the readers will get a much more
practical understanding of the SR architecture.

The main part of the text is dedicated to the first goal, the objective explanation of the technology and
the description of use-cases. Important concepts are reminded in textboxes titled “Highlight”.

To clearly distinguish the subjective content from the main flow of the book (objective), the
subjective viewpoints will be inserted in text boxes attributed to the person providing the subjective
opinion.
This entire first chapter should be considered as a subjective entry. It is written by Clarence to
describe how he got started with SR, then the influence of SDN and OpenFlow, how he managed the
project and got the close involvement of operators to define the technology and drive its
implementation for specific use-cases.

We stress the subjective nature of this chapter and the opinion text boxes along this book. They
express personal viewpoints to help the reader forge his own viewpoint. They are not meant to be “it
must be this way” or “this is the only right way” type of guidelines. They are meant to clearly
describe how some people think of the technology so that the readers can leverage their viewpoints to
build their own. This is very different from the normal flow of the book where we try to be objective
and describe the technology as it is (“then it is really like this”).
A.2 Why Did We Start SR?
In 1998, I was hired in the European consulting team at Cisco to deploy a new technology called tag-
switching (later renamed MPLS).

This was a fantastic experience: witnessing the entire technology definition process, working closely
with the MPLS architecture team, having first-hand experience designing, deploying the first and
biggest MPLS networks and collecting feedback and requirements from operators.

Over these years, while the elegance of the MPLS data plane has rarely been challenged, it became
obvious that the MPLS “classic” (LDP and RSVP-TE) control-plane was too complex and lacked
scalability.

In 2016, when we write this text, it should be safe to write that LDP is redundant to the IGP and that it
is better to distribute labels bound to IGP signaled prefixes in the IGP itself rather than using an
independent protocol (LDP) to do it.

LDP adds complexity: it requires one more process to configure and manage and it creates
complicated interaction problems with the IGP (LDP-IGP synchronization issue, RFC 5443, RFC
6138).

Clearly, LDP was invented 20 years ago for various reasons that were good at that time. We are not
saying that mistakes were made in the 1990’s. We are saying that, in our opinion1, if an operator were
to deploy a greenfield MPLS network in 2016, considering the issues described above and the
experience learned with RLFA (RFC 7490), they would not think of using LDP and would prefer to
distribute the labels directly in the IGP. This requires straightforward IGP extensions as we will see
in chapter 5, “Segment Routing IGP Control Plane” in Part I of the SR book series.

On the RSVP-TE side, from a bandwidth admission control viewpoint, it seems safe to write that
there has been weak deployment and that the few who deployed have reported complex operation
models and scaling issues. In fact, most of the RSVP-TE deployments have been limited to fast re-
route (FRR) use-case.

Overall, we would estimate that 10% of the SP market and likely 0% of the Enterprise market have
used RSVP-TE and that among these deployments, the vast majority did it for FRR reasons.
Our point is not to criticize the RSVP-TE protocol definition or minimize its merits. Like for LDP,
there were good reasons 20 years ago for RSVP-TE and MPLS TE to have been defined the way they
are. It is also clear that 20 years ago, RSVP-TE and MPLS-TE provided a major innovation to IP
networks. At that time, there was no other bandwidth optimization solution. At that time, there was no
other FRR solution. RSVP-TE and MPLS-TE introduced great benefits 20 years ago.

Our point is to look at its applicability in IP networks in 2016. Does it fit the needs of modern IP
networks?

In our opinion, RSVP-TE and the classic MPLS TE solution have been defined to replicate FR/ATM
in IP. The objective was to create circuits whose state would be signaled hop-by-hop along the circuit
path. Bandwidth would be booked hop-by-hop. Each hop’s state would be updated. The available
bandwidth of each link would be flooded throughout the domain using IGP to enable distributed TE
computation.

We believe that these design goals are no longer consistent with the needs of modern IP networks.

First, RSVP-TE is not ECMP-friendly. This is a fundamental issue as the basic property of modern IP
networks is to offer multiple paths from a source to a destination. This ECMP-nature is fundamental
to spread traffic along multiple paths to add capacity as required and for redundancy reasons.

Second, to accurately book the used bandwidth, RSVP-TE requires all the IP traffic to run within so-
called “RSVP-TE tunnels”. This leads to much complexity and lack of scale in practice.

Let’s illustrate this by analyzing the most frequent status of a network: i.e., a correctly capacity-
planned network.

Such a network has enough capacity to accommodate without congestion a likely traffic volume under
a set of likely independent failures. The traffic is routed according to the IGP shortest-path and
enough capacity is present along these shortest-paths. This is the norm for the vast majority of SP and
Enterprise networks either all of the times or at least for the vast majority of the times (this is
controlled by the “likeliness” of the traffic volume and the failure scenarios). Tools such as Cisco
WAN Automation Engine (WAE) Planning2 are essential to correctly capacity plan a network.
HIGHLIGHT: Capacity Planning
If one wants to succeed in traffic engineering, one should first learn capacity planning. An analogy could be that if one
wants to be healthy (SLA), one should focus on finding a life style (capacity planning process) which keeps the body and
mind balanced such as to minimize when medicine (tactical traffic engineering) needs to be taken. We would advise to
study WAE Planning.2

Clearly, in these conditions, traffic engineering to avoid congestion is not needed. It seems obvious to
write it but as we will see further, this is not the case for an RSVP-TE network.

In the rare cases where the traffic is larger than expected or a non-expected failure occurs, congestion
occurs and a traffic engineering solution may be needed. We write “may” because once again it
depends on the capacity planning process.

Some operators might capacity plan the network via modeling such that these occurrences are so
unlikely that the resulting congestion might be tolerated. This is a very frequent approach.

Some other operators may not tolerate even these rare congestions and then require a tactical traffic-
engineering process.

A tactical traffic-engineering solution is a solution that is used only when needed.

To the contrary, the classic RSVP-TE solution is an “always-on” solution. At any time (even when no
congestion is occurring), all the traffic must be steered along circuits (RSVP-TE tunnels). This is
required to correctly account the used bandwidth at any hop.

This is the reason for the infamous full-mesh of RSVP-TE tunnels. Full-mesh implies that there must
be a tunnel from anywhere to anywhere on the network edge and that all traffic must ride on RSVP-TE
tunnels. IP forwarding spoils accurate traffic statistics.

Hence, traffic can never utilize IGP derived ECMP paths and to hide the lack of ECMP in RSVP-TE,
several tunnels have to be created between each source and destination (at least one per ECMP path).

Hence, while no traffic engineering is required in the most likely situation of an IP network, the
RSVP-TE solution always requires N2×K tunnels where N scales with the number of nodes in the
network and K with the number of ECMP paths. While no traffic engineering is required in the most
likely situation of an IP network, the classical MPLS TE solution always requires all the IP traffic to
not be switched as IP, but as MPLS TE circuits.

The consequence of this “full-mesh” is lots of operational complexity and limited scale, most of the
time, without any gain. Indeed, most of the times, all these tunnels follow the IGP shortest-path as the
network is correctly capacity planned and no traffic engineering is required.

This is largely suboptimal. An analogy would be that one needs to wear his raincoat and boots every
day while it rains only a few days a year.

RSVP ope rational comple xity


“More than 15 years ago, when DT’s IP/MPLS multi-service network design was implemented, everyone assumed that
RSVP should be used as a traffic engineering technology in order to optimize the overall network efficiency. While the
routers eventually proved that they can do RSVP, the operational experience was devastating: the effect of ECMP routing
– even at that time – was completely underestimated, and mitigating the effects of parallel links with more and more TE
tunnels made the overlay topology of explicit paths even more complicated.

Eventually we found that IGP metric tuning, which cannot optimize network efficiency as perfectly as explicit paths, still
does a fair job in terms of network efficiency but at a much lower cost of operational complexity.

We continued to use RSVP for tactical cases of traffic-engineering for quite a while. We merge with a network that used
RSVP for the sake of Fast-Reroute. But finally we managed to fulfill all requirements of efficiency, Fast-Reroute and
disjoint paths just with proper IGP metric tuning, LFA based IP-FRR, and maintaining a suitable topology at the transport
layer. Removing RSVP from the network design – although technically working – was considered a great advantage from
all sides.”

— Martin Horneffer

Let’s remember the two origins (in our opinion) of the classical RSVP-TE complexity and scale
problem: the modeling as a reference to ATM/FR circuit; the decision to optimize bandwidth in a
distributed manner instead of a centralized one.

In the early 2000, Thomas Telkamp was managing the worldwide Global Crossing backbone from
Amsterdam, the Netherlands. This was one of the first RSVP-TE deployment and likely the biggest at
that time. I had the chance to work directly with him and learned the three following concepts through
the experience.
1. the always-on RSVP-TE full-mesh model is way too complex because it creates continuous pain
for no gain as, in most of the cases, the network is just fine routing along the IGP shortest-path. A
tactical TE approach is more appealing. Remember the analogy of the raincoat. One should only
need to wear the raincoat when it actually rains.
2. ECMP is key to IP. A traffic engineering approach for IP should natively support ECMP

3. for real networks, routing convergence has more impact on SLA than bandwidth optimization

Let’s illustrate the third learning.

Back in the early 2000, Global Crossing was operating one of the first VoIP service. During any
network failure, the connectivity was lost for several 10’s of seconds waiting for the network (IGP
plus full-mesh of RSVP-TE tunnels) to converge.

Considering that the network was correctly capacity planned for most expected failures, and that
independent failures occur very often on a large network, it is easy to understand that the SLA impact
of slow routing convergence is much more serious than the impact due to congestion.

While everyone was focusing at that time on QoS and TE (the rare congestion problem), a very
important problem was left without attention: routing convergence.

Thanks to Thomas and the Global Crossing experience, I then started the “Fast Convergence” project.

In 6 years, we improved IS-IS/OSPF convergence in a reference worldwide network from 9.5 second
to under 200msec. This involved considerable optimization in all parts of the routing system from the
IS-IS/OSPF process to the router’s linecard HW FIB organization including the RIB, LDP, LSD (the
MPLS RIB), BCDL (the bus from the route processor to the linecard) and the FIB process. This
involved considerable lab characterization either to monitor our progress towards our “sub-
200msec” target or to spot the next bottleneck.

In parallel to this fast IS-IS/OSPF convergence, we also researched on an IP-based automated FRR
for sub-50msec protection.

As we noted earlier, we knew that RSVP-TE deployment was rare (10%) and that most of these
deployments were not motivated by BW optimization but rather by FRR. So, if we found an IP-
optimized FRR that was simpler to operate, we knew that this would attract a lot of operator interest.

We started that research in 2001 timeframe at the cafeteria of the Brussels Cisco office. This was the
“water flow” discussion. If the course of a river through a valley gets blocked, it suffices to
explicitly steer the water to the top of the correct hill and, from there, let the water flow naturally to
its destination.

The intuition was fundamental to our “IPFRR” research: the shorter the explicit path, the better.

Contrary to RSVP-TE FRR, we never wanted to steer the packet around the failure and back at the
other end of the failure. This model is ATM/FR “circuit” centric. We wanted a model that would be
IP centric. We wanted to reroute around the failure as fast as possible such as to release the packet as
soon as possible to a natural IP path

Releasing the packet as soon as possible to a natural IP path was the target of our IPFRR project.

Very quickly, we found Loop-Free Alternate (LFA, RFC 6571).

LFA allowed the IGP to pre-compute FRR backup paths for 75 to 90% of the IGP destinations (please
refer to RFC 6571 for LFA applicability analysis and reports of solution coverage across real data
sets). This solution received a lot of interest as it offered a much simpler FRR alternative than RSVP-
TE.

Later on, we extended the IPFRR coverage to 95/99% with the introduction of Remote LFA (RLFA,
RFC 7490).

This was sufficient for most operators and the deployment of RSVP-TE for sole FRR reasons stopped
in favor of the much simpler LFA/RLFA alternative.

Still, two problems were remaining: the theoretical lack of a 100% guarantee of coverage and the
possibility that the LFA backup/repair path can be non-optimum (not along the post-convergence
path).

We had done ample research on the matter and we knew that we could not deliver these properties
without explicit routing.
“In early 2000, RSVP-TE was the only solution available to provide fast-reroute. Implementing RSVP-TE for FRR in a
network already running LDP for primary path brings additional complexity in the network design and operations as three
protocols (IGP, LDP, RSVP-TE) will interact between each other: think about the sequence of protocols events when a
single link fails in the network, when something goes wrong, troubleshooting is not so easy.

The availability of IPFRR solutions was a great opportunity for network simplification by leveraging only one primary
protocol (IGP) to provide FRR.

Even if LFA/RLFA did not provide 100% guarantee of protection, the gain in simplicity is a good reason to use them: a
simple network is usually more robust.”

— Stéphane Litkowski

In parallel to the IGP fast convergence and the IPFRR research, we had a third related research: the
search for a microloop avoidance solution3: i.e., a solution that would prevent packet drop due to
inconsistent transient state between converging routers.

We found several theoretical solutions along that research but never one with enough coverage or
with enough robustness for real-life deployment.

Let’s stop for a moment here and summarize the key points we learned through these years of design,
deployment and research:

LDP was redundant and was creating unneeded complexity

For the bandwidth optimization problem, the RSVP-TE full-mesh was modelled on ATM/FR
“circuits” with a distributed optimization and as a result was too complex and not scalable. A
tactical ECMP-aware IP optimized solution would be better.
For the FRR problem, the RSVP-TE solution was modelled on SONET/SDH “circuit” and as a
result was too suboptimal from a bandwidth and latency viewpoint and was too complex to
operate. An ECMP-aware IP-optimized solution which would release the packet as soon as
possible to its shortest-path was much better. Our IPFRR research led to LFA and RLFA. Most
operators were satisfied with the 90-99% coverage. Still, the 100% coverage guarantee was
missing. We knew that a form of explicit routing would be required to get that property.
MPLS was perceived to be too complex to deploy and manage and most Enterprise network and
some SP network operators stayed away from it
microloop avoidance was an unsolved problem

Around spring 2012, one of the research projects I was managing was related to OAM in ECMP
networks.

It was indeed difficult for operators to spot forwarding issues involving a subset of the ECMP paths
to a destination.

As part of that work, we came up with a proposal to allocate MPLS labels for each outgoing interface
of each router and then steer OAM probes over a deterministic path by pushing the appropriate label
stack on the probe. At each hop, the top label would be popped and would indicate the outgoing
interface to forward the packet.

The proposal was not very interesting for several reasons.

First, BFD was already largely deployed on a per link basis and hence issues impacting any traffic on
a link were already detected. What we had to detect is the failure related to a specific FIB destination
and this idea was missing the target.

Second, this idea would require too many labels as one label per link hop was required and the
network diameter could be very large.

However, at that time, I was planning a trip to Rome and the following intuition came to me: when I
go to Rome, from Brussels, I listen to the radio for traffic events. If I hear that the Gottardo tunnel is
blocked then I mentally switch on the path to Geneva, and then from there to the path to Rome.

This intuition initiated the Segment Routing project: in our daily life, we do not plan our journeys turn
by turn; instead we plan them as a very small number of shortest-path hops. Applying the intuition to
real network, real traffic engineering problems in real networks would be solved by one or two
intermediate shortest-paths, one or two segments in SR terminology.

The years of designing and deploying real technology in real networks had taught me that the simplest
ideas prevail and that unneeded sophistication leads to a lot of cost and complexity.

Obviously, I knew that I would want to prove this in a scientific manner by analyzing real-life
networks and applying traffic engineering problems to them. I knew we needed to confirm it
scientifically… but I had seen enough networks that I felt the intuition was correct.

HIGHLIGHT: Fe w se gme nts would be re quire d to e xpre ss an e xplicit path


When we explain a path to a friend, we do not describe it turn by turn but instead as a small number of shortest-path hops
through intermediate cities.

Applying the intuition to real network: real traffic engineering problems would be solved by one or two intermediate
shortest-paths, one or two segments in SR terminology.

While the theoretical bound scales with the network diameter, few segments would be required in practice.

This was the start of the Segment Routing project.

We would distribute labels in the routing protocol (i.e., the prefix segments) and then we would build
an ECMP-aware IP-optimized explicit path solution based on stacking few of these labels (i.e., the
segment list). Doing so, we would drastically simplify the MPLS control-plane by removing LDP and
RSVP-TE while we would improve its scalability and functionality (tactical bandwidth engineering,
IPFRR with 100% coverage, microloop avoidance, OAM). We will explain these concepts in detail
in this book.

In fact, the intuition of the “path to Rome” also gave the first name for “SR”. “SR” originally meant
“Strade Romane” which means in Italian the network of roads that were built by the Roman Empire.
By combining any of these roads, the Romans could go from anywhere to anywhere within the
Empire.
Figure A-1: Clarence at the start of the “Appia” Roman road in Rome. The network of Roman roads

Later on, we turned SR for Segment Routing ☺.


A.3 The SDN and OpenFlow Influences
In 2013, two fundamental papers were published at Sigcomm: SWAN (Software-driven WAN) from
Microsoft4 and B4 from Google5.

While reading these excellent papers, it occurred to me that while I was agreeing on the ideas, I
would have not implemented them with an OpenFlow concept as, in my opinion, it would not scale.

Indeed, I had spent several years improving the routing convergence speed and hence I knew that the
problem was not so much in the control plane and the algorithmic computation of a path but much
more in the transferring of the FIB updates from the routing processor down to the linecards and then
the writing of the updates from the linecard CPU to the hardware forwarding logic.

To give an order of magnitude, while the routing processor would compute and update an entry in
µsec, the rest of the process (distribute to Linecards, update hardware forwarding) was in msec when
we started the fast convergence project. It took many years of work to get that component down to
10’s of µsec.

Hence, by intuition I would have bet that the OpenFlow-driven approach would run in severe
convergence problem: it would take way too much time for the centralized control-plane to send
updates to the switches and have them install these updates in HW.

Hint to the reader: do some reverse engineering of the numbers published in the OpenFlow papers
and realize how slow these systems were.

Nevertheless, reading these papers was essential to the SR project.

Using my “path to Rome” intuition, I thought that a much better solution would be to combine
centralized optimization with a traffic-engineering solution based on a list of prefix segments.

The need for centralized optimization is clearly described in a paper of Urs Holzle, also at Google.6
It details the drawbacks due to the distributed signaling mechanism at the heart of RSVP-TE: lack of
optimality, lack of predictability and slow convergence.

In a distributed signaling solution, each router behaves like a kid sitting at a table where bandwidth is
represented as a pile of candies at the center of the table. Each router vies for its bandwidth as a kid
for its candy. This uncoordinated fight leads to lack of optimality (each kid is not ensured to get its
preferred taste or what is required for his health condition), lack of predictability (no way to guess
which kid will get what candy) and slow convergence (kids fighting for the same candy, retrying for
others etc.).

HIGHLIGHT: Ce ntraliz e d Optimiz ation

Read the paper of Urs Holzle6 for an analysis of the centralized optimization benefits over RSVP-TE distributed signaling:
optimality, predictability and convergence.

While the need of centralized optimization was clear and I agreed with these papers, the use of
OpenFlow-like technique was, in my opinion, the issue. It would lead to far too many interactions
between the centralized controller and the routers in the network.

The first reason is that the OpenFlow granularity is far too small: a per-flow entry on a per switch
basis. To build a single flow through the network, all the routers along the path need to be
programmed. Multiply this by tens of thousands or millions and this is clearly too much state.

The second reason is that every time the network changes, the controller needs to re-compute the flow
placement and potentially update a large part of the flow entries throughout the network.

Instead, the intuition was to let the centralized controller use segments as bricks of the LEGO®
construction toy. The network would offer basic bricks as ECMP-aware shortest path to destinations
(Prefix Segments) and the controller would combine these LEGO® bricks to build any explicit path it
would want (but with as few per-flow state as possible, in fact only one, at the source).

To route a demand from San Francisco (SFO) to New York City (NYC) via Denver (DEN), rather
than programming all the routers along the path with the appropriate per-flow entry, the controller
would only need to program one single state at the source of the demand (SFO). The state would
consist of a source-routed path expressed as an ordered list of segments {DEN, NYC}. This is what
we refer to as SR Traffic Engineering (SR-TE).
“Segment Routing represents a major, evolutionary step forward for the design, control and operation of modern, large-
scale, data-center or WAN networks. It provides for an unprecedented level of control over traffic without the concomitant
state required by existing MPLS control plane technologies. That it can accomplish this by leveraging MPLS data plane
technologies means that it can be introduced into existing networks without major disruption. This accords it a significant
ease-of-deployability advantage over other SDN technologies. It is an exciting time to be a network provider because the
future of Segment Routing holds so much potential for transforming networks and the services provided by them.”

— Steven Lin

SR-TE is hybrid centralized/distributed cooperation between the controller and the network. The
network maintains the multi-hop ECMP-aware segments and the centralized controller combines them
to form a source-routed path through the network. State is removed from the network. State is only
present at the ingress to the network and then in the packet header itself.

{DEN, NYC} is called a segment list. DEN is the first segment. NYC is the last segment. Steering a
demand through {DEN, NYC} means steering the demand along the ECMP-aware shortest-path to
Denver and then along the ECMP-aware shortest-path to NYC.

Note the similarity with the initial intuition for SR: if the shortest-path to Rome is jammed, then the
alternate path is {Geneva, Rome}. The human mind does not express a path as a turn-by-turn journey;
instead it expresses it as a minimum number of subsequent ECMP-aware shortest-paths.

In the early days of SR, we used a baggage tag analogy to explain SR to non-technical audience, see
Figure A-2. Imagine one needs to send a Baggage to Berlin (TXL) from Seattle with transit in Mexico
City (MEX) and Madrid (MAD). Clearly, the transportation system does not associate a single ID to
the baggage (flow ID) and then create circuit state in Mexico and Madrid to recognize the baggage ID
and route accordingly. Instead, the transportation system scales better by appending a tag “Mexico
then Madrid then Berlin” to the baggage at the source of the journey. This way, the transportation
system is unaware of each individual baggage specifics along the journey. It only knows about a few
thousands of airport codes (Prefix Segments) and routes the baggage from airport to airport according
to the baggage tag. Segment Routing does exactly the same: a Prefix Segment acts as an airport code;
the Segment List in the packet header acts as a tag on the baggage; the source knows the specifics of
the packet and encodes it as a Segment List in the packet header; the rest of the network is unaware of
that specific flow and only knows about a few hundreds or thousands of Prefix Segments.
Figure A-2: Baggage tag analogy to explain SR to non-technical audience

HIGHLIGHT: Combining se gme nts to form an e xplicit path.


Hybrid coupling of ce ntraliz e d optimiz ation with distribute d inte llige nce

The network would offer basic LEGO® bricks as ECMP-aware shortest path to destinations (Prefix Segments). The
distributed intelligence would ensure these segments are always available (IGP convergence, IP FRR).

The controller would acquire the entire topology state (with the segments supported by the network) and would use
centralized optimization to express traffic-engineering policies as a combination of segments. It would consider segments as
a LEGO® bricks. It would combine them in the way it wants, to express the policy it wants.

The controller would scale better thanks to the source-routing concept: it would only need to maintain state at the source,
not throughout the infrastructure.

The network would keep the right level of distributed intelligence (IS-IS and OSPF distributing the
so-called Prefix and Node Segments) and the controller would express any path it desires as a list of
segments. The controller’s programming job would be drastically reduced because state would only
need to be programmed at the source and not all along the flow path. The controller scaling would
also be much better thanks to the reliance on the IGP intelligence. Indeed, in real life, the underlying
distributed intelligence can be trusted to adapt correctly to the topology changes reducing the need for
the controller to react, re-compute and update the segment lists. For example, the inherent IGP support
for an FRR solution ensures that the connectivity to prefix segments is preserved and hence the
connectivity along a list of segments is preserved. Upon failure, the controller can leverage this IGP
FRR help to dampen its reaction and hence scale better.

This is why we say that Segment Routing is hybrid centralized/distributed optimization architecture.

We marry distributed intelligence (shortest-path, FRR, microloop avoidance) with centralized


optimization for a policy objective like latency, disjoint-ness, avoidance and bandwidth.

In SR, we would tackle the bandwidth engineering problem with a centralized controller in a tactical
manner: the controller would monitor the network and the requirements of applications and would
push SR Policies only when required. This would give us better optimality and predictability. We
would scale the controller by expressing traffic engineering policies with list of segments and hence
would only need per-demand state at the source.

Also our intuition was that this solution would also converge faster.

Why? Because the comparison has to be made with an N2×K full-mesh of RSVP-TE tunnels that vie
for bandwidth every time the topology changes. It was well-known that operators had seen
convergence times as slow as 15minutes. In comparison, our approach would only have to compute
SR Policies in the rare cases when congestion occurs. Few states would need to be programmed
thanks to the source-routed nature. By combining sufficient centralized compute performance with our
scaled source-routing paradigm, likely we would reduce the convergence time.
Figure A-3: John Leddy highlighting the importance to keep the network stateless and hence the role of SR for SDN done right

On SDN and the role of Se gme nt Routing


“SR is SDN done right!”

— John Leddy

The provisioning of such tactical SR Policies can also be done directly on the edge router in the
network as a first transition, without the need for a controller. However, the benefits of bringing in the
centralized intelligence are obvious as described above, and enable the transition to an SDN
architecture. We will see in this book how basic Segment Routing addresses the requirements for
which operators used to turn to RSVP-TE and in the next book we will introduce real traffic
engineering use-cases at a scale that was challenging if not impossible until now.

Aside this influence, SDN proved fundamental to SR, it simply allowed it to exist: the applicability
of SR to SDN provided us with the necessary tailwind to push through our proposal against the
dominance of the classic MPLS control-plane.
Over the last three years, seeing the numerous design and deployments, we can say that SR has
become the de-facto network architecture for SDN. Watch for example the SR analysis for SWAN.7
A.4 100% Coverage for IPFRR and Optimum Repair Path
Now that we had a source-routed explicit solution, it was straightforward to close our IPFRR
research.

LFA and RLFA had many advantages but only gave us 99% coverage and were not always selecting
the optimum path.

Thanks to SR, we could easily solve both problems.

The point of local repair (PLR) would pre-compute the post-convergence path to a destination,
assuming its primary link, node or SRLG fails. The PLR would express this explicit path as source-
routed list of segments. The PLR would install this backup/repair path on a per destination basis in
the data plane. The PLR would detect the failure in sub-10msec and would trigger the backup/repair
paths in a prefix-independent manner.

This solution would give us 100% coverage in any topology. We would call it “Topology-
Independent LFA” (TI-LFA). It would also ensure optimality of the backup path (post-convergence
path on a per-destination basis).

We will explain all the details of TI-LFA in this book.

HIGHLIGHT: TI-LFA be ne fits


Sub-50msec link, node and SRLG protection
100% coverage.
Simple to operate and understand
Automatically computed by the IGP, no other protocol required
No state created outside the protecting state at the PLR
Optimum: the backup path follows the post-convergence path
Incremental deployment
Also applies to IP and LDP traffic
A.5 Other Benefits
The following problems in IP/MPLS networks did also motivate our SR research:

IP had no native explicit routing capability.


It was clear that with explicit routing capability, IPv6 would have a central role to play in
future infrastructure and service network programming.
SRv6 is playing a key role here. We will detail this in Part III of the book.
RSVP-TE did not scale inter-domain for basic services such as latency and disjoint-ness.
In a previous section, we analyzed the drawbacks of the RSVP-TE solution for the bandwidth
optimization problem. RSVP-TE has another issue: for latency or disjoint-ness services, it did
not scale across domains. The very nature of modern IP networks is to be multi-domain (e.g.,
data-center (DC), metro, core).
SR is now recognized as a scalable end-to-end policy-aware IP architecture that spans DC,
metro and backbone. This is an important concept which we will detail in this book (the
scalability and base design) but as well in Part II (the TE specific aspects).
MPLS did not make any inroad in the Data Center (DC)
We felt that while MPLS data plane had a role to play in the DC, this could not be realized due
to inherent complexities of the classic MPLS control plane.
SR MPLS is now a reality in the DC. We will document this in the BGP Prefix-SID section and
the DC use-case.
OAM was fairly limited
Operators were reporting real difficulty to detect forwarding issues in IP network, especially
when the issue is associated to a specific FIB entry (destination based) potentially for a
specific combination of ingress port at the faulty node and ECMP hash.
SR has delivered solutions here which will be detailed in the OAM section of this book.
Traffic Matrices were important but were actually unavailable to most operators.
Traffic matrices are the key input to the capacity planning analysis that is one of the most
important processes within operator. Most operators did not have any traffic matrices. They
were basically too complex to be built with classic IP/MPLS architecture.
SR has delivered an automated traffic matrix collection solution.
A.6 Team
David Ward, SVP and Chief Architect for Cisco Engineering, was essential to realize the opportunity
with SR and approved the project in September 2012.

We have created the momentum by keeping focus on the execution and delivering an impressive list of
functionality and use-cases: IS-IS and OSPF support for SR, BGP support for SR, the SR/LDP
seamless interworking, the TI-LFA 100% FRR solution, microloop avoidance, distributed SR-TE,
centralized SR-TE, the egress peer traffic engineering use-case etc.

Ahmed Bashandy was the first person I recruited for the SR project. I had worked with him during the
Fast Convergence project and I knew him well. He had deep knowledge across the entire IOS XR
system (he had coded in IS-IS, BGP and FIB). This was essential for bringing up our first SR
implementation as we had to touch all the parts of the system. Ahmed is fun to work with. This is
essential. Finally, Ahmed is not slowed down by doubts. Many engineers would have spent hours
asking why we would do it, what was the difference against MPLS classic, whether we had to first
have IETF consensus, whether we had any chance to get something done in the end… all sorts of
existential questions that are the best way to never do anything. This was certainly not going to be an
issue with Ahmed and this was essential. We had no time to lose. Ahmed joined in December 2012
and we had a first public demo by March 2013.

Bertrand Duvivier was the second person to join the team. I needed someone to manage the
relationship with marketing and engineering once we became public. I knew Bertrand since 1998. He
is excellent at understanding a technology, the business benefits and managing the engineering and
marketing relationships. Bertrand joined the team in February 2013. We come from the same region
(Namur, Belgium), we speak the same language (French, with same accent and dialect), share the
same culture hence it is very easy to understand each other.

Kris Michielsen was the third person to join. Again, I knew him from the Fast Convergence project.
Kris had done all the Fast Convergence characterization. He is excellent at managing lab
characterization, thoroughly analyzing the origins of bottlenecks, creating training and transferring
information to the field and operators. We knew each other since 1996.
Stefano Previdi was the fourth person to join. I knew him since I started at Cisco in June 1996. We
had done all the MPLS deployments together and then IPFRR. Stefano would focus on the IETF and
we will talk more about this later.

Aside the technical expertise, we were all in the same time-zone, we knew each other for a long time,
we could understand each other well and we had fun working together. These are essential
ingredients.

Later on, Siva Sivabalan and Ketan Talaulikar would join us. Siva would lead the SR-TE project and
Ketan would contribute to the IGP deployment.

Operators have played a fundamental role in the SR project.

In October 2012, at the annual Cisco Network Architecture Group (NAG) conference (we invite key
operator architects for 2 days and have open discussion on any relevant subject of networking,
without any marketing compromise), I presented the first session on SR, describing the improved
simplicity, the improved functionality (FRR, SR-TE), the improved scale (flow state only at the
source) and the opportunity for SDN (hybrid centralized and distributed optimization). This session
generated a lot of interest.

We immediately created a lead operator group and started defining SR.

John Leddy from Comcast, Stéphane Litkowski and Bruno Decraene from Orange, Daniel Voyer from
Bell Canada, Martin Horneffer from DT, Rob Shakir then at BT, Josh George then at Google, Ebben
Aries then at Facebook, Dmitry Afanasiev and Daniel Ginsburg at Yandex, Tim Laberge and Steven
Lin then at Microsoft, Mohan Nanduri and Paul Mattes from Microsoft were among the most active in
the group.

Later on, as the project evolved, many other engineers and operators joined the effort and we will
have the opportunity to introduce their viewpoints throughout this book.

Again, most of us were in the same time-zone; we knew each other from previous projects. It was
easy to understand each other and we had the same focus. This generated a lot of excellent
discussions that shaped most of the technology and use-cases illustrated in this book.

Within a few months of defining this technology, we received a formal letter of intent to deploy from
an operator. This was a fantastic proof for the business interest and this helped us fund our first phase
of the project.

Ravi Chandra proved essential to execute our project beyond its first phase. Ravi was leading the
IOS XR, IOS XE and NX-OS software at Cisco. He very quickly understood the SR opportunity and
funded it as a portfolio-wide program. We could then really tackle all the markets interested in SR
(hyper-scale WEB operators, SP and Enterprise) and all the network segments (DC,
metro/aggregation, edge, backbone).

“Segment Routing is a core innovation that I believe will change how we do networking like some of the earlier core
technologies like MPLS. Having the ability to Source Routing at scale will be an invaluable tool in many different use cases
in the future.”

— Ravi Chandra
SVP Cisco Systems Core Software Group
A.7 Keeping Things Simple
More than anything, we believe in keeping things simple.

This means that we do not like technology for the sake of technology. If we can avoid technology, we
avoid technology. If we can state a reasonable design guideline (that is correct and fair for real
networks) such that the problem becomes much simpler and we need less technology to solve it, then
we are not shy of making the assumption that such guideline will be met.

With SR, we try to package complex technology and algorithms in a manner that is simple to use and
operate. We favor automated behaviors. We try to avoid parameters, options and tuning. TI-LFA,
microloop avoidance, centralized SR-TE optimization (Sigcomm 2015 paper8), and distributed SR-
TE optimization are such examples.

Simplicity
“The old rule to make things "as simple as possible, but not simpler" is always good advice for network design. In other
words it says that the parameter complexity should be minimized.

However, whenever you look closer at a specific problem, you will usually find that complexity is a rather multi-dimensional
parameter. Whenever you minimize one dimension, you will increase another. For example, in order to provide sets of
disjoint paths in an IP backbone, you can either ask operations to use complicated technology and operational procedures
to satisfy this requirement. Or you can ask planning to provide a suitable topology with strict requirements on the transport
layer.

The art of good network design is, in my eyes, based on a broad overview of possible dimensions of complexity, good
knowledge of the cost related to each dimension, and wise choice of the trade-offs that come with specific design options.

I am convinced that segment routing offers a lot of very interesting network design options that help to reduce overall
network cost and increase its robustness.”

— Martin Horneffer

The point of keeping things simple is to choose well where to put intelligence to solve a real
problem, how to package the intelligence to simplify its operation and where to simplify the problem
by an appropriate design approach.
A.8 Standardization and Multi-Vendor Consensus
When we created the lead-operator group, we pledged three characteristics of our joint work: (1)
transparency, (2) committed to standards, (3) committed to multi-vendor consensus.

The transparency meant that we would define the technology together; we would update the lead
operator group with progress, issues and challenges.

Our commitment to standardization meant that we would release all the necessary information to the
IETF and would ensure that our implementation will be entirely in-line with the released information.
We clearly understood that this was essential to our operators and hence it was essential to us.

Clearly, getting our ideas through the IETF was a fantastic challenge. We were challenging 20-years
of classic MPLS control-plane and were promoting concepts (global label) that were seen as a non-
starter by some prominent part of the IETF community.

Our strategy was the following:

1. be positive. We knew that we would be attacked in cynical ways. The team was instructed to
never reply and never state anything negative or emotional
2. lead by the implementation. We had to take the risk of implementing what we were proposing. We
had to demonstrate the benefits of the use-cases.
3. get the operators to voice their requirements. The operators who wanted to have SR knew that
they would need to state their demands very clearly. The implementation and the demonstration
of the use-cases would strengthen their argumentation
4. get other vendors to contribute. Alcatel-Lucent (Wim Henderickx) and Ericsson (Jeff Tantsura)
quickly understood the SR benefits and joined the project. More or less a year after our initial
implementation, we (i.e., Cisco) could demonstrate interoperability with Alcatel and Ericsson
and it was then clear that the multi-vendor consensus would be obtained. Shortly after, Huawei
also joined the effort.
5. shield the team from this standardization activity while handling the process with the highest
quality
Stefano joined the team to handle this last point. He would lead and coordinate all our IETF SR
involvement on behalf of the team. The team would contribute by writing text but the team would
rarely enter in list discussion and argumentation. Stefano would handle it. This was essential to get
the highest quality of the argumentation, the consistency but as well to shield the team from the
emotional implication of the negotiation. While we were handling the most difficult steps of the IETF
process and negotiation, the rest of the engineers were focused on coding and releasing functionality
to our lead operators. This was key to preserve our implementation velocity, which was a key pillar
of the strategy.

Standardiz ation and Multi-Ve ndor Focus for SR te am


“Operators manage multi-vendors networks. Standardization and interoperability is thus essential, especially for routing
related subjects.

It would have been easier and faster for the SR team to work alone and create a proprietary technology as big vendors
sometimes do.

Yet, Clarence understood the point, committed to work with all vendors and did make it through the whole industry.”

— Bruno Decraene
A.9 Global Label
Clearly, SR was conceived with global labels in mind.

Since global labels have been discussed in the IETF in the early days of MPLS, we knew that some
members of the community would be strongly against this concept. Very early, we came up with the
indexing solution in the SRGB9 range. We thought it was a great idea because on one hand, operators
could still use the technology with global labels (simply by ensuring that all the routers use the same
SRGB range), while on the other hand, the technology was theoretically not defining global labels
(but rather global indexes) and hence would more easily pass the IETF process.

We presented this idea to the lead operator group in February 2013 and they rejected it. Main concern
has been that they wanted global labels, they wanted the ease of operation of global labels and hence
they did not want this indexing operation.

We noted their preference and when we released the proposal publicly in March 2013, our first IETF
draft and our first implementation were using global labels without index.

As expected, global labels created lots of emotions… To get our multi-vendor consensus, we invited
several vendors to Rome for a two-day discussion (see Figure A-4). We always meet in Rome when
an important discussion must occur. This is a habit that we took during the IPFRR project.
Figure A-4: Note the bar “Dolce Roma” (“Sweet Rome”) behind the team ☺ . From left to right, Stefano Previdi, Peter Psenak,
Wim Henderickx, Clarence Filsfils and Hannes Gredler

During the first day of the meeting, we could resolve all issues but one: Juniper and Hannes were
demanding that we re-introduce the index in the proposal. Hannes’ main concern was that in some
contexts with legacy hardware-constrained systems, it may not be possible to come up with a single
contiguous SRGB block across vendors.

At the end of the first day, we then organized a call with all the lead operators, explained the status
and advised to re-introduce the index. Multi-vendor consensus and standardization was essential and
operators could still get the global label behavior by ensuring that the SRGB be the same on all
nodes.
Operators were not happy as they really wanted the simplicity of the global labels but in the end
agreed.

On the second day, we updated the architecture, IS-IS, OSPF and use-case drafts to reflect the
changes and from then on, we had our multi-vendor consensus.

This story is extremely important as it shows that SR (for MPLS) was really designed and desired
with global labels and that operators should always design and deploy SR with the same SRGB
across all their routers. This is the simplest way to use SR and the best way to collect all its benefits.

Sure, diverse SRGB is supported by the implementation and the technology… but really understand
that it is supported mostly for the rare interoperability reasons10 in case no common SRGB range
could be found in deployments involving legacy hardware constrained systems. People really would
like to use as per the original idea of global labels.

During the call with the operators, as they were very reluctant to accept the index, I committed that the
Cisco implementation would allow for global and index configuration both in the configuration and in
the show command. Indeed, by preserving the look and feel of the global label in the configuration
and in the show command, by designing with the same SRGB across all the routers, an operator
would have what they want: global label. They would never have to deal with index.

This section is a good example for the use of “opinion” textboxes. Without this historical explanation,
one could wonder why we have two ways to configure prefix segments in MPLS. This is the reason.

The use of the SRGB, the index, global and local labels are described in detail in this book.
A.10 SR MPLS
It is straightforward to apply the SR architecture to the MPLS data plane.

A segment is a label. A list of segments is a stack of labels. The active segment is the top of the stack.
When a segment is completed, the related label is popped. When an SR policy is applied on a packet,
the related label stack is pushed on the packet.

The SR architecture reuses the MPLS data plane without any change. Existing infrastructure only
requires a software upgrade to enable the SR control-plane.

Operators were complaining about the lack of scale, functionality and the inherent complexity of the
classic MPLS control plane. The MPLS data plane was mature and very well deployed. For these
reasons, our first priority has been to apply SR to the MPLS data plane. This was the main focus of
our first three years of the SR project and this is also the focus of this book.

It is important to remind that SR MPLS applies to IPv4 and IPv6.

As part of the initial focus on SR MPLS, it was essential to devise a solution to seamlessly deploy
SR in existing MPLS network. The SR team and the lead operator group dedicated much of the initial
effort to this problem. A great solution was found to allow SR and LDP brownfield networks to
seamlessly co-exist or, better, inter-work.

This SR/LDP seamless solution is described in detail in this book.


A.11 SRv6
The SR architecture has been thought since day one for its application to the IPv6 data plane. This is
referred to as “SRv6”.

All the research we did on automated TI-LFA FRR, microloop avoidance, distributed traffic
engineering, centralized traffic engineering etc. is directly applicable to SRv6.

We believe that SRv6 plays a fundamental role to the value of IPv6 and will significantly influence
all future IP infrastructure deployments either in the DC, the large-scale aggregation, in the backbone.

SRv6’s influence will expand beyond the infrastructure layer. An IPv6 address can identify any
object, any content or any function applied to an object or a piece of content. SRv6 could offer
formidable opportunities for chaining micro-services in a distributed architecture or for content
networking.

We would recommend reading the paper of John Schanz from Comcast11, watching the session of John
Leddy on Comcast’s SRv6-based Smarter Network concept12 and watching the demonstration of the
“Spray” end-to-end SRv6 use-case13 to understand the potential of SRv6.
Figure A-5: John Leddy presenting on the “Smarter Network” concept and highlighting the SRv6 role

John Leddy played a key role in highlighting the SRv6 potential for drastically enhancing the
application interaction with the network. We will focus on the SRv6 technology and use-cases in the
third book. This first book introduces the Segment Routing architecture by first addressing the simpler
use-cases related to the MPLS data plane and the MPLS infrastructure from DC to aggregation to
backbone.

In this part of the book, we will only provide a small introduction to SRv6. We will dedicate much
more content in a following book.
A.12 Industry Benefits
We have seen SR being applied to the hyper-scale WEB, SP and Enterprise markets. We have seen
use-cases in the DC, in the metro/aggregation and in the WAN. We have seen (many) use-cases for
end-to-end policy-aware architecture from the server in the DC through the metro and the backbone.

We believe that the SR benefits are the following:

simplification of the control-plane (LDP and RSVP-TE removed, LDP/IGP synchronization


removed)
topology-independent IP-optimal 50msec FRR (TI-LFA)

microloop avoidance
support for tactical traffic engineering (explicit path encoded as a segment list)
centralized optimization benefits (optimality, predictability, convergence)
scalability (per-flow state is only at the source, not throughout the infrastructure)
seamless deployment in existing networks (applies equally for SR-MPLS and SRv6)
de-facto architecture for SDN
standardized
multi-vendor consensus
strong requirement from operators

defined closely with operators for real use-cases


solving unsolved problems (TI-LFA, microloop, inter-domain disjointness/latency policies…)
cost optimization (through improved capacity planning with tactical traffic engineering)
an IPv6 segment can identify any object, any content, or any function applied to an object. This
will likely extend SR’s impact well beyond the infrastructure use-cases.

Most of these benefits will be described in this book. The TE and SRv6 benefits will be detailed in
the next book.
A.13 References
[draft-francois-rtgwg-segment-routing-uloop] Francois, P., Filsfils, C., Bashandy, A., Litkowski,
S. , “Loop avoidance using Segment Routing”, draft-francois-rtgwg-segment-routing-uloop (work
in progress), June 2016, https://datatracker.ietf.org/doc/draft-francois-rtgwg-segment-routing-
uloop
[RFC5443] Jork, M., Atlas, A., and L. Fang, “LDP IGP Synchronization”, RFC 5443, DOI
10.17487/RFC5443, March 2009, https://datatracker.ietf.org/doc/rfc5443.
[RFC6138] Kini, S., Ed., and W. Lu, Ed., “LDP IGP Synchronization for Broadcast Networks”,
RFC 6138, DOI 10.17487/RFC6138, February 2011, https://datatracker.ietf.org/doc/rfc6138.

[RFC6571] Filsfils, C., Ed., Francois, P., Ed., Shand, M., Decraene, B., Uttaro, J., Leymann, N.,
and M. Horneffer, “Loop-Free Alternate (LFA) Applicability in Service Provider (SP)
Networks”, RFC 6571, DOI 10.17487/RFC6571, June 2012,
https://datatracker.ietf.org/doc/rfc6571.
[RFC7490] Bryant, S., Filsfils, C., Previdi, S., Shand, M., and N. So, “Remote Loop-Free
Alternate (LFA) Fast Reroute (FRR)”, RFC 7490, DOI 10.17487/RFC7490, April 2015,
https://datatracker.ietf.org/doc/rfc7490.
[RFC7490] Bryant, S., Filsfils, C., Previdi, S., Shand, M., and N. So, “Remote Loop-Free
Alternate (LFA) Fast Reroute (FRR)”, RFC 7490, DOI 10.17487/RFC7490, April 2015,
https://datatracker.ietf.org/doc/rfc7490.

1. Note that this entire chapter should be considered subjective.↩

2. Cisco WAN Automation Engine (WAE), http://www.cisco.com/c/en/us/products/routers/wae-


planning/index.html and http://www.cisco.com/c/en/us/support/routers/wae-
planning/model.html.↩

3. draft-francois-rtgwg-segment-routing-uloop, https://datatracker.ietf.org/doc/draft-francois-rtgwg-
segment-routing-uloop.↩

4. http://conferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p15.pdf.↩
5. http://conferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p3.pdf.↩

6. http://www.opennetsummit.org/archives/apr12/hoelzle-tue-openflow.pdf.↩

7. Paul Mattes, “Traffic Engineering in a Large Network with Segment Routing“, Tech Field Day,
https://www.youtube.com/watch?v=CDtoPGCZu3Y.↩

8. http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p15.pdf.↩

9. Segment Routing Global Block, see chapter 4, “Management of the SRGB” in Part I of the SR
book series.↩

10. In extreme/rare cases, this could help to migrate some hardware constrained legacy systems, but
one should be very careful to limit this usage to a transient migration. To the best of our
knowledge, all Cisco platforms support the homogenous SRGB design guideline at the base of
SR MPLS.↩

11. John D. Schanz, “How IPv6 lays the foundation for a smarter network”,
http://www.networkworld.com/article/3088322/internet/how-ipv6-lays-the-foundation-for-a-
smarter-network.html.↩

12. John Leddy, “Comcast and The Smarter Network with John Leddy”, Tech Field Day,
https://www.youtube.com/watch?v=GQkVpfgjiJ0.↩

13. Jose Liste, “Cisco Segment Routing Multicast Use Case Demo with Jose Liste”, Tech Field Day,
https://www.youtube.com/watch?v=W-q4T-vN0Q4.↩
B. Confirming the Intuition of SR Book Part I
This short chapter leverages an excellent public presentation from Alex Bogdanov at the MPLS WC
2019 in Paris, describing his traffic engineering experience operating one of the largest networks in
the world: B2, the Google backbone that carries all user-facing traffic. The video of the presentation
is available on our segment-routing.net website.
B.1 Raincoat and Boots on a Sunny Day
In chapter 1 of Part I we wrote:

“The consequence of this “full-mesh” is lots of operational complexity and limited scale,
most of the time, without any gain. Indeed, most of the times, all these tunnels follow the IGP
shortest-path as the network is correctly capacity planned and no traffic engineering is
required. This is largely suboptimal. An analogy would be that one needs to wear his raincoat
and boots every day while it rains only a few days a year.”

This is confirmed by the data analysis presented in Figure B‑1, reporting that more than 98% of
RSVP-TE LSPs follow the shortest-path.

Figure B-1: >98% of HIPRI LSPs remain on the shortest path 1

This is highly complex and inefficient. Just using a segment would be much simpler (automated from
the IGP), scalable (stateless), and efficient (a prefix segment is ECMP-aware, an RSVP-TE LSP is
not).
B.2 ECMP-Awareness and Diversity
In chapter 1 of Part I we wrote:

“First, RSVP-TE is not ECMP-friendly. This is a fundamental issue as the basic property of
modern IP networks is to offer multiple paths from a source to a destination. This ECMP-
nature is fundamental to spread traffic along multiple paths to add capacity as required and
for redundancy reasons.”

And we also wrote:

“Hence, traffic can never utilize IGP derived ECMP paths and to hide the lack of ECMP in
RSVP-TE, several tunnels have to be created between each source and destination (at least
one per ECMP path).

Hence, while no traffic engineering is required in the most likely situation of an IP network,
the RSVP-TE solution always requires N2×K tunnels where N scales with the number of nodes
in the network and K with the number of ECMP paths.”

This is confirmed by the data analysis presented in Figure B‑2 that reports the weak path diversity of
the RSVP-TE LSP solution.
Figure B-2: LSP pathing does not offer enough diversity 1

The slide in Figure B‑2 reminds us that this could be theoretically improved (as we noted in 2016) by
increasing K (the number of LSPs between a headend and a tailend) but this is obviously difficult in
practice as this amplifies the RSVP-TE-LSP scaling issue (K×N2).

1. ©2019 Google LLC. Presented at MPLS WC Paris in April 2019. Google and the Google logo
are registered trademarks of Google LLC.↩

You might also like