SR-MPLS on Cisco XRd: PCE, Flex-Algo, and TI-LFA in a Live Lab

Running a full SR-MPLS topology on Cisco IOS-XR 25.3.1 — including a live PCE computing paths dynamically via PCEP, Flex-Algo using measured delay as its routing metric, and TI-LFA delivering sub-second failover. All verified with real command output and a live failure/recovery scenario.

IOS-XR 25.3.1 SR-TE Flex-Algo PCE/PCEP TI-LFA BGP-LS Cisco XRd

The Setup

Cisco's DevNet XRd sandbox provides a pre-built SR-MPLS topology running containerized IOS-XR (XRd) routers. The sandbox gives you a working starting point — I focused on understanding the existing configuration in depth, running verification scenarios, and automating state collection against it using NETCONF.

The topology is 7 XRd routers plus a PCE and a virtual route reflector, running IS-IS as the IGP with SR-MPLS extensions. What makes it interesting is that it's not just basic SR — it includes Flex-Algo (delay-based routing), a live PCE doing dynamic path computation via PCEP, and on-demand SR-TE policies triggered by BGP color communities.

xrd-7 (PCE) / \ xrd-3 --------- xrd-4 / | | \ src -- xrd-1 | | xrd-2 -- dst \ | | / xrd-5 --------- xrd-6 \ / xrd-8 (vRR)

Four parallel paths exist between xrd-1 (ingress PE) and xrd-2 (egress PE) through the core. The topology uses anycast SIDs to provide ECMP across the parallel paths — label 17001 resolves to either xrd-3 or xrd-5, and label 17101 resolves to either xrd-4 or xrd-6.

Flex-Algo: Constraint-Based Topology Separation

The most interesting feature in this topology is Flex-Algo. Standard IS-IS metrics are administrative costs — they don't reflect actual fiber latency. Flex-Algo 128 is defined with metric-type delay, which means it runs a separate SPF using link delay as its metric rather than IS-IS cost. In a production environment with Performance Measurement running, those delay values would be dynamically measured and flooded as IS-IS delay TLVs — the SPF would then route around degraded fiber automatically.

Two algorithms are defined. FA-128 is configured for delay-metric routing, constraining paths to links tagged as "blue" (intended low-latency), excluding "red" links. FA-129 does the opposite — uses standard IGP cost, prefers red links, and avoids blue links, providing intentional path diversity for traffic that should take a different physical route.

IOS-XR — Flex-Algo configuration
router isis 1
  affinity-map red  bit-position 128
  affinity-map blue bit-position 129
  !
  flex-algo 128
    metric-type delay          ! Use measured delay, not IS-IS cost
    advertise-definition
    affinity exclude-any red   ! Never use high-latency links
    affinity include-all blue  ! Only use verified low-latency links
  !
  flex-algo 129
    metric-type igp
    advertise-definition
    affinity exclude-any blue  ! Avoid low-lat links (path diversity)
    affinity include-any red

Each router advertises three prefix-SIDs on its Loopback0 — one per algorithm. The label value is always SRGB_base + index. With SRGB base 16000, xrd-1's three loopback SIDs are: 16101 (base algo 0), 16201 (FA-128 delay), 16301 (FA-129 affinity). The same three-tier structure applies to the anycast SIDs on Loopback1 — the shared loopback between xrd-3/xrd-5 and xrd-4/xrd-6. This is the key design insight: one Loopback1 per anycast pair, three SIDs on it, each anchoring a different forwarding topology:

xrd-3 and xrd-5 — three anycast SIDs on shared Loopback1 (101.103.105.255)
interface Loopback1
  ipv4 address 101.103.105.255 255.255.255.255   ! shared anycast IP
  !
  prefix-sid index 1001 n-flag-clear             ! base algo  → label 17001
  prefix-sid algorithm 128 index 1002 n-flag-clear  ! FA-128 delay → label 17002
  prefix-sid algorithm 129 index 1003 n-flag-clear  ! FA-129 affinity → label 17003

The n-flag-clear is what makes these anycast — it clears the Node flag in the IS-IS prefix-SID advertisement, telling other routers this SID is shared by multiple nodes rather than being unique to one. Without it, two routers advertising the same SID index would be a conflict. With it, the SPF resolves the label to whichever node is topologically closest, providing automatic ECMP and resilience.

Production deployment vs. this lab

In a production low-latency network (HFT, mobile xhaul), Performance Measurement would be running on every link — TWAMP-Light probes measuring actual one-way delay, with IOS-XR flooding updated delay sub-TLVs into IS-IS whenever a link's measured latency changes. FA-128 would then automatically route around degraded fiber without any operator intervention.

In this sandbox, PM is not configured (PM process is not running), so no delay TLVs are being flooded — confirmed by show isis database verbose | include delay returning empty. FA-128 is correctly defined and its per-algorithm SIDs (17002/17102) are in the label table, but every link effectively has delay = 0, making the FA-128 SPF produce the same result as the base SPF. The configuration and SID structure are production-accurate; only the measurement input is missing.

PCE/PCEP: Dynamic Path Computation

xrd-7 is the PCE. It receives the full IS-IS topology via BGP-LS (the distribute link-state command in IS-IS exports the topology database into BGP, which xrd-7 receives). When xrd-1 needs a path computed, it sends a PCReq to xrd-7 via PCEP, and xrd-7 responds with a label stack.

The on-demand color mechanism is what triggers this. When xrd-2 advertises a VPN route into BGP VPNv4 with color community 100, xrd-1 sees the color, matches it against its on-demand color 100 config, and automatically requests a path from the PCE. No static SR-TE policy config required on the head-end.

PCEP session verification on xrd-1
RP/0/RP0/CPU0:xrd-1# show segment-routing traffic-eng pcc ipv4 peer

PCC's peer database:
Peer address: 100.100.100.107
  Precedence: 255, (best PCE)
  State: up
  Capabilities: Stateful, Update, Segment-Routing, Instantiation

Once the PCEP session is up and BGP converges, the PCE-computed policy appears with the full label stack:

SR-TE policy — PCE computed label stack
Color: 100, End-point: 100.100.100.102
  Status: Admin: up  Operational: up
  Candidate-paths:
    Preference: 100 (BGP ODN) (active)
      Dynamic (pce 100.100.100.107) (valid)
        Metric Type: IGP, Path Accumulated Metric: 3
          SID[0]: 17001  [Prefix-SID, 101.103.105.255]  ← anycast xrd-3 or xrd-5
          SID[1]: 17101  [Prefix-SID, 101.104.106.255]  ← anycast xrd-4 or xrd-6
          SID[2]: 16102  [Prefix-SID, 100.100.100.102]  ← xrd-2 (destination PE)
  Binding SID: 24006

The PCE chose anycast SIDs for the middle hops — label 17001 resolves to whichever of xrd-3/xrd-5 is closer, providing automatic ECMP. The core doesn't need to know the specific path; it just executes the top label instruction at each hop.

Notice the metric type: IGP (1), Accumulated Metric 3. The on-demand color 100 policy on xrd-1 explicitly requests metric type igp from the PCE, so the PCE runs a standard IGP-cost SPF and returns base-algorithm anycast SIDs — 17001 and 17101. This is algorithm 0, the base topology.

If the policy instead requested metric type latency, the PCE would run the FA-128 delay-metric SPF and return the FA-128 anycast SIDs — 17002 and 17102 — steering traffic onto the delay-constrained topology. The PCE picks which SID tier to use based entirely on what metric the head-end requests. This is the elegance of the three-tier anycast design: one label stack change at the head-end shifts the entire traffic flow to a different forwarding topology without touching any P router config.

Note on the lab setup

The DevNet sandbox uses Docker 24.0.5 which has a known bug where containers can't attach to a macvlan network simultaneously with multiple bridge networks. I worked around this by starting each container with only the mgmt network, then attaching data plane networks one at a time post-start. If you're trying to replicate this and hitting "Container cannot be connected to network endpoints" errors — that's why.

TI-LFA: Sub-Second Failover

TI-LFA pre-computes backup paths for every protected prefix and installs them in the FIB before any failure occurs. When a link goes down, the FIB swap is instantaneous — no reconvergence wait, no RSVP signaling, no path computation under pressure.

Before the failure test, label 17001 shows two equal-cost ECMP paths — one via xrd-3, one via xrd-5 — and each is pre-programmed as the other's backup:

LFIB before failure — ECMP with mutual TI-LFA backups
17001  Pop  Gi0/0/0/0  100.101.103.103   ← xrd-3  (IDX:0, BKUP-IDX:1)
       Pop  Gi0/0/0/1  100.101.105.105   ← xrd-5  (IDX:1, BKUP-IDX:0)
Updated: 19:35:17

Shutting Gi0/0/0/0 (the link to xrd-3) and immediately checking:

LFIB immediately after failure — TI-LFA switchover
17001  Pop  Gi0/0/0/1  100.101.105.105   ← xrd-5 only
Updated: 19:52:47.575

! commit was at 19:52:47 — sub-second switchover
! TI-LFA backup was already programmed — zero reconvergence wait

At the same time, the PCE detects the topology change via BGP-LS and starts reoptimizing the SR-TE policy — but it does this while traffic continues flowing on the existing LSP:

SR-TE policy during PCE reoptimization
Status: (active) (reoptimizing)

LSPs:
  LSP[0]: LSP-ID 2  (active)       ← still carrying traffic
  LSP[1]: LSP-ID 3  (reoptimized)  ← PCE computed new path, installing

! Binding SID: 24006  ← unchanged throughout entire event

After restoring the interface, IS-IS reconverged and ECMP was restored in about 5 seconds. The Binding SID (24006) never changed through any of this — any system steering traffic by pushing that label was completely unaffected.

Key Takeaways

  • The anycast SID design is elegant. Using 17001 for the {xrd-3, xrd-5} pair means the PCE doesn't need to pin to a specific node — it gets ECMP and resilience for free from the anycast topology.
  • Flex-Algo and SR-TE solve different problems. Flex-Algo gives you a constraint-based forwarding plane that every router participates in automatically. SR-TE gives you explicit end-to-end path control with PCE optimization. In production Low Latency networks you'd use both — FA for the default latency-optimized paths, SR-TE for specific traffic classes needing precise path control.
  • TI-LFA coverage is 100% in this topology. The anycast design means there's always an alternate path available. Classic LFA would only protect ~50-80% of destinations depending on topology.
  • The PCE reoptimization is hitless. LSP-ID 2 kept forwarding while LSP-ID 3 was being installed. The transition between them was invisible to any traffic flowing through the tunnel.
  • IOS-XR SRGB is explicit. global-block 16000 18000 must be configured. Arista cEOS auto-assigns base 900000. This is the first thing to check when comparing label values between platforms.

What's Next

The next post will cover the NETCONF automation side — specifically, how I found the correct YANG paths for SR-TE policies, IS-IS SR labels, and BGP neighbor state on IOS-XR 25.3.1. The short version: the model names in the capabilities advertisement don't always map cleanly to what you'd expect, and a few of them return null until you find the right container path.

I'm also planning to deliberately break some config — shut the PCE, change Flex-Algo affinities mid-convergence, modify the SRGB — to see how the network behaves at the edges. That's where the interesting learning happens.