rfc9696.original   rfc9696.txt 
RIFT WG Y. Wei, Ed. Internet Engineering Task Force (IETF) Y. Wei, Ed.
Internet-Draft Z. Zhang Request for Comments: 9696 Z. Zhang
Intended status: Informational ZTE Corporation Category: Informational ZTE Corporation
Expires: 19 December 2024 D. Afanasiev ISSN: 2070-1721 D. Afanasiev
Yandex Yandex
P. Thubert P. Thubert
Cisco Systems Individual
T. Przygienda T. Przygienda
Juniper Networks Juniper Networks
17 June 2024 December 2024
RIFT Applicability and Operational Considerations Routing in Fat Trees (RIFT) Applicability and Operational Considerations
draft-ietf-rift-applicability-17
Abstract Abstract
This document discusses the properties, applicability and operational This document discusses the properties, applicability, and
considerations of RIFT in different network scenarios. It intends to operational considerations of Routing in Fat Trees (RIFT) in
provide a rough guide how RIFT can be deployed to simplify routing different network scenarios with the intention of providing a rough
operations in Clos topologies and their variations. guide on how RIFT can be deployed to simplify routing operations in
Clos topologies and their variations.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This document is not an Internet Standards Track specification; it is
provisions of BCP 78 and BCP 79. published for informational purposes.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months This document is a product of the Internet Engineering Task Force
and may be updated, replaced, or obsoleted by other documents at any (IETF). It represents the consensus of the IETF community. It has
time. It is inappropriate to use Internet-Drafts as reference received public review and has been approved for publication by the
material or to cite them other than as "work in progress." Internet Engineering Steering Group (IESG). Not all documents
approved by the IESG are candidates for any level of Internet
Standard; see Section 2 of RFC 7841.
This Internet-Draft will expire on 19 December 2024. Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
https://www.rfc-editor.org/info/rfc9696.
Copyright Notice Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/ Provisions Relating to IETF Documents
license-info) in effect on the date of publication of this document. (https://trustee.ietf.org/license-info) in effect on the date of
Please review these documents carefully, as they describe your rights publication of this document. Please review these documents
and restrictions with respect to this document. Code Components carefully, as they describe your rights and restrictions with respect
extracted from this document must include Revised BSD License text as to this document. Code Components extracted from this document must
described in Section 4.e of the Trust Legal Provisions and are include Revised BSD License text as described in Section 4.e of the
provided without warranty as described in the Revised BSD License. Trust Legal Provisions and are provided without warranty as described
in the Revised BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology
3. Problem Statement of Routing in Modern IP Fabric Fat Tree 3. Problem Statement of Routing in Modern IP Fabric Fat Tree
Networks . . . . . . . . . . . . . . . . . . . . . . . . 4 Networks
4. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 5 4. Applicability of RIFT to Clos IP Fabrics
4.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 5 4.1. Overview of RIFT
4.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 8 4.2. Applicable Topologies
4.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 8 4.2.1. Horizontal Links
4.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 8 4.2.2. Vertical Shortcuts
4.2.3. Generalizing to any Directed Acyclic Graph . . . . . 9 4.2.3. Generalizing to Any Directed Acyclic Graph
4.2.4. Reachability of Internal Nodes in the Fabric . . . . 10 4.2.4. Reachability of Internal Nodes in the Fabric
4.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 10 4.3. Use Cases
4.3.1. Data Center Topologies . . . . . . . . . . . . . . . 10 4.3.1. Data Center Topologies
4.3.2. Metro Networks . . . . . . . . . . . . . . . . . . . 11 4.3.2. Metro Networks
4.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 12 4.3.3. Building Cabling
4.3.4. Internal Router Switching Fabrics . . . . . . . . . . 12 4.3.4. Internal Router Switching Fabrics
4.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 12 4.3.5. CloudCO
5. Operational Considerations . . . . . . . . . . . . . . . . . 14 5. Operational Considerations
5.1. South Reflection . . . . . . . . . . . . . . . . . . . . 15 5.1. South Reflection
5.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 15 5.2. Suboptimal Routing on Link Failures
5.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 17 5.3. Black-Holing on Link Failures
5.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 18 5.4. Zero Touch Provisioning (ZTP)
5.5. Miscabling . . . . . . . . . . . . . . . . . . . . . . . 19 5.5. Miscabling
5.5.1. Miscabling Examples . . . . . . . . . . . . . . . . . 19 5.5.1. Miscabling Examples
5.5.2. Miscabling considerations . . . . . . . . . . . . . . 21 5.5.2. Miscabling Considerations
5.6. Multicast and Broadcast Implementations . . . . . . . . . 22 5.6. Multicast and Broadcast Implementations
5.7. Positive vs. Negative Disaggregation . . . . . . . . . . 23 5.7. Positive vs. Negative Disaggregation
5.8. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 24 5.8. Mobile Edge and Anycast
5.9. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 26 5.9. IPv4 over IPv6
5.10. In-Band Reachability of Nodes . . . . . . . . . . . . . . 27 5.10. In-Band Reachability of Nodes
5.11. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 28 5.11. Dual-Homing Servers
5.12. Fabric with A Controller . . . . . . . . . . . . . . . . 28 5.12. Fabric with a Controller
5.12.1. Controller Attached to ToFs . . . . . . . . . . . . 29 5.12.1. Controller Attached to ToFs
5.12.2. Controller Attached to Leaf . . . . . . . . . . . . 29 5.12.2. Controller Attached to Leaf
5.13. Internet Connectivity Within Underlay . . . . . . . . . . 29 5.13. Internet Connectivity Within Underlay
5.13.1. Internet Default on the Leaf . . . . . . . . . . . . 30 5.13.1. Internet Default on the Leaf
5.13.2. Internet Default on the ToFs . . . . . . . . . . . . 30 5.13.2. Internet Default on the ToFs
5.14. Subnet Mismatch and Address Families . . . . . . . . . . 30 5.14. Subnet Mismatch and Address Families
5.15. Anycast Considerations . . . . . . . . . . . . . . . . . 30 5.15. Anycast Considerations
5.16. IoT Applicability . . . . . . . . . . . . . . . . . . . . 31 5.16. IoT Applicability
5.17. Key Management . . . . . . . . . . . . . . . . . . . . . 32 5.17. Key Management
5.18. TTL/HopLimit of 1 vs. 255 on LIEs/TIEs . . . . . . . . . 33 5.18. TTL/Hop Limit of 1 vs. 255 on LIEs/TIEs
6. Security Considerations . . . . . . . . . . . . . . . . . . . 33 6. Security Considerations
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33 7. IANA Considerations
8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 33 8. References
9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 33 8.1. Normative References
10. Normative References . . . . . . . . . . . . . . . . . . . . 34 8.2. Informative References
11. Informative References . . . . . . . . . . . . . . . . . . . 35 Acknowledgments
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 Contributors
Authors' Addresses
1. Introduction 1. Introduction
This document discusses the properties and applicability of "Routing This document discusses the properties and applicability of "RIFT:
in Fat Trees" [RIFT] in different deployment scenarios and highlights Routing in Fat Trees" [RFC9692] in different deployment scenarios and
the operational simplicity of the technology compared to traditional highlights the operational simplicity of the technology compared to
routing solutions. It also documents special considerations when classical routing solutions. It also documents special
RIFT is used with or without overlays and/or controllers, and how considerations when RIFT is used with or without overlays and/or
RIFT identifies miscablings and reroutes around node and link controllers and how RIFT identifies miscablings and reroutes around
failures. node and link failures.
2. Terminology 2. Terminology
This document uses the terminology of RIFT [RIFT]. The most This document uses the terminology defined in [RFC9692]. The most
frequently used terminologies defined in RIFT are listed here. These frequently used terms and their definitions from that document are
terms are consistent with definition in RIFT [RIFT] listed here.
Clos/Fat Tree: Clos / Fat Tree:
This document uses the terms Clos and Fat Tree interchangeably This document uses the terms "Clos" and "Fat Tree" interchangeably
where it always refers to a folded spine-and-leaf topology with where it always refers to a folded spine-and-leaf topology with
possibly multiple Points of Delivery (PoDs) and one or multiple possibly multiple Points of Delivery (PoDs) and one or multiple
Top of Fabric (ToF) planes. Several modifications such as leaf- Top of Fabric (ToF) planes. Several modifications such as leaf-
2-leaf shortcuts and multiple level shortcuts are possible and 2-leaf shortcuts and multiple level shortcuts are possible and
described further in the document. described further in the document.
Crossbar: Crossbar:
Physical arrangement of ports in a switching matrix without Physical arrangement of ports in a switching matrix without
implying any further scheduling or buffering disciplines. implying any further scheduling or buffering disciplines.
Directed Acyclic Graph (DAG): Directed Acyclic Graph (DAG):
A finite directed graph with no directed cycles (loops). If links A finite directed graph with no directed cycles (loops). If links
in a Clos are considered as either being all directed towards the in a Clos are considered as either being all directed towards the
top or vice versa, each of such two graphs is a DAG. top or bottom, each of such two graphs is a DAG.
Disaggregation: Disaggregation:
Process in which a node decides to advertise more specific The process in which a node decides to advertise more specific
prefixes Southwards, either positively to attract the prefixes southwards, either positively to attract the
corresponding traffic, or negatively to repel it. Disaggregation corresponding traffic or negatively to repel it. Disaggregation
is performed to prevent traffic loss and suboptimal routing to the is performed to prevent traffic loss and suboptimal routing to the
more specific prefixes. more specific prefixes.
Leaf: Leaf:
A node without southbound adjacencies. Level 0 implies a leaf in A node without southbound adjacencies. Level 0 implies a leaf in
RIFT but a leaf does not have to be level 0. RIFT, but a leaf does not have to be level 0.
LIE: LIE:
This is an acronym for a "Link Information Element" exchanged on This is an acronym for "Link Information Element" exchanged on all
all the system's links running RIFT to form _ThreeWay_ adjacencies the system's links running RIFT to form _ThreeWay_ adjacencies and
and carry information used to perform RIFT Zero Touch Provisioning carry information used to perform RIFT Zero Touch Provisioning
(ZTP) of levels. (ZTP) of levels.
South Reflection: South Reflection:
Often abbreviated just as "reflection", it defines a mechanism Often abbreviated just as "reflection", South Reflection defines a
where South Node TIEs are "reflected" from the level south back up mechanism where South Node TIEs are "reflected" from the level
north to allow nodes in the same level without E-W links to be south back up north to allow nodes in the same level without East-
aware of each other's node Topology Information Elements (TIEs). West links to be aware of each other's node Topology Information
Elements (TIEs).
Spine: Spine:
Any nodes north of leaves and south of ToF nodes. Multiple layers Any nodes north of leaves and south of ToF nodes. Multiple layers
of spines in a PoD are possible. of spines in a PoD are possible.
TIE: TIE:
This is an acronym for a "Topology Information Element". TIEs are This is an acronym for "Topology Information Element". TIEs are
exchanged between RIFT nodes to describe parts of a network such exchanged between RIFT nodes to describe parts of a network such
as links and address prefixes. A TIE has always a direction and a as links and address prefixes. A TIE always has a direction and a
type. North TIEs (sometimes abbreviated as N-TIEs) are used when type. North TIEs (sometimes abbreviated as N-TIEs) are used when
dealing with TIEs in the northbound representation and South-TIEs dealing with TIEs in the northbound representation, and South-TIEs
(sometimes abbreviated as S-TIEs) for the southbound equivalent. (sometimes abbreviated as S-TIEs) are used for the southbound
TIEs have different types such as node and prefix TIEs. equivalent. TIEs have different types, such as node and prefix
TIEs.
3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks
Clos [CLOS] topologies (called commonly a fat tree/network in modern Clos [CLOS] topologies (commonly called a Fat Tree/network in modern
IP fabric considerations as homonym to the original definition of the IP fabric considerations as a similar term for the original
term Fat Tree [FATTREE]) have gained prominence in today's definition of the term Fat Tree [FATTREE]) have gained prominence in
networking, primarily as a result of the paradigm shift towards a today's networking, primarily as a result of the paradigm shift
centralized data-center based architecture that deliver a majority of towards a centralized data-center-based architecture that delivers a
computation and storage services. majority of computation and storage services.
Current routing protocols were geared towards a network with an Current routing protocols were geared towards a network with an
irregular topology with isotropic properties, and low degree of irregular topology with isotropic properties and a low degree of
connectivity. When applied to Fat Tree topologies: connectivity. When applied to Fat Tree topologies:
* They tend to need extensive configuration or provisioning during * They tend to need extensive configuration or provisioning during
initialization and adding or removing nodes from the fabric. initialization and adding or removing nodes from the fabric.
* For link state routing protocols, all nodes including spine and * For link-state routing protocols, all nodes including spine-and-
leaf nodes learn the entire network topology and routing leaf nodes learn the entire network topology and routing
information, which is in fact, not needed on the leaf nodes during information, which is actually not needed on the leaf nodes during
normal operation. They flood significant amounts of duplicate normal operation. They flood significant amounts of duplicate
link state information between spine and leaf nodes during link-state information between spine-and-leaf nodes during
topology updates and convergence events, requiring that additional topology updates and convergence events, requiring that additional
CPU and link bandwidth be consumed. This may impact the stability CPU and link bandwidth be consumed. This may impact the stability
and scalability of the fabric, make the fabric less reactive to and scalability of the fabric, make the fabric less reactive to
failures, and prevent the use of cheaper hardware at the lower failures, and prevent the use of cheaper hardware at the lower
levels (i.e. spine and leaf nodes). levels (i.e., spine-and-leaf nodes).
4. Applicability of RIFT to Clos IP Fabrics 4. Applicability of RIFT to Clos IP Fabrics
Further content of this document assumes that the reader is familiar Further content of this document assumes that the reader is familiar
with the terms and concepts used in OSPF (Open Shortest Path First) with the terms and concepts used in the Open Shortest Path First
[RFC2328], OSPF for IPv6 [RFC5340] and IS-IS (Intermediate System to (OSPF) [RFC2328], OSPF for IPv6 [RFC5340], and Intermediate System to
Intermediate System) [ISO10589-Second-Edition] link-state protocols. Intermediate System (IS-IS) [ISO10589-Second-Edition] link-state
The sections of RIFT [RIFT] outline the requirements of routing in IP protocols. [RFC9692] outlines the requirements of routing in IP
fabrics and RIFT protocol concepts. fabrics and RIFT protocol concepts.
4.1. Overview of RIFT 4.1. Overview of RIFT
RIFT is a dynamic routing protocol that is tailored for use in Clos, RIFT is a dynamic routing protocol that is tailored for use in Clos,
Fat-Tree, and other anisotropic topologies. A core property Fat Tree, and other anisotropic topologies. Therefore, a core
therefore of RIFT is that its operation is sensitive to the structure property of RIFT is that its operation is sensitive to the structure
of the fabric - it is anisotropic. RIFT acts as a link-state of the fabric -- it is anisotropic. RIFT acts as a link-state
protocol when "pointing north", advertising southwards routes to protocol when "pointing north", advertising southward routes to
northwards peers (parents) through flooding and database northward peers (parents) through flooding and database
synchronization. When "pointing south", RIFT operates hop-by-hop synchronization. When "pointing south", RIFT operates hop-by-hop
like a distance- vector protocol, typically advertising a fabric like a distance-vector protocol, typically advertising a fabric
default route towards the Top of Fabric (ToF, aka superspine) to default route towards the ToF, aka superspine, to southward peers
southwards peers (children). (children).
The fabric default is typically the default route, as described in The fabric default is typically the default route as described in
Section 6.3.8 "Southbound Default Route Origination" of RIFT [RIFT]. Section 6.3.8 ("Southbound Default Route Origination") of [RFC9692].
The ToF nodes may alternatively originate more specific prefixes (P') The ToF nodes may alternatively originate more specific prefixes (P')
southbound instead of the default route. In such a scenario, all southbound instead of the default route. In such a scenario, all
addresses carried within the RIFT domain must be contained within P', addresses carried within the RIFT domain must be contained within P',
and it is possible for a leaf that acts as gateway to the Internet to and it is possible for a leaf that acts as gateway to the Internet to
advertise the default route instead. advertise the default route instead.
RIFT floods flat link-state information northbound only so that each RIFT floods flat link-state information northbound only so that each
level obtains the full topology of levels south of it. That level obtains the full topology of the levels that are south of it.
information is never flooded east-west or back south again. So a top That information is never flooded East-West or back south again, so a
tier node has full set of prefixes from the Shortest Path First (SPF) top tier node has a full set of prefixes from the Shortest Path First
calculation. (SPF) calculation.
In the southbound direction, the protocol operates like a "fully In the southbound direction, the protocol operates like a "fully
summarizing, unidirectional" path-vector protocol or rather a summarizing, unidirectional" path-vector protocol or, rather, a
distance-vector with implicit split horizon. Routing information, distance-vector with implicit split horizon. Routing information,
normally just the default route, propagates one hop south and is "re- normally just the default route, propagates one hop south and is "re-
advertised" by nodes at next lower level. advertised" by nodes at next lower level.
+---------------+ +----------------+ +---------------+ +----------------+
| ToF | | ToF | LEVEL 2 | ToF | | ToF | LEVEL 2
+ ++------+--+--+-+ ++-+--+----+-----+ + ++------+--+--+-+ ++-+--+----+-----+
| | | | | | | | | ^ | | | | | | | | | ^
+ | | | +-------------------------+ | + | | | +-------------------------+ |
Distance | +-------------------+ | | | | | Distance- | +-------------------+ | | | | |
Vector | | | | | | | | + Vector | | | | | | | | +
South | | | | +--------+ | | | Link-State South | | | | +--------+ | | | Link-State
+ | | | | | | | | Flooding + | | | | | | | | Flooding
| | | +----------------+ | | | North | | | +----------------+ | | | North
v | | | | | | | | + v | | | | | | | | +
++---+-+ +------+ +-+----+ ++----++ | ++---+-+ +------+ +-+----+ ++----++ |
|SPINE | |SPINE | | SPINE| | SPINE| | LEVEL 1 |SPINE | |SPINE | | SPINE| | SPINE| | LEVEL 1
+ ++----++ ++---+-+ +-+--+-+ ++----++ | + ++----++ ++---+-+ +-+--+-+ ++----++ |
+ | | | | | | | | | ^ N + | | | | | | | | | ^ N
Distance | +-------+ | | +--------+ | | | E Distance- | +-------+ | | +--------+ | | | E
Vector | | | | | | | | | +------> Vector | | | | | | | | | +------>
South | +-------+ | | | +------+ | | | | South | +-------+ | | | +------+ | | | |
+ | | | | | | | | | + + | | | | | | | | | +
v ++--++ +-+-++ ++--++ ++--++ + v ++--++ +-+-++ ++--++ ++--++ +
|LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0 |LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
Figure 1: RIFT overview Figure 1: RIFT Overview
A spine node has only information necessary for its level, which is A spine node only has information necessary for its level, which is
all destinations south of the node based on SPF calculation, default all destinations south of the node based on SPF calculation, the
route, and potentially disaggregated routes. default route, and potentially disaggregated routes.
RIFT combines the advantage of both link-state and distance-vector: RIFT combines the advantages of both link-state and distance-vector
protocols:
* Fastest possible convergence * Fastest possible convergence
* Automatic detection of topology * Automatic detection of topology
* Minimal routes/information on Top-of-Rack (ToR) switches, aka leaf * Minimal routes/information on Top-of-Rack (ToR) switches, aka leaf
nodes nodes
* High degree of ECMP * High degree of ECMP
* Fast de-commissioning of nodes * Fast decommissioning of nodes
* Maximum propagation speed with flexible prefixes in an update * Maximum propagation speed with flexible prefixes in an update
So there are two types of link-state database which are "north There are two types of link-state databases that are "north
representation" North Topology Information Elements (N-TIEs) and representation" North Topology Information Elements (N-TIEs) and
"south representation" South Topology Information Elements (S-TIEs). "south representation" South Topology Information Elements (S-TIEs).
The N-TIEs contain a link-state topology description of lower levels The N-TIEs contain a link-state topology description of lower levels,
and S-TIEs carry simply default and disaggregated routes for the and the S-TIEs simply carry default and disaggregated routes for the
lower levels. lower levels.
RIFT also eliminates major disadvantages of link-state and distance- RIFT also eliminates major disadvantages of link-state and distance-
vector with: vector protocols with the following:
* Reduced and balanced flooding * Reduced and balanced flooding
* Level constrained automatic neighbor discovery * Level-constrained automatic neighbor discovery
To achieve this, RIFT builds on the art of IGPs, not only OSPF and To achieve this, RIFT builds on the art of IGPs, such as OSPF, IS-IS,
IS-IS but also MANET and IoT (Internet of Things), to provide unique Mobile Ad Hoc Network (MANET), and Internet of Things (IoT) to
features: provide unique features:
* Automatic (positive or negative) route disaggregation of * Automatic (positive or negative) route disaggregation of northward
northwards routes upon fallen leaves routes upon fallen leaves
* Recursive operation in the case of negative route disaggregation * Recursive operation in the case of negative route disaggregation
* Anisotropic routing that extends a principle seen in RPL [RFC6550] * Anisotropic routing that extends a principle seen in the Routing
to wide superspines Protocol for Low-Power and Lossy Networks (RPL) [RFC6550] to wide
superspines
* Optimal flooding reduction that derives from the concept of a * Optimal flooding reduction that derives from the concept of a
"multipoint relay" (MPR) found in OLSR [RFC3626] and balances the "multipoint relay" (MPR) found in Optimized Link State Routing
flooding load over northbound links and nodes. (OLSR) [RFC3626] and balances the flooding load over northbound
links and nodes
Additional advantages that are unique to RIFT are listed below, the Additional advantages that are unique to RIFT are listed below. The
details of which can be found in RIFT [RIFT]. details of these advantages can be found in RIFT [RFC9692].
* True ZTP (Zero Touch Provisioning) * True ZTP
* Minimal blast radius on failures * Minimal blast radius on failures
* Can utilize all paths through fabric without looping * Can utilize all paths through fabric without looping
* Simple leaf implementation that can scale down to servers * Simple leaf implementation that can scale down to servers
* Key-Value store * Key-value store
* Horizontal links used for protection only * Horizontal links used for protection only
4.2. Applicable Topologies 4.2. Applicable Topologies
Albeit RIFT is specified primarily for "proper" Clos or Fat Tree Albeit RIFT is specified primarily for "proper" Clos or Fat Tree
topologies, the protocol natively supports Points of Delivery (PoD) topologies, the protocol natively supports Points of Delivery (PoD)
concepts, which, strictly speaking, are not found in the original concepts, which, strictly speaking, are not found in the original
Clos concept. Clos concept.
Further, the specification explains and supports operations of multi- Further, the specification explains and supports operations of multi-
plane Clos variants where the protocol recommends the use of inter- plane Clos variants where the protocol recommends the use of inter-
plane rings at the Top-of-Fabric level to allow the reconciliation of plane rings at the ToF level to allow the reconciliation of topology
topology view of different planes to make the negative disaggregation view of different planes to make the Negative Disaggregation viable
viable in case of failures within a plane. These observations hold in case of failures within a plane. These observations hold not only
not only in case of RIFT but also in the generic case of dynamic in case of RIFT but also in the generic case of dynamic routing on
routing on Clos variants with multiple planes and failures in bi- Clos variants with multiple planes and failures in bisectional
sectional bandwidth, especially on the leafs. bandwidth, especially on the leaves.
4.2.1. Horizontal Links 4.2.1. Horizontal Links
RIFT is not limited to pure Clos divided into PoD and multi-planes RIFT is not limited to pure Clos divided into PoD and multi-planes
but supports horizontal (East-West) links below the top of fabric but supports horizontal (East-West) links below the ToF level. Those
level. Those links are used only for last resort northbound links are used only for last resort northbound forwarding when a
forwarding when a spine loses all its northbound links or cannot spine loses all its northbound links or cannot compute a default
compute a default route through them. route through them.
A full-mesh connectivity between nodes on the same level can be A full-mesh connectivity between nodes on the same level can be
employed and that allows N-SPF to provide for any node losing all its deployed, which allows North SPF (N-SPF) to provide for any node
northbound adjacencies (as long as any of the other nodes in the losing all its northbound adjacencies (as long as any of the other
level are northbound connected) to still participate in northbound nodes in the level are northbound connected) and still participate in
forwarding. northbound forwarding.
Note that a "ring" of horizontal links at any level below ToF does Note that a "ring" of horizontal links at any level below ToF does
not provide a "ring-based protection" scheme since the SPF not provide a "ring-based protection" scheme since the SPF
computation would have to deal necessarily with breaking of "loops", computation would have to deal with breaking of "loops", an
an application for which RIFT is not intended. application for which RIFT is not intended.
4.2.2. Vertical Shortcuts 4.2.2. Vertical Shortcuts
Through relaxations of the specified adjacency forming rules, RIFT Through relaxations of the specified adjacency forming rules, RIFT
implementations can be extended to support vertical "shortcuts". The implementations can be extended to support vertical "shortcuts". The
RIFT specification itself does not provide the exact details since RIFT specification itself does not provide the exact details since
the resulting solution suffers from either much larger blast radius the resulting solution suffers from either a much larger blast radius
with increased flooding volumes or in case of maximum aggregation with increased flooding volumes or bow tie problems in the case of
routing, bow-tie problems. maximum aggregation routing.
4.2.3. Generalizing to any Directed Acyclic Graph 4.2.3. Generalizing to Any Directed Acyclic Graph
RIFT is an anisotropic routing protocol, meaning that it has a sense RIFT is an anisotropic routing protocol, meaning that it has a sense
of direction (northbound, southbound, east-west) and that it operates of direction (northbound, southbound, and East-West) and operates
differently depending on the direction. differently depending on the direction.
Since a DAG provides a sense of north (the direction of the DAG) and Since a DAG provides a sense of north (the direction of the DAG) and
of south (the reverse), it can be used to apply RIFT——an edge in the south (the reverse), it can be used to apply RIFT -- an edge in the
DAG that has only incoming vertices is a ToF node. DAG that has only incoming vertices is a ToF node.
There are a number of caveats though: There are a number of caveats though:
* The DAG structure must exist before RIFT starts, so there is a * The DAG structure must exist before RIFT starts, so there is a
need for a companion protocol to establish the logical DAG need for a companion protocol to establish the logical DAG
structure. structure.
* A generic DAG does not have a sense of east and west. The * A generic DAG does not have a sense of East and West. The
operation specified for east-west links and the southbound operation specified for East-West links and the southbound
reflection between nodes are not applicable. Also ZTP will derive reflection between nodes are not applicable. Also, ZTP will
a sense of depth that will eliminate some links. Variations of derive a sense of depth that will eliminate some links.
ZTP could be derived to meet specific objectives, e.g., make it so Variations of ZTP could be derived to meet specific objectives,
that most routers have at least 2 parents to reach the ToF. e.g., make it so that most routers have at least two parents to
reach the ToF.
* RIFT applies to any Destination-Oriented DAG (DODAG) where there's * RIFT applies to any Destination-Oriented DAG (DODAG) where there's
only one ToF node and the problem of disaggregation does not only one ToF node and the problem of disaggregation does not
exist. In that case, RIFT operates very much like RPL [RFC6550], exist. In that case, RIFT operates very much like RPL [RFC6550],
but using Link State for southbound routes (downwards in RPL's but uses link-state information for southbound routes (downwards
terms). For an arbitrary DAG with multiple destinations (ToFs) in RPL's terms). For an arbitrary DAG with multiple destinations
the way disaggregation happens has to be considered. (ToFs), the way disaggregation happens has to be considered.
* Positive disaggregation expects that most of the ToF nodes reach * Positive Disaggregation expects that most of the ToF nodes reach
most of the leaves, so disaggregation is the exception as opposed most of the leaves, so disaggregation is the exception as opposed
to the rule. When this is no longer true, it makes sense to turn to the rule. When this is no longer true, it makes sense to turn
off disaggregation and route between the ToF nodes over a ring, a off disaggregation and route between the ToF nodes over a ring, a
full mesh, transit network, or a form of area zero. There again, full mesh, a transit network, or a form of area zero. Then again,
this operation is similar to RPL operating as a single DODAG with this operation is similar to RPL operating as a single DODAG with
a virtual root. a virtual root.
* In order to aggregate and disaggregate routes, RIFT requires that * In order to aggregate and disaggregate routes, RIFT requires that
all the ToF nodes share the full knowledge of the prefixes in the all the ToF nodes share the full knowledge of the prefixes in the
fabric. This can be achieved with a ring as suggested by "RIFT" fabric. This can be achieved with a ring as suggested by RIFT
[RIFT], by some preconfiguration, or using a synchronization with [RFC9692], by some preconfiguration, or by using a synchronization
a common repository where all the active prefixes are registered. with a common repository where all the active prefixes are
registered.
4.2.4. Reachability of Internal Nodes in the Fabric 4.2.4. Reachability of Internal Nodes in the Fabric
RIFT does not require that nodes have reachable addresses in the RIFT does not require that nodes have reachable addresses in the
fabric, though it is clearly desirable for operational purposes. fabric, though it is clearly desirable for operational purposes.
Under normal operating conditions this can be easily achieved by Under normal operating conditions, this can be easily achieved by
injecting the node's loopback address into North and South Prefix injecting the node's loopback address into Prefix North TIEs and
TIEs or other implementation specific mechanisms. Prefix South TIEs or other implementation-specific mechanisms.
Special considerations arise when a node loses all northbound Special considerations arise when a node loses all northbound
adjacencies, but is not at the top of the fabric. If a spine node adjacencies but is not at the top of the fabric. If a spine node
loses all northbound links, the spine node doesn't advertise default loses all northbound links, the spine node doesn't advertise a
route. But if the level of the spine node is auto-determined by ZTP, default route. But if the level of the spine node is auto-determined
it will "fall down" as depicted in Figure 8. by ZTP, it will "fall down" as depicted in Figure 8.
4.3. Use Cases 4.3. Use Cases
4.3.1. Data Center Topologies 4.3.1. Data Center Topologies
4.3.1.1. Data Center Fabrics 4.3.1.1. Data Center Fabrics
RIFT is suited for applying in data center (DC) IP fabrics underlay RIFT is suited for applying underlay routing in data center (DC) IP
routing, vast majority of which seem to be currently (and for the fabrics, with the vast majority of these IP fabrics being Clos
foreseeable future) Clos architectures. It significantly simplifies architectures (and will be for the foreseeable future). It
operation and deployment of such fabrics as described in Section 5 significantly simplifies operation and deployment of such fabrics as
for environments compared to extensive proprietary provisioning and described in Section 5 for environments compared to extensive
operational solutions. proprietary provisioning and operational solutions.
4.3.1.2. Adaptations to Other Proposed Data Center Topologies 4.3.1.2. Adaptations to Other Proposed Data Center Topologies
. +-----+ +-----+ . +-----+ +-----+
. | | | | . | | | |
.+-+ S0 | | S1 | .+-+ S0 | | S1 |
.| ++---++ ++---++ .| ++---++ ++---++
.| | | | | .| | | | |
.| | +------------+ | .| | +------------+ |
.| | | +------------+ | .| | | +------------+ |
.| | | | | .| | | | |
.| ++-+--+ +--+-++ .| ++-+--+ +--+-++
.| | | | | .| | | | |
skipping to change at page 11, line 29 skipping to change at line 483
.| | | | | .| | | | |
.| +-+-+-+ +--+-++ .| +-+-+-+ +--+-++
.+-+ | | | .+-+ | | |
. | L0 | | L1 | . | L0 | | L1 |
. +-----+ +-----+ . +-----+ +-----+
Figure 2: Level Shortcut Figure 2: Level Shortcut
RIFT is not strictly limited to Clos topologies. The protocol only RIFT is not strictly limited to Clos topologies. The protocol only
requires a sense of "compass rose directionality" either achieved requires a sense of "compass rose directionality" either achieved
through configuration or derivation of levels. So, conceptually, through configuration or derivation of levels. So conceptually,
shortcuts between levels could be included. Figure 2 depicts an shortcuts between levels could be included. Figure 2 depicts an
example of a shortcut between levels. In this example, sub-optimal example of a shortcut between levels. In this example, suboptimal
routing will occur when traffic is sent from L0 to L1 via S0's routing will occur when traffic is sent from L0 to L1 via S0's
default route and back down through A0 or A1. In order to avoid default route and back down through A0 or A1. In order to avoid
that, only default routes from A0 or A1 are used, all leaves would be that, only default routes from A0 or A1 are used. All leaves would
required to install each other's routes. be required to install each other's routes.
While various technical and operational challenges may require the While various technical and operational challenges may require the
use of such modifications, discussion of those topics are outside the use of such modifications, discussion of those topics is outside the
scope of this document. scope of this document.
4.3.2. Metro Networks 4.3.2. Metro Networks
The demand for bandwidth is increasing steadily, driven primarily by The demand for bandwidth is increasing steadily, driven primarily by
environments close to content producers (server farms connection via environments close to content producers (server farms connection via
DC fabrics) but in proximity to content consumers as well. Consumers DC fabrics) but in proximity to content consumers as well. Consumers
are often clustered in metro areas with their own network are often clustered in metro areas with their own network
architectures that can benefit from simplified, regular Clos architectures that can benefit from simplified, regular Clos
structures and hence from RIFT. structures. Thus, they can also benefit from RIFT.
4.3.3. Building Cabling 4.3.3. Building Cabling
Commercial edifices are often cabled in topologies that are either Commercial edifices are often cabled in topologies that are either
Clos or its isomorphic equivalents. The Clos can grow rather high Clos or its isomorphic equivalents. The Clos can grow rather high
with many levels. That presents a challenge for traditional routing with many levels. That presents a challenge for classical routing
protocols (except BGP[RFC4271] and by now largely phased-out protocols (except BGP [RFC4271] and Private Network-Network Interface
PNNI[PNNI]) which do not support an arbitrary number of levels which (PNNI) [PNNI], which is largely phased-out by now) that do not
RIFT does naturally. Moreover, due to the limited sizes of support an arbitrary number of levels, which RIFT does naturally.
forwarding tables in network elements of building cabling, the Moreover, due to the limited sizes of forwarding tables in network
minimum FIB size RIFT maintains under normal conditions is cost- elements of building cabling, the minimum FIB size RIFT maintains
effective in terms of hardware and operational costs. under normal conditions is cost-effective in terms of hardware and
operational costs.
4.3.4. Internal Router Switching Fabrics 4.3.4. Internal Router Switching Fabrics
It is common in high-speed communications switching and routing It is common in high-speed communications switching and routing
devices to use switch fabrics which are interconnection networks devices to use switch fabrics that are interconnection networks
inside the devices connecting the input ports to their output ports. inside the devices connecting the input ports to their output ports.
For example, crossbar is one of the switch fabric techniques while a For example, a crossbar is one of the switch fabric techniques, even
crossbar is not feasible due to cost, head-of-line blocking or size though it is not feasible due to cost, head-of-line blocking, or size
trade-offs. And normally such fabrics are not self-healing or rely trade-offs. Normally, such fabrics are not self-healing or rely on
on 1:1 or 1+1 protection schemes but it is conceivable to use RIFT to 1:1 or 1+1 protection schemes, but it is conceivable to use RIFT to
operate Clos fabrics that can deal effectively with interconnections operate Clos fabrics that can deal effectively with interconnections
or subsystem failures in such module. RIFT is not IP specific and or subsystem failures in such a module. RIFT is not IP specific and
hence any link addressing connecting internal device subnets is hence any link addressing connecting internal device subnets is
conceivable. conceivable.
4.3.5. CloudCO 4.3.5. CloudCO
The Cloud Central Office (CloudCO) is a new stage of telecom Central The Cloud Central Office (CloudCO) is a new stage of the telecom
Office. It takes the advantage of Software Defined Networking (SDN) Central Office. It takes the advantage of Software-Defined
and Network Function Virtualization (NFV) in conjunction with general Networking (SDN) and Network Function Virtualization (NFV) in
purpose hardware to optimize current networks. The following figure conjunction with general purpose hardware to optimize current
illustrates this architecture at a high level. It describes a single networks. The following figure illustrates this architecture at a
instance or macro-node of cloud CO that provides a number of Value high level. It describes a single instance or macro-node of CloudCO
Added Services (VAS), a Broadband Access Abstraction (BAA), and that provides a number of value-added services (VASes), a Broadband
virtualized network services. An Access I/O module faces a Cloud CO Access Abstraction (BAA), and virtualized network services. An
access node, and the Customer Premises Equipments (CPEs) behind it. Access I/O module faces a CloudCO access node and the Customer
A Network I/O module is facing the core network. The two I/O modules Premises Equipment (CPE) behind it. A Network I/O module is facing
are interconnected by a leaf and spine fabric [TR-384]. the core network. The two I/O modules are interconnected by a spine-
and-leaf fabric [TR-384].
+---------------------+ +----------------------+ +---------------------+ +----------------------+
| Spine | | Spine | | Spine | | Spine |
| Switch | | Switch | | Switch | | Switch |
+------+---+------+-+-+ +--+-+-+-+-----+-------+ +------+---+------+-+-+ +--+-+-+-+-----+-------+
| | | | | | | | | | | | | | | | | | | | | | | |
| | | | | +-------------------------------+ | | | | | | +-------------------------------+ |
| | | | | | | | | | | | | | | | | | | | | | | |
| | | | +-------------------------+ | | | | | | | +-------------------------+ | | |
| | | | | | | | | | | | | | | | | | | | | | | |
skipping to change at page 13, line 45 skipping to change at line 586
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| || VAS7 || || VAS4 || || vIGMP || ||BAA || | | || VAS7 || || VAS4 || || vIGMP || ||BAA || |
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| +--------+ +--------+ +----------+ +-------+ | | +--------+ +--------+ +----------+ +-------+ |
| | | |
++-----------+ +---------++ ++-----------+ +---------++
|Network I/O | |Access I/O| |Network I/O | |Access I/O|
+------------+ +----------+ +------------+ +----------+
Figure 3: An example of CloudCO architecture Figure 3: CloudCO Architecture Example
The Spine-Leaf architecture deployed inside CloudCO meets the network The Spine-Leaf architecture deployed inside CloudCO meets the network
requirements of adaptable, agile, scalable and dynamic. requirements of being adaptable, agile, scalable, and dynamic.
5. Operational Considerations 5. Operational Considerations
RIFT presents the features for organizations building and operating RIFT presents the features for organizations building and operating
IP fabrics to simplify the operation and deployments while achieving IP fabrics to simplify the operation and deployments while achieving
many desirable properties of a dynamic routing protocol on such a many desirable properties of a dynamic routing protocol on such a
substrate: substrate:
* RIFT only floods routing information to the devices that need it. * RIFT only floods routing information to the devices that need it.
* RIFT allows for Zero Touch Provisioning within the protocol. In * RIFT allows for ZTP within the protocol. In its most extreme
its most extreme version, RIFT does not rely on any specific version, RIFT does not rely on any specific addressing and can
addressing and for IP fabric can operate using IPv6 ND [RFC4861] operate using IPv6 Neighbor Discovery (ND) [RFC4861] only for IP
only. fabric.
* RIFT has provisions to detect common IP fabric miscabling * RIFT has provisions to detect common IP fabric miscabling
scenarios. scenarios.
* RIFT negotiates automatically BFD per link. This allows for IP * RIFT automatically negotiates Bidirectional Forwarding Detection
and micro-BFD [RFC7130] to replace Link Aggregation Groups (LAGs) (BFD) per link. This allows for IP and micro-BFD [RFC7130] to
which do hide bandwidth imbalances in case of constituent replace Link Aggregation Groups (LAGs) that hide bandwidth
failures. Further automatic link validation techniques similar to imbalances in case of constituent failures. Further automatic
[RFC5357] could be supported as well. link validation techniques similar to those in [RFC5357] could be
supported as well.
* RIFT inherently solves many problems associated with the use of * RIFT inherently solves many problems associated with the use of
traditional routing topologies with dense meshes and high degrees classical routing topologies with dense meshes and high degrees of
of ECMP by including automatic bandwidth balancing, flood ECMP by including automatic bandwidth balancing, flood reduction,
reduction and automatic disaggregation on failures while providing and automatic disaggregation on failures while providing maximum
maximum aggregation of prefixes in default scenarios. ECMP in aggregation of prefixes in default scenarios. ECMP in RIFT
RIFT eliminates the need for more Loop-Free Alternates procedures. eliminates the need for more Loop-Free Alternate (LFA) procedures.
* RIFT reduces FIB size towards the bottom of the IP fabric where * RIFT reduces FIB size towards the bottom of the IP fabric where
most nodes reside and allows with that for cheaper hardware on the most nodes reside. This allows for cheaper hardware on the edges
edges and introduction of modern IP fabric architectures that and introduction of modern IP fabric architectures that encompass
encompass e.g. server multi-homing. server multihoming and other mechanisms.
* RIFT provides valley-free routing and with that is loop free. A * RIFT provides valley-free routing that is loop free. A valley-
valley-free path allows reversal of direction at most once from a free path allows for reversal of direction at most once from a
packet heading northbound to southbound while permitting traversal packet heading northbound to southbound while permitting traversal
of horizontal links in the northbound phase. This allows the use of horizontal links in the northbound phase. This allows for the
of any such valley-free path in bi-sectional fabric bandwidth use of any such valley-free path in bisectional fabric bandwidth
between two destinations irrespective of their metrics which can between two destinations irrespective of their metrics that can be
be used to balance load on the fabric in different ways. Valley- used to balance load on the fabric in different ways. Valley-free
free routing eliminates the need for any specific micro-loop routing eliminates the need for any specific micro-loop avoidance
avoidance procedures for RIFT. procedures for RIFT.
* RIFT includes a key-value distribution mechanism which allows for * RIFT includes a key-value distribution mechanism that allows for
future applications such as automatic provisioning of basic future applications such as automatic provisioning of basic
overlay services or automatic key roll-overs over whole fabrics. overlay services or automatic key rollovers over whole fabrics.
* RIFT is designed for minimum delay in case of prefix mobility on * RIFT is designed for minimum delay in case of prefix mobility on
the fabric. In conjunction with [RFC8505], RIFT can differentiate the fabric. In conjunction with [RFC8505], RIFT can differentiate
anycast advertisements from mobility events and retain only the anycast advertisements from mobility events and retain only the
most recent advertisement in the latter case. most recent advertisement in the latter case.
* Many further operational and design points collected over many * Many further operational and design points collected over many
years of routing protocol deployments have been incorporated in years of routing protocol deployments have been incorporated in
RIFT such as fast flooding rates, protection of information RIFT such as fast flooding rates, protection of information
lifetimes and operationally recognizable remote ends of links and lifetimes, and operationally recognizable remote ends of links and
node names. node names.
5.1. South Reflection 5.1. South Reflection
South reflection is a mechanism that South Node TIEs are "reflected" South reflection is a mechanism where South Node TIEs are "reflected"
back up north to allow nodes in same level without east-west links to back up north to allow nodes in the same level without East-West
"see" each other. links to "see" each other.
For example, in Figure 4, Spine111\Spine112\Spine121\Spine122 For example, in Figure 4, Spine111\Spine112\Spine121\Spine122
reflects Node S-TIEs from ToF21 to ToF22 separately. Respectively, reflects Node S-TIEs from ToF21 to ToF22 separately. Respectively,
Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22
to ToF21 separately. So ToF22 and ToF21 see each other's node to ToF21 separately, so ToF22 and ToF21 see each other's node
information as level 2 nodes. information as level 2 nodes.
In an equivalent fashion, as the result of the south reflection In an equivalent fashion, as the result of the south reflection
between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122,
Spine121 and Spine 122 knows each other at level 1. Spine121 and Spine 122 know each other at level 1.
5.2. Suboptimal Routing on Link Failures 5.2. Suboptimal Routing on Link Failures
+--------+ +--------+ +--------+ +--------+
| ToF21 | | ToF22 | LEVEL 2 | ToF21 | | ToF22 | LEVEL 2
++--+-+-++ ++-+--+-++ ++--+-+-++ ++-+--+-++
| | | | | | | + | | | | | | | +
| | | | | | | linkTS8 | | | | | | | linkTS8
+------------+ | +-+linkTS3+-+ | | | +-------------+ +------------+ | +-+linkTS3+-+ | | | +-------------+
| | | | | | + | | | | | | | + |
| +---------------------------+ | linkTS7 | | +---------------------------+ | linkTS7 |
| | | | + + + | | | | | + + + |
| | | +-------+linkTS4+------------+ | | | | +-------+linkTS4+------------+ |
skipping to change at page 16, line 31 skipping to change at line 697
| +-------------+ | + ++XX+linkSL6+---+ + | +-------------+ | + ++XX+linkSL6+---+ +
| | | | linkSL5 | | linkSL8 | | | | linkSL5 | | linkSL8
| +-----------+ | | + +---+linkSL7+-+ | + | +-----------+ | | + +---+linkSL7+-+ | +
| | | | | | | | | | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ +-+-----+ +-----+-+ +-+-----+ +-+-----+ +-+-----+ +-----+-+ +-+-----+
+ + + + + + + +
Prefix111 Prefix112 Prefix121 Prefix122 Prefix111 Prefix112 Prefix121 Prefix122
Figure 4: Suboptimal routing upon link failure use case Figure 4: Suboptimal Routing Upon Link Failure Use Case
As shown in Figure 4, as the result of the south reflection between As shown in Figure 4, as the result of the south reflection, Spine121
Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and and Spine 122 know each other via Leaf121 or Leaf 122 at level 1.
Spine 122 knows each other at level 1.
Without disaggregation mechanism, when linkSL6 fails, the packet from Without disaggregation mechanisms, the packet from leaf121 to
leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 prefix122 will probably go up through linkSL5 to linkTS3 when linkSL6
then go down through linkTS4 to linkSL8 to Leaf122 or go up through fails. Then, the packet will go down through linkTS4 to linkSL8 to
linkSL5 to linkTS6 then go down through linkTS8 and linkSL8 to Leaf122 or go up through linkSL5 to linkTS6, then go down through
Leaf122 based on pure default route. It's the case of suboptimal linkTS8 and linkSL8 to Leaf122 based on the pure default route. This
routing or bow-tieing. is the case of suboptimal routing or bow tying.
With disaggregation mechanism, when linkSL6 fails, Spine122 will With disaggregation mechanisms, Spine122 will detect the failure
detect the failure according to the reflected node S-TIE from according to the reflected node S-TIE from Spine121 when linkSL6
Spine121. Based on the disaggregation algorithm provided by RIFT, fails. Based on the disaggregation algorithm provided by RIFT,
Spine122 will explicitly advertise prefix122 in Disaggregated Prefix Spine122 will explicitly advertise prefix122 in Disaggregated Prefix
S-TIE PrefixTIEElement(prefix122, cost 1). The packet from leaf121 S-TIE PrefixTIEElement(prefix122, cost 1). The packet from leaf121
to prefix122 will only be sent to linkSL7 following a longest-prefix to prefix122 will only be sent to linkSL7 following a longest-prefix
match to prefix 122 directly then go down through linkSL8 to Leaf122 match to prefix 122 directly, then it will go down through linkSL8 to
. Leaf122.
5.3. Black-Holing on Link Failures 5.3. Black-Holing on Link Failures
+--------+ +--------+ +--------+ +--------+
| ToF 21 | | ToF 22 | LEVEL 2 | ToF 21 | | ToF 22 | LEVEL 2
++-+--+-++ ++-+--+-++ ++-+--+-++ ++-+--+-++
| | | | | | | + | | | | | | | +
| | | | | | | linkTS8 | | | | | | | linkTS8
+--------------+ | +-+linkTS3+X+ | | | +--------------+ +--------------+ | +-+linkTS3+X+ | | | +--------------+
linkTS1 | | | | | + | linkTS1 | | | | | + |
skipping to change at page 17, line 34 skipping to change at line 747
+ +---------------+ | + +---+linkSL6+---+ + + +---------------+ | + +---+linkSL6+---+ +
linkSL1 | | | linkSL5 | | linkSL8 linkSL1 | | | linkSL5 | | linkSL8
+ +--+linkSL3+--+ | | + +---+linkSL7+-+ | + + +--+linkSL3+--+ | | + +---+linkSL7+-+ | +
| | | | | | | | | | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ +-+-----+ +-----+-+ +-----+-+ +-+-----+ +-+-----+ +-----+-+ +-----+-+
+ + + + + + + +
Prefix111 Prefix112 Prefix121 Prefix122 Prefix111 Prefix112 Prefix121 Prefix122
Figure 5: Black-holing upon link failure use case Figure 5: Black-Holing Upon Link Failure Use Case
This scenario illustrates a case when double link failure occurs and This scenario illustrates a case where double link failure occurs and
with that black-holing can happen. black-holing can happen.
Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, Without disaggregation mechanisms, the packet from leaf111 to
the packet from leaf111 to prefix122 would suffer 50% black-holing prefix122 would suffer 50% black-holing based on pure default route
based on pure default route. The packet supposed to go up through when linkTS3 and linkTS4 both fail. The packet is supposed to go up
linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be through linkSL1 to linkTS1 and then go down through linkTS3 or
dropped. The packet supposed to go up through linkSL3 to linkTS2 linkTS4 will be dropped. The packet is supposed to go up through
then go down through linkTS3 or linkTS4 will be dropped as well. linkSL3 to linkTS2, then go down through linkTS3 or linkTS4 will be
It's the case of black-holing. dropped as well. This is the case of black-holing.
With disaggregation mechanism, when linkTS3 and linkTS4 both fail, With disaggregation mechanisms, ToF22 will detect the failure
ToF22 will detect the failure according to the reflected node S-TIE according to the reflected node S-TIE of ToF21 from Spine111\Spine112
of ToF21 from Spine111\Spine112. Based on the disaggregation when linkTS3 and linkTS4 both fail. Based on the disaggregation
algorithm provided by RIFT, ToF22 will explicitly originate an S-TIE algorithm provided by RIFT, ToF22 will explicitly originate an S-TIE
with prefix 121 and prefix 122, that is flooded to spines 111, 112, with prefix 121 and prefix 122 that is flooded to spines 111, 112,
121 and 122. 121, and 122.
The packet from leaf111 to prefix122 will not be routed to linkTS1 or The packet from leaf111 to prefix122 will not be routed to linkTS1 or
linkTS2. The packet from leaf111 to prefix122 will only be routed to linkTS2. The packet from leaf111 to prefix122 will only be routed to
linkTS5 or linkTS7 following a longest-prefix match to prefix122. linkTS5 or linkTS7 following a longest-prefix match to prefix122.
5.4. Zero Touch Provisioning (ZTP) 5.4. Zero Touch Provisioning (ZTP)
RIFT is designed to require a very minimal configuration to simplify RIFT is designed to require a very minimal configuration to simplify
its operation and avoid human errors; based on that minimal its operation and avoid human errors; based on that minimal
information, Zero Touch Provisioning (ZTP) auto configures the key information, ZTP auto configures the key operational parameters of
operational parameters of all the RIFT nodes, including the SystemID all the RIFT nodes, including the System ID of the node that must be
of the node that must be unique in the RIFT network and the level of unique in the RIFT network and the level of the node in the Fat Tree,
the node in the Fat Tree, which determines which peers are northwards which determines which peers are northward "parents" and which are
"parents" and which are southwards "children". southward "children".
ZTP is always on, but its decisions can be overridden when a network ZTP is always on, but its decisions can be overridden when a network
administrator prefers to impose its own configuration. In that case, administrator prefers to impose its own configuration. In that case,
it is the responsibility of the administrator to ensure that the it is the responsibility of the administrator to ensure that the
configured parameters are correct, in other words that the SystemID configured parameters are correct, i.e., ensure that the System ID of
of each node is unique, and that the administratively set levels each node is unique and that the administratively set levels truly
truly reflect the relative position of the nodes in the fabric. It reflect the relative position of the nodes in the fabric. It is
is recommended to let ZTP configure the network, and when not, it is recommended to let ZTP configure the network, and when ZTP does not
recommended to configure the level of all the nodes to avoid an configure the network, it is recommended to configure the level of
undesirable interaction between ZTP and the manual configuration. all the nodes to avoid an undesirable interaction between ZTP and the
manual configuration.
ZTP requires that the administrator points out the Top-of-Fabric ZTP requires that the administrator points out the ToF nodes to set
(ToF) nodes to set the baseline from which the fabric topology is the baseline from which the fabric topology is derived. The ToF
derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC nodes are configured with the TOP_OF_FABRIC flag, which are initial
flag which are initial 'seeds' needed for other ZTP nodes to derive 'seeds' needed for other ZTP nodes to derive their level in the
their level in the topology. ZTP computes the level of each node topology. ZTP computes the level of each node based on the Highest
based on the Highest Available Level (HAL) of the potential parent(s) Available Level (HAL) of the potential parent closest to that
nearest that baseline, which represents the superspine. In a baseline, which represents the superspine. In a fashion, RIFT can be
fashion, RIFT can be seen as a distance-vector protocol that computes seen as a distance-vector protocol that computes a set of feasible
a set of feasible successors towards the superspine and auto- successors towards the superspine and autoconfigures the rest of the
configures the rest of the topology. topology.
The auto configuration mechanism computes a global maximum of levels The autoconfiguration mechanism computes a global maximum of levels
by diffusion. The derivation of the level of each node happens then by diffusion. The derivation of the level of each node happens then
based on Link Information Elements (LIEs) received from its neighbors based on LIEs received from its neighbors, whereas each node (with
whereas each node (with possibly exceptions of configured leaves) possible exceptions of configured leaves) tries to attach at the
tries to attach at the highest possible point in the fabric. This highest possible point in the fabric. This guarantees that even if
guarantees that even if the diffusion front reaches a node from the diffusion front reaches a node from "below" faster than from
"below" faster than from "above", it will greedily abandon already "above", it will greedily abandon already negotiated levels derived
negotiated level derived from nodes topologically below it and from nodes topologically below it and properly peer with nodes above.
properly peer with nodes above.
The achieved equilibrium can be disturbed massively by all nodes with The achieved equilibrium can be disturbed massively by all nodes with
highest level either leaving or entering the domain (with some finer the highest level either leaving or entering the domain (with some
distinctions not explained further). It is therefore recommended finer distinctions not explained further). It is therefore
that each node is multi-homed towards nodes with respective HAL recommended that each node is multihomed towards nodes with
offerings. Fortunately, this is the natural state of things for the respective HAL offerings. Fortunately, this is the natural state of
topology variants considered in RIFT. things for the topology variants considered in RIFT.
A RIFT node may also be configured to confine it to the leaf role A RIFT node may also be configured to confine it to the leaf role
with the LEAF_ONLY flag. A leaf node can also be configured to with the LEAF_ONLY flag. A leaf node can also be configured to
support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In both
case the node cannot be TOP_OF_FABRIC and its level cannot be cases, the node cannot be TOP_OF_FABRIC and its level cannot be
configured. RIFT will fully determine the node's level after it is configured. RIFT will fully determine the node's level after it is
attached to the topology and ensure that the node is at the "bottom attached to the topology and ensure that the node is at the "bottom
of the hierarchy" (southernmost). of the hierarchy" (southernmost).
5.5. Miscabling 5.5. Miscabling
5.5.1. Miscabling Examples 5.5.1. Miscabling Examples
+----------------+ +-----------------+ +----------------+ +-----------------+
| ToF21 | +------+ ToF22 | LEVEL 2 | ToF21 | +------+ ToF22 | LEVEL 2
skipping to change at page 19, line 42 skipping to change at line 853
+-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+
| | | | | | | | | | | | | | | | | |
| +---------+ | link-M | +---------+ | | +---------+ | link-M | +---------+ |
| | | | | | | | | | | | | | | | | |
| +-------+ | | | | +-------+ | | | +-------+ | | | | +-------+ | |
| | | | | | | | | | | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
Figure 6: A single plane miscabling example Figure 6: A Single-Plane Miscabling Example
Figure 6 shows a single plane miscabling example. It's a perfect Fat Figure 6 shows a single-plane miscabling example. It's a perfect Fat
Tree fabric except link-M connecting Leaf112 to ToF22. Tree fabric except for link-M connecting Leaf112 to ToF22.
The RIFT control protocol can discover the physical links The RIFT control protocol can discover the physical links
automatically and be able to detect cabling that violates Fat Tree automatically and is able to detect cabling that violates Fat Tree
topology constraints. It reacts accordingly to such miscabling topology constraints. It reacts accordingly to such miscabling
attempts, at a minimum preventing adjacencies between nodes from attempts, preventing adjacencies between nodes from being formed and
being formed and traffic from being forwarded on those miscabled traffic from being forwarded on those miscabled links at a minimum.
links. Leaf112 will in such scenario use link-M to derive its level In such scenario, Leaf112 will use link-M to derive its level (unless
(unless it is leaf) and can report links to Spine111 and Spine112 as it is leaf) and can report links to Spine111 and Spine112 as
miscabled unless the implementations allows horizontal links. miscabled unless the implementations allow horizontal links.
Figure 7 shows a multiple plane miscabling example. Since Leaf112 Figure 7 shows a multi-plane miscabling example. Since Leaf112 and
and Spine121 belong to two different PoDs, the adjacency between Spine121 belong to two different PoDs, the adjacency between Leaf112
Leaf112 and Spine121 can not be formed. Link-W would be detected and and Spine121 cannot be formed. Link-W would be detected and
prevented. prevented.
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
| | | | | | | | | | | | | | | |
| | | +-----------------+ | | | | | | +-----------------+ | | |
| +--------------------------+ | | | | | +--------------------------+ | | | |
| +------+ | | | +------+ | | +------+ | | | +------+ |
| | +-----------------+ | | | | | | | +-----------------+ | | | | |
skipping to change at page 20, line 36 skipping to change at line 895
| | | | | | | | | | | | | | | | | |
| +---------+ | | | +---------+ | | +---------+ | | | +---------+ |
| | | | link-W | | | | | | | | link-W | | | |
| +-------+ | | | | +-------+ | | | +-------+ | | | | +-------+ | |
| | | | | | | | | | | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
+--------PoD#1----------+ +---------PoD#2---------+ +--------PoD#1----------+ +---------PoD#2---------+
Figure 7: A multiple plane miscabling example Figure 7: A Multiple Plane Miscabling Example
RIFT provides an optional level determination procedure in its Zero RIFT provides an optional level determination procedure in its ZTP
Touch Provisioning mode. Nodes in the fabric without their level mode. Nodes in the fabric without their level configured determine
configured determine it automatically. This can have possibly it automatically. However, this can have possible counter-intuitive
counter-intuitive consequences however. One extreme failure scenario consequences. One extreme failure scenario is depicted in Figure 8,
is depicted in Figure 8 and it shows that if all northbound links of and it shows that if all northbound links of Spine11 fail at the same
spine11 fail at the same time, spine11 negotiates a lower level than time, Spine11 negotiates a lower level than Leaf11 and Leaf12.
Leaf11 and Leaf12.
To prevent such scenario where leafs are expected to act as switches, To prevent such scenario where leaves are expected to act as
LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is switches, the LEAF_ONLY flag can be set for Leaf111 and Leaf112.
invalid, Spine11 would not derive a valid level from the topology in Since level -1 is invalid, Spine11 would not derive a valid level
Figure 8. It will be isolated from the whole fabric and it would be from the topology in Figure 8. It will be isolated from the whole
up to the leafs to declare the links towards such spine as miscabled. fabric, and it would be up to the leaves to declare the links towards
such spine as miscabled.
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF A1| |ToF A2| |ToF A1| |ToF A2| |ToF A1| |ToF A2|
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
| | | | | | | | | | | |
| +-------+ | | | | +-------+ | | |
+ + | | ====> | | + + | | ====> | |
X X +------+ | +------+ | X X +------+ | +------+ |
+ + | | | | + + | | | |
+----+--+ +-+-----+ +-+-----+ +----+--+ +-+-----+ +-+-----+
skipping to change at page 21, line 30 skipping to change at line 936
+-+---+-+ +--+--+-+ +-----+-+ +-----+-+ +-+---+-+ +--+--+-+ +-----+-+ +-----+-+
|Leaf111| |Leaf112| |Leaf111| |Leaf112| |Leaf111| |Leaf112| |Leaf111| |Leaf112|
+-------+ +-------+ +-+-----+ +-+-----+ +-------+ +-------+ +-+-----+ +-+-----+
| | | |
| +--------+ | +--------+
| | | |
+-+---+-+ +-+---+-+
|Spine11| |Spine11|
+-------+ +-------+
Figure 8: Fallen spine Figure 8: Fallen Spine
5.5.2. Miscabling considerations 5.5.2. Miscabling Considerations
There are scenarios where operators may want to leverage ZTP and There are scenarios where operators may want to leverage ZTP and
implement additional cabling constraints that go beyond the implement additional cabling constraints that go beyond the
previously described topology violations. Enforcing cabling down to previously described topology violations. Enforcing cabling down to
specific level, node, and port combinations might make it simpler for specific level, node, and port combinations might make it simpler for
onsite staff to perform troubleshooting activities or replace optical onsite staff to perform troubleshooting activities or replace optical
transceivers and/or cabling as the physical layout will be consistent transceivers and/or cabling as the physical layout will be consistent
across the fabric. This is especially true for densely connected across the fabric. This is especially true for densely connected
fabrics where it is difficult to physically manipulate those fabrics where it is difficult to physically manipulate those
components. It is also easy to imagine other models, such as one components. It is also easy to imagine other models, such as one
where the strict port requirement is relaxed. where the strict port requirement is relaxed.
Figure 9 illustrates an example where the first port on Leaf1 must Figure 9 illustrates an example where the first port on Leaf1 must
connect to the first port on Spine1, the second port on Leaf1 must connect to the first port on Spine1, the second port on Leaf1 must
connect to the first port on Spine2, and so on. Consider a case connect to the first port on Spine2, and so on. Consider a case
where (Leaf1, Port1) and (Leaf1, Port2) were reversed. RIFT would where (Leaf1, Port1) and (Leaf1, Port2) were reversed. RIFT would
not consider this to be miscabled by default, however, an operator not consider this to be miscabled by default; however, an operator
might want to. might want to.
+--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
| Spine1 | | Spine2 | | Spine3 | | Spine4 | | Spine1 | | Spine2 | | Spine3 | | Spine4 |
+-1------+ +-1------+ +-1------+ +-1------+ +-1------+ +-1------+ +-1------+ +-1------+
+ + + + + + + +
| +----------+ | | | +----------+ | |
| | | | | | | |
| | +---------------------+ | | | +---------------------+ |
| | | | | | | |
| | | +--------------------------------+ | | | +--------------------------------+
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
+ + + + + + + +
+-1--2--3--4--+ +-1--2--3--4--+
| Leaf1 | ...... | Leaf1 | ......
+-------------+ +-------------+
Figure 9: Fallen spine Figure 9: Additional Cabling Constraint Example
RIFT allows implementations to provide programmable plugins that can RIFT allows implementations to provide programmable plug-ins that can
adjust ZTP operation or capture information during computation. adjust ZTP operation or capture information during computation.
While defining this is outside the scope of this document, such a While defining this is outside the scope of this document, such a
mechanism could be used to extend miscabling functionality. mechanism could be used to extend the miscabling functionality.
For other protocols to achieve this, it would require additional For other protocols to achieve this, it would require additional
operational overhead. Consider a fabric that is using unnumbered operational overhead. Consider a fabric that is using unnumbered
OSPF links, it is still very likely that a miscabled link will form OSPF links; it is still very likely that a miscabled link will form
an adjacency. Each attempts to move cables to the correct port may an adjacency. Each attempt to move cables to the correct port may
result in the need for additional troubleshooting as other links will result in the need for additional troubleshooting as other links will
become miscabled in the process. Without automation to explicitly become miscabled in the process. Without automation to explicitly
tell the operator which ports need to be moved where, the process tell the operator which ports need to be moved where, the process
becomes manually intensive and error-prone very quickly. Or if the becomes manually intensive and error-prone very quickly. If the
problem goes unnoticed, result in suboptimal performance in the problem goes unnoticed, it will result in suboptimal performance in
fabric. the fabric.
5.6. Multicast and Broadcast Implementations 5.6. Multicast and Broadcast Implementations
RIFT supports both multicast and broadcast implementations. While a RIFT supports both multicast and broadcast implementations. While a
multicast implementation is preferred, there might cases where a multicast implementation is preferred, there might cases where a
broadcast implementation is optimal or even required. For example, broadcast implementation is optimal or even required. For example,
operating systems on IoT devices and embedded devices may not have operating systems on IoT devices and embedded devices may not have
the required multicast support. Another example is containers, which the required multicast support. Another example is containers, which
in some cases do support multicast, but tend to be very CPU- do support multicast in some cases but tend to be very CPU-
inefficient and difficult to tune. inefficient and difficult to tune.
5.7. Positive vs. Negative Disaggregation 5.7. Positive vs. Negative Disaggregation
Disaggregation is the procedure whereby RIFT [RIFT] advertises a more Disaggregation is the procedure whereby RIFT [RFC9692] advertises a
specific route southwards as an exception to the aggregated fabric- more specific route southwards as an exception to the aggregated
default north. Disaggregation is useful when a prefix within the fabric-default north. Disaggregation is useful when a prefix within
aggregation is reachable via some of the parents but not the others the aggregation is reachable via some of the parents but not the
at the same level of the fabric. It is mandatory when the level is others at the same level of the fabric. It is mandatory when the
the ToF since a ToF node that cannot reach a prefix becomes a black level is the ToF since a ToF node that cannot reach a prefix becomes
hole for that prefix. The hard problem is to know which prefixes are a black hole for that prefix. The hard problem is to know which
reachable by whom. prefixes are reachable by whom.
In the general case, RIFT [RIFT] solves that problem by In the general case, RIFT [RFC9692] solves that problem by
interconnecting the ToF nodes. So the ToF nodes can exchange the interconnecting the ToF nodes so that the ToF nodes can exchange the
full list of prefixes that exist in the fabric and figure out when a full list of prefixes that exist in the fabric and figure out when a
ToF node lacks reachability to some prefixes. This requires ToF node lacks reachability to some prefixes. This requires
additional ports at the ToF, typically 2 ports per ToF node to form a additional ports at the ToF, typically two ports per ToF node to form
ToF-spanning ring. RIFT [RIFT] also defines the southbound a ToF-spanning ring. RIFT [RFC9692] also defines the southbound
reflection procedure that enables a parent to explore the direct reflection procedure that enables a parent to explore the direct
connectivity of its peers, meaning their own parents and children; connectivity of its peers, meaning their own parents and children;
based on the advertisements received from the shared parents and based on the advertisements received from the shared parents and
children, it may enable the parent to infer the prefixes its peers children, it may enable the parent to infer the prefixes its peers
can reach. can reach.
When a parent lacks reachability to a prefix, it may disaggregate the When a parent lacks reachability to a prefix, it may disaggregate the
prefix negatively, i.e., advertise that this parent can be used to prefix negatively, i.e., advertise that this parent can be used to
reach any prefix in the aggregation except that one. The Negative reach any prefix in the aggregation except that one. The Negative
Disaggregation signaling is simple and functions transitively from Disaggregation signaling is simple and functions transitively from
ToF to top-of-pod (ToP) and then from ToP to Leaf. But it is hard ToF to Top-of-Pod (ToP) and then from ToP to Leaf. However, it is
for a parent to figure which prefix it needs to disaggregate, because hard for a parent to figure out which prefix it needs to disaggregate
it does not know what it does not know; it results that the use of a because it does not know what it does not know; it results that the
spanning ring at the ToF is required to operate the Negative use of a spanning ring at the ToF is required to operate the Negative
Disaggregation. Also, though it is only an implementation problem, Disaggregation. Also, though it is only an implementation problem,
the programming of the FIB is complex compared to normal routes, and the programming of the FIB is complex compared to normal routes and
may incur recursions. may incur recursions.
The more classical alternative is, for the parents that can reach a The more classical alternative is, for the parents that can reach a
prefix that peers at the same level cannot, to advertise a more prefix that peers at the same level cannot, to advertise a more
specific route to that prefix. This leverages the normal longest specific route to that prefix. This leverages the normal longest
prefix match in the FIB, and does not require a special prefix match in the FIB and does not require a special
implementation. But as opposed to the Negative Disaggregation, the implementation. As opposed to the Negative Disaggregation, the
Positive Disaggregation is difficult and inefficient to operate Positive Disaggregation is difficult and inefficient to operate
transitively. transitively.
Transitivity is not needed to a grandchild if all its parents Transitivity is not needed by a grandchild if all its parents
received the Positive Disaggregation, meaning that they shall all received the Positive Disaggregation, meaning that they shall all
avoid the black hole; when that is the case, they collectively build avoid the black hole; when that is the case, they collectively build
a ceiling that protects the grandchild. But until then, a parent a ceiling that protects the grandchild. Until then, a parent that
that received a Positive Disaggregation may believe that some peers received the Positive Disaggregation may believe that some peers are
are lacking the reachability and readvertise too early, or defer and lacking the reachability and re-advertise too early or defer and
maintain a black hole situation longer than necessary. maintain a black hole situation longer than necessary.
In a non-partitioned fabric, all the ToF nodes see one another In a non-partitioned fabric, all the ToF nodes see one another
through the reflection and can figure if one is missing a child. In through the reflection and can figure out if one is missing a child.
that case it is possible to compute the prefixes that the peer cannot In that case, it is possible to compute the prefixes that the peer
reach and disaggregate positively without a ToF-spanning ring. The cannot reach and disaggregate positively without a ToF-spanning ring.
ToF nodes can also ascertain that the ToP nodes are connected each to The ToF nodes can also ascertain that the ToP nodes are each
at least a ToF node that can still reach the prefix, meaning that the connected to at least a ToF node that can still reach the prefix,
transitive operation is not required. meaning that the transitive operation is not required.
The bottom line is that in a fabric that is partitioned (e.g., using The bottom line is that in a fabric that is partitioned (e.g., using
multiple planes) and/or where the ToP nodes are not guaranteed to multiple planes) and/or where the ToP nodes are not guaranteed to
always form a ceiling for their children, it is mandatory to use the always form a ceiling for their children, it is mandatory to use
Negative Disaggregation. On the other hand, in a highly symmetrical Negative Disaggregation. On the other hand, in a highly symmetrical
and fully connected fabric, (e.g., a canonical Clos Network), the and fully connected fabric (e.g., a canonical Clos Network), the
Positive Disaggregation methods allows to save the complexity and Positive Disaggregation methods save the complexity and cost
cost associated to the ToF-spanning ring. associated to the ToF-spanning ring.
Note that in the case of Positive Disaggregation, the first ToF Note that in the case of Positive Disaggregation, the first ToF nodes
node(s) that announces a more-specific route attracts all the traffic that announce a more-specific route attract all the traffic for that
for that route and may suffer from a transient incast. A ToP node route and may suffer from a transient incast. A ToP node that defers
that defers injecting the longer prefix in the FIB, in order to injecting the longer prefix in the FIB, in order to receive more
receive more advertisements and spread the packets better, also keeps advertisements and spread the packets better, also keeps on sending a
on sending a portion of the traffic to the black hole in the portion of the traffic to the black hole in the meantime. In the
meantime. In the case of Negative Disaggregation, the last ToF case of Negative Disaggregation, the last ToF nodes that inject the
node(s) that injects the route may also incur an incast issue; this route may also incur an incast issue; this problem would occur if a
problem would occur if a prefix that becomes totally unreachable is prefix that becomes totally unreachable is disaggregated.
disaggregated.
5.8. Mobile Edge and Anycast 5.8. Mobile Edge and Anycast
When a physical or a virtual node changes its point of attachment in When a physical or a virtual node changes its point of attachment in
the fabric from a previous-leaf to a next-leaf, new routes must be the fabric from a previous-leaf to a next-leaf, new routes must be
installed that supersede the old ones. Since the flooding flows installed that supersede the old ones. Since the flooding flows
northwards, the nodes (if any) between the previous-leaf and the northwards, the nodes (if any) between the previous-leaf and the
common parent are not immediately aware that the path via previous- common parent are not immediately aware that the path via the
leaf is obsolete, and a stale route may exist for a while. The previous-leaf is obsolete and a stale route may exist for a while.
common parent needs to select the freshest route advertisement in The common parent needs to select the freshest route advertisement in
order to install the correct route via the next-leaf. This requires order to install the correct route via the next-leaf. This requires
that the fabric determines the sequence of the movements of the that the fabric determines the sequence of the movements of the
mobile node. mobile node.
On the one hand, a classical sequence counter provides a total order On the one hand, a classical sequence counter provides a total order
for a while but it will eventually wrap. On the other hand, a for a while, but it will eventually wrap. On the other hand, a
timestamp provides a permanent order but it may miss a movement that timestamp provides a permanent order, but it may miss a movement that
happens too quickly vs. the granularity of the timing information. happens too quickly vs. the granularity of the timing information.
It is not envisioned that an average fabric supports Precision Time It is not envisioned that an average fabric supports the Precision
Protocol [IEEEstd1588] in the short term, nor that the precision Time Protocol [IEEEstd1588] in the short term nor that the precision
available with the Network Time Protocol [RFC5905] (in the order of available with the Network Time Protocol [RFC5905] (in the order of
100 to 200ms) may not be necessarily enough to cover, e.g., the fast 100 to 200 ms) may not be necessarily enough to cover, e.g., the fast
mobility of a Virtual Machine. mobility of a Virtual Machine (VM).
Section 6.8.4 "Mobility" of RIFT [RIFT] specifies a hybrid method Section 6.8.4 ("Mobility") of [RFC9692] specifies a hybrid method
that combines a sequence counter from the mobile node and a timestamp that combines a sequence counter from the mobile node and a timestamp
from the network taken at the leaf when the route is injected. If from the network taken at the leaf when the route is injected. If
the timestamps of the concurrent advertisements are comparable (i.e., the timestamps of the concurrent advertisements are comparable (i.e.,
more distant than the precision of the timing protocol), then the more distant than the precision of the timing protocol), then the
timestamp alone is used to determine the relative freshness of the timestamp alone is used to determine the relative freshness of the
routes. Otherwise, the sequence counter from the mobile node, if routes. Otherwise, the sequence counter from the mobile node is used
available, is used. One caveat is that the sequence counter must not if it is available. One caveat is that the sequence counter must not
wrap within the precision of the timing protocol. Another is that wrap within the precision of the timing protocol. Another is that
the mobile node may not even provide a sequence counter, in which the mobile node may not even provide a sequence counter; in which
case the mobility itself must be slower than the precision of the case, the mobility itself must be slower than the precision of the
timing. timing.
Mobility must not be confused with anycast. In both cases, a same Mobility must not be confused with anycast. In both cases, the same
address is injected in RIFT at different leaves. In the case of address is injected in RIFT at different leaves. In the case of
mobility, only the freshest route must be conserved, since mobile mobility, only the freshest route must be conserved since the mobile
node changed its point of attachment for a leaf to the next. In the node changes its point of attachment for a leaf to the next. In the
case of anycast, the node may be either multihomed (attached to case of anycast, the node may either be multihomed (attached to
multiple leaves in parallel) or reachable beyond the fabric via multiple leaves in parallel) or reachable beyond the fabric via
multiple routes that are redistributed to different leaves; either multiple routes that are redistributed to different leaves. Either
way, in the case of anycast, the multiple routes are equally valid way, the multiple routes are equally valid and should be conserved in
and should be conserved. Without further information from the the case of anycast. Without further information from the
redistributed routing protocol, it is impossible to sort out a redistributed routing protocol, it is impossible to sort out a
movement from a redistribution that happens asynchronously on movement from a redistribution that happens asynchronously on
different leaves. RIFT [RIFT] expects that anycast addresses are different leaves. RIFT [RFC9692] expects that anycast addresses are
advertised within the timing precision, which is typically the case advertised within the timing precision, which is typically the case
with a low-precision timing and a multihomed node. Beyond that time with a low-precision timing and a multihomed node. Beyond that time
interval, RIFT interprets the lag as a mobility and only the freshest interval, RIFT interprets the lag as a mobility and only the freshest
route is retained. route is retained.
When using IPv6 [RFC8200], RIFT suggests to leverage [RFC8505] as the When using IPv6 [RFC8200], RIFT suggests leveraging 6LoWPAN ND
IPv6 ND interaction between the mobile node and the leaf. This [RFC8505] as the IPv6 ND interaction between the mobile node and the
provides not only a sequence counter but also a lifetime and a leaf. This not only provides a sequence counter but also a lifetime
security token that may be used to protect the ownership of an and a security token that may be used to protect the ownership of an
address [RFC8928]. When using [RFC8505], the parallel registration address [RFC8928]. When using 6LoWPAN ND [RFC8505], the parallel
of an anycast address to multiple leaves is done with the same registration of an anycast address to multiple leaves is done with
sequence counter, whereas the sequence counter is incremented when the same sequence counter, whereas the sequence counter is
the point of attachment changes. This way, it is possible to incremented when the point of attachment changes. This way, it is
differentiate a mobile node from a multihomed node, even when the possible to differentiate a mobile node from a multihomed node, even
mobility happens within the timing precision. It is also possible when the mobility happens within the timing precision. It is also
for a mobile node to be multihomed as well, e.g., to change only one possible for a mobile node to be multihomed as well, e.g., to change
of its points of attachment. only one of its points of attachment.
5.9. IPv4 over IPv6 5.9. IPv4 over IPv6
RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 RIFT allows advertising IPv4 prefixes over an IPv6 RIFT network. An
Address Family (AF) configures via the usual Neighbor Discovery (ND) IPv6 Address Family (AF) configures via the usual ND mechanisms and
mechanisms and then V4 can use V6 next-hops analogous to [RFC8950]. then V4 can use V6 next-hops analogous to [RFC8950]. It is expected
It is expected that the whole fabric supports the same type of that the whole fabric supports the same type of forwarding of AFs on
forwarding of address families on all the links. RIFT provides an all the links. RIFT provides an indication whether a node is capable
indication whether a node is v4 forwarding capable and of V4-forwarding and implementations are possible where different
implementations are possible where different routing tables are routing tables are computed per AF as long as the computation remains
computed per address family as long as the computation remains loop- loop-free.
free.
+-----+ +-----+ +-----+ +-----+
+---+---+ | ToF | | ToF | +---+---+ | ToF | | ToF |
^ +--+--+ +-----+ ^ +--+--+ +-----+
| | | | | | | | | |
| | +-------------+ | | | +-------------+ |
| | +--------+ | | | | +--------+ | |
+ | | | | + | | | |
V6 +-----+ +-+---+ V6 +-----+ +-+---+
Forwarding |Spine| |Spine| Forwarding |Spine| |Spine|
+ +--+--+ +-----+ + +--+--+ +-----+
| | | | | | | | | |
| | +-------------+ | | | +-------------+ |
| | +--------+ | | | | +--------+ | |
| | | | | | | | | |
v +-----+ +-+---+ v +-----+ +-+---+
+---+---+ |Leaf | | Leaf| +---+---+ |Leaf | | Leaf|
+--+--+ +--+--+ +--+--+ +--+--+
| | | |
IPv4 prefixes| |IPv4 prefixes IPv4 prefixes| |IPv4 prefixes
| | | |
+---+----+ +---+----+ +---+----+ +---+----+
| V4 | | V4 | | V4 | | V4 |
| subnet | | subnet | | subnet | | subnet |
+--------+ +--------+ +--------+ +--------+
Figure 10: IPv4 over IPv6 Figure 10: IPv4 over IPv6
5.10. In-Band Reachability of Nodes 5.10. In-Band Reachability of Nodes
RIFT doesn't precondition that nodes of the fabric have reachable RIFT doesn't precondition that nodes of the fabric have reachable
addresses. But the operational reasons to reach the internal nodes addresses, but the operational reasons to reach the internal nodes
may exist. Figure 11 shows an example that the network management may exist. Figure 11 shows an example that the network management
station (NMS) attaches to leaf1. station (NMS) attaches to Leaf1.
+-------+ +-------+ +-------+ +-------+
| ToF1 | | ToF2 | | ToF1 | | ToF2 |
++---- ++ ++-----++ ++---- ++ ++-----++
| | | | | | | |
| +----------+ | | +----------+ |
| +--------+ | | | +--------+ | |
| | | | | | | |
++-----++ +--+---++ ++-----++ +--+---++
|Spine1 | |Spine2 | |Spine1 | |Spine2 |
skipping to change at page 27, line 32 skipping to change at line 1212
| | | | | | | |
| +----------+ | | +----------+ |
| +--------+ | | | +--------+ | |
| | | | | | | |
++-----++ +--+---++ ++-----++ +--+---++
| Leaf1 | | Leaf2 | | Leaf1 | | Leaf2 |
+---+---+ +-------+ +---+---+ +-------+
| |
|NMS |NMS
Figure 11: In-Band reachability of node Figure 11: In-Band Reachability of Nodes
If NMS wants to access Leaf2, it simply works. Because loopback If the NMS wants to access Leaf2, it simply works because the
address of Leaf2 is flooded in its Prefix North TIE. loopback address of Leaf2 is flooded in its Prefix North TIE.
If NMS wants to access Spine2, it simply works too. Because spine If the NMS wants to access Spine2, it also works because a spine node
node always advertises its loopback address in the Prefix North TIE. always advertises its loopback address in the Prefix North TIE. The
NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/
ToF2-Spine2. ToF2-Spine2.
If NMS wants to access ToF2, ToF2's loopback address needs to be If the NMS wants to access ToF2, ToF2's loopback address needs to be
injected into its Prefix South TIE. This TIE must be seen by all injected into its Prefix South TIE. This TIE must be seen by all
nodes at the level below - the spine nodes in Figure 11 – that must nodes at the level below -- the spine nodes in Figure 11 -- that must
form a ceiling for all the traffic coming from below (south). form a ceiling for all the traffic coming from below (south).
Otherwise, the traffic from NMS may follow the default route to the Otherwise, the traffic from the NMS may follow the default route to
wrong ToF Node, e.g., ToF1. the wrong ToF Node, e.g., ToF1.
In case of failure between ToF2 and spine nodes, ToF2's loopback In the case of failure between ToF2 and spine nodes, ToF2's loopback
address must be disaggregated recursively all the way to the leaves. address must be disaggregated recursively all the way to the leaves.
In a partitioned ToF, even with recursive disaggregation a ToF node In a partitioned ToF, even with recursive disaggregation, a ToF node
is only reachable within its plane. is only reachable within its plane.
A possible alternative to recursive disaggregation is to use a ring A possible alternative to recursive disaggregation is to use a ring
that interconnects the ToF nodes to transmit packets between them for that interconnects the ToF nodes to transmit packets between them for
their loopback addresses only. The idea is that this is mostly their loopback addresses only. The idea is that this is mostly
control traffic and should not alter the load balancing properties of control traffic and should not alter the load-balancing properties of
the fabric. the fabric.
5.11. Dual Homing Servers 5.11. Dual-Homing Servers
Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It Each RIFT node may operate in ZTP mode. It has no configuration
has no configuration (unless it is a Top-of-Fabric at the top of the (unless it is a ToF node at the top of the topology or if it must
topology or the must operate in the topology as leaf and/or support operate in the topology as a leaf and/or support leaf-2-leaf
leaf-2-leaf procedures) and it will fully configure itself after procedures), and it will fully configure itself after being attached
being attached to the topology. to the topology.
+---+ +---+ +---+ +---+ +---+ +---+
|ToF| |ToF| |ToF| ToF |ToF| |ToF| |ToF| ToF
+---+ +---+ +---+ +---+ +---+ +---+
| | | | | | | | | | | |
| +----------------+ | | | +----------------+ | |
| +----------------+ | | +----------------+ |
| | | | | | | | | | | |
+----------+--+ +--+----------+ +----------+--+ +--+----------+
| ToR1 | | ToR2 | Spine | ToR1 | | ToR2 | Spine
skipping to change at page 28, line 40 skipping to change at line 1269
| +-----------------+ | | | | +-----------------+ | | |
| | | +-------------+ | | | | | +-------------+ | |
| | | | | +-----------------+ | | | | | | +-----------------+ |
| | | | +--------------+ | | | | | | | +--------------+ | | |
| | | | | | | | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| | | | | | | | | | | | | | | |
+---+ +---+ ............. +---+ +---+ +---+ +---+ ............. +---+ +---+
SV(1) SV(2) SV(n-1) SV(n) Leaf SV(1) SV(2) SV(n-1) SV(n) Leaf
Figure 12: Dual-homing servers Figure 12: Dual-Homing Servers
Sometimes, people may prefer to disaggregate from ToR to servers from Sometimes people may prefer to disaggregate from ToR nodes to servers
start on, i.e. the servers have couple tens of routes in FIB from from startup, i.e., the servers have multiple routes in the FIB from
start on beside default routes to avoid breakages at rack level. startup other than default routes to avoid breakages at the rack
Full disaggregation of the fabric could be achieved by configuration level. Full disaggregation of the fabric could be achieved by
supported by RIFT. configuration supported by RIFT.
5.12. Fabric with A Controller 5.12. Fabric with a Controller
There are many different ways to deploy the controller. One There are many different ways to deploy the controller. One
possibility is attaching a controller to the RIFT domain from ToF and possibility is attaching a controller to the RIFT domain from ToF and
another possibility is attaching a controller from the leaf. another possibility is attaching a controller from the leaf.
+------------+ +------------+
| Controller | | Controller |
++----------++ ++----------++
| | | |
| | | |
skipping to change at page 29, line 28 skipping to change at line 1306
RIFT domain |Spine| |Spine| RIFT domain |Spine| |Spine|
+--+--+ +-----+ +--+--+ +-----+
| | | | | | | | | |
| | +-------------+ | | | +-------------+ |
| | +--------+ | | | | +--------+ | |
| | | | | | | | | |
| +-----+ +-+---+ | +-----+ +-+---+
------- |Leaf | | Leaf| ------- |Leaf | | Leaf|
+-----+ +-----+ +-----+ +-----+
Figure 13: Fabric with a controller Figure 13: Fabric with a Controller
5.12.1. Controller Attached to ToFs 5.12.1. Controller Attached to ToFs
If a controller is attaching to the RIFT domain from ToF, it usually If a controller is attaching to the RIFT domain from ToF, it usually
uses dual-homing connections. The loopback prefix of the controller uses dual-homing connections. The loopback prefix of the controller
should be advertised down by the ToF and spine to leaves. If the should be advertised down by the ToF and spine to the leaves. If the
controller loses link to ToF, make sure the ToF withdraw the prefix controller loses the link to ToF, make sure the ToF withdraws the
of the controller. prefix of the controller.
5.12.2. Controller Attached to Leaf 5.12.2. Controller Attached to Leaf
If the controller is attaching from a leaf to the fabric, no special If the controller is attaching from a leaf to the fabric, no special
provisions are needed. provisions are needed.
5.13. Internet Connectivity Within Underlay 5.13. Internet Connectivity Within Underlay
If global addressing is running without overlay, an external default If global addressing is running without overlay, an external default
route needs to be advertised through RIFT fabric to achieve internet route needs to be advertised through the RIFT fabric to achieve
connectivity. For the purpose of forwarding of the entire RIFT internet connectivity. For the purpose of forwarding of the entire
fabric, an internal fabric prefix needs to be advertised in the South RIFT fabric, an internal fabric prefix needs to be advertised in the
Prefix TIE by ToF and spine nodes. Prefix South TIE by ToF and spine nodes.
5.13.1. Internet Default on the Leaf 5.13.1. Internet Default on the Leaf
In case that the internet gateway is a leaf, the leaf node as the In the case that the internet gateway is a leaf, the leaf node as the
internet gateway needs to advertise a default route in its Prefix internet gateway needs to advertise a default route in its Prefix
North TIE. North TIE.
5.13.2. Internet Default on the ToFs 5.13.2. Internet Default on the ToFs
In case that the internet gateway is a ToF, the ToF and spine nodes In the case that the internet gateway is a ToF, the ToF and spine
need to advertise a default route in the Prefix South TIE. nodes need to advertise a default route in the Prefix South TIE.
5.14. Subnet Mismatch and Address Families 5.14. Subnet Mismatch and Address Families
+--------+ +--------+ +--------+ +--------+
| | LIE LIE | | | | LIE LIE | |
| A | +----> <----+ | B | | A | +----> <----+ | B |
| +---------------------+ | | +---------------------+ |
+--------+ +--------+ +--------+ +--------+
X/24 Y/24 X/24 Y/24
Figure 14: subnet mismatch Figure 14: Subnet Mismatch
LIEs are exchanged over all links running RIFT to perform Link LIEs are exchanged over all links running RIFT to perform Link
(Neighbor) Discovery. A node must NOT originate LIEs on an address (Neighbor) Discovery. A node must NOT originate LIEs on an AF if it
family if it does not process received LIEs on that family. LIEs on does not process received LIEs on that family. LIEs on the same link
same link are considered part of the same negotiation independent on are considered part of the same negotiation independent from the AF
the address family they arrive on. An implementation must be ready they arrive on. An implementation must be ready to accept TIEs on
to accept TIEs on all addresses it used as source of LIE frames. all addresses it used as the source of LIE frames.
As shown in the above figure, without further checks adjacency of As shown in Figure 14, an adjacency of nodes A and B may form without
node A and B may form, but the forwarding between node A and node B further checks, but the forwarding between nodes A and B may fail
may fail because subnet X mismatches with subnet Y. because subnet X mismatches with subnet Y.
To prevent this a RIFT implementation should check for subnet To prevent this, a RIFT implementation should check for subnet
mismatch just like e.g. IS-IS does. This can lead to scenarios mismatch in a way that is similar to how IS-IS does. This can lead
where an adjacency, despite exchange of LIEs in both address families to scenarios where an adjacency, despite the exchange of LIEs in both
may end up having an adjacency in a single AF only. This is a AFs, may end up having an adjacency in a single AF only. This is
consideration especially in Section 5.9 scenarios. especially a consideration in scenarios relating to Section 5.9.
5.15. Anycast Considerations 5.15. Anycast Considerations
+ traffic + traffic
| |
v v
+------+------+ +------+------+
| ToF | | ToF |
+---+-----+---+ +---+-----+---+
| | | | | | | |
+------------+ | | +------------+ +------------+ | | +------------+
| | | | | | | |
+---+---+ +-------+ +-------+ +---+---+ +---+---+ +-------+ +-------+ +---+---+
skipping to change at page 31, line 32 skipping to change at line 1398
| | | | | | | | | | | | | | | |
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-----+-+ +-+-----+ ++------+ +-----+-+ +-----+-+
+ + + ^ + + + + ^ +
PrefixA PrefixB PrefixA | PrefixC PrefixA PrefixB PrefixA | PrefixC
| |
+ traffic + traffic
Figure 15: Anycast Figure 15: Anycast
If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast If the traffic comes from ToF to Leaf111 or Leaf121, which has
prefix PrefixA, RIFT can deal with this case well. But if the anycast prefix PrefixA, RIFT can deal with this case well. However,
traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. if the traffic comes from Leaf122, it arrives to Spine21 or Spine22
But Spine21 or Spine22 doesn't know another PrefixA attaching at LEVEL 1. Additionally, Spine21 or Spine22 doesn't know another
Leaf111. So it will always get to Leaf121 and never get to Leaf111. PrefixA attaching Leaf111, so it will always get to Leaf121 and never
If the intension is that the traffic should be offloaded to Leaf111, Leaf111. If the intention is that the traffic should be offloaded to
then use policy guided prefixes defined in RIFT [RIFT]. Leaf111, then use the policy-guided prefixes defined in RIFT
[RFC9692].
5.16. IoT Applicability 5.16. IoT Applicability
The design of RIFT inherits from RPL [RFC6550] the anisotropic design The design of RIFT inherits the anisotropic design of a default route
of a default route upwards (northwards); it also inherits the upwards (northwards) from RPL [RFC6550]. It also inherits the
capability to inject external host routes at the Leaf level using capability to inject external host routes at the Leaf level using
Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host Wireless ND (WiND) [RFC8505] [RFC8928] between a RIFT-agnostic host
and a RIFT router. Both the RPL and the RIFT protocols are meant for and a RIFT router. Both the RPL and the RIFT protocols are meant for
large scale, and WiND enables device mobility at the edge the same a large scale, and WiND enables device mobility at the edge the same
way in both cases. way in both cases.
The main difference between RIFT and RPL is that with RPL, there’s a The main difference between RIFT and RPL is that there's a single
single Root, whereas RIFT has many ToF nodes. This adds huge root with RPL, whereas RIFT has many ToF nodes. This adds huge
capabilities for leaf-2-leaf ECMP paths, but additional complexity capabilities for leaf-2-leaf ECMP paths but additional complexity
with the need to disaggregate. Also RIFT uses Link State flooding with the need to disaggregate. Also, RIFT uses link-state flooding
northwards, and is not designed for low-power operation. northwards and is not designed for low-power operation.
Still nothing prevents that the IP devices connected at the Leaf are Still, nothing prevents that the IP devices connected at the Leaf are
IoT devices, which typically expose their address using WiND – which IoT devices, which typically expose their address using WiND -- this
is an upgrade from 6LoWPAN ND [RFC6775]. is an upgrade from 6LoWPAN ND [RFC6775].
A network that serves high speed/ high power IoT devices should A network that serves high speed / high power IoT devices should
typically provide deterministic capabilities for applications such as typically provide deterministic capabilities for applications such as
high speed control loops or movement detection. The Fat Tree is high speed control loops or movement detection. The Fat Tree is
highly reliable, and in normal condition provides an equivalent highly reliable and, in normal conditions, provides an equivalent
multipath operation; but the ECMP doesn’t provide hard guarantees for multipath operation; however, the ECMP doesn't provide hard
either delivery or latency. As long as the fabric is non-blocking guarantees for either delivery or latency. As long as the fabric is
the result is the same; but there can be load unbalances resulting in non-blocking, the result is the same, but there can be load
incast and possibly congestion loss that will prevent the delivery unbalances resulting in incast and possibly congestion loss that will
within bounded latency. prevent the delivery within bounded latency.
This could be alleviated with Packet Replication, Elimination and This could be alleviated with Packet Replication, Elimination, and
Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide Ordering Functions (PREOF) [RFC8655] leaf-2-leaf, but PREOF is hard
at the scale of all flows, and the replication may increase the to provide at the scale of all flows and the replication may increase
probability of the overload that it attempts to solve. the probability of the overload that it attempts to solve.
Note that the load balancing is not RIFTs problem, but it is key to Note that the load balancing is not RIFT's problem, but it is key to
serve IoT adequately. serve IoT adequately.
5.17. Key Management 5.17. Key Management
As outlined in Section 9 "Security Considerations" of RIFT [RIFT], As outlined in Section 9 ("Security Considerations") of [RFC9692],
either a private shared key or a public/private key pair is used to either a private shared key or a public/private key pair is used to
authenticate the adjacency. Both the key distribution and key authenticate the adjacency. Both the key distribution and key
synchronization methods are out of scope for this document. Both synchronization methods are out of scope for this document. Both
nodes in the adjacency must share the same keys, key type, and nodes in the adjacency must share the same keys, key type, and
algorithm for a given key ID. Mismatched keys will not inter-operate algorithm for a given key ID. Mismatched keys will not interoperate
as their security envelopes will be unverifiable. as their security envelopes will be unverifiable.
Key roll-over while the adjacency is active may be supported. The Key rollover while the adjacency is active may be supported. The
specific mechanism is well documented in [RFC6518]. As outlined in specific mechanism is well documented in [RFC6518]. As outlined in
Section 9.9 "Host Implementations" of RIFT [RIFT], hosts as well as 9.9 ("Host Implementations") of [RFC9692], hosts as well as VMs
VMs act as RIFT devices are possible. KMP such as KV for key roll- acting as RIFT devices are possible. Key Management Protocols
over in the fabric using a symmetric key that can be changed easily (KMPs), such as Key Value (KV) for key rollover in the fabric, use a
when compromised. Wherein symmetric key of a host is more likely to symmetric key that can be changed easily when compromised; in which
be compromised than of a in-fabric networking node. case, the symmetric key of a host is more likely to be compromised
than an in-fabric networking node.
5.18. TTL/HopLimit of 1 vs. 255 on LIEs/TIEs 5.18. TTL/Hop Limit of 1 vs. 255 on LIEs/TIEs
The use of a packet's Time to Live (TTL) (IPv4) or Hop Limit (IPv6) The use of a packet's Time to Live (TTL) (IPv4) or Hop Limit (IPv6)
to verify whether the packet was originated by an adjacent node on a to verify whether the packet was originated by an adjacent node on a
connected link has been used in RIFT.RIFT explicitly requires the use connected link has been used in RIFT. RIFT explicitly requires the
of a TTL/HL value of 1 *or* 255 when sending/receiving LIEs and TIEs use of a TTL/HL value of 1 or 255 when sending/receiving LIEs and
so that implementers have a choice between the two. TIEs so that implementers have a choice between the two.
TTL=1 or HL=1 protects against the information disseminating more TTL=1 or HL=1 protects against the information disseminating more
than 1 hop in the fabric and should be the default unless configured than 1 hop in the fabric and should be the default unless configured
otherwise. TTL=255 or HL=255 can lead RIFT TIE packet propagation to otherwise. TTL=255 or HL=255 can lead RIFT TIE packet propagation to
more than one hop (multicast address is already local subnetwork more than one hop (the multicast address is already in local
range) in case of implementation problems but does protect against a subnetwork range) in case of implementation problems but does protect
remote attack as well, and the receiving remote router will ignore against a remote attack as well, and the receiving remote router will
such TIE packet unless the remote router is exactly 254 hops away and ignore such TIE packet unless the remote router is exactly 254 hops
accepts only TTL=1 or HL=1. [RFC5082] defines a Generalized TTL away and accepts only TTL=1 or HL=1. [RFC5082] defines a Generalized
Security Mechanism (GTSM). The GTSM is applicable to LIEs/TIEs TTL Security Mechanism (GTSM). The GTSM is applicable to LIE/TIE
implementations that use a TTL or HL of 255. It provides a defense implementations that use a TTL or HL of 255. It provides a defense
from infrastructure attacks based on forged protocol packets from from infrastructure attacks based on forged protocol packets from
outside the fabric. outside the fabric.
6. Security Considerations 6. Security Considerations
This document presents applicability of RIFT. As such, it does not This document presents applicability of RIFT. As such, it does not
introduce any security considerations. However, there are a number introduce any security considerations. However, there are a number
of security concerns at RIFT [RIFT]. of security concerns in [RFC9692].
7. IANA Considerations 7. IANA Considerations
This document has no IANA actions. This document has no IANA actions.
8. Acknowledgments 8. References
The authors would like to thank Jaroslaw Kowalczyk, Alvaro Retana,
Jim Guichard and Jeffrey Zhang for providing invaluable concepts and
content for this document.
9. Contributors
The following people (listed in alphabetical order) contributed
significantly to the content of this document and should be
considered co-authors:
Jordan Head
Juniper Networks
Email: jhead@juniper.net
Tom Verhaeg
Juniper Networks
Email: tverhaeg@juniper.net
10. Normative References 8.1. Normative References
[ISO10589-Second-Edition] [ISO10589-Second-Edition]
International Organization for Standardization, ISO/IEC, "Information technology - Telecommunications and
"Intermediate system to Intermediate system intra-domain information exchange between systems - Intermediate System
routing information exchange protocol for use in to Intermediate System intra-domain routeing information
conjunction with the protocol for providing the exchange protocol for use in conjunction with the protocol
connectionless-mode Network Service (ISO 8473)", November for providing the connectionless-mode network service (ISO
2002. 8473)", ISO/IEC 10589:2002, November 2002,
<https://www.iso.org/standard/30932.html>.
[TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central
Office Reference Architectural Framework", January 2018.
[RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328,
DOI 10.17487/RFC2328, April 1998, DOI 10.17487/RFC2328, April 1998,
<https://www.rfc-editor.org/info/rfc2328>. <https://www.rfc-editor.org/info/rfc2328>.
[RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
"Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
DOI 10.17487/RFC4861, September 2007, DOI 10.17487/RFC4861, September 2007,
<https://www.rfc-editor.org/info/rfc4861>. <https://www.rfc-editor.org/info/rfc4861>.
skipping to change at page 35, line 35 skipping to change at line 1566
"Deterministic Networking Architecture", RFC 8655, "Deterministic Networking Architecture", RFC 8655,
DOI 10.17487/RFC8655, October 2019, DOI 10.17487/RFC8655, October 2019,
<https://www.rfc-editor.org/info/rfc8655>. <https://www.rfc-editor.org/info/rfc8655>.
[RFC8950] Litkowski, S., Agrawal, S., Ananthamurthy, K., and K. [RFC8950] Litkowski, S., Agrawal, S., Ananthamurthy, K., and K.
Patel, "Advertising IPv4 Network Layer Reachability Patel, "Advertising IPv4 Network Layer Reachability
Information (NLRI) with an IPv6 Next Hop", RFC 8950, Information (NLRI) with an IPv6 Next Hop", RFC 8950,
DOI 10.17487/RFC8950, November 2020, DOI 10.17487/RFC8950, November 2020,
<https://www.rfc-editor.org/info/rfc8950>. <https://www.rfc-editor.org/info/rfc8950>.
[RIFT] Przygienda, T., Head, J., Sharma, A., Thubert, P., [RFC9692] Przygienda, T., Ed., Head, J., Ed., Sharma, A., Thubert,
Rijsman, B., and D. Afanasiev, "RIFT: Routing in Fat P., Rijsman, B., and D. Afanasiev, "RIFT: Routing in Fat
Trees", Work in Progress, Internet-Draft, draft-ietf-rift- Trees", RFC 9692, DOI 10.17487/RFC9692, December 2024,
rift-24, 23 May 2024, <https://www.rfc-editor.org/info/rfc9692>.
<https://datatracker.ietf.org/doc/html/draft-ietf-rift-
rift-24>.
11. Informative References [TR-384] Broadband Forum Technical Report, "TR-384: Cloud Central
Office Reference Architectural Framework", TR-384, Issue
1, January 2018,
<https://www.broadband-forum.org/pdfs/tr-384-1-0-0.pdf>.
[IEEEstd1588] 8.2. Informative References
IEEE standard for Information Technology, "IEEE Standard
for a Precision Clock Synchronization Protocol for
Networked Measurement and Control Systems",
<https://standards.ieee.org/standard/1588-2019.html>.
[CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer
Communication Environments", IEEE International Parallel & Communication Environments", 2011 IEEE International
Distributed Processing Symposium, 2011. Parallel & Distributed Processing Symposium,
DOI 10.1109/IPDPS.2011.27, May 2011,
<https://ieeexplore.ieee.org/document/6012836>.
[FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for [FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for
Hardware-Efficient Supercomputing", 1985. Hardware-Efficient Supercomputing", IEEE Transactions on
Computers, vol. C-34, no. 10, pp. 892-901,
DOI 10.1109/TC.1985.6312192, October 1985,
<https://ieeexplore.ieee.org/document/6312192>.
[PNNI] ATM Forum Technical Committee, "Private Network-Network [IEEEstd1588]
Interface Specification, Version 1.1 (PNNI 1.1), af-pnni- IEEE, "IEEE Standard for a Precision Clock Synchronization
0055.002", 2003. Protocol for Networked Measurement and Control Systems",
IEEE Std 1588-2019, DOI 10.1109/IEEESTD.2020.9120376, June
2020, <https://ieeexplore.ieee.org/document/9120376>.
[PNNI] The ATM Forum Technical Committee, "Private Network-
Network Interface - Specification Version 1.1 - (PNNI
1.1)", af-pnni-0055.001, April 2002,
<https://www.broadband-forum.org/download/af-pnni-
0055.001.pdf>.
[RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link [RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link
State Routing Protocol (OLSR)", RFC 3626, State Routing Protocol (OLSR)", RFC 3626,
DOI 10.17487/RFC3626, October 2003, DOI 10.17487/RFC3626, October 2003,
<https://www.rfc-editor.org/info/rfc3626>. <https://www.rfc-editor.org/info/rfc3626>.
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006, DOI 10.17487/RFC4271, January 2006,
<https://www.rfc-editor.org/info/rfc4271>. <https://www.rfc-editor.org/info/rfc4271>.
skipping to change at page 36, line 43 skipping to change at line 1633
Perkins, "Registration Extensions for IPv6 over Low-Power Perkins, "Registration Extensions for IPv6 over Low-Power
Wireless Personal Area Network (6LoWPAN) Neighbor Wireless Personal Area Network (6LoWPAN) Neighbor
Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018,
<https://www.rfc-editor.org/info/rfc8505>. <https://www.rfc-editor.org/info/rfc8505>.
[RFC8928] Thubert, P., Ed., Sarikaya, B., Sethi, M., and R. Struik, [RFC8928] Thubert, P., Ed., Sarikaya, B., Sethi, M., and R. Struik,
"Address-Protected Neighbor Discovery for Low-Power and "Address-Protected Neighbor Discovery for Low-Power and
Lossy Networks", RFC 8928, DOI 10.17487/RFC8928, November Lossy Networks", RFC 8928, DOI 10.17487/RFC8928, November
2020, <https://www.rfc-editor.org/info/rfc8928>. 2020, <https://www.rfc-editor.org/info/rfc8928>.
Acknowledgments
The authors would like to thank Jaroslaw Kowalczyk, Alvaro Retana,
Jim Guichard, and Jeffrey Zhang for providing invaluable concepts and
content for this document.
Contributors
The following people contributed substantially to the content of this
document and should be considered coauthors:
Jordan Head
Juniper Networks
Email: jhead@juniper.net
Tom Verhaeg
Juniper Networks
Email: tverhaeg@juniper.net
Authors' Addresses Authors' Addresses
Yuehua Wei (editor) Yuehua Wei (editor)
ZTE Corporation ZTE Corporation
No.50, Software Avenue No.50, Software Avenue
Nanjing Nanjing
210012 210012
China China
Email: wei.yuehua@zte.com.cn Email: wei.yuehua@zte.com.cn
Zheng Zhang
Zheng (Sandy) Zhang
ZTE Corporation ZTE Corporation
No.50, Software Avenue No.50, Software Avenue
Nanjing Nanjing
210012 210012
China China
Email: zhang.zheng@zte.com.cn Email: zhang.zheng@zte.com.cn
Dmitry Afanasiev Dmitry Afanasiev
Yandex Yandex
Email: fl0w@yandex-team.ru Email: fl0w@yandex-team.ru
Pascal Thubert Pascal Thubert
Cisco Systems, Inc Individual
Building D
45 Allee des Ormes - BP1200
06254 MOUGINS - Sophia Antipolis
France France
Phone: +33 497 23 26 34 Email: pascal.thubert@gmail.com
Email: pthubert@cisco.com
Tony Przygienda Tony Przygienda
Juniper Networks Juniper Networks
1194 N. Mathilda Ave 1194 N. Mathilda Ave
Sunnyvale, CA, 94089 Sunnyvale, CA 94089
United States of America United States of America
Email: prz@juniper.net Email: prz@juniper.net
 End of changes. 211 change blocks. 
682 lines changed or deleted 699 lines changed or added

This html diff was produced by rfcdiff 1.48.