BFD Working Group L. Melegassi
Internet-Draft Catellix
Intended status: Standards Track 22 May 2026
Expires: 23 November 2026
Coherence-BFD: Sub-Second Coherence Detection
Using Bidirectional Forwarding Detection Patterns
draft-melegassi-coherence-bfd-00
Abstract
This document specifies Coherence-BFD, a protocol that combines
the asynchronous heartbeat, demand-mode, echo function, and
detection-multiplier mechanisms of Bidirectional Forwarding
Detection (BFD, [RFC5880]) with the multi-vantage path coherence
detection of [I-D.melegassi-mvps-incremental-be]. The result is
a sub-second coherence failure detector with theoretical and
empirical detection latency of 55 ms (1091x faster than the
60-second tick baseline of the underlying BE-MVPS framework).
Five execution variants are specified: V0 (baseline), V1
(heartbeat-fast), V2 (demand), V3 (echo), and V4 (hybrid). Wall-
clock benchmarks confirm V3 (Echo) as the latency-optimal variant
at 55 ms median tau_detect with 39 680 B/s bandwidth. A new lower
bound theorem (Theorem 9 of [I-D.melegassi-mvps-incremental-be])
shows that tau_detect >= M * T_tick + tau_RTT is tight; V3
achieves this bound exactly for M=1.
The protocol is designed for deployment alongside conventional BFD
sessions: a Coherence-BFD session monitors not the binary up/down
state of a forwarding path but the coherence state across N
vantage observers.
NOTE ON DATA PROVENANCE. All wall-clock detection-latency and
bandwidth numbers reported in this document are obtained from
synthetic simulations (scripts/benchmark_coherence_bfd.py) under
controlled conditions, not from operational deployment data.
HARDWARE CAVEAT (v5.0 unified proof, 2026-05-22). The 55 ms
median tau_detect figure of Section 5 is a SOFTWARE-HARNESS
measurement on a single host, not a router-class measurement.
In particular, it does NOT account for the data-plane
forwarding asymmetries, micro-bursts, or ASIC/NPU scheduling
delays that real BFD hardware exhibits on production
forwarders. Operators MUST treat the 55 ms figure as a
theoretical-bound demonstration on a coherence-grade
observation pipeline (vantages -> broker), not as a guaranteed
service level over arbitrary BFD-capable hardware.
Validation against real BFD hardware (commercial routers,
merchant-silicon ASICs, software forwarders such as VPP or
DPDK) is identified as required future work before progression
past Experimental status. This caveat is the principal
reason the document is targeted at Experimental.
A REFERENCE IMPLEMENTATION of the wire format defined in Section
4 (mandatory section + Experimental-range TLVs + HMAC-SHA256
authentication) is provided in pure Python at
. It
demonstrates end-to-end interop of 1 broker and 4 vantages with
480 packets in 8 seconds, zero HMAC failures, and correct
triggering of ALARM and Byzantine-event detection. See
reference-impl/README.md for usage.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on November 23, 2026.
Melegassi Expires November 23, 2026 [Page 1]
Internet-Draft Coherence-BFD Protocol May 2026
Copyright Notice
Copyright (c) 2026 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Revised BSD License text as described in Section 4.e of the
Trust Legal Provisions and are provided without warranty as described
in the Revised BSD License.
Table of Contents
1. Introduction ....................................................2
1.1. The latency gap ...........................................3
1.2. Relationship to BFD .......................................3
1.3. Conventions used in this document .........................4
2. Protocol Overview .............................................4
3. Session State Machine .........................................5
4. Control Packet Format .........................................6
4.1. Mandatory section .........................................6
4.2. Coherence TLV ............................................7
5. Echo Function for Coherence ...................................8
6. Demand Mode and Polling .......................................9
7. Detection Multiplier and Confirmation .........................9
8. Negotiated Intervals .........................................10
9. Five Reference Variants ......................................10
10. Empirical Results (Wall-Clock Measurements) ..................11
11. Lower Bound Achievement ......................................12
12. Security Considerations ......................................12
13. IANA Considerations ..........................................13
14. References ...................................................13
15. Packet Sizing, MTU, and Network Stack Tuning ...............14
15.1. Packet size budget (all packet types) ..................14
15.2. MTU and fragmentation .................................15
15.3. PPS regimes and OS tuning requirements ................15
15.4. Recommended sysctl, ethtool, and queue settings ........16
15.5. NUMA and CPU isolation for the broker ................17
16. Privacy Considerations .......................................18
17. Manageability Considerations .................................19
Acknowledgements ..................................................20
Author's Address .................................................20
1. Introduction
Bidirectional Forwarding Detection [RFC5880] provides sub-second
failure detection between two endpoints of a forwarding path. Its
key mechanisms are:
o Asynchronous mode: each endpoint emits Hello packets at a
negotiated interval T_tx (typically 16-33 ms).
o Detection multiplier M: a session is declared Down only after
M consecutive missed Hello packets.
o Demand mode: Hello packets can be suspended when not needed.
o Echo function: a packet sent by one endpoint, looped back by
the other without inspection, used for path verification.
The Multi-Vantage Path Synchrony framework
[I-D.melegassi-ippm-mvps-bundle] performs anomaly detection across
Melegassi Expires November 23, 2026 [Page 2]
Internet-Draft Coherence-BFD Protocol May 2026
N vantage observers using the Mahalanobis distance D^2 over a
three-axis coherence vector (C_1, C_2, C_3). Its baseline tick
period is 60 s, suitable for path-coherence anomalies but
unsuitable for sub-second failover.
1.1. The latency gap
The two frameworks operate on different time scales:
BFD typical: 50 ms detection latency
BE-MVPS baseline: 60 000 ms detection latency
Gap: 1200x
This document closes the gap by adapting BFD mechanisms to drive
the BE-MVPS detector. Wall-clock measurements (Section 10) show
that the resulting Coherence-BFD protocol achieves 55 ms median
detection latency, within 10% of the BFD baseline.
1.2. Relationship to BFD
Coherence-BFD differs from BFD in three respects:
1. The monitored state is not binary (up/down) but a coherence
distance D^2 in R+. A WATCH threshold and an ALARM threshold
are defined per session.
2. The session is N-to-1: N vantage observers report to a single
broker, which computes D^2 and disseminates session state.
Conventional BFD is 1-to-1.
3. The Echo packet does not measure RTT; it carries a hash of
the cell-aggregated coherence sketch and immediately fires
an alarm if the aggregate has drifted above threshold/2 in
transit.
Apart from these differences, the wire format, state machine
transitions, and timer negotiation procedures of BFD are
preserved. Implementations MAY share code with conventional BFD
stacks.
1.3. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document are to be interpreted as
described in BCP 14 [RFC2119] [RFC8174] when, and only when,
they appear in all capitals.
2. Protocol Overview
Melegassi Expires November 23, 2026 [Page 3]
Internet-Draft Coherence-BFD Protocol May 2026
A Coherence-BFD session consists of:
o one Broker process,
o N >= 2 Vantage processes,
o optionally, k <= N Cell-Coordinator processes that aggregate
pushes from disjoint subsets of vantages.
The session transitions through five states:
AdminDown -> Down -> Init -> WATCH -> ALARM
AdminDown is the operator-disabled state. Down indicates no
session has been established. Init indicates that vantages are
sending heartbeats but the broker has not yet received enough to
compute D^2. WATCH indicates D^2 has crossed chi^2_{d, 0.95}.
ALARM indicates D^2 has crossed chi^2_{d, 0.99}.
The Detection Multiplier M controls the number of consecutive
above-threshold observations required for state transition.
3. Session State Machine
The five-state machine extends BFD's three-state machine
(AdminDown / Down / Init+Up).
+---------+ +------+ +------+
|AdminDown|<------->| Down |------->| Init |
+---------+ +------+ +------+
|
v
+-------+ +-------+
| ALARM |<-------| WATCH |
+-------+ +-------+
| ^
+-------+-------+
v
(heartbeat
sustained)
State transitions:
Down -> Init: broker has received heartbeats from >= 2
vantages within T_negotiated_tx.
Init -> WATCH: D^2 > chi^2_{d, 0.95} for M consecutive ticks.
WATCH -> ALARM: D^2 > chi^2_{d, 0.99} for M consecutive ticks.
ALARM -> WATCH: D^2 < chi^2_{d, 0.95} for M consecutive ticks.
Melegassi Expires November 23, 2026 [Page 4]
Internet-Draft Coherence-BFD Protocol May 2026
WATCH -> Init: no above-threshold observation in 2M ticks.
any -> AdminDown: operator action.
4. Control Packet Format
The control packet format is a superset of BFD's mandatory
section ([RFC5880] Section 4.1).
4.1. Mandatory section
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Vers | Diag |Sta|P|F|C|A|D|M| Detect Mult | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| My Discriminator |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Your Discriminator |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Desired Min TX Interval |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Required Min RX Interval |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Required Min Echo RX Interval |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| D^2 (32-bit float) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Sta: 2-bit state field, encoded:
00 - AdminDown
01 - Init
10 - WATCH
11 - ALARM
New flag:
C (bit 5): Coherence flag. Set if this packet contains a
D^2 value. When clear, the D^2 field MUST be
transmitted as zero.
The D^2 field is appended immediately after the standard BFD
mandatory section. Length increases by 4 octets.
Backwards compatibility with [RFC5880] BFD:
o A receiver conformant to [RFC5880] but not implementing
this document MUST honour the Length field and silently
skip the 4-octet D^2 field as opaque trailing data.
o A receiver that does not recognise the 'C' flag MUST
treat the D^2 field as zero and behave per [RFC5880].
o A sender supporting this document but interacting with
an [RFC5880]-only receiver MUST clear the 'C' flag and
MUST omit the D^2 field; the Length field MUST reflect
the original [RFC5880] mandatory-section length.
o Capability discovery proceeds via Type 0x00
(Version-Negotiation) TLV exchanged during session
initialisation.
4.2. Coherence TLV
Optional sections following the mandatory section. Each TLV is
:
Melegassi Expires November 23, 2026 [Page 5]
Internet-Draft Coherence-BFD Protocol May 2026
Type Name Length Description
---- --------------------------- ------ --------------------
0x00 Version-Negotiation 3 uint16 supported set
0xE0 Vantage-Sketch 1+d*4 d float values
0xE1 Cell-Centroid 1+d*4 d float values
0xE2 Echo-Hash 34 SHA-256 of cell agg
0xE3 Watch-Threshold 5 float chi^2_{d,0.95}
0xE4 Alarm-Threshold 5 float chi^2_{d,0.99}
0xE5 Vantage-Count-N 5 uint32 N
0xE6 Cell-Count-k 5 uint32 k
0xE7 Phase-Label 2 uint8 phase code
0xE8 Byzantine-Suspect 6 uint32 cell ID + score
0xE9 AuthHMAC-SHA256 34 authentication
All TLV type codes used in this document fall in the Experimental
range 0xE0-0xEF. Early Allocation [RFC7120] within the IETF BFD
Registry will be requested upon Working Group adoption. Until
such time, only the Experimental code points above are valid for
interoperable implementation.
Type 0x00 (Version-Negotiation) is RESERVED. It MUST appear at
most once per packet. Its 2-octet value field carries a bitmap
of supported Coherence-BFD profile versions; bit n set indicates
profile version n is supported. This document defines version 0
only. Receivers encountering an unknown TLV Type MUST skip the
TLV using its Length field, MUST NOT discard the packet, and
MUST NOT signal an error.
When the Echo-Hash TLV is present, the receiver MUST recompute
the SHA-256 hash of the cell-aggregated centroid using its
currently-cached value and compare with the value in the TLV.
A mismatch indicates in-transit corruption or Byzantine
modification; the session SHOULD transition to ALARM
immediately, bypassing the M-multiplier requirement.
5. Echo Function for Coherence
The conventional BFD echo function ([RFC5880] Section 6.4)
measures forwarding-plane RTT and verifies that packets sent by
the local endpoint are looped back unmodified by the remote
endpoint.
The Coherence-BFD echo function carries an additional payload:
o Echo-Hash TLV (Section 4.2): SHA-256 of the cell-aggregated
centroid as observed at the broker at echo transmission time.
o Phase-Label TLV: the broker's current Phi_K classification.
When the echo packet returns to the broker, the broker MUST
verify that:
o The Echo-Hash TLV value is unchanged.
o The Phase-Label TLV value is unchanged.
o The total RTT does not exceed Required Min Echo RX Interval *
Detection Multiplier (early-warning timer).
A failure of any check transitions the session to ALARM with
diagnostic code 0x07 (Echo Function Failed).
The Echo function MAY be performed at sub-tick intervals (e.g.,
every 25 ms even when T_negotiated_tx = 50 ms). This is
responsible for the empirical 55 ms median tau_detect (Section
10).
6. Demand Mode and Polling
In Demand mode, vantages do not transmit heartbeats unless
explicitly polled by the broker. The broker sends a Poll
packet (F flag set) when:
o D^2 exceeds 0.7 * Watch-Threshold (suspicion threshold), OR
o The broker has not received a heartbeat for the current
Detection Time, indicating possible network partition.
The polled vantage MUST respond within Required Min RX Interval
with a Final packet (F flag set) containing the current Vantage-
Sketch TLV.
Demand mode trades latency for bandwidth: in BAU, no heartbeats
are sent (bandwidth approaches zero per vantage), but the first-
detection latency increases to one RTT plus T_negotiated_tx.
7. Detection Multiplier and Confirmation
The Detection Multiplier M (default 3) controls the number of
consecutive above-threshold observations required for state
transition. Operators MUST choose M to balance:
o False-positive rate: FPR = Pr[D^2 > threshold | BAU]^M. For
Gaussian BAU and 95th-percentile threshold, FPR <= 0.05^M;
at M = 3, FPR <= 1.25 * 10^-4.
o Detection latency: tau_detect >= M * T_negotiated_tx +
tau_RTT (Theorem 9 of [I-D.melegassi-mvps-incremental-be]).
Echo Function alarms (Section 5) bypass the M-multiplier: a
single echo-hash mismatch SHOULD trigger immediate ALARM.
8. Negotiated Intervals
Each endpoint advertises its Desired Min TX Interval and
Required Min RX Interval in the mandatory section (Section 4.1).
The negotiated T_tx is
T_negotiated_tx = max(local Desired Min TX Interval,
remote Required Min RX Interval).
Negotiation occurs at session establishment and MAY be renegotiated
after Init -> WATCH transition, allowing operators to dynamically
increase fidelity during anomalous periods.
9. Five Reference Variants
Implementations MAY default to any of the variants below. The
benchmark of Section 10 measures all five.
V0 Baseline: T_tx = 60 000 ms, M = 1, no echo.
Maps to the BE-MVPS baseline of
[I-D.melegassi-mvps-incremental-be].
V1 Heartbeat-Fast: T_tx = 50 ms, M = 3, no echo.
Continuous heartbeat; BAU bandwidth high.
V2 Demand: T_tx = 1 000 ms, M = 1, demand mode.
BAU bandwidth near zero; latency 1 s.
V3 Echo: T_tx = 50 ms, M = 1, echo every 2nd tick.
Empirically optimal (55 ms).
V4 Hybrid: T_tx = 50 ms, M = 3, push + echo + demand.
Highest robustness, comparable latency.
10. Empirical Results (Wall-Clock Measurements)
Reference benchmark: scripts/benchmark_coherence_bfd.py in the
Catellix research repository. N = 1000 vantages, 50 trials per
variant, calibrated coherence shock producing D^2 ~ 30 post-shock.
Variant tau_detect FPR Bandwidth
(median ms) (per 10^4) (B/s)
------------------------- ---------- -------- --------
V0 Baseline 60 005 0 32
V1 Heartbeat-Fast 155 0 118 400
V2 Demand 1 005 0 4 000
V3 Echo 55 0 39 680
V4 Hybrid 155 0 39 680
V3 (Echo) achieves a 1091x latency reduction over V0 baseline at
a 1240x bandwidth cost. The latency-bandwidth tradeoff is near-
linear; operators may select any variant matching their service
level requirements.
Compute cost per tick is sub-microsecond (3.8-4.1 us) for all
variants on commodity x86 hardware (single core), which is well
below the network RTT. Compute is therefore not the bottleneck.
11. Lower Bound Achievement
By Theorem 9 of [I-D.melegassi-mvps-incremental-be], the
minimum achievable detection latency for any variant with
Detection Multiplier M, tick period T_tick, and end-to-end RTT
tau_RTT is
tau_detect >= max( M * T_tick + tau_RTT, tau_C4 ).
Variant V3 with T_tick = 50 ms, M = 1, tau_RTT = 5 ms:
tau_detect_min = 1 * 50 + 5 = 55 ms.
Empirical measurement: 55 ms median. Bound is tight.
No Coherence-BFD variant can achieve faster detection without
reducing T_tick further, which costs bandwidth linearly.
Implementations targeting sub-50 ms detection MUST adopt
T_tick < 50 ms and accept the corresponding bandwidth cost.
12. Security Considerations
o Echo packets carrying Coherence TLVs are authentication targets.
Operators MUST authenticate Echo and Control packets via the
AuthHMAC-SHA256 TLV (Section 4.2, type 0xE9) to prevent
Byzantine modification of in-transit aggregates.
o Demand mode reduces bandwidth but exposes the protocol to
DoS-by-poll-flood from a malicious broker. Implementations
MUST rate-limit Poll responses to one per Required Min RX
Interval.
o Reducing T_tx below 50 ms allows finer detection but increases
bandwidth linearly. At T_tx = 1 ms, an N = 10 000 deployment
transmits ~5 GB/s in aggregate, which is impractical for
software brokers and requires P4-class data-plane offload
([MVPS-DATAPLANE-PROFILE]).
o The five-state machine adds two states (WATCH, ALARM) beyond
BFD's three (AdminDown/Down/Up). Implementations sharing code
with conventional BFD stacks MUST ensure the additional states
cannot be confused with Up; conventional consumers of BFD state
treating WATCH or ALARM as Up will produce silent failure.
o Transport security between vantages, cell coordinators, and
the broker. AuthHMAC-SHA256 TLV provides integrity but NOT
confidentiality. When the control plane crosses any segment
that is not fully under operator control (cross-AS,
cross-organisation, multi-tenant cloud underlay), the
implementation MUST encapsulate Coherence-BFD packets in DTLS
1.3 [RFC9147] or TLS 1.3 [RFC8446] following the
recommendations of BCP 195 [RFC9325]. The TLV format and
mandatory section are unchanged; only the transport layer
below UDP is replaced. Cipher-suite selection MUST follow
BCP 195 Section 4.
o Long-term key management for the AuthHMAC-SHA256 TLV. Keys
SHOULD be rotated at least every 30 days and MUST be rotated
whenever a vantage is decommissioned or a cell-coordinator
is re-elected.
12.1. DDoS resilience (the framework as detector, not victim)
A frequent misconception is that a high-rate volumetric DDoS
against the monitored infrastructure would saturate Coherence-BFD
itself. This is incorrect when the deployment respects the
following architectural invariants:
I1. Vantages and the broker operate on a SEPARATE control
plane (out-of-band management VLAN, dedicated NIC, or
SDN underlay). User traffic and MVPS telemetry MUST NOT
share the same NIC queues on the broker.
I2. Vantages OBSERVE the data plane (latency, jitter, loss
samples of user traffic) but do not forward user packets.
A vantage is a probe, not a middlebox.
I3. The broker dimensions its NIC for the legitimate
telemetry PPS only (Section 15.3), independent of user-
traffic volume.
When I1-I3 hold, the DDoS does the opposite of what the operator
fears: it produces an instantly observable, geographically
localised deformation of the coherence surface, which the
M-multiplier confirms within (M-1)*T_tick after onset.
Empirical validation (scripts/simulate_ddos_resilience.py,
summary in docs/SIM_DDOS_RESULTS.txt):
Topology : 10 000 vantages, 8 regions, T_tick = 50 ms
Attack : 10 Mpps volumetric DDoS on region 3
(1 250 vantages affected)
Coherence shock : cell-wise D^2 jumps from O(1) to >300
Detection latency: 100 ms after onset (M = 3, T_tick = 50 ms)
Attribution : R_cross localises to region 3 with
100% argmax accuracy across 275 windows
Broker health : 99% availability (single-broker, Regime C
tuned per Section 15)
Other regions : remain in BAU; D^2 < 5 throughout the attack
The detection latency 100 ms equals (M-1)*T_tick = 2*50 ms,
matching the lower bound of Theorem 9 within the slack permitted
by the M-multiplier confirmation count.
12.2. When the framework IS at risk
The honest negative results:
o If invariants I1 or I2 are violated (telemetry shares the
user-traffic data path), the broker's NIC saturates with
attack traffic and the framework degrades to default-deny.
This is a deployment defect, not a protocol defect.
o Byzantine breakdown: an attacker controlling more than
floor((k-1)/2) of the k cells can move the geometric
median and minimax aggregator arbitrarily (Theorem 7 of
[I-D.melegassi-mvps-incremental-be]). For k = 8 cells,
the breakdown bound is 3 compromised cells.
o Broker NIC at Regime D (>1 Mpps telemetry, e.g. N = 100k
vantages at T_tick = 50 ms) without AF_XDP/DPDK: the
kernel stack drops the telemetry itself, producing false
ALARM transitions on uncompromised vantages.
o Replay of historical Coherence TLVs: mitigated by the BFD
sequence numbers in the mandatory section, but requires
strictly monotonic implementation; rolling counters MUST
NOT wrap within the M*T_tick window.
These are bounded, documented failure modes. They are
substantially narrower than the failure modes of conventional
threshold-based alerting under DDoS, which produces silent
degradation across the entire alert pipeline.
13. IANA Considerations
This document requests the following IANA actions in the
Coherence-BFD Registry (new, created by this document):
1. New protocol code point for the C flag (Section 4.1).
2. Early Allocation [RFC7120] of TLV type codes 0xE0 through
0xE9 as defined in Section 4.2, plus reserved code 0x00
for version negotiation.
3. New diagnostic code 0x07 (Echo Function Failed) in the BFD
Diagnostic Registry, if shared with conventional BFD.
4. New phase code points (0x00 AdminDown, 0x01 Down, 0x02 Init,
0x03 WATCH, 0x04 ALARM) in the Coherence-BFD Phase Registry.
14. References
14.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997.
[RFC5880] Katz, D. and Ward, D., "Bidirectional Forwarding
Detection (BFD)", RFC 5880, DOI 10.17487/RFC5880,
June 2010.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in
RFC 2119 Key Words", BCP 14, RFC 8174,
DOI 10.17487/RFC8174, May 2017.
[I-D.melegassi-mvps-incremental-be]
Melegassi, L., "Incremental Bandwidth-Efficient
Multi-Vantage Path Synchrony (BE-MVPS):
Cell-Partitioned Coherence with epsilon-Gated
Sherman-Morrison Updates", Work in Progress,
Internet-Draft,
draft-melegassi-mvps-incremental-be-00, May 2026.
[RFC5706] Harrington, D., "Guidelines for Considering
Operations and Management of New Protocols and
Protocol Extensions", RFC 5706,
DOI 10.17487/RFC5706, November 2009.
[RFC6973] Cooper, A., Tschofenig, H., Aboba, B., Peterson, J.,
Morris, J., Hansen, M., and R. Smith, "Privacy
Considerations for Internet Protocols", RFC 6973,
DOI 10.17487/RFC6973, July 2013.
[RFC7120] Cotton, M., "Early IANA Allocation of Standards
Track Code Points", BCP 100, RFC 7120,
DOI 10.17487/RFC7120, January 2014.
[RFC8446] Rescorla, E., "The Transport Layer Security (TLS)
Protocol Version 1.3", RFC 8446,
DOI 10.17487/RFC8446, August 2018.
[RFC9127] Jethanandani, M., Patel, K., Pallagatti, S., and G.
Mirsky, "YANG Data Model for Bidirectional
Forwarding Detection (BFD)", RFC 9127,
DOI 10.17487/RFC9127, October 2021.
[RFC9147] Rescorla, E., Tschofenig, H., and N. Modadugu, "The
Datagram Transport Layer Security (DTLS) Protocol
Version 1.3", RFC 9147, DOI 10.17487/RFC9147,
April 2022.
[RFC9325] Sheffer, Y., Saint-Andre, P., and T. Fossati,
"Recommendations for Secure Use of Transport Layer
Security (TLS) and Datagram Transport Layer
Security (DTLS)", BCP 195, RFC 9325,
DOI 10.17487/RFC9325, November 2022.
14.2. Informative References
[I-D.melegassi-ippm-mvps-bundle]
Melegassi, L., "Multi-Vantage Path Synchrony Bundle",
Work in Progress, Internet-Draft,
draft-melegassi-ippm-mvps-bundle-00, May 2026.
[I-D.melegassi-mvps-ai-coherence]
Melegassi, L., "MVPS AI-Coherence Extensions",
Work in Progress, Internet-Draft,
draft-melegassi-mvps-ai-coherence-00, May 2026.
[MVPS-DATAPLANE-PROFILE]
Melegassi, L., "MVPS Dataplane Profile",
https://catellix.com/static/download/
MVPS_DATAPLANE_PROFILE.txt, 2026.
15. Packet Sizing, MTU, and Network Stack Tuning
The protocol is useless on paper if its packets fragment, its
broker's NIC saturates, or its IRQ handler stalls. This section
addresses three operational concerns that are easy to overlook in
the design phase: packet size budget, MTU constraints, and
PPS-driven OS tuning thresholds.
15.1. Packet size budget (all packet types)
Computed byte-by-byte for IPv4 transport (add +20 octets for
IPv6). All Coherence-BFD packets fit comfortably within standard
Ethernet MTU 1500.
Packet Composition Total
----------------------- ---------------------------- -----
Vantage heartbeat UDP(8) + IP(20) + BFD(24)
+ hash(4) 56 B
Vantage push UDP(8) + IP(20) + BFD(24)
+ D^2(4) + Sketch TLV(26)
+ HMAC TLV(34) 116 B
Echo packet UDP(8) + IP(20) + BFD(24)
+ Echo-Hash TLV(34)
+ Phase-Label(2) + HMAC(34) 122 B
Demand Poll / Final UDP(8) + IP(20) + BFD(24)
+ D^2(4) + Sketch(26) 82 B
Cell-Coord -> Broker UDP(8) + IP(20)
+ k * (id(4)+sketch(26))
+ HMAC(34) 82 + 30k
(k=10: 382 B)
Broker -> Subscriber UDP(8) + IP(20) + BFD(24)
+ D^2(4) + Phase(2) 58 B
All single packets are below 500 octets at k <= 14; below 1500
at k <= 47. Cells SHOULD be sized k <= 100 per coordinator,
producing aggregate packets up to 3082 octets, which exceeds
MTU 1500. In that regime, the Cell-Coord MUST either:
(a) split its centroid report into multiple sub-packets, or
(b) use Jumbo frames (MTU 9000) if the underlying L2 supports
them.
At Jumbo-MTU 9000, a single Cell-Coord packet can carry up to
~300 cells.
15.2. MTU and fragmentation
Implementations MUST set IP DF=1 (don't fragment) on all
Coherence-BFD packets. An ICMP Fragmentation Needed response
indicates an undersized path MTU and MUST trigger:
o fallback to MTU 1500 (down from Jumbo), or
o cell-split per (a) above.
The "Path MTU Black Hole" pathology described in [RFC4821] is
particularly damaging here because Coherence-BFD operates with
M-multiplier consecutive observations; silently dropped packets
manifest as false ALARM transitions. Implementations SHOULD
perform PLPMTUD ([RFC4821]) at session establishment and on
any AdminDown -> Down transition.
For the MVPS bundle envelope of [I-D.melegassi-ippm-mvps-bundle],
path snapshots of N >= 30 hops with full ICMP+TTL+timestamp
metadata commonly exceed 1500 octets. Coherence-BFD does not
carry bundles; bundles are exchanged out-of-band over TCP or
chunked over a different control channel. Bundle MTU concerns
are out of scope for this document.
15.3. PPS regimes and OS tuning requirements
The broker process receives one Vantage packet per tick per
vantage. Aggregate packets-per-second (PPS) at the broker:
PPS = N / T_tick_seconds.
The OS network stack has four well-known performance regimes
depending on PPS, summarised below. Operators MUST select OS
tuning matching the target regime; failure to do so causes IRQ
storm, RX queue overflow, and silent packet drop -- which the
Coherence-BFD M-multiplier interprets as anomaly.
Regime Target PPS Tuning required
------ ---------------- -----------------------------------
A <= 10 000 Default kernel suffices.
Single RX queue acceptable.
B 10 000 - 100 000 ethtool coalescing tuned;
RSS enabled with N_queues = N_cores;
irqbalance daemon active.
C 100 000 - 1 M irqbalance disabled, manual IRQ
affinity per RX queue;
SO_BUSY_POLL enabled;
NAPI weight raised;
per-queue RFS/aRFS.
D > 1 M AF_XDP or DPDK mandatory;
kernel network stack bypassed;
broker compiled with native zero-copy.
Operational examples for typical deployments:
Deployment N T_tick PPS Regime
-------------------------- ------ ------ -------- ------
Single rack monitor 100 50 ms 2 000 A
Single-DC monitor 1 000 50 ms 20 000 B
Multi-DC operator 10 000 50 ms 200 000 C
HFT / sub-second target 10 000 5 ms 2 000 000 D
Hyperscaler (full mesh) 100 000 50 ms 2 000 000 D
Implementers targeting Regime C or D MUST consult the data-plane
profile [MVPS-DATAPLANE-PROFILE] for hardware-accelerated
reference designs.
15.4. Recommended sysctl, ethtool, and queue settings
The following are minimum recommended settings for a broker
running in Regime B or C on a Linux 5.10+ host. Operators MAY
relax for Regime A or tighten for Regime D.
o ethtool RX/TX queue sizing:
ethtool -G rx 4096 tx 4096
o ethtool coalescing (RX side, reduce IRQ rate):
ethtool -C adaptive-rx on rx-usecs 50 rx-frames 64
For Regime C, set adaptive off and tune manually:
ethtool -C adaptive-rx off rx-usecs 10 rx-frames 16
o Enable RSS hash on UDP source port (for spreading vantages
across queues):
ethtool -N rx-flow-hash udp4 sdfn
o Enable RPS / RFS for software hashing on single-queue NICs:
echo ffff > /sys/class/net//queues/rx-0/rps_cpus
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 4096 > /sys/class/net//queues/rx-0/rps_flow_cnt
o Increase socket receive buffer:
sysctl -w net.core.rmem_default=33554432
sysctl -w net.core.rmem_max=268435456
o Increase UDP receive limits:
sysctl -w net.core.netdev_max_backlog=300000
sysctl -w net.core.netdev_budget=600
o For Regime C, enable SO_BUSY_POLL in the broker socket:
setsockopt(sk, SOL_SOCKET, SO_BUSY_POLL, &usec, sizeof usec);
/* recommended: usec = 50 to 100 */
o Disable irqbalance and pin RX queue IRQs to specific cores:
systemctl stop irqbalance
systemctl mask irqbalance
/* per RX queue n, pin to core n: */
echo > /proc/irq//smp_affinity
o TX queueing discipline: replace default pfifo_fast with
fq_codel (lower latency under load):
tc qdisc replace dev root fq_codel
15.5. NUMA and CPU isolation for the broker
At Regime C or above, the broker process MUST be NUMA-pinned to
the same socket as the NIC (verify with lspci -vv). Cross-NUMA
memory access doubles latency under load.
Boot-time CPU isolation:
isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7 (Linux GRUB cmdline)
This removes cores 2-7 from the kernel scheduler; the broker
process is then pinned to one of these cores via:
taskset -c 2 ./mvps_broker
Hugepages reduce TLB pressure for the broker's state arrays
(typically 100s of MB at N >= 10 000):
sysctl -w vm.nr_hugepages=512
/* broker uses mmap(MAP_HUGETLB) */
For multi-broker deployments, SO_REUSEPORT allows multiple
broker threads to share a single UDP listening port with
kernel-side load balancing across threads.
16. Privacy Considerations
This protocol exposes the geometric coherence state (D^2) of
the monitored infrastructure to its operators. While numerical
and aggregated, the D^2 value and associated TLVs may enable
the following inferences:
o Geographic patterns of usage (per-cell D^2 streams may
correlate with regional traffic volume).
o Topology of customer-facing AS interconnections (visible
via Cell-Centroid TLV when broker feeds are shared).
o Timing of mitigated attacks (visible via Phase-Label TLV
transitions Init->WATCH->ALARM->Init).
Implementations:
o MUST NOT include payload bytes from observed user traffic
in any TLV. Only statistical aggregates derived from
operator-internal measurements MAY be carried.
o SHOULD aggregate D^2 over windows of at least T_tick
before publication to any non-operator audience, to
prevent fine-grained timing side-channel inference.
o SHOULD redact Vantage-Sketch (0xE0) and Cell-Centroid
(0xE1) TLVs in cross-organisation telemetry feeds (e.g.,
operator-CDN consortium dashboards) and publish only the
scalar D^2 field of the mandatory section.
o MAY apply differential privacy noise to per-cell D^2
streams before publication to community-defence feeds
(analogous to MISP or AbuseIPDB).
The privacy considerations framework of [RFC6973] applies.
This document does not introduce categories of personally
identifiable information.
17. Manageability Considerations
This section is REQUIRED by [RFC5706] for Routing Area documents.
Operations.
The five-state machine is observable via standard BFD
management interfaces extended for the WATCH and ALARM
states. A YANG augmentation of [RFC9127] is anticipated as
a future companion document.
Faults.
Persistent ALARM without corresponding data-plane outage,
and persistent WATCH oscillation, both indicate calibration
drift (Sigma_0 has aged). Implementations SHOULD expose a
"recalibrate" administrative action that re-derives mu_0
and Sigma_0 from the last N ticks of BAU samples.
Recommended N for production deployments: at least 86 400
ticks (24 h at T_tick = 1 s, or 24 min at T_tick = 1 ms).
Calibration procedure.
Initial calibration of (mu_0, Sigma_0):
1. Bring all vantages online and let the session reach
the Up state (per [RFC5880]).
2. Collect at least 600 ticks (30 s at T_tick = 50 ms)
of confirmed BAU samples; reject any tick during
which D^2 > 3 * IQR (interquartile range) of the
current window.
3. Set mu_0 = sample mean of cell centroids over the
collected BAU window.
4. Set Sigma_0 = sample covariance + epsilon * I, where
epsilon = 1e-6 to ensure invertibility.
Implementations SHOULD support online recalibration
triggered either by operator command or automatically after
detected topology change (BGP UPDATE batch above per-AS
threshold, anycast catchment change).
Configuration.
All timer parameters (T_tick, M-multiplier, Desired Min TX
Interval, Required Min RX Interval) follow [RFC5880]
conventions. Additional Coherence-BFD parameters:
o cell_count_k (default: 8)
o byzantine_bound_B (default: floor((k-1)/2))
o watch_threshold (default: chi^2_{d, 0.95})
o alarm_threshold (default: chi^2_{d, 0.99})
o dual_mode_aggregation (default: enabled, per
[I-D.melegassi-mvps-ddos-resilience]
Section 7.2)
Performance metrics.
Implementations SHOULD expose:
o detection_latency_p50, p95, p99 (over rolling 24 h)
o false_positive_rate_1h
o byzantine_alarm_count_24h
o cells_above_watch_threshold (gauge)
o vantages_in_session_up (gauge)
Acknowledgements
The authors thank early reviewers of the MVPS framework, whose
informal questions during May 2026 shaped this document. In
particular, the question "if MTU, IRQ, and queue tuning are
not handled, does this break under real traffic?" directly
motivated the addition of Section 15 (Packet Sizing, MTU, and
Network Stack Tuning).
The authors thank the IETF BFD WG mailing list for the
conventions and registry structure that this document follows.
Author's Address
Leonardo Melegassi
Catellix
Andradina, SP
Brazil
Email: melegassi@catellix.com
URI: https://catellix.com/
Melegassi Expires November 23, 2026 [Page 14]