dnsop | W. Hardaker |
Internet-Draft | USC/ISI |
Updates: 7583 (if approved) | W. Kumari |
Intended status: Standards Track | |
Expires: August 5, 2018 | February 01, 2018 |
Security Considerations for RFC5011 Publishers
draft-ietf-dnsop-rfc5011-security-considerations-11
This document extends the RFC5011 rollover strategy with timing advice that must be followed by the publisher in order to maintain security. Specifically, this document describes the math behind the minimum time-length that a DNS zone publisher must wait before signing exclusively with recently added DNSKEYs. This document also describes the minimum time-length that a DNS zone publisher must wait after publishing a revoked DNSKEY before assuming that all active RFC5011 resolvers should have seen the revocation-marked key and removed it from their list of trust anchors.
This document contains much math and complicated equations, but the summary is that the key rollover / revocation time is much longer than intuition would suggest. If you are not both publishing a DNSSEC DNSKEY, and using RFC5011 to advertise this DNSKEY as a new Secure Entry Point key for use as a trust anchor, you probably don't need to read this document.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 5, 2018.
Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
[RFC5011] defines a mechanism by which DNSSEC validators can update their list of trust anchors when they've seen a new key published in a zone or revoke a properly marked key from a trust anchor list. However, RFC5011 [intentionally] provides no guidance to the publishers of DNSKEYs about how long they must wait before switching to exclusively using recently published keys for signing records, or how long they must wait before ceasing publication of a revoked key. Because of this lack of guidance, zone publishers may derive incorrect assumptions about safe usage of the RFC5011 DNSKEY advertising, rolling and revocation process. This document describes the minimum security requirements from a publisher's point of view and is intended to complement the guidance offered in RFC5011 (which is written to provide timing guidance solely to a Validating Resolver's point of view).
To explain the RFC5011 security analysis in this document better, Section 5 first describes an attack on a zone publisher. Then in Section 6.1 we break down each of the timing components that will be later used to define timing requirements for adding keys in Section 6.2 and revoking keys in Section 6.3.
To verify this lack of understanding is wide-spread, the authors reached out to 5 DNSSEC experts to ask them how long they thought they must wait before signing a zone exclusively with a new KSK [RFC4033] that was being introduced according to the 5011 process. All 5 experts answered with an insecure value, and we determined that this lack of mathematical understanding might cause security concerns in deployment. We hope that this companion document to RFC5011 will rectify this understanding and provide better guidance to zone publishers that wish to make use of the RFC5011 rollover process.
One important note about ICANN's (currently in process) 2017/2018 KSK rollover plan for the root zone: the timing values chosen for rolling the KSK in the root zone appear completely safe, and are not affected by the timing concerns introduced by this draft
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
The RFC5011 process describes a process by which a RFC5011 Resolver may accept a newly published KSK as a trust anchor for validating future DNSSEC signed records. It also describes the process for publicly revoking a published KSK. This document augments that information with additional constraints, from the SEP publisher's points of view. Note that this document does not define any other operational guidance or recommendations about the RFC5011 process and restricts itself to solely the security and operational ramifications of switching to exclusively using recently added keys or removing revoked keys too soon.
Failure of a DNSKEY publisher to follow the minimum recommendations associated with this draft can result in potential denial-of-service attack opportunities against validating resolvers. Failure of a DNSKEY publisher to publish a revoked key for a long enough period of time may result in RFC5011 Resolvers leaving that key in their trust anchor storage beyond the key's expected lifetime.
Also see Section 2 of [RFC4033] and [RFC7719] for additional terminology.
These sections define a high-level overview of [RFC5011] processing. These steps are not sufficient for proper RFC5011 implementation, but provide enough background for the reader to follow the discussion in this document. Readers need to fully understand [RFC5011] as well to fully comprehend the content and importance of this document.
RFC5011's process of safely publishing a new DNSKEY and then assuming RFC5011 Resolvers have adopted it for trust falls into a number of high-level steps to be performed by the SEP Publisher. This document discusses the following scenario, which the principle way RFC5011 is currently being used (even though Section 6 of RFC5011 suggests having a stand-by key available):
This document discusses the time required to wait during step 2 of the above process. Some interpretations of RFC5011 have erroneously determined that the wait time is equal to RFC5011's "hold down time". Section 5 describes an attack based on this (common) erroneous belief, which can result in a denial of service attack against the zone.
RFC5011's process of advertising that an old key is to be revoked from RFC5011 Resolvers falls into a number of high-level steps:
This document discusses the time required to wait in step 3 of the above process. Some interpretations of RFC5011 have erroneously determined that the wait time is equal to RFC5011's "hold down time". This document describes an attack based on this (common) erroneous belief, which results in a revoked DNSKEY potentially remaining as a trust anchor in a RFC5011 Resolver long past its expected usage.
This section serves as an illustrative example of the problem being discussed in this document. Note that in order to keep the example simple enough to understand, some simplifications were made (such as by not creating a set of pre-signed RRSIGs and by not using values that result in the addHoldDownTime not being evenly divisible by the activeRefresh value); the mathematical formulas in Section 6 are, however, complete.
If an attacker is able to provide a RFC5011 Resolver with past responses, such as when it is in-path or able to perform any number of cache poisoning attacks, the attacker may be able to leave compliant RFC5011 Resolvers without an appropriate DNSKEY trust anchor. This scenario will remain until an administrator manually fixes the situation.
The time-line below illustrates an example of this situation.
The following example settings are used in the example scenario within this section:
Given these settings, the sequence of events in Section 5.1.1 depicts how a SEP Publisher that waits for only the RFC5011 hold time timer length of 30 days subjects its users to a potential Denial of Service attack. The timing schedule listed below is based on a SEP Publisher publishing a new Key Signing Key (KSK), with the intent that it will later be used as a trust anchor. We label this publication time as "T+0". All numbers in this sequence refer to days before and after this initial publication event. Thus, T-1 is the day before the introduction of the new key, and T+15 is the 15th day after the key was introduced into the fictitious zone being discussed.
In this dialog, we consider two keys within the example zone:
The steps shows an attack that foils the adoption of a new DNSKEY by a 5011 Resolver when the SEP Publisher that starts signing and publishing with the new DNSKEY too quickly.
This section defines the minimum timing requirements for making exclusive use of newly added DNSKEYs and timing requirements for ceasing the publication of DNSKEYs to be revoked. We break our timing solution requirements into two primary components: the mathematically-based security analysis of the RFC5011 publication process itself, and an extension of this that takes operational realities into account that further affect the recommended timings.
First, we define the term components used in all equations in Section 6.1.
The addHoldDownTime is defined in Section 2.4.1 of [RFC5011] as:
The add hold-down time is 30 days or the expiration time of the original TTL of the first trust point DNSKEY RRSet that contained the new key, whichever is greater. This ensures that at least two validated DNSKEY RRSets that contain the new key MUST be seen by the resolver prior to the key's acceptance.
The latest value (i.e. the future most date and time) of any RRSig Signature Expiration field covering any DNSKEY RRSet containing only the old trust anchor(s) that are being superseded. Note that for organizations pre-creating signatures this time may be fairly far in the future unless they can be significantly assured that none of their pre-generated signatures can be replayed at a later date.
The amount of time between the DNSKEY RRSIG's Signature Inception field and the Signature Expiration field.
sigExpirationTimeRemaining is defined in Section 3.
activeRefresh time is defined by RFC5011 by
A resolver that has been configured for an automatic update of keys from a particular trust point MUST query that trust point (e.g., do a lookup for the DNSKEY RRSet and related RRSIG records) no less often than the lesser of 15 days, half the original TTL for the DNSKEY RRSet, or half the RRSIG expiration interval and no more often than once per hour.
This translates to:
activeRefresh = MAX(1 hour, MIN(sigExpirationTime / 2, MAX(TTL of K_old DNSKEY RRSet) / 2, 15 days) )
Mentally, it is easy to assume that the period of time required for SEP publishers to wait after making changes to SEP marked DNSKEY sets will be entirely based off the length of the addHoldDownTime. Unfortunately, analysis shows that both the design of the RFC5011 protocol and in operational realities in deploying it require waiting and additional period of time longer. In subsections Section 6.1.6.1 to Section 6.1.6.3 below, we discuss three sources of additional delay. In the end, we will pick the largest of these delays as the minimum additional time that the SEP Publisher must wait in our final timingSafetyMargin value, which we define in Section 6.1.6.4.
Security analysis of the timing associated with the query rate of RFC5011 Resolvers shows that it may not perfectly align with the addHoldDownTime when the addHoldDownTime is not evenly divisible by the activeRefresh time. Consider the example of a zone with an activeRefresh period of 7 days. If an associated RFC5011 Resolver started it's holdDown timer just after the SEP published a new DNSKEY (at time T), the resolver would send checking queries at T+7, T+14, T+21 and T+28 Days and will finally accept it at T+35 days, which is 5 days longer than the 30-day addHoldDownTime.
The activeRefreshOffset term defines this time difference and becomes:
activeRefreshOffset = addHoldDownTime % activeRefresh
The % symbol denotes the mathematical mod operator (calculating the remainder in a division problem). This will frequently be zero, but can be nearly as large as activeRefresh itself.
Even small clock drifts can have negative impacts upon the timing of the RFC5011 Resolver's measurements. Consider the simplest case where the RFC5011 Resolver's clock shifts over time to be 2 seconds slower near the end of the RFC5011 Resolver's addHoldDownTime period. I.E., if the RFC5011 Resolver first noticed a new DNSKEY at:
firstSeen = sigExpirationTime + activeRefresh + 1 second
The effect of 2 second clock drift between the SEP Publisher and the RFC5011 Resolver may result in the RFC5011 Resolver querying again at:
justBefore = sigExpirationTime + addHoldDownTime + activeRefresh + 1 second - 2 seconds which becomes: justBefore = sigExpirationTime + addHoldDownTime + activeRefresh - 1 second
The net effect is the addHoldDownTime will not have been reached from the perspective of the RFC5011 Resolver, but it will have been reached from the perspective of the SEP Publisher. The net effect is it may take one additional activeRefresh period longer for this RFC5011 Resolver to accept the new key (at sigExpirationTime + addHoldDownTime + 2 * activeRefresh - 1 second).
We note that even the smallest clockskew errors can require waiting an additional activeRefresh period, and thus define the clockskewDriftMargin as:
clockskewDriftMargin = activeRefresh
Drift associated with a lost transmission and an accompanying re-transmission (see Section 2.3 of [RFC5011]) will cause RFC5011 Resolvers to also change the timing associated with query times such that it becomes impossible to predict, from the perspective of the PEP Publisher, when the final important measurement query will arrive. Similarly, any software that restarts/reboots without saving next-query timing state may also commence with a new random starting time. Thus, an additional activeRefresh is needed to handle both these cases as well.
retryDriftMargin = activeRefresh
Note that we account for additional time associated with cumulative multiple retries, especially under high-loss conditions, in Section 6.1.6.4.
The activeRefreshOffset, clockskewDriftMargin, and retryDriftMargin parameters all deal with additional wait-periods that must be accounted for after analyzing what conditions the client will take longer than expected to make its last query while waiting for the addHoldDownTime period to pass. But these values may be merged into a single term by waiting the longest of any of them. We define timingSafetyMargin as this "worst case" value:
timingSafetyMargin = MAX(activeRefreshOffset, clockskewDriftMargin, retryDriftMargin) timingSafetyMargin = MAX(addWaitTime % activeRefresh, activeRefresh, activeRefresh) timingSafetyMargin = activeRefresh
The retrySafetyMargin is an extra period of time to account for caching, network delays, dropped packets, and other operational concerns otherwise beyond the scope of this document. The value operators should chose is highly dependent on the deployment situation associated with their zone. Note that no value of a retrySafetyMargin can protect against resolvers that are "down". None the less, we do offer the following as one method considering reasonable values to select from.
The following list of variables need to be considered when selecting an appropriate retrySafetyMargin value:
Note that RFC5011 defines retryTime as:
If the query fails, the resolver MUST repeat the query until satisfied no more often than once an hour and no less often than the lesser of 1 day, 10% of the original TTL, or 10% of the original expiration interval. That is, retryTime = MAX (1 hour, MIN (1 day, .1 * origTTL, .1 * expireInterval)).
With the successRate and numResolvers values selected and the definition of retryTime from RFC5011, one method for determining how many retryTime intervals to wait in order to reduce the set of uncompleted servers to 0 assuming normal probability is thus:
x = (1/(1 - successRate)) retryCountWait = Log_base_x(numResolvers)
To reduce the need for readers to pull out a scientific calculator, we offer the following lookup table based on successRate and numResolvers:
retryCountWait lookup table --------------------------- Number of client RFC5011 Resolvers (numResolvers) ------------------------------------------------- 10,000 100,000 1,000,000 10,000,000 100,000,000 0.01 917 1146 1375 1604 1833 Probability 0.05 180 225 270 315 360 of Success 0.10 88 110 132 153 175 Per Retry 0.15 57 71 86 100 114 Interval 0.25 33 41 49 57 65 (successRate) 0.50 14 17 20 24 27 0.90 4 5 6 7 8 0.95 4 4 5 6 7 0.99 2 3 3 4 4 0.999 2 2 2 3 3
Finally, a suggested value of retrySafetyMargin can then be this retryCountWait number multiplied by the retryTime from RFC5011:
retrySafetyMargin = retryCountWait * retryTime
Given the defined parameters and analysis from Section 6.1, we can now create a method for calculating the amount of time to wait until it is safe to start signing exclusively with a new DNSKEY (especially useful for writing code involving sleep based timers) in Section 6.2.1, and define a method for calculating a wall-clock value after which it is safe to start signing exclusively with a new DNSKEY (especially useful for writing code based on clock-based event triggers) in Section 6.2.2.
Given the attack description in Section 5, the correct minimum length of time required for the Zone Signer to wait after publishing K_new but before exclusively using it and newer keys is:
addWaitTime = addHoldDownTime + sigExpirationTimeRemaining + activeRefresh + timingSafetyMargin + retrySafetyMargin
Given the equation components defined in Section 6.1, the full expanded equation is:
addWaitTime = addHoldDownTime + sigExpirationTimeRemaining + 2 * MAX(1 hour, MIN(sigExpirationTime / 2, MAX(TTL of K_old DNSKEY RRSet) / 2, 15 days) ) + retrySafetyMargin
The equations in Section 6.2.1 are defined based upon how long to wait from a particular moment in time. An alternative, but equivalent, method is to calculate the date and time before which it is unsafe to use a key for signing. This calculation thus becomes:
addWallClockTime = lastSigExpirationTime + addHoldDownTime + activeRefresh + timingSafetyMargin + retrySafetyMargin
where lastSigExpirationTime is the latest value of any sigExpirationTime for which RRSIGs were created that could potentially be replayed. Fully expanded, this becomes:
addWallClockTime = lastSigExpirationTime + addHoldDownTime + 2 * MAX(1 hour, MIN(sigExpirationTime / 2, MAX(TTL of K_old DNSKEY RRSet) / 2, 15 days) ) + retrySafetyMargin
The important timing constraint introduced by this memo relates to the last point at which a RFC5011 Resolver may have received a replayed original DNSKEY set, containing K_old and not K_new. The next query of the RFC5011 validator at which K_new will be seen without the potential for a replay attack will occur after the old DNSKEY RRSIG's Signature Expriation Time. Thus, the latest time that a RFC5011 Validator may begin their hold down timer is an "Active Refresh" period after the last point that an attacker can replay the K_old DNSKEY set. The worst case scenario of this attack is if the attacker can replay K_old just seconds before the (DNSKEY RRSIG Signature Validity) field of the last K_old only RRSIG.
Note: our notion of addWaitTime is called "Itrp" in Section 3.3.4.1 of [RFC7583]. The equation for Itrp in RFC7583 is insecure as it does not include the sigExpirationTime listed above. The Itrp equation in RFC7583 also does not include the 2*TTL safety margin, though that is an operational consideration.
addWaitTime = 30 + 10 + 1 / 2 + 1 / 2 (days) addWaitTime = 43 (days)
For the parameters listed in Section 5.1, our resulting addWaitTime is:
This addWaitTime of 42.5 days is 12.5 days longer than just the hold down timer, even with the needed retrySafetyMargin value being left out (which we exclude due to the lack of necessary operational parameters).
This issue affects not just the publication of new DNSKEYs intended to be used as trust anchors, but also the length of time required to continuously publish a DNSKEY with the revoke bit set.
Section 6.2.1 defines a method for calculating the amount of time operators need to wait until it is safe to cease publishing a DNSKEY (especially useful for writing code involving sleep based timers), and Section 6.2.2 defines a method for calculating a minimal wall-clock value after which it is safe to cease publishing a DNSKEY (especially useful for writing code based on clock-based event triggers).
Both of these publication timing requirements are affected by the attacks described in this document, but with revocation the key is revoked immediately and the addHoldDown timer does not apply. Thus the minimum amount of time that a SEP Publisher must wait before removing a revoked key from publication is:
remWaitTime = sigExpirationTimeRemaining + activeRefresh + timingSafetyMargin + retrySafetyMargin remWaitTime = sigExpirationTimeRemaining + MAX(1 hour, MIN((sigExpirationTime) / 2, MAX(TTL of K_old DNSKEY RRSet) / 2, 15 days)) + activeRefresh + retrySafetyMargin
Note also that adding retryTime intervals to the remWaitTime may be wise, just as it was for addWaitTime in Section 6.
Like before, the above equations are defined based upon how long to wait from a particular moment in time. An alternative, but equivalent, method is to calculate the date and time before which it is unsafe to cease publishing a revoked key. This calculation thus becomes:
remWallClockTime = lastSigExpirationTime + activeRefresh + timingSafetyMargin + retrySafetyMargin remWallClockTime = lastSigExpirationTime + MAX(1 hour, MIN((sigExpirationTime) / 2, MAX(TTL of K_old DNSKEY RRSet) / 2, 15 days)) + timingSafetyMargin + retrySafetyMargin
where lastSigExpirationTime is the latest value of any sigExpirationTime for which RRSIGs were created that could potentially be replayed. Fully expanded, this becomes:
Note that our notion of remWaitTime is called "Irev" in Section 3.3.4.2 of [RFC7583]. The equation for Irev in RFC7583 is insecure as it does not include the sigExpirationTime listed above. The Irev equation in RFC7583 also does not include a safety margin, though that is an operational consideration.
remwaitTime = 10 + 1 / 2 (days) remwaitTime = 10.5 (days)
For the parameters listed in Section 5.1, our example:
Note that for the values in this example produce a length shorter than the recommended 30 days in RFC5011's section 6.6, step 3. Other values of sigExpirationTime and the original TTL of the K_old DNSKEY RRSet, however, can produce values longer than 30 days.
Note that because revocation happens immediately, an attacker has a much harder job tricking a RFC5011 Resolver into leaving a trust anchor in place, as the attacker must successfully replay the old data for every query a RFC5011 Resolver sends, not just one.
This document contains no IANA considerations.
A companion document to RFC5011 was expected to be published that describes the best operational practice considerations from the perspective of a zone publisher and SEP Publisher. However, this companion document has yet to be published. The authors of this document hope that it will at some point in the future, as RFC5011 timing can be tricky as we have shown, and a BCP is clearly warranted. This document is intended only to fill a single operational void which, when left misunderstood, can result in serious security ramifications. This document does not attempt to document any other missing operational guidance for zone publishers.
This document, is solely about the security considerations with respect to the SEP Publisher's ability to advertise new DNSKEYs via the RFC5011 automated trust anchor update process. Thus the entire document is a discussion of Security Considerations when adding or removing DNSKEYs from trust anchor storage using the RFC5011 process.
For simplicity, this document assumes that the SEP Publisher will use a consistent RRSIG validity period. SEP Publishers that vary the length of RRSIG validity periods will need to adjust the sigExpirationTime value accordingly so that the equations in Section 6 and Section 6.3 use a value that coincides with the last time a replay of older RRSIGs will no longer succeed.
The authors would like to especially thank to Michael StJohns for his help and advice and the care and thought he put into RFC5011 itself and his continued reviews and suggestions for this document. He also designed the suggested math behind the suggested retrySafetyMargin values in Section 6.1.7.
We would also like to thank Bob Harold, Shane Kerr, Matthijs Mekking, Duane Wessels, Petr Petr Spacek, Ed Lewis, and the dnsop working group who have assisted with this document.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC4033] | Arends, R., Austein, R., Larson, M., Massey, D. and S. Rose, "DNS Security Introduction and Requirements", RFC 4033, DOI 10.17487/RFC4033, March 2005. |
[RFC5011] | StJohns, M., "Automated Updates of DNS Security (DNSSEC) Trust Anchors", STD 74, RFC 5011, DOI 10.17487/RFC5011, September 2007. |
[RFC7583] | Morris, S., Ihren, J., Dickinson, J. and W. Mekking, "DNSSEC Key Rollover Timing Considerations", RFC 7583, DOI 10.17487/RFC7583, October 2015. |
[RFC7719] | Hoffman, P., Sullivan, A. and K. Fujiwara, "DNS Terminology", RFC 7719, DOI 10.17487/RFC7719, December 2015. |
addHoldDownTime: 30 days Old DNSKEY sigExpirationTime: 21 days Old DNSKEY TTL: 2 days
In 2017 and 2018, ICANN expects to (or has, depending on when you're reading this) roll the key signing key (KSK) for the root zone. The relevant parameters associated with the root zone at the time of this writing is as follows:
addWaitTime = 30 + 21 + MAX(1 hour, MIN(21 / 2, # activeRefresh MAX(2) / 2, 15 days), ) + activeRefresh addWaitTime = 30 + 21 + 1 + 1 addWaitTime = 53 days
Thus, sticking this information into the equation in Section Section 6 yields (in days from publication time):
Also note that we exclude the retrySafetyMargin value, which is calculated based on the expected client deployment size.
Thus, ICANN must wait a minimum of 52 days before switching to the newly published KSK (and 26 days before removing the old revoked key once it is published as revoked). ICANN's current plans involve waiting over 3 months before using the new KEY and 69 days before removing the old, revoked key. Thus, their current rollover plans are sufficiently secure from the attack discussed in this memo.