Internet Engineering Task Force | M. Sivaraman |
Internet-Draft | Akira Systems Private Limited |
Intended status: Experimental | Liu |
Expires: December 27, 2020 | Infoblox |
June 25, 2020 |
The DNS thundering herd problem
draft-muks-dnsop-dns-thundering-herd-00
This document describes an observed regular pattern of spikes in queries that affects caching resolvers, and recommends software mitigations for it.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 27, 2020.
Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Typically, DNS caching resolvers prepare answers for multiple clients from a single cached RRset [RFC1034]. Depending on when in time the clients make their queries, caching resolvers reply with lower and lower valued TTLs, before the cached RRset from which answers are prepared expires. Clients themselves may cache and use their copies of RRsets until the TTL that the resolver replied with expires. A key property is that all these copies of answers, and the cached answer from which they are prepared, expire at the same absolute time.
As an example, consider the following query sequence received by a resolver from 10 clients all querying for a popular www.example.org./A RRset. We use this example to illustrate two kinds of spikes in queries.
Client | Query time (seconds since epoch) | Answer RRset TTL | Notes |
---|---|---|---|
C1 | 1591441620 | 600 | Answer was not found in cache. Resolver performs a resolution and caches authoritative answer with TTL=600. |
C2 | 1591441626 | 594 | Answered from cache. |
C3 | 1591441713 | 507 | Answered from cache. |
C4 | 1591441780 | 440 | Answered from cache. |
C5 | 1591441866 | 354 | Answered from cache. |
C6 | 1591442006 | 214 | Answered from cache. |
C7 | 1591442070 | 150 | Answered from cache. |
C8 | 1591442070 | 150 | Answered from cache. |
C9 | 1591442213 | 7 | Answered from cache. |
C3 | 1591442220 | 600 | Previously cached answer had expired in the resolver's cache. So the resolver performs a fresh resolution and caches authoritative answer with TTL=600. |
C5 | 1591442220 | 600 | Ditto if not joined with previous. |
C2 | 1591442220 | 600 | Ditto if not joined with previous. |
C6 | 1591442220 | 600 | Ditto if not joined with previous. |
C1 | 1591442221 | 599 | Answered from cache. |
C9 | 1591442221 | 599 | Answered from cache. |
C4 | 1591442221 | 599 | Answered from cache. |
C8 | 1591442221 | 599 | Answered from cache. |
C7 | 1591442221 | 599 | Answered from cache. |
C10 | 1591442227 | 593 | Answered from cache. |
C7 | 1591442820 | 600 | Previously cached answer had expired in the resolver's cache. So the resolver performs a fresh resolution and caches authoritative answer with TTL=600. |
C4 | 1591442820 | 600 | Ditto if not joined with previous. |
C1 | 1591442820 | 600 | Ditto if not joined with previous. |
C2 | 1591442820 | 600 | Ditto if not joined with previous. |
C10 | 1591442820 | 600 | Ditto if not joined with previous. |
C8 | 1591442820 | 600 | Ditto if not joined with previous. |
C3 | 1591442821 | 599 | Answered from cache. |
C9 | 1591442821 | 599 | Answered from cache. |
C5 | 1591442821 | 599 | Answered from cache. |
C6 | 1591442821 | 599 | Answered from cache. |
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
At a resolver, when multiple queries have arrived together asking the same question and there is no existing unexpired cached answer, DNS resolutions have to be performed to answer these queries. De-duplication of these multiple resolutions into a single DNS resolution by the resolver is RECOMMENDED where possible.
If such de-duplication is not performed, the client queries will effectively be forwarded 1:1 by the resolver to upstream nameservers, and they will significantly increase the upstream nameservers' query rate in spikes. Some nameserver operators may have deployed measures such as response rate limiting [RRL] and other IP-address based rate limiting, which may cause them to deny service to the resolver due to the query spikes of identical queries.
Caching resolvers are permitted to lower the TTLs of RRsets in their answers as they please [RFC2181]. This can be used to distribute the time at which RRset copies received by clients expire from a single absolute time to a time interval. However, this has to be done with some consideration such that the thundering herd doesn't re-converge at the expiry time of the cached RRset that is used to generate answers to the clients.
TBD.
With very low authoritative RRset TTLs (such as under 60s) for popular questions, the frequency of the thundering herd increases and including noise in response TTLs is less effective because the maximum TTL to work with is low. In other words, there is a shorter interval over which the thundering herd can be distributed by adding noise. Some implementations permit an operator to set a minimum TTL value such that authoritative RRset TTLs with lower values are increased and clamped to the minimum TTL value. This breaks currently accepted DNS protocol, and hence this document does not make any recommendation about it.
There are no security considerations.
There are no IANA considerations.
This document was prepared from thundering herd client query patterns noticed at resolvers of ISPs and large institutions, which resulted in traffic spikes that caused performance issues and lookup failures. The authors acknowledge the contribution of Ramesh Damodaran who participated in analysis of these patterns.
[RFC1034] | Mockapetris, P., "Domain names - concepts and facilities", STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987. |
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC2181] | Elz, R. and R. Bush, "Clarifications to the DNS Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997. |
[RFC8174] | Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017. |
[RRL] | Vixie, P. and V. Schryver, "DNS Response Rate Limiting (DNS RRL)", 2012. |