fix(dns-resolve): add a lower-bound TTL for dns refreshing #3807
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
this branch makes a small, targeted fix to mitigated the excessive rate of DNS queries described in linkerd/linkerd2#13508. this bounds the rate at which worker tasks returned by
<linkerd_dns_resolve::DnsResolve as tower::Service<T>::call()
will attempt to refresh DNS records.this branch does so in two small steps: first, the
linkerd_dns::Resolver::resolve_addrs()
function is changed to return atokio::time::Instant
instead of atokio::time::Sleep
. then, theresolution
worker is extended with a minimum delay it will wait before continuing.i ran this locally with a small workload, on a local cluster. because the rate of these DNS queries is a factor of the number of control plane clients, i used this job/deployment pair to create 64 servers and a load-generator sending 100 requests per-second.
then, i created a prometheus config file, like so:
use kubectl to port-forward coredns and linkerd metrics:
...then launch prometheus by running
prometheus --config.file=prometheus.yml
using the configuration above. once prometheus is running, navigate tolocalhost:9090/graph
to query the metrics being scraped.after observing the steady state of the rate of coredns requests and cache misses, i loaded a new patched proxy image into the cluster and restarted the control plane and servers by running:
after the new patched proxies came online, i could see the "thundering herd" described in this comment in linkerd/linkerd2#13508: