fix(dns-resolve): add a lower-bound TTL for dns refreshing #3807

cratelyn · 2025-03-25T20:15:41Z

this branch makes a small, targeted fix to mitigated the excessive rate of DNS queries described in linkerd/linkerd2#13508. this bounds the rate at which worker tasks returned by <linkerd_dns_resolve::DnsResolve as tower::Service<T>::call() will attempt to refresh DNS records.

this branch does so in two small steps: first, the linkerd_dns::Resolver::resolve_addrs() function is changed to return a tokio::time::Instant instead of a tokio::time::Sleep. then, the resolution worker is extended with a minimum delay it will wait before continuing.

i ran this locally with a small workload, on a local cluster. because the rate of these DNS queries is a factor of the number of control plane clients, i used this job/deployment pair to create 64 servers and a load-generator sending 100 requests per-second.

# This is a small template to exercise linkerd's DNS resolution.
#
# There are two values in this file that should be replaced before applying
# it via `kubectl`:
#
#   - SERVER_REPLICAS: number of server replicas
#   - CLIENT_RPS: rate for slow-cooker to send requests
#
# apply this via a command like:
#
# ```
# ; cat deployment.yaml | sed -e 's/CLIENT_RPS/100/' -e 's/SERVER_REPLICAS/4/'
# ```
---
# An HTTP/1 server running on port 8080.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fast
spec:
  replicas: SERVER_REPLICAS
  selector:
    matchLabels:
      app: terminus
  template:
    metadata:
      labels:
        app: terminus
    spec:
      containers:
      - name: fast
        image: buoyantio/bb:v0.0.6
        args:
        - terminus
        - "--h1-server-port=8080"
        - "--response-text=pong"
        ports:
        - containerPort: 8080
---
# A service to reach the HTTP/1 server.
apiVersion: v1
kind: Service
metadata:
  name: terminus-svc
  labels:
    app: terminus-svc
spec:
  selector:
    app: terminus
  ports:
  - name: http
    port: 8080
    targetPort: 8080
---
# Generate load, sending 100 requests per second to the service.
apiVersion: batch/v1
kind: Job
metadata:
  name: slow-cooker
spec:
  template:
    metadata:
      labels:
        app: slow-cooker
    spec:
      containers:
      - name: slow-cooker
        image: buoyantio/slow_cooker:1.3.0
        command:
        - "/bin/sh"
        args:
        - "-c"
        - |
          sleep 15 # wait for pods to start
          /slow_cooker/slow_cooker -qps CLIENT_RPS -metric-addr 0.0.0.0:9999 http://terminus-svc:8080
        ports:
        - containerPort: 9999
      restartPolicy: OnFailure

then, i created a prometheus config file, like so:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'coredns'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9153']
  - job_name: 'linkerd'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:4191']

use kubectl to port-forward coredns and linkerd metrics:

kubectl -n kube-system port-forward $(kubectl -n kube-system get pods -l k8s-app=kube-dns -o name) 9153:9153
kubectl -n linkerd port-forward $(kubectl -n linkerd get pods -l linkerd.io/control-plane-ns=linkerd -o jsonpath='{.items[0].metadata.name}') 4191:4191

...then launch prometheus by running prometheus --config.file=prometheus.yml using the configuration above. once prometheus is running, navigate to localhost:9090/graph to query the metrics being scraped.

after observing the steady state of the rate of coredns requests and cache misses, i loaded a new patched proxy image into the cluster and restarted the control plane and servers by running:

kubectl -n linkerd rollout restart deployment linkerd-destination linkerd-identity linkerd-proxy-injector
kubectl rollout restart deployment fast

after the new patched proxies came online, i could see the "thundering herd" described in this comment in linkerd/linkerd2#13508:

this commit changes the signature of the `resolve_srv` and `resolve_a_or_aaaa` methods so that they now return an `Instant`, rather than a `Sleep` future. Signed-off-by: katelyn martin <kate@buoyant.io>

Signed-off-by: katelyn martin <kate@buoyant.io>

olix0r

🥇

cratelyn added 2 commits March 25, 2025 16:04

refactor(dns): lookups return Instant expiry

9a373d4

this commit changes the signature of the `resolve_srv` and `resolve_a_or_aaaa` methods so that they now return an `Instant`, rather than a `Sleep` future. Signed-off-by: katelyn martin <kate@buoyant.io>

fix(dns-resolve): add a lower-bound for dns refreshing

5ed003c

Signed-off-by: katelyn martin <kate@buoyant.io>

cratelyn force-pushed the kate/dns-resolver-tasks-should-respect-a-minimum-ttl branch from 6c69242 to 5ed003c Compare March 25, 2025 20:18

cratelyn marked this pull request as ready for review March 25, 2025 21:41

cratelyn requested a review from a team as a code owner March 25, 2025 21:41

olix0r approved these changes Mar 25, 2025

View reviewed changes

olix0r merged commit a3ce719 into main Mar 25, 2025
15 checks passed

olix0r deleted the kate/dns-resolver-tasks-should-respect-a-minimum-ttl branch March 25, 2025 23:37

cratelyn mentioned this pull request Apr 29, 2025

Prevent panic on zero DNS negative TTL during backoff #3888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(dns-resolve): add a lower-bound TTL for dns refreshing #3807

fix(dns-resolve): add a lower-bound TTL for dns refreshing #3807

Uh oh!

cratelyn commented Mar 25, 2025 •

edited

Loading

Uh oh!

olix0r left a comment

Uh oh!

Uh oh!

Uh oh!

fix(dns-resolve): add a lower-bound TTL for dns refreshing #3807

fix(dns-resolve): add a lower-bound TTL for dns refreshing #3807

Uh oh!

Conversation

cratelyn commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olix0r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cratelyn commented Mar 25, 2025 •

edited

Loading