Skip to content

fix(dns-resolve): add a lower-bound TTL for dns refreshing #3807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 25, 2025

Conversation

cratelyn
Copy link
Member

@cratelyn cratelyn commented Mar 25, 2025

this branch makes a small, targeted fix to mitigated the excessive rate of DNS queries described in linkerd/linkerd2#13508. this bounds the rate at which worker tasks returned by <linkerd_dns_resolve::DnsResolve as tower::Service<T>::call() will attempt to refresh DNS records.

this branch does so in two small steps: first, the linkerd_dns::Resolver::resolve_addrs() function is changed to return a tokio::time::Instant instead of a tokio::time::Sleep. then, the resolution worker is extended with a minimum delay it will wait before continuing.


i ran this locally with a small workload, on a local cluster. because the rate of these DNS queries is a factor of the number of control plane clients, i used this job/deployment pair to create 64 servers and a load-generator sending 100 requests per-second.

# This is a small template to exercise linkerd's DNS resolution.
#
# There are two values in this file that should be replaced before applying
# it via `kubectl`:
#
#   - SERVER_REPLICAS: number of server replicas
#   - CLIENT_RPS: rate for slow-cooker to send requests
#
# apply this via a command like:
#
# ```
# ; cat deployment.yaml | sed -e 's/CLIENT_RPS/100/' -e 's/SERVER_REPLICAS/4/'
# ```
---
# An HTTP/1 server running on port 8080.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fast
spec:
  replicas: SERVER_REPLICAS
  selector:
    matchLabels:
      app: terminus
  template:
    metadata:
      labels:
        app: terminus
    spec:
      containers:
      - name: fast
        image: buoyantio/bb:v0.0.6
        args:
        - terminus
        - "--h1-server-port=8080"
        - "--response-text=pong"
        ports:
        - containerPort: 8080
---
# A service to reach the HTTP/1 server.
apiVersion: v1
kind: Service
metadata:
  name: terminus-svc
  labels:
    app: terminus-svc
spec:
  selector:
    app: terminus
  ports:
  - name: http
    port: 8080
    targetPort: 8080
---
# Generate load, sending 100 requests per second to the service.
apiVersion: batch/v1
kind: Job
metadata:
  name: slow-cooker
spec:
  template:
    metadata:
      labels:
        app: slow-cooker
    spec:
      containers:
      - name: slow-cooker
        image: buoyantio/slow_cooker:1.3.0
        command:
        - "/bin/sh"
        args:
        - "-c"
        - |
          sleep 15 # wait for pods to start
          /slow_cooker/slow_cooker -qps CLIENT_RPS -metric-addr 0.0.0.0:9999 http://terminus-svc:8080
        ports:
        - containerPort: 9999
      restartPolicy: OnFailure

then, i created a prometheus config file, like so:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'coredns'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9153']
  - job_name: 'linkerd'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:4191']

use kubectl to port-forward coredns and linkerd metrics:

kubectl -n kube-system port-forward $(kubectl -n kube-system get pods -l k8s-app=kube-dns -o name) 9153:9153
kubectl -n linkerd port-forward $(kubectl -n linkerd get pods -l linkerd.io/control-plane-ns=linkerd -o jsonpath='{.items[0].metadata.name}') 4191:4191

...then launch prometheus by running prometheus --config.file=prometheus.yml using the configuration above. once prometheus is running, navigate to localhost:9090/graph to query the metrics being scraped.

after observing the steady state of the rate of coredns requests and cache misses, i loaded a new patched proxy image into the cluster and restarted the control plane and servers by running:

kubectl -n linkerd rollout restart deployment linkerd-destination linkerd-identity linkerd-proxy-injector
kubectl rollout restart deployment fast

after the new patched proxies came online, i could see the "thundering herd" described in this comment in linkerd/linkerd2#13508:

compare-coredns-cache-misses-prometheus

this commit changes the signature of the `resolve_srv` and
`resolve_a_or_aaaa` methods so that they now return an `Instant`, rather
than a `Sleep` future.

Signed-off-by: katelyn martin <kate@buoyant.io>
Signed-off-by: katelyn martin <kate@buoyant.io>
@cratelyn cratelyn force-pushed the kate/dns-resolver-tasks-should-respect-a-minimum-ttl branch from 6c69242 to 5ed003c Compare March 25, 2025 20:18
@cratelyn cratelyn marked this pull request as ready for review March 25, 2025 21:41
@cratelyn cratelyn requested a review from a team as a code owner March 25, 2025 21:41
Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@olix0r olix0r merged commit a3ce719 into main Mar 25, 2025
15 checks passed
@olix0r olix0r deleted the kate/dns-resolver-tasks-should-respect-a-minimum-ttl branch March 25, 2025 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants