Prevent panic on zero DNS negative TTL during backoff #3888
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change addresses a panic that can occur in the control plane client's DNS resolution backoff logic within
linkerd/app/core/src/control.rs
.Problem:
When resolving the control plane address, if a
dns::ResolveError
occurs and provides a negative TTL vianegative_ttl()
, this TTL is used to schedule the next resolution attempt usingtokio::time::interval(ttl)
.However, if the
negative_ttl()
returnsSome(Duration::ZERO)
, passing a zero duration totime::interval
causes a panic:This panic was observed in production environments, particularly during restarts or issues with the
linkerd-destination
service. When the proxy sidecar panicked due to this error, it resulted in service unavailability for meshed applications, requiring manual restarts of deployments to recover connectivity.Solution:
This commit introduces a minimum backoff duration (
min_duration
= 100ms) for cases where a negative TTL is provided by the DNS resolver. It usesstd::cmp::max(ttl, min_duration)
to ensure that the duration passed totime::interval
is never zero.This prevents the panic and ensures the proxy gracefully handles zero TTLs by applying a minimal delay before the next resolution attempt, improving resilience during control plane discovery issues.