Skip to content

Prevent panic on zero DNS negative TTL during backoff #3888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

s-starostin
Copy link

This change addresses a panic that can occur in the control plane client's DNS resolution backoff logic within linkerd/app/core/src/control.rs.

Problem:

When resolving the control plane address, if a dns::ResolveError occurs and provides a negative TTL via negative_ttl(), this TTL is used to schedule the next resolution attempt using tokio::time::interval(ttl).

However, if the negative_ttl() returns Some(Duration::ZERO), passing a zero duration to time::interval causes a panic:

thread 'main' panicked at linkerd/app/core/src/control.rs:87:49:
`period` must be non-zero.
   0:     0x563f974dd323 - <unknown>
   1:     0x563f9680ed2c - <unknown>
   2:     0x563f974b030d - <unknown>
   3:     0x563f974de99e - <unknown>
   4:     0x563f974de564 - <unknown>
   5:     0x563f974df5c5 - <unknown>
   6:     0x563f974eecc9 - <unknown>
   7:     0x563f974eec96 - <unknown>
   8:     0x563f96737dfa - <unknown>
   9:     0x563f974f116c - <unknown>
  10:     0x563f97233597 - <unknown>
  11:     0x563f9715cc1a - <unknown>
  12:     0x563f9727a651 - <unknown>
  13:     0x563f971a91af - <unknown>
  14:     0x563f96c5a01c - <unknown>
  15:     0x563f967902b3 - <unknown>
  16:     0x563f96c56ba5 - <unknown>
  17:     0x7fc1ccd2624a - <unknown>
  18:     0x7fc1ccd26305 - __libc_start_main
  19:     0x563f9674f711 - <unknown>
  20:                0x0 - <unknown>

This panic was observed in production environments, particularly during restarts or issues with the linkerd-destination service. When the proxy sidecar panicked due to this error, it resulted in service unavailability for meshed applications, requiring manual restarts of deployments to recover connectivity.

Solution:

This commit introduces a minimum backoff duration (min_duration = 100ms) for cases where a negative TTL is provided by the DNS resolver. It uses std::cmp::max(ttl, min_duration) to ensure that the duration passed to time::interval is never zero.

This prevents the panic and ensures the proxy gracefully handles zero TTLs by applying a minimal delay before the next resolution attempt, improving resilience during control plane discovery issues.

@s-starostin s-starostin requested a review from a team as a code owner April 29, 2025 07:09
Signed-off-by: StarostinSY <sergejj.starostin@vitech.team>
@s-starostin s-starostin force-pushed the fix-prevent-interval-panic-zero-ttl branch from 5793629 to f7f06ca Compare April 29, 2025 07:12
@cratelyn
Copy link
Member

hi @s-starostin, can you confirm what version of the proxy you are currently using?

the line number included in the panic message above, thread 'main' panicked at linkerd/app/core/src/control.rs:87:49, no longer points to a line of code that could panic, as of today:

i believe this issue may have already been fixed in #3807, which also added a lower-bound TTL when refreshing DNS records, further down in the linkerd-dns and linkerd-dns-resolve components.

@s-starostin
Copy link
Author

Hello,
Ah, yes - we're currently on v2.214.0.
I noticed that block still looked unchanged and wanted to suggest a fix, but if it's already been addressed, then never mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants