-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Implement otel retry metrics #12064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement otel retry metrics #12064
Conversation
opentelemetry/src/main/java/io/grpc/opentelemetry/internal/OpenTelemetryConstants.java
Outdated
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/GrpcOpenTelemetry.java
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/GrpcOpenTelemetry.java
Show resolved
Hide resolved
a7a16ce
to
3529769
Compare
3529769
to
bd69ed5
Compare
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Outdated
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Outdated
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Show resolved
Hide resolved
Optional<MetricData> metric = openTelemetryTesting.getMetrics().stream() | ||
.filter(m -> m.getName().equals(metricName)) | ||
.findFirst(); | ||
if (metric.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this metric not be present for the operations done by the tested on the tracer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever they are not enabled or have default values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is test code, it is risky to have the assertion inside an if statement. If the absence of the metric is not expected, then it should not cause the test to pass.
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Show resolved
Hide resolved
Optional<MetricData> metric = openTelemetryTesting.getMetrics().stream() | ||
.filter(m -> m.getName().equals(metricName)) | ||
.findFirst(); | ||
if (metric.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever they are not enabled or have default values
if (info.isTransparentRetry()) { | ||
transparentRetriesPerCall.incrementAndGet(); | ||
} else if (info.isHedging()) { | ||
hedgedAttemptsPerCall.incrementAndGet(); | ||
} else { | ||
attemptsPerCall.incrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not assume them to be mutually exclusive ?
@@ -266,7 +266,8 @@ public ClientStreamTracer newClientStreamTracer( | |||
|
|||
Metadata newHeaders = updateHeaders(headers, previousAttemptCount); | |||
// NOTICE: This set _must_ be done before stream.start() and it actually is. | |||
sub.stream = newSubstream(newHeaders, tracerFactory, previousAttemptCount, isTransparentRetry); | |||
sub.stream = newSubstream(newHeaders, tracerFactory, previousAttemptCount, isTransparentRetry, | |||
isHedging); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isHedging
is just whether there is a hedging policy. It doesn't mean this particular stream is a hedge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give the argument a different name to not shadow the field by the same name? I don't think we hate ourselves enough to want two booleans with the same name in the same scope, but with different semantics.
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Outdated
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Outdated
Show resolved
Hide resolved
opentelemetry/src/main/java/io/grpc/opentelemetry/OpenTelemetryMetricsModule.java
Outdated
Show resolved
Hide resolved
@@ -331,6 +335,7 @@ public ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata metada | |||
} | |||
if (++activeStreams == 1 && attemptStopwatch.isRunning()) { | |||
attemptStopwatch.stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attemptStopwatch looks to be for grpc.client.attempt.duration
? This looks busted (pre-existing). It looks like attemptStopwatch
is tracking the retry_delay right now, as it only runs when there are no streams. Each stream needs its own attempt stopwatch, right? But right now there is only one per call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CallAttemptsTracerFactory.attemptEnded is for recording a stream ended (StatsTraceContext.streamClosed calls it), so for the first attempt and each retry attempt end the attemptStopWatch is started, so it does measure the time between stream attempts, not per call.
But there is a problem. RetriableStream creates its own anonymous tracerFactory and not take the one put by MetricsClientInterceptor
in the callOptions, so this needs to change.
I have a question. A call can only have 1 stream, and the only way the activeStreams can be > 1 is if the tracer factory is shared between calls (and that would mess up the calculation for time between stream attempts anyway). The way the OpenTelemetryMetricsModule.MetricsClientInterceptor creates the factory though is per call, not shared between calls. So we we can never have > 1 streams and there should be no need for synchronization either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I said about RetriableStream's anonymous tracerFactory is not a problem, StatsTraceContext.streamClosed iterates over all tracers and calls stream closed on each of them.
@@ -331,6 +335,7 @@ public ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata metada | |||
} | |||
if (++activeStreams == 1 && attemptStopwatch.isRunning()) { | |||
attemptStopwatch.stop(); | |||
retryDelayNanos = attemptStopwatch.elapsed(TimeUnit.NANOSECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell how retryDelayNanos
is synchronized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the same as with OpenCensus
implements A96