Runtime has the wiring in place to make that happen in: https://github.com/chapel-lang/chapel/blob/main/runtime/src/chpl-gpu.c#L162-L177
However, as noted in the comment, basic performance tests showed that that wasn't beneficial. @mppf pointed out that we can probably do a hybrid approach, where we can busy-wait for 1000 or so iterations while the stream is not ready, and then yield instead of yielding after every check on the stream.
That makes good sense to me, and it is something that can be tried for better performance when host/device overlap is important.