-
Notifications
You must be signed in to change notification settings - Fork 243
feat: track rows processed during model evaluation #5162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d722c67
to
c3a8760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start, I think I follow the logic. I've left some comments for your consideration.
I'm not sure about the track_row_count
parameter on the engine adapter methods. I'd like to understand more about the undesirable queries that were getting picked up - perhaps that indicates we were being too coarse with the statistics gathering / wrapping in the wrong place
Building on @erindru's feedback, I wonder if in general this tracking logic could get pushed to the SnapshotEvaluator layer. Right now the engine adapter methods return nothing but if they returned row count then it seems like the SnapshotEvaluator could then determine what to do with that information. I don't think though I deeply understand this problem and this comes from a quick review and my general intuition. |
1e405ce
to
9377241
Compare
If we pass row counts around through the entire call stack between the scheduler and the engine adapter, I believe it starts getting very messy with the number of method signature changes. The scheduler is where we want to display the information, but it's typically on a different thread and reasonably far removed from This design that Trey is proposing means that we don't have to change the return type of the engine adapter methods or have to worry about passing things back up the stack / rely on every component in the chain between the It essentially uses a combination of thread-local + global state to bypass the call stack. It emits row counts in imo this is a much simpler / less invasive implementation |
9377241
to
6957330
Compare
6957330
to
d4d5afa
Compare
d00e0b9
to
a2bdfdc
Compare
f072873
to
97db434
Compare
97db434
to
b66fc6a
Compare
b66fc6a
to
eebb26e
Compare
Captures the number of rows processed (and potentially other information) during a model evaluation for display to the user.
Example running sushi on Bigquery (includes bytes processed for each model batch):
Context
cursor.row_count
property that contains the number of rows affected by a DML command execution (e.g.,INSERT
,CTAS
)Implementation context
_execute
method, so we read the row_count property there immediately aftercursor.execute
Determining what to track
track_row_count=True
in the execute call within themengine_adapter.insert_append
Tracker implementation
QueryExecutionContext
dataclass_execute
calls singlerecord_execution
method, which determines which tracker is active and should be used (if any)