-
Notifications
You must be signed in to change notification settings - Fork 59
Vignette refresh #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Vignette refresh #642
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one leans heavily on the user guide for the python version, but adapted for the R version
You can relax the validation further by allowing `NULL` types in the schema, which means that the column can be of any type or even missing from the table. | ||
<!-- This is useful when you want to validate the presence of a column without enforcing a specific type or the column --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check that a column exists but not bothering with the type? I wasn't expecting the NULL
to allow the column to be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is but it's hacky, not well-explained, and so should be improved in the future:
small_table %>%
expect_col_schema_match(
schema = col_schema(
date_time = "POSIXct",
date = "Date",
a = NULL, # Column exists but type is ignored
b = NULL, # Column exists but type is ignored
f = "character",
e = "logical"
),
complete = FALSE,
in_order = FALSE,
is_exact = FALSE # Required for NULL to work
)
This seems more like a side effect of exact type-matching and isn't very good API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your example is passing more because complete = FALSE
and your schema is missing columns c
and d
.
Only using is_exact = FALSE
does not turn b = NULL
into a check that b
exists:
library(pointblank)
# baseline: passes
data.frame(a = 1:2) |>
col_schema_match(col_schema(a = "integer"))
#> a
#> 1 1
#> 2 2
# add b to data frame and to schema as NULL and strict check fails as it should
data.frame(a = 1:2, b = 1:2) |>
col_schema_match(col_schema(a = "integer", b = NULL))
#> Error: Failure to validate that column schemas match.
#> The `col_schema_match()` validation failed beyond the absolute threshold level (1).
#> * failure level (1) >= failure threshold (1)
# relaxing `is_exact` allows the check to pass
data.frame(a = 1:2, b = 1:2) |>
col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE)
#> a b
#> 1 1 1
#> 2 2 2
# but it still passes when b is missing from the data frame
# i.e. it's not a check for existence
data.frame(a = 1:2) |>
col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE)
#> a
#> 1 1
#> 2 2
Created on 2025-08-22 with reprex v2.1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking how these options interact. Definitely need to just make NULL
ignore the column type check (but still check column existence), regardless of the options!
The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can define the schema in SQL types and validate directly against the SQL table without pulling data into R. | ||
|
||
```{r} | ||
#| label: types-sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not particularly database-savvy, so if you spot any ways to improve this example, please let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could say something like "...in SQL types (like VARCHAR
and BIGINT
) and validate..."
schema_sql <- col_schema( | ||
amount = "REAL", | ||
customer_name = "TEXT", | ||
sale_date = "REAL", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not loving this conversion of the date format from R, is there a way to make this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option is to use DuckDB instead. It's much better with dates/times and it's a supported input format.
I've just read through the vignettes more carefully and I think they are both well written! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good! I think we need to leave out the part about checking columns w/o column types/classes until we fix it in the codebase (I'll create an issue for that). Once that's implemented the vignette could be revised in a separate PR to put that example back in (it's a valuable usage example!).
As discussed, I'm chipping away at a refresh of the vignettes. Overall, I'm aiming to have
I'm gonna keep this PR as a draft for now but your comments on the first two vignettes would be very welcome already!