Skip to content

Vignette refresh #642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Vignette refresh #642

wants to merge 2 commits into from

Conversation

hfrick
Copy link
Collaborator

@hfrick hfrick commented Aug 21, 2025

As discussed, I'm chipping away at a refresh of the vignettes. Overall, I'm aiming to have

  • a vignette covering the core functionality of the various validation rules and how they can be combined into a validation plan/agent
  • a vignette focused on schema validation aka "validating the fundamentals" of the shape of the data before validating the content of the data
  • a vignette centered around "Taking action" covering getting notified, stopping automation, and inspecting results
  • and possibly one around tailoring the validation report to stakeholders/people who understand the context of the data well but may not be data-focused themselves

I'm gonna keep this PR as a draft for now but your comments on the first two vignettes would be very welcome already!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one leans heavily on the user guide for the python version, but adapted for the R version

Comment on lines +145 to +146
You can relax the validation further by allowing `NULL` types in the schema, which means that the column can be of any type or even missing from the table.
<!-- This is useful when you want to validate the presence of a column without enforcing a specific type or the column -->
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check that a column exists but not bothering with the type? I wasn't expecting the NULL to allow the column to be missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is but it's hacky, not well-explained, and so should be improved in the future:

small_table %>%
  expect_col_schema_match(
    schema = col_schema(
      date_time = "POSIXct",
      date = "Date",
      a = NULL,        # Column exists but type is ignored
      b = NULL,        # Column exists but type is ignored  
      f = "character",
      e = "logical"
    ),
    complete = FALSE,
    in_order = FALSE,
    is_exact = FALSE  # Required for NULL to work
  )

This seems more like a side effect of exact type-matching and isn't very good API design.

Copy link
Collaborator Author

@hfrick hfrick Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your example is passing more because complete = FALSE and your schema is missing columns c and d.

Only using is_exact = FALSE does not turn b = NULL into a check that b exists:

library(pointblank)

# baseline: passes
data.frame(a = 1:2) |>
  col_schema_match(col_schema(a = "integer"))
#>   a
#> 1 1
#> 2 2

# add b to data frame and to schema as NULL and strict check fails as it should
data.frame(a = 1:2, b = 1:2) |> 
  col_schema_match(col_schema(a = "integer", b = NULL))
#> Error: Failure to validate that column schemas match.
#> The `col_schema_match()` validation failed beyond the absolute threshold level (1).
#> * failure level (1) >= failure threshold (1)

# relaxing `is_exact` allows the check to pass
data.frame(a = 1:2, b = 1:2) |>
  col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE)
#>   a b
#> 1 1 1
#> 2 2 2

# but it still passes when b is missing from the data frame
# i.e. it's not a check for existence
data.frame(a = 1:2) |>
  col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE)
#>   a
#> 1 1
#> 2 2

Created on 2025-08-22 with reprex v2.1.1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking how these options interact. Definitely need to just make NULL ignore the column type check (but still check column existence), regardless of the options!

The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can define the schema in SQL types and validate directly against the SQL table without pulling data into R.

```{r}
#| label: types-sql
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly database-savvy, so if you spot any ways to improve this example, please let me know!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could say something like "...in SQL types (like VARCHAR and BIGINT) and validate..."

schema_sql <- col_schema(
amount = "REAL",
customer_name = "TEXT",
sale_date = "REAL",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not loving this conversion of the date format from R, is there a way to make this better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to use DuckDB instead. It's much better with dates/times and it's a supported input format.

@hfrick hfrick requested a review from rich-iannone August 21, 2025 14:38
@rich-iannone
Copy link
Member

I've just read through the vignettes more carefully and I think they are both well written!

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good! I think we need to leave out the part about checking columns w/o column types/classes until we fix it in the codebase (I'll create an issue for that). Once that's implemented the vignette could be revised in a separate PR to put that example back in (it's a valuable usage example!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants