Pointblank Versions Save

Data quality assessment and metadata reporting for data frames and database tables

v0.12.1

1 month ago
  • Ensured that the column string is a symbol before constructing the expression for the col_vals_*() functions.

  • No longer resolve columns with tidyselect when the target table cannot be materialized.

  • Relaxed tests on tidyselect error messages.

v0.12.0

2 months ago

New features

  • Complete {tidyselect} support for the columns argument of all validation functions, as well as in has_columns() and info_columns. The columns argument can now take familiar column-selection expressions as one would use inside dplyr::select(). This also begins a process of deprecation:

    • columns = vars(...) will continue to work, but c() now supersedes vars().
    • If passing an external vector of column names, it should be wrapped in all_of().
  • The label argument of validation functions now exposes the following string variables via {glue} syntax:

    • "{.step}": The validation step name
    • "{.col}": The current column name
    • "{.seg_col}": The current segment's column name
    • "{.seg_val}": The current segment's value/group

    These dynamic values may be useful for validations that get expanded into multiple steps.

  • interrogate() gains two new options for printing progress in the console output:

    • progress: Whether interrogation progress should be printed to the console (TRUE for interactive sessions, same as before)
    • show_step_label: Whether each validation step's label value should be printed alongside the progress.

Minor improvements and bug fixes

  • Fixes issue with rendering reports in Quarto HTML documents.

  • When no columns are returned from a {tidyselect} expression in columns, the agent's report now displays the originally supplied expression instead of simply blank (e.g., in create_agent(small_table) |> col_vals_null(matches("z"))).

  • Fixes issue with the hashing implementation to improve performance and alignment of validation steps in the multiagent.

v0.11.4

1 year ago
  • Fixes issue with gt 0.9.0 compatibility.

v0.11.3

1 year ago
  • Fixes issue with tables not rendering due to interaction with the gt package.

v0.11.2

1 year ago
  • Internal changes were made to ensure compatibility with an in-development version of R.

v0.11.1

1 year ago
  • Updated all help files to pass HTML validation.

v0.11.0

1 year ago

New features

  • The row_count_match() function can now match the count of rows in the target table to a literal value (in addition to comparing row counts to a secondary table).

  • The analogous col_count_match() function was added to compare column counts in the target table to a secondary table, or, to match on a literal value.

  • Substitution syntax has been added to the tbl_store() function with {{ <name> }}. This is a great way to make table-prep more concise, readable, and less prone to errors.

  • The get_informant_report() has been enhanced with more width options. Aside from the "standard" and "small" sizes we can now supply any pixel- or percent-based width to precisely size the reporting.

  • Added support for validating data in BigQuery tables.

Documentation

  • All functions in the package now have better usage examples.

v0.10.0

2 years ago

New features

  • The new function row_count_match() (plus expect_row_count_match() and test_row_count_match()) checks for exact matching of rows across two tables (the target table and a comparison table of your choosing). Works equally well for local tables and for database and Spark tables.

  • The new tbl_match() function (along with expect_tbl_match() and test_tbl_match()) checks for an exact matching of the target table with a comparison table. It will check for a strict match on table schemas, on equivalent row counts, and then exact matches on cell values across the two tables.

Minor improvements and bug fixes

  • The set_tbl() function was given the tbl_name and label arguments to provide an opportunity to set metadata on the new target table.

  • Support for mssql tables has been restored and works exceedingly well for the majority of validation functions (the few that are incompatible provide messaging about not being supported).

Documentation

  • All functions in the package now have usage examples.

  • An RStudio Cloud project has been prepared with .Rmd files that contain explainers and runnable examples for each function in the package. Look at the project README for a link to the project.

Breaking changes

  • The read_fn argument in create_agent() and create_informant() has been deprecated in favor of an enhanced tbl argument. Now, we can supply a variety of inputs to tbl for associating a target table to an agent or an informant. With tbl, it's now possible to provide a table (e.g., data.frame, tbl_df, tbl_dbi, tbl_spark, etc.), an expression (a table-prep formula or a function) to read in the table only at interrogation time, or a table source expression to get table preparations from a table store (as an in-memory object or as defined in a YAML file).

  • The set_read_fn(), remove_read_fn(), and remove_tbl() functions were removed since the read_fn argument has been deprecated (and there's virtually no need to remove a table from an object with remove_tbl() now).

v0.9.0

2 years ago

New features

  • The new rows_complete() validation function (along with the expect_rows_complete() and test_rows_complete() expectation and test variants) check whether rows contain any NA/NULL values (optionally constrained to a selection of specified columns).

  • The new function serially() (along with expect_serially() and test_serially()) allows for a series of tests to run in sequence before either culminating in a final validation step or simply exiting the series. This construction allows for pre-testing that may make sense before a validation step. For example, there may be situations where it's vital to check a column type before performing a validation on the same column.

  • The specially()/expect_specially()/test_specially() functions enable custom validations/tests/expectations with a user-defined function. We still have preconditions and other common args available for convenience. The great thing about this is that because we require the UDF to return a logical vector of passing/failing test units (or a table where the rightmost column is logical), we can incorporate the results quite easily in the standard pointblank reporting.

  • The info_columns_from_tbl() function is a super-convenient wrapper for the info_columns() function. Say you're making a data dictionary with an informant and you already have the table metadata somewhere as a table: you can use that here and not have to call info_columns() many, many times.

  • Added the game_revenue_info dataset which contains metadata for the extant game_revenue dataset. Both datasets pair nicely together in examples that create a data dictionary with create_informant() and info_columns_from_tbl().

  • Added the table transformer function tt_tbl_colnames() to get a table's column names for validation.

Minor improvements and bug fixes

  • Input data tables with label attribute values in their columns will be displayed in the 'Variables' section of the scan_data() report. This is useful when scanning imported SAS tables (which often have labeled variables).

  • The all_passed() function has been improved such that failed validation steps (that return an evaluation error, perhaps because of a missing column) result in FALSE; the i argument has been added to all_passed() to optionally get a subset of validation steps before evaluation.

  • For those expect_*() functions that can handle multiple columns, pointblank now correctly stops at the first failure and provides the correct reporting for that. Passing multiple columns really should mean processing multiple steps in serial, and previously this was handled incorrectly.

v0.8.0

2 years ago

New features

  • The new draft_validation() function will create a starter validation .R or .Rmd file with just a table as an input. Uses a new 'column roles' feature to develop a starter set of validation steps based on what kind of data the columns contain (e.g., latitude/longitude values, URLs, email addresses, etc.).

  • The validation function col_vals_within_spec() (and the variants expect_col_vals_within_spec() and test_col_vals_within_spec()) will test column values against a specification like phone numbers ("phone"), VIN numbers ("VIN"), URLs ("url"), email addresses ("email"), and much more ("isbn", "postal_code[<country_code>]", "credit_card", "iban[<country_code>]", "swift", "ipv4", "ipv6", and "mac").

  • A large cross section of row-based validation functions can now operate on segments of the target table, so you can run a particular validation with slices (or segments) of the target table. The segmentation is made possible by use of the new segments argument, which takes an expression that serves to segment the target table by column values. It can be given in one of two ways: (1) as a single or multiple column names containing keys to segment on, or (2) as a two-sided formula where the LHS holds a column name and the RHS contains the column values to segment on (allowing for a subset of keys for segmentation).

  • The default printing of the multiagent object is now a stacked display of agent reports. The wide report (useful for comparisons of validations targeting the same table over time) is available in the improved get_multiagent_report() function (with display_mode = "wide").

  • Exporting the reporting is now much easier with the new export_report() function. It will export objects such as the agent (for validations), the informant (for table metadata), and the multiagent (a series of validations), and, also those objects containing customized reports (from scan_data(), get_agent_report(), get_informant_report(), and get_multiagent_report()). You'll always get a self-contained HTML file of the report from any use of export_report().

  • A new family of functions has been added to pointblank: Table Transformers! These functions can radically transform a data table and either provide a wholly different table (like a summary table or table properties table) or do some useful filtering in a single step. This can be useful for preparing the target table for validation or when creating temporary tables (through preconditions) for a few validation steps (e.g., validating table properties or string lengths). As a nice bonus these transformer functions will work equally well with data frames, database tables, and Spark tables. The included functions are: tt_summary_stats(), tt_string_info(), tt_tbl_dims(), tt_time_shift(), and tt_time_slice().

  • Two new datasets have been added: specifications and game_revenue. The former dataset can be used to test out the col_vals_within_spec() validation function. The latter dataset (with 2,000 rows) can be used to experiment with the tt_time_shift() and tt_time_slice() table transformer functions.

Minor improvements and bug fixes

  • Added the Polish ("pl"), Danish ("da"), Turkish ("tr"), Swedish ("sv"), and Dutch ("nl") translations.

  • The scan_data() function is now a bit more performant, testable, and better at communicating progress in generating the report.

  • The preconditions argument, used to modify the target table in a validation step, is now improved by (1) checking that a table object is returned after evaluation, and (2) correcting the YAML writing of any preconditions expression that's provided as a function.

  • The x_write_disk() and x_read_disk() have been extended to allow the writing and reading of ptblank_tbl_scan objects (returned by scan_data()).

  • Print methods received some love in this release. Now, scan_data() table scan reports look much better in R Markdown. Reporting objects from get_agent_report(), get_informant_report(), and get_multiagent_report() now have print methods and work beautifully in R Markdown as a result.

  • The incorporate() function, when called on an informant object, now emits styled messages to the console. And when using yaml_exec() to process an arbitrary amount of YAML-based agents and informants, you'll be given information about that progress in the console.

Documentation

  • Many help files were overhauled so that (1) things are clearer, (2) more details are provided (if things are complex), and (3) many ready-to-run examples are present. The functions with improved help in this release are: all_passed(), get_data_extracts(), get_multiagent_report(), get_sundered_data(), has_columns(), write_testthat_file(), x_write_disk(), and yaml_exec().