Synthetic data generation for tabular data
This release makes a number of changes to how id columns are generated. By default, id columns with a regex will now have their values scrambled in the output. Id columns without a regex that are numeric will be created randomly. If they're not numeric, they will have a random suffix.
Additionally, improvements were made to the visibility of the get_loss_values_plot
.
This release adds support for Python 3.12! It also adds a number of feature improvements. It adds a simplify_schema
utility function to the sdv.utils.poc
module which simplifies multi-table schemas so they can be run using HMASynthesizer
. Multi-table data dictionaries can now be saved directly to CSVs using the sdv.datasets.local.save_csvs
utility function. Additionally, generator-discriminator loss values can now be plotted directly from CTGAN using the get_loss_values_plot
method. This release also adds error messages when trying to load an SDV synthesizer on an older version of the SDV, or when trying to re-fit a synthesizer from an older version of the SDV.
This release also fixes a number of bugs. Metadata auto-detection now validates that all primary keys are unique, and the metadata correctly validates sdtypes in a column relationship. Bugs in the HMASynthesizer
that would cause the diagnostic score to not be equal to 1.0 for cardinality and data validity were fixed. Finally, errors in constraints now correctly raise a ConstraintsNotMetError
instead of an InvalidData
error.
SingleTablePreset
(including FastML
Preset) - Issue #1855 by @lajohn4747sequence_key
when using PARSynthesizer - Issue #1883 by @frances-h'truncnorm'
distribution - Issue #1831 by @frances-hIDGenerator
for Primary Key columns - Issue #1862 by @lajohn4747This release adds the poc
utility submodule to help users more easily create a proof-of-concept with multi-table datasets. The poc
submodule includes the drop_unknown_references
utility function to automatically drop unknown references in a multi-table dataset. Additionally, multiple columns in the metadata can now be updated at once using the update_columns
and update_columns_metadata
methods. The SDV now also warns users when a synthesizer is loaded that was fitted on a different version of the SDV.
get_parameters
function consistent between synthesizers - Issue #1756 by @fealhoget_table_parameters
for the multi-table synthesizers - Issue #1757 by @fealhoupdate_columns
and update_columns_metadata
methods to metadata - Issue #1804 by @R-Palazzoget_column_names
method to metadata - Issue #1805 by @frances-hdrop_unknown_references
- Issue #1845 by @R-Palazzopoc
module for utilities that help with proof-of-concept - Issue #1846 by @pvk-developerutils
module: Make internal functions private - Issue #1793 by @R-PalazzoThis release adds multiple improvements to handling premium transformers and column relationships, including using premium transformers even if the PII flag is set to true. Additionally, the SDV now warns users to save the metadata after auto-detection has been used. Semantic sdtype detection has also been improved to tokenize column names to prevent unexpected substring matches.
This release also fixes a few warning bugs and fixes an issue that would cause metadata.to_dict
to fail for metadata loaded from older versions of the SDV. A few synthesizer bugs were also resolved. The quality of the sequence_index for the PARSynthesizer
has been improved, and an issue that would cause CTGANSynthesizer
, TVAESynthesizer
, and CopulaGANSynthesizer
to crash if all columns were to be generated from scratch has been fixed.
ScalarRange
constraint - Issue #1737 by @fealhosequence_index
: Move the start dates into the context model - Issue #1760 by @frances-h'category'
(CTGAN, TVAE) - Issue #1735 by @frances-hversion
module to align with SDV Enterprise - Issue #1761 by @R-PalazzoThis release makes a number of improvements. It introduces a new concept to the metadata known as column relationships! Column relationships can be used to define when certain groups of columns in a table should be treated as a special concept (eg. address). You can add a column relationship by using the new add_column_relationship
method. The metadata detection was also improved by allowing semantic sdtypes (eg. 'email', 'phone_number') to be detected as primary keys.
This release also patches some bugs. An issue messing up the likelihood matching in the HMASynthesizer
was resolved. The CTGANSynthesizer
no longer fails when using the FixedCombinations
constraint. The Inequality
constraint was also patched to handle datetimes better.
set_address_columns
method is deprecated in favor of add_column_relationship
.BaseIndependentSampler
crashes because it tries to cast id columns - Issue #1712 by @pvk-developerCTGANSynthesizer
when applying FixedCombinations
constraint - Issue #1717 by @pvk-developerThis release adds support for the new Diagnostic Report from SDMetrics. This report calculates scores for three basic but important properties of your data: data validity, data structure and in the multi table case, relationship validity. Data validity checks that the columns of your data are valid (eg. correct range or values). Data structure makes sure the synthetic data has the correct columns. Relationship validity checks to make sure key references are correct and the cardinality is within ranges seen in the real data.
Additionally, a few bugs were fixed and functionality was improved around synthesizers. It is now possible to access the loss values for the TVAESynthesizer
and CTGANSynthesizer
by using the get_loss_values
method. The get_parameters
method is now more detailed and returns all the parameters used to make a synthesizer. The metadata is now capable of detecting some common pii sdtypes. Finally, a bug that made every parent row generated by the HMASynthesizer
have at least one child row was patched. This should improve cardinality.
SettingWithCopyWarning
(HMASynthesizer) - Issue #1557 by @pvk-developerget_parameters
method for all multi-table synthesizers - Issue #1674 by @frances-hThis release adds an alert to the CTGANSynthesizer
during preprocessing. The alert informs the user if the fitting of the synthesizer is likely to be slow on their schema. Additionally, it is now possible to enforce that sampled datetime values stay within the range of the fitted data!
This release also makes internal changes to support address data in SDV Enterprise.
This release improves user messaging in multiple ways. The most notable is that users will now see an alert if the HMASynthesizer
is likely to be slow for their data's schema. Additionally, the logger messaging for constraints and the error messaging when setting distributions on non-parametric models was made more detailed.
The visualization plots in the sdv.evaluation
sub-package all got a new parameter called plot_type
, allowing the users to specify the plot type to use if the one being inferred is not useful. The sdv.datasets.local.load_csvs
method now has a parameter called read_csv_parameters
, that allow users to specify how the csvs should be read during loading. The same change was also made to the sdv.metadata.multi_table.detect_table_from_csv
, sdv.metadata.multi_table.detect_from_csvs
and sdv.metadata.single_table.detect_from_csv
methods.
Multiple bugs were resolved including one that caused new categories to be created during the sample step of CTGANSynthesizer
.
Several improvements and bug fixes were made in this release. Most notably, the metadata detection was substantially improved. Support for the 'unknown' sdtype was added, providing more flexibility in data representation. The software now attempts to intelligently detect primary keys and identify parent-child relationships in the metadata, streamlining the metadata creation process.
Additionally, issues related to conditional sampling with negative float values, the inability to update transformers for columns created by constraints, and compatibility with numpy version 1.25 and higher were addressed. The default branch was also switched from 'master' to 'main' for better development practices. Various bugs and errors, including those involving HMA and datetime format detection, were also resolved.
id
(leave others as unknown
) - Issue #1598 by @amontanez24'gaussian_kde'
with HMA - Issue #1604 by @frances-hKeyError
) - Issue #1454 by @frances-hValueError: Invalid distribution specification
when setting numerical_distributions on child table (HMA) - Issue #1605 by @fealhoThis release makes multiple improvements to the metadata. Both the single and multi table metadata classes now have a validate_data
method. This method runs checks to validate the data against the current specifications in the metadata. The SingleTableMetadata.visualize
is also improved. The sequence index is now shown in the same section as the sequence key. It also now shows all key and index information (eg. sequence key, primary key, sequence index) in one section.
The CTGANSynthesizer
has been made more efficient in the following ways:
preprocess
like categorial columns are.CTGAN
skip the one-hot encoding step.Additional changes include that the columns labeled with the sdtype id
will now go through the IDGenerator
transformer by default and constraint transformations that were being overwritten during sampling will now be respected.