The Burden of Choice: Models in Synthetic Data Vault
Synthetic Data Generation - This article is part of a series.
Synthetic Data Vault (SDV) is an open-source Python library that allows users to easily model and sample synthetic data from multiple sources of real data. SDV can handle different types of data, such as tabular, relational, time series, and multi-modal. SDV uses deep learning models to learn the structure and dependencies of the real data and generate realistic synthetic data that can be used for various applications.
Please note that this article was written with the assistance of GPT. While I did provide all of the points that I intended to convey, the article is not as tight knit and polished as I’d like. I do believe, however, that this is a somewhat appropriate usecase for GPT and that the major ideas have been imparted. I would essentially like to employ this article as a placeholder until I get the time to do said polishing.
What is a synthesizer? #
A synthesizer is a machine learning model that can generate synthetic data from a given real data source. A synthesizer can be trained on a specific type of data, such as tabular, relational, time series, or multi-modal. A synthesizer can also be customized and fine-tuned according to the user’s needs and preferences. A synthesizer can sample synthetic data that mimics the characteristics and patterns of the real data.
Types of models available at SDV #
SDV provides different types of models (synthesizers) for generating synthetic data from different types of real data sources. Some of the available synthesizers are:
- GaussianCopula: This synthesizer uses a Gaussian copula model to capture the marginal distributions and correlations of tabular data. It can handle numerical, categorical, ordinal, datetime, and boolean variables. It also supports conditional sampling and rejection sampling methods.
- CTGAN: This synthesizer uses a conditional generative adversarial network (GAN) to model the joint distribution of tabular data. It can handle numerical and categorical variables, as well as missing values. It also supports conditional sampling and data transformations.
- SDV: This synthesizer uses a combination of deep neural networks and copulas to model the joint distribution of relational data. It can handle multiple tables with primary and foreign keys, as well as numerical and categorical variables. It also supports conditional sampling and hierarchical sampling methods.
- PAR: This synthesizer uses a recurrent neural network (RNN) to model the temporal dependencies of time series data. It can handle univariate or multivariate time series, as well as irregular or missing timestamps. It also supports conditional sampling and forecasting methods.
- MultiMod: This synthesizer uses a variational autoencoder (VAE) to model the joint distribution of multi-modal data. It can handle multiple modalities, such as images, text, audio, etc. It also supports conditional sampling and data fusion methods.
When to use what #
The choice of synthesizer depends on the type and characteristics of the real data source, as well as the desired quality and diversity of the synthetic data. For example:
- If the real data is tabular and has simple or linear relationships between variables, then GaussianCopula may be a good choice.
- If the real data is tabular and has complex or nonlinear relationships between variables, then CTGAN may be a better choice.
- If the real data is relational and has multiple tables with foreign keys, then SDV may be the best choice.
- If the real data is time series and has temporal dependencies between variables, then PAR may be a suitable choice.
- If the real data is multi-modal and has multiple types of data sources, then MultiMod may be an optimal choice.
Utilizing quality reports #
To evaluate the quality and validity of the synthetic data generated by SDV, users can use various metrics and tools provided by SDV. One of these tools is the quality report, which is a comprehensive summary of the statistical properties and comparisons of the real and synthetic data. The quality report can be generated using the sdv.evaluation.evaluate
function, which takes in the real data, the synthetic data, and optionally a list of metrics to compute.
The quality report contains various sections that provide information about the real and synthetic data, such as:
- Overview: This section provides general information about the real and synthetic data, such as number of rows, columns, variables, missing values, etc.
- Data Types: This section provides information about the types of variables in the real and synthetic data, such as numerical, categorical, ordinal, datetime, boolean, etc.
- Distributions: This section provides information about the distributions of each variable in the real and synthetic data, such as mean, standard deviation, skewness, kurtosis, etc. It also provides visualizations such as histograms or box plots to compare the distributions.
- Correlations: This section provides information about the correlations between each pair of variables in the real and synthetic data, such as Pearson’s r or Spearman’s rho. It also provides visualizations such as heatmaps or scatter plots to compare the correlations.
- Metrics: This section provides information about various metrics that measure the similarity or difference between the real and synthetic data, such as KL divergence or Jensen-Shannon divergence.
The quality report can help users to assess how well the synthesizer has learned from the real data and how realistic and diverse the synthetic data is. The quality report can also help users to identify and diagnose any potential issues or limitations of the synthesizer or the synthetic data.
How to determine best model via metrics #
To determine the best model (synthesizer) for a given real data source, users can compare the quality reports of different synthesizers and metrics. Users can use the sdv.evaluation.compare
function, which takes in a list of synthesizers, a real data source, and optionally a list of metrics to compute.
The compare function returns a table that shows the scores of each synthesizer and metric. The scores are normalized between 0 and 1, where higher scores indicate better quality and similarity of the synthetic data to the real data. Users can use the table to rank and select the best synthesizer for their specific use case.
For example, suppose we have a tabular data source and we want to compare three synthesizers: GaussianCopula, CTGAN, and SDV. We can use the following code to generate and compare the quality reports:
import sdv
from sdv.tabular import GaussianCopula, CTGAN, SDV
# Load real data
real_data = sdv.load_demo()
# Define synthesizers
synthesizers = [
GaussianCopula(),
CTGAN(),
SDV()
]
# Compare quality reports
table = sdv.evaluation.compare(synthesizers, real_data)
print(table)
The output table may look something like this:
Synthesizer | KL Divergence | Jensen-Shannon Divergence | Logistic Detection Rate | SVM Detection Rate |
---|---|---|---|---|
GaussianCopula | 0.87 | 0.91 | 0.76 | 0.82 |
CTGAN | 0.95 | 0.96 | 0.89 | 0.93 |
SDV | 0.92 | 0.94 | 0.85 | 0.88 |
From the table, we can see that CTGAN has the highest scores for all metrics, which means that it generates synthetic data that is most similar and realistic to the real data. Therefore, we can conclude that CTGAN is the best model for this tabular data source.
Conclusion #
SDV offers different models (synthesizers) for generating synthetic data from different types of real data sources. Users can choose the best model for their specific use case based on the type and characteristics of the real data, as well as the desired quality and diversity of the synthetic data. Users can also use various metrics and tools provided by SDV to evaluate and compare the quality and validity of the synthetic data generated by different models. Users can use the quality report to get a comprehensive summary of the statistical properties and comparisons of the real and synthetic data. Users can also use the compare function to rank and select the best model based on normalized scores of different metrics. To learn more about SDV and how to use it with different models, you can visit the official documentation or the GitHub repository.