Categories
Machine Learning

Generating Synthetic Data for Machine Learning Models

You’ve probably heard of synthetic data, but what exactly is it? In simple terms, synthetic data is artificially generated information that closely mirrors real-world data. It’s like creating a digital version of reality, but without the need for actual human-generated data.

Now, why should we care about synthetic data in the context of machine learning (ML)? Here’s why:

  1. Privacy and Security: Sometimes, the data you need to improve your ML models may be sensitive, and you don’t want to expose it. Synthetic data is a viable solution. It doesn’t involve real data, which means it offers a layer of privacy while still providing enough structure to train models effectively.
  2. Data Scarcity and Imbalance: In many real-world scenarios, datasets can be too small or imbalanced. Synthetic data comes in handy here. It can be used to augment datasets, helping models generalize better. Unlike random noise, synthetic data maintains meaningful correlations and patterns, which makes it more valuable.

Tools for Generating Synthetic Data

When it comes to tools for synthetic data generation, there are many, but for this article, I’ll be using the SDV (Synthetic Data Vault) library in Python. While I’m still getting the hang of synthetic data, I believe SDV offers a straightforward solution for most use cases.

The SDV library supports tabular data (including relational and time-series formats) and provides models based on Gaussian Copulas, CTGANs, and more. So if you’re working with tabular data, you’re likely covered.

Generating Synthetic Data Using SDV

Let’s dive into the practical steps of generating synthetic data using SDV. I’ll assume that you already have the library installed, so let’s get started with some real data.

Step 1: Load the Data

For our example, we’ll use the California housing dataset, a simple dataset containing features like median income, house age, average rooms, and population.

from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the California housing dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

# Sample a subset of the dataset for testing
df = df.sample(n=1000, random_state=42)

Step 2: Initialize the SDV Library

Here, we define the metadata and initialize the CTGAN (Conditional Tabular GAN) synthesizer, which will generate synthetic data.

from sdv.metadata.single_table import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer

# Create metadata from the real data
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)

# Initialize the CTGAN synthesizer
synthesizer = CTGANSynthesizer(metadata)
synthesizer.fit(df)

CTGAN (Conditional Tabular GAN) is a deep learning model used to generate synthetic tabular data. It uses two neural networks: a generator, which creates synthetic data, and a discriminator, which distinguishes between real and fake data.

These networks are trained together, improving the generator’s ability to produce realistic data over time.

Step 3: Generate Synthetic Data

Once the synthesizer is trained on the real data, we can generate synthetic data. Let’s generate 1,000 synthetic rows:

syn_df = synthesizer.sample(num_rows=1000)

Now, you have a synthetic version of the dataset. To make a comparison, you can examine the statistics of both the real and synthetic data.

print("Real Data:\n", df.describe())
print("Synthetic Data:\n", syn_df.describe())

Training Models and Comparing Performance

Now that we have both real and synthetic datasets, let’s train models and compare their performance.

Training on Real Data

First, let’s train a Random Forest Regression model on the real data:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Define features and target
X_real = df.drop('MedHouseVal', axis=1)
y_real = df['MedHouseVal']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_real, y_real, test_size=0.3, random_state=42)

# Train the model
reg_real = RandomForestRegressor()
reg_real.fit(X_train, y_train)

# Make predictions and calculate RMSE
predictions = reg_real.predict(X_test)
real_rmse = np.sqrt(mean_squared_error(y_test, predictions))

Training on Synthetic Data

Next, we’ll train a model on the synthetic data to see how it performs in comparison:

# Define features and target for synthetic data
X_syn = syn_df.drop('MedHouseVal', axis=1)
y_syn = syn_df['MedHouseVal']

# Train the model on synthetic data
reg_syn = RandomForestRegressor()
reg_syn.fit(X_syn, y_syn)

# Make predictions and calculate RMSE
syn_predictions = reg_syn.predict(X_syn)
syn_rmse = np.sqrt(mean_squared_error(y_syn, syn_predictions))

Comparing RMSE

Finally, let’s compare the Root Mean Squared Error (RMSE) between the real and synthetic data models:

print(f"RMSE (Real): {real_rmse}")
print(f"RMSE (Synthetic): {syn_rmse}")

You may notice that synthetic data models sometimes perform better than those trained on real data. This can happen if the synthetic data closely mirrors the distribution of the original data. However, it’s important to keep in mind that synthetic data models can also suffer from overfitting.

Conclusion

Synthetic data can be a powerful tool for ML, especially when privacy is a concern or when dealing with limited, imbalanced, or incomplete datasets. It allows you to generate large datasets without the risks associated with using real-world data. However, it’s not without its challenges.

For synthetic data to be truly effective:

  • It needs to be statistically accurate.
  • It should preserve privacy and prevent reverse-engineering.
  • It must be used responsibly, especially in regulated fields like healthcare or finance.

While it has advantages, like avoiding privacy issues and enabling rapid prototyping, it’s not always a perfect fit. Synthetic data may miss critical real-world correlations, and overfitting can be a concern if the generator learns too much from the training data.

In conclusion, synthetic data is a promising solution, but like any tool, it should be used with caution, especially when transitioning to real-world applications.

Source: This article was adapted from an original piece by Ryuru. You can read the original article here.

Leave a Reply

Your email address will not be published. Required fields are marked *