DTGAN: Differential Private Training for Tabular GANs

Figure 1: Privacy Preserving Generator

1. Introduction

Tabular generative adversarial networks (TGANs) have recently emerged to cater to the need of synthesizing tabular data — the most widely used data format. While synthetic tabular data offers the advantage of complying with privacy regulations, there still exists a risk of privacy leakage via inference attacks due to interpolating the properties of real data during training. Differential private (DP) training algorithms provide theoretical guarantees for training machine learning models by injecting statistical noise to prevent privacy leaks. However, the challenges of applying DP on TGAN are to determine the most optimal framework (i.e., PATE/DP-SGD) and neural network (i.e., Generator/Discriminator) to inject noise such that the data utility is well maintained under a given privacy guarantee.

2. DTGAN

DTGAN is a novel approach to generate tabular datasets with strong DP guarantees. It utilizes the DP-SGD framework and the subsampled RDP moments accountant technique to preserve privacy and account for the cost, respectively. In addition, it makes use of the Wasserstein loss with gradient penalty to effectively bound the gradient norms thereby providing an analytically derived optimal clipping value for better preserving gradient information after being clipped in regards to DP-SGD as shown in the work of GS-WGAN.

  • The discriminator directly interacts with real data.
  • The discriminator gradient norms are directly bounded due to the gradient penalty.
  • Subsampling is highly efficient as it is defined as 𝛾=𝐵/𝑁 where 𝐵 is the batch size and 𝑁 is the size of the training dataset.
  • The use of the Wasserstein loss requires multiple updates to the discriminator for performing a single update to the generator.
  • Training multiple discriminators to perform distributed GAN training increases the privacy cost significantly.
  • DP-generator makes for a safe public release after training.
  • Distributed discriminators do not increase privacy cost.
  • Subsampling introduces complexity via multiple discriminators and is defined as 1/N_d where N_d is the number of discriminators.
  • Added loss functions on the generator increase privacy cost.

3. Inference Attacks

Membership Inference Attack- It is a binary classification problem in which an attacker tries to predict if a particular target data point has been used to train a victim generative model. This post assumes that the attacker only needs access to a black-box tabular GAN model, a reference dataset and target data point for which the inference must be made.

Figure 2: Membership Inference Attack
Figure 3: Attribute Inference Attack

4. Results

ML Utility- Only the DTGAN_D model consistently improves across all metrics with a looser privacy budget. It also showcases the best performance for both F1-score and APR metrics across all baselines and privacy budgets. This suggests that training the discriminator with DP guarantees, i.e. DTGAN_D, is more optimal than training the generator with DP guarantees, i.e. DTGAN_G.

Table 1: Difference of accuracy (%), F1-score, AUC and APR between original and synthetic data: average over 3 different datasets and different privacy budgets, epsilon = 1 & 100.
Table 2: Statistical similarity metrics averaged on 3 datasets with different privacy budget, epsilon = 1 & 100.
Table 3: Empirical privacy gain against membership attack with naïve and correlation feature extraction, and attribute inference attack: average over 3 different datasets with privacy budget, epsilon = 1

5. Conclusion

Motivated by the risk of privacy leakage through synthetic tabular data, we propose a novel DP conditional Wasserstein tabular GAN, DTGAN. We rigorously analyse DTGAN using it’s two variants, namely DTGAN_D and DTGAN_G via the theoretical Renyi DP framework and highlight the privacy cost for additional losses used by the generator to enhance data quality. Moreover, we empirically showcase the data utility achieved by applying DP-SGD to train the discriminator vs generator, respectively, Additionally, we rigorously evaluate the privacy robustness against practical membership and attribute inference attacks.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aditya Kunar

Aditya Kunar

I am a researcher at Generatrix- An AI-based privacy preserving data synthesizing platform. I have an avid passion for new and emerging technologies in AI & ML