# DTGAN: Differential Private Training for Tabular GANs

--

By Aditya Kunar, Robert Birke, Zilong Zhao & Lydia Y. Chen

# 1. Introduction

**Tabular generative adversarial networks (TGANs)** have recently emerged to cater to the need of synthesizing tabular data — the most widely used data format. While synthetic tabular data offers the advantage of complying with privacy regulations, there still exists a *risk of privacy leakage via inference attacks* due to interpolating the properties of real data during training. **Differential private (DP) training algorithms** provide theoretical guarantees for training machine learning models by* injecting statistical noise to prevent privacy leaks*. However, the **challenges of applying DP on TGAN** are** to determine the most optimal framework (i.e., PATE/DP-SGD)** and **neural network (i.e., Generator/Discriminator) to inject noise** such that the data utility is well maintained under a given privacy guarantee.

In this blog, we highlight** DTGAN**, *a novel conditional Wasserstein tabular GAN* that comes in two variants** DTGAN_G **and **DTGAN_D**, for providing a detailed comparison of tabular GANs trained using **DP-SGD** for the *generator vs discriminator*, respectively. Moreover, we elicit the privacy analysis associated with training the generator with *complex loss functions (i.e., **classification and information losses**) *needed for high quality tabular data synthesis. Additionally, we rigorously **evaluate the theoretical privacy guarantees offered by DP empirically against membership** and **attribute inference attacks**.

# 2. DTGAN

**DTGAN **is a novel approach to generate tabular datasets with strong DP guarantees. It utilizes the **DP-SGD framework** and the **subsampled RDP moments accountant technique**** **to preserve privacy and account for the cost, respectively. In addition, it makes use of the **Wasserstein loss with gradient penalty** to effectively *bound the gradient norms* thereby providing an *analytically derived optimal clipping value* for better preserving gradient information after being clipped in regards to DP-SGD as shown in the work of GS-WGAN.

**DP-Discriminator**- Each discriminator update satisfies (𝜆,2𝐵𝜆/𝜎^2)-RDP where B is the batch size, 𝜆 is the order of the* Renyi divergence* and 𝜎 is the *noise scale.*

Pros-

- The
**discriminator directly interacts with real data**. - The
**discriminator gradient norms are directly bounded**due to the gradient penalty. **Subsampling is highly efficient**as it is defined as 𝛾=𝐵/𝑁 where 𝐵 is the batch size and 𝑁 is the size of the training dataset.

Cons-

- The use of the
**Wasserstein loss requires multiple updates**to the discriminator for performing a single update to the generator. - Training multiple discriminators to perform
**distributed GAN training increases the privacy cost significantly**.

**DP-Generator**- Each generator update satisfies (𝜆,6𝐵𝜆/𝜎^2)-RDP where B is the batch size, 𝜆 is the order of the* Renyi divergence* and 𝜎 is the *noise scale*.

Pros-

**DP-generator makes for a safe public release**after training.**Distributed discriminators do not increase privacy cost**.

Cons-

**Subsampling introduces complexity via multiple discriminators**and is defined as 1/N_d where N_d is the number of discriminators.**Added loss functions on the generator increase privacy cost**.

# 3. Inference Attacks

**Membership Inference Attack**- It is a *binary classification problem* in which an attacker tries to predict if a particular target data point has been used to train a victim generative model. This post assumes that the attacker only needs access to a black-box tabular GAN model, a reference dataset and target data point for which the inference must be made.

**Attribute Inference Attack**- It is defined as a *regression problem* where the attacker attempts to predict the values of a sensitive target column provided he/she has black-box access to a generative model.

# 4. Results

**ML Utility**- **Only the DTGAN_D model consistently improves across all metrics with a looser privacy budget**. It also showcases the best performance for both F1-score and APR metrics across all baselines and privacy budgets. This suggests that** training the discriminator with DP guarantees, i.e. DTGAN_D, is more optimal than training the generator with DP guarantees, i.e. DTGAN_G**.

**Statistical Similarity**- Among all DP models **DTGAN_D is the only model which consistently improves across all three metrics when the privacy budget is increased**. Similarly, *DTGAN_G* sees an improvement across both the *Avg-JSD* and *Avg-WD*. The same is not true for* PATE-GAN* and *DP-WGAN* where DP-WGAN performs better across all metrics. Moreover, they perform worse than the two variants of* DTGAN *at both levels of epsilon. This highlights their inability to capture the statistical distributions during training despite a looser privacy budget. This is due to the lack of an effective training framework.

**Resilience to inference attacks**- With respect to* membership inference attacks*, all DP baselines provide an empirical privacy gain close to 0.25 for both feature extraction methods. This indicates that differential private methods provide a strong privacy protection against membership attacks. It ensures that the **average probability of success for any attack is close to the attacker’s original prior, i.e 0.5**.

In terms of *attribute inference attacks*, *PATE-GAN* provides the greatest resilience, followed by *DP-WGAN, DTGAN_D,* and* DTGAN_G*. This is due to the superior quality of the synthetic data offered which enhances the attacker’s probability of successfully inferring sensitive information. Even if both variants of *DTGAN* are less resilient than the two DP baselines, the difference with *TGAN* providing the worst/no resilience is still significant. These results highlight the **inherent trade-off between privacy and data utility i.e., increasing the utility directly worsens the privacy and vice versa**.

# 5. Conclusion

Motivated by the risk of privacy leakage through synthetic tabular data, **we propose a novel DP conditional Wasserstein tabular GAN, DTGAN**. We rigorously analyse *DTGAN* using it’s two variants, namely *DTGAN_D* and *DTGAN_G *via the *theoretical Renyi DP framework* and highlight the privacy cost for additional losses used by the generator to enhance data quality. Moreover, we empirically showcase the data utility achieved by applying* DP-SGD* to train the *discriminator vs generator*, respectively, Additionally, **we rigorously evaluate the privacy robustness against practical membership and attribute inference attacks**.

**Our results on three tabular datasets show that synthetic tabular data generated by DP-SGD achieves higher data utility as compared to the PATE framework**. Moreover, we find that** DTGAN_D outperforms DTGAN_G, illustrating that the discriminator trained with DP guarantees is more optimal under stringent privacy budgets**. Finally, in terms of data utility and reliance to privacy attacks ,** DTGAN_D improves upon prior work by 18% across 4 ML models in terms of the average precision score and all DP baselines reduce the success rate of membership attacks by approx. 50%**. Therefore, this showcases the effectiveness of DP for protecting the privacy of sensitive datasets being used for training tabular GANs. **However, further enhancement of the quality of synthetic data at strict privacy budgets (i.e., epsilon < 1) is still needed. Ultimately, there is an inherent trade-off between privacy and utility and obtaining the most optimal balance between both requires future work**.

*Thank you for reading. Please feel free to access our full research paper **here**.*