Generating Synthetic Sequential Data using GANs

Original article was published on Deep Learning on Medium

Common approaches to sequential data generation

Most of the models for time-series data generation use one of the following approaches:

Dynamic stationary processes that work by representing each point in the time series as a sum of deterministic processes with some noise added. This is a widely used approach for modeling time-series with techniques like bootstrapping. However, some prior knowledge of long-term dependencies, like cyclical patterns, has to be incorporated to constrain the deterministic process. This makes it very difficult to model datasets with complex, unknown correlations.

Markov Models are a popular approach for modeling categorical time series by representing system dynamics as a conditional probability distribution. Variants, such as Hidden Markov models, have also been used for modeling the distributions of time series. The problem with this approach is its inability to capture long-term complex dependencies.

Autoregressive (AR) models are dynamic stationary processes where each point in the sequence is represented as a function of the previous n points. Nonlinear AR models (like ARIMA) are very powerful. AR models like Markov models have a fidelity problem — they produce simplistic models incapable of capturing complex temporal correlations.

Recurrent Neural Networks (RNNs) have been recently used for time-series modeling in deep learning. Like autoregressive and Markov models, RNNs use a sliding window of previous timesteps to determine the next points in time. RNNs also store an internal state variable that captures the entire history of the time series. RNNs, like long short-term memory networks (LTSMs), have had great success in learning discriminative models of time series data, which predict a label conditioned on a sample. However, RNNs are unable to learn certain simple time-series distributions.

GAN-based methods or generative adversarial network models have emerged as a popular technique for generating or augmenting datasets, especially with images and videos. However, GANs give poor fidelity in networking data, which has both complex temporal correlations and mixed discrete-continuous data types. Although GAN-based time-series generation exists — for instance for medical time series — such techniques fail on more complex data exhibiting poor autocorrelation scores on longe sequences while prune to mode collapse. This is due to the fact that the data distribution is heavy-tailed and variable in length. This seems to affect considerably GANs.

Introducing DoppelGANger for generating high-quality, synthetic time-series data

In this section, I will explore the recent model to generate synthetic sequential data DoppelGANger. I will use this model based on GANs with a generator composed of recurrent unities to generate synthetic versions of transactional data using two datasets: bank transactions and road traffic. We used a modification of the DoppelGANger model to address the limitations of generative models for sequential data.

Traditional Generative Adversarial Networks, or GANs, struggle to model sequential data due to the following issues:

  • They don’t capture complex correlations between temporal features and their associated (immutable) attributes: For instance, depending on the owner characteristics (age, income, etc), credit card patterns in transactions are very distinct.
  • Long-term correlations within time series, such as diurnal patterns: These correlations are qualitatively very different from those found in images, which have a fixed dimension and do not need to be generated pixel by pixel.

DoppelGANger incorporates some innovative ideas, like:

  • using two networks (a MultiLayer Perceptron MLP and a recurrent network) to capture temporal dependencies
  • decoupled attribution generation to better capture correlations between time series and their attributes — e.g., age, location, and gender of users
  • batched generation — generation of small stacked batches for long sequences
  • decoupled normalization — the addition of normalization factors to the generator to constraints range of features

DoppelGANger decouples the generation of attributes from time series while feeding attributes to the time series generator at each timestep. This contrasts with conventional approaches, where attributes and features are generated jointly.

DoppelGANger’s conditional generation architecture also offers the flexibility to change the attribute distribution and condition the features on the attributes. This also helps to hide the attribute distribution thus increasing privacy.

The DoppelGANger model also has the advantage of generating data features conditioned on data attributes.

Figure 1: Schematic representation of the original DoppelGANger model two generator blocks and two discriminators. Credit:

Another neat feature of this model is how it handles extreme events, a very challenging problem. It’s not uncommon for sequential data to have a wide range of feature values across samples — some products may have thousands of transactions while others just a few. For GANs this is problematic as it is a sure recipe for mode collapse — samples will contain only the most common items and ignore the rare events. For images — the focus of almost all efforts on GANs — this isn’t an issue since the distributions are smooth. This is why the authors of DoppelGANger proposed an innovative way to handle these cases: auto-normalization. It consists in normalizing the data features prior to training and adding the minimum and maximum range of features as two additional attributes to each sample.

In the generated data, these two attributes usually scale features back to a realistic range. This is done in three steps:

  1. Generate attributes using the MultiLayer Perceptron (MLP) generator.
  2. With the generated attributes as inputs, generate the two “fake” (max/min) attributes using another MLP.
  3. With the generated real and fake attributes as inputs, generate the features.

Training the DoppelGANger model on bank transactions data

First, we evaluated DoppelGANger on a dataset of bank transactions. The data used for training is synthetic, so we know the real distributions, and can be accessed here. Our aim was to show that this model was able to learn the time dependencies in the data.

How to prepare the data?

Figure 2: Schematic representation of the data processed as a set of attributes and features of varied lengths.

We assume sequential data is composed of a set of sequences with maximum length Lmax — in our case we consider Lmax = 100. Each sequence contains a set of attributes A (fixed quantities) and features F (transactions). In our case, the only attribute is the initial bank Balance and the features are: Amount of the transaction (positive or negative) and two additional categories describing the transaction: Flag and Description.

To run the model we need three NumPy arrays:

  1. data_feature: training features, in NumPy float32 array format. The size is [(number of training samples) x (maximum length) x (total dimension of features)]. Categorical features are stored by one-hot encoding.
  2. data_attribute: Training attributes, in NumPy float32 array format. The size is [(number of training samples) x (total dimension of attributes)].
  3. data_gen_flag: An array of flags indicating the activation of features. The size is [(number of training samples) x (maximum length)].

Additionally, we need a list of objects of class Output that contains the data type for each variable, normalization, and cardinality. In this case, it is:

data_feature_outputs = [
# time intervals between transactions (Dif)
# binarized Amount
# Flag
# Description

The first element of the list is the time interval between events Dif, followed by the 1-hot encoded transaction value (Amount), followed by the Flag, and the fourth is the transaction Description. All gen_flags are set to False since it’s an internal flag to be later modified by the model itself.

The attribute is encoded as a continuous variable with normalization between -1 and 1 to account for negative balances:

data_attribute_outputs = [output.Output(type_=OutputType.CONTINUOUS,dim=1,normalization=Normalization.MINUSONE_ONE,is_gen_flag=False)]

The only attribute used in this simulation is the initial balance. The balance at each step is simply updated by adding the corresponding transaction amount.

We used Hazy processors to pre-process each sequence and reshape it in the right format.

n_bins = 20
processor_dict = {
"by_type": {
"float": {
"processor": "FloatToOneHot", #FloatToBin"
"kwargs": {"n_bins": n_bins}
"int": {
"processor": "IntToFloat",
"kwargs": {"n_bins": n_bins}
"category": {
"processor": "CatToOneHot",

"datetime": {
"processor": "DtToFloat",
from hazy_trainer.processing import HazyProcessor
processor = HazyProcessor(processor_dict)

Now we are going to read the data and process it using the function format_data. The auxiliary variables categories_n and categories_cum store respectively the cardinality and the cumulative sum of the cardinality of the variables.

data=pd.read_csv('data.csv',nrows=100000)    # read the datacategorical = ['Amount','Flag','Description'] 
continuous =['Balance','Dif']
cols = categorical + continuous
processor = HazyProcessor(processor_dict) #call Hazy processor
processor.setup_df(data[cols]) # setup the processor
categories_n = [] # Number of categories in each categorical variable
for cat in categorical:

categories_cum = list(np.cumsum(categories_n)) # Cumulative sum of number of categorical variables
categories_cum = [x for x in categories_cum] # We take one out because they will be indexes
categories_cum = [0] + categories_cum

def format_data(data, cols, nsequences=1000, Lmax=100, cardinality=70):
''' cols is a list of columns to be processed
nsequences number of sequences to use for training
Lmax is the maximum sequence length
Cardinality shape of sequences'''
idd=list(accenture.Account_id.unique()) # unique account ids
data.Date = pd.to_datetime(data.Date) # format date
# dummy to normalize the processors
data_all = np.zeros((nsequences,Lmax,Cardinality))
real_df = pd.DataFrame()
for i,ids in enumerate(idd[:nsequences]):
user = data[data.Account_id==ids]
user = user[cols]
processed_df = processor.process_df(user)
Data_attribut[i] = processed_df['Balance'].values[0]
processed_array = np.asarray(processed_df.iloc[:,1:)
return data_all, data_attribut, data_gen_flag


The data consist of roughly 10 million bank transactions from which we will use just a sample of 100,000 containing 5,000 unique accounts with an average of 20 transactions per account. We consider the following fields:

  • Date of the transaction
  • Amount of transaction
  • Balance
  • Transaction Flag (5 levels)
  • Description (44 levels)

Below is the head of the data used:

Table 1: sample of bank transactions data

As mentioned before, the temporal information will be modeled as the time difference between two consecutive transactions (in seconds).

Figure 3: Histogram of transactions for different Description separated as income and outflows.
Figure 4: Heatmaps of transactions for different time distributions.
Figure 5: Distribution of transactions Amount.
Figure 6: Distribution of initial Balance. Note that some accounts have an initial negative balance due to overdraft.
Figure 7: Number of transactions over a month — income and outflow. Note that income has very distinct peaks. The synthetic data has to capture these peaks

Running the code

We ran the code for only 100 epochs using the following parameters:

import sys
import os
import matplotlib.pyplot as plt

from gan import output
sys.modules["output"] = output

import numpy as np
import pickle
import pandas as pd

from gan.doppelganger import DoppelGANger
from gan.load_data import load_data
from import DoppelGANgerGenerator, Discriminator, AttrDiscriminator
from gan.output import Output, OutputType, Normalization
import tensorflow as tf
from import DoppelGANgerGenerator, Discriminator, \
RNNInitialStateType, AttrDiscriminator
from gan.util import add_gen_flag, normalize_per_sample, \

sample_len = 10
epoch = 100
batch_size = 20
d_rounds = 2
g_rounds = 1
d_gp_coe = 10.0
attr_d_gp_coe = 10.0
g_attr_d_coe = 1.0

Note that the generator is composed of a list of layers with the softmax activation function for categorical inputs and linear activation for continuous variables. Both generator and discriminator are optimized using the Adam algorithm with a specified learning rate and momentum.

Now we prepare the data to feed the network. The real_attribute_mask is a list of True/False with the same length as the number of attributes. False if the attribute is (max-min)/2 or (max+min)/2; otherwise True.First we instantiate the generator and the discriminator:

# create the necessary input arrays
data_all, data_attribut, data_gen_flag = format_data(data,cols)
# normalise data
(data_feature, data_attribute, data_attribute_outputs,
real_attribute_mask) = normalize_per_sample(
data_all, data_attribut, data_feature_outputs,
# add generation flag to features
data_feature, data_feature_outputs = add_gen_flag(
data_feature, data_gen_flag, data_feature_outputs, sample_len)
generator = DoppelGANgerGenerator(

discriminator = Discriminator()
attr_discriminator = AttrDiscriminator()

We used a neural network composed of two layers of 100 neurons for the generator and the discriminator. All data were normalized or 1-hot encoded. Then we train the model with the following parameters:

checkpoint_dir = "./results/checkpoint"
sample_path = "./results/time"
epoch = 100
batch_size = 50
g_lr = 0.0001
d_lr = 0.0001
vis_freq = 50
vis_num_sample = 5
d_rounds = 3
g_rounds = 1
d_gp_coe = 10.0
attr_d_gp_coe = 10.0
g_attr_d_coe = 1.0
extra_checkpoint_freq = 30
num_packing = 1

Some notes on training

If the data is large, you should use a larger number of epochs — the authors suggest 400 but, in our experiments, we found that we could be as high as 1000 without networks degenerating into mode collapse. Also, consider that the number of epochs is related to batch size — smaller batches need more epochs and a lower learning rate.

For those new to neural networks, Batch, Stochastic, and Minibatch gradient descent are the three main flavors of machine learning algorithms. Batch size controls the accuracy of the estimate of the error gradient when training neural networks. The user should be aware of the trade-offs between batch size, speed, and stability during learning. Larger batches require larger learning rates and the network will learn faster, but it can also be less stable, which is particularly problematic for GANs due to the mode collapse problem.

As a rule of thumb learning rates of generators and discriminators should be small (in the range 10–3 to 10–5) and similar to each other. In our case, we use 10–4, not the default 10–3.

Another important parameter is the number of rounds on the generator and discriminator. Wasserstein GAN (WGAN) requires two components to work properly: gradient clipping and higher rounds of discriminator (d_round) than the generator. Normally the number of rounds of the discriminator is between 3 to 5 for each round of the generator. Here we use d_round=3 and g_round=1.

In order to speed up the training, we used a cyclical learning rate for the generator and a fixed one for the discriminator.

The directory sample_path stores a set of samples collected at different checkpoints, which is useful for verification purposes. Visualizations of the loss functions can be done using TensorBoard on the checkpoint directory that you provide. You can control the frequency of checkpoints with the parameter extra_checkpoint_freq.

Be aware that this may take up a lot of disk space. The simulation took less than ten minutes on a MacBook Pro.

run_config = tf.ConfigProto()
tf.reset_default_graph() # if you are using spyder
with tf.Session(config=run_config) as sess:
gan = DoppelGANger(

Synthetic data generation

After the model is trained, you can use the generator to create synthetic data from noise. There are two ways to do it:

  1. Unconditional generation from pure noise
  2. Conditional generation on attributes

In the first case, we generate attributes and features. In the second, we explicitly specify which attributes we want to condition the feature generation with so that only features are generated.

Below is the code to generate samples from:

run_config = tf.ConfigProto()
total_generate_num_sample = 1000
with tf.Session(config=run_config) as sess:
gan = DoppelGANger(

# build the network

length = int(data_feature.shape[1] / sample_len)
real_attribute_input_noise = gan.gen_attribute_input_noise(
addi_attribute_input_noise = gan.gen_attribute_input_noise(
feature_input_noise = gan.gen_feature_input_noise(
total_generate_num_sample, length)
input_data = gan.gen_feature_input_data_free(
# load the weights / change the path accordingly

# generate features, attributes and lengths
features, attributes, gen_flags, lengths = gan.sample_from(
real_attribute_input_noise, addi_attribute_input_noise,
feature_input_noise, input_data, given_attribute=None,
#denormalise accordingly
features, attributes = renormalize_per_sample(
features, attributes, data_feature_outputs,
data_attribute_outputs, gen_flags,

We need a few extra steps to process the generated samples into a sequence format and return vectors in a 1-hot encoding format.

nfloat = len(continuous)
for i in range(features.shape[0]):
v = np.concatenate([np.zeros_like(attributes[i]), np.zeros_like(features[i])],axis=-1)
v[attributes[i].shape] = attributes[i]
V[attributes[i].shape[0]:attributes[i].shape[0]+1] = feature[i,:,0]

for j, c in enumerate(categories_cum[:-1]):
ac = features[:,nfloat+categories_cum[j]-1: nfloat+categories_cum[j+1]-1]
a_hot = np.zeros((ac.shape[0], categories_n[j]))
a_hot[np.arange(ac.shape[0]),ac.argmax(axis=1)] = 1

synth = np.vstack([synth,v])

df = pd.DataFrame(synth[1:,1:],columns=processed_df.columns)
formated_df = processor.format_df(df)
formated_df['account_id']=synth[:,0] # add account_id

Below we present some comparisons between synthetic (generated) and real data. We can observe that, overall, the generated data distribution matches relatively well the real ones — Fig 8 and Fig 9.

Figure 8: Histograms of sequence length (top) time intervals between Transactions (middle) and Flags (bottom) for generated vs real data.

The only exception is the distribution of the variable Amount, as shown in Figure 9. This is due to the fact that this variable has a non-smooth distribution. To solve this issue we discretized it into 20 levels resulting in a much better match.

Figure 9: Amount real vs generated using a continuous encoding (top) and binarised one-hot encoding (bottom).

We then used the Hazy metrics to calculate the Similarity Score. This score is a mean of three scores: Histogram and histogram2D similarity (how much the real and synthetic data histograms overlap) and Mutual Information between columns. This score establishes how well the synthetic data preserves the correlations between columns.

We got a similarity score of 0.57 when treating Amount as a continuous variable and 0.63 when we binarised it into 20 bins. The Similarity Score was obtained as follows:

from hazy_trainer.evaluation.similarity import Similarity
sim = Similarity(metrics=['hist','hist2d','mi'])
score = sim.score(real_df[cols], formated_df[cols])

However, we’ve noticed that this number does not really tell the whole story since it does not explicitly measure the temporal coherence of the synthetic data sequences — it treats each row independently.

Figure 10: Transactions Amount generated by the model over time (money in and money out).

For that purpose, we used an additional key metric: autocorrelation which measures how an event in time t is related to events occurring at time t — ∆ where ∆ is a time lag. To measure the relation we compare in the following way:

AC=i=1T(Areali -Asynthetici)2/ i=1T(Areali )2

Below are the autocorrelation plots for the total amount spent (aggregated by day) on real and synthetic data. We can see that the two have very similar patterns.

This will only work for numerical data. For categorical, we can use mutual information. For our data, we got AC = 0.71

Figure 11: Auto-correlation for real and synthetic data for the bank transaction dataset.

The traffic dataset

In order to prove the capabilities of a sequential data generator, we tested it on another more challenging dataset: the Metro Interstate Traffic Volume Data Set. It’s a dataset with hourly traffic data from 2012 to 2018. As we can see in the next figures, the data is relatively coherent over time with some daily and weekly patterns and large hourly variability. The synthetic data originated from the generator has to reproduce all these trends.

Figure 12: Histogram of traffic volume (vehicles per hour).

The daily patterns can be quite complex as seen in the next figure containing traffic over the first month (October 2012):

Figure 14: Hourly traffic patterns for the month of October 2012. Each dip represents a day. Weekends are visible in lower-traffic patterns.

In order to generate good quality synthetic data, the network has to predict the right daily, weekly, monthly, and even yearly patterns, so long-term correlations are important.

Figure 15: Some more distributions of the data.

In terms of autocorrelation, we can see a smooth daily correlation — which makes sense since most traffic have a symmetric behavior. High intensity in the morning is correlated with high intensity in the evening.

Figure 15: Auto-correlation for real traffic data versus generated traffic data. For longer legs, the autocorrelation of synthetic data starts to depart from the one obtained from real data

Running the model

In this case, the sequence lengths are fixed. To prepare the data, we generated 50,000 sequences using a sliding window of monthly and weekly data. This dataset is much larger than the previous and we expected the model to behave smoothly without mode collapse.

In this case, we also had a larger number of attributes. Some, like Day of the week and Month, were constructed from the data:

  • Temperature
  • Rain_1h
  • Snow_1h
  • Clouds_all
  • Weather_description
  • Weather_main
  • Holiday
  • Day of the week
  • Month

As features, we have only the hourly traffic volume. Since we want to capture this variable with the highest granularity, all numeric values were discretized into 20 bins, except the traffic volume that was discretized into 50 bins. The model ran for 200 epochs with a batch size of 20 and the same learning rate as before.


Figure 17 contains a real and generated sample. We can see that the cyclic patterns are well kept and data looks realistic.

Figure 17: Real (top) and generated (bottom) sequences over a 500-hour period. The model was run unconditionally. We can see that the synthetic data captures very well the daily and weekly patterns.

To test the quality of the generated data, we present some metrics — see table 2:

  • Similarity — measured by the overlap of histograms and mutual information
  • Auto-correlation — the ratio between real and synthetic over 30 time lags
  • Utility — measured by the relative ratio of forecasting error when trained with real and synthetic data

We used as a baseline an LSTM (long short-term memory) model with bootstrapping. This LSTM model is composed of two layers with 100 neurons each and uses a sliding window of 30 hours. The attributes were added through a dense layer and concatenated to the last hidden layer of the network.

As we can see from Table 2, DoppelGANger, trained with weekly data, performs relatively well, outperforming by a good margin the bootstrapping technique.

Table 2: Results for the traffic dataset.

We added a third metric, the Sequential Mutual Information (SMI). It is evaluating the Mutual Information on a matrix containing T columns where each column corresponds to the event occurring previous t, t-1, t-2, … t-T time steps and averaging on a subset of attributes.

We should note that the model can be conditioned on the attributes, so we can generate samples for a specific weather condition or day of the week or month.

Experiments on Differential Privacy

In the original work, the authors introduced differential privacy in the model through the well-known technique of adding noise to the discriminator and clipping its gradients — the DPGAN.

However, they found out that, as soon as the privacy budget, ε, becomes relatively small — meaning that the synthetic data gets safer, it also starts losing quality — measured by temporal coherence with respect to the real data. This could represent a major problem if the end-usage of the data is to extract detailed temporal information, like causality between events.

Based on recent work around PPGAN (Privacy-preserving Generative Adversarial Network), we introduced some modifications to the noise injected to the gradients of the discriminator. The moment’s accountant frames the privacy loss problem as if it was a random variable, using its moment-generating functions to control the variable’s density distributions. This property makes the PPGAN model training more stable. The difference with DPGAN is particularly significant when generating very long sequences.

The noise is given by the following expression:


Where 𝞷 is the sensitivity to a query f from two neighbor points x and x’:


This expression means that most informative points — the highest sensitivity — will get more noise added to the gradient, thus not compromising the quality from other points. By using this carefully designed noise, we were able to preserve 88 percent of the autocorrelation up to ε = 1 on the traffic data.


Synthetic sequential data generation is a challenging problem that has not yet been fully solved. Through the testing presented above, we proved that GANs present as an effective way to address this problem.