Data Synthesis
In this notebook we'll demonstrate how causal-validation can be used to simulate
synthetic datasets. We'll start with very simple data to which a static treatment
effect may be applied. From there, we'll build up to complex datasets. Along the way,
we'll show how reproducibility can be ensured, plots can be generated, and unit-level
parameters may be specified.
from itertools import product
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import (
norm,
poisson,
)
from causal_validation import (
Config,
simulate,
)
from causal_validation.effects import StaticEffect
from causal_validation.plotters import plot
from causal_validation.transforms import (
Periodic,
Trend,
)
from causal_validation.transforms.parameter import UnitVaryingParameter
Simulating a Dataset
Simulating a dataset is as simple as specifying a Config object and then invoking the simulate function. Control/treated units, pre/post treatment periods, as well as the treatment dosages are defined through a treatment assignment matrix. Once simulated, we may visualise the data through the plot function.
Controlling baseline behaviour
We observe that we have 10 control units, each of which were sampled from a Gaussian distribution with mean 20 and scale 0.2. Had we wished for our underlying observations to have more or less noise, or to have a different global mean, then we can simply specify that through the config file.
means = [10, 50]
scales = [0.1, 0.5]
fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(10, 6), tight_layout=True)
for (m, s), ax in zip(product(means, scales), axes.ravel(), strict=False):
cfg = Config(
treatment_assignments=D,
global_mean=m,
global_scale=s,
)
data = simulate(cfg)
_ = plot(data, ax=ax, title=f"Mean: {m}, Scale: {s}")
Reproducibility
In the above four panels, we can see that whilst the mean and scale of the underlying data generating process is varying, the functional form of the data is the same. This is by design to ensure that data sampling is reproducible. To sample a new dataset, you may either change the underlying seed in the config file.
Reusing the same config file across simulations
Or manually specifying and passing your own pseudorandom number generator key
Simulating an effect
In the data we have seen up until now, the treated unit has been drawn from the same
data generating process as the control units. However, it can be helpful to also
inflate the treated unit to observe how well our model can recover the the true
treatment effect. To do this, we simply compose our dataset with an Effect object.
In the below, we shall inflate our data by 2%.
More complex generation processes
The example presented above shows a very simple stationary data generation process. However, we may make our example more complex by including a non-stationary trend to the data.
We may also include periodic components in our data
Unit-level parameterisation
Conclusions
In this notebook we have shown how one can define their model's true underlying data generating process, starting from simple white-noise samples through to more complex example with periodic and temporal components, perhaps containing unit-level variation. In a follow-up notebook, we show how these datasets may be integrated with Amazon's own AZCausal library to compare the effect estimated by a model with the true effect of the underlying data generating process. A link to this notebook is here.