Blog
Technical

Capturing Relationships Over Time With Synthetic Event Data from Tonic's AI Synthesizer

Author
Joe Ferrara, PhD
July 1, 2022
Capturing Relationships Over Time With Synthetic Event Data from Tonic's AI Synthesizer
In this article
    Share

    AI Synthesizer and Event Data

    With the latest release of Tonic, our proprietary AI Synthesizer can now synthesize realistic event data like financial transactions, time-series sensor data, or other types of sequential tabular data. This synthetic event data not only captures the relationships between columns, but faithfully captures the temporal relationships between rows: times between events, autocorrelations, general trends, and more. In this post, we explore this new capability of AI Synthesizer through two examples of financial transactions datasets.

    Example Datasets

    Our two datasets are derived from the transaction table of the Czech Financial Dataset. The transaction table consists of anonymized bank transactions for approximately 4,500 users.

    The first dataset, which we call the monthly aggregate dataset, looks at summaries of transactions at monthly scale: 

    This table has 107,928 rows with 4,497 account_id’s and 24 rows for each account_id (4497 * 24 = 107928, yay!).

    Our second dataset, the daily transaction dataset, shows user transaction type and amount on a daily scale over a period of two months:

    The daily transaction dataset has 41,551 rows and 4,479 account_id’s, with the number of rows for each account_id varying depending on the account. As we can see from the table above, account_id = 1 has 9 rows in the dataset.

    Monthly Aggregate Dataset: Synthesizing Temporal Trends

    So, what happens when we run this data through Tonic’s AI Synthesizer? The monthly aggregate table has interesting temporal behavior that is realistically captured in the synthetic data as demonstrated by the following two sets of graphs.

    The below graphs display the 95% confidence intervals for the mean value of the transaction_count, amount_out, and amount_in columns for each month in the dataset.

    The two important patterns to note are that, for all three columns, the mean increases over time and there are spikes in the value during the months of January and June for transaction_count and amount_out and December and June for amount_in. These are both very clearly captured by the synthetic data.

    The next set of graphs display the average autocorrelation for each lag value from 1 to 23. The autocorrelation of a sequence measures how much the sequence is periodic with period being the given lag.

    All three plots have large spikes when the lag is 12 corresponding to the yearly periodicity in the data. The fact that the autocorrelation curves above line up for the real and synthetic data gives further evidence that the synthetic data is capturing the complex temporal behavior of the real data.

    Daily Transaction Dataset: Synthesizing Fine-grained Temporal Behavior

    The monthly aggregate dataset is a great dataset for seeing interesting temporal patterns, but because it is an aggregate dataset it has more structure than a raw transaction table would. In particular, every account has measurements for all 24 time steps and the length of time between time steps is always one month. In a raw transaction table over a fixed window of time, different accounts have different numbers of transactions and these transactions occur at irregular times. Let’s examine this fine-grained transaction table to see how our AI Synthesizer captures these complicated temporal properties. Show me the fake money!

    Sequence Lengths, Inter-arrival Times, and Transactions Per Day

    The sequence length of a given account is the number of rows in the table with the account’s account_id. An inter-arrival time is the number of days between subsequent transactions for a given account. The number of transactions in a given day is exactly what it sounds like - the number of rows in the dataset that correspond to a transaction on the given day. These three properties are intertwined to capture the variation in types of accounts and cadences of transactions for different types of accounts. They are displayed in the following graphs for the real and synthetic data.

    As the plots show, the synthetic data captures the complex behaviors of each of these three properties. 

    Categorical Values Across Time

    In addition to the frequency and cadence of transactions, it’s also important to preserve the types of transactions across time. The daily transaction dataset’s operation column specifies the type of each transaction. The following figure shows the categories and frequencies of the operation column for the real and synthetic data.

    Job well done, again, AI Synthesizer: this frequency plot looks good and shows that the operation column in the real and synthetic datasets has similar frequencies irrespective of time. Let’s now look at how these frequencies vary across time for the three most common types of transactions: cash withdrawal, transfer to bank, and cash credit. The following graphs display the number of transactions per day for these three types of transactions.

    We see that in addition to nailing the overall frequency of types of transactions in the dataset, AI Synthesizer also preserves the trends of what types of transactions are more likely to occur during different times of the month. FTW!

    Conclusion

    These example datasets are a small preview of Tonic and its AI Synthesizer’s ability to produce rich event data that mimics your real data with a high degree of fidelity. This synthetic event data accurately captures statistical properties of the underlying dataset — across columns and rows — and can be scaled up to produce as many synthetic sequences as you’d like. Curious to see it in action? We love to chat.

    Joe Ferrara, PhD
    Senior AI Scientist
    Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.