So let’s dive in. Let’s start with some questions around how the DB Synthesizer works:
All the structural data and data values will be almost exactly as they are in the original database. Foreign key distributions, which refer to points such as referential integrity and cardinality, will be approximated from the original database to serve data to their destination database.
This means that the destination database is going to have a similar to, but not exact distribution of referential integrity and cardinality as the original database.
The column probability density function, this density distribution is also going to be approximated. So the distributions are going to look quite similar to the original data, but they are not going to be exactly the same data. This results in generated data that is private.
Now that you have an understanding of how the DB Synthesizer works, let’s look into an example.
The original database used in this report is from a credit system, and contains historical information about users, and what credit lines, products and campaigns they participated in, among some other information. This database has seven tables along with several primary and foreign keys.
Table sizes in this database go from 10 to 1M rows, and while for the purpose of this report a database with only 7 tables was chosen, the DB Synthesizer is scalable and is able to work with databases of virtually any size (the software has been tested and validated against production size databases with thousands of tables).
The DB Synthesizer can be run using a command line interface, which is going to run the java application this simple command:
And what you can see here is the origin URL, the name of the database of the origin, and down here, the destination connection, URL and destination database name, and the other parameters that can be actioned.
Here we’ll perform an action such as copying or synthesizing the data. We’ll configure the DB Synthesizer here by adding information about the source database along with user and password information. You can configure these and other parameters.
More information can be found here: docs.synthesized.io
You can see all the other configuration elements that you can, that you can run. You can also run the DB Synthesizer from an API. So you could see it. You can find more information about how to run it here. There are other parameters that can be added to the DB Synthesizer, more detailed information can be found in the configuration section in the documentation.
All of these parameters can be added for extra and more personalized output. So all the information such as primary keys column types, et cetera, will be extracted from the database. But there are some cases where some information needs to be provided manually to the Synthesizer. And with this configuration file, we are able to add that information to the database.
Now that the DB Synthesizer has been run for this database, let’s look at the results. To do so, we’ll connect the original and destination databases and compare the results from both databases. Firstly, let’s confirm the tables in the synthetic database (destination) are the same as the original database.
So we have seven tables, which is the same number as in the original database.
The next thing we’ll look at is referential integrity. Based on the first five rows of this new table, we’re able to see that the data looks very similar to the original database. And then if we look at the shapes of the data in the tables, you’ll see that the shapes and the number of rows and columns of the original and Synthesized tables are the same.
Now, if we do the same exercise and pick one random customer in the synthetic data, for example customer ID 265, this is how the row looks.
And then we can go to the customer credit line for the same user in the synthetic database and we’ll see that this one has five rows as well.
This is a quick example of how referential integrity is preserved. Furthermore, if we join these tables on the customer ID parameter in both the original and synthetic databases, we’ll see that there are some distributions that are preserved even after joining both tables.
For example, if we look at the default rate, we can see that it’s the same for the original database. This also applies to other fields like the minimum credit score. It’s been approximated quite nicely.
Let’s do the same exercise for the other tables. If we look at customer information, we’ll look at this customer to see how it looks and this customer only has one ID product 15.
Which means it only exists once.
• Original products for id_customer = 11391, that has id_product = 15
While we thought we were looking at a one to many relationship, in this case, we are actually looking at a one-to-one relationship. And you can see that in the synthetic database, it’s also a one-to-one relationship. There’s only one product with ID 14, so the database was able to keep that structural relationship.
• Synthesized customer info for id_customer = 265
• Synthesized customercreditline for id_customer = 265
And if we join these tables again, you can see in the first five rows there are no duplicates.
And again, here, there are no duplicates. So the referential integrity of this structure was kept.
Let’s continue and plot the heat map for these two columns, the age band and campaign code in the original and synthesized tables. After joining these tables, it’s important to highlight that the age band is in one of the tables while the campaign code is in the other table. So age band is in the customer information table, while the campaign code is in the products table. You can see that even after joining both tables, the heat maps are quite similar.
We can plot other things like the campaign length against age bands, and the distributions are very similar on the campaign length.
The distributions are also quite similar when looking at campaign codes and product codes. So you can see how the distributions are kept.
Let’s go a step further and join the three tables. So we’ll join credit line with customer credit and customer credit line via this ID credit line. And then we also join credit products on ID product, and we then apply the same query to the synthetic database and plot the distribution.
You can see that even after joining three tables, the distributions are kept quite nicely. So you can see it for the campaign code and on the product code. And if we look at customers per product, we’ll see the same results.
Now let’s take a look at the foreign keys. We want to make sure that we preserve the value counts when we’re talking about foreign keys. For example, in the case where we have the customer and customer transactions we would want to make sure that the number of transactions per customer is similar to ensure referential integrity is preserved.
This is known as cardinality which is plotted above. For example, here, all values are unique in the both original and synthetic databases. You can see that this distribution is quite accurate. Most of them only have one occurrence when ID customer number is one. There are 8,000 that only have one occurrence. So the majority are biased for one occurrence, but most of them have multiple occurrences. And you can see a similar situation where referential integrity is approximated quite nicely.
Last but not least, we also need to look at column distribution. If we look at one column, particularly will the distribution be preserved. And here you can see the submissions of the single column for the table customer information.
You can see that continuous and categorical values, are all preserved quite nicely. Obviously the primary key is not preserved because we generated new primary keys to avoid privacy leakage. So the primary keys are generated independently from the, from the original values. And you can see it.
The rest of the values are fairly similar. And similarly, if we look at the customer credit line table again and specifically the ID customer line, and ID customer are not approximated nicely because we generated new values for this, because otherwise these can be identifiers and the database could be attacked with these values, but for the rest of our values you can observe the distributions.
In this report we demonstrated how the DB Synthesizer is able to explore a given database, extract and learn all the information contained within it, and generate an entirely new database whilst maintaining the referential integrity necessary to ensure the proper linkage between multiple tables within the database.
Please feel free to reach out with any questions or if you’d like to learn more to: letschat@syntheszed.io