Share:
Platform
February 8, 2024

Synthesized creates accurate synthetic data to power innovation with BigQuery

Synthesized creates accurate synthetic data to power innovation with BigQuery

Today, it’s data that powers the enterprise, helping to provide competitive advantage, inform business decisions, and drive innovation. However, accessing high-quality data can be costly and time-consuming, and using it often involves complying with strict data compliance regulations.

Synthesized helps organizations gain faster access to data and navigate compliance restrictions by using generative AI to create shareable, compliant snapshots of large datasets. These snapshots can then be used to make faster and more informed business decisions, and power application development and testing. It does this by helping organizations overcome many of the obstacles to fast and compliant insights:

  • Accessing compliant data - BigQuery provides a wide range of capabilities that help data to be stored and governed in a secure and compliant way. However, when that data is used in a different context, for example to train an ML model, for testing, or to share information with a different department with different clearance levels, ensuring that data is accessed in a compliant way can become complex. Confidential datasets, such as those with personally identifiable information (PII), medical records, financial data, or other sensitive information that should not be disclosed, are often subject to different restrictions due to industry and local governmental regulations. This can make it difficult for international offices managing access for various teams across regions and countries.
  • Ensuring data quality - One way to manage and protect confidential datasets is data masking, that is, obscuring data so that it cannot be read by certain users. While this is a powerful approach for many use cases, it’s less suited to scenarios where visibility of the underlying data is required for example training a machine learning model. On top of this, organizations are also tasked with uncovering insights from low-quality or unbalanced data, which makes it difficult to land on accurate and representative data insights.

Unlocking data’s potential with accurate snapshots

Synthesized uses generative AI to help customers across healthcare, financial services, insurance, government, and more generate a new and accurate view of their data with confidentiality restrictions automatically applied.

The solution effectively applies data transformations such as masking, subsetting, redaction or generation to create high-fidelity snapshots of large datasets that can be used for modeling and testing. Synthesized uses generative AI to capture deep statistical properties, which are often hidden in the data, to create valuable data patterns and recreate them in synthetic data. At the same time, Synthesized helps ensure adherence to enterprise data privacy regulations, as the output data is programmatically designed to be fully anonymized, for easy and fast access to high-quality data, enabling better decision-making.

With the click of a button, organizations can access insights from a synthetic snapshot that is representative of the entire original dataset — in a way that’s fast and compliant. In other words, the solution addresses the “chicken-and-egg” problem of data access: Data consumers have to formulate their request for data access in terms of SQL query, but they can’t write the query without access to data in the first place.

The newly generated synthetic data can be used for a variety of purposes, including:

  • Fast access to a compliant snapshot of the data for testing and development purposes.
  • Simplifying model training by programmatically creating diverse data snapshots that cover a wide range of scenarios, including edge cases and rare events. This diversity helps improve the robustness and generalization of machine learning models.
  • Accelerating and evaluating cloud migration with accurate test data that mimics the structure of cloud databases, so you can confidently add sanitized or synthetic data by extending existing CI/CD pipelines.
  • Creating full datasets from unbalanced data, when an original dataset has unequal distribution of examples, and analysis requires the extrapolation of additional reliable data points.

German bank gets compliant, high-quality synthetic data

One of the largest banks in Germany turned to Synthesized to give its engineers and data science teams fast access to the synthetic test data. They wanted to accelerate the preparation time needed to query the data so that they could speed up testing and time to market, and increase accuracy. Synthesized provided non-traceable snapshots of the original datasets, enabling the bank to start data analysis, app migration and testing in the cloud, and experiment with large datasets for new AI/ML use cases and technologies.

Insurance company accelerates product development

Likewise, a leading insurance company wanted to move away from highly manual and resource-intensive data processes to help it remain competitive. Synthesized helped the company generate millions of highly representative test datasets that could be shared safely with third-party vendors for product development. The company was able to accelerate product development, save 200 man-hours per project and drastically reduce its volume of work.

Built with BigQuery

Synthesized extends the functions already available in BigQuery. For example, BigQuery covers masking and data loss prevention for redaction, while Synthesized applies transformations like subsetting and generation. Integrating Synthesized and BigQuery can help organizations to gain fast and secure access to ready-to-query datasets, extracting only the snapshots they need to inform testing or business intelligence. Once the snapshots are ready to be shared safely from a compliance perspective, they can be stored in an organization's own systems, or shared with third parties for analysis.

Because these snapshots remain in BigQuery, they can be easily used with the full range of Google Data and AI products, including training AI models with BigQuery ML and Vertex AI.

Synthesized has API access to BigQuery, so extracting snapshots and provisioning data is easy and automated. Synthesized also uses a generative model to synthesize data and create balanced datasets from unbalanced datasets, providing the necessary distribution of examples that are ready for sharing. This generative model is stored within the customer's tenant and can also be shared along with the data.

Here is a simple illustrative example query to generate a fast and compliant snapshot with 1,000 rows from a input table:

SELECT dataset.synthesize(
  'project.dataset.input_table',
  'project.dataset.output_table',
  '{"synthesize": {"num_rows": 1000, "produce_nans": true}}'
);

Synthesized Scientific Data Kit (SDK) is now available on Google Marketplace.

The built with BigQuery advantage for ISVs and data providers

Built with BigQuery helps ISVs and data providers build innovative applications with Google’s Data Cloud. Participating companies can:

  • Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices
  • Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption

BigQuery gives ISVs the advantage of a powerful, highly scalable unified AI lakehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. Click here to learn more about Built with BigQuery.

---

Originally published at cloud.google.com/blog/ on February 2, 2024.