Why synthetic data makes real AI better

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!

Data is precious – so it’s been asserted; it has become the world’s most valuable commodity.

And when it comes to training artificial intelligence (AI) and machine learning (ML) models, it’s absolutely essential.

Still, due to various factors, high-quality, real-world data can be hard – sometimes even impossible – to come by.

This is where synthetic data becomes so valuable.

Synthetic data reflects real-world data, both mathematically and statistically, but it’s generated in the digital world by computer simulations, algorithms, statistical modeling, simple rules and other techniques. This is opposed to data that’s collected, compiled, annotated and labeled based on real-world sources, scenarios and experimentation.

The concept of synthetic data has been around since the early 1990s, when Harvard statistics professor Donald Rubin generated a set of anonymized U.S. Census responses that mirrored that of the original dataset (but without identifying respondents by home address, phone number or Social Security number).

Synthetic data came to be more widely used in the 2000s, particularly in the development of autonomous vehicles. Now, synthetic data is increasingly being applied to numerous AI and ML use cases.

Synthetic data vs. real data

Real-world data is almost always the best source of insights for AI and ML models (because, well, it’s real). That said, it can often simply be unavailable, unusable due to privacy regulations and constraints, imbalanced or expensive. Errors can also be introduced through bias.

To this point, Gartner estimates that through 2022, 85% of AI projects will deliver erroneous outcomes.

“Real-world data is happenstance and does not contain all permutations of conditions or events possible in the real world,” Alexander Linden, VP analyst at Gartner, said in a firm-conducted Q&A.

Synthetic data may counter many of these challenges. According to experts and practitioners, it’s often quicker, easier and less expensive to produce and doesn’t need to be cleaned and maintained. It removes or reduces constraints in using sensitive and regulated data, can account for edge cases, can be tailored to certain conditions that might otherwise be unobtainable or have not yet occurred, and can allow for quicker insights. Also, training is less cumbersome and much more effective, particularly when real data can’t be used, shared or moved.

As Linden notes, sometimes information injected into AI models can prove more valuable than direct observation. Similarly, some assert that synthetic data is better than the real thing – even revolutionary.

Companies apply synthetic data to a variety of use cases: software testing, marketing, creating digital twins, testing AI systems for bias, or simulating the future, alternate futures or the metaverse. Banks and financial institutions use synthetic data to explore market behaviors, make better lending decisions or combat financial fraud, Linden explains. Retailers, meanwhile, rely on it for autonomous checkout systems, cashier-less stores and analysis of customer demographics.

“When combined with real data, synthetic data creates an enhanced dataset that often can mitigate the weaknesses of the real data,” Linden says.

Still, he cautions that synthetic data has risks and limitations. Its quality depends on the quality of the model that created it, it can be misleading and lead to inferior results, and it may not be “100% fail-safe” privacy-wise.

Then there’s user skepticism – some have referred to it as “fake data” or “inferior data.” Also, as it becomes more widely adopted, business leaders may raise questions about data generation techniques, transparency and explainability.

Real-world growth for synthetic data

In an oft-quoted prediction from Gartner, by 2024, 60% of data used for the development of AI and analytics projects will be synthetically generated. In fact, the firm said that high-quality, high-value AI models simply won’t be possible without the use of synthetic data. Gartner further estimates that by 2030, synthetic data will completely overshadow real data in AI models.

“The breadth of its applicability will make it a critical accelerator for AI,” Linden says. “Synthetic data makes AI possible where lack of data makes AI unusable due to bias or inability to recognize rare or unprecedented scenarios.”

According to Cognilytica, the market for synthetic data generation was roughly $110 million in 2021. The research firm expects that to reach $1.15 billion by 2027. Grand View Research anticipates the AI training dataset market to reach more than $8.6 billion by 2030, representing a compound annual growth rate (CAGR) of just over 22%.

And as the concept grows, so too do the contenders.

An increasing number of startups are entering the synthetic data space and receiving significant funding in doing so. These include Datagen, which recently closed a $50 million series B; Gretel.ai, with a $50 million series B; MostlyAI, with a $25 million series B; and Synthesis AI, with a $17 million series A.

Other companies in the space include Sky Engine, OneView, Cvedia and leading data engineering company Innodata, which recently launched an ecommerce portal where customers can purchase on-demand synthetic datasets and immediately train models. Several open-source tools are also available: Synner, Synthea, Synthetig and The Synthetic Data Vault.

Similarly, Google, Microsoft, Facebook, IBM and Nvidia are already using synthetic data or are developing engines and programs to do so.

Amazon, for its part, has relied on synthetic data to generate and fine-tune its Alexa virtual assistant. The company also offers WorldForge, which enables the generation of synthetic scenes, and just announced at its re:MARS (Machine Learning, Automation, Robotics and Space) conference last week that its SageMaker Ground Truth tool can now be used to generate labeled synthetic image data.

“Combining your real-world data with synthetic data helps to create more complete training datasets for training your ML models,” Antje Barth, principal developer advocate for AI and ML at Amazon Web Services (AWS) said in a blog post published in conjunction with re:MARS.

How synthetic data enhances the real world, enhanced

Barth described the building of ML models as an iterative process involving data collection and preparation, model training and model deployment.

In starting out, a data scientist might spend months collecting hundreds of thousands of images from production environments. A major hurdle in this is representing all possible scenarios and annotating them correctly. Acquiring variations might be impossible, such as in the case of rare product defects. In that instance, developers may have to intentionally damage products to simulate various scenarios.

Then comes the time-consuming, error-prone, expensive process of manually labeling images or building labeling tools, Barth points out.

AWS introduced SageMaker Ground Truth, the new capability in Amazon’s data labeling service, to help simplify, streamline and enhance this process. The new tool creates synthetic, photorealistic images.

Through the service, developers can create an unlimited number of images of a given object in different positions, proportions, lighting conditions and other variations, Barth explains. This is critical, she notes, as models learn best when they have an abundance of sample images and training data enabling them to calculate numerous variations and scenarios.

Synthetic data can be created through the service in enormous quantities with “highly accurate” labels for annotations across thousands of images. Label accuracy can be done at fine granularity – such as subobject or pixel level – and across modalities including bounding boxes, polygons, depth and segments. Objects and environments can also be customized with variations in such elements as lighting, textures, poses, colors and background.

“In other words, you can ‘order’ the exact use case you are training your ML model for,” Barth says.

She adds that “if you combine your real-world data with synthetic data, you can create more complete and balanced datasets, adding data variety that real-world data might lack.”

Any scenario

In SageMaker Ground Truth, users can request new synthetic data projects, monitor them in progress, and view batches of generated images once they are available for review.

After establishing project requirements, an AWS project development team creates small test batches by collecting inputs including reference photos and 2D and 3D sources, Barth explains. These are then customized to represent any variation or scenario – such as scratches, dents and textures. They can also create and add new objects, configure distributions and locations of objects in a scene, and modify object size, shape, color and surface texture.

Once prepared, objects are rendered via a photorealistic physics engine and automatically labeled. Throughout the process, companies receive a fidelity and diversity report providing image- and object-level statistics to “help make sense” of synthetic images and compare them with real images, Barth said.

“With synthetic data,” she said, “you have the freedom to create any imagery environment.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.