The rise of GenAI has introduced a new dimension in the data sphere: artificial or synthetic data. This possibility can be of great value in addressing a few problems: scarcity of training data, the lack of comprehensive training datasets that account for all scenarios, and data protection concerns when using real-world data for machine learning. AI models rely heavily on the “quality, accuracy, completeness, consistency, timeliness and relevance” of data. As such, artificial data is expected to boost the speed of AI advancement further.
Gen AI enables the generation of artificial data “information from almost any source is analyzed to detect structures and patterns, which are then used as the foundation for creating new datasets”. An AI model is trained on existing data and generates new labeled datasets with the same mathematical properties and predictive power. This minimizes labeling efforts, and the result can replace or complement real-world data, “Retaining the structure and statistical integrity of original data, synthetic data has the same predictive power as the original data but replaces it rather than disguising or modifying it”. AI models that generate artificial data, in turn, rely on this data as training input to learn and predict. In other words, AI generates synthetic data and depends on it to reach its full potential.
This two-way dependency reemphasizes the importance of original datasets and models’ quality, completeness, and accuracy.
Data scarcity is starting to represent an obstacle to AI training. Algorithms require vast amounts of information covering different scenarios and accounting for exceptional cases. Generating comprehensive training data via AI has proven to be much faster, cheaper, and more secure than the conventional data collection and processing methods “While actual data may lack quality, volume, or variety, synthetic data can overcome these limitations and be generated in all the permutations and combinations of any given condition”.
Several researchers and companies in the AI industry have highlighted that “The real-world data used to make smarter models is running out.” This results from fierce AI competition and is partially due to data owners placing more restrictions on the unrestricted use of their information. Artificial data is an alluring alternative, and some AI firms are building models with the sole purpose of generating artificial data. Artificial data can also be beneficial in handling bias by balancing information and ensuring it reflects under–represented groups.
Synthetically generated data can be added to complement an existing dataset or even replace the full set.
Data protection best practice implies data encryption at rest, in use, and in transit. There are limitations on data sharing, which become even stricter when teams work remotely, ultimately hindering advancement in prototype development. Kalyan Veeramachaneni, principal research scientist at MIT, says, “Data today is treated like the computer lab of yesteryear: Access is restricted — and so are opportunities for college students, professional developers, and data scientists, to test new ideas. With far fewer necessary limitations on who can use it, synthetic data can provide these opportunities”.
Synthetic data is becoming essential to research, innovation, and new product development. Its importance is even more apparent in fields with high data protection requirements, such as healthcare, pharmaceutical, life sciences, finance, insurance, and legal. Synthetic data can be easily shared and made available in a central location accessible to all concerned teams. It eliminates the need for anonymization and the risk of a data breach.
Artificial data does have its challenges though. There are instances of GenAI hallucinations, which might lead to a misrepresentation of original information. In addition, there are no robust guarantees about not retaining personal data from the source. The model might also learn and amplify any anomalies in the original data if not accounted for in the first place.
Experts have warned of the danger of “model collapse” when relying on the random use of synthetic data. They highlighted that AI training is better when given a combination of real and synthetic data and that artificial data is better when generated from real data, a concept referred to as “hybrid data”. It is about finding the right balance; this will prevent models from becoming irrelevant or mutated.
With the mixing of real and artificial data in training, EY underlines the need for data tagging to differentiate between the two: “This metadata layer is essential for maintaining transparency, traceability and trust in AI systems.”
What if artificial data is combined with other advanced methods, such as digital twins? This combination has the potential to revolutionize digital transformation, operational optimization, product development, and AI progression. Watch this space for more on digital twins in the next issue of Nash Insights!
Useful Links
- https://www.digitaldubai.ae/knowledge-hub/publications/synthetic-data
- https://aws.amazon.com/what-is/synthetic-data/
- https://mitsloan.mit.edu/ideas-made-to-matter/what-synthetic-data-and-how-can-it-help-you-competitively
- https://www.accenture.com/content/dam/accenture/final/accenture-com/document-3/Accenture-Is-Your-Data-Ready-To-Power-AI-Reinvention-Report.pdf
- https://www.pwc.com/gx/en/issues/technology/synthetic-data.html
- https://www.ey.com/en_us/insights/government-public-sector/synthetic-data-differentiation-and-ai-readiness
- https://www.ey.com/en_in/media/podcasts/tech-trends/2023/03/episode-3-synthetic-data-artificial-data-real-solutions
- https://www.businessinsider.com/ai-synthetic-data-industry-debate-over-fake-2024-8
