For organizations training AI models, access to sufficient volumes of high-quality data is quickly becoming a serious challenge. Privacy and regulatory compliance are among the biggest issues, with increasingly strict rules making accessing the information needed to train robust AI models difficult.
Even when data is available, quality is not always guaranteed. Real-world datasets can easily reflect existing inequalities or historic decisions that, left unaddressed, can lead to flawed results that can manifest in customer-facing applications. What’s more, in highly specialized industries or where rare events are involved, the volume of usable data may be too small to draw meaningful insights.
Then, there’s the cost. Preparing real-world data for AI training is a labor-intensive process, often involving large-scale collection, tagging and validation. This can be time-consuming and prone to setbacks, particularly when teams are under pressure to deliver quick results. Put all this together, and it’s hardly surprising that some businesses are struggling to get their AI projects off the ground.
Product and Strategy Director at Node4.
Diverse use cases
To bridge the gap, many are turning to artificially generated or ‘synthetic’ data as an alternative to real-world sources. This comes in various formats, ranging from structured tables and records to unstructured content such as text, images and videos. It’s even possible to create synthetic users or behaviors for more sophisticated training and testing scenarios.
Designed to reflect the properties of real data without including any personally identifiable information (PII), it can provide a flexible solution that overcomes many of the challenges associated with live datasets.
For regulated industries, synthetic data is already proving valuable. In healthcare, for example, recreating realistic datasets without referencing patient data avoids many of the legal and ethical issues typically associated with these use cases. In practical terms, this means hospitals and research institutions can use AI platforms that replicate the characteristics of medical records without including personal details.
Elsewhere, researchers can explore complex questions, such as predicting disease progression or optimizing treatment plans by using synthetic datasets that behave like real patient populations. This means they can train AI models without putting privacy at risk, and because synthetic data maintains the core properties of the original, the output remains valid for modelling and analysis but with zero risk of re-identification.
The right datasets at the right time
In other settings, real-world datasets often reflect the limitations or inequalities present in the systems they were drawn from – whether that’s underrepresentation of certain demographics or skewed outcomes caused by historic decision-making. Left uncorrected, these issues can carry through to the AI models being trained, resulting in flawed or unfair outputs.
Synthetic data offers a way to correct that imbalance. Because it’s generated artificially, datasets can be adjusted to better reflect a more diverse or representative sample, such as different age groups, ethnicities or behavioral patterns. It also allows organizations to create realistic simulations of rare scenarios that would otherwise be too rare in real-world data to train against effectively.
Lack of data also manifests itself in other situations, such as those encountered by autonomous driving systems. In some countries, weather events such as hailstorms are rare, but when they do occur, can present a real hazard for vehicles and their occupants.
Rather than wait for such conditions to happen naturally, AI developers can create synthetic simulations of low-visibility conditions and other unusual scenarios, which are then used to train vehicle systems to respond appropriately in live situations.
In a similar way, images of people or objects suddenly appearing in the path of the car can be computer-generated and tested from all angles to ensure all possibilities are addressed. Without this level of training, the model may not recognize a potentially hazardous situation and fail to respond appropriately.
Cost and efficiency
Compared to the time, effort and budget needed to obtain and prepare large-scale real-world datasets, using synthetic data can offer a faster and more predictable alternative. In financial services, for example, using real customer transaction data typically requires extensive anonymization and compliance checks. In contrast, synthetic datasets that mimic transaction patterns without referencing any real customer data allow for quicker, lower-risk AI model development.
Out in the real world, synthetic data has been used by financial institutions to improve fraud detection model development without relying on sensitive customer transaction records. Accessing and using real financial data typically requires costly anonymization, compliance checks and legal reviews – a set of processes that inevitably drive up costs. By generating synthetic datasets replicating real transaction patterns, companies reduce the need for expensive data preparation and minimize regulatory hurdles, making their AI projects more cost-effective.
Looking ahead, this kind of work represents the tip of the iceberg, and we can expect to see many more organizations turn to synthetic data to power their AI projects. Indeed, if predictions from Gartner are accurate, by 2030, “synthetic data will completely overshadow real data in AI models.”
We’ve featured the best AI chatbot for business.
This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro
https://cdn.mos.cms.futurecdn.net/LJ7xXkLMRdgVo8vT4Ccgrb.jpg
Source link