The tech giant Nvidia has recently acquired synthetic data startup Gretel, marking a strategic step forward in addressing one of the major challenges in artificial intelligence (AI) training: data scarcity.
What is Gretel and Why Synthetic Data Matters?
Founded in 2019 by Alex Watson, John Myers, and Ali Golshan, Gretel specializes in providing a comprehensive platform and APIs for generating synthetic data. Synthetic data consists of computer-generated datasets that mimic real-world data, solving issues related to limited availability, privacy concerns, and scalability.
Before its acquisition by Nvidia, Gretel raised over $67 million in venture capital and achieved a valuation of around $320 million. With approximately 80 employees, Gretel's technology will be integrated into Nvidia's expanding cloud-based generative AI services suite, targeted at developers.
Nvidia's Strategic Move Towards Synthetic Data
Nvidia's CEO, Jensen Huang, has repeatedly highlighted three key issues in scaling AI efficiently:
Data scarcity: Where and how to source extensive datasets.
Model architecture: Optimizing structure for performance.
Scalability laws: Understanding how AI models scale with data and computational resources.
The acquisition of Gretel is a clear response to the first issue. Nvidia has already ventured into synthetic data with tools like Omniverse Replicator, launched in 2022, allowing developers to generate accurate, personalized 3D synthetic data for neural network training. Additionally, Nvidia introduced the Nemotron-4 340B models, designed to create synthetic training data for diverse sectors, including healthcare, finance, manufacturing, and retail.

Opportunities Presented by Synthetic Data
Synthetic data brings numerous advantages:
Scalability: Offers developers nearly unlimited access to training data, accelerating model development.
Privacy Protection: Crucial for sensitive industries such as healthcare, finance, and government services.
Bias Reduction: Enables the creation of diverse, balanced datasets to minimize inherent biases found in real-world data.
For instance, healthcare institutions can leverage synthetic data to train models for diagnosing rare diseases without compromising patient confidentiality.
Risks and Limitations
However, the use of synthetic data isn't without concerns. According to recent research published in Nature (July 2024), AI models risk “collapsing” if continually trained on synthetic data generated by other models, resulting in performance degradation.
Alexandr Wang, CEO of Scale AI, emphasizes the necessity of a balanced approach, combining human-generated and synthetic data. Similarly, Gretel's founders acknowledge that exclusively synthetic training scenarios aren't reflective of real-world AI development practices.

Industry Perspectives and Big Tech Involvement
Despite risks, the synthetic data market continues to attract attention from big tech companies. Meta integrated synthetic datasets in training its Llama 3 AI model, while Amazon Bedrock allows developers to create synthetic data through Anthropic's Claude chatbot. Microsoft's Phi-3 model employs synthetic data cautiously, noting potential accuracy reduction and bias amplification. Even Google's DeepMind acknowledges the complexities of maintaining privacy and accuracy in synthetic datasets.
The industry-wide consensus advocates for a hybrid approach, merging synthetic and human-generated datasets to maintain data integrity and model effectiveness.
Nvidia's Vision for the Future
With the acquisition of Gretel, Nvidia strengthens its position as a leader in AI innovation. Synthetic data will likely become central in overcoming current and future challenges of data scarcity, privacy regulations, and scalability, driving the next wave of technological advancement in AI.
The integration of Gretel's sophisticated data generation platform into Nvidia's ecosystem represents a pivotal moment, demonstrating Nvidia's commitment to staying ahead in the rapidly evolving AI landscape.