The Dawn of Synthetic Data: Rethinking the Storage Paradigm in AI Training

The axis of Artificial Intelligence (AI) training has traditionally spun around the acquisition and storage of colossal volumes of real-world data. This conventional approach, while effective, poses several challenges including storage costs, data privacy concerns, and the ecological footprint of data centers. However, the innovation of synthetic training data is ushering in a compelling alternative. Models such as Wayve’s GAIA-1 and platforms like the Universal Simulator (UniSim) are at the forefront of this transition, showcasing the ability to generate synthetic data for training AI models, potentially reducing the reliance on stored real-world data.

GAIA-1: A Forerunner in Synthetic Data Generation

Unveiled in June 2023, GAIA-1 emerged as a groundbreaking generative model aimed at enhancing the resolution of generated videos and improving the world model quality through larger-scale training. This 9-billion parameter generative world model is designed to offer a structured understanding of environmental dynamics, crucial for making informed decisions while driving. The model’s adeptness in accurately predicting future events is seen as a cornerstone for enhancing safety and efficiency on the roads, allowing autonomous vehicles to better anticipate and plan their actions in real-world scenarios.

UniSim: Bridging the Synthetic Data Gap

On another spectrum, the Universal Simulator (UniSim) project explores the potential of synthetic data in simulating realistic interactions between humans, robots, and other interactive agents. By emulating human and agent interactions with the world, UniSim provides a glimpse into a future where AI systems can be trained using generated data, eliminating the need for storing extensive real-world datasets. The simulator has shown promising results in training both high-level vision-language planners and low-level reinforcement learning policies, exhibiting significant transfer from training in a simulator to real-world scenarios.

The Speed of Data Generation Versus Ingestion

The remarkable pace at which synthetic data can be generated presents a nuanced challenge—the disparity between the rate of data generation and the rate of data ingestion for training purposes. The rapid generation of synthetic data may outpace the ability of AI models to process it in real-time. This scenario underscores the possible necessity for a caching mechanism to temporarily store generated data, ensuring a continuous and efficient training pipeline. While this doesn’t equate to the long-term storage of real-world data, it hints at a nuanced approach where temporary storage of generated data bridges the gap between generation and ingestion.

The Transition to Caching Generated Data

This rapid generation of synthetic data, although a boon, necessitates a strategy to address the lag in data ingestion rates. Caching emerges as a viable solution, acting as a conduit between data generation and data ingestion, ensuring a seamless training process. This approach, while not entirely eliminating the need for data storage, significantly reduces the volume of data that needs to be stored and managed over time.

Conclusion

The advancements in synthetic data generation as demonstrated by GAIA-1 and UniSim are redefining the landscape of AI training. The age-old practice of storing vast amounts of real-world data for training purposes might soon be eclipsed by more efficient, and scalable, synthetic data generation methodologies. The unfolding narrative in this domain is not only promising but indicative of a future where the training of AI systems is constrained only by the bounds of creativity, not storage capacity. The narrative around synthetic data resonates with the axiom—architecture matters. The pivot towards synthetic data underscores a significant architectural shift in AI training, which is bound to have a ripple effect across the broader spectrum of AI and machine learning domains.

  1. Universal Simulator (UniSim): universal-simulator.github.io/unisim
  2. Scaling GAIA-1: wayve.ai/thinking/scaling-gaia-1

What’s new, and side gigs.

I haven’t updated this blog in forever and I really think I should be using it. Though I think now it is going to be more focused on dumping my own ideas to a screen, rather than writing for anyone else to consume. With that being said, I won’t work hard to edit what I am writing I will just write it. There might be errors, there might be undocumented details but at least there will be something!

I have been working on some side projects while I work my day job and spend time with my family (so many kids)! I realized I don’t really write any of the things I am doing down, so I think it’s time to start sharing things I am doing.

While these articles will be available for everyone to read they really are for me so I can recreate things I have done, or be slightly consistent. Maybe they will be useful to someone else as well.

Was the passing of Steve that tragic?

Anyone who stumbles on this website, might look at my last post and think. Wow, he stopped writing when Steve Jobs passed away, he must have taken that seriously. Nope, sorry everyone, I just got busy with life. But I am back again!

Is this thing on?

Well, it has been about a year since I posted on Automatons Adrift. There have been some significant changes in my life. I completed my masters degree with a clear pass. A very rare feat I am told, typically there are at least minor revisions. I left my job at UNBC and moved to the University of Alberta to be the System Administrator for the Faculty of Science. This of course means I moved to the wonderful city of Edmonton.

The most surprising thing is, all these changes happend in the last couple of months. Now that my research for my thesis is complete I have a lot less time pressure and I can devote more energy to Automatons Adrift. I have updated the website along with the hosting service for it. I have added some new content including pages on the SDNEAT and NEAT algorithms which were at the center of my research. I have also started a page for the Neuroevolutionary Solver. This page will outline how to use the system and modify it to perform further experiments.

There is a lot of material to be posted as we move forward. I hope you enjoy the new Automatons Adrift!

If there is an algorithm for intelligence…

Then we could run it in about 50 atoms worth of space. That is assuming that we could build the smallest state machine possible in that amount of space and then actually wire it up to some tiny interface. The smallest possible universal state machine was proven to exist a couple days ago by Alex Smith of Birmingham, UK.

This really does have some significant impact. While I don’t think we would run the algorithm for intelligence on this particular state machine, we could. We could in fact run any program at all on this state machine and have it input and output any possible string of information. You can think of this as the smallest possible independent microprocessor. This could be a significant step in the advancement of massively parallel sensor networks.

Think of this, you construct a piece of e-paper made up of these tiny little state machines. You connect them in all eight directions to their neighbours using carbon nanotube circuitry as well as connecting the top layer to a set of output machines, something like a pixel in an LCD. Now put a few of these together into a magazine and put some somewhat more complex circuitry into the spine of the book along with a power source (probably an external layer of solar energy molecules) and you have yourself an extremely powerful parallel computer that looks something like a book but is capable of far more. Completely dynamic content all controlled through a massively distributed network of basic microprocessor state machines. I can’t take credit for this idea, it is from “The Diamond Age” a very good novel about nanotechnology but what is fantastic in that book is a step closer to reality due to this proof and several recent advances in nanotechnology. Another common theme from the same book is how ubiquitous these massively parallel systems could be. They could consist of countless billions of tiny sensor nodes distributed through the air. You could breathe them in without destroying significant portions of the network because they are so small (about the size of dust). Yet the processing power in each is universal and the power of the entire system is extraordinary. Each of the nodes could perform complex sensing tasks and transmit their information through the network back to their home base. These truly would be automatons adrift!

Of course the algorithms to do that efficiently don’t really exist yet, but they are being worked on.

I know it seems pretty wild, but that is just one example. When you distribute the processing of a program across millions of tiny universal state machines you can drop the time to process down to a much smaller value. This would require a completely new direction in programming but it is entirely possible. The original posting of this was found on slashdot, thanks guys!

Whew… Blogging Breakdown

What happens when you work a full time job, get lots of overtime at work and have a major deadline in your thesis work that requires you to code your butt off? You get a breakdown in the amount of cool blog posts you get to put up.

I am working out some loose ends in my integration of NEAT into PicoEvo and Simbad. I am to the point where I have to integrate the genetic operators and the evolution epoch into the algorithm and putting it together the right way is tricky. I suspect I will wind up just slamming it together so it works then I can pull it apart and put it back together the right way after I meet my deadline!

Neurotic Agents

I got this from Slashdot a couple days ago and I wanted to share it.

When you are playing a real time strategy game (or any video game for that matter) the artificial intelligence you are playing against is usually a form of rule based system. The AI is given large amounts of game information, has a complex set of rules it follows and really would kick your butt every time if the game makers didn’t dumb it down. Some recent research into emotional AI with game playing shows that a Neurotic personality does best at playing a real time strategy it even beats the AI that is tuned to be difficult for humans.

I wonder if a evolutionary agent could learn the emotions of this AI? Perhaps it could evolve an efficient Neural Network structure for Neurotic game play. Could we separate the neurosis from the game rules in the NN? Some interesting questions, the article is from New Scientist (which is an awesome magazine).

Turkey Weekend

I hope everyone had a nice Thanksgiving weekend. I enjoyed two dinners, one on Sunday and one on Monday. I have many leftovers.

I also spent a lot of time working on my implementation of NEAT into Picoevo. It is a very intricate process integrating NEAT into an evolutionary systems like Picoevo. It wasn’t really designed to handle an algorithm like NEAT though it is quite capable of it. I think there is more than one way to implement it, and I am following the approach that I think works right now. In the future I may revise the design to bring it more inline with the design of Picoevo.

One of the more interesting tasks of this project is deciding where each portion of NEAT belongs in the Picoevo environment. Picoevo heavily uses inheritance to be flexible and it has been designed to work with almost any type of Genome. One just has to decide how to extend each element to support what you require. So for implementing NEAT I have each gene in the genome implemented as an Element, each genome is an individual composed of multiple types of Elements and each Population has multiple NEAT genomes. Crossover, Speciation and Innovation are controlled at the Population level, mutating genomes by adding links and nodes is controlled at the Individual level and mutation of weights is controlled at the Element level.

Picoevo was designed well and is indeed flexible but it wasn’t really designed with the idea that each Individual or genome might have more than one type of gene or Element. So I had to be creative when it comes to extending the Individual class in Picoevo. I think the solution will work out fine. When I post the Neuroevolutionary Solver for public consumption I will talk about some of these design ideas more.