Mad AI disease

Synthetic data and the problem of AI model collapse

Aug 12, 2024

I’ve had the pleasure of blogging for about a year and a half now, and this marks my 25^th post since January 2023—so a little retrospective might be in order. While I started covering quantum computing and post-quantum cryptography, the quick rise of generative AI and ChatGPT brought me to also discuss artificial intelligence—and a pattern soon began to emerge. Every time I discovered and explained a problem with AI, and offered a possible solution, it wasn’t long before another problem cropped up. It started with AI’s carbon footprint, and moved on to inaccuracy and bias, followed by copyright and plagiarism. Then I wrote about hallucinations, and just when I thought I was done, along came the issues of hypnotism and gaslighting.

No sooner had my virtual ink dried on those topics when I came across an article in the UK online newspaper The Independent, which in turn quoted a scientific paper published in the journal Nature at the end of July, describing the ominous-sounding problem of AI model collapse. I have to ask myself, when will it end? Nonetheless, let’s dive in for another round.

Do you remember the scare of BSE—Bovine Spongiform Encephalopathy—popularly known as mad cow disease? It struck primarily in the United Kingdom but also to a degree in other parts of Europe and North America in the 1980s and 1990s—and when BSE was transmitted to humans through eating tainted beef, it could cause the often fatal variant Creutzfeld-Jakob disease (vCJD.) 178 people died in the UK alone as a result, and some four million head of cattle had to be slaughtered in order to bring the disease under control.

BSE is an infectious disease caused by a misfolded protein known as a prion, and the infection spread rapidly through herds of cattle. How did it spread? It was passed on through the practice of feeding young calves meat-and-bone meal made from the intestinal and nervous system remains of other possibly infected cattle, or from sheep possibly infected with the related disease scrapies. Now if the notion of involuntary bovine cannibalism makes you feel a bit queasy, you’re not alone. The practice has subsequently been strictly controlled although not completely eradicated, but it might still give you pause before you bite into your next steak or hamburger.

Madness in the method
What is the connection between generative AI and mad cow disease? The answer, as always, is in the data.

During my tenure at a large consulting firm, I attended an online forum where some of the partners discussed new and disruptive technologies—and naturally the conversation was dominated by AI. One of them spoke up saying that “Generative AI is the synthetic creation of data.” She repeated the line for emphasis, and everyone nodded sagely as if in full agreement with this apparently profound statement. Now, if you’ve read any of my previous posts, you know that I would never use the words ‘AI’ and ‘creation’ in the same sentence, plus, adding ‘synthetic’ just seemed to me to be superfluous. But it turns out I was missing something, and the executive was partially right. Let’s drop the word ‘creation’—and then there is such a thing as synthetic data. It’s actually very useful and AI can even sometimes be used to generate it although as we’ll see, we must be careful of the consequences.

Synthetic data is broadly defined to be any kind of data that is generated by an algorithm—and this can include even flight simulators or music synthesizers, but more commonly since the early 1990s it is used to supplement natural data for many statistical and scientific calculations. Note that I did say supplement, not replace. There must be a sufficient amount of natural data first, in order to detect patterns and establish a baseline. Then, synthetic data can be generated to, for example, fill in gaps in the natural data or protect the confidentiality of human subjects in a statistical analysis or clinical trial.

One of the advantages of synthetic data is that it can be scaled up quickly. Once the baseline and parameters are set, the data can be replicated with variations that might not occur in natural data. This is very useful for training machine learning (ML) systems. It has been used to generate everything from a broader range of scenarios for fraud detection algorithms, to a wider variety of traffic situations for training self-driving cars.

Naturally, the next step for synthetic data beyond ML is to train generative AI systems. Among the biggest problems for generative AI are the quantity and quality of data needed to train it, even for very specific limited use cases. If you’re going to build a Large Language Model (LLM) for a particular business purpose, and you need to train it with synthetic data, you will also need to be sure that the data is relevant, accurate and complete.

Any sufficiently large data set generally will follow the pattern of a statistical bell curve. The most common values will be clustered toward the middle with less-common values and outliers spread out towards both ends of the curve, known as the tails. Population surveys, census results, medical data—there are all kinds of cases where this pattern emerges. The problem is that synthetic data tends to trend toward the middle, concentrating on the most common values found in the original natural data. As the authors of the Nature article point out, “tails of the original content distribution disappear.” This can have serious consequences for your AI system—consider what would happen if rare but life-threatening diseases were eliminated from a medical study, or if racial and sexual minorities were underrepresented in a survey or census.

Down the rabbit hole
That’s just one generation of synthetic data. Remember that training AI often involves reinforcement learning in which the training is repeated multiple times. If output from one run is used to start the next iteration, then the process of chopping off the tails from the data distribution gets repeated, each time on an already less diverse data set. Ongoing curation of your synthetic data becomes critically important.

Now let’s extrapolate this to large commercial generative AI systems like ChatGPT, Gemini and their many competitors. Their training data need to be as broad as possible, amounting to as much as they can take from the internet itself. Ask yourself, what do most people use generative AI for? You guessed it, producing content—e-mails, essays, documents, even blog posts (but not this one!) And where is this content being published? Of course, to the internet. So, in a recursive way, generative AI is consuming its own raw data, without necessarily any curation or quality control, to train future generations of itself. The problem is not big right now, because the amount of natural data on the internet is still much larger than synthetic data, but synthetic data is growing faster and may soon overtake natural data.

And there you have it—just as feeding bovine entrails to young calves turned out to be a bad idea both for the animals and their human consumers—carelessly feeding AI-generated data to AI systems is an equally bad idea both for the LLMs and their human users. The risk is the phenomenon known as model collapse, defined by Nature as “a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation. Being trained on polluted data, they then mis-perceive reality.” Not surprisingly, it’s becoming popularly known as MAD, or Model Autophagy[1] Disorder. Mad AI Disease, indeed.

The consequences of model collapse range from mild to severe. The authors of the Nature article found that in initial iterations, just a few errors were introduced into their test model, but after nine generations of repeatedly feeding its own data back in, their LLM was reduced to producing complete gibberish as output. Now, the article might be overly alarmist—their tests were run preserving not more than 10 per cent of the original natural data and sometimes none of it. The semi-satirical British journal The Register notes that other academics argue against the inevitability of model collapse. Their point is that natural data won’t be replaced entirely, but will be mixed with synthetic data, and this will mitigate the negative effects.

A fine line between genius and insanity
Just like with cattle feed, the answer comes down to the quality and proportion of synthetic and natural data. First of all, not all synthetic data is bad, nor is all natural data good. Quality control and curation of both, as I’ve said, will be critical. Second, no matter how good it is, 100 per cent synthetic data will be problematic but 100 per cent natural data will not be feasible. How much is enough? It will depend on the specific LLM, foundation model and use cases involved so I don’t think there will be a single clear answer. But a healthy mix of natural data supplemented with synthetic data can help to improve generative AI while avoiding model collapse.

To that end, IBM has stepped up with a methodology it calls Large-scale Alignment for Chatbots or LAB for short. IBM has paired this with a new open-source product called InstructLab, a toolkit for generating synthetic data fit for specific use cases as defined by your business requirements. InstructLab includes something called the Taxonomy Explorer, which allows developers to build a visual model of data sets and their relationships. This taxonomy is then used to safely generate synthetic data, which can in turn be used for training generative AI. IBM notes that its tool allows the AI foundation model to assimilate the new synthetic data without overwriting what it has previously learned. No doubt IBM will face competition—Amazon’s Sagemaker tool for fine-tuning AI, for example, also works with synthetic data—but this is a healthy sign that the industry is on its way to understanding the problem.

We may not be able to cure mad AI disease, but with good software tools and careful curation of our data we should be able to manage the condition—and we must do so if we want AI systems to become our trusted companions in our professional and personal lives.

[1] A compound word with Greek roots meaning self-consumption (I admit I had to look it up)

Emergent technology

Discussion about this post