AI Companies Seeking AI-Produced Data for Recursive Training

admin July 20, 2023

0 2 minutes read

AI companies such as Microsoft, OpenAI, and Cohere seem to be doing everything in their power to find synthetic data to train their AI products. Given the limited availability of “organic” human-generated data on the World Wide Web, these companies aim to use AI-generated (synthetic) data in a kind of infinite loop. , training is done on data that has already been generated generatively.

“It would be great if we could get all the data we need from the web,” said Aidan Gomez, chief executive of $2 billion LLM startup Cohere. To the Financial Times. “In practice, the web is so noisy and messy that it can’t really represent the data we need. The web doesn’t do everything we need.”

There’s also the issue of cost, as human-generated data is “very expensive,” Gomez said. This has already led to the creation of several “synthetic data” companies, such as his Gretel.ai, which specializes in creating synthetic datasets to be sold for training purposes.

Data availability and provenance issues are one of the biggest limiting factors in the current AI era. Currently, training an AI network using synthetic data that has already been “chewed” and generated by the AI itself carries real risks. One problem is compounding imperfections in the base training data. If the original non-synthetic training dataset already suffers from bias, the same bias is included in subsequent training iterations, digested and amplified, increasing its relevance.

But another, perhaps more pervasive issue stems from a recently discovered limitation. After five training iterations on AI-generated synthetic data, the output quality degrades significantly. Whether this “MAD” state presents a soft or hard constraint on AI training seems like a central question to Microsoft’s and OpenAI’s intent to recursively train AI networks. However, much research is likely to be done in this area. For example, Microsoft Research found a short recursively generated story (meaning that the model was trained on a story generated by another model) and a coding AI trained on AI-generated documentation on Python programming. I published a paper on network. Validating the risk of data degradation in these and other larger size models (such as his Llama 2 with 70B parameters recently released as open source by Meta) is a predictable way to see how far (and how far) AI can go. Faster) will be the key to how the future evolves.

As AI-powered companies demand more data, it makes sense to try to recursively generate high-quality datasets. There are multiple ways to do this, but perhaps the most likely to succeed is to simply have two AI networks interact, one acting as a tutor and the other as a student. brought about by However, it will (and will always) require human intervention to cull low-quality data points and suppress “hallucinations” (untrue AI affirmations).

There are several obstacles on the way to the technocratic dream of self-evolving, self-learning AI. It’s a model that allows for internal discussion, internal discovery, and produces new knowledge that isn’t just combinatorial (which is, after all, one of the hallmarks of creative artifacts).

Of course, it should be borne in mind that not all dreams are pleasant. We are already struggling to deal with human-caused nightmares. I don’t know how much the machine “nightmare” will affect me.