OpenAI wants to work with organizations to create new AI training data sets

It’s an open secret that the data sets used to train AI models are highly flawed.

Image corpora INTERESTS to be US- and Western-centric, in part because Western images dominated the internet when the data sets were compiled. And as recently highlighted by a study from the Allen Institute for AI, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases.

Models magnify these errors in harmful ways. Now, OpenAI says it wants to counter them by partnering with outside institutions to create new, hopefully improved data sets.

OpenAI today announced Data Partnerships, an effort to collaborate with third-party organizations to build public and private data sets for AI model training. In a blog postOpenAI says the Data Partnerships are intended to “enable more organizations to help guide the future of AI” and “benefit from more useful models.”

“In order to ultimately make (AI) safe and useful for all of humanity, we want AI models to understand well all subjects, industries, cultures and languages, which requires a broad set of training data as much as possible,” writes OpenAI. “Incorporating your content can make AI models more helpful to you by increasing their understanding of your domain.”

As part of its Data Partnerships program, OpenAI says it will collect “large-scale” data sets that “reflect human society” and are not easily accessible online today. While the company plans to work with a wide range of methods, including images, audio and video, especially looking for data that “expresses human intent” (eg long-form writing or conversations) in different languages, topics and formats.

OpenAI says it will work with organizations to digitize training data where necessary, using a combination of optical character recognition and automatic speech recognition tools and removing sensitive or personal information where necessary.

To begin with, OpenAI is looking to create two types of data sets: an open source data set that can be made public for anyone to use in AI model training and a set of private data sets for proprietary training. AI models. Private sets are intended for organizations that want to keep their data private but want OpenAI models to have a better understanding of their domain, OpenAI says; to date, OpenAI’s been working with the Icelandic Government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic and the Free Law Project to improve the understanding of its models of legal documents.

“In general, we are looking for partners who want to help us teach AI to understand our world to be the most helpful to everyone,” OpenAI wrote.

So, could OpenAI be better than the many data-building efforts that came before it? I’m not sure – minimizing bias in the data set is a problem that has puzzled many experts in the world. At the very least, I hope the company is clear about the process – and about the challenges that will inevitably be encountered in creating these data sets.

Despite the strange language of the blog post, there also seems to be a clear commercial motivation, here, to improve the performance of OpenAI models at the expense of others – and without paying for the tags – belongs to the data to speak. I believe that is within OpenAI’s prerogative. But it seems a little tone deaf given the open letters and lawsuits by creators who say OpenAI trained many of its models on their work without permission or compensation.

Leave a comment