It looks like OpenAI has been training Sora on game content - and legal experts say that could be a problem

OpenAI has never revealed exactly what data it used to train Sora, its video-generating AI. But from the looks of it, at least some of the data could come from Twitch streams and game walkthroughs.

Sora launched on Monday and I’ve been playing around with it for a while (as capacity issues allow). From a text prompt or image, Sora can generate videos up to 20 seconds long in various aspect ratios and resolutions.

When OpenAI first unveiled Sora in February, it indicated that it had trained the model using Minecraft videos. So I was wondering what other video game runs might be lurking in the training set?

Apparently quite a few.

Sora can create a video of what is essentially a Super Mario Bros. clone (in case there is a bug):

OpenAI Sora video game — **Photo credit:**OpenAI

It can create gameplay footage of a first-person shooter that looks inspired by Call of Duty and Counter-Strike:

And it can spit out a clip showing an arcade fighter in the style of a ’90s Teenage Mutant Ninja Turtle game:

Sora also seems to understand what a Twitch stream should look like – meaning he’s seen a few. Check out the screenshot below that shows the broad strokes correctly:

Another notable thing about the screenshot: It shows the likeness of popular Twitch streamer Raúl Álvarez Genes, who goes by the name Auronplay – right down to the tattoo on Genes’ left forearm.

Auronplay isn’t the only Twitch streamer that Sora seems to “know.” A video was created of a character that resembles in appearance (with some artistic liberties) Imane Anys, better known as Pokimane.

Admittedly, I had to get a little creative with some of the prompts (e.g. “Italian plumbing game”). OpenAI has implemented filtering to prevent Sora from generating clips depicting trademarked characters. For example, if you type “Mortal Kombat 1 Gameplay,” you won’t get anything similar to the title.

But my testing suggests that game content may have found its way into Sora’s training data.

OpenAI has been cautious about where it gets its training data from. In an interview with The Wall Street Journal in March, then-OpenAI CTO Mira Murati did not directly deny that Sora was trained on YouTube, Instagram and Facebook content. And in the technical specifications for Sora, OpenAI admitted that it used “publicly available” data to develop Sora, as well as licensed data from stock media libraries such as Shutterstock.

OpenAI did not immediately respond to a request for comment. But shortly after this story was published, a PR representative said they would “consult with the team.”

If game content is actually included in Sora’s training set, this could have legal implications – especially if OpenAI builds more interactive experiences on top of Sora.

“Companies that conduct training using unlicensed footage from video game playthroughs are taking on a lot of risks,” Joshua Weigensberg, IP attorney at Pryor Cashman, told TechCrunch. “Training a generative AI model generally involves copying the training data. If this data is video playthroughs of games, it is very likely that copyrighted materials will be included in the training set.”

Probability models

Generative AI models like Sora are probabilistic. Using lots of data, they learn patterns in that data to make predictions – for example, that a person biting into a burger will leave a bite mark.

This is a useful feature. It allows models to “learn” how the world works to some extent through observation. But it can also be an Achilles heel. When asked to do so in a certain way, models – many of which are trained on public web data – create near-copies of their training examples.

This understandably upset the creators, whose works were flooded into the training without their permission. More and more people are seeking legal redress in court.

Microsoft and OpenAI are currently being sued for allegedly allowing their AI tools to replay licensed code. Three companies behind popular AI art apps, Midjourney, Runway and Stability AI, are in the crosshairs of a lawsuit accusing them of violating artists’ rights. And major music labels have filed breach of contract lawsuits against two startups developing AI-powered song generators, Udio and Suno.

Many AI companies have long called for fair use protections, claiming that their models create transformative, not plagiaristic, works. Suno, for example, argues that indiscriminate training is nothing more than “a kid writing his own rock songs after listening to the genre.”

However, there are certain peculiarities when it comes to game content, says Evan Everist, a copyright attorney at Dorsey & Whitney.

“Playthrough videos include at least two levels of copyright protection: the content of the game as the property of the game developer and the unique video created by the player or videographer that captures the player’s experience,” Everist told TechCrunch in an email. “And for some games, there is a potential third level of rights in the form of user-generated content that appears in the software.”

Everist cited Epic’s Fortnite as an example, which allows players to create their own game maps and share them for others to use. A video of a playthrough of one of these maps would affect no fewer than three copyright holders, he said: (1) Epic, (2) the person using the map, and (3) the map’s creator.

“If courts find copyright liability for training AI models, each of those copyright holders would be potential plaintiffs or licensors,” Everist said. “For any developer training AI for such videos, the risk is exponential.”

Weigensberg noted that games themselves have many “protectable” elements, such as proprietary textures, that a judge might consider in an IP lawsuit. “Unless these works are properly licensed,” he said, “training on them could constitute a violation.”

TechCrunch reached out to a number of game studios and publishers for comment, including Epic, Microsoft (which owns Minecraft), Ubisoft, Nintendo, Roblox and cyberpunk developer CD Projekt Red. Only a few responded – and none made an official statement.

“We cannot commit to an interview at this time,” said a CD Projekt Red spokesperson. EA told TechCrunch that there was “no comment at this time.”

Risky expenses

It is possible that AI companies will prevail in these litigations. The courts could decide that generative AI has a “highly compelling transformative purpose,” following the precedent set in the publishing industry’s lawsuit against Google about a decade ago.

In this case, a court ruled that Google’s copying of millions of books was legal for Google Books, a type of digital archive. Authors and publishers had tried to argue that the reproduction of their intellectual property on the Internet constituted an infringement.

But a ruling in favor of AI companies would not necessarily protect users from allegations of wrongdoing. If a generative model recreated a copyrighted work, a person who then published that work – or incorporated it into another project – could still be liable for intellectual property infringement.

“Generative AI systems often spit out recognizable, protectable IP assets as output,” Weigensberg said. “Simpler systems that generate text or static images often have problems preventing the generation of copyrighted material in their output, and therefore more complex systems may well have the same problem, regardless of the programmers’ intentions.”

Some AI companies have indemnity clauses to cover such situations should they arise. However, the clauses often contain exceptions. For example, OpenAI only applies to enterprise customers – not individual users.

In addition to copyright, there are also risks to consider, says Weigensberg, such as the violation of trademark rights.

“The issue could also include assets used in connection with marketing and branding – including recognizable characters from games – which poses a brand risk,” he said. “Or the issue could pose risks for name, image and likeness rights.”

The growing interest in world models could make this all even more complicated. One application of world models – which OpenAI considers Sora to be – is essentially the generation of video games in real time. If these “synthetic” games resemble the content on which the model was trained, this could be legally problematic.

“Training an AI platform on voices, movements, characters, songs, dialogue and graphics in a video game constitutes copyright infringement, just as it would be if these elements were used in other contexts,” said Avery Williams, an attorney of intellectual property at McKool Smith, said. “The fair use questions that have arisen in so many lawsuits against generative AI companies will affect the video game industry as much as any other creative market.”

Breaking News

Dellupodisabato

It looks like OpenAI has been training Sora on game content – and legal experts say that could be a problem

Probability models

Risky expenses

Leave a Reply Cancel reply