OpenAI has never revealed the exact data used to train Sora, their video-generating AI. But it seems some of it may have come from Twitch streams and game walkthroughs. Sora, which debuted on Monday, can create up to 20-second videos from text prompts or images, with various aspect ratios and resolutions.
When Sora was first introduced in February, OpenAI hinted at training the model on Minecraft videos. This led to curiosity about other game playthroughs in the training set. It turns out there are quite a few. Sora can generate videos resembling a Super Mario Bros. clone, a first-person shooter inspired by Call of Duty and Counter-Strike, and an arcade fighter reminiscent of a ’90s Teenage Mutant Ninja Turtle game.
Sora also seems familiar with what a Twitch stream should look like, suggesting it has seen a few. For example, it can replicate the style with accuracy. A notable feature is the appearance of popular Twitch streamer Raúl Álvarez Genes, known as Auronplay, in one clip. Furthermore, Sora has created a character resembling Imane Anys, also known as Pokimane.

Although OpenAI has been secretive about the data sources, there are concerns about potential legal implications if game content was used in Sora’s training data. Training generative AI models on copyrighted materials without authorization could pose risks, as highlighted by IP attorney Joshua Weigensberg.
Probabilistic AI models like Sora learn patterns from data to make predictions. While this property is beneficial for understanding the world, it can also lead to issues, such as generating near-copies of training examples, causing friction with creators concerned about their works being used without consent.
Legal disputes over generative AI models are on the rise, with questions around copyright infringement and transformative use. Training AI on game content presents unique challenges, as explained by attorney Evan Everist. The legal landscape is complex, and the outcome of these disputes could impact the future development and use of AI technology.
The implications span beyond copyright, touching on trademark rights and likeness concerns. As AI technology evolves, the need to navigate legal and ethical considerations becomes crucial. While the transformative potential of AI is promising, the risks associated with unintentional infringement highlight the importance of clarity and accountability in this rapidly evolving field.