Text to video model

Technology

A class of generative AI technology that creates video content from written text descriptions. Sora is presented as the latest and most advanced example of this technology.

First Mentioned

1/4/2026, 3:39:15 AM

Last Updated

1/4/2026, 3:42:46 AM

Research Retrieved

1/4/2026, 3:42:46 AM

Summary

Text-to-video (T2V) models represent a significant frontier in generative AI, evolving from text-to-image foundations by adding the complex dimension of time. These models, such as OpenAI's Sora, utilize architectures like diffusion transformers and recurrent neural networks to maintain temporal coherence and visual quality. While early models like CogVideo and Meta's Make-A-Video emerged in 2022, recent advancements have focused on achieving an implicit understanding of physics and overcoming the "uncanny valley" in animation. The technology is currently characterized by a rapid "state-of-the-art" turnover, with major tech entities like Google, Tencent, and Alibaba competing to increase parameter counts and context windows, while startups like Magic.dev explore AI "coworkers" and Meta applies similar logic to automated software testing.

Referenced in 1 Document

Document b5abf73b...

Research Data

Extracted Attributes

Field
Artificial Intelligence / Computer Vision
Primary Function
Generating video from text instructions, still images, or existing video extensions
Key Architectures
Diffusion Models, Transformers, Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs)
Training Data Source
Synthetic Data and video-text pairs (e.g., YouTube titles and descriptions)
Max Video Length (Sora)
60 seconds
Parameter Count (CogVideo)
9.4 billion
Parameter Count (HunyuanVideo)
13 billion+

Timeline

CogVideo, the earliest text-to-video model with 9.4 billion parameters, is presented on GitHub. (Source: Wikipedia)
2022-01-01
Meta Platforms releases a partial text-to-video model called Make-A-Video. (Source: Wikipedia)
2022-09-29
Google Brain introduces Imagen Video, a text-to-video model utilizing a 3D U-Net architecture. (Source: Wikipedia)
2022-10-05
OpenAI launches Sora, a groundbreaking model capable of generating one-minute videos with implicit physics understanding. (Source: Document b5abf73b-f30b-41b8-b4d1-f22b8ed1c816)
2024-02-15
Genmo releases Mochi, a 10 billion parameter text-to-video model. (Source: Modal.com)
2024-10-01
Tencent releases HunyuanVideo with over 13 billion parameters. (Source: Modal.com)
2024-12-01
Alibaba releases Wan2.2 in 5 billion and 14 billion parameter versions. (Source: Modal.com)
2025-07-01

Web Search Results

Text-to-video model
30. ^ Jin, Jiayao; Wu, Jianhang; Xu, Zhoucheng; Zhang, Hang; Wang, Yaxin; Yang, Jielong (4 August 2023). "Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network". 2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT). IEEE. pp. 108–114. doi "Doi (identifier)"):10.1109/CCPQT60491.2023.00024. ISBN "ISBN (identifier)") 979-8-3503-4269-7. [...] There are several architectures that have been used to create text-to-video models. Similar to text-to-image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively. An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders [...] There are different models, including open source models. Chinese-language input CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022. That year, Meta Platforms released a partial text-to-video model called "Make-A-Video", and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net.
Top open-source text-to-video AI models - Modal
| Model | Parameters | Created by | Released | --- --- | | HunyuanVideo | 13B+ | Tencent | Dec 2024 | | Mochi (deploy on Modal) | 10B | Genmo | Oct 2024 | | Wan2.2 | 5B and 14B | Alibaba | Jul 2025 | Text-to-video AI models build on text-to-image foundations but add a more difficult dimension: time. Every frame must not only look convincing on its own but also stay coherent across seconds of motion. This shift introduces new failure modes: [...] Text-to-video models introduce a whole new set of unique operational challenges. This means that the choice of a model isn’t as easy as picking the one that looks good. We have to find a model that fits into our existing workflow and hardware budget. ### Things to Think About When Selecting a Model [...] ## Closing Thoughts The text-to-video space is moving at a very fast clip, with new models claiming “state-of-the-art” being released every few weeks. The common message across these models is that there is no single “best” option, but a growing set of trade-offs.
Sora: Creating video from text
Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. [...] In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames. Learn more in our technical report⁠.
Video Generation from Text
In comparison, the T2V model provides both background and motion features. The intermediate gist-generation step ﬁxes the background style and structure, and the following Text2Filter step forces the synthesized motion to use the text information. These results demonstrate the necessity of both the gist generator and the Text2Filter components in our model. In the following subsections, we intentionally generate videos that do not usually happen in real world. This is to address concerns of [...] • Text-to-video generation with pair information (PT2V): DT2V is extended using the framework of (Reed et al. 2016). The discriminator judges whether the video and text pair are real, synthetic, or a mismatched pair. This is the method in Figure 3(b). We use a linear concatenation for the video and text feature in the discriminator. [...] Our contributions are summarized as follows: (i) By view-ing the gist as an intermediate step, we propose an effective text-to-video generation framework. (ii) We demonstrate that using input text to generate a ﬁlter better models dynamic features. (iii) We propose a method to construct a training dataset based on YouTube (www.youtube.com) videos where the video titles and descriptions are used as the accompany-ing text. This allows abundant on-line video data to be used to construct robust and
A collection of awesome video generation studies. - GitHub
- Text2Video-Zero: Text-to-image Diffusion Models are Zero-shot Video Generators ( ( ( ( - Video Probabilistic Diffusion Models in Projected Latent Space ( ( + ICCV - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models ( ( - Gen-1: Structure and Content-guided Video Synthesis with Diffusion Models ( ( + NeurIPS - Video Diffusion Models ( ( - Learning Universal Policies via Text-Guided Video Generation ( ( ( [...] - ⚠️ Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis + ICCV - Unified Video Generation: Unified Video Generation via Next-Set Prediction in Continuous Domain ( + NeurIPS - Stable Cinemetrics: Structured Taxonomy and Evaluation for Professional Video Generation ( ( + ICLR - OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation ( ( ( ( [...] - HARIVO: Harnessing Text-to-Image Models for Video Generation ( ( - MEVG: Multi-event Video Generation with Text-to-Video Models ( ( + NeurIPS - Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning ( ( + ICML - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization ( ( ( + ICLR - VDT: General-purpose Video Diffusion Transformers via Mask Modeling ( ( (

Text to video model

First Mentioned

Last Updated

Research Retrieved

Summary

Referenced in 1 Document

Research Data

Extracted Attributes

Field

Primary Function

Key Architectures

Training Data Source

Max Video Length (Sora)

Parameter Count (CogVideo)

Parameter Count (HunyuanVideo)

Timeline

Web Search Results