Mochi 1 Preview: Open Source Video Generation with High-Fidelity Motion

Experience state-of-the-art video generation with exceptional motion quality and prompt adherence, powered by our 10B parameter AsymmDiT architecture

Try on Playground

Mochi 1 Preview: Open Source Video Generation with High-Fidelity Motion

Featured Examples

Explore groundbreaking open-source video generation powered by our novel Asymmetric Diffusion Transformer

A static shot of a diner at night. The door swings open and a young blind lady enters...

A static shot of a diverse group of people standing on top of a high hill, overlooking a sprawling, vibrant cityscape...

An FPV drone swoops through the colorful coral-lined street of an underwater neighborhood...

A wide shot of the vast expanse of outer space, with planets and stars twinkling in the distance...

A close-up shot of a majestic horse's eye, with a glossy brown coat reflecting the light...

A static shot of a chubby orange and white cat sitting comfortably on a plush sofa...

A wide shot of a vast, otherworldly desert with a deep blue sky overhead...

A drone shot flies down into the central courtyard of a majestic medieval castle...

A cinematic anamorphic shot of a man walking through a dimly lit, foggy graveyard...

A static shot of actress Daisy Ridley, dressed in a sleek, strapless red gown with black detailing...

A stylish man in a three-piece suit and a gold chain draped across his vest exits a 1940s night club...

A close-up of a humanoid lemur with a playful face wearing a cyberpunk tactical suit...

A static shot of a young blind lady entering a small diner at night...

A static shot of a horse's face fills the frame, with the camera focusing on its deep brown eye...

A close-up shot of actress Daisy Ridley in a red dress, with black detailing and sheer sleeves...

A close-up shot of a steaming cup of coffee on a rustic wooden table...

A close-up shot of a smiling child holding a brightly colored spinning top...

A close-up shot of an adorable white, fluffy cat dressed in a vibrant, colorful outfit...

A static shot of a mystical figure dressed in flowing white robes...

A close-up shot of a time portal opening up in the modern city...

State-of-the-Art Video Generation

High-Fidelity Motion

Industry-leading motion quality through our 10B parameter diffusion model with strong prompt adherence in preliminary evaluation

Open Source Architecture

Built on the novel Asymmetric Diffusion Transformer (AsymmDiT) architecture, freely available under Apache 2.0 license

Advanced Compression

Featuring our open-source VAE that causally compresses videos to 128x smaller size with 8x8 spatial and 6x temporal compression

Efficient Processing

Streamlined text processing with single T5-XXL language model and optimized visual reasoning capacity

Multimodal Architecture

Joint attention to text and visual tokens with dedicated MLP layers for each modality and non-square QKV projection

Developer-Friendly

Simple, hackable architecture with comprehensive documentation and community support

Generate Videos with Mochi 1

Setup Environment

Clone the repository and install dependencies using uv package manager

Configure Parameters

Set model directory, CFG scale, and seed values for controlled generation

Generate Content

Run inference through Gradio UI or command line interface

Community Feedback

John Doe

The motion quality and prompt adherence set new standards for open-source video generation.

★★★★★

Alex Smith

The simple, hackable architecture makes it perfect for research and experimentation.

★★★★★

Emily Zhang

Impressive results with the efficient single T5-XXL language model approach.

★★★★★

Frequently Asked Questions

► What makes Mochi 1 unique?

Mochi 1 represents a significant advancement in open-source video generation, featuring a 10B parameter diffusion model with novel AsymmDiT architecture, offering high-fidelity motion and strong prompt adherence while being completely open source.

► What are the technical specifications?

Mochi 1 generates 480p videos using a 10B parameter model with 8x8 spatial and 6x temporal compression. The model requires 4 H100 GPUs for inference and uses a single T5-XXL for prompt encoding.

► How does the architecture work?

The AsymmDiT architecture efficiently processes prompts alongside compressed video tokens, using multi-modal self-attention with separate MLP layers for each modality and non-square QKV projections for reduced memory requirements.

► What are the current limitations?

The current research preview is limited to 480p resolution, may show minor warping in extreme motion cases, and is optimized for photorealistic rather than animated content.

Join the Open Source Video Generation Revolution

Experience the largest openly released video generative model

Try Mochi 1