Nvidia Launches Open Multimodal AI Model Nemotron 3 Nano Omni

Nvidia’s new open model processes video, audio, images and text in one system, reducing reliance on separate perception models.

MIT SMR Editors April 29, 2026

Topics

Nvidia on Tuesday introduced NVIDIA Nemotron 3 Nano Omni, an open multimodal artificial intelligence model designed to process video, audio, images and text within a single system.

The company said many AI agent systems still rely on separate models for vision, speech and language tasks, which requires multiple inference passes and separates context across inputs.

“AI agent systems today juggle separate models for vision, speech and language, losing time and context as they pass data from one model to the other,” Nvidia said in its official release.

Nemotron 3 Nano Omni combines vision and audio encoders within what Nvidia described as a 30B-A3B hybrid mixture-of-experts architecture.

According to the company, the model can process multimodal inputs without the need for separate perception models and delivers up to nine times higher throughput than comparable open omni models.

Nvidia said the model is being positioned for enterprise agent workflows involving screen recordings, uploaded audio, documents, spreadsheets, charts and mixed-media inputs.

The company said such systems are increasingly being used in customer support, finance analysis, compliance workflows and computer-use agents that navigate graphical user interfaces.

Nvidia also said the model can operate alongside other open or proprietary AI models in larger agentic workflows.

Topics

About the Author

Tags:

Nemotron 3 Nano Omni Nvidia

Topics

Share