Nvidia Launches Open Multimodal AI Model Nemotron 3 Nano Omni
Nvidia’s new open model processes video, audio, images and text in one system, reducing reliance on separate perception models.
Topics
News
- Sovereign Wealth Funds chase AI infrastructure boom
- AI Capex Boom Could Turn Into Investment Bust, BIS Warns
- Indian IT Firms See Agentic AI Opening $400 Billion Opportunity
- OpenAI Hires Uber Veteran Prabhjeet Singh for India Push
- Adani, Jabil Plan AI Data Center Hardware Platform in India
- Arun Misra May Take over as Vedanta CEO in August
Nvidia on Tuesday introduced NVIDIA Nemotron 3 Nano Omni, an open multimodal artificial intelligence model designed to process video, audio, images and text within a single system.
The company said many AI agent systems still rely on separate models for vision, speech and language tasks, which requires multiple inference passes and separates context across inputs.
“AI agent systems today juggle separate models for vision, speech and language, losing time and context as they pass data from one model to the other,” Nvidia said in its official release.
Nemotron 3 Nano Omni combines vision and audio encoders within what Nvidia described as a 30B-A3B hybrid mixture-of-experts architecture.
According to the company, the model can process multimodal inputs without the need for separate perception models and delivers up to nine times higher throughput than comparable open omni models.
Nvidia said the model is being positioned for enterprise agent workflows involving screen recordings, uploaded audio, documents, spreadsheets, charts and mixed-media inputs.
The company said such systems are increasingly being used in customer support, finance analysis, compliance workflows and computer-use agents that navigate graphical user interfaces.
Nvidia also said the model can operate alongside other open or proprietary AI models in larger agentic workflows.

