Skip to main content

Documentation Index

Fetch the complete documentation index at: https://daily-docs-pr-4424.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

NVIDIA Nemotron Speech provides two STT service implementations:
  • NvidiaSTTService — Real-time streaming transcription using Nemotron ASR Streaming models with interim results and continuous audio processing.
  • NvidiaSegmentedSTTService — Segmented transcription using Canary models with advanced language support, word boosting, and enterprise-grade accuracy.

NVIDIA Nemotron Speech STT API Reference

Pipecat’s API methods for NVIDIA Nemotron Speech STT integration

Example Implementation

Complete example with NVIDIA services integration

NVIDIA ASR NIM Documentation

Official NVIDIA ASR NIM documentation

NVIDIA Developer Portal

Access API keys and Nemotron Speech services

Installation

To use NVIDIA Nemotron Speech services, install the required dependency:
uv add "pipecat-ai[nvidia]"

Prerequisites

NVIDIA Nemotron Speech Setup

Before using NVIDIA Nemotron Speech STT services, you need:
  1. NVIDIA Developer Account (for cloud deployments): Sign up at NVIDIA Developer Portal
  2. API Key (for cloud deployments): Generate an NVIDIA API key for Nemotron Speech services
  3. Model Selection: Choose between Nemotron ASR Streaming (streaming) and Canary (segmented) models
For local deployments, you can run NVIDIA ASR NIM locally without an API key. See the NVIDIA ASR NIM documentation for deployment instructions.

Environment Variables

  • NVIDIA_API_KEY: Your NVIDIA API key for authentication (required for cloud endpoint, not needed for local deployments)

NvidiaSTTService

Real-time streaming transcription using NVIDIA Nemotron Speech’s streaming ASR models.
api_key
str | None
default:"None"
NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.
server
str
default:"grpc.nvcf.nvidia.com:443"
NVIDIA Nemotron Speech server address. For local deployments, pass the local address (e.g. localhost:50051).
model_function_map
Mapping[str, str]
Mapping containing function_id and model_name for the ASR model.
sample_rate
int
default:"None"
Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
params
NvidiaSTTService.InputParams
default:"None"
deprecated
Additional configuration parameters. Deprecated in v0.0.105. Use settings=NvidiaSTTService.Settings(...) instead.
settings
NvidiaSTTService.Settings
default:"None"
Runtime-configurable settings. See Settings below.
use_ssl
bool
default:"True"
Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.
audio_channel_count
int
default:"1"
Number of audio channels.
start_history
int
default:"-1"
VAD start history in frames. Use -1 for Nemotron Speech default.
start_threshold
float
default:"-1.0"
VAD start threshold. Use -1.0 for Nemotron Speech default.
stop_history
int
default:"320"
VAD stop history in frames. Use -1 for Nemotron Speech default.
stop_threshold
float
default:"-1.0"
VAD stop threshold. Use -1.0 for Nemotron Speech default.
stop_history_eou
int
default:"-1"
End-of-utterance stop history in frames. Use -1 for Nemotron Speech default.
stop_threshold_eou
float
default:"-1.0"
End-of-utterance stop threshold. Use -1.0 for Nemotron Speech default.
custom_configuration
str
default:"\"\""
Custom Nemotron Speech configuration string (e.g. "enable_vad_endpointing:true,neural_vad.onset:0.65").
ttfs_p99_latency
float
default:"1.0"
P99 latency from speech end to final transcript in seconds. Override for your deployment. See stt-benchmark.

Settings

Runtime-configurable settings passed via the settings constructor argument using NvidiaSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneSTT model identifier. (Inherited from base STT settings.)
languageLanguage | strLanguage.EN_USTarget language for transcription. (Inherited from base STT settings.)
profanity_filterboolFalseWhether to filter profanity from results.
automatic_punctuationboolTrueWhether to add automatic punctuation.
verbatim_transcriptsboolTrueWhether to return verbatim transcripts.
boosted_lm_wordslist[str]NoneList of words to boost in the language model.
boosted_lm_scorefloat4.0Score boost for specified words.
max_alternativesint1Maximum number of recognition alternatives.
interim_resultsboolTrueWhether to return interim (partial) results.
word_time_offsetsboolFalseWhether to include word-level time offsets.
speaker_diarizationboolFalseWhether to enable speaker diarization.
diarization_max_speakersint0Maximum number of speakers for diarization.

Usage

from pipecat.services.nvidia.stt import NvidiaSTTService

stt = NvidiaSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
)

Notes

  • Model cannot be changed after initialization: Use the model_function_map parameter in the constructor to specify the model and function ID.
  • Streaming: Provides real-time interim and final results through continuous audio streaming.
  • Metrics support: This service supports metrics generation (can_generate_metrics() returns True).

NvidiaSegmentedSTTService

Batch/segmented transcription using NVIDIA Nemotron Speech’s Canary models. Processes complete audio segments after VAD detects speech boundaries.
api_key
str | None
default:"None"
NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.
server
str
default:"grpc.nvcf.nvidia.com:443"
NVIDIA Nemotron Speech server address. For local deployments, pass the local address (e.g. localhost:50051).
model_function_map
Mapping[str, str]
Mapping containing function_id and model_name for the ASR model.
sample_rate
int
default:"None"
Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
params
NvidiaSegmentedSTTService.InputParams
default:"None"
deprecated
Additional configuration parameters. Deprecated in v0.0.105. Use settings=NvidiaSegmentedSTTService.Settings(...) instead.
settings
NvidiaSegmentedSTTService.Settings
default:"None"
Runtime-configurable settings. See Settings below.
use_ssl
bool
default:"True"
Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.
custom_configuration
str
default:"\"\""
Custom Nemotron Speech configuration string (e.g. "enable_vad_endpointing:true,neural_vad.onset:0.65").
ttfs_p99_latency
float
default:"1.0"
P99 latency from speech end to final transcript in seconds. Override for your deployment. See stt-benchmark.

Settings

Runtime-configurable settings passed via the settings constructor argument using NvidiaSegmentedSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneSTT model identifier. (Inherited from base STT settings.)
languageLanguage | strLanguage.EN_USTarget language for transcription. (Inherited from base STT settings.)
profanity_filterboolFalseWhether to filter profanity from results.
automatic_punctuationboolTrueWhether to add automatic punctuation.
verbatim_transcriptsboolFalseWhether to return verbatim transcripts.
boosted_lm_wordslist[str]NoneList of words to boost in the language model.
boosted_lm_scorefloat4.0Score boost for specified words.
max_alternativesint1Maximum number of recognition alternatives.
word_time_offsetsboolFalseWhether to include word-level time offsets.

Usage

from pipecat.services.nvidia.stt import NvidiaSegmentedSTTService
from pipecat.transcriptions.language import Language

stt = NvidiaSegmentedSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    settings=NvidiaSegmentedSTTService.Settings(
        language=Language.ES,
        automatic_punctuation=True,
        boosted_lm_words=["Pipecat", "NVIDIA"],
        boosted_lm_score=6.0,
    ),
)

Notes

  • Model cannot be changed after initialization: Use the model_function_map parameter in the constructor to specify the model and function ID.
  • Segmented processing: Processes complete audio segments for higher accuracy compared to streaming.
  • Language support: Supports Arabic, English (US/GB), French, German, Hindi, Italian, Japanese, Korean, Portuguese (BR), Russian, and Spanish (ES/US). See the NVIDIA ASR NIM documentation for the complete list.
  • Word boosting: Use boosted_lm_words and boosted_lm_score to improve recognition of domain-specific terms.
The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.