This guide walks through the complete process of accessing the NVIDIA API Catalog: creating an account, setting up an organization, generating API keys, and making your first authenticated API call. The NVIDIA API Catalog is hosted at build.nvidia.com and provides access to hundreds of AI models through standard REST APIs.
Before you can generate API keys, you need an NVIDIA account.
You can also create an account at developer.nvidia.com — the same credentials work across both sites.
Once signed in, you arrive at the NVIDIA API Catalog on build.nvidia.com.
The catalog is organized by model category:
Each model card shows:
Before generating an API key, you can try any model in the browser playground — no key required. This is useful for quick evaluation.
NVIDIA’s API platform supports organizations — shared workspaces where teams can collaborate, share API keys, manage billing, and set usage policies.
Acme AI Team)acme-ai)You are automatically assigned the Owner role in the organization.
To add collaborators to your organization:
API keys authenticate your requests to the NVIDIA API endpoints. Keys can be scoped to a personal account or to an organization.
dev-laptop, ci-pipeline, rag-prototype).prod-backend, data-team, staging).The standard way to use your API key is via an environment variable:
export NVIDIA_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Add this line to your ~/.bashrc, ~/.zshrc, or ~/.profile to make it persistent across terminal sessions.
set NVIDIA_API_KEY=nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$env:NVIDIA_API_KEY = "nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Create a .env file in your project root (add .env to .gitignore):
NVIDIA_API_KEY=nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Load it in your Python code:
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.environ["NVIDIA_API_KEY"]
With your API key ready, you can make authenticated calls to any model in the catalog.
The base URL for NVIDIA API calls is:
https://integrate.api.nvidia.com/v1
curl -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $NVIDIA_API_KEY" \
-d '{
"model": "meta/llama-3.3-70b-instruct",
"messages": [
{
"role": "user",
"content": "What is NVIDIA CUDA and why does it matter for AI?"
}
],
"max_tokens": 512,
"temperature": 0.7
}'
The endpoints are OpenAI-compatible, so the OpenAI Python SDK works without modification:
from openai import OpenAI
import os
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=os.environ["NVIDIA_API_KEY"]
)
response = client.chat.completions.create(
model="meta/llama-3.3-70b-instruct",
messages=[
{
"role": "system",
"content": "You are a helpful AI assistant specializing in NVIDIA technologies."
},
{
"role": "user",
"content": "Explain the difference between NIM and TensorRT in simple terms."
}
],
max_tokens=512,
temperature=0.6
)
print(response.choices[0].message.content)
from openai import OpenAI
import os
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=os.environ["NVIDIA_API_KEY"]
)
response = client.embeddings.create(
model="nvidia/nv-embed-v2",
input="NVIDIA RAPIDS accelerates data science on GPUs",
encoding_format="float"
)
embedding_vector = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding_vector)}")
pip install openai python-dotenv
Each model in the catalog has an identifier in the format provider/model-name. You can find the correct identifier for any model:
Example model identifiers:
| Model | Identifier |
|---|---|
| Llama 3.3 70B Instruct | meta/llama-3.3-70b-instruct |
| Mistral 7B Instruct | mistralai/mistral-7b-instruct-v0.3 |
| Mixtral 8x7B Instruct | mistralai/mixtral-8x7b-instruct-v0.1 |
| Microsoft Phi-3 Medium | microsoft/phi-3-medium-4k-instruct |
| DeepSeek R1 | deepseek-ai/deepseek-r1 |
| NV-Embed-v2 (embeddings) | nvidia/nv-embed-v2 |
| Stable Diffusion XL | stability-ai/sdxl |
| Llama 3.2 11B Vision | meta/llama-3.2-11b-vision-instruct |
For chat completions, streaming returns tokens as they are generated rather than waiting for the full response. This is important for responsive user interfaces:
from openai import OpenAI
import os
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=os.environ["NVIDIA_API_KEY"]
)
stream = client.chat.completions.create(
model="meta/llama-3.3-70b-instruct",
messages=[{"role": "user", "content": "Explain GPU memory hierarchy in detail."}],
max_tokens=1024,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
The free tier has rate limits per model:
Rate limit errors return HTTP 429 Too Many Requests. Implement exponential backoff retry logic:
import time
import openai
def call_with_retry(client, max_retries=5, **kwargs):
for attempt in range(max_retries):
try:
return client.chat.completions.create(**kwargs)
except openai.RateLimitError:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise
For production workloads, upgrade from the free tier through the NVIDIA Developer portal:
For building RAG systems, agents, or pipelines, the native NVIDIA integrations in popular frameworks are the most convenient path.
pip install langchain-nvidia-ai-endpoints
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
import os
# Chat model
llm = ChatNVIDIA(
model="meta/llama-3.3-70b-instruct",
api_key=os.environ["NVIDIA_API_KEY"]
)
response = llm.invoke("What are the key benefits of NVIDIA NIM?")
print(response.content)
# Embedding model
embedder = NVIDIAEmbeddings(
model="nvidia/nv-embed-v2",
api_key=os.environ["NVIDIA_API_KEY"]
)
vectors = embedder.embed_documents(["RAPIDS", "cuDNN", "TensorRT"])
pip install llama-index-llms-nvidia llama-index-embeddings-nvidia
from llama_index.llms.nvidia import NVIDIA
from llama_index.embeddings.nvidia import NVIDIAEmbedding
import os
llm = NVIDIA(
model="meta/llama-3.3-70b-instruct",
api_key=os.environ["NVIDIA_API_KEY"]
)
embed_model = NVIDIAEmbedding(
model="nvidia/nv-embed-v2",
api_key=os.environ["NVIDIA_API_KEY"]
)
The NVIDIA API Catalog is designed as a stepping stone. When your application is ready for production requirements — data privacy, predictable latency, cost efficiency at scale — you can migrate to self-hosted NIM with minimal code changes.
The key migration steps are:
https://integrate.api.nvidia.com/v1 to your local NIM endpoint (e.g., http://localhost:8000/v1)# API Catalog (prototyping)
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=os.environ["NVIDIA_API_KEY"]
)
# Self-hosted NIM (production)
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required" # Or your NIM auth config
)
Because both use OpenAI-compatible APIs, your application code does not change — only the endpoint configuration changes.
Getting started with the NVIDIA API Catalog takes only a few minutes:
The catalog’s OpenAI compatibility, free tier, and broad model selection make it the practical starting point for any NVIDIA AI integration project — whether you are building a chatbot, a RAG system, an embeddings pipeline, or a vision application.