Ollama API Serving

Serve models via OpenAI-compatible API with Ollama

Replaces DGX Spark: NIM on Spark

inferenceapi

Basic idea

Ollama's REST API is OpenAI-compatible, meaning any tool built for the OpenAI API — LangChain, Continue.dev, Cursor, Open WebUI, custom Python apps — works with local Ollama models by changing a single base URL. This is the Mac equivalent of NVIDIA NIM: a production-ready model inference microservice with a standard API, running entirely on your hardware.

The API compatibility is deep: Ollama implements the same JSON request/response schemas as OpenAI for /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models. The only differences are that the model field uses Ollama model names (e.g., qwen2.5:7b instead of gpt-4o), and the api_key field accepts any non-empty string (Ollama validates the header format but ignores the value).

What you'll accomplish

A locally running OpenAI-compatible API server at http://localhost:11434/v1 that any OpenAI SDK client can connect to, with multiple models available for switching, embeddings support for RAG pipelines, and optional cross-machine access for running inference on a Mac from a different device on your network.

What to know before starting

The OpenAI API specification: — a JSON REST API where `POST /v1/chat/completions` takes a `messages` array and returns a `choices` array. This standard was adopted by Anthropic, Mistral, Together, and many others — Ollama's compatibility means any code written against this spec works locally.

API keys and why Ollama ignores the value: — the OpenAI SDK requires an `api_key` parameter and sends it as an `Authorization: Bearer <key>` header. Ollama checks that the header is present and syntactically valid (non-empty string) but doesn't validate the actual value. Set it to `"ollama"` or any string.

Streaming via server-sent events: — when `stream=True`, the server sends tokens as they're generated using the SSE protocol: each token is a `data: {...}` line, and the stream ends with `data: [DONE]`. This enables real-time display of output without waiting for the full response.

What embeddings are: — a text embedding is a dense vector (e.g., 768 or 1536 floating-point numbers) that represents the semantic meaning of a text. Similar texts have vectors that are close together in this high-dimensional space. Embeddings power semantic search, RAG (retrieval-augmented generation), and document clustering.

Why a separate embedding model: — `nomic-embed-text` is a 137M parameter model fine-tuned specifically to produce high-quality embeddings. Using a chat model for embeddings works but is wasteful: you'd load a 7B model just to get embeddings that a 137M model produces better.

Prerequisites

• Ollama installed: `brew install ollama` or from [ollama.com](https://ollama.com)

• At least one model pulled (see Setup tab)

• For embeddings: `nomic-embed-text` model pulled

• For cross-machine access: Tailscale installed (optional — see the Tailscale playbook)

Time & risk

Duration:: 5 minutes

Risk level:: None — Ollama runs as a user process, not a system service; no ports opened externally by default

Rollback:: `pkill ollama` to stop the server