โšก
Mac Playbook
โฑ 5 min

Ollama API Serving

Serve models via OpenAI-compatible API with Ollama

Replaces DGX Spark: NIM on Spark
inferenceapi

Basic idea

Ollama's REST API is OpenAI-compatible, meaning any tool built for the OpenAI API โ€” LangChain, Continue.dev, Cursor, Open WebUI, custom Python apps โ€” works with local Ollama models by changing a single base URL. This is the Mac equivalent of NVIDIA NIM: a production-ready model inference microservice with a standard API, running entirely on your hardware.

The API compatibility is deep: Ollama implements the same JSON request/response schemas as OpenAI for /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models. The only differences are that the model field uses Ollama model names (e.g., qwen2.5:7b instead of gpt-4o), and the api_key field accepts any non-empty string (Ollama validates the header format but ignores the value).

What you'll accomplish

A locally running OpenAI-compatible API server at http://localhost:11434/v1 that any OpenAI SDK client can connect to, with multiple models available for switching, embeddings support for RAG pipelines, and optional cross-machine access for running inference on a Mac from a different device on your network.

What to know before starting

The OpenAI API specification: โ€” a JSON REST API where `POST /v1/chat/completions` takes a `messages` array and returns a `choices` array. This standard was adopted by Anthropic, Mistral, Together, and many others โ€” Ollama's compatibility means any code written against this spec works locally.
API keys and why Ollama ignores the value: โ€” the OpenAI SDK requires an `api_key` parameter and sends it as an `Authorization: Bearer <key>` header. Ollama checks that the header is present and syntactically valid (non-empty string) but doesn't validate the actual value. Set it to `"ollama"` or any string.
Streaming via server-sent events: โ€” when `stream=True`, the server sends tokens as they're generated using the SSE protocol: each token is a `data: {...}` line, and the stream ends with `data: [DONE]`. This enables real-time display of output without waiting for the full response.
What embeddings are: โ€” a text embedding is a dense vector (e.g., 768 or 1536 floating-point numbers) that represents the semantic meaning of a text. Similar texts have vectors that are close together in this high-dimensional space. Embeddings power semantic search, RAG (retrieval-augmented generation), and document clustering.
Why a separate embedding model: โ€” `nomic-embed-text` is a 137M parameter model fine-tuned specifically to produce high-quality embeddings. Using a chat model for embeddings works but is wasteful: you'd load a 7B model just to get embeddings that a 137M model produces better.

Prerequisites

โ€ข Ollama installed: `brew install ollama` or from [ollama.com](https://ollama.com)
โ€ข At least one model pulled (see Setup tab)
โ€ข For embeddings: `nomic-embed-text` model pulled
โ€ข For cross-machine access: Tailscale installed (optional โ€” see the Tailscale playbook)

Time & risk

Duration:: 5 minutes
Risk level:: None โ€” Ollama runs as a user process, not a system service; no ports opened externally by default
Rollback:: `pkill ollama` to stop the server