Serve models via OpenAI-compatible API with Ollama
Replaces DGX Spark: NIM on Spark
inferenceapi
Basic idea
Ollama's REST API is OpenAI-compatible, meaning any tool built for the OpenAI API โ LangChain, Continue.dev, Cursor, Open WebUI, custom Python apps โ works with local Ollama models by changing a single base URL. This is the Mac equivalent of NVIDIA NIM: a production-ready model inference microservice with a standard API, running entirely on your hardware.
The API compatibility is deep: Ollama implements the same JSON request/response schemas as OpenAI for /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models. The only differences are that the model field uses Ollama model names (e.g., qwen2.5:7b instead of gpt-4o), and the api_key field accepts any non-empty string (Ollama validates the header format but ignores the value).
What you'll accomplish
A locally running OpenAI-compatible API server at http://localhost:11434/v1 that any OpenAI SDK client can connect to, with multiple models available for switching, embeddings support for RAG pipelines, and optional cross-machine access for running inference on a Mac from a different device on your network.
What to know before starting
The OpenAI API specification: โ a JSON REST API where `POST /v1/chat/completions` takes a `messages` array and returns a `choices` array. This standard was adopted by Anthropic, Mistral, Together, and many others โ Ollama's compatibility means any code written against this spec works locally.
API keys and why Ollama ignores the value: โ the OpenAI SDK requires an `api_key` parameter and sends it as an `Authorization: Bearer <key>` header. Ollama checks that the header is present and syntactically valid (non-empty string) but doesn't validate the actual value. Set it to `"ollama"` or any string.
Streaming via server-sent events: โ when `stream=True`, the server sends tokens as they're generated using the SSE protocol: each token is a `data: {...}` line, and the stream ends with `data: [DONE]`. This enables real-time display of output without waiting for the full response.
What embeddings are: โ a text embedding is a dense vector (e.g., 768 or 1536 floating-point numbers) that represents the semantic meaning of a text. Similar texts have vectors that are close together in this high-dimensional space. Embeddings power semantic search, RAG (retrieval-augmented generation), and document clustering.
Why a separate embedding model: โ `nomic-embed-text` is a 137M parameter model fine-tuned specifically to produce high-quality embeddings. Using a chat model for embeddings works but is wasteful: you'd load a 7B model just to get embeddings that a 137M model produces better.
Prerequisites
โข Ollama installed: `brew install ollama` or from [ollama.com](https://ollama.com)
โข At least one model pulled (see Setup tab)
โข For embeddings: `nomic-embed-text` model pulled
โข For cross-machine access: Tailscale installed (optional โ see the Tailscale playbook)
Time & risk
Duration:: 5 minutes
Risk level:: None โ Ollama runs as a user process, not a system service; no ports opened externally by default