Polished GUI for local LLM inference using MLX backend
Replaces DGX Spark: LM Studio / SGLang
inferenceui
Basic idea
LM Studio is a native macOS application that wraps MLX (and llama.cpp for non-MLX models) with a polished interface for browsing, downloading, and running models. It abstracts away all the command-line complexity of MLX LM while still using it as the inference backend โ meaning you get the same performance as running mlx_lm.generate directly, but through a GUI.
Think of it this way: MLX LM gives you maximum control and scriptability; LM Studio gives you the same inference speed with a point-and-click interface and no terminal required. For daily use, model exploration, or sharing local AI access with less technical teammates, LM Studio is often the better choice over the raw CLIs.
The application has two main components:
1. A model library browser and downloader connected to Hugging Face
2. A local server that exposes an OpenAI-compatible API at localhost:1234 โ the same interface used by tools like Cursor, Continue.dev, and Obsidian Copilot
On Apple Silicon, LM Studio prioritizes the MLX backend for any model that has an MLX variant. For models that only exist as GGUF files, it falls back to its bundled llama.cpp engine with Metal acceleration.
What you'll accomplish
After following this playbook you will have:
โข LM Studio installed as a native Mac app
โข At least one model downloaded (MLX variant for best performance)
โข A working chat interface with persistent conversation history
โข An OpenAI-compatible local server running at `localhost:1234` that external tools can connect to
What to know before starting
MLX vs GGUF in LM Studio's model search:: When you search for a model in LM Studio, results show both MLX variants and GGUF variants. MLX variants run through Apple's MLX framework and are significantly faster on Apple Silicon. GGUF variants run through llama.cpp's Metal backend โ still GPU-accelerated but typically 10โ30% slower than MLX. Always choose MLX when available.
What the Q suffix means for GGUF models:: In LM Studio's model browser, GGUF models show suffixes like Q4_K_M, Q5_K_M, Q8_0. These indicate quantization level. Q4_K_M is the standard recommendation โ good quality, fits a 7B model in ~6 GB RAM. Q8_0 is near-lossless but uses ~2x the RAM. If you see an MLX and a GGUF variant, pick MLX.
How LM Studio stores models:: Models are downloaded to `~/Documents/LM Studio/Models/` by default. Each MLX model is typically 3โ20 GB depending on parameter count and quantization. You can change the storage location in LM Studio's settings.
The local server vs chat interface:: LM Studio has two separate concerns: the chat UI (for you to talk to models directly in the app) and the developer server (an HTTP API for other apps to use). You load a model for chat separately from loading a model for the server. Both can run the same model simultaneously.
Prerequisites
โข macOS 13.1+ (macOS 14 Sonoma or later recommended for MLX performance)
โข Apple Silicon Mac (M1 or later) โ Intel Macs are supported but will not use the MLX backend