AI Infrastructure 2026-05-26 3 min read

Ollama API Quickstart: How to Run a Local Model and Call It From Python

A practical Ollama guide showing how to start the local server, pull a model, call the HTTP API, and use a local LLM from Python without overcomplicating the stack.

Why Ollama gets popular so fast: it gives developers a much shorter path from “I want to try a local model” to “I have a local API endpoint.” That speed is useful, but the value comes from using it in a disciplined way instead of turning local LLM work into random shell experiments.

What Ollama gives you

Ollama helps you download and run local models with a simpler command-line workflow than many lower-level inference setups. It also exposes a local HTTP API, which is what makes it useful beyond toy terminal chats.

That means you can:

run a model locally
call it from scripts
plug it into prototypes
experiment without immediately depending on a hosted API

Step 1: install Ollama

On macOS, install it from the official site or app bundle. After installation, verify:

ollama --version

If the command is missing, fix PATH or the installation before doing anything else.

Step 2: pull a model

Example:

ollama pull llama3.1

This downloads the model so it is available locally.

Step 3: run a quick local chat in the terminal

ollama run llama3.1

That confirms the local runtime works at the most basic level.

Step 4: use the HTTP API

Ollama documents a local API. A simple request with curl looks like this:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain what a Docker healthcheck does in one paragraph.",
  "stream": false
}'

That is the point where Ollama becomes more than a CLI novelty.

Python example

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1",
        "prompt": "Write a Python function that retries a request three times.",
        "stream": False,
    },
    timeout=120,
)

response.raise_for_status()
data = response.json()
print(data["response"])

That is enough to start wiring a local model into internal tools or experiments.

When the chat endpoint is a better fit

Some teams begin with /api/generate, but for multi-turn workflows the chat-style API is often easier to reason about because the message history is explicit.

import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1",
        "messages": [
            {"role": "system", "content": "You explain infrastructure clearly."},
            {"role": "user", "content": "What does a reverse proxy do?"},
        ],
        "stream": False,
    },
    timeout=120,
)

print(response.json()["message"]["content"])

That shape tends to age better once your prototype becomes a real assistant or internal tool.

Why developers still get tripped up

They forget local models are still infrastructure

Local does not mean free of constraints. CPU vs GPU, RAM, model size, and latency still matter.

They assume local privacy means zero design work

Running the model locally is only one part. You still need to think about logging, prompt handling, timeouts, and error behavior.

They keep switching models without defining the job

A local model experiment becomes much more useful when the task is concrete:

summarize logs
rewrite docs
classify support tickets
draft code comments

When Ollama is the right tool

It is excellent when you want:

fast local experimentation
an easy API surface
lower-friction demos for internal tooling
local testing before deciding on a heavier stack

Common first-week mistakes

The most common mistake is downloading a model that is too heavy for the machine and then assuming the whole local-LLM idea is bad. Another is treating every model swap like progress instead of first defining the task. A better workflow is to pick one narrow job, measure response quality and latency, and only then decide whether you need a larger model.

Final recommendation

Do not judge Ollama by the first funny chatbot output it gives you. Judge it by whether the local API helps you prototype a real developer workflow faster.

That is the real reason to use it. Not local-model theater. Faster iteration with more control.