Local LLM Setup for Coding: A Practical Guide

# Local LLM Setup for Coding: A Practical Guide

Running a coding assistant locally isn’t a luxury—it’s a practical choice when you need speed, privacy, and zero API costs. This guide walks you through setting up a local LLM that actually works for code generation, debugging, and refactoring.

## Why Go Local for Coding

The big three (ChatGPT, Claude, Gemini) are solid, but they send your code to external servers. That breaks for proprietary projects, violates compliance requirements, or simply costs money when you hit rate limits.

Local LLMs give you:
– **Zero per-token costs** — run unlimited queries
– **Privacy** — your code never leaves your machine
– **Speed** — no network latency on repeated queries
– **Offline capability** — work on planes, in remote locations

The trade-off: you need decent hardware, and models aren’t as capable as GPT-4 on complex reasoning. But for most coding tasks, they’re good enough.

## Hardware Requirements

Your GPU determines what you can run:

| VRAM | Models You Can Run |
|——|——————-|
| 8GB | 7B parameter models (Q4_K_M quantization) |
| 16GB | 14B models (Q4_K_M), 7B at higher quality |
| 24GB+ | 14B+ at decent quantizations, 34B at Q4 |

A 24GB VRAM GPU (RTX 4090, RTX 3090, or A4000) hits the sweet spot for coding. With 8GB, you’re limited to smaller models that struggle with complex codebases.

CPU-only inference works but runs 10-50x slower. If you’re serious about this, get a GPU.

## Setting Up Ollama

Ollama is the easiest way to get a local LLM running. It handles model downloading, runtime, and provides an OpenAI-compatible API.

### Installation

“`bash
# macOS
brew install ollama

# Linux (Ubuntu/Debian)
curl -fsSL https://ollama.com/install.sh | sh

# Windows (WSL2 recommended)
# Install via the installer at ollama.com
“`

### Pull Your First Model

“`bash
# Pull a coding-focused model
ollama pull codellama

# Or try mistral for general purpose
ollama pull mistral

# Check available models
ollama list
“`

Codellama is trained on code datasets and performs better at coding tasks than general models at the same size.

### Running the Model

“`bash
# Start the server (runs on port 11434 by default)
ollama serve

# In another terminal, chat with it
ollama run codellama “write a python function to merge two sorted arrays”
“`

For a 7B model, expect 15-30 tokens/second on a decent GPU. That’s usable but not blazing.

## Connecting to Your Editor

Most editors expect an OpenAI-compatible API. Ollama provides this out of the box.

### VS Code with Continue Extension

Continue (continue.dev) is the best VS Code extension for local LLMs.

“`json
// In ~/.continue/config.json
{
“models”: [
{
“model”: “codellama”,
“provider”: “ollama”,
“api_base”: “http://localhost:11434”
}
],
“tabAutocompleteModel”: {
“model”: “starcoder”,
“provider”: “ollama”,
“api_base”: “http://localhost:11434”
}
}
“`

After installing the extension and reloading VS Code, you get:
– Inline autocomplete (Ctrl+Space for more)
– Chat panel for refactoring and explanation
– Highlight code + right-click for context actions

### Neovim

For the vim users, nvim-cmp and copilot.lua don’t support local models directly, but you can use the Ollama API:

“`lua
— In ~/.config/nvim/lua/plugins/copilot.lua
require(“copilot”).setup({
server = {
cmd = {
“node”,
“/path/to/copilot-server/dist/agent.js”,
“–stdio”
},
settings = {
— This part doesn’t work locally, use nvim-cmp with custom source instead
}
}
})
“`

Better option: use the `ollama`.nvim plugin or write a custom completion source that hits `http://localhost:11434/api/generate`. That’s beyond this guide, but it’s doable.

## Optimizing Performance

### Quantization Matters

When you pull a model, Ollama uses a default quantization (typically Q4_K_M). This is a good balance of size and quality. But you can optimize:

“`bash
# Pull with specific quantization
ollama pull codellama:7b-instruct-q4_K_M

# List available quantizations
ollama show codellama
“`

Q5_K_M gives better quality at the cost of ~30% more VRAM. Q3_K_M saves VRAM but degrades code generation quality noticeably.

### Context Length

Long context eats VRAM fast. A 7B model at 4096 context uses ~6GB VRAM; at 8192 it jumps to ~10GB. For code, 4096-8192 is usually sufficient.

“`bash
# Set context length when running
ollama run codellama –context 8192
“`

### Model Selection Trade-offs

| Model | Strengths | Weaknesses |
|——-|———–|————|
| codellama | Code-specific, good at explaining | Less capable at general reasoning |
| mistral | Balanced, good at following instructions | Not code-specific |
| starcoder | Fast autocomplete | Weaker at complex tasks |
| deepseek-coder | Strong on code, good context handling | Newer, less tested |

For pure coding, codellama or deepseek-coder win. For mixed tasks, mistral.

## What Actually Works

After running local LLMs for months, here’s the reality:

**Works well:**
– Explaining unfamiliar code
– Generating boilerplate
– Writing tests
– Simple refactoring
– Finding bugs in small functions

**Doesn’t work well:**
– Complex architecture decisions
– Multi-file refactoring (context limits)
– Debugging cryptic error messages (GPT-4 is much better here)
– Writing code that requires deep domain knowledge of your project

Local models need more hand-holding. You get better results with explicit prompts:

“`
Write a FastAPI endpoint that accepts a CSV upload, validates
the headers against a schema, and returns errors for invalid rows.
Use pandas. Include docstring.
“`

vs. just “write an endpoint.”

## Key Takeaways

– Local LLMs cost nothing after hardware and run offline—worth it for proprietary code
– 8GB VRAM limits you to 7B models; 16GB+ opens up 14B+ models
– Ollama makes setup trivial—pull, run, connect via API
– Continue extension in VS Code is the easiest editor integration
– Local models handle simple tasks well but struggle with complex reasoning
– Quantization (Q4 vs Q5) trades VRAM for quality—balance based on your GPU

## Next Steps

1. **Install Ollama** and pull codellama (`ollama pull codellama`)
2. **Test it** with a simple coding task: “write a python decorator that logs function execution time”
3. **Install Continue** in VS Code and configure it to use the Ollama endpoint
4. **Try it on real work** — explain a confusing file or generate tests for a module

Run it for a week. If the speed and privacy are worth the trade-off in capability, you’ve got your setup. If you need GPT-4 level reasoning for complex tasks, keep local as a quick tool and use the API for the hard stuff.