#+TITLE: Install local offline LLM runtime and model cache #+DATE: 2026-05-28 #+SOURCE_PROJECT: rulesets #+REQUEST_TYPE: install-feature #+STARTUP: showall * Request Please add local offline LLM support to =archsetup='s normal install process so machines can run a local coding agent when there is no network. This came from the =rulesets= generic-agent-runtime design pass. =rulesets= should become runtime-neutral, but it needs =archsetup= to provision the local model runtime and prefetch model files while network is available. * Hardware-specific recommendations ** High-end Strix Halo machine Detected with =inxi=: - AMD Ryzen AI Max+ 395 - 128 GiB RAM - Radeon 8060S / Strix Halo unified memory Install: - Default offline coding model: =Qwen3-Coder-30B-A3B-Instruct-GGUF=, prefer =Q6_K= on this machine. - Compatibility quant: =Qwen3-Coder-30B-A3B-Instruct-GGUF Q4_K_M=. - Larger general/long-context fallback: =Qwen3-Next-80B-A3B-Instruct-GGUF Q4_K_M=. ** velox Detected with =ssh velox inxi -C -G -m -S --filter=: - Intel Core i7-1370P - 64 GiB RAM - Intel Iris Xe integrated graphics Install: - Strongest practical offline coding default: =Qwen3-Coder-30B-A3B-Instruct-GGUF Q4_K_M=. - Add an 8B fallback model for quick edits and low-latency triage. Expect =velox= to be CPU/low-end-iGPU bound. The 30B model fits, but latency will be the limiting factor. * Runtime stack Recommended packages/components: - =llama.cpp= with CPU and Vulkan support where practical. - Optional =ollama= as a simple model manager/API for workflows that prefer it. - A shared local model cache, e.g. =~/.local/share/llm/models= or =/srv/models/llm=. - OpenAI-compatible local endpoints: - coding model on =127.0.0.1:8081= - larger/general model on =127.0.0.1:8082= when installed - leave =127.0.0.1:11434= for =ollama= if used * Install behavior - Install runtime packages during normal setup. - Prefetch model files when network is available. - Make model download idempotent: skip if exact file already exists. - Do not make the install fail hard if model download is unavailable; surface a clear follow-up saying local offline LLM support is incomplete. - Add a smoke test command that starts the local endpoint and asks a short prompt. * Why this belongs in archsetup =rulesets= can provide the runtime manifests, launcher behavior, and project instructions, but it should not own machine provisioning. =archsetup= already owns package installation and per-host setup, so it is the right place to install =llama.cpp=/=ollama= and maintain the machine-local model inventory. * Sources checked - Qwen3-Coder 30B GGUF quant listings show Q4_K_M around 18.6 GB and Q6_K around 25.1 GB. - Qwen3-Next 80B GGUF model card shows Q4_K_M around 48.4 GB and native 262K context. - =llama.cpp= supports CPU and GPU backends including Vulkan/HIP/ROCm; keep the backend configurable per host.