assets/outbox/2026-05-28-from-rulesets-local-llm-install.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

#+TITLE: Install local offline LLM runtime and model cache
#+DATE: 2026-05-28
#+SOURCE_PROJECT: rulesets
#+REQUEST_TYPE: install-feature
#+STARTUP: showall

* Request

Please add local offline LLM support to =archsetup='s normal install process so
machines can run a local coding agent when there is no network.

This came from the =rulesets= generic-agent-runtime design pass. =rulesets=
should become runtime-neutral, but it needs =archsetup= to provision the local
model runtime and prefetch model files while network is available.

* Hardware-specific recommendations

** High-end Strix Halo machine

Detected with =inxi=:

- AMD Ryzen AI Max+ 395
- 128 GiB RAM
- Radeon 8060S / Strix Halo unified memory

Install:

- Default offline coding model:
  =Qwen3-Coder-30B-A3B-Instruct-GGUF=, prefer =Q6_K= on this machine.
- Compatibility quant:
  =Qwen3-Coder-30B-A3B-Instruct-GGUF Q4_K_M=.
- Larger general/long-context fallback:
  =Qwen3-Next-80B-A3B-Instruct-GGUF Q4_K_M=.

** velox

Detected with =ssh velox inxi -C -G -m -S --filter=:

- Intel Core i7-1370P
- 64 GiB RAM
- Intel Iris Xe integrated graphics

Install:

- Strongest practical offline coding default:
  =Qwen3-Coder-30B-A3B-Instruct-GGUF Q4_K_M=.
- Add an 8B fallback model for quick edits and low-latency triage.

Expect =velox= to be CPU/low-end-iGPU bound. The 30B model fits, but latency
will be the limiting factor.

* Runtime stack

Recommended packages/components:

- =llama.cpp= with CPU and Vulkan support where practical.
- Optional =ollama= as a simple model manager/API for workflows that prefer it.
- A shared local model cache, e.g. =~/.local/share/llm/models= or
  =/srv/models/llm=.
- OpenAI-compatible local endpoints:
  - coding model on =127.0.0.1:8081=
  - larger/general model on =127.0.0.1:8082= when installed
  - leave =127.0.0.1:11434= for =ollama= if used

* Install behavior

- Install runtime packages during normal setup.
- Prefetch model files when network is available.
- Make model download idempotent: skip if exact file already exists.
- Do not make the install fail hard if model download is unavailable; surface a
  clear follow-up saying local offline LLM support is incomplete.
- Add a smoke test command that starts the local endpoint and asks a short prompt.

* Why this belongs in archsetup

=rulesets= can provide the runtime manifests, launcher behavior, and project
instructions, but it should not own machine provisioning. =archsetup= already
owns package installation and per-host setup, so it is the right place to install
=llama.cpp=/=ollama= and maintain the machine-local model inventory.

* Sources checked

- Qwen3-Coder 30B GGUF quant listings show Q4_K_M around 18.6 GB and Q6_K around
  25.1 GB.
- Qwen3-Next 80B GGUF model card shows Q4_K_M around 48.4 GB and native 262K
  context.
- =llama.cpp= supports CPU and GPU backends including Vulkan/HIP/ROCm; keep the
  backend configurable per host.