blob: c3cbdaa776f351feb3398e93ad5f1e6d51c31f3e (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
|
#+TITLE: Install local offline LLM runtime and model cache
#+DATE: 2026-05-28
#+SOURCE_PROJECT: rulesets
#+REQUEST_TYPE: install-feature
#+STARTUP: showall
* Request
Please add local offline LLM support to =archsetup='s normal install process so
machines can run a local coding agent when there is no network.
This came from the =rulesets= generic-agent-runtime design pass. =rulesets=
should become runtime-neutral, but it needs =archsetup= to provision the local
model runtime and prefetch model files while network is available.
* Hardware-specific recommendations
** High-end Strix Halo machine
Detected with =inxi=:
- AMD Ryzen AI Max+ 395
- 128 GiB RAM
- Radeon 8060S / Strix Halo unified memory
Install:
- Default offline coding model:
=Qwen3-Coder-30B-A3B-Instruct-GGUF=, prefer =Q6_K= on this machine.
- Compatibility quant:
=Qwen3-Coder-30B-A3B-Instruct-GGUF Q4_K_M=.
- Larger general/long-context fallback:
=Qwen3-Next-80B-A3B-Instruct-GGUF Q4_K_M=.
** velox
Detected with =ssh velox inxi -C -G -m -S --filter=:
- Intel Core i7-1370P
- 64 GiB RAM
- Intel Iris Xe integrated graphics
Install:
- Strongest practical offline coding default:
=Qwen3-Coder-30B-A3B-Instruct-GGUF Q4_K_M=.
- Add an 8B fallback model for quick edits and low-latency triage.
Expect =velox= to be CPU/low-end-iGPU bound. The 30B model fits, but latency
will be the limiting factor.
* Runtime stack
Recommended packages/components:
- =llama.cpp= with CPU and Vulkan support where practical.
- Optional =ollama= as a simple model manager/API for workflows that prefer it.
- A shared local model cache, e.g. =~/.local/share/llm/models= or
=/srv/models/llm=.
- OpenAI-compatible local endpoints:
- coding model on =127.0.0.1:8081=
- larger/general model on =127.0.0.1:8082= when installed
- leave =127.0.0.1:11434= for =ollama= if used
* Install behavior
- Install runtime packages during normal setup.
- Prefetch model files when network is available.
- Make model download idempotent: skip if exact file already exists.
- Do not make the install fail hard if model download is unavailable; surface a
clear follow-up saying local offline LLM support is incomplete.
- Add a smoke test command that starts the local endpoint and asks a short prompt.
* Why this belongs in archsetup
=rulesets= can provide the runtime manifests, launcher behavior, and project
instructions, but it should not own machine provisioning. =archsetup= already
owns package installation and per-host setup, so it is the right place to install
=llama.cpp=/=ollama= and maintain the machine-local model inventory.
* Sources checked
- Qwen3-Coder 30B GGUF quant listings show Q4_K_M around 18.6 GB and Q6_K around
25.1 GB.
- Qwen3-Next 80B GGUF model card shows Q4_K_M around 48.4 GB and native 262K
context.
- =llama.cpp= supports CPU and GPU backends including Vulkan/HIP/ROCm; keep the
backend configurable per host.
|