By Z. Aw | Published 24 May 2026

On-prem AI workstation hitting its ceiling under burst load

When Your On-Prem AI Workstation Hits Its Ceiling

Anywhere you look, there's still paper. Delivery slips, signed forms, certificates, invoices, packing lists. They get scanned, photographed, emailed back and forth, and eventually someone - usually the most senior person available - retypes the numbers into a spreadsheet or an ERP. The cost is hidden because nobody books it as "data entry"; it's just "the way we've always done it."

Private LLMs running on a consumer-grade AI workstation are an attractive escape hatch from that. You buy a Strix Halo box for under SGD 3,000, install an OCR pipeline and a local model, and suddenly the paper-to-digital step happens in seconds, on-prem, with no data leaving the building. PDPA-friendly. Vendor-independent. Cheap on the margin.

I've been running exactly this kind of stack in production for a few weeks. Today's Saturday session was the first time it broke in an interesting way - and what broke it teaches you something about the difference between "works on the bench" and "works under real load."

The stack

The pipeline has three model-shaped pieces, all on one Strix Halo box:

Stage 1 - Surya (open-source layout + recognition): rasterises each page, returns OCR tokens with bounding boxes. Runs in its own podman container, talks to the GPU via ROCm. Exposes POST /layout on :8090.
Stage 1B - Qwen3-VL-32B (vision-language fallback): re-reads a page the structurer wasn't confident about. ROCm via toolbox, port :8080.
Stage 2 - Gemma 4 26B (structurer): takes the OCR tokens + a prompt, returns structured JSON. Runs via llama.cpp on Vulkan, port :8001.

Surya talks to the GPU through ROCm. Gemma talks to the same GPU through Vulkan/RADV. ComfyUI sits on the same box for occasional image generation. One physical AMD iGPU. One LPDDR5X unified memory pool.

In front of all of this: a small Bun-based gateway that adds auth, rate limiting, and a Cloudflare-tunnel keepalive heartbeat (CF closes the tunnel at 100s idle, but Surya OCR can take 200s+ on a dense page, so the gateway emits a whitespace byte every 60s to keep the pipe open).

On steady-state load this stack runs beautifully. Cold-start of Gemma is 5-7 minutes, but once it's warm, structurer calls take ~30-90s. Surya takes anywhere from 12s to 200s depending on page density. Under typical "one document trickling in every few minutes" usage, you wouldn't know there was anything fragile underneath.

What broke

Today I uploaded eight photos in roughly thirty seconds. The pipeline was supposed to classify each, OCR them, structure the result, and file the row. Twenty minutes later, only five had completed. The other three were stuck in retry state, two were silently in tool-retry, and one had been "processing Stage 1 OCR" for six straight minutes.

Three findings, none of which I'd seen during normal trickle-load.

Finding 1: vk::DeviceLostError, 30+ times in an hour

Gemma's stack trace went ggml_vk_submit → vk::Queue::submit → ErrorDeviceLost → C++ uncaught_exception → terminate → abort. Then systemd respawned it. Then Gemma's first request after warm-up triggered the same chain. Crash loop.

The root cause is structural. Surya (ROCm) and Gemma (Vulkan) are both clients of the same kernel-level AMDGPU driver. RADV (Mesa's Vulkan driver for AMD) loses its GPU context when something else holding the same physical queue does sustained CPU-shaped submits - which is exactly what happens when Surya is grinding through layout+recognition+OCR on a page that has 200 text lines. The Vulkan context-loss propagates to every Vulkan client on the box. Gemma was idle waiting for its next request, and got killed anyway.

By the numbers

One physical AMD iGPU. Two heavy clients (Surya via ROCm + Gemma via Vulkan). 30+ Gemma ABRT crashes in a single hour the first time real burst load hit. Zero crashes under trickle load over the previous week.

Finding 2: the OCR server's single uvicorn worker was blocked on a 365-second call

Surya runs as a FastAPI app under uvicorn with one worker. When that worker is in a long /layout call doing real OCR compute, it can't even respond to its own /healthz GET. From outside, this looks identical to "the process has hung" - your probe times out at 5s and you have no idea whether the OCR is still progressing or the server has died.

It hadn't died. It was working. But the gateway's warmup probe and any other liveness check stayed timing-out for the full duration of that long call.

Finding 3: the gateway itself crashed six times during the incident

A single line of error-logging code did e.message where e could be undefined (a known edge in Bun's fetch error propagation). Each gateway crash was ~3 seconds of total outage before systemd respawned it. Cumulatively that's enough lost requests to fail several jobs that would otherwise have succeeded. This is the kind of thing that should never have shipped - but you never see it until production traffic includes the specific edge that triggers it.

The fix - three layers, deepest first

Layer 1: defensive error handling

Changed e.message to e?.message ?? String(e) in the gateway. Two-character fix, eliminated the six crashes/day.

Layer 2: equalise the input pixel envelope

Phone-camera JPEGs come in at 3000×2250 pixels, often under 1.5MB after JPEG compression. The OCR pipeline's earlier downscale gate only fired if the file was >1.5MB, so phone shots went into Surya at full resolution and allocated ~5-6 GB of GPU memory per page. PDF rasterisations, by contrast, come in at ~1190×1684 pre-shrunk and use ~3 GB. Equal-content inputs, very different GPU footprints. Made the downscale unconditional - every input is normalised to 1800px on the long edge before Surya sees it. OCR accuracy unchanged at this resolution (text remains ≥40px tall).

Layer 3: external liveness watchdog with a CPU sanity check

This is the interesting one.

A naive watchdog says: "probe /healthz every 30 seconds, restart the container after N consecutive failures." That sounds right but is exactly wrong for a single-worker server like Surya. If Surya is happily processing a 200-second OCR call, /healthz will fail for the full duration. The naive watchdog would restart Surya mid-call, killing legitimate work, and possibly triggering another GPU context-loss cascade.

The fix is to add a second liveness signal: container CPU usage. A truly wedged process is at 0% CPU. A worker that's busy in real compute is at 60%+ CPU. So the watchdog only restarts when both are true:

/healthz has failed N consecutive times (180s without an HTTP response), AND
Container CPU is below 3% (process is alive but not doing anything).

With this guard, the watchdog never kills a long but legitimate OCR call. It only restarts when the process is genuinely stuck.

For the LLM the logic is simpler - its /health returns 200 quickly when alive and 503 ("Loading model") during warm-up - so HTTP alone is enough, with one nuance: 503 is treated as "alive, warming" not as a failure. Five consecutive non-{200,503} responses → systemctl restart, 5-minute cooldown.

Both watchdogs send a Telegram message on every action - start, restart-firing, restart-recovered, restart-failed-to-recover - so I find out about restarts from my phone, not from staring at a dashboard.

Total code: about 150 lines of bash + a systemd unit. The script itself reads its bot token from the same on-box agent stack, so there's nothing new to deploy or rotate.

The harder question

The watchdog works. It's been running for about two hours as I write this. The queue drained cleanly after I added it. New uploads now flow through.

But the watchdog is a recovery mechanism, not a prevention one. It restarts services when they crash. It doesn't stop them crashing.

The structural problem hasn't gone away: two heavy GPU clients (OCR via ROCm, LLM via Vulkan) sharing one physical AMD iGPU. Under sustained burst load they will fight, and the Vulkan client will lose. The watchdog will catch each loss within 180 seconds and restart it. The user will see a 3-minute "your batch is processing…" pause. Sometimes more.

There are four directions out of this:

Throttle the queue worker - insert a 30-60 second gap between jobs. Trades user-perceived speed for stability. Probably fine for low-volume use, but feels like papering over the problem.
Make Stage 2 cloud-fallback aware - when the local LLM has crashed recently, route the next call to a cloud LLM instead. The pipeline already has the abstraction; the routing logic is small. Costs maybe SGD 0.50 per 100 documents at typical volume - real money but not breakage-money.
Add a watchdog and live with it (what we did today). Cheap, observable, doesn't change the architecture. But every restart costs user latency, and you'll see them in the logs forever.
Move the LLM off the workstation entirely - to a dedicated server (DGX Spark at SGD 7-8k), or to cloud full-time. Eliminates the contention root cause. Highest capital cost; lowest ongoing pain.

There's no objectively right answer; it depends on volume, budget, and how much variability in user-perceived latency you can tolerate. A workstation-tier deployment is a beautiful answer for prototyping and low-volume real use. It is not a beautiful answer for "burst of eight uploads at the same moment, every day, forever."

What this means if you're building something similar

A few rules I'd internalise before the next time:

Burst-load test before you call it production. Trickle load hides every structural concurrency problem. Half an hour of bursting forty documents at a stack will tell you in advance what today's incident told me retrospectively.
Single-worker servers need a CPU-aware watchdog, not just an HTTP one. The two-signal check is twenty lines of bash. It pays for itself the first time it doesn't kill a legit long call.
Vulkan and ROCm on the same AMD GPU is a sharp edge. Not a deal-breaker; today's stack is still useful. But know that the failure mode exists, isn't documented well, and won't show up in any benchmark.
The right point to graduate from workstation to dedicated server is when burst load happens often enough that the watchdog fires more than once a week. Until then, you're optimising for the wrong constraint.

The actual line between "workstation is enough" and "you need a server" is a function of how concurrent your real users are. For most teams that line is much further out than the vendor brochures imply. Today I learned where mine is.

Working notes from a hands-on day. Today's stack ships in production again as I write this - the watchdog is armed, the inputs are pixel-normalised, the gateway has its error guard. Tomorrow's first job will tell me whether any of this needs another pass.