Building an AI news pipeline on a desktop GPU: cost, latency, and what surprised us

Building an AI news pipeline on a desktop GPU: cost, latency, and what surprised us

The news mosaic on altronis.sg is read by an AI pipeline running on a single desktop GPU. No cloud LLM bill. No rate limits. No vendor terms-of-service to re-read every quarter.

What follows is the architecture we settled on, the cost math we used to justify it, and the parts that surprised us after several months of operation. We see this shape of pipeline constantly in advisory work. Most teams reach for a cloud API on instinct, without doing the breakeven sum.

Why we built this in the first place

altronis.sg shows a rolling news feed of items relevant to the sectors we advise in: manufacturing, sport, energy, public-sector AI, and regional politics. The feed is not decorative; it feeds our sector advisor agents, our Lyra CRM context, and the weekly briefings we send to retainer clients.

The operational need was simple. Pull in roughly 90 to 100 items per week, classify them by sector and relevance, attach a sentiment tag and a one-line summary, surface the result. A junior analyst could do this in four hours a week. We wanted it to run hourly and never sleep.

We tried the obvious cloud-first approach for two weeks. It worked. But it produced a recurring cost line and a vendor-lock surface that did not feel right for a workload this predictable.

By the numbers

The operational need was simple: pull in roughly 90 to 100 items per week, classify them by sector and relevance, attach a sentiment tag and a one-line summary, then surface the result.

The architecture

The pipeline is intentionally boring. Boring is a feature when something has to run unattended for months.

A feedparser job pulls 16 RSS sources on an hourly cron. Each item gets a stable URL hash; if the hash already exists in Firestore we skip the rest of the work, so the dedupe surface is one indexed lookup per item rather than an embedding round-trip.

New items go to a local llama-server bound to port 8001, running Qwen3.6-35B-A3B in Unsloth's UD-Q8_K_XL dynamic quantisation. The model returns three structured fields: a sector label from a fixed taxonomy, a sentiment score, and a one-sentence editorial summary written in the house voice.

The result lands in Firestore with a 14-day rolling window for public display. Older items stay available to internal agents through sgdata-mcp, our open-source MCP server that exposes the same Firestore collection plus other Singapore data sources to LLM clients.

End-to-end, an item moves from RSS feed to live on the site in 6 to 9 seconds. The slow leg is almost always the RSS poll. Not the model.

A few details we have learned to care about. The URL hash uses canonical URL form, stripped of UTM and fbclid parameters, otherwise the same article appears three times under different campaign tags. The model prompt enforces a strict JSON schema and we reject any output that fails to parse, which costs us about 1.5% of items but keeps downstream code simple. The classifier is given the article title and the first 1,200 characters of body text only; full-article context did not move accuracy meaningfully and roughly tripled latency.

The hardware decision

The host machine is an AMD Ryzen AI Max+ 395, codename Strix Halo, with 128GB of unified LPDDR5X memory. The model lives entirely in unified RAM and runs through llama.cpp built against the Vulkan RADV backend.

We picked this hardware in late 2025 for one reason that turned into three. The headline reason was unified memory. 128GB shared between CPU and integrated GPU means we can hold a 35-billion-parameter MoE model comfortably, with room left over for a vision-language model on port 8080 and a ComfyUI instance for image work. All on the same box.

The second reason emerged once we started measuring: for MoE models like the Qwen3.5 and Qwen3.6 35B-A3B family, the Vulkan RADV path on Strix Halo is faster than the ROCm HIP path for both prompt processing and token generation. The community benchmarks at kyuz0/amd-strix-halo-toolboxes and the llama.cpp tracker on llm-tracker.info both show this clearly, and our own measurement on the production workload comes out roughly 5% in favour of Vulkan on token generation. ROCm still wins for the dense vision-language model on port 8080, which is why we run both backends side by side.

The third reason was sustained throughput. We see roughly 41 tokens per second of steady text generation on Qwen3.6-35B-A3B at UD-Q8_K_XL, which is plenty to keep up with hourly news ingestion and leaves a lot of headroom for the sector advisor agents that share the same endpoint.

A note on quantisation. We tested Q4_K_M, Q6_K, and UD-Q8_K_XL on the same prompt suite. The dynamic Q8 variant gave us the cleanest JSON adherence and the fewest hallucinated sector tags, at a memory cost we could absorb on a 128GB box. On a 64GB machine we would likely have settled for Q5 or Q6.

On configuration. We run --mmap for the Vulkan binary, --no-mmap for the ROCm binary, and flash attention on for both. These are not preferences. They are crash-avoidance settings on this hardware.

The cost math

Here is the breakeven sum we ran before committing. We will use round, conservative numbers and stay on the side that flatters the cloud option.

The workload is 91 items per week, 52 weeks a year, so 4,732 items annually. Each item sends roughly 1,500 input tokens and receives roughly 250 output tokens. That is 7.1 million input tokens and 1.18 million output tokens per year.

At gpt-4o-mini list pricing of $0.15 per million input tokens and $0.60 per million output tokens, the annual API spend works out to roughly $1.07 plus $0.71. Under $2 a year. Cloud-first looks unbeatable on this workload alone.

This is the trap. The news pipeline is not the only workload on the box. The same llama-server endpoint also serves our sector advisor agents (manufacturing and sport), the Lyra draft-meeting-artifact tool, batch reclassification when we change the taxonomy, the weekly knowledge-base refresh cron, and ad-hoc evaluation runs when we test new prompts. Real annual token volume across those workloads sits in the 800-million to 1.2-billion range, with output tokens running well above the news-only proportion because the advisor agents emit long-form drafts.

Re-running the sum at 1 billion mixed tokens per year against gpt-4o-mini puts us in the $250 to $400 annual range for inference alone, which is still small but no longer trivial. Then we add the workloads we know we will run but have not built yet: longer-context advisor briefings, evaluation harnesses, fine-tune-style preference distillations. By the time we look up two years out, the cloud line is plausibly four figures and growing with the business.

The hardware was a one-time capital cost which was already justified by other work we were doing on the same box, so for the news pipeline specifically the marginal inference cost is electricity. We measure roughly 90 to 110 watts of additional draw under sustained load, which at Singapore commercial rates is somewhere around SGD 0.30 a day for the relevant duty cycle.

Clean way to state the conclusion. At this volume, cloud-first is not expensive in absolute terms. It stops being free as soon as we grow into the rest of the stack, and the variance of that growth is what we did not want on the books.

What surprised us

Three things we did not predict.

Latency turned out to be a quiet win. We expected to pay a latency tax for staying local. Instead, end-to-end times held steady in the 6 to 9 second band even under burst. No network leg, no cold-start, no rate-limit backoff to handle. The variance of the local pipeline came in dramatically tighter than what we measured on cloud APIs during the two-week trial, which mattered more than the median for our scheduling.

The SWA cache-reuse limitation cost us a planned optimisation. Qwen3-family models with sliding-window attention plus hybrid recurrent components do not currently benefit from llama.cpp's cross-slot KV cache reuse — the upstream issue trail (ggml-org/llama.cpp #20225, #18497, #19794) shows this clearly, and we spent a weekend chasing what turned out to be a known limitation. In-slot, same-prompt caching still works, so we serialise like-shaped requests where we can. We are no longer pitching prefix-cache wins at the llama-server layer for these models until upstream fixes land.

Vulkan beat ROCm for the MoE workload, which inverts what most teams assume coming from CUDA-shaped intuitions. RADV on Strix Halo is roughly 5% ahead on token generation for Qwen3 35B-A3B in our measurements, with the kyuz0 benchmark grid reporting a similar pattern across the family. The setup is also simpler — fewer environment knobs, a more forgiving driver story. We still keep ROCm on the box for the dense vision-language model, but the default for new MoE work is Vulkan first, measure second.

A smaller surprise. Prompt engineering effort moved from latency-shaping to schema-discipline. With cloud APIs, the cycles go into trimming tokens to keep cost down. On a local box, the cycles go into tightening output schemas, because a parse failure is the only thing that wastes cycles we cannot get back.

When this approach breaks

We are not arguing this pattern fits every workload. Three shapes specifically do not fit.

Real-time chat with sub-second p99. A 35B MoE on Strix Halo gives us first-token latency in the high hundreds of milliseconds and full responses in the low seconds. That is fine for batch classification, not for interactive chat where users expect tokens streaming inside 300ms. For that, we would either move to a 7B-class local model or accept the cloud round-trip and pay for it.

Bursty workloads with very high peak concurrency. A single Strix Halo box serves one heavy request at a time well. Fan out to 50 concurrent users in a spike and we are either queueing or on cloud. Our pipeline is hourly and predictable, so this never bites.

Workloads where model quality at the frontier matters more than steady-state cost. If your task genuinely needs the strongest available reasoning model, run it in the cloud, period. We use cloud reasoning models inside Lyra for proposal drafting where the quality lift is visible to the customer; we use the local model for classification, summarisation, and extraction where Qwen3.6-35B-A3B is comfortably above the quality floor.

The honest framing is that local LLMs are an excellent backbone for the boring 80% of an AI workload, and a poor fit for the visible 20% where frontier capability is worth the spend.

The day-2 trap

The reason we wrote this up is the running cost story for SG SMEs. Reinventing.ai's April 2026 SMB AI Pricing brief reports that integration, ongoing optimisation and re-training routinely add 20 to 40 percent on top of an SMB's initial AI budget, and that consumption-based pricing surprises a majority of buyers in their first year. PwC Singapore's commentary on Budget 2026 says it more directly: SMEs explored AI tools, then "recurring operating costs such as the cost of tokens or licenses to use cloud and AI services soon outweighed perceived benefits", and many returned to manual processes.

Building the AI is the easy part. Sustaining it is the part that kills SME adoption in this market. The pricing of the build side has fallen sharply over the last 18 months; the pricing of the sustainment side has not, and a meaningful share of pilot abandonment in Singapore traces back to that single fact.

A local LLM does not solve this for every workload. It does change the shape of the cost curve from variable-with-usage to mostly-fixed-with-step-changes-when-you-upgrade-hardware, which is a much easier line item to defend in an SME budget conversation. For workloads with predictable throughput — classification, triage, content generation, internal search, document understanding — that shape is closer to what most owner-operated businesses actually want.

We are not arguing for local-everything. We are arguing for measuring the day-2 cost honestly before signing a 12-month commit.

Clone the pattern

Almost every component of this pipeline is either open-source or replicable in a weekend.

The llama-server stack is upstream llama.cpp built against Vulkan, with Qwen3.6-35B-A3B Unsloth GGUF weights from Hugging Face. The benchmark grid we used to short-list backends lives at kyuz0/amd-strix-halo-toolboxes; the wider Strix Halo performance tracker lives at llm-tracker.info. Both are excellent first stops.

The Firestore schema, the sector taxonomy, and the MCP tool surface for querying the news collection are all in sgdata-mcp, our open-source MCP server. If you are building on the same hardware family or you want a starting point for a Singapore-context AI workload, clone it and strip the parts you do not need.

If you would rather have us run the diagnostic on your specific workload — volume, latency profile, sensitivity, day-2 budget — we do that under our advisory engagements. The framework above is the same one we use internally before any cloud-vs-local recommendation goes to a client.

The engineering question is not "cloud or local". It is "which slice of this workload behaves like predictable backbone, and which slice behaves like spiky frontier work". Answer that honestly and the deployment shape draws itself.

Frequently asked

Why run a news pipeline on a single AI workstation instead of in the cloud?

Three reasons: cost (electricity beats metered API at scale), latency (no network round-trip for the LLM call), and privacy (the news-source URLs and your scoring criteria stay yours). For a 50-article-per-day pipeline, the break-even on a Strix Halo box is roughly 4 months.

What does the altronis.sg news pipeline actually do?

Crawls a curated list of SG/enterprise AI sources daily, deduplicates by URL and title, scores each article for SME-relevance with a small LLM, and surfaces the top 5 on the altronis.sg news feed. The whole pipeline runs in under 8 minutes on one Strix Halo box.

Is the news pipeline open-source?

The MCP-based crawler is open-source under sgdata-mcp's adjacent repo. The scoring prompts and ranking heuristics are Altronis-proprietary because they encode 18 months of editorial calibration.

Related reads

Last updated 3 May 2026.