The curiosity started with a simple question: what would it actually take to run AI locally? Not API calls to someone else's data center, not a monthly subscription to a cloud service — a real machine, in the building, with a GPU doing inference on a local network. No data leaving the premises. No per-token cost. No dependency on a company's uptime or pricing decisions.
That question turned into a machine, then an API, then a live integration across multiple production websites. This is the full story of how that happened.
The Starting Point: Understanding What Was Possible
The first serious research into local AI ran against a specific hardware question: what GPU do you actually need? The answer that kept coming back was 24GB VRAM as the practical minimum for running models large enough to be genuinely useful. Below that threshold you're either running models too small to compete with the free-tier cloud offerings, or you're quantizing so aggressively that quality degrades below the point of usefulness.
The NVIDIA RTX 3090 Ti landed as the answer. 24GB GDDR6X, still one of the strongest consumer cards for inference workloads even as newer generations have shipped. Not cheap, but a one-time hardware cost versus ongoing API subscription costs at scale.
The machine that was built around it:
- GPU: NVIDIA RTX 3090 Ti — 24GB VRAM
- RAM: 64GB system memory
- Storage: 1TB NVMe (916GB usable, primary OS and model storage)
- OS: Ubuntu (the standard Linux AI stack target)
- Hostname: ai1
- LAN IP: 10.0.1.190
It sits on the LAN segment of the LocalAd network alongside vc1, the NAS, and the mail server — all behind the Frankenrouter, all off the public internet.
The Software Stack: Ollama
The runtime that made local AI practical was Ollama. Ollama handles model download, quantization, VRAM management, and exposes a local API on port 11434. The installation is a single command. The model library covers essentially every major open-weight model. For a self-hosted setup where you don't want to manage CUDA environments, model quantization by hand, or write inference code from scratch, Ollama is the right answer.
The first model pulled was exploratory — smaller 7B and 14B models to understand the performance characteristics and quality floor. The machine handled them easily. The real test was the 32B class.
The primary model that settled into production use: qwen2.5:32b — Alibaba's Qwen 2.5 series at the 32 billion parameter count. At 19GB it fits in the 3090 Ti's VRAM with room to spare, runs at practical inference speeds, and produces output quality that competes with API-based models for the use cases that matter: content generation, question answering, code assistance, summarization.
Other models pulled and tested:
- qwen2.5-coder:7b — lighter weight, faster, used for code completion tasks
- qwen2.5-coder:32b — full 32B coding-tuned variant
- llama3.3:latest — Meta's Llama 3.3 at 42GB, tested but too large for comfortable single-GPU inference at this VRAM level
The FastAPI Bridge
Ollama's local API on port 11434 is bound to 127.0.0.1 by default — localhost only, which is correct for security. But the other servers on the network (ser2 running petblip.com, volusiamarket.com, forumla.us) need to call it. That required a bridge layer.
A Python FastAPI application was written and deployed on ai1 at /mnt/ai-data/home/ai1/blip-api/server.py. The storage path starts with /mnt/ai-data/ because ai1's primary drive is mounted there rather than at the standard home directory — a quirk of the machine's setup.
What the FastAPI server does:
- Listens on port 8000 (later moved to 8001), bound to all interfaces so other LAN machines can reach it
- Validates every incoming request against a Bearer token —
e1fb58a27a8ca4019889182321b414b9f86e077bc41e562ea1e1d611239a5d40— rejecting anything that doesn't match - Accepts a JSON body with the message, optional conversation history, and an optional system prompt override
- Builds a full structured prompt including the system persona, conversation history, and the new message
- POSTs to Ollama's local API with the model name and prompt
- Returns the response as JSON:
{"reply": "..."} - Logs every interaction to a JSONL file with timestamp, input, and output
The system prompt for the PetBlip persona locked it into pet-only responses: the model answers questions about pet care, health, food, and products, and redirects anything off-topic back to pets. That constraint is enforced at the prompt level — the model doesn't get to decide what it's willing to talk about.
The call chain across the network:
Browser / tablet UI
↓
ser2 (petblip.com backend)
↓ REST call, Bearer token, LAN only
ai1 (10.0.1.190:8001) — FastAPI
↓
Ollama (localhost:11434)
↓
qwen2.5:32b — GPU inference
↓
Response back to ser2
↓
WebSocket push to wall display
The AI machine is never directly exposed to public traffic. Every external request goes through ser2, which handles authentication, session management, prompt formatting, and database logging before calling ai1 on the internal network.
First Production Integration: The PetBlip Wall
The first live integration was the PetBlip interactive wall display — a system designed for Rachel's Pet Supply where customers can interact with an AI pet advisor on a large display screen, with the conversation appearing on wall monitors.
The architecture:
- A customer-facing tablet runs the Blip terminal UI in a browser
- Questions go to ser2's backend (
ask.php→blip_ask.php→blip_process.php) - ser2 formats the prompt with the Blip persona and POSTs to ai1
- ai1 runs inference on qwen2.5:32b
- ser2 receives the response, writes it to the
wall_sessionstable, and inserts into the stream - A Node.js WebSocket server on ser2 (port 3000) pushes the response to the wall display clients in real time
- The wall monitors show the AI response as it arrives
Getting this pipeline fully working was not straightforward. The debugging process surfaced several real problems: a missing /api/answer.php endpoint that read completed answers back from the sessions table to the polling browser, a JavaScript variable scope bug in the polling function, curl calls pointed at the external HTTPS domain instead of localhost, and the WebSocket server needing an /ask_async endpoint that wasn't initially present. Each problem required tracing the full call chain to find where the data was stopping.
The lesson embedded in that debugging process: in a multi-hop pipeline this long (browser → PHP → PHP → CLI → ai1 → database → Node → WebSocket → browser), you have to be able to confirm each hop independently before assuming the problem is at the end. The AI was working the whole time. The last mile from the database back to the browser was what was broken.
External Access: ai.localad.pro
The LAN-only restriction on ai1 was appropriate for petblip.com and the other sites on ser2, which can reach ai1 directly. But some use cases required external access — the VolusiaMarket Business Advisor, ForumLA's AI writing assistant, and future integrations where the calling server isn't on the same LAN.
The solution used the existing Frankenrouter HAProxy infrastructure. A domain ai.localad.pro was added to ser1's HestiaCP, with a Let's Encrypt certificate issued normally. Rather than fighting HestiaCP's managed nginx configuration (which gets overwritten on domain rebuild), a custom nginx include file was created at:
/home/la1/conf/web/ai.localad.pro/nginx.ssl.conf_ai_proxy
This file contains a regex location block that intercepts API routes (/health, /models, /chat, /generate, /docs) and proxies them through to 10.0.1.190:8001 on the LAN. Everything else falls through to the normal Apache handler. HestiaCP respects custom include files in that path and doesn't overwrite them on rebuild.
The routing path for external requests:
External client
↓ HTTPS
HAProxy (Frankenrouter, port 443)
↓ TCP SNI passthrough
ser1 nginx (SSL termination, Let's Encrypt)
↓ HTTP proxy, LAN only
ai1 FastAPI (10.0.1.190:8001)
↓
Ollama → qwen2.5:32b
The FastAPI service was also moved from the original startup command to a systemd unit (ai-api.service) so it survives reboots without manual intervention — one of the standing maintenance issues from the early deployment where both the Node.js server on ser2 and the FastAPI server on ai1 required manual restart after any reboot.
Content Generation: The Forge
A secondary use case that came online early was bulk content generation for the LocalAd network sites. A tool called the Forge (localad.us/forge.php) runs as a simple PHP interface that sends a topic and site-specific context to ai1 and generates a full blog post draft — headline, body, structured sections — tuned to the voice of whichever site is selected from a dropdown: ForumLA, PetBlip, LocalAd, VolusiaMarket, Rachel's Pet Supply.
The Forge runs entirely on the local AI stack. No Anthropic API call, no external cost. The 3090 Ti generates a 500-900 word article draft in roughly 30-60 seconds at 32B parameter quality. The output goes into the textarea for review and editing before publishing. It's a workflow assist, not autopublish — the human reads it, fixes it, adds specifics, then posts it.
A version of the same pattern was built into ForumLA's new-thread page as the AI Writing Assistant sidebar — same architecture, different interface, generating forum thread drafts rather than blog posts.
Site Integrations: Where ai1 Is Currently Used
As of current deployment, ai1 is integrated into or available to these properties:
- PetBlip — Ask Blip AI pipeline (tablet → wall display), pet care Q&A, persona-locked to pet topics
- ForumLA — AI Writing Assistant in new-thread creation, generates thread drafts with selectable tone and length; ai1 as primary, Claude API as fallback
- VolusiaMarket — Business Advisor, AI-powered small business guidance for Volusia County trades and service businesses; Anthropic API primary with ai1 fallback (timeout issues with ai1 on longer prompts led to this ordering)
- LocalAd Forge — bulk content generation tool for all network sites
- ai.localad.pro — external HTTPS API endpoint, currently used for integrations that call from outside the LAN
The API token used across all integrations: c238f59f2a00a4f2c31766aa8cea1472d5d6f4a6d8e2c3415876444e69305826 (set in /home/ai1/ai-api/.env).
The Reliability Problem
The most persistent operational issue with ai1 is service persistence. Both the FastAPI server and (on ser2) the Node.js WebSocket server require manual startup after a reboot. The FastAPI service has a systemd unit now, but it has not been confirmed to survive every reboot cleanly. The Node server on ser2 is still manually started.
This is an outstanding reliability gap. A server that handles production AI requests for a live customer-facing display at Rachel's Pet Supply should not require an SSH session to bring back up after a power event. The systemd unit approach is the right solution — it just needs to be verified solid across actual reboots rather than assumed to be working.
A related issue: inference latency on longer prompts at 32B can push past PHP's default request timeout, particularly when ser2 is calling ai1 for a complex Business Advisor response. That's why the VolusiaMarket Business Advisor runs Anthropic's Claude API as primary and ai1 as fallback rather than the other way around — the local model is capable, but the response time on a cold prompt can exceed what a web user will tolerate waiting for.
What's Planned
The longer-term goal for ai1 is to reduce the dependency on the Anthropic API across the network. VolusiaMarket's Business Advisor is currently the main property still running cloud AI as primary. The plan is to migrate that to ai1-first by July 2026, with Anthropic as fallback only. That requires solving the timeout issue — either through streaming responses (returning tokens as they're generated rather than waiting for the full response), or through prompt optimization that reduces the token count on the initial response generation.
The AceMagic mini PC worker nodes being planned for vc1's video transcoding queue represent a parallel capability expansion — dedicated hardware for a specific heavy workload, keeping the primary servers free. A similar approach could eventually apply to AI: if inference demand grows beyond what the 3090 Ti can handle at acceptable latency, additional worker nodes running smaller models for specific tasks (summarization, classification, short Q&A) could offload work from the main 32B model.
The core principle behind all of it stays the same as when the machine was first built: AI capability that's owned outright, runs on-premises, costs nothing per inference, and doesn't depend on anyone else's infrastructure being up.