Introducing Open WebUI on AWS 1.1.0: Run Your Own Open-Source LLMs on GPU

We’re launching the newest addition to the FOSSonCloud lineup: Open WebUI on AWS by FOSSonCloud, version 1.1.0 . This is the first FOSSonCloud pattern that isn’t a typical web application — it’s AI infrastructure. Subscribe, launch the CloudFormation template, and you get a private, ChatGPT-style web UI talking to a vLLM inference server running open-source models on your own GPU instance.

What you get

A turnkey GPU-accelerated LLM stack on AWS:

Open WebUI — a polished ChatGPT-style web interface with chat history, multi-user accounts, model switching, and a built-in OpenAI-compatible API
vLLM 0.20.0 — production-grade LLM inference engine with FP8 KV cache and a 32K default context window, sized for agentic clients whose tool-definition system prompts blow past 8-12K tokens
CUDA 13.0.2 / cuDNN 9.13.0 / NVIDIA driver 580.95.05 — fully baked into the AMI, so first-boot is fast and reliable
Persistent EBS data volume — chats, accounts, and configuration survive instance replacements

The 1.1.0 release ships with a curated set of recently-released open-weight models, all tested on g6e.xlarge:

Qwen/Qwen3-8B (default) — Qwen3 generation with strong tool-calling support, fits a g6e.xlarge (24 GB VRAM)
Qwen/Qwen3-Coder-30B-A3B-Instruct — Apache 2.0 MoE 30B/3B-active for code-focused workloads
openai/gpt-oss-20b — Apache 2.0 general-purpose 20B
microsoft/phi-4 (14B) — general-purpose
microsoft/Phi-4-mini-reasoning (3.8B) — reasoning-tuned

Want a model not in the dropdown? ModelOverride accepts any Hugging Face model identifier.

Tool-calling out of the box

A common pain point with self-hosted LLMs is getting OpenAI-compatible clients (opencode, aider, etc.) to actually invoke tools correctly. The 1.1.0 pattern picks the right vLLM tool-calling parser for each model automatically:

Qwen / nvidia OpenReasoning Nemotron → hermes parser
microsoft/Phi-4-mini-reasoning → phi4_mini_json parser
openai/gpt-oss-* → gpt_oss parser

Custom overrides via CustomVllmConfigParameterArn still take precedence.

Architecture

Singleton ASG with a g6e.xlarge default (24 GB GPU memory) — scales up to g6e.16xlarge (384 GB GPU memory) for larger models
ALB with HTTPS terminating against an ACM certificate you supply
Route 53 DNS integration via parameter
CloudWatch Logs wired up for vLLM, Open WebUI, nginx, and system logs
Optional AlbIngressCidr — restrict access by IP at the load balancer

Customization

Two SSM Parameter ARN parameters let you override config without forking the pattern:

CustomOpenWebuiConfigParameterArn — Open WebUI environment variables
CustomVllmConfigParameterArn — vLLM CLI flags (model length, quantization, parser, etc.)

This is the same escape-hatch approach we use across our other patterns: customization without code changes.

Fresh deployments

You’ll need a Route 53 hosted zone, an ACM certificate, and access to the g6e instance family in your target region. Everything else the template provisions (VPC, ALB, ASG, IAM, CloudWatch, persistent data volume, Route 53 record).

What’s next

We’re tracking the upstream Open WebUI roadmap closely — this is a fast-moving project. Expect more model dropdown updates as new open-weight models land, plus eventual support for higher-throughput multi-GPU configurations once the surrounding tooling catches up.

If you hit anything in 1.1.0, ping us on GitHub.

— FOSSonCloud