We’re launching the newest addition to the FOSSonCloud lineup: Open WebUI on AWS by FOSSonCloud, version 1.1.0 . This is the first FOSSonCloud pattern that isn’t a typical web application — it’s AI infrastructure. Subscribe, launch the CloudFormation template, and you get a private, ChatGPT-style web UI talking to a vLLM inference server running open-source models on your own GPU instance.
What you get
A turnkey GPU-accelerated LLM stack on AWS:
- Open WebUI — a polished ChatGPT-style web interface with chat history, multi-user accounts, model switching, and a built-in OpenAI-compatible API
- vLLM 0.20.0 — production-grade LLM inference engine with FP8 KV cache and a 32K default context window, sized for agentic clients whose tool-definition system prompts blow past 8-12K tokens
- CUDA 13.0.2 / cuDNN 9.13.0 / NVIDIA driver 580.95.05 — fully baked into the AMI, so first-boot is fast and reliable
- Persistent EBS data volume — chats, accounts, and configuration survive instance replacements
Models in the dropdown
The 1.1.0 release ships with a curated set of recently-released open-weight models, all tested on g6e.xlarge:
- Qwen/Qwen3-8B (default) — Qwen3 generation with strong tool-calling support, fits a
g6e.xlarge(24 GB VRAM) - Qwen/Qwen3-Coder-30B-A3B-Instruct — Apache 2.0 MoE 30B/3B-active for code-focused workloads
- openai/gpt-oss-20b — Apache 2.0 general-purpose 20B
- microsoft/phi-4 (14B) — general-purpose
- microsoft/Phi-4-mini-reasoning (3.8B) — reasoning-tuned
Want a model not in the dropdown? ModelOverride accepts any Hugging Face model identifier.
Tool-calling out of the box
A common pain point with self-hosted LLMs is getting OpenAI-compatible clients (opencode, aider, etc.) to actually invoke tools correctly. The 1.1.0 pattern picks the right vLLM tool-calling parser for each model automatically:
- Qwen / nvidia OpenReasoning Nemotron →
hermesparser - microsoft/Phi-4-mini-reasoning →
phi4_mini_jsonparser - openai/gpt-oss-* →
gpt_ossparser
Custom overrides via CustomVllmConfigParameterArn still take precedence.
Architecture
- Singleton ASG with a
g6e.xlargedefault (24 GB GPU memory) — scales up tog6e.16xlarge(384 GB GPU memory) for larger models - ALB with HTTPS terminating against an ACM certificate you supply
- Route 53 DNS integration via parameter
- CloudWatch Logs wired up for vLLM, Open WebUI, nginx, and system logs
- Optional
AlbIngressCidr— restrict access by IP at the load balancer
Customization
Two SSM Parameter ARN parameters let you override config without forking the pattern:
CustomOpenWebuiConfigParameterArn— Open WebUI environment variablesCustomVllmConfigParameterArn— vLLM CLI flags (model length, quantization, parser, etc.)
This is the same escape-hatch approach we use across our other patterns: customization without code changes.
Fresh deployments
You’ll need a Route 53 hosted zone, an ACM certificate, and access to the g6e instance family in your target region. Everything else the template provisions (VPC, ALB, ASG, IAM, CloudWatch, persistent data volume, Route 53 record).
What’s next
We’re tracking the upstream Open WebUI roadmap closely — this is a fast-moving project. Expect more model dropdown updates as new open-weight models land, plus eventual support for higher-throughput multi-GPU configurations once the surrounding tooling catches up.
If you hit anything in 1.1.0, ping us on GitHub.
— FOSSonCloud
