Unlocking AI Potential: How DetoServe Empowers Your GPU Clusters with One Endpoint

nikhilprakash1911
Mar 23
6 min read

https://video.wixstatic.com/video/c64468_b8e9914abb0947269da0936f2e1d191d/1080p/mp4/file.mp4

DetoServe is an open-source, production-ready platform that lets you define an AI model once and deploy it across any number of GPU clusters — cloud, on-prem, or hybrid — behind a single API endpoint. Built in Go. Powered by SkyPilot. Integrated with NVIDIA Dynamo. Apache 2.0.* --- If you're running large language models in production, you've probably felt this pain: You have GPUs in AWS. You have a few nodes on-prem. Maybe you grabbed some capacity on GCP during a crunch. Each cluster has its own setup, its own endpoints, its own deployment scripts. Your application code is littered with routing logic, retries, and fallback URLs. Scaling means SSH-ing into machines and editing YAML files. We built **DetoServe** because we were tired of this. --- ## The Problem Every organization running LLMs at scale eventually hits the same wall: **multi-cluster inference is a mess**. You need to: - Deploy the same model across different clusters in different regions - Give your consumers **one URL** to hit, regardless of where the model is running - Route requests intelligently — not round-robin, but to the cluster that already has the user's context cached - Scale GPU resources up and down without a human in the loop - Manage multiple teams, each with their own GPU quotas and API keys There's no single tool that does all of this. So teams stitch together a Frankenstein of scripts, Kubernetes operators, custom proxies, and Slack alerts. It works — until it doesn't. --- ## What DetoServe Does DetoServe is a **multi-cluster AI inference platform**. Here's the idea in one sentence: > **Define your model once. Deploy it anywhere. One endpoint.** ### Functions — Your Model as a Unit of Deployment In DetoServe, you create a **Function**. A function wraps everything about how your model should run: - Which model to serve (e.g., `meta-llama/Llama-3-70B-Instruct`) - Which inference runtime to use (vLLM, SGLang, TensorRT-LLM) - How many GPUs it needs - Min/max replicas - Which clusters it should be deployed to You define it once. DetoServe handles the rest — it calls [SkyPilot](https://github.com/skypilot-org/skypilot) to provision resources, deploys the model pods, and registers the endpoints in the global routing table. Want to deploy the same model to 5 clusters across 3 cloud providers? One API call. *For the technical details on how Functions work, see the [Function Manager source](https://github.com/your-org/detoserve/tree/main/control-plane/function-manager).* ### One API Gateway — OpenAI-Compatible Your consumers (apps, agents, chatbots) hit **one URL**: ``` POST https://api.your-company.com/v1/chat/completions ``` That's it. They don't know which cluster serves their request. They don't need to. DetoServe's API Gateway — a lightweight Go/Fiber service — receives the request, resolves which backends are available for the requested model, and proxies it to the best one. It's fully **OpenAI-compatible**. If your app works with the OpenAI SDK, it works with DetoServe. Just change the base URL. *Source: [API Gateway](https://github.com/your-org/detoserve/tree/main/control-plane/api-gateway)* ### Smart Routing — Not Just Round-Robin This is where it gets interesting. A naive load balancer sends requests wherever there's capacity. DetoServe's **Smart Router** is smarter: - **KV-cache awareness** — If Cluster A already has this user's conversation context cached in GPU memory, the request goes there. This cuts time-to-first-token in half. - **Session affinity** — Multi-turn conversations stick to the same backend. - **Load-based balancing** — Factor in queue depth, GPU utilization, and active request count. - **Latency scoring** — Prefer clusters with lower network latency to the consumer. The result: faster responses, better GPU utilization, and lower costs. *Source: [Smart Router](https://github.com/your-org/detoserve/tree/main/control-plane/smart-router)* ### GPU Discovery — Automatic, Across Every Cluster DetoServe deploys a lightweight **Cluster Agent** on each Kubernetes cluster. The agent: 1. Discovers all GPUs on the cluster using SkyPilot's Kubernetes API 2. Reports GPU type, count, availability, plus CPU/memory per node 3. Sends a heartbeat to the control plane every 10 seconds No manual inventory. No spreadsheets. You add a cluster, deploy the agent via Helm, and DetoServe knows what hardware is available within seconds. *Source: [Cluster Agent](https://github.com/your-org/detoserve/tree/main/cluster-agent)* --- ## NVIDIA Dynamo — The Engine Inside Each Cluster One of the most exciting parts of DetoServe is the integration with [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo). Here's how they fit together: - **DetoServe** handles the **inter-cluster** layer — deploying models, routing between clusters, managing tenants - **Dynamo** handles the **intra-cluster** layer — the actual inference runtime inside each cluster What does Dynamo bring to the table? **Disaggregated prefill and decode.** Instead of running both phases on the same GPU, Dynamo splits them into independent worker pools. Prefill (the compute-heavy part) and decode (the memory-bound part) scale independently. This alone can give you **7x higher throughput per GPU**. **Multi-tier KV caching.** Dynamo's KV Block Manager moves cache between GPU memory → CPU memory → SSD → remote storage. Long conversations don't evict other users' caches — they spill to cheaper tiers. **SLA-driven autoscaling.** Dynamo's Planner watches latency SLAs and scales workers up or down to meet them. Not based on CPU percentage — based on actual inference performance. **Fast cold starts.** Dynamo's ModelExpress streams model weights to new pods, cutting cold start time by 7x. New capacity comes online in seconds, not minutes. DetoServe generates the Dynamo deployment manifests automatically. When you create a Function with `runtime: dynamo`, DetoServe produces the correct `DynamoGraphDeployment` CRDs, applies them to the target cluster, and Dynamo takes it from there. *Full architecture doc: [Dynamo Integration Architecture](https://github.com/your-org/detoserve/tree/main/docs/dynamo-integration-architecture.md)* --- ## The Tech Stack We made a deliberate choice: **everything in Go**. The control plane — all 10 services — is written in Go using [GoFiber](https://gofiber.io/). Each service compiles to a single static binary, ships in a ~20 MB Docker image, and runs with minimal memory. No JVM warmup. No Python GIL. No Node.js event loop surprises. | Component | What It Does | |-----------|-------------| | API Gateway | Single consumer endpoint, OpenAI-compatible | | Smart Router | KV-cache-aware, session-affine routing | | Function Manager | Define models, deploy instances | | SkyPilot Bridge | Agent heartbeats, SkyPilot orchestration | | Cluster Manager | Cluster lifecycle and metadata | | Tenant Manager | Multi-tenant quotas and API keys | | Deployment Manager | Orchestrates deployments via SkyPilot | | Config Store | GitOps-compatible config persistence | | Autoscaler | Metrics-driven scaling (queue depth, GPU util, latency) | | Cluster Agent | Per-cluster GPU discovery + heartbeats | The frontend is React (Vite) with a clean management UI and embedded Swagger API docs. *Full project structure: [GitHub README](https://github.com/your-org/detoserve)* --- ## How It Works in Practice Let's walk through a real scenario. **You have three clusters:** - 8x A100 on AWS (us-east-1) - 4x H100 on your on-prem rack - 4x A100 on GCP (europe-west-1) for EU latency requirements **Step 1: Register your clusters.** Deploy the DetoServe agent on each cluster via Helm. The agents auto-discover GPUs and start sending heartbeats. **Step 2: Create a Function.** ```json { "name": "llama-3-70b", "model": "meta-llama/Llama-3-70B-Instruct", "runtime": "dynamo", "gpu_type": "A100", "gpu_count": 4, "min_replicas": 1, "max_replicas": 3, "target_clusters": ["aws-east", "onprem-rack", "gcp-eu"] } ``` DetoServe deploys the model to all three clusters, each running NVIDIA Dynamo as the inference runtime. **Step 3: Give your consumers one URL.** ```bash curl https://api.your-company.com/v1/chat/completions \ -H "Authorization: Bearer $API_KEY" \ -d '{"model": "llama-3-70b", "messages": [{"role": "user", "content": "Hello"}]}' ``` The API Gateway routes this to the best cluster. If the user chats again, the Smart Router sends them back to the same cluster (session affinity + KV-cache hit). If that cluster is overloaded, it spills to another. **Step 4: Scale.** Traffic spikes. The autoscaler sees queue depth rising and tells SkyPilot to launch new pods. Dynamo's ModelExpress streams the weights and the new capacity is serving requests in under 30 seconds. Traffic drops. Pods scale down. GPUs aren't wasted. --- ## Why Open Source We believe AI infrastructure should be open. The companies building AI applications shouldn't be locked into a single vendor's inference platform. They should be able to: - Run on **any cloud** or on-prem hardware - Use **any inference runtime** (vLLM, SGLang, TensorRT-LLM, Dynamo) - Inspect, modify, and extend **every component** - Deploy with **no license fees**, ever DetoServe is Apache 2.0 licensed. Use it, fork it, extend it, sell services on top of it. We don't care. We just want the infrastructure layer to be solved so teams can focus on building AI products. --- ## What's Next We're actively working on: - **Phase 2: Full NVIDIA Dynamo integration** — automated CRD generation, cross-cluster KV-cache routing, two-tier autoscaling - **Phase 3: Advanced multi-tenancy** — per-tenant SLAs, priority queues, chargeback/billing - **Phase 4: Observability** — distributed tracing, per-request latency dashboards, cost attribution --- ## Get Involved DetoServe is early. The architecture is solid, the core services work, and we're running it in local development (k3d) and testing on real clusters. But we need help. **Areas where contributors can make an immediate impact:** - **Go backend** — Smart Router algorithms, autoscaler policies, Dynamo CRD generation - **React frontend** — Dashboard improvements, real-time metrics visualization - **Kubernetes** — Helm charts, operator patterns, KAI Scheduler integration - **Documentation** — Tutorials, deployment guides, architecture explainers - **Testing** — Integration tests, load testing, chaos engineering If any of this interests you: - **Star the repo:** [github.com/your-org/detoserve](https://github.com/your-org/detoserve) - **Read the architecture docs:** [Dynamo Integration Architecture](https://github.com/your-org/detoserve/tree/main/docs/dynamo-integration-architecture.md) - **Check the issues tab** for good first issues - **Join the discussion** — open an issue or start a thread We're building the inference platform we wish existed. Come build it with us.

Unlocking AI Potential: How DetoServe Empowers Your GPU Clusters with One Endpoint

Recent Posts

Comments