Examples on LWS

vLLM

Mon, 01 Jan 0001 00:00:00 +0000

In this example, we will use LeaderWorkerSet to deploy a distributed inference service with vLLM on GPUs. vLLM supports distributed tensor-parallel inference and serving. Currently, it supports Megatron-LM’s tensor parallel algorithm. It manages the distributed runtime with Ray. See the doc vLLM Distributed Inference and Serving for more details.

Deploy LeaderWorkerSet of vLLM

We use LeaderWorkerSet to deploy two vLLM model replicas. We have two flavors of the deployment:

GPU: Each vLLM replica has 2 pods (pipeline_parallel_size=2) and 8 GPUs per pod (tensor_parallel_size=8).
TPU: The example assumes that you have a GKE cluster with two TPU v5e-16 slices. You can view how to create a cluster with multiple TPU slices here. Each TPU slice has 4 hosts, and each host has 4 TPUs. The vLLM server is deployed on the TPU slice with pipeline_parallel_size=2 and tensor_parallel_size=16.

In both examples, Ray uses the leader pod as the head node and the worker pods as the worker nodes. The leader pod runs the vLLM server, with a ClusterIP Service exposing the port.

TensorRT-LLM

Mon, 01 Jan 0001 00:00:00 +0000

In this example, we will use LeaderWorkerSet to deploy a distributed inference service with Triton TensorRT-LLM on GPUs. TensorRT-LLM supports multinode serving using tensor and pipeline parallelism. It manages the distributed runtime with MPI.

Build the Triton TensorRT-LLM image

We provide a Dockerfile to build the image. The Dockerfile contains an installation script to download any Llama model from hugging face and prepare it to be used by TensorRT-LLM. It also has a python script to initialize MPI and start the server.

llama.cpp

Mon, 01 Jan 0001 00:00:00 +0000

Deploy Distributed Inference Service with llama.cpp

In this example, we will use LeaderWorkerSet to deploy a distributed inference service using llama.cpp. llama.cpp began as a project to support CPU-only inference on a single node, but has since expanded to support accelerators and distributed inference.

Deploy LeaderWorkerSet of llama.cpp

We use LeaderWorkerSet to deploy a llama.cpp leader and two llama.cpp workers. The leader pod loads the model and distributes layers to the workers; the workers perform the majority of the computation.

SGLang

Mon, 01 Jan 0001 00:00:00 +0000

Deploy Distributed Inference Service with SGLang and LWS on GPUs

In this example, we demonstrate how to deploy a distributed inference service using LeaderWorkerSet (LWS) with SGLang on GPU clusters.

SGLang provides native support for distributed tensor-parallel inference and serving, enabling efficient deployment of large language models (LLMs) such as DeepSeek-R1 671B and Llama-3.1-405B across multiple nodes. This example uses the meta-llama/Meta-Llama-3.1-8B-Instruct model to demonstrate multi-node serving capabilities. For implementation details on distributed execution, see the SGLang docs Run Multi-Node Inference.

Horizontal Pod Autoscaler (HPA)

Mon, 01 Jan 0001 00:00:00 +0000

LeaderWorkerSet supports Horizontal Pod Autoscaler (HPA) through its scale subresource. This allows you to automatically scale the number of replica groups based on resource utilization metrics like CPU or memory.

How HPA Works with LeaderWorkerSet

When using HPA with LeaderWorkerSet:

HPA monitors leader pods only (not worker pods)
The hpaPodSelector in LeaderWorkerSet status helps HPA identify which pods to monitor
Scaling affects the number of replica groups, not individual pods within a group
Each replica group (leader + workers) is scaled as a unit

Prerequisites

Before setting up HPA, ensure you have:

Topology Aware Scheduling with Kueue

Mon, 01 Jan 0001 00:00:00 +0000

AI Inference workloads require constant Pod-to-Pod communication. This makes the network bandwidth an important requirement of running workloads efficiently. The bandwidth between the Pods depends on the placement of the Nodes in the data center. Topology Aware Scheduling (TAS), looks to place the pods as closely as possible to maximize the network bandwidth. To learn more about TAS, visit the page in the Kueue website.

This example will cover how to deploy a vLLM multi-host workload using TAS.