LWS

Configure external cert-manager

Mon, 28 Apr 2025 00:00:00 +0000

This page shows how you can a third party certificate authority solution like Cert Manager.

Before you begin

Make sure the following conditions are set:

A Kubernetes cluster is running.
The kubectl command-line tool has communication with your cluster.
Cert Manager is installed

LWS supports either Kustomize or installation via a Helm chart.

Internal Certificate management

In all cases, LWS’s internal certificate management must be turned off if one wants to use CertManager.

vLLM

Mon, 01 Jan 0001 00:00:00 +0000

In this example, we will use LeaderWorkerSet to deploy a distributed inference service with vLLM on GPUs. vLLM supports distributed tensor-parallel inference and serving. Currently, it supports Megatron-LM’s tensor parallel algorithm. It manages the distributed runtime with Ray. See the doc vLLM Distributed Inference and Serving for more details.

Deploy LeaderWorkerSet of vLLM

We use LeaderWorkerSet to deploy two vLLM model replicas. We have two flavors of the deployment:

GPU: Each vLLM replica has 2 pods (pipeline_parallel_size=2) and 8 GPUs per pod (tensor_parallel_size=8).
TPU: The example assumes that you have a GKE cluster with two TPU v5e-16 slices. You can view how to create a cluster with multiple TPU slices here. Each TPU slice has 4 hosts, and each host has 4 TPUs. The vLLM server is deployed on the TPU slice with pipeline_parallel_size=2 and tensor_parallel_size=16.

In both examples, Ray uses the leader pod as the head node and the worker pods as the worker nodes. The leader pod runs the vLLM server, with a ClusterIP Service exposing the port.

Configure Prometheus

Mon, 28 Apr 2025 00:00:00 +0000

This page shows how you configure LWS to use prometheus metrics.

Before you begin

Make sure you the following conditions are set:

A Kubernetes cluster is running.
The kubectl command-line tool has communication with your cluster.
Prometheus is installed
Cert Manager can be optionally installed

LWS supports either Kustomize or installation via a Helm chart.

Kustomize Installation

Enable prometheus in config/default/kustomization.yaml and uncomment all sections with ‘PROMETHEUS’.

Kustomize Prometheus with certificates

If you want to enable TLS verification for the metrics endpoint, follow the directions below.

TensorRT-LLM

Mon, 01 Jan 0001 00:00:00 +0000

In this example, we will use LeaderWorkerSet to deploy a distributed inference service with Triton TensorRT-LLM on GPUs. TensorRT-LLM supports multinode serving using tensor and pipeline parallelism. It manages the distributed runtime with MPI.

Build the Triton TensorRT-LLM image

We provide a Dockerfile to build the image. The Dockerfile contains an installation script to download any Llama model from hugging face and prepare it to be used by TensorRT-LLM. It also has a python script to initialize MPI and start the server.

llama.cpp

Mon, 01 Jan 0001 00:00:00 +0000

Deploy Distributed Inference Service with llama.cpp

In this example, we will use LeaderWorkerSet to deploy a distributed inference service using llama.cpp. llama.cpp began as a project to support CPU-only inference on a single node, but has since expanded to support accelerators and distributed inference.

Deploy LeaderWorkerSet of llama.cpp

We use LeaderWorkerSet to deploy a llama.cpp leader and two llama.cpp workers. The leader pod loads the model and distributes layers to the workers; the workers perform the majority of the computation.

SGLang

Mon, 01 Jan 0001 00:00:00 +0000

Deploy Distributed Inference Service with SGLang and LWS on GPUs

In this example, we demonstrate how to deploy a distributed inference service using LeaderWorkerSet (LWS) with SGLang on GPU clusters.

SGLang provides native support for distributed tensor-parallel inference and serving, enabling efficient deployment of large language models (LLMs) such as DeepSeek-R1 671B and Llama-3.1-405B across multiple nodes. This example uses the meta-llama/Meta-Llama-3.1-8B-Instruct model to demonstrate multi-node serving capabilities. For implementation details on distributed execution, see the SGLang docs Run Multi-Node Inference.

Horizontal Pod Autoscaler (HPA)

Mon, 01 Jan 0001 00:00:00 +0000

LeaderWorkerSet supports Horizontal Pod Autoscaler (HPA) through its scale subresource. This allows you to automatically scale the number of replica groups based on resource utilization metrics like CPU or memory.

How HPA Works with LeaderWorkerSet

When using HPA with LeaderWorkerSet:

HPA monitors leader pods only (not worker pods)
The hpaPodSelector in LeaderWorkerSet status helps HPA identify which pods to monitor
Scaling affects the number of replica groups, not individual pods within a group
Each replica group (leader + workers) is scaled as a unit

Prerequisites

Before setting up HPA, ensure you have:

Topology Aware Scheduling with Kueue

Mon, 01 Jan 0001 00:00:00 +0000

AI Inference workloads require constant Pod-to-Pod communication. This makes the network bandwidth an important requirement of running workloads efficiently. The bandwidth between the Pods depends on the placement of the Nodes in the data center. Topology Aware Scheduling (TAS), looks to place the pods as closely as possible to maximize the network bandwidth. To learn more about TAS, visit the page in the Kueue website.

This example will cover how to deploy a vLLM multi-host workload using TAS.

Labels, Annotations and Environment Variables

Fri, 14 Mar 2025 00:00:00 +0000

Labels

Key	Description	Example	Applies to
`leaderworkerset.sigs.k8s.io/name`	The name of the LeaderWorkerSet object to which these resources belong.	leaderworkerset-multi-template	Pod, StatefulSet, Service
`leaderworkerset.sigs.k8s.io/template-revision-hash`	Hash used to track the controller revision that matches a LeaderWorkerSet object.	5c5fcdfb44	Pod, StatefulSet
`leaderworkerset.sigs.k8s.io/group-index`	The group to which it belongs.	0	Pod, StatefulSet (only worker)
`leaderworkerset.sigs.k8s.io/group-key`	Unique key identifying the group.	689ce1b5…b07	Pod, StatefulSet (only worker)
`leaderworkerset.sigs.k8s.io/worker-index`	The index or identity of the pod within the group.	0	Pod
`leaderworkerset.sigs.k8s.io/subgroup-index`	Tracks which subgroup the pod is part of.	0	Pod (only if SubGroup is set)
`leaderworkerset.sigs.k8s.io/subgroup-key`	Pods that are part of the same subgroup will have the same unique hash value.	92904e74…801	Pod (only if SubGroup is set)

Annotations

Key	Description	Example	Applies to
`leaderworkerset.sigs.k8s.io/size`	The total number of pods in each group.	4	Pod
`leaderworkerset.sigs.k8s.io/replicas`	Replicas Number of leader-workers groups.	3	StatefulSet (only leader)
`leaderworkerset.sigs.k8s.io/leader-name`	The name of the leader pod.	leaderworkerset-multi-template-0	Pod (only worker)
`leaderworkerset.sigs.k8s.io/exclusive-topology`	Specifies the topology for exclusive 1:1 scheduling.	cloud.google.com/gke-nodepool	LeaderWorkerSet, Pod (only if exclusive-topology is used)
`leaderworkerset.sigs.k8s.io/subdomainPolicy`	Determines what type of domain will be injected.	UniquePerReplica	Pod (only if leader and subdomainPolicy set to UniquePerReplica)
`leaderworkerset.sigs.k8s.io/subgroup-size`	The number of pods per subgroup.	2	Pod (only if SubGroup is set)
`leaderworkerset.sigs.k8s.io/subgroup-exclusive-topology`	Specifies the topology for exclusive 1:1 scheduling within a subgroup.	topologyKey	LeaderWorkerSet, Pod (only if SubGroup is set and subgroup-exclusive-topology is used)
`leaderworkerset.sigs.k8s.io/leader-requests-tpus`	Indicates if the leader pod requests TPU.	true	Pod (only if leader pod requests TPU)

Environment Variables

Key	Description	Example	Applies to
`LWS_LEADER_ADDRESS`	The address of the leader via the headless service.	leaderworkerset-multi-template-0.leaderworkerset-multi-template.default	Pod
`LWS_GROUP_SIZE`	Tracks the size of the LWS group.	4	Pod
`LWS_WORKER_INDEX`	The index or identity of the pod within the group.	2	Pod
`TPU_WORKER_HOSTNAMES`	Hostnames of TPU workers only in the same subgroup.	test-sample-1-5.default,test-sample-1-6.default,test-sample-1-7.default,test-sample-1-8.default	Pod (only if TPU enabled)
`TPU_WORKER_ID`	ID of the TPU worker.	0	Pod (only if TPU enabled)
`TPU_NAME`	Name of the TPU.	test-sample-1	Pod (only if TPU enabled)

If you want to use more environment variables, they are available in the labels or annotations but not listed in the Environment Variables section. We can obtain the index by using the Downward API to pass the Pod’s label as an environment variable to the container.

LeaderWorkerSet API

Mon, 01 Jan 0001 00:00:00 +0000

Resource Types

LeaderWorkerSet

`LeaderWorkerSet`

Appears in:

LeaderWorkerSet is the Schema for the leaderworkersets API

Field	Description
`apiVersion` string	`leaderworkerset.x-k8s.io/v1`
`kind` string	`LeaderWorkerSet`
`spec` [Required] `LeaderWorkerSetSpec`	spec defines the desired behavior of LeaderWorkerSet.
`status` [Required] `LeaderWorkerSetStatus`	status represents the current status of LeaderWorkerSet.

`LeaderWorkerSetSpec`

Appears in:

LeaderWorkerSet

One group consists of a single leader and M workers, and the total number of pods in a group is M+1. LeaderWorkerSet will create N replicas of leader-worker pod groups (hereinafter referred to as group).

Search Results

Mon, 01 Jan 0001 00:00:00 +0000