<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Examples on LWS</title><link>/docs/examples/</link><description>Recent content in Examples on LWS</description><generator>Hugo</generator><language>en</language><atom:link href="/docs/examples/index.xml" rel="self" type="application/rss+xml"/><item><title>vLLM</title><link>/docs/examples/vllm/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/examples/vllm/</guid><description>&lt;p&gt;In this example, we will use LeaderWorkerSet to deploy a distributed inference service with vLLM on GPUs.
&lt;a href="https://docs.vllm.ai/en/latest/index.html"&gt;vLLM&lt;/a&gt; supports distributed tensor-parallel inference and serving. Currently, it supports Megatron-LM’s tensor parallel algorithm. It manages the distributed runtime with &lt;a href="https://docs.ray.io/en/latest/index.html"&gt;Ray&lt;/a&gt;. See the doc &lt;a href="https://docs.vllm.ai/en/latest/serving/distributed_serving.html"&gt;vLLM Distributed Inference and Serving&lt;/a&gt; for more details.&lt;/p&gt;
&lt;h2 id="deploy-leaderworkerset-of-vllm"&gt;Deploy LeaderWorkerSet of vLLM&lt;/h2&gt;
&lt;p&gt;We use LeaderWorkerSet to deploy two vLLM model replicas. We have two flavors of the deployment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPU: Each vLLM replica has 2 pods (&lt;code&gt;pipeline_parallel_size=2&lt;/code&gt;) and 8 GPUs per pod (&lt;code&gt;tensor_parallel_size=8&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;TPU: The example assumes that you have a GKE cluster with two TPU v5e-16 slices. You can view how to create a cluster with multiple TPU slices &lt;a href="https://cloud.google.com/kubernetes-engine/docs/how-to/tpus"&gt;here&lt;/a&gt;. Each TPU slice has 4 hosts, and each host has 4 TPUs. The vLLM server is deployed on the TPU slice with &lt;code&gt;pipeline_parallel_size=2&lt;/code&gt; and &lt;code&gt;tensor_parallel_size=16&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In both examples, Ray uses the leader pod as the head node and the worker pods as the worker nodes. The leader pod runs the vLLM server, with a ClusterIP Service exposing the port.&lt;/p&gt;</description></item><item><title>TensorRT-LLM</title><link>/docs/examples/tensorrt-llm/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/examples/tensorrt-llm/</guid><description>&lt;p&gt;In this example, we will use LeaderWorkerSet to deploy a distributed inference service with Triton TensorRT-LLM on GPUs.
&lt;a href="https://nvidia.github.io/TensorRT-LLM/"&gt;TensorRT-LLM&lt;/a&gt; supports multinode serving using tensor and pipeline parallelism. It manages the distributed runtime with &lt;a href="https://www.open-mpi.org/"&gt;MPI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="build-the-triton-tensorrt-llm-image"&gt;Build the Triton TensorRT-LLM image&lt;/h2&gt;
&lt;p&gt;We provide a &lt;a href="https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/tensorrt-llm/build/Dockerfile"&gt;Dockerfile&lt;/a&gt; to build the image. The Dockerfile contains an installation script to download any Llama model from hugging face and prepare it to be used by TensorRT-LLM. It also has a python script to initialize MPI and start the server.&lt;/p&gt;</description></item><item><title>llama.cpp</title><link>/docs/examples/llamacpp/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/examples/llamacpp/</guid><description>&lt;h2 id="deploy-distributed-inference-service-with-llamacpp"&gt;Deploy Distributed Inference Service with llama.cpp&lt;/h2&gt;
&lt;p&gt;In this example, we will use LeaderWorkerSet to deploy a distributed
inference service using &lt;a href="https://github.com/ggerganov/llama.cpp"&gt;llama.cpp&lt;/a&gt;.
llama.cpp began as a project to support CPU-only inference on a single node, but has
since expanded to support accelerators and distributed inference.&lt;/p&gt;
&lt;h3 id="deploy-leaderworkerset-of-llamacpp"&gt;Deploy LeaderWorkerSet of llama.cpp&lt;/h3&gt;
&lt;p&gt;We use LeaderWorkerSet to deploy a llama.cpp leader and two llama.cpp workers.
The leader pod loads the model and distributes layers to the workers; the workers
perform the majority of the computation.&lt;/p&gt;</description></item><item><title>SGLang</title><link>/docs/examples/sglang/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/examples/sglang/</guid><description>&lt;h2 id="deploy-distributed-inference-service-with-sglang-and-lws-on-gpus"&gt;Deploy Distributed Inference Service with SGLang and LWS on GPUs&lt;/h2&gt;
&lt;p&gt;In this example, we demonstrate how to deploy a distributed inference service using LeaderWorkerSet (LWS) with &lt;a href="https://docs.sglang.ai/"&gt;SGLang&lt;/a&gt; on GPU clusters.&lt;/p&gt;
&lt;p&gt;SGLang provides native support for distributed tensor-parallel inference and serving, enabling efficient deployment of large language models (LLMs) such as DeepSeek-R1 671B and Llama-3.1-405B across multiple nodes. This example uses the &lt;a href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct"&gt;meta-llama/Meta-Llama-3.1-8B-Instruct&lt;/a&gt; model to demonstrate multi-node serving capabilities. For implementation details on distributed execution, see the SGLang docs &lt;a href="https://docs.sglang.ai/references/multi_node_deployment/multi_node.html"&gt;Run Multi-Node Inference&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Horizontal Pod Autoscaler (HPA)</title><link>/docs/examples/hpa/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/examples/hpa/</guid><description>&lt;p&gt;LeaderWorkerSet supports Horizontal Pod Autoscaler (HPA) through its scale subresource. This allows you to automatically scale the number of replica groups based on resource utilization metrics like CPU or memory.&lt;/p&gt;
&lt;h2 id="how-hpa-works-with-leaderworkerset"&gt;How HPA Works with LeaderWorkerSet&lt;/h2&gt;
&lt;p&gt;When using HPA with LeaderWorkerSet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;HPA monitors &lt;strong&gt;leader pods&lt;/strong&gt; only (not worker pods)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;hpaPodSelector&lt;/code&gt; in LeaderWorkerSet status helps HPA identify which pods to monitor&lt;/li&gt;
&lt;li&gt;Scaling affects the number of &lt;strong&gt;replica groups&lt;/strong&gt;, not individual pods within a group&lt;/li&gt;
&lt;li&gt;Each replica group (leader + workers) is scaled as a unit&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="prerequisites"&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before setting up HPA, ensure you have:&lt;/p&gt;</description></item><item><title>Topology Aware Scheduling with Kueue</title><link>/docs/examples/tas/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/examples/tas/</guid><description>&lt;p&gt;AI Inference workloads require constant Pod-to-Pod communication. This makes the network bandwidth an important requirement of
running workloads efficiently. The bandwidth between the Pods depends on the placement of the Nodes in the data center. Topology Aware Scheduling (TAS), looks to place the pods as closely as possible to maximize the network bandwidth. To learn more about TAS, visit the page in the &lt;a href="https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/"&gt;Kueue website&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This example will cover how to deploy a vLLM multi-host workload using TAS.&lt;/p&gt;</description></item></channel></rss>