Quick Start: Write an RL sketch in 10 minutes ============================================= In this tutorial, we will use a simplified example to show you how to write an RL program with ``rlite`` in tens of lines, while still scales well. 1. Import packages ------------------ First, import ``rlite`` as a common python package. .. code-block:: python import torch import torch.nn.functional as F from accelerate import init_empty_weights from transformers import Qwen2ForCausalLM import rlite import rlite.nn from rlite import NeedParallel, RliteInferenceEngine, RliteTrainEngine 2. Define your ``torch.nn.Module`` -------------------------------- Then, define a pytorch module that implements the logic of your algorithm. Here we use the language loss for simplicity. .. code-block:: python class MyQwenModel(rlite.nn.HuggingFaceFsdp2TrainModule, Qwen2ForCausalLM): def on_weights_materialized(self): self.optimizer = torch.optim.AdamW(self.parameters(), lr=1e-5) def train_step(self, input_ids: list[list[int]], grad_acc_steps: int = 1): input_ids = torch.tensor(input_ids, dtype=torch.long, device="cuda") # Language loss logits = self(input_ids).logits[:, :-1].contiguous() labels = input_ids[:, 1:].contiguous() loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1)).mean() # Backward and gradient accumulation loss = loss / grad_acc_steps loss.backward() return loss.item() def optimizer_step(self): self.optimizer.step() self.optimizer.zero_grad() Here we are subclassing the ``Qwen2ForCausalLM`` from ``transformers`` to reuse its forward implementation. We also subclass ``rlite.nn.HuggingFaceFsdp2TrainModule``, which is basically a ``torch.nn.Module`` that further implements 1. serialization of the module, so that this module can be sent/received using `ray`; 2. necessary interfaces (e.g., get the checkpoint path) for materializing and training the module on the cluster. .. note :: In ``rlite``, we use ``rlite.nn.xxxTrainModule`` as the carrier for user-defined logics. Such modules **must** be initialized on torch's meta device to ensure efficient module transfer between processes (i.e. ray actors). This enables algorithm-computation decoupling, allowing users to focus on algorithms only (writing ``nn.Module``s). RLite will handle everything else. .. note:: ``on_weights_materialized`` is an important hook for operations that are expected to happen after the module is materialized on GPUs. For example, this is useful for optimizer and LR schedulers. You don't need to implement this hook if your module does not need to be optimized, e.g. reference model in PPO. 3. Initialize ``rlite`` ----------------------- Call ``rlite.init()`` to config ``rlite`` globally and initialize the resource manager. .. code-block:: python rlite.init() 4. Initialize a vLLM actor for inference ---------------------------------------- Then, we initialize a vLLM actor and use it to generate rollouts. .. code-block:: python vllm_engine = RliteInferenceEngine("Qwen/Qwen2.5-7B-Instruct", executor="vllm") vllm_engine.build(tensor_parallel_size=4) prompts = ["你好,世界!", "Hello, world!"] * 8 rollouts = vllm_engine.generate(prompts) 5. Initialize a FSDP2 actor for training ---------------------------------------- After generation, we drop all the weights of this vLLM actor and initialize the train actor. Note that the module **must** be initialized on `meta` device to get a low-memory module instance. .. code-block:: python vllm_engine.meta() # Release everything from GPU with init_empty_weights(): module = MyQwenModel.from_pretrained("Qwen/Qwen2.5-7B-Instruct") fsdp2_engine = RliteTrainEngine(module, executor="fsdp2") fsdp2_engine.build(tensor_parallel_size=4, colocate_with=vllm_engine) .. note:: You can use ``.cpu()``, ``.cuda(*args, **kwargs)``, or ``.meta()``, to move the engine to CPU, to GPU, or release all memory usage in both CPU and GPU. 6. Call the user-defined training logic --------------------------------------- Let's start training! .. code-block:: python input_ids = [[x.outputs[0].token_ids for x in rollouts] for _ in range(4)] grad_acc_steps = 4 for step in range(grad_acc_steps): fsdp2_engine.train_step( NeedParallel(input_ids[step]), # This will be split among the workers (DP) grad_acc_steps # This will be copied to all workers ) fsdp2_engine.optimizer_step() As you can see, you can call the function you just defined through the ``RliteTrainEngine``. This gives the full flexibility to users to design new algorithms, without worrying about adapting them to ``rlite``. 7. Sync weight from FSDP2 actor to vLLM actor --------------------------------------------- After training, sync the updated weights to vLLM actor and generate again. .. code-block:: python vllm_engine.cuda("weights") # Only the weights are loaded fsdp2_engine.p2p_weight_sync(vllm_engine) # GPU-to-GPU weight sync via CUDAIPC fsdp2_engine.cpu() vllm_engine.cuda("kv_cache") # The KV cache of vLLM is back to GPU That's all 🎉! Writing an RL program should be this simple 😄! The complete code of this tutorial can be found `here `_.