Inference with RLite

RLite’s inference engine provides a generate interface similar to HuggingFace and vLLM. Currently supported backends include:

  • vLLM

Inference with Language Models

Initializing the inference engine only requires specifying the model path and backend engine. In the build function, users need to specify the model’s parallelism strategy, including:

  • tensor_parallel_size: Size of tensor parallelism

  • pipeline_parallel_size: Size of pipeline parallelism

  • data_parallel_size: Size of data parallelism, defaults to -1, which means using as many DP Executors as possible

Other backend parameters can also be specified through the build function.

engine = rlite.InferenceEngine(
    model_path="path/to/model",
    backend="vllm",
)
engine.build(
    tensor_parallel_size=4,
    gpu_memory_utilization=0.9,
)

Prompts and generation parameters can be passed to the generate function, where generation parameters support vLLM’s SamplingParams or can be directly passed as a dictionary. The generate function also supports tqdm progress bars. By default, all prompts are processed in parallel, with the backend engine randomly distributing these prompts to different executors.

engine.generate(
    ["Hello, world!"] * 3,
    use_tqdm=True,
    tqdm_desc="Generating random samples",
)

Inference with Vision-Language Models

Inference with Vision-Language Models is similar to language models, requiring only the model path and backend engine to be specified. The input prompt is simply a dictionary containing the raw image and text.

engine.generate(
    [
        {
            "prompt": "What is the image about?",
            "multi_modal_data": {
                "image": PIL.Image.open("path/to/image.jpg")
            }
        }
        for _ in range(8)
    ],
    sampling_params={
        "temperature": 0.7,
        "seed": 42,
    },
    use_tqdm=True,
    tqdm_desc="VLM generating",
)