Inference with RLite

RLite’s inference engine provides a generate interface similar to HuggingFace and vLLM. Currently supported backends include:

vLLM

Inference with Language Models

Initializing the inference engine only requires specifying the model path and backend engine. In the build function, users need to specify the model’s parallelism strategy, including:

tensor_parallel_size: Size of tensor parallelism
pipeline_parallel_size: Size of pipeline parallelism
data_parallel_size: Size of data parallelism, defaults to -1, which means using as many DP Executors as possible

Other backend parameters can also be specified through the build function.

engine = rlite.InferenceEngine(
    model_path="path/to/model",
    backend="vllm",
)
engine.build(
    tensor_parallel_size=4,
    gpu_memory_utilization=0.9,
)

Prompts and generation parameters can be passed to the generate function, where generation parameters support vLLM’s SamplingParams or can be directly passed as a dictionary. The generate function also supports tqdm progress bars. By default, all prompts are processed in parallel, with the backend engine randomly distributing these prompts to different executors.

engine.generate(
    ["Hello, world!"] * 3,
    use_tqdm=True,
    tqdm_desc="Generating random samples",
)

Inference with Vision-Language Models

Inference with Vision-Language Models is similar to language models, requiring only the model path and backend engine to be specified. The input prompt is simply a dictionary containing the raw image and text.

engine.generate(
    [
        {
            "prompt": "What is the image about?",
            "multi_modal_data": {
                "image": PIL.Image.open("path/to/image.jpg")
            }
        }
        for _ in range(8)
    ],
    sampling_params={
        "temperature": 0.7,
        "seed": 42,
    },
    use_tqdm=True,
    tqdm_desc="VLM generating",
)