Inference with RLite
RLite’s inference engine provides a generate interface similar to HuggingFace and vLLM. Currently supported backends include:
vLLM
Inference with Language Models
Initializing the inference engine only requires specifying the model path and backend engine. In the build function, users need to specify the model’s parallelism strategy, including:
tensor_parallel_size: Size of tensor parallelismpipeline_parallel_size: Size of pipeline parallelismdata_parallel_size: Size of data parallelism, defaults to -1, which means using as many DP Executors as possible
Other backend parameters can also be specified through the build function.
engine = rlite.InferenceEngine(
model_path="path/to/model",
backend="vllm",
)
engine.build(
tensor_parallel_size=4,
gpu_memory_utilization=0.9,
)
Prompts and generation parameters can be passed to the generate function, where generation parameters support vLLM’s SamplingParams or can be directly passed as a dictionary. The generate function also supports tqdm progress bars. By default, all prompts are processed in parallel, with the backend engine randomly distributing these prompts to different executors.
engine.generate(
["Hello, world!"] * 3,
use_tqdm=True,
tqdm_desc="Generating random samples",
)
Inference with Vision-Language Models
Inference with Vision-Language Models is similar to language models, requiring only the model path and backend engine to be specified. The input prompt is simply a dictionary containing the raw image and text.
engine.generate(
[
{
"prompt": "What is the image about?",
"multi_modal_data": {
"image": PIL.Image.open("path/to/image.jpg")
}
}
for _ in range(8)
],
sampling_params={
"temperature": 0.7,
"seed": 42,
},
use_tqdm=True,
tqdm_desc="VLM generating",
)