Inference with RLite
====================

RLite's inference engine provides a ``generate`` interface similar to HuggingFace and vLLM. Currently supported backends include:

- vLLM

Inference with Language Models
------------------------------

Initializing the inference engine only requires specifying the model path and backend engine. In the ``build`` function, users need to specify the model's parallelism strategy, including:

- ``tensor_parallel_size``: Size of tensor parallelism
- ``pipeline_parallel_size``: Size of pipeline parallelism
- ``data_parallel_size``: Size of data parallelism, defaults to -1, which means using as many DP Executors as possible

Other backend parameters can also be specified through the ``build`` function.

.. code-block:: python

    engine = rlite.InferenceEngine(
        model_path="path/to/model",
        backend="vllm",
    )
    engine.build(
        tensor_parallel_size=4,
        gpu_memory_utilization=0.9,
    )

Prompts and generation parameters can be passed to the ``generate`` function, where generation parameters support vLLM's ``SamplingParams`` or can be directly passed as a dictionary. The generate function also supports ``tqdm`` progress bars. By default, all prompts are processed in parallel, with the backend engine randomly distributing these prompts to different executors.

.. code-block:: python

    engine.generate(
        ["Hello, world!"] * 3,
        use_tqdm=True,
        tqdm_desc="Generating random samples",
    )

Inference with Vision-Language Models
-------------------------------------

Inference with Vision-Language Models is similar to language models, requiring only the model path and backend engine to be specified. The input `prompt` is simply a dictionary containing the raw image and text.

.. code-block:: python

    engine.generate(
        [
            {
                "prompt": "What is the image about?",
                "multi_modal_data": {
                    "image": PIL.Image.open("path/to/image.jpg")
                }
            }
            for _ in range(8)
        ],
        sampling_params={
            "temperature": 0.7,
            "seed": 42,
        },
        use_tqdm=True,
        tqdm_desc="VLM generating",
    )