Basic usage of inference module for image-based inference¶
In this tutorial, we will be using the inference module from urban-worm, which supports three frameworks to run MLLMs: Ollama (built on top of llama.cpp) and Llama.cpp to showcase inference with single and multiple images with InternVL3.
Three type of output schema will be demonstrated for inference:
- plain text generation
- multiple questions with binary answers
- multiple choices
from urbanworm.inference.llama import InferenceLlamacpp, InferenceOllama
# Optional fast local VLM backend (requires `pip install "urban-worm[unsloth]"`)
from urbanworm import InferenceUnsloth
First, let's set up some schema for defining output format and prompts for demonstrating inference tasks.
# define the schema for model output
# this the default built-in schema for plain text generation
normal_format = {
"questions": (str, ...),
"answer": (str, ...),
}
# binary answer
bool_format = {
"questions": (str, ...),
"answer": (bool, ...),
}
# multiple choice
from typing import Literal
multiple_choice_format = {
"questions": (str, ...),
"answer": (Literal['occupied', 'unoccupied'], ...),
"explanation": (str, ...),
}
# define the inference task and emphasize the output format in the prompt
multi_questions_prompt = '''
Question 1 - Is there any damage on the roof?
Question 2 - Is any window broken or boarded?
Question 3 - Is any door broken, missing, or boarded?
For each question, you have to respond in the following format:
yes (true) / no (false)
'''
multi_choice_prompt = '''
Does the house look occupied?
For each question, you have to respond in the following format:
'occupied' / 'unoccupied'
'''
We will be using three street views that capture a single residential property from different angles:
1 one-time inference¶
1.1 Ollama¶
# build constructor
# All these three images in constructor will be used together for a single inference
data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
image=["./data/img_1.jpg",
"./data/img_2.jpg",
"./data/img_3.jpg",],
schema=normal_format)
# inference
result = data.one_inference(prompt='what is the color of the house?')
result
| questions1 | answer1 | data | |
|---|---|---|---|
| 0 | What is the color of the house? | The house in each image appears to be light-co... | [./data/img_1.jpg, ./data/img_2.jpg, ./data/im... |
result['answer1'][0]
"The images depict a two-story house with white siding and multiple windows. The yard appears to be fenced, and there's an assortment of items near the entrance such as trash bins and possibly gardening tools. There is also a sidewalk leading up to the front door."
# image can also be provided for a single inference
data.schema = bool_format # replace the output format
result = data.one_inference(prompt=multi_questions_prompt,
image="./data/img_1.jpg")
result
| questions1 | answer1 | questions2 | answer2 | questions3 | answer3 | data | |
|---|---|---|---|---|---|---|---|
| 0 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | True | [./data/img_1.jpg] |
# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
image="./data/img_1.jpg")
result
| questions1 | answer1 | explanation1 | data | |
|---|---|---|---|---|
| 0 | Does the house look occupied? | unoccupied | The porch area appears empty and there are no ... | [./data/img_1.jpg] |
1.2 Llama.cpp¶
# build constructor
data = InferenceLlamacpp(
# if model amd mmproj are already downloaded,
# you can directly specify the path to model files in the constructor, for example:
# llm = "model/InternVL3-8B-Instruct-Q8_0.gguf"
# mp = "model/mmproj-InternVL3-8B-Instruct-Q8_0.gguf"
# you can also just provide model's hf repo id and its quant directly:
llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
image=["./data/img_1.jpg",
"./data/img_2.jpg",
"./data/img_3.jpg",], # All these three images in constructor will be used together for the inference
# schema=normal_format
)
# inference
result = data.one_inference(prompt='what is the color of the house?')
result
| questions1 | answer1 | data | |
|---|---|---|---|
| 0 | What is the color of the house? | The house in each image appears to be light-co... | [./data/img_1.jpg, ./data/img_2.jpg, ./data/im... |
# single image inference
data.schema = bool_format
result = data.one_inference(prompt=multi_questions_prompt, image="./data/img_1.jpg")
result
| questions1 | answer1 | questions2 | answer2 | questions3 | answer3 | data | |
|---|---|---|---|---|---|---|---|
| 0 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | True | [./data/img_1.jpg] |
# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
image="./data/img_1.jpg")
result
| questions1 | answer1 | explanation1 | data | |
|---|---|---|---|---|
| 0 | Does the house look occupied? | unoccupied | The porch area appears empty and there are no ... | [./data/img_1.jpg] |
1.3 Unsloth¶
InferenceUnsloth runs a small VLM locally via Unsloth's FastVisionModel. Compared to Ollama / llama.cpp it typically delivers 2–4× faster inference on a CUDA GPU, and the batch_inference(..., batch_size=N) argument lets you process multiple images in a single forward pass for further speedup.
Tested small VLM checkpoints:
unsloth/Qwen3-VL-3B-Instruct— fastest, lowest VRAMunsloth/Qwen3-VL-8B-Instruct— strongest 8B-classunsloth/gemma-3-4b-it— Gemma 3 multimodal, balancedunsloth/Qwen2-VL-2B-Instructunsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit
Audio inference is not supported by Unsloth.
# build constructor
# All three images are passed to a single inference call
data = InferenceUnsloth(
llm='unsloth/Qwen3-VL-3B-Instruct',
load_in_4bit=True,
image=["./data/img_1.jpg",
"./data/img_2.jpg",
"./data/img_3.jpg"],
schema=normal_format,
)
result = data.one_inference(prompt='what is the color of the house?')
result
# single-image inference with the boolean schema
data.schema = bool_format
result = data.one_inference(prompt=multi_questions_prompt,
image="./data/img_1.jpg")
result
# multiple choice
data.schema = multiple_choice_format
result = data.one_inference(prompt=multi_choice_prompt,
image="./data/img_1.jpg")
result
2 Batched inference with multiple-image input¶
To implement batched multi-image input for inference, we just need to pack images (path) into a nested list/tuple.
2.1 Ollama¶
data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
schema=bool_format)
data.imgs = [
["./data/img_1.jpg",
"./data/img_2.jpg",],
["./data/img_2.jpg",
"./data/img_3.jpg",]
]
# uncommnet the code below to do batched single-image inference
# data.imgs = [
# ["./data/img_1.jpg",
# "./data/img_2.jpg",
# "./data/img_3.jpg",]
# ]
data.batch_inference(prompt=multi_questions_prompt)
Processing...: 100%|█████████████████████████| 2/2 [00:23<00:00, 11.56s/it]
| questions1 | answer1 | questions2 | answer2 | questions3 | answer3 | data | |
|---|---|---|---|---|---|---|---|
| 0 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | True | [./data/img_1.jpg, ./data/img_2.jpg] |
| 1 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | True | [./data/img_2.jpg, ./data/img_3.jpg] |
data.results
{'responses': [[QnA(questions='Is there any damage on the roof?', answer=False),
QnA(questions='Is any window broken or boarded?', answer=False),
QnA(questions='Is any door broken, missing, or boarded?', answer=True)],
[QnA(questions='Is there any damage on the roof?', answer=False),
QnA(questions='Is any window broken or boarded?', answer=False),
QnA(questions='Is any door broken, missing, or boarded?', answer=True)]],
'data': [['./data/img_1.jpg', './data/img_2.jpg'],
['./data/img_2.jpg', './data/img_3.jpg']]}
data.df
| questions1 | answer1 | questions2 | answer2 | questions3 | answer3 | data | |
|---|---|---|---|---|---|---|---|
| 0 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | True | [./data/img_1.jpg, ./data/img_2.jpg] |
| 1 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | True | [./data/img_2.jpg, ./data/img_3.jpg] |
2.2 Llama.cpp¶
data = InferenceLlamacpp(llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0', schema=bool_format)
# pack images in a nested list to batch multiple-image inference
data.imgs = [
["./data/img_1.jpg",
"./data/img_2.jpg",],
["./data/img_2.jpg",
"./data/img_3.jpg",]
]
# uncommnet the code below to batch single-image inference
# data.imgs = [
# ["./data/img_1.jpg",
# "./data/img_2.jpg",
# "./data/img_3.jpg",]
# ]
data.batch_inference(prompt=multi_questions_prompt)
Processing...: 100%|█████████████████████████| 2/2 [00:16<00:00, 8.16s/it]
| questions_1 | answer_1 | questions_2 | answer_2 | questions_3 | answer_3 | data_1 | data_2 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | False | ./data/img_1.jpg | ./data/img_2.jpg |
| 1 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | False | ./data/img_2.jpg | ./data/img_3.jpg |
data.df
| questions_1 | answer_1 | questions_2 | answer_2 | questions_3 | answer_3 | data_1 | data_2 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | False | ./data/img_1.jpg | ./data/img_2.jpg |
| 1 | Is there any damage on the roof? | False | Is any window broken or boarded? | False | Is any door broken, missing, or boarded? | False | ./data/img_2.jpg | ./data/img_3.jpg |
2.3 Unsloth¶
batch_inference accepts a batch_size argument. Setting it above 1 groups multiple items into a single GPU forward pass — typically the biggest single throughput win. Practical sweet spot for 7-8B VLMs on a 24GB GPU is batch_size=4–8; smaller models (3-4B) can go higher.
data = InferenceUnsloth(llm='unsloth/Qwen3-VL-3B-Instruct',
load_in_4bit=True,
schema=bool_format)
# pack images in a nested list to batch multiple-image inference
data.imgs = [
["./data/img_1.jpg", "./data/img_2.jpg"],
["./data/img_2.jpg", "./data/img_3.jpg"],
]
# uncomment for batched single-image inference
# data.imgs = ["./data/img_1.jpg", "./data/img_2.jpg", "./data/img_3.jpg"]
# batch_size=2 -> the two items above run in one forward pass
data.batch_inference(prompt=multi_questions_prompt, batch_size=2)
data.df