Basic usage of inference module for image-based inference¶

In this tutorial, we will be using the inference module from urban-worm, which supports three frameworks to run MLLMs: Ollama (built on top of llama.cpp) and Llama.cpp to showcase inference with single and multiple images with InternVL3.

Three type of output schema will be demonstrated for inference:

plain text generation
multiple questions with binary answers
multiple choices

In [1]:

Copied!

from urbanworm.inference.llama import InferenceLlamacpp, InferenceOllama
# Optional fast local VLM backend (requires `pip install "urban-worm[unsloth]"`)
from urbanworm import InferenceUnsloth
from urbanworm.inference.llama import InferenceLlamacpp, InferenceOllama
# Optional fast local VLM backend (requires `pip install "urban-worm[unsloth]"`)
from urbanworm import InferenceUnsloth

First, let's set up some schema for defining output format and prompts for demonstrating inference tasks.

In [2]:

Copied!





# define the schema for model output

# this the default built-in schema for plain text generation
normal_format = {
    "questions": (str, ...),
    "answer": (str, ...),
}

# binary answer
bool_format = {
    "questions": (str, ...),
    "answer": (bool, ...),
}

# multiple choice
from typing import Literal
multiple_choice_format = {
    "questions": (str, ...),
    "answer": (Literal['occupied', 'unoccupied'], ...),
    "explanation": (str, ...),
}

# define the inference task and emphasize the output format in the prompt
multi_questions_prompt =  '''
    Question 1 - Is there any damage on the roof?
    Question 2 - Is any window broken or boarded?
    Question 3 - Is any door broken, missing, or boarded?

    For each question, you have to respond in the following format:
    yes (true) / no (false)
'''

multi_choice_prompt = '''
    Does the house look occupied?
    For each question, you have to respond in the following format:
    'occupied' / 'unoccupied'
'''
# define the schema for model output

# this the default built-in schema for plain text generation
normal_format = {
    "questions": (str, ...),
    "answer": (str, ...),
}

# binary answer
bool_format = {
    "questions": (str, ...),
    "answer": (bool, ...),
}

# multiple choice
from typing import Literal
multiple_choice_format = {
    "questions": (str, ...),
    "answer": (Literal['occupied', 'unoccupied'], ...),
    "explanation": (str, ...),
}

# define the inference task and emphasize the output format in the prompt
multi_questions_prompt =  '''
    Question 1 - Is there any damage on the roof?
    Question 2 - Is any window broken or boarded?
    Question 3 - Is any door broken, missing, or boarded?

    For each question, you have to respond in the following format:
    yes (true) / no (false)
'''

multi_choice_prompt = '''
    Does the house look occupied?
    For each question, you have to respond in the following format:
    'occupied' / 'unoccupied'
'''

We will be using three street views that capture a single residential property from different angles:

1 one-time inference¶

1.1 Ollama¶

In [3]:

Copied!





# build constructor
# All these three images in constructor will be used together for a single inference
data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
                       image=["./data/img_1.jpg",
                              "./data/img_2.jpg",
                              "./data/img_3.jpg",],
                       schema=normal_format)
# inference
result = data.one_inference(prompt='what is the color of the house?')
result
# build constructor
# All these three images in constructor will be used together for a single inference
data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
                       image=["./data/img_1.jpg",
                              "./data/img_2.jpg",
                              "./data/img_3.jpg",],
                       schema=normal_format)
# inference
result = data.one_inference(prompt='what is the color of the house?')
result

Out[3]:

	questions1	answer1	data
0	What is the color of the house?	The house in each image appears to be light-co...	[./data/img_1.jpg, ./data/img_2.jpg, ./data/im...

In [4]:

Copied!

result['answer1'][0]
result['answer1'][0]

Out[4]:

"The images depict a two-story house with white siding and multiple windows. The yard appears to be fenced, and there's an assortment of items near the entrance such as trash bins and possibly gardening tools. There is also a sidewalk leading up to the front door."

In [5]:

Copied!





# image can also be provided for a single inference
data.schema = bool_format # replace the output format
result = data.one_inference(prompt=multi_questions_prompt,
                            image="./data/img_1.jpg")
result
# image can also be provided for a single inference
data.schema = bool_format # replace the output format
result = data.one_inference(prompt=multi_questions_prompt,
                            image="./data/img_1.jpg")
result

Out[5]:

	questions1	answer1	questions2	answer2	questions3	answer3	data
0	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	True	[./data/img_1.jpg]

In [16]:

Copied!





# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result
# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result

Out[16]:

	questions1	answer1	explanation1	data
0	Does the house look occupied?	unoccupied	The porch area appears empty and there are no ...	[./data/img_1.jpg]

1.2 Llama.cpp¶

In [10]:

Copied!





# build constructor
data = InferenceLlamacpp(
    # if model amd mmproj are already downloaded,
    # you can directly specify the path to model files in the constructor, for example:
    # llm = "model/InternVL3-8B-Instruct-Q8_0.gguf"
    # mp = "model/mmproj-InternVL3-8B-Instruct-Q8_0.gguf"

    # you can also just provide model's hf repo id and its quant directly:
    llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
    image=["./data/img_1.jpg",
           "./data/img_2.jpg",
           "./data/img_3.jpg",], # All these three images in constructor will be used together for the inference
    # schema=normal_format
)
# build constructor
data = InferenceLlamacpp(
    # if model amd mmproj are already downloaded,
    # you can directly specify the path to model files in the constructor, for example:
    # llm = "model/InternVL3-8B-Instruct-Q8_0.gguf"
    # mp = "model/mmproj-InternVL3-8B-Instruct-Q8_0.gguf"

    # you can also just provide model's hf repo id and its quant directly:
    llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
    image=["./data/img_1.jpg",
           "./data/img_2.jpg",
           "./data/img_3.jpg",], # All these three images in constructor will be used together for the inference
    # schema=normal_format
)

In [14]:

Copied!

# inference
result = data.one_inference(prompt='what is the color of the house?')
result
# inference
result = data.one_inference(prompt='what is the color of the house?')
result

Out[14]:

	questions1	answer1	data
0	What is the color of the house?	The house in each image appears to be light-co...	[./data/img_1.jpg, ./data/img_2.jpg, ./data/im...

In [18]:

Copied!





# single image inference
data.schema = bool_format
result = data.one_inference(prompt=multi_questions_prompt, image="./data/img_1.jpg")
result
# single image inference
data.schema = bool_format
result = data.one_inference(prompt=multi_questions_prompt, image="./data/img_1.jpg")
result

Out[18]:

	questions1	answer1	questions2	answer2	questions3	answer3	data
0	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	True	[./data/img_1.jpg]

In [17]:

Copied!





# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result
# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result

Out[17]:

	questions1	answer1	explanation1	data
0	Does the house look occupied?	unoccupied	The porch area appears empty and there are no ...	[./data/img_1.jpg]

1.3 Unsloth¶

InferenceUnsloth runs a small VLM locally via Unsloth's FastVisionModel. Compared to Ollama / llama.cpp it typically delivers 2–4× faster inference on a CUDA GPU, and the batch_inference(..., batch_size=N) argument lets you process multiple images in a single forward pass for further speedup.

Tested small VLM checkpoints:

unsloth/Qwen3-VL-3B-Instruct — fastest, lowest VRAM
unsloth/Qwen3-VL-8B-Instruct — strongest 8B-class
unsloth/gemma-3-4b-it — Gemma 3 multimodal, balanced
unsloth/Qwen2-VL-2B-Instruct
unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit

Audio inference is not supported by Unsloth.

In [ ]:

Copied!





# build constructor
# All three images are passed to a single inference call
data = InferenceUnsloth(
    llm='unsloth/Qwen3-VL-3B-Instruct',
    load_in_4bit=True,
    image=["./data/img_1.jpg",
           "./data/img_2.jpg",
           "./data/img_3.jpg"],
    schema=normal_format,
)
result = data.one_inference(prompt='what is the color of the house?')
result
# build constructor
# All three images are passed to a single inference call
data = InferenceUnsloth(
    llm='unsloth/Qwen3-VL-3B-Instruct',
    load_in_4bit=True,
    image=["./data/img_1.jpg",
           "./data/img_2.jpg",
           "./data/img_3.jpg"],
    schema=normal_format,
)
result = data.one_inference(prompt='what is the color of the house?')
result

In [ ]:

Copied!





# single-image inference with the boolean schema
data.schema = bool_format
result = data.one_inference(prompt=multi_questions_prompt,
                            image="./data/img_1.jpg")
result
# single-image inference with the boolean schema
data.schema = bool_format
result = data.one_inference(prompt=multi_questions_prompt,
                            image="./data/img_1.jpg")
result

In [ ]:

Copied!





# multiple choice
data.schema = multiple_choice_format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result
# multiple choice
data.schema = multiple_choice_format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result

2 Batched inference with multiple-image input¶

To implement batched multi-image input for inference, we just need to pack images (path) into a nested list/tuple.

2.1 Ollama¶

In [4]:

Copied!





data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
                       schema=bool_format)
data.imgs = [
    ["./data/img_1.jpg",
     "./data/img_2.jpg",],
    ["./data/img_2.jpg",
     "./data/img_3.jpg",]
]

# uncommnet the code below to do batched single-image inference
# data.imgs = [
#     ["./data/img_1.jpg",
#      "./data/img_2.jpg",
#      "./data/img_3.jpg",]
# ]

data.batch_inference(prompt=multi_questions_prompt)
data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
                       schema=bool_format)
data.imgs = [
    ["./data/img_1.jpg",
     "./data/img_2.jpg",],
    ["./data/img_2.jpg",
     "./data/img_3.jpg",]
]

# uncommnet the code below to do batched single-image inference
# data.imgs = [
#     ["./data/img_1.jpg",
#      "./data/img_2.jpg",
#      "./data/img_3.jpg",]
# ]

data.batch_inference(prompt=multi_questions_prompt)

Processing...: 100%|█████████████████████████| 2/2 [00:23<00:00, 11.56s/it]

Out[4]:

	questions1	answer1	questions2	answer2	questions3	answer3	data
0	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	True	[./data/img_1.jpg, ./data/img_2.jpg]
1	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	True	[./data/img_2.jpg, ./data/img_3.jpg]

In [5]:

Copied!

data.results
data.results

Out[5]:

{'responses': [[QnA(questions='Is there any damage on the roof?', answer=False),
   QnA(questions='Is any window broken or boarded?', answer=False),
   QnA(questions='Is any door broken, missing, or boarded?', answer=True)],
  [QnA(questions='Is there any damage on the roof?', answer=False),
   QnA(questions='Is any window broken or boarded?', answer=False),
   QnA(questions='Is any door broken, missing, or boarded?', answer=True)]],
 'data': [['./data/img_1.jpg', './data/img_2.jpg'],
  ['./data/img_2.jpg', './data/img_3.jpg']]}

In [6]:

Copied!

data.df
data.df

Out[6]:

	questions1	answer1	questions2	answer2	questions3	answer3	data
0	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	True	[./data/img_1.jpg, ./data/img_2.jpg]
1	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	True	[./data/img_2.jpg, ./data/img_3.jpg]

2.2 Llama.cpp¶

In [3]:

Copied!





data = InferenceLlamacpp(llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0', schema=bool_format)
# pack images in a nested list to batch multiple-image inference
data.imgs = [
    ["./data/img_1.jpg",
     "./data/img_2.jpg",],
    ["./data/img_2.jpg",
     "./data/img_3.jpg",]
]

# uncommnet the code below to batch single-image inference
# data.imgs = [
#     ["./data/img_1.jpg",
#      "./data/img_2.jpg",
#      "./data/img_3.jpg",]
# ]

data.batch_inference(prompt=multi_questions_prompt)
data = InferenceLlamacpp(llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0', schema=bool_format)
# pack images in a nested list to batch multiple-image inference
data.imgs = [
    ["./data/img_1.jpg",
     "./data/img_2.jpg",],
    ["./data/img_2.jpg",
     "./data/img_3.jpg",]
]

# uncommnet the code below to batch single-image inference
# data.imgs = [
#     ["./data/img_1.jpg",
#      "./data/img_2.jpg",
#      "./data/img_3.jpg",]
# ]

data.batch_inference(prompt=multi_questions_prompt)

Processing...: 100%|█████████████████████████| 2/2 [00:16<00:00,  8.16s/it]

Out[3]:

	questions_1	answer_1	questions_2	answer_2	questions_3	answer_3	data_1	data_2
0	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	False	./data/img_1.jpg	./data/img_2.jpg
1	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	False	./data/img_2.jpg	./data/img_3.jpg

In [4]:

Copied!

data.df
data.df

Out[4]:

	questions_1	answer_1	questions_2	answer_2	questions_3	answer_3	data_1	data_2
0	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	False	./data/img_1.jpg	./data/img_2.jpg
1	Is there any damage on the roof?	False	Is any window broken or boarded?	False	Is any door broken, missing, or boarded?	False	./data/img_2.jpg	./data/img_3.jpg

2.3 Unsloth¶

batch_inference accepts a batch_size argument. Setting it above 1 groups multiple items into a single GPU forward pass — typically the biggest single throughput win. Practical sweet spot for 7-8B VLMs on a 24GB GPU is batch_size=4–8; smaller models (3-4B) can go higher.

In [ ]:

Copied!





data = InferenceUnsloth(llm='unsloth/Qwen3-VL-3B-Instruct',
                        load_in_4bit=True,
                        schema=bool_format)
# pack images in a nested list to batch multiple-image inference
data.imgs = [
    ["./data/img_1.jpg", "./data/img_2.jpg"],
    ["./data/img_2.jpg", "./data/img_3.jpg"],
]

# uncomment for batched single-image inference
# data.imgs = ["./data/img_1.jpg", "./data/img_2.jpg", "./data/img_3.jpg"]

# batch_size=2 -> the two items above run in one forward pass
data.batch_inference(prompt=multi_questions_prompt, batch_size=2)
data = InferenceUnsloth(llm='unsloth/Qwen3-VL-3B-Instruct',
                        load_in_4bit=True,
                        schema=bool_format)
# pack images in a nested list to batch multiple-image inference
data.imgs = [
    ["./data/img_1.jpg", "./data/img_2.jpg"],
    ["./data/img_2.jpg", "./data/img_3.jpg"],
]

# uncomment for batched single-image inference
# data.imgs = ["./data/img_1.jpg", "./data/img_2.jpg", "./data/img_3.jpg"]

# batch_size=2 -> the two items above run in one forward pass
data.batch_inference(prompt=multi_questions_prompt, batch_size=2)

In [ ]:

Copied!

data.df
data.df