Inference Backends¶

All backends share the same one_inference / batch_inference interface defined in [Inference][urbanworm.inference.Inference.Inference].

Unsloth (recommended)¶

GPU-accelerated local VLM inference via Unsloth's FastVisionModel. Automatically spreads the model across all visible GPUs and retries OOM-failed chunks item-by-item.

Install: pip install "urban-worm[unsloth]" (pre-install CUDA torch first)

`urbanworm.inference.unsloth.InferenceUnsloth(llm=None, load_in_4bit=True, max_seq_length=4096, device=None, max_memory=None, model_dir=None, dtype=None, disable_compile=True, skip_errors=True, **kwargs)` ¶

Bases: Inference

Vision-language model inference via Unsloth's FastVisionModel.

Parameters:

Name	Type	Description	Default
`llm`	`str \| None`	Unsloth-compatible model id or local path. Defaults to `unsloth/Qwen3-VL-3B-Instruct`.	`None`
`load_in_4bit`	`bool`	Load weights in 4-bit (bitsandbytes). Big VRAM win, small quality cost. Default `True`.	`True`
`max_seq_length`	`int`	Maximum tokenized prompt+generation length passed to `FastVisionModel.from_pretrained`. Default 4096.	`4096`
`device`	`str \| None`	Override the device map string (e.g. `"cuda:0"`, `"auto"`). `None` (default) auto-detects: uses `device_map="auto"` when multiple CUDA GPUs are present so the model is spread across all of them, otherwise falls back to a single GPU or CPU.	`None`
`max_memory`	`dict \| None`	Per-device VRAM budget passed to `from_pretrained` as `max_memory`. `None` (default) auto-computes 90 % of each GPU's total capacity when multi-GPU is detected. Example: `{0: "10GiB", 1: "10GiB"}`.	`None`
`model_dir`	`str \| None`	Local directory where HuggingFace model weights are cached. Passed as `cache_dir` to `FastVisionModel.from_pretrained`. `None` (default) uses the HuggingFace default (`~/.cache/huggingface/hub` or the `HF_HOME` env var).	`None`
`dtype`	`Any`	Override the compute dtype. `None` = auto.	`None`
`disable_compile`	`bool`	Disable Unsloth / Torch auto-compilation before the model is loaded. Default `True`. Strongly recommended for production / large-scale inference jobs — prevents the `AlignDevicesHook` / Torch-Dynamo recompile crashes that occur when Accelerate device hooks conflict with compiled code. Sets `UNSLOTH_COMPILE_DISABLE=1`, `UNSLOTH_DISABLE_FAST_GENERATION=1`, and `TORCH_COMPILE_DISABLE=1` via :func:`configure_runtime`.	`True`
`skip_errors`	`bool`	If `True` (default), schema-validation failures yield an empty `Response(responses=[])` so batch loops continue instead of crashing. Mirrors :class:`InferenceOllama`.	`True`
`**kwargs`		Forwarded to :class:`Inference` (`image`, `images`, `audio`, `audios`, `geo_tagged_data`, `schema`).	`{}`

Source code in urbanworm/inference/unsloth.py

def __init__(
    self,
    llm: str | None = None,
    load_in_4bit: bool = True,
    max_seq_length: int = 4096,
    device: str | None = None,
    max_memory: dict | None = None,
    model_dir: str | None = None,
    dtype: Any = None,
    disable_compile: bool = True,
    skip_errors: bool = True,
    **kwargs,
) -> None:
    # Configure runtime env-vars before any unsloth/torch import happens.
    configure_runtime(disable_compile)
    super().__init__(**kwargs)
    self.llm = llm or self.DEFAULT_MODEL
    self.load_in_4bit = load_in_4bit
    self.max_seq_length = max_seq_length
    self.device = device
    self.max_memory = max_memory
    self.model_dir = model_dir
    self.dtype = dtype
    self.disable_compile = disable_compile
    self.skip_errors = skip_errors
    self._model = None
    self._processor = None
    self._model_dtype = None   # set by _ensure_loaded; cached to avoid repeated next(parameters())

Functions¶

`one_inference(system='', prompt='', image=None, audio=None, temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512)` ¶

Run inference on a single image (or list of images for one prompt).

Returns a one-row DataFrame in the same shape as :meth:InferenceOllama.one_inference.

Source code in urbanworm/inference/unsloth.py

def one_inference(
    self,
    system: str = "",
    prompt: str = "",
    image: str | list | tuple = None,
    audio: str | list | tuple = None,
    temp: float = 0.0,
    top_k: int = 20,
    top_p: float = 0.8,
    max_new_tokens: int = 512,
) -> pd.DataFrame:
    """Run inference on a single image (or list of images for one prompt).

    Returns a one-row DataFrame in the same shape as
    :meth:`InferenceOllama.one_inference`.
    """
    if audio is not None:
        raise NotImplementedError(
            "Unsloth VLMs do not currently support audio input."
        )
    self._ensure_loaded()

    img = image if image is not None else self.img
    if img is None:
        raise ValueError("No image provided to one_inference().")
    imgs = [img] if isinstance(img, str) else list(img)
    # If a list, validate it's flat str list
    if not all(isinstance(i, str) for i in imgs):
        raise TypeError("`image` must be a path/url/base64 string or a flat list of those.")

    schema = create_format(self.schema)
    # Single-prompt call: one batch of one (or one batch of N images for
    # a multi-image-per-prompt scenario).
    responses = self._generate_batch(
        systems=[system],
        prompts=[prompt],
        images_per_prompt=[imgs],
        schema=schema,
        temp=temp,
        top_k=top_k,
        top_p=top_p,
        max_new_tokens=max_new_tokens,
    )
    dic = {"responses": [responses[0].responses], "data": [imgs]}
    try:
        return response2df(dic)
    except Exception as e:
        # Empty/malformed responses (e.g. skip_errors path) — return a
        # minimally-shaped frame instead of crashing.
        logger.warning("one_inference: response2df failed (%s); returning raw frame.", e)
        return pd.DataFrame({"responses": [responses[0].responses], "data": [imgs]})

`batch_inference(system='', prompt='', temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512, batch_size=1, task_chunk_size=None, disableProgressBar=False, checkpoint_path=None, failed_log_path=None)` ¶

Run inference over self.batch_images with optional GPU batching.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Number of items per `model.generate` call (the GPU batch). Larger values trade VRAM for throughput; sweet spot for 7–8 B VLMs on a 24 GB GPU is ~4–8.	`1`
`task_chunk_size`	`int \| None`	Logical job partition size, independent of `batch_size`. When set, the dataset is split into segments of this size and progress is reported at the task-chunk level, making long runs (e.g. 144 k samples) easier to monitor. `None` (default) disables task-level chunking. Example: `batch_size=4, task_chunk_size=1000` → ~145 task chunks, each processed internally in batches of 4.	`None`
`checkpoint_path`	`str \| None`	Path to a JSONL file for resume-safe checkpointing. Already-completed items are skipped on the next run.	`None`
`failed_log_path`	`str \| None`	Optional path to a CSV file where permanently failed sample indices and error messages are appended. Lets you rerun only the failures later.	`None`

Returns:

Type	Description
`DataFrame`	DataFrame, same shape as :meth:`InferenceOllama.batch_inference`.

Source code in urbanworm/inference/unsloth.py

def batch_inference(
    self,
    system: str = "",
    prompt: str = "",
    temp: float = 0.0,
    top_k: int = 20,
    top_p: float = 0.8,
    max_new_tokens: int = 512,
    batch_size: int = 1,
    task_chunk_size: int | None = None,
    disableProgressBar: bool = False,
    checkpoint_path: str | None = None,
    failed_log_path: str | None = None,
) -> pd.DataFrame:
    """Run inference over ``self.batch_images`` with optional GPU batching.

    Args:
        batch_size: Number of items per ``model.generate`` call (the GPU
            batch).  Larger values trade VRAM for throughput; sweet spot
            for 7–8 B VLMs on a 24 GB GPU is ~4–8.
        task_chunk_size: Logical job partition size, independent of
            ``batch_size``.  When set, the dataset is split into segments
            of this size and progress is reported at the task-chunk level,
            making long runs (e.g. 144 k samples) easier to monitor.
            ``None`` (default) disables task-level chunking.
            Example: ``batch_size=4, task_chunk_size=1000`` → ~145 task
            chunks, each processed internally in batches of 4.
        checkpoint_path: Path to a JSONL file for resume-safe
            checkpointing.  Already-completed items are skipped on the
            next run.
        failed_log_path: Optional path to a CSV file where permanently
            failed sample indices and error messages are appended.  Lets
            you rerun only the failures later.

    Returns:
        DataFrame, same shape as :meth:`InferenceOllama.batch_inference`.
    """
    import csv

    import torch

    self._ensure_loaded()

    imgs = self.batch_images if self.batch_images is not None else self.imgs
    if not imgs:
        raise ValueError("No images to run inference on.")

    items: list[list[str]] = [
        [it] if isinstance(it, str) else list(it) for it in imgs
    ]
    schema = create_format(self.schema)
    bs = max(1, int(batch_size))
    n = len(items)

    # ── resume from checkpoint ───────────────────────────────────────
    done_records = load_inference_checkpoint(checkpoint_path) if checkpoint_path else []
    # Resume from exactly where we left off.  The old formula
    # ``(len // bs) * bs`` rounded down to the nearest batch boundary,
    # which caused the trailing partial batch to be re-processed on every
    # restart.  Since checkpoints are written per-item (not per-batch),
    # len(done_records) is always the exact number of completed items.
    start_idx = len(done_records)
    dic = restore_ollama_results(done_records)

    # ── task-chunk boundaries for progress reporting ─────────────────
    tcs = int(task_chunk_size) if task_chunk_size and task_chunk_size > 0 else None
    n_task_chunks = ((n - start_idx) + tcs - 1) // tcs if tcs else 1

    def _save_checkpoint(img_idx, responses):
        if not checkpoint_path:
            return
        try:
            responses_dump = [r.model_dump() for r in responses]
        except Exception:
            responses_dump = [dict(r) for r in responses] if responses else []
        append_inference_checkpoint(checkpoint_path, {
            "idx": img_idx,
            "responses": responses_dump,
            "data": imgs[img_idx] if isinstance(imgs[img_idx], str) else list(imgs[img_idx]),
        })

    def _log_failed(img_idx, error_msg):
        if not failed_log_path:
            return
        import os
        write_header = not os.path.exists(failed_log_path)
        with open(failed_log_path, "a", newline="", encoding="utf-8") as fh:
            w = csv.writer(fh)
            if write_header:
                w.writerow(["idx", "data", "error"])
            w.writerow([img_idx,
                         imgs[img_idx] if isinstance(imgs[img_idx], str) else str(imgs[img_idx]),
                         error_msg[:500]])

    task_chunk_idx = 0
    with tqdm(total=n - start_idx, desc="Processing", ncols=80,
              disable=disableProgressBar) as pbar:
        for start in range(start_idx, n, bs):
            # ── task-chunk boundary logging ──────────────────────────
            if tcs:
                rel = start - start_idx
                if rel % tcs == 0:
                    task_chunk_idx = rel // tcs
                    logger.info(
                        "Task chunk %d/%d: rows [%d, %d)",
                        task_chunk_idx + 1, n_task_chunks,
                        start, min(start + tcs, n),
                    )

            chunk = items[start:start + bs]
            responses_list = self._run_chunk_with_retry(
                chunk=chunk,
                start=start,
                system=system,
                prompt=prompt,
                schema=schema,
                temp=temp,
                top_k=top_k,
                top_p=top_p,
                max_new_tokens=max_new_tokens,
            )

            for k, (responses, err) in enumerate(responses_list):
                img_idx = start + k
                dic["responses"].append(responses)
                dic["data"].append(imgs[img_idx])
                if responses:
                    _save_checkpoint(img_idx, responses)
                if err is not None:
                    _log_failed(img_idx, err)

            # Proactively free reserved-but-unused allocator pool.
            try:
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
            except Exception:
                pass

            pbar.update(len(chunk))

    self.results = dic
    return self.to_df(output=True)

`to_df(output=True)` ¶

Convert self.results into a DataFrame. Mirrors InferenceOllama.

Source code in urbanworm/inference/unsloth.py

def to_df(self, output: bool = True) -> pd.DataFrame | None:
    """Convert ``self.results`` into a DataFrame. Mirrors InferenceOllama."""
    if self.results is None:
        return None
    try:
        self.df = response2df(self.results)
    except Exception as e:
        logger.warning("to_df: response2df failed (%s); returning raw dict.", e)
        self.df = pd.DataFrame(self.results)
    return self.df if output else None

Ollama¶

Inference via a locally running Ollama server. No GPU required; any GGUF-backed vision model works.

Install: pip install "urban-worm[ollama]" + the Ollama app

`urbanworm.inference.llama.InferenceOllama(llm=None, ollama_key=None, model_dir=None, **kwargs)` ¶

Bases: Inference

Constructor for vision inference using MLLMs with Ollama.

Parameters:

Name	Type	Description	Default
`llm`	`str`	model checkpoint.	`None`
`ollama_key`	`str`	The Ollama API key.	`None`
`model_dir`	`str`	Directory where Ollama stores downloaded models. Sets the `OLLAMA_MODELS` environment variable before each `ollama.pull` call. Note: for the running Ollama server to store new downloads here, it must also have been started with `OLLAMA_MODELS` pointing to the same directory (e.g. `OLLAMA_MODELS=/data/models ollama serve`). If the server is already running with a different directory, this setting affects only where the client looks, not where the server saves.	`None`
`**kwargs`		image (str\|list[str]\|tuple[str]), images (list\|tuple), data constructor (GeoTaggedData), and schema (dict)	`{}`

Source code in urbanworm/inference/llama.py

def __init__(self,
             llm: str = None,
             ollama_key: str = None,
             model_dir: str | None = None,
             **kwargs) -> None:
    super().__init__(**kwargs)
    self.llm = llm
    self.skip_errors = True
    self.ollama_key = ollama_key
    self.model_dir = model_dir

Functions¶

`one_inference(system='', prompt='', image=None, audio=None, temp=0.0, top_k=20.0, top_p=0.8, max_new_tokens=512)` ¶

Chat with MLLM model with one image.

Parameters:

Name	Type	Description	Default
`system`	`str`	The system message.	`''`
`prompt`	`str`	The prompt message.	`''`
`image`	`str \| list[str] \| tuple[str]`	The image path.	`None`
`audio`	`str \| list[str] \| tuple[str]`	The audio path.	`None`
`temp`	`float`	The temperature value.	`0.0`
`top_k`	`int`	The top_k value.	`20.0`
`top_p`	`float`	The top_p value.	`0.8`
`max_new_tokens`	`int`	Maximum number of tokens to generate. Default 512.	`512`

Notes

Ollama currently does not support audio input. The argument audio is just a placeholder for the future development.

Returns:

Name	Type	Description
`dict`		A dictionary includes questions/messages, responses/answers

Source code in urbanworm/inference/llama.py

def one_inference(self,
                  system: str = '',
                  prompt: str = '',
                  image: str | list[str] | tuple[str] = None,
                  audio: str | list[str] | tuple[str] = None,
                  temp: float = 0.0,
                  top_k: int = 20.0,
                  top_p: float = 0.8,
                  max_new_tokens: int = 512):

    '''
    Chat with MLLM model with one image.

    Args:
        system (str, optional): The system message.
        prompt (str): The prompt message.
        image (str | list[str] | tuple[str]): The image path.
        audio (str | list[str] | tuple[str]): The audio path.
        temp (float): The temperature value.
        top_k (int): The top_k value.
        top_p (float): The top_p value.
        max_new_tokens (int): Maximum number of tokens to generate.  Default 512.

    Notes:
        Ollama currently does not support audio input.
        The argument `audio` is just a placeholder for the future development.

    Returns:
        dict: A dictionary includes questions/messages, responses/answers
    '''

    ollama, _ = _lazy_ollama()
    if self.model_dir is not None:
        import os
        _prev_ollama_models = os.environ.get("OLLAMA_MODELS")
        os.environ["OLLAMA_MODELS"] = self.model_dir
        try:
            ollama.pull(self.llm, stream=True)
        finally:
            if _prev_ollama_models is None:
                os.environ.pop("OLLAMA_MODELS", None)
            else:
                os.environ["OLLAMA_MODELS"] = _prev_ollama_models
    else:
        ollama.pull(self.llm, stream=True)
    multiImg = False
    if image is None and audio is not None:
        # Audio is not supported by Ollama yet; fall through with the
        # path treated as an image so the user gets a clear error from
        # the model rather than a NameError here.
        image = audio
    if image is not None:
        img = image
    else:
        img = self.img
    if isinstance(img, list) or isinstance(img, tuple):
        if not isinstance(img[0], str):
            self.logger.warning("a list of images can only be a flatten list")
        multiImg = True
    else:
        img = [img]

    schema = create_format(self.schema)

    dic = {'responses': [], 'data': []}
    r = self._mtmd(model=self.llm,
                   system=system, prompt=prompt,
                   img=img,
                   temp=temp, top_k=top_k, top_p=top_p,
                   num_predict=max_new_tokens,
                   schema=schema,
                   one_shot_lr=[],
                   multiImgInput=multiImg)
    dic['responses'] += [r.responses]
    dic['data'] += [img]
    return response2df(dic)

`batch_inference(system='', prompt='', temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512, disableProgressBar=False, checkpoint_path=None)` ¶

Chat with MLLM model for each image.

Parameters:

Name	Type	Description	Default
`system`	`str`	The system message.	`''`
`prompt`	`str`	The prompt message.	`''`
`temp`	`float`	The temperature value.	`0.0`
`top_k`	`float`	The top_k value.	`20`
`top_p`	`float`	The top_p value.	`0.8`
`max_new_tokens`	`int`	Maximum number of tokens to generate per call. Default 512.	`512`
`disableProgressBar`	`bool`	The progress bar for showing the progress of data analysis over the units.	`False`
`checkpoint_path`	`str`	Path to a JSONL file for resume-safe checkpointing. Already-completed items are skipped on the next run automatically.	`None`

Returns:

Type	Description
`dict`	list A list of dictionaries. Each dict includes questions/messages, responses/answers, and image base64 (if required)

Source code in urbanworm/inference/llama.py

def batch_inference(self,
                    system: str = '',
                    prompt: str = '',
                    temp: float = 0.0,
                    top_k: int = 20,
                    top_p: float = 0.8,
                    max_new_tokens: int = 512,
                    disableProgressBar: bool = False,
                    checkpoint_path: str | None = None) -> dict:
    '''
    Chat with MLLM model for each image.

    Args:
        system (str, optional): The system message.
        prompt (str): The prompt message.
        temp (float): The temperature value.
        top_k (float): The top_k value.
        top_p (float): The top_p value.
        max_new_tokens (int): Maximum number of tokens to generate per call.  Default 512.
        disableProgressBar (bool): The progress bar for showing the progress of data analysis over the units.
        checkpoint_path (str, optional): Path to a JSONL file for resume-safe checkpointing.
            Already-completed items are skipped on the next run automatically.

    Returns:
        list A list of dictionaries. Each dict includes questions/messages, responses/answers, and image base64 (if required)
    '''

    ollama, _ = _lazy_ollama()
    if self.model_dir is not None:
        import os
        _prev_ollama_models = os.environ.get("OLLAMA_MODELS")
        os.environ["OLLAMA_MODELS"] = self.model_dir
        try:
            ollama.pull(self.llm, stream=True)
        finally:
            if _prev_ollama_models is None:
                os.environ.pop("OLLAMA_MODELS", None)
            else:
                os.environ["OLLAMA_MODELS"] = _prev_ollama_models
    else:
        ollama.pull(self.llm, stream=True)

    if self.batch_images is not None:
        imgs = self.batch_images
    else:
        imgs = self.imgs

    schema = create_format(self.schema)

    multiImgInput = False
    if isinstance(imgs[0], list) or isinstance(imgs[0], tuple):
        multiImgInput = True

    # ── resume from checkpoint ───────────────────────────────────────
    done_records = load_inference_checkpoint(checkpoint_path) if checkpoint_path else []
    start_idx = len(done_records)
    dic = restore_ollama_results(done_records)

    for i in tqdm(range(start_idx, len(imgs)), desc="Processing...", ncols=75, disable=disableProgressBar):
        img = imgs[i]
        try:
            r = self._mtmd(model=self.llm,
                           system=system, prompt=prompt,
                           img=img if multiImgInput else [img],
                           temp=temp, top_k=top_k, top_p=top_p,
                           num_predict=max_new_tokens,
                           schema=schema,
                           one_shot_lr=[],
                           multiImgInput=multiImgInput)
            rr = r.responses
        except Exception as e:
            # Log and continue; capture an error stub so downstream stays consistent
            self.logger.warning("batch_inference: image %d failed (%s). Continuing.", i, e)
            rr = []

        dic['responses'] += [rr]
        dic['data'] += [imgs[i]]

        if checkpoint_path:
            try:
                responses_dump = [item.model_dump() for item in rr]
            except Exception:
                responses_dump = [dict(item) for item in rr] if rr else []
            append_inference_checkpoint(checkpoint_path, {
                'idx': i,
                'responses': responses_dump,
                'data': imgs[i] if isinstance(imgs[i], str) else list(imgs[i]),
            })

    self.results = dic
    return self.to_df(output=True)

`to_df(output=True)` ¶

Convert the output from an MLLM reponse (from .batch_inference) into a DataFrame.

Parameters:

Name	Type	Description	Default
`output`	`bool`	Whether to return a DataFrame. Defaults to True.	`True`

Returns: pd.DataFrame: A DataFrame containing responses and associated metadata. str: An error message if .batch_inference() has not been run or if the format is unsupported.

Source code in urbanworm/inference/llama.py

def to_df(self, output: bool = True) -> pd.DataFrame | str:
    """
    Convert the output from an MLLM reponse (from .batch_inference) into a DataFrame.

    Args:
        output (bool): Whether to return a DataFrame. Defaults to True.
    Returns:
        pd.DataFrame: A DataFrame containing responses and associated metadata.
        str: An error message if `.batch_inference()` has not been run or if the format is unsupported.
    """

    if self.results is not None:
        self.df = response2df(self.results)
        if output:
            return self.df
    return None

llama.cpp¶

Inference via the llama-mtmd-cli binary. Supports audio input; highly configurable sampling parameters.

Install: pip install "urban-worm[llamacpp]" + brew install llama.cpp

`urbanworm.inference.llama.InferenceLlamacpp(llm=None, mp=None, model_dir=None, **kwargs)` ¶

Bases: Inference

Constructor for vision inference using MLLMs with llama.cpp

Parameters:

Name	Type	Description	Default
`llm`	`str`	model checkpoint to download (e.g. `ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0`) or a local path to a `.gguf` model file.	`None`
`mp`	`str`	If `llm` is a local `.gguf` path, `mp` must be the local path to the multimodal projector file (`mmproj.gguf`).	`None`
`model_dir`	`str`	Directory used as `HF_HUB_CACHE` when llama-mtmd-cli downloads a model via the `-hf` flag. GGUF files from HuggingFace will be cached here instead of the default `~/.cache/huggingface/hub`. Has no effect when `llm` is already a local file path.	`None`
`**kwargs`		image (str\|list[str]\|tuple[str]), images (list\|tuple), data constructor (GeoTaggedData), and schema (dict)	`{}`

Source code in urbanworm/inference/llama.py

def __init__(self, llm: str = None, mp: str = None,
             model_dir: str | None = None,
             **kwargs):
    super().__init__(**kwargs)
    self.llm = llm
    self.mp = mp
    self.model_dir = model_dir

Functions¶

`one_inference(system='', prompt='', image=None, audio=None, temp=0.2, top_k=20, top_p=0.8, ctx_size=4096, max_new_tokens=512, audio_input=False)` ¶

Chat with MLLM model with one image. Args: system (str, optional): The system message. prompt (str): The prompt message. image (str | list | tuple, optional): The image path. audio (str | list | tuple, optional): The audio path. temp (float): The temperature value. top_k (int): The top_k value. top_p (float): The top_p value. ctx_size (int): Size of context (The default is 4096) max_new_tokens (int): Maximum number of tokens to generate. Default 512. audio_input (bool, optional): Whether to run inference with audio input

Returns: response from MLLM as a dataframe

Source code in urbanworm/inference/llama.py

def one_inference(self,
                  system: str = '',
                  prompt: str = '',
                  image: str | list | tuple = None,
                  audio: str | list | tuple = None,
                  temp: float = 0.2,
                  top_k: int = 20,
                  top_p: float = 0.8,
                  ctx_size: int = 4096,
                  max_new_tokens: int = 512,
                  audio_input: bool = False
                  ) -> Any:
    '''
        Chat with MLLM model with one image.
        Args:
             system (str, optional): The system message.
             prompt (str): The prompt message.
             image (str | list | tuple, optional): The image path.
             audio (str | list | tuple, optional): The audio path.
             temp (float): The temperature value.
             top_k (int): The top_k value.
             top_p (float): The top_p value.
             ctx_size (int): Size of context (The default is 4096)
             max_new_tokens (int): Maximum number of tokens to generate.  Default 512.
             audio_input (bool, optional): Whether to run inference with audio input

        Returns: response from MLLM as a dataframe
    '''

    llm = self.llm
    mp = self.mp

    if not audio_input:
        if image is not None:
            im = [image] if isinstance(image, str) else image
        else:
            im = [self.img] if isinstance(self.img, str) else self.img

    else:
        if audio is not None:
            im = [audio] if isinstance(audio, str) else audio
        else:
            im = [self.audio] if isinstance(self.audio, str) else self.audio

    if isinstance(im, list) or isinstance(im, tuple):
        if not isinstance(im[0], str):
            self.logger.warning("a list of images can only be a flatten list")
            return None

    # ims_origin = None
    im_ = []
    if not audio_input:
        for i in im:
            if is_base64(i):
                tmp_path = base64img2temp(i)
                im_ += [tmp_path]
            elif is_url(i):
                tmp_path = url2temp(i)
                im_ += [tmp_path]
            else:
                pass
    else:
        for i in range(len(im)):
            if is_url(im[i]):
                tmp_path = sound_url_to_temp(im[i])
                im_ += [tmp_path]
            else:
                pass

    if len(im_) == len(im):
        # ims_origin = im
        im = im_

    if llm is None:
        self.logger.warning("model cannot be None")
        return None

    schema = create_format(self.schema)

    r = self._mtmd(llm, mp,
                   system,
                   prompt,
                   im,
                   temperature=temp,
                   top_k=top_k,
                   top_p=top_p,
                   ctx_size=ctx_size,
                   max_new_tokens=max_new_tokens,
                   schema=schema,
                   audio_input=audio_input)
    r = extract_last_json(r)
    r = pd.DataFrame(r['responses'])
    df = responses_to_wide_all_columns(r)
    # df['data'] = ''
    # df.loc[0, 'data'] = im
    if len(im_) >= 1:
        for each in im_:
            try:
                os.remove(each)
            except OSError:
                pass
    return df

`batch_inference(system='', prompt='', temp=0.2, top_k=20, top_p=0.8, min_p=0.0, seed=3407, ctx_size=4096, max_new_tokens=512, audio_input=False, disableProgressBar=False, checkpoint_path=None)` ¶

Chat with MLLM model for each image in a list. Args: system (str, optional): The system message. prompt (str): The prompt message. temp (float): The temperature value (default: 0.2) top_k (float): The top_k value (default: 20) top_p (float): The top_p value (default: 0.8) min_p (float): min-p sampling (default: 0.0, 0.0 = disabled) seed (int): The seed value (Default is 3407) ctx_size (int): Size of context (Default is 4096) max_new_tokens (int): Maximum number of tokens to generate per call. Default 512. audio_input (bool): Whether to run inference with audio input disableProgressBar (bool): Whether to disable progress bar. checkpoint_path (str, optional): Path to a JSONL file for resume-safe checkpointing. Already-completed items are skipped on the next run automatically. Returns: response from MLLM as a dataframe

Source code in urbanworm/inference/llama.py

def batch_inference(self,
                    system: str = '',
                    prompt: str = '',
                    temp: float = 0.2,
                    top_k: int = 20,
                    top_p: float = 0.8,
                    min_p: float = 0.0,
                    seed: int = 3407,
                    ctx_size: int = 4096,
                    max_new_tokens: int = 512,
                    audio_input = False,
                    disableProgressBar: bool = False,
                    checkpoint_path: str | None = None):
    '''
        Chat with MLLM model for each image in a list.
        Args:
            system (str, optional): The system message.
            prompt (str): The prompt message.
            temp (float): The temperature value (default: 0.2)
            top_k (float): The top_k value (default: 20)
            top_p (float): The top_p value (default: 0.8)
            min_p (float): min-p sampling (default: 0.0, 0.0 = disabled)
            seed (int): The seed value (Default is 3407)
            ctx_size (int): Size of context (Default is 4096)
            max_new_tokens (int): Maximum number of tokens to generate per call.  Default 512.
            audio_input (bool): Whether to run inference with audio input
            disableProgressBar (bool): Whether to disable progress bar.
            checkpoint_path (str, optional): Path to a JSONL file for resume-safe checkpointing.
                Already-completed items are skipped on the next run automatically.
        Returns: response from MLLM as a dataframe
    '''

    llm = self.llm
    mp = self.mp
    clips = None
    if not audio_input:
        if self.batch_images is not None:
            imgs = self.batch_images
        else:
            imgs = self.imgs
    else:
        if self.batch_audios is not None:
            imgs = self.batch_audios
            clips = self.batch_audios_slice
        else:
            imgs = self.audios

    schema = create_format(self.schema)

    # ── resume from checkpoint ───────────────────────────────────────
    done_records = load_inference_checkpoint(checkpoint_path) if checkpoint_path else []
    start_idx = len(done_records)
    dic = restore_llamacpp_results(done_records)

    for i in tqdm(range(start_idx, len(imgs)), desc="Processing...", ncols=75, disable=disableProgressBar):
        ims = [imgs[i]] if isinstance(imgs[i], str) else imgs[i]

        ims_origin = None
        ims_ = []
        if not audio_input:
            for im in ims:
                if is_base64(im):
                    tmp_path = base64img2temp(im)
                    ims_ += [tmp_path]
                elif is_url(im):
                    tmp_path = url2temp(im)
                    ims_ += [tmp_path]
                else:
                    pass
        else:
            for j in range(len(ims)):
                im = ims[j]
                if is_url(im):
                    if clips is not None:
                        clip_range = clips[j]
                        tmp_path = sound_url_to_temp(im, clip_range)
                        ims_ += [tmp_path]
                    else:
                        tmp_path = sound_url_to_temp(im)
                        ims_ += [tmp_path]
                else:
                    pass

        if len(ims_) == len(ims):
            ims_origin = ims
            ims = ims_

        try:
            r = None
            try_times = 0
            while r is None and try_times <= 5:
                r = self._mtmd(llm,
                               mp,
                               system,
                               prompt,
                               ims,
                               temperature=temp,
                               top_k=top_k,
                               top_p=top_p,
                               min_p=min_p,
                               seed=seed,
                               ctx_size=ctx_size,
                               max_new_tokens=max_new_tokens,
                               schema=schema,
                               audio_input=audio_input)
                r = extract_last_json(r)
                try_times += 1

            if r is None:
                r = 'Bad response'
            dic['responses'] += [r]
            stored_data = ims if ims_origin is None else ims_origin
            dic['data'] += [stored_data]

            if checkpoint_path:
                append_inference_checkpoint(checkpoint_path, {
                    'idx': i,
                    'responses': r,
                    'data': stored_data,
                })

            if len(ims_) >= 1:
                for each in ims_:
                    try:
                        os.remove(each)
                    except OSError:
                        pass
        except Exception as e:
            print(e)
            pass

    self.results = dic
    return self.to_df(output=True)

`to_df(output=True)` ¶

Convert the output from an MLLM reponse (from .batch_inference) into a DataFrame.

Parameters:

Name	Type	Description	Default
`output`	`bool`	Whether to return a DataFrame. Defaults to True.	`True`

Returns: pd.DataFrame: A DataFrame containing responses and associated metadata.

Source code in urbanworm/inference/llama.py

def to_df(self, output: bool = True) -> Any:
    """
        Convert the output from an MLLM reponse (from .batch_inference) into a DataFrame.

        Args:
            output (bool): Whether to return a DataFrame. Defaults to True.
        Returns:
            pd.DataFrame: A DataFrame containing responses and associated metadata.
    """

    if self.results is not None:
        df_list = []
        responses = self.results['responses']
        imgs = self.results['data']

        for inx in range(len(responses)):
            r = responses[inx]
            i = imgs[inx]

            r = pd.DataFrame(r['responses'])
            r = responses_to_wide_all_columns(r)
            for j in range(len(i)):
                r[f'data_{j + 1}'] = i[j]

            df_list += [r]
        self.df = pd.concat(df_list, ignore_index=True)
        if output:
            return self.df
        return None
    else:
        return None

Cloud API¶

Inference via hosted providers (Anthropic, OpenAI, Google).

Install: pip install "urban-worm[api]"

`urbanworm.inference.api.InferenceAPI(llm, provider='anthropic', api_key=None, max_tokens=1024, skip_errors=True, **kwargs)` ¶

Bases: Inference

Vision-language inference via hosted API providers.

Parameters:

Name	Type	Description	Default
`llm`	`str`	Model name for the chosen provider, e.g. `"claude-opus-4-6"`, `"gpt-4o"`, `"gemini-2.0-flash"`.	required
`provider`	`str`	One of `"anthropic"`, `"openai"`, or `"google"`.	`'anthropic'`
`api_key`	`str \| None`	API key. If `None`, each provider falls back to its standard environment variable (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`).	`None`
`max_tokens`	`int`	Maximum tokens to generate per call. Default 1024.	`1024`
`skip_errors`	`bool`	If `True` (default), API / parse errors per image are logged and that image gets an empty response instead of crashing the batch.	`True`
`**kwargs`		Forwarded to :class:`~urbanworm.inference.Inference` (`image`, `images`, `geo_tagged_data`, `schema`).	`{}`

Source code in urbanworm/inference/api.py

def __init__(
    self,
    llm: str,
    provider: str = "anthropic",
    api_key: str | None = None,
    max_tokens: int = 1024,
    skip_errors: bool = True,
    **kwargs,
) -> None:
    super().__init__(**kwargs)
    if provider not in _PROVIDERS:
        raise ValueError(
            f"provider must be one of {_PROVIDERS!r}; got {provider!r}"
        )
    self.llm = llm
    self.provider = provider
    self.api_key = api_key
    self.max_tokens = max_tokens
    self.skip_errors = skip_errors

Functions¶

`one_inference(system='', prompt='', image=None, audio=None, **_kwargs)` ¶

Run inference on a single image.

Parameters:

Name	Type	Description	Default
`system`	`str`	System prompt.	`''`
`prompt`	`str`	User prompt.	`''`
`image`	`str \| list \| tuple \| None`	Path, URL, or base64 string (or list of those for multi-image-per-prompt).	`None`
`audio`	`str \| list \| tuple \| None`	Accepted for API parity; raises `NotImplementedError`.	`None`

Returns:

Type	Description
`DataFrame`	One-row DataFrame.

Source code in urbanworm/inference/api.py

def one_inference(
    self,
    system: str = "",
    prompt: str = "",
    image: str | list | tuple | None = None,
    audio: str | list | tuple | None = None,
    **_kwargs,
) -> pd.DataFrame:
    """Run inference on a single image.

    Args:
        system: System prompt.
        prompt: User prompt.
        image: Path, URL, or base64 string (or list of those for
            multi-image-per-prompt).
        audio: Accepted for API parity; raises ``NotImplementedError``.

    Returns:
        One-row DataFrame.
    """
    if audio is not None:
        raise NotImplementedError(
            "InferenceAPI does not support audio input."
        )

    img = image if image is not None else self.img
    if img is None:
        raise ValueError("No image provided to one_inference().")
    imgs = [img] if isinstance(img, str) else list(img)

    response_dict = self._call(system, prompt, imgs)
    dic = {
        "responses": [[_MockResponse(response_dict)]],
        "data": [imgs],
    }
    try:
        return response2df(dic)
    except Exception as e:
        logger.warning("one_inference: response2df failed (%s); returning raw.", e)
        return pd.DataFrame({"responses": [[response_dict]], "data": [imgs]})

`batch_inference(system='', prompt='', disableProgressBar=False, checkpoint_path=None, **_kwargs)` ¶

Run inference over all collected images.

Parameters:

Name	Type	Description	Default
`system`	`str`	System prompt.	`''`
`prompt`	`str`	User prompt.	`''`
`disableProgressBar`	`bool`	Suppress tqdm bar.	`False`
`checkpoint_path`	`str \| None`	Path to a JSONL file for resume-safe checkpointing. On the next run items already in the file are skipped automatically.	`None`

Returns:

Type	Description
`DataFrame`	DataFrame — same shape as the other backends.

Source code in urbanworm/inference/api.py

def batch_inference(
    self,
    system: str = "",
    prompt: str = "",
    disableProgressBar: bool = False,
    checkpoint_path: str | None = None,
    **_kwargs,
) -> pd.DataFrame:
    """Run inference over all collected images.

    Args:
        system: System prompt.
        prompt: User prompt.
        disableProgressBar: Suppress tqdm bar.
        checkpoint_path: Path to a JSONL file for resume-safe
            checkpointing.  On the next run items already in the file
            are skipped automatically.

    Returns:
        DataFrame — same shape as the other backends.
    """
    imgs = self.batch_images if self.batch_images is not None else self.imgs
    if not imgs:
        raise ValueError("No images to run inference on.")

    # ── resume from checkpoint ───────────────────────────────────────
    done_records = load_inference_checkpoint(checkpoint_path) if checkpoint_path else []
    start_idx = len(done_records)

    dic = restore_ollama_results(done_records)

    # ── process remaining images ─────────────────────────────────────
    for i in tqdm(
        range(start_idx, len(imgs)),
        desc="Processing",
        ncols=75,
        disable=disableProgressBar,
    ):
        img = imgs[i]
        img_list = [img] if isinstance(img, str) else list(img)
        try:
            response_dict = self._call(system, prompt, img_list)
            wrapped = [_MockResponse(response_dict)]
        except Exception as e:
            logger.warning(
                "batch_inference: image %d failed (%s). Continuing.", i, e
            )
            if self.skip_errors:
                response_dict = {}
                wrapped = []
            else:
                raise

        dic["responses"].append(wrapped)
        dic["data"].append(img)

        if checkpoint_path:
            append_inference_checkpoint(checkpoint_path, {
                "idx": i,
                "responses": [response_dict],
                "data": img if isinstance(img, str) else list(img),
            })

    self.results = dic
    return self.to_df(output=True)

`to_df(output=True)` ¶

Convert self.results into a DataFrame.

Source code in urbanworm/inference/api.py

def to_df(self, output: bool = True) -> pd.DataFrame | None:
    """Convert ``self.results`` into a DataFrame."""
    if self.results is None:
        return None
    try:
        self.df = response2df(self.results)
    except Exception as e:
        logger.warning("to_df: response2df failed (%s); returning raw frame.", e)
        self.df = pd.DataFrame(self.results)
    return self.df if output else None

Output schema¶

`urbanworm.inference.format.create_format(fields, *, item_model_name='QnA', wrapper_model_name=None)` ¶

Create a typed Response[CustomQnA] model using a dynamically defined schema.

Parameters:

Name	Type	Description	Default
`fields`	`dict[str, FieldSpec]`	field definitions for the inner model.	required
`item_model_name`	`str`	name of the inner model.	`'QnA'`
`wrapper_model_name`	`str \| None`	optional pretty name for the specialized wrapper class.	`None`

Returns:

Type	Description
`type[BaseModel]`	A concrete Pydantic model class: Response[CustomQnA].

Source code in urbanworm/inference/format.py

def create_format(
    fields: dict[str, FieldSpec],
    *,
    item_model_name: str = "QnA",
    wrapper_model_name: str | None = None,
) -> type[BaseModel]:
    """
    Create a typed `Response[CustomQnA]` model using a dynamically defined schema.

    Args:
        fields: field definitions for the inner model.
        item_model_name: name of the inner model.
        wrapper_model_name: optional pretty name for the specialized wrapper class.

    Returns:
        A concrete Pydantic model class: Response[CustomQnA].
    """
    CustomQnA = schema(fields, model_name=item_model_name)
    Model = cast(type[BaseModel], Response[CustomQnA])  # concrete generic specialization

    # Give the specialized model a stable readable name (optional)
    if wrapper_model_name is None:
        wrapper_model_name = f"Response_{item_model_name}"
    try:
        Model.__name__ = wrapper_model_name  # type: ignore[attr-defined]
    except Exception:
        pass

    return Model

`urbanworm.inference.format.Response` ¶

Bases: BaseModel, Generic[T]

Wrapper schema: {"responses": [ ... ]}

Inference Backends¶

Unsloth (recommended)¶

urbanworm.inference.unsloth.InferenceUnsloth(llm=None, load_in_4bit=True, max_seq_length=4096, device=None, max_memory=None, model_dir=None, dtype=None, disable_compile=True, skip_errors=True, **kwargs) ¶

Functions¶

one_inference(system='', prompt='', image=None, audio=None, temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512) ¶

batch_inference(system='', prompt='', temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512, batch_size=1, task_chunk_size=None, disableProgressBar=False, checkpoint_path=None, failed_log_path=None) ¶

to_df(output=True) ¶

Ollama¶

urbanworm.inference.llama.InferenceOllama(llm=None, ollama_key=None, model_dir=None, **kwargs) ¶

Functions¶

one_inference(system='', prompt='', image=None, audio=None, temp=0.0, top_k=20.0, top_p=0.8, max_new_tokens=512) ¶

batch_inference(system='', prompt='', temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512, disableProgressBar=False, checkpoint_path=None) ¶

to_df(output=True) ¶

llama.cpp¶

urbanworm.inference.llama.InferenceLlamacpp(llm=None, mp=None, model_dir=None, **kwargs) ¶

Functions¶

one_inference(system='', prompt='', image=None, audio=None, temp=0.2, top_k=20, top_p=0.8, ctx_size=4096, max_new_tokens=512, audio_input=False) ¶

batch_inference(system='', prompt='', temp=0.2, top_k=20, top_p=0.8, min_p=0.0, seed=3407, ctx_size=4096, max_new_tokens=512, audio_input=False, disableProgressBar=False, checkpoint_path=None) ¶

to_df(output=True) ¶

Cloud API¶

urbanworm.inference.api.InferenceAPI(llm, provider='anthropic', api_key=None, max_tokens=1024, skip_errors=True, **kwargs) ¶

Functions¶

one_inference(system='', prompt='', image=None, audio=None, **_kwargs) ¶

batch_inference(system='', prompt='', disableProgressBar=False, checkpoint_path=None, **_kwargs) ¶

to_df(output=True) ¶

Output schema¶

urbanworm.inference.format.create_format(fields, *, item_model_name='QnA', wrapper_model_name=None) ¶

urbanworm.inference.format.Response ¶

`urbanworm.inference.unsloth.InferenceUnsloth(llm=None, load_in_4bit=True, max_seq_length=4096, device=None, max_memory=None, model_dir=None, dtype=None, disable_compile=True, skip_errors=True, **kwargs)` ¶

`one_inference(system='', prompt='', image=None, audio=None, temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512)` ¶

`batch_inference(system='', prompt='', temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512, batch_size=1, task_chunk_size=None, disableProgressBar=False, checkpoint_path=None, failed_log_path=None)` ¶

`to_df(output=True)` ¶

`urbanworm.inference.llama.InferenceOllama(llm=None, ollama_key=None, model_dir=None, **kwargs)` ¶

`one_inference(system='', prompt='', image=None, audio=None, temp=0.0, top_k=20.0, top_p=0.8, max_new_tokens=512)` ¶

`batch_inference(system='', prompt='', temp=0.0, top_k=20, top_p=0.8, max_new_tokens=512, disableProgressBar=False, checkpoint_path=None)` ¶

`to_df(output=True)` ¶

`urbanworm.inference.llama.InferenceLlamacpp(llm=None, mp=None, model_dir=None, **kwargs)` ¶

`one_inference(system='', prompt='', image=None, audio=None, temp=0.2, top_k=20, top_p=0.8, ctx_size=4096, max_new_tokens=512, audio_input=False)` ¶

`batch_inference(system='', prompt='', temp=0.2, top_k=20, top_p=0.8, min_p=0.0, seed=3407, ctx_size=4096, max_new_tokens=512, audio_input=False, disableProgressBar=False, checkpoint_path=None)` ¶

`to_df(output=True)` ¶

`urbanworm.inference.api.InferenceAPI(llm, provider='anthropic', api_key=None, max_tokens=1024, skip_errors=True, **kwargs)` ¶

`one_inference(system='', prompt='', image=None, audio=None, **_kwargs)` ¶

`batch_inference(system='', prompt='', disableProgressBar=False, checkpoint_path=None, **_kwargs)` ¶

`to_df(output=True)` ¶

`urbanworm.inference.format.create_format(fields, *, item_model_name='QnA', wrapper_model_name=None)` ¶

`urbanworm.inference.format.Response` ¶