Dataset¶

The GeoTaggedData class is the central data container for the collection pipeline. It fetches building footprints, retrieves geo-located media (street views, photos, audio), and hands the results off to an inference backend.

GeoTaggedData¶

`urbanworm.dataset.GeoTaggedData(locations=None, units=None)` ¶

Parameters:

Name	Type	Description	Default
`locations`	`list \| tuple \| dict \| Dataframe`	A list of coordinates (longitude/x and latitude/y) or a dictionary keyed by longitude and latitude or a dataframe with columns "longitude" and "latitude".	`None`
`units`	`GeoDataFrame`	The path to the shapefile or geojson file, or GeoDataFrame.	`None`

Examples:

retrieve street view with building footprints (OSM)¶

gtd = GeoTaggedData() gtd.getBuildingFootprints(bbox=(-83.235572,42.348092,-83.235154,42.348806)) gtd.get_svi_from_locations(key="your Mapillary token")

locations - a nested list of coordinates¶

gtd = GeoTaggedData(location=[[-83.235572,42.348092],[-83.235154,42.348806]])

locations - a dataframe with columns "longitude" and "latitude"¶

df = pd.Dataframe({"longitude":[-83.235572, -83.235154], "latitude":[42.348092, 42.348806]}) gtd = GeoTaggedData(locations=df)

Source code in urbanworm/dataset.py

def __init__(self,
             locations: list|tuple|dict|pd.DataFrame=None,
             units: GeoDataFrame=None):
    '''
    Args:
        locations (list|tuple|dict|Dataframe): A list of coordinates (longitude/x and latitude/y) or a dictionary keyed by longitude and latitude or a dataframe with columns "longitude" and "latitude".
        units (GeoDataFrame): The path to the shapefile or geojson file, or GeoDataFrame.

    Examples:
        # retrieve street view with building footprints (OSM)
        gtd = GeoTaggedData()
        gtd.getBuildingFootprints(bbox=(-83.235572,42.348092,-83.235154,42.348806))
        gtd.get_svi_from_locations(key="your Mapillary token")

        # locations - a nested list of coordinates
        gtd = GeoTaggedData(location=[[-83.235572,42.348092],[-83.235154,42.348806]])
        # locations - a dataframe with columns "longitude" and "latitude"
        df = pd.Dataframe({"longitude":[-83.235572, -83.235154], "latitude":[42.348092, 42.348806]})
        gtd = GeoTaggedData(locations=df)
    '''

    self.images = None
    self.locations = locations
    self.units = units
    if locations is not None and units is None:
        self.construct_units()

    # NOTE: each must be its own dict literal — chained assignment would
    # alias all three names to the SAME underlying dict object.
    def _empty_payload():
        return {'loc_id': [], 'id': [], 'data': [], 'path': []}

    self.svis = _empty_payload()
    self.photos = _empty_payload()
    self.audios = _empty_payload()

    self.svi_metadata = None
    self.photo_metadata = None
    self.audio_metadata = None
    self.plot = None

Functions¶

`getBuildings(bbox=None, source='osm', min_area=0, max_area=None, random_sample=None, gba_path=None, gba_cache_dir=None)` ¶

Extract buildings from a public source using the bbox.

Parameters:

Name	Type	Description	Default
`bbox`	`list or tuple`	The bounding box (min_lon, min_lat, max_lon, max_lat).	`None`
`source`	`str`	One of: `'osm'` (default) — OpenStreetMap via the Overpass API. Footprints only; no height. `'microsoft'` — Microsoft GlobalMLBuildingFootprints (Bing). Footprints only; no height. `'globfp3d'` — `3D-GloBFP <https://zenodo.org/records/15487037>`_ (Che et al., ESSD 2024). Footprints + per-building height. Auto-fetches from Zenodo + Figshare and caches under `gba_cache_dir` (default `~/.cache/urbanworm/globfp3d`). Pass `gba_path` to load from a pre-downloaded local file. The resulting `self.units` keeps a `height_m` column. `'gba'` — `GlobalBuildingAtlas <https://github.com/zhu-xlab/GlobalBuildingAtlas>`_ (Zhu et al., ESSD 2025). A different dataset from 3D-GloBFP — hosted on HuggingFace + mediaTUM. Auto- fetches polygon tiles from `zhu-xlab/GBA.LoD1` using `representative/lod1.geojson` as the manifest. Caches under `~/.cache/urbanworm/gba`. NOTE: GBA polygons ship without per-row height — heights live in a separate mediaTUM dataset (`m1837832`) which isn't joined yet (tracking issue in CHANGELOG). For per-building height today, use `source='globfp3d'`.	`'osm'`
`min_area`	`float or int`	The minimum area in m².	`0`
`max_area`	`float or int`	The maximum area in m².	`None`
`random_sample`	`int`	If set, randomly subsample to this many.	`None`
`gba_path`	`str`	Optional. Path to a pre-downloaded file (skips network calls). Used by both `'globfp3d'` and `'gba'` sources.	`None`
`gba_cache_dir`	`str`	Optional. Where to cache auto-fetched files. Defaults differ per source (`~/.cache/urbanworm/globfp3d` or `.../gba`).	`None`

Source code in urbanworm/dataset.py

def getBuildings(self,
                 bbox: list | tuple = None,
                 source: str = 'osm',
                 min_area: float | int = 0,
                 max_area: float | int = None,
                 random_sample: int = None,
                 gba_path: str = None,
                 gba_cache_dir: str = None) -> None:
    '''
        Extract buildings from a public source using the bbox.

        Args:
            bbox (list or tuple): The bounding box (min_lon, min_lat, max_lon, max_lat).
            source (str): One of:

              * ``'osm'`` (default) — OpenStreetMap via the Overpass API.
                Footprints only; no height.
              * ``'microsoft'`` — Microsoft GlobalMLBuildingFootprints
                (Bing). Footprints only; no height.
              * ``'globfp3d'`` — `3D-GloBFP <https://zenodo.org/records/15487037>`_
                (Che et al., ESSD 2024). **Footprints + per-building
                height.** Auto-fetches from Zenodo + Figshare and
                caches under ``gba_cache_dir`` (default
                ``~/.cache/urbanworm/globfp3d``). Pass ``gba_path``
                to load from a pre-downloaded local file. The
                resulting ``self.units`` keeps a ``height_m`` column.
              * ``'gba'`` — `GlobalBuildingAtlas <https://github.com/zhu-xlab/GlobalBuildingAtlas>`_
                (Zhu et al., ESSD 2025). **A different dataset from
                3D-GloBFP** — hosted on HuggingFace + mediaTUM. Auto-
                fetches polygon tiles from
                ``zhu-xlab/GBA.LoD1`` using ``representative/lod1.geojson``
                as the manifest. Caches under ``~/.cache/urbanworm/gba``.
                NOTE: GBA polygons ship without per-row height — heights
                live in a separate mediaTUM dataset (`m1837832`) which
                isn't joined yet (tracking issue in CHANGELOG). For
                per-building height today, use ``source='globfp3d'``.

            min_area (float or int): The minimum area in m².
            max_area (float or int): The maximum area in m².
            random_sample (int): If set, randomly subsample to this many.
            gba_path (str): Optional. Path to a pre-downloaded file
                (skips network calls). Used by both ``'globfp3d'`` and
                ``'gba'`` sources.
            gba_cache_dir (str): Optional. Where to cache auto-fetched
                files. Defaults differ per source
                (``~/.cache/urbanworm/globfp3d`` or ``.../gba``).
    '''

    if source not in ('osm', 'microsoft', 'gba', 'globfp3d'):
        raise ValueError(
            f'Unsupported building source {source!r}; '
            f'choose from "osm", "microsoft", "globfp3d", or "gba".'
        )

    if source == 'osm':
        buildings = getOSMbuildings(bbox, min_area, max_area)
    elif source == 'microsoft':
        buildings = getGlobalMLBuilding(bbox, min_area, max_area)
    elif source == 'globfp3d':
        buildings = getGloBFP3DBuildings(
            bbox, gba_path=gba_path, min_area=min_area,
            max_area=max_area, cache_dir=gba_cache_dir,
        )
    else:  # 'gba'
        buildings = getGBABuildings(
            bbox, gba_path=gba_path, min_area=min_area,
            max_area=max_area, cache_dir=gba_cache_dir,
        )

    if buildings is None or buildings.empty:
        if source == 'osm':
            logger.warning(
                "No buildings found in the bounding box. "
                "Check https://overpass-turbo.eu/ for areas with buildings."
            )
        elif source == 'microsoft':
            logger.warning(
                "No buildings found in the bounding box. "
                "Check https://github.com/microsoft/GlobalMLBuildingFootprints "
                "for areas with buildings."
            )
        elif source == 'globfp3d':
            logger.warning(
                "No 3D-GloBFP buildings found in the bounding box %s "
                "(local: %s).", bbox, gba_path,
            )
        else:
            logger.warning(
                "No GBA buildings found in the bounding box %s "
                "(local: %s).", bbox, gba_path,
            )
        return None
    if random_sample is not None:
        buildings = buildings.sample(random_sample)
    self.units = buildings.to_crs(4326)
    with_height = (
        int(buildings["height_m"].notna().sum())
        if "height_m" in buildings.columns else 0
    )
    if with_height:
        logger.info(
            "%d buildings found in the bounding box (%d with height_m).",
            len(buildings), with_height,
        )
    else:
        logger.info("%d buildings found in the bounding box.", len(buildings))
    return None

`get_svi_from_locations(id_column=None, distance=50, key=None, source='mapillary', pano=True, reoriented=True, multi_num=1, interval=1, fov=80, heading=None, pitch=5, height=500, width=700, year=None, season=None, time_of_day='day', fov_margin=0.1, fov_min=30.0, fov_max=120.0, building_height=9.0, silent=True, checkpoint_path=None)` ¶

get_svi_from_locations

Retrieve the closest street view image(s) near each coordinate. The street view image will be reoriented to look at the coordinate when reoriented=True (Mapillary) or always (Google).

Parameters:

Name	Type	Description	Default
`id_column`	`str`	The name of column that has unique identifier (or something similar) for each location.	`None`
`distance`	`int`	The max distance in meters between the centroid and the street view.	`50`
`key`	`str`	API access token for the chosen source. Mapillary — pass token or set env var `MAPILLARY_API_KEY`. Google — pass token or set env var `GOOGLE_STREETVIEW_API_KEY`.	`None`
`source`	`str`	Street view data source. One of `"mapillary"` (default) or `"google"`.	`'mapillary'`
`pano`	`bool`	Whether to search for pano street view images only. Mapillary only — ignored for Google. (Default is True)	`True`
`reoriented`	`bool`	Whether to reorient and crop street view images. Mapillary only — Google always faces the target. (Default is True)	`True`
`multi_num`	`int`	The number of multiple SVIs. Mapillary only — Google always returns 1. (Default is 1)	`1`
`interval`	`int`	The interval in meters between each SVI. Mapillary only. (Default is 1)	`1`
`fov`	`int \| float \| str`	Field of view in degrees (default 80). Pass `'auto'` (with `reoriented=True`) to size the FOV per image so the building footprint at each location is just framed. The polygon used is each unit's `row.geometry` from `self.units` — i.e. the building footprint loaded by `getBuildings()`. Falls back to a distance-based heuristic if a unit's geometry is a point. Mapillary only for `'auto'`.	`80`
`heading`	`int`	Camera heading in degrees. If None, it will be computed based on the house orientation.	`None`
`pitch`	`int`	Camera pitch angle. (Default is 5).	`5`
`height`	`int`	Height in pixels of the returned image. (Default is 500).	`500`
`width`	`int`	Width in pixels of the returned image. (Default is 700).	`700`
`year`	`list[str]`	Year of data (start year, end year). Mapillary only — ignored for Google with a warning.	`None`
`season`	`str`	Season of data. One of ["spring","summer","fall","autumn","winter"]. Mapillary only — ignored for Google with a warning.	`None`
`time_of_day`	`str`	Time of data. One of ["day","night"] (Default is 'day'). Mapillary only — ignored for Google with a warning.	`'day'`
`fov_margin`	`float`	When `fov='auto'`, fractional padding added to the auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.	`0.1`
`fov_min`	`float`	Lower clamp for `fov='auto'` (degrees). Default 30°. Mapillary only.	`30.0`
`fov_max`	`float`	Upper clamp for `fov='auto'` (degrees). Default 120°. Mapillary only.	`120.0`
`building_height`	`float`	Assumed building height in meters used by `fov='auto'` (default 9 m, ~3 stories). Mapillary only.	`9.0`
`silent`	`bool`	If True, do not show error traceback (Default is True).	`True`
`checkpoint_path`	`str`	Path to a JSONL file for resume-safe checkpointing. When provided, each successfully fetched location is written to the file immediately, and base64 images are saved to a companion directory (`<checkpoint_stem>_files/` next to the JSONL) so the session can be resumed after a crash.	`None`

Source code in urbanworm/dataset.py

def get_svi_from_locations(self,
                           id_column:str=None,
                           distance:int = 50,
                           key: str = None,
                           source: str = "mapillary",
                           pano: bool = True, reoriented: bool = True,
                           multi_num: int = 1, interval: int = 1,
                           fov: int | float | str = 80, heading: int = None, pitch: int = 5,
                           height: int = 500, width: int = 700,
                           year: list | tuple = None, season: str = None, time_of_day: str = 'day',
                           fov_margin: float = 0.10,
                           fov_min: float = 30.0,
                           fov_max: float = 120.0,
                           building_height: float = 9.0,
                           silent: bool = True,
                           checkpoint_path: str | None = None):
    """
        get_svi_from_locations

        Retrieve the closest street view image(s) near each coordinate.
        The street view image will be reoriented to look at the coordinate when
        ``reoriented=True`` (Mapillary) or always (Google).

        Args:
            id_column (str, optional): The name of column that has unique identifier (or something similar) for each location.
            distance (int): The max distance in meters between the centroid and the street view.
            key (str): API access token for the chosen source.
                Mapillary — pass token or set env var ``MAPILLARY_API_KEY``.
                Google    — pass token or set env var ``GOOGLE_STREETVIEW_API_KEY``.
            source (str): Street view data source. One of ``"mapillary"`` (default)
                or ``"google"``.
            pano (bool): Whether to search for pano street view images only.
                Mapillary only — ignored for Google. (Default is True)
            reoriented (bool): Whether to reorient and crop street view images.
                Mapillary only — Google always faces the target. (Default is True)
            multi_num (int): The number of multiple SVIs.
                Mapillary only — Google always returns 1. (Default is 1)
            interval (int): The interval in meters between each SVI.
                Mapillary only. (Default is 1)
            fov (int | float | str): Field of view in degrees (default 80). Pass
                ``'auto'`` (with ``reoriented=True``) to size the FOV per image
                so the building footprint at each location is just framed.
                The polygon used is each unit's ``row.geometry`` from
                ``self.units`` — i.e. the building footprint loaded by
                ``getBuildings()``. Falls back to a distance-based heuristic
                if a unit's geometry is a point. Mapillary only for ``'auto'``.
            heading (int): Camera heading in degrees. If None, it will be computed based on the house orientation.
            pitch (int): Camera pitch angle. (Default is 5).
            height (int): Height in pixels of the returned image. (Default is 500).
            width (int): Width in pixels of the returned image. (Default is 700).
            year (list[str], optional): Year of data (start year, end year).
                Mapillary only — ignored for Google with a warning.
            season (str, optional): Season of data. One of ["spring","summer","fall","autumn","winter"].
                Mapillary only — ignored for Google with a warning.
            time_of_day (str, optional): Time of data. One of ["day","night"] (Default is 'day').
                Mapillary only — ignored for Google with a warning.
            fov_margin (float): When ``fov='auto'``, fractional padding added to the
                auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.
            fov_min (float): Lower clamp for ``fov='auto'`` (degrees). Default 30°.
                Mapillary only.
            fov_max (float): Upper clamp for ``fov='auto'`` (degrees). Default 120°.
                Mapillary only.
            building_height (float): Assumed building height in meters used by
                ``fov='auto'`` (default 9 m, ~3 stories). Mapillary only.
            silent (bool): If True, do not show error traceback (Default is True).
            checkpoint_path (str, optional): Path to a JSONL file for
                resume-safe checkpointing. When provided, each successfully
                fetched location is written to the file immediately, and
                base64 images are saved to a companion directory
                (``<checkpoint_stem>_files/`` next to the JSONL) so the
                session can be resumed after a crash.
        """

    self.svis = {
        'loc_id': [],
        'id': [],
        'data': [],
        'path': [],
    }
    self.svi_metadata = None

    if id_column is None:
        id_column = 'loc_id'
        if id_column not in self.units.columns:
            self.units[id_column] = [i for i in range(len(self.units))]
    # Resolve API key once with env var fallback.
    # Use the appropriate env var depending on the source.
    _env_var = "GOOGLE_STREETVIEW_API_KEY" if source.lower() == "google" else "MAPILLARY_API_KEY"
    resolved_key = key or os.getenv(_env_var)
    if not resolved_key:
        raise ValueError(
            "Missing Mapillary access token. Pass key=... or set env var MAPILLARY_API_KEY."
        )

    # ── resume from checkpoint ────────────────────────────────────────
    # The checkpoint records which loc_ids have been fetched and stores
    # the raw fetched data (base64 strings or URLs) so the full
    # self.svis payload can be restored without re-hitting the API.
    # File downloading is handled separately by download_to_dir().
    done_ids: set = set()
    if checkpoint_path is not None:
        done_ids, ckpt_records = load_collection_checkpoint(checkpoint_path)
        restored_svis, restored_frames = restore_svis_from_checkpoint(ckpt_records)
        self.svis = restored_svis
    else:
        restored_frames = []

    # Accumulate per-location frames and concat once for O(n) instead of O(n^2)
    frames: list[pd.DataFrame] = list(restored_frames)
    skip_count = 0
    for _index, row in tqdm(self.units.iterrows(), total=len(self.units)):
        loc_id = row[id_column]

        # Skip already-checkpointed locations
        if loc_id in done_ids:
            continue

        try:
            # Pass the unit's polygon to enable fov='auto' framing.
            # Points (no `.exterior`) become None, so getSV will fall
            # back to its distance-based heuristic.
            target_poly = getattr(row.geometry, "exterior", None)
            target_poly = row.geometry if target_poly is not None else None

            # Per-building height: if the units GeoDataFrame has a
            # height_m column (e.g. from source='gba'), use that row's
            # value. Fall back to the global ``building_height``
            # parameter when the row is missing/NaN.
            row_height = building_height
            if "height_m" in self.units.columns:
                rh = row.get("height_m")
                try:
                    if rh is not None and not pd.isna(rh) and float(rh) > 0:
                        row_height = float(rh)
                except (TypeError, ValueError):
                    pass

            svis, output_df = getSV(
                [row.geometry.centroid.x, row.geometry.centroid.y],
                loc_id=loc_id,
                distance=distance,
                key=resolved_key,
                source=source,
                pano=pano,
                reoriented=reoriented,
                multi_num=multi_num,
                interval=interval,
                fov=fov, heading=heading, pitch=pitch,
                height=height, width=width,
                year=year, season=season, time_of_day=time_of_day,
                target_polygon=target_poly,
                fov_margin=fov_margin, fov_min=fov_min, fov_max=fov_max,
                building_height=row_height,
                silent=silent,
            )
            if svis is None:
                skip_count += 1
                continue

            self.svis['data'] += svis
            self.svis['loc_id'] += output_df['loc_id'].tolist()
            self.svis['id'] += output_df['id'].tolist()

            # ── checkpoint: record the fetched data ──────────────────
            # Stores raw fetched data (base64 or URLs) so the session
            # can be fully restored without re-hitting the Mapillary API.
            # File downloading remains the responsibility of download_to_dir().
            if checkpoint_path is not None:
                append_collection_checkpoint(checkpoint_path, {
                    'loc_id': loc_id,
                    'ids': output_df['id'].tolist(),
                    'paths': [],           # populated later by download_to_dir()
                    'data': svis,          # base64 strings or URLs as returned by getSV
                    'metadata': output_df.to_dict(orient='records'),
                })

            frames.append(output_df)
        except Exception as e:
            if not silent:
                logger.warning(
                    'skipping %s: %s',
                    [row.geometry.centroid.x, row.geometry.centroid.y], e,
                )
            skip_count += 1
            continue
    self.svi_metadata = pd.concat(frames, ignore_index=True) if frames else None
    if skip_count > 0:
        logger.info(
            'Collected data for %d locations; skipped %d (no data found).',
            len(self.units) - skip_count, skip_count,
        )
    return None

`get_photo_from_location(id_column=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_personal_photo=True, exclude_from_location=None, silent=True, checkpoint_path=None)` ¶

get_photo_from_location

Retrieve geotagged photos from Flickr

Parameters:

Name	Type	Description	Default
`id_column`	`str`	(str, optional): The name of column that has unique identifier (or something similar) for each location.	`None`
`distance`	`int`	Search radius in meters (converted to km; Flickr radius max is 32 km).	`50`
`key`	`str`	Flickr API key. If None, reads env var FLICKR_API_KEY.	`None`
`query`	`str`	Query string to search for.	`None`
`geo_context`	`int`	Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)	`None`
`tag`	`str \| list[str]`	Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.	`None`
`max_return`	`int`	Number of photos to return (after filters).	`1`
`year`	`list \| tuple`	[Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.	`None`
`season`	`str`	One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).	`None`
`time_of_day`	`str`	One of {"morning","afternoon","evening","night"} (post-filter by taken hour).	`None`
`exclude_personal_photo`	`bool`	If True, exclude personal photo from locations. (Default is True)	`True`
`exclude_from_location`	`int`	Drop retrieved data with a distance from the given location.	`None`
`silent`	`bool`	If True, do not show error traceback (Default is True).	`True`

Source code in urbanworm/dataset.py

def get_photo_from_location(self,
                            id_column:str=None,
                            distance: int = 50,
                            key: str = None,
                            query: str | list[str] = None,
                            geo_context: int = None,
                            tag: str | list[str] = None,
                            max_return: int = 1,
                            year: list | tuple = None,
                            season: str = None,
                            time_of_day: str = None,
                            exclude_personal_photo: bool = True,
                            exclude_from_location:int = None,
                            silent = True,
                            checkpoint_path: str | None = None,
                            ):
    '''
        get_photo_from_location

        Retrieve geotagged photos from Flickr

        Args:
            id_column: (str, optional): The name of column that has unique identifier (or something similar) for each location.
            distance (int): Search radius in meters (converted to km; Flickr radius max is 32 km).
            key (str): Flickr API key. If None, reads env var FLICKR_API_KEY.
            query (str, optional): Query string to search for.
            geo_context (int, optional): Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)
            tag (str | list[str]): Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.
            max_return (int): Number of photos to return (after filters).
            year: [Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.
            season (str): One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).
            time_of_day (str): One of {"morning","afternoon","evening","night"} (post-filter by taken hour).
            exclude_personal_photo (bool): If True, exclude personal photo from locations. (Default is True)
            exclude_from_location (int, optional): Drop retrieved data with a distance from the given location.
            silent (bool): If True, do not show error traceback (Default is True).
    '''

    from importlib.resources import as_file, files

    self.photos = {
        'loc_id': [],
        'id': [],
        'data': [],
        'path': [],
    }
    self.photo_metadata = None

    if id_column is None:
        id_column = 'loc_id'
        if id_column not in self.units.columns:
            self.units[id_column] = list(range(len(self.units)))

    # ── resume from checkpoint ────────────────────────────────────────
    done_ids_ph: set = set()
    if checkpoint_path is not None:
        done_ids_ph, ckpt_records_ph = load_collection_checkpoint(checkpoint_path)
        restored_photos, restored_frames_ph = restore_photos_from_checkpoint(ckpt_records_ph)
        self.photos = restored_photos
    else:
        restored_frames_ph = []

    frames: list[pd.DataFrame] = list(restored_frames_ph)
    skip_count = 0
    for _index, row in tqdm(self.units.iterrows(), total=len(self.units)):
        loc_id = row[id_column]
        if loc_id in done_ids_ph:
            continue
        try:
            output_df = getPhoto([row.geometry.centroid.x, row.geometry.centroid.y],
                                 loc_id,
                                 distance,
                                 key,
                                 query,
                                 geo_context,
                                 tag,
                                 max_return,
                                 year,
                                 season,
                                 time_of_day,
                                 exclude_from_location,
                                 output_df=True)
            if exclude_personal_photo:
                model_res = files("urbanworm.models") / "face_detection_yunet_2023mar.onnx"
                drop_list = []
                for ind, r in output_df.iterrows():
                    with as_file(model_res) as model_path:
                        is_selfie = is_selfie_photo(model_path, r['url'])
                        if is_selfie:
                            drop_list += [ind]
                if len(drop_list) > 0:
                    output_df.drop(drop_list, axis=0, inplace=True)
                    if len(output_df) == 0:
                        continue

            self.photos['loc_id'] += output_df['loc_id'].tolist()
            self.photos['data'] += output_df['url'].tolist()
            self.photos['id'] += output_df['id'].tolist()

            if checkpoint_path is not None:
                append_collection_checkpoint(checkpoint_path, {
                    'loc_id': loc_id,
                    'ids': output_df['id'].tolist(),
                    'paths': [],
                    'data': output_df['url'].tolist(),
                    'metadata': output_df.to_dict(orient='records'),
                })

            frames.append(output_df)
        except Exception as e:
            if not silent:
                logger.warning("photo fetch error: %s", e)
            skip_count += 1
            continue
    self.photo_metadata = pd.concat(frames, ignore_index=True) if frames else None
    if skip_count > 0:
        logger.info(
            'Collected data for %d locations; skipped %d (no data found).',
            len(self.units) - skip_count, skip_count,
        )
    return None

`get_sound_from_location(id_column=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=None, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, silent=True, checkpoint_path=None)` ¶

get_sound_from_location

Retrieve geotagged sound recordings from Freesound (default) or from a Radio Aporee catalog you provide as a CSV / DataFrame.

Parameters:

Name	Type	Description	Default
`id_column`	`str`	The name of column that has unique identifier (or something similar) for each location.	`None`
`distance`	`int`	radius in meters (converted to km for Freesound geofilt).	`50`
`source`	`str`	one of {"freesound", "aporee"} (Default is "freesound").	`'freesound'`
`key`	`str`	Freesound API key. Required only when source="freesound". If None, reads env var FREESOUND_API_KEY.	`None`
`catalog`	`str \| DataFrame`	Required only when source="aporee". Path to a CSV or an in-memory DataFrame containing at minimum the columns `url`, `latitude`, `longitude`. Optional columns recognised: `id`/`identifier`, `name`/`title`, `description`, `tags`, `created` (ISO timestamp), `duration_s`.	`None`
`query`	`str`	Query string to search for.	`None`
`tag`	`str \| list[str]`	tag string or list of tags (used as filters).	`None`
`max_return`	`int`	number of sounds to return (after post-filters).	`1`
`year`	`int \| list`	[Y] or (Y,) or (Y1, Y2) inclusive (filters by upload date "created").	`None`
`season`	`str`	one of {"spring","summer","fall","autumn","winter"} (post-filter by created month).	`None`
`time_of_day`	`str`	one of {"morning","afternoon","evening","night"} (post-filter by created hour).	`None`
`duration`	`int \| list[int] \| tuple[int]`	maximum duration in seconds (<= duration). If you want a range, pass a tuple/list (min,max).	`None`
`exclude_from_location`	`int`	Drop retrieved data with a distance from the given location.	`None`
`slice_duration`	`int`	Split the original sound signal into clips with the given duration.	`None`
`slice_max_num`	`int`	Maximum number of clips sliced from the original sound signal.	`None`
`probe_durations`	`bool`	Aporee-only. When `slice_duration` is set but the catalog has no `duration_s` column, probe each selected URL once to learn its length. Set False to skip slicing instead. Default True.	`True`
`silent`	`bool`	If True, do not show error traceback (Default is True).	`True`

Source code in urbanworm/dataset.py

def get_sound_from_location(self,
                            id_column: str = None,
                            distance: int = 50,
                            source: str = 'freesound',
                            key: str = None,
                            catalog: str | pd.DataFrame = None,
                            query: str | list[str] = None,
                            tag: str | list[str] = None,
                            max_return: int = 1,
                            year: list | tuple = None,
                            season: str = None,
                            time_of_day: str = None,
                            duration: int = None,
                            exclude_from_location: int = None,
                            slice_duration: int = None,
                            slice_max_num: int = None,
                            probe_durations: bool = True,
                            silent: bool = True,
                            checkpoint_path: str | None = None,
                            ):

    '''
        get_sound_from_location

        Retrieve geotagged sound recordings from Freesound (default) or
        from a Radio Aporee catalog you provide as a CSV / DataFrame.

        Args:
            id_column (str, optional): The name of column that has unique identifier (or something similar) for each location.
            distance (int): radius in meters (converted to km for Freesound geofilt).
            source (str): one of {"freesound", "aporee"} (Default is "freesound").
            key (str): Freesound API key. Required only when source="freesound".
                If None, reads env var FREESOUND_API_KEY.
            catalog (str | pandas.DataFrame): Required only when source="aporee".
                Path to a CSV or an in-memory DataFrame containing at minimum the columns
                ``url``, ``latitude``, ``longitude``. Optional columns recognised:
                ``id``/``identifier``, ``name``/``title``, ``description``, ``tags``,
                ``created`` (ISO timestamp), ``duration_s``.
            query (str, optional): Query string to search for.
            tag (str | list[str]): tag string or list of tags (used as filters).
            max_return (int): number of sounds to return (after post-filters).
            year (int | list): [Y] or (Y,) or (Y1, Y2) inclusive (filters by upload date "created").
            season (str): one of {"spring","summer","fall","autumn","winter"} (post-filter by created month).
            time_of_day (str): one of {"morning","afternoon","evening","night"} (post-filter by created hour).
            duration (int | list[int] | tuple[int]): maximum duration in seconds (<= duration). If you want a range, pass a tuple/list (min,max).
            exclude_from_location (int, optional): Drop retrieved data with a distance from the given location.
            slice_duration (int, optional): Split the original sound signal into clips with the given duration.
            slice_max_num (int, optional): Maximum number of clips sliced from the original sound signal.
            probe_durations (bool): Aporee-only. When ``slice_duration`` is set
                but the catalog has no ``duration_s`` column, probe each
                selected URL once to learn its length. Set False to skip
                slicing instead. Default True.
            silent (bool): If True, do not show error traceback (Default is True).
    '''

    self.audios = {
        'loc_id': [],
        'id': [],
        'data': [],
        'path': [],
    }
    self.audio_metadata = None

    if slice_duration is not None:
        self.audios['slice'] = []

    if id_column is None:
        id_column = 'loc_id'
        if id_column not in self.units.columns:
            self.units[id_column] = list(range(len(self.units)))

    # ── resume from checkpoint ────────────────────────────────────────
    done_ids_au: set = set()
    if checkpoint_path is not None:
        done_ids_au, ckpt_records_au = load_collection_checkpoint(checkpoint_path)
        restored_audios, restored_frames_au = restore_audios_from_checkpoint(ckpt_records_au)
        self.audios = restored_audios
    else:
        restored_frames_au = []

    frames: list[pd.DataFrame] = list(restored_frames_au)
    skip_count = 0
    for _index, row in tqdm(self.units.iterrows(), total=len(self.units)):
        loc_id = row[id_column]
        if loc_id in done_ids_au:
            continue
        try:
            output_df = getSound([row.geometry.centroid.x, row.geometry.centroid.y],
                                 loc_id=loc_id,
                                 distance=distance,
                                 source=source,
                                 key=key,
                                 catalog=catalog,
                                 query=query,
                                 tag=tag,
                                 max_return=max_return,
                                 year=year,
                                 season=season,
                                 time_of_day=time_of_day,
                                 duration=duration,
                                 exclude_from_location=exclude_from_location,
                                 slice_duration=slice_duration,
                                 slice_max_num=slice_max_num,
                                 probe_durations=probe_durations,
                                 output_df=True)

            # `slice` may be missing if the source couldn't compute it
            # (e.g. Aporee catalog with no duration_s and probe_durations
            # disabled). Fall back to the un-sliced path in that case.
            if slice_duration is not None and 'slice' in output_df.columns:
                slice_list = output_df['slice'].tolist()
                loc_id_list = output_df['loc_id'].tolist()
                data_list = output_df['preview-hq-mp3'].tolist()
                id_list = output_df['id'].tolist()

                # `slice_list[i]` is always a list of [start_ms, end_ms] pairs
                # (one per generated clip). Flatten and replicate metadata
                # to match the per-clip cardinality.
                flattened_slice_list = [
                    item for sublist in slice_list for item in sublist
                ]
                repeated_loc, repeated_data, repeated_id = [], [], []
                for sublist, lid, d, sid in zip(
                        slice_list, loc_id_list, data_list, id_list, strict=False):
                    n = len(sublist)
                    repeated_loc.extend([lid] * n)
                    repeated_data.extend([d] * n)
                    repeated_id.extend([sid] * n)
                self.audios['loc_id'] += repeated_loc
                self.audios['data'] += repeated_data
                self.audios['id'] += repeated_id
                self.audios['slice'] += flattened_slice_list

                if checkpoint_path is not None:
                    append_collection_checkpoint(checkpoint_path, {
                        'loc_id': loc_id,
                        'ids': repeated_id,
                        'paths': [],
                        'data': repeated_data,
                        'slices': flattened_slice_list,
                        'metadata': output_df.to_dict(orient='records'),
                    })
            else:
                self.audios['loc_id'] += output_df['loc_id'].tolist()
                self.audios['data'] += output_df['preview-hq-mp3'].tolist()
                self.audios['id'] += output_df['id'].tolist()

                if checkpoint_path is not None:
                    append_collection_checkpoint(checkpoint_path, {
                        'loc_id': loc_id,
                        'ids': output_df['id'].tolist(),
                        'paths': [],
                        'data': output_df['preview-hq-mp3'].tolist(),
                        'slices': None,
                        'metadata': output_df.to_dict(orient='records'),
                    })

            frames.append(output_df)
        except Exception as e:
            if not silent:
                logger.warning("sound fetch error: %s", e)
            skip_count += 1
            continue
    self.audio_metadata = pd.concat(frames, ignore_index=True) if frames else None
    if skip_count > 0:
        logger.info(
            'Collected data for %d locations; skipped %d (no data found).',
            len(self.units) - skip_count, skip_count,
        )
    return None

`download_to_dir(data=None, to_dir=None, prefix=None)` ¶

download_to_dir

Download retrieved data (fetched by get_svi_from_locations, get_photo_from_location, or get_sound_from_location) to a local directory and populate the corresponding path list on the dataset object.

This method is resume-safe by default: if a file already exists at its target path it is never re-downloaded. You can safely re-run this call after a crash and it will only fetch the files that are still missing, then rebuild the complete path list from what is on disk.

Parameters:

Name	Type	Description	Default
`data`	`str`	Type of data to download: ['svi', 'audio', 'photo'].	`None`
`to_dir`	`str`	the directory to save the downloaded data.	`None`
`prefix`	`str`	The prefix to add to the output filename.	`None`

Source code in urbanworm/dataset.py

def download_to_dir(self, data:str = None, to_dir:str = None, prefix: str = None)-> None:
    '''
        download_to_dir

        Download retrieved data (fetched by get_svi_from_locations,
        get_photo_from_location, or get_sound_from_location) to a local
        directory and populate the corresponding ``path`` list on the
        dataset object.

        This method is **resume-safe by default**: if a file already
        exists at its target path it is never re-downloaded.  You can
        safely re-run this call after a crash and it will only fetch
        the files that are still missing, then rebuild the complete
        path list from what is on disk.

        Args:
            data (str): Type of data to download: ['svi', 'audio', 'photo'].
            to_dir (str): the directory to save the downloaded data.
            prefix (str, optional):  The prefix to add to the output filename.
    '''
    if data not in ['svi', 'audio', 'photo']:
        raise ValueError('Invalid data type provided. It has to be one of ["svi", "audio", "photo"].')
    if to_dir is None:
        raise ValueError("to_dir must be provided.")
    if not os.path.exists(to_dir):
        logger.info("Directory %s does not exist; creating.", to_dir)
        Path(to_dir).mkdir(parents=True, exist_ok=True)
    if data == 'svi':
        if len(self.svis['id']) == 0:
            return None
        self.svis['path'] = []
        for i in tqdm(range(len(self.svis['data'])), total=len(self.svis['data'])):
            loc_id = self.svis['loc_id'][i]
            img_id = self.svis['id'][i]
            path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
            p = path + f'_{img_id}.png'
            if not os.path.exists(p):
                try:
                    if is_base64(self.svis['data'][i]):
                        save_base64(self.svis['data'][i], p)
                    else:
                        download_image_requests(self.svis['data'][i], p)
                except Exception:
                    self.svis['path'] += [" "]
                    continue
            self.svis['path'] += [p]
    elif data == 'audio':
        if len(self.audios['id']) == 0:
            return None
        self.audios['path'] = []
        if 'slice' in self.audios:
            for i in tqdm(range(len(self.audios['data'])), total=len(self.audios['data'])):
                loc_id = self.audios['loc_id'][i]
                audio_id = self.audios['id'][i]
                slices = self.audios['slice'][i]
                path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
                start = slices[0]
                end = slices[1]
                p = path + f'_{audio_id}_clip_{start}_{end}.mp3'
                if not os.path.exists(p):
                    try:
                        clip(self.audios['data'][i], start, end, p)
                    except Exception:
                        self.audios['path'] += [" "]
                        continue
                self.audios['path'] += [p]
        else:
            for i in tqdm(range(len(self.audios['data'])), total=len(self.audios['data'])):
                loc_id = self.audios['loc_id'][i]
                audio_id = self.audios['id'][i]
                path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
                p = path + f'_{audio_id}.mp3'
                if not os.path.exists(p):
                    try:
                        download_freesound_preview(self.audios['data'][i], p)
                    except Exception:
                        self.audios['path'] += [" "]
                        continue
                self.audios['path'] += [p]
    elif data == 'photo':
        if len(self.photos['id']) == 0:
            return None
        self.photos['path'] = []
        for i in tqdm(range(len(self.photos['data'])), total=len(self.photos['data'])):
            loc_id = self.photos['loc_id'][i]
            photo_id = self.photos['id'][i]
            path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
            p = path + f'_{photo_id}.png'
            if not os.path.exists(p):
                try:
                    download_image_requests(self.photos['data'][i], p)
                except Exception:
                    # download failed: align list lengths with sentinel
                    self.photos['path'] += [" "]
                    continue
            self.photos['path'] += [p]
    return None

`export(output_dir, data='svi', labels=None)` ¶

Export collected data as an organized dataset folder.

Creates::

output_dir/
    metadata.csv          # loc_id, file_id, file_type, file_path
                          # + optional label columns from `labels`
    images/               # when data in {'svi', 'photo'}
        {loc_id}_{file_id}.png
    audio/                # when data == 'audio'
        {loc_id}_{file_id}.mp3

If a file already exists on disk at the target path it is not downloaded again, so the method is safe to call repeatedly.

Parameters:

Name	Type	Description	Default
`output_dir`	`str`	Root directory for the exported dataset.	required
`data`	`str`	Which modality to export. One of `'svi'`, `'photo'`, or `'audio'`.	`'svi'`
`labels`	`DataFrame`	Optional DataFrame produced by `batch_inference()`. Must contain a `loc_id` column; it is left-joined onto the metadata table so each file row gets the matching label columns.	`None`

Returns:

Type	Description
`str`	Absolute path to the created `metadata.csv`.

Source code in urbanworm/dataset.py

def export(
    self,
    output_dir: str,
    data: str = 'svi',
    labels: pd.DataFrame = None,
) -> str:
    """Export collected data as an organized dataset folder.

    Creates::

        output_dir/
            metadata.csv          # loc_id, file_id, file_type, file_path
                                  # + optional label columns from `labels`
            images/               # when data in {'svi', 'photo'}
                {loc_id}_{file_id}.png
            audio/                # when data == 'audio'
                {loc_id}_{file_id}.mp3

    If a file already exists on disk at the target path it is not
    downloaded again, so the method is safe to call repeatedly.

    Args:
        output_dir: Root directory for the exported dataset.
        data: Which modality to export. One of ``'svi'``, ``'photo'``,
            or ``'audio'``.
        labels: Optional DataFrame produced by ``batch_inference()``.
            Must contain a ``loc_id`` column; it is left-joined onto the
            metadata table so each file row gets the matching label
            columns.

    Returns:
        Absolute path to the created ``metadata.csv``.
    """
    if data not in ('svi', 'photo', 'audio'):
        raise ValueError(
            "data must be one of 'svi', 'photo', 'audio'; "
            f"got {data!r}"
        )

    out_root = Path(output_dir)

    if data in ('svi', 'photo'):
        files_dir = out_root / 'images'
    else:
        files_dir = out_root / 'audio'
    files_dir.mkdir(parents=True, exist_ok=True)

    payload = (
        self.svis if data == 'svi'
        else self.photos if data == 'photo'
        else self.audios
    )

    if not payload['id']:
        logger.warning("export: no %s data to export.", data)
        return str(out_root / 'metadata.csv')

    ext = '.png' if data != 'audio' else '.mp3'
    rows: list[dict] = []

    for i in range(len(payload['id'])):
        loc_id = payload['loc_id'][i]
        file_id = payload['id'][i]
        source = payload['data'][i]
        existing_path = payload['path'][i] if i < len(payload['path']) else ''

        fname = f"{loc_id}_{file_id}{ext}"
        local_path = str(files_dir / fname)

        # Download / copy only if the file is not already in place
        if not Path(local_path).exists():
            try:
                if existing_path and Path(existing_path).exists():
                    import shutil as _shutil
                    _shutil.copy2(existing_path, local_path)
                elif data != 'audio':
                    if is_url(source):
                        download_image_requests(source, local_path)
                    elif is_image_path(source):
                        import shutil as _shutil
                        _shutil.copy2(source, local_path)
                    else:
                        # assume base64
                        save_base64(source, local_path)
                else:
                    # audio
                    slices = (
                        payload['slice'][i]
                        if 'slice' in payload and i < len(payload['slice'])
                        else None
                    )
                    if slices is not None:
                        clip(source, slices[0], slices[1], local_path)
                    else:
                        download_freesound_preview(source, local_path)
            except Exception as _dl_err:
                logger.warning(
                    "export: could not save %s (loc_id=%s, file_id=%s): %s",
                    local_path, loc_id, file_id, _dl_err,
                )

        rows.append({
            'loc_id': loc_id,
            'file_id': file_id,
            'file_type': data,
            'file_path': local_path,
            'source_data': source if is_url(source) else '<local>',
        })

    meta_df = pd.DataFrame(rows)

    if labels is not None:
        if 'loc_id' in labels.columns:
            meta_df = meta_df.merge(labels, on='loc_id', how='left')
        else:
            logger.warning(
                "export: labels DataFrame has no 'loc_id' column; skipping merge."
            )

    out_csv = out_root / 'metadata.csv'
    meta_df.to_csv(out_csv, index=False)
    logger.info("export: wrote %d rows to %s", len(meta_df), out_csv)
    return str(out_csv)

`set_images(img_type)` ¶

set_images

Set retrieved street view images or Flickr photos as images dataset

Parameters:

Name	Type	Description	Default
`img_type`	`str`	'photo' or 'svi'	required

Source code in urbanworm/dataset.py

def set_images(self, img_type: str):
    '''
        set_images

        Set retrieved street view images or Flickr photos as images dataset

        Args:
            img_type (str): 'photo' or 'svi'
    '''
    if img_type == 'svi':
        self.images = self.svis
    elif img_type == 'photo':
        self.images = self.photos
    return None

`plot_data(data=None, export_gdf=False)` ¶

Parameters:

Name	Type	Description	Default
`data`	`str`	Type of data to download: ['svi', 'audio', 'photo'].	`None`
`export_gdf`	`bool`	Export gpd.GeoDataFrame.	`False`

Source code in urbanworm/dataset.py

def plot_data(self, data:str = None, export_gdf: bool = False) -> None:
    '''

    Args:
        data (str): Type of data to download: ['svi', 'audio', 'photo'].
        export_gdf (bool): Export gpd.GeoDataFrame.
    '''
    if data is None:
        return None

    if data == 'svi':
        temp = self.svi_metadata
        geometry = gpd.points_from_xy(temp['image_lon'], temp['image_lat'])
        temp['detail'] = temp.apply(
            lambda row: f'<a href="{row["url"]}">View image details</a>',
            axis=1
        )
        gdf = gpd.GeoDataFrame(temp, geometry=geometry, crs="EPSG:4326")
        popup = ["id", "captured_at", "detail"]
    elif data == 'photo':
        temp = self.photo_metadata
        geometry = gpd.points_from_xy(temp['longitude'], temp['latitude'])
        temp['detail'] = temp.apply(
            lambda row: f'<a href="{row["url"]}">View photo details</a>',
            axis=1
        )
        gdf = gpd.GeoDataFrame(temp, geometry=geometry, crs="EPSG:4326")
        popup = ["id", "datetaken", "detail"]
    elif data == 'audio':
        temp = self.audio_metadata
        geometry = gpd.points_from_xy(temp['longitude'], temp['latitude'])
        temp['detail'] = temp.apply(
            lambda row: f'<a href="{row["url"]}">Listen to the sound</a>',
            axis=1
        )
        gdf = gpd.GeoDataFrame(self.audio_metadata, geometry=geometry, crs="EPSG:4326")
        popup = ["id", "created_dt", "detail"]
    else:
        raise ValueError('Invalid data type provided. It has to be one of ["svi", "audio", "photo"].')

    self.plot = gdf.explore(
        popup=popup,
        color="red",
        marker_kwds=dict(radius=5, fill=True),
        tiles="CartoDB positron",
        name="map",
    )
    return gdf if export_gdf else self.plot

Standalone helpers¶

These functions are also available at the top level (from urbanworm import getSV, …) but are more commonly called through GeoTaggedData.

`urbanworm.dataset.getSV(location, loc_id=None, distance=50, key=None, source='mapillary', pano=False, reoriented=False, multi_num=1, interval=1, fov=80, heading=None, pitch=5, height=500, width=700, year=None, season=None, time_of_day=None, target_polygon=None, fov_margin=0.1, fov_min=30.0, fov_max=120.0, building_height=9.0, output_df=True, silent=False)` ¶

getSV

Retrieve the closest street view image(s) near a coordinate. Supports multiple sources; the image is reoriented to face the target coordinate when reoriented=True (Mapillary) or always (Google).

Parameters:

Name	Type	Description	Default
`location`	`list \| tuple`	coordinates (longitude/x and latitude/y)	required
`loc_id`	`int \| str`	The id of the location.	`None`
`distance`	`int`	The max distance in meters between the centroid and the street view.	`50`
`key`	`str`	API access token for the chosen source. Mapillary — pass token or set env var `MAPILLARY_API_KEY`. Google — pass token or set env var `GOOGLE_STREETVIEW_API_KEY`.	`None`
`source`	`str`	Street view data source. One of `"mapillary"` (default) or `"google"`.	`'mapillary'`
`pano`	`bool`	Whether to search for panoramic images only. Mapillary only — ignored for Google. (Default is False)	`False`
`reoriented`	`bool`	Whether to reorient and crop street view images to face the target. Mapillary only — Google always faces the target. (Default is False)	`False`
`multi_num`	`int`	The number of multiple SVIs. Mapillary only — Google always returns 1. (Default is 1)	`1`
`interval`	`int`	The interval in meters between each SVI. Mapillary only. (Default is 1)	`1`
`fov`	`int \| float \| str`	Field of view in degrees for the perspective image (default 80). Pass `'auto'` together with `reoriented=True` to size the FOV per image so the target building is just framed — see `target_polygon` / `fov_margin` / `fov_min` / `fov_max`. When `target_polygon` is None, `'auto'` falls back to a distance-based heuristic (assumes ~15 m wide building). Mapillary only — for Google, `fov` is passed directly to the API and clamped to [10, 120]; `'auto'` is not supported.	`80`
`heading`	`int`	Camera heading in degrees. If None, computed from the bearing to the target location.	`None`
`pitch`	`int`	Camera pitch angle. (Default is 5)	`5`
`height`	`int`	Height in pixels of the returned image. (Default is 500)	`500`
`width`	`int`	Width in pixels of the returned image. (Default is 700)	`700`
`year`	`list[str]`	Year of data (start year, end year). Mapillary only — ignored for Google with a warning.	`None`
`season`	`str`	Season of data. Mapillary only — ignored for Google with a warning.	`None`
`time_of_day`	`str`	Time of data. Mapillary only — ignored for Google with a warning.	`None`
`target_polygon`	`Polygon`	Building footprint used by `fov='auto'` to compute the angular extent of the target. Coordinates are assumed to be `(lon, lat)` in WGS84. Mapillary only.	`None`
`fov_margin`	`float`	Fractional padding added to the auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.	`0.1`
`fov_min`	`float`	Lower clamp for `fov='auto'` (degrees). Default 30°. Mapillary only.	`30.0`
`fov_max`	`float`	Upper clamp for `fov='auto'` (degrees). Default 120°. Mapillary only.	`120.0`
`building_height`	`float`	Assumed building height in meters used by `fov='auto'` (default 9 m, ~3 stories). Mapillary only.	`9.0`
`output_df`	`bool`	Whether to also return a DataFrame of metadata. (Default is True)	`True`
`silent`	`bool`	Whether to silence warnings. (Default is False)	`False`

Returns:

Name	Type	Description
	`DataFrame \| list \| None`	list[str]: A list of images in base64 format.
`DataFrame`	`DataFrame \| list \| None`	A dataframe containing metadata about the street view images. `captured_at` format is `"YYYY-M-D-H"` for Mapillary and `"YYYY-MM-1-1"` for Google (day and hour are nominal placeholders).

Source code in urbanworm/dataset.py

def getSV(location: list|tuple,
          loc_id: int | str = None,
          distance:int = 50,
          key: str = None,
          source: str = "mapillary",
          pano: bool = False,
          reoriented: bool = False,
          multi_num: int = 1,
          interval: int = 1,
          fov: int | float | str = 80, heading: int = None, pitch: int = 5,
          height: int = 500, width: int = 700,
          year: list | tuple = None,
          season: str = None,
          time_of_day: str = None,
          target_polygon=None,
          fov_margin: float = 0.10,
          fov_min: float = 30.0,
          fov_max: float = 120.0,
          building_height: float = 9.0,
          output_df: bool = True,
          silent: bool = False) -> pd.DataFrame | list | None:
    """
        getSV

        Retrieve the closest street view image(s) near a coordinate.
        Supports multiple sources; the image is reoriented to face the target
        coordinate when ``reoriented=True`` (Mapillary) or always (Google).

        Args:
            location: coordinates (longitude/x and latitude/y)
            loc_id (int|str, optional): The id of the location.
            distance (int): The max distance in meters between the centroid and the street view.
            key (str): API access token for the chosen source.
                Mapillary — pass token or set env var ``MAPILLARY_API_KEY``.
                Google    — pass token or set env var ``GOOGLE_STREETVIEW_API_KEY``.
            source (str): Street view data source. One of ``"mapillary"`` (default)
                or ``"google"``.
            pano (bool): Whether to search for panoramic images only.
                Mapillary only — ignored for Google. (Default is False)
            reoriented (bool): Whether to reorient and crop street view images to face
                the target. Mapillary only — Google always faces the target.
                (Default is False)
            multi_num (int): The number of multiple SVIs. Mapillary only — Google
                always returns 1. (Default is 1)
            interval (int): The interval in meters between each SVI.
                Mapillary only. (Default is 1)
            fov (int | float | str): Field of view in degrees for the perspective image
                (default 80). Pass ``'auto'`` together with ``reoriented=True`` to
                size the FOV per image so the target building is just framed —
                see ``target_polygon`` / ``fov_margin`` / ``fov_min`` / ``fov_max``.
                When ``target_polygon`` is None, ``'auto'`` falls back to a
                distance-based heuristic (assumes ~15 m wide building).
                Mapillary only — for Google, ``fov`` is passed directly to the API
                and clamped to [10, 120]; ``'auto'`` is not supported.
            heading (int): Camera heading in degrees. If None, computed from the
                bearing to the target location.
            pitch (int): Camera pitch angle. (Default is 5)
            height (int): Height in pixels of the returned image. (Default is 500)
            width (int): Width in pixels of the returned image. (Default is 700)
            year (list[str], optional): Year of data (start year, end year).
                Mapillary only — ignored for Google with a warning.
            season (str, optional): Season of data.
                Mapillary only — ignored for Google with a warning.
            time_of_day (str, optional): Time of data.
                Mapillary only — ignored for Google with a warning.
            target_polygon (shapely.geometry.Polygon, optional): Building footprint
                used by ``fov='auto'`` to compute the angular extent of the target.
                Coordinates are assumed to be ``(lon, lat)`` in WGS84.
                Mapillary only.
            fov_margin (float): Fractional padding added to the auto-computed
                FOV (0.10 = +10%). Default 0.10. Mapillary only.
            fov_min (float): Lower clamp for ``fov='auto'`` (degrees). Default 30°.
                Mapillary only.
            fov_max (float): Upper clamp for ``fov='auto'`` (degrees). Default 120°.
                Mapillary only.
            building_height (float): Assumed building height in meters used by
                ``fov='auto'`` (default 9 m, ~3 stories). Mapillary only.
            output_df (bool, optional): Whether to also return a DataFrame of metadata.
                (Default is True)
            silent (bool, optional): Whether to silence warnings. (Default is False)

        Returns:
            list[str]: A list of images in base64 format.
            DataFrame: A dataframe containing metadata about the street view images.
                ``captured_at`` format is ``"YYYY-M-D-H"`` for Mapillary and
                ``"YYYY-MM-1-1"`` for Google (day and hour are nominal placeholders).
    """
    source = source.lower().strip()

    if source == "google":
        # Warn about params that Google does not support.
        # warnings.warn() deduplicates by call-site, so each message appears
        # only once even when getSV() is called in a loop (e.g. from
        # get_svi_from_locations), unlike logger.warning() which fires every time.
        if multi_num > 1:
            warnings.warn(
                "getSV: multi_num > 1 is not supported for source='google'; using 1.",
                stacklevel=2,
            )
        if any([year, season, time_of_day]):
            warnings.warn(
                "getSV: year/season/time_of_day filtering is not supported for "
                "source='google' (API does not expose historical imagery). "
                "These parameters will be ignored.",
                stacklevel=2,
            )
        if isinstance(fov, str) and fov.strip().lower() == "auto":
            warnings.warn(
                "getSV: fov='auto' is not supported for source='google'. "
                "Falling back to fov=80.",
                stacklevel=2,
            )
            fov = 80
        return _getSV_google(
            location=location, loc_id=loc_id, distance=distance, key=key,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            output_df=output_df, silent=silent,
        )

    if source == "mapillary":
        return _getSV_mapillary(
            location=location, loc_id=loc_id, distance=distance, key=key,
            pano=pano, reoriented=reoriented, multi_num=multi_num, interval=interval,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            year=year, season=season, time_of_day=time_of_day,
            target_polygon=target_polygon, fov_margin=fov_margin,
            fov_min=fov_min, fov_max=fov_max, building_height=building_height,
            output_df=output_df, silent=silent,
        )

    raise ValueError(
        f"getSV: unknown source '{source}'. Choose 'mapillary' or 'google'."
    )

`urbanworm.dataset.getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True)` ¶

getPhoto

Fetch public Flickr photos with geotags near a location (or within a Flickr place).

Parameters:

Name	Type	Description	Default
`location`	`list \| tuple`	(lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos	required
`loc_id`	`int \| str`	The id of the location.	`None`
`distance`	`int`	Search radius in meters (converted to km; Flickr radius max is 32 km).	`50`
`key`	`str`	Flickr API key. If None, reads env var FLICKR_API_KEY.	`None`
`query`	`str \| list[str]`	Query parameters to pass to Flickr API (free text search).	`None`
`geo_context`	`int`	Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)	`None`
`tag`	`str \| list[str]`	Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.	`None`
`max_return`	`int`	Number of photos to return (after filters).	`1`
`year`	`str \| tuple`	[Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.	`None`
`season`	`str`	One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).	`None`
`time_of_day`	`str`	One of {"morning","afternoon","evening","night"} (post-filter by taken hour).	`None`
`exclude_from_location`	`int`	drop retrieved photos within a distance (in meter) from the given location. (Default is None)	`None`
`output_df`	`bool`	If True, return a pandas.DataFrame; otherwise return dict (if max_return==1) or list[dict].	`True`

Returns:

Type	Description
	dict \| list[dict] \| pandas.DataFrame

Source code in urbanworm/dataset.py

def getPhoto(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        key: str = None,
        query: str | list[str] = None,
        geo_context: int = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        exclude_from_location:int = None,
        output_df: bool = True
):
    """
        getPhoto

        Fetch public Flickr photos with geotags near a location (or within a Flickr place).

        Args:
            location (list|tuple): (lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos
            loc_id (int | str): The id of the location.
            distance (int): Search radius in meters (converted to km; Flickr radius max is 32 km).
            key (str): Flickr API key. If None, reads env var FLICKR_API_KEY.
            query (str | list[str]): Query parameters to pass to Flickr API (free text search).
            geo_context (int): Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)
            tag: Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.
            max_return: Number of photos to return (after filters).
            year (str | tuple): [Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.
            season (str): One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).
            time_of_day (str): One of {"morning","afternoon","evening","night"} (post-filter by taken hour).
            exclude_from_location (int, optional): drop retrieved photos within a distance (in meter) from the given location. (Default is None)
            output_df (bool): If True, return a pandas.DataFrame; otherwise return dict (if max_return==1)
                       or list[dict].

        Returns:
            dict | list[dict] | pandas.DataFrame
    """

    import os
    from datetime import datetime, timedelta, timezone

    import requests

    if exclude_from_location is not None:
        drop_area = projection(location, r=distance)

    # -------------------------
    # Validate inputs
    # -------------------------
    if max_return is None or int(max_return) < 1:
        raise ValueError("max_return must be >= 1.")
    max_return = int(max_return)

    api_key = key or os.getenv("FLICKR_API_KEY")
    if not api_key:
        raise ValueError("Missing Flickr API key. Pass key=... or set env var FLICKR_API_KEY.")

    lon, lat = location
    months = season_months(season)
    hours = tod_hours(time_of_day)
    y_range = year_range(year)

    # Radius in km (Flickr max 32km) :contentReference[oaicite:3]{index=3}
    radius_km = max(float(distance) / 1000.0, 0.01)
    radius_km = min(radius_km, 32.0)

    # Geo queries need a "limiting agent"; tags or min/max dates qualify. :contentReference[oaicite:4]{index=4}
    # If user provided none, default to last 365 days so results aren’t silently limited to ~12 hours.
    now_utc = datetime.now(timezone.utc)
    default_min_upload_date = int((now_utc - timedelta(days=365)).timestamp())

    # -------------------------
    # Build Flickr request
    # -------------------------
    endpoint = "https://api.flickr.com/services/rest/"

    extras = ",".join(
        [
            "description",
            "license",
            "date_upload",
            "date_taken",
            "owner_name",
            "geo",
            "tags",
            "views",
            "media",
            "url_sq",
            "url_t",
            "url_s",
            "url_q",
            "url_m",
            "url_n",
            "url_z",
            "url_c",
            "url_l",
            "url_o",
        ]
    )

    params = {
        "method": "flickr.photos.search",
        "api_key": api_key,
        "format": "json",
        "nojsoncallback": 1,
        "extras": extras,
        "safe_search": 1, # safe only for un-authed calls
        "media": "photos",
        "has_geo": 1,
        "content_types": 0, # photos
        "sort": "relevance",
        "lat": lat,
        "lon": lon,
        "radius": radius_km,
        "radius_units": "km"
    }

    if query:
        q = query_string(query)
        if q:
            params["text"] = q

    if geo_context:
        params["geo_context"] = geo_context

    # tags
    if tag:
        if isinstance(tag, (list, tuple)):
            tags = ",".join([str(t).strip() for t in tag if str(t).strip()])
            params["tags"] = tags
            params["tag_mode"] = "all"
        else:
            params["tags"] = str(tag).strip()

    # date range (taken) if specified
    if y_range is not None:
        params["min_taken_date"], params["max_taken_date"] = y_range
    else:
        # If no explicit limiting agent, set min_upload_date (acts as limiting agent for geo queries). :contentReference[oaicite:7]{index=7}
        if not tag and season is None and time_of_day is None:
            params["min_upload_date"] = default_min_upload_date

    # -------------------------
    # Fetch + post-filter
    # -------------------------
    session = requests.Session()

    # Geo/bbox queries only return up to 250/page. :contentReference[oaicite:8]{index=8}
    per_page = min(250, max(50, max_return * 20))
    params["per_page"] = per_page

    results = []
    seen = set()

    max_pages = 150
    for page in range(1, max_pages + 1):
        params["page"] = page
        r = session.get(endpoint, params=params, timeout=30)
        r.raise_for_status()
        data = r.json()

        if data.get("stat") != "ok":
            msg = data.get("message") or data.get("error") or str(data)
            raise RuntimeError(f"Flickr API error: {msg}")

        photos = (data.get("photos") or {}).get("photo") or []
        if not photos:
            break

        for p in photos:
            if exclude_from_location is not None:
                if is_coordinate_in_bbox(p["longitude"], p["latitude"], drop_area):
                    continue
            pid = p.get("id")
            if not pid or pid in seen:
                continue
            seen.add(pid)

            taken_dt = parse_taken(p)
            if months and taken_dt and taken_dt.month not in months:
                continue
            if hours and taken_dt and taken_dt.hour not in hours:
                continue

            s_lat = float(p["latitude"]) if "latitude" in p and p["latitude"] not in (None, "") else None
            s_lon = float(p["longitude"]) if "longitude" in p and p["longitude"] not in (None, "") else None

            url = best_url(p)
            out = {
                "loc_id": '',
                "id": pid,
                "title": p.get("title"),
                "owner": p.get("owner"),
                # "ownername": p.get("ownername"),
                "datetaken": p.get("datetaken") or p.get("date_taken"),
                "latitude": s_lat,
                "longitude": s_lon,
                # "accuracy": int(p["accuracy"]) if "accuracy" in p and str(p["accuracy"]).isdigit() else None,
                "distance_m": haversine_m(lat, lon, s_lat, s_lon) if (s_lat is not None and s_lon is not None) else None,
                "tags": p.get("tags"),
                "description": p.get("description"),
                "views": int(p["views"]) if "views" in p and str(p["views"]).isdigit() else None,
                "license": p.get("license"),
                "url": url,
                # "page_url": f"https://www.flickr.com/photos/{p.get('owner')}/{pid}",
            }

            if loc_id is not None:
                out["loc_id"] = loc_id
            else:
                del out["loc_id"]

            results.append(out)

            # if len(results) >= max_return:
            #     break

        if len(results) >= max_return:
            break

    if output_df:
        import pandas as pd
        df = pd.DataFrame(results)
        df = df.sort_values(by='distance_m', ascending=True)
        return df.head(max_return)

    if max_return == 1:
        return results[0] if results else None
    return results

`urbanworm.dataset.getSound(location, loc_id=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=300, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)` ¶

Dispatch to the per-source helpers.

Parameters:

Name	Type	Description	Default
`source`	`str`	one of {"freesound", "aporee"}. Default "freesound".	`'freesound'`
`catalog`	`str \| DataFrame`	required when source="aporee" — see :func:`getSoundAporee`.	`None`
`probe_durations`	`bool`	Aporee-only. See :func:`getSoundAporee`.	`True`

All other arguments are forwarded; key is only used by Freesound, catalog and probe_durations only by Aporee.

Source code in urbanworm/dataset.py

def getSound(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        source: str = 'freesound',
        key: str = None,
        catalog: str | pd.DataFrame = None,
        query: str | list[str] | None = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        duration: int = 300,
        exclude_from_location: int = None,
        slice_duration: int = None,
        slice_max_num: int = None,
        probe_durations: bool = True,
        output_df: bool = True,
) -> pd.DataFrame | dict | list | None:
    """Dispatch to the per-source helpers.

    Args:
        source (str): one of {"freesound", "aporee"}. Default "freesound".
        catalog: required when source="aporee" — see :func:`getSoundAporee`.
        probe_durations: Aporee-only. See :func:`getSoundAporee`.

    All other arguments are forwarded; ``key`` is only used by Freesound,
    ``catalog`` and ``probe_durations`` only by Aporee.
    """
    src = (source or 'freesound').lower()
    if src == 'freesound':
        return _getSoundFreesound(
            location=location, loc_id=loc_id, distance=distance, key=key,
            query=query, tag=tag, max_return=max_return, year=year,
            season=season, time_of_day=time_of_day, duration=duration,
            exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            output_df=output_df,
        )
    elif src == 'aporee':
        return getSoundAporee(
            location=location, loc_id=loc_id, distance=distance,
            catalog=catalog, query=query, tag=tag, max_return=max_return,
            year=year, season=season, time_of_day=time_of_day,
            duration=duration, exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            probe_durations=probe_durations,
            output_df=output_df,
        )
    else:
        raise ValueError(
            f"Unsupported sound source {source!r}; choose 'freesound' or 'aporee'."
        )

Dataset¶

GeoTaggedData¶

urbanworm.dataset.GeoTaggedData(locations=None, units=None) ¶

retrieve street view with building footprints (OSM)¶

locations - a nested list of coordinates¶

locations - a dataframe with columns "longitude" and "latitude"¶

Functions¶

getBuildings(bbox=None, source='osm', min_area=0, max_area=None, random_sample=None, gba_path=None, gba_cache_dir=None) ¶

get_photo_from_location(id_column=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_personal_photo=True, exclude_from_location=None, silent=True, checkpoint_path=None) ¶

download_to_dir(data=None, to_dir=None, prefix=None) ¶

export(output_dir, data='svi', labels=None) ¶

set_images(img_type) ¶

plot_data(data=None, export_gdf=False) ¶

Standalone helpers¶

urbanworm.dataset.getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True) ¶

`urbanworm.dataset.GeoTaggedData(locations=None, units=None)` ¶

`getBuildings(bbox=None, source='osm', min_area=0, max_area=None, random_sample=None, gba_path=None, gba_cache_dir=None)` ¶

`get_photo_from_location(id_column=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_personal_photo=True, exclude_from_location=None, silent=True, checkpoint_path=None)` ¶

`download_to_dir(data=None, to_dir=None, prefix=None)` ¶

`export(output_dir, data='svi', labels=None)` ¶

`set_images(img_type)` ¶

`plot_data(data=None, export_gdf=False)` ¶

`urbanworm.dataset.getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True)` ¶