Skip to content

Dataset

The GeoTaggedData class is the central data container for the collection pipeline. It fetches building footprints, retrieves geo-located media (street views, photos, audio), and hands the results off to an inference backend.

GeoTaggedData

urbanworm.dataset.GeoTaggedData(locations=None, units=None)

Parameters:

Name Type Description Default
locations list | tuple | dict | Dataframe

A list of coordinates (longitude/x and latitude/y) or a dictionary keyed by longitude and latitude or a dataframe with columns "longitude" and "latitude".

None
units GeoDataFrame

The path to the shapefile or geojson file, or GeoDataFrame.

None

Examples:

retrieve street view with building footprints (OSM)

gtd = GeoTaggedData() gtd.getBuildingFootprints(bbox=(-83.235572,42.348092,-83.235154,42.348806)) gtd.get_svi_from_locations(key="your Mapillary token")

locations - a nested list of coordinates

gtd = GeoTaggedData(location=[[-83.235572,42.348092],[-83.235154,42.348806]])

locations - a dataframe with columns "longitude" and "latitude"

df = pd.Dataframe({"longitude":[-83.235572, -83.235154], "latitude":[42.348092, 42.348806]}) gtd = GeoTaggedData(locations=df)

Source code in urbanworm/dataset.py
def __init__(self,
             locations: list|tuple|dict|pd.DataFrame=None,
             units: GeoDataFrame=None):
    '''
    Args:
        locations (list|tuple|dict|Dataframe): A list of coordinates (longitude/x and latitude/y) or a dictionary keyed by longitude and latitude or a dataframe with columns "longitude" and "latitude".
        units (GeoDataFrame): The path to the shapefile or geojson file, or GeoDataFrame.

    Examples:
        # retrieve street view with building footprints (OSM)
        gtd = GeoTaggedData()
        gtd.getBuildingFootprints(bbox=(-83.235572,42.348092,-83.235154,42.348806))
        gtd.get_svi_from_locations(key="your Mapillary token")

        # locations - a nested list of coordinates
        gtd = GeoTaggedData(location=[[-83.235572,42.348092],[-83.235154,42.348806]])
        # locations - a dataframe with columns "longitude" and "latitude"
        df = pd.Dataframe({"longitude":[-83.235572, -83.235154], "latitude":[42.348092, 42.348806]})
        gtd = GeoTaggedData(locations=df)
    '''

    self.images = None
    self.locations = locations
    self.units = units
    if locations is not None and units is None:
        self.construct_units()

    # NOTE: each must be its own dict literal — chained assignment would
    # alias all three names to the SAME underlying dict object.
    def _empty_payload():
        return {'loc_id': [], 'id': [], 'data': [], 'path': []}

    self.svis = _empty_payload()
    self.photos = _empty_payload()
    self.audios = _empty_payload()

    self.svi_metadata = None
    self.photo_metadata = None
    self.audio_metadata = None
    self.plot = None

Functions

getBuildings(bbox=None, source='osm', min_area=0, max_area=None, random_sample=None, gba_path=None, gba_cache_dir=None)

Extract buildings from a public source using the bbox.

Parameters:

Name Type Description Default
bbox list or tuple

The bounding box (min_lon, min_lat, max_lon, max_lat).

None
source str

One of:

  • 'osm' (default) — OpenStreetMap via the Overpass API. Footprints only; no height.
  • 'microsoft' — Microsoft GlobalMLBuildingFootprints (Bing). Footprints only; no height.
  • 'globfp3d'3D-GloBFP <https://zenodo.org/records/15487037>_ (Che et al., ESSD 2024). Footprints + per-building height. Auto-fetches from Zenodo + Figshare and caches under gba_cache_dir (default ~/.cache/urbanworm/globfp3d). Pass gba_path to load from a pre-downloaded local file. The resulting self.units keeps a height_m column.
  • 'gba'GlobalBuildingAtlas <https://github.com/zhu-xlab/GlobalBuildingAtlas>_ (Zhu et al., ESSD 2025). A different dataset from 3D-GloBFP — hosted on HuggingFace + mediaTUM. Auto- fetches polygon tiles from zhu-xlab/GBA.LoD1 using representative/lod1.geojson as the manifest. Caches under ~/.cache/urbanworm/gba. NOTE: GBA polygons ship without per-row height — heights live in a separate mediaTUM dataset (m1837832) which isn't joined yet (tracking issue in CHANGELOG). For per-building height today, use source='globfp3d'.
'osm'
min_area float or int

The minimum area in m².

0
max_area float or int

The maximum area in m².

None
random_sample int

If set, randomly subsample to this many.

None
gba_path str

Optional. Path to a pre-downloaded file (skips network calls). Used by both 'globfp3d' and 'gba' sources.

None
gba_cache_dir str

Optional. Where to cache auto-fetched files. Defaults differ per source (~/.cache/urbanworm/globfp3d or .../gba).

None
Source code in urbanworm/dataset.py
def getBuildings(self,
                 bbox: list | tuple = None,
                 source: str = 'osm',
                 min_area: float | int = 0,
                 max_area: float | int = None,
                 random_sample: int = None,
                 gba_path: str = None,
                 gba_cache_dir: str = None) -> None:
    '''
        Extract buildings from a public source using the bbox.

        Args:
            bbox (list or tuple): The bounding box (min_lon, min_lat, max_lon, max_lat).
            source (str): One of:

              * ``'osm'`` (default) — OpenStreetMap via the Overpass API.
                Footprints only; no height.
              * ``'microsoft'`` — Microsoft GlobalMLBuildingFootprints
                (Bing). Footprints only; no height.
              * ``'globfp3d'`` — `3D-GloBFP <https://zenodo.org/records/15487037>`_
                (Che et al., ESSD 2024). **Footprints + per-building
                height.** Auto-fetches from Zenodo + Figshare and
                caches under ``gba_cache_dir`` (default
                ``~/.cache/urbanworm/globfp3d``). Pass ``gba_path``
                to load from a pre-downloaded local file. The
                resulting ``self.units`` keeps a ``height_m`` column.
              * ``'gba'`` — `GlobalBuildingAtlas <https://github.com/zhu-xlab/GlobalBuildingAtlas>`_
                (Zhu et al., ESSD 2025). **A different dataset from
                3D-GloBFP** — hosted on HuggingFace + mediaTUM. Auto-
                fetches polygon tiles from
                ``zhu-xlab/GBA.LoD1`` using ``representative/lod1.geojson``
                as the manifest. Caches under ``~/.cache/urbanworm/gba``.
                NOTE: GBA polygons ship without per-row height — heights
                live in a separate mediaTUM dataset (`m1837832`) which
                isn't joined yet (tracking issue in CHANGELOG). For
                per-building height today, use ``source='globfp3d'``.

            min_area (float or int): The minimum area in m².
            max_area (float or int): The maximum area in m².
            random_sample (int): If set, randomly subsample to this many.
            gba_path (str): Optional. Path to a pre-downloaded file
                (skips network calls). Used by both ``'globfp3d'`` and
                ``'gba'`` sources.
            gba_cache_dir (str): Optional. Where to cache auto-fetched
                files. Defaults differ per source
                (``~/.cache/urbanworm/globfp3d`` or ``.../gba``).
    '''

    if source not in ('osm', 'microsoft', 'gba', 'globfp3d'):
        raise ValueError(
            f'Unsupported building source {source!r}; '
            f'choose from "osm", "microsoft", "globfp3d", or "gba".'
        )

    if source == 'osm':
        buildings = getOSMbuildings(bbox, min_area, max_area)
    elif source == 'microsoft':
        buildings = getGlobalMLBuilding(bbox, min_area, max_area)
    elif source == 'globfp3d':
        buildings = getGloBFP3DBuildings(
            bbox, gba_path=gba_path, min_area=min_area,
            max_area=max_area, cache_dir=gba_cache_dir,
        )
    else:  # 'gba'
        buildings = getGBABuildings(
            bbox, gba_path=gba_path, min_area=min_area,
            max_area=max_area, cache_dir=gba_cache_dir,
        )

    if buildings is None or buildings.empty:
        if source == 'osm':
            logger.warning(
                "No buildings found in the bounding box. "
                "Check https://overpass-turbo.eu/ for areas with buildings."
            )
        elif source == 'microsoft':
            logger.warning(
                "No buildings found in the bounding box. "
                "Check https://github.com/microsoft/GlobalMLBuildingFootprints "
                "for areas with buildings."
            )
        elif source == 'globfp3d':
            logger.warning(
                "No 3D-GloBFP buildings found in the bounding box %s "
                "(local: %s).", bbox, gba_path,
            )
        else:
            logger.warning(
                "No GBA buildings found in the bounding box %s "
                "(local: %s).", bbox, gba_path,
            )
        return None
    if random_sample is not None:
        buildings = buildings.sample(random_sample)
    self.units = buildings.to_crs(4326)
    with_height = (
        int(buildings["height_m"].notna().sum())
        if "height_m" in buildings.columns else 0
    )
    if with_height:
        logger.info(
            "%d buildings found in the bounding box (%d with height_m).",
            len(buildings), with_height,
        )
    else:
        logger.info("%d buildings found in the bounding box.", len(buildings))
    return None

get_svi_from_locations(id_column=None, distance=50, key=None, source='mapillary', pano=True, reoriented=True, multi_num=1, interval=1, fov=80, heading=None, pitch=5, height=500, width=700, year=None, season=None, time_of_day='day', fov_margin=0.1, fov_min=30.0, fov_max=120.0, building_height=9.0, silent=True, checkpoint_path=None)

get_svi_from_locations

Retrieve the closest street view image(s) near each coordinate. The street view image will be reoriented to look at the coordinate when reoriented=True (Mapillary) or always (Google).

Parameters:

Name Type Description Default
id_column str

The name of column that has unique identifier (or something similar) for each location.

None
distance int

The max distance in meters between the centroid and the street view.

50
key str

API access token for the chosen source. Mapillary — pass token or set env var MAPILLARY_API_KEY. Google — pass token or set env var GOOGLE_STREETVIEW_API_KEY.

None
source str

Street view data source. One of "mapillary" (default) or "google".

'mapillary'
pano bool

Whether to search for pano street view images only. Mapillary only — ignored for Google. (Default is True)

True
reoriented bool

Whether to reorient and crop street view images. Mapillary only — Google always faces the target. (Default is True)

True
multi_num int

The number of multiple SVIs. Mapillary only — Google always returns 1. (Default is 1)

1
interval int

The interval in meters between each SVI. Mapillary only. (Default is 1)

1
fov int | float | str

Field of view in degrees (default 80). Pass 'auto' (with reoriented=True) to size the FOV per image so the building footprint at each location is just framed. The polygon used is each unit's row.geometry from self.units — i.e. the building footprint loaded by getBuildings(). Falls back to a distance-based heuristic if a unit's geometry is a point. Mapillary only for 'auto'.

80
heading int

Camera heading in degrees. If None, it will be computed based on the house orientation.

None
pitch int

Camera pitch angle. (Default is 5).

5
height int

Height in pixels of the returned image. (Default is 500).

500
width int

Width in pixels of the returned image. (Default is 700).

700
year list[str]

Year of data (start year, end year). Mapillary only — ignored for Google with a warning.

None
season str

Season of data. One of ["spring","summer","fall","autumn","winter"]. Mapillary only — ignored for Google with a warning.

None
time_of_day str

Time of data. One of ["day","night"] (Default is 'day'). Mapillary only — ignored for Google with a warning.

'day'
fov_margin float

When fov='auto', fractional padding added to the auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.

0.1
fov_min float

Lower clamp for fov='auto' (degrees). Default 30°. Mapillary only.

30.0
fov_max float

Upper clamp for fov='auto' (degrees). Default 120°. Mapillary only.

120.0
building_height float

Assumed building height in meters used by fov='auto' (default 9 m, ~3 stories). Mapillary only.

9.0
silent bool

If True, do not show error traceback (Default is True).

True
checkpoint_path str

Path to a JSONL file for resume-safe checkpointing. When provided, each successfully fetched location is written to the file immediately, and base64 images are saved to a companion directory (<checkpoint_stem>_files/ next to the JSONL) so the session can be resumed after a crash.

None
Source code in urbanworm/dataset.py
def get_svi_from_locations(self,
                           id_column:str=None,
                           distance:int = 50,
                           key: str = None,
                           source: str = "mapillary",
                           pano: bool = True, reoriented: bool = True,
                           multi_num: int = 1, interval: int = 1,
                           fov: int | float | str = 80, heading: int = None, pitch: int = 5,
                           height: int = 500, width: int = 700,
                           year: list | tuple = None, season: str = None, time_of_day: str = 'day',
                           fov_margin: float = 0.10,
                           fov_min: float = 30.0,
                           fov_max: float = 120.0,
                           building_height: float = 9.0,
                           silent: bool = True,
                           checkpoint_path: str | None = None):
    """
        get_svi_from_locations

        Retrieve the closest street view image(s) near each coordinate.
        The street view image will be reoriented to look at the coordinate when
        ``reoriented=True`` (Mapillary) or always (Google).

        Args:
            id_column (str, optional): The name of column that has unique identifier (or something similar) for each location.
            distance (int): The max distance in meters between the centroid and the street view.
            key (str): API access token for the chosen source.
                Mapillary — pass token or set env var ``MAPILLARY_API_KEY``.
                Google    — pass token or set env var ``GOOGLE_STREETVIEW_API_KEY``.
            source (str): Street view data source. One of ``"mapillary"`` (default)
                or ``"google"``.
            pano (bool): Whether to search for pano street view images only.
                Mapillary only — ignored for Google. (Default is True)
            reoriented (bool): Whether to reorient and crop street view images.
                Mapillary only — Google always faces the target. (Default is True)
            multi_num (int): The number of multiple SVIs.
                Mapillary only — Google always returns 1. (Default is 1)
            interval (int): The interval in meters between each SVI.
                Mapillary only. (Default is 1)
            fov (int | float | str): Field of view in degrees (default 80). Pass
                ``'auto'`` (with ``reoriented=True``) to size the FOV per image
                so the building footprint at each location is just framed.
                The polygon used is each unit's ``row.geometry`` from
                ``self.units`` — i.e. the building footprint loaded by
                ``getBuildings()``. Falls back to a distance-based heuristic
                if a unit's geometry is a point. Mapillary only for ``'auto'``.
            heading (int): Camera heading in degrees. If None, it will be computed based on the house orientation.
            pitch (int): Camera pitch angle. (Default is 5).
            height (int): Height in pixels of the returned image. (Default is 500).
            width (int): Width in pixels of the returned image. (Default is 700).
            year (list[str], optional): Year of data (start year, end year).
                Mapillary only — ignored for Google with a warning.
            season (str, optional): Season of data. One of ["spring","summer","fall","autumn","winter"].
                Mapillary only — ignored for Google with a warning.
            time_of_day (str, optional): Time of data. One of ["day","night"] (Default is 'day').
                Mapillary only — ignored for Google with a warning.
            fov_margin (float): When ``fov='auto'``, fractional padding added to the
                auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.
            fov_min (float): Lower clamp for ``fov='auto'`` (degrees). Default 30°.
                Mapillary only.
            fov_max (float): Upper clamp for ``fov='auto'`` (degrees). Default 120°.
                Mapillary only.
            building_height (float): Assumed building height in meters used by
                ``fov='auto'`` (default 9 m, ~3 stories). Mapillary only.
            silent (bool): If True, do not show error traceback (Default is True).
            checkpoint_path (str, optional): Path to a JSONL file for
                resume-safe checkpointing. When provided, each successfully
                fetched location is written to the file immediately, and
                base64 images are saved to a companion directory
                (``<checkpoint_stem>_files/`` next to the JSONL) so the
                session can be resumed after a crash.
        """

    self.svis = {
        'loc_id': [],
        'id': [],
        'data': [],
        'path': [],
    }
    self.svi_metadata = None

    if id_column is None:
        id_column = 'loc_id'
        if id_column not in self.units.columns:
            self.units[id_column] = [i for i in range(len(self.units))]
    # Resolve API key once with env var fallback.
    # Use the appropriate env var depending on the source.
    _env_var = "GOOGLE_STREETVIEW_API_KEY" if source.lower() == "google" else "MAPILLARY_API_KEY"
    resolved_key = key or os.getenv(_env_var)
    if not resolved_key:
        raise ValueError(
            "Missing Mapillary access token. Pass key=... or set env var MAPILLARY_API_KEY."
        )

    # ── resume from checkpoint ────────────────────────────────────────
    # The checkpoint records which loc_ids have been fetched and stores
    # the raw fetched data (base64 strings or URLs) so the full
    # self.svis payload can be restored without re-hitting the API.
    # File downloading is handled separately by download_to_dir().
    done_ids: set = set()
    if checkpoint_path is not None:
        done_ids, ckpt_records = load_collection_checkpoint(checkpoint_path)
        restored_svis, restored_frames = restore_svis_from_checkpoint(ckpt_records)
        self.svis = restored_svis
    else:
        restored_frames = []

    # Accumulate per-location frames and concat once for O(n) instead of O(n^2)
    frames: list[pd.DataFrame] = list(restored_frames)
    skip_count = 0
    for _index, row in tqdm(self.units.iterrows(), total=len(self.units)):
        loc_id = row[id_column]

        # Skip already-checkpointed locations
        if loc_id in done_ids:
            continue

        try:
            # Pass the unit's polygon to enable fov='auto' framing.
            # Points (no `.exterior`) become None, so getSV will fall
            # back to its distance-based heuristic.
            target_poly = getattr(row.geometry, "exterior", None)
            target_poly = row.geometry if target_poly is not None else None

            # Per-building height: if the units GeoDataFrame has a
            # height_m column (e.g. from source='gba'), use that row's
            # value. Fall back to the global ``building_height``
            # parameter when the row is missing/NaN.
            row_height = building_height
            if "height_m" in self.units.columns:
                rh = row.get("height_m")
                try:
                    if rh is not None and not pd.isna(rh) and float(rh) > 0:
                        row_height = float(rh)
                except (TypeError, ValueError):
                    pass

            svis, output_df = getSV(
                [row.geometry.centroid.x, row.geometry.centroid.y],
                loc_id=loc_id,
                distance=distance,
                key=resolved_key,
                source=source,
                pano=pano,
                reoriented=reoriented,
                multi_num=multi_num,
                interval=interval,
                fov=fov, heading=heading, pitch=pitch,
                height=height, width=width,
                year=year, season=season, time_of_day=time_of_day,
                target_polygon=target_poly,
                fov_margin=fov_margin, fov_min=fov_min, fov_max=fov_max,
                building_height=row_height,
                silent=silent,
            )
            if svis is None:
                skip_count += 1
                continue

            self.svis['data'] += svis
            self.svis['loc_id'] += output_df['loc_id'].tolist()
            self.svis['id'] += output_df['id'].tolist()

            # ── checkpoint: record the fetched data ──────────────────
            # Stores raw fetched data (base64 or URLs) so the session
            # can be fully restored without re-hitting the Mapillary API.
            # File downloading remains the responsibility of download_to_dir().
            if checkpoint_path is not None:
                append_collection_checkpoint(checkpoint_path, {
                    'loc_id': loc_id,
                    'ids': output_df['id'].tolist(),
                    'paths': [],           # populated later by download_to_dir()
                    'data': svis,          # base64 strings or URLs as returned by getSV
                    'metadata': output_df.to_dict(orient='records'),
                })

            frames.append(output_df)
        except Exception as e:
            if not silent:
                logger.warning(
                    'skipping %s: %s',
                    [row.geometry.centroid.x, row.geometry.centroid.y], e,
                )
            skip_count += 1
            continue
    self.svi_metadata = pd.concat(frames, ignore_index=True) if frames else None
    if skip_count > 0:
        logger.info(
            'Collected data for %d locations; skipped %d (no data found).',
            len(self.units) - skip_count, skip_count,
        )
    return None

get_photo_from_location(id_column=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_personal_photo=True, exclude_from_location=None, silent=True, checkpoint_path=None)

get_photo_from_location

Retrieve geotagged photos from Flickr

Parameters:

Name Type Description Default
id_column str

(str, optional): The name of column that has unique identifier (or something similar) for each location.

None
distance int

Search radius in meters (converted to km; Flickr radius max is 32 km).

50
key str

Flickr API key. If None, reads env var FLICKR_API_KEY.

None
query str

Query string to search for.

None
geo_context int

Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)

None
tag str | list[str]

Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.

None
max_return int

Number of photos to return (after filters).

1
year list | tuple

[Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.

None
season str

One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).

None
time_of_day str

One of {"morning","afternoon","evening","night"} (post-filter by taken hour).

None
exclude_personal_photo bool

If True, exclude personal photo from locations. (Default is True)

True
exclude_from_location int

Drop retrieved data with a distance from the given location.

None
silent bool

If True, do not show error traceback (Default is True).

True
Source code in urbanworm/dataset.py
def get_photo_from_location(self,
                            id_column:str=None,
                            distance: int = 50,
                            key: str = None,
                            query: str | list[str] = None,
                            geo_context: int = None,
                            tag: str | list[str] = None,
                            max_return: int = 1,
                            year: list | tuple = None,
                            season: str = None,
                            time_of_day: str = None,
                            exclude_personal_photo: bool = True,
                            exclude_from_location:int = None,
                            silent = True,
                            checkpoint_path: str | None = None,
                            ):
    '''
        get_photo_from_location

        Retrieve geotagged photos from Flickr

        Args:
            id_column: (str, optional): The name of column that has unique identifier (or something similar) for each location.
            distance (int): Search radius in meters (converted to km; Flickr radius max is 32 km).
            key (str): Flickr API key. If None, reads env var FLICKR_API_KEY.
            query (str, optional): Query string to search for.
            geo_context (int, optional): Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)
            tag (str | list[str]): Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.
            max_return (int): Number of photos to return (after filters).
            year: [Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.
            season (str): One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).
            time_of_day (str): One of {"morning","afternoon","evening","night"} (post-filter by taken hour).
            exclude_personal_photo (bool): If True, exclude personal photo from locations. (Default is True)
            exclude_from_location (int, optional): Drop retrieved data with a distance from the given location.
            silent (bool): If True, do not show error traceback (Default is True).
    '''

    from importlib.resources import as_file, files

    self.photos = {
        'loc_id': [],
        'id': [],
        'data': [],
        'path': [],
    }
    self.photo_metadata = None

    if id_column is None:
        id_column = 'loc_id'
        if id_column not in self.units.columns:
            self.units[id_column] = list(range(len(self.units)))

    # ── resume from checkpoint ────────────────────────────────────────
    done_ids_ph: set = set()
    if checkpoint_path is not None:
        done_ids_ph, ckpt_records_ph = load_collection_checkpoint(checkpoint_path)
        restored_photos, restored_frames_ph = restore_photos_from_checkpoint(ckpt_records_ph)
        self.photos = restored_photos
    else:
        restored_frames_ph = []

    frames: list[pd.DataFrame] = list(restored_frames_ph)
    skip_count = 0
    for _index, row in tqdm(self.units.iterrows(), total=len(self.units)):
        loc_id = row[id_column]
        if loc_id in done_ids_ph:
            continue
        try:
            output_df = getPhoto([row.geometry.centroid.x, row.geometry.centroid.y],
                                 loc_id,
                                 distance,
                                 key,
                                 query,
                                 geo_context,
                                 tag,
                                 max_return,
                                 year,
                                 season,
                                 time_of_day,
                                 exclude_from_location,
                                 output_df=True)
            if exclude_personal_photo:
                model_res = files("urbanworm.models") / "face_detection_yunet_2023mar.onnx"
                drop_list = []
                for ind, r in output_df.iterrows():
                    with as_file(model_res) as model_path:
                        is_selfie = is_selfie_photo(model_path, r['url'])
                        if is_selfie:
                            drop_list += [ind]
                if len(drop_list) > 0:
                    output_df.drop(drop_list, axis=0, inplace=True)
                    if len(output_df) == 0:
                        continue

            self.photos['loc_id'] += output_df['loc_id'].tolist()
            self.photos['data'] += output_df['url'].tolist()
            self.photos['id'] += output_df['id'].tolist()

            if checkpoint_path is not None:
                append_collection_checkpoint(checkpoint_path, {
                    'loc_id': loc_id,
                    'ids': output_df['id'].tolist(),
                    'paths': [],
                    'data': output_df['url'].tolist(),
                    'metadata': output_df.to_dict(orient='records'),
                })

            frames.append(output_df)
        except Exception as e:
            if not silent:
                logger.warning("photo fetch error: %s", e)
            skip_count += 1
            continue
    self.photo_metadata = pd.concat(frames, ignore_index=True) if frames else None
    if skip_count > 0:
        logger.info(
            'Collected data for %d locations; skipped %d (no data found).',
            len(self.units) - skip_count, skip_count,
        )
    return None

get_sound_from_location(id_column=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=None, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, silent=True, checkpoint_path=None)

get_sound_from_location

Retrieve geotagged sound recordings from Freesound (default) or from a Radio Aporee catalog you provide as a CSV / DataFrame.

Parameters:

Name Type Description Default
id_column str

The name of column that has unique identifier (or something similar) for each location.

None
distance int

radius in meters (converted to km for Freesound geofilt).

50
source str

one of {"freesound", "aporee"} (Default is "freesound").

'freesound'
key str

Freesound API key. Required only when source="freesound". If None, reads env var FREESOUND_API_KEY.

None
catalog str | DataFrame

Required only when source="aporee". Path to a CSV or an in-memory DataFrame containing at minimum the columns url, latitude, longitude. Optional columns recognised: id/identifier, name/title, description, tags, created (ISO timestamp), duration_s.

None
query str

Query string to search for.

None
tag str | list[str]

tag string or list of tags (used as filters).

None
max_return int

number of sounds to return (after post-filters).

1
year int | list

[Y] or (Y,) or (Y1, Y2) inclusive (filters by upload date "created").

None
season str

one of {"spring","summer","fall","autumn","winter"} (post-filter by created month).

None
time_of_day str

one of {"morning","afternoon","evening","night"} (post-filter by created hour).

None
duration int | list[int] | tuple[int]

maximum duration in seconds (<= duration). If you want a range, pass a tuple/list (min,max).

None
exclude_from_location int

Drop retrieved data with a distance from the given location.

None
slice_duration int

Split the original sound signal into clips with the given duration.

None
slice_max_num int

Maximum number of clips sliced from the original sound signal.

None
probe_durations bool

Aporee-only. When slice_duration is set but the catalog has no duration_s column, probe each selected URL once to learn its length. Set False to skip slicing instead. Default True.

True
silent bool

If True, do not show error traceback (Default is True).

True
Source code in urbanworm/dataset.py
def get_sound_from_location(self,
                            id_column: str = None,
                            distance: int = 50,
                            source: str = 'freesound',
                            key: str = None,
                            catalog: str | pd.DataFrame = None,
                            query: str | list[str] = None,
                            tag: str | list[str] = None,
                            max_return: int = 1,
                            year: list | tuple = None,
                            season: str = None,
                            time_of_day: str = None,
                            duration: int = None,
                            exclude_from_location: int = None,
                            slice_duration: int = None,
                            slice_max_num: int = None,
                            probe_durations: bool = True,
                            silent: bool = True,
                            checkpoint_path: str | None = None,
                            ):

    '''
        get_sound_from_location

        Retrieve geotagged sound recordings from Freesound (default) or
        from a Radio Aporee catalog you provide as a CSV / DataFrame.

        Args:
            id_column (str, optional): The name of column that has unique identifier (or something similar) for each location.
            distance (int): radius in meters (converted to km for Freesound geofilt).
            source (str): one of {"freesound", "aporee"} (Default is "freesound").
            key (str): Freesound API key. Required only when source="freesound".
                If None, reads env var FREESOUND_API_KEY.
            catalog (str | pandas.DataFrame): Required only when source="aporee".
                Path to a CSV or an in-memory DataFrame containing at minimum the columns
                ``url``, ``latitude``, ``longitude``. Optional columns recognised:
                ``id``/``identifier``, ``name``/``title``, ``description``, ``tags``,
                ``created`` (ISO timestamp), ``duration_s``.
            query (str, optional): Query string to search for.
            tag (str | list[str]): tag string or list of tags (used as filters).
            max_return (int): number of sounds to return (after post-filters).
            year (int | list): [Y] or (Y,) or (Y1, Y2) inclusive (filters by upload date "created").
            season (str): one of {"spring","summer","fall","autumn","winter"} (post-filter by created month).
            time_of_day (str): one of {"morning","afternoon","evening","night"} (post-filter by created hour).
            duration (int | list[int] | tuple[int]): maximum duration in seconds (<= duration). If you want a range, pass a tuple/list (min,max).
            exclude_from_location (int, optional): Drop retrieved data with a distance from the given location.
            slice_duration (int, optional): Split the original sound signal into clips with the given duration.
            slice_max_num (int, optional): Maximum number of clips sliced from the original sound signal.
            probe_durations (bool): Aporee-only. When ``slice_duration`` is set
                but the catalog has no ``duration_s`` column, probe each
                selected URL once to learn its length. Set False to skip
                slicing instead. Default True.
            silent (bool): If True, do not show error traceback (Default is True).
    '''

    self.audios = {
        'loc_id': [],
        'id': [],
        'data': [],
        'path': [],
    }
    self.audio_metadata = None

    if slice_duration is not None:
        self.audios['slice'] = []

    if id_column is None:
        id_column = 'loc_id'
        if id_column not in self.units.columns:
            self.units[id_column] = list(range(len(self.units)))

    # ── resume from checkpoint ────────────────────────────────────────
    done_ids_au: set = set()
    if checkpoint_path is not None:
        done_ids_au, ckpt_records_au = load_collection_checkpoint(checkpoint_path)
        restored_audios, restored_frames_au = restore_audios_from_checkpoint(ckpt_records_au)
        self.audios = restored_audios
    else:
        restored_frames_au = []

    frames: list[pd.DataFrame] = list(restored_frames_au)
    skip_count = 0
    for _index, row in tqdm(self.units.iterrows(), total=len(self.units)):
        loc_id = row[id_column]
        if loc_id in done_ids_au:
            continue
        try:
            output_df = getSound([row.geometry.centroid.x, row.geometry.centroid.y],
                                 loc_id=loc_id,
                                 distance=distance,
                                 source=source,
                                 key=key,
                                 catalog=catalog,
                                 query=query,
                                 tag=tag,
                                 max_return=max_return,
                                 year=year,
                                 season=season,
                                 time_of_day=time_of_day,
                                 duration=duration,
                                 exclude_from_location=exclude_from_location,
                                 slice_duration=slice_duration,
                                 slice_max_num=slice_max_num,
                                 probe_durations=probe_durations,
                                 output_df=True)

            # `slice` may be missing if the source couldn't compute it
            # (e.g. Aporee catalog with no duration_s and probe_durations
            # disabled). Fall back to the un-sliced path in that case.
            if slice_duration is not None and 'slice' in output_df.columns:
                slice_list = output_df['slice'].tolist()
                loc_id_list = output_df['loc_id'].tolist()
                data_list = output_df['preview-hq-mp3'].tolist()
                id_list = output_df['id'].tolist()

                # `slice_list[i]` is always a list of [start_ms, end_ms] pairs
                # (one per generated clip). Flatten and replicate metadata
                # to match the per-clip cardinality.
                flattened_slice_list = [
                    item for sublist in slice_list for item in sublist
                ]
                repeated_loc, repeated_data, repeated_id = [], [], []
                for sublist, lid, d, sid in zip(
                        slice_list, loc_id_list, data_list, id_list, strict=False):
                    n = len(sublist)
                    repeated_loc.extend([lid] * n)
                    repeated_data.extend([d] * n)
                    repeated_id.extend([sid] * n)
                self.audios['loc_id'] += repeated_loc
                self.audios['data'] += repeated_data
                self.audios['id'] += repeated_id
                self.audios['slice'] += flattened_slice_list

                if checkpoint_path is not None:
                    append_collection_checkpoint(checkpoint_path, {
                        'loc_id': loc_id,
                        'ids': repeated_id,
                        'paths': [],
                        'data': repeated_data,
                        'slices': flattened_slice_list,
                        'metadata': output_df.to_dict(orient='records'),
                    })
            else:
                self.audios['loc_id'] += output_df['loc_id'].tolist()
                self.audios['data'] += output_df['preview-hq-mp3'].tolist()
                self.audios['id'] += output_df['id'].tolist()

                if checkpoint_path is not None:
                    append_collection_checkpoint(checkpoint_path, {
                        'loc_id': loc_id,
                        'ids': output_df['id'].tolist(),
                        'paths': [],
                        'data': output_df['preview-hq-mp3'].tolist(),
                        'slices': None,
                        'metadata': output_df.to_dict(orient='records'),
                    })

            frames.append(output_df)
        except Exception as e:
            if not silent:
                logger.warning("sound fetch error: %s", e)
            skip_count += 1
            continue
    self.audio_metadata = pd.concat(frames, ignore_index=True) if frames else None
    if skip_count > 0:
        logger.info(
            'Collected data for %d locations; skipped %d (no data found).',
            len(self.units) - skip_count, skip_count,
        )
    return None

download_to_dir(data=None, to_dir=None, prefix=None)

download_to_dir

Download retrieved data (fetched by get_svi_from_locations, get_photo_from_location, or get_sound_from_location) to a local directory and populate the corresponding path list on the dataset object.

This method is resume-safe by default: if a file already exists at its target path it is never re-downloaded. You can safely re-run this call after a crash and it will only fetch the files that are still missing, then rebuild the complete path list from what is on disk.

Parameters:

Name Type Description Default
data str

Type of data to download: ['svi', 'audio', 'photo'].

None
to_dir str

the directory to save the downloaded data.

None
prefix str

The prefix to add to the output filename.

None
Source code in urbanworm/dataset.py
def download_to_dir(self, data:str = None, to_dir:str = None, prefix: str = None)-> None:
    '''
        download_to_dir

        Download retrieved data (fetched by get_svi_from_locations,
        get_photo_from_location, or get_sound_from_location) to a local
        directory and populate the corresponding ``path`` list on the
        dataset object.

        This method is **resume-safe by default**: if a file already
        exists at its target path it is never re-downloaded.  You can
        safely re-run this call after a crash and it will only fetch
        the files that are still missing, then rebuild the complete
        path list from what is on disk.

        Args:
            data (str): Type of data to download: ['svi', 'audio', 'photo'].
            to_dir (str): the directory to save the downloaded data.
            prefix (str, optional):  The prefix to add to the output filename.
    '''
    if data not in ['svi', 'audio', 'photo']:
        raise ValueError('Invalid data type provided. It has to be one of ["svi", "audio", "photo"].')
    if to_dir is None:
        raise ValueError("to_dir must be provided.")
    if not os.path.exists(to_dir):
        logger.info("Directory %s does not exist; creating.", to_dir)
        Path(to_dir).mkdir(parents=True, exist_ok=True)
    if data == 'svi':
        if len(self.svis['id']) == 0:
            return None
        self.svis['path'] = []
        for i in tqdm(range(len(self.svis['data'])), total=len(self.svis['data'])):
            loc_id = self.svis['loc_id'][i]
            img_id = self.svis['id'][i]
            path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
            p = path + f'_{img_id}.png'
            if not os.path.exists(p):
                try:
                    if is_base64(self.svis['data'][i]):
                        save_base64(self.svis['data'][i], p)
                    else:
                        download_image_requests(self.svis['data'][i], p)
                except Exception:
                    self.svis['path'] += [" "]
                    continue
            self.svis['path'] += [p]
    elif data == 'audio':
        if len(self.audios['id']) == 0:
            return None
        self.audios['path'] = []
        if 'slice' in self.audios:
            for i in tqdm(range(len(self.audios['data'])), total=len(self.audios['data'])):
                loc_id = self.audios['loc_id'][i]
                audio_id = self.audios['id'][i]
                slices = self.audios['slice'][i]
                path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
                start = slices[0]
                end = slices[1]
                p = path + f'_{audio_id}_clip_{start}_{end}.mp3'
                if not os.path.exists(p):
                    try:
                        clip(self.audios['data'][i], start, end, p)
                    except Exception:
                        self.audios['path'] += [" "]
                        continue
                self.audios['path'] += [p]
        else:
            for i in tqdm(range(len(self.audios['data'])), total=len(self.audios['data'])):
                loc_id = self.audios['loc_id'][i]
                audio_id = self.audios['id'][i]
                path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
                p = path + f'_{audio_id}.mp3'
                if not os.path.exists(p):
                    try:
                        download_freesound_preview(self.audios['data'][i], p)
                    except Exception:
                        self.audios['path'] += [" "]
                        continue
                self.audios['path'] += [p]
    elif data == 'photo':
        if len(self.photos['id']) == 0:
            return None
        self.photos['path'] = []
        for i in tqdm(range(len(self.photos['data'])), total=len(self.photos['data'])):
            loc_id = self.photos['loc_id'][i]
            photo_id = self.photos['id'][i]
            path = f'{to_dir}/{prefix}_{loc_id}' if prefix is not None else f'./{to_dir}/{loc_id}'
            p = path + f'_{photo_id}.png'
            if not os.path.exists(p):
                try:
                    download_image_requests(self.photos['data'][i], p)
                except Exception:
                    # download failed: align list lengths with sentinel
                    self.photos['path'] += [" "]
                    continue
            self.photos['path'] += [p]
    return None

export(output_dir, data='svi', labels=None)

Export collected data as an organized dataset folder.

Creates::

output_dir/
    metadata.csv          # loc_id, file_id, file_type, file_path
                          # + optional label columns from `labels`
    images/               # when data in {'svi', 'photo'}
        {loc_id}_{file_id}.png
    audio/                # when data == 'audio'
        {loc_id}_{file_id}.mp3

If a file already exists on disk at the target path it is not downloaded again, so the method is safe to call repeatedly.

Parameters:

Name Type Description Default
output_dir str

Root directory for the exported dataset.

required
data str

Which modality to export. One of 'svi', 'photo', or 'audio'.

'svi'
labels DataFrame

Optional DataFrame produced by batch_inference(). Must contain a loc_id column; it is left-joined onto the metadata table so each file row gets the matching label columns.

None

Returns:

Type Description
str

Absolute path to the created metadata.csv.

Source code in urbanworm/dataset.py
def export(
    self,
    output_dir: str,
    data: str = 'svi',
    labels: pd.DataFrame = None,
) -> str:
    """Export collected data as an organized dataset folder.

    Creates::

        output_dir/
            metadata.csv          # loc_id, file_id, file_type, file_path
                                  # + optional label columns from `labels`
            images/               # when data in {'svi', 'photo'}
                {loc_id}_{file_id}.png
            audio/                # when data == 'audio'
                {loc_id}_{file_id}.mp3

    If a file already exists on disk at the target path it is not
    downloaded again, so the method is safe to call repeatedly.

    Args:
        output_dir: Root directory for the exported dataset.
        data: Which modality to export. One of ``'svi'``, ``'photo'``,
            or ``'audio'``.
        labels: Optional DataFrame produced by ``batch_inference()``.
            Must contain a ``loc_id`` column; it is left-joined onto the
            metadata table so each file row gets the matching label
            columns.

    Returns:
        Absolute path to the created ``metadata.csv``.
    """
    if data not in ('svi', 'photo', 'audio'):
        raise ValueError(
            "data must be one of 'svi', 'photo', 'audio'; "
            f"got {data!r}"
        )

    out_root = Path(output_dir)

    if data in ('svi', 'photo'):
        files_dir = out_root / 'images'
    else:
        files_dir = out_root / 'audio'
    files_dir.mkdir(parents=True, exist_ok=True)

    payload = (
        self.svis if data == 'svi'
        else self.photos if data == 'photo'
        else self.audios
    )

    if not payload['id']:
        logger.warning("export: no %s data to export.", data)
        return str(out_root / 'metadata.csv')

    ext = '.png' if data != 'audio' else '.mp3'
    rows: list[dict] = []

    for i in range(len(payload['id'])):
        loc_id = payload['loc_id'][i]
        file_id = payload['id'][i]
        source = payload['data'][i]
        existing_path = payload['path'][i] if i < len(payload['path']) else ''

        fname = f"{loc_id}_{file_id}{ext}"
        local_path = str(files_dir / fname)

        # Download / copy only if the file is not already in place
        if not Path(local_path).exists():
            try:
                if existing_path and Path(existing_path).exists():
                    import shutil as _shutil
                    _shutil.copy2(existing_path, local_path)
                elif data != 'audio':
                    if is_url(source):
                        download_image_requests(source, local_path)
                    elif is_image_path(source):
                        import shutil as _shutil
                        _shutil.copy2(source, local_path)
                    else:
                        # assume base64
                        save_base64(source, local_path)
                else:
                    # audio
                    slices = (
                        payload['slice'][i]
                        if 'slice' in payload and i < len(payload['slice'])
                        else None
                    )
                    if slices is not None:
                        clip(source, slices[0], slices[1], local_path)
                    else:
                        download_freesound_preview(source, local_path)
            except Exception as _dl_err:
                logger.warning(
                    "export: could not save %s (loc_id=%s, file_id=%s): %s",
                    local_path, loc_id, file_id, _dl_err,
                )

        rows.append({
            'loc_id': loc_id,
            'file_id': file_id,
            'file_type': data,
            'file_path': local_path,
            'source_data': source if is_url(source) else '<local>',
        })

    meta_df = pd.DataFrame(rows)

    if labels is not None:
        if 'loc_id' in labels.columns:
            meta_df = meta_df.merge(labels, on='loc_id', how='left')
        else:
            logger.warning(
                "export: labels DataFrame has no 'loc_id' column; skipping merge."
            )

    out_csv = out_root / 'metadata.csv'
    meta_df.to_csv(out_csv, index=False)
    logger.info("export: wrote %d rows to %s", len(meta_df), out_csv)
    return str(out_csv)

set_images(img_type)

set_images

Set retrieved street view images or Flickr photos as images dataset

Parameters:

Name Type Description Default
img_type str

'photo' or 'svi'

required
Source code in urbanworm/dataset.py
def set_images(self, img_type: str):
    '''
        set_images

        Set retrieved street view images or Flickr photos as images dataset

        Args:
            img_type (str): 'photo' or 'svi'
    '''
    if img_type == 'svi':
        self.images = self.svis
    elif img_type == 'photo':
        self.images = self.photos
    return None

plot_data(data=None, export_gdf=False)

Parameters:

Name Type Description Default
data str

Type of data to download: ['svi', 'audio', 'photo'].

None
export_gdf bool

Export gpd.GeoDataFrame.

False
Source code in urbanworm/dataset.py
def plot_data(self, data:str = None, export_gdf: bool = False) -> None:
    '''

    Args:
        data (str): Type of data to download: ['svi', 'audio', 'photo'].
        export_gdf (bool): Export gpd.GeoDataFrame.
    '''
    if data is None:
        return None

    if data == 'svi':
        temp = self.svi_metadata
        geometry = gpd.points_from_xy(temp['image_lon'], temp['image_lat'])
        temp['detail'] = temp.apply(
            lambda row: f'<a href="{row["url"]}">View image details</a>',
            axis=1
        )
        gdf = gpd.GeoDataFrame(temp, geometry=geometry, crs="EPSG:4326")
        popup = ["id", "captured_at", "detail"]
    elif data == 'photo':
        temp = self.photo_metadata
        geometry = gpd.points_from_xy(temp['longitude'], temp['latitude'])
        temp['detail'] = temp.apply(
            lambda row: f'<a href="{row["url"]}">View photo details</a>',
            axis=1
        )
        gdf = gpd.GeoDataFrame(temp, geometry=geometry, crs="EPSG:4326")
        popup = ["id", "datetaken", "detail"]
    elif data == 'audio':
        temp = self.audio_metadata
        geometry = gpd.points_from_xy(temp['longitude'], temp['latitude'])
        temp['detail'] = temp.apply(
            lambda row: f'<a href="{row["url"]}">Listen to the sound</a>',
            axis=1
        )
        gdf = gpd.GeoDataFrame(self.audio_metadata, geometry=geometry, crs="EPSG:4326")
        popup = ["id", "created_dt", "detail"]
    else:
        raise ValueError('Invalid data type provided. It has to be one of ["svi", "audio", "photo"].')

    self.plot = gdf.explore(
        popup=popup,
        color="red",
        marker_kwds=dict(radius=5, fill=True),
        tiles="CartoDB positron",
        name="map",
    )
    return gdf if export_gdf else self.plot

Standalone helpers

These functions are also available at the top level (from urbanworm import getSV, …) but are more commonly called through GeoTaggedData.

urbanworm.dataset.getSV(location, loc_id=None, distance=50, key=None, source='mapillary', pano=False, reoriented=False, multi_num=1, interval=1, fov=80, heading=None, pitch=5, height=500, width=700, year=None, season=None, time_of_day=None, target_polygon=None, fov_margin=0.1, fov_min=30.0, fov_max=120.0, building_height=9.0, output_df=True, silent=False)

getSV

Retrieve the closest street view image(s) near a coordinate. Supports multiple sources; the image is reoriented to face the target coordinate when reoriented=True (Mapillary) or always (Google).

Parameters:

Name Type Description Default
location list | tuple

coordinates (longitude/x and latitude/y)

required
loc_id int | str

The id of the location.

None
distance int

The max distance in meters between the centroid and the street view.

50
key str

API access token for the chosen source. Mapillary — pass token or set env var MAPILLARY_API_KEY. Google — pass token or set env var GOOGLE_STREETVIEW_API_KEY.

None
source str

Street view data source. One of "mapillary" (default) or "google".

'mapillary'
pano bool

Whether to search for panoramic images only. Mapillary only — ignored for Google. (Default is False)

False
reoriented bool

Whether to reorient and crop street view images to face the target. Mapillary only — Google always faces the target. (Default is False)

False
multi_num int

The number of multiple SVIs. Mapillary only — Google always returns 1. (Default is 1)

1
interval int

The interval in meters between each SVI. Mapillary only. (Default is 1)

1
fov int | float | str

Field of view in degrees for the perspective image (default 80). Pass 'auto' together with reoriented=True to size the FOV per image so the target building is just framed — see target_polygon / fov_margin / fov_min / fov_max. When target_polygon is None, 'auto' falls back to a distance-based heuristic (assumes ~15 m wide building). Mapillary only — for Google, fov is passed directly to the API and clamped to [10, 120]; 'auto' is not supported.

80
heading int

Camera heading in degrees. If None, computed from the bearing to the target location.

None
pitch int

Camera pitch angle. (Default is 5)

5
height int

Height in pixels of the returned image. (Default is 500)

500
width int

Width in pixels of the returned image. (Default is 700)

700
year list[str]

Year of data (start year, end year). Mapillary only — ignored for Google with a warning.

None
season str

Season of data. Mapillary only — ignored for Google with a warning.

None
time_of_day str

Time of data. Mapillary only — ignored for Google with a warning.

None
target_polygon Polygon

Building footprint used by fov='auto' to compute the angular extent of the target. Coordinates are assumed to be (lon, lat) in WGS84. Mapillary only.

None
fov_margin float

Fractional padding added to the auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.

0.1
fov_min float

Lower clamp for fov='auto' (degrees). Default 30°. Mapillary only.

30.0
fov_max float

Upper clamp for fov='auto' (degrees). Default 120°. Mapillary only.

120.0
building_height float

Assumed building height in meters used by fov='auto' (default 9 m, ~3 stories). Mapillary only.

9.0
output_df bool

Whether to also return a DataFrame of metadata. (Default is True)

True
silent bool

Whether to silence warnings. (Default is False)

False

Returns:

Name Type Description
DataFrame | list | None

list[str]: A list of images in base64 format.

DataFrame DataFrame | list | None

A dataframe containing metadata about the street view images. captured_at format is "YYYY-M-D-H" for Mapillary and "YYYY-MM-1-1" for Google (day and hour are nominal placeholders).

Source code in urbanworm/dataset.py
def getSV(location: list|tuple,
          loc_id: int | str = None,
          distance:int = 50,
          key: str = None,
          source: str = "mapillary",
          pano: bool = False,
          reoriented: bool = False,
          multi_num: int = 1,
          interval: int = 1,
          fov: int | float | str = 80, heading: int = None, pitch: int = 5,
          height: int = 500, width: int = 700,
          year: list | tuple = None,
          season: str = None,
          time_of_day: str = None,
          target_polygon=None,
          fov_margin: float = 0.10,
          fov_min: float = 30.0,
          fov_max: float = 120.0,
          building_height: float = 9.0,
          output_df: bool = True,
          silent: bool = False) -> pd.DataFrame | list | None:
    """
        getSV

        Retrieve the closest street view image(s) near a coordinate.
        Supports multiple sources; the image is reoriented to face the target
        coordinate when ``reoriented=True`` (Mapillary) or always (Google).

        Args:
            location: coordinates (longitude/x and latitude/y)
            loc_id (int|str, optional): The id of the location.
            distance (int): The max distance in meters between the centroid and the street view.
            key (str): API access token for the chosen source.
                Mapillary — pass token or set env var ``MAPILLARY_API_KEY``.
                Google    — pass token or set env var ``GOOGLE_STREETVIEW_API_KEY``.
            source (str): Street view data source. One of ``"mapillary"`` (default)
                or ``"google"``.
            pano (bool): Whether to search for panoramic images only.
                Mapillary only — ignored for Google. (Default is False)
            reoriented (bool): Whether to reorient and crop street view images to face
                the target. Mapillary only — Google always faces the target.
                (Default is False)
            multi_num (int): The number of multiple SVIs. Mapillary only — Google
                always returns 1. (Default is 1)
            interval (int): The interval in meters between each SVI.
                Mapillary only. (Default is 1)
            fov (int | float | str): Field of view in degrees for the perspective image
                (default 80). Pass ``'auto'`` together with ``reoriented=True`` to
                size the FOV per image so the target building is just framed —
                see ``target_polygon`` / ``fov_margin`` / ``fov_min`` / ``fov_max``.
                When ``target_polygon`` is None, ``'auto'`` falls back to a
                distance-based heuristic (assumes ~15 m wide building).
                Mapillary only — for Google, ``fov`` is passed directly to the API
                and clamped to [10, 120]; ``'auto'`` is not supported.
            heading (int): Camera heading in degrees. If None, computed from the
                bearing to the target location.
            pitch (int): Camera pitch angle. (Default is 5)
            height (int): Height in pixels of the returned image. (Default is 500)
            width (int): Width in pixels of the returned image. (Default is 700)
            year (list[str], optional): Year of data (start year, end year).
                Mapillary only — ignored for Google with a warning.
            season (str, optional): Season of data.
                Mapillary only — ignored for Google with a warning.
            time_of_day (str, optional): Time of data.
                Mapillary only — ignored for Google with a warning.
            target_polygon (shapely.geometry.Polygon, optional): Building footprint
                used by ``fov='auto'`` to compute the angular extent of the target.
                Coordinates are assumed to be ``(lon, lat)`` in WGS84.
                Mapillary only.
            fov_margin (float): Fractional padding added to the auto-computed
                FOV (0.10 = +10%). Default 0.10. Mapillary only.
            fov_min (float): Lower clamp for ``fov='auto'`` (degrees). Default 30°.
                Mapillary only.
            fov_max (float): Upper clamp for ``fov='auto'`` (degrees). Default 120°.
                Mapillary only.
            building_height (float): Assumed building height in meters used by
                ``fov='auto'`` (default 9 m, ~3 stories). Mapillary only.
            output_df (bool, optional): Whether to also return a DataFrame of metadata.
                (Default is True)
            silent (bool, optional): Whether to silence warnings. (Default is False)

        Returns:
            list[str]: A list of images in base64 format.
            DataFrame: A dataframe containing metadata about the street view images.
                ``captured_at`` format is ``"YYYY-M-D-H"`` for Mapillary and
                ``"YYYY-MM-1-1"`` for Google (day and hour are nominal placeholders).
    """
    source = source.lower().strip()

    if source == "google":
        # Warn about params that Google does not support.
        # warnings.warn() deduplicates by call-site, so each message appears
        # only once even when getSV() is called in a loop (e.g. from
        # get_svi_from_locations), unlike logger.warning() which fires every time.
        if multi_num > 1:
            warnings.warn(
                "getSV: multi_num > 1 is not supported for source='google'; using 1.",
                stacklevel=2,
            )
        if any([year, season, time_of_day]):
            warnings.warn(
                "getSV: year/season/time_of_day filtering is not supported for "
                "source='google' (API does not expose historical imagery). "
                "These parameters will be ignored.",
                stacklevel=2,
            )
        if isinstance(fov, str) and fov.strip().lower() == "auto":
            warnings.warn(
                "getSV: fov='auto' is not supported for source='google'. "
                "Falling back to fov=80.",
                stacklevel=2,
            )
            fov = 80
        return _getSV_google(
            location=location, loc_id=loc_id, distance=distance, key=key,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            output_df=output_df, silent=silent,
        )

    if source == "mapillary":
        return _getSV_mapillary(
            location=location, loc_id=loc_id, distance=distance, key=key,
            pano=pano, reoriented=reoriented, multi_num=multi_num, interval=interval,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            year=year, season=season, time_of_day=time_of_day,
            target_polygon=target_polygon, fov_margin=fov_margin,
            fov_min=fov_min, fov_max=fov_max, building_height=building_height,
            output_df=output_df, silent=silent,
        )

    raise ValueError(
        f"getSV: unknown source '{source}'. Choose 'mapillary' or 'google'."
    )

urbanworm.dataset.getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True)

getPhoto

Fetch public Flickr photos with geotags near a location (or within a Flickr place).

Parameters:

Name Type Description Default
location list | tuple

(lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos

required
loc_id int | str

The id of the location.

None
distance int

Search radius in meters (converted to km; Flickr radius max is 32 km).

50
key str

Flickr API key. If None, reads env var FLICKR_API_KEY.

None
query str | list[str]

Query parameters to pass to Flickr API (free text search).

None
geo_context int

Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)

None
tag str | list[str]

Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.

None
max_return int

Number of photos to return (after filters).

1
year str | tuple

[Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.

None
season str

One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).

None
time_of_day str

One of {"morning","afternoon","evening","night"} (post-filter by taken hour).

None
exclude_from_location int

drop retrieved photos within a distance (in meter) from the given location. (Default is None)

None
output_df bool

If True, return a pandas.DataFrame; otherwise return dict (if max_return==1) or list[dict].

True

Returns:

Type Description

dict | list[dict] | pandas.DataFrame

Source code in urbanworm/dataset.py
def getPhoto(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        key: str = None,
        query: str | list[str] = None,
        geo_context: int = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        exclude_from_location:int = None,
        output_df: bool = True
):
    """
        getPhoto

        Fetch public Flickr photos with geotags near a location (or within a Flickr place).

        Args:
            location (list|tuple): (lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos
            loc_id (int | str): The id of the location.
            distance (int): Search radius in meters (converted to km; Flickr radius max is 32 km).
            key (str): Flickr API key. If None, reads env var FLICKR_API_KEY.
            query (str | list[str]): Query parameters to pass to Flickr API (free text search).
            geo_context (int): Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)
            tag: Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.
            max_return: Number of photos to return (after filters).
            year (str | tuple): [Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.
            season (str): One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).
            time_of_day (str): One of {"morning","afternoon","evening","night"} (post-filter by taken hour).
            exclude_from_location (int, optional): drop retrieved photos within a distance (in meter) from the given location. (Default is None)
            output_df (bool): If True, return a pandas.DataFrame; otherwise return dict (if max_return==1)
                       or list[dict].

        Returns:
            dict | list[dict] | pandas.DataFrame
    """

    import os
    from datetime import datetime, timedelta, timezone

    import requests

    if exclude_from_location is not None:
        drop_area = projection(location, r=distance)

    # -------------------------
    # Validate inputs
    # -------------------------
    if max_return is None or int(max_return) < 1:
        raise ValueError("max_return must be >= 1.")
    max_return = int(max_return)

    api_key = key or os.getenv("FLICKR_API_KEY")
    if not api_key:
        raise ValueError("Missing Flickr API key. Pass key=... or set env var FLICKR_API_KEY.")

    lon, lat = location
    months = season_months(season)
    hours = tod_hours(time_of_day)
    y_range = year_range(year)

    # Radius in km (Flickr max 32km) :contentReference[oaicite:3]{index=3}
    radius_km = max(float(distance) / 1000.0, 0.01)
    radius_km = min(radius_km, 32.0)

    # Geo queries need a "limiting agent"; tags or min/max dates qualify. :contentReference[oaicite:4]{index=4}
    # If user provided none, default to last 365 days so results aren’t silently limited to ~12 hours.
    now_utc = datetime.now(timezone.utc)
    default_min_upload_date = int((now_utc - timedelta(days=365)).timestamp())

    # -------------------------
    # Build Flickr request
    # -------------------------
    endpoint = "https://api.flickr.com/services/rest/"

    extras = ",".join(
        [
            "description",
            "license",
            "date_upload",
            "date_taken",
            "owner_name",
            "geo",
            "tags",
            "views",
            "media",
            "url_sq",
            "url_t",
            "url_s",
            "url_q",
            "url_m",
            "url_n",
            "url_z",
            "url_c",
            "url_l",
            "url_o",
        ]
    )

    params = {
        "method": "flickr.photos.search",
        "api_key": api_key,
        "format": "json",
        "nojsoncallback": 1,
        "extras": extras,
        "safe_search": 1, # safe only for un-authed calls
        "media": "photos",
        "has_geo": 1,
        "content_types": 0, # photos
        "sort": "relevance",
        "lat": lat,
        "lon": lon,
        "radius": radius_km,
        "radius_units": "km"
    }

    if query:
        q = query_string(query)
        if q:
            params["text"] = q

    if geo_context:
        params["geo_context"] = geo_context

    # tags
    if tag:
        if isinstance(tag, (list, tuple)):
            tags = ",".join([str(t).strip() for t in tag if str(t).strip()])
            params["tags"] = tags
            params["tag_mode"] = "all"
        else:
            params["tags"] = str(tag).strip()

    # date range (taken) if specified
    if y_range is not None:
        params["min_taken_date"], params["max_taken_date"] = y_range
    else:
        # If no explicit limiting agent, set min_upload_date (acts as limiting agent for geo queries). :contentReference[oaicite:7]{index=7}
        if not tag and season is None and time_of_day is None:
            params["min_upload_date"] = default_min_upload_date

    # -------------------------
    # Fetch + post-filter
    # -------------------------
    session = requests.Session()

    # Geo/bbox queries only return up to 250/page. :contentReference[oaicite:8]{index=8}
    per_page = min(250, max(50, max_return * 20))
    params["per_page"] = per_page

    results = []
    seen = set()

    max_pages = 150
    for page in range(1, max_pages + 1):
        params["page"] = page
        r = session.get(endpoint, params=params, timeout=30)
        r.raise_for_status()
        data = r.json()

        if data.get("stat") != "ok":
            msg = data.get("message") or data.get("error") or str(data)
            raise RuntimeError(f"Flickr API error: {msg}")

        photos = (data.get("photos") or {}).get("photo") or []
        if not photos:
            break

        for p in photos:
            if exclude_from_location is not None:
                if is_coordinate_in_bbox(p["longitude"], p["latitude"], drop_area):
                    continue
            pid = p.get("id")
            if not pid or pid in seen:
                continue
            seen.add(pid)

            taken_dt = parse_taken(p)
            if months and taken_dt and taken_dt.month not in months:
                continue
            if hours and taken_dt and taken_dt.hour not in hours:
                continue

            s_lat = float(p["latitude"]) if "latitude" in p and p["latitude"] not in (None, "") else None
            s_lon = float(p["longitude"]) if "longitude" in p and p["longitude"] not in (None, "") else None

            url = best_url(p)
            out = {
                "loc_id": '',
                "id": pid,
                "title": p.get("title"),
                "owner": p.get("owner"),
                # "ownername": p.get("ownername"),
                "datetaken": p.get("datetaken") or p.get("date_taken"),
                "latitude": s_lat,
                "longitude": s_lon,
                # "accuracy": int(p["accuracy"]) if "accuracy" in p and str(p["accuracy"]).isdigit() else None,
                "distance_m": haversine_m(lat, lon, s_lat, s_lon) if (s_lat is not None and s_lon is not None) else None,
                "tags": p.get("tags"),
                "description": p.get("description"),
                "views": int(p["views"]) if "views" in p and str(p["views"]).isdigit() else None,
                "license": p.get("license"),
                "url": url,
                # "page_url": f"https://www.flickr.com/photos/{p.get('owner')}/{pid}",
            }

            if loc_id is not None:
                out["loc_id"] = loc_id
            else:
                del out["loc_id"]

            results.append(out)

            # if len(results) >= max_return:
            #     break

        if len(results) >= max_return:
            break

    if output_df:
        import pandas as pd
        df = pd.DataFrame(results)
        df = df.sort_values(by='distance_m', ascending=True)
        return df.head(max_return)

    if max_return == 1:
        return results[0] if results else None
    return results

urbanworm.dataset.getSound(location, loc_id=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=300, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)

Dispatch to the per-source helpers.

Parameters:

Name Type Description Default
source str

one of {"freesound", "aporee"}. Default "freesound".

'freesound'
catalog str | DataFrame

required when source="aporee" — see :func:getSoundAporee.

None
probe_durations bool

Aporee-only. See :func:getSoundAporee.

True

All other arguments are forwarded; key is only used by Freesound, catalog and probe_durations only by Aporee.

Source code in urbanworm/dataset.py
def getSound(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        source: str = 'freesound',
        key: str = None,
        catalog: str | pd.DataFrame = None,
        query: str | list[str] | None = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        duration: int = 300,
        exclude_from_location: int = None,
        slice_duration: int = None,
        slice_max_num: int = None,
        probe_durations: bool = True,
        output_df: bool = True,
) -> pd.DataFrame | dict | list | None:
    """Dispatch to the per-source helpers.

    Args:
        source (str): one of {"freesound", "aporee"}. Default "freesound".
        catalog: required when source="aporee" — see :func:`getSoundAporee`.
        probe_durations: Aporee-only. See :func:`getSoundAporee`.

    All other arguments are forwarded; ``key`` is only used by Freesound,
    ``catalog`` and ``probe_durations`` only by Aporee.
    """
    src = (source or 'freesound').lower()
    if src == 'freesound':
        return _getSoundFreesound(
            location=location, loc_id=loc_id, distance=distance, key=key,
            query=query, tag=tag, max_return=max_return, year=year,
            season=season, time_of_day=time_of_day, duration=duration,
            exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            output_df=output_df,
        )
    elif src == 'aporee':
        return getSoundAporee(
            location=location, loc_id=loc_id, distance=distance,
            catalog=catalog, query=query, tag=tag, max_return=max_return,
            year=year, season=season, time_of_day=time_of_day,
            duration=duration, exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            probe_durations=probe_durations,
            output_df=output_df,
        )
    else:
        raise ValueError(
            f"Unsupported sound source {source!r}; choose 'freesound' or 'aporee'."
        )