Data Sources¶

These modules contain the per-provider helpers used internally by GeoTaggedData. You can also call them directly for lower-level control.

Mapillary (street views)¶

`urbanworm.sources.mapillary` ¶

Mapillary street-view source. Thin re-export of :func:urbanworm.dataset.getSV.

Functions¶

`getSV(location, loc_id=None, distance=50, key=None, source='mapillary', pano=False, reoriented=False, multi_num=1, interval=1, fov=80, heading=None, pitch=5, height=500, width=700, year=None, season=None, time_of_day=None, target_polygon=None, fov_margin=0.1, fov_min=30.0, fov_max=120.0, building_height=9.0, output_df=True, silent=False)` ¶

getSV

Retrieve the closest street view image(s) near a coordinate. Supports multiple sources; the image is reoriented to face the target coordinate when reoriented=True (Mapillary) or always (Google).

Parameters:

Name	Type	Description	Default
`location`	`list \| tuple`	coordinates (longitude/x and latitude/y)	required
`loc_id`	`int \| str`	The id of the location.	`None`
`distance`	`int`	The max distance in meters between the centroid and the street view.	`50`
`key`	`str`	API access token for the chosen source. Mapillary — pass token or set env var `MAPILLARY_API_KEY`. Google — pass token or set env var `GOOGLE_STREETVIEW_API_KEY`.	`None`
`source`	`str`	Street view data source. One of `"mapillary"` (default) or `"google"`.	`'mapillary'`
`pano`	`bool`	Whether to search for panoramic images only. Mapillary only — ignored for Google. (Default is False)	`False`
`reoriented`	`bool`	Whether to reorient and crop street view images to face the target. Mapillary only — Google always faces the target. (Default is False)	`False`
`multi_num`	`int`	The number of multiple SVIs. Mapillary only — Google always returns 1. (Default is 1)	`1`
`interval`	`int`	The interval in meters between each SVI. Mapillary only. (Default is 1)	`1`
`fov`	`int \| float \| str`	Field of view in degrees for the perspective image (default 80). Pass `'auto'` together with `reoriented=True` to size the FOV per image so the target building is just framed — see `target_polygon` / `fov_margin` / `fov_min` / `fov_max`. When `target_polygon` is None, `'auto'` falls back to a distance-based heuristic (assumes ~15 m wide building). Mapillary only — for Google, `fov` is passed directly to the API and clamped to [10, 120]; `'auto'` is not supported.	`80`
`heading`	`int`	Camera heading in degrees. If None, computed from the bearing to the target location.	`None`
`pitch`	`int`	Camera pitch angle. (Default is 5)	`5`
`height`	`int`	Height in pixels of the returned image. (Default is 500)	`500`
`width`	`int`	Width in pixels of the returned image. (Default is 700)	`700`
`year`	`list[str]`	Year of data (start year, end year). Mapillary only — ignored for Google with a warning.	`None`
`season`	`str`	Season of data. Mapillary only — ignored for Google with a warning.	`None`
`time_of_day`	`str`	Time of data. Mapillary only — ignored for Google with a warning.	`None`
`target_polygon`	`Polygon`	Building footprint used by `fov='auto'` to compute the angular extent of the target. Coordinates are assumed to be `(lon, lat)` in WGS84. Mapillary only.	`None`
`fov_margin`	`float`	Fractional padding added to the auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.	`0.1`
`fov_min`	`float`	Lower clamp for `fov='auto'` (degrees). Default 30°. Mapillary only.	`30.0`
`fov_max`	`float`	Upper clamp for `fov='auto'` (degrees). Default 120°. Mapillary only.	`120.0`
`building_height`	`float`	Assumed building height in meters used by `fov='auto'` (default 9 m, ~3 stories). Mapillary only.	`9.0`
`output_df`	`bool`	Whether to also return a DataFrame of metadata. (Default is True)	`True`
`silent`	`bool`	Whether to silence warnings. (Default is False)	`False`

Returns:

Name	Type	Description
	`DataFrame \| list \| None`	list[str]: A list of images in base64 format.
`DataFrame`	`DataFrame \| list \| None`	A dataframe containing metadata about the street view images. `captured_at` format is `"YYYY-M-D-H"` for Mapillary and `"YYYY-MM-1-1"` for Google (day and hour are nominal placeholders).

Source code in urbanworm/dataset.py

def getSV(location: list|tuple,
          loc_id: int | str = None,
          distance:int = 50,
          key: str = None,
          source: str = "mapillary",
          pano: bool = False,
          reoriented: bool = False,
          multi_num: int = 1,
          interval: int = 1,
          fov: int | float | str = 80, heading: int = None, pitch: int = 5,
          height: int = 500, width: int = 700,
          year: list | tuple = None,
          season: str = None,
          time_of_day: str = None,
          target_polygon=None,
          fov_margin: float = 0.10,
          fov_min: float = 30.0,
          fov_max: float = 120.0,
          building_height: float = 9.0,
          output_df: bool = True,
          silent: bool = False) -> pd.DataFrame | list | None:
    """
        getSV

        Retrieve the closest street view image(s) near a coordinate.
        Supports multiple sources; the image is reoriented to face the target
        coordinate when ``reoriented=True`` (Mapillary) or always (Google).

        Args:
            location: coordinates (longitude/x and latitude/y)
            loc_id (int|str, optional): The id of the location.
            distance (int): The max distance in meters between the centroid and the street view.
            key (str): API access token for the chosen source.
                Mapillary — pass token or set env var ``MAPILLARY_API_KEY``.
                Google    — pass token or set env var ``GOOGLE_STREETVIEW_API_KEY``.
            source (str): Street view data source. One of ``"mapillary"`` (default)
                or ``"google"``.
            pano (bool): Whether to search for panoramic images only.
                Mapillary only — ignored for Google. (Default is False)
            reoriented (bool): Whether to reorient and crop street view images to face
                the target. Mapillary only — Google always faces the target.
                (Default is False)
            multi_num (int): The number of multiple SVIs. Mapillary only — Google
                always returns 1. (Default is 1)
            interval (int): The interval in meters between each SVI.
                Mapillary only. (Default is 1)
            fov (int | float | str): Field of view in degrees for the perspective image
                (default 80). Pass ``'auto'`` together with ``reoriented=True`` to
                size the FOV per image so the target building is just framed —
                see ``target_polygon`` / ``fov_margin`` / ``fov_min`` / ``fov_max``.
                When ``target_polygon`` is None, ``'auto'`` falls back to a
                distance-based heuristic (assumes ~15 m wide building).
                Mapillary only — for Google, ``fov`` is passed directly to the API
                and clamped to [10, 120]; ``'auto'`` is not supported.
            heading (int): Camera heading in degrees. If None, computed from the
                bearing to the target location.
            pitch (int): Camera pitch angle. (Default is 5)
            height (int): Height in pixels of the returned image. (Default is 500)
            width (int): Width in pixels of the returned image. (Default is 700)
            year (list[str], optional): Year of data (start year, end year).
                Mapillary only — ignored for Google with a warning.
            season (str, optional): Season of data.
                Mapillary only — ignored for Google with a warning.
            time_of_day (str, optional): Time of data.
                Mapillary only — ignored for Google with a warning.
            target_polygon (shapely.geometry.Polygon, optional): Building footprint
                used by ``fov='auto'`` to compute the angular extent of the target.
                Coordinates are assumed to be ``(lon, lat)`` in WGS84.
                Mapillary only.
            fov_margin (float): Fractional padding added to the auto-computed
                FOV (0.10 = +10%). Default 0.10. Mapillary only.
            fov_min (float): Lower clamp for ``fov='auto'`` (degrees). Default 30°.
                Mapillary only.
            fov_max (float): Upper clamp for ``fov='auto'`` (degrees). Default 120°.
                Mapillary only.
            building_height (float): Assumed building height in meters used by
                ``fov='auto'`` (default 9 m, ~3 stories). Mapillary only.
            output_df (bool, optional): Whether to also return a DataFrame of metadata.
                (Default is True)
            silent (bool, optional): Whether to silence warnings. (Default is False)

        Returns:
            list[str]: A list of images in base64 format.
            DataFrame: A dataframe containing metadata about the street view images.
                ``captured_at`` format is ``"YYYY-M-D-H"`` for Mapillary and
                ``"YYYY-MM-1-1"`` for Google (day and hour are nominal placeholders).
    """
    source = source.lower().strip()

    if source == "google":
        # Warn about params that Google does not support.
        # warnings.warn() deduplicates by call-site, so each message appears
        # only once even when getSV() is called in a loop (e.g. from
        # get_svi_from_locations), unlike logger.warning() which fires every time.
        if multi_num > 1:
            warnings.warn(
                "getSV: multi_num > 1 is not supported for source='google'; using 1.",
                stacklevel=2,
            )
        if any([year, season, time_of_day]):
            warnings.warn(
                "getSV: year/season/time_of_day filtering is not supported for "
                "source='google' (API does not expose historical imagery). "
                "These parameters will be ignored.",
                stacklevel=2,
            )
        if isinstance(fov, str) and fov.strip().lower() == "auto":
            warnings.warn(
                "getSV: fov='auto' is not supported for source='google'. "
                "Falling back to fov=80.",
                stacklevel=2,
            )
            fov = 80
        return _getSV_google(
            location=location, loc_id=loc_id, distance=distance, key=key,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            output_df=output_df, silent=silent,
        )

    if source == "mapillary":
        return _getSV_mapillary(
            location=location, loc_id=loc_id, distance=distance, key=key,
            pano=pano, reoriented=reoriented, multi_num=multi_num, interval=interval,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            year=year, season=season, time_of_day=time_of_day,
            target_polygon=target_polygon, fov_margin=fov_margin,
            fov_min=fov_min, fov_max=fov_max, building_height=building_height,
            output_df=output_df, silent=silent,
        )

    raise ValueError(
        f"getSV: unknown source '{source}'. Choose 'mapillary' or 'google'."
    )

Flickr (photos)¶

`urbanworm.sources.flickr` ¶

Flickr photo source. Thin re-export of :func:urbanworm.dataset.getPhoto.

Functions¶

`getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True)` ¶

getPhoto

Fetch public Flickr photos with geotags near a location (or within a Flickr place).

Parameters:

Name	Type	Description	Default
`location`	`list \| tuple`	(lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos	required
`loc_id`	`int \| str`	The id of the location.	`None`
`distance`	`int`	Search radius in meters (converted to km; Flickr radius max is 32 km).	`50`
`key`	`str`	Flickr API key. If None, reads env var FLICKR_API_KEY.	`None`
`query`	`str \| list[str]`	Query parameters to pass to Flickr API (free text search).	`None`
`geo_context`	`int`	Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)	`None`
`tag`	`str \| list[str]`	Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.	`None`
`max_return`	`int`	Number of photos to return (after filters).	`1`
`year`	`str \| tuple`	[Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.	`None`
`season`	`str`	One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).	`None`
`time_of_day`	`str`	One of {"morning","afternoon","evening","night"} (post-filter by taken hour).	`None`
`exclude_from_location`	`int`	drop retrieved photos within a distance (in meter) from the given location. (Default is None)	`None`
`output_df`	`bool`	If True, return a pandas.DataFrame; otherwise return dict (if max_return==1) or list[dict].	`True`

Returns:

Type	Description
	dict \| list[dict] \| pandas.DataFrame

Source code in urbanworm/dataset.py

def getPhoto(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        key: str = None,
        query: str | list[str] = None,
        geo_context: int = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        exclude_from_location:int = None,
        output_df: bool = True
):
    """
        getPhoto

        Fetch public Flickr photos with geotags near a location (or within a Flickr place).

        Args:
            location (list|tuple): (lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos
            loc_id (int | str): The id of the location.
            distance (int): Search radius in meters (converted to km; Flickr radius max is 32 km).
            key (str): Flickr API key. If None, reads env var FLICKR_API_KEY.
            query (str | list[str]): Query parameters to pass to Flickr API (free text search).
            geo_context (int): Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)
            tag: Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.
            max_return: Number of photos to return (after filters).
            year (str | tuple): [Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.
            season (str): One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).
            time_of_day (str): One of {"morning","afternoon","evening","night"} (post-filter by taken hour).
            exclude_from_location (int, optional): drop retrieved photos within a distance (in meter) from the given location. (Default is None)
            output_df (bool): If True, return a pandas.DataFrame; otherwise return dict (if max_return==1)
                       or list[dict].

        Returns:
            dict | list[dict] | pandas.DataFrame
    """

    import os
    from datetime import datetime, timedelta, timezone

    import requests

    if exclude_from_location is not None:
        drop_area = projection(location, r=distance)

    # -------------------------
    # Validate inputs
    # -------------------------
    if max_return is None or int(max_return) < 1:
        raise ValueError("max_return must be >= 1.")
    max_return = int(max_return)

    api_key = key or os.getenv("FLICKR_API_KEY")
    if not api_key:
        raise ValueError("Missing Flickr API key. Pass key=... or set env var FLICKR_API_KEY.")

    lon, lat = location
    months = season_months(season)
    hours = tod_hours(time_of_day)
    y_range = year_range(year)

    # Radius in km (Flickr max 32km) :contentReference[oaicite:3]{index=3}
    radius_km = max(float(distance) / 1000.0, 0.01)
    radius_km = min(radius_km, 32.0)

    # Geo queries need a "limiting agent"; tags or min/max dates qualify. :contentReference[oaicite:4]{index=4}
    # If user provided none, default to last 365 days so results aren’t silently limited to ~12 hours.
    now_utc = datetime.now(timezone.utc)
    default_min_upload_date = int((now_utc - timedelta(days=365)).timestamp())

    # -------------------------
    # Build Flickr request
    # -------------------------
    endpoint = "https://api.flickr.com/services/rest/"

    extras = ",".join(
        [
            "description",
            "license",
            "date_upload",
            "date_taken",
            "owner_name",
            "geo",
            "tags",
            "views",
            "media",
            "url_sq",
            "url_t",
            "url_s",
            "url_q",
            "url_m",
            "url_n",
            "url_z",
            "url_c",
            "url_l",
            "url_o",
        ]
    )

    params = {
        "method": "flickr.photos.search",
        "api_key": api_key,
        "format": "json",
        "nojsoncallback": 1,
        "extras": extras,
        "safe_search": 1, # safe only for un-authed calls
        "media": "photos",
        "has_geo": 1,
        "content_types": 0, # photos
        "sort": "relevance",
        "lat": lat,
        "lon": lon,
        "radius": radius_km,
        "radius_units": "km"
    }

    if query:
        q = query_string(query)
        if q:
            params["text"] = q

    if geo_context:
        params["geo_context"] = geo_context

    # tags
    if tag:
        if isinstance(tag, (list, tuple)):
            tags = ",".join([str(t).strip() for t in tag if str(t).strip()])
            params["tags"] = tags
            params["tag_mode"] = "all"
        else:
            params["tags"] = str(tag).strip()

    # date range (taken) if specified
    if y_range is not None:
        params["min_taken_date"], params["max_taken_date"] = y_range
    else:
        # If no explicit limiting agent, set min_upload_date (acts as limiting agent for geo queries). :contentReference[oaicite:7]{index=7}
        if not tag and season is None and time_of_day is None:
            params["min_upload_date"] = default_min_upload_date

    # -------------------------
    # Fetch + post-filter
    # -------------------------
    session = requests.Session()

    # Geo/bbox queries only return up to 250/page. :contentReference[oaicite:8]{index=8}
    per_page = min(250, max(50, max_return * 20))
    params["per_page"] = per_page

    results = []
    seen = set()

    max_pages = 150
    for page in range(1, max_pages + 1):
        params["page"] = page
        r = session.get(endpoint, params=params, timeout=30)
        r.raise_for_status()
        data = r.json()

        if data.get("stat") != "ok":
            msg = data.get("message") or data.get("error") or str(data)
            raise RuntimeError(f"Flickr API error: {msg}")

        photos = (data.get("photos") or {}).get("photo") or []
        if not photos:
            break

        for p in photos:
            if exclude_from_location is not None:
                if is_coordinate_in_bbox(p["longitude"], p["latitude"], drop_area):
                    continue
            pid = p.get("id")
            if not pid or pid in seen:
                continue
            seen.add(pid)

            taken_dt = parse_taken(p)
            if months and taken_dt and taken_dt.month not in months:
                continue
            if hours and taken_dt and taken_dt.hour not in hours:
                continue

            s_lat = float(p["latitude"]) if "latitude" in p and p["latitude"] not in (None, "") else None
            s_lon = float(p["longitude"]) if "longitude" in p and p["longitude"] not in (None, "") else None

            url = best_url(p)
            out = {
                "loc_id": '',
                "id": pid,
                "title": p.get("title"),
                "owner": p.get("owner"),
                # "ownername": p.get("ownername"),
                "datetaken": p.get("datetaken") or p.get("date_taken"),
                "latitude": s_lat,
                "longitude": s_lon,
                # "accuracy": int(p["accuracy"]) if "accuracy" in p and str(p["accuracy"]).isdigit() else None,
                "distance_m": haversine_m(lat, lon, s_lat, s_lon) if (s_lat is not None and s_lon is not None) else None,
                "tags": p.get("tags"),
                "description": p.get("description"),
                "views": int(p["views"]) if "views" in p and str(p["views"]).isdigit() else None,
                "license": p.get("license"),
                "url": url,
                # "page_url": f"https://www.flickr.com/photos/{p.get('owner')}/{pid}",
            }

            if loc_id is not None:
                out["loc_id"] = loc_id
            else:
                del out["loc_id"]

            results.append(out)

            # if len(results) >= max_return:
            #     break

        if len(results) >= max_return:
            break

    if output_df:
        import pandas as pd
        df = pd.DataFrame(results)
        df = df.sort_values(by='distance_m', ascending=True)
        return df.head(max_return)

    if max_return == 1:
        return results[0] if results else None
    return results

Freesound (audio)¶

`urbanworm.sources.freesound` ¶

Freesound audio source. Thin re-export of :func:urbanworm.dataset.getSound.

Functions¶

`getSound(location, loc_id=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=300, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)` ¶

Dispatch to the per-source helpers.

Parameters:

Name	Type	Description	Default
`source`	`str`	one of {"freesound", "aporee"}. Default "freesound".	`'freesound'`
`catalog`	`str \| DataFrame`	required when source="aporee" — see :func:`getSoundAporee`.	`None`
`probe_durations`	`bool`	Aporee-only. See :func:`getSoundAporee`.	`True`

All other arguments are forwarded; key is only used by Freesound, catalog and probe_durations only by Aporee.

Source code in urbanworm/dataset.py

def getSound(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        source: str = 'freesound',
        key: str = None,
        catalog: str | pd.DataFrame = None,
        query: str | list[str] | None = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        duration: int = 300,
        exclude_from_location: int = None,
        slice_duration: int = None,
        slice_max_num: int = None,
        probe_durations: bool = True,
        output_df: bool = True,
) -> pd.DataFrame | dict | list | None:
    """Dispatch to the per-source helpers.

    Args:
        source (str): one of {"freesound", "aporee"}. Default "freesound".
        catalog: required when source="aporee" — see :func:`getSoundAporee`.
        probe_durations: Aporee-only. See :func:`getSoundAporee`.

    All other arguments are forwarded; ``key`` is only used by Freesound,
    ``catalog`` and ``probe_durations`` only by Aporee.
    """
    src = (source or 'freesound').lower()
    if src == 'freesound':
        return _getSoundFreesound(
            location=location, loc_id=loc_id, distance=distance, key=key,
            query=query, tag=tag, max_return=max_return, year=year,
            season=season, time_of_day=time_of_day, duration=duration,
            exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            output_df=output_df,
        )
    elif src == 'aporee':
        return getSoundAporee(
            location=location, loc_id=loc_id, distance=distance,
            catalog=catalog, query=query, tag=tag, max_return=max_return,
            year=year, season=season, time_of_day=time_of_day,
            duration=duration, exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            probe_durations=probe_durations,
            output_df=output_df,
        )
    else:
        raise ValueError(
            f"Unsupported sound source {source!r}; choose 'freesound' or 'aporee'."
        )

Radio Aporee (audio)¶

`urbanworm.sources.aporee` ¶

Radio Aporee audio source.

Re-exports the helpers that live in :mod:urbanworm.dataset:

:func:getSoundAporee — filter a catalog by spatial proximity
:func:fetch_aporee_catalog — fetch the catalog from Internet Archive
:func:enrich_aporee_catalog — probe URLs for duration_s

Functions¶

`enrich_aporee_catalog(catalog, out_path=None, min_duration=None, skip_existing=True, timeout=60.0)` ¶

Add a duration_s column to an Aporee catalog by probing each URL.

Aporee URLs don't carry duration metadata, so this helper downloads each file once, reads its length with pydub (or mutagen as a fallback), and annotates the catalog. Optionally drops rows shorter than min_duration.

Use this once after building / updating your catalog so that subsequent :func:getSoundAporee calls with slice_duration can compute clip windows without paying the per-row probe cost every time.

Parameters:

Name	Type	Description	Default
`catalog`	`str \| DataFrame`	CSV path or in-memory DataFrame. Must have a `url` column.	required
`out_path`	`str`	If provided, writes the enriched DataFrame back to this CSV path.	`None`
`min_duration`	`float`	Drop rows shorter than this many seconds (after probing). `None` keeps all rows.	`None`
`skip_existing`	`bool`	If `True` (default) and `duration_s` is already populated for a row, leave it alone. Set `False` to re-probe every row.	`True`
`timeout`	`float`	Per-URL request timeout (seconds).	`60.0`

Returns:

Type	Description
`DataFrame`	The enriched `pandas.DataFrame`.

Source code in urbanworm/dataset.py

def enrich_aporee_catalog(
        catalog: str | pd.DataFrame,
        out_path: str | None = None,
        min_duration: float | None = None,
        skip_existing: bool = True,
        timeout: float = 60.0,
) -> pd.DataFrame:
    """Add a ``duration_s`` column to an Aporee catalog by probing each URL.

    Aporee URLs don't carry duration metadata, so this helper downloads each
    file once, reads its length with pydub (or mutagen as a fallback), and
    annotates the catalog. Optionally drops rows shorter than
    ``min_duration``.

    Use this once after building / updating your catalog so that subsequent
    :func:`getSoundAporee` calls with ``slice_duration`` can compute clip
    windows without paying the per-row probe cost every time.

    Args:
        catalog (str | pandas.DataFrame): CSV path or in-memory DataFrame.
            Must have a ``url`` column.
        out_path (str, optional): If provided, writes the enriched DataFrame
            back to this CSV path.
        min_duration (float, optional): Drop rows shorter than this many
            seconds (after probing). ``None`` keeps all rows.
        skip_existing (bool): If ``True`` (default) and ``duration_s`` is
            already populated for a row, leave it alone. Set ``False`` to
            re-probe every row.
        timeout (float): Per-URL request timeout (seconds).

    Returns:
        The enriched ``pandas.DataFrame``.
    """
    from .utils.utils import probe_audio_duration

    if isinstance(catalog, str):
        df = pd.read_csv(catalog)
    elif isinstance(catalog, pd.DataFrame):
        df = catalog.copy()
    else:
        raise TypeError(
            "catalog must be a CSV path (str) or a pandas.DataFrame; "
            f"got {type(catalog).__name__}."
        )

    if "url" not in df.columns:
        raise ValueError("Aporee catalog must have a 'url' column.")
    if "duration_s" not in df.columns:
        df["duration_s"] = pd.NA

    needs_probe = df.index if not skip_existing else df.index[df["duration_s"].isna()]
    logger.info(
        "enrich_aporee_catalog: probing %d / %d rows", len(needs_probe), len(df),
    )

    for i in tqdm(needs_probe, desc="probing", ncols=75):
        url = df.at[i, "url"]
        if not isinstance(url, str) or not url.startswith("http"):
            continue
        d = probe_audio_duration(url, timeout=timeout)
        if d is not None:
            df.at[i, "duration_s"] = round(float(d), 2)

    if min_duration is not None:
        before = len(df)
        df = df[
            df["duration_s"].notna()
            & (pd.to_numeric(df["duration_s"], errors="coerce") >= float(min_duration))
        ].reset_index(drop=True)
        logger.info(
            "enrich_aporee_catalog: dropped %d rows shorter than %ss",
            before - len(df), min_duration,
        )

    if out_path is not None:
        df.to_csv(out_path, index=False)
        logger.info("enrich_aporee_catalog: wrote %d rows to %s", len(df), out_path)

    return df

`fetch_aporee_catalog(bbox=None, year=None, hour=None, season=None, southern=False, rows=0, verify_urls=False, out_path=None, enrich_durations=False, min_duration=None, timeout=60.0, page_size=500)` ¶

Fetch the Aporee sound-map catalog from Internet Archive.

All Aporee field recordings are mirrored on archive.org under the radio-aporee-maps collection. This helper queries IA's Scrape API with optional server-side bbox / year filters and applies hour / season filters client-side, then returns a DataFrame in the schema :func:getSoundAporee expects.

Parameters:

Name	Type	Description	Default
`bbox`	`tuple[float, float, float, float] \| list \| None`	`(lat_min, lon_min, lat_max, lon_max)` to filter server-side. Pass `None` for the whole world.	`None`
`year`	`int \| tuple[int, int] \| list \| None`	Single year (`2021`) or inclusive range (`(2018, 2022)`). Filtered server-side via IA's `date` field.	`None`
`hour`	`int \| tuple[int, int] \| list \| None`	UTC hour or inclusive range (`(9, 17)` or `(22, 4)` for midnight-wrap). Applied client-side against `capture_time`.	`None`
`season`	`str \| list[str] \| None`	One of `"spring" \| "summer" \| "autumn"/"fall" \| "winter"`, or a list. Hemisphere is auto-detected from each row's latitude; pass `southern=True` to force southern interpretation.	`None`
`southern`	`bool`	Force southern-hemisphere season interpretation.	`False`
`rows`	`int`	Maximum number of records to fetch. `0` means all.	`0`
`verify_urls`	`bool`	If True, query IA's metadata API for each identifier to find the exact mp3 filename. Slow but accurate. Default False uses the `<identifier>.mp3` fallback (works for the vast majority of items).	`False`
`out_path`	`str`	If provided, write the resulting DataFrame to this CSV path.	`None`
`enrich_durations`	`bool`	If True, also probe each fetched URL for its duration via :func:`enrich_aporee_catalog` (slow — one request per row).	`False`
`min_duration`	`float`	When `enrich_durations=True`, drop rows shorter than this many seconds.	`None`
`timeout`	`float`	Per-request HTTP timeout (seconds).	`60.0`
`page_size`	`int`	Records per Scrape-API page (min 100).	`500`

Returns:

Type	Description
`DataFrame`	`pandas.DataFrame` with columns:
`DataFrame`	``identifier, id, latitude, longitude, url, capture_time, created,
`DataFrame`	year, month, hour, season, title, name, description, tags, licence,
`DataFrame`	duration_s`.`id`aliases`identifier`and`name`` aliases
`DataFrame`	`title` for compatibility with :func:`getSoundAporee`'s filters.

Source code in urbanworm/dataset.py

def fetch_aporee_catalog(
        bbox: tuple[float, float, float, float] | list | None = None,
        year: int | tuple[int, int] | list | None = None,
        hour: int | tuple[int, int] | list | None = None,
        season: str | list[str] | None = None,
        southern: bool = False,
        rows: int = 0,
        verify_urls: bool = False,
        out_path: str | None = None,
        enrich_durations: bool = False,
        min_duration: float | None = None,
        timeout: float = 60.0,
        page_size: int = 500,
) -> pd.DataFrame:
    """Fetch the Aporee sound-map catalog from Internet Archive.

    All Aporee field recordings are mirrored on archive.org under the
    ``radio-aporee-maps`` collection. This helper queries IA's Scrape API
    with optional server-side ``bbox`` / ``year`` filters and applies
    ``hour`` / ``season`` filters client-side, then returns a DataFrame in
    the schema :func:`getSoundAporee` expects.

    Args:
        bbox: ``(lat_min, lon_min, lat_max, lon_max)`` to filter server-side.
            Pass ``None`` for the whole world.
        year: Single year (``2021``) or inclusive range (``(2018, 2022)``).
            Filtered server-side via IA's ``date`` field.
        hour: UTC hour or inclusive range (``(9, 17)`` or ``(22, 4)`` for
            midnight-wrap). Applied client-side against ``capture_time``.
        season: One of ``"spring" | "summer" | "autumn"/"fall" | "winter"``,
            or a list. Hemisphere is auto-detected from each row's latitude;
            pass ``southern=True`` to force southern interpretation.
        southern (bool): Force southern-hemisphere season interpretation.
        rows (int): Maximum number of records to fetch. ``0`` means all.
        verify_urls (bool): If True, query IA's metadata API for each
            identifier to find the exact mp3 filename. Slow but accurate.
            Default False uses the ``<identifier>.mp3`` fallback (works
            for the vast majority of items).
        out_path (str, optional): If provided, write the resulting DataFrame
            to this CSV path.
        enrich_durations (bool): If True, also probe each fetched URL for
            its duration via :func:`enrich_aporee_catalog` (slow — one
            request per row).
        min_duration (float, optional): When ``enrich_durations=True``,
            drop rows shorter than this many seconds.
        timeout (float): Per-request HTTP timeout (seconds).
        page_size (int): Records per Scrape-API page (min 100).

    Returns:
        ``pandas.DataFrame`` with columns:
        ``identifier, id, latitude, longitude, url, capture_time, created,
        year, month, hour, season, title, name, description, tags, licence,
        duration_s``. ``id`` aliases ``identifier`` and ``name`` aliases
        ``title`` for compatibility with :func:`getSoundAporee`'s filters.
    """
    import requests

    # Build query
    query = f"collection:{_APOREE_COLLECTION}"
    whole_world = (-90.0, -180.0, 90.0, 180.0)
    bbox_t = tuple(bbox) if bbox is not None else whole_world
    if bbox_t != whole_world:
        lat_min, lon_min, lat_max, lon_max = bbox_t
        query += (
            f" AND lat:[{lat_min:g} TO {lat_max:g}]"
            f" AND lon:[{lon_min:g} TO {lon_max:g}]"
        )
    if year is not None:
        if isinstance(year, (list, tuple)):
            y1, y2 = int(year[0]), int(year[-1])
            if y2 < y1:
                y1, y2 = y2, y1
        else:
            y1 = y2 = int(year)
        query += f" AND date:[{y1}-01-01T00:00:00Z TO {y2}-12-31T23:59:59Z]"

    # Normalize hour filter
    hour_range: tuple[int, int] | None = None
    if hour is not None:
        if isinstance(hour, (list, tuple)):
            hour_range = (int(hour[0]), int(hour[-1]))
        else:
            hour_range = (int(hour), int(hour))

    # Normalize season filter to a set of months
    season_set: set[int] | None = None
    if season is not None:
        from .utils.utils import season_months as _sm
        names = season if isinstance(season, (list, tuple)) else [season]
        season_set = set()
        for s in names:
            season_set |= _sm(s)

    logger.info("fetch_aporee_catalog: query=%s", query)
    headers = {"User-Agent": "urban-worm/0.x (+aporee fetcher)"}
    page_size = max(100, int(page_size))

    items: list[dict] = []
    cursor: str | None = None
    fetched = 0
    skip_no_geo = skip_hour = skip_season = 0

    pbar = tqdm(desc="fetch aporee", unit="rec", disable=False)
    while True:
        if rows and fetched >= rows:
            break
        page_n = page_size if not rows else min(page_size, rows - fetched)
        params = {
            "q": query,
            "fields": ",".join(_IA_FIELDS),
            "count": max(100, page_n),
        }
        if cursor:
            params["cursor"] = cursor

        r = requests.get(_IA_SCRAPE, params=params, headers=headers, timeout=timeout)
        r.raise_for_status()
        data = r.json()
        if "items" not in data:
            raise RuntimeError(f"IA scrape API error: {data}")

        docs = data["items"]
        next_cursor = data.get("cursor")
        if not docs:
            break

        for doc in docs:
            try:
                lat_v = float(doc.get("latitude") or "")
                lon_v = float(doc.get("longitude") or "")
            except (ValueError, TypeError):
                skip_no_geo += 1
                continue

            ident = doc.get("identifier", "")
            title = doc.get("title", "")
            ctime = (doc.get("date") or "").strip()
            description = doc.get("description", "")
            licence = doc.get("licenseurl", "")
            subject = doc.get("subject", "")
            # `subject` may come back as a list — collapse to comma-string
            if isinstance(subject, list):
                subject = ",".join(str(s) for s in subject)

            # Client-side hour filter
            if hour_range is not None:
                hh = _ia_extract_hour(ctime)
                if hh is None:
                    skip_hour += 1
                    continue
                h_start, h_end = hour_range
                if h_start <= h_end:
                    matched = h_start <= hh <= h_end
                else:
                    matched = hh >= h_start or hh <= h_end
                if not matched:
                    skip_hour += 1
                    continue

            # Client-side season filter
            if season_set is not None:
                mm = _ia_extract_month(ctime)
                if mm is None:
                    skip_season += 1
                    continue
                row_month = mm
                if southern or lat_v < 0:
                    row_month = ((mm - 1 + 6) % 12) + 1
                if row_month not in season_set:
                    skip_season += 1
                    continue

            url = (
                _ia_verify_mp3_url(ident, timeout=timeout)
                if verify_urls
                else f"{_IA_DOWNLOAD}/{ident}/{ident}.mp3"
            )

            items.append({
                "identifier": ident,
                "id": ident,                        # alias for getSoundAporee
                "latitude": lat_v,
                "longitude": lon_v,
                "url": url,
                "capture_time": ctime,              # script's column name
                "created": ctime,                   # getSoundAporee filter name
                "title": title,
                "name": title,                      # alias for getSoundAporee.query
                "description": description,
                "tags": subject,                    # IA's `subject` -> our tags
                "licence": licence,
                "duration_s": None,
            })
            fetched += 1
            pbar.update(1)
            if rows and fetched >= rows:
                break

        # Cursor is the source of truth for "more pages available" — IA's
        # scrape API can return a partial page mid-stream, so don't bail
        # out just because len(docs) < page_n.
        if not next_cursor:
            break
        cursor = next_cursor

    pbar.close()
    logger.info(
        "fetch_aporee_catalog: kept %d, skipped no_geo=%d hour=%d season=%d",
        len(items), skip_no_geo, skip_hour, skip_season,
    )

    df = pd.DataFrame(items)
    if df.empty:
        if out_path:
            df.to_csv(out_path, index=False)
        return df

    # Enrich with derived time columns (year/month/hour/season) for
    # downstream convenience. ``parse_iso_created`` handles missing
    # fractional-seconds gracefully.
    from .utils.utils import parse_iso_created
    parsed = df["capture_time"].apply(parse_iso_created)
    df["year"] = parsed.apply(lambda d: d.year if d is not None else None)
    df["month"] = parsed.apply(lambda d: d.month if d is not None else None)
    df["hour"] = parsed.apply(lambda d: d.hour if d is not None else None)
    df["season"] = df.apply(
        lambda r: _season_for(r["month"], r["latitude"], southern) if r["month"] else "",
        axis=1,
    )

    if enrich_durations:
        df = enrich_aporee_catalog(df, min_duration=min_duration, timeout=timeout)

    if out_path:
        df.to_csv(out_path, index=False)
        logger.info("fetch_aporee_catalog: wrote %d rows to %s", len(df), out_path)

    return df

`getSoundAporee(location, loc_id=None, distance=50, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=None, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)` ¶

Filter a Radio Aporee catalog (CSV or DataFrame) by spatial proximity.

Aporee (radio aporee ::: maps) does not expose a public geo-query API the way Freesound does, so this helper takes a pre-built catalog of geotagged Aporee URLs and filters it with the same semantics as :func:_getSoundFreesound. The resulting DataFrame uses the same column names so the downstream GeoTaggedData / download_to_dir pipeline needs no changes.

Parameters:

Name	Type	Description	Default
`location`	`list \| tuple`	(lon, lat) of the query point.	required
`loc_id`	`int \| str`	Identifier for the query location.	`None`
`distance`	`int`	Search radius in meters.	`50`
`catalog`	`str \| DataFrame`	Path to a CSV file or an in-memory DataFrame. Required columns: `url`, `latitude`, `longitude`. Optional columns recognised by the filters: `id`/`identifier`, `name`/`title`, `description`, `tags`, `created` (ISO timestamp), `duration_s`.	`None`
`query`	`str \| list[str]`	Substring(s) matched against `name`/`title` and `description` (case-insensitive). Skipped silently if neither column is present.	`None`
`tag`	`str \| list[str]`	Substring(s) matched against `tags` (case-insensitive). Skipped if column is absent.	`None`
`max_return`	`int`	Number of nearest sounds to return.	`1`
`year, season, time_of_day`		Same semantics as :func:`getSound`. Applied against the `created` column if present.	required
`duration`	`int \| list[int] \| tuple[int]`	Filter on `duration_s` if present. Pass an int for max-only or (min, max) for a range.	`None`
`exclude_from_location`	`int`	Drop rows inside this radius (m) around the query point — useful for "what's nearby but not at this exact spot".	`None`
`slice_duration`	`int`	Pre-compute clip windows on top of the chosen recording's `duration_s` (mirrors Freesound path).	`None`
`slice_max_num`	`int`	Cap on number of clips per recording.	`None`
`probe_durations`	`bool`	If True (default) and `slice_duration` is requested but the catalog has no `duration_s` column, fetch each selected recording once with :func:`urbanworm.utils.utils.probe_audio_duration` to learn its length so slice windows can be computed. Set False to skip slicing instead (faster; no per-row download).	`True`
`output_df`	`bool`	If True (default) return a `pandas.DataFrame`.	`True`

Returns:

Type	Description
`DataFrame \| dict \| list \| None`	`pandas.DataFrame`, `dict`, `list[dict]`, or `None` if the
`DataFrame \| dict \| list \| None`	filtered catalog is empty.

Source code in urbanworm/dataset.py

def getSoundAporee(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        catalog: str | pd.DataFrame = None,
        query: str | list[str] | None = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        duration: int | list | tuple = None,
        exclude_from_location: int = None,
        slice_duration: int = None,
        slice_max_num: int = None,
        probe_durations: bool = True,
        output_df: bool = True,
) -> pd.DataFrame | dict | list | None:
    """Filter a Radio Aporee catalog (CSV or DataFrame) by spatial proximity.

    Aporee (radio aporee ::: maps) does not expose a public geo-query API the
    way Freesound does, so this helper takes a pre-built catalog of geotagged
    Aporee URLs and filters it with the same semantics as
    :func:`_getSoundFreesound`. The resulting DataFrame uses the same column
    names so the downstream ``GeoTaggedData`` / ``download_to_dir`` pipeline
    needs no changes.

    Args:
        location (list | tuple): (lon, lat) of the query point.
        loc_id (int | str, optional): Identifier for the query location.
        distance (int): Search radius in meters.
        catalog (str | pandas.DataFrame): Path to a CSV file or an in-memory
            DataFrame. Required columns: ``url``, ``latitude``, ``longitude``.
            Optional columns recognised by the filters: ``id``/``identifier``,
            ``name``/``title``, ``description``, ``tags``, ``created`` (ISO
            timestamp), ``duration_s``.
        query (str | list[str], optional): Substring(s) matched against
            ``name``/``title`` and ``description`` (case-insensitive). Skipped
            silently if neither column is present.
        tag (str | list[str], optional): Substring(s) matched against ``tags``
            (case-insensitive). Skipped if column is absent.
        max_return (int): Number of nearest sounds to return.
        year, season, time_of_day: Same semantics as :func:`getSound`. Applied
            against the ``created`` column if present.
        duration (int | list[int] | tuple[int]): Filter on ``duration_s`` if
            present. Pass an int for max-only or (min, max) for a range.
        exclude_from_location (int, optional): Drop rows inside this radius
            (m) around the query point — useful for "what's nearby but not
            *at* this exact spot".
        slice_duration (int, optional): Pre-compute clip windows on top of
            the chosen recording's ``duration_s`` (mirrors Freesound path).
        slice_max_num (int, optional): Cap on number of clips per recording.
        probe_durations (bool): If True (default) and ``slice_duration`` is
            requested but the catalog has no ``duration_s`` column, fetch
            each selected recording once with
            :func:`urbanworm.utils.utils.probe_audio_duration` to learn its
            length so slice windows can be computed. Set False to skip
            slicing instead (faster; no per-row download).
        output_df (bool): If True (default) return a ``pandas.DataFrame``.

    Returns:
        ``pandas.DataFrame``, ``dict``, ``list[dict]``, or ``None`` if the
        filtered catalog is empty.
    """
    import os

    from .utils.utils import (
        haversine_m,
        is_coordinate_in_bbox,
        parse_iso_created,
        probe_audio_duration,
        season_months,
        sliced_duration,
        tod_hours,
    )

    # -------------------------
    # Validate inputs
    # -------------------------
    if max_return is None or int(max_return) < 1:
        raise ValueError("max_return must be >= 1.")
    max_return = int(max_return)

    if catalog is None:
        env_path = os.getenv("APOREE_CATALOG")
        if env_path:
            catalog = env_path
        else:
            raise ValueError(
                "source='aporee' requires a catalog (CSV path or DataFrame). "
                "Pass catalog=... or set APOREE_CATALOG env var."
            )

    if isinstance(catalog, str):
        df = pd.read_csv(catalog)
    elif isinstance(catalog, pd.DataFrame):
        df = catalog.copy()
    else:
        raise TypeError(
            "catalog must be a CSV path (str) or a pandas.DataFrame; "
            f"got {type(catalog).__name__}."
        )

    # Accept the alternate column names produced by fetch_aporee_catalog()
    # (which mirrors archive.org's `lat` / `lon` / `date` field names).
    _aliases = {"lat": "latitude", "lon": "longitude", "capture_time": "created"}
    for src, dst in _aliases.items():
        if src in df.columns and dst not in df.columns:
            df = df.rename(columns={src: dst})

    required = {"url", "latitude", "longitude"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(
            f"Aporee catalog is missing required columns: {sorted(missing)}. "
            "At minimum it needs 'url', 'latitude', 'longitude' "
            "(or 'lat'/'lon' which will be renamed)."
        )

    if df.empty:
        return None if not output_df else pd.DataFrame()

    lon, lat = location

    # Coerce coords to float and drop rows that aren't usable.
    df["latitude"] = pd.to_numeric(df["latitude"], errors="coerce")
    df["longitude"] = pd.to_numeric(df["longitude"], errors="coerce")
    df = df.dropna(subset=["latitude", "longitude", "url"]).copy()
    if df.empty:
        return None if not output_df else pd.DataFrame()

    # -------------------------
    # Spatial filter
    # -------------------------
    df["distance_m"] = df.apply(
        lambda r: haversine_m(lat, lon, float(r["latitude"]), float(r["longitude"])),
        axis=1,
    )
    df = df[df["distance_m"] <= float(distance)]

    if exclude_from_location is not None and not df.empty:
        drop_area = projection(location, r=exclude_from_location)
        mask = df.apply(
            lambda r: not is_coordinate_in_bbox(
                float(r["longitude"]), float(r["latitude"]), drop_area
            ),
            axis=1,
        )
        df = df[mask]

    # -------------------------
    # Text / tag filters
    # -------------------------
    def _as_list(x):
        if x is None:
            return []
        if isinstance(x, (list, tuple)):
            return [str(t).strip().lower() for t in x if str(t).strip()]
        return [str(x).strip().lower()]

    qterms = _as_list(query)
    tterms = _as_list(tag)

    if qterms:
        text_cols = [c for c in ("name", "title", "description") if c in df.columns]
        if text_cols:
            haystack = df[text_cols].astype(str).agg(" ".join, axis=1).str.lower()
            df = df[haystack.apply(lambda s: all(q in s for q in qterms))]

    if tterms and "tags" in df.columns:
        tag_haystack = df["tags"].astype(str).str.lower()
        df = df[tag_haystack.apply(lambda s: all(t in s for t in tterms))]

    # -------------------------
    # Time filters (only if `created` column is present)
    # -------------------------
    if "created" in df.columns and (year is not None or season or time_of_day):
        parsed = df["created"].apply(parse_iso_created)
        if year is not None:
            ys = year if isinstance(year, (list, tuple)) else [year]
            y1 = int(ys[0])
            y2 = int(ys[-1])
            if y2 < y1:
                y1, y2 = y2, y1
            df = df[parsed.apply(lambda dt: dt is not None and y1 <= dt.year <= y2)]
            parsed = parsed[df.index]
        if season:
            months = season_months(season)
            df = df[parsed.apply(lambda dt: dt is not None and dt.month in months)]
            parsed = parsed[df.index]
        if time_of_day:
            hours = tod_hours(time_of_day)
            df = df[parsed.apply(lambda dt: dt is not None and dt.hour in hours)]

    # -------------------------
    # Duration filter (only if `duration_s` column is present)
    # -------------------------
    if duration is not None and "duration_s" in df.columns:
        ds = pd.to_numeric(df["duration_s"], errors="coerce")
        if isinstance(duration, (list, tuple)) and len(duration) == 2:
            dmin, dmax = float(duration[0]), float(duration[1])
            if dmax < dmin:
                dmin, dmax = dmax, dmin
            df = df[(ds >= dmin) & (ds <= dmax)]
        else:
            df = df[ds <= float(duration)]

    if df.empty:
        return None if not output_df else pd.DataFrame()

    # -------------------------
    # Normalize output schema to match Freesound path
    # -------------------------
    df = df.sort_values(by="distance_m", ascending=True).head(max_return).reset_index(drop=True)

    # `id` column: prefer existing, then `identifier`, else fall back to row index.
    if "id" not in df.columns:
        if "identifier" in df.columns:
            df["id"] = df["identifier"]
        else:
            df["id"] = [f"aporee_{i}" for i in range(len(df))]

    # Alias `url` as `preview-hq-mp3` so downstream ``download_to_dir`` works
    # without any branching.
    df["preview-hq-mp3"] = df["url"]

    if loc_id is not None:
        df["loc_id"] = loc_id
    elif "loc_id" not in df.columns:
        df["loc_id"] = ""

    # Optional slice column to mirror Freesound behavior. Aporee catalogs
    # often lack a `duration_s` column because that metadata isn't on the
    # site — probe each selected URL on-demand if requested, or skip
    # slicing with a clear warning.
    if slice_duration is not None:
        if "duration_s" not in df.columns:
            if probe_durations:
                logger.info(
                    "Aporee catalog has no 'duration_s' column; probing %d "
                    "selected recordings to determine clip windows. "
                    "(Pass probe_durations=False to skip.)",
                    len(df),
                )
                # Wrap in a lambda so pandas doesn't see attributes like
                # `.keys()` on a callable (e.g. when probe_audio_duration is
                # patched with a MagicMock in tests) and mistakenly take the
                # dict-like apply codepath.
                df["duration_s"] = df["url"].apply(lambda u: probe_audio_duration(u))
            else:
                logger.warning(
                    "Aporee catalog has no 'duration_s' column and "
                    "probe_durations=False; skipping slice generation. "
                    "Run urbanworm.dataset.enrich_aporee_catalog() once to "
                    "permanently add duration_s to your CSV."
                )

        if "duration_s" in df.columns:
            df["slice"] = df["duration_s"].apply(
                lambda d: sliced_duration(int(d), slice_duration, slice_max_num)
                if pd.notna(d) and float(d) > 0 else [[0, 0]]
            )

    if output_df:
        return df

    records = df.to_dict(orient="records")
    if max_return == 1:
        return records[0] if records else None
    return records

Data Sources¶

Mapillary (street views)¶

urbanworm.sources.mapillary ¶

Functions¶

Flickr (photos)¶

urbanworm.sources.flickr ¶

Functions¶

getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True) ¶

Freesound (audio)¶

urbanworm.sources.freesound ¶

Functions¶

getSound(location, loc_id=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=300, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True) ¶

Radio Aporee (audio)¶

urbanworm.sources.aporee ¶

Functions¶

enrich_aporee_catalog(catalog, out_path=None, min_duration=None, skip_existing=True, timeout=60.0) ¶

fetch_aporee_catalog(bbox=None, year=None, hour=None, season=None, southern=False, rows=0, verify_urls=False, out_path=None, enrich_durations=False, min_duration=None, timeout=60.0, page_size=500) ¶

getSoundAporee(location, loc_id=None, distance=50, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=None, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True) ¶

`urbanworm.sources.mapillary` ¶

`urbanworm.sources.flickr` ¶

`getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True)` ¶

`urbanworm.sources.freesound` ¶

`getSound(location, loc_id=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=300, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)` ¶

`urbanworm.sources.aporee` ¶

`enrich_aporee_catalog(catalog, out_path=None, min_duration=None, skip_existing=True, timeout=60.0)` ¶

`fetch_aporee_catalog(bbox=None, year=None, hour=None, season=None, southern=False, rows=0, verify_urls=False, out_path=None, enrich_durations=False, min_duration=None, timeout=60.0, page_size=500)` ¶

`getSoundAporee(location, loc_id=None, distance=50, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=None, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)` ¶