Skip to content

Data Sources

These modules contain the per-provider helpers used internally by GeoTaggedData. You can also call them directly for lower-level control.


Mapillary (street views)

urbanworm.sources.mapillary

Mapillary street-view source. Thin re-export of :func:urbanworm.dataset.getSV.

Functions

getSV(location, loc_id=None, distance=50, key=None, source='mapillary', pano=False, reoriented=False, multi_num=1, interval=1, fov=80, heading=None, pitch=5, height=500, width=700, year=None, season=None, time_of_day=None, target_polygon=None, fov_margin=0.1, fov_min=30.0, fov_max=120.0, building_height=9.0, output_df=True, silent=False)

getSV

Retrieve the closest street view image(s) near a coordinate. Supports multiple sources; the image is reoriented to face the target coordinate when reoriented=True (Mapillary) or always (Google).

Parameters:

Name Type Description Default
location list | tuple

coordinates (longitude/x and latitude/y)

required
loc_id int | str

The id of the location.

None
distance int

The max distance in meters between the centroid and the street view.

50
key str

API access token for the chosen source. Mapillary — pass token or set env var MAPILLARY_API_KEY. Google — pass token or set env var GOOGLE_STREETVIEW_API_KEY.

None
source str

Street view data source. One of "mapillary" (default) or "google".

'mapillary'
pano bool

Whether to search for panoramic images only. Mapillary only — ignored for Google. (Default is False)

False
reoriented bool

Whether to reorient and crop street view images to face the target. Mapillary only — Google always faces the target. (Default is False)

False
multi_num int

The number of multiple SVIs. Mapillary only — Google always returns 1. (Default is 1)

1
interval int

The interval in meters between each SVI. Mapillary only. (Default is 1)

1
fov int | float | str

Field of view in degrees for the perspective image (default 80). Pass 'auto' together with reoriented=True to size the FOV per image so the target building is just framed — see target_polygon / fov_margin / fov_min / fov_max. When target_polygon is None, 'auto' falls back to a distance-based heuristic (assumes ~15 m wide building). Mapillary only — for Google, fov is passed directly to the API and clamped to [10, 120]; 'auto' is not supported.

80
heading int

Camera heading in degrees. If None, computed from the bearing to the target location.

None
pitch int

Camera pitch angle. (Default is 5)

5
height int

Height in pixels of the returned image. (Default is 500)

500
width int

Width in pixels of the returned image. (Default is 700)

700
year list[str]

Year of data (start year, end year). Mapillary only — ignored for Google with a warning.

None
season str

Season of data. Mapillary only — ignored for Google with a warning.

None
time_of_day str

Time of data. Mapillary only — ignored for Google with a warning.

None
target_polygon Polygon

Building footprint used by fov='auto' to compute the angular extent of the target. Coordinates are assumed to be (lon, lat) in WGS84. Mapillary only.

None
fov_margin float

Fractional padding added to the auto-computed FOV (0.10 = +10%). Default 0.10. Mapillary only.

0.1
fov_min float

Lower clamp for fov='auto' (degrees). Default 30°. Mapillary only.

30.0
fov_max float

Upper clamp for fov='auto' (degrees). Default 120°. Mapillary only.

120.0
building_height float

Assumed building height in meters used by fov='auto' (default 9 m, ~3 stories). Mapillary only.

9.0
output_df bool

Whether to also return a DataFrame of metadata. (Default is True)

True
silent bool

Whether to silence warnings. (Default is False)

False

Returns:

Name Type Description
DataFrame | list | None

list[str]: A list of images in base64 format.

DataFrame DataFrame | list | None

A dataframe containing metadata about the street view images. captured_at format is "YYYY-M-D-H" for Mapillary and "YYYY-MM-1-1" for Google (day and hour are nominal placeholders).

Source code in urbanworm/dataset.py
def getSV(location: list|tuple,
          loc_id: int | str = None,
          distance:int = 50,
          key: str = None,
          source: str = "mapillary",
          pano: bool = False,
          reoriented: bool = False,
          multi_num: int = 1,
          interval: int = 1,
          fov: int | float | str = 80, heading: int = None, pitch: int = 5,
          height: int = 500, width: int = 700,
          year: list | tuple = None,
          season: str = None,
          time_of_day: str = None,
          target_polygon=None,
          fov_margin: float = 0.10,
          fov_min: float = 30.0,
          fov_max: float = 120.0,
          building_height: float = 9.0,
          output_df: bool = True,
          silent: bool = False) -> pd.DataFrame | list | None:
    """
        getSV

        Retrieve the closest street view image(s) near a coordinate.
        Supports multiple sources; the image is reoriented to face the target
        coordinate when ``reoriented=True`` (Mapillary) or always (Google).

        Args:
            location: coordinates (longitude/x and latitude/y)
            loc_id (int|str, optional): The id of the location.
            distance (int): The max distance in meters between the centroid and the street view.
            key (str): API access token for the chosen source.
                Mapillary — pass token or set env var ``MAPILLARY_API_KEY``.
                Google    — pass token or set env var ``GOOGLE_STREETVIEW_API_KEY``.
            source (str): Street view data source. One of ``"mapillary"`` (default)
                or ``"google"``.
            pano (bool): Whether to search for panoramic images only.
                Mapillary only — ignored for Google. (Default is False)
            reoriented (bool): Whether to reorient and crop street view images to face
                the target. Mapillary only — Google always faces the target.
                (Default is False)
            multi_num (int): The number of multiple SVIs. Mapillary only — Google
                always returns 1. (Default is 1)
            interval (int): The interval in meters between each SVI.
                Mapillary only. (Default is 1)
            fov (int | float | str): Field of view in degrees for the perspective image
                (default 80). Pass ``'auto'`` together with ``reoriented=True`` to
                size the FOV per image so the target building is just framed —
                see ``target_polygon`` / ``fov_margin`` / ``fov_min`` / ``fov_max``.
                When ``target_polygon`` is None, ``'auto'`` falls back to a
                distance-based heuristic (assumes ~15 m wide building).
                Mapillary only — for Google, ``fov`` is passed directly to the API
                and clamped to [10, 120]; ``'auto'`` is not supported.
            heading (int): Camera heading in degrees. If None, computed from the
                bearing to the target location.
            pitch (int): Camera pitch angle. (Default is 5)
            height (int): Height in pixels of the returned image. (Default is 500)
            width (int): Width in pixels of the returned image. (Default is 700)
            year (list[str], optional): Year of data (start year, end year).
                Mapillary only — ignored for Google with a warning.
            season (str, optional): Season of data.
                Mapillary only — ignored for Google with a warning.
            time_of_day (str, optional): Time of data.
                Mapillary only — ignored for Google with a warning.
            target_polygon (shapely.geometry.Polygon, optional): Building footprint
                used by ``fov='auto'`` to compute the angular extent of the target.
                Coordinates are assumed to be ``(lon, lat)`` in WGS84.
                Mapillary only.
            fov_margin (float): Fractional padding added to the auto-computed
                FOV (0.10 = +10%). Default 0.10. Mapillary only.
            fov_min (float): Lower clamp for ``fov='auto'`` (degrees). Default 30°.
                Mapillary only.
            fov_max (float): Upper clamp for ``fov='auto'`` (degrees). Default 120°.
                Mapillary only.
            building_height (float): Assumed building height in meters used by
                ``fov='auto'`` (default 9 m, ~3 stories). Mapillary only.
            output_df (bool, optional): Whether to also return a DataFrame of metadata.
                (Default is True)
            silent (bool, optional): Whether to silence warnings. (Default is False)

        Returns:
            list[str]: A list of images in base64 format.
            DataFrame: A dataframe containing metadata about the street view images.
                ``captured_at`` format is ``"YYYY-M-D-H"`` for Mapillary and
                ``"YYYY-MM-1-1"`` for Google (day and hour are nominal placeholders).
    """
    source = source.lower().strip()

    if source == "google":
        # Warn about params that Google does not support.
        # warnings.warn() deduplicates by call-site, so each message appears
        # only once even when getSV() is called in a loop (e.g. from
        # get_svi_from_locations), unlike logger.warning() which fires every time.
        if multi_num > 1:
            warnings.warn(
                "getSV: multi_num > 1 is not supported for source='google'; using 1.",
                stacklevel=2,
            )
        if any([year, season, time_of_day]):
            warnings.warn(
                "getSV: year/season/time_of_day filtering is not supported for "
                "source='google' (API does not expose historical imagery). "
                "These parameters will be ignored.",
                stacklevel=2,
            )
        if isinstance(fov, str) and fov.strip().lower() == "auto":
            warnings.warn(
                "getSV: fov='auto' is not supported for source='google'. "
                "Falling back to fov=80.",
                stacklevel=2,
            )
            fov = 80
        return _getSV_google(
            location=location, loc_id=loc_id, distance=distance, key=key,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            output_df=output_df, silent=silent,
        )

    if source == "mapillary":
        return _getSV_mapillary(
            location=location, loc_id=loc_id, distance=distance, key=key,
            pano=pano, reoriented=reoriented, multi_num=multi_num, interval=interval,
            fov=fov, heading=heading, pitch=pitch, height=height, width=width,
            year=year, season=season, time_of_day=time_of_day,
            target_polygon=target_polygon, fov_margin=fov_margin,
            fov_min=fov_min, fov_max=fov_max, building_height=building_height,
            output_df=output_df, silent=silent,
        )

    raise ValueError(
        f"getSV: unknown source '{source}'. Choose 'mapillary' or 'google'."
    )

Flickr (photos)

urbanworm.sources.flickr

Flickr photo source. Thin re-export of :func:urbanworm.dataset.getPhoto.

Functions

getPhoto(location, loc_id=None, distance=50, key=None, query=None, geo_context=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, exclude_from_location=None, output_df=True)

getPhoto

Fetch public Flickr photos with geotags near a location (or within a Flickr place).

Parameters:

Name Type Description Default
location list | tuple

(lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos

required
loc_id int | str

The id of the location.

None
distance int

Search radius in meters (converted to km; Flickr radius max is 32 km).

50
key str

Flickr API key. If None, reads env var FLICKR_API_KEY.

None
query str | list[str]

Query parameters to pass to Flickr API (free text search).

None
geo_context int

Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)

None
tag str | list[str]

Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.

None
max_return int

Number of photos to return (after filters).

1
year str | tuple

[Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.

None
season str

One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).

None
time_of_day str

One of {"morning","afternoon","evening","night"} (post-filter by taken hour).

None
exclude_from_location int

drop retrieved photos within a distance (in meter) from the given location. (Default is None)

None
output_df bool

If True, return a pandas.DataFrame; otherwise return dict (if max_return==1) or list[dict].

True

Returns:

Type Description

dict | list[dict] | pandas.DataFrame

Source code in urbanworm/dataset.py
def getPhoto(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        key: str = None,
        query: str | list[str] = None,
        geo_context: int = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        exclude_from_location:int = None,
        output_df: bool = True
):
    """
        getPhoto

        Fetch public Flickr photos with geotags near a location (or within a Flickr place).

        Args:
            location (list|tuple): (lon, lat) required. Coordinates of location (longitude, latitude) for searching for geotagged photos
            loc_id (int | str): The id of the location.
            distance (int): Search radius in meters (converted to km; Flickr radius max is 32 km).
            key (str): Flickr API key. If None, reads env var FLICKR_API_KEY.
            query (str | list[str]): Query parameters to pass to Flickr API (free text search).
            geo_context (int): Specify whether a geotagged photo was taken indoors or outdoors. 0: Not defined; 1: Indoors; 2: Outdoors. (Default is None)
            tag: Tag string or list of tags (comma-separated). Acts as a "limiting agent" for geo queries.
            max_return: Number of photos to return (after filters).
            year (str | tuple): [Y] or (Y,) or (Y1, Y2) inclusive. Filters by taken date range.
            season (str): One of {"spring","summer","fall","autumn","winter"} (post-filter by taken month).
            time_of_day (str): One of {"morning","afternoon","evening","night"} (post-filter by taken hour).
            exclude_from_location (int, optional): drop retrieved photos within a distance (in meter) from the given location. (Default is None)
            output_df (bool): If True, return a pandas.DataFrame; otherwise return dict (if max_return==1)
                       or list[dict].

        Returns:
            dict | list[dict] | pandas.DataFrame
    """

    import os
    from datetime import datetime, timedelta, timezone

    import requests

    if exclude_from_location is not None:
        drop_area = projection(location, r=distance)

    # -------------------------
    # Validate inputs
    # -------------------------
    if max_return is None or int(max_return) < 1:
        raise ValueError("max_return must be >= 1.")
    max_return = int(max_return)

    api_key = key or os.getenv("FLICKR_API_KEY")
    if not api_key:
        raise ValueError("Missing Flickr API key. Pass key=... or set env var FLICKR_API_KEY.")

    lon, lat = location
    months = season_months(season)
    hours = tod_hours(time_of_day)
    y_range = year_range(year)

    # Radius in km (Flickr max 32km) :contentReference[oaicite:3]{index=3}
    radius_km = max(float(distance) / 1000.0, 0.01)
    radius_km = min(radius_km, 32.0)

    # Geo queries need a "limiting agent"; tags or min/max dates qualify. :contentReference[oaicite:4]{index=4}
    # If user provided none, default to last 365 days so results aren’t silently limited to ~12 hours.
    now_utc = datetime.now(timezone.utc)
    default_min_upload_date = int((now_utc - timedelta(days=365)).timestamp())

    # -------------------------
    # Build Flickr request
    # -------------------------
    endpoint = "https://api.flickr.com/services/rest/"

    extras = ",".join(
        [
            "description",
            "license",
            "date_upload",
            "date_taken",
            "owner_name",
            "geo",
            "tags",
            "views",
            "media",
            "url_sq",
            "url_t",
            "url_s",
            "url_q",
            "url_m",
            "url_n",
            "url_z",
            "url_c",
            "url_l",
            "url_o",
        ]
    )

    params = {
        "method": "flickr.photos.search",
        "api_key": api_key,
        "format": "json",
        "nojsoncallback": 1,
        "extras": extras,
        "safe_search": 1, # safe only for un-authed calls
        "media": "photos",
        "has_geo": 1,
        "content_types": 0, # photos
        "sort": "relevance",
        "lat": lat,
        "lon": lon,
        "radius": radius_km,
        "radius_units": "km"
    }

    if query:
        q = query_string(query)
        if q:
            params["text"] = q

    if geo_context:
        params["geo_context"] = geo_context

    # tags
    if tag:
        if isinstance(tag, (list, tuple)):
            tags = ",".join([str(t).strip() for t in tag if str(t).strip()])
            params["tags"] = tags
            params["tag_mode"] = "all"
        else:
            params["tags"] = str(tag).strip()

    # date range (taken) if specified
    if y_range is not None:
        params["min_taken_date"], params["max_taken_date"] = y_range
    else:
        # If no explicit limiting agent, set min_upload_date (acts as limiting agent for geo queries). :contentReference[oaicite:7]{index=7}
        if not tag and season is None and time_of_day is None:
            params["min_upload_date"] = default_min_upload_date

    # -------------------------
    # Fetch + post-filter
    # -------------------------
    session = requests.Session()

    # Geo/bbox queries only return up to 250/page. :contentReference[oaicite:8]{index=8}
    per_page = min(250, max(50, max_return * 20))
    params["per_page"] = per_page

    results = []
    seen = set()

    max_pages = 150
    for page in range(1, max_pages + 1):
        params["page"] = page
        r = session.get(endpoint, params=params, timeout=30)
        r.raise_for_status()
        data = r.json()

        if data.get("stat") != "ok":
            msg = data.get("message") or data.get("error") or str(data)
            raise RuntimeError(f"Flickr API error: {msg}")

        photos = (data.get("photos") or {}).get("photo") or []
        if not photos:
            break

        for p in photos:
            if exclude_from_location is not None:
                if is_coordinate_in_bbox(p["longitude"], p["latitude"], drop_area):
                    continue
            pid = p.get("id")
            if not pid or pid in seen:
                continue
            seen.add(pid)

            taken_dt = parse_taken(p)
            if months and taken_dt and taken_dt.month not in months:
                continue
            if hours and taken_dt and taken_dt.hour not in hours:
                continue

            s_lat = float(p["latitude"]) if "latitude" in p and p["latitude"] not in (None, "") else None
            s_lon = float(p["longitude"]) if "longitude" in p and p["longitude"] not in (None, "") else None

            url = best_url(p)
            out = {
                "loc_id": '',
                "id": pid,
                "title": p.get("title"),
                "owner": p.get("owner"),
                # "ownername": p.get("ownername"),
                "datetaken": p.get("datetaken") or p.get("date_taken"),
                "latitude": s_lat,
                "longitude": s_lon,
                # "accuracy": int(p["accuracy"]) if "accuracy" in p and str(p["accuracy"]).isdigit() else None,
                "distance_m": haversine_m(lat, lon, s_lat, s_lon) if (s_lat is not None and s_lon is not None) else None,
                "tags": p.get("tags"),
                "description": p.get("description"),
                "views": int(p["views"]) if "views" in p and str(p["views"]).isdigit() else None,
                "license": p.get("license"),
                "url": url,
                # "page_url": f"https://www.flickr.com/photos/{p.get('owner')}/{pid}",
            }

            if loc_id is not None:
                out["loc_id"] = loc_id
            else:
                del out["loc_id"]

            results.append(out)

            # if len(results) >= max_return:
            #     break

        if len(results) >= max_return:
            break

    if output_df:
        import pandas as pd
        df = pd.DataFrame(results)
        df = df.sort_values(by='distance_m', ascending=True)
        return df.head(max_return)

    if max_return == 1:
        return results[0] if results else None
    return results

Freesound (audio)

urbanworm.sources.freesound

Freesound audio source. Thin re-export of :func:urbanworm.dataset.getSound.

Functions

getSound(location, loc_id=None, distance=50, source='freesound', key=None, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=300, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)

Dispatch to the per-source helpers.

Parameters:

Name Type Description Default
source str

one of {"freesound", "aporee"}. Default "freesound".

'freesound'
catalog str | DataFrame

required when source="aporee" — see :func:getSoundAporee.

None
probe_durations bool

Aporee-only. See :func:getSoundAporee.

True

All other arguments are forwarded; key is only used by Freesound, catalog and probe_durations only by Aporee.

Source code in urbanworm/dataset.py
def getSound(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        source: str = 'freesound',
        key: str = None,
        catalog: str | pd.DataFrame = None,
        query: str | list[str] | None = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        duration: int = 300,
        exclude_from_location: int = None,
        slice_duration: int = None,
        slice_max_num: int = None,
        probe_durations: bool = True,
        output_df: bool = True,
) -> pd.DataFrame | dict | list | None:
    """Dispatch to the per-source helpers.

    Args:
        source (str): one of {"freesound", "aporee"}. Default "freesound".
        catalog: required when source="aporee" — see :func:`getSoundAporee`.
        probe_durations: Aporee-only. See :func:`getSoundAporee`.

    All other arguments are forwarded; ``key`` is only used by Freesound,
    ``catalog`` and ``probe_durations`` only by Aporee.
    """
    src = (source or 'freesound').lower()
    if src == 'freesound':
        return _getSoundFreesound(
            location=location, loc_id=loc_id, distance=distance, key=key,
            query=query, tag=tag, max_return=max_return, year=year,
            season=season, time_of_day=time_of_day, duration=duration,
            exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            output_df=output_df,
        )
    elif src == 'aporee':
        return getSoundAporee(
            location=location, loc_id=loc_id, distance=distance,
            catalog=catalog, query=query, tag=tag, max_return=max_return,
            year=year, season=season, time_of_day=time_of_day,
            duration=duration, exclude_from_location=exclude_from_location,
            slice_duration=slice_duration, slice_max_num=slice_max_num,
            probe_durations=probe_durations,
            output_df=output_df,
        )
    else:
        raise ValueError(
            f"Unsupported sound source {source!r}; choose 'freesound' or 'aporee'."
        )

Radio Aporee (audio)

urbanworm.sources.aporee

Radio Aporee audio source.

Re-exports the helpers that live in :mod:urbanworm.dataset:

  • :func:getSoundAporee — filter a catalog by spatial proximity
  • :func:fetch_aporee_catalog — fetch the catalog from Internet Archive
  • :func:enrich_aporee_catalog — probe URLs for duration_s

Functions

enrich_aporee_catalog(catalog, out_path=None, min_duration=None, skip_existing=True, timeout=60.0)

Add a duration_s column to an Aporee catalog by probing each URL.

Aporee URLs don't carry duration metadata, so this helper downloads each file once, reads its length with pydub (or mutagen as a fallback), and annotates the catalog. Optionally drops rows shorter than min_duration.

Use this once after building / updating your catalog so that subsequent :func:getSoundAporee calls with slice_duration can compute clip windows without paying the per-row probe cost every time.

Parameters:

Name Type Description Default
catalog str | DataFrame

CSV path or in-memory DataFrame. Must have a url column.

required
out_path str

If provided, writes the enriched DataFrame back to this CSV path.

None
min_duration float

Drop rows shorter than this many seconds (after probing). None keeps all rows.

None
skip_existing bool

If True (default) and duration_s is already populated for a row, leave it alone. Set False to re-probe every row.

True
timeout float

Per-URL request timeout (seconds).

60.0

Returns:

Type Description
DataFrame

The enriched pandas.DataFrame.

Source code in urbanworm/dataset.py
def enrich_aporee_catalog(
        catalog: str | pd.DataFrame,
        out_path: str | None = None,
        min_duration: float | None = None,
        skip_existing: bool = True,
        timeout: float = 60.0,
) -> pd.DataFrame:
    """Add a ``duration_s`` column to an Aporee catalog by probing each URL.

    Aporee URLs don't carry duration metadata, so this helper downloads each
    file once, reads its length with pydub (or mutagen as a fallback), and
    annotates the catalog. Optionally drops rows shorter than
    ``min_duration``.

    Use this once after building / updating your catalog so that subsequent
    :func:`getSoundAporee` calls with ``slice_duration`` can compute clip
    windows without paying the per-row probe cost every time.

    Args:
        catalog (str | pandas.DataFrame): CSV path or in-memory DataFrame.
            Must have a ``url`` column.
        out_path (str, optional): If provided, writes the enriched DataFrame
            back to this CSV path.
        min_duration (float, optional): Drop rows shorter than this many
            seconds (after probing). ``None`` keeps all rows.
        skip_existing (bool): If ``True`` (default) and ``duration_s`` is
            already populated for a row, leave it alone. Set ``False`` to
            re-probe every row.
        timeout (float): Per-URL request timeout (seconds).

    Returns:
        The enriched ``pandas.DataFrame``.
    """
    from .utils.utils import probe_audio_duration

    if isinstance(catalog, str):
        df = pd.read_csv(catalog)
    elif isinstance(catalog, pd.DataFrame):
        df = catalog.copy()
    else:
        raise TypeError(
            "catalog must be a CSV path (str) or a pandas.DataFrame; "
            f"got {type(catalog).__name__}."
        )

    if "url" not in df.columns:
        raise ValueError("Aporee catalog must have a 'url' column.")
    if "duration_s" not in df.columns:
        df["duration_s"] = pd.NA

    needs_probe = df.index if not skip_existing else df.index[df["duration_s"].isna()]
    logger.info(
        "enrich_aporee_catalog: probing %d / %d rows", len(needs_probe), len(df),
    )

    for i in tqdm(needs_probe, desc="probing", ncols=75):
        url = df.at[i, "url"]
        if not isinstance(url, str) or not url.startswith("http"):
            continue
        d = probe_audio_duration(url, timeout=timeout)
        if d is not None:
            df.at[i, "duration_s"] = round(float(d), 2)

    if min_duration is not None:
        before = len(df)
        df = df[
            df["duration_s"].notna()
            & (pd.to_numeric(df["duration_s"], errors="coerce") >= float(min_duration))
        ].reset_index(drop=True)
        logger.info(
            "enrich_aporee_catalog: dropped %d rows shorter than %ss",
            before - len(df), min_duration,
        )

    if out_path is not None:
        df.to_csv(out_path, index=False)
        logger.info("enrich_aporee_catalog: wrote %d rows to %s", len(df), out_path)

    return df

fetch_aporee_catalog(bbox=None, year=None, hour=None, season=None, southern=False, rows=0, verify_urls=False, out_path=None, enrich_durations=False, min_duration=None, timeout=60.0, page_size=500)

Fetch the Aporee sound-map catalog from Internet Archive.

All Aporee field recordings are mirrored on archive.org under the radio-aporee-maps collection. This helper queries IA's Scrape API with optional server-side bbox / year filters and applies hour / season filters client-side, then returns a DataFrame in the schema :func:getSoundAporee expects.

Parameters:

Name Type Description Default
bbox tuple[float, float, float, float] | list | None

(lat_min, lon_min, lat_max, lon_max) to filter server-side. Pass None for the whole world.

None
year int | tuple[int, int] | list | None

Single year (2021) or inclusive range ((2018, 2022)). Filtered server-side via IA's date field.

None
hour int | tuple[int, int] | list | None

UTC hour or inclusive range ((9, 17) or (22, 4) for midnight-wrap). Applied client-side against capture_time.

None
season str | list[str] | None

One of "spring" | "summer" | "autumn"/"fall" | "winter", or a list. Hemisphere is auto-detected from each row's latitude; pass southern=True to force southern interpretation.

None
southern bool

Force southern-hemisphere season interpretation.

False
rows int

Maximum number of records to fetch. 0 means all.

0
verify_urls bool

If True, query IA's metadata API for each identifier to find the exact mp3 filename. Slow but accurate. Default False uses the <identifier>.mp3 fallback (works for the vast majority of items).

False
out_path str

If provided, write the resulting DataFrame to this CSV path.

None
enrich_durations bool

If True, also probe each fetched URL for its duration via :func:enrich_aporee_catalog (slow — one request per row).

False
min_duration float

When enrich_durations=True, drop rows shorter than this many seconds.

None
timeout float

Per-request HTTP timeout (seconds).

60.0
page_size int

Records per Scrape-API page (min 100).

500

Returns:

Type Description
DataFrame

pandas.DataFrame with columns:

DataFrame

``identifier, id, latitude, longitude, url, capture_time, created,

DataFrame

year, month, hour, season, title, name, description, tags, licence,

DataFrame

duration_s.idaliasesidentifierandname`` aliases

DataFrame

title for compatibility with :func:getSoundAporee's filters.

Source code in urbanworm/dataset.py
def fetch_aporee_catalog(
        bbox: tuple[float, float, float, float] | list | None = None,
        year: int | tuple[int, int] | list | None = None,
        hour: int | tuple[int, int] | list | None = None,
        season: str | list[str] | None = None,
        southern: bool = False,
        rows: int = 0,
        verify_urls: bool = False,
        out_path: str | None = None,
        enrich_durations: bool = False,
        min_duration: float | None = None,
        timeout: float = 60.0,
        page_size: int = 500,
) -> pd.DataFrame:
    """Fetch the Aporee sound-map catalog from Internet Archive.

    All Aporee field recordings are mirrored on archive.org under the
    ``radio-aporee-maps`` collection. This helper queries IA's Scrape API
    with optional server-side ``bbox`` / ``year`` filters and applies
    ``hour`` / ``season`` filters client-side, then returns a DataFrame in
    the schema :func:`getSoundAporee` expects.

    Args:
        bbox: ``(lat_min, lon_min, lat_max, lon_max)`` to filter server-side.
            Pass ``None`` for the whole world.
        year: Single year (``2021``) or inclusive range (``(2018, 2022)``).
            Filtered server-side via IA's ``date`` field.
        hour: UTC hour or inclusive range (``(9, 17)`` or ``(22, 4)`` for
            midnight-wrap). Applied client-side against ``capture_time``.
        season: One of ``"spring" | "summer" | "autumn"/"fall" | "winter"``,
            or a list. Hemisphere is auto-detected from each row's latitude;
            pass ``southern=True`` to force southern interpretation.
        southern (bool): Force southern-hemisphere season interpretation.
        rows (int): Maximum number of records to fetch. ``0`` means all.
        verify_urls (bool): If True, query IA's metadata API for each
            identifier to find the exact mp3 filename. Slow but accurate.
            Default False uses the ``<identifier>.mp3`` fallback (works
            for the vast majority of items).
        out_path (str, optional): If provided, write the resulting DataFrame
            to this CSV path.
        enrich_durations (bool): If True, also probe each fetched URL for
            its duration via :func:`enrich_aporee_catalog` (slow — one
            request per row).
        min_duration (float, optional): When ``enrich_durations=True``,
            drop rows shorter than this many seconds.
        timeout (float): Per-request HTTP timeout (seconds).
        page_size (int): Records per Scrape-API page (min 100).

    Returns:
        ``pandas.DataFrame`` with columns:
        ``identifier, id, latitude, longitude, url, capture_time, created,
        year, month, hour, season, title, name, description, tags, licence,
        duration_s``. ``id`` aliases ``identifier`` and ``name`` aliases
        ``title`` for compatibility with :func:`getSoundAporee`'s filters.
    """
    import requests

    # Build query
    query = f"collection:{_APOREE_COLLECTION}"
    whole_world = (-90.0, -180.0, 90.0, 180.0)
    bbox_t = tuple(bbox) if bbox is not None else whole_world
    if bbox_t != whole_world:
        lat_min, lon_min, lat_max, lon_max = bbox_t
        query += (
            f" AND lat:[{lat_min:g} TO {lat_max:g}]"
            f" AND lon:[{lon_min:g} TO {lon_max:g}]"
        )
    if year is not None:
        if isinstance(year, (list, tuple)):
            y1, y2 = int(year[0]), int(year[-1])
            if y2 < y1:
                y1, y2 = y2, y1
        else:
            y1 = y2 = int(year)
        query += f" AND date:[{y1}-01-01T00:00:00Z TO {y2}-12-31T23:59:59Z]"

    # Normalize hour filter
    hour_range: tuple[int, int] | None = None
    if hour is not None:
        if isinstance(hour, (list, tuple)):
            hour_range = (int(hour[0]), int(hour[-1]))
        else:
            hour_range = (int(hour), int(hour))

    # Normalize season filter to a set of months
    season_set: set[int] | None = None
    if season is not None:
        from .utils.utils import season_months as _sm
        names = season if isinstance(season, (list, tuple)) else [season]
        season_set = set()
        for s in names:
            season_set |= _sm(s)

    logger.info("fetch_aporee_catalog: query=%s", query)
    headers = {"User-Agent": "urban-worm/0.x (+aporee fetcher)"}
    page_size = max(100, int(page_size))

    items: list[dict] = []
    cursor: str | None = None
    fetched = 0
    skip_no_geo = skip_hour = skip_season = 0

    pbar = tqdm(desc="fetch aporee", unit="rec", disable=False)
    while True:
        if rows and fetched >= rows:
            break
        page_n = page_size if not rows else min(page_size, rows - fetched)
        params = {
            "q": query,
            "fields": ",".join(_IA_FIELDS),
            "count": max(100, page_n),
        }
        if cursor:
            params["cursor"] = cursor

        r = requests.get(_IA_SCRAPE, params=params, headers=headers, timeout=timeout)
        r.raise_for_status()
        data = r.json()
        if "items" not in data:
            raise RuntimeError(f"IA scrape API error: {data}")

        docs = data["items"]
        next_cursor = data.get("cursor")
        if not docs:
            break

        for doc in docs:
            try:
                lat_v = float(doc.get("latitude") or "")
                lon_v = float(doc.get("longitude") or "")
            except (ValueError, TypeError):
                skip_no_geo += 1
                continue

            ident = doc.get("identifier", "")
            title = doc.get("title", "")
            ctime = (doc.get("date") or "").strip()
            description = doc.get("description", "")
            licence = doc.get("licenseurl", "")
            subject = doc.get("subject", "")
            # `subject` may come back as a list — collapse to comma-string
            if isinstance(subject, list):
                subject = ",".join(str(s) for s in subject)

            # Client-side hour filter
            if hour_range is not None:
                hh = _ia_extract_hour(ctime)
                if hh is None:
                    skip_hour += 1
                    continue
                h_start, h_end = hour_range
                if h_start <= h_end:
                    matched = h_start <= hh <= h_end
                else:
                    matched = hh >= h_start or hh <= h_end
                if not matched:
                    skip_hour += 1
                    continue

            # Client-side season filter
            if season_set is not None:
                mm = _ia_extract_month(ctime)
                if mm is None:
                    skip_season += 1
                    continue
                row_month = mm
                if southern or lat_v < 0:
                    row_month = ((mm - 1 + 6) % 12) + 1
                if row_month not in season_set:
                    skip_season += 1
                    continue

            url = (
                _ia_verify_mp3_url(ident, timeout=timeout)
                if verify_urls
                else f"{_IA_DOWNLOAD}/{ident}/{ident}.mp3"
            )

            items.append({
                "identifier": ident,
                "id": ident,                        # alias for getSoundAporee
                "latitude": lat_v,
                "longitude": lon_v,
                "url": url,
                "capture_time": ctime,              # script's column name
                "created": ctime,                   # getSoundAporee filter name
                "title": title,
                "name": title,                      # alias for getSoundAporee.query
                "description": description,
                "tags": subject,                    # IA's `subject` -> our tags
                "licence": licence,
                "duration_s": None,
            })
            fetched += 1
            pbar.update(1)
            if rows and fetched >= rows:
                break

        # Cursor is the source of truth for "more pages available" — IA's
        # scrape API can return a partial page mid-stream, so don't bail
        # out just because len(docs) < page_n.
        if not next_cursor:
            break
        cursor = next_cursor

    pbar.close()
    logger.info(
        "fetch_aporee_catalog: kept %d, skipped no_geo=%d hour=%d season=%d",
        len(items), skip_no_geo, skip_hour, skip_season,
    )

    df = pd.DataFrame(items)
    if df.empty:
        if out_path:
            df.to_csv(out_path, index=False)
        return df

    # Enrich with derived time columns (year/month/hour/season) for
    # downstream convenience. ``parse_iso_created`` handles missing
    # fractional-seconds gracefully.
    from .utils.utils import parse_iso_created
    parsed = df["capture_time"].apply(parse_iso_created)
    df["year"] = parsed.apply(lambda d: d.year if d is not None else None)
    df["month"] = parsed.apply(lambda d: d.month if d is not None else None)
    df["hour"] = parsed.apply(lambda d: d.hour if d is not None else None)
    df["season"] = df.apply(
        lambda r: _season_for(r["month"], r["latitude"], southern) if r["month"] else "",
        axis=1,
    )

    if enrich_durations:
        df = enrich_aporee_catalog(df, min_duration=min_duration, timeout=timeout)

    if out_path:
        df.to_csv(out_path, index=False)
        logger.info("fetch_aporee_catalog: wrote %d rows to %s", len(df), out_path)

    return df

getSoundAporee(location, loc_id=None, distance=50, catalog=None, query=None, tag=None, max_return=1, year=None, season=None, time_of_day=None, duration=None, exclude_from_location=None, slice_duration=None, slice_max_num=None, probe_durations=True, output_df=True)

Filter a Radio Aporee catalog (CSV or DataFrame) by spatial proximity.

Aporee (radio aporee ::: maps) does not expose a public geo-query API the way Freesound does, so this helper takes a pre-built catalog of geotagged Aporee URLs and filters it with the same semantics as :func:_getSoundFreesound. The resulting DataFrame uses the same column names so the downstream GeoTaggedData / download_to_dir pipeline needs no changes.

Parameters:

Name Type Description Default
location list | tuple

(lon, lat) of the query point.

required
loc_id int | str

Identifier for the query location.

None
distance int

Search radius in meters.

50
catalog str | DataFrame

Path to a CSV file or an in-memory DataFrame. Required columns: url, latitude, longitude. Optional columns recognised by the filters: id/identifier, name/title, description, tags, created (ISO timestamp), duration_s.

None
query str | list[str]

Substring(s) matched against name/title and description (case-insensitive). Skipped silently if neither column is present.

None
tag str | list[str]

Substring(s) matched against tags (case-insensitive). Skipped if column is absent.

None
max_return int

Number of nearest sounds to return.

1
year, season, time_of_day

Same semantics as :func:getSound. Applied against the created column if present.

required
duration int | list[int] | tuple[int]

Filter on duration_s if present. Pass an int for max-only or (min, max) for a range.

None
exclude_from_location int

Drop rows inside this radius (m) around the query point — useful for "what's nearby but not at this exact spot".

None
slice_duration int

Pre-compute clip windows on top of the chosen recording's duration_s (mirrors Freesound path).

None
slice_max_num int

Cap on number of clips per recording.

None
probe_durations bool

If True (default) and slice_duration is requested but the catalog has no duration_s column, fetch each selected recording once with :func:urbanworm.utils.utils.probe_audio_duration to learn its length so slice windows can be computed. Set False to skip slicing instead (faster; no per-row download).

True
output_df bool

If True (default) return a pandas.DataFrame.

True

Returns:

Type Description
DataFrame | dict | list | None

pandas.DataFrame, dict, list[dict], or None if the

DataFrame | dict | list | None

filtered catalog is empty.

Source code in urbanworm/dataset.py
def getSoundAporee(
        location: list | tuple,
        loc_id: int | str = None,
        distance: int = 50,
        catalog: str | pd.DataFrame = None,
        query: str | list[str] | None = None,
        tag: str | list[str] = None,
        max_return: int = 1,
        year: list | tuple = None,
        season: str = None,
        time_of_day: str = None,
        duration: int | list | tuple = None,
        exclude_from_location: int = None,
        slice_duration: int = None,
        slice_max_num: int = None,
        probe_durations: bool = True,
        output_df: bool = True,
) -> pd.DataFrame | dict | list | None:
    """Filter a Radio Aporee catalog (CSV or DataFrame) by spatial proximity.

    Aporee (radio aporee ::: maps) does not expose a public geo-query API the
    way Freesound does, so this helper takes a pre-built catalog of geotagged
    Aporee URLs and filters it with the same semantics as
    :func:`_getSoundFreesound`. The resulting DataFrame uses the same column
    names so the downstream ``GeoTaggedData`` / ``download_to_dir`` pipeline
    needs no changes.

    Args:
        location (list | tuple): (lon, lat) of the query point.
        loc_id (int | str, optional): Identifier for the query location.
        distance (int): Search radius in meters.
        catalog (str | pandas.DataFrame): Path to a CSV file or an in-memory
            DataFrame. Required columns: ``url``, ``latitude``, ``longitude``.
            Optional columns recognised by the filters: ``id``/``identifier``,
            ``name``/``title``, ``description``, ``tags``, ``created`` (ISO
            timestamp), ``duration_s``.
        query (str | list[str], optional): Substring(s) matched against
            ``name``/``title`` and ``description`` (case-insensitive). Skipped
            silently if neither column is present.
        tag (str | list[str], optional): Substring(s) matched against ``tags``
            (case-insensitive). Skipped if column is absent.
        max_return (int): Number of nearest sounds to return.
        year, season, time_of_day: Same semantics as :func:`getSound`. Applied
            against the ``created`` column if present.
        duration (int | list[int] | tuple[int]): Filter on ``duration_s`` if
            present. Pass an int for max-only or (min, max) for a range.
        exclude_from_location (int, optional): Drop rows inside this radius
            (m) around the query point — useful for "what's nearby but not
            *at* this exact spot".
        slice_duration (int, optional): Pre-compute clip windows on top of
            the chosen recording's ``duration_s`` (mirrors Freesound path).
        slice_max_num (int, optional): Cap on number of clips per recording.
        probe_durations (bool): If True (default) and ``slice_duration`` is
            requested but the catalog has no ``duration_s`` column, fetch
            each selected recording once with
            :func:`urbanworm.utils.utils.probe_audio_duration` to learn its
            length so slice windows can be computed. Set False to skip
            slicing instead (faster; no per-row download).
        output_df (bool): If True (default) return a ``pandas.DataFrame``.

    Returns:
        ``pandas.DataFrame``, ``dict``, ``list[dict]``, or ``None`` if the
        filtered catalog is empty.
    """
    import os

    from .utils.utils import (
        haversine_m,
        is_coordinate_in_bbox,
        parse_iso_created,
        probe_audio_duration,
        season_months,
        sliced_duration,
        tod_hours,
    )

    # -------------------------
    # Validate inputs
    # -------------------------
    if max_return is None or int(max_return) < 1:
        raise ValueError("max_return must be >= 1.")
    max_return = int(max_return)

    if catalog is None:
        env_path = os.getenv("APOREE_CATALOG")
        if env_path:
            catalog = env_path
        else:
            raise ValueError(
                "source='aporee' requires a catalog (CSV path or DataFrame). "
                "Pass catalog=... or set APOREE_CATALOG env var."
            )

    if isinstance(catalog, str):
        df = pd.read_csv(catalog)
    elif isinstance(catalog, pd.DataFrame):
        df = catalog.copy()
    else:
        raise TypeError(
            "catalog must be a CSV path (str) or a pandas.DataFrame; "
            f"got {type(catalog).__name__}."
        )

    # Accept the alternate column names produced by fetch_aporee_catalog()
    # (which mirrors archive.org's `lat` / `lon` / `date` field names).
    _aliases = {"lat": "latitude", "lon": "longitude", "capture_time": "created"}
    for src, dst in _aliases.items():
        if src in df.columns and dst not in df.columns:
            df = df.rename(columns={src: dst})

    required = {"url", "latitude", "longitude"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(
            f"Aporee catalog is missing required columns: {sorted(missing)}. "
            "At minimum it needs 'url', 'latitude', 'longitude' "
            "(or 'lat'/'lon' which will be renamed)."
        )

    if df.empty:
        return None if not output_df else pd.DataFrame()

    lon, lat = location

    # Coerce coords to float and drop rows that aren't usable.
    df["latitude"] = pd.to_numeric(df["latitude"], errors="coerce")
    df["longitude"] = pd.to_numeric(df["longitude"], errors="coerce")
    df = df.dropna(subset=["latitude", "longitude", "url"]).copy()
    if df.empty:
        return None if not output_df else pd.DataFrame()

    # -------------------------
    # Spatial filter
    # -------------------------
    df["distance_m"] = df.apply(
        lambda r: haversine_m(lat, lon, float(r["latitude"]), float(r["longitude"])),
        axis=1,
    )
    df = df[df["distance_m"] <= float(distance)]

    if exclude_from_location is not None and not df.empty:
        drop_area = projection(location, r=exclude_from_location)
        mask = df.apply(
            lambda r: not is_coordinate_in_bbox(
                float(r["longitude"]), float(r["latitude"]), drop_area
            ),
            axis=1,
        )
        df = df[mask]

    # -------------------------
    # Text / tag filters
    # -------------------------
    def _as_list(x):
        if x is None:
            return []
        if isinstance(x, (list, tuple)):
            return [str(t).strip().lower() for t in x if str(t).strip()]
        return [str(x).strip().lower()]

    qterms = _as_list(query)
    tterms = _as_list(tag)

    if qterms:
        text_cols = [c for c in ("name", "title", "description") if c in df.columns]
        if text_cols:
            haystack = df[text_cols].astype(str).agg(" ".join, axis=1).str.lower()
            df = df[haystack.apply(lambda s: all(q in s for q in qterms))]

    if tterms and "tags" in df.columns:
        tag_haystack = df["tags"].astype(str).str.lower()
        df = df[tag_haystack.apply(lambda s: all(t in s for t in tterms))]

    # -------------------------
    # Time filters (only if `created` column is present)
    # -------------------------
    if "created" in df.columns and (year is not None or season or time_of_day):
        parsed = df["created"].apply(parse_iso_created)
        if year is not None:
            ys = year if isinstance(year, (list, tuple)) else [year]
            y1 = int(ys[0])
            y2 = int(ys[-1])
            if y2 < y1:
                y1, y2 = y2, y1
            df = df[parsed.apply(lambda dt: dt is not None and y1 <= dt.year <= y2)]
            parsed = parsed[df.index]
        if season:
            months = season_months(season)
            df = df[parsed.apply(lambda dt: dt is not None and dt.month in months)]
            parsed = parsed[df.index]
        if time_of_day:
            hours = tod_hours(time_of_day)
            df = df[parsed.apply(lambda dt: dt is not None and dt.hour in hours)]

    # -------------------------
    # Duration filter (only if `duration_s` column is present)
    # -------------------------
    if duration is not None and "duration_s" in df.columns:
        ds = pd.to_numeric(df["duration_s"], errors="coerce")
        if isinstance(duration, (list, tuple)) and len(duration) == 2:
            dmin, dmax = float(duration[0]), float(duration[1])
            if dmax < dmin:
                dmin, dmax = dmax, dmin
            df = df[(ds >= dmin) & (ds <= dmax)]
        else:
            df = df[ds <= float(duration)]

    if df.empty:
        return None if not output_df else pd.DataFrame()

    # -------------------------
    # Normalize output schema to match Freesound path
    # -------------------------
    df = df.sort_values(by="distance_m", ascending=True).head(max_return).reset_index(drop=True)

    # `id` column: prefer existing, then `identifier`, else fall back to row index.
    if "id" not in df.columns:
        if "identifier" in df.columns:
            df["id"] = df["identifier"]
        else:
            df["id"] = [f"aporee_{i}" for i in range(len(df))]

    # Alias `url` as `preview-hq-mp3` so downstream ``download_to_dir`` works
    # without any branching.
    df["preview-hq-mp3"] = df["url"]

    if loc_id is not None:
        df["loc_id"] = loc_id
    elif "loc_id" not in df.columns:
        df["loc_id"] = ""

    # Optional slice column to mirror Freesound behavior. Aporee catalogs
    # often lack a `duration_s` column because that metadata isn't on the
    # site — probe each selected URL on-demand if requested, or skip
    # slicing with a clear warning.
    if slice_duration is not None:
        if "duration_s" not in df.columns:
            if probe_durations:
                logger.info(
                    "Aporee catalog has no 'duration_s' column; probing %d "
                    "selected recordings to determine clip windows. "
                    "(Pass probe_durations=False to skip.)",
                    len(df),
                )
                # Wrap in a lambda so pandas doesn't see attributes like
                # `.keys()` on a callable (e.g. when probe_audio_duration is
                # patched with a MagicMock in tests) and mistakenly take the
                # dict-like apply codepath.
                df["duration_s"] = df["url"].apply(lambda u: probe_audio_duration(u))
            else:
                logger.warning(
                    "Aporee catalog has no 'duration_s' column and "
                    "probe_durations=False; skipping slice generation. "
                    "Run urbanworm.dataset.enrich_aporee_catalog() once to "
                    "permanently add duration_s to your CSV."
                )

        if "duration_s" in df.columns:
            df["slice"] = df["duration_s"].apply(
                lambda d: sliced_duration(int(d), slice_duration, slice_max_num)
                if pd.notna(d) and float(d) > 0 else [[0, 0]]
            )

    if output_df:
        return df

    records = df.to_dict(orient="records")
    if max_return == 1:
        return records[0] if records else None
    return records