Implementing DBSCAN for stay-point clustering in Python

Implementing DBSCAN for stay-point clustering in Python requires converting GPS trajectories into radian coordinates, applying density-based spatial clustering with a geographic distance metric, and filtering resulting clusters by minimum temporal duration. The most reliable production approach uses scikit-learn’s DBSCAN with metric='haversine', algorithm='ball_tree', and a post-processing step that validates cluster dwell time against a configurable threshold (typically 5–15 minutes). This pipeline transforms noisy, irregularly sampled pings into semantically stable locations suitable for downstream mobility analytics.

Stay-point detection is a foundational preprocessing stage in Movement Pattern Extraction & Trajectory Analysis, where raw telemetry must be distilled into actionable spatial anchors. Traditional threshold-based methods (fixed radius + fixed time) fail under variable sampling rates, GPS drift, and urban canyon multipath errors. Density-based clustering adapts to local point concentration, making it the preferred choice for modern Stay-Point Detection Algorithms deployed in fleet telematics, ride-hailing routing, and pedestrian flow modeling.

Environment & Compatibility Notes

  • Python: 3.9+ (required for modern numpy vectorization and zoneinfo support)
  • Core Libraries: scikit-learn>=1.2, geopandas>=0.12, pandas>=1.5, numpy>=1.23
  • Coordinate System: DBSCAN with metric='haversine' expects input in radians, not decimal degrees. Earth radius is assumed as 6,371,000 meters.
  • Memory Scaling: Distance matrix computation scales O(N²). Use algorithm='ball_tree' for N < 100k. For larger trajectories, chunk by device ID or time windows.
  • Temporal Handling: Timestamps must be timezone-aware (UTC recommended) to avoid DST-induced duration miscalculations. See the official pandas time-zone handling guide for best practices.

Production-Ready Implementation

The following snippet demonstrates a complete, production-ready pipeline. It loads trajectory data, converts coordinates, runs DBSCAN per device, filters by dwell time, and returns structured stay points.

PYTHON
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN

def detect_stay_points(
    trajectory_df: pd.DataFrame,
    eps_meters: float = 100.0,
    min_samples: int = 3,
    min_duration_minutes: float = 5.0
) -> pd.DataFrame:
    """
    Detect stay points from GPS trajectory using DBSCAN + temporal filtering.

    trajectory_df must contain: ['device_id', 'timestamp', 'lat', 'lon']
    """
    df = trajectory_df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)
    df = df.sort_values(['device_id', 'timestamp']).reset_index(drop=True)

    # Convert decimal degrees to radians for Haversine metric
    df['lat_rad'] = np.radians(df['lat'])
    df['lon_rad'] = np.radians(df['lon'])

    # Convert spatial threshold to radians (Earth radius in meters)
    EARTH_RADIUS_M = 6371000.0
    eps_rad = eps_meters / EARTH_RADIUS_M

    stay_points = []

    # Process each device independently to prevent cross-trajectory leakage
    for device_id, group in df.groupby('device_id'):
        coords = group[['lat_rad', 'lon_rad']].values

        # Initialize and fit DBSCAN
        # Reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
        db = DBSCAN(
            eps=eps_rad,
            min_samples=min_samples,
            metric='haversine',
            algorithm='ball_tree'
        ).fit(coords)

        group = group.copy()
        group['cluster'] = db.labels_

        # Filter out noise points (-1)
        clusters = group[group['cluster'] != -1]

        for cluster_id, cluster_data in clusters.groupby('cluster'):
            duration_min = (cluster_data['timestamp'].max() -
                            cluster_data['timestamp'].min()).total_seconds() / 60.0

            if duration_min >= min_duration_minutes:
                # Compute geographic centroid in decimal degrees
                centroid_lat = np.degrees(np.mean(cluster_data['lat_rad']))
                centroid_lon = np.degrees(np.mean(cluster_data['lon_rad']))

                stay_points.append({
                    'device_id': device_id,
                    'cluster_id': int(cluster_id),
                    'centroid_lat': round(centroid_lat, 6),
                    'centroid_lon': round(centroid_lon, 6),
                    'start_time': cluster_data['timestamp'].min(),
                    'end_time': cluster_data['timestamp'].max(),
                    'duration_minutes': round(duration_min, 2),
                    'point_count': len(cluster_data)
                })

    return pd.DataFrame(stay_points)

Step-by-Step Pipeline Breakdown

  1. Temporal Normalization: Raw GPS logs often arrive out of order or with mixed timezones. Sorting by device_id and timestamp guarantees monotonic progression. Converting to UTC eliminates daylight saving time edge cases that corrupt duration calculations.
  2. Radian Conversion: The Haversine formula operates on angular distances. Multiplying decimal degrees by π/180 via np.radians() aligns inputs with scikit-learn’s expectations. Failing to convert results in eps values that are orders of magnitude too small, producing zero clusters.
  3. Spatial Clustering: DBSCAN groups points that fall within eps meters of each other, requiring at least min_samples to form a core point. The ball_tree algorithm accelerates spherical distance lookups, avoiding the O(N²) brute-force fallback.
  4. Temporal Validation: Spatial density alone cannot distinguish between a traffic jam and a genuine stop. Filtering by min_duration_minutes removes transient congestion while preserving meaningful dwell events.
  5. Centroid Aggregation: The output calculates a mean latitude/longitude in decimal degrees, providing a clean, human-readable anchor for mapping or geocoding downstream.

Performance Scaling & Parameter Tuning

  • Chunking Strategy: For datasets exceeding 100k points per device, memory pressure spikes during tree construction. Partition trajectories by calendar day or fixed 4-hour windows before clustering, then merge overlapping stay points post-processing.
  • eps Calibration: Start with 50–150 meters for pedestrian/urban routing, and 200–500 meters for highway/fleet tracking. Validate against known landmarks (e.g., parking garages, transit hubs) to calibrate spatial tolerance.
  • min_samples Trade-offs: Lower values (2–3) capture brief stops but increase false positives from GPS drift. Higher values (5–8) enforce stricter density, ideal for high-frequency sampling (>1 Hz).
  • Noise Handling: Points labeled -1 are not discarded; they represent transit segments or isolated pings. Preserve them for route reconstruction or speed profiling.

Validation Checklist

Before deploying to production, verify:

  • All timestamps are UTC-aware and monotonically increasing per device
  • lat/lon columns contain no NaN or out-of-bounds values (-90 to 90, -180 to 180)
  • eps is converted to radians before passing to DBSCAN
  • Output DataFrame contains only clusters meeting both spatial and temporal thresholds
  • Centroids fall within the convex hull of their source points (sanity check against coordinate inversion)

This pipeline delivers deterministic, scalable stay-point extraction that integrates cleanly with mobility data stacks. By decoupling spatial density from temporal validation, it adapts to irregular sampling while maintaining geographic precision.