Match Catch Surveys to GPS Trips — match_surveys_to_gps

Universal two-step matching workflow that links catch survey records to GPS trip data via device identifiers (IMEI). Works for both Kenya and Zanzibar by accepting either an explicit device registry or constructing an implicit one from the trips data.

Usage

match_surveys_to_gps_trips(
  surveys,
  trips,
  registry = NULL,
  reg_threshold = 0.15,
  name_threshold = 0.25
)

Arguments

surveys

Data frame containing:

submission_id: Unique survey identifier
landing_date: Date of landing
Boat identifiers: registration_number (or variants), boat_name, fisher_name (or captain)

trips

Data frame containing:

trip: Unique trip identifier
imei: Device identifier
ended: Trip end timestamp
Boat identifiers: registration_number (or variants), boat_name, fisher_name (or captain)

registry

Optional data frame with device-boat mappings containing:

imei: Device identifier
Boat identifiers: registration_number (or variants), boat_name, fisher_name (or captain)

If NULL, will be constructed from trips data (Zanzibar approach).

reg_threshold

Numeric. Maximum normalized Levenshtein distance (0-1) for registration number fuzzy matching. Default is 0.15 (15% difference allowed).

name_threshold

Numeric. Maximum normalized Levenshtein distance (0-1) for boat name and fisher name fuzzy matching. Default is 0.25 (25% difference allowed).

Value

Data frame combining matched and unmatched records with columns:

submission_id: Survey identifier (NA for unmatched trips)
landing_date: Landing date
imei: Device identifier
n_fields_used: Number of fields available for matching (0-3)
n_fields_ok: Number of fields that matched within threshold
match_ok: Logical indicating successful match (at least 1 field matched)
trip: GPS trip identifier (NA for unmatched surveys)
started, ended: Trip timestamps
registration_number_survey, registration_number_trip: For comparison
boat_name_survey, boat_name_trip: For comparison
fisher_name_survey, fisher_name_trip: For comparison
boat, duration_seconds, range_meters, distance_meters: Trip metadata
Additional trip columns (gear, community, etc.)

Details

The function implements a two-step matching process:

Step 1: Survey -> Registry (IMEI Assignment)

Standardizes column names across datasets
Cleans text fields (lowercase, remove punctuation, normalize whitespace)
Uses fuzzy matching (Levenshtein distance) on registration number, boat name, and fisher name
Assigns each survey to the registry entry with the most matching fields
Requires at least 1 field to match within threshold (match_ok = TRUE)

Step 2: IMEI -> Trips (Date Matching)

Joins surveys to trips by IMEI and landing_date
Only creates matches when there is exactly ONE survey and ONE trip per IMEI-date
Records with multiple trips/surveys per day remain unmatched
Preserves all unmatched surveys and trips in the output

Matching Thresholds

The normalized Levenshtein distance ranges from 0 (exact match) to 1 (completely different):

reg_threshold = 0.15: Allows ~15% character differences in registration numbers
name_threshold = 0.25: Allows ~25% character differences in names

Site-Specific Usage

Kenya: Uses explicit device registry from Airtable/PDS
Zanzibar: Builds implicit registry from historical trip data (registry = NULL)

Examples

if (FALSE) { # \dontrun{
# Kenya: With explicit device registry
results <- match_surveys_to_gps_trips(
  surveys = kefs_surveys,
  trips = pds_trips,
  registry = devices
)

# Zanzibar: Without explicit registry (builds from trips)
results <- match_surveys_to_gps_trips(
  surveys = wf_surveys,
  trips = pds_trips,
  registry = NULL
)

# With custom thresholds
results <- match_surveys_to_gps_trips(
  surveys = surveys,
  trips = trips,
  registry = devices,
  reg_threshold = 0.10,  # Stricter registration matching
  name_threshold = 0.30  # More lenient name matching
)
} # }