
Match Catch Surveys to GPS Trips
match_surveys_to_gps_trips.RdUniversal two-step matching workflow that links catch survey records to GPS trip data via device identifiers (IMEI). Works for both Kenya and Zanzibar by accepting either an explicit device registry or constructing an implicit one from the trips data.
Usage
match_surveys_to_gps_trips(
surveys,
trips,
registry = NULL,
reg_threshold = 0.15,
name_threshold = 0.25
)Arguments
- surveys
Data frame containing:
submission_id: Unique survey identifier
landing_date: Date of landing
Boat identifiers: registration_number (or variants), boat_name, fisher_name (or captain)
- trips
Data frame containing:
trip: Unique trip identifier
imei: Device identifier
ended: Trip end timestamp
Boat identifiers: registration_number (or variants), boat_name, fisher_name (or captain)
- registry
Optional data frame with device-boat mappings containing:
imei: Device identifier
Boat identifiers: registration_number (or variants), boat_name, fisher_name (or captain)
If NULL, will be constructed from trips data (Zanzibar approach).
- reg_threshold
Numeric. Maximum normalized Levenshtein distance (0-1) for registration number fuzzy matching. Default is 0.15 (15% difference allowed).
- name_threshold
Numeric. Maximum normalized Levenshtein distance (0-1) for boat name and fisher name fuzzy matching. Default is 0.25 (25% difference allowed).
Value
Data frame combining matched and unmatched records with columns:
submission_id: Survey identifier (NA for unmatched trips)
landing_date: Landing date
imei: Device identifier
n_fields_used: Number of fields available for matching (0-3)
n_fields_ok: Number of fields that matched within threshold
match_ok: Logical indicating successful match (at least 1 field matched)
trip: GPS trip identifier (NA for unmatched surveys)
started, ended: Trip timestamps
registration_number_survey, registration_number_trip: For comparison
boat_name_survey, boat_name_trip: For comparison
fisher_name_survey, fisher_name_trip: For comparison
boat, duration_seconds, range_meters, distance_meters: Trip metadata
Additional trip columns (gear, community, etc.)
Details
The function implements a two-step matching process:
Step 1: Survey -> Registry (IMEI Assignment)
Standardizes column names across datasets
Cleans text fields (lowercase, remove punctuation, normalize whitespace)
Uses fuzzy matching (Levenshtein distance) on registration number, boat name, and fisher name
Assigns each survey to the registry entry with the most matching fields
Requires at least 1 field to match within threshold (match_ok = TRUE)
Step 2: IMEI -> Trips (Date Matching)
Joins surveys to trips by IMEI and landing_date
Only creates matches when there is exactly ONE survey and ONE trip per IMEI-date
Records with multiple trips/surveys per day remain unmatched
Preserves all unmatched surveys and trips in the output
Matching Thresholds
The normalized Levenshtein distance ranges from 0 (exact match) to 1 (completely different):
reg_threshold = 0.15: Allows ~15% character differences in registration numbers
name_threshold = 0.25: Allows ~25% character differences in names
Site-Specific Usage
- Kenya
Uses explicit device registry from Airtable/PDS
- Zanzibar
Builds implicit registry from historical trip data (registry = NULL)
Examples
if (FALSE) { # \dontrun{
# Kenya: With explicit device registry
results <- match_surveys_to_gps_trips(
surveys = kefs_surveys,
trips = pds_trips,
registry = devices
)
# Zanzibar: Without explicit registry (builds from trips)
results <- match_surveys_to_gps_trips(
surveys = wf_surveys,
trips = pds_trips,
registry = NULL
)
# With custom thresholds
results <- match_surveys_to_gps_trips(
surveys = surveys,
trips = trips,
registry = devices,
reg_threshold = 0.10, # Stricter registration matching
name_threshold = 0.30 # More lenient name matching
)
} # }