Preprocess KEFS (CATCH ASSESSMENT QUESTIONNAIRE) Survey Data — preprocess_kefs_surveys

This function preprocesses raw KEFS (CATCH ASSESSMENT QUESTIONNAIRE) survey data from Google Cloud Storage. It performs data cleaning, transformation, standardization of field names, type conversions, and mapping to standardized taxonomic and gear names using Airtable reference tables.

Usage

preprocess_kefs_surveys_v2(log_threshold = logger::DEBUG)

Arguments

log_threshold: Logging threshold level (default: logger::DEBUG)

Value

No return value. Function processes the data and uploads the result as a Parquet file to Google Cloud Storage.

Details

The function performs the following main operations:

Fetches metadata assets: Retrieves taxonomic, gear, vessel, and landing site mappings from Airtable based on the KEFS Kobo form asset ID
Downloads raw data: Retrieves raw survey data from Google Cloud Storage
Extracts trip information: Selects and renames relevant trip-level fields including:
- Landing details (date, site, district, BMU)
- Fishing ground and JCMA (Joint Community Management Area) information
- Vessel details (type, name, registration, motorization, horsepower)
- Trip details (crew size, start/end times, gear, mesh size, fuel)
- Catch outcome indicators
Reshapes catch data: Transforms catch details from wide to long format using reshape_priority_species() and reshape_overall_sample()
Type conversions and calculations:
- Converts date/time fields to proper datetime format
- Calculates trip duration in hours from start and end times
- Converts numeric fields (hp, fishers, mesh size, fuel) to appropriate types
Joins trip and catch data: Combines trip information with catch records using full join on submission_id
Standardizes names: Maps survey labels to standardized names using map_surveys():
- Taxonomic names to scientific names and alpha3 codes
- Gear types to standardized gear names
- Vessel types to standardized vessel categories
- Landing site codes to full site names
Uploads processed data: Saves preprocessed data as a Parquet file to Google Cloud Storage

Data Structure

The preprocessed output includes the following key fields:

Trip identifiers: submission_id
Temporal: landing_date, fishing_trip_start, fishing_trip_end, trip_duration
Spatial: district, BMU, landing_site, fishing_ground, jcma, jcma_site
Vessel: vessel_type, boat_name, vessel_reg_number, motorized, hp
Crew: captain_name, no_of_fishers
Gear: gear, mesh_size
Catch: scientific_name, alpha3_code, total_catch_weight, price_per_kg, total_value
Operations: fuel, catch_outcome, catch_shark

Pipeline Integration

This function is part of the KEFS data pipeline sequence:

ingest_kefs_surveys_v2() - Downloads raw data from Kobo
preprocess_kefs_surveys_v2() - Cleans and standardizes data (this function)
Validation step (to be implemented)
Export step (to be implemented)

Examples

if (FALSE) { # \dontrun{
# Preprocess KEFS survey data
preprocess_kefs_surveys_v2()

# Run with custom logging level
preprocess_kefs_surveys_v2(log_threshold = logger::INFO)
} # }