Skip to contents

peskas.mozambique.data.pipeline 2.3.1

Major Changes

  • Streamlined Validation Workflow: Replaced KoboToolbox API updates with direct MongoDB storage to improve performance.
    • New export_validation_flags() function exports validation flags directly to MongoDB
    • Validation status queries now only identify manually edited submissions, not update them
    • Disabled sync_validation_submissions() workflow steps in GitHub Actions
    • Significantly reduced pipeline execution time by avoiding slow KoboToolbox API calls

Improvements

  • Validation System:
    • Validation functions now preserve manual human approvals while updating system-generated statuses
    • Added fetch_error field to get_validation_status() for better error tracking
    • Improved error handling in validation status queries
  • Code Quality:
    • Fixed SeaLifeBase API calls by pinning to version 24.07 to avoid server errors
    • Standardized function parameter formatting across validation and preprocessing modules
    • Removed empty R/data.R file
  • Pipeline Configuration:
    • Removed survey activity filter in Lurio preprocessing to include all submissions

peskas.mozambique.data.pipeline 2.3.0

Major Changes

  • Enumerator Name Standardization: New intelligent name cleaning and matching system to handle data entry inconsistencies.
    • Introduced standardize_enumerator_names() function with fuzzy string matching using Levenshtein distance
    • Automatically removes special characters, numbers, and extra whitespace from enumerator names
    • Matches similar names with typos (e.g., “john smith” and “jhon smith”) and consolidates to a standard form
    • Integrated into Lurio preprocessing pipeline with configurable distance threshold
    • Marks single-word entries as “undefined” to ensure quality control
    • Returns cleaner, more consistent enumerator tracking for performance analysis
  • Enhanced Validation System: Improved validation logic with manual approval tracking and new quality checks.
    • Added manual approval tracking from KoboToolbox to distinguish human-reviewed approvals from system approvals
    • Manual approvals by human reviewers now properly bypass automatic validation flags
    • New validation flag 20: Detects landing date after submission date inconsistencies
    • New validation flag 11: Flags zero fishers with positive catch outcome
    • Parallel processing for KoboToolbox validation status queries using furrr for faster bulk validation
    • Improved handling of infinite CPUE/RPUE values in composite indicator validation
    • Enhanced price per kg threshold from 1,875 to 2,500 MZN (~30 EUR) for more realistic outlier detection

Improvements

  • Data Structure Enhancements:
    • Added scientific_name field to catch data for better species traceability and validation
    • Renamed tot_fishers to n_fishers throughout codebase for naming consistency
    • Fixed column name inconsistency where survey_label and catch_taxon were swapped in some contexts
    • Improved data pipeline clarity with consistent field naming across preprocessing and validation stages
  • Validation Pipeline Improvements:
    • Enhanced validation logic to handle edge cases (infinite values, zero denominators)
    • Better separation of basic quality checks (flags 1-7, 20) from composite indicators (flags 8-11)
    • Improved flag consolidation logic to properly aggregate multiple validation issues per submission
    • More robust handling of submissions with missing or invalid fishers count
    • Enhanced logging for validation status queries with submission counts
  • Code Quality:
    • Fixed assignment operators (=<-) for R style consistency
    • Improved function parameter formatting throughout codebase
    • Added proper roxygen2 documentation for new standardize_enumerator_names() function
    • Enhanced inline comments explaining validation thresholds and logic

Bug Fixes

  • Fixed infinite CPUE/RPUE calculations when fishers count or trip duration is zero
  • Corrected validation flag order to properly prioritize data quality issues
  • Fixed column mapping issue in Lurio validation where survey labels were misaligned
  • Resolved issue where submission date validation was not being performed
  • Fixed enumerator name column selection to use correct nested field path

Infrastructure & Dependencies

  • Added stringdist package dependency for fuzzy name matching capabilities
  • Enhanced parallel processing configuration for validation status queries
  • Updated NAMESPACE with new exported function standardize_enumerator_names()
  • Added man page documentation for enumerator name standardization function

peskas.mozambique.data.pipeline 2.2.0

Major Changes

  • Redesigned Length Frequency Processing: Complete rebuild of catch data reshaping with simplified, row-by-row processing architecture.
    • Introduced expand_length_frequency() for processing individual species rows
    • Refactored reshape_catch_data() to use row-wise expansion instead of complex joins
    • Eliminated data loss issues caused by multiple join operations
    • Preserves all metadata (counting_method, species, n_buckets, etc.) throughout transformation
    • Simpler and more maintainable code with clear step-by-step logic

Improvements

  • Enhanced Catch Data Processing:
    • Fixed critical bug where counting_method was being lost during length frequency expansion
    • Improved handling of NA values in separate_wider_delim() with too_few = "align_start"
    • Better support for length frequency data (fish under 100cm) with proper regex pattern matching
    • Clearer inline documentation explaining each processing step
    • More robust error handling for empty length bins
  • Code Architecture:
    • New expand_length_frequency() function processes one species row at a time
    • Deprecated process_regular_length_groups() in favor of simpler row-by-row approach
    • Retained process_over100_length_groups() for backwards compatibility with large fish data
    • Eliminated complex join logic that was prone to losing metadata
    • Uses rowwise() |> group_split() |> map_dfr() pattern for cleaner row processing
  • Documentation Quality:
    • Updated all function documentation to reflect new implementation
    • Added detailed @details sections explaining the row-by-row approach
    • Improved @keywords for better pkgdown organization
    • Clear documentation of deprecated functions
    • Enhanced examples showing length frequency analysis

Bug Fixes

  • Fixed counting_method = NA issue where metadata was lost during length data expansion (#issue)
  • Corrected regex pattern to match no_individuals_5_10 format (was looking for 5_10 only)
  • Fixed separate_wider_delim() failure on NA length ranges
  • Eliminated extra length group columns appearing in final output
  • Resolved data preservation issues in complex join operations

Technical Details

  • Length Frequency Data Flow:
    • Old approach: Extract all length data → Join back → Lose metadata
    • New approach: Process each row → Expand in place → Keep everything
    • Result: 100% metadata preservation with simpler logic
  • Performance: Row-by-row processing with purrr::map_dfr() provides clean, functional approach while maintaining good performance for typical survey sizes

peskas.mozambique.data.pipeline 2.1.0

Major Changes

  • Enhanced Validation Sync System: Restructured validation synchronization following Kenya pipeline best practices.
    • Added sync_validation_submissions() for bidirectional validation status updates with rate limiting
    • Implemented process_submissions_parallel() helper function for consistent API interactions
    • Rate limiting (0.1-0.2s delays) prevents overwhelming KoboToolbox API
    • Manual approval respect: Human review decisions are never overwritten by system updates
    • Optimized API usage: Skips already-approved submissions to minimize unnecessary calls
    • Fetches current validation status BEFORE making updates for smarter decision-making
    • Automated approval/rejection of submissions in KoboToolbox based on validation results
    • Stores validation metadata in MongoDB for enumerator performance tracking
    • Enhanced error tracking with success/failure logging
  • Improved Asset Management: Enhanced preprocessing with form-specific asset filtering.
    • Preprocessing functions now automatically filter Airtable assets by form_id
    • Better separation of metadata between Lurio and ADNAP survey forms
    • Improved handling of shared assets across multiple survey versions
    • More reliable species, gear, vessel, and site mappings
  • Optimized GitHub Actions Workflow: Streamlined pipeline execution by combining ingestion and preprocessing stages.
    • Combined ingest and preprocess jobs for each data source (Lurio, ADNAP, PDS) reducing container startups by ~30%
    • Simplified dependency graph from 10 jobs to 7 jobs for faster pipeline execution
    • Maintained separation of validation stages for better error isolation and independent re-runs
    • Aligned workflow structure with Kenya pipeline best practices

Improvements

  • Validation System Enhancements:
    • Manual approvals by human reviewers now properly bypass automatic validation flags
    • System-generated approvals are re-validated to ensure data quality
    • Better logging of validation status queries with submission counts
    • Enhanced validation flag preservation for monitoring and reporting
    • Improved handling of catch_taxon field mapping in Lurio surveys
  • Configuration Management:
    • Restructured MongoDB connection strings to support separate validation database
    • Added KOBO_TOKEN authentication for ADNAP asset
    • Improved configuration structure for multiple database contexts
    • Enhanced PDS storage configuration organization
    • Added explicit assets configuration for Airtable integration
  • Workflow Performance:
    • Reduced overall pipeline execution time through job consolidation
    • Parallel execution of independent data streams (Lurio, ADNAP, PDS)
    • Cleaner job naming for better CI/CD monitoring
    • Maintained robust error handling with granular validation stages
  • Code Quality:

Bug Fixes

  • Fixed asset fetching logic to properly filter by target form_id
  • Corrected catch_taxon column mapping in Lurio validation (changed to alpha3_code)
  • Fixed validation status query to exclude system approvals from manual approval overrides
  • Fixed MongoDB configuration path typo (collection → collections) in enumerators_stats
  • Removed redundant asset fetching code in preprocessing functions
  • Added missing KOBO_USERNAME configuration for ADNAP asset
  • Fixed sync function to never overwrite manual approvals from human reviewers

Infrastructure & Dependencies

  • Added MONGODB_CONNECTION_STRING_VALIDATION environment variable for separate validation database
  • Enhanced GitHub Actions workflow with combined ingest-preprocess jobs
  • Improved parallel processing configuration in validation sync
  • Updated NAMESPACE with new imports for future, furrr, and progressr packages
  • Maintained compatibility with existing storage and authentication systems

peskas.mozambique.data.pipeline 2.0.0

Major Changes

  • Dual Survey System Integration: Full support for both Lurio and ADNAP fisheries surveys with parallel processing workflows.
  • Enhanced Validation System for ADNAP: Advanced validation with KoBoToolbox integration.
    • Integrated KoBoToolbox validation status API for manual approval workflow
    • Added get_validation_status() to query submission approval status
    • Implemented parallel processing for validation status queries across multiple submissions
    • Manual approvals in KoBoToolbox now bypass automatic validation flags
    • Maintained two-stage validation (7 basic checks + 3 composite economic indicators)
  • Flexible Survey Data Reshaping: New module for handling multiple survey form structures.
  • Improved GitHub Actions Workflow: Enhanced automation with clearer job naming and parallel execution.
    • Renamed workflow jobs for better visibility (e.g., “Ingest Lurio landings”, “Validate ADNAP landings”)
    • Parallel execution of Lurio and ADNAP pipelines for faster processing
    • Separate PDS data ingestion and preprocessing jobs
    • Clear dependency chains between ingestion, preprocessing, and validation stages

Improvements

  • Documentation Overhaul:
    • Corrected all storage backend references from MongoDB to Google Cloud Storage
    • Updated function documentation to accurately reflect Parquet file usage
    • Added specific titles to distinguish Lurio and ADNAP functions
    • Removed misleading unused parameters from ingestion functions
    • Added explicit invisible(NULL) returns to all workflow functions for consistency
    • Enhanced documentation for KoBoToolbox validation integration
    • Improved parameter documentation honesty (noting hardcoded values)
  • Code Quality Enhancements:
    • Cleaned up function signatures by removing unused parameters
    • Made return values explicit across all workflow functions
    • Improved consistency between function documentation and implementation
    • Enhanced examples to reflect actual usage patterns
  • Survey Processing Pipeline:
    • Added support for multiple species field normalization
    • Improved handling of separate length group structures for large fish
    • Enhanced catch data validation with species-specific thresholds
    • Better integration with Airtable form assets for data mapping

Bug Fixes

  • Fixed incorrect package reference in documentation (removed non-existent KoboconnectR package)
  • Corrected validation threshold documentation (200 individuals, not 100) for ADNAP surveys
  • Fixed duplicate GitHub Actions job names that made debugging difficult
  • Corrected storage backend documentation throughout codebase (MongoDB → GCS)
  • Updated validation flag numbering documentation for consistency across surveys

Infrastructure & Dependencies

  • Maintained compatibility with parallel processing packages (future, furrr)
  • Enhanced configuration system to support multiple survey sources
  • Improved separation of concerns between Lurio and ADNAP processing pipelines
  • Updated documentation generation with roxygen2
  • Package maintains clean R CMD check status

peskas.mozambique.data.pipeline 1.0.0

Major Changes

  • GPS Tracking Integration with Pelagic Data Systems (PDS): Full support for vessel tracking data ingestion and preprocessing.
  • Airtable Integration Module: Complete suite of functions for two-way synchronization with Airtable.
  • Comprehensive Data Validation Framework: Implemented multi-stage validation adapted from Peskas Zanzibar pipeline.
    • Redesigned validate_landings() with 10 validation flags across two stages
    • Stage 1: Basic data quality checks (form completeness, catch info, length validation, bucket/individual counts)
    • Stage 2: Composite economic indicators (price per kg, CPUE, RPUE) following Zanzibar thresholds
    • Created modular validation functions: validate_catch_taxa(), validate_price(), validate_total_catch()
    • Validation results exclude flagged submissions from final dataset while preserving flags for monitoring
  • Taxa Modeling and Species Intelligence: New module for automated species identification and biological data enrichment.

Improvements

  • Storage System Enhancements:
  • Configuration Management:
    • Switched to dotenv package for environment variable management
    • Added load_dotenv() function with configurable .env file paths
    • Updated read_config() to automatically load environment variables
    • Expanded configuration schema to support PDS, Airtable, and multi-cloud storage
    • Added support for separate storage buckets for different data types (surveys vs. tracks)
  • Data Preprocessing Pipeline:
    • Enhanced preprocess_landings() with metadata table joins (landing sites, boats, enumerators)
    • Implemented process_species_group() for handling species group disaggregation
    • Added species validation and enrichment with FishBase/SeaLifeBase data
    • Integrated length-weight conversion using local coefficient database
    • Added habitat information from species area data
    • Improved catch weight calculation with multiple estimation methods
  • Export Functionality:
    • Expanded export_landings() to generate multiple analytical outputs
    • Added calculate_fishery_metrics() for aggregated statistics
    • Created MongoDB portal collections for dashboard integration
    • Implemented trip-level summarization for GPS track data
    • Enhanced data transformation for consumption by visualization tools
  • Workflow Automation:
    • Added GitHub Actions workflow for automated releases (release.yaml)
    • Updated data pipeline workflow with improved error handling and notifications
    • Integrated cloud authentication in CI/CD pipeline
    • Added support for scheduled and manual workflow triggers

Bug Fixes

  • Fixed price validation logic that was incorrectly flagging valid entries (#PR/issue reference if applicable)
  • Corrected global variable bindings in validation functions to prevent R CMD check warnings
  • Removed invalid geo parameter from mdb_collection_push() function call
  • Fixed customer_name and submission_id variable scoping issues using .data$ notation

Infrastructure & Dependencies

  • Added new package dependencies: furrr, future, glue, readr for enhanced functionality
  • Updated .Rbuildignore to exclude development files (.env, .claude, CLAUDE.md)
  • Package now passes R CMD check with no warnings or notes
  • Improved documentation coverage with 34 new exported functions
  • Enhanced type safety and code consistency throughout codebase

peskas.mozambique.data.pipeline 0.2.0

New features

  • Updates data ingestion and preprocessing workflows
  • Renames ingest_surveys to ingest_landings
  • Adds new metadata joins and data transformations in preprocess_landings
  • Introduces calculate_catch function for catch weight estimation
  • Updates configuration to include Google Cloud Storage and additional metadata tables

peskas.mozambique.data.pipeline 0.1.0

  • Initial CRAN submission.