5.2 Data Preprocessing Case Study: Cleaning SSF Landing Site Data

Introduction 🎯

Now that we’ve learned essential preprocessing techniques, let’s apply them to real Small-Scale Fisheries data. We’ll work with landing site data from Malawi that contains various challenges commonly encountered in field research. By working through this practical case study, you’ll strengthen your data cleaning skills using actual field-collected data.

Think of this as your first real fish to clean - while our previous tutorial showed you the techniques, now you’ll get hands-on experience with real-world data complexities. Just as each fish presents unique cleaning challenges, field data often has unexpected issues that require careful attention.

Learning Objectives

By working through this case study, you will: - Apply preprocessing techniques to real SSF field data - Handle common data quality issues from field collection - Create an end-to-end preprocessing workflow - Practice standardizing categorical variables - Deal with real-world missing data patterns - Clean actual numeric data with outliers

Understanding Our Dataset 📋

Let’s load and examine our landing site data from Malawi:

Our dataset comes from landing site surveys in Salima, Malawi, and includes variables that are crucial for understanding fishing activities:

Data Structure Overview

Survey Information:
- submission_id: Unique identifier for each survey
- survey_id: Detailed survey identifier with trip information
- landing_date: Date when catch was landed
- landing_site: Name of the landing site
- lat, lon: Geographic coordinates of the landing
Fishing Effort Variables:
- n_fishers: Number of fishers involved
- n_boats: Number of boats used
- trip_length: Duration of fishing trip
- gear: Type of fishing gear used
Catch Information:
- catch_usage: Purpose of the catch (trade, food, etc.)
- catch_taxon: Species or taxonomic group caught
- catch_kg: Weight of catch in kilograms
- catch_price: Total price of catch
- price_kg: Price per kilogram

Initial Data Assessment

Let’s examine the quality issues we need to address:

Our initial assessment reveals several data quality challenges that we’ll need to address:

Missing Data Issues:
- Complete survey rows with all values missing
- Scattered missing values in effort variables
- Empty strings and NA values mixed together
Categorical Inconsistencies:
- Mixed capitalization in catch_usage (“trade” vs “TRADE”)
- Variations in species names
- Inconsistent landing site spellings
Numeric Data Problems:
- Text mixed with numbers in some numeric fields
- Potential outliers in catch and price data
- Effort variables with unrealistic values

Starting with Clean Names 🏷️

The first step in our preprocessing workflow is to standardize variable names using the techniques we learned in the previous tutorial. This ensures consistent naming patterns throughout our analysis:

The clean_names() function has: - Converted all names to lowercase - Replaced dots and spaces with underscores - Removed special characters - Created consistent naming patterns

String Standardization 🔤

Our categorical variables need standardization. Let’s clean up species names and catch usage categories:

Our standardization has: - Created consistent categories for catch usage - Standardized species names to lowercase - Removed unnecessary whitespace - Made categories ready for analysis

Handling Missing Values 🔍

Our dataset has various types of missing values that need careful handling:

Let’s handle these missing values systematically:

Our missing value strategy: 1. Converted empty strings to NA for consistency 2. Removed completely empty survey records 3. Converted zeros to NA where they likely represent missing data 4. Left legitimate zeros in catch and price data

Cleaning Numeric Data 🔢

After handling categorical variables and missing values, we need to ensure our numeric data is clean and properly formatted. Our dataset includes several important numeric measurements: - Catch weights (catch_kg) - Prices (price_kg and catch_price) - Effort metrics (n_fishers, n_boats, trip_length)

Let’s examine these variables:

From our examination, we can see several issues: 1. Some values contain non-numeric characters 2. There are potential outliers in catch and price data 3. Some effort variables have unrealistic values

Let’s clean these issues systematically:

Working with Dates and Times ⏰

Our dataset includes landing dates that need to be properly formatted for analysis. The landing_date column contains date information we need to parse:

Now we can also analyze temporal patterns in our data:

Creating Date-Based Groups

For analysis purposes, we can add additional time-based categorizations:

These date transformations allow us to: - Track temporal patterns in fishing activity - Analyze seasonal variations - Compare weekday vs weekend patterns - Group data by meaningful time periods

Conclusion 🎯

Let’s examine our fully preprocessed dataset:

Our preprocessing workflow has: - Standardized all variable names and categories - Cleaned and validated numeric data - Handled missing values appropriately - Created consistent date formats - Removed unrealistic values

The cleaned dataset is now ready for analysis! 🎯

Next: Data validation