Introduction 🎯
Now that we’ve learned essential preprocessing techniques, let’s apply them to real Small-Scale Fisheries data. We’ll work with landing site data from Malawi that contains various challenges commonly encountered in field research. By working through this practical case study, you’ll strengthen your data cleaning skills using actual field-collected data.
Think of this as your first real fish to clean - while our previous tutorial showed you the techniques, now you’ll get hands-on experience with real-world data complexities. Just as each fish presents unique cleaning challenges, field data often has unexpected issues that require careful attention.
By working through this case study, you will: - Apply preprocessing techniques to real SSF field data - Handle common data quality issues from field collection - Create an end-to-end preprocessing workflow - Practice standardizing categorical variables - Deal with real-world missing data patterns - Clean actual numeric data with outliers
Understanding Our Dataset 📋
Let’s load and examine our landing site data from Malawi:
Our dataset comes from landing site surveys in Salima, Malawi, and includes variables that are crucial for understanding fishing activities:
Data Structure Overview
- Survey Information:
submission_id
: Unique identifier for each surveysurvey_id
: Detailed survey identifier with trip informationlanding_date
: Date when catch was landedlanding_site
: Name of the landing sitelat
,lon
: Geographic coordinates of the landing
- Fishing Effort Variables:
n_fishers
: Number of fishers involvedn_boats
: Number of boats usedtrip_length
: Duration of fishing tripgear
: Type of fishing gear used
- Catch Information:
catch_usage
: Purpose of the catch (trade, food, etc.)catch_taxon
: Species or taxonomic group caughtcatch_kg
: Weight of catch in kilogramscatch_price
: Total price of catchprice_kg
: Price per kilogram
Initial Data Assessment
Let’s examine the quality issues we need to address:
Our initial assessment reveals several data quality challenges that we’ll need to address:
- Missing Data Issues:
- Complete survey rows with all values missing
- Scattered missing values in effort variables
- Empty strings and NA values mixed together
- Categorical Inconsistencies:
- Mixed capitalization in catch_usage (“trade” vs “TRADE”)
- Variations in species names
- Inconsistent landing site spellings
- Numeric Data Problems:
- Text mixed with numbers in some numeric fields
- Potential outliers in catch and price data
- Effort variables with unrealistic values
Starting with Clean Names 🏷️
The first step in our preprocessing workflow is to standardize variable names using the techniques we learned in the previous tutorial. This ensures consistent naming patterns throughout our analysis:
The clean_names()
function has: - Converted all names to lowercase - Replaced dots and spaces with underscores - Removed special characters - Created consistent naming patterns
String Standardization 🔤
Our categorical variables need standardization. Let’s clean up species names and catch usage categories:
Our standardization has: - Created consistent categories for catch usage - Standardized species names to lowercase - Removed unnecessary whitespace - Made categories ready for analysis
Handling Missing Values 🔍
Our dataset has various types of missing values that need careful handling:
Let’s handle these missing values systematically:
Our missing value strategy: 1. Converted empty strings to NA for consistency 2. Removed completely empty survey records 3. Converted zeros to NA where they likely represent missing data 4. Left legitimate zeros in catch and price data
Cleaning Numeric Data 🔢
After handling categorical variables and missing values, we need to ensure our numeric data is clean and properly formatted. Our dataset includes several important numeric measurements: - Catch weights (catch_kg) - Prices (price_kg and catch_price) - Effort metrics (n_fishers, n_boats, trip_length)
Let’s examine these variables:
From our examination, we can see several issues: 1. Some values contain non-numeric characters 2. There are potential outliers in catch and price data 3. Some effort variables have unrealistic values
Let’s clean these issues systematically:
Working with Dates and Times ⏰
Our dataset includes landing dates that need to be properly formatted for analysis. The landing_date
column contains date information we need to parse:
Now we can also analyze temporal patterns in our data:
Creating Date-Based Groups
For analysis purposes, we can add additional time-based categorizations:
These date transformations allow us to: - Track temporal patterns in fishing activity - Analyze seasonal variations - Compare weekday vs weekend patterns - Group data by meaningful time periods
Conclusion 🎯
Let’s examine our fully preprocessed dataset:
Our preprocessing workflow has: - Standardized all variable names and categories - Cleaned and validated numeric data - Handled missing values appropriately - Created consistent date formats - Removed unrealistic values
The cleaned dataset is now ready for analysis! 🎯
Next: Data validation