Skip to contents

Cleans and standardizes enumerator names by removing special characters, fixing typos, and matching similar names using string distance.

Usage

standardize_enumerator_names(data = NULL, max_distance = 3)

Arguments

data

A data frame with columns 'submission_id' and 'enumerator_name'

max_distance

Maximum Levenshtein distance for matching similar names. Lower values are stricter. Default is 3.

Value

A data frame with two columns: 'submission_id' and 'enumerator_name_clean'

Details

The function:

  • Removes numbers and special characters

  • Converts to lowercase

  • Removes extra whitespace

  • Marks single-word entries as "undefined"

  • Matches similar names (e.g., "john smith" and "jhon smith")

  • Returns the shorter/alphabetically first variant as the standard name

Examples

if (FALSE) { # \dontrun{
clean_names <- standardize_enumerator_names(raw_dat, max_distance = 2)
} # }